乾貨：PHP與大資料開發實踐

懂天明發表於2019-05-18

原文網址 : https://www.cnblogs.com/xinlangboke/p/10887734.html

大資料是使用工具和技術處理大量和複雜資料集合的術語。能夠處理大量資料的技術稱為MapReduce。

很多初學者，對大資料的概念都是模糊不清的，大資料是什麼，能做什麼，學的時候，該按照什麼線路去學習，學完往哪方面發展，想深入瞭解，想學習的同學歡迎加入大資料學習qq群：410391744，有大量乾貨（零基礎以及進階的經典實戰）分享給大家，並且有清華大學畢業的資深大資料講師給大家免費授課，給大家分享目前國內最完整的大資料高階實戰實用學習流程體系。

何時使用MapReduce

MapReduce特別適合涉及大量資料的問題。它通過將工作分成更小的塊，然後可以被多個系統處理。由於MapReduce將一個問題分片並行工作，與傳統系統相比，解決方案會更快。

大概有如下場景會應用到MapReduce：

1 計數和統計
2 整理
3 過濾
4 排序

Apache Hadoop

在本文中，我們將使用Apache Hadoop。

開發MapReduce解決方案，推薦使用Hadoop，它已經是事實上的標準，同時也是開源免費的軟體。
另外在Amazon，Google和Microsoft等雲提供商租用或搭建Hadoop叢集。

還有其他多個優點：

可擴充套件：可以輕鬆清加新的處理節點，而無需更改一行程式碼
成本效益：不需要任何專門和奇特的硬體，因為軟體在正常的硬體都執行正常
靈活：無模式。可以處理任何資料結構，甚至可以組合多個資料來源，而不會有很多問題。
容錯：如果有節點出現問題，其它節點可以接收它的工作，整個叢集繼續處理。

另外，Hadoop容器還是支援一種稱為“流”的應用程式，它為使用者提供了選擇用於開發對映器和還原器指令碼語言的自由度。

本文中我們將使用PHP做為主開發語言。

Hadoop安裝

Apache Hadoop的安裝配置超出了本文範圍。您可以根據自己的平臺，線上輕鬆找到很多文章。為了保持簡單，我們只討論大資料相關的事。

對映器（Mapper）

對映器的任務是將輸入轉換成一系列的鍵值對。比如在字計數器的情況下，輸入是一系列的行。我們按單詞將它們分開，把它們變成鍵值對（如key:word,value:1）,看起來像這樣：

the 1
water 1
on 1
on 1
water 1
on 1
... 1

然後，這些對然後被髮送到reducer以進行下一步驟。

reducer

reducer的任務是檢索（排序）對，迭代並轉換為所需輸出。在單詞計數器的例子中，取單詞數（值），並將它們相加得到一個單詞（鍵）及其最終計數。如下：

water 2
the 1
on 3

mapping和reducing的整個過程看起來有點像這樣，請看下列之圖表：

使用PHP做單詞計數器

我們將從MapReduce世界的“Hello World”的例子開始，那就是一個簡單的單詞計數器的實現。我們將需要一些資料來處理。我們用已經公開的書Moby Dick來做實驗。

執行以下命令下載這本書：

wget http://www.gutenberg.org/cache ... 1.txt

在HDFS（Hadoop分散式檔案系統）中建立一個工作目錄

hadoop dfs -mkdir wordcount

我們的PHP程式碼從mapper開始

#!/usr/bin/php
<?php
    // iterate through lines
    while($line = fgets(STDIN)){
        // remove leading and trailing
        $line = ltrim($line);
        $line = rtrim($line);

        // split the line in words
        $words = preg_split('/\s/', $line, -1, PREG_SPLIT_NO_EMPTY);
        // iterate through words
        foreach( $words as $key ) {
            // print word (key) to standard output
            // the output will be used in the
            // reduce (reducer.php) step
            // word (key) tab-delimited wordcount (1)
            printf("%s\t%d\n", $key, 1);
        }
    }
?>

下面是 reducer 程式碼。

#!/usr/bin/php
<?php
    $last_key = NULL;
    $running_total = 0;

    // iterate through lines
    while($line = fgets(STDIN)) {
        // remove leading and trailing
        $line = ltrim($line);
        $line = rtrim($line);
        // split line into key and count
        list($key,$count) = explode("\t", $line);
        // this if else structure works because
        // hadoop sorts the mapper output by it keys
        // before sending it to the reducer
        // if the last key retrieved is the same
        // as the current key that have been received
        if ($last_key === $key) {
            // increase running total of the key
            $running_total += $count;
        } else {
            if ($last_key != NULL)
                // output previous key and its running total
                printf("%s\t%d\n", $last_key, $running_total);
            // reset last key and running total
            // by assigning the new key and its value
            $last_key = $key;
            $running_total = $count;
        }
    }
?>

你可以通過使用某些命令和管道的組合來在本地輕鬆測試指令碼。

head -n1000 pg2701.txt | ./mapper.php | sort | ./reducer.php

我們在Apache Hadoop叢集上執行它：

hadoop jar /usr/hadoop/2.5.1/libexec/lib/hadoop-streaming-2.5.1.jar \
 -mapper "./mapper.php"
 -reducer "./reducer.php"
 -input "hello/mobydick.txt"
 -output "hello/result"

輸出將儲存在資料夾hello / result中，可以通過執行以下命令檢視

hdfs dfs -cat hello/result/part-00000

計算年均黃金價格

下一個例子是一個更實際的例子，雖然資料集相對較小，但是相同的邏輯可以很容易地應用於具有數百個資料點的集合上。我們將嘗試計算過去五十年的黃金年平均價格。

我們下載資料集：

wget https://raw.githubusercontent. ... a.csv

在HDFS（Hadoop分散式檔案系統）中建立一個工作目錄

hadoop dfs -mkdir goldprice

將已下載的資料集複製到HDFS

hadoop dfs -copyFromLocal ./data.csv goldprice/data.csv

我的reducer看起來像這樣

#!/usr/bin/php
<?php
    // iterate through lines
    while($line = fgets(STDIN)){
        // remove leading and trailing
        $line = ltrim($line);
        $line = rtrim($line);

        // regular expression to capture year and gold value
        preg_match("/^(.*?)\-(?:.*),(.*)$/", $line, $matches);

        if ($matches) {
            // key: year, value: gold price
            printf("%s\t%.3f\n", $matches[1], $matches[2]);
        }
    }
?>

reducer也略有修改，因為我們需要計算專案數量和平均值。

#!/usr/bin/php
<?php
    $last_key = NULL;
    $running_total = 0;
    $running_average = 0;
    $number_of_items = 0;

    // iterate through lines
    while($line = fgets(STDIN)) {
        // remove leading and trailing
        $line = ltrim($line);
        $line = rtrim($line);

        // split line into key and count
        list($key,$count) = explode("\t", $line);

        // if the last key retrieved is the same
        // as the current key that have been received
        if ($last_key === $key) {
            // increase number of items
            $number_of_items++;
            // increase running total of the key
            $running_total += $count;
            // (re)calculate average for that key
            $running_average = $running_total / $number_of_items;
        } else {
            if ($last_key != NULL)
                // output previous key and its running average
                printf("%s\t%.4f\n", $last_key, $running_average);
            // reset key, running total, running average
            // and number of items
            $last_key = $key;
            $number_of_items = 1;
            $running_total   = $count;
            $running_average = $count;
        }
    }

    if ($last_key != NULL)
        // output previous key and its running average
        printf("%s\t%.3f\n", $last_key, $running_average);
?>

像單詞統計樣例一樣，我們也可以在本地測試

head -n1000 data.csv | ./mapper.php | sort | ./reducer.php

最終在hadoop叢集上執行它

hadoop jar /usr/hadoop/2.5.1/libexec/lib/hadoop-streaming-2.5.1.jar \
 -mapper "./mapper.php"
 -reducer "./reducer.php"
 -input "goldprice/data.csv"
 -output "goldprice/result"

檢視平均值

hdfs dfs -cat goldprice/result/part-00000

小獎勵：生成圖表

我們經常會將結果轉換成圖表。對於這個演示，我將使用gnuplot，你可以使用其它任何有趣的東西。

首先在本地返回結果：

hdfs dfs -get goldprice/result/part-00000 gold.dat

建立一個gnu plot配置檔案（gold.plot）並複製以下內容

# Gnuplot script file for generating gold prices
set terminal png
set output "chart.jpg"
set style data lines
set nokey
set grid
set title "Gold prices"
set xlabel "Year"
set ylabel "Price"
plot "gold.dat"

生成圖表：

gnuplot gold.plot

這會生成一個名為chart.jpg的檔案。

乾貨分享：智慧工廠時代下大資料 + 智慧的深度實踐
2018-12-04
大資料
大資料開發的儲存技術探索與實踐
2024-02-01
大資料
【乾貨】MySQL資料庫開發規範
2018-12-16
MySql資料庫
【乾貨】DDM實踐：資料庫秒級平滑擴容方案
2018-07-09
資料庫
平安雲原生資料庫開發與實踐
2022-05-09
資料庫
DTCC 乾貨 | 中國銀聯跨中心，異構資料同步技術與實踐
2019-06-21
前端乾貨之JS最佳實踐
2019-02-16
前端JS
純乾貨分享 —— 大資料入門指南
2019-06-19
大資料
乾貨 | 影像資料增強實戰
2018-11-21
PHP最佳實踐之資料庫
2019-02-16
PHP資料庫
【乾貨】分庫分表最佳實踐
2021-09-10
資料視覺化實用乾貨分享
2024-08-14
視覺化
實戰乾貨｜Spark 在袋鼠雲數棧的深度探索與實踐
2024-04-26
Spark
阿里雲DataWorks實踐：資料整合+資料開發
2021-02-26
阿里
開源實踐 | OceanBase 在紅象雲騰大資料場景下的實踐與思考
2022-01-20
大資料
乾貨 | 攜程酒店基於血緣後設資料的資料流程最佳化實踐
2023-11-29
大資料開發平臺(Data Platform)在有讚的最佳實踐
2018-07-23
大資料Platform
乾貨！DataPipeline2021資料管理與創新大會全篇劃重點
2021-08-24
API
乾貨 | 京東雲部署Wordpress最佳實踐
2019-03-20
乾貨收藏！Calico的BGP RouteReflector策略實踐
2024-05-29
乾貨分享：資料分析的6大基本步驟
2021-11-25
奈學乾貨分享：分散式CAP實踐分析
2020-05-29
分散式
OPPO大資料診斷平臺設計與實踐
2022-12-28
大資料
大資料開發-資料表監控-實現
2021-09-09
大資料
技術乾貨｜如何利用 ChunJun 實現資料實時同步？
2023-04-24
【虹科乾貨】無模式資料庫的利與弊
2023-12-18
模式資料庫
【大資料】MapReduce開發小實戰
2020-09-21
大資料
資深Java工程師推薦新手乾貨教材《Java Web開發實戰》
2019-07-22
Java工程師Web
乾貨：排程演算法的價值與阿里的應用實踐
2018-06-22
演算法阿里
【重磅乾貨】大模型時代，開發者雲上成長指南
2024-03-29
大模型
Java開發三大框架，你必需要知道的IT乾貨！
2019-05-24
Java框架
開發中的程式碼規範實踐 PHP
2018-08-28
PHP
技術乾貨 | 資料中介軟體如何與GreatSQL資料同步？
2022-07-08
SQL
大資料Storm 之RCE實踐
2018-08-10
大資料ORM
貨拉拉自助資料分析平臺實踐
2022-11-28
乾貨 | 學習大資料為什麼要先學Java？
2019-07-09
大資料Java
技術乾貨｜阿里雲資料庫PostgreSQL 13大版本揭秘
2021-03-04
阿里資料庫SQL
讓Elasticsearch飛起來!——效能優化實踐乾貨
2019-02-21
Elasticsearch優化

乾貨：PHP與大資料開發實踐

相關文章