Hadoop-Streaming(Python篇)

NullSpider發表於2017-11-28

問題反饋

  • 部署或使用時有不明白的可以聯絡我
    • Wechat:Leo-sunhailin
    • QQ: 379978424


搭建環境

  • 系統環境: Win10(64位) <-- 一定要64位
    • Linux暫時不講,因為部署起來比Windows簡單
  • Java版本: Java 1.8.0_144
  • Hadoop版本: Apache Hadoop 2.7.4
    • (接下來的教程使用的時2.7.4作為"栗子")
  • Python版本: Python 3.4.4

下載方式

  • Hadoop下載地址:
    • 儘量選擇國內源,如果需要歷史版本的話只能去官方的源下載.
    • 1、阿里源: 地址
    • 2、清華源: 地址
    • 3、官方: 地址
  • Winutils(Linux的可以略過):
    • 1、winutils這個版本需要和Hadoop版本進行對應
    • 2、如果不下載的話就需要去網上找別人編譯好的winutils的包.(概率容易出bug)
  • Java和Python的下載自行百度
    • JDK1.8
    • Python 3.4.4

部署和測試

  • Python環境自行搭建,不作闡述
  • 步驟0(Java環境變數)
    • 最最最重要的問題!
      • Windows下,Java配置路徑千萬不要有空格.
      • 安裝完Java後自己去安裝目錄拷出來到一個完全沒空格的路徑.
      • 配置Java環境變數
  • 步驟1(Hadoop環境搭建)
    • 注: 以下部署為單機單節點部署
    • 1.1、解壓到你自己歸類的目錄下
      • 我自己在D盤建立個路徑: D:/bigdata/
      • 解壓完後hadoop的路徑為: D:/bigdata/hadoop-2.7.4
    • 1.2、進入到hadoop根目錄下的etc\hadoop\中
    • 1.3、修改core-site.xml,hdfs-site.xml,yarn-site.xml,mapred-site.xml(其中mapred的原本字尾有個.template,重名去掉字尾)
      • core-site.xml:
      <configuration>
          <!--指定namenode的地址-->
          <property>
              <name>fs.defaultFS</name>
              <value>hdfs://localhost:9000</value>
              <description>HDFS的URI,檔案系統://namenode標識:埠號</description>
          </property>
          
          <!--用來指定使用hadoop時產生檔案的存放目錄-->
          <property>
              <name>hadoop.tmp.dir</name>
              <value>/D:/bigdata/hadoop-2.7.4/workplace/tmp</value>
              <description>namenode上本地的hadoop臨時資料夾</description>
          </property>
      </configuration>
      複製程式碼
      • hdfs-site.xml:
      <configuration>
          <!--指定hdfs儲存資料的副本數量-->
          <property>
          <name>dfs.replication</name>
          <value>1</value>
          <description>副本個數,配置預設是3,應小於datanode機器數量</description>
      </property>
      
      <property>
          <name>dfs.name.dir</name>
          <value>/D:/bigdata/hadoop-2.7.4/workplace/name</value>
          <description>namenode上儲存hdfs名字空間後設資料 </description>
      </property>
      
      <property>
          <name>dfs.data.dir</name>
          <value>/D:/bigdata/hadoop-2.7.4/workplace/data</value>
          <description>datanode上資料塊的物理儲存位置</description>
      </property>
      
      <property>
          <name>dfs.webhdfs.enabled</name>
          <value>true</value>
          <description>WebHDFS介面</description>
      </property>
      
      <property>
          <name>dfs.permissions</name>
          <value>false</value>
          <description>HDFS許可權控制</description>
      </property>
      </configuration>
      複製程式碼
      • yarn-site.xml:
      <configuration>
          <!--nomenodeManager獲取資料的方式是shuffle-->
          <property>
              <name>yarn.nodemanager.aux-services</name>
              <value>mapreduce_shuffle</value>
          </property>
          
          <!-- 我電腦的記憶體是8G,一般設定到6G左右,我為了方便直接寫了8G的實際大小 -->
          <property>
              <name>yarn.nodemanager.resource.memory-mb</name>
              <value>8192</value>
          </property>
          
          <!-- 這個引數是MR任務最小的分配記憶體 -->
          <property>
              <name>yarn.scheduler.minimum-allocation-mb</name>
          <value>1536</value>
          </property>
          
          <!-- 這個引數是MR任務最大的分配記憶體 -->
          <property>
              <name>yarn.scheduler.maximum-allocation-mb</name>
          <value>4096</value>
          </property>
          
          <!-- 這個引數是MR任務使用的CPU數 -->
          <property>
              <name>yarn.nodemanager.resource.cpu-vcores</name>
              <value>2</value>
          </property>
          
          <property>                                                                
              <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
              <value>org.apache.hadoop.mapred.ShuffleHandler</value>
          </property>
      </configuration>
      複製程式碼
      • mapred-site.xml:
      <configuration>
          <!--告訴hadoop以後MR執行在YARN上--> 
          <property>
              <name>mapreduce.framework.name</name>
              <value>yarn</value>
          </property>
          <property>
              <name>mapreduce.map.memory.mb</name>
              <value>2048</value>
          </property>
          <property>
              <name>mapreduce.reduce.memory.mb</name>
              <value>2048</value>
          </property>
          <property>
              <name>mapreduce.jobtracker.http.address</name>
              <value>localhost:50030</value>
          </property>
          <property>
              <name>mapreduce.jobhistory.address</name>
              <value>localhost:10020</value>
          </property>
          <property>
              <name>mapreduce.jobhistory.webapp.address</name>
              <value>localhost:19888</value>
          </property>
          <property>
              <name>mapred.job.tracker</name>
              <value>http://localhost:9001</value>
          </property>
      </configuration>
      複製程式碼
    • 1.4、回到hadoop根目錄建立hdfs-site和core-site中指定目錄的資料夾
    • 1.5、配置環境變數:
      • 1.5.1、新增系統變數HADOOP_CONF_DIR
      D:\bigdata\hadoop-2.7.4\etc\hadoop\
      複製程式碼
      • 1.5.2、新增系統變數HADOOP_HOME
      D:\bigdata\hadoop-2.7.4
      複製程式碼
      • 1.5.3、Path路徑下新增:
      D:\bigdata\hadoop-2.7.4\bin
      複製程式碼
    • 1.6、執行namenode的初始化
      hadoop namenode -format
      複製程式碼
    • 1.7、(1.6的步驟沒有報錯之後)啟動hadoop
      cd /d D:\bigdata\hadoop-2.7.4\sbin
      dir
      
      ## 推薦啟動步驟
      start-dfs.cmd
      start-yarn.cmd
      
      ## 粗暴
      start-all.cmd
      複製程式碼
    • 1.8、(以上步驟都沒有報錯後)開啟新的cmd視窗
      jps
      
      # 看看對應的元件是否啟動成功並佔有程式ID號
      複製程式碼
    • 1.9、測試訪問連結(訪問無誤後執行下一步): http://localhost:50070 http://localhost:8088
    • 2.0、測試hadoop mapreduce example能否執行:
      
      # 可以檢視有什麼測試的方法
      hadoop jar D:\bigdata\hadoop-2.7.4\share\hadoop\mapreduce\hadoop-mapreduce-examples-2.7.4.jar -info
      
      # 選用最經常使用的PI測試(3個Task, 100個取樣個數,兩數相乘為總樣本數)
      hadoop jar D:\bigdata\hadoop-2.7.4\share\hadoop\mapreduce\hadoop-mapreduce-examples-2.7.4.jar PI 3 100
      複製程式碼
    • 2.1、成功執行後,開始編寫python的mapper和reducer

程式碼案例

  • Hadoop Streaming的官方Jar包註釋:
Usage: $HADOOP_PREFIX/bin/hadoop jar hadoop-streaming.jar [options]
Options:
  -input          <path> DFS input file(s) for the Map step.
  -output         <path> DFS output directory for the Reduce step.
  -mapper         <cmd|JavaClassName> Optional. Command to be run as mapper.
  -combiner       <cmd|JavaClassName> Optional. Command to be run as combiner.
  -reducer        <cmd|JavaClassName> Optional. Command to be run as reducer.
  -file           <file> Optional. File/dir to be shipped in the Job jar file.
                  Deprecated. Use generic option "-files" instead.
  -inputformat    <TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName>
                  Optional. The input format class.
  -outputformat   <TextOutputFormat(default)|JavaClassName>
                  Optional. The output format class.
  -partitioner    <JavaClassName>  Optional. The partitioner class.
  -numReduceTasks <num> Optional. Number of reduce tasks.
  -inputreader    <spec> Optional. Input recordreader spec.
  -cmdenv         <n>=<v> Optional. Pass env.var to streaming commands.
  -mapdebug       <cmd> Optional. To run this script when a map task fails.
  -reducedebug    <cmd> Optional. To run this script when a reduce task fails.
  -io             <identifier> Optional. Format to use for input to and output
                  from mapper/reducer commands
  -lazyOutput     Optional. Lazily create Output.
  -background     Optional. Submit the job and don't wait till it completes.
  -verbose        Optional. Print verbose output.
  -info           Optional. Print detailed usage.
  -help           Optional. Print help message.

Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|resourcemanager:port>    specify a ResourceManager
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]


Usage tips:
In -input: globbing on <path> is supported and can have multiple -input

Default Map input format: a line is a record in UTF-8 the key part ends at first
  TAB, the rest of the line is the value

To pass a Custom input format:
  -inputformat package.MyInputFormat

Similarly, to pass a custom output format:
  -outputformat package.MyOutputFormat

The files with extensions .class and .jar/.zip, specified for the -file
  argument[s], end up in "classes" and "lib" directories respectively inside
  the working directory when the mapper and reducer are run. All other files
  specified for the -file argument[s] end up in the working directory when the
  mapper and reducer are run. The location of this working directory is
  unspecified.

To set the number of reduce tasks (num. of output files) as, say 10:
  Use -numReduceTasks 10
To skip the sort/combine/shuffle/sort/reduce step:
  Use -numReduceTasks 0
  Map output then becomes a 'side-effect output' rather than a reduce input.
  This speeds up processing. This also feels more like "in-place" processing
  because the input filename and the map input order are preserved.
  This is equivalent to -reducer NONE

To speed up the last maps:
  -D mapreduce.map.speculative=true
To speed up the last reduces:
  -D mapreduce.reduce.speculative=true
To name the job (appears in the JobTracker Web UI):
  -D mapreduce.job.name='My Job'
To change the local temp directory:
  -D dfs.data.dir=/tmp/dfs
  -D stream.tmpdir=/tmp/streaming
Additional local temp directories with -jt local:
  -D mapreduce.cluster.local.dir=/tmp/local
  -D mapreduce.jobtracker.system.dir=/tmp/system
  -D mapreduce.cluster.temp.dir=/tmp/temp
To treat tasks with non-zero exit status as SUCCEDED:
  -D stream.non.zero.exit.is.failure=false
Use a custom hadoop streaming build along with standard hadoop install:
  $HADOOP_PREFIX/bin/hadoop jar /path/my-hadoop-streaming.jar [...]\
    [...] -D stream.shipped.hadoopstreaming=/path/my-hadoop-streaming.jar
For more details about jobconf parameters see:
  http://wiki.apache.org/hadoop/JobConfFile
To set an environement variable in a streaming command:
   -cmdenv EXAMPLE_DIR=/home/example/dictionaries/

Shortcut:
   setenv HSTREAMING "$HADOOP_PREFIX/bin/hadoop jar hadoop-streaming.jar"

Example: $HSTREAMING -mapper "/usr/local/bin/perl5 filter.pl"
           -file /local/filter.pl -input "/logs/0604*/*" [...]
  Ships a script, invokes the non-shipped perl interpreter. Shipped files go to
  the working directory so filter.pl is found by perl. Input files are all the
  daily logs for days in month 2006-04
複製程式碼
  • 資料準備(我在網上找的一個books.json的資料)
["milton-paradise.txt", "[ Paradise Lost by John Milton 1667 ] Book I Of Man ' s first disobedience , and the fruit Of that forbidden tree whose mortal taste Brought death into the World , and all our woe , With loss of Eden , till one greater Man Restore us , and regain the blissful seat , Sing , Heavenly Muse , that , on the secret top Of Oreb , or of Sinai , didst inspire That shepherd who first taught the chosen seed In the beginning how the heavens and earth Rose out of Chaos : or , if Sion hill Delight thee more , and Siloa ' s brook that flowed Fast by the oracle of God , I thence Invoke thy aid to my adventurous song , That with no middle flight intends to soar Above th ' Aonian mount , while it pursues Things unattempted yet in prose or rhyme ."]
["edgeworth-parents.txt", "[ The Parent ' s Assistant , by Maria Edgeworth ] THE ORPHANS . Near the ruins of the castle of Rossmore , in Ireland , is a small cabin , in which there once lived a widow and her four children . As long as she was able to work , she was very industrious , and was accounted the best spinner in the parish ; but she overworked herself at last , and fell ill , so that she could not sit to her wheel as she used to do , and was obliged to give it up to her eldest daughter , Mary ."]
["austen-emma.txt", "[ Emma by Jane Austen 1816 ] VOLUME I CHAPTER I Emma Woodhouse , handsome , clever , and rich , with a comfortable home and happy disposition , seemed to unite some of the best blessings of existence ; and had lived nearly twenty - one years in the world with very little to distress or vex her . She was the youngest of the two daughters of a most affectionate , indulgent father ; and had , in consequence of her sister ' s marriage , been mistress of his house from a very early period . Her mother had died too long ago for her to have more than an indistinct remembrance of her caresses ; and her place had been supplied by an excellent woman as governess , who had fallen little short of a mother in affection ."]
["chesterton-ball.txt", "[ The Ball and The Cross by G . K . Chesterton 1909 ] I . A DISCUSSION SOMEWHAT IN THE AIR The flying ship of Professor Lucifer sang through the skies like a silver arrow ; the bleak white steel of it , gleaming in the bleak blue emptiness of the evening . That it was far above the earth was no expression for it ; to the two men in it , it seemed to be far above the stars . The professor had himself invented the flying machine , and had also invented nearly everything in it ."]
["bible-kjv.txt", "[ The King James Bible ] The Old Testament of the King James Bible The First Book of Moses : Called Genesis 1 : 1 In the beginning God created the heaven and the earth . 1 : 2 And the earth was without form , and void ; and darkness was upon the face of the deep . And the Spirit of God moved upon the face of the waters . 1 : 3 And God said , Let there be light : and there was light . 1 : 4 And God saw the light , that it was good : and God divided the light from the darkness . 1 : 5 And God called the light Day , and the darkness he called Night . And the evening and the morning were the first day ."]
["chesterton-thursday.txt", "[ The Man Who Was Thursday by G . K . Chesterton 1908 ] To Edmund Clerihew Bentley A cloud was on the mind of men , and wailing went the weather , Yea , a sick cloud upon the soul when we were boys together . Science announced nonentity and art admired decay ; The world was old and ended : but you and I were gay ; Round us in antic order their crippled vices came -- Lust that had lost its laughter , fear that had lost its shame . Like the white lock of Whistler , that lit our aimless gloom , Men showed their own white feather as proudly as a plume . Life was a fly that faded , and death a drone that stung ; The world was very old indeed when you and I were young ."]
["blake-poems.txt", "[ Poems by William Blake 1789 ] SONGS OF INNOCENCE AND OF EXPERIENCE and THE BOOK of THEL SONGS OF INNOCENCE INTRODUCTION Piping down the valleys wild , Piping songs of pleasant glee , On a cloud I saw a child , And he laughing said to me : \" Pipe a song about a Lamb !\" So I piped with merry cheer . \" Piper , pipe that song again ;\" So I piped : he wept to hear . \" Drop thy pipe , thy happy pipe ; Sing thy songs of happy cheer :!\" So I sang the same again , While he wept with joy to hear . \" Piper , sit thee down and write In a book , that all may read .\" So he vanish ' d from my sight ; And I pluck ' d a hollow reed , And I made a rural pen , And I stain ' d the water clear , And I wrote my happy songs Every child may joy to hear ."]
["shakespeare-caesar.txt", "[ The Tragedie of Julius Caesar by William Shakespeare 1599 ] Actus Primus . Scoena Prima . Enter Flauius , Murellus , and certaine Commoners ouer the Stage . Flauius . Hence : home you idle Creatures , get you home : Is this a Holiday ? What , know you not ( Being Mechanicall ) you ought not walke Vpon a labouring day , without the signe Of your Profession ? Speake , what Trade art thou ? Car . Why Sir , a Carpenter Mur . Where is thy Leather Apron , and thy Rule ? What dost thou with thy best Apparrell on ? You sir , what Trade are you ? Cobl . Truely Sir , in respect of a fine Workman , I am but as you would say , a Cobler Mur . But what Trade art thou ? Answer me directly Cob . A Trade Sir , that I hope I may vse , with a safe Conscience , which is indeed Sir , a Mender of bad soules Fla ."]
["whitman-leaves.txt", "[ Leaves of Grass by Walt Whitman 1855 ] Come , said my soul , Such verses for my Body let us write , ( for we are one ,) That should I after return , Or , long , long hence , in other spheres , There to some group of mates the chants resuming , ( Tallying Earth ' s soil , trees , winds , tumultuous waves ,) Ever with pleas ' d smile I may keep on , Ever and ever yet the verses owning -- as , first , I here and now Signing for Soul and Body , set to them my name , Walt Whitman [ BOOK I . INSCRIPTIONS ] } One ' s - Self I Sing One ' s - self I sing , a simple separate person , Yet utter the word Democratic , the word En - Masse ."]
["melville-moby_dick.txt", "[ Moby Dick by Herman Melville 1851 ] ETYMOLOGY . ( Supplied by a Late Consumptive Usher to a Grammar School ) The pale Usher -- threadbare in coat , heart , body , and brain ; I see him now . He was ever dusting his old lexicons and grammars , with a queer handkerchief , mockingly embellished with all the gay flags of all the known nations of the world . He loved to dust his old grammars ; it somehow mildly reminded him of his mortality ."]
複製程式碼
  • 任務需求:

    • 1、json格式的資料,對對每個txt的內容進行分詞
    • 2、將每個詞抽取出來,按照某個詞對應一個文字或多個文字按行輸出,例:
    Data:
        ["test_1.txt", "[ apple pipe ]"]
        ["test_2.txt", "[ apple company ]"]
    
    Result:
        apple    ["test_1.txt", "test_2.txt"]
        pipe    ["test_1.txt"]
        company    ["test_2.txt"]
    複製程式碼
  • mapper.py

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
"""
Created on 2017年10月30日
@author: Leo
"""

# Python內部庫
import sys
import json


for line in sys.stdin:
    line = line.strip()
    record = json.loads(line)
    file_name = record[0]
    value = record[1]
    words = value.split()
    for word in words:
        print("%s\t%s" % (word, file_name))

複製程式碼
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
"""
Created on 2017年10月30日
@author: Leo
"""

# Python內部庫
import sys

media = {}
word_in_media = {}

# maps words to their counts
for line in sys.stdin:
    (word, file_name) = line.strip().split('\t', 1)
    media.setdefault(word, [])
    media[word].append(file_name)

for word in media:
    word_in_media.setdefault(word, list(set(media[word])))

for word in word_in_media:
    print("%s\t%s" % (word, word_in_media[word]))

複製程式碼
  • 將books.json上傳到HDFS

    • 1、上傳命令
    # 如果還沒在HDFS上建立過資料夾的話,需要先建立資料夾
    # 以下是本人建立的方式,可以自行建立
    hdfs dfs -mkdir -p /user/Leo/input
    
    # 從本地上傳到HDFS中
    hdfs dfs -copyFromLocal <絕對路徑>\books.json /user/Leo/input/
    
    # 開啟localhost:50070的HDFS頁面後檢視檔案是否存在
    複製程式碼
    • 2、刪除Output資料夾
    # 這個是如果Mapreduce執行過程中出錯,解決後再出錯的時候記得刪除output資料夾
    hdfs dfs -rm -r /user/Leo/output
    複製程式碼
    • 3、關閉HDFS Safemode模式
    # 不正常的操作觸發了HDFS啟動了安全模式
    hdfs dfsadmin -safemode leave
    複製程式碼
    • 4、執行hadoop streaming的命令(記得別漏了橫槓! 記得別漏了橫槓!)
    # 以下程式碼過長,我用linux命令換行的方式進行展示
    hadoop jar D:/bigdata/hadoop-2.7.4/share/hadoop/tools/lib/hadoop-streaming-2.7.4.jar \ 
    -D stream.non.zero.exit.is.failure=false \
    -input /user/Leo/input/books.json \ 
    -output /user/Leo/output \ 
    -mapper "python mapper.py" \
    -reducer "python reducer.py" \
    -file C:/Users/Administrator/Desktop/MingDong_Work/Work_2/mapper.py \
    -file C:/Users/Administrator/Desktop/MingDong_Work/Work_2/reducer.py
    
    # 解釋:
    1、jar 後面跟的是Jar包的路徑,官方提倡用環境變數加路徑的方式,我這為了演示用了絕對路徑進行展示
    2、-D stream.non.zero.exit.is.failure=false 這句話的意思是如果函式返回值(即mapper或reducer沒有return 0,則函式為異常結果.加了這句就可以跳過檢查.)
    3、input: 就是HDFS的檔案
    4、output: 就是M-R任務結束後的檔案存放的地方
    5、mapper: 指定執行mapper的指令碼或程式碼
    6、reducer: 指定執行reducer的指令碼或程式碼
    7、-file: 指定程式碼的位置(多個檔案用-files,為了展示更清晰,我用了舊版的-file的形式進行展示)
    複製程式碼
    • 5、執行過後再HDFS指定的output路徑下會出現一個名為part-00000的檔案,結果儲存在裡面,可以直接在網頁上下載到本地或用程式碼下載到本地.
    hdfs -dfs -get <HDFS檔案路徑>(/user/Leo/output/part-00000) <本地路徑>
    複製程式碼

總結

  • 總體來說,Hadoop Streaming上手容易,主要難在對於Map-reduce的模式的理解上.
  • 需要理解如下幾點:
    • Map-reduce(即Map-Shuffle-Reduce)的工作流程
    • Hadoop streaming key/value的劃分方式
    • 結合自己的業務需求
    • 滿足業務所使用的指令碼語言的特點進行編寫.

優點: (方便,快捷)

  • 只要支援stdin,stdout的語言都可以(Unix風格)
  • 靈活,一般多用於處理一些臨時任務,不用改動專案的程式碼結構
  • 本地除錯

缺點: (主要都是效能問題)

  • 也正是由於因為stdin,stdout,資料傳輸交換的過程中,難免要對資料型別進行轉換,所以會增加程式碼的執行時間.

補充

  • 在Windows環境下出現的一個奇怪的Error
    Hadoop-Streaming(Python篇)
    1. 預設管理員可以建立符號表,可以使用管理員命令列啟動 hadoop的應用
    2. 通過修改使用者策略, 步驟如下:
      1. win+R -> gpedit.msc
      2. 計算機配置 -> windows設定 -> 安全設定 -> 本地策略 -> 使用者許可權分配 -> 建立符號連結
      3. 把使用者新增進去,重啟或者登出

相關文章