僅供測試學習的文章,不推薦在生產環境使用2.0,因為2.0採用YARN,hive,hbase,mahout等需要map/reduceV1的可能無法使用hadoop 2.0或者會出現意外情況。

5月23日,apache釋出了hadoop 2.0的測試版。正好跟家呆著沒事幹,小小的體會了一下map/reduce V2。

環境,virtual box虛擬機器ubuntu server 12.04,openjdk-7。
簡單介紹一下,2.0.0是從hadoop 0.23.x發展出來的。取消了jobtracker和tasktracker,或者說,是把這兩個封裝到了container裡面。使用YARN替代了原來的map/reduce。
YARN號稱是第二代map/reduce,速度比一代更快,且支援叢集伺服器數量更大。hadoop 0.20.x和由其發展過來的1.0.x支援叢集數量建議在3000臺左右,最大支援到4000臺。而hadoop 2.0和YARN宣稱支援6000-10000臺,CPU核心數支援200000顆。從叢集數量和運算能力上說,似乎還是提高了不少的。並且加入了namenode的HA,也就是高可用。我說似乎,因為沒有在實際生產環境測試速度。而namenode的HA,因為是虛擬機器測試,也就沒有測試。只是簡單的看了一下。
2.0的檔案結構相比1.0有所變化,更加清晰明瞭了。可執行檔案在bin/下,server啟動放到了sbin/下,map/red,streaming,pipes的jar包放到了share/下。很容易找到。
安裝包解壓縮後,先進入etc/hadoop/目錄下,按照單機版方式配置幾個配置檔案。有core-site.xml,hdfs-site.xml,但是沒有了mapred-site.xml,取而代之的是yarn-site.xml
假設已經按照單機配置配好了,那麼進入$HADOOP_HOME/bin/目錄下
執行如下
./hadoop namenode -format
#先格式化
cd ../sbin/
#進入sbin目錄,這裡放的都是server啟動指令碼
./hadoop-daemon.sh start namenode
./hadoop-daemon.sh start datanode
./hadoop-daemon.sh start secondarynamenode

#備份服起不起都無所謂,不影響使用,不過可以用來試試HA功能
#下面較重要,2.0取消了jobtracker和tasktracker,以YARN來代替,所以如果執行start jobtracker一類的,會報錯。
#且hadoop,hdfs,map/reduce功能都分離出了單獨指令碼,所以不能用hadoop-daemon.sh啟動所有了。
./yarn-daemon.sh start resourcemanager
#這個就相當於原來的jobtracker,用作運算資源分配的程式,跟namenode可放在一起。
./yarn-daemon.sh start nodemanager
#這個相當於原來的tasktracker,每臺datanode或者叫slave的伺服器上都要啟動。

ps aux一下,如果看到4個java程式,就算啟動成功了,訪問http://localhost:50070看看hdfs情況。且由於取消了jobtracker,所以也就沒有50030埠來檢視任務情況了,這個以後再說吧。
然後來試試編寫第一個map/reduce V2的程式。其實從程式的編寫方式來說跟V1沒有任何區別,只是最後呼叫方式變化了一下。hadoop 2.0為了保證相容性,使用者介面方面對於使用者來說,還是跟原來是一樣的。
這樣一段資料
20120503        04      2012-05-03 04:49:22                     222.139.35.72   Log_ASF ProductVer=”5.12.0425.2111″
20120503        04      2012-05-03 04:49:21                     113.232.38.239  Log_ASF ProductVer=”5.09.0119.1112″
假設就2條不一樣的吧,一共20條。
還是用python來寫map/red指令碼
#!/usr/bin/python
#-*- encoding:UTF-8 -*-
#map.py
import sys

debug = True
if debug:
                lzo = 0
else:
                lzo = 1

count=`0`
for line in sys.stdin:
                try:
                                flags = line[:-1].split(` `)
                                if len(flags) == 0:
                                                break
                                if len(flags) != 5+lzo:
                                                continue

                                stat_date = flags[2+lzo].split(` `)[0]
                                version = flags[5+lzo].split(`”`)[1]

                                str = stat_date+`,`+version+` `+count
                                print str

                except Exception,e:
                                print e

——————————————————————

#!/usr/bin/python
#-*- encoding:UTF-8 -*-
#reduce.py
import sys

import string

res = {}
#宣告字典

for line in sys.stdin:
                try:
                                flags = line[:-1].split(` `)
                                if len(flags) != 2:
                                                continue
                                field_key = flags[0]
                                if res.has_key(field_key) == False:
                                                res[field_key] = 0
                                res[field_key] += 1
                except Exception,e:
                                pass

for key in res.keys():
                print key+`,`+`%s` % (res[key])

然後把範例資料複製到hdfs上面用
./hadoop fs -mkdir /tmp
./hadoop fs -copyFromLocal /root/asf /tmp/asf

測試一下,還跟以前hadoop一樣。不過兩種streaming的方式都可以
./hadoop jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.0.0-alpha.jar -mapper /opt/hadoop/mrs/map.py -reducer /opt/hadoop/mrs/red.py -input /tmp/asf -output /asf

或者

./yarn jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.0.0-alpha.jar -mapper /opt/hadoop/mrs/map.py -reducer /opt/hadoop/mrs/red.py -input /tmp/asf -output /asf

然後
./hadoop fs -cat /asf/part-00000檔案
2012-05-03,5.09.0119.1112,2
2012-05-03,5.12.0425.2111,18
結果正確。
附map/reduce V2執行日誌:
root@localhost:/opt/hadoop/bin# ./yarn jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.0.0-alpha.jar -mapper /opt/hadoop/mrs/map.py -reducer /opt/hadoop/mrs/red.py -input /tmp/asf -output /asf
12/06/01 23:26:40 WARN util.KerberosName: Kerberos krb5 configuration not found, setting default realm to empty
12/06/01 23:26:41 WARN conf.Configuration: session.id is deprecated. Instead, use dfs.metrics.session-id
12/06/01 23:26:41 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
12/06/01 23:26:41 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= – already initialized
12/06/01 23:26:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
12/06/01 23:26:42 WARN snappy.LoadSnappy: Snappy native library not loaded
12/06/01 23:26:42 INFO mapred.FileInputFormat: Total input paths to process : 1
12/06/01 23:26:42 INFO mapreduce.JobSubmitter: number of splits:1
12/06/01 23:26:42 WARN conf.Configuration: mapred.jar is deprecated. Instead, use mapreduce.job.jar
12/06/01 23:26:42 WARN conf.Configuration: mapred.create.symlink is deprecated. Instead, use mapreduce.job.cache.symlink.create
12/06/01 23:26:42 WARN conf.Configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name
12/06/01 23:26:42 WARN conf.Configuration: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
12/06/01 23:26:42 WARN conf.Configuration: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
12/06/01 23:26:42 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
12/06/01 23:26:42 WARN conf.Configuration: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
12/06/01 23:26:42 WARN conf.Configuration: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
12/06/01 23:26:42 WARN conf.Configuration: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
12/06/01 23:26:42 WARN conf.Configuration: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
12/06/01 23:26:42 WARN conf.Configuration: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
12/06/01 23:26:42 WARN mapred.LocalDistributedCacheManager: LocalJobRunner does not support symlinking into current working dir.
12/06/01 23:26:42 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
12/06/01 23:26:42 INFO mapreduce.Job: Running job: job_local_0001
12/06/01 23:26:42 INFO mapred.LocalJobRunner: OutputCommitter set in config null
12/06/01 23:26:42 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
12/06/01 23:26:42 INFO mapred.LocalJobRunner: Waiting for map tasks
12/06/01 23:26:42 INFO mapred.LocalJobRunner: Starting task: attempt_local_0001_m_000000_0
12/06/01 23:26:42 INFO mapred.Task:    Using ResourceCalculatorPlugin : org.apache.hadoop.yarn.util.LinuxResourceCalculatorPlugin@52b5ef94
12/06/01 23:26:42 INFO mapred.MapTask: numReduceTasks: 1
12/06/01 23:26:42 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
12/06/01 23:26:42 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
12/06/01 23:26:42 INFO mapred.MapTask: soft limit at 83886080
12/06/01 23:26:42 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
12/06/01 23:26:42 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
12/06/01 23:26:42 INFO streaming.PipeMapRed: PipeMapRed exec [/opt/hadoop/mrs/map.py]
12/06/01 23:26:42 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
12/06/01 23:26:42 WARN conf.Configuration: user.name is deprecated. Instead, use mapreduce.job.user.name
12/06/01 23:26:42 WARN conf.Configuration: map.input.start is deprecated. Instead, use mapreduce.map.input.start
12/06/01 23:26:42 WARN conf.Configuration: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
12/06/01 23:26:42 WARN conf.Configuration: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
12/06/01 23:26:42 WARN conf.Configuration: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
12/06/01 23:26:42 WARN conf.Configuration: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
12/06/01 23:26:42 WARN conf.Configuration: map.input.length is deprecated. Instead, use mapreduce.map.input.length
12/06/01 23:26:42 WARN conf.Configuration: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
12/06/01 23:26:42 WARN conf.Configuration: mapred.work.output.dir is deprecated. Instead, use mapreduce.task.output.dir
12/06/01 23:26:42 WARN conf.Configuration: map.input.file is deprecated. Instead, use mapreduce.map.input.file
12/06/01 23:26:42 WARN conf.Configuration: mapred.job.id is deprecated. Instead, use mapreduce.job.id
12/06/01 23:26:43 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
12/06/01 23:26:43 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
12/06/01 23:26:43 INFO streaming.PipeMapRed: MRErrorThread done
12/06/01 23:26:43 INFO streaming.PipeMapRed: Records R/W=20/1
12/06/01 23:26:43 INFO streaming.PipeMapRed: mapRedFinished
12/06/01 23:26:43 INFO mapred.LocalJobRunner:
12/06/01 23:26:43 INFO mapred.MapTask: Starting flush of map output
12/06/01 23:26:43 INFO mapred.MapTask: Spilling map output
12/06/01 23:26:43 INFO mapred.MapTask: bufstart = 0; bufend = 560; bufvoid = 104857600
12/06/01 23:26:43 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214320(104857280); length = 77/6553600
12/06/01 23:26:43 INFO mapred.MapTask: Finished spill 0
12/06/01 23:26:43 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of committing
12/06/01 23:26:43 INFO mapred.LocalJobRunner: Records R/W=20/1
12/06/01 23:26:43 INFO mapred.Task: Task `attempt_local_0001_m_000000_0` done.
12/06/01 23:26:43 INFO mapred.LocalJobRunner: Finishing task: attempt_local_0001_m_000000_0
12/06/01 23:26:43 INFO mapred.LocalJobRunner: Map task executor complete.
12/06/01 23:26:43 INFO mapred.Task:    Using ResourceCalculatorPlugin : org.apache.hadoop.yarn.util.LinuxResourceCalculatorPlugin@25d71236
12/06/01 23:26:43 INFO mapred.Merger: Merging 1 sorted segments
12/06/01 23:26:43 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 574 bytes
12/06/01 23:26:43 INFO mapred.LocalJobRunner:
12/06/01 23:26:43 INFO streaming.PipeMapRed: PipeMapRed exec [/opt/hadoop/mrs/red.py]
12/06/01 23:26:43 WARN conf.Configuration: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
12/06/01 23:26:43 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
12/06/01 23:26:43 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
12/06/01 23:26:43 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
12/06/01 23:26:43 INFO streaming.PipeMapRed: Records R/W=20/1
12/06/01 23:26:43 INFO streaming.PipeMapRed: MRErrorThread done
12/06/01 23:26:43 INFO streaming.PipeMapRed: mapRedFinished
12/06/01 23:26:43 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of committing
12/06/01 23:26:43 INFO mapred.LocalJobRunner:
12/06/01 23:26:43 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now
12/06/01 23:26:43 INFO output.FileOutputCommitter: Saved output of task `attempt_local_0001_r_000000_0` to hdfs://localhost:9000/asf/_temporary/0/task_local_0001_r_000000
12/06/01 23:26:43 INFO mapred.LocalJobRunner: Records R/W=20/1 > reduce
12/06/01 23:26:43 INFO mapred.Task: Task `attempt_local_0001_r_000000_0` done.
12/06/01 23:26:43 INFO mapreduce.Job: Job job_local_0001 running in uber mode : false
12/06/01 23:26:43 INFO mapreduce.Job:    map 100% reduce 100%
12/06/01 23:26:43 INFO mapreduce.Job: Job job_local_0001 completed successfully
12/06/01 23:26:43 INFO mapreduce.Job: Counters: 32
                File System Counters
                                FILE: Number of bytes read=205938
                                FILE: Number of bytes written=452840
                                FILE: Number of read operations=0
                                FILE: Number of large read operations=0
                                FILE: Number of write operations=0
                                HDFS: Number of bytes read=252230
                                HDFS: Number of bytes written=59
                                HDFS: Number of read operations=13
                                HDFS: Number of large read operations=0
                                HDFS: Number of write operations=4
                Map-Reduce Framework
                                Map input records=20
                                Map output records=20
                                Map output bytes=560
                                Map output materialized bytes=606
                                Input split bytes=81
                                Combine input records=0
                                Combine output records=0
                                Reduce input groups=2
                                Reduce shuffle bytes=0
                                Reduce input records=20
                                Reduce output records=2
                                Spilled Records=40
                                Shuffled Maps =0
                                Failed Shuffles=0
                                Merged Map outputs=0
                                GC time elapsed (ms)=12
                                CPU time spent (ms)=0
                                Physical memory (bytes) snapshot=0
                                Virtual memory (bytes) snapshot=0
                                Total committed heap usage (bytes)=396361728
                File Input Format Counters
                                Bytes Read=126115
                File Output Format Counters
                                Bytes Written=59
12/06/01 23:26:43 INFO streaming.StreamJob: Output directory: /asf

當然map/reduce V2的功能還不止這些,還需要深入的研究一下。因為2.0雖然是0.23發展過來,但是跟0.23還有些不同,比如0.23中有ApplicationManager,2.0裡好像沒有在外面露出來了。也許也封裝到container裡面了。另外,那些xml的配置選項好像跟0.20.x也有很大不同了,具體還沒細看。HA功能是支援多個namenode,且多個namenode分管不同的datanode。可以支援手工從某臺namenode切換到另外一臺namenode。這樣做到高可用,據說未來會支援自動檢測切換。