Hadoop TeraSort 基準測試實驗

詹坤林發表於2011-04-01


Author: zhankunlin
Date: 2011-4-1
Key words: Hadoop, TeraSort

 

<一> TeraSort 介紹

1TB排序通常用於衡量分散式資料處理框架的資料處理能力。Terasort是Hadoop中的的一個排序作業,在2008年,Hadoop在1TB排序基準評估中贏得第一名,耗時209秒。

<二> 相關資料

Hadoop MapReduce擴充套件性的測試:  http://cloud.csdn.net/a/20100901/278934.html
用MPI實現Hadoop:  Map/Reduce的TeraSort  http://emonkey.blog.sohu.com/166546157.html
Hadoop中TeraSort演算法分析:  http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/
hadoop的1TB排序terasort:  http://hi.baidu.com/dtzw/blog/item/cffc8e1830f908b94bedbc12.html
Sort Benchmark:  http://sortbenchmark.org/
Trir樹:http://www.cnblogs.com/cherish_yimi/archive/2009/10/12/1581666.html
<三> 實驗

(0) 原始碼位置
    /local/zkl/hadoop/hadoop-0.20.1/hadoop-0.20.1/src/examples/org/apache/hadoop/examples/terasort

(1) 首先執行 teragen 生成資料

[root@gd86 hadoop-0.20.1]# /local/zkl/hadoop/hadoop-0.20.1/hadoop-0.20.1/bin/hadoop jar hadoop-0.20.1-examples.jar teragen 1000000 terasort/1000000-input

檢視生成的資料

[root@gd86 hadoop-0.20.1]# /local/zkl/hadoop/hadoop-0.20.1/hadoop-0.20.1/bin/hadoop fs -ls  /user/root/terasort/1000000-input
Found 3 items
drwxr-xr-x   - root supergroup          0 2011-03-31 16:21 /user/root/terasort/1000000-input/_logs
-rw-r--r--   3 root supergroup   50000000 2011-03-31 16:21 /user/root/terasort/1000000-input/part-00000  
-rw-r--r--   3 root supergroup   50000000 2011-03-31 16:21 /user/root/terasort/1000000-input/part-00001

生成兩個資料,每個的大小是 50000000 B = 50 M

[root@gd86 hadoop-0.20.1]# bin/hadoop jar hadoop-0.20.1-examples.jar teragen 10 terasort/1000000-input
將生成兩個 500 B 的資料,加起來是 1000 B = 1 kb

產生的資料一行是100B,引數10表示產生10行,共1000B;1,000,000 行就有 100,000,000 B = 100 M;

teragen是用兩個 map 來完成資料的生成,每個 map 生成一個檔案,兩個檔案大小共 100 M,每個就是 50 M .

[root@gd86 hadoop-0.20.1]# bin/hadoop jar hadoop-0.20.1-examples.jar teragen 10000000 terasort/1G-input

這將產生 1 G 的資料,由於資料塊是 64 M 一塊,這會被分成16個資料塊,當執行terasort時會有64個map task。

[root@gd86 hadoop-0.20.1]# bin/hadoop jar hadoop-0.20.1-examples.jar teragen 10000000 terasort/1G-input
Generating 10000000 using 2 maps with step of 5000000
11/04/01 17:02:46 INFO mapred.JobClient: Running job: job_201103311423_0005
11/04/01 17:02:47 INFO mapred.JobClient:  map 0% reduce 0%
11/04/01 17:03:00 INFO mapred.JobClient:  map 19% reduce 0%
11/04/01 17:03:01 INFO mapred.JobClient:  map 41% reduce 0%
11/04/01 17:03:03 INFO mapred.JobClient:  map 52% reduce 0%
11/04/01 17:03:04 INFO mapred.JobClient:  map 63% reduce 0%
11/04/01 17:03:06 INFO mapred.JobClient:  map 74% reduce 0%
11/04/01 17:03:10 INFO mapred.JobClient:  map 91% reduce 0%
11/04/01 17:03:12 INFO mapred.JobClient:  map 100% reduce 0%
11/04/01 17:03:14 INFO mapred.JobClient: Job complete: job_201103311423_0005
11/04/01 17:03:14 INFO mapred.JobClient: Counters: 6
11/04/01 17:03:14 INFO mapred.JobClient:   Job Counters
11/04/01 17:03:14 INFO mapred.JobClient:     Launched map tasks=2
11/04/01 17:03:14 INFO mapred.JobClient:   FileSystemCounters
11/04/01 17:03:14 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1000000000
11/04/01 17:03:14 INFO mapred.JobClient:   Map-Reduce Framework
11/04/01 17:03:14 INFO mapred.JobClient:     Map input records=10000000
11/04/01 17:03:14 INFO mapred.JobClient:     Spilled Records=0
11/04/01 17:03:14 INFO mapred.JobClient:     Map input bytes=10000000
11/04/01 17:03:14 INFO mapred.JobClient:     Map output records=10000000


(2) 執行 terasort 排序

執行 terasort 程式,將會執行 16 個 MapTask

root@gd38 hadoop-0.20.1# bin/hadoop jar hadoop-0.20.1-examples.jar terasort terasort/1G-input terasort/1G-output

11/03/31 17:12:49 INFO terasort.TeraSort: starting
11/03/31 17:12:49 INFO mapred.FileInputFormat: Total input paths to process : 2
11/03/31 17:13:05 INFO util.NativeCodeLoader: Loaded the native-hadoop library
11/03/31 17:13:05 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
11/03/31 17:13:05 INFO compress.CodecPool: Got brand-new compressor
Making 1 from 100000 records
Step size is 100000.0
11/03/31 17:13:06 INFO mapred.JobClient: Running job: job_201103311423_0006
11/03/31 17:13:07 INFO mapred.JobClient:  map 0% reduce 0%
11/03/31 17:13:20 INFO mapred.JobClient:  map 12% reduce 0%
11/03/31 17:13:21 INFO mapred.JobClient:  map 37% reduce 0%
11/03/31 17:13:29 INFO mapred.JobClient:  map 50% reduce 2%
11/03/31 17:13:30 INFO mapred.JobClient:  map 75% reduce 2%
11/03/31 17:13:32 INFO mapred.JobClient:  map 75% reduce 12%
11/03/31 17:13:36 INFO mapred.JobClient:  map 87% reduce 12%
11/03/31 17:13:38 INFO mapred.JobClient:  map 100% reduce 12%
11/03/31 17:13:41 INFO mapred.JobClient:  map 100% reduce 25%
11/03/31 17:13:44 INFO mapred.JobClient:  map 100% reduce 31%
11/03/31 17:13:53 INFO mapred.JobClient:  map 100% reduce 33%
11/03/31 17:14:02 INFO mapred.JobClient:  map 100% reduce 68%
11/03/31 17:14:05 INFO mapred.JobClient:  map 100% reduce 71%
11/03/31 17:14:08 INFO mapred.JobClient:  map 100% reduce 75%
11/03/31 17:14:11 INFO mapred.JobClient:  map 100% reduce 79%
11/03/31 17:14:14 INFO mapred.JobClient:  map 100% reduce 82%
11/03/31 17:14:17 INFO mapred.JobClient:  map 100% reduce 86%
11/03/31 17:14:20 INFO mapred.JobClient:  map 100% reduce 90%
11/03/31 17:14:23 INFO mapred.JobClient:  map 100% reduce 93%
11/03/31 17:14:26 INFO mapred.JobClient:  map 100% reduce 97%
11/03/31 17:14:32 INFO mapred.JobClient:  map 100% reduce 100%
11/03/31 17:14:34 INFO mapred.JobClient: Job complete: job_201103311423_0006
11/03/31 17:14:34 INFO mapred.JobClient: Counters: 18
11/03/31 17:14:34 INFO mapred.JobClient:   Job Counters
11/03/31 17:14:34 INFO mapred.JobClient:     Launched reduce tasks=1
11/03/31 17:14:34 INFO mapred.JobClient:     Launched map tasks=16
11/03/31 17:14:34 INFO mapred.JobClient:     Data-local map tasks=16
11/03/31 17:14:34 INFO mapred.JobClient:   FileSystemCounters
11/03/31 17:14:34 INFO mapred.JobClient:     FILE_BYTES_READ=2382257412
11/03/31 17:14:34 INFO mapred.JobClient:     HDFS_BYTES_READ=1000057358
11/03/31 17:14:34 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=3402255956
11/03/31 17:14:34 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1000000000
11/03/31 17:14:34 INFO mapred.JobClient:   Map-Reduce Framework
11/03/31 17:14:34 INFO mapred.JobClient:     Reduce input groups=10000000
11/03/31 17:14:34 INFO mapred.JobClient:     Combine output records=0
11/03/31 17:14:34 INFO mapred.JobClient:     Map input records=10000000
11/03/31 17:14:34 INFO mapred.JobClient:     Reduce shuffle bytes=951549012
11/03/31 17:14:34 INFO mapred.JobClient:     Reduce output records=10000000
11/03/31 17:14:34 INFO mapred.JobClient:     Spilled Records=33355441
11/03/31 17:14:34 INFO mapred.JobClient:     Map output bytes=1000000000
11/03/31 17:14:34 INFO mapred.JobClient:     Map input bytes=1000000000
11/03/31 17:14:34 INFO mapred.JobClient:     Combine input records=0
11/03/31 17:14:34 INFO mapred.JobClient:     Map output records=10000000
11/03/31 17:14:34 INFO mapred.JobClient:     Reduce input records=10000000
11/03/31 17:14:34 INFO terasort.TeraSort: done

執行完成,排序,生成的資料仍是 1G ,

root@gd38 hadoop-0.20.1# bin/hadoop fs -ls terasort/1G-output
Found 2 items
drwxr-xr-x   - root supergroup          0 2011-03-31 17:13 /user/root/terasort/1G-output/_logs
-rw-r--r--   1 root supergroup 1000000000 2011-03-31 17:13 /user/root/terasort/1G-output/part-00000

 

 

 

 

 

 

 

 

相關文章