Hadoop TeraSort 基準測試實驗
Author: zhankunlin
Date: 2011-4-1
Key words: Hadoop, TeraSort
<一> TeraSort 介紹
1TB排序通常用於衡量分散式資料處理框架的資料處理能力。Terasort是Hadoop中的的一個排序作業,在2008年,Hadoop在1TB排序基準評估中贏得第一名,耗時209秒。
<二> 相關資料
Hadoop MapReduce擴充套件性的測試: http://cloud.csdn.net/a/20100901/278934.html
用MPI實現Hadoop: Map/Reduce的TeraSort http://emonkey.blog.sohu.com/166546157.html
Hadoop中TeraSort演算法分析: http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/
hadoop的1TB排序terasort: http://hi.baidu.com/dtzw/blog/item/cffc8e1830f908b94bedbc12.html
Sort Benchmark: http://sortbenchmark.org/
Trir樹:http://www.cnblogs.com/cherish_yimi/archive/2009/10/12/1581666.html
<三> 實驗
(0) 原始碼位置
/local/zkl/hadoop/hadoop-0.20.1/hadoop-0.20.1/src/examples/org/apache/hadoop/examples/terasort
(1) 首先執行 teragen 生成資料
[root@gd86 hadoop-0.20.1]# /local/zkl/hadoop/hadoop-0.20.1/hadoop-0.20.1/bin/hadoop jar hadoop-0.20.1-examples.jar teragen 1000000 terasort/1000000-input
檢視生成的資料
[root@gd86 hadoop-0.20.1]# /local/zkl/hadoop/hadoop-0.20.1/hadoop-0.20.1/bin/hadoop fs -ls /user/root/terasort/1000000-input
Found 3 items
drwxr-xr-x - root supergroup 0 2011-03-31 16:21 /user/root/terasort/1000000-input/_logs
-rw-r--r-- 3 root supergroup 50000000 2011-03-31 16:21 /user/root/terasort/1000000-input/part-00000
-rw-r--r-- 3 root supergroup 50000000 2011-03-31 16:21 /user/root/terasort/1000000-input/part-00001
生成兩個資料,每個的大小是 50000000 B = 50 M
[root@gd86 hadoop-0.20.1]# bin/hadoop jar hadoop-0.20.1-examples.jar teragen 10 terasort/1000000-input
將生成兩個 500 B 的資料,加起來是 1000 B = 1 kb
產生的資料一行是100B,引數10表示產生10行,共1000B;1,000,000 行就有 100,000,000 B = 100 M;
teragen是用兩個 map 來完成資料的生成,每個 map 生成一個檔案,兩個檔案大小共 100 M,每個就是 50 M .
[root@gd86 hadoop-0.20.1]# bin/hadoop jar hadoop-0.20.1-examples.jar teragen 10000000 terasort/1G-input
這將產生 1 G 的資料,由於資料塊是 64 M 一塊,這會被分成16個資料塊,當執行terasort時會有64個map task。
[root@gd86 hadoop-0.20.1]# bin/hadoop jar hadoop-0.20.1-examples.jar teragen 10000000 terasort/1G-input
Generating 10000000 using 2 maps with step of 5000000
11/04/01 17:02:46 INFO mapred.JobClient: Running job: job_201103311423_0005
11/04/01 17:02:47 INFO mapred.JobClient: map 0% reduce 0%
11/04/01 17:03:00 INFO mapred.JobClient: map 19% reduce 0%
11/04/01 17:03:01 INFO mapred.JobClient: map 41% reduce 0%
11/04/01 17:03:03 INFO mapred.JobClient: map 52% reduce 0%
11/04/01 17:03:04 INFO mapred.JobClient: map 63% reduce 0%
11/04/01 17:03:06 INFO mapred.JobClient: map 74% reduce 0%
11/04/01 17:03:10 INFO mapred.JobClient: map 91% reduce 0%
11/04/01 17:03:12 INFO mapred.JobClient: map 100% reduce 0%
11/04/01 17:03:14 INFO mapred.JobClient: Job complete: job_201103311423_0005
11/04/01 17:03:14 INFO mapred.JobClient: Counters: 6
11/04/01 17:03:14 INFO mapred.JobClient: Job Counters
11/04/01 17:03:14 INFO mapred.JobClient: Launched map tasks=2
11/04/01 17:03:14 INFO mapred.JobClient: FileSystemCounters
11/04/01 17:03:14 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1000000000
11/04/01 17:03:14 INFO mapred.JobClient: Map-Reduce Framework
11/04/01 17:03:14 INFO mapred.JobClient: Map input records=10000000
11/04/01 17:03:14 INFO mapred.JobClient: Spilled Records=0
11/04/01 17:03:14 INFO mapred.JobClient: Map input bytes=10000000
11/04/01 17:03:14 INFO mapred.JobClient: Map output records=10000000
(2) 執行 terasort 排序
執行 terasort 程式,將會執行 16 個 MapTask
root@gd38 hadoop-0.20.1# bin/hadoop jar hadoop-0.20.1-examples.jar terasort terasort/1G-input terasort/1G-output
11/03/31 17:12:49 INFO terasort.TeraSort: starting
11/03/31 17:12:49 INFO mapred.FileInputFormat: Total input paths to process : 2
11/03/31 17:13:05 INFO util.NativeCodeLoader: Loaded the native-hadoop library
11/03/31 17:13:05 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
11/03/31 17:13:05 INFO compress.CodecPool: Got brand-new compressor
Making 1 from 100000 records
Step size is 100000.0
11/03/31 17:13:06 INFO mapred.JobClient: Running job: job_201103311423_0006
11/03/31 17:13:07 INFO mapred.JobClient: map 0% reduce 0%
11/03/31 17:13:20 INFO mapred.JobClient: map 12% reduce 0%
11/03/31 17:13:21 INFO mapred.JobClient: map 37% reduce 0%
11/03/31 17:13:29 INFO mapred.JobClient: map 50% reduce 2%
11/03/31 17:13:30 INFO mapred.JobClient: map 75% reduce 2%
11/03/31 17:13:32 INFO mapred.JobClient: map 75% reduce 12%
11/03/31 17:13:36 INFO mapred.JobClient: map 87% reduce 12%
11/03/31 17:13:38 INFO mapred.JobClient: map 100% reduce 12%
11/03/31 17:13:41 INFO mapred.JobClient: map 100% reduce 25%
11/03/31 17:13:44 INFO mapred.JobClient: map 100% reduce 31%
11/03/31 17:13:53 INFO mapred.JobClient: map 100% reduce 33%
11/03/31 17:14:02 INFO mapred.JobClient: map 100% reduce 68%
11/03/31 17:14:05 INFO mapred.JobClient: map 100% reduce 71%
11/03/31 17:14:08 INFO mapred.JobClient: map 100% reduce 75%
11/03/31 17:14:11 INFO mapred.JobClient: map 100% reduce 79%
11/03/31 17:14:14 INFO mapred.JobClient: map 100% reduce 82%
11/03/31 17:14:17 INFO mapred.JobClient: map 100% reduce 86%
11/03/31 17:14:20 INFO mapred.JobClient: map 100% reduce 90%
11/03/31 17:14:23 INFO mapred.JobClient: map 100% reduce 93%
11/03/31 17:14:26 INFO mapred.JobClient: map 100% reduce 97%
11/03/31 17:14:32 INFO mapred.JobClient: map 100% reduce 100%
11/03/31 17:14:34 INFO mapred.JobClient: Job complete: job_201103311423_0006
11/03/31 17:14:34 INFO mapred.JobClient: Counters: 18
11/03/31 17:14:34 INFO mapred.JobClient: Job Counters
11/03/31 17:14:34 INFO mapred.JobClient: Launched reduce tasks=1
11/03/31 17:14:34 INFO mapred.JobClient: Launched map tasks=16
11/03/31 17:14:34 INFO mapred.JobClient: Data-local map tasks=16
11/03/31 17:14:34 INFO mapred.JobClient: FileSystemCounters
11/03/31 17:14:34 INFO mapred.JobClient: FILE_BYTES_READ=2382257412
11/03/31 17:14:34 INFO mapred.JobClient: HDFS_BYTES_READ=1000057358
11/03/31 17:14:34 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3402255956
11/03/31 17:14:34 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1000000000
11/03/31 17:14:34 INFO mapred.JobClient: Map-Reduce Framework
11/03/31 17:14:34 INFO mapred.JobClient: Reduce input groups=10000000
11/03/31 17:14:34 INFO mapred.JobClient: Combine output records=0
11/03/31 17:14:34 INFO mapred.JobClient: Map input records=10000000
11/03/31 17:14:34 INFO mapred.JobClient: Reduce shuffle bytes=951549012
11/03/31 17:14:34 INFO mapred.JobClient: Reduce output records=10000000
11/03/31 17:14:34 INFO mapred.JobClient: Spilled Records=33355441
11/03/31 17:14:34 INFO mapred.JobClient: Map output bytes=1000000000
11/03/31 17:14:34 INFO mapred.JobClient: Map input bytes=1000000000
11/03/31 17:14:34 INFO mapred.JobClient: Combine input records=0
11/03/31 17:14:34 INFO mapred.JobClient: Map output records=10000000
11/03/31 17:14:34 INFO mapred.JobClient: Reduce input records=10000000
11/03/31 17:14:34 INFO terasort.TeraSort: done
執行完成,排序,生成的資料仍是 1G ,
root@gd38 hadoop-0.20.1# bin/hadoop fs -ls terasort/1G-output
Found 2 items
drwxr-xr-x - root supergroup 0 2011-03-31 17:13 /user/root/terasort/1G-output/_logs
-rw-r--r-- 1 root supergroup 1000000000 2011-03-31 17:13 /user/root/terasort/1G-output/part-00000
相關文章
- hadoop基準測試_Hadoop TeraSort基準測試Hadoop
- Hadoop測試TeraSortHadoop
- 測試眼裡的Hadoop系列 之TerasortHadoop
- hadoop-2.6.0基準測試Hadoop
- Hadoop TerasortHadoop
- Hadoop安裝後的叢集基準測試Hadoop
- MySQL基準測試MySql
- TGI 基準測試
- Spark 與 Hadoop 關於 TeraGen/TeraSort 的對比實驗(包含原始碼)SparkHadoop原始碼
- Hadoop的TeraSort問題Hadoop
- hadoop的terasort排序總結Hadoop排序
- 測試基準資料的準備
- MYSQL 效能測試方法 - 基準測試(benchmarking)MySql
- MySQL學習 - 基準測試MySql
- 固態硬碟基準測試硬碟
- TPCC-MySQL基準測試MySql
- 【MYSQL 基準測試結果】MySql
- MySQL基準測試工具sysbenchMySql
- 《Redis官方教程》-基準測試Redis
- 【Mysql】sysbench基準測試工具MySql
- [轉帖]sysbench基準測試
- JMH- benchmark基準測試
- 如何在Java中做基準測試?JMH使用初體驗Java
- 精準測試實踐
- postgresql:pgbench基準效能測試SQL
- 【工具】基準測試工具之sysbench
- ubuntu 快速測試 cpu 基準水平Ubuntu
- 實驗三--測試
- Hadoop學習筆記 - Sort / TeraSort / TestDFSIOHadoop筆記
- 技術基礎 | Apache Cassandra 4.0基準測試Apache
- 資料庫基準測試工具 sysbench資料庫
- 公有云RDS-MySQL基準測試MySql
- 基準測試:HTTP/3 有多快? - requestmetricsHTTP
- 【MySQL】利用sysbench進行基準測試MySql
- MySQL基準壓力測試工具MySQLSlapMySql
- 【工具】基準測試工具之iozone
- Go 語言基準測試入門Go
- Hadoop-MapReduce-TeraSort-大資料排序例子Hadoop大資料排序