hive表連線和oracle測試對比

YallonKing發表於2012-03-31
/*
測試簡述:
在oracle中測試100000條資料的等值連線效率;在hive中測試65535條資料的等值連線效率;並進行對比。
其中在hive中通過資料行儲存和列儲存的資料組織結構進行測試。
測試環境說明:
oracle中:伺服器+orcel enterprise linux6.0+oracle11gr2
hive中:一般桌上型電腦+vmware8.0+orcel enterprise linux6.0+hadoop-0.20.2+hive-0.7.1
*/

--附:具體測試過程如下:
--------------------------oracle下操作--------------------------
--建立測試表1
SQL> create table hive1 (id number,name varchar2(20));
Table created.
SQL> begin
  2  for i in 1..100000
  3  loop
  4  insert into hive1 values(i,'hive1');
  5  commit;
  6  end loop;
  7  end;
  8   /
PL/SQL procedure successfully completed.
--建立測試表2
SQL> create table hive2 (id number,name varchar2(20));
Table created.
Elapsed: 00:00:00.05
SQL> begin
  2  for i in 1..100000
  3  loop
  4  insert into hive2 values(i,'hive2');
  5  commit;
  6  end loop;
  7  end;
  8  /
PL/SQL procedure successfully completed.
--測試2個表等值連線效率
SQL> select a.*,b.name as name2 from hive1 a,hive2 b where a.id=b.id;
100000 rows selected.
Elapsed: 00:00:00.98
Execution Plan
----------------------------------------------------------
Plan hash value: 2251506592
--------------------------------------------------------------------------------
----
| Id  | Operation          | Name  | Rows  | Bytes |TempSpc| Cost (%CPU)| Time
   |
--------------------------------------------------------------------------------
----
|   0 | SELECT STATEMENT   |       |   102K|  5002K|       |   456   (1)| 00:00:
06 |
|*  1 |  HASH JOIN         |       |   102K|  5002K|  3704K|   456   (1)| 00:00:
06 |
|   2 |   TABLE ACCESS FULL| HIVE2 |   102K|  2501K|       |    45   (3)| 00:00:
01 |
|   3 |   TABLE ACCESS FULL| HIVE1 |   105K|  2587K|       |    45   (3)| 00:00:
01 |
--------------------------------------------------------------------------------
----

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("A"."ID"="B"."ID")
Note
-----
   - dynamic sampling used for this statement (level=2)

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
       7149  consistent gets
          0  physical reads
          0  redo size
    1836054  bytes sent via SQL*Net to client
      73742  bytes received via SQL*Net from client
       6668  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
     100000  rows processed
--------------------------hive下操作--------------------------
--建立行儲存測試表 test_hive2[測試表  test_hive3 同理]
hive> create table test_hive2 (id int,id2 int,name string)
    > row format delimited
    > fields terminated by '\t';
OK
Time taken: 0.055 seconds
hive> load data local inpath '/root/tmp001.txt'
    > overwrite into table test_hive2;
Copying data from file:/root/tmp001.txt
Copying file: file:/root/tmp001.txt
Loading data to table default.test_hive2
Deleted file:/user/hive/warehouse/test_hive2
OK
Time taken: 0.342 seconds
--測試全表掃描效率
hive> select count(*) from test_hive2;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Execution log at: /tmp/root/root_20120331160606_9f242fd6-e321-4413-92a7-b4d92207f14e.log
Job running in-process (local Hadoop)
2012-03-31 16:06:10,931 null map = 100%,  reduce = 100%
Ended Job = job_local_0001
OK
65535
Time taken: 2.36 seconds
--測試2個行儲存表的等值連線效率
hive> select test_hive2.*,test_hive3.name from test_hive2 join test_hive3 on ( test_hive2.id=test_hive3.id);
Time taken: 27.749 seconds
--列儲存2表等值連線測試如下
[root@oraking ~]# hive
Hive history file=/tmp/root/hive_job_log_root_201203311553_471153629.txt
--建立列儲存測試表 test_hive
hive> create table test_hive (id int,id2 int,name string)
    > row format serde 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
    > STORED AS RCFILE;
OK
Time taken: 2.944 seconds
--插入資料
hive> insert overwrite table test_hive
    > select id,id2,name               
    > from test_hive2;
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Execution log at: /tmp/root/root_20120331164444_44ac1bdb-f603-45e2-b50a-4681e69d330f.log
Job running in-process (local Hadoop)
2012-03-31 16:44:31,709 null map = 100%,  reduce = 0%
Ended Job = job_local_0001
Ended Job = 171641075, job is filtered out (removed at runtime).
Moving data to: file:/tmp/hive-root/hive_2012-03-31_16-44-28_048_8584030323247636747/-ext-10000
Loading data to table default.test_hive
Deleted file:/user/hive/warehouse/test_hive
Table default.test_hive stats: [num_partitions: 0, num_files: 1, num_rows: 65535, total_size: 961048]
OK
Time taken: 3.946 seconds
--建立列儲存測試表 test_hive4
hive> create table test_hive4 (id int,id2 int,name string)
    > row format serde 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
    > STORED AS RCFILE;
OK
Time taken: 0.019 seconds
--插入資料
hive> insert overwrite table test_hive4
    > select id,id2,name
    > from test_hive2;
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Execution log at: /tmp/root/root_20120331165555_13476cce-d845-4277-9505-38e6441e8e77.log
Job running in-process (local Hadoop)
2012-03-31 16:55:11,786 null map = 100%,  reduce = 0%
Ended Job = job_local_0001
Ended Job = -1613004717, job is filtered out (removed at runtime).
Moving data to: file:/tmp/hive-root/hive_2012-03-31_16-55-09_157_510206187744689585/-ext-10000
Loading data to table default.test_hive4
Deleted file:/user/hive/warehouse/test_hive4
Table default.test_hive4 stats: [num_partitions: 0, num_files: 1, num_rows: 65535, total_size: 961048]
OK
Time taken: 2.863 seconds
--測試列儲存全表掃描效率
hive> select count(*) from test_hive;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Execution log at: /tmp/root/root_20120331164848_c82b81f6-ee21-48fa-bbb8-c5a52b5c01bb.log
Job running in-process (local Hadoop)
2012-03-31 16:48:45,111 null map = 100%,  reduce = 100%
Ended Job = job_local_0001
OK
65535
Time taken: 2.357 seconds
hive> select count(*) from test_hive4;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Execution log at: /tmp/root/root_20120331165555_0e89d023-1969-4272-8ab5-a68c48db7073.log
Job running in-process (local Hadoop)
2012-03-31 16:55:45,199 null map = 100%,  reduce = 100%
Ended Job = job_local_0001
OK
65535
Time taken: 2.377 seconds
--測試列儲存2表等值連線效率
hive> select test_hive.*,test_hive4.name from test_hive join test_hive4 on (test_hive.id=test_hive4.id);   
Time taken: 28.306 seconds
/*
綜述:整體來看,在oracle中的效率明顯高於hive中[無論是行儲存還是列儲存]
注:oracle中是2個100000條資料的等值連線測試;而hive中是2個65535條資料的等值連線測試
初步分析如下:
1、由於hive是本機的虛擬機器下測試,而oracle是在真實伺服器下測試,故可能存在整體基礎環境效能影響;
2、由於hive的工作原理本身的問題,如先map再reduce生成結果集等;
3、本人對新技術掌握程度,靈活使用的能力問題;
4、其他不確定因素。
*/
 

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/26143577/viewspace-720093/,如需轉載,請註明出處,否則將追究法律責任。

相關文章