hive partition
partition是hive提供的一種機制:使用者透過指定一個或多個partition key,決定資料存放方式,進而最佳化資料的查詢
一個表可以指定多個partition key,每個partition在hive中以資料夾的形式存在。
例項(static partition):
編輯檔案:/home/work/data/test3.txt; /home/work/data/test4.txt;
$ cat /home/work/data/test3.txt
1,zxm
2,ljz
3,cds
4,mac
5,android
6,symbian
7,wp
$ cat /home/work/data/test4.txt
8,zxm
9,ljz
10,cds
11,mac
12,android
13,symbian
14,wp
建表:
hive> create table student_tmp(id INT, name STRING)
> partitioned by(academy STRING, class STRING)
> row format delimited fields terminated by ',';
OK
Time taken: 6.505 seconds
id,name是真實列,partition列academy和class是偽列
load資料:(此處直接load資料進partition,在hive 0.6之前的版本,必須先建立好partition,資料才能匯入)
hive> load data local inpath '/home/work/data/test3.txt' into table student_tmp partition(academy='computer', class='034');
Copying data from file:/home/work/data/test3.txt
Copying file: file:/home/work/data/test3.txt
Loading data to table default.student_tmp partition (academy=computer, class=034)
OK
Time taken: 0.898 seconds
hive> load data local inpath '/home/work/data/test3.txt' into table student_tmp partition(academy='physics', class='034');
Copying data from file:/home/work/data/test3.txt
Copying file: file:/home/work/data/test3.txt
Loading data to table default.student_tmp partition (academy=physics, class=034)
OK
Time taken: 0.256 seconds
檢視hive檔案結構:
$ hadoop fs -ls /user/hive/warehouse/student_tmp/
Found 2 items
drwxr-xr-x - work supergroup 0 2012-07-30 18:47 /user/hive/warehouse/student_tmp/academy=computer
drwxr-xr-x - work supergroup 0 2012-07-30 19:00 /user/hive/warehouse/student_tmp/academy=physics
$ hadoop fs -ls /user/hive/warehouse/student_tmp/academy=computer
Found 1 items
drwxr-xr-x - work supergroup 0 2012-07-30 18:47 /user/hive/warehouse/student_tmp/academy=computer/class=034
查詢資料:
hive> select * from student_tmp where academy='physics';
OK
1 zxm physics 034
2 ljz physics 034
3 cds physics 034
4 mac physics 034
5 android physics 034
6 symbian physics 034
7 wp physics 034
Time taken: 0.139 seconds
以上是static partition的示例,static partition即由使用者指定資料所在的partition,在load資料時,指定partition(academy='computer', class='034');
static partition常適用於使用處理時間作為partition key的例子。
但是,我們也常常會遇到需要向分割槽表中插入大量資料,並且插入前不清楚資料歸宿的partition,此時,我們需要dynamic partition。
使用動態分割槽需要設定hive.exec.dynamic.partition引數值為true。
可以設定部分列為dynamic partition列,例如:partition(academy='computer', class);
也可以設定所有列為dynamic partition列,例如partition(academy, class);
設定所有列為dynamic partition列時,需要設定hive.exec.dynamic.partition.mode=nonstrict
需要注意的是,主分割槽為dynamic partition列,而副分割槽為static partition列是不允許的,例如partition(academy, class=‘034’);是不允許的
示例(dynamic partition):
建表
hive> create table student(id INT, name STRING)
> partitioned by(academy STRING, class STRING)
> row format delimited fields terminated by ',';
OK
Time taken: 0.393 seconds
設定引數
hive> set hive.exec.dynamic.partition.mode=nonstrict;
hive> set hive.exec.dynamic.partition=true;
匯入資料:
hive> insert overwrite table student partition(academy, class)
> select id,name,academy,class from student_tmp
> where class='034';
Total MapReduce jobs = 2
.........
OK
Time taken: 29.616 seconds
查詢資料:
hive> select * from student where academy='physics';
OK
1 zxm physics 034
2 ljz physics 034
3 cds physics 034
4 mac physics 034
5 android physics 034
6 symbian physics 034
7 wp physics 034
Time taken: 0.165 seconds
檢視檔案:
$ hadoop fs -ls /user/hive/warehouse/student/
Found 2 items
drwxr-xr-x - work supergroup 0 2012-07-30 19:22 /user/hive/warehouse/student/academy=computer
drwxr-xr-x - work supergroup 0 2012-07-30 19:22 /user/hive/warehouse/student/academy=physics
總結:
hive partition是透過將資料拆分成不同的partition放入不同的檔案,從而減少查詢操作時資料處理規模的手段。
例如,Hive Select查詢中,如果沒有建partition,則會掃描整個表內容,這樣計算量巨大。如果我們在相應維度做了partition,則處理資料規模可能會大大減少。
|
附partition相關引數:
hive.exec.dynamic.partition(預設false): 設定為true允許使用dynamic partition
hive.exec.dynamic.partition.mode(預設strick):設定dynamic partition模式(nostrict允許所有partition列都為dynamic partition,strict不允許)
hive.exec.max.dynamic.partitions.pernode (預設100):每一個mapreduce job允許建立的分割槽的最大數量,如果超過了這個數量就會報錯
hive.exec.max.dynamic.partitions (預設1000):一個dml語句允許建立的所有分割槽的最大數量
hive.exec.max.created.files (預設100000):所有的mapreduce job允許建立的檔案的最大數量
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/26613085/viewspace-1131406/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- Partition Pruning和Partition-Wise Joins
- Clique Partition
- oracle partition by group by,詳解partition by和group by對比Oracle
- 分割槽Partition
- 7.74 DATAOBJ_TO_PARTITIONOBJ
- 86. Partition List
- oracle partition by 語法Oracle
- B. Range and Partition
- Peace or partition? Cyprus - Espresso EconomistEspresso
- 7.73 DATAOBJ_TO_MAT_PARTITIONOBJ
- 3-Partition 問題
- Codeforces 1948E Clique Partition
- 分割槽partition知識點
- Partition|Disk Utility 如何分割磁碟
- ROWNUMBER() OVER( PARTITION BY COL1
- Hive --------- hive 的優化Hive優化
- [Hive]Hive排序優化Hive排序優化
- 【Hive】hive資料遷移Hive
- Spark學習——分割槽Partition數Spark
- [LeetCode] 416. Partition Equal Subset SumLeetCode
- HIVEHive
- 【Hive一】Hive安裝及配置Hive
- Flume和Hive整合之hive sinkHive
- Oracle Partition 分割槽詳細總結Oracle
- PostgreSQL DBA(94) - PG 12 Improving Partition(Select)SQL
- PostgreSQL DBA(93) - PG 12 Improving Partition(Insert)SQL
- 分割槽函式Partition By的基本用法函式
- Kafka分割槽分配策略(Partition Assignment Strategy)Kafka
- Hive學習之Hive的安裝Hive
- Hive -------- hive常見查詢練習Hive
- HIVE基本語法以及HIVE分割槽Hive
- [Hive]Hive實現抽樣查詢Hive
- flink實戰--讀寫Hive(Flink on Hive)Hive
- hive匯出到csv hive匯出到excelHiveExcel
- spark with hiveSparkHive
- [hive]hive資料模型中四種表Hive模型
- [Hive]hive分割槽設定注意事項Hive
- How to Add a New Disk new partition in centos7CentOS
- PostgreSQL DBA(95) - PG 12 Partition(out of shared memory)SQL