hive分割槽分桶

哇塞兒發表於2021-02-26

原文網址 : https://www.cnblogs.com/wasaier/p/14451069.html

Hive

1、分割槽

如果一個表中資料很多，我們查詢時就很慢，耗費大量時間，如果要查詢其中部分資料該怎麼辦呢，這時我們引入分割槽的概念。

Hive 中的分割槽表分為兩種：靜態分割槽和動態分割槽。

1.1、靜態分割槽

可以根據 PARTITIONED BY 建立分割槽表。

一個表可以擁有一個或者多個分割槽，每個分割槽以資料夾的形式單獨存在表資料夾的目錄下。

分割槽列是虛擬列，它們不是資料本身的一部分，而是在載入時派生的。

1.1.1、一個分割槽

-- 建表
create table order_table_s
(
    order_id int,         -- 訂單id
    product_name string,  -- 產品名稱
    price int             -- 產品價格
)
partitioned by (deal_day string)  -- 交易日期YYYYMM
row format delimited
fields terminated by "\t";

-- 檢視錶結構
hive> desc order_table_s;
OK
order_id                int                                         
product_name            string                                      
price                   int                                         
deal_day                string                                      
                 
# Partition Information          
# col_name              data_type               comment             
deal_day                string  


-- 源資料
-- order-201901.txt
1	cellphone	2000
2	tv	3000
3	sofa	8000
4	cabinet	5000
5	bicycle	1000
6	truck	20000

-- order-201902.txt
1	apple	10
2	banana	8
3	milk	70
4	liquor	150

-- 匯入資料
load data local inpath '/root/data/order-201901.txt' overwrite into table order_table_s partition(deal_day='201901');
load data local inpath '/root/data/order-201902.txt' overwrite into table order_table_s partition(deal_day='201902');

-- 檢視錶
hive> select * from order_table_s;
1       cellphone       2000    201901
2       tv      3000    201901
3       sofa    8000    201901
4       cabinet 5000    201901
5       bicycle 1000    201901
6       truck   20000   201901
1       apple   10      201902
2       banana  8       201902
3       milk    70      201902
4       liquor  150     201902

-- 檢視201902的資料
hive> select * from order_table_s where deal_day='201902';
1       apple   10      201902
2       banana  8       201902
3       milk    70      201902
4       liquor  150     201902

-- 檢視hdfs目錄
[root@zgg data]# hadoop fs -ls /user/hive/warehouse/order_table_s
Found 2 items
drwxr-xr-x   - root supergroup          0 2021-01-08 16:05 /user/hive/warehouse/order_table_s/deal_day=201901
drwxr-xr-x   - root supergroup          0 2021-01-08 16:27 /user/hive/warehouse/order_table_s/deal_day=201902
[root@zgg data]# hadoop fs -ls /user/hive/warehouse/order_table_s/deal_day=201901
Found 1 items
-rw-r--r--   1 root supergroup         56 2021-01-08 18:23 /user/hive/warehouse/order_table_s/deal_day=201901/000000_0

[root@zgg data]# hadoop fs -ls /user/hive/warehouse/order_table_s/deal_day=201901
Found 1 items
-rw-r--r--   1 root supergroup         83 2021-01-08 16:05 /user/hive/warehouse/order_table_s/deal_day=201901/order-201901.txt

-- 檢視mysql中的後設資料
mysql> select * from PARTITION_KEY_VALS;
+---------+--------------+-------------+
| PART_ID | PART_KEY_VAL | INTEGER_IDX |
+---------+--------------+-------------+
|      16 | 201901       |           0 |
|      21 | 201902       |           0 |
+---------+--------------+-------------+


-- 刪除分割槽201902
-- ALTER TABLE table_name DROP partition_spec, partition_spec,...
hive> alter table order_table_s drop partition(deal_day='201902');


-- 刪除後，檢視錶
hive> select * from order_table_s;
1       cellphone       2000    201901
2       tv      3000    201901
3       sofa    8000    201901
4       cabinet 5000    201901
5       bicycle 1000    201901
6       truck   20000   201901

-- 刪除後，檢視hdfs目錄
[root@zgg data]# hadoop fs -ls /user/hive/warehouse/order_table_s
Found 1 items
drwxr-xr-x   - root supergroup          0 2021-01-08 16:05 /user/hive/warehouse/order_table_s/deal_day=201901

-- 刪除後，檢視mysql中的後設資料
mysql> select * from PARTITION_KEY_VALS;
+---------+--------------+-------------+
| PART_ID | PART_KEY_VAL | INTEGER_IDX |
+---------+--------------+-------------+
|      16 | 201901       |           0 |
+---------+--------------+-------------+


-- 刪除後，資料和後設資料都被刪除了
-- ---------------------------------------------------


-- 重新新增分割槽201902
-- 這裡的location是這個分割槽的資料存放的位置，預設是`/user/hive/warehouse/order_table_s`
alter table order_table_s add partition(deal_day='201902') location '/in/order';

-- 把order-201902.txt移動到'/in/order'目錄下
-- 所以匯入資料，也可以直接複製到對應的目錄下。
[root@zgg data]# hadoop fs -mv /in/order-201902.txt /in/order

-- 檢視201902的資料
hive> select * from order_table_s where deal_day='201902';
1       apple   10      201902
2       banana  8       201902
3       milk    70      201902
4       liquor  150     201902

-- hive中查詢分割槽
hive> show partitions order_table_s;
deal_day=201901
deal_day=201902

-- 如果直接將資料複製到分割槽目錄下，select沒有資料的話，可以執行`msck repair table order_table_s` 重新同步hdfs上的分割槽資訊。

1.1.2、多個分割槽

-- 建表:兩個分割槽
create table order_table_d
(
    order_id int,         -- 訂單id
    product_name string,  -- 產品名稱
    price int             -- 產品價格
)
partitioned by (deal_day string,category string)  -- 交易日期YYYYMM，產品類別
row format delimited
fields terminated by "\t";

-- 檢視錶結構
hive> desc order_table_d;
OK
order_id                int                                         
product_name            string                                      
price                   int                                         
deal_day                string                                      
category                string                                      
                 
# Partition Information          
# col_name              data_type               comment             
deal_day                string                                      
category                string           

-- 源資料
-- order-201901-electronicproducts.txt
1	cellphone	2000
2	tv	3000

-- order-201901-car.txt
1	bicycle	1000
2	truck	20000

-- order-201902-fruit.txt
1	apple	10
2	banana	8

-- order-201902-drinks.txt
1	milk	70
2	liquor	150

-- 匯入資料
load data local inpath '/root/data/order-201901-electronicproducts.txt' overwrite into table order_table_d partition(deal_day='201901',category='electronicproducts');

load data local inpath '/root/data/order-201901-car.txt' overwrite into table order_table_d partition(deal_day='201901',category='car');

load data local inpath '/root/data/order-201902-fruit.txt' overwrite into table order_table_d partition(deal_day='201902',category='fruit');

load data local inpath '/root/data/order-201902-drinks.txt' overwrite into table order_table_d partition(deal_day='201902',category='drinks');

-- 檢視
hive> select * from order_table_d;
OK
1       bicycle 1000    201901  car
2       truck   20000   201901  car
1       cellphone       2000    201901  electronicproducts
2       tv      3000    201901  electronicproducts
1       milk    70      201902  drinks
2       liquor  150     201902  drinks
1       apple   10      201902  fruit
2       banana  8       201902  fruit

hive> show partitions order_table_d;
OK
deal_day=201901/category=car
deal_day=201901/category=electronicproducts
deal_day=201902/category=drinks
deal_day=201902/category=fruit

-- 檢視hdfs目錄
[root@zgg data]# hadoop fs -ls /user/hive/warehouse/order_table_d
Found 2 items
drwxr-xr-x   - root supergroup          0 2021-01-08 17:00 /user/hive/warehouse/order_table_d/deal_day=201901
drwxr-xr-x   - root supergroup          0 2021-01-08 17:00 /user/hive/warehouse/order_table_d/deal_day=201902

[root@zgg data]# hadoop fs -ls /user/hive/warehouse/order_table_d/deal_day=201901
Found 2 items
drwxr-xr-x   - root supergroup          0 2021-01-08 17:00 /user/hive/warehouse/order_table_d/deal_day=201901/category=car
drwxr-xr-x   - root supergroup          0 2021-01-08 17:00 /user/hive/warehouse/order_table_d/deal_day=201901/category=electronicproducts

-- 檢視mysql中的後設資料
mysql> select * from PARTITION_KEY_VALS;
+---------+--------------------+-------------+
| PART_ID | PART_KEY_VAL       | INTEGER_IDX |
+---------+--------------------+-------------+
|      23 | 201901             |           0 |
|      23 | electronicproducts |           1 |
|      24 | 201901             |           0 |
|      24 | car                |           1 |
|      25 | 201902             |           0 |
|      25 | fruit              |           1 |
|      26 | 201902             |           0 |
|      26 | drinks             |           1 |
+---------+--------------------+-------------+

-- 刪除分割槽deal_day=201902/category=fruit
hive> alter table order_table_d drop partition(deal_day='201902',category='fruit');

-- 刪除後，檢視錶分割槽
hive> show partitions order_table_d;
OK
deal_day=201901/category=car
deal_day=201901/category=electronicproducts
deal_day=201902/category=drinks

-- 刪除後，檢視hdfs目錄
[root@zgg data]# hadoop fs -ls /user/hive/warehouse/order_table_d/deal_day=201902
Found 1 items
drwxr-xr-x   - root supergroup          0 2021-01-08 17:00 /user/hive/warehouse/order_table_d/deal_day=201902/category=drinks

-- 刪除後，檢視mysql中的後設資料
mysql> select * from PARTITION_KEY_VALS;
+---------+--------------------+-------------+
| PART_ID | PART_KEY_VAL       | INTEGER_IDX |
+---------+--------------------+-------------+
|      23 | 201901             |           0 |
|      23 | electronicproducts |           1 |
|      24 | 201901             |           0 |
|      24 | car                |           1 |
|      26 | 201902             |           0 |
|      26 | drinks             |           1 |
+---------+--------------------+-------------+

-- 重新新增分割槽deal_day=201902/category=fruit
hive> alter table order_table_d add partition(deal_day='201902',category='fruit');

-- 重新匯入
hive> load data local inpath '/root/data/order-201902-drinks.txt' overwrite into table order_table_d partition(deal_day='201902',category='drinks');

-- 檢視
hive> select * from order_table_d;
1       bicycle 1000    201901  car
2       truck   20000   201901  car
1       cellphone       2000    201901  electronicproducts
2       tv      3000    201901  electronicproducts
1       milk    70      201902  drinks
2       liquor  150     201902  drinks

hive> show partitions order_table_d;
deal_day=201901/category=car
deal_day=201901/category=electronicproducts
deal_day=201902/category=drinks
deal_day=201902/category=fruit

-- 匯入資料到分割槽表，還可以使用 insert
-- 將order_table_d的201901分割槽中的資料插入到order_table_s的201901分割槽中

-- 檢視order_table_s資料
hive> select * from order_table_s;
1       apple   10      201902
2       banana  8       201902
3       milk    70      201902
4       liquor  150     201902

hive> insert into table order_table_s partition(deal_day='201901') select order_id,product_name,price from order_table_d where deal_day='201901';

-- 檢視結果
hive> select * from order_table_s where deal_day='201901';
1       bicycle 1000    201901
2       truck   20000   201901
1       cellphone       2000    201901
2       tv      3000    201901

1.2、動態分割槽

上面展示瞭如何使用 insert 插入到分割槽表，如果再插入分割槽 '201902' 資料，需要再寫一條 insert 語句。

而動態分割槽可以直接使用一條 insert 語句完成。

下面利用動態分割槽進行演示。

演示前先進行設定：

hive 中預設是靜態分割槽，想要使用動態分割槽，需要設定如下引數，可以使用臨時設定，也可以寫在配置檔案（hive-site.xml）裡，永久生效。臨時配置如下:

//開啟動態分割槽 預設為false，不開啟
set hive.exec.dynamic.partition=true;

//指定動態分割槽模式，預設為strict，即必須指定至少一個分割槽為靜態分割槽，
//nonstrict模式表示允許所有的分割槽欄位都可以使用動態分割槽  
set hive.exec.dynamic.partition.mode=nonstrict;

-- 建立動態分割槽表
create table order_table_dy
(
    order_id int,         -- 訂單id
    product_name string,  -- 產品名稱
    price int             -- 產品價格
)
partitioned by (deal_day string)  -- 交易日期YYYYMM，產品類別
row format delimited
fields terminated by "\t";

hive> select * from order_table_s;
1       bicycle 1000    201901
2       truck   20000   201901
1       cellphone       2000    201901
2       tv      3000    201901
1       apple   10      201902
2       banana  8       201902
3       milk    70      201902
4       liquor  150     201902

-- 將order_table_s表裡的兩個分割槽的資料插入到order_table_dy表
hive> insert into table order_table_dy(deal_day) select order_id,product_name,price,deal_day from order_table_s;

-- 檢視結果
hive> select * from order_table_dy where deal_day=201901;
OK
1       bicycle 1000    201901
2       truck   20000   201901
1       cellphone       2000    201901
2       tv      3000    201901

hive> select * from order_table_dy where deal_day=201902;
OK
1       apple   10      201902
2       banana  8       201902
3       milk    70      201902
4       liquor  150     201902

hive> show partitions order_table_dy;
OK
deal_day=201901
deal_day=201902

靜態分割槽(SP)列:在涉及多個分割槽列的 DML/DDL 中，這些列的值在編譯時已知(由使用者給出)。

動態分割槽(DP)列:在執行時才知道其值的列。

DP 列的指定方式與 SP 列的指定方式相同：在 partition 子句中。唯一的區別是 DP 列沒有值，而 SP 列有。在 partition 子句中，我們需要指定所有分割槽列，即使它們都是 DP 列。

在 INSERT ... SELECT ... 查詢時，動態分割槽列必須在 SELECT 語句中的最後一個列中指定，且順序與它們在 PARTITION() 子句中出現的順序相同。

-- 正確
INSERT OVERWRITE TABLE T PARTITION (ds='2010-03-03', hr)
SELECT key, value, /*ds,*/ hr FROM srcpart WHERE ds is not null and hr>10;

-- 錯誤
INSERT OVERWRITE TABLE T PARTITION (ds, hr = 11)
SELECT key, value, ds/*, hr*/ FROM srcpart WHERE ds is not null and hr=11;

2、分桶

通過計算表的某些列的雜湊值，分割槽中的資料再被劃分到桶中。

例如，page_views 表根據 userid 分桶，userid 是 page_view 表的列之一，而不是分割槽列。

將表或分割槽組織成桶有以下幾個目的：

(1)抽樣更高效，因為在處理大規模的資料集時，在開發、測試階段將所有的資料全部處理一遍可能不太現實，這時抽樣就必不可少。

抽樣：
    SELECT * FROM table_name TABLESAMPLE(n PERCENT)； 
    抽樣出n%的資料，會全表掃描。

有了桶之後呢？

    如果在 TABLESAMPLE 子句中指定的列與 CLUSTERED BY 子句中的列相匹配，則 TABLESAMPLE 只掃描表中要求的雜湊分割槽【就是具體的桶】

    SELECT * FROM film TABLESAMPLE(BUCKET x OUTOF y)
       
        如果表是用 `CLUSTERED BY id INTO 32 BUCKETS` 建立的。

        TABLESAMPLE(BUCKET 3 OUT OF 16 ON id)：
        將挑選出第 3 和第 19 個聚類，因為每個桶將由 (32/16)=2 個聚類組成。
        【每個桶有2個聚類，一共64個聚類，取出第 3 和第 19 個聚類】

        TABLESAMPLE(BUCKET 3 OUT OF 64 ON id)：
        將挑選出第 3 個聚類的一半，因為每個桶將由 (32/64)=1/2 個聚類組成

(2)更好的查詢處理效率。

大表在JOIN的時候，效率低下。如果對兩個表先分別按id分桶，那麼相同id都會歸入一個桶。
那麼此時再進行JOIN的時候是按照桶來JOIN的，那麼大大減少了JOIN的數量。

在建立桶之前，需要設定 set hive.enforce.bucketing=true;，使得 hive 能識別桶。【僅版本 0.x 和 1.x，對於2.x版本不再需要】

hive> select * from order_table_s;
OK
1       bicycle 1000    201901
2       truck   20000   201901
1       cellphone       2000    201901
2       tv      3000    201901
1       apple   10      201902
2       banana  8       201902
3       milk    70      201902
4       liquor  150     201902

-- 建立分桶
create table order_table_buckets
(
    order_id int,         -- 訂單id
    product_name string,  -- 產品名稱
    price int             -- 產品價格
)
partitioned by (deal_day string)  -- 交易日期YYYYMM，產品類別
clustered by (order_id) into 4 buckets
row format delimited
fields terminated by "\t";

-- 匯入資料：
hive> insert into table order_table_buckets partition(deal_day) select order_id,product_name,price,deal_day from order_table_s where order_id=1;

-- 檢視錶
hive> select * from order_table_buckets;
OK
1       cellphone       2000    201901
1       bicycle 1000    201901
1       apple   10      201902

-- 檢視hdfs中的目錄
[root@zgg ~]# hadoop fs -ls /user/hive/warehouse/order_table_buckets
Found 2 items
drwxr-xr-x   - root supergroup          0 2021-01-08 18:49 /user/hive/warehouse/order_table_buckets/deal_day=201901
drwxr-xr-x   - root supergroup          0 2021-01-08 18:49 /user/hive/warehouse/order_table_buckets/deal_day=201902

[root@zgg ~]# hadoop fs -ls /user/hive/warehouse/order_table_buckets/deal_day=201901
Found 4 items
-rw-r--r--   1 root supergroup          0 2021-01-08 18:49 /user/hive/warehouse/order_table_buckets/deal_day=201901/000000_0
-rw-r--r--   1 root supergroup         32 2021-01-08 18:49 /user/hive/warehouse/order_table_buckets/deal_day=201901/000001_0
-rw-r--r--   1 root supergroup          0 2021-01-08 18:49 /user/hive/warehouse/order_table_buckets/deal_day=201901/000002_0
-rw-r--r--   1 root supergroup          0 2021-01-08 18:49 /user/hive/warehouse/order_table_buckets/deal_day=201901/000003_0

參考：

https://blog.51cto.com/10814168/2135046

hive分割槽和分桶你熟悉嗎？
2024-03-10
Hive
hive 分割槽表和分桶表區別
2020-09-26
Hive
hive從入門到放棄(四)——分割槽與分桶
2022-04-02
Hive
Apache Spark：分割槽和分桶 - Nivedita
2022-05-30
ApacheSpark
好程式設計師大資料學習路線分享hive分割槽和分桶
2019-08-20
程式設計師大資料Hive
Hive動態分割槽
2018-03-13
Hive
HIVE基本語法以及HIVE分割槽
2018-09-20
Hive
Hive的靜態分割槽與動態分割槽
2018-05-03
Hive
Spark操作Hive分割槽表
2018-12-07
SparkHive
Hive的分割槽和排序
2024-11-13
Hive排序
Hive和Spark分割槽策略
2021-06-27
HiveSpark
[Hive]hive分割槽設定注意事項
2018-08-16
Hive
Hive——分桶的理解
2020-10-21
Hive
hive分桶表排序
2020-09-26
Hive排序
Hive中靜態分割槽和動態分割槽總結
2021-03-31
Hive
Hive動態分割槽詳解
2020-12-23
Hive
一文教你玩轉 Apache Doris 分割槽分桶新功能
2023-02-10
Apache
Hive分桶之BUCKET詳解
2019-03-24
Hive
hive Sql的動態分割槽問題
2024-04-01
HiveSQL
hive 動態分割槽插入資料表
2020-12-18
Hive
一起學Hive——使用MSCK命令修復Hive分割槽
2021-09-09
Hive
【趙渝強老師】Hive的分割槽表
2024-10-28
Hive
hive迷案之消失的分割槽檔案
2021-09-09
Hive
hive學習筆記之四：分割槽表
2021-07-02
Hive筆記
Oracle分割槽表基礎運維-01分割槽表分類
2020-05-14
Oracle運維
hive學習筆記之五：分桶
2021-07-03
Hive筆記
MySql分表、分庫、分片和分割槽
2021-06-17
MySql
Oracle分割槽表基礎運維-06分割槽表索引
2020-05-18
Oracle運維索引
好程式設計師大資料培訓分享Hive的靜態分割槽與動態分割槽
2020-06-05
程式設計師大資料Hive
好程式設計師大資料開發之掌握Hive的靜態分割槽與動態分割槽
2019-03-29
程式設計師大資料Hive
Linux分割槽方案、分割槽建議
2024-11-04
Linux
Seven 儲存結構與磁碟劃分主分割槽交換分割槽的作用！
2020-10-17
oracle分割槽表和分割槽表exchange
2024-03-15
Oracle
PostgreSQL/LightDB 分割槽表之分割槽裁剪
2022-07-14
SQL
win10硬碟分割槽怎麼分_win10系統如何對硬碟分割槽
2019-12-04
Win10硬碟
Linux 分割槽擴容（根分割槽擴容，SWAP 分割槽擴容，掛載新分割槽為目錄）
2021-08-27
Linux
win10分割槽好了怎麼刪除_win10分割槽完後如何刪除
2020-06-05
Win10
hive建立分割槽表報錯AccessControlException Permission denied: user=NONE, access=WRITE, inode
2020-09-23
HiveExceptionNone