hive後設資料和mysql表的對應

王詩齡發表於2014-07-04
Hive後設資料解析

本文是筆者初步整理的Hive後設資料表,有不準確的地方請輕拍,後續我會補充.
1. Hive 0.11 後設資料表彙總
線上Hive 0.11 metastore包括下述39個表,主要分為以下幾類 :

Database相關
Table相關
資料儲存相關SDS
COLUMN相關
SERDE相關(序列化)
Partition相關(分割槽)
SKEW相關(資料傾斜)
BUCKET相關(分桶)
PRIVS相關(許可權管理)
mysql> show tables;
+---------------------------+
| Tables_in_hive_yz_test |
+---------------------------+
| BUCKETING_COLS |
| CDS |
| COLUMNS_V2 |
| DATABASE_PARAMS |
| DBS |
| DB_PRIVS |
| GLOBAL_PRIVS |
| IDXS |
| INDEX_PARAMS |
| NUCLEUS_TABLES |
| PARTITIONS |
| PARTITION_EVENTS |
| PARTITION_KEYS |
| PARTITION_KEY_VALS |
| PARTITION_PARAMS |
| PART_COL_PRIVS |
| PART_COL_STATS |
| PART_PRIVS |
| ROLES |
| ROLE_MAP |
| SDS |
| SD_PARAMS |
| SEQUENCE_TABLE |
| SERDES |
| SERDE_PARAMS |
| SKEWED_COL_NAMES |
| SKEWED_COL_VALUE_LOC_MAP |
| SKEWED_STRING_LIST |
| SKEWED_STRING_LIST_VALUES |
| SKEWED_VALUES |
| SORT_COLS |
| TABLE_PARAMS |
| TAB_COL_STATS |
| TBLS |
| TBL_COL_PRIVS |
| TBL_PRIVS |
| TYPES |
| TYPE_FIELDS |
| VERSION |
+---------------------------+
39 rows in set (0.00 sec)
2.各個表的含義
2.1 Database表:DBS

描述 該表儲存Hive Database的後設資料資訊,DB_ID是資料庫ID,NAME是庫名,DB_LOCATION_URI是資料庫在HDFS中的位置,DESC為資料庫的描述資訊。
mysql> desc DBS;
+-----------------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+---------------+------+-----+---------+-------+
| DB_ID | bigint(20) | NO | PRI | NULL | |
| DESC | varchar(4000) | YES | | NULL | |
| DB_LOCATION_URI | varchar(4000) | NO | | NULL | |
| NAME | varchar(128) | YES | UNI | NULL | |
+-----------------+---------------+------+-----+---------+-------+
例子:acorn_3g資料庫的資訊如下,表示該庫ID為81,路徑為hdfs://YZSJHL19-42.opi.com/warehouse.test/acorn_3g.db

mysql> select * from DBS where NAME='acorn_3g';
+-------+------+-------------------------------------------------------+----------+
| DB_ID | DESC | DB_LOCATION_URI | NAME |
+-------+------+-------------------------------------------------------+----------+
| 81 | NULL | hdfs://YZSJHL19-42.opi.com/warehouse.test/acorn_3g.db | acorn_3g |
+-------+------+-------------------------------------------------------+----------+
2. 2 Table 表

描述:

TBLS 儲存Hive Table的後設資料資訊,每個表有唯一的TBL_ID
SD_ID外來鍵指向所屬的Database,SD_IID關聯SDS表的主鍵。 其中SDS儲存列(CD_ID)等資訊。TBLS.SD_ID關聯SDS.SD_ID, SDS.SD_ID關聯CDS.CD_ID,
CDS.CD_ID關聯COLUMNS_V2.CD_ID
例子:*acorn_3g.user_act表的資訊: TBL_ID為41231,TBL_TYPE為MANAGED_TABLE普通表(若值為EXTERNAL,表示外部表),DB_ID為81,表示隸屬DB_ID=81的Database。
mysql> select * from TBLS where TBL_NAME='user_act' and DB_ID=81 \G
*************************** 1. row ***************************
TBL_ID: 41231
CREATE_TIME: 1366188055
DB_ID: 81
LAST_ACCESS_TIME: 0
OWNER: xianbing.liu
RETENTION: 0
SD_ID: 263311
TBL_NAME: user_act
TBL_TYPE: MANAGED_TABLE
VIEW_EXPANDED_TEXT: NULL
VIEW_ORIGINAL_TEXT: NULL
2.3 SDS表(資料儲存表)

描述:

SDS表儲存了Hive資料倉儲所有的HDFS資料檔案資訊,每個SD_ID唯一標記一個資料儲存記錄
CD_ID關聯COLUMN_V2.CD_ID,指定該資料的欄位資訊
SERDE_ID關聯SERDES.SERDE_ID,指定該資料的序列化資訊(如是否是序列化表,DELIMITED欄位等)
例子:

根據SDS表找到acorn_3g.user_act表的CD_ID是263311, SERDE_ID是263301,預設儲存位置為
hdfs://YZSJHL19-42.opi.com/warehouse.test/acorn_3g.db/user_act

mysql> select * from SDS where SD_ID=263311 \G
*************************** 1. row ***************************
SD_ID: 263311
CD_ID: 263311
INPUT_FORMAT: org.apache.hadoop.mapred.TextInputFormat
IS_COMPRESSED:
LOCATION: hdfs://YZSJHL19-42.opi.com/warehouse.test/acorn_3g.db/user_act
NUM_BUCKETS: -1
OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
SERDE_ID: 263301
IS_STOREDASSUBDIRECTORIES:
2.4 CDS 和 COLUMN_V2 (列資訊)

CDS表
描述:

該表只有一個欄位CD_ID,永遠儲存整個Hive資料倉儲中的CD_ID.

例子:

可以看到acorn_3g.user_act表對應的CD_ID記錄在CDS中
mysql> desc CDS;
+-------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| CD_ID | bigint(20) | NO | PRI | NULL | |
+-------+------------+------+-----+---------+-------+
1 row in set (0.00 sec)

mysql> select * FROM CDS where CD_ID=263311;
+--------+
| CD_ID |
+--------+
| 263311 |
+--------+
1 row in set (0.00 sec)
COLUMN_V2表
描述:

該表儲存了一個CD_ID對應的所有欄位資訊

例子:

檢視acorn_3g.user_act表的COLUMN資訊;我們可以看到acorn_3g.user_act表有14個欄位,COLUMN_NAME為欄位名,TYPE_NAME為欄位型別,INTEGER_IDX為欄位序號
mysql> select * from COLUMNS_V2 where CD_ID=263311 order by integer_idx;
+--------+---------+---------------+-----------+-------------+
| CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX |
+--------+---------+---------------+-----------+-------------+
| 263311 | NULL | id | bigint | 0 |
| 263311 | NULL | action_id | int | 1 |
| 263311 | NULL | user_id | bigint | 2 |
| 263311 | NULL | request | string | 3 |
| 263311 | NULL | visit_time | string | 4 |
| 263311 | NULL | source_id | int | 5 |
| 263311 | NULL | sess_id | string | 6 |
| 263311 | NULL | mobile_number | string | 7 |
| 263311 | NULL | from_id | string | 8 |
| 263311 | NULL | app_id | string | 9 |
| 263311 | NULL | version | string | 10 |
| 263311 | NULL | reg_type | int | 11 |
| 263311 | NULL | uniqid | string | 12 |
| 263311 | NULL | failure | int | 13 |
+--------+---------+---------------+-----------+-------------+
2.5 SERDES和SERDE_PARAMS (序列化)
描述:

SERDES儲存了所有的序列化資訊(SERDE_ID,SLIB),SLIB表示序列化所採用的Java類

SERDES_PARAMS 儲存序列化具體的引數及值

例子: acorn_3g.user_act表對應的SERDE_ID=263301表示 採用hive預設序列化類org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe , DELIMITED字元為'\t' (即建立表時指定的 ...DELIMITED BY '\t' ...)

mysql> select * FROM SERDES where SERDE_ID=263301;
+----------+------+----------------------------------------------------+
| SERDE_ID | NAME | SLIB |
+----------+------+----------------------------------------------------+
| 263301 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
+----------+------+----------------------------------------------------+

mysql> select SERDE_ID,PARAM_KEY,REPLACE(PARAM_VALUE,'\t','\\t') from SERDE_PARAMS where SERDE_ID=263301;
+----------+----------------------+---------------------------------+
| SERDE_ID | PARAM_KEY | REPLACE(PARAM_VALUE,'\t','\\t') |
+----------+----------------------+---------------------------------+
| 263301 | field.delim | \t |
| 263301 | serialization.format | \t |
+----------+----------------------+---------------------------------+
2.6 PARTITIONS PARTITION_KEY 和 PARTITION_KEY_VALS (分割槽)

PARTITION_KEYS
描述:

PARTITION_KEYS 儲存了所有分割槽表用於分割槽的欄位

例子:

通過PARTITION_KEYS檢視acorn_3g.user_act的分割槽資訊,可看到該表是一個分割槽表,分割槽欄位為log_date,其中INTEGER_IDX為分割槽欄位的序號,和分割槽欄位一一對應
mysql> desc PARTITION_KEYS;
+--------------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+---------------+------+-----+---------+-------+
| TBL_ID | bigint(20) | NO | PRI | NULL | |
| PKEY_COMMENT | varchar(4000) | YES | | NULL | |
| PKEY_NAME | varchar(128) | NO | PRI | NULL | |
| PKEY_TYPE | varchar(767) | NO | | NULL | |
| INTEGER_IDX | int(11) | NO | | NULL | |
+--------------+---------------+------+-----+---------+-------+

mysql> select * FROM PARTITION_KEYS WHERE TBL_ID=41231;
+--------+--------------+-----------+-----------+-------------+
| TBL_ID | PKEY_COMMENT | PKEY_NAME | PKEY_TYPE | INTEGER_IDX |
+--------+--------------+-----------+-----------+-------------+
| 41231 | NULL | log_date | string | 0 |
+--------+--------------+-----------+-----------+-------------+
PARTITIONS
描述:

PARTITIONS儲存了Hive資料倉儲總所有的分割槽資訊,每個分割槽由PART_ID標識,其中TBL_ID為隸屬的Table,SD_ID為隸屬的SDS(見2.3)

例子:

通過PARTITIONS表檢視acorn_3g.user_act表的分割槽資訊,如PART_ID為168301,名字為log_date=2013-03-01,SD_ID為231621
mysql> desc PARTITIONS;
+------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------+--------------+------+-----+---------+-------+
| PART_ID | bigint(20) | NO | PRI | NULL | |
| CREATE_TIME | int(11) | NO | | NULL | |
| LAST_ACCESS_TIME | int(11) | NO | | NULL | |
| PART_NAME | varchar(767) | YES | MUL | NULL | |
| SD_ID | bigint(20) | YES | MUL | NULL | |
| TBL_ID | bigint(20) | YES | MUL | NULL | |
+------------------+--------------+------+-----+---------+-------+

mysql> select * FROM PARTITIONS WHERE TBL_ID=41231 order by PART_NAME limit 5;
+---------+-------------+------------------+---------------------+--------+--------+
| PART_ID | CREATE_TIME | LAST_ACCESS_TIME | PART_NAME | SD_ID | TBL_ID |
+---------+-------------+------------------+---------------------+--------+--------+
| 168301 | 1366259946 | 0 | log_date=2013-03-01 | 231621 | 41231 |
| 168321 | 1366260063 | 0 | log_date=2013-03-02 | 231641 | 41231 |
| 168331 | 1366260176 | 0 | log_date=2013-03-03 | 231651 | 41231 |
| 168346 | 1366260298 | 0 | log_date=2013-03-04 | 231666 | 41231 |
| 168361 | 1366260398 | 0 | log_date=2013-03-05 | 231681 | 41231 |
+---------+-------------+------------------+---------------------+--------+--------+
PARTITION_KEY_VALS
描述:

PARTITION_KEY_VALS 儲存了PARTITION_KEY中描述的分割槽欄位的值,通常配合PARTITIONS 和PARTITION_KEYS表使用。

例子:

檢視分割槽PART_ID=168301各分割槽欄位的值,其中PARTITION_KEY_VALS儲存了所有分割槽欄位序號(INTEGER_IDX)和值(PART_KEY_VAL)間的對應關係。本例中acorn_3g.user_act PART_ID=168301分割槽log_date欄位的值為‘2013-03-01’
mysql> select pk.PKEY_NAME,pk.PKEY_TYPE,pk.INTEGER_IDX,pkv.PART_KEY_VAL from PARTITION_KEYS pk,PARTITION_KEY_VALS pkv
where pk.INTEGER_IDX=pkv.INTEGER_IDX and pk.TBL_ID=41231 and pkv.PART_ID=168301;
+-----------+-----------+-------------+--------------+
| PKEY_NAME | PKEY_TYPE | INTEGER_IDX | PART_KEY_VAL |
+-----------+-----------+-------------+--------------+
| log_date | string | 0 | 2013-03-01 |
+-----------+-----------+-------------+--------------+
2.7 BUCKET 相關表

描述

BUCKETING_COLS表描述了所有采用了分桶技術的SDS,目前公司未用BUCKET,//TODO
mysql> desc BUCKETING_COLS;
+-----------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+--------------+------+-----+---------+-------+
| SD_ID | bigint(20) | NO | PRI | NULL | |
| BUCKET_COL_NAME | varchar(256) | YES | | NULL | |
| INTEGER_IDX | int(11) | NO | PRI | NULL | |
+-----------------+--------------+------+-----+---------+-------+
2.8 PRIVS 許可權管理相關表

TBL_PRIVS DB_PRIVS PART_PRIVS 等,目前Hive在許可權管理方面遠不及關聯式資料庫,公司也未對許可權進行統一管理。

2.9 SKEW 資料傾斜相關表

相比0.8版本,0.11後設資料增加了資料傾斜相關的表 SKEWED_COL_NAMES SKEWED_COL_VALUE_LOC_MAP SKEWED_STRING_LIST SKEWED_STRING_LIST_VALUES SKEWED_VALUES ,這些高階特性還在測試階段,目前公司沒有用到。

2.3 其他

如VERSION 描述版本資訊,這類表開發者不用太關心。

相關文章