適用分割槽或者說分表最多的場景依然是針對時間欄位做拆分, 這節我們詳細講講如何更好的基於時間欄位來拆分。分別按照年、月、日幾個維度的實現方法以及一些細節注意事項。
第一,以年為維度做拆分
日期欄位拆分粒度的選擇跟業務檢索請求密切相關。比如保留10年資料,每次查詢基於某個具體年份做為過濾條件,那按照年拆分肯定最好。例如下面SQL:
select * from ytt_pt1 where log_date >='2018-01-01' and log_date < '2019-01-01';
那我們來看下按照年單獨拆分的實際例子:表ytt_pt1 ,包含1000W條記錄,以年為粒度建立分割槽表。
mysql> create table ytt_pt1(id bigint, log_date date);
Query OK, 0 rows affected (0.18 sec)
mysql> insert into ytt_pt1 select id,log_date from ytt_p1 limit 10000000;
Query OK, 10000000 rows affected (3 min 49.53 sec)
Records: 10000000 Duplicates: 0 Warnings: 0
mysql> ALTER TABLE ytt_pt1 PARTITION BY RANGE (year(log_date))
-> (
-> PARTITION p0001 VALUES LESS THAN (2012),
-> PARTITION p0002 VALUES LESS THAN (2013),
-> PARTITION p0003 VALUES LESS THAN (2014),
-> PARTITION p0004 VALUES LESS THAN (2015),
-> PARTITION p0005 VALUES LESS THAN (2016),
-> PARTITION p0006 VALUES LESS THAN (2017),
-> PARTITION p0007 VALUES LESS THAN (2018),
-> PARTITION p0008 VALUES LESS THAN (2019),
-> PARTITION p0009 VALUES LESS THAN (2020),
-> PARTITION p0010 VALUES LESS THAN (2021),
-> PARTITION p_max VALUES LESS THAN (maxvalue)
-> );
Query OK, 10000000 rows affected (2 min 33.31 sec)
Records: 10000000 Duplicates: 0 Warnings: 0
看下按年為粒度的查詢效果:以下SQL 直接走分割槽p0008,查詢時間0.91秒, 這個時間不算短,後期可以增加過濾條件來減少查詢時間。
mysql> select count(*) from ytt_pt1 where log_date >='2018-01-01' and log_date < '2019-01-01';
+----------+
| count(*) |
+----------+
| 1000204 |
+----------+
1 row in set (0.91 sec)
mysql> explain select count(*) from ytt_pt1 where log_date >='2018-01-01' and log_date < '2019-01-01'\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: ytt_pt1
partitions: p0008
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 998002
filtered: 11.11
Extra: Using where
1 row in set, 1 warning (0.00 sec)
需要注意:查詢只能基於欄位來直接過濾,如果基於欄位表示式來過濾,MySQL 不確定走哪個分割槽,會掃描所有分割槽,處理方法和單表查詢一樣。比如語句:
select count(*) from ytt_pt1 where year(log_date) = '2018' ;
看下執行情況: MySQL掃描所有分割槽,查詢執行時間9秒多。
mysql> select count(*) from ytt_pt1 where year(log_date) = '2018' ;
+----------+
| count(*) |
+----------+
| 1000204 |
+----------+
1 row in set (9.19 sec)
mysql> explain select count(*) from ytt_pt1 where year(log_date) = '2018' \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: ytt_pt1
partitions: p0001,p0002,p0003,p0004,p0005,p0006,p0007,p0008,p0009,p0010,p_max
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 9982648
filtered: 100.00
Extra: Using where
1 row in set, 1 warning (0.00 sec)
如果非堅持這種寫法,可以給優化器一個提示,具體到指定分割槽去檢索資料,或者是基於欄位表示式建一個虛擬列:
mysql> select count(*) from ytt_pt1 partition(p0008) where year(log_date) = '2018' ;
+----------+
| count(*) |
+----------+
| 1000204 |
+----------+
1 row in set (0.84 sec)
如果查詢按照月作為維度過濾比較頻繁,那肯定是按照月來拆最好;比如需要檢索2020年當月的某些記錄來做後續資料處理,大致SQL如下:
select * from ytt_pt1_按月拆分表 where log_date in ('2020-01-01','2020-01-02',...)
第二,以月為維度做拆分
按照月來拆分,有以下兩種寫法:
第一種: 直接按照月來拆12個分割槽: 下面表 ytt_pt1_month1 分割槽型別為 LIST ,基於函式 month 直接計算。
mysql> show create table ytt_pt1_month1\G
*************************** 1. row ***************************
Table: ytt_pt1_month1
Create Table: CREATE TABLE `ytt_pt1_month1` (
`id` bigint DEFAULT NULL,
`log_date` date DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
/*!50100 PARTITION BY LIST (month(`log_date`))
(PARTITION p0001 VALUES IN (1) ENGINE = InnoDB,
PARTITION p0002 VALUES IN (2) ENGINE = InnoDB,
PARTITION p0003 VALUES IN (3) ENGINE = InnoDB,
PARTITION p0004 VALUES IN (4) ENGINE = InnoDB,
PARTITION p0005 VALUES IN (5) ENGINE = InnoDB,
PARTITION p0006 VALUES IN (6) ENGINE = InnoDB,
PARTITION p0007 VALUES IN (7) ENGINE = InnoDB,
PARTITION p0008 VALUES IN (8) ENGINE = InnoDB,
PARTITION p0009 VALUES IN (9) ENGINE = InnoDB,
PARTITION p0010 VALUES IN (10) ENGINE = InnoDB,
PARTITION p0011 VALUES IN (11) ENGINE = InnoDB,
PARTITION p0012 VALUES IN (12) ENGINE = InnoDB) */
1 row in set (0.00 sec)
比如要查詢2020年前半個月的記錄:查詢限定在分割槽 p0001 裡,但是時間不太理想,得0.66秒。
mysql> select count(*) from ytt_pt1_month1 where log_date in ('2020-01-01','2020-01-02','2020-01-03','2020-01-04','2020-01-05','2020-01-06','2020-01-07','2020-01-08','2020-01-09','2020-01-10','2020-01-11','2020-01-12','2020-01-13','2020-01-14','2020-01-15');
+----------+
| count(*) |
+----------+
| 41540 |
+----------+
1 row in set (0.66 sec)
mysql> explain select count(*) from ytt_pt1_month1 where log_date in ('2020-01-01','2020-01-02','2020-01-03','2020-01-04','2020-01-05','2020-01-06','2020-01-07','2020-01-08','2020-01-09','2020-01-10','2020-01-11','2020-01-12','2020-01-13','2020-01-14','2020-01-15')\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: ytt_pt1_month1
partitions: p0001
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 848224
filtered: 50.00
Extra: Using where
1 row in set, 1 warning (0.00 sec)
第二種: 對於每年的資料,單獨劃分12個分割槽,也就是按照年月聯合維度來分割槽,一共有 144 個分割槽,每個分割槽對應具體某一年某一月資料。
新增這個分割槽稍微麻煩些,克隆表 ytt_pt1_month1 為 ytt_pt1_month2 ,這裡寫個儲存過程來新增分割槽資訊:
DELIMITER $$
USE `ytt`$$
DROP PROCEDURE IF EXISTS `sp_add_partition_ytt_pt1_month2`$$
CREATE DEFINER=`root`@`%` PROCEDURE `sp_add_partition_ytt_pt1_month2`()
BEGIN
DECLARE i,j INT UNSIGNED DEFAULT 1;
DECLARE v_tmp_date DATE;
SET @stmt = '';
SET @stmt_begin = 'ALTER TABLE ytt_pt1_month2 PARTITION BY RANGE COLUMNS (log_date)(';
SET i = 2010;
WHILE i <= 2020 DO
SET j = 1;
WHILE j <= 12 DO
SET v_tmp_date = CONCAT(i,'-01-01');
SET @stmt = CONCAT(@stmt,'PARTITION p',i,'_',LPAD(j,2,"0"),' VALUES LESS THAN (''',DATE_ADD(v_tmp_date,INTERVAL j MONTH),'''),');
SET j = j + 1;
END WHILE;
SET i = i + 1;
END WHILE;
SET @stmt_end = 'PARTITION p_max VALUES LESS THAN (maxvalue))';
SET @stmt = CONCAT(@stmt_begin,@stmt,@stmt_end);
PREPARE s1 FROM @stmt;
EXECUTE s1;
DROP PREPARE s1;
SET @stmt = NULL;
SET @stmt_begin = NULL;
SET @stmt_end = NULL;
END$$
DELIMITER ;
mysql> call sp_add_partition_ytt_pt1_month2;
Query OK, 0 rows affected (2 min 20.48 sec)
結果類似這樣:
PARTITION p2010_01 VALUES LESS THAN ('2010-02-01') ENGINE = InnoDB,
...
PARTITION p2010_12 VALUES LESS THAN ('2011-01-01') ENGINE = InnoDB,
PARTITION p2011_01 VALUES LESS THAN ('2011-02-01') ENGINE = InnoDB,
...
PARTITION p2011_12 VALUES LESS THAN ('2012-01-01') ENGINE = InnoDB,
...
PARTITION p2020_12 VALUES LESS THAN ('2021-01-01') ENGINE = InnoDB,
PARTITION p_max VALUES LESS THAN (MAXVALUE) ENGINE = InnoDB)
加好分割槽後,來觀察下剛才那個獲取2020年前半個月記錄的查詢:
mysql> select count(*) from ytt_pt1_month2 where log_date in ('2020-01-01','2020-01-02','2020-01-03','2020-01-04','2020-01-05','2020-01-06','2020-01-07','2020-01-08','2020-01-09','2020-01-10','2020-01-11','2020-01-12','2020-01-13','2020-01-14','2020-01-15');
+----------+
| count(*) |
+----------+
| 41540 |
+----------+
1 row in set (0.06 sec)
mysql> explain select count(*) from ytt_pt1_month2 where log_date in ('2020-01-01','2020-01-02','2020-01-03','2020-01-04','2020-01-05','2020-01-06','2020-01-07','2020-01-08','2020-01-09','2020-01-10','2020-01-11','2020-01-12','2020-01-13','2020-01-14','2020-01-15')\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: ytt_pt1_month2
partitions: p2020_01
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 85498
filtered: 50.00
Extra: Using where
1 row in set, 1 warning (0.00 sec)
查詢時間為0.06秒,比第一種拆分方法要快10倍。
那是不是說按照年月聯合來分割槽一定比按照單月來分割槽更加優化?也不一定,如果查詢語句過濾條件包含分割槽的固定月,比如每年的12月份都要查,這時候直接按月來分割槽肯定要優於按照年月來分割槽。
第三,以天為維度做拆分
按照天為維度過濾比較頻繁的查詢,以天來拆分最好。比如只想查詢2020年1月1日當天的資料,大致 SQL 如下:
select * from ytt_pt1 where log_date = '2020-01-01'
類似按照年月聯合維度,寫個指令碼或者儲存過程來新增分割槽,這裡唯一要注意的一點就是 MySQL 分割槽表數量有限制,最大為8192個,所以如果按照天來分割槽,存放10年資料,分割槽數量為3650個,也在限制之內。
修改下之前的儲存過程,程式碼如下:
DELIMITER $$
USE `ytt`$$
DROP PROCEDURE IF EXISTS `sp_add_partition_ytt_pt1_day`$$
CREATE DEFINER=`root`@`%` PROCEDURE `sp_add_partition_ytt_pt1_day`(
IN f_year_start YEAR,
IN f_year_end YEAR
)
BEGIN
DECLARE v_days INT UNSIGNED DEFAULT 365;
DECLARE v_year DATE DEFAULT '2010-01-01';
DECLARE v_partition_name VARCHAR(64) DEFAULT '';
DECLARE v_log_date DATE;
DECLARE i,j INT UNSIGNED DEFAULT 1;
SET @stmt = '';
SET @stmt_begin = 'ALTER TABLE ytt_pt1_day PARTITION BY RANGE COLUMNS (log_date)(';
SET i = f_year_start;
WHILE i <= f_year_end DO
SET v_year = CONCAT(i,'-01-01');
SET v_days = DATEDIFF(DATE_ADD(v_year,INTERVAL 1 YEAR),v_year);
SET j = 1;
WHILE j <= v_days DO
SET v_log_date = DATE_ADD(v_year,INTERVAL j DAY);
SET v_partition_name = CONCAT('p',i,'_',LPAD(j,3,'0'));
SET @stmt = CONCAT(@stmt,'PARTITION ',v_partition_name,' VALUES LESS THAN (''',v_log_date,'''),');
SET j = j + 1;
END WHILE;
SET i = i + 1;
END WHILE;
SET @stmt_end = 'PARTITION p_max VALUES LESS THAN (maxvalue))';
SET @stmt = CONCAT(@stmt_begin,@stmt,@stmt_end);
PREPARE s1 FROM @stmt;
EXECUTE s1;
DROP PREPARE s1;
SELECT NULL,NULL,NULL INTO @stmt,@stmt_begin,@stmt_end;
END$$
DELIMITER ;
mysql> CALL sp_add_partition_ytt_pt1_day('2010','2020');
Query OK, 1 row affected (14 min 13.69 sec)
接下來,以天來查詢時間一定是最短的,只有0.01秒。
mysql> select count(*) from ytt_pt1_day where log_date = '2020-01-01';
+----------+
| count(*) |
+----------+
| 2675 |
+----------+
1 row in set (0.01 sec)
此時這樣的查詢要是基於年或者月效能肯定不是最優。
mysql> select count(*) from ytt_pt1 where log_date = '2020-01-01';
+----------+
| count(*) |
+----------+
| 2675 |
+----------+
1 row in set (0.68 sec)
mysql> select count(*) from ytt_pt1_month1 where log_date = '2020-01-01';
+----------+
| count(*) |
+----------+
| 2675 |
+----------+
1 row in set (0.87 sec)
mysql> select count(*) from ytt_pt1_month2 where log_date = '2020-01-01';
+----------+
| count(*) |
+----------+
| 2675 |
+----------+
1 row in set (0.09 sec)
可以看到,此類查詢基於其他方法分割槽時間明顯比按天來的長。
總結下:
本篇主要講述日常工作中最常用的資料拆分方式:按時間來拆分。時間欄位又常按照年,月,日三種不同的維度來具體實現,並且詳細講述了具體的實現方式以及適用場景。
關於 MySQL 的技術內容,你們還有什麼想知道的嗎?趕緊留言告訴小編吧!