資料庫表中資料行去重複
0.起因
我在建立唯一索引時,提示有重複資料,建立失敗。於是,我得先去重,然後再建立唯一索引。
1.建表
sql
CREATE TABLE `demo_table` ( `id` int(10) unsigned NOT NULL, `name` char(255) NOT NULL, `email` char(255) NOT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
2.插入資料,並嘗試建立唯一索引
sql
INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('1', 'u1', 'u1@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('2', 'u2', 'u2@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('3', 'u3', 'u3@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('4', 'u4', 'u4@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('5', 'u5', 'u5@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('6', 'u6', 'u6@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('7', 'u7', 'u7@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('8', 'u8', 'u8@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('9', 'u9', 'u9@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('10', 'u10', 'u10@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('11', 'u11', 'u11@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('12', 'u12', 'u12@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('13', 'u13', 'u13@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('14', 'u14', 'u14@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('15', 'u15', 'u15@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('16', 'u16', 'u16@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('17', 'u17', 'u17@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('18', 'u18', 'u18@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('19', 'u19', 'u19@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('20', 'u20', 'u20@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('21', 'u21', 'u1@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('22', 'u22', 'u2@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('23', 'u23', 'u3@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('24', 'u24', 'u4@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('25', 'u25', 'u5@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('26', 'u26', 'u1@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('27', 'u27', 'u2@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('28', 'u28', 'u3@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('29', 'u29', 'u1@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('30', 'u30', 'u2@email.com');
建立唯一索引
sql
CREATE UNIQUE INDEX idx_email ON demo_table(email);
執行結果:ERROR 1062 (23000): Duplicate entry 'u2@email.com' for key 'idx_email'
提示有重複資料,導致建立唯一索引失敗。
3.看看我們插入的資料
sql
select * from demo_table;
執行結果:
sql
+----+------+---------------+ | id | name | email | +----+------+---------------+ | 1 | u1 | u1@email.com | | 2 | u2 | u2@email.com | | 3 | u3 | u3@email.com | | 4 | u4 | u4@email.com | | 5 | u5 | u5@email.com | | 6 | u6 | u6@email.com | | 7 | u7 | u7@email.com | | 8 | u8 | u8@email.com | | 9 | u9 | u9@email.com | | 10 | u10 | u10@email.com | | 11 | u11 | u11@email.com | | 12 | u12 | u12@email.com | | 13 | u13 | u13@email.com | | 14 | u14 | u14@email.com | | 15 | u15 | u15@email.com | | 16 | u16 | u16@email.com | | 17 | u17 | u17@email.com | | 18 | u18 | u18@email.com | | 19 | u19 | u19@email.com | | 20 | u20 | u20@email.com | | 21 | u21 | u1@email.com | | 22 | u22 | u2@email.com | | 23 | u23 | u3@email.com | | 24 | u24 | u4@email.com | | 25 | u25 | u5@email.com | | 26 | u26 | u1@email.com | | 27 | u27 | u2@email.com | | 28 | u28 | u3@email.com | | 29 | u29 | u1@email.com | | 30 | u30 | u2@email.com | +----+------+---------------+ 30 rows in set (0.00 sec)
可以看到,我們插入的email欄位的確有重複資料。當然,我們是故意的。
4.使用group by看看email欄位完全不重複的資料。為了讓結果看起來比較順眼,我們使用order by對結果排一下序,這個不是必選項啦。
sql
select * from demo_table group by email order by id;
執行結果:
sql
+----+------+---------------+ | id | name | email | +----+------+---------------+ | 1 | u1 | u1@email.com | | 2 | u2 | u2@email.com | | 3 | u3 | u3@email.com | | 4 | u4 | u4@email.com | | 5 | u5 | u5@email.com | | 6 | u6 | u6@email.com | | 7 | u7 | u7@email.com | | 8 | u8 | u8@email.com | | 9 | u9 | u9@email.com | | 10 | u10 | u10@email.com | | 11 | u11 | u11@email.com | | 12 | u12 | u12@email.com | | 13 | u13 | u13@email.com | | 14 | u14 | u14@email.com | | 15 | u15 | u15@email.com | | 16 | u16 | u16@email.com | | 17 | u17 | u17@email.com | | 18 | u18 | u18@email.com | | 19 | u19 | u19@email.com | | 20 | u20 | u20@email.com | +----+------+---------------+ 20 rows in set (0.00 sec)
可以發現,表中原本有30條記錄,group by得出的結果是隻有20條記錄,這就說明有10條重複記錄。
不相信?好吧,我們再使用having看看吧。group by後,分組記錄條數是多少就表示該欄位存在多少條。大於1就表示出現了重複。
sql
select *,count(id) from demo_table group by email having count(id)>1 order by id;
執行結果:
sql
+----+------+--------------+-----------+ | id | name | email | count(id) | +----+------+--------------+-----------+ | 1 | u1 | u1@email.com | 4 | | 2 | u2 | u2@email.com | 4 | | 3 | u3 | u3@email.com | 3 | | 4 | u4 | u4@email.com | 2 | | 5 | u5 | u5@email.com | 2 | +----+------+--------------+-----------+ 5 rows in set (0.00 sec)
這下子,我們不僅僅知道重複記錄的條數是10條,而且還知道是哪些email出現了重複,並且出現了多次數。
還不相信?好吧,那你自己數吧
5.刪除重複資料
我們只保留id最小的資料,其他重複的都給刪掉。
sql
delete from demo_table where id not in (select min(id) from demo_table group by email);
執行結果:ERROR 1093 (HY000): You can't specify target table 'demo_table' for update in FROM clause
貌似是MySQL有保護機制,不可以這樣操作。
題外話:
如果不是純使用SQL方式處理,而是通過程式設計處理,可以把select min(id) from demo_table group by email查詢出來的id結果集儲存在變數中,然後再執行delete from demo_table where id not id結果集。有一點需要注意的是,可能id結果集很大很大,儲存在變數中會吃記憶體。
6.把需要保留的資料id存在臨時表中,然後刪除重複資料。
sql
create table tmp_table as select min(id) from demo_table group by email;
執行結果:Query OK, 20 rows affected (0.43 sec)
Records: 20 Duplicates: 0 Warnings: 0
看看我們建立的臨時表:
sql
select * from tmp_table;
執行結果:
sql
+---------+ | min(id) | +---------+ | 10 | | 11 | | 12 | | 13 | | 14 | | 15 | | 16 | | 17 | | 18 | | 19 | | 1 | | 20 | | 2 | | 3 | | 4 | | 5 | | 6 | | 7 | | 8 | | 9 | +---------+ 20 rows in set (0.00 sec)
這就是我們要保留的記錄的id列表。
刪除重複資料:
sql
delete from demo_table where id not in (select * from tmp_table);
執行結果:Query OK, 10 rows affected (0.05 sec)
讓我們再看一眼demo_table表吧:
sql
select * from demo_table;
執行結果:
sql
+----+------+---------------+ | id | name | email | +----+------+---------------+ | 1 | u1 | u1@email.com | | 2 | u2 | u2@email.com | | 3 | u3 | u3@email.com | | 4 | u4 | u4@email.com | | 5 | u5 | u5@email.com | | 6 | u6 | u6@email.com | | 7 | u7 | u7@email.com | | 8 | u8 | u8@email.com | | 9 | u9 | u9@email.com | | 10 | u10 | u10@email.com | | 11 | u11 | u11@email.com | | 12 | u12 | u12@email.com | | 13 | u13 | u13@email.com | | 14 | u14 | u14@email.com | | 15 | u15 | u15@email.com | | 16 | u16 | u16@email.com | | 17 | u17 | u17@email.com | | 18 | u18 | u18@email.com | | 19 | u19 | u19@email.com | | 20 | u20 | u20@email.com | +----+------+---------------+ 20 rows in set (0.00 sec)
好的,重複的資料已經被我們成功幹掉了。大功告成。等等,我們還得把那個已經沒有用處的臨時表給刪除。貌似有點卸磨殺驢的感覺啊。
sql
drop table tmp_table;
執行結果:Query OK, 0 rows affected (0.12 sec)
7.總結一下吧
其實就三條SQL語句:
sql
create table tmp_table as select min(id) from demo_table group by email; delete from demo_table where id not in (select * from tmp_table); drop table tmp_table;
可能有童鞋要問,如果要去重的不是一個欄位,而是多個怎麼辦?比如,某2條記錄的email相同並且name也相同,就算為重複。
處理方式:
sql
group by email,name
再說說資料列去重複吧
1.故意插入一些列重複的資料
sql
INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('31', '31@email.com', '31@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('32', '32@email.com', '32@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('33', '33@email.com', '33@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('34', '34@email.com', '34@email.com'); INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('35', '35@email.com', '35@email.com');
查詢表中name欄位和email欄位相同的資料:
sql
select * from demo_table as a, demo_table as b where a.id=b.id and a.name=b.email;
執行結果:
sql
+----+--------------+--------------+----+--------------+--------------+ | id | name | email | id | name | email | +----+--------------+--------------+----+--------------+--------------+ | 31 | 31@email.com | 31@email.com | 31 | 31@email.com | 31@email.com | | 32 | 32@email.com | 32@email.com | 32 | 32@email.com | 32@email.com | | 33 | 33@email.com | 33@email.com | 33 | 33@email.com | 33@email.com | | 34 | 34@email.com | 34@email.com | 34 | 34@email.com | 34@email.com | | 35 | 35@email.com | 35@email.com | 35 | 35@email.com | 35@email.com | +----+--------------+--------------+----+--------------+--------------+ 5 rows in set (0.00 sec)
重複記錄的id都已經找出來了,刪除就參考上述方式處理了。