三體PCC大賽題目-facebook微博like場景資料庫設計與效能壓測

德哥發表於2017-05-12

資料庫

背景

高可用架構的一個PCC大賽，看了一下比賽規則，發現PostgreSQL很適合做這個場景，原樣復刻，使用PG實現以及效能表現到底如何？

比賽內容介紹如下

https://github.com/archnotes/PCC

實現類似 facebook 中的 like 功能，需要：

可以對一個物件（一條feed、文章、或者url）進行 like 操作，禁止 like 兩次，第二次 like 返回錯誤碼
有 isLike 介面，返回引數指定的物件有沒有被當前使用者 like 過
需要看到一個物件的 like 計數
可以看到一個物件的 like 使用者列表（類似 QQ 空間）；
上述列表加分項：Like優先顯示我的好友列表(social list)。
資料量：每天新增的 like 物件數為 1 千萬，每秒 like 計數器查詢量為 30 萬次 / 秒。

資料量

使用者數量級1億，好友數量級1~1萬，單個物件的Like數量1-100萬
提供比賽資料集（純文字格式），需要參賽人員自行匯入到自己資料庫

測試資料集格式定義

測試資料下載：https://github.com/archnotes/PCC/tree/master/data （非壓測資料）

使用者資料格式

uid為uint64，1億條

1,Tom    
2,Jerry

使用者好友資料格式

uid, friend_id為uint64，只存在雙向好友關係，1億個使用者*1000,好友關係通常是一個長尾分佈，90% 100個以下，8%長尾分佈300-1000,2% 1000-10000

1,2

物件Like列表資料格式

oid,uid為uint64，2億個objects, 每個1-100w

101:[1,2,3,4,5]

資料庫設計

人與人的關係為關注，或者相互關注的關係。

人與物件為喜歡或者不喜歡的關係。

在設計時，分明細資料和統計資料，統計資料為了更快速的查詢關係，被關注數。

明細可以記錄在日誌，也可以記錄在資料庫中。統計資料（關係，計數，被LIKE等），以流處理的方式寫入資料庫。

關係設計

結構設計

create table user_like(uid int8 primary key, like_who int8[]);   -- 關注誰？    
    
create table user_liked(uid int8 primary key, who_like int8[]);  -- 被誰關注了？本場景未涉及（如果需要的話，建立反向關係表）。    
    
create table obj(id int8 primary key, who_like int8[], like_cnt int8);   -- 物件被誰關注，總共被多少人關注，like_cnt 峰值每秒被查詢30萬次    
    
create table user_like_obj(uid int8 primary key, ids int8[]);   -- 使用者like了哪些物件，本場景未涉及（如果需要的話，建立反向關係表）。

查詢實現

1. 人關注了哪些人，

select like_who from user_like where uid=?;

2. 人被哪些人關注，本場景未涉及（如果需要的話，建立反向關係表）。

select who_like from user_liked where uid=?;

3. 人LIKE了哪些物件，本場景未涉及（如果需要的話，建立反向關係表）。

select ids from user_like_obj where uid=?;

4. 物件被哪些人LIKE，

select who_like from obj where id=?;

5. 物件被LIKE了多少次

select like_cnt from obj where id=?;

6. LIKE某物件的使用者中，哪些是我的好友？

CREATE FUNCTION array_intersect(anyarray, anyarray)    
  RETURNS anyarray    
  language sql    
as $FUNCTION$    
    SELECT ARRAY(    
        SELECT UNNEST($1)    
        INTERSECT    
        SELECT UNNEST($2)    
    );    
$FUNCTION$;    
    
select array_intersect(t1.who_like, t2.like_who) from     
(    
select who_like from obj where id=?    
) t1    
,    
(    
select array[like_who] as like_who from user_like where uid=?    
) t2;

DEMO

建立流，關注的行為將寫入流，同時寫入明細（可選）。

create STREAM user_like(uid int8, like_who int8);   -- 關注誰？    
    
create STREAM obj(id int8, who_like int8);   -- 物件被誰關注？

建立持續檢視，根據關注行為實時統計。

create CONTINUOUS VIEW cv_user_like as select uid, array_agg(like_who) as like_who from user_like group by uid;   -- 關注誰？    
create unique index idx1_cv_user_like on cv_user_like(uid);    
    
create CONTINUOUS VIEW cv_obj as select id, array_agg(who_like) as who_like, count(*) as like_cnt from obj group by id;  -- 物件被誰關注，總共被多少人關注，like_cnt 峰值每秒被查詢30萬次    
create unique index idx1_cv_obj on cv_obj(id);

啟用流計算

pipeline=# activate ;    
ACTIVATE

關注（LIKE）操作函式，判斷是否已關注，如果已關注，返回異常，否則關注。(這個也可以寫在程式中，但是需要與資料庫互動多次，差評)

函式可以根據實際需求進行調整，比如需要返回被LIKE後的陣列，查詢一下continue view即可。

create or replace function f_user_like(i_uid int8, i_like_who int8) returns void as     
$$    
declare    
begin    
  perform 1 from cv_user_like where uid=i_uid and like_who @> array[i_like_who];  -- 未關注則關注   
  if not found then    
    insert into user_like values (i_uid, i_like_who);    
  end if;  
end;    
$$ language plpgsql strict;    
    
create or replace function f_obj(i_id int8, i_who_like int8) returns void as     
$$    
declare    
begin    
  perform 1 from cv_obj where id=i_id and who_like @> array[i_who_like];    
  if not found then    
    insert into obj values (i_id, i_who_like);    
  end if;    
end;    
$$ language plpgsql strict;

測試

1喜歡1號文章    
pipeline=# select f_obj(1,1);    
 f_obj     
-------    
     
(1 row)    
    
1喜歡1號文章    
pipeline=# select f_obj(1,1);    
 f_obj     
-------    
     
(1 row)    
    
1號文章被誰喜歡    
pipeline=# select * from cv_obj;    
 id | who_like | like_cnt     
----+----------+----------    
  1 | {1}      |        1    
(1 row)    
    
2喜歡1號文章    
pipeline=# select f_obj(1,2);    
 f_obj     
-------    
     
(1 row)    
    
1號文章被誰喜歡    
pipeline=# select * from cv_obj;    
 id | who_like | like_cnt     
----+----------+----------    
  1 | {1,2}    |        2    
(1 row)    
    
100喜歡1號文章    
pipeline=# select f_obj(1,100);    
 f_obj     
-------    
     
(1 row)    
    
1號文章被誰喜歡    
pipeline=# select * from cv_obj;    
 id | who_like  | like_cnt     
----+-----------+----------    
  1 | {1,2,100} |        3    
(1 row)

壓測1

1. 使用者ID範圍

1-1億

2. 文章ID範圍

1-2億

3. 熱點文章ID範圍

總共2億文章，使用高斯分佈進行LIKE，分佈在以鐘鼎為中心的2.0/xx這個區間內的文章ID，覆蓋了95%的出現概率。分佈在1.0/xx這個區間的文章ID覆蓋了67%的出現概率。

橫座標越靠近鈡的頂端的值（即文章ID=1億），產生的概率越高。

xx越小，鈡越尖，也就是說熱點文章越少。

原理參考

《生成泊松、高斯、指數、隨機分佈資料 – PostgreSQL pg_bench 》

4. 隨機使用者喜歡隨機文章

5. 隨機使用者喜歡熱點文章

首先根據以上要求生成基礎資料

壓測指令碼，LIKE文章，使用高斯分佈產生文章ID，經過長時間的壓測，文章被LIKE的次數呈現高斯分佈，鐘鼎的文章被LIKE的次數最多。

xx設定為10.0，表示以鐘鼎為中心的20%這個區間內的文章ID，覆蓋了95%的出現概率。分佈在10%這個區間的文章ID覆蓋了67%的出現概率。

xx越大，鐘鼎的文章ID概率越高。

vi test.sql    
setrandom uid 1 100000000    
setrandom id 1 200000000 gaussian 10.0  
select f_obj(:id,:uid);

256個連線進行壓測，測試結果，每秒產生17.7萬次LIKE請求。

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 256 -j 256 -T 120    
    
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 256
number of threads: 256
duration: 120 s
number of transactions actually processed: 21331348
latency average: 1.438 ms
latency stddev: 0.591 ms
tps = 177652.080934 (including connections establishing)
tps = 177665.827969 (excluding connections establishing)
statement latencies in milliseconds:
        0.002267        setrandom uid 1 100000000  
        0.002384        setrandom id 1 200000000 gaussian 10.0
        1.433405        select f_obj(:id,:uid);

階段性壓測後文章數

pipeline=# select count(*) from cv_obj;
  count   
----------
 86842876
(1 row) 
    
-- 查詢鐘鼎附近的詞被LIKE的次數  
  
pipeline=# select like_cnt from cv_obj where id=100000000;
 like_cnt 
----------
    18317
(1 row)

pipeline=# select like_cnt from cv_obj where id=100000001;
 like_cnt 
----------
    18410
(1 row)

pipeline=# select like_cnt from cv_obj where id=100000002;
 like_cnt 
----------
    18566
(1 row)

pipeline=# select like_cnt from cv_obj where id=100000000-1;
 like_cnt 
----------
    18380
(1 row)

pipeline=# select like_cnt from cv_obj where id=100000000-2;
 like_cnt 
----------
    18399
(1 row)
  
鈡的底部邊緣被LIKE就很少  
pipeline=# select * from cv_obj where id>199999990;  
    id     |  who_like  | like_cnt   
-----------+------------+----------  
 199999991 | {89433428} |        1  
(1 row)

符合預期，繼續壓測。(或者我們也可以選擇指數分佈進行測試)

暫時沒有進行優化的情況下，CPU使用情況如下

Cpu(s): 35.2%us, 17.4%sy, 13.8%ni, 33.2%id,  0.3%wa,  0.0%hi,  0.1%si,  0.0%st    
    
主機上其他不相干程式的開銷    
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                             
16725 digoal    20   0 18.4g  11m  948 S 320.6  0.0   1:15.63 pgbench -M prepared -n -r -P 1 -f ./test.sql -c 256 -j 256 -T 120                                                                                                                  
18411 root      20   0  445m  58m 7048 R 98.8  0.0   0:03.25                                                            
18434 root      20   0  445m  58m 7040 R 87.5  0.0   0:02.71

持續壓測like，產生2億文章的LIKE資料，然後進入測試2。

生成使用者關係資料

1. 使用者ID範圍

1-1億

2. 使用者好友分佈

90% 100個以下，8%長尾分佈300-1000, 2% 1000-10000

關係表    
create table user_like_agg(uid int8 primary key, like_who int8[]);    
    
產生指定元素個數範圍的陣列    
create or replace function gen_uids(c1 int, c2 int) returns int8[] as    
$$     
select array(select (random()*100000000)::int8 from generate_series(1, c1+(random()*(c2-c1))::int));    
$$ language sql strict;    
    
序列    
create sequence seq cache 100;

產生90%的使用者關係

vi gen90.sql    
insert into user_like_agg select nextval(`seq`), gen_uids(1,100);    
    
pgbench -M prepared -n -r -P 1 -f ./gen90.sql -c 100 -j 100 -t 900000

產生8%的使用者關係

vi gen8.sql    
insert into user_like_agg select nextval(`seq`), gen_uids(300,1000);    
    
pgbench -M prepared -n -r -P 1 -f ./gen8.sql -c 100 -j 100 -t 80000

產生2%的使用者關係

vi gen2.sql    
insert into user_like_agg select nextval(`seq`), gen_uids(1000,10000);    
    
pgbench -M prepared -n -r -P 1 -f ./gen2.sql -c 100 -j 100 -t 20000

最終生成1億使用者，佔用123GB空間，2.7GB索引。

pipeline=#     
pipeline=# dt+ user_like_agg     
                        List of relations    
 Schema |     Name      | Type  |  Owner   |  Size  | Description     
--------+---------------+-------+----------+--------+-------------    
 public | user_like_agg | table | postgres | 123 GB |     
(1 row)    
    
pipeline=# di+ user_like_agg_pkey     
                                   List of relations    
 Schema |        Name        | Type  |  Owner   |     Table     |  Size   | Description     
--------+--------------------+-------+----------+---------------+---------+-------------    
 public | user_like_agg_pkey | index | postgres | user_like_agg | 2706 MB |     
(1 row)    
    
pipeline=# select count(*) from user_like_agg ;    
   count       
-----------    
 100000000    
(1 row)

壓測2

1. 查詢文章被誰like？

2. 查詢文章被like了多少次？

3. 查詢LIKE某文章的使用者中，哪些是我的好友？

壓測指令碼1, 查詢文章被誰like？

vi test1.sql    
setrandom id 1 200000000    
select who_like from cv_obj where id=:id;    
    
pgbench -M prepared -n -r -P 1 -f ./test1.sql -c 128 -j 128 -T 120

壓測指令碼2, 查詢文章被like了多少次？

vi test2.sql    
setrandom id 1 200000000    
select like_cnt from cv_obj where id=:id;    
    
pgbench -M prepared -n -r -P 1 -f ./test2.sql -c 128 -j 128 -T 120

壓測指令碼3, 查詢LIKE某文章的使用者中，哪些是我的好友？

vi test3.sql    
setrandom id 1 200000000    
setrandom uid 1 100000000    
select array_intersect(t1.who_like, t2.like_who) from (select who_like from cv_obj where id=:id) t1,(select array[like_who] as like_who from user_like_agg where uid=:uid) t2;    
    
pgbench -M prepared -n -r -P 1 -f ./test3.sql -c 128 -j 128 -T 120

壓測結果1，查詢文章被誰like？達到 101萬/s 並不意外。

transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 121935264
latency average: 0.125 ms
latency stddev: 0.203 ms
tps = 1016035.198013 (including connections establishing)
tps = 1016243.580731 (excluding connections establishing)
statement latencies in milliseconds:
        0.001589        setrandom id 1 1000000000
        0.123249        select who_like  from cv_obj where id=:id;

壓測結果2，查詢文章被like了多少次？ 104萬/s。

transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 124966713
latency average: 0.122 ms
latency stddev: 0.204 ms
tps = 1041268.730790 (including connections establishing)
tps = 1041479.852625 (excluding connections establishing)
statement latencies in milliseconds:
        0.001708        setrandom id 1 1000000000
        0.120069        select like_cnt from cv_obj where id=:id;

壓測結果3，查詢LIKE某文的使用者中，哪些是我的好友？ 64.8萬/s。

transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 128
number of threads: 128
duration: 120 s
number of transactions actually processed: 77802915
latency average: 0.196 ms
latency stddev: 1.649 ms
tps = 648273.025370 (including connections establishing)
tps = 648368.477278 (excluding connections establishing)
statement latencies in milliseconds:
        0.001719        setrandom id 1 1000000000
        0.000695        setrandom uid 1 100000000
        0.193728        select array_intersect(t1.who_like, t2.like_who) from (select who_like from cv_obj where id=:id) t1,(select array[like_who] as like_who from user_like_agg where uid=:uid) t2;

優化思路

1. 陣列越長，一條記錄佔用的空間會越大，使用TOAST切片儲存，可以有效的提高查詢非陣列欄位的效率。

例如

alter table cv_obj alter column who_like set (storage=extended);

2. profiling，針對性的優化。

小結

微博、facebook最常用的操作：

1. 關注人或者喜歡某條訊息、微博等。

這個屬於寫操作，要求寫入快，並且要求寫入（LIKE或關注）後立即反映出來。

2. 查詢好友列表

為了查詢快速，最快的方法是PK查詢，但是一個人可能關注很多人，如果是查詢多條記錄，很顯然會比較慢。

因此考慮使用陣列儲存好友列表。

但是，使用陣列儲存列表，又需要考慮寫入速度的問題。

所以使用流計算聚合是最好的，因為PG有流計算外掛，可以在資料庫中完成流計算。

3. 查詢被關注的好友列表

反向好友關係，同樣要求查詢快速，使用正向關係一樣的方法。

4. 查詢文章（微博）被關注的次數，被關注人，被關注的人裡有哪些是我的好友。

首先被關注的次數，實際上就是個計數器。為了提高查詢速度，它必須是一個VALUE而不是查詢時使用COUNT(*)進行聚合。

查詢文章（微博）被關注的人，為了提高查詢速度，同樣考慮使用陣列儲存。使用PG內建的流計算進行聚合。

被關注的人裡面有哪些是我的好友，這個問題就很簡單了，好友關係與文章（微博）被關注人的兩個陣列求交集即可。

使用PG的流計算解決了實時寫入，實時聚合的問題。

同時由於資料實時被聚合，所以幾個查詢需求就顯得非常輕鬆。

測試得到的效能指標（未優化）：

1. 關注微博（文章）

17.7萬/s，預計可以優化到30萬。

2. 查詢文章被誰like？

101.6萬/s

3. 查詢文章被like了多少次？

104.1萬/s

4. 查詢LIKE某文章的使用者中，哪些是我的好友？

64.8萬/s

5. 機器:

（10W左右價位的X86，12*8TB SATA盤，1塊SSD作為BCACHE）

資料庫內建流計算功能，是一件不錯的事情。

參考

《facebook linkbench 測試PostgreSQL社交關係圖譜場景效能》

《流計算風雲再起 – PostgreSQL攜PipelineDB力挺IoT》

《PostgreSQL on Linux 最佳部署手冊》