人分九等,數有階梯-PostgreSQL階品(顆粒)分析函式width_bucket,kmean應用

德哥發表於2017-07-16

標籤

PostgreSQL , width_bucket , 資料分佈 , 包裹篩選 , 顆粒度篩選 , Oracle , 階級分佈 , kmean


背景

從古至今,現實社會中,到處充滿了等級劃分,例如東漢史學家、文學家班固《漢書》《古今人表》把人分九等。九品量表”之中,分為上(上智).中(中人).下(下愚)三等。在每個等級中又分為:上上.上中.上下,中上.中中.中下,以及下上.下中和下下三等。

現代的一些見解:

第一等:聖人。已參破紅塵卻仍然悲天憫人,已近神界卻不孤芳自賞。

第二等:英雄。胸懷遠大,智慧超群,忍辱負重,力挽狂瀾,解民眾於水火。

第三等:才俊。獨立性強,才智過人,在科學、藝術領域對人類貢獻極大。

第四等:志士。徒有一腔熱血,滿腹才華,無奈命運多舛,加上自身的侷限性,雖也做了一些事情,終是曇花一現。代表人物:屈原、鄭和、譚嗣同、魯迅、遇洛克、原《南方週末》有良知、有骨氣的記者及網上有正義感的遊俠。

第五等:仁人。善良、純淨,急民之所需,痛民之所痛。 願意放棄優越生活,去支援貧苦百姓的人。

第六等:大眾。善良純樸,見淺識陋,奔波勞碌,人窮志短。雖有美好意願,卻無正義精神,可為“希望工程”捐款,而遇到邪惡立刻明哲保身。既可推動文明,亦能充當犧牲品或殺人工具。

第七等:賤民。生活在底層,偷雞摸狗,渾渾噩噩,苟延殘喘,著實齷齪。代表人物:妓女、乞丐、騙子。如遇良好教育尚有希望。

第八等:暴君。上帝的玩偶,派來警示人類的。

第九等:小人。輕則小有才華,無病呻吟,蔑視苦難,以醜為美,誤導大眾,譁眾取寵,玷汙智慧,成就庸俗,如王家衛、金庸、轉型後的張藝謀、明星、娛記、《Vogue》、紈絝子弟太子黨,重則奴顏婢膝,脅肩諂笑,口是心非,生性陰暗見不得陽光,如呂后、郭沫若,以及多數中國當代官僚。

pic

以上參考自豆瓣

除了人分九等,實際上我們還能看到各種的等級分佈。例如:

1、對全國高考成績進行分佈透視,你的高考成績拿出來,會落在第幾等呢?

2、氣溫分佈,你所在的城市,全年的平均氣溫會在全國排第幾等呢?

3、雨量分佈,你所在的城市,全年的降雨量會在全國排在第幾等呢?

4、包裹分揀,按重量、按體積進行分揀,方便物流的運輸。

5、商品顆粒篩選,例如大米、枸杞、羅漢果、水果、大閘蟹等商品,按顆粒度的大小,分為不同的等級。

pic

6、收入等級,你的薪資水平落在第幾等呢?是不是戳中小心臟了,是不是又拖全國平均工資後腿了呢?

7、用水用電等級,現在水電都是階梯價,不同的階梯,定價都不一樣。

8、交稅也按收入分等級。

9、按每年接待的遊客數分幾個檔,評選出不同級別的景區。

pic

10、對玩王者榮耀的時間進行統計,按遊戲時長,對人群進行歸類。

11、對淘寶店鋪的點選率、銷量資料進行統計,劃分店鋪等級。

還有好多分類或分級的例子。分級是為了更好的對資料進行歸類,方便資料的透視。

在資料庫中,儲存的通常是明細資料,如何進行等級劃分或者分揀呢?

接下來隆重推出PostgreSQL的兩大歸類分析利器。

1、width_bucket

1、指定預設邊界和等級個數,返回VALUE所處等級。

適合求均勻分佈的等級劃分,例如超時的商品,按單價進行均勻分佈的劃分,看看每種商品落在哪個消費區間。

2、指定預設邊界陣列,返回VALUE所處等級。

適合求非均勻分佈的等級劃分,例如求收入水平、學習成績的非均勻分佈資料。60分以下為不及格,60-80為中,80-90為良,90-97為優,97以上為拔尖。

2、kmean

一個聚類演算法。

pic

PostgreSQL kmeans外掛:

《K-Means 資料聚集演算法》

一、分類利器1 width_bucket

例子1

淘寶店鋪每天都有點選率,銷量資料。對淘寶店鋪在100 ~ 5000次瀏覽量均勻劃分為10個等級,低於100為0等,高於5000為11等。

1、設計表結構

create table test(    
  sid int,   -- 店鋪ID    
  cnt_date date,  -- 日期    
  cnt int,  -- 瀏覽量    
  primary key (sid,cnt_date)  -- 主鍵約束    
);    

2、生成正態分佈的銷量資料

vi test.sql    
    
set cnt random_gaussian(0,10000,4)    
set sid random(1,10000000)    
insert into test values (:sid, `2017-07-15`, :cnt) on conflict (sid,cnt_date) do nothing;    

寫入測試資料

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 32 -j 32 -T 120    
    
    

3、等級統計透視

在100 ~ 5000次瀏覽量均勻劃分為10個等級,低於100為0等,高於5000為11等。

postgres=# select sid, cnt_date, cnt, width_bucket(cnt, 100, 5000, 10) as wb from test limit 10;    
   sid   |  cnt_date  | cnt  | wb     
---------+------------+------+----    
 1799658 | 2017-07-15 | 5708 | 11    
 9549703 | 2017-07-15 | 5016 | 11    
 2122532 | 2017-07-15 | 4413 |  9    
 7663952 | 2017-07-15 | 6199 | 11    
 7047657 | 2017-07-15 | 5655 | 11    
 8485951 | 2017-07-15 | 6902 | 11    
 5135164 | 2017-07-15 | 5929 | 11    
 5592226 | 2017-07-15 | 4213 |  9    
 3389938 | 2017-07-15 | 4091 |  9    
 1372024 | 2017-07-15 | 5505 | 11    
(10 rows)    
    
postgres=# select width_bucket(cnt, 100, 5000, 10) as wb, count(*) from test where cnt_date=`2017-07-15` group by 1 order by 1;    
 wb |  count      
----+---------    
  0 |     129    
  1 |    1635    
  2 |    6368    
  3 |   21686    
  4 |   62661    
  5 |  155530    
  6 |  332831    
  7 |  610253    
  8 |  961658    
  9 | 1303200    
 10 | 1517335    
 11 | 4975181    
(12 rows)    

4、等級佔比透視

select wb, cnt, round(100*(cnt/(sum(cnt) over ())), 2)||` %` as ratio from     
  (select width_bucket(cnt, 100, 5000, 10) as wb, count(*) cnt from test where cnt_date=`2017-07-15` group by 1) t order by wb;    
    
 wb |   cnt   |  ratio      
----+---------+---------    
  0 |     129 | 0.00 %    
  1 |    1635 | 0.02 %    
  2 |    6368 | 0.06 %    
  3 |   21686 | 0.22 %    
  4 |   62661 | 0.63 %    
  5 |  155530 | 1.56 %    
  6 |  332831 | 3.35 %    
  7 |  610253 | 6.13 %    
  8 |  961658 | 9.67 %    
  9 | 1303200 | 13.10 %    
 10 | 1517335 | 15.25 %    
 11 | 4975181 | 50.01 %    
(12 rows)    

例子2

還是以上資料,但是不按等值分佈,而是用陣列表示自定義區間分佈。分佈如下:

<100    
    
[100, 500)    
    
[500,1000)    
    
[1000,2000)    
    
[2000,5000)    
    
[5000,8000)    
    
>=8000    
postgres=# select width_bucket(cnt, array[100,500,1000,2000,5000,8000]) as wb, count(*) as cnt from test where cnt_date=`2017-07-15` group by 1 order by 1;    
 wb |   cnt      
----+---------    
  0 |     129    
  1 |    1152    
  2 |    5245    
  3 |   74644    
  4 | 4892116    
  5 | 4893590    
  6 |   81591    
(7 rows)    
select wb, cnt, round(100*(cnt/(sum(cnt) over ())), 2)||` %` as ratio from     
  (select width_bucket(cnt, array[100,500,1000,2000,5000,8000]) as wb, count(*) as cnt from test where cnt_date=`2017-07-15` group by 1) t order by wb;    
    
 wb |   cnt   |  ratio      
----+---------+---------    
  0 |     129 | 0.00 %    
  1 |    1152 | 0.01 %    
  2 |    5245 | 0.05 %    
  3 |   74644 | 0.75 %    
  4 | 4892116 | 49.17 %    
  5 | 4893590 | 49.19 %    
  6 |   81591 | 0.82 %    
(7 rows)    

二、分類利器2 kmeans

請參考 《K-Means 資料聚集演算法》 瞭解背景知識以及PostgreSQL kmeans外掛。

例子1

還是使用其那面的測試資料。對店鋪銷量劃分為10類,第一次劃分不使用種子,很多點都被歸為噪點,所以分級集中在正態資料分佈的部分。

postgres=# select sid,cnt_date,cnt,kmeans(array[cnt], 10) over () k from test where cnt_date=`2017-07-15` limit 10;    
   sid   |  cnt_date  | cnt  | k     
---------+------------+------+---    
 1799658 | 2017-07-15 | 5708 | 6    
 9549703 | 2017-07-15 | 5016 | 5    
 2122532 | 2017-07-15 | 4413 | 3    
 7663952 | 2017-07-15 | 6199 | 7    
 7047657 | 2017-07-15 | 5655 | 6    
 8485951 | 2017-07-15 | 6902 | 8    
 5135164 | 2017-07-15 | 5929 | 6    
 5592226 | 2017-07-15 | 4213 | 3    
 3389938 | 2017-07-15 | 4091 | 3    
 1372024 | 2017-07-15 | 5505 | 5    
(10 rows)    
    
postgres=# select k, count(*) as cnt from (select sid,cnt_date,cnt,kmeans(array[cnt], 10) over () k from test where cnt_date=`2017-07-15`) t group by 1 order by 1;    
 k |   cnt       
---+---------    
 0 |  244257    
 1 |  674861    
 2 | 1084717    
 3 | 1399515    
 4 | 1569936    
 5 | 1568358    
 6 | 1399082    
 7 | 1087206    
 8 |  675811    
 9 |  244724    
(10 rows)    

第二次劃分,使用聚類種子(中心點)。

postgres=# select sid,cnt_date,cnt,kmeans(array[cnt], 6, array[100,500,1000,2000,5000,8000]) over () k from test where cnt_date=`2017-07-15` limit 10;    
   sid   |  cnt_date  | cnt  | k     
---------+------------+------+---    
 1799658 | 2017-07-15 | 5708 | 3    
 9549703 | 2017-07-15 | 5016 | 3    
 2122532 | 2017-07-15 | 4413 | 2    
 7663952 | 2017-07-15 | 6199 | 4    
 7047657 | 2017-07-15 | 5655 | 3    
 8485951 | 2017-07-15 | 6902 | 5    
 5135164 | 2017-07-15 | 5929 | 4    
 5592226 | 2017-07-15 | 4213 | 2    
 3389938 | 2017-07-15 | 4091 | 1    
 1372024 | 2017-07-15 | 5505 | 3    
(10 rows)    
    
postgres=# select k, count(*) as cnt from (select sid,cnt_date,cnt,kmeans(array[cnt], 6, array[100,500,1000,2000,5000,8000]) over () k from test where cnt_date=`2017-07-15`) t group by 1 order by 1;    
 k |   cnt       
---+---------    
 0 |  731010    
 1 | 1788146    
 2 | 2428678    
 3 | 2438930    
 4 | 1813311    
 5 |  748392    
(6 rows)    

透視

select k, cnt, round(100*(cnt/(sum(cnt) over ())), 2)||` %` as ratio from     
  (select k, count(*) as cnt from (select sid,cnt_date,cnt,kmeans(array[cnt], 10) over () k from test where cnt_date=`2017-07-15`) t group by 1) t order by k;    
 k |   cnt   |  ratio      
---+---------+---------    
 0 |  244257 | 2.46 %    
 1 |  674861 | 6.78 %    
 2 | 1084717 | 10.90 %    
 3 | 1399515 | 14.07 %    
 4 | 1569936 | 15.78 %    
 5 | 1568358 | 15.76 %    
 6 | 1399082 | 14.06 %    
 7 | 1087206 | 10.93 %    
 8 |  675811 | 6.79 %    
 9 |  244724 | 2.46 %    
(10 rows)    
    
    
select k, cnt, round(100*(cnt/(sum(cnt) over ())), 2)||` %` as ratio from     
  (select k, count(*) as cnt from (select sid,cnt_date,cnt,kmeans(array[cnt], 6, array[0,100,500,2000,5000,10000]) over () k from test where cnt_date=`2017-07-15`) t group by 1) t order by k;    
 k |   cnt   |  ratio      
---+---------+---------    
 0 |  731010 | 7.35 %    
 1 | 1788146 | 17.97 %    
 2 | 2428678 | 24.41 %    
 3 | 2438930 | 24.52 %    
 4 | 1813311 | 18.23 %    
 5 |  748392 | 7.52 %    
(6 rows)    

三、MADlib機器學習庫

除了使用kmeans外掛進行分類,還可以使用MADlib外掛,MADlib外掛是PostgreSQL的開源機器學習庫。

https://github.com/apache/incubator-madlib

http://madlib.incubator.apache.org/

Classification

When the desired output is categorical in nature, we use classification methods to build a model that predicts which of the various categories a new result would fall into. The goal of classification is to be able to correctly label incoming records with the correct class for the record.

Example: If we had data that described various demographic data and other features of individuals applying for loans, and we had historical data that included what past loans had defaulted, then we could build a model that described the likelihood that a new set of demographic data would result in a loan default. In this case, the categories are “will default” or “won’t default” which are two discrete classes of output.

Regression

When the desired output is continuous in nature, we use regression methods to build a model that predicts the output value.

Example: If we had data that described properties of real estate listings, then we could build a model to predict the sale value for homes based on the known characteristics of the houses. This is a regression problem because the output response is continuous in nature, rather than categorical.

Clustering

Here we are trying to identify groups of data such that the items within one cluster are more similar to each other than they are to the items in any other cluster.

Example: In customer segmentation analysis, the goal is to identify specific groups of customers that behave in a similar fashion, so that various marketing campaigns can be designed to reach these markets. When the customer segments are known in advance this would be a supervised classification task. When we let the data itself identify the segments, this becomes a clustering task.

Topic Modeling

Topic modeling is similar to clustering in that it attempts to identify clusters of documents that are similar to each other, but it is more specific to the text domain where it is also trying to identify the main themes of those documents.

Association Rule Mining

Also called market basket analysis or frequent itemset mining, this is attempting to identify which items tend to occur together more frequently than random chance would indicate, suggesting an underlying relationship between the items.

Example: In an online web store, association rule mining can be used to identify what products tend to be purchased together. This can then be used as input into a product recommendation engine to suggest items that may be of interest to the customer and provide upsell opportunities.

Descriptive Statistics

Descriptive statistics don’t provide a model and thus are not considered a learning method. However, they can be helpful in providing information to an analyst to understand the underlying data, and can provide valuable insights into the data that may influence choice of data model.

Example: Calculating the distribution of data within each variable of a dataset can help an analyst understand which variables should be treated as categorical variables, and which should be treated as continuous variables, including the sort of distribution the values fall in.

Validation

Using a model without understanding the accuracy of that model can lead to a poor outcome. For that reason, it is important to understand the error of a model and to evaluate the model for accuracy on test data. Frequently in data analysis, a separation is made between training data and test data solely for the purpose of providing statistically valid analysis of the validity of the model, and assessment that the model is not over-fitting the training data. N-fold cross validation is also frequently utilized.

四、為什麼PostgreSQL比Oracle先進

1、Oracle width_bucket不支援陣列,只支援均勻分佈透視,不支援非均勻分佈的資料透視。

例如高考成績分佈,分數從0分到750分都有,如果只能均勻透視,沒法真正區分有意義的等級區間。

而使用PostgreSQL width_bucket陣列解決這個問題。例如 array[300, 400, 520, 580, 630, 690] 這樣可以根據實際情況進行透視,出來的透視結果是比較有意義的。

2、PostgreSQL支援眾多機器學習演算法。

3、PostgreSQL 支援生成正態分佈,隨機分佈,泊松分佈的測試資料,便於測試。

pic

https://www.postgresql.org/docs/10/static/pgbench.html

更多Mathematical函式

https://www.postgresql.org/docs/10/static/functions-math.html

參考

https://www.postgresql.org/docs/10/static/functions-math.html

https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions214.htm

壓測資料,正泰分佈

https://www.postgresql.org/docs/9.6/static/pgbench.html

《在PostgreSQL中如何生成測試kmean演算法的資料》

《K-Means 資料聚集演算法》

《生成泊松、高斯、指數、隨機分佈資料 – PostgreSQL 9.5 new feature – pg_bench improve, gaussian (standard normal) & exponential distribution》

《PostgreSQL 9.5 new feature – width_bucket return the bucket number》

《PostgreSQL FDW mongo_fdw usage》


相關文章