聚類分析-案例:客戶特徵的聚類與探索性分析

Zerogoy發表於2020-09-28

來自:宋天龍《PYTHON資料分析與資料化運營》,以下內容比較簡陋,方便日後翻閱。

場景:
某天業務部門拿了一些資料找到資料部門,這些資料是關於客戶的,苦於沒有分析入手點,希望資料部門通過對這些資料的分析,給業務部門一些啟示、建議。

資料來源特徵如下:
user_id使用者ID列,整數型
AVG_ORDERS:平均使用者訂單數量,浮點型
AVG_MONEY:平均訂單價值,浮點型
IS_ACTIVE:是否活躍 字串
SEX:性別,0,1,1表示未知,男,女

分析:
IS_ACTIVE是字串型分類變數,SEX是分型別變數,均需要對其onehotencoder處理;
AVG_ORDERS和AVG_MONEY有明顯的量綱差異,需要歸一化;
但是使用onehotencoder後IS_ACTIVE和SEX的特徵被分開了,最後輸出的結果可能會比較凌亂,不符合業務原始變數分析的需求。
下面使用另外一種方法:
直接對IS_ACTIVE和SEX分類特徵計算,只是不能用距離來判斷相似度,而是計算類別內的分類特徵值的出現頻率。sklearn中的k-means使用的是該方法。

python實現
1.匯入庫,讀取資料

import pandas as pd # panda庫
import numpy as np
import matplotlib.pyplot as plt  # 匯入matplotlib庫
from sklearn.preprocessing import MinMaxScaler # 標準化庫
from sklearn.cluster import KMeans  # 匯入sklearn聚類模組
from sklearn.metrics import silhouette_score, calinski_harabasz_score  # 效果評估模組
import matplotlib.pyplot as plt # 圖形庫
# 讀取資料
raw_data = pd.read_csv('cluster.txt')  # 匯入資料檔案
print(raw_data.head())

在這裡插入圖片描述
1.資料處理、模型訓練

# 資料標準化
numeric_features = raw_data.iloc[:,1:3] # 數值型特徵
scaler = MinMaxScaler()
scaled_numeric_features = scaler.fit_transform(numeric_features)
print(scaled_numeric_features[:,:2])
# 訓練聚類模型
n_clusters = 3  # 設定聚類數量
model_kmeans = KMeans(n_clusters=n_clusters, random_state=0)  # 建立聚類模型物件
model_kmeans.fit(scaled_numeric_features)  # 訓練聚類模型

在這裡插入圖片描述
3.評估模型

# 模型效果指標評估
# 總樣本量,總特徵數
n_samples, n_features = raw_data.iloc[:,1:].shape
print('samples: %d \t features: %d' % (n_samples, n_features))
# 非監督式評估方法
silhouette_s = silhouette_score(scaled_numeric_features, model_kmeans.labels_, metric='euclidean')  # 平均輪廓係數
calinski_harabaz_s = calinski_harabaz_score(scaled_numeric_features, model_kmeans.labels_)  # Calinski和Harabaz得分
unsupervised_data = {'silh':[silhouette_s],'c&h':[calinski_harabasz_s]}
unsupervised_score = pd.DataFrame.from_dict(unsupervised_data)
print('\n','unsupervised score:','\n','-'*60)
print(unsupervised_score)

在這裡插入圖片描述
第一個指標大於0.5說明聚類質量較優。
4.結果合併分析

# 合併資料和特徵
# 獲得每個樣本的聚類類別
kmeans_labels = pd.DataFrame(model_kmeans.labels_,columns=['labels']) 
# 組合原始資料與標籤
kmeans_data = pd.concat((raw_data,kmeans_labels),axis=1)
print(kmeans_data.head())
# 計算不同聚類類別的樣本量和佔比
label_count = kmeans_data.groupby(['labels'])['SEX'].count()  # 計算頻數
label_count_rate = label_count/ kmeans_data.shape[0] # 計算佔比
kmeans_record_count = pd.concat((label_count,label_count_rate),axis=1)
kmeans_record_count.columns=['record_count','record_rate']
print(kmeans_record_count.head())
# 計算不同聚類類別數值型特徵
kmeans_numeric_features = kmeans_data.groupby(['labels'])['AVG_ORDERS','AVG_MONEY'].mean()
print(kmeans_numeric_features.head())

# 計算不同聚類類別分型別特徵
active_list = []
sex_gb_list = []
unique_labels = np.unique(model_kmeans.labels_)
print('標籤:',unique_labels)
for each_label in unique_labels:
    each_data = kmeans_data[kmeans_data['labels']==each_label]

    active_list.append(each_data.groupby(['IS_ACTIVE'])['USER_ID'].count()/each_data.shape[0])
    sex_gb_list.append(each_data.groupby(['SEX'])['USER_ID'].count()/each_data.shape[0])

kmeans_active_pd = pd.DataFrame(active_list)
kmeans_sex_gb_pd = pd.DataFrame(sex_gb_list)
kmeans_string_features = pd.concat((kmeans_active_pd,kmeans_sex_gb_pd),axis=1)
kmeans_string_features.index = unique_labels
kmeans_string_features.columns=['不活躍','活躍','未知','男','女']
print(kmeans_string_features.head())

# 合併所有類別的分析結果
features_all = pd.concat((kmeans_record_count,kmeans_numeric_features,kmeans_string_features),axis=1)
print(features_all.head())

在這裡插入圖片描述
5.視覺化展示

# 視覺化圖形展示
# part 1 全域性配置
fig = plt.figure(figsize=(10, 7))
titles = ['RECORD_RATE','AVG_ORDERS','AVG_MONEY','IS_ACTIVE','SEX'] # 共用標題
line_index,col_index = 3,5 # 定義網格數
ax_ids = np.arange(1,16).reshape(line_index,col_index) # 生成子網格索引值
plt.rcParams['font.sans-serif']=['SimHei'] #用來正常顯示中文標籤
    
# part 2 畫出三個類別的佔比
pie_fracs = features_all['record_rate'].tolist()
for ind in range(len(pie_fracs)):
    ax = fig.add_subplot(line_index, col_index, ax_ids[:,0][ind])
    init_labels = ['','',''] # 初始化空label標籤
    init_labels[ind] = 'cluster_{0}'.format(ind) # 設定標籤
    init_colors = ['lightgray', 'lightgray', 'lightgray']
    init_colors[ind] = 'g' # 設定目標面積區別顏色
    ax.pie(x=pie_fracs, autopct='%3.0f %%',labels=init_labels,colors=init_colors)
    ax.set_aspect('equal') # 設定餅圖為圓形
    if ind == 0:
        ax.set_title(titles[0])
    
# part 3  畫出AVG_ORDERS均值
avg_orders_label = 'AVG_ORDERS'
avg_orders_fraces = features_all[avg_orders_label]
for ind, frace in enumerate(avg_orders_fraces):
    ax = fig.add_subplot(line_index, col_index, ax_ids[:,1][ind])
    ax.bar(x=unique_labels,height=[0,avg_orders_fraces[ind],0])# 畫出柱形圖
    ax.set_ylim((0, max(avg_orders_fraces)*1.2))
    ax.set_xticks([])
    ax.set_yticks([])
    if ind == 0:# 設定總標題
        ax.set_title(titles[1])
    # 設定每個柱形圖的數值標籤和x軸label
    ax.text(unique_labels[1],frace+0.4,s='{:.2f}'.format(frace),ha='center',va='top')
    ax.text(unique_labels[1],-0.4,s=avg_orders_label,ha='center',va='bottom')
        
# part 4  畫出AVG_MONEY均值
avg_money_label = 'AVG_MONEY'
avg_money_fraces = features_all[avg_money_label]
for ind, frace in enumerate(avg_money_fraces):
    ax = fig.add_subplot(line_index, col_index, ax_ids[:,2][ind])
    ax.bar(x=unique_labels,height=[0,avg_money_fraces[ind],0])# 畫出柱形圖
    ax.set_ylim((0, max(avg_money_fraces)*1.2))
    ax.set_xticks([])
    ax.set_yticks([])
    if ind == 0:# 設定總標題
        ax.set_title(titles[2])
    # 設定每個柱形圖的數值標籤和x軸label
    ax.text(unique_labels[1],frace+4,s='{:.0f}'.format(frace),ha='center',va='top')
    ax.text(unique_labels[1],-4,s=avg_money_label,ha='center',va='bottom')
        
# part 5  畫出是否活躍
axtivity_labels = ['不活躍','活躍']
x_ticket = [i for i in range(len(axtivity_labels))]
activity_data = features_all[axtivity_labels]
ylim_max = np.max(np.max(activity_data))
for ind,each_data in enumerate(activity_data.values):
    ax = fig.add_subplot(line_index, col_index, ax_ids[:,3][ind])
    ax.bar(x=x_ticket,height=each_data) # 畫出柱形圖
    ax.set_ylim((0, ylim_max*1.2))
    ax.set_xticks([])
    ax.set_yticks([])    
    if ind == 0:# 設定總標題
        ax.set_title(titles[3])
    # 設定每個柱形圖的數值標籤和x軸label
    activity_values = ['{:.1%}'.format(i) for i in each_data]
    for i in range(len(x_ticket)):
        ax.text(x_ticket[i],each_data[i]+0.05,s=activity_values[i],ha='center',va='top')
        ax.text(x_ticket[i],-0.05,s=axtivity_labels[i],ha='center',va='bottom')
        
# part 6  畫出性別分佈
sex_data = features_all.iloc[:,-3:]
x_ticket = [i for i in range(len(sex_data))]
sex_labels = ['SEX_{}'.format(i) for i in range(3)]
ylim_max = np.max(np.max(sex_data))
for ind,each_data in enumerate(sex_data.values):
    ax = fig.add_subplot(line_index, col_index, ax_ids[:,4][ind])
    ax.bar(x=x_ticket,height=each_data) # 畫柱形圖
    ax.set_ylim((0, ylim_max*1.2))
    ax.set_xticks([])
    ax.set_yticks([])
    if ind == 0: # 設定標題
       ax.set_title(titles[4])    
    # 設定每個柱形圖的數值標籤和x軸label
    sex_values = ['{:.1%}'.format(i) for i in each_data]
    for i in range(len(x_ticket)):
        ax.text(x_ticket[i],each_data[i]+0.1,s=sex_values[i],ha='center',va='top')
        ax.text(x_ticket[i],-0.1,s=sex_labels[i],ha='center',va='bottom')
    
plt.tight_layout(pad=0.8) #設定預設的間距

在這裡插入圖片描述

相關文章