建立細分客戶的無監督學習專案

dicksonjyl560101發表於2019-05-02

原文網址 : http://blog.itpub.net/29829936/viewspace-2643138/

https://www.toutiao.com/a6685626606284702220/

在整個專案中，我們將分析一些產品類別中幾個客戶的消費行為。該專案的主要目標是：

將客戶分組為具有相近支出特徵的叢集。

描述不同叢集內的變化，以便為每個集團找到最佳的交付結構。

要執行此專案，我們將使用可在以下UCI機器學習庫中找到的資料集。

我們將重點分析為客戶記錄的六個產品類別，不包括"渠道"和"區域"欄位。


# Import libraries necessary for this project


 import numpy as np


 import pandas as pd


 from IPython.display import display # Allows the use of display() for DataFrames


 


 # Import supplementary visualizations code visuals.py


 import visuals as vs


 


 # Pretty display for notebooks


 %matplotlib inline


 


 # Load the wholesale customers dataset


 try:


 data = pd.read_csv("customers.csv")


 data.drop(['Region', 'Channel'], axis = 1, inplace = True)


 print("Wholesale customers dataset has {} samples with {} features each.".format(*data.shape))


 except:


 print("Dataset could not be loaded. Is the dataset missing?")

資料探索

現在，我們將通過視覺化和程式碼探索資料集，以便了解特徵之間的關係。此外，我們將計算資料集的統計描述並考慮每個特徵的整體相關性。


# Display a description of the dataset


 display(data.describe())


# Display the head of the dataset


 data.head()

選擇樣本

為了更好地理解我們的資料集以及如何通過分析轉換資料，我們將選擇一些樣本點並詳細探索它們。


# Select three indices to sample from the dataset


 indices = [85,181,338]


 


 # Create a DataFrame of the chosen samples


 samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)


 print("Chosen samples of wholesale customers dataset:")


 display(samples)

注意事項

現在讓我們考慮每個產品類別的總購買成本以及我們樣本客戶的上述資料集的統計描述。如果我們必須預測所選擇的三個樣本中的每一個代表什麼型別的企業：

考慮平均值：

新鮮：12000.2977
牛奶：5796.2
雜貨店：3071.9
清潔紙r：2881.4
熟食店：1524.8

我們可以做出以下預測：

指數85：零售商：

最大的清潔劑，紙張和雜貨的整個資料集，通常是房屋產品。
高於牛奶的平均支出。
冷凍產品的支出低於平均水平。

指數181：大市場

幾乎所有產品類別的高消費。
對整個資料集的新鮮產品的最高支出。可能是一個大市場。
洗滌劑支出低。

指數338：餐廳

每種產品的數量都明顯低於前兩位客戶。
新鮮產品的支出是整個資料集中最低的。
牛奶，洗滌劑和紙張的支出處於最低四分位數。
它可能是一個小而便宜的餐廳，需要雜貨和冷凍食品來供應餐點。

特徵相關性

我們現在將分析這些特徵的相關性，以瞭解客戶的購買行為。換句話說，確定購買一定數量的一類產品的客戶是否必須購買一定比例的另一類產品。

我們將通過在一個資料子集上訓練有監督的迴歸學習器並移除一個特徵來研究這一點，然後評估該模型可以預測移除的特徵的程度。


# Display the head of the dataset


 data.head(1)


# Make a copy of the DataFrame, using the 'drop' function to drop the given feature


 new_data = data.drop('Grocery', axis=1)


 


 # Split the data into training and testing sets(0.25) using the given feature as the target


 # Set a random state.


 from sklearn.model_selection import train_test_split


 


 X_train, X_test, y_train, y_test = train_test_split(new_data, data.Grocery, test_size=0.25, random_state=42)


 


 # Create a decision tree regressor and fit it to the training set


 from sklearn.tree import DecisionTreeRegressor


 regressor = DecisionTreeRegressor()


 regressor = regressor.fit(X_train, y_train)


 prediction = regressor.predict(X_test)


 


 # Report the score of the prediction using the testing set


 from sklearn.metrics import r2_score


 score = r2_score(y_test, prediction)


 print("Prediction score is: {}".format(score))

我們試圖預測Grocery特徵。
報告的預測得分為67.25％。
當我們獲得高分時，它作為非常合適的指標。因此，考慮到其餘的花費習慣，這個特徵很容易預測，因此，對於識別顧客的消費習慣不是很有必要。

視覺化特徵分佈

為了更好地理解我們的資料集，我們將顯示每個產品特徵的散佈矩陣。

顯示散點圖矩陣相關性的產品特徵與預測其他特徵相關。


# Produce a scatter matrix for each pair of features in the data


 pd.scatter_matrix(data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');


# Display a correlation matrix


 import seaborn as sns


 sns.heatmap(data.corr())

使用散點矩陣和相關矩陣作為參考，我們可以推斷出以下內容：

資料不是正態分佈的，它是正偏態的，並且它可以重新定義對數正態分佈。
在大多數情節中，大多數資料點位於原點附近，它們之間幾乎沒有相關性。
從散點圖和相關熱圖，我們可以看到“雜貨店”和“清潔紙”特徵之間存在很強的相關性。“雜貨店”和“牛奶”等特徵也表現出良好的相關性。
這種相關性證實了我對'Grocery'特徵的相關性的猜測，可以使用'Detergent_paper'特徵準確預測。因此，它不是資料集中絕對必要的特徵。

資料預處理

此步驟對於確保獲得的結果具有重要意義。我們將通過縮放資料並檢測潛在異常值來預處理資料。

特徵縮放

通常，當資料不是正態分佈，特別是如果平均數和中位數顯著變化，這是最適合施加非線性縮放的，尤其是對財務資料。

實現此縮放的一種方法是使用Box-Cox測試，該測試計算資料的最佳功率變換，從而減少偏斜。在大多數情況下可以使用的更簡單的方法是應用自然對數。


# Scale the data using the natural logarithm 


 log_data = np.log(data) 


# Scale the sample data using the natural logarithm 


 log_samples = np.log(samples) 


# Produce a scatter matrix for each pair of newly-transformed features 


 pd.scatter_matrix(log_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');

總結

在對資料應用自然對數縮放後，每個要素的分佈顯得更加正常。對於我們之前已經確定為相關的任何特徵對，我們在此觀察到相關性仍然存在（以及它現在是否比之前更強或更弱）。

顯示實際資料：


# Display the log-transformed sample data


 display(log_samples)

異常值檢測

在任何分析的資料預處理步驟中檢測資料中的異常值都非常重要。異常值的存在通常會使這些資料點的結果產生偏差。

在這裡，我們將使用Tukey的方法來識別異常值：異常步驟計算為四分位數範圍（IQR）的1.5倍。具有超出該特徵的IQR之外的異常步驟的特徵的資料點被認為是異常的。


outliers = []


 


 # For each feature find the data points with extreme high or low values


 for feature in log_data.keys():


 


 # Calculate Q1 (25th percentile of the data) for the given feature


 Q1 = np.percentile(log_data[feature],25)


 


 # Calculate Q3 (75th percentile of the data) for the given feature


 Q3 = np.percentile(log_data[feature],75)


 


 # Use the interquartile range to calculate an outlier step (1.5 times the interquartile range)


 step = 1.5 * (Q3-Q1)


 


 # Display the outliers


 print("Data points considered outliers for the feature '{}':".format(feature))


 display(log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))])


 lista = log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))].index.tolist()


 outliers.append(lista)

outliers


# Detecting outliers that appear in more than one product


 seen = {}


 dupes = []


 


 for lista in outliers:


 for index in lista:


 if index not in seen:


 seen[index] = 1


 else:


 if seen[index] == 1:


 dupes.append(index)


 seen[index] += 1


 dupes = sorted(dupes)


 dupes


# Removing outliers 


 good_data = log_data.drop(dupes, axis=0).reset_index(drop=True)

總結

資料點被認為存在於多個特徵中的異常值是：65,66,75,128,154。
K-Means受到異常值的影響很大，因為它們顯著增加了演算法試圖最小化的損失函式。該損失函式是每個資料點與質心的距離的平方和，因此，如果異常值足夠遠，則質心將被錯誤地定位。因此，應刪除異常值。

特徵轉換

現在我們將使用主成分分析來提取有關資料集隱藏結構的結論。PCA用於計算最大化方差的那些維度，因此我們將找到最能描述每個客戶的特徵組合。

主成分分析（PCA）

一旦資料被縮放到正態分佈並且已經去除了必要的異常值，我們就可以應用PCA good_data來發現關於資料的哪些維度最大化所涉及的特徵的方差。

除了找到這些維度之外，PCA還將報告每個維度的解釋方差比 - 資料中的差異僅由該維度解釋。


# Get the shape of the log_samples


 log_samples.shape


# Apply PCA by fitting the good data with the same number of dimensions as features


 from sklearn.decomposition import PCA


 pca = PCA(n_components=good_data.shape[1])


 pca = pca.fit(good_data)


 


 # Transform log_samples using the PCA fit above


 pca_samples = pca.transform(log_samples)


 


 # Generate PCA results plot


 pca_results = vs.pca_results(good_data, pca)

總結

前兩個主成分解釋的差異佔總數的70.68％。
前三個主要組成部分解釋的差異是總數的93.11％。

尺寸討論

維度1：就負面差異而言，此維度代表以下特徵：Detergent_Paper，Milk和雜貨。主要用於日常消費。
維度2：就負面差異而言，此維度代表以下特徵：Fresh，Frozen和Delicatessen。主要是食物消耗。
維度3：根據正方差，熟食特徵和負差異de Fresh特徵，此維度表示良好。當天要消耗的食物。
維度4：根據正方差，凍結特徵以及負方差，熟食特徵，該維度表示良好。可以存放的食物。


# Display sample log-data after having a PCA transformation applied


 display(pd.DataFrame(np.round(pca_samples, 4), columns = pca_results.index.values))

降維

使用主成分分析時，主要目標之一是減少資料的維度。

尺寸降低是有代價的：使用的尺寸越少意味著資料中的總方差變化越小。因此，累積解釋的方差比對於瞭解問題需要多少維度非常重要。另外，如果僅通過兩維或三維解釋顯著的方差量，則可以在之後視覺化減少的資料。


# Apply PCA by fitting the good data with only two dimensions


 pca = PCA(n_components=2).fit(good_data)


 


 # Transform the good data using the PCA fit above


 reduced_data = pca.transform(good_data)


 


 # Transform log_samples using the PCA fit above


 pca_samples = pca.transform(log_samples)


 


 # Create a DataFrame for the reduced data


 reduced_data = pd.DataFrame(reduced_data, columns = ['Dimension 1', 'Dimension 2'])

下面的單元格顯示了僅使用兩個維度對PCA轉換應用後，對數轉換後的樣本資料的變化情況。觀察與六維中的PCA變換相比，前兩個維度的值如何保持不變。


# Display sample log-data after applying PCA transformation in two dimensions


 display(pd.DataFrame(np.round(pca_samples, 4), columns = ['Dimension 1', 'Dimension 2']))

視覺化Biplot

雙標圖是散點圖，其中每個資料點由其沿主要分量的分數表示。軸是主要元件。

雙標圖顯示了原始特徵沿元件的投影。雙標圖可以幫助我們解釋資料的縮小尺寸，並發現主要元件和原始特徵之間的關係。


# Create a biplot


 vs.biplot(good_data, reduced_data, pca)

一旦我們得到原始特徵投影（紅色），就可以更容易地解釋散點圖中每個資料點的相對位置。

例如，點圖的右下角將有可能對應於花費了大量的客戶“牛奶”，“雜貨店”和“清潔紙”，但沒有這麼多的其他產品類別。

聚類

在本節中，我們將選擇使用K-Means聚類演算法或高斯混合模型（GMM）聚類演算法來識別隱藏在資料中的各種客戶群。

然後，我們將從群集中恢復特定資料點，以通過將它們轉換回原始維度和比例來了解它們的重要性。

K-Means與GMM

使用K-Means作為聚類演算法的主要優點是：

易於實施。
對於大量變數，如果（K很小），它可能在計算上比分層次聚類更快。
一致且規模不變。
保證收斂。

使用高斯混合模型作為聚類演算法的主要優點是：

就叢集協方差而言，它更加靈活。這意味著每個叢集可以具有無約束的協方差結構。換句話說，雖然K-means假設每個簇都具有球形結構，但GMM允許橢圓形。
點可以屬於不同的叢集，具有不同的成員級別。這種成員級別是每個點屬於每個叢集的概率。

選擇演算法：

所選擇的演算法是高斯混合模型。因為資料不是在明確的和不同的叢集中分割的，所以我們不知道有多少叢集。

建立叢集

當先前不知道簇的數量時，不能保證給定數量的簇最好地對資料進行分段，因為不清楚資料中存在什麼結構。

但是，我們可以通過計算每個資料點的輪廓係數來量化聚類的"優點" 。資料點的輪廓係數測量它與指定簇的相似程度，從-1（不相似）到1（相似）。計算平均輪廓係數提供了給定聚類的簡單評分方法。


# Import the necessary libraries


 from sklearn.mixture import GaussianMixture


 from sklearn.metrics import silhouette_score


 


 scores = {}


 for i in range(2,7):


 


 print('Number of clusters: ' + str(i))


 


 # Apply your clustering algorithm of choice to the reduced data 


 clusterer = GaussianMixture(random_state=42, n_components=i)


 clusterer.fit(reduced_data)


 


 # Predict the cluster for each data point


 preds = clusterer.predict(reduced_data)


 


 # Find the cluster centers


 centers = clusterer.means_


 print('Cluster Center: ' + str(centers))


 


 # Predict the cluster for each transformed sample data point


 sample_preds = clusterer.predict(pca_samples)


 print('Sample predictions: ' + str(sample_preds))


 


 # Calculate the mean silhouette coefficient for the number of clusters chosen


 score = silhouette_score(reduced_data, preds)


 scores[i] = score


 print('Silhouette score is: ' + str(score), '
')


 


 print('Scores: ' + str(scores))

具有最佳Silhouette Score的群集數為2，得分為0.42。

叢集視覺化

一旦我們使用上面的評分指標為聚類演算法選擇了最佳聚類數，我們現在可以在下面的程式碼塊中視覺化結果。


# Apply your clustering algorithm of choice to the reduced data 


 clusterer = GaussianMixture(random_state=42, n_components=2)


 clusterer.fit(reduced_data)


 


 # Predict the cluster for each data point


 preds = clusterer.predict(reduced_data)


 


 # Find the cluster centers


 centers = clusterer.means_


 print('Cluster Center: ' + str(centers))


 


 # Predict the cluster for each transformed sample data point


 sample_preds = clusterer.predict(pca_samples)


 print('Sample predictions: ' + str(sample_preds))


 


 # Calculate the mean silhouette coefficient for the number of clusters chosen


 score = silhouette_score(reduced_data, preds)


 scores[i] = score


 print('Silhouette score is: ' + str(score), '
')


# Display the results of the clustering from implementation


 vs.cluster_results(reduced_data, preds, centers, pca_samples)

資料恢復

上面的視覺化中存在的每個群集都有一箇中心點。這些中心不是來自資料的特定資料點，而是各個叢集中預測的所有資料點的平均值。

對於建立客戶群的問題，群集的中心點對應於該段的平均客戶。由於資料目前在尺寸上減小並按對數縮放，我們可以通過應用逆變換從這些資料點恢復代表性客戶支出。


# Inverse transform the centers


 log_centers = pca.inverse_transform(centers)


 


 # Exponentiate the centers


 true_centers = np.exp(log_centers)


 


 # Display the true centers


 segments = ['Segment {}'.format(i) for i in range(0,len(centers))]


 true_centers = pd.DataFrame(np.round(true_centers), columns = data.keys())


 true_centers.index = segments


 display(true_centers)

第0段可能代表一個新鮮食品市場，因為除冷凍和新鮮之外的每個特徵均低於中位數。
細分1可以代表超市，因為除了新鮮和冷凍之外的每個特徵都高於中位數。

下面的程式碼顯示了每個樣本點的預測屬於哪個ckustr。


# Display the predictions


 for i, pred in enumerate(sample_preds):


 print("Sample point", i, "predicted to be in Cluster", pred)

結論

樣品點0→超市和原始猜測是一個零售商。可以解釋這種差異，因為群集的大小（相當大）
樣品點1→超市和原產地猜測是一樣的。
樣品點2→新鮮食品市場和原始猜測是一個餐廳，考慮到特徵的消費金額是合理的。

總結

批發經銷商如何僅使用其估計的產品支出和客戶細分資料來標記新客戶？

可以使用監督學習演算法，將估計的產品花費作為屬性，將客戶群作為目標變數，使其成為分類問題。由於客戶群與產品支出之間沒有明確的數學關係，KNN可能是一個很好的演算法。

視覺化底層分佈

在該專案開始時，討論了將從資料集中排除'Channel'和'Region'特徵，以便在分析中強調客戶產品類別。通過將'Channel'特徵重新引入資料集，當考慮先前應用於原始資料集的相同PCA維數減少時，會出現一個有趣的結構。

下面的程式碼塊顯示了每個資料點的標籤'HoReCa'（酒店/餐廳/咖啡廳）或'Retail'縮小的空間。


# Display the clustering results based on 'Channel' data vs.channel_results(reduced_data, preds, pca_samples)

我們可以觀察到，群集演算法可以很好地將資料聚類到底層分佈，因為群集0可以與零售商和群集1完美地關聯到Ho / Re / Ca（酒店/餐廳/咖啡廳）

來自 “ ITPUB部落格 ” ，連結：http://blog.itpub.net/29829936/viewspace-2643138/，如需轉載，請註明出處，否則將追究法律責任。

機器學習——監督學習&無監督學習
2019-07-24
機器學習
一圖看懂監督學習、無監督學習和半監督學習
2020-02-18
【ML吳恩達】3 有監督學習和無監督學習
2020-11-19
吳恩達
無監督學習之降維
2019-08-30
【火爐煉AI】機器學習027-專案案例：用聚類演算法建立客戶細分模型
2018-09-07
AI機器學習聚類演算法模型
監督學習or無監督學習？這個問題必須搞清楚
2020-05-02
監督學習，無監督學習常用演算法集合總結，引用scikit-learn庫（監督篇）
2022-03-19
演算法
基於自編碼器的表徵學習：如何攻克半監督和無監督學習？
2018-12-22
監督學習
2024-06-05
【機器學習基礎】無監督學習（1）——PCA
2022-01-22
機器學習PCA
【機器學習基礎】無監督學習（3）——AutoEncoder
2022-05-07
機器學習
[譯] Python 中的無監督學習演算法
2018-09-26
Python演算法
自監督學習
2024-04-15
當前最好的詞句嵌入技術概覽：從無監督學習轉向監督、多工學習
2018-06-08
機器學習7-模型儲存&無監督學習
2021-01-22
機器學習模型
無監督學習才不是“不要你管”
2018-04-18
無監督學習-K-means演算法
2022-04-05
演算法
機器學習：監督學習
2022-12-04
機器學習
深度學習中的互資訊：無監督提取特徵
2018-10-12
深度學習特徵
003.00 監督式學習
2019-09-17
自監督學習概述
2020-10-29
吳恩達機器學習筆記 —— 14 無監督學習
2018-07-25
吳恩達機器學習筆記
機器學習個人筆記（三）之無監督學習
2020-10-27
機器學習筆記
機器學習--有監督學習--分類演算法（預測分類）
2024-06-18
機器學習演算法
人工智慧 (05) 機器學習 - 無監督式學習群集方法
2019-12-19
人工智慧機器學習
人工智慧 (02) 機器學習 - 監督式學習分類方法
2019-12-18
人工智慧機器學習
監督學習基礎概念
2020-02-14
監督學習之迴歸
2019-08-30
有監督學習——梯度下降
2023-03-11
梯度
【無監督學習】2：DBSCAN聚類演算法原理
2018-04-09
聚類演算法
GAN用於無監督表徵學習，效果依然驚人……
2019-07-09
服務交付的專案：建立PM牢固的客戶關係
2022-05-31
Hinton新作！越大的自監督模型，半監督學習需要的標籤越少
2020-10-30
模型
吳恩達《Machine Learning》精煉筆記 1：監督學習與非監督學習
2020-11-28
吳恩達Mac筆記
監督學習之支援向量機
2020-02-14
非監督學習最強攻略
2019-10-12
有監督學習——高斯過程
2023-03-18
【機器學習基礎】無監督學習（2）——降維之LLE和TSNE
2022-03-21
機器學習