使用K-Means對超市客戶分組
主要步驟流程:
- 1. 匯入包
- 2. 匯入資料集
- 3. 使用肘部法則選擇最優的K值
- 4. 使用K=5做聚類
- 5. 視覺化聚類效果
- 6. 採取措施
- 7. 瑞士捲生產及其聚類
資料集連結:https://www.heywhale.com/mw/dataset/6230697d5f17950018ee88b5/file
1. 匯入包
In [1]:
# 匯入包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
2. 匯入資料集
In [2]:
# 匯入資料集
dataset = pd.read_csv('Mall_Customers.csv')
dataset
Out[2]:
為了視覺化聚類效果,僅選取Annual Income (k$)和Spending Score (1-100)這2個欄位
In [3]:
X = dataset.iloc[:, [3, 4]].values
X[:3, :]
Out[3]:
3. 使用肘部法則選擇最優的K值
In [4]:
# 使用肘部法則選擇最優的K值
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', n_init=10, max_iter=300, random_state = 0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
In [5]:
# 畫出 聚類個數 vs WCSS 圖
plt.figure()
plt.plot(range(1, 11), wcss, 'ro-')
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
從K=5開始,WCSS下降的不再明顯,說明K=5是最優選擇
4. 使用K=5做聚類
In [6]:
# 使用選擇出的K,使用K-Means做聚類
kmeans = KMeans(n_clusters = 5, init = 'k-means++', n_init=10, max_iter=300, random_state = 0)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
In [7]:
y_kmeans
Out[7]:
5. 視覺化聚類效果
In [8]:
# 視覺化聚類效果
plt.figure()
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroids')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()
6. 採取措施
- Cluster 1 工資收入中等,消費中等;
- Cluster 2 工資收入低,消費高,檢視這個分組主要購買哪些商品;
- Cluster 3 工資收入高,消費高;
- Cluster 4 工資收入低,消費低;
- Cluster 5 工資收入高,消費低,給這個分組的客戶辦理優惠券或打折購物卡,吸引他們消費;
7. 瑞士捲生產及其聚類
In [10]:
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans
from sklearn import manifold, datasets
import matplotlib.pyplot as plt
#生成帶噪聲的瑞士捲資料集
X,color = datasets.make_swiss_roll(n_samples=3000)
#使用100個K-means簇對資料進行近似
clusters_swiss_roll = KMeans(n_clusters=3,random_state=1).fit_predict(X)
fig2 = plt.figure(figsize=(10,10))
ax = fig2.add_subplot(111,projection='3d')
ax.scatter(X[:,0],X[:,1],X[:,2],c = clusters_swiss_roll,cmap = 'Spectral')
plt.show()
如上圖,根據距離將其聚成了3類,