專案2：運營商客戶流失分析與預測

红酒人生發表於2024-07-28

運營商客戶流失分析與預測

背景
提出問題
理解資料
資料清洗
視覺化分析
使用者流失預測
結論和建議

一、背景

關於使用者留存有這樣一個觀點，如果將使用者流失率降低5%，公司利潤將提升25%-85%。如今高居不下的獲客成本讓移動運營商遭遇“天花板”，甚至陷入獲客難的窘境。隨著市場飽和度上升，移動運營商亟待解決增加使用者黏性，延長使用者生命週期的問題。因此，移動使用者流失分析與預測至關重要。資料集來自kesci中的“移動運營商客戶資料集”

二、提出問題

分析使用者特徵與流失的關係。
從整體情況看，流失使用者的普遍具有哪些特徵？
嘗試找到合適的模型預測流失使用者。
針對性給出增加使用者黏性、預防流失的建議。

三、理解資料

該資料集有21個欄位，共7043條記錄。每條記錄包含了唯一客戶的特徵。
我們目標就是發現前20列特徵和最後一列客戶是否流失特徵之間的關係。

四、資料清洗

資料清洗的“完全合一”規則：

完整性：單條資料是否存在空值，統計的欄位是否完善。
全面性：觀察某一列的全部數值，透過常識來判斷該列是否有問題，比如：資料定義、單位標識、資料本身。
合法性：資料的型別、內容、大小的合法性。比如資料中是否存在非ASCII字元，性別存在了未知，年齡超過了150等。
唯一性：資料是否存在重複記錄，因為資料通常來自不同渠道的彙總，重複的情況是常見的。行資料、列資料都需要是唯一的。
資料集下載地址：
連結：https://pan.baidu.com/s/1NIg-4X_ajfeaMr7hB1rScQ?pwd=49kk
提取碼：49kk

# 1.匯入工具包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

# 2.匯入資料集檔案
customerDF = pd.read_csv('./data/WA_Fn-UseC_-Telco-Customer-Churn.csv')

# 3.檢視資料集大小
customerDF.shape

【輸出結果如下】：

(7043, 21)

# 4.設定檢視列不省略
pd.set_option('display.max_columns',None)

# 5.檢視前10條資料
customerDF.head(10)

【輸出結果如下】：

#6.檢視資料是否存在Null，如果存在則計數
pd.isnull(customerDF).sum()

【輸出結果如下】：

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

# 7.1 檢視資料型別，下面兩行指令功能一樣
customerDF.info()

【輸出結果如下】：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 
 17  PaymentMethod     7043 non-null   object 
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7043 non-null   object 
 20  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB

# 7.2 檢視資料型別
customerDF.dtypes

【輸出結果如下】：

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

# 8 將‘TotalCharges’總消費額的資料型別轉換為浮點型
# × 8.1 發現錯誤：字串無法轉換為數字，ValueError: could not convert string to float: 
customerDF[['TotalCharges']].astype(float)

【程式碼錯誤的輸出結果如下】：
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-7c93c9019d13> in <module>
      1 # 8 將‘TotalCharges’總消費額的資料型別轉換為浮點型
      2 # × 8.1 發現錯誤：字串無法轉換為數字，ValueError: could not convert string to float:
----> 3 customerDF[['TotalCharges']].astype(float)

D:\mysoft\anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors)
   5875         else:
   5876             # else, only a single dtype is given
-> 5877             new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   5878             return self._constructor(new_data).__finalize__(self, method="astype")
   5879 

D:\mysoft\anaconda3\lib\site-packages\pandas\core\internals\managers.py in astype(self, dtype, copy, errors)
    629         self, dtype, copy: bool = False, errors: str = "raise"
    630     ) -> "BlockManager":
--> 631         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    632 
    633     def convert(

D:\mysoft\anaconda3\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
    425                     applied = b.apply(f, **kwargs)
    426                 else:
--> 427                     applied = getattr(b, f)(**kwargs)
    428             except (TypeError, NotImplementedError):
    429                 if not ignore_failures:

D:\mysoft\anaconda3\lib\site-packages\pandas\core\internals\blocks.py in astype(self, dtype, copy, errors)
    671             vals1d = values.ravel()
    672             try:
--> 673                 values = astype_nansafe(vals1d, dtype, copy=True)
    674             except (ValueError, TypeError):
    675                 # e.g. astype_nansafe can fail on object-dtype of strings

D:\mysoft\anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy, skipna)
   1095     if copy or is_object_dtype(arr) or is_object_dtype(dtype):
   1096         # Explicit copy, or required since NumPy can't view from / to object.
-> 1097         return arr.astype(dtype, copy=True)
   1098 
   1099     return arr.view(dtype)

ValueError: could not convert string to float: ''

# 8.2 依次檢查各個欄位的資料型別、欄位內容和數量。最後發現“TotalCharges”（總消費額）列有11個使用者資料缺失
# 檢視每一列資料取值
for x in customerDF.columns:
    test=customerDF.loc[:,x].value_counts()
    print('{0} 的行數是：{1}'.format(x,test.sum()))
    print('{0} 的資料型別是：{1}'.format(x,customerDF[x].dtypes))
    print('{0} 的內容是：\n{1}\n'.format(x,test))

【輸出結果如下】：

customerID 的行數是：7043
customerID 的資料型別是：object
customerID 的內容是：
0463-TXOAK    1
1025-FALIX    1
7176-WIONM    1
5180-UCIIQ    1
2260-USTRB    1
             ..
6017-PPLPX    1
9588-YRFHY    1
0112-QAWRZ    1
9985-MWVIX    1
5095-AESKG    1
Name: customerID, Length: 7043, dtype: int64

gender 的行數是：7043
gender 的資料型別是：object
gender 的內容是：
Male      3555
Female    3488
Name: gender, dtype: int64

SeniorCitizen 的行數是：7043
SeniorCitizen 的資料型別是：int64
SeniorCitizen 的內容是：
0    5901
1    1142
Name: SeniorCitizen, dtype: int64

Partner 的行數是：7043
Partner 的資料型別是：object
Partner 的內容是：
No     3641
Yes    3402
Name: Partner, dtype: int64

Dependents 的行數是：7043
Dependents 的資料型別是：object
Dependents 的內容是：
No     4933
Yes    2110
Name: Dependents, dtype: int64

tenure 的行數是：7043
tenure 的資料型別是：int64
tenure 的內容是：
1     613
72    362
2     238
3     200
4     176
     ... 
28     57
39     56
44     51
36     50
0      11
Name: tenure, Length: 73, dtype: int64

PhoneService 的行數是：7043
PhoneService 的資料型別是：object
PhoneService 的內容是：
Yes    6361
No      682
Name: PhoneService, dtype: int64

MultipleLines 的行數是：7043
MultipleLines 的資料型別是：object
MultipleLines 的內容是：
No                  3390
Yes                 2971
No phone service     682
Name: MultipleLines, dtype: int64

InternetService 的行數是：7043
InternetService 的資料型別是：object
InternetService 的內容是：
Fiber optic    3096
DSL            2421
No             1526
Name: InternetService, dtype: int64

OnlineSecurity 的行數是：7043
OnlineSecurity 的資料型別是：object
OnlineSecurity 的內容是：
No                     3498
Yes                    2019
No internet service    1526
Name: OnlineSecurity, dtype: int64

OnlineBackup 的行數是：7043
OnlineBackup 的資料型別是：object
OnlineBackup 的內容是：
No                     3088
Yes                    2429
No internet service    1526
Name: OnlineBackup, dtype: int64

DeviceProtection 的行數是：7043
DeviceProtection 的資料型別是：object
DeviceProtection 的內容是：
No                     3095
Yes                    2422
No internet service    1526
Name: DeviceProtection, dtype: int64

TechSupport 的行數是：7043
TechSupport 的資料型別是：object
TechSupport 的內容是：
No                     3473
Yes                    2044
No internet service    1526
Name: TechSupport, dtype: int64

StreamingTV 的行數是：7043
StreamingTV 的資料型別是：object
StreamingTV 的內容是：
No                     2810
Yes                    2707
No internet service    1526
Name: StreamingTV, dtype: int64

StreamingMovies 的行數是：7043
StreamingMovies 的資料型別是：object
StreamingMovies 的內容是：
No                     2785
Yes                    2732
No internet service    1526
Name: StreamingMovies, dtype: int64

Contract 的行數是：7043
Contract 的資料型別是：object
Contract 的內容是：
Month-to-month    3875
Two year          1695
One year          1473
Name: Contract, dtype: int64

PaperlessBilling 的行數是：7043
PaperlessBilling 的資料型別是：object
PaperlessBilling 的內容是：
Yes    4171
No     2872
Name: PaperlessBilling, dtype: int64

PaymentMethod 的行數是：7043
PaymentMethod 的資料型別是：object
PaymentMethod 的內容是：
Electronic check             2365
Mailed check                 1612
Bank transfer (automatic)    1544
Credit card (automatic)      1522
Name: PaymentMethod, dtype: int64

MonthlyCharges 的行數是：7043
MonthlyCharges 的資料型別是：float64
MonthlyCharges 的內容是：
20.05     61
19.85     45
19.90     44
19.95     44
19.65     43
          ..
87.65      1
35.30      1
114.85     1
56.50      1
97.25      1
Name: MonthlyCharges, Length: 1585, dtype: int64

TotalCharges 的行數是：7043
TotalCharges 的資料型別是：object
TotalCharges 的內容是：
           11
20.2       11
19.75       9
20.05       8
19.65       8
           ..
5166.2      1
1133.65     1
934.8       1
385.55      1
5832        1
Name: TotalCharges, Length: 6531, dtype: int64

Churn 的行數是：7043
Churn 的資料型別是：object
Churn 的內容是：
No     5174
Yes    1869
Name: Churn, dtype: int64

# 8.3 採用強制轉換，將“TotalCharges”（總消費額）轉換為浮點型資料
# ×報錯：AttributeError: 'Series' object has no attribute 'convert_objects'
# ×convert_objects的方法已經被棄用，
customerDF['TotalCharges']=customerDF['TotalCharges'].convert_objects(convert_numeric=True

【程式碼執行的錯過結果如下】：

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-10-bdd74b9e37bd> in <module>
      2 # ×報錯：AttributeError: 'Series' object has no attribute 'convert_objects'
      3 # ×convert_objects的方法已經被棄用，
----> 4 customerDF['TotalCharges']=customerDF['TotalCharges'].convert_objects(convert_numeric=True)

D:\mysoft\anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5463             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5464                 return self[name]
-> 5465             return object.__getattribute__(self, name)
   5466 
   5467     def __setattr__(self, name: str, value) -> None:

AttributeError: 'Series' object has no attribute 'convert_objects'

# 8.3.1 √解決以上版本型別轉換為方法
customerDF['TotalCharges'] = pd.to_numeric(customerDF['TotalCharges'], errors='coerce')

# 9.轉換後發現“TotalCharges”（總消費額）列有11個使用者資料缺失，為NaN。
test=customerDF.loc[:,'TotalCharges'].value_counts().sort_index()
print(test.sum())

print(customerDF.tenure[customerDF['TotalCharges'].isnull().values==True])

【輸出結果如下】：

7032
488     0
753     0
936     0
1082    0
1340    0
3331    0
3826    0
4380    0
5218    0
6670    0
6754    0
Name: tenure, dtype: int64

經過觀察，發現這11個使用者‘tenure’（入網時長）為0個月，推測是當月新入網使用者。 
根據一般經驗，使用者即使在註冊的當月流失，也需繳納當月費用。因此將這11個使用者入網時長改為1，將總消費額填充為月消費額，符合實際情況。

# 9.1. 檢視null值，且輸出
print(customerDF.isnull().any())
print(customerDF[customerDF['TotalCharges'].isnull().values==True][['tenure','MonthlyCharges','TotalCharges']])

【輸出結果如下】：

customerID          False
gender              False
SeniorCitizen       False
Partner             False
Dependents          False
tenure              False
PhoneService        False
MultipleLines       False
InternetService     False
OnlineSecurity      False
OnlineBackup        False
DeviceProtection    False
TechSupport         False
StreamingTV         False
StreamingMovies     False
Contract            False
PaperlessBilling    False
PaymentMethod       False
MonthlyCharges      False
TotalCharges         True
Churn               False
dtype: bool
      tenure  MonthlyCharges  TotalCharges
488        0           52.55           NaN
753        0           20.25           NaN
936        0           80.85           NaN
1082       0           25.75           NaN
1340       0           56.05           NaN
3331       0           19.85           NaN
3826       0           25.35           NaN
4380       0           20.00           NaN
5218       0           19.70           NaN
6670       0           73.35           NaN
6754       0           61.90           NaN

# 9.2 ×將總消費額填充為月消費額,以下報錯：ValueError: Series.replace cannot use dict-value and non-None to_replace
customerDF.loc[:,'TotalCharges'].replace(to_replace=np.nan,value=customerDF.loc[:,'MonthlyCharges'],inplace=True)

【錯誤的輸出結果如下】：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-6230086838ba> in <module>
      1 # ×將總消費額填充為月消費額,以下報錯：ValueError: Series.replace cannot use dict-value and non-None to_replace
----> 2 customerDF.loc[:,'TotalCharges'].replace(to_replace=np.nan,value=customerDF.loc[:,'MonthlyCharges'],inplace=True)

D:\mysoft\anaconda3\lib\site-packages\pandas\core\series.py in replace(self, to_replace, value, inplace, limit, regex, method)
   4507         method="pad",
   4508     ):
-> 4509         return super().replace(
   4510             to_replace=to_replace,
   4511             value=value,

D:\mysoft\anaconda3\lib\site-packages\pandas\core\generic.py in replace(self, to_replace, value, inplace, limit, regex, method)
   6921                     # Operate column-wise
   6922                     if self.ndim == 1:
-> 6923                         raise ValueError(
   6924                             "Series.replace cannot use dict-value and "
   6925                             "non-None to_replace"

ValueError: Series.replace cannot use dict-value and non-None to_replace

# 9.3 √上一步將總消費額填充為月消費額
# 方法1：使用填充方法
customerDF['TotalCharges'] = customerDF['TotalCharges'].fillna(customerDF['MonthlyCharges'])  

# 方法2：執行下面兩行程式碼
# 把所有需要替換的行的索引值取出來，轉換成列表形式
# pan1 = customerDF[customerDF['TotalCharges'].isnull()].index.to_list()

# # 開始一一對應去替換
# customerDF.loc[pan1,'TotalCharges'] = customerDF.loc[pan1,'MonthlyCharges']

# 9.4 檢視是否替換成功
customerDF[customerDF['tenure']==0][['tenure','MonthlyCharges','TotalCharges']]

【輸出結果如下】：

# 10.將‘tenure’入網時長從0修改為1
customerDF.loc[:,'tenure'].replace(to_replace=0,value=1,inplace=True)
print(pd.isnull(customerDF['TotalCharges']).sum())
print(customerDF['TotalCharges'].dtypes)

【輸出結果為】：

0
float64

# 11.獲取資料型別的描述統計資訊
customerDF.describe()

【輸出結果為】：

五、視覺化分析

根據一般經驗，將使用者特徵劃分為使用者屬性、服務屬性、合同屬性，並從這三個維度進行視覺化分析。

# 12.檢視流失使用者數量和佔比
plt.rcParams['figure.figsize']=6,6
plt.pie(customerDF['Churn'].value_counts(),labels=customerDF['Churn'].value_counts().index,autopct='%1.2f%%',explode=(0.1,0))
plt.title('Churn(Yes/No) Ratio')
plt.show()

【輸出結果如下】:

#13. 
churnDf=customerDF['Churn'].value_counts().to_frame()
x=churnDf.index
y=churnDf['Churn']
plt.bar(x,y,width = 0.5,color = 'c')

#用來正常顯示中文標籤（需要安裝字型檔）
plt.title('Churn(Yes/No) Num')
plt.show()

【輸出結果如下】:

屬於不平衡資料集，流失使用者佔比達26.54%

（1）使用者屬性分析

import matplotlib.ticker as ticker
def barplot_percentages(feature,orient='v',axis_name="percentage of customers"):
    ratios = pd.DataFrame()
    g = (customerDF.groupby(feature)["Churn"].value_counts()/len(customerDF)).to_frame()
    g.rename(columns={"Churn":axis_name},inplace=True)
    g.reset_index(inplace=True)

    #print(g)
    if orient == 'v':
        ax = sns.barplot(x=feature, y= axis_name, hue='Churn', data=g, orient=orient)
        ax.set_yticklabels(['{:,.0%}'.format(y) for y in ax.get_yticks()])
        plt.rcParams.update({'font.size': 13})
        #plt.legend(fontsize=10)
    else:
        ax = sns.barplot(x= axis_name, y=feature, hue='Churn', data=g, orient=orient)
        ax.set_xticklabels(['{:,.0%}'.format(x) for x in ax.get_xticks()])
        plt.legend(fontsize=10)
    plt.title('Churn(Yes/No) Ratio as {0}'.format(feature))
    plt.show()
barplot_percentages("SeniorCitizen")
barplot_percentages("gender")

【輸出結果如下】：

customerDF['churn_rate'] = customerDF['Churn'].replace("No", 0).replace("Yes", 1)
g = sns.FacetGrid(customerDF, col="SeniorCitizen", height=4, aspect=.9)
ax = g.map(sns.barplot, "gender", "churn_rate", palette = "Blues_d", order= ['Female', 'Male'])
plt.rcParams.update({'font.size': 13})
plt.show()

【輸出結果如下】：

小結：
使用者流失與性別基本無關；
年老使用者流失佔顯著高於年輕使用者。

fig, axis = plt.subplots(1, 2, figsize=(12,4))
axis[0].set_title("Has Partner")
axis[1].set_title("Has Dependents")
axis_y = "percentage of customers"

# Plot Partner column
gp_partner = (customerDF.groupby('Partner')["Churn"].value_counts()/len(customerDF)).to_frame()
gp_partner.rename(columns={"Churn": axis_y}, inplace=True)
gp_partner.reset_index(inplace=True)
ax1 = sns.barplot(x='Partner', y= axis_y, hue='Churn', data=gp_partner, ax=axis[0])
ax1.legend(fontsize=10)
#ax1.set_xlabel('伴侶')


# Plot Dependents column
gp_dep = (customerDF.groupby('Dependents')["Churn"].value_counts()/len(customerDF)).to_frame()
#print(gp_dep)
gp_dep.rename(columns={"Churn": axis_y} , inplace=True)
#print(gp_dep)
gp_dep.reset_index(inplace=True)
#print(gp_dep)

ax2 = sns.barplot(x='Dependents', y= axis_y, hue='Churn', data=gp_dep, ax=axis[1])
#ax2.set_xlabel('家屬')


#設定字型大小
plt.rcParams.update({'font.size': 20})
ax2.legend(fontsize=10)

#設定
plt.show()

【輸出結果如下】：

# Kernel density estimaton核密度估計
def kdeplot(feature,xlabel):
    plt.figure(figsize=(9, 4))
    plt.title("KDE for {0}".format(feature))
    ax0 = sns.kdeplot(customerDF[customerDF['Churn'] == 'No'][feature].dropna(), color= 'navy', label= 'Churn: No', shade='True')
    ax1 = sns.kdeplot(customerDF[customerDF['Churn'] == 'Yes'][feature].dropna(), color= 'orange', label= 'Churn: Yes',shade='True')
    plt.xlabel(xlabel)
    #設定字型大小
    plt.rcParams.update({'font.size': 20})
    plt.legend(fontsize=10)
kdeplot('tenure','tenure')
plt.show()

【輸出結果為】：

小結：

有伴侶的使用者流失佔比低於無伴侶使用者；
有家屬的使用者較少；
有家屬的使用者流失佔比低於無家屬使用者;
在網時長越久，流失率越低，符合一般經驗；
在網時間達到三個月，流失率小於在網率，證明使用者心理穩定期一般是三個月

（2）服務屬性分析

plt.figure(figsize=(9, 4.5))
barplot_percentages("MultipleLines", orient='h')

【輸出結果為】：

plt.figure(figsize=(9, 4.5))
barplot_percentages("InternetService", orient="h")

【輸出結果為】：

cols = ["PhoneService","MultipleLines","OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies"]
df1 = pd.melt(customerDF[customerDF["InternetService"] != "No"][cols])
df1.rename(columns={'value': 'Has service'},inplace=True)
plt.figure(figsize=(20, 8))
ax = sns.countplot(data=df1, x='variable', hue='Has service')
ax.set(xlabel='Internet Additional service', ylabel='Num of customers')
plt.rcParams.update({'font.size':20})
plt.legend( labels = ['No Service', 'Has Service'],fontsize=15)
plt.title('Num of Customers as Internet Additional Service')
plt.show()

【輸出結果為】：

plt.figure(figsize=(20, 8))
df1 = customerDF[(customerDF.InternetService != "No") & (customerDF.Churn == "Yes")]
df1 = pd.melt(df1[cols])
df1.rename(columns={'value': 'Has service'}, inplace=True)
ax = sns.countplot(data=df1, x='variable', hue='Has service', hue_order=['No', 'Yes'])
ax.set(xlabel='Internet Additional service', ylabel='Churn Num')
plt.rcParams.update({'font.size':20})
plt.legend( labels = ['No Service', 'Has Service'],fontsize=15)
plt.title('Num of Churn Customers as Internet Additional Service')
plt.show()

【輸出結果為】：

小結：

電話服務整體對使用者流失影響較小。
單光纖使用者的流失佔比較高；
光纖使用者繫結了安全、備份、保護、技術支援服務的流失率較低；
光纖使用者附加流媒體電視、電影服務的流失率佔比較高。

（3）合同屬性分析

plt.figure(figsize=(9, 4.5))
barplot_percentages("PaymentMethod",orient='h')

g = sns.FacetGrid(customerDF, col="PaperlessBilling", height=6, aspect=.9)
ax = g.map(sns.barplot, "Contract", "churn_rate", palette = "Blues_d", order= ['Month-to-month', 'One year', 'Two year'])
plt.rcParams.update({'font.size':18})
plt.show()

【輸出結果為】：

kdeplot('MonthlyCharges','MonthlyCharges')
kdeplot('TotalCharges','TotalCharges')
plt.show()

【輸出結果為】：

小結：

採用電子支票支付的使用者流失率最高，推測該方式的使用體驗較為一般；
簽訂合同方式對客戶流失率影響為：按月簽訂 > 按一年簽訂 > 按兩年簽訂，證明長期合同最能保留客戶；
月消費額大約在70-110之間使用者流失率較高；
長期來看，使用者總消費越高，流失率越低，符合一般經驗。

五、使用者流失預測

對資料集進一步清洗和提取特徵，透過特徵選取對資料進行降維，採用機器學習模型應用於測試資料集，然後對構建的分類模型準確性進行分析

（1）資料清洗

customerID=customerDF['customerID']
customerDF.drop(['customerID'],axis=1, inplace=True)

觀察資料型別，發現大多除了“tenure”、“MonthlyCharges”、“TotalCharges”是連續特徵，其它都是離散特徵。對於連續特徵，採用標準化方式處理。對於離散特徵，特徵之間沒有大小關係，採用one-hot編碼；特徵之間有大小關聯，則採用數值對映。

#獲取離散特徵
cateCols = [c for c in customerDF.columns if customerDF[c].dtype == 'object' or c == 'SeniorCitizen']
dfCate = customerDF[cateCols].copy()
dfCate.head(3)

【輸出結果為】：

#進行特徵編碼
for col in cateCols:
    if dfCate[col].nunique() == 2:
        dfCate[col] = pd.factorize(dfCate[col])[0]
    else:
        dfCate = pd.get_dummies(dfCate, columns=[col])
dfCate['tenure']=customerDF[['tenure']]
dfCate['MonthlyCharges']=customerDF[['MonthlyCharges']]
dfCate['TotalCharges']=customerDF[['TotalCharges']]

#檢視關聯關係
plt.figure(figsize=(16,8))
dfCate.corr()['Churn'].sort_values(ascending=False).plot(kind='bar')
plt.show()

【輸出結果為】：

（2）特徵選取

# 特徵選擇
dropFea = ['gender','PhoneService',
           'OnlineSecurity_No internet service', 'OnlineBackup_No internet service',
           'DeviceProtection_No internet service', 'TechSupport_No internet service',
           'StreamingTV_No internet service', 'StreamingMovies_No internet service',
           #'OnlineSecurity_No', 'OnlineBackup_No',
           #'DeviceProtection_No','TechSupport_No',
           #'StreamingTV_No', 'StreamingMovies_No',
           ]
dfCate.drop(dropFea, inplace=True, axis =1) 
#最後一列是作為標識
target = dfCate['Churn'].values
#列表：特徵和1個標識
columns = dfCate.columns.tolist()

構造訓練資料集和測試資料集

# 列表：特徵
columns.remove('Churn')
# 含有特徵的DataFrame
features = dfCate[columns].values
# 30% 作為測試集，其餘作為訓練集
# random_state = 1表示重複試驗隨機得到的資料集始終不變
# stratify = target 表示按標識的類別，作為訓練資料集、測試資料集內部的分配比例
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(features, target, test_size=0.30, stratify = target, random_state = 1)

（3）構建模型

構造多個分類器

# 引入以下分類演算法模組，# 如果沒有需要執行 pip install scikit-learn
from sklearn.svm import SVC     # C-支援向量分類器
from sklearn.tree import DecisionTreeClassifier      #決策樹模型模型
from sklearn.ensemble import RandomForestClassifier   # 隨機森林分類器
from sklearn.neighbors import KNeighborsClassifier   #K 最近鄰（KNN）分類演算法
from sklearn.ensemble import AdaBoostClassifier    #AdaBoost分類器

# 構造各種分類器
classifiers = [
    SVC(random_state = 1, kernel = 'rbf'),    
    DecisionTreeClassifier(random_state = 1, criterion = 'gini'),
    RandomForestClassifier(random_state = 1, criterion = 'gini'),
    KNeighborsClassifier(metric = 'minkowski'),
    AdaBoostClassifier(random_state = 1),   
]
# 分類器名稱
classifier_names = [
            'svc', 
            'decisiontreeclassifier',
            'randomforestclassifier',
            'kneighborsclassifier',
            'adaboostclassifier',
]
# 分類器引數
#注意分類器的引數，字典鍵的格式，GridSearchCV對調優的引數格式是"分類器名"+"__"+"引數名"
classifier_param_grid = [
            {'svc__C':[0.1], 'svc__gamma':[0.01]},
            {'decisiontreeclassifier__max_depth':[6,9,11]},
            {'randomforestclassifier__n_estimators':range(1,11)} ,
            {'kneighborsclassifier__n_neighbors':[4,6,8]},
            {'adaboostclassifier__n_estimators':[70,80,90]}
]

（4）模型引數調優和評估

對分類器進行引數調優和評估，最後得到試用AdaBoostClassifier(n_estimators=80)效果最好。

 #Pipeline將資料處理步驟和一個學習器組合在一起，使得可以使用一個命令對資料進行處理並用學習器進行訓練
from sklearn.pipeline import Pipeline 
from sklearn.model_selection import GridSearchCV   #GridSearchCV的主要作用是用於超引數調優
from sklearn.metrics import accuracy_score   # 使用 accuracy_score 計算模型準確性

# 對具體的分類器進行 GridSearchCV 引數調優
def GridSearchCV_work(pipeline, train_x, train_y, test_x, test_y, param_grid, score = 'accuracy_score'):
    response = {}
    gridsearch = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=3, scoring = score)
    # 尋找最優的引數 和最優的準確率分數
    search = gridsearch.fit(train_x, train_y)
    print("GridSearch 最優引數：", search.best_params_)
    print("GridSearch 最優分數： %0.4lf" %search.best_score_)
    #採用predict函式（特徵是測試資料集）來預測標識，預測使用的引數是上一步得到的最優引數
    predict_y = gridsearch.predict(test_x)
    print(" 準確率 %0.4lf" %accuracy_score(test_y, predict_y))
    response['predict_y'] = predict_y
    response['accuracy_score'] = accuracy_score(test_y,predict_y)
    return response
 
for model, model_name, model_param_grid in zip(classifiers, classifier_names, classifier_param_grid):
    #採用 StandardScaler 方法對資料規範化：均值為0，方差為1的正態分佈
    pipeline = Pipeline([
            #('scaler', StandardScaler()),
            #('pca',PCA),
            (model_name, model)
    ])
    result = GridSearchCV_work(pipeline, train_x, train_y, test_x, test_y, model_param_grid , score = 'accuracy')

【輸出結果為】：

GridSearch 最優引數： {'svc__C': 0.1, 'svc__gamma': 0.01}
GridSearch 最優分數： 0.7560
 準確率 0.7591
GridSearch 最優引數： {'decisiontreeclassifier__max_depth': 6}
GridSearch 最優分數： 0.7777
 準確率 0.7927
GridSearch 最優引數： {'randomforestclassifier__n_estimators': 10}
GridSearch 最優分數： 0.7702
 準確率 0.7842
GridSearch 最優引數： {'kneighborsclassifier__n_neighbors': 8}
GridSearch 最優分數： 0.7690
 準確率 0.7870
GridSearch 最優引數： {'adaboostclassifier__n_estimators': 70}
GridSearch 最優分數： 0.7998
 準確率 0.8050

六、結論和建議

根據以上分析，得到高流失率使用者的特徵：

使用者屬性：老年使用者，未婚使用者，無親屬使用者更容易流失；
服務屬性：在網時長小於半年，有電話服務，光纖使用者/光纖使用者附加流媒體電視、電影服務，無網際網路增值服務；
合同屬性：簽訂的合同期較短，採用電子支票支付，是電子賬單，月租費約70-110元的客戶容易流失；其它屬性對使用者流失影響較小，以上特徵保持獨立。

針對上述結論，從業務角度給出相應建議：
根據預測模型，構建一個高流失率的使用者列表。透過使用者調研推出一個最小可行化產品功能，並邀請種子使用者進行試用。

使用者方面：針對老年使用者、無親屬、無伴侶使用者的特徵退出定製服務如親屬套餐、溫暖套餐等，一方面加強與其它使用者關聯度，另一方對特定使用者提供個性化服務。
服務方面：針對新註冊使用者，推送半年優惠如贈送消費券，以渡過使用者流失高峰期。針對光纖使用者和附加流媒體電視、電影服務使用者，重點在於提升網路體驗、增值服務體驗，一方面推動技術部門提升網路指標，另一方面對使用者承諾免費網路升級和贈送電視、電影等包月服務以提升使用者黏性。針對線上安全、線上備份、裝置保護、技術支援等增值服務，應重點對使用者進行推廣介紹，如首月/半年免費體驗。
合同方面：針對單月合同使用者，建議推出年合同付費折扣活動，將月合同使用者轉化為年合同使用者，提高使用者在網時長，以達到更高的使用者留存。針對採用電子支票支付使用者，建議定向推送其它支付方式的優惠券，引導使用者改變支付方式。

資料探勘與預測分析(第2版)
2018-10-25
某鐵路資訊中心運營監測專案
2024-06-03
數分專案-基於Cox風險比例模型的流失會員使用者預測
2024-03-31
模型
8ManagePPM，助力北京測威提升專案運營效益
2018-05-01
如何用運營活動挽回流失的玩家？
2019-12-25
Python資料分析與機器學習-使用者流失預警churm
2018-10-28
Python機器學習
使用者觸達難？流失率高？HMS Core預測服務和智慧運營，助你解決此難題。
2022-04-02
如何提升專案的運營和管理？
2021-03-12
電商運營與大資料分析
2022-12-06
大資料
使用者觸達難？流失率高？HMS Core預測服務和智慧運營，助你提前掌握營銷時機，解決此難題。
2022-04-08
遊戲流失分析方法2 問卷調查法
2019-11-05
遊戲
【專案：深圳市二手房房價分析及預測
2020-06-26
預測性客戶分析之藉助聚類和預測分析優化售後服務（Part 4）
2018-03-11
聚類優化
如何降低ERP專案的整體運營成本
2019-12-28
抖音表情包專案的運營邏輯
2022-01-22
交通運輸部：2020年春運客流預測大資料分析
2020-01-14
大資料
區塊鏈專案包裝攻略，區塊鏈專案包裝運營
2024-03-15
區塊鏈
企業門戶專案需求調研指南2
2018-11-10
【專案：信用卡客戶使用者畫像及貸款違約預測模型】
2020-06-26
模型
運營人員使用什麼專案管理軟體？
2023-05-03
專案管理
如何發起並運營一個開源專案
2021-03-02
同城商戶碼贏利點分析，專案好做嗎？
2021-11-01
【精細化運營】遊戲運營資料分析
2019-04-24
遊戲
等到使用者流失後才想起來召回？華為預測幫您預防使用者流失、提高使用者轉化
2021-03-04
灰色預測分析
2020-11-28
運營型CRM系統(運營型客戶關係管理)是針對於運營的嗎？
2021-10-19
民營企業SAP專案客戶的幾種心態
2021-10-28
資料驅動！精細化運營！用機器學習做客戶生命週期與價值預估！⛵
2022-11-16
機器學習
貸款違約預測專案-資料分箱
2020-11-09
哪些專案適合微信平臺營銷？新媒體運營模式
2021-04-02
模式
遊戲流失分析方法4_流失和留存使用者對比分析法
2019-11-27
遊戲
預測分析 · 員工滿意度預測
2020-05-30
專案分析
2024-12-03
Insider Intelligence：預計推特2年內將流失3000萬使用者
2022-12-14
IDEIntel
鏈遊專案要怎麼發行？怎麼運營呢？
2022-01-06
如何運營好文旅夜遊專案產業鏈聚集區
2022-03-11
產業
電商運營管理用什麼專案管理軟體好？
2022-06-14
專案管理
機器學習專案---預測心臟病（二）
2020-12-02
機器學習

專案2：運營商客戶流失分析與預測

運營商客戶流失分析與預測

一、背景

二、提出問題

三、理解資料

四、資料清洗

五、視覺化分析

（1）使用者屬性分析

（2）服務屬性分析

小結：

（3）合同屬性分析

小結：

五、使用者流失預測

（1）資料清洗

（2）特徵選取

構造訓練資料集和測試資料集

（3）構建模型

構造多個分類器

（4）模型引數調優和評估

六、結論和建議

相關文章