? 作者:韓信子@ShowMeAI
? 資料分析實戰系列:https://www.showmeai.tech/tutorials/40
? 機器學習實戰系列:https://www.showmeai.tech/tutorials/41
? 本文地址:https://www.showmeai.tech/article-detail/401
? 宣告:版權所有,轉載請聯絡平臺與作者並註明出處
? 收藏ShowMeAI檢視更多精彩內容
? 引言
在過去幾年中,客戶對航空公司的滿意度一直在穩步攀升。在 COVID-19 大流行導致的停頓之後,航空旅行業重新開始,大家越來越關注航空出行的滿意度問題,客戶也會對一些常見問題,如『不舒服的座位』、『擁擠的空間』、『延誤』和『不合標準的設施』等進行反饋。
各家航空公司也越來越關注客戶滿意度問題並努力提高。對航空公司而言,出色的客戶服務,是銷量和客戶留存的關鍵;反之,糟糕的客戶服務評級會導致客戶流失和公司聲譽不佳。
在本專案中,我們將對航空滿意度資料進行分析建模,對滿意度進行預估,並找出影響滿意度的核心因素。
? 資料&環境
這裡使用到的主要開發環境是 Jupyter Notebooks,基於 Python 3.9 完成。依賴的工具庫包括 用於資料探索分析的Pandas、Numpy、Seaborn 和 Matplotlib 庫、用於建模和最佳化的 XGBoost 和 Scikit-Learn 庫,以及用於模型可解釋性分析的 SHAP 工具庫。
關於以上工具庫的用法,ShowMeAI在實戰文章中做了詳細介紹,大家可以檢視以下教程系列和文章
我們本次用到的資料集是 ?Kaggle航空滿意度資料集。資料集使用csv
格式檔案儲存,預先切分好了 80% 的訓練集 和 20% 的測試集;目標列“Satisfaction/滿意度”。大家可以透過 ShowMeAI 的百度網盤地址下載。
? 實戰資料集下載(百度網盤):公眾號『ShowMeAI研究中心』回覆『實戰』,或者點選 這裡 獲取本文 [36]『航班乘客滿意度』場景資料分析建模與業務歸因解釋 『Airline Passenger Satisfaction資料集』
⭐ ShowMeAI官方GitHub:https://github.com/ShowMeAI-Hub
詳細的資料列欄位如下:
欄位 | 說明 | 詳情 |
---|---|---|
Gender | 乘客性別 | Female, Male |
Customer Type | 乘客型別 | Loyal customer, disloyal customer |
Age | 乘客年齡 | -- |
Type of Travel | 乘客出行目的 | Personal Travel, Business Travel |
Class | 客艙等級 | Business, Eco, Eco Plus |
Flight distance | 航程距離 | -- |
Inflight wifi service | 機上WiFi服務滿意度 | 0:Not Applicable;1-5 |
Departure/Arrival time convenient | 起飛/降落舒適度滿意度 | -- |
Ease of Online booking | 線上預定滿意度 | -- |
Gate location | 登機門位置滿意度 | -- |
Food and drink | 機上食物滿意度 | -- |
Online boarding | 線上值機滿意度 | -- |
Seat comfort | 座椅舒適度滿意度 | -- |
Inflight entertainment | 機上娛樂設施滿意度 | -- |
On-board service | 登機服務滿意度 | -- |
Leg room service | 腿部空間滿意度 | -- |
Baggage handling | 行李處理滿意度 | -- |
Check-in service | 值機滿意度 | -- |
Inflight service | 機上服務滿意度 | -- |
Cleanliness | 環境乾淨度滿意度 | -- |
Departure Delay in Minutes | 起飛延誤時間 | -- |
Arrival Delay in Minutes | 抵達延誤時間 | -- |
Satisfaction | 航線滿意度 | Satisfaction, neutral or dissatisfaction |
? 資料一覽和清理
? 資料一覽
我們先匯入工具庫,進行基本的設定,並讀取資料。
# 匯入工具庫
import pandas as pd
import numpy as np
import scipy.stats as sp
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
# 視覺化圖例設定
from matplotlib import rcParams
# 字型大小
rcParams['font.size'] = 12
# 圖例大小
rcParams['figure.figsize'] = 7, 5
# 讀取資料
air_train_df = pd.read_csv('air-train.csv')
air_test_df = pd.read_csv('air-test.csv')
air_train_df.head()
air_train_df.satisfaction.value_counts()
neutral or dissatisfied 58879
satisfied 45025
Name: satisfaction, dtype: int64
air_train_df.info()
air_test_df.info()
輸出的資料資訊如下,我們使用到的資料總共包含 129,880 行25 列。資料集被預拆分為包含 103,904 行的訓練資料集(19.8MB)和包含 25,976 行的測試資料集(5MB)。
Training Data Set (air_train_df):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103904 entries, 0 to 103903
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 103904 non-null int64
1 id 103904 non-null int64
2 Gender 103904 non-null object
3 Customer Type 103904 non-null object
4 Age 103904 non-null int64
5 Type of Travel 103904 non-null object
6 Class 103904 non-null object
7 Flight Distance 103904 non-null int64
8 Inflight wifi service 103904 non-null int64
9 Departure/Arrival time convenient 103904 non-null int64
10 Ease of Online booking 103904 non-null int64
11 Gate location 103904 non-null int64
12 Food and drink 103904 non-null int64
13 Online boarding 103904 non-null int64
14 Seat comfort 103904 non-null int64
15 Inflight entertainment 103904 non-null int64
16 On-board service 103904 non-null int64
17 Leg room service 103904 non-null int64
18 Baggage handling 103904 non-null int64
19 Checkin service 103904 non-null int64
20 Inflight service 103904 non-null int64
21 Cleanliness 103904 non-null int64
22 Departure Delay in Minutes 103904 non-null int64
23 Arrival Delay in Minutes 103594 non-null float64
24 satisfaction 103904 non-null object
dtypes: float64(1), int64(19), object(5)
memory usage: 19.8+ MB
Testing Set (air_test_df):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25976 entries, 0 to 25975
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 25976 non-null int64
1 id 25976 non-null int64
2 Gender 25976 non-null object
3 Customer Type 25976 non-null object
4 Age 25976 non-null int64
5 Type of Travel 25976 non-null object
6 Class 25976 non-null object
7 Flight Distance 25976 non-null int64
8 Inflight wifi service 25976 non-null int64
9 Departure/Arrival time convenient 25976 non-null int64
10 Ease of Online booking 25976 non-null int64
11 Gate location 25976 non-null int64
12 Food and drink 25976 non-null int64
13 Online boarding 25976 non-null int64
14 Seat comfort 25976 non-null int64
15 Inflight entertainment 25976 non-null int64
16 On-board service 25976 non-null int64
17 Leg room service 25976 non-null int64
18 Baggage handling 25976 non-null int64
19 Checkin service 25976 non-null int64
20 Inflight service 25976 non-null int64
21 Cleanliness 25976 non-null int64
22 Departure Delay in Minutes 25976 non-null int64
23 Arrival Delay in Minutes 25893 non-null float64
24 satisfaction 25976 non-null object
dtypes: float64(1), int64(19), object(5)
memory usage: 5.0+ MB
資料集中,19 個 int 資料型別欄位,1 個 float 資料型別欄位,5 個分類資料型別(物件)欄位。
? 資料清洗
下面我們進行資料清洗:
id
和unnamed
兩列沒有作用,我們直接刪除。- 『到達延誤時間』列是浮點資料型別,『出發延誤時間』列是整數資料型別,在進行進一步分析前,我們把它們都調整為浮點數型別,保持一致。
- 類別型變數,包括列名和列取值,我們對它們做規範化處理(全部小寫化,以便在後續建模過程中準確編碼)。
Arrival Delay
列中也存在缺失值——訓練集中缺少 310 個,測試集中缺少 83 個。我們在這裡用最簡單的平均值來填充它們。- 資料集的滿意度等級列應該是 1 到 5 的等級評分。有一些取值為0的髒資料,我們剔除掉它們。
- 我們把航班延誤資訊聚合成一些統一的列。表明航班是否經歷了延誤(起飛或到達)和航班延誤所花費的總時間。
def clean_data(orig_df):
'''
This function applies 5 steps to the dataframe to clean the data.
1. Dropping of unnecessary columns
2. Uniformize datatypes in delay column
3. Normalizing column names.
4. Normalizing text values in columns.
5. Imputing numeric null values with the mean value of the column.
6. Dropping "zero" values from ranked categorical variables.
7. Creating aggregated flight delay column
Return: Cleaned DataFrame, ready for analysis - final encoding still to be applied.
'''
df = orig_df.copy()
'''1. Dropping off unnecessary columns'''
df.drop(['Unnamed: 0', 'id'], axis = 1, inplace = True)
'''2. Uniformizing datatype in delay column'''
df['Departure Delay in Minutes'] = df['Departure Delay in Minutes'].astype(float)
'''3. Normalizing column names'''
df.columns = df.columns.str.lower()
'''Replacing spaces and other characters with underscores, this is more
for us to make it easier to work with them and so that we can call them using dot notation.'''
special_chars = "/ -"
for special_char in special_chars:
df.columns = [col.replace(special_char, '_') for col in df.columns]
'''4. Normalizing text values in columns'''
cat_cols = ['gender', 'customer_type', 'class', 'type_of_travel', 'satisfaction']
for column in cat_cols:
df[column] = df[column].str.lower()
'''5. Imputing the nulls in the arrival delay column with the mean.
Since we cannot safely equate these nulls to a zero value, the mean value of the column is the
most sensible method of replacement.'''
df['arrival_delay_in_minutes'].fillna(df['arrival_delay_in_minutes'].mean(), inplace = True)
df.round({'arrival_delay_in_minutes' : 1})
'''6. Dropping rows from ranked value columns where "zero" exists as a value
Since these columns are meant to be ranked on a scale from 1 to 5, having zero as a value
does not make sense nor does it help us in any way.'''
rank_list = ["inflight_wifi_service", "departure_arrival_time_convenient", "ease_of_online_booking", "gate_location",
"food_and_drink", "online_boarding", "seat_comfort", "inflight_entertainment", "on_board_service",
"leg_room_service", "baggage_handling", "checkin_service", "inflight_service", "cleanliness"]
'''7. Creating aggregated and categorical flight delay columns'''
df['total_delay_time'] = (df['departure_delay_in_minutes'] + df['arrival_delay_in_minutes'])
df['was_flight_delayed'] = np.nan
df['was_flight_delayed'] = np.where(df['total_delay_time'] > 0, 'yes', 'no')
for col in rank_list:
df.drop(df.loc[df[col]==0].index, inplace=True)
cleaned_df = df
return cleaned_df
? 探索性分析
完成資料載入與基本的資料清洗後,我們對資料進行進一步的分析挖掘,即EDA(探索性資料分析)的過程。
? 目標變數(客戶滿意度)分佈如何
我們先對目標變數進行分析,即客戶滿意度情況,這是建模的最終標籤,它是一個類別型欄位。
air_train_cleaned = clean_data(air_train_df)
air_test_cleaned = clean_data(air_test_df)
fig = plt.figure(figsize = (10,7))
air_train_cleaned.satisfaction.value_counts(normalize = True).plot(kind='bar', alpha = 0.9, rot=0)
plt.title('Customer satisfaction')
plt.ylabel('Percent')
plt.show()
總體來說,標籤還算均衡,大約 55% 的中立或不滿意,45% 的滿意。這種標籤比例分佈下,我們不需要進行資料取樣。
? 性別和客戶身份 V.S. 滿意度
with sns.axes_style(style = 'ticks'):
d = sns.histplot(x = "gender", hue= 'satisfaction', data = air_train_cleaned,
stat = 'percent', multiple="dodge", palette = 'Set1')
從性別維度來看,男女似乎差別不大,總體滿意度可能更取決於其他因素。
with sns.axes_style(style = 'ticks'):
d = sns.histplot(x = "customer_type", hue= 'satisfaction', data = air_train_cleaned,
stat = 'percent', multiple="dodge", palette = 'Set1')
從客戶忠誠度角度看,忠誠客戶的滿意度比例會相對高一點,這也是我們可以直觀理解的。
? 客艙等級 V.S. 滿意度
with sns.axes_style(style = 'ticks'):
d = sns.histplot(x = "class", hue= 'satisfaction', data = air_train_cleaned,
stat = 'percent', multiple="dodge", palette = 'Set1')
我們分別看一下乘坐經濟艙、高階艙和商務艙的旅客的滿意度,從上面的分佈我們可以觀察到乘坐高階艙(商務艙)的乘客與乘坐長途客艙(經濟艙或豪華艙)的乘客在滿意度上存在根本差異。
那我們進而看一下因個人休閒而出差的乘客
with sns.axes_style(style = 'ticks'):
d = sns.histplot(x = "type_of_travel", hue= 'satisfaction', data = air_train_cleaned,
stat = 'percent', multiple="dodge", palette = 'Set1')
從上面的分析我們發現,商務旅行的乘客與休閒旅行的乘客之間的滿意度存在非常顯著的差異。
? 年齡段 V.S. 滿意度
with sns.axes_style('white'):
g = sns.catplot(x = 'age', data = air_train_cleaned,
kind = 'count', hue = 'satisfaction', order = range(7, 80),
height = 8.27, aspect=18.7/8.27, legend = False,
palette = 'Set1')
plt.legend(loc='upper right');
sns.violinplot(data = air_train_cleaned, x = "satisfaction", y = "age", palette='Set1')
上圖是年齡和滿意度之間的關係,分析結果非常有趣,37-61 歲年齡組與其他年齡組之間存在顯著差異(他們對體驗的滿意度遠遠高於其他組的乘客)。另外我們還觀察到,這個段的乘客的滿意度隨著年齡的增長而穩步上升。
? 飛行時間長短 V.S. 滿意度
sns.violinplot(data = air_train_cleaned, x = "satisfaction", y = "flight_distance", palette = 'Set1')
從飛行距離維度,我們看不出顯著的滿意度差異,而且絕大多數乘客的航班航程為 1,000 英里或更短。
? 飛行距離 V.S. 各個體驗維度
score_cols = ["inflight_wifi_service", "departure_arrival_time_convenient", "ease_of_online_booking",
"gate_location","food_and_drink", "online_boarding", "seat_comfort", "inflight_entertainment",
"on_board_service","leg_room_service", "baggage_handling", "checkin_service", "inflight_service","cleanliness"]
plt.figure(figsize=(40, 20))
plt.subplots_adjust(hspace=0.3)
# Loop through scored columns
for n, score_col in enumerate(score_cols):
# Add a new subplot iteratively
ax = plt.subplot(4, 4, n + 1)
# Filter df and plot scored column on new axis
sns.violinplot(data = air_train_cleaned,
x = score_col,
y = 'flight_distance',
hue = "satisfaction",
split = True,
ax = ax,
palette = 'Set1')
# Chart formatting
ax.set_title(score_col)
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05),
fancybox=True, shadow=True, ncol=5)
ax.set_xlabel("")
我們使用小提琴圖對航班不同飛行距離和旅客對不同服務維度評級的滿意程度進行交叉分析如上,飛行距離對客戶滿意度的影響還是比較大的。
? 年齡 V.S. 各個體驗維度
plt.figure(figsize=(40, 20))
plt.subplots_adjust(hspace=0.3)
# Loop through scored columns
for n, score_col in enumerate(score_cols):
# Add a new subplot iteratively
ax = plt.subplot(4, 4, n + 1)
# Filter df and plot scored column on new axis
sns.violinplot(data = air_train_cleaned,
x = score_col,
y = 'age',
hue = "satisfaction",
split = True,
ax = ax,
palette = 'Set1')
# Chart formatting
ax.set_title(score_col),
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05),
fancybox=True, shadow=True, ncol=5)
ax.set_xlabel("")
同樣的方式,我們針對不同的年齡段,對於乘客在不同維度的體驗滿意度分析如上,我們觀察到,在這些分佈的大多數中,37-60 歲年齡組有一個明顯的高峰。
? 客艙等級和出行目的 V.S. 各個體驗維度
plt.figure(figsize=(40, 20))
plt.subplots_adjust(hspace=0.3)
# Loop through scored columns
for n, score_col in enumerate(score_cols):
# Add a new subplot iteratively
ax = plt.subplot(4, 4, n + 1)
# Filter df and plot scored column on new axis
sns.violinplot(data = air_train_cleaned,
x = 'class',
y = score_col,
hue = "satisfaction",
split = True,
ax = ax,
palette = 'Set1')
# Chart formatting
ax.set_title(score_col)
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05),
fancybox=True, shadow=True, ncol=5)
ax.set_xlabel("")
plt.figure(figsize=(40, 20))
plt.subplots_adjust(hspace=0.3)
# Loop through scored columns
for n, score_col in enumerate(score_cols):
# Add a new subplot iteratively
ax = plt.subplot(4, 4, n + 1)
# Filter df and plot scored column on new axis
sns.violinplot(data = air_train_cleaned,
x = 'type_of_travel',
y = score_col,
hue = "satisfaction",
split = True,
ax = ax,
palette = 'Set1')
# Chart formatting
ax.set_title(score_col)
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05),
fancybox=True, shadow=True, ncol=5)
ax.set_xlabel("")
同樣的方式,我們針對不同的客艙等級和出行目的,對於乘客在不同維度的體驗滿意度分析如上,我們觀察到,這兩個資訊很大程度影響乘客滿意度。機上 Wi-Fi 服務、線上登機、座椅舒適度、機上娛樂、機上客戶服務、腿部空間和機上客戶服務的滿意度和不滿意度都出現了明顯的高峰。
很有意思的一點是機上wi-fi服務欄,這一項的滿意似乎對乘坐經濟艙和經濟艙的客戶的航班行程滿意有很大影響,但它似乎對商務艙旅客的滿意度沒有太大影響。
? 資料處理和特徵選擇
? 資料處理/特徵工程
在將資料引入模型之前,必須對資料進行編碼以便為建模做好準備。我們針對類別型的變數,使用序號編碼進行編碼對映,具體程式碼如下(考慮到下面的不同類別取值本身有程度大小關係,以及我們會使用xgboost等非線性模型,因此序號編碼是OK的)
關於特徵工程的詳細知識,歡迎大家檢視ShowMeAI的系列教程文章:
from sklearn.preprocessing import OrdinalEncoder
def encode_data(orig_df):
'''
Encodes remaining categorical variables of data frame to be ready for model ingestion
Inputs:
Dataframe
Manipulations:
Encoding of categorical variables.
Return:
Encoded Column Values
'''
df = orig_df.copy()
#Ordinal encode of scored rating columns.
encoder = OrdinalEncoder()
for j in score_cols:
df[j] = encoder.fit_transform(df[[j]])
# Replacement of binary categories.
df.was_flight_delayed.replace({'no': 0, 'yes' : 1}, inplace = True)
df['satisfaction'].replace({'neutral or dissatisfied': 0, 'satisfied': 1},inplace = True)
df.customer_type.replace({'disloyal customer': 0, 'loyal customer': 1}, inplace = True)
df.type_of_travel.replace({'personal travel': 0, 'business travel': 1}, inplace = True)
df.gender.replace({'male': 0, 'female' : 1}, inplace = True)
encoded_df = pd.get_dummies(df, columns = ['class'])
return encoded_df
# 對訓練集和測試集進行編碼
air_train_encoded = encode_data(air_train_cleaned)
air_test_encoded = encode_data(air_test_cleaned)
# 檢視特徵和目標列之間的相關性
train_corr = air_train_encoded.corr()[['satisfaction']]
train_corr = train_corr
plt.figure(figsize=(10, 12))
heatmap = sns.heatmap(train_corr.sort_values(by='satisfaction', ascending=False),
vmin=-1, vmax=1, annot=True, cmap='Blues')
heatmap.set_title('Feature Correlation with Target Variable', fontdict={'fontsize':14});
? 特徵選擇
為了更佳的建模效果與更高效的建模效率,在完成特徵工程之後我們要進行特徵選擇,我們這裡使用 Scikit-Learn 的內建特徵選擇功能,使用 K-Best 作為特徵篩選器,並使用卡方值作為篩選標準( 卡方是相對合適的標準,因為我們的資料集中有幾個分類變數)。
# Pre-processing and scaling dataset for feature selection
from sklearn import preprocessing
r_scaler = preprocessing.MinMaxScaler()
r_scaler.fit(air_train_encoded)
air_train_scaled = pd.DataFrame(r_scaler.transform(air_train_encoded), columns = air_train_encoded.columns)
air_train_scaled.head()
# Feature selection, applying Select K Best and Chi2 to output the 15 most important features
from sklearn.feature_selection import SelectKBest, chi2
X = air_train_scaled.loc[:,air_train_scaled.columns!='satisfaction']
y = air_train_scaled[['satisfaction']]
selector = SelectKBest(chi2, k = 10)
selector.fit(X, y)
X_new = selector.transform(X)
features = (X.columns[selector.get_support(indices=True)])
features
輸出:
Index(['type_of_travel', 'inflight_wifi_service', 'online_boarding',
'seat_comfort', 'inflight_entertainment', 'on_board_service',
'leg_room_service', 'cleanliness', 'class_business', 'class_eco'],
dtype='object')
我們透過K-Best篩選過後的特徵是旅行型別、機上 wifi 服務、線上登機流程、座椅舒適度、機上娛樂、機上客戶服務、座位空間、清潔度和旅行等級(商務艙或經濟艙)。
? 建模
下一步我們可以基於已有資料進行建模了,我們在這裡訓練的模型包括 邏輯迴歸 模型、 Adaboost 分類器、 隨機森林 分類器、 樸素貝葉斯 分類模型和 Xgboost 分類器。我們會基於準確性和測試準確性、精確度、召回率和 ROC 值等指標對模型進行評估。
? 工具庫匯入與資料準備
import sklearn
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import CategoricalNB
import xgboost
from xgboost import XGBClassifier
# Features as selected from feature importance
features = features
# Specifying target variable
target = ['satisfaction']
# Splitting into train and test
X_train = air_train_encoded[features].to_numpy()
X_test = air_test_encoded[features]
y_train = air_train_encoded[target].to_numpy()
y_test = air_test_encoded[target]
? 模型評估指標計算
import time
from resource import getrusage, RUSAGE_SELF
from sklearn.metrics import accuracy_score, roc_auc_score, plot_confusion_matrix, plot_roc_curve, precision_score, recall_score
# 模型評估與結果繪圖
def get_model_metrics(model, X_train, X_test, y_train, y_test):
'''
Model activation function, takes in model as a parameter and returns metrics as specified.
Inputs:
model, X_train, y_train, X_test, y_test
Output:
Model output metrics, confusion matrix, ROC AUC curve
'''
# Mark of current time when model began running
t0 = time.time()
# Fit the model on the training data and run predictions on test data
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:,1]
# Obtain training accuracy as a comparative metric using Sklearn's metrics package
train_score = model.score(X_train, y_train)
# Obtain testing accuracy as a comparative metric using Sklearn's metrics package
accuracy = accuracy_score(y_test, y_pred)
# Obtain precision from predictions using Sklearn's metrics package
precision = precision_score(y_test, y_pred)
# Obtain recall from predictions using Sklearn's metrics package
recall = recall_score(y_test, y_pred)
# Obtain ROC score from predictions using Sklearn's metrics package
roc = roc_auc_score(y_test, y_pred_proba)
# Obtain the time taken used to run the model, by subtracting the start time from the current time
time_taken = time.time() - t0
# Obtain the resources consumed in running the model
memory_used = int(getrusage(RUSAGE_SELF).ru_maxrss / 1024)
# Outputting the metrics of the model performance
print("Accuracy on Training = {}".format(train_score))
print("Accuracy on Test = {} • Precision = {}".format(accuracy, precision))
print("Recall = {} • ROC Area under Curve = {}".format(recall, roc))
print("Time taken = {} seconds • Memory consumed = {} Bytes".format(time_taken, memory_used))
# Plotting the confusion matrix of the model's predictive capabilities
plot_confusion_matrix(model, X_test, y_test, cmap = plt.cm.Blues, normalize = 'all')
# Plotting the ROC AUC curve of the model
plot_roc_curve(model, X_test, y_test)
plt.show()
return model, train_score, accuracy, precision, recall, roc, time_taken, memory_used
? 建模與最佳化
① 邏輯迴歸模型
# 建模與調參
clf = LogisticRegression()
params = {'C': [0.1, 0.5, 1, 5, 10]}
rscv = RandomizedSearchCV(estimator = clf,
param_distributions = params,
scoring = 'f1',
n_iter = 10,
verbose = 1)
rscv.fit(X_train, y_train)
rscv.predict(X_test)
# Parameter object to be passed through to function activation
params = rscv.best_params_
print("Best parameters:", params)
model_lr = LogisticRegression(**params)
model_lr, train_lr, accuracy_lr, precision_lr, recall_lr, roc_lr, tt_lr, mu_lr = get_model_metrics(model_lr, X_train, X_test, y_train, y_test)
② 隨機森林模型
clf = RandomForestClassifier()
params = { 'max_depth': [5, 10, 15, 20, 25, 30],
'max_leaf_nodes': [10, 20, 30, 40, 50],
'min_samples_split': [1, 2, 3, 4, 5]}
rscv = RandomizedSearchCV(estimator = clf,
param_distributions = params,
scoring = 'f1',
n_iter = 10,
verbose = 1)
rscv.fit(X_train, y_train)
rscv.predict(X_test)
# Parameter object to be passed through to function activation
params = rscv.best_params_
print("Best parameters:", params)
model_rf = RandomForestClassifier(**params)
model_rf, train_rf, accuracy_rf, precision_rf, recall_rf, roc_rf, tt_rf, mu_rf = get_model_metrics(model_rf, X_train, X_test, y_train, y_test)
③ Adaboost模型
clf = AdaBoostClassifier()
params = { 'n_estimators': [25, 50, 75, 100, 125, 150],
'learning_rate': [0.2, 0.4, 0.6, 0.8, 1.0]}
rscv = RandomizedSearchCV(estimator = clf,
param_distributions = params,
scoring = 'f1',
n_iter = 10,
verbose = 1)
rscv.fit(X_train, y_train)
rscv.predict(X_test)
# Parameter object to be passed through to function activation
params = rscv.best_params_
print("Best parameters:", params)
model_ada = AdaBoostClassifier(**params)
# Saving output metrics
model_ada, accuracy_ada, train_ada, precision_ada, recall_ada, roc_ada, tt_ada, mu_ada = get_model_metrics(model_ada, X_train, X_test, y_train, y_test)
④ 樸素貝葉斯
clf = CategoricalNB()
params = { 'alpha': [0.0001, 0.001, 0.1, 1, 10, 100, 1000],
'min_categories': [6, 8, 10]}
rscv = RandomizedSearchCV(estimator = clf,
param_distributions = params,
scoring = 'f1',
n_iter = 10,
verbose = 1)
rscv.fit(X_train, y_train)
rscv.predict(X_test)
# Parameter object to be passed through to function activation
params = rscv.best_params_
print("Best parameters:", params)
model_cnb = CategoricalNB(**params)
# Saving Output Metrics
model_cnb, accuracy_cnb, train_cnb, precision_cnb, recall_cnb, roc_cnb, tt_cnb, mu_cnb = get_model_metrics(model_cnb, X_train, X_test, y_train, y_test)
⑤ Xgboost模型
clf = XGBClassifier()
params = { 'max_depth': [3, 5, 6, 10, 15, 20],
'learning_rate': [0.01, 0.1, 0.2, 0.3],
'n_estimators': [100, 500, 1000]}
rscv = RandomizedSearchCV(estimator = clf,
param_distributions = params,
scoring = 'f1',
n_iter = 10,
verbose = 1)
rscv.fit(X_train, y_train)
rscv.predict(X_test)
# Parameter object to be passed through to function activation
params = rscv.best_params_
print("Best parameters:", params)
model_xgb = XGBClassifier(**params)
# Saving Output Metrics
model_xgb, accuracy_xgb, train_xgb, precision_xgb, recall_xgb, roc_xgb, tt_xgb, mu_xgb = get_model_metrics(model_xgb, X_train, X_test, y_train, y_test)
綜合對比
如下我們對效果做一個綜合對比,每個模型都應用了引數最佳化,在訓練資料上的準確率不低於 88%,在測試資料上的準確率不低於 87%。
training_scores = [train_lr, train_rf, train_ada, train_cnb, train_xgb]
accuracy = [accuracy_lr, accuracy_rf, accuracy_ada, accuracy_cnb, accuracy_xgb]
roc_scores = [roc_lr, roc_rf, roc_ada, roc_cnb, roc_xgb]
precision = [precision_lr, precision_rf, precision_ada, precision_cnb, precision_xgb]
recall = [recall_lr, recall_rf, recall_ada, recall_cnb, recall_xgb]
time_scores = [tt_lr, tt_rf, tt_ada, tt_cnb, tt_xgb]
memory_scores = [mu_lr, mu_rf, mu_ada, mu_cnb, mu_xgb]
model_data = {'Model': ['Logistic Regression', 'Random Forest', 'Adaptive Boost',
'Categorical Bayes', 'Extreme Gradient Boost'],
'Accuracy on Training' : training_scores,
'Accuracy on Test' : accuracy,
'ROC AUC Score' : roc_scores,
'Precision' : precision,
'Recall' : recall,
'Time Elapsed (seconds)' : time_scores,
'Memory Consumed (bytes)': memory_scores}
model_data = pd.DataFrame(model_data)
model_data
我們最終選擇xgboost,它表現最好,在訓練和測試中都表現出高效能,測試集上ROC-AUC值為 98,精度為 95,召回率為 92。
plt.rcParams["figure.figsize"] = (25,15)
ax1 = model_data.plot.bar(x = 'Model', y = ["Accuracy on Training", "Accuracy on Test", "ROC AUC Score",
"Precision", "Recall"],
cmap = 'coolwarm')
ax1.legend()
ax1.set_title("Model Comparison", fontsize = 18)
ax1.set_xlabel('Model', fontsize = 14)
ax1.set_ylabel('Result', fontsize = 14, color = 'Black');
? 模型可解釋性
除了拿到最終效能良好的模型,在機器學習實際應用中,很重要的另外一件事情是結合業務場景進行解釋,這能幫助業務後續提升。我們可以基於Xgboost自帶的特徵重要度和SHAP等完成這項任務。
對於SHAP工具庫的使用介紹,歡迎大家閱讀ShowMeAI的文章:
? XGBoost 特徵重要性
from xgboost import plot_importance
model_xgb.get_booster().feature_names = ['type_of_travel', 'inflight_wifi_service', 'online_boarding',
'seat_comfort', 'inflight_entertainment', 'on_board_service',
'leg_room_service', 'cleanliness', 'class_business', 'class_eco']
plot_importance(model_xgb)
plt.show()
Xgboost給出的最重要的特徵依次包括:座椅舒適度、線上登機、機上娛樂、機上服務質量、腿部空間、機上無線網路和清潔度。
? SHAP 模型和特徵可解釋性
為了分析模型在 SHAP 中的特徵影響,首先使用 Python 的 pickle 庫對模型進行 pickle。然後使用模型管道和我們選擇的特徵在 Shap 中建立了一個直譯器,並將其應用於 X_train 資料集上。
import shap
# Saving test model.
pickle.dump(model_xgb, open('./Models/model_xgb.pkl', 'wb'))
explainer = shap.Explainer(model_xgb, feature_names = features)
shap_values = explainer(X_train)
shap.initjs()
shap.summary_plot(shap_values, X_train, class_names=model_xgb.classes_)
如果將平均 SHAP 值作為我們衡量特徵重要性的指標,我們可以看到機上 Wi-Fi 服務是我們資料中最具影響力的特徵,緊隨其後的是旅行型別和線上登機。
對於幾乎每個特徵,高取值(大部分是對這個特徵維度的滿意程度高)對預測有積極影響,而低特徵值對預測有負面影響。機上 wi-fi 服務是我們資料集中最具影響力的特徵,緊隨其後的是旅行型別和線上登機流程。
? 機上 Wi-Fi 服務 特徵影響分析
shap.plots.scatter(shap_values[:, "inflight_wifi_service"], color=shap_values)
我們拿出最重要的特徵『機上 Wi-Fi』進行進一步分析。上圖中的橫座標為機上wifi滿意度得分,縱座標為SHAP值大小,顏色區分旅行型別(個人旅行編碼為 0,商務旅行編碼為 1)。
我們觀察到:
-
個人旅行乘客:機上WiFi打分高對最終高滿意度有更多的正面影響,而機上WiFi打分低對最終滿意度低的貢獻更大。
-
商務旅行乘客:無論他們的 Wi-Fi 服務體驗如何,都有一部分是滿意的(正 SHAP 值超過負值)。
? 線上登機特徵影響分析
shap.plots.scatter(shap_values[:, "online_boarding"], color=shap_values)
對『線上登機』特徵的影響SHAP分析如上。無論是個人旅行還是商務出行,線上登機過程的低分都會對最終滿意度輸出產生負面影響。
? 總結
在本篇內容中,我們結合航空出行場景,對航班乘客滿意度進行了詳盡的資料分析和建模預測,並進行了模型的可解釋性分析。
我們效果最好的模型取得了95%的accuracy和0.987的auc得分,模型解釋上可以看到影響滿意度最重要的因素是機上 Wi-Fi 服務、線上登機、機上娛樂質量、餐飲、座椅舒適度、機艙清潔度和腿部空間。
參考資料
- ? 航空公司乘客滿意度資料集(Kaggle)
- ? 美國航空公司的乘客不滿意原因分析(CNN)
- ? 新聞:隨著飛機客滿和票價上漲,旅客滿意度下降(CNBC)
- ? 資料分析實戰:Python 資料分析實戰教程:https://www.showmeai.tech/tutorials/40
- ? 機器學習實戰:手把手教你玩轉機器學習系列:https://www.showmeai.tech/tutorials/41
- ? 基於SHAP的機器學習可解釋性實戰:https://showmeai.tech/article-detail/337
- ? 機器學習實戰 | 機器學習特徵工程最全解讀:https://showmeai.tech/article-detail/208
推薦閱讀
- ? 資料分析實戰系列 :https://www.showmeai.tech/tutorials/40
- ? 機器學習資料分析實戰系列:https://www.showmeai.tech/tutorials/41
- ? 深度學習資料分析實戰系列:https://www.showmeai.tech/tutorials/42
- ? TensorFlow資料分析實戰系列:https://www.showmeai.tech/tutorials/43
- ? PyTorch資料分析實戰系列:https://www.showmeai.tech/tutorials/44
- ? NLP實戰資料分析實戰系列:https://www.showmeai.tech/tutorials/45
- ? CV實戰資料分析實戰系列:https://www.showmeai.tech/tutorials/46
- ? AI 面試題庫系列:https://www.showmeai.tech/tutorials/48