Machine Learning (7) - 關於 Logistic Regression (Binary Classification) 的小練習

Rachel發表於2019-06-07

原文網址 : https://learnku.com/articles/29508

題目: 幫助 HR 做員工去留分析

第一步：處理資料

1. 引入基礎資料

import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

df = pd.read_csv('/Users/rachel/Downloads/py-master/ML/7_logistic_reg/Exercise/HR_comma_sep.csv')
df.head()

輸出：

Machine Learning (7) - 關於 Logistic Regression (Binary Classification) 的小練習

上表中的各個列就是對員工多維度的資料統計, 最終體現員工去留的就是 left 列, left 列的值只有 0 和 1, 0 表示留下, 1 表示離開.
首先根據現有資料分析哪些維度會對員工的去留(也就是 left 列的值)產生比較大的影響.

2. 檢視整個表中員工的去留人數分別是多少

// 檢視資料中離開員工的人數
left = df[df.left==1]
left.shape

輸出：

(3571, 10)

dataframe 的 shape 屬性可以檢視整個 dataframe 共有多少行多少列，所以這裡就是 3571行 * 10列。也即是說，共有 3571位員工離開。

// 檢視資料中留下員工的人數
retained = df[df.left==0]
retained.shape

輸出：

(11428, 10)

表示共有 11428 位員工留下來。

3. 分析真正影響員工去留的欄位

根據 left 欄位進行分組, 並取平均值

df.groupby('left').mean()

輸出：

Machine Learning (7) - 關於 Logistic Regression (Binary Classification) 的小練習

利用 pandas 的 crosstab 函式分析 salary 對 left 值的影響, 再以柱形圖輸出

pd.crosstab(df.salary, df.left).plot(kind='bar')

Machine Learning (7) - 關於 Logistic Regression (Binary Classification) 的小練習

從結果可以看出高工資的人留下來的佔比比較大.

利用 pandas 的 crosstab 函式分析 Department 對 left 值的影響, 再以柱形圖輸出

pd.crosstab(df.Department, df.left).plot(kind='bar')

Machine Learning (7) - 關於 Logistic Regression (Binary Classification) 的小練習

從結果可以看出部門對 left 值的影響不是很明顯

以上分析了各個欄位對 left 值的影響, 最後總結以下幾個欄位對 left 值的影響比較大，所以後面將通過這些欄位來訓練資料模型。

4. 保留有用欄位生成新的 dataframe

df_new = df[['satisfaction_level', 'average_montly_hours', 'Work_accident', 'promotion_last_5years', 'salary']]
df_new.head()

輸出：
Machine Learning (7) - 關於 Logistic Regression (Binary Classification) 的小練習

5. 通過 pandas 的 get_dummies 函式把 salary 列數字化

salary_dummies = pd.get_dummies(df.salary, prefix='salary')
df_with_dummies = pd.concat([df_new, salary_dummies], axis='columns')
df_with_dummies.head()

輸出：

Machine Learning (7) - 關於 Logistic Regression (Binary Classification) 的小練習

// 去掉 salary 列
df_with_dummies.drop('salary', axis='columns', inplace=True)
df_with_dummies.head()

輸出：

Machine Learning (7) - 關於 Logistic Regression (Binary Classification) 的小練習

// 去掉 salary_medium 列
df_with_dummies.drop('salary_medium',axis='columns', inplace=True)
df_with_dummies.head()

輸出：

Machine Learning (7) - 關於 Logistic Regression (Binary Classification) 的小練習

6. 準備用於訓練模型的資料

X = df_with_dummies
X.head()

輸出：

Machine Learning (7) - 關於 Logistic Regression (Binary Classification) 的小練習

y = df.left
y.head()

輸出:
0 1
1 1
2 1
3 1
4 1
Name: left, dtype: int64

第二步：訓練模型

1. 取出 20% 資料做測試用

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

2. 使用 LogisticRegression 訓練模型

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

3. 分析模型準確度

model.score(X_test, y_test) // 輸出 0.7773333333333333

Machine Learning (6) - 關於 Logistic Regression (Multiclass Classification) 的小練習
2019-04-14
Mac
Machine Learning (9) - 關於 Logistic Regression (Multiclass Classification) 的小練習
2019-06-08
Mac
Machine Learning (6) - Logistic Regression (Binary Classification)
2019-06-07
Mac
Machine Learning (8) - Logistic Regression (Multiclass Classification)
2019-06-07
Mac
Machine Learning (11) - 關於 Decision Tree 的小練習
2019-06-09
Mac
吳恩達機器學習第一課 Supervised Machine Learning Regression and Classification
2024-06-10
吳恩達機器學習Mac
Machine Learning（16） - 關於 K Means Clustering 的練習題
2019-06-15
Mac
Machine Learning (1) - Linear Regression
2019-04-14
Mac
【機器學習】Logistic Regression 的前世今生（理論篇）
2019-02-22
機器學習
《Machine Learning in Action》—— Taoye給你講講Logistic迴歸是咋回事
2020-12-07
Mac
吳恩達機器學習課程：程式設計練習 | (2) ex2-logistic regression
2020-09-23
吳恩達機器學習程式設計
Logistic regression 為什麼用 sigmoid ？
2018-05-29
Sigmoid
吳恩達《Machine Learning》精煉筆記 6：關於機器學習的建議
2021-01-16
吳恩達Mac筆記機器學習
4.邏輯迴歸（Logistic Regression）
2020-11-16
邏輯迴歸
《深度學習》PDF Deep Learning: Adaptive Computation and Machine Learning series
2019-12-17
深度學習APTMac
【深度學習基礎-13】非線性迴歸 logistic regression
2019-01-14
深度學習
【題解】程式設計作業ex5: Regularized Linear Regression and Bias/Variance (Machine Learning)
2020-10-09
程式設計ZedMac
《machine learning》引言
2020-10-13
Mac
Machine Learning with Sklearn
2020-12-11
Mac
Machine Learning（機器學習）之二
2018-10-25
Mac機器學習
Machine Learning（機器學習）之一
2019-02-27
Mac機器學習
使用Octave來學習Machine Learning(二)
2019-02-27
Mac
Machine Learning 機器學習筆記
2018-03-27
Mac機器學習筆記
WEKA把分類(Classification)和迴歸(Regression)
2018-06-17
Machine Learning (12) - Support Vector Machine (SVM)
2019-06-10
Mac
Matlab機器學習3（Machine Learning Onramp）
2020-10-27
Matlab機器學習Mac
林軒田機器學習基石課程學習筆記10 — Logistic Regression
2018-07-24
機器學習筆記
Machine Learning－Introduction
2019-04-03
Mac
Machine Learning - Basic points
2020-01-17
Mac
邏輯迴歸（Logistic Regression）原理及推導
2019-02-22
邏輯迴歸
林軒田機器學習技法課程學習筆記5 — Kernel Logistic Regression
2018-07-25
機器學習筆記
三、邏輯迴歸logistic regression——分類問題
2024-08-06
邏輯迴歸
Extreme Learning Machine 翻譯
2019-01-20
REMMac
pages bookmarks for machine learning domain
2018-12-05
MacAI
Machine Learning（13）- Random Forest
2019-06-12
MacrandomREST
Machine Learning (10) - Decision Tree
2019-06-09
Mac
Machine learning terms_01
2021-04-07
Mac
Machine Learning (5) - Training and Testing Data
2019-06-06
MacAI

Machine Learning (7) - 關於 Logistic Regression (Binary Classification) 的小練習

題目: 幫助 HR 做員工去留分析

第一步：處理資料

1. 引入基礎資料

2. 檢視整個表中員工的去留人數分別是多少

3. 分析真正影響員工去留的欄位

4. 保留有用欄位生成新的 dataframe

5. 通過 pandas 的 get_dummies 函式把 salary 列數字化

6. 準備用於訓練模型的資料

第二步：訓練模型

1. 取出 20% 資料做測試用

2. 使用 LogisticRegression 訓練模型

3. 分析模型準確度

相關文章