清洗資料,我習慣用這 7 步!
資料清洗 (data cleaning) 是機器學習和深度學習進入演算法步前的一項重要任務,我平時比較習慣使用的 7 個步驟,總結如下:
Step1 : read csv
Step2 : preview data
Step3: check null value for every column
Step4: complete null value
Step5: feature engineering
Step 5.1: delete some features
Step 5.2: create new feature
Step6: encode for categories columns
Step 6.1: Sklearn LabelEncode
Step 6.2: Pandas get_dummies
Step 7: check for data cleaning
今天使用泰坦尼克資料集,完整介紹以上 7 步的具體操作過程。
1 讀入資料
這不廢話嗎,第一步就是讀入資料。
data_raw = pd.read_csv('../input/titanicdataset-traincsv/train.csv')
data_raw
結果:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q
891 rows × 12 columns
2 資料預覽
data_raw.info()
data_raw.describe(include='all')
結果:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
count 891.000000 891.000000 891.000000 891 891 714.000000 891.000000 891.000000 891 891.000000 204 889
unique NaN NaN NaN 891 2 NaN NaN NaN 681 NaN 147 3
top NaN NaN NaN Hakkarainen, Mr. Pekka Pietari male NaN NaN NaN 1601 NaN G6 S
freq NaN NaN NaN 1 577 NaN NaN NaN 7 NaN 4 644
mean 446.000000 0.383838 2.308642 NaN NaN 29.699118 0.523008 0.381594 NaN 32.204208 NaN NaN
std 257.353842 0.486592 0.836071 NaN NaN 14.526497 1.102743 0.806057 NaN 49.693429 NaN NaN
min 1.000000 0.000000 1.000000 NaN NaN 0.420000 0.000000 0.000000 NaN 0.000000 NaN NaN
25% 223.500000 0.000000 2.000000 NaN NaN 20.125000 0.000000 0.000000 NaN 7.910400 NaN NaN
50% 446.000000 0.000000 3.000000 NaN NaN 28.000000 0.000000 0.000000 NaN 14.454200 NaN NaN
75% 668.500000 1.000000 3.000000 NaN NaN 38.000000 1.000000 0.000000 NaN 31.000000 NaN NaN
max 891.000000 1.000000 3.000000 NaN NaN 80.000000 8.000000 6.000000 NaN 512.329200 NaN N
3 檢查null值
data1 = data_raw.copy(deep=True)
data1.isnull().sum()
結果:
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
Age 列 177 個空值,Cabin 687 個空值,一共才 891 行,估計沒啥價值了!Embarked 2 個。
4 補全空值
data1['Age'].fillna(data1['Age'].median(), inplace = True)
data1['Embarked'].fillna(data1['Embarked'].mode()[0], inplace = True)
data1.isnull().sum()
補全操作check:
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 0
dtype: int64
5 特徵工程
5.1 幹掉 3 列:
drop_column = ['PassengerId','Cabin', 'Ticket']
data1.drop(drop_column, axis=1, inplace = True)
5.2 增加 3 列
增加一列 FamilySize
data1['FamilySize'] = data1 ['SibSp'] + data1['Parch'] + 1
data1
列印結果:
Survived Pclass Name Sex Age SibSp Parch Fare Embarked FamilySize
0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 S 2
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C 2
2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 S 1
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 S 2
4 0 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 S 1
... ... ... ... ... ... ... ... ... ... ...
886 0 2 Montvila, Rev. Juozas male 27.0 0 0 13.0000 S 1
887 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 30.0000 S 1
888 0 3 Johnston, Miss. Catherine Helen "Carrie" female 28.0 1 2 23.4500 S 4
889 1 1 Behr, Mr. Karl Howell male 26.0 0 0 30.0000 C 1
890 0 3 Dooley, Mr. Patrick male 32.0 0 0 7.7500 Q 1
891 rows × 10 columns
再建立一列:
data1['IsAlone'] = np.where(data1['FamilySize'] > 1,0,1)
再建立一列:
data1['Title'] = data1['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
data1
結果:
Survived Pclass Name Sex Age SibSp Parch Fare Embarked FamilySize IsAlone Title
0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 S 2 0 Mr
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C 2 0 Mrs
2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 S 1 1 Miss
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 S 2 0 Mrs
4 0 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 S 1 1 Mr
... ... ... ... ... ... ... ... ... ... ... ... ...
886 0 2 Montvila, Rev. Juozas male 27.0 0 0 13.0000 S 1 1 Rev
887 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 30.0000 S 1 1 Miss
888 0 3 Johnston, Miss. Catherine Helen "Carrie" female 28.0 1 2 23.4500 S 4 0 Miss
889 1 1 Behr, Mr. Karl Howell male 26.0 0 0 30.0000 C 1 1 Mr
890 0 3 Dooley, Mr. Patrick male 32.0 0 0 7.7500 Q 1 1 Mr
891 rows × 12 columns
5.3 分箱走起
data1['FareCut'] = pd.qcut(data1['Fare'], 4)
data1['AgeCut'] = pd.cut(data1['Age'].astype(int), 6)
data1
結果:
Survived Pclass Name Sex Age SibSp Parch Fare Embarked FamilySize IsAlone Title FareCut AgeCut
0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 S 2 0 Mr (-0.001, 7.91] (13.333, 26.667]
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C 2 0 Mrs (31.0, 512.329] (26.667, 40.0]
2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 S 1 1 Miss (7.91, 14.454] (13.333, 26.667]
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 S 2 0 Mrs (31.0, 512.329] (26.667, 40.0]
4 0 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 S 1 1 Mr (7.91, 14.454] (26.667, 40.0]
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
886 0 2 Montvila, Rev. Juozas male 27.0 0 0 13.0000 S 1 1 Rev (7.91, 14.454] (26.667, 40.0]
887 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 30.0000 S 1 1 Miss (14.454, 31.0] (13.333, 26.667]
888 0 3 Johnston, Miss. Catherine Helen "Carrie" female 28.0 1 2 23.4500 S 4 0 Miss (14.454, 31.0] (26.667, 40.0]
889 1 1 Behr, Mr. Karl Howell male 26.0 0 0 30.0000 C 1 1 Mr (14.454, 31.0] (13.333, 26.667]
890 0 3 Dooley, Mr. Patrick male 32.0 0 0 7.7500 Q 1 1 Mr (-0.001, 7.91] (26.667, 40.0]
891 rows × 14 columns
6 編碼
6.1 LabelEncoder 方法
使用 Sklearn 的 LabelEncoder
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()
data1['Sex_Code'] = label.fit_transform(data1['Sex'])
data1['Embarked_Code'] = label.fit_transform(data1['Embarked'])
data1['Title_Code'] = label.fit_transform(data1['Title'])
data1['AgeBin_Code'] = label.fit_transform(data1['AgeCut'])
data1['FareBin_Code'] = label.fit_transform(data1['FareCut'])
data1
結果 data1 選取某些列,演算法模型終於能認出它們了,多不容易!
6.2 get_dummies 方法
get_dummies 將長 DataFrame 變為寬 DataFrame:
pd.get_dummies(data1['Sex'])
結果:
female male
0 0 1
1 1 0
2 1 0
3 1 0
4 0 1
... ... ...
886 0 1
887 1 0
888 1 0
889 0 1
890 0 1
891 rows × 2 columns
而 LabelEncoder 編碼後,僅僅是把 Female 編碼為 0, male 編碼為 1.
label.fit_transform(data1['Sex'])
0 1
1 0
2 0
3 0
4 1
..
886 1
887 0
888 0
889 1
890 1
Name: Sex_Code, Length: 891, dtype: int64
7 再 check
# Step 7: data cleaning check
data1[data1_x_alg].info()
print('-'*50)
data1_dummy.info()
結果:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
Sex_Code 891 non-null int64
Pclass 891 non-null int64
Embarked_Code 891 non-null int64
Title_Code 891 non-null int64
SibSp 891 non-null int64
Parch 891 non-null int64
Age 891 non-null float64
Fare 891 non-null float64
dtypes: float64(2), int64(6)
memory usage: 55.8 KB
--------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 29 columns):
Pclass 891 non-null int64
SibSp 891 non-null int64
Parch 891 non-null int64
Age 891 non-null float64
Fare 891 non-null float64
FamilySize 891 non-null int64
IsAlone 891 non-null int64
Sex_female 891 non-null uint8
Sex_male 891 non-null uint8
Embarked_C 891 non-null uint8
Embarked_Q 891 non-null uint8
Embarked_S 891 non-null uint8
Title_Capt 891 non-null uint8
Title_Col 891 non-null uint8
Title_Don 891 non-null uint8
Title_Dr 891 non-null uint8
Title_Jonkheer 891 non-null uint8
Title_Lady 891 non-null uint8
Title_Major 891 non-null uint8
Title_Master 891 non-null uint8
Title_Miss 891 non-null uint8
Title_Mlle 891 non-null uint8
Title_Mme 891 non-null uint8
Title_Mr 891 non-null uint8
Title_Mrs 891 non-null uint8
Title_Ms 891 non-null uint8
Title_Rev 891 non-null uint8
Title_Sir 891 non-null uint8
Title_the Countess 891 non-null uint8
dtypes: float64(2), int64(5), uint8(22)
memory usage: 68.0 KB
Great !
Done~
原創不易,點個在看
相關文章
- 還在為資料清洗抓狂?這裡有一個簡單實用的清洗程式碼集
- 資料清洗
- 機器學習-資料清洗機器學習
- 用了這麼多年MySql,這些好習慣你用過哪些MySql
- tensorflow資料清洗
- 資料清洗經驗
- 乾貨來了!快速教你資料清洗的步驟及方法
- 資料清洗和資料處理
- 八個機器學習資料清洗機器學習
- 資料清洗有哪些方法?
- 資料清洗如何測試?
- 實用好習慣,讓一個人每天進步
- 我的前端編碼習慣 —— html篇前端HTML
- 大資料技術與應用課堂測試-資料清洗同步大資料
- 資料治理為什麼要清洗資料
- 資料質量管理--資料抽取和清洗
- 資料清洗處理-常用操作
- 資料清洗的方法有哪些?
- 掌握這些技巧,讓Excel批次資料清洗變得簡單高效!Excel
- 7個高效的API和服務管理習慣API
- 資料管理:業務資料清洗,落地實現方案
- 高效工作,這些習慣你值得擁有!
- 資料清洗和準備 (待更新)
- python 操作 excel 之資料清洗PythonExcel
- 常用資料清洗方法大盤點
- 一個spark清洗資料的demoSpark
- 使用Mysql工具進行資料清洗MySql
- 機器學習之資料清洗與特徵提取機器學習特徵
- 畢業設計三:資料清洗
- 深度學習學習7步驟深度學習
- 我的編碼習慣 —— Controller規範Controller
- 優秀程式設計師7個好習慣程式設計師
- 低效程式設計師的7個壞習慣程式設計師
- 專注深度學習模型精度和效能提升、資料採集清洗深度學習模型
- Scikit-Learn 與 TensorFlow 機器學習實用指南學習筆記 3 —— 資料獲取與清洗機器學習筆記
- 資料分析-pandas資料處理清洗常用總結
- Linux入門,這七大習慣得有!Linux
- 全球手機使用者使用習慣——資料資訊圖