要獲得優秀的模型,首先需要清洗資料。這是一篇如何在 Python 中執行資料清洗的分步指南。
資料清洗:從記錄集、表或資料庫中檢測和修正(或刪除)受損或不準確記錄的過程。它識別出資料中不完善、不準確或不相關的部分,並替換、修改或刪除這些髒亂的資料。
![資料缺失、混亂、重複怎麼辦?最全資料清洗指南讓你所向披靡](https://i.iter01.com/images/a557e06365a29abdd1d95857033ef37a15c9603bb09795a845a7b20fc1758f0b.jpg)
缺失資料;
不規則資料(異常值);
不必要資料:重複資料(repetitive data)、複製資料(duplicate data)等;
不一致資料:大寫、地址等;
![資料缺失、混亂、重複怎麼辦?最全資料清洗指南讓你所向披靡](https://i.iter01.com/images/682d03c747ac8f6b9a98f43090cca5ee6c7d7c450d0b4087d65c026c1c645a31.png)
本文兩位作者 Lianne & Justin。
# import packages import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt import matplotlib.mlab as mlab import matplotlib plt.style.use('ggplot') from matplotlib.pyplot import figure %matplotlib inline matplotlib.rcParams['figure.figsize'] = (12,8) pd.options.mode.chained_assignment = None # read the data df = pd.read_csv('sberbank.csv') # shape and data types of the data print(df.shape) print(df.dtypes) # select numeric columns df_numeric = df.select_dtypes(include=[np.number]) numeric_cols = df_numeric.columns.values print(numeric_cols) # select non numeric columns df_non_numeric = df.select_dtypes(exclude=[np.number]) non_numeric_cols = df_non_numeric.columns.values print(non_numeric_cols)
cols = df.columns[:30] # first 30 columns colours = ['#000099', '#ffff00'] # specify the colours - yellow is missing. blue is not missing. sns.heatmap(df[cols].isnull(), cmap=sns.color_palette(colours))
![資料缺失、混亂、重複怎麼辦?最全資料清洗指南讓你所向披靡](https://i.iter01.com/images/eb5f71c20b2159d0cc5a21680f75be480b5cba8848c39f9a0e2201020558910c.jpg)
# if it's a larger dataset and the visualization takes too long can do this. # % of missing. for col in df.columns: pct_missing = np.mean(df[col].isnull()) print('{} - {}%'.format(col, round(pct_missing*100)))
![資料缺失、混亂、重複怎麼辦?最全資料清洗指南讓你所向披靡](https://i.iter01.com/images/8ccce06b4f63a4c60ca9fa123d9c6c36f91e302bca694fc63a3bb1ecfe1ea836.png)
前 30 個特徵的缺失資料百分比列表
# first create missing indicator for features with missing data for col in df.columns: missing = df[col].isnull() num_missing = np.sum(missing) if num_missing > 0: print('created missing indicator for: {}'.format(col)) df['{}_ismissing'.format(col)] = missing # then based on the indicator, plot the histogram of missing values ismissing_cols = [col for col in df.columns if 'ismissing' in col] df['num_missing'] = df[ismissing_cols].sum(axis=1) df['num_missing'].value_counts().reset_index().sort_values(by='index').plot.bar(x='index', y='num_missing')
![資料缺失、混亂、重複怎麼辦?最全資料清洗指南讓你所向披靡](https://i.iter01.com/images/cbafb7dcf1291a9ccf22821da8db95588796941f747d1c6a7b338eff502b6313.png)
缺失資料直方圖
# drop rows with a lot of missing values. ind_missing = df[df['num_missing'] > 35].index df_less_missing_rows = df.drop(ind_missing, axis=0)
# hospital_beds_raion has a lot of missing. # If we want to drop. cols_to_drop = ['hospital_beds_raion'] df_less_hos_beds_raion = df.drop(cols_to_drop, axis=1)
# replace missing values with the median. med = df['life_sq'].median() print(med) df['life_sq'] = df['life_sq'].fillna(med)
# impute the missing values and create the missing value indicator variables for each numeric column. df_numeric = df.select_dtypes(include=[np.number]) numeric_cols = df_numeric.columns.values for col in numeric_cols: missing = df[col].isnull() num_missing = np.sum(missing) if num_missing > 0: # only do the imputation for the columns that have missing values. print('imputing missing values for: {}'.format(col)) df['{}_ismissing'.format(col)] = missing med = df[col].median() df[col] = df[col].fillna(med)
![資料缺失、混亂、重複怎麼辦?最全資料清洗指南讓你所向披靡](https://i.iter01.com/images/fdbe697e2d7459928eff1feed1d5b3eaf3de8a0e76abb50e3936d74a79edb121.png)
# impute the missing values and create the missing value indicator variables for each non-numeric column. df_non_numeric = df.select_dtypes(exclude=[np.number]) non_numeric_cols = df_non_numeric.columns.values for col in non_numeric_cols: missing = df[col].isnull() num_missing = np.sum(missing) if num_missing > 0: # only do the imputation for the columns that have missing values. print('imputing missing values for: {}'.format(col)) df['{}_ismissing'.format(col)] = missing top = df[col].describe()['top'] # impute with the most frequent value. df[col] = df[col].fillna(top)
# categorical df['sub_area'] = df['sub_area'].fillna('_MISSING_') # numeric df['life_sq'] = df['life_sq'].fillna(-999)
# histogram of life_sq. df['life_sq'].hist(bins=100)
![資料缺失、混亂、重複怎麼辦?最全資料清洗指南讓你所向披靡](https://i.iter01.com/images/257d4ee867af6c4644a0601204a91d775d9d74118a1fb9a205d8a4a8c635a071.png)
直方圖
# box plot. df.boxplot(column=['life_sq'])
![資料缺失、混亂、重複怎麼辦?最全資料清洗指南讓你所向披靡](https://i.iter01.com/images/f2c514c776b3869d0a7c732596067459a7b29309d6f1013f54c9827f4f6581ce.png)
箱形圖
df['life_sq'].describe()
![資料缺失、混亂、重複怎麼辦?最全資料清洗指南讓你所向披靡](https://i.iter01.com/images/7afb6b0e2a1dc89de9394b111d17a24be561c9ae82d2956f762457febb9ea13a.png)
# bar chart - distribution of a categorical variable df['ecology'].value_counts().plot.bar()
![資料缺失、混亂、重複怎麼辦?最全資料清洗指南讓你所向披靡](https://i.iter01.com/images/e7846052b2fdfc77c1c83893f425abf6f9552551e0edb919c0994b5a1a6f3baf.png)
num_rows = len(df.index) low_information_cols = [] # for col in df.columns: cnts = df[col].value_counts(dropna=False) top_pct = (cnts/num_rows).iloc[0] if top_pct > 0.95: low_information_cols.append(col) print('{0}: {1:.5f}%'.format(col, top_pct*100)) print(cnts) print()
![資料缺失、混亂、重複怎麼辦?最全資料清洗指南讓你所向披靡](https://i.iter01.com/images/bbf602454147ef07ed9c4d496d717e567f5ef68c7d523844b00b32814fc2b152.png)
# we know that column 'id' is unique, but what if we drop it? df_dedupped = df.drop('id', axis=1).drop_duplicates() # there were duplicate rows print(df.shape) print(df_dedupped.shape)
![資料缺失、混亂、重複怎麼辦?最全資料清洗指南讓你所向披靡](https://i.iter01.com/images/1d713a96dcd6363a89d58d9de800e6b3e8ebf575e8c5dbb459d57fb4b825bbb8.png)
key = ['timestamp', 'full_sq', 'life_sq', 'floor', 'build_year', 'num_room', 'price_doc'] df.fillna(-999).groupby(key)['id'].count().sort_values(ascending=False).head(20)
![資料缺失、混亂、重複怎麼辦?最全資料清洗指南讓你所向披靡](https://i.iter01.com/images/a9c67fb03f783c4aed58278c56c951fa36ab1de771ab9ae45233063b470439fd.png)
# drop duplicates based on an subset of variables. key = ['timestamp', 'full_sq', 'life_sq', 'floor', 'build_year', 'num_room', 'price_doc'] df_dedupped2 = df.drop_duplicates(subset=key) print(df.shape) print(df_dedupped2.shape)
![資料缺失、混亂、重複怎麼辦?最全資料清洗指南讓你所向披靡](https://i.iter01.com/images/4f3bccaae917e110a772886cb75cd119557dddfdb38f6aa9d104a1d49376ff75.png)
df['sub_area'].value_counts(dropna=False)
![資料缺失、混亂、重複怎麼辦?最全資料清洗指南讓你所向披靡](https://i.iter01.com/images/f215e4c5843d190866bc3d16fd07dc196051e889233155106ad4e4b0294ad373.png)
# make everything lower case. df['sub_area_lower'] = df['sub_area'].str.lower() df['sub_area_lower'].value_counts(dropna=False)
![資料缺失、混亂、重複怎麼辦?最全資料清洗指南讓你所向披靡](https://i.iter01.com/images/6df8bcfd57a029ee0fe8c5866b9a006a259b9227efbb57a8071977dca65a2c08.png)
df
![資料缺失、混亂、重複怎麼辦?最全資料清洗指南讓你所向披靡](https://i.iter01.com/images/752d7f748153cee719934d6534caf6ec74a1bdf58da9a44aa1d859c67e465e1b.png)
df['timestamp_dt'] = pd.to_datetime(df['timestamp'], format='%Y-%m-%d') df['year'] = df['timestamp_dt'].dt.year df['month'] = df['timestamp_dt'].dt.month df['weekday'] = df['timestamp_dt'].dt.weekday print(df['year'].value_counts(dropna=False)) print() print(df['month'].value_counts(dropna=False))
![資料缺失、混亂、重複怎麼辦?最全資料清洗指南讓你所向披靡](https://i.iter01.com/images/5e1fb49295073f6d7160f90d7faa66db0f0104668a165ec8fc7c49a1ba974b25.png)
from nltk.metrics import edit_distance df_city_ex = pd.DataFrame(data={'city': ['torontoo', 'toronto', 'tronto', 'vancouver', 'vancover', 'vancouvr', 'montreal', 'calgary']}) df_city_ex['city_distance_toronto'] = df_city_ex['city'].map(lambda x: edit_distance(x, 'toronto')) df_city_ex['city_distance_vancouver'] = df_city_ex['city'].map(lambda x: edit_distance(x, 'vancouver')) df_city_ex
![資料缺失、混亂、重複怎麼辦?最全資料清洗指南讓你所向披靡](https://i.iter01.com/images/9d1620d9205d60bb16ebb4b0e8e6f8a2ff81a7ba0ca6b91ca6c9011c6e9899b6.png)
msk = df_city_ex['city_distance_toronto'] <= 2 df_city_ex.loc[msk, 'city'] = 'toronto' msk = df_city_ex['city_distance_vancouver'] <= 2 df_city_ex.loc[msk, 'city'] = 'vancouver' df_city_ex
![資料缺失、混亂、重複怎麼辦?最全資料清洗指南讓你所向披靡](https://i.iter01.com/images/9013713baba3ce082d21f33ed55b2b72d864c2e5f05d036bc5d24972566eaf49.png)
# no address column in the housing dataset. So create one to show the code. df_add_ex = pd.DataFrame(['123 MAIN St Apartment 15', '123 Main Street Apt 12 ', '543 FirSt Av', ' 876 FIRst Ave.'], columns=['address']) df_add_ex
![資料缺失、混亂、重複怎麼辦?最全資料清洗指南讓你所向披靡](https://i.iter01.com/images/b5c179adfd765a7e223a3fd408308c9fde75be60855f50e1e15785db39bc1b85.png)
df_add_ex['address_std'] = df_add_ex['address'].str.lower() df_add_ex['address_std'] = df_add_ex['address_std'].str.strip() # remove leading and trailing whitespace. df_add_ex['address_std'] = df_add_ex['address_std'].str.replace('\\.', '') # remove period. df_add_ex['address_std'] = df_add_ex['address_std'].str.replace('\\bstreet\\b', 'st') # replace street with st. df_add_ex['address_std'] = df_add_ex['address_std'].str.replace('\\bapartment\\b', 'apt') # replace apartment with apt. df_add_ex['address_std'] = df_add_ex['address_std'].str.replace('\\bav\\b', 'ave') # replace apartment with apt. df_add_ex
![資料缺失、混亂、重複怎麼辦?最全資料清洗指南讓你所向披靡](https://i.iter01.com/images/890888e13d388c3ebc76d1c80e19503287aeb0157d9af463a50640227e964871.png)