pandas學習task07缺失資料

埋在地裡的小土豆發表於2021-01-03

原文網址 : https://blog.csdn.net/qq_36559719/article/details/112162441

這是在datawhale學習小組學習pandas的第七章內容，缺失資料，以下是學習筆記，僅供參考，不喜勿噴
DataWhale

第七章缺失資料

import numpy as np
import pandas as pd

一、缺失值的統計和刪除

1. 缺失資訊的統計

缺失資料可以使用 isna 或 isnull （兩個函式沒有區別）來檢視每個單元格是否缺失，結合 mean 可以計算出每列缺失值的比例：

df = pd.read_csv(r'C:\Users\zhoukaiwei\Desktop\joyful-pandas\data\learn_pandas.csv',
                     usecols = ['Grade', 'Name', 'Gender', 'Height', 'Weight', 'Transfer'])
df.head()

	Grade	Name	Gender	Height	Weight	Transfer
0	Freshman	Gaopeng Yang	Female	158.9	46.0	N
1	Freshman	Changqiang You	Male	166.5	70.0	N
2	Senior	Mei Sun	Male	188.9	89.0	N
3	Sophomore	Xiaojuan Sun	Female	NaN	41.0	N
4	Sophomore	Gaojuan You	Male	174.0	74.0	N

df.isna().head()

	Grade	Name	Gender	Height	Weight	Transfer
0	False	False	False	False	False	False
1	False	False	False	False	False	False
2	False	False	False	False	False	False
3	False	False	False	True	False	False
4	False	False	False	False	False	False

 df.isna().mean() # 檢視缺失的比例

Grade       0.000
Name        0.000
Gender      0.000
Height      0.085
Weight      0.055
Transfer    0.060
dtype: float64

如果想要檢視某一列缺失或者非缺失的行，可以利用 Series 上的 isna 或者 notna 進行布林索引。例如，檢視身高缺失的行：

df[df.Height.isna()].head()

	Grade	Name	Gender	Height	Weight	Transfer
3	Sophomore	Xiaojuan Sun	Female	NaN	41.0	N
12	Senior	Peng You	Female	NaN	48.0	NaN
26	Junior	Yanli You	Female	NaN	48.0	N
36	Freshman	Xiaojuan Qin	Male	NaN	79.0	Y
60	Freshman	Yanpeng Lv	Male	NaN	65.0	N

如果想要同時對幾個列，檢索出全部為缺失或者至少有一個缺失或者沒有缺失的行，可以使用 isna, notna 和 any, all 的組合。

sub_set = df[['Height', 'Weight', 'Transfer']]
df[sub_set.isna().all(1)] # 全部缺失

	Grade	Name	Gender	Height	Weight	Transfer
102	Junior	Chengli Zhao	Male	NaN	NaN	NaN

df[sub_set.isna().any(1)].head() # 至少有一個缺失

	Grade	Name	Gender	Height	Weight	Transfer
3	Sophomore	Xiaojuan Sun	Female	NaN	41.0	N
9	Junior	Juan Xu	Female	164.8	NaN	N
12	Senior	Peng You	Female	NaN	48.0	NaN
21	Senior	Xiaopeng Shen	Male	166.0	62.0	NaN
26	Junior	Yanli You	Female	NaN	48.0	N

df[sub_set.notna().all(1)].head() # 沒有缺失

	Grade	Name	Gender	Height	Weight	Transfer
0	Freshman	Gaopeng Yang	Female	158.9	46.0	N
1	Freshman	Changqiang You	Male	166.5	70.0	N
2	Senior	Mei Sun	Male	188.9	89.0	N
4	Sophomore	Gaojuan You	Male	174.0	74.0	N
5	Freshman	Xiaoli Qian	Female	158.0	51.0	N

2. 缺失資訊的刪除

資料處理中經常需要根據缺失值的大小、比例或其他特徵來進行行樣本或列特徵的刪除， pandas 中提供了 dropna 函式來進行操作。

dropna 的主要引數為軸方向 axis （預設為0，即刪除行）、刪除方式 how 、刪除的非缺失值個數閾值 thresh （非缺失值沒有達到這個數量的相應維度會被刪除）、備選的刪除子集 subset ，其中 how 主要有 any 和 all 兩種引數可以選擇。

res = df.dropna(how = 'any', subset = ['Height', 'Weight'])#刪除身高體重至少有一個缺失的行：
res.shape

(174, 6)

#刪除超過15個缺失值的列：
res = df.dropna(1, thresh=df.shape[0]-15) # 身高被刪除
res.head()

	Grade	Name	Gender	Weight	Transfer
0	Freshman	Gaopeng Yang	Female	46.0	N
1	Freshman	Changqiang You	Male	70.0	N
2	Senior	Mei Sun	Male	89.0	N
3	Sophomore	Xiaojuan Sun	Female	41.0	N
4	Sophomore	Gaojuan You	Male	74.0	N

二、缺失值的填充和插值

1. 利用fillna進行填充

在 fillna 中有三個引數是常用的： value, method, limit 。其中， value 為填充值，可以是標量，也可以是索引到元素的字典對映； method 為填充方法，有用前面的元素填充 ffill 和用後面的元素填充 bfill 兩種型別， limit 參數列示連續缺失值的最大填充次數。

s = pd.Series([np.nan, 1, np.nan, np.nan, 2, np.nan],list('aaabcd'))
s

a    NaN
a    1.0
a    NaN
b    NaN
c    2.0
d    NaN
dtype: float64

s.fillna(method='ffill') # 用前面的值向後填充

a    NaN
a    1.0
a    1.0
b    1.0
c    2.0
d    2.0
dtype: float64

s.fillna(method='ffill', limit=1) # 連續出現的缺失，最多填充一次

a    NaN
a    1.0
a    1.0
b    NaN
c    2.0
d    2.0
dtype: float64

s.fillna(s.mean()) # value為標量

a    1.5
a    1.0
a    1.5
b    1.5
c    2.0
d    1.5
dtype: float64

 s.fillna({'a': 100, 'd': 200}) # 通過索引對映填充的值

a    100.0
a      1.0
a    100.0
b      NaN
c      2.0
d    200.0
dtype: float64

2. 插值函式

對於 interpolate 而言，除了插值方法（預設為 linear 線性插值）之外，有與 fillna 類似的兩個常用引數，一個是控制方向的 limit_direction ，另一個是控制最大連續缺失值插值個數的 limit 。其中，限制插值的方向預設為 forward ，這與 fillna 的 method 中的 ffill 是類似的，若想要後向限制插值或者雙向限制插值可以指定為 backward 或 both

s = pd.Series([np.nan, np.nan, 1,np.nan, np.nan, np.nan,2, np.nan, np.nan])
s.values

array([nan, nan,  1., nan, nan, nan,  2., nan, nan])

#在預設線性插值法下分別進行 backward 和雙向限制插值，同時限制最大連續條數為1：
res = s.interpolate(limit_direction='both', limit=1)
res.values

array([ nan, 1.  , 1.  , 1.25,  nan, 1.75, 2.  , 2.  ,  nan])

第二種常見的插值是最近鄰插補，即缺失值的元素和離它最近的非缺失值元素一樣：

s.interpolate('nearest').values

array([nan, nan,  1.,  1.,  1.,  2.,  2., nan, nan])

最後來介紹索引插值，即根據索引大小進行線性插值

s = pd.Series([0,np.nan,10],index=[0,1,10])
s

0      0.0
1      NaN
10    10.0
dtype: float64

s.interpolate()

0      0.0
1      5.0
10    10.0
dtype: float64

s.interpolate(method='index')

0      0.0
1      1.0
10    10.0
dtype: float64

三、Nullable型別

1. 缺失記號及其缺陷

在 python 中的缺失值用 None 表示，該元素除了等於自己本身之外，與其他任何元素不相等：

None == None

True

None == False

False

None == []

False

None == ''

False

值得注意的是，雖然在對缺失序列或表格的元素進行比較操作的時候， np.nan 的對應位置會返回 False ，但是在使用 equals 函式進行兩張表或兩個序列的相同性檢驗時，會自動跳過兩側表都是缺失值的位置，直接返回 True ：

s1 = pd.Series([1, np.nan])
s2 = pd.Series([1, 2])
s3 = pd.Series([1, np.nan])
s1 == 1

0     True
1    False
dtype: bool

s1.equals(s2)

False

s1.equals(s3)

True

2. Nullable型別的性質

pd.Series([np.nan, 1], dtype = 'Int64') # "i"是大寫的

0    <NA>
1       1
dtype: Int64

pd.Series([np.nan, True], dtype = 'boolean')

0    <NA>
1    True
dtype: boolean

pd.Series([np.nan, 'my_str'], dtype = 'string')

0      <NA>
1    my_str
dtype: string

#在 Int 的序列中，返回的結果會盡可能地成為 Nullable 的型別：
pd.Series([np.nan, 0], dtype = 'Int64') + 1

0    <NA>
1       1
dtype: Int64

pd.Series([np.nan, 0], dtype = 'Int64') == 0

0    <NA>
1    True
dtype: boolean

pd.Series([np.nan, 0], dtype = 'Int64') * 0.5 # 只能是浮點

0    NaN
1    0.0
dtype: float64

對於 boolean 型別的序列而言，其和 bool 序列的行為主要有兩點區別：

第一點是帶有缺失的布林列表無法進行索引器中的選擇，而 boolean 會把缺失值看作 False ：

s = pd.Series(['a', 'b'])
s_bool = pd.Series([True, np.nan])
s_boolean = pd.Series([True, np.nan]).astype('boolean')
s[s_boolean]

0    a
dtype: object

第二點是在進行邏輯運算時， bool 型別在缺失處返回的永遠是 False ，而 boolean 會根據邏輯運算是否能確定唯一結果來返回相應的值。那什麼叫能否確定唯一結果呢？舉個簡單例子： True | pd.NA 中無論缺失值為什麼值，必然返回 True ； False | pd.NA 中的結果會根據缺失值取值的不同而變化，此時返回 pd.NA ； False & pd.NA 中無論缺失值為什麼值，必然返回 False 。

s_boolean & True

0    True
1    <NA>
dtype: boolean

s_boolean | True

0    True
1    True
dtype: boolean

 ~s_boolean

0    False
1     <NA>
dtype: boolean

3. 缺失資料的計算和分組

當呼叫函式 sum, prob 使用加法和乘法的時候，缺失資料等價於被分別視作0和1，即不改變原來的計算結果：

 s = pd.Series([2,3,np.nan,4,5])
s.sum()

14.0

s.prod()

120.0

s.cumsum()

0     2.0
1     5.0
2     NaN
3     9.0
4    14.0
dtype: float64

當進行單個標量運算的時候，除了 np.nan ** 0 和 1 ** np.nan 這兩種情況為確定的值之外，所有運算結果全為缺失（ pd.NA 的行為與此一致），並且 np.nan 在比較操作時一定返回 False ，而 pd.NA 返回 pd.NA ：

np.nan == 0

False

pd.NA == 0

<NA>

np.nan > 0

False

pd.NA > 0

<NA>

np.nan + 1

nan

np.log(np.nan)

nan

np.add(np.nan, 1)

nan

np.nan ** 0

1.0

pd.NA ** 0

1 ** np.nan

1.0

1 ** pd.NA

diff, pct_change 這兩個函式雖然功能相似，但是對於缺失的處理不同，前者凡是參與缺失計算的部分全部設為了缺失值，而後者缺失值位置會被設為 0% 的變化率：

s.diff()

0    NaN
1    1.0
2    NaN
3    NaN
4    1.0
dtype: float64

s.pct_change()

0         NaN
1    0.500000
2    0.000000
3    0.333333
4    0.250000
dtype: float64

pandas-task07-缺失資料.md
2021-01-02
Pandas 基礎 (5) - 處理缺失的資料
2019-03-08
Pandas高階教程之:處理缺失資料
2021-06-24
（資料科學學習手札97）掌握pandas中的transform
2020-10-14
資料科學ORM
Pandas缺失值處理 | 輕鬆玩轉Pandas（3）
2018-07-24
Python大資料分析學習.Pandas 資料匯入問題 (1)
2018-05-19
Python大資料
機器學習第2篇：資料預處理（缺失值）
2020-12-27
機器學習
機器學習中資料缺失的處理及建模方法
2021-01-31
機器學習
pandas 學習第14篇：索引和選擇資料
2020-12-15
索引
Python 資料科學之 Pandas
2020-03-16
Python資料科學
seaborn和pandas-missingno 的資料視覺化--使用畫圖--缺失值分析
2019-01-01
視覺化
[python]pandas學習
2019-02-26
Python
【Pandas學習筆記02】-資料處理高階用法
2021-12-01
筆記
【Pandas學習筆記02】處理資料實用操作
2021-11-26
筆記
（資料科學學習手札99）掌握pandas中的時序資料分組運算
2020-12-08
資料科學
（資料科學學習手札134）pyjanitor：為pandas補充更多功能
2022-03-12
資料科學
pandas學習筆記
2020-10-01
筆記
pandas 學習總結
2018-04-02
Pandas基礎學習
2021-05-10
機器學習第4篇：資料預處理（sklearn 插補缺失值）
2020-12-29
機器學習
（資料科學學習手札92）利用query()與eval()優化pandas程式碼
2020-08-07
資料科學優化
（資料科學學習手札63）利用pandas讀寫HDF5檔案
2019-07-05
資料科學
基於python的大資料分析實戰學習筆記-pandas（資料分析包）
2019-08-28
Python大資料筆記
（資料科學學習手札86）全平臺支援的pandas運算加速神器
2020-06-05
資料科學
【pandas學習筆記】Series
2018-07-12
筆記
【pandas學習筆記】DataFrame
2018-07-12
筆記
Pandas大綱學習-0
2020-12-26
numpy的學習筆記\pandas學習筆記
2018-03-18
筆記
基於python的大資料分析實戰學習筆記-pandas之DataFrame
2019-08-29
Python大資料筆記
Summary Functions and Maps(pandas學習三)
2024-10-08
Function
Indexing, Selecting & Assigning(pandas學習二)
2024-10-04
Index
【Task03】Pandas學習打卡
2020-12-22
Numpy與Pandas學習網站
2021-01-05
學習網站
pandas學習之Python基礎
2020-12-16
Python
[Python] Pandas 對資料進行查詢、替換、篩選、排序、重複值和缺失值處理
2021-02-11
Python排序
機器學習第3篇：資料預處理（使用插補法處理缺失值）
2020-12-28
機器學習
Python—關於Pandas缺失值問題(國內唯一)
2021-04-03
Python
（資料科學學習手札151）速通pandas2.0新版本乾貨內容
2023-04-05
資料科學