Pandas高階教程之:處理缺失資料

flydean發表於2021-06-24

原文網址 : https://www.cnblogs.com/flydean/p/14925464.html

簡介

在資料處理中，Pandas會將無法解析的資料或者缺失的資料使用NaN來表示。雖然所有的資料都有了相應的表示，但是NaN很明顯是無法進行數學運算的。

本文將會講解Pandas對於NaN資料的處理方法。

NaN的例子

上面講到了缺失的資料會被表現為NaN，我們來看一個具體的例子：

我們先來構建一個DF：

In [1]: df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
   ...:                   columns=['one', 'two', 'three'])
   ...: 

In [2]: df['four'] = 'bar'

In [3]: df['five'] = df['one'] > 0

In [4]: df
Out[4]: 
        one       two     three four   five
a  0.469112 -0.282863 -1.509059  bar   True
c -1.135632  1.212112 -0.173215  bar  False
e  0.119209 -1.044236 -0.861849  bar   True
f -2.104569 -0.494929  1.071804  bar  False
h  0.721555 -0.706771 -1.039575  bar   True

上面DF只有acefh這幾個index，我們重新index一下資料：

In [5]: df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

In [6]: df2
Out[6]: 
        one       two     three four   five
a  0.469112 -0.282863 -1.509059  bar   True
b       NaN       NaN       NaN  NaN    NaN
c -1.135632  1.212112 -0.173215  bar  False
d       NaN       NaN       NaN  NaN    NaN
e  0.119209 -1.044236 -0.861849  bar   True
f -2.104569 -0.494929  1.071804  bar  False
g       NaN       NaN       NaN  NaN    NaN
h  0.721555 -0.706771 -1.039575  bar   True

資料缺失，就會產生很多NaN。

為了檢測是否NaN，可以使用isna()或者notna() 方法。

In [7]: df2['one']
Out[7]: 
a    0.469112
b         NaN
c   -1.135632
d         NaN
e    0.119209
f   -2.104569
g         NaN
h    0.721555
Name: one, dtype: float64

In [8]: pd.isna(df2['one'])
Out[8]: 
a    False
b     True
c    False
d     True
e    False
f    False
g     True
h    False
Name: one, dtype: bool

In [9]: df2['four'].notna()
Out[9]: 
a     True
b    False
c     True
d    False
e     True
f     True
g    False
h     True
Name: four, dtype: bool

注意在Python中None是相等的：

In [11]: None == None                                                 # noqa: E711
Out[11]: True

但是np.nan是不等的：

In [12]: np.nan == np.nan
Out[12]: False

整數型別的缺失值

NaN預設是float型別的，如果是整數型別，我們可以強制進行轉換：

In [14]: pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype())
Out[14]: 
0       1
1       2
2    <NA>
3       4
dtype: Int64

Datetimes 型別的缺失值

時間型別的缺失值使用NaT來表示：

In [15]: df2 = df.copy()

In [16]: df2['timestamp'] = pd.Timestamp('20120101')

In [17]: df2
Out[17]: 
        one       two     three four   five  timestamp
a  0.469112 -0.282863 -1.509059  bar   True 2012-01-01
c -1.135632  1.212112 -0.173215  bar  False 2012-01-01
e  0.119209 -1.044236 -0.861849  bar   True 2012-01-01
f -2.104569 -0.494929  1.071804  bar  False 2012-01-01
h  0.721555 -0.706771 -1.039575  bar   True 2012-01-01

In [18]: df2.loc[['a', 'c', 'h'], ['one', 'timestamp']] = np.nan

In [19]: df2
Out[19]: 
        one       two     three four   five  timestamp
a       NaN -0.282863 -1.509059  bar   True        NaT
c       NaN  1.212112 -0.173215  bar  False        NaT
e  0.119209 -1.044236 -0.861849  bar   True 2012-01-01
f -2.104569 -0.494929  1.071804  bar  False 2012-01-01
h       NaN -0.706771 -1.039575  bar   True        NaT

In [20]: df2.dtypes.value_counts()
Out[20]: 
float64           3
datetime64[ns]    1
bool              1
object            1
dtype: int64

None 和 np.nan 的轉換

對於數字型別的，如果賦值為None，那麼會轉換為相應的NaN型別：

In [21]: s = pd.Series([1, 2, 3])

In [22]: s.loc[0] = None

In [23]: s
Out[23]: 
0    NaN
1    2.0
2    3.0
dtype: float64

如果是物件型別，使用None賦值，會保持原樣：

In [24]: s = pd.Series(["a", "b", "c"])

In [25]: s.loc[0] = None

In [26]: s.loc[1] = np.nan

In [27]: s
Out[27]: 
0    None
1     NaN
2       c
dtype: object

缺失值的計算

缺失值的數學計算還是缺失值：

In [28]: a
Out[28]: 
        one       two
a       NaN -0.282863
c       NaN  1.212112
e  0.119209 -1.044236
f -2.104569 -0.494929
h -2.104569 -0.706771

In [29]: b
Out[29]: 
        one       two     three
a       NaN -0.282863 -1.509059
c       NaN  1.212112 -0.173215
e  0.119209 -1.044236 -0.861849
f -2.104569 -0.494929  1.071804
h       NaN -0.706771 -1.039575

In [30]: a + b
Out[30]: 
        one  three       two
a       NaN    NaN -0.565727
c       NaN    NaN  2.424224
e  0.238417    NaN -2.088472
f -4.209138    NaN -0.989859
h       NaN    NaN -1.413542

但是在統計中會將NaN當成0來對待。

In [31]: df
Out[31]: 
        one       two     three
a       NaN -0.282863 -1.509059
c       NaN  1.212112 -0.173215
e  0.119209 -1.044236 -0.861849
f -2.104569 -0.494929  1.071804
h       NaN -0.706771 -1.039575

In [32]: df['one'].sum()
Out[32]: -1.9853605075978744

In [33]: df.mean(1)
Out[33]: 
a   -0.895961
c    0.519449
e   -0.595625
f   -0.509232
h   -0.873173
dtype: float64

如果是在cumsum或者cumprod中，預設是會跳過NaN，如果不想統計NaN，可以加上引數skipna=False

In [34]: df.cumsum()
Out[34]: 
        one       two     three
a       NaN -0.282863 -1.509059
c       NaN  0.929249 -1.682273
e  0.119209 -0.114987 -2.544122
f -1.985361 -0.609917 -1.472318
h       NaN -1.316688 -2.511893

In [35]: df.cumsum(skipna=False)
Out[35]: 
   one       two     three
a  NaN -0.282863 -1.509059
c  NaN  0.929249 -1.682273
e  NaN -0.114987 -2.544122
f  NaN -0.609917 -1.472318
h  NaN -1.316688 -2.511893

使用fillna填充NaN資料

資料分析中，如果有NaN資料，那麼需要對其進行處理，一種處理方法就是使用fillna來進行填充。

下面填充常量：

In [42]: df2
Out[42]: 
        one       two     three four   five  timestamp
a       NaN -0.282863 -1.509059  bar   True        NaT
c       NaN  1.212112 -0.173215  bar  False        NaT
e  0.119209 -1.044236 -0.861849  bar   True 2012-01-01
f -2.104569 -0.494929  1.071804  bar  False 2012-01-01
h       NaN -0.706771 -1.039575  bar   True        NaT

In [43]: df2.fillna(0)
Out[43]: 
        one       two     three four   five            timestamp
a  0.000000 -0.282863 -1.509059  bar   True                    0
c  0.000000  1.212112 -0.173215  bar  False                    0
e  0.119209 -1.044236 -0.861849  bar   True  2012-01-01 00:00:00
f -2.104569 -0.494929  1.071804  bar  False  2012-01-01 00:00:00
h  0.000000 -0.706771 -1.039575  bar   True                    0

還可以指定填充方法，比如pad：

In [45]: df
Out[45]: 
        one       two     three
a       NaN -0.282863 -1.509059
c       NaN  1.212112 -0.173215
e  0.119209 -1.044236 -0.861849
f -2.104569 -0.494929  1.071804
h       NaN -0.706771 -1.039575

In [46]: df.fillna(method='pad')
Out[46]: 
        one       two     three
a       NaN -0.282863 -1.509059
c       NaN  1.212112 -0.173215
e  0.119209 -1.044236 -0.861849
f -2.104569 -0.494929  1.071804
h -2.104569 -0.706771 -1.039575

可以指定填充的行數：

In [48]: df.fillna(method='pad', limit=1)

fill方法統計：

方法名	描述
pad / ffill	向前填充
bfill / backfill	向後填充

可以使用PandasObject來填充：

In [53]: dff
Out[53]: 
          A         B         C
0  0.271860 -0.424972  0.567020
1  0.276232 -1.087401 -0.673690
2  0.113648 -1.478427  0.524988
3       NaN  0.577046 -1.715002
4       NaN       NaN -1.157892
5 -1.344312       NaN       NaN
6 -0.109050  1.643563       NaN
7  0.357021 -0.674600       NaN
8 -0.968914 -1.294524  0.413738
9  0.276662 -0.472035 -0.013960

In [54]: dff.fillna(dff.mean())
Out[54]: 
          A         B         C
0  0.271860 -0.424972  0.567020
1  0.276232 -1.087401 -0.673690
2  0.113648 -1.478427  0.524988
3 -0.140857  0.577046 -1.715002
4 -0.140857 -0.401419 -1.157892
5 -1.344312 -0.401419 -0.293543
6 -0.109050  1.643563 -0.293543
7  0.357021 -0.674600 -0.293543
8 -0.968914 -1.294524  0.413738
9  0.276662 -0.472035 -0.013960

In [55]: dff.fillna(dff.mean()['B':'C'])
Out[55]: 
          A         B         C
0  0.271860 -0.424972  0.567020
1  0.276232 -1.087401 -0.673690
2  0.113648 -1.478427  0.524988
3       NaN  0.577046 -1.715002
4       NaN -0.401419 -1.157892
5 -1.344312 -0.401419 -0.293543
6 -0.109050  1.643563 -0.293543
7  0.357021 -0.674600 -0.293543
8 -0.968914 -1.294524  0.413738
9  0.276662 -0.472035 -0.013960

上面操作等同於：

In [56]: dff.where(pd.notna(dff), dff.mean(), axis='columns')

使用dropna刪除包含NA的資料

除了fillna來填充資料之外，還可以使用dropna刪除包含na的資料。

In [57]: df
Out[57]: 
   one       two     three
a  NaN -0.282863 -1.509059
c  NaN  1.212112 -0.173215
e  NaN  0.000000  0.000000
f  NaN  0.000000  0.000000
h  NaN -0.706771 -1.039575

In [58]: df.dropna(axis=0)
Out[58]: 
Empty DataFrame
Columns: [one, two, three]
Index: []

In [59]: df.dropna(axis=1)
Out[59]: 
        two     three
a -0.282863 -1.509059
c  1.212112 -0.173215
e  0.000000  0.000000
f  0.000000  0.000000
h -0.706771 -1.039575

In [60]: df['one'].dropna()
Out[60]: Series([], Name: one, dtype: float64)

插值interpolation

資料分析時候，為了資料的平穩，我們需要一些插值運算interpolate() ，使用起來很簡單：

In [61]: ts
Out[61]: 
2000-01-31    0.469112
2000-02-29         NaN
2000-03-31         NaN
2000-04-28         NaN
2000-05-31         NaN
                ...   
2007-12-31   -6.950267
2008-01-31   -7.904475
2008-02-29   -6.441779
2008-03-31   -8.184940
2008-04-30   -9.011531
Freq: BM, Length: 100, dtype: float64

In [64]: ts.interpolate()
Out[64]: 
2000-01-31    0.469112
2000-02-29    0.434469
2000-03-31    0.399826
2000-04-28    0.365184
2000-05-31    0.330541
                ...   
2007-12-31   -6.950267
2008-01-31   -7.904475
2008-02-29   -6.441779
2008-03-31   -8.184940
2008-04-30   -9.011531
Freq: BM, Length: 100, dtype: float64

插值函式還可以新增引數，指定插值的方法，比如按時間插值：

In [67]: ts2
Out[67]: 
2000-01-31    0.469112
2000-02-29         NaN
2002-07-31   -5.785037
2005-01-31         NaN
2008-04-30   -9.011531
dtype: float64

In [68]: ts2.interpolate()
Out[68]: 
2000-01-31    0.469112
2000-02-29   -2.657962
2002-07-31   -5.785037
2005-01-31   -7.398284
2008-04-30   -9.011531
dtype: float64

In [69]: ts2.interpolate(method='time')
Out[69]: 
2000-01-31    0.469112
2000-02-29    0.270241
2002-07-31   -5.785037
2005-01-31   -7.190866
2008-04-30   -9.011531
dtype: float64

按index的float value進行插值：

In [70]: ser
Out[70]: 
0.0      0.0
1.0      NaN
10.0    10.0
dtype: float64

In [71]: ser.interpolate()
Out[71]: 
0.0      0.0
1.0      5.0
10.0    10.0
dtype: float64

In [72]: ser.interpolate(method='values')
Out[72]: 
0.0      0.0
1.0      1.0
10.0    10.0
dtype: float64

除了插值Series，還可以插值DF：

In [73]: df = pd.DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
   ....:                    'B': [.25, np.nan, np.nan, 4, 12.2, 14.4]})
   ....: 

In [74]: df
Out[74]: 
     A      B
0  1.0   0.25
1  2.1    NaN
2  NaN    NaN
3  4.7   4.00
4  5.6  12.20
5  6.8  14.40

In [75]: df.interpolate()
Out[75]: 
     A      B
0  1.0   0.25
1  2.1   1.50
2  3.4   2.75
3  4.7   4.00
4  5.6  12.20
5  6.8  14.40

interpolate還接收limit引數，可以指定插值的個數。

In [95]: ser.interpolate(limit=1)
Out[95]: 
0     NaN
1     NaN
2     5.0
3     7.0
4     NaN
5     NaN
6    13.0
7    13.0
8     NaN
dtype: float64

使用replace替換值

replace可以替換常量，也可以替換list：

In [102]: ser = pd.Series([0., 1., 2., 3., 4.])

In [103]: ser.replace(0, 5)
Out[103]: 
0    5.0
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

In [104]: ser.replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0])
Out[104]: 
0    4.0
1    3.0
2    2.0
3    1.0
4    0.0
dtype: float64

可以替換DF中特定的數值：

In [106]: df = pd.DataFrame({'a': [0, 1, 2, 3, 4], 'b': [5, 6, 7, 8, 9]})

In [107]: df.replace({'a': 0, 'b': 5}, 100)
Out[107]: 
     a    b
0  100  100
1    1    6
2    2    7
3    3    8
4    4    9

可以使用插值替換：

In [108]: ser.replace([1, 2, 3], method='pad')
Out[108]: 
0    0.0
1    0.0
2    0.0
3    0.0
4    4.0
dtype: float64

本文已收錄於 http://www.flydean.com/07-python-pandas-missingdata/

最通俗的解讀，最深刻的乾貨，最簡潔的教程，眾多你不知道的小技巧等你來發現！

歡迎關注我的公眾號:「程式那些事」,懂技術，更懂你！

Pandas高階教程之:處理text資料
2021-06-23
Pandas高階教程之:時間處理
2021-10-11
Pandas 基礎 (5) - 處理缺失的資料
2019-03-08
Pandas高階教程之:稀疏資料結構
2021-07-20
資料結構
Pandas高階教程之:category資料型別
2021-06-28
Go資料型別
【Pandas學習筆記02】-資料處理高階用法
2021-12-01
筆記
Pandas高階教程之:window操作
2021-07-19
Pandas高階教程之:GroupBy用法
2021-07-12
Pandas缺失值處理 | 輕鬆玩轉Pandas（3）
2018-07-24
Python 資料處理庫 pandas 進階教程
2018-04-18
Python
Pandas高階教程之:統計方法
2021-07-08
Pandas高階教程之:自定義選項
2021-07-22
Pandas高階教程之:Dataframe的合併
2021-06-14
Pandas高階教程之:plot畫圖詳解
2021-07-07
資料處理--pandas問題
2024-08-04
Python資料處理-pandas用法
2020-12-17
Python
Pandas高階教程之:Dataframe的重排和旋轉
2021-06-15
【Python資料分析基礎】: 資料缺失值處理
2018-07-28
Python
Python資料分析基礎: 資料缺失值處理
2020-10-31
Python
pandas-task07-缺失資料.md
2021-01-02
Excel高階應用教程：資料處理與資料分析
2018-05-25
Excel
pandas學習task07缺失資料
2021-01-03
資料的規範化——Pandas處理
2024-04-07
Python利用pandas處理資料與分析
2024-03-25
Python
資料預處理之 pandas 讀表
2020-03-01
特徵工程之資料預處理（下）
2019-02-13
特徵工程
pandas（進階操作）-- 處理非數值型資料 -- 資料分析三劍客(核心)
2023-10-02
Go高階特性 17 | SliceHeader：slice 高效處理資料
2021-03-02
GoHeader
[Python] Pandas 對資料進行查詢、替換、篩選、排序、重複值和缺失值處理
2021-02-11
Python排序
pandas 資料處理一些常用操作
2023-05-15
處理pandas讀取資料為nan時
2024-06-24
NaN
Python 資料處理庫 pandas 入門教程
2018-04-17
Python
處理資料缺失的結構化解決辦法
2018-10-26
pandas 處理資料和crc16計算
2020-09-26
資料清洗與預處理：使用 Python Pandas 庫
2024-07-26
Python
Pandas之:Pandas高階教程以鐵達尼號真實資料為例
2021-06-07
機器學習第2篇：資料預處理（缺失值）
2020-12-27
機器學習
機器學習中資料缺失的處理及建模方法
2021-01-31
機器學習