一、資料結構:Series 和 DataFrame
(一)Series
Series 是個定長的字典序列。說是定長是因為在儲存的時候,相當於兩個 ndarray,這也是和字典結構最大的不同。因為在字典的結構裡,元素的個數是不固定的。
Series 有兩個基本屬性:index 和 values。在 Series 結構中,index 預設是 0,1,2,……遞增的整數序列,當然我們也可以自己來指定索引,比如 index=[‘a’, ‘b’, ‘c’, ‘d’]。
import pandas as pd
x1 = pd. Series( [ 1 , 2 , 3 , 4 ] )
x1
0 1
1 2
2 3
3 4
dtype: int64
x2 = pd. Series( [ 1 , 2 , 3 , 4 ] , index= [ 'a' , 'b' , 'c' , 'd' ] )
x2
a 1
b 2
c 3
d 4
dtype: int64
d = { 'a' : 1 , 'b' : 2 , 'c' : 3 , 'd' : 4 }
d
{'a': 1, 'b': 2, 'c': 3, 'd': 4}
x3 = pd. Series( d)
x3
a 1
b 2
c 3
d 4
dtype: int64
(二)DataFrame 型別資料結構類似資料庫表
import pandas as pd
data = { 'Chinese' : [ 66 , 95 , 93 , 90 , 80 ] , 'English' : [ 65 , 85 , 92 , 88 , 90 ] , 'Math' : [ 30 , 98 , 96 , 77 , 90 ] }
data
{'Chinese': [66, 95, 93, 90, 80],
'English': [65, 85, 92, 88, 90],
'Math': [30, 98, 96, 77, 90]}
df1= pd. DataFrame( data)
df1
Chinese English Math 0 66 65 30 1 95 85 98 2 93 92 96 3 90 88 77 4 80 90 90
df2 = pd. DataFrame( data, index= [ 'ZhangFei' , 'GuanYu' , 'ZhaoYun' , 'HuangZhong' , 'DianWei' ] )
df2
Chinese English Math ZhangFei 66 65 30 GuanYu 95 85 98 ZhaoYun 93 92 96 HuangZhong 90 88 77 DianWei 80 90 90
二、資料匯入和輸出
df2. to_excel( 'data2.xlsx' )
三、資料清洗
(一)刪除資料
df2
Chinese English Math ZhangFei 66 65 30 GuanYu 95 85 98 ZhaoYun 93 92 96 HuangZhong 90 88 77 DianWei 80 90 90
df2. drop( 'Math' , axis= 1 )
Chinese English ZhangFei 66 65 GuanYu 95 85 ZhaoYun 93 92 HuangZhong 90 88 DianWei 80 90
df2. drop( columns= [ 'Math' ] )
Chinese English ZhangFei 66 65 GuanYu 95 85 ZhaoYun 93 92 HuangZhong 90 88 DianWei 80 90
df2. drop( index= [ 'DianWei' ] )
Chinese English Math ZhangFei 66 65 30 GuanYu 95 85 98 ZhaoYun 93 92 96 HuangZhong 90 88 77
(二)重新命名列
df2. rename( columns= { 'Chinese' : 'YuWen' , 'English' : 'Yingyu' } , inplace= True )
df2
YuWen Yingyu Math ZhangFei 66 65 30 GuanYu 95 85 98 ZhaoYun 93 92 96 HuangZhong 90 88 77 DianWei 80 90 90
(三)去掉重複行
df2. drop_duplicates( )
YuWen Yingyu Math ZhangFei 66 65 30 GuanYu 95 85 98 ZhaoYun 93 92 96 HuangZhong 90 88 77 DianWei 80 90 90
(四)更改資料格式
df2[ 'YuWen' ] . dtype
dtype('int64')
df2[ 'YuWen' ] = df2[ 'YuWen' ] . astype( 'str' )
df2[ 'YuWen' ] . dtype
dtype('O')
df2[ 'YuWen' ]
ZhangFei 66
GuanYu 95
ZhaoYun 93
HuangZhong 90
DianWei 80
Name: YuWen, dtype: object
df2[ 'Yingyu' ]
ZhangFei 65
GuanYu 85
ZhaoYun 92
HuangZhong 88
DianWei 90
Name: Yingyu, dtype: int64
(五)大小寫轉換
df2. columns = df2. columns. str . upper( )
df2
YUWEN YINGYU MATH ZhangFei 66 65 30 GuanYu 95 85 98 ZhaoYun 93 92 96 HuangZhong 90 88 77 DianWei 80 90 90
df2. columns = df2. columns. str . title( )
df2. columns
Index(['Yuwen', 'Yingyu', 'Math'], dtype='object')
df2. columns = df2. columns. str . lower( )
df2
yuwen yingyu math ZhangFei 66 65 30 GuanYu 95 85 98 ZhaoYun 93 92 96 HuangZhong 90 88 77 DianWei 80 90 90
(六)查詢值
import numpy as np
df2. loc[ 'ZhangFei' , 'yuwen' ] = np. nan
df2
yuwen yingyu math ZhangFei NaN 65 30 GuanYu 95 85 98 ZhaoYun 93 92 96 HuangZhong 90 88 77 DianWei 80 90 90
df2. isnull( )
yuwen yingyu math ZhangFei True False False GuanYu False False False ZhaoYun False False False HuangZhong False False False DianWei False False False
df2. isnull( ) . any ( )
yuwen True
yingyu False
math False
dtype: bool
df2. isnull( ) . any ( axis= 1 )
ZhangFei True
GuanYu False
ZhaoYun False
HuangZhong False
DianWei False
dtype: bool
df2. isnull( ) . any ( axis= None )
True
(七)apply對資料進行清洗
df3 = df2. reset_index( )
df3
index yuwen yingyu math 0 ZhangFei NaN 65 30 1 GuanYu 95 85 98 2 ZhaoYun 93 92 96 3 HuangZhong 90 88 77 4 DianWei 80 90 90
df3[ 'index' ] = df3[ 'index' ] . apply ( str . upper)
df3
index yuwen yingyu math 0 ZHANGFEI NaN 65 30 1 GUANYU 95 85 98 2 ZHAOYUN 93 92 96 3 HUANGZHONG 90 88 77 4 DIANWEI 80 90 90
def plus ( df, n, m) :
df[ 'new1' ] = ( df[ 'math' ] + df[ 'yingyu' ] ) * m
df[ 'new2' ] = ( df[ 'math' ] + df[ 'yingyu' ] ) * n
return df
df3 = df3. apply ( plus, axis= 1 , args= ( 2 , 3 , ) )
df3
index yuwen yingyu math new1 new2 0 ZHANGFEI NaN 65 30 285 190 1 GUANYU 95 85 98 549 366 2 ZHAOYUN 93 92 96 564 376 3 HUANGZHONG 90 88 77 495 330 4 DIANWEI 80 90 90 540 360
其中 axis=1 代表按照列為軸進行操作,axis=0 代表按照行為軸進行操作,args 是傳遞的兩個引數,即 n=2, m=3,在 plus 函式中使用到了 n 和 m,從而生成新的 df。資料統計
四、資料統計
Pandas 和 NumPy 一樣,都有常用的統計函式,如果遇到空值 NaN,會自動排除。
df1 = pd. DataFrame( { 'name' : [ 'ZhangFei' , 'GuanYu' , 'a' , 'b' , 'c' ] , 'data1' : range ( 5 ) } )
df1
name data1 0 ZhangFei 0 1 GuanYu 1 2 a 2 3 b 3 4 c 4
df1. describe( )
data1 count 5.000000 mean 2.000000 std 1.581139 min 0.000000 25% 1.000000 50% 2.000000 75% 3.000000 max 4.000000
df3. describe( )
yingyu math new1 new2 count 5.000000 5.000000 5.000000 5.00000 mean 84.000000 78.200000 486.600000 324.40000 std 10.931606 28.163807 115.595415 77.06361 min 65.000000 30.000000 285.000000 190.00000 25% 85.000000 77.000000 495.000000 330.00000 50% 88.000000 90.000000 540.000000 360.00000 75% 90.000000 96.000000 549.000000 366.00000 max 92.000000 98.000000 564.000000 376.00000
五、資料表合併
df1 = pd. DataFrame( { 'name' : [ 'ZhangFei' , 'GuanYu' , 'a' , 'b' , 'c' ] , 'data1' : range ( 5 ) } )
df2 = pd. DataFrame( { 'name' : [ 'ZhangFei' , 'GuanYu' , 'A' , 'B' , 'C' ] , 'data2' : range ( 5 ) } )
df1
name data1 0 ZhangFei 0 1 GuanYu 1 2 a 2 3 b 3 4 c 4
df2
name data2 0 ZhangFei 0 1 GuanYu 1 2 A 2 3 B 3 4 C 4
(一)基於指定列進行連線
df3 = pd. merge( df1, df2, on= 'name' )
df3
name data1 data2 0 ZhangFei 0 0 1 GuanYu 1 1
(二)inner 內連線
df3 = pd. merge( df1, df2, how= 'inner' )
df3
name data1 data2 0 ZhangFei 0 0 1 GuanYu 1 1
(三)left 左連線
df3 = pd. merge( df1, df2, how= 'left' )
df3
name data1 data2 0 ZhangFei 0 0.0 1 GuanYu 1 1.0 2 a 2 NaN 3 b 3 NaN 4 c 4 NaN
(四)right 右連線
df3 = pd. merge( df1, df2, how= 'right' )
df3
name data1 data2 0 ZhangFei 0.0 0 1 GuanYu 1.0 1 2 A NaN 2 3 B NaN 3 4 C NaN 4
(五)outer 外連線
df3 = pd. merge( df1, df2, how= 'outer' )
df3
name data1 data2 0 ZhangFei 0.0 0.0 1 GuanYu 1.0 1.0 2 a 2.0 NaN 3 b 3.0 NaN 4 c 4.0 NaN 5 A NaN 2.0 6 B NaN 3.0 7 C NaN 4.0
六、如何用 SQL 方式開啟 Pandas
import pandas as pd
from pandas import DataFrame
from pandasql import sqldf, load_meat, load_births
df1 = DataFrame( { 'name' : [ 'ZhangFei' , 'GuanYu' , 'a' , 'b' , 'c' ] , 'data1' : range ( 5 ) } )
pysqldf = lambda sql: sqldf( sql, globals ( ) )
sql = "select * from df1 where name ='ZhangFei'"
pysqldf( sql)
sqldf( sql, globals ( ) )
七、練習題
data = { '語文' : [ 66 , 95 , 93 , 90 , 80 , 80 ] , '英語' : [ 65 , 85 , 92 , 88 , 90 , 90 ] , '數學' : [ None , 98 , 96 , 77 , 90 , 90 ] }
index= [ '張飛' , '關羽' , '趙雲' , '黃忠' , '典韋' , '典韋' ]
import pandas as pd
df = pd. DataFrame( data, index)
df
語文 英語 數學 張飛 66 65 NaN 關羽 95 85 98.0 趙雲 93 92 96.0 黃忠 90 88 77.0 典韋 80 90 90.0 典韋 80 90 90.0
填充空值
df = df. fillna( 0 )
df
語文 英語 數學 張飛 66 65 0.0 關羽 95 85 98.0 趙雲 93 92 96.0 黃忠 90 88 77.0 典韋 80 90 90.0 典韋 80 90 90.0
去掉重複行
df = df. drop_duplicates( )
df
語文 英語 數學 張飛 66 65 0.0 關羽 95 85 98.0 趙雲 93 92 96.0 黃忠 90 88 77.0 典韋 80 90 90.0
totle = df. sum ( axis= 1 )
totle
張飛 131.0
關羽 278.0
趙雲 281.0
黃忠 255.0
典韋 260.0
dtype: float64
df[ '總和' ] = totle
d:\app\miniconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
df
語文 英語 數學 總和 張飛 66 65 0.0 131.0 關羽 95 85 98.0 278.0 趙雲 93 92 96.0 281.0 黃忠 90 88 77.0 255.0 典韋 80 90 90.0 260.0
八、收穫
學到了apply更復雜的用法,傳入的函式還能帶其他引數,以及指定axis的值,df3 = df3.apply(plus,axis=1,args=(2,3,)); 對pandas的瞭解也夠多了,但是沒有整理,需要整理所學,查缺補漏。