pandas學習筆記

aboater發表於2020-10-01

一、資料結構:Series 和 DataFrame

(一)Series

Series 是個定長的字典序列。說是定長是因為在儲存的時候,相當於兩個 ndarray,這也是和字典結構最大的不同。因為在字典的結構裡,元素的個數是不固定的。

Series 有兩個基本屬性:index 和 values。在 Series 結構中,index 預設是 0,1,2,……遞增的整數序列,當然我們也可以自己來指定索引,比如 index=[‘a’, ‘b’, ‘c’, ‘d’]。

import pandas as pd
x1 = pd.Series([1,2,3,4])
x1
0    1
1    2
2    3
3    4
dtype: int64
x2 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
x2
a    1
b    2
c    3
d    4
dtype: int64
d = {'a':1, 'b':2, 'c':3, 'd':4}
d
{'a': 1, 'b': 2, 'c': 3, 'd': 4}
x3 = pd.Series(d)
x3
a    1
b    2
c    3
d    4
dtype: int64

(二)DataFrame 型別資料結構類似資料庫表

import pandas as pd
data = {'Chinese': [66, 95, 93, 90,80],'English': [65, 85, 92, 88, 90],'Math': [30, 98, 96, 77, 90]}
data
{'Chinese': [66, 95, 93, 90, 80],
 'English': [65, 85, 92, 88, 90],
 'Math': [30, 98, 96, 77, 90]}
df1= pd.DataFrame(data)
df1
ChineseEnglishMath
0666530
1958598
2939296
3908877
4809090
df2 = pd.DataFrame(data, index=['ZhangFei', 'GuanYu', 'ZhaoYun', 'HuangZhong', 'DianWei'])
df2
ChineseEnglishMath
ZhangFei666530
GuanYu958598
ZhaoYun939296
HuangZhong908877
DianWei809090

二、資料匯入和輸出

df2.to_excel('data2.xlsx')

三、資料清洗

(一)刪除資料

df2
ChineseEnglishMath
ZhangFei666530
GuanYu958598
ZhaoYun939296
HuangZhong908877
DianWei809090
df2.drop('Math', axis=1)
ChineseEnglish
ZhangFei6665
GuanYu9585
ZhaoYun9392
HuangZhong9088
DianWei8090
df2.drop(columns=['Math'])
ChineseEnglish
ZhangFei6665
GuanYu9585
ZhaoYun9392
HuangZhong9088
DianWei8090
df2.drop(index=['DianWei'])
ChineseEnglishMath
ZhangFei666530
GuanYu958598
ZhaoYun939296
HuangZhong908877

(二)重新命名列

df2.rename(columns={'Chinese': 'YuWen', 'English': 'Yingyu'}, inplace=True)
df2
YuWenYingyuMath
ZhangFei666530
GuanYu958598
ZhaoYun939296
HuangZhong908877
DianWei809090

(三)去掉重複行

df2.drop_duplicates()   # duplicates 多重紀錄
YuWenYingyuMath
ZhangFei666530
GuanYu958598
ZhaoYun939296
HuangZhong908877
DianWei809090

(四)更改資料格式

df2['YuWen'].dtype
dtype('int64')
df2['YuWen'] = df2['YuWen'].astype('str')
df2['YuWen'].dtype
dtype('O')
df2['YuWen']
ZhangFei      66
GuanYu        95
ZhaoYun       93
HuangZhong    90
DianWei       80
Name: YuWen, dtype: object
df2['Yingyu']
ZhangFei      65
GuanYu        85
ZhaoYun       92
HuangZhong    88
DianWei       90
Name: Yingyu, dtype: int64

(五)大小寫轉換

df2.columns = df2.columns.str.upper()
df2
YUWENYINGYUMATH
ZhangFei666530
GuanYu958598
ZhaoYun939296
HuangZhong908877
DianWei809090
df2.columns = df2.columns.str.title()
df2.columns
Index(['Yuwen', 'Yingyu', 'Math'], dtype='object')
df2.columns = df2.columns.str.lower()
df2
yuwenyingyumath
ZhangFei666530
GuanYu958598
ZhaoYun939296
HuangZhong908877
DianWei809090

(六)查詢值

import numpy as np
df2.loc['ZhangFei', 'yuwen'] = np.nan
df2
yuwenyingyumath
ZhangFeiNaN6530
GuanYu958598
ZhaoYun939296
HuangZhong908877
DianWei809090
df2.isnull()
yuwenyingyumath
ZhangFeiTrueFalseFalse
GuanYuFalseFalseFalse
ZhaoYunFalseFalseFalse
HuangZhongFalseFalseFalse
DianWeiFalseFalseFalse
df2.isnull().any()
yuwen      True
yingyu    False
math      False
dtype: bool
df2.isnull().any(axis=1)
ZhangFei       True
GuanYu        False
ZhaoYun       False
HuangZhong    False
DianWei       False
dtype: bool
df2.isnull().any(axis=None)
True

(七)apply對資料進行清洗

df3 = df2.reset_index()
df3
indexyuwenyingyumath
0ZhangFeiNaN6530
1GuanYu958598
2ZhaoYun939296
3HuangZhong908877
4DianWei809090
df3['index'] = df3['index'].apply(str.upper)
df3
indexyuwenyingyumath
0ZHANGFEINaN6530
1GUANYU958598
2ZHAOYUN939296
3HUANGZHONG908877
4DIANWEI809090
def plus(df,n,m):
    df['new1'] = (df['math']+df['yingyu']) * m
    df['new2'] = (df['math']+df['yingyu']) * n
    return df
df3 = df3.apply(plus,axis=1,args=(2,3,))
df3
indexyuwenyingyumathnew1new2
0ZHANGFEINaN6530285190
1GUANYU958598549366
2ZHAOYUN939296564376
3HUANGZHONG908877495330
4DIANWEI809090540360

其中 axis=1 代表按照列為軸進行操作,axis=0 代表按照行為軸進行操作,args 是傳遞的兩個引數,即 n=2, m=3,在 plus 函式中使用到了 n 和 m,從而生成新的 df。資料統計

四、資料統計

Pandas 和 NumPy 一樣,都有常用的統計函式,如果遇到空值 NaN,會自動排除。

df1 = pd.DataFrame({'name':['ZhangFei', 'GuanYu', 'a', 'b', 'c'], 'data1':range(5)})
df1
namedata1
0ZhangFei0
1GuanYu1
2a2
3b3
4c4
df1.describe()
data1
count5.000000
mean2.000000
std1.581139
min0.000000
25%1.000000
50%2.000000
75%3.000000
max4.000000
df3.describe()   # 只對數值列進行統計
yingyumathnew1new2
count5.0000005.0000005.0000005.00000
mean84.00000078.200000486.600000324.40000
std10.93160628.163807115.59541577.06361
min65.00000030.000000285.000000190.00000
25%85.00000077.000000495.000000330.00000
50%88.00000090.000000540.000000360.00000
75%90.00000096.000000549.000000366.00000
max92.00000098.000000564.000000376.00000

五、資料表合併

df1 = pd.DataFrame({'name':['ZhangFei', 'GuanYu', 'a', 'b', 'c'], 'data1':range(5)})
df2 = pd.DataFrame({'name':['ZhangFei', 'GuanYu', 'A', 'B', 'C'], 'data2':range(5)})
df1
namedata1
0ZhangFei0
1GuanYu1
2a2
3b3
4c4
df2
namedata2
0ZhangFei0
1GuanYu1
2A2
3B3
4C4

(一)基於指定列進行連線

df3 = pd.merge(df1, df2, on='name')
df3
namedata1data2
0ZhangFei00
1GuanYu11

(二)inner 內連線

df3 = pd.merge(df1, df2, how='inner')
df3
namedata1data2
0ZhangFei00
1GuanYu11

(三)left 左連線

df3 = pd.merge(df1, df2, how='left')
df3
namedata1data2
0ZhangFei00.0
1GuanYu11.0
2a2NaN
3b3NaN
4c4NaN

(四)right 右連線

df3 = pd.merge(df1, df2, how='right')
df3
namedata1data2
0ZhangFei0.00
1GuanYu1.01
2ANaN2
3BNaN3
4CNaN4

(五)outer 外連線

df3 = pd.merge(df1, df2, how='outer')
df3
namedata1data2
0ZhangFei0.00.0
1GuanYu1.01.0
2a2.0NaN
3b3.0NaN
4c4.0NaN
5ANaN2.0
6BNaN3.0
7CNaN4.0

六、如何用 SQL 方式開啟 Pandas

import pandas as pd
from pandas import DataFrame
from pandasql import sqldf, load_meat, load_births
df1 = DataFrame({'name':['ZhangFei', 'GuanYu', 'a', 'b', 'c'], 'data1':range(5)})
pysqldf = lambda sql: sqldf(sql, globals())
sql = "select * from df1 where name ='ZhangFei'"
# globals() # 返回全域性變數的字典
pysqldf(sql)
namedata1
0ZhangFei0
sqldf(sql, globals())
namedata1
0ZhangFei0

七、練習題

data = {'語文': [66, 95, 93, 90,80, 80],'英語': [65, 85, 92, 88, 90, 90],'數學': [None, 98, 96, 77, 90, 90]}
index=['張飛', '關羽', '趙雲', '黃忠', '典韋', '典韋']
import pandas as pd
df = pd.DataFrame(data, index)
df
語文英語數學
張飛6665NaN
關羽958598.0
趙雲939296.0
黃忠908877.0
典韋809090.0
典韋809090.0
  1. 填充空值
df = df.fillna(0)   # 缺考0分
df
語文英語數學
張飛66650.0
關羽958598.0
趙雲939296.0
黃忠908877.0
典韋809090.0
典韋809090.0
  1. 去掉重複行
df = df.drop_duplicates()
df
語文英語數學
張飛66650.0
關羽958598.0
趙雲939296.0
黃忠908877.0
典韋809090.0
totle = df.sum(axis=1)
totle
張飛    131.0
關羽    278.0
趙雲    281.0
黃忠    255.0
典韋    260.0
dtype: float64
df['總和'] = totle
d:\app\miniconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
df
語文英語數學總和
張飛66650.0131.0
關羽958598.0278.0
趙雲939296.0281.0
黃忠908877.0255.0
典韋809090.0260.0

八、收穫

  1. 學到了apply更復雜的用法,傳入的函式還能帶其他引數,以及指定axis的值,df3 = df3.apply(plus,axis=1,args=(2,3,));
  2. 對pandas的瞭解也夠多了,但是沒有整理,需要整理所學,查缺補漏。

相關文章