Pandas基礎
# 匯入包
import numpy as np
import pandas as pd
檔案的讀取和寫入
df_csv = pd.read_csv('train.csv')
df_csv
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
df_txt = pd.read_table('study_pandas.txt')
df_txt
apple |
---|
df_excel = pd.read_excel('study_pandas.xlsx')
df_excel
abalabala |
---|
header = None表示第一行不作為列名
index_col表示把某一列或幾列作為索引
usecols表示讀取列的集合,預設讀取所有列
nrows表示讀取的資料行數
pd.read_table('study_pandas.txt', header = None)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 0 | 1 | 2 | 3.0 | |
1 | 1 | apple | quit | lalala | NaN |
2 | 2 | row | key | ting | NaN |
3 | 3 | oppo | vivo | apple | NaN |
pd.read_csv('train.csv', index_col=['PassengerId'])
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
PassengerId | |||||||||||
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 11 columns
pd.read_table('study_pandas.txt', usecols=['0', '2'])
0 | 2 | |
---|---|---|
0 | apple | lalala |
1 | row | ting |
2 | oppo | apple |
pd.read_csv('train.csv', nrows=2)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
在讀取txt檔案時, 經常遇到分割符非空格的情況, read_table有一個分割引數sep, 它使得使用者可以自定義分割符號,進行txt資料的讀取
pd.read_table('study_pandas.txt', sep='\|\|\|\|', engine='python')
Unnamed: 0 | 0 | 1 | 2 | 3 | |
---|---|---|---|---|---|
0 | 1 | apple | quit | lalala | None |
1 | 4 | lili | lala | pipi | pion |
sep是正則引數
當索引沒有特殊意義時,把index設定為False
基本資料結構
pandas 中具有兩種基本的資料儲存結構,儲存一維 values 的 Series 和儲存二維 values 的 DataFrame
Series 一般由四個部分組成,分別是序列的值 data 、索引 index 、儲存型別 dtype 、序列的名字 name 。其中,索引也可以指定它的名字,預設為空。
s = pd.Series(data = [100, 'a', {'dic1':5}],
index = pd.Index(['id1', 20, 'third'], name='my_idx'),
dtype = 'object',
name = 'my_name')
s
my_idx
id1 100
20 a
third {'dic1': 5}
Name: my_name, dtype: object
s.values
array([100, 'a', {'dic1': 5}], dtype=object)
s.index
Index(['id1', 20, 'third'], dtype='object', name='my_idx')
s.dtype
dtype('O')
s.name
'my_name'
s.shape
(3,)
如果想要取出單個索引對應的值,可以透過 [index_item] 可以取出。
s['third']
{'dic1': 5}
DataFrame 在 Series 的基礎上增加了列索引,一個資料框可以由二維的 data 與行列索引來構造:
data = [[1, 'a', 1.2], [2, 'b', 2.2], [3, 'c', 3.2]]
df = pd.DataFrame(data = data,
index = ['row_%d'%i for i in range(3)],
columns = ['col_0', 'col_1', 'col_2'])
df
col_0 | col_1 | col_2 | |
---|---|---|---|
row_0 | 1 | a | 1.2 |
row_1 | 2 | b | 2.2 |
row_2 | 3 | c | 3.2 |
採用從列索引名到資料的對映來構造資料框,同時再加上行索引
df = pd.DataFrame(data = {'col_0': [1, 2, 3], 'col_1':list('abc'),
'col_2' : [1.2, 2.2, 3.2]},
index = ['row_%d'%i for i in range(3)])
df
col_0 | col_1 | col_2 | |
---|---|---|---|
row_0 | 1 | a | 1.2 |
row_1 | 2 | b | 2.2 |
row_2 | 3 | c | 3.2 |
由於這種對映關係,在 DataFrame 中可以用 [col_name] 與 [col_list] 來取出相應的列與由多個列組成的表,結果分別為 Series 和 DataFrame
df['col_0']
row_0 1
row_1 2
row_2 3
Name: col_0, dtype: int64
df[['col_0', 'col_1']]
col_0 | col_1 | |
---|---|---|
row_0 | 1 | a |
row_1 | 2 | b |
row_2 | 3 | c |
透過 .T 可以把 DataFrame 進行轉置:
df.T
row_0 | row_1 | row_2 | |
---|---|---|---|
col_0 | 1 | 2 | 3 |
col_1 | a | b | c |
col_2 | 1.2 | 2.2 | 3.2 |
常用基本函式
df = pd.read_csv('train.csv')
df.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
head, tail 函式分別表示返回表或者序列的前 n 行和後 n 行,其中 n 預設為5:
df.head(2)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
df.tail(3)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.45 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.00 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.75 | NaN | Q |
info, describe 分別返回表的 資訊概況 和表中 數值列對應的主要統計量 :
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
df.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
在 Series 和 DataFrame 上定義了許多統計函式,最常見的是 sum, mean, median, var, std, max, min
df_demo = df[['Age', 'Pclass']]
df_demo.mean()
Age 29.699118
Pclass 2.308642
dtype: float64
df_demo.max()
Age 80.0
Pclass 3.0
dtype: float64
需要介紹的是 quantile, count, idxmax 這三個函式,它們分別返回的是分位數、非缺失值個數、最大值對應的索引
df_demo.quantile(0.75)
Age 38.0
Pclass 3.0
Name: 0.75, dtype: float64
df_demo.count()
Age 714
Pclass 891
dtype: int64
df_demo.idxmax()
Age 630
Pclass 0
dtype: int64
操作後返回的是標量,所以又稱為聚合函式,它們有一個公共引數 axis ,預設為0代表逐列聚合,如果設定為1則表示逐行聚合
df_demo.mean(axis=1).head() # 這是Age和Pclass的均值,沒什麼意義
0 12.5
1 19.5
2 14.5
3 18.0
4 19.0
dtype: float64
對序列使用 unique 和 nunique 可以分別得到其唯一值組成的列表和唯一值的個數:
df['Age'].unique()
array([22. , 38. , 26. , 35. , nan, 54. , 2. , 27. , 14. ,
4. , 58. , 20. , 39. , 55. , 31. , 34. , 15. , 28. ,
8. , 19. , 40. , 66. , 42. , 21. , 18. , 3. , 7. ,
49. , 29. , 65. , 28.5 , 5. , 11. , 45. , 17. , 32. ,
16. , 25. , 0.83, 30. , 33. , 23. , 24. , 46. , 59. ,
71. , 37. , 47. , 14.5 , 70.5 , 32.5 , 12. , 9. , 36.5 ,
51. , 55.5 , 40.5 , 44. , 1. , 61. , 56. , 50. , 36. ,
45.5 , 20.5 , 62. , 41. , 52. , 63. , 23.5 , 0.92, 43. ,
60. , 10. , 64. , 13. , 48. , 0.75, 53. , 57. , 80. ,
70. , 24.5 , 6. , 0.67, 30.5 , 0.42, 34.5 , 74. ])
df['Age'].nunique()
88
value_counts 可以得到唯一值和其對應出現的頻數
df['Embarked'].value_counts()
Embarked
S 644
C 168
Q 77
Name: count, dtype: int64
如果想要觀察多個列組合的唯一值,可以使用 drop_duplicates 。其中的關鍵引數是 keep ,預設值 first 表示每個組合保留第一次出現的所在行, last 表示保留最後一次出現的所在行, False 表示把所有重複組合所在的行剔除。
df_demo = df[['Pclass', 'Embarked', 'PassengerId']]
df_demo.drop_duplicates(['Pclass', 'Embarked'])
Pclass | Embarked | PassengerId | |
---|---|---|---|
0 | 3 | S | 1 |
1 | 1 | C | 2 |
3 | 1 | S | 4 |
5 | 3 | Q | 6 |
9 | 2 | C | 10 |
15 | 2 | S | 16 |
19 | 3 | C | 20 |
61 | 1 | NaN | 62 |
245 | 1 | Q | 246 |
303 | 2 | Q | 304 |
duplicated 和 drop_duplicates 的功能類似,但前者返回了是否為唯一值的布林列表,其 keep 引數與後者一致。其返回的序列,把重複元素設為 True ,否則為 False 。 drop_duplicates 等價於把 duplicated 為 True 的對應行剔除。
df_demo.duplicated(['Pclass', 'Embarked']).head()
0 False
1 False
2 True
3 False
4 True
dtype: bool
pandas 中的替換函式可以歸納為三類:對映替換、邏輯替換、數值替換,此處介紹 replace 的用法。
df['Sex'].replace({'female':0, 'male':1}).head()
C:\Users\dell\AppData\Local\Temp\ipykernel_20744\89183440.py:1: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
df['Sex'].replace({'female':0, 'male':1}).infer_objects(copy=False).head()
0 1
1 0
2 0
3 0
4 1
Name: Sex, dtype: int64
df['Sex'].replace(['female', 'male'], [0, 1]).head()
C:\Users\dell\AppData\Local\Temp\ipykernel_20744\1075419171.py:1: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
df['Sex'].replace(['female', 'male'], [0, 1]).head()
0 1
1 0
2 0
3 0
4 1
Name: Sex, dtype: int64
replace 還有一種特殊的方向替換,指定 method 引數為 ffill 則為用前面一個最近的未被替換的值進行替換, bfill 則使用後面最近的未被替換的值進行替換。
s = pd.Series(['a', 1, 'b', 2, 1, 1, 'a'])
s
0 a
1 1
2 b
3 2
4 1
5 1
6 a
dtype: object
s.replace([1, 2], method='ffill')
C:\Users\dell\AppData\Local\Temp\ipykernel_20744\3681488121.py:1: FutureWarning: The 'method' keyword in Series.replace is deprecated and will be removed in a future version.
s.replace([1, 2], method='ffill')
0 a
1 a
2 b
3 b
4 b
5 b
6 a
dtype: object
s.replace([1, 2], method='bfill')
C:\Users\dell\AppData\Local\Temp\ipykernel_20744\1140877935.py:1: FutureWarning: The 'method' keyword in Series.replace is deprecated and will be removed in a future version.
s.replace([1, 2], method='bfill')
0 a
1 b
2 b
3 a
4 a
5 a
6 a
dtype: object
邏輯替換包括了 where 和 mask ,這兩個函式是完全對稱的: where 函式在傳入條件為 False 的對應行進行替換,而 mask 在傳入條件為 True 的對應行進行替換,當不指定替換值時,替換為缺失值。
s =pd.Series([-1, 1.2345, 100, -50])
s.where(s < 0, 100)
0 -1.0
1 100.0
2 100.0
3 -50.0
dtype: float64
s.mask(s<0, -50)
0 -50.0000
1 1.2345
2 100.0000
3 -50.0000
dtype: float64
傳入的條件只需是與被呼叫的 Series 索引一致的布林序列即可:
s_condition = pd.Series([True, False, False, True], index = s.index)
s.mask(s_condition, -50)
0 -50.0000
1 1.2345
2 100.0000
3 -50.0000
dtype: float64
數值替換包含了 round, abs, clip 方法,它們分別表示按照給定精度四捨五入、取絕對值和截斷:
s = pd.Series([-1, 1.2345, 100, -50])
s.round(2)
0 -1.00
1 1.23
2 100.00
3 -50.00
dtype: float64
s.abs()
0 1.0000
1 1.2345
2 100.0000
3 50.0000
dtype: float64
s.clip(-1, 100)
0 -1.0000
1 1.2345
2 100.0000
3 -1.0000
dtype: float64
排序共有兩種方式,其一為值排序,其二為索引排序,對應的函式是 sort_values 和 sort_index 。
為了演示排序函式,下面先利用 set_index 方法把年級和姓名兩列作為索引,對年齡進行排序,預設引數 ascending=True 為升序:
df_demo = df[['Sex', 'Age', 'Fare', 'Name']].set_index(['Sex', 'Name'])
df_demo.sort_values('Age').head()
Age | Fare | ||
---|---|---|---|
Sex | Name | ||
male | Thomas, Master. Assad Alexander | 0.42 | 8.5167 |
Hamalainen, Master. Viljo | 0.67 | 14.5000 | |
female | Baclini, Miss. Eugenie | 0.75 | 19.2583 |
Baclini, Miss. Helene Barbara | 0.75 | 19.2583 | |
male | Caldwell, Master. Alden Gates | 0.83 | 29.0000 |
df_demo.sort_values('Fare', ascending=False).head()
Age | Fare | ||
---|---|---|---|
Sex | Name | ||
female | Ward, Miss. Anna | 35.0 | 512.3292 |
male | Lesurer, Mr. Gustave J | 35.0 | 512.3292 |
Cardeza, Mr. Thomas Drake Martinez | 36.0 | 512.3292 | |
female | Fortune, Miss. Mabel Helen | 23.0 | 263.0000 |
male | Fortune, Mr. Charles Alexander | 19.0 | 263.0000 |
在排序中,經常遇到多列排序的問題,比如在年齡相同的情況下,對票價進行排序,並且保持票價降序排列,年齡降序排列:
df_demo.sort_values(['Age', 'Fare'], ascending=[False, False]).head()
Age | Fare | ||
---|---|---|---|
Sex | Name | ||
male | Barkworth, Mr. Algernon Henry Wilson | 80.0 | 30.0000 |
Svensson, Mr. Johan | 74.0 | 7.7750 | |
Artagaveytia, Mr. Ramon | 71.0 | 49.5042 | |
Goldschmidt, Mr. George B | 71.0 | 34.6542 | |
Connors, Mr. Patrick | 70.5 | 7.7500 |
索引排序的用法和值排序完全一致,只不過元素的值在索引中,此時需要指定索引層的名字或者層號,用引數 level 表示。另外,需要注意的是字串的排列順序由字母順序決定。
df_demo.sort_index(level = ['Sex', 'Name'], ascending=[True, False]).head()
Age | Fare | ||
---|---|---|---|
Sex | Name | ||
female | de Messemaeker, Mrs. Guillaume Joseph (Emma) | 36.0 | 17.4000 |
Zabour, Miss. Thamine | NaN | 14.4542 | |
Zabour, Miss. Hileni | 14.5 | 14.4542 | |
Yrois, Miss. Henriette ("Mrs Harbeck") | 24.0 | 13.0000 | |
Young, Miss. Marie Grice | 36.0 | 135.6333 |
apply 方法常用於 DataFrame 的行迭代或者列迭代,它的 axis 含義與第2小節中的統計聚合函式一致, apply 的引數往往是一個以序列為輸入的函式。例如對於 .mean() ,使用 apply 可以如下地寫出
df_demo = df[['Age', 'Fare']]
def my_mean(x):
res = x.mean()
return res
df_demo.apply(my_mean)
Age 29.699118
Fare 32.204208
dtype: float64
df_demo.apply(lambda x:x.mean())
Age 29.699118
Fare 32.204208
dtype: float64