Pandas基礎

0214jx發表於2024-07-22

Pandas基礎

# 匯入包
import numpy as np
import pandas as pd

檔案的讀取和寫入

df_csv = pd.read_csv('train.csv')
df_csv
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

df_txt = pd.read_table('study_pandas.txt')
df_txt
apple
df_excel = pd.read_excel('study_pandas.xlsx')
df_excel
abalabala

header = None表示第一行不作為列名
index_col表示把某一列或幾列作為索引
usecols表示讀取列的集合,預設讀取所有列
nrows表示讀取的資料行數

pd.read_table('study_pandas.txt', header = None)
0 1 2 3 4
0 0 1 2 3.0
1 1 apple quit lalala NaN
2 2 row key ting NaN
3 3 oppo vivo apple NaN
pd.read_csv('train.csv', index_col=['PassengerId'])
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ...
887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 11 columns

pd.read_table('study_pandas.txt', usecols=['0', '2'])
0 2
0 apple lalala
1 row ting
2 oppo apple
pd.read_csv('train.csv', nrows=2)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C

在讀取txt檔案時, 經常遇到分割符非空格的情況, read_table有一個分割引數sep, 它使得使用者可以自定義分割符號,進行txt資料的讀取

pd.read_table('study_pandas.txt', sep='\|\|\|\|', engine='python')
Unnamed: 0 0 1 2 3
0 1 apple quit lalala None
1 4 lili lala pipi pion

sep是正則引數

當索引沒有特殊意義時,把index設定為False

基本資料結構

pandas 中具有兩種基本的資料儲存結構,儲存一維 values 的 Series 和儲存二維 values 的 DataFrame

Series 一般由四個部分組成,分別是序列的值 data 、索引 index 、儲存型別 dtype 、序列的名字 name 。其中,索引也可以指定它的名字,預設為空。

s = pd.Series(data = [100, 'a', {'dic1':5}],
             index = pd.Index(['id1', 20, 'third'], name='my_idx'),
             dtype = 'object', 
             name = 'my_name')
s
my_idx
id1              100
20                 a
third    {'dic1': 5}
Name: my_name, dtype: object
s.values
array([100, 'a', {'dic1': 5}], dtype=object)
s.index
Index(['id1', 20, 'third'], dtype='object', name='my_idx')
s.dtype
dtype('O')
s.name
'my_name'
s.shape
(3,)

如果想要取出單個索引對應的值,可以透過 [index_item] 可以取出。

s['third']
{'dic1': 5}

DataFrame 在 Series 的基礎上增加了列索引,一個資料框可以由二維的 data 與行列索引來構造:

data = [[1, 'a', 1.2], [2, 'b', 2.2], [3, 'c', 3.2]]
df = pd.DataFrame(data = data,
                  index = ['row_%d'%i for i in range(3)],
                  columns = ['col_0', 'col_1', 'col_2'])
df
col_0 col_1 col_2
row_0 1 a 1.2
row_1 2 b 2.2
row_2 3 c 3.2

採用從列索引名到資料的對映來構造資料框,同時再加上行索引

df = pd.DataFrame(data = {'col_0': [1, 2, 3], 'col_1':list('abc'),
                         'col_2' : [1.2, 2.2, 3.2]},
                 index = ['row_%d'%i for i in range(3)])
df
col_0 col_1 col_2
row_0 1 a 1.2
row_1 2 b 2.2
row_2 3 c 3.2

由於這種對映關係,在 DataFrame 中可以用 [col_name] 與 [col_list] 來取出相應的列與由多個列組成的表,結果分別為 Series 和 DataFrame

df['col_0']
row_0    1
row_1    2
row_2    3
Name: col_0, dtype: int64
df[['col_0', 'col_1']]
col_0 col_1
row_0 1 a
row_1 2 b
row_2 3 c

透過 .T 可以把 DataFrame 進行轉置:

df.T
row_0 row_1 row_2
col_0 1 2 3
col_1 a b c
col_2 1.2 2.2 3.2

常用基本函式

df = pd.read_csv('train.csv')
df.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

head, tail 函式分別表示返回表或者序列的前 n 行和後 n 行,其中 n 預設為5:

df.head(2)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
df.tail(3)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q

info, describe 分別返回表的 資訊概況 和表中 數值列對應的主要統計量 :

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
df.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

在 Series 和 DataFrame 上定義了許多統計函式,最常見的是 sum, mean, median, var, std, max, min

df_demo = df[['Age', 'Pclass']]
df_demo.mean()
Age       29.699118
Pclass     2.308642
dtype: float64
df_demo.max()
Age       80.0
Pclass     3.0
dtype: float64

需要介紹的是 quantile, count, idxmax 這三個函式,它們分別返回的是分位數、非缺失值個數、最大值對應的索引

df_demo.quantile(0.75)
Age       38.0
Pclass     3.0
Name: 0.75, dtype: float64
df_demo.count()
Age       714
Pclass    891
dtype: int64
df_demo.idxmax()
Age       630
Pclass      0
dtype: int64

操作後返回的是標量,所以又稱為聚合函式,它們有一個公共引數 axis ,預設為0代表逐列聚合,如果設定為1則表示逐行聚合

df_demo.mean(axis=1).head() # 這是Age和Pclass的均值,沒什麼意義
0    12.5
1    19.5
2    14.5
3    18.0
4    19.0
dtype: float64

對序列使用 unique 和 nunique 可以分別得到其唯一值組成的列表和唯一值的個數:

df['Age'].unique()
array([22.  , 38.  , 26.  , 35.  ,   nan, 54.  ,  2.  , 27.  , 14.  ,
        4.  , 58.  , 20.  , 39.  , 55.  , 31.  , 34.  , 15.  , 28.  ,
        8.  , 19.  , 40.  , 66.  , 42.  , 21.  , 18.  ,  3.  ,  7.  ,
       49.  , 29.  , 65.  , 28.5 ,  5.  , 11.  , 45.  , 17.  , 32.  ,
       16.  , 25.  ,  0.83, 30.  , 33.  , 23.  , 24.  , 46.  , 59.  ,
       71.  , 37.  , 47.  , 14.5 , 70.5 , 32.5 , 12.  ,  9.  , 36.5 ,
       51.  , 55.5 , 40.5 , 44.  ,  1.  , 61.  , 56.  , 50.  , 36.  ,
       45.5 , 20.5 , 62.  , 41.  , 52.  , 63.  , 23.5 ,  0.92, 43.  ,
       60.  , 10.  , 64.  , 13.  , 48.  ,  0.75, 53.  , 57.  , 80.  ,
       70.  , 24.5 ,  6.  ,  0.67, 30.5 ,  0.42, 34.5 , 74.  ])
df['Age'].nunique()
88

value_counts 可以得到唯一值和其對應出現的頻數

df['Embarked'].value_counts()
Embarked
S    644
C    168
Q     77
Name: count, dtype: int64

如果想要觀察多個列組合的唯一值,可以使用 drop_duplicates 。其中的關鍵引數是 keep ,預設值 first 表示每個組合保留第一次出現的所在行, last 表示保留最後一次出現的所在行, False 表示把所有重複組合所在的行剔除。

df_demo = df[['Pclass', 'Embarked', 'PassengerId']]
df_demo.drop_duplicates(['Pclass', 'Embarked'])
Pclass Embarked PassengerId
0 3 S 1
1 1 C 2
3 1 S 4
5 3 Q 6
9 2 C 10
15 2 S 16
19 3 C 20
61 1 NaN 62
245 1 Q 246
303 2 Q 304

duplicated 和 drop_duplicates 的功能類似,但前者返回了是否為唯一值的布林列表,其 keep 引數與後者一致。其返回的序列,把重複元素設為 True ,否則為 False 。 drop_duplicates 等價於把 duplicated 為 True 的對應行剔除。

df_demo.duplicated(['Pclass', 'Embarked']).head()
0    False
1    False
2     True
3    False
4     True
dtype: bool

pandas 中的替換函式可以歸納為三類:對映替換、邏輯替換、數值替換,此處介紹 replace 的用法。

df['Sex'].replace({'female':0, 'male':1}).head()
C:\Users\dell\AppData\Local\Temp\ipykernel_20744\89183440.py:1: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  df['Sex'].replace({'female':0, 'male':1}).infer_objects(copy=False).head()





0    1
1    0
2    0
3    0
4    1
Name: Sex, dtype: int64
df['Sex'].replace(['female', 'male'], [0, 1]).head()
C:\Users\dell\AppData\Local\Temp\ipykernel_20744\1075419171.py:1: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  df['Sex'].replace(['female', 'male'], [0, 1]).head()





0    1
1    0
2    0
3    0
4    1
Name: Sex, dtype: int64

replace 還有一種特殊的方向替換,指定 method 引數為 ffill 則為用前面一個最近的未被替換的值進行替換, bfill 則使用後面最近的未被替換的值進行替換。

s = pd.Series(['a', 1, 'b', 2, 1, 1, 'a'])
s
0    a
1    1
2    b
3    2
4    1
5    1
6    a
dtype: object
s.replace([1, 2], method='ffill')
C:\Users\dell\AppData\Local\Temp\ipykernel_20744\3681488121.py:1: FutureWarning: The 'method' keyword in Series.replace is deprecated and will be removed in a future version.
  s.replace([1, 2], method='ffill')





0    a
1    a
2    b
3    b
4    b
5    b
6    a
dtype: object
s.replace([1, 2], method='bfill')
C:\Users\dell\AppData\Local\Temp\ipykernel_20744\1140877935.py:1: FutureWarning: The 'method' keyword in Series.replace is deprecated and will be removed in a future version.
  s.replace([1, 2], method='bfill')





0    a
1    b
2    b
3    a
4    a
5    a
6    a
dtype: object

邏輯替換包括了 where 和 mask ,這兩個函式是完全對稱的: where 函式在傳入條件為 False 的對應行進行替換,而 mask 在傳入條件為 True 的對應行進行替換,當不指定替換值時,替換為缺失值。

s =pd.Series([-1, 1.2345, 100, -50])
s.where(s < 0, 100)
0     -1.0
1    100.0
2    100.0
3    -50.0
dtype: float64
s.mask(s<0, -50)
0    -50.0000
1      1.2345
2    100.0000
3    -50.0000
dtype: float64

傳入的條件只需是與被呼叫的 Series 索引一致的布林序列即可:

s_condition = pd.Series([True, False, False, True], index = s.index)
s.mask(s_condition, -50)
0    -50.0000
1      1.2345
2    100.0000
3    -50.0000
dtype: float64

數值替換包含了 round, abs, clip 方法,它們分別表示按照給定精度四捨五入、取絕對值和截斷:

s = pd.Series([-1, 1.2345, 100, -50])
s.round(2)
0     -1.00
1      1.23
2    100.00
3    -50.00
dtype: float64
s.abs()
0      1.0000
1      1.2345
2    100.0000
3     50.0000
dtype: float64
s.clip(-1, 100)
0     -1.0000
1      1.2345
2    100.0000
3     -1.0000
dtype: float64

排序共有兩種方式,其一為值排序,其二為索引排序,對應的函式是 sort_values 和 sort_index 。

為了演示排序函式,下面先利用 set_index 方法把年級和姓名兩列作為索引,對年齡進行排序,預設引數 ascending=True 為升序:

df_demo = df[['Sex', 'Age', 'Fare', 'Name']].set_index(['Sex', 'Name'])
df_demo.sort_values('Age').head()
Age Fare
Sex Name
male Thomas, Master. Assad Alexander 0.42 8.5167
Hamalainen, Master. Viljo 0.67 14.5000
female Baclini, Miss. Eugenie 0.75 19.2583
Baclini, Miss. Helene Barbara 0.75 19.2583
male Caldwell, Master. Alden Gates 0.83 29.0000
df_demo.sort_values('Fare', ascending=False).head()
Age Fare
Sex Name
female Ward, Miss. Anna 35.0 512.3292
male Lesurer, Mr. Gustave J 35.0 512.3292
Cardeza, Mr. Thomas Drake Martinez 36.0 512.3292
female Fortune, Miss. Mabel Helen 23.0 263.0000
male Fortune, Mr. Charles Alexander 19.0 263.0000

在排序中,經常遇到多列排序的問題,比如在年齡相同的情況下,對票價進行排序,並且保持票價降序排列,年齡降序排列:

df_demo.sort_values(['Age', 'Fare'], ascending=[False, False]).head()
Age Fare
Sex Name
male Barkworth, Mr. Algernon Henry Wilson 80.0 30.0000
Svensson, Mr. Johan 74.0 7.7750
Artagaveytia, Mr. Ramon 71.0 49.5042
Goldschmidt, Mr. George B 71.0 34.6542
Connors, Mr. Patrick 70.5 7.7500

索引排序的用法和值排序完全一致,只不過元素的值在索引中,此時需要指定索引層的名字或者層號,用引數 level 表示。另外,需要注意的是字串的排列順序由字母順序決定。

df_demo.sort_index(level = ['Sex', 'Name'], ascending=[True, False]).head()
Age Fare
Sex Name
female de Messemaeker, Mrs. Guillaume Joseph (Emma) 36.0 17.4000
Zabour, Miss. Thamine NaN 14.4542
Zabour, Miss. Hileni 14.5 14.4542
Yrois, Miss. Henriette ("Mrs Harbeck") 24.0 13.0000
Young, Miss. Marie Grice 36.0 135.6333

apply 方法常用於 DataFrame 的行迭代或者列迭代,它的 axis 含義與第2小節中的統計聚合函式一致, apply 的引數往往是一個以序列為輸入的函式。例如對於 .mean() ,使用 apply 可以如下地寫出

df_demo = df[['Age', 'Fare']]
def my_mean(x):
    res = x.mean()
    return res
df_demo.apply(my_mean)
Age     29.699118
Fare    32.204208
dtype: float64

df_demo.apply(lambda x:x.mean())
Age     29.699118
Fare    32.204208
dtype: float64

相關文章