Pandas之:深入理解Pandas的資料結構

flydean發表於2021-06-11

原文網址 : https://www.cnblogs.com/flydean/p/14873681.html

簡介

本文將會講解Pandas中基本的資料型別Series和DataFrame，並詳細講解這兩種型別的建立，索引等基本行為。

使用Pandas需要引用下面的lib：

In [1]: import numpy as np

In [2]: import pandas as pd

Series

Series是一維帶label和index的陣列。我們使用下面的方法來建立一個Series：

>>> s = pd.Series(data, index=index)

這裡的data可以是Python的字典，np的ndarray，或者一個標量。

index是一個橫軸label的list。接下來我們分別來看下怎麼建立Series。

從ndarray建立

s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

s
Out[67]: 
a   -1.300797
b   -2.044172
c   -1.170739
d   -0.445290
e    1.208784
dtype: float64

使用index獲取index：

s.index
Out[68]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

從dict建立

d = {'b': 1, 'a': 0, 'c': 2}

pd.Series(d)
Out[70]: 
a    0
b    1
c    2
dtype: int64

從標量建立

pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
Out[71]: 
a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

Series 和 ndarray

Series和ndarray是很類似的，在Series中使用index數值表現的就像ndarray:

s[0]
Out[72]: -1.3007972194268396

s[:3]
Out[73]: 
a   -1.300797
b   -2.044172
c   -1.170739
dtype: float64

s[s > s.median()]
Out[74]: 
d   -0.445290
e    1.208784
dtype: float64

s[[4, 3, 1]]
Out[75]: 
e    1.208784
d   -0.445290
b   -2.044172
dtype: float64

Series和dict

如果使用label來訪問Series，那麼它的表現就和dict很像：

s['a']
Out[80]: -1.3007972194268396

s['e'] = 12.

s
Out[82]: 
a    -1.300797
b    -2.044172
c    -1.170739
d    -0.445290
e    12.000000
dtype: float64

向量化操作和標籤對齊

Series可以使用更加簡單的向量化操作：

s + s
Out[83]: 
a    -2.601594
b    -4.088344
c    -2.341477
d    -0.890581
e    24.000000
dtype: float64

s * 2
Out[84]: 
a    -2.601594
b    -4.088344
c    -2.341477
d    -0.890581
e    24.000000
dtype: float64

np.exp(s)
Out[85]: 
a         0.272315
b         0.129487
c         0.310138
d         0.640638
e    162754.791419
dtype: float64

Name屬性

Series還有一個name屬性，我們可以在建立的時候進行設定：

s = pd.Series(np.random.randn(5), name='something')

s
Out[88]: 
0    0.192272
1    0.110410
2    1.442358
3   -0.375792
4    1.228111
Name: something, dtype: float64

s還有一個rename方法，可以重新命名s：

s2 = s.rename("different")

DataFrame

DataFrame是一個二維的帶label的資料結構，它是由Series組成的，你可以把DataFrame看成是一個excel表格。DataFrame可以由下面幾種資料來建立：

一維的ndarrays, lists, dicts, 或者 Series
結構化陣列建立
2維的numpy.ndarray
其他的DataFrame

從Series建立

可以從Series構成的字典中來建立DataFrame：

d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

df
Out[92]: 
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

進行index重排：

pd.DataFrame(d, index=['d', 'b', 'a'])
Out[93]: 
   one  two
d  NaN  4.0
b  2.0  2.0
a  1.0  1.0

進行列重排：

pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
Out[94]: 
   two three
d  4.0   NaN
b  2.0   NaN
a  1.0   NaN

從ndarrays 和 lists建立

d = {'one': [1., 2., 3., 4.],'two': [4., 3., 2., 1.]}

pd.DataFrame(d)
Out[96]: 
   one  two
0  1.0  4.0
1  2.0  3.0
2  3.0  2.0
3  4.0  1.0

pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
Out[97]: 
   one  two
a  1.0  4.0
b  2.0  3.0
c  3.0  2.0
d  4.0  1.0

從結構化陣列建立

可以從結構化陣列中建立DF：

In [47]: data = np.zeros((2, ), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')])

In [48]: data[:] = [(1, 2., 'Hello'), (2, 3., "World")]

In [49]: pd.DataFrame(data)
Out[49]: 
   A    B         C
0  1  2.0  b'Hello'
1  2  3.0  b'World'

In [50]: pd.DataFrame(data, index=['first', 'second'])
Out[50]: 
        A    B         C
first   1  2.0  b'Hello'
second  2  3.0  b'World'

In [51]: pd.DataFrame(data, columns=['C', 'A', 'B'])
Out[51]: 
          C  A    B
0  b'Hello'  1  2.0
1  b'World'  2  3.0

從字典list建立

In [52]: data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

In [53]: pd.DataFrame(data2)
Out[53]: 
   a   b     c
0  1   2   NaN
1  5  10  20.0

In [54]: pd.DataFrame(data2, index=['first', 'second'])
Out[54]: 
        a   b     c
first   1   2   NaN
second  5  10  20.0

In [55]: pd.DataFrame(data2, columns=['a', 'b'])
Out[55]: 
   a   b
0  1   2
1  5  10

從元組中建立

可以從元組中建立更加複雜的DF：

In [56]: pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
   ....:               ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
   ....:               ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
   ....:               ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
   ....:               ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})
   ....: 
Out[56]: 
       a              b      
       b    a    c    a     b
A B  1.0  4.0  5.0  8.0  10.0
  C  2.0  3.0  6.0  7.0   NaN
  D  NaN  NaN  NaN  NaN   9.0

列選擇，新增和刪除

可以像操作Series一樣操作DF：

In [64]: df['one']
Out[64]: 
a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [65]: df['three'] = df['one'] * df['two']

In [66]: df['flag'] = df['one'] > 2

In [67]: df
Out[67]: 
   one  two  three   flag
a  1.0  1.0    1.0  False
b  2.0  2.0    4.0  False
c  3.0  3.0    9.0   True
d  NaN  4.0    NaN  False

可以刪除特定的列，或者pop操作：

In [68]: del df['two']

In [69]: three = df.pop('three')

In [70]: df
Out[70]: 
   one   flag
a  1.0  False
b  2.0  False
c  3.0   True
d  NaN  False

如果插入常量，那麼會填滿整個列：

In [71]: df['foo'] = 'bar'

In [72]: df
Out[72]: 
   one   flag  foo
a  1.0  False  bar
b  2.0  False  bar
c  3.0   True  bar
d  NaN  False  bar

預設會插入到DF中最後一列，可以使用insert來指定插入到特定的列：

In [75]: df.insert(1, 'bar', df['one'])

In [76]: df
Out[76]: 
   one  bar   flag  foo  one_trunc
a  1.0  1.0  False  bar        1.0
b  2.0  2.0  False  bar        2.0
c  3.0  3.0   True  bar        NaN
d  NaN  NaN  False  bar        NaN

使用assign 可以從現有的列中衍生出新的列：

In [77]: iris = pd.read_csv('data/iris.data')

In [78]: iris.head()
Out[78]: 
   SepalLength  SepalWidth  PetalLength  PetalWidth         Name
0          5.1         3.5          1.4         0.2  Iris-setosa
1          4.9         3.0          1.4         0.2  Iris-setosa
2          4.7         3.2          1.3         0.2  Iris-setosa
3          4.6         3.1          1.5         0.2  Iris-setosa
4          5.0         3.6          1.4         0.2  Iris-setosa

In [79]: (iris.assign(sepal_ratio=iris['SepalWidth'] / iris['SepalLength'])
   ....:      .head())
   ....: 
Out[79]: 
   SepalLength  SepalWidth  PetalLength  PetalWidth         Name  sepal_ratio
0          5.1         3.5          1.4         0.2  Iris-setosa     0.686275
1          4.9         3.0          1.4         0.2  Iris-setosa     0.612245
2          4.7         3.2          1.3         0.2  Iris-setosa     0.680851
3          4.6         3.1          1.5         0.2  Iris-setosa     0.673913
4          5.0         3.6          1.4         0.2  Iris-setosa     0.720000

注意， assign 會建立一個新的DF，原DF保持不變。

下面用一張表來表示DF中的index和選擇：

操作	語法	返回結果
選擇列	`df[col]`	Series
通過label選擇行	`df.loc[label]`	Series
通過陣列選擇行	`df.iloc[loc]`	Series
行的切片	`df[5:10]`	DataFrame
使用boolean向量選擇行	`df[bool_vec]`	DataFrame

本文已收錄於 http://www.flydean.com/03-python-pandas-data-structures/

最通俗的解讀，最深刻的乾貨，最簡潔的教程，眾多你不知道的小技巧等你來發現！

歡迎關注我的公眾號:「程式那些事」,懂技術，更懂你！

Pandas資料結構詳解 | 輕鬆玩轉Pandas（1）
2019-03-03
資料結構
Pandas高階教程之:稀疏資料結構
2021-07-20
資料結構
Python資料分析之pandas
2018-07-23
Python
資料分析利器之Pandas
2022-12-05
Python 資料科學之 Pandas
2020-03-16
Python資料科學
Python資料分析之Pandas篇
2020-10-05
Python
深入理解和運用Pandas的GroupBy機制——理解篇
2021-10-30
Python中pandas是什麼?資料結構介紹！
2021-04-23
Python資料結構
Pandas之:Pandas簡潔教程
2021-06-05
Lesson3——Pandas Series結構
2022-02-05
Python資料分析 Pandas模組基礎資料結構與簡介
2018-12-14
Python資料結構
資料預處理之 pandas 讀表
2020-03-01
Python - pandas 資料分析
2020-04-05
Python
Pandas之:Pandas高階教程以鐵達尼號真實資料為例
2021-06-07
資料分析-pandas資料處理清洗常用總結
2018-04-12
【Pandas基礎教程】第02講 Pandas讀取資料
2020-12-24
Pandas讀寫資料庫
2024-11-05
資料庫
資料分析---pandas模組
2024-05-29
pandas用法總結
2020-04-05
【Pandas學習筆記01】強大的分析結構化資料的工具集
2021-11-25
筆記
pandas中的series資料型別
2019-01-18
資料型別
深入理解Redis 資料結構—雙連結串列
2021-11-30
Redis資料結構
Python培訓教程分享：怎樣使用Pandas的內建資料結構繪圖?
2021-11-10
Python資料結構繪圖
pandas 兩列資料合併
2020-11-18
使用pandas進行資料分析
2024-10-27
資料處理--pandas問題
2024-08-04
pandas（10）：資料增刪改
2021-05-09
pandas索引和選擇資料
2020-12-28
索引
Python資料處理-pandas用法
2020-12-17
Python
pandas 學習總結
2018-04-02
深入理解MySQL索引底層資料結構
2023-04-06
MySql索引資料結構
資料的規範化——Pandas處理
2024-04-07
[譯] Pandas 資料型別概覽
2019-03-04
資料型別
pandas-profiling資料分析預覽
2020-10-25
pandas-task07-缺失資料.md
2021-01-02
Pandas 基礎 (5) - 處理缺失的資料
2019-03-08
Pandas 資料分析——超好用的 Groupby 詳解
2020-01-15
pandas的外部資料匯入與常用方法
2019-04-30