Pandas庫基礎分析——資料生成和訪問

元宵大師發表於2019-02-16

原文網址 : https://flycode.co/archives/80374

前言

Pandas是Python環境下最有名的資料統計包，是基於 Numpy 構建的含有更高階資料結構和工具的資料分析包。Pandas圍繞著 Series 和 DataFrame 兩個核心資料結構展開的。本文著重介紹這兩種資料結構的生成和訪問的基本方法。

Series

Series是一種類似於一維陣列的物件，由一組資料（一維ndarray陣列物件）和一組與之對應相關的資料標籤（索引）組成。
注：numpy（Numerical Python）提供了python對多維陣列物件的支援：ndarray，具有向量運算能力，快速、節省空間。

（1）Pandas說明檔案中對Series特點介紹如下：

“”” One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).

Operations between Series (+, -, /, , *) align values based on their
associated index values– they need not be the same length. The result
index will be the sorted union of the two indexes.

Parameters
———- data : array-like, dict, or scalar value
Contains data stored in Series index : array-like or Index (1d)
Values must be hashable and have the same length as `data`.
Non-unique index values are allowed. Will default to
RangeIndex(len(data)) if not provided. If both a dict and index
sequence are used, the index will override the keys found in the
dict. dtype : numpy.dtype or None
If None, dtype will be inferred copy : boolean, default False
Copy input data """

（2）建立Series的基本方法如下，資料可以是陣列（list、ndarray）、字典和常量值。s = pd.Series(data, index=index)


s = pd.Series([-1.55666192,-0.75414753,0.47251231,-1.37775038,-1.64899442], index=[`a`, `b`, `c`, `d`, `e`],dtype=`int8` )
a   -1
b    0
c    0
d   -1
e   -1
dtype: int8

s = pd.Series([`a`,-0.75414753,123,66666,-1.64899442], index=[`a`, `b`, `c`, `d`, `e`],)
a           a
b   -0.754148
c         123
d       66666
e    -1.64899
dtype: object

注：Series支援的資料型別包括整數、浮點數、複數、布林值、字串等numpy.dtype，與建立ndarray陣列相同的是，如未指定型別，它會嘗試推斷出一個合適的資料型別，例程中資料包含數字和字串時，推斷為object型別；如指定int8型別時資料以int8顯示。

s = pd.Series(np.random.randn(5))
0    0.485468
1   -0.912130
2    0.771970
3   -1.058117
4    0.926649
dtype: float64

s.index
RangeIndex(start=0, stop=5, step=1)

s = pd.Series(np.random.randn(5), index=[`a`, `b`, `c`, `d`, `e`])
a    0.485468
b   -0.912130
c    0.771970
d   -1.058117
e    0.926649
dtype: float64

注：當資料未指定索引時，Series會自動建立整數型索引


s = pd.Series({`a` : 0., `b` : 1., `c` : 2.})
a    0.0
b    1.0
c    2.0
dtype: float64

s = pd.Series({`a` : 0., `b` : 1., `c` : 2.}, index=[`b`, `c`, `d`, `a`])
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

注：透過Python字典建立Series，可視為一個定長的有序字典。如果只傳入一個字典，那麼Series中的索引即是原字典的鍵。如果傳入索引，那麼會找到索引相匹配的值並放在相應的位置上，未找到對應值時結果為NaN。


s = pd.Series(5., index=[`a`, `b`, `c`, `d`, `e`])
a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

注：數值重複匹配以適應索引長度

（3）訪問Series中的元素和索引


s = pd.Series({`a` : 0., `b` : 1., `c` : 2.}, index=[`b`, `c`, `d`, `a`])
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

s.values
[  1.   2.  nan   0.]

s.index
Index([u`b`, u`c`, u`d`, u`a`], dtype=`object`)

注：Series的values和index屬性獲取其陣列表示形式和索引物件


s[`a`]
0.0

s[[`a`,`b`]]
a    0.0
b    1.0
dtype: float64

s[[`a`,`b`,`c`]]
a    0.0
b    1.0
c    2.0
dtype: float64

s[:2] 
b    1.0
c    2.0
dtype: float64

注：可以透過索引的方式選取Series中的單個或一組值

DataFrame

DataFrame是一個表格型（二維）的資料結構，它含有一組有序的列，每列可以是不同的值型別（數值、字串、布林值等）。DataFrame既有行索引也有列索引，它可以看做由Series組成的字典（共用同一個索引）。

（1）Pandas說明檔案中對DataFrame特點介紹如下：

“”” Two-dimensional size-mutable, potentially heterogeneous tabular
data structure with labeled axes (rows and columns). Arithmetic
operations align on both row and column labels. Can be thought of as a
dict-like container for Series objects. The primary pandas data
structure

Parameters
———- data : numpy ndarray (structured or homogeneous), dict, or DataFrame
Dict can contain Series, arrays, constants, or list-like objects index : Index or array-like
Index to use for resulting frame. Will default to np.arange(n) if
no indexing information part of input data and no index provided columns : Index or array-like
Column labels to use for resulting frame. Will default to
np.arange(n) if no column labels are provided dtype : dtype, default None
Data type to force. Only a single dtype is allowed. If None, infer copy : boolean, default False
Copy data from inputs. Only affects DataFrame / 2d ndarray input

（2）建立DataFrame的基本方法如下，資料可以是由列表、一維ndarray或Series組成的字典（序列長度必須相同）、二維ndarray、字典組成的字典等df = pd.DataFrame(data, index=index)


df = pd.DataFrame({`one`: [1., 2., 3., 5], `two`: [1., 2., 3., 4.]})
   one  two
0  1.0  1.0
1  2.0  2.0
2  3.0  3.0
3  5.0  4.0

注：以列表組成的字典形式建立，每個序列成為DataFrame的一列。不支援單一列表建立df = pd.DataFrame({[1., 2., 3., 5], [1., 2., 3., 4.]})，因為list為unhashable型別


df = pd.DataFrame([[1., 2., 3., 5],[1., 2., 3., 4.]],index=[`a`, `b`],columns=[`one`,`two`,`three`,`four`])
   one  two  three  four
a  1.0  2.0    3.0   5.0
b  1.0  2.0    3.0   4.0

注：以巢狀列表組成形式建立2行4列的表格，透過index和 columns引數指定了索引和列名


data = np.zeros((2,), dtype=[(`A`, `i4`),(`B`, `f4`),(`C`, `a10`)])
[(0,  0., ``) (0,  0., ``)]

注：zeros(shape, dtype=float, order=`C`)返回一個給定形狀和型別的用0填充的陣列


data[:] = [(1,2.,`Hello`), (2,3.,"World")]        
df = pd.DataFrame(data)
   A    B      C
0  1  2.0  Hello
1  2  3.0  World

df = pd.DataFrame(data, index=[`first`, `second`])
        A    B      C
first   1  2.0  Hello
second  2  3.0  World

df = pd.DataFrame(data, columns=[`C`, `A`, `B`])
       C  A    B
0  Hello  1  2.0
1  World  2  3.0

注：同Series相同，未指定索引時DataFrame會自動加上索引，指定列則按指定順序進行排列


data = {`one` : pd.Series([1., 2., 3.], index=[`a`, `b`, `c`]),
        `two` : pd.Series([1., 2., 3., 4.], index=[`a`, `b`, `c`, `d`])}
df = pd.DataFrame(data)
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

注：以Series組成的字典形式建立時，每個Series成為一列，如果沒有顯示指定索引，則各Series的索引被合併成結果的行索引。NaN代替缺失的列資料


df = pd.DataFrame(data,index=[`d`, `b`, `a`])
   one  two
d  NaN  4.0
b  2.0  2.0
a  1.0  1.0

df = pd.DataFrame(data,index=[`d`, `b`, `a`], columns=[`two`, `three`])
   two three
d  4.0   NaN
b  2.0   NaN
a  1.0   NaN

data2 = [{`a`: 1, `b`: 2}, {`a`: 5, `b`: 10, `c`: 20}]
df = pd.DataFrame(data2)
   a   b     c
0  1   2   NaN
1  5  10  20.0

注：以字典的列表形式建立時，各項成為DataFrame的一行，字典鍵索引的並整合為DataFrame的列標


df = pd.DataFrame(data2, index=[`first`, `second`])
        a   b     c
first   1   2   NaN
second  5  10  20.0

df = pd.DataFrame(data2, columns=[`a`, `b`])
   a   b
0  1   2
1  5  10

df = pd.DataFrame({(`a`, `b`): {(`A`, `B`): 1, (`A`, `C`): 2},
                 (`a`, `a`): {(`A`, `C`): 3, (`A`, `B`): 4},
                 (`a`, `c`): {(`A`, `B`): 5, (`A`, `C`): 6}, 
                 (`b`, `a`): {(`A`, `C`): 7, (`A`, `B`): 8},  
                 (`b`, `b`): {(`A`, `D`): 9, (`A`, `B`): 10}})
       a              b
       a    b    c    a     b
A B  4.0  1.0  5.0  8.0  10.0
  C  3.0  2.0  6.0  7.0   NaN
  D  NaN  NaN  NaN  NaN   9.0

注：以字典的字典形式建立時，列索引由外層的鍵合併成結果的列索引，各內層字典成為一列，內層的鍵會被合併成結果的行索引。

（3）訪問DataFrame中的元素和索引


data = {`one` : pd.Series([1., 2., 3.], index=[`a`, `b`, `c`]),
        `two` : pd.Series([1., 2., 3., 4.], index=[`a`, `b`, `c`, `d`])}
df = pd.DataFrame(data)
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

df[`one`]或df.one
a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

注：透過類似字典標記的方式或屬性的方式，可以將DataFrame的列獲取為一個Series。返回的Series擁有原DataFrame相同的索引，且其name屬性也被相應設定。


df[0:1]
   one  two
a  1.0  1.0

注：返回前兩列資料


df.loc[`a`]
one    1.0
two    1.0
Name: a, dtype: float64

df.loc[:,[`one`,`two`] ]
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

df.loc[[`a`,],[`one`,`two`]]
   one  two
a  1.0  1.0

df.loc[`a`,`one`]
1.0

注：loc是透過標籤來選擇資料


df.iloc[0:2,0:1]  
   one
a  1.0
b  2.0

df.iloc[0:2]  
   one  two
a  1.0  1.0
b  2.0  2.0

df.iloc[[0,2],[0,1]]#自由選取行位置，和列位置對應的資料
   one  two
a  1.0  1.0
c  3.0  3.0

注：iloc透過位置來選擇資料


df.ix[`a`]
one    1.0
two    1.0
Name: a, dtype: float64

df.ix[`a`,[`one`,`two`]]
one    1.0
two    1.0
Name: a, dtype: float64

df.ix[`a`,[0,1]]
one    1.0
two    1.0
Name: a, dtype: float64

df.ix[[`a`,`b`],[0,1]]
   one  two
a  1.0  1.0
b  2.0  2.0

df.ix[1,[0,1]]
one    2.0
two    2.0
Name: b, dtype: float64

df.ix[[0,1],[0,1]]
   one  two
a  1.0  1.0
b  2.0  2.0

注：透過索引欄位ix和名稱結合的方式獲取行資料


df.ix[df.one>1,:1]
   one
b  2.0
c  3.0

注：使用條件來選擇，選取one列中大於1的行和第一列


df[`one`]=16.8
    one  two
a  16.8  1.0
b  16.8  2.0
c  16.8  3.0
d  16.8  4.0

val = pd.Series([2,2,2],index=[`b`, `c`, `d`])
df[`one`]=val
   one  two
a  NaN  1.0
b  2.0  2.0
c  2.0  3.0
d  2.0  4.0

注：列可以透過賦值方式修改，將列表或陣列賦值給某個列時長度必須和DataFrame的長度相匹配。Series賦值時會精確匹配DataFrame的索引，空位以NaN填充。


df[`four`]=[3,3,3,3]
   one  two  four
a  NaN  1.0     3
b  2.0  2.0     3
c  2.0  3.0     3
d  2.0  4.0     3

注：對不存在的列賦值會建立新列


df.index.get_loc(`a`)
0

df.index.get_loc(`b`)
1

df.columns.get_loc(`one`)
0

注：透過行/列索引獲取整數形式位置

Xamarin SQLite教程資料庫訪問與生成
2018-06-20
SQLite資料庫
Python資料分析 Pandas模組基礎資料結構與簡介
2018-12-14
Python資料結構
【Pandas基礎教程】第02講 Pandas讀取資料
2020-12-24
Spring Boot 2.x基礎教程：使用JdbcTemplate訪問MySQL資料庫
2020-02-13
Spring BootJDBCMySql資料庫
ABP框架之——資料訪問基礎架構
2022-05-25
框架架構
JDBC資料庫訪問
2020-11-01
JDBC資料庫
ABP框架之——資料訪問基礎架構（下）
2022-06-23
框架架構
Pandas 基礎 (5) - 處理缺失的資料
2019-03-08
Pandas 基礎 (3) - 生成 Dataframe 的幾種方式
2019-03-07
資料庫基礎
2018-11-14
資料庫
資料庫基礎
2019-05-25
資料庫
Pandas 基礎 (12) - Stack 和 Unstack
2019-04-14
Python - pandas 資料分析
2020-04-05
Python
Android基礎與應用資料儲存與訪問
2018-04-29
Android
Pandas基礎
2024-07-22
CDA資料分析師 - SQL資料庫基礎查詢&連線
2019-03-01
SQL資料庫
外網訪問MySQL資料庫
2018-12-02
MySql資料庫
Oracle資料庫限制訪問IP
2023-02-07
Oracle資料庫
資料庫基礎使用
2020-10-19
資料庫
Python資料分析之pandas
2018-07-23
Python
資料分析---pandas模組
2024-05-29
資料分析利器之Pandas
2022-12-05
Pandas 基礎 (2) - Dataframe 基礎
2019-03-07
基於gin的golang web開發：訪問mysql資料庫
2020-11-06
GolangWebMySql資料庫
Python大資料分析學習.Pandas 資料匯入問題 (1)
2018-05-19
Python大資料
Pandas進階貳 pandas基礎
2020-12-20
Pandas讀寫資料庫
2024-11-05
資料庫
分享跨域訪問的解決方案與基礎分析
2019-03-04
跨域
資料分析-基礎維度
2019-11-17
使用 @NoRepositoryBean 簡化資料庫訪問
2024-04-27
Bean資料庫
如何限制ip訪問Oracle資料庫
2020-08-17
Oracle資料庫
jmeter 使用 ssh 方式訪問資料庫
2020-10-19
JMeter資料庫
資料庫設計基礎
2024-03-22
資料庫
資料庫基礎概念理解
2020-07-14
資料庫
MySQL資料庫注入基礎
2020-09-27
MySql資料庫
Redis基礎（二）資料庫
2020-10-27
Redis資料庫
【資料庫】Redis基礎篇
2019-04-28
資料庫Redis
31. 資料庫基礎
2024-10-09
資料庫

Pandas庫基礎分析——資料生成和訪問

前言

Series

DataFrame

相關文章