前言
Pandas是Python環境下最有名的資料統計包,是基於 Numpy 構建的含有更高階資料結構和工具的資料分析包。Pandas圍繞著 Series 和 DataFrame 兩個核心資料結構展開的。本文著重介紹這兩種資料結構的生成和訪問的基本方法。
Series
Series是一種類似於一維陣列的物件,由一組資料(一維ndarray陣列物件)和一組與之對應相關的資料標籤(索引)組成。
注:numpy(Numerical Python)提供了python對多維陣列物件的支援:ndarray,具有向量運算能力,快速、節省空間。
“”” One-dimensional ndarray with axis labels (including time series).
Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).Operations between Series (+, -, /, , *) align values based on their
associated index values– they need not be the same length. The result
index will be the sorted union of the two indexes.Parameters
———- data : array-like, dict, or scalar valueContains data stored in Series index : array-like or Index (1d) Values must be hashable and have the same length as `data`. Non-unique index values are allowed. Will default to RangeIndex(len(data)) if not provided. If both a dict and index sequence are used, the index will override the keys found in the dict. dtype : numpy.dtype or None If None, dtype will be inferred copy : boolean, default False Copy input data """
(2)建立Series的基本方法如下,資料可以是陣列(list、ndarray)、字典和常量值。s = pd.Series(data, index=index)
s = pd.Series([-1.55666192,-0.75414753,0.47251231,-1.37775038,-1.64899442], index=[`a`, `b`, `c`, `d`, `e`],dtype=`int8` )
a -1
b 0
c 0
d -1
e -1
dtype: int8
s = pd.Series([`a`,-0.75414753,123,66666,-1.64899442], index=[`a`, `b`, `c`, `d`, `e`],)
a a
b -0.754148
c 123
d 66666
e -1.64899
dtype: object
注:Series支援的資料型別包括整數、浮點數、複數、布林值、字串等numpy.dtype,與建立ndarray陣列相同的是,如未指定型別,它會嘗試推斷出一個合適的資料型別,例程中資料包含數字和字串時,推斷為object型別;如指定int8型別時資料以int8顯示。
s = pd.Series(np.random.randn(5))
0 0.485468
1 -0.912130
2 0.771970
3 -1.058117
4 0.926649
dtype: float64
s.index
RangeIndex(start=0, stop=5, step=1)
s = pd.Series(np.random.randn(5), index=[`a`, `b`, `c`, `d`, `e`])
a 0.485468
b -0.912130
c 0.771970
d -1.058117
e 0.926649
dtype: float64
注:當資料未指定索引時,Series會自動建立整數型索引
s = pd.Series({`a` : 0., `b` : 1., `c` : 2.})
a 0.0
b 1.0
c 2.0
dtype: float64
s = pd.Series({`a` : 0., `b` : 1., `c` : 2.}, index=[`b`, `c`, `d`, `a`])
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64
注:透過Python字典建立Series,可視為一個定長的有序字典。如果只傳入一個字典,那麼Series中的索引即是原字典的鍵。如果傳入索引,那麼會找到索引相匹配的值並放在相應的位置上,未找到對應值時結果為NaN。
s = pd.Series(5., index=[`a`, `b`, `c`, `d`, `e`])
a 5.0
b 5.0
c 5.0
d 5.0
e 5.0
dtype: float64
注:數值重複匹配以適應索引長度
(3)訪問Series中的元素和索引
s = pd.Series({`a` : 0., `b` : 1., `c` : 2.}, index=[`b`, `c`, `d`, `a`])
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64
s.values
[ 1. 2. nan 0.]
s.index
Index([u`b`, u`c`, u`d`, u`a`], dtype=`object`)
注:Series的values和index屬性獲取其陣列表示形式和索引物件
s[`a`]
0.0
s[[`a`,`b`]]
a 0.0
b 1.0
dtype: float64
s[[`a`,`b`,`c`]]
a 0.0
b 1.0
c 2.0
dtype: float64
s[:2]
b 1.0
c 2.0
dtype: float64
注:可以透過索引的方式選取Series中的單個或一組值
DataFrame
DataFrame是一個表格型(二維)的資料結構,它含有一組有序的列,每列可以是不同的值型別(數值、字串、布林值等)。DataFrame既有行索引也有列索引,它可以看做由Series組成的字典(共用同一個索引)。
(1)Pandas說明檔案中對DataFrame特點介紹如下:
“”” Two-dimensional size-mutable, potentially heterogeneous tabular
data structure with labeled axes (rows and columns). Arithmetic
operations align on both row and column labels. Can be thought of as a
dict-like container for Series objects. The primary pandas data
structureParameters
———- data : numpy ndarray (structured or homogeneous), dict, or DataFrameDict can contain Series, arrays, constants, or list-like objects index : Index or array-like Index to use for resulting frame. Will default to np.arange(n) if no indexing information part of input data and no index provided columns : Index or array-like Column labels to use for resulting frame. Will default to np.arange(n) if no column labels are provided dtype : dtype, default None Data type to force. Only a single dtype is allowed. If None, infer copy : boolean, default False Copy data from inputs. Only affects DataFrame / 2d ndarray input
(2)建立DataFrame的基本方法如下,資料可以是由列表、一維ndarray或Series組成的字典(序列長度必須相同)、二維ndarray、字典組成的字典等df = pd.DataFrame(data, index=index)
df = pd.DataFrame({`one`: [1., 2., 3., 5], `two`: [1., 2., 3., 4.]})
one two
0 1.0 1.0
1 2.0 2.0
2 3.0 3.0
3 5.0 4.0
注:以列表組成的字典形式建立,每個序列成為DataFrame的一列。不支援單一列表建立df = pd.DataFrame({[1., 2., 3., 5], [1., 2., 3., 4.]}),因為list為unhashable型別
df = pd.DataFrame([[1., 2., 3., 5],[1., 2., 3., 4.]],index=[`a`, `b`],columns=[`one`,`two`,`three`,`four`])
one two three four
a 1.0 2.0 3.0 5.0
b 1.0 2.0 3.0 4.0
注:以巢狀列表組成形式建立2行4列的表格,透過index和 columns引數指定了索引和列名
data = np.zeros((2,), dtype=[(`A`, `i4`),(`B`, `f4`),(`C`, `a10`)])
[(0, 0., ``) (0, 0., ``)]
注:zeros(shape, dtype=float, order=`C`)返回一個給定形狀和型別的用0填充的陣列
data[:] = [(1,2.,`Hello`), (2,3.,"World")]
df = pd.DataFrame(data)
A B C
0 1 2.0 Hello
1 2 3.0 World
df = pd.DataFrame(data, index=[`first`, `second`])
A B C
first 1 2.0 Hello
second 2 3.0 World
df = pd.DataFrame(data, columns=[`C`, `A`, `B`])
C A B
0 Hello 1 2.0
1 World 2 3.0
注:同Series相同,未指定索引時DataFrame會自動加上索引,指定列則按指定順序進行排列
data = {`one` : pd.Series([1., 2., 3.], index=[`a`, `b`, `c`]),
`two` : pd.Series([1., 2., 3., 4.], index=[`a`, `b`, `c`, `d`])}
df = pd.DataFrame(data)
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
注:以Series組成的字典形式建立時,每個Series成為一列,如果沒有顯示指定索引,則各Series的索引被合併成結果的行索引。NaN代替缺失的列資料
df = pd.DataFrame(data,index=[`d`, `b`, `a`])
one two
d NaN 4.0
b 2.0 2.0
a 1.0 1.0
df = pd.DataFrame(data,index=[`d`, `b`, `a`], columns=[`two`, `three`])
two three
d 4.0 NaN
b 2.0 NaN
a 1.0 NaN
data2 = [{`a`: 1, `b`: 2}, {`a`: 5, `b`: 10, `c`: 20}]
df = pd.DataFrame(data2)
a b c
0 1 2 NaN
1 5 10 20.0
注:以字典的列表形式建立時,各項成為DataFrame的一行,字典鍵索引的並整合為DataFrame的列標
df = pd.DataFrame(data2, index=[`first`, `second`])
a b c
first 1 2 NaN
second 5 10 20.0
df = pd.DataFrame(data2, columns=[`a`, `b`])
a b
0 1 2
1 5 10
df = pd.DataFrame({(`a`, `b`): {(`A`, `B`): 1, (`A`, `C`): 2},
(`a`, `a`): {(`A`, `C`): 3, (`A`, `B`): 4},
(`a`, `c`): {(`A`, `B`): 5, (`A`, `C`): 6},
(`b`, `a`): {(`A`, `C`): 7, (`A`, `B`): 8},
(`b`, `b`): {(`A`, `D`): 9, (`A`, `B`): 10}})
a b
a b c a b
A B 4.0 1.0 5.0 8.0 10.0
C 3.0 2.0 6.0 7.0 NaN
D NaN NaN NaN NaN 9.0
注:以字典的字典形式建立時,列索引由外層的鍵合併成結果的列索引,各內層字典成為一列,內層的鍵會被合併成結果的行索引。
(3)訪問DataFrame中的元素和索引
data = {`one` : pd.Series([1., 2., 3.], index=[`a`, `b`, `c`]),
`two` : pd.Series([1., 2., 3., 4.], index=[`a`, `b`, `c`, `d`])}
df = pd.DataFrame(data)
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
df[`one`]或df.one
a 1.0
b 2.0
c 3.0
d NaN
Name: one, dtype: float64
注:透過類似字典標記的方式或屬性的方式,可以將DataFrame的列獲取為一個Series。返回的Series擁有原DataFrame相同的索引,且其name屬性也被相應設定。
df[0:1]
one two
a 1.0 1.0
注:返回前兩列資料
df.loc[`a`]
one 1.0
two 1.0
Name: a, dtype: float64
df.loc[:,[`one`,`two`] ]
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
df.loc[[`a`,],[`one`,`two`]]
one two
a 1.0 1.0
df.loc[`a`,`one`]
1.0
注:loc是透過標籤來選擇資料
df.iloc[0:2,0:1]
one
a 1.0
b 2.0
df.iloc[0:2]
one two
a 1.0 1.0
b 2.0 2.0
df.iloc[[0,2],[0,1]]#自由選取行位置,和列位置對應的資料
one two
a 1.0 1.0
c 3.0 3.0
注:iloc透過位置來選擇資料
df.ix[`a`]
one 1.0
two 1.0
Name: a, dtype: float64
df.ix[`a`,[`one`,`two`]]
one 1.0
two 1.0
Name: a, dtype: float64
df.ix[`a`,[0,1]]
one 1.0
two 1.0
Name: a, dtype: float64
df.ix[[`a`,`b`],[0,1]]
one two
a 1.0 1.0
b 2.0 2.0
df.ix[1,[0,1]]
one 2.0
two 2.0
Name: b, dtype: float64
df.ix[[0,1],[0,1]]
one two
a 1.0 1.0
b 2.0 2.0
注:透過索引欄位ix和名稱結合的方式獲取行資料
df.ix[df.one>1,:1]
one
b 2.0
c 3.0
注:使用條件來選擇,選取one列中大於1的行和第一列
df[`one`]=16.8
one two
a 16.8 1.0
b 16.8 2.0
c 16.8 3.0
d 16.8 4.0
val = pd.Series([2,2,2],index=[`b`, `c`, `d`])
df[`one`]=val
one two
a NaN 1.0
b 2.0 2.0
c 2.0 3.0
d 2.0 4.0
注:列可以透過賦值方式修改,將列表或陣列賦值給某個列時長度必須和DataFrame的長度相匹配。Series賦值時會精確匹配DataFrame的索引,空位以NaN填充。
df[`four`]=[3,3,3,3]
one two four
a NaN 1.0 3
b 2.0 2.0 3
c 2.0 3.0 3
d 2.0 4.0 3
注:對不存在的列賦值會建立新列
df.index.get_loc(`a`)
0
df.index.get_loc(`b`)
1
df.columns.get_loc(`one`)
0
注:透過行/列索引獲取整數形式位置