用 Python 進行資料分析 pandas (一)

小胖樂發表於2019-06-04

原文網址 : https://juejin.im/post/5cd3d90d51882535786bc430

Series DataFrame 與時間戳

Series的建立

#字典建立
a={'a':1 ,'b':2 , 'c':3}
S=pd.Series(a)

#陣列建立
S1=pd.Series(np.random.randn(4))

#用標量建立
S2=pd.Series(10,index=range(4))
#標量的個數由idnex的個數決定

複製程式碼

Series的索引

s = pd.Series(np.random.rand(5)*100, index = ['a','b','c','d','e'])
print(s[['a','c','e']])#選取自己想要的值
print(s[1:3])#左閉右開
print(s['a':'c'])#都是閉區間
print(s[-1])#倒過來
print(s[::2])#步長為2

#  布林索引
bs1=s>50
bs2=s.isnull()
bs3=s.notnull()
S2=s[s.notnull()]#輸出結果是輸出S2不是null的值
# 布林索引的作用可以用於篩選，返回的是布林型別
複製程式碼

Series的常用函式

s = pd.Series(np.random.rand(3), index = ['a','b','c'])
s.head()#檢視前五條資料
s.tail()#檢視後五條
s1=s.reindex(['b','a','c','d'],fill_value=0)#reindex新加的索引行為空,fill_value引數是把空值填充為0
複製程式碼

Series的自動對齊

s1 = pd.Series(np.random.rand(3), index = ['Jack','Marry','Tom'])
s2 = pd.Series(np.random.rand(3), index = ['Wang','Jack','Marry'])
print(s1+s2)
複製程式碼

#python會自耦東識別相同的標籤相加，沒有相同標籤的或者值為空值的相加後為null

Series的增刪改

s1 = pd.Series(np.random.rand(5))
s2 = pd.Series(np.random.rand(5), index = list('ngjur'))
#增
s2['a']=100
s1[5]=30
#刪
s1.drop(1)
s2.drop('n')
#改
s2['a']=1
複製程式碼

DataFrame的五種建立方式

#方法一 由list組成的字典
data={'a':[1,2,3],
    'b':np.random.rand(3)
}
df=pd.DataFrame(data,index=['one','two','three'])
#注意index的個數必須與行數相等，columns的個數可以任意，多出來的系統會預設是空值

#方法二由Series組成的字典生成
data={'one':np.Series(np.random.rand(3)),
    'two':np.random.rand(2)
}
df=pd.DateFrame(data,index=[1,2,3])

#方法三 由二維陣列生成
data=np.random.rand(9).reshape(3,3)
df=pd.DataFrame(data)

#方法四 由字典生成
data = [{'one': 1, 'two': 2}, {'one': 5, 'two': 10, 'three': 20}]
df1 = pd.DataFrame(data)

#方法五 由字典組成的字典
data={'key1':{'math':45,'eng':56,'art':65},
    'key2': {'math':23,'eng':12,'art':87}
}
df1 = pd.DataFrame(data)
#第一個字典鍵key1是列，裡面的字典的鍵為行索引

df1.index  #dataframe的行索引
df.columns #dataframe的列索引
df.values #dataframe的元素
複製程式碼

pandas的行列選擇，切片和布林索引

df=pd.DataFrame(np.random.rand(16).reshape(4,4),
columns=list('ABCD'),index=['one','two','three','four'])
#選擇列
df['A']#直接用列名進行索引
df[['A','C']]
df['A':'C']#不可以這麼用
#一個列返回的是Series 兩個返回的是dataframe
#選擇行
df.loc['one':'three','a':'b']
df.iloc[1:3,'a']
df.ix[1:3,'b']

#布林索引
print(df>20)
print(df[df>20])
print(df[df>20][['a','b']])#a,b列>20的值,也是多重索引。在df>20的dataframe下再次索引

複製程式碼

#ix,loc,iloc的區別，iloc只能通過行號來獲取資料，不能是字元。ix / loc 可以通過行號和行標籤進行索引,但是iloc的效率是最高的

思維引導 dataframe你可以認為他是由很多個Series構成的，所以他的很多用法跟Series類似，其中每一個列他的資料型別就是Series型別如果你要索引兩個值是是df[['A','B']] 那麼接下來dataframe的增刪改也可以說跟Series類似的了

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                columns = ['a','b','c','d'])
#增
df['e']=10#增列
df.loc[4]=1#增行
print(df)
#改
df[['a','c']]=100#改列
df.iloc[3]=1#改行
#刪除
del df['a']#原陣列發生改變
print(df.drop(['b','c'],axis=1))
print(df.drop(0))
print(df.drop([1,2]))
#注意drop函式執行成功時，原來的df並不會發生改變也是就說他生成的新的dataframe，如果要改變df，則應該加個引數inplace比如
df.drop(['a'],axis=1,inpalce=True)
複製程式碼

dataframe的常用函式

#unique()
df['列名'].unique() 返回沒有重複元素的列

#排序sort_values()
df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   columns = ['a','b','c','d'])
print(df1)
# print(df1.sort_values(['a'], ascending = True))  # 升序
print(df1.sort_values(['a','c'],ascending=False))#降序，預設是升序

#索引排序 sort_index
df1.sort_index(ascending=True,inplace=True)
#看了這麼多例子，可以直接就知道，ascending就是排序的引數，inplace就是是否在原資料上操作，false的話返回的是新的dataframe

#value_counts()統計重複元素的個數
df1.value_counts()
複製程式碼

時間戳

主要掌握datetime，timestamp,datetimeindex,Periods,時間序列的索引與重取樣對於電商或者金融方面很多資料的索引都是時間，索引掌握哈時間戳很關鍵

datetime 主要掌握datetime.date(),datetime.datetime(),datetime.timetelta()

import datetime
print(datetime.date.today())#輸出當前時間
print(datetime.date(2019,1,2))#輸出自定義時間

#datetime.datetime
t1=datetime.datetime(2018,3,5)
t2=datetime.datetime(2017,2,14,15,13,45)
print(t1,t2)
#與datetime.date的區別是datetime 能輸出分秒

#datetime.timedelta()時間差
today=datetime.date(2016,2,5)
yestaday=today-datetime.timedelta(1)
print(today,yestaday)

#日期解析，把字串準成日期格式
from dateutil.parser import parser
t='2018/2/23'
date=parser(t)
print(t)
複製程式碼

timestamp

# pd.Timestamp可以也精確到分秒
t1=pd.Timestamp('2017-3-2')
t2=pd.Timestamp('1919-2-4')
print(t1,t2)

#pd.to_datetime
date=pd.to_datetime('2014-1-2')
print(type(date))
date_index=pd.to_datetime(['2012/2/2','2013/2/3'])
print(type(date_index)
#注意，只有一個時間時為Timestamp型別，兩個或者兩個以上為DatetimeIndex型別
#當你第一次接觸新的資料型別的時候，輸出他的資料型別有助於你理解他

date = ['2017-2-1','2017-2-2','2017-2-3','hello world!','2017-2-5','2017-2-6']
t1=pd.to_datetime(date,error='ignore')
t2=pd.to_datetime(date,error='coerce')
print(type(t1))#型別為ndarry型別
print(type(t2))#型別為DatetimeIndex型別
#引數的意思第一個是忽略錯誤，所以返回的是原來的資料型別，
#第一個是coecer，強制的意思，把它強制轉為DatetimeIndex型別
複製程式碼

DatetimeIndex

用時間作為索引

rng = pd.DatetimeIndex(['12/1/2017','12/2/2017','12/3/2017','12/4/2017','12/5/2017'])
data=pd.Series(np.random.rand(5),index=rng)
複製程式碼

pd.date_range生成日期範圍

# 生成日期範圍
'''
pd.date_range(start=None, end=None, periods=None, freq='D', tz=None, normalize=False, name=None, closed=None, **kwargs)

start:開始時間
end: 結束時間
periods: 偏移量
freq:頻率
ts:時間
normalize:時間正則化到午夜時間戳
closed:區間，預設是左右閉 用法 closed=left or right
'''
t1=pd.date_range('2017/2/1','2017/3/2')
# print(t1)#顯示2/1到3/2 的時間，頻率預設是天
t2=pd.date_range(start='2017/2/1',periods=10)
t3=pd.date_range(end='2017/2/1',periods=10)
t4=pd.date_range(start='2017/2',periods=10,freq='m')#輸出結果
'''
'2017-02-28', '2017-03-31', '2017-04-30', '2017-05-31',
               '2017-06-30', '2017-07-31', '2017-08-31', '2017-09-30',
               '2017-10-31', '2017-11-30'],
'''
複製程式碼

下面對於金融學的作用較大，講freq是各種不同引數與用法

pd.date_range('2015/2/5','2015/2/6',frep='D')
'''
freq是預設是值以天為頻率
引數值先有  
'M':month 
'Y':year 
'T' or 'MIN':分鐘
'S':second 秒
'L':毫秒
'U':微秒
'''
# freq 其他引數值
'''
'W-MON':指定每月的哪個星期開始
'WOM-2MON':指定每個月的第幾個星期(這裡是第二個星期)
'''
#  例子
print(pd.date_range('2018/1/1','2018/2/1',freq='W-MON'))#一 星期為間隔
print(pd.date_range('2018/1/1','2018/5/1',freq='WOM-2MON'))#以月為間隔

'''
M:每個月最後一個日曆日
Q-DEC:Q-月,指定月為季度末，每個季度末最後一月的最後一個日曆日
        1-4-7-10  2-5-8-11
A-DEC:每年指定月份的最後一個日曆日
'''
print(pd.date_range(2017','2018', freq = 'M'))
print(pd.date_range('2017','2020', freq = 'Q-DEC'))
print(pd.date_range('2017','2018', freq = 'A-DEC'))
#分別以月季度年為間隔
複製程式碼

''' print(pd.date_range('2017','2018', freq = 'BMS'))
print(pd.date_range('2017','2020', freq = 'BQS-DEC'))
print(pd.date_range('2017','2020', freq = 'BAS-DEC')) #輸出相應頻率的工作日 '''

總結

freq 的三個引數，B,(M,Q,A),S.分別代表了工作日，（以月為頻率，以季度為頻率，以年為頻率），（最接近月初的那一天）

#l頻率的轉換
  pd.asfreq()
ts=pd.data_range('2018','2019')
r=ts.asfreq('4H',method='ffill')

# asfreq：頻率轉換

p = pd.Period('2017','A-DEC')
print(p)
print(p.asfreq('M', how = 'start'))  # 也可寫 how = 's'
print(p.asfreq('D', how = 'end'))  # 也可寫 how = 'e'
# 通過.asfreq(freq, method=None, how=None)方法轉換成別的頻率
複製程式碼

pandas period

p=pd.Period('2017',freq='M')
p+1  #向前前進一頻率

pd.date_range('2017','2018',freq='M')
pd.period_range('2017',periods=10,freq='M')
#時間範圍和時期範圍，時期範圍精確度小
複製程式碼

#pd.to_period()、pd.to_timestamp()
p=pd.date_range('2017','2018',freq='M')
p1=pd.period_range('2019',periods=6,freq='M')
print(p)
print(p.to_period())
print(p1.to_timestamp())
#轉換成period和timestamp型別，相同精度的型別不能相互轉換
複製程式碼

切片和索引

時間型別的切片和索引的用法和列表的類似

重取樣

rng = pd.date_range('20170101', periods = 12)
ts = pd.Series(np.arange(12), index = rng)
re_ts=ts.resample('5D').sum()
print(ts.resample('5D').mean(),'→ 求平均值\n')
print(ts.resample('5D').max(),'→ 求最大值\n')
print(ts.resample('5D').min(),'→ 求最小值\n')
print(ts.resample('5D').median(),'→ 求中值\n')
print(ts.resample('5D').first(),'→ 返回第一個值\n')
print(ts.resample('5D').last(),'→ 返回最後一個值\n')
print(ts.resample('5D').ohlc(),'→ OHLC重取樣\n')
# OHLC:金融領域的時間序列聚合方式 → open開盤、high最大值、low最小值、close收盤
相當於改變頻率然接聚合函式 ，與groupby類似

pd.resmaple(closed,label)的引數

print(ts.resample('5D',closed='left')#區間分佈[1,2,3,4,5],[6,7,8,9,10],[11,12]
print(ts.resample('5D',closed='right')
# right指定間隔右邊為結束 → [1],[2,3,4,5,6],[7,8,9,10,11],[12]

複製程式碼

計算增長比
df=pd.DataFrame(np.arange(0,16).reshape(4,4),index=pd.date_range('2017/1/1','2017/1/04'),columns=list('ABCD'))
ts=df.shift(-1)#往上移動一位
print(df/ts)

per=df/df.shift(1)-1
print(per.dropna()
複製程式碼

使用pandas進行資料分析
2024-10-27
Pandas使用DataFrame進行資料分析比賽進階之路（一）
2019-02-16
Python - pandas 資料分析
2020-04-05
Python
pandas基本使用（一）-- 利用python進行資料分析筆記（第五章）
2020-10-02
Python筆記
利用Tushare資料介面+pandas進行股票資料分析
2022-06-05
Python資料分析之pandas
2018-07-23
Python
Python資料分析之Pandas篇
2020-10-05
Python
[譯] 使用 Pandas 對 Kaggle 資料集進行統計資料分析
2018-11-12
用一行Python進行資料收集探索
2019-10-09
Python
[譯] 使用 NumPy 和 Pandas 進行 Python 式資料清理
2018-04-17
Python
[譯] 在 Python 中，如何運用 Dask 資料進行並行資料分析
2018-12-24
Python並行
如何用Python進行資料分析？
2019-01-15
Python
Python利用pandas處理資料與分析
2024-03-25
Python
Python入門教程—資料分析工具Pandas
2021-08-11
Python
薦書 | 《利用Python進行資料分析》
2019-05-13
Python
Python 資料分析：讓你像寫 Sql 語句一樣，使用 Pandas 做資料分析
2019-06-14
PythonSQL
教你用SQL進行資料分析
2021-11-18
SQL
Python 資料處理庫 pandas 進階教程
2018-04-18
Python
python-資料分析-Pandas-1、Series物件
2024-06-09
Python物件
python-資料分析-Pandas-3、DataFrame-資料重塑
2024-06-10
Python
使用 Python 進行資料分析：入門指南
2024-07-26
Python
Pandas 資料分析 5 個實用小技巧
2020-12-06
python-資料分析-Pandas-4、DataFrame-資料透視
2024-06-10
Python
教程：使用Python進行基本影像資料分析！
2018-08-28
Python
《利用Python進行資料分析·第2版》轉
2019-02-19
Python
【資料分析】針對家庭用電資料進行時序分析（1）
2023-09-26
資料分析---pandas模組
2024-05-29
資料分析利器之Pandas
2022-12-05
快速入門pandas進行資料探勘資料分析[多維度排序、資料篩選、分組計算、透視表](一)
2023-02-03
排序
Python大資料分析學習.Pandas 資料匯入問題 (1)
2018-05-19
Python大資料
使用python進行Oracle資料庫效能趨勢分析
2018-06-14
PythonOracle資料庫
利用python進行資料分析之準備工作（1）
2018-08-10
Python
用Jupyter+pandas資料分析，6種資料格式效率對比
2020-10-29
基於python的大資料分析實戰學習筆記-pandas（資料分析包）
2019-08-28
Python大資料筆記
Python資料分析 Pandas模組基礎資料結構與簡介
2018-12-14
Python資料結構
怎麼進行資料分析
2024-01-11
大資料如何進行分析
2022-12-15
大資料
基於python的大資料分析-pandas資料儲存（程式碼實戰）
2019-08-28
Python大資料

用 Python 進行資料分析 pandas (一)

Series DataFrame 與時間戳

時間戳

相關文章