pandas的學習總結

大樹2發表於2018-01-07

pandas的學習總結

作者：csj
更新時間:2017.12.31

email:59888745@qq.com

說明：因內容較多，會不斷更新 xxx學習總結；

回主目錄：2017 年學習記錄和總結

1.pandas簡介
2.pandas資料結構
　　Series
　　DataFrame
　　Index
　　csv檔案讀寫
3.常用函式：
　　Group by
　　Aggregate
　　concat
　　merge
　　join
etc

-------------------------------------------------------------------------------------

1.pandas簡介

　　pandas是一個專門用於資料分析的python library。
　　基於numpy (對ndarray的操作)
　　相當於用python做Excel/SQL/R的感覺；
2.pandas資料結構
　　2.1Series：
　　　　是一個一維的資料結構；預設用0到n來作為Series的index，但是我們也可以自己指定index。
　　index我們可以把它理解為dict裡面的key；
　　s=pd.Series([1,'a',2,'b',2,20])
　　s2=pd.Series([1,'a',2,'b',2,20],index=[1,2,3,4,5)
　　s4=pd.Series([{'a':1,'b':2},name='price'])
　　print(s)
　　s3=s.append(s2)
　　Series就像一個dict，前面定義的index就是用來選擇資料的；
　　Series的元素可以被賦值；s1[0]=3
　　數學運算；s1 + s2,s4 /2,s4 **2, r = 'a' in s4
　　s4.median()中位數
　　s4.mean()
　　s4.max()
　　s4.min();s4[s4 >s4.mean()]
　　資料缺失：
　　使用notnull和isnull兩個函式來判空:s4.isnull(),s4.notnull()
　　為空的部分，賦上平均值:s4[s4.isnull()]=s4.mean()
　　選擇資料:s[1],s1[2,3,1],s1[1:],s1[:-1],s1.get('1'),s1[s1 <2],s1[s1.index !=2]

2.2 DataFrame：
　　一個Dataframe就是一張表格，Series表示的是一維陣列，Dataframe則是一個二維陣列；
　　columns的名字和順序可以指定；
　　可以從幾個Series構建一個DataFrame；
　　可以用一個list of dicts來構建DataFrame；
　　data ={'a':[1,2,3],'b':[4,5,6]}
　　pd1 = pd.DataFrame(data)
　　df = pd.DataFrame(data,coulums=['t','f'],index=['one','tow','three']) #在DataFrame中columns對應的是列，index對應的是行。
　　DataFrame元素賦值：
　　可以給一整列賦值:df["t"]=400
　　給一整行賦值:df["one"]=100
　　可以使用loc,iloc選擇一行：
　　pd1.loc['one'];pd1.iloc['one'];pd1[1:2]
　　pd1中加入df:pd1.append(df2);pd1.append(series1)
　　pd1中加入一column:pd1['newcol']=200
　　df1["t"].loc["one"] = 300
　　df1["colname"]='newcolumn'
　　df1.columns
　　df1.index
　　df1.info
　　df.index.name = "city"
　　df.columns.name = "info"
　　使用isin判斷價格在[]範圍內的:df1["one"].isin([30, 200])
　　df1.where(df1["neo"] > 10)
　　對於NAN會被上一個記錄的值填充上:df1.fillna(method="ffill") df.ffill()
　　df.sub((row, axis=1)) 計算其他的資料與row之間的差值
2.3Index：
　　Series時宣告index
　　針對index進行索引和切片
　　DataFrame進行Indexing與Series基本相同:df.loc[row,colu]
2.4reindex:
　　對一個Series或者DataFrame按照新的index順序進行重排；
　　用drop來刪除Series和DataFrame中的index

　　2.5csv檔案讀寫：
　　　　read_csv
　　　　to_csv
3.常用函式：
　　　　Group by
　　　　Aggregate
　　　　concat
　　　　merge
　　　　join

demo資料清洗:

分析資料問題

沒有列頭
一個列有多個引數
列資料的單位不統一
缺失值
空行
重複資料
非 ASCII 字元
有些列頭應該是資料，而不應該是列名引數

1資料清洗。檢查資料檢視一列的一些基本統計資訊：data.columnname.describe() •選擇一列：data['columnname'] •選擇一列的前幾行資料：data['columnsname'][:n] •選擇多列：data[['column1','column2']] •Where 條件過濾：data[data['columnname'] > condition]

。處理缺失資料。為缺失資料賦值預設值 data.country.fillna('')#（0，mean) •去掉/刪除缺失資料行 data.dropna() data.dropna(how='all')刪除一整行的值都為 NA data.drop(thresh=5)刪除一整行資料中至少要有 5 個非空值 •去掉/刪除缺失率高的列

•新增預設值 data.country= data.country.fillna('')

•刪除不完整的行 data.drop(axis=0, how='any')

•刪除不完整的列刪除一正列為 NA 的列:data.drop(axis=1, how='all'，inplace=True)行的例子中使用了 axis=0，因為如果我們不傳

引數 axis，預設是axis=0 刪除任何包含空值的列：data.drop(axis=1， how='any'，inplace=True)

•規範化資料型別 data = pd.read_csv('../data/moive_metadata.csv', dtype={'duration': int})

•必要的轉換 .錯別字 •英文單詞時大小寫的不統一:data['movie_title'].str.upper() •輸入了額外的空格:data['movie_title'].str.strip()

•重新命名列名 data = data.rename(columns = {‘title_year’:’release_date’,

‘movie_facebook_likes’:’facebook_likes’})

•儲存結果 data.to_csv(‘cleanfile.csv’ encoding=’utf-8’)

demo: df.head() .沒有列頭 # 增加列頭 column_names= ['id', 'name', 'age',

'weight','m0006','m0612','m1218','f0006','f0612','f1218'] df = pd.read_csv('../data/patient_heart_rate.csv', names = column_names)

.一個列有多個引數值，分割成多列 df[['first_name','last_name']] = df['name'].str.split(expand=True) df.drop('name', axis=1, inplace=True)

.列資料的單位不統一 rows_with_lbs = df['weight'].str.contains('lbs').fillna(False) df[rows_with_lbs] for i,lbs_row in df[rows_with_lbs].iterrows(): weight = int(float(lbs_row['weight'][:-3])/2.2) df.at[i,'weight'] = '{}kgs'.format(weight)

.缺失值 data.country.fillna('') •刪：刪除資料缺失的記錄（資料清洗- Pandas 清洗“髒”資料（一）/[資料清洗]-Pandas 清洗“髒”資料

（一）） •贗品：使用合法的初始值替換，數值型別可以使用 0，字串可以使用空字串“” •均值：使用當前列的均值 •高頻：使用當前列出現頻率最高的資料 •源頭優化：如果能夠和資料收集團隊進行溝通，就共同排查問題，尋找解決方案。

.空行 # 刪除全空的行 df.dropna(how='all',inplace=True) df.dropna(how='any',inplace=True) .重複資料 # 刪除重複資料行 df.drop_duplicates(['first_name','last_name'],inplace=True)

.非 ASCII 字元處理非 ASCII 資料方式有多種 •刪除 •替換 •僅僅提示一下

我們使用刪除的方式：

# 刪除非 ASCII 字元 df['first_name'].replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True) df['last_name'].replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)

.有些列頭應該是資料，而不應該是列名引數,將多列合併為一列 # 刪除沒有心率的資料 row_with_dashes = df['puls_rate'].str.contains('-').fillna(False) df.drop(df[row_with_dashes].index, inplace=True)

import pandas as pd
# 增加列頭
column_names= ['id', 'name', 'age', 'weight','m0006','m0612','m1218','f0006','f0612','f1218']
df = pd.read_csv('../data/patient_heart_rate.csv', names = column_names)

# 切分名字，刪除源資料列
df[['first_name','last_name']] = df['name'].str.split(expand=True)
df.drop('name', axis=1, inplace=True)

# 獲取 weight 資料列中單位為 lbs 的資料
rows_with_lbs = df['weight'].str.contains('lbs').fillna(False)
df[rows_with_lbs]

#get row data

df[1:5]

#get col data

df['columnsname']

#get row and col data

df['columnsname'][1:3
# 將 lbs 的資料轉換為 kgs 資料
for i,lbs_row in df[rows_with_lbs].iterrows():
weight = int(float(lbs_row['weight'][:-3])/2.2)
df.at[i,'weight'] = '{}kgs'.format(weight)

# 刪除全空的行
df.dropna(how='all',inplace=True)

# 刪除重複資料行
df.drop_duplicates(['first_name','last_name'],inplace=True)

# 刪除非 ASCII 字元
df['first_name'].replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)
df['last_name'].replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)

# 切分 sex_hour 列為 sex 列和 hour 列
sorted_columns = ['id','age','weight','first_name','last_name']
df = pd.melt(df,
id_vars=sorted_columns,var_name='sex_hour',value_name='puls_rate').sort_values(sorted_columns)
df[['sex','hour']] = df['sex_hour'].apply(lambda x:pd.Series(([x[:1],'{}-{}'.format(x[1:3],x[3:])])))[[0,1]]
df.drop('sex_hour', axis=1, inplace=True)

# 刪除沒有心率的資料
row_with_dashes = df['puls_rate'].str.contains('-').fillna(False)
df.drop(df[row_with_dashes].index,
inplace=True)

# 重置索引，不做也沒關係，主要是為了看著美觀一點
df = df.reset_index(drop=True)
print(df)

------add 20180110 by 59888745@qq.com--------

# Demo 1  of pandas 
# bike project
# stocks project
# credit project

import pandas as pd
import numpy as np
bikes = pd.read_csv('data/bikes.csv', sep=';', parse_dates=['Date'], encoding='latin1', dayfirst=True, index_col="Date")
bikes.head()
#bikes.dropna()
#bikes.dropna(how='all').head()
df0= bikes.dropna(axis = 1,how='all').head()
bikes.shape

#apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds) method of pandas.core.frame.DataFrame instance
#apply這個函式可以逐行的來處理內容。 
#獲得每列的有多少行是空的
#bikes.apply(lambda x:sum(x.isnull()))
bikes.isnull().sum()

#如何填充缺失的資料
# row =bikes.iloc[0].copy
# bikes.fillna(row)

# 得到一列，這一列都是其他行的平均數
m = bikes.mean(axis=1)
#然後使用m去一行一行的填充bikes中資料為NaN的
#bikes.iloc[:, i]表示bikes所有行的第i列。
for i, col in enumerate(bikes):
    bikes.iloc[:, i] = bikes.iloc[:, i].fillna(m)
    
bikes.head()

#計算berri_bikes週一到週日中每天分別的騎車的人數進行統計。然後畫個圖
berri_bkies=bikes['Berri 1'].copy()
berri_bkies.head()
# berri_bkies.index
# l=berri_bkies.index.weekday 
 
# berri_bkies["weekday"]=l
# weekday_counts = berri_bkies.groupby('weekday').aggregate(sum)
# weekday_counts
#weekday_counts = berri_bkies.groupby('weekday').aggregate(sum)
#weekday_counts.index = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
#weekday_counts
#berri_bkies.plot(kind="bar")

# to_frame可以把一個Series轉換成一個DataFrame
#bikes.sum(axis=1)將每天中所有的線路加在一起
#berri_bkies.sum(axis=1).to_frame().head()

#Demo 2 

    # goog = pd.read_csv("data/GOOG.csv", index_col=0)
    # goog.index = pd.to_datetime(goog.index)
    # goog["Adj Close"].plot(grid = True)

    # goog.shift(1).head() #shift其實就是將日期數字往後移動了一天

    # aapl = pd.read_csv("data/AAPL.csv", index_col=0)
    # aapl.index = pd.to_datetime(aapl.index)
    # #aapl.head()
    # aapl["Adj Close"].plot(grid=True)


#Demo 3
    # df = pd.read_csv("data/credit-data.csv")
    # df.head()
    # for i, val in enumerate(df):
    #     print(val)
    #     print(df[val].value_counts())

    # df['income_bins'] = pd.cut(df.monthly_income, bins=15, labels=False)
    # pd.value_counts(df.income_bins)

    # df["monthly_income"] = df["monthly_income"].fillna(df["monthly_income"].mean())

    # df["income_bins"] = df["income_bins"].astype("int")

    
#pandas cont list 

# Series
    # #構造和初始化Series
    # s = pd.Series({"bj":100,"sz":200})
    # s['bj']
    # s2=pd.Series([1,2,3,4])
    # s2[0]
    # s3=pd.Series(("1",2,3),index=[1,2,3])
    # s4 = s3.append(s2,ignore_index=True)
    # s4
    # s5 =s4.drop(1)
    # s5
    # s6 =pd.Series(np.random.randn(5),index=['a','b','c','d','e'])
    # s6
    # s7 = pd.Series(np.random.randn(5),index=range(6,11))
    # s7

    # #選擇Series資料 
    # #使用下標格式
    # s7[6]
    # s7[[6,8]]
    # s7[:-1]
    # s7[1:-1]
    # #相加的時候能找到的相加，找不到的為空
    # s8 =s7[1:] + s7[:-1]
    # #可以使用下標格式或者字典格式來找資料
    # s['bj']
    # s[['bj','sz']]

    # t = 'bj' in s
    # t
    # t2 =6 in s8
    # t2
    # t3 =s8.get(7)
    # t3
    # t4 = s8[s8 <8]
    # t4

    # s8[s8.index !=9]

#series的增刪改查 

tm=pd.Series({'hefei':1000})
tm2=pd.Series({'hefei2':2000})
tm3 = tm.append(tm2)
tm4 =tm3.drop('hefei')
'hefei2' in tm4
tm4.get('hefei2')
tm3['hefei2']=3000
tm3
tm3 / 2
tm3 **2
np.log(tm3)

tm5 =tm3+tm2*2
tm5
tm5.isnull()
tm6 =tm5[tm5.isnull()]=tm5.mean()
tm6


# DataFrame
    # data=['beijin',1000,'shenzhen',2000]
    # df=pd.DataFrame(data,columns=['city'],index=['a','b','c','d'])
    # df.loc['b']
    # data2 = [{"a": 999999, "b": 50000, "c": 1000}, {"a": 99999, "b": 8000, "c": 200}]
    # data3 = [ "a",999999, "b", 50000, "c", 1000]
    # df2= pd.DataFrame(data2,index=['one','two'],columns=['a','b','c'])
    # #df2= pd.DataFrame(data3,index=['one','two','three','four','five','six'],columns=['a'])
    # df3 =df2.drop('c',axis=1)
    # df3
    # df4 =df2.drop('one',axis=0)
    # df4
    # #如何改變列的位置？
    # df5= pd.DataFrame(data2,index=['one','two'],columns=['a','c','b'])
    # df5
    #df.dropna()會刪除所有帶NA的行
    #df.dropna(how='all').head()刪掉全部都是NA的行
     #df.dropna(how='any').head()刪掉部fen 都是NA的行
     #若ignore_index=True由於索引存在衝突所以會建立新的索引，若為False，會存在索引相同的情況
    #df3 =df2.append(df_s,ignore_index=True) 
    
    #若要三個Series 能夠合併的比較好的一個DataFrame那麼它們的行index最好是一樣的。  
    # s_product=pd.Series([10.0,12.0,14.0],index=['apple','bear','banana'])
    # s_unit=pd.Series(['kg','kg','kg'],index=['apple','bear','banana'])
    # s_num=pd.Series([100,200,300],index=['apple','bear','banana'])
    # #df = pd.DataFrame({'prices':s_product,'unit':s_unit,'num':s_num},columns=['prices','unit','num'],index=['one','two','three'])
    # df = pd.DataFrame({'prices':s_product,'unit':s_unit,'num':s_num},columns=['num1','prices','unit'])
    # df
    # df2 = pd.DataFrame(s_product,columns=['prices'])
    # df2['num']=100
    # s=['a','b','c']
    # df_s=pd.DataFrame(s)
    # #若ignore_index=True由於索引存在衝突所以會建立新的索引，若為False，會存在索引相同的情況
    # df3 =df2.append(df_s,ignore_index=True) 
    # df3 
    # df3[['prices','num']].loc['0':'1']

pandas 學習總結
2018-04-02
pandas 學習（1）： pandas 資料結構之Series
2016-09-24
資料結構
pandas 學習（2）： pandas 資料結構之DataFrame
2016-09-24
資料結構
pandas用法總結
2020-04-05
MongoDB的學習總結
2021-09-09
MongoDB
docker的學習總結
2018-11-22
Docker
學習總結
2024-08-04
[python]pandas學習
2019-02-26
Python
Pandas基礎學習
2021-05-10
pandas學習筆記
2020-10-01
筆記
numpy的學習筆記\pandas學習筆記
2018-03-18
筆記
sqlldr的學習與總結
2013-11-19
SQL
mysqlimport學習總結
2020-03-30
MySqlImport
Maven學習總結
2020-08-19
Maven
MyBatis 學習總結
2020-05-20
MyBatis
awk 學習總結
2019-12-10
JNI 學習總結
2019-05-25
tkinter學習總結
2019-04-05
SVG學習總結
2019-05-04
SVG
vue學習總結
2019-04-12
Vue
WorkFlow學習總結
2019-07-05
HTML學習總結
2019-02-10
HTML
Mybatis學習總結
2019-02-13
MyBatis
Kafka 總結學習
2021-06-26
Kafka
Typescript學習總結
2021-06-26
TypeScript
【TS】學習總結
2021-08-24
lua 學習總結
2018-08-28
vue 學習總結
2018-08-13
Vue
HSF學習總結
2018-10-12
ElasticSearch 學習總結
2018-09-06
Elasticsearch
BOM學習總結
2018-12-10
JavaWeb學習總結
2018-12-20
JavaWeb
Storm學習總結
2018-07-11
ORM
redis學習總結
2021-01-14
Redis
JVM學習總結
2020-11-01
JVM
Oracle學習總結
2020-11-30
Oracle
Ajax學習總結
2020-09-30
WebRTC學習總結
2018-06-25
Web

pandas的學習總結

回主目錄：2017 年學習記錄和總結

分析資料問題

相關文章