Python資料處理：Pandas模組的 12 種實用技巧

a58797發表於2016-07-27

簡介

Python 正迅速成為資料科學家們更為鍾愛的程式語言。形成該現狀的理由非常充分：Python 提供了一種覆蓋範圍更為廣闊的程式語言生態系統，以及具有一定計算深度且效能良好的科學計算庫。如果您是 Python 初學者，建議首先看下Python 學習路線。

在 Python 自帶的科學計算庫中，Pandas 模組是最適於資料科學相關操作的工具。它與 Scikit-learn 兩個模組幾乎提供了資料科學家所需的全部工具。本文著重介紹了 Python 中資料處理的 12 種方法。此前的文章裡也分享了一些技巧和經驗，這些將有助於您提高工作效率。

在本文開始前，推薦讀者首先了解一些資料探勘的相關程式碼。為使本文更易於理解，我們事先選定了一個資料集以示範相關操作和處理方法。

資料集來源：本文采用的是貸款預測問題的資料集。請下載該資料集並開始本文內容。

讓我們開始吧

首先匯入相關模組並載入資料集到 Python 環境中：

import pandas as pd
import numpy as np
data = pd.read_csv("train.csv", index_col="Loan_ID")

import pandas as pd

import numpy as np

data = pd.read_csv("train.csv", index_col="Loan_ID")

#1 – 布林索引

如果需要以其它列資料值為條件過濾某一列的資料，您會怎麼處理？例如建立一個列表，列表中全部為未能畢業但曾獲得貸款的女性。這裡可以使用布林索引，程式碼如下：

data.loc[(data["Gender"]=="Female") & (data["Education"]=="Not Graduate") & (data["Loan_Status"]=="Y"), 
["Gender","Education","Loan_Status"]]

1 2	data.loc[(data["Gender"]=="Female") & (data["Education"]=="Not Graduate") & (data["Loan_Status"]=="Y"), ["Gender","Education","Loan_Status"]]

更多內容請參閱：利用 Pandas 模組進行選擇和索引

#2 – Apply 函式

Apply 函式是處理資料和建立新變數的常用函式之一。在向資料框的每一行或每一列傳遞指定函式後，Apply 函式會返回相應的值。這個由 Apply 傳入的函式可以是系統預設的或者使用者自定義的。例如，在下面的例子中它可以用於查詢每一行和每一列中的缺失值。

#Create a new function:
def num_missing(x):
  return sum(x.isnull())

#Applying per column:
print "Missing values per column:"
print data.apply(num_missing, axis=0) #axis=0 defines that function is to be applied on each column

#Applying per row:
print "nMissing values per row:"
print data.apply(num_missing, axis=1).head() #axis=1 defines that function is to be applied on each row

#Create a new function:

def num_missing(x):

return sum(x.isnull())

#Applying per column:

print "Missing values per column:"

print data.apply(num_missing, axis=0) #axis=0 defines that function is to be applied on each column

#Applying per row:

print "nMissing values per row:"

print data.apply(num_missing, axis=1).head() #axis=1 defines that function is to be applied on each row

這樣我們就得到了所需的結果。

注：由於輸出結果包含多行資料，第二個輸出函式使用了 head() 函式以限定輸出資料長度。在不限定輸入引數時 head() 函式預設輸出 5 行資料。

更多內容請參閱：Pandas 參考（apply 函式）

#3 – 填補缺失值

fillna() 函式可一次性完成填補功能。它可以利用所在列的均值/眾數/中位數來替換該列的缺失資料。下面利用“Gender”、“Married”、和“Self_Employed”列中各自的眾數值填補對應列的缺失資料。

#First we import a function to determine the mode
from scipy.stats import mode
mode(data['Gender'])

#First we import a function to determine the mode

from scipy.stats import mode

mode(data['Gender'])

輸出結果為：ModeResult(mode=array([‘Male’], dtype=object), count=array([489]))

輸出結果返回了眾數值和對應次數。需要記住的是由於可能存在多個高頻出現的重複資料，因此眾數可以是一個陣列。通常預設使用第一個眾數值：

mode(data['Gender']).mode[0]

1	mode(data['Gender']).mode[0]

現在可以進行缺失資料值填補並利用#2方法進行檢查。

#Impute the values:
data['Gender'].fillna(mode(data['Gender']).mode[0], inplace=True)
data['Married'].fillna(mode(data['Married']).mode[0], inplace=True)
data['Self_Employed'].fillna(mode(data['Self_Employed']).mode[0], inplace=True)

#Now check the #missing values again to confirm:
print data.apply(num_missing, axis=0)

#Impute the values:

data['Gender'].fillna(mode(data['Gender']).mode[0], inplace=True)

data['Married'].fillna(mode(data['Married']).mode[0], inplace=True)

data['Self_Employed'].fillna(mode(data['Self_Employed']).mode[0], inplace=True)

#Now check the #missing values again to confirm:

print data.apply(num_missing, axis=0)

至此，可以確定缺失值已經被填補。請注意，上述方法是最基本的填補方法。包括缺失值建模，用分組平均數（均值/眾數/中位數）填補在內的其他複雜方法將在接下來的文章中進行介紹。

更多內容請參閱：Pandas 參考（ fillna 函式）

#4 – 資料透視表

Pandas 可建立 MS Excel 型別的資料透視表。例如在下文的程式碼段裡，關鍵列“LoanAmount” 存在缺失值。我們可以根據“Gender”，“Married”和“Self_Employed”分組後的平均金額來替換。 “LoanAmount”的各組均值可由如下方法確定：

#Determine pivot table
impute_grps = data.pivot_table(values=["LoanAmount"], index=["Gender","Married","Self_Employed"], aggfunc=np.mean)
print impute_grps

#Determine pivot table

impute_grps = data.pivot_table(values=["LoanAmount"], index=["Gender","Married","Self_Employed"], aggfunc=np.mean)

print impute_grps

更多內容請參閱：Pandas 參考（資料透視表）

#5 – 複合索引

如果您注意觀察#3計算的輸出內容，會發現它有一個奇怪的性質。即每個索引均由三個數值的組合構成，稱為複合索引。它有助於運算操作的快速進行。

從#3的例子繼續開始，已知每個分組資料值但還未進行資料填補。具體的填補方式可結合此前學到的多個技巧來完成。

#iterate only through rows with missing LoanAmount
for i,row in data.loc[data['LoanAmount'].isnull(),:].iterrows():
  ind = tuple([row['Gender'],row['Married'],row['Self_Employed']])
  data.loc[i,'LoanAmount'] = impute_grps.loc[ind].values[0]

#Now check the #missing values again to confirm:
print data.apply(num_missing, axis=0)

#iterate only through rows with missing LoanAmount

for i,row in data.loc[data['LoanAmount'].isnull(),:].iterrows():

ind = tuple([row['Gender'],row['Married'],row['Self_Employed']])

data.loc[i,'LoanAmount'] = impute_grps.loc[ind].values[0]

#Now check the #missing values again to confirm:

print data.apply(num_missing, axis=0)

Note:

注：

1. 多值索引需要在 loc 語句中使用用於定義索引分組的元組結構。該元組會在函式中使用。

2. 應使用字尾 .values[0] 以避免潛在的錯誤。因為預設情況下複合索引返回的 Series 元素索引順序與所在的資料框架（dataframe）不一致。在此條件下直接賦值會產生錯誤。

#6 – Crosstab 函式

該函式用於獲取資料的初始印象（直觀檢視），從而驗證一些基本假設。例如在本例中，“Credit_History”被認為會顯著影響貸款狀態。這個假設可以通過如下程式碼生成的交叉表進行驗證：

pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True)

1	pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True)

以上這些都是絕對值。但百分比形式能獲得更為直觀的資料結果。使用 apply 函式可實現該功能：

def percConvert(ser):
  return ser/float(ser[-1])
  pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True).apply(percConvert, axis=1)

def percConvert(ser):

return ser/float(ser[-1])

pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True).apply(percConvert, axis=1)

現在可以證明：與僅佔9%的無信用記錄人群相比，佔比為80%的有信用記錄人群獲得貸款的機率會更高。

但這並不是全部的資料結果，其中還包含了一個有趣的內容。既然已知有信用記錄非常重要，如果利用信用記錄情況進行貸款預測會如何？其中，預測有信用記錄的人的獲得貸款狀態為 Y，否則為 N。令人吃驚的是在614次測試中，共正確預測了 82+378=460 次，正確率高達75%！

如果您正好奇為什麼我們需要統計模型，我一點兒也不會責怪您。但是請相信，提高預測精度是一項非常具有挑戰性的任務，哪怕僅僅是在上述預測結果的基礎上提高0.001%的預測精度也是如此。您會接受這個挑戰嗎？

注：75% 是對本文的訓練資料集而言。測試資料集的結果將會有所不同，但也非常接近。同樣地，希望通過這個例子能讓大家明白為什麼僅僅提高0.05%的預測精度就可在Kaggle排行榜中排名躍升500位。

更多內容請參閱：Pandas 模組參考（ crosstab 函式）

#7 – 合併資料框（DataFrames）

當有來自不同資料來源的資訊需要收集整理時，合併資料框就變成了一項必不可少的基本操作。考慮一個假設的情況，即不同型別的房產有不同的均價（單位：INR / 平方米）。定義資料框如下：

prop_rates = pd.DataFrame([1000, 5000, 12000], index=['Rural','Semiurban','Urban'],columns=['rates'])
prop_rates

1 2	prop_rates = pd.DataFrame([1000, 5000, 12000], index=['Rural','Semiurban','Urban'],columns=['rates']) prop_rates

現在可將上述資訊與原始資料框合併如下：

data_merged = data.merge(right=prop_rates, how='inner',left_on='Property_Area',right_index=True, sort=False)
data_merged.pivot_table(values='Credit_History',index=['Property_Area','rates'], aggfunc=len)

1 2	data_merged = data.merge(right=prop_rates, how='inner',left_on='Property_Area',right_index=True, sort=False) data_merged.pivot_table(values='Credit_History',index=['Property_Area','rates'], aggfunc=len)

上述透視表驗證了合併操作成功。需要注意的是由於上述程式碼僅對資料值進行簡單計算，因此‘values’引數在本例中是一個獨立內容，與上下文無關。

更多內容請參閱：Pandas 參考（ merge 函式）

#8 – 排列資料框架（DataFrames）

Pandas 允許基於多列資料進行簡單排列。具體實現如下：

data_sorted = data.sort_values(['ApplicantIncome','CoapplicantIncome'], ascending=False)
data_sorted[['ApplicantIncome','CoapplicantIncome']].head(10)

1 2	data_sorted = data.sort_values(['ApplicantIncome','CoapplicantIncome'], ascending=False) data_sorted[['ApplicantIncome','CoapplicantIncome']].head(10)

注：Pandas模組中的“sort”函式現已不再使用，應用“sort_values”函式進行代替。

更多內容請參閱：Pandas 參考（ sort_values 函式）

#9 – 繪圖（Boxplot 和 Histogram 函式）

許多人也許並沒有意識到 Pandas 模組中的 boxplots 和 histograms 函式可以用於直接繪圖，此時沒有必要再單獨呼叫 matplotlib 模組。一行命令即可完成相關功能。例如，如果想通過 Loan_Status 比較 ApplicantIncome 的分佈情況，則實現程式碼如下：

import matplotlib.pyplot as plt
%matplotlib inline
data.boxplot(column="ApplicantIncome",by="Loan_Status")

import matplotlib.pyplot as plt

%matplotlib inline

data.boxplot(column="ApplicantIncome",by="Loan_Status")

data.hist(column="ApplicantIncome",by="Loan_Status",bins=30)

1	data.hist(column="ApplicantIncome",by="Loan_Status",bins=30)

上圖的資料結果表明，由於獲得貸款人群和未獲得貸款人群數沒有明顯的收入差距，因此個人收入水平高低並非是否能獲得貸款的主要決定因素。

更多內容請參閱：Pandas 參考（ hist 函式）|（ boxplot 函式）

#10 – 使用 Cut 函式進行分箱

有時將數值資料聚合在一起會更有意義。例如，如果我們要根據一天中的某個時間段（單位：分鐘）建立交通流量模型模型（以路上的汽車為統計目標）。與具體的分鐘數相比，對於交通流量預測而言一天中的具體時間段則更為重要，如“早上”、 “下午”、“傍晚”、“夜晚”、“深夜（Late Night）”。以這種方式建立交通流量模型則更為直觀且避免了過擬合情況的發生。

下面的例子中定義了一個簡單的可重用函式，該函式可以非常輕鬆地實現任意變數的分箱功能。

#Binning:
def binning(col, cut_points, labels=None):
  #Define min and max values:
  minval = col.min()
  maxval = col.max()

  #create list by adding min and max to cut_points
  break_points = [minval] + cut_points + [maxval]

  #if no labels provided, use default labels 0 ... (n-1)
  if not labels:
    labels = range(len(cut_points)+1)

  #Binning using cut function of pandas
  colBin = pd.cut(col,bins=break_points,labels=labels,include_lowest=True)
  return colBin

#Binning age:
cut_points = [90,140,190]
labels = ["low","medium","high","very high"]
data["LoanAmount_Bin"] = binning(data["LoanAmount"], cut_points, labels)
print pd.value_counts(data["LoanAmount_Bin"], sort=False)

#Binning:

def binning(col, cut_points, labels=None):

#Define min and max values:

minval = col.min()

maxval = col.max()

#create list by adding min and max to cut_points

break_points = [minval] + cut_points + [maxval]

#if no labels provided, use default labels 0 ... (n-1)

if not labels:

labels = range(len(cut_points)+1)

#Binning using cut function of pandas

colBin = pd.cut(col,bins=break_points,labels=labels,include_lowest=True)

return colBin

#Binning age:

cut_points = [90,140,190]

labels = ["low","medium","high","very high"]

data["LoanAmount_Bin"] = binning(data["LoanAmount"], cut_points, labels)

print pd.value_counts(data["LoanAmount_Bin"], sort=False)

#11 – 為名義變數編碼

通常我們會遇到需要對名義變數進行分類的情況。可能的原因如下：

1. 一些演算法（如邏輯迴歸演算法）要求輸入引數全部為數字。因此名義變數多需要編碼為0, 1….(n-1)。

2. 有時同一種分類可以表示為兩種形式。如溫度可能被記錄為“高（High）”、“中（Medium）”、“低（Low）”、“高（H）”、“低（low）”。在這裡，“高（High）”和“高（H）”都表示同一種分類。類似地在“低（Low）”和“低（low）”的表示方法中僅存在大小寫的區別。但 python 將會將它們視為不同的溫度水平。

3.一些分類的出現頻率可能較低，因此將這些分類歸為一類不失為一個好主意。

下面的例子中定義了一個通用函式，該函式使用字典作為輸入，並利用 Pandas 模組的‘replace’函式對字典值進行編碼。

#Define a generic function using Pandas replace function
def coding(col, codeDict):
  colCoded = pd.Series(col, copy=True)
  for key, value in codeDict.items():
    colCoded.replace(key, value, inplace=True)
  return colCoded
 
#Coding LoanStatus as Y=1, N=0:
print 'Before Coding:'
print pd.value_counts(data["Loan_Status"])
data["Loan_Status_Coded"] = coding(data["Loan_Status"], {'N':0,'Y':1})
print 'nAfter Coding:'
print pd.value_counts(data["Loan_Status_Coded"])

#Define a generic function using Pandas replace function

def coding(col, codeDict):

colCoded = pd.Series(col, copy=True)

for key, value in codeDict.items():

colCoded.replace(key, value, inplace=True)

return colCoded

#Coding LoanStatus as Y=1, N=0:

print 'Before Coding:'

print pd.value_counts(data["Loan_Status"])

data["Loan_Status_Coded"] = coding(data["Loan_Status"], {'N':0,'Y':1})

print 'nAfter Coding:'

print pd.value_counts(data["Loan_Status_Coded"])

編碼前後的計數結果一致，證明編碼正確。

更多內容請參閱：Pandas 參考（ replace 函式）

#12 – 對資料框的行資料進行迭代

這個操作不經常使用。但您也不希望被這個問題卡住，對吧？有時需要利用 for 迴圈對所有行資料進行迭代。例如一個常見的問題即是 Python 中變數的不當處理。通常發生在如下情況：

1.帶有數字的名義變數被認為是數值資料。

2.由於資料錯誤，帶有字元的數值變數輸入行資料中時被認為是分類變數。

因此手動定義列資料型別會是一個不錯的主意。如果檢查所有列資料的資料型別：

#Check current type:
data.dtypes

1 2	#Check current type: data.dtypes

會看到名義變數 Credit_History 被顯示為 float 型別。解決這種問題的一個好方法即是建立一個包含列名和對應型別的 csv 檔案。這樣就可以定義一個通用函式來讀取檔案，並指定列資料的型別。例如在下面的例子裡建立了一個 csv 檔案“datatypes.csv”。

#Load the file:
colTypes = pd.read_csv('datatypes.csv')
print colTypes

#Load the file:

colTypes = pd.read_csv('datatypes.csv')

print colTypes

載入資料後可以對每一行進行迭代，並利用‘type’列中的資料內容確定‘feature’列中對應變數名的資料型別。

#Iterate through each row and assign variable type.
#Note: astype is used to assign types

for i, row in colTypes.iterrows():  #i: dataframe index; row: each row in series format
    if row['type']=="categorical":
        data[row['feature']]=data[row['feature']].astype(np.object)
    elif row['type']=="continuous":
        data[row['feature']]=data[row['feature']].astype(np.float)
print data.dtypes

#Iterate through each row and assign variable type.

#Note: astype is used to assign types

for i, row in colTypes.iterrows(): #i: dataframe index; row: each row in series format

if row['type']=="categorical":

data[row['feature']]=data[row['feature']].astype(np.object)

elif row['type']=="continuous":

data[row['feature']]=data[row['feature']].astype(np.float)

print data.dtypes

現在 credit history 列被定義為了 ‘object’ 型別，該型別即為 Pandas 模組中用於表示名義變數的資料型別。

更多內容請參閱：Pandas 參考（ iterrows 函式）

結語

在這篇文章中，我們介紹了 Pandas 的多個函式，這些函式使得我們資料探勘和特徵工程上更加輕鬆。同時我們定義了一些可重用的通用函式，以在處理不同的資料集時可以獲得類似目標結果。

另請參見：如果您有任何關於 Pandas 模組或 Python 的疑問，請隨時與我們討論。

Python資料處理-pandas用法
2020-12-17
Python
Python利用pandas處理資料與分析
2024-03-25
Python
Python 資料處理庫 pandas 入門教程
2018-04-17
Python
Python 資料處理庫 pandas 進階教程
2018-04-18
Python
資料分析---pandas模組
2024-05-29
【Pandas學習筆記02】處理資料實用操作
2021-11-26
筆記
資料清洗與預處理：使用 Python Pandas 庫
2024-07-26
Python
Pandas 資料分析 5 個實用小技巧
2020-12-06
資料處理--pandas問題
2024-08-04
資料的規範化——Pandas處理
2024-04-07
【Python自動化Excel】pandas處理Excel資料的基本流程
2022-01-09
PythonExcel
Python資料處理(一)：處理 JSON、XML、CSV 三種格式資料
2019-01-27
PythonJSONXML
Python資料分析 Pandas模組基礎資料結構與簡介
2018-12-14
Python資料結構
Pandas 基礎 (5) - 處理缺失的資料
2019-03-08
資料預處理之 pandas 讀表
2020-03-01
Python五種實用的小技巧
2020-04-06
Python
Python資料處理（二）：處理 Excel 資料
2019-02-16
PythonExcel
CSDN周賽第37期：贏《Pandas入門與實戰應用：基於Python的資料分析與處理》
2023-03-13
Python
處理pandas讀取資料為nan時
2024-06-24
NaN
pandas 資料處理一些常用操作
2023-05-15
Pandas高階教程之:處理缺失資料
2021-06-24
Pandas高階教程之:處理text資料
2021-06-23
python 處理資料
2020-10-29
Python
Serverless 在大規模資料處理的實踐
2020-06-08
Server
Pandas 基礎 (6) - 用 replace () 函式處理不合理資料
2019-03-24
函式
Python自動化測試-使用Pandas來高效處理測試資料
2019-05-30
Python
菜市場價格分析 python pandas Apriori演算法資料預處理
2019-03-30
Python演算法
transforms模組—PyTorch影像處理與資料增強方法
2023-01-03
ORMPyTorch
Python基礎12（模組與datetime模組）
2020-12-16
Python
pandas 處理資料和crc16計算
2020-09-26
Python模組高階技巧
2019-02-24
Python
處理XML資料應用實踐
2021-02-25
XML
Python表格處理模組xlrd在Anaconda中的安裝
2024-04-01
Python
Python資料處理從零開始----第三章（pandas）③資料標準化
2021-09-09
Python
Pandas多維特徵資料預處理及sklearn資料不均衡處理相關技術實踐-大資料ML樣本集案例實戰
2018-12-19
特徵大資料
Python 分組處理
2020-12-03
Python
用 Python 進行資料分析 pandas (一)
2019-06-04
Python
Python教程：Pandas資料轉換編碼的10種方式
2024-07-10
Python
人工智慧-機器學習-Python-第三方庫-Pandas(用於資料處理)
2020-11-17
人工智慧機器學習Python