Pandas 資料分析 5 個實用小技巧

K'illCode發表於2020-12-06

原文網址 : https://blog.csdn.net/Dome_/article/details/110748946

小技巧1：如何使用map對某些列做特徵工程？

先生成資料：

d = {
"gender":["male", "female", "male","female"],
"color":["red", "green", "blue","green"],
"age":[25, 30, 15, 32]
}

df = pd.DataFrame(d)
df

在 gender 列上，使用 map 方法，快速完成如下對映：

d = {"male": 0, "female": 1}
df["gender2"] = df["gender"].map(d)

小技巧2：使用 replace 和正則清洗資料

Pandas 的強項在於資料分析，自然就少不了資料清洗。

一個快速清洗資料的小技巧，在某列上使用 replace 方法和正則，快速完成值的清洗。

源資料：

d = {"customer": ["A", "B", "C", "D"],
"sales":[1100, "950.5RMB", "$400", " $1250.75"]}

df = pd.DataFrame(d)
df

列印結果：

customer sales
0 A 1100
1 B 950.5RMB
2 C $400
3 D $1250.75

看到 sales 列的值，有整型，浮點型+RMB後變為字串型，還有美元+整型，美元+浮點型。

我們的目標：清洗掉 RMB，$ 符號，轉化這一列為浮點型。

一行程式碼搞定：（點選程式碼區域，向右滑動，檢視完整程式碼）

df["sales"] = df["sales"].replace("[$,RMB]",
                                  "", regex = True) \
.astype("float")

使用正則替換，將要替換的字元放到列表中 [$,RMB]，替換為空字元，即 ""；

最後使用 astype 轉為 float

列印結果：

customer sales
0 A 1100.00
1 B 950.50
2 C 400.00
3 D 1250.75

如果不放心，再檢查下值的型別：

df["sales"].apply(type)

列印結果：

0    <class 'float'>
1    <class 'float'>
2    <class 'float'>
3    <class 'float'>

小技巧3：使用 melt 如何對資料透視分析？

構造一個 DataFrame：

d = {\
"district_code": [12345, 56789, 101112, 131415],
"apple": [5.2, 2.4, 4.2, 3.6],
"banana": [3.5, 1.9, 4.0, 2.3],
"orange": [8.0, 7.5, 6.4, 3.9]
}
  
df = pd.DataFrame(d)
df

列印結果：

district_code apple banana orange
0123455.23.58.0
1567892.41.97.5
21011124.24.06.4
31314153.62.33.9

5.2 表示 12345 區域的 apple 價格，並且 apple, banana, orange，這三列都是一種水果，那麼如何把這三列合併為一列？

使用 pd.melt

具體引數取值，根據此例去推敲：

df = df.melt(\
id_vars = "district_code",
var_name = "fruit_name",
value_name = "price")
df

列印結果：

district_code fruit_name price
012345 apple 5.2
156789 apple 2.4
2101112 apple 4.2
3131415 apple 3.6
412345 banana 3.5
556789 banana 1.9
6101112 banana 4.0
7131415 banana 2.3
812345 orange 8.0
956789 orange 7.5
10101112 orange 6.4
11131415 orange 3.9

以上就是長 DataFrame，對應的原 DataFrame 是寬 DF.

小技巧4：已知 year 和 dayofyear，怎麼轉 datetime?

原 DataFrame

d = {\
"year": [2019, 2019, 2020],
"day_of_year": [350, 365, 1]
}
df = pd.DataFrame(d)
df

列印結果：

  year day_of_year
02019350
12019365
220201

轉 datetime 的小技巧

步驟 1: 建立整數

df["int_number"] =
df["year"]*1000 + df["day_of_year"]

列印 df 結果：

year day_of_year int_number
020193502019350
120193652019365
2202012020001

步驟 2: to_datetime

df["date"]=pd.to_datetime(df["int_number"],
format = "%Y%j")

注意 "%Y%j" 中轉化格式 j

列印結果：

 year day_of_year int_number date
0201935020193502019-12-16
1201936520193652019-12-31
22020120200012020-01-01

小技巧5：如何將分類中出現次數較少的值歸為 others？

這也是我們在資料清洗、特徵構造中面臨的一個任務。

如下一個 DataFrame:

d = {"name":['Jone','Alica','Emily','Robert','Tomas',
             'Zhang','Liu','Wang','Jack','Wsx','Guo'],
     "categories": ["A", "C", "A", "D", "A", 
                    "B", "B", "C", "A", "E", "F"]}
df = pd.DataFrame(d)
df

結果：

 name categories
0 Jone A
1 Alica C
2 Emily A
3 Robert D
4 Tomas A
5 Zhang B
6 Liu B
7 Wang C
8 Jack A
9 Wsx E
10 Guo F

D、E、F 僅在分類中出現一次，A 出現次數較多。

步驟 1：統計頻次，並歸一

frequencies = df["categories"].value_counts(normalize = True)
frequencies

結果：

A    0.363636
B    0.181818
C    0.181818
F    0.090909
E    0.090909
D    0.090909
Name: categories, dtype: float64

步驟 2：設定閾值，過濾出頻次較少的值

threshold = 0.1
small_categories = frequencies[frequencies < threshold].index
small_categories

結果：

Index(['F', 'E', 'D'], dtype='object')

步驟 3：替換值

df["categories"] = df["categories"] \
.replace(small_categories, "Others")

替換後的 DataFrame:

 name categories
0 Jone A
1 Alica C
2 Emily A
3 Robert Others
4 Tomas A
5 Zhang B
6 Liu B
7 Wang C
8 Jack A
9 Wsx Others
10 Guo Others

用 Python 進行資料分析 pandas (一)
2019-06-04
Python
Python - pandas 資料分析
2020-04-05
Python
5個Word簡單實用小技巧，但真正會使用的人很少
2018-10-18
工作中實用的5個Excel小技巧，提升辦公效率！
2019-11-14
Excel
5個看分時圖買股票的小技巧簡單實用
2022-03-03
分享幾個 SpringBoot 實用的小技巧
2018-10-15
Spring Boot
電腦實用7個小技巧分享
2019-06-03
Python資料分析之pandas
2018-07-23
Python
資料分析---pandas模組
2024-05-29
資料分析利器之Pandas
2022-12-05
pandas引數設定小技巧
2020-08-22
Mac新手別錯過，超實用的5個Mac隱藏小技巧
2021-10-29
Mac
Redis 實用小技巧—— key 分佈情況分析
2023-05-09
Redis
用Jupyter+pandas資料分析，6種資料格式效率對比
2020-10-29
Tomcat 實用小技巧
2020-04-06
Tomcat
實戰演練！5個資料分析在電商的最佳應用
2022-05-27
Python資料分析之Pandas篇
2020-10-05
Python
使用pandas進行資料分析
2024-10-27
超實用的企業防範資料洩露小技巧！
2023-10-31
基於python的大資料分析實戰學習筆記-pandas（資料分析包）
2019-08-28
Python大資料筆記
資料分析常用的 23 個 Pandas 程式碼，收好不謝
2018-09-13
JavaScript 中 JSON 的 5 個小技巧?
2022-03-03
JavaScriptJSON
簡單介紹5個python的實用技巧
2020-10-12
Python
pandas-profiling資料分析預覽
2020-10-25
保護雲端資料的5個技巧
2021-03-05
基於python的大資料分析-pandas資料儲存（程式碼實戰）
2019-08-28
Python大資料
基於python的大資料分析-pandas資料讀取（程式碼實戰）
2019-08-29
Python大資料
Redis 實用小技巧——如何實現一個排行榜功能
2023-05-15
Redis
Pandas 資料分析——Merge 資料拼接圖文詳解
2020-01-15
資料分析-pandas資料處理清洗常用總結
2018-04-12
利用Tushare資料介面+pandas進行股票資料分析
2022-06-05
銀彈谷V平臺使用技巧：資料庫使用的三個實用技巧
2021-11-08
資料庫
MySQL查詢最佳化的5個實用技巧
2021-09-09
MySql
自定義View實用小技巧
2019-03-25
View
Python利用pandas處理資料與分析
2024-03-25
Python
Python入門教程—資料分析工具Pandas
2021-08-11
Python
【Pandas學習筆記02】處理資料實用操作
2021-11-26
筆記
MacOS Monterey中不能錯過的10個實用小技巧
2021-12-04
Mac

Pandas 資料分析 5 個實用小技巧

小技巧2：使用 replace 和正則清洗資料

小技巧3：使用 melt 如何對資料透視分析？

小技巧4：已知 year 和 dayofyear，怎麼轉 datetime?

小技巧5：如何將分類中出現次數較少的值歸為 others？

相關文章