Pandas高階教程之:處理text資料

flydean發表於2021-06-23

原文網址 : https://www.cnblogs.com/flydean/p/14921314.html

簡介

在1.0之前，只有一種形式來儲存text資料，那就是object。在1.0之後，新增了一個新的資料型別叫做StringDtype 。今天將會給大家講解Pandas中text中的那些事。

建立text的DF

先看下常見的使用text來構建DF的例子：

In [1]: pd.Series(['a', 'b', 'c'])
Out[1]: 
0    a
1    b
2    c
dtype: object

如果要使用新的StringDtype，可以這樣：

In [2]: pd.Series(['a', 'b', 'c'], dtype="string")
Out[2]: 
0    a
1    b
2    c
dtype: string

In [3]: pd.Series(['a', 'b', 'c'], dtype=pd.StringDtype())
Out[3]: 
0    a
1    b
2    c
dtype: string

或者使用astype進行轉換：

In [4]: s = pd.Series(['a', 'b', 'c'])

In [5]: s
Out[5]: 
0    a
1    b
2    c
dtype: object

In [6]: s.astype("string")
Out[6]: 
0    a
1    b
2    c
dtype: string

String 的方法

String可以轉換成大寫，小寫和統計它的長度：

In [24]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'],
   ....:               dtype="string")
   ....: 

In [25]: s.str.lower()
Out[25]: 
0       a
1       b
2       c
3    aaba
4    baca
5    <NA>
6    caba
7     dog
8     cat
dtype: string

In [26]: s.str.upper()
Out[26]: 
0       A
1       B
2       C
3    AABA
4    BACA
5    <NA>
6    CABA
7     DOG
8     CAT
dtype: string

In [27]: s.str.len()
Out[27]: 
0       1
1       1
2       1
3       4
4       4
5    <NA>
6       4
7       3
8       3
dtype: Int64

還可以進行trip操作：

In [28]: idx = pd.Index([' jack', 'jill ', ' jesse ', 'frank'])

In [29]: idx.str.strip()
Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')

In [30]: idx.str.lstrip()
Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')

In [31]: idx.str.rstrip()
Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')

columns的String操作

因為columns是String表示的，所以可以按照普通的String方式來操作columns：

In [34]: df.columns.str.strip()
Out[34]: Index(['Column A', 'Column B'], dtype='object')

In [35]: df.columns.str.lower()
Out[35]: Index([' column a ', ' column b '], dtype='object')

In [32]: df = pd.DataFrame(np.random.randn(3, 2),
   ....:                   columns=[' Column A ', ' Column B '], index=range(3))
   ....: 

In [33]: df
Out[33]: 
    Column A    Column B 
0    0.469112   -0.282863
1   -1.509059   -1.135632
2    1.212112   -0.173215

分割和替換String

Split可以將一個String切分成一個陣列。

In [38]: s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'], dtype="string")

In [39]: s2.str.split('_')
Out[39]: 
0    [a, b, c]
1    [c, d, e]
2         <NA>
3    [f, g, h]
dtype: object

要想訪問split之後陣列中的字元，可以這樣：

In [40]: s2.str.split('_').str.get(1)
Out[40]: 
0       b
1       d
2    <NA>
3       g
dtype: object

In [41]: s2.str.split('_').str[1]
Out[41]: 
0       b
1       d
2    <NA>
3       g
dtype: object

使用 expand=True 可以將split過後的陣列擴充套件成為多列：

In [42]: s2.str.split('_', expand=True)
Out[42]: 
      0     1     2
0     a     b     c
1     c     d     e
2  <NA>  <NA>  <NA>
3     f     g     h

可以指定分割列的個數：

In [43]: s2.str.split('_', expand=True, n=1)
Out[43]: 
      0     1
0     a   b_c
1     c   d_e
2  <NA>  <NA>
3     f   g_h

replace用來進行字元的替換，在替換過程中還可以使用正規表示式：

s3.str.replace('^.a|dog', 'XX-XX ', case=False)

String的連線

使用cat 可以連線 String：

In [64]: s = pd.Series(['a', 'b', 'c', 'd'], dtype="string")

In [65]: s.str.cat(sep=',')
Out[65]: 'a,b,c,d'

使用 .str來index

pd.Series會返回一個Series，如果Series中是字串的話，可通過index來訪問列的字元，舉個例子：

In [99]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan,
   ....:                'CABA', 'dog', 'cat'],
   ....:               dtype="string")
   ....: 

In [100]: s.str[0]
Out[100]: 
0       A
1       B
2       C
3       A
4       B
5    <NA>
6       C
7       d
8       c
dtype: string

In [101]: s.str[1]
Out[101]: 
0    <NA>
1    <NA>
2    <NA>
3       a
4       a
5    <NA>
6       A
7       o
8       a
dtype: string

extract

Extract用來從String中解壓資料，它接收一個 expand引數，在0.23版本之前，這個引數預設是False。如果是false，extract會返回Series，index或者DF 。如果expand=true，那麼會返回DF。0.23版本之後，預設是true。

extract通常是和正規表示式一起使用的。

In [102]: pd.Series(['a1', 'b2', 'c3'],
   .....:           dtype="string").str.extract(r'([ab])(\d)', expand=False)
   .....: 
Out[102]: 
      0     1
0     a     1
1     b     2
2  <NA>  <NA>

上面的例子將Series中的每一字串都按照正規表示式來進行分解。前面一部分是字元，後面一部分是數字。

注意，只有正規表示式中group的資料才會被extract .

下面的就只會extract數字：

In [106]: pd.Series(['a1', 'b2', 'c3'],
   .....:           dtype="string").str.extract(r'[ab](\d)', expand=False)
   .....: 
Out[106]: 
0       1
1       2
2    <NA>
dtype: string

還可以指定列的名字如下：

In [103]: pd.Series(['a1', 'b2', 'c3'],
   .....:           dtype="string").str.extract(r'(?P<letter>[ab])(?P<digit>\d)',
   .....:                                       expand=False)
   .....: 
Out[103]: 
  letter digit
0      a     1
1      b     2
2   <NA>  <NA>

extractall

和extract相似的還有extractall，不同的是extract只會匹配第一次，而extractall會做所有的匹配，舉個例子：

In [112]: s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"],
   .....:               dtype="string")
   .....: 

In [113]: s
Out[113]: 
A    a1a2
B      b1
C      c1
dtype: string

In [114]: two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'

In [115]: s.str.extract(two_groups, expand=True)
Out[115]: 
  letter digit
A      a     1
B      b     1
C      c     1

extract匹配到a1之後就不會繼續了。

In [116]: s.str.extractall(two_groups)
Out[116]: 
        letter digit
  match             
A 0          a     1
  1          a     2
B 0          b     1
C 0          c     1

extractall匹配了a1之後還會匹配a2。

contains 和 match

contains 和 match用來測試DF中是否含有特定的資料：

In [127]: pd.Series(['1', '2', '3a', '3b', '03c', '4dx'],
   .....:           dtype="string").str.contains(pattern)
   .....: 
Out[127]: 
0    False
1    False
2     True
3     True
4     True
5     True
dtype: boolean

In [128]: pd.Series(['1', '2', '3a', '3b', '03c', '4dx'],
   .....:           dtype="string").str.match(pattern)
   .....: 
Out[128]: 
0    False
1    False
2     True
3     True
4    False
5     True
dtype: boolean

In [129]: pd.Series(['1', '2', '3a', '3b', '03c', '4dx'],
   .....:           dtype="string").str.fullmatch(pattern)
   .....: 
Out[129]: 
0    False
1    False
2     True
3     True
4    False
5    False
dtype: boolean

String方法總結

最後總結一下String的方法：

Method	Description
cat()	Concatenate strings
split()	Split strings on delimiter
rsplit()	Split strings on delimiter working from the end of the string
get()	Index into each element (retrieve i-th element)
join()	Join strings in each element of the Series with passed separator
get_dummies()	Split strings on the delimiter returning DataFrame of dummy variables
contains()	Return boolean array if each string contains pattern/regex
replace()	Replace occurrences of pattern/regex/string with some other string or the return value of a callable given the occurrence
repeat()	Duplicate values (s.str.repeat(3) equivalent to x * 3)
pad()	Add whitespace to left, right, or both sides of strings
center()	Equivalent to str.center
ljust()	Equivalent to str.ljust
rjust()	Equivalent to str.rjust
zfill()	Equivalent to str.zfill
wrap()	Split long strings into lines with length less than a given width
slice()	Slice each string in the Series
slice_replace()	Replace slice in each string with passed value
count()	Count occurrences of pattern
startswith()	Equivalent to str.startswith(pat) for each element
endswith()	Equivalent to str.endswith(pat) for each element
findall()	Compute list of all occurrences of pattern/regex for each string
match()	Call re.match on each element, returning matched groups as list
extract()	Call re.search on each element, returning DataFrame with one row for each element and one column for each regex capture group
extractall()	Call re.findall on each element, returning DataFrame with one row for each match and one column for each regex capture group
len()	Compute string lengths
strip()	Equivalent to str.strip
rstrip()	Equivalent to str.rstrip
lstrip()	Equivalent to str.lstrip
partition()	Equivalent to str.partition
rpartition()	Equivalent to str.rpartition
lower()	Equivalent to str.lower
casefold()	Equivalent to str.casefold
upper()	Equivalent to str.upper
find()	Equivalent to str.find
rfind()	Equivalent to str.rfind
index()	Equivalent to str.index
rindex()	Equivalent to str.rindex
capitalize()	Equivalent to str.capitalize
swapcase()	Equivalent to str.swapcase
normalize()	Return Unicode normal form. Equivalent to unicodedata.normalize
translate()	Equivalent to str.translate
isalnum()	Equivalent to str.isalnum
isalpha()	Equivalent to str.isalpha
isdigit()	Equivalent to str.isdigit
isspace()	Equivalent to str.isspace
islower()	Equivalent to str.islower
isupper()	Equivalent to str.isupper
istitle()	Equivalent to str.istitle
isnumeric()	Equivalent to str.isnumeric
isdecimal()	Equivalent to str.isdecimal

本文已收錄於 http://www.flydean.com/06-python-pandas-text/

最通俗的解讀，最深刻的乾貨，最簡潔的教程，眾多你不知道的小技巧等你來發現！

歡迎關注我的公眾號:「程式那些事」,懂技術，更懂你！

Pandas高階教程之:處理缺失資料
2021-06-24
Pandas高階教程之:時間處理
2021-10-11
Pandas高階教程之:稀疏資料結構
2021-07-20
資料結構
Pandas高階教程之:category資料型別
2021-06-28
Go資料型別
【Pandas學習筆記02】-資料處理高階用法
2021-12-01
筆記
Pandas高階教程之:window操作
2021-07-19
Pandas高階教程之:GroupBy用法
2021-07-12
Python 資料處理庫 pandas 進階教程
2018-04-18
Python
Pandas高階教程之:統計方法
2021-07-08
Pandas高階教程之:自定義選項
2021-07-22
Pandas高階教程之:Dataframe的合併
2021-06-14
Pandas高階教程之:plot畫圖詳解
2021-07-07
資料處理--pandas問題
2024-08-04
Python資料處理-pandas用法
2020-12-17
Python
Pandas高階教程之:Dataframe的重排和旋轉
2021-06-15
Excel高階應用教程：資料處理與資料分析
2018-05-25
Excel
資料的規範化——Pandas處理
2024-04-07
Python利用pandas處理資料與分析
2024-03-25
Python
資料預處理之 pandas 讀表
2020-03-01
特徵工程之資料預處理（下）
2019-02-13
特徵工程
pandas（進階操作）-- 處理非數值型資料 -- 資料分析三劍客(核心)
2023-10-02
Go高階特性 17 | SliceHeader：slice 高效處理資料
2021-03-02
GoHeader
Pandas 基礎 (5) - 處理缺失的資料
2019-03-08
pandas 資料處理一些常用操作
2023-05-15
處理pandas讀取資料為nan時
2024-06-24
NaN
Python 資料處理庫 pandas 入門教程
2018-04-17
Python
pandas 處理資料和crc16計算
2020-09-26
資料清洗與預處理：使用 Python Pandas 庫
2024-07-26
Python
Pandas之:Pandas高階教程以鐵達尼號真實資料為例
2021-06-07
高手系列！資料科學傢俬藏pandas高階用法大全 ⛵
2022-12-01
資料科學
【Pandas學習筆記02】處理資料實用操作
2021-11-26
筆記
Pandas 資料處理三板斧——map、apply、applymap 詳解
2020-01-15
APP
對pandas進行資料預處理的例項講解
2018-04-20
【Python自動化Excel】pandas處理Excel資料的基本流程
2022-01-09
PythonExcel
一招教會你處理Flutter中的資料
2019-04-15
Flutter
kubernetes排程之資源耗盡處理配置
2019-06-17
Pandas 基礎 (6) - 用 replace () 函式處理不合理資料
2019-03-24
函式
Pandas缺失值處理 | 輕鬆玩轉Pandas（3）
2018-07-24