Pandas 2.2 中文官方教程和指南（二十四）

绝不原创的飞龙發表於2024-04-24

原文網址 : https://www.cnblogs.com/apachecn/p/18154713

原文：pandas.pydata.org/docs/

擴充套件到大型資料集

原文：pandas.pydata.org/docs/user_guide/scale.html

pandas 提供了用於記憶體分析的資料結構，這使得使用 pandas 分析大於記憶體資料集的資料集有些棘手。即使是佔用相當大記憶體的資料集也變得難以處理，因為一些 pandas 操作需要進行中間複製。

本文提供了一些建議，以便將您的分析擴充套件到更大的資料集。這是對提高效能的補充，後者側重於加快適��記憶體的資料集的分析。

載入更少的資料

假設我們在磁碟上的原始資料集有許多列。

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: def make_timeseries(start="2000-01-01", end="2000-12-31", freq="1D", seed=None):
 ...:    index = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
 ...:    n = len(index)
 ...:    state = np.random.RandomState(seed)
 ...:    columns = {
 ...:        "name": state.choice(["Alice", "Bob", "Charlie"], size=n),
 ...:        "id": state.poisson(1000, size=n),
 ...:        "x": state.rand(n) * 2 - 1,
 ...:        "y": state.rand(n) * 2 - 1,
 ...:    }
 ...:    df = pd.DataFrame(columns, index=index, columns=sorted(columns))
 ...:    if df.index[-1] == end:
 ...:        df = df.iloc[:-1]
 ...:    return df
 ...: 

In [4]: timeseries = [
 ...:    make_timeseries(freq="1min", seed=i).rename(columns=lambda x: f"{x}_{i}")
 ...:    for i in range(10)
 ...: ]
 ...: 

In [5]: ts_wide = pd.concat(timeseries, axis=1)

In [6]: ts_wide.head()
Out[6]: 
 id_0 name_0       x_0  ...   name_9       x_9       y_9
timestamp                                   ... 
2000-01-01 00:00:00   977  Alice -0.821225  ...  Charlie -0.957208 -0.757508
2000-01-01 00:01:00  1018    Bob -0.219182  ...    Alice -0.414445 -0.100298
2000-01-01 00:02:00   927  Alice  0.660908  ...  Charlie -0.325838  0.581859
2000-01-01 00:03:00   997    Bob -0.852458  ...      Bob  0.992033 -0.686692
2000-01-01 00:04:00   965    Bob  0.717283  ...  Charlie -0.924556 -0.184161

[5 rows x 40 columns]

In [7]: ts_wide.to_parquet("timeseries_wide.parquet")

要載入我們想要的列，我們有兩個選項。選項 1 載入所有資料，然後篩選我們需要的資料。

In [8]: columns = ["id_0", "name_0", "x_0", "y_0"]

In [9]: pd.read_parquet("timeseries_wide.parquet")[columns]
Out[9]: 
 id_0 name_0       x_0       y_0
timestamp 
2000-01-01 00:00:00   977  Alice -0.821225  0.906222
2000-01-01 00:01:00  1018    Bob -0.219182  0.350855
2000-01-01 00:02:00   927  Alice  0.660908 -0.798511
2000-01-01 00:03:00   997    Bob -0.852458  0.735260
2000-01-01 00:04:00   965    Bob  0.717283  0.393391
...                   ...    ...       ...       ...
2000-12-30 23:56:00  1037    Bob -0.814321  0.612836
2000-12-30 23:57:00   980    Bob  0.232195 -0.618828
2000-12-30 23:58:00   965  Alice -0.231131  0.026310
2000-12-30 23:59:00   984  Alice  0.942819  0.853128
2000-12-31 00:00:00  1003  Alice  0.201125 -0.136655

[525601 rows x 4 columns]

選項 2 僅載入我們請求的列。

In [10]: pd.read_parquet("timeseries_wide.parquet", columns=columns)
Out[10]: 
 id_0 name_0       x_0       y_0
timestamp 
2000-01-01 00:00:00   977  Alice -0.821225  0.906222
2000-01-01 00:01:00  1018    Bob -0.219182  0.350855
2000-01-01 00:02:00   927  Alice  0.660908 -0.798511
2000-01-01 00:03:00   997    Bob -0.852458  0.735260
2000-01-01 00:04:00   965    Bob  0.717283  0.393391
...                   ...    ...       ...       ...
2000-12-30 23:56:00  1037    Bob -0.814321  0.612836
2000-12-30 23:57:00   980    Bob  0.232195 -0.618828
2000-12-30 23:58:00   965  Alice -0.231131  0.026310
2000-12-30 23:59:00   984  Alice  0.942819  0.853128
2000-12-31 00:00:00  1003  Alice  0.201125 -0.136655

[525601 rows x 4 columns]

如果我們測量這兩個呼叫的記憶體使用情況，我們會發現在這種情況下指定columns使用的記憶體約為 1/10。

使用pandas.read_csv()，您可以指定usecols來限制讀入記憶體的列。並非所有可以被 pandas 讀取的檔案格式都提供讀取子集列的選項。

使用高效的資料型別

預設的 pandas 資料型別並不是最節省記憶體的。特別是對於具有相對少量唯一值的文字資料列（通常稱為“低基數”資料），這一點尤為明顯。透過使用更高效的資料型別，您可以在記憶體中儲存更大的資料集。

In [11]: ts = make_timeseries(freq="30s", seed=0)

In [12]: ts.to_parquet("timeseries.parquet")

In [13]: ts = pd.read_parquet("timeseries.parquet")

In [14]: ts
Out[14]: 
 id     name         x         y
timestamp 
2000-01-01 00:00:00  1041    Alice  0.889987  0.281011
2000-01-01 00:00:30   988      Bob -0.455299  0.488153
2000-01-01 00:01:00  1018    Alice  0.096061  0.580473
2000-01-01 00:01:30   992      Bob  0.142482  0.041665
2000-01-01 00:02:00   960      Bob -0.036235  0.802159
...                   ...      ...       ...       ...
2000-12-30 23:58:00  1022    Alice  0.266191  0.875579
2000-12-30 23:58:30   974    Alice -0.009826  0.413686
2000-12-30 23:59:00  1028  Charlie  0.307108 -0.656789
2000-12-30 23:59:30  1002    Alice  0.202602  0.541335
2000-12-31 00:00:00   987    Alice  0.200832  0.615972

[1051201 rows x 4 columns]

現在，讓我們檢查資料型別和記憶體使用情況，看看我們應該關注哪些方面。

In [15]: ts.dtypes
Out[15]: 
id        int64
name     object
x       float64
y       float64
dtype: object

In [16]: ts.memory_usage(deep=True)  # memory usage in bytes
Out[16]: 
Index     8409608
id        8409608
name     65176434
x         8409608
y         8409608
dtype: int64

name列佔用的記憶體比其他任何列都多得多。它只有幾個唯一值，因此很適合轉換為pandas.Categorical。使用pandas.Categorical，我們只需一次儲存每個唯一名稱，並使用節省空間的整數來知道每行中使用了哪個特定名稱。

In [17]: ts2 = ts.copy()

In [18]: ts2["name"] = ts2["name"].astype("category")

In [19]: ts2.memory_usage(deep=True)
Out[19]: 
Index    8409608
id       8409608
name     1051495
x        8409608
y        8409608
dtype: int64

我們可以進一步將數值列降級為它們的最小型別，使用pandas.to_numeric()。

In [20]: ts2["id"] = pd.to_numeric(ts2["id"], downcast="unsigned")

In [21]: ts2[["x", "y"]] = ts2[["x", "y"]].apply(pd.to_numeric, downcast="float")

In [22]: ts2.dtypes
Out[22]: 
id        uint16
name    category
x        float32
y        float32
dtype: object

In [23]: ts2.memory_usage(deep=True)
Out[23]: 
Index    8409608
id       2102402
name     1051495
x        4204804
y        4204804
dtype: int64

In [24]: reduction = ts2.memory_usage(deep=True).sum() / ts.memory_usage(deep=True).sum()

In [25]: print(f"{reduction:0.2f}")
0.20

總的來說，我們將這個資料集的記憶體佔用減少到原始大小的 1/5。

有關pandas.Categorical的更多資訊，請參閱分類資料，有關 pandas 所有資料型別的概述，請參閱資料型別。

使用分塊載入

透過將一個大問題分成一堆小問題，一些工作負載可以透過分塊來實現。例如，將單個 CSV 檔案轉換為 Parquet 檔案，併為目錄中的每個檔案重複此操作。只要每個塊適合記憶體，您就可以處理比記憶體大得多的資料集。

注意

當你執行的操作需要零或最小的塊之間協調時，分塊工作效果很好。對於更復雜的工作流程，最好使用其他庫。

假設我們在磁碟上有一個更大的“邏輯資料集”，它是一個 parquet 檔案目錄。目錄中的每個檔案代表整個資料集的不同年份。

In [26]: import pathlib

In [27]: N = 12

In [28]: starts = [f"20{i:>02d}-01-01" for i in range(N)]

In [29]: ends = [f"20{i:>02d}-12-13" for i in range(N)]

In [30]: pathlib.Path("data/timeseries").mkdir(exist_ok=True)

In [31]: for i, (start, end) in enumerate(zip(starts, ends)):
 ....:    ts = make_timeseries(start=start, end=end, freq="1min", seed=i)
 ....:    ts.to_parquet(f"data/timeseries/ts-{i:0>2d}.parquet")
 ....:

data
└── timeseries
    ├── ts-00.parquet
    ├── ts-01.parquet
    ├── ts-02.parquet
    ├── ts-03.parquet
    ├── ts-04.parquet
    ├── ts-05.parquet
    ├── ts-06.parquet
    ├── ts-07.parquet
    ├── ts-08.parquet
    ├── ts-09.parquet
    ├── ts-10.parquet
    └── ts-11.parquet

現在我們將實現一個分散式的pandas.Series.value_counts()。這個工作流程的峰值記憶體使用量是最大塊的記憶體，再加上一個小系列儲存到目前為止的唯一值計數。只要每個單獨的檔案都適合記憶體，這將適用於任意大小的資料集。

In [32]: %%time
 ....: files = pathlib.Path("data/timeseries/").glob("ts*.parquet")
 ....: counts = pd.Series(dtype=int)
 ....: for path in files:
 ....:    df = pd.read_parquet(path)
 ....:    counts = counts.add(df["name"].value_counts(), fill_value=0)
 ....: counts.astype(int)
 ....: 
CPU times: user 760 ms, sys: 26.1 ms, total: 786 ms
Wall time: 559 ms
Out[32]: 
name
Alice      1994645
Bob        1993692
Charlie    1994875
dtype: int64

一些讀取器，比如pandas.read_csv()，在讀取單個檔案時提供了控制chunksize的引數。

手動分塊是一個適合不需要太複雜操作的工作流程的選擇。一些操作，比如pandas.DataFrame.groupby()，在塊方式下要困難得多。在這些情況下，最好切換到一個實現這些分散式演算法的不同庫。

使用其他庫

還有其他類似於 pandas 並與 pandas DataFrame 很好配合的庫，可以透過並行執行時、分散式記憶體、叢集等功能來擴充套件大型資料集的處理和分析能力。您可以在生態系統頁面找到更多資訊。

載入更少的資料

假設我們在磁碟上的原始資料集有許多列。

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: def make_timeseries(start="2000-01-01", end="2000-12-31", freq="1D", seed=None):
 ...:    index = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
 ...:    n = len(index)
 ...:    state = np.random.RandomState(seed)
 ...:    columns = {
 ...:        "name": state.choice(["Alice", "Bob", "Charlie"], size=n),
 ...:        "id": state.poisson(1000, size=n),
 ...:        "x": state.rand(n) * 2 - 1,
 ...:        "y": state.rand(n) * 2 - 1,
 ...:    }
 ...:    df = pd.DataFrame(columns, index=index, columns=sorted(columns))
 ...:    if df.index[-1] == end:
 ...:        df = df.iloc[:-1]
 ...:    return df
 ...: 

In [4]: timeseries = [
 ...:    make_timeseries(freq="1min", seed=i).rename(columns=lambda x: f"{x}_{i}")
 ...:    for i in range(10)
 ...: ]
 ...: 

In [5]: ts_wide = pd.concat(timeseries, axis=1)

In [6]: ts_wide.head()
Out[6]: 
 id_0 name_0       x_0  ...   name_9       x_9       y_9
timestamp                                   ... 
2000-01-01 00:00:00   977  Alice -0.821225  ...  Charlie -0.957208 -0.757508
2000-01-01 00:01:00  1018    Bob -0.219182  ...    Alice -0.414445 -0.100298
2000-01-01 00:02:00   927  Alice  0.660908  ...  Charlie -0.325838  0.581859
2000-01-01 00:03:00   997    Bob -0.852458  ...      Bob  0.992033 -0.686692
2000-01-01 00:04:00   965    Bob  0.717283  ...  Charlie -0.924556 -0.184161

[5 rows x 40 columns]

In [7]: ts_wide.to_parquet("timeseries_wide.parquet")

要載入我們想要的列，我們有兩個選項。選項 1 載入所有資料，然後篩選我們需要的資料。

In [8]: columns = ["id_0", "name_0", "x_0", "y_0"]

In [9]: pd.read_parquet("timeseries_wide.parquet")[columns]
Out[9]: 
 id_0 name_0       x_0       y_0
timestamp 
2000-01-01 00:00:00   977  Alice -0.821225  0.906222
2000-01-01 00:01:00  1018    Bob -0.219182  0.350855
2000-01-01 00:02:00   927  Alice  0.660908 -0.798511
2000-01-01 00:03:00   997    Bob -0.852458  0.735260
2000-01-01 00:04:00   965    Bob  0.717283  0.393391
...                   ...    ...       ...       ...
2000-12-30 23:56:00  1037    Bob -0.814321  0.612836
2000-12-30 23:57:00   980    Bob  0.232195 -0.618828
2000-12-30 23:58:00   965  Alice -0.231131  0.026310
2000-12-30 23:59:00   984  Alice  0.942819  0.853128
2000-12-31 00:00:00  1003  Alice  0.201125 -0.136655

[525601 rows x 4 columns]

選項 2 只載入我們請求的列。

In [10]: pd.read_parquet("timeseries_wide.parquet", columns=columns)
Out[10]: 
 id_0 name_0       x_0       y_0
timestamp 
2000-01-01 00:00:00   977  Alice -0.821225  0.906222
2000-01-01 00:01:00  1018    Bob -0.219182  0.350855
2000-01-01 00:02:00   927  Alice  0.660908 -0.798511
2000-01-01 00:03:00   997    Bob -0.852458  0.735260
2000-01-01 00:04:00   965    Bob  0.717283  0.393391
...                   ...    ...       ...       ...
2000-12-30 23:56:00  1037    Bob -0.814321  0.612836
2000-12-30 23:57:00   980    Bob  0.232195 -0.618828
2000-12-30 23:58:00   965  Alice -0.231131  0.026310
2000-12-30 23:59:00   984  Alice  0.942819  0.853128
2000-12-31 00:00:00  1003  Alice  0.201125 -0.136655

[525601 rows x 4 columns]

如果我們測量這兩個呼叫的記憶體使用情況，我們會發現在這種情況下指定columns使用的記憶體約為 1/10。

使用pandas.read_csv()，您可以指定usecols來限制讀入記憶體的列。並非所有可以被 pandas 讀取的檔案格式都提供了讀取子集列的選項。

使用高效的資料型別

預設的 pandas 資料型別不是最節省記憶體的。對於具有相對少量唯一值的文字資料列（通常稱為“低基數”資料），這一點尤為明顯。透過使用更高效的資料型別，您可以在記憶體中儲存更大的資料集。

In [11]: ts = make_timeseries(freq="30s", seed=0)

In [12]: ts.to_parquet("timeseries.parquet")

In [13]: ts = pd.read_parquet("timeseries.parquet")

In [14]: ts
Out[14]: 
 id     name         x         y
timestamp 
2000-01-01 00:00:00  1041    Alice  0.889987  0.281011
2000-01-01 00:00:30   988      Bob -0.455299  0.488153
2000-01-01 00:01:00  1018    Alice  0.096061  0.580473
2000-01-01 00:01:30   992      Bob  0.142482  0.041665
2000-01-01 00:02:00   960      Bob -0.036235  0.802159
...                   ...      ...       ...       ...
2000-12-30 23:58:00  1022    Alice  0.266191  0.875579
2000-12-30 23:58:30   974    Alice -0.009826  0.413686
2000-12-30 23:59:00  1028  Charlie  0.307108 -0.656789
2000-12-30 23:59:30  1002    Alice  0.202602  0.541335
2000-12-31 00:00:00   987    Alice  0.200832  0.615972

[1051201 rows x 4 columns]

現在，讓我們檢查資料型別和記憶體使用情況，看看我們應該把注意力放在哪裡。

In [15]: ts.dtypes
Out[15]: 
id        int64
name     object
x       float64
y       float64
dtype: object

In [16]: ts.memory_usage(deep=True)  # memory usage in bytes
Out[16]: 
Index     8409608
id        8409608
name     65176434
x         8409608
y         8409608
dtype: int64

name列佔用的記憶體比其他任何列都多。它只有很少的唯一值，因此很適合轉換為pandas.Categorical。使用pandas.Categorical，我們只需一次儲存每個唯一名稱，並使用空間高效的整數來知道每行中使用了哪個特定名稱。

In [17]: ts2 = ts.copy()

In [18]: ts2["name"] = ts2["name"].astype("category")

In [19]: ts2.memory_usage(deep=True)
Out[19]: 
Index    8409608
id       8409608
name     1051495
x        8409608
y        8409608
dtype: int64

我們可以進一步將數值列降級為它們的最小型別，使用pandas.to_numeric()。

In [20]: ts2["id"] = pd.to_numeric(ts2["id"], downcast="unsigned")

In [21]: ts2[["x", "y"]] = ts2[["x", "y"]].apply(pd.to_numeric, downcast="float")

In [22]: ts2.dtypes
Out[22]: 
id        uint16
name    category
x        float32
y        float32
dtype: object

In [23]: ts2.memory_usage(deep=True)
Out[23]: 
Index    8409608
id       2102402
name     1051495
x        4204804
y        4204804
dtype: int64

In [24]: reduction = ts2.memory_usage(deep=True).sum() / ts.memory_usage(deep=True).sum()

In [25]: print(f"{reduction:0.2f}")
0.20

總的來說，我們已將此資料集的記憶體佔用減少到原始大小的 1/5。

請檢視 Categorical data 以瞭解更多關於pandas.Categorical和 dtypes 以獲得 pandas 所有 dtypes 的概述。

使用分塊

透過將一個大問題分解為一堆小問題，可以使用分塊來實現某些工作負載。例如，將單個 CSV 檔案轉換為 Parquet 檔案，併為目錄中的每個檔案重複此操作。只要每個塊適合記憶體，您就可以處理比記憶體大得多的資料集。

注意

當您執行的操作需要零或最小的分塊之間協調時，分塊效果很好。對於更復雜的工作流程，最好使用其他庫。

假設我們在磁碟上有一個更大的“邏輯資料集”，它是一個 parquet 檔案目錄。目錄中的每個檔案代表整個資料集的不同年份。

In [26]: import pathlib

In [27]: N = 12

In [28]: starts = [f"20{i:>02d}-01-01" for i in range(N)]

In [29]: ends = [f"20{i:>02d}-12-13" for i in range(N)]

In [30]: pathlib.Path("data/timeseries").mkdir(exist_ok=True)

In [31]: for i, (start, end) in enumerate(zip(starts, ends)):
 ....:    ts = make_timeseries(start=start, end=end, freq="1min", seed=i)
 ....:    ts.to_parquet(f"data/timeseries/ts-{i:0>2d}.parquet")
 ....:

data
└── timeseries
    ├── ts-00.parquet
    ├── ts-01.parquet
    ├── ts-02.parquet
    ├── ts-03.parquet
    ├── ts-04.parquet
    ├── ts-05.parquet
    ├── ts-06.parquet
    ├── ts-07.parquet
    ├── ts-08.parquet
    ├── ts-09.parquet
    ├── ts-10.parquet
    └── ts-11.parquet

現在我們將實現一個基於磁碟的pandas.Series.value_counts()。此工作流的峰值記憶體使用量是最大的單個塊，再加上一個小系列，用於儲存到目前為止的唯一值計數。只要每個單獨的檔案都適合記憶體，這將適用於任意大小的資料集。

In [32]: %%time
 ....: files = pathlib.Path("data/timeseries/").glob("ts*.parquet")
 ....: counts = pd.Series(dtype=int)
 ....: for path in files:
 ....:    df = pd.read_parquet(path)
 ....:    counts = counts.add(df["name"].value_counts(), fill_value=0)
 ....: counts.astype(int)
 ....: 
CPU times: user 760 ms, sys: 26.1 ms, total: 786 ms
Wall time: 559 ms
Out[32]: 
name
Alice      1994645
Bob        1993692
Charlie    1994875
dtype: int64

一些讀取器，如pandas.read_csv()，在讀取單個檔案時提供控制chunksize的引數。

手動分塊是一個適用於不需要太複雜操作的工作流程的選擇。一些操作，比如pandas.DataFrame.groupby()，在分塊方式下要困難得多。在這些情況下，最好切換到另一個庫，該庫為您實現這些基於外儲存演算法。

使用其他庫

還有其他庫提供類似於 pandas 的 API，並與 pandas DataFrame 很好地配合，可以透過並行執行時、分散式記憶體、叢集等功能來擴充套件大型資料集的處理和分析能力。您可以在生態系統頁面找到更多資訊。

稀疏資料結構

原文：pandas.pydata.org/docs/user_guide/sparse.html

pandas 提供了用於高效儲存稀疏資料的資料結構。這些資料結構不一定是典型的“大部分為 0”的稀疏資料。相反，您可以將這些物件視為“壓縮的”，其中任何與特定值匹配的資料（NaN / 缺失值，儘管可以選擇任何值，包括 0）都被省略。壓縮的值實際上並未儲存在陣列中。

In [1]: arr = np.random.randn(10)

In [2]: arr[2:-2] = np.nan

In [3]: ts = pd.Series(pd.arrays.SparseArray(arr))

In [4]: ts
Out[4]: 
0    0.469112
1   -0.282863
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8   -0.861849
9   -2.104569
dtype: Sparse[float64, nan]

注意 dtype，Sparse[float64, nan]。nan表示陣列中的nan元素實際上並未儲存，只有非nan元素。這些非nan元素具有float64 dtype。

稀疏物件存在是為了記憶體效率的原因。假設您有一個大多數為 NA 的DataFrame：

In [5]: df = pd.DataFrame(np.random.randn(10000, 4))

In [6]: df.iloc[:9998] = np.nan

In [7]: sdf = df.astype(pd.SparseDtype("float", np.nan))

In [8]: sdf.head()
Out[8]: 
 0    1    2    3
0  NaN  NaN  NaN  NaN
1  NaN  NaN  NaN  NaN
2  NaN  NaN  NaN  NaN
3  NaN  NaN  NaN  NaN
4  NaN  NaN  NaN  NaN

In [9]: sdf.dtypes
Out[9]: 
0    Sparse[float64, nan]
1    Sparse[float64, nan]
2    Sparse[float64, nan]
3    Sparse[float64, nan]
dtype: object

In [10]: sdf.sparse.density
Out[10]: 0.0002

正如您所看到的，密度（未“壓縮”的值的百分比）非常低。這個稀疏物件在磁碟（pickled）和 Python 直譯器中佔用的記憶體要少得多。

In [11]: 'dense : {:0.2f} bytes'.format(df.memory_usage().sum() / 1e3)
Out[11]: 'dense : 320.13 bytes'

In [12]: 'sparse: {:0.2f} bytes'.format(sdf.memory_usage().sum() / 1e3)
Out[12]: 'sparse: 0.22 bytes'

從功能上講，它們的行為應該幾乎與它們的密集對應物相同。

稀疏陣列

arrays.SparseArray 是用於儲存稀疏值陣列的ExtensionArray（有關擴充套件陣列的更多資訊，請參見 dtypes）。它是一個一維類似 ndarray 的物件，僅儲存與fill_value不同的值：

In [13]: arr = np.random.randn(10)

In [14]: arr[2:5] = np.nan

In [15]: arr[7:8] = np.nan

In [16]: sparr = pd.arrays.SparseArray(arr)

In [17]: sparr
Out[17]: 
[-1.9556635297215477, -1.6588664275960427, nan, nan, nan, 1.1589328886422277, 0.14529711373305043, nan, 0.6060271905134522, 1.3342113401317768]
Fill: nan
IntIndex
Indices: array([0, 1, 5, 6, 8, 9], dtype=int32)

稀疏陣列可以使用numpy.asarray()轉換為常規（密集）ndarray

In [18]: np.asarray(sparr)
Out[18]: 
array([-1.9557, -1.6589,     nan,     nan,     nan,  1.1589,  0.1453,
 nan,  0.606 ,  1.3342]) 
```  ## 稀疏 dtype

`SparseArray.dtype` 屬性儲存兩個資訊

1.  非稀疏值的 dtype

1.  標量填充值

```py
In [19]: sparr.dtype
Out[19]: Sparse[float64, nan]

可以透過僅傳遞 dtype 來構造SparseDtype

In [20]: pd.SparseDtype(np.dtype('datetime64[ns]'))
Out[20]: Sparse[datetime64[ns], numpy.datetime64('NaT')]

在這種情況下，將使用預設填充值（對於 NumPy dtypes，通常是該 dtype 的“缺失”值）。可以傳遞顯式填充值來覆蓋此預設值

In [21]: pd.SparseDtype(np.dtype('datetime64[ns]'),
 ....:               fill_value=pd.Timestamp('2017-01-01'))
 ....: 
Out[21]: Sparse[datetime64[ns], Timestamp('2017-01-01 00:00:00')]

最後，字串別名'Sparse[dtype]'可用於在許多地方指定稀疏 dtype

In [22]: pd.array([1, 0, 0, 2], dtype='Sparse[int]')
Out[22]: 
[1, 0, 0, 2]
Fill: 0
IntIndex
Indices: array([0, 3], dtype=int32) 
```  ## 稀疏訪問器

pandas 提供了一個`.sparse`訪問器，類似於字串資料的`.str`，分類資料的`.cat`和日期時間資料的`.dt`。此名稱空間提供了特定於稀疏資料的屬性和方法。

```py
In [23]: s = pd.Series([0, 0, 1, 2], dtype="Sparse[int]")

In [24]: s.sparse.density
Out[24]: 0.5

In [25]: s.sparse.fill_value
Out[25]: 0

此訪問器僅適用於具有SparseDtype的資料，並且適用於Series類本身，用於從 scipy COO 矩陣建立具有稀疏資料的 Series。

為DataFrame也新增了一個.sparse訪問器。更多資訊請參見 Sparse accessor。 ## 稀疏計算

你可以將 NumPy ufuncs應用於arrays.SparseArray，並得到一個arrays.SparseArray作為結果。

In [26]: arr = pd.arrays.SparseArray([1., np.nan, np.nan, -2., np.nan])

In [27]: np.abs(arr)
Out[27]: 
[1.0, nan, nan, 2.0, nan]
Fill: nan
IntIndex
Indices: array([0, 3], dtype=int32)

ufunc也應用於fill_value。這是為了獲得正確的稠密結果。

In [28]: arr = pd.arrays.SparseArray([1., -1, -1, -2., -1], fill_value=-1)

In [29]: np.abs(arr)
Out[29]: 
[1, 1, 1, 2.0, 1]
Fill: 1
IntIndex
Indices: array([3], dtype=int32)

In [30]: np.abs(arr).to_dense()
Out[30]: array([1., 1., 1., 2., 1.])

轉換

要將稀疏資料轉換為稠密資料，使用.sparse訪問器

In [31]: sdf.sparse.to_dense()
Out[31]: 
 0         1         2         3
0          NaN       NaN       NaN       NaN
1          NaN       NaN       NaN       NaN
2          NaN       NaN       NaN       NaN
3          NaN       NaN       NaN       NaN
4          NaN       NaN       NaN       NaN
...        ...       ...       ...       ...
9995       NaN       NaN       NaN       NaN
9996       NaN       NaN       NaN       NaN
9997       NaN       NaN       NaN       NaN
9998  0.509184 -0.774928 -1.369894 -0.382141
9999  0.280249 -1.648493  1.490865 -0.890819

[10000 rows x 4 columns]

從稠密到稀疏，使用帶有SparseDtype的DataFrame.astype()。

In [32]: dense = pd.DataFrame({"A": [1, 0, 0, 1]})

In [33]: dtype = pd.SparseDtype(int, fill_value=0)

In [34]: dense.astype(dtype)
Out[34]: 
 A
0  1
1  0
2  0
3  1 
```  ## 與*scipy.sparse*的互動

使用`DataFrame.sparse.from_spmatrix()`從稀疏矩陣建立具有稀疏值的`DataFrame`。

```py
In [35]: from scipy.sparse import csr_matrix

In [36]: arr = np.random.random(size=(1000, 5))

In [37]: arr[arr < .9] = 0

In [38]: sp_arr = csr_matrix(arr)

In [39]: sp_arr
Out[39]: 
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
 with 517 stored elements in Compressed Sparse Row format>

In [40]: sdf = pd.DataFrame.sparse.from_spmatrix(sp_arr)

In [41]: sdf.head()
Out[41]: 
 0  1  2         3  4
0   0.95638  0  0         0  0
1         0  0  0         0  0
2         0  0  0         0  0
3         0  0  0         0  0
4  0.999552  0  0  0.956153  0

In [42]: sdf.dtypes
Out[42]: 
0    Sparse[float64, 0]
1    Sparse[float64, 0]
2    Sparse[float64, 0]
3    Sparse[float64, 0]
4    Sparse[float64, 0]
dtype: object

所有稀疏格式都受支援，但不在COOrdinate格式中的矩陣將被轉換，根據需要複製資料。要轉換回 COO 格式的稀疏 SciPy 矩陣，可以使用DataFrame.sparse.to_coo()方法：

In [43]: sdf.sparse.to_coo()
Out[43]: 
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
 with 517 stored elements in COOrdinate format>

Series.sparse.to_coo()用於將由MultiIndex索引的具有稀疏值的Series轉換為scipy.sparse.coo_matrix。

該方法需要具有兩個或更多級別的MultiIndex。

In [44]: s = pd.Series([3.0, np.nan, 1.0, 3.0, np.nan, np.nan])

In [45]: s.index = pd.MultiIndex.from_tuples(
 ....:    [
 ....:        (1, 2, "a", 0),
 ....:        (1, 2, "a", 1),
 ....:        (1, 1, "b", 0),
 ....:        (1, 1, "b", 1),
 ....:        (2, 1, "b", 0),
 ....:        (2, 1, "b", 1),
 ....:    ],
 ....:    names=["A", "B", "C", "D"],
 ....: )
 ....: 

In [46]: ss = s.astype('Sparse')

In [47]: ss
Out[47]: 
A  B  C  D
1  2  a  0    3.0
 1    NaN
 1  b  0    1.0
 1    3.0
2  1  b  0    NaN
 1    NaN
dtype: Sparse[float64, nan]

在下面的示例中，我們透過指定第一和第二個MultiIndex級別定義行的標籤，第三和第四個級別定義列的標籤，將Series轉換為 2 維陣列的稀疏表示。我們還指定列和行標籤應在最終稀疏表示中排序。

In [48]: A, rows, columns = ss.sparse.to_coo(
 ....:    row_levels=["A", "B"], column_levels=["C", "D"], sort_labels=True
 ....: )
 ....: 

In [49]: A
Out[49]: 
<3x4 sparse matrix of type '<class 'numpy.float64'>'
 with 3 stored elements in COOrdinate format>

In [50]: A.todense()
Out[50]: 
matrix([[0., 0., 1., 3.],
 [3., 0., 0., 0.],
 [0., 0., 0., 0.]])

In [51]: rows
Out[51]: [(1, 1), (1, 2), (2, 1)]

In [52]: columns
Out[52]: [('a', 0), ('a', 1), ('b', 0), ('b', 1)]

指定不同的行和列標籤（並且不對它們進行排序）將產生不同的稀疏矩陣：

In [53]: A, rows, columns = ss.sparse.to_coo(
 ....:    row_levels=["A", "B", "C"], column_levels=["D"], sort_labels=False
 ....: )
 ....: 

In [54]: A
Out[54]: 
<3x2 sparse matrix of type '<class 'numpy.float64'>'
 with 3 stored elements in COOrdinate format>

In [55]: A.todense()
Out[55]: 
matrix([[3., 0.],
 [1., 3.],
 [0., 0.]])

In [56]: rows
Out[56]: [(1, 2, 'a'), (1, 1, 'b'), (2, 1, 'b')]

In [57]: columns
Out[57]: [(0,), (1,)]

為從 scipy.sparse.coo_matrix 建立具有稀疏值的 Series 實現了一個方便的方法 Series.sparse.from_coo()。

In [58]: from scipy import sparse

In [59]: A = sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])), shape=(3, 4))

In [60]: A
Out[60]: 
<3x4 sparse matrix of type '<class 'numpy.float64'>'
 with 3 stored elements in COOrdinate format>

In [61]: A.todense()
Out[61]: 
matrix([[0., 0., 1., 2.],
 [3., 0., 0., 0.],
 [0., 0., 0., 0.]])

預設行為（使用 dense_index=False）只返回一個僅包含非空條目的 Series。

In [62]: ss = pd.Series.sparse.from_coo(A)

In [63]: ss
Out[63]: 
0  2    1.0
 3    2.0
1  0    3.0
dtype: Sparse[float64, nan]

指定 dense_index=True 將導致索引為矩陣的行和列座標的笛卡爾乘積。請注意，如果稀疏矩陣足夠大（且稀疏），則這將消耗大量記憶體（相對於 dense_index=False）。

In [64]: ss_dense = pd.Series.sparse.from_coo(A, dense_index=True)

In [65]: ss_dense
Out[65]: 
1  0    3.0
 2    NaN
 3    NaN
0  0    NaN
 2    1.0
 3    2.0
 0    NaN
 2    1.0
 3    2.0
dtype: Sparse[float64, nan] 
```  ## 稀疏陣列

`arrays.SparseArray` 是用於儲存稀疏值陣列的 `ExtensionArray`（有關擴充套件陣列的更多資訊，請參閱資料型別）。它是一個一維類似 ndarray 的物件，僅儲存與 `fill_value` 不同的值：

```py
In [13]: arr = np.random.randn(10)

In [14]: arr[2:5] = np.nan

In [15]: arr[7:8] = np.nan

In [16]: sparr = pd.arrays.SparseArray(arr)

In [17]: sparr
Out[17]: 
[-1.9556635297215477, -1.6588664275960427, nan, nan, nan, 1.1589328886422277, 0.14529711373305043, nan, 0.6060271905134522, 1.3342113401317768]
Fill: nan
IntIndex
Indices: array([0, 1, 5, 6, 8, 9], dtype=int32)

使用 numpy.asarray() 可將稀疏陣列轉換為常規（密集）ndarray。

In [18]: np.asarray(sparr)
Out[18]: 
array([-1.9557, -1.6589,     nan,     nan,     nan,  1.1589,  0.1453,
 nan,  0.606 ,  1.3342])

稀疏資料型別

SparseArray.dtype 屬性儲存兩個資訊

非稀疏值的資料型別
標量填充值

In [19]: sparr.dtype
Out[19]: Sparse[float64, nan]

可以透過僅傳遞一個資料型別來構造 SparseDtype。

In [20]: pd.SparseDtype(np.dtype('datetime64[ns]'))
Out[20]: Sparse[datetime64[ns], numpy.datetime64('NaT')]

在這種情況下，將使用預設填充值（對於 NumPy 資料型別，這通常是該資料型別的“缺失”值）。可以傳遞一個顯式的填充值以覆蓋此預設值

In [21]: pd.SparseDtype(np.dtype('datetime64[ns]'),
 ....:               fill_value=pd.Timestamp('2017-01-01'))
 ....: 
Out[21]: Sparse[datetime64[ns], Timestamp('2017-01-01 00:00:00')]

最後，可以使用字串別名 'Sparse[dtype]' 來在許多地方指定稀疏資料型別

In [22]: pd.array([1, 0, 0, 2], dtype='Sparse[int]')
Out[22]: 
[1, 0, 0, 2]
Fill: 0
IntIndex
Indices: array([0, 3], dtype=int32)

稀疏訪問器

pandas 提供了一個 .sparse 訪問器，類似於字串資料的 .str、分類資料的 .cat 和類似日期時間資料的 .dt。此名稱空間提供了特定於稀疏資料的屬性和方法。

In [23]: s = pd.Series([0, 0, 1, 2], dtype="Sparse[int]")

In [24]: s.sparse.density
Out[24]: 0.5

In [25]: s.sparse.fill_value
Out[25]: 0

此訪問器僅在具有 SparseDtype 的資料上可用，並且在 Series 類本身上可用於使用 scipy COO 矩陣建立具有稀疏資料的 Series。

為 DataFrame 新增了 .sparse 訪問器。有關更多資訊，請參閱稀疏訪問器。

稀疏計算

您可以對 arrays.SparseArray 應用 NumPy ufuncs，並獲得 arrays.SparseArray 作為結果。

In [26]: arr = pd.arrays.SparseArray([1., np.nan, np.nan, -2., np.nan])

In [27]: np.abs(arr)
Out[27]: 
[1.0, nan, nan, 2.0, nan]
Fill: nan
IntIndex
Indices: array([0, 3], dtype=int32)

ufunc 也適用於 fill_value。這是為了獲得正確的密集結果而需要的。

In [28]: arr = pd.arrays.SparseArray([1., -1, -1, -2., -1], fill_value=-1)

In [29]: np.abs(arr)
Out[29]: 
[1, 1, 1, 2.0, 1]
Fill: 1
IntIndex
Indices: array([3], dtype=int32)

In [30]: np.abs(arr).to_dense()
Out[30]: array([1., 1., 1., 2., 1.])

轉換

要將資料從稀疏轉換為密集，使用 .sparse 訪問器。

In [31]: sdf.sparse.to_dense()
Out[31]: 
 0         1         2         3
0          NaN       NaN       NaN       NaN
1          NaN       NaN       NaN       NaN
2          NaN       NaN       NaN       NaN
3          NaN       NaN       NaN       NaN
4          NaN       NaN       NaN       NaN
...        ...       ...       ...       ...
9995       NaN       NaN       NaN       NaN
9996       NaN       NaN       NaN       NaN
9997       NaN       NaN       NaN       NaN
9998  0.509184 -0.774928 -1.369894 -0.382141
9999  0.280249 -1.648493  1.490865 -0.890819

[10000 rows x 4 columns]

從密集到稀疏，使用 DataFrame.astype() 和 SparseDtype。

In [32]: dense = pd.DataFrame({"A": [1, 0, 0, 1]})

In [33]: dtype = pd.SparseDtype(int, fill_value=0)

In [34]: dense.astype(dtype)
Out[34]: 
 A
0  1
1  0
2  0
3  1

與 scipy.sparse 的互動

使用 DataFrame.sparse.from_spmatrix() 可以從稀疏矩陣建立具有稀疏值的 DataFrame。

In [35]: from scipy.sparse import csr_matrix

In [36]: arr = np.random.random(size=(1000, 5))

In [37]: arr[arr < .9] = 0

In [38]: sp_arr = csr_matrix(arr)

In [39]: sp_arr
Out[39]: 
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
 with 517 stored elements in Compressed Sparse Row format>

In [40]: sdf = pd.DataFrame.sparse.from_spmatrix(sp_arr)

In [41]: sdf.head()
Out[41]: 
 0  1  2         3  4
0   0.95638  0  0         0  0
1         0  0  0         0  0
2         0  0  0         0  0
3         0  0  0         0  0
4  0.999552  0  0  0.956153  0

In [42]: sdf.dtypes
Out[42]: 
0    Sparse[float64, 0]
1    Sparse[float64, 0]
2    Sparse[float64, 0]
3    Sparse[float64, 0]
4    Sparse[float64, 0]
dtype: object

所有稀疏格式都受支援，但不在 COOrdinate 格式中的矩陣將被轉換，根據需要複製資料。要轉換回 COO 格式的稀疏 SciPy 矩陣，您可以使用 DataFrame.sparse.to_coo() 方法：

In [43]: sdf.sparse.to_coo()
Out[43]: 
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
 with 517 stored elements in COOrdinate format>

Series.sparse.to_coo() 方法用於將由 MultiIndex 索引的稀疏值的 Series 轉換為 scipy.sparse.coo_matrix。

該方法需要具有兩個或更多級別的 MultiIndex。

In [44]: s = pd.Series([3.0, np.nan, 1.0, 3.0, np.nan, np.nan])

In [45]: s.index = pd.MultiIndex.from_tuples(
 ....:    [
 ....:        (1, 2, "a", 0),
 ....:        (1, 2, "a", 1),
 ....:        (1, 1, "b", 0),
 ....:        (1, 1, "b", 1),
 ....:        (2, 1, "b", 0),
 ....:        (2, 1, "b", 1),
 ....:    ],
 ....:    names=["A", "B", "C", "D"],
 ....: )
 ....: 

In [46]: ss = s.astype('Sparse')

In [47]: ss
Out[47]: 
A  B  C  D
1  2  a  0    3.0
 1    NaN
 1  b  0    1.0
 1    3.0
2  1  b  0    NaN
 1    NaN
dtype: Sparse[float64, nan]

在下面的示例中，我們透過指定第一和第二個 MultiIndex 級別定義行的標籤，第三和第四個級別定義列的標籤，將 Series 轉換為 2-d 陣列的稀疏表示。我們還指定列和行標籤應在最終稀疏表示中排序。

In [48]: A, rows, columns = ss.sparse.to_coo(
 ....:    row_levels=["A", "B"], column_levels=["C", "D"], sort_labels=True
 ....: )
 ....: 

In [49]: A
Out[49]: 
<3x4 sparse matrix of type '<class 'numpy.float64'>'
 with 3 stored elements in COOrdinate format>

In [50]: A.todense()
Out[50]: 
matrix([[0., 0., 1., 3.],
 [3., 0., 0., 0.],
 [0., 0., 0., 0.]])

In [51]: rows
Out[51]: [(1, 1), (1, 2), (2, 1)]

In [52]: columns
Out[52]: [('a', 0), ('a', 1), ('b', 0), ('b', 1)]

指定不同的行和列標籤（且不排序它們）會產生不同的稀疏矩陣：

In [53]: A, rows, columns = ss.sparse.to_coo(
 ....:    row_levels=["A", "B", "C"], column_levels=["D"], sort_labels=False
 ....: )
 ....: 

In [54]: A
Out[54]: 
<3x2 sparse matrix of type '<class 'numpy.float64'>'
 with 3 stored elements in COOrdinate format>

In [55]: A.todense()
Out[55]: 
matrix([[3., 0.],
 [1., 3.],
 [0., 0.]])

In [56]: rows
Out[56]: [(1, 2, 'a'), (1, 1, 'b'), (2, 1, 'b')]

In [57]: columns
Out[57]: [(0,), (1,)]

一個方便的方法Series.sparse.from_coo()被實現用於從scipy.sparse.coo_matrix建立一個稀疏值的Series。

In [58]: from scipy import sparse

In [59]: A = sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])), shape=(3, 4))

In [60]: A
Out[60]: 
<3x4 sparse matrix of type '<class 'numpy.float64'>'
 with 3 stored elements in COOrdinate format>

In [61]: A.todense()
Out[61]: 
matrix([[0., 0., 1., 2.],
 [3., 0., 0., 0.],
 [0., 0., 0., 0.]])

預設行為（使用dense_index=False）簡單地返回一個只包含非空條目的Series。

In [62]: ss = pd.Series.sparse.from_coo(A)

In [63]: ss
Out[63]: 
0  2    1.0
 3    2.0
1  0    3.0
dtype: Sparse[float64, nan]

指定dense_index=True將導致一個索引，該索引是矩陣的行和列座標的笛卡爾積。請注意，如果稀疏矩陣足夠大（且稀疏），這將消耗大量記憶體（相對於dense_index=False）。

In [64]: ss_dense = pd.Series.sparse.from_coo(A, dense_index=True)

In [65]: ss_dense
Out[65]: 
1  0    3.0
 2    NaN
 3    NaN
0  0    NaN
 2    1.0
 3    2.0
 0    NaN
 2    1.0
 3    2.0
dtype: Sparse[float64, nan]

常見問題（FAQ）

原文：pandas.pydata.org/docs/user_guide/gotchas.html

DataFrame 記憶體使用情況

在呼叫 info() 時，DataFrame 的記憶體使用情況（包括索引）會顯示出來。一個配置選項，display.memory_usage（參見選項列表），指定了在呼叫 info() 方法時是否會顯示 DataFrame 的記憶體使用情況。

例如，在呼叫 info() 時，下面的 DataFrame 的記憶體使用情況會顯示如下：

In [1]: dtypes = [
 ...:    "int64",
 ...:    "float64",
 ...:    "datetime64[ns]",
 ...:    "timedelta64[ns]",
 ...:    "complex128",
 ...:    "object",
 ...:    "bool",
 ...: ]
 ...: 

In [2]: n = 5000

In [3]: data = {t: np.random.randint(100, size=n).astype(t) for t in dtypes}

In [4]: df = pd.DataFrame(data)

In [5]: df["categorical"] = df["object"].astype("category")

In [6]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   int64            5000 non-null   int64 
 1   float64          5000 non-null   float64 
 2   datetime64[ns]   5000 non-null   datetime64[ns] 
 3   timedelta64[ns]  5000 non-null   timedelta64[ns]
 4   complex128       5000 non-null   complex128 
 5   object           5000 non-null   object 
 6   bool             5000 non-null   bool 
 7   categorical      5000 non-null   category 
dtypes: bool(1), category(1), complex128(1), datetime64ns, float64(1), int64(1), object(1), timedelta64ns
memory usage: 288.2+ KB

+ 符號表示真實記憶體使用量可能更高，因為 pandas 不會計算具有 dtype=object 的列中的值所使用的記憶體。

傳遞 memory_usage='deep' 將啟用更準確的記憶體使用報告，考慮到所包含物件的完整使用情況。這是可選的，因為進行這種更深層次的內省可能很昂貴。

In [7]: df.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   int64            5000 non-null   int64 
 1   float64          5000 non-null   float64 
 2   datetime64[ns]   5000 non-null   datetime64[ns] 
 3   timedelta64[ns]  5000 non-null   timedelta64[ns]
 4   complex128       5000 non-null   complex128 
 5   object           5000 non-null   object 
 6   bool             5000 non-null   bool 
 7   categorical      5000 non-null   category 
dtypes: bool(1), category(1), complex128(1), datetime64ns, float64(1), int64(1), object(1), timedelta64ns
memory usage: 424.7 KB

預設情況下，顯示選項設定為 True，但是在呼叫 info() 時可以透過顯式傳遞 memory_usage 引數來明確覆蓋。

可以透過呼叫 memory_usage() 方法找到每列的記憶體使用情況。這會返回一個 Series，其索引由列名錶示，並顯示每列的記憶體使用情況（以位元組為單位）。對於上述的 DataFrame，可以透過 memory_usage() 方法找到每列的記憶體使用情況和總記憶體使用情況：

In [8]: df.memory_usage()
Out[8]: 
Index                128
int64              40000
float64            40000
datetime64[ns]     40000
timedelta64[ns]    40000
complex128         80000
object             40000
bool                5000
categorical         9968
dtype: int64

# total memory usage of dataframe
In [9]: df.memory_usage().sum()
Out[9]: 295096

預設情況下，返回的 Series 中顯示 DataFrame 索引的記憶體使用情況，可以透過傳遞 index=False 引數來抑制索引的記憶體使用情況：

In [10]: df.memory_usage(index=False)
Out[10]: 
int64              40000
float64            40000
datetime64[ns]     40000
timedelta64[ns]    40000
complex128         80000
object             40000
bool                5000
categorical         9968
dtype: int64

info() 方法顯示的記憶體使用情況利用了 memory_usage() 方法來確定 DataFrame 的記憶體使用情況，同時以人類可讀的單位格式化輸出（基於 2 的表示法；即 1KB = 1024 位元組）。

另請參閱分類記憶用法。 ## 在 pandas 中使用 if/truth 語句

pandas 遵循 NumPy 的慣例，當你嘗試將某些內容轉換為 bool 時會引發錯誤。這會在 if 語句中或使用布林操作：and、or 和 not 時發生。以下程式碼的結果不清楚：

>>> if pd.Series([False, True, False]):
...     pass

應該是 True 嗎，因為它不是零長度，還是 False 因為有 False 值？不清楚，所以 pandas 引發了 ValueError：

In [11]: if pd.Series([False, True, False]):
 ....:    print("I was true")
 ....: 
---------------------------------------------------------------------------
ValueError  Traceback (most recent call last)
<ipython-input-11-5c782b38cd2f> in ?()
----> 1 if pd.Series([False, True, False]):
  2     print("I was true")

~/work/pandas/pandas/pandas/core/generic.py in ?(self)
  1575     @final
  1576     def __nonzero__(self) -> NoReturn:
-> 1577         raise ValueError(
  1578             f"The truth value of a {type(self).__name__} is ambiguous. "
  1579             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
  1580         )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

你需要明確選擇你想要對 DataFrame 做什麼，例如使用 any()、all() 或 empty()。或者，你可能想要比較 pandas 物件是否為 None：

In [12]: if pd.Series([False, True, False]) is not None:
 ....:    print("I was not None")
 ....: 
I was not None

下面是如何檢查任何值是否為 True：

In [13]: if pd.Series([False, True, False]).any():
 ....:    print("I am any")
 ....: 
I am any

位運算布林值

位運算布林運算子如 == 和 != 返回一個布林 Series，與標量進行比較時執行逐元素比較。

In [14]: s = pd.Series(range(5))

In [15]: s == 4
Out[15]: 
0    False
1    False
2    False
3    False
4     True
dtype: bool

檢視布林值比較獲取更多示例。

使用 `in` 運算子

在 Series 上使用 Python in 運算子測試成員身份在索引中，而不是在值之間。

In [16]: s = pd.Series(range(5), index=list("abcde"))

In [17]: 2 in s
Out[17]: False

In [18]: 'b' in s
Out[18]: True

如果這種行為令人驚訝，請記住，在 Python 字典上使用 in 測試鍵，而不是值，並且 Series 類似於字典。要測試成員身份是否在值中，請使用方法 isin()：

In [19]: s.isin([2])
Out[19]: 
a    False
b    False
c     True
d    False
e    False
dtype: bool

In [20]: s.isin([2]).any()
Out[20]: True

對於 DataFrame，同樣地，in 應用於列軸，測試是否在列名列表中。 ## 透過使用者定義的函式 (UDF) 方法進行變異

此部分適用於需要 UDF 的 pandas 方法。特別是 DataFrame.apply()、DataFrame.aggregate()、DataFrame.transform() 和 DataFrame.filter() 方法。

在程式設計中，通常的規則是在容器被迭代時不要改變容器。變異將使迭代器無效，導致意外行為。考慮以下例子：

In [21]: values = [0, 1, 2, 3, 4, 5]

In [22]: n_removed = 0

In [23]: for k, value in enumerate(values):
 ....:    idx = k - n_removed
 ....:    if value % 2 == 1:
 ....:        del values[idx]
 ....:        n_removed += 1
 ....:    else:
 ....:        values[idx] = value + 1
 ....: 

In [24]: values
Out[24]: [1, 4, 5]

人們可能會期望結果是 [1, 3, 5]。當使用需要 UDF 的 pandas 方法時，內部 pandas 通常會迭代 DataFrame 或其他 pandas 物件。因此，如果 UDF 改變了 DataFrame，可能會出現意外行為。

這裡有一個類似的例子，使用 DataFrame.apply()：

In [25]: def f(s):
 ....:    s.pop("a")
 ....:    return s
 ....: 

In [26]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

In [27]: df.apply(f, axis="columns")
---------------------------------------------------------------------------
KeyError  Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
  3804 try:
-> 3805     return self._engine.get_loc(casted_key)
  3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'a'

The above exception was the direct cause of the following exception:

KeyError  Traceback (most recent call last)
Cell In[27], line 1
----> 1 df.apply(f, axis="columns")

File ~/work/pandas/pandas/pandas/core/frame.py:10374, in DataFrame.apply(self, func, axis, raw, result_type, args, by_row, engine, engine_kwargs, **kwargs)
  10360 from pandas.core.apply import frame_apply
  10362 op = frame_apply(
  10363     self,
  10364     func=func,
   (...)
  10372     kwargs=kwargs,
  10373 )
> 10374 return op.apply().__finalize__(self, method="apply")

File ~/work/pandas/pandas/pandas/core/apply.py:916, in FrameApply.apply(self)
  913 elif self.raw:
  914     return self.apply_raw(engine=self.engine, engine_kwargs=self.engine_kwargs)
--> 916 return self.apply_standard()

File ~/work/pandas/pandas/pandas/core/apply.py:1063, in FrameApply.apply_standard(self)
  1061 def apply_standard(self):
  1062     if self.engine == "python":
-> 1063         results, res_index = self.apply_series_generator()
  1064     else:
  1065         results, res_index = self.apply_series_numba()

File ~/work/pandas/pandas/pandas/core/apply.py:1081, in FrameApply.apply_series_generator(self)
  1078 with option_context("mode.chained_assignment", None):
  1079     for i, v in enumerate(series_gen):
  1080         # ignore SettingWithCopy here in case the user mutates
-> 1081         results[i] = self.func(v, *self.args, **self.kwargs)
  1082         if isinstance(results[i], ABCSeries):
  1083             # If we have a view on v, we need to make a copy because
  1084             #  series_generator will swap out the underlying data
  1085             results[i] = results[i].copy(deep=False)

Cell In[25], line 2, in f(s)
  1 def f(s):
----> 2     s.pop("a")
  3     return s

File ~/work/pandas/pandas/pandas/core/series.py:5391, in Series.pop(self, item)
  5366 def pop(self, item: Hashable) -> Any:
  5367  """
  5368 Return item and drops from series. Raise KeyError if not found.
  5369  
 (...)
  5389 dtype: int64
  5390 """
-> 5391     return super().pop(item=item)

File ~/work/pandas/pandas/pandas/core/generic.py:947, in NDFrame.pop(self, item)
  946 def pop(self, item: Hashable) -> Series | Any:
--> 947     result = self[item]
  948     del self[item]
  950     return result

File ~/work/pandas/pandas/pandas/core/series.py:1121, in Series.__getitem__(self, key)
  1118     return self._values[key]
  1120 elif key_is_scalar:
-> 1121     return self._get_value(key)
  1123 # Convert generator to list before going through hashable part
  1124 # (We will iterate through the generator there to check for slices)
  1125 if is_iterator(key):

File ~/work/pandas/pandas/pandas/core/series.py:1237, in Series._get_value(self, label, takeable)
  1234     return self._values[label]
  1236 # Similar to Index.get_value, but we do not fall back to positional
-> 1237 loc = self.index.get_loc(label)
  1239 if is_integer(loc):
  1240     return self._values[loc]

File ~/work/pandas/pandas/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
  3807     if isinstance(casted_key, slice) or (
  3808         isinstance(casted_key, abc.Iterable)
  3809         and any(isinstance(x, slice) for x in casted_key)
  3810     ):
  3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
  3813 except TypeError:
  3814     # If we have a listlike key, _check_indexing_error will raise
  3815     #  InvalidIndexError. Otherwise we fall through and re-raise
  3816     #  the TypeError.
  3817     self._check_indexing_error(key)

KeyError: 'a'

要解決這個問題，可以製作一份副本，這樣變異就不會應用於正在迭代的容器。

In [28]: values = [0, 1, 2, 3, 4, 5]

In [29]: n_removed = 0

In [30]: for k, value in enumerate(values.copy()):
 ....:    idx = k - n_removed
 ....:    if value % 2 == 1:
 ....:        del values[idx]
 ....:        n_removed += 1
 ....:    else:
 ....:        values[idx] = value + 1
 ....: 

In [31]: values
Out[31]: [1, 3, 5]

In [32]: def f(s):
 ....:    s = s.copy()
 ....:    s.pop("a")
 ....:    return s
 ....: 

In [33]: df = pd.DataFrame({"a": [1, 2, 3], 'b': [4, 5, 6]})

In [34]: df.apply(f, axis="columns")
Out[34]: 
 b
0  4
1  5
2  6

NumPy 型別的缺失值表示

`np.nan` 作為 NumPy 型別的 `NA` 表示

由於在 NumPy 和 Python 中普遍缺乏對 NA（缺失）的支援，NA 可以用以下方式表示：

一種 掩碼陣列 解決方案：一個資料陣列和一個布林值陣列，指示值是否存在或缺失。
使用特殊的哨兵值、位模式或一組哨兵值來表示各種 dtypes 中的 NA。

選擇特殊值 np.nan（非數字）作為 NumPy 型別的 NA 值，並且有一些 API 函式如 DataFrame.isna() 和 DataFrame.notna() 可以用於各種 dtypes 來檢測 NA 值。然而，這個選擇有一個缺點，即將缺失的整數資料強制轉換為浮點型別，如整數 NA 的支援所示。

NumPy 型別的 `NA` 型別提升

當透過reindex()或其他方式向現有的Series或DataFrame引入 NA 時，布林和整數型別將被提升為不同的 dtype 以儲存 NA。這些提升總結在這個表中：

型別	用於儲存 NA 的提升 dtype
`floating`	無變化
`object`	無變化
`integer`	轉換為`float64`
`boolean`	轉換為`object`

支援整數`NA`

在 NumPy 中沒有從頭開始構建高效能NA支援的情況下，主要的犧牲品是無法在整數陣列中表示 NA。例如：

In [35]: s = pd.Series([1, 2, 3, 4, 5], index=list("abcde"))

In [36]: s
Out[36]: 
a    1
b    2
c    3
d    4
e    5
dtype: int64

In [37]: s.dtype
Out[37]: dtype('int64')

In [38]: s2 = s.reindex(["a", "b", "c", "f", "u"])

In [39]: s2
Out[39]: 
a    1.0
b    2.0
c    3.0
f    NaN
u    NaN
dtype: float64

In [40]: s2.dtype
Out[40]: dtype('float64')

這種權衡主要是出於記憶體和效能原因，以及確保生成的Series繼續是“數值型”的原因。

如果需要表示可能缺失值的整數，請使用 pandas 或 pyarrow 提供的可空整數擴充套件 dtypes 之一

Int8Dtype
Int16Dtype
Int32Dtype
Int64Dtype
ArrowDtype

In [41]: s_int = pd.Series([1, 2, 3, 4, 5], index=list("abcde"), dtype=pd.Int64Dtype())

In [42]: s_int
Out[42]: 
a    1
b    2
c    3
d    4
e    5
dtype: Int64

In [43]: s_int.dtype
Out[43]: Int64Dtype()

In [44]: s2_int = s_int.reindex(["a", "b", "c", "f", "u"])

In [45]: s2_int
Out[45]: 
a       1
b       2
c       3
f    <NA>
u    <NA>
dtype: Int64

In [46]: s2_int.dtype
Out[46]: Int64Dtype()

In [47]: s_int_pa = pd.Series([1, 2, None], dtype="int64[pyarrow]")

In [48]: s_int_pa
Out[48]: 
0       1
1       2
2    <NA>
dtype: int64[pyarrow]

檢視可空整數資料型別和 PyArrow 功能以獲取更多資訊。

為什麼不讓 NumPy 像 R 一樣呢？

許多人建議 NumPy 應該簡單地模仿更多領域特定的統計程式語言R中存在的NA支援。部分原因是 NumPy 型別層次結構：

型別	Dtypes
`numpy.floating`	`float16, float32, float64, float128`
`numpy.integer`	`int8, int16, int32, int64`
`numpy.unsignedinteger`	`uint8, uint16, uint32, uint64`
`numpy.object_`	`object_`
`numpy.bool_`	`bool_`
`numpy.character`	`bytes_, str_`

相比之下，R 語言只有少數幾種內建資料型別：integer、numeric（浮點數）、character和boolean。NA型別是透過為每種型別保留特殊的位模式來實現的，用作缺失值。雖然在整個 NumPy 型別層次結構中執行此操作是可能的，但這將是一個更重大的權衡（特別是對於 8 位和 16 位資料型別），並且需要更多的實現工作。

但是，R 的NA語義現在可透過使用遮罩 NumPy 型別（例如Int64Dtype）或 PyArrow 型別（ArrowDtype）來實現。

與 NumPy 的差異

對於Series和DataFrame物件，var()透過N-1進行歸一化以生成無偏的總體方差估計，而 NumPy 的numpy.var()透過 N 進行歸一化，該方法測量樣本的方差。請注意，cov()在 pandas 和 NumPy 中都透過N-1進行歸一化。

執行緒安全性

pandas 並非 100%執行緒安全。已知問題與copy()方法有關。如果您線上程之間共享的DataFrame物件上進行大量複製操作，我們建議在發生資料複製的執行緒內持有鎖定。

有關更多資訊，請參見此連結。

位元組順序問題

偶爾你可能需要處理在與執行 Python 的機器上的位元組順序不同的機器上建立的資料。此問題的常見症狀是錯誤，例如：

Traceback
    ...
ValueError: Big-endian buffer not supported on little-endian compiler

要處理此問題，您應該在將底層 NumPy 陣列傳遞給Series或DataFrame建構函式之前將其轉換為本機系統位元組順序，如下所示：

In [49]: x = np.array(list(range(10)), ">i4")  # big endian

In [50]: newx = x.byteswap().view(x.dtype.newbyteorder())  # force native byteorder

In [51]: s = pd.Series(newx)

有關更多詳情，請參閱NumPy 關於位元組順序的文件。

DataFrame 記憶體使用情況

呼叫info()時，會顯示DataFrame（包括索引）的記憶體使用情況。配置選項display.memory_usage（請參閱選項列表）指定在呼叫info()方法時是否顯示DataFrame的記憶體使用情況。

例如，呼叫 info() 時，下面的 DataFrame 的記憶體使用情況會顯示出來：

In [1]: dtypes = [
 ...:    "int64",
 ...:    "float64",
 ...:    "datetime64[ns]",
 ...:    "timedelta64[ns]",
 ...:    "complex128",
 ...:    "object",
 ...:    "bool",
 ...: ]
 ...: 

In [2]: n = 5000

In [3]: data = {t: np.random.randint(100, size=n).astype(t) for t in dtypes}

In [4]: df = pd.DataFrame(data)

In [5]: df["categorical"] = df["object"].astype("category")

In [6]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   int64            5000 non-null   int64 
 1   float64          5000 non-null   float64 
 2   datetime64[ns]   5000 non-null   datetime64[ns] 
 3   timedelta64[ns]  5000 non-null   timedelta64[ns]
 4   complex128       5000 non-null   complex128 
 5   object           5000 non-null   object 
 6   bool             5000 non-null   bool 
 7   categorical      5000 non-null   category 
dtypes: bool(1), category(1), complex128(1), datetime64ns, float64(1), int64(1), object(1), timedelta64ns
memory usage: 288.2+ KB

+ 符號表示真正的記憶體使用量可能更高，因為 pandas 不計算具有 dtype=object 的列中值的記憶體使用量。

透過傳遞 memory_usage='deep' 將啟用更準確的記憶體使用報告，考慮到所包含物件的完整使用情況。這是可選的，因為進行更深入的內省可能會很昂貴。

In [7]: df.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   int64            5000 non-null   int64 
 1   float64          5000 non-null   float64 
 2   datetime64[ns]   5000 non-null   datetime64[ns] 
 3   timedelta64[ns]  5000 non-null   timedelta64[ns]
 4   complex128       5000 non-null   complex128 
 5   object           5000 non-null   object 
 6   bool             5000 non-null   bool 
 7   categorical      5000 non-null   category 
dtypes: bool(1), category(1), complex128(1), datetime64ns, float64(1), int64(1), object(1), timedelta64ns
memory usage: 424.7 KB

預設情況下，顯示選項設定為 True，但可以透過在呼叫 info() 時傳遞 memory_usage 引數來顯式地覆蓋。

透過呼叫 memory_usage() 方法可以找到每列的記憶體使用情況。這將返回一個由列名錶示的索引的 Series，其中顯示了每列的記憶體使用情況（以位元組為單位）。對於上述的 DataFrame，可以透過 memory_usage() 方法找到每列的記憶體使用情況和總記憶體使用情況：

In [8]: df.memory_usage()
Out[8]: 
Index                128
int64              40000
float64            40000
datetime64[ns]     40000
timedelta64[ns]    40000
complex128         80000
object             40000
bool                5000
categorical         9968
dtype: int64

# total memory usage of dataframe
In [9]: df.memory_usage().sum()
Out[9]: 295096

預設情況下，返回的 Series 中顯示了 DataFrame 索引的記憶體使用情況，可以透過傳遞 index=False 引數來抑制索引的記憶體使用情況：

In [10]: df.memory_usage(index=False)
Out[10]: 
int64              40000
float64            40000
datetime64[ns]     40000
timedelta64[ns]    40000
complex128         80000
object             40000
bool                5000
categorical         9968
dtype: int64

info() 方法顯示的記憶體使用情況利用 memory_usage() 方法來確定 DataFrame 的記憶體使用情況，同時以人類可讀的單位格式化輸出（基於 2 的表示法；即 1KB = 1024 位元組）。

另請參閱分類記憶體使用。

使用 pandas 進行 if/truth 語句

pandas 遵循 NumPy 的慣例，當你嘗試將某些東西轉換為 bool 時會引發錯誤。這發生在 if 語句中或在使用布林運算時：and、or 和 not。下面的程式碼應該得到什麼結果不清楚：

>>> if pd.Series([False, True, False]):
...     pass

它應該是 True，因為它不是零長度，還是 False，因為存在 False 值？不清楚，因此，pandas 引發了一個 ValueError：

In [11]: if pd.Series([False, True, False]):
 ....:    print("I was true")
 ....: 
---------------------------------------------------------------------------
ValueError  Traceback (most recent call last)
<ipython-input-11-5c782b38cd2f> in ?()
----> 1 if pd.Series([False, True, False]):
  2     print("I was true")

~/work/pandas/pandas/pandas/core/generic.py in ?(self)
  1575     @final
  1576     def __nonzero__(self) -> NoReturn:
-> 1577         raise ValueError(
  1578             f"The truth value of a {type(self).__name__} is ambiguous. "
  1579             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
  1580         )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

您需要明確選擇您要對DataFrame進行的操作，例如使用any()、all()或empty()。或者，您可能想要比較 pandas 物件是否為None：

In [12]: if pd.Series([False, True, False]) is not None:
 ....:    print("I was not None")
 ....: 
I was not None

以下是如何檢查任何值是否為True：

In [13]: if pd.Series([False, True, False]).any():
 ....:    print("I am any")
 ....: 
I am any

位運算布林

像==和!=這樣的位運算布林運算子返回一個布林Series，當與標量比較時進行逐元素比較。

In [14]: s = pd.Series(range(5))

In [15]: s == 4
Out[15]: 
0    False
1    False
2    False
3    False
4     True
dtype: bool

請參閱布林比較以獲取更多示例。

使用`in`運算子

在Series上使用 Python 的in運算子測試是否屬於索引，而不是值之間的成員關係。

In [16]: s = pd.Series(range(5), index=list("abcde"))

In [17]: 2 in s
Out[17]: False

In [18]: 'b' in s
Out[18]: True

如果此行為令人驚訝，請記住，在 Python 字典上使用in測試鍵，而不是值，而Series類似於字典。要測試值的成員資格，請使用方法isin()：

In [19]: s.isin([2])
Out[19]: 
a    False
b    False
c     True
d    False
e    False
dtype: bool

In [20]: s.isin([2]).any()
Out[20]: True

對於DataFrame，同樣地，in應用於列軸，測試是否在列名列表中。

位運算布林

像==和!=這樣的位運算布林運算子返回一個布林Series，當與標量比較時進行逐元素比較。

In [14]: s = pd.Series(range(5))

In [15]: s == 4
Out[15]: 
0    False
1    False
2    False
3    False
4     True
dtype: bool

請參閱布林比較以獲取更多示例。

使用`in`運算子

在Series上使用 Python 的in運算子測試是否屬於索引，而不是值之間的成員關係。

In [16]: s = pd.Series(range(5), index=list("abcde"))

In [17]: 2 in s
Out[17]: False

In [18]: 'b' in s
Out[18]: True

如果此行為令人驚訝，請記住，在 Python 字典上使用in測試鍵，而不是值，而Series類似於字典。要測試值的成員資格，請使用方法isin()：

In [19]: s.isin([2])
Out[19]: 
a    False
b    False
c     True
d    False
e    False
dtype: bool

In [20]: s.isin([2]).any()
Out[20]: True

對於DataFrame，同樣地，in應用於列軸，測試是否在列名列表中。

使用使用者定義函式（UDF）方法進行變異

本節適用於接受 UDF 的 pandas 方法。特別是，方法 DataFrame.apply()、DataFrame.aggregate()、DataFrame.transform() 和 DataFrame.filter()。

程式設計中的一個通用規則是，在迭代容器時不應該改變容器。改變會使迭代器失效，導致意外行為。考慮下面的例子：

In [21]: values = [0, 1, 2, 3, 4, 5]

In [22]: n_removed = 0

In [23]: for k, value in enumerate(values):
 ....:    idx = k - n_removed
 ....:    if value % 2 == 1:
 ....:        del values[idx]
 ....:        n_removed += 1
 ....:    else:
 ....:        values[idx] = value + 1
 ....: 

In [24]: values
Out[24]: [1, 4, 5]

人們可能本來期望結果會是[1, 3, 5]。當使用一個接受使用者定義函式（UDF）的 pandas 方法時，內部 pandas 經常會迭代DataFrame 或其他 pandas 物件。因此，如果 UDF 改變了 DataFrame，可能會導致意外行為的發生。

下面是一個類似的例子，使用了 DataFrame.apply()：

In [25]: def f(s):
 ....:    s.pop("a")
 ....:    return s
 ....: 

In [26]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

In [27]: df.apply(f, axis="columns")
---------------------------------------------------------------------------
KeyError  Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
  3804 try:
-> 3805     return self._engine.get_loc(casted_key)
  3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'a'

The above exception was the direct cause of the following exception:

KeyError  Traceback (most recent call last)
Cell In[27], line 1
----> 1 df.apply(f, axis="columns")

File ~/work/pandas/pandas/pandas/core/frame.py:10374, in DataFrame.apply(self, func, axis, raw, result_type, args, by_row, engine, engine_kwargs, **kwargs)
  10360 from pandas.core.apply import frame_apply
  10362 op = frame_apply(
  10363     self,
  10364     func=func,
   (...)
  10372     kwargs=kwargs,
  10373 )
> 10374 return op.apply().__finalize__(self, method="apply")

File ~/work/pandas/pandas/pandas/core/apply.py:916, in FrameApply.apply(self)
  913 elif self.raw:
  914     return self.apply_raw(engine=self.engine, engine_kwargs=self.engine_kwargs)
--> 916 return self.apply_standard()

File ~/work/pandas/pandas/pandas/core/apply.py:1063, in FrameApply.apply_standard(self)
  1061 def apply_standard(self):
  1062     if self.engine == "python":
-> 1063         results, res_index = self.apply_series_generator()
  1064     else:
  1065         results, res_index = self.apply_series_numba()

File ~/work/pandas/pandas/pandas/core/apply.py:1081, in FrameApply.apply_series_generator(self)
  1078 with option_context("mode.chained_assignment", None):
  1079     for i, v in enumerate(series_gen):
  1080         # ignore SettingWithCopy here in case the user mutates
-> 1081         results[i] = self.func(v, *self.args, **self.kwargs)
  1082         if isinstance(results[i], ABCSeries):
  1083             # If we have a view on v, we need to make a copy because
  1084             #  series_generator will swap out the underlying data
  1085             results[i] = results[i].copy(deep=False)

Cell In[25], line 2, in f(s)
  1 def f(s):
----> 2     s.pop("a")
  3     return s

File ~/work/pandas/pandas/pandas/core/series.py:5391, in Series.pop(self, item)
  5366 def pop(self, item: Hashable) -> Any:
  5367  """
  5368 Return item and drops from series. Raise KeyError if not found.
  5369  
 (...)
  5389 dtype: int64
  5390 """
-> 5391     return super().pop(item=item)

File ~/work/pandas/pandas/pandas/core/generic.py:947, in NDFrame.pop(self, item)
  946 def pop(self, item: Hashable) -> Series | Any:
--> 947     result = self[item]
  948     del self[item]
  950     return result

File ~/work/pandas/pandas/pandas/core/series.py:1121, in Series.__getitem__(self, key)
  1118     return self._values[key]
  1120 elif key_is_scalar:
-> 1121     return self._get_value(key)
  1123 # Convert generator to list before going through hashable part
  1124 # (We will iterate through the generator there to check for slices)
  1125 if is_iterator(key):

File ~/work/pandas/pandas/pandas/core/series.py:1237, in Series._get_value(self, label, takeable)
  1234     return self._values[label]
  1236 # Similar to Index.get_value, but we do not fall back to positional
-> 1237 loc = self.index.get_loc(label)
  1239 if is_integer(loc):
  1240     return self._values[loc]

File ~/work/pandas/pandas/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
  3807     if isinstance(casted_key, slice) or (
  3808         isinstance(casted_key, abc.Iterable)
  3809         and any(isinstance(x, slice) for x in casted_key)
  3810     ):
  3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
  3813 except TypeError:
  3814     # If we have a listlike key, _check_indexing_error will raise
  3815     #  InvalidIndexError. Otherwise we fall through and re-raise
  3816     #  the TypeError.
  3817     self._check_indexing_error(key)

KeyError: 'a'

要解決此問題，可以製作一個副本，以便變化不適用於被迭代的容器。

In [28]: values = [0, 1, 2, 3, 4, 5]

In [29]: n_removed = 0

In [30]: for k, value in enumerate(values.copy()):
 ....:    idx = k - n_removed
 ....:    if value % 2 == 1:
 ....:        del values[idx]
 ....:        n_removed += 1
 ....:    else:
 ....:        values[idx] = value + 1
 ....: 

In [31]: values
Out[31]: [1, 3, 5]

In [32]: def f(s):
 ....:    s = s.copy()
 ....:    s.pop("a")
 ....:    return s
 ....: 

In [33]: df = pd.DataFrame({"a": [1, 2, 3], 'b': [4, 5, 6]})

In [34]: df.apply(f, axis="columns")
Out[34]: 
 b
0  4
1  5
2  6

NumPy 型別的缺失值表示

`np.nan` 作為 NumPy 型別的 `NA` 表示

由於 NumPy 和 Python 一般都不支援從底層開始的 NA（缺失）支援，因此 NA 可以用以下方式表示：

掩碼陣列 解決方案：一個資料陣列和一個布林值陣列，指示值是否存在或缺失。
使用特殊的哨兵值、位模式或一組哨兵值來表示跨 dtypes 的 NA。

選擇了特殊值 np.nan（Not-A-Number）作為 NumPy 型別的 NA 值，並且有像 DataFrame.isna() 和 DataFrame.notna() 這樣的 API 函式，可以用於跨 dtypes 檢測 NA 值。然而，這種選擇的缺點是會將缺失的整數資料強制轉換為浮點型別，如在整數 NA 的支援中所示。

NumPy 型別的 `NA` 型別提升

透過 reindex() 或其他方式將 NA 引入現有的 Series 或 DataFrame 時，布林和整數型別將被提升為不同的 dtype 以儲存 NA。這些提升總結在這個表中：

型別類	用於儲存 NA 的提升 dtype
`floating`	無變化
`object`	無變化
`integer`	轉換為 `float64`
`boolean`	轉換為 `object`

對整數 `NA` 的支援

在 NumPy 中沒有內建高效能的 NA 支援的情況下，主要的犧牲是無法在整數陣列中表示 NA。例如：

In [35]: s = pd.Series([1, 2, 3, 4, 5], index=list("abcde"))

In [36]: s
Out[36]: 
a    1
b    2
c    3
d    4
e    5
dtype: int64

In [37]: s.dtype
Out[37]: dtype('int64')

In [38]: s2 = s.reindex(["a", "b", "c", "f", "u"])

In [39]: s2
Out[39]: 
a    1.0
b    2.0
c    3.0
f    NaN
u    NaN
dtype: float64

In [40]: s2.dtype
Out[40]: dtype('float64')

這種權衡主要是為了記憶體和效能原因，以及確保生成的 Series 仍然是“數值型”的。

如果需要表示可能缺失值的整數，請使用 pandas 或 pyarrow 提供的可空整數擴充套件 dtypes 之一

Int8Dtype
Int16Dtype
Int32Dtype
Int64Dtype
ArrowDtype

In [41]: s_int = pd.Series([1, 2, 3, 4, 5], index=list("abcde"), dtype=pd.Int64Dtype())

In [42]: s_int
Out[42]: 
a    1
b    2
c    3
d    4
e    5
dtype: Int64

In [43]: s_int.dtype
Out[43]: Int64Dtype()

In [44]: s2_int = s_int.reindex(["a", "b", "c", "f", "u"])

In [45]: s2_int
Out[45]: 
a       1
b       2
c       3
f    <NA>
u    <NA>
dtype: Int64

In [46]: s2_int.dtype
Out[46]: Int64Dtype()

In [47]: s_int_pa = pd.Series([1, 2, None], dtype="int64[pyarrow]")

In [48]: s_int_pa
Out[48]: 
0       1
1       2
2    <NA>
dtype: int64[pyarrow]

更多資訊請參閱可空整數資料型別和 PyArrow 功能。

為什麼不讓 NumPy 像 R 一樣？

許多人建議 NumPy 應該簡單地模仿更多領域特定的統計程式語言 R 中存在的 NA 支援。部分原因是 NumPy 的型別層次結構：

型別類	Dtypes
`numpy.floating`	`float16, float32, float64, float128`
`numpy.integer`	`int8, int16, int32, int64`
`numpy.unsignedinteger`	`uint8, uint16, uint32, uint64`
`numpy.object_`	`object_`
`numpy.bool_`	`bool_`
`numpy.character`	`bytes_, str_`

相比之下，R 語言只有少數幾種內建資料型別：integer、numeric（浮點數）、character 和 boolean。 NA 型別是透過為每種型別保留特殊的位模式來實現的，用作缺失值。雖然在 NumPy 的完整型別層次結構中執行這一操作是可能的，但這將是一個更為重大的權衡（特別是對於 8 位和 16 位資料型別）和實現任務。

然而，透過使用像 Int64Dtype 或 PyArrow 型別（ArrowDtype）這樣的掩碼 NumPy 型別，現在可以使用 R NA 語義。

使用 `np.nan` 作為 NumPy 型別的 `NA` 表示

由於 NumPy 和 Python 在一般情況下缺乏從頭開始的 NA（缺失）支援，NA 可以用以下方式表示：

一種 掩碼陣列 解決方案：一個資料陣列和一個布林值陣列，指示值是否存在或缺失。
使用特殊的標記值、位模式或一組標記值來表示跨資料型別的 NA。

選擇了特殊值 np.nan（非數字）作為 NumPy 型別的 NA 值，還有像 DataFrame.isna() 和 DataFrame.notna() 這樣的 API 函式，可以跨資料類��用於檢測 NA 值。然而，這種選擇的缺點是將缺失的整數資料強制轉換為浮點型別，如整數 NA 支援中所示。

NumPy 型別的`NA`型別提升

當透過 reindex() 或其他方式將 NAs 引入現有的 Series 或 DataFrame 時，布林值和整數型別將被提升為不同的資料型別以儲存 NA。這些提升總結在這個表中：

型別類	用於儲存 NA 的提升資料型別
`浮點數`	無變化
`物件`	無變化
`整數`	轉換為 `float64`
`布林值`	轉換為 `物件`

整數 `NA` 支援

在 NumPy 中沒有從頭開始構建高效能NA支援的情況下，主要的犧牲品是無法在整數陣列中表示 NA。例如：

In [35]: s = pd.Series([1, 2, 3, 4, 5], index=list("abcde"))

In [36]: s
Out[36]: 
a    1
b    2
c    3
d    4
e    5
dtype: int64

In [37]: s.dtype
Out[37]: dtype('int64')

In [38]: s2 = s.reindex(["a", "b", "c", "f", "u"])

In [39]: s2
Out[39]: 
a    1.0
b    2.0
c    3.0
f    NaN
u    NaN
dtype: float64

In [40]: s2.dtype
Out[40]: dtype('float64')

這種權衡主要是出於記憶體和效能原因，以及確保生成的 Series 仍然是“數值型”的。

如果您需要表示可能缺失值的整數，請使用 pandas 或 pyarrow 提供的可空整數擴充套件資料型別之一

Int8Dtype
Int16Dtype
Int32Dtype
Int64Dtype
ArrowDtype

In [41]: s_int = pd.Series([1, 2, 3, 4, 5], index=list("abcde"), dtype=pd.Int64Dtype())

In [42]: s_int
Out[42]: 
a    1
b    2
c    3
d    4
e    5
dtype: Int64

In [43]: s_int.dtype
Out[43]: Int64Dtype()

In [44]: s2_int = s_int.reindex(["a", "b", "c", "f", "u"])

In [45]: s2_int
Out[45]: 
a       1
b       2
c       3
f    <NA>
u    <NA>
dtype: Int64

In [46]: s2_int.dtype
Out[46]: Int64Dtype()

In [47]: s_int_pa = pd.Series([1, 2, None], dtype="int64[pyarrow]")

In [48]: s_int_pa
Out[48]: 
0       1
1       2
2    <NA>
dtype: int64[pyarrow]

更多資訊，請參見可空整數資料型別和 PyArrow 功能。

為什麼不讓 NumPy 像 R 一樣？

許多人建議 NumPy 應該簡單地模仿更多領域特定的統計程式語言R中存在的NA支援。部分原因是 NumPy 型別層次結構：

型別類	資料型別
`numpy.floating`	`float16, float32, float64, float128`
`numpy.integer`	`int8, int16, int32, int64`
`numpy.unsignedinteger`	`uint8, uint16, uint32, uint64`
`numpy.object_`	`object_`
`numpy.bool_`	`bool_`
`numpy.character`	`bytes_, str_`

相比之下，R 語言只有少數幾種內建資料型別：integer、numeric（浮點數）、character和boolean。NA型別是透過為每種型別保留特殊的位模式來實現的，以用作缺失值。雖然使用完整的 NumPy 型別層次結構進行此操作是可能的，但這將是一個更重大的折衷（特別是對於 8 位和 16 位資料型別）和實施任務。

然而，現在可以透過使用掩碼 NumPy 型別（如Int64Dtype）或 PyArrow 型別（ArrowDtype）來實現 R 的NA語義。

與 NumPy 的差異

對於Series和DataFrame物件，var()透過N-1進行歸一化，以產生總體方差的無偏估計，而 NumPy 的numpy.var()透過 N 進行歸一化，這測量了樣本的方差。請注意，cov()在 pandas 和 NumPy 中都透過N-1進行歸一化。

執行緒安全性

pandas 並非 100%執行緒安全。已知問題與copy()方法有關。如果您正在對線上程之間共享的DataFrame物件進行大量複製，我們建議在進行資料複製的執行緒內部保持鎖定。

更多資訊，請參見此連結。

位元組順序問題

有時您可能需要處理在與執行 Python 的機器上具有不同位元組順序的機器上建立的資料。這個問題的常見症狀是出現錯誤，如：

Traceback
    ...
ValueError: Big-endian buffer not supported on little-endian compiler

要解決這個問題，您應該在將其傳遞給Series或DataFrame建構函式之前，將底層 NumPy 陣列轉換為本機系統位元組順序，類似於以下內容：

In [49]: x = np.array(list(range(10)), ">i4")  # big endian

In [50]: newx = x.byteswap().view(x.dtype.newbyteorder())  # force native byteorder

In [51]: s = pd.Series(newx)

檢視更多詳細資訊，請參閱NumPy 文件中關於位元組順序的部分。

Pandas 2.2 中文官方教程和指南（六）
2024-04-24
Pandas 2.2 中文官方教程和指南（三）
2024-04-24
Pandas 2.2 中文官方教程和指南（二十二）
2024-04-24
Google Guava官方教程（中文版）
2018-03-16
GoGuava
Pandas之:Pandas簡潔教程
2021-06-05
Django2.2圖文教程
2019-04-18
Django
SAPUI5教程——最全中文學習指南（必看）
2020-04-04
UI
Pandas高階教程之:Dataframe的重排和旋轉
2021-06-15
EventBus官方教程
2018-05-07
PhpStorm 漢化指南（官方）
2020-04-20
PHPORM
英語不行？你可以試試TensorFlow官方中文版教程
2018-11-19
Docker最全教程之MySQL容器化（二十四）
2019-06-11
DockerMySql
RayWenderlich 官方 Swift 風格指南
2018-07-31
Swift
Pandas-2-2-中文文件-二十-
2024-06-24
【Pandas基礎教程】第02講 Pandas讀取資料
2020-12-24
Vim官方的中文幫助！！
2018-08-20
吉他中文官方網站
2019-05-11
網站
【Python】官方文件中文版
2019-03-22
Python
官方調研重磅釋出，Pandas或將重構？
2019-10-23
企業級 SpringBoot 教程（二十四）springboot整合docker
2019-03-15
Spring BootDocker
shell基礎教程二十四: shell基礎教程: Shell檔案包含
2020-12-24
[python][科學計算][pandas]使用指南
2019-04-20
Python
Pandas高階教程之:window操作
2021-07-19
Pandas高階教程之:GroupBy用法
2021-07-12
ZooKeeper 官方教程[翻譯]
2019-03-28
[翻譯]CMAKE官方教程
2019-08-01
HTTPie 官方文件中文翻譯版
2019-02-16
HTTP
elasticsearch教程--中文分詞器作用和使用
2019-06-12
Elasticsearch中文分詞
WPF入門教程系列二十四——DataGrid使用示例(1)
2023-05-14
MySQL Workbench 中文使用指南 - 如何使用 Workbench 操作 MySQL 資料庫教程
2021-11-22
MySql資料庫
Pandas高階教程之:統計方法
2021-07-08
撒花！PyTorch 官方教程中文版正式上線，激動人心的大好事！
2019-10-08
PyTorch
Pandas-2-2-中文文件-二十二-
2024-06-24
pandas資料處理清洗案例：中文地址拆分
2021-06-15
Google官方應用程式架構指南
2019-03-01
Go架構
Python 官方文件：入門教程
2018-08-16
Python
Matlab 2018a 官方教程[二]
2018-03-17
Matlab
來了！Python官方文件中文版
2019-03-27
Python

Pandas 2.2 中文官方教程和指南（二十四）

擴充套件到大型資料集

載入更少的資料

使用高效的資料型別

使用分塊載入

使用其他庫

載入更少的資料

使用高效的資料型別

使用分塊

使用其他庫

稀疏資料結構

稀疏陣列

稀疏資料型別

稀疏訪問器

稀疏計算

與 scipy.sparse 的互動

常見問題（FAQ）

DataFrame 記憶體使用情況

位運算布林值

使用 in 運算子

NumPy 型別的缺失值表示

np.nan 作為 NumPy 型別的 NA 表示

NumPy 型別的 NA 型別提升

支援整數NA

為什麼不讓 NumPy 像 R 一樣呢？

與 NumPy 的差異

執行緒安全性

位元組順序問題

DataFrame 記憶體使用情況

使用 pandas 進行 if/truth 語句

位運算布林

使用in運算子

位運算布林

使用in運算子

使用使用者定義函式（UDF）方法進行變異

NumPy 型別的缺失值表示

np.nan 作為 NumPy 型別的 NA 表示

NumPy 型別的 NA 型別提升

對整數 NA 的支援

為什麼不讓 NumPy 像 R 一樣？

使用 np.nan 作為 NumPy 型別的 NA 表示

NumPy 型別的NA型別提升

整數 NA 支援

為什麼不讓 NumPy 像 R 一樣？

與 NumPy 的差異

執行緒安全性

位元組順序問題

相關文章

使用 `in` 運算子

`np.nan` 作為 NumPy 型別的 `NA` 表示

NumPy 型別的 `NA` 型別提升

支援整數`NA`

使用`in`運算子

使用`in`運算子

`np.nan` 作為 NumPy 型別的 `NA` 表示

NumPy 型別的 `NA` 型別提升

對整數 `NA` 的支援

使用 `np.nan` 作為 NumPy 型別的 `NA` 表示

NumPy 型別的`NA`型別提升

整數 `NA` 支援