Python中的隨機取樣和概率分佈(一)

orion發表於2021-12-05

原文網址 : https://www.cnblogs.com/orion-orion/p/15647408.html

Python（包括其包Numpy）中包含了了許多概率演算法，包括基礎的隨機取樣以及許多經典的概率分佈生成。我們這個系列介紹幾個在機器學習中常用的概率函式。先來看最基礎的功能——隨機取樣。

1. `random.choice`

如果我們只需要從序列裡採一個樣本（所有樣本等概率被採），只需要使用random.choice即可：

import random
res1 = random.choice([0, 1, 2, 3, 4])
print(res1) # 3

2. `random.choices`（有放回）

當然，很多時候我們不只需要採一個數，而且我們需要設定序列中每一項被採的概率不同。此時我們可以採用random.random.choices函式, 該函式用於有放回的（即一個資料項可以被重複採多次）對一個序列進行取樣。其函式原型如下：

random.choices(population, weights=None, *, cum_weights=None, k=1)

population: 欲取樣的序列
weights: 每個樣本被賦予的權重（又稱相對權重），決定每個樣本被採的概率，如[10, 0, 30, 60, 0]
cum_weights: 累積權重，相對權重[10, 0, 30, 60, 0]相當於累積權重[10, 10, 40, 100, 100]

我們從[0, 1, 2, 3, 4]中按照相對權重取樣3個樣本如下:

res2 = random.choices([0, 1, 2, 3, 4], weights=[10, 0, 30, 60, 0], k=3)
# 注意population不是關鍵字引數，在函式呼叫時不能寫成population=[0,1,2,3,4]來傳參
# 關於關鍵字引數和位置引數，可以參看我的部落格《Python技法2：函式引數的進階用法》https://www.cnblogs.com/orion-orion/p/15647408.html
print(res2) # [3, 3, 2]

從[0, 1, 2, 3, 4]中按照累積權重取樣3和樣本如下：

res3 = random.choices([0, 1, 2, 3, 4], cum_weights=[10, 10, 40, 100, 100], k=3)
print(res3) # [0, 3, 3]

注意，相對權重weights和累計權重cum_weights不能同時傳入，否則會報TypeError異常'Cannot specify both weights and cumulative weights'。

3. `numpy.sample`(無放回)

random.sample是無放回，如果我們需要無放回取樣（即每一項只能採一次），那我們需要使用random.sample。需要注意的是，如果使用該函式，將無法定義樣本權重。該函式原型如下：

random.sample(population, k, *, counts=None)¶

population: 欲取樣的序列
k: 取樣元素個數
counts: 用於population是可重複集合的情況，定義集合元素的重複次數。sample(['red', 'blue'], counts=[4, 2], k=5)等價於sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)

我們無放回地對序列[0, 1, 2, 3, 4]取樣3次如下：

res3 = random.sample([0, 1, 2, 3, 4], k=3)
print(res3) # [3, 2, 1]

無放回地對可重複集合[0, 1, 1, 2, 2, 3, 3, 4]取樣3次如下：

res4 = random.sample([0, 1, 2, 3, 4], k=3, counts=[1, 2, 2, 2, 1])
print(res4) # [3, 2, 2]

如果counts長度和population序列長度不一致，會丟擲異常ValueError:"The number of counts does not match the population"。

4.`rng.choices` 和 `rng.sample`

還有一種有放回取樣實現方法是我在論文^[1]的程式碼^[2]中學習到的。即先定義一個隨機數生成器，再呼叫隨機數生成器的choices方法或sample方法，其使用方法和random.choice/random.sample函式相同。

rng_seed = 1234
rng = random.Random(rng_seed)
res5 = rng.choices(
     population=[0,1,2,3,4],
     weights=[0.1, 0, 0.3, 0.6, 0],
     k=3,
)
print(res5) # [3, 3, 0]

res6 = rng.sample(
     population=[0, 1, 2, 3, 4],
     k=3,
)
print(res6) # [4, 0, 2]

這兩個函式在論文^[1]的實現程式碼^[2]中用來隨機選擇任務節點client：

 def sample_clients(self):
        """
        sample a list of clients without repetition

        """
        rng_seed = (seed if (seed is not None and seed >= 0) else int(time.time()))
        self.rng = random.Random(rng_seed)

        if self.sample_with_replacement:
            self.sampled_clients = \
                self.rng.choices(
                    population=self.clients,
                    weights=self.clients_weights,
                    k=self.n_clients_per_round,
                )
        else:
            self.sampled_clients = self.rng.sample(self.clients, k=self.n_clients_per_round)

5. `numpy.random.choices`

從序列中按照權重分佈取樣也可以採用numpy.random.choice實現。其函式原型如下:

random.choice(a, size=None, replace=True, p=None)

a: 1-D array-like or int 如果是1-D array-like，那麼樣本會從其元素中抽取。如果是int，那麼樣本會從np.arange(a)中抽取；
size: int or tuple of ints, optional 為輸出形狀大小，如果給定形狀為\((m, n, k)\)，那麼\(m\times n\times k\)的樣本會從中抽取。預設為None，即返回一個單一標量。
replace: boolean, optional 表示取樣是又放回的還是無放回的。若replace=True，則為又放回取樣（一個值可以被採多次），否則是無放回的（一個值只能被採一次）。
p: 1-D array-like, optional 表示a中每一項被採的概率。如果沒有給定，則我們假定a中各項被採的概率服從均勻分佈（即每一項被採的概率相同）。

從[0,1,2,3,4,5]中重複/不重複取樣3次如下：

import numpy as np
res1 = np.random.choice(5, 3, replace=True)
print(res1) # [1 1 4]

res2 = np.random.choice(5, 3, replace=False)
print(res2) # [2 1 4]

同樣是[0,1,2,3,4,5]中重複/不重複取樣3次，現在來看我們為每個樣本設定不同概率的情況：

res3 = np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0])
print(res3)  # [2 3 3]

res4 = np.random.choice(5, 3, replace=False, p=[0.1, 0, 0.3, 0.6, 0])
print(res4) # [3 2 0]

參考文獻

[1] Marfoq O, Neglia G, Bellet A, et al. Federated multi-task learning under a mixture of distributions[J]. Advances in Neural Information Processing Systems, 2021, 34.
[2] https://github.com/omarfoq/FedEM
[3] https://www.python.org/
[4] https://numpy.org/

Python中的隨機取樣和概率分佈(二)
2021-12-06
Python隨機概率分佈
概率統計第二章隨機變數及其分佈
2020-10-05
隨機變數
PRML 概率分佈
2022-02-21
概率分佈
概率演算法_二項分佈和泊松分佈
2018-08-24
演算法
概率論——常用分佈
2021-10-11
【概率論】一維隨機變數
2021-01-03
隨機變數
熟悉常見概率分佈
2018-11-20
概率分佈
如何生成指定分佈的隨機數
2018-04-26
隨機
隨機變數函式的分佈
2020-10-25
隨機變數函式
模式識別與機器學習——概率分佈
2018-11-17
模式機器學習概率分佈
基於序列模型的隨機取樣
2019-09-04
模型隨機
Beta分佈與湯普森取樣：智慧決策系統機率取樣的理論基礎
2024-12-09
Java 控制隨機數出現的概率
2021-12-25
Java隨機
數學一|概統|二、隨機變數及其分佈
2024-06-23
隨機變數
Python獲取好友地區分佈及好友性別分佈
2019-07-10
Python
Python實現感測器的隨機佈置
2020-11-07
Python隨機
按照獎品概率分佈抽獎的實現
2019-02-16
概率分佈
天坑，這樣一個lambda隨機取資料也有Bug
2022-12-07
隨機
數學一|概統|三、多維隨機變數及其分佈
2024-06-23
隨機變數
獲取當前時間戳和隨機數的獲取、Java Random、ThreadLocalRandom、UUID類中的方法應用（隨機數）
2018-04-10
時間戳隨機JavarandomthreadUI
python中如何隨機分配
2021-09-11
Python隨機
ArcGIS如何自動獲得隨機取樣點？
2023-05-17
隨機
MATLAB生成服從各種分佈的隨機數函式
2019-03-28
Matlab隨機函式
matplotlib繪製伯努利分佈的概率密度圖
2020-09-24
Python中如何生成隨機數?
2024-01-12
Python隨機
Java從List中獲取隨機元素
2018-09-17
Java隨機
多元統計分析04：多元正態分佈的抽樣分佈
2021-11-05
python爬蟲從ip池獲取隨機IP
2021-09-11
Python爬蟲隨機
java獲取時間戳和隨機數
2020-10-29
Java時間戳隨機
NumPy 隨機資料分佈與 Seaborn 視覺化詳解
2024-05-21
隨機視覺化
在統計學中機率分佈中的機率密度函式PDF，機率質量PMF，累積分佈CD
2021-09-09
函式
python生成隨機數、隨機字串
2018-12-18
Python隨機字串
如何利用Python隨機從list中挑選一個元素
2022-06-11
Python隨機
概率論與梳理統計-隨機變數random variables
2020-12-04
隨機變數random
機器學習中的概率統計
2019-03-17
機器學習
Python如何隨機生成1到100的隨機數?
2024-02-01
Python隨機
如何利用“隨機”來控制貧富差距？談談網路遊戲中的概率設計
2020-04-20
隨機遊戲
逆機率取樣-接受拒絕取樣-MCMC取樣
2024-08-10

Python中的隨機取樣和概率分佈(一)

1. random.choice

2. random.choices（有放回）

3. numpy.sample(無放回)

4.rng.choices 和 rng.sample