高效能的Python擴充套件（3）

Janzou發表於2014-11-21

簡介

本文是這個系列的第三篇，我們關注於使用NumPy API為Python編寫高效能的C擴充套件模組。在本文中，我們將使用OpenMP來並行第二部分中的實現。

回顧

Wrold是儲存N體狀態的一個類。我們的模擬將演化一系列時間步長下的狀態。

class World(object):
    """World is a structure that holds the state of N bodies and
    additional variables.

    threads : (int) The number of threads to use for multithreaded
              implementations.
    dt      : (float) The time-step. 

    STATE OF THE WORLD: 

    N : (int) The number of bodies in the simulation.
    m : (1D ndarray) The mass of each body.
    r : (2D ndarray) The position of each body.
    v : (2D ndarray) The velocity of each body.
    F : (2D ndarray) The force on each body.

    TEMPORARY VARIABLES:

    Ft : (3D ndarray) A 2D force array for each thread's local storage.
    s  : (2D ndarray) The vectors from one body to all others. 
    s3 : (1D ndarray) The norm of each s vector. 

    NOTE: Ft is used by parallel algorithms for thread-local
          storage. s and s3 are only used by the Python
          implementation.
    """
    def __init__(self, N, threads=1, 
                 m_min=1, m_max=30.0, r_max=50.0, v_max=4.0, dt=1e-3):
        self.threads = threads
        self.N  = N
        self.m  = np.random.uniform(m_min, m_max, N)
        self.r  = np.random.uniform(-r_max, r_max, (N, 2))
        self.v  = np.random.uniform(-v_max, v_max, (N, 2))
        self.F  = np.zeros_like(self.r)
        self.Ft = np.zeros((threads, N, 2))
        self.s  = np.zeros_like(self.r)
        self.s3 = np.zeros_like(self.m)
        self.dt = dt

class World(object):

"""World is a structure that holds the state of N bodies and

additional variables.

threads : (int) The number of threads to use for multithreaded

implementations.

dt : (float) The time-step.

STATE OF THE WORLD:

N : (int) The number of bodies in the simulation.

m : (1D ndarray) The mass of each body.

r : (2D ndarray) The position of each body.

v : (2D ndarray) The velocity of each body.

F : (2D ndarray) The force on each body.

TEMPORARY VARIABLES:

Ft : (3D ndarray) A 2D force array for each thread's local storage.

s : (2D ndarray) The vectors from one body to all others.

s3 : (1D ndarray) The norm of each s vector.

NOTE: Ft is used by parallel algorithms for thread-local

storage. s and s3 are only used by the Python

implementation.

"""

def __init__(self, N, threads=1,

m_min=1, m_max=30.0, r_max=50.0, v_max=4.0, dt=1e-3):

self.threads = threads

self.N = N

self.m = np.random.uniform(m_min, m_max, N)

self.r = np.random.uniform(-r_max, r_max, (N, 2))

self.v = np.random.uniform(-v_max, v_max, (N, 2))

self.F = np.zeros_like(self.r)

self.Ft = np.zeros((threads, N, 2))

self.s = np.zeros_like(self.r)

self.s3 = np.zeros_like(self.m)

self.dt = dt

在開始模擬時，N體被隨機分配質量m，位置r和速度v。對於每個時間步長，接下來的計算有：

合力F，每個體上的合力根據所有其他體的計算。
速度v，由於力的作用每個體的速度被改變。
位置r，由於速度每個體的位置被改變。

計算力：序列程式碼

下面是之前文章實現中（全部的原始碼在這裡）的compute_F函式。這個函式計算模擬中每對體之間的相互作用力，其複雜度為O(N^2)。

static inline void compute_F(npy_int64 N,
                             npy_float64 *m,
                             __m128d *r,
                             __m128d *F) {
  npy_int64 i, j;
  __m128d s, s2, tmp;
  npy_float64 s3;

  // Set all forces to zero.
  for(i = 0; i < N; ++i) {
    F[i] = _mm_set1_pd(0);
  }

  // Compute forces between pairs of bodies. 
  for(i = 0; i < N; ++i) {
    for(j = i + 1; j < N; ++j) {
      s = r[j] - r[i];

      s2 = s * s;
      s3 = sqrt(s2[0] + s2[1]);
      s3 *= s3 * s3;

      tmp = s * m[i] * m[j] / s3;
      F[i] += tmp;
      F[j] -= tmp;
    }
  }
}

static inline void compute_F(npy_int64 N,

npy_float64 *m,

__m128d *r,

__m128d *F) {

npy_int64 i, j;

__m128d s, s2, tmp;

npy_float64 s3;

// Set all forces to zero.

for(i = 0; i < N; ++i) {

F[i] = _mm_set1_pd(0);

}

// Compute forces between pairs of bodies.

for(i = 0; i < N; ++i) {

for(j = i + 1; j < N; ++j) {

s = r[j] - r[i];

s2 = s * s;

s3 = sqrt(s2[0] + s2[1]);

s3 *= s3 * s3;

tmp = s * m[i] * m[j] / s3;

F[i] += tmp;

F[j] -= tmp;

}

效能

使用這一系列的實現，我們的程式碼在i5桌上型電腦上能每秒執行26427個時間步長。

計算力：並行程式碼

下面是重新實現的compute_F函式，它使用OpenMP來並行迴圈。完整的原始檔src/simdomp1.c可以在github上獲得。

static inline void compute_F(npy_int64 threads,                            
                             npy_int64 N,
                             npy_float64 *m,
                             __m128d *r,
                             __m128d *F, 
                             __m128d *Ft) {
  npy_int64 id, i, j, Nid;
  __m128d s, s2, tmp;
  npy_float64 s3;

#pragma omp parallel private(id, i, j, s, s2, s3, tmp, Nid) 
  {
    id = omp_get_thread_num();
    Nid = N * id; // Zero-index in thread-local array Ft. 

    // Zero out the thread-local force arrays. 
    for(i = 0; i < N; i++) {
      Ft[Nid + i] = _mm_set1_pd(0);
    }

    // Compute forces between pairs of bodies.
#pragma omp for schedule(dynamic, 8) 
    for(i = 0; i < N; ++i) {
      F[i] = _mm_set1_pd(0);

      for(j = i + 1; j < N; ++j) {

        s = r[j] - r[i];
        s2 = s * s;
        s3 = sqrt(s2[0] + s2[1]);
        s3 *= s3 * s3;

        tmp = s * m[i] * m[j] / s3;

        Ft[Nid + i] += tmp;
        Ft[Nid + j] -= tmp;
      }
    }

    // Sum the thread-local forces computed above.
#pragma omp for 
    for(i = 0; i < N; ++i) {
      for(id = 0; id < threads; ++id) {
        F[i] += Ft[N*id + i];
      }
    }
  }
}

static inline void compute_F(npy_int64 threads,

npy_int64 N,

npy_float64 *m,

__m128d *r,

__m128d *F,

__m128d *Ft) {

npy_int64 id, i, j, Nid;

__m128d s, s2, tmp;

npy_float64 s3;

#pragma omp parallel private(id, i, j, s, s2, s3, tmp, Nid)

{

id = omp_get_thread_num();

Nid = N * id; // Zero-index in thread-local array Ft.

// Zero out the thread-local force arrays.

for(i = 0; i < N; i++) {

Ft[Nid + i] = _mm_set1_pd(0);

}

// Compute forces between pairs of bodies.

#pragma omp for schedule(dynamic, 8)

for(i = 0; i < N; ++i) {

F[i] = _mm_set1_pd(0);

for(j = i + 1; j < N; ++j) {

s = r[j] - r[i];

s2 = s * s;

s3 = sqrt(s2[0] + s2[1]);

s3 *= s3 * s3;

tmp = s * m[i] * m[j] / s3;

Ft[Nid + i] += tmp;

Ft[Nid + j] -= tmp;

}

// Sum the thread-local forces computed above.

#pragma omp for

for(i = 0; i < N; ++i) {

for(id = 0; id < threads; ++id) {

F[i] += Ft[N*id + i];

}

執行緒區域性儲存

序列版本和並行版本之間最明顯的不同在於Ft陣列的出現。Ft陣列通過每個同時執行的執行緒為力的計算提供執行緒區域性儲存。這些區域性的陣列然後在一個單獨的迴圈中相加，從而得到一個力F的陣列。區域性儲存對於避免競爭條件是有必要的。可以想象一下，如果一個執行緒正在執行i=0和j=1，而此時另一個執行緒正在使用i=1和j=2。在我們的演算法中，它們都試圖修改F[1]，這導致了一種競爭。你可以在維基百科上閱讀有關競爭條件（race conditions）的內容。

OpenMP

OpenMP API通過使用如omp_get_thread_num這樣的函式和如#pragma omp parallel這樣的指令來執行。在上面的程式碼中，我們建立了許多有#pragma omp parallel指令的執行緒。private指令列出了每個執行緒的私有變數。所有其他變數是公有的。 #pragma omp指令允許for迴圈並行進行，每個執行緒處理不同的指數。我們使用schedule(dynamic, 8)指令來告訴編譯器使用8塊大小的動態排程。當每個迴圈花費的時間不同時，動態排程是有用的，正如這裡的這種情況。需要注意的是，在每個並行for迴圈的末尾有一個隱式屏障。所有的執行緒在任何可以提前之前都必須完成迴圈。在OpenMP 網站上參見有關該API的更多資訊。

效能

因為我們增加了一個額外的for迴圈來相加區域性執行緒的力的陣列，我們可以預期單核效能會稍受損失。在我們的測試中，在使用單執行緒時，OpenMP版本的執行速度比沒有OpenMP的版本慢約3%。在Intel i5桌上型電腦上，使用四核及四執行緒，OpenMP使用SIMD指令執行，每秒80342個時間步長。這比我們原來使用NumPy的Python實現塊312倍，比我們最快的序列實現快3倍。

擴充套件性

上圖顯示OpenMP程式碼是如何在四核i5桌上型電腦上擴充套件執行緒數量的。標有”OMP+SIMD”的點對應的是本文中的實現。一個類似的版本可以在這裡獲得，沒有明確的向量指令標有”OMP”。使用Python和NumPy的原始版本被用於比較（左下）。效能可以很好的達到物理核的數量，並大幅減少額外執行緒的增加。請注意，這不是一般的事實，所以使用不同數量的執行緒來衡量你的程式碼總是一個好主意。

環境變數

這裡有許多環境變數可以改變使用OpenMp程式的行為。這些都對效能有重大的影響，特別是對有多個物理處理器的計算機。請參閱GNU libgomp 文件瞭解詳細資訊。

效能總結：所有實現

這裡是這個系列文章中所有實現版本的基準。這個執行在一個四核Intel i5桌上型電腦上，在Debian Wheezy系統下使用gcc 4.7.2。

實現	時間步長/每秒	加速
Python+Numpy	257	1
Simple C 1	17974	70
Simple C 2	26136	102
SIMD	26423	103
OMP 4 cores	76657	298
OMP+SIMD 4 cores	80342	313

附加測試

DigitalOcean20核虛擬機器

在DigitalOcean20核虛擬機器例項上執行上面相同的最快的實現。3個因素的最好的效能比在四核四執行緒的i5桌上型電腦上執行時的效能差。我不知道這個差異的原因是否是因為虛擬化和共享核心。

RunAbove176執行緒虛擬機器

RunAbove對他們IBM Power 8架構免費一個月的使用進行推廣。為了好玩，我在他們176執行緒的例項上執行了相同的實現。之前我從來沒有使用過PowerPC架構，我很高興程式碼執行不需要任何修改。

Cython

Matthew Honnibal提交了一個Cythond的實現，可以在這裡獲得。這個實現和我們當初在第一部分中的C語言實現有些類似。

結論

當效能至關重要時，一個C語言擴充套件可以顯著提升效能。使用SIMD指令可以使效能受益並簡化程式碼，但有移植的成本。OpenMP的支援可以輕鬆地增加現有程式碼，並能提高在多核系統中的效能。如果您有任何疑問，意見，建議或更正，請通過聯絡連結告訴我。

kotlin 擴充套件（擴充套件函式和擴充套件屬性）
2019-02-26
Kotlin套件函式
PHP 高效能 Excel 擴充套件 1.2.7 釋出
2019-09-09
PHPExcel套件
使用ctypes來擴充套件Python
2019-05-11
套件Python
使用cython擴充套件python庫
2024-11-27
套件Python
Yaconf – 一個高效能的配置管理擴充套件
2019-02-16
套件
3 種擴充套件 Kubernetes 能力的方式
2018-03-27
套件
burpsuite擴充套件開發之Python
2020-08-19
UI套件Python
Python為什麼能擴充套件
2021-09-11
Python套件
【Kotlin】擴充套件屬性、擴充套件函式
2024-04-08
Kotlin套件函式
VS Code 的 Python 擴充套件有大更新
2019-04-26
Python套件
python算術運算子的擴充套件功能
2021-09-11
Python套件
?用Chrome擴充套件管理器, 管理你的擴充套件
2019-03-18
Chrome套件
PHP擴充套件開發就是一個自己的PHP擴充套件
2020-11-05
PHP套件
推薦3款超好用的Chrome擴充套件
2018-12-12
Chrome套件
python擴充套件檔案import失敗
2024-04-27
Python套件Import
練習安裝Python擴充套件庫
2024-04-27
Python套件
擴充套件工具
2020-11-21
套件
Sanic 擴充套件
2019-04-21
套件
Mybatis擴充套件
2024-08-14
MyBatis套件
SpringMVC 擴充套件
2024-08-17
SpringMVC套件
ORACLE 擴充套件
2021-02-26
Oracle套件
正則的擴充套件
2019-01-19
套件
SRAM的容量擴充套件
2020-12-01
套件
使用Kotlin擴充套件函式擴充套件Spring Data案例
2021-11-11
Kotlin套件函式Spring
JMeter 擴充套件開發：擴充套件 TCP 取樣器
2022-12-01
JMeter套件TCP
vue3 setup語法糖擴充套件
2024-09-12
Vue套件
ASP.NET Core擴充套件庫之Http通用擴充套件
2021-04-19
ASP.NET套件HTTP
使用 Cython 為 Python 編寫更快的 C 擴充套件
2019-05-15
Python套件
Python+numpy實現矩陣的行列擴充套件
2018-12-27
Python矩陣套件
C-如何快速生成Python的C擴充套件.md
2022-01-09
Python套件
Solon詳解（六）- Solon的校驗擴充套件框架使用與擴充套件
2020-09-09
套件框架
[擴充套件推薦]Aliyun-oss-laravel —— Laravel最好的OSS Storage擴充套件
2021-03-23
套件Laravel
[譯] 構建高效能和可擴充套件性 Node.js 應用的最佳實踐 [第 3/3 部分]
2019-03-03
套件Node.js
iOS 通知擴充套件
2018-10-08
iOS套件
swift擴充套件Extensions
2018-09-05
Swift套件
擴充套件BSGS/exBSGS
2024-03-10
套件
Json擴充套件方法
2024-03-23
JSON套件
分類擴充套件
2019-05-11
套件
提高擴充套件性
2024-11-29
套件

高效能的Python擴充套件（3）

回顧

計算力：序列程式碼

計算力：並行程式碼

執行緒區域性儲存

OpenMP

效能

擴充套件性

環境變數

效能總結：所有實現

附加測試

DigitalOcean20核虛擬機器

RunAbove176執行緒虛擬機器

Cython

結論

相關文章