CUDA 的隨機數演算法 API

Aurelius84發表於2023-05-06

原文網址 : https://www.cnblogs.com/CocoML/p/17376020.html

參考自 Nvidia cuRand 官方 API 檔案

一、具體使用場景

如下是是在 dropout 最佳化中手寫的 uniform_random 的 Kernel：

#include <cuda_runtime.h>
#include <curand_kernel.h>

__device__ inline float cinn_nvgpu_uniform_random_fp32(int seed){
  curandStatePhilox4_32_10_t state;
  int idx = threadIdx.x + blockIdx.x * blockDim.x;
  curand_init(seed, idx, 1, &state);
  return curand_uniform(&state);
}

二、API 解析

我們首先來看 curand_init 函式的簽名和語義：

__device__ void
curand_init(unsigned long long seed,
            unsigned long long subsequence,
            unsigned long long offset,
            curandStatePhilox4_32_10_t *state)

給定相同的seed、sequence、offset 引數下，curand_init 會保證產生相同的其實狀態 state。另外此函式會在呼叫 2^67 ⋅ sequence + offset次 cu_rand API 之後「重置」為起始狀態。關於 sequence 和 offset 的如何生效的機制，參考 StackOverflow。

注意：在 seed 相同、sequence 不同時，一般不會產生有統計學上關聯的結果。原文：Sequences generated with the same seed and different sequence numbers will not have statistically correlated values.

另外在CUDA 並行產生隨機數實踐上，也有一些經驗上的建議：

若保證高質量的偽隨機數，建議使用不同的 seed
若是並行在一個「實驗」裡，建議指定不同的sequence引數，且最好「單調遞增」
若果執行緒裡的config都是一樣，即 state 一樣，則可以把「隨機狀態量」放到 global memory裡，以減少 setup 開銷

參考原文：For the highest quality parallel pseudorandom number generation, each experiment should be assigned a unique seed. Within an experiment, each thread of computation should be assigned a unique sequence number. If an experiment spans multiple kernel launches, it is recommended that threads between kernel launches be given the same seed, and sequence numbers be assigned in a monotonically increasing way. If the same configuration of threads is launched, random state can be preserved in global memory between launches to avoid state setup time.

然後我們看下Nvidia 主要提供了哪些常用的隨機數生成API：

__device__ float
curand_uniform (curandState_t *state);   // <----  It may return from 0.0 to 1.0, where 1.0 is included and 0.0 is excluded.

__device__ float
curand_normal (curandState_t *state);    // <-----  returns a single normally distributed float with mean 0.0 and standard deviation 1.0.

__device__ float
curand_log_normal (curandState_t *state, float mean, float stddev); // <----- returns a single log-normally distributed float based on a normal distribution with the given mean and standard deviation.

// 如下是上述 3 個API 的 double 版本
__device__ double
curand_uniform_double (curandState_t *state)；

__device__ double
curand_normal_double (curandState_t *state)；

__device__ double
curand_log_normal_double (curandState_t *state, double mean, double stddev);

上面的 device API 在每次呼叫時，只會生成一個 float/double 的隨機數。Nvidia 同樣提供了一次可以生成 2個或4個 device API：

__device__ uint4
curand4 (curandStatePhilox4_32_10_t *state)；

__device__ float4
curand_uniform4 (curandStatePhilox4_32_10_t *state)；

__device__ float4
curand_normal4 (curandStatePhilox4_32_10_t *state)；

__device__ float4
curand_log_normal4 (curandStatePhilox4_32_10_t *state, float mean, float stddev)；

從上面的函式介面以及 Nvidia 的檔案來看，在初始化某種型別的 state 狀態量後，每次呼叫類似 curand() 的 API 後，state 都會自動進行 offset 偏移。

因此，Nvidia 官網上也提供了單獨對 state 進行位移的 API，其效果等價於呼叫多次無返回值的 curand() API，且效能更好：

__device__ void
skipahead(unsigned long long n, curandState_t *state); // <----- == calls n*curand()

__device__ void
skipahead_sequence(unsigned long long n, curandState_t *state); // <----- == calls n*2^67 curand()

三、效能分析

Nvidia 的官網明確指出了存在的效能問題，給開發者實現高效能 Kernel 提供了充分的經驗指導：

curand_init()要比curand()和curand_uniform()慢！
curand_init()在 offset 比較大時效能也會比小 offset 差！
save/load操作 state 比每次重複建立起始 state 效能要快很多！

原文如下：Calls to curand_init() are slower than calls to curand() or curand_uniform(). Large offsets to curand_init() take more time than smaller offsets. It is much faster to save and restore random generator state than to recalculate the starting state repeatedly.

對於上述第三點，Nvidia 建議可以將 state 存放到 global memory 中，如下是一個樣例程式碼：

__global__ void example(curandState *global_state)
{
    curandState local_state;
    local_state = global_state[threadIdx.x];
    for(int i = 0; i < 10000; i++) {
        unsigned int x = curand(&local_state);
        ...
    }
    global_state[threadIdx.x] = local_state;
}

從另一個維度來講，相對於A產生隨機數操作的API，初始化 state 會佔用更多的「暫存器」和 local memory 資源。因此 nvidia 建議將 curand_init 和 curand() API 拆分放到不同的 Kernel 中，可以獲得最大的效能收益；

原文：Initialization of the random generator state generally requires more registers and local memory than random number generation. It may be beneficial to separate calls to curand_init() and curand() into separate kernels for maximum performance.
State setup can be an expensive operation. One way to speed up the setup is to use different seeds for each thread and a constant sequence number of 0. This can be especially helpful if many generators need to be created. While faster to set up, this method provides less guarantees about the mathematical properties of the generated sequences. If there happens to be a bad interaction between the hash function that initializes the generator state from the seed and the periodicity of the generators, there might be threads with highly correlated outputs for some seed values. We do not know of any problem values; if they do exist they are likely to be rare.

Nvidia 提供的 API 樣例程式碼如下：

#include <stdio.h>
#include <stdlib.h>

#include <cuda_runtime.h>
#include <curand_kernel.h>

#define CUDA_CALL(x) do { if((x) != cudaSuccess) { \
    printf("Error at %s:%d\n",__FILE__,__LINE__); \
    return EXIT_FAILURE;}} while(0)

__global__ void setup_kernel(curandState *state)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    /* Each thread gets same seed, a different sequence
       number, no offset */
    curand_init(1234, id, 0, &state[id]);
}

__global__ void generate_uniform_kernel(curandStatePhilox4_32_10_t *state,
                                int n,
                                unsigned int *result)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int count = 0;
    float x;
    /* Copy state to local memory for efficiency */
    curandStatePhilox4_32_10_t localState = state[id];
    /* Generate pseudo-random uniforms */
    for(int i = 0; i < n; i++) {
        x = curand_uniform(&localState);
        /* Check if > .5 */
        if(x > .5) {
            count++;
        }
    }
    /* Copy state back to global memory */
    state[id] = localState;
    /* Store results */
    result[id] += count;
}

硬核 - Java 隨機數相關 API 的演進與思考（上）
2022-01-10
Java隨機API
明明的隨機數
2020-09-05
隨機
隨機演算法
2024-10-09
隨機演算法
Matlab 隨機生成兩個數值之間的隨機數
2018-09-03
Matlab隨機
隨機數
2024-08-16
隨機
RandomForest 隨機森林演算法與模型引數的調優
2021-01-19
randomREST隨機森林演算法模型
python生成隨機數、隨機字串
2018-12-18
Python隨機字串
Python如何隨機生成1到100的隨機數?
2024-02-01
Python隨機
Docker Compose部署隨機圖API
2024-08-10
Docker隨機API
【API】隨機獲取圖片
2020-12-08
API隨機
隨機森林演算法
2019-03-11
隨機森林演算法
【隨機演算法】洗牌
2020-10-28
隨機演算法
隨機範圍小數和隨機範圍整數
2024-06-01
隨機
Linux Shell 生成隨機數和隨機字串
2019-02-08
Linux隨機字串
洗牌演算法擴充（從n個數中隨機m個數）
2020-11-13
演算法隨機
JavaScript隨機數的應用
2019-02-16
JavaScript隨機
GoLang 中的隨機數 tips
2018-03-20
Golang隨機
更安全的隨機數生成
2022-01-04
隨機
隨機森林演算法梳理
2019-03-01
隨機森林演算法
javaScript隨機排序演算法
2018-07-20
JavaScript隨機排序演算法
JavaScript 生成隨機數
2019-04-10
JavaScript隨機
C++ 隨機數
2018-07-27
C++隨機
Js Math隨機數
2024-03-07
JS隨機
隨機數漫談
2024-07-03
隨機
演算法隨筆——數位DP
2024-06-01
演算法
Java 控制隨機數出現的機率
2021-12-25
Java隨機
區塊鏈中的隨機數
2019-02-27
區塊鏈隨機
整理記錄一些好用的隨機圖API
2021-11-22
隨機API
偽隨機數是什麼？偽隨機數生成方法有哪些？
2022-11-24
隨機
Java 隨機數探祕
2019-01-22
Java隨機
matlab 生成隨機數序列
2018-06-24
Matlab隨機
C++生成隨機數
2024-04-12
C++隨機
隨機數生成器
2020-07-19
隨機
MATLAB設定隨機數
2019-07-01
Matlab隨機
java隨機數生成原理
2021-09-11
Java隨機
R產生隨機數
2020-12-31
隨機
numpy各種生成隨機數的方法
2019-03-23
隨機
Solidity陷阱：以太坊的隨機數生成
2018-12-07
Solid隨機

CUDA 的隨機數演算法 API

一、具體使用場景

二、API 解析

三、效能分析

相關文章