[原始碼解析] NVIDIA HugeCTR，GPU 版本引數伺服器---(7) ---Distributed Hash之前向傳播

為了更好的說明，我們給出一個實際例子，假定一共有兩個slot（User ID 和 Item ID），每個slot內部最長為4個元素，稠密向量長度 embedding_vec_size 是8。下面CSR檔案之中，每行是一個slot，所以一共有兩個樣本，每個樣本兩行，假定batch size = 2，所以這兩個樣本一起訓練。

*   40,50,10,20
*   30,50,10
*   30,20
*   10
* Will be convert to the form of:
* row offset: 0,4,7,9,10
* value: 40,50,10,20,30,50,10,30,20,10

第一個樣本包括：

40,50,10,20 # slot 1
30,50,10 # slot 2

第二個樣本是

30,20 # slot 1
10 # slot 2

0x02 總體邏輯

前向傳播的總體功能是：Embedded_lookuper負責本地gpu計算和查詢嵌入向量，即使用者輸入->嵌入向量。這裡只考慮 *train* 名字的各種變數，忽略 *evalute* 名字的各種變數，即只看訓練邏輯。

2.1 註釋&思路

原始碼之中的註釋如下：

/**
 * All the CUDA kernel functions used by embedding layer are defined in this file, including
 * forward propagation, backward propagation. The functions are defined by propagation type
 * and combiner type(sum or mean) as below:
 *   1) forward
 *        sum: calling forward_sum_kernel()
 *        mean: calling foward_sum_kernel() + forward_scale_kernel()
 *   2) backward:
 *        calculating wgrad:
 *          sum: calling backward_sum_kernel()
 *          mean: calling backward_mean_kernel()
 *        update embedding table: including several steps as below,
 *          step1: expand sample IDs, calling sample_id_expand_kernel()
 *          step2: get value_index by key (will call hash_table->get_insert() in nv_hashtable lib)
 *          step3: sort by value_index (will call cub::DeviceRadixSort::SortPairs in cub lib)
 *          step4: count the number for each unduplicated value_index, calling value_count_kernel()
 *          step5: use optimizer method to compute deltaw, and record corresponding, including three
 * types of optimizer: Adam: caling opt_adam_kernel() Momentum sgd: calling
 * opt_momentum_sgd_kernel() Nesterov: calling opt_nesterov_kernel() step6: update embedding table
 * by deltaw, calling update_kernel()
 */

我們翻譯梳理邏輯如下 Read data from input_buffers_ -> look up -> write to output_tensors，具體就是：

從input_buffers_讀取資料。具體是通過 filter_keys_per_gpu 來完成對 embedding_data_ 的一系列配置。
從embedding之中進行 look up，即呼叫 functors_.forward_per_gpu 從本gpu的hashmap做lookup操作。
- 由 DistributedSlotSparseEmbeddingHash 的特點我們知道，因為當前gpu對應的資料key都在此gpu，所以此時不需要做節點間通訊。
- 這裡 hash_tables_[i]，hash_table_value_tensors_[i]，hash_value_index_tensors_[i] 就是本地第 i 個GPU對應的hashmap組合。
- embedding_data_.get_value_tensors(is_train)[i] 就是從我們之前提到的GPU sparse input 內部提取的輸入訓練資料。
- 進行本地規約。
做reduce scatter操作。每個gpu的資料是batch size條，但是每條資料裡的每個slot只是一部分key，需要做reduce scatter操作，做完reduce scatter後，資料才是完整的，此時每個gpu上分到完整資料的一部分。
寫到output_tensors。

具體一些成員變數的定義需要回憶一下。

hash_value_index_tensors_ ：embedding vector表的row index。就是低維矩陣的 row offset。
- 需要注意，其型別是 Tensors2，其型別是 std::vector<Tensor2>，所以每一個GPU對應該vector之中的一個元素。
- index 和 value 的行數相關。
- 內容是hash table value_index(row index of embedding)。
hash_table_value_tensors_ ：embedding vector表的value。就是低維矩陣。
- 需要注意，其型別是 Tensors2，其型別是 std::vector<Tensor2>，所以每一個GPU對應該vector之中的一個元素。
- 其內容是embedding vector。
- 用hash_value_index_tensors_的結果在這裡查詢一個 embedding vector。

後續我們依然做簡化，忽略多個 worker，多個 GPU 的情況。

2.2 總體程式碼

前向傳播總體程式碼如下：

本地多個GPU並行前向傳播，每個執行緒對應一個GPU，多GPU進行。
呼叫 filter_keys_per_gpu 完成完成了對 EmbeddingData 的配置，這裡i就是GPU index，拿到本GPU對應的輸入資料。
呼叫 forward_per_gpu 從本gpu的hashmap做lookup操作。
reduce scatter，做了之後，資料才是完整的，每個gpu上分到完整資料的一部分。
all_reduce 操作，這是combiner=mean時需要繼續處理。
forward_scale 操作，做平均。

  /**
   * The forward propagation of embedding layer.
   */
  void forward(bool is_train, int eval_batch = -1) override {
    // Read data from input_buffers_ -> look up -> write to output_tensors

#pragma omp parallel num_threads(embedding_data_.get_resource_manager().get_local_gpu_count())
    { // 本地多個GPU並行前向傳播
      // 每個執行緒對應一個GPU，多GPU進行
      size_t i = omp_get_thread_num(); // 拿到本執行緒序號
      CudaDeviceContext context(embedding_data_.get_local_gpu(i).get_device_id());
      
      if (embedding_data_.embedding_params_.is_data_parallel) {
        // 這裡完成了對 EmbeddingData 的配置，這裡i就是GPU index
        filter_keys_per_gpu(is_train, i, embedding_data_.get_local_gpu(i).get_global_id(),
                            embedding_data_.get_resource_manager().get_global_gpu_count());
      }
      // 從本gpu的hashmap做lookup操作
      // 這裡 hash_tables_[i]，hash_table_value_tensors_[i]，hash_value_index_tensors_[i] 就是對應的hashmap
      functors_.forward_per_gpu(embedding_data_.embedding_params_.get_batch_size(is_train),
                                embedding_data_.embedding_params_.slot_num,
                                embedding_data_.embedding_params_.embedding_vec_size, 0, is_train,
                                embedding_data_.get_row_offsets_tensors(is_train)[i],
                                embedding_data_.get_value_tensors(is_train)[i],
                                *embedding_data_.get_nnz_array(is_train)[i], *hash_tables_[i],
                                hash_table_value_tensors_[i], hash_value_index_tensors_[i],
                                embedding_feature_tensors_[i],
                                embedding_data_.get_local_gpu(i).get_stream());
    }

    // do reduce scatter
    // 做了之後，資料才是完整的，每個gpu上分到完整資料的一部分
    size_t recv_count = embedding_data_.get_batch_size_per_gpu(is_train) *
                        embedding_data_.embedding_params_.slot_num *
                        embedding_data_.embedding_params_.embedding_vec_size;
    functors_.reduce_scatter(recv_count, embedding_feature_tensors_,
                             embedding_data_.get_output_tensors(is_train),
                             embedding_data_.get_resource_manager());

    // scale for combiner=mean after reduction
    if (embedding_data_.embedding_params_.combiner == 1) {
      size_t send_count = embedding_data_.embedding_params_.get_batch_size(is_train) *
                              embedding_data_.embedding_params_.slot_num +
                          1;
      functors_.all_reduce(send_count, embedding_data_.get_row_offsets_tensors(is_train),
                           row_offset_allreduce_tensors_, embedding_data_.get_resource_manager());

      // do average
      functors_.forward_scale(
          embedding_data_.embedding_params_.get_batch_size(is_train),
          embedding_data_.embedding_params_.slot_num,
          embedding_data_.embedding_params_.embedding_vec_size, row_offset_allreduce_tensors_,
          embedding_data_.get_output_tensors(is_train), embedding_data_.get_resource_manager());
    }

    return;
  }

具體流程是：

0x03 配置資料

之前，在EmbeddingData 初始化時候，只是配置了其成員函式 train_keys_，train_keys_ 就是前面提到的 sparse_input，就是CSR format對應的稀疏張量。

template <typename TypeKey, typename TypeEmbeddingComp>
class EmbeddingData {
 public:
  const Embedding_t embedding_type_;
  SparseEmbeddingHashParams embedding_params_; /**< Sparse embedding hash params. */

  std::vector<std::shared_ptr<GeneralBuffer2<CudaAllocator>>>
      bufs_;                                         /**< The buffer for storing output tensors. */
  Tensors2<TypeEmbeddingComp> train_output_tensors_; /**< The output tensors. */
  Tensors2<TypeEmbeddingComp> evaluate_output_tensors_; /**< The output tensors. */
  Tensors2<TypeKey> train_row_offsets_tensors_; /**< The row_offsets tensors of the input data. */
  Tensors2<TypeKey> train_value_tensors_;       /**< The value tensors of the input data. */
  std::vector<std::shared_ptr<size_t>> train_nnz_array_;
  Tensors2<TypeKey>
      evaluate_row_offsets_tensors_;         /**< The row_offsets tensors of the input data. */
  Tensors2<TypeKey> evaluate_value_tensors_; /**< The value tensors of the input data. */
  std::vector<std::shared_ptr<size_t>> evaluate_nnz_array_;

  std::shared_ptr<ResourceManager> resource_manager_; /**< The GPU device resources. */

  SparseTensors<TypeKey> train_keys_;
  SparseTensors<TypeKey> evaluate_keys_;
  Tensors2<TypeKey> embedding_offsets_;
}

此時資料如下, embedding_offsets_ 和 train_output_tensors_ 都是預先分配的，我們假設 CSR 資料為：40,50,10,20,30,50,10,30,20，CSR row offset 是 0,4,7,9。

3.1 CUB函式

我們首先要介紹幾個cub庫的方法，這是NVIDIA提供的函式庫，用來操作CUDA，把一些常見方法用並行化來實現，比如陣列求和，不併行計算就是從頭查到尾，如果CUDA並行，則可以高速實現。其網址為：https://docs.nvidia.com/cuda/cub/index.html，配置資料中就採用其中了幾個方法。

3.1.1 cub::DeviceScan::InclusiveSum

此函式作用是使用GPU來計算inclusive prefix sum。

使用舉例如下：

 * int  num_items;      // e.g., 7
 * int  *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
 * int  *d_out;         // e.g., [ ,  ,  ,  ,  ,  ,  ]
 * ...
 *
 * // Determine temporary device storage requirements for inclusive prefix sum
 * void     *d_temp_storage = NULL;
 * size_t   temp_storage_bytes = 0;
 * cub::DeviceScan::InclusiveSum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
 *
 * // Allocate temporary storage for inclusive prefix sum
 * cudaMalloc(&d_temp_storage, temp_storage_bytes);
 *
 * // Run inclusive prefix sum
 * cub::DeviceScan::InclusiveSum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
 *
 * // d_out <-- [8, 14, 21, 26, 29, 29, 38]

函式實現為：

/**
* \brief Computes a device-wide inclusive prefix sum.
*
* \par
* - Supports non-commutative sum operators.
* - Provides "run-to-run" determinism for pseudo-associative reduction
*   (e.g., addition of floating point types) on the same GPU device.
*   However, results for pseudo-associative reduction may be inconsistent
*   from one device to a another device of a different compute-capability
*   because CUB can employ different tile-sizing for different architectures.
* - \devicestorage
*/
template <
    typename            InputIteratorT,
    typename            OutputIteratorT>
CUB_RUNTIME_FUNCTION
static cudaError_t InclusiveSum(
    void*               d_temp_storage,                 ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
    size_t&             temp_storage_bytes,             ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
    InputIteratorT      d_in,                           ///< [in] Pointer to the input sequence of data items
    OutputIteratorT     d_out,                          ///< [out] Pointer to the output sequence of data items
    int                 num_items,                      ///< [in] Total number of input items (i.e., the length of \p d_in)
    cudaStream_t        stream             = 0,         ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
    bool                debug_synchronous  = false)     ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
{
    // Signed integer type for global offsets
    typedef int OffsetT;

    return DispatchScan<InputIteratorT, OutputIteratorT, Sum, NullType, OffsetT>::Dispatch(
        d_temp_storage,
        temp_storage_bytes,
        d_in,
        d_out,
        Sum(),
        NullType(),
        num_items,
        stream,
        debug_synchronous);
}

如果想深入研究，可以參見 https://nvlabs.github.io/cub/structcub_1_1_device_scan.html 。

3.1.2 cub::DeviceSelect::If

此函式作用是：使用 select_op 函式，將相應的元素從 d_in 分割到一個分割槽序列 d_out。被複制到第一個分割槽的元素總數被寫入 d_num_selected_out。

具體使用方法為，此例子中，小於7的放在第一個分割槽，分割槽內元素數目為5.

 * // Functor type for selecting values less than some criteria
 * struct LessThan
 * {
 *     int compare;
 *
 *     CUB_RUNTIME_FUNCTION __forceinline__
 *     LessThan(int compare) : compare(compare) {}
 *
 *     CUB_RUNTIME_FUNCTION __forceinline__
 *     bool operator()(const int &a) const {
 *         return (a < compare);
 *     }
 * };
 *
 * // Declare, allocate, and initialize device-accessible pointers for input and output
 * int      num_items;              // e.g., 8
 * int      *d_in;                  // e.g., [0, 2, 3, 9, 5, 2, 81, 8]
 * int      *d_out;                 // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
 * int      *d_num_selected_out;    // e.g., [ ]
 * LessThan select_op(7);
 * ...
 *
 * // Determine temporary device storage requirements
 * void     *d_temp_storage = NULL;
 * size_t   temp_storage_bytes = 0;
 * cub::DeviceSelect::If(d_temp_storage, temp_storage_bytes, d_in, d_out, d_num_selected_out, num_items, select_op);
 *
 * // Allocate temporary storage
 * cudaMalloc(&d_temp_storage, temp_storage_bytes);
 *
 * // Run selection
 * cub::DeviceSelect::If(d_temp_storage, temp_storage_bytes, d_in, d_out, d_num_selected_out, num_items, select_op);
 *
 * // d_out                 <-- [0, 2, 3, 5, 2, 8, 81, 9]
 * // d_num_selected_out    <-- [5]

函式實現是

    /**
     * \brief Uses the \p select_op functor to split the corresponding items from \p d_in into a partitioned sequence \p d_out.  The total number of items copied into the first partition is written to \p d_num_selected_out. 
     */
    template <
        typename                    InputIteratorT,
        typename                    OutputIteratorT,
        typename                    NumSelectedIteratorT,
        typename                    SelectOp>
    CUB_RUNTIME_FUNCTION __forceinline__
    static cudaError_t If(
        void*               d_temp_storage,                ///< [in] %Device-accessible allocation of temporary storage.  When NULL, the required allocation size is written to \p temp_storage_bytes and no work is done.
        size_t                      &temp_storage_bytes,            ///< [in,out] Reference to size in bytes of \p d_temp_storage allocation
        InputIteratorT              d_in,                           ///< [in] Pointer to the input sequence of data items
        OutputIteratorT             d_out,                          ///< [out] Pointer to the output sequence of partitioned data items
        NumSelectedIteratorT        d_num_selected_out,             ///< [out] Pointer to the output total number of items selected (i.e., the offset of the unselected partition)
        int                         num_items,                      ///< [in] Total number of items to select from
        SelectOp                    select_op,                      ///< [in] Unary selection operator
        cudaStream_t                stream             = 0,         ///< [in] <b>[optional]</b> CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
        bool                        debug_synchronous  = false)     ///< [in] <b>[optional]</b> Whether or not to synchronize the stream after every kernel launch to check for errors.  May cause significant slowdown.  Default is \p false.
    {
        typedef int                     OffsetT;         // Signed integer type for global offsets
        typedef NullType*               FlagIterator;   // FlagT iterator type (not used)
        typedef NullType                EqualityOp;     // Equality operator (not used)

        return DispatchSelectIf<InputIteratorT, FlagIterator, OutputIteratorT, NumSelectedIteratorT, SelectOp, EqualityOp, OffsetT, true>::Dispatch(
            d_temp_storage,
            temp_storage_bytes,
            d_in,
            NULL,
            d_out,
            d_num_selected_out,
            select_op,
            EqualityOp(),
            num_items,
            stream,
            debug_synchronous);
    }

};

如果想深入研究，參見 https://nvlabs.github.io/cub/structcub_1_1_device_select.html

3.1.3 臨時儲存

前面cub方法之中，都要有一個臨時儲存區域，因此 DistributedSlotSparseEmbeddingHash 之中有一個 DistributedFilterKeyStorage 就是用來達到這個目的。

template <typename TypeHashKey, typename TypeEmbeddingComp>
class DistributedSlotSparseEmbeddingHash : public IEmbedding {
  using NvHashTable = HashTable<TypeHashKey, size_t>;

  std::vector<DistributedFilterKeyStorage<TypeHashKey>> filter_keys_storage_;
}

DistributedFilterKeyStorage 定義如下：

template <typename TypeHashKey>
struct DistributedFilterKeyStorage {
  Tensor2<size_t> value_select_num;
  Tensor2<void> temp_value_select_storage;

  Tensor2<TypeHashKey> rowoffset_select;
  Tensor2<void> temp_rowoffset_select_scan_storage;

  DistributedFilterKeyStorage(const std::shared_ptr<GeneralBuffer2<CudaAllocator>> &buf,
                              size_t max_nnz, size_t rowoffset_count, size_t global_id,
                              size_t global_num);
};

具體構建方法如下：

template <typename TypeHashKey>
DistributedFilterKeyStorage<TypeHashKey>::DistributedFilterKeyStorage(
    const std::shared_ptr<GeneralBuffer2<CudaAllocator>> &buf, size_t max_nnz,
    size_t rowoffset_count, size_t global_id, size_t global_num) {
  buf->reserve({1}, &value_select_num);
  // select value
  {
    distributed_embedding_kernels::HashOp<TypeHashKey> select_op{global_id, global_num};
    size_t size_in_bytes = 0;
    cub::DeviceSelect::If(nullptr, size_in_bytes, (TypeHashKey *)nullptr, (TypeHashKey *)nullptr,
                          (size_t *)nullptr, max_nnz, select_op);
    buf->reserve({size_in_bytes}, &temp_value_select_storage);
  }

  // count rowoffset
  {
    size_t size_in_bytes = 0;
    cub::DeviceScan::InclusiveSum(nullptr, size_in_bytes, (TypeHashKey *)nullptr,
                                  (TypeHashKey *)nullptr, rowoffset_count);
    buf->reserve({size_in_bytes}, &temp_rowoffset_select_scan_storage);
  }
  buf->reserve({rowoffset_count}, &rowoffset_select);
}

3.2 配置資料

在前向傳播之中，首先就是在 filter_keys_per_gpu 之中使用 train_keys_ 來對其他成員變數進行配置，目的是拿到本GPU上 DistributedSlotSparseEmbeddingHash 對應的輸入資料。回憶一下，EmbeddingData 的這幾個成員變數 get_output_tensors，get_input_keys，get_row_offsets_tensors，get_value_tensors，get_nnz_array 都返回引用，這說明大部分成員變數都是可以被直接修改的。具體配置程式碼如下：

template <typename TypeHashKey, typename TypeEmbeddingComp>
void DistributedSlotSparseEmbeddingHash<TypeHashKey, TypeEmbeddingComp>::filter_keys_per_gpu(
    bool is_train, size_t id, size_t global_id, size_t global_num) {
  
  // 對當前GPU進行配置
  
  // 得到 train_keys_，利用它來配置row offsets和value
  const SparseTensor<TypeHashKey> &all_gather_key = embedding_data_.get_input_keys(is_train)[id];
  // 得到 embedding_data_.train_row_offsets_tensors_，修改 rowoffset_tensor 就是修改此成員變數
  Tensor2<TypeHashKey> rowoffset_tensor = embedding_data_.get_row_offsets_tensors(is_train)[id];
  // 得到 embedding_data_.train_value_tensors_，修改 value_tensor 就是修改此成員變數
  Tensor2<TypeHashKey> value_tensor = embedding_data_.get_value_tensors(is_train)[id];
  std::shared_ptr<size_t> nnz_ptr = embedding_data_.get_nnz_array(is_train)[id];
  auto &filter_keys_storage = filter_keys_storage_[id];
  auto &stream = embedding_data_.get_local_gpu(id).get_stream();

  size_t batch_size = embedding_data_.embedding_params_.get_batch_size(is_train);
  size_t slot_num = (all_gather_key.rowoffset_count() - 1) / batch_size;
  size_t rowoffset_num = batch_size * slot_num + 1;
  size_t rowoffset_num_without_zero = rowoffset_num - 1;

  // select value
  {
    distributed_embedding_kernels::HashOp<TypeHashKey> select_op{global_id, global_num};

    size_t size_in_bytes = filter_keys_storage.temp_value_select_storage.get_size_in_bytes();
    // 配置 embedding_data_.train_value_tensors_
    cub::DeviceSelect::If(filter_keys_storage.temp_value_select_storage.get_ptr(), size_in_bytes,
                          all_gather_key.get_value_ptr(), value_tensor.get_ptr(),
                          filter_keys_storage.value_select_num.get_ptr(), all_gather_key.nnz(),
                          select_op, stream);
  }

  // select rowoffset
  {
    cudaMemsetAsync(filter_keys_storage.rowoffset_select.get_ptr(), 0,
                    filter_keys_storage.rowoffset_select.get_size_in_bytes(), stream);
    {
      constexpr int block_size = 512;
      int grid_size = (rowoffset_num_without_zero - 1) / block_size + 1;
      distributed_embedding_kernels::select_rowoffset<<<grid_size, block_size, 0, stream>>>(
          all_gather_key.get_rowoffset_ptr(), rowoffset_num_without_zero,
          all_gather_key.get_value_ptr(), filter_keys_storage.rowoffset_select.get_ptr(), global_id,
          global_num);
    }
    {
      size_t size_in_bytes =
          filter_keys_storage.temp_rowoffset_select_scan_storage.get_size_in_bytes();
      // 配置row offset，就是拷貝到 rowoffset_tensor之中。
      cub::DeviceScan::InclusiveSum(
          filter_keys_storage.temp_rowoffset_select_scan_storage.get_ptr(), size_in_bytes,
          filter_keys_storage.rowoffset_select.get_ptr(), rowoffset_tensor.get_ptr(), rowoffset_num,
          stream);
    }

    // select nnz
    // 直接拷貝即可
    cudaMemcpyAsync(nnz_ptr.get(), filter_keys_storage.value_select_num.get_ptr(), sizeof(size_t),
                    cudaMemcpyDeviceToHost, stream);
  }
}

配置完成之後，得到如下，其中 train_value_tensors_ 對應了csr value，train_row_offsets_tensors_ 對應了csr row offset，從SparseTensor拷貝到 EmbeddingData。

結合我們例子，最後前向傳播輸入訓練資料是：

0x04 Lookup操作

此部分就是完成嵌入表 look up操作。現在EmbeddingData得到了各種配置，就是sparse input引數，所以可以利用其作為key，得到embedding vector了。這部分是在 forward_per_gpu 內部完成的。

  functors_.forward_per_gpu(embedding_data_.embedding_params_.get_batch_size(is_train),
                            embedding_data_.embedding_params_.slot_num,
                            embedding_data_.embedding_params_.embedding_vec_size, 0, is_train,
                            embedding_data_.get_row_offsets_tensors(is_train)[i],
                            embedding_data_.get_value_tensors(is_train)[i],
                            *embedding_data_.get_nnz_array(is_train)[i], *hash_tables_[i],
                            hash_table_value_tensors_[i], hash_value_index_tensors_[i],
                            embedding_feature_tensors_[i],
                            embedding_data_.get_local_gpu(i).get_stream());
}

4.1 提取資料

這裡用到了比如 get_row_offsets_tensors 這樣的方法從 embedding_data_ 之中提取輸入資料。從input_buffers_讀取資料對應的提取資料程式碼如下，就是從GPU的sparse input csr資料中讀取到輸入資料，作為後續在hash table查詢的key：

Tensors2<TypeKey>& get_value_tensors(bool is_train) {
  if (is_train) {
    return train_value_tensors_;
  } else {
    return evaluate_value_tensors_;
  }
}

從 CSR 讀取 offset 程式碼如下：

Tensors2<TypeKey>& get_row_offsets_tensors(bool is_train) {
  if (is_train) {
    return train_row_offsets_tensors_;
  } else {
    return evaluate_row_offsets_tensors_;
  }
}

因為輸入有幾千萬個，但是可能其中只有幾百個才非零，所以hash表就是把這幾千萬個輸入做第一次對映，可以減少大量記憶體空間。

4.2 查詢

目前程式碼來到了這裡，就是利用雜湊表來從輸入CSR得到對應的嵌入向量。

forward_per_gpu 分為兩部分：查詢和內部規約。

4.2.1 查詢運算元

forward_per_gpu 函式是用來具體做lookup的。從其註釋可以看到其用途，就是我們之前分析過的。

@param row_offset row_offset (CSR format of input sparse tensors)
@param hash_key value (CSR format of input sparse tensors)
@param nnz non-zero feature number per batch
@param hash_table hash table, pairs of <key, value_index>
@param hash_table_value hash table value, which represents embedding vector
@param hash_value_index hash table value_index(row index of embedding)

這裡的引數都是引用，可以修改外部資料，具體思路是：

首先使用 hash_key value (CSR format of input sparse tensors) 來呼叫 get_insert 去 hash table 之中查詢，如果找到了，得到的就是 hash_value_index。這個value 是低維 embedding表的 row index。這部分程式碼是 hash_table.get_insert 相關。其實，這裡沒有用到get_insert 返回值，而是把 hash_key value 插進雜湊表內部，得到一個對映，具體如何查詢是通過 csr row offset完成。
hash_table.get_insert 如果在 hash_table 的內部資料結構之中找到了，就返回，如果沒有找到，就插入一個遞增的數值，這個數值被設定到 hash_value_index 之中。
然後通過 hash_value_index 作為 index，在 hash_table_value 之中得到最終的 embedding vector，並且先在slot內部做reduce。這部分程式碼是 forward_sum 和 forward_mean 相關。

所以 hash_table_value_tensors_[i], hash_value_index_tensors_ 這兩部分何時設定？其實是在forward_per_gpu完成的，具體邏輯如圖：

具體程式碼是：

/**
 * forward propagation on each GPU for LocalizedSlotSparseEmbeddingHash
 * @param batch_size batch size for the current mini-batch computation.
 * @param slot_num the number of slots for current GPU
 * @param embedding_vec_size embedding vector size.
 * @param combiner 0-sum; 1-mean
 * @param row_offset row_offset (CSR format of input sparse tensors)
 * @param hash_key value (CSR format of input sparse tensors)
 * @param nnz non-zero feature number per batch
 * @param hash_table hash table, pairs of <key, value_index>
 * @param hash_table_value hash table value, which represents embedding vector
 * @param hash_value_index hash table value_index(row index of embedding)
 * @param embedding_feature embedding feature (output)
 * @param stream cuda stream
 */
template <typename TypeHashKey, typename TypeEmbeddingComp>
void SparseEmbeddingFunctors::forward_per_gpu(
    size_t batch_size, size_t slot_num, size_t embedding_vec_size, int combiner, bool train,
    const Tensor2<TypeHashKey> &row_offset, const Tensor2<TypeHashKey> &hash_key, size_t nnz,
    HashTable<TypeHashKey, size_t> &hash_table, const Tensor2<float> &hash_table_value,
    Tensor2<size_t> &hash_value_index, Tensor2<TypeEmbeddingComp> &embedding_feature,
    cudaStream_t stream) {
  try {
    if (train) { // 訓練會來到這裡
      // 先從hash_table之中依據 hash_key 得到hash_value_index 之中對應的位置，作用就是讓 hash_value_index 之中包含所有key對應的內部hash_value_index
      // 其實，這裡是否返回不重要，重要的是把hash_key value插進雜湊表內部，具體如何查詢是通過csr row offset完成
      hash_table.get_insert(hash_key.get_ptr(), hash_value_index.get_ptr(), nnz, stream);
    } else {
      hash_table.get_mark(hash_key.get_ptr(), hash_value_index.get_ptr(), nnz, stream);
    }

    // do sum reduction
    if (combiner == 0) { // 0-sum; 1-mean
      // 然後利用 hash_value_index 從 hash_table_value 之中得到 value，再進行操作
      forward_sum(batch_size, slot_num, embedding_vec_size, row_offset.get_ptr(),
                  hash_value_index.get_ptr(), hash_table_value.get_ptr(),
                  embedding_feature.get_ptr(), stream);
    } else if (combiner == 1) {
      // 然後利用 hash_value_index 從 hash_table_value 之中得到 value，再進行操作
      forward_mean(batch_size, slot_num, embedding_vec_size, row_offset.get_ptr(),
                   hash_value_index.get_ptr(), hash_table_value.get_ptr(),
                   embedding_feature.get_ptr(), stream);
    } else {
      CK_THROW_(Error_t::WrongInput, "Invalid combiner type ");
    }
  } catch (const std::runtime_error &rt_err) {
    std::cerr << rt_err.what() << std::endl;
    throw;
  }

  return;
}

運算元內部也分為 get_insert 來處理雜湊表，和 combiner 處理，我們一一看看。

4.2.2 get_insert

前面我們分析了雜湊表的 get 和 insert 操作，這裡是合而為一，就是如果找不到就插入。開始訓練時候，不需要給雜湊表賦初值，而是在訓練過程之中使用get_insert動態插入。

我們再回憶下原理。

比如一共有1億個單詞，40表示第40個單詞。如果想表示 10，30，40，50，20在這一億個單詞是有效的，最常見的辦法是弄個1億長度陣列，把40，50，20，30，10這5個位置設定為1，其他位置設定為0。對應嵌入矩陣也是一個高維矩陣，比如 1億 x 64 維度矩陣。

如果想省空間，就弄會構建一個小資料結構（低維矩陣）來儲存這些有意義的值，弄一個一個hash函式 m_hf來做"從高維矩陣到低維矩陣的轉換"，就是10 -->?，20 --> ? 等。

假如是選取十位數數為key，對於我們的例子，就是

m_hf(10)=1
m_hf(20)=2
m_hf(30)=3
m_hf(40)=4
m_hf(50)=5

1，2，3，4，5 就是內部的hash_value，叫做 hash_value（對應下面程式碼），對應的內部儲存陣列叫做 hashtbl_values。但是因為分桶了，所以在雜湊表內部是放置在hashtbl_values之中（這裡我們做了一個簡化，就是 hashtbl_values[i] = i）。

hashtbl_values[1] = 1，hashtbl_values[2] = 2, hashtbl_values[3] = 3，...

以上說的是雜湊表，我們回到 DistributedSlotSparseEmbeddingHash 本身，於是1，2，3 （陣列之中的內容，不是陣列index，簡化成恰好相等）就是DistributedSlotSparseEmbeddingHash 想得到的 10, 20, 30 對應的資料，就是10 放在低維嵌入表第一個位置，20放在低維嵌入表第二個位置，就是就是低維矩陣的row offset。即，hash_value_index 的內容是 [1,2,3,4,5]，這些是原始輸入資料 10，20，30，40，50 分別在 hash_table_value 之中對應的index，因此，10 對應的就是 hash_table_value[1]，20 對應就是 hash_table_value[2]，依此類推。

再返回雜湊表，NvHashTable 的 get_insert 程式碼如下。

template <typename KeyType, typename ValType>
void NvHashTable<KeyType, ValType>::get_insert(const void *d_keys, void *d_vals, size_t len, cudaStream_t stream) {
    const KeyType *_d_keys = reinterpret_cast<const KeyType*>(d_keys);
    ValType *_d_vals = reinterpret_cast<ValType*>(d_vals);
    return hashtable_.get_insert(_d_keys, _d_vals, len, stream);
}

HashTable 的 get_insert 位於 sparse_operation_kit/kit_cc/kit_cc_infra/src/hashtable/nv_hashtable.cu。這裡是在GPU進行並行操作，提取value。

template <typename Table>
__global__ void get_insert_kernel(Table* table, const typename Table::key_type* const keys,
                                  typename Table::mapped_type* const vals, size_t len,
                                  size_t* d_counter) {
  ReplaceOp<typename Table::mapped_type> op;
  const size_t i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i < len) {
    auto it = table->get_insert(keys[i], op, d_counter);
    vals[i] = it->second; // 在這裡對外面 hash_value_index做了設定
  }
}

template <typename KeyType, typename ValType>
void HashTable<KeyType, ValType>::get_insert(const KeyType* d_keys, ValType* d_vals, size_t len,
                                             cudaStream_t stream) {
  if (len == 0) {
    return;
  }
  const int grid_size = (len - 1) / BLOCK_SIZE_ + 1;
  get_insert_kernel<<<grid_size, BLOCK_SIZE_, 0, stream>>>(container_, d_keys, d_vals, len,
                                                           d_counter_);
}

最後還是來到 HugeCTR/include/hashtable/cudf/concurrent_unordered_map.cuh。如果沒有value，就生成一個value。

// __forceinline__ 的意思是編譯為行內函數
// __host__ __device__ 表示是此函式同時為主機和裝置編譯
template <typename aggregation_type, typename counter_type, class comparison_type = key_equal,
          typename hash_value_type = typename Hasher::result_type>
__forceinline__ __device__ iterator get_insert(const key_type& k, aggregation_type op,
                                               counter_type* value_counter,
                                               comparison_type keys_equal = key_equal(),
                                               bool precomputed_hash = false,
                                               hash_value_type precomputed_hash_value = 0) {
  const size_type hashtbl_size = m_hashtbl_size;
  value_type* hashtbl_values = m_hashtbl_values;

  hash_value_type hash_value{0};

  // If a precomputed hash value has been passed in, then use it to determine
  // the write location of the new key
  if (true == precomputed_hash) {
    hash_value = precomputed_hash_value;
  }
  // Otherwise, compute the hash value from the new key
  else {
    hash_value = m_hf(k); // 3356作為key，得到了一個hash_value
  }

  size_type current_index = hash_value % hashtbl_size; // 找到哪個位置
  value_type* current_hash_bucket = &(hashtbl_values[current_index]); // 找到該位置的bucket
  const key_type insert_key = k;
  bool insert_success = false;
  size_type counter = 0;

  while (false == insert_success) {
    // Situation %5: No slot: All slot in the hashtable is occupied by other key, both get and
    // insert fail. Return empty iterator
    // hash表已經滿了
    if (counter++ >= hashtbl_size) {
      return end();
    }

    key_type& existing_key = current_hash_bucket->first; // 這個才是table key
    volatile mapped_type& existing_value = current_hash_bucket->second; // 這個才是table value

    // 如果 existing_key == unused_key時，則當前雜湊位置為空，所以existing_key由atomicCAS更新為insert_key。
    // 如果 existing_key == insert_key時，這個位置已經被插入這個key了。
    // 在任何一種情況下，都要執行existing_value和insert_value的atomic聚合，因為雜湊表是用聚合操作的標識值初始化的，所以在existing_value仍具有其初始值時，執行該操作是安全的     
    // Try and set the existing_key for the current hash bucket to insert_key
    const key_type old_key = atomicCAS(&existing_key, unused_key, insert_key);

    // If old_key == unused_key, the current hash bucket was empty
    // and existing_key was updated to insert_key by the atomicCAS.
    // If old_key == insert_key, this key has already been inserted.
    // In either case, perform the atomic aggregation of existing_value and insert_value
    // Because the hash table is initialized with the identity value of the aggregation
    // operation, it is safe to perform the operation when the existing_value still
    // has its initial value
    // TODO: Use template specialization to make use of native atomic functions
    // TODO: How to handle data types less than 32 bits?

    // Situation #1: Empty slot: this key never exist in the table, ready to insert.
    if (keys_equal(unused_key, old_key)) { // 如果沒有找到hash key
      existing_value = (mapped_type)(atomicAdd(value_counter, 1)); // hash value 就遞增
      break;

    }  // Situation #2+#3: Target slot: This slot is the slot for this key
    else if (keys_equal(insert_key, old_key)) {
      while (existing_value == m_unused_element) {
        // Situation #2: This slot is inserting by another CUDA thread and the value is not yet
        // ready, just wait
      }
      // Situation #3: This slot is already ready, get successfully and return (iterator of) the
      // value
      break;
    }
    // Situation 4: Wrong slot: This slot is occupied by other key, get fail, do nothing and
    // linear probing to next slot.

    // 此位置已經被其他key佔了，只能向後遍歷
    current_index = (current_index + 1) % hashtbl_size;
    current_hash_bucket = &(hashtbl_values[current_index]);
  }

  return iterator(m_hashtbl_values, m_hashtbl_values + hashtbl_size, current_hash_bucket);
}

具體邏輯演進如下：

4.3 combiner

拿到了多個向量之後，需要做聚合，因為此處過於繁瑣，因此我們單獨拿出來說一下，把它提升到和查詢一個級別，大家不要誤會。

4.3.1 為何要聚合

在CTR領域，人們通常會把多個embedding向量合併成一個向量，這就是pooling。比如使用者看了3本藝術書，2本體育書，所以讀書習慣 = 3 * 藝術 + 2 * 體育。這種聚合經常使用加權的pooling，而不是concat。因為雖然concat效果更好，但是pooling更快，而且這樣做好處就是即使向量長度不同，也可以生成一個同樣長度的新張量。比如：特徵的embeddingSize是10，現在所有Field的個數是50，其中5個Field是序列形式的特徵（對於序列長度的上限取40）。此時你有兩種處理方式：

mean/sum pooling ：embedding層的引數量是10 * 50 = 500
concat ：embedding層的引數量是 10*(50-5) + 40 * 10 * 5 = 2450

如果使用 concat，則embedding層的引數量直接漲了4倍左右，實際ctr模型種引數量最大部分一般就是embedding -> MLP的這一層，所以concat會直接拖慢線上inference的速度。

4.3.2 設計準則

我們回憶一下前面提到的設計準則：嵌入表可以被分割成多個槽（或feature fields）。為了在不同的嵌入上獲得最佳效能，可以選擇不同的嵌入層實現。

LocalizedSlotEmbeddingHash：同一個槽（特徵域）中的特徵會儲存在一個GPU中，這就是為什麼它被稱為“本地化槽”，根據槽的索引號，不同的槽可能儲存在不同的GPU中。
DistributedSlotEmbeddingHash：所有特徵都儲存於不同特徵域/槽上，不管槽索引號是多少，這些特徵都根據特徵的索引號分佈到不同的GPU上。這意味著同一插槽中的特徵可能儲存在不同的 GPU 中，這就是將其稱為“分散式插槽”的原因。由於需要全域性規約，所以DistributedSlotEmbedding 適合 embedding 大於 GPU 記憶體大小的情況，因而DistributedSlotEmbedding 在 GPU 之間有更多的記憶體交換。

一定要注意，LocalizedSlotEmbeddingHash 和 DistributedSlotEmbeddingHash 的區別在於同一個槽（特徵域）中的特徵 是不是 會儲存在同一個GPU中。比如，有 2 張GPU卡，有4個slot。

local模式：GPU0存slot0和slot1，GPU1存slot2和slot3。
distribute模式：每個GPU都會存所有slot的一部分引數，通過雜湊方法決定如何將一個引數分配到哪個GPU上。

在嵌入查詢過程中，屬於同一槽的稀疏特徵輸入在分別轉換為相應的密集嵌入向量後，被簡化為單個嵌入向量。然後，來自不同槽的嵌入向量連線在一起。這個就是前面提到的combiner操作。

4.3.3 Combiner程式碼

現在已經拿到了 embedding table 的 index，需要看看如何拿到 embedding vector，如何僅需操作。

// do sum reduction
if (combiner == 0) { // 0-sum; 1-mean 這裡是combiner型別
  // 然後利用 hash_value_index 從 hash_table_value 之中得到 value，再進行操作
  forward_sum(batch_size, slot_num, embedding_vec_size, row_offset.get_ptr(),
              hash_value_index.get_ptr(), hash_table_value.get_ptr(),
              embedding_feature.get_ptr(), stream);
} else if (combiner == 1) {
  // 然後利用 hash_value_index 從 hash_table_value 之中得到 value，再進行操作
  forward_mean(batch_size, slot_num, embedding_vec_size, row_offset.get_ptr(),
               hash_value_index.get_ptr(), hash_table_value.get_ptr(),
               embedding_feature.get_ptr(), stream);
}

具體是通過 forward_sum 和 forward_mean 完成，我們用 forward_sum 舉例看看。

// do sum reduction
template <typename TypeHashKey>
void forward_sum(size_t batch_size, size_t slot_num, size_t embedding_vec_size,
                 const TypeHashKey *row_offset, const size_t *hash_value_index,
                 const float *hash_table_value, __half *embedding_feature, cudaStream_t stream) {
  
  const size_t grid_size = batch_size;  // each block corresponds to a sample
  if (embedding_vec_size % 2 == 0) {
    const size_t block_size = embedding_vec_size / 2;
    forward_sum_align2_kernel<<<grid_size, block_size, 0, stream>>>(
        batch_size, slot_num, embedding_vec_size / 2, row_offset, hash_value_index,
        hash_table_value, embedding_feature);
  } else {
    const size_t block_size =
        embedding_vec_size;  // each thread corresponds to one element in an embedding vector
    forward_sum_kernel<<<grid_size, block_size, 0, stream>>>(
        batch_size, slot_num, embedding_vec_size, row_offset, hash_value_index, hash_table_value,
        embedding_feature);
  }
}

上面程式碼之中需要注意兩個註釋

grid_size = batch_size; // each block corresponds to a sample
const size_t block_size = embedding_vec_size; // each thread corresponds to one element in an embedding vector

4.3.3.1 例子

回憶我們的例子：

*   40,50,10,20
*   30,50,10
*   30,20
*   10
* Will be convert to the form of:
* row offset: 0,4,7,9,10
* value: 40,50,10,20,30,50,10,30,20,10

第一個樣本包括：

40,50,10,20 # slot 1
30,50,10 # slot 2

第二個樣本是

30,20 # slot 1
10 # slot 2

所以，應該得到10個稠密向量，比如40有一個稠密向量，50有一個稠密向量。

怎麼知道 40 對應低維嵌入表的哪一行呢？通過一個雜湊表來處理的，假如雜湊函式是選取十位數為key，則得到：

m_hf(40)=4

所以，就知道了，40應該在低維嵌入表的第4行（我們對雜湊表做了簡化）。

4.3.3.2 要點

forward_sum_kernel 的程式碼如下，這裡程式碼很燒腦，需要結合註釋仔細分析，

第一個要點是回憶一下hash_value_index_tensors_的使用：

細心讀者可能有疑問，如果雜湊表能從高維offset對映到低維offset，這個hash_value_index_tensors_ 應該就沒有用了吧？這裡解釋如下：

事實上，因為解耦合的原因，hash_value_index_tensors_ 並不應該知道雜湊表內部把高維矩陣的維度對映了多大的低維矩陣，而 hash_value_index_tensors_ 大小也不應該隨之變化。
所以，hash_value_index_tensors_ 大小被固定為：batch_size * nnz_per_slot，可以認為就是CSR之中元素個數。所以 hash_value_index_tensors_ 實際上記錄了每個元素對應的低維矩陣offset 數值，hash_value_index_tensors_ 事實上就是和CSR之中元素位置一一對應。
因此，最終嵌入表查詢時候，是通過CSR row offset 來找到 CSR之中每個元素，從而也找到了hash_value_index_tensors_ 這個表的index，從而就能找到其低維矩陣offset。
針對我們的例子，hash_value_index_tensors_ 的數值就是 4,5,1,2,3,5,1,3,2,1。

其餘幾個要點是：

bid 是第幾個樣本。
tid 是最終嵌入向量的第幾個元素，一個執行緒處理嵌入向量的一個元素。
hash_value_index 是低維嵌入表的offset表的指標。
- hash_value_index 是一張表，就是上面說的hash_value_index_tensors_。
row_offset 是CSR offset，例子就是 0,4,7,9,10，所以對於第二個樣本，row offset 是 7，9。
hash_table_value 可以認為是一個陣列，低維嵌入矩陣是儲存在這個陣列之中。hash_table_value[value_index * embedding_vec_size] 就是對應的稠密向量。

4.3.3.3 註釋版程式碼

// forward kernel funcion: for both combiner=sum and combiner=mean
template <typename TypeKey, typename TypeEmbeddingComp>
__global__ void forward_sum_kernel(int batch_size, int slot_num, int embedding_vec_size,
                                   const TypeKey *row_offset, const size_t *hash_value_index,
                                   const float *hash_table_value,
                                   TypeEmbeddingComp *embedding_feature) {
  
  // bid是第幾個樣本，假如是1，那麼就是第二個樣本
  int bid = blockIdx.x;   // each block corresponding to one sample
  // tid最終是嵌入向量的第幾個元素，一個執行緒處理嵌入向量的一個元素
  int tid = threadIdx.x;  // each thread corresponding to one element in the embedding vector

  if (bid < batch_size && tid < embedding_vec_size) { // batch_size = 2
    for (int i = 0; i < slot_num; i++) { // slot_num = 2
      // 得到當前行對應的在row offset之中的位置，比如是2或者3，就是從 0,4,7,9,10 之中找第2，第3個
      int feature_row_index = bid * slot_num + i; // feature_row_index 範圍是 2,3
      // 得到當前行在CSR內的元素偏移，行0，行1 是第一個樣本，行2，行3是第二個樣本
      TypeKey value_offset = row_offset[feature_row_index]; // 行2的偏移value_offset是7，行3是9
      // 每行有多少元素，行2對應的元素個數是9-7=2，行3對應的元素個數是10-9=1
      TypeKey feature_num = 
          row_offset[feature_row_index + 1] - value_offset;  // number of hash values in one slot

      float sum = 0.0f;

      // reduce in a slot
      for (int j = 0; j < feature_num; j++) { // 行內元素個數，行2是2，行3是1
        // 假如是行2，則value是30，20，則取出hash_value_index的第7，8個位置的數值，分別是3，2
        size_t value_index = hash_value_index[value_offset + j];
        // 取出hash_table_value的第3，2個元素的數值，進行計算
        // value_index 就是具體哪一個 CSR user ID 在 hash_table_value 之中的起始位置，即hash_value_index記錄了哪一個 CSR user ID 在hash_table_value的第幾行
        // hash_table_value[value_index * embedding_vec_size] 就是 CSR user ID對應的稠密向量
        // hash_table_value[value_index * embedding_vec_size + tid] 就是 CSR user ID對應的稠密向量的第tid個element        
        sum += (value_index != std::numeric_limits<size_t>::max())
                   ? hash_table_value[value_index * embedding_vec_size + tid]
                   : 0.0f;
      }

      // store the embedding vector
      // 這裡就對應了2，3兩行，就是一個樣本的兩個slots 會順序排在一起，最終輸出稠密向量的每個元素值是樣本之中所有元素稠密向量對應位置元素的和
      embedding_feature[feature_row_index * embedding_vec_size + tid] =
          TypeConvertFunc<TypeEmbeddingComp, float>::convert(sum);
    }
  }
}

4.3.3.4 並行操作

關於並行操作，留意點是：

bid是第幾個樣本。
tid 是最終嵌入向量的第幾個元素，一個執行緒處理嵌入向量的一個元素。
hash_table_value[value_index * embedding_vec_size] 就是 CSR user ID對應的稠密向量。
hash_table_value[value_index * embedding_vec_size + tid] 就是 CSR user ID對應的稠密向量的第 tid 個element。

之前說了，應該是兩個樣本一共10個元素 40,50,10,20,30,50,10,30,20,10，應該對應10個稠密向量。但是在GPU之中會啟動tid個執行緒並行操作，會在一個slot之中進行reduce，然後把結果存入到 embedding_feature 之中。GPU並行體現在同時生成一個稠密向量的所有元素。就是每一個sample生成 slot_num 個稠密向量。稠密向量的每個元素都是根據樣本內部元素計算出來的。

比如第一個樣本是：

40,50,10,20 # slot 1
30,50,10 # slot 2

slot 1 應該輸出 40 對應的稠密向量 + 50 對應的稠密向量 + 10 對應的稠密向量 + 20 對應的稠密向量。
slot 2 應該輸出 30 對應的稠密向量 + 50 對應的稠密向量 + 10 對應的稠密向量。

但經過 combiner之後，樣本1輸出了兩個稠密向量，分別對應兩個slot，假定每個稠密向量長度是8，計算方式是：

稠密向量1 = 40 對應的稠密向量 + 50 對應的稠密向量 + 10 對應的稠密向量 + 20 對應的稠密向量
稠密向量2 = 30 對應的稠密向量 + 50 對應的稠密向量 + 10 對應的稠密向量

稠密向量1內部8個元素分別是由40，50，10，20對應的稠密向量8個同位置上元素的和構成。即稠密向量1的[0] = sum(40 對應的稠密向量的[0], 50 對應的稠密向量的[0], 10 對應的稠密向量的[0], 20 對應的稠密向量的[0] )。可以看到，其確實轉成了嵌入式向量，但並不是用矩陣乘法，而是用了自己一套機制，具體入下圖：

4.3.4 嵌入表大小

我們已經知道可以通過雜湊表來進行縮減嵌入表大小，現在又知道其實還可以通過combine來繼續化簡，所以在已經有了雜湊表基礎之上，我們需要先問幾個問題。

目前 hash_table_value 究竟有多大？就是權重矩陣（稠密矩陣）究竟多大？
embedding_feature （嵌入層前向傳播的輸出）究竟有多大？就是輸出的規約之後的矩陣應該有多大？
embedding_feature 的每一個元素是怎麼計算出來的？
實際矩陣有多大？

我們解答一下。

第一個問題hash_table_value 究竟有多大？

前文之中有分析 hash_table_value 大小是：max_vocabulary_size_per_gpu_ = embedding_data_.embedding_params_.max_vocabulary_size_per_gpu;

實際上，大致可以認為，hash_table_value 的大小是：(value number in CSR) * (embedding_vec_size) 。

hash_table_value 的數值是隨機初始化的。每一個原始的 CSR user ID 對應了其中的 embedding_vec_size 個元素。hash_value_index 和 row_offset 湊在一起，就可以找到每一個原始的 CSR user ID 對應了其中的 embedding_vec_size 個元素。

第二個問題：embedding_feature 究竟有多大？就是邏輯上的稠密矩陣究竟有多大？從程式碼可以看到，

  embedding_feature[feature_row_index * embedding_vec_size + tid] =
      TypeConvertFunc<TypeEmbeddingComp, float>::convert(sum);

可見，embedding_feature 的大小是：(row number in CSR) * (embedding_vec_size) 。因此，對於 embedding_feature_tensors_，我們抽象一下，輸入假設是4行 CSR格式，則輸出就是4行稠密向量格式。

第三個問題：embedding_feature 的每一個元素是怎麼計算出來的？

是遍歷slot和element，進行計算。

    sum += (value_index != std::numeric_limits<size_t>::max())
               ? hash_table_value[value_index * embedding_vec_size + tid]
               : 0.0f;

第四個問題：實際embedding矩陣，或者說工程上的稠密矩陣有多大？

其實就是 slot_num * embedding_vec_size。row number 其實就是 slot_num。從下面輸出可以看到。

以 deep_data 為例，其slot num 是26，embedding_vec_size = 16，最後輸出的一條樣本大小是 [26 x 16]。

model.add(hugectr.Input(label_dim = 1, label_name = label,
                        dense_dim = 13, dense_name = dense,
                        data_reader_sparse_param_array = 
                        [hugectr.DataReaderSparseParam(wide_data, 30, True, 1),
                        hugectr.DataReaderSparseParam(deep_data, 2, False, 26)]))
model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash, 
                            workspace_size_per_gpu_in_mb = 23,
                            embedding_vec_size = 1,
                            combiner = sum,
                            sparse_embedding_name = sparse_embedding2,
                            bottom_name = wide_data,
                            optimizer = optimizer))
model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash, 
                            workspace_size_per_gpu_in_mb = 358,
                            embedding_vec_size = 16,
                            combiner = sum,
                            sparse_embedding_name = sparse_embedding1,
                            bottom_name = deep_data,
                            optimizer = optimizer))

輸出：

"------------------------------------------------------------------------------------------------------------------\n",
"Layer Type                              Input Name                    Output Name                   Output Shape                  \n",
"------------------------------------------------------------------------------------------------------------------\n",
"DistributedSlotSparseEmbeddingHash      wide_data                     sparse_embedding2             (None, 1, 1)                  \n",
"DistributedSlotSparseEmbeddingHash      deep_data                     sparse_embedding1             (None, 26, 16)

0x05 Reduce Scatter

現在每個GPU之上都得到了自己樣本對應的稠密向量，記錄在 embedding_feature_tensors_ 之上。每個GPU的資料是 batch size 條，每條有 slot number 個稠密向量，我們現在回憶一下：

DistributedSlotEmbeddingHash：所有特徵都儲存於不同特徵域/槽上，不管槽索引號是多少，這些特徵都根據特徵的索引號分佈到不同的GPU上。這意味著同一插槽中的特徵可能儲存在不同的 GPU 中，這就是將其稱為“分散式插槽”的原因。由於需要全域性規約，所以DistributedSlotEmbedding 適合 embedding 大於 GPU 記憶體大小的情況，因而DistributedSlotEmbedding 在 GPU 之間有更多的記憶體交換。

GPU之上每個樣本資料之中的slot只是slot的一部分資料，我們給出一個例子。我們假設一共有2個gpu，batch size為2，一共3個slot。有兩個樣本，拿第一個樣本為例，slot 1有兩個key，分別是GPU 1 上的1，GPU 2上的7。所以需要把這兩個key進行歸併操作。具體如下：

每條資料裡面的每個slot都只是一部分key，同一插槽中的特徵可能儲存在不同的 GPU 中，這些特徵都根據特徵的索引號分佈到不同的GPU上。這樣就需要把GPU 1，GPU 2之上的資料進行合併，做完reduce scatter後，資料應該是完整的，並且每個gpu上只分到一部分完整的資料。

5.1 背景知識

關於 Reduce Scatter 的原理，請參見 https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/operations.html。這裡只是大致介紹。

Reduce操作的作用是對所有計算節點上的數值進行歸約操作，並且只將歸約後的結果儲存到主節點上。和 AllReduce 的操作其實一樣，只不過是把結果只放到root上而已。

ReduceScatter 操作執行與Reduce操作相同的操作，只是結果分散在rank之間的相等塊中，每個rank根據其rank索引獲得一塊資料，即每個rank只受到reduce結果的一部分資料。

或者說，ReduceScatter 就是先做Scatter，將資料切分成同等大小的資料塊，再依據Rank Index 對每一個rank所獲得的資料做Reduce。這類似於全聚集，但是並不是將資料簡單拼接到一起而是做了規約操作（比如，求和或最大值操作）。

或者參見下圖，來自NVIDIA文件 https://images.nvidia.cn/events/sc15/pdfs/NCCL-Woolley.pdf。對所有GPU上的資料進行reduce操作，這裡是sum，然後將結果切分到所有的GPU上。

5.2 程式碼

具體程式碼如下，是對 embedding_feature_tensors_ 進行 reduce scatter，結果放在 embedding_data_.get_output_tensors(is_train) 之上。

    // do reduce scatter
    // 做了之後，資料才是完整的，每個gpu上分到完整資料的一部分
    size_t recv_count = embedding_data_.get_batch_size_per_gpu(is_train) *
                        embedding_data_.embedding_params_.slot_num *
                        embedding_data_.embedding_params_.embedding_vec_size;
    functors_.reduce_scatter(recv_count, embedding_feature_tensors_,
                             embedding_data_.get_output_tensors(is_train),
                             embedding_data_.get_resource_manager());

reduce_scatter 運算元程式碼是，這裡是sum操作：

template void SparseEmbeddingFunctors::reduce_scatter<float>(
    size_t recv_count, const Tensors2<float> &send_tensors, Tensors2<float> &recv_tensors,
    const ResourceManager &resource_manager);

template <typename TypeEmbeddingComp>
void SparseEmbeddingFunctors::reduce_scatter(size_t recv_count,
                                             const Tensors2<TypeEmbeddingComp> &send_tensors,
                                             Tensors2<TypeEmbeddingComp> &recv_tensors,
                                             const ResourceManager &resource_manager) {
  size_t local_gpu_count = resource_manager.get_local_gpu_count();
  size_t total_gpu_count = resource_manager.get_global_gpu_count();

  // need to know the type of TypeHashKey here
  ncclDataType_t type;
  switch (sizeof(TypeEmbeddingComp)) {
    case 2:
      type = ncclHalf;
      break;
    case 4:
      type = ncclFloat;
      break;
    default:
      CK_THROW_(Error_t::WrongInput, "Error: TypeHashKey not support by now");
  }

  // for multi GPUs, use NCCL to do Reduce-Scatter(supporting multi-node GPU servers)
  if (total_gpu_count > 1) {
    CK_NCCL_THROW_(ncclGroupStart());
    for (size_t id = 0; id < local_gpu_count; id++) {
      const auto &local_gpu = resource_manager.get_local_gpu(id);
      CK_NCCL_THROW_(ncclReduceScatter(send_tensors[id].get_ptr(),  // send buf
                                       recv_tensors[id].get_ptr(),  // recv buff
                                       recv_count, type, ncclSum, local_gpu->get_nccl(),
                                       local_gpu->get_stream()));
    }
    CK_NCCL_THROW_(ncclGroupEnd());
  }
  // for single GPU, just do memcpyD2D
  else {  // total_gpu_count == 1
    const auto &local_gpu = resource_manager.get_local_gpu(0);
    CudaDeviceContext context(local_gpu->get_device_id());
    CK_CUDA_THROW_(cudaMemcpyAsync(recv_tensors[0].get_ptr(), send_tensors[0].get_ptr(),
                                   recv_count * sizeof(TypeEmbeddingComp), cudaMemcpyDeviceToDevice,
                                   local_gpu->get_stream()));
  }

  return;
}

我們用圖例來展示一下目前過程，為了更好的理解，這裡我們可以把Reduce-Scatter分段考慮，

Reduce 就是類似AllReduce操作，這個之後，所有GPU之上擁有所有資料。
Scatter 則按照 rank 來對樣本進行分配，所以GPU 1 之上是Sample 1，GPU 2之上是Sample 2。

我們最後歸納整體如下：

0x06 Combiner

如果需要做 mean pooling，則需需要做兩個操作。

 *   1) forward
 *        sum: calling forward_sum_kernel()
 *        mean: calling foward_sum_kernel() + forward_scale_kernel()

第一個操作是對CSR row offset 做一個AllReduce，這樣就相當於是一個全域性offset了，就可以拿到每個sample每個slot裡的key的總個數。

第二個操作是Forward Scale，就是把embedding的值除以這個"個數"，也就等於做了平均。

    // scale for combiner=mean after reduction
    if (embedding_data_.embedding_params_.combiner == 1) {
      size_t send_count = embedding_data_.embedding_params_.get_batch_size(is_train) *
                              embedding_data_.embedding_params_.slot_num +
                          1;
      functors_.all_reduce(send_count, embedding_data_.get_row_offsets_tensors(is_train),
                           row_offset_allreduce_tensors_, embedding_data_.get_resource_manager());

      // do average
      functors_.forward_scale(
          embedding_data_.embedding_params_.get_batch_size(is_train),
          embedding_data_.embedding_params_.slot_num,
          embedding_data_.embedding_params_.embedding_vec_size, row_offset_allreduce_tensors_,
          embedding_data_.get_output_tensors(is_train), embedding_data_.get_resource_manager());
    }

6.1 AllReduce

AllReduce 結果如下：

回憶一下 CSR 例子。

*   40,50,10,20
*   30,50,10
*   30,20
*   10
* Will be convert to the form of:
* row offset: 0,4,7,9,10
* value: 40,50,10,20,30,50,10,30,20,10

row_offset 的數字就是：第一行起始位置是0，第二行起始位置是4，第三行起始位置是7..... 我們假設這是在Node 1之上。

如果Node 2的row_offset為 0,5,7,10,11，說明在這個Node之上，第一行起始位置是0，第二行起始位置是5，第三行起始位置是7.....，對應CSR是：

*   40,50,10,20,30
*   30,50
*   30,20,40
*   10
* Will be convert to the form of:
* row offset: 0,5,7,10,11
* value: 40,50,10,20,30,50,10,30,20,10

做了AllReduce之後，得到：0,9,14,19,21。這樣就知道第一個行總個數是9個，第二行總個是是7+7-9 = 5個。

具體運算元如下：

/**
 * collection communication: all_reduce.
 * @param send_count the count of elements will be sent.
 * @param send_tensors the send tensors of multi GPUs.
 * @param recv_tensors the recv tensors of multi GPUs.
 * @param device_resources all gpus device resources.
 * @param context gpu device context, for switching device.
 */
template <typename TypeHashKey>
void SparseEmbeddingFunctors::all_reduce(size_t send_count,
                                         const Tensors2<TypeHashKey> &send_tensors,
                                         Tensors2<TypeHashKey> &recv_tensors,
                                         const ResourceManager &resource_manager) {
  size_t local_gpu_count = resource_manager.get_local_gpu_count();
  size_t total_gpu_count = resource_manager.get_global_gpu_count();

  // need to know the type of Type here
  ncclDataType_t type;
  switch (sizeof(TypeHashKey)) {
    case 4:
      type = ncclUint32;
      break;
    case 8:
      type = ncclUint64;
      break;
    default:
      CK_THROW_(Error_t::WrongInput, "Error: Type not support by now");
  }

  // for multi GPUs, use NCCL to do all_reduce (supporting multi-node GPU servers)
  if (total_gpu_count > 1) {
    CK_NCCL_THROW_(ncclGroupStart());
    for (size_t id = 0; id < local_gpu_count; id++) {
      const auto &local_gpu = resource_manager.get_local_gpu(id);
      // ALLReduce操作
      CK_NCCL_THROW_(ncclAllReduce(send_tensors[id].get_ptr(), recv_tensors[id].get_ptr(),
                                   send_count, type, ncclSum, local_gpu->get_nccl(),
                                   local_gpu->get_stream()));
    }
    CK_NCCL_THROW_(ncclGroupEnd());
  }
  // for single GPU, just do memcpyD2D
  else {  // total_gpu_count == 1
    const auto &local_gpu = resource_manager.get_local_gpu(0);
    CudaDeviceContext context(local_gpu->get_device_id());
    CK_CUDA_THROW_(cudaMemcpyAsync(recv_tensors[0].get_ptr(), send_tensors[0].get_ptr(),
                                   send_count * sizeof(TypeHashKey), cudaMemcpyDeviceToDevice,
                                   local_gpu->get_stream()));
  }

  return;
}

6.2 Forward Scale

最後要做一步 Forward Scale 操作。

  // do average
  functors_.forward_scale(
      embedding_data_.embedding_params_.get_batch_size(is_train),
      embedding_data_.embedding_params_.slot_num,
      embedding_data_.embedding_params_.embedding_vec_size, row_offset_allreduce_tensors_,
      embedding_data_.get_output_tensors(is_train), embedding_data_.get_resource_manager());

前面我們做了AllReduce之後，得到 row_offset_allreduce_tensors_ 是 0,9,14,19,21。這樣就知道第一個行總個數是9個，第二行總個是是7+7-9 = 5個。就可以對embedding_data_.get_output_tensors(is_train)的每個元素進行操作，每個元素都除以本slot的元素總數，就是做mean了。

運算元如下：

// forward kernel function: this is an additional function for combiner=mean (only for Distributed
// Embedding)
template <typename TypeKey, typename TypeEmbeddingComp>
__global__ void forward_scale_kernel(int batch_size, int slot_num, int embedding_vec_size,
                                     const TypeKey *row_offset,
                                     TypeEmbeddingComp *embedding_feature) {
  int bid = blockIdx.x;
  int tid = threadIdx.x;

  if (bid < batch_size && tid < embedding_vec_size) {
    for (int i = 0; i < slot_num; i++) {
      size_t feature_row_index = bid * slot_num + i;
      // 本slot元素總數
      int feature_num = row_offset[feature_row_index + 1] - row_offset[feature_row_index];
      // 輸出矩陣的row offset
      size_t feature_index = feature_row_index * embedding_vec_size + tid;
      float feature =
          TypeConvertFunc<float, TypeEmbeddingComp>::convert(embedding_feature[feature_index]);
      float scaler = 1.0f;
      if (feature_num > 1) {
        scaler = 1.0f / (float)feature_num; // 除數
      }

      embedding_feature[feature_index] = // 設定
          TypeConvertFunc<TypeEmbeddingComp, float>::convert(feature * scaler);
    }
  }
}

template <typename TypeKey, typename TypeEmbeddingComp>
void do_forward_scale(size_t batchsize_per_gpu, size_t slot_num, size_t embedding_vec_size,
                      const TypeKey *row_offset, TypeEmbeddingComp *embedding_feature,
                      cudaStream_t stream) {
  const size_t grid_size = batchsize_per_gpu;
  const size_t block_size = embedding_vec_size;
  forward_scale_kernel<<<grid_size, block_size, 0, stream>>>(
      batchsize_per_gpu, slot_num, embedding_vec_size, row_offset, embedding_feature);
};