[原始碼解析] NVIDIA HugeCTR,GPU版本引數伺服器--- (4)

羅西的思考發表於2022-02-21

[原始碼解析] NVIDIA HugeCTR,GPU版本引數伺服器--- (4)

0x00 摘要

在這個系列中,我們介紹了 HugeCTR,這是一個面向行業的推薦系統訓練框架,針對具有模型並行嵌入和資料並行密集網路的大規模 CTR 模型進行了優化。

本文主要介紹流水線的前兩級,最後一級將會獨立成文。其中借鑑了HugeCTR原始碼閱讀 這篇大作,特此感謝。

本系列其他文章如下:

[原始碼解析] NVIDIA HugeCTR,GPU 版本引數伺服器 --(1)

[原始碼解析] NVIDIA HugeCTR,GPU版本引數伺服器--- (2)

[原始碼解析] NVIDIA HugeCTR,GPU版本引數伺服器---(3)

0x01 總體流程

由於高效的資料交換和三級流水線,HugeCTR的可擴充套件性和活躍GPU的數量都有所增加。此流水線包括三級:

  • 從檔案讀取資料。
  • 從主機到裝置的資料傳輸(節點間和節點內)。
  • 利用GPU計算。

的資料讀取重疊,並訓練GPU。下圖顯示了HugeCTR的可擴充套件性,批量大小為16384,在DGX1伺服器上有七層。

0x02 DataReader

DataReader 被用來把資料從資料集拷貝到嵌入層。其是流水線的入口,包括了流水線的前面兩步驟:讀取檔案和拷貝到GPU。

此圖顯示了“讀取檔案”、“複製到 GPU”和“訓練”階段如何重疊三個批次以提高 GPU 資源利用率。

Figure 5. HugeCTR training pipeline with its data reader.

2.1 定義

為了分析需要,我們只給出成員變數,方法我們會在使用時候具體介紹。

從動態角度看,成員變數之中重要的是以下兩個:

  • worker_group :工作執行緒組,負責把資料從dataset檔案讀取到記憶體之中,這個可以認為是流水線的第一級。之前的版本之中有一個HeapEx資料結構用來做中間快取,目前這個資料結構已經移除。
  • data_collector_ :擁有一個執行緒,負責把資料拷貝到GPU之中。這個可以認為是流水線的第二級

從靜態角度看,主要是以下三個buffer:

  • std::vector<std::shared_ptr<ThreadBuffer>> thread_buffers_:執行緒內部使用的buffer。
  • std::shared_ptr<BroadcastBuffer> broadcast_buffer_:用來後續和collector互動,collector 把它作為中間buffer。
  • std::shared_ptr<DataReaderOutput> output_:reader的輸出,訓練最後讀取的是這裡。

以上三個buffer的資料流動是:ThreadBuffer --> BroadcastBuffer ---> DataReaderOutput

從資源角度看,則是:

  • std::shared_ptr resource_manager_ :這是 Session 的成員變數,在DataReader建構函式之中傳遞進來的。
  • const std::vector params_ :這是依據配置檔案整理出來的sparse引數元資訊。
/**
 * @brief Data reading controller.
 *
 * Control the data reading from data set to embedding.
 * An instance of DataReader will maintain independent
 * threads for data reading (IDataReaderWorker)
 * from dataset to heap. Meanwhile one independent
 * thread consumes the data (DataCollector),
 * and copy the data to GPU buffer.
 */
template <typename TypeKey>
class DataReader : public IDataReader {
 private:
  std::vector<std::shared_ptr<ThreadBuffer>> thread_buffers_;  // gpu_id -> thread_idx
  std::shared_ptr<BroadcastBuffer> broadcast_buffer_;
  std::shared_ptr<DataReaderOutput> output_;

  std::shared_ptr<DataReaderWorkerGroup> worker_group_;
  std::shared_ptr<DataCollector<TypeKey>> data_collector_; /**< pointer of DataCollector */

  /* Each gpu will have several csr output for different embedding */
  const std::vector<DataReaderSparseParam> params_;
  std::shared_ptr<ResourceManager> resource_manager_; /**< gpu resource used in this data reader*/
  const size_t batchsize_;                            /**< batch size */
  const size_t label_dim_; /**< dimention of label e.g. 1 for BinaryCrossEntropy */
  const size_t dense_dim_; /**< dimention of dense */
  long long current_batchsize_;

  bool repeat_;
  std::string file_name_;
  SourceType_t source_type_;
}

2.2 構建

對DataReader的構建分為兩部分:

  • 在建構函式之中會:
    • 對各種buffer進行配置。
    • 對構建DataCollector。
  • 在create_datareader之中會分別處理 train_data_reader和 evaluate_data_reader,也就是用於訓練和評估的兩個reader。然後會為他們建立workgroup。

我們先省略對建構函式的分析,因為其牽扯到一系列資料結構。等介紹完資料結構之後,再進行論述。

2.3 DataReaderSparseParam

2.3.1 定義

DataReaderSparseParam 是依據配置得到的Sparse引數的元資訊,其主要成員變數如下:

  • sparse_name是其後續層引用的稀疏輸入張量的名稱。沒有預設值,應由使用者指定。

  • nnz_per_slot是每個插槽的指定sparse輸入的最大特徵數。

    • 'nnz_per_slot'可以是'int',即每個slot的平均nnz,因此每個例項的最大功能數應該是'nnz_per_slot*slot_num'。
    • 或者可以使用List[int]初始化'nnz_per_slot',則每個樣本的最大特徵數應為'sum(nnz_per_slot)',在這種情況下,陣列'nnz_per_slot'的長度應與'slot_num'相同。
  • 'is_fixed_length'用於標識所有樣本中每個插槽的categorical inputs是否具有相同的長度。如果不同的樣本對於每個插槽具有相同數量的特徵,則使用者可以設定“is_fixed_length=True”,Hugetr可以使用此資訊來減少資料傳輸時間。

  • slot_num指定用於資料集中此稀疏輸入的插槽數。

    • 注意:如果指定了多個'DataReaderSparseParam',則任何一對'DataReaderSparseParam'之間都不應有重疊。比如,在[wdl樣本](../samples/wdl/wdl.py)中,我們總共有27個插槽;我們將第一個插槽指定為"wide_data",將接下來的26個插槽指定為"deep_data"。
struct DataReaderSparseParam {
  std::string top_name;
  std::vector<int> nnz_per_slot;
  bool is_fixed_length;
  int slot_num;

  DataReaderSparse_t type;
  int max_feature_num;
  int max_nnz;

  DataReaderSparseParam() {}
  DataReaderSparseParam(const std::string& top_name_, const std::vector<int>& nnz_per_slot_,
                        bool is_fixed_length_, int slot_num_)
      : top_name(top_name_),
        nnz_per_slot(nnz_per_slot_),
        is_fixed_length(is_fixed_length_),
        slot_num(slot_num_),
        type(DataReaderSparse_t::Distributed) {
    max_feature_num = std::accumulate(nnz_per_slot.begin(), nnz_per_slot.end(), 0);
    max_nnz = *std::max_element(nnz_per_slot.begin(), nnz_per_slot.end());
  }

  DataReaderSparseParam(const std::string& top_name_, const int nnz_per_slot_,
                        bool is_fixed_length_, int slot_num_)
      : top_name(top_name_),
        nnz_per_slot(slot_num_, nnz_per_slot_),
        is_fixed_length(is_fixed_length_),
        slot_num(slot_num_),
        type(DataReaderSparse_t::Distributed) {
    max_feature_num = std::accumulate(nnz_per_slot.begin(), nnz_per_slot.end(), 0);
    max_nnz = *std::max_element(nnz_per_slot.begin(), nnz_per_slot.end());
  }
};

2.3.2 使用

之前提到了Parser是解析配置檔案,HugeCTR 也支援程式碼設定,比如下面就設定了兩個DataReaderSparseParam,也有對應的DistributedSlotSparseEmbeddingHash。

model = hugectr.Model(solver, reader, optimizer)
model.add(hugectr.Input(label_dim = 1, label_name = "label",
                        dense_dim = 13, dense_name = "dense",
                        data_reader_sparse_param_array = 
                        [hugectr.DataReaderSparseParam("wide_data", 30, True, 1),
                        hugectr.DataReaderSparseParam("deep_data", 2, False, 26)]))
model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash, 
                            workspace_size_per_gpu_in_mb = 23,
                            embedding_vec_size = 1,
                            combiner = "sum",
                            sparse_embedding_name = "sparse_embedding2",
                            bottom_name = "wide_data",
                            optimizer = optimizer))
model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash, 
                            workspace_size_per_gpu_in_mb = 358,
                            embedding_vec_size = 16,
                            combiner = "sum",
                            sparse_embedding_name = "sparse_embedding1",
                            bottom_name = "deep_data",
                            optimizer = optimizer))

0x03 DataReader Buffer 機制

我們接下來看看 DataReader 的若干Buffer,依賴於這些buffer,HugeCTR實現了流水線的前兩級。

3.1 比對

我們首先要做一個歷史對比,看看這部分程式碼的發展脈絡。我們先看看3.1版本的程式碼。DataReader 我們選取了部分成員變數。3.1 版本之前使用了一個heap進行操作,即下面的csr_heap_

class DataReader : public IDataReader {
  std::shared_ptr<HeapEx<CSRChunk<TypeKey>>> csr_heap_; /**< heap to cache the data set */
  Tensors2<float> label_tensors_;                       /**< Label tensors for the usage of loss */
  std::vector<TensorBag2> dense_tensors_;               /**< Dense tensors for the usage of loss */
  /* Each gpu will have several csr output for different embedding */
  Tensors2<TypeKey> csr_buffers_; /**< csr_buffers contains row_offset_tensor and value_tensors */
  Tensors2<TypeKey> row_offsets_tensors_; /**< row offset tensors*/
  Tensors2<TypeKey> value_tensors_;       /**< value tensors */
  std::vector<std::shared_ptr<size_t>> nnz_array_;

  const size_t label_dim_; /**< dimention of label e.g. 1 for BinaryCrossEntropy */
  const size_t dense_dim_; /**< dimention of dense */
}

我們再看看3.2.1版本的程式碼,也選取了部分成員變數。

template <typename TypeKey>
class DataReader : public IDataReader {
  std::vector<std::shared_ptr<ThreadBuffer>> thread_buffers_;  // gpu_id -> thread_idx
  std::shared_ptr<BroadcastBuffer> broadcast_buffer_;
  std::shared_ptr<DataReaderOutput> output_;

  const size_t label_dim_; /**< dimention of label e.g. 1 for BinaryCrossEntropy */
  const size_t dense_dim_; /**< dimention of dense */
}

3.2.1 這裡是:

  • label_tensors_, dense_tensors_ 移動到 AsyncReader。
  • 把 csr_heap_ 用 thread_buffers_broadcast_buffer_output_ 等進行替代。
  • 把 row_offsets_tensors_,value_tensors_,nnz_array_ 等等用 ThreadBuffer,BroadcastBuffer,DataReaderOutput 之中的 SparseTensorBag 來包括,統一管理 CSR。

3.2 Buffer 相關類

我們依據上面的歷史版本比對來看看。

  • 在之前版本(比如3.1)之中,存在一個 HeapEX 類,其實現了 CPU 到 GPU 之間的一個資料快取功能。
  • 在最新版本之中,改為一系列 buffer 相關類,比如 ThreadBuffer 和 BroadcastBuffer,其狀態都是由 BufferState 實現的。
enum class BufferState : int { FileEOF, Reading, ReadyForRead, Writing, ReadyForWrite };

以下是三個buffer的定義。

struct ThreadBuffer {
  std::vector<SparseTensorBag> device_sparse_buffers;  // same number as embedding number
  std::vector<unsigned char> is_fixed_length;          // same number as embedding number
  TensorBag2 device_dense_buffers;
  std::atomic<BufferState> state;
  long long current_batch_size;
  int batch_size;
  size_t param_num;
  int label_dim;
  int dense_dim;
  int batch_size_start_idx;  // dense buffer
  int batch_size_end_idx;
};

struct BroadcastBuffer {
  std::vector<SparseTensorBag>
      sparse_buffers;  // same number as (embedding number * local device number)
  std::vector<unsigned char> is_fixed_length;        // same number as embedding number
  std::vector<TensorBag2> dense_tensors;             // same number as local device number
  std::vector<cudaEvent_t> finish_broadcast_events;  // same number as local device number
  std::atomic<BufferState> state;
  long long current_batch_size;
  size_t param_num;
};

struct DataReaderOutput {
  std::map<std::string, std::vector<SparseTensorBag>> sparse_tensors_map;
  std::vector<std::string> sparse_name_vec;
  std::vector<TensorBag2> label_tensors;
  std::vector<TensorBag2> dense_tensors;
  bool use_mixed_precision;
  int label_dense_dim;
};

以上這些類,對應了 DataReader 的以下成員變數。

class DataReader : public IDataReader {
 private:
  std::vector<std::shared_ptr<ThreadBuffer>> thread_buffers_;  // gpu_id -> thread_idx
  std::shared_ptr<BroadcastBuffer> broadcast_buffer_;
  std::shared_ptr<DataReaderOutput> output_;
}

接下來,我們就一一分析。

3.3 DataReader構造

前面跳過了 DataReader 的建構函式,接下來我們接下來對建構函式進行分析,其主要功能就是為三種buffer來預留空間,分配記憶體,最後構建了collector。

DataReader(int batchsize, size_t label_dim, int dense_dim,
           std::vector<DataReaderSparseParam> &params,
           const std::shared_ptr<ResourceManager> &resource_manager, bool repeat, int num_threads,
           bool use_mixed_precision)
    : broadcast_buffer_(new BroadcastBuffer()),
      output_(new DataReaderOutput()),
      params_(params),
      resource_manager_(resource_manager),
      batchsize_(batchsize),
      label_dim_(label_dim),
      dense_dim_(dense_dim),
      repeat_(repeat) {
  size_t local_gpu_count = resource_manager_->get_local_gpu_count();
  size_t total_gpu_count = resource_manager_->get_global_gpu_count();

  // batchsize_ is a multiple of total_gpu_count
  size_t batch_size_per_gpu = batchsize_ / total_gpu_count;
        
  // 1. 生成了一個臨時變數buffs,用來具體分配記憶體,裡面是若干 CudaAllocator,每個CudaAllocator對應了i個GPU 
  std::vector<std::shared_ptr<GeneralBuffer2<CudaAllocator>>> buffs;
  // 先預留部分記憶體空間      
  buffs.reserve(local_gpu_count);
  // 為每個GPU初始化一個GeneralBuffer2   
  for (size_t i = 0; i < local_gpu_count; ++i) {
    buffs.push_back(GeneralBuffer2<CudaAllocator>::create());
  }

  // 2.預留buffer 
  // 處理 thread_buffers_     
  thread_buffers_.reserve(num_threads);
  for (int i = 0; i < num_threads; ++i) { 
    // a worker may maintain multiple buffers on device i % local_gpu_count
    auto local_gpu = resource_manager_->get_local_gpu(i % local_gpu_count);
    CudaCPUDeviceContext context(local_gpu->get_device_id());
    auto &buff = buffs[i % local_gpu_count]; // 找到對應GPU對應的CudaAllocator,進行分配
    std::shared_ptr<ThreadBuffer> current_thread_buffer = std::make_shared<ThreadBuffer>();
    thread_buffers_.push_back(current_thread_buffer);

    current_thread_buffer->device_sparse_buffers.reserve(params.size());
    current_thread_buffer->is_fixed_length.reserve(params.size()); // vector的reserve
    for (size_t param_id = 0; param_id < params.size(); ++param_id) {
      auto &param = params_[param_id];
      SparseTensor<TypeKey> temp_sparse_tensor;
      // 預留記憶體
      buff->reserve({(size_t)batchsize, (size_t)param.max_feature_num}, param.slot_num,
                    &temp_sparse_tensor);
      current_thread_buffer->device_sparse_buffers.push_back(temp_sparse_tensor.shrink());
      current_thread_buffer->is_fixed_length.push_back(param.is_fixed_length);
    }
    Tensor2<float> temp_dense_tensor;
    // 預留記憶體
    buff->reserve({batch_size_per_gpu * local_gpu_count, label_dim + dense_dim},
                  &temp_dense_tensor);
    current_thread_buffer->device_dense_buffers = temp_dense_tensor.shrink();
    current_thread_buffer->state.store(BufferState::ReadyForWrite);
    current_thread_buffer->current_batch_size = 0;
    current_thread_buffer->batch_size = batchsize;
    current_thread_buffer->param_num = params.size();
    current_thread_buffer->label_dim = label_dim;
    current_thread_buffer->dense_dim = dense_dim;
    current_thread_buffer->batch_size_start_idx =
        batch_size_per_gpu * resource_manager_->get_gpu_global_id_from_local_id(0);
    current_thread_buffer->batch_size_end_idx =
        current_thread_buffer->batch_size_start_idx + batch_size_per_gpu * local_gpu_count;
  }

  // 處理 broadcast buffer,注意這裡的reserve是 vector資料結構的方法,不是預留記憶體      
  broadcast_buffer_->sparse_buffers.reserve(local_gpu_count * params.size());
  broadcast_buffer_->is_fixed_length.reserve(local_gpu_count * params.size());
  broadcast_buffer_->dense_tensors.reserve(local_gpu_count);
  broadcast_buffer_->finish_broadcast_events.resize(local_gpu_count);
  broadcast_buffer_->state.store(BufferState::ReadyForWrite);
  broadcast_buffer_->current_batch_size = 0;
  broadcast_buffer_->param_num = params.size();
        
  // 處理 output buffer,注意這裡的reserve是 vector資料結構的方法,不是預留記憶體
  output_->dense_tensors.reserve(local_gpu_count);
  output_->label_tensors.reserve(local_gpu_count);
  output_->use_mixed_precision = use_mixed_precision;
  output_->label_dense_dim = label_dim + dense_dim;
  // 預留sparse tensor,注意這裡的reserve是 vector資料結構的方法,不是預留記憶體      
  for (size_t param_id = 0; param_id < params.size(); ++param_id) {
    auto &param = params_[param_id];
    output_->sparse_tensors_map[param.top_name].reserve(local_gpu_count);
    output_->sparse_name_vec.push_back(param.top_name);
  }

  // 遍歷本地的 GPU       
  for (size_t local_id = 0; local_id < local_gpu_count; ++local_id) {
    // 還是需要針對每一個GPU,找到對應的CudaAllocator進行分配
    auto local_gpu = resource_manager_->get_local_gpu(local_id);
    CudaDeviceContext ctx(local_gpu->get_device_id());
    auto &buff = buffs[local_id];

    for (size_t param_id = 0; param_id < params.size(); ++param_id) {
      auto &param = params_[param_id];
      SparseTensor<TypeKey> temp_sparse_tensor;
      // 給broadcast_buffer_分配記憶體
      buff->reserve({(size_t)batchsize, (size_t)param.max_feature_num}, param.slot_num,
                    &temp_sparse_tensor);
      broadcast_buffer_->sparse_buffers.push_back(temp_sparse_tensor.shrink());
      broadcast_buffer_->is_fixed_length.push_back(param.is_fixed_length);
    }
    Tensor2<float> temp_dense_tensor;
    buff->reserve({batch_size_per_gpu, label_dim + dense_dim}, &temp_dense_tensor);
    broadcast_buffer_->dense_tensors.push_back(temp_dense_tensor.shrink());

    CK_CUDA_THROW_(cudaEventCreateWithFlags(&broadcast_buffer_->finish_broadcast_events[local_id],
                                            cudaEventDisableTiming));

    for (size_t param_id = 0; param_id < params.size(); ++param_id) {
      auto &param = params_[param_id];
      SparseTensor<TypeKey> temp_sparse_tensor;
      // 預留記憶體
      buff->reserve({(size_t)batchsize, (size_t)param.max_feature_num}, param.slot_num,
                    &temp_sparse_tensor);
      output_->sparse_tensors_map[param.top_name].push_back(temp_sparse_tensor.shrink());
    }

    Tensor2<float> label_tensor;
    // 預留記憶體
    buff->reserve({batch_size_per_gpu, label_dim}, &label_tensor);
    output_->label_tensors.push_back(label_tensor.shrink());

    if (use_mixed_precision) {
      Tensor2<__half> dense_tensor;
      // 預留記憶體
      buff->reserve({(size_t)batch_size_per_gpu, (size_t)dense_dim}, &dense_tensor);
      output_->dense_tensors.push_back(dense_tensor.shrink());
    } else {
      Tensor2<float> dense_tensor;
      // 預留記憶體
      buff->reserve({(size_t)batch_size_per_gpu, (size_t)dense_dim}, &dense_tensor);
      output_->dense_tensors.push_back(dense_tensor.shrink());
    }

    buff->allocate(); // 3. 分配記憶體
  }

  // 4. 構建DataCollector     
  data_collector_ = std::make_shared<DataCollector<TypeKey>>(thread_buffers_, broadcast_buffer_,
                                                             output_, resource_manager);
  return;
}

我們接下來會仔細分一下構造程式碼之中的各個部分。

3.3.1 輔助 GeneralBuffer2

首先我們分析上面程式碼之中buffs部分,這個變數作用就是統一分配記憶體。

  // 1. 生成了一個臨時變數buffs    
  std::vector<std::shared_ptr<GeneralBuffer2<CudaAllocator>>> buffs;
  // 先預留部分容量大小     
  buffs.reserve(local_gpu_count);
  // 為每個GPU初始化一個GeneralBuffer2   
  for (size_t i = 0; i < local_gpu_count; ++i) {
    buffs.push_back(GeneralBuffer2<CudaAllocator>::create());
  }

3.3.2 ThreadBuffer

然後我們看看處理 thread_buffers_ 部分,這裡是為執行緒buffer進行處理。我們首先獲取ThreadBuffer類定義如下,後面分析時候可以比對。

struct ThreadBuffer {
  std::vector<SparseTensorBag> device_sparse_buffers;  // same number as embedding number
  std::vector<unsigned char> is_fixed_length;          // same number as embedding number
  TensorBag2 device_dense_buffers;
  std::atomic<BufferState> state;
  long long current_batch_size;
  int batch_size;
  size_t param_num;
  int label_dim;
  int dense_dim;
  int batch_size_start_idx;  // dense buffer
  int batch_size_end_idx;
};

其次,具體構建函式中的邏輯如下:

  • 首先,對於 thread_buffers_ 這個vector,會擴充 vector 容量到執行緒數大小。
  • 拿到本執行緒(或者說是本GPU)在buffs之中對應的buffer,賦值到 buff。
  • 對於每一個執行緒,會生成一個ThreadBuffer,命名為current_thread_buffer,放入到 thread_buffers_ 之中。
  • 對於每一個 ThreadBuffer,預留 ThreadBuffer 的device_sparse_buffers 和 is_fixed_length 這兩個 vector 的容量大小。
  • 遍歷sparse引數,對於每一個引數,會建立一個臨時張量,並且通過 buff 預留記憶體(CPU或者GPU),然後把此臨時張量放入device_sparse_buffers。
  • 建立一個針對dense的張量,並且通過 buff 預留張量記憶體,把臨時張量放入device_dense_buffers。
  • 設定current_thread_buffer 狀態。
  • 設定 current_thread_buffer 其他資訊。
  // 處理 thread_buffers_,會擴充 vector 容量到執行緒數大小 
  thread_buffers_.reserve(num_threads);
  for (int i = 0; i < num_threads; ++i) {  // 遍歷執行緒
    // a worker may maintain multiple buffers on device i % local_gpu_count
    auto local_gpu = resource_manager_->get_local_gpu(i % local_gpu_count);
    CudaCPUDeviceContext context(local_gpu->get_device_id());
    auto &buff = buffs[i % local_gpu_count]; // 拿到本執行緒(或者說是本GPU)在buffs之中對應的buffer
    // 生成一個ThreadBuffer,存入到thread_buffers_
    std::shared_ptr<ThreadBuffer> current_thread_buffer = std::make_shared<ThreadBuffer>();
    thread_buffers_.push_back(current_thread_buffer);

    // 預留 ThreadBuffer 的device_sparse_buffers 和 is_fixed_length 這兩個 vector 的容量大小
    current_thread_buffer->device_sparse_buffers.reserve(params.size());
    current_thread_buffer->is_fixed_length.reserve(params.size());
    
    // 遍歷引數
    for (size_t param_id = 0; param_id < params.size(); ++param_id) {
      auto &param = params_[param_id];
      SparseTensor<TypeKey> temp_sparse_tensor;
      // 建立一個臨時張量,並且預留記憶體(CPU或者GPU)
      buff->reserve({(size_t)batchsize, (size_t)param.max_feature_num}, param.slot_num,
                    &temp_sparse_tensor);
      // 把張量放入device_sparse_buffers
      current_thread_buffer->device_sparse_buffers.push_back(temp_sparse_tensor.shrink());
      current_thread_buffer->is_fixed_length.push_back(param.is_fixed_length);
    }
    
    // 建立一個針對dense的張量
    Tensor2<float> temp_dense_tensor;
    // 預留張量記憶體
    buff->reserve({batch_size_per_gpu * local_gpu_count, label_dim + dense_dim},
                  &temp_dense_tensor);
    // 把臨時張量放入device_dense_buffers
    current_thread_buffer->device_dense_buffers = temp_dense_tensor.shrink();
    // 設定狀態
    current_thread_buffer->state.store(BufferState::ReadyForWrite);
    // 設定其他資訊
    current_thread_buffer->current_batch_size = 0;
    current_thread_buffer->batch_size = batchsize;
    current_thread_buffer->param_num = params.size();
    current_thread_buffer->label_dim = label_dim;
    current_thread_buffer->dense_dim = dense_dim;
    current_thread_buffer->batch_size_start_idx =
        batch_size_per_gpu * resource_manager_->get_gpu_global_id_from_local_id(0);
    current_thread_buffer->batch_size_end_idx =
        current_thread_buffer->batch_size_start_idx + batch_size_per_gpu * local_gpu_count;
  }

此時如下,注意,DataReader 包括多個 ThreadBuffer。

3.3.3 BroadcastBuffer

接下來看看如何構建BroadcastBuffer。

BroadcastBuffer定義如下:

struct BroadcastBuffer {
  std::vector<SparseTensorBag>
      sparse_buffers;  // same number as (embedding number * local device number)
  std::vector<unsigned char> is_fixed_length;        // same number as embedding number
  std::vector<TensorBag2> dense_tensors;             // same number as local device number
  std::vector<cudaEvent_t> finish_broadcast_events;  // same number as local device number
  std::atomic<BufferState> state;
  long long current_batch_size;
  size_t param_num;
};

按照構建程式碼來說,這裡只是做了一些預留和設定,沒有涉及記憶體,記憶體在後續會統一處理。

  // 處理 broadcast buffer      
  // 預留vector的容量
  broadcast_buffer_->sparse_buffers.reserve(local_gpu_count * params.size());
  // 預留vector的容量
  broadcast_buffer_->is_fixed_length.reserve(local_gpu_count * params.size());
  // 預留vector的容量
  broadcast_buffer_->dense_tensors.reserve(local_gpu_count);
  broadcast_buffer_->finish_broadcast_events.resize(local_gpu_count);
  // 設定狀態
  broadcast_buffer_->state.store(BufferState::ReadyForWrite);
  broadcast_buffer_->current_batch_size = 0;
  broadcast_buffer_->param_num = params.size();

3.3.4 DataReaderOutput

我們接著看看如何構建DataReaderOutput。

struct DataReaderOutput {
  std::map<std::string, std::vector<SparseTensorBag>> sparse_tensors_map;
  std::vector<std::string> sparse_name_vec;
  std::vector<TensorBag2> label_tensors;
  std::vector<TensorBag2> dense_tensors;
  bool use_mixed_precision;
  int label_dense_dim;
};

按照構建程式碼來說,這裡只是做了一些預留和設定,沒有涉及記憶體,記憶體在後續會統一處理。

output_->dense_tensors.reserve(local_gpu_count); // 預留vector的容量
output_->label_tensors.reserve(local_gpu_count); // 預留vector的容量
output_->use_mixed_precision = use_mixed_precision;
output_->label_dense_dim = label_dim + dense_dim;
for (size_t param_id = 0; param_id < params.size(); ++param_id) {
  auto &param = params_[param_id];

  output_->sparse_tensors_map[param.top_name].reserve(local_gpu_count);
  output_->sparse_name_vec.push_back(param.top_name);
}

3.3.5 預留和分配

這裡會對 broadcast 和 output 進行預留,這裡統一分配記憶體。

for (size_t local_id = 0; local_id < local_gpu_count; ++local_id) { // 遍歷GPU
  auto local_gpu = resource_manager_->get_local_gpu(local_id);
  CudaDeviceContext ctx(local_gpu->get_device_id());
  auto &buff = buffs[local_id]; // 獲取臨時buffs之中對應某一個本地gpu的allocator

  for (size_t param_id = 0; param_id < params.size(); ++param_id) {
    auto &param = params_[param_id];
    SparseTensor<TypeKey> temp_sparse_tensor;
    // 分配sparse記憶體
    buff->reserve({(size_t)batchsize, (size_t)param.max_feature_num}, param.slot_num,
                  &temp_sparse_tensor);
    // 賦值到broadcast 之上
    broadcast_buffer_->sparse_buffers.push_back(temp_sparse_tensor.shrink());
    broadcast_buffer_->is_fixed_length.push_back(param.is_fixed_length);
  }
  // 分配dense記憶體
  Tensor2<float> temp_dense_tensor;
  buff->reserve({batch_size_per_gpu, label_dim + dense_dim}, &temp_dense_tensor);
  // 賦值到broadcast 之上
  broadcast_buffer_->dense_tensors.push_back(temp_dense_tensor.shrink());

  CK_CUDA_THROW_(cudaEventCreateWithFlags(&broadcast_buffer_->finish_broadcast_events[local_id],
                                          cudaEventDisableTiming));

  for (size_t param_id = 0; param_id < params.size(); ++param_id) {
    auto &param = params_[param_id];

    // 分配sparse記憶體
    SparseTensor<TypeKey> temp_sparse_tensor;
    buff->reserve({(size_t)batchsize, (size_t)param.max_feature_num}, param.slot_num,
                  &temp_sparse_tensor);
    // 賦值到output之上
    output_->sparse_tensors_map[param.top_name].push_back(temp_sparse_tensor.shrink());
  }

  // 分配label的記憶體
  Tensor2<float> label_tensor;
  buff->reserve({batch_size_per_gpu, label_dim}, &label_tensor);
  // 賦值到output之上
  output_->label_tensors.push_back(label_tensor.shrink());

  if (use_mixed_precision) {
    Tensor2<__half> dense_tensor;
    // 分配dense記憶體
    buff->reserve({(size_t)batch_size_per_gpu, (size_t)dense_dim}, &dense_tensor);
    // 賦值到output之上
    output_->dense_tensors.push_back(dense_tensor.shrink());
  } else {
    Tensor2<float> dense_tensor;
    // 分配dense記憶體
    buff->reserve({(size_t)batch_size_per_gpu, (size_t)dense_dim}, &dense_tensor);
    // 賦值到output之上
    output_->dense_tensors.push_back(dense_tensor.shrink());
  }

  buff->allocate(); // 統一分配
}

預留buffer的具體邏輯如下:

分配之後如下,需要注意的是,這裡都是簡化版本,沒有體現出來多個本地GPU的狀態。比如下面三個類的成員變數都會分配到多個本地GPU之上。

// embedding number 指的是本模型之中,DataReaderSparseParam 的個數,就是有幾個 embedding 層
struct ThreadBuffer {
  std::vector<SparseTensorBag> device_sparse_buffers;  // same number as embedding number
  // device_sparse_buffers 會分配在多個本地GPU之上
  
struct BroadcastBuffer {
  std::vector<SparseTensorBag>
      sparse_buffers;  // same number as (embedding number * local device number)
  // sparse_buffers 也會分配在多個本地GPU之上

struct DataReaderOutput {
  std::map<std::string, std::vector<SparseTensorBag>> sparse_tensors_map;
  // 每個 sparse_tensors_map[param.top_name] 都會分配在多個本地GPU之上
  // 比如 output_->sparse_tensors_map[param.top_name].reserve(local_gpu_count);

如下簡化版本之中都只體現了一個GPU,這些buffer都是位於GPU之上。

現在 DataReader 有了一系列buffer,我們接下來看看如何使用。

0x04 DataReaderWorkerGroup

DataReaderWorkerGroup 負責具體讀資料操作。

4.1 構建

在 create_datareader 之中,有如下程式碼建立 DataReaderWorkerGroup,分別對應了三種group。

    switch (format) {
      case DataReaderType_t::Norm: {
        train_data_reader->create_drwg_norm(source_data, check_type, start_right_now);
        evaluate_data_reader->create_drwg_norm(eval_source, check_type, start_right_now);
        break;
      }
      case DataReaderType_t::Raw: {
        train_data_reader->create_drwg_raw(source_data, num_samples, float_label_dense, true,
                                           false);
        evaluate_data_reader->create_drwg_raw(eval_source, eval_num_samples, float_label_dense,
                                              false, false);
        break;
      }
      case DataReaderType_t::Parquet: {
        train_data_reader->create_drwg_parquet(source_data, slot_offset, true);
        evaluate_data_reader->create_drwg_parquet(eval_source, slot_offset, true);
        break;
      }

我們用create_drwg_norm來繼續分析,發現其構建了DataReaderWorkerGroupNorm。即,配置了 DataReader 之中的成員變數 worker_group_ 為一個 DataReaderWorkerGroupNorm。

注意,這裡傳入的是thread_buffers_說明 DataReaderWorkerGroup 操作的就是DataReader 的 thread_buffers_

void create_drwg_norm(std::string file_name, Check_t check_type,
                      bool start_reading_from_beginning = true) override {
  source_type_ = SourceType_t::FileList;
  worker_group_.reset(new DataReaderWorkerGroupNorm<TypeKey>(
      thread_buffers_, resource_manager_, file_name, repeat_, check_type, params_,
      start_reading_from_beginning));
  file_name_ = file_name;
}

4.2 DataReaderWorkerGroup 定義

我們只看其成員變數,主要是 IDataReaderWorker,這就是具體讀資料的wroker。

class DataReaderWorkerGroup {
  std::vector<std::thread> data_reader_threads_; /**< A vector of the pointers of data reader .*/
 protected:
  int data_reader_loop_flag_{0}; /**< p_loop_flag a flag to control the loop */
  DataReaderType_t data_reader_type_;
  std::vector<std::shared_ptr<IDataReaderWorker>>
      data_readers_; /**< A vector of DataReaderWorker' pointer.*/
  std::shared_ptr<ResourceManager> resource_manager_;
}

4.3 DataReaderWorkerGroupNorm

我們使用 DataReaderWorkerGroupNorm 來分析,其最重要的是構建 DataReaderWorker 時候,設定了每個DataReaderWorker 對應哪些GPU資源

template <typename TypeKey>
class DataReaderWorkerGroupNorm : public DataReaderWorkerGroup {
  std::string file_list_; /**< file list of data set */

  std::shared_ptr<Source> create_source(size_t worker_id, size_t num_worker,
                                        const std::string &file_name, bool repeat) override {
    return std::make_shared<FileSource>(worker_id, num_worker, file_name, repeat);
  }

 public:
  // Ctor
  DataReaderWorkerGroupNorm(const std::vector<std::shared_ptr<ThreadBuffer>> &output_buffers,
                            const std::shared_ptr<ResourceManager> &resource_manager_,
                            std::string file_list, bool repeat, Check_t check_type,
                            const std::vector<DataReaderSparseParam> &params,
                            bool start_reading_from_beginning = true)
      : DataReaderWorkerGroup(start_reading_from_beginning, DataReaderType_t::Norm) {

    int num_threads = output_buffers.size();
    size_t local_gpu_count = resource_manager_->get_local_gpu_count();

    // create data reader workers
    int max_feature_num_per_sample = 0;
    for (auto &param : params) {
      max_feature_num_per_sample += param.max_feature_num;
    }

    set_resource_manager(resource_manager_);
    for (int i = 0; i < num_threads; i++) {
      std::shared_ptr<IDataReaderWorker> data_reader(new DataReaderWorker<TypeKey>(
          // 這裡設定了每個 DataReaderWorker 對應的 GPU 資源
          i, num_threads, resource_manager_->get_local_gpu(i % local_gpu_count),
          &data_reader_loop_flag_, output_buffers[i], file_list, max_feature_num_per_sample, repeat,
          check_type, params));
      data_readers_.push_back(data_reader);
    }
    create_data_reader_threads(); // 建立了多個工作執行緒
  }
};

4.4 建立執行緒

create_data_reader_threads 建立了多個工作執行緒,設定了每個執行緒對應的 GPU 資源。

  /**
   * Create threads to run data reader workers
   */
  void create_data_reader_threads() {
    size_t local_gpu_count = resource_manager_->get_local_gpu_count();

    for (size_t i = 0; i < data_readers_.size(); ++i) {
      // 這裡設定了每個執行緒對應的 GPU 資源
      auto local_gpu = resource_manager_->get_local_gpu(i % local_gpu_count);
      // 指定了執行緒主體函式
      data_reader_threads_.emplace_back(data_reader_thread_func_, data_readers_[i],
                                        &data_reader_loop_flag_, local_gpu->get_device_id());
    }
  }

4.5 執行緒主體函式

data_reader_thread_func_ 是工作執行緒的主體函式,裡面設定了本執行緒的裝置,然後呼叫了 IDataReaderWorker 完成讀取資料。

/**
 * A helper function to read data from dataset to heap in a new thread.
 * @param data_reader a pointer of data_reader.
 * @param p_loop_flag a flag to control the loop,
          and break loop when IDataReaderWorker is destroyed.
 */
static void data_reader_thread_func_(const std::shared_ptr<IDataReaderWorker>& data_reader,
                                     int* p_loop_flag, int device_id) {
  try {
    CudaCPUDeviceContext context(device_id); // 設定了本執行緒的裝置

    while ((*p_loop_flag) == 0) {
      usleep(2);
    }

    while (*p_loop_flag) {
      data_reader->read_a_batch(); // 然後開始讀取檔案資料
    }
  } catch (const std::runtime_error& rt_err) {
    std::cerr << rt_err.what() << std::endl;
  }
}

所以,這裡就設定了哪個樣本應該放到哪個卡上,例如,下面4個執行緒,分別對應了 GPU 0 和 GPU 1。

4.6 DataReaderWorker

DataReaderWorker 是解析資料的業務模組。IDataReaderWorker 是 基類,其buffer_是關鍵,其指向了ThreadBuffer。

class IDataReaderWorker {
  std::shared_ptr<Source> source_; /**< source: can be file or network */

  int worker_id_;
  int worker_num_;
  std::shared_ptr<GPUResource> gpu_resource_; // 這是本worker的GPU資源

  bool is_eof_;
  int *loop_flag_;

  std::shared_ptr<ThreadBuffer> buffer_;
  
  IDataReaderWorker(const int worker_id, const int worker_num,
                    const std::shared_ptr<GPUResource> &gpu_resource, bool is_eof, int *loop_flag,
                    const std::shared_ptr<ThreadBuffer> &buff)
      : worker_id_(worker_id),
        worker_num_(worker_num),
        gpu_resource_(gpu_resource), // 設定GPU資源
        is_eof_(is_eof),
        loop_flag_(loop_flag),
        buffer_(buff) {}  
};

DataReaderWorker 具體定義如下:

template <class T>
class DataReaderWorker : public IDataReaderWorker {
 private:
  DataSetHeader
      data_set_header_;  /**< the header of data set, which has main informations of a data file */
  size_t buffer_length_; /**< buffer size for internal use */
  Check_t check_type_;   /**< check type for data set */
  std::vector<DataReaderSparseParam> params_; /**< configuration of data reader sparse input */
  std::shared_ptr<Checker> checker_; /**< checker aim to perform error check of the input data */
  bool skip_read_{false};            /**< set to true when you want to stop the data reading */
  int current_record_index_{0};
  size_t total_slot_num_;
  std::vector<size_t> last_batch_nnz_;

  Tensor2<float> temp_host_dense_buffer_;  // read data to make checker move
  Tensor2<float> host_dense_buffer_;
  std::vector<CSR<T>> host_sparse_buffer_;
}

其構建程式碼如下,需要注意,

  • 有一個繼承於基類的變數 std::shared_ptr buffer_ 指向的是 ThreadBuffer。
  • 變數 host_sparse_buffer_ 是構建在 Host 之上,而非GPU之上,這個 host_sparse_buffer_ 作用是檔案中讀取資料,解析成csr,放置到 host_sparse_buffer_ 之上。
  • 關於變數 DataReaderSparseParam 的說明,這是一個DataReaderSparseParam 陣列,如果做如下設定,則 params_ 包含三個元素,分別對應分了 user, good, cate。
model.add(hugectr.Input(label_dim = 1, label_name = "label",
                        dense_dim = 0, dense_name = "dense",
                        data_reader_sparse_param_array =
                        [hugectr.DataReaderSparseParam("UserID", 1, True, 1),
                        hugectr.DataReaderSparseParam("GoodID", 1, True, 11),
                        hugectr.DataReaderSparseParam("CateID", 1, True, 11)]))

DataReaderWorker 具體定義如下:

DataReaderWorker(const int worker_id, const int worker_num,
                 const std::shared_ptr<GPUResource>& gpu_resource, int* loop_flag,
                 const std::shared_ptr<ThreadBuffer>& buffer, const std::string& file_list,
                 size_t buffer_length, bool repeat, Check_t check_type,
                 const std::vector<DataReaderSparseParam>& params)
    : IDataReaderWorker(worker_id, worker_num, gpu_resource, !repeat, loop_flag, buffer),
      buffer_length_(buffer_length),
      check_type_(check_type),
      params_(params),
      total_slot_num_(0),
      last_batch_nnz_(params.size(), 0) {

  total_slot_num_ = 0;
  for (auto& p : params) {
    total_slot_num_ += p.slot_num;
  }
  source_ = std::make_shared<FileSource>(worker_id, worker_num, file_list, repeat);
  create_checker();

  int batch_size = buffer->batch_size;
  int batch_size_start_idx = buffer->batch_size_start_idx;
  int batch_size_end_idx = buffer->batch_size_end_idx;
  int label_dim = buffer->label_dim;
  int dense_dim = buffer->dense_dim;

  CudaCPUDeviceContext ctx(gpu_resource->get_device_id()); // 得到了本worker對應哪個GPU
  std::shared_ptr<GeneralBuffer2<CudaHostAllocator>> buff =
      GeneralBuffer2<CudaHostAllocator>::create();

  buff->reserve({static_cast<size_t>(batch_size_end_idx - batch_size_start_idx),
                 static_cast<size_t>(label_dim + dense_dim)},
                &host_dense_buffer_);
  buff->reserve({static_cast<size_t>(label_dim + dense_dim)}, &temp_host_dense_buffer_);

  for (auto& param : params) {
    host_sparse_buffer_.emplace_back(batch_size * param.slot_num,
                                     batch_size * param.max_feature_num);
  }

  buff->allocate();
}

具體擴充如下,其中每個thread裡面含有一個worker:

或者我們進一步簡化幾個記憶體類,得到如下,DataReaderWorker 操作 DataReader 之中的一個 ThreadBuffer,

4.7 讀取資料

Reader構建時候,會建立一個 checker_,用來從檔案讀取資料。

4.7.1 Checker

void create_checker() {
  switch (check_type_) {
    case Check_t::Sum:
      checker_ = std::make_shared<CheckSum>(*source_);
      break;
    case Check_t::None:
      checker_ = std::make_shared<CheckNone>(*source_);
      break;
    default:
      assert(!"Error: no such Check_t && should never get here!!");
  }
}

以 CheckNone 為例,可以看到其就是讀取檔案。

class CheckNone : public Checker {
 private:
  const int MAX_TRY{10};

 public:
  CheckNone(Source& src) : Checker(src) {}
  /**
   * Read "bytes_to_read" byte to the memory associated to ptr.
   * Users don't need to manualy maintain the check bit offset, just specify
   * number of bytes you really want to see in ptr.
   * @param ptr pointer to user located buffer
   * @param bytes_to_read bytes to read
   * @return `DataCheckError` `OutOfBound` `Success` `UnspecificError`
   */
  Error_t read(char* ptr, size_t bytes_to_read) noexcept {
    try {
      Checker::src_.read(ptr, bytes_to_read);
      return Error_t::Success;
    } catch (const std::runtime_error& rt_err) {
      std::cerr << rt_err.what() << std::endl;
      return Error_t::BrokenFile;
    }
  }

  /**
   * Start a new file to read.
   * @return `FileCannotOpen` or `UnspecificError`
   */
  Error_t next_source() {
    for (int i = MAX_TRY; i > 0; i--) {
      Error_t flag_eof = Checker::src_.next_source();
      if (flag_eof == Error_t::Success || flag_eof == Error_t::EndOfFile) {
        return flag_eof;
      }
    }
    CK_THROW_(Error_t::FileCannotOpen, "Checker::src_.next_source() == Error_t::Success failed");
    return Error_t::FileCannotOpen;  // to elimate compile error
  }
};

4.7.2 CSR 樣例

我們從 samples/ncf/preprocess-1m.py 之中找出一個程式碼來看看 csr 檔案的格式。

def write_hugeCTR_data(huge_ctr_data, filename='huge_ctr_data.dat'):
    with open(filename, 'wb') as f:
        #write header
        f.write(ll(0)) # 0: no error check; 1: check_num
        f.write(ll(huge_ctr_data.shape[0])) # the number of samples in this data file
        f.write(ll(1)) # dimension of label
        f.write(ll(1)) # dimension of dense feature
        f.write(ll(2)) # long long slot_num
        for _ in range(3): f.write(ll(0)) # reserved for future use

        for i in tqdm.tqdm(range(huge_ctr_data.shape[0])):
            f.write(c_float(huge_ctr_data[i,2])) # float label[label_dim];
            f.write(c_float(0)) # dummy dense feature
            f.write(c_int(1)) # slot 1 nnz: user ID
            f.write(c_uint(huge_ctr_data[i,0]))
            f.write(c_int(1)) # slot 2 nnz: item ID
            f.write(c_uint(huge_ctr_data[i,1]))

4.7.3 讀取批次資料

read_a_batch 完成具體解析資料集工作。

  • 首先從檔案讀取資料。
  • 等待 ThreadBuffer(就是DataReader的thread_buffers_成員變數)的狀態變成ReadyForWrite。
  • 解析成csr,放入到 host_dense_buffer_。
  • 呼叫 wait_until_h2d_ready 等待拷貝完成。
  • 其次呼叫cudaMemcpyAsync把資料從 host_dense_buffer_ 拷貝到 ThreadBuffer 之中。這裡有兩點很重要:
    • 目前資料在 host_sparse_buffer_(CPU)之上,需要拷貝到 GPU(目標是 ThreadBuffer 的 device_sparse_buffers 成員變數)。
    • 而且,host_sparse_buffer_ 是 CSR 格式,ThreadBuffer 的 device_sparse_buffers 成員變數是SparseTensor格式,需要轉換。
    • 這裡是通過拷貝就進行了轉換。

有幾點如下:

  • nnz 的意思是:non-zero feature number。
  • 每一個slot資料對應了一個CSR row。

具體程式碼如下:

  /**
   * read a batch of data from data set to heap.
   */
  void read_a_batch() {
    // 得到各種配置
    long long current_batch_size = buffer_->batch_size;
    int label_dim = buffer_->label_dim;
    int dense_dim = buffer_->dense_dim;
    int label_dense_dim = label_dim + dense_dim;
    int batch_size_start_idx = buffer_->batch_size_start_idx;
    int batch_size_end_idx = buffer_->batch_size_end_idx;

    try {
      if (!checker_->is_open()) {
        read_new_file(); // 讀一個新檔案
      }
    } catch (const internal_runtime_error& rt_err) {
      Error_t err = rt_err.get_error();
      if (err == Error_t::EndOfFile) { // 檔案讀完了
        if (!wait_until_h2d_ready()) return;  // 等待 buffer_ 狀態變為 ReadyForWrite
        buffer_->current_batch_size = 0;
        assert(buffer_->state.load() == BufferState::Writing); // 設定
        is_eof_ = true;
        buffer_->state.store(BufferState::ReadyForRead); // 設定狀態為可讀

        while (buffer_->state.load() != BufferState::ReadyForWrite) {
          usleep(2);
          if (*loop_flag_ == 0) return;  // in case main thread exit
        }
        return;  // need this return to run from begining
      } else {
        throw;
      }
    }

    // if the EOF is faced, the current batch size can be changed later
    
    for (auto& each_csr : host_sparse_buffer_) {
      each_csr.reset();
    }
    // batch loop
    for (int batch_idx = 0; batch_idx < buffer_->batch_size; ++batch_idx) {//讀取batch中一個
      if (batch_idx >= current_batch_size) { // 如果已經讀取batch之中的全部資料了
        for (size_t param_id = 0; param_id < params_.size(); ++param_id) { // 多個embedding
          // 如果是前面那個例子,這裡遍歷的就是user, good, cate
          auto& param = params_[param_id];
          // host_sparse_buffer_型別是std::vector<CSR<T>>
          auto& current_csr = host_sparse_buffer_[param_id]; 
          for (int k = 0; k < param.slot_num; k++) { // slot數目就是行數
            current_csr.new_row(); // 增加一行
          }
        }
        if (batch_idx >= batch_size_start_idx &&
            batch_idx < batch_size_end_idx) {  // only read local device dense data
          // 設定dense
          float* ptr =
              host_dense_buffer_.get_ptr() + (batch_idx - batch_size_start_idx) * label_dense_dim;

          for (int j = 0; j < label_dense_dim; j++) {
            ptr[j] = 0.f;
          }
        }
        continue;
      }
      try {
        try {
          if (batch_idx >= batch_size_start_idx &&
              batch_idx < batch_size_end_idx) {  // only read local device dense data
            // 讀取dense引數
            CK_THROW_(checker_->read(reinterpret_cast<char*>(host_dense_buffer_.get_ptr() +
                                                             (batch_idx - batch_size_start_idx) *
                                                                 label_dense_dim),
                                     sizeof(float) * label_dense_dim),
                      "failure in reading label_dense");
          } else {
            // 讀取dense引數
            CK_THROW_(checker_->read(reinterpret_cast<char*>(temp_host_dense_buffer_.get_ptr()),
                                     sizeof(float) * label_dense_dim),
                      "failure in reading label_dense");
          }

          for (size_t param_id = 0; param_id < params_.size(); ++param_id) {
            auto& current_csr = host_sparse_buffer_[param_id];
            current_csr.set_check_point();
          }
          // 讀取sparse引數
          for (size_t param_id = 0; param_id < params_.size(); ++param_id) {
            auto& param = params_[param_id];
            auto& current_csr = host_sparse_buffer_[param_id];
            for (int k = 0; k < param.slot_num; k++) {
              int nnz; // 讀取一個int到nnz,就是得到nnz的大小,non-zero feature number
              CK_THROW_(checker_->read(reinterpret_cast<char*>(&nnz), sizeof(int)),
                        "failure in reading nnz");
              current_csr.new_row(); // 換行
              size_t num_value = current_csr.get_num_values();
              // 讀取nnz個資料
              CK_THROW_(checker_->read(reinterpret_cast<char*>(
                                           current_csr.get_value_tensor().get_ptr() + num_value),
                                       sizeof(T) * nnz),
                        "failure in reading feature_ids_");
              current_csr.update_value_size(nnz);
            }
          }
        } catch (const internal_runtime_error& rt_err) { // 回退
          batch_idx--;  // restart i-th sample
          for (auto& each_csr : host_sparse_buffer_) {
            each_csr.roll_back();
          }
          Error_t err = rt_err.get_error();
          if (err == Error_t::DataCheckError) {
            ERROR_MESSAGE_("Error_t::DataCheckError");
          } else {            // Error_t::BrokenFile, Error_t::UnspecificEror, ...
            read_new_file();  // can throw Error_t::EOF
          }
        }

        current_record_index_++;

        // start a new file when finish one file read
        if (current_record_index_ >= data_set_header_.number_of_records) {
          read_new_file();  // can throw Error_t::EOF
        }
      } catch (const internal_runtime_error& rt_err) {
        Error_t err = rt_err.get_error();
        if (err == Error_t::EndOfFile) {
          current_batch_size = batch_idx + 1;
        } else {
          throw;
        }
      }
    }

    for (auto& each_csr : host_sparse_buffer_) {
      each_csr.new_row();
    }
    
    // do h2d
    // wait buffer and schedule
		// 目前資料在 host_sparse_buffer_(CPU)之上,需要拷貝到 GPU(目標是 ThreadBuffer 的 device_sparse_buffers 成員變數),使用 cudaMemcpyHostToDevice
    // 而且,host_sparse_buffer_ 是 CSR<T> 格式,ThreadBuffer 的 device_sparse_buffers 成員變數是SparseTensor<T>格式,需要轉換
    if (!wait_until_h2d_ready()) return;
    buffer_->current_batch_size = current_batch_size;
    {
      CudaCPUDeviceContext context(gpu_resource_->get_device_id());
      // 目標是 ThreadBuffer 的 device_sparse_buffers 成員變數
      auto dst_dense_tensor = Tensor2<float>::stretch_from(buffer_->device_dense_buffers);
      CK_CUDA_THROW_(cudaMemcpyAsync(dst_dense_tensor.get_ptr(), host_dense_buffer_.get_ptr(),
                                     host_dense_buffer_.get_size_in_bytes(), cudaMemcpyHostToDevice,
                                     gpu_resource_->get_memcpy_stream()));

      for (size_t param_id = 0; param_id < params_.size(); ++param_id) { // 遍歷嵌入層
        auto dst_sparse_tensor =
            SparseTensor<T>::stretch_from(buffer_->device_sparse_buffers[param_id]);
        if (buffer_->is_fixed_length[param_id] &&
            last_batch_nnz_[param_id] == host_sparse_buffer_[param_id].get_num_values()) {
          // 拷貝到GPU,同時也進行了轉換,提取了CSR的成員變數,拷貝到了SparseTensor的對應地址
          CK_CUDA_THROW_(cudaMemcpyAsync(dst_sparse_tensor.get_value_ptr(),
                                         host_sparse_buffer_[param_id].get_value_tensor().get_ptr(),
                                         host_sparse_buffer_[param_id].get_num_values() * sizeof(T),
                                         cudaMemcpyHostToDevice,
                                         gpu_resource_->get_memcpy_stream()));
        } else {
          // 拷貝到GPU
          sparse_tensor_helper::cuda::copy_async(dst_sparse_tensor, host_sparse_buffer_[param_id],
                                                 gpu_resource_->get_memcpy_stream());
          last_batch_nnz_[param_id] = host_sparse_buffer_[param_id].get_num_values();
        }
      }
      // 進行同步
      CK_CUDA_THROW_(cudaStreamSynchronize(gpu_resource_->get_memcpy_stream()));
    }
    assert(buffer_->state.load() == BufferState::Writing);
    buffer_->state.store(BufferState::ReadyForRead);
  }
};
4.7.3.1 等待

這裡wait_until_h2d_ready會等待。

bool wait_until_h2d_ready() {
  BufferState expected = BufferState::ReadyForWrite;
  while (!buffer_->state.compare_exchange_weak(expected, BufferState::Writing)) {
    expected = BufferState::ReadyForWrite;
    usleep(2);
    if (*loop_flag_ == 0) return false;  // in case main thread exit
  }
  return true;
}
4.7.3.2 讀取檔案

read_new_file 完成了對檔案的讀取。

void read_new_file() {
  constexpr int MAX_TRY = 10;
  for (int i = 0; i < MAX_TRY; i++) {
    if (checker_->next_source() == Error_t::EndOfFile) {
      throw internal_runtime_error(Error_t::EndOfFile, "EndOfFile");
    }

    Error_t err =
        checker_->read(reinterpret_cast<char*>(&data_set_header_), sizeof(DataSetHeader));
    current_record_index_ = 0;
    if (!(data_set_header_.error_check == 0 && check_type_ == Check_t::None) &&
        !(data_set_header_.error_check == 1 && check_type_ == Check_t::Sum)) {
      ERROR_MESSAGE_("DataHeaderError");
      continue;
    }
    if (static_cast<size_t>(data_set_header_.slot_num) != total_slot_num_) {
      ERROR_MESSAGE_("DataHeaderError");
      continue;
    }
    if (err == Error_t::Success) {
      return;
    }
  }
  CK_THROW_(Error_t::BrokenFile, "failed to read a file");
}

4.7.4 小結

我們總結邏輯如下,執行緒一直呼叫 data_reader_thread_func_ 來迴圈讀取:

另外一個邏輯視角是:

  1. 多執行緒呼叫 data_reader_thread_func_,其使用 read_a_batch 從資料檔案之中讀取資料解析為CSR。每一個embedding層 對應一個CSR。
  2. CSR 被放入 DataReaderWorker 的 host_sparse_buffer_。
  3. 隨著batch不斷讀取,CSR 行數在不斷增加,每一個slot對應了一行,所以一個batch的行數就是 batch_size * slot_num。
  4. 使用 cudaMemcpyAsync 把CSR從 host_sparse_buffer_ 拷貝到ThreadBuffer(位於GPU)。ThreadBuffer是 SparseTensor 型別了。
  5. 目前CSR資料就在 GPU 之上了

這裡簡化了多GPU,多worker 的情況。

0x05 讀取到embedding

我們接下來看看 DataCollector,就是流水線的第二級,就是這裡的黃色框 "Copy to GPU"。其實其內部文字修改為:Copy To Embedding 更合適。

此圖顯示了“讀取檔案”、“複製到 GPU”和“訓練”階段如何重疊三個批次以提高 GPU 資源利用率。

5.1 DataCollector

我們首先看看DataCollector的定義,這裡省略了成員函式,主要成員變數是。

  • std::shared_ptr broadcast_buffer_ : CPU 資料拷貝到 GPU 之上,GPU 上就在這裡。
  • std::shared_ptr output_buffer_ :這個就是 DataReaderOutput,就是 Reader 的成員變數,複製到這裡是為了 collector 操作方便
  • BackgroundDataCollectorThread background_collector_ :執行緒主體,主要包括 ThreadBuffer 和 BroadcastBuffer,會把資料從 ThreadBuffer 拷貝到 BroadcastBuffer 之上
  • std::thread background_collector_thread_ :工作執行緒。
/**
 * @brief A helper class of data reader.
 *
 * This class implement asynchronized data collecting from heap
 * to output of data reader, thus data collection and training
 * can work in a pipeline.
 */
template <typename T>
class DataCollector {
  
  class BackgroundDataCollectorThread {
    std::vector<std::shared_ptr<ThreadBuffer>> thread_buffers_;
    std::shared_ptr<BroadcastBuffer> broadcast_buffer_;

    std::atomic<bool> loop_flag_;
    int counter_;
    std::vector<size_t> last_batch_nnz_;  // local_gpu_count * embedding number
    std::vector<char> worker_status_;
    int eof_worker_num_;

    std::shared_ptr<ResourceManager> resource_manager_;
  }
  
  std::shared_ptr<BroadcastBuffer> broadcast_buffer_;
  std::shared_ptr<DataReaderOutput> output_buffer_;

  BackgroundDataCollectorThread background_collector_;
  std::thread background_collector_thread_;

  std::atomic<bool> loop_flag_;
  std::vector<size_t> last_batch_nnz_;

  std::shared_ptr<ResourceManager> resource_manager_;
};

目前具體如下,Collector 之中的 broadcast_buffer_ 和 output_buffer_ 都指向了GPU,但GPU之中尚且沒有資料:

5.2 ThreadBuffer 2 BroadBuffer

5.2.1 工作執行緒

BackgroundDataCollectorThread 的作用是把資料從 DataReader 的thread_buffers_拷貝到 broadcast_buffer_

class BackgroundDataCollectorThread {
  std::vector<std::shared_ptr<ThreadBuffer>> thread_buffers_;
  std::shared_ptr<BroadcastBuffer> broadcast_buffer_;

  std::atomic<bool> loop_flag_;
  int counter_;
  std::vector<size_t> last_batch_nnz_;  // local_gpu_count * embedding number
  std::vector<char> worker_status_;
  int eof_worker_num_;

  std::shared_ptr<ResourceManager> resource_manager_;

 public:
  BackgroundDataCollectorThread(const std::vector<std::shared_ptr<ThreadBuffer>> &thread_buffers,
                                const std::shared_ptr<BroadcastBuffer> &broadcast_buffer,
                                const std::shared_ptr<ResourceManager> &resource_manager)
      : thread_buffers_(thread_buffers),
        broadcast_buffer_(broadcast_buffer),
        loop_flag_{true},
        counter_{0},
        last_batch_nnz_(
            broadcast_buffer->is_fixed_length.size() * resource_manager->get_local_gpu_count(),
            0),
        worker_status_(thread_buffers.size(), 0),
        eof_worker_num_(0),
        resource_manager_(resource_manager) {}
  
  void start() {
    
    while (loop_flag_.load()) {
      // threadbuffer是源資料,broadcast buffer是目標資料
      auto &current_src_buffer = thread_buffers_[counter_];
      auto &dst_buffer = broadcast_buffer_;
      auto src_expected = BufferState::ReadyForRead; // 期望源資料是這個狀態
      auto dst_expected = BufferState::ReadyForWrite; // 期望目標資料是這個狀態

      if (worker_status_[counter_]) {
        counter_ = (counter_ + 1) % thread_buffers_.size();
        continue;
      }

      if ((current_src_buffer->state.load() == BufferState::Reading ||
           current_src_buffer->state.compare_exchange_weak(src_expected, BufferState::Reading)) &&
          (dst_buffer->state.load() == BufferState::Writing ||
           dst_buffer->state.compare_exchange_weak(dst_expected, BufferState::Writing))) {

        // 如果源資料是可讀或者正在讀,並且,目標資料是可寫或者正在寫,則可以操作
        
        if (current_src_buffer->current_batch_size == 0) {
          worker_status_[counter_] = 1;
          eof_worker_num_ += 1;
          current_src_buffer->state.store(BufferState::FileEOF);
        }
        if (static_cast<size_t>(eof_worker_num_) != thread_buffers_.size() &&
            current_src_buffer->current_batch_size == 0) {
          counter_ = (counter_ + 1) % thread_buffers_.size();
          dst_buffer->state.store(BufferState::ReadyForWrite); // 設定目標資料的狀態
          continue;
        }
        dst_buffer->current_batch_size = current_src_buffer->current_batch_size;
        if (current_src_buffer->current_batch_size != 0) {
          // 進行廣播操作
          broadcast<T>(current_src_buffer, dst_buffer, last_batch_nnz_, resource_manager_);

          current_src_buffer->state.store(BufferState::ReadyForWrite); // 設定目標資料的狀態
          counter_ = (counter_ + 1) % thread_buffers_.size();
        } else {
          memset(worker_status_.data(), 0, sizeof(char) * worker_status_.size());
          eof_worker_num_ = 0;
          counter_ = 0;
        }

        dst_buffer->state.store(BufferState::ReadyForRead); // 會通知源資料可以繼續讀取了
      } else {
        usleep(2); // 否則等待一會
      }
    }
  }

  void stop() { loop_flag_.store(false); }
};

5.2.2 拷貝操作

這裡就是從源資料拷貝到目標資料,並且是逐個引數進行拷貝。這個是裝置之內的拷貝。

template <typename T>
void broadcast(const std::shared_ptr<ThreadBuffer>& thread_buffer,
               std::shared_ptr<BroadcastBuffer>& broadcast_buffer,
               std::vector<size_t>& last_batch_nnz_,
               const std::shared_ptr<ResourceManager>& resource_manager) {
  int param_num = thread_buffer->param_num;
  int dense_dim = thread_buffer->dense_dim;
  int label_dim = thread_buffer->label_dim;
  int batch_size = thread_buffer->batch_size;
  int batch_size_per_gpu = batch_size / resource_manager->get_global_gpu_count();
  int local_gpu_count = resource_manager->get_local_gpu_count();

#pragma omp parallel for num_threads(local_gpu_count)
  for (int i = 0; i < local_gpu_count; ++i) { // 遍歷本地的GPU
    
    auto local_gpu = resource_manager->get_local_gpu(i);
    CudaDeviceContext ctx(local_gpu->get_device_id());

    for (int param_id = 0; param_id < param_num; ++param_id) { // 遍歷嵌入層
      // 從 thread_buffer 拷貝到 broadcast_buffer
      auto src_sparse_tensor =
          SparseTensor<T>::stretch_from(thread_buffer->device_sparse_buffers[param_id]);
      auto dst_sparse_tensor =
          SparseTensor<T>::stretch_from(broadcast_buffer->sparse_buffers[i * param_num + param_id]);

      // 拷貝sparse引數
      if (thread_buffer->is_fixed_length[param_id] &&
          last_batch_nnz_[i * param_num + param_id] == src_sparse_tensor.nnz()) {
        CK_CUDA_THROW_(cudaMemcpyAsync(dst_sparse_tensor.get_value_ptr(),
                                       src_sparse_tensor.get_value_ptr(),
                                       src_sparse_tensor.nnz() * sizeof(T),
                                       cudaMemcpyDeviceToDevice, local_gpu->get_p2p_stream()));
      } else {
        sparse_tensor_helper::cuda::copy_async(dst_sparse_tensor, src_sparse_tensor,
                                               cudaMemcpyDeviceToDevice,
                                               local_gpu->get_p2p_stream());
        last_batch_nnz_[i * param_num + param_id] = src_sparse_tensor.nnz();
      }
    }

    // 拷貝dense引數
    auto dst_dense_tensor = Tensor2<float>::stretch_from(broadcast_buffer->dense_tensors[i]);
    auto src_dense_tensor = Tensor2<float>::stretch_from(thread_buffer->device_dense_buffers);
    CK_CUDA_THROW_(cudaMemcpyAsync(
        dst_dense_tensor.get_ptr(),
        src_dense_tensor.get_ptr() + i * batch_size_per_gpu * (label_dim + dense_dim),
        batch_size_per_gpu * (label_dim + dense_dim) * sizeof(float), cudaMemcpyDeviceToDevice,
        local_gpu->get_p2p_stream()));
    
    // 同步
    CK_CUDA_THROW_(cudaStreamSynchronize(local_gpu->get_p2p_stream()));
  }
}

邏輯如下,多了一步從 ThreadBuffer 到 BroadcastBuffer 的操作。

5.3 讀取到output

目前的流程是:DataFile ---> Host buffer ----> ThreadBuffer ----> BroadcastBuffer。

現在資料已經拷貝到了 GPU 之上的 BroadcastBuffer,我們需要看看最後訓練時候怎麼拿到資料。

5.3.1 Train

我們首先回到 train 函式,其呼叫了 read_a_batch_to_device_delay_release 來從 BroadcastBuffer 拷貝資料。

bool Session::train() {
  try {
    // 確保 train_data_reader_ 已經啟動
    if (train_data_reader_->is_started() == false) {
      CK_THROW_(Error_t::IllegalCall,
                "Start the data reader first before calling Session::train()");
    }

#ifndef DATA_READING_TEST
    // 需要 reader 先讀取一個 batchsize 的資料。
    long long current_batchsize = train_data_reader_->read_a_batch_to_device_delay_release(); // 讀取資料
    if (!current_batchsize) {
      return false; // 讀不到就退出,沒有資料了
    }
    #pragma omp parallel num_threads(networks_.size()) //其後語句將被networks_.size()個執行緒並行執行
    { 
      
      size_t id = omp_get_thread_num();
      CudaCPUDeviceContext ctx(resource_manager_->get_local_gpu(id)->get_device_id());
      cudaStreamSynchronize(resource_manager_->get_local_gpu(id)->get_stream());
    }
    // reader 可以開始解析資料
    train_data_reader_->ready_to_collect();
#ifdef ENABLE_PROFILING
    global_profiler.iter_check();
#endif

    // If true we're gonna use overlaping, if false we use default
    if (solver_config_.use_overlapped_pipeline) {
      train_overlapped();
    } else {
      for (const auto& one_embedding : embeddings_) {
        one_embedding->forward(true); // 嵌入層進行前向傳播,即從引數伺服器讀取embedding,進行處理
      }

      // Network forward / backward
      if (networks_.size() > 1) {
        // 單機多卡或多機多卡
        // execute dense forward and backward with multi-cpu threads
        #pragma omp parallel num_threads(networks_.size())
        {
          // dense網路的前向反向
          size_t id = omp_get_thread_num();
          long long current_batchsize_per_device =
              train_data_reader_->get_current_batchsize_per_device(id);
          networks_[id]->train(current_batchsize_per_device); // 前向操作
          const auto& local_gpu = resource_manager_->get_local_gpu(id);
          local_gpu->set_compute_event_sync(local_gpu->get_stream());
          local_gpu->wait_on_compute_event(local_gpu->get_comp_overlap_stream());
        }
      } else if (resource_manager_->get_global_gpu_count() > 1) {
        // 多機單卡
        long long current_batchsize_per_device =
            train_data_reader_->get_current_batchsize_per_device(0);
        networks_[0]->train(current_batchsize_per_device); // 前向操作
        const auto& local_gpu = resource_manager_->get_local_gpu(0);
        local_gpu->set_compute_event_sync(local_gpu->get_stream());
        local_gpu->wait_on_compute_event(local_gpu->get_comp_overlap_stream());
      } else {
        // 單機單卡
        long long current_batchsize_per_device =
            train_data_reader_->get_current_batchsize_per_device(0);
        networks_[0]->train(current_batchsize_per_device); // 前向操作
        const auto& local_gpu = resource_manager_->get_local_gpu(0);
        local_gpu->set_compute_event_sync(local_gpu->get_stream());
        local_gpu->wait_on_compute_event(local_gpu->get_comp_overlap_stream());
        networks_[0]->update_params();
      }

      // Embedding backward
      for (const auto& one_embedding : embeddings_) {
        one_embedding->backward(); // 嵌入層反向操作
      }

      // Exchange wgrad and update params
      if (networks_.size() > 1) {
        #pragma omp parallel num_threads(networks_.size())
        {
          size_t id = omp_get_thread_num();
          exchange_wgrad(id); // 多卡之間交換dense引數的梯度
          networks_[id]->update_params();
        }
      } else if (resource_manager_->get_global_gpu_count() > 1) {
        exchange_wgrad(0);
        networks_[0]->update_params(); 
      } 
      for (const auto& one_embedding : embeddings_) {
        one_embedding->update_params(); // 嵌入層更新sparse引數
      }

      // Join streams
      if (networks_.size() > 1) {
        #pragma omp parallel num_threads(networks_.size())
        {
          size_t id = omp_get_thread_num();
          const auto& local_gpu = resource_manager_->get_local_gpu(id);
          local_gpu->set_compute2_event_sync(local_gpu->get_comp_overlap_stream());
          local_gpu->wait_on_compute2_event(local_gpu->get_stream());
        }
      }
      else {
        const auto& local_gpu = resource_manager_->get_local_gpu(0);
        local_gpu->set_compute2_event_sync(local_gpu->get_comp_overlap_stream());
        local_gpu->wait_on_compute2_event(local_gpu->get_stream());
      }
      return true;
    }
#else
      data_reader_->read_a_batch_to_device();
#endif

  } catch (const internal_runtime_error& err) {
    std::cerr << err.what() << std::endl;
    throw err;
  } catch (const std::exception& err) {
    std::cerr << err.what() << std::endl;
    throw err;
  }
  return true;
}

5.3.2 read_a_batch_to_device_delay_release

read_a_batch_to_device_delay_release 是最終配置好embedding資料的地方。

long long read_a_batch_to_device_delay_release() override {
  current_batchsize_ = data_collector_->read_a_batch_to_device();
  return current_batchsize_;
}

我們看看 read_a_batch_to_device。這裡 read_a_batch_to_device_delay_release 和 read_a_batch_to_device 是沿用舊版本命名,已經和目前狀況不符合。

具體邏輯是:看看 broadcast_buffer_ 的狀態是不是可以讀取 ReadyForRead,如果不可以,就等一會。如果可以,就繼續,即遍歷GPU,逐個從broadcast拷貝到output(也是裝置之間的拷貝),也對 label 和 dense 進行split。

  long long read_a_batch_to_device() {

    BufferState expected = BufferState::ReadyForRead;
    while (!broadcast_buffer_->state.compare_exchange_weak(expected, BufferState::Reading)) {
      expected = BufferState::ReadyForRead;
      usleep(2);
    }
    long long current_batch_size = broadcast_buffer_->current_batch_size;
    if (current_batch_size != 0) {
      int local_gpu_count = resource_manager_->get_local_gpu_count();

#pragma omp parallel for num_threads(local_gpu_count)
      for (int i = 0; i < local_gpu_count; ++i) {
        auto local_gpu = resource_manager_->get_local_gpu(i);
        CudaDeviceContext ctx(local_gpu->get_device_id());

        // wait until last iteration finish
        auto label_tensor = Tensor2<float>::stretch_from(output_buffer_->label_tensors[i]);
        auto label_dense_tensor = Tensor2<float>::stretch_from(broadcast_buffer_->dense_tensors[i]);

        // 遍歷 sparse 引數
        for (size_t param_id = 0; param_id < output_buffer_->sparse_name_vec.size(); ++param_id) {
          const auto &top_name = output_buffer_->sparse_name_vec[param_id];
          int idx_broadcast = i * broadcast_buffer_->param_num + param_id;
          // broadcast 的是源
          auto src_sparse_tensor =
              SparseTensor<T>::stretch_from(broadcast_buffer_->sparse_buffers[idx_broadcast]);
          if (output_buffer_->sparse_tensors_map.find(top_name) ==
              output_buffer_->sparse_tensors_map.end()) {
            CK_THROW_(Error_t::IllegalCall, "can not find sparse name");
          }
          // output是目標
          auto dst_sparse_tensor =
              SparseTensor<T>::stretch_from(output_buffer_->sparse_tensors_map[top_name][i]);

          // 從broadcast拷貝到output
          if (broadcast_buffer_->is_fixed_length[idx_broadcast] &&
              last_batch_nnz_[idx_broadcast] == src_sparse_tensor.nnz()) {
            CK_CUDA_THROW_(cudaMemcpyAsync(dst_sparse_tensor.get_value_ptr(),
                                           src_sparse_tensor.get_value_ptr(),
                                           src_sparse_tensor.nnz() * sizeof(T),
                                           cudaMemcpyDeviceToDevice, local_gpu->get_stream()));
          } else {
            // 從broadcast拷貝到output
            sparse_tensor_helper::cuda::copy_async(dst_sparse_tensor, src_sparse_tensor,
                                                   cudaMemcpyDeviceToDevice,
                                                   local_gpu->get_stream());
            last_batch_nnz_[idx_broadcast] = src_sparse_tensor.nnz();
          }
        }
        const int label_dense_dim = output_buffer_->label_dense_dim;

        // 拷貝label和dense
        if (output_buffer_->use_mixed_precision) {
          auto dense_tensor = Tensor2<__half>::stretch_from(output_buffer_->dense_tensors[i]);
          // 進行分塊
          split(label_tensor, dense_tensor, label_dense_tensor, label_dense_dim,
                local_gpu->get_stream());
        } else {
          auto dense_tensor = Tensor2<float>::stretch_from(output_buffer_->dense_tensors[i]);
          split(label_tensor, dense_tensor, label_dense_tensor, label_dense_dim,
                local_gpu->get_stream());
        }
      }
    } else {
      broadcast_buffer_->state.store(BufferState::ReadyForWrite);
    }
    return current_batch_size;
  }

5.3.3 split

label 和 dense 早已經拷貝到了GPU之上,這步做的是分成block,然後使用 GPU thread 進行操作。

template <typename TypeComp>
__global__ void split_kernel__(int batchsize, float* label_ptr, int label_dim, TypeComp* dense_ptr,
                               int dense_dim, const float* label_dense, int label_dense_dim) {
  int idx = blockDim.x * blockIdx.x + threadIdx.x;
  if (idx < batchsize * label_dense_dim) {
    const int in_col = idx % label_dense_dim;
    const int in_row = idx / label_dense_dim;
    const int out_row = in_row;
    if (in_col < label_dim) {
      const int out_col = in_col;
      label_ptr[out_row * label_dim + out_col] = label_dense[idx];
    } else {
      const int out_col = in_col - label_dim;
      dense_ptr[out_row * dense_dim + out_col] = label_dense[idx];
    }
  }
  return;
}

template <typename TypeComp>
void split(Tensor2<float>& label_tensor, Tensor2<TypeComp>& dense_tensor,
           const Tensor2<float>& label_dense_buffer, const int label_dense_dim,
           cudaStream_t stream) {
  // check the input size
  assert(label_tensor.get_dimensions()[0] == dense_tensor.get_dimensions()[0]);
  assert(label_tensor.get_num_elements() + dense_tensor.get_num_elements() ==
         label_dense_buffer.get_num_elements());

  const int batchsize = label_tensor.get_dimensions()[0];
  const int label_dim = label_tensor.get_dimensions()[1];
  const int dense_dim = dense_tensor.get_dimensions()[1];
  const int BLOCK_DIM = 256;
  const int GRID_DIM = (label_dense_buffer.get_num_elements() - 1) / BLOCK_DIM + 1;

  if (dense_dim > 0) {
    split_kernel__<<<GRID_DIM, BLOCK_DIM, 0, stream>>>(
        batchsize, label_tensor.get_ptr(), label_dim, dense_tensor.get_ptr(), dense_dim,
        label_dense_buffer.get_ptr(), label_dense_dim);
  } else if (dense_dim == 0) {
    split_kernel__<<<GRID_DIM, BLOCK_DIM, 0, stream>>>(
        batchsize, label_tensor.get_ptr(), label_dim, (TypeComp*)0, 0, label_dense_buffer.get_ptr(),
        label_dense_dim);

  } else {
    CK_THROW_(Error_t::WrongInput, "dense_dim < 0");
  }

  return;
}

這樣後續就可以訓練了,後續是通過 finalize_batch 之中進行讀取。

void finalize_batch() {
  for (size_t i = 0; i < resource_manager_->get_local_gpu_count(); i++) {
    const auto &local_gpu = resource_manager_->get_local_gpu(i);
    CudaDeviceContext context(local_gpu->get_device_id());
    CK_CUDA_THROW_(cudaStreamSynchronize(local_gpu->get_stream()));
  }

  broadcast_buffer_->state.store(BufferState::ReadyForWrite);
}

template <typename SparseType>
void AsyncReader<SparseType>::ready_to_collect() {
  auto raw_device_id = reader_impl_->get_last_batch_device();
  auto local_gpu = resource_manager_->get_local_gpu(raw_device_id);
  CudaDeviceContext ctx(local_gpu->get_device_id());
  CK_CUDA_THROW_(cudaEventRecord(completion_events_[raw_device_id], local_gpu->get_stream()));

  reader_impl_->finalize_batch(&completion_events_[raw_device_id]);
}

0x06 總結

具體邏輯如下,本章節之中,各個buffer之間拷貝,是依據其狀態是 ReadyForRead 和 ReadyForWrite 來完成的。最終sparse 引數的embedding是在DataReaderOutput,即後續 GPU 上的計算是從output開始的。

0xFF 參考

https://developer.nvidia.com/blog/introducing-merlin-hugectr-training-framework-dedicated-to-recommender-systems/

https://developer.nvidia.com/blog/announcing-nvidia-merlin-application-framework-for-deep-recommender-systems/

https://developer.nvidia.com/blog/accelerating-recommender-systems-training-with-nvidia-merlin-open-beta/

HugeCTR原始碼閱讀

embedding層如何反向傳播

https://web.eecs.umich.edu/~justincj/teaching/eecs442/notes/linear-backprop.html

稀疏矩陣儲存格式總結+儲存效率對比:COO,CSR,DIA,ELL,HYB

相關文章