gptq 中W4A16 或者 W8A16 中具體是怎麼計算的呢？

沉淀fc發表於2024-08-19

原文網址 : https://www.cnblogs.com/chenenenen/p/18367849

在深入瞭解了 quantization 之後，對quant有所瞭解之後，不論是 dynamic quant還是static quant都有所瞭解，但是因為看了大佬的有關量化之後，理解了trt中的W8A8的運算，理解了為什麼量化之後會加速的原因，但是針對gptq的 W8A16或者W4A16 卻不明白到底屬於是 dynamic quant 還是 static quant,因此糾結了好久，後續透過看了gptq的原始碼理解到，整個過程其實是將量化的 weight 先反量化為 fp16 然後再和 W*X再進行運算，具體原始碼可以參看gptq的原始碼。

但看完之後，又糾結了，就是覺得既然在相乘之前，有個反量化的過程，豈不是速度變慢了？為啥大家都說速度加快了呢？為什麼加速了呢？還是糾結


void vecquant8matmul_cuda(
  torch::Tensor vec,
  torch::Tensor mat,
  torch::Tensor mul,
  torch::Tensor scales,
  torch::Tensor zeros,
  torch::Tensor g_idx
) {
  int batch = vec.size(0);
  int vec_height = vec.size(1);
  int height = mat.size(0);
  int width = mat.size(1);
  int zero_width = zeros.size(1);

  dim3 blocks(
    (height + BLOCKHEIGHT8 - 1) / BLOCKHEIGHT8,
    (width + BLOCKWIDTH - 1) / BLOCKWIDTH
  );
  dim3 threads(BLOCKWIDTH); // 申請資源

  AT_DISPATCH_FLOATING_TYPES(
    vec.type(), "vecquant8matmul_cuda", ([&] {
      VecQuant8MatMulKernel<<<blocks, threads>>>( // 真正的cuda函式，包裝了thread之後重新呼叫
        vec.data<scalar_t>(), mat.data<int>(), mul.data<scalar_t>(),
        scales.data<scalar_t>(), zeros.data<int>(), g_idx.data<int>(), 
        batch, vec_height, height, width, zero_width
      );
    })
  );
}

template <typename scalar_t> // 型別模版
__global__ void VecQuant8MatMulKernel(
    const  scalar_t* __restrict__ vec, // x
    const       int* __restrict__ mat, // w
           scalar_t* __restrict__ mul, // w*x 的結果
    const  scalar_t* __restrict__ scales, // w 量化過程中的 scale
    const       int* __restrict__ zeros, // w 量化過程中的 zero
    const   	int* __restrict__ g_idx,
    int batch,
    int vec_height,
    int height,
    int width,
	int zero_width
) {
  int h = BLOCKHEIGHT8 * blockIdx.x;
  int w = BLOCKWIDTH * blockIdx.y + threadIdx.x;
  
  __shared__ scalar_t blockvec[BLOCKWIDTH];
  int i = width * h + w;
  int g_h = h * 4;
  int k;
  unsigned int g;
  scalar_t w_tmp;
  
  int z_w = w / 4; 
  int z_mod = (w % 4) * 8;
  
  float weight[BLOCKWIDTH];
  
  for (k = 0; k <  BLOCKWIDTH; ++k){	
	int k_w = (k / 4); 
	int k_bit = (k % 4) * 8;
	
    g = as_int(g_idx[g_h + k]);
    scalar_t scale = scales[g * width + w]; // 獲取 scale fp16型別
    scalar_t zero = scalar_t((((as_unsigned(zeros[g * zero_width + z_w]) >> z_mod) & 0xFF) + 1) & 0x0f);
	
    w_tmp = ((as_unsigned(mat[i + (k_w * width)]) >> k_bit) & 0xFF);
    
	  weight[k] = scale * (w_tmp - zero); // 反量化
  }

  scalar_t res;
  for (int b = 0; b < batch; ++b){	
	res = 0;
	
    blockvec[threadIdx.x] = vec[b * vec_height + blockIdx.x * BLOCKWIDTH + threadIdx.x];
    __syncthreads();
	for (k = 0; k <  BLOCKWIDTH; ++k){	
	  res += weight[k] * blockvec[k]; // 相乘
    }
    atomicAdd(&mul[b * width + w], res); // 賦值相乘結果
    __syncthreads();
  }
}

關於舉報功能的設計，請問各位具體是怎麼做才科學呢？
2021-01-15
非同步程式設計在ArkTS中具體怎麼實現？
2024-11-21
非同步程式設計
新股中籤率怎麼算？新股中籤率計算公式
2022-03-04
公式
SEO優化具體是什麼，SEO有什麼優劣呢？
2020-11-25
優化
Uiautomator swipe 中滑動速度怎麼計算？
2020-12-18
UI
小白談一談：命令在redis中是怎麼被執行的呢？
2021-03-08
Redis
專案管理軟體中什麼是依賴管理，具體有什麼作用？
2023-03-07
專案管理
BIRT 中組內跨行計算和小計怎麼做
2020-05-21
什麼是計算機中的高速公路-匯流排？
2023-02-07
計算機
python中連乘怎麼算？
2021-09-11
Python
MVVM中ICommand的具體使用
2024-03-22
MVVM
什麼是POE交換機，它具備什麼樣的作用呢？
2018-06-06
以太坊gas是什麼？gasprice怎麼計算？
2018-07-17
計算機中的層次化儲存是個什麼鬼？
2021-02-20
計算機
請問60*80的canvas佔多少記憶體？你是怎麼計算的？
2024-12-08
Canvas記憶體
Calendar原始碼--JDK是怎麼計算時間的
2020-09-28
原始碼JDK
什麼是API介面，具體是什麼意思？
2023-03-01
API
IPP SWAP具體怎麼操作
2023-05-08
雲端計算開發是什麼？雲端計算的就業前途怎麼樣？
2019-07-09
就業
紅外測溫儀的精度是怎麼計算的
2021-07-14
ftp,ftp是幹什麼的，怎麼運用呢？
2020-05-27
FTP
Python是什麼?具有怎麼樣的特點呢?
2020-04-07
Python
hypernetwork在SD中是怎麼工作的
2024-07-01
小程式怎麼開發？具體怎麼操作
2020-08-13
2021-2-17：Java HashMap 的中 key 的雜湊值是如何計算的，為何這麼計算？
2021-02-17
JavaHashMap
大資料具體是幹什麼的
2021-12-21
大資料
好程式設計師雲端計算培訓分享雲端計算中SOA是什麼？
2020-07-09
程式設計師
c++ 原始碼中&&變數是什麼意思呢？
2018-11-10
C++原始碼變數
python中Roberts運算元是什麼
2021-09-11
Python
python中Matplotlib是什麼？怎麼用？
2021-09-11
Python
好程式設計師雲端計算培訓分享雲端計算中微服務是什麼？
2020-07-09
程式設計師微服務
phpcms如何使用自己設計或者想要的頁面呢？
2021-04-15
PHP
半導體行業SAP系統維護服務費是怎麼計算的
2021-11-24
行業
人工智慧技術是啥？它也是怎樣工作中的呢？
2020-11-13
人工智慧
Spring Boot中Tomcat是怎麼啟動的
2020-07-14
Spring BootTomcat
for迴圈在Python中是怎麼工作的
2019-02-18
Python
Map集合中的具體子類TreeMap
2024-10-16
TCP中的資料是怎麼傳輸的？
2019-02-26
TCP

gptq 中W4A16 或者 W8A16 中具體是怎麼計算的呢？

相關文章