LLM最佳化：開源星火13B顯示卡及記憶體佔用最佳化

mengrennwpu發表於2024-04-28

原文網址 : https://www.cnblogs.com/mengrennwpu/p/18164027

1. 背景

本qiang~這兩天接了一個任務，部署幾個開源的模型，並且將本地經過全量微調的模型與開源模型做一個效果對比。

部署的開源模型包括：星火13B，Baichuan2-13B, ChatGLM6B等

其他兩個模型基於transformers架構封裝，因此推理服務啟動還是十分絲滑，但星火13B是基於Megatron-DeepSpeed框架實現，地址是：https://gitee.com/iflytekopensource/iFlytekSpark-13B，啟動推理服務的過程中發現啟動13B的顯示卡佔用71G-78G，有些反直覺。

此文就是整理開源星火13B的視訊記憶體及記憶體排查並最佳化的整理過程，至於哪家開源模型效果好，不在此文的討論範圍內。

2. 原因分析

直觀上來說，13B的模型，資料型別為bf16，顯示卡佔用大概在26G左右，但星火13B直接佔用70G+，不可思議，怪不得網上關於星火開源模型的討論少之又少，原因顯而易見，這麼大的視訊記憶體佔用只能用多卡或者A800等80G顯示卡才能適配。窮人家的孩子，哪有這麼多餘糧。

排查原因的過程中，少不了原始碼的除錯與分析。在排查的過程中，啟動推理服務的檔案run_iFlytekSpark_text_generation.py中，model_provider方法是初始化模型並載入模型檔案的方法。

def model_provider(pre_process=True, post_process=True):
    """Build the model."""
    print_rank_0('building iFlytekSpark model ...')
    args = get_args()
    config = core_transformer_config_from_args(args)
    
    ### 初始化星火模型
    model = iFlytekSparkModel(
        config,
        num_tokentypes=0,
        parallel_output=False,
        pre_process=pre_process,
        post_process=post_process,
        return_moe_loss=False
    )


    if args.from_pretrained is not None:
        assert os.path.exists(args.from_pretrained)
        ckpt_path = get_checkpoint_name(args.from_pretrained)
        print_rank_0('Loading from {} '.format(
                args.from_pretrained))
        # 模型載入權重檔案
        state_dict = torch.load(ckpt_path, map_location=f"cuda:{torch.cuda.current_device()}")
        if 'module' in state_dict:
            state_dict = state_dict['module']
        model.load_state_dict(state_dict)
    return model

其中，載入權重檔案可以看到，載入state_dict時，直接將權重檔案載入到顯示卡中，而非載入至CPU，然後再執行to方法，轉移到GPU。因此該處是一個潛在的最佳化點。

再打入iFlytekSparkModel內部，詞表Embedding層，線性轉換層，等初始化weight時，也是直接將weight分配在GPU上執行。例如下例：

class RowParallelLinear(torch.nn.Module):
    def __init__(self, input_size: int, output_size: int, *,
                 config: ModelParallelConfig,
                 init_method: Callable,
                 bias: bool = True,
                 input_is_parallel: bool = False,
                 stride: int = 1,
                 keep_master_weight_for_test: bool = False,
                 skip_bias_add: bool = False,
                 moe=False, enable_expert_tensor_parallelism=False):
        super(RowParallelLinear, self).__init__()

        # .........
        
        if config.use_cpu_initialization:
            self.weight = Parameter(torch.empty(self.output_size,
                                                self.input_size_per_partition,
                                                dtype=config.params_dtype))
            if config.perform_initialization:
                self.master_weight = _initialize_affine_weight_cpu(
                    self.weight, self.output_size, self.input_size,
                    self.input_size_per_partition, 1, init_method,
                    stride=stride, return_master_weight=keep_master_weight_for_test,
                    params_dtype=config.params_dtype)
        else:
            # 預設按照啟動sh命令，會走該分支
            self.weight = Parameter(torch.empty(
                self.output_size, self.input_size_per_partition,
                device=get_accelerator().current_device_name(), dtype=config.params_dtype))
            if config.perform_initialization:
                _initialize_affine_weight_gpu(self.weight, init_method,
                                              partition_dim=1, stride=stride)
        if bias:
            if config.use_cpu_initialization:
                self.bias = Parameter(torch.empty(self.output_size,
                                                  dtype=config.params_dtype))
            else:
                # 預設按照啟動sh命令，會走該分支
                self.bias = Parameter(torch.empty(
                    self.output_size, device=get_accelerator().current_device_name(),
                    dtype=config.params_dtype))
            setattr(self.bias, 'sequence_parallel', self.sequence_parallel)

            if config.perform_initialization:
                # Always initialize bias to zero.
                with torch.no_grad():
                    self.bias.zero_()
        else:
            self.register_parameter('bias', None)

3. 最佳化方案

1. 模型初始化時，模型的Embedding，線性層的權重weight均直接載入至GPU，因此可以最佳化為先將這些weight載入至CPU。

改進的方式也很簡單，從上面的原始碼層面，可以看到，當增加引數” use_cpu_initialization”，將使用CPU進行初始化權重，因此只需要在啟動推理服務的指令碼中增加” --use-cpu-initialization”引數即可。

2. 載入模型檔案時，直接載入至GPU，然後run_iFlytekSpark_text_generation.py中的get_model方法中，當模型載入完成後，會進行分配至GPU以及FP16的轉換的操作。如下程式碼所示。

def get_model(model_provider_func, model_type=ModelType.encoder_or_decoder, wrap_with_ddp=True):
    """Build the model."""
    args = get_args()
    args.model_type = model_type

    # ..........

    # GPU allocation.
    for model_module in model:
        model_module.to(get_accelerator().current_device_name())
 

    # Fp16 conversion.
    if args.fp16 or args.bf16:
        model = [Float16Module(model_module, args) for model_module in model]

    # .......

    return model

因此，最佳化的方式也很簡單，可以最佳化為先載入至CPU，再執行get_model中的預設分配至GPU，載入完後，再使用垃圾回收機制清除CPU佔用的記憶體即可。

話不多說，最佳化後的程式碼如下：

def model_provider(pre_process=True, post_process=True):
    """Build the model."""
    print_rank_0('building iFlytekSpark model ...')
    args = get_args()
    config = core_transformer_config_from_args(args)
    model = iFlytekSparkModel(
        config,
        num_tokentypes=0,
        parallel_output=False,
        pre_process=pre_process,
        post_process=post_process,
        return_moe_loss=False
    )


    if args.from_pretrained is not None:
        print(args.from_pretrained)
        assert os.path.exists(args.from_pretrained)
        ckpt_path = get_checkpoint_name(args.from_pretrained)
        print_rank_0('Loading from {} '.format(
                args.from_pretrained))

        # state_dict = torch.load(ckpt_path, map_location=f"cuda:{torch.cuda.current_device()}")
        # CPU進行載入
        state_dict = torch.load(ckpt_path, map_location=f"cpu")
        if 'module' in state_dict:
            state_dict = state_dict['module']
        model.load_state_dict(state_dict)
        
        # 載入完成，刪除state_dict，並垃圾回收
        del state_dict
        gc.collect()
        torch.cuda.empty_cache()

    return model