ggml 簡介

HuggingFace發表於2024-08-29

ggml 是一個用 C 和 C++ 編寫、專注於 Transformer 架構模型推理的機器學習庫。該專案完全開源，處於活躍的開發階段，開發社群也在不斷壯大。ggml 和 PyTorch、TensorFlow 等機器學習庫比較相似，但由於目前處於開發的早期階段，一些底層設計仍在不斷改進中。

相比於 llama.cpp 和 whisper.cpp 等專案，ggml 也在一直不斷廣泛普及。為了實現端側大語言模型推理，包括 ollama、jan、LM Studio 等很多專案內部都使用了 ggml。

相比於其它庫，ggml 有以下優勢:

最小化實現: 核心庫獨立，僅包含 5 個檔案。如果你想加入 GPU 支援，你可以自行加入相關實現，這不是必選的。
編譯簡單: 你不需要花哨的編譯工具，如果不需要 GPU，單純 GGC 或 Clang 就可以完成編譯。
輕量化: 編譯好的二進位制檔案還不到 1MB，和 PyTorch (需要幾百 MB) 對比實在是夠小了。
相容性好: 支援各類硬體，包括 x86_64、ARM、Apple Silicon、CUDA 等等。
支援張量的量化: 張量可以被量化，以此節省記憶體，有些時候甚至還提升了效能。
記憶體使用高效到了極致: 儲存張量和執行計算的開銷是最小化的。

當然，目前 ggml 還存在一些缺點。如果你選擇 ggml 進行開發，這些方面你需要了解 (後續可能會改進):

並非任何張量操作都可以在你期望的後端上執行。比如有些 CPU 上可以跑的操作，可能在 CUDA 上還不支援。
使用 ggml 開發可能沒那麼簡單直接，因為這需要一些比較深入的底層程式設計知識。
該專案仍在活躍開發中，所以有可能會出現比較大的改動。

本文將帶你入門 ggml 開發。文中不會涉及諸如使用 llama.cpp 進行 LLM 推理等的高階專案。相反，我們將著重介紹 ggml 的核心概念和基本用法，為想要使用 ggml 的開發者們後續學習高階開發打好基礎。

開始學習

我們先從編譯開始。簡單起見，我們以在 Ubuntu 上編譯 ggml 作為示例。當然 ggml 支援在各類平臺上編譯 (包括 Windows、macOS、BSD 等)。指令如下:

# Start by installing build dependencies
# "gdb" is optional, but is recommended
sudo apt install build-essential cmake git gdb

# Then, clone the repository
git clone https://github.com/ggerganov/ggml.git
cd ggml

# Try compiling one of the examples
cmake -B build
cmake --build build --config Release --target simple-ctx

# Run the example
./build/bin/simple-ctx

期望輸出:

mul mat (4 x 3) (transposed result):
[ 60.00 55.00 50.00 110.00
 90.00 54.00 54.00 126.00
 42.00 29.00 28.00 64.00 ]

看到期望輸出沒問題，我們就繼續。

術語和概念

首先我們學習一些 ggml 的核心概念。如果你熟悉 PyTorch 或 TensorFlow，這可能對你來說有比較大的跨度。但由於 ggml 是一個低層的庫，理解這些概念能讓你更大幅度地掌控效能。

ggml_context: 一個裝載各類物件 (如張量、計算圖、其他資料) 的“容器”。
ggml_cgraph: 計算圖的表示，可以理解為將要傳給後端的“計算執行順序”。
ggml_backend: 執行計算圖的介面，有很多種型別: CPU (預設) 、CUDA、Metal (Apple Silicon) 、Vulkan、RPC 等等。
ggml_backend_buffer_type: 表示一種快取，可以理解為連線到每個 ggml_backend 的一個“記憶體分配器”。比如你要在 GPU 上執行計算，那你就需要透過一個buffer_type (通常縮寫為 buft ) 去在 GPU 上分配記憶體。
ggml_backend_buffer: 表示一個透過 buffer_type 分配的快取。需要注意的是，一個快取可以儲存多個張量資料。
ggml_gallocr: 表示一個給計算圖分配記憶體的分配器，可以給計算圖中的張量進行高效的記憶體分配。
ggml_backend_sched: 一個排程器，使得多種後端可以併發使用，在處理大模型或多 GPU 推理時，實現跨硬體平臺地分配計算任務 (如 CPU 加 GPU 混合計算)。該排程器還能自動將 GPU 不支援的運算元轉移到 CPU 上，來確保最優的資源利用和相容性。

簡單示例

這裡的簡單示例將復現第一節最後一行指令程式碼中的示例程式。我們首先建立兩個矩陣，然後相乘得到結果。如果使用 PyTorch，程式碼可能長這樣:

import torch

# Create two matrices
matrix1 = torch.tensor([
  [2, 8],
  [5, 1],
  [4, 2],
  [8, 6],
])
matrix2 = torch.tensor([
  [10, 5],
  [9, 9],
  [5, 4],
])

# Perform matrix multiplication
result = torch.matmul(matrix1, matrix2.T)
print(result.T)

使用 ggml，則需要根據以下步驟來:

分配一個 ggml_context 來儲存張量資料
分配張量並賦值
為矩陣乘法運算建立一個 ggml_cgraph
執行計算
獲取計算結果
釋放記憶體並退出

請注意: 本示例中，我們直接在 ggml_context 裡分配了張量的具體資料。但實際上，記憶體應該被分配成一個裝置端的快取，我們將在下一部分介紹。

我們先建立一個新資料夾 examples/demo ，然後執行以下命令建立 C 檔案和 CMake 檔案。

cd ggml # make sure you're in the project root

# create C source and CMakeLists file
touch examples/demo/demo.c
touch examples/demo/CMakeLists.txt

本示例的程式碼是基於 simple-ctx.cpp 的。

編輯 examples/demo/demo.c ，寫入以下程式碼:

#include "ggml.h"
#include <string.h>
#include <stdio.h>

int main(void) {
    // initialize data of matrices to perform matrix multiplication
    const int rows_A = 4, cols_A = 2;
    float matrix_A[rows_A * cols_A] = {
        2, 8,
        5, 1,
        4, 2,
        8, 6
    };
    const int rows_B = 3, cols_B = 2;
    float matrix_B[rows_B * cols_B] = {
        10, 5,
        9, 9,
        5, 4
    };

    // 1. Allocate `ggml_context` to store tensor data
    // Calculate the size needed to allocate
    size_t ctx_size = 0;
    ctx_size += rows_A * cols_A * ggml_type_size(GGML_TYPE_F32); // tensor a
    ctx_size += rows_B * cols_B * ggml_type_size(GGML_TYPE_F32); // tensor b
    ctx_size += rows_A * rows_B * ggml_type_size(GGML_TYPE_F32); // result
    ctx_size += 3 * ggml_tensor_overhead(); // metadata for 3 tensors
    ctx_size += ggml_graph_overhead(); // compute graph
    ctx_size += 1024; // some overhead (exact calculation omitted for simplicity)

    // Allocate `ggml_context` to store tensor data
    struct ggml_init_params params = {
        /*.mem_size =*/ ctx_size,
        /*.mem_buffer =*/ NULL,
        /*.no_alloc =*/ false,
    };
    struct ggml_context * ctx = ggml_init(params);

    // 2. Create tensors and set data
    struct ggml_tensor * tensor_a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, cols_A, rows_A);
    struct ggml_tensor * tensor_b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, cols_B, rows_B);
    memcpy(tensor_a->data, matrix_A, ggml_nbytes(tensor_a));
    memcpy(tensor_b->data, matrix_B, ggml_nbytes(tensor_b));


    // 3. Create a `ggml_cgraph` for mul_mat operation
    struct ggml_cgraph * gf = ggml_new_graph(ctx);

    // result = a*b^T
    // Pay attention: ggml_mul_mat(A, B) ==> B will be transposed internally
    // the result is transposed
    struct ggml_tensor * result = ggml_mul_mat(ctx, tensor_a, tensor_b);

    // Mark the "result" tensor to be computed
    ggml_build_forward_expand(gf, result);

    // 4. Run the computation
    int n_threads = 1; // Optional: number of threads to perform some operations with multi-threading
    ggml_graph_compute_with_ctx(ctx, gf, n_threads);

    // 5. Retrieve results (output tensors)
    float * result_data = (float *) result->data;
    printf("mul mat (%d x %d) (transposed result):\n[", (int) result->ne[0], (int) result->ne[1]);
    for (int j = 0; j < result->ne[1]/* rows */; j++) {
        if (j > 0) {
            printf("\n");
        }

        for (int i = 0; i < result->ne[0]/* cols */; i++) {
            printf(" %.2f", result_data[j * result->ne[0] + i]);
        }
    }
    printf(" ]\n");

    // 6. Free memory and exit
    ggml_free(ctx);
    return 0;
}

然後將以下程式碼寫入 examples/demo/CMakeLists.txt :

set(TEST_TARGET demo)
add_executable(${TEST_TARGET} demo)
target_link_libraries(${TEST_TARGET} PRIVATE ggml)

編輯 examples/CMakeLists.txt ，在末尾加入這一行程式碼:

add_subdirectory(demo)

然後編譯並執行:

cmake -B build
cmake --build build --config Release --target demo

# Run it
./build/bin/demo

期望的結果應該是這樣:

mul mat (4 x 3) (transposed result):
[ 60.00 55.00 50.00 110.00
 90.00 54.00 54.00 126.00
 42.00 29.00 28.00 64.00 ]

使用後端的示例

“Backend” in ggml refers to an interface that can handle tensor operations. Backend can be CPU, CUDA, Vulkan, etc.

在 ggml 中，“後端”指的是一個可以處理張量操作的介面，比如 CPU、CUDA、Vulkan 等。

後端可以抽象化計算圖的執行。當定義後，一個計算圖就可以在相關硬體上用對應的後端實現去進行計算。注意，在這個過程中，ggml 會自動為需要的中間結果預留記憶體，並基於其生命週期最佳化記憶體使用。

使用後端進行計算或推理，基本步驟如下:

初始化 ggml_backend
分配 ggml_context 以儲存張量的 metadata (此時還不需要直接分配張量的資料)
為張量建立 metadata (也就是形狀和資料型別)
分配一個 ggml_backend_buffer 用來儲存所有的張量
從記憶體 (RAM) 中複製張量的具體資料到後端快取
為矩陣乘法建立一個 ggml_cgraph
建立一個 ggml_gallocr 用以分配計算圖
可選: 用 ggml_backend_sched 排程計算圖
執行計算圖
獲取結果，即計算圖的輸出
釋放記憶體並退出

本示例的程式碼基於 simple-backend.cpp:

#include "ggml.h"
#include "ggml-alloc.h"
#include "ggml-backend.h"
#ifdef GGML_USE_CUDA
#include "ggml-cuda.h"
#endif

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

int main(void) {
    // initialize data of matrices to perform matrix multiplication
    const int rows_A = 4, cols_A = 2;
    float matrix_A[rows_A * cols_A] = {
        2, 8,
        5, 1,
        4, 2,
        8, 6
    };
    const int rows_B = 3, cols_B = 2;
    float matrix_B[rows_B * cols_B] = {
        10, 5,
        9, 9,
        5, 4
    };

    // 1. Initialize backend
    ggml_backend_t backend = NULL;
#ifdef GGML_USE_CUDA
    fprintf(stderr, "%s: using CUDA backend\n", __func__);
    backend = ggml_backend_cuda_init(0); // init device 0
    if (!backend) {
        fprintf(stderr, "%s: ggml_backend_cuda_init() failed\n", __func__);
    }
#endif
    // if there aren't GPU Backends fallback to CPU backend
    if (!backend) {
        backend = ggml_backend_cpu_init();
    }

    // Calculate the size needed to allocate
    size_t ctx_size = 0;
    ctx_size += 2 * ggml_tensor_overhead(); // tensors
    // no need to allocate anything else!

    // 2. Allocate `ggml_context` to store tensor data
    struct ggml_init_params params = {
        /*.mem_size =*/ ctx_size,
        /*.mem_buffer =*/ NULL,
        /*.no_alloc =*/ true, // the tensors will be allocated later by ggml_backend_alloc_ctx_tensors()
    };
    struct ggml_context * ctx = ggml_init(params);

    // Create tensors metadata (only there shapes and data type)
    struct ggml_tensor * tensor_a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, cols_A, rows_A);
    struct ggml_tensor * tensor_b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, cols_B, rows_B);

    // 4. Allocate a `ggml_backend_buffer` to store all tensors
    ggml_backend_buffer_t buffer = ggml_backend_alloc_ctx_tensors(ctx, backend);

    // 5. Copy tensor data from main memory (RAM) to backend buffer
    ggml_backend_tensor_set(tensor_a, matrix_A, 0, ggml_nbytes(tensor_a));
    ggml_backend_tensor_set(tensor_b, matrix_B, 0, ggml_nbytes(tensor_b));

    // 6. Create a `ggml_cgraph` for mul_mat operation
    struct ggml_cgraph * gf = NULL;
    struct ggml_context * ctx_cgraph = NULL;
    {
        // create a temporally context to build the graph
        struct ggml_init_params params0 = {
            /*.mem_size =*/ ggml_tensor_overhead()*GGML_DEFAULT_GRAPH_SIZE + ggml_graph_overhead(),
            /*.mem_buffer =*/ NULL,
            /*.no_alloc =*/ true, // the tensors will be allocated later by ggml_gallocr_alloc_graph()
        };
        ctx_cgraph = ggml_init(params0);
        gf = ggml_new_graph(ctx_cgraph);

        // result = a*b^T
        // Pay attention: ggml_mul_mat(A, B) ==> B will be transposed internally
        // the result is transposed
        struct ggml_tensor * result0 = ggml_mul_mat(ctx_cgraph, tensor_a, tensor_b);

        // Add "result" tensor and all of its dependencies to the cgraph
        ggml_build_forward_expand(gf, result0);
    }

    // 7. Create a `ggml_gallocr` for cgraph computation
    ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(backend));
    ggml_gallocr_alloc_graph(allocr, gf);

    // (we skip step 8. Optionally: schedule the cgraph using `ggml_backend_sched`)

    // 9. Run the computation
    int n_threads = 1; // Optional: number of threads to perform some operations with multi-threading
    if (ggml_backend_is_cpu(backend)) {
        ggml_backend_cpu_set_n_threads(backend, n_threads);
    }
    ggml_backend_graph_compute(backend, gf);

    // 10. Retrieve results (output tensors)
    // in this example, output tensor is always the last tensor in the graph
    struct ggml_tensor * result = gf->nodes[gf->n_nodes - 1];
    float * result_data = malloc(ggml_nbytes(result));
    // because the tensor data is stored in device buffer, we need to copy it back to RAM
    ggml_backend_tensor_get(result, result_data, 0, ggml_nbytes(result));
    printf("mul mat (%d x %d) (transposed result):\n[", (int) result->ne[0], (int) result->ne[1]);
    for (int j = 0; j < result->ne[1]/* rows */; j++) {
        if (j > 0) {
            printf("\n");
        }

        for (int i = 0; i < result->ne[0]/* cols */; i++) {
            printf(" %.2f", result_data[j * result->ne[0] + i]);
        }
    }
    printf(" ]\n");
    free(result_data);

    // 11. Free memory and exit
    ggml_free(ctx_cgraph);
    ggml_gallocr_free(allocr);
    ggml_free(ctx);
    ggml_backend_buffer_free(buffer);
    ggml_backend_free(backend);
    return 0;
}

編譯並執行:

cmake -B build
cmake --build build --config Release --target demo

# Run it
./build/bin/demo

期望結果應該和上面的例子相同:

mul mat (4 x 3) (transposed result):
[ 60.00 55.00 50.00 110.00
 90.00 54.00 54.00 126.00
 42.00 29.00 28.00 64.00 ]

列印計算圖

ggml_cgraph 代表了計算圖，它定義了後端執行計算的順序。列印計算圖是一個非常有用的 debug 工具，尤其是模型複雜時。

可以使用 ggml_graph_print 去列印計算圖:

...

// Mark the "result" tensor to be computed
ggml_build_forward_expand(gf, result0);

// Print the cgraph
ggml_graph_print(gf);

執行程式:

=== GRAPH ===
n_nodes = 1
 - 0: [     4, 3, 1] MUL_MAT
n_leafs = 2
 - 0: [     2, 4] NONE leaf_0
 - 1: [     2, 3] NONE leaf_1
========================================

此外，你還可以把計算圖列印成 graphviz 的 dot 檔案格式:

ggml_graph_dump_dot(gf, NULL, "debug.dot");

然後使用 dot 命令或使用這個網站把 debug.dot 檔案渲染成圖片:

ggml-debug

總結

本文介紹了 ggml，涵蓋基本概念、簡單示例、後端示例。除了這些基礎知識，ggml 還有很多有待我們學習。

接下來我們還會推出多篇文章，涵蓋更多 ggml 的內容，包括 GGUF 格式模型、模型量化，以及多個後端如何協調配合。此外，你還可以參考 ggml 示例資料夾學習更多高階用法和示例程式。請持續關注我們 ggml 的相關內容。

英文原文: https://hf.co/blog/introduction-to-ggml

原文作者: Xuan Son NGUYEN, Georgi Gerganov, slaren

譯者: hugging-hoi2022

簡介
2024-09-22
Jira使用簡介 HP ALM使用簡介
2020-11-03
BookKeeper 介紹(1)--簡介
2024-05-26
PCIe簡介
2024-08-11
valgrind簡介
2024-08-07
SpringMVC簡介
2024-08-11
SpringMVC
HTML 簡介
2024-08-04
HTML
核心簡介
2024-08-04
DPDK簡介
2024-07-30
Docker簡介
2024-07-20
Docker
SpotBugs 簡介
2024-07-17
webservice簡介
2024-07-07
Web
OME 簡介
2024-06-25
Spring 簡介
2024-03-06
Spring
pytorch簡介
2023-04-06
PyTorch
【QCustomPlot】簡介
2023-03-12
DuckDB簡介
2024-04-16
SDL簡介
2024-04-11
swagger簡介
2024-05-09
Swagger
MongoDb簡介
2024-05-28
MongoDB
RabbitMQ簡介
2024-05-31
MQ
JetCache 簡介
2024-06-14
JavaParser 簡介
2024-05-13
Java
SSHJ 簡介
2024-05-13
Redpanda簡介
2024-03-17
Swoole 簡介
2022-06-13
jQuery 簡介
2022-07-13
jQuery
SQLite簡介
2022-04-01
SQLite
NGINX簡介
2022-04-01
Nginx
Electron簡介
2021-09-09
cookie 簡介
2019-10-20
Cookie
Session 簡介
2019-10-21
Session
Selenium - 簡介
2019-08-05
jango簡介
2019-07-28
Go
MyBatis 簡介
2020-06-19
MyBatis
Amphenol簡介
2020-02-24
Vagrant簡介
2020-08-05
Flink簡介
2020-07-31