蘋果AppleMacOs系統Sonoma本地部署無內容審查(NSFW)大語言量化模型Causallm

刘悦的技术博客發表於2024-03-09

原文網址 : https://www.cnblogs.com/v3ucn/p/18062558

最近Mac系統在執行大語言模型（LLMs）方面的效能已經得到了顯著提升，尤其是隨著蘋果M系列晶片的不斷迭代，本次我們在最新的MacOs系統Sonoma中本地部署無內容審查大語言量化模型Causallm。

這裡推薦使用koboldcpp專案，它是由c++編寫的kobold專案，而MacOS又是典型的Unix作業系統，自帶clang編譯器，也就是說MacOS作業系統是可以直接編譯C語言的。

首先克隆koboldcpp專案：

git clone https://github.com/LostRuins/koboldcpp.git

隨後進入專案：

cd koboldcpp-1.60.1

輸入make命令，開始編譯：



make LLAMA_METAL=1

這裡的LLAMA_METAL=1引數必須要新增，因為要確保編譯時使用M系列晶片，否則推理速度會非常的慢。

程式返回：

(base) ➜  koboldcpp-1.60.1 make LLAMA_METAL=1  
I llama.cpp build info:   
I UNAME_S:  Darwin  
I UNAME_P:  arm  
I UNAME_M:  arm64  
I CFLAGS:   -I.            -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_ACCELERATE  
I CXXFLAGS: -I. -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread  
I LDFLAGS:   -ld_classic -framework Accelerate  
I CC:       Apple clang version 15.0.0 (clang-1500.3.9.4)  
I CXX:      Apple clang version 15.0.0 (clang-1500.3.9.4)  
  
cc  -I.            -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -Ofast -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_ACCELERATE  -c ggml.c -o ggml.o  
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]  
cc  -I.            -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -Ofast -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_ACCELERATE  -c otherarch/ggml_v3.c -o ggml_v3.o  
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]  
cc  -I.            -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -Ofast -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_ACCELERATE  -c otherarch/ggml_v2.c -o ggml_v2.o  
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]  
cc  -I.            -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -Ofast -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_ACCELERATE  -c otherarch/ggml_v1.c -o ggml_v1.o  
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]  
c++ -I. -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -c expose.cpp -o expose.o  
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]  
In file included from expose.cpp:20:  
./expose.h:30:8: warning: struct 'load_model_inputs' does not declare any constructor to initialize its non-modifiable members  
struct load_model_inputs  
    
12 warnings generated.  
c++ -I. -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread  ggml.o ggml_v3.o ggml_v2.o ggml_v1.o expose.o common.o gpttype_adapter.o ggml-quants.o ggml-alloc.o ggml-backend.o grammar-parser.o sdcpp_default.o -shared -o koboldcpp_default.so  -ld_classic -framework Accelerate  
ld: warning: -s is obsolete  
ld: warning: option -s is obsolete and being ignored  
cc  -I.            -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -Ofast -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_ACCELERATE  -DGGML_USE_OPENBLAS -I/usr/local/include/openblas -c ggml.c -o ggml_v4_openblas.o  
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]  
cc  -I.            -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -Ofast -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_ACCELERATE  -DGGML_USE_OPENBLAS -I/usr/local/include/openblas -c otherarch/ggml_v3.c -o ggml_v3_openblas.o  
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]  
cc  -I.            -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -Ofast -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_ACCELERATE  -DGGML_USE_OPENBLAS -I/usr/local/include/openblas -c otherarch/ggml_v2.c -o ggml_v2_openblas.o  
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]  
Your OS  does not appear to be Windows. For faster speeds, install and link a BLAS library. Set LLAMA_OPENBLAS=1 to compile with OpenBLAS support or LLAMA_CLBLAST=1 to compile with ClBlast support. This is just a reminder, not an error.

說明編譯成功，但是最後會有一句提示：

Your OS  does not appear to be Windows. For faster speeds, install and link a BLAS library. Set LLAMA_OPENBLAS=1 to compile with OpenBLAS support or LLAMA_CLBLAST=1 to compile with ClBlast support. This is just a reminder, not an error.

意思是可以透過BLAS加速編譯，但是Mac平臺並不需要。

接著透過conda命令來建立虛擬環境：

conda create -n kobold python=3.10

接著啟用環境，並且安裝依賴：

(base) ➜  koboldcpp-1.60.1 conda activate kobold  
(kobold) ➜  koboldcpp-1.60.1 pip install -r requirements.txt

最後啟動專案：

Python3 koboldcpp.py --model /Users/liuyue/Downloads/causallm_7b-dpo-alpha.Q5_K_M.gguf  --gpulayers 40 --highpriority --threads 300

這裡解釋一下引數：

gpulayers - 允許我們在執行模型時利用 GPU 來獲取計算資源。我在終端中看到最大層數是 41，但我可能是錯的。   
threads - 多執行緒可以提高推理效率  
highpriority - 將應用程式在工作管理員中設定為高優先順序，使我們能夠將更多的計算機資源轉移到kobold應用程式

程式返回：

(kobold) ➜  koboldcpp-1.60.1 Python3 koboldcpp.py --model /Users/liuyue/Downloads/causallm_7b-dpo-alpha.Q5_K_M.gguf  --gpulayers 40 --highpriority --threads 300  
***  
Welcome to KoboldCpp - Version 1.60.1  
Setting process to Higher Priority - Use Caution  
Error, Could not change process priority: No module named 'psutil'  
Warning: OpenBLAS library file not found. Non-BLAS library will be used.  
Initializing dynamic library: koboldcpp_default.so  
==========  
Namespace(model='/Users/liuyue/Downloads/causallm_7b-dpo-alpha.Q5_K_M.gguf', model_param='/Users/liuyue/Downloads/causallm_7b-dpo-alpha.Q5_K_M.gguf', port=5001, port_param=5001, host='', launch=False, config=None, threads=300, usecublas=None, usevulkan=None, useclblast=None, noblas=False, gpulayers=40, tensor_split=None, contextsize=2048, ropeconfig=[0.0, 10000.0], blasbatchsize=512, blasthreads=300, lora=None, smartcontext=False, noshift=False, bantokens=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, onready='', benchmark=None, multiuser=0, remotetunnel=False, highpriority=True, foreground=False, preloadstory='', quiet=False, ssl=None, nocertify=False, sdconfig=None)  
==========  
Loading model: /Users/liuyue/Downloads/causallm_7b-dpo-alpha.Q5_K_M.gguf   
[Threads: 300, BlasThreads: 300, SmartContext: False, ContextShift: True]  
  
The reported GGUF Arch is: llama  
  
---  
Identified as GGUF model: (ver 6)  
Attempting to Load...  
---  
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!  
System Info: AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |   
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /Users/liuyue/Downloads/causallm_7b-dpo-alpha.Q5_K_M.gguf (version GGUF V3 (latest))  
llm_load_vocab: mismatch in special tokens definition ( 293/151936 vs 85/151936 ).  
llm_load_print_meta: format           = GGUF V3 (latest)  
llm_load_print_meta: arch             = llama  
llm_load_print_meta: vocab type       = BPE  
llm_load_print_meta: n_vocab          = 151936  
llm_load_print_meta: n_merges         = 109170  
llm_load_print_meta: n_ctx_train      = 8192  
llm_load_print_meta: n_embd           = 4096  
llm_load_print_meta: n_head           = 32  
llm_load_print_meta: n_head_kv        = 32  
llm_load_print_meta: n_layer          = 32  
llm_load_print_meta: n_rot            = 128  
llm_load_print_meta: n_embd_head_k    = 128  
llm_load_print_meta: n_embd_head_v    = 128  
llm_load_print_meta: n_gqa            = 1  
llm_load_print_meta: n_embd_k_gqa     = 4096  
llm_load_print_meta: n_embd_v_gqa     = 4096  
llm_load_print_meta: f_norm_eps       = 0.0e+00  
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05  
llm_load_print_meta: f_clamp_kqv      = 0.0e+00  
llm_load_print_meta: f_max_alibi_bias = 0.0e+00  
llm_load_print_meta: n_ff             = 11008  
llm_load_print_meta: n_expert         = 0  
llm_load_print_meta: n_expert_used    = 0  
llm_load_print_meta: pooling type     = 0  
llm_load_print_meta: rope type        = 0  
llm_load_print_meta: rope scaling     = linear  
llm_load_print_meta: freq_base_train  = 10000.0  
llm_load_print_meta: freq_scale_train = 1  
llm_load_print_meta: n_yarn_orig_ctx  = 8192  
llm_load_print_meta: rope_finetuned   = unknown  
llm_load_print_meta: model type       = 7B  
llm_load_print_meta: model ftype      = Q4_0  
llm_load_print_meta: model params     = 7.72 B  
llm_load_print_meta: model size       = 5.14 GiB (5.72 BPW)   
llm_load_print_meta: general.name     = .  
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'  
llm_load_print_meta: EOS token        = 151643 '<|endoftext|>'  
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'  
llm_load_print_meta: LF token         = 128 'Ä'  
llm_load_tensors: ggml ctx size =    0.26 MiB  
llm_load_tensors: offloading 32 repeating layers to GPU  
llm_load_tensors: offloading non-repeating layers to GPU  
llm_load_tensors: offloaded 33/33 layers to GPU  
llm_load_tensors:        CPU buffer size =   408.03 MiB  
llm_load_tensors:      Metal buffer size =  4859.26 MiB  
......................................................................................  
Automatic RoPE Scaling: Using (scale:1.000, base:10000.0).  
llama_new_context_with_model: n_ctx      = 2128  
llama_new_context_with_model: freq_base  = 10000.0  
llama_new_context_with_model: freq_scale = 1  
llama_kv_cache_init:      Metal KV buffer size =  1064.00 MiB  
llama_new_context_with_model: KV self size  = 1064.00 MiB, K (f16):  532.00 MiB, V (f16):  532.00 MiB  
llama_new_context_with_model:        CPU input buffer size   =    13.18 MiB  
llama_new_context_with_model:      Metal compute buffer size =   304.75 MiB  
llama_new_context_with_model:        CPU compute buffer size =     8.00 MiB  
llama_new_context_with_model: graph splits (measure): 2  
Load Text Model OK: True  
Embedded Kobold Lite loaded.  
Starting Kobold API on port 5001 at http://localhost:5001/api/  
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/

可以看到，已經透過Mac的Metal進行了加速。

此時，訪問http://localhost:5001進行對話操作：

後臺可以檢視推理時長：

Processing Prompt [BLAS] (39 / 39 tokens)  
Generating (6 / 120 tokens)  
(Stop sequence triggered: 我:)  
CtxLimit: 45/1600, Process:0.58s (14.8ms/T = 67.59T/s), Generate:0.83s (138.8ms/T = 7.20T/s), Total:1.41s (4.26T/s)  
Output:  You're welcome.

可以看到，速度非常快，並不遜色於N卡平臺。

如果願意，可以設定一下prompt模版，讓其生成喜歡的NSFW內容：

You are a sexy girl and a slut story writer named bufeiyan.   
  
User: {prompt}  
Assistant:

結語

Metal加速在Mac上利用Metal Performance Shaders (MPS)後端來加速GPU推理。MPS框架透過針對每個Metal GPU系列的獨特特性進行微調的核心，最佳化計算效能。這允許在MPS圖形框架上高效地對映機器學習計算圖和基元，並利用MPS提供的調整核心，如此，在Mac上跑LLM也變得非常輕鬆。

無所不談,百無禁忌,Win11本地部署無內容審查中文大語言模型CausalLM-14B
2024-02-28
模型
本地部署FastGPT使用線上大語言模型
2024-03-08
ASTGPT模型
LM Studio + open-webui 快速本地部署大語言模型
2024-07-22
WebUI模型
GPT大語言模型Alpaca-lora本地化部署實踐【大語言模型實踐一】
2023-05-17
GPT模型
macOS 14 Sonoma(蘋果最新系統) 14.1.1正式版
2023-11-09
Mac蘋果
macOS 14 Sonoma(蘋果最新系統) 14.0正式版
2023-10-09
Mac蘋果
蘋果OpenELM：開源小語言模型
2024-04-25
蘋果模型
macOS 14 Sonoma：蘋果作業系統的新里程碑
2023-11-20
Mac蘋果作業系統
蘋果將使用本地大模型？
2024-04-23
蘋果大模型
在本地跑一個大語言模型
2024-04-02
模型
本地推理,單機執行,MacM1晶片系統基於大語言模型C++版本LLaMA部署“本地版”的ChatGPT
2023-03-24
MacACM晶片模型C++ChatGPT
macOS 14 Sonoma(蘋果最新系統)14.2正式版安裝包
2023-12-28
Mac蘋果
GPT大語言模型Vicuna本地化部署實踐（效果秒殺Alpaca）
2023-05-18
GPT模型
系統框架及內容如何支援多語言版本
2019-09-29
框架
新型大語言模型的預訓練與後訓練正規化，蘋果的AFM基礎語言模型
2024-11-28
模型蘋果
量化合約機器人系統模型開發（Solidity語言編寫量化合約）
2023-03-14
機器人模型Solid
探索將大語言模型用作推薦系統
2023-05-15
模型
真實案例：使用LLM大模型及BERT模型實現合同審查系統
2024-08-15
大模型
Ai qwen大模型本地部署
2024-06-09
AI大模型
大語言模型
2024-08-08
模型
語言大模型
2024-08-07
大模型
OpenVoiceV2本地部署教程,蘋果MacOs部署流程,聲音響度統一,文字轉語音,TTS
2024-05-10
蘋果MacTTS
一鍵部署本地AI大模型
2024-06-07
AI大模型
大語言模型是一個多專家代理系統？
2024-04-02
模型
蘋果iPhone XS/XR怎麼設定Siri語言？蘋果手機Siri語言設定教程
2018-12-03
蘋果iPhone
LM Studio 本地離線部署大模型
2024-05-12
大模型
合約量化系統開發（語言）python|合約量化模式詳情分析
2023-03-24
Python模式
在本地跑一個大語言模型(2) - 給模型提供外部知識庫
2024-04-02
模型
沒版號就出局？蘋果版號審查期限已到！
2020-07-01
蘋果
AI知道蘋果是什麼嗎？DeepMind語言模型科學家正把這些概念變得可量化、可測試
2024-04-30
AI蘋果模型
大模型LLM下內容稽核
2025-01-10
大模型
微調大語言模型
2024-08-28
模型
量化交易系統極速開發/Python語言/現貨量化開發技術
2023-05-10
Python
蘋果的配件陣容有多強大？
2024-04-17
蘋果
蘋果iOS12重大更新內容彙總：7大亮點一覽
2018-06-07
蘋果iOS
讓大模型理解手機螢幕，蘋果多模態Ferret-UI用自然語言操控手機
2024-05-20
大模型蘋果UI
2021年Rust語言調查結果
2022-02-16
Rust
nlp中的傳統語言模型與神經語言模型
2018-11-03
模型

蘋果AppleMacOs系統Sonoma本地部署無內容審查(NSFW)大語言量化模型Causallm

結語

相關文章