業界良心OpenAI開源的Whisper模型是開源語音轉文字領域的執牛耳者,白璧微瑕之處在於無法透過蘋果M晶片最佳化轉錄效率,Whisper.cpp 則是 Whisper 模型的 C/C++ 移植版本,它具有無依賴項、記憶體使用量低等特點,重要的是增加了 Core ML 支援,完美適配蘋果M系列晶片。
Whisper.cpp的張量運運算元針對蘋果M晶片的 CPU 進行了大量最佳化,根據計算大小,使用 Arm Neon SIMD instrisics 或 CBLAS Accelerate 框架例程,後者對於更大的尺寸特別有效,因為 Accelerate 框架可以使用蘋果M系列晶片中提供的專用 AMX 協處理器。
配置Whisper.cpp
老規矩,執行git命令來克隆Whisper.cpp專案:
git clone https://github.com/ggerganov/whisper.cpp.git
隨後進入專案的目錄:
cd whisper.cpp
專案預設的基礎模型不支援中文,這裡推薦使用medium模型,透過shell指令碼進行下載:
bash ./models/download-ggml-model.sh medium
下載完成後,會在專案的models目錄儲存ggml-medium.bin模型檔案,大小為1.53GB:
whisper.cpp git:(master) cd models
➜ models git:(master) ll
total 3006000
-rw-r--r-- 1 liuyue staff 3.2K 4 21 07:21 README.md
-rw-r--r-- 1 liuyue staff 7.2K 4 21 07:21 convert-h5-to-ggml.py
-rw-r--r-- 1 liuyue staff 9.2K 4 21 07:21 convert-pt-to-ggml.py
-rw-r--r-- 1 liuyue staff 13K 4 21 07:21 convert-whisper-to-coreml.py
drwxr-xr-x 4 liuyue staff 128B 4 22 00:33 coreml-encoder-medium.mlpackage
-rwxr-xr-x 1 liuyue staff 2.1K 4 21 07:21 download-coreml-model.sh
-rw-r--r-- 1 liuyue staff 1.3K 4 21 07:21 download-ggml-model.cmd
-rwxr-xr-x 1 liuyue staff 2.0K 4 21 07:21 download-ggml-model.sh
-rw-r--r-- 1 liuyue staff 562K 4 21 07:21 for-tests-ggml-base.bin
-rw-r--r-- 1 liuyue staff 573K 4 21 07:21 for-tests-ggml-base.en.bin
-rw-r--r-- 1 liuyue staff 562K 4 21 07:21 for-tests-ggml-large.bin
-rw-r--r-- 1 liuyue staff 562K 4 21 07:21 for-tests-ggml-medium.bin
-rw-r--r-- 1 liuyue staff 573K 4 21 07:21 for-tests-ggml-medium.en.bin
-rw-r--r-- 1 liuyue staff 562K 4 21 07:21 for-tests-ggml-small.bin
-rw-r--r-- 1 liuyue staff 573K 4 21 07:21 for-tests-ggml-small.en.bin
-rw-r--r-- 1 liuyue staff 562K 4 21 07:21 for-tests-ggml-tiny.bin
-rw-r--r-- 1 liuyue staff 573K 4 21 07:21 for-tests-ggml-tiny.en.bin
-rwxr-xr-x 1 liuyue staff 1.4K 4 21 07:21 generate-coreml-interface.sh
-rwxr-xr-x@ 1 liuyue staff 769B 4 21 07:21 generate-coreml-model.sh
-rw-r--r-- 1 liuyue staff 1.4G 3 22 16:04 ggml-medium.bin
模型下載以後,在根目錄編譯可執行檔案:
make
程式返回:
➜ whisper.cpp git:(master) make
I whisper.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS: -framework Accelerate
I CC: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread examples/bench/bench.cpp ggml.o whisper.o -o bench -framework Accelerate
至此,Whisper.cpp就配置好了。
牛刀小試
現在我們來測試一段語音,看看效果:
./main -osrt -m ./models/ggml-medium.bin -f samples/jfk.wav
這行命令的含義是透過剛才下載ggml-medium.bin模型來對專案中的samples/jfk.wav語音檔案進行識別,這段語音是遇刺的美國總統肯尼迪的著名演講,程式返回:
➜ whisper.cpp git:(master) ./main -osrt -m ./models/ggml-medium.bin -f samples/jfk.wav
whisper_init_from_file_no_state: loading model from './models/ggml-medium.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1024
whisper_model_load: n_text_head = 16
whisper_model_load: n_text_layer = 24
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 4
whisper_model_load: mem required = 1725.00 MB (+ 43.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx = 1462.35 MB
whisper_model_load: model size = 1462.12 MB
whisper_init_state: kv self size = 42.00 MB
whisper_init_state: kv cross size = 140.62 MB
system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 |
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:11.000] And so, my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
output_srt: saving output to 'samples/jfk.wav.srt'
只需要11秒,同時語音字幕會寫入samples/jfk.wav.srt檔案。
英文準確率是百分之百。
現在我們來換成中文語音,可以隨便錄製一段語音,需要注意的是,Whisper.cpp只支援wav格式的語音檔案,這裡先透過ffmpeg將mp3檔案轉換為wav:
ffmpeg -i ./test1.mp3 -ar 16000 -ac 1 -c:a pcm_s16le ./test1.wav
程式返回:
ffmpeg version 5.1.2 Copyright (c) 2000-2022 the FFmpeg developers
built with Apple clang version 14.0.0 (clang-1400.0.29.202)
configuration: --prefix=/opt/homebrew/Cellar/ffmpeg/5.1.2_1 --enable-shared --enable-pthreads --enable-version3 --cc=clang --host-cflags= --host-ldflags= --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libbluray --enable-libdav1d --enable-libmp3lame --enable-libopus --enable-librav1e --enable-librist --enable-librubberband --enable-libsnappy --enable-libsrt --enable-libtesseract --enable-libtheora --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-lzma --enable-libfontconfig --enable-libfreetype --enable-frei0r --enable-libass --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libspeex --enable-libsoxr --enable-libzmq --enable-libzimg --disable-libjack --disable-indev=jack --enable-videotoolbox --enable-neon
libavutil 57. 28.100 / 57. 28.100
libavcodec 59. 37.100 / 59. 37.100
libavformat 59. 27.100 / 59. 27.100
libavdevice 59. 7.100 / 59. 7.100
libavfilter 8. 44.100 / 8. 44.100
libswscale 6. 7.100 / 6. 7.100
libswresample 4. 7.100 / 4. 7.100
libpostproc 56. 6.100 / 56. 6.100
[mp3 @ 0x130e05580] Estimating duration from bitrate, this may be inaccurate
Input #0, mp3, from './test1.mp3':
Duration: 00:05:41.33, start: 0.000000, bitrate: 48 kb/s
Stream #0:0: Audio: mp3, 24000 Hz, mono, fltp, 48 kb/s
Stream mapping:
Stream #0:0 -> #0:0 (mp3 (mp3float) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, wav, to './test1.wav':
Metadata:
ISFT : Lavf59.27.100
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Metadata:
encoder : Lavc59.37.100 pcm_s16le
[mp3float @ 0x132004260] overread, skip -6 enddists: -4 -4ed=N/A
Last message repeated 1 times
[mp3float @ 0x132004260] overread, skip -7 enddists: -1 -1
[mp3float @ 0x132004260] overread, skip -7 enddists: -2 -2
[mp3float @ 0x132004260] overread, skip -7 enddists: -1 -1
[mp3float @ 0x132004260] overread, skip -9 enddists: -2 -2
[mp3float @ 0x132004260] overread, skip -5 enddists: -1 -1
Last message repeated 1 times
[mp3float @ 0x132004260] overread, skip -7 enddists: -3 -3
[mp3float @ 0x132004260] overread, skip -8 enddists: -5 -5
[mp3float @ 0x132004260] overread, skip -5 enddists: -2 -2
[mp3float @ 0x132004260] overread, skip -6 enddists: -1 -1
[mp3float @ 0x132004260] overread, skip -7 enddists: -3 -3
[mp3float @ 0x132004260] overread, skip -6 enddists: -2 -2
[mp3float @ 0x132004260] overread, skip -6 enddists: -3 -3
[mp3float @ 0x132004260] overread, skip -7 enddists: -6 -6
[mp3float @ 0x132004260] overread, skip -9 enddists: -6 -6
[mp3float @ 0x132004260] overread, skip -5 enddists: -3 -3
[mp3float @ 0x132004260] overread, skip -5 enddists: -2 -2
[mp3float @ 0x132004260] overread, skip -5 enddists: -3 -3
[mp3float @ 0x132004260] overread, skip -7 enddists: -1 -1
size= 10667kB time=00:05:41.32 bitrate= 256.0kbits/s speed=2.08e+03x
video:0kB audio:10666kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.000714%
這裡將一段五分四十一秒的語音轉換為wav檔案。
隨後執行命令開始轉錄:
./main -osrt -m ./models/ggml-medium.bin -f samples/test1.wav -l zh
這裡需要加上引數-l,告知程式為中文語音,程式返回:
➜ whisper.cpp git:(master) ./main -osrt -m ./models/ggml-medium.bin -f samples/test1.wav -l zh
whisper_init_from_file_no_state: loading model from './models/ggml-medium.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1024
whisper_model_load: n_text_head = 16
whisper_model_load: n_text_layer = 24
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 4
whisper_model_load: mem required = 1725.00 MB (+ 43.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx = 1462.35 MB
whisper_model_load: model size = 1462.12 MB
whisper_init_state: kv self size = 42.00 MB
whisper_init_state: kv cross size = 140.62 MB
system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 |
main: processing 'samples/test1.wav' (5461248 samples, 341.3 sec), 4 threads, 1 processors, lang = zh, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:03.340] Hello 大家好,這裡是劉越的技術部落格。
[00:00:03.340 --> 00:00:05.720] 最近的事情大家都曉得了,
[00:00:05.720 --> 00:00:07.880] 某公司技術經理魅上欺下,
[00:00:07.880 --> 00:00:10.380] 打工人應對進隊,不易快災,
[00:00:10.380 --> 00:00:12.020] 不易壯災,
[00:00:12.020 --> 00:00:14.280] 所謂魅上者必欺下,
[00:00:14.280 --> 00:00:16.020] 古人誠不我竊。
[00:00:16.020 --> 00:00:17.360] 技術經理者,
[00:00:17.360 --> 00:00:20.160] 公然在聊天群裡大玩職場PUA,
[00:00:20.160 --> 00:00:22.400] 氣焰囂張,有恃無恐,
[00:00:22.400 --> 00:00:23.700] 最終引發眾目,
[00:00:23.700 --> 00:00:26.500] 嘿嘿,技術經理,團隊領導,
[00:00:26.500 --> 00:00:29.300] 原來團隊領導這四個字是這麼用的,
[00:00:29.300 --> 00:00:31.540] 奴媚顯達,構陷下屬,
[00:00:31.540 --> 00:00:32.780] 人文巨損,
[00:00:32.780 --> 00:00:33.840] 逢迎上意,
[00:00:33.840 --> 00:00:34.980] 傲然下欺,
[00:00:34.980 --> 00:00:36.080] 裝腔作勢,
[00:00:36.080 --> 00:00:37.180] 極盡投機,
[00:00:37.180 --> 00:00:38.320] 負他人之負,
[00:00:38.320 --> 00:00:39.620] 康他人之愷,
[00:00:39.620 --> 00:00:42.180] 如此者,可謂團隊領導也。
[00:00:42.180 --> 00:00:43.980] 中國的所謂傳統文化,
[00:00:43.980 --> 00:00:45.320] 除了仁義理智性,
[00:00:45.320 --> 00:00:46.620] 除了金石子極,
[00:00:46.620 --> 00:00:47.820] 除了爭爭風骨,
[00:00:47.820 --> 00:00:49.560] 其實還有很多別的東西,
[00:00:49.560 --> 00:00:52.020] 被大家或有意或無意的忽視了,
[00:00:52.020 --> 00:00:53.300] 比如功利實用,
[00:00:53.300 --> 00:00:54.300] 屈顏附示,
[00:00:54.300 --> 00:00:55.360] 以兼至善,
[00:00:55.360 --> 00:01:01.000] 官本位和錢規則的傳統,在某種程度上,傳統文化這沒硬幣的另一面,
[00:01:01.000 --> 00:01:03.900] 才是更需要我們去面對和正視的,
[00:01:03.900 --> 00:01:07.140] 我以為,這在目前盛行實惠價值觀的時候,
[00:01:07.140 --> 00:01:08.940] 提一提還是必要的,
[00:01:08.940 --> 00:01:10.240] 有的人說了,
[00:01:10.240 --> 00:01:13.740] 在開發群裡對領導,非常痛快,非常爽,
[00:01:13.740 --> 00:01:17.180] 但是,然後呢,有用嗎?
[00:01:17.180 --> 00:01:19.260] 倒黴的還不是自己,
[00:01:19.260 --> 00:01:22.520] 沒錯,這就是功利且實用的傳統,
[00:01:22.520 --> 00:01:28.780] 各種精神,思辨,反抗,憤怒,都抵不過三個字,有用嗎?
[00:01:28.780 --> 00:01:31.820] 事實上,但凡叫做某種精神的,
[00:01:31.820 --> 00:01:33.320] 那就是哲學思辨,
[00:01:33.320 --> 00:01:36.220] 就是一種相對無用的思辨和學術,
[00:01:36.220 --> 00:01:39.180] 而中國職場有很強的實用傳統,
[00:01:39.180 --> 00:01:42.140] 但這不是學術思辨,也沒有理論構架,
[00:01:42.140 --> 00:01:44.380] 僅僅是一種短視的經驗論,
[00:01:44.380 --> 00:01:47.220] 所以,功利主義,是密爾,
[00:01:47.220 --> 00:01:48.980] 編慶的倫理價值學說,
[00:01:48.980 --> 00:01:52.700] 強調的是,追求幸福,如何獲得最大效用,
[00:01:52.700 --> 00:01:55.580] 實用主義,是西方的一個學術流派,
[00:01:55.580 --> 00:01:58.260] 比如杜威,胡適,就是代表,
[00:01:58.260 --> 00:02:01.180] 實用主義的另一個名字,叫人本主義,
[00:02:01.180 --> 00:02:04.780] 意思是,以人作為經驗和萬物的尺度,
[00:02:04.780 --> 00:02:06.080] 換句話說,
[00:02:06.080 --> 00:02:09.420] 功利主義,反對的正是那種短視的功利,
[00:02:09.420 --> 00:02:13.220] 實用主義,反對的也正是那種凡是看對自己,
[00:02:13.220 --> 00:02:15.220] 是不是有利的侷限判斷,
[00:02:15.220 --> 00:02:17.260] 而在中國職場功利,
[00:02:17.260 --> 00:02:21.060] 實用的傳統中,恰恰是不會有這些理論構架的,
[00:02:21.060 --> 00:02:23.700] 並且,不僅沒有理論構架,
[00:02:23.700 --> 00:02:26.140] 還要對那些無用的,思辨的,
[00:02:26.140 --> 00:02:29.980] 純粹的精神,視如避喜,吃之以鼻,
[00:02:29.980 --> 00:02:32.260] 沒錯,在技術團隊裡,
[00:02:32.260 --> 00:02:35.260] 我們重視技術,重視實用的科學,
[00:02:35.260 --> 00:02:38.900] 但是主流職場並不鼓勵去搞那些看似無用的東西,
[00:02:38.900 --> 00:02:41.380] 比如普通勞動者的合法權益,
[00:02:41.380 --> 00:02:43.580] 張義謀的滿江紅,
[00:02:43.580 --> 00:02:45.220] 大家想必也都看了的,
[00:02:45.220 --> 00:02:46.820] 人們總覺得很奇怪,
[00:02:46.820 --> 00:02:48.300] 為什麼那麼壞的人,
[00:02:48.300 --> 00:02:50.020] 皇帝為啥不罷免他?
[00:02:50.020 --> 00:02:53.140] 為什麼小人能當權來構陷好人呢?
[00:02:53.140 --> 00:02:55.980] 當我們瞭解了傳統文化中的法家思想,
[00:02:55.980 --> 00:02:57.300] 就瞭然了,
[00:02:57.300 --> 00:02:59.260] 在法家的思想規則下,
[00:02:59.260 --> 00:03:01.660] 小人得是,忠良備辱,
[00:03:01.660 --> 00:03:03.140] 事事所必然,
[00:03:03.140 --> 00:03:04.900] 因為他一開始的設定,
[00:03:04.900 --> 00:03:07.540] 就使得劣幣驅逐良幣的遊戲規則,
[00:03:07.540 --> 00:03:09.940] 所以,在這種觀念下,
[00:03:09.940 --> 00:03:12.460] 古代常見的一種職場智慧就是,
[00:03:12.460 --> 00:03:14.820] 自汙名節,以求自保,
[00:03:14.820 --> 00:03:16.420] 在這種環境下,
[00:03:16.420 --> 00:03:17.780] 要想生存,
[00:03:17.780 --> 00:03:19.260] 就只有一條出路,
[00:03:19.260 --> 00:03:20.900] 那就是依附權力,
[00:03:20.900 --> 00:03:23.700] 並且,誰能擁有更大的權力,
[00:03:23.700 --> 00:03:25.700] 誰就能生存得更好,
[00:03:25.700 --> 00:03:27.500] 如何依附權力呢?
[00:03:27.500 --> 00:03:29.180] 那就是現在正在發生的,
[00:03:29.180 --> 00:03:31.900] 肆無忌憚的大腕職場PUA,
[00:03:31.900 --> 00:03:33.060] 除此之外,
[00:03:33.060 --> 00:03:34.340] 這種權力關係,
[00:03:34.340 --> 00:03:36.900] 在古代會滲透到方方面面,
[00:03:36.900 --> 00:03:40.300] 因為權力系統是一個複雜而高效的執行機器,
[00:03:40.300 --> 00:03:42.940] CPU,記憶體,硬碟,
[00:03:42.940 --> 00:03:44.900] 甚至一顆C面底螺絲釘,
[00:03:44.900 --> 00:03:47.140] 都是權力機器上的一個環節,
[00:03:47.140 --> 00:03:48.060] 於是,
[00:03:48.060 --> 00:03:50.420] 官僚體系之外的一切職場人,
[00:03:50.420 --> 00:03:52.340] 都會面臨一個尷尬的處境,
[00:03:52.340 --> 00:03:54.340] 一方面遭遇權力的打壓,
[00:03:54.340 --> 00:03:55.340] 另一方面,
[00:03:55.340 --> 00:03:57.900] 也都會多少嚐到權力的甜頭,
[00:03:57.900 --> 00:03:58.900] 於是乎,
[00:03:58.900 --> 00:04:01.420] 權力的細胞滲透到角角落落,
[00:04:01.420 --> 00:04:02.980] 即便沒有組織權力,
[00:04:02.980 --> 00:04:04.620] 也要追求文化權力,
[00:04:04.620 --> 00:04:05.500] 父權,
[00:04:05.500 --> 00:04:06.380] 夫權,
[00:04:06.380 --> 00:04:07.460] 家長權力,
[00:04:07.460 --> 00:04:08.580] 宗族權力,
[00:04:08.580 --> 00:04:09.660] 老師權力,
[00:04:09.660 --> 00:04:10.780] 公司權力,
[00:04:10.780 --> 00:04:12.140] 團隊領導權力,
[00:04:12.140 --> 00:04:13.100] 點點滴滴,
[00:04:13.100 --> 00:04:15.580] 滴滴點點,追逐權力,
[00:04:15.580 --> 00:04:18.140] 幾乎成為人們生活的全部意義,
[00:04:18.140 --> 00:04:18.980] 故而,
[00:04:18.980 --> 00:04:19.980] 服從權力,
[00:04:19.980 --> 00:04:21.180] 服從上級,
[00:04:21.180 --> 00:04:22.420] 不得罪同事,
[00:04:22.420 --> 00:04:23.660] 不得罪朋友,
[00:04:23.660 --> 00:04:25.060] 不得罪陌生人,
[00:04:25.060 --> 00:04:26.100] 因為你不知道,
[00:04:26.100 --> 00:04:28.260] 他們背後有什麼的權力關係,
[00:04:28.260 --> 00:04:30.940] 他們又會不會用這個權力來對付你,
[00:04:30.940 --> 00:04:31.940] 沒錯,
[00:04:31.940 --> 00:04:34.380] 當我們解構群裡那位領導的行為時,
[00:04:34.380 --> 00:04:36.220] 我們也在解構我們自己,
[00:04:36.220 --> 00:04:37.420] 毫無疑問,
[00:04:37.420 --> 00:04:39.380] 對於這位敢於發聲的職場人,
[00:04:39.380 --> 00:04:41.180] 深安職場底層邏輯的,
[00:04:41.180 --> 00:04:43.220] 我們一定能猜到他的結局,
[00:04:43.220 --> 00:04:44.700] 他的結局是註定的,
[00:04:44.700 --> 00:04:46.220] 同時也是悲哀的,
[00:04:46.220 --> 00:04:47.340] 問題是,
[00:04:47.340 --> 00:04:48.540] 這樣做,
[00:04:48.540 --> 00:04:49.660] 值得嗎?
[00:04:49.660 --> 00:04:52.580] 香港著名導演王家衛拍過一部電影,
[00:04:52.580 --> 00:04:54.420] 叫做東邪西毒,
[00:04:54.420 --> 00:04:56.340] 電影中有這樣一個情節,
[00:04:56.340 --> 00:04:59.620] 有個女人的弟弟被太尉府的一群刀客殺了,
[00:04:59.620 --> 00:05:00.860] 他想報仇,
[00:05:00.860 --> 00:05:02.300] 可自己沒有武功,
[00:05:02.300 --> 00:05:04.060] 只能請刀客出手,
[00:05:04.060 --> 00:05:05.540] 但家裡窮沒錢,
[00:05:05.540 --> 00:05:08.540] 最有價值的資產是一籃子雞蛋,
[00:05:08.540 --> 00:05:09.260] 於是,
[00:05:09.260 --> 00:05:10.900] 他提著那一籃子雞蛋,
[00:05:10.900 --> 00:05:13.420] 天天站在刀客劍客們經過的路口,
[00:05:13.420 --> 00:05:14.700] 請求他們出手,
[00:05:14.700 --> 00:05:16.220] 報仇就是雞蛋,
[00:05:16.220 --> 00:05:17.860] 沒有人願意為了雞蛋,
[00:05:17.860 --> 00:05:20.020] 去單挑太尉府的刀客,
[00:05:20.020 --> 00:05:21.460] 除了洪七,
[00:05:21.460 --> 00:05:24.260] 洪七獨自力戰太尉府那幫刀客,
[00:05:24.260 --> 00:05:26.780] 所得的報仇是一個雞蛋,
[00:05:26.780 --> 00:05:29.020] 但是洪七付出的代價太大,
[00:05:29.020 --> 00:05:30.060] 混戰中,
[00:05:30.060 --> 00:05:32.700] 洪七被對手砍斷了一根手指,
[00:05:32.700 --> 00:05:33.820] 為了一個雞蛋,
[00:05:33.820 --> 00:05:35.500] 而失去一隻手指,
[00:05:35.500 --> 00:05:36.740] 值得嗎?
[00:05:36.740 --> 00:05:37.860] 不值得,
[00:05:37.860 --> 00:05:39.300] 但是我覺得痛快,
[00:05:39.300 --> 00:05:40.540] 因為這才是我自己
output_srt: saving output to 'samples/test1.wav.srt'
whisper_print_timings: load time = 978.82 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 438.81 ms
whisper_print_timings: sample time = 980.66 ms / 2343 runs ( 0.42 ms per run)
whisper_print_timings: encode time = 31476.10 ms / 13 runs ( 2421.24 ms per run)
whisper_print_timings: decode time = 47833.70 ms / 2343 runs ( 20.42 ms per run)
whisper_print_timings: total time = 81797.88 ms
五分鐘的語音,只需要一分鐘多一點就可以轉錄完成,效率滿分。
當然,精確度還有待提高,提高精確度可以選擇large模型,但轉錄時間會相應增加。
蘋果M晶片模型轉換
基於蘋果Mac系統的使用者有福了,Whisper.cpp可以透過Core ML在Apple Neural Engine (ANE)上執行編碼器推理,這可以比僅使用CPU執行快出三倍以上。
首先安裝轉換依賴:
pip install ane_transformers
pip install openai-whisper
pip install coremltools
接著執行轉換指令碼:
./models/generate-coreml-model.sh medium
這裡引數即模型的名稱。
程式返回:
➜ models git:(master) python3 convert-whisper-to-coreml.py --model medium --encoder-only True
scikit-learn version 1.2.0 is not supported. Minimum required version: 0.17. Maximum required version: 1.1.2. Disabling scikit-learn conversion API.
ModelDimensions(n_mels=80, n_audio_ctx=1500, n_audio_state=1024, n_audio_head=16, n_audio_layer=24, n_vocab=51865, n_text_ctx=448, n_text_state=1024, n_text_head=16, n_text_layer=24)
/opt/homebrew/lib/python3.10/site-packages/whisper/model.py:166: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert x.shape[1:] == self.positional_embedding.shape, "incorrect audio shape"
/opt/homebrew/lib/python3.10/site-packages/whisper/model.py:97: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
scale = (n_state // self.n_head) ** -0.25
Converting PyTorch Frontend ==> MIL Ops: 100%|▉| 1971/1972 [00:00<00:00, 3247.25
Running MIL frontend_pytorch pipeline: 100%|█| 5/5 [00:00<00:00, 54.69 passes/s]
Running MIL default pipeline: 100%|████████| 57/57 [00:09<00:00, 6.29 passes/s]
Running MIL backend_mlprogram pipeline: 100%|█| 10/10 [00:00<00:00, 444.13 passe
done converting
轉換好以後,重新進行編譯:
make clean
WHISPER_COREML=1 make -j
隨後用轉換後的模型進行轉錄即可:
./main -m models/ggml-medium.bin -f samples/jfk.wav
至此,Mac使用者立馬榮升一等公民。
結語
Whisper.cpp是Whisper的精神復刻與肉體重生,完美承襲了Whisper的所有功能,在此之上,提高了語音轉錄文字的速度和效率以及跨平臺移植性,百尺竿頭更進一步,開源技術的高速發展讓我們明白了一件事,那就是高品質技術的傳播遠比技術本身更加寶貴。