CosyVoice多語言、音色和情感控制模型,one-shot零樣本語音克隆模型本地部署(Win/Mac),通義實驗室開源

刘悦的技术博客發表於2024-07-07

近日,阿里通義實驗室開源了CosyVoice語音模型,它支援自然語音生成,支援多語言、音色和情感控制,在多語言語音生成、零樣本語音生成、跨語言聲音合成和指令執行能力方面表現卓越。

CosyVoice採用了總共超15萬小時的資料訓練,支援中英日粵韓5種語言的合成,合成效果顯著優於傳統語音合成模型。

CosyVoice支援one-shot音色克隆 :僅需要3~10s的原始音訊,即可生成模擬音色,甚至包括韻律、情感等細節。在跨語種的語音合成中,也有不俗的表現。

由於官方的版本暫不支援Windows和Mac平臺,本次我們分別在這兩個平臺本地部署CosyVoice。

Windows平臺

首先來到windows平臺,克隆專案:

git clone https://github.com/v3ucn/CosyVoice_For_Windows

進入專案:

cd CosyVoice_For_Windows

生成內建模組:

git submodule update --init --recursive

隨後安裝依賴:

conda create -n cosyvoice python=3.11  
conda activate cosyvoice  
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com

官方推薦的Python版本是3.8,實際上3.11也是可以跑起來的,並且理論上3.11的效能更好。

隨後下載deepspeed的windows版本安裝包來進行安裝:

https://github.com/S95Sedan/Deepspeed-Windows/releases/tag/v14.0%2Bpy311

最後,安裝gpu版本的torch:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

這裡cuda的版本選擇12,也可以安裝11的。

隨後下載模型:

# git模型下載,請確保已安裝git lfs  
mkdir -p pretrained_models  
git clone https://www.modelscope.cn/iic/CosyVoice-300M.git pretrained_models/CosyVoice-300M  
git clone https://www.modelscope.cn/iic/CosyVoice-300M-SFT.git pretrained_models/CosyVoice-300M-SFT  
git clone https://www.modelscope.cn/iic/CosyVoice-300M-Instruct.git pretrained_models/CosyVoice-300M-Instruct  
git clone https://www.modelscope.cn/speech_tts/speech_kantts_ttsfrd.git pretrained_models/speech_kantts_ttsfrd

由於使用國內的魔搭倉庫,所以速度非常快

最後新增環境變數:

set PYTHONPATH=third_party/AcademiCodec;third_party/Matcha-TTS

基礎用法:

from cosyvoice.cli.cosyvoice import CosyVoice  
from cosyvoice.utils.file_utils import load_wav  
import torchaudio  
  
cosyvoice = CosyVoice('speech_tts/CosyVoice-300M-SFT')  
# sft usage  
print(cosyvoice.list_avaliable_spks())  
output = cosyvoice.inference_sft('你好,我是通義生成式語音大模型,請問有什麼可以幫您的嗎?', '中文女')  
torchaudio.save('sft.wav', output['tts_speech'], 22050)  
  
cosyvoice = CosyVoice('speech_tts/CosyVoice-300M')  
# zero_shot usage  
prompt_speech_16k = load_wav('zero_shot_prompt.wav', 16000)  
output = cosyvoice.inference_zero_shot('收到好友從遠方寄來的生日禮物,那份意外的驚喜與深深的祝福讓我心中充滿了甜蜜的快樂,笑容如花兒般綻放。', '希望你以後能夠做的比我還好呦。', prompt_speech_16k)  
torchaudio.save('zero_shot.wav', output['tts_speech'], 22050)  
# cross_lingual usage  
prompt_speech_16k = load_wav('cross_lingual_prompt.wav', 16000)  
output = cosyvoice.inference_cross_lingual('<|en|>And then later on, fully acquiring that company. So keeping management in line, interest in line with the asset that\'s coming into the family is a reason why sometimes we don\'t buy the whole thing.', prompt_speech_16k)  
torchaudio.save('cross_lingual.wav', output['tts_speech'], 22050)  
  
cosyvoice = CosyVoice('speech_tts/CosyVoice-300M-Instruct')  
# instruct usage  
output = cosyvoice.inference_instruct('在面對挑戰時,他展現了非凡的<strong>勇氣</strong>與<strong>智慧</strong>。', '中文男', 'Theo \'Crimson\', is a fiery, passionate rebel leader. Fights with fervor for justice, but struggles with impulsiveness.')  
torchaudio.save('instruct.wav', output['tts_speech'], 22050)

這裡推薦使用webui,更加直觀和方便:

python3 webui.py --port 9886 --model_dir ./pretrained_models/CosyVoice-300M

訪問 http://localhost:9886

需要注意的是,官方的torch的backend使用的是sox,這裡改成了soundfile:

torchaudio.set_audio_backend('soundfile')

可能會有一些bug,後續還請關注官方的專案更新。

MacOS平臺

現在來到MacOs平臺,還是先克隆專案:

git clone https://github.com/v3ucn/CosyVoice_for_MacOs.git

安裝依賴:

cd CosyVoice_for_MacOs  
conda create -n cosyvoice python=3.8  
conda activate cosyvoice  
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com

隨後需要透過Homebrew安裝sox:

brew install sox

如此就配置好了,但是別忘了新增環境變數:

export PYTHONPATH=third_party/AcademiCodec:third_party/Matcha-TTS

使用方式和Windows版本保持一致。

這裡還是推薦使用webui:

python3 webui.py --port 50000 --model_dir speech_tts/CosyVoice-300M

訪問 http://localhost:50000

結語

平心而論,CosyVoice不愧是大廠出品,模型的品質沒的說,代表了國內AI的最高水準,通義實驗室名下無虛,當然,如果能將工程化之後的程式碼也開源出來,那就更好了,相信經過libtorch的最佳化,這個模型將會是開源TTS的不二選擇。

相關文章