CosyVoice多語言、音色和情感控制模型，one-shot零樣本語音克隆模型本地部署(Win/Mac),通義實驗室開源

近日，阿里通義實驗室開源了CosyVoice語音模型，它支援自然語音生成，支援多語言、音色和情感控制，在多語言語音生成、零樣本語音生成、跨語言聲音合成和指令執行能力方面表現卓越。

CosyVoice採用了總共超15萬小時的資料訓練，支援中英日粵韓5種語言的合成，合成效果顯著優於傳統語音合成模型。

CosyVoice支援one-shot音色克隆：僅需要3~10s的原始音訊，即可生成模擬音色，甚至包括韻律、情感等細節。在跨語種的語音合成中，也有不俗的表現。

由於官方的版本暫不支援Windows和Mac平臺，本次我們分別在這兩個平臺本地部署CosyVoice。

Windows平臺

首先來到windows平臺，克隆專案：

git clone https://github.com/v3ucn/CosyVoice_For_Windows

進入專案:

cd CosyVoice_For_Windows

生成內建模組：

git submodule update --init --recursive

隨後安裝依賴：

conda create -n cosyvoice python=3.11  
conda activate cosyvoice  
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com

官方推薦的Python版本是3.8，實際上3.11也是可以跑起來的，並且理論上3.11的效能更好。

隨後下載deepspeed的windows版本安裝包來進行安裝：

https://github.com/S95Sedan/Deepspeed-Windows/releases/tag/v14.0%2Bpy311

最後，安裝gpu版本的torch:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

這裡cuda的版本選擇12，也可以安裝11的。

隨後下載模型：

# git模型下載，請確保已安裝git lfs  
mkdir -p pretrained_models  
git clone https://www.modelscope.cn/iic/CosyVoice-300M.git pretrained_models/CosyVoice-300M  
git clone https://www.modelscope.cn/iic/CosyVoice-300M-SFT.git pretrained_models/CosyVoice-300M-SFT  
git clone https://www.modelscope.cn/iic/CosyVoice-300M-Instruct.git pretrained_models/CosyVoice-300M-Instruct  
git clone https://www.modelscope.cn/speech_tts/speech_kantts_ttsfrd.git pretrained_models/speech_kantts_ttsfrd

由於使用國內的魔搭倉庫，所以速度非常快

最後新增環境變數：

set PYTHONPATH=third_party/AcademiCodec;third_party/Matcha-TTS

基礎用法：

from cosyvoice.cli.cosyvoice import CosyVoice  
from cosyvoice.utils.file_utils import load_wav  
import torchaudio  
  
cosyvoice = CosyVoice('speech_tts/CosyVoice-300M-SFT')  
# sft usage  
print(cosyvoice.list_avaliable_spks())  
output = cosyvoice.inference_sft('你好，我是通義生成式語音大模型，請問有什麼可以幫您的嗎？', '中文女')  
torchaudio.save('sft.wav', output['tts_speech'], 22050)  
  
cosyvoice = CosyVoice('speech_tts/CosyVoice-300M')  
# zero_shot usage  
prompt_speech_16k = load_wav('zero_shot_prompt.wav', 16000)  
output = cosyvoice.inference_zero_shot('收到好友從遠方寄來的生日禮物，那份意外的驚喜與深深的祝福讓我心中充滿了甜蜜的快樂，笑容如花兒般綻放。', '希望你以後能夠做的比我還好呦。', prompt_speech_16k)  
torchaudio.save('zero_shot.wav', output['tts_speech'], 22050)  
# cross_lingual usage  
prompt_speech_16k = load_wav('cross_lingual_prompt.wav', 16000)  
output = cosyvoice.inference_cross_lingual('<|en|>And then later on, fully acquiring that company. So keeping management in line, interest in line with the asset that\'s coming into the family is a reason why sometimes we don\'t buy the whole thing.', prompt_speech_16k)  
torchaudio.save('cross_lingual.wav', output['tts_speech'], 22050)  
  
cosyvoice = CosyVoice('speech_tts/CosyVoice-300M-Instruct')  
# instruct usage  
output = cosyvoice.inference_instruct('在面對挑戰時，他展現了非凡的<strong>勇氣</strong>與<strong>智慧</strong>。', '中文男', 'Theo \'Crimson\', is a fiery, passionate rebel leader. Fights with fervor for justice, but struggles with impulsiveness.')  
torchaudio.save('instruct.wav', output['tts_speech'], 22050)

這裡推薦使用webui，更加直觀和方便：

python3 webui.py --port 9886 --model_dir ./pretrained_models/CosyVoice-300M

訪問 http://localhost:9886

需要注意的是，官方的torch的backend使用的是sox，這裡改成了soundfile：

torchaudio.set_audio_backend('soundfile')

可能會有一些bug，後續還請關注官方的專案更新。

MacOS平臺

現在來到MacOs平臺，還是先克隆專案：

git clone https://github.com/v3ucn/CosyVoice_for_MacOs.git

安裝依賴：

cd CosyVoice_for_MacOs  
conda create -n cosyvoice python=3.8  
conda activate cosyvoice  
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com

隨後需要透過Homebrew安裝sox:

brew install sox

如此就配置好了，但是別忘了新增環境變數：

export PYTHONPATH=third_party/AcademiCodec:third_party/Matcha-TTS

使用方式和Windows版本保持一致。

這裡還是推薦使用webui:

python3 webui.py --port 50000 --model_dir speech_tts/CosyVoice-300M

訪問 http://localhost:50000

結語

平心而論，CosyVoice不愧是大廠出品，模型的品質沒的說，代表了國內AI的最高水準，通義實驗室名下無虛，當然，如果能將工程化之後的程式碼也開源出來，那就更好了，相信經過libtorch的最佳化，這個模型將會是開源TTS的不二選擇。

CosyVoice多語言、音色和情感控制模型，one-shot零樣本語音克隆模型本地部署(Win/Mac),通義實驗室開源

Windows平臺

MacOS平臺

結語

相關文章