近日,阿里通義實驗室開源了CosyVoice語音模型,它支援自然語音生成,支援多語言、音色和情感控制,在多語言語音生成、零樣本語音生成、跨語言聲音合成和指令執行能力方面表現卓越。
CosyVoice採用了總共超15萬小時的資料訓練,支援中英日粵韓5種語言的合成,合成效果顯著優於傳統語音合成模型。
CosyVoice支援one-shot音色克隆 :僅需要3~10s的原始音訊,即可生成模擬音色,甚至包括韻律、情感等細節。在跨語種的語音合成中,也有不俗的表現。
由於官方的版本暫不支援Windows和Mac平臺,本次我們分別在這兩個平臺本地部署CosyVoice。
Windows平臺
首先來到windows平臺,克隆專案:
git clone https://github.com/v3ucn/CosyVoice_For_Windows
進入專案:
cd CosyVoice_For_Windows
生成內建模組:
git submodule update --init --recursive
隨後安裝依賴:
conda create -n cosyvoice python=3.11
conda activate cosyvoice
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
官方推薦的Python版本是3.8,實際上3.11也是可以跑起來的,並且理論上3.11的效能更好。
隨後下載deepspeed的windows版本安裝包來進行安裝:
https://github.com/S95Sedan/Deepspeed-Windows/releases/tag/v14.0%2Bpy311
最後,安裝gpu版本的torch:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
這裡cuda的版本選擇12,也可以安裝11的。
隨後下載模型:
# git模型下載,請確保已安裝git lfs
mkdir -p pretrained_models
git clone https://www.modelscope.cn/iic/CosyVoice-300M.git pretrained_models/CosyVoice-300M
git clone https://www.modelscope.cn/iic/CosyVoice-300M-SFT.git pretrained_models/CosyVoice-300M-SFT
git clone https://www.modelscope.cn/iic/CosyVoice-300M-Instruct.git pretrained_models/CosyVoice-300M-Instruct
git clone https://www.modelscope.cn/speech_tts/speech_kantts_ttsfrd.git pretrained_models/speech_kantts_ttsfrd
由於使用國內的魔搭倉庫,所以速度非常快
最後新增環境變數:
set PYTHONPATH=third_party/AcademiCodec;third_party/Matcha-TTS
基礎用法:
from cosyvoice.cli.cosyvoice import CosyVoice
from cosyvoice.utils.file_utils import load_wav
import torchaudio
cosyvoice = CosyVoice('speech_tts/CosyVoice-300M-SFT')
# sft usage
print(cosyvoice.list_avaliable_spks())
output = cosyvoice.inference_sft('你好,我是通義生成式語音大模型,請問有什麼可以幫您的嗎?', '中文女')
torchaudio.save('sft.wav', output['tts_speech'], 22050)
cosyvoice = CosyVoice('speech_tts/CosyVoice-300M')
# zero_shot usage
prompt_speech_16k = load_wav('zero_shot_prompt.wav', 16000)
output = cosyvoice.inference_zero_shot('收到好友從遠方寄來的生日禮物,那份意外的驚喜與深深的祝福讓我心中充滿了甜蜜的快樂,笑容如花兒般綻放。', '希望你以後能夠做的比我還好呦。', prompt_speech_16k)
torchaudio.save('zero_shot.wav', output['tts_speech'], 22050)
# cross_lingual usage
prompt_speech_16k = load_wav('cross_lingual_prompt.wav', 16000)
output = cosyvoice.inference_cross_lingual('<|en|>And then later on, fully acquiring that company. So keeping management in line, interest in line with the asset that\'s coming into the family is a reason why sometimes we don\'t buy the whole thing.', prompt_speech_16k)
torchaudio.save('cross_lingual.wav', output['tts_speech'], 22050)
cosyvoice = CosyVoice('speech_tts/CosyVoice-300M-Instruct')
# instruct usage
output = cosyvoice.inference_instruct('在面對挑戰時,他展現了非凡的<strong>勇氣</strong>與<strong>智慧</strong>。', '中文男', 'Theo \'Crimson\', is a fiery, passionate rebel leader. Fights with fervor for justice, but struggles with impulsiveness.')
torchaudio.save('instruct.wav', output['tts_speech'], 22050)
這裡推薦使用webui,更加直觀和方便:
python3 webui.py --port 9886 --model_dir ./pretrained_models/CosyVoice-300M
訪問 http://localhost:9886
需要注意的是,官方的torch的backend使用的是sox,這裡改成了soundfile:
torchaudio.set_audio_backend('soundfile')
可能會有一些bug,後續還請關注官方的專案更新。
MacOS平臺
現在來到MacOs平臺,還是先克隆專案:
git clone https://github.com/v3ucn/CosyVoice_for_MacOs.git
安裝依賴:
cd CosyVoice_for_MacOs
conda create -n cosyvoice python=3.8
conda activate cosyvoice
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
隨後需要透過Homebrew安裝sox:
brew install sox
如此就配置好了,但是別忘了新增環境變數:
export PYTHONPATH=third_party/AcademiCodec:third_party/Matcha-TTS
使用方式和Windows版本保持一致。
這裡還是推薦使用webui:
python3 webui.py --port 50000 --model_dir speech_tts/CosyVoice-300M
訪問 http://localhost:50000
結語
平心而論,CosyVoice不愧是大廠出品,模型的品質沒的說,代表了國內AI的最高水準,通義實驗室名下無虛,當然,如果能將工程化之後的程式碼也開源出來,那就更好了,相信經過libtorch的最佳化,這個模型將會是開源TTS的不二選擇。