大家好,我是痞子衡,是正經搞技術的痞子。今天痞子衡給大家介紹的是語音處理工具Jays-PySPEECH誕生之語音識別實現。
語音識別是Jays-PySPEECH的核心功能,Jays-PySPEECH藉助的是SpeechRecognition系統以及CMU Sphinx引擎來實現的語音識別功能,今天痞子衡為大家介紹語音識別在Jays-PySPEECH中是如何實現的。
一、SpeechRecognition系統簡介
SpeechRecognition是一套基於python實現語音識別的系統,該系統的設計者為 Anthony Zhang (Uberi),該庫從2014年開始推出,一直持續更新至今,Jays-PySPEECH使用的是SpeechRecognition 3.8.1。
SpeechRecognition系統的官方主頁如下:
- SpeechRecognition官方主頁: https://github.com/Uberi/speech_recognition
- SpeechRecognition安裝方法: https://pypi.org/project/SpeechRecognition/
SpeechRecognition系統自身並沒有語音識別功能,其主要是呼叫第三方語音識別引擎來實現語音識別,SpeechRecognition支援的語音識別引擎非常多,有如下8種:
- CMU Sphinx (works offline)
- Google Speech Recognition
- Google Cloud Speech API
- Wit.ai
- Microsoft Bing Voice Recognition
- Houndify API
- IBM Speech to Text
- Snowboy Hotword Detection (works offline)
不管是選用哪一種語音識別引擎,在SpeechRecognition裡呼叫介面都是一致的,我們以實現音訊檔案轉文字的示例程式碼 audio_transcribe.py 為例瞭解SpeechRecognition的用法,擷取audio_transcribe.py部分內容如下:
import speech_recognition as sr
# 指定要轉換的音訊原始檔(english.wav)
from os import path
AUDIO_FILE = path.join(path.dirname(path.realpath(__file__)), "english.wav")
# 定義SpeechRecognition物件並獲取音訊原始檔(english.wav)中的資料
r = sr.Recognizer()
with sr.AudioFile(AUDIO_FILE) as source:
audio = r.record(source) # read the entire audio file
# 使用CMU Sphinx引擎去識別音訊
try:
print("Sphinx thinks you said " + r.recognize_sphinx(audio))
except sr.UnknownValueError:
print("Sphinx could not understand audio")
except sr.RequestError as e:
print("Sphinx error; {0}".format(e))
# 使用Microsoft Bing Voice Recognition引擎去識別音訊
BING_KEY = "INSERT BING API KEY HERE" # Microsoft Bing Voice Recognition API keys 32-character lowercase hexadecimal strings
try:
print("Microsoft Bing Voice Recognition thinks you said " + r.recognize_bing(audio, key=BING_KEY))
except sr.UnknownValueError:
print("Microsoft Bing Voice Recognition could not understand audio")
except sr.RequestError as e:
print("Could not request results from Microsoft Bing Voice Recognition service; {0}".format(e))
# 使用其他引擎去識別音訊
# ... ...
有木有覺得SpeechRecognition使用起來特別簡單?是的,這正是SpeechRecognition系統強大之處,更多示例可見 https://github.com/Uberi/speech_recognition/tree/master/examples。
1.1 選用CMU Sphinx引擎
前面痞子衡講了SpeechRecognition系統自身並沒有語音識別功能,因此我們需要為SpeechRecognition安裝一款語音識別引擎,痞子衡為JaysPySPEECH選用的是可離線工作的CMU Sphinx。
CMU Sphinx是卡內基梅隆大學開發的一款開源語音識別引擎,該引擎可以離線工作,並且支援多語種(英語、中文、法語等)。CMU Sphinx引擎的官方主頁如下:
- CMU Sphinx官方主頁: https://cmusphinx.github.io/
- CMU Sphinx官方下載: https://sourceforge.net/projects/cmusphinx/
由於JaysPySPEECH是基於Python環境開發的,因此我們不能直接用CMU Sphinx,那該怎麼辦?彆著急,Dmitry Prazdnichnov大牛為CMU Sphinx寫了Python封裝介面,即PocketSphinx,其官方主頁如下:
- PocketSphinx官方主頁: https://github.com/bambocher/pocketsphinx-python
- PocketSphinx安裝方法: https://pypi.org/project/pocketsphinx/
我們在JaysPySPEECH誕生系列文章第一篇 環境搭建 裡已經安裝了SpeechRecognition和PocketSphinx,痞子衡的安裝路徑為C:\tools_mcu\Python27\Lib\site-packages下的\speech_recognition與\pocketsphinx,安裝好這兩個包,引擎便選好了。
1.2 為PocketSphinx引擎增加中文語言包
預設情況下,PocketSphinx僅支援US English語言的識別,在C:\tools_mcu\Python27\Lib\site-packages\speech_recognition\pocketsphinx-data目錄下僅能看到en-US資料夾,先來看一下這個資料夾裡有什麼:
\pocketsphinx-data\en-US
\acoustic-model --聲學模型
\feat.params --HMM模型的特徵引數
\mdef --模型定義檔案
\means --混合高斯模型的均值
\mixture_weights --混合權重
\noisedict --噪聲也就是非語音字典
\sendump --從聲學模型中獲取混合權重
\transition_matrices --HMM模型的狀態轉移矩陣
\variances --混合高斯模型的方差
\language-model.lm.bin --語言模型
\pronounciation-dictionary.dict --拼音字典
看到這一堆檔案是不是覺得有點難懂?這其實跟CMU Sphinx引擎的語音識別原理有關,此處我們暫且不深入瞭解,對我們呼叫API的應用來說只需要關於如何為CMU Sphinx增加其他語言包(比如中文包)。
要想增加其他語言,首先得要有語言包資料,CMU Sphinx主頁提供了12種主流語言包的下載 https://sourceforge.net/projects/cmusphinx/files/Acoustic_and_Language_Models/,因為JaysPySPEECH需要支援中文識別,因此我們需要下載\Mandarin下面的三個檔案:
\Mandarin
\zh_broadcastnews_16k_ptm256_8000.tar.bz2 --聲學模型
\zh_broadcastnews_64000_utf8.DMP --語言模型
\zh_broadcastnews_utf8.dic --拼音字典
有了中文語言包資料,然後我們需要根據 Notes on using PocketSphinx 裡指示的步驟操作,痞子衡整理如下:
- \speech_recognition\pocketsphinx-data目錄下建立zh-CN資料夾
- 將zh_broadcastnews_16k_ptm256_8000.tar.bz2解壓縮並裡面所有檔案放入\zh-CN\acoustic-model資料夾下
- 將zh_broadcastnews_utf8.dic重新命名為pronounciation-dictionary.dict並放入\zh-CN資料夾下
- 藉助SphinxBase工具將zh_broadcastnews_64000_utf8.DMP轉換成language-model.lm.bin並放入\zh-CN資料夾下
關於第4步裡提到的SphinxBase工具,我們需要從 https://github.com/cmusphinx/sphinxbase 裡下載原始碼,然後使用Visual Studio 2010(或以上)開啟\sphinxbase\sphinxbase.sln工程Rebuild All後會在\sphinxbase\bin\Release\x64下看到生成了如下6個工具:
\\sphinxbase\bin\Release\x64
\sphinx_cepview.exe
\sphinx_fe.exe
\sphinx_jsgf2fsg.exe
\sphinx_lm_convert.exe
\sphinx_pitch.exe
\sphinx_seg.exe
我們主要使用sphinx_lm_convert.exe工具完成轉換工作生成language-model.lm.bin,具體命令如下:
PS C:\tools_mcu\sphinxbase\bin\Release\x64> .\sphinx_lm_convert.exe -i .\zh_broadcastnews_64000_utf8.DMP -o language-model.lm - ofmt arpa
Current configuration: [NAME] [DEFLT] [VALUE] -case -help no no -i .\zh_broadcastnews_64000_utf8.DMP -ifmt -logbase 1.0001 1.000100e+00 -mmap no no -o language-model.lm -ofmt arpa INFO: ngram_model_trie.c(354): Trying to read LM in trie binary format INFO: ngram_model_trie.c(365): Header doesn't match INFO: ngram_model_trie.c(177): Trying to read LM in arpa format INFO: ngram_model_trie.c(70): No \data\ mark in LM file INFO: ngram_model_trie.c(445): Trying to read LM in dmp format INFO: ngram_model_trie.c(527): ngrams 1=63944, 2=16600781, 3=20708460 INFO: lm_trie.c(474): Training quantizer INFO: lm_trie.c(482): Building LM trie
PS C:\tools_mcu\sphinxbase\bin\Release\x64> .\sphinx_lm_convert.exe -i .\language-model.lm -o language-model.lm.bin
Current configuration: [NAME] [DEFLT] [VALUE] -case -help no no -i .\language-model.lm -ifmt -logbase 1.0001 1.000100e+00 -mmap no no -o language-model.lm.bin -ofmt INFO: ngram_model_trie.c(354): Trying to read LM in trie binary format INFO: ngram_model_trie.c(365): Header doesn't match INFO: ngram_model_trie.c(177): Trying to read LM in arpa format INFO: ngram_model_trie.c(193): LM of order 3 INFO: ngram_model_trie.c(195): #1-grams: 63944 INFO: ngram_model_trie.c(195): #2-grams: 16600781 INFO: ngram_model_trie.c(195): #3-grams: 20708460 INFO: lm_trie.c(474): Training quantizer INFO: lm_trie.c(482): Building LM trie
二、Jays-PySPEECH語音識別實現
語音識別程式碼實現其實很簡單,直接呼叫speech_recognition裡的API即可,目前僅實現了CMU Sphinx引擎,並且僅支援中英雙語識別。具體到Jays-PySPEECH上主要是實現GUI介面上"ASR"按鈕的回撥函式,即audioSpeechRecognition(),如果使用者選定了配置引數(語言型別、ASR引擎型別),並點選了"ASR"按鈕,此時便會觸發audioSpeechRecognition()的執行。程式碼如下:
import speech_recognition
class mainWin(win.speech_win):
def getLanguageSelection(self):
languageType = self.m_choice_lang.GetString(self.m_choice_lang.GetSelection())
if languageType == 'Mandarin Chinese':
languageType = 'zh-CN'
languageName = 'Chinese'
else: # languageType == 'US English':
languageType = 'en-US'
languageName = 'English'
return languageType, languageName
def audioSpeechRecognition( self, event ):
if os.path.isfile(self.wavPath):
# 建立speech_recognition語音識別物件asrObj
asrObj = speech_recognition.Recognizer()
# 獲取wav檔案裡的語音內容
with speech_recognition.AudioFile(self.wavPath) as source:
speechAudio = asrObj.record(source)
self.m_textCtrl_asrttsText.Clear()
# 獲取語音語言型別(English/Chinese)
languageType, languageName = self.getLanguageSelection()
engineType = self.m_choice_asrEngine.GetString(self.m_choice_asrEngine.GetSelection())
if engineType == 'CMU Sphinx':
try:
# 呼叫recognize_sphinx完成語音識別
speechText = asrObj.recognize_sphinx(speechAudio, language=languageType)
# 語音識別結果顯示在asrttsText文字框內
self.m_textCtrl_asrttsText.write(speechText)
self.statusBar.SetStatusText("ASR Conversation Info: Successfully")
# 語音識別結果寫入指定檔案
fileName = self.m_textCtrl_asrFileName.GetLineText(0)
if fileName == '':
fileName = 'asr_untitled1.txt'
asrFilePath = os.path.join(os.path.dirname(os.path.abspath(os.path.dirname(__file__))), 'conv', 'asr', fileName)
asrFileObj = open(asrFilePath, 'wb')
asrFileObj.write(speechText)
asrFileObj.close()
except speech_recognition.UnknownValueError:
self.statusBar.SetStatusText("ASR Conversation Info: Sphinx could not understand audio")
except speech_recognition.RequestError as e:
self.statusBar.SetStatusText("ASR Conversation Info: Sphinx error; {0}".format(e))
else:
self.statusBar.SetStatusText("ASR Conversation Info: Unavailable ASR Engine")
至此,語音處理工具Jays-PySPEECH誕生之語音識別實現痞子衡便介紹完畢了,掌聲在哪裡~~~