funasr

lightsong發表於2024-10-16

funasr

https://www.funasr.com/#/

https://github.com/modelscope/FunASR

A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.

FunASR hopes to build a bridge between academic research and industrial applications on speech recognition. By supporting the training & finetuning of the industrial-grade speech recognition model, researchers and developers can conduct research and production of speech recognition models more conveniently, and promote the development of speech recognition ecology. ASR for Fun!

  • FunASR is a fundamental speech recognition toolkit that offers a variety of features, including speech recognition (ASR), Voice Activity Detection (VAD), Punctuation Restoration, Language Models, Speaker Verification, Speaker Diarization and multi-talker ASR. FunASR provides convenient scripts and tutorials, supporting inference and fine-tuning of pre-trained models.
  • We have released a vast collection of academic and industrial pretrained models on the ModelScope and huggingface, which can be accessed through our Model Zoo. The representative Paraformer-large, a non-autoregressive end-to-end speech recognition model, has the advantages of high accuracy, high efficiency, and convenient deployment, supporting the rapid construction of speech recognition services. For more details on service deployment, please refer to the service deployment document.

https://cloud.baidu.com/article/3347080

3. 多語言支援

隨著全球化的推進,多語言支援已成為語音識別技術的必備功能。FunASR支援中文、英文、日文等多種主流語言,並可根據使用者需求進行定製開發,滿足不同國家和地區的語音識別需求。這一特性使得FunASR在跨國企業、國際交流等領域具有廣泛的應用前景。

demo

https://www.funasr.com/static/offline/index.html

tutourial

https://www.cnblogs.com/LaoDie1/p/18183024

https://www.cnblogs.com/v3ucn/p/17956926

VAD

https://blog.ailemon.net/2021/02/18/introduction-to-vad-theory/

首先我們來明確一下基本概念,語音啟用檢測(VAD, Voice Activation Detection)演算法主要是用來檢測當前聲音訊號中是否存在人的話音訊號的。該演算法透過對輸入訊號進行判斷,將話音訊號片段與各種背景噪聲訊號片段區分出來,使得我們能夠分別對兩種訊號採用不同的處理方法。

VAD有很多種特徵提取方法,一種最簡單直接的是:透過短時能量(short time energy, STE)和短時過零率(zero cross counter, ZCC) 來測定,即基於能量的特徵。短時能量就是一幀語音訊號的能量,過零率就是一幀語音的時域訊號穿過0(時間軸)的次數。一般來說,精確度高的VAD會提取基於能量的特徵、頻域特徵、倒譜特徵、諧波特徵、長時資訊特徵等多個特徵進行判斷[1]。最後我們再根據閾值進行比較,或者使用統計的方法和機器學習的方法,得出是語音訊號還是非語音訊號的結論。