Voice Agent 開發者必讀,2024 最前沿語音模型梳理

RTE开发者社区發表於2024-12-04

今天推薦的是我們的社群成員 BoJack 建立的 GitHub 倉庫,如果你在關注 Voice Agent 開發,想了解最前沿的語音模型都有哪些,這個倉庫的列表就非常值得關注。

BoJack 正在上海交大讀博,研究方向為語音多模態,語音互動系統,自監督預訓練。他也是近期釋出的語音全雙工模型 LSLM、TTS 語音合成模型 F5-TTS 的作者之一。

倉庫地址:
https://github.com/ddlBoJack/Awesome-Speech-Language-Model

Awesome-Speech-Language-Model

論文、程式碼與資源:語音語言模型和端到端語音對話系統。

通用語音、音訊和音樂理解模型

Universal Speech, Audio and Music Understanding

模型
Model

  • LTU: Listen, Think, and Understand - ICLR 2024

https://arxiv.org/abs/2305.10790

  • SALMONN: Towards Generic Hearing Abilities for Large Language Models- ICLR 2024

https://arxiv.org/abs/2310.13289

  • LTU-AS: Joint Audio and Speech Understanding - ASRU 2024

https://arxiv.org/abs/2309.14405

  • Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models - arXiv 2023

https://arxiv.org/abs/2311.07919

  • Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities - ICML 2024

https://arxiv.org/abs/2402.01831

  • Qwen2-Audio Technical Report - arXiv 2024

https://arxiv.org/abs/2407.10759

  • WavLLM: Towards Robust and Adaptive Speech Large Language Model - EMNLP 2024

https://arxiv.org/abs/2404.00656

  • DiVA: Distilling an End-to-End Voice Assistant Without Instruction Training Data - arXiv 2024

https://arxiv.org/abs/2410.02678

基準
Benchmark

  • Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech - ICASSP 2024

https://arxiv.org/abs/2309.09510

  • AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension - ACL 2024

https://arxiv.org/abs/2402.07729

  • SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words - arXiv 2024

https://arxiv.org/abs/2406.13340

  • AudioBench: A Universal Benchmark for Audio Large Language Models - arXiv 2024

https://arxiv.org/abs/2406.16020

  • SALMon: A Suite for Acoustic Language Model Evaluation - arXiv 2024

https://arxiv.org/abs/2409.07437

  • MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark - arXiv 2024

https://www.arxiv.org/abs/2410.19168

  • Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks -ICLR 2024 open review

https://openreview.net/forum?id=s7lzZpAW7T

端到端語音對話系統

End2End Speech Dialogue System

模型
Model

  • SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities - EMNLP 2023

https://arxiv.org/abs/2305.11000

  • GPT-4o Voice Mode -API 2024

https://openai.com/index/hello-gpt-4o/

  • PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems - EMNLP 2024

  • VITA: Towards Open-Source Interactive Omni Multimodal LLM - arXiv 2024

https://www.arxiv.org/abs/2408.05211

  • Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming - arXiv 2024

https://arxiv.org/abs/2408.16725

  • LLaMA-Omni: Seamless Speech Interaction with Large Language Models -arXiv 2024

https://arxiv.org/abs/2409.06666

  • Moshi: a speech-text foundation model for real-time dialogue - arXiv 2024

https://arxiv.org/abs/2410.00037

  • Westlake-Omni - GitHub 2024

https://github.com/xinchen-ai/Westlake-Omni

  • EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions - arXiv 2024

https://arxiv.org/abs/2409.18042

  • IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities - arXiv 2024

https://arxiv.org/abs/2410.08035

  • MooER-omni - GitHub 2024

https://github.com/MooreThreads/MooER

  • GLM-4-Voice - GitHub 2024

https://github.com/THUDM/GLM-4-Voice

  • Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM - arXiv 2024

https://arxiv.org/abs/2411.00774

  • Hertz-dev - GitHub 2024

https://github.com/Standard-Intelligence/hertz-dev

  • Fish Agent - GitHub 2024

https://github.com/fishaudio/fish-speech

  • Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities - arXiv 2024

https://arxiv.org/abs/2410.11190

基準
Benchmark

  • VoiceBench: Benchmarking LLM-Based Voice Assistants - arXiv 2024

https://arxiv.org/abs/2410.17196

全雙工建模

Full Duplex Modeling

  • A Full-duplex Speech Dialogue Scheme Based On Large Language Models -NeurIPS 2024

https://arxiv.org/abs/2405.19487

  • MiniCPM-duplex: Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models - EMNLP 2024

https://arxiv.org/abs/2406.15718

  • LSLM: Language Model Can Listen While Speaking - arXiv 2024

https://arxiv.org/abs/2408.02622

  • SyncLLM: Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents - arXiv 2024

https://arxiv.org/abs/2409.15594

  • Enabling Real-Time Conversations with Minimal Training Costs - arXiv 2024

https://arxiv.org/abs/2409.11727

綜述
Survey

  • Towards audio language modeling -- an overview - arXiv 2024

https://arxiv.org/abs/2402.13236

  • Recent Advances in Speech Language Models: A Survey - arXiv 2024

https://arxiv.org/abs/2410.03751

  • Speech Trident - Github

https://github.com/ga642381/speech-trident

  • A Survey on Speech Large Language Models - arXiv 2024

https://arxiv.org/abs/2410.18908

編輯:林瑞麗,傅豐元

更多 Voice Agent 學習筆記:

從開發者工具轉型 AI 呼叫中心,這家 Voice Agent 公司已服務 100+ 客戶

WebRTC 建立者剛加入了 OpenAI,他是如何思考語音 AI 的未來?

人類級別語音 AI 路線圖丨 Voice Agent 學習筆記

語音 AI 革命:未來,消費者更可能傾向於與 AI 溝通,而非人工客服

語音 AI 迎來爆發期,也仍然隱藏著被低估的機會丨 RTE2024 音訊技術和 Voice AI 專場

下一代 AI 陪伴 | 平等關係、長久記憶與情境共享 | 播客《編碼人聲》

Voice-first,閉關做一款語音產品的思考|社群來稿

相關文章