隨著大語言模型在長文字場景下的需求不斷湧現,其核心的注意力機制(Attention Mechanism)也獲得了非常多的關注。 注意力機制會計算一定跨度內輸入文字(令牌,Token)之間的互動,從而實現對上下文的理解。隨著應用的發展,高效處理更長輸入的需求也隨之增長 [1][2],這帶來了計算代價的挑戰:注意力高昂的計算成本和不斷增長的鍵值快取(KV-Cache)代價。稀疏注意力機制可以有效緩解記憶體和吞吐量的挑戰。 然而,現有稀疏注意力通常採用統一的稀疏注意力模式,即對不同的注意力頭和輸入長度應用相同的稀疏模式。這種統一的方法難以捕捉到大語言模型中多樣的注意力模式,導致不同注意力頭的不同的精度 - 代價權衡被忽略。 最近,來自清華大學、無問芯穹和上海交通大學的研究團隊發表了《MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression》,提出透過混合不同稀疏度的注意力頭,使用 25% 的注意力稠密度,就可以記憶幾乎 100% 的上下文。 本工作現已開源,歡迎交流討論。
不同框架在7B和13B模型上的效率分析。MoA 每個設計帶來的效率提升透過消融分析分為四個部分。所有稀疏注意力方法都使用50%的注意力密度。解碼吞吐量在A100-80GB GPU 視訊記憶體能容納的最大批大小下進行評估。 作者介紹 本論文的共同一作是清華大學電子工程系 NICS-EFC 實驗室的傅天予、黃浩峰和寧雪妃,他們來自 NICS-EFC 實驗室的 EffAlg 團隊和無問芯穹(Infinigence AI)。NICS-EFC 實驗室由汪玉教授帶領,實驗室的高效演算法團隊(Efficient Algorithm Team,EffAlg)由寧雪妃助理研究員帶領。EffAlg 團隊的主要研究方向為高效深度學習技術,團隊網站為 https://nics-effalg.com/ 引用 [1] Chen, Shouyuan, et al. "Extending Context Window of Large Language Models via Positional Interpolation." ArXiv, 2023, abs/2306.15595, https://api.semanticscholar.org/CorpusID:259262376.[2] Tworkowski, Szymon, et al. "Focused Transformer: Contrastive Training for Context Scaling." ArXiv, 2023, abs/2307.03170, https://api.semanticscholar.org/CorpusID:259360592.[3] Vaswani, Ashish, et al. "Attention is all you need." Advances in Neural Information Processing Systems, vol. 30, 2017.[4] Xiao, Guangxuan, et al. "Efficient Streaming Language Models with Attention Sinks." The Twelfth International Conference on Learning Representations, 2024.[5] Han, Chi, et al. "Lm-infinite: Simple on-the-fly length generalization for large language models." arXiv preprint arXiv:2308.16137, 2023.[6] Zaheer, Manzil, et al. "Big bird: Transformers for longer sequences." Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 17283-17297.[7] Li, Dacheng, et al. "How Long Can Open-Source LLMs Truly Promise on Context Length?" lmsys.org, June 2023, https://lmsys.org/blog/2023-06-29-longchat.[8] Fu, Yao, et al. "Data Engineering for Scaling Language Models to 128K Context." ArXiv, 2024, abs/2402.10171, https://api.semanticscholar.org/CorpusID:267682361.[9] Zhang, Zhenyu (Allen), et al. "H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models." ArXiv, 2023, abs/2306.14048, https://api.semanticscholar.org/CorpusID:259263947.[10] Dao, Tri, et al. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." Advances in Neural Information Processing Systems, 2022.[11] Kwon, Woosuk, et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention." Proceedings of the 29th Symposium on Operating Systems Principles, 2023, https://api.semanticscholar.org/CorpusID:261697361.[12] Xiao, Chaojun et al. “InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory.” (2024).