vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

lightsong發表於2024-07-16

原文網址 : https://www.cnblogs.com/lightsong/p/18306358

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

https://blog.vllm.ai/2023/06/20/vllm.html

LLMs promise to fundamentally change how we use AI across all industries. However, actually serving these models is challenging and can be surprisingly slow even on expensive hardware. Today we are excited to introduce vLLM, an open-source library for fast LLM inference and serving. vLLM utilizes PagedAttention, our new attention algorithm that effectively manages attention keys and values. vLLM equipped with PagedAttention redefines the new state of the art in LLM serving: it delivers up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes.

vLLM has been developed at UC Berkeley and deployed at Chatbot Arena and Vicuna Demo for the past two months. It is the core technology that makes LLM serving affordable even for a small research team like LMSYS with limited compute resources. Try out vLLM now with a single command at our GitHub repository.

https://arxiv.org/abs/2309.06180

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2-4<span style="display: inline-block; position: relative; width: 0.797em; height: 0px; font-size: 120%;"><span style="position: absolute; clip: rect(1.47em, 1000.65em, 2.319em, -1000em); top: -2.145em; left: 0em;"><span id="MathJax-Span-2" class="mrow"><span id="MathJax-Span-3" class="mo" style="font-family: MathJax_Main;">× with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM's source code is publicly available at this https URL

http://cncc.bingj.com/cache.aspx?q=vllm&d=4538921862332244&mkt=en-US&setlang=en-US&w=kCbRK1lZINsQEH4tJdltWDyg-VtMiMe7

Why is serving LLM so challenging?

Computational Resources

Due to the fact that LLM has numerous parameters to perform a prediction, which could start with the 7B parameter and then go up to 321B, deploying this model may require an intensive resource and a lot of optimization rather than using a traditional method to deploy a machine learning model.

Latency

When a sentence or token is complicated, the process takes several minutes to compute a result for the client, which may cause an issue on a large scale or in real-world business. For instance, a company may apply LLM with a product Q&A chatbot, which has a slow response to each question, which could cause frustration for the user. Therefore, applying some method to reduce the latency would be a good practice.

Cost

In a large-scale system or with multiple LLMs in the system, which would consume a lot of budget for the application since LLMs use large resources to process, as a MLE, finding a way to utilize a resource would bring a financial benefit to the system. For instance, lower the cost per request.

What is vLLM?

This project is from UC Berkeley’s students, who have a passion to optimize serving performance in LLMs. Many systems spend a lot of resources on serving LLMs. However, it has a poor response time when using a simple method to deploy it. As a result, vLLM’s team proposes a new method to solve this issue by using the OS’s virtual memory design, which could improve LLM serving performance around 24 times while using half the memory of the GPU compared with the traditional method. To integrate into your system, vLLM provides a simple interface that lets machine learning engineers (MLE) develop it via a Python interface, which you could integrate into your system without using fancy packages or dependencies.

https://blog.monsterapi.ai/blogs/what-is-vllm-and-how-to-implement-it/

Serving large language models (LLMs) in production environments poses significant challenges, including high memory consumption, latency issues, and the need for efficient resource management. These challenges often result in suboptimal performance and scalability problems, hindering the deployment of LLMs in real-world applications.

vLLM addresses these challenges by optimizing memory management and dynamically adjusting batch sizes, ensuring efficient execution and improved throughput for large language models.

What is the Core Idea of vLLM?

The core idea of vLLM (Virtual Large Language Model) is to optimize the serving and execution of large language models (LLMs) by utilizing efficient memory management techniques. Here are the key aspects:

Optimized Memory Management: vLLM uses sophisticated memory allocation and management strategies to maximize the utilization of available hardware resources. This allows for the efficient execution of large language models without running into memory bottlenecks.

Dynamic Batching: vLLM dynamically adjusts the batch sizes and sequences to better fit the memory and compute capacity of the hardware. This flexibility leads to improved throughput and reduced latency during inference.

Modular Design: The architecture of vLLM is designed to be modular, allowing for easy integration with various hardware accelerators and scaling across multiple devices or clusters.

Efficient Resource Utilization: By managing resources such as CPU, GPU, and memory more effectively, vLLM can serve larger models and handle more simultaneous requests, making it suitable for production environments where scalability and performance are critical.

Seamless Integration: vLLM aims to integrate seamlessly with existing machine learning frameworks and libraries, providing a user-friendly interface for deploying and serving large language models in various applications.

Overall, the core idea of vLLM is to enhance the performance, scalability, and efficiency of large language model deployment through advanced memory and resource management techniques.

How to Use vLLM?

We will now walk you through the steps to effectively use vLLM for serving large language models (LLMs) in production. We'll cover integration, configuration, deployment, and maintenance steps.

For those looking for a quicker alternative, we also introduce a ready-to-use service leveraging vLLM at the end of this topic.

Here’s a step-wise Workflow for Using vLLM:

Integration and Configuration:

Option 1: Self-Configuration:

Integrate vLLM into your existing machine learning framework or library (e.g., PyTorch, TensorFlow) by following the provided installation and setup guidelines.

Configure memory management settings and adjust batching strategies, including batch size and sequence length, to match your hardware resources and optimize performance.

Load your pre-trained large language model (LLM) into vLLM, ensuring it is properly initialized and ready for inference tasks.

Option 2: vLLM Docker Container:

Use the ready-to-use vLLM Docker container for a simplified setup.

Follow the instructions to pull the Docker image, configure the necessary settings, and deploy your LLM within the container environment.

Here’s an example command for running a vLLM docker container with Mistral 7B LLM:
docker run --runtime nvidia --gpus all \ 
	-v ~/.cache/huggingface:/root/.cache/huggingface \ 
	--env "HUGGING_FACE_HUB_TOKEN=<secret>" \ 
	-p 8000:8000 \ 
	--ipc=host \
	vllm/vllm-openai:latest \ 
	--model mistralai/Mistral-7B-v0.1
You may find more details on docker deployments on vLLM’s official docs.

P

Understanding vLLM for Increasing LLM Throughput

https://www.e2enetworks.com/blog/understanding-vllm-for-increasing-llm-throughput

vLLM與PagedAttention：全面概述
2024-07-09
Cheap Robot 題解
2024-08-18
TensorFlow Serving
2021-04-15
fastchat vs vLLM
2024-07-20
AST
賈揚清點贊：3K star量的SGLang上新，加速Llama 405B推理秒殺vLLM、TensorRT-LLM
2024-07-27
Qwen2-72B的vLLM部署
2024-08-24
smbmap報[*] Detected 0 hosts serving SMB
2024-08-10
fast-in
2024-07-13
AST
Fast Car Game
2018-11-14
ASTGAM
fast-bev
2024-10-22
AST
●Joyoi Easy
2018-03-11
大模型推理指南：使用 vLLM 實現高效推理
2024-11-21
大模型
PaddleOCR 服務化部署(基於PaddleHub Serving)
2024-03-08
跟我一起學Knative(2)--Knative Serving
2020-05-15
【Fast R-CNN】Fast R-CNN (2015) 全文翻譯
2020-10-20
ASTCNN
Fail-Fast in Java
2018-08-24
AIASTJava
Fail - Fast機制
2018-08-22
AIAST
ASM Fast Mirror Resync
2018-03-21
ASMAST
fast planner總結
2023-02-17
AST
Berkeley vLLM：算力減半、吞吐增十倍
2024-05-12
As Easy As A+B
2018-08-09
Leetcode(easy heap)
2020-10-08
LeetCode
Easy-Admin
2019-05-11
LeetCode-Easy
2018-12-28
LeetCode
【BUUCTF】Easy Java
2024-08-13
Java
【BUUCTF】easy calc
2024-07-26
字串魔法(easy)
2020-12-10
字串
debate by LLM
2024-10-03
BAT
LLM evaluation
2024-08-01
LLM應用實戰：當KBQA整合LLM
2024-04-11
FAST Globular Cluster observation log
2024-08-24
AST
vue-easy-renderer
2019-02-16
Vue
type challenge（easy 部分）
2023-05-16
Prefix Flip (Easy Version)
2020-10-16
[RoarCTF 2019]Easy Calc
2024-11-20
Fractal pg walkthrough Easy
2024-11-26
Hub PG walkthrough Easy
2024-11-22
Catch the Mole(Easy Version)
2024-07-21

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

Why is serving LLM so challenging?

Computational Resources

Latency

Cost

What is vLLM?

What is the Core Idea of vLLM?

How to Use vLLM?

Integration and Configuration:

Understanding vLLM for Increasing LLM Throughput

相關文章