fastchat vs vLLM

lightsong發表於2024-07-20

vLLM

https://github.com/vllm-project/vllm

https://docs.vllm.ai/en/latest/

推理和服務,但是更加偏向推理。

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Fast model execution with CUDA/HIP graph
  • Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache
  • Optimized CUDA kernels

Performance benchmark: We include a performance benchmark that compares the performance of vllm against other LLM serving engines (TensorRT-LLM, text-generation-inference and lmdeploy).

vLLM is flexible and easy to use with:

  • Seamless integration with popular Hugging Face models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor parallelism and pipeline parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server
  • Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs
  • (Experimental) Prefix caching support
  • (Experimental) Multi-lora support

vLLM seamlessly supports most popular open-source models on HuggingFace, including:

  • Transformer-like LLMs (e.g., Llama)
  • Mixture-of-Expert LLMs (e.g., Mixtral)
  • Multi-modal LLMs (e.g., LLaVA)

Find the full list of supported models here.

FastChat

https://github.com/lm-sys/FastChat

對模型的訓練、服務、評估負責,

流行的還是使用其服務功能, 即部署功能(分散式部署,提供webui 和 resetapi), 切後端可以整合vLLM加速推理。

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

FastChat

| Demo | Discord | X |

FastChat is an open platform for training, serving, and evaluating large language model based chatbots.

  • FastChat powers Chatbot Arena (https://chat.lmsys.org/), serving over 10 million chat requests for 70+ LLMs.
  • Chatbot Arena has collected over 500K human votes from side-by-side LLM battles to compile an online LLM Elo leaderboard.

FastChat's core features include:

  • The training and evaluation code for state-of-the-art models (e.g., Vicuna, MT-Bench).
  • A distributed multi-model serving system with web UI and OpenAI-compatible RESTful APIs.

https://rudeigerc.dev/posts/llm-inference-with-fastchat/

VS

https://fastchat.mintlify.app/vllm_integration

https://github.com/lm-sys/FastChat/issues/1775

相關文章