production-stack

vLLM Production Stack: reference stack for production vLLM deployment

Blog Docs Production-Stack Slack Channel LMCache Slack Interest Form Official Email

Latest News

Community Events

We host weekly community meetings at alternating times to accommodate different time zones.

Meetings alternate weekly between the two times. All are welcome to join!

Introduction

vLLM Production Stack project provides a reference implementation on how to build an inference stack on top of vLLM, which allows you to:

Step-By-Step Tutorials

  1. How To Install Kubernetes (kubectl, helm, minikube, etc)?
  2. How to Deploy Production Stack on Major Cloud Platforms (AWS, GCP, Lambda Labs, Azure)?
  3. How To Set up a Minimal vLLM Production Stack?
  4. How To Customize vLLM Configs (optional)?
  5. How to Load Your LLM Weights?
  6. How to Launch Different LLMs in vLLM Production Stack?
  7. How to Enable KV Cache Offloading with LMCache?

Architecture

The stack is set up using Helm, and contains the following key parts:

Architecture of the stack

Roadmap

We are actively working on this project and will release the following features soon. Please stay tuned!

Deploying the stack via Helm

Prerequisites

Deployment

vLLM Production Stack can be deployed via helm charts. Clone the repo to local and execute the following commands for a minimal deployment:

git clone https://github.com/vllm-project/production-stack.git
cd production-stack/
helm repo add vllm https://vllm-project.github.io/production-stack
helm install vllm vllm/vllm-stack -f tutorials/assets/values-01-minimal-example.yaml

The deployed stack provides the same OpenAI API interface as vLLM, and can be accessed through kubernetes service.

To validate the installation and send a query to the stack, refer to this tutorial.

For more information about customizing the helm chart, please refer to values.yaml and our other tutorials.

Uninstall

helm uninstall vllm

Grafana Dashboard

Features

The Grafana dashboard provides the following insights:

  1. Available vLLM Instances: Displays the number of healthy instances.
  2. Request Latency Distribution: Visualizes end-to-end request latency.
  3. Time-to-First-Token (TTFT) Distribution: Monitors response times for token generation.
  4. Number of Running Requests: Tracks the number of active requests per instance.
  5. Number of Pending Requests: Tracks requests waiting to be processed.
  6. GPU KV Usage Percent: Monitors GPU KV cache usage.
  7. GPU KV Cache Hit Rate: Displays the hit rate for the GPU KV cache.

Grafana dashboard to monitor the deployment

Configuration

See the details in observability/README.md

Router

The router ensures efficient request distribution among backends. It supports:

Please refer to the router documentation for more details.

Contributing

We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.

License

This project is licensed under Apache License 2.0. See the LICENSE file for details.

Sponsors

We are grateful to our sponsors who support our development and benchmarking efforts:

GMI Cloud Logo


For any issues or questions, feel free to open an issue or contact us (@ApostaC, @YuhanLiu11, @Shaoting-Feng).