Serving Llms Vllm — vLLM: high-throughput LLM serving, OpenAI API, quantization
vLLM: high-throughput LLM serving, OpenAI API, quantization.
Skill metadata
Section titled “Skill metadata”| Source | Bundled (installed by default) |
| Path | skills/mlops/inference/vllm |
| Version | 1.0.0 |
| Author | Orchestra Research |
| License | MIT |
| Dependencies | vllm, torch, transformers |
| Platforms | linux, macos |
| Tags | vLLM, Inference Serving, PagedAttention, Continuous Batching, High Throughput, Production, OpenAI API, Quantization, Tensor Parallelism |
Reference: full SKILL.md
Section titled “Reference: full SKILL.md”The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
vLLM - High-Performance LLM Serving
Section titled “vLLM - High-Performance LLM Serving”When to use
Section titled “When to use”Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.
Quick start
Section titled “Quick start”vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).
Installation:
pip install vllmBasic offline inference:
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3-8B-Instruct")sampling = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["Explain quantum computing"], sampling)print(outputs[0].outputs[0].text)OpenAI-compatible server:
vllm serve meta-llama/Llama-3-8B-Instruct
# Query with OpenAI SDKpython -c "from openai import OpenAIclient = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')print(client.chat.completions.create( model='meta-llama/Llama-3-8B-Instruct', messages=[{'role': 'user', 'content': 'Hello!'}]).choices[0].message.content)"Common workflows
Section titled “Common workflows”Workflow 1: Production API deployment
Section titled “Workflow 1: Production API deployment”Copy this checklist and track progress:
Deployment Progress:- [ ] Step 1: Configure server settings- [ ] Step 2: Test with limited traffic- [ ] Step 3: Enable monitoring- [ ] Step 4: Deploy to production- [ ] Step 5: Verify performance metricsStep 1: Configure server settings
Choose configuration based on your model size:
# For 7B-13B models on single GPUvllm serve meta-llama/Llama-3-8B-Instruct \ --gpu-memory-utilization 0.9 \ --max-model-len 8192 \ --port 8000
# For 30B-70B models with tensor parallelismvllm serve meta-llama/Llama-2-70b-hf \ --tensor-parallel-size 4 \ --gpu-memory-utilization 0.9 \ --quantization awq \ --port 8000
# For production with caching and metricsvllm serve meta-llama/Llama-3-8B-Instruct \ --gpu-memory-utilization 0.9 \ --enable-prefix-caching \ --enable-metrics \ --metrics-port 9090 \ --port 8000 \ --host 0.0.0.0Step 2: Test with limited traffic
Run load test before production:
# Install load testing toolpip install locust
# Create test_load.py with sample requests# Run: locust -f test_load.py --host http://localhost:8000Verify TTFT (time to first token) < 500ms and throughput > 100 req/sec.
Step 3: Enable monitoring
vLLM exposes Prometheus metrics on port 9090:
curl http://localhost:9090/metrics | grep vllmKey metrics to monitor:
vllm:time_to_first_token_seconds- Latencyvllm:num_requests_running- Active requestsvllm:gpu_cache_usage_perc- KV cache utilization
Step 4: Deploy to production
Use Docker for consistent deployment:
# Run vLLM in Dockerdocker run --gpus all -p 8000:8000 \ vllm/vllm-openai:latest \ --model meta-llama/Llama-3-8B-Instruct \ --gpu-memory-utilization 0.9 \ --enable-prefix-cachingStep 5: Verify performance metrics
Check that deployment meets targets:
- TTFT < 500ms (for short prompts)
- Throughput > target req/sec
- GPU utilization > 80%
- No OOM errors in logs
Workflow 2: Offline batch inference
Section titled “Workflow 2: Offline batch inference”For processing large datasets without server overhead.
Copy this checklist:
Batch Processing:- [ ] Step 1: Prepare input data- [ ] Step 2: Configure LLM engine- [ ] Step 3: Run batch inference- [ ] Step 4: Process resultsStep 1: Prepare input data
# Load prompts from fileprompts = []with open("prompts.txt") as f: prompts = [line.strip() for line in f]
print(f"Loaded {len(prompts)} prompts")Step 2: Configure LLM engine
from vllm import LLM, SamplingParams
llm = LLM( model="meta-llama/Llama-3-8B-Instruct", tensor_parallel_size=2, # Use 2 GPUs gpu_memory_utilization=0.9, max_model_len=4096)
sampling = SamplingParams( temperature=0.7, top_p=0.95, max_tokens=512, stop=["</s>", "\n\n"])Step 3: Run batch inference
vLLM automatically batches requests for efficiency:
# Process all prompts in one calloutputs = llm.generate(prompts, sampling)
# vLLM handles batching internally# No need to manually chunk promptsStep 4: Process results
# Extract generated textresults = []for output in outputs: prompt = output.prompt generated = output.outputs[0].text results.append({ "prompt": prompt, "generated": generated, "tokens": len(output.outputs[0].token_ids) })
# Save to fileimport jsonwith open("results.jsonl", "w") as f: for result in results: f.write(json.dumps(result) + "\n")
print(f"Processed {len(results)} prompts")Workflow 3: Quantized model serving
Section titled “Workflow 3: Quantized model serving”Fit large models in limited GPU memory.
Quantization Setup:- [ ] Step 1: Choose quantization method- [ ] Step 2: Find or create quantized model- [ ] Step 3: Launch with quantization flag- [ ] Step 4: Verify accuracyStep 1: Choose quantization method
- AWQ: Best for 70B models, minimal accuracy loss
- GPTQ: Wide model support, good compression
- FP8: Fastest on H100 GPUs
Step 2: Find or create quantized model
Use pre-quantized models from HuggingFace:
# Search for AWQ models# Example: TheBloke/Llama-2-70B-AWQStep 3: Launch with quantization flag
# Using pre-quantized modelvllm serve TheBloke/Llama-2-70B-AWQ \ --quantization awq \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.95
# Results: 70B model in ~40GB VRAMStep 4: Verify accuracy
Test outputs match expected quality:
# Compare quantized vs non-quantized responses# Verify task-specific performance unchangedWhen to use vs alternatives
Section titled “When to use vs alternatives”Use vLLM when:
- Deploying production LLM APIs (100+ req/sec)
- Serving OpenAI-compatible endpoints
- Limited GPU memory but need large models
- Multi-user applications (chatbots, assistants)
- Need low latency with high throughput
Use alternatives instead:
- llama.cpp: CPU/edge inference, single-user
- HuggingFace transformers: Research, prototyping, one-off generation
- TensorRT-LLM: NVIDIA-only, need absolute maximum performance
- Text-Generation-Inference: Already in HuggingFace ecosystem
Common issues
Section titled “Common issues”Issue: Out of memory during model loading
Reduce memory usage:
vllm serve MODEL \ --gpu-memory-utilization 0.7 \ --max-model-len 4096Or use quantization:
vllm serve MODEL --quantization awqIssue: Slow first token (TTFT > 1 second)
Enable prefix caching for repeated prompts:
vllm serve MODEL --enable-prefix-cachingFor long prompts, enable chunked prefill:
vllm serve MODEL --enable-chunked-prefillIssue: Model not found error
Use --trust-remote-code for custom models:
vllm serve MODEL --trust-remote-codeIssue: Low throughput (<50 req/sec)
Increase concurrent sequences:
vllm serve MODEL --max-num-seqs 512Check GPU utilization with nvidia-smi - should be >80%.
Issue: Inference slower than expected
Verify tensor parallelism uses power of 2 GPUs:
vllm serve MODEL --tensor-parallel-size 4 # Not 3Enable speculative decoding for faster generation:
vllm serve MODEL --speculative-model DRAFT_MODELAdvanced topics
Section titled “Advanced topics”Server deployment patterns: See references/server-deployment.md for Docker, Kubernetes, and load balancing configurations.
Performance optimization: See references/optimization.md for PagedAttention tuning, continuous batching details, and benchmark results.
Quantization guide: See references/quantization.md for AWQ/GPTQ/FP8 setup, model preparation, and accuracy comparisons.
Troubleshooting: See references/troubleshooting.md for detailed error messages, debugging steps, and performance diagnostics.
Hardware requirements
Section titled “Hardware requirements”- Small models (7B-13B): 1x A10 (24GB) or A100 (40GB)
- Medium models (30B-40B): 2x A100 (40GB) with tensor parallelism
- Large models (70B+): 4x A100 (40GB) or 2x A100 (80GB), use AWQ/GPTQ
Supported platforms: NVIDIA (primary), AMD ROCm, Intel GPUs, TPUs
Resources
Section titled “Resources”- Official docs: https://docs.vllm.ai
- GitHub: https://github.com/vllm-project/vllm
- Paper: “Efficient Memory Management for Large Language Model Serving with PagedAttention” (SOSP 2023)
- Community: https://discuss.vllm.ai