Tensorrt Llm — Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency
Tensorrt Llm
Section titled “Tensorrt Llm”Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.
Skill metadata
Section titled “Skill metadata”| Source | Optional — install with hermes skills install official/mlops/tensorrt-llm |
| Path | optional-skills/mlops/tensorrt-llm |
| Version | 1.0.0 |
| Author | Orchestra Research |
| License | MIT |
| Dependencies | tensorrt-llm, torch |
| Tags | Inference Serving, TensorRT-LLM, NVIDIA, Inference Optimization, High Throughput, Low Latency, Production, FP8, INT4, In-Flight Batching, Multi-GPU |
Reference: full SKILL.md
Section titled “Reference: full SKILL.md”The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
TensorRT-LLM
Section titled “TensorRT-LLM”NVIDIA’s open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.
When to use TensorRT-LLM
Section titled “When to use TensorRT-LLM”Use TensorRT-LLM when:
- Deploying on NVIDIA GPUs (A100, H100, GB200)
- Need maximum throughput (24,000+ tokens/sec on Llama 3)
- Require low latency for real-time applications
- Working with quantized models (FP8, INT4, FP4)
- Scaling across multiple GPUs or nodes
Use vLLM instead when:
- Need simpler setup and Python-first API
- Want PagedAttention without TensorRT compilation
- Working with AMD GPUs or non-NVIDIA hardware
Use llama.cpp instead when:
- Deploying on CPU or Apple Silicon
- Need edge deployment without NVIDIA GPUs
- Want simpler GGUF quantization format
Quick start
Section titled “Quick start”Installation
Section titled “Installation”# Docker (recommended)docker pull nvidia/tensorrt_llm:latest
# pip installpip install tensorrt_llm==1.2.0rc3
# Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12Basic inference
Section titled “Basic inference”from tensorrt_llm import LLM, SamplingParams
# Initialize modelllm = LLM(model="meta-llama/Meta-Llama-3-8B")
# Configure samplingsampling_params = SamplingParams( max_tokens=100, temperature=0.7, top_p=0.9)
# Generateprompts = ["Explain quantum computing"]outputs = llm.generate(prompts, sampling_params)
for output in outputs: print(output.text)Serving with trtllm-serve
Section titled “Serving with trtllm-serve”# Start server (automatic model download and compilation)trtllm-serve meta-llama/Meta-Llama-3-8B \ --tp_size 4 \ # Tensor parallelism (4 GPUs) --max_batch_size 256 \ --max_num_tokens 4096
# Client requestcurl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Meta-Llama-3-8B", "messages": [{"role": "user", "content": "Hello!"}], "temperature": 0.7, "max_tokens": 100 }'Key features
Section titled “Key features”Performance optimizations
Section titled “Performance optimizations”- In-flight batching: Dynamic batching during generation
- Paged KV cache: Efficient memory management
- Flash Attention: Optimized attention kernels
- Quantization: FP8, INT4, FP4 for 2-4× faster inference
- CUDA graphs: Reduced kernel launch overhead
Parallelism
Section titled “Parallelism”- Tensor parallelism (TP): Split model across GPUs
- Pipeline parallelism (PP): Layer-wise distribution
- Expert parallelism: For Mixture-of-Experts models
- Multi-node: Scale beyond single machine
Advanced features
Section titled “Advanced features”- Speculative decoding: Faster generation with draft models
- LoRA serving: Efficient multi-adapter deployment
- Disaggregated serving: Separate prefill and generation
Common patterns
Section titled “Common patterns”Quantized model (FP8)
Section titled “Quantized model (FP8)”from tensorrt_llm import LLM
# Load FP8 quantized model (2× faster, 50% memory)llm = LLM( model="meta-llama/Meta-Llama-3-70B", dtype="fp8", max_num_tokens=8192)
# Inference same as beforeoutputs = llm.generate(["Summarize this article..."])Multi-GPU deployment
Section titled “Multi-GPU deployment”# Tensor parallelism across 8 GPUsllm = LLM( model="meta-llama/Meta-Llama-3-405B", tensor_parallel_size=8, dtype="fp8")Batch inference
Section titled “Batch inference”# Process 100 prompts efficientlyprompts = [f"Question {i}: ..." for i in range(100)]
outputs = llm.generate( prompts, sampling_params=SamplingParams(max_tokens=200))
# Automatic in-flight batching for maximum throughputPerformance benchmarks
Section titled “Performance benchmarks”Meta Llama 3-8B (H100 GPU):
- Throughput: 24,000 tokens/sec
- Latency: ~10ms per token
- vs PyTorch: 100× faster
Llama 3-70B (8× A100 80GB):
- FP8 quantization: 2× faster than FP16
- Memory: 50% reduction with FP8
Supported models
Section titled “Supported models”- LLaMA family: Llama 2, Llama 3, CodeLlama
- GPT family: GPT-2, GPT-J, GPT-NeoX
- Qwen: Qwen, Qwen2, QwQ
- DeepSeek: DeepSeek-V2, DeepSeek-V3
- Mixtral: Mixtral-8x7B, Mixtral-8x22B
- Vision: LLaVA, Phi-3-vision
- 100+ models on HuggingFace
References
Section titled “References”- Optimization Guide - Quantization, batching, KV cache tuning
- Multi-GPU Setup - Tensor/pipeline parallelism, multi-node
- Serving Guide - Production deployment, monitoring, autoscaling