vLLM
High-throughput LLM inference and serving engine
View on GitHubOverview
vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. It uses PagedAttention for efficient memory management.
Key Features
- PagedAttention: Efficient KV cache management
- Continuous batching: Maximizes GPU utilization
- Tensor parallelism: Scale across multiple GPUs
- Quantization: GPTQ, AWQ, FP8 support