vLLM

High-throughput LLM inference and serving engine

Overview

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. It uses PagedAttention for efficient memory management.

Key Features

PagedAttention: Efficient KV cache management
Continuous batching: Maximizes GPU utilization
Tensor parallelism: Scale across multiple GPUs
Quantization: GPTQ, AWQ, FP8 support