vLLM

High-throughput LLM inference and serving engine

View on GitHub

Overview

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. It uses PagedAttention for efficient memory management.

Key Features

  • PagedAttention: Efficient KV cache management
  • Continuous batching: Maximizes GPU utilization
  • Tensor parallelism: Scale across multiple GPUs
  • Quantization: GPTQ, AWQ, FP8 support