What is vLLM?
vLLM is a high-throughput LLM inference engine built around three core insights:
PagedAttention
Manages KV cache memory in fixed-size pages (blocks), like virtual memory in an OS. Eliminates fragmentation and enables sharing of prompt prefixes across requests.
Continuous Batching
Requests are processed iteration-by-iteration. New requests join and finished ones leave at each step, keeping GPU utilization high without waiting for a full batch.
Chunked Prefill
Long prompts are split into chunks and interleaved with decode steps, balancing latency and throughput rather than blocking on prefill.
Prefix Caching
Identical prompt prefixes share KV cache blocks via content hashing. Repeated system prompts or RAG contexts are computed only once.
End-to-End Data Flow
↓
↓
↓
↓
↓
↓
↓
Key Data Structures
| Struct | Direction | File | Purpose |
|---|---|---|---|
SamplingParams |
in | sampling_params.py | User-facing: temperature, top_p, max_tokens, stop… |
EngineCoreRequest |
in | v1/engine/__init__.py | Serializable IPC request (msgspec) |
Request |
internal | v1/request.py | Scheduler-internal state with lifecycle tracking |
SchedulerOutput |
internal | v1/core/sched/output.py | Batch description + block assignments sent to executor |
ModelRunnerOutput |
internal | v1/outputs.py | Raw logits + sampled tokens from device |
EngineCoreOutput |
out | v1/engine/__init__.py | Per-request new tokens + finish reason |
RequestOutput |
out | outputs.py | User-facing: decoded text, logprobs, metrics |
Source Layout
vllm/ ├── entrypoints/ # LLM, OpenAI API, gRPC server │ ├── llm.py # Offline inference class │ └── openai/ # HTTP API server (FastAPI) ├── v1/ # V1 engine (production) │ ├── engine/ │ │ ├── core.py # EngineCore orchestrator │ │ ├── core_client.py# IPC client / proxy │ │ ├── async_llm.py # Async engine (online serving) │ │ ├── llm_engine.py # Sync engine (offline) │ │ ├── input_processor.py │ │ └── output_processor.py │ ├── core/ │ │ ├── sched/ │ │ │ ├── scheduler.py # Main scheduler │ │ │ ├── output.py # SchedulerOutput │ │ │ └── request_queue.py │ │ ├── kv_cache_manager.py # Block allocator │ │ └── kv_cache_utils.py │ ├── executor/ # Distributed execution adapters │ ├── worker/ │ │ ├── gpu_worker.py │ │ └── gpu_model_runner.py # Model forward + sampling │ ├── sample/ │ │ ├── sampler.py │ │ └── rejection_sampler.py │ ├── attention/backend.py │ └── kv_cache_interface.py ├── model_executor/ │ ├── layers/attention/ # Attention layer with paged KV │ └── model_loader.py ├── config/ │ └── vllm.py # VllmConfig (central config) ├── sampling_params.py └── outputs.py
Class Responsibility Map
| Class | File | Role |
|---|---|---|
LLM | entry-points | User API — offline inference |
AsyncLLM | entry-points | User API — online async serving |
EngineCore | engine | Orchestrates scheduler + executor each step |
Scheduler | scheduler | Batches requests, manages KV allocation |
KVCacheManager | kv-cache | PagedAttention block allocation |
Executor | execution | Dispatches to workers (single/multi/Ray) |
LLMModelRunner | execution | Model forward pass, CUDA graphs |
Attention | attention | QKV attention with paged KV cache |
Sampler | sampling | Token sampling from logits |
External References
- PagedAttention Paper (SOSP 2023) — core memory management idea
- vLLM Blog: High-throughput LLM Serving — PagedAttention motivation
- FlashAttention (Dao et al., 2022) — IO-efficient attention kernel
- FlashAttention-2 — improved parallelism
- FlashInfer — attention kernel used by vLLM for decode
- Orca: Continuous Batching (OSDI 2022) — iteration-level scheduling idea
- vLLM Documentation — official docs
- vLLM Source @ d400445 — pinned commit for all code links