vLLM Architecture

What is vLLM?

vLLM is a high-throughput LLM inference engine built around three core insights:

PagedAttention

Manages KV cache memory in fixed-size pages (blocks), like virtual memory in an OS. Eliminates fragmentation and enables sharing of prompt prefixes across requests.

Continuous Batching

Requests are processed iteration-by-iteration. New requests join and finished ones leave at each step, keeping GPU utilization high without waiting for a full batch.

Chunked Prefill

Long prompts are split into chunks and interleaved with decode steps, balancing latency and throughput rather than blocking on prefill.

Prefix Caching

Identical prompt prefixes share KV cache blocks via content hashing. Repeated system prompts or RAG contexts are computed only once.

End-to-End Data Flow

User Code

LLM.generate(prompt, SamplingParams)

Offline: LLM class | Online: OpenAI-compatible HTTP API

↓

Input Processor

vllm/v1/engine/input_processor.py

Tokenize · multimodal preprocessing · build EngineCoreRequest

↓

EngineCore

vllm/v1/engine/core.py

Central orchestrator — coordinates Scheduler → Executor → post-processing

↓

Scheduler

vllm/v1/core/sched/scheduler.py

Batches requests · allocates KV cache blocks · emits SchedulerOutput

↓

Executor → Worker → ModelRunner

vllm/v1/executor/ · vllm/v1/worker/gpu_model_runner.py

Distributed execution · model forward pass · CUDA graphs

↓

Attention Layer + KV Cache

vllm/model_executor/layers/attention/attention.py

PagedAttention · FlashInfer · block tables · prefill vs. decode

↓

Sampler

vllm/v1/sample/sampler.py

Temperature · top-p/k · penalties · sample next token

↓

Output Processor

vllm/v1/engine/output_processor.py

Detokenize · assemble RequestOutput · stream to caller

Key Data Structures

Struct	Direction	File	Purpose
`SamplingParams`	in	sampling_params.py	User-facing: temperature, top_p, max_tokens, stop…
`EngineCoreRequest`	in	v1/engine/__init__.py	Serializable IPC request (msgspec)
`Request`	internal	v1/request.py	Scheduler-internal state with lifecycle tracking
`SchedulerOutput`	internal	v1/core/sched/output.py	Batch description + block assignments sent to executor
`ModelRunnerOutput`	internal	v1/outputs.py	Raw logits + sampled tokens from device
`EngineCoreOutput`	out	v1/engine/__init__.py	Per-request new tokens + finish reason
`RequestOutput`	out	outputs.py	User-facing: decoded text, logprobs, metrics

Source Layout

vllm/
├── entrypoints/          # LLM, OpenAI API, gRPC server
│   ├── llm.py            # Offline inference class
│   └── openai/           # HTTP API server (FastAPI)
├── v1/                   # V1 engine (production)
│   ├── engine/
│   │   ├── core.py       # EngineCore orchestrator
│   │   ├── core_client.py# IPC client / proxy
│   │   ├── async_llm.py  # Async engine (online serving)
│   │   ├── llm_engine.py # Sync engine (offline)
│   │   ├── input_processor.py
│   │   └── output_processor.py
│   ├── core/
│   │   ├── sched/
│   │   │   ├── scheduler.py     # Main scheduler
│   │   │   ├── output.py        # SchedulerOutput
│   │   │   └── request_queue.py
│   │   ├── kv_cache_manager.py  # Block allocator
│   │   └── kv_cache_utils.py
│   ├── executor/         # Distributed execution adapters
│   ├── worker/
│   │   ├── gpu_worker.py
│   │   └── gpu_model_runner.py  # Model forward + sampling
│   ├── sample/
│   │   ├── sampler.py
│   │   └── rejection_sampler.py
│   ├── attention/backend.py
│   └── kv_cache_interface.py
├── model_executor/
│   ├── layers/attention/ # Attention layer with paged KV
│   └── model_loader.py
├── config/
│   └── vllm.py           # VllmConfig (central config)
├── sampling_params.py
└── outputs.py

Class Responsibility Map

Class	File	Role
`LLM`	entry-points	User API — offline inference
`AsyncLLM`	entry-points	User API — online async serving
`EngineCore`	engine	Orchestrates scheduler + executor each step
`Scheduler`	scheduler	Batches requests, manages KV allocation
`KVCacheManager`	kv-cache	PagedAttention block allocation
`Executor`	execution	Dispatches to workers (single/multi/Ray)
`LLMModelRunner`	execution	Model forward pass, CUDA graphs
`Attention`	attention	QKV attention with paged KV cache
`Sampler`	sampling	Token sampling from logits

External References

PagedAttention Paper (SOSP 2023) — core memory management idea
vLLM Blog: High-throughput LLM Serving — PagedAttention motivation
FlashAttention (Dao et al., 2022) — IO-efficient attention kernel
FlashAttention-2 — improved parallelism
FlashInfer — attention kernel used by vLLM for decode
Orca: Continuous Batching (OSDI 2022) — iteration-level scheduling idea
vLLM Documentation — official docs
vLLM Source @ d400445 — pinned commit for all code links

vLLM Internals