vLLM Internals

Engine Layer

The engine is the central orchestrator. It owns the request lifecycle, drives the scheduler, dispatches work to the executor, and coordinates the iterative generation loop. Two layers exist: the EngineCore (compute-side) and the front-end (LLMEngine / AsyncLLM), connected via IPC.

Architecture: Two-Process Split

Front End vs. Engine Core

In production, vLLM separates the HTTP/user-facing front end from the GPU-facing engine core into distinct processes. This prevents Python GIL contention and lets the engine loop run uninterrupted.

┌──────────────────────────────────┐    IPC (msgspec over socket)
│  FRONT-END PROCESS               │◄──────────────────────────────►
│  LLMEngine / AsyncLLM            │
│  ├─ InputProcessor               │    ┌──────────────────────────┐
│  ├─ OutputProcessor              │    │  ENGINE-CORE PROCESS     │
│  └─ EngineCoreClient (proxy)     │    │  EngineCore              │
└──────────────────────────────────┘    │  ├─ Scheduler            │
                                        │  ├─ KVCacheManager       │
                                        │  └─ Executor → Workers   │
                                        └──────────────────────────┘

Single-process fallback: both sides run in one process (debugging/testing)
Ray mode: engine core lives on Ray actors

The boundary is EngineCoreClient. On the client side it serializes EngineCoreRequest structs; on the server side EngineCore deserializes and processes them.

EngineCore

EngineCore — The Main Loop

Source: vllm/v1/engine/core.py

EngineCore is instantiated once. It holds the scheduler, executor, and KV cache config. Its step() method is the main generation loop — called repeatedly until all requests are done.

class EngineCore:
    def __init__(self, vllm_config: VllmConfig, ...):
        self.scheduler = Scheduler(vllm_config, ...)
        self.model_executor = Executor.get_class(vllm_config)(vllm_config)
        self.kv_cache_config: KVCacheConfig = self._init_kv_cache()

    def add_request(self, request: EngineCoreRequest) -> None:
        # Deserialize and hand to scheduler
        self.scheduler.add_request(Request.from_engine_core_request(request))

    def step(self) -> tuple[dict[str, EngineCoreOutputs], bool]:
        # 1. Scheduler decides what to run this iteration
        scheduler_output = self.scheduler.schedule()

        # 2. Executor runs the model (may span multiple GPUs)
        model_output = self.model_executor.execute_model(scheduler_output)

        # 3. Scheduler updates state from results (free blocks, mark done)
        engine_core_outputs = self.scheduler.update_from_output(
            scheduler_output, model_output
        )
        return engine_core_outputs, self.scheduler.has_unfinished_requests()

Initialization sequence

EngineCore.__init__
Executor.get_class(vllm_config)
Select: UniProc / Multiproc / Ray / External
executor.determine_num_available_blocks()
GPU memory profiling to decide KV cache size
executor.initialize_cache(num_gpu_blocks)
Allocate GPU KV cache tensors
Scheduler(vllm_config, kv_cache_config)
Ready to accept requests

Request Lifecycle in EngineCore

Requests move through states managed by Request (vllm/v1/request.py):

class RequestStatus(enum.Enum):
    WAITING        = "waiting"    # in request queue
    RUNNING        = "running"    # being processed this step
    PREEMPTED      = "preempted"  # evicted due to memory pressure
    FINISHED_STOP  = "finished_stop"
    FINISHED_LENGTH = "finished_length"
    FINISHED_ABORT = "finished_abort"

class Request:
    request_id: str
    prompt_token_ids: list[int]
    sampling_params: SamplingParams
    status: RequestStatus

    # Tracking progress
    num_computed_tokens: int   # tokens whose KV is in cache
    output_token_ids: list[int]
    stop_reason: int | str | None
WAITING
In RequestQueue, awaiting scheduler pickup
↓ scheduler.schedule() picks it
RUNNING
Assigned KV blocks, included in current batch
↓ EOS / max_tokens / stop string hit
FINISHED_*
Blocks freed, EngineCoreOutput emitted

Under memory pressure, RUNNING → PREEMPTED (blocks reclaimed) → WAITING (requeued for re-prefill).

EngineCoreClient — IPC Layer

Cross-Process Communication

Source: vllm/v1/engine/core_client.py

The client abstracts the communication mechanism. vLLM supports multiple backends:

ModeClassUse Case
In-processUniProcEngineCoreClientDebugging, single-GPU, tests
MultiprocessingMPClientDefault multi-GPU on single node
RayRayClientMulti-node distributed serving
External launcherExternalLauncherClientKubernetes / custom orchestration

Serialization

# msgspec is used for zero-copy binary serialization
import msgspec

encoder = msgspec.msgpack.Encoder()
decoder = msgspec.msgpack.Decoder(EngineCoreRequest)

# Send: front end → engine core
wire_bytes = encoder.encode(engine_core_request)
socket.send(wire_bytes)

# Receive: engine core → front end
engine_core_request = decoder.decode(socket.recv())

msgspec is 10-100x faster than pickle or JSON for structured data, crucial for high-QPS scenarios where serialization overhead matters.

LLMEngine (Sync) & AsyncLLM (Async)

LLMEngine — Synchronous Wrapper

Source: vllm/v1/engine/llm_engine.py

LLMEngine is the sync wrapper used by the offline LLM class. It drives the engine core in a loop.

class LLMEngine:
    def __init__(self, vllm_config: VllmConfig):
        self.engine_core = EngineCoreClient.get_class(vllm_config)(vllm_config)
        self.input_processor = InputProcessor(vllm_config, ...)
        self.output_processor = OutputProcessor(...)

    def add_request(self, request_id: str, prompt, params):
        engine_request = self.input_processor.process(prompt, params, ...)
        self.engine_core.add_request(engine_request)

    def step(self) -> list[RequestOutput]:
        # Drive one engine iteration
        engine_core_outputs, has_more = self.engine_core.step()
        return self.output_processor.process_outputs(engine_core_outputs)

    def has_unfinished_requests(self) -> bool: ...
    def abort_request(self, request_id: str): ...

Configuration — VllmConfig

Source: vllm/config/vllm.py

VllmConfig is the central configuration object passed everywhere. It composes all sub-configs:

class VllmConfig:
    model_config: ModelConfig           # model, dtype, quantization
    cache_config: CacheConfig           # block_size, num_gpu_blocks
    parallel_config: ParallelConfig     # tensor/pipeline parallelism
    scheduler_config: SchedulerConfig   # max_num_seqs, max_num_batched_tokens
    device_config: DeviceConfig         # cuda, cpu, tpu, …
    speculative_config: SpeculativeConfig
    attention_config: AttentionConfig   # backend choice
    compilation_config: CompilationConfig  # CUDA graphs, torch.compile
    observability_config: ObservabilityConfig
    kv_transfer_config: KVTransferConfig | None  # disaggregated prefill/decode