Executor, Worker & Model Runner
The execution layer runs the actual transformer model. It has three tiers: the Executor abstracts distribution strategy, the Worker manages one device (GPU), and the ModelRunner runs the model forward pass and sampling on that device.
Execution Tier Architecture
Three-Tier Stack
Executor
Executor Implementations
Source: vllm/v1/executor/abstract.py
The executor is selected at startup based on parallel_config.distributed_executor_backend:
| Executor | Backend | When |
|---|---|---|
UniProcExecutor | in-process | Single GPU, debugging |
MultiprocExecutor | multiprocessing | Multi-GPU on one node (default) |
RayDistributedExecutor | Ray | Multi-node cluster |
ExternalLauncherExecutor | external | Custom launcher (Kubernetes, SLURM) |
class Executor(ABC):
@classmethod
def get_class(cls, vllm_config: VllmConfig) -> type["Executor"]:
backend = vllm_config.parallel_config.distributed_executor_backend
if backend == "mp":
return MultiprocExecutor
elif backend == "ray":
return RayDistributedExecutor
...
@abstractmethod
def execute_model(
self, scheduler_output: SchedulerOutput
) -> ModelRunnerOutput:
# Broadcast input to all workers, gather outputs
...
Tensor Parallelism
For large models split across multiple GPUs (TP), the executor launches one worker per GPU. Workers communicate via NCCL all_reduce during the forward pass (after each attention and MLP layer). The scheduler only talks to rank 0; it gathers outputs from rank 0 as well.
Each GPU holds a shard of all transformer layers. After each layer, an all_reduce gathers partial sums across GPUs.
Worker
GPU Worker
Source: vllm/v1/worker/gpu_worker.py | worker_base.py
One worker process per GPU. Lifecycle:
class GPUWorker(WorkerBase):
def init_device(self):
# Set CUDA device, init distributed group (NCCL)
torch.cuda.set_device(self.local_rank)
dist.init_process_group(backend="nccl", ...)
self.gpu_model_runner = LLMModelRunner(self.vllm_config, ...)
def load_model(self):
# Download from HuggingFace, load weights into GPU tensors
self.gpu_model_runner.load_model()
def initialize_cache(self, num_gpu_blocks: int):
# Allocate KV cache tensors on GPU
self.gpu_model_runner.initialize_kv_cache(num_gpu_blocks)
def compile_or_warm_up_model(self):
# Run dummy inference to trigger torch.compile / CUDA graph capture
self.gpu_model_runner.capture_model()
def execute_model(
self,
scheduler_output: SchedulerOutput,
) -> ModelRunnerOutput:
return self.gpu_model_runner.execute_model(scheduler_output)
LLMModelRunner
Model Forward Pass
Source: vllm/v1/worker/gpu_model_runner.py
LLMModelRunner is the core of GPU-side execution. It prepares the input tensors from the scheduler's abstract description and runs them through the model.
class LLMModelRunner(nn.Module):
def execute_model(
self, scheduler_output: SchedulerOutput
) -> ModelRunnerOutput:
# 1. Build input tensors from scheduler description
input_ids, positions, attn_metadata = self._prepare_inputs(scheduler_output)
# 2. Forward pass through transformer
hidden_states = self.model(
input_ids=input_ids,
positions=positions,
kv_caches=self.kv_cache,
attn_metadata=attn_metadata,
)
# 3. Select only the *last* hidden state per sequence (for sampling)
# (prefill tokens don't produce output tokens)
logits = self.compute_logits(hidden_states, scheduler_output)
# 4. Sample next tokens
sampler_output = self.sampler(logits, sampling_metadata)
return ModelRunnerOutput(
sampled_token_ids=sampler_output.sampled_tokens,
logprobs=sampler_output.logprobs,
)
_prepare_inputs()
This method converts the human-readable SchedulerOutput into flat tensors the model can consume:
def _prepare_inputs(self, scheduler_output: SchedulerOutput):
# input_ids: [total_tokens] — all prompt + decode tokens concatenated
# positions: [total_tokens] — position indices for RoPE
# block_tables: [num_seqs, max_blocks] — KV cache lookup table
# seq_lens: [num_seqs] — actual length of each sequence
# query_lens: [num_seqs] — how many query tokens each seq contributes
#
# Example (2 requests: prefill=4tok, decode=1tok):
# input_ids = [p0, p1, p2, p3, d0]
# positions = [ 0, 1, 2, 3, 7] # d0 is at position 7
# seq_lens = [4, 8]
# query_lens = [4, 1] # req0 contributes 4 queries, req1 contributes 1
...
CUDA Graph Execution
For decode-only steps (no prefill), vLLM captures CUDA graphs to eliminate Python overhead and kernel launch latency.
# During warm-up (capture phase):
for batch_size in [1, 2, 4, 8, 16, 32, ...]:
# Capture the decode-only forward pass for this batch size
with torch.cuda.graph(self.graphs[batch_size]):
output = model(dummy_input_of_size(batch_size))
self.graphs[batch_size] = graph # store
# During inference (replay phase):
if is_decode_only and batch_size in self.graphs:
# Update input buffer in-place
self.input_ids_buf[:batch_size] = input_ids
# Replay graph — no Python overhead, ~10-30% throughput gain
self.graphs[batch_size].replay()
return self.output_buf[:batch_size]
else:
# Eager mode for prefill or unseen batch sizes
return model(input_ids, ...)
CUDA graphs eliminate per-kernel launch overhead (~5-10µs per kernel). For small decode batches where GPU compute is minimal, this overhead dominates — graphs make it nearly zero.
ModelRunnerOutput
Source: vllm/v1/outputs.py
@dataclass
class ModelRunnerOutput:
# One token ID per sequence (the newly sampled token)
sampled_token_ids: torch.Tensor # shape: [num_seqs]
# Log-probabilities (only if requested by SamplingParams)
logprobs: LogprobsTensors | None
# Optional: hidden states for embedding tasks
hidden_states: torch.Tensor | None
# Output from KV transfer connector (disaggregated prefill/decode)
kv_connector_output: KVConnectorOutput | None
This is collected from all workers (only rank 0 returns meaningful sampled tokens). The scheduler's update_from_output() consumes it to update request state and emit EngineCoreOutput.
Pipeline Parallelism
With pipeline parallelism (PP), layers are split across GPUs in sequence. GPU 0 runs layers 0-7, GPU 1 runs 8-15, etc. The activation (hidden states) is passed between GPUs via point-to-point sends/receives.
# PP communication: send/recv between adjacent ranks
# GPU 0 → GPU 1: hidden_states after layer 7
dist.send(hidden_states, dst=next_rank)
# GPU 1 receives:
hidden_states = dist.recv(src=prev_rank)
PP is useful for very large models that don't fit on a single node even with TP. TP+PP can be combined (e.g., TP=4, PP=2 for a 8-GPU setup).