vLLM — Entry Points & I/O Pipeline

Entry Points & I/O Pipeline

How a user request enters vLLM, gets preprocessed, dispatched to the engine, and how the raw model output is converted back into readable text. This layer is pure orchestration — no GPU code here.

Entry Points

LLM — Offline Inference

The LLM class in vllm/entrypoints/llm.py is the primary user-facing class for offline (batch) inference. It wraps the V1 LLMEngine and exposes a simple generate() interface.

# Typical usage
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Meta-Llama-3-8B")
outputs = llm.generate(
    ["Tell me a joke", "What is 2+2?"],
    SamplingParams(temperature=0.8, max_tokens=100)
)
for o in outputs:
    print(o.outputs[0].text)

How generate() works internally

The call chain for offline inference is synchronous:

LLM.generate(prompts, sampling_params)

vllm/entrypoints/llm.py

↓ _validate_and_add_requests()

InputProcessor.process()

vllm/v1/engine/input_processor.py

Tokenize + build EngineCoreRequest per prompt

↓ engine_core.add_request()

LLMEngine.step() loop

vllm/v1/engine/llm_engine.py

Runs until all requests finish

↓ OutputProcessor.process_outputs()

List[RequestOutput]

vllm/outputs.py

Decoded text + metadata returned to caller

Source: vllm/entrypoints/llm.py

AsyncLLM — Online Serving

AsyncLLM is used for online serving. It runs the engine loop in a background thread and exposes an async generator per request, enabling simultaneous concurrent requests and streaming.

# AsyncLLM drives the OpenAI-compatible API server
class AsyncLLM:
    async def generate(
        self,
        prompt: PromptType,
        sampling_params: SamplingParams,
        request_id: str,
    ) -> AsyncGenerator[RequestOutput, None]:
        # Yields RequestOutput for each new token
        ...

    def _run_engine_loop(self):
        # Background thread: calls engine_core.step() continuously
        while True:
            outputs = self.engine_core.step()
            self._process_outputs(outputs)

Concurrency model

The engine runs in a dedicated thread. Each generate() call is a Python async generator that receives its results via a per-request queue. This decouples the engine's sync execution from the async HTTP handler.

Source: vllm/v1/engine/async_llm.py

OpenAI-Compatible API Server

The HTTP API server is in vllm/entrypoints/openai/. It uses FastAPI and handles /v1/completions, /v1/chat/completions, and other OpenAI-compatible endpoints.

# Start the server
vllm serve meta-llama/Meta-Llama-3-8B

# Under the hood:
#   FastAPI app → AsyncLLM.generate() → streaming SSE response

The server creates one AsyncLLM instance shared across all requests. Each HTTP request becomes a unique request_id in the engine.

Input Processing

InputProcessor

Source: vllm/v1/engine/input_processor.py

Converts a user's raw prompt (string or token IDs) into an EngineCoreRequest. This is the first transformation in the pipeline.

Steps performed

Tokenize the prompt string → list[int]
Preprocess multimodal inputs (images, audio) → MultiModalFeatureSpec
Validate SamplingParams against model config
Resolve LoRA adapter request
Wrap everything in EngineCoreRequest

EngineCoreRequest fields

# vllm/v1/engine/__init__.py
class EngineCoreRequest(msgspec.Struct):
    request_id: str
    prompt_token_ids: list[int] | None
    mm_features: list[MultiModalFeatureSpec]  # images, audio, etc.
    sampling_params: SamplingParams | None
    pooling_params: PoolingParams | None
    arrival_time: float
    lora_request: LoRARequest | None
    prompt_embeds: torch.Tensor | None        # for embedding-level inputs
    prompt_is_token_ids: list[bool] | None    # position mask for mixed

This struct uses msgspec for zero-copy serialization when requests cross process boundaries (e.g. from HTTP server process to engine process).

Output Processing

OutputProcessor

Source: vllm/v1/engine/output_processor.py

Converts EngineCoreOutput (raw token IDs from the engine) into RequestOutput (decoded text for the user). This runs on the front-end side.

EngineCoreOutput → RequestOutput transformation

# Per-iteration result from the engine (one per request)
class EngineCoreOutput(msgspec.Struct):
    request_id: str
    new_token_ids: list[int]          # tokens produced this iteration
    new_logprobs: LogprobsLists | None
    finish_reason: FinishReason | None  # STOP | LENGTH | ABORT | ERROR
    stop_reason: int | str | None

# User-facing output
class RequestOutput:
    request_id: str
    prompt: str
    prompt_token_ids: list[int]
    outputs: list[CompletionOutput]   # one per beam / sample
    finished: bool
    metrics: RequestMetrics

class CompletionOutput:
    text: str                         # decoded text (accumulated)
    token_ids: list[int]
    cumulative_logprob: float | None
    logprobs: SampleLogprobs | None
    finish_reason: str | None         # "stop" | "length" | "abort"

Detokenization

The output processor maintains an incremental detokenizer per request. Rather than decoding the entire sequence each iteration, it appends new tokens and handles multi-byte characters and special tokens correctly.

Source: vllm/outputs.py | vllm/v1/engine/__init__.py

SamplingParams — User Configuration

Source: vllm/sampling_params.py

Controls how the model generates text. Passed in by the user and propagated all the way to the sampler.

Field	Type	Effect
temperature	float	Randomness; 0 = greedy, 1 = softmax, >1 = flatter
top_p	float	Nucleus sampling — keep top tokens summing to top_p probability
top_k	int	Keep only top-k tokens before sampling
max_tokens	int	Maximum new tokens to generate
stop	list[str]	Stop when any of these strings appear in the output
stop_token_ids	list[int]	Stop on specific token IDs (e.g. EOS)
repetition_penalty	float	Penalize repeated tokens (1.0 = no penalty)
presence_penalty	float	Penalize tokens already present in output
frequency_penalty	float	Penalize tokens proportionally to frequency
seed	int \| None	Random seed for reproducibility
logprobs	int \| None	Return top-N logprobs per token
n	int	Number of completions to generate
best_of	int	Generate this many and return the best n
guided_decoding	GuidedDecodingParams	Constrained decoding (JSON, grammar, regex)

Full I/O Pipeline Summary

# --- FRONT END (CPU, per-request) ---

User: llm.generate(prompt="Hello", SamplingParams(max_tokens=50))

InputProcessor.process(prompt, sampling_params, request_id)
  → tokenizer.encode(prompt) → [15043, ...]       # Hugging Face tokenizer
  → build EngineCoreRequest(
        request_id="req-001",
        prompt_token_ids=[15043, ...],
        sampling_params=SamplingParams(...),
        arrival_time=time.monotonic()
    )

EngineCoreClient.add_request(engine_core_request)
  → serialize with msgspec → send over IPC socket  # if multiprocessing

# --- BACK END (Engine loop, potentially separate process) ---

EngineCore.add_request(engine_core_request)
  → Scheduler.add_request(Request(...))

# ... many iterations later, engine calls ...

EngineCore.step()
  → scheduler.schedule() → SchedulerOutput
  → executor.execute_model(scheduler_output) → ModelRunnerOutput
  → [extract EngineCoreOutput per request]

# --- FRONT END (output side) ---

OutputProcessor.process_outputs([engine_core_output, ...])
  → incremental_detokenizer.decode(new_token_ids)
  → RequestOutput(
        request_id="req-001",
        outputs=[CompletionOutput(text=" world!", token_ids=[995, 0])],
        finished=True
    )

# Returned to user or streamed via SSE