Entry Points & I/O Pipeline
How a user request enters vLLM, gets preprocessed, dispatched to the engine, and how the raw model output is converted back into readable text. This layer is pure orchestration — no GPU code here.
Entry Points
LLM — Offline Inference
The LLM class in vllm/entrypoints/llm.py is the primary user-facing class for offline (batch) inference. It wraps the V1 LLMEngine and exposes a simple generate() interface.
# Typical usage
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Meta-Llama-3-8B")
outputs = llm.generate(
["Tell me a joke", "What is 2+2?"],
SamplingParams(temperature=0.8, max_tokens=100)
)
for o in outputs:
print(o.outputs[0].text)
How generate() works internally
The call chain for offline inference is synchronous:
Source: vllm/entrypoints/llm.py
AsyncLLM — Online Serving
AsyncLLM is used for online serving. It runs the engine loop in a background thread and exposes an async generator per request, enabling simultaneous concurrent requests and streaming.
# AsyncLLM drives the OpenAI-compatible API server
class AsyncLLM:
async def generate(
self,
prompt: PromptType,
sampling_params: SamplingParams,
request_id: str,
) -> AsyncGenerator[RequestOutput, None]:
# Yields RequestOutput for each new token
...
def _run_engine_loop(self):
# Background thread: calls engine_core.step() continuously
while True:
outputs = self.engine_core.step()
self._process_outputs(outputs)
Concurrency model
The engine runs in a dedicated thread. Each generate() call is a Python async generator that receives its results via a per-request queue. This decouples the engine's sync execution from the async HTTP handler.
Source: vllm/v1/engine/async_llm.py
OpenAI-Compatible API Server
The HTTP API server is in vllm/entrypoints/openai/. It uses FastAPI and handles /v1/completions, /v1/chat/completions, and other OpenAI-compatible endpoints.
# Start the server
vllm serve meta-llama/Meta-Llama-3-8B
# Under the hood:
# FastAPI app → AsyncLLM.generate() → streaming SSE response
The server creates one AsyncLLM instance shared across all requests. Each HTTP request becomes a unique request_id in the engine.
Input Processing
InputProcessor
Source: vllm/v1/engine/input_processor.py
Converts a user's raw prompt (string or token IDs) into an EngineCoreRequest. This is the first transformation in the pipeline.
Steps performed
- Tokenize the prompt string →
list[int] - Preprocess multimodal inputs (images, audio) →
MultiModalFeatureSpec - Validate
SamplingParamsagainst model config - Resolve LoRA adapter request
- Wrap everything in
EngineCoreRequest
EngineCoreRequest fields
# vllm/v1/engine/__init__.py
class EngineCoreRequest(msgspec.Struct):
request_id: str
prompt_token_ids: list[int] | None
mm_features: list[MultiModalFeatureSpec] # images, audio, etc.
sampling_params: SamplingParams | None
pooling_params: PoolingParams | None
arrival_time: float
lora_request: LoRARequest | None
prompt_embeds: torch.Tensor | None # for embedding-level inputs
prompt_is_token_ids: list[bool] | None # position mask for mixed
This struct uses msgspec for zero-copy serialization when requests cross process boundaries (e.g. from HTTP server process to engine process).
Output Processing
OutputProcessor
Source: vllm/v1/engine/output_processor.py
Converts EngineCoreOutput (raw token IDs from the engine) into RequestOutput (decoded text for the user). This runs on the front-end side.
EngineCoreOutput → RequestOutput transformation
# Per-iteration result from the engine (one per request)
class EngineCoreOutput(msgspec.Struct):
request_id: str
new_token_ids: list[int] # tokens produced this iteration
new_logprobs: LogprobsLists | None
finish_reason: FinishReason | None # STOP | LENGTH | ABORT | ERROR
stop_reason: int | str | None
# User-facing output
class RequestOutput:
request_id: str
prompt: str
prompt_token_ids: list[int]
outputs: list[CompletionOutput] # one per beam / sample
finished: bool
metrics: RequestMetrics
class CompletionOutput:
text: str # decoded text (accumulated)
token_ids: list[int]
cumulative_logprob: float | None
logprobs: SampleLogprobs | None
finish_reason: str | None # "stop" | "length" | "abort"
Detokenization
The output processor maintains an incremental detokenizer per request. Rather than decoding the entire sequence each iteration, it appends new tokens and handles multi-byte characters and special tokens correctly.
Source: vllm/outputs.py | vllm/v1/engine/__init__.py
SamplingParams — User Configuration
Source: vllm/sampling_params.py
Controls how the model generates text. Passed in by the user and propagated all the way to the sampler.
| Field | Type | Effect |
|---|---|---|
| temperature | float | Randomness; 0 = greedy, 1 = softmax, >1 = flatter |
| top_p | float | Nucleus sampling — keep top tokens summing to top_p probability |
| top_k | int | Keep only top-k tokens before sampling |
| max_tokens | int | Maximum new tokens to generate |
| stop | list[str] | Stop when any of these strings appear in the output |
| stop_token_ids | list[int] | Stop on specific token IDs (e.g. EOS) |
| repetition_penalty | float | Penalize repeated tokens (1.0 = no penalty) |
| presence_penalty | float | Penalize tokens already present in output |
| frequency_penalty | float | Penalize tokens proportionally to frequency |
| seed | int | None | Random seed for reproducibility |
| logprobs | int | None | Return top-N logprobs per token |
| n | int | Number of completions to generate |
| best_of | int | Generate this many and return the best n |
| guided_decoding | GuidedDecodingParams | Constrained decoding (JSON, grammar, regex) |
Full I/O Pipeline Summary
# --- FRONT END (CPU, per-request) ---
User: llm.generate(prompt="Hello", SamplingParams(max_tokens=50))
InputProcessor.process(prompt, sampling_params, request_id)
→ tokenizer.encode(prompt) → [15043, ...] # Hugging Face tokenizer
→ build EngineCoreRequest(
request_id="req-001",
prompt_token_ids=[15043, ...],
sampling_params=SamplingParams(...),
arrival_time=time.monotonic()
)
EngineCoreClient.add_request(engine_core_request)
→ serialize with msgspec → send over IPC socket # if multiprocessing
# --- BACK END (Engine loop, potentially separate process) ---
EngineCore.add_request(engine_core_request)
→ Scheduler.add_request(Request(...))
# ... many iterations later, engine calls ...
EngineCore.step()
→ scheduler.schedule() → SchedulerOutput
→ executor.execute_model(scheduler_output) → ModelRunnerOutput
→ [extract EngineCoreOutput per request]
# --- FRONT END (output side) ---
OutputProcessor.process_outputs([engine_core_output, ...])
→ incremental_detokenizer.decode(new_token_ids)
→ RequestOutput(
request_id="req-001",
outputs=[CompletionOutput(text=" world!", token_ids=[995, 0])],
finished=True
)
# Returned to user or streamed via SSE