Sampling Pipeline

How raw logits become the next token: the composable sampler chain.

src/llama-sampler.cpp llama.h#L1190 — sampler API Top-P Sampling (Holtzman 2019) Mirostat (Basu 2021)

The Sampler Chain Pattern

llama.cpp uses a composable chain of samplers. Each sampler in the chain receives the current token candidates array, optionally filters or transforms it, and passes it to the next. The last sampler in the chain must be a "selecting" sampler that returns a single token ID.

Why a chain? Different use cases want different combinations: code generation wants greedy (temperature=0); creative writing wants high temperature + top-P; chat assistants want repetition penalty + top-K + temperature. The chain pattern lets you compose exactly what you need without writing custom logic.

API: Building and Using a Chain

llama_sampler_chain_init() and the full interface
// include/llama.h#L1190 — sampler interface

// Core sampler vtable (every sampler implements some of these):
struct llama_sampler_i {
    const char * (*name)  (const struct llama_sampler * smpl);
    void         (*accept)(struct llama_sampler * smpl, llama_token token);
    void         (*apply) (struct llama_sampler * smpl,
                           llama_token_data_array * cur_p); // ← filters candidates
    void         (*reset) (struct llama_sampler * smpl);
    void         (*free)  (struct llama_sampler * smpl);
};

// llama_token_data_array: the candidates being filtered
typedef struct llama_token_data_array {
    llama_token_data * data;   // array of {id, logit, p} for each token
    size_t   size;             // current number of candidates (shrinks as filtered)
    int64_t  selected;         // index of selected token (-1 if not yet sampled)
    bool     sorted;           // true if sorted by probability (descending)
} llama_token_data_array;
// Build a typical chat sampler chain:
struct llama_sampler * smpl =
    llama_sampler_chain_init(llama_sampler_chain_default_params());

llama_sampler_chain_add(smpl, llama_sampler_init_repetition_penalty(
    llama_model_get_vocab(model), last_n=64, penalty=1.1f, freq_pen=0.f, pres_pen=0.f));
llama_sampler_chain_add(smpl, llama_sampler_init_top_k(40));
llama_sampler_chain_add(smpl, llama_sampler_init_top_p(0.95f, 1));
llama_sampler_chain_add(smpl, llama_sampler_init_temp(0.8f));
llama_sampler_chain_add(smpl, llama_sampler_init_dist(LLAMA_DEFAULT_SEED));
//                                                    ↑ final selector: sample from distribution

// During generation loop:
llama_token id = llama_sampler_sample(smpl, ctx, -1);  // -1 = last token's logits
llama_sampler_accept(smpl, id);  // update repetition state

Sampler Chain Execution Flow

1. Get logits from context
logits[n_vocab] (float32, raw unnormalized scores)
llama_get_logits_ith(ctx, idx) → float* of length n_vocab
2. Repetition Penalty
logit[t] /= penalty if token t appeared in last_n tokens
Reduces probability of recently generated tokens to prevent loops
penalty=1.1, last_n=64, freq_penalty, presence_penalty
3. Top-K Filter
keep top K tokens by logit value; discard the rest
Removes long-tail low-probability tokens. Hard cutoff.
k=40 typical. k=1 = greedy. k=0 = disabled.
4. Top-P (Nucleus Sampling)
softmax → sort ↓ → keep tokens until Σp ≥ p_threshold
Dynamically adjusts cutoff: removes candidates beyond the "nucleus" of the distribution
p=0.95, min_keep=1
5. Temperature
logit[i] /= T (then re-softmax at sample time)
T<1: sharper (more confident), T>1: flatter (more random), T=0: greedy
temp=0.8
6. Categorical Sample (dist)
softmax → cumsum → binary search on random uniform draw
Selects one token ID from the filtered, temperature-scaled distribution
seed (for reproducibility)

Individual Sampler Deep Dives

Temperature — the most important parameter

Temperature controls the sharpness of the output distribution. Dividing logits by T before softmax is equivalent to raising each probability to the power 1/T and renormalizing.

// src/llama-sampler.cpp — llama_sampler_init_temp
// Applied in apply():
for (size_t i = 0; i < cur_p->size; i++) {
    cur_p->data[i].logit /= temp;  // scale logits
}
// Softmax happens at the selection step

// Effect:
// temp=0.0 → greedy (argmax, deterministic)
// temp=0.5 → sharper than softmax, less variety
// temp=1.0 → raw softmax probabilities
// temp=2.0 → flatter, more random
// temp>10  → nearly uniform → essentially random token selection
Top-P Nucleus Sampling

The "nucleus" is the minimal set of tokens whose cumulative probability exceeds p. This adapts to the model's confidence: when the model is very sure (peaked distribution), the nucleus is small; when uncertain, it's large.

// Pseudocode for top-p:
// 1. Compute softmax of all logits
// 2. Sort tokens by probability, descending
// 3. Walk down the sorted list, accumulating probability
// 4. Stop when cumulative sum >= p
// 5. Zero out all tokens beyond the cutoff

float cumsum = 0.0f;
for (size_t i = 0; i < sorted_size; i++) {
    cumsum += probs[i];
    if (cumsum >= p && i + 1 >= min_keep) {
        cur_p->size = i + 1;  // keep only top i+1 tokens
        break;
    }
}
Repetition Penalty (+ frequency and presence penalty)

Three related penalties discourage the model from repeating tokens from recent context. The token history is tracked via llama_sampler_accept() after each generation step.

// For each token t in the candidate list:
// if t appeared in the last `penalty_last_n` tokens:
//
//   Base penalty (multiplicative):
//   logit[t] = (logit[t] >= 0) ? logit[t] / penalty_repeat
//                               : logit[t] * penalty_repeat
//
//   Frequency penalty (subtractive, scales with count):
//   logit[t] -= count(t) * penalty_freq
//
//   Presence penalty (subtractive, binary — just appeared or not):
//   logit[t] -= penalty_present  (if t appeared at all in window)

// Typical values:
// penalty_repeat  = 1.1  (10% reduction per occurrence)
// penalty_freq    = 0.0  (disabled by default)
// penalty_present = 0.0  (disabled by default)
Min-P sampling

Min-P (introduced 2023) is an alternative to Top-P. Instead of a fixed cumulative threshold, it removes tokens whose probability is below min_p × max_prob. This scales the cutoff with the model's confidence.

// float max_prob = max of all softmax probabilities
// Keep token t if: prob[t] >= min_p * max_prob
//
// Example:
// max_prob = 0.6 (model very confident)
// min_p    = 0.1
// cutoff   = 0.06  → removes all tokens with prob < 6%
//
// max_prob = 0.05 (model unsure)
// cutoff   = 0.005 → much more permissive

// Benefit: behaves like greedy when confident,
//          like broad sampling when uncertain
Mirostat — entropy-targeting sampler

Mirostat maintains a target perplexity (entropy) across generated tokens. It dynamically adjusts temperature each step to keep the "surprise level" of generation stable, avoiding both repetitive and incoherent text.

// Mirostat v2 algorithm (Basu 2021):
// target_surprise τ = 3.0  (target perplexity per token)
// η = 0.1                  (learning rate for τ adjustment)
// μ = 2τ                   (initial sampling parameter)

// Each step:
// 1. Sort tokens by probability
// 2. Find k = min tokens such that sum of top-k exceeds threshold μ
// 3. Sample from top-k
// 4. Compute surprise s = -log2(p_selected)
// 5. Update: μ = μ - η * (s - τ)
//    → if sampled token was surprising: lower μ next step (more focused)
//    → if too predictable: raise μ (more diverse)
Greedy sampling — deterministic argmax
// llama_sampler_init_greedy()
// Selects the token with highest logit value unconditionally
// Equivalent to temperature → 0

// apply():
int64_t best = 0;
for (size_t i = 1; i < cur_p->size; i++) {
    if (cur_p->data[i].logit > cur_p->data[best].logit) best = i;
}
cur_p->selected = best;

// Used for: code generation, structured output, beam search
// Problem: can get stuck in repetition loops (use repetition penalty)

Grammar-Constrained Sampling

Forcing structured output with GBNF grammars

llama.cpp includes a sampler that enforces a GBNF (GGML BNF) grammar. It maintains a parser state machine and sets logits to -infinity for any token that would produce an invalid continuation.

// Grammar sampler — applied BEFORE top-K/P/temp to constrain the space
struct llama_sampler * grammar =
    llama_sampler_init_grammar(llama_model_get_vocab(model),
                               grammar_str,    // GBNF grammar text
                               "root");        // start rule

// During apply(), the grammar parser:
// 1. Identifies which tokens are valid next tokens given the current parse state
// 2. Sets logit = -INFINITY for all invalid tokens
// 3. The subsequent top-K/P samplers then only see valid tokens

// Example GBNF for JSON object with "name" field:
// root   ::= "{" ws "\"name\"" ws ":" ws string ws "}"
// string ::= "\"" char* "\""
// char   ::= [a-zA-Z0-9 ]

After Sampling

The selected token_id is returned from llama_sampler_sample(). The caller must then:

// 1. Accept the token into the sampler's history (for repetition tracking):
llama_sampler_accept(smpl, token_id);

// 2. Convert to text:
char piece[64];
llama_token_to_piece(vocab, token_id, piece, sizeof(piece), 0, true);
// → " Paris"

// 3. Feed back for next step:
struct llama_batch next_batch = llama_batch_get_one(&token_id, 1);
llama_decode(ctx, next_batch);  // KV cache grows +1, computes new logits
→ HTTP Server (orchestration) ← Tokenization (detokenize)