How llama.cpp builds a lazy computation graph for the transformer forward pass, then executes it.
ggml uses a deferred computation model. When you call ggml_mul_mat(ctx, a, b), no math happens — instead a ggml_tensor node is created with op=MUL_MAT and pointers to its inputs. The full transformer is described as a DAG of such nodes. Only when you call ggml_graph_compute() does execution begin.
// ggml/include/ggml.h#L666
struct ggml_tensor {
enum ggml_type type; // F32 | F16 | BF16 | Q4_0 | Q8_0 | ...
// determines per-element byte size
int64_t ne[4]; // shape: [cols, rows, batch, batch2]
// ne[0]=cols is contiguous (innermost)
// e.g. a 4096x4096 matrix: ne={4096,4096,1,1}
size_t nb[4]; // byte strides: nb[0]=element_size, nb[1]=row_bytes, ...
// enables non-contiguous views without copying
enum ggml_op op; // NONE | MUL_MAT | ADD | SOFT_MAX | ROPE | NORM | ...
struct ggml_tensor * src[GGML_MAX_SRC]; // input tensors → graph edges
// src[0], src[1] for binary ops
void * data; // pointer to raw bytes (on CPU or device)
char name[64]; // human-readable for debugging
};
ne[4] = dimensions. For a 2D weight matrix W of shape (n_embd × n_vocab): ne = {n_embd, n_vocab, 1, 1}.
// include/llama.h#L953
LLAMA_API int32_t llama_decode(
struct llama_context * ctx,
struct llama_batch batch);
// Internal call chain (src/llama-context.cpp):
// llama_decode()
// → ctx->decode(batch)
// → memory->init_batch(ubatch) // allocate KV cache slots
// → build_graph(ubatch) // construct ggml_cgraph
// → ggml_backend_sched_graph_compute(sched, gf) // execute
// → extract logits from output tensor
Batch splitting: if n_tokens > n_ubatch, the batch is split into micro-batches and decoded sequentially. The KV cache is updated after each micro-batch.
The computation graph for each forward pass implements the decoder-only transformer. Each layer applies attention and FFN with residual connections.
// src/llama-graph.cpp — llm_graph_input_embd // tok_embd: weight matrix [n_embd, n_vocab] (transposed for efficiency) // For each token in the batch, gather its row: struct ggml_tensor * inpL = ggml_get_rows(ctx, model.tok_embd, inp_tokens); // inpL shape: [n_embd, n_tokens] — each col is one token's embedding
ggml_get_rows is essentially an embedding lookup: it gathers the rows of tok_embd at the indices given by inp_tokens (the token IDs from the batch). No matrix multiply — just indexed copies.
RMSNorm replaces LayerNorm in LLaMA-family models. It normalizes by RMS (root-mean-square) without subtracting the mean, which is cheaper and empirically as good.
// Formula: x_norm = x / rms(x) * weight
// where rms(x) = sqrt(mean(x²) + ε)
// ggml/include/ggml.h#L1370
struct ggml_tensor * ggml_rms_norm(
struct ggml_context * ctx,
struct ggml_tensor * a,
float eps);
// In graph builder:
struct ggml_tensor * cur = ggml_rms_norm(ctx0, inpL, hparams.f_norm_rms_eps);
cur = ggml_mul(ctx0, cur, model.layers[il].attn_norm); // scale by weight
// Project hidden state to Q, K, V // Weight shapes: W_q [n_embd_head_k * n_head, n_embd] // W_k [n_embd_head_k * n_kv, n_embd] // W_v [n_embd_head_v * n_kv, n_embd] struct ggml_tensor * Q = ggml_mul_mat(ctx0, model.layers[il].wq, cur); struct ggml_tensor * K = ggml_mul_mat(ctx0, model.layers[il].wk, cur); struct ggml_tensor * V = ggml_mul_mat(ctx0, model.layers[il].wv, cur);
Instead of adding positional encodings to embeddings, RoPE rotates Q and K vectors by an angle proportional to their position. This gives the dot product Q·K a natural position-dependent phase, enabling better length generalization.
// ggml/include/ggml.h#L1772
// Rotates each pair of dimensions by angle = pos * θ^(-2i/d)
struct ggml_tensor * ggml_rope(
struct ggml_context * ctx,
struct ggml_tensor * a, // Q or K tensor
struct ggml_tensor * b, // positions vector
int n_dims,
int mode); // GGML_ROPE_TYPE_NEOX for LLaMA
Q = ggml_rope(ctx0, Q, inp_pos, n_rot, rope_type);
K = ggml_rope(ctx0, K, inp_pos, n_rot, rope_type);
// Store K, V into the KV cache for this layer + sequence position // (KV cache is a pre-allocated tensor; we write into it at the right offset) ggml_build_forward_expand(gf, ggml_cpy(ctx0, K, k_cache_view)); ggml_build_forward_expand(gf, ggml_cpy(ctx0, V, v_cache_view)); // Retrieve all cached K, V (past + current positions) struct ggml_tensor * K_all = kv_cache.k_layer[il]; // [n_embd_gqa, n_ctx] struct ggml_tensor * V_all = kv_cache.v_layer[il]; // [n_ctx, n_embd_gqa] // Attention: softmax(Q @ K^T / sqrt(d_k)) @ V // ggml_mul_mat handles: Q [n_head, n_tokens, head_dim] // x K^T [n_kv, n_ctx, head_dim]^T // = scores [n_head, n_tokens, n_ctx] struct ggml_tensor * KQ = ggml_mul_mat(ctx0, K_all, Q); // Scale by 1/sqrt(head_dim): KQ = ggml_soft_max_ext(ctx0, KQ, mask, scale, max_bias); // Weighted sum of V: struct ggml_tensor * KQV = ggml_mul_mat(ctx0, V_all, KQ);
LLaMA uses SwiGLU (a gated variant of SiLU) instead of the original FFN. The intermediate dimension is typically 2.67× the embedding dimension, and two projections gate each other multiplicatively.
// SwiGLU FFN:
// ffn_out = (SiLU(x @ W_gate) * (x @ W_up)) @ W_down
struct ggml_tensor * gate = ggml_mul_mat(ctx0, model.layers[il].ffn_gate, cur);
struct ggml_tensor * up = ggml_mul_mat(ctx0, model.layers[il].ffn_up, cur);
gate = ggml_silu(ctx0, gate); // SiLU(x) = x * sigmoid(x)
// ggml/include/ggml.h#L1184
struct ggml_tensor * ffn_inp = ggml_mul(ctx0, gate, up); // element-wise
struct ggml_tensor * ffn_out =
ggml_mul_mat(ctx0, model.layers[il].ffn_down, ffn_inp);
For a LLaMA-7B: n_embd=4096, n_ff=11008. The three FFN matrices are: gate [4096×11008], up [4096×11008], down [11008×4096]. These are the largest tensors per layer.
// After all layers, apply final norm + project to vocab
struct ggml_tensor * result_norm = ggml_rms_norm(ctx0, cur, eps);
result_norm = ggml_mul(ctx0, result_norm, model.output_norm);
// output: [n_embd, n_vocab] weight matrix
struct ggml_tensor * result_output =
ggml_mul_mat(ctx0, model.output, result_norm);
// Shape: [n_vocab, n_tokens]
// Only compute logits for tokens with batch.logits[i] != 0
// (usually just the last token during generation)
Without a KV cache, processing the N-th token would require recomputing attention over all N past tokens — O(N²) total work. The KV cache stores key and value tensors from all past positions so only the new token's Q is computed each step.
// KV cache layout (per layer): // K cache: [n_embd_gqa, n_ctx, n_layer] dtype = type_k (F16 default) // V cache: [n_ctx, n_embd_gqa, n_layer] dtype = type_v // During prefill (processing the prompt): // Write K, V for positions 0..n_prompt-1 into the cache // During generation (one new token at a time): // Write K, V for position n_prompt+i at step i // Read full K_cache[0..n_prompt+i], V_cache[0..n_prompt+i] // Compute attention of new Q against ALL cached K // Memory: 2 * n_layer * n_ctx * n_embd_gqa * sizeof(dtype) // LLaMA-7B (F16, 4096 ctx): 2 * 32 * 4096 * 1024 * 2 = 512 MB
The context manages KV cache slots per sequence. Multiple sequences share the same physical cache with different slot assignments, enabling true parallel inference (the server uses this for batching requests).
After graph execution, the logits tensor lives on the backend device (GPU memory for CUDA/Metal). They must be copied to CPU RAM before the sampler can use them.
// src/llama-context.cpp — after ggml_backend_sched_graph_compute(): // t_logits is the output tensor from the graph struct ggml_tensor * t_logits = /* last node in graph */; // Async copy from device → host ggml_backend_tensor_get_async(backend, t_logits, logits_ptr, offset, size); // For the server: logits for ALL sequences' last tokens are stored: // ctx->logits[i * n_vocab .. (i+1)*n_vocab] for each output token i // Access via public API: float * llama_get_logits_ith(struct llama_context * ctx, int32_t i);
LLaMA-2+, Mistral, Qwen2, etc. use Grouped Query Attention (GQA): fewer K/V heads than Q heads. This drastically reduces KV cache size and memory bandwidth without much quality loss.
// Vanilla MHA: n_head_kv == n_head // GQA: n_head_kv < n_head (e.g. n_head=32, n_head_kv=8) // MQA: n_head_kv == 1 (extreme case of GQA) // In llama-graph.cpp, K and V are expanded by repeating heads: K = ggml_repeat(ctx0, K, Q); // broadcast K from n_kv to n_head V = ggml_repeat(ctx0, V, Q); // DeepSeek uses MLA (Multi-head Latent Attention): // Compresses K/V into a low-rank latent space → even smaller cache
Once logits are on the CPU, they're passed to the sampling pipeline.