Computation Graph & Transformer

How llama.cpp builds a lazy computation graph for the transformer forward pass, then executes it.

src/llama-graph.cpp src/llama-context.cpp ggml/include/ggml.h Attention Is All You Need RoPE paper RMSNorm paper SwiGLU / GLU Variants

The Two-Phase Design: Build then Execute

ggml uses a deferred computation model. When you call ggml_mul_mat(ctx, a, b), no math happens — instead a ggml_tensor node is created with op=MUL_MAT and pointers to its inputs. The full transformer is described as a DAG of such nodes. Only when you call ggml_graph_compute() does execution begin.

Why lazy graphs? The scheduler can inspect the entire graph before running it, enabling: optimal memory allocation, node assignment to the best device (CPU/GPU), kernel fusion, and graph caching to avoid rebuild cost on subsequent decode steps.

ggml_tensor — The Graph Node

ggml_tensor structure — shape, operation, and data pointer

// ggml/include/ggml.h#L666
struct ggml_tensor {
    enum ggml_type type;    // F32 | F16 | BF16 | Q4_0 | Q8_0 | ...
                            //   determines per-element byte size

    int64_t ne[4];          // shape: [cols, rows, batch, batch2]
                            //   ne[0]=cols is contiguous (innermost)
                            //   e.g. a 4096x4096 matrix: ne={4096,4096,1,1}

    size_t  nb[4];          // byte strides: nb[0]=element_size, nb[1]=row_bytes, ...
                            //   enables non-contiguous views without copying

    enum ggml_op op;        // NONE | MUL_MAT | ADD | SOFT_MAX | ROPE | NORM | ...

    struct ggml_tensor * src[GGML_MAX_SRC];  // input tensors → graph edges
                                              //   src[0], src[1] for binary ops

    void * data;            // pointer to raw bytes (on CPU or device)
    char   name[64];        // human-readable for debugging
};

ne[4] = dimensions. For a 2D weight matrix W of shape (n_embd × n_vocab): ne = {n_embd, n_vocab, 1, 1}.

ggml.h#L666 — ggml_tensor

The llama_decode() Entry Point

What happens when you call llama_decode(ctx, batch)

// include/llama.h#L953
LLAMA_API int32_t llama_decode(
    struct llama_context * ctx,
    struct llama_batch     batch);

// Internal call chain (src/llama-context.cpp):
// llama_decode()
//   → ctx->decode(batch)
//     → memory->init_batch(ubatch)   // allocate KV cache slots
//     → build_graph(ubatch)          // construct ggml_cgraph
//     → ggml_backend_sched_graph_compute(sched, gf)  // execute
//     → extract logits from output tensor

Batch splitting: if n_tokens > n_ubatch, the batch is split into micro-batches and decoded sequentially. The KV cache is updated after each micro-batch.

llama-context.cpp — decode() llama.h#L953

Transformer Architecture (LLaMA-style)

The computation graph for each forward pass implements the decoder-only transformer. Each layer applies attention and FFN with residual connections.

Token IDs → tok_embd lookup → [n_tokens, n_embd]

↓ repeated × n_layer

RMSNorm → Q, K, V projections → RoPE → Attention + KV cache

O projection → + residual

RMSNorm → FFN: gate, up → SiLU × up → down → + residual

↓

final RMSNorm → output projection → [n_outputs, n_vocab] ← logits

Graph Construction: Key Operations

1. Embedding lookup — token IDs → vectors

// src/llama-graph.cpp — llm_graph_input_embd

// tok_embd: weight matrix [n_embd, n_vocab] (transposed for efficiency)
// For each token in the batch, gather its row:
struct ggml_tensor * inpL = ggml_get_rows(ctx, model.tok_embd, inp_tokens);
// inpL shape: [n_embd, n_tokens]  — each col is one token's embedding

ggml_get_rows is essentially an embedding lookup: it gathers the rows of tok_embd at the indices given by inp_tokens (the token IDs from the batch). No matrix multiply — just indexed copies.

ggml.h — ggml_get_rows()

2. RMSNorm — normalize hidden states

RMSNorm replaces LayerNorm in LLaMA-family models. It normalizes by RMS (root-mean-square) without subtracting the mean, which is cheaper and empirically as good.

// Formula: x_norm = x / rms(x) * weight
// where rms(x) = sqrt(mean(x²) + ε)

// ggml/include/ggml.h#L1370
struct ggml_tensor * ggml_rms_norm(
    struct ggml_context * ctx,
    struct ggml_tensor  * a,
    float                 eps);

// In graph builder:
struct ggml_tensor * cur = ggml_rms_norm(ctx0, inpL, hparams.f_norm_rms_eps);
cur = ggml_mul(ctx0, cur, model.layers[il].attn_norm);  // scale by weight

ggml.h#L1370 — ggml_rms_norm() RMSNorm paper (Zhang 2019)

3. Multi-Head Attention with RoPE

Q, K, V Projections

// Project hidden state to Q, K, V
// Weight shapes: W_q [n_embd_head_k * n_head, n_embd]
//                W_k [n_embd_head_k * n_kv,   n_embd]
//                W_v [n_embd_head_v * n_kv,   n_embd]

struct ggml_tensor * Q = ggml_mul_mat(ctx0, model.layers[il].wq, cur);
struct ggml_tensor * K = ggml_mul_mat(ctx0, model.layers[il].wk, cur);
struct ggml_tensor * V = ggml_mul_mat(ctx0, model.layers[il].wv, cur);

RoPE — Rotary Position Embedding

Instead of adding positional encodings to embeddings, RoPE rotates Q and K vectors by an angle proportional to their position. This gives the dot product Q·K a natural position-dependent phase, enabling better length generalization.

// ggml/include/ggml.h#L1772
// Rotates each pair of dimensions by angle = pos * θ^(-2i/d)
struct ggml_tensor * ggml_rope(
    struct ggml_context * ctx,
    struct ggml_tensor  * a,    // Q or K tensor
    struct ggml_tensor  * b,    // positions vector
    int                   n_dims,
    int                   mode); // GGML_ROPE_TYPE_NEOX for LLaMA

Q = ggml_rope(ctx0, Q, inp_pos, n_rot, rope_type);
K = ggml_rope(ctx0, K, inp_pos, n_rot, rope_type);

ggml.h#L1772 — ggml_rope() RoPE paper (Su 2021)

KV Cache and Attention Score

// Store K, V into the KV cache for this layer + sequence position
// (KV cache is a pre-allocated tensor; we write into it at the right offset)
ggml_build_forward_expand(gf, ggml_cpy(ctx0, K, k_cache_view));
ggml_build_forward_expand(gf, ggml_cpy(ctx0, V, v_cache_view));

// Retrieve all cached K, V (past + current positions)
struct ggml_tensor * K_all = kv_cache.k_layer[il]; // [n_embd_gqa, n_ctx]
struct ggml_tensor * V_all = kv_cache.v_layer[il]; // [n_ctx, n_embd_gqa]

// Attention: softmax(Q @ K^T / sqrt(d_k)) @ V
// ggml_mul_mat handles: Q [n_head, n_tokens, head_dim]
//                  x K^T [n_kv, n_ctx, head_dim]^T
//                  = scores [n_head, n_tokens, n_ctx]
struct ggml_tensor * KQ = ggml_mul_mat(ctx0, K_all, Q);
// Scale by 1/sqrt(head_dim):
KQ = ggml_soft_max_ext(ctx0, KQ, mask, scale, max_bias);
// Weighted sum of V:
struct ggml_tensor * KQV = ggml_mul_mat(ctx0, V_all, KQ);

ggml.h#L1716 — ggml_soft_max() ggml.h#L1417 — ggml_mul_mat() Attention Is All You Need FlashAttention

4. Feed-Forward Network with SwiGLU

LLaMA uses SwiGLU (a gated variant of SiLU) instead of the original FFN. The intermediate dimension is typically 2.67× the embedding dimension, and two projections gate each other multiplicatively.

// SwiGLU FFN:
// ffn_out = (SiLU(x @ W_gate) * (x @ W_up)) @ W_down

struct ggml_tensor * gate = ggml_mul_mat(ctx0, model.layers[il].ffn_gate, cur);
struct ggml_tensor * up   = ggml_mul_mat(ctx0, model.layers[il].ffn_up,   cur);

gate = ggml_silu(ctx0, gate);       // SiLU(x) = x * sigmoid(x)
                                    // ggml/include/ggml.h#L1184
struct ggml_tensor * ffn_inp = ggml_mul(ctx0, gate, up);  // element-wise

struct ggml_tensor * ffn_out =
    ggml_mul_mat(ctx0, model.layers[il].ffn_down, ffn_inp);

For a LLaMA-7B: n_embd=4096, n_ff=11008. The three FFN matrices are: gate [4096×11008], up [4096×11008], down [11008×4096]. These are the largest tensors per layer.

ggml.h#L1184 — ggml_silu() SwiGLU / GLU Variants (Noam Shazeer)

5. Output projection → logits

// After all layers, apply final norm + project to vocab
struct ggml_tensor * result_norm = ggml_rms_norm(ctx0, cur, eps);
result_norm = ggml_mul(ctx0, result_norm, model.output_norm);

// output: [n_embd, n_vocab] weight matrix
struct ggml_tensor * result_output =
    ggml_mul_mat(ctx0, model.output, result_norm);
// Shape: [n_vocab, n_tokens]

// Only compute logits for tokens with batch.logits[i] != 0
// (usually just the last token during generation)

llama-graph.cpp — build output

KV Cache

How the KV cache enables O(1) per-token cost

Without a KV cache, processing the N-th token would require recomputing attention over all N past tokens — O(N²) total work. The KV cache stores key and value tensors from all past positions so only the new token's Q is computed each step.

// KV cache layout (per layer):
// K cache: [n_embd_gqa, n_ctx, n_layer]  dtype = type_k (F16 default)
// V cache: [n_ctx, n_embd_gqa, n_layer]  dtype = type_v

// During prefill (processing the prompt):
//   Write K, V for positions 0..n_prompt-1 into the cache

// During generation (one new token at a time):
//   Write K, V for position n_prompt+i at step i
//   Read full K_cache[0..n_prompt+i], V_cache[0..n_prompt+i]
//   Compute attention of new Q against ALL cached K

// Memory: 2 * n_layer * n_ctx * n_embd_gqa * sizeof(dtype)
// LLaMA-7B (F16, 4096 ctx): 2 * 32 * 4096 * 1024 * 2 = 512 MB

The context manages KV cache slots per sequence. Multiple sequences share the same physical cache with different slot assignments, enabling true parallel inference (the server uses this for batching requests).

src/llama-kv-cache.cpp FlashAttention (efficient attention)

Logits Extraction

How output is copied from device to host

After graph execution, the logits tensor lives on the backend device (GPU memory for CUDA/Metal). They must be copied to CPU RAM before the sampler can use them.

// src/llama-context.cpp — after ggml_backend_sched_graph_compute():

// t_logits is the output tensor from the graph
struct ggml_tensor * t_logits = /* last node in graph */;

// Async copy from device → host
ggml_backend_tensor_get_async(backend, t_logits, logits_ptr, offset, size);

// For the server: logits for ALL sequences' last tokens are stored:
// ctx->logits[i * n_vocab .. (i+1)*n_vocab]  for each output token i

// Access via public API:
float * llama_get_logits_ith(struct llama_context * ctx, int32_t i);

llama.h#L988 — llama_get_logits() llama-context.cpp

Grouped-Query Attention (GQA) & MLA

How modern attention variants differ from vanilla MHA

LLaMA-2+, Mistral, Qwen2, etc. use Grouped Query Attention (GQA): fewer K/V heads than Q heads. This drastically reduces KV cache size and memory bandwidth without much quality loss.

// Vanilla MHA: n_head_kv == n_head
// GQA:         n_head_kv < n_head  (e.g. n_head=32, n_head_kv=8)
// MQA:         n_head_kv == 1       (extreme case of GQA)

// In llama-graph.cpp, K and V are expanded by repeating heads:
K = ggml_repeat(ctx0, K, Q);  // broadcast K from n_kv to n_head
V = ggml_repeat(ctx0, V, Q);

// DeepSeek uses MLA (Multi-head Latent Attention):
// Compresses K/V into a low-rank latent space → even smaller cache

GQA paper (Ainslie 2023) DeepSeek-V2 MLA

Once logits are on the CPU, they're passed to the sampling pipeline.

→ Sampling Pipeline → GGML Backends (how kernels run)