How the scheduler maps computation graph nodes to CPU, CUDA, Metal, and Vulkan device kernels.
Each backend implements a common interface via vtable-like function pointers. The rest of llama.cpp never calls CUDA or Metal directly — it calls through this interface.
// ggml/include/ggml-backend.h
// A backend = a device + its allocator + its compute engine
typedef struct ggml_backend * ggml_backend_t;
// Core operations every backend implements:
struct ggml_backend_i {
const char * (*get_name)(ggml_backend_t backend);
// Allocate a tensor's data on this device:
void (*buffer_alloc)(ggml_backend_buffer_t buf, struct ggml_tensor * t);
// Execute a computation graph on this backend:
enum ggml_status (*graph_compute)(
ggml_backend_t backend,
struct ggml_cgraph * cgraph);
// Async copy: host ↔ device
void (*tensor_set)(ggml_backend_t, struct ggml_tensor *, const void * data, size_t offset, size_t size);
void (*tensor_get)(ggml_backend_t, const struct ggml_tensor *, void * data, size_t offset, size_t size);
// Synchronize (wait for all async ops to complete):
void (*synchronize)(ggml_backend_t backend);
};
The scheduler is the key piece enabling split inference across devices. It inspects the full graph, decides which device handles each node, allocates memory on the right device, and inserts copy nodes where tensors cross device boundaries.
// ggml/src/ggml-backend.cpp
// Create scheduler with list of backends (in priority order):
ggml_backend_t backends[] = { cuda_backend, cpu_backend };
ggml_backend_sched_t sched =
ggml_backend_sched_new(backends, /*n_backends=*/2, max_nodes);
// Assign: the scheduler calls each backend's can_compute(node) function.
// CUDA backend takes: MUL_MAT, ROPE, SOFT_MAX, NORM, SILU, ADD, ...
// CPU backend takes: ops not supported by GPU, or ops on CPU-only tensors.
// Execute the full graph:
ggml_backend_sched_graph_compute(sched, cgraph);
Models too large to fit in VRAM can be split: the first N layers go to GPU, the rest stay on CPU. This is controlled by n_gpu_layers in llama_model_params.
// include/llama.h#L291
struct llama_model_params {
int32_t n_gpu_layers; // how many transformer layers to offload to GPU
// -1 = all, 0 = CPU only
enum llama_split_mode split_mode;
// LLAMA_SPLIT_MODE_NONE = single GPU
// LLAMA_SPLIT_MODE_LAYER = split layers across GPUs
// LLAMA_SPLIT_MODE_ROW = split individual tensors row-wise
};
// At model load, each layer's tensors are allocated on the assigned device:
// layers[0..n_gpu_layers-1] → CUDA/Metal buffer
// layers[n_gpu_layers..end] → CPU buffer
// tok_embd, output → follow last GPU or CPU
During a forward pass with split layers, the scheduler automatically inserts CPU↔GPU transfer nodes at layer boundaries. The hidden state tensor travels GPU→CPU→GPU as it crosses from offloaded to non-offloaded layers.
// LLAMA_SPLIT_MODE_LAYER: round-robin across GPUs // GPU 0: layers 0..15 // GPU 1: layers 16..31 // LLAMA_SPLIT_MODE_ROW: each GPU holds part of every weight matrix // GPU 0: W[:n_embd/2, :] // GPU 1: W[n_embd/2:, :] // Requires NCCL for all-reduce across GPUs
llama.cpp stores model weights in quantized formats to reduce memory and increase throughput. The backend handles dequantization on-the-fly during matrix multiplication.
| Type | Bits/weight | Size vs F32 | Notes |
|---|---|---|---|
| F32 | 32 | 1× | Full precision, largest |
| F16 | 16 | 0.5× | Standard GPU inference dtype |
| BF16 | 16 | 0.5× | Better range, same size as F16 |
| Q8_0 | 8 | 0.25× | Near-lossless, good for KV cache |
| Q4_K_M | ~4.5 | ~0.14× | Best quality/size for most models |
| Q4_0 | 4 | 0.125× | Faster, slightly lower quality |
| Q3_K_M | ~3.4 | ~0.11× | Very small, some quality loss |
| Q2_K | ~2.5 | ~0.08× | Extreme compression, noticeable quality drop |
| IQ4_XS | ~4.25 | ~0.13× | Importance-matrix quant, better than Q4_0 |
K-quants (Q4_K, Q5_K, Q6_K): group weights into blocks of 256, store a per-block scale + min value. Better quality than uniform quantization at same bit width.
// ggml/include/ggml.h#L389 — quantization types enum
enum ggml_type {
GGML_TYPE_F32 = 0,
GGML_TYPE_F16 = 1,
GGML_TYPE_Q4_0 = 2,
GGML_TYPE_Q4_1 = 3,
GGML_TYPE_Q5_0 = 6,
GGML_TYPE_Q5_1 = 7,
GGML_TYPE_Q8_0 = 8,
// ... K-quants:
GGML_TYPE_Q2_K = 10,
GGML_TYPE_Q3_K = 11,
GGML_TYPE_Q4_K = 12,
GGML_TYPE_Q5_K = 13,
GGML_TYPE_Q6_K = 14,
// imatrix quants:
GGML_TYPE_IQ4_XS = 22,
// ...
};
The CPU backend uses a threadpool. Each ggml_mul_mat is split into row blocks, one per thread. Within each thread, SIMD intrinsics handle the inner loop.
// ggml/src/ggml-cpu/ contains type-specific kernels: // // ggml_vec_dot_q4_0_q8_0(): dot product of Q4_0 × Q8_0 vectors // → depack 4-bit weights to 8-bit // → multiply-accumulate with 8-bit inputs // → uses AVX2: _mm256_maddubs_epi16, _mm256_madd_epi16 // // ggml_vec_dot_f32(): F32 dot product // → AVX2: _mm256_fmadd_ps // → ARM NEON: vfmaq_f32 // Thread count set in llama_context_params.n_threads // (default: physical core count) // For prompt processing (large batch): n_threads_batch is used // For single-token generation: n_threads is used // Recommendation: n_threads = physical cores, n_threads_batch = all cores
// ggml/src/ggml-cuda/ — major kernel files: // // mmq.cu — quantized matrix multiplication (MUL_MAT with Q types) // mul_mat_vec.cu — matrix-vector product (single token generation) // softmax.cu — attention softmax with causal mask // rope.cu — rotary position embedding // norm.cu — RMS normalization // fattn/ — Flash Attention kernel variants // fattn-vec.cuh — vector attention (prefill with small batch) // fattn-tile.cuh — tiled attention (large batch prefill) // Key insight: llama.cpp uses different kernels for: // PREFILL (processing full prompt): // batch size = n_tokens (e.g. 512) // uses large GEMM (batched matmul) kernels → high GPU utilization // // GENERATION (one new token per step): // batch size = 1 (or small with parallel requests) // uses GEMV (matrix-vector) kernels → bandwidth bound, not compute bound
// ggml/src/ggml-metal/ggml-metal.m — Objective-C Metal interface // ggml/src/ggml-metal/ggml-metal.metal — MSL shader kernels // Metal shaders for key operations: // kernel_mul_mat_q4_0_f32 — Q4_0 × F32 matmul // kernel_mul_mat_f16_f32 — F16 × F32 matmul // kernel_soft_max — causal softmax // kernel_rope_norm — RoPE with normalization // kernel_rms_norm — RMS normalization // Apple Silicon advantage: unified memory // CPU and GPU share the same physical RAM → no PCIe transfer overhead // Model weights loaded once into memory, GPU accesses them directly // Flash Attention on Metal: // kernel_flash_attn_ext_f16 — tiled flash attention kernel // Very efficient on M-series due to large GPU L2 cache
llama.cpp memory-maps (mmap) the GGUF file by default. This means weight tensors point directly into the mapped file pages — no upfront copy into RAM. Pages are loaded on demand by the OS as the GPU reads them.
// src/llama-model.cpp — model loading // Default: use_mmap=true // Weight tensor data pointers → mmap'd file pages // OS loads pages lazily when GPU DMA-reads them // When use_mmap=false: // All tensors explicitly copied to heap // Slower startup but avoids page-fault latency during inference // For GPU offloading: // CPU-resident tensors: mmap pages (or heap copy) // GPU-resident tensors: allocated in VRAM, weights copied during load // (ggml_backend_tensor_set copies from host mmap → device VRAM) // Memory pinning: for faster host→GPU transfers, CPU buffers // can be pinned (CUDA: cudaHostRegister) to enable DMA
After device kernels execute and logits are on CPU, sampling selects the next token.