Tokenization

Converting a raw text string into a sequence of integer token IDs — and back again.

src/llama-vocab.cpp src/llama-vocab.h llama.h#L1125 — llama_tokenize() BPE paper (Sennrich 2016) SentencePiece paper

What is a Token?

Language models don't operate on characters or words — they operate on tokens: variable-length subword units chosen to balance vocabulary size against sequence length. The mapping between strings and token IDs is fixed after training.

Why subword tokenization? Pure character models need very long sequences; pure word models can't handle rare or novel words. BPE and SentencePiece find a middle ground: common words become single tokens, rare words split into meaningful pieces. "tokenization" → ["token", "ization"].

Tokenizer Types in llama.cpp

The tokenizer type is stored in the GGUF file and read at model load time. llama.cpp supports 6+ types covering essentially all modern LLM families.

BPE
Byte-Pair Encoding. Merges the most frequent adjacent byte pairs iteratively until vocab size is reached.
GPT-2, GPT-4, LLaMA-3, Qwen, Mistral
SPM
SentencePiece Model. Treats input as raw bytes, language-agnostic, supports byte fallback for unknown chars.
LLaMA-1/2, Gemma, T5
WPM
WordPiece. Similar to BPE but uses a likelihood criterion instead of frequency for merges.
BERT, DistilBERT
UGM
Unigram Language Model. Trains a probabilistic model, prunes the vocabulary to size.
T5, mT5, ALBERT
RWKV
Greedy tokenizer. Simple and fast, used by RWKV models.
RWKV
NONE
No tokenizer — model expects raw token IDs (embeddings only mode).
Embedding models

Tokenization Example

How "Hello world" gets tokenized with a LLaMA-3 BPE vocabulary:

"Hello world"
Hello
9906
world
1917

Notice the space is part of the token " world" — BPE encodes whitespace as part of the following token, which is why the leading space matters for correct round-tripping.

API: Tokenize & Detokenize

llama_tokenize() — text → token IDs
// include/llama.h#L1125
LLAMA_API int32_t llama_tokenize(
    const struct llama_vocab * vocab,
    const char * text,
    int32_t      text_len,
    llama_token * tokens,     // output array
    int32_t       n_tokens_max,
    bool          add_special, // add BOS token at start?
    bool          parse_special); // treat <|im_start|> etc as tokens?

Returns the number of tokens written. Call with tokens=NULL to query the count first.

// Usage pattern:
int n = llama_tokenize(vocab, text, -1, NULL, 0, true, false);  // query count
std::vector<llama_token> ids(n);
llama_tokenize(vocab, text, -1, ids.data(), n, true, false);
llama_token_to_piece() — token ID → text
// include/llama.h#L1147
LLAMA_API int32_t llama_token_to_piece(
    const struct llama_vocab * vocab,
    llama_token   token,
    char        * buf,
    int32_t       length,
    int32_t       lstrip,    // strip leading spaces
    bool          special);  // render special tokens as text?

// Returns number of bytes written.
// Negative return = buffer too small (abs value = needed size)

Internal Tokenization Flow (BPE)

Step-by-step: how BPE tokenizes a string

BPE tokenization is a two-stage greedy process:

Stage 1: Pre-tokenization (regex splitting)

The raw string is first split by a regex pattern (model-specific) into "words". For GPT-2/LLaMA-3 style BPE, this separates punctuation, handles whitespace, and prevents merges across word boundaries.

// Example regex (simplified GPT-2 pattern):
// "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|..."
// "Hello world!" → ["Hello", " world", "!"]

Stage 2: BPE merge sequence

Each pre-token is converted to UTF-8 bytes, then the BPE merge rules (learned during training) are applied greedily in priority order:

// "Hello" as bytes: ['H','e','l','l','o']
// Merge table lookup (highest priority first):
// ('H','e') → 'He'       → ['He','l','l','o']
// ('l','l') → 'll'       → ['He','ll','o']
// ('He','ll') → 'Hell'   → ['Hell','o']
// ('Hell','o') → 'Hello' → ['Hello']   ← single token!

// Result: token_id for "Hello" = 9906
SentencePiece (SPM) differences

SPM doesn't pre-tokenize on whitespace — it treats the whole input as a byte sequence. It uses a Viterbi algorithm to find the optimal segmentation according to a unigram language model, then applies BPE-like merges.

// SPM special handling:
// 1. Normalizes unicode (NFKC or custom rules)
// 2. Adds '▁' (U+2581) before each word to encode spaces
// 3. Byte fallback: unknown bytes → <0xHH> tokens

// "Hello world" with SPM:
// → ['▁Hello', '▁world']   (if both are in vocab)
// → ['▁He', 'llo', '▁world']  (if 'Hello' not in vocab)

Special Tokens

BOS, EOS, PAD and chat template tokens

Every model defines special tokens stored in the GGUF vocabulary metadata. These are not learnable subwords but reserved IDs with specific semantic meaning.

// include/llama.h — special token accessors
LLAMA_API llama_token llama_vocab_bos(const struct llama_vocab * vocab); // begin-of-sequence
LLAMA_API llama_token llama_vocab_eos(const struct llama_vocab * vocab); // end-of-sequence
LLAMA_API llama_token llama_vocab_eot(const struct llama_vocab * vocab); // end-of-turn
LLAMA_API llama_token llama_vocab_sep(const struct llama_vocab * vocab); // separator

// Check if token signals end of generation:
LLAMA_API bool llama_vocab_is_eog(const struct llama_vocab * vocab, llama_token token);

Chat-format models (instruct/assistant variants) additionally use role-delimiter tokens like <|im_start|>, <|im_end|>, [INST], <|eot_id|>. These are encoded by the chat template (Jinja2-based, stored in the GGUF metadata).

Vocabulary Structure in GGUF

How tokenizer data is stored in the model file

The GGUF file format stores tokenizer data as key-value metadata entries. These are loaded before any tensors.

// GGUF metadata keys for vocabulary:
"tokenizer.ggml.model"        → "gpt2" | "llama" | "bert" | ...
"tokenizer.ggml.tokens"       → array of token strings (length = n_vocab)
"tokenizer.ggml.scores"       → array of log-probabilities (for SPM)
"tokenizer.ggml.token_type"   → NORMAL | UNKNOWN | CONTROL | USER_DEFINED | BYTE
"tokenizer.ggml.merges"       → BPE merge rules (ordered by priority)
"tokenizer.ggml.bos_token_id" → integer
"tokenizer.ggml.eos_token_id" → integer
"tokenizer.chat_template"     → Jinja2 template string

At load time, llama_vocab builds hash maps from token strings to IDs and vice versa, plus the merge priority table for BPE.

What Happens After Tokenization

The integer token IDs produced here are placed into llama_batch.token[]. During the forward pass, each token ID is used as an index into the tok_embd weight matrix to retrieve that token's learned embedding vector — the first step of the transformer computation.

→ Computation Graph & Transformer