Stage 5 — B-Tree Engine
btree.c — Manages tables and indexes as B-trees over fixed-size database pages
B-Tree Fundamentals

SQLite stores every table and index as a separate B-tree within the database file. The file is divided into pages of a fixed size (default 4096 bytes). Each page is either an interior node (holds only keys + child pointers) or a leaf node (holds keys + data).

ROOT PAGE 2
key ranges
+ child ptrs
↙ ↓ ↘
INTERIOR
keys+ptrs
INTERIOR
keys+ptrs
LEAF
rowid + record
LEAF
rowid + record
LEAF
rowid + record

SQLite uses a variant called a B*-tree: all data lives in leaf pages; interior pages hold only separator keys and child page numbers. Tables are sorted by rowid (or the PRIMARY KEY for WITHOUT ROWID tables); indexes are sorted by the indexed column values.

Two B-tree flavors: Table B-trees use the 64-bit rowid as the key and store full row records in leaves. Index B-trees use the indexed column(s) plus rowid as the key; leaves contain no additional data (the rowid is used to look up the full record in the table B-tree).
Key Functions
btree.c:4775 — Open a cursor on a B-tree by root page number
int sqlite3BtreeCursor(
  Btree *p,              /* the btree */
  Pgno iTable,           /* root page of the table/index */
  int wrFlag,            /* 1 = write cursor, 0 = read-only */
  struct KeyInfo *pKeyInfo, /* column affinity/collation (indexes only) */
  BtCursor *pCur         /* OUT: write new cursor here */
){
  if( p->sharable ){
    return btreeCursorWithLock(p, iTable, wrFlag, pKeyInfo, pCur);
  }else{
    return btreeCursor(p, iTable, wrFlag, pKeyInfo, pCur);
  }
}
/* The cursor tracks:
   - a page stack from root → leaf (the "path")
   - the cell index within the current leaf page
   - whether the cursor is valid (pointing at an existing entry) */
btree.c:6387 — Advance cursor to the next entry
int sqlite3BtreeNext(BtCursor *pCur, int flags){
  /* Moves within the current leaf page first.
     If we reach the last cell, ascend to the parent,
     then descend into the next subtree. */
  MemPage *pPage = pCur->pPage;
  int idx = pCur->ix;

  idx++;
  if( idx >= pPage->nCell ){
    /* move to parent or return SQLITE_DONE */
    return moveToParent(pCur);
  }
  pCur->ix = idx;
  return SQLITE_OK;
}
BtCursor — Position Tracking

A BtCursor tracks the cursor's current position in the B-tree as a stack of (page, cell-index) pairs — the path from the root down to the current leaf entry.

/* btreeInt.h — BtCursor (simplified) */
struct BtCursor {
  u8 eState;            /* CURSOR_VALID, CURSOR_INVALID, CURSOR_SKIPNEXT */
  u8 curFlags;          /* read-only? etc. */
  i16 iPage;            /* index of current level in apPage[] */
  u16 ix;               /* current cell index on apPage[iPage] */
  Pgno pgnoRoot;        /* root page of this B-tree */
  BtShared *pBt;        /* shared B-tree state */
  MemPage *pPage;       /* current page (= apPage[iPage]) */
  MemPage *apPage[BTCURSOR_MAX_DEPTH]; /* page stack root→leaf */
  u16     aiIdx[BTCURSOR_MAX_DEPTH];  /* cell index at each level */
  KeyInfo *pKeyInfo;    /* collation for index cursors */
  ...
};
Page Layout

Each page follows a defined binary format. The first 100 bytes of page 1 are the database file header. Every page starts with a small page header followed by a cell pointer array (offsets into the page content area), then the cell content itself at the end of the page.

Page layout (4096 bytes):
┌─────────────────────────────────┐ 0
│  Page header (8 or 12 bytes)    │
│  flags, freeblocks, ncells,     │
│  content area offset, ...       │
├─────────────────────────────────┤ 8/12
│  Cell pointer array             │
│  [offset₀, offset₁, offset₂…]  │ 2 bytes each
├─────────────────────────────────┤
│  Unallocated space              │
│  (grows down ↓)                 │
├─────────────────────────────────┤
│  Cell content area              │ ← cells stored right-to-left
│  [cell₂ | cell₁ | cell₀]       │
└─────────────────────────────────┘ 4096

For a table leaf page, each cell encodes: varint(payload_length) + varint(rowid) + payload_data. For an index leaf page, each cell encodes: varint(payload_length) + payload_data (key columns + rowid packed into the key).

Record Format

Row data (the "payload" in leaf cells) uses SQLite's packed record format. A record header tells the type and size of each column value; column values follow in order.

Record format:
┌────────────────┬──────────────────────────────┬──────────────────┐
│ header_size    │ serial_type₀ serial_type₁ …  │ value₀ value₁ … │
│ (varint)       │ (varint each)                 │ (variable width) │
└────────────────┴──────────────────────────────┴──────────────────┘

Serial type encoding:
  0  → NULL
  1  → 1-byte int
  2  → 2-byte int
  ...
  7  → IEEE 754 double
  8  → integer 0 (no bytes in value section)
  9  → integer 1 (no bytes in value section)
  N≥12, N even  → BLOB, length = (N-12)/2 bytes
  N≥13, N odd   → TEXT, length = (N-13)/2 bytes

The VDBE OP_MakeRecord encodes registers into this format; OP_Column decodes it back, skipping directly to the requested column offset.

Insert and Page Splits

Inserting a new entry moves the cursor to the correct leaf position using a binary search, then calls insertCell(). When a leaf fills up, balance() redistributes cells across sibling pages (a B-tree balance/split operation). Interior nodes are updated to reflect the new separator keys.

int sqlite3BtreeInsert(
  BtCursor *pCur,    /* cursor pointing at insertion location */
  const BtreePayload *pX, /* key + data to insert */
  int flags,         /* BTREE_SAVEPOSITION etc. */
  int seekResult     /* result of prior seek (0 if unknown) */
){
  /* 1. Ensure write transaction is active */
  /* 2. Move cursor to correct leaf position */
  /* 3. insertCell() — put the encoded cell onto the leaf page */
  /* 4. balance()    — split/redistribute if page overflows */
  ...
}
Next Stage

The B-tree engine never reads or writes disk directly. It asks the Pager for a DbPage* (an in-memory copy of a page) via sqlite3PagerGet(), modifies it, and marks it dirty via sqlite3PagerWrite(). The Pager handles journaling and eventual disk flush.