β–ΆπŸ—Ί Write Path Diagram
  scrape.Commit()
        β”‚
        β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚         fanoutStorage.Appender            β”‚  storage/fanout.go:29
  β”‚   fanoutAppender.Append() fans to ALL     β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ local (primary)   β”‚   remote (secondary, best-effort)
  β–Ό                   β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚                tsdb.DB                       β”‚  tsdb/db.go:291
  β”‚  Appender() β†’ initAppender β†’ headAppender    β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚                             β”‚
          β–Ό                             β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  WAL (wlog)  β”‚            β”‚   Head              β”‚  tsdb/head.go:71
  β”‚ tsdb/wlog/   β”‚            β”‚   (in-memory)       β”‚
  β”‚ wlog.go:182  β”‚            β”‚                     β”‚
  β”‚              β”‚            β”‚  stripeSeries map   β”‚
  β”‚  SERIES rec  β”‚            β”‚  └─ memSeries #ID   β”‚  tsdb/head.go:2508
  β”‚  SAMPLE rec  β”‚            β”‚     └─ headChunks   β”‚
  β”‚  EXEMPLAR recβ”‚            β”‚     └─ mmappedChunksβ”‚
  β”‚  HISTOGRAM   β”‚            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                       β”‚  chunk full (120 samples)
                                         β–Ό
                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β”‚  Head Chunk Files   β”‚
                              β”‚  (m-mmap)           β”‚
                              β”‚  data/wal/chunks_head/
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                         β”‚  compaction trigger
                                         β–Ό
                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β”‚  Block (on disk)    β”‚  tsdb/block.go
                              β”‚  data/<ulid>/       β”‚
                              β”‚  β”œβ”€β”€ chunks/        β”‚
                              β”‚  β”œβ”€β”€ index          β”‚
                              β”‚  β”œβ”€β”€ tombstones     β”‚
                              β”‚  └── meta.json      β”‚
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–ΆπŸ“‘ fanoutStorage β€” Write Multiplexer write
storage/fanout.go β€” fanout struct L29
type fanout struct {
    logger     *slog.Logger
    primary    Storage    // local TSDB β€” error here aborts the write
    secondaries []Storage // remote write β€” errors logged but ignored
}

// fanoutAppender.Commit() order:
//  1. primary.Commit()   ← must succeed
//  2. secondary[i].Commit()  ← best-effort, logged on failure

The fanout decouples local durability from remote delivery. A remote write failure never drops a local sample.

β–ΆπŸ—„ tsdb.DB β€” Top-level Database
tsdb/db.go β€” DB struct (key fields) L291
type DB struct {
    dir    string
    locker *tsdbutil.DirLocker

    logger    *slog.Logger
    opts      *Options
    compactor Compactor

    mtx    sync.RWMutex
    blocks []*Block      // persisted, immutable blocks sorted by time

    head *Head           // mutable in-memory block

    compactc chan struct{} // signal to trigger compaction
    stopc    chan struct{}
    donec    chan struct{}

    autoCompact bool
    ...
}
tsdb/head_append.go β€” initAppender (entry point) L50
// initAppender defers creation of the real headAppender until the first
// Append() call so that the mint/maxt are known.
type initAppender struct {
    app  storage.Appender    // nil until first append
    head *Head
    ...
}

func (a *initAppender) Append(ref storage.SeriesRef, lset labels.Labels,
    t int64, v float64) (storage.SeriesRef, error) {
    if a.app != nil {
        return a.app.Append(ref, lset, t, v)
    }
    // First append: create real headAppender
    a.app = a.head.appender()
    return a.app.Append(ref, lset, t, v)
}
β–ΆπŸ“ WAL β€” Write-Ahead Log write

The WAL provides durability. Every sample is persisted to disk before it enters the in-memory Head. On crash recovery Prometheus replays the WAL to rebuild the Head.

tsdb/wlog/wlog.go β€” WL struct L182
type WL struct {
    dir            string
    logger         *slog.Logger
    segmentSize    int    // default 128 MiB
    mtx            sync.RWMutex
    segment        *Segment  // active segment file
    donePages      int
    page           [pageSize]byte  // 32 KiB page buffer
    actorc         chan func()
    stopc          chan chan struct{}
    compress       CompressionType // snappy or zstd
    ...
}

Record Types

RecordWritten whenContent
SERIESFirst time a label set is seenseries ref + labels.Labels
SAMPLESEvery Append()[]RefSample{ref, t, v}
EXEMPLARSAppendExemplar()[]RefExemplar{ref, exemplar}
HISTOGRAMSAppendHistogram()[]RefHistogramSample{ref, t, h}
METADATAmetadata updates[]RefMetadata{ref, type, unit, help}
TOMBSTONESDelete() intervals[]Stone{ref, {mint,maxt}}
MMAPMARKERSchunk m-mappedrefs of flushed chunks

Segment Layout

data/wal/
β”œβ”€β”€ 00000001    ← completed segment (128 MiB)
β”œβ”€β”€ 00000002    ← completed segment
└── 00000003    ← active segment (being written)

data/wal/chunks_head/
β”œβ”€β”€ 000001      ← m-mapped head chunks
WAL segments are deleted once their time range has been fully compacted into a Block and the head has checkpointed past them. The WAL also has a checkpoint mechanism (wlog.WriteCheckpoint()) that compresses older segments.
β–ΆπŸ§  Head β€” In-Memory Block
tsdb/head.go β€” Head struct (key fields) L71
type Head struct {
    chunkRange atomic.Int64   // maximum time range for a chunk (default 2h)
    numSeries  atomic.Uint64
    minTime, maxTime atomic.Int64

    wal, wbl *wlog.WL  // WAL and Write-Behind Log (OOO samples)

    exemplars ExemplarStorage

    // Hash-stripe sharded map: 512 stripes to reduce lock contention.
    series *stripeSeries

    // Pools to recycle slices without GC pressure.
    floatsPool          zeropool.Pool[[]record.RefSample]
    histogramsPool      zeropool.Pool[[]record.RefHistogramSample]
    ...
}

The stripeSeries structure is a 512-way hash-sharded map that maps HeadSeriesRef β†’ *memSeries. This dramatically reduces lock contention under high write concurrency.

β–ΆπŸ“ˆ memSeries β€” Per-Series Storage
tsdb/head.go β€” memSeries struct L2508
type memSeries struct {
    // Immutable after construction β€” no lock needed.
    ref       chunks.HeadSeriesRef
    shardHash uint64

    sync.Mutex  // guards everything below

    lset labels.Labels

    // mmappedChunks: completed chunks flushed to disk (memory-mapped).
    // Pointer arithmetic tracks firstChunkID to handle compaction shifts.
    mmappedChunks []*mmappedChunk
    firstChunkID  chunks.HeadChunkID

    // headChunks: linked list of in-memory chunks still being written.
    // headChunks β†’ headChunks.prev β†’ ... (most recent first)
    headChunks *memChunk

    ooo *memSeriesOOOFields  // out-of-order sample state
    ...
}

Chunk Lifecycle for a memSeries

  1. First sample β†’ allocate memChunk with XOR encoder; attach as headChunks
  2. Samples appended to active headChunk via XOR encoding (~120 samples max)
  3. Chunk full or time range exceeded β†’ flush to chunks_head/ m-map file via chunkDiskMapper
  4. Flushed chunk pointer stored in mmappedChunks; memory freed
  5. Compaction moves mmappedChunks into a Block; firstChunkID advances

XOR Chunk Encoding

Prometheus uses the Gorilla XOR encoding for float samples, adapted from Facebook's paper:

ComponentEncodingTypical size
timestamp deltadelta-of-delta, variable bits1–3 bytes
float valueXOR of previous, leading/trailing zeros compressed0–9 bytes
per sample averagecombined~1.37 bytes

Implementation: tsdb/chunkenc/xor.go

β–Άβœ headAppender β€” Transactional Append

The headAppender collects samples in memory for a single scrape batch, writes WAL records, then appends to memSeries β€” all under a single lock lifecycle.

tsdb/head_append.go β€” headAppender.Append() L434
func (a *headAppender) Append(ref storage.SeriesRef,
    lset labels.Labels, t int64, v float64) (storage.SeriesRef, error) {

    // 1. Look up existing series by ref or by label hash.
    s := a.head.series.getByID(chunks.HeadSeriesRef(ref))
    if s == nil {
        // 2. New series: register it, get a new ref.
        var created bool
        s, created, err = a.head.getOrCreate(lset.Hash(), lset)
        if created {
            // 3. WAL SERIES record scheduled.
            a.series = append(a.series, record.RefSeries{
                Ref:    s.ref,
                Labels: lset,
            })
        }
    }

    // 4. Accumulate sample for batch WAL write.
    a.samples = append(a.samples, record.RefSample{
        Ref: s.ref, T: t, V: v,
    })
    a.sampleSeries = append(a.sampleSeries, s)
    return storage.SeriesRef(s.ref), nil
}

func (a *headAppender) Commit() error {
    // 5. Write WAL SERIES + SAMPLES records atomically.
    // 6. For each sample: s.append(t, v, ...) β†’ XOR encode into headChunk.
    // 7. If chunk full: enqueue for m-map flush.
    ...
}
β–ΆπŸ”§ Compaction β€” Head β†’ Block

When the Head accumulates more than chunkRange * 3/2 (default 3h) of data, a compaction is triggered. The oldest portion of the head is written to a new immutable on-disk Block.

tsdb/compact.go β€” LeveledCompactor compact.go
// Block directory layout after compaction:
data/
β”œβ”€β”€ 01HJXXXXXXXXXXXXX/       ← ULID (time-sortable unique ID)
β”‚   β”œβ”€β”€ chunks/
β”‚   β”‚   β”œβ”€β”€ 000001           ← raw XOR-compressed chunk data
β”‚   β”‚   └── 000002
β”‚   β”œβ”€β”€ index                ← inverted index: label β†’ posting list
β”‚   β”œβ”€β”€ tombstones           ← delete intervals
β”‚   └── meta.json            ← {ulid, minTime, maxTime, stats, compaction}
└── 01HJYYYYY.../

Block Merge (Level Compaction)

LevelTime rangeTriggered by
0 (head flush)≀ 2hHead min time advancing
1≀ 2h Γ— 3 = 6h3 overlapping L0 blocks
2≀ 18h3 overlapping L1 blocks
N≀ 2h Γ— 3^Ncascading merge
Retention is enforced after compaction. Blocks with maxTime < now - retentionDuration are marked for deletion and removed from DB.blocks.
β–Άβͺ Out-of-Order (OOO) Samples

Since Prometheus 2.39, OOO samples (arriving with timestamps older than the current Head maxTime) are buffered in a separate Write-Behind Log (WBL) and the memSeriesOOOFields structure, then merged at compaction time.

tsdb/head.go β€” OOO fields head.go
type memSeriesOOOFields struct {
    oooMmappedChunks []*mmappedChunk  // flushed OOO chunks
    oooHeadChunk     *oooHeadChunk    // current in-memory OOO chunk
    firstOOOChunkID  chunks.HeadChunkID
}

// OOO write path:
//  headAppender.Append() β†’ detects t < s.maxTime
//  β†’ s.appendOOO(t, v)
//  β†’ written to wbl (Write-Behind Log, separate WAL)
//  β†’ OOO compaction merges into regular blocks
OOO is controlled by --storage.tsdb.allow-overlapping-compaction and out_of_order_time_window in the TSDB config. OOO data older than the window is silently dropped.