Prometheus Data Flow — PromQL Engine

▶🗺 Query Execution Flow

  HTTP GET /api/v1/query?query=rate(http_requests_total[5m])&time=...
        │
        ▼
  ┌─────────────────────────────────────────────────────────────┐
  │                  web/api/v1/api.go                          │
  │  queryHandler → engine.NewInstantQuery(queryable, expr, t)  │
  └──────────────────────────┬──────────────────────────────────┘
                             │
                             ▼
  ┌─────────────────────────────────────────────────────────────┐
  │                   promql.Engine                             │  promql/engine.go:348
  │                                                             │
  │  1. parser.ParseExpr(exprStr)  →  AST                       │
  │  2. Validate AST (functions, types)                         │
  │  3. activeQueryTracker.Insert()  (concurrency limit)        │
  │  4. newEvalNodeHelper per-step                              │
  └──────────────────────────┬──────────────────────────────────┘
                             │
                             ▼
  ┌─────────────────────────────────────────────────────────────┐
  │                    evaluator.eval(AST)                      │
  │                                                             │
  │  Walk AST nodes recursively:                                │
  │  • VectorSelector  → select from storage                   │
  │  • MatrixSelector  → select range from storage             │
  │  • BinaryExpr      → join two vectors (label matching)     │
  │  • AggregateExpr   → sum/avg/topk/… over label sets        │
  │  • Call (function) → rate/increase/histogram_quantile/…    │
  └──────────────────────────┬──────────────────────────────────┘
                             │  (for each VectorSelector)
                             ▼
  ┌─────────────────────────────────────────────────────────────┐
  │                 storage.Querier.Select()                    │
  │                                                             │
  │  FanoutQuerier fans reads across:                           │
  │  • headQuerier  (in-memory Head block)                      │
  │  • blockQueriers[] (one per on-disk Block in time range)    │
  └──────────────────────────┬──────────────────────────────────┘
                             │  SeriesSet
                             ▼
  ┌────────────────────┐    ┌────────────────────────────────────┐
  │   headQuerier      │    │      blockQuerier                  │
  │   tsdb/head_read.go│    │      tsdb/querier.go               │
  │                    │    │                                    │
  │ • stripeSeries map │    │ • index.Reader (postings search)   │
  │ • memChunks        │    │ • chunks.Reader (decompress)       │
  └────────────────────┘    └────────────────────────────────────┘
                             │  []chunks.Meta → Iterator
                             ▼
                    XOR-decode samples at each step timestamp
                             │
                             ▼
                       *Result{Value, Err}
                    (Scalar | Vector | Matrix | String)

▶⚙ Engine — Configuration & Limits

promql/engine.go — Engine struct L348

type Engine struct {
    logger             *slog.Logger
    timeout            time.Duration     // max query duration
    maxSamplesPerQuery int               // OOM guard
    lookbackDelta      time.Duration     // default 5m

    activeQueryTracker QueryTracker      // cap on concurrent queries
    queryLogger        QueryLogger       // slow query log

    noStepSubqueryIntervalFn func(rangeMillis int64) int64
    enableAtModifier         bool        // @timestamp modifier
    enableNegativeOffset     bool        // offset -5m
    enablePerStepStats       bool        // per-step sample counts
    enableDelayedNameRemoval bool        // __name__ removal timing
    enableTypeAndUnitLabels  bool        // unit/type label propagation
    parser                   parser.Parser
}

Key Engine Options

Option	Default	Guards against
timeout	2m	Long-running query denial-of-service
maxSamplesPerQuery	50 000 000	OOM from huge result sets
lookbackDelta	5m	Stale series extrapolation window
maxConcurrentQueries	20	CPU saturation from concurrent queries

▶🌳 Parsing — PromQL AST

The PromQL lexer and parser live in promql/parser. The parser produces an Expr interface value representing the AST.

promql/parser/ast.go — key Expr types ast.go

// Example: rate(http_requests_total{job="api"}[5m])
//
// AST:
//   Call{
//     Func: "rate",
//     Args: [
//       MatrixSelector{
//         VectorSelector{
//           Name: "http_requests_total",
//           LabelMatchers: [{job="api"}],
//         },
//         Range: 5m,
//       }
//     ]
//   }

type VectorSelector struct {
    Name           string
    LabelMatchers  []*labels.Matcher
    Offset         time.Duration
    Timestamp      *int64           // @modifier
    StartOrEnd     ItemType
    Series         []storage.Series // populated during eval
    ...
}

type BinaryExpr struct {
    Op       ItemType          // +, -, *, /, ==, and, or, …
    LHS, RHS Expr
    VectorMatching *VectorMatching
    ReturnBool bool
}

type AggregateExpr struct {
    Op       ItemType          // sum, avg, topk, count, …
    Expr     Expr
    Grouping []string          // by/without label list
    Without  bool
    Param    Expr              // for topk(N, …) etc.
}

▶🔎 Storage Select — Index Lookup read

A VectorSelector is resolved by calling querier.Select(). The fanout querier merges results from all blocks that overlap the query time range.

tsdb/querier.go — blockQuerier.Select() querier.go

// Label matcher lookup in the inverted index:
//  1. For each Matcher, call indexReader.Postings(name, value)
//     → returns a sorted list of series references (posting list)
//  2. AND matchers → Intersect(postings...)
//  3. OR matchers  → Merge(postings...)
//  4. NOT matchers → subtract via AllPostings() \ match
//  5. Iterate refs → load labels from index
//  6. Return as SeriesSet (lazy; chunks loaded on iteration)

Index Structure

tsdb/index/index.go — Reader / Writer

Block index file layout:
  ┌─────────────────────────────────────────────┐
  │  Symbol table  (label names + values)       │
  │  Series table  (ref → labels + chunk metas) │
  │  Postings      (label=value → []seriesRef)  │
  │  Postings offset table                      │
  │  TOC           (offsets of above sections)  │
  └─────────────────────────────────────────────┘

Implementation: tsdb/index/index.go

▶📦 Chunk Reading & Decoding

Once series references are resolved, the evaluator iterates their chunks for the query time range.

tsdb/chunkenc/xor.go — XorIterator xor.go

// ChunkedSeriesIterator iterates chunks for a series:
for _, meta := range series.chunks {
    if meta.MaxTime < mint || meta.MinTime > maxt {
        continue  // skip chunks outside time range
    }
    chk, err := chunkReader.Chunk(meta)
    it := chk.Iterator(reuse)   // XOR decoder
    for it.Next() == chunkenc.ValFloat {
        ts, val := it.At()      // decode next (t, v) pair
        // feed into evaluator matrix
    }
}

Iterator Value Types

ValType	Method	Data
ValFloat	At()	(int64 ts, float64 v)
ValHistogram	AtHistogram()	(int64 ts, *histogram.Histogram)
ValFloatHistogram	AtFloatHistogram()	(int64 ts, *histogram.FloatHistogram)

▶🧮 Expression Evaluation

The evaluator walks the AST bottom-up. At each step timestamp it evaluates each node:

promql/engine.go — evaluator.eval() dispatch engine.go

func (ev *evaluator) eval(ctx context.Context, expr parser.Expr) (parser.Value, annotations.Annotations) {
    switch e := expr.(type) {

    case *parser.AggregateExpr:
        // Evaluate inner expression, then group + aggregate.
        return ev.evalAggregation(ctx, e)

    case *parser.Call:
        // Evaluate arguments, then call registered function.
        // e.g. "rate" → functions.FuncRate
        return ev.evalCall(ctx, e)

    case *parser.BinaryExpr:
        // Evaluate LHS + RHS, then vector matching + binary op.
        return ev.evalBinaryExpr(ctx, e)

    case *parser.VectorSelector:
        // Already pre-populated with series; sample lookup by timestamp.
        return ev.evalVectorSelector(ctx, e, ...)

    case *parser.MatrixSelector:
        // Return a matrix (series → []sample window).
        return ev.evalMatrixSelector(ctx, e, ...)
    }
}

rate() / increase() Function

// rate(v[d]) = extrapolated per-second increase over window d.
// Implemented in: promql/functions.go — funcRate()
//
// Algorithm:
//  1. Take samples in [t-d, t] from the range vector.
//  2. Compute (last - first) considering counter resets.
//  3. Extrapolate to exact window boundaries.
//  4. Divide by window duration in seconds.

Function registry: promql/functions.go

▶📤 Result Types

Type	PromQL produces	API resultType
Vector	instant query on metric selector or aggregation	"vector"
Matrix	range query, or any expression with range selector	"matrix"
Scalar	numeric literal or scalar() function	"scalar"
String	label("name", selector) etc.	"string"

promql/value.go — result structs value.go

// Vector — instant query result
type Vector []Sample      // one Sample per matching series

type Sample struct {
    Metric labels.Labels
    T      int64
    F      float64
    H      *histogram.FloatHistogram
}

// Matrix — range query result
type Matrix []Series

type Series struct {
    Metric  labels.Labels
    Floats  []FPoint        // (t, float64) pairs
    Histograms []HPoint     // (t, *FloatHistogram) pairs
}

▶⏱ Staleness & lookbackDelta

For an instant query at time t, a series must have a sample in (t - lookbackDelta, t] (default 5 min) to appear in the result. This prevents showing stale gauge values from targets that went away.

lookbackDelta can be overridden globally (--query.lookback-delta) or per-query via the lookback_delta query parameter. Range selectors use their explicit window instead.

The stale marker (special NaN written by the scrape loop) makes a series immediately invisible regardless of lookbackDelta — it short-circuits the window check.

PromQL Query Engine

Key Engine Options

Index Structure

Iterator Value Types

rate() / increase() Function