← overview

Codegen

Stage 4 of 6 · tinygrad/codegen/
Overview pipeline

Codegen receives a single kernel AST (the SINK UOp from the scheduler) and produces a compiled PROGRAM UOp containing device-executable bytes. It does this through a sequence of graph-rewrite passes followed by linearization and rendering.

do_to_program(ast: UOp, renderer: Renderer) -> UOp
  │
  ├─ full_rewrite_to_sink(ast, renderer)   # optimize + lower to indexed loops
  │
  ├─ do_linearize(renderer, prg, sink)     # toposort UOps for execution
  │
  ├─ do_render(renderer, prg, lin)         # generate source string
  │    └─ renderer.render(uops) -> str
  │
  └─ do_compile(renderer, prg, source)     # compile to binary
       └─ renderer.compiler.compile(src) -> bytes
codegen/__init__.py:176 — do_to_program
full_rewrite_to_sink — the optimization pipeline codegen/__init__.py:27

This function applies a cascade of PatternMatcher passes, each transforming the UOp AST closer to device-runnable code.

Phase 1 — Early movement ops
  • pm_mops: Propagate/eliminate RESHAPE, EXPAND, PERMUTE, SHRINK through math ops
  • pm_syntactic_sugar: Simplify redundant index/cast chains
  • pm_store_ranges: Attach loop range info to STORE nodes
Phase 2 — Optimization (when optimize=True)
  • Load collapse: eliminate redundant memory reads within a kernel
  • Range splitting: divide loops for parallelism (local/global split)
  • Symbolic simplification: constant-fold index arithmetic
  • BEAM search or hand-coded heuristics to pick best tiling/unrolling
  • Post-range opts: loop reordering, cache tiling
Phase 3 — Expander
  • pm_pre_expander: Prepare ops for vectorization/unrolling
  • pm_group_for_reduce: Identify reduction groups for warp/block operations
  • expander: Expand complex ops (tensor cores, vector) into constituent UOps
Phase 4 — Local buffer & range concretization
  • pm_add_buffers_local: Insert DEFINE_LOCAL for shared memory allocations
  • rangeify_codegen: Finalize RANGE/END pairs with concrete integer bounds
Phase 5 — Devectorize & lower
  • pm_reduce: Convert REDUCE ops to explicit loop + accumulator patterns
  • pm_add_loads: Insert explicit LOAD for each variable use
  • Devectorize passes: break vector ops into scalar instructions
  • Index dtype lowering: replace symbolic shapes with int32/int64
  • Op decomposition: rewrite ops not in renderer.supported_ops to equivalents
  • Transcendental lowering: replace sin/cos/exp2 with polynomial approximations if needed
codegen/__init__.py:27 — full_rewrite_to_sink
do_linearize — instruction ordering codegen/__init__.py:132

Converts the DAG SINK back into a flat ordered list of UOps (list[UOp]) that a renderer can iterate over sequentially.

1
Topological sort with run-count priority
UOps inside loops execute many times. Scheduling considers multiplicity — ops inside nested loops should come after their loop RANGE nodes.
2
PLACE ordering
Ensures control-flow structure (RANGE/END pairs, IF/ENDIF) is correct and that variable definitions precede uses.
codegen/__init__.py:132 — do_linearize codegen/late/linearizer.py
do_render & do_compile codegen/__init__.py:155–170
R
do_render
Calls renderer.render(uops_list), which walks the linearized list and emits the device-specific source string (C, PTX, WGSL, Metal, LLVM IR, etc.).
codegen/__init__.py:155 — do_render
C
do_compile
Calls renderer.compiler.compile_cached(source), which invokes the device compiler (nvrtc for CUDA, Metal compiler for Metal, LLVM for CPU…) and returns the compiled binary bytes as a BINARY UOp.
codegen/__init__.py:160 — do_compile
Compilation results are cached by source hash in Compiler.compile_cached(), so identical kernels (same ops, same shapes) are only compiled once per device per process.
Transformation example: a simple elementwise kernel walkthrough
# Input AST (from scheduler):
SINK(
  STORE(BUFFER(out, shape=(1024,)),
        ADD(LOAD(BUFFER(a, shape=(1024,))),
            LOAD(BUFFER(b, shape=(1024,)))))
)

# After full_rewrite_to_sink:
SINK(
  RANGE(i, 0, 1024) ─────────────────────────────┐
    STORE(INDEX(BUFFER(out), i),                  │
          ADD(LOAD(INDEX(BUFFER(a), i)),           │
              LOAD(INDEX(BUFFER(b), i))))          │
  END(RANGE) ◄──────────────────────────────────┘
)

# After linearize: flat list
[RANGE(i,0,1024), LOAD(a[i]), LOAD(b[i]), ADD, STORE(out[i]), END]

# After render (OpenCL C example):
"""
kernel void r_1024(global float* out, global float* a, global float* b) {
  int i = get_global_id(0);
  out[i] = a[i] + b[i];
}
"""

# After compile: bytes of .ptx / .metallib / .so etc.