tinygrad — Scheduler

What the scheduler does overview ▼

The scheduler takes the raw tensor-level UOp DAG (with RESHAPE, EXPAND, REDUCE, BUFFER etc.) and transforms it into a LINEAR UOp — an ordered, dependency-respecting list of kernel CALL nodes ready to compile and run.

Key concerns: fusion (merging kernels to avoid memory round-trips), buffer allocation, multi-device routing, and linearizing the dependency graph.

create_linear_with_vars — top-level entry schedule/__init__.py:122 ▼

Called by Tensor.linear_with_vars() and Tensor.realize(). Accepts the big SINK UOp, returns a compiled execution plan.

def create_linear_with_vars(big_sink: UOp) -> tuple[UOp, dict[str, int]]:
  # 1. Apply pattern matchers that fuse ops and lower movement ops
  # 2. Create a schedule (ordered CALL list) via create_schedule()
  # 3. Extract symbolic variable values from BIND nodes
  # 4. Return (LINEAR uop, var_vals dict)

schedule/__init__.py:122

Returns LINEAR A UOp(Ops.LINEAR) whose src is an ordered tuple of CALL nodes Returns var_vals dict[str, int] — concrete values for any symbolic dimension variables

create_schedule — build the execution DAG schedule/__init__.py:21 ▼

Converts the tensor UOp graph into a topologically-sorted list of CALL nodes.

1

Build AFTER dependency edges

Traverse SINK, find all AFTER nodes (which represent data dependencies between kernels), collect them into a dependency graph.

2

In-degree topological sort

Use a priority queue to emit kernels in the best order, ensuring each kernel only runs after its data sources are ready.

3

Produce LINEAR UOp

Wrap the ordered CALL sequence in a UOp(Ops.LINEAR) node.

schedule/__init__.py:21 — create_schedule

lower_sink_to_linear — kernel boundary decisions schedule/__init__.py:93 ▼

Decides how to split the UOp graph into separate kernels. The core fusion heuristic lives here: ops that share the same loop structure can be merged into one kernel.

def lower_sink_to_linear(function: UOp) -> UOp|None:
  # Returns a LINEAR UOp for a single fused kernel group,
  # or None if this function doesn't need a kernel
  # (e.g. pure in-memory copies or views)

schedule/__init__.py:93

Supporting schedule modules schedule/ ▼

schedule/rangeify.py

Converts tensor index arithmetic into RANGE/END loop pairs. Turns abstract shapes into concrete loop bounds.

schedule/indexing.py

Maps multi-dimensional tensor indices to flat buffer offsets. Handles strides, padding, and permutations.

schedule/memory.py

Buffer allocation and layout decisions. Determines which buffers can be reused or aliased.

schedule/multi.py

Multi-device sharding logic. Handles MSELECT/MSTACK ops for distributing tensors across devices.

schedule/allreduce.py

Cross-device all-reduce for distributed training. Produces communication primitives in the schedule.

Schedule caching (SCACHE) schedule/__init__.py:91 ▼

When SCACHE=1 is set, the computed schedule is cached by the UOp key hash. On repeated calls with identical graphs (same shapes, same ops), the cached LINEAR is returned without re-scheduling.

Schedule caching is separate from kernel compilation caching (runtime_cache). Both are required for JIT-style repeated execution to be fast.

schedule/__init__.py:91 — schedule_cache

Example: two-op schedule walkthrough ▼

# Tensor graph (input to scheduler):
# SINK
#   └─ BUFFER(out) ← REDUCE(ADD) ← MUL ← BUFFER(a), BUFFER(b)

# After scheduling, produces:
LINEAR(
  # Kernel 1: fused mul+reduce in a single pass
  CALL(
    PROGRAM(ast=SINK(STORE(buf_out, REDUCE(MUL(LOAD(a), LOAD(b)))))),
    buf_a,     # input
    buf_b,     # input
    buf_out    # output (newly allocated)
  )
)

# var_vals = {} (no symbolic dims here)

# If buf_a or buf_b needed an upload first, the schedule would also contain:
# CALL(COPY, host_data, buf_a)  ← before the kernel call