← overview

Scheduler

Stage 3 of 6 · tinygrad/schedule/
What the scheduler does overview

The scheduler takes the raw tensor-level UOp DAG (with RESHAPE, EXPAND, REDUCE, BUFFER etc.) and transforms it into a LINEAR UOp — an ordered, dependency-respecting list of kernel CALL nodes ready to compile and run.

Key concerns: fusion (merging kernels to avoid memory round-trips), buffer allocation, multi-device routing, and linearizing the dependency graph.

create_linear_with_vars — top-level entry schedule/__init__.py:122

Called by Tensor.linear_with_vars() and Tensor.realize(). Accepts the big SINK UOp, returns a compiled execution plan.

def create_linear_with_vars(big_sink: UOp) -> tuple[UOp, dict[str, int]]:
  # 1. Apply pattern matchers that fuse ops and lower movement ops
  # 2. Create a schedule (ordered CALL list) via create_schedule()
  # 3. Extract symbolic variable values from BIND nodes
  # 4. Return (LINEAR uop, var_vals dict)
schedule/__init__.py:122
Returns LINEAR A UOp(Ops.LINEAR) whose src is an ordered tuple of CALL nodes Returns var_vals dict[str, int] — concrete values for any symbolic dimension variables
create_schedule — build the execution DAG schedule/__init__.py:21

Converts the tensor UOp graph into a topologically-sorted list of CALL nodes.

1
Build AFTER dependency edges
Traverse SINK, find all AFTER nodes (which represent data dependencies between kernels), collect them into a dependency graph.
2
In-degree topological sort
Use a priority queue to emit kernels in the best order, ensuring each kernel only runs after its data sources are ready.
3
Produce LINEAR UOp
Wrap the ordered CALL sequence in a UOp(Ops.LINEAR) node.
schedule/__init__.py:21 — create_schedule
lower_sink_to_linear — kernel boundary decisions schedule/__init__.py:93

Decides how to split the UOp graph into separate kernels. The core fusion heuristic lives here: ops that share the same loop structure can be merged into one kernel.

def lower_sink_to_linear(function: UOp) -> UOp|None:
  # Returns a LINEAR UOp for a single fused kernel group,
  # or None if this function doesn't need a kernel
  # (e.g. pure in-memory copies or views)
schedule/__init__.py:93
Supporting schedule modules schedule/
schedule/rangeify.py
Converts tensor index arithmetic into RANGE/END loop pairs. Turns abstract shapes into concrete loop bounds.
schedule/indexing.py
Maps multi-dimensional tensor indices to flat buffer offsets. Handles strides, padding, and permutations.
schedule/memory.py
Buffer allocation and layout decisions. Determines which buffers can be reused or aliased.
schedule/multi.py
Multi-device sharding logic. Handles MSELECT/MSTACK ops for distributing tensors across devices.
schedule/allreduce.py
Cross-device all-reduce for distributed training. Produces communication primitives in the schedule.
Schedule caching (SCACHE) schedule/__init__.py:91

When SCACHE=1 is set, the computed schedule is cached by the UOp key hash. On repeated calls with identical graphs (same shapes, same ops), the cached LINEAR is returned without re-scheduling.

Schedule caching is separate from kernel compilation caching (runtime_cache). Both are required for JIT-style repeated execution to be fast.
schedule/__init__.py:91 — schedule_cache
Example: two-op schedule walkthrough
# Tensor graph (input to scheduler):
# SINK
#   └─ BUFFER(out) ← REDUCE(ADD) ← MUL ← BUFFER(a), BUFFER(b)

# After scheduling, produces:
LINEAR(
  # Kernel 1: fused mul+reduce in a single pass
  CALL(
    PROGRAM(ast=SINK(STORE(buf_out, REDUCE(MUL(LOAD(a), LOAD(b)))))),
    buf_a,     # input
    buf_b,     # input
    buf_out    # output (newly allocated)
  )
)

# var_vals = {} (no symbolic dims here)

# If buf_a or buf_b needed an upload first, the schedule would also contain:
# CALL(COPY, host_data, buf_a)  ← before the kernel call