The scheduler takes the raw tensor-level UOp DAG (with RESHAPE, EXPAND, REDUCE, BUFFER etc.) and transforms it into a LINEAR UOp — an ordered, dependency-respecting list of kernel CALL nodes ready to compile and run.
Key concerns: fusion (merging kernels to avoid memory round-trips), buffer allocation, multi-device routing, and linearizing the dependency graph.
Called by Tensor.linear_with_vars() and Tensor.realize(). Accepts the big SINK UOp, returns a compiled execution plan.
def create_linear_with_vars(big_sink: UOp) -> tuple[UOp, dict[str, int]]: # 1. Apply pattern matchers that fuse ops and lower movement ops # 2. Create a schedule (ordered CALL list) via create_schedule() # 3. Extract symbolic variable values from BIND nodes # 4. Return (LINEAR uop, var_vals dict)schedule/__init__.py:122
UOp(Ops.LINEAR) whose src is an ordered tuple of CALL nodes
Returns var_vals
dict[str, int] — concrete values for any symbolic dimension variables
Converts the tensor UOp graph into a topologically-sorted list of CALL nodes.
UOp(Ops.LINEAR) node.Decides how to split the UOp graph into separate kernels. The core fusion heuristic lives here: ops that share the same loop structure can be merged into one kernel.
def lower_sink_to_linear(function: UOp) -> UOp|None: # Returns a LINEAR UOp for a single fused kernel group, # or None if this function doesn't need a kernel # (e.g. pure in-memory copies or views)schedule/__init__.py:93
RANGE/END loop pairs. Turns abstract shapes into concrete loop bounds.MSELECT/MSTACK ops for distributing tensors across devices.When SCACHE=1 is set, the computed schedule is cached by the UOp key hash. On repeated calls with identical graphs (same shapes, same ops), the cached LINEAR is returned without re-scheduling.
runtime_cache). Both are required for JIT-style repeated execution to be fast.# Tensor graph (input to scheduler):
# SINK
# └─ BUFFER(out) ← REDUCE(ADD) ← MUL ← BUFFER(a), BUFFER(b)
# After scheduling, produces:
LINEAR(
# Kernel 1: fused mul+reduce in a single pass
CALL(
PROGRAM(ast=SINK(STORE(buf_out, REDUCE(MUL(LOAD(a), LOAD(b)))))),
buf_a, # input
buf_b, # input
buf_out # output (newly allocated)
)
)
# var_vals = {} (no symbolic dims here)
# If buf_a or buf_b needed an upload first, the schedule would also contain:
# CALL(COPY, host_data, buf_a) ← before the kernel call