← overview

Runtime & Execution

Stage 6 of 6 · tinygrad/engine/ · tinygrad/device.py · tinygrad/runtime/
run_linear — the execution entry point engine/realize.py:252

Called at the end of Tensor.realize(). Takes the LINEAR UOp produced by the scheduler and executes each CALL node against the device.

def run_linear(
  linear:    UOp,                    # LINEAR with ordered CALLs
  var_vals:  dict[str, int] | None,  # symbolic variable values
  input_uops: tuple[UOp, ...] = (), # externally-provided buffers
  update_stats: bool = True,
  jit:  bool = False,
  wait: bool = False,
):
  ctx = ExecContext(var_vals or {}, ...)
  for call in linear.src:
    pm_exec.rewrite(call, ctx)       # dispatch by CALL type
engine/realize.py:252 — run_linear
compile_linear — deferred compilation engine/realize.py:246

Compiles all kernels in the LINEAR plan without executing them. Used by the JIT to pre-warm the kernel cache.

1
Apply pm_compile pattern matcher
Walks the LINEAR and compiles any CALL nodes that contain un-compiled PROGRAM ASTs via do_to_program(). Converts SINK ASTs to BINARY UOps.
2
BEAM search (optional)
When BEAM>0, compile many tiling candidates and benchmark them. Keep the fastest. The winner's compiled binary is cached.
3
Validate (optional)
When VALIDATE_WITH_CPU=1, run the same computation on CPU and compare outputs. Useful for debugging backend correctness.
engine/realize.py:246 — compile_linear
pm_exec — execution dispatch (pattern matcher) engine/realize.py:233

Each CALL node is dispatched to the appropriate executor via a PatternMatcher. The CALL type determines which function handles it.

K
exec_kernel — run a compiled GPU/CPU kernel
Resolves buffer arguments to concrete Buffer objects, looks up or creates the kernel runtime via get_runtime(), then calls runtime(bufs, var_vals, global_size, local_size).
engine/realize.py:170 — exec_kernel
C
exec_copy — buffer data transfer
Handles host→device uploads, device→host downloads, and device→device copies. Dispatches to allocator's copyin()/copyout()/_transfer().
engine/realize.py:156 — exec_copy
V
exec_view — buffer alias/slice
Creates a BUFFER_VIEW — a zero-copy view into an existing buffer at an offset. Used for slices and tensor views.
engine/realize.py:149 — exec_view
G
exec_graph — batched kernel graph
Submits multiple kernels as a single batched command graph (CUDA Graph, Metal command buffer, HCQ graph). Reduces per-kernel CPU overhead for training loops.
engine/realize.py:200 — exec_graph
Buffer — device memory abstraction device.py:99

Every tensor's data lives in a Buffer. Allocation is lazy — the underlying device memory is only allocated on first use.

device Device string this buffer lives on ("CUDA:0", "CPU", etc.) size Number of elements dtype Element type allocate() Trigger actual device memory allocation via the device's allocator copyin(mv) Copy data from a Python memoryview into device memory copyout(mv) Copy device memory back to a Python memoryview view(offset, size) Create a zero-copy alias starting at the given element offset ref(count) Adjust reference count for kernel-held buffers (prevents premature free)
device.py:99 — class Buffer
Device singleton & Compiled class device.py:15, 287

Device["CUDA"] returns the singleton for that backend, which provides the allocator, compiler, and renderer triplet.

Device[name] Returns cached device instance, loading the runtime module on first access Device.DEFAULT Current default device string from env or first available allocator Device-specific memory allocator (e.g. CUDAAllocator, MetalAllocator) compiler Code compiler for this device (nvrtc, clang, Metal, etc.) renderer Source code renderer for this device runtime(prg) Creates a callable kernel from compiled bytes graph() Creates a batched graph executor for JIT training loops
device.py:15 — class _Device device.py:287 — class Compiled
Runtime backends runtime/ops_*.py
CUDA
runtime/ops_cuda.py
nvrtc JIT compiler, CUDAAllocator, CUDA stream management.
Metal (Apple)
runtime/ops_metal.py
Metal command queues, MTLBuffer allocation, MSL compilation.
AMD (HIP/HSA)
runtime/ops_amd.py
HCQ-based direct HSA queue submission. Low-overhead GPU dispatch.
NV (low-level)
runtime/ops_nv.py
Direct NVIDIA HCQ (hardware command queue) submission without CUDA driver overhead.
CPU (clang)
runtime/ops_cpu.py
Clang JIT compiler, mmap allocator, shared library loading and calling.
OpenCL
runtime/ops_opencl.py
OpenCL platform/device enumeration, cl_mem allocation, kernel build and enqueue.
WebGPU
runtime/ops_webgpu.py
Browser-native GPU via the WebGPU API. Runs in Node or browser environments.
DISK
runtime/ops_disk.py
Memory-mapped file buffers. Used for loading model weights directly from disk.
Stats tracking & profiling engine/realize.py:49

Every kernel execution updates global counters and can emit profiling events.

GlobalCounters Tracks kernel count, total FLOPS, memory bandwidth, and wall time across all executions track_stats() Records per-kernel execution time, op count, memory bytes ProfileEvent Emitted for each kernel/copy; used by the profiler for timeline visualization estimate_uop() Static analysis of a CALL to estimate FLOPS and memory before running
engine/realize.py:49 — track_stats engine/realize.py:38 — estimate_uop
Caching layers summary multi-level cache
Level 1: UOp deduplication
  UOpMetaClass caches UOp instances by (op, dtype, src, arg)
  → same computation graph node returned if already exists

Level 2: Schedule cache (SCACHE=1)
  schedule_cache[uop_key] = linear_uop
  → skip re-scheduling for identical tensor graphs

Level 3: Kernel source cache
  Compiler.compile_cached(source) hashes source string
  → skip recompiling identical kernels within a process

Level 4: Disk cache (~/.cache/tinygrad/)
  Compiled binaries persisted to disk
  → survive process restarts; keyed on source hash + device

Level 5: Runtime cache
  runtime_cache[(key, device)] = compiled_kernel_callable
  → skip kernel loading overhead on repeated runs

Level 6: Graph cache (JIT)
  graph_cache caches batched CUDA Graph / Metal command buffer
  → near-zero CPU overhead per training step