# ADR-0046: TLContext — Kernel-side `tl.*` API Contract ## Status Accepted (2026-05-22). Documents the set of `tl.*` primitives exposed by `src/kernbench/triton_emu/`'s `TLContext`, their semantics, and the two execution-mode contracts (command-list / greenlet runner). ADR-0014/0020 defines the PE pipeline and the 2-pass execution model, but **the `tl.*` surface that bench kernel functions call** had no ADR-level coverage. ## First action When `TLContext(pe_id, num_programs, dispatch_cycles, runner, cube_id, num_cubes, scratch_base, scratch_size)` is instantiated, the first action is to initialize six categories of state: - `self._pe_id`, `self._num_programs`, `self._cube_id`, `self._num_cubes` — values that `tl.program_id` / `tl.num_programs` will return. - `self._dispatch_cycles` — cycle count emitted as `PeCpuOverheadCmd(cycles)` at the start of every `tl.*` API call. - `self._runner` — `KernelRunner` instance (present → greenlet mode; absent → command-list mode). - `self._commands: list[PeCommand] = []` — command-list accumulator (command-list mode only). - `self._handle_counter = 0`, `self._completion_counter = 0` — counters for generating TensorHandle / CompletionHandle ids. - `self._scratch_base`, `self._scratch_size`, `self._scratch_cursor = 0` — PE-local scratch region (used for math/dot/composite output handle addresses). In short, **TLContext's first act is "record where (sip/cube/pe) and at what scale (num_programs/num_cubes) this kernel instance runs, and pick its dispatch mode (runner present or not)"**. No SimPy event is created and no command is emitted at this moment. The runtime first action happens when the kernel function first calls a `tl.()`. The standard entry for every `tl.*` API is: 1. Call `self._emit_dispatch_overhead()` — if `dispatch_cycles > 0`, immediately `_emit` a `PeCpuOverheadCmd(dispatch_cycles)`. 2. Per-API processing (TensorHandle creation, command construction). 3. `self._emit(cmd)` — in runner mode this `greenlet.switch()`es the cmd to SimPy; in command-list mode it appends to `self._commands`. ## Context The `tl.*` surface consists of `TLContext`'s methods, and the `tl` parameter received by a kernel function is one of these objects. The contract the user (bench author) sees: - Which primitives exist. - What data flow each primitive triggers (DMA / compute / IPCQ / metadata-only). - How a TensorHandle's `space` and `addr` are decided. - The difference between command-list and greenlet modes. ADR-0014 (PE pipeline) defines the PeCommands consumed by PE_SCHEDULER, but how `tl.*` emits them is a code-only convention. ADR-0020 (2-pass data execution) mentions greenlet mode in D3 but does not pin down the signature difference (return-value handling) between the runner / non-runner paths. This ADR fills the gap. ## Decision ### D1. The `tl` parameter is a `TLContext` instance A bench kernel function has the signature: ```python def _kernel(arg1, arg2, ..., tl, **kwargs): ... ``` `tl` is a `kernbench.triton_emu.tl_context.TLContext` instance. The name imitates real Triton's `triton.language` module; the actual Triton module is **not** passed in. The kernel is plain Python — no `yield` or `async`. `tl.*` calls produce SimPy events, but to the caller they appear synchronous because in greenlet mode the KernelRunner relays between SimPy and the kernel (ADR-0020 D3). ### D2. Two execution modes — command-list / greenlet runner - **Command-list mode (`runner is None`)**: `tl.*` calls append PeCommand to `self._commands`. DMA / GEMM / Math consume no SimPy time and return metadata-only TensorHandles (`data=None`). PE_SCHEDULER / sim_engine later replays the command sequence in time. - **Greenlet runner mode (`runner is not None`)**: `tl.*` calls `self._emit(cmd)` → `runner.switch_to_simpy(cmd)`, handing control to the parent greenlet (SimPy). The parent distributes the cmd to components, consumes SimPy time, and (for DMA reads) returns real numpy data. The kernel receives the result and continues to the next line (the data-aware execution model from ADR-0020 D3). The choice of mode is decided by whether a KernelRunner is injected into the TLContext. The `tl.*` methods themselves are mode-blind — they go through `_emit()` uniformly. ### D3. Primitive categories #### D3.1. Reference (no DMA, metadata only) - `tl.ref(ptr, shape, dtype="f16") -> TensorHandle`: create a handle referencing HBM data without issuing DMA. Used when the scheduler streams the data per-tile (e.g., the b operand of a composite GEMM). #### D3.2. Data movement (blocking, DMA engine) - `tl.load(ptr, shape, dtype="f16") -> TensorHandle`: HBM → handle. Emits `DmaReadCmd`. In greenlet mode the returned handle's `.data` carries real numpy data; in command-list mode it is a placeholder. The handle has `space="hbm"`, `pinned=True`. - `tl.store(ptr, handle) -> None`: TCM → HBM. Emits `DmaWriteCmd`. In greenlet mode, when `handle.data` is present, `_store.write("hbm", ptr, data)` runs first (visibility = issue time, ADR-0020 D3). #### D3.3. GEMM / compute (blocking) - `tl.dot(a, b) -> TensorHandle`: `a @ b`. Both operands must live in TCM; shapes `(M,K) × (K,N) → (M,N)`. Emits `GemmCmd`; the output handle is allocated from PE-local scratch via `_make_compute_out(shape, dtype)`. - `tl.composite(op, a, b=None, out_ptr=0, math_op=None, epilogue=None, acc_dtype=None, tile_shape=None) -> CompletionHandle`: non-blocking tiled pipeline. Emits `CompositeCmd`. `epilogue` is a list of dicts, each with `"op"` plus op-specific fields and an optional `"scope"` (k_tile / output_tile). Unknown ops or missing fields raise ValueError immediately. The returned CompletionHandle synchronizes via `tl.wait(h)`. #### D3.4. Math: unary (blocking) - `tl.exp(x)`, `tl.log(x)`, `tl.sqrt(x)`, `tl.abs(x)`, `tl.sigmoid(x)`, `tl.cos(x)`, `tl.sin(x)` — each emits `MathCmd(op=, inputs=(x,), out=)`. `out` is scratch-allocated with the same shape/dtype as `x`. #### D3.5. Math: binary (blocking) - `tl.maximum(a, b)`, `tl.minimum(a, b)` — `_binary_math`. - `tl.fma(a, b, c)` — `a*b + c`. Three inputs. - `tl.clamp(x, min, max)` — `MathCmd(op="clamp", inputs=(x, min, max))`. - `tl.where(cond, a, b)` — `MathCmd(op="where", inputs=(cond, a, b))`. - `tl.softmax(x, axis=-1)` — a single `MathCmd(op="softmax")` so timing accounts at one dispatch. Phase 2 DataExecutor expands it to the canonical (x-max → exp → sum → div) sequence. #### D3.6. Reduction (blocking) - `tl.sum(x, axis)`, `tl.max(x, axis)`, `tl.min(x, axis)` — return an output handle with the axis size collapsed to 1. Emit `MathCmd(op=, inputs=(x,), out=, axis=axis)`. #### D3.7. Index / scalar (PE_CPU, no engine) - `tl.program_id(axis=0) -> int`: `axis==0` → pe_id (cube-local PE index), `axis==1` → cube_id (ADR-0022). - `tl.num_programs(axis=0) -> int`: `axis==0` → num_programs (PEs per cube), `axis==1` → num_cubes. - `tl.arange(start, end, dtype="i32") -> TensorHandle`: an index range in TCM. No command emitted. - `tl.zeros(shape, dtype="f16") -> TensorHandle`, `tl.full(shape, value, dtype="f16") -> TensorHandle`: TCM placeholder. No command emitted. #### D3.8. Scalar helpers (no command, no engine) - `TLContext.cdiv(a, b) -> int` (static): ceiling division `-(-a // b)`. Mirrors real Triton's `tl.cdiv`. #### D3.9. Metadata-only (no compute, no DMA) - `tl.trans(x) -> TensorHandle`: a new handle with the last two dims swapped. Shares `addr` and `data`; no command emitted. #### D3.10. IPCQ (CCL) primitives (ADR-0023 D4) - `tl.send(dir, src=None, *, src_addr=None, nbytes=None, shape=None, dtype="f16", space="tcm") -> None`: blocking send. Accepts either handle form or raw-address form. Emits `IpcqSendCmd`. The handle's `.data` snapshot rides along on the command — avoiding the race where a later inbound IPCQ overwrites the slot before the outbound PE_DMA reads it. - `tl.recv(dir=None, shape=(), dtype="f16", space="tcm", dst_addr=None, dst_space=None) -> TensorHandle`: blocking recv. Providing both `dst_addr` and `dst_space` enters "copy_to_dst" mode; otherwise "return_slot" mode. In greenlet mode the handle's `.data` carries the real data. - `tl.recv_no_consume(dir=None, shape=(), dtype="f16") -> TensorHandle`: **DIAGNOSTIC ONLY**. Has the same blocking-arrival semantics as `tl.recv` but skips the slot-read latency charge (slot-IO + PE↔bank fabric drain). Used in the pe2pe overview plot for an apples-to-apples comparison against `tl.store`. Production kernels MUST NOT use it — the diagnostic flag is isolated in its own command branch (`consume=False`) so it cannot be accidentally enabled. - `tl.recv_async(dir, shape=(), dtype="f16") -> RecvFuture`: non-blocking recv. Returns a `RecvFuture`; resolved later by `tl.wait(future)`. #### D3.11. Composite + control - `tl.composite(...)`: see D3.3. - `tl.wait(handle=None)`: wait on a `CompletionHandle` (composite), a `RecvFuture` (async recv), or `None` (all pending composites). - `tl.cycles(n)`: declare a scalar PE_CPU overhead. Emits `PeCpuOverheadCmd(cycles=n)`. ### D4. TensorHandle arithmetic operators — thread-local TLContext At module load, `tl_context.py::_enable_tensor_ops()` runs and patches `TensorHandle.__add__`, `__sub__`, `__mul__`, `__truediv__`. Each operator calls `_binary_math` on the active TLContext stored in a module-level thread-local `_ctx`. So inside a kernel, `c = a + b` is equivalent to emitting `MathCmd(op="add", inputs=(a, b), out=)` and returning a new TensorHandle. Active-TLContext management: - `TLContext._set_active(ctx)`: set the active ctx for the current thread/greenlet. - `TLContext._get_active()`: read it (RuntimeError if unset). - `run_kernel(kernel_fn, tl_ctx, *args, **kwargs)`: helper. Sets active on entry, runs the kernel, restores `None` on exit. `KernelRunner` re-asserts `_set_active(tl)` inside its `_switch_kernel` just before resuming the kernel, so a sibling PE runner that overwrote the thread-local context is correctly recovered. ### D5. Scratch allocator — compute output handles Ops that produce a result — `tl.dot`, `tl.exp`, `tl.add` (via TensorHandle `__add__`), etc. — call `_make_compute_out(shape, dtype)` to obtain a 16-byte-aligned scratch address. The address is published with `space="tcm"`, so the handle can later be the source of a `tl.send` / `tl.store`. When `_scratch_base == 0` (e.g., command-list mode), the address is 0 and the handle cannot be a send/store source (in that case, only `tl.load`-returned handles are valid sources). When the cursor exceeds `_scratch_size` (default 1 MiB), a RuntimeError is raised. The cursor must reset between kernel invocations (current code naturally satisfies this: KernelRunner creates a fresh TLContext each time). ### D6. Dispatch overhead — `PeCpuOverheadCmd(dispatch_cycles)` Every non-metadata `tl.*` call starts with `_emit_dispatch_overhead()`, which — when `dispatch_cycles > 0` — emits `PeCpuOverheadCmd(dispatch_cycles)`. This models the cycles PE_CPU spends dispatching the command. Defaults: - `TLContext.__init__`'s `dispatch_cycles` parameter default: `1` cycle. - TLContext built by `KernelRunner`: `0` cycles (greenlet mode handles cycle accounting differently — aligned with ADR-0020 D3 intent). ### D7. Kernel registry (`triton_emu/registry.py`) A separate `_kernels: dict[str, Callable]` holds the name → function mapping: - `register_kernel(name, fn)`: ValueError on duplicate. - `get_kernel(name)`: KeyError if missing. - `clear_registry()`: test-only. `RuntimeContext.launch(kernel_name, kernel_fn, *args)` overwrites `_kernels[kernel_name] = kernel_fn` on every call (last-call-wins, idempotent) — consistent with ADR-0045 D8's `launch` behavior. PE_CPU looks up `KernelRef.name` in the registry and runs the function through KernelRunner. ## Alternatives Considered ### A1. Fold `tl.*` into ADR-0014 / ADR-0020 Rejected. ADR-0014 covers the PE pipeline (sim_engine-side consumption of PeCommands); ADR-0020 covers 2-pass execution (Phase 1 timing / Phase 2 data). The `tl.*` surface is what the kernel author touches; a dedicated ADR improves findability and onboarding. ### A2. Deprecate command-list mode Rejected (currently). Simple unit tests and kernel verification benefit from the lighter command-list path — it exposes a PeCommand sequence inspector without requiring greenlet machinery. When greenlet-mode semantics (real data, Phase 2) are needed, D2 explicitly selects them. ### A3. Remove TensorHandle arithmetic operators Rejected. They mimic real Triton kernel ergonomics (e.g., `c = a + b`), and the thread-local active-ctx pattern works cleanly. The explicit function-form (`tl.add(a, b)`) is also exposed in D3.5, so the operators are syntactic sugar. ### A4. Expand softmax into the explicit sequence (max → exp → sum → div) Partially adopted. `tl.softmax` is a single `MathCmd(op="softmax")` for timing accounting (D3.5), but Phase 2 DataExecutor expands it to the canonical sequence for real-data computation. Timing model atomic, data model expanded — the two split intentionally. ## Consequences - Every `tl.*` primitive a bench author meets is classified and defined in a single ADR. Paired with ADR-0045 D8's host-side surface (`torch.empty` etc.), the inside-kernel and outside-kernel authoring guides are now complete. - The command-list / greenlet difference is pinned in D2, so any new `tl.*` primitive that follows the `_emit()` pattern auto-supports both modes. - The thread-local active-ctx pattern (D4) is justified at ADR level, clarifying who owns the reset responsibility when multiple PE runners share a thread (KernelRunner.run's contract restores active inside `_switch_kernel`). - `tl.recv_no_consume`'s diagnostic isolation (D3.10) is hardened in ADR form — accidental production use is blocked by a separate command branch. - The registry (D7) gets its own D-section, formalizing the name-collision and dynamic-re-registration semantics.