kernbench2/docs/adr/ADR-0046-prog-tl-context-contract.md

# ADR-0046: TLContext — Kernel-side `tl.*` API Contract

## Status

Accepted (2026-05-22).

Documents the set of `tl.*` primitives exposed by
`src/kernbench/triton_emu/`'s `TLContext`, their semantics, and the two
execution-mode contracts (command-list / greenlet runner). ADR-0014/0020
defines the PE pipeline and the 2-pass execution model, but **the `tl.*`
surface that bench kernel functions call** had no ADR-level coverage.

## First action

When `TLContext(pe_id, num_programs, dispatch_cycles, runner, cube_id,
num_cubes, scratch_base, scratch_size)` is instantiated, the first action
is to initialize six categories of state:

- `self._pe_id`, `self._num_programs`, `self._cube_id`, `self._num_cubes` —
  values that `tl.program_id` / `tl.num_programs` will return.
- `self._dispatch_cycles` — cycle count emitted as `PeCpuOverheadCmd(cycles)`
  at the start of every `tl.*` API call.
- `self._runner` — `KernelRunner` instance (present → greenlet mode;
  absent → command-list mode).
- `self._commands: list[PeCommand] = []` — command-list accumulator
  (command-list mode only).
- `self._handle_counter = 0`, `self._completion_counter = 0` — counters
  for generating TensorHandle / CompletionHandle ids.
- `self._scratch_base`, `self._scratch_size`, `self._scratch_cursor = 0` —
  PE-local scratch region (used for math/dot/composite output handle
  addresses).

In short, **TLContext's first act is "record where (sip/cube/pe) and at
what scale (num_programs/num_cubes) this kernel instance runs, and pick
its dispatch mode (runner present or not)"**. No SimPy event is created
and no command is emitted at this moment.

The runtime first action happens when the kernel function first calls a
`tl.<api>()`. The standard entry for every `tl.*` API is:

1. Call `self._emit_dispatch_overhead()` — if `dispatch_cycles > 0`,
   immediately `_emit` a `PeCpuOverheadCmd(dispatch_cycles)`.
2. Per-API processing (TensorHandle creation, command construction).
3. `self._emit(cmd)` — in runner mode this `greenlet.switch()`es the cmd
   to SimPy; in command-list mode it appends to `self._commands`.

## Context

The `tl.*` surface consists of `TLContext`'s methods, and the `tl`
parameter received by a kernel function is one of these objects. The
contract the user (bench author) sees:

- Which primitives exist.
- What data flow each primitive triggers (DMA / compute / IPCQ /
  metadata-only).
- How a TensorHandle's `space` and `addr` are decided.
- The difference between command-list and greenlet modes.

ADR-0014 (PE pipeline) defines the PeCommands consumed by PE_SCHEDULER,
but how `tl.*` emits them is a code-only convention. ADR-0020 (2-pass
data execution) mentions greenlet mode in D3 but does not pin down the
signature difference (return-value handling) between the runner /
non-runner paths. This ADR fills the gap.

## Decision

### D1. The `tl` parameter is a `TLContext` instance

A bench kernel function has the signature:

```python
def _kernel(arg1, arg2, ..., tl, **kwargs):
    ...
```

`tl` is a `kernbench.triton_emu.tl_context.TLContext` instance. The name
imitates real Triton's `triton.language` module; the actual Triton
module is **not** passed in.

The kernel is plain Python — no `yield` or `async`. `tl.*` calls produce
SimPy events, but to the caller they appear synchronous because in
greenlet mode the KernelRunner relays between SimPy and the kernel
(ADR-0020 D3).

### D2. Two execution modes — command-list / greenlet runner

- **Command-list mode (`runner is None`)**: `tl.*` calls append PeCommand
  to `self._commands`. DMA / GEMM / Math consume no SimPy time and return
  metadata-only TensorHandles (`data=None`). PE_SCHEDULER / sim_engine
  later replays the command sequence in time.

- **Greenlet runner mode (`runner is not None`)**: `tl.*` calls
  `self._emit(cmd)` → `runner.switch_to_simpy(cmd)`, handing control to
  the parent greenlet (SimPy). The parent distributes the cmd to
  components, consumes SimPy time, and (for DMA reads) returns real numpy
  data. The kernel receives the result and continues to the next line
  (the data-aware execution model from ADR-0020 D3).

The choice of mode is decided by whether a KernelRunner is injected into
the TLContext. The `tl.*` methods themselves are mode-blind — they go
through `_emit()` uniformly.

### D3. Primitive categories

#### D3.1. Reference (no DMA, metadata only)

- `tl.ref(ptr, shape, dtype="f16") -> TensorHandle`: create a handle
  referencing HBM data without issuing DMA. Used when the scheduler
  streams the data per-tile (e.g., the b operand of a composite GEMM).

#### D3.2. Data movement (blocking, DMA engine)

- `tl.load(ptr, shape, dtype="f16") -> TensorHandle`: HBM → handle.
  Emits `DmaReadCmd`. In greenlet mode the returned handle's `.data`
  carries real numpy data; in command-list mode it is a placeholder.
  The handle has `space="hbm"`, `pinned=True`.
- `tl.store(ptr, handle) -> None`: TCM → HBM. Emits `DmaWriteCmd`. In
  greenlet mode, when `handle.data` is present, `_store.write("hbm",
  ptr, data)` runs first (visibility = issue time, ADR-0020 D3).

#### D3.3. GEMM / compute (blocking)

- `tl.dot(a, b) -> TensorHandle`: `a @ b`. Both operands must live in
  TCM; shapes `(M,K) × (K,N) → (M,N)`. Emits `GemmCmd`; the output
  handle is allocated from PE-local scratch via
  `_make_compute_out(shape, dtype)`.
- `tl.composite(op, a, b=None, out_ptr=0, math_op=None, epilogue=None,
  acc_dtype=None, tile_shape=None) -> CompletionHandle`: non-blocking
  tiled pipeline. Emits `CompositeCmd`. `epilogue` is a list of dicts,
  each with `"op"` plus op-specific fields and an optional `"scope"`
  (k_tile / output_tile). Unknown ops or missing fields raise
  ValueError immediately. The returned CompletionHandle synchronizes
  via `tl.wait(h)`.

#### D3.4. Math: unary (blocking)

- `tl.exp(x)`, `tl.log(x)`, `tl.sqrt(x)`, `tl.abs(x)`, `tl.sigmoid(x)`,
  `tl.cos(x)`, `tl.sin(x)` — each emits `MathCmd(op=<name>,
  inputs=(x,), out=)`. `out` is scratch-allocated with the same
  shape/dtype as `x`.

#### D3.5. Math: binary (blocking)

- `tl.maximum(a, b)`, `tl.minimum(a, b)` — `_binary_math`.
- `tl.fma(a, b, c)` — `a*b + c`. Three inputs.
- `tl.clamp(x, min, max)` — `MathCmd(op="clamp", inputs=(x, min, max))`.
- `tl.where(cond, a, b)` — `MathCmd(op="where", inputs=(cond, a, b))`.
- `tl.softmax(x, axis=-1)` — a single `MathCmd(op="softmax")` so timing
  accounts at one dispatch. Phase 2 DataExecutor expands it to the
  canonical (x-max → exp → sum → div) sequence.

#### D3.6. Reduction (blocking)

- `tl.sum(x, axis)`, `tl.max(x, axis)`, `tl.min(x, axis)` — return an
  output handle with the axis size collapsed to 1. Emit
  `MathCmd(op=<name>, inputs=(x,), out=, axis=axis)`.

#### D3.7. Index / scalar (PE_CPU, no engine)

- `tl.program_id(axis=0) -> int`: `axis==0` → pe_id (cube-local PE
  index), `axis==1` → cube_id (ADR-0022).
- `tl.num_programs(axis=0) -> int`: `axis==0` → num_programs (PEs per
  cube), `axis==1` → num_cubes.
- `tl.arange(start, end, dtype="i32") -> TensorHandle`: an index range
  in TCM. No command emitted.
- `tl.zeros(shape, dtype="f16") -> TensorHandle`, `tl.full(shape,
  value, dtype="f16") -> TensorHandle`: TCM placeholder. No command
  emitted.

#### D3.8. Scalar helpers (no command, no engine)

- `TLContext.cdiv(a, b) -> int` (static): ceiling division
  `-(-a // b)`. Mirrors real Triton's `tl.cdiv`.

#### D3.9. Metadata-only (no compute, no DMA)

- `tl.trans(x) -> TensorHandle`: a new handle with the last two dims
  swapped. Shares `addr` and `data`; no command emitted.

#### D3.10. IPCQ (CCL) primitives (ADR-0023 D4)

- `tl.send(dir, src=None, *, src_addr=None, nbytes=None, shape=None,
  dtype="f16", space="tcm") -> None`: blocking send. Accepts either
  handle form or raw-address form. Emits `IpcqSendCmd`. The handle's
  `.data` snapshot rides along on the command — avoiding the race
  where a later inbound IPCQ overwrites the slot before the outbound
  PE_DMA reads it.
- `tl.recv(dir=None, shape=(), dtype="f16", space="tcm", dst_addr=None,
  dst_space=None) -> TensorHandle`: blocking recv. Providing both
  `dst_addr` and `dst_space` enters "copy_to_dst" mode; otherwise
  "return_slot" mode. In greenlet mode the handle's `.data` carries
  the real data.
- `tl.recv_no_consume(dir=None, shape=(), dtype="f16") -> TensorHandle`:
  **DIAGNOSTIC ONLY**. Has the same blocking-arrival semantics as
  `tl.recv` but skips the slot-read latency charge (slot-IO + PE↔bank
  fabric drain). Used in the pe2pe overview plot for an apples-to-apples
  comparison against `tl.store`. Production kernels MUST NOT use it —
  the diagnostic flag is isolated in its own command branch
  (`consume=False`) so it cannot be accidentally enabled.
- `tl.recv_async(dir, shape=(), dtype="f16") -> RecvFuture`: non-blocking
  recv. Returns a `RecvFuture`; resolved later by `tl.wait(future)`.

#### D3.11. Composite + control

- `tl.composite(...)`: see D3.3.
- `tl.wait(handle=None)`: wait on a `CompletionHandle` (composite), a
  `RecvFuture` (async recv), or `None` (all pending composites).
- `tl.cycles(n)`: declare a scalar PE_CPU overhead. Emits
  `PeCpuOverheadCmd(cycles=n)`.

### D4. TensorHandle arithmetic operators — thread-local TLContext

At module load, `tl_context.py::_enable_tensor_ops()` runs and patches
`TensorHandle.__add__`, `__sub__`, `__mul__`, `__truediv__`. Each
operator calls `_binary_math` on the active TLContext stored in a
module-level thread-local `_ctx`.

So inside a kernel, `c = a + b` is equivalent to emitting
`MathCmd(op="add", inputs=(a, b), out=)` and returning a new
TensorHandle.

Active-TLContext management:

- `TLContext._set_active(ctx)`: set the active ctx for the current
  thread/greenlet.
- `TLContext._get_active()`: read it (RuntimeError if unset).
- `run_kernel(kernel_fn, tl_ctx, *args, **kwargs)`: helper. Sets active
  on entry, runs the kernel, restores `None` on exit.

`KernelRunner` re-asserts `_set_active(tl)` inside its `_switch_kernel`
just before resuming the kernel, so a sibling PE runner that overwrote
the thread-local context is correctly recovered.

### D5. Scratch allocator — compute output handles

Ops that produce a result — `tl.dot`, `tl.exp`, `tl.add` (via
TensorHandle `__add__`), etc. — call `_make_compute_out(shape, dtype)`
to obtain a 16-byte-aligned scratch address. The address is published
with `space="tcm"`, so the handle can later be the source of a
`tl.send` / `tl.store`.

When `_scratch_base == 0` (e.g., command-list mode), the address is 0
and the handle cannot be a send/store source (in that case, only
`tl.load`-returned handles are valid sources).

When the cursor exceeds `_scratch_size` (default 1 MiB), a
RuntimeError is raised. The cursor must reset between kernel
invocations (current code naturally satisfies this: KernelRunner
creates a fresh TLContext each time).

### D6. Dispatch overhead — `PeCpuOverheadCmd(dispatch_cycles)`

Every non-metadata `tl.*` call starts with `_emit_dispatch_overhead()`,
which — when `dispatch_cycles > 0` — emits
`PeCpuOverheadCmd(dispatch_cycles)`. This models the cycles PE_CPU
spends dispatching the command.

Defaults:

- `TLContext.__init__`'s `dispatch_cycles` parameter default: `1` cycle.
- TLContext built by `KernelRunner`: `0` cycles (greenlet mode handles
  cycle accounting differently — aligned with ADR-0020 D3 intent).

### D7. Kernel registry (`triton_emu/registry.py`)

A separate `_kernels: dict[str, Callable]` holds the name → function
mapping:

- `register_kernel(name, fn)`: ValueError on duplicate.
- `get_kernel(name)`: KeyError if missing.
- `clear_registry()`: test-only.

`RuntimeContext.launch(kernel_name, kernel_fn, *args)` overwrites
`_kernels[kernel_name] = kernel_fn` on every call (last-call-wins,
idempotent) — consistent with ADR-0045 D8's `launch` behavior.

PE_CPU looks up `KernelRef.name` in the registry and runs the function
through KernelRunner.

## Alternatives Considered

### A1. Fold `tl.*` into ADR-0014 / ADR-0020

Rejected. ADR-0014 covers the PE pipeline (sim_engine-side consumption
of PeCommands); ADR-0020 covers 2-pass execution (Phase 1 timing /
Phase 2 data). The `tl.*` surface is what the kernel author touches; a
dedicated ADR improves findability and onboarding.

### A2. Deprecate command-list mode

Rejected (currently). Simple unit tests and kernel verification benefit
from the lighter command-list path — it exposes a PeCommand sequence
inspector without requiring greenlet machinery. When greenlet-mode
semantics (real data, Phase 2) are needed, D2 explicitly selects them.

### A3. Remove TensorHandle arithmetic operators

Rejected. They mimic real Triton kernel ergonomics (e.g., `c = a + b`),
and the thread-local active-ctx pattern works cleanly. The explicit
function-form (`tl.add(a, b)`) is also exposed in D3.5, so the
operators are syntactic sugar.

### A4. Expand softmax into the explicit sequence (max → exp → sum → div)

Partially adopted. `tl.softmax` is a single `MathCmd(op="softmax")` for
timing accounting (D3.5), but Phase 2 DataExecutor expands it to the
canonical sequence for real-data computation. Timing model atomic,
data model expanded — the two split intentionally.

## Consequences

- Every `tl.*` primitive a bench author meets is classified and defined
  in a single ADR. Paired with ADR-0045 D8's host-side surface
  (`torch.empty` etc.), the inside-kernel and outside-kernel authoring
  guides are now complete.
- The command-list / greenlet difference is pinned in D2, so any new
  `tl.*` primitive that follows the `_emit()` pattern auto-supports
  both modes.
- The thread-local active-ctx pattern (D4) is justified at ADR level,
  clarifying who owns the reset responsibility when multiple PE
  runners share a thread (KernelRunner.run's contract restores active
  inside `_switch_kernel`).
- `tl.recv_no_consume`'s diagnostic isolation (D3.10) is hardened in
  ADR form — accidental production use is blocked by a separate
  command branch.
- The registry (D7) gets its own D-section, formalizing the
  name-collision and dynamic-re-registration semantics.