Files
kernbench2/docs/adr/ADR-0046-prog-tl-context-contract.md
T
ywkang 9a02955770 adr: add ADR-0046-0049 — close G4 coverage gaps from /report
Documents four cross-cutting surfaces that previously had no ADR backing,
each surfaced as a G4 candidate by /report:

- 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates
  all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...),
  the two execution modes (command-list vs greenlet runner), scratch
  allocator semantics, dispatch-overhead model, and the kernel registry.

- 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group
  (backend="ahbm") install path. world_size priority (algorithm >
  defaults > topology), the 4-step init sequence (load ccl.yaml, import
  algorithm module, derive world_size, install SFR + IPCQ), greenlet-
  local rank registry, all_reduce dispatch via _defer_wait, barrier
  no-op rationale, and the explicit list of unsupported dist.* APIs.

- 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator
  free-list semantics. Offset-keyed first-fit with coalescing, the
  no-validation trust model for free(), HBM/TCM channel separation,
  page-aligned VA allocation, the page_size dual-default
  (VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and
  one-allocator-per-sub-unit rule.

- 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog.
  H2D / D2H / PE DMA categories with their exact cube-index choices,
  the 32 KiB reference size, the 5-point utilization sweep, the
  formula vs actual column meanings, automatic invariant checks
  (monotonicity, D2H >= H2D, best < worst), per-case GraphEngine
  isolation, and the human-readable (not machine-parsable) output
  contract.

Bilingual pair verifier passes for all four EN/KO pairs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:25:04 -07:00

14 KiB
Raw Blame History

ADR-0046: TLContext — Kernel-side tl.* API Contract

Status

Accepted (2026-05-22).

Documents the set of tl.* primitives exposed by src/kernbench/triton_emu/'s TLContext, their semantics, and the two execution-mode contracts (command-list / greenlet runner). ADR-0014/0020 defines the PE pipeline and the 2-pass execution model, but the tl.* surface that bench kernel functions call had no ADR-level coverage.

First action

When TLContext(pe_id, num_programs, dispatch_cycles, runner, cube_id, num_cubes, scratch_base, scratch_size) is instantiated, the first action is to initialize six categories of state:

  • self._pe_id, self._num_programs, self._cube_id, self._num_cubes — values that tl.program_id / tl.num_programs will return.
  • self._dispatch_cycles — cycle count emitted as PeCpuOverheadCmd(cycles) at the start of every tl.* API call.
  • self._runnerKernelRunner instance (present → greenlet mode; absent → command-list mode).
  • self._commands: list[PeCommand] = [] — command-list accumulator (command-list mode only).
  • self._handle_counter = 0, self._completion_counter = 0 — counters for generating TensorHandle / CompletionHandle ids.
  • self._scratch_base, self._scratch_size, self._scratch_cursor = 0 — PE-local scratch region (used for math/dot/composite output handle addresses).

In short, TLContext's first act is "record where (sip/cube/pe) and at what scale (num_programs/num_cubes) this kernel instance runs, and pick its dispatch mode (runner present or not)". No SimPy event is created and no command is emitted at this moment.

The runtime first action happens when the kernel function first calls a tl.<api>(). The standard entry for every tl.* API is:

  1. Call self._emit_dispatch_overhead() — if dispatch_cycles > 0, immediately _emit a PeCpuOverheadCmd(dispatch_cycles).
  2. Per-API processing (TensorHandle creation, command construction).
  3. self._emit(cmd) — in runner mode this greenlet.switch()es the cmd to SimPy; in command-list mode it appends to self._commands.

Context

The tl.* surface consists of TLContext's methods, and the tl parameter received by a kernel function is one of these objects. The contract the user (bench author) sees:

  • Which primitives exist.
  • What data flow each primitive triggers (DMA / compute / IPCQ / metadata-only).
  • How a TensorHandle's space and addr are decided.
  • The difference between command-list and greenlet modes.

ADR-0014 (PE pipeline) defines the PeCommands consumed by PE_SCHEDULER, but how tl.* emits them is a code-only convention. ADR-0020 (2-pass data execution) mentions greenlet mode in D3 but does not pin down the signature difference (return-value handling) between the runner / non-runner paths. This ADR fills the gap.

Decision

D1. The tl parameter is a TLContext instance

A bench kernel function has the signature:

def _kernel(arg1, arg2, ..., tl, **kwargs):
    ...

tl is a kernbench.triton_emu.tl_context.TLContext instance. The name imitates real Triton's triton.language module; the actual Triton module is not passed in.

The kernel is plain Python — no yield or async. tl.* calls produce SimPy events, but to the caller they appear synchronous because in greenlet mode the KernelRunner relays between SimPy and the kernel (ADR-0020 D3).

D2. Two execution modes — command-list / greenlet runner

  • Command-list mode (runner is None): tl.* calls append PeCommand to self._commands. DMA / GEMM / Math consume no SimPy time and return metadata-only TensorHandles (data=None). PE_SCHEDULER / sim_engine later replays the command sequence in time.

  • Greenlet runner mode (runner is not None): tl.* calls self._emit(cmd)runner.switch_to_simpy(cmd), handing control to the parent greenlet (SimPy). The parent distributes the cmd to components, consumes SimPy time, and (for DMA reads) returns real numpy data. The kernel receives the result and continues to the next line (the data-aware execution model from ADR-0020 D3).

The choice of mode is decided by whether a KernelRunner is injected into the TLContext. The tl.* methods themselves are mode-blind — they go through _emit() uniformly.

D3. Primitive categories

D3.1. Reference (no DMA, metadata only)

  • tl.ref(ptr, shape, dtype="f16") -> TensorHandle: create a handle referencing HBM data without issuing DMA. Used when the scheduler streams the data per-tile (e.g., the b operand of a composite GEMM).

D3.2. Data movement (blocking, DMA engine)

  • tl.load(ptr, shape, dtype="f16") -> TensorHandle: HBM → handle. Emits DmaReadCmd. In greenlet mode the returned handle's .data carries real numpy data; in command-list mode it is a placeholder. The handle has space="hbm", pinned=True.
  • tl.store(ptr, handle) -> None: TCM → HBM. Emits DmaWriteCmd. In greenlet mode, when handle.data is present, _store.write("hbm", ptr, data) runs first (visibility = issue time, ADR-0020 D3).

D3.3. GEMM / compute (blocking)

  • tl.dot(a, b) -> TensorHandle: a @ b. Both operands must live in TCM; shapes (M,K) × (K,N) → (M,N). Emits GemmCmd; the output handle is allocated from PE-local scratch via _make_compute_out(shape, dtype).
  • tl.composite(op, a, b=None, out_ptr=0, math_op=None, epilogue=None, acc_dtype=None, tile_shape=None) -> CompletionHandle: non-blocking tiled pipeline. Emits CompositeCmd. epilogue is a list of dicts, each with "op" plus op-specific fields and an optional "scope" (k_tile / output_tile). Unknown ops or missing fields raise ValueError immediately. The returned CompletionHandle synchronizes via tl.wait(h).

D3.4. Math: unary (blocking)

  • tl.exp(x), tl.log(x), tl.sqrt(x), tl.abs(x), tl.sigmoid(x), tl.cos(x), tl.sin(x) — each emits MathCmd(op=<name>, inputs=(x,), out=). out is scratch-allocated with the same shape/dtype as x.

D3.5. Math: binary (blocking)

  • tl.maximum(a, b), tl.minimum(a, b)_binary_math.
  • tl.fma(a, b, c)a*b + c. Three inputs.
  • tl.clamp(x, min, max)MathCmd(op="clamp", inputs=(x, min, max)).
  • tl.where(cond, a, b)MathCmd(op="where", inputs=(cond, a, b)).
  • tl.softmax(x, axis=-1) — a single MathCmd(op="softmax") so timing accounts at one dispatch. Phase 2 DataExecutor expands it to the canonical (x-max → exp → sum → div) sequence.

D3.6. Reduction (blocking)

  • tl.sum(x, axis), tl.max(x, axis), tl.min(x, axis) — return an output handle with the axis size collapsed to 1. Emit MathCmd(op=<name>, inputs=(x,), out=, axis=axis).

D3.7. Index / scalar (PE_CPU, no engine)

  • tl.program_id(axis=0) -> int: axis==0 → pe_id (cube-local PE index), axis==1 → cube_id (ADR-0022).
  • tl.num_programs(axis=0) -> int: axis==0 → num_programs (PEs per cube), axis==1 → num_cubes.
  • tl.arange(start, end, dtype="i32") -> TensorHandle: an index range in TCM. No command emitted.
  • tl.zeros(shape, dtype="f16") -> TensorHandle, tl.full(shape, value, dtype="f16") -> TensorHandle: TCM placeholder. No command emitted.

D3.8. Scalar helpers (no command, no engine)

  • TLContext.cdiv(a, b) -> int (static): ceiling division -(-a // b). Mirrors real Triton's tl.cdiv.

D3.9. Metadata-only (no compute, no DMA)

  • tl.trans(x) -> TensorHandle: a new handle with the last two dims swapped. Shares addr and data; no command emitted.

D3.10. IPCQ (CCL) primitives (ADR-0023 D4)

  • tl.send(dir, src=None, *, src_addr=None, nbytes=None, shape=None, dtype="f16", space="tcm") -> None: blocking send. Accepts either handle form or raw-address form. Emits IpcqSendCmd. The handle's .data snapshot rides along on the command — avoiding the race where a later inbound IPCQ overwrites the slot before the outbound PE_DMA reads it.
  • tl.recv(dir=None, shape=(), dtype="f16", space="tcm", dst_addr=None, dst_space=None) -> TensorHandle: blocking recv. Providing both dst_addr and dst_space enters "copy_to_dst" mode; otherwise "return_slot" mode. In greenlet mode the handle's .data carries the real data.
  • tl.recv_no_consume(dir=None, shape=(), dtype="f16") -> TensorHandle: DIAGNOSTIC ONLY. Has the same blocking-arrival semantics as tl.recv but skips the slot-read latency charge (slot-IO + PE↔bank fabric drain). Used in the pe2pe overview plot for an apples-to-apples comparison against tl.store. Production kernels MUST NOT use it — the diagnostic flag is isolated in its own command branch (consume=False) so it cannot be accidentally enabled.
  • tl.recv_async(dir, shape=(), dtype="f16") -> RecvFuture: non-blocking recv. Returns a RecvFuture; resolved later by tl.wait(future).

D3.11. Composite + control

  • tl.composite(...): see D3.3.
  • tl.wait(handle=None): wait on a CompletionHandle (composite), a RecvFuture (async recv), or None (all pending composites).
  • tl.cycles(n): declare a scalar PE_CPU overhead. Emits PeCpuOverheadCmd(cycles=n).

D4. TensorHandle arithmetic operators — thread-local TLContext

At module load, tl_context.py::_enable_tensor_ops() runs and patches TensorHandle.__add__, __sub__, __mul__, __truediv__. Each operator calls _binary_math on the active TLContext stored in a module-level thread-local _ctx.

So inside a kernel, c = a + b is equivalent to emitting MathCmd(op="add", inputs=(a, b), out=) and returning a new TensorHandle.

Active-TLContext management:

  • TLContext._set_active(ctx): set the active ctx for the current thread/greenlet.
  • TLContext._get_active(): read it (RuntimeError if unset).
  • run_kernel(kernel_fn, tl_ctx, *args, **kwargs): helper. Sets active on entry, runs the kernel, restores None on exit.

KernelRunner re-asserts _set_active(tl) inside its _switch_kernel just before resuming the kernel, so a sibling PE runner that overwrote the thread-local context is correctly recovered.

D5. Scratch allocator — compute output handles

Ops that produce a result — tl.dot, tl.exp, tl.add (via TensorHandle __add__), etc. — call _make_compute_out(shape, dtype) to obtain a 16-byte-aligned scratch address. The address is published with space="tcm", so the handle can later be the source of a tl.send / tl.store.

When _scratch_base == 0 (e.g., command-list mode), the address is 0 and the handle cannot be a send/store source (in that case, only tl.load-returned handles are valid sources).

When the cursor exceeds _scratch_size (default 1 MiB), a RuntimeError is raised. The cursor must reset between kernel invocations (current code naturally satisfies this: KernelRunner creates a fresh TLContext each time).

D6. Dispatch overhead — PeCpuOverheadCmd(dispatch_cycles)

Every non-metadata tl.* call starts with _emit_dispatch_overhead(), which — when dispatch_cycles > 0 — emits PeCpuOverheadCmd(dispatch_cycles). This models the cycles PE_CPU spends dispatching the command.

Defaults:

  • TLContext.__init__'s dispatch_cycles parameter default: 1 cycle.
  • TLContext built by KernelRunner: 0 cycles (greenlet mode handles cycle accounting differently — aligned with ADR-0020 D3 intent).

D7. Kernel registry (triton_emu/registry.py)

A separate _kernels: dict[str, Callable] holds the name → function mapping:

  • register_kernel(name, fn): ValueError on duplicate.
  • get_kernel(name): KeyError if missing.
  • clear_registry(): test-only.

RuntimeContext.launch(kernel_name, kernel_fn, *args) overwrites _kernels[kernel_name] = kernel_fn on every call (last-call-wins, idempotent) — consistent with ADR-0045 D8's launch behavior.

PE_CPU looks up KernelRef.name in the registry and runs the function through KernelRunner.

Alternatives Considered

A1. Fold tl.* into ADR-0014 / ADR-0020

Rejected. ADR-0014 covers the PE pipeline (sim_engine-side consumption of PeCommands); ADR-0020 covers 2-pass execution (Phase 1 timing / Phase 2 data). The tl.* surface is what the kernel author touches; a dedicated ADR improves findability and onboarding.

A2. Deprecate command-list mode

Rejected (currently). Simple unit tests and kernel verification benefit from the lighter command-list path — it exposes a PeCommand sequence inspector without requiring greenlet machinery. When greenlet-mode semantics (real data, Phase 2) are needed, D2 explicitly selects them.

A3. Remove TensorHandle arithmetic operators

Rejected. They mimic real Triton kernel ergonomics (e.g., c = a + b), and the thread-local active-ctx pattern works cleanly. The explicit function-form (tl.add(a, b)) is also exposed in D3.5, so the operators are syntactic sugar.

A4. Expand softmax into the explicit sequence (max → exp → sum → div)

Partially adopted. tl.softmax is a single MathCmd(op="softmax") for timing accounting (D3.5), but Phase 2 DataExecutor expands it to the canonical sequence for real-data computation. Timing model atomic, data model expanded — the two split intentionally.

Consequences

  • Every tl.* primitive a bench author meets is classified and defined in a single ADR. Paired with ADR-0045 D8's host-side surface (torch.empty etc.), the inside-kernel and outside-kernel authoring guides are now complete.
  • The command-list / greenlet difference is pinned in D2, so any new tl.* primitive that follows the _emit() pattern auto-supports both modes.
  • The thread-local active-ctx pattern (D4) is justified at ADR level, clarifying who owns the reset responsibility when multiple PE runners share a thread (KernelRunner.run's contract restores active inside _switch_kernel).
  • tl.recv_no_consume's diagnostic isolation (D3.10) is hardened in ADR form — accidental production use is blocked by a separate command branch.
  • The registry (D7) gets its own D-section, formalizing the name-collision and dynamic-re-registration semantics.