Files
kernbench2/docs/adr/ADR-0046-prog-tl-context-contract.md
T
ywkang 9a02955770 adr: add ADR-0046-0049 — close G4 coverage gaps from /report
Documents four cross-cutting surfaces that previously had no ADR backing,
each surfaced as a G4 candidate by /report:

- 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates
  all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...),
  the two execution modes (command-list vs greenlet runner), scratch
  allocator semantics, dispatch-overhead model, and the kernel registry.

- 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group
  (backend="ahbm") install path. world_size priority (algorithm >
  defaults > topology), the 4-step init sequence (load ccl.yaml, import
  algorithm module, derive world_size, install SFR + IPCQ), greenlet-
  local rank registry, all_reduce dispatch via _defer_wait, barrier
  no-op rationale, and the explicit list of unsupported dist.* APIs.

- 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator
  free-list semantics. Offset-keyed first-fit with coalescing, the
  no-validation trust model for free(), HBM/TCM channel separation,
  page-aligned VA allocation, the page_size dual-default
  (VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and
  one-allocator-per-sub-unit rule.

- 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog.
  H2D / D2H / PE DMA categories with their exact cube-index choices,
  the 32 KiB reference size, the 5-point utilization sweep, the
  formula vs actual column meanings, automatic invariant checks
  (monotonicity, D2H >= H2D, best < worst), per-case GraphEngine
  isolation, and the human-readable (not machine-parsable) output
  contract.

Bilingual pair verifier passes for all four EN/KO pairs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:25:04 -07:00

328 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0046: TLContext — Kernel-side `tl.*` API Contract
## Status
Accepted (2026-05-22).
Documents the set of `tl.*` primitives exposed by
`src/kernbench/triton_emu/`'s `TLContext`, their semantics, and the two
execution-mode contracts (command-list / greenlet runner). ADR-0014/0020
defines the PE pipeline and the 2-pass execution model, but **the `tl.*`
surface that bench kernel functions call** had no ADR-level coverage.
## First action
When `TLContext(pe_id, num_programs, dispatch_cycles, runner, cube_id,
num_cubes, scratch_base, scratch_size)` is instantiated, the first action
is to initialize six categories of state:
- `self._pe_id`, `self._num_programs`, `self._cube_id`, `self._num_cubes`
values that `tl.program_id` / `tl.num_programs` will return.
- `self._dispatch_cycles` — cycle count emitted as `PeCpuOverheadCmd(cycles)`
at the start of every `tl.*` API call.
- `self._runner``KernelRunner` instance (present → greenlet mode;
absent → command-list mode).
- `self._commands: list[PeCommand] = []` — command-list accumulator
(command-list mode only).
- `self._handle_counter = 0`, `self._completion_counter = 0` — counters
for generating TensorHandle / CompletionHandle ids.
- `self._scratch_base`, `self._scratch_size`, `self._scratch_cursor = 0`
PE-local scratch region (used for math/dot/composite output handle
addresses).
In short, **TLContext's first act is "record where (sip/cube/pe) and at
what scale (num_programs/num_cubes) this kernel instance runs, and pick
its dispatch mode (runner present or not)"**. No SimPy event is created
and no command is emitted at this moment.
The runtime first action happens when the kernel function first calls a
`tl.<api>()`. The standard entry for every `tl.*` API is:
1. Call `self._emit_dispatch_overhead()` — if `dispatch_cycles > 0`,
immediately `_emit` a `PeCpuOverheadCmd(dispatch_cycles)`.
2. Per-API processing (TensorHandle creation, command construction).
3. `self._emit(cmd)` — in runner mode this `greenlet.switch()`es the cmd
to SimPy; in command-list mode it appends to `self._commands`.
## Context
The `tl.*` surface consists of `TLContext`'s methods, and the `tl`
parameter received by a kernel function is one of these objects. The
contract the user (bench author) sees:
- Which primitives exist.
- What data flow each primitive triggers (DMA / compute / IPCQ /
metadata-only).
- How a TensorHandle's `space` and `addr` are decided.
- The difference between command-list and greenlet modes.
ADR-0014 (PE pipeline) defines the PeCommands consumed by PE_SCHEDULER,
but how `tl.*` emits them is a code-only convention. ADR-0020 (2-pass
data execution) mentions greenlet mode in D3 but does not pin down the
signature difference (return-value handling) between the runner /
non-runner paths. This ADR fills the gap.
## Decision
### D1. The `tl` parameter is a `TLContext` instance
A bench kernel function has the signature:
```python
def _kernel(arg1, arg2, ..., tl, **kwargs):
...
```
`tl` is a `kernbench.triton_emu.tl_context.TLContext` instance. The name
imitates real Triton's `triton.language` module; the actual Triton
module is **not** passed in.
The kernel is plain Python — no `yield` or `async`. `tl.*` calls produce
SimPy events, but to the caller they appear synchronous because in
greenlet mode the KernelRunner relays between SimPy and the kernel
(ADR-0020 D3).
### D2. Two execution modes — command-list / greenlet runner
- **Command-list mode (`runner is None`)**: `tl.*` calls append PeCommand
to `self._commands`. DMA / GEMM / Math consume no SimPy time and return
metadata-only TensorHandles (`data=None`). PE_SCHEDULER / sim_engine
later replays the command sequence in time.
- **Greenlet runner mode (`runner is not None`)**: `tl.*` calls
`self._emit(cmd)``runner.switch_to_simpy(cmd)`, handing control to
the parent greenlet (SimPy). The parent distributes the cmd to
components, consumes SimPy time, and (for DMA reads) returns real numpy
data. The kernel receives the result and continues to the next line
(the data-aware execution model from ADR-0020 D3).
The choice of mode is decided by whether a KernelRunner is injected into
the TLContext. The `tl.*` methods themselves are mode-blind — they go
through `_emit()` uniformly.
### D3. Primitive categories
#### D3.1. Reference (no DMA, metadata only)
- `tl.ref(ptr, shape, dtype="f16") -> TensorHandle`: create a handle
referencing HBM data without issuing DMA. Used when the scheduler
streams the data per-tile (e.g., the b operand of a composite GEMM).
#### D3.2. Data movement (blocking, DMA engine)
- `tl.load(ptr, shape, dtype="f16") -> TensorHandle`: HBM → handle.
Emits `DmaReadCmd`. In greenlet mode the returned handle's `.data`
carries real numpy data; in command-list mode it is a placeholder.
The handle has `space="hbm"`, `pinned=True`.
- `tl.store(ptr, handle) -> None`: TCM → HBM. Emits `DmaWriteCmd`. In
greenlet mode, when `handle.data` is present, `_store.write("hbm",
ptr, data)` runs first (visibility = issue time, ADR-0020 D3).
#### D3.3. GEMM / compute (blocking)
- `tl.dot(a, b) -> TensorHandle`: `a @ b`. Both operands must live in
TCM; shapes `(M,K) × (K,N) → (M,N)`. Emits `GemmCmd`; the output
handle is allocated from PE-local scratch via
`_make_compute_out(shape, dtype)`.
- `tl.composite(op, a, b=None, out_ptr=0, math_op=None, epilogue=None,
acc_dtype=None, tile_shape=None) -> CompletionHandle`: non-blocking
tiled pipeline. Emits `CompositeCmd`. `epilogue` is a list of dicts,
each with `"op"` plus op-specific fields and an optional `"scope"`
(k_tile / output_tile). Unknown ops or missing fields raise
ValueError immediately. The returned CompletionHandle synchronizes
via `tl.wait(h)`.
#### D3.4. Math: unary (blocking)
- `tl.exp(x)`, `tl.log(x)`, `tl.sqrt(x)`, `tl.abs(x)`, `tl.sigmoid(x)`,
`tl.cos(x)`, `tl.sin(x)` — each emits `MathCmd(op=<name>,
inputs=(x,), out=)`. `out` is scratch-allocated with the same
shape/dtype as `x`.
#### D3.5. Math: binary (blocking)
- `tl.maximum(a, b)`, `tl.minimum(a, b)` — `_binary_math`.
- `tl.fma(a, b, c)` — `a*b + c`. Three inputs.
- `tl.clamp(x, min, max)` — `MathCmd(op="clamp", inputs=(x, min, max))`.
- `tl.where(cond, a, b)` — `MathCmd(op="where", inputs=(cond, a, b))`.
- `tl.softmax(x, axis=-1)` — a single `MathCmd(op="softmax")` so timing
accounts at one dispatch. Phase 2 DataExecutor expands it to the
canonical (x-max → exp → sum → div) sequence.
#### D3.6. Reduction (blocking)
- `tl.sum(x, axis)`, `tl.max(x, axis)`, `tl.min(x, axis)` — return an
output handle with the axis size collapsed to 1. Emit
`MathCmd(op=<name>, inputs=(x,), out=, axis=axis)`.
#### D3.7. Index / scalar (PE_CPU, no engine)
- `tl.program_id(axis=0) -> int`: `axis==0` → pe_id (cube-local PE
index), `axis==1` → cube_id (ADR-0022).
- `tl.num_programs(axis=0) -> int`: `axis==0` → num_programs (PEs per
cube), `axis==1` → num_cubes.
- `tl.arange(start, end, dtype="i32") -> TensorHandle`: an index range
in TCM. No command emitted.
- `tl.zeros(shape, dtype="f16") -> TensorHandle`, `tl.full(shape,
value, dtype="f16") -> TensorHandle`: TCM placeholder. No command
emitted.
#### D3.8. Scalar helpers (no command, no engine)
- `TLContext.cdiv(a, b) -> int` (static): ceiling division
`-(-a // b)`. Mirrors real Triton's `tl.cdiv`.
#### D3.9. Metadata-only (no compute, no DMA)
- `tl.trans(x) -> TensorHandle`: a new handle with the last two dims
swapped. Shares `addr` and `data`; no command emitted.
#### D3.10. IPCQ (CCL) primitives (ADR-0023 D4)
- `tl.send(dir, src=None, *, src_addr=None, nbytes=None, shape=None,
dtype="f16", space="tcm") -> None`: blocking send. Accepts either
handle form or raw-address form. Emits `IpcqSendCmd`. The handle's
`.data` snapshot rides along on the command — avoiding the race
where a later inbound IPCQ overwrites the slot before the outbound
PE_DMA reads it.
- `tl.recv(dir=None, shape=(), dtype="f16", space="tcm", dst_addr=None,
dst_space=None) -> TensorHandle`: blocking recv. Providing both
`dst_addr` and `dst_space` enters "copy_to_dst" mode; otherwise
"return_slot" mode. In greenlet mode the handle's `.data` carries
the real data.
- `tl.recv_no_consume(dir=None, shape=(), dtype="f16") -> TensorHandle`:
**DIAGNOSTIC ONLY**. Has the same blocking-arrival semantics as
`tl.recv` but skips the slot-read latency charge (slot-IO + PE↔bank
fabric drain). Used in the pe2pe overview plot for an apples-to-apples
comparison against `tl.store`. Production kernels MUST NOT use it —
the diagnostic flag is isolated in its own command branch
(`consume=False`) so it cannot be accidentally enabled.
- `tl.recv_async(dir, shape=(), dtype="f16") -> RecvFuture`: non-blocking
recv. Returns a `RecvFuture`; resolved later by `tl.wait(future)`.
#### D3.11. Composite + control
- `tl.composite(...)`: see D3.3.
- `tl.wait(handle=None)`: wait on a `CompletionHandle` (composite), a
`RecvFuture` (async recv), or `None` (all pending composites).
- `tl.cycles(n)`: declare a scalar PE_CPU overhead. Emits
`PeCpuOverheadCmd(cycles=n)`.
### D4. TensorHandle arithmetic operators — thread-local TLContext
At module load, `tl_context.py::_enable_tensor_ops()` runs and patches
`TensorHandle.__add__`, `__sub__`, `__mul__`, `__truediv__`. Each
operator calls `_binary_math` on the active TLContext stored in a
module-level thread-local `_ctx`.
So inside a kernel, `c = a + b` is equivalent to emitting
`MathCmd(op="add", inputs=(a, b), out=)` and returning a new
TensorHandle.
Active-TLContext management:
- `TLContext._set_active(ctx)`: set the active ctx for the current
thread/greenlet.
- `TLContext._get_active()`: read it (RuntimeError if unset).
- `run_kernel(kernel_fn, tl_ctx, *args, **kwargs)`: helper. Sets active
on entry, runs the kernel, restores `None` on exit.
`KernelRunner` re-asserts `_set_active(tl)` inside its `_switch_kernel`
just before resuming the kernel, so a sibling PE runner that overwrote
the thread-local context is correctly recovered.
### D5. Scratch allocator — compute output handles
Ops that produce a result — `tl.dot`, `tl.exp`, `tl.add` (via
TensorHandle `__add__`), etc. — call `_make_compute_out(shape, dtype)`
to obtain a 16-byte-aligned scratch address. The address is published
with `space="tcm"`, so the handle can later be the source of a
`tl.send` / `tl.store`.
When `_scratch_base == 0` (e.g., command-list mode), the address is 0
and the handle cannot be a send/store source (in that case, only
`tl.load`-returned handles are valid sources).
When the cursor exceeds `_scratch_size` (default 1 MiB), a
RuntimeError is raised. The cursor must reset between kernel
invocations (current code naturally satisfies this: KernelRunner
creates a fresh TLContext each time).
### D6. Dispatch overhead — `PeCpuOverheadCmd(dispatch_cycles)`
Every non-metadata `tl.*` call starts with `_emit_dispatch_overhead()`,
which — when `dispatch_cycles > 0` — emits
`PeCpuOverheadCmd(dispatch_cycles)`. This models the cycles PE_CPU
spends dispatching the command.
Defaults:
- `TLContext.__init__`'s `dispatch_cycles` parameter default: `1` cycle.
- TLContext built by `KernelRunner`: `0` cycles (greenlet mode handles
cycle accounting differently — aligned with ADR-0020 D3 intent).
### D7. Kernel registry (`triton_emu/registry.py`)
A separate `_kernels: dict[str, Callable]` holds the name → function
mapping:
- `register_kernel(name, fn)`: ValueError on duplicate.
- `get_kernel(name)`: KeyError if missing.
- `clear_registry()`: test-only.
`RuntimeContext.launch(kernel_name, kernel_fn, *args)` overwrites
`_kernels[kernel_name] = kernel_fn` on every call (last-call-wins,
idempotent) — consistent with ADR-0045 D8's `launch` behavior.
PE_CPU looks up `KernelRef.name` in the registry and runs the function
through KernelRunner.
## Alternatives Considered
### A1. Fold `tl.*` into ADR-0014 / ADR-0020
Rejected. ADR-0014 covers the PE pipeline (sim_engine-side consumption
of PeCommands); ADR-0020 covers 2-pass execution (Phase 1 timing /
Phase 2 data). The `tl.*` surface is what the kernel author touches; a
dedicated ADR improves findability and onboarding.
### A2. Deprecate command-list mode
Rejected (currently). Simple unit tests and kernel verification benefit
from the lighter command-list path — it exposes a PeCommand sequence
inspector without requiring greenlet machinery. When greenlet-mode
semantics (real data, Phase 2) are needed, D2 explicitly selects them.
### A3. Remove TensorHandle arithmetic operators
Rejected. They mimic real Triton kernel ergonomics (e.g., `c = a + b`),
and the thread-local active-ctx pattern works cleanly. The explicit
function-form (`tl.add(a, b)`) is also exposed in D3.5, so the
operators are syntactic sugar.
### A4. Expand softmax into the explicit sequence (max → exp → sum → div)
Partially adopted. `tl.softmax` is a single `MathCmd(op="softmax")` for
timing accounting (D3.5), but Phase 2 DataExecutor expands it to the
canonical sequence for real-data computation. Timing model atomic,
data model expanded — the two split intentionally.
## Consequences
- Every `tl.*` primitive a bench author meets is classified and defined
in a single ADR. Paired with ADR-0045 D8's host-side surface
(`torch.empty` etc.), the inside-kernel and outside-kernel authoring
guides are now complete.
- The command-list / greenlet difference is pinned in D2, so any new
`tl.*` primitive that follows the `_emit()` pattern auto-supports
both modes.
- The thread-local active-ctx pattern (D4) is justified at ADR level,
clarifying who owns the reset responsibility when multiple PE
runners share a thread (KernelRunner.run's contract restores active
inside `_switch_kernel`).
- `tl.recv_no_consume`'s diagnostic isolation (D3.10) is hardened in
ADR form — accidental production use is blocked by a separate
command branch.
- The registry (D7) gets its own D-section, formalizing the
name-collision and dynamic-re-registration semantics.