9a02955770
Documents four cross-cutting surfaces that previously had no ADR backing, each surfaced as a G4 candidate by /report: - 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...), the two execution modes (command-list vs greenlet runner), scratch allocator semantics, dispatch-overhead model, and the kernel registry. - 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group (backend="ahbm") install path. world_size priority (algorithm > defaults > topology), the 4-step init sequence (load ccl.yaml, import algorithm module, derive world_size, install SFR + IPCQ), greenlet- local rank registry, all_reduce dispatch via _defer_wait, barrier no-op rationale, and the explicit list of unsupported dist.* APIs. - 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator free-list semantics. Offset-keyed first-fit with coalescing, the no-validation trust model for free(), HBM/TCM channel separation, page-aligned VA allocation, the page_size dual-default (VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and one-allocator-per-sub-unit rule. - 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog. H2D / D2H / PE DMA categories with their exact cube-index choices, the 32 KiB reference size, the 5-point utilization sweep, the formula vs actual column meanings, automatic invariant checks (monotonicity, D2H >= H2D, best < worst), per-case GraphEngine isolation, and the human-readable (not machine-parsable) output contract. Bilingual pair verifier passes for all four EN/KO pairs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
328 lines
14 KiB
Markdown
328 lines
14 KiB
Markdown
# ADR-0046: TLContext — Kernel-side `tl.*` API Contract
|
||
|
||
## Status
|
||
|
||
Accepted (2026-05-22).
|
||
|
||
Documents the set of `tl.*` primitives exposed by
|
||
`src/kernbench/triton_emu/`'s `TLContext`, their semantics, and the two
|
||
execution-mode contracts (command-list / greenlet runner). ADR-0014/0020
|
||
defines the PE pipeline and the 2-pass execution model, but **the `tl.*`
|
||
surface that bench kernel functions call** had no ADR-level coverage.
|
||
|
||
## First action
|
||
|
||
When `TLContext(pe_id, num_programs, dispatch_cycles, runner, cube_id,
|
||
num_cubes, scratch_base, scratch_size)` is instantiated, the first action
|
||
is to initialize six categories of state:
|
||
|
||
- `self._pe_id`, `self._num_programs`, `self._cube_id`, `self._num_cubes` —
|
||
values that `tl.program_id` / `tl.num_programs` will return.
|
||
- `self._dispatch_cycles` — cycle count emitted as `PeCpuOverheadCmd(cycles)`
|
||
at the start of every `tl.*` API call.
|
||
- `self._runner` — `KernelRunner` instance (present → greenlet mode;
|
||
absent → command-list mode).
|
||
- `self._commands: list[PeCommand] = []` — command-list accumulator
|
||
(command-list mode only).
|
||
- `self._handle_counter = 0`, `self._completion_counter = 0` — counters
|
||
for generating TensorHandle / CompletionHandle ids.
|
||
- `self._scratch_base`, `self._scratch_size`, `self._scratch_cursor = 0` —
|
||
PE-local scratch region (used for math/dot/composite output handle
|
||
addresses).
|
||
|
||
In short, **TLContext's first act is "record where (sip/cube/pe) and at
|
||
what scale (num_programs/num_cubes) this kernel instance runs, and pick
|
||
its dispatch mode (runner present or not)"**. No SimPy event is created
|
||
and no command is emitted at this moment.
|
||
|
||
The runtime first action happens when the kernel function first calls a
|
||
`tl.<api>()`. The standard entry for every `tl.*` API is:
|
||
|
||
1. Call `self._emit_dispatch_overhead()` — if `dispatch_cycles > 0`,
|
||
immediately `_emit` a `PeCpuOverheadCmd(dispatch_cycles)`.
|
||
2. Per-API processing (TensorHandle creation, command construction).
|
||
3. `self._emit(cmd)` — in runner mode this `greenlet.switch()`es the cmd
|
||
to SimPy; in command-list mode it appends to `self._commands`.
|
||
|
||
## Context
|
||
|
||
The `tl.*` surface consists of `TLContext`'s methods, and the `tl`
|
||
parameter received by a kernel function is one of these objects. The
|
||
contract the user (bench author) sees:
|
||
|
||
- Which primitives exist.
|
||
- What data flow each primitive triggers (DMA / compute / IPCQ /
|
||
metadata-only).
|
||
- How a TensorHandle's `space` and `addr` are decided.
|
||
- The difference between command-list and greenlet modes.
|
||
|
||
ADR-0014 (PE pipeline) defines the PeCommands consumed by PE_SCHEDULER,
|
||
but how `tl.*` emits them is a code-only convention. ADR-0020 (2-pass
|
||
data execution) mentions greenlet mode in D3 but does not pin down the
|
||
signature difference (return-value handling) between the runner /
|
||
non-runner paths. This ADR fills the gap.
|
||
|
||
## Decision
|
||
|
||
### D1. The `tl` parameter is a `TLContext` instance
|
||
|
||
A bench kernel function has the signature:
|
||
|
||
```python
|
||
def _kernel(arg1, arg2, ..., tl, **kwargs):
|
||
...
|
||
```
|
||
|
||
`tl` is a `kernbench.triton_emu.tl_context.TLContext` instance. The name
|
||
imitates real Triton's `triton.language` module; the actual Triton
|
||
module is **not** passed in.
|
||
|
||
The kernel is plain Python — no `yield` or `async`. `tl.*` calls produce
|
||
SimPy events, but to the caller they appear synchronous because in
|
||
greenlet mode the KernelRunner relays between SimPy and the kernel
|
||
(ADR-0020 D3).
|
||
|
||
### D2. Two execution modes — command-list / greenlet runner
|
||
|
||
- **Command-list mode (`runner is None`)**: `tl.*` calls append PeCommand
|
||
to `self._commands`. DMA / GEMM / Math consume no SimPy time and return
|
||
metadata-only TensorHandles (`data=None`). PE_SCHEDULER / sim_engine
|
||
later replays the command sequence in time.
|
||
|
||
- **Greenlet runner mode (`runner is not None`)**: `tl.*` calls
|
||
`self._emit(cmd)` → `runner.switch_to_simpy(cmd)`, handing control to
|
||
the parent greenlet (SimPy). The parent distributes the cmd to
|
||
components, consumes SimPy time, and (for DMA reads) returns real numpy
|
||
data. The kernel receives the result and continues to the next line
|
||
(the data-aware execution model from ADR-0020 D3).
|
||
|
||
The choice of mode is decided by whether a KernelRunner is injected into
|
||
the TLContext. The `tl.*` methods themselves are mode-blind — they go
|
||
through `_emit()` uniformly.
|
||
|
||
### D3. Primitive categories
|
||
|
||
#### D3.1. Reference (no DMA, metadata only)
|
||
|
||
- `tl.ref(ptr, shape, dtype="f16") -> TensorHandle`: create a handle
|
||
referencing HBM data without issuing DMA. Used when the scheduler
|
||
streams the data per-tile (e.g., the b operand of a composite GEMM).
|
||
|
||
#### D3.2. Data movement (blocking, DMA engine)
|
||
|
||
- `tl.load(ptr, shape, dtype="f16") -> TensorHandle`: HBM → handle.
|
||
Emits `DmaReadCmd`. In greenlet mode the returned handle's `.data`
|
||
carries real numpy data; in command-list mode it is a placeholder.
|
||
The handle has `space="hbm"`, `pinned=True`.
|
||
- `tl.store(ptr, handle) -> None`: TCM → HBM. Emits `DmaWriteCmd`. In
|
||
greenlet mode, when `handle.data` is present, `_store.write("hbm",
|
||
ptr, data)` runs first (visibility = issue time, ADR-0020 D3).
|
||
|
||
#### D3.3. GEMM / compute (blocking)
|
||
|
||
- `tl.dot(a, b) -> TensorHandle`: `a @ b`. Both operands must live in
|
||
TCM; shapes `(M,K) × (K,N) → (M,N)`. Emits `GemmCmd`; the output
|
||
handle is allocated from PE-local scratch via
|
||
`_make_compute_out(shape, dtype)`.
|
||
- `tl.composite(op, a, b=None, out_ptr=0, math_op=None, epilogue=None,
|
||
acc_dtype=None, tile_shape=None) -> CompletionHandle`: non-blocking
|
||
tiled pipeline. Emits `CompositeCmd`. `epilogue` is a list of dicts,
|
||
each with `"op"` plus op-specific fields and an optional `"scope"`
|
||
(k_tile / output_tile). Unknown ops or missing fields raise
|
||
ValueError immediately. The returned CompletionHandle synchronizes
|
||
via `tl.wait(h)`.
|
||
|
||
#### D3.4. Math: unary (blocking)
|
||
|
||
- `tl.exp(x)`, `tl.log(x)`, `tl.sqrt(x)`, `tl.abs(x)`, `tl.sigmoid(x)`,
|
||
`tl.cos(x)`, `tl.sin(x)` — each emits `MathCmd(op=<name>,
|
||
inputs=(x,), out=)`. `out` is scratch-allocated with the same
|
||
shape/dtype as `x`.
|
||
|
||
#### D3.5. Math: binary (blocking)
|
||
|
||
- `tl.maximum(a, b)`, `tl.minimum(a, b)` — `_binary_math`.
|
||
- `tl.fma(a, b, c)` — `a*b + c`. Three inputs.
|
||
- `tl.clamp(x, min, max)` — `MathCmd(op="clamp", inputs=(x, min, max))`.
|
||
- `tl.where(cond, a, b)` — `MathCmd(op="where", inputs=(cond, a, b))`.
|
||
- `tl.softmax(x, axis=-1)` — a single `MathCmd(op="softmax")` so timing
|
||
accounts at one dispatch. Phase 2 DataExecutor expands it to the
|
||
canonical (x-max → exp → sum → div) sequence.
|
||
|
||
#### D3.6. Reduction (blocking)
|
||
|
||
- `tl.sum(x, axis)`, `tl.max(x, axis)`, `tl.min(x, axis)` — return an
|
||
output handle with the axis size collapsed to 1. Emit
|
||
`MathCmd(op=<name>, inputs=(x,), out=, axis=axis)`.
|
||
|
||
#### D3.7. Index / scalar (PE_CPU, no engine)
|
||
|
||
- `tl.program_id(axis=0) -> int`: `axis==0` → pe_id (cube-local PE
|
||
index), `axis==1` → cube_id (ADR-0022).
|
||
- `tl.num_programs(axis=0) -> int`: `axis==0` → num_programs (PEs per
|
||
cube), `axis==1` → num_cubes.
|
||
- `tl.arange(start, end, dtype="i32") -> TensorHandle`: an index range
|
||
in TCM. No command emitted.
|
||
- `tl.zeros(shape, dtype="f16") -> TensorHandle`, `tl.full(shape,
|
||
value, dtype="f16") -> TensorHandle`: TCM placeholder. No command
|
||
emitted.
|
||
|
||
#### D3.8. Scalar helpers (no command, no engine)
|
||
|
||
- `TLContext.cdiv(a, b) -> int` (static): ceiling division
|
||
`-(-a // b)`. Mirrors real Triton's `tl.cdiv`.
|
||
|
||
#### D3.9. Metadata-only (no compute, no DMA)
|
||
|
||
- `tl.trans(x) -> TensorHandle`: a new handle with the last two dims
|
||
swapped. Shares `addr` and `data`; no command emitted.
|
||
|
||
#### D3.10. IPCQ (CCL) primitives (ADR-0023 D4)
|
||
|
||
- `tl.send(dir, src=None, *, src_addr=None, nbytes=None, shape=None,
|
||
dtype="f16", space="tcm") -> None`: blocking send. Accepts either
|
||
handle form or raw-address form. Emits `IpcqSendCmd`. The handle's
|
||
`.data` snapshot rides along on the command — avoiding the race
|
||
where a later inbound IPCQ overwrites the slot before the outbound
|
||
PE_DMA reads it.
|
||
- `tl.recv(dir=None, shape=(), dtype="f16", space="tcm", dst_addr=None,
|
||
dst_space=None) -> TensorHandle`: blocking recv. Providing both
|
||
`dst_addr` and `dst_space` enters "copy_to_dst" mode; otherwise
|
||
"return_slot" mode. In greenlet mode the handle's `.data` carries
|
||
the real data.
|
||
- `tl.recv_no_consume(dir=None, shape=(), dtype="f16") -> TensorHandle`:
|
||
**DIAGNOSTIC ONLY**. Has the same blocking-arrival semantics as
|
||
`tl.recv` but skips the slot-read latency charge (slot-IO + PE↔bank
|
||
fabric drain). Used in the pe2pe overview plot for an apples-to-apples
|
||
comparison against `tl.store`. Production kernels MUST NOT use it —
|
||
the diagnostic flag is isolated in its own command branch
|
||
(`consume=False`) so it cannot be accidentally enabled.
|
||
- `tl.recv_async(dir, shape=(), dtype="f16") -> RecvFuture`: non-blocking
|
||
recv. Returns a `RecvFuture`; resolved later by `tl.wait(future)`.
|
||
|
||
#### D3.11. Composite + control
|
||
|
||
- `tl.composite(...)`: see D3.3.
|
||
- `tl.wait(handle=None)`: wait on a `CompletionHandle` (composite), a
|
||
`RecvFuture` (async recv), or `None` (all pending composites).
|
||
- `tl.cycles(n)`: declare a scalar PE_CPU overhead. Emits
|
||
`PeCpuOverheadCmd(cycles=n)`.
|
||
|
||
### D4. TensorHandle arithmetic operators — thread-local TLContext
|
||
|
||
At module load, `tl_context.py::_enable_tensor_ops()` runs and patches
|
||
`TensorHandle.__add__`, `__sub__`, `__mul__`, `__truediv__`. Each
|
||
operator calls `_binary_math` on the active TLContext stored in a
|
||
module-level thread-local `_ctx`.
|
||
|
||
So inside a kernel, `c = a + b` is equivalent to emitting
|
||
`MathCmd(op="add", inputs=(a, b), out=)` and returning a new
|
||
TensorHandle.
|
||
|
||
Active-TLContext management:
|
||
|
||
- `TLContext._set_active(ctx)`: set the active ctx for the current
|
||
thread/greenlet.
|
||
- `TLContext._get_active()`: read it (RuntimeError if unset).
|
||
- `run_kernel(kernel_fn, tl_ctx, *args, **kwargs)`: helper. Sets active
|
||
on entry, runs the kernel, restores `None` on exit.
|
||
|
||
`KernelRunner` re-asserts `_set_active(tl)` inside its `_switch_kernel`
|
||
just before resuming the kernel, so a sibling PE runner that overwrote
|
||
the thread-local context is correctly recovered.
|
||
|
||
### D5. Scratch allocator — compute output handles
|
||
|
||
Ops that produce a result — `tl.dot`, `tl.exp`, `tl.add` (via
|
||
TensorHandle `__add__`), etc. — call `_make_compute_out(shape, dtype)`
|
||
to obtain a 16-byte-aligned scratch address. The address is published
|
||
with `space="tcm"`, so the handle can later be the source of a
|
||
`tl.send` / `tl.store`.
|
||
|
||
When `_scratch_base == 0` (e.g., command-list mode), the address is 0
|
||
and the handle cannot be a send/store source (in that case, only
|
||
`tl.load`-returned handles are valid sources).
|
||
|
||
When the cursor exceeds `_scratch_size` (default 1 MiB), a
|
||
RuntimeError is raised. The cursor must reset between kernel
|
||
invocations (current code naturally satisfies this: KernelRunner
|
||
creates a fresh TLContext each time).
|
||
|
||
### D6. Dispatch overhead — `PeCpuOverheadCmd(dispatch_cycles)`
|
||
|
||
Every non-metadata `tl.*` call starts with `_emit_dispatch_overhead()`,
|
||
which — when `dispatch_cycles > 0` — emits
|
||
`PeCpuOverheadCmd(dispatch_cycles)`. This models the cycles PE_CPU
|
||
spends dispatching the command.
|
||
|
||
Defaults:
|
||
|
||
- `TLContext.__init__`'s `dispatch_cycles` parameter default: `1` cycle.
|
||
- TLContext built by `KernelRunner`: `0` cycles (greenlet mode handles
|
||
cycle accounting differently — aligned with ADR-0020 D3 intent).
|
||
|
||
### D7. Kernel registry (`triton_emu/registry.py`)
|
||
|
||
A separate `_kernels: dict[str, Callable]` holds the name → function
|
||
mapping:
|
||
|
||
- `register_kernel(name, fn)`: ValueError on duplicate.
|
||
- `get_kernel(name)`: KeyError if missing.
|
||
- `clear_registry()`: test-only.
|
||
|
||
`RuntimeContext.launch(kernel_name, kernel_fn, *args)` overwrites
|
||
`_kernels[kernel_name] = kernel_fn` on every call (last-call-wins,
|
||
idempotent) — consistent with ADR-0045 D8's `launch` behavior.
|
||
|
||
PE_CPU looks up `KernelRef.name` in the registry and runs the function
|
||
through KernelRunner.
|
||
|
||
## Alternatives Considered
|
||
|
||
### A1. Fold `tl.*` into ADR-0014 / ADR-0020
|
||
|
||
Rejected. ADR-0014 covers the PE pipeline (sim_engine-side consumption
|
||
of PeCommands); ADR-0020 covers 2-pass execution (Phase 1 timing /
|
||
Phase 2 data). The `tl.*` surface is what the kernel author touches; a
|
||
dedicated ADR improves findability and onboarding.
|
||
|
||
### A2. Deprecate command-list mode
|
||
|
||
Rejected (currently). Simple unit tests and kernel verification benefit
|
||
from the lighter command-list path — it exposes a PeCommand sequence
|
||
inspector without requiring greenlet machinery. When greenlet-mode
|
||
semantics (real data, Phase 2) are needed, D2 explicitly selects them.
|
||
|
||
### A3. Remove TensorHandle arithmetic operators
|
||
|
||
Rejected. They mimic real Triton kernel ergonomics (e.g., `c = a + b`),
|
||
and the thread-local active-ctx pattern works cleanly. The explicit
|
||
function-form (`tl.add(a, b)`) is also exposed in D3.5, so the
|
||
operators are syntactic sugar.
|
||
|
||
### A4. Expand softmax into the explicit sequence (max → exp → sum → div)
|
||
|
||
Partially adopted. `tl.softmax` is a single `MathCmd(op="softmax")` for
|
||
timing accounting (D3.5), but Phase 2 DataExecutor expands it to the
|
||
canonical sequence for real-data computation. Timing model atomic,
|
||
data model expanded — the two split intentionally.
|
||
|
||
## Consequences
|
||
|
||
- Every `tl.*` primitive a bench author meets is classified and defined
|
||
in a single ADR. Paired with ADR-0045 D8's host-side surface
|
||
(`torch.empty` etc.), the inside-kernel and outside-kernel authoring
|
||
guides are now complete.
|
||
- The command-list / greenlet difference is pinned in D2, so any new
|
||
`tl.*` primitive that follows the `_emit()` pattern auto-supports
|
||
both modes.
|
||
- The thread-local active-ctx pattern (D4) is justified at ADR level,
|
||
clarifying who owns the reset responsibility when multiple PE
|
||
runners share a thread (KernelRunner.run's contract restores active
|
||
inside `_switch_kernel`).
|
||
- `tl.recv_no_consume`'s diagnostic isolation (D3.10) is hardened in
|
||
ADR form — accidental production use is blocked by a separate
|
||
command branch.
|
||
- The registry (D7) gets its own D-section, formalizing the
|
||
name-collision and dynamic-re-registration semantics.
|