adr: add ADR-0046-0049 — close G4 coverage gaps from /report
Documents four cross-cutting surfaces that previously had no ADR backing, each surfaced as a G4 candidate by /report: - 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...), the two execution modes (command-list vs greenlet runner), scratch allocator semantics, dispatch-overhead model, and the kernel registry. - 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group (backend="ahbm") install path. world_size priority (algorithm > defaults > topology), the 4-step init sequence (load ccl.yaml, import algorithm module, derive world_size, install SFR + IPCQ), greenlet- local rank registry, all_reduce dispatch via _defer_wait, barrier no-op rationale, and the explicit list of unsupported dist.* APIs. - 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator free-list semantics. Offset-keyed first-fit with coalescing, the no-validation trust model for free(), HBM/TCM channel separation, page-aligned VA allocation, the page_size dual-default (VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and one-allocator-per-sub-unit rule. - 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog. H2D / D2H / PE DMA categories with their exact cube-index choices, the 32 KiB reference size, the 5-point utilization sweep, the formula vs actual column meanings, automatic invariant checks (monotonicity, D2H >= H2D, best < worst), per-case GraphEngine isolation, and the human-readable (not machine-parsable) output contract. Bilingual pair verifier passes for all four EN/KO pairs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,327 @@
|
||||
# ADR-0046: TLContext — Kernel-side `tl.*` API Contract
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
Documents the set of `tl.*` primitives exposed by
|
||||
`src/kernbench/triton_emu/`'s `TLContext`, their semantics, and the two
|
||||
execution-mode contracts (command-list / greenlet runner). ADR-0014/0020
|
||||
defines the PE pipeline and the 2-pass execution model, but **the `tl.*`
|
||||
surface that bench kernel functions call** had no ADR-level coverage.
|
||||
|
||||
## First action
|
||||
|
||||
When `TLContext(pe_id, num_programs, dispatch_cycles, runner, cube_id,
|
||||
num_cubes, scratch_base, scratch_size)` is instantiated, the first action
|
||||
is to initialize six categories of state:
|
||||
|
||||
- `self._pe_id`, `self._num_programs`, `self._cube_id`, `self._num_cubes` —
|
||||
values that `tl.program_id` / `tl.num_programs` will return.
|
||||
- `self._dispatch_cycles` — cycle count emitted as `PeCpuOverheadCmd(cycles)`
|
||||
at the start of every `tl.*` API call.
|
||||
- `self._runner` — `KernelRunner` instance (present → greenlet mode;
|
||||
absent → command-list mode).
|
||||
- `self._commands: list[PeCommand] = []` — command-list accumulator
|
||||
(command-list mode only).
|
||||
- `self._handle_counter = 0`, `self._completion_counter = 0` — counters
|
||||
for generating TensorHandle / CompletionHandle ids.
|
||||
- `self._scratch_base`, `self._scratch_size`, `self._scratch_cursor = 0` —
|
||||
PE-local scratch region (used for math/dot/composite output handle
|
||||
addresses).
|
||||
|
||||
In short, **TLContext's first act is "record where (sip/cube/pe) and at
|
||||
what scale (num_programs/num_cubes) this kernel instance runs, and pick
|
||||
its dispatch mode (runner present or not)"**. No SimPy event is created
|
||||
and no command is emitted at this moment.
|
||||
|
||||
The runtime first action happens when the kernel function first calls a
|
||||
`tl.<api>()`. The standard entry for every `tl.*` API is:
|
||||
|
||||
1. Call `self._emit_dispatch_overhead()` — if `dispatch_cycles > 0`,
|
||||
immediately `_emit` a `PeCpuOverheadCmd(dispatch_cycles)`.
|
||||
2. Per-API processing (TensorHandle creation, command construction).
|
||||
3. `self._emit(cmd)` — in runner mode this `greenlet.switch()`es the cmd
|
||||
to SimPy; in command-list mode it appends to `self._commands`.
|
||||
|
||||
## Context
|
||||
|
||||
The `tl.*` surface consists of `TLContext`'s methods, and the `tl`
|
||||
parameter received by a kernel function is one of these objects. The
|
||||
contract the user (bench author) sees:
|
||||
|
||||
- Which primitives exist.
|
||||
- What data flow each primitive triggers (DMA / compute / IPCQ /
|
||||
metadata-only).
|
||||
- How a TensorHandle's `space` and `addr` are decided.
|
||||
- The difference between command-list and greenlet modes.
|
||||
|
||||
ADR-0014 (PE pipeline) defines the PeCommands consumed by PE_SCHEDULER,
|
||||
but how `tl.*` emits them is a code-only convention. ADR-0020 (2-pass
|
||||
data execution) mentions greenlet mode in D3 but does not pin down the
|
||||
signature difference (return-value handling) between the runner /
|
||||
non-runner paths. This ADR fills the gap.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. The `tl` parameter is a `TLContext` instance
|
||||
|
||||
A bench kernel function has the signature:
|
||||
|
||||
```python
|
||||
def _kernel(arg1, arg2, ..., tl, **kwargs):
|
||||
...
|
||||
```
|
||||
|
||||
`tl` is a `kernbench.triton_emu.tl_context.TLContext` instance. The name
|
||||
imitates real Triton's `triton.language` module; the actual Triton
|
||||
module is **not** passed in.
|
||||
|
||||
The kernel is plain Python — no `yield` or `async`. `tl.*` calls produce
|
||||
SimPy events, but to the caller they appear synchronous because in
|
||||
greenlet mode the KernelRunner relays between SimPy and the kernel
|
||||
(ADR-0020 D3).
|
||||
|
||||
### D2. Two execution modes — command-list / greenlet runner
|
||||
|
||||
- **Command-list mode (`runner is None`)**: `tl.*` calls append PeCommand
|
||||
to `self._commands`. DMA / GEMM / Math consume no SimPy time and return
|
||||
metadata-only TensorHandles (`data=None`). PE_SCHEDULER / sim_engine
|
||||
later replays the command sequence in time.
|
||||
|
||||
- **Greenlet runner mode (`runner is not None`)**: `tl.*` calls
|
||||
`self._emit(cmd)` → `runner.switch_to_simpy(cmd)`, handing control to
|
||||
the parent greenlet (SimPy). The parent distributes the cmd to
|
||||
components, consumes SimPy time, and (for DMA reads) returns real numpy
|
||||
data. The kernel receives the result and continues to the next line
|
||||
(the data-aware execution model from ADR-0020 D3).
|
||||
|
||||
The choice of mode is decided by whether a KernelRunner is injected into
|
||||
the TLContext. The `tl.*` methods themselves are mode-blind — they go
|
||||
through `_emit()` uniformly.
|
||||
|
||||
### D3. Primitive categories
|
||||
|
||||
#### D3.1. Reference (no DMA, metadata only)
|
||||
|
||||
- `tl.ref(ptr, shape, dtype="f16") -> TensorHandle`: create a handle
|
||||
referencing HBM data without issuing DMA. Used when the scheduler
|
||||
streams the data per-tile (e.g., the b operand of a composite GEMM).
|
||||
|
||||
#### D3.2. Data movement (blocking, DMA engine)
|
||||
|
||||
- `tl.load(ptr, shape, dtype="f16") -> TensorHandle`: HBM → handle.
|
||||
Emits `DmaReadCmd`. In greenlet mode the returned handle's `.data`
|
||||
carries real numpy data; in command-list mode it is a placeholder.
|
||||
The handle has `space="hbm"`, `pinned=True`.
|
||||
- `tl.store(ptr, handle) -> None`: TCM → HBM. Emits `DmaWriteCmd`. In
|
||||
greenlet mode, when `handle.data` is present, `_store.write("hbm",
|
||||
ptr, data)` runs first (visibility = issue time, ADR-0020 D3).
|
||||
|
||||
#### D3.3. GEMM / compute (blocking)
|
||||
|
||||
- `tl.dot(a, b) -> TensorHandle`: `a @ b`. Both operands must live in
|
||||
TCM; shapes `(M,K) × (K,N) → (M,N)`. Emits `GemmCmd`; the output
|
||||
handle is allocated from PE-local scratch via
|
||||
`_make_compute_out(shape, dtype)`.
|
||||
- `tl.composite(op, a, b=None, out_ptr=0, math_op=None, epilogue=None,
|
||||
acc_dtype=None, tile_shape=None) -> CompletionHandle`: non-blocking
|
||||
tiled pipeline. Emits `CompositeCmd`. `epilogue` is a list of dicts,
|
||||
each with `"op"` plus op-specific fields and an optional `"scope"`
|
||||
(k_tile / output_tile). Unknown ops or missing fields raise
|
||||
ValueError immediately. The returned CompletionHandle synchronizes
|
||||
via `tl.wait(h)`.
|
||||
|
||||
#### D3.4. Math: unary (blocking)
|
||||
|
||||
- `tl.exp(x)`, `tl.log(x)`, `tl.sqrt(x)`, `tl.abs(x)`, `tl.sigmoid(x)`,
|
||||
`tl.cos(x)`, `tl.sin(x)` — each emits `MathCmd(op=<name>,
|
||||
inputs=(x,), out=)`. `out` is scratch-allocated with the same
|
||||
shape/dtype as `x`.
|
||||
|
||||
#### D3.5. Math: binary (blocking)
|
||||
|
||||
- `tl.maximum(a, b)`, `tl.minimum(a, b)` — `_binary_math`.
|
||||
- `tl.fma(a, b, c)` — `a*b + c`. Three inputs.
|
||||
- `tl.clamp(x, min, max)` — `MathCmd(op="clamp", inputs=(x, min, max))`.
|
||||
- `tl.where(cond, a, b)` — `MathCmd(op="where", inputs=(cond, a, b))`.
|
||||
- `tl.softmax(x, axis=-1)` — a single `MathCmd(op="softmax")` so timing
|
||||
accounts at one dispatch. Phase 2 DataExecutor expands it to the
|
||||
canonical (x-max → exp → sum → div) sequence.
|
||||
|
||||
#### D3.6. Reduction (blocking)
|
||||
|
||||
- `tl.sum(x, axis)`, `tl.max(x, axis)`, `tl.min(x, axis)` — return an
|
||||
output handle with the axis size collapsed to 1. Emit
|
||||
`MathCmd(op=<name>, inputs=(x,), out=, axis=axis)`.
|
||||
|
||||
#### D3.7. Index / scalar (PE_CPU, no engine)
|
||||
|
||||
- `tl.program_id(axis=0) -> int`: `axis==0` → pe_id (cube-local PE
|
||||
index), `axis==1` → cube_id (ADR-0022).
|
||||
- `tl.num_programs(axis=0) -> int`: `axis==0` → num_programs (PEs per
|
||||
cube), `axis==1` → num_cubes.
|
||||
- `tl.arange(start, end, dtype="i32") -> TensorHandle`: an index range
|
||||
in TCM. No command emitted.
|
||||
- `tl.zeros(shape, dtype="f16") -> TensorHandle`, `tl.full(shape,
|
||||
value, dtype="f16") -> TensorHandle`: TCM placeholder. No command
|
||||
emitted.
|
||||
|
||||
#### D3.8. Scalar helpers (no command, no engine)
|
||||
|
||||
- `TLContext.cdiv(a, b) -> int` (static): ceiling division
|
||||
`-(-a // b)`. Mirrors real Triton's `tl.cdiv`.
|
||||
|
||||
#### D3.9. Metadata-only (no compute, no DMA)
|
||||
|
||||
- `tl.trans(x) -> TensorHandle`: a new handle with the last two dims
|
||||
swapped. Shares `addr` and `data`; no command emitted.
|
||||
|
||||
#### D3.10. IPCQ (CCL) primitives (ADR-0023 D4)
|
||||
|
||||
- `tl.send(dir, src=None, *, src_addr=None, nbytes=None, shape=None,
|
||||
dtype="f16", space="tcm") -> None`: blocking send. Accepts either
|
||||
handle form or raw-address form. Emits `IpcqSendCmd`. The handle's
|
||||
`.data` snapshot rides along on the command — avoiding the race
|
||||
where a later inbound IPCQ overwrites the slot before the outbound
|
||||
PE_DMA reads it.
|
||||
- `tl.recv(dir=None, shape=(), dtype="f16", space="tcm", dst_addr=None,
|
||||
dst_space=None) -> TensorHandle`: blocking recv. Providing both
|
||||
`dst_addr` and `dst_space` enters "copy_to_dst" mode; otherwise
|
||||
"return_slot" mode. In greenlet mode the handle's `.data` carries
|
||||
the real data.
|
||||
- `tl.recv_no_consume(dir=None, shape=(), dtype="f16") -> TensorHandle`:
|
||||
**DIAGNOSTIC ONLY**. Has the same blocking-arrival semantics as
|
||||
`tl.recv` but skips the slot-read latency charge (slot-IO + PE↔bank
|
||||
fabric drain). Used in the pe2pe overview plot for an apples-to-apples
|
||||
comparison against `tl.store`. Production kernels MUST NOT use it —
|
||||
the diagnostic flag is isolated in its own command branch
|
||||
(`consume=False`) so it cannot be accidentally enabled.
|
||||
- `tl.recv_async(dir, shape=(), dtype="f16") -> RecvFuture`: non-blocking
|
||||
recv. Returns a `RecvFuture`; resolved later by `tl.wait(future)`.
|
||||
|
||||
#### D3.11. Composite + control
|
||||
|
||||
- `tl.composite(...)`: see D3.3.
|
||||
- `tl.wait(handle=None)`: wait on a `CompletionHandle` (composite), a
|
||||
`RecvFuture` (async recv), or `None` (all pending composites).
|
||||
- `tl.cycles(n)`: declare a scalar PE_CPU overhead. Emits
|
||||
`PeCpuOverheadCmd(cycles=n)`.
|
||||
|
||||
### D4. TensorHandle arithmetic operators — thread-local TLContext
|
||||
|
||||
At module load, `tl_context.py::_enable_tensor_ops()` runs and patches
|
||||
`TensorHandle.__add__`, `__sub__`, `__mul__`, `__truediv__`. Each
|
||||
operator calls `_binary_math` on the active TLContext stored in a
|
||||
module-level thread-local `_ctx`.
|
||||
|
||||
So inside a kernel, `c = a + b` is equivalent to emitting
|
||||
`MathCmd(op="add", inputs=(a, b), out=)` and returning a new
|
||||
TensorHandle.
|
||||
|
||||
Active-TLContext management:
|
||||
|
||||
- `TLContext._set_active(ctx)`: set the active ctx for the current
|
||||
thread/greenlet.
|
||||
- `TLContext._get_active()`: read it (RuntimeError if unset).
|
||||
- `run_kernel(kernel_fn, tl_ctx, *args, **kwargs)`: helper. Sets active
|
||||
on entry, runs the kernel, restores `None` on exit.
|
||||
|
||||
`KernelRunner` re-asserts `_set_active(tl)` inside its `_switch_kernel`
|
||||
just before resuming the kernel, so a sibling PE runner that overwrote
|
||||
the thread-local context is correctly recovered.
|
||||
|
||||
### D5. Scratch allocator — compute output handles
|
||||
|
||||
Ops that produce a result — `tl.dot`, `tl.exp`, `tl.add` (via
|
||||
TensorHandle `__add__`), etc. — call `_make_compute_out(shape, dtype)`
|
||||
to obtain a 16-byte-aligned scratch address. The address is published
|
||||
with `space="tcm"`, so the handle can later be the source of a
|
||||
`tl.send` / `tl.store`.
|
||||
|
||||
When `_scratch_base == 0` (e.g., command-list mode), the address is 0
|
||||
and the handle cannot be a send/store source (in that case, only
|
||||
`tl.load`-returned handles are valid sources).
|
||||
|
||||
When the cursor exceeds `_scratch_size` (default 1 MiB), a
|
||||
RuntimeError is raised. The cursor must reset between kernel
|
||||
invocations (current code naturally satisfies this: KernelRunner
|
||||
creates a fresh TLContext each time).
|
||||
|
||||
### D6. Dispatch overhead — `PeCpuOverheadCmd(dispatch_cycles)`
|
||||
|
||||
Every non-metadata `tl.*` call starts with `_emit_dispatch_overhead()`,
|
||||
which — when `dispatch_cycles > 0` — emits
|
||||
`PeCpuOverheadCmd(dispatch_cycles)`. This models the cycles PE_CPU
|
||||
spends dispatching the command.
|
||||
|
||||
Defaults:
|
||||
|
||||
- `TLContext.__init__`'s `dispatch_cycles` parameter default: `1` cycle.
|
||||
- TLContext built by `KernelRunner`: `0` cycles (greenlet mode handles
|
||||
cycle accounting differently — aligned with ADR-0020 D3 intent).
|
||||
|
||||
### D7. Kernel registry (`triton_emu/registry.py`)
|
||||
|
||||
A separate `_kernels: dict[str, Callable]` holds the name → function
|
||||
mapping:
|
||||
|
||||
- `register_kernel(name, fn)`: ValueError on duplicate.
|
||||
- `get_kernel(name)`: KeyError if missing.
|
||||
- `clear_registry()`: test-only.
|
||||
|
||||
`RuntimeContext.launch(kernel_name, kernel_fn, *args)` overwrites
|
||||
`_kernels[kernel_name] = kernel_fn` on every call (last-call-wins,
|
||||
idempotent) — consistent with ADR-0045 D8's `launch` behavior.
|
||||
|
||||
PE_CPU looks up `KernelRef.name` in the registry and runs the function
|
||||
through KernelRunner.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. Fold `tl.*` into ADR-0014 / ADR-0020
|
||||
|
||||
Rejected. ADR-0014 covers the PE pipeline (sim_engine-side consumption
|
||||
of PeCommands); ADR-0020 covers 2-pass execution (Phase 1 timing /
|
||||
Phase 2 data). The `tl.*` surface is what the kernel author touches; a
|
||||
dedicated ADR improves findability and onboarding.
|
||||
|
||||
### A2. Deprecate command-list mode
|
||||
|
||||
Rejected (currently). Simple unit tests and kernel verification benefit
|
||||
from the lighter command-list path — it exposes a PeCommand sequence
|
||||
inspector without requiring greenlet machinery. When greenlet-mode
|
||||
semantics (real data, Phase 2) are needed, D2 explicitly selects them.
|
||||
|
||||
### A3. Remove TensorHandle arithmetic operators
|
||||
|
||||
Rejected. They mimic real Triton kernel ergonomics (e.g., `c = a + b`),
|
||||
and the thread-local active-ctx pattern works cleanly. The explicit
|
||||
function-form (`tl.add(a, b)`) is also exposed in D3.5, so the
|
||||
operators are syntactic sugar.
|
||||
|
||||
### A4. Expand softmax into the explicit sequence (max → exp → sum → div)
|
||||
|
||||
Partially adopted. `tl.softmax` is a single `MathCmd(op="softmax")` for
|
||||
timing accounting (D3.5), but Phase 2 DataExecutor expands it to the
|
||||
canonical sequence for real-data computation. Timing model atomic,
|
||||
data model expanded — the two split intentionally.
|
||||
|
||||
## Consequences
|
||||
|
||||
- Every `tl.*` primitive a bench author meets is classified and defined
|
||||
in a single ADR. Paired with ADR-0045 D8's host-side surface
|
||||
(`torch.empty` etc.), the inside-kernel and outside-kernel authoring
|
||||
guides are now complete.
|
||||
- The command-list / greenlet difference is pinned in D2, so any new
|
||||
`tl.*` primitive that follows the `_emit()` pattern auto-supports
|
||||
both modes.
|
||||
- The thread-local active-ctx pattern (D4) is justified at ADR level,
|
||||
clarifying who owns the reset responsibility when multiple PE
|
||||
runners share a thread (KernelRunner.run's contract restores active
|
||||
inside `_switch_kernel`).
|
||||
- `tl.recv_no_consume`'s diagnostic isolation (D3.10) is hardened in
|
||||
ADR form — accidental production use is blocked by a separate
|
||||
command branch.
|
||||
- The registry (D7) gets its own D-section, formalizing the
|
||||
name-collision and dynamic-re-registration semantics.
|
||||
Reference in New Issue
Block a user