adr: add ADR-0046-0049 — close G4 coverage gaps from /report

Documents four cross-cutting surfaces that previously had no ADR backing, each surfaced as a G4 candidate by /report: - 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...), the two execution modes (command-list vs greenlet runner), scratch allocator semantics, dispatch-overhead model, and the kernel registry. - 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group (backend="ahbm") install path. world_size priority (algorithm > defaults > topology), the 4-step init sequence (load ccl.yaml, import algorithm module, derive world_size, install SFR + IPCQ), greenlet- local rank registry, all_reduce dispatch via _defer_wait, barrier no-op rationale, and the explicit list of unsupported dist.* APIs. - 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator free-list semantics. Offset-keyed first-fit with coalescing, the no-validation trust model for free(), HBM/TCM channel separation, page-aligned VA allocation, the page_size dual-default (VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and one-allocator-per-sub-unit rule. - 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog. H2D / D2H / PE DMA categories with their exact cube-index choices, the 32 KiB reference size, the 5-point utilization sweep, the formula vs actual column meanings, automatic invariant checks (monotonicity, D2H >= H2D, best < worst), per-case GraphEngine isolation, and the human-readable (not machine-parsable) output contract. Bilingual pair verifier passes for all four EN/KO pairs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:25:04 -07:00
parent 5f8dd688f5
commit 9a02955770
8 changed files with 2154 additions and 0 deletions
@@ -0,0 +1,327 @@
+# ADR-0046: TLContext — Kernel-side `tl.*` API Contract
+
+## Status
+
+Accepted (2026-05-22).
+
+Documents the set of `tl.*` primitives exposed by
+`src/kernbench/triton_emu/`'s `TLContext`, their semantics, and the two
+execution-mode contracts (command-list / greenlet runner). ADR-0014/0020
+defines the PE pipeline and the 2-pass execution model, but **the `tl.*`
+surface that bench kernel functions call** had no ADR-level coverage.
+
+## First action
+
+When `TLContext(pe_id, num_programs, dispatch_cycles, runner, cube_id,
+num_cubes, scratch_base, scratch_size)` is instantiated, the first action
+is to initialize six categories of state:
+
+- `self._pe_id`, `self._num_programs`, `self._cube_id`, `self._num_cubes` —
+  values that `tl.program_id` / `tl.num_programs` will return.
+- `self._dispatch_cycles` — cycle count emitted as `PeCpuOverheadCmd(cycles)`
+  at the start of every `tl.*` API call.
+- `self._runner` — `KernelRunner` instance (present → greenlet mode;
+  absent → command-list mode).
+- `self._commands: list[PeCommand] = []` — command-list accumulator
+  (command-list mode only).
+- `self._handle_counter = 0`, `self._completion_counter = 0` — counters
+  for generating TensorHandle / CompletionHandle ids.
+- `self._scratch_base`, `self._scratch_size`, `self._scratch_cursor = 0` —
+  PE-local scratch region (used for math/dot/composite output handle
+  addresses).
+
+In short, **TLContext's first act is "record where (sip/cube/pe) and at
+what scale (num_programs/num_cubes) this kernel instance runs, and pick
+its dispatch mode (runner present or not)"**. No SimPy event is created
+and no command is emitted at this moment.
+
+The runtime first action happens when the kernel function first calls a
+`tl.<api>()`. The standard entry for every `tl.*` API is:
+
+1. Call `self._emit_dispatch_overhead()` — if `dispatch_cycles > 0`,
+   immediately `_emit` a `PeCpuOverheadCmd(dispatch_cycles)`.
+2. Per-API processing (TensorHandle creation, command construction).
+3. `self._emit(cmd)` — in runner mode this `greenlet.switch()`es the cmd
+   to SimPy; in command-list mode it appends to `self._commands`.
+
+## Context
+
+The `tl.*` surface consists of `TLContext`'s methods, and the `tl`
+parameter received by a kernel function is one of these objects. The
+contract the user (bench author) sees:
+
+- Which primitives exist.
+- What data flow each primitive triggers (DMA / compute / IPCQ /
+  metadata-only).
+- How a TensorHandle's `space` and `addr` are decided.
+- The difference between command-list and greenlet modes.
+
+ADR-0014 (PE pipeline) defines the PeCommands consumed by PE_SCHEDULER,
+but how `tl.*` emits them is a code-only convention. ADR-0020 (2-pass
+data execution) mentions greenlet mode in D3 but does not pin down the
+signature difference (return-value handling) between the runner /
+non-runner paths. This ADR fills the gap.
+
+## Decision
+
+### D1. The `tl` parameter is a `TLContext` instance
+
+A bench kernel function has the signature:
+
+```python
+def _kernel(arg1, arg2, ..., tl, **kwargs):
+    ...
+```
+
+`tl` is a `kernbench.triton_emu.tl_context.TLContext` instance. The name
+imitates real Triton's `triton.language` module; the actual Triton
+module is **not** passed in.
+
+The kernel is plain Python — no `yield` or `async`. `tl.*` calls produce
+SimPy events, but to the caller they appear synchronous because in
+greenlet mode the KernelRunner relays between SimPy and the kernel
+(ADR-0020 D3).
+
+### D2. Two execution modes — command-list / greenlet runner
+
+- **Command-list mode (`runner is None`)**: `tl.*` calls append PeCommand
+  to `self._commands`. DMA / GEMM / Math consume no SimPy time and return
+  metadata-only TensorHandles (`data=None`). PE_SCHEDULER / sim_engine
+  later replays the command sequence in time.
+
+- **Greenlet runner mode (`runner is not None`)**: `tl.*` calls
+  `self._emit(cmd)` → `runner.switch_to_simpy(cmd)`, handing control to
+  the parent greenlet (SimPy). The parent distributes the cmd to
+  components, consumes SimPy time, and (for DMA reads) returns real numpy
+  data. The kernel receives the result and continues to the next line
+  (the data-aware execution model from ADR-0020 D3).
+
+The choice of mode is decided by whether a KernelRunner is injected into
+the TLContext. The `tl.*` methods themselves are mode-blind — they go
+through `_emit()` uniformly.
+
+### D3. Primitive categories
+
+#### D3.1. Reference (no DMA, metadata only)
+
+- `tl.ref(ptr, shape, dtype="f16") -> TensorHandle`: create a handle
+  referencing HBM data without issuing DMA. Used when the scheduler
+  streams the data per-tile (e.g., the b operand of a composite GEMM).
+
+#### D3.2. Data movement (blocking, DMA engine)
+
+- `tl.load(ptr, shape, dtype="f16") -> TensorHandle`: HBM → handle.
+  Emits `DmaReadCmd`. In greenlet mode the returned handle's `.data`
+  carries real numpy data; in command-list mode it is a placeholder.
+  The handle has `space="hbm"`, `pinned=True`.
+- `tl.store(ptr, handle) -> None`: TCM → HBM. Emits `DmaWriteCmd`. In
+  greenlet mode, when `handle.data` is present, `_store.write("hbm",
+  ptr, data)` runs first (visibility = issue time, ADR-0020 D3).
+
+#### D3.3. GEMM / compute (blocking)
+
+- `tl.dot(a, b) -> TensorHandle`: `a @ b`. Both operands must live in
+  TCM; shapes `(M,K) × (K,N) → (M,N)`. Emits `GemmCmd`; the output
+  handle is allocated from PE-local scratch via
+  `_make_compute_out(shape, dtype)`.
+- `tl.composite(op, a, b=None, out_ptr=0, math_op=None, epilogue=None,
+  acc_dtype=None, tile_shape=None) -> CompletionHandle`: non-blocking
+  tiled pipeline. Emits `CompositeCmd`. `epilogue` is a list of dicts,
+  each with `"op"` plus op-specific fields and an optional `"scope"`
+  (k_tile / output_tile). Unknown ops or missing fields raise
+  ValueError immediately. The returned CompletionHandle synchronizes
+  via `tl.wait(h)`.
+
+#### D3.4. Math: unary (blocking)
+
+- `tl.exp(x)`, `tl.log(x)`, `tl.sqrt(x)`, `tl.abs(x)`, `tl.sigmoid(x)`,
+  `tl.cos(x)`, `tl.sin(x)` — each emits `MathCmd(op=<name>,
+  inputs=(x,), out=)`. `out` is scratch-allocated with the same
+  shape/dtype as `x`.
+
+#### D3.5. Math: binary (blocking)
+
+- `tl.maximum(a, b)`, `tl.minimum(a, b)` — `_binary_math`.
+- `tl.fma(a, b, c)` — `a*b + c`. Three inputs.
+- `tl.clamp(x, min, max)` — `MathCmd(op="clamp", inputs=(x, min, max))`.
+- `tl.where(cond, a, b)` — `MathCmd(op="where", inputs=(cond, a, b))`.
+- `tl.softmax(x, axis=-1)` — a single `MathCmd(op="softmax")` so timing
+  accounts at one dispatch. Phase 2 DataExecutor expands it to the
+  canonical (x-max → exp → sum → div) sequence.
+
+#### D3.6. Reduction (blocking)
+
+- `tl.sum(x, axis)`, `tl.max(x, axis)`, `tl.min(x, axis)` — return an
+  output handle with the axis size collapsed to 1. Emit
+  `MathCmd(op=<name>, inputs=(x,), out=, axis=axis)`.
+
+#### D3.7. Index / scalar (PE_CPU, no engine)
+
+- `tl.program_id(axis=0) -> int`: `axis==0` → pe_id (cube-local PE
+  index), `axis==1` → cube_id (ADR-0022).
+- `tl.num_programs(axis=0) -> int`: `axis==0` → num_programs (PEs per
+  cube), `axis==1` → num_cubes.
+- `tl.arange(start, end, dtype="i32") -> TensorHandle`: an index range
+  in TCM. No command emitted.
+- `tl.zeros(shape, dtype="f16") -> TensorHandle`, `tl.full(shape,
+  value, dtype="f16") -> TensorHandle`: TCM placeholder. No command
+  emitted.
+
+#### D3.8. Scalar helpers (no command, no engine)
+
+- `TLContext.cdiv(a, b) -> int` (static): ceiling division
+  `-(-a // b)`. Mirrors real Triton's `tl.cdiv`.
+
+#### D3.9. Metadata-only (no compute, no DMA)
+
+- `tl.trans(x) -> TensorHandle`: a new handle with the last two dims
+  swapped. Shares `addr` and `data`; no command emitted.
+
+#### D3.10. IPCQ (CCL) primitives (ADR-0023 D4)
+
+- `tl.send(dir, src=None, *, src_addr=None, nbytes=None, shape=None,
+  dtype="f16", space="tcm") -> None`: blocking send. Accepts either
+  handle form or raw-address form. Emits `IpcqSendCmd`. The handle's
+  `.data` snapshot rides along on the command — avoiding the race
+  where a later inbound IPCQ overwrites the slot before the outbound
+  PE_DMA reads it.
+- `tl.recv(dir=None, shape=(), dtype="f16", space="tcm", dst_addr=None,
+  dst_space=None) -> TensorHandle`: blocking recv. Providing both
+  `dst_addr` and `dst_space` enters "copy_to_dst" mode; otherwise
+  "return_slot" mode. In greenlet mode the handle's `.data` carries
+  the real data.
+- `tl.recv_no_consume(dir=None, shape=(), dtype="f16") -> TensorHandle`:
+  **DIAGNOSTIC ONLY**. Has the same blocking-arrival semantics as
+  `tl.recv` but skips the slot-read latency charge (slot-IO + PE↔bank
+  fabric drain). Used in the pe2pe overview plot for an apples-to-apples
+  comparison against `tl.store`. Production kernels MUST NOT use it —
+  the diagnostic flag is isolated in its own command branch
+  (`consume=False`) so it cannot be accidentally enabled.
+- `tl.recv_async(dir, shape=(), dtype="f16") -> RecvFuture`: non-blocking
+  recv. Returns a `RecvFuture`; resolved later by `tl.wait(future)`.
+
+#### D3.11. Composite + control
+
+- `tl.composite(...)`: see D3.3.
+- `tl.wait(handle=None)`: wait on a `CompletionHandle` (composite), a
+  `RecvFuture` (async recv), or `None` (all pending composites).
+- `tl.cycles(n)`: declare a scalar PE_CPU overhead. Emits
+  `PeCpuOverheadCmd(cycles=n)`.
+
+### D4. TensorHandle arithmetic operators — thread-local TLContext
+
+At module load, `tl_context.py::_enable_tensor_ops()` runs and patches
+`TensorHandle.__add__`, `__sub__`, `__mul__`, `__truediv__`. Each
+operator calls `_binary_math` on the active TLContext stored in a
+module-level thread-local `_ctx`.
+
+So inside a kernel, `c = a + b` is equivalent to emitting
+`MathCmd(op="add", inputs=(a, b), out=)` and returning a new
+TensorHandle.
+
+Active-TLContext management:
+
+- `TLContext._set_active(ctx)`: set the active ctx for the current
+  thread/greenlet.
+- `TLContext._get_active()`: read it (RuntimeError if unset).
+- `run_kernel(kernel_fn, tl_ctx, *args, **kwargs)`: helper. Sets active
+  on entry, runs the kernel, restores `None` on exit.
+
+`KernelRunner` re-asserts `_set_active(tl)` inside its `_switch_kernel`
+just before resuming the kernel, so a sibling PE runner that overwrote
+the thread-local context is correctly recovered.
+
+### D5. Scratch allocator — compute output handles
+
+Ops that produce a result — `tl.dot`, `tl.exp`, `tl.add` (via
+TensorHandle `__add__`), etc. — call `_make_compute_out(shape, dtype)`
+to obtain a 16-byte-aligned scratch address. The address is published
+with `space="tcm"`, so the handle can later be the source of a
+`tl.send` / `tl.store`.
+
+When `_scratch_base == 0` (e.g., command-list mode), the address is 0
+and the handle cannot be a send/store source (in that case, only
+`tl.load`-returned handles are valid sources).
+
+When the cursor exceeds `_scratch_size` (default 1 MiB), a
+RuntimeError is raised. The cursor must reset between kernel
+invocations (current code naturally satisfies this: KernelRunner
+creates a fresh TLContext each time).
+
+### D6. Dispatch overhead — `PeCpuOverheadCmd(dispatch_cycles)`
+
+Every non-metadata `tl.*` call starts with `_emit_dispatch_overhead()`,
+which — when `dispatch_cycles > 0` — emits
+`PeCpuOverheadCmd(dispatch_cycles)`. This models the cycles PE_CPU
+spends dispatching the command.
+
+Defaults:
+
+- `TLContext.__init__`'s `dispatch_cycles` parameter default: `1` cycle.
+- TLContext built by `KernelRunner`: `0` cycles (greenlet mode handles
+  cycle accounting differently — aligned with ADR-0020 D3 intent).
+
+### D7. Kernel registry (`triton_emu/registry.py`)
+
+A separate `_kernels: dict[str, Callable]` holds the name → function
+mapping:
+
+- `register_kernel(name, fn)`: ValueError on duplicate.
+- `get_kernel(name)`: KeyError if missing.
+- `clear_registry()`: test-only.
+
+`RuntimeContext.launch(kernel_name, kernel_fn, *args)` overwrites
+`_kernels[kernel_name] = kernel_fn` on every call (last-call-wins,
+idempotent) — consistent with ADR-0045 D8's `launch` behavior.
+
+PE_CPU looks up `KernelRef.name` in the registry and runs the function
+through KernelRunner.
+
+## Alternatives Considered
+
+### A1. Fold `tl.*` into ADR-0014 / ADR-0020
+
+Rejected. ADR-0014 covers the PE pipeline (sim_engine-side consumption
+of PeCommands); ADR-0020 covers 2-pass execution (Phase 1 timing /
+Phase 2 data). The `tl.*` surface is what the kernel author touches; a
+dedicated ADR improves findability and onboarding.
+
+### A2. Deprecate command-list mode
+
+Rejected (currently). Simple unit tests and kernel verification benefit
+from the lighter command-list path — it exposes a PeCommand sequence
+inspector without requiring greenlet machinery. When greenlet-mode
+semantics (real data, Phase 2) are needed, D2 explicitly selects them.
+
+### A3. Remove TensorHandle arithmetic operators
+
+Rejected. They mimic real Triton kernel ergonomics (e.g., `c = a + b`),
+and the thread-local active-ctx pattern works cleanly. The explicit
+function-form (`tl.add(a, b)`) is also exposed in D3.5, so the
+operators are syntactic sugar.
+
+### A4. Expand softmax into the explicit sequence (max → exp → sum → div)
+
+Partially adopted. `tl.softmax` is a single `MathCmd(op="softmax")` for
+timing accounting (D3.5), but Phase 2 DataExecutor expands it to the
+canonical sequence for real-data computation. Timing model atomic,
+data model expanded — the two split intentionally.
+
+## Consequences
+
+- Every `tl.*` primitive a bench author meets is classified and defined
+  in a single ADR. Paired with ADR-0045 D8's host-side surface
+  (`torch.empty` etc.), the inside-kernel and outside-kernel authoring
+  guides are now complete.
+- The command-list / greenlet difference is pinned in D2, so any new
+  `tl.*` primitive that follows the `_emit()` pattern auto-supports
+  both modes.
+- The thread-local active-ctx pattern (D4) is justified at ADR level,
+  clarifying who owns the reset responsibility when multiple PE
+  runners share a thread (KernelRunner.run's contract restores active
+  inside `_switch_kernel`).
+- `tl.recv_no_consume`'s diagnostic isolation (D3.10) is hardened in
+  ADR form — accidental production use is blocked by a separate
+  command branch.
+- The registry (D7) gets its own D-section, formalizing the
+  name-collision and dynamic-re-registration semantics.