adr: add ADR-0046-0049 — close G4 coverage gaps from /report

Documents four cross-cutting surfaces that previously had no ADR backing,
each surfaced as a G4 candidate by /report:

- 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates
  all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...),
  the two execution modes (command-list vs greenlet runner), scratch
  allocator semantics, dispatch-overhead model, and the kernel registry.

- 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group
  (backend="ahbm") install path. world_size priority (algorithm >
  defaults > topology), the 4-step init sequence (load ccl.yaml, import
  algorithm module, derive world_size, install SFR + IPCQ), greenlet-
  local rank registry, all_reduce dispatch via _defer_wait, barrier
  no-op rationale, and the explicit list of unsupported dist.* APIs.

- 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator
  free-list semantics. Offset-keyed first-fit with coalescing, the
  no-validation trust model for free(), HBM/TCM channel separation,
  page-aligned VA allocation, the page_size dual-default
  (VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and
  one-allocator-per-sub-unit rule.

- 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog.
  H2D / D2H / PE DMA categories with their exact cube-index choices,
  the 32 KiB reference size, the 5-point utilization sweep, the
  formula vs actual column meanings, automatic invariant checks
  (monotonicity, D2H >= H2D, best < worst), per-case GraphEngine
  isolation, and the human-readable (not machine-parsable) output
  contract.

Bilingual pair verifier passes for all four EN/KO pairs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-22 10:25:04 -07:00
parent 5f8dd688f5
commit 9a02955770
8 changed files with 2154 additions and 0 deletions
@@ -0,0 +1,327 @@
# ADR-0046: TLContext — Kernel-side `tl.*` API Contract
## Status
Accepted (2026-05-22).
Documents the set of `tl.*` primitives exposed by
`src/kernbench/triton_emu/`'s `TLContext`, their semantics, and the two
execution-mode contracts (command-list / greenlet runner). ADR-0014/0020
defines the PE pipeline and the 2-pass execution model, but **the `tl.*`
surface that bench kernel functions call** had no ADR-level coverage.
## First action
When `TLContext(pe_id, num_programs, dispatch_cycles, runner, cube_id,
num_cubes, scratch_base, scratch_size)` is instantiated, the first action
is to initialize six categories of state:
- `self._pe_id`, `self._num_programs`, `self._cube_id`, `self._num_cubes`
values that `tl.program_id` / `tl.num_programs` will return.
- `self._dispatch_cycles` — cycle count emitted as `PeCpuOverheadCmd(cycles)`
at the start of every `tl.*` API call.
- `self._runner``KernelRunner` instance (present → greenlet mode;
absent → command-list mode).
- `self._commands: list[PeCommand] = []` — command-list accumulator
(command-list mode only).
- `self._handle_counter = 0`, `self._completion_counter = 0` — counters
for generating TensorHandle / CompletionHandle ids.
- `self._scratch_base`, `self._scratch_size`, `self._scratch_cursor = 0`
PE-local scratch region (used for math/dot/composite output handle
addresses).
In short, **TLContext's first act is "record where (sip/cube/pe) and at
what scale (num_programs/num_cubes) this kernel instance runs, and pick
its dispatch mode (runner present or not)"**. No SimPy event is created
and no command is emitted at this moment.
The runtime first action happens when the kernel function first calls a
`tl.<api>()`. The standard entry for every `tl.*` API is:
1. Call `self._emit_dispatch_overhead()` — if `dispatch_cycles > 0`,
immediately `_emit` a `PeCpuOverheadCmd(dispatch_cycles)`.
2. Per-API processing (TensorHandle creation, command construction).
3. `self._emit(cmd)` — in runner mode this `greenlet.switch()`es the cmd
to SimPy; in command-list mode it appends to `self._commands`.
## Context
The `tl.*` surface consists of `TLContext`'s methods, and the `tl`
parameter received by a kernel function is one of these objects. The
contract the user (bench author) sees:
- Which primitives exist.
- What data flow each primitive triggers (DMA / compute / IPCQ /
metadata-only).
- How a TensorHandle's `space` and `addr` are decided.
- The difference between command-list and greenlet modes.
ADR-0014 (PE pipeline) defines the PeCommands consumed by PE_SCHEDULER,
but how `tl.*` emits them is a code-only convention. ADR-0020 (2-pass
data execution) mentions greenlet mode in D3 but does not pin down the
signature difference (return-value handling) between the runner /
non-runner paths. This ADR fills the gap.
## Decision
### D1. The `tl` parameter is a `TLContext` instance
A bench kernel function has the signature:
```python
def _kernel(arg1, arg2, ..., tl, **kwargs):
...
```
`tl` is a `kernbench.triton_emu.tl_context.TLContext` instance. The name
imitates real Triton's `triton.language` module; the actual Triton
module is **not** passed in.
The kernel is plain Python — no `yield` or `async`. `tl.*` calls produce
SimPy events, but to the caller they appear synchronous because in
greenlet mode the KernelRunner relays between SimPy and the kernel
(ADR-0020 D3).
### D2. Two execution modes — command-list / greenlet runner
- **Command-list mode (`runner is None`)**: `tl.*` calls append PeCommand
to `self._commands`. DMA / GEMM / Math consume no SimPy time and return
metadata-only TensorHandles (`data=None`). PE_SCHEDULER / sim_engine
later replays the command sequence in time.
- **Greenlet runner mode (`runner is not None`)**: `tl.*` calls
`self._emit(cmd)``runner.switch_to_simpy(cmd)`, handing control to
the parent greenlet (SimPy). The parent distributes the cmd to
components, consumes SimPy time, and (for DMA reads) returns real numpy
data. The kernel receives the result and continues to the next line
(the data-aware execution model from ADR-0020 D3).
The choice of mode is decided by whether a KernelRunner is injected into
the TLContext. The `tl.*` methods themselves are mode-blind — they go
through `_emit()` uniformly.
### D3. Primitive categories
#### D3.1. Reference (no DMA, metadata only)
- `tl.ref(ptr, shape, dtype="f16") -> TensorHandle`: create a handle
referencing HBM data without issuing DMA. Used when the scheduler
streams the data per-tile (e.g., the b operand of a composite GEMM).
#### D3.2. Data movement (blocking, DMA engine)
- `tl.load(ptr, shape, dtype="f16") -> TensorHandle`: HBM → handle.
Emits `DmaReadCmd`. In greenlet mode the returned handle's `.data`
carries real numpy data; in command-list mode it is a placeholder.
The handle has `space="hbm"`, `pinned=True`.
- `tl.store(ptr, handle) -> None`: TCM → HBM. Emits `DmaWriteCmd`. In
greenlet mode, when `handle.data` is present, `_store.write("hbm",
ptr, data)` runs first (visibility = issue time, ADR-0020 D3).
#### D3.3. GEMM / compute (blocking)
- `tl.dot(a, b) -> TensorHandle`: `a @ b`. Both operands must live in
TCM; shapes `(M,K) × (K,N) → (M,N)`. Emits `GemmCmd`; the output
handle is allocated from PE-local scratch via
`_make_compute_out(shape, dtype)`.
- `tl.composite(op, a, b=None, out_ptr=0, math_op=None, epilogue=None,
acc_dtype=None, tile_shape=None) -> CompletionHandle`: non-blocking
tiled pipeline. Emits `CompositeCmd`. `epilogue` is a list of dicts,
each with `"op"` plus op-specific fields and an optional `"scope"`
(k_tile / output_tile). Unknown ops or missing fields raise
ValueError immediately. The returned CompletionHandle synchronizes
via `tl.wait(h)`.
#### D3.4. Math: unary (blocking)
- `tl.exp(x)`, `tl.log(x)`, `tl.sqrt(x)`, `tl.abs(x)`, `tl.sigmoid(x)`,
`tl.cos(x)`, `tl.sin(x)` — each emits `MathCmd(op=<name>,
inputs=(x,), out=)`. `out` is scratch-allocated with the same
shape/dtype as `x`.
#### D3.5. Math: binary (blocking)
- `tl.maximum(a, b)`, `tl.minimum(a, b)` — `_binary_math`.
- `tl.fma(a, b, c)` — `a*b + c`. Three inputs.
- `tl.clamp(x, min, max)` — `MathCmd(op="clamp", inputs=(x, min, max))`.
- `tl.where(cond, a, b)` — `MathCmd(op="where", inputs=(cond, a, b))`.
- `tl.softmax(x, axis=-1)` — a single `MathCmd(op="softmax")` so timing
accounts at one dispatch. Phase 2 DataExecutor expands it to the
canonical (x-max → exp → sum → div) sequence.
#### D3.6. Reduction (blocking)
- `tl.sum(x, axis)`, `tl.max(x, axis)`, `tl.min(x, axis)` — return an
output handle with the axis size collapsed to 1. Emit
`MathCmd(op=<name>, inputs=(x,), out=, axis=axis)`.
#### D3.7. Index / scalar (PE_CPU, no engine)
- `tl.program_id(axis=0) -> int`: `axis==0` → pe_id (cube-local PE
index), `axis==1` → cube_id (ADR-0022).
- `tl.num_programs(axis=0) -> int`: `axis==0` → num_programs (PEs per
cube), `axis==1` → num_cubes.
- `tl.arange(start, end, dtype="i32") -> TensorHandle`: an index range
in TCM. No command emitted.
- `tl.zeros(shape, dtype="f16") -> TensorHandle`, `tl.full(shape,
value, dtype="f16") -> TensorHandle`: TCM placeholder. No command
emitted.
#### D3.8. Scalar helpers (no command, no engine)
- `TLContext.cdiv(a, b) -> int` (static): ceiling division
`-(-a // b)`. Mirrors real Triton's `tl.cdiv`.
#### D3.9. Metadata-only (no compute, no DMA)
- `tl.trans(x) -> TensorHandle`: a new handle with the last two dims
swapped. Shares `addr` and `data`; no command emitted.
#### D3.10. IPCQ (CCL) primitives (ADR-0023 D4)
- `tl.send(dir, src=None, *, src_addr=None, nbytes=None, shape=None,
dtype="f16", space="tcm") -> None`: blocking send. Accepts either
handle form or raw-address form. Emits `IpcqSendCmd`. The handle's
`.data` snapshot rides along on the command — avoiding the race
where a later inbound IPCQ overwrites the slot before the outbound
PE_DMA reads it.
- `tl.recv(dir=None, shape=(), dtype="f16", space="tcm", dst_addr=None,
dst_space=None) -> TensorHandle`: blocking recv. Providing both
`dst_addr` and `dst_space` enters "copy_to_dst" mode; otherwise
"return_slot" mode. In greenlet mode the handle's `.data` carries
the real data.
- `tl.recv_no_consume(dir=None, shape=(), dtype="f16") -> TensorHandle`:
**DIAGNOSTIC ONLY**. Has the same blocking-arrival semantics as
`tl.recv` but skips the slot-read latency charge (slot-IO + PE↔bank
fabric drain). Used in the pe2pe overview plot for an apples-to-apples
comparison against `tl.store`. Production kernels MUST NOT use it —
the diagnostic flag is isolated in its own command branch
(`consume=False`) so it cannot be accidentally enabled.
- `tl.recv_async(dir, shape=(), dtype="f16") -> RecvFuture`: non-blocking
recv. Returns a `RecvFuture`; resolved later by `tl.wait(future)`.
#### D3.11. Composite + control
- `tl.composite(...)`: see D3.3.
- `tl.wait(handle=None)`: wait on a `CompletionHandle` (composite), a
`RecvFuture` (async recv), or `None` (all pending composites).
- `tl.cycles(n)`: declare a scalar PE_CPU overhead. Emits
`PeCpuOverheadCmd(cycles=n)`.
### D4. TensorHandle arithmetic operators — thread-local TLContext
At module load, `tl_context.py::_enable_tensor_ops()` runs and patches
`TensorHandle.__add__`, `__sub__`, `__mul__`, `__truediv__`. Each
operator calls `_binary_math` on the active TLContext stored in a
module-level thread-local `_ctx`.
So inside a kernel, `c = a + b` is equivalent to emitting
`MathCmd(op="add", inputs=(a, b), out=)` and returning a new
TensorHandle.
Active-TLContext management:
- `TLContext._set_active(ctx)`: set the active ctx for the current
thread/greenlet.
- `TLContext._get_active()`: read it (RuntimeError if unset).
- `run_kernel(kernel_fn, tl_ctx, *args, **kwargs)`: helper. Sets active
on entry, runs the kernel, restores `None` on exit.
`KernelRunner` re-asserts `_set_active(tl)` inside its `_switch_kernel`
just before resuming the kernel, so a sibling PE runner that overwrote
the thread-local context is correctly recovered.
### D5. Scratch allocator — compute output handles
Ops that produce a result — `tl.dot`, `tl.exp`, `tl.add` (via
TensorHandle `__add__`), etc. — call `_make_compute_out(shape, dtype)`
to obtain a 16-byte-aligned scratch address. The address is published
with `space="tcm"`, so the handle can later be the source of a
`tl.send` / `tl.store`.
When `_scratch_base == 0` (e.g., command-list mode), the address is 0
and the handle cannot be a send/store source (in that case, only
`tl.load`-returned handles are valid sources).
When the cursor exceeds `_scratch_size` (default 1 MiB), a
RuntimeError is raised. The cursor must reset between kernel
invocations (current code naturally satisfies this: KernelRunner
creates a fresh TLContext each time).
### D6. Dispatch overhead — `PeCpuOverheadCmd(dispatch_cycles)`
Every non-metadata `tl.*` call starts with `_emit_dispatch_overhead()`,
which — when `dispatch_cycles > 0` — emits
`PeCpuOverheadCmd(dispatch_cycles)`. This models the cycles PE_CPU
spends dispatching the command.
Defaults:
- `TLContext.__init__`'s `dispatch_cycles` parameter default: `1` cycle.
- TLContext built by `KernelRunner`: `0` cycles (greenlet mode handles
cycle accounting differently — aligned with ADR-0020 D3 intent).
### D7. Kernel registry (`triton_emu/registry.py`)
A separate `_kernels: dict[str, Callable]` holds the name → function
mapping:
- `register_kernel(name, fn)`: ValueError on duplicate.
- `get_kernel(name)`: KeyError if missing.
- `clear_registry()`: test-only.
`RuntimeContext.launch(kernel_name, kernel_fn, *args)` overwrites
`_kernels[kernel_name] = kernel_fn` on every call (last-call-wins,
idempotent) — consistent with ADR-0045 D8's `launch` behavior.
PE_CPU looks up `KernelRef.name` in the registry and runs the function
through KernelRunner.
## Alternatives Considered
### A1. Fold `tl.*` into ADR-0014 / ADR-0020
Rejected. ADR-0014 covers the PE pipeline (sim_engine-side consumption
of PeCommands); ADR-0020 covers 2-pass execution (Phase 1 timing /
Phase 2 data). The `tl.*` surface is what the kernel author touches; a
dedicated ADR improves findability and onboarding.
### A2. Deprecate command-list mode
Rejected (currently). Simple unit tests and kernel verification benefit
from the lighter command-list path — it exposes a PeCommand sequence
inspector without requiring greenlet machinery. When greenlet-mode
semantics (real data, Phase 2) are needed, D2 explicitly selects them.
### A3. Remove TensorHandle arithmetic operators
Rejected. They mimic real Triton kernel ergonomics (e.g., `c = a + b`),
and the thread-local active-ctx pattern works cleanly. The explicit
function-form (`tl.add(a, b)`) is also exposed in D3.5, so the
operators are syntactic sugar.
### A4. Expand softmax into the explicit sequence (max → exp → sum → div)
Partially adopted. `tl.softmax` is a single `MathCmd(op="softmax")` for
timing accounting (D3.5), but Phase 2 DataExecutor expands it to the
canonical sequence for real-data computation. Timing model atomic,
data model expanded — the two split intentionally.
## Consequences
- Every `tl.*` primitive a bench author meets is classified and defined
in a single ADR. Paired with ADR-0045 D8's host-side surface
(`torch.empty` etc.), the inside-kernel and outside-kernel authoring
guides are now complete.
- The command-list / greenlet difference is pinned in D2, so any new
`tl.*` primitive that follows the `_emit()` pattern auto-supports
both modes.
- The thread-local active-ctx pattern (D4) is justified at ADR level,
clarifying who owns the reset responsibility when multiple PE
runners share a thread (KernelRunner.run's contract restores active
inside `_switch_kernel`).
- `tl.recv_no_consume`'s diagnostic isolation (D3.10) is hardened in
ADR form — accidental production use is blocked by a separate
command branch.
- The registry (D7) gets its own D-section, formalizing the
name-collision and dynamic-re-registration semantics.
+259
View File
@@ -0,0 +1,259 @@
# ADR-0047: AHBM CCL Backend — `torch.distributed`-compat shim
## Status
Accepted (2026-05-22).
Pins down what `runtime_api/distributed.py`'s `AhbmCCLBackend` +
`DistributedContext` actually install — i.e., the entry point
`torch.distributed.init_process_group(backend="ahbm")` — and how
`all_reduce`/`barrier`/`get_rank` etc. are implemented. ADR-0023 D11
mentions the "torch.distributed compatibility" intent, but **the backend
itself** had no ADR-level coverage.
## First action
`RuntimeContext.__post_init__` automatically constructs a
`DistributedContext()` and attaches it to `self.distributed`. The first
action at that moment:
1. `self._backend: AhbmCCLBackend | None = None` — uninitialized.
2. `self._rank_by_greenlet: dict = {}` — greenlet-local rank registry
(ADR-0024 D2).
3. The caller (RuntimeContext) sets `dc._ctx_ref = self` so subsequent
`init_process_group` can reach `ctx.engine` / `ctx.spec` / `ctx.launch`.
In short, **DistributedContext's first act is "attach to RuntimeContext
with a back-reference and leave the backend slot empty"**. Actual
backend installation (IPCQ install, world_size derivation, algorithm
module import) happens only when user code calls
`torch.distributed.init_process_group(backend="ahbm")`.
At that moment, `init_process_group`'s first action is:
1. If `backend != "ahbm"`, raise `ValueError("Unsupported backend ...")`
immediately.
2. If `getattr(self, "_ctx_ref", None)` is None,
`RuntimeError("DistributedContext not bound to a RuntimeContext")`.
3. `self._backend = AhbmCCLBackend(torch_ctx=ctx)` — inside this
constructor, ccl.yaml is loaded, the algorithm module is imported,
world_size is derived, SFR is configured, and IPCQ is installed.
4. `self._backend._dist_ctx = self` — the backend gets a back-reference
so it can read `_rank_by_greenlet`.
## Context
The `AhbmCCLBackend` exists so that PyTorch DDP collective calls
(`init_process_group`, `all_reduce`, etc.) work unchanged and bench code
reads identically to a real DDP training script (in line with
ADR-0024 + ADR-0027's launcher model).
The backend's responsibilities:
- At `init_process_group` time, install the **IPCQ neighbor table once**
(analogous to NCCL communicator creation).
- For each `all_reduce(tensor, op="sum")`, dispatch the configured
algorithm's kernel function via `ctx.launch(...)`.
- Answer `get_world_size` / `get_rank` consistently from the
greenlet-local rank registry plus ccl.yaml/topology.
ADR-0023 D10 (IPCQ install plan) and ADR-0024 (SIP launcher) touch
parts of this, but **the backend's own responsibility scope and decision
order** are not pinned anywhere. This ADR fills that gap.
## Decision
### D1. The backend is created only at `init_process_group(backend="ahbm")` time
`DistributedContext` starts with `_backend = None`. The backend object
does not exist until the user calls
`dist.init_process_group(backend="ahbm")`. Any other API
(`is_initialized`, `get_world_size`, `all_reduce`, `barrier`) called
while `_backend` is None raises
`RuntimeError("Default process group has not been initialized...")` via
the `_ensure_initialized` helper.
`backend != "ahbm"` raises `ValueError` immediately. Other backend names
(`nccl`, `gloo`, etc.) are not recognized.
### D2. world_size resolution priority — algorithm > defaults > topology
`AhbmCCLBackend._resolve_world_size` (ADR-0024 D1):
1. If `ccl.yaml`'s algorithm entry has `world_size`, use it.
2. Else if `defaults.world_size` is set, use it.
3. Else fall back to `spec.system.sips.count` (the topology's SIP count).
The default interpretation is **rank = SIP** (ADR-0024). Cube/PE-level
parallelism is expressed inside each rank via DPPolicy and does not
affect world_size. An explicit `ccl.yaml` override is preserved for the
legacy "rank = flat PE index" test path.
User arguments to `init_process_group(world_size=..., rank=...)` are
**accepted but ignored** (same as real PyTorch's `RANK` / `WORLD_SIZE`
env vars).
### D3. `init_process_group` performs four installation steps
Inside `AhbmCCLBackend.__init__`, in order:
1. **Load ccl.yaml**: `kernbench.ccl.install.load_ccl_config()`
`resolve_algorithm_config(_cfg_all)` produces the merged config for
`defaults.algorithm` (or the user-specified algorithm).
2. **Import algorithm module**:
`importlib.import_module(self._merged["module"])`. The module must
expose a `kernel` function, a `kernel_args(world_size, n_elem,
cube_w, cube_h)` helper, and optionally a `TOPO_NAME_TO_KIND` map.
3. **Resolve world_size** (D2).
4. **Collect topology metadata** from `spec`: `n_sips`, `sip_topo`
(`ring_1d` default), `cube_w`/`cube_h`, `sips.w`/`sips.h`. When the
SIP topology is not `ring_1d`, derive `_sip_topo_w/h` from explicit
`w`/`h` or via square-root (require `w*h == n_sips`). Mismatch raises
`ValueError`.
5. **Install SFR + IPCQ**:
`kernbench.ccl.sfr_config.configure_sfr_intercube_multisip(engine,
spec, self._merged)`. This pushes IPCQ neighbor tables to every
SIP/cube's pe0 (one-time setup analogous to NCCL communicator
creation).
If the order changes (e.g., SFR runs before the algorithm module
loads), partial initialization can result. So D3 is treated as an
atomic 4-step block — on failure the backend remains uninstalled.
### D4. Greenlet-local rank binding (ADR-0024 D2)
`DistributedContext._rank_by_greenlet: dict[greenlet, int]` maps spawned
worker greenlets to their ranks. When the bench launcher (e.g.,
`torch.multiprocessing.spawn`) spawns a worker, it registers via
`dc._bind_rank(g, rank)`.
`get_rank()` looks up `getcurrent()`'s greenlet. Unregistered greenlets
fall back to 0 — preserves single-driver / test compatibility.
The backend reads the current greenlet's rank from
`_dist_ctx._rank_by_greenlet` during `all_reduce` (D5).
### D5. `all_reduce(tensor, op="sum")` behavior
Validation:
- `op != "sum"``NotImplementedError`. Current kernels only
implement add reduction.
- `tensor._handle is None``RuntimeError("not deployed")`.
- `tensor._handle.shards` empty → `RuntimeError("no shards")`.
Preparation:
- `n_elem = shards[0].nbytes // tensor.itemsize` — element count of a
single shard.
- `kernel_fn = self._algo_module.kernel` — the algorithm module's entry
function (imported in D3).
- Decide effective cube dims: if the first SIP has just 1 cube, use
`(1, 1)`; otherwise use the topology's `cube_w`/`cube_h`. This
naturally absorbs TP runs that use only a subset of cubes.
- `kernel_args = self._algo_module.kernel_args(world_size, n_elem,
cube_w, cube_h)` — the algorithm decides which arguments to pass to
its kernel.
Dispatch:
- Resolve the current greenlet's rank via
`_rank_by_greenlet.get(g, 0)`.
- Append `extra_args = (sip_rank, sip_topo_kind, sip_topo_w,
sip_topo_h)`.
- `pending = self.ctx.launch(algorithm_name, kernel_fn, tensor,
*kernel_args, *extra_args, _defer_wait=True)` — `_defer_wait=True`
delegates collective drain to the main scheduler (ADR-0027 D0.4).
Drain:
- If the parent greenlet is alive (multi-greenlet mode), enqueue
`_pending_collective_handles` and switch to parent. The main
scheduler drains after all ranks have launched.
- If single-driver mode, drain inline:
`for h, _sip_id, meta in pending: self.ctx.wait(h, _meta=meta)`.
### D6. `barrier()` is a no-op (single-driver model)
kernbench runs all ranks as greenlets inside a single Python process,
so no cross-process synchronization is needed. `barrier()` is callable
but does no synchronization. Kept for real-PyTorch API compatibility so
callers don't get `NotImplementedError`.
If multi-process kernbench (SimPy event loop per process) is introduced
in the future, D6 needs a superseding ADR.
### D7. Semantics of `get_rank` / `get_world_size` / `get_backend`
- `get_rank()` (D4): the current greenlet's bound rank; unregistered → 0.
- `get_world_size()` (D2): the world_size resolved by the backend in D3.
- `get_backend()`: always the literal string `"ahbm"`. Calling before
backend exists triggers `_ensure_initialized`'s RuntimeError.
Differences vs. real PyTorch:
- Real PyTorch `get_rank()` is a process-global value; here it is
greenlet-local. Inside a spawned worker → the worker's rank; in the
main thread → 0. Bench authors should expect meaningful ranks only
inside worker functions.
### D8. Supported API surface (final)
`DistributedContext` exposes:
- `init_process_group(backend="ahbm", world_size=None, rank=None,
**kwargs)`
- `is_initialized() -> bool`
- `get_world_size() -> int`
- `get_rank() -> int`
- `get_backend() -> str`
- `all_reduce(tensor, op="sum") -> None`
- `barrier() -> None`
- (internal) `_bind_rank(g, rank)`
Other PyTorch distributed APIs (`broadcast`, `reduce`, `all_gather`,
`gather`, `scatter`, point-to-point `send/recv`, etc.) are **not
implemented**. Kernel-level expression is available via
`tl.send`/`tl.recv` (ADR-0046 D3.10), but the `dist.*` surface does not
expose them. If additional collectives are needed, add a paired
(algorithm module, `DistributedContext` method) and extend D8.
## Alternatives Considered
### A1. Create the backend in `RuntimeContext.__init__`
Rejected. If `ccl.yaml` is missing or the algorithm module can't be
imported, RuntimeContext construction would fail even when the bench
does not use distributed features. Lazy creation at call time (D1) is
the right semantics.
### A2. Always derive world_size from topology (no override)
Rejected. ADR-0024 D1's "explicit override" path is used by legacy
tests. Diagnostic scenarios that define PE-level ranks within a single
SIP also need this escape hatch.
### A3. Silent fallback for unsupported `op`
Rejected. If the user intends `op="prod"` / `"max"` / `"avg"` and silent
`sum` runs instead, result validation gets very hard. Explicit
`NotImplementedError` is safer.
### A4. Implement `barrier` as a SimPy event
Rejected (currently). With single-driver semantics there is no
cross-process synchronization to express, so a no-op is meaningfully
correct. A fake-barrier SimPy event would add code complexity for no
semantic gain. Revisit when multi-process kernbench arrives.
## Consequences
- The 4-step installation (D3) for
`torch.distributed.init_process_group(backend="ahbm")` is locked in,
making clear where future collective algorithms must hook.
- The priority order in D2 (algorithm > defaults > topology) makes the
blast radius of ccl.yaml changes quickly knowable.
- The no-op `barrier` (D6) is recorded so multi-process kernbench, if
introduced, must explicitly supersede this ADR.
- D8's list of unsupported APIs explicitly grounds the rejection
message when users call, e.g., `dist.broadcast(...)`.
@@ -0,0 +1,278 @@
# ADR-0048: Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
## Status
Accepted (2026-05-22).
Pins down the free-list algorithm, page alignment, and coalescing rules
used by `policy/address/allocator.py`'s `_FreeList` / `PEMemAllocator`
and `va_allocator.py`'s `VirtualAllocator`. ADR-0001 (PhysAddr layout)
and ADR-0011 (PA/VA/LA models) define the address schemes; the
**allocation algorithms** had no ADR-level coverage.
## First action
### `_FreeList(capacity)`
On construction: `self._capacity = capacity`, `self._used = 0`,
`self._free = [(0, capacity)]`. The first act is **establishing the
entire region as one free block** — the tuple `(offset=0,
size=capacity)` is the sole entry in the free list.
### `PEMemAllocator(sip_id, die_id, pe_id, cfg)`
On construction, builds two `_FreeList`s:
- `self._hbm = _FreeList(cfg.hbm_slice_bytes)` — the size of this PE's
HBM slice (`hbm_bytes_per_cube // hbm_slices_per_cube`).
- `self._tcm = _FreeList(cfg.tcm_allocatable_bytes)` — equals
`tcm_bytes_per_pe - tcm_scheduler_reserved_bytes` (the scheduler
reservation is pre-deducted).
So PEMemAllocator's first act is **constructing single-free-block
HBM-slice and TCM regions for this PE**.
### `VirtualAllocator(va_base, va_size, page_size=2*1024*1024)`
On construction: `self._va_base = va_base`, `self._va_size = va_size`,
`self._page_size = page_size`, `self._used = 0`, `self._free =
[(va_base, va_size)]`. The first act is **establishing one block from
va_base to va_size and stashing page_size**.
## Context
`runtime_api/context.py::_ensure_allocators` builds the allocator set
in these stages:
1. Read `hbm_total_gb_per_cube`, `hbm_slices_per_cube`, `tcm_size_mb`,
per-target_device SIP range, etc. from `spec`.
2. Pack everything into a frozen `AddressConfig`.
3. For every combination in the target SIP range × cubes × PEs,
construct one `PEMemAllocator(sip, cube, pe, cfg)` instance.
4. Construct one `VirtualAllocator(va_base=0x1_0000_0000, va_size=64
GiB, page_size=pe_mmu.page_size)`.
Allocator responsibilities:
- **PEMemAllocator**: PA-space allocation in the PE-local HBM slice /
TCM (including PhysAddr encoding).
- **VirtualAllocator**: device-wide VA allocation, page-aligned.
`RuntimeContext._create_tensor` then pushes VA → PA mappings to
components via `MmuMapMsg`.
These algorithms are:
- **First-fit**, kept simple.
- The free-block list is **sorted by start offset**.
- On `free()`, **adjacent blocks coalesce**.
The rationale was not documented anywhere, so when someone asks "why
not best-fit?", "why not a buddy allocator?", "why does partial-overlap
free pass silently?", there was no anchor to answer from. This ADR
provides it.
## Decision
### D1. `_FreeList` — offset-keyed first-fit + coalescing
`policy/address/allocator.py::_FreeList`:
- Internal representation: `list[tuple[int, int]] = [(start_offset,
size), ...]` — sorted by start offset.
- `alloc(nbytes)`:
1. Iterate the free list from the front (first-fit).
2. Take from the first block with `size >= nbytes`.
3. Exact match → drop the block; otherwise shrink it to `(start +
nbytes, size - nbytes)`.
4. `_used += nbytes`; return the taken `start`.
5. If no block fits, `AllocationError("overflow ... largest free
block ...")`.
- `free(offset, nbytes)`:
1. `_used -= nbytes`.
2. `bisect_left(self._free, (offset,))` finds the insertion index.
3. If adjacent to the previous block (`prev_start + prev_size ==
offset`), merge.
4. If adjacent to the next block (`offset + nbytes == next_start`),
merge.
5. Insert the coalesced range at the right sorted position.
This algorithm is weaker than best-fit / buddy on fragmentation, but
the simulator's workload (mostly stack-like deploy/free) tolerates it.
If the workload shape changes, D1 is a supersession candidate.
### D2. Partial-overlap free is **not** validated
`_FreeList.free(offset, nbytes)` trusts the caller to pass the exact
`(offset, nbytes)`. It does **not** verify:
- That the range was actually allocated.
- That the range does not overlap another allocated region.
Reason: in a simulator context, callers always store the return value
of `alloc()` and pass it back to `free()` — there is no external user
input. Adding a safety check would cost O(N) per free and impact
simulation wall-clock.
If this trust model breaks (e.g., a code path lets two tensors point
at the same PA), this ADR must be revisited.
### D3. `PEMemAllocator` — two channels for HBM/TCM
`PEMemAllocator(sip_id, die_id, pe_id, cfg)` holds two `_FreeList`s:
- `_hbm`: size `cfg.hbm_slice_bytes`.
- `_tcm`: size `cfg.tcm_allocatable_bytes` (= `tcm_bytes_per_pe -
tcm_scheduler_reserved_bytes`).
`alloc_hbm(nbytes) -> PhysAddr`:
- `_hbm.alloc(nbytes)` → offset.
- `PhysAddr.pe_hbm_addr(sip_id, die_id, pe_id,
pe_local_hbm_offset=offset, slice_size_bytes=cfg.hbm_slice_bytes)`.
- Failure raises `AllocationError("HBM overflow ...")`.
`free_hbm(pa, nbytes)`:
- Recover PE-local offset via `pa.hbm_offset - pe_id *
cfg.hbm_slice_bytes`.
- `_hbm.free(offset, nbytes)`.
`alloc_tcm(nbytes) -> PhysAddr`: similar; uses `PhysAddr.pe_tcm_addr`.
`free_tcm(pa, nbytes)`: uses `pa.sub_offset` directly (TCM's PE-local
offset equals its sub_offset).
The allocator does not see the scheduler-reserved TCM region
(`cfg.tcm_scheduler_reserved_bytes`) — it is pre-subtracted from the
`_tcm` capacity. This is consistent with ADR-0014's PE_SCHEDULER
internal-buffer reservation.
### D4. `VirtualAllocator` — page-aligned first-fit + coalescing
`policy/address/va_allocator.py::VirtualAllocator`:
- Internal representation: same sorted `list[tuple[int, int]]` as
`_FreeList`. Initially `[(va_base, va_size)]`.
- `_align_up(nbytes) = ceil(nbytes / page_size) * page_size`.
- `alloc(nbytes) -> int`:
1. `aligned = _align_up(nbytes)`.
2. First-fit a block with `size >= aligned`.
3. Take `aligned` from the block's front; remove if exact.
4. `_used += aligned`. Return the block's `start` (which is page-
aligned).
5. Failure → `VaAllocationError`.
- `free(va, nbytes)`: free `_align_up(nbytes)` worth. Coalesces with
the same algorithm as `_FreeList`.
`page_size` has different defaults in two places:
- `VirtualAllocator.__init__`'s parameter default: `2 MiB`. Direct-call
tests receive this.
- `RuntimeContext._ensure_allocators` when constructing the instance:
`pe_mmu.attrs.get("page_size", 4096)` — uses
`topology.yaml`'s `pe_mmu.attrs.page_size` if set, else falls back
to `4 KiB`.
The two defaults differ on purpose: `VirtualAllocator`'s standalone
default (`2 MiB`) aligns with ADR-0039's PE_MMU stopgap default for
direct-test ergonomics; the context fallback (`4 KiB`) is the safe
minimum when `topology.yaml` doesn't specify a page size. The
production path is always the latter (via `_ensure_allocators`), and
when `topology.yaml` sets `page_size`, that value flows consistently
into both the MMU and the VA allocator.
If consistency breaks (e.g., VirtualAllocator instantiated with a
page_size different from PE_MMU's), MMU `map()` falls into the
sub-page region mode (ADR-0039 D3).
VA range defaults: `va_base = 0x1_0000_0000` (= 4 GiB), `va_size = 64
GiB`. These are hardcoded in `_ensure_allocators` and have no
semantic meaning in ADR-0011's VA model — they simply reserve enough
device-wide space without colliding with host code.
### D5. Lifecycle of allocator instances
- `RuntimeContext._ensure_allocators` is lazy — called on the first
`_create_tensor`.
- The allocator dict (`self._allocators`) lives for the
RuntimeContext's lifetime. A second deploy in the same process
does not construct new objects.
- `RuntimeContext.cleanup()` walks living tensors and calls
`_free_tensor()`, which issues MMU unmaps + `va_allocator.free` +
`pemem_allocator.free_hbm` — restoring the free lists. A subsequent
RuntimeContext starts fresh.
This per-RuntimeContext isolation guarantees deterministic deploy →
cleanup → deploy sequences within a single process.
### D6. Allocator failure raises (no silent OOM)
Both `_FreeList.alloc` and `VirtualAllocator.alloc` raise
`AllocationError` / `VaAllocationError` when no block fits. The message
includes "required size + largest available block" to distinguish
fragmentation from true OOM.
A silent fallback (e.g., allocating only as much as the largest free
block) is strictly forbidden — a partially-allocated tensor reaching
SimPy would cause routing / DMA to see incorrect PAs and silently
corrupt simulation results.
### D7. One allocator per address space
Physical address spaces are separated by PhysAddr sub-units (ADR-0001
D2.3); each sub-unit gets its own allocator instance:
- HBM slice → `PEMemAllocator._hbm`.
- PE TCM → `PEMemAllocator._tcm`.
- (Currently unused) M_CPU local memory, CUBE SRAM → would need their
own allocators. Today these are handled as IPCQ-only slots (ADR-0023
D9.7) and do not share PA space, so no free-list exists for them.
When a cube-level SRAM allocator is needed,
`_FreeList(cfg.sram_bytes_per_cube)` is added per-cube
(`cfg.sram_bytes_per_cube` is already defined in `AddressConfig` —
the data model is ready).
## Alternatives Considered
### A1. Best-fit / buddy allocator
Rejected (currently). The workload's alloc/free pattern is stack-like
(deploy order ≈ free order), so first-fit + coalescing controls
fragmentation well enough. If long-running fragmentation appears in LLM
kernel sweeps, a buddy-allocator ADR will replace D1.
### A2. Add partial-overlap free validation
Rejected. D2's trust model plus the O(N) per-free cost makes this
unattractive. A debug mode (e.g., `KERNBENCH_DEBUG` env var) that
enables the check could be added later.
### A3. A unified allocator for VA and PA
Rejected. VA space (64 GiB device-wide) and PA space (per-slice ~6
GiB) have different semantic dimensions — VA is the kernel's view, PA
is the device sub-unit's view. ADR-0011's VA model (MMU maps between
the two) calls for separated allocators.
### A4. Multi-tier page sizes (large pages + small pages)
Rejected (currently). A single page size (2 MiB) matches LLM kernel
tensor sizes (a few MiB to GiB); smaller mappings are absorbed by
ADR-0039 D3's sub-page region mode. Multi-tier paging would require
extending the MMU model itself — a separate ADR candidate.
## Consequences
- The allocator algorithm is pinned at ADR level (D1, D3, D4), so any
future simulation scenario hitting fragmentation has a clear "we're
using first-fit + coalescing" anchor to inspect.
- D2's trust model is explicit, so any future code path that exposes
alloc/free to direct user input will trigger this ADR's supersession
early.
- D7's one-allocator-per-sub-unit mapping is on record, so when M_CPU
or SRAM need their own free-list, the addition point is obvious.
- D4 captures the page_size dual-default and its production path
(`_ensure_allocators` always wins), letting future `topology.yaml`
`page_size` changes be assessed against ADR-0039's stopgap
interaction quickly.
+247
View File
@@ -0,0 +1,247 @@
# ADR-0049: `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
## Status
Accepted (2026-05-22).
Pins down the traffic-pattern catalog, formula-vs-actual comparison, and
invariant checks (monotonicity, D2H ≥ H2D, etc.) exposed by
`probes/probe.py::run_probe(...)`. ADR-0010 (CLI surface) enumerates the
`kernbench probe` subcommand, but **what probe actually measures** and
**which invariants it judges PASS/FAIL** had no ADR-level coverage.
## First action
`run_probe(topology_path, case_filter=None)` performs four startup steps:
1. `Path(topology_path).expanduser().resolve()` → absolute path.
2. `load_topology(path)``TopologyGraph` (graph + spec).
3. `_build_edge_map(graph)` → a `{(src, dst): Edge}` lookup table.
4. Instantiate `AddressResolver(graph)` + `PathRouter(graph)`.
Then it sets `nbytes = 32768` (= 32 KiB, the summary-table reference
size) and `show_all = (case_filter is None or case_filter == "all")`.
In short, **probe's first act is "load the topology once and prepare
edge map / resolver / router, plus pin 32 KiB as the standard measurement
size"**. After that, the H2D → D2H → PE DMA categories execute in
separate `GraphEngine` instances (no cross-talk between cases).
## Context
`kernbench probe` was introduced as a verification tool for these
purposes:
- **Manual ground truth**: when a real-simulation result (`kernbench run
--bench ...`) shows abnormal latency, derive the answer for a simple
traffic pattern in isolation and compare.
- **Formula vs actual**: check whether the analytical model
(wire latency + overhead + drain) matches the simulator's
`total_ns`. A mismatch points to which simplifying assumption in
ADR-0033 is missing.
- **Monotonicity check**: latency should grow monotonically with hop
count.
- **Utilization sweep**: a BW-utilization table across data sizes
(4 KiB ~ 1 MiB).
Without an ADR for this tool:
- Adding a new traffic-pattern category (e.g., MCpuDma, IPCQ) is hard
because the table format / measurement units of existing categories
aren't documented at the ADR level.
- The basis for the monotonicity check (hop count? cube distance? wire
length?) is ambiguous.
- The reference size 32 KiB and the sweep `[4 KiB, 16 KiB, 64 KiB, 256
KiB, 1 MiB]` are only discoverable by reading source.
## Decision
### D1. Three case categories — H2D / D2H / PE DMA
Each category has a distinct data path in the topology and gets its own
summary table + sweep table + route-detail block.
- **H2D (Host → Device Write)**: `MemoryWriteMsg(dst_sip=0, dst_cube,
dst_pe=0, pattern="zero")` flows along `pcie_ep → io_cpu → m_cpu →
hbm_ctrl`. The cube index varies the hop count:
- h2d-1hop: cube=0, hops=1
- h2d-2hop: cube=4, hops=2
- h2d-3hop: cube=8, hops=3
- h2d-4hop: cube=12, hops=4
- **D2H (Device → Host Read)**: `MemoryReadMsg(src_sip=0, src_cube,
src_pe=0)`. Total latency = forward command path + reverse data path.
Same 4-hops category as H2D.
- **PE DMA (PE-initiated)**: `PeDmaMsg(src_sip, src_cube, src_pe,
dst_pa)`. Five cases cover varying cube/PE positions:
- pe-local-hbm: same cube, same PE
- pe-same-half-hbm: same cube, different PE (PE 1)
- pe-cross-half-hbm: same cube, far PE (PE 4)
- pe-cross-cube-hbm-best: adjacent cube (cube 1)
- pe-cross-cube-hbm-worst: diagonal far cube (cube 15)
The cube indices 4/8/12 (H2D) and 1/4/15 (PE DMA) are meaningful for a
4 × 4 cube mesh (`sip.cube_mesh.w=4, h=4`); changes to the mesh size
require these to be updated in lockstep.
### D2. Standard measurement size — `nbytes = 32768` (32 KiB)
Every case in the summary table runs once with `nbytes=32768`. 32 KiB
was chosen because:
- DMA overhead and BW drain are balanced — neither dominates.
- It compares cleanly against the one-shot transfer size of several
sub-units (TCM, register file).
Per-size utilization variations are shown in a separate sweep table
(D3).
### D3. Utilization sweep — `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]`
`SWEEP_SIZES = [4096, 16384, 65536, 262144, 1048576]`,
`SWEEP_LABELS = ["4KB", "16KB", "64KB", "256KB", "1MB"]`. Per size:
```
drain = nbytes / bottleneck_bw
total = overhead + wire + drain
eff_bw = nbytes / total
util% = eff_bw / bottleneck_bw × 100
```
When `bn_bw is None or <= 0`, the column shows 0.0 %. The intent: the
table shows in one view how small transfers become overhead-bound and
large transfers become drain-bound as hop count rises.
### D4. Measured columns — actual / formula / breakdown
Per-case columns:
- `Actual` (total_ns): the SimPy run's `trace["total_ns"]`.
- `Ovhd`: sum of `node.attrs["overhead_ns"]` along the path (formula).
- `Drain`: `nbytes / min(edge.bw_gbs over path)` (formula).
- `Wire`: `Σ edge.distance_mm * (ns_per_mm from spec)`.
- `Ovhd%` / `Drain%`: each portion as a percentage of Actual. Wire is
usually too small to display.
- `Eff.BW`: `nbytes / total_ns` (measured BW).
- `BN.BW`: bottleneck bandwidth (formula). The minimum edge BW along
the path. Missing edge BW shows "-".
- `Util%`: `Eff.BW / BN.BW × 100`. 100 % means the single-stream BW
upper bound is reached.
A large gap between the formula sum (`wire + ovhd + drain`) and Actual
signals a factor the simplified model misses (a place to inspect
ADR-0033's assumptions).
### D5. Automatic invariant checks — PASS/FAIL
The following invariants are reported with `[v] PASS` / `[x] FAIL`:
- **H2D / D2H monotonic increase**: as hop count rises, actual latency
must grow monotonically. `all(lats[i] < lats[i+1] for ...)`.
- **D2H ≥ H2D**: for the same hop index, D2H ≥ H2D (D2H has both
forward command and reverse data legs). `all(d2h[i].total >=
h2d[i].total)`.
- **PE DMA best < worst**: cross-cube best (adjacent) latency must be
less than cross-cube worst (diagonal).
- **PE DMA local vs remote**: prints the local BN BW vs remote BN BW
side-by-side (informational, not PASS/FAIL).
When a check fails, a single clear line surfaces the regression for
human review.
### D6. Route detail — per-hop timestamp trace
After the summary and sweep tables, each case's path and cumulative
per-hop timestamps (`_hop_timestamps`) appear in a separate section:
- H2D: leg1 (`pcie_ep → io_cpu`) + leg2 (`io_cpu → m_cpu`) + leg3
(`m_cpu → hbm_ctrl`) + per-hop trace.
- D2H: forward (cmd, no data) and reverse (data) traces shown
separately.
- PE DMA: `pe_dma → router → hbm_ctrl` path + per-hop trace.
Each hop's timestamp is cumulative `wire_ns + overhead_ns`. The
terminal hop's annotation appends `drain:Xns`. Bottleneck edges are
marked `<BN:XXGB/s>` so they are visually identifiable.
### D7. Semantics of the `case_filter` argument
- `None` or `"all"`: run all cases (default).
- Other strings: run only the case whose name matches exactly. Example:
`kernbench probe --case h2d-2hop`.
Within a category, cases with `name != case_filter` are skipped; if
only one data point remains, the category's monotonicity / D2H ≥ H2D
comparisons are naturally skipped.
The CLI parser's `--case` default is `"all"`, so omitting it runs
everything.
### D8. Fresh GraphEngine per case
Each of the 4 H2D, 4 D2H, and 5 PE DMA cases runs in **its own
GraphEngine** (`engine = GraphEngine(graph)`). Reasons:
- Isolate accumulated state (op_log, completion tracking, allocators)
so cases do not cross-talk.
- Guarantee one case's traffic does not perturb another case's BW
measurement.
This isolation lets probe results be interpreted as **single-flow**
per-case latency. Multi-flow contention measurement is handled by
separate tooling (e.g., the `pe2pe_overview` plot or ADR-0033's
multi-flow merging model).
### D9. Output-format stability
probe's stdout is meant for humans; precise column widths, separators,
and whitespace are **not** a machine-readable contract. Automated tools
that wish to parse probe output should use a separate JSON-output mode
(not yet implemented).
The `[v]` / `[x]` prefix on PASS/FAIL lines is a stable CI grep anchor.
## Alternatives Considered
### A1. Register probe as another bench (`@bench(name="probe")`)
Rejected. probe is a verification tool, not a bench — multi-engine
execution for sweeps/analysis and PASS/FAIL invariant output are
essential, none of which fits ADR-0045's "single device + single
RuntimeContext" bench model.
### A2. Exit code 1 on monotonicity violation
Rejected (currently). probe is positioned as a human inspection tool —
PASS/FAIL is printed and exit is 0. A wrapper can `grep "\[x\]"` to
decide. A future `--strict` flag could opt into non-zero exits.
### A3. Externalize the case catalog to YAML
Rejected (currently). The 8 cases (4 H2D + 4 D2H + 5 PE DMA = 13 total)
are hardcoded and their semantics are tightly bound to the mesh
topology. Moving cube-index meaning (4, 8, 12 / 1, 4, 15) into YAML
would require separate documentation and lose cohesion. Externalize
only when case additions become frequent.
### A4. Add multi-flow contention measurement
Rejected (out of probe scope). D8's single-flow isolation is probe's
core intent. Multi-flow contention belongs in a different area of the
ADR-0033 latency model — either a separate tool or a new case
category.
## Consequences
- probe's case catalog (D1) and measurement units (D2/D3) are pinned at
ADR level, so new traffic categories know which table format to
follow.
- The semantics of the formula-vs-actual columns (D4) are locked in, so
questions like "why is Drain% 5 % or 70 %?" can quickly be linked to
ADR-0033 assumption checks.
- Automatic invariant checks (D5) are pinned, so latency-model changes
immediately catch monotonicity / D2H ≥ H2D regressions.
- D8's case-isolation is explicit, so probe results are safe to read as
single-flow measurements. If multi-flow is needed, a separate tool
track is clearly required.
- A2's strict-mode flag is recorded as a follow-up so CI integration
has a minimal change path when requested.