adr: add ADR-0046-0049 — close G4 coverage gaps from /report
Documents four cross-cutting surfaces that previously had no ADR backing, each surfaced as a G4 candidate by /report: - 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...), the two execution modes (command-list vs greenlet runner), scratch allocator semantics, dispatch-overhead model, and the kernel registry. - 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group (backend="ahbm") install path. world_size priority (algorithm > defaults > topology), the 4-step init sequence (load ccl.yaml, import algorithm module, derive world_size, install SFR + IPCQ), greenlet- local rank registry, all_reduce dispatch via _defer_wait, barrier no-op rationale, and the explicit list of unsupported dist.* APIs. - 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator free-list semantics. Offset-keyed first-fit with coalescing, the no-validation trust model for free(), HBM/TCM channel separation, page-aligned VA allocation, the page_size dual-default (VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and one-allocator-per-sub-unit rule. - 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog. H2D / D2H / PE DMA categories with their exact cube-index choices, the 32 KiB reference size, the 5-point utilization sweep, the formula vs actual column meanings, automatic invariant checks (monotonicity, D2H >= H2D, best < worst), per-case GraphEngine isolation, and the human-readable (not machine-parsable) output contract. Bilingual pair verifier passes for all four EN/KO pairs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,327 @@
|
||||
# ADR-0046: TLContext — Kernel-side `tl.*` API Contract
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
Documents the set of `tl.*` primitives exposed by
|
||||
`src/kernbench/triton_emu/`'s `TLContext`, their semantics, and the two
|
||||
execution-mode contracts (command-list / greenlet runner). ADR-0014/0020
|
||||
defines the PE pipeline and the 2-pass execution model, but **the `tl.*`
|
||||
surface that bench kernel functions call** had no ADR-level coverage.
|
||||
|
||||
## First action
|
||||
|
||||
When `TLContext(pe_id, num_programs, dispatch_cycles, runner, cube_id,
|
||||
num_cubes, scratch_base, scratch_size)` is instantiated, the first action
|
||||
is to initialize six categories of state:
|
||||
|
||||
- `self._pe_id`, `self._num_programs`, `self._cube_id`, `self._num_cubes` —
|
||||
values that `tl.program_id` / `tl.num_programs` will return.
|
||||
- `self._dispatch_cycles` — cycle count emitted as `PeCpuOverheadCmd(cycles)`
|
||||
at the start of every `tl.*` API call.
|
||||
- `self._runner` — `KernelRunner` instance (present → greenlet mode;
|
||||
absent → command-list mode).
|
||||
- `self._commands: list[PeCommand] = []` — command-list accumulator
|
||||
(command-list mode only).
|
||||
- `self._handle_counter = 0`, `self._completion_counter = 0` — counters
|
||||
for generating TensorHandle / CompletionHandle ids.
|
||||
- `self._scratch_base`, `self._scratch_size`, `self._scratch_cursor = 0` —
|
||||
PE-local scratch region (used for math/dot/composite output handle
|
||||
addresses).
|
||||
|
||||
In short, **TLContext's first act is "record where (sip/cube/pe) and at
|
||||
what scale (num_programs/num_cubes) this kernel instance runs, and pick
|
||||
its dispatch mode (runner present or not)"**. No SimPy event is created
|
||||
and no command is emitted at this moment.
|
||||
|
||||
The runtime first action happens when the kernel function first calls a
|
||||
`tl.<api>()`. The standard entry for every `tl.*` API is:
|
||||
|
||||
1. Call `self._emit_dispatch_overhead()` — if `dispatch_cycles > 0`,
|
||||
immediately `_emit` a `PeCpuOverheadCmd(dispatch_cycles)`.
|
||||
2. Per-API processing (TensorHandle creation, command construction).
|
||||
3. `self._emit(cmd)` — in runner mode this `greenlet.switch()`es the cmd
|
||||
to SimPy; in command-list mode it appends to `self._commands`.
|
||||
|
||||
## Context
|
||||
|
||||
The `tl.*` surface consists of `TLContext`'s methods, and the `tl`
|
||||
parameter received by a kernel function is one of these objects. The
|
||||
contract the user (bench author) sees:
|
||||
|
||||
- Which primitives exist.
|
||||
- What data flow each primitive triggers (DMA / compute / IPCQ /
|
||||
metadata-only).
|
||||
- How a TensorHandle's `space` and `addr` are decided.
|
||||
- The difference between command-list and greenlet modes.
|
||||
|
||||
ADR-0014 (PE pipeline) defines the PeCommands consumed by PE_SCHEDULER,
|
||||
but how `tl.*` emits them is a code-only convention. ADR-0020 (2-pass
|
||||
data execution) mentions greenlet mode in D3 but does not pin down the
|
||||
signature difference (return-value handling) between the runner /
|
||||
non-runner paths. This ADR fills the gap.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. The `tl` parameter is a `TLContext` instance
|
||||
|
||||
A bench kernel function has the signature:
|
||||
|
||||
```python
|
||||
def _kernel(arg1, arg2, ..., tl, **kwargs):
|
||||
...
|
||||
```
|
||||
|
||||
`tl` is a `kernbench.triton_emu.tl_context.TLContext` instance. The name
|
||||
imitates real Triton's `triton.language` module; the actual Triton
|
||||
module is **not** passed in.
|
||||
|
||||
The kernel is plain Python — no `yield` or `async`. `tl.*` calls produce
|
||||
SimPy events, but to the caller they appear synchronous because in
|
||||
greenlet mode the KernelRunner relays between SimPy and the kernel
|
||||
(ADR-0020 D3).
|
||||
|
||||
### D2. Two execution modes — command-list / greenlet runner
|
||||
|
||||
- **Command-list mode (`runner is None`)**: `tl.*` calls append PeCommand
|
||||
to `self._commands`. DMA / GEMM / Math consume no SimPy time and return
|
||||
metadata-only TensorHandles (`data=None`). PE_SCHEDULER / sim_engine
|
||||
later replays the command sequence in time.
|
||||
|
||||
- **Greenlet runner mode (`runner is not None`)**: `tl.*` calls
|
||||
`self._emit(cmd)` → `runner.switch_to_simpy(cmd)`, handing control to
|
||||
the parent greenlet (SimPy). The parent distributes the cmd to
|
||||
components, consumes SimPy time, and (for DMA reads) returns real numpy
|
||||
data. The kernel receives the result and continues to the next line
|
||||
(the data-aware execution model from ADR-0020 D3).
|
||||
|
||||
The choice of mode is decided by whether a KernelRunner is injected into
|
||||
the TLContext. The `tl.*` methods themselves are mode-blind — they go
|
||||
through `_emit()` uniformly.
|
||||
|
||||
### D3. Primitive categories
|
||||
|
||||
#### D3.1. Reference (no DMA, metadata only)
|
||||
|
||||
- `tl.ref(ptr, shape, dtype="f16") -> TensorHandle`: create a handle
|
||||
referencing HBM data without issuing DMA. Used when the scheduler
|
||||
streams the data per-tile (e.g., the b operand of a composite GEMM).
|
||||
|
||||
#### D3.2. Data movement (blocking, DMA engine)
|
||||
|
||||
- `tl.load(ptr, shape, dtype="f16") -> TensorHandle`: HBM → handle.
|
||||
Emits `DmaReadCmd`. In greenlet mode the returned handle's `.data`
|
||||
carries real numpy data; in command-list mode it is a placeholder.
|
||||
The handle has `space="hbm"`, `pinned=True`.
|
||||
- `tl.store(ptr, handle) -> None`: TCM → HBM. Emits `DmaWriteCmd`. In
|
||||
greenlet mode, when `handle.data` is present, `_store.write("hbm",
|
||||
ptr, data)` runs first (visibility = issue time, ADR-0020 D3).
|
||||
|
||||
#### D3.3. GEMM / compute (blocking)
|
||||
|
||||
- `tl.dot(a, b) -> TensorHandle`: `a @ b`. Both operands must live in
|
||||
TCM; shapes `(M,K) × (K,N) → (M,N)`. Emits `GemmCmd`; the output
|
||||
handle is allocated from PE-local scratch via
|
||||
`_make_compute_out(shape, dtype)`.
|
||||
- `tl.composite(op, a, b=None, out_ptr=0, math_op=None, epilogue=None,
|
||||
acc_dtype=None, tile_shape=None) -> CompletionHandle`: non-blocking
|
||||
tiled pipeline. Emits `CompositeCmd`. `epilogue` is a list of dicts,
|
||||
each with `"op"` plus op-specific fields and an optional `"scope"`
|
||||
(k_tile / output_tile). Unknown ops or missing fields raise
|
||||
ValueError immediately. The returned CompletionHandle synchronizes
|
||||
via `tl.wait(h)`.
|
||||
|
||||
#### D3.4. Math: unary (blocking)
|
||||
|
||||
- `tl.exp(x)`, `tl.log(x)`, `tl.sqrt(x)`, `tl.abs(x)`, `tl.sigmoid(x)`,
|
||||
`tl.cos(x)`, `tl.sin(x)` — each emits `MathCmd(op=<name>,
|
||||
inputs=(x,), out=)`. `out` is scratch-allocated with the same
|
||||
shape/dtype as `x`.
|
||||
|
||||
#### D3.5. Math: binary (blocking)
|
||||
|
||||
- `tl.maximum(a, b)`, `tl.minimum(a, b)` — `_binary_math`.
|
||||
- `tl.fma(a, b, c)` — `a*b + c`. Three inputs.
|
||||
- `tl.clamp(x, min, max)` — `MathCmd(op="clamp", inputs=(x, min, max))`.
|
||||
- `tl.where(cond, a, b)` — `MathCmd(op="where", inputs=(cond, a, b))`.
|
||||
- `tl.softmax(x, axis=-1)` — a single `MathCmd(op="softmax")` so timing
|
||||
accounts at one dispatch. Phase 2 DataExecutor expands it to the
|
||||
canonical (x-max → exp → sum → div) sequence.
|
||||
|
||||
#### D3.6. Reduction (blocking)
|
||||
|
||||
- `tl.sum(x, axis)`, `tl.max(x, axis)`, `tl.min(x, axis)` — return an
|
||||
output handle with the axis size collapsed to 1. Emit
|
||||
`MathCmd(op=<name>, inputs=(x,), out=, axis=axis)`.
|
||||
|
||||
#### D3.7. Index / scalar (PE_CPU, no engine)
|
||||
|
||||
- `tl.program_id(axis=0) -> int`: `axis==0` → pe_id (cube-local PE
|
||||
index), `axis==1` → cube_id (ADR-0022).
|
||||
- `tl.num_programs(axis=0) -> int`: `axis==0` → num_programs (PEs per
|
||||
cube), `axis==1` → num_cubes.
|
||||
- `tl.arange(start, end, dtype="i32") -> TensorHandle`: an index range
|
||||
in TCM. No command emitted.
|
||||
- `tl.zeros(shape, dtype="f16") -> TensorHandle`, `tl.full(shape,
|
||||
value, dtype="f16") -> TensorHandle`: TCM placeholder. No command
|
||||
emitted.
|
||||
|
||||
#### D3.8. Scalar helpers (no command, no engine)
|
||||
|
||||
- `TLContext.cdiv(a, b) -> int` (static): ceiling division
|
||||
`-(-a // b)`. Mirrors real Triton's `tl.cdiv`.
|
||||
|
||||
#### D3.9. Metadata-only (no compute, no DMA)
|
||||
|
||||
- `tl.trans(x) -> TensorHandle`: a new handle with the last two dims
|
||||
swapped. Shares `addr` and `data`; no command emitted.
|
||||
|
||||
#### D3.10. IPCQ (CCL) primitives (ADR-0023 D4)
|
||||
|
||||
- `tl.send(dir, src=None, *, src_addr=None, nbytes=None, shape=None,
|
||||
dtype="f16", space="tcm") -> None`: blocking send. Accepts either
|
||||
handle form or raw-address form. Emits `IpcqSendCmd`. The handle's
|
||||
`.data` snapshot rides along on the command — avoiding the race
|
||||
where a later inbound IPCQ overwrites the slot before the outbound
|
||||
PE_DMA reads it.
|
||||
- `tl.recv(dir=None, shape=(), dtype="f16", space="tcm", dst_addr=None,
|
||||
dst_space=None) -> TensorHandle`: blocking recv. Providing both
|
||||
`dst_addr` and `dst_space` enters "copy_to_dst" mode; otherwise
|
||||
"return_slot" mode. In greenlet mode the handle's `.data` carries
|
||||
the real data.
|
||||
- `tl.recv_no_consume(dir=None, shape=(), dtype="f16") -> TensorHandle`:
|
||||
**DIAGNOSTIC ONLY**. Has the same blocking-arrival semantics as
|
||||
`tl.recv` but skips the slot-read latency charge (slot-IO + PE↔bank
|
||||
fabric drain). Used in the pe2pe overview plot for an apples-to-apples
|
||||
comparison against `tl.store`. Production kernels MUST NOT use it —
|
||||
the diagnostic flag is isolated in its own command branch
|
||||
(`consume=False`) so it cannot be accidentally enabled.
|
||||
- `tl.recv_async(dir, shape=(), dtype="f16") -> RecvFuture`: non-blocking
|
||||
recv. Returns a `RecvFuture`; resolved later by `tl.wait(future)`.
|
||||
|
||||
#### D3.11. Composite + control
|
||||
|
||||
- `tl.composite(...)`: see D3.3.
|
||||
- `tl.wait(handle=None)`: wait on a `CompletionHandle` (composite), a
|
||||
`RecvFuture` (async recv), or `None` (all pending composites).
|
||||
- `tl.cycles(n)`: declare a scalar PE_CPU overhead. Emits
|
||||
`PeCpuOverheadCmd(cycles=n)`.
|
||||
|
||||
### D4. TensorHandle arithmetic operators — thread-local TLContext
|
||||
|
||||
At module load, `tl_context.py::_enable_tensor_ops()` runs and patches
|
||||
`TensorHandle.__add__`, `__sub__`, `__mul__`, `__truediv__`. Each
|
||||
operator calls `_binary_math` on the active TLContext stored in a
|
||||
module-level thread-local `_ctx`.
|
||||
|
||||
So inside a kernel, `c = a + b` is equivalent to emitting
|
||||
`MathCmd(op="add", inputs=(a, b), out=)` and returning a new
|
||||
TensorHandle.
|
||||
|
||||
Active-TLContext management:
|
||||
|
||||
- `TLContext._set_active(ctx)`: set the active ctx for the current
|
||||
thread/greenlet.
|
||||
- `TLContext._get_active()`: read it (RuntimeError if unset).
|
||||
- `run_kernel(kernel_fn, tl_ctx, *args, **kwargs)`: helper. Sets active
|
||||
on entry, runs the kernel, restores `None` on exit.
|
||||
|
||||
`KernelRunner` re-asserts `_set_active(tl)` inside its `_switch_kernel`
|
||||
just before resuming the kernel, so a sibling PE runner that overwrote
|
||||
the thread-local context is correctly recovered.
|
||||
|
||||
### D5. Scratch allocator — compute output handles
|
||||
|
||||
Ops that produce a result — `tl.dot`, `tl.exp`, `tl.add` (via
|
||||
TensorHandle `__add__`), etc. — call `_make_compute_out(shape, dtype)`
|
||||
to obtain a 16-byte-aligned scratch address. The address is published
|
||||
with `space="tcm"`, so the handle can later be the source of a
|
||||
`tl.send` / `tl.store`.
|
||||
|
||||
When `_scratch_base == 0` (e.g., command-list mode), the address is 0
|
||||
and the handle cannot be a send/store source (in that case, only
|
||||
`tl.load`-returned handles are valid sources).
|
||||
|
||||
When the cursor exceeds `_scratch_size` (default 1 MiB), a
|
||||
RuntimeError is raised. The cursor must reset between kernel
|
||||
invocations (current code naturally satisfies this: KernelRunner
|
||||
creates a fresh TLContext each time).
|
||||
|
||||
### D6. Dispatch overhead — `PeCpuOverheadCmd(dispatch_cycles)`
|
||||
|
||||
Every non-metadata `tl.*` call starts with `_emit_dispatch_overhead()`,
|
||||
which — when `dispatch_cycles > 0` — emits
|
||||
`PeCpuOverheadCmd(dispatch_cycles)`. This models the cycles PE_CPU
|
||||
spends dispatching the command.
|
||||
|
||||
Defaults:
|
||||
|
||||
- `TLContext.__init__`'s `dispatch_cycles` parameter default: `1` cycle.
|
||||
- TLContext built by `KernelRunner`: `0` cycles (greenlet mode handles
|
||||
cycle accounting differently — aligned with ADR-0020 D3 intent).
|
||||
|
||||
### D7. Kernel registry (`triton_emu/registry.py`)
|
||||
|
||||
A separate `_kernels: dict[str, Callable]` holds the name → function
|
||||
mapping:
|
||||
|
||||
- `register_kernel(name, fn)`: ValueError on duplicate.
|
||||
- `get_kernel(name)`: KeyError if missing.
|
||||
- `clear_registry()`: test-only.
|
||||
|
||||
`RuntimeContext.launch(kernel_name, kernel_fn, *args)` overwrites
|
||||
`_kernels[kernel_name] = kernel_fn` on every call (last-call-wins,
|
||||
idempotent) — consistent with ADR-0045 D8's `launch` behavior.
|
||||
|
||||
PE_CPU looks up `KernelRef.name` in the registry and runs the function
|
||||
through KernelRunner.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. Fold `tl.*` into ADR-0014 / ADR-0020
|
||||
|
||||
Rejected. ADR-0014 covers the PE pipeline (sim_engine-side consumption
|
||||
of PeCommands); ADR-0020 covers 2-pass execution (Phase 1 timing /
|
||||
Phase 2 data). The `tl.*` surface is what the kernel author touches; a
|
||||
dedicated ADR improves findability and onboarding.
|
||||
|
||||
### A2. Deprecate command-list mode
|
||||
|
||||
Rejected (currently). Simple unit tests and kernel verification benefit
|
||||
from the lighter command-list path — it exposes a PeCommand sequence
|
||||
inspector without requiring greenlet machinery. When greenlet-mode
|
||||
semantics (real data, Phase 2) are needed, D2 explicitly selects them.
|
||||
|
||||
### A3. Remove TensorHandle arithmetic operators
|
||||
|
||||
Rejected. They mimic real Triton kernel ergonomics (e.g., `c = a + b`),
|
||||
and the thread-local active-ctx pattern works cleanly. The explicit
|
||||
function-form (`tl.add(a, b)`) is also exposed in D3.5, so the
|
||||
operators are syntactic sugar.
|
||||
|
||||
### A4. Expand softmax into the explicit sequence (max → exp → sum → div)
|
||||
|
||||
Partially adopted. `tl.softmax` is a single `MathCmd(op="softmax")` for
|
||||
timing accounting (D3.5), but Phase 2 DataExecutor expands it to the
|
||||
canonical sequence for real-data computation. Timing model atomic,
|
||||
data model expanded — the two split intentionally.
|
||||
|
||||
## Consequences
|
||||
|
||||
- Every `tl.*` primitive a bench author meets is classified and defined
|
||||
in a single ADR. Paired with ADR-0045 D8's host-side surface
|
||||
(`torch.empty` etc.), the inside-kernel and outside-kernel authoring
|
||||
guides are now complete.
|
||||
- The command-list / greenlet difference is pinned in D2, so any new
|
||||
`tl.*` primitive that follows the `_emit()` pattern auto-supports
|
||||
both modes.
|
||||
- The thread-local active-ctx pattern (D4) is justified at ADR level,
|
||||
clarifying who owns the reset responsibility when multiple PE
|
||||
runners share a thread (KernelRunner.run's contract restores active
|
||||
inside `_switch_kernel`).
|
||||
- `tl.recv_no_consume`'s diagnostic isolation (D3.10) is hardened in
|
||||
ADR form — accidental production use is blocked by a separate
|
||||
command branch.
|
||||
- The registry (D7) gets its own D-section, formalizing the
|
||||
name-collision and dynamic-re-registration semantics.
|
||||
@@ -0,0 +1,259 @@
|
||||
# ADR-0047: AHBM CCL Backend — `torch.distributed`-compat shim
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
Pins down what `runtime_api/distributed.py`'s `AhbmCCLBackend` +
|
||||
`DistributedContext` actually install — i.e., the entry point
|
||||
`torch.distributed.init_process_group(backend="ahbm")` — and how
|
||||
`all_reduce`/`barrier`/`get_rank` etc. are implemented. ADR-0023 D11
|
||||
mentions the "torch.distributed compatibility" intent, but **the backend
|
||||
itself** had no ADR-level coverage.
|
||||
|
||||
## First action
|
||||
|
||||
`RuntimeContext.__post_init__` automatically constructs a
|
||||
`DistributedContext()` and attaches it to `self.distributed`. The first
|
||||
action at that moment:
|
||||
|
||||
1. `self._backend: AhbmCCLBackend | None = None` — uninitialized.
|
||||
2. `self._rank_by_greenlet: dict = {}` — greenlet-local rank registry
|
||||
(ADR-0024 D2).
|
||||
3. The caller (RuntimeContext) sets `dc._ctx_ref = self` so subsequent
|
||||
`init_process_group` can reach `ctx.engine` / `ctx.spec` / `ctx.launch`.
|
||||
|
||||
In short, **DistributedContext's first act is "attach to RuntimeContext
|
||||
with a back-reference and leave the backend slot empty"**. Actual
|
||||
backend installation (IPCQ install, world_size derivation, algorithm
|
||||
module import) happens only when user code calls
|
||||
`torch.distributed.init_process_group(backend="ahbm")`.
|
||||
|
||||
At that moment, `init_process_group`'s first action is:
|
||||
|
||||
1. If `backend != "ahbm"`, raise `ValueError("Unsupported backend ...")`
|
||||
immediately.
|
||||
2. If `getattr(self, "_ctx_ref", None)` is None,
|
||||
`RuntimeError("DistributedContext not bound to a RuntimeContext")`.
|
||||
3. `self._backend = AhbmCCLBackend(torch_ctx=ctx)` — inside this
|
||||
constructor, ccl.yaml is loaded, the algorithm module is imported,
|
||||
world_size is derived, SFR is configured, and IPCQ is installed.
|
||||
4. `self._backend._dist_ctx = self` — the backend gets a back-reference
|
||||
so it can read `_rank_by_greenlet`.
|
||||
|
||||
## Context
|
||||
|
||||
The `AhbmCCLBackend` exists so that PyTorch DDP collective calls
|
||||
(`init_process_group`, `all_reduce`, etc.) work unchanged and bench code
|
||||
reads identically to a real DDP training script (in line with
|
||||
ADR-0024 + ADR-0027's launcher model).
|
||||
|
||||
The backend's responsibilities:
|
||||
|
||||
- At `init_process_group` time, install the **IPCQ neighbor table once**
|
||||
(analogous to NCCL communicator creation).
|
||||
- For each `all_reduce(tensor, op="sum")`, dispatch the configured
|
||||
algorithm's kernel function via `ctx.launch(...)`.
|
||||
- Answer `get_world_size` / `get_rank` consistently from the
|
||||
greenlet-local rank registry plus ccl.yaml/topology.
|
||||
|
||||
ADR-0023 D10 (IPCQ install plan) and ADR-0024 (SIP launcher) touch
|
||||
parts of this, but **the backend's own responsibility scope and decision
|
||||
order** are not pinned anywhere. This ADR fills that gap.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. The backend is created only at `init_process_group(backend="ahbm")` time
|
||||
|
||||
`DistributedContext` starts with `_backend = None`. The backend object
|
||||
does not exist until the user calls
|
||||
`dist.init_process_group(backend="ahbm")`. Any other API
|
||||
(`is_initialized`, `get_world_size`, `all_reduce`, `barrier`) called
|
||||
while `_backend` is None raises
|
||||
`RuntimeError("Default process group has not been initialized...")` via
|
||||
the `_ensure_initialized` helper.
|
||||
|
||||
`backend != "ahbm"` raises `ValueError` immediately. Other backend names
|
||||
(`nccl`, `gloo`, etc.) are not recognized.
|
||||
|
||||
### D2. world_size resolution priority — algorithm > defaults > topology
|
||||
|
||||
`AhbmCCLBackend._resolve_world_size` (ADR-0024 D1):
|
||||
|
||||
1. If `ccl.yaml`'s algorithm entry has `world_size`, use it.
|
||||
2. Else if `defaults.world_size` is set, use it.
|
||||
3. Else fall back to `spec.system.sips.count` (the topology's SIP count).
|
||||
|
||||
The default interpretation is **rank = SIP** (ADR-0024). Cube/PE-level
|
||||
parallelism is expressed inside each rank via DPPolicy and does not
|
||||
affect world_size. An explicit `ccl.yaml` override is preserved for the
|
||||
legacy "rank = flat PE index" test path.
|
||||
|
||||
User arguments to `init_process_group(world_size=..., rank=...)` are
|
||||
**accepted but ignored** (same as real PyTorch's `RANK` / `WORLD_SIZE`
|
||||
env vars).
|
||||
|
||||
### D3. `init_process_group` performs four installation steps
|
||||
|
||||
Inside `AhbmCCLBackend.__init__`, in order:
|
||||
|
||||
1. **Load ccl.yaml**: `kernbench.ccl.install.load_ccl_config()` →
|
||||
`resolve_algorithm_config(_cfg_all)` produces the merged config for
|
||||
`defaults.algorithm` (or the user-specified algorithm).
|
||||
2. **Import algorithm module**:
|
||||
`importlib.import_module(self._merged["module"])`. The module must
|
||||
expose a `kernel` function, a `kernel_args(world_size, n_elem,
|
||||
cube_w, cube_h)` helper, and optionally a `TOPO_NAME_TO_KIND` map.
|
||||
3. **Resolve world_size** (D2).
|
||||
4. **Collect topology metadata** from `spec`: `n_sips`, `sip_topo`
|
||||
(`ring_1d` default), `cube_w`/`cube_h`, `sips.w`/`sips.h`. When the
|
||||
SIP topology is not `ring_1d`, derive `_sip_topo_w/h` from explicit
|
||||
`w`/`h` or via square-root (require `w*h == n_sips`). Mismatch raises
|
||||
`ValueError`.
|
||||
5. **Install SFR + IPCQ**:
|
||||
`kernbench.ccl.sfr_config.configure_sfr_intercube_multisip(engine,
|
||||
spec, self._merged)`. This pushes IPCQ neighbor tables to every
|
||||
SIP/cube's pe0 (one-time setup analogous to NCCL communicator
|
||||
creation).
|
||||
|
||||
If the order changes (e.g., SFR runs before the algorithm module
|
||||
loads), partial initialization can result. So D3 is treated as an
|
||||
atomic 4-step block — on failure the backend remains uninstalled.
|
||||
|
||||
### D4. Greenlet-local rank binding (ADR-0024 D2)
|
||||
|
||||
`DistributedContext._rank_by_greenlet: dict[greenlet, int]` maps spawned
|
||||
worker greenlets to their ranks. When the bench launcher (e.g.,
|
||||
`torch.multiprocessing.spawn`) spawns a worker, it registers via
|
||||
`dc._bind_rank(g, rank)`.
|
||||
|
||||
`get_rank()` looks up `getcurrent()`'s greenlet. Unregistered greenlets
|
||||
fall back to 0 — preserves single-driver / test compatibility.
|
||||
|
||||
The backend reads the current greenlet's rank from
|
||||
`_dist_ctx._rank_by_greenlet` during `all_reduce` (D5).
|
||||
|
||||
### D5. `all_reduce(tensor, op="sum")` behavior
|
||||
|
||||
Validation:
|
||||
|
||||
- `op != "sum"` → `NotImplementedError`. Current kernels only
|
||||
implement add reduction.
|
||||
- `tensor._handle is None` → `RuntimeError("not deployed")`.
|
||||
- `tensor._handle.shards` empty → `RuntimeError("no shards")`.
|
||||
|
||||
Preparation:
|
||||
|
||||
- `n_elem = shards[0].nbytes // tensor.itemsize` — element count of a
|
||||
single shard.
|
||||
- `kernel_fn = self._algo_module.kernel` — the algorithm module's entry
|
||||
function (imported in D3).
|
||||
- Decide effective cube dims: if the first SIP has just 1 cube, use
|
||||
`(1, 1)`; otherwise use the topology's `cube_w`/`cube_h`. This
|
||||
naturally absorbs TP runs that use only a subset of cubes.
|
||||
- `kernel_args = self._algo_module.kernel_args(world_size, n_elem,
|
||||
cube_w, cube_h)` — the algorithm decides which arguments to pass to
|
||||
its kernel.
|
||||
|
||||
Dispatch:
|
||||
|
||||
- Resolve the current greenlet's rank via
|
||||
`_rank_by_greenlet.get(g, 0)`.
|
||||
- Append `extra_args = (sip_rank, sip_topo_kind, sip_topo_w,
|
||||
sip_topo_h)`.
|
||||
- `pending = self.ctx.launch(algorithm_name, kernel_fn, tensor,
|
||||
*kernel_args, *extra_args, _defer_wait=True)` — `_defer_wait=True`
|
||||
delegates collective drain to the main scheduler (ADR-0027 D0.4).
|
||||
|
||||
Drain:
|
||||
|
||||
- If the parent greenlet is alive (multi-greenlet mode), enqueue
|
||||
`_pending_collective_handles` and switch to parent. The main
|
||||
scheduler drains after all ranks have launched.
|
||||
- If single-driver mode, drain inline:
|
||||
`for h, _sip_id, meta in pending: self.ctx.wait(h, _meta=meta)`.
|
||||
|
||||
### D6. `barrier()` is a no-op (single-driver model)
|
||||
|
||||
kernbench runs all ranks as greenlets inside a single Python process,
|
||||
so no cross-process synchronization is needed. `barrier()` is callable
|
||||
but does no synchronization. Kept for real-PyTorch API compatibility so
|
||||
callers don't get `NotImplementedError`.
|
||||
|
||||
If multi-process kernbench (SimPy event loop per process) is introduced
|
||||
in the future, D6 needs a superseding ADR.
|
||||
|
||||
### D7. Semantics of `get_rank` / `get_world_size` / `get_backend`
|
||||
|
||||
- `get_rank()` (D4): the current greenlet's bound rank; unregistered → 0.
|
||||
- `get_world_size()` (D2): the world_size resolved by the backend in D3.
|
||||
- `get_backend()`: always the literal string `"ahbm"`. Calling before
|
||||
backend exists triggers `_ensure_initialized`'s RuntimeError.
|
||||
|
||||
Differences vs. real PyTorch:
|
||||
|
||||
- Real PyTorch `get_rank()` is a process-global value; here it is
|
||||
greenlet-local. Inside a spawned worker → the worker's rank; in the
|
||||
main thread → 0. Bench authors should expect meaningful ranks only
|
||||
inside worker functions.
|
||||
|
||||
### D8. Supported API surface (final)
|
||||
|
||||
`DistributedContext` exposes:
|
||||
|
||||
- `init_process_group(backend="ahbm", world_size=None, rank=None,
|
||||
**kwargs)`
|
||||
- `is_initialized() -> bool`
|
||||
- `get_world_size() -> int`
|
||||
- `get_rank() -> int`
|
||||
- `get_backend() -> str`
|
||||
- `all_reduce(tensor, op="sum") -> None`
|
||||
- `barrier() -> None`
|
||||
- (internal) `_bind_rank(g, rank)`
|
||||
|
||||
Other PyTorch distributed APIs (`broadcast`, `reduce`, `all_gather`,
|
||||
`gather`, `scatter`, point-to-point `send/recv`, etc.) are **not
|
||||
implemented**. Kernel-level expression is available via
|
||||
`tl.send`/`tl.recv` (ADR-0046 D3.10), but the `dist.*` surface does not
|
||||
expose them. If additional collectives are needed, add a paired
|
||||
(algorithm module, `DistributedContext` method) and extend D8.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. Create the backend in `RuntimeContext.__init__`
|
||||
|
||||
Rejected. If `ccl.yaml` is missing or the algorithm module can't be
|
||||
imported, RuntimeContext construction would fail even when the bench
|
||||
does not use distributed features. Lazy creation at call time (D1) is
|
||||
the right semantics.
|
||||
|
||||
### A2. Always derive world_size from topology (no override)
|
||||
|
||||
Rejected. ADR-0024 D1's "explicit override" path is used by legacy
|
||||
tests. Diagnostic scenarios that define PE-level ranks within a single
|
||||
SIP also need this escape hatch.
|
||||
|
||||
### A3. Silent fallback for unsupported `op`
|
||||
|
||||
Rejected. If the user intends `op="prod"` / `"max"` / `"avg"` and silent
|
||||
`sum` runs instead, result validation gets very hard. Explicit
|
||||
`NotImplementedError` is safer.
|
||||
|
||||
### A4. Implement `barrier` as a SimPy event
|
||||
|
||||
Rejected (currently). With single-driver semantics there is no
|
||||
cross-process synchronization to express, so a no-op is meaningfully
|
||||
correct. A fake-barrier SimPy event would add code complexity for no
|
||||
semantic gain. Revisit when multi-process kernbench arrives.
|
||||
|
||||
## Consequences
|
||||
|
||||
- The 4-step installation (D3) for
|
||||
`torch.distributed.init_process_group(backend="ahbm")` is locked in,
|
||||
making clear where future collective algorithms must hook.
|
||||
- The priority order in D2 (algorithm > defaults > topology) makes the
|
||||
blast radius of ccl.yaml changes quickly knowable.
|
||||
- The no-op `barrier` (D6) is recorded so multi-process kernbench, if
|
||||
introduced, must explicitly supersede this ADR.
|
||||
- D8's list of unsupported APIs explicitly grounds the rejection
|
||||
message when users call, e.g., `dist.broadcast(...)`.
|
||||
@@ -0,0 +1,278 @@
|
||||
# ADR-0048: Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
Pins down the free-list algorithm, page alignment, and coalescing rules
|
||||
used by `policy/address/allocator.py`'s `_FreeList` / `PEMemAllocator`
|
||||
and `va_allocator.py`'s `VirtualAllocator`. ADR-0001 (PhysAddr layout)
|
||||
and ADR-0011 (PA/VA/LA models) define the address schemes; the
|
||||
**allocation algorithms** had no ADR-level coverage.
|
||||
|
||||
## First action
|
||||
|
||||
### `_FreeList(capacity)`
|
||||
|
||||
On construction: `self._capacity = capacity`, `self._used = 0`,
|
||||
`self._free = [(0, capacity)]`. The first act is **establishing the
|
||||
entire region as one free block** — the tuple `(offset=0,
|
||||
size=capacity)` is the sole entry in the free list.
|
||||
|
||||
### `PEMemAllocator(sip_id, die_id, pe_id, cfg)`
|
||||
|
||||
On construction, builds two `_FreeList`s:
|
||||
|
||||
- `self._hbm = _FreeList(cfg.hbm_slice_bytes)` — the size of this PE's
|
||||
HBM slice (`hbm_bytes_per_cube // hbm_slices_per_cube`).
|
||||
- `self._tcm = _FreeList(cfg.tcm_allocatable_bytes)` — equals
|
||||
`tcm_bytes_per_pe - tcm_scheduler_reserved_bytes` (the scheduler
|
||||
reservation is pre-deducted).
|
||||
|
||||
So PEMemAllocator's first act is **constructing single-free-block
|
||||
HBM-slice and TCM regions for this PE**.
|
||||
|
||||
### `VirtualAllocator(va_base, va_size, page_size=2*1024*1024)`
|
||||
|
||||
On construction: `self._va_base = va_base`, `self._va_size = va_size`,
|
||||
`self._page_size = page_size`, `self._used = 0`, `self._free =
|
||||
[(va_base, va_size)]`. The first act is **establishing one block from
|
||||
va_base to va_size and stashing page_size**.
|
||||
|
||||
## Context
|
||||
|
||||
`runtime_api/context.py::_ensure_allocators` builds the allocator set
|
||||
in these stages:
|
||||
|
||||
1. Read `hbm_total_gb_per_cube`, `hbm_slices_per_cube`, `tcm_size_mb`,
|
||||
per-target_device SIP range, etc. from `spec`.
|
||||
2. Pack everything into a frozen `AddressConfig`.
|
||||
3. For every combination in the target SIP range × cubes × PEs,
|
||||
construct one `PEMemAllocator(sip, cube, pe, cfg)` instance.
|
||||
4. Construct one `VirtualAllocator(va_base=0x1_0000_0000, va_size=64
|
||||
GiB, page_size=pe_mmu.page_size)`.
|
||||
|
||||
Allocator responsibilities:
|
||||
|
||||
- **PEMemAllocator**: PA-space allocation in the PE-local HBM slice /
|
||||
TCM (including PhysAddr encoding).
|
||||
- **VirtualAllocator**: device-wide VA allocation, page-aligned.
|
||||
`RuntimeContext._create_tensor` then pushes VA → PA mappings to
|
||||
components via `MmuMapMsg`.
|
||||
|
||||
These algorithms are:
|
||||
|
||||
- **First-fit**, kept simple.
|
||||
- The free-block list is **sorted by start offset**.
|
||||
- On `free()`, **adjacent blocks coalesce**.
|
||||
|
||||
The rationale was not documented anywhere, so when someone asks "why
|
||||
not best-fit?", "why not a buddy allocator?", "why does partial-overlap
|
||||
free pass silently?", there was no anchor to answer from. This ADR
|
||||
provides it.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. `_FreeList` — offset-keyed first-fit + coalescing
|
||||
|
||||
`policy/address/allocator.py::_FreeList`:
|
||||
|
||||
- Internal representation: `list[tuple[int, int]] = [(start_offset,
|
||||
size), ...]` — sorted by start offset.
|
||||
- `alloc(nbytes)`:
|
||||
1. Iterate the free list from the front (first-fit).
|
||||
2. Take from the first block with `size >= nbytes`.
|
||||
3. Exact match → drop the block; otherwise shrink it to `(start +
|
||||
nbytes, size - nbytes)`.
|
||||
4. `_used += nbytes`; return the taken `start`.
|
||||
5. If no block fits, `AllocationError("overflow ... largest free
|
||||
block ...")`.
|
||||
- `free(offset, nbytes)`:
|
||||
1. `_used -= nbytes`.
|
||||
2. `bisect_left(self._free, (offset,))` finds the insertion index.
|
||||
3. If adjacent to the previous block (`prev_start + prev_size ==
|
||||
offset`), merge.
|
||||
4. If adjacent to the next block (`offset + nbytes == next_start`),
|
||||
merge.
|
||||
5. Insert the coalesced range at the right sorted position.
|
||||
|
||||
This algorithm is weaker than best-fit / buddy on fragmentation, but
|
||||
the simulator's workload (mostly stack-like deploy/free) tolerates it.
|
||||
If the workload shape changes, D1 is a supersession candidate.
|
||||
|
||||
### D2. Partial-overlap free is **not** validated
|
||||
|
||||
`_FreeList.free(offset, nbytes)` trusts the caller to pass the exact
|
||||
`(offset, nbytes)`. It does **not** verify:
|
||||
|
||||
- That the range was actually allocated.
|
||||
- That the range does not overlap another allocated region.
|
||||
|
||||
Reason: in a simulator context, callers always store the return value
|
||||
of `alloc()` and pass it back to `free()` — there is no external user
|
||||
input. Adding a safety check would cost O(N) per free and impact
|
||||
simulation wall-clock.
|
||||
|
||||
If this trust model breaks (e.g., a code path lets two tensors point
|
||||
at the same PA), this ADR must be revisited.
|
||||
|
||||
### D3. `PEMemAllocator` — two channels for HBM/TCM
|
||||
|
||||
`PEMemAllocator(sip_id, die_id, pe_id, cfg)` holds two `_FreeList`s:
|
||||
|
||||
- `_hbm`: size `cfg.hbm_slice_bytes`.
|
||||
- `_tcm`: size `cfg.tcm_allocatable_bytes` (= `tcm_bytes_per_pe -
|
||||
tcm_scheduler_reserved_bytes`).
|
||||
|
||||
`alloc_hbm(nbytes) -> PhysAddr`:
|
||||
|
||||
- `_hbm.alloc(nbytes)` → offset.
|
||||
- `PhysAddr.pe_hbm_addr(sip_id, die_id, pe_id,
|
||||
pe_local_hbm_offset=offset, slice_size_bytes=cfg.hbm_slice_bytes)`.
|
||||
- Failure raises `AllocationError("HBM overflow ...")`.
|
||||
|
||||
`free_hbm(pa, nbytes)`:
|
||||
|
||||
- Recover PE-local offset via `pa.hbm_offset - pe_id *
|
||||
cfg.hbm_slice_bytes`.
|
||||
- `_hbm.free(offset, nbytes)`.
|
||||
|
||||
`alloc_tcm(nbytes) -> PhysAddr`: similar; uses `PhysAddr.pe_tcm_addr`.
|
||||
|
||||
`free_tcm(pa, nbytes)`: uses `pa.sub_offset` directly (TCM's PE-local
|
||||
offset equals its sub_offset).
|
||||
|
||||
The allocator does not see the scheduler-reserved TCM region
|
||||
(`cfg.tcm_scheduler_reserved_bytes`) — it is pre-subtracted from the
|
||||
`_tcm` capacity. This is consistent with ADR-0014's PE_SCHEDULER
|
||||
internal-buffer reservation.
|
||||
|
||||
### D4. `VirtualAllocator` — page-aligned first-fit + coalescing
|
||||
|
||||
`policy/address/va_allocator.py::VirtualAllocator`:
|
||||
|
||||
- Internal representation: same sorted `list[tuple[int, int]]` as
|
||||
`_FreeList`. Initially `[(va_base, va_size)]`.
|
||||
- `_align_up(nbytes) = ceil(nbytes / page_size) * page_size`.
|
||||
- `alloc(nbytes) -> int`:
|
||||
1. `aligned = _align_up(nbytes)`.
|
||||
2. First-fit a block with `size >= aligned`.
|
||||
3. Take `aligned` from the block's front; remove if exact.
|
||||
4. `_used += aligned`. Return the block's `start` (which is page-
|
||||
aligned).
|
||||
5. Failure → `VaAllocationError`.
|
||||
- `free(va, nbytes)`: free `_align_up(nbytes)` worth. Coalesces with
|
||||
the same algorithm as `_FreeList`.
|
||||
|
||||
`page_size` has different defaults in two places:
|
||||
|
||||
- `VirtualAllocator.__init__`'s parameter default: `2 MiB`. Direct-call
|
||||
tests receive this.
|
||||
- `RuntimeContext._ensure_allocators` when constructing the instance:
|
||||
`pe_mmu.attrs.get("page_size", 4096)` — uses
|
||||
`topology.yaml`'s `pe_mmu.attrs.page_size` if set, else falls back
|
||||
to `4 KiB`.
|
||||
|
||||
The two defaults differ on purpose: `VirtualAllocator`'s standalone
|
||||
default (`2 MiB`) aligns with ADR-0039's PE_MMU stopgap default for
|
||||
direct-test ergonomics; the context fallback (`4 KiB`) is the safe
|
||||
minimum when `topology.yaml` doesn't specify a page size. The
|
||||
production path is always the latter (via `_ensure_allocators`), and
|
||||
when `topology.yaml` sets `page_size`, that value flows consistently
|
||||
into both the MMU and the VA allocator.
|
||||
|
||||
If consistency breaks (e.g., VirtualAllocator instantiated with a
|
||||
page_size different from PE_MMU's), MMU `map()` falls into the
|
||||
sub-page region mode (ADR-0039 D3).
|
||||
|
||||
VA range defaults: `va_base = 0x1_0000_0000` (= 4 GiB), `va_size = 64
|
||||
GiB`. These are hardcoded in `_ensure_allocators` and have no
|
||||
semantic meaning in ADR-0011's VA model — they simply reserve enough
|
||||
device-wide space without colliding with host code.
|
||||
|
||||
### D5. Lifecycle of allocator instances
|
||||
|
||||
- `RuntimeContext._ensure_allocators` is lazy — called on the first
|
||||
`_create_tensor`.
|
||||
- The allocator dict (`self._allocators`) lives for the
|
||||
RuntimeContext's lifetime. A second deploy in the same process
|
||||
does not construct new objects.
|
||||
- `RuntimeContext.cleanup()` walks living tensors and calls
|
||||
`_free_tensor()`, which issues MMU unmaps + `va_allocator.free` +
|
||||
`pemem_allocator.free_hbm` — restoring the free lists. A subsequent
|
||||
RuntimeContext starts fresh.
|
||||
|
||||
This per-RuntimeContext isolation guarantees deterministic deploy →
|
||||
cleanup → deploy sequences within a single process.
|
||||
|
||||
### D6. Allocator failure raises (no silent OOM)
|
||||
|
||||
Both `_FreeList.alloc` and `VirtualAllocator.alloc` raise
|
||||
`AllocationError` / `VaAllocationError` when no block fits. The message
|
||||
includes "required size + largest available block" to distinguish
|
||||
fragmentation from true OOM.
|
||||
|
||||
A silent fallback (e.g., allocating only as much as the largest free
|
||||
block) is strictly forbidden — a partially-allocated tensor reaching
|
||||
SimPy would cause routing / DMA to see incorrect PAs and silently
|
||||
corrupt simulation results.
|
||||
|
||||
### D7. One allocator per address space
|
||||
|
||||
Physical address spaces are separated by PhysAddr sub-units (ADR-0001
|
||||
D2.3); each sub-unit gets its own allocator instance:
|
||||
|
||||
- HBM slice → `PEMemAllocator._hbm`.
|
||||
- PE TCM → `PEMemAllocator._tcm`.
|
||||
- (Currently unused) M_CPU local memory, CUBE SRAM → would need their
|
||||
own allocators. Today these are handled as IPCQ-only slots (ADR-0023
|
||||
D9.7) and do not share PA space, so no free-list exists for them.
|
||||
|
||||
When a cube-level SRAM allocator is needed,
|
||||
`_FreeList(cfg.sram_bytes_per_cube)` is added per-cube
|
||||
(`cfg.sram_bytes_per_cube` is already defined in `AddressConfig` —
|
||||
the data model is ready).
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. Best-fit / buddy allocator
|
||||
|
||||
Rejected (currently). The workload's alloc/free pattern is stack-like
|
||||
(deploy order ≈ free order), so first-fit + coalescing controls
|
||||
fragmentation well enough. If long-running fragmentation appears in LLM
|
||||
kernel sweeps, a buddy-allocator ADR will replace D1.
|
||||
|
||||
### A2. Add partial-overlap free validation
|
||||
|
||||
Rejected. D2's trust model plus the O(N) per-free cost makes this
|
||||
unattractive. A debug mode (e.g., `KERNBENCH_DEBUG` env var) that
|
||||
enables the check could be added later.
|
||||
|
||||
### A3. A unified allocator for VA and PA
|
||||
|
||||
Rejected. VA space (64 GiB device-wide) and PA space (per-slice ~6
|
||||
GiB) have different semantic dimensions — VA is the kernel's view, PA
|
||||
is the device sub-unit's view. ADR-0011's VA model (MMU maps between
|
||||
the two) calls for separated allocators.
|
||||
|
||||
### A4. Multi-tier page sizes (large pages + small pages)
|
||||
|
||||
Rejected (currently). A single page size (2 MiB) matches LLM kernel
|
||||
tensor sizes (a few MiB to GiB); smaller mappings are absorbed by
|
||||
ADR-0039 D3's sub-page region mode. Multi-tier paging would require
|
||||
extending the MMU model itself — a separate ADR candidate.
|
||||
|
||||
## Consequences
|
||||
|
||||
- The allocator algorithm is pinned at ADR level (D1, D3, D4), so any
|
||||
future simulation scenario hitting fragmentation has a clear "we're
|
||||
using first-fit + coalescing" anchor to inspect.
|
||||
- D2's trust model is explicit, so any future code path that exposes
|
||||
alloc/free to direct user input will trigger this ADR's supersession
|
||||
early.
|
||||
- D7's one-allocator-per-sub-unit mapping is on record, so when M_CPU
|
||||
or SRAM need their own free-list, the addition point is obvious.
|
||||
- D4 captures the page_size dual-default and its production path
|
||||
(`_ensure_allocators` always wins), letting future `topology.yaml`
|
||||
`page_size` changes be assessed against ADR-0039's stopgap
|
||||
interaction quickly.
|
||||
@@ -0,0 +1,247 @@
|
||||
# ADR-0049: `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
Pins down the traffic-pattern catalog, formula-vs-actual comparison, and
|
||||
invariant checks (monotonicity, D2H ≥ H2D, etc.) exposed by
|
||||
`probes/probe.py::run_probe(...)`. ADR-0010 (CLI surface) enumerates the
|
||||
`kernbench probe` subcommand, but **what probe actually measures** and
|
||||
**which invariants it judges PASS/FAIL** had no ADR-level coverage.
|
||||
|
||||
## First action
|
||||
|
||||
`run_probe(topology_path, case_filter=None)` performs four startup steps:
|
||||
|
||||
1. `Path(topology_path).expanduser().resolve()` → absolute path.
|
||||
2. `load_topology(path)` → `TopologyGraph` (graph + spec).
|
||||
3. `_build_edge_map(graph)` → a `{(src, dst): Edge}` lookup table.
|
||||
4. Instantiate `AddressResolver(graph)` + `PathRouter(graph)`.
|
||||
|
||||
Then it sets `nbytes = 32768` (= 32 KiB, the summary-table reference
|
||||
size) and `show_all = (case_filter is None or case_filter == "all")`.
|
||||
|
||||
In short, **probe's first act is "load the topology once and prepare
|
||||
edge map / resolver / router, plus pin 32 KiB as the standard measurement
|
||||
size"**. After that, the H2D → D2H → PE DMA categories execute in
|
||||
separate `GraphEngine` instances (no cross-talk between cases).
|
||||
|
||||
## Context
|
||||
|
||||
`kernbench probe` was introduced as a verification tool for these
|
||||
purposes:
|
||||
|
||||
- **Manual ground truth**: when a real-simulation result (`kernbench run
|
||||
--bench ...`) shows abnormal latency, derive the answer for a simple
|
||||
traffic pattern in isolation and compare.
|
||||
- **Formula vs actual**: check whether the analytical model
|
||||
(wire latency + overhead + drain) matches the simulator's
|
||||
`total_ns`. A mismatch points to which simplifying assumption in
|
||||
ADR-0033 is missing.
|
||||
- **Monotonicity check**: latency should grow monotonically with hop
|
||||
count.
|
||||
- **Utilization sweep**: a BW-utilization table across data sizes
|
||||
(4 KiB ~ 1 MiB).
|
||||
|
||||
Without an ADR for this tool:
|
||||
|
||||
- Adding a new traffic-pattern category (e.g., MCpuDma, IPCQ) is hard
|
||||
because the table format / measurement units of existing categories
|
||||
aren't documented at the ADR level.
|
||||
- The basis for the monotonicity check (hop count? cube distance? wire
|
||||
length?) is ambiguous.
|
||||
- The reference size 32 KiB and the sweep `[4 KiB, 16 KiB, 64 KiB, 256
|
||||
KiB, 1 MiB]` are only discoverable by reading source.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Three case categories — H2D / D2H / PE DMA
|
||||
|
||||
Each category has a distinct data path in the topology and gets its own
|
||||
summary table + sweep table + route-detail block.
|
||||
|
||||
- **H2D (Host → Device Write)**: `MemoryWriteMsg(dst_sip=0, dst_cube,
|
||||
dst_pe=0, pattern="zero")` flows along `pcie_ep → io_cpu → m_cpu →
|
||||
hbm_ctrl`. The cube index varies the hop count:
|
||||
- h2d-1hop: cube=0, hops=1
|
||||
- h2d-2hop: cube=4, hops=2
|
||||
- h2d-3hop: cube=8, hops=3
|
||||
- h2d-4hop: cube=12, hops=4
|
||||
- **D2H (Device → Host Read)**: `MemoryReadMsg(src_sip=0, src_cube,
|
||||
src_pe=0)`. Total latency = forward command path + reverse data path.
|
||||
Same 4-hops category as H2D.
|
||||
- **PE DMA (PE-initiated)**: `PeDmaMsg(src_sip, src_cube, src_pe,
|
||||
dst_pa)`. Five cases cover varying cube/PE positions:
|
||||
- pe-local-hbm: same cube, same PE
|
||||
- pe-same-half-hbm: same cube, different PE (PE 1)
|
||||
- pe-cross-half-hbm: same cube, far PE (PE 4)
|
||||
- pe-cross-cube-hbm-best: adjacent cube (cube 1)
|
||||
- pe-cross-cube-hbm-worst: diagonal far cube (cube 15)
|
||||
|
||||
The cube indices 4/8/12 (H2D) and 1/4/15 (PE DMA) are meaningful for a
|
||||
4 × 4 cube mesh (`sip.cube_mesh.w=4, h=4`); changes to the mesh size
|
||||
require these to be updated in lockstep.
|
||||
|
||||
### D2. Standard measurement size — `nbytes = 32768` (32 KiB)
|
||||
|
||||
Every case in the summary table runs once with `nbytes=32768`. 32 KiB
|
||||
was chosen because:
|
||||
|
||||
- DMA overhead and BW drain are balanced — neither dominates.
|
||||
- It compares cleanly against the one-shot transfer size of several
|
||||
sub-units (TCM, register file).
|
||||
|
||||
Per-size utilization variations are shown in a separate sweep table
|
||||
(D3).
|
||||
|
||||
### D3. Utilization sweep — `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]`
|
||||
|
||||
`SWEEP_SIZES = [4096, 16384, 65536, 262144, 1048576]`,
|
||||
`SWEEP_LABELS = ["4KB", "16KB", "64KB", "256KB", "1MB"]`. Per size:
|
||||
|
||||
```
|
||||
drain = nbytes / bottleneck_bw
|
||||
total = overhead + wire + drain
|
||||
eff_bw = nbytes / total
|
||||
util% = eff_bw / bottleneck_bw × 100
|
||||
```
|
||||
|
||||
When `bn_bw is None or <= 0`, the column shows 0.0 %. The intent: the
|
||||
table shows in one view how small transfers become overhead-bound and
|
||||
large transfers become drain-bound as hop count rises.
|
||||
|
||||
### D4. Measured columns — actual / formula / breakdown
|
||||
|
||||
Per-case columns:
|
||||
|
||||
- `Actual` (total_ns): the SimPy run's `trace["total_ns"]`.
|
||||
- `Ovhd`: sum of `node.attrs["overhead_ns"]` along the path (formula).
|
||||
- `Drain`: `nbytes / min(edge.bw_gbs over path)` (formula).
|
||||
- `Wire`: `Σ edge.distance_mm * (ns_per_mm from spec)`.
|
||||
- `Ovhd%` / `Drain%`: each portion as a percentage of Actual. Wire is
|
||||
usually too small to display.
|
||||
- `Eff.BW`: `nbytes / total_ns` (measured BW).
|
||||
- `BN.BW`: bottleneck bandwidth (formula). The minimum edge BW along
|
||||
the path. Missing edge BW shows "-".
|
||||
- `Util%`: `Eff.BW / BN.BW × 100`. 100 % means the single-stream BW
|
||||
upper bound is reached.
|
||||
|
||||
A large gap between the formula sum (`wire + ovhd + drain`) and Actual
|
||||
signals a factor the simplified model misses (a place to inspect
|
||||
ADR-0033's assumptions).
|
||||
|
||||
### D5. Automatic invariant checks — PASS/FAIL
|
||||
|
||||
The following invariants are reported with `[v] PASS` / `[x] FAIL`:
|
||||
|
||||
- **H2D / D2H monotonic increase**: as hop count rises, actual latency
|
||||
must grow monotonically. `all(lats[i] < lats[i+1] for ...)`.
|
||||
- **D2H ≥ H2D**: for the same hop index, D2H ≥ H2D (D2H has both
|
||||
forward command and reverse data legs). `all(d2h[i].total >=
|
||||
h2d[i].total)`.
|
||||
- **PE DMA best < worst**: cross-cube best (adjacent) latency must be
|
||||
less than cross-cube worst (diagonal).
|
||||
- **PE DMA local vs remote**: prints the local BN BW vs remote BN BW
|
||||
side-by-side (informational, not PASS/FAIL).
|
||||
|
||||
When a check fails, a single clear line surfaces the regression for
|
||||
human review.
|
||||
|
||||
### D6. Route detail — per-hop timestamp trace
|
||||
|
||||
After the summary and sweep tables, each case's path and cumulative
|
||||
per-hop timestamps (`_hop_timestamps`) appear in a separate section:
|
||||
|
||||
- H2D: leg1 (`pcie_ep → io_cpu`) + leg2 (`io_cpu → m_cpu`) + leg3
|
||||
(`m_cpu → hbm_ctrl`) + per-hop trace.
|
||||
- D2H: forward (cmd, no data) and reverse (data) traces shown
|
||||
separately.
|
||||
- PE DMA: `pe_dma → router → hbm_ctrl` path + per-hop trace.
|
||||
|
||||
Each hop's timestamp is cumulative `wire_ns + overhead_ns`. The
|
||||
terminal hop's annotation appends `drain:Xns`. Bottleneck edges are
|
||||
marked `<BN:XXGB/s>` so they are visually identifiable.
|
||||
|
||||
### D7. Semantics of the `case_filter` argument
|
||||
|
||||
- `None` or `"all"`: run all cases (default).
|
||||
- Other strings: run only the case whose name matches exactly. Example:
|
||||
`kernbench probe --case h2d-2hop`.
|
||||
|
||||
Within a category, cases with `name != case_filter` are skipped; if
|
||||
only one data point remains, the category's monotonicity / D2H ≥ H2D
|
||||
comparisons are naturally skipped.
|
||||
|
||||
The CLI parser's `--case` default is `"all"`, so omitting it runs
|
||||
everything.
|
||||
|
||||
### D8. Fresh GraphEngine per case
|
||||
|
||||
Each of the 4 H2D, 4 D2H, and 5 PE DMA cases runs in **its own
|
||||
GraphEngine** (`engine = GraphEngine(graph)`). Reasons:
|
||||
|
||||
- Isolate accumulated state (op_log, completion tracking, allocators)
|
||||
so cases do not cross-talk.
|
||||
- Guarantee one case's traffic does not perturb another case's BW
|
||||
measurement.
|
||||
|
||||
This isolation lets probe results be interpreted as **single-flow**
|
||||
per-case latency. Multi-flow contention measurement is handled by
|
||||
separate tooling (e.g., the `pe2pe_overview` plot or ADR-0033's
|
||||
multi-flow merging model).
|
||||
|
||||
### D9. Output-format stability
|
||||
|
||||
probe's stdout is meant for humans; precise column widths, separators,
|
||||
and whitespace are **not** a machine-readable contract. Automated tools
|
||||
that wish to parse probe output should use a separate JSON-output mode
|
||||
(not yet implemented).
|
||||
|
||||
The `[v]` / `[x]` prefix on PASS/FAIL lines is a stable CI grep anchor.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. Register probe as another bench (`@bench(name="probe")`)
|
||||
|
||||
Rejected. probe is a verification tool, not a bench — multi-engine
|
||||
execution for sweeps/analysis and PASS/FAIL invariant output are
|
||||
essential, none of which fits ADR-0045's "single device + single
|
||||
RuntimeContext" bench model.
|
||||
|
||||
### A2. Exit code 1 on monotonicity violation
|
||||
|
||||
Rejected (currently). probe is positioned as a human inspection tool —
|
||||
PASS/FAIL is printed and exit is 0. A wrapper can `grep "\[x\]"` to
|
||||
decide. A future `--strict` flag could opt into non-zero exits.
|
||||
|
||||
### A3. Externalize the case catalog to YAML
|
||||
|
||||
Rejected (currently). The 8 cases (4 H2D + 4 D2H + 5 PE DMA = 13 total)
|
||||
are hardcoded and their semantics are tightly bound to the mesh
|
||||
topology. Moving cube-index meaning (4, 8, 12 / 1, 4, 15) into YAML
|
||||
would require separate documentation and lose cohesion. Externalize
|
||||
only when case additions become frequent.
|
||||
|
||||
### A4. Add multi-flow contention measurement
|
||||
|
||||
Rejected (out of probe scope). D8's single-flow isolation is probe's
|
||||
core intent. Multi-flow contention belongs in a different area of the
|
||||
ADR-0033 latency model — either a separate tool or a new case
|
||||
category.
|
||||
|
||||
## Consequences
|
||||
|
||||
- probe's case catalog (D1) and measurement units (D2/D3) are pinned at
|
||||
ADR level, so new traffic categories know which table format to
|
||||
follow.
|
||||
- The semantics of the formula-vs-actual columns (D4) are locked in, so
|
||||
questions like "why is Drain% 5 % or 70 %?" can quickly be linked to
|
||||
ADR-0033 assumption checks.
|
||||
- Automatic invariant checks (D5) are pinned, so latency-model changes
|
||||
immediately catch monotonicity / D2H ≥ H2D regressions.
|
||||
- D8's case-isolation is explicit, so probe results are safe to read as
|
||||
single-flow measurements. If multi-flow is needed, a separate tool
|
||||
track is clearly required.
|
||||
- A2's strict-mode flag is recorded as a follow-up so CI integration
|
||||
has a minimal change path when requested.
|
||||
Reference in New Issue
Block a user