adr: add ADR-0046-0049 — close G4 coverage gaps from /report

Documents four cross-cutting surfaces that previously had no ADR backing, each surfaced as a G4 candidate by /report: - 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...), the two execution modes (command-list vs greenlet runner), scratch allocator semantics, dispatch-overhead model, and the kernel registry. - 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group (backend="ahbm") install path. world_size priority (algorithm > defaults > topology), the 4-step init sequence (load ccl.yaml, import algorithm module, derive world_size, install SFR + IPCQ), greenlet- local rank registry, all_reduce dispatch via _defer_wait, barrier no-op rationale, and the explicit list of unsupported dist.* APIs. - 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator free-list semantics. Offset-keyed first-fit with coalescing, the no-validation trust model for free(), HBM/TCM channel separation, page-aligned VA allocation, the page_size dual-default (VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and one-allocator-per-sub-unit rule. - 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog. H2D / D2H / PE DMA categories with their exact cube-index choices, the 32 KiB reference size, the 5-point utilization sweep, the formula vs actual column meanings, automatic invariant checks (monotonicity, D2H >= H2D, best < worst), per-case GraphEngine isolation, and the human-readable (not machine-parsable) output contract. Bilingual pair verifier passes for all four EN/KO pairs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:25:04 -07:00
parent 5f8dd688f5
commit 9a02955770
8 changed files with 2154 additions and 0 deletions
@@ -0,0 +1,259 @@
+# ADR-0047: AHBM CCL Backend — `torch.distributed`-compat shim
+
+## Status
+
+Accepted (2026-05-22).
+
+Pins down what `runtime_api/distributed.py`'s `AhbmCCLBackend` +
+`DistributedContext` actually install — i.e., the entry point
+`torch.distributed.init_process_group(backend="ahbm")` — and how
+`all_reduce`/`barrier`/`get_rank` etc. are implemented. ADR-0023 D11
+mentions the "torch.distributed compatibility" intent, but **the backend
+itself** had no ADR-level coverage.
+
+## First action
+
+`RuntimeContext.__post_init__` automatically constructs a
+`DistributedContext()` and attaches it to `self.distributed`. The first
+action at that moment:
+
+1. `self._backend: AhbmCCLBackend | None = None` — uninitialized.
+2. `self._rank_by_greenlet: dict = {}` — greenlet-local rank registry
+   (ADR-0024 D2).
+3. The caller (RuntimeContext) sets `dc._ctx_ref = self` so subsequent
+   `init_process_group` can reach `ctx.engine` / `ctx.spec` / `ctx.launch`.
+
+In short, **DistributedContext's first act is "attach to RuntimeContext
+with a back-reference and leave the backend slot empty"**. Actual
+backend installation (IPCQ install, world_size derivation, algorithm
+module import) happens only when user code calls
+`torch.distributed.init_process_group(backend="ahbm")`.
+
+At that moment, `init_process_group`'s first action is:
+
+1. If `backend != "ahbm"`, raise `ValueError("Unsupported backend ...")`
+   immediately.
+2. If `getattr(self, "_ctx_ref", None)` is None,
+   `RuntimeError("DistributedContext not bound to a RuntimeContext")`.
+3. `self._backend = AhbmCCLBackend(torch_ctx=ctx)` — inside this
+   constructor, ccl.yaml is loaded, the algorithm module is imported,
+   world_size is derived, SFR is configured, and IPCQ is installed.
+4. `self._backend._dist_ctx = self` — the backend gets a back-reference
+   so it can read `_rank_by_greenlet`.
+
+## Context
+
+The `AhbmCCLBackend` exists so that PyTorch DDP collective calls
+(`init_process_group`, `all_reduce`, etc.) work unchanged and bench code
+reads identically to a real DDP training script (in line with
+ADR-0024 + ADR-0027's launcher model).
+
+The backend's responsibilities:
+
+- At `init_process_group` time, install the **IPCQ neighbor table once**
+  (analogous to NCCL communicator creation).
+- For each `all_reduce(tensor, op="sum")`, dispatch the configured
+  algorithm's kernel function via `ctx.launch(...)`.
+- Answer `get_world_size` / `get_rank` consistently from the
+  greenlet-local rank registry plus ccl.yaml/topology.
+
+ADR-0023 D10 (IPCQ install plan) and ADR-0024 (SIP launcher) touch
+parts of this, but **the backend's own responsibility scope and decision
+order** are not pinned anywhere. This ADR fills that gap.
+
+## Decision
+
+### D1. The backend is created only at `init_process_group(backend="ahbm")` time
+
+`DistributedContext` starts with `_backend = None`. The backend object
+does not exist until the user calls
+`dist.init_process_group(backend="ahbm")`. Any other API
+(`is_initialized`, `get_world_size`, `all_reduce`, `barrier`) called
+while `_backend` is None raises
+`RuntimeError("Default process group has not been initialized...")` via
+the `_ensure_initialized` helper.
+
+`backend != "ahbm"` raises `ValueError` immediately. Other backend names
+(`nccl`, `gloo`, etc.) are not recognized.
+
+### D2. world_size resolution priority — algorithm > defaults > topology
+
+`AhbmCCLBackend._resolve_world_size` (ADR-0024 D1):
+
+1. If `ccl.yaml`'s algorithm entry has `world_size`, use it.
+2. Else if `defaults.world_size` is set, use it.
+3. Else fall back to `spec.system.sips.count` (the topology's SIP count).
+
+The default interpretation is **rank = SIP** (ADR-0024). Cube/PE-level
+parallelism is expressed inside each rank via DPPolicy and does not
+affect world_size. An explicit `ccl.yaml` override is preserved for the
+legacy "rank = flat PE index" test path.
+
+User arguments to `init_process_group(world_size=..., rank=...)` are
+**accepted but ignored** (same as real PyTorch's `RANK` / `WORLD_SIZE`
+env vars).
+
+### D3. `init_process_group` performs four installation steps
+
+Inside `AhbmCCLBackend.__init__`, in order:
+
+1. **Load ccl.yaml**: `kernbench.ccl.install.load_ccl_config()` →
+   `resolve_algorithm_config(_cfg_all)` produces the merged config for
+   `defaults.algorithm` (or the user-specified algorithm).
+2. **Import algorithm module**:
+   `importlib.import_module(self._merged["module"])`. The module must
+   expose a `kernel` function, a `kernel_args(world_size, n_elem,
+   cube_w, cube_h)` helper, and optionally a `TOPO_NAME_TO_KIND` map.
+3. **Resolve world_size** (D2).
+4. **Collect topology metadata** from `spec`: `n_sips`, `sip_topo`
+   (`ring_1d` default), `cube_w`/`cube_h`, `sips.w`/`sips.h`. When the
+   SIP topology is not `ring_1d`, derive `_sip_topo_w/h` from explicit
+   `w`/`h` or via square-root (require `w*h == n_sips`). Mismatch raises
+   `ValueError`.
+5. **Install SFR + IPCQ**:
+   `kernbench.ccl.sfr_config.configure_sfr_intercube_multisip(engine,
+   spec, self._merged)`. This pushes IPCQ neighbor tables to every
+   SIP/cube's pe0 (one-time setup analogous to NCCL communicator
+   creation).
+
+If the order changes (e.g., SFR runs before the algorithm module
+loads), partial initialization can result. So D3 is treated as an
+atomic 4-step block — on failure the backend remains uninstalled.
+
+### D4. Greenlet-local rank binding (ADR-0024 D2)
+
+`DistributedContext._rank_by_greenlet: dict[greenlet, int]` maps spawned
+worker greenlets to their ranks. When the bench launcher (e.g.,
+`torch.multiprocessing.spawn`) spawns a worker, it registers via
+`dc._bind_rank(g, rank)`.
+
+`get_rank()` looks up `getcurrent()`'s greenlet. Unregistered greenlets
+fall back to 0 — preserves single-driver / test compatibility.
+
+The backend reads the current greenlet's rank from
+`_dist_ctx._rank_by_greenlet` during `all_reduce` (D5).
+
+### D5. `all_reduce(tensor, op="sum")` behavior
+
+Validation:
+
+- `op != "sum"` → `NotImplementedError`. Current kernels only
+  implement add reduction.
+- `tensor._handle is None` → `RuntimeError("not deployed")`.
+- `tensor._handle.shards` empty → `RuntimeError("no shards")`.
+
+Preparation:
+
+- `n_elem = shards[0].nbytes // tensor.itemsize` — element count of a
+  single shard.
+- `kernel_fn = self._algo_module.kernel` — the algorithm module's entry
+  function (imported in D3).
+- Decide effective cube dims: if the first SIP has just 1 cube, use
+  `(1, 1)`; otherwise use the topology's `cube_w`/`cube_h`. This
+  naturally absorbs TP runs that use only a subset of cubes.
+- `kernel_args = self._algo_module.kernel_args(world_size, n_elem,
+  cube_w, cube_h)` — the algorithm decides which arguments to pass to
+  its kernel.
+
+Dispatch:
+
+- Resolve the current greenlet's rank via
+  `_rank_by_greenlet.get(g, 0)`.
+- Append `extra_args = (sip_rank, sip_topo_kind, sip_topo_w,
+  sip_topo_h)`.
+- `pending = self.ctx.launch(algorithm_name, kernel_fn, tensor,
+  *kernel_args, *extra_args, _defer_wait=True)` — `_defer_wait=True`
+  delegates collective drain to the main scheduler (ADR-0027 D0.4).
+
+Drain:
+
+- If the parent greenlet is alive (multi-greenlet mode), enqueue
+  `_pending_collective_handles` and switch to parent. The main
+  scheduler drains after all ranks have launched.
+- If single-driver mode, drain inline:
+  `for h, _sip_id, meta in pending: self.ctx.wait(h, _meta=meta)`.
+
+### D6. `barrier()` is a no-op (single-driver model)
+
+kernbench runs all ranks as greenlets inside a single Python process,
+so no cross-process synchronization is needed. `barrier()` is callable
+but does no synchronization. Kept for real-PyTorch API compatibility so
+callers don't get `NotImplementedError`.
+
+If multi-process kernbench (SimPy event loop per process) is introduced
+in the future, D6 needs a superseding ADR.
+
+### D7. Semantics of `get_rank` / `get_world_size` / `get_backend`
+
+- `get_rank()` (D4): the current greenlet's bound rank; unregistered → 0.
+- `get_world_size()` (D2): the world_size resolved by the backend in D3.
+- `get_backend()`: always the literal string `"ahbm"`. Calling before
+  backend exists triggers `_ensure_initialized`'s RuntimeError.
+
+Differences vs. real PyTorch:
+
+- Real PyTorch `get_rank()` is a process-global value; here it is
+  greenlet-local. Inside a spawned worker → the worker's rank; in the
+  main thread → 0. Bench authors should expect meaningful ranks only
+  inside worker functions.
+
+### D8. Supported API surface (final)
+
+`DistributedContext` exposes:
+
+- `init_process_group(backend="ahbm", world_size=None, rank=None,
+  **kwargs)`
+- `is_initialized() -> bool`
+- `get_world_size() -> int`
+- `get_rank() -> int`
+- `get_backend() -> str`
+- `all_reduce(tensor, op="sum") -> None`
+- `barrier() -> None`
+- (internal) `_bind_rank(g, rank)`
+
+Other PyTorch distributed APIs (`broadcast`, `reduce`, `all_gather`,
+`gather`, `scatter`, point-to-point `send/recv`, etc.) are **not
+implemented**. Kernel-level expression is available via
+`tl.send`/`tl.recv` (ADR-0046 D3.10), but the `dist.*` surface does not
+expose them. If additional collectives are needed, add a paired
+(algorithm module, `DistributedContext` method) and extend D8.
+
+## Alternatives Considered
+
+### A1. Create the backend in `RuntimeContext.__init__`
+
+Rejected. If `ccl.yaml` is missing or the algorithm module can't be
+imported, RuntimeContext construction would fail even when the bench
+does not use distributed features. Lazy creation at call time (D1) is
+the right semantics.
+
+### A2. Always derive world_size from topology (no override)
+
+Rejected. ADR-0024 D1's "explicit override" path is used by legacy
+tests. Diagnostic scenarios that define PE-level ranks within a single
+SIP also need this escape hatch.
+
+### A3. Silent fallback for unsupported `op`
+
+Rejected. If the user intends `op="prod"` / `"max"` / `"avg"` and silent
+`sum` runs instead, result validation gets very hard. Explicit
+`NotImplementedError` is safer.
+
+### A4. Implement `barrier` as a SimPy event
+
+Rejected (currently). With single-driver semantics there is no
+cross-process synchronization to express, so a no-op is meaningfully
+correct. A fake-barrier SimPy event would add code complexity for no
+semantic gain. Revisit when multi-process kernbench arrives.
+
+## Consequences
+
+- The 4-step installation (D3) for
+  `torch.distributed.init_process_group(backend="ahbm")` is locked in,
+  making clear where future collective algorithms must hook.
+- The priority order in D2 (algorithm > defaults > topology) makes the
+  blast radius of ccl.yaml changes quickly knowable.
+- The no-op `barrier` (D6) is recorded so multi-process kernbench, if
+  introduced, must explicitly supersede this ADR.
+- D8's list of unsupported APIs explicitly grounds the rejection
+  message when users call, e.g., `dist.broadcast(...)`.