adr: add ADR-0046-0049 — close G4 coverage gaps from /report
Documents four cross-cutting surfaces that previously had no ADR backing, each surfaced as a G4 candidate by /report: - 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...), the two execution modes (command-list vs greenlet runner), scratch allocator semantics, dispatch-overhead model, and the kernel registry. - 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group (backend="ahbm") install path. world_size priority (algorithm > defaults > topology), the 4-step init sequence (load ccl.yaml, import algorithm module, derive world_size, install SFR + IPCQ), greenlet- local rank registry, all_reduce dispatch via _defer_wait, barrier no-op rationale, and the explicit list of unsupported dist.* APIs. - 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator free-list semantics. Offset-keyed first-fit with coalescing, the no-validation trust model for free(), HBM/TCM channel separation, page-aligned VA allocation, the page_size dual-default (VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and one-allocator-per-sub-unit rule. - 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog. H2D / D2H / PE DMA categories with their exact cube-index choices, the 32 KiB reference size, the 5-point utilization sweep, the formula vs actual column meanings, automatic invariant checks (monotonicity, D2H >= H2D, best < worst), per-case GraphEngine isolation, and the human-readable (not machine-parsable) output contract. Bilingual pair verifier passes for all four EN/KO pairs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,259 @@
|
||||
# ADR-0047: AHBM CCL Backend — `torch.distributed`-compat shim
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
Pins down what `runtime_api/distributed.py`'s `AhbmCCLBackend` +
|
||||
`DistributedContext` actually install — i.e., the entry point
|
||||
`torch.distributed.init_process_group(backend="ahbm")` — and how
|
||||
`all_reduce`/`barrier`/`get_rank` etc. are implemented. ADR-0023 D11
|
||||
mentions the "torch.distributed compatibility" intent, but **the backend
|
||||
itself** had no ADR-level coverage.
|
||||
|
||||
## First action
|
||||
|
||||
`RuntimeContext.__post_init__` automatically constructs a
|
||||
`DistributedContext()` and attaches it to `self.distributed`. The first
|
||||
action at that moment:
|
||||
|
||||
1. `self._backend: AhbmCCLBackend | None = None` — uninitialized.
|
||||
2. `self._rank_by_greenlet: dict = {}` — greenlet-local rank registry
|
||||
(ADR-0024 D2).
|
||||
3. The caller (RuntimeContext) sets `dc._ctx_ref = self` so subsequent
|
||||
`init_process_group` can reach `ctx.engine` / `ctx.spec` / `ctx.launch`.
|
||||
|
||||
In short, **DistributedContext's first act is "attach to RuntimeContext
|
||||
with a back-reference and leave the backend slot empty"**. Actual
|
||||
backend installation (IPCQ install, world_size derivation, algorithm
|
||||
module import) happens only when user code calls
|
||||
`torch.distributed.init_process_group(backend="ahbm")`.
|
||||
|
||||
At that moment, `init_process_group`'s first action is:
|
||||
|
||||
1. If `backend != "ahbm"`, raise `ValueError("Unsupported backend ...")`
|
||||
immediately.
|
||||
2. If `getattr(self, "_ctx_ref", None)` is None,
|
||||
`RuntimeError("DistributedContext not bound to a RuntimeContext")`.
|
||||
3. `self._backend = AhbmCCLBackend(torch_ctx=ctx)` — inside this
|
||||
constructor, ccl.yaml is loaded, the algorithm module is imported,
|
||||
world_size is derived, SFR is configured, and IPCQ is installed.
|
||||
4. `self._backend._dist_ctx = self` — the backend gets a back-reference
|
||||
so it can read `_rank_by_greenlet`.
|
||||
|
||||
## Context
|
||||
|
||||
The `AhbmCCLBackend` exists so that PyTorch DDP collective calls
|
||||
(`init_process_group`, `all_reduce`, etc.) work unchanged and bench code
|
||||
reads identically to a real DDP training script (in line with
|
||||
ADR-0024 + ADR-0027's launcher model).
|
||||
|
||||
The backend's responsibilities:
|
||||
|
||||
- At `init_process_group` time, install the **IPCQ neighbor table once**
|
||||
(analogous to NCCL communicator creation).
|
||||
- For each `all_reduce(tensor, op="sum")`, dispatch the configured
|
||||
algorithm's kernel function via `ctx.launch(...)`.
|
||||
- Answer `get_world_size` / `get_rank` consistently from the
|
||||
greenlet-local rank registry plus ccl.yaml/topology.
|
||||
|
||||
ADR-0023 D10 (IPCQ install plan) and ADR-0024 (SIP launcher) touch
|
||||
parts of this, but **the backend's own responsibility scope and decision
|
||||
order** are not pinned anywhere. This ADR fills that gap.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. The backend is created only at `init_process_group(backend="ahbm")` time
|
||||
|
||||
`DistributedContext` starts with `_backend = None`. The backend object
|
||||
does not exist until the user calls
|
||||
`dist.init_process_group(backend="ahbm")`. Any other API
|
||||
(`is_initialized`, `get_world_size`, `all_reduce`, `barrier`) called
|
||||
while `_backend` is None raises
|
||||
`RuntimeError("Default process group has not been initialized...")` via
|
||||
the `_ensure_initialized` helper.
|
||||
|
||||
`backend != "ahbm"` raises `ValueError` immediately. Other backend names
|
||||
(`nccl`, `gloo`, etc.) are not recognized.
|
||||
|
||||
### D2. world_size resolution priority — algorithm > defaults > topology
|
||||
|
||||
`AhbmCCLBackend._resolve_world_size` (ADR-0024 D1):
|
||||
|
||||
1. If `ccl.yaml`'s algorithm entry has `world_size`, use it.
|
||||
2. Else if `defaults.world_size` is set, use it.
|
||||
3. Else fall back to `spec.system.sips.count` (the topology's SIP count).
|
||||
|
||||
The default interpretation is **rank = SIP** (ADR-0024). Cube/PE-level
|
||||
parallelism is expressed inside each rank via DPPolicy and does not
|
||||
affect world_size. An explicit `ccl.yaml` override is preserved for the
|
||||
legacy "rank = flat PE index" test path.
|
||||
|
||||
User arguments to `init_process_group(world_size=..., rank=...)` are
|
||||
**accepted but ignored** (same as real PyTorch's `RANK` / `WORLD_SIZE`
|
||||
env vars).
|
||||
|
||||
### D3. `init_process_group` performs four installation steps
|
||||
|
||||
Inside `AhbmCCLBackend.__init__`, in order:
|
||||
|
||||
1. **Load ccl.yaml**: `kernbench.ccl.install.load_ccl_config()` →
|
||||
`resolve_algorithm_config(_cfg_all)` produces the merged config for
|
||||
`defaults.algorithm` (or the user-specified algorithm).
|
||||
2. **Import algorithm module**:
|
||||
`importlib.import_module(self._merged["module"])`. The module must
|
||||
expose a `kernel` function, a `kernel_args(world_size, n_elem,
|
||||
cube_w, cube_h)` helper, and optionally a `TOPO_NAME_TO_KIND` map.
|
||||
3. **Resolve world_size** (D2).
|
||||
4. **Collect topology metadata** from `spec`: `n_sips`, `sip_topo`
|
||||
(`ring_1d` default), `cube_w`/`cube_h`, `sips.w`/`sips.h`. When the
|
||||
SIP topology is not `ring_1d`, derive `_sip_topo_w/h` from explicit
|
||||
`w`/`h` or via square-root (require `w*h == n_sips`). Mismatch raises
|
||||
`ValueError`.
|
||||
5. **Install SFR + IPCQ**:
|
||||
`kernbench.ccl.sfr_config.configure_sfr_intercube_multisip(engine,
|
||||
spec, self._merged)`. This pushes IPCQ neighbor tables to every
|
||||
SIP/cube's pe0 (one-time setup analogous to NCCL communicator
|
||||
creation).
|
||||
|
||||
If the order changes (e.g., SFR runs before the algorithm module
|
||||
loads), partial initialization can result. So D3 is treated as an
|
||||
atomic 4-step block — on failure the backend remains uninstalled.
|
||||
|
||||
### D4. Greenlet-local rank binding (ADR-0024 D2)
|
||||
|
||||
`DistributedContext._rank_by_greenlet: dict[greenlet, int]` maps spawned
|
||||
worker greenlets to their ranks. When the bench launcher (e.g.,
|
||||
`torch.multiprocessing.spawn`) spawns a worker, it registers via
|
||||
`dc._bind_rank(g, rank)`.
|
||||
|
||||
`get_rank()` looks up `getcurrent()`'s greenlet. Unregistered greenlets
|
||||
fall back to 0 — preserves single-driver / test compatibility.
|
||||
|
||||
The backend reads the current greenlet's rank from
|
||||
`_dist_ctx._rank_by_greenlet` during `all_reduce` (D5).
|
||||
|
||||
### D5. `all_reduce(tensor, op="sum")` behavior
|
||||
|
||||
Validation:
|
||||
|
||||
- `op != "sum"` → `NotImplementedError`. Current kernels only
|
||||
implement add reduction.
|
||||
- `tensor._handle is None` → `RuntimeError("not deployed")`.
|
||||
- `tensor._handle.shards` empty → `RuntimeError("no shards")`.
|
||||
|
||||
Preparation:
|
||||
|
||||
- `n_elem = shards[0].nbytes // tensor.itemsize` — element count of a
|
||||
single shard.
|
||||
- `kernel_fn = self._algo_module.kernel` — the algorithm module's entry
|
||||
function (imported in D3).
|
||||
- Decide effective cube dims: if the first SIP has just 1 cube, use
|
||||
`(1, 1)`; otherwise use the topology's `cube_w`/`cube_h`. This
|
||||
naturally absorbs TP runs that use only a subset of cubes.
|
||||
- `kernel_args = self._algo_module.kernel_args(world_size, n_elem,
|
||||
cube_w, cube_h)` — the algorithm decides which arguments to pass to
|
||||
its kernel.
|
||||
|
||||
Dispatch:
|
||||
|
||||
- Resolve the current greenlet's rank via
|
||||
`_rank_by_greenlet.get(g, 0)`.
|
||||
- Append `extra_args = (sip_rank, sip_topo_kind, sip_topo_w,
|
||||
sip_topo_h)`.
|
||||
- `pending = self.ctx.launch(algorithm_name, kernel_fn, tensor,
|
||||
*kernel_args, *extra_args, _defer_wait=True)` — `_defer_wait=True`
|
||||
delegates collective drain to the main scheduler (ADR-0027 D0.4).
|
||||
|
||||
Drain:
|
||||
|
||||
- If the parent greenlet is alive (multi-greenlet mode), enqueue
|
||||
`_pending_collective_handles` and switch to parent. The main
|
||||
scheduler drains after all ranks have launched.
|
||||
- If single-driver mode, drain inline:
|
||||
`for h, _sip_id, meta in pending: self.ctx.wait(h, _meta=meta)`.
|
||||
|
||||
### D6. `barrier()` is a no-op (single-driver model)
|
||||
|
||||
kernbench runs all ranks as greenlets inside a single Python process,
|
||||
so no cross-process synchronization is needed. `barrier()` is callable
|
||||
but does no synchronization. Kept for real-PyTorch API compatibility so
|
||||
callers don't get `NotImplementedError`.
|
||||
|
||||
If multi-process kernbench (SimPy event loop per process) is introduced
|
||||
in the future, D6 needs a superseding ADR.
|
||||
|
||||
### D7. Semantics of `get_rank` / `get_world_size` / `get_backend`
|
||||
|
||||
- `get_rank()` (D4): the current greenlet's bound rank; unregistered → 0.
|
||||
- `get_world_size()` (D2): the world_size resolved by the backend in D3.
|
||||
- `get_backend()`: always the literal string `"ahbm"`. Calling before
|
||||
backend exists triggers `_ensure_initialized`'s RuntimeError.
|
||||
|
||||
Differences vs. real PyTorch:
|
||||
|
||||
- Real PyTorch `get_rank()` is a process-global value; here it is
|
||||
greenlet-local. Inside a spawned worker → the worker's rank; in the
|
||||
main thread → 0. Bench authors should expect meaningful ranks only
|
||||
inside worker functions.
|
||||
|
||||
### D8. Supported API surface (final)
|
||||
|
||||
`DistributedContext` exposes:
|
||||
|
||||
- `init_process_group(backend="ahbm", world_size=None, rank=None,
|
||||
**kwargs)`
|
||||
- `is_initialized() -> bool`
|
||||
- `get_world_size() -> int`
|
||||
- `get_rank() -> int`
|
||||
- `get_backend() -> str`
|
||||
- `all_reduce(tensor, op="sum") -> None`
|
||||
- `barrier() -> None`
|
||||
- (internal) `_bind_rank(g, rank)`
|
||||
|
||||
Other PyTorch distributed APIs (`broadcast`, `reduce`, `all_gather`,
|
||||
`gather`, `scatter`, point-to-point `send/recv`, etc.) are **not
|
||||
implemented**. Kernel-level expression is available via
|
||||
`tl.send`/`tl.recv` (ADR-0046 D3.10), but the `dist.*` surface does not
|
||||
expose them. If additional collectives are needed, add a paired
|
||||
(algorithm module, `DistributedContext` method) and extend D8.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. Create the backend in `RuntimeContext.__init__`
|
||||
|
||||
Rejected. If `ccl.yaml` is missing or the algorithm module can't be
|
||||
imported, RuntimeContext construction would fail even when the bench
|
||||
does not use distributed features. Lazy creation at call time (D1) is
|
||||
the right semantics.
|
||||
|
||||
### A2. Always derive world_size from topology (no override)
|
||||
|
||||
Rejected. ADR-0024 D1's "explicit override" path is used by legacy
|
||||
tests. Diagnostic scenarios that define PE-level ranks within a single
|
||||
SIP also need this escape hatch.
|
||||
|
||||
### A3. Silent fallback for unsupported `op`
|
||||
|
||||
Rejected. If the user intends `op="prod"` / `"max"` / `"avg"` and silent
|
||||
`sum` runs instead, result validation gets very hard. Explicit
|
||||
`NotImplementedError` is safer.
|
||||
|
||||
### A4. Implement `barrier` as a SimPy event
|
||||
|
||||
Rejected (currently). With single-driver semantics there is no
|
||||
cross-process synchronization to express, so a no-op is meaningfully
|
||||
correct. A fake-barrier SimPy event would add code complexity for no
|
||||
semantic gain. Revisit when multi-process kernbench arrives.
|
||||
|
||||
## Consequences
|
||||
|
||||
- The 4-step installation (D3) for
|
||||
`torch.distributed.init_process_group(backend="ahbm")` is locked in,
|
||||
making clear where future collective algorithms must hook.
|
||||
- The priority order in D2 (algorithm > defaults > topology) makes the
|
||||
blast radius of ccl.yaml changes quickly knowable.
|
||||
- The no-op `barrier` (D6) is recorded so multi-process kernbench, if
|
||||
introduced, must explicitly supersede this ADR.
|
||||
- D8's list of unsupported APIs explicitly grounds the rejection
|
||||
message when users call, e.g., `dist.broadcast(...)`.
|
||||
Reference in New Issue
Block a user