Files
kernbench2/docs/adr/ADR-0047-par-ahbm-ccl-backend.md
T
ywkang 9a02955770 adr: add ADR-0046-0049 — close G4 coverage gaps from /report
Documents four cross-cutting surfaces that previously had no ADR backing,
each surfaced as a G4 candidate by /report:

- 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates
  all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...),
  the two execution modes (command-list vs greenlet runner), scratch
  allocator semantics, dispatch-overhead model, and the kernel registry.

- 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group
  (backend="ahbm") install path. world_size priority (algorithm >
  defaults > topology), the 4-step init sequence (load ccl.yaml, import
  algorithm module, derive world_size, install SFR + IPCQ), greenlet-
  local rank registry, all_reduce dispatch via _defer_wait, barrier
  no-op rationale, and the explicit list of unsupported dist.* APIs.

- 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator
  free-list semantics. Offset-keyed first-fit with coalescing, the
  no-validation trust model for free(), HBM/TCM channel separation,
  page-aligned VA allocation, the page_size dual-default
  (VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and
  one-allocator-per-sub-unit rule.

- 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog.
  H2D / D2H / PE DMA categories with their exact cube-index choices,
  the 32 KiB reference size, the 5-point utilization sweep, the
  formula vs actual column meanings, automatic invariant checks
  (monotonicity, D2H >= H2D, best < worst), per-case GraphEngine
  isolation, and the human-readable (not machine-parsable) output
  contract.

Bilingual pair verifier passes for all four EN/KO pairs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:25:04 -07:00

10 KiB

ADR-0047: AHBM CCL Backend — torch.distributed-compat shim

Status

Accepted (2026-05-22).

Pins down what runtime_api/distributed.py's AhbmCCLBackend + DistributedContext actually install — i.e., the entry point torch.distributed.init_process_group(backend="ahbm") — and how all_reduce/barrier/get_rank etc. are implemented. ADR-0023 D11 mentions the "torch.distributed compatibility" intent, but the backend itself had no ADR-level coverage.

First action

RuntimeContext.__post_init__ automatically constructs a DistributedContext() and attaches it to self.distributed. The first action at that moment:

  1. self._backend: AhbmCCLBackend | None = None — uninitialized.
  2. self._rank_by_greenlet: dict = {} — greenlet-local rank registry (ADR-0024 D2).
  3. The caller (RuntimeContext) sets dc._ctx_ref = self so subsequent init_process_group can reach ctx.engine / ctx.spec / ctx.launch.

In short, DistributedContext's first act is "attach to RuntimeContext with a back-reference and leave the backend slot empty". Actual backend installation (IPCQ install, world_size derivation, algorithm module import) happens only when user code calls torch.distributed.init_process_group(backend="ahbm").

At that moment, init_process_group's first action is:

  1. If backend != "ahbm", raise ValueError("Unsupported backend ...") immediately.
  2. If getattr(self, "_ctx_ref", None) is None, RuntimeError("DistributedContext not bound to a RuntimeContext").
  3. self._backend = AhbmCCLBackend(torch_ctx=ctx) — inside this constructor, ccl.yaml is loaded, the algorithm module is imported, world_size is derived, SFR is configured, and IPCQ is installed.
  4. self._backend._dist_ctx = self — the backend gets a back-reference so it can read _rank_by_greenlet.

Context

The AhbmCCLBackend exists so that PyTorch DDP collective calls (init_process_group, all_reduce, etc.) work unchanged and bench code reads identically to a real DDP training script (in line with ADR-0024 + ADR-0027's launcher model).

The backend's responsibilities:

  • At init_process_group time, install the IPCQ neighbor table once (analogous to NCCL communicator creation).
  • For each all_reduce(tensor, op="sum"), dispatch the configured algorithm's kernel function via ctx.launch(...).
  • Answer get_world_size / get_rank consistently from the greenlet-local rank registry plus ccl.yaml/topology.

ADR-0023 D10 (IPCQ install plan) and ADR-0024 (SIP launcher) touch parts of this, but the backend's own responsibility scope and decision order are not pinned anywhere. This ADR fills that gap.

Decision

D1. The backend is created only at init_process_group(backend="ahbm") time

DistributedContext starts with _backend = None. The backend object does not exist until the user calls dist.init_process_group(backend="ahbm"). Any other API (is_initialized, get_world_size, all_reduce, barrier) called while _backend is None raises RuntimeError("Default process group has not been initialized...") via the _ensure_initialized helper.

backend != "ahbm" raises ValueError immediately. Other backend names (nccl, gloo, etc.) are not recognized.

D2. world_size resolution priority — algorithm > defaults > topology

AhbmCCLBackend._resolve_world_size (ADR-0024 D1):

  1. If ccl.yaml's algorithm entry has world_size, use it.
  2. Else if defaults.world_size is set, use it.
  3. Else fall back to spec.system.sips.count (the topology's SIP count).

The default interpretation is rank = SIP (ADR-0024). Cube/PE-level parallelism is expressed inside each rank via DPPolicy and does not affect world_size. An explicit ccl.yaml override is preserved for the legacy "rank = flat PE index" test path.

User arguments to init_process_group(world_size=..., rank=...) are accepted but ignored (same as real PyTorch's RANK / WORLD_SIZE env vars).

D3. init_process_group performs four installation steps

Inside AhbmCCLBackend.__init__, in order:

  1. Load ccl.yaml: kernbench.ccl.install.load_ccl_config()resolve_algorithm_config(_cfg_all) produces the merged config for defaults.algorithm (or the user-specified algorithm).
  2. Import algorithm module: importlib.import_module(self._merged["module"]). The module must expose a kernel function, a kernel_args(world_size, n_elem, cube_w, cube_h) helper, and optionally a TOPO_NAME_TO_KIND map.
  3. Resolve world_size (D2).
  4. Collect topology metadata from spec: n_sips, sip_topo (ring_1d default), cube_w/cube_h, sips.w/sips.h. When the SIP topology is not ring_1d, derive _sip_topo_w/h from explicit w/h or via square-root (require w*h == n_sips). Mismatch raises ValueError.
  5. Install SFR + IPCQ: kernbench.ccl.sfr_config.configure_sfr_intercube_multisip(engine, spec, self._merged). This pushes IPCQ neighbor tables to every SIP/cube's pe0 (one-time setup analogous to NCCL communicator creation).

If the order changes (e.g., SFR runs before the algorithm module loads), partial initialization can result. So D3 is treated as an atomic 4-step block — on failure the backend remains uninstalled.

D4. Greenlet-local rank binding (ADR-0024 D2)

DistributedContext._rank_by_greenlet: dict[greenlet, int] maps spawned worker greenlets to their ranks. When the bench launcher (e.g., torch.multiprocessing.spawn) spawns a worker, it registers via dc._bind_rank(g, rank).

get_rank() looks up getcurrent()'s greenlet. Unregistered greenlets fall back to 0 — preserves single-driver / test compatibility.

The backend reads the current greenlet's rank from _dist_ctx._rank_by_greenlet during all_reduce (D5).

D5. all_reduce(tensor, op="sum") behavior

Validation:

  • op != "sum"NotImplementedError. Current kernels only implement add reduction.
  • tensor._handle is NoneRuntimeError("not deployed").
  • tensor._handle.shards empty → RuntimeError("no shards").

Preparation:

  • n_elem = shards[0].nbytes // tensor.itemsize — element count of a single shard.
  • kernel_fn = self._algo_module.kernel — the algorithm module's entry function (imported in D3).
  • Decide effective cube dims: if the first SIP has just 1 cube, use (1, 1); otherwise use the topology's cube_w/cube_h. This naturally absorbs TP runs that use only a subset of cubes.
  • kernel_args = self._algo_module.kernel_args(world_size, n_elem, cube_w, cube_h) — the algorithm decides which arguments to pass to its kernel.

Dispatch:

  • Resolve the current greenlet's rank via _rank_by_greenlet.get(g, 0).
  • Append extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h).
  • pending = self.ctx.launch(algorithm_name, kernel_fn, tensor, *kernel_args, *extra_args, _defer_wait=True)_defer_wait=True delegates collective drain to the main scheduler (ADR-0027 D0.4).

Drain:

  • If the parent greenlet is alive (multi-greenlet mode), enqueue _pending_collective_handles and switch to parent. The main scheduler drains after all ranks have launched.
  • If single-driver mode, drain inline: for h, _sip_id, meta in pending: self.ctx.wait(h, _meta=meta).

D6. barrier() is a no-op (single-driver model)

kernbench runs all ranks as greenlets inside a single Python process, so no cross-process synchronization is needed. barrier() is callable but does no synchronization. Kept for real-PyTorch API compatibility so callers don't get NotImplementedError.

If multi-process kernbench (SimPy event loop per process) is introduced in the future, D6 needs a superseding ADR.

D7. Semantics of get_rank / get_world_size / get_backend

  • get_rank() (D4): the current greenlet's bound rank; unregistered → 0.
  • get_world_size() (D2): the world_size resolved by the backend in D3.
  • get_backend(): always the literal string "ahbm". Calling before backend exists triggers _ensure_initialized's RuntimeError.

Differences vs. real PyTorch:

  • Real PyTorch get_rank() is a process-global value; here it is greenlet-local. Inside a spawned worker → the worker's rank; in the main thread → 0. Bench authors should expect meaningful ranks only inside worker functions.

D8. Supported API surface (final)

DistributedContext exposes:

  • init_process_group(backend="ahbm", world_size=None, rank=None, **kwargs)
  • is_initialized() -> bool
  • get_world_size() -> int
  • get_rank() -> int
  • get_backend() -> str
  • all_reduce(tensor, op="sum") -> None
  • barrier() -> None
  • (internal) _bind_rank(g, rank)

Other PyTorch distributed APIs (broadcast, reduce, all_gather, gather, scatter, point-to-point send/recv, etc.) are not implemented. Kernel-level expression is available via tl.send/tl.recv (ADR-0046 D3.10), but the dist.* surface does not expose them. If additional collectives are needed, add a paired (algorithm module, DistributedContext method) and extend D8.

Alternatives Considered

A1. Create the backend in RuntimeContext.__init__

Rejected. If ccl.yaml is missing or the algorithm module can't be imported, RuntimeContext construction would fail even when the bench does not use distributed features. Lazy creation at call time (D1) is the right semantics.

A2. Always derive world_size from topology (no override)

Rejected. ADR-0024 D1's "explicit override" path is used by legacy tests. Diagnostic scenarios that define PE-level ranks within a single SIP also need this escape hatch.

A3. Silent fallback for unsupported op

Rejected. If the user intends op="prod" / "max" / "avg" and silent sum runs instead, result validation gets very hard. Explicit NotImplementedError is safer.

A4. Implement barrier as a SimPy event

Rejected (currently). With single-driver semantics there is no cross-process synchronization to express, so a no-op is meaningfully correct. A fake-barrier SimPy event would add code complexity for no semantic gain. Revisit when multi-process kernbench arrives.

Consequences

  • The 4-step installation (D3) for torch.distributed.init_process_group(backend="ahbm") is locked in, making clear where future collective algorithms must hook.
  • The priority order in D2 (algorithm > defaults > topology) makes the blast radius of ccl.yaml changes quickly knowable.
  • The no-op barrier (D6) is recorded so multi-process kernbench, if introduced, must explicitly supersede this ADR.
  • D8's list of unsupported APIs explicitly grounds the rejection message when users call, e.g., dist.broadcast(...).