kernbench2/docs/adr/ADR-0047-par-ahbm-ccl-backend.md

# ADR-0047: AHBM CCL Backend — `torch.distributed`-compat shim

## Status

Accepted (2026-05-22).

Pins down what `runtime_api/distributed.py`'s `AhbmCCLBackend` +
`DistributedContext` actually install — i.e., the entry point
`torch.distributed.init_process_group(backend="ahbm")` — and how
`all_reduce`/`barrier`/`get_rank` etc. are implemented. ADR-0023 D11
mentions the "torch.distributed compatibility" intent, but **the backend
itself** had no ADR-level coverage.

## First action

`RuntimeContext.__post_init__` automatically constructs a
`DistributedContext()` and attaches it to `self.distributed`. The first
action at that moment:

1. `self._backend: AhbmCCLBackend | None = None` — uninitialized.
2. `self._rank_by_greenlet: dict = {}` — greenlet-local rank registry
   (ADR-0024 D2).
3. The caller (RuntimeContext) sets `dc._ctx_ref = self` so subsequent
   `init_process_group` can reach `ctx.engine` / `ctx.spec` / `ctx.launch`.

In short, **DistributedContext's first act is "attach to RuntimeContext
with a back-reference and leave the backend slot empty"**. Actual
backend installation (IPCQ install, world_size derivation, algorithm
module import) happens only when user code calls
`torch.distributed.init_process_group(backend="ahbm")`.

At that moment, `init_process_group`'s first action is:

1. If `backend != "ahbm"`, raise `ValueError("Unsupported backend ...")`
   immediately.
2. If `getattr(self, "_ctx_ref", None)` is None,
   `RuntimeError("DistributedContext not bound to a RuntimeContext")`.
3. `self._backend = AhbmCCLBackend(torch_ctx=ctx)` — inside this
   constructor, ccl.yaml is loaded, the algorithm module is imported,
   world_size is derived, SFR is configured, and IPCQ is installed.
4. `self._backend._dist_ctx = self` — the backend gets a back-reference
   so it can read `_rank_by_greenlet`.

## Context

The `AhbmCCLBackend` exists so that PyTorch DDP collective calls
(`init_process_group`, `all_reduce`, etc.) work unchanged and bench code
reads identically to a real DDP training script (in line with
ADR-0024 + ADR-0027's launcher model).

The backend's responsibilities:

- At `init_process_group` time, install the **IPCQ neighbor table once**
  (analogous to NCCL communicator creation).
- For each `all_reduce(tensor, op="sum")`, dispatch the configured
  algorithm's kernel function via `ctx.launch(...)`.
- Answer `get_world_size` / `get_rank` consistently from the
  greenlet-local rank registry plus ccl.yaml/topology.

ADR-0023 D10 (IPCQ install plan) and ADR-0024 (SIP launcher) touch
parts of this, but **the backend's own responsibility scope and decision
order** are not pinned anywhere. This ADR fills that gap.

## Decision

### D1. The backend is created only at `init_process_group(backend="ahbm")` time

`DistributedContext` starts with `_backend = None`. The backend object
does not exist until the user calls
`dist.init_process_group(backend="ahbm")`. Any other API
(`is_initialized`, `get_world_size`, `all_reduce`, `barrier`) called
while `_backend` is None raises
`RuntimeError("Default process group has not been initialized...")` via
the `_ensure_initialized` helper.

`backend != "ahbm"` raises `ValueError` immediately. Other backend names
(`nccl`, `gloo`, etc.) are not recognized.

### D2. world_size resolution priority — algorithm > defaults > topology

`AhbmCCLBackend._resolve_world_size` (ADR-0024 D1):

1. If `ccl.yaml`'s algorithm entry has `world_size`, use it.
2. Else if `defaults.world_size` is set, use it.
3. Else fall back to `spec.system.sips.count` (the topology's SIP count).

The default interpretation is **rank = SIP** (ADR-0024). Cube/PE-level
parallelism is expressed inside each rank via DPPolicy and does not
affect world_size. An explicit `ccl.yaml` override is preserved for the
legacy "rank = flat PE index" test path.

User arguments to `init_process_group(world_size=..., rank=...)` are
**accepted but ignored** (same as real PyTorch's `RANK` / `WORLD_SIZE`
env vars).

### D3. `init_process_group` performs four installation steps

Inside `AhbmCCLBackend.__init__`, in order:

1. **Load ccl.yaml**: `kernbench.ccl.install.load_ccl_config()` →
   `resolve_algorithm_config(_cfg_all)` produces the merged config for
   `defaults.algorithm` (or the user-specified algorithm).
2. **Import algorithm module**:
   `importlib.import_module(self._merged["module"])`. The module must
   expose a `kernel` function, a `kernel_args(world_size, n_elem,
   cube_w, cube_h)` helper, and optionally a `TOPO_NAME_TO_KIND` map.
3. **Resolve world_size** (D2).
4. **Collect topology metadata** from `spec`: `n_sips`, `sip_topo`
   (`ring_1d` default), `cube_w`/`cube_h`, `sips.w`/`sips.h`. When the
   SIP topology is not `ring_1d`, derive `_sip_topo_w/h` from explicit
   `w`/`h` or via square-root (require `w*h == n_sips`). Mismatch raises
   `ValueError`.
5. **Install SFR + IPCQ**:
   `kernbench.ccl.sfr_config.configure_sfr_intercube_multisip(engine,
   spec, self._merged)`. This pushes IPCQ neighbor tables to every
   SIP/cube's pe0 (one-time setup analogous to NCCL communicator
   creation).

If the order changes (e.g., SFR runs before the algorithm module
loads), partial initialization can result. So D3 is treated as an
atomic 4-step block — on failure the backend remains uninstalled.

### D4. Greenlet-local rank binding (ADR-0024 D2)

`DistributedContext._rank_by_greenlet: dict[greenlet, int]` maps spawned
worker greenlets to their ranks. When the bench launcher (e.g.,
`torch.multiprocessing.spawn`) spawns a worker, it registers via
`dc._bind_rank(g, rank)`.

`get_rank()` looks up `getcurrent()`'s greenlet. Unregistered greenlets
fall back to 0 — preserves single-driver / test compatibility.

The backend reads the current greenlet's rank from
`_dist_ctx._rank_by_greenlet` during `all_reduce` (D5).

### D5. `all_reduce(tensor, op="sum")` behavior

Validation:

- `op != "sum"` → `NotImplementedError`. Current kernels only
  implement add reduction.
- `tensor._handle is None` → `RuntimeError("not deployed")`.
- `tensor._handle.shards` empty → `RuntimeError("no shards")`.

Preparation:

- `n_elem = shards[0].nbytes // tensor.itemsize` — element count of a
  single shard.
- `kernel_fn = self._algo_module.kernel` — the algorithm module's entry
  function (imported in D3).
- Decide effective cube dims: if the first SIP has just 1 cube, use
  `(1, 1)`; otherwise use the topology's `cube_w`/`cube_h`. This
  naturally absorbs TP runs that use only a subset of cubes.
- `kernel_args = self._algo_module.kernel_args(world_size, n_elem,
  cube_w, cube_h)` — the algorithm decides which arguments to pass to
  its kernel.

Dispatch:

- Resolve the current greenlet's rank via
  `_rank_by_greenlet.get(g, 0)`.
- Append `extra_args = (sip_rank, sip_topo_kind, sip_topo_w,
  sip_topo_h)`.
- `pending = self.ctx.launch(algorithm_name, kernel_fn, tensor,
  *kernel_args, *extra_args, _defer_wait=True)` — `_defer_wait=True`
  delegates collective drain to the main scheduler (ADR-0027 D0.4).

Drain:

- If the parent greenlet is alive (multi-greenlet mode), enqueue
  `_pending_collective_handles` and switch to parent. The main
  scheduler drains after all ranks have launched.
- If single-driver mode, drain inline:
  `for h, _sip_id, meta in pending: self.ctx.wait(h, _meta=meta)`.

### D6. `barrier()` is a no-op (single-driver model)

kernbench runs all ranks as greenlets inside a single Python process,
so no cross-process synchronization is needed. `barrier()` is callable
but does no synchronization. Kept for real-PyTorch API compatibility so
callers don't get `NotImplementedError`.

If multi-process kernbench (SimPy event loop per process) is introduced
in the future, D6 needs a superseding ADR.

### D7. Semantics of `get_rank` / `get_world_size` / `get_backend`

- `get_rank()` (D4): the current greenlet's bound rank; unregistered → 0.
- `get_world_size()` (D2): the world_size resolved by the backend in D3.
- `get_backend()`: always the literal string `"ahbm"`. Calling before
  backend exists triggers `_ensure_initialized`'s RuntimeError.

Differences vs. real PyTorch:

- Real PyTorch `get_rank()` is a process-global value; here it is
  greenlet-local. Inside a spawned worker → the worker's rank; in the
  main thread → 0. Bench authors should expect meaningful ranks only
  inside worker functions.

### D8. Supported API surface (final)

`DistributedContext` exposes:

- `init_process_group(backend="ahbm", world_size=None, rank=None,
  **kwargs)`
- `is_initialized() -> bool`
- `get_world_size() -> int`
- `get_rank() -> int`
- `get_backend() -> str`
- `all_reduce(tensor, op="sum") -> None`
- `barrier() -> None`
- (internal) `_bind_rank(g, rank)`

Other PyTorch distributed APIs (`broadcast`, `reduce`, `all_gather`,
`gather`, `scatter`, point-to-point `send/recv`, etc.) are **not
implemented**. Kernel-level expression is available via
`tl.send`/`tl.recv` (ADR-0046 D3.10), but the `dist.*` surface does not
expose them. If additional collectives are needed, add a paired
(algorithm module, `DistributedContext` method) and extend D8.

## Alternatives Considered

### A1. Create the backend in `RuntimeContext.__init__`

Rejected. If `ccl.yaml` is missing or the algorithm module can't be
imported, RuntimeContext construction would fail even when the bench
does not use distributed features. Lazy creation at call time (D1) is
the right semantics.

### A2. Always derive world_size from topology (no override)

Rejected. ADR-0024 D1's "explicit override" path is used by legacy
tests. Diagnostic scenarios that define PE-level ranks within a single
SIP also need this escape hatch.

### A3. Silent fallback for unsupported `op`

Rejected. If the user intends `op="prod"` / `"max"` / `"avg"` and silent
`sum` runs instead, result validation gets very hard. Explicit
`NotImplementedError` is safer.

### A4. Implement `barrier` as a SimPy event

Rejected (currently). With single-driver semantics there is no
cross-process synchronization to express, so a no-op is meaningfully
correct. A fake-barrier SimPy event would add code complexity for no
semantic gain. Revisit when multi-process kernbench arrives.

## Consequences

- The 4-step installation (D3) for
  `torch.distributed.init_process_group(backend="ahbm")` is locked in,
  making clear where future collective algorithms must hook.
- The priority order in D2 (algorithm > defaults > topology) makes the
  blast radius of ccl.yaml changes quickly knowable.
- The no-op `barrier` (D6) is recorded so multi-process kernbench, if
  introduced, must explicitly supersede this ADR.
- D8's list of unsupported APIs explicitly grounds the rejection
  message when users call, e.g., `dist.broadcast(...)`.