kernbench2/docs/adr/ADR-0032-algo-intercube-allreduce.md

# ADR-0032: Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange

## Status

Accepted (supersedes ADR-0029).

## Context

### Goal

Define a single all-reduce algorithm that exploits the topology hierarchy:
cube mesh within each SIP (intercube) + inter-SIP exchange. One kernel,
one SFR configuration path, driven by `topology.yaml` and `ccl.yaml`.

### Why replace ADR-0029 (hierarchical 3-level)

ADR-0029 proposed a 3-level (intra-cube → inter-cube → inter-SIP) algorithm
where every PE in the system participates. In practice this adds the
intra-cube PE-to-PE stage complexity (bidirectional reduce + chain broadcast)
without matching the common workload pattern where the tensor is sharded
**per cube** (not per PE within a cube).

Moreover, the hierarchical design required:
- per-PE neighbor graph installation (`_build_pe_installs` multi-level)
- multi-level topology schema (`hierarchical_3level`)
- `all_pes` mapper + `multi_pe_sip_local` validator infrastructure

The intercube algorithm below removes all of that: **pe0-only same-lane
intercube reduce on the 4×4 cube mesh**, then inter-SIP exchange on the
root cube, then broadcast back. Simpler kernel, simpler wiring, same
bandwidth characteristics for the common per-cube DP workload.

### Current state

- `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` — kernel
- `src/kernbench/ccl/sfr_config.py` — `configure_sfr_intercube_multisip`
- `src/kernbench/runtime_api/distributed.py` — `AhbmCCLBackend` wires this
  automatically at `init_process_group` time.
- Old `ring_allreduce`, `mesh_allreduce`, `tree_allreduce`,
  `hierarchical_allreduce` modules and their tests are **removed**.

---

## Decision

### D1. Algorithm structure — 5 phases (center-root, bidirectional)

The root cube sits at the geometric **center** of the cube mesh:

```
root_col  = cube_w // 2
root_row  = cube_h // 2
root_cube = root_row * cube_w + root_col   # center; 10 on a 4×4 mesh
```

Each reduce/broadcast phase converges/diverges **bidirectionally** toward
this center, halving the intra-SIP critical path versus a corner-root walk
(4×4 mesh: 4 hops reduce + 4 hops broadcast vs 6+6 with an SE-corner root).

For each SIP (launched concurrently by `mp.spawn`):

```
Phase 1 — Row reduce converging at col == root_col (cube mesh, pe0 only):
    left half (col < root_col) walks W→E; right half (col > root_col)
    walks E→W; the root_col cube merges both sides → holds row sum.

Phase 2 — Col reduce on col == root_col converging at row == root_row:
    above (row < root_row) walks N→S; below (row > root_row) walks S→N;
    the root cube merges both → holds the full SIP sum.

Phase 3 — Inter-SIP exchange on cube_id == root_cube (pe0 only):
    Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
    selected by sip_topo_kind (from topology.yaml sips.topology).

Phase 4 — Col broadcast on col == root_col, outward from root_row.

Phase 5 — Row broadcast outward from root_col across the cube mesh.
```

After all phases every cube's pe0 holds the global sum.

**Single-cube fast-path**: when `cube_w == cube_h == 1` (one cube per rank,
the common TP case), the intra-SIP reduce/broadcast phases are skipped and
the kernel goes straight to the Phase 3 inter-SIP exchange.

The kernel is a single function parameterised by `sip_topo_kind ∈ {0, 1, 2}`
(ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical
across topologies; only phase 3 branches. Helper functions
`_inter_sip_ring`, `_inter_sip_torus_2d`, `_inter_sip_mesh_2d` encode the
three exchange patterns.

### D2. Tensor layout (rank = SIP, per-worker)

Per ADR-0024 rank = SIP at the process-group level. Each worker allocates
its own cube-mesh-spanning tensor:

```python
dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=16, num_pes=1)
tensor = torch.zeros((n_cubes, n_elem), dtype="f16", dp=dp)
```

Shard layout: 16 shards per SIP, one per cube on pe0. The kernel addresses
each cube's shard as `pe_addr = t_ptr + cube_id * n_elem * 2`.

### D3. SFR / IPCQ wiring — `configure_sfr_intercube_multisip`

Replaces the rank-to-2-PE install from ADR-0024. Wires PE_IPCQ neighbor
tables for **every cube's pe0 across every SIP** — regardless of which
cube is the root or which SIP topology is selected. This lets the kernel
elect the root cube at runtime and supports topology switches without
re-wiring.

| Level | Direction labels | Scope |
|---|---|---|
| Intercube within SIP | N / S / E / W | pe0 of every cube → pe0 of mesh neighbors (no wrap) |
| Inter-SIP (all cubes) | global_E / global_W / global_N / global_S | pe0 of cube c on sip A → pe0 of cube c on peer SIP per `sips.topology` |

Inter-SIP directions use the `global_*` prefix to keep the namespace
disjoint from intercube directions. ADR-0025's `_OPPOSITE_DIR` is extended
with `global_E ↔ global_W` and `global_N ↔ global_S` so the reverse-
direction resolver handles 2-SIP bidirectional rings correctly.

Internally the function calls `install_ipcq` with:
- `world_size = n_sips × n_cubes`
- `rank_to_pe = [(sip, cube, 0) for sip in range(n_sips) for cube in range(n_cubes)]`
- A closure-captured `neighbors()` function that builds the map above.

This `world_size` is internal to IPCQ wiring and does not leak to the
process-group rank.

### D4. SIP topology — from `topology.yaml`

```yaml
system:
  sips:
    count: 2
    topology: ring_1d       # or torus_2d, mesh_2d_no_wrap
```

- `ring_1d`: n_sips-1 rounds of `send global_E / recv global_W`.
- `torus_2d`: `w × h` wrapping mesh. Row ring on `global_E/W` then col
  ring on `global_S/N`.
- `mesh_2d_no_wrap`: `w × h` mesh without wrap-around. Chain reduce +
  broadcast per dimension.

2D grid dims `(w, h)` come from `system.sips.w/h` (ADR-0024 D5). A square
fallback (`round(sqrt(n_sips))²`) applies **only** when `w/h` are omitted,
so rectangular grids (e.g. 6 SIPs as `3×2`) are supported by giving
explicit `w/h`.

### D5. Process-group integration — `AhbmCCLBackend`

At `init_process_group` time the backend:

1. Loads `ccl.yaml` + `topology.yaml`.
2. Derives `sip_topo_kind` from `system.sips.topology` via the algorithm
   module's `TOPO_NAME_TO_KIND`, and `sip_topo_w, sip_topo_h` from
   `system.sips.w/h` with a square-only fallback (ADR-0024 D5).
3. Calls `configure_sfr_intercube_multisip(engine, spec, cfg)` — one-time
   SFR wiring, mirrors NCCL communicator creation.

At each `dist.all_reduce(tensor)` call:

1. Resolves `kernel_fn` from `cfg["module"]`.
2. Builds args: `(n_elem, cube_w, cube_h, n_sips)` from
   `kernel_args(world_size, n_elem)`.
3. Appends `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` where
   `sip_rank` is the current greenlet's bound rank.
4. Launches with `_defer_wait=True`; the main scheduler drains pending
   handles after all workers submit (per ADR-0027 D0.4).

### D6. Config schema

`ccl.yaml`:

```yaml
defaults:
  algorithm: lrab_hierarchical_allreduce
  buffer_kind: tcm
  ...

algorithms:
  lrab_hierarchical_allreduce:
    module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
    topology: none
    buffer_kind: tcm
    n_elem: 8
    root_cube: 15   # NOT read today — the kernel elects the root dynamically
                    # as the geometric center (see D1). Kept as a placeholder
                    # for a future explicit-root override / runtime election.
```

`topology.yaml`:

```yaml
system:
  sips:
    count: 2
    topology: ring_1d
sip:
  cube_mesh: { w: 4, h: 4 }
```

### D7. Algorithm module contract

Modules loaded via `cfg["module"]` must export:

| Name | Purpose |
|---|---|
| `kernel` | callable, signature `(t_ptr, n_elem, cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl)` |
| `kernel_args(world_size, n_elem) -> tuple` | returns the first 4 scalar args (per-tensor) |
| `TOPO_NAME_TO_KIND: dict[str, int]` | maps `system.sips.topology` name to kernel branch code |
| `SIP_TOPO_RING`, `SIP_TOPO_TORUS`, `SIP_TOPO_MESH` | integer constants (0, 1, 2) |

---

## Dependencies

- **ADR-0023**: IPCQ protocol (neighbor table, send/recv, credit return).
- **ADR-0024**: rank = SIP launcher, `mp.spawn`, greenlet-local rank.
- **ADR-0025**: Address-based IPCQ direction matching; extended
  `_OPPOSITE_DIR` with `global_*` pairs.
- **ADR-0027**: Worker-wait / collective-pending drain in main scheduler.

## Non-goals

- **Per-PE allreduce** (intra-cube PE-to-PE reduce). Out of scope — the
  workload for this algorithm is per-cube DP.
- **Square-grid fallback requires `n_sips = k²`**: rectangular SIP grids
  (non-square mesh/torus) are supported, but only when `system.sips.w/h`
  are given explicitly (ADR-0024 D5). With `w/h` omitted, 2D topologies
  fall back to a square grid and still require `n_sips = k²`.
- **Pipelined chunks**: single-tile per cube, no pipelining yet.
- **Root cube runtime election**: the kernel currently uses
  `root_cube = (mesh_h // 2) * mesh_w + (mesh_w // 2)` — the geometric
  center, chosen to minimize the intra-SIP critical path. SFR wiring
  covers all cubes, so electing a different root at runtime is a pure
  kernel change when needed.

---

## Consequences

### Positive

- **Single kernel, single install path** for all-reduce — replaces four
  removed modules (`ring`, `mesh`, `tree`, `hierarchical`).
- **Topology-agnostic kernel**: ring / torus / mesh selected via one
  integer param, no kernel duplication.
- **Automatic via `dist.all_reduce`**: no bench-level or user-level
  algorithm selection needed; config-driven end-to-end.
- **Full SFR wiring**: every cube on every SIP has inter-SIP links
  available — supports future dynamic root-cube election.

### Negative

- **Not suitable for per-PE sharded tensors**: TP-layer-style tensors that
  shard within one cube across 8 PEs are not addressable by this kernel.
  Such workloads would need a separate intra-cube all-reduce path (not
  yet implemented).
- **`configure_sfr_intercube_multisip` always wires all pe0s**: even if a
  given run only needs a subset (e.g. 1 SIP, ring only). Install cost is
  small but not zero.

---

## Affected files

| File | Change |
|---|---|
| `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` |
| `src/kernbench/ccl/sfr_config.py` (new) | `configure_sfr_intercube_multisip` |
| `src/kernbench/ccl/topologies.py` | Added `torus_2d`, `mesh_2d_no_wrap` |
| `src/kernbench/ccl/install.py` | Extended `_OPPOSITE_DIR` with `global_*` pairs |
| `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend` uses `configure_sfr_intercube_multisip` + appends sip_rank/topo args |
| `ccl.yaml` | Single `lrab_hierarchical_allreduce` entry |
| `topology.yaml` | Added `system.sips.topology` |
| `benches/ccl_allreduce.py` | Row-wise cube-mesh tensor layout |
| `tests/sccl/` (test package) | Config-driven ring/torus/mesh correctness + full `dist.all_reduce` path + latency/buffer-kind sweeps (evaluation harness — ADR-0043) |
| `tests/test_intercube_sfr_config.py` | SFR wiring verification |
| Removed | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` and their tests |