kernbench2/docs/adr/ADR-0032-intercube-allreduce.md

# ADR-0032: Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange

## Status

Accepted (supersedes ADR-0029).

## Context

### Goal

Define a single all-reduce algorithm that exploits the topology hierarchy:
cube mesh within each SIP (intercube) + inter-SIP exchange. One kernel,
one SFR configuration path, driven by `topology.yaml` and `ccl.yaml`.

### Why replace ADR-0029 (hierarchical 3-level)

ADR-0029 proposed a 3-level (intra-cube → inter-cube → inter-SIP) algorithm
where every PE in the system participates. In practice this adds the
intra-cube PE-to-PE stage complexity (bidirectional reduce + chain broadcast)
without matching the common workload pattern where the tensor is sharded
**per cube** (not per PE within a cube).

Moreover, the hierarchical design required:
- per-PE neighbor graph installation (`_build_pe_installs` multi-level)
- multi-level topology schema (`hierarchical_3level`)
- `all_pes` mapper + `multi_pe_sip_local` validator infrastructure

The intercube algorithm below removes all of that: **pe0-only same-lane
intercube reduce on the 4×4 cube mesh**, then inter-SIP exchange on the
root cube, then broadcast back. Simpler kernel, simpler wiring, same
bandwidth characteristics for the common per-cube DP workload.

### Current state

- `src/kernbench/ccl/algorithms/intercube_allreduce.py` — kernel
- `src/kernbench/ccl/sfr_config.py` — `configure_sfr_intercube_multisip`
- `src/kernbench/runtime_api/distributed.py` — `AhbmCCLBackend` wires this
  automatically at `init_process_group` time.
- Old `ring_allreduce`, `mesh_allreduce`, `tree_allreduce`,
  `hierarchical_allreduce` modules and their tests are **removed**.

---

## Decision

### D1. Algorithm structure — 5 phases

For each SIP (launched concurrently by `mp.spawn`):

```
Phase 1 — Row reduce W → E (cube mesh, pe0 only):
    col=0 sends E → col=1 accumulates, sends E → ... → col=3 holds row sum.

Phase 2 — Col reduce N → S on rightmost column (pe0, col = mesh_w-1):
    row=0 sends S → row=1 accumulates, sends S → ... → root cube (15)
    holds the full SIP sum.

Phase 3 — Inter-SIP exchange on root cube (pe0 of root cube only):
    Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
    selected by sip_topo_kind (from topology.yaml sips.topology).

Phase 4 — Col broadcast S → N on rightmost column.

Phase 5 — Row broadcast E → W across the cube mesh.
```

After all phases every cube's pe0 holds the global sum.

The kernel is a single function parameterised by `sip_topo_kind ∈ {0, 1, 2}`
(ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical
across topologies; only phase 3 branches. Helper functions
`_inter_sip_ring`, `_inter_sip_torus_2d`, `_inter_sip_mesh_2d` encode the
three exchange patterns.

### D2. Tensor layout (rank = SIP, per-worker)

Per ADR-0024 rank = SIP at the process-group level. Each worker allocates
its own cube-mesh-spanning tensor:

```python
dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=16, num_pes=1)
tensor = torch.zeros((n_cubes, n_elem), dtype="f16", dp=dp)
```

Shard layout: 16 shards per SIP, one per cube on pe0. The kernel addresses
each cube's shard as `pe_addr = t_ptr + cube_id * n_elem * 2`.

### D3. SFR / IPCQ wiring — `configure_sfr_intercube_multisip`

Replaces the rank-to-2-PE install from ADR-0024. Wires PE_IPCQ neighbor
tables for **every cube's pe0 across every SIP** — regardless of which
cube is the root or which SIP topology is selected. This lets the kernel
elect the root cube at runtime and supports topology switches without
re-wiring.

| Level | Direction labels | Scope |
|---|---|---|
| Intercube within SIP | N / S / E / W | pe0 of every cube → pe0 of mesh neighbors (no wrap) |
| Inter-SIP (all cubes) | global_E / global_W / global_N / global_S | pe0 of cube c on sip A → pe0 of cube c on peer SIP per `sips.topology` |

Inter-SIP directions use the `global_*` prefix to keep the namespace
disjoint from intercube directions. ADR-0025's `_OPPOSITE_DIR` is extended
with `global_E ↔ global_W` and `global_N ↔ global_S` so the reverse-
direction resolver handles 2-SIP bidirectional rings correctly.

Internally the function calls `install_ipcq` with:
- `world_size = n_sips × n_cubes`
- `rank_to_pe = [(sip, cube, 0) for sip in range(n_sips) for cube in range(n_cubes)]`
- A closure-captured `neighbors()` function that builds the map above.

This `world_size` is internal to IPCQ wiring and does not leak to the
process-group rank.

### D4. SIP topology — from `topology.yaml`

```yaml
system:
  sips:
    count: 2
    topology: ring_1d       # or torus_2d, mesh_2d_no_wrap
```

- `ring_1d`: n_sips-1 rounds of `send global_E / recv global_W`.
- `torus_2d`: sqrt(n_sips)×sqrt(n_sips) wrapping mesh. Row ring on
  `global_E/W` then col ring on `global_S/N`.
- `mesh_2d_no_wrap`: square mesh without wrap-around. Chain reduce +
  broadcast per dimension.

2D variants require `n_sips` to be a perfect square.

### D5. Process-group integration — `AhbmCCLBackend`

At `init_process_group` time the backend:

1. Loads `ccl.yaml` + `topology.yaml`.
2. Derives `sip_topo_kind, sip_topo_w, sip_topo_h` from
   `system.sips.topology` using the algorithm module's `TOPO_NAME_TO_KIND`.
3. Calls `configure_sfr_intercube_multisip(engine, spec, cfg)` — one-time
   SFR wiring, mirrors NCCL communicator creation.

At each `dist.all_reduce(tensor)` call:

1. Resolves `kernel_fn` from `cfg["module"]`.
2. Builds args: `(n_elem, cube_w, cube_h, n_sips)` from
   `kernel_args(world_size, n_elem)`.
3. Appends `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` where
   `sip_rank` is the current greenlet's bound rank.
4. Launches with `_defer_wait=True`; the main scheduler drains pending
   handles after all workers submit (per ADR-0024 D7 / ADR-0027 D0.4).

### D6. Config schema

`ccl.yaml`:

```yaml
defaults:
  algorithm: intercube_allreduce
  buffer_kind: tcm
  ...

algorithms:
  intercube_allreduce:
    module: kernbench.ccl.algorithms.intercube_allreduce
    topology: none
    buffer_kind: tcm
    n_elem: 8
    root_cube: 15
```

`topology.yaml`:

```yaml
system:
  sips:
    count: 2
    topology: ring_1d
sip:
  cube_mesh: { w: 4, h: 4 }
```

### D7. Algorithm module contract

Modules loaded via `cfg["module"]` must export:

| Name | Purpose |
|---|---|
| `kernel` | callable, signature `(t_ptr, n_elem, cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl)` |
| `kernel_args(world_size, n_elem) -> tuple` | returns the first 4 scalar args (per-tensor) |
| `TOPO_NAME_TO_KIND: dict[str, int]` | maps `system.sips.topology` name to kernel branch code |
| `SIP_TOPO_RING`, `SIP_TOPO_TORUS`, `SIP_TOPO_MESH` | integer constants (0, 1, 2) |

---

## Dependencies

- **ADR-0023**: IPCQ protocol (neighbor table, send/recv, credit return).
- **ADR-0024**: rank = SIP launcher, `mp.spawn`, greenlet-local rank.
- **ADR-0025**: Address-based IPCQ direction matching; extended
  `_OPPOSITE_DIR` with `global_*` pairs.
- **ADR-0027**: Worker-wait / collective-pending drain in main scheduler.

## Non-goals

- **Per-PE allreduce** (intra-cube PE-to-PE reduce). Out of scope — the
  workload for this algorithm is per-cube DP.
- **Asymmetric SIP topologies** (non-square mesh/torus). `torus_2d` and
  `mesh_2d_no_wrap` require `n_sips = k²`.
- **Pipelined chunks**: single-tile per cube, no pipelining yet.
- **Root cube runtime election**: the kernel currently uses
  `root_cube = (mesh_h - 1) * mesh_w + (mesh_w - 1)` hardcoded to the SE
  corner. SFR wiring covers all cubes, so runtime election is a pure kernel
  change when needed.

---

## Consequences

### Positive

- **Single kernel, single install path** for all-reduce — replaces four
  removed modules (`ring`, `mesh`, `tree`, `hierarchical`).
- **Topology-agnostic kernel**: ring / torus / mesh selected via one
  integer param, no kernel duplication.
- **Automatic via `dist.all_reduce`**: no bench-level or user-level
  algorithm selection needed; config-driven end-to-end.
- **Full SFR wiring**: every cube on every SIP has inter-SIP links
  available — supports future dynamic root-cube election.

### Negative

- **Not suitable for per-PE sharded tensors**: TP-layer-style tensors that
  shard within one cube across 8 PEs are not addressable by this kernel.
  Such workloads would need a separate intra-cube all-reduce path (not
  yet implemented).
- **`configure_sfr_intercube_multisip` always wires all pe0s**: even if a
  given run only needs a subset (e.g. 1 SIP, ring only). Install cost is
  small but not zero.

---

## Affected files

| File | Change |
|---|---|
| `src/kernbench/ccl/algorithms/intercube_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` |
| `src/kernbench/ccl/sfr_config.py` (new) | `configure_sfr_intercube_multisip` |
| `src/kernbench/ccl/topologies.py` | Added `torus_2d`, `mesh_2d_no_wrap` |
| `src/kernbench/ccl/install.py` | Extended `_OPPOSITE_DIR` with `global_*` pairs |
| `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend` uses `configure_sfr_intercube_multisip` + appends sip_rank/topo args |
| `ccl.yaml` | Single `intercube_allreduce` entry |
| `topology.yaml` | Added `system.sips.topology` |
| `benches/ccl_allreduce.py` | Row-wise cube-mesh tensor layout |
| `tests/test_allreduce_multidevice.py` (new) | Config-driven ring/torus/mesh |
| `tests/test_distributed_intercube_allreduce.py` (new) | Full `dist.all_reduce` path |
| `tests/test_intercube_sfr_config.py` (new) | SFR wiring verification |
| Removed | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` and their tests |