# ADR-0032: Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange ## Status Accepted (supersedes ADR-0029). ## Context ### Goal Define a single all-reduce algorithm that exploits the topology hierarchy: cube mesh within each SIP (intercube) + inter-SIP exchange. One kernel, one SFR configuration path, driven by `topology.yaml` and `ccl.yaml`. ### Why replace ADR-0029 (hierarchical 3-level) ADR-0029 proposed a 3-level (intra-cube → inter-cube → inter-SIP) algorithm where every PE in the system participates. In practice this adds the intra-cube PE-to-PE stage complexity (bidirectional reduce + chain broadcast) without matching the common workload pattern where the tensor is sharded **per cube** (not per PE within a cube). Moreover, the hierarchical design required: - per-PE neighbor graph installation (`_build_pe_installs` multi-level) - multi-level topology schema (`hierarchical_3level`) - `all_pes` mapper + `multi_pe_sip_local` validator infrastructure The intercube algorithm below removes all of that: **pe0-only same-lane intercube reduce on the 4×4 cube mesh**, then inter-SIP exchange on the root cube, then broadcast back. Simpler kernel, simpler wiring, same bandwidth characteristics for the common per-cube DP workload. ### Current state - `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` — kernel - `src/kernbench/ccl/sfr_config.py` — `configure_sfr_intercube_multisip` - `src/kernbench/runtime_api/distributed.py` — `AhbmCCLBackend` wires this automatically at `init_process_group` time. - Old `ring_allreduce`, `mesh_allreduce`, `tree_allreduce`, `hierarchical_allreduce` modules and their tests are **removed**. --- ## Decision ### D1. Algorithm structure — 5 phases (center-root, bidirectional) The root cube sits at the geometric **center** of the cube mesh: ``` root_col = cube_w // 2 root_row = cube_h // 2 root_cube = root_row * cube_w + root_col # center; 10 on a 4×4 mesh ``` Each reduce/broadcast phase converges/diverges **bidirectionally** toward this center, halving the intra-SIP critical path versus a corner-root walk (4×4 mesh: 4 hops reduce + 4 hops broadcast vs 6+6 with an SE-corner root). For each SIP (launched concurrently by `mp.spawn`): ``` Phase 1 — Row reduce converging at col == root_col (cube mesh, pe0 only): left half (col < root_col) walks W→E; right half (col > root_col) walks E→W; the root_col cube merges both sides → holds row sum. Phase 2 — Col reduce on col == root_col converging at row == root_row: above (row < root_row) walks N→S; below (row > root_row) walks S→N; the root cube merges both → holds the full SIP sum. Phase 3 — Inter-SIP exchange on cube_id == root_cube (pe0 only): Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast — selected by sip_topo_kind (from topology.yaml sips.topology). Phase 4 — Col broadcast on col == root_col, outward from root_row. Phase 5 — Row broadcast outward from root_col across the cube mesh. ``` After all phases every cube's pe0 holds the global sum. **Single-cube fast-path**: when `cube_w == cube_h == 1` (one cube per rank, the common TP case), the intra-SIP reduce/broadcast phases are skipped and the kernel goes straight to the Phase 3 inter-SIP exchange. The kernel is a single function parameterised by `sip_topo_kind ∈ {0, 1, 2}` (ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical across topologies; only phase 3 branches. Helper functions `_inter_sip_ring`, `_inter_sip_torus_2d`, `_inter_sip_mesh_2d` encode the three exchange patterns. ### D2. Tensor layout (rank = SIP, per-worker) Per ADR-0024 rank = SIP at the process-group level. Each worker allocates its own cube-mesh-spanning tensor: ```python dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=16, num_pes=1) tensor = torch.zeros((n_cubes, n_elem), dtype="f16", dp=dp) ``` Shard layout: 16 shards per SIP, one per cube on pe0. The kernel addresses each cube's shard as `pe_addr = t_ptr + cube_id * n_elem * 2`. ### D3. SFR / IPCQ wiring — `configure_sfr_intercube_multisip` Replaces the rank-to-2-PE install from ADR-0024. Wires PE_IPCQ neighbor tables for **every cube's pe0 across every SIP** — regardless of which cube is the root or which SIP topology is selected. This lets the kernel elect the root cube at runtime and supports topology switches without re-wiring. | Level | Direction labels | Scope | |---|---|---| | Intercube within SIP | N / S / E / W | pe0 of every cube → pe0 of mesh neighbors (no wrap) | | Inter-SIP (all cubes) | global_E / global_W / global_N / global_S | pe0 of cube c on sip A → pe0 of cube c on peer SIP per `sips.topology` | Inter-SIP directions use the `global_*` prefix to keep the namespace disjoint from intercube directions. ADR-0025's `_OPPOSITE_DIR` is extended with `global_E ↔ global_W` and `global_N ↔ global_S` so the reverse- direction resolver handles 2-SIP bidirectional rings correctly. Internally the function calls `install_ipcq` with: - `world_size = n_sips × n_cubes` - `rank_to_pe = [(sip, cube, 0) for sip in range(n_sips) for cube in range(n_cubes)]` - A closure-captured `neighbors()` function that builds the map above. This `world_size` is internal to IPCQ wiring and does not leak to the process-group rank. ### D4. SIP topology — from `topology.yaml` ```yaml system: sips: count: 2 topology: ring_1d # or torus_2d, mesh_2d_no_wrap ``` - `ring_1d`: n_sips-1 rounds of `send global_E / recv global_W`. - `torus_2d`: `w × h` wrapping mesh. Row ring on `global_E/W` then col ring on `global_S/N`. - `mesh_2d_no_wrap`: `w × h` mesh without wrap-around. Chain reduce + broadcast per dimension. 2D grid dims `(w, h)` come from `system.sips.w/h` (ADR-0024 D5). A square fallback (`round(sqrt(n_sips))²`) applies **only** when `w/h` are omitted, so rectangular grids (e.g. 6 SIPs as `3×2`) are supported by giving explicit `w/h`. ### D5. Process-group integration — `AhbmCCLBackend` At `init_process_group` time the backend: 1. Loads `ccl.yaml` + `topology.yaml`. 2. Derives `sip_topo_kind` from `system.sips.topology` via the algorithm module's `TOPO_NAME_TO_KIND`, and `sip_topo_w, sip_topo_h` from `system.sips.w/h` with a square-only fallback (ADR-0024 D5). 3. Calls `configure_sfr_intercube_multisip(engine, spec, cfg)` — one-time SFR wiring, mirrors NCCL communicator creation. At each `dist.all_reduce(tensor)` call: 1. Resolves `kernel_fn` from `cfg["module"]`. 2. Builds args: `(n_elem, cube_w, cube_h, n_sips)` from `kernel_args(world_size, n_elem)`. 3. Appends `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` where `sip_rank` is the current greenlet's bound rank. 4. Launches with `_defer_wait=True`; the main scheduler drains pending handles after all workers submit (per ADR-0027 D0.4). ### D6. Config schema `ccl.yaml`: ```yaml defaults: algorithm: lrab_hierarchical_allreduce buffer_kind: tcm ... algorithms: lrab_hierarchical_allreduce: module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce topology: none buffer_kind: tcm n_elem: 8 root_cube: 15 # NOT read today — the kernel elects the root dynamically # as the geometric center (see D1). Kept as a placeholder # for a future explicit-root override / runtime election. ``` `topology.yaml`: ```yaml system: sips: count: 2 topology: ring_1d sip: cube_mesh: { w: 4, h: 4 } ``` ### D7. Algorithm module contract Modules loaded via `cfg["module"]` must export: | Name | Purpose | |---|---| | `kernel` | callable, signature `(t_ptr, n_elem, cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl)` | | `kernel_args(world_size, n_elem) -> tuple` | returns the first 4 scalar args (per-tensor) | | `TOPO_NAME_TO_KIND: dict[str, int]` | maps `system.sips.topology` name to kernel branch code | | `SIP_TOPO_RING`, `SIP_TOPO_TORUS`, `SIP_TOPO_MESH` | integer constants (0, 1, 2) | --- ## Dependencies - **ADR-0023**: IPCQ protocol (neighbor table, send/recv, credit return). - **ADR-0024**: rank = SIP launcher, `mp.spawn`, greenlet-local rank. - **ADR-0025**: Address-based IPCQ direction matching; extended `_OPPOSITE_DIR` with `global_*` pairs. - **ADR-0027**: Worker-wait / collective-pending drain in main scheduler. ## Non-goals - **Per-PE allreduce** (intra-cube PE-to-PE reduce). Out of scope — the workload for this algorithm is per-cube DP. - **Square-grid fallback requires `n_sips = k²`**: rectangular SIP grids (non-square mesh/torus) are supported, but only when `system.sips.w/h` are given explicitly (ADR-0024 D5). With `w/h` omitted, 2D topologies fall back to a square grid and still require `n_sips = k²`. - **Pipelined chunks**: single-tile per cube, no pipelining yet. - **Root cube runtime election**: the kernel currently uses `root_cube = (mesh_h // 2) * mesh_w + (mesh_w // 2)` — the geometric center, chosen to minimize the intra-SIP critical path. SFR wiring covers all cubes, so electing a different root at runtime is a pure kernel change when needed. --- ## Consequences ### Positive - **Single kernel, single install path** for all-reduce — replaces four removed modules (`ring`, `mesh`, `tree`, `hierarchical`). - **Topology-agnostic kernel**: ring / torus / mesh selected via one integer param, no kernel duplication. - **Automatic via `dist.all_reduce`**: no bench-level or user-level algorithm selection needed; config-driven end-to-end. - **Full SFR wiring**: every cube on every SIP has inter-SIP links available — supports future dynamic root-cube election. ### Negative - **Not suitable for per-PE sharded tensors**: TP-layer-style tensors that shard within one cube across 8 PEs are not addressable by this kernel. Such workloads would need a separate intra-cube all-reduce path (not yet implemented). - **`configure_sfr_intercube_multisip` always wires all pe0s**: even if a given run only needs a subset (e.g. 1 SIP, ring only). Install cost is small but not zero. --- ## Affected files | File | Change | |---|---| | `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` | | `src/kernbench/ccl/sfr_config.py` (new) | `configure_sfr_intercube_multisip` | | `src/kernbench/ccl/topologies.py` | Added `torus_2d`, `mesh_2d_no_wrap` | | `src/kernbench/ccl/install.py` | Extended `_OPPOSITE_DIR` with `global_*` pairs | | `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend` uses `configure_sfr_intercube_multisip` + appends sip_rank/topo args | | `ccl.yaml` | Single `lrab_hierarchical_allreduce` entry | | `topology.yaml` | Added `system.sips.topology` | | `benches/ccl_allreduce.py` | Row-wise cube-mesh tensor layout | | `tests/sccl/` (test package) | Config-driven ring/torus/mesh correctness + full `dist.all_reduce` path + latency/buffer-kind sweeps (evaluation harness — ADR-0043) | | `tests/test_intercube_sfr_config.py` | SFR wiring verification | | Removed | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` and their tests |