Files
kernbench2/docs/adr/ADR-0032-intercube-allreduce.md
T
mukesh 1c33afec55 ADR-0032 + intra_* opposite directions in IPCQ install
Add intra_N/S/E/W to install.py _OPPOSITE_DIR table so the intra-cube
PE-to-PE namespace is symmetrical with intercube N/S/E/W. ADR-0032
documents the intercube allreduce algorithm (supersedes ADR-0029).
Refresh ADR-0024/0025/0029 cross-refs and update
test_intercube_sfr_config.py to cover the new intra_* mappings. Drop
the obsolete test_ccl_round_robin_recv.py (replaced by intercube tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 16:43:01 -07:00

257 lines
9.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0032: Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange
## Status
Accepted (supersedes ADR-0029).
## Context
### Goal
Define a single all-reduce algorithm that exploits the topology hierarchy:
cube mesh within each SIP (intercube) + inter-SIP exchange. One kernel,
one SFR configuration path, driven by `topology.yaml` and `ccl.yaml`.
### Why replace ADR-0029 (hierarchical 3-level)
ADR-0029 proposed a 3-level (intra-cube → inter-cube → inter-SIP) algorithm
where every PE in the system participates. In practice this adds the
intra-cube PE-to-PE stage complexity (bidirectional reduce + chain broadcast)
without matching the common workload pattern where the tensor is sharded
**per cube** (not per PE within a cube).
Moreover, the hierarchical design required:
- per-PE neighbor graph installation (`_build_pe_installs` multi-level)
- multi-level topology schema (`hierarchical_3level`)
- `all_pes` mapper + `multi_pe_sip_local` validator infrastructure
The intercube algorithm below removes all of that: **pe0-only same-lane
intercube reduce on the 4×4 cube mesh**, then inter-SIP exchange on the
root cube, then broadcast back. Simpler kernel, simpler wiring, same
bandwidth characteristics for the common per-cube DP workload.
### Current state
- `src/kernbench/ccl/algorithms/intercube_allreduce.py` — kernel
- `src/kernbench/ccl/sfr_config.py``configure_sfr_intercube_multisip`
- `src/kernbench/runtime_api/distributed.py``AhbmCCLBackend` wires this
automatically at `init_process_group` time.
- Old `ring_allreduce`, `mesh_allreduce`, `tree_allreduce`,
`hierarchical_allreduce` modules and their tests are **removed**.
---
## Decision
### D1. Algorithm structure — 5 phases
For each SIP (launched concurrently by `mp.spawn`):
```
Phase 1 — Row reduce W → E (cube mesh, pe0 only):
col=0 sends E → col=1 accumulates, sends E → ... → col=3 holds row sum.
Phase 2 — Col reduce N → S on rightmost column (pe0, col = mesh_w-1):
row=0 sends S → row=1 accumulates, sends S → ... → root cube (15)
holds the full SIP sum.
Phase 3 — Inter-SIP exchange on root cube (pe0 of root cube only):
Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
selected by sip_topo_kind (from topology.yaml sips.topology).
Phase 4 — Col broadcast S → N on rightmost column.
Phase 5 — Row broadcast E → W across the cube mesh.
```
After all phases every cube's pe0 holds the global sum.
The kernel is a single function parameterised by `sip_topo_kind ∈ {0, 1, 2}`
(ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical
across topologies; only phase 3 branches. Helper functions
`_inter_sip_ring`, `_inter_sip_torus_2d`, `_inter_sip_mesh_2d` encode the
three exchange patterns.
### D2. Tensor layout (rank = SIP, per-worker)
Per ADR-0024 rank = SIP at the process-group level. Each worker allocates
its own cube-mesh-spanning tensor:
```python
dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=16, num_pes=1)
tensor = torch.zeros((n_cubes, n_elem), dtype="f16", dp=dp)
```
Shard layout: 16 shards per SIP, one per cube on pe0. The kernel addresses
each cube's shard as `pe_addr = t_ptr + cube_id * n_elem * 2`.
### D3. SFR / IPCQ wiring — `configure_sfr_intercube_multisip`
Replaces the rank-to-2-PE install from ADR-0024. Wires PE_IPCQ neighbor
tables for **every cube's pe0 across every SIP** — regardless of which
cube is the root or which SIP topology is selected. This lets the kernel
elect the root cube at runtime and supports topology switches without
re-wiring.
| Level | Direction labels | Scope |
|---|---|---|
| Intercube within SIP | N / S / E / W | pe0 of every cube → pe0 of mesh neighbors (no wrap) |
| Inter-SIP (all cubes) | global_E / global_W / global_N / global_S | pe0 of cube c on sip A → pe0 of cube c on peer SIP per `sips.topology` |
Inter-SIP directions use the `global_*` prefix to keep the namespace
disjoint from intercube directions. ADR-0025's `_OPPOSITE_DIR` is extended
with `global_E ↔ global_W` and `global_N ↔ global_S` so the reverse-
direction resolver handles 2-SIP bidirectional rings correctly.
Internally the function calls `install_ipcq` with:
- `world_size = n_sips × n_cubes`
- `rank_to_pe = [(sip, cube, 0) for sip in range(n_sips) for cube in range(n_cubes)]`
- A closure-captured `neighbors()` function that builds the map above.
This `world_size` is internal to IPCQ wiring and does not leak to the
process-group rank.
### D4. SIP topology — from `topology.yaml`
```yaml
system:
sips:
count: 2
topology: ring_1d # or torus_2d, mesh_2d_no_wrap
```
- `ring_1d`: n_sips-1 rounds of `send global_E / recv global_W`.
- `torus_2d`: sqrt(n_sips)×sqrt(n_sips) wrapping mesh. Row ring on
`global_E/W` then col ring on `global_S/N`.
- `mesh_2d_no_wrap`: square mesh without wrap-around. Chain reduce +
broadcast per dimension.
2D variants require `n_sips` to be a perfect square.
### D5. Process-group integration — `AhbmCCLBackend`
At `init_process_group` time the backend:
1. Loads `ccl.yaml` + `topology.yaml`.
2. Derives `sip_topo_kind, sip_topo_w, sip_topo_h` from
`system.sips.topology` using the algorithm module's `TOPO_NAME_TO_KIND`.
3. Calls `configure_sfr_intercube_multisip(engine, spec, cfg)` — one-time
SFR wiring, mirrors NCCL communicator creation.
At each `dist.all_reduce(tensor)` call:
1. Resolves `kernel_fn` from `cfg["module"]`.
2. Builds args: `(n_elem, cube_w, cube_h, n_sips)` from
`kernel_args(world_size, n_elem)`.
3. Appends `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` where
`sip_rank` is the current greenlet's bound rank.
4. Launches with `_defer_wait=True`; the main scheduler drains pending
handles after all workers submit (per ADR-0024 D7 / ADR-0027 D0.4).
### D6. Config schema
`ccl.yaml`:
```yaml
defaults:
algorithm: intercube_allreduce
buffer_kind: tcm
...
algorithms:
intercube_allreduce:
module: kernbench.ccl.algorithms.intercube_allreduce
topology: none
buffer_kind: tcm
n_elem: 8
root_cube: 15
```
`topology.yaml`:
```yaml
system:
sips:
count: 2
topology: ring_1d
sip:
cube_mesh: { w: 4, h: 4 }
```
### D7. Algorithm module contract
Modules loaded via `cfg["module"]` must export:
| Name | Purpose |
|---|---|
| `kernel` | callable, signature `(t_ptr, n_elem, cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl)` |
| `kernel_args(world_size, n_elem) -> tuple` | returns the first 4 scalar args (per-tensor) |
| `TOPO_NAME_TO_KIND: dict[str, int]` | maps `system.sips.topology` name to kernel branch code |
| `SIP_TOPO_RING`, `SIP_TOPO_TORUS`, `SIP_TOPO_MESH` | integer constants (0, 1, 2) |
---
## Dependencies
- **ADR-0023**: IPCQ protocol (neighbor table, send/recv, credit return).
- **ADR-0024**: rank = SIP launcher, `mp.spawn`, greenlet-local rank.
- **ADR-0025**: Address-based IPCQ direction matching; extended
`_OPPOSITE_DIR` with `global_*` pairs.
- **ADR-0027**: Worker-wait / collective-pending drain in main scheduler.
## Non-goals
- **Per-PE allreduce** (intra-cube PE-to-PE reduce). Out of scope — the
workload for this algorithm is per-cube DP.
- **Asymmetric SIP topologies** (non-square mesh/torus). `torus_2d` and
`mesh_2d_no_wrap` require `n_sips = k²`.
- **Pipelined chunks**: single-tile per cube, no pipelining yet.
- **Root cube runtime election**: the kernel currently uses
`root_cube = (mesh_h - 1) * mesh_w + (mesh_w - 1)` hardcoded to the SE
corner. SFR wiring covers all cubes, so runtime election is a pure kernel
change when needed.
---
## Consequences
### Positive
- **Single kernel, single install path** for all-reduce — replaces four
removed modules (`ring`, `mesh`, `tree`, `hierarchical`).
- **Topology-agnostic kernel**: ring / torus / mesh selected via one
integer param, no kernel duplication.
- **Automatic via `dist.all_reduce`**: no bench-level or user-level
algorithm selection needed; config-driven end-to-end.
- **Full SFR wiring**: every cube on every SIP has inter-SIP links
available — supports future dynamic root-cube election.
### Negative
- **Not suitable for per-PE sharded tensors**: TP-layer-style tensors that
shard within one cube across 8 PEs are not addressable by this kernel.
Such workloads would need a separate intra-cube all-reduce path (not
yet implemented).
- **`configure_sfr_intercube_multisip` always wires all pe0s**: even if a
given run only needs a subset (e.g. 1 SIP, ring only). Install cost is
small but not zero.
---
## Affected files
| File | Change |
|---|---|
| `src/kernbench/ccl/algorithms/intercube_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` |
| `src/kernbench/ccl/sfr_config.py` (new) | `configure_sfr_intercube_multisip` |
| `src/kernbench/ccl/topologies.py` | Added `torus_2d`, `mesh_2d_no_wrap` |
| `src/kernbench/ccl/install.py` | Extended `_OPPOSITE_DIR` with `global_*` pairs |
| `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend` uses `configure_sfr_intercube_multisip` + appends sip_rank/topo args |
| `ccl.yaml` | Single `intercube_allreduce` entry |
| `topology.yaml` | Added `system.sips.topology` |
| `benches/ccl_allreduce.py` | Row-wise cube-mesh tensor layout |
| `tests/test_allreduce_multidevice.py` (new) | Config-driven ring/torus/mesh |
| `tests/test_distributed_intercube_allreduce.py` (new) | Full `dist.all_reduce` path |
| `tests/test_intercube_sfr_config.py` (new) | SFR wiring verification |
| Removed | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` and their tests |