fd56b6cacd
Document the allreduce + GEMM evaluation harnesses and bring the affected allreduce ADRs in line with the refactored code. New (Accepted, EN + KO): - ADR-0043 — allreduce evaluation harness (tests/sccl/): distributed-driven correctness, latency/buffer-kind sweeps, sessionfinish plot aggregators, topology + FSIM-comparison figures. Verified against the implementation. - ADR-0044 — GEMM evaluation harness (scripts/gemm_sweep.py + tests/gemm/): heavy-script data gen vs. fast test-rendered figures, slow regenerator, the 3-figure set. Records two limitations as open questions: the theoretical-model constants are inherited (not yet traced to ADR-0033/ 0014), and the *_measured figure is a naming misnomer. Updated (EN + KO): - ADR-0024 — add D5: SIP grid w/h resolution (explicit sips.w/h, square fallback, fail-loud), documenting the AhbmCCLBackend fix. - ADR-0032 — D4/D5/Non-goals reconciled: rectangular SIP grids (e.g. 6 SIPs as 3x2) are supported via explicit w/h; the square requirement now applies only to the fallback. Affected-files repointed to tests/sccl/. Verification: ADR-0023 and ADR-0042 confirmed still matching the code (no change). verify_adr_lang_pairs.py passes (EN/KO Status blocks byte-equal). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
282 lines
11 KiB
Markdown
282 lines
11 KiB
Markdown
# ADR-0032: Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange
|
||
|
||
## Status
|
||
|
||
Accepted (supersedes ADR-0029).
|
||
|
||
## Context
|
||
|
||
### Goal
|
||
|
||
Define a single all-reduce algorithm that exploits the topology hierarchy:
|
||
cube mesh within each SIP (intercube) + inter-SIP exchange. One kernel,
|
||
one SFR configuration path, driven by `topology.yaml` and `ccl.yaml`.
|
||
|
||
### Why replace ADR-0029 (hierarchical 3-level)
|
||
|
||
ADR-0029 proposed a 3-level (intra-cube → inter-cube → inter-SIP) algorithm
|
||
where every PE in the system participates. In practice this adds the
|
||
intra-cube PE-to-PE stage complexity (bidirectional reduce + chain broadcast)
|
||
without matching the common workload pattern where the tensor is sharded
|
||
**per cube** (not per PE within a cube).
|
||
|
||
Moreover, the hierarchical design required:
|
||
- per-PE neighbor graph installation (`_build_pe_installs` multi-level)
|
||
- multi-level topology schema (`hierarchical_3level`)
|
||
- `all_pes` mapper + `multi_pe_sip_local` validator infrastructure
|
||
|
||
The intercube algorithm below removes all of that: **pe0-only same-lane
|
||
intercube reduce on the 4×4 cube mesh**, then inter-SIP exchange on the
|
||
root cube, then broadcast back. Simpler kernel, simpler wiring, same
|
||
bandwidth characteristics for the common per-cube DP workload.
|
||
|
||
### Current state
|
||
|
||
- `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` — kernel
|
||
- `src/kernbench/ccl/sfr_config.py` — `configure_sfr_intercube_multisip`
|
||
- `src/kernbench/runtime_api/distributed.py` — `AhbmCCLBackend` wires this
|
||
automatically at `init_process_group` time.
|
||
- Old `ring_allreduce`, `mesh_allreduce`, `tree_allreduce`,
|
||
`hierarchical_allreduce` modules and their tests are **removed**.
|
||
|
||
---
|
||
|
||
## Decision
|
||
|
||
### D1. Algorithm structure — 5 phases (center-root, bidirectional)
|
||
|
||
The root cube sits at the geometric **center** of the cube mesh:
|
||
|
||
```
|
||
root_col = cube_w // 2
|
||
root_row = cube_h // 2
|
||
root_cube = root_row * cube_w + root_col # center; 10 on a 4×4 mesh
|
||
```
|
||
|
||
Each reduce/broadcast phase converges/diverges **bidirectionally** toward
|
||
this center, halving the intra-SIP critical path versus a corner-root walk
|
||
(4×4 mesh: 4 hops reduce + 4 hops broadcast vs 6+6 with an SE-corner root).
|
||
|
||
For each SIP (launched concurrently by `mp.spawn`):
|
||
|
||
```
|
||
Phase 1 — Row reduce converging at col == root_col (cube mesh, pe0 only):
|
||
left half (col < root_col) walks W→E; right half (col > root_col)
|
||
walks E→W; the root_col cube merges both sides → holds row sum.
|
||
|
||
Phase 2 — Col reduce on col == root_col converging at row == root_row:
|
||
above (row < root_row) walks N→S; below (row > root_row) walks S→N;
|
||
the root cube merges both → holds the full SIP sum.
|
||
|
||
Phase 3 — Inter-SIP exchange on cube_id == root_cube (pe0 only):
|
||
Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
|
||
selected by sip_topo_kind (from topology.yaml sips.topology).
|
||
|
||
Phase 4 — Col broadcast on col == root_col, outward from root_row.
|
||
|
||
Phase 5 — Row broadcast outward from root_col across the cube mesh.
|
||
```
|
||
|
||
After all phases every cube's pe0 holds the global sum.
|
||
|
||
**Single-cube fast-path**: when `cube_w == cube_h == 1` (one cube per rank,
|
||
the common TP case), the intra-SIP reduce/broadcast phases are skipped and
|
||
the kernel goes straight to the Phase 3 inter-SIP exchange.
|
||
|
||
The kernel is a single function parameterised by `sip_topo_kind ∈ {0, 1, 2}`
|
||
(ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical
|
||
across topologies; only phase 3 branches. Helper functions
|
||
`_inter_sip_ring`, `_inter_sip_torus_2d`, `_inter_sip_mesh_2d` encode the
|
||
three exchange patterns.
|
||
|
||
### D2. Tensor layout (rank = SIP, per-worker)
|
||
|
||
Per ADR-0024 rank = SIP at the process-group level. Each worker allocates
|
||
its own cube-mesh-spanning tensor:
|
||
|
||
```python
|
||
dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=16, num_pes=1)
|
||
tensor = torch.zeros((n_cubes, n_elem), dtype="f16", dp=dp)
|
||
```
|
||
|
||
Shard layout: 16 shards per SIP, one per cube on pe0. The kernel addresses
|
||
each cube's shard as `pe_addr = t_ptr + cube_id * n_elem * 2`.
|
||
|
||
### D3. SFR / IPCQ wiring — `configure_sfr_intercube_multisip`
|
||
|
||
Replaces the rank-to-2-PE install from ADR-0024. Wires PE_IPCQ neighbor
|
||
tables for **every cube's pe0 across every SIP** — regardless of which
|
||
cube is the root or which SIP topology is selected. This lets the kernel
|
||
elect the root cube at runtime and supports topology switches without
|
||
re-wiring.
|
||
|
||
| Level | Direction labels | Scope |
|
||
|---|---|---|
|
||
| Intercube within SIP | N / S / E / W | pe0 of every cube → pe0 of mesh neighbors (no wrap) |
|
||
| Inter-SIP (all cubes) | global_E / global_W / global_N / global_S | pe0 of cube c on sip A → pe0 of cube c on peer SIP per `sips.topology` |
|
||
|
||
Inter-SIP directions use the `global_*` prefix to keep the namespace
|
||
disjoint from intercube directions. ADR-0025's `_OPPOSITE_DIR` is extended
|
||
with `global_E ↔ global_W` and `global_N ↔ global_S` so the reverse-
|
||
direction resolver handles 2-SIP bidirectional rings correctly.
|
||
|
||
Internally the function calls `install_ipcq` with:
|
||
- `world_size = n_sips × n_cubes`
|
||
- `rank_to_pe = [(sip, cube, 0) for sip in range(n_sips) for cube in range(n_cubes)]`
|
||
- A closure-captured `neighbors()` function that builds the map above.
|
||
|
||
This `world_size` is internal to IPCQ wiring and does not leak to the
|
||
process-group rank.
|
||
|
||
### D4. SIP topology — from `topology.yaml`
|
||
|
||
```yaml
|
||
system:
|
||
sips:
|
||
count: 2
|
||
topology: ring_1d # or torus_2d, mesh_2d_no_wrap
|
||
```
|
||
|
||
- `ring_1d`: n_sips-1 rounds of `send global_E / recv global_W`.
|
||
- `torus_2d`: `w × h` wrapping mesh. Row ring on `global_E/W` then col
|
||
ring on `global_S/N`.
|
||
- `mesh_2d_no_wrap`: `w × h` mesh without wrap-around. Chain reduce +
|
||
broadcast per dimension.
|
||
|
||
2D grid dims `(w, h)` come from `system.sips.w/h` (ADR-0024 D5). A square
|
||
fallback (`round(sqrt(n_sips))²`) applies **only** when `w/h` are omitted,
|
||
so rectangular grids (e.g. 6 SIPs as `3×2`) are supported by giving
|
||
explicit `w/h`.
|
||
|
||
### D5. Process-group integration — `AhbmCCLBackend`
|
||
|
||
At `init_process_group` time the backend:
|
||
|
||
1. Loads `ccl.yaml` + `topology.yaml`.
|
||
2. Derives `sip_topo_kind` from `system.sips.topology` via the algorithm
|
||
module's `TOPO_NAME_TO_KIND`, and `sip_topo_w, sip_topo_h` from
|
||
`system.sips.w/h` with a square-only fallback (ADR-0024 D5).
|
||
3. Calls `configure_sfr_intercube_multisip(engine, spec, cfg)` — one-time
|
||
SFR wiring, mirrors NCCL communicator creation.
|
||
|
||
At each `dist.all_reduce(tensor)` call:
|
||
|
||
1. Resolves `kernel_fn` from `cfg["module"]`.
|
||
2. Builds args: `(n_elem, cube_w, cube_h, n_sips)` from
|
||
`kernel_args(world_size, n_elem)`.
|
||
3. Appends `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` where
|
||
`sip_rank` is the current greenlet's bound rank.
|
||
4. Launches with `_defer_wait=True`; the main scheduler drains pending
|
||
handles after all workers submit (per ADR-0027 D0.4).
|
||
|
||
### D6. Config schema
|
||
|
||
`ccl.yaml`:
|
||
|
||
```yaml
|
||
defaults:
|
||
algorithm: lrab_hierarchical_allreduce
|
||
buffer_kind: tcm
|
||
...
|
||
|
||
algorithms:
|
||
lrab_hierarchical_allreduce:
|
||
module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
|
||
topology: none
|
||
buffer_kind: tcm
|
||
n_elem: 8
|
||
root_cube: 15 # NOT read today — the kernel elects the root dynamically
|
||
# as the geometric center (see D1). Kept as a placeholder
|
||
# for a future explicit-root override / runtime election.
|
||
```
|
||
|
||
`topology.yaml`:
|
||
|
||
```yaml
|
||
system:
|
||
sips:
|
||
count: 2
|
||
topology: ring_1d
|
||
sip:
|
||
cube_mesh: { w: 4, h: 4 }
|
||
```
|
||
|
||
### D7. Algorithm module contract
|
||
|
||
Modules loaded via `cfg["module"]` must export:
|
||
|
||
| Name | Purpose |
|
||
|---|---|
|
||
| `kernel` | callable, signature `(t_ptr, n_elem, cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl)` |
|
||
| `kernel_args(world_size, n_elem) -> tuple` | returns the first 4 scalar args (per-tensor) |
|
||
| `TOPO_NAME_TO_KIND: dict[str, int]` | maps `system.sips.topology` name to kernel branch code |
|
||
| `SIP_TOPO_RING`, `SIP_TOPO_TORUS`, `SIP_TOPO_MESH` | integer constants (0, 1, 2) |
|
||
|
||
---
|
||
|
||
## Dependencies
|
||
|
||
- **ADR-0023**: IPCQ protocol (neighbor table, send/recv, credit return).
|
||
- **ADR-0024**: rank = SIP launcher, `mp.spawn`, greenlet-local rank.
|
||
- **ADR-0025**: Address-based IPCQ direction matching; extended
|
||
`_OPPOSITE_DIR` with `global_*` pairs.
|
||
- **ADR-0027**: Worker-wait / collective-pending drain in main scheduler.
|
||
|
||
## Non-goals
|
||
|
||
- **Per-PE allreduce** (intra-cube PE-to-PE reduce). Out of scope — the
|
||
workload for this algorithm is per-cube DP.
|
||
- **Square-grid fallback requires `n_sips = k²`**: rectangular SIP grids
|
||
(non-square mesh/torus) are supported, but only when `system.sips.w/h`
|
||
are given explicitly (ADR-0024 D5). With `w/h` omitted, 2D topologies
|
||
fall back to a square grid and still require `n_sips = k²`.
|
||
- **Pipelined chunks**: single-tile per cube, no pipelining yet.
|
||
- **Root cube runtime election**: the kernel currently uses
|
||
`root_cube = (mesh_h // 2) * mesh_w + (mesh_w // 2)` — the geometric
|
||
center, chosen to minimize the intra-SIP critical path. SFR wiring
|
||
covers all cubes, so electing a different root at runtime is a pure
|
||
kernel change when needed.
|
||
|
||
---
|
||
|
||
## Consequences
|
||
|
||
### Positive
|
||
|
||
- **Single kernel, single install path** for all-reduce — replaces four
|
||
removed modules (`ring`, `mesh`, `tree`, `hierarchical`).
|
||
- **Topology-agnostic kernel**: ring / torus / mesh selected via one
|
||
integer param, no kernel duplication.
|
||
- **Automatic via `dist.all_reduce`**: no bench-level or user-level
|
||
algorithm selection needed; config-driven end-to-end.
|
||
- **Full SFR wiring**: every cube on every SIP has inter-SIP links
|
||
available — supports future dynamic root-cube election.
|
||
|
||
### Negative
|
||
|
||
- **Not suitable for per-PE sharded tensors**: TP-layer-style tensors that
|
||
shard within one cube across 8 PEs are not addressable by this kernel.
|
||
Such workloads would need a separate intra-cube all-reduce path (not
|
||
yet implemented).
|
||
- **`configure_sfr_intercube_multisip` always wires all pe0s**: even if a
|
||
given run only needs a subset (e.g. 1 SIP, ring only). Install cost is
|
||
small but not zero.
|
||
|
||
---
|
||
|
||
## Affected files
|
||
|
||
| File | Change |
|
||
|---|---|
|
||
| `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` |
|
||
| `src/kernbench/ccl/sfr_config.py` (new) | `configure_sfr_intercube_multisip` |
|
||
| `src/kernbench/ccl/topologies.py` | Added `torus_2d`, `mesh_2d_no_wrap` |
|
||
| `src/kernbench/ccl/install.py` | Extended `_OPPOSITE_DIR` with `global_*` pairs |
|
||
| `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend` uses `configure_sfr_intercube_multisip` + appends sip_rank/topo args |
|
||
| `ccl.yaml` | Single `lrab_hierarchical_allreduce` entry |
|
||
| `topology.yaml` | Added `system.sips.topology` |
|
||
| `benches/ccl_allreduce.py` | Row-wise cube-mesh tensor layout |
|
||
| `tests/sccl/` (test package) | Config-driven ring/torus/mesh correctness + full `dist.all_reduce` path + latency/buffer-kind sweeps (evaluation harness — ADR-0043) |
|
||
| `tests/test_intercube_sfr_config.py` | SFR wiring verification |
|
||
| Removed | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` and their tests |
|