CCL allreduce: rename to lrab_hierarchical_allreduce + descriptive plots

Rename the intercube all-reduce identity to lrab_hierarchical_allreduce
(module, config key, distributed test) so the name reflects both levels
it implements: LRAB intra-SIP (local reduce to center root + broadcast)
and the hierarchical inter-SIP topology exchange (ring/torus/mesh).
ADR-0032 slug kept as the stable decision id; pure rename, no logic change.

Also in this batch:
- ADR-0032 (EN+KO): document the shipped center-root bidirectional reduce
  (doc was stale corner-root); annotate ccl.yaml root_cube as a placeholder.
- Rename allreduce + pe2pe latency plots to descriptive, title-matching
  filenames and retitle the in-plot headings; drop overview/overview_log.
- Point the PPTX image refs at the new plot names.

Doc + derived-artifact + rename only; no simulation behavior changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-20 20:50:48 -07:00
parent e77e4a1703
commit ff7d727ddd
38 changed files with 259 additions and 272 deletions
+40 -20
View File
@@ -32,7 +32,7 @@ bandwidth characteristics for the common per-cube DP workload.
### Current state
- `src/kernbench/ccl/algorithms/intercube_allreduce.py` — kernel
- `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` — kernel
- `src/kernbench/ccl/sfr_config.py``configure_sfr_intercube_multisip`
- `src/kernbench/runtime_api/distributed.py``AhbmCCLBackend` wires this
automatically at `init_process_group` time.
@@ -43,29 +43,46 @@ bandwidth characteristics for the common per-cube DP workload.
## Decision
### D1. Algorithm structure — 5 phases
### D1. Algorithm structure — 5 phases (center-root, bidirectional)
The root cube sits at the geometric **center** of the cube mesh:
```
root_col = cube_w // 2
root_row = cube_h // 2
root_cube = root_row * cube_w + root_col # center; 10 on a 4×4 mesh
```
Each reduce/broadcast phase converges/diverges **bidirectionally** toward
this center, halving the intra-SIP critical path versus a corner-root walk
(4×4 mesh: 4 hops reduce + 4 hops broadcast vs 6+6 with an SE-corner root).
For each SIP (launched concurrently by `mp.spawn`):
```
Phase 1 — Row reduce W → E (cube mesh, pe0 only):
col=0 sends E → col=1 accumulates, sends E → ... → col=3 holds row sum.
Phase 1 — Row reduce converging at col == root_col (cube mesh, pe0 only):
left half (col < root_col) walks W→E; right half (col > root_col)
walks E→W; the root_col cube merges both sides → holds row sum.
Phase 2 — Col reduce N → S on rightmost column (pe0, col = mesh_w-1):
row=0 sends S → row=1 accumulates, sends S → ... → root cube (15)
holds the full SIP sum.
Phase 2 — Col reduce on col == root_col converging at row == root_row:
above (row < root_row) walks N→S; below (row > root_row) walks S→N;
the root cube merges both → holds the full SIP sum.
Phase 3 — Inter-SIP exchange on root cube (pe0 of root cube only):
Phase 3 — Inter-SIP exchange on cube_id == root_cube (pe0 only):
Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
selected by sip_topo_kind (from topology.yaml sips.topology).
Phase 4 — Col broadcast S → N on rightmost column.
Phase 4 — Col broadcast on col == root_col, outward from root_row.
Phase 5 — Row broadcast E → W across the cube mesh.
Phase 5 — Row broadcast outward from root_col across the cube mesh.
```
After all phases every cube's pe0 holds the global sum.
**Single-cube fast-path**: when `cube_w == cube_h == 1` (one cube per rank,
the common TP case), the intra-SIP reduce/broadcast phases are skipped and
the kernel goes straight to the Phase 3 inter-SIP exchange.
The kernel is a single function parameterised by `sip_topo_kind ∈ {0, 1, 2}`
(ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical
across topologies; only phase 3 branches. Helper functions
@@ -154,17 +171,19 @@ At each `dist.all_reduce(tensor)` call:
```yaml
defaults:
algorithm: intercube_allreduce
algorithm: lrab_hierarchical_allreduce
buffer_kind: tcm
...
algorithms:
intercube_allreduce:
module: kernbench.ccl.algorithms.intercube_allreduce
lrab_hierarchical_allreduce:
module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
topology: none
buffer_kind: tcm
n_elem: 8
root_cube: 15
root_cube: 15 # NOT read today — the kernel elects the root dynamically
# as the geometric center (see D1). Kept as a placeholder
# for a future explicit-root override / runtime election.
```
`topology.yaml`:
@@ -207,9 +226,10 @@ Modules loaded via `cfg["module"]` must export:
`mesh_2d_no_wrap` require `n_sips = k²`.
- **Pipelined chunks**: single-tile per cube, no pipelining yet.
- **Root cube runtime election**: the kernel currently uses
`root_cube = (mesh_h - 1) * mesh_w + (mesh_w - 1)` hardcoded to the SE
corner. SFR wiring covers all cubes, so runtime election is a pure kernel
change when needed.
`root_cube = (mesh_h // 2) * mesh_w + (mesh_w // 2)` — the geometric
center, chosen to minimize the intra-SIP critical path. SFR wiring
covers all cubes, so electing a different root at runtime is a pure
kernel change when needed.
---
@@ -242,15 +262,15 @@ Modules loaded via `cfg["module"]` must export:
| File | Change |
|---|---|
| `src/kernbench/ccl/algorithms/intercube_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` |
| `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` |
| `src/kernbench/ccl/sfr_config.py` (new) | `configure_sfr_intercube_multisip` |
| `src/kernbench/ccl/topologies.py` | Added `torus_2d`, `mesh_2d_no_wrap` |
| `src/kernbench/ccl/install.py` | Extended `_OPPOSITE_DIR` with `global_*` pairs |
| `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend` uses `configure_sfr_intercube_multisip` + appends sip_rank/topo args |
| `ccl.yaml` | Single `intercube_allreduce` entry |
| `ccl.yaml` | Single `lrab_hierarchical_allreduce` entry |
| `topology.yaml` | Added `system.sips.topology` |
| `benches/ccl_allreduce.py` | Row-wise cube-mesh tensor layout |
| `tests/test_allreduce_multidevice.py` (new) | Config-driven ring/torus/mesh |
| `tests/test_distributed_intercube_allreduce.py` (new) | Full `dist.all_reduce` path |
| `tests/test_distributed_lrab_hierarchical_allreduce.py` (new) | Full `dist.all_reduce` path |
| `tests/test_intercube_sfr_config.py` (new) | SFR wiring verification |
| Removed | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` and their tests |