CCL allreduce: rename to lrab_hierarchical_allreduce + descriptive plots
Rename the intercube all-reduce identity to lrab_hierarchical_allreduce (module, config key, distributed test) so the name reflects both levels it implements: LRAB intra-SIP (local reduce to center root + broadcast) and the hierarchical inter-SIP topology exchange (ring/torus/mesh). ADR-0032 slug kept as the stable decision id; pure rename, no logic change. Also in this batch: - ADR-0032 (EN+KO): document the shipped center-root bidirectional reduce (doc was stale corner-root); annotate ccl.yaml root_cube as a placeholder. - Rename allreduce + pe2pe latency plots to descriptive, title-matching filenames and retitle the in-plot headings; drop overview/overview_log. - Point the PPTX image refs at the new plot names. Doc + derived-artifact + rename only; no simulation behavior changed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -32,7 +32,7 @@ bandwidth characteristics for the common per-cube DP workload.
|
||||
|
||||
### Current state
|
||||
|
||||
- `src/kernbench/ccl/algorithms/intercube_allreduce.py` — kernel
|
||||
- `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` — kernel
|
||||
- `src/kernbench/ccl/sfr_config.py` — `configure_sfr_intercube_multisip`
|
||||
- `src/kernbench/runtime_api/distributed.py` — `AhbmCCLBackend` wires this
|
||||
automatically at `init_process_group` time.
|
||||
@@ -43,29 +43,46 @@ bandwidth characteristics for the common per-cube DP workload.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Algorithm structure — 5 phases
|
||||
### D1. Algorithm structure — 5 phases (center-root, bidirectional)
|
||||
|
||||
The root cube sits at the geometric **center** of the cube mesh:
|
||||
|
||||
```
|
||||
root_col = cube_w // 2
|
||||
root_row = cube_h // 2
|
||||
root_cube = root_row * cube_w + root_col # center; 10 on a 4×4 mesh
|
||||
```
|
||||
|
||||
Each reduce/broadcast phase converges/diverges **bidirectionally** toward
|
||||
this center, halving the intra-SIP critical path versus a corner-root walk
|
||||
(4×4 mesh: 4 hops reduce + 4 hops broadcast vs 6+6 with an SE-corner root).
|
||||
|
||||
For each SIP (launched concurrently by `mp.spawn`):
|
||||
|
||||
```
|
||||
Phase 1 — Row reduce W → E (cube mesh, pe0 only):
|
||||
col=0 sends E → col=1 accumulates, sends E → ... → col=3 holds row sum.
|
||||
Phase 1 — Row reduce converging at col == root_col (cube mesh, pe0 only):
|
||||
left half (col < root_col) walks W→E; right half (col > root_col)
|
||||
walks E→W; the root_col cube merges both sides → holds row sum.
|
||||
|
||||
Phase 2 — Col reduce N → S on rightmost column (pe0, col = mesh_w-1):
|
||||
row=0 sends S → row=1 accumulates, sends S → ... → root cube (15)
|
||||
holds the full SIP sum.
|
||||
Phase 2 — Col reduce on col == root_col converging at row == root_row:
|
||||
above (row < root_row) walks N→S; below (row > root_row) walks S→N;
|
||||
the root cube merges both → holds the full SIP sum.
|
||||
|
||||
Phase 3 — Inter-SIP exchange on root cube (pe0 of root cube only):
|
||||
Phase 3 — Inter-SIP exchange on cube_id == root_cube (pe0 only):
|
||||
Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
|
||||
selected by sip_topo_kind (from topology.yaml sips.topology).
|
||||
|
||||
Phase 4 — Col broadcast S → N on rightmost column.
|
||||
Phase 4 — Col broadcast on col == root_col, outward from root_row.
|
||||
|
||||
Phase 5 — Row broadcast E → W across the cube mesh.
|
||||
Phase 5 — Row broadcast outward from root_col across the cube mesh.
|
||||
```
|
||||
|
||||
After all phases every cube's pe0 holds the global sum.
|
||||
|
||||
**Single-cube fast-path**: when `cube_w == cube_h == 1` (one cube per rank,
|
||||
the common TP case), the intra-SIP reduce/broadcast phases are skipped and
|
||||
the kernel goes straight to the Phase 3 inter-SIP exchange.
|
||||
|
||||
The kernel is a single function parameterised by `sip_topo_kind ∈ {0, 1, 2}`
|
||||
(ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical
|
||||
across topologies; only phase 3 branches. Helper functions
|
||||
@@ -154,17 +171,19 @@ At each `dist.all_reduce(tensor)` call:
|
||||
|
||||
```yaml
|
||||
defaults:
|
||||
algorithm: intercube_allreduce
|
||||
algorithm: lrab_hierarchical_allreduce
|
||||
buffer_kind: tcm
|
||||
...
|
||||
|
||||
algorithms:
|
||||
intercube_allreduce:
|
||||
module: kernbench.ccl.algorithms.intercube_allreduce
|
||||
lrab_hierarchical_allreduce:
|
||||
module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
|
||||
topology: none
|
||||
buffer_kind: tcm
|
||||
n_elem: 8
|
||||
root_cube: 15
|
||||
root_cube: 15 # NOT read today — the kernel elects the root dynamically
|
||||
# as the geometric center (see D1). Kept as a placeholder
|
||||
# for a future explicit-root override / runtime election.
|
||||
```
|
||||
|
||||
`topology.yaml`:
|
||||
@@ -207,9 +226,10 @@ Modules loaded via `cfg["module"]` must export:
|
||||
`mesh_2d_no_wrap` require `n_sips = k²`.
|
||||
- **Pipelined chunks**: single-tile per cube, no pipelining yet.
|
||||
- **Root cube runtime election**: the kernel currently uses
|
||||
`root_cube = (mesh_h - 1) * mesh_w + (mesh_w - 1)` hardcoded to the SE
|
||||
corner. SFR wiring covers all cubes, so runtime election is a pure kernel
|
||||
change when needed.
|
||||
`root_cube = (mesh_h // 2) * mesh_w + (mesh_w // 2)` — the geometric
|
||||
center, chosen to minimize the intra-SIP critical path. SFR wiring
|
||||
covers all cubes, so electing a different root at runtime is a pure
|
||||
kernel change when needed.
|
||||
|
||||
---
|
||||
|
||||
@@ -242,15 +262,15 @@ Modules loaded via `cfg["module"]` must export:
|
||||
|
||||
| File | Change |
|
||||
|---|---|
|
||||
| `src/kernbench/ccl/algorithms/intercube_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` |
|
||||
| `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` |
|
||||
| `src/kernbench/ccl/sfr_config.py` (new) | `configure_sfr_intercube_multisip` |
|
||||
| `src/kernbench/ccl/topologies.py` | Added `torus_2d`, `mesh_2d_no_wrap` |
|
||||
| `src/kernbench/ccl/install.py` | Extended `_OPPOSITE_DIR` with `global_*` pairs |
|
||||
| `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend` uses `configure_sfr_intercube_multisip` + appends sip_rank/topo args |
|
||||
| `ccl.yaml` | Single `intercube_allreduce` entry |
|
||||
| `ccl.yaml` | Single `lrab_hierarchical_allreduce` entry |
|
||||
| `topology.yaml` | Added `system.sips.topology` |
|
||||
| `benches/ccl_allreduce.py` | Row-wise cube-mesh tensor layout |
|
||||
| `tests/test_allreduce_multidevice.py` (new) | Config-driven ring/torus/mesh |
|
||||
| `tests/test_distributed_intercube_allreduce.py` (new) | Full `dist.all_reduce` path |
|
||||
| `tests/test_distributed_lrab_hierarchical_allreduce.py` (new) | Full `dist.all_reduce` path |
|
||||
| `tests/test_intercube_sfr_config.py` (new) | SFR wiring verification |
|
||||
| Removed | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` and their tests |
|
||||
|
||||
Reference in New Issue
Block a user