Intercube allreduce: center root + bidirectional reduce

Move the algorithmic root cube from the corner (cube_w-1,
cube_h-1) to the geometric center (cube_w//2, cube_h//2) and
have each phase converge bidirectionally so the intra-SIP
critical path drops from ~12 hops to ~8 hops on a 4×4 mesh
(left half W→E + right half E→W in row reduce; top half N→S +
bottom half S→N in col reduce; mirrored on broadcast).

Result on torus_2d 6 SIPs at 96 KB / PE on TCM:
  before (corner root)  : 22.0 µs
  after  (center root)  : 17.2 µs   (−22%)

Same shape on ring_1d (−7%) and mesh_2d_no_wrap (−12%); also
holds across SRAM and HBM (~−20% each).

Phase 1 test (test_intercube_root_center.py) asserts the
torus_2d 96 KB latency drops below 20.5 µs and that all 96
cubes still validate (correctness preserved).

Plot updates:
- overview.png: replace constant 10.6 µs theoretical line with
  user-supplied hand-derived curve (per-cube packet count =
  bytes_per_pe × 8 PEs ÷ 128 B; 1346 ns startup + 1.20 ns/pkt).
- All summary.csv numbers and per-topology PNGs regenerated.
- pe2pe_latency_plots and ipcq diagram emitter PNGs refreshed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-27 21:28:58 -07:00
parent 84a1325e5c
commit 1c5752a9ec
16 changed files with 324 additions and 157 deletions
+139
View File
@@ -0,0 +1,139 @@
"""Phase 1 test for moving the intercube_allreduce root cube from the
bottom-right corner (3,3) to the geometric center (2,2).
Today's algorithm (intercube_allreduce.py) hardcodes
``root_cube = (cube_h-1) * cube_w + (cube_w-1)`` (= cube 15 in 4×4).
The intra-SIP critical path for one allreduce is therefore::
Phase 1 (row reduce W→E to col 3) : 3 hops
Phase 2 (col reduce N→S to row 3 on col 3): 3 hops
Phase 3 (inter-SIP at root) : (separate)
Phase 4 (col broadcast S→N) : 3 hops
Phase 5 (row broadcast E→W) : 3 hops
Total intra-SIP critical path : 12 hops
Moving the root to (2,2) and using BIDIRECTIONAL convergence (cols 0..2
go W→E, col 3 goes E→W in parallel; rows 0..2 go N→S, row 3 goes S→N
in parallel) cuts each phase's critical path from 3 hops to 2::
Phase 1 critical path : max(2, 1) = 2 hops
Phase 2 critical path : max(2, 1) = 2 hops
Phase 4 critical path : 2 hops
Phase 5 critical path : 2 hops
Total intra-SIP critical path : 8 hops
Per-hop cost at 96 KB on TCM ≈ 600 ns (slot IO write+read 384 ns +
fabric drain ~217 ns). 4 fewer hops ⇒ ~2.4 µs reduction.
EXPECTED Phase 1 outcome:
- Today (root = corner) : ~22.0 µs ← test FAILS (> 20500 ns)
- After Phase 2 (root = center) : ~19.6 µs ← test PASSES (< 20500 ns)
"""
from __future__ import annotations
from pathlib import Path
import pytest
from kernbench.runtime_api.context import RuntimeContext
from kernbench.runtime_api.types import DeviceSelector
from kernbench.sim_engine.engine import GraphEngine
from kernbench.topology.builder import resolve_topology
from tests.test_allreduce_multidevice import (
_write_temp_configs,
run_allreduce,
)
def _run_torus_96kb(tmp_path: Path) -> float:
"""Run torus_2d 6-SIP allreduce at 96 KB / slot, return critical-path
pe_exec_ns. Fixed at TCM (the project default)."""
sub = tmp_path / "torus_root_center"
sub.mkdir()
topo_path, ccl_path = _write_temp_configs(
sub,
sip_topology="torus_2d",
n_sips=6,
algorithm="intercube_allreduce",
sip_w=3, sip_h=2,
n_elem_override=49152, # 49152 × 2 = 96 KB / slot
)
topo = resolve_topology(topo_path)
engine = GraphEngine(topo.topology_obj, enable_data=True)
spec = topo.topology_obj.spec
with RuntimeContext(
engine=engine,
target_device=DeviceSelector("all"),
correlation_id="root_center_phase1",
spec=spec,
) as ctx:
result = run_allreduce(
ctx, engine, spec,
algorithm="intercube_allreduce", ccl_yaml=ccl_path,
)
assert result["ok_cubes"] > 0
pe_exec_vals = [
float(tr.get("pe_exec_ns", 0.0) or 0.0)
for _, (_, tr) in engine._results.items()
if isinstance(tr, dict)
]
return max(pe_exec_vals) if pe_exec_vals else 0.0
def test_intra_sip_critical_path_at_96k_below_threshold(tmp_path):
"""Post-Phase-2 (root=center, bidirectional reduce) the torus_2d
96 KB allreduce on TCM should drop below 20.5 µs.
Today's value: ~22.0 µs (12-hop critical path with corner root).
Expected post-Phase-2: ~19.6 µs (8-hop critical path with
center root) — model estimate, ~11% reduction end-to-end.
"""
lat_ns = _run_torus_96kb(tmp_path)
THRESHOLD_NS = 20_500.0
assert lat_ns < THRESHOLD_NS, (
f"torus_2d 6-SIP 96 KB allreduce should land below "
f"{THRESHOLD_NS:.0f} ns post-Phase-2 (root=center, "
f"bidirectional reduce). got {lat_ns:.1f} ns "
f"({lat_ns / 1000:.2f} µs)"
)
def test_correctness_preserved(tmp_path):
"""Smoke check: at small n_elem the new algorithm must still produce
the correct sum across all 96 cubes. ``run_allreduce`` validates
every cube against the expected reduce result (``ok_cubes`` must be
96 = 6 SIPs × 16 cubes).
This guards against the obvious Phase 2 risk: bidirectional reduce
sums each contribution exactly once. If implemented wrong (double-
counting or skipping the right edge column / bottom row), the
asserts inside run_allreduce fail.
"""
sub = tmp_path / "correctness"
sub.mkdir()
topo_path, ccl_path = _write_temp_configs(
sub,
sip_topology="torus_2d",
n_sips=6,
algorithm="intercube_allreduce",
sip_w=3, sip_h=2,
n_elem_override=128, # tiny payload to keep this fast
)
topo = resolve_topology(topo_path)
engine = GraphEngine(topo.topology_obj, enable_data=True)
spec = topo.topology_obj.spec
with RuntimeContext(
engine=engine,
target_device=DeviceSelector("all"),
correlation_id="root_center_correctness",
spec=spec,
) as ctx:
result = run_allreduce(
ctx, engine, spec,
algorithm="intercube_allreduce", ccl_yaml=ccl_path,
)
n_cubes = 6 * 16 # 6 SIPs × 16 cubes/SIP
assert result["ok_cubes"] == n_cubes, (
f"all 96 cubes must validate; got {result['ok_cubes']} OK"
)