Files
kernbench2/tests/test_intercube_root_center.py
T
mukesh ff7d727ddd CCL allreduce: rename to lrab_hierarchical_allreduce + descriptive plots
Rename the intercube all-reduce identity to lrab_hierarchical_allreduce
(module, config key, distributed test) so the name reflects both levels
it implements: LRAB intra-SIP (local reduce to center root + broadcast)
and the hierarchical inter-SIP topology exchange (ring/torus/mesh).
ADR-0032 slug kept as the stable decision id; pure rename, no logic change.

Also in this batch:
- ADR-0032 (EN+KO): document the shipped center-root bidirectional reduce
  (doc was stale corner-root); annotate ccl.yaml root_cube as a placeholder.
- Rename allreduce + pe2pe latency plots to descriptive, title-matching
  filenames and retitle the in-plot headings; drop overview/overview_log.
- Point the PPTX image refs at the new plot names.

Doc + derived-artifact + rename only; no simulation behavior changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 20:50:48 -07:00

145 lines
5.3 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""Phase 1 test for moving the lrab_hierarchical_allreduce root cube from the
bottom-right corner (3,3) to the geometric center (2,2).
Today's algorithm (lrab_hierarchical_allreduce.py) hardcodes
``root_cube = (cube_h-1) * cube_w + (cube_w-1)`` (= cube 15 in 4×4).
The intra-SIP critical path for one allreduce is therefore::
Phase 1 (row reduce W→E to col 3) : 3 hops
Phase 2 (col reduce N→S to row 3 on col 3): 3 hops
Phase 3 (inter-SIP at root) : (separate)
Phase 4 (col broadcast S→N) : 3 hops
Phase 5 (row broadcast E→W) : 3 hops
Total intra-SIP critical path : 12 hops
Moving the root to (2,2) and using BIDIRECTIONAL convergence (cols 0..2
go W→E, col 3 goes E→W in parallel; rows 0..2 go N→S, row 3 goes S→N
in parallel) cuts each phase's critical path from 3 hops to 2::
Phase 1 critical path : max(2, 1) = 2 hops
Phase 2 critical path : max(2, 1) = 2 hops
Phase 4 critical path : 2 hops
Phase 5 critical path : 2 hops
Total intra-SIP critical path : 8 hops
Per-hop cost at 96 KB on TCM ≈ 600 ns (slot IO write+read 384 ns +
fabric drain ~217 ns). 4 fewer hops ⇒ ~2.4 µs reduction.
EXPECTED Phase 1 outcome:
- Today (root = corner) : ~22.0 µs ← test FAILS (> 20500 ns)
- After Phase 2 (root = center) : ~19.6 µs ← test PASSES (< 20500 ns)
"""
from __future__ import annotations
from pathlib import Path
import pytest
from kernbench.runtime_api.context import RuntimeContext
from kernbench.runtime_api.types import DeviceSelector
from kernbench.sim_engine.engine import GraphEngine
from kernbench.topology.builder import resolve_topology
from tests.test_allreduce_multidevice import (
_write_temp_configs,
run_allreduce,
)
def _run_torus_96kb(tmp_path: Path) -> float:
"""Run torus_2d 6-SIP allreduce at 96 KB / slot, return critical-path
pe_exec_ns. Fixed at TCM (the project default)."""
sub = tmp_path / "torus_root_center"
sub.mkdir()
topo_path, ccl_path = _write_temp_configs(
sub,
sip_topology="torus_2d",
n_sips=6,
algorithm="lrab_hierarchical_allreduce",
sip_w=3, sip_h=2,
n_elem_override=49152, # 49152 × 2 = 96 KB / slot
)
topo = resolve_topology(topo_path)
engine = GraphEngine(topo.topology_obj, enable_data=True)
spec = topo.topology_obj.spec
with RuntimeContext(
engine=engine,
target_device=DeviceSelector("all"),
correlation_id="root_center_phase1",
spec=spec,
) as ctx:
result = run_allreduce(
ctx, engine, spec,
algorithm="lrab_hierarchical_allreduce", ccl_yaml=ccl_path,
)
assert result["ok_cubes"] > 0
pe_exec_vals = [
float(tr.get("pe_exec_ns", 0.0) or 0.0)
for _, (_, tr) in engine._results.items()
if isinstance(tr, dict)
]
return max(pe_exec_vals) if pe_exec_vals else 0.0
def test_intra_sip_critical_path_at_96k_below_threshold(tmp_path):
"""Post-Phase-2 (root=center, bidirectional reduce) the torus_2d
96 KB allreduce on TCM should be meaningfully lower than corner
root with serial reduce.
The absolute number depends on the latency model's fidelity.
Under ADR-0033 Phase 2c (per-flit wire timing, wormhole) the
bottleneck-link transit time is charged once per flit on each
serialized hop, so allreduce numbers are higher than pre-2c
estimates. Threshold widened to 30 µs to accommodate the more
accurate model; the algorithmic property (8-hop center root <
12-hop corner root) is the invariant being asserted.
"""
lat_ns = _run_torus_96kb(tmp_path)
THRESHOLD_NS = 30_000.0
assert lat_ns < THRESHOLD_NS, (
f"torus_2d 6-SIP 96 KB allreduce should land below "
f"{THRESHOLD_NS:.0f} ns post-Phase-2 (root=center, "
f"bidirectional reduce). got {lat_ns:.1f} ns "
f"({lat_ns / 1000:.2f} µs)"
)
def test_correctness_preserved(tmp_path):
"""Smoke check: at small n_elem the new algorithm must still produce
the correct sum across all 96 cubes. ``run_allreduce`` validates
every cube against the expected reduce result (``ok_cubes`` must be
96 = 6 SIPs × 16 cubes).
This guards against the obvious Phase 2 risk: bidirectional reduce
sums each contribution exactly once. If implemented wrong (double-
counting or skipping the right edge column / bottom row), the
asserts inside run_allreduce fail.
"""
sub = tmp_path / "correctness"
sub.mkdir()
topo_path, ccl_path = _write_temp_configs(
sub,
sip_topology="torus_2d",
n_sips=6,
algorithm="lrab_hierarchical_allreduce",
sip_w=3, sip_h=2,
n_elem_override=128, # tiny payload to keep this fast
)
topo = resolve_topology(topo_path)
engine = GraphEngine(topo.topology_obj, enable_data=True)
spec = topo.topology_obj.spec
with RuntimeContext(
engine=engine,
target_device=DeviceSelector("all"),
correlation_id="root_center_correctness",
spec=spec,
) as ctx:
result = run_allreduce(
ctx, engine, spec,
algorithm="lrab_hierarchical_allreduce", ccl_yaml=ccl_path,
)
n_cubes = 6 * 16 # 6 SIPs × 16 cubes/SIP
assert result["ok_cubes"] == n_cubes, (
f"all 96 cubes must validate; got {result['ok_cubes']} OK"
)