Files
kernbench2/tests/sccl/test_intercube_root_center.py
mukesh b610cb0d9a sccl: drive allreduce tests via torch.distributed; reorganize into tests/sccl/
Convert the multidevice allreduce correctness + latency/buffer-kind sweeps
to run through the real PyTorch-distributed path
(init_process_group(backend="ahbm") -> mp.spawn -> dist.all_reduce) instead
of direct ctx.launch, and reorganize the CCL/allreduce tests into a
tests/sccl/ package split one test per file.

Production change (required for the distributed path on non-square SIP grids):
- AhbmCCLBackend now reads explicit system.sips.w/h from the spec, with a
  square-only sqrt fallback that raises on ambiguity, instead of silently
  guessing round(sqrt(count)). This fixes the 2x3 / 3x2 torus + mesh cases,
  which previously resolved to a wrong 2x2 grid. Mirrors the test helper's
  _sip_topo_dims precedence (explicit w/h > square fallback > raise).

Test reorganization (tests/sccl/):
- _allreduce_helpers.py: shared plumbing (distributed driver, config writers,
  direct-launch run_allreduce parity reference, sweep/buffer-kind constants,
  plot aggregators, topology-diagram + FSIM-comparison emitters).
- test_allreduce_ring_torus_mesh.py: correctness across ring/torus/mesh.
- test_distributed_default_topology.py: full distributed path on topology.yaml.
- test_plot_latency_sweep.py / test_plot_buffer_kind_sweep.py: sweep rows.
- test_plot_topology_diagram.py / test_plot_comparison_fsim.py: plot emitters.
- test_intercube_root_center.py: moved in (ADR-0032 center-root latency guard).

Also:
- Move the FSIM comparison plot generator out of scripts/ into the sccl suite.
- Delete superseded test files (test_allreduce_multidevice,
  test_distributed_lrab_hierarchical_allreduce, test_allreduce_buffer_kind_sweep)
  and repoint conftest aggregators + the ipcq buffer-kind importers.
- Regenerate the allreduce_latency_plots derived artifacts from the full sweep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 22:24:43 -07:00

145 lines
5.3 KiB
Python
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""Phase 1 test for moving the lrab_hierarchical_allreduce root cube from the
bottom-right corner (3,3) to the geometric center (2,2).
Today's algorithm (lrab_hierarchical_allreduce.py) hardcodes
``root_cube = (cube_h-1) * cube_w + (cube_w-1)`` (= cube 15 in 4×4).
The intra-SIP critical path for one allreduce is therefore::
Phase 1 (row reduce W→E to col 3) : 3 hops
Phase 2 (col reduce N→S to row 3 on col 3): 3 hops
Phase 3 (inter-SIP at root) : (separate)
Phase 4 (col broadcast S→N) : 3 hops
Phase 5 (row broadcast E→W) : 3 hops
Total intra-SIP critical path : 12 hops
Moving the root to (2,2) and using BIDIRECTIONAL convergence (cols 0..2
go W→E, col 3 goes E→W in parallel; rows 0..2 go N→S, row 3 goes S→N
in parallel) cuts each phase's critical path from 3 hops to 2::
Phase 1 critical path : max(2, 1) = 2 hops
Phase 2 critical path : max(2, 1) = 2 hops
Phase 4 critical path : 2 hops
Phase 5 critical path : 2 hops
Total intra-SIP critical path : 8 hops
Per-hop cost at 96 KB on TCM ≈ 600 ns (slot IO write+read 384 ns +
fabric drain ~217 ns). 4 fewer hops ⇒ ~2.4 µs reduction.
EXPECTED Phase 1 outcome:
- Today (root = corner) : ~22.0 µs ← test FAILS (> 20500 ns)
- After Phase 2 (root = center) : ~19.6 µs ← test PASSES (< 20500 ns)
"""
from __future__ import annotations
from pathlib import Path
import pytest
from kernbench.runtime_api.context import RuntimeContext
from kernbench.runtime_api.types import DeviceSelector
from kernbench.sim_engine.engine import GraphEngine
from kernbench.topology.builder import resolve_topology
from tests.sccl._allreduce_helpers import (
_write_temp_configs,
run_allreduce,
)
def _run_torus_96kb(tmp_path: Path) -> float:
"""Run torus_2d 6-SIP allreduce at 96 KB / slot, return critical-path
pe_exec_ns. Fixed at TCM (the project default)."""
sub = tmp_path / "torus_root_center"
sub.mkdir()
topo_path, ccl_path = _write_temp_configs(
sub,
sip_topology="torus_2d",
n_sips=6,
algorithm="lrab_hierarchical_allreduce",
sip_w=3, sip_h=2,
n_elem_override=49152, # 49152 × 2 = 96 KB / slot
)
topo = resolve_topology(topo_path)
engine = GraphEngine(topo.topology_obj, enable_data=True)
spec = topo.topology_obj.spec
with RuntimeContext(
engine=engine,
target_device=DeviceSelector("all"),
correlation_id="root_center_phase1",
spec=spec,
) as ctx:
result = run_allreduce(
ctx, engine, spec,
algorithm="lrab_hierarchical_allreduce", ccl_yaml=ccl_path,
)
assert result["ok_cubes"] > 0
pe_exec_vals = [
float(tr.get("pe_exec_ns", 0.0) or 0.0)
for _, (_, tr) in engine._results.items()
if isinstance(tr, dict)
]
return max(pe_exec_vals) if pe_exec_vals else 0.0
def test_intra_sip_critical_path_at_96k_below_threshold(tmp_path):
"""Post-Phase-2 (root=center, bidirectional reduce) the torus_2d
96 KB allreduce on TCM should be meaningfully lower than corner
root with serial reduce.
The absolute number depends on the latency model's fidelity.
Under ADR-0033 Phase 2c (per-flit wire timing, wormhole) the
bottleneck-link transit time is charged once per flit on each
serialized hop, so allreduce numbers are higher than pre-2c
estimates. Threshold widened to 30 µs to accommodate the more
accurate model; the algorithmic property (8-hop center root <
12-hop corner root) is the invariant being asserted.
"""
lat_ns = _run_torus_96kb(tmp_path)
THRESHOLD_NS = 30_000.0
assert lat_ns < THRESHOLD_NS, (
f"torus_2d 6-SIP 96 KB allreduce should land below "
f"{THRESHOLD_NS:.0f} ns post-Phase-2 (root=center, "
f"bidirectional reduce). got {lat_ns:.1f} ns "
f"({lat_ns / 1000:.2f} µs)"
)
def test_correctness_preserved(tmp_path):
"""Smoke check: at small n_elem the new algorithm must still produce
the correct sum across all 96 cubes. ``run_allreduce`` validates
every cube against the expected reduce result (``ok_cubes`` must be
96 = 6 SIPs × 16 cubes).
This guards against the obvious Phase 2 risk: bidirectional reduce
sums each contribution exactly once. If implemented wrong (double-
counting or skipping the right edge column / bottom row), the
asserts inside run_allreduce fail.
"""
sub = tmp_path / "correctness"
sub.mkdir()
topo_path, ccl_path = _write_temp_configs(
sub,
sip_topology="torus_2d",
n_sips=6,
algorithm="lrab_hierarchical_allreduce",
sip_w=3, sip_h=2,
n_elem_override=128, # tiny payload to keep this fast
)
topo = resolve_topology(topo_path)
engine = GraphEngine(topo.topology_obj, enable_data=True)
spec = topo.topology_obj.spec
with RuntimeContext(
engine=engine,
target_device=DeviceSelector("all"),
correlation_id="root_center_correctness",
spec=spec,
) as ctx:
result = run_allreduce(
ctx, engine, spec,
algorithm="lrab_hierarchical_allreduce", ccl_yaml=ccl_path,
)
n_cubes = 6 * 16 # 6 SIPs × 16 cubes/SIP
assert result["ok_cubes"] == n_cubes, (
f"all 96 cubes must validate; got {result['ok_cubes']} OK"
)