Convert the multidevice allreduce correctness + latency/buffer-kind sweeps
to run through the real PyTorch-distributed path
(init_process_group(backend="ahbm") -> mp.spawn -> dist.all_reduce) instead
of direct ctx.launch, and reorganize the CCL/allreduce tests into a
tests/sccl/ package split one test per file.
Production change (required for the distributed path on non-square SIP grids):
- AhbmCCLBackend now reads explicit system.sips.w/h from the spec, with a
square-only sqrt fallback that raises on ambiguity, instead of silently
guessing round(sqrt(count)). This fixes the 2x3 / 3x2 torus + mesh cases,
which previously resolved to a wrong 2x2 grid. Mirrors the test helper's
_sip_topo_dims precedence (explicit w/h > square fallback > raise).
Test reorganization (tests/sccl/):
- _allreduce_helpers.py: shared plumbing (distributed driver, config writers,
direct-launch run_allreduce parity reference, sweep/buffer-kind constants,
plot aggregators, topology-diagram + FSIM-comparison emitters).
- test_allreduce_ring_torus_mesh.py: correctness across ring/torus/mesh.
- test_distributed_default_topology.py: full distributed path on topology.yaml.
- test_plot_latency_sweep.py / test_plot_buffer_kind_sweep.py: sweep rows.
- test_plot_topology_diagram.py / test_plot_comparison_fsim.py: plot emitters.
- test_intercube_root_center.py: moved in (ADR-0032 center-root latency guard).
Also:
- Move the FSIM comparison plot generator out of scripts/ into the sccl suite.
- Delete superseded test files (test_allreduce_multidevice,
test_distributed_lrab_hierarchical_allreduce, test_allreduce_buffer_kind_sweep)
and repoint conftest aggregators + the ipcq buffer-kind importers.
- Regenerate the allreduce_latency_plots derived artifacts from the full sweep.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refactor the latency sweep from one giant test into 36 parametrized
cases that run in parallel under xdist (~6-8x faster: 1:49 instead of
~10 min). Each case writes a JSON row to a staging dir; conftest
sessionfinish hook aggregates rows on the controller node into
summary.csv and the per-topology + overview plots.
Aggregator gains a CSV fallback so plot-only tweaks no longer require
re-running the sweep.
Overview plot updates:
- 96 KB explicit x-axis marker with vertical dotted line
- horizontal theoretical 2D-torus reference (10600 ns)
- annotation showing both theoretical and simulated values at 96 KB
- drop overlapping 128 KB tick
New topology.png: 2x2 panel diagram showing device-level topology
(ring, torus 2x3, mesh 2x3) and the cube-level reduction inside SIP 0.
Wrap arrows anchor on box edges and arc outside rows/columns so they
do not overlap any SIP.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Provides a shared `topology` fixture that caches the parsed
topology.yaml result per pytest-xdist worker session. Tests that
build a GraphEngine can accept `topology` instead of calling
resolve_topology("topology.yaml") repeatedly.
Topology parsing costs ~32ms, so the practical saving per worker is
modest (<1s across all tests). The fixture is mainly for architectural
cleanliness — keeping the "parse once, build engine many" pattern
explicit.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>