sccl: drive allreduce tests via torch.distributed; reorganize into tests/sccl/

Convert the multidevice allreduce correctness + latency/buffer-kind sweeps
to run through the real PyTorch-distributed path
(init_process_group(backend="ahbm") -> mp.spawn -> dist.all_reduce) instead
of direct ctx.launch, and reorganize the CCL/allreduce tests into a
tests/sccl/ package split one test per file.

Production change (required for the distributed path on non-square SIP grids):
- AhbmCCLBackend now reads explicit system.sips.w/h from the spec, with a
  square-only sqrt fallback that raises on ambiguity, instead of silently
  guessing round(sqrt(count)). This fixes the 2x3 / 3x2 torus + mesh cases,
  which previously resolved to a wrong 2x2 grid. Mirrors the test helper's
  _sip_topo_dims precedence (explicit w/h > square fallback > raise).

Test reorganization (tests/sccl/):
- _allreduce_helpers.py: shared plumbing (distributed driver, config writers,
  direct-launch run_allreduce parity reference, sweep/buffer-kind constants,
  plot aggregators, topology-diagram + FSIM-comparison emitters).
- test_allreduce_ring_torus_mesh.py: correctness across ring/torus/mesh.
- test_distributed_default_topology.py: full distributed path on topology.yaml.
- test_plot_latency_sweep.py / test_plot_buffer_kind_sweep.py: sweep rows.
- test_plot_topology_diagram.py / test_plot_comparison_fsim.py: plot emitters.
- test_intercube_root_center.py: moved in (ADR-0032 center-root latency guard).

Also:
- Move the FSIM comparison plot generator out of scripts/ into the sccl suite.
- Delete superseded test files (test_allreduce_multidevice,
  test_distributed_lrab_hierarchical_allreduce, test_allreduce_buffer_kind_sweep)
  and repoint conftest aggregators + the ipcq buffer-kind importers.
- Regenerate the allreduce_latency_plots derived artifacts from the full sweep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-20 22:24:43 -07:00
parent ff7d727ddd
commit b610cb0d9a
22 changed files with 745 additions and 759 deletions
+13
View File
@@ -59,10 +59,23 @@ class AhbmCCLBackend:
self._sip_topo_kind = topo_map.get(self._sip_topo, 0)
else:
self._sip_topo_kind = 0
sips = spec.get("system", {}).get("sips", {})
if self._sip_topo == "ring_1d":
self._sip_topo_w, self._sip_topo_h = 0, 0
elif sips.get("w") is not None and sips.get("h") is not None:
w, h = int(sips["w"]), int(sips["h"])
if w * h != self._n_sips:
raise ValueError(
f"sip layout {w}x{h} != sips.count ({self._n_sips})"
)
self._sip_topo_w, self._sip_topo_h = w, h
else:
side = int(round(math.sqrt(self._n_sips)))
if side * side != self._n_sips:
raise ValueError(
f"SIP topology '{self._sip_topo}' requires square "
f"sips.count or explicit sips.w/h, got {self._n_sips}"
)
self._sip_topo_w, self._sip_topo_h = side, side
# IPCQ install: wire all pe0s across all cubes and SIPs