Convert the multidevice allreduce correctness + latency/buffer-kind sweeps
to run through the real PyTorch-distributed path
(init_process_group(backend="ahbm") -> mp.spawn -> dist.all_reduce) instead
of direct ctx.launch, and reorganize the CCL/allreduce tests into a
tests/sccl/ package split one test per file.
Production change (required for the distributed path on non-square SIP grids):
- AhbmCCLBackend now reads explicit system.sips.w/h from the spec, with a
square-only sqrt fallback that raises on ambiguity, instead of silently
guessing round(sqrt(count)). This fixes the 2x3 / 3x2 torus + mesh cases,
which previously resolved to a wrong 2x2 grid. Mirrors the test helper's
_sip_topo_dims precedence (explicit w/h > square fallback > raise).
Test reorganization (tests/sccl/):
- _allreduce_helpers.py: shared plumbing (distributed driver, config writers,
direct-launch run_allreduce parity reference, sweep/buffer-kind constants,
plot aggregators, topology-diagram + FSIM-comparison emitters).
- test_allreduce_ring_torus_mesh.py: correctness across ring/torus/mesh.
- test_distributed_default_topology.py: full distributed path on topology.yaml.
- test_plot_latency_sweep.py / test_plot_buffer_kind_sweep.py: sweep rows.
- test_plot_topology_diagram.py / test_plot_comparison_fsim.py: plot emitters.
- test_intercube_root_center.py: moved in (ADR-0032 center-root latency guard).
Also:
- Move the FSIM comparison plot generator out of scripts/ into the sccl suite.
- Delete superseded test files (test_allreduce_multidevice,
test_distributed_lrab_hierarchical_allreduce, test_allreduce_buffer_kind_sweep)
and repoint conftest aggregators + the ipcq buffer-kind importers.
- Regenerate the allreduce_latency_plots derived artifacts from the full sweep.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Allreduce + pe2pe + ipcq + pe_view auto-regenerated by test sweeps
running against the new chunk-streaming wire timing (per-flit
wormhole) — absolute numbers shift upward to reflect bottleneck-link
transit charged once per flit (instead of the previous cut-through
subtraction at HBM CTRL).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the algorithmic root cube from the corner (cube_w-1,
cube_h-1) to the geometric center (cube_w//2, cube_h//2) and
have each phase converge bidirectionally so the intra-SIP
critical path drops from ~12 hops to ~8 hops on a 4×4 mesh
(left half W→E + right half E→W in row reduce; top half N→S +
bottom half S→N in col reduce; mirrored on broadcast).
Result on torus_2d 6 SIPs at 96 KB / PE on TCM:
before (corner root) : 22.0 µs
after (center root) : 17.2 µs (−22%)
Same shape on ring_1d (−7%) and mesh_2d_no_wrap (−12%); also
holds across SRAM and HBM (~−20% each).
Phase 1 test (test_intercube_root_center.py) asserts the
torus_2d 96 KB latency drops below 20.5 µs and that all 96
cubes still validate (correctness preserved).
Plot updates:
- overview.png: replace constant 10.6 µs theoretical line with
user-supplied hand-derived curve (per-cube packet count =
bytes_per_pe × 8 PEs ÷ 128 B; 1346 ns startup + 1.20 ns/pkt).
- All summary.csv numbers and per-topology PNGs regenerated.
- pe2pe_latency_plots and ipcq diagram emitter PNGs refreshed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plot output dirs now live under docs/diagrams/ (the canonical
"derived artifacts" location per CLAUDE.md):
tests/allreduce_latency_plots/ → docs/diagrams/allreduce_latency_plots/
tests/pe2pe_latency_plots/ → docs/diagrams/pe2pe_latency_plots/
+ new docs/diagrams/ipcq_diagram_plots/ with two presentation diagrams
(ipcq_send_recv.png, ipcq_two_pe_dma.png)
New test tests/test_emit_ipcq_diagram.py renders the two IPCQ
diagrams from a static description (no simulation); it exists so
the diagrams can be regenerated reproducibly.
Path references updated in tests/test_pe_to_pe_latency.py.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>