Files
kernbench2/tests/test_ipcq_buffer_kind_locations.py
mukesh b610cb0d9a sccl: drive allreduce tests via torch.distributed; reorganize into tests/sccl/
Convert the multidevice allreduce correctness + latency/buffer-kind sweeps
to run through the real PyTorch-distributed path
(init_process_group(backend="ahbm") -> mp.spawn -> dist.all_reduce) instead
of direct ctx.launch, and reorganize the CCL/allreduce tests into a
tests/sccl/ package split one test per file.

Production change (required for the distributed path on non-square SIP grids):
- AhbmCCLBackend now reads explicit system.sips.w/h from the spec, with a
  square-only sqrt fallback that raises on ambiguity, instead of silently
  guessing round(sqrt(count)). This fixes the 2x3 / 3x2 torus + mesh cases,
  which previously resolved to a wrong 2x2 grid. Mirrors the test helper's
  _sip_topo_dims precedence (explicit w/h > square fallback > raise).

Test reorganization (tests/sccl/):
- _allreduce_helpers.py: shared plumbing (distributed driver, config writers,
  direct-launch run_allreduce parity reference, sweep/buffer-kind constants,
  plot aggregators, topology-diagram + FSIM-comparison emitters).
- test_allreduce_ring_torus_mesh.py: correctness across ring/torus/mesh.
- test_distributed_default_topology.py: full distributed path on topology.yaml.
- test_plot_latency_sweep.py / test_plot_buffer_kind_sweep.py: sweep rows.
- test_plot_topology_diagram.py / test_plot_comparison_fsim.py: plot emitters.
- test_intercube_root_center.py: moved in (ADR-0032 center-root latency guard).

Also:
- Move the FSIM comparison plot generator out of scripts/ into the sccl suite.
- Delete superseded test files (test_allreduce_multidevice,
  test_distributed_lrab_hierarchical_allreduce, test_allreduce_buffer_kind_sweep)
  and repoint conftest aggregators + the ipcq buffer-kind importers.
- Regenerate the allreduce_latency_plots derived artifacts from the full sweep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 22:24:43 -07:00

215 lines
8.2 KiB
Python
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""Phase 1 micro-tests for IPCQ slot-memory PHYSICAL placement.
The current model in ``_BUFFER_KIND_BW`` (src/kernbench/common/ipcq_types.py)
charges only an intrinsic-memory term for IPCQ slot read/write::
TCM: nbytes/512 + 0
SRAM: nbytes/512 + 2
HBM: nbytes/256 + 6
This treats SRAM and HBM as if they were per-PE local. The topology
declares the opposite — both live on the cube NoC, behind their own
router-attached link::
topology.yaml:130 sram_to_router_bw_gbs: 128.0
topology.yaml:129 hbm_to_router_bw_gbs: 256.0
So a correct model must charge a PE→bank fabric drain for SRAM and HBM
on both ``tl.send`` (writer landing bytes into the cube SRAM/HBM bank
via PE_DMA → router → bank) and ``tl.recv`` (reader pulling bytes back
across the same link). TCM stays free of that hop because it is
genuinely per-PE local.
The three tests below run the existing torus_2d 6-SIP allreduce harness
with ``buffer_kind`` flipped between tcm/sram/hbm and assert invariants
that the post-fix model must satisfy. They EXPECT TO FAIL today because
the simulator under-charges SRAM and HBM by skipping the PE↔bank hop.
Phase 2 will edit:
- src/kernbench/components/builtin/pe_ipcq.py (_handle_recv: add
compute_drain_ns(pe→bank, nbytes) for sram/hbm)
- src/kernbench/components/builtin/pe_dma.py (_handle_ipcq_inbound:
add second-leg drain for sram/hbm-destined slots)
Tests must NEVER be weakened to make Phase 2 pass — invariants below
follow from physics (link BW × payload), so any model reflecting the
topology will satisfy them by construction.
"""
from __future__ import annotations
from pathlib import Path
import pytest
import yaml
from kernbench.runtime_api.context import RuntimeContext
from kernbench.runtime_api.types import DeviceSelector
from kernbench.sim_engine.engine import GraphEngine
from kernbench.topology.builder import resolve_topology
from tests.sccl._allreduce_helpers import (
_write_temp_configs,
run_allreduce,
)
def _run_allreduce_with_buffer_kind(
tmp_path: Path, *, buffer_kind: str, n_elem: int,
) -> float:
"""Run one torus_2d 6-SIP allreduce with the given buffer_kind and
return critical-path pe_exec_ns (max across all PEs).
Mirrors the buffer-kind sweep harness in
tests/sccl/test_plot_buffer_kind_sweep.py so the assertions
below compare apples-to-apples against that PNG.
"""
sub = tmp_path / f"{buffer_kind}_{n_elem}"
sub.mkdir()
topo_path, ccl_path = _write_temp_configs(
sub,
sip_topology="torus_2d",
n_sips=6,
algorithm="lrab_hierarchical_allreduce",
sip_w=3, sip_h=2,
n_elem_override=n_elem,
)
with open(ccl_path) as f:
ccl_cfg = yaml.safe_load(f)
ccl_cfg.setdefault("defaults", {})["buffer_kind"] = buffer_kind
ccl_cfg.setdefault("algorithms", {}).setdefault(
"lrab_hierarchical_allreduce", {},
)["buffer_kind"] = buffer_kind
with open(ccl_path, "w") as f:
yaml.dump(ccl_cfg, f, default_flow_style=False)
topo = resolve_topology(topo_path)
engine = GraphEngine(topo.topology_obj, enable_data=True)
spec = topo.topology_obj.spec
with RuntimeContext(
engine=engine,
target_device=DeviceSelector("all"),
correlation_id=f"loc_{buffer_kind}_{n_elem}",
spec=spec,
) as ctx:
result = run_allreduce(
ctx, engine, spec,
algorithm="lrab_hierarchical_allreduce", ccl_yaml=ccl_path,
)
assert result["ok_cubes"] > 0, "allreduce did not validate"
pe_exec_vals = [
float(tr.get("pe_exec_ns", 0.0) or 0.0)
for _, (_, tr) in engine._results.items()
if isinstance(tr, dict)
]
return max(pe_exec_vals) if pe_exec_vals else 0.0
# ── Phase 1 assertions ───────────────────────────────────────────────
def test_sram_meaningfully_slower_than_tcm_at_large_payload(tmp_path):
"""At 32 KB / PE the SRAM-backed allreduce must take meaningfully
longer than the TCM-backed one because every IPCQ slot access goes
through the 128 GB/s SRAM↔router link, while TCM stays per-PE local.
Floor justification (physics, not implementation):
Per-IPCQ-roundtrip the SRAM tier adds 2 × nbytes/128 ns over TCM
(one PE→SRAM hop on send-inbound, one SRAM→PE hop on recv).
At 32 KB: 2 × 32768/128 = 512 ns added per slot exchange.
With ≥ 10 critical-path exchanges in a 6-SIP torus_2d allreduce
this is ≥ 5_120 ns. The threshold below is half that to leave
room for differing critical-path counting.
Pre-Phase-2: gap is constant 48 ns (just the SRAM overhead × 24
slot accesses); test FAILS.
Post-Phase-2: gap scales with payload; test PASSES.
"""
n_elem = 16384 # 32 KB / PE
lat_tcm = _run_allreduce_with_buffer_kind(
tmp_path, buffer_kind="tcm", n_elem=n_elem,
)
lat_sram = _run_allreduce_with_buffer_kind(
tmp_path, buffer_kind="sram", n_elem=n_elem,
)
delta = lat_sram - lat_tcm
THRESHOLD_NS = 2_500.0
assert delta > THRESHOLD_NS, (
f"SRAM should be ≥ {THRESHOLD_NS:.0f} ns slower than TCM at 32 KB "
f"because each IPCQ access pays a 128 GB/s PE↔SRAM hop. "
f"got tcm={lat_tcm:.1f} sram={lat_sram:.1f} delta={delta:.1f} ns"
)
def test_sram_tcm_gap_scales_with_payload(tmp_path):
"""The SRAM-vs-TCM gap must grow roughly linearly with payload size.
Pre-Phase-2: the only difference between TCM and SRAM is the SRAM
per-access ``overhead_ns = 2``, which does NOT scale with payload —
so the gap is the same constant 48 ns at 8 KB and at 32 KB. Ratio = 1.
Post-Phase-2: the dominant term is 2 × nbytes/128 (PE↔SRAM hop on
write+read) which IS linear in payload. Going 8 KB → 32 KB (4×)
should produce a gap roughly 4× larger.
Threshold below is 3× to keep slack for fixed-overhead effects.
"""
lat_tcm_small = _run_allreduce_with_buffer_kind(
tmp_path, buffer_kind="tcm", n_elem=4096, # 8 KB
)
lat_sram_small = _run_allreduce_with_buffer_kind(
tmp_path, buffer_kind="sram", n_elem=4096,
)
lat_tcm_large = _run_allreduce_with_buffer_kind(
tmp_path, buffer_kind="tcm", n_elem=16384, # 32 KB
)
lat_sram_large = _run_allreduce_with_buffer_kind(
tmp_path, buffer_kind="sram", n_elem=16384,
)
gap_small = lat_sram_small - lat_tcm_small
gap_large = lat_sram_large - lat_tcm_large
assert gap_small > 0, (
f"sanity: SRAM should never be FASTER than TCM, "
f"got gap_small={gap_small:.1f} ns"
)
assert gap_large > 3.0 * gap_small, (
f"4× payload should produce ≥3× SRAM/TCM gap (linear in nbytes "
f"because of the 128 GB/s PE↔SRAM hop). "
f"got gap_small={gap_small:.1f} (8KB), gap_large={gap_large:.1f} "
f"(32KB), ratio={gap_large / max(gap_small, 1e-9):.2f}"
)
def test_hbm_pe_hop_charged_at_large_payload(tmp_path):
"""At 32 KB / PE the HBM-vs-TCM gap must exceed the gap that comes
purely from HBM's 256 GB/s intrinsic slot-IO disadvantage.
Pre-Phase-2 the entire HBM/TCM gap is just the slot-IO term
(24 × (nbytes/512 + 6) ≈ 1_700 ns at 32 KB). Post-fix adds another
chunk of latency from the PE↔HBM hop on send and recv, so the
total HBM/TCM gap should clearly clear the threshold below.
Under ADR-0017 D4 per-PE HBM CTRL, each PE's slice runs on its own
controller with no cross-PE contention, so the IPCQ pattern (each
PE writes its own slice) yields a gap of ≈ 1.7 µs — well above the
bare slot-IO term, confirming the PE↔HBM hop is being charged.
"""
n_elem = 16384 # 32 KB / PE
lat_tcm = _run_allreduce_with_buffer_kind(
tmp_path, buffer_kind="tcm", n_elem=n_elem,
)
lat_hbm = _run_allreduce_with_buffer_kind(
tmp_path, buffer_kind="hbm", n_elem=n_elem,
)
delta = lat_hbm - lat_tcm
THRESHOLD_NS = 1_500.0
assert delta > THRESHOLD_NS, (
f"HBM should be ≥ {THRESHOLD_NS:.0f} ns slower than TCM at 32 KB "
f"once the 256 GB/s PE↔HBM hop is charged on each IPCQ access. "
f"got tcm={lat_tcm:.1f} hbm={lat_hbm:.1f} delta={delta:.1f} ns"
)