ADR-0023 D9.7+: charge PE↔bank fabric hop for SRAM/HBM IPCQ slots

Cube SRAM and HBM live on the cube NoC behind router-attached links (sram_to_router_bw_gbs=128, hbm_to_router_bw_gbs=256). Previously the slot-IO model treated them as if they were per-PE local, so the buffer_kind sweep showed TCM ≈ SRAM at 64 KB / PE. pe_ipcq._handle_recv and pe_dma._handle_ipcq_inbound now charge a PE→bank compute_drain_ns on top of the intrinsic slot-IO for SRAM/HBM. TCM stays free of this hop. Adds an internal IpcqRecvCmd.consume field that gates the recv-side hop+slot-IO charges (used by a follow-up diagnostic API; default True keeps current behavior). Post-fix at 64 KB / PE: TCM 12.0 µs < HBM 21.4 µs < SRAM 24.3 µs. SRAM is slowest because its 128 GB/s bank link is the narrowest in the system — narrower than HBM's 256 GB/s. The existing ordering test is rewritten from tcm<sram<hbm to tcm<hbm<sram and a new test_ipcq_buffer_kind_locations adds 3 invariants on the gap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 18:20:28 -07:00
parent 533e699299
commit 9c129d6131
7 changed files with 317 additions and 44 deletions
@@ -0,0 +1,208 @@
+"""Phase 1 micro-tests for IPCQ slot-memory PHYSICAL placement.
+
+The current model in ``_BUFFER_KIND_BW`` (src/kernbench/common/ipcq_types.py)
+charges only an intrinsic-memory term for IPCQ slot read/write::
+
+    TCM:  nbytes/512 + 0
+    SRAM: nbytes/512 + 2
+    HBM:  nbytes/256 + 6
+
+This treats SRAM and HBM as if they were per-PE local. The topology
+declares the opposite — both live on the cube NoC, behind their own
+router-attached link::
+
+    topology.yaml:130   sram_to_router_bw_gbs: 128.0
+    topology.yaml:129   hbm_to_router_bw_gbs:  256.0
+
+So a correct model must charge a PE→bank fabric drain for SRAM and HBM
+on both ``tl.send`` (writer landing bytes into the cube SRAM/HBM bank
+via PE_DMA → router → bank) and ``tl.recv`` (reader pulling bytes back
+across the same link). TCM stays free of that hop because it is
+genuinely per-PE local.
+
+The three tests below run the existing torus_2d 6-SIP allreduce harness
+with ``buffer_kind`` flipped between tcm/sram/hbm and assert invariants
+that the post-fix model must satisfy. They EXPECT TO FAIL today because
+the simulator under-charges SRAM and HBM by skipping the PE↔bank hop.
+
+Phase 2 will edit:
+  - src/kernbench/components/builtin/pe_ipcq.py   (_handle_recv: add
+        compute_drain_ns(pe→bank, nbytes) for sram/hbm)
+  - src/kernbench/components/builtin/pe_dma.py    (_handle_ipcq_inbound:
+        add second-leg drain for sram/hbm-destined slots)
+
+Tests must NEVER be weakened to make Phase 2 pass — invariants below
+follow from physics (link BW × payload), so any model reflecting the
+topology will satisfy them by construction.
+"""
+from __future__ import annotations
+
+from pathlib import Path
+
+import pytest
+import yaml
+
+from kernbench.runtime_api.context import RuntimeContext
+from kernbench.runtime_api.types import DeviceSelector
+from kernbench.sim_engine.engine import GraphEngine
+from kernbench.topology.builder import resolve_topology
+
+from tests.test_allreduce_multidevice import (
+    _write_temp_configs,
+    run_allreduce,
+)
+
+
+def _run_allreduce_with_buffer_kind(
+    tmp_path: Path, *, buffer_kind: str, n_elem: int,
+) -> float:
+    """Run one torus_2d 6-SIP allreduce with the given buffer_kind and
+    return critical-path pe_exec_ns (max across all PEs).
+
+    Mirrors the sweep harness in test_allreduce_buffer_kind_sweep.py
+    so the assertions below compare apples-to-apples against that PNG.
+    """
+    sub = tmp_path / f"{buffer_kind}_{n_elem}"
+    sub.mkdir()
+    topo_path, ccl_path = _write_temp_configs(
+        sub,
+        sip_topology="torus_2d",
+        n_sips=6,
+        algorithm="intercube_allreduce",
+        sip_w=3, sip_h=2,
+        n_elem_override=n_elem,
+    )
+
+    with open(ccl_path) as f:
+        ccl_cfg = yaml.safe_load(f)
+    ccl_cfg.setdefault("defaults", {})["buffer_kind"] = buffer_kind
+    ccl_cfg.setdefault("algorithms", {}).setdefault(
+        "intercube_allreduce", {},
+    )["buffer_kind"] = buffer_kind
+    with open(ccl_path, "w") as f:
+        yaml.dump(ccl_cfg, f, default_flow_style=False)
+
+    topo = resolve_topology(topo_path)
+    engine = GraphEngine(topo.topology_obj, enable_data=True)
+    spec = topo.topology_obj.spec
+
+    with RuntimeContext(
+        engine=engine,
+        target_device=DeviceSelector("all"),
+        correlation_id=f"loc_{buffer_kind}_{n_elem}",
+        spec=spec,
+    ) as ctx:
+        result = run_allreduce(
+            ctx, engine, spec,
+            algorithm="intercube_allreduce", ccl_yaml=ccl_path,
+        )
+        assert result["ok_cubes"] > 0, "allreduce did not validate"
+
+    pe_exec_vals = [
+        float(tr.get("pe_exec_ns", 0.0) or 0.0)
+        for _, (_, tr) in engine._results.items()
+        if isinstance(tr, dict)
+    ]
+    return max(pe_exec_vals) if pe_exec_vals else 0.0
+
+
+# ── Phase 1 assertions ───────────────────────────────────────────────
+
+
+def test_sram_meaningfully_slower_than_tcm_at_large_payload(tmp_path):
+    """At 32 KB / PE the SRAM-backed allreduce must take meaningfully
+    longer than the TCM-backed one because every IPCQ slot access goes
+    through the 128 GB/s SRAM↔router link, while TCM stays per-PE local.
+
+    Floor justification (physics, not implementation):
+        Per-IPCQ-roundtrip the SRAM tier adds 2 × nbytes/128 ns over TCM
+        (one PE→SRAM hop on send-inbound, one SRAM→PE hop on recv).
+        At 32 KB:  2 × 32768/128 = 512 ns added per slot exchange.
+        With ≥ 10 critical-path exchanges in a 6-SIP torus_2d allreduce
+        this is ≥ 5_120 ns. The threshold below is half that to leave
+        room for differing critical-path counting.
+
+    Pre-Phase-2: gap is constant 48 ns (just the SRAM overhead × 24
+    slot accesses); test FAILS.
+    Post-Phase-2: gap scales with payload; test PASSES.
+    """
+    n_elem = 16384  # 32 KB / PE
+    lat_tcm = _run_allreduce_with_buffer_kind(
+        tmp_path, buffer_kind="tcm", n_elem=n_elem,
+    )
+    lat_sram = _run_allreduce_with_buffer_kind(
+        tmp_path, buffer_kind="sram", n_elem=n_elem,
+    )
+    delta = lat_sram - lat_tcm
+    THRESHOLD_NS = 2_500.0
+    assert delta > THRESHOLD_NS, (
+        f"SRAM should be ≥ {THRESHOLD_NS:.0f} ns slower than TCM at 32 KB "
+        f"because each IPCQ access pays a 128 GB/s PE↔SRAM hop. "
+        f"got tcm={lat_tcm:.1f} sram={lat_sram:.1f} delta={delta:.1f} ns"
+    )
+
+
+def test_sram_tcm_gap_scales_with_payload(tmp_path):
+    """The SRAM-vs-TCM gap must grow roughly linearly with payload size.
+
+    Pre-Phase-2: the only difference between TCM and SRAM is the SRAM
+    per-access ``overhead_ns = 2``, which does NOT scale with payload —
+    so the gap is the same constant 48 ns at 8 KB and at 32 KB. Ratio = 1.
+
+    Post-Phase-2: the dominant term is 2 × nbytes/128 (PE↔SRAM hop on
+    write+read) which IS linear in payload. Going 8 KB → 32 KB (4×)
+    should produce a gap roughly 4× larger.
+
+    Threshold below is 3× to keep slack for fixed-overhead effects.
+    """
+    lat_tcm_small = _run_allreduce_with_buffer_kind(
+        tmp_path, buffer_kind="tcm", n_elem=4096,    # 8 KB
+    )
+    lat_sram_small = _run_allreduce_with_buffer_kind(
+        tmp_path, buffer_kind="sram", n_elem=4096,
+    )
+    lat_tcm_large = _run_allreduce_with_buffer_kind(
+        tmp_path, buffer_kind="tcm", n_elem=16384,   # 32 KB
+    )
+    lat_sram_large = _run_allreduce_with_buffer_kind(
+        tmp_path, buffer_kind="sram", n_elem=16384,
+    )
+
+    gap_small = lat_sram_small - lat_tcm_small
+    gap_large = lat_sram_large - lat_tcm_large
+
+    assert gap_small > 0, (
+        f"sanity: SRAM should never be FASTER than TCM, "
+        f"got gap_small={gap_small:.1f} ns"
+    )
+    assert gap_large > 3.0 * gap_small, (
+        f"4× payload should produce ≥3× SRAM/TCM gap (linear in nbytes "
+        f"because of the 128 GB/s PE↔SRAM hop). "
+        f"got gap_small={gap_small:.1f} (8KB), gap_large={gap_large:.1f} "
+        f"(32KB), ratio={gap_large / max(gap_small, 1e-9):.2f}"
+    )
+
+
+def test_hbm_pe_hop_charged_at_large_payload(tmp_path):
+    """At 32 KB / PE the HBM-vs-TCM gap must exceed the gap that comes
+    purely from HBM's 256 GB/s intrinsic slot-IO disadvantage.
+
+    Pre-Phase-2 the entire HBM/TCM gap is just the slot-IO term
+    (24 × (nbytes/512 + 6) ≈ 1_700 ns at 32 KB). Post-fix adds another
+    24 × (nbytes/256) × 2 ≈ 6_144 ns from the PE↔HBM hop on send and
+    recv, so the total HBM/TCM gap should clearly clear 4 µs.
+    """
+    n_elem = 16384  # 32 KB / PE
+    lat_tcm = _run_allreduce_with_buffer_kind(
+        tmp_path, buffer_kind="tcm", n_elem=n_elem,
+    )
+    lat_hbm = _run_allreduce_with_buffer_kind(
+        tmp_path, buffer_kind="hbm", n_elem=n_elem,
+    )
+    delta = lat_hbm - lat_tcm
+    THRESHOLD_NS = 4_000.0
+    assert delta > THRESHOLD_NS, (
+        f"HBM should be ≥ {THRESHOLD_NS:.0f} ns slower than TCM at 32 KB "
+        f"once the 256 GB/s PE↔HBM hop is charged on each IPCQ access. "
+        f"got tcm={lat_tcm:.1f} hbm={lat_hbm:.1f} delta={delta:.1f} ns"
+    )