Files
kernbench2/tests/test_ipcq_buffer_kind_locations.py
T
ywkang b8213d43a9 ADR-0019 D1/D4: per-PE HBM CTRL partitioning
Restores per-PE HBM controller partitioning that was lost in
commit 5917b34 ("Replace xbar/bridge/single-NOC with explicit
router mesh"), which had over-consolidated the per-slice HBM CTRL
into a single cube-wide ``hbm_ctrl`` connected to every router —
the opposite of what ADR-0019 D1/D4 specifies.

Builder splits ``hbm_ctrl`` into 8 ``hbm_ctrl.pe{X}`` instances per
cube, each reachable ONLY through PE_X's attaching router via the
existing ``peX.hbm`` attach metadata from cube_mesh.yaml. Cube
aggregate BW now matches the spec (8 PEs × 8 PCs × 32 GB/s =
2048 GB/s) instead of collapsing to 256 GB/s.

AddressResolver decodes the target PE from the HBM PA's hbm_offset
(``offset // slice_size``) and returns ``hbm_ctrl.pe{X}``. PathRouter
uses the existing ``_adj_local`` adjacency for same-cube PE_DMA so
the cube's own UCIe port can no longer appear as a zero-distance
shortcut between routers — local PE_DMA now traverses the mesh,
restoring the ADR-0019 D4 worked example
``PE0.pe_dma → r0c0 → … → r1c4 → hbm_ctrl``.

Tests:
- New tests/test_per_pe_hbm_partition.py: 14 tests covering
  topology shape, per-PE router exclusivity, PA resolution,
  single-hop local path, cross-PE mesh traversal, and end-to-end
  latency monotonicity. Probe CLI now reports
  pe-local < pe-same-half < pe-cross-half (was uniform 141ns).
- Existing tests updated for new node ids and replaced two
  assertions that locked in the wrong consolidation:
  test_noc_mesh.test_hbm_connects_to_all_routers and
  test_topology_compile.test_hbm_ctrl_connects_all_routers are
  now per-PE exclusivity assertions; test_routing
  .test_all_pe_hbm_equidistant becomes
  test_cross_pe_hbm_distance_increases_with_mesh_hops.
- test_ipcq_buffer_kind_locations.test_hbm_pe_hop_charged_at_large_payload
  threshold recalibrated 4000→1500 ns: the prior figure reflected
  serialization on the over-consolidated single hbm_ctrl; per-PE
  partitioning removes that artificial contention so the gap
  shrinks to the genuine PE↔HBM-hop cost.

Full suite: 645 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 01:04:30 -07:00

217 lines
8.4 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""Phase 1 micro-tests for IPCQ slot-memory PHYSICAL placement.
The current model in ``_BUFFER_KIND_BW`` (src/kernbench/common/ipcq_types.py)
charges only an intrinsic-memory term for IPCQ slot read/write::
TCM: nbytes/512 + 0
SRAM: nbytes/512 + 2
HBM: nbytes/256 + 6
This treats SRAM and HBM as if they were per-PE local. The topology
declares the opposite — both live on the cube NoC, behind their own
router-attached link::
topology.yaml:130 sram_to_router_bw_gbs: 128.0
topology.yaml:129 hbm_to_router_bw_gbs: 256.0
So a correct model must charge a PE→bank fabric drain for SRAM and HBM
on both ``tl.send`` (writer landing bytes into the cube SRAM/HBM bank
via PE_DMA → router → bank) and ``tl.recv`` (reader pulling bytes back
across the same link). TCM stays free of that hop because it is
genuinely per-PE local.
The three tests below run the existing torus_2d 6-SIP allreduce harness
with ``buffer_kind`` flipped between tcm/sram/hbm and assert invariants
that the post-fix model must satisfy. They EXPECT TO FAIL today because
the simulator under-charges SRAM and HBM by skipping the PE↔bank hop.
Phase 2 will edit:
- src/kernbench/components/builtin/pe_ipcq.py (_handle_recv: add
compute_drain_ns(pe→bank, nbytes) for sram/hbm)
- src/kernbench/components/builtin/pe_dma.py (_handle_ipcq_inbound:
add second-leg drain for sram/hbm-destined slots)
Tests must NEVER be weakened to make Phase 2 pass — invariants below
follow from physics (link BW × payload), so any model reflecting the
topology will satisfy them by construction.
"""
from __future__ import annotations
from pathlib import Path
import pytest
import yaml
from kernbench.runtime_api.context import RuntimeContext
from kernbench.runtime_api.types import DeviceSelector
from kernbench.sim_engine.engine import GraphEngine
from kernbench.topology.builder import resolve_topology
from tests.test_allreduce_multidevice import (
_write_temp_configs,
run_allreduce,
)
def _run_allreduce_with_buffer_kind(
tmp_path: Path, *, buffer_kind: str, n_elem: int,
) -> float:
"""Run one torus_2d 6-SIP allreduce with the given buffer_kind and
return critical-path pe_exec_ns (max across all PEs).
Mirrors the sweep harness in test_allreduce_buffer_kind_sweep.py
so the assertions below compare apples-to-apples against that PNG.
"""
sub = tmp_path / f"{buffer_kind}_{n_elem}"
sub.mkdir()
topo_path, ccl_path = _write_temp_configs(
sub,
sip_topology="torus_2d",
n_sips=6,
algorithm="intercube_allreduce",
sip_w=3, sip_h=2,
n_elem_override=n_elem,
)
with open(ccl_path) as f:
ccl_cfg = yaml.safe_load(f)
ccl_cfg.setdefault("defaults", {})["buffer_kind"] = buffer_kind
ccl_cfg.setdefault("algorithms", {}).setdefault(
"intercube_allreduce", {},
)["buffer_kind"] = buffer_kind
with open(ccl_path, "w") as f:
yaml.dump(ccl_cfg, f, default_flow_style=False)
topo = resolve_topology(topo_path)
engine = GraphEngine(topo.topology_obj, enable_data=True)
spec = topo.topology_obj.spec
with RuntimeContext(
engine=engine,
target_device=DeviceSelector("all"),
correlation_id=f"loc_{buffer_kind}_{n_elem}",
spec=spec,
) as ctx:
result = run_allreduce(
ctx, engine, spec,
algorithm="intercube_allreduce", ccl_yaml=ccl_path,
)
assert result["ok_cubes"] > 0, "allreduce did not validate"
pe_exec_vals = [
float(tr.get("pe_exec_ns", 0.0) or 0.0)
for _, (_, tr) in engine._results.items()
if isinstance(tr, dict)
]
return max(pe_exec_vals) if pe_exec_vals else 0.0
# ── Phase 1 assertions ───────────────────────────────────────────────
def test_sram_meaningfully_slower_than_tcm_at_large_payload(tmp_path):
"""At 32 KB / PE the SRAM-backed allreduce must take meaningfully
longer than the TCM-backed one because every IPCQ slot access goes
through the 128 GB/s SRAM↔router link, while TCM stays per-PE local.
Floor justification (physics, not implementation):
Per-IPCQ-roundtrip the SRAM tier adds 2 × nbytes/128 ns over TCM
(one PE→SRAM hop on send-inbound, one SRAM→PE hop on recv).
At 32 KB: 2 × 32768/128 = 512 ns added per slot exchange.
With ≥ 10 critical-path exchanges in a 6-SIP torus_2d allreduce
this is ≥ 5_120 ns. The threshold below is half that to leave
room for differing critical-path counting.
Pre-Phase-2: gap is constant 48 ns (just the SRAM overhead × 24
slot accesses); test FAILS.
Post-Phase-2: gap scales with payload; test PASSES.
"""
n_elem = 16384 # 32 KB / PE
lat_tcm = _run_allreduce_with_buffer_kind(
tmp_path, buffer_kind="tcm", n_elem=n_elem,
)
lat_sram = _run_allreduce_with_buffer_kind(
tmp_path, buffer_kind="sram", n_elem=n_elem,
)
delta = lat_sram - lat_tcm
THRESHOLD_NS = 2_500.0
assert delta > THRESHOLD_NS, (
f"SRAM should be ≥ {THRESHOLD_NS:.0f} ns slower than TCM at 32 KB "
f"because each IPCQ access pays a 128 GB/s PE↔SRAM hop. "
f"got tcm={lat_tcm:.1f} sram={lat_sram:.1f} delta={delta:.1f} ns"
)
def test_sram_tcm_gap_scales_with_payload(tmp_path):
"""The SRAM-vs-TCM gap must grow roughly linearly with payload size.
Pre-Phase-2: the only difference between TCM and SRAM is the SRAM
per-access ``overhead_ns = 2``, which does NOT scale with payload —
so the gap is the same constant 48 ns at 8 KB and at 32 KB. Ratio = 1.
Post-Phase-2: the dominant term is 2 × nbytes/128 (PE↔SRAM hop on
write+read) which IS linear in payload. Going 8 KB → 32 KB (4×)
should produce a gap roughly 4× larger.
Threshold below is 3× to keep slack for fixed-overhead effects.
"""
lat_tcm_small = _run_allreduce_with_buffer_kind(
tmp_path, buffer_kind="tcm", n_elem=4096, # 8 KB
)
lat_sram_small = _run_allreduce_with_buffer_kind(
tmp_path, buffer_kind="sram", n_elem=4096,
)
lat_tcm_large = _run_allreduce_with_buffer_kind(
tmp_path, buffer_kind="tcm", n_elem=16384, # 32 KB
)
lat_sram_large = _run_allreduce_with_buffer_kind(
tmp_path, buffer_kind="sram", n_elem=16384,
)
gap_small = lat_sram_small - lat_tcm_small
gap_large = lat_sram_large - lat_tcm_large
assert gap_small > 0, (
f"sanity: SRAM should never be FASTER than TCM, "
f"got gap_small={gap_small:.1f} ns"
)
assert gap_large > 3.0 * gap_small, (
f"4× payload should produce ≥3× SRAM/TCM gap (linear in nbytes "
f"because of the 128 GB/s PE↔SRAM hop). "
f"got gap_small={gap_small:.1f} (8KB), gap_large={gap_large:.1f} "
f"(32KB), ratio={gap_large / max(gap_small, 1e-9):.2f}"
)
def test_hbm_pe_hop_charged_at_large_payload(tmp_path):
"""At 32 KB / PE the HBM-vs-TCM gap must exceed the gap that comes
purely from HBM's 256 GB/s intrinsic slot-IO disadvantage.
Pre-Phase-2 the entire HBM/TCM gap is just the slot-IO term
(24 × (nbytes/512 + 6) ≈ 1_700 ns at 32 KB). Post-fix adds another
chunk of latency from the PE↔HBM hop on send and recv, so the
total HBM/TCM gap should clearly clear the threshold below.
Threshold history: the gap was 4 µs under the over-consolidated
single-hbm_ctrl model (commit 5917b34), inflated by serialization
on the shared HBM controller. With ADR-0019 D1 per-PE HBM CTRL
restored, each PE's slice runs on its own controller with no
cross-PE contention, so the IPCQ pattern (each PE writes its own
slice) drops the gap to ≈ 1.7 µs — still well above the bare
slot-IO term, confirming the PE↔HBM hop is being charged.
"""
n_elem = 16384 # 32 KB / PE
lat_tcm = _run_allreduce_with_buffer_kind(
tmp_path, buffer_kind="tcm", n_elem=n_elem,
)
lat_hbm = _run_allreduce_with_buffer_kind(
tmp_path, buffer_kind="hbm", n_elem=n_elem,
)
delta = lat_hbm - lat_tcm
THRESHOLD_NS = 1_500.0
assert delta > THRESHOLD_NS, (
f"HBM should be ≥ {THRESHOLD_NS:.0f} ns slower than TCM at 32 KB "
f"once the 256 GB/s PE↔HBM hop is charged on each IPCQ access. "
f"got tcm={lat_tcm:.1f} hbm={lat_hbm:.1f} delta={delta:.1f} ns"
)