Files
kernbench2/tests/test_cross_sip_routing.py
ywkang b8213d43a9 ADR-0019 D1/D4: per-PE HBM CTRL partitioning
Restores per-PE HBM controller partitioning that was lost in
commit 5917b34 ("Replace xbar/bridge/single-NOC with explicit
router mesh"), which had over-consolidated the per-slice HBM CTRL
into a single cube-wide ``hbm_ctrl`` connected to every router —
the opposite of what ADR-0019 D1/D4 specifies.

Builder splits ``hbm_ctrl`` into 8 ``hbm_ctrl.pe{X}`` instances per
cube, each reachable ONLY through PE_X's attaching router via the
existing ``peX.hbm`` attach metadata from cube_mesh.yaml. Cube
aggregate BW now matches the spec (8 PEs × 8 PCs × 32 GB/s =
2048 GB/s) instead of collapsing to 256 GB/s.

AddressResolver decodes the target PE from the HBM PA's hbm_offset
(``offset // slice_size``) and returns ``hbm_ctrl.pe{X}``. PathRouter
uses the existing ``_adj_local`` adjacency for same-cube PE_DMA so
the cube's own UCIe port can no longer appear as a zero-distance
shortcut between routers — local PE_DMA now traverses the mesh,
restoring the ADR-0019 D4 worked example
``PE0.pe_dma → r0c0 → … → r1c4 → hbm_ctrl``.

Tests:
- New tests/test_per_pe_hbm_partition.py: 14 tests covering
  topology shape, per-PE router exclusivity, PA resolution,
  single-hop local path, cross-PE mesh traversal, and end-to-end
  latency monotonicity. Probe CLI now reports
  pe-local < pe-same-half < pe-cross-half (was uniform 141ns).
- Existing tests updated for new node ids and replaced two
  assertions that locked in the wrong consolidation:
  test_noc_mesh.test_hbm_connects_to_all_routers and
  test_topology_compile.test_hbm_ctrl_connects_all_routers are
  now per-PE exclusivity assertions; test_routing
  .test_all_pe_hbm_equidistant becomes
  test_cross_pe_hbm_distance_increases_with_mesh_hops.
- test_ipcq_buffer_kind_locations.test_hbm_pe_hop_charged_at_large_payload
  threshold recalibrated 4000→1500 ns: the prior figure reflected
  serialization on the over-consolidated single hbm_ctrl; per-PE
  partitioning removes that artificial contention so the gap
  shrinks to the genuine PE↔HBM-hop cost.

Full suite: 645 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 01:04:30 -07:00

74 lines
2.5 KiB
Python

"""Cross-SIP PE_DMA routing tests (ADR-0023, topology v2).
Verifies that PE_DMA in one SIP can route to PE_DMA in another SIP via
the bidirectional pcie_ep ↔ fabric.switch0 path. Required for IPCQ
multi-SIP collectives.
"""
from __future__ import annotations
import pytest
from kernbench.policy.routing.router import PathRouter, RoutingError
from kernbench.topology.builder import resolve_topology
def _topo():
return resolve_topology("topology.yaml").topology_obj
# ── New edge ────────────────────────────────────────────────────────
def test_pcie_ep_to_switch_edge_exists():
"""The reverse pcie_ep → switch edge must exist for outbound traffic."""
topo = _topo()
pairs = {(e.src, e.dst) for e in topo.edges}
assert ("sip0.io0.pcie_ep", "fabric.switch0") in pairs
assert ("sip1.io0.pcie_ep", "fabric.switch0") in pairs
def test_existing_switch_to_pcie_ep_still_present():
"""Host→device path must remain intact (regression)."""
topo = _topo()
pairs = {(e.src, e.dst) for e in topo.edges}
assert ("fabric.switch0", "sip0.io0.pcie_ep") in pairs
assert ("fabric.switch0", "sip1.io0.pcie_ep") in pairs
# ── Cross-SIP path ──────────────────────────────────────────────────
def test_router_finds_cross_sip_pe_dma_path():
topo = _topo()
r = PathRouter(topo)
path = r.find_path("sip0.cube0.pe0", "sip1.cube0.pe0.pe_dma")
assert len(path) > 0
assert path[0] == "sip0.cube0.pe0.pe_dma"
assert path[-1] == "sip1.cube0.pe0.pe_dma"
assert "fabric.switch0" in path
def test_router_finds_cross_sip_far_pe_path():
"""Last cube of sip0 → first cube of sip1."""
topo = _topo()
r = PathRouter(topo)
path = r.find_path("sip0.cube15.pe7", "sip1.cube0.pe0.pe_dma")
assert "fabric.switch0" in path
# ── Regression: intra-SIP routing unchanged ─────────────────────────
def test_router_intra_sip_path_unchanged():
topo = _topo()
r = PathRouter(topo)
path = r.find_path("sip0.cube0.pe0", "sip0.cube0.pe1.pe_dma")
assert "fabric.switch0" not in path # should not detour through switch
def test_router_intra_cube_path_unchanged():
topo = _topo()
r = PathRouter(topo)
path = r.find_path("sip0.cube0.pe0", "sip0.cube0.hbm_ctrl.pe0")
assert "fabric.switch0" not in path