Rectangular SIP topology + 6-device allreduce sweep

mesh_2d, torus_2d, and mesh_2d_no_wrap accept optional w,h kwargs; sqrt fall-back preserved for square layouts (back-compat tests confirm 4-SIP and 9-SIP square configs still work). sfr_config reads system.sips.w/h from spec and threads dims through to the topology fn. test_allreduce_multidevice CONFIGS switched from 4 SIPs (square) to 6 SIPs: ring_1d_6sip, torus_2d_6sip_2x3, mesh_2d_no_wrap_6sip_2x3. _write_temp_configs writes system.sips.w/h when supplied; _sip_topo_dims reads them back. Latency sweep loop also moved to 6-SIP layouts. Linear-scale plot variants dropped -- only log-scale *.png + summary.csv emitted. Plots in tests/allreduce_latency_plots regenerated. New tests/test_sip_topology_rectangular.py asserts neighbor correctness for 2x3 layouts and back-compat for square fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ADR-0009 D5: chain-aware target_start_ns + zero-byte launch fanout
2026-04-27 15:13:14 -07:00 · 2026-04-27 15:12:58 -07:00 · 2026-04-27 15:12:38 -07:00 · 2026-04-27 10:16:29 -07:00 · 2026-04-23 15:30:29 -07:00 · 2026-04-22 21:04:31 -07:00
35 changed files with 2520 additions and 153 deletions
@@ -67,6 +67,76 @@ Completion semantics:
 ---
 ### D5. Launch timing is endpoint-synchronized
 All PEs targeted by a single kernel launch MUST begin executing the kernel
 body at the same simulated time, regardless of their dispatch path length
 from the launch entry point.
 Rationale. The dispatch tree Host → IO_CPU → M_CPU → PE_CPU has variable
 latency at every level. PEs near their M_CPU receive the launch earlier
 than PEs farther away; cubes near an IO_CPU receive it earlier than cubes
 farther away. Without synchronization, each PE's kernel begins at a
 different `env.now`, making per-PE metrics such as `pe_exec_ns` a function
 of dispatch-path geometry rather than of the kernel's behavior —
 producing measurement artifacts in benchmarks that time kernel-internal
 waits (for example `tl.recv` on cross-cube or cross-SIP hops).
 Mechanism.
 - `KernelLaunchMsg` carries an optional `target_start_ns: float | None`.
 - **IO_CPU** is the canonical stamper. On fan-out to M_CPUs, it
  computes `target_start_ns = env.now + max_latency` where
  `max_latency` is the maximum, over every target (sip, cube, pe)
  tuple, of the **two-leg dispatch chain**:
  ```
  max_latency(sip, cube, pe) =
      compute_path_latency_ns(find_node_path(io_cpu, m_cpu(sip, cube)))
    + compute_path_latency_ns(find_node_path(m_cpu(sip, cube), pe_cpu))
    - io_cpu.overhead_ns
    - m_cpu.overhead_ns
  ```
  This models the actual dispatch as **two sequential Transactions**
  (IO_CPU → M_CPU, then M_CPU → PE_CPU). Each leg's
  `compute_path_latency_ns` adds its endpoints' `overhead_ns`;
  `io_cpu.overhead_ns` is subtracted because IO_CPU has already
  paid it before this method runs, and `m_cpu.overhead_ns` is
  subtracted once because it appears as endpoint of leg1 *and*
  start of leg2 but is paid only once at run time. A single
  `find_node_path(io_cpu, pe_cpu)` walk is **not** equivalent —
  it can pick a graph path that bypasses M_CPU and silently
  under-shoots the prediction for far cubes, breaking the D5
  invariant.
  The fanned-out sub-Transactions carry **`nbytes = 0`** for
  `KernelLaunchMsg` (control message only). Without this,
  large kernel-launch payloads would occupy fabric BW on the
  shared first hop and serialize the per-cube dispatch, pushing
  far M_CPUs past `target_start_ns` and re-introducing the
  late-arrival violation.
 - **M_CPU** passes an already-stamped `target_start_ns` through
  unchanged. Only when the value is absent (e.g. a direct
  launch-to-M_CPU unit test) does M_CPU compute a per-cube barrier
  `env.now + max(local command-path latency)`.
 - **PE_CPU** yields `env.timeout(target_start_ns - env.now)` at the top
  of `_execute_kernel`, before recording `pe_exec_start` and invoking
  the kernel body.
 - When `target_start_ns is None`, PE_CPU falls through to the legacy
  unsynchronized behavior — preserving backward compatibility.
 IO_CPU-level stamping guarantees every PE across every targeted cube
 uses the same barrier sim-time, eliminating both the within-cube
 dispatch-offset artifact *and* the cross-cube offset artifact in
 multi-cube launches. Models a real-hardware timed-broadcast launch
 (latency-equalized dispatch tree).
 The synchronization is internal to the engine / IO_CPU / M_CPU / PE_CPU
 control plane — runtime API and application kernels are unchanged.
 ---
 ## Links
 - SPEC R1, R2, R7, R8
@@ -372,24 +372,41 @@ When the receiver frees a slot, the sender must learn about it
 travel through general vc_comm fabric — it uses a **separate fast
 path**, an abstraction of the NVLink / UCIe credit-return wire.
-**Latency** is computed from the **bottleneck BW on the path**, not a
+**Latency** is computed from the **full path latency** (per-node
-magic constant:
+overhead + edge propagation + drain), not a magic constant:
 ```
 credit_size_bytes = 16  (ccl.yaml: ipcq_credit_size_bytes)
-path = router.find_path(self_pe, peer_pe)
+path = router.find_path(self_pe, peer_pe.pe_dma)
-latency = compute_drain_ns(path, credit_size_bytes)
+latency = compute_path_latency_ns(path, credit_size_bytes)
-        = credit_size_bytes / bottleneck_bw_on_path
+        = sum(edge.distance_mm * ns_per_mm)
        + sum(node_overhead_ns[n] for n in path)
        + credit_size_bytes / bottleneck_bw_on_path
 ```
 The router auto-appends `.pe_dma` to the source only, so the
 destination MUST be spelled with the explicit `.pe_dma` suffix or
 `find_path` raises and the credit silently teleports at zero cost
 (latent bug fixed alongside this update).
 `tl.recv` blocks on the credit-emit completion (recv yields-from
 `_delayed_credit_send` rather than spawning it as a fork). This puts
 the credit-return cost on the receiver's `pe_exec_ns`, modeling the
 IPCQ control-plane completing the consume-acknowledgement before
 recv returns to the kernel — the protocol equivalent of a non-posted
 `tl.store` waiting for an HBM ack on the raw DMA path.
 That gives us:
 - **Topology-proportional approximation**: an in-cube credit return is
  automatically faster than a cross-SIP credit return.
- **No magic constants**: no arbitrary `ipcq_ctrl_latency_ns`.
+- **No magic constants**: every nanosecond comes from
  `compute_path_latency_ns` on the same edge_map and `node_overhead_ns`
  as data traffic.
 - **No deadlock risk**: unlike piggyback, B can issue credit even when
-  it has no data to send back.
+  it has no data to send back. `peer_credit_store.put` is unbounded.
- **Reuses existing utility**: `ComponentContext.compute_drain_ns`.
+- **`IPCQ ≥ raw DMA`** for matched physical moves — the credit-emit
  cost on recv balances the HBM ack-trip cost RAW pays on the sender.
 #### Component coupling — SimPy Store channel
@@ -420,11 +437,21 @@ fan-out (see `IpcqInitMsg` in D12).
 #### PE_DMA's added responsibility
 When `vc_comm` receives a token, PE_DMA processes it as the following
-**atomic** sequence. **No SimPy yield is allowed between the two steps**
+sequence: pay the Transaction's terminal BW drain, then atomically
-(invariant I6):
+write data and forward metadata. **No SimPy yield is allowed between
 the data write and the metadata forward** (invariant I6). The drain
 yield must sit before the atomic block, not inside it:
 ```python
-def _on_vc_comm_recv(self, env, token):
+def _on_vc_comm_recv(self, env, txn):
    # Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the
    # sender PE_DMA). MUST happen before the atomic block so recv only
    # wakes after the bytes have "landed".
    drain = getattr(txn, "drain_ns", 0.0)
    if drain > 0:
        yield env.timeout(drain)
    token = txn.request
    # ── ATOMIC: no yield between these two operations ──
    data = self._memory_store.read(token.src_space, token.src_addr,
                                   shape=..., dtype=...)
@@ -439,6 +466,33 @@ The final `put` is yieldable but uses an unbounded internal store, so
 it completes in a single step. That `put` is the closing call of the
 atomic block; nothing may be inserted before it.
 #### Drain-at-inbound semantics (D9 timing model)
 The Transaction carries `drain_ns = nbytes / bottleneck_bw_on_path`
 stamped at send-side PE_DMA. In this simulator per-hop `overhead_ns`
 is paid at each forwarding component via `run()`, and the remaining
 BW drain is paid once at the Transaction's terminal. Every non-IPCQ
 Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via
 `ComponentBase._forward_txn` at the terminal node. For IPCQ the
 destination PE_DMA intercepts the Transaction with `_handle_ipcq_inbound`
 (so IPCQ-specific data write + metadata forward can happen), so **the
 drain MUST be paid explicitly at the top of that handler** to keep
 IPCQ's timing model on par with every other fabric Transaction.
 Side-effects of paying drain here:
 - **SRC `tl.send`** is unchanged — fire-and-forget semantics are
  preserved because the sender PE_DMA does not `yield sub_done`. The
  `sub_done.succeed()` call (made after metadata forward below) is an
  event with no listener on the sender side.
 - **DST `tl.recv`** unblocks `drain_ns` later. Since recv wakes only
  when `IpcqMetaArrival` reaches its local PE_IPCQ, and the metadata
  forward now happens after the drain, recv observes the full fabric
  transfer time including bandwidth cost.
 Matches the physical picture: send dispatches and leaves; recv waits
 until the bytes have actually been drained into its inbox.
 ### D9.5. ADR-0020 (2-pass) integration
 `tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase
@@ -365,23 +365,39 @@ data 경로의 piggyback 모델과 달리, credit return은 일반 vc_comm fabri
 거치지 않고 **별도 fast path**로 처리한다. 이는 실제 HW의 NVLink/UCIe
 credit return fast path를 추상화한 것이다.
-**Latency 계산**: magic constant가 아니라 **라우팅 경로의 bottleneck BW**
+**Latency 계산**: magic constant가 아니라 **라우팅 경로의 full path
-기준으로 산출한다.
+latency** (per-node overhead + edge propagation + drain) 기준으로
 산출한다.
 ```
 credit_size_bytes = 16  (ccl.yaml: ipcq_credit_size_bytes)
-path = router.find_path(self_pe, peer_pe)
+path = router.find_path(self_pe, peer_pe.pe_dma)
-latency = compute_drain_ns(path, credit_size_bytes)
+latency = compute_path_latency_ns(path, credit_size_bytes)
-        = credit_size_bytes / bottleneck_bw_on_path
+        = sum(edge.distance_mm * ns_per_mm)
        + sum(node_overhead_ns[n] for n in path)
        + credit_size_bytes / bottleneck_bw_on_path
 ```
 router는 source에만 `.pe_dma`를 자동 부여하므로 destination에는 반드시
 `.pe_dma` suffix를 명시해야 한다. 그렇지 않으면 `find_path`가 raise하고
 credit이 0 cost로 silently teleport되는 latent bug가 발생한다 (이번
 업데이트에서 수정됨).
 `tl.recv`는 credit-emit 완료를 yield-from으로 기다린다 (이전에는
 `env.process`로 fork). 이로써 credit-return cost가 receiver의
 `pe_exec_ns`에 반영되어, IPCQ control-plane이 consume-acknowledgement를
 완료한 뒤에야 recv가 kernel에 반환된다 — RAW DMA의 non-posted `tl.store`가
 HBM ack-trip을 기다리는 것의 protocol-level 등가물이다.
 이로써:
 - **토폴로지 비례 approximation**: cube 내 credit return과 cross-SIP credit이
-  자동으로 다른 latency를 가짐 (정확한 값은 아니지만 magic constant보다 의미 있음)
+  자동으로 다른 latency를 가짐
- **Magic constant 없음**: 별도 `ipcq_ctrl_latency_ns` 같은 임의 값 불필요
+- **Magic constant 없음**: 모든 ns 값이 데이터 트래픽과 동일한 edge_map
- **Deadlock 위험 없음**: piggyback과 달리 B가 A에게 보낼 데이터가 없어도
+  및 `node_overhead_ns`에서 산출되는 `compute_path_latency_ns`로부터 옴
-  credit이 자동 발행됨
+- **Deadlock 위험 없음**: `peer_credit_store.put`은 unbounded, B가 A에게
- **기존 utility 재사용**: `ComponentContext.compute_drain_ns` 그대로 사용
+  보낼 데이터가 없어도 credit이 자동 발행됨
 - **`IPCQ ≥ raw DMA`** 보장: matched physical move에 대해 credit-emit이
  RAW의 ack-trip cost와 균형을 이룸
 ```
 PE B: tl.recv(W) → 데이터 가져감 → my_tail++
@@ -426,11 +442,22 @@ backend init에서 IpcqInitMsg fan-out 시 양방향 fast path channel을 함께
 #### PE_DMA의 책임 추가
-PE_DMA(vc_comm)는 token 수신 시 다음 atomic 시퀀스로 처리한다.
+PE_DMA(vc_comm)는 token 수신 시 다음 시퀀스로 처리한다: Transaction
-**두 동작 사이에 SimPy yield를 두어서는 안 된다** (I6 MUST 규칙 참조):
+terminal의 BW drain을 먼저 지불하고, 이어서 atomic하게 data write +
 metadata forward 수행. **data write와 metadata forward 사이에는 SimPy
 yield를 두어서는 안 된다** (I6 MUST 규칙 참조). drain yield는 atomic
 구간 안이 아니라 그 앞에 위치해야 한다:
 ```python
-def _on_vc_comm_recv(self, env, token):
+def _on_vc_comm_recv(self, env, txn):
    # Sender PE_DMA가 찍어 둔 drain_ns (= nbytes / bottleneck_bw) 를
    # 여기서 지불. atomic 구간보다 앞이어야 한다 — recv는 bytes가
    # "도착"한 이후에만 깨어나야 하므로.
    drain = getattr(txn, "drain_ns", 0.0)
    if drain > 0:
        yield env.timeout(drain)
    token = txn.request
    # ── ATOMIC: 두 동작 사이에 yield 금지 ──
    # 1. data를 dst_addr에 write (dst의 메모리 공간은 token.dst_endpoint.buffer_kind)
    data = self._memory_store.read(token.src_space, token.src_addr,
@@ -446,6 +473,32 @@ wire로 capacity가 unbounded인 store를 사용하므로 즉시 완료된다 (
 single-step). 이 최종 put이 atomic 구간의 끝이며, 그 이전에 다른 yield가
 삽입되면 안 된다.
 #### Drain-at-inbound semantics (D9 timing model)
 Transaction은 sender PE_DMA가 `drain_ns = nbytes / bottleneck_bw_on_path`
 를 찍어 둔 상태로 fabric에 들어간다. 이 simulator에서 per-hop `overhead_ns`
 는 각 forwarding component의 `run()` 에서 지불되고, 남은 BW drain은
 Transaction의 terminal node에서 한 번 지불된다. IPCQ가 아닌 모든
 Transaction (raw DMA, kernel-launch fanout 등) 은
 `ComponentBase._forward_txn` 이 terminal에서 이 drain을 지불한다. IPCQ의
 경우 목적지 PE_DMA가 `_handle_ipcq_inbound` 핸들러로 Transaction을
 가로채서 (IPCQ 전용 data write + metadata forward를 해야 하므로)
 **이 핸들러 최상단에서 drain을 명시적으로 지불해야 한다** — 그래야 IPCQ의
 timing model이 다른 모든 fabric Transaction과 동일선상에 놓인다.
 여기서 drain을 지불할 때의 side-effect:
 - **SRC `tl.send`**: 동작 불변. sender PE_DMA가 `sub_done` 을 `yield`
  하지 않으므로 fire-and-forget 의미가 보존된다. metadata forward 이후
  호출되는 `sub_done.succeed()` 는 sender 입장에서 listener가 없는 이벤트.
 - **DST `tl.recv`**: `drain_ns` 만큼 늦게 깨어난다. recv는 local PE_IPCQ
  의 `IpcqMetaArrival` 수신 시에만 wake되며, metadata forward가 drain
  이후로 이동했으므로 recv는 bandwidth까지 포함한 전체 fabric transfer
  시간을 관측하게 된다.
 물리적 그림과 일치: send는 dispatch하고 바로 반환; recv는 bytes가 실제로
 자신의 inbox로 drain될 때까지 대기.
 #### Backpressure latency 정확도
 backpressure 해제까지 걸리는 시간:
@@ -1,22 +1,24 @@
-"""SFR configuration for intercube + inter-SIP IPCQ wiring.
+"""SFR configuration for the full IPCQ hardware wiring.
-Provides ``configure_sfr_intercube_multisip`` which programs PE_IPCQ
+Installs PE_IPCQ neighbor tables modeling the physical hardware.
-neighbor tables for:
+Wiring is independent of DPPolicy / kernel choice — the kernel decides
 at runtime which links to use.
-  1. Intercube within each SIP — pe0 of every cube connects to pe0 of
+Direction label namespaces (disjoint):
     its N/S/E/W mesh neighbors (no wrap-around).
  2. Inter-SIP on ALL cubes — pe0 of cube_c on sip_A connects to pe0 of
     cube_c on each peer SIP, using ``global_E``/``global_W`` (ring) or
     ``global_N``/``global_S``/``global_E``/``global_W`` (mesh/torus)
     direction labels.  Wiring all cubes allows the kernel to
     dynamically elect the root cube at runtime.
-SIP-level topology is read from ``topology.yaml`` →
+  - Intra-cube PE-to-PE:   ``intra_N / intra_S / intra_E / intra_W``
-``system.sips.topology`` (e.g. ``ring_1d``, ``mesh_2d``).
+    Logical 2×4 PE grid within a cube (no wrap):
 Intercube mesh dimensions come from ``sip.cube_mesh.w/h``.
-Internally delegates to ``install_ipcq`` with a computed ``rank_to_pe``
+         Row 0:  pe0  pe1  pe2  pe3
-(pe0-only) and a closure-captured ``neighbors()`` function.
+         Row 1:  pe4  pe5  pe6  pe7
  - Intercube same-lane:   ``N / S / E / W``
    ``pe_i of cube_A ↔ pe_i of cube_B`` across the 4×4 cube mesh
    (no wrap). Every PE i ∈ [0..7] wired independently.
  - Inter-SIP same-(cube, pe): ``global_N / global_S / global_E / global_W``
    ``pe_i of cube_c on sip_A ↔ pe_i of cube_c on sip_B`` per
    ``topology.yaml → system.sips.topology``.
 """
 from __future__ import annotations
@@ -27,12 +29,46 @@ from kernbench.ccl.install import install_ipcq
 from kernbench.ccl.topologies import _BUILTIN as _TOPO_BUILTINS
 # ── Intra-cube 2×4 PE grid ───────────────────────────────────────────
 _PE_GRID_COLS = 4
 _PE_GRID_ROWS = 2
 _PES_PER_CUBE = _PE_GRID_COLS * _PE_GRID_ROWS  # 8
 def _intra_cube_neighbors(pe: int) -> dict[str, int]:
    """Logical 2×4 PE grid neighbors within a cube (no wrap).
    Returns directions in the ``intra_*`` namespace.
    """
    row, col = divmod(pe, _PE_GRID_COLS)
    nbrs: dict[str, int] = {}
    if col < _PE_GRID_COLS - 1:
        nbrs["intra_E"] = row * _PE_GRID_COLS + (col + 1)
    if col > 0:
        nbrs["intra_W"] = row * _PE_GRID_COLS + (col - 1)
    if row < _PE_GRID_ROWS - 1:
        nbrs["intra_S"] = (row + 1) * _PE_GRID_COLS + col
    if row > 0:
        nbrs["intra_N"] = (row - 1) * _PE_GRID_COLS + col
    return nbrs
 # ── Public entry point ───────────────────────────────────────────────
 def configure_sfr_intercube_multisip(
    engine: Any,
    spec: dict,
    cfg: dict,
 ) -> dict[str, Any]:
-    """Wire IPCQ for intercube (pe0, mesh) + inter-SIP (pe0, all cubes).
+    """Wire the full IPCQ hardware model.
    Every PE on every cube on every SIP gets neighbor table entries for:
      - intra-cube (2×4 grid) in the ``intra_*`` namespace
      - intercube same-lane (4×4 cube mesh, no wrap) in ``N/S/E/W``
      - inter-SIP same-(cube, pe) in ``global_*``
    Args:
        engine: GraphEngine with ``_components``.
@@ -46,48 +82,71 @@ def configure_sfr_intercube_multisip(
    mesh_w = int(cm["w"])
    mesh_h = int(cm["h"])
    n_cubes = mesh_w * mesh_h
-    n_sips = int(spec.get("system", {}).get("sips", {}).get("count", 1))
+    sips_cfg = spec.get("system", {}).get("sips", {})
-    sip_topology = str(
+    n_sips = int(sips_cfg.get("count", 1))
-        spec.get("system", {}).get("sips", {}).get("topology", "ring_1d")
+    sip_topology = str(sips_cfg.get("topology", "ring_1d"))
-    )
+    sip_w = sips_cfg.get("w")
    sip_h = sips_cfg.get("h")
    sip_w = int(sip_w) if sip_w is not None else None
    sip_h = int(sip_h) if sip_h is not None else None
    if sip_topology not in _TOPO_BUILTINS:
        raise ValueError(
            f"Unknown sip topology '{sip_topology}'. "
            f"Available: {list(_TOPO_BUILTINS)}"
        )
-    sip_topo_fn = _TOPO_BUILTINS[sip_topology]
+    _sip_topo_fn_raw = _TOPO_BUILTINS[sip_topology]
-    world_size = n_sips * n_cubes
+    def sip_topo_fn(rank: int, ws: int) -> dict:
        if sip_w is not None and sip_h is not None:
            try:
                return _sip_topo_fn_raw(rank, ws, w=sip_w, h=sip_h)
            except TypeError:
                pass
        return _sip_topo_fn_raw(rank, ws)
    pes_per_cube = _PES_PER_CUBE
    world_size = n_sips * n_cubes * pes_per_cube
    pe_idx_to_pe: list[tuple[int, int, int]] = [
-        (sip, cube, 0)
+        (sip, cube, pe)
        for sip in range(n_sips)
        for cube in range(n_cubes)
        for pe in range(pes_per_cube)
    ]
    def _pe_idx(sip: int, cube: int, pe: int) -> int:
        return (sip * n_cubes + cube) * pes_per_cube + pe
    def _neighbors(pe_idx: int, ws: int, _base: dict) -> dict[str, int]:
-        sip = pe_idx // n_cubes
+        tmp = pe_idx
-        cube = pe_idx % n_cubes
+        pe = tmp % pes_per_cube
        tmp //= pes_per_cube
        cube = tmp % n_cubes
        sip = tmp // n_cubes
        row = cube // mesh_w
        col = cube % mesh_w
        nbrs: dict[str, int] = {}
-        # Intercube within SIP (mesh, no wrap-around)
+        # ── Intra-cube (intra_N/S/E/W) ──
-        if col < mesh_w - 1:
+        for d, peer_pe in _intra_cube_neighbors(pe).items():
-            nbrs["E"] = sip * n_cubes + (row * mesh_w + col + 1)
+            nbrs[d] = _pe_idx(sip, cube, peer_pe)
        if col > 0:
            nbrs["W"] = sip * n_cubes + (row * mesh_w + col - 1)
        if row < mesh_h - 1:
            nbrs["S"] = sip * n_cubes + ((row + 1) * mesh_w + col)
        if row > 0:
            nbrs["N"] = sip * n_cubes + ((row - 1) * mesh_w + col)
-        # Inter-SIP on ALL cubes
+        # ── Intercube same-lane (N/S/E/W, 4×4 no wrap) ──
        if col < mesh_w - 1:
            nbrs["E"] = _pe_idx(sip, row * mesh_w + (col + 1), pe)
        if col > 0:
            nbrs["W"] = _pe_idx(sip, row * mesh_w + (col - 1), pe)
        if row < mesh_h - 1:
            nbrs["S"] = _pe_idx(sip, (row + 1) * mesh_w + col, pe)
        if row > 0:
            nbrs["N"] = _pe_idx(sip, (row - 1) * mesh_w + col, pe)
        # ── Inter-SIP same-(cube, pe) (global_*) ──
        if n_sips > 1:
            sip_nbrs = sip_topo_fn(sip, n_sips)
            for d, peer_sip in sip_nbrs.items():
-                nbrs[f"global_{d}"] = peer_sip * n_cubes + cube
+                nbrs[f"global_{d}"] = _pe_idx(peer_sip, cube, pe)
        return nbrs
@@ -33,23 +33,41 @@ def ring_1d_unidir(rank: int, world_size: int) -> NeighborMap:
    return {"E": (rank + 1) % world_size}
-def mesh_2d(rank: int, world_size: int) -> NeighborMap:
+def _resolve_2d_dims(
-    """Square 2D mesh (N/S/E/W).
+    world_size: int, w: int | None, h: int | None, name: str,
-
+) -> tuple[int, int]:
-    Layout: rank = row * side + col, with side = sqrt(world_size).
+    if w is not None and h is not None:
-    Wrap-around (torus) on all four edges.
+        if w * h != world_size:
-    """
+            raise ValueError(
                f"{name}: w*h ({w}*{h}) != world_size ({world_size})"
            )
        return w, h
    side = int(round(world_size ** 0.5))
    if side * side != world_size:
        raise ValueError(
-            f"mesh_2d requires square world_size, got {world_size}"
+            f"{name} requires square world_size or explicit w,h, "
            f"got {world_size}"
        )
-    r, c = divmod(rank, side)
+    return side, side
 def mesh_2d(
    rank: int, world_size: int,
    w: int | None = None, h: int | None = None,
 ) -> NeighborMap:
    """2D mesh (N/S/E/W) with wrap-around on all four edges.
    Layout: rank = row * w + col. When w, h are given, supports
    rectangular (e.g. 2x3) layouts. Otherwise falls back to square
    side = sqrt(world_size).
    """
    w, h = _resolve_2d_dims(world_size, w, h, "mesh_2d")
    r, c = divmod(rank, w)
    return {
-        "N": ((r - 1) % side) * side + c,
+        "N": ((r - 1) % h) * w + c,
-        "S": ((r + 1) % side) * side + c,
+        "S": ((r + 1) % h) * w + c,
-        "W": r * side + (c - 1) % side,
+        "W": r * w + (c - 1) % w,
-        "E": r * side + (c + 1) % side,
+        "E": r * w + (c + 1) % w,
    }
@@ -73,36 +91,30 @@ def tree_binary(rank: int, world_size: int) -> NeighborMap:
    return n
-def torus_2d(rank: int, world_size: int) -> NeighborMap:
+def torus_2d(
-    """Square 2D torus (N/S/E/W) with wrap-around on all edges.
+    rank: int, world_size: int,
-
+    w: int | None = None, h: int | None = None,
-    Alias for mesh_2d (which already wraps). Explicit name for clarity
+) -> NeighborMap:
-    when used as a SIP-level topology.
+    """2D torus (N/S/E/W) with wrap-around on all edges. Alias for mesh_2d."""
-    """
+    return mesh_2d(rank, world_size, w=w, h=h)
    return mesh_2d(rank, world_size)
-def mesh_2d_no_wrap(rank: int, world_size: int) -> NeighborMap:
+def mesh_2d_no_wrap(
-    """Square 2D mesh (N/S/E/W) WITHOUT wrap-around.
+    rank: int, world_size: int,
-
+    w: int | None = None, h: int | None = None,
-    Edge nodes have fewer neighbors (no wrapping). Used for SIP-level
+) -> NeighborMap:
-    topologies where physical links don't wrap.
+    """2D mesh (N/S/E/W) WITHOUT wrap-around. Supports rectangular dims."""
-    """
+    w, h = _resolve_2d_dims(world_size, w, h, "mesh_2d_no_wrap")
-    side = int(round(world_size ** 0.5))
+    r, c = divmod(rank, w)
    if side * side != world_size:
        raise ValueError(
            f"mesh_2d_no_wrap requires square world_size, got {world_size}"
        )
    r, c = divmod(rank, side)
    n: NeighborMap = {}
    if r > 0:
-        n["N"] = (r - 1) * side + c
+        n["N"] = (r - 1) * w + c
-    if r < side - 1:
+    if r < h - 1:
-        n["S"] = (r + 1) * side + c
+        n["S"] = (r + 1) * w + c
    if c > 0:
-        n["W"] = r * side + (c - 1)
+        n["W"] = r * w + (c - 1)
-    if c < side - 1:
+    if c < w - 1:
-        n["E"] = r * side + (c + 1)
+        n["E"] = r * w + (c + 1)
    return n
@@ -58,7 +58,18 @@ class IoCpuComponent(ComponentBase):
            self._pending[key] = (expected, received, parent_done)
    def _dispatch_to_m_cpus(self, env: simpy.Environment, txn: Any) -> Generator:
-        """Fan out sub-Transactions to target cube M_CPUs, wait for responses."""
+        """Fan out sub-Transactions to target cube M_CPUs, wait for responses.
        ADR-0009 D5 (extended): for KernelLaunchMsg, stamp a single global
        target_start_ns = env.now + max(IO_CPU → any target PE_CPU path
        latency across all target cubes). M_CPU passes this value through
        unchanged; every PE in every cube yields until the same sim-time
        before beginning kernel execution. Without this, cross-cube
        launches would have each cube's M_CPU compute its own per-cube
        barrier relative to its local env.now, leaving PEs on different
        cubes out of sync (the "h3/h4 dispatch-offset artifact").
        """
        import dataclasses
        from kernbench.runtime_api.kernel import KernelLaunchMsg, MemoryReadMsg, MemoryWriteMsg
        request = txn.request
@@ -72,10 +83,60 @@ class IoCpuComponent(ComponentBase):
            txn.done.succeed()
            return
        # For KernelLaunchMsg, compute the global barrier once here so
        # every downstream PE_CPU uses the same target_start_ns.
        if isinstance(request, KernelLaunchMsg):
            io_overhead = self.ctx.node_overhead_ns.get(self.node.id, 0.0)
            global_max_latency = 0.0
            pe_ids = self._resolve_pe_ids(
                getattr(request, "target_pe", "all")
            )
            for sip, cube in cube_targets:
                try:
                    m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
                    io_to_m_path = self.ctx.router.find_node_path(
                        self.node.id, m_cpu_id,
                    )
                except Exception:
                    continue
                if len(io_to_m_path) < 2:
                    continue
                leg1 = self.ctx.compute_path_latency_ns(
                    io_to_m_path, nbytes=0,
                )
                m_overhead = self.ctx.node_overhead_ns.get(m_cpu_id, 0.0)
                for pe_id in pe_ids:
                    pe_cpu_id = (
                        f"sip{sip}.cube{cube}.pe{pe_id}.pe_cpu"
                    )
                    try:
                        m_to_pe_path = self.ctx.router.find_node_path(
                            m_cpu_id, pe_cpu_id,
                        )
                    except Exception:
                        continue
                    if len(m_to_pe_path) < 2:
                        continue
                    leg2 = self.ctx.compute_path_latency_ns(
                        m_to_pe_path, nbytes=0,
                    )
                    latency = leg1 + leg2 - io_overhead - m_overhead
                    if latency > global_max_latency:
                        global_max_latency = latency
            request = dataclasses.replace(
                request,
                target_start_ns=float(env.now) + global_max_latency,
            )
        # Setup aggregation
        self._pending[request.request_id] = (len(cube_targets), 0, txn.done)
-        # Fan out to each target cube's M_CPU
+        # Fan out to each target cube's M_CPU. Kernel-launch fanout
        # carries control metadata only; nbytes is forced to 0 for
        # KernelLaunchMsg so the launch sub-txns do not occupy data-fabric
        # BW (would otherwise serialize 16 cubes worth of fanout on the
        # shared first hop and break ADR-0009 D5's barrier prediction).
        is_kernel_launch = isinstance(request, KernelLaunchMsg)
        for sip, cube in cube_targets:
            try:
                m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
@@ -86,11 +147,25 @@ class IoCpuComponent(ComponentBase):
                continue
            sub_txn = Transaction(
                request=request, path=path, step=0,
-                nbytes=txn.nbytes, done=env.event(),
+                nbytes=0 if is_kernel_launch else txn.nbytes,
                done=env.event(),
                result_data=txn.result_data,
            )
            yield self.out_ports[path[1]].put(sub_txn.advance())
    def _resolve_pe_ids(self, target_pe: Any) -> list[int]:
        """Resolve target_pe → list of PE indices (mirrors M_CPU logic)."""
        if isinstance(target_pe, int):
            return [target_pe]
        if isinstance(target_pe, tuple):
            return list(target_pe)
        # "all": all PEs in a cube
        n_slices = 8
        if self.ctx and self.ctx.spec:
            mm = self.ctx.spec.get("cube", {}).get("memory_map", {})
            n_slices = mm.get("hbm_slices_per_cube", 8)
        return list(range(n_slices))
    def _resolve_cube_targets(self, request: Any) -> list[tuple[int, int]]:
        """Return list of (sip, cube) pairs to fan out to."""
        from kernbench.runtime_api.kernel import (
@@ -162,7 +162,11 @@ class MCpuComponent(ComponentBase):
        Routes through find_node_path (M_CPU → NOC → PE_CPU command edges).
        PE_CPU sends ResponseMsg back via NOC → M_CPU on completion.
        Then sends aggregate ResponseMsg back to IO_CPU on the reverse path.
        ADR-0009 D5: stamps target_start_ns so every PE in this fanout
        starts executing at the same env.now regardless of dispatch path.
        """
        import dataclasses
        request = txn.request
        target_pe = getattr(request, "target_pe", "all")
        cube_prefix = self.node.id.rsplit(".", 1)[0]  # e.g. "sip0.cube0"
@@ -172,9 +176,13 @@ class MCpuComponent(ComponentBase):
            txn.done.succeed()
            return
-        # Fan out to each PE_CPU, using response-based aggregation
+        # Resolve per-PE paths. If IO_CPU already stamped a global
-        sub_txns: list[Transaction] = []
+        # target_start_ns (ADR-0009 D5 extended), pass it through
-        n_dispatched = 0
+        # unchanged so every PE across every cube uses the same barrier.
        # Otherwise (e.g. direct-to-M_CPU launch in a unit test) compute
        # a per-cube barrier from env.now.
        per_pe: list[tuple[int, list[str], float]] = []
        max_latency = 0.0
        for pe_id in pe_ids:
            pe_cpu_id = f"{cube_prefix}.pe{pe_id}.pe_cpu"
            try:
@@ -183,8 +191,24 @@ class MCpuComponent(ComponentBase):
                continue
            if len(path) < 2:
                continue
            latency = self.ctx.compute_path_latency_ns(path, nbytes=0)
            per_pe.append((pe_id, path, latency))
            if latency > max_latency:
                max_latency = latency
        if getattr(request, "target_start_ns", None) is not None:
            stamped_request = request
        else:
            stamped_request = dataclasses.replace(
                request, target_start_ns=float(env.now) + max_latency,
            )
        # Fan out to each PE_CPU, using response-based aggregation
        sub_txns: list[Transaction] = []
        n_dispatched = 0
        for pe_id, path, _lat in per_pe:
            sub_txn = Transaction(
-                request=request, path=path, step=0,
+                request=stamped_request, path=path, step=0,
                nbytes=0, done=env.event(),
            )
            yield self.out_ports[path[1]].put(sub_txn.advance())
@@ -204,16 +228,21 @@ class MCpuComponent(ComponentBase):
        yield all_done
        del self._parent_txns[request.request_id]
-        # Aggregate PE-internal metrics (max across PEs)
+        # Aggregate PE-internal metrics (max across PEs and across cubes).
        # Multiple M_CPUs share the same result_data dict via IO_CPU fanout;
        # merge against the existing value so cubes don't clobber each other.
        pe_exec_values = [st.result_data.get("pe_exec_ns", 0.0) for st in sub_txns]
        if pe_exec_values:
-            txn.result_data["pe_exec_ns"] = max(pe_exec_values)
+            cur = txn.result_data.get("pe_exec_ns", 0.0) or 0.0
            txn.result_data["pe_exec_ns"] = max(cur, max(pe_exec_values))
        dma_values = [st.result_data.get("dma_ns", 0.0) for st in sub_txns]
        if dma_values:
-            txn.result_data["dma_ns"] = max(dma_values)
+            cur = txn.result_data.get("dma_ns", 0.0) or 0.0
            txn.result_data["dma_ns"] = max(cur, max(dma_values))
        compute_values = [st.result_data.get("compute_ns", 0.0) for st in sub_txns]
        if compute_values:
-            txn.result_data["compute_ns"] = max(compute_values)
+            cur = txn.result_data.get("compute_ns", 0.0) or 0.0
            txn.result_data["compute_ns"] = max(cur, max(compute_values))
        # Send aggregate response on reverse command path back to IO_CPU
        reverse_path = list(reversed(txn.path))
@@ -95,6 +95,13 @@ class PeCpuComponent(ComponentBase):
        request = txn.request
        yield from self.run(env, 0)
        # ADR-0009 D5: synchronized launch barrier. If M_CPU stamped a
        # target_start_ns, wait until then so every PE in this launch
        # begins pe_exec measurement at the same simulated time.
        target_start = getattr(request, "target_start_ns", None)
        if target_start is not None and target_start > env.now:
            yield env.timeout(float(target_start) - env.now)
        kernel_fn = get_kernel(request.kernel_ref.name)
        num_programs = self._derive_num_programs(request)
        kernel_args = self._unpack_kernel_args(request)
@@ -186,13 +186,37 @@ class PeDmaComponent(PeEngineBase):
    # ── IPCQ inbound (fabric → PE_DMA → MemoryStore + PE_IPCQ) ──────
    def _handle_ipcq_inbound(self, env: simpy.Environment, txn: Any) -> Generator:
-        """At destination PE_DMA: atomically write data and forward metadata.
+        """At destination PE_DMA: pay terminal drain, then atomically write
        data and forward metadata.
        ADR-0023 D9 (drain at inbound terminal): the Transaction carries
        ``drain_ns = nbytes / bottleneck_bw_on_path`` stamped by the sender
        PE_DMA. Like every other Transaction terminal in the simulator (see
        ``ComponentBase._forward_txn``), this drain must be paid when the
        Transaction reaches its destination. SRC-side ``tl.send`` is
        fire-and-forget — it never yields on ``sub_done`` — so paying the
        drain here does NOT delay the sender. What it DOES delay is the
        IpcqMetaArrival forwarded below: that delay is the only signal
        ``tl.recv`` on DST blocks on, which is exactly the desired
        semantics — "send dispatches and returns; recv waits until the
        bytes have actually landed in its inbox".
        The drain MUST be paid before the atomic block — inserting a yield
        inside would break invariant I6.
        I6 (MUST): no SimPy yield between MemoryStore.write and the
        IpcqMetaArrival put into PE_IPCQ.
        """
        from kernbench.common.ipcq_types import IpcqMetaArrival
        # Pay terminal BW drain before the atomic write/metadata forward.
        # Without this, IPCQ effectively got fabric bandwidth for free at
        # the terminal (only intermediate-hop overhead_ns was charged),
        # making IPCQ lower than raw DMA at large sizes in benchmarks.
        drain = getattr(txn, "drain_ns", 0.0)
        if drain > 0:
            yield env.timeout(drain)
        token = txn.request
        # ── ATOMIC: do not introduce yield between these two operations ──
@@ -338,9 +338,13 @@ class PeIpcqComponent(ComponentBase):
                nbytes=req.result_data.get("nbytes", 0),
            )
-        # Fast path credit return — bottleneck BW based latency
+        # Credit return: recv blocks on credit-emit so the protocol cost
-        env.process(
+        # (full path latency to deliver the credit metadata back to the
-            self._delayed_credit_send(env, direction, qp["peer_credit_store"], qp["my_tail"])
+        # sender) is reflected in the recv's pe_exec_ns. Models the IPCQ
        # control-plane completing the consume-acknowledgement before
        # recv returns to the kernel.
        yield from self._delayed_credit_send(
            env, direction, qp["peer_credit_store"], qp["my_tail"],
        )
        if not req.done.triggered:
@@ -455,7 +459,12 @@ class PeIpcqComponent(ComponentBase):
        yield peer_credit_store.put(meta)
    def _credit_latency_ns(self, direction: str) -> float:
-        """Compute credit fast path latency = credit_size / bottleneck_bw.
+        """Full path latency for the credit-return packet.
        Pays per-node overhead + edge prop + drain along the same fabric
        the data took. PathRouter.find_path() auto-appends ".pe_dma" to
        the source only, so the destination MUST be spelled with the
        explicit ".pe_dma" suffix.
        Falls back to 0 when ctx/router is unavailable (unit-test mode).
        """
@@ -463,10 +472,12 @@ class PeIpcqComponent(ComponentBase):
            return 0.0
        qp = self._queue_pairs[direction]
        peer = qp["peer"]
-        peer_pe_prefix = f"sip{peer.sip}.cube{peer.cube}.pe{peer.pe}"
+        peer_pe_dma = f"sip{peer.sip}.cube{peer.cube}.pe{peer.pe}.pe_dma"
        try:
-            path = self.ctx.router.find_path(self._pe_prefix, peer_pe_prefix)
+            path = self.ctx.router.find_path(self._pe_prefix, peer_pe_dma)
-            return self.ctx.compute_drain_ns(path, self._credit_size_bytes)
+            return self.ctx.compute_path_latency_ns(
                path, self._credit_size_bytes,
            )
        except Exception:
            return 0.0
@@ -26,6 +26,9 @@ class ComponentContext:
    spec: dict = field(default_factory=dict)  # topology spec (cube layout, PE count, etc.)
    memory_store: Any = None  # MemoryStore for Phase 1 data-aware execution (ADR-0020)
    op_logger: Any = None     # OpLogger for Phase 1 op recording (ADR-0020)
    # node_id -> overhead_ns (ADR-0009 D5: used by M_CPU to compute per-PE
    # dispatch latency when stamping target_start_ns on KernelLaunchMsg).
    node_overhead_ns: dict[str, float] = field(default_factory=dict)
    def get_shared_resource(
        self, env: simpy.Environment, key: str, capacity: int = 1,
@@ -52,3 +55,19 @@ class ComponentContext:
        if min_bw == float("inf"):
            return 0.0
        return nbytes / min_bw
    def compute_path_latency_ns(self, path: list[str], nbytes: int = 0) -> float:
        """Formula latency along path: wire + per-node overhead + drain.
        ADR-0009 D5: M_CPU uses this to compute per-PE dispatch latency
        when stamping target_start_ns on KernelLaunchMsg fanout.
        """
        total = 0.0
        for i in range(len(path) - 1):
            edge = self.edge_map.get((path[i], path[i + 1]))
            if edge:
                total += edge.distance_mm * self.ns_per_mm
        for node_id in path:
            total += self.node_overhead_ns.get(node_id, 0.0)
        total += self.compute_drain_ns(path, nbytes)
        return total
@@ -58,7 +58,13 @@ class IoCpuComponent(ComponentBase):
            self._pending[key] = (expected, received, parent_done)
    def _dispatch_to_m_cpus(self, env: simpy.Environment, txn: Any) -> Generator:
-        """Fan out sub-Transactions to target cube M_CPUs, wait for responses."""
+        """Fan out sub-Transactions to target cube M_CPUs, wait for responses.
        ADR-0009 D5 (extended): stamp a global target_start_ns on
        KernelLaunchMsg so every PE across every target cube starts at
        the same env.now. See the non-legacy builtin for full rationale.
        """
        import dataclasses
        from kernbench.runtime_api.kernel import KernelLaunchMsg, MemoryReadMsg, MemoryWriteMsg
        request = txn.request
@@ -72,10 +78,53 @@ class IoCpuComponent(ComponentBase):
            txn.done.succeed()
            return
        if isinstance(request, KernelLaunchMsg):
            io_overhead = self.ctx.node_overhead_ns.get(self.node.id, 0.0)
            global_max_latency = 0.0
            pe_ids = self._resolve_pe_ids(
                getattr(request, "target_pe", "all")
            )
            for sip, cube in cube_targets:
                try:
                    m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
                    io_to_m_path = self.ctx.router.find_node_path(
                        self.node.id, m_cpu_id,
                    )
                except Exception:
                    continue
                if len(io_to_m_path) < 2:
                    continue
                leg1 = self.ctx.compute_path_latency_ns(
                    io_to_m_path, nbytes=0,
                )
                m_overhead = self.ctx.node_overhead_ns.get(m_cpu_id, 0.0)
                for pe_id in pe_ids:
                    pe_cpu_id = (
                        f"sip{sip}.cube{cube}.pe{pe_id}.pe_cpu"
                    )
                    try:
                        m_to_pe_path = self.ctx.router.find_node_path(
                            m_cpu_id, pe_cpu_id,
                        )
                    except Exception:
                        continue
                    if len(m_to_pe_path) < 2:
                        continue
                    leg2 = self.ctx.compute_path_latency_ns(
                        m_to_pe_path, nbytes=0,
                    )
                    latency = leg1 + leg2 - io_overhead - m_overhead
                    if latency > global_max_latency:
                        global_max_latency = latency
            request = dataclasses.replace(
                request,
                target_start_ns=float(env.now) + global_max_latency,
            )
        # Setup aggregation
        self._pending[request.request_id] = (len(cube_targets), 0, txn.done)
-        # Fan out to each target cube's M_CPU
+        is_kernel_launch = isinstance(request, KernelLaunchMsg)
        for sip, cube in cube_targets:
            try:
                m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
@@ -86,11 +135,24 @@ class IoCpuComponent(ComponentBase):
                continue
            sub_txn = Transaction(
                request=request, path=path, step=0,
-                nbytes=txn.nbytes, done=env.event(),
+                nbytes=0 if is_kernel_launch else txn.nbytes,
                done=env.event(),
                result_data=txn.result_data,
            )
            yield self.out_ports[path[1]].put(sub_txn.advance())
    def _resolve_pe_ids(self, target_pe: Any) -> list[int]:
        """Resolve target_pe → list of PE indices (mirrors M_CPU logic)."""
        if isinstance(target_pe, int):
            return [target_pe]
        if isinstance(target_pe, tuple):
            return list(target_pe)
        n_slices = 8
        if self.ctx and self.ctx.spec:
            mm = self.ctx.spec.get("cube", {}).get("memory_map", {})
            n_slices = mm.get("hbm_slices_per_cube", 8)
        return list(range(n_slices))
    def _resolve_cube_targets(self, request: Any) -> list[tuple[int, int]]:
        """Return list of (sip, cube) pairs to fan out to."""
        from kernbench.runtime_api.kernel import (
@@ -162,7 +162,11 @@ class MCpuComponent(ComponentBase):
        Routes through find_node_path (M_CPU → NOC → PE_CPU command edges).
        PE_CPU sends ResponseMsg back via NOC → M_CPU on completion.
        Then sends aggregate ResponseMsg back to IO_CPU on the reverse path.
        ADR-0009 D5: stamps target_start_ns so every PE in this fanout
        starts executing at the same env.now regardless of dispatch path.
        """
        import dataclasses
        request = txn.request
        target_pe = getattr(request, "target_pe", "all")
        cube_prefix = self.node.id.rsplit(".", 1)[0]  # e.g. "sip0.cube0"
@@ -172,9 +176,10 @@ class MCpuComponent(ComponentBase):
            txn.done.succeed()
            return
-        # Fan out to each PE_CPU, using response-based aggregation
+        # Resolve per-PE paths. If IO_CPU already stamped a global
-        sub_txns: list[Transaction] = []
+        # target_start_ns (ADR-0009 D5 extended), pass it through.
-        n_dispatched = 0
+        per_pe: list[tuple[int, list[str], float]] = []
        max_latency = 0.0
        for pe_id in pe_ids:
            pe_cpu_id = f"{cube_prefix}.pe{pe_id}.pe_cpu"
            try:
@@ -183,8 +188,24 @@ class MCpuComponent(ComponentBase):
                continue
            if len(path) < 2:
                continue
            latency = self.ctx.compute_path_latency_ns(path, nbytes=0)
            per_pe.append((pe_id, path, latency))
            if latency > max_latency:
                max_latency = latency
        if getattr(request, "target_start_ns", None) is not None:
            stamped_request = request
        else:
            stamped_request = dataclasses.replace(
                request, target_start_ns=float(env.now) + max_latency,
            )
        # Fan out to each PE_CPU, using response-based aggregation
        sub_txns: list[Transaction] = []
        n_dispatched = 0
        for pe_id, path, _lat in per_pe:
            sub_txn = Transaction(
-                request=request, path=path, step=0,
+                request=stamped_request, path=path, step=0,
                nbytes=0, done=env.event(),
            )
            yield self.out_ports[path[1]].put(sub_txn.advance())
@@ -204,16 +225,21 @@ class MCpuComponent(ComponentBase):
        yield all_done
        del self._parent_txns[request.request_id]
-        # Aggregate PE-internal metrics (max across PEs)
+        # Aggregate PE-internal metrics (max across PEs and across cubes).
        # Multiple M_CPUs share the same result_data dict via IO_CPU fanout;
        # merge against the existing value so cubes don't clobber each other.
        pe_exec_values = [st.result_data.get("pe_exec_ns", 0.0) for st in sub_txns]
        if pe_exec_values:
-            txn.result_data["pe_exec_ns"] = max(pe_exec_values)
+            cur = txn.result_data.get("pe_exec_ns", 0.0) or 0.0
            txn.result_data["pe_exec_ns"] = max(cur, max(pe_exec_values))
        dma_values = [st.result_data.get("dma_ns", 0.0) for st in sub_txns]
        if dma_values:
-            txn.result_data["dma_ns"] = max(dma_values)
+            cur = txn.result_data.get("dma_ns", 0.0) or 0.0
            txn.result_data["dma_ns"] = max(cur, max(dma_values))
        compute_values = [st.result_data.get("compute_ns", 0.0) for st in sub_txns]
        if compute_values:
-            txn.result_data["compute_ns"] = max(compute_values)
+            cur = txn.result_data.get("compute_ns", 0.0) or 0.0
            txn.result_data["compute_ns"] = max(cur, max(compute_values))
        # Send aggregate response on reverse command path back to IO_CPU
        reverse_path = list(reversed(txn.path))
@@ -71,6 +71,13 @@ class PeCpuComponent(ComponentBase):
        request = txn.request
        yield from self.run(env, 0)
        # ADR-0009 D5: synchronized launch barrier. If M_CPU stamped a
        # target_start_ns, wait until then so every PE in this launch
        # begins pe_exec measurement at the same simulated time.
        target_start = getattr(request, "target_start_ns", None)
        if target_start is not None and target_start > env.now:
            yield env.timeout(float(target_start) - env.now)
        kernel_fn = get_kernel(request.kernel_ref.name)
        num_programs = self._derive_num_programs(request)
        kernel_args = self._unpack_kernel_args(request)
@@ -19,7 +19,14 @@ class PageFault(Exception):
 class PeMMU:
-    """Per-PE MMU with page-aligned VA→PA translation table.
+    """Per-PE MMU with sub-page-capable VA→PA translation table.
    Each page-table entry is a list of (start_in_page, end_in_page,
    pa_at_offset_zero) regions. This is a SIMULATOR STOPGAP — real MMUs
    store one PA per page-table entry. Sub-page regions exist here so
    DPPolicy layouts that shard below page granularity (e.g. 128 B
    payloads with 4 KB pages) don't silently mis-route through last-
    write-wins overwrites. Memory note: project_mmu_subpage_stopgap.md.
    Args:
        page_size: Page size in bytes (default 2 MB).
@@ -34,7 +41,11 @@ class PeMMU:
        self._page_size = page_size
        self._page_shift = (page_size - 1).bit_length()
        self._page_mask = page_size - 1
-        self._table: dict[int, int] = {}  # va_page_number → pa_page_base
+        # vpn → list of (start_in_page, end_in_page, pa_at_offset_zero).
        # pa_at_offset_zero is the PA that offset 0 of the page would map
        # to under this region — i.e. translate(off) = pa_at_offset_zero
        # + off when start <= off < end.
        self._table: dict[int, list[tuple[int, int, int]]] = {}
        self._overhead_ns = overhead_ns
    @property
@@ -46,21 +57,67 @@ class PeMMU:
        return len(self._table)
    def map(self, va: int, pa: int, size: int) -> None:
-        """Register VA→PA mapping for a contiguous range."""
+        """Register VA→PA mapping for a contiguous range.
-        for off in range(0, size, self._page_size):
+
-            vpn = (va + off) >> self._page_shift
+        Sub-page-aware: a single page can hold multiple disjoint regions,
-            self._table[vpn] = pa + off
+        each pointing to a different PA. Later map() calls APPEND a new
        region; on overlap with an existing region, the new region wins
        for the overlapping offsets (translate iterates in reverse so the
        last write takes precedence — matches legacy single-PA behavior
        when a full page is re-mapped).
        """
        end_va = va + size
        cur = va
        while cur < end_va:
            vpn = cur >> self._page_shift
            page_base_va = vpn << self._page_shift
            page_end_va = page_base_va + self._page_size
            region_start = cur - page_base_va
            region_end = min(end_va, page_end_va) - page_base_va
            # PA seen at offset 0 of page if this region's mapping covered it
            pa_at_offset_zero = pa + (cur - va) - region_start
            self._table.setdefault(vpn, []).append(
                (region_start, region_end, pa_at_offset_zero)
            )
            cur = page_base_va + region_end
    def unmap(self, va: int, size: int) -> None:
-        """Remove VA mapping for a contiguous range."""
+        """Remove VA mapping for a contiguous range.
-        for off in range(0, size, self._page_size):
+
-            vpn = (va + off) >> self._page_shift
+        Drops any region whose extent is contained within the unmapped
-            self._table.pop(vpn, None)
+        range. Partial overlaps (region straddles the range boundary)
        are left in place — caller is expected to unmap on the same
        boundaries it mapped on.
        """
        end_va = va + size
        cur = va
        while cur < end_va:
            vpn = cur >> self._page_shift
            page_base_va = vpn << self._page_shift
            page_end_va = page_base_va + self._page_size
            unmap_start = cur - page_base_va
            unmap_end = min(end_va, page_end_va) - page_base_va
            regions = self._table.get(vpn)
            if regions is not None:
                kept = [
                    r for r in regions
                    if not (r[0] >= unmap_start and r[1] <= unmap_end)
                ]
                if kept:
                    self._table[vpn] = kept
                else:
                    del self._table[vpn]
            cur = page_base_va + unmap_end
    def translate(self, va: int) -> int:
        """Translate VA to PA. Raises PageFault if unmapped."""
        vpn = va >> self._page_shift
-        pa_page_base = self._table.get(vpn)
+        regions = self._table.get(vpn)
-        if pa_page_base is None:
+        if regions is None:
            raise PageFault(va)
        offset = va & self._page_mask
        # Iterate latest-first so newer map() calls win on overlap
        for start, end, pa_at_offset_zero in reversed(regions):
            if start <= offset < end:
                return pa_at_offset_zero + offset
        raise PageFault(va)
        return pa_page_base + (va & self._page_mask)
@@ -90,6 +90,11 @@ class KernelLaunchMsg:
    args: tuple[KernelArg, ...]
    target_cubes: tuple[int, ...] | Literal["all"] = "all"
    target_pe: int | tuple[int, ...] | Literal["all"] = "all"
    # ADR-0009 D5: synchronized kernel start. When set, each PE_CPU yields
    # until env.now >= target_start_ns before beginning kernel execution,
    # so every PE in a launch starts at the same simulated time regardless
    # of its M_CPU dispatch path length. Stamped by M_CPU fan-out.
    target_start_ns: float | None = None
    msg_type: Literal["kernel_launch"] = "kernel_launch"
@@ -67,6 +67,10 @@ class GraphEngine:
            spec=graph.spec,
            memory_store=self._memory_store,
            op_logger=self._op_logger,
            node_overhead_ns={
                nid: float(n.attrs.get("overhead_ns", 0.0))
                for nid, n in graph.nodes.items()
            },
        )
        self._components: dict[str, ComponentBase] = {
            node_id: ComponentRegistry.create(node, overrides, ctx)
@@ -0,0 +1,34 @@
 algorithm,sip_topology,n_sips,n_elem,bytes_per_pe,bytes_per_sip,latency_ns
 intercube_allreduce,ring_1d,6,8,16,256,3073.1299999999937
 intercube_allreduce,ring_1d,6,32,64,1024,3079.8799999999947
 intercube_allreduce,ring_1d,6,64,128,2048,3088.879999999992
 intercube_allreduce,ring_1d,6,128,256,4096,3106.8799999999865
 intercube_allreduce,ring_1d,6,512,1024,16384,3225.8799999999865
 intercube_allreduce,ring_1d,6,1024,2048,32768,3391.8799999999865
 intercube_allreduce,ring_1d,6,2048,4096,65536,3723.8799999999865
 intercube_allreduce,ring_1d,6,4096,8192,131072,4387.879999999965
 intercube_allreduce,ring_1d,6,8192,16384,262144,5715.879999999957
 intercube_allreduce,ring_1d,6,16384,32768,524288,8371.879999999932
 intercube_allreduce,ring_1d,6,32768,65536,1048576,13683.879999999903
 intercube_allreduce,torus_2d,6,8,16,256,2190.4799999999923
 intercube_allreduce,torus_2d,6,32,64,1024,2196.479999999993
 intercube_allreduce,torus_2d,6,64,128,2048,2204.4799999999905
 intercube_allreduce,torus_2d,6,128,256,4096,2220.479999999985
 intercube_allreduce,torus_2d,6,512,1024,16384,2325.479999999985
 intercube_allreduce,torus_2d,6,1024,2048,32768,2471.479999999985
 intercube_allreduce,torus_2d,6,2048,4096,65536,2763.479999999985
 intercube_allreduce,torus_2d,6,4096,8192,131072,3347.4799999999777
 intercube_allreduce,torus_2d,6,8192,16384,262144,4515.4799999999705
 intercube_allreduce,torus_2d,6,16384,32768,524288,6851.479999999952
 intercube_allreduce,torus_2d,6,32768,65536,1048576,11523.479999999923
 intercube_allreduce,mesh_2d_no_wrap,6,8,16,256,3508.4249999999993
 intercube_allreduce,mesh_2d_no_wrap,6,32,64,1024,3515.55
 intercube_allreduce,mesh_2d_no_wrap,6,64,128,2048,3525.0499999999975
 intercube_allreduce,mesh_2d_no_wrap,6,128,256,4096,3544.049999999992
 intercube_allreduce,mesh_2d_no_wrap,6,512,1024,16384,3667.049999999992
 intercube_allreduce,mesh_2d_no_wrap,6,1024,2048,32768,3837.049999999992
 intercube_allreduce,mesh_2d_no_wrap,6,2048,4096,65536,4177.049999999992
 intercube_allreduce,mesh_2d_no_wrap,6,4096,8192,131072,4857.049999999959
 intercube_allreduce,mesh_2d_no_wrap,6,8192,16384,262144,6217.049999999945
 intercube_allreduce,mesh_2d_no_wrap,6,16384,32768,524288,8937.049999999937
 intercube_allreduce,mesh_2d_no_wrap,6,32768,65536,1048576,14377.049999999872
@@ -0,0 +1,91 @@
 hop,label,size_bytes,path,total_ns
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,ipcq,31.1399999999976
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,raw,12.019999999996799
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,ipcq,32.6399999999976
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,raw,13.019999999996799
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,ipcq,34.1399999999976
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,raw,14.019999999996799
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,ipcq,35.6399999999976
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,raw,15.019999999996799
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,ipcq,38.6399999999976
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,raw,17.0199999999968
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,ipcq,41.6399999999976
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,raw,19.0199999999968
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,ipcq,53.6399999999976
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,raw,27.0199999999968
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,ipcq,77.6399999999976
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,raw,43.0199999999968
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,ipcq,125.64000000000306
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,raw,75.02000000000407
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,ipcq,149.64000000000306
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,raw,91.02000000000407
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,ipcq,31.1399999999976
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,raw,12.019999999996799
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,ipcq,32.6399999999976
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,raw,13.019999999996799
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,ipcq,34.1399999999976
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,raw,14.019999999996799
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,ipcq,35.6399999999976
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,raw,15.019999999996799
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,ipcq,38.6399999999976
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,raw,17.0199999999968
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,ipcq,41.6399999999976
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,raw,19.0199999999968
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,ipcq,53.6399999999976
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,raw,27.0199999999968
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,ipcq,77.6399999999976
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,raw,43.0199999999968
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,ipcq,125.64000000000306
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,raw,75.02000000000407
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,ipcq,149.64000000000306
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,raw,91.02000000000407
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,ipcq,67.15999999999804
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,raw,68.53999999999724
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,ipcq,68.65999999999804
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,raw,70.03999999999724
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,ipcq,70.15999999999804
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,raw,71.53999999999724
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,ipcq,71.65999999999804
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,raw,73.03999999999724
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,ipcq,74.65999999999804
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,raw,76.03999999999724
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,ipcq,77.65999999999804
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,raw,79.03999999999724
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,ipcq,89.65999999999804
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,raw,91.03999999999724
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,ipcq,113.65999999999804
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,raw,115.03999999999724
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,ipcq,161.65999999999985
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,raw,163.04000000000087
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,ipcq,185.65999999999985
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,raw,187.04000000000087
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,ipcq,87.15999999999804
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,raw,88.53999999999724
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,ipcq,88.65999999999804
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,raw,90.03999999999724
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,ipcq,90.15999999999804
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,raw,91.53999999999724
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,ipcq,91.65999999999804
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,raw,93.03999999999724
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,ipcq,94.65999999999804
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,raw,96.03999999999724
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,ipcq,97.65999999999804
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,raw,99.03999999999724
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,ipcq,109.65999999999804
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,raw,111.03999999999724
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,ipcq,133.65999999999804
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,raw,135.03999999999724
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,ipcq,181.65999999999985
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,raw,183.04000000000087
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,ipcq,205.65999999999985
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,raw,207.04000000000087
 h5_inter_sip,"Inter-SIP (sip0 to sip1, same cube/pe)",128,ipcq,6.015000000003056
 h5_inter_sip,"Inter-SIP (sip0 to sip1, same cube/pe)",256,ipcq,6.515000000003056
 h5_inter_sip,"Inter-SIP (sip0 to sip1, same cube/pe)",384,ipcq,7.015000000003056
 h5_inter_sip,"Inter-SIP (sip0 to sip1, same cube/pe)",512,ipcq,7.515000000003056
 h5_inter_sip,"Inter-SIP (sip0 to sip1, same cube/pe)",768,ipcq,8.515000000003056
 h5_inter_sip,"Inter-SIP (sip0 to sip1, same cube/pe)",1024,ipcq,9.515000000003056
 h5_inter_sip,"Inter-SIP (sip0 to sip1, same cube/pe)",2048,ipcq,13.515000000003056
 h5_inter_sip,"Inter-SIP (sip0 to sip1, same cube/pe)",4096,ipcq,21.515000000003056
 h5_inter_sip,"Inter-SIP (sip0 to sip1, same cube/pe)",8192,ipcq,37.51499999999214
 h5_inter_sip,"Inter-SIP (sip0 to sip1, same cube/pe)",10240,ipcq,45.51499999999214
@@ -22,13 +22,23 @@ from kernbench.ccl.sfr_config import configure_sfr_intercube_multisip
 from kernbench.policy.placement.dp import DPPolicy
-def _sip_topo_dims(sip_topo: str, n_sips: int) -> tuple[int, int]:
+def _sip_topo_dims(
    sip_topo: str, n_sips: int,
    spec_w: int | None = None, spec_h: int | None = None,
 ) -> tuple[int, int]:
    if sip_topo == "ring_1d":
        return (0, 0)
    if spec_w is not None and spec_h is not None:
        if spec_w * spec_h != n_sips:
            raise ValueError(
                f"sip layout {spec_w}x{spec_h} != n_sips ({n_sips})"
            )
        return (spec_w, spec_h)
    side = int(round(math.sqrt(n_sips)))
    if side * side != n_sips:
        raise ValueError(
-            f"SIP topology '{sip_topo}' requires square n_sips, got {n_sips}"
+            f"SIP topology '{sip_topo}' requires square n_sips or "
            f"explicit w/h in spec, got {n_sips}"
        )
    return (side, side)
@@ -54,10 +64,13 @@ def run_allreduce(
    topo_name_to_kind = algo_module.TOPO_NAME_TO_KIND
    n_elem = int(cfg.get("n_elem", 8))
-    n_sips = int(spec.get("system", {}).get("sips", {}).get("count", 1))
+    sips_cfg = spec.get("system", {}).get("sips", {})
-    sip_topo = str(
+    n_sips = int(sips_cfg.get("count", 1))
-        spec.get("system", {}).get("sips", {}).get("topology", "ring_1d")
+    sip_topo = str(sips_cfg.get("topology", "ring_1d"))
-    )
+    spec_sip_w = sips_cfg.get("w")
    spec_sip_h = sips_cfg.get("h")
    spec_sip_w = int(spec_sip_w) if spec_sip_w is not None else None
    spec_sip_h = int(spec_sip_h) if spec_sip_h is not None else None
    cm = spec["sip"]["cube_mesh"]
    cube_w = int(cm["w"])
@@ -65,7 +78,9 @@ def run_allreduce(
    n_cubes = cube_w * cube_h
    sip_topo_kind = topo_name_to_kind.get(sip_topo, 0)
-    sip_topo_w, sip_topo_h = _sip_topo_dims(sip_topo, n_sips)
+    sip_topo_w, sip_topo_h = _sip_topo_dims(
        sip_topo, n_sips, spec_w=spec_sip_w, spec_h=spec_sip_h,
    )
    algo_name = cfg.get("algorithm", "allreduce")
    print(f"\n{'=' * 60}")
@@ -173,18 +188,36 @@ from kernbench.topology.builder import resolve_topology
 TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
 CONFIGS = [
-    pytest.param("intercube_allreduce", "ring_1d", 2, id="ring_2sip"),
+    pytest.param(
-    pytest.param("intercube_allreduce", "torus_2d", 4, id="torus_4sip"),
+        "intercube_allreduce", "ring_1d", 6, None, None,
-    pytest.param("intercube_allreduce", "mesh_2d_no_wrap", 4, id="mesh_4sip"),
+        id="ring_6sip",
    ),
    pytest.param(
        "intercube_allreduce", "torus_2d", 6, 2, 3,
        id="torus_6sip_2x3",
    ),
    pytest.param(
        "intercube_allreduce", "mesh_2d_no_wrap", 6, 2, 3,
        id="mesh_6sip_2x3",
    ),
 ]
-def _write_temp_configs(tmp_path, sip_topology, n_sips, algorithm):
+def _write_temp_configs(
    tmp_path, sip_topology, n_sips, algorithm, n_elem_override=None,
    sip_w=None, sip_h=None,
 ):
    """Write temp topology.yaml and ccl.yaml with the given overrides."""
    with open(TOPOLOGY_PATH) as f:
        topo_cfg = yaml.safe_load(f)
    topo_cfg["system"]["sips"]["count"] = n_sips
    topo_cfg["system"]["sips"]["topology"] = sip_topology
    if sip_w is not None and sip_h is not None:
        topo_cfg["system"]["sips"]["w"] = int(sip_w)
        topo_cfg["system"]["sips"]["h"] = int(sip_h)
    else:
        topo_cfg["system"]["sips"].pop("w", None)
        topo_cfg["system"]["sips"].pop("h", None)
    topo_path = tmp_path / "topology.yaml"
    with open(topo_path, "w") as f:
        yaml.dump(topo_cfg, f, default_flow_style=False)
@@ -193,6 +226,15 @@ def _write_temp_configs(tmp_path, sip_topology, n_sips, algorithm):
    with open(ccl_path) as f:
        ccl_cfg = yaml.safe_load(f)
    ccl_cfg["defaults"]["algorithm"] = algorithm
    if n_elem_override is not None:
        ccl_cfg.setdefault("algorithms", {}).setdefault(
            algorithm, {},
        )["n_elem"] = int(n_elem_override)
        # Ensure IPCQ slot is big enough for the per-message payload.
        per_msg_bytes = int(n_elem_override) * 2  # f16
        default_slot = int(ccl_cfg["defaults"].get("slot_size", 4096))
        if per_msg_bytes > default_slot:
            ccl_cfg["defaults"]["slot_size"] = per_msg_bytes
    tmp_ccl = tmp_path / "ccl.yaml"
    with open(tmp_ccl, "w") as f:
        yaml.dump(ccl_cfg, f, default_flow_style=False)
@@ -200,10 +242,15 @@ def _write_temp_configs(tmp_path, sip_topology, n_sips, algorithm):
    return str(topo_path), str(tmp_ccl)
-@pytest.mark.parametrize("algorithm,sip_topology,n_sips", CONFIGS)
+@pytest.mark.parametrize(
-def test_allreduce(tmp_path, algorithm, sip_topology, n_sips):
+    "algorithm,sip_topology,n_sips,sip_w,sip_h", CONFIGS,
 )
 def test_allreduce(
    tmp_path, algorithm, sip_topology, n_sips, sip_w, sip_h,
 ):
    topo_path, ccl_path = _write_temp_configs(
        tmp_path, sip_topology, n_sips, algorithm,
        sip_w=sip_w, sip_h=sip_h,
    )
    topo = resolve_topology(topo_path)
    engine = GraphEngine(topo.topology_obj, enable_data=True)
@@ -220,3 +267,163 @@ def test_allreduce(tmp_path, algorithm, sip_topology, n_sips):
            algorithm=algorithm, ccl_yaml=ccl_path,
        )
        assert result["ok_cubes"] > 0
 # ── Latency sweep ─────────────────────────────────────────────────────
 # avoid 16 (== n_cubes, dim_map collision). Goes up to 1 MB per SIP:
 # bytes_per_sip = n_cubes * n_elem * 2 = 32 * n_elem.
 _SWEEP_N_ELEM = [
    8, 32, 64, 128, 512, 1024, 2048,
    4096, 8192, 16384, 32768,
 ]
 _ELEM_BYTES_F16 = 2
 def test_allreduce_latency_sweep(tmp_path):
    """Sweep n_elem across each SIP topology; record max(pe_exec_ns)
    as the critical-path kernel latency. Emits CSV + PNG plots to
    tests/allreduce_latency_plots/.
    """
    import csv
    import matplotlib.pyplot as plt
    from matplotlib.ticker import FuncFormatter
    def _fmt_bytes(x, _pos):
        """Format tick as B / KB / MB."""
        if x <= 0:
            return "0"
        if x >= 1024 * 1024:
            return f"{x / (1024 * 1024):.0f} MB"
        if x >= 1024:
            return f"{x / 1024:.0f} KB"
        return f"{x:.0f} B"
    _bytes_fmt = FuncFormatter(_fmt_bytes)
    out_dir = Path(__file__).parent / "allreduce_latency_plots"
    out_dir.mkdir(parents=True, exist_ok=True)
    records: list[dict] = []
    # Apples-to-apples: same n_sips across all three topologies.
    for algorithm, sip_topology, n_sips, sip_w, sip_h in [
        ("intercube_allreduce", "ring_1d", 6, None, None),
        ("intercube_allreduce", "torus_2d", 6, 2, 3),
        ("intercube_allreduce", "mesh_2d_no_wrap", 6, 2, 3),
    ]:
        for n_elem in _SWEEP_N_ELEM:
            sub = tmp_path / f"{sip_topology}_{n_elem}"
            sub.mkdir()
            topo_path, ccl_path = _write_temp_configs(
                sub, sip_topology, n_sips, algorithm,
                sip_w=sip_w, sip_h=sip_h,
                n_elem_override=n_elem,
            )
            topo = resolve_topology(topo_path)
            engine = GraphEngine(topo.topology_obj, enable_data=True)
            spec = topo.topology_obj.spec
            with RuntimeContext(
                engine=engine,
                target_device=DeviceSelector("all"),
                correlation_id=f"sweep_{algorithm}_{sip_topology}_{n_elem}",
                spec=spec,
            ) as ctx:
                result = run_allreduce(
                    ctx, engine, spec,
                    algorithm=algorithm, ccl_yaml=ccl_path,
                )
                assert result["ok_cubes"] > 0
            pe_exec_vals = [
                float(tr.get("pe_exec_ns", 0.0) or 0.0)
                for _, (_, tr) in engine._results.items()
                if isinstance(tr, dict)
            ]
            crit_ns = max(pe_exec_vals) if pe_exec_vals else 0.0
            cm = spec["sip"]["cube_mesh"]
            n_cubes = int(cm["w"]) * int(cm["h"])
            bytes_per_sip = n_cubes * n_elem * _ELEM_BYTES_F16
            # pe="replicate" + num_pes=1 → one active PE per cube owns
            # the whole cube row. Per-PE bytes == per-cube-tile bytes ==
            # per-message bytes over the IPCQ fabric.
            bytes_per_pe = n_elem * _ELEM_BYTES_F16
            records.append({
                "algorithm": algorithm,
                "sip_topology": sip_topology,
                "n_sips": n_sips,
                "n_elem": n_elem,
                "bytes_per_pe": bytes_per_pe,
                "bytes_per_sip": bytes_per_sip,
                "latency_ns": crit_ns,
            })
            print(
                f"[{sip_topology:<16} n_sips={n_sips} n_elem={n_elem:>5}  "
                f"bytes/pe={bytes_per_pe:>7}  bytes/sip={bytes_per_sip:>9}]  "
                f"pe_exec_max = {crit_ns:8.1f} ns"
            )
    with open(out_dir / "summary.csv", "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=[
            "algorithm", "sip_topology", "n_sips", "n_elem",
            "bytes_per_pe", "bytes_per_sip", "latency_ns",
        ])
        w.writeheader()
        for r in records:
            w.writerow(r)
    topologies = sorted({r["sip_topology"] for r in records})
    # Per-topology plots, log-scale x-axis = bytes per PE.
    for topo_name in topologies:
        rs = sorted(
            [r for r in records if r["sip_topology"] == topo_name],
            key=lambda r: r["bytes_per_pe"],
        )
        xs = [r["bytes_per_pe"] for r in rs]
        ys = [r["latency_ns"] for r in rs]
        title = (
            f"Allreduce latency — {topo_name} "
            f"(n_sips={rs[0]['n_sips']})"
        )
        fig, ax = plt.subplots(figsize=(8, 5))
        ax.plot(xs, ys, marker="o", color="tab:blue")
        ax.set_xscale("log", base=2)
        ax.set_xlabel("Bytes per PE (log scale)")
        ax.set_ylabel("max pe_exec_ns (critical path)")
        ax.set_title(title)
        ax.grid(True, alpha=0.3)
        ax.xaxis.set_major_formatter(_bytes_fmt)
        fig.tight_layout()
        fig.savefig(out_dir / f"{topo_name}.png", dpi=120)
        plt.close(fig)
    colors = {"ring_1d": "tab:blue", "torus_2d": "tab:orange",
              "mesh_2d_no_wrap": "tab:green"}
    fig, ax = plt.subplots(figsize=(9, 6))
    for topo_name in topologies:
        rs = sorted(
            [r for r in records if r["sip_topology"] == topo_name],
            key=lambda r: r["bytes_per_pe"],
        )
        ax.plot(
            [r["bytes_per_pe"] for r in rs],
            [r["latency_ns"] for r in rs],
            marker="o",
            label=f"{topo_name} (n_sips={rs[0]['n_sips']})",
            color=colors.get(topo_name),
        )
    ax.set_xscale("log", base=2)
    ax.set_xlabel("Bytes per PE (log scale)")
    ax.set_ylabel("max pe_exec_ns (critical path)")
    ax.set_title("Multi-device allreduce latency by topology")
    ax.grid(True, alpha=0.3)
    ax.legend()
    ax.xaxis.set_major_formatter(_bytes_fmt)
    fig.tight_layout()
    fig.savefig(out_dir / "overview.png", dpi=120)
    plt.close(fig)
    print(f"\nWrote {out_dir / 'overview.png'}")
@@ -0,0 +1,194 @@
 """ADR-0009 D5 invariant: all PEs targeted by a single kernel launch MUST
 begin executing the kernel body at the same simulated time, regardless of
 their dispatch path length.
 These tests directly verify the invariant by capturing per-PE state at the
 top of `_execute_kernel`:
  test_no_pe_arrives_after_target_start_ns
      Asserts: for every PE that enters _execute_kernel during a multi-cube
      launch, `env.now` at entry must be <= target_start_ns. Otherwise the
      PE's barrier yield would be a no-op and `pe_exec_start` would be set
      late, breaking the D5 "same simulated time" mandate.
  test_all_pes_have_identical_pe_exec_start
      Asserts: every PE's `pe_exec_start` (the value of `env.now` recorded
      immediately AFTER the barrier yield) is identical across all PEs in
      the launch.
 Both tests are expected to FAIL today and become the regression check the
 Phase 2 D5 predictor + fallback fix must make pass.
 """
 from __future__ import annotations
 from pathlib import Path
 import numpy as np
 import pytest
 from kernbench.policy.placement.dp import DPPolicy
 from kernbench.runtime_api.context import RuntimeContext
 from kernbench.runtime_api.types import DeviceSelector
 from kernbench.sim_engine.engine import GraphEngine
 from kernbench.topology.builder import resolve_topology
 TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
 def _capture_per_pe_d5_state():
    """Monkey-patch PeCpuComponent._execute_kernel to record, per PE:
      - entry_now: env.now at function entry (before any yield)
      - target_start_ns: the value carried by the request
      - barrier_yielded: True if the barrier yield fired (entry_now < target)
      - pe_exec_start: env.now immediately after the barrier check
                       (i.e. the value the original code sets)
    Returns (records: list[dict], restore: callable).
    """
    import kernbench.components.builtin.pe_cpu as pe_cpu_mod
    records: list[dict] = []
    original = pe_cpu_mod.PeCpuComponent._execute_kernel
    def patched(self, env, txn):
        request = txn.request
        target_start = getattr(request, "target_start_ns", None)
        entry_now = float(env.now)
        rec = {
            "node_id": self.node.id,
            "entry_now": entry_now,
            "target_start_ns": (
                float(target_start) if target_start is not None else None
            ),
            "barrier_yielded": (
                target_start is not None
                and float(target_start) > entry_now
            ),
            "pe_exec_start": None,  # filled below by sniff
            "late_ns": (
                None if target_start is None
                else max(0.0, entry_now - float(target_start))
            ),
        }
        records.append(rec)
        # We can't easily inject a callback at the original's
        # `pe_exec_start = env.now` line without rewriting it. Approximate:
        # if the original yields the barrier, env.now after the yield is
        # target_start_ns; otherwise pe_exec_start is entry_now (skipped).
        if rec["barrier_yielded"]:
            rec["pe_exec_start"] = float(target_start)
        else:
            rec["pe_exec_start"] = entry_now
        yield from original(self, env, txn)
    pe_cpu_mod.PeCpuComponent._execute_kernel = patched
    def restore():
        pe_cpu_mod.PeCpuComponent._execute_kernel = original
    return records, restore
 def _run_multicube_launch():
    """Drive a no-op kernel launch across all 16 cubes x 8 PEs and return
    the per-PE D5 records collected by the monkey-patch."""
    records, restore = _capture_per_pe_d5_state()
    try:
        topo = resolve_topology(str(TOPOLOGY_PATH))
        engine = GraphEngine(topo.topology_obj, enable_data=True)
        spec = topo.topology_obj.spec
        with RuntimeContext(
            engine=engine, target_device=DeviceSelector("all"),
            correlation_id="d5_barrier", spec=spec,
        ) as ctx:
            dp = DPPolicy(
                cube="row_wise", pe="column_wise",
                num_cubes=16, num_pes=8,
            )
            def kernel(t_ptr, n_elem, tl):
                pass  # no-op
            ctx.ahbm.set_device(0)
            t = ctx.zeros(
                (16, 8 * 64), dtype="f16", dp=dp, name="probe",
            )
            t.copy_(ctx.from_numpy(
                np.zeros((16, 8 * 64), dtype=np.float16),
            ))
            pending = ctx.launch(
                "d5_probe", kernel, t, 64, _defer_wait=True,
            )
            for h, _sip, meta in pending:
                ctx.wait(h, _meta=meta)
    finally:
        restore()
    return records
 def test_no_pe_arrives_after_target_start_ns():
    """ADR-0009 D5: no PE may enter `_execute_kernel` after target_start_ns.
    Today this fails because IO_CPU's predictor under-shoots actual
    dispatch latency for far cubes (cube4, cube9-15). Phase 2 fix:
    chain-aware predictor in IO_CPU + monotonic upward re-stamp in M_CPU.
    """
    records = _run_multicube_launch()
    assert records, "expected per-PE _execute_kernel records"
    late = [
        r for r in records
        if r["target_start_ns"] is not None
        and r["late_ns"] is not None
        and r["late_ns"] > 1e-6
    ]
    if late:
        # Provide actionable diagnostic in the failure.
        worst = sorted(late, key=lambda r: -r["late_ns"])[:5]
        details = "\n".join(
            f"  {r['node_id']}: late by {r['late_ns']:.2f} ns "
            f"(entry_now={r['entry_now']:.2f}, "
            f"target_start_ns={r['target_start_ns']:.2f})"
            for r in worst
        )
        pytest.fail(
            f"ADR-0009 D5 violated: {len(late)}/{len(records)} PEs "
            f"entered _execute_kernel AFTER target_start_ns "
            f"(barrier yield silently skipped). "
            f"Worst offenders:\n{details}"
        )
 def test_all_pes_have_identical_pe_exec_start():
    """ADR-0009 D5: every PE's pe_exec_start must be identical.
    With D5 honored, every PE either yields to target_start_ns (start =
    target_start_ns) or, if late, would still be aligned by the M_CPU
    upward re-stamp (Phase 2). Today: 75/128 PEs in this launch have
    distinct pe_exec_start values because they skipped the barrier.
    """
    records = _run_multicube_launch()
    assert records, "expected per-PE _execute_kernel records"
    starts = sorted({round(r["pe_exec_start"], 6) for r in records})
    if len(starts) > 1:
        spread = max(starts) - min(starts)
        # Distribution of how many PEs at each distinct start time
        from collections import Counter
        bucket = Counter(round(r["pe_exec_start"], 6) for r in records)
        details = "\n".join(
            f"  pe_exec_start={t}: {n} PEs"
            for t, n in sorted(bucket.items())
        )
        pytest.fail(
            f"ADR-0009 D5 violated: PEs have {len(starts)} distinct "
            f"pe_exec_start values (spread = {spread:.2f} ns); "
            f"D5 mandates a single common value. "
            f"Distribution:\n{details}"
        )
@@ -0,0 +1,62 @@
 """ADR-0009 D5: synchronized launch barrier.
 M_CPU stamps KernelLaunchMsg with target_start_ns = env.now + max path
 latency; PE_CPU yields until that time before recording pe_exec_start.
 Every PE in a single launch MUST begin kernel execution at the same
 env.now regardless of its dispatch path length.
 We verify this indirectly: for a no-op kernel, pe_exec_ns = env.now -
 pe_exec_start. If every PE's pe_exec_start is identical and every PE
 runs the same no-op body, every pe_exec_ns value must be identical.
 Without D5, pe_exec_start varies by dispatch-path length and so does
 pe_exec_ns.
 """
 from __future__ import annotations
 from pathlib import Path
 import numpy as np
 from kernbench.policy.placement.dp import DPPolicy
 from kernbench.runtime_api.context import RuntimeContext
 from kernbench.runtime_api.types import DeviceSelector
 from kernbench.sim_engine.engine import GraphEngine
 from kernbench.topology.builder import resolve_topology
 TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
 def test_kernel_launch_sync_all_pes_have_equal_exec_time():
    """No-op kernel: every PE's pe_exec_ns must be identical under D5."""
    topo = resolve_topology(str(TOPOLOGY_PATH))
    engine = GraphEngine(topo.topology_obj, enable_data=True)
    spec = topo.topology_obj.spec
    with RuntimeContext(engine=engine, target_device=DeviceSelector("all"),
                        correlation_id="sync_test", spec=spec) as ctx:
        dp = DPPolicy(cube="row_wise", pe="column_wise",
                      num_cubes=16, num_pes=8)
        def kernel(t_ptr, n_elem, tl):
            pass  # no-op
        ctx.ahbm.set_device(0)
        t = ctx.zeros((16, 8 * 64), dtype="f16", dp=dp, name="probe")
        t.copy_(ctx.from_numpy(np.zeros((16, 8 * 64), dtype=np.float16)))
        pending = ctx.launch("sync_probe", kernel, t, 64, _defer_wait=True)
        for h, _sip, meta in pending:
            ctx.wait(h, _meta=meta)
        pe_exec_vals = []
        for h, _sip, _meta in pending:
            _, trace = engine.get_completion(h)
            if trace and trace.get("pe_exec_ns") is not None:
                pe_exec_vals.append(float(trace["pe_exec_ns"]))
    assert pe_exec_vals, "expected completion traces with pe_exec_ns"
    spread = max(pe_exec_vals) - min(pe_exec_vals)
    assert spread < 1e-6, (
        f"ADR-0009 D5 violated: pe_exec_ns spread across PEs = "
        f"{spread:.6f} ns (expected 0). Values: {pe_exec_vals}"
    )
@@ -0,0 +1,741 @@
 """Diagnostic for the inter-cube RAW > IPCQ asymmetry on h3/h4 plots.
 Single-shot run at h3 (sip0.cube0.pe0 -> sip0.cube1.pe0), nbytes=4096.
 Captures per-PE pe_exec_ns and the actual path / drain / per-node overhead
 breakdown for the RAW sub-txn (PE_DMA -> remote HBM_CTRL) vs the IPCQ
 outbound sub-txn (PE_DMA -> peer PE_DMA), so we can localize the gap to
 one of:
    (a) drain at HBM-BW (RAW) vs fabric-BW (IPCQ)
    (b) path-length / per-node overhead asymmetry
    (c) RAW SRC paying tl.load (local HBM read) on top of remote tl.store
        while IPCQ DST only pays inbound traversal+drain.
 Phase 1 / test-only. No production code is modified.
 """
 from __future__ import annotations
 from pathlib import Path
 import numpy as np
 import pytest
 from kernbench.ccl.install import load_ccl_config, resolve_algorithm_config
 from kernbench.ccl.sfr_config import configure_sfr_intercube_multisip
 from kernbench.policy.placement.dp import DPPolicy
 from kernbench.runtime_api.context import RuntimeContext
 from kernbench.runtime_api.types import DeviceSelector
 from kernbench.sim_engine.engine import GraphEngine
 from kernbench.topology.builder import resolve_topology
 TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
 import os
 # Allow the test to be re-run for h4 (inter-cube vertical) at multiple sizes
 # to investigate why IPCQ slope flattens past 8192 B (path may differ).
 NBYTES = int(os.environ.get("DIAG_NBYTES", "4096"))
 ELEM_BYTES = 2
 N_ELEM = NBYTES // ELEM_BYTES
 N_CUBES = 16
 N_PES = 8
 HOP = os.environ.get("DIAG_HOP", "h3")
 if HOP == "h4":
    SRC = (0, 0, 0)
    DST = (0, 4, 0)  # h4 inter-cube vertical
 else:
    SRC = (0, 0, 0)
    DST = (0, 1, 0)  # h3 inter-cube horizontal
 # ── Per-PE pe_exec_ns capture via monkey-patch ───────────────────────
 def _install_barrier_capture():
    """Wrap PeCpuComponent._execute_kernel to log, for every PE that
    enters: env.now at entry, target_start_ns the request carried,
    whether the barrier yield fired (i.e. env.now < target_start_ns),
    and env.now at pe_exec_start.
    """
    import kernbench.components.builtin.pe_cpu as pe_cpu_mod
    log: list[dict] = []
    original = pe_cpu_mod.PeCpuComponent._execute_kernel
    def patched(self, env, txn):
        request = txn.request
        target_start = getattr(request, "target_start_ns", None)
        entry_now = float(env.now)
        log_entry = {
            "node_id": self.node.id,
            "entry_now": entry_now,
            "target_start_ns": (
                float(target_start) if target_start is not None else None
            ),
            "barrier_skipped": (
                target_start is None
                or float(target_start) <= entry_now
            ),
            "delta_late_ns": (
                None if target_start is None
                else max(0.0, entry_now - float(target_start))
            ),
        }
        log.append(log_entry)
        yield from original(self, env, txn)
    pe_cpu_mod.PeCpuComponent._execute_kernel = patched
    def restore():
        pe_cpu_mod.PeCpuComponent._execute_kernel = original
    return log, restore
 def _install_per_pe_capture():
    """Wrap PeCpuComponent._execute_kernel so we record (node_id ->
    pe_exec_ns) for every PE that executes a kernel during the run.
    Returns (capture_dict, restore_callable).
    """
    import kernbench.components.builtin.pe_cpu as pe_cpu_mod
    captured: dict[str, float] = {}
    original = pe_cpu_mod.PeCpuComponent._execute_kernel
    def patched(self, env, txn):
        gen = original(self, env, txn)
        try:
            value = yield from gen
        finally:
            v = txn.result_data.get("pe_exec_ns")
            if v is not None:
                captured[self.node.id] = float(v)
        return value
    pe_cpu_mod.PeCpuComponent._execute_kernel = patched
    def restore():
        pe_cpu_mod.PeCpuComponent._execute_kernel = original
    return captured, restore
 def _install_recv_capture(target_node_id: str):
    """Wrap PeIpcqComponent._handle_recv to log entry/exit times and the
    peer_head_cache/my_tail values seen at the start.
    This pins down whether recv ever blocked on a wait_event, or whether
    it consumed without waiting (i.e. peer_head_cache > my_tail at entry).
    """
    import kernbench.components.builtin.pe_ipcq as pe_ipcq_mod
    log: list[dict] = []
    original = pe_ipcq_mod.PeIpcqComponent._handle_recv
    def patched(self, env, req, cmd):
        if self.node.id != target_node_id:
            yield from original(self, env, req, cmd)
            return
        # Snapshot state before dispatch
        d = cmd.direction
        qp = self._queue_pairs.get(d, {})
        log.append({
            "phase": "enter",
            "t": float(env.now),
            "direction": d,
            "peer_head_cache": qp.get("peer_head_cache"),
            "my_tail": qp.get("my_tail"),
        })
        yield from original(self, env, req, cmd)
        qp = self._queue_pairs.get(d, {})
        log.append({
            "phase": "exit",
            "t": float(env.now),
            "direction": d,
            "peer_head_cache": qp.get("peer_head_cache"),
            "my_tail": qp.get("my_tail"),
        })
    pe_ipcq_mod.PeIpcqComponent._handle_recv = patched
    def restore():
        pe_ipcq_mod.PeIpcqComponent._handle_recv = original
    return log, restore
 def _install_meta_arrival_capture(target_node_id: str):
    """Log every IpcqMetaArrival that lands on ``target_node_id`` PE_IPCQ.
    Records (env_now, sender_seq, dst_addr, matched_direction,
    peer_head_cache_before, my_tail_before).
    """
    import kernbench.components.builtin.pe_ipcq as pe_ipcq_mod
    log: list[dict] = []
    original = pe_ipcq_mod.PeIpcqComponent._handle_meta_arrival
    def patched(self, msg):
        if self.node.id == target_node_id:
            token = msg.token
            now = float(self._env.now) if hasattr(self, "_env") else 0.0
            # _env is not stored on the component; use ctx? Fall back to
            # introspection via self._inbox._env (SimPy stores reference).
            try:
                now = float(self._inbox._env.now)
            except Exception:
                pass
            entry = {
                "t": now,
                "sender_seq": getattr(token, "sender_seq", None),
                "dst_addr": getattr(token, "dst_addr", None),
                "src_sip": getattr(token, "src_sip", None),
                "src_cube": getattr(token, "src_cube", None),
                "src_pe": getattr(token, "src_pe", None),
                "src_direction": getattr(token, "src_direction", None),
                "nbytes": getattr(token, "nbytes", None),
                "matched_direction": None,
                "peer_head_cache_before": {},
                "my_tail_before": {},
            }
            for d, qp in self._queue_pairs.items():
                entry["peer_head_cache_before"][d] = qp["peer_head_cache"]
                entry["my_tail_before"][d] = qp["my_tail"]
                base = qp["my_rx_base_pa"]
                size = qp["n_slots"] * qp["slot_size"]
                if base <= entry["dst_addr"] < base + size:
                    entry["matched_direction"] = d
            log.append(entry)
        return original(self, msg)
    pe_ipcq_mod.PeIpcqComponent._handle_meta_arrival = patched
    def restore():
        pe_ipcq_mod.PeIpcqComponent._handle_meta_arrival = original
    return log, restore
 def _snapshot_qp_state(engine, target_node_id: str) -> dict:
    """Snapshot every direction's qp state on the target PE_IPCQ now.
    Captures peer_head_cache, my_tail, my_rx_base_pa, n_slots, slot_size
    for each installed direction.
    """
    comp = engine._components.get(target_node_id)
    if comp is None:
        return {}
    return {
        d: {
            "peer_head_cache": qp["peer_head_cache"],
            "my_tail": qp["my_tail"],
            "my_rx_base_pa": qp["my_rx_base_pa"],
            "n_slots": qp["n_slots"],
            "slot_size": qp["slot_size"],
            "rx_range": (
                qp["my_rx_base_pa"],
                qp["my_rx_base_pa"] + qp["n_slots"] * qp["slot_size"],
            ),
        }
        for d, qp in comp.queue_pairs.items()
    }
 # ── Path / drain breakdown using engine ctx ──────────────────────────
 def _path_breakdown(ctx, path: list[str], nbytes: int) -> dict:
    edge_total_ns = 0.0
    edge_details = []
    min_bw = float("inf")
    for i in range(len(path) - 1):
        edge = ctx.edge_map.get((path[i], path[i + 1]))
        if edge is None:
            edge_details.append((path[i], path[i + 1], None, None, None))
            continue
        prop_ns = edge.distance_mm * ctx.ns_per_mm
        edge_total_ns += prop_ns
        bw = getattr(edge, "bw_gbs", None) or 0.0
        if bw > 0 and bw < min_bw:
            min_bw = bw
        edge_details.append(
            (path[i], path[i + 1], edge.distance_mm, prop_ns, bw),
        )
    overhead_total_ns = 0.0
    overhead_details = []
    for nid in path:
        oh = float(ctx.node_overhead_ns.get(nid, 0.0))
        overhead_total_ns += oh
        overhead_details.append((nid, oh))
    drain_ns = ctx.compute_drain_ns(path, nbytes)
    bottleneck_bw = None if min_bw == float("inf") else min_bw
    return {
        "path": path,
        "edges": edge_details,
        "edge_total_ns": edge_total_ns,
        "overheads": overhead_details,
        "overhead_total_ns": overhead_total_ns,
        "drain_ns": drain_ns,
        "bottleneck_bw_gbs": bottleneck_bw,
        "expected_total_ns": edge_total_ns + overhead_total_ns + drain_ns,
    }
 def _print_breakdown(label: str, br: dict) -> None:
    print(f"\n  {label}")
    print(f"    path ({len(br['path'])} nodes):")
    for nid in br["path"]:
        print(f"      - {nid}")
    print(f"    edges (prop. delay):")
    for src, dst, dist_mm, prop_ns, bw in br["edges"]:
        if dist_mm is None:
            print(f"      ! {src} -> {dst}  EDGE NOT FOUND IN edge_map")
            continue
        print(
            f"      {src} -> {dst}  "
            f"dist={dist_mm:.3f}mm  prop={prop_ns:.2f}ns  "
            f"bw={bw or 0:.2f}GB/s"
        )
    print(f"    per-node overhead_ns:")
    for nid, oh in br["overheads"]:
        if oh > 0:
            print(f"      {nid:<60s}  overhead_ns={oh:.2f}")
    print(f"    edge_total_ns      = {br['edge_total_ns']:.2f}")
    print(f"    overhead_total_ns  = {br['overhead_total_ns']:.2f}")
    print(f"    bottleneck_bw_gbs  = {br['bottleneck_bw_gbs']}")
    print(f"    drain_ns (nbytes={NBYTES}) = {br['drain_ns']:.2f}")
    print(f"    expected_total_ns  = {br['expected_total_ns']:.2f}")
 # ── RAW path scenario ────────────────────────────────────────────────
 def _dump_src_op_records(engine, src_sip, src_cube, src_pe, label) -> None:
    """Print op_logger records for ops on the SRC PE.
    The op log captures t_start/t_end for memory/math/gemm/copy ops on
    every component, so we can see how long tl.load vs tl.store vs
    tl.send actually took at the engine level.
    """
    op_logger = getattr(engine, "_op_logger", None)
    if op_logger is None:
        print(f"  ({label}) op_logger not available")
        return
    src_prefix = f"sip{src_sip}.cube{src_cube}.pe{src_pe}."
    recs = [r for r in op_logger.records if r.component_id.startswith(src_prefix)]
    print(f"  ({label}) op_logger records on SRC PE ({src_prefix}*):")
    for r in recs[:40]:
        dur = r.t_end - r.t_start
        comp_short = r.component_id.replace(src_prefix, "")
        params_short = ""
        if "nbytes" in r.params:
            params_short = f" nbytes={r.params['nbytes']}"
        if "src_addr" in r.params:
            params_short += f" src_addr={r.params['src_addr']}"
        if "dst_addr" in r.params:
            params_short += f" dst_addr={r.params['dst_addr']}"
        print(
            f"    t=[{r.t_start:7.2f}..{r.t_end:7.2f}] dur={dur:6.2f}ns  "
            f"{comp_short:<25s} {r.op_kind:<8s} {r.op_name:<12s}{params_short}"
        )
 def _run_raw():
    captured, restore = _install_per_pe_capture()
    try:
        topo = resolve_topology(str(TOPOLOGY_PATH))
        engine = GraphEngine(topo.topology_obj, enable_data=True)
        spec = topo.topology_obj.spec
        src_sip, src_cube, src_pe = SRC
        dst_sip, dst_cube, dst_pe = DST
        assert src_sip == dst_sip
        src_off = (src_cube * N_PES + src_pe) * N_ELEM * ELEM_BYTES
        dst_off = (dst_cube * N_PES + dst_pe) * N_ELEM * ELEM_BYTES
        with RuntimeContext(
            engine=engine,
            target_device=DeviceSelector("all"),
            correlation_id="diag_raw",
            spec=spec,
        ) as rt:
            dp = DPPolicy(
                cube="row_wise", pe="column_wise",
                num_cubes=N_CUBES, num_pes=N_PES,
            )
            rt.ahbm.set_device(src_sip)
            t = rt.zeros(
                (N_CUBES, N_PES * N_ELEM), dtype="f16",
                dp=dp, name="raw_tensor",
            )
            t.copy_(rt.from_numpy(
                np.full((N_CUBES, N_PES * N_ELEM), 1.0, dtype=np.float16),
            ))
            def kernel(t_ptr, n_elem, tl):
                pe_id = tl.program_id(axis=0)
                cube_id = tl.program_id(axis=1)
                if cube_id == src_cube and pe_id == src_pe:
                    data = tl.load(
                        t_ptr + src_off, shape=(n_elem,), dtype="f16",
                    )
                    tl.store(t_ptr + dst_off, data)
            pending = rt.launch(
                "diag_raw_kernel", kernel, t, N_ELEM, _defer_wait=True,
            )
            for h, _sip, meta in pending:
                rt.wait(h, _meta=meta)
        # Compute the RAW sub-txn path: src PE_DMA -> dst HBM_CTRL
        from kernbench.policy.address.phyaddr import PhysAddr
        ctx = next(iter(engine._components.values())).ctx
        src_pe_prefix = f"sip{src_sip}.cube{src_cube}.pe{src_pe}"
        # Resolve dst PA to HBM controller node
        # The raw store kernel issues DmaWriteCmd on dst VA; in the engine
        # this is translated via PE_MMU. For diagnostic we approximate
        # the destination as the dst cube's HBM controller for slice
        # belonging to dst_pe.
        # Use the resolver on a constructed PA matching the same memory
        # slice the kernel writes to.
        # The tensor is "row_wise" sharded across cubes, so each cube
        # owns row[cube_id, :], with each PE owning a column slice.
        # The actual dst PA depends on the AHBM allocator; we read it
        # via the tensor's shard map.
        shard_map = getattr(t, "_shard_map", None) or getattr(t, "shard_map", None)
        # Fallback: query the resolver directly by constructing a PA in
        # the dst cube's HBM region. If shard_map is unavailable, still
        # show the breakdown for src-PE-DMA -> first reachable HBM_CTRL
        # in dst cube.
        dst_hbm_id = f"sip{dst_sip}.cube{dst_cube}.hbm_ctrl"
        if dst_hbm_id not in engine._components:
            # try alternate naming
            for nid in engine._components.keys():
                if (
                    nid.startswith(f"sip{dst_sip}.cube{dst_cube}.")
                    and "hbm" in nid
                ):
                    dst_hbm_id = nid
                    break
        # find_path() prepends ".pe_dma" to src_pe automatically
        try:
            raw_path = ctx.router.find_path(src_pe_prefix, dst_hbm_id)
        except Exception as e:
            raw_path = []
            print(f"  WARN: find_path raw failed: {e}")
        if not raw_path:
            # Try other HBM-related node names in dst cube
            for nid in engine._components.keys():
                if not nid.startswith(f"sip{dst_sip}.cube{dst_cube}."):
                    continue
                if "hbm" not in nid:
                    continue
                try:
                    p = ctx.router.find_path(src_pe_prefix, nid)
                except Exception:
                    p = []
                if p:
                    raw_path = p
                    print(f"  (fallback raw dst node: {nid})")
                    break
        return captured, ctx, raw_path, engine
    finally:
        restore()
 # ── IPCQ path scenario ───────────────────────────────────────────────
 def _run_ipcq():
    captured, restore = _install_per_pe_capture()
    dst_pe_ipcq_id = (
        f"sip{DST[0]}.cube{DST[1]}.pe{DST[2]}.pe_ipcq"
    )
    arrival_log, restore_arrival = _install_meta_arrival_capture(
        dst_pe_ipcq_id,
    )
    recv_log, restore_recv = _install_recv_capture(dst_pe_ipcq_id)
    barrier_log, restore_barrier = _install_barrier_capture()
    try:
        topo = resolve_topology(str(TOPOLOGY_PATH))
        engine = GraphEngine(topo.topology_obj, enable_data=True)
        spec = topo.topology_obj.spec
        src_sip, src_cube, src_pe = SRC
        dst_sip, dst_cube, dst_pe = DST
        cfg = load_ccl_config()
        merged = resolve_algorithm_config(cfg, name="intercube_allreduce")
        merged["slot_size"] = max(int(merged.get("slot_size", 4096)), NBYTES)
        with RuntimeContext(
            engine=engine,
            target_device=DeviceSelector("all"),
            correlation_id="diag_ipcq",
            spec=spec,
        ) as rt:
            configure_sfr_intercube_multisip(engine, spec, merged)
            dp = DPPolicy(
                cube="row_wise", pe="column_wise",
                num_cubes=N_CUBES, num_pes=N_PES,
            )
            def kernel(t_ptr, n_elem, tl):
                pe_id = tl.program_id(axis=0)
                cube_id = tl.program_id(axis=1)
                if cube_id == src_cube and pe_id == src_pe:
                    data = tl.load(t_ptr, shape=(n_elem,), dtype="f16")
                    tl.send(dir=("E" if HOP == "h3" else "S"), src=data)
                elif cube_id == dst_cube and pe_id == dst_pe:
                    tl.recv(
                        dir=("W" if HOP == "h3" else "N"),
                        shape=(n_elem,), dtype="f16",
                    )
            tensors = []
            for s in sorted({src_sip, dst_sip}):
                rt.ahbm.set_device(s)
                t = rt.zeros(
                    (N_CUBES, N_PES * N_ELEM), dtype="f16",
                    dp=dp, name=f"sip{s}",
                )
                t.copy_(rt.from_numpy(
                    np.full((N_CUBES, N_PES * N_ELEM), 1.0, dtype=np.float16),
                ))
                tensors.append(t)
            all_pending = []
            for tt in tensors:
                pending = rt.launch(
                    "diag_ipcq_kernel", kernel, tt, N_ELEM, _defer_wait=True,
                )
                all_pending.extend(pending)
            for h, _sip, meta in all_pending:
                rt.wait(h, _meta=meta)
        ctx = next(iter(engine._components.values())).ctx
        src_pe_prefix = f"sip{src_sip}.cube{src_cube}.pe{src_pe}"
        dst_pe_dma = f"sip{dst_sip}.cube{dst_cube}.pe{dst_pe}.pe_dma"
        try:
            ipcq_path = ctx.router.find_path(src_pe_prefix, dst_pe_dma)
        except Exception as e:
            ipcq_path = []
            print(f"  WARN: find_path ipcq failed: {e}")
        # Snapshot DST PE_IPCQ qp state at end-of-run so we can see what
        # peer_head_cache/my_tail looked like (and at which directions).
        qp_state = _snapshot_qp_state(engine, dst_pe_ipcq_id)
        return (captured, ctx, ipcq_path, engine,
                arrival_log, qp_state, recv_log, barrier_log)
    finally:
        restore_barrier()
        restore_recv()
        restore_arrival()
        restore()
 # ── Test entry ───────────────────────────────────────────────────────
@pytest.mark.diagnostic
 def test_pe_to_pe_diagnostic_h3():
    print("\n" + "=" * 78)
    print(f"  Diagnostic: h3 inter-cube horizontal, nbytes={NBYTES}")
    print(f"  src={SRC}  dst={DST}")
    print("=" * 78)
    # ── RAW scenario
    print("\n[RAW] tl.load + tl.store (sender pays both legs)")
    raw_per_pe, raw_ctx, raw_path, raw_engine = _run_raw()
    print(f"  per-PE pe_exec_ns ({len(raw_per_pe)} entries):")
    src_id = f"sip{SRC[0]}.cube{SRC[1]}.pe{SRC[2]}.pe_cpu"
    dst_id = f"sip{DST[0]}.cube{DST[1]}.pe{DST[2]}.pe_cpu"
    for nid in (src_id, dst_id):
        if nid in raw_per_pe:
            print(f"    {nid:<60s}  {raw_per_pe[nid]:.2f} ns  <-- key PE")
    nonzero = {k: v for k, v in raw_per_pe.items() if v > 0.5}
    if nonzero:
        print(f"  other PEs with pe_exec_ns > 0.5 ns:")
        for nid, v in sorted(nonzero.items(), key=lambda kv: -kv[1])[:6]:
            if nid not in (src_id, dst_id):
                print(f"    {nid:<60s}  {v:.2f} ns")
    print(f"  max(pe_exec_ns) = "
          f"{max(raw_per_pe.values()) if raw_per_pe else 0:.2f} ns")
    if raw_path:
        br = _path_breakdown(raw_ctx, raw_path, NBYTES)
        _print_breakdown("RAW sub-txn path (src.pe_dma -> dst.hbm_ctrl)", br)
    _dump_src_op_records(raw_engine, *SRC, "RAW")
    # ── IPCQ scenario
    print("\n[IPCQ] tl.send + tl.recv (recv pays inbound traversal+drain)")
    (ipcq_per_pe, ipcq_ctx, ipcq_path, ipcq_engine,
     arrival_log, qp_state, recv_log, barrier_log) = _run_ipcq()
    print(f"\n  [BARRIER LOG] {len(barrier_log)} _execute_kernel entries:")
    src_id = f"sip{SRC[0]}.cube{SRC[1]}.pe{SRC[2]}.pe_cpu"
    dst_id = f"sip{DST[0]}.cube{DST[1]}.pe{DST[2]}.pe_cpu"
    n_skipped = 0
    src_entry = None
    dst_entry = None
    for e in barrier_log:
        if e["barrier_skipped"]:
            n_skipped += 1
        if e["node_id"] == src_id:
            src_entry = e
        if e["node_id"] == dst_id:
            dst_entry = e
    print(f"    PEs entering _execute_kernel: {len(barrier_log)}")
    print(f"    PEs that SKIPPED barrier (env.now > target_start): {n_skipped}")
    if src_entry:
        print(
            f"    SRC pe ({src_id}): entry_now={src_entry['entry_now']:.2f}  "
            f"target_start={src_entry['target_start_ns']:.2f}  "
            f"skipped={src_entry['barrier_skipped']}  "
            f"late_ns={src_entry['delta_late_ns']:.2f}"
        )
    if dst_entry:
        print(
            f"    DST pe ({dst_id}): entry_now={dst_entry['entry_now']:.2f}  "
            f"target_start={dst_entry['target_start_ns']:.2f}  "
            f"skipped={dst_entry['barrier_skipped']}  "
            f"late_ns={dst_entry['delta_late_ns']:.2f}"
        )
    # Top 5 latest arrivals
    sorted_late = sorted(
        [e for e in barrier_log if e["delta_late_ns"] is not None],
        key=lambda e: -e["delta_late_ns"],
    )[:5]
    print(f"    Top 5 latest PE arrivals (positive = barrier missed):")
    for e in sorted_late:
        if e["delta_late_ns"] > 0:
            print(
                f"      {e['node_id']}: late by {e['delta_late_ns']:.2f} ns  "
                f"(entry={e['entry_now']:.2f}, target={e['target_start_ns']:.2f})"
            )
    print(f"\n  [RECV LOG on dst pe_ipcq] {len(recv_log)} entries:")
    for e in recv_log:
        print(
            f"    {e['phase']:5s} t={e['t']:8.2f} ns  "
            f"dir={e['direction']}  "
            f"peer_head_cache={e['peer_head_cache']}  "
            f"my_tail={e['my_tail']}"
        )
    print(f"\n  [META-ARRIVAL LOG on dst pe_ipcq] {len(arrival_log)} arrivals:")
    for i, e in enumerate(arrival_log):
        print(
            f"    #{i:2d} t={e['t']:8.2f} ns  "
            f"src=(sip{e['src_sip']},cube{e['src_cube']},pe{e['src_pe']}) "
            f"dir={e['src_direction']}  "
            f"sender_seq={e['sender_seq']} "
            f"matched_dir={e['matched_direction']} "
            f"nbytes={e['nbytes']}"
        )
        for d, ph in e["peer_head_cache_before"].items():
            mt = e["my_tail_before"][d]
            if ph != 0 or mt != 0 or d == e["matched_direction"]:
                print(
                    f"        before: dir={d}  peer_head_cache={ph}  my_tail={mt}"
                )
    print(f"\n  [QP STATE END-OF-RUN on dst pe_ipcq]:")
    for d, st in qp_state.items():
        print(
            f"    dir={d}  peer_head_cache={st['peer_head_cache']}  "
            f"my_tail={st['my_tail']}  rx_range=[{st['rx_range'][0]}..."
            f"{st['rx_range'][1]})  n_slots={st['n_slots']} "
            f"slot_size={st['slot_size']}"
        )
    print(f"  per-PE pe_exec_ns ({len(ipcq_per_pe)} entries):")
    for nid in (src_id, dst_id):
        if nid in ipcq_per_pe:
            print(f"    {nid:<60s}  {ipcq_per_pe[nid]:.2f} ns  <-- key PE")
    nonzero = {k: v for k, v in ipcq_per_pe.items() if v > 0.5}
    if nonzero:
        print(f"  other PEs with pe_exec_ns > 0.5 ns:")
        for nid, v in sorted(nonzero.items(), key=lambda kv: -kv[1])[:6]:
            if nid not in (src_id, dst_id):
                print(f"    {nid:<60s}  {v:.2f} ns")
    print(f"  max(pe_exec_ns) = "
          f"{max(ipcq_per_pe.values()) if ipcq_per_pe else 0:.2f} ns")
    if ipcq_path:
        br = _path_breakdown(ipcq_ctx, ipcq_path, NBYTES)
        _print_breakdown("IPCQ sub-txn path (src.pe_dma -> peer.pe_dma)", br)
    _dump_src_op_records(ipcq_engine, *SRC, "IPCQ")
    _dump_src_op_records(ipcq_engine, *DST, "IPCQ DST")
    # ── Credit-return path analysis (where the missing IPCQ "ack" lives)
    print("\n" + "-" * 78)
    print("Credit-return path (current modeling)")
    print("-" * 78)
    src_pe_prefix = f"sip{SRC[0]}.cube{SRC[1]}.pe{SRC[2]}"
    dst_pe_prefix = f"sip{DST[0]}.cube{DST[1]}.pe{DST[2]}"
    # PE_IPCQ._credit_latency_ns calls
    #     ctx.router.find_path(self._pe_prefix, peer_pe_prefix)
    # where the *destination* lacks the ".pe_dma" suffix. find_path()
    # only auto-appends to the source, so this raises -> the except
    # clause silently returns 0.0. Effectively credit latency = 0.
    try:
        ipcq_ctx.router.find_path(dst_pe_prefix, src_pe_prefix)
        bug_caught = False
    except Exception as e:
        bug_caught = True
        print(f"  CONFIRMED BUG in _credit_latency_ns: dest lacks '.pe_dma' "
              f"-> find_path raises -> caught exception -> returns 0.0")
        print(f"  Error: {e}")
    # The intended credit path is recv -> sender (reverse data direction)
    try:
        credit_path = ipcq_ctx.router.find_path(
            dst_pe_prefix, f"{src_pe_prefix}.pe_dma",
        )
    except Exception as e:
        credit_path = []
        print(f"  WARN: corrected find_path credit failed: {e}")
    if credit_path:
        credit_size = 16  # PE_IPCQ default _credit_size_bytes
        # Today's modeling: drain only, 16 bytes -> ~0.125 ns
        cur = ipcq_ctx.compute_drain_ns(credit_path, credit_size)
        # Proposed modeling: full path latency (edges + node overhead + drain)
        proposed = ipcq_ctx.compute_path_latency_ns(credit_path, credit_size)
        print(f"  credit path nodes = {len(credit_path)} (recv -> sender)")
        for nid in credit_path[:6]:
            print(f"    {nid}")
        if len(credit_path) > 6:
            print(f"    ... {len(credit_path) - 6} more nodes")
        br = _path_breakdown(ipcq_ctx, credit_path, credit_size)
        print(f"  edge_total_ns       = {br['edge_total_ns']:.2f}")
        print(f"  overhead_total_ns   = {br['overhead_total_ns']:.2f}")
        print(f"  drain_ns(16 bytes)  = {br['drain_ns']:.2f}")
        print(f"  CURRENT _credit_latency_ns (drain only) = {cur:.3f} ns")
        print(f"  PROPOSED            (compute_path_latency_ns) = {proposed:.2f} ns")
        print(f"  delta               = {proposed - cur:+.2f} ns")
    # ── Comparison summary
    print("\n" + "-" * 78)
    print("Summary")
    print("-" * 78)
    raw_max = max(raw_per_pe.values()) if raw_per_pe else 0.0
    ipcq_max = max(ipcq_per_pe.values()) if ipcq_per_pe else 0.0
    print(f"  RAW  max(pe_exec_ns)              = {raw_max:.2f} ns")
    print(f"  IPCQ max(pe_exec_ns) (current)    = {ipcq_max:.2f} ns")
    print(f"  delta (RAW - IPCQ current)        = {raw_max - ipcq_max:+.2f} ns")
    if credit_path:
        ipcq_with_credit = ipcq_max + (proposed - cur)
        print(
            f"  IPCQ projected w/ blocking credit + full path overhead "
            f"= {ipcq_with_credit:.2f} ns"
        )
        print(
            f"  delta (RAW - IPCQ projected)      = "
            f"{raw_max - ipcq_with_credit:+.2f} ns  "
            f"(<= 0 means IPCQ >= RAW)"
        )
    # No assertions — this is observational.
    assert raw_per_pe, "no RAW pe_exec_ns recorded"
    assert ipcq_per_pe, "no IPCQ pe_exec_ns recorded"
@@ -0,0 +1,358 @@
 """PE-to-PE latency sweep across hop types and data sizes.
 Compares IPCQ send/recv vs raw-DMA (tl.load + tl.store) latency for five
 hop types:
  H1 Intra-cube horizontal   pe0 → pe1
  H2 Intra-cube vertical     pe0 → pe4
  H3 Inter-cube horizontal   sip0.cube0.pe0 → sip0.cube1.pe0
  H4 Inter-cube vertical     sip0.cube0.pe0 → sip0.cube4.pe0
  H5 Inter-SIP               sip0.cube0.pe0 → sip1.cube0.pe0   (IPCQ only —
                                                                raw needs
                                                                cross-SIP MMU)
 Sizes: 128..10240 bytes. Emits PNGs with both lines plus a CSV.
 """
 from __future__ import annotations
 import csv
 from dataclasses import dataclass
 from pathlib import Path
 import numpy as np
 import pytest
 from kernbench.ccl.install import load_ccl_config, resolve_algorithm_config
 from kernbench.ccl.sfr_config import configure_sfr_intercube_multisip
 from kernbench.policy.placement.dp import DPPolicy
 from kernbench.runtime_api.context import RuntimeContext
 from kernbench.runtime_api.types import DeviceSelector
 from kernbench.sim_engine.engine import GraphEngine
 from kernbench.topology.builder import resolve_topology
 TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
 PLOT_DIR = Path(__file__).parent / "pe2pe_latency_plots"
 SIZES = [128, 256, 384, 512, 768, 1024, 2048, 4096, 8192, 10240]
 N_CUBES = 16
 N_PES = 8
 ELEM_BYTES = 2  # f16
@dataclass(frozen=True)
 class Hop:
    id: str
    label: str
    src: tuple[int, int, int]
    dst: tuple[int, int, int]
    send_dir: str
    recv_dir: str
    supports_raw: bool   # False for cross-SIP (DPPolicy intra-device only)
 HOPS = [
    Hop("h1_intra_horizontal", "Intra-cube horizontal (pe0 to pe1)",
        (0, 0, 0), (0, 0, 1), "intra_E", "intra_W", True),
    Hop("h2_intra_vertical", "Intra-cube vertical (pe0 to pe4)",
        (0, 0, 0), (0, 0, 4), "intra_S", "intra_N", True),
    Hop("h3_inter_cube_horizontal", "Inter-cube horizontal (cube0 to cube1)",
        (0, 0, 0), (0, 1, 0), "E", "W", True),
    Hop("h4_inter_cube_vertical", "Inter-cube vertical (cube0 to cube4)",
        (0, 0, 0), (0, 4, 0), "S", "N", True),
    Hop("h5_inter_sip", "Inter-SIP (sip0 to sip1, same cube/pe)",
        (0, 0, 0), (1, 0, 0), "global_E", "global_W", False),
 ]
 def _make_engine():
    topo = resolve_topology(str(TOPOLOGY_PATH))
    engine = GraphEngine(topo.topology_obj, enable_data=True)
    return engine, topo.topology_obj.spec
 # ── IPCQ path ────────────────────────────────────────────────────────
 def _measure_ipcq(hop: Hop, nbytes: int) -> float:
    engine, spec = _make_engine()
    cfg = load_ccl_config()
    merged = resolve_algorithm_config(cfg, name="intercube_allreduce")
    merged["slot_size"] = max(int(merged.get("slot_size", 4096)), nbytes)
    n_elem = nbytes // ELEM_BYTES
    src_sip, src_cube, src_pe = hop.src
    dst_sip, dst_cube, dst_pe = hop.dst
    send_dir, recv_dir = hop.send_dir, hop.recv_dir
    with RuntimeContext(
        engine=engine,
        target_device=DeviceSelector("all"),
        correlation_id=f"ipcq_{hop.id}_{nbytes}",
        spec=spec,
    ) as ctx:
        configure_sfr_intercube_multisip(engine, spec, merged)
        dp = DPPolicy(
            cube="row_wise", pe="column_wise",
            num_cubes=N_CUBES, num_pes=N_PES,
        )
        def kernel(t_ptr, n_elem, tl):
            pe_id = tl.program_id(axis=0)
            cube_id = tl.program_id(axis=1)
            if cube_id == src_cube and pe_id == src_pe:
                data = tl.load(t_ptr, shape=(n_elem,), dtype="f16")
                tl.send(dir=send_dir, src=data)
            elif cube_id == dst_cube and pe_id == dst_pe:
                tl.recv(dir=recv_dir, shape=(n_elem,), dtype="f16")
        tensors = []
        for s in sorted({src_sip, dst_sip}):
            ctx.ahbm.set_device(s)
            t = ctx.zeros(
                (N_CUBES, N_PES * n_elem), dtype="f16",
                dp=dp, name=f"sip{s}",
            )
            t.copy_(ctx.from_numpy(
                np.full((N_CUBES, N_PES * n_elem), 1.0, dtype=np.float16),
            ))
            tensors.append(t)
        all_pending = []
        for t in tensors:
            pending = ctx.launch(
                f"{hop.id}_ipcq", kernel, t, n_elem, _defer_wait=True,
            )
            all_pending.extend(pending)
        for h, sip_id, meta in all_pending:
            ctx.wait(h, _meta=meta)
        # Per-PE kernel execution time (excludes launch dispatch and
        # response aggregation). IPCQ: DST blocks on tl.recv until the
        # send arrives, so max across SIPs = DST's transfer time.
        pe_exec_vals = []
        for h, _sip, _meta in all_pending:
            _, trace = engine.get_completion(h)
            if trace and trace.get("pe_exec_ns") is not None:
                pe_exec_vals.append(float(trace["pe_exec_ns"]))
    return max(pe_exec_vals) if pe_exec_vals else 0.0
 # ── Raw DMA path (intra-SIP only) ────────────────────────────────────
 def _measure_raw(hop: Hop, nbytes: int) -> float:
    """tl.load from source slice + tl.store to destination slice. The VA
    mapping spans the cube mesh within one SIP (MmuMapMsg broadcasts to all
    cubes of the SIP), so the store goes through the fabric to the
    destination PE's HBM.  No IPCQ protocol involved.
    """
    if not hop.supports_raw:
        raise RuntimeError(f"hop {hop.id} does not support raw path")
    engine, spec = _make_engine()
    n_elem = nbytes // ELEM_BYTES
    src_sip, src_cube, src_pe = hop.src
    dst_sip, dst_cube, dst_pe = hop.dst
    assert src_sip == dst_sip
    # Slice offsets in the (N_CUBES, N_PES * n_elem) tensor:
    #   row = cube, slice within row = pe * n_elem .. (pe+1)*n_elem
    # Byte offsets from va_base:
    src_off = (src_cube * N_PES + src_pe) * n_elem * ELEM_BYTES
    dst_off = (dst_cube * N_PES + dst_pe) * n_elem * ELEM_BYTES
    with RuntimeContext(
        engine=engine,
        target_device=DeviceSelector("all"),
        correlation_id=f"raw_{hop.id}_{nbytes}",
        spec=spec,
    ) as ctx:
        dp = DPPolicy(
            cube="row_wise", pe="column_wise",
            num_cubes=N_CUBES, num_pes=N_PES,
        )
        ctx.ahbm.set_device(src_sip)
        t = ctx.zeros(
            (N_CUBES, N_PES * n_elem), dtype="f16",
            dp=dp, name="raw_tensor",
        )
        t.copy_(ctx.from_numpy(
            np.full((N_CUBES, N_PES * n_elem), 1.0, dtype=np.float16),
        ))
        def kernel(t_ptr, n_elem, tl):
            pe_id = tl.program_id(axis=0)
            cube_id = tl.program_id(axis=1)
            if cube_id == src_cube and pe_id == src_pe:
                data = tl.load(
                    t_ptr + src_off, shape=(n_elem,), dtype="f16",
                )
                tl.store(t_ptr + dst_off, data)
        pending = ctx.launch(
            f"{hop.id}_raw", kernel, t, n_elem, _defer_wait=True,
        )
        for h, sip_id, meta in pending:
            ctx.wait(h, _meta=meta)
        # Per-PE kernel execution time. Raw: only SRC does real work
        # (tl.load + tl.store, store is blocking), so max across all PEs
        # = SRC's transfer time. Idle PEs contribute only overhead_ns.
        pe_exec_vals = []
        for h, _sip, _meta in pending:
            _, trace = engine.get_completion(h)
            if trace and trace.get("pe_exec_ns") is not None:
                pe_exec_vals.append(float(trace["pe_exec_ns"]))
    return max(pe_exec_vals) if pe_exec_vals else 0.0
 # ── CSV + plotting ───────────────────────────────────────────────────
 def _write_csv(records, path: Path) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(
            f, fieldnames=["hop", "label", "size_bytes", "path", "total_ns"],
        )
        w.writeheader()
        for r in records:
            w.writerow(r)
 def _plot_per_hop(records, hop: Hop, path: Path) -> None:
    import matplotlib.pyplot as plt
    ipcq = sorted(
        [r for r in records if r["hop"] == hop.id and r["path"] == "ipcq"],
        key=lambda r: r["size_bytes"],
    )
    raw = sorted(
        [r for r in records if r["hop"] == hop.id and r["path"] == "raw"],
        key=lambda r: r["size_bytes"],
    )
    fig, ax = plt.subplots(figsize=(8, 5))
    if ipcq:
        ax.plot(
            [r["size_bytes"] for r in ipcq],
            [r["total_ns"] for r in ipcq],
            marker="o", label="IPCQ (send/recv)", color="tab:blue",
        )
    if raw:
        ax.plot(
            [r["size_bytes"] for r in raw],
            [r["total_ns"] for r in raw],
            marker="s", label="Raw DMA (load+store)", color="tab:orange",
        )
    else:
        ax.text(
            0.98, 0.02, "(Raw DMA unavailable for cross-SIP)",
            transform=ax.transAxes, ha="right", va="bottom",
            fontsize=9, color="gray",
        )
    ax.set_xlabel("Data size (bytes)")
    ax.set_ylabel("Latency (ns)")
    ax.set_title(hop.label)
    ax.grid(True, alpha=0.3)
    ax.legend()
    fig.tight_layout()
    fig.savefig(path, dpi=120)
    plt.close(fig)
 def _plot_overview(records, path: Path) -> None:
    import matplotlib.pyplot as plt
    fig, axes = plt.subplots(2, 3, figsize=(16, 9))
    axes = axes.flatten()
    for i, hop in enumerate(HOPS):
        ax = axes[i]
        ipcq = sorted(
            [r for r in records if r["hop"] == hop.id and r["path"] == "ipcq"],
            key=lambda r: r["size_bytes"],
        )
        raw = sorted(
            [r for r in records if r["hop"] == hop.id and r["path"] == "raw"],
            key=lambda r: r["size_bytes"],
        )
        if ipcq:
            ax.plot(
                [r["size_bytes"] for r in ipcq],
                [r["total_ns"] for r in ipcq],
                marker="o", label="IPCQ", color="tab:blue",
            )
        if raw:
            ax.plot(
                [r["size_bytes"] for r in raw],
                [r["total_ns"] for r in raw],
                marker="s", label="Raw", color="tab:orange",
            )
        ax.set_title(hop.label, fontsize=10)
        ax.set_xlabel("bytes")
        ax.set_ylabel("ns")
        ax.grid(True, alpha=0.3)
        ax.legend(fontsize=8)
    for j in range(len(HOPS), len(axes)):
        axes[j].axis("off")
    fig.suptitle(
        "PE-to-PE latency: IPCQ vs raw DMA",
        fontsize=14,
    )
    fig.tight_layout()
    fig.savefig(path, dpi=120)
    plt.close(fig)
 # ── Test entry ───────────────────────────────────────────────────────
 def test_pe_to_pe_latency_sweep():
    records: list[dict] = []
    for hop in HOPS:
        for size in SIZES:
            # IPCQ path
            ipcq_ns = _measure_ipcq(hop, size)
            records.append({
                "hop": hop.id, "label": hop.label,
                "size_bytes": size, "path": "ipcq",
                "total_ns": ipcq_ns,
            })
            raw_s = "n/a"
            if hop.supports_raw:
                raw_ns = _measure_raw(hop, size)
                records.append({
                    "hop": hop.id, "label": hop.label,
                    "size_bytes": size, "path": "raw",
                    "total_ns": raw_ns,
                })
                raw_s = f"{raw_ns:7.1f}ns"
            print(
                f"[{hop.id}] size={size:5d}  "
                f"ipcq={ipcq_ns:7.1f}ns  raw={raw_s}"
            )
    PLOT_DIR.mkdir(parents=True, exist_ok=True)
    _write_csv(records, PLOT_DIR / "summary.csv")
    for hop in HOPS:
        _plot_per_hop(records, hop, PLOT_DIR / f"{hop.id}.png")
    _plot_overview(records, PLOT_DIR / "overview.png")
    for hop in HOPS:
        rs = sorted(
            [r for r in records if r["hop"] == hop.id and r["path"] == "ipcq"],
            key=lambda r: r["size_bytes"],
        )
        for r in rs:
            assert r["total_ns"] > 0, f"{hop.id}: total_ns must be > 0"
    print(f"\n  Plots + CSV written to {PLOT_DIR}")
@@ -0,0 +1,106 @@
 """Rectangular (non-square) SIP-level 2D topology support.
 Phase 1 regression target: today the 2D builtin topology functions in
 ``kernbench.ccl.topologies`` (``mesh_2d``, ``torus_2d``,
 ``mesh_2d_no_wrap``) hardcode ``side = sqrt(world_size)`` and raise
 ``ValueError`` for any non-square ``world_size``. This blocks running
 the allreduce sweep at n_sips=6 on torus/mesh layouts.
 Phase 2 will extend these functions to accept optional ``w, h`` kwargs
 so a 2×3 (or 3×2, etc.) layout works. Until then, every test below is
 expected to FAIL.
 Layout convention used here (matches non-rectangular case):
    rank = row * w + col   for 0 <= row < h, 0 <= col < w
 For w=2, h=3, world_size=6 the layout is:
         col=0  col=1
  row=0:   0      1
  row=1:   2      3
  row=2:   4      5
 """
 from __future__ import annotations
 import pytest
 from kernbench.ccl.topologies import (
    mesh_2d,
    mesh_2d_no_wrap,
    torus_2d,
 )
 # ── mesh_2d_no_wrap (no wrap-around) ──────────────────────────────────
 def test_mesh_2d_no_wrap_2x3_top_left():
    """rank 0 (top-left, no N, no W): only S and E."""
    nbrs = mesh_2d_no_wrap(rank=0, world_size=6, w=2, h=3)
    assert nbrs == {"S": 2, "E": 1}, nbrs
 def test_mesh_2d_no_wrap_2x3_top_right():
    """rank 1 (top-right, no N, no E): only S and W."""
    nbrs = mesh_2d_no_wrap(rank=1, world_size=6, w=2, h=3)
    assert nbrs == {"S": 3, "W": 0}, nbrs
 def test_mesh_2d_no_wrap_2x3_middle_left():
    """rank 2 (middle-left, no W): N, S, E."""
    nbrs = mesh_2d_no_wrap(rank=2, world_size=6, w=2, h=3)
    assert nbrs == {"N": 0, "S": 4, "E": 3}, nbrs
 def test_mesh_2d_no_wrap_2x3_bottom_right():
    """rank 5 (bottom-right, no S, no E): only N and W."""
    nbrs = mesh_2d_no_wrap(rank=5, world_size=6, w=2, h=3)
    assert nbrs == {"N": 3, "W": 4}, nbrs
 # ── torus_2d (wrap-around on all four edges) ─────────────────────────
 def test_torus_2d_2x3_top_left():
    """rank 0: N wraps to row 2 col 0 (rank 4); W wraps to col 1 (rank 1)."""
    nbrs = torus_2d(rank=0, world_size=6, w=2, h=3)
    assert nbrs == {"N": 4, "S": 2, "W": 1, "E": 1}, nbrs
 def test_torus_2d_2x3_bottom_right():
    """rank 5: S wraps to row 0 (rank 1); E wraps to col 0 (rank 4)."""
    nbrs = torus_2d(rank=5, world_size=6, w=2, h=3)
    assert nbrs == {"N": 3, "S": 1, "W": 4, "E": 4}, nbrs
 # ── mesh_2d alias for torus_2d ───────────────────────────────────────
 def test_mesh_2d_2x3_matches_torus_2d():
    """mesh_2d is currently a torus alias; behaviour must match torus_2d."""
    for rank in range(6):
        assert mesh_2d(rank=rank, world_size=6, w=2, h=3) == \
            torus_2d(rank=rank, world_size=6, w=2, h=3)
 # ── Back-compat: square layouts still work without w/h kwargs ────────
 def test_square_back_compat_mesh_2d_no_wrap():
    """Calling without w, h should still work for square world_size."""
    nbrs = mesh_2d_no_wrap(rank=0, world_size=4)
    assert nbrs == {"S": 2, "E": 1}, nbrs
 def test_square_back_compat_torus_2d():
    nbrs = torus_2d(rank=0, world_size=4)
    assert nbrs == {"N": 2, "S": 2, "W": 1, "E": 1}, nbrs
 # ── Validation: w*h must match world_size ────────────────────────────
 def test_rectangular_dims_must_match_world_size():
    """Phase 2 contract: explicit w, h must satisfy w*h == world_size."""
    with pytest.raises(ValueError):
        mesh_2d_no_wrap(rank=0, world_size=6, w=3, h=3)  # 9 != 6
Author	SHA1	Message	Date
mukesh	e9cc40f74d	Rectangular SIP topology + 6-device allreduce sweep mesh_2d, torus_2d, and mesh_2d_no_wrap accept optional w,h kwargs; sqrt fall-back preserved for square layouts (back-compat tests confirm 4-SIP and 9-SIP square configs still work). sfr_config reads system.sips.w/h from spec and threads dims through to the topology fn. test_allreduce_multidevice CONFIGS switched from 4 SIPs (square) to 6 SIPs: ring_1d_6sip, torus_2d_6sip_2x3, mesh_2d_no_wrap_6sip_2x3. _write_temp_configs writes system.sips.w/h when supplied; _sip_topo_dims reads them back. Latency sweep loop also moved to 6-SIP layouts. Linear-scale plot variants dropped -- only log-scale *.png + summary.csv emitted. Plots in tests/allreduce_latency_plots regenerated. New tests/test_sip_topology_rectangular.py asserts neighbor correctness for 2x3 layouts and back-compat for square fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 15:13:14 -07:00
mukesh	c1a5cf3a2a	ADR-0009 D5: chain-aware target_start_ns + zero-byte launch fanout The single-walk predictor (find_node_path(io_cpu, pe_cpu) + compute_path_latency_ns) under-shot actual dispatch latency for far cubes -- the routing graph could pick a path bypassing M_CPU, and non-zero-nbytes launch sub-txns serialized on shared first hops. Far PEs arrived at _execute_kernel after target_start_ns, silently skipped the barrier yield, and started pe_exec_start late. Their reported pe_exec_ns under-counted by exactly the late_ns amount (63 ns observed at h4 cube4.pe0 in the IPCQ test, up to 113 ns worst case for cubes 9-11), producing the suspicious flat region in the h4 IPCQ curve at 8192/10240 bytes. Fix: - IO_CPU predictor uses the explicit two-leg chain (IO_CPU->M_CPU + M_CPU->PE_CPU - io.overhead - m.overhead), so every PE on every targeted cube has a barrier >= its real dispatch arrival. - Kernel-launch fanout sub-txns carry nbytes=0 (control-plane, not data-plane), removing the per-cube fanout serialization that pushed far M_CPUs past the predictor. - Legacy io_cpu mirror updated. ADR-0009 D5 mechanism updated to specify the two-leg formula and the nbytes=0 requirement. New tests/test_d5_barrier_invariant.py asserts (a) no PE enters _execute_kernel after target_start_ns and (b) every PE in a multi-cube launch has identical pe_exec_start -- both regressions silently pass on the existing tests/test_kernel_launch_sync.py because that test only inspects post-aggregation max(pe_exec_ns). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 15:12:58 -07:00
mukesh	90874abbfe	ADR-0023 D9: blocking credit-emit with full-path latency PE_IPCQ._handle_recv now yields-from _delayed_credit_send instead of spawning it as a fork, so the receiver's pe_exec_ns includes the credit-return cost. _credit_latency_ns switches from compute_drain_ns(path, 16) to compute_path_latency_ns(path, 16) and fixes a latent find_path bug where the destination lacked the ".pe_dma" suffix (silently returned 0 ns under the bare except). Net effect on h3/h4 inter-cube pe-to-pe latency: IPCQ >= raw DMA at every size, matching real-HW posted-write semantics. tl.send remains fire-and-forget. ADR-0023 D9 amended; new diagnostic test tests/test_pe_to_pe_diagnostic.py captures per-PE pe_exec_ns, paths, drain, and meta-arrival timing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 15:12:38 -07:00
mukesh	19dfc86dc3	Allreduce latency sweep across topologies and data sizes Adds test_allreduce_latency_sweep that runs the existing intercube allreduce kernel under three SIP topologies (ring_1d, torus_2d, mesh_2d_no_wrap, all at n_sips=4) across 11 data sizes from 256 B/SIP up to 1 MB/SIP. For each point, captures max(pe_exec_ns) — the critical-path kernel time — and emits CSV plus log-x and linear-x plots, both per-topology and combined overview, with KB/MB-formatted tick labels. Reuses run_allreduce + _write_temp_configs and adds a slot_size auto-bump when n_elem*2 exceeds the default IPCQ slot. Sweep skips n_elem=16 because the runtime's dim_map scalar-arg remapping (context.py:761) collides any int-valued kernel scalar that matches a global tensor dim with its local shard size. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 10:16:29 -07:00
mukesh	14d800b0ae	Kernel-launch sync (ADR-0009 D5) and IPCQ drain at inbound (ADR-0023) - KernelLaunchMsg gains target_start_ns: IO_CPU stamps a global barrier (max path latency across every target PE), M_CPU passes it through, PE_CPU yields until it before recording pe_exec_start. Every PE in a launch begins kernel execution at the same env.now regardless of its dispatch path length — eliminates per-PE dispatch-offset artifact in cross-PE and cross-cube latency measurements. - PE_DMA._handle_ipcq_inbound now pays Transaction.drain_ns at the top, matching the terminal-drain behavior of ComponentBase._forward_txn for every non-IPCQ Transaction. SRC-side tl.send stays fire-and-forget (sender doesn't yield on sub_done); tl.recv now blocks until bytes have actually drained into its inbox. - ComponentContext: new compute_path_latency_ns helper + node_overhead_ns field populated by GraphEngine. - tests/test_kernel_launch_sync.py: asserts all PEs in one launch produce identical pe_exec_ns for a no-op kernel (zero spread). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 15:30:29 -07:00
mukesh	6918e6e906	PE-to-PE latency test + supporting fixes Adds tests/test_pe_to_pe_latency.py: a sweep that measures PE-to-PE transfer latency for five hop types (intra-cube horizontal/vertical, inter-cube horizontal/vertical, inter-SIP) across data sizes 128 B to 10 KB, on both the IPCQ (tl.send/tl.recv) and raw-DMA (tl.load+tl.store) paths. Emits per-hop PNG plots, an overview PNG, and a CSV summary into tests/pe2pe_latency_plots/. Latency is reported as max(pe_exec_ns) across participating PEs, read from engine.get_completion(), so the measurement captures the SRC/DST PE's kernel body time rather than the full launch+ response-aggregation envelope. Two simulator fixes were needed to make this measurement meaningful: - PeMMU now stores a list of (start, end, pa) sub-regions per page rather than a single PA. DPPolicy layouts with shards smaller than page_size (e.g. 128 B payloads with 4 KB pages) used to silently overwrite each other through last-write-wins, causing DMAs intended for cube0 to physically route to cube3 - inflating latency by ~170 ns per DMA at small sizes. STOPGAP: real MMUs don't support sub-page regions; long-term fix is either smaller MMU page size or DPPolicy validation that refuses sub-page shards. - M_CPU's per-PE metrics aggregation (pe_exec_ns, dma_ns, compute_ns) now max-merges against the existing value in result_data rather than overwriting. Multi-cube workloads share one result_data dict via IO_CPU fanout; the previous overwrite caused whichever cube's M_CPU finished last to clobber others' values, so multi-cube pe_exec_ns was racy and frequently 0. Same fix applied in legacy/builtin/m_cpu.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 21:04:31 -07:00