Fix all remaining test failures: single-cube allreduce + matplotlib dep

- intercube_allreduce: add single-cube fast path that skips intra-SIP mesh reduce and goes directly to inter-SIP exchange. Fixes IPCQ deadlock when TP launches kernel on one cube per SIP. - distributed.py: derive effective cube dims from tensor shard placement instead of hardcoding topology mesh size. - pyproject.toml: add matplotlib>=3.7 to dependencies. - pe_dma.py (prior commit): add MMU translation in pipeline DMA path. 577 passed, 0 failed (was 529 passed, 10 failed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Merge: accept remote pe2pe summary.csv
2026-04-27 21:25:31 -07:00 · 2026-04-27 17:13:06 -07:00 · 2026-04-27 16:43:28 -07:00 · 2026-04-27 16:43:19 -07:00 · 2026-04-27 16:43:01 -07:00 · 2026-04-27 15:52:29 -07:00
70 changed files with 4209 additions and 584 deletions
@@ -1,25 +1,39 @@
-# ADR-0001: PhysAddr Layout & Address Decoding Contract
+# ADR-0001: 51-bit Physical Address Layout & Decoding Contract

 ## Status

-Accepted
+Accepted (Revision 2 — 2026-04-27: concrete bit layout, rack_id removal,
+Tray->SIP / SIP->DIE renaming, PE/MCPU/IOCPU sub-unit tables.
+Supersedes ADR-0031.)

 ## Date

-2026-02-27
+2026-04-27 (original: 2026-02-27)

 ## Context

-KernBench Graph Latency Simulator must route requests deterministically and compute end-to-end latency strictly by graph traversal.
-To model local vs remote traffic (same/different SIP, same/different CUBE, optional PE-group), requests need a stable, parsable address/location scheme that:
+KernBench requires a stable, parsable physical address scheme that:

- can be decoded into routing domains (SIP/CUBE/HBM/PE-resource, etc.)
+- can be decoded into routing domains (SIP / die / HBM / PE-resource / IOCPU)
 - remains topology-agnostic (no hardcoded counts)
- supports swappable policy and DI-first components without leaking topology assumptions into node implementations
+- supports swappable policy and DI-first components
+- covers multiple SIPs, AHBM dies, and IO chiplet dies in a unified space
+
+### History
+
+- Original ADR-0001 defined a 51-bit layout with `rack_id(4) + sip_id(4) +
+  sip_seg(5) + local_offset(38)`. `rack_id` was never used in practice.
+- ADR-0031 (stub) requested PE-resource range partition but was never
+  implemented.
+
+Revision 2 removes `rack_id`, renames `sip_seg -> die_id`, and provides
+concrete sub-unit tables for PE, MCPU, CUBE_SRAM, and IOCPU resources.
+ADR-0031 is superseded.

 ## Decision

-We define a **PhysAddr value object** and an **address decoding contract** that converts an integer address into routing domains.
+We define a **PhysAddr value object** and an **address decoding contract**
+that converts an integer address into routing domains.

 ### D1. PhysAddr is an immutable value object

@@ -27,82 +41,322 @@ We define a **PhysAddr value object** and an **address decoding contract** that
 - Any allocator returns a **fully specified PhysAddr** (not partial metadata).
 - No global state may be required to interpret a PhysAddr.

-### D2. PhysAddr fields (logical contract)
+### D2. 51-bit Physical Address Layout

-PhysAddr must be able to represent at least:
+A 51-bit physical address is adopted.

- `rack_id` (optional but reserved for scale-out)
- `sip_id`  (device / SIP domain)
- `sip_seg` (SIP-level segment/window selection, e.g., cube window)
- `local_offset` (offset within the chosen segment/window)
+#### 2.1 Top-Level Address Map

-Decoded/derived fields may include (optional):
+```text
+[50:47] sip_id        (4)     -- 16 SIPs
+[46:42] die_id        (5)     -- 32 dies per SIP
+[41: 0] local_offset  (42)    -- 4 TB per die
+```

- `cube_id`
- `kind` (e.g., HBM vs PE-resource vs raw)
- `unit_type` / `pe_id` (if PE-level addressing is modeled)
+```text
+50      47 46      42 41                      0
+---------+----------+-------------------------+
+| sip_id  | die_id   |      local_offset       |
+---------+----------+-------------------------+
+```

-**Important:** The exact bit allocation may evolve, but the *semantic fields above* must remain decodable without hidden assumptions.
+#### 2.2 die_id Allocation

-### D3. Decoding is deterministic and policy-compatible
+| die_id | Meaning |
+|--------|---------|
+| 0..15  | AHBM dies |
+| 16..20 | IOCHIPLET dies |
+| 21..31 | Reserved |

- Decoding must deterministically map an integer address to:
-  - destination SIP domain (`sip_id`)
-  - destination sub-domain (`cube_id` if applicable)
-  - destination target kind (HBM/PE-resource/other)
- Decoding must not depend on runtime topology sizes; it may depend on **explicit topology parameters** provided through configuration (e.g., segment size, slice size), and those parameters must live in the topology/config layer (not in random components).
+#### 2.3 AHBM Die Layout

-### D4. Topology-derived constants live in the topology layer
+Only lower 256 GB of the 4 TB die-local window is assigned.

-Constants such as segment sizes (e.g., HBM slice size / window size) are derived from topology configuration (YAML/JSON/dict) and are provided to the decoder via DI/config.
-They must not be hardcoded in node implementations.
+```text
+[41:38] MBZ            (4)
+[37]    addr_space      (1)    -- 0 = local resource, 1 = HBM memory
+[36: 0] sub-address    (37)
+```
+
+| addr_space | Meaning |
+|------------|---------|
+| 0 | Local resource |
+| 1 | HBM memory |
+
+##### 2.3.1 HBM Window (addr_space = 1)
+
+```text
+[36:0] hbm_offset     (37)    -- 128 GB decode window
+```
+
+The architectural decode window is fixed at 128 GB. Implemented capacity
+may be smaller depending on SKU/topology (see D4).
+
+##### 2.3.2 Resource Window (addr_space = 0)
+
+```text
+[36:34] resource_kind  (3)
+[33: 0] kind_local    (34)    -- 16 GB per kind
+```
+
+| resource_kind | Meaning |
+|---------------|---------|
+| 000 | PE_LOCAL |
+| 001 | MCPU_LOCAL |
+| 010 | CUBE_SRAM |
+| 011..111 | Reserved |
+
+Each kind gets a 16 GB decode region.
+
+##### 2.3.3 PE_LOCAL (resource_kind = 000)
+
+```text
+[33]    MBZ            (1)
+[32:29] pe_id          (4)     -- 0..15
+[28:25] pe_sub_unit    (4)
+[24: 0] sub_offset    (25)    -- 32 MB per slot
+```
+
+16 PEs x 16 sub-unit slots x 32 MB = 8 GB active decode.
+
+| pe_sub_unit | Name | Budget |
+|-------------|------|--------|
+| 0 | PE_CPU_DTCM | 8 KB |
+| 1 | MATH_ENGINE_DTCM | 8 KB |
+| 2 | IPCQ | 256 KB |
+| 3 | PE_CPU_SFR | 16 KB |
+| 4 | MATH_ENGINE_SFR | 16 KB |
+| 5 | DMA_ENGINE_SFR | 192 KB |
+| 6 | PE_TCM | 2 MB |
+| 7..15 | Reserved | -- |
+
+##### 2.3.4 MCPU_LOCAL (resource_kind = 001)
+
+```text
+[33:30] MBZ            (4)
+[29:25] mcpu_sub_unit  (5)
+[24: 0] sub_offset    (25)    -- 32 MB per slot
+```
+
+1 GB active decode.
+
+| mcpu_sub_unit | Name | Budget |
+|---------------|------|--------|
+| 0 | MCPU_ITCM | 512 KB |
+| 1 | MCPU_DTCM | 512 KB |
+| 2 | IPCQ | 256 KB |
+| 3 | MCPU_SFR | 8 KB |
+| 4 | MCPU_DMA_SFR | 16 KB |
+| 5 | MCPU_SRAM | 10 MB |
+| 6..31 | Reserved | -- |
+
+##### 2.3.5 CUBE_SRAM (resource_kind = 010)
+
+```text
+[33:25] MBZ            (9)
+[24: 0] sram_offset   (25)    -- flat 32 MB
+```
+
+#### 2.4 IOCHIPLET Die Layout
+
+Only lower 1 TB of the 4 TB die-local window is assigned.
+
+```text
+[41:40] MBZ            (2)
+[39: 0] chiplet_offset (40)   -- 1 TB
+```
+
+Region split by address range:
+
+| Range | Meaning | Decode condition |
+|-------|---------|------------------|
+| [0, 2 GB) | IOCPU resource | chiplet_offset < 0x8000_0000 |
+| [2 GB, 1 TB) | UAL | chiplet_offset >= 0x8000_0000 |
+
+##### 2.4.1 IOCPU Region
+
+```text
+[30:27] iocpu_sub_unit (4)
+[26: 0] sub_offset    (27)    -- 128 MB per slot
+```
+
+16 x 128 MB slots. 2 GB active decode.
+
+| iocpu_sub_unit | Name | Budget |
+|----------------|------|--------|
+| 0 | IOCPU_ITCM | 512 KB |
+| 1 | IOCPU_DTCM | 512 KB |
+| 2 | IPCQ | 2 MB |
+| 3 | IOCPU_SFR | 8 KB |
+| 4 | IO_DMA_SFR | 16 KB |
+| 5 | IO_SRAM | 64 MB |
+| 6..15 | Reserved | -- |
+
+##### 2.4.2 UAL Region
+
+Sub-layout TBD (separate ADR).
+
+#### 2.5 Addressing Rules
+
+1. MBZ bits must be zero. An address with non-zero MBZ bits is
+   **architecturally invalid**. Implementation may raise a decode fault
+   or return an error -- behavior is not prescribed by this ADR.
+2. Fixed slot sizes are chosen for simple hardware decode; actual
+   implemented capacity may be smaller than the slot.
+3. Access beyond a sub-unit's implemented budget within a slot is
+   **architecturally invalid** (same policy as MBZ).
+
+### D3. Bitfield decoding is deterministic
+
+Given an integer address, field extraction (`sip_id`, `die_id`, `kind`,
+`sub_unit`, `offset`) is purely positional. No runtime state is required.
+Decoding deterministically maps an integer address to destination domains:
+`sip_id`, `die_id`, target kind (HBM / PE_LOCAL / MCPU_LOCAL / CUBE_SRAM /
+IOCPU / UAL).
+
+### D4. Capacity validation may depend on topology config
+
+Whether a decoded address falls within **implemented capacity** (e.g.,
+HBM 96 GB on a specific SKU) is checked against topology parameters
+provided via DI/config. Decode itself (D3) never consults topology --
+only validation does. These parameters must live in the topology/config
+layer, not in node implementations.

 ### D5. Routing consumes decoded domains, not raw bits

 Routing policy uses decoded domains:

- `src` location (sip/cube/pe or node_id)
+- `src` location (sip / die / pe or node_id)
 - `dst` domains derived from PhysAddr decoding
 - `size_bytes` for size-aware link latency
-Routing must not inspect raw bit-fields directly except inside the decoding module.
+
+Routing must not inspect raw bit-fields directly except inside the
+decoding module.

 ## Alternatives Considered

-1) **Use raw integers everywhere, decode ad-hoc in routing**
+1. **Keep `rack_id` (4 bits)**: Rejected -- never used in practice,
+   consumes 4 bits that enable die-local expansion to 42 bits
+   (IOCHIPLET 1 TB).

- Rejected: leads to duplicated logic, inconsistent routing, and hidden assumptions embedded in multiple components.
+2. **Uniform 256 GB per die**: Rejected -- IOCHIPLET UAL requires ~1 TB.
+   Freed rack_id bits enable 42-bit local_offset.

-1) **Hardcode topology sizes (SIP/CUBE/PE counts) into decoding**
+3. **Variable-width die windows (AHBM 256 GB, CHIPLET 1 TB via multi-seg
+   spanning)**: Rejected -- complicates D3 (deterministic decoding).
+   Uniform 4 TB window with MBZ padding is simpler.

- Rejected: violates SPEC (R3) and breaks swappability and configuration-driven topologies.
+4. **Use raw integers everywhere, decode ad-hoc in routing**: Rejected --
+   leads to duplicated logic, inconsistent routing, and hidden
+   assumptions.

-1) **Put decoding inside memory controllers or routers**
+5. **Hardcode topology sizes (SIP/CUBE/PE counts) into decoding**:
+   Rejected -- violates SPEC R3 and breaks swappability.

- Rejected: leaks policy into components and undermines DI-first, swappable implementations (SPEC R4).
+6. **Put decoding inside memory controllers or routers**: Rejected --
+   leaks policy into components, violates SPEC R4 / D5.

 ## Consequences

 ### Positive

- Deterministic routing domains enable clear test invariants for local vs remote paths (SPEC R1, R5).
- Keeps topology variability (SPEC R3) while preserving consistent semantics.
- DI-first: decoder can be swapped or extended without changing components or tests (SPEC R4).
+- Simple hierarchical decoder: SIP -> die -> kind -> sub-unit.
+- Clean separation of memory (HBM) vs local resource (PE/MCPU/SRAM/IOCPU).
+- Deterministic routing domains enable clear test invariants (SPEC R1, R5).
+- Expandable: 11 reserved die_id slots, reserved resource_kind / sub-unit
+  slots, reserved MBZ bits.
+- DI-first: decoder can be swapped without changing components (SPEC R4).

-### Tradeoffs / Costs
+### Tradeoffs

- Requires explicit configuration for any topology-derived sizes.
- Introduces a single “blessed” decoding module that must remain stable and well-tested.
+- Sparse address holes due to power-of-2 slot alignment.
+- Large reserved/MBZ regions (intentional for future extension).
+- Requires explicit configuration for topology-derived sizes (D4).
+- Introduces a single "blessed" decoding module that must remain stable
+  and well-tested.
+
+## Supersedes
+
+- **ADR-0031 (PhysAddr PE-Resource Extension)**: stub status. The
+  PE_LOCAL / MCPU_LOCAL / CUBE_SRAM sub-unit tables in D2.3.3-D2.3.5
+  fulfill ADR-0031's stated goals.

 ## Implementation Notes (Non-normative)

- Recommended module boundary:
-  - `src/kernbench/policy/address/phyaddr.py`
+- Recommended module: `src/kernbench/policy/address/phyaddr.py`
+- Tests should cover: encode/decode round-trip per kind, MBZ enforcement,
+  die_id dispatch (AHBM / IOCHIPLET / reserved), sub-unit boundary
+  values, backward compatibility of factory APIs.
+- Factory methods: `hbm_addr`, `pe_hbm_addr`, `pe_tcm_addr`,
+  `cube_sram_addr` retain signatures (minus `rack_id`); `cube_id`
+  parameter renamed to `die_id`.
+- New factories: `pe_resource_addr`, `mcpu_resource_addr`,
+  `iocpu_resource_addr`, `ual_addr`.

- Tests should cover:
-  - deterministic decoding
-  - local vs remote classification from decoded fields
-  - invariants: “allocator returns full PhysAddr”, “decoding requires no global state”
+## Appendix A. Address Examples
+
+### A.1 AHBM HBM access
+
+sip=2, die=5, HBM offset=0x1000
+
+```text
+sip_id     = 2       -> [50:47] = 0b0010
+die_id     = 5       -> [46:42] = 0b00101
+addr_space = 1       -> [37]    = 1 (HBM)
+hbm_offset = 0x1000  -> [36:0]
+
+51-bit addr = (2 << 47) | (5 << 42) | (1 << 37) | 0x1000
+```
+
+### A.2 AHBM PE_LOCAL -- PE3 PE_TCM, offset=0x400
+
+```text
+sip_id        = 0  -> [50:47] = 0
+die_id        = 0  -> [46:42] = 0
+addr_space    = 0  -> [37]    = 0
+resource_kind = 0  -> [36:34] = 000 (PE_LOCAL)
+pe_id         = 3  -> [32:29] = 0011
+pe_sub_unit   = 6  -> [28:25] = 0110 (PE_TCM)
+sub_offset    = 0x400 -> [24:0]
+
+local_offset = (0 << 34) | (3 << 29) | (6 << 25) | 0x400
+```
+
+### A.3 AHBM MCPU_LOCAL -- MCPU_SRAM, offset=0x0
+
+```text
+sip_id        = 1  -> [50:47] = 0001
+die_id        = 3  -> [46:42] = 00011
+addr_space    = 0  -> [37]    = 0
+resource_kind = 1  -> [36:34] = 001 (MCPU_LOCAL)
+mcpu_sub_unit = 5  -> [29:25] = 00101 (MCPU_SRAM)
+sub_offset    = 0  -> [24:0]  = 0
+
+local_offset = (1 << 34) | (5 << 25)
+```
+
+### A.4 IOCHIPLET -- IOCPU IPCQ, offset=0x20000
+
+```text
+sip_id         = 1   -> [50:47] = 0001
+die_id         = 17  -> [46:42] = 10001 (IOCHIPLET[1])
+iocpu_sub_unit = 2   -> [30:27] = 0010 (IPCQ)
+sub_offset     = 0x20000 -> [26:0]
+
+chiplet_offset = (2 << 27) | 0x20000
+                 (< 0x8000_0000 -> IOCPU region)
+```
+
+### A.5 IOCHIPLET -- UAL region, offset=4 GB
+
+```text
+sip_id         = 0   -> [50:47] = 0
+die_id         = 16  -> [46:42] = 10000 (IOCHIPLET[0])
+chiplet_offset = 0x1_0000_0000 (4 GB >= 2 GB -> UAL region)
+```

 ## Links

- SPEC.md: R1 (routing), R3 (configurable topology), R4 (DI-first), R5 (multi-domain comm)
+- SPEC.md: R1 (routing), R3 (configurable topology), R4 (DI-first),
+  R5 (multi-domain comm)
+- ADR-0031: Superseded
@@ -67,6 +67,76 @@ Completion semantics:

 ---

+### D5. Launch timing is endpoint-synchronized
+
+All PEs targeted by a single kernel launch MUST begin executing the kernel
+body at the same simulated time, regardless of their dispatch path length
+from the launch entry point.
+
+Rationale. The dispatch tree Host → IO_CPU → M_CPU → PE_CPU has variable
+latency at every level. PEs near their M_CPU receive the launch earlier
+than PEs farther away; cubes near an IO_CPU receive it earlier than cubes
+farther away. Without synchronization, each PE's kernel begins at a
+different `env.now`, making per-PE metrics such as `pe_exec_ns` a function
+of dispatch-path geometry rather than of the kernel's behavior —
+producing measurement artifacts in benchmarks that time kernel-internal
+waits (for example `tl.recv` on cross-cube or cross-SIP hops).
+
+Mechanism.
+
+- `KernelLaunchMsg` carries an optional `target_start_ns: float | None`.
+- **IO_CPU** is the canonical stamper. On fan-out to M_CPUs, it
+  computes `target_start_ns = env.now + max_latency` where
+  `max_latency` is the maximum, over every target (sip, cube, pe)
+  tuple, of the **two-leg dispatch chain**:
+
+  ```
+  max_latency(sip, cube, pe) =
+      compute_path_latency_ns(find_node_path(io_cpu, m_cpu(sip, cube)))
+    + compute_path_latency_ns(find_node_path(m_cpu(sip, cube), pe_cpu))
+    - io_cpu.overhead_ns
+    - m_cpu.overhead_ns
+  ```
+
+  This models the actual dispatch as **two sequential Transactions**
+  (IO_CPU → M_CPU, then M_CPU → PE_CPU). Each leg's
+  `compute_path_latency_ns` adds its endpoints' `overhead_ns`;
+  `io_cpu.overhead_ns` is subtracted because IO_CPU has already
+  paid it before this method runs, and `m_cpu.overhead_ns` is
+  subtracted once because it appears as endpoint of leg1 *and*
+  start of leg2 but is paid only once at run time. A single
+  `find_node_path(io_cpu, pe_cpu)` walk is **not** equivalent —
+  it can pick a graph path that bypasses M_CPU and silently
+  under-shoots the prediction for far cubes, breaking the D5
+  invariant.
+
+  The fanned-out sub-Transactions carry **`nbytes = 0`** for
+  `KernelLaunchMsg` (control message only). Without this,
+  large kernel-launch payloads would occupy fabric BW on the
+  shared first hop and serialize the per-cube dispatch, pushing
+  far M_CPUs past `target_start_ns` and re-introducing the
+  late-arrival violation.
+- **M_CPU** passes an already-stamped `target_start_ns` through
+  unchanged. Only when the value is absent (e.g. a direct
+  launch-to-M_CPU unit test) does M_CPU compute a per-cube barrier
+  `env.now + max(local command-path latency)`.
+- **PE_CPU** yields `env.timeout(target_start_ns - env.now)` at the top
+  of `_execute_kernel`, before recording `pe_exec_start` and invoking
+  the kernel body.
+- When `target_start_ns is None`, PE_CPU falls through to the legacy
+  unsynchronized behavior — preserving backward compatibility.
+
+IO_CPU-level stamping guarantees every PE across every targeted cube
+uses the same barrier sim-time, eliminating both the within-cube
+dispatch-offset artifact *and* the cross-cube offset artifact in
+multi-cube launches. Models a real-hardware timed-broadcast launch
+(latency-equalized dispatch tree).
+
+The synchronization is internal to the engine / IO_CPU / M_CPU / PE_CPU
+control plane — runtime API and application kernels are unchanged.
+
+---
+
 ## Links

 - SPEC R1, R2, R7, R8
@@ -372,24 +372,41 @@ When the receiver frees a slot, the sender must learn about it
 travel through general vc_comm fabric — it uses a **separate fast
 path**, an abstraction of the NVLink / UCIe credit-return wire.

-**Latency** is computed from the **bottleneck BW on the path**, not a
-magic constant:
+**Latency** is computed from the **full path latency** (per-node
+overhead + edge propagation + drain), not a magic constant:

 ```
 credit_size_bytes = 16  (ccl.yaml: ipcq_credit_size_bytes)
-path = router.find_path(self_pe, peer_pe)
-latency = compute_drain_ns(path, credit_size_bytes)
-        = credit_size_bytes / bottleneck_bw_on_path
+path = router.find_path(self_pe, peer_pe.pe_dma)
+latency = compute_path_latency_ns(path, credit_size_bytes)
+        = sum(edge.distance_mm * ns_per_mm)
+        + sum(node_overhead_ns[n] for n in path)
+        + credit_size_bytes / bottleneck_bw_on_path
 ```

+The router auto-appends `.pe_dma` to the source only, so the
+destination MUST be spelled with the explicit `.pe_dma` suffix or
+`find_path` raises and the credit silently teleports at zero cost
+(latent bug fixed alongside this update).
+
+`tl.recv` blocks on the credit-emit completion (recv yields-from
+`_delayed_credit_send` rather than spawning it as a fork). This puts
+the credit-return cost on the receiver's `pe_exec_ns`, modeling the
+IPCQ control-plane completing the consume-acknowledgement before
+recv returns to the kernel — the protocol equivalent of a non-posted
+`tl.store` waiting for an HBM ack on the raw DMA path.
+
 That gives us:

 - **Topology-proportional approximation**: an in-cube credit return is
  automatically faster than a cross-SIP credit return.
- **No magic constants**: no arbitrary `ipcq_ctrl_latency_ns`.
+- **No magic constants**: every nanosecond comes from
+  `compute_path_latency_ns` on the same edge_map and `node_overhead_ns`
+  as data traffic.
 - **No deadlock risk**: unlike piggyback, B can issue credit even when
-  it has no data to send back.
- **Reuses existing utility**: `ComponentContext.compute_drain_ns`.
+  it has no data to send back. `peer_credit_store.put` is unbounded.
+- **`IPCQ ≥ raw DMA`** for matched physical moves — the credit-emit
+  cost on recv balances the HBM ack-trip cost RAW pays on the sender.

 #### Component coupling — SimPy Store channel

@@ -420,11 +437,21 @@ fan-out (see `IpcqInitMsg` in D12).
 #### PE_DMA's added responsibility

 When `vc_comm` receives a token, PE_DMA processes it as the following
-**atomic** sequence. **No SimPy yield is allowed between the two steps**
-(invariant I6):
+sequence: pay the Transaction's terminal BW drain, then atomically
+write data and forward metadata. **No SimPy yield is allowed between
+the data write and the metadata forward** (invariant I6). The drain
+yield must sit before the atomic block, not inside it:

 ```python
-def _on_vc_comm_recv(self, env, token):
+def _on_vc_comm_recv(self, env, txn):
+    # Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the
+    # sender PE_DMA). MUST happen before the atomic block so recv only
+    # wakes after the bytes have "landed".
+    drain = getattr(txn, "drain_ns", 0.0)
+    if drain > 0:
+        yield env.timeout(drain)
+
+    token = txn.request
    # ── ATOMIC: no yield between these two operations ──
    data = self._memory_store.read(token.src_space, token.src_addr,
                                   shape=..., dtype=...)
@@ -439,6 +466,33 @@ The final `put` is yieldable but uses an unbounded internal store, so
 it completes in a single step. That `put` is the closing call of the
 atomic block; nothing may be inserted before it.

+#### Drain-at-inbound semantics (D9 timing model)
+
+The Transaction carries `drain_ns = nbytes / bottleneck_bw_on_path`
+stamped at send-side PE_DMA. In this simulator per-hop `overhead_ns`
+is paid at each forwarding component via `run()`, and the remaining
+BW drain is paid once at the Transaction's terminal. Every non-IPCQ
+Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via
+`ComponentBase._forward_txn` at the terminal node. For IPCQ the
+destination PE_DMA intercepts the Transaction with `_handle_ipcq_inbound`
+(so IPCQ-specific data write + metadata forward can happen), so **the
+drain MUST be paid explicitly at the top of that handler** to keep
+IPCQ's timing model on par with every other fabric Transaction.
+
+Side-effects of paying drain here:
+
+- **SRC `tl.send`** is unchanged — fire-and-forget semantics are
+  preserved because the sender PE_DMA does not `yield sub_done`. The
+  `sub_done.succeed()` call (made after metadata forward below) is an
+  event with no listener on the sender side.
+- **DST `tl.recv`** unblocks `drain_ns` later. Since recv wakes only
+  when `IpcqMetaArrival` reaches its local PE_IPCQ, and the metadata
+  forward now happens after the drain, recv observes the full fabric
+  transfer time including bandwidth cost.
+
+Matches the physical picture: send dispatches and leaves; recv waits
+until the bytes have actually been drained into its inbox.
+
 ### D9.5. ADR-0020 (2-pass) integration

 `tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase
@@ -365,23 +365,39 @@ data 경로의 piggyback 모델과 달리, credit return은 일반 vc_comm fabri
 거치지 않고 **별도 fast path**로 처리한다. 이는 실제 HW의 NVLink/UCIe
 credit return fast path를 추상화한 것이다.

-**Latency 계산**: magic constant가 아니라 **라우팅 경로의 bottleneck BW**
-기준으로 산출한다.
+**Latency 계산**: magic constant가 아니라 **라우팅 경로의 full path
+latency** (per-node overhead + edge propagation + drain) 기준으로
+산출한다.

 ```
 credit_size_bytes = 16  (ccl.yaml: ipcq_credit_size_bytes)
-path = router.find_path(self_pe, peer_pe)
-latency = compute_drain_ns(path, credit_size_bytes)
-        = credit_size_bytes / bottleneck_bw_on_path
+path = router.find_path(self_pe, peer_pe.pe_dma)
+latency = compute_path_latency_ns(path, credit_size_bytes)
+        = sum(edge.distance_mm * ns_per_mm)
+        + sum(node_overhead_ns[n] for n in path)
+        + credit_size_bytes / bottleneck_bw_on_path
 ```

+router는 source에만 `.pe_dma`를 자동 부여하므로 destination에는 반드시
+`.pe_dma` suffix를 명시해야 한다. 그렇지 않으면 `find_path`가 raise하고
+credit이 0 cost로 silently teleport되는 latent bug가 발생한다 (이번
+업데이트에서 수정됨).
+
+`tl.recv`는 credit-emit 완료를 yield-from으로 기다린다 (이전에는
+`env.process`로 fork). 이로써 credit-return cost가 receiver의
+`pe_exec_ns`에 반영되어, IPCQ control-plane이 consume-acknowledgement를
+완료한 뒤에야 recv가 kernel에 반환된다 — RAW DMA의 non-posted `tl.store`가
+HBM ack-trip을 기다리는 것의 protocol-level 등가물이다.
+
 이로써:
 - **토폴로지 비례 approximation**: cube 내 credit return과 cross-SIP credit이
-  자동으로 다른 latency를 가짐 (정확한 값은 아니지만 magic constant보다 의미 있음)
- **Magic constant 없음**: 별도 `ipcq_ctrl_latency_ns` 같은 임의 값 불필요
- **Deadlock 위험 없음**: piggyback과 달리 B가 A에게 보낼 데이터가 없어도
-  credit이 자동 발행됨
- **기존 utility 재사용**: `ComponentContext.compute_drain_ns` 그대로 사용
+  자동으로 다른 latency를 가짐
+- **Magic constant 없음**: 모든 ns 값이 데이터 트래픽과 동일한 edge_map
+  및 `node_overhead_ns`에서 산출되는 `compute_path_latency_ns`로부터 옴
+- **Deadlock 위험 없음**: `peer_credit_store.put`은 unbounded, B가 A에게
+  보낼 데이터가 없어도 credit이 자동 발행됨
+- **`IPCQ ≥ raw DMA`** 보장: matched physical move에 대해 credit-emit이
+  RAW의 ack-trip cost와 균형을 이룸

 ```
 PE B: tl.recv(W) → 데이터 가져감 → my_tail++
@@ -426,11 +442,22 @@ backend init에서 IpcqInitMsg fan-out 시 양방향 fast path channel을 함께

 #### PE_DMA의 책임 추가

-PE_DMA(vc_comm)는 token 수신 시 다음 atomic 시퀀스로 처리한다.
-**두 동작 사이에 SimPy yield를 두어서는 안 된다** (I6 MUST 규칙 참조):
+PE_DMA(vc_comm)는 token 수신 시 다음 시퀀스로 처리한다: Transaction
+terminal의 BW drain을 먼저 지불하고, 이어서 atomic하게 data write +
+metadata forward 수행. **data write와 metadata forward 사이에는 SimPy
+yield를 두어서는 안 된다** (I6 MUST 규칙 참조). drain yield는 atomic
+구간 안이 아니라 그 앞에 위치해야 한다:

 ```python
-def _on_vc_comm_recv(self, env, token):
+def _on_vc_comm_recv(self, env, txn):
+    # Sender PE_DMA가 찍어 둔 drain_ns (= nbytes / bottleneck_bw) 를
+    # 여기서 지불. atomic 구간보다 앞이어야 한다 — recv는 bytes가
+    # "도착"한 이후에만 깨어나야 하므로.
+    drain = getattr(txn, "drain_ns", 0.0)
+    if drain > 0:
+        yield env.timeout(drain)
+
+    token = txn.request
    # ── ATOMIC: 두 동작 사이에 yield 금지 ──
    # 1. data를 dst_addr에 write (dst의 메모리 공간은 token.dst_endpoint.buffer_kind)
    data = self._memory_store.read(token.src_space, token.src_addr,
@@ -446,6 +473,32 @@ wire로 capacity가 unbounded인 store를 사용하므로 즉시 완료된다 (
 single-step). 이 최종 put이 atomic 구간의 끝이며, 그 이전에 다른 yield가
 삽입되면 안 된다.

+#### Drain-at-inbound semantics (D9 timing model)
+
+Transaction은 sender PE_DMA가 `drain_ns = nbytes / bottleneck_bw_on_path`
+를 찍어 둔 상태로 fabric에 들어간다. 이 simulator에서 per-hop `overhead_ns`
+는 각 forwarding component의 `run()` 에서 지불되고, 남은 BW drain은
+Transaction의 terminal node에서 한 번 지불된다. IPCQ가 아닌 모든
+Transaction (raw DMA, kernel-launch fanout 등) 은
+`ComponentBase._forward_txn` 이 terminal에서 이 drain을 지불한다. IPCQ의
+경우 목적지 PE_DMA가 `_handle_ipcq_inbound` 핸들러로 Transaction을
+가로채서 (IPCQ 전용 data write + metadata forward를 해야 하므로)
+**이 핸들러 최상단에서 drain을 명시적으로 지불해야 한다** — 그래야 IPCQ의
+timing model이 다른 모든 fabric Transaction과 동일선상에 놓인다.
+
+여기서 drain을 지불할 때의 side-effect:
+
+- **SRC `tl.send`**: 동작 불변. sender PE_DMA가 `sub_done` 을 `yield`
+  하지 않으므로 fire-and-forget 의미가 보존된다. metadata forward 이후
+  호출되는 `sub_done.succeed()` 는 sender 입장에서 listener가 없는 이벤트.
+- **DST `tl.recv`**: `drain_ns` 만큼 늦게 깨어난다. recv는 local PE_IPCQ
+  의 `IpcqMetaArrival` 수신 시에만 wake되며, metadata forward가 drain
+  이후로 이동했으므로 recv는 bandwidth까지 포함한 전체 fabric transfer
+  시간을 관측하게 된다.
+
+물리적 그림과 일치: send는 dispatch하고 바로 반환; recv는 bytes가 실제로
+자신의 inbox로 drain될 때까지 대기.
+
 #### Backpressure latency 정확도

 backpressure 해제까지 걸리는 시간:
@@ -2,7 +2,14 @@

 ## Status

-Proposed (Revision 8 — Hierarchical content split out to ADR-0029)
+Accepted. rank = SIP process-group model stands. The allreduce algorithm
+path (mapper / validator / per-PE install machinery originally targeted at
+ADR-0029) has been replaced by ADR-0032: `AhbmCCLBackend` now calls
+`configure_sfr_intercube_multisip` at `init_process_group` time and the
+intercube kernel receives `(sip_rank, sip_topo_kind, sip_topo_w,
+sip_topo_h)` appended after the module's `kernel_args()`. The
+`leader_only` / `all_pes` mapper concepts in this document are no longer
+used by the default allreduce path.

 ## Context

@@ -89,7 +89,14 @@ direction_idx × bytes_per_direction). 따라서:
 `src/kernbench/ccl/install.py`:

 ```python
-_OPPOSITE_DIR = {"E": "W", "W": "E", "N": "S", "S": "N"}
+# Extended in ADR-0032 with global_* pairs for inter-SIP directions,
+# which were introduced by configure_sfr_intercube_multisip to keep
+# intercube (N/S/E/W) and inter-SIP (global_N/S/E/W) namespaces disjoint.
+_OPPOSITE_DIR = {
+    "E": "W", "W": "E", "N": "S", "S": "N",
+    "global_E": "global_W", "global_W": "global_E",
+    "global_N": "global_S", "global_S": "global_N",
+}

 def reverse_direction(my_rank: int, peer_rank: int, my_dir: str) -> str | None:
    """Find peer's direction that reciprocates my_dir→peer_rank.
@@ -2,7 +2,9 @@

 ## Status

-Proposed
+Superseded by ADR-0032 (Intercube all-reduce). The 3-level kernel and
+`hierarchical_allreduce.py` module have been removed. The cube-mesh
+intercube + inter-SIP path is now the single all-reduce algorithm.

 ## Context

@@ -2,7 +2,11 @@

 ## Status

-Stub (Blocker for ADR-0030 — specific range allocations TBD)
+Superseded by ADR-0001 (Revision 2, 2026-04-27).
+PE_LOCAL / MCPU_LOCAL / CUBE_SRAM sub-unit tables are now defined in
+ADR-0001 D2.3.3-D2.3.5.
+
+Previous status: Stub (Blocker for ADR-0030 — specific range allocations TBD)

 ## Context

@@ -0,0 +1,256 @@
+# ADR-0032: Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange
+
+## Status
+
+Accepted (supersedes ADR-0029).
+
+## Context
+
+### Goal
+
+Define a single all-reduce algorithm that exploits the topology hierarchy:
+cube mesh within each SIP (intercube) + inter-SIP exchange. One kernel,
+one SFR configuration path, driven by `topology.yaml` and `ccl.yaml`.
+
+### Why replace ADR-0029 (hierarchical 3-level)
+
+ADR-0029 proposed a 3-level (intra-cube → inter-cube → inter-SIP) algorithm
+where every PE in the system participates. In practice this adds the
+intra-cube PE-to-PE stage complexity (bidirectional reduce + chain broadcast)
+without matching the common workload pattern where the tensor is sharded
+**per cube** (not per PE within a cube).
+
+Moreover, the hierarchical design required:
+- per-PE neighbor graph installation (`_build_pe_installs` multi-level)
+- multi-level topology schema (`hierarchical_3level`)
+- `all_pes` mapper + `multi_pe_sip_local` validator infrastructure
+
+The intercube algorithm below removes all of that: **pe0-only same-lane
+intercube reduce on the 4×4 cube mesh**, then inter-SIP exchange on the
+root cube, then broadcast back. Simpler kernel, simpler wiring, same
+bandwidth characteristics for the common per-cube DP workload.
+
+### Current state
+
+- `src/kernbench/ccl/algorithms/intercube_allreduce.py` — kernel
+- `src/kernbench/ccl/sfr_config.py` — `configure_sfr_intercube_multisip`
+- `src/kernbench/runtime_api/distributed.py` — `AhbmCCLBackend` wires this
+  automatically at `init_process_group` time.
+- Old `ring_allreduce`, `mesh_allreduce`, `tree_allreduce`,
+  `hierarchical_allreduce` modules and their tests are **removed**.
+
+---
+
+## Decision
+
+### D1. Algorithm structure — 5 phases
+
+For each SIP (launched concurrently by `mp.spawn`):
+
+```
+Phase 1 — Row reduce W → E (cube mesh, pe0 only):
+    col=0 sends E → col=1 accumulates, sends E → ... → col=3 holds row sum.
+
+Phase 2 — Col reduce N → S on rightmost column (pe0, col = mesh_w-1):
+    row=0 sends S → row=1 accumulates, sends S → ... → root cube (15)
+    holds the full SIP sum.
+
+Phase 3 — Inter-SIP exchange on root cube (pe0 of root cube only):
+    Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
+    selected by sip_topo_kind (from topology.yaml sips.topology).
+
+Phase 4 — Col broadcast S → N on rightmost column.
+
+Phase 5 — Row broadcast E → W across the cube mesh.
+```
+
+After all phases every cube's pe0 holds the global sum.
+
+The kernel is a single function parameterised by `sip_topo_kind ∈ {0, 1, 2}`
+(ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical
+across topologies; only phase 3 branches. Helper functions
+`_inter_sip_ring`, `_inter_sip_torus_2d`, `_inter_sip_mesh_2d` encode the
+three exchange patterns.
+
+### D2. Tensor layout (rank = SIP, per-worker)
+
+Per ADR-0024 rank = SIP at the process-group level. Each worker allocates
+its own cube-mesh-spanning tensor:
+
+```python
+dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=16, num_pes=1)
+tensor = torch.zeros((n_cubes, n_elem), dtype="f16", dp=dp)
+```
+
+Shard layout: 16 shards per SIP, one per cube on pe0. The kernel addresses
+each cube's shard as `pe_addr = t_ptr + cube_id * n_elem * 2`.
+
+### D3. SFR / IPCQ wiring — `configure_sfr_intercube_multisip`
+
+Replaces the rank-to-2-PE install from ADR-0024. Wires PE_IPCQ neighbor
+tables for **every cube's pe0 across every SIP** — regardless of which
+cube is the root or which SIP topology is selected. This lets the kernel
+elect the root cube at runtime and supports topology switches without
+re-wiring.
+
+| Level | Direction labels | Scope |
+|---|---|---|
+| Intercube within SIP | N / S / E / W | pe0 of every cube → pe0 of mesh neighbors (no wrap) |
+| Inter-SIP (all cubes) | global_E / global_W / global_N / global_S | pe0 of cube c on sip A → pe0 of cube c on peer SIP per `sips.topology` |
+
+Inter-SIP directions use the `global_*` prefix to keep the namespace
+disjoint from intercube directions. ADR-0025's `_OPPOSITE_DIR` is extended
+with `global_E ↔ global_W` and `global_N ↔ global_S` so the reverse-
+direction resolver handles 2-SIP bidirectional rings correctly.
+
+Internally the function calls `install_ipcq` with:
+- `world_size = n_sips × n_cubes`
+- `rank_to_pe = [(sip, cube, 0) for sip in range(n_sips) for cube in range(n_cubes)]`
+- A closure-captured `neighbors()` function that builds the map above.
+
+This `world_size` is internal to IPCQ wiring and does not leak to the
+process-group rank.
+
+### D4. SIP topology — from `topology.yaml`
+
+```yaml
+system:
+  sips:
+    count: 2
+    topology: ring_1d       # or torus_2d, mesh_2d_no_wrap
+```
+
+- `ring_1d`: n_sips-1 rounds of `send global_E / recv global_W`.
+- `torus_2d`: sqrt(n_sips)×sqrt(n_sips) wrapping mesh. Row ring on
+  `global_E/W` then col ring on `global_S/N`.
+- `mesh_2d_no_wrap`: square mesh without wrap-around. Chain reduce +
+  broadcast per dimension.
+
+2D variants require `n_sips` to be a perfect square.
+
+### D5. Process-group integration — `AhbmCCLBackend`
+
+At `init_process_group` time the backend:
+
+1. Loads `ccl.yaml` + `topology.yaml`.
+2. Derives `sip_topo_kind, sip_topo_w, sip_topo_h` from
+   `system.sips.topology` using the algorithm module's `TOPO_NAME_TO_KIND`.
+3. Calls `configure_sfr_intercube_multisip(engine, spec, cfg)` — one-time
+   SFR wiring, mirrors NCCL communicator creation.
+
+At each `dist.all_reduce(tensor)` call:
+
+1. Resolves `kernel_fn` from `cfg["module"]`.
+2. Builds args: `(n_elem, cube_w, cube_h, n_sips)` from
+   `kernel_args(world_size, n_elem)`.
+3. Appends `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` where
+   `sip_rank` is the current greenlet's bound rank.
+4. Launches with `_defer_wait=True`; the main scheduler drains pending
+   handles after all workers submit (per ADR-0024 D7 / ADR-0027 D0.4).
+
+### D6. Config schema
+
+`ccl.yaml`:
+
+```yaml
+defaults:
+  algorithm: intercube_allreduce
+  buffer_kind: tcm
+  ...
+
+algorithms:
+  intercube_allreduce:
+    module: kernbench.ccl.algorithms.intercube_allreduce
+    topology: none
+    buffer_kind: tcm
+    n_elem: 8
+    root_cube: 15
+```
+
+`topology.yaml`:
+
+```yaml
+system:
+  sips:
+    count: 2
+    topology: ring_1d
+sip:
+  cube_mesh: { w: 4, h: 4 }
+```
+
+### D7. Algorithm module contract
+
+Modules loaded via `cfg["module"]` must export:
+
+| Name | Purpose |
+|---|---|
+| `kernel` | callable, signature `(t_ptr, n_elem, cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl)` |
+| `kernel_args(world_size, n_elem) -> tuple` | returns the first 4 scalar args (per-tensor) |
+| `TOPO_NAME_TO_KIND: dict[str, int]` | maps `system.sips.topology` name to kernel branch code |
+| `SIP_TOPO_RING`, `SIP_TOPO_TORUS`, `SIP_TOPO_MESH` | integer constants (0, 1, 2) |
+
+---
+
+## Dependencies
+
+- **ADR-0023**: IPCQ protocol (neighbor table, send/recv, credit return).
+- **ADR-0024**: rank = SIP launcher, `mp.spawn`, greenlet-local rank.
+- **ADR-0025**: Address-based IPCQ direction matching; extended
+  `_OPPOSITE_DIR` with `global_*` pairs.
+- **ADR-0027**: Worker-wait / collective-pending drain in main scheduler.
+
+## Non-goals
+
+- **Per-PE allreduce** (intra-cube PE-to-PE reduce). Out of scope — the
+  workload for this algorithm is per-cube DP.
+- **Asymmetric SIP topologies** (non-square mesh/torus). `torus_2d` and
+  `mesh_2d_no_wrap` require `n_sips = k²`.
+- **Pipelined chunks**: single-tile per cube, no pipelining yet.
+- **Root cube runtime election**: the kernel currently uses
+  `root_cube = (mesh_h - 1) * mesh_w + (mesh_w - 1)` hardcoded to the SE
+  corner. SFR wiring covers all cubes, so runtime election is a pure kernel
+  change when needed.
+
+---
+
+## Consequences
+
+### Positive
+
+- **Single kernel, single install path** for all-reduce — replaces four
+  removed modules (`ring`, `mesh`, `tree`, `hierarchical`).
+- **Topology-agnostic kernel**: ring / torus / mesh selected via one
+  integer param, no kernel duplication.
+- **Automatic via `dist.all_reduce`**: no bench-level or user-level
+  algorithm selection needed; config-driven end-to-end.
+- **Full SFR wiring**: every cube on every SIP has inter-SIP links
+  available — supports future dynamic root-cube election.
+
+### Negative
+
+- **Not suitable for per-PE sharded tensors**: TP-layer-style tensors that
+  shard within one cube across 8 PEs are not addressable by this kernel.
+  Such workloads would need a separate intra-cube all-reduce path (not
+  yet implemented).
+- **`configure_sfr_intercube_multisip` always wires all pe0s**: even if a
+  given run only needs a subset (e.g. 1 SIP, ring only). Install cost is
+  small but not zero.
+
+---
+
+## Affected files
+
+| File | Change |
+|---|---|
+| `src/kernbench/ccl/algorithms/intercube_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` |
+| `src/kernbench/ccl/sfr_config.py` (new) | `configure_sfr_intercube_multisip` |
+| `src/kernbench/ccl/topologies.py` | Added `torus_2d`, `mesh_2d_no_wrap` |
+| `src/kernbench/ccl/install.py` | Extended `_OPPOSITE_DIR` with `global_*` pairs |
+| `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend` uses `configure_sfr_intercube_multisip` + appends sip_rank/topo args |
+| `ccl.yaml` | Single `intercube_allreduce` entry |
+| `topology.yaml` | Added `system.sips.topology` |
+| `benches/ccl_allreduce.py` | Row-wise cube-mesh tensor layout |
+| `tests/test_allreduce_multidevice.py` (new) | Config-driven ring/torus/mesh |
+| `tests/test_distributed_intercube_allreduce.py` (new) | Full `dist.all_reduce` path |
+| `tests/test_intercube_sfr_config.py` (new) | SFR wiring verification |
+| Removed | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` and their tests |
@@ -6,7 +6,7 @@ build-backend = "setuptools.build_meta"
 name = "kernbench"
 version = "0.1.0"
 requires-python = ">=3.10"
-dependencies = ["pytest", "simpy", "pyyaml", "fastapi>=0.110", "uvicorn[standard]>=0.29", "websockets>=12", "numpy>=1.24", "greenlet>=3.0"]
+dependencies = ["pytest", "simpy", "pyyaml", "fastapi>=0.110", "uvicorn[standard]>=0.29", "websockets>=12", "numpy>=1.24", "greenlet>=3.0", "matplotlib>=3.7"]

 [project.scripts]
 kernbench = "kernbench.cli.main:main"
@@ -24,9 +24,7 @@ TOPO_NAME_TO_KIND = {
 }


-def kernel_args(world_size: int, n_elem: int) -> tuple:
-    cube_w = 4
-    cube_h = 4
+def kernel_args(world_size: int, n_elem: int, *, cube_w: int = 4, cube_h: int = 4) -> tuple:
    return (n_elem, cube_w, cube_h, world_size)


@@ -127,61 +125,79 @@ def allreduce_intercube_multidevice(
    row = cube_id // cube_w
    col = cube_id % cube_w
    nbytes = n_elem * 2
+    single_cube = (cube_w == 1 and cube_h == 1)

    pe_addr = t_ptr + cube_id * nbytes
    acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")

-    # ── Phase 1: row reduce W → E ──
-    if col == 0:
-        tl.send(dir="E", src=acc)
-    elif col < cube_w - 1:
-        recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
-        acc = acc + recv
-        tl.send(dir="E", src=acc)
+    if single_cube:
+        # ── Single-cube mode: skip intra-SIP reduce, go directly to
+        #    inter-SIP exchange (TP use case: one cube per rank). ──
+        if n_sips > 1:
+            if sip_topo_kind == SIP_TOPO_RING:
+                acc = _inter_sip_ring(acc, n_sips, n_elem, tl)
+            elif sip_topo_kind == SIP_TOPO_TORUS:
+                acc = _inter_sip_torus_2d(
+                    acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
+            elif sip_topo_kind == SIP_TOPO_MESH:
+                acc = _inter_sip_mesh_2d(
+                    acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
    else:
-        recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
-        acc = acc + recv
+        # ── Multi-cube mode: full mesh reduce + inter-SIP + broadcast ──

-    # ── Phase 2: col reduce N → S on rightmost column ──
-    if col == cube_w - 1:
-        if row == 0:
-            tl.send(dir="S", src=acc)
-        elif row < cube_h - 1:
-            recv = tl.recv(dir="N", shape=(n_elem,), dtype="f16")
+        # Phase 1: row reduce W → E
+        if col == 0:
+            tl.send(dir="E", src=acc)
+        elif col < cube_w - 1:
+            recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
            acc = acc + recv
-            tl.send(dir="S", src=acc)
+            tl.send(dir="E", src=acc)
        else:
-            recv = tl.recv(dir="N", shape=(n_elem,), dtype="f16")
+            recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
            acc = acc + recv

-    # ── Phase 3: inter-SIP exchange on root cube ──
-    root_cube = (cube_h - 1) * cube_w + (cube_w - 1)
-    if cube_id == root_cube and n_sips > 1:
-        if sip_topo_kind == SIP_TOPO_RING:
-            acc = _inter_sip_ring(acc, n_sips, n_elem, tl)
-        elif sip_topo_kind == SIP_TOPO_TORUS:
-            acc = _inter_sip_torus_2d(acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
-        elif sip_topo_kind == SIP_TOPO_MESH:
-            acc = _inter_sip_mesh_2d(acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
+        # Phase 2: col reduce N → S on rightmost column
+        if col == cube_w - 1:
+            if row == 0:
+                tl.send(dir="S", src=acc)
+            elif row < cube_h - 1:
+                recv = tl.recv(dir="N", shape=(n_elem,), dtype="f16")
+                acc = acc + recv
+                tl.send(dir="S", src=acc)
+            else:
+                recv = tl.recv(dir="N", shape=(n_elem,), dtype="f16")
+                acc = acc + recv

-    # ── Phase 4: col broadcast S → N on rightmost column ──
-    if col == cube_w - 1:
-        if row == cube_h - 1:
-            tl.send(dir="N", src=acc)
-        elif row > 0:
-            acc = tl.recv(dir="S", shape=(n_elem,), dtype="f16")
-            tl.send(dir="N", src=acc)
+        # Phase 3: inter-SIP exchange on root cube
+        root_cube = (cube_h - 1) * cube_w + (cube_w - 1)
+        if cube_id == root_cube and n_sips > 1:
+            if sip_topo_kind == SIP_TOPO_RING:
+                acc = _inter_sip_ring(acc, n_sips, n_elem, tl)
+            elif sip_topo_kind == SIP_TOPO_TORUS:
+                acc = _inter_sip_torus_2d(
+                    acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
+            elif sip_topo_kind == SIP_TOPO_MESH:
+                acc = _inter_sip_mesh_2d(
+                    acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
+
+        # Phase 4: col broadcast S → N on rightmost column
+        if col == cube_w - 1:
+            if row == cube_h - 1:
+                tl.send(dir="N", src=acc)
+            elif row > 0:
+                acc = tl.recv(dir="S", shape=(n_elem,), dtype="f16")
+                tl.send(dir="N", src=acc)
+            else:
+                acc = tl.recv(dir="S", shape=(n_elem,), dtype="f16")
+
+        # Phase 5: row broadcast E → W
+        if col == cube_w - 1:
+            tl.send(dir="W", src=acc)
+        elif col > 0:
+            acc = tl.recv(dir="E", shape=(n_elem,), dtype="f16")
+            tl.send(dir="W", src=acc)
        else:
-            acc = tl.recv(dir="S", shape=(n_elem,), dtype="f16")
-
-    # ── Phase 5: row broadcast E → W ──
-    if col == cube_w - 1:
-        tl.send(dir="W", src=acc)
-    elif col > 0:
-        acc = tl.recv(dir="E", shape=(n_elem,), dtype="f16")
-        tl.send(dir="W", src=acc)
-    else:
-        acc = tl.recv(dir="E", shape=(n_elem,), dtype="f16")
+            acc = tl.recv(dir="E", shape=(n_elem,), dtype="f16")

    tl.store(pe_addr, acc)

@@ -221,6 +221,8 @@ def install_ipcq(

    _OPPOSITE_DIR = {
        "E": "W", "W": "E", "N": "S", "S": "N",
+        "intra_E": "intra_W", "intra_W": "intra_E",
+        "intra_N": "intra_S", "intra_S": "intra_N",
        "global_E": "global_W", "global_W": "global_E",
        "global_N": "global_S", "global_S": "global_N",
    }
@@ -1,22 +1,24 @@
-"""SFR configuration for intercube + inter-SIP IPCQ wiring.
+"""SFR configuration for the full IPCQ hardware wiring.

-Provides ``configure_sfr_intercube_multisip`` which programs PE_IPCQ
-neighbor tables for:
+Installs PE_IPCQ neighbor tables modeling the physical hardware.
+Wiring is independent of DPPolicy / kernel choice — the kernel decides
+at runtime which links to use.

-  1. Intercube within each SIP — pe0 of every cube connects to pe0 of
-     its N/S/E/W mesh neighbors (no wrap-around).
-  2. Inter-SIP on ALL cubes — pe0 of cube_c on sip_A connects to pe0 of
-     cube_c on each peer SIP, using ``global_E``/``global_W`` (ring) or
-     ``global_N``/``global_S``/``global_E``/``global_W`` (mesh/torus)
-     direction labels.  Wiring all cubes allows the kernel to
-     dynamically elect the root cube at runtime.
+Direction label namespaces (disjoint):

-SIP-level topology is read from ``topology.yaml`` →
-``system.sips.topology`` (e.g. ``ring_1d``, ``mesh_2d``).
-Intercube mesh dimensions come from ``sip.cube_mesh.w/h``.
+  - Intra-cube PE-to-PE:   ``intra_N / intra_S / intra_E / intra_W``
+    Logical 2×4 PE grid within a cube (no wrap):

-Internally delegates to ``install_ipcq`` with a computed ``rank_to_pe``
-(pe0-only) and a closure-captured ``neighbors()`` function.
+         Row 0:  pe0  pe1  pe2  pe3
+         Row 1:  pe4  pe5  pe6  pe7
+
+  - Intercube same-lane:   ``N / S / E / W``
+    ``pe_i of cube_A ↔ pe_i of cube_B`` across the 4×4 cube mesh
+    (no wrap). Every PE i ∈ [0..7] wired independently.
+
+  - Inter-SIP same-(cube, pe): ``global_N / global_S / global_E / global_W``
+    ``pe_i of cube_c on sip_A ↔ pe_i of cube_c on sip_B`` per
+    ``topology.yaml → system.sips.topology``.
 """
 from __future__ import annotations

@@ -27,12 +29,46 @@ from kernbench.ccl.install import install_ipcq
 from kernbench.ccl.topologies import _BUILTIN as _TOPO_BUILTINS


+# ── Intra-cube 2×4 PE grid ───────────────────────────────────────────
+
+_PE_GRID_COLS = 4
+_PE_GRID_ROWS = 2
+_PES_PER_CUBE = _PE_GRID_COLS * _PE_GRID_ROWS  # 8
+
+
+def _intra_cube_neighbors(pe: int) -> dict[str, int]:
+    """Logical 2×4 PE grid neighbors within a cube (no wrap).
+
+    Returns directions in the ``intra_*`` namespace.
+    """
+    row, col = divmod(pe, _PE_GRID_COLS)
+    nbrs: dict[str, int] = {}
+    if col < _PE_GRID_COLS - 1:
+        nbrs["intra_E"] = row * _PE_GRID_COLS + (col + 1)
+    if col > 0:
+        nbrs["intra_W"] = row * _PE_GRID_COLS + (col - 1)
+    if row < _PE_GRID_ROWS - 1:
+        nbrs["intra_S"] = (row + 1) * _PE_GRID_COLS + col
+    if row > 0:
+        nbrs["intra_N"] = (row - 1) * _PE_GRID_COLS + col
+    return nbrs
+
+
+# ── Public entry point ───────────────────────────────────────────────
+
+
 def configure_sfr_intercube_multisip(
    engine: Any,
    spec: dict,
    cfg: dict,
 ) -> dict[str, Any]:
-    """Wire IPCQ for intercube (pe0, mesh) + inter-SIP (pe0, all cubes).
+    """Wire the full IPCQ hardware model.
+
+    Every PE on every cube on every SIP gets neighbor table entries for:
+
+      - intra-cube (2×4 grid) in the ``intra_*`` namespace
+      - intercube same-lane (4×4 cube mesh, no wrap) in ``N/S/E/W``
+      - inter-SIP same-(cube, pe) in ``global_*``

    Args:
        engine: GraphEngine with ``_components``.
@@ -46,48 +82,71 @@ def configure_sfr_intercube_multisip(
    mesh_w = int(cm["w"])
    mesh_h = int(cm["h"])
    n_cubes = mesh_w * mesh_h
-    n_sips = int(spec.get("system", {}).get("sips", {}).get("count", 1))
-    sip_topology = str(
-        spec.get("system", {}).get("sips", {}).get("topology", "ring_1d")
-    )
+    sips_cfg = spec.get("system", {}).get("sips", {})
+    n_sips = int(sips_cfg.get("count", 1))
+    sip_topology = str(sips_cfg.get("topology", "ring_1d"))
+    sip_w = sips_cfg.get("w")
+    sip_h = sips_cfg.get("h")
+    sip_w = int(sip_w) if sip_w is not None else None
+    sip_h = int(sip_h) if sip_h is not None else None

    if sip_topology not in _TOPO_BUILTINS:
        raise ValueError(
            f"Unknown sip topology '{sip_topology}'. "
            f"Available: {list(_TOPO_BUILTINS)}"
        )
-    sip_topo_fn = _TOPO_BUILTINS[sip_topology]
+    _sip_topo_fn_raw = _TOPO_BUILTINS[sip_topology]

-    world_size = n_sips * n_cubes
+    def sip_topo_fn(rank: int, ws: int) -> dict:
+        if sip_w is not None and sip_h is not None:
+            try:
+                return _sip_topo_fn_raw(rank, ws, w=sip_w, h=sip_h)
+            except TypeError:
+                pass
+        return _sip_topo_fn_raw(rank, ws)
+
+    pes_per_cube = _PES_PER_CUBE
+    world_size = n_sips * n_cubes * pes_per_cube
    pe_idx_to_pe: list[tuple[int, int, int]] = [
-        (sip, cube, 0)
+        (sip, cube, pe)
        for sip in range(n_sips)
        for cube in range(n_cubes)
+        for pe in range(pes_per_cube)
    ]

+    def _pe_idx(sip: int, cube: int, pe: int) -> int:
+        return (sip * n_cubes + cube) * pes_per_cube + pe
+
    def _neighbors(pe_idx: int, ws: int, _base: dict) -> dict[str, int]:
-        sip = pe_idx // n_cubes
-        cube = pe_idx % n_cubes
+        tmp = pe_idx
+        pe = tmp % pes_per_cube
+        tmp //= pes_per_cube
+        cube = tmp % n_cubes
+        sip = tmp // n_cubes
        row = cube // mesh_w
        col = cube % mesh_w

        nbrs: dict[str, int] = {}

-        # Intercube within SIP (mesh, no wrap-around)
-        if col < mesh_w - 1:
-            nbrs["E"] = sip * n_cubes + (row * mesh_w + col + 1)
-        if col > 0:
-            nbrs["W"] = sip * n_cubes + (row * mesh_w + col - 1)
-        if row < mesh_h - 1:
-            nbrs["S"] = sip * n_cubes + ((row + 1) * mesh_w + col)
-        if row > 0:
-            nbrs["N"] = sip * n_cubes + ((row - 1) * mesh_w + col)
+        # ── Intra-cube (intra_N/S/E/W) ──
+        for d, peer_pe in _intra_cube_neighbors(pe).items():
+            nbrs[d] = _pe_idx(sip, cube, peer_pe)

-        # Inter-SIP on ALL cubes
+        # ── Intercube same-lane (N/S/E/W, 4×4 no wrap) ──
+        if col < mesh_w - 1:
+            nbrs["E"] = _pe_idx(sip, row * mesh_w + (col + 1), pe)
+        if col > 0:
+            nbrs["W"] = _pe_idx(sip, row * mesh_w + (col - 1), pe)
+        if row < mesh_h - 1:
+            nbrs["S"] = _pe_idx(sip, (row + 1) * mesh_w + col, pe)
+        if row > 0:
+            nbrs["N"] = _pe_idx(sip, (row - 1) * mesh_w + col, pe)
+
+        # ── Inter-SIP same-(cube, pe) (global_*) ──
        if n_sips > 1:
            sip_nbrs = sip_topo_fn(sip, n_sips)
            for d, peer_sip in sip_nbrs.items():
-                nbrs[f"global_{d}"] = peer_sip * n_cubes + cube
+                nbrs[f"global_{d}"] = _pe_idx(peer_sip, cube, pe)

        return nbrs

@@ -33,23 +33,41 @@ def ring_1d_unidir(rank: int, world_size: int) -> NeighborMap:
    return {"E": (rank + 1) % world_size}


-def mesh_2d(rank: int, world_size: int) -> NeighborMap:
-    """Square 2D mesh (N/S/E/W).
-
-    Layout: rank = row * side + col, with side = sqrt(world_size).
-    Wrap-around (torus) on all four edges.
-    """
+def _resolve_2d_dims(
+    world_size: int, w: int | None, h: int | None, name: str,
+) -> tuple[int, int]:
+    if w is not None and h is not None:
+        if w * h != world_size:
+            raise ValueError(
+                f"{name}: w*h ({w}*{h}) != world_size ({world_size})"
+            )
+        return w, h
    side = int(round(world_size ** 0.5))
    if side * side != world_size:
        raise ValueError(
-            f"mesh_2d requires square world_size, got {world_size}"
+            f"{name} requires square world_size or explicit w,h, "
+            f"got {world_size}"
        )
-    r, c = divmod(rank, side)
+    return side, side
+
+
+def mesh_2d(
+    rank: int, world_size: int,
+    w: int | None = None, h: int | None = None,
+) -> NeighborMap:
+    """2D mesh (N/S/E/W) with wrap-around on all four edges.
+
+    Layout: rank = row * w + col. When w, h are given, supports
+    rectangular (e.g. 2x3) layouts. Otherwise falls back to square
+    side = sqrt(world_size).
+    """
+    w, h = _resolve_2d_dims(world_size, w, h, "mesh_2d")
+    r, c = divmod(rank, w)
    return {
-        "N": ((r - 1) % side) * side + c,
-        "S": ((r + 1) % side) * side + c,
-        "W": r * side + (c - 1) % side,
-        "E": r * side + (c + 1) % side,
+        "N": ((r - 1) % h) * w + c,
+        "S": ((r + 1) % h) * w + c,
+        "W": r * w + (c - 1) % w,
+        "E": r * w + (c + 1) % w,
    }


@@ -73,36 +91,30 @@ def tree_binary(rank: int, world_size: int) -> NeighborMap:
    return n


-def torus_2d(rank: int, world_size: int) -> NeighborMap:
-    """Square 2D torus (N/S/E/W) with wrap-around on all edges.
-
-    Alias for mesh_2d (which already wraps). Explicit name for clarity
-    when used as a SIP-level topology.
-    """
-    return mesh_2d(rank, world_size)
+def torus_2d(
+    rank: int, world_size: int,
+    w: int | None = None, h: int | None = None,
+) -> NeighborMap:
+    """2D torus (N/S/E/W) with wrap-around on all edges. Alias for mesh_2d."""
+    return mesh_2d(rank, world_size, w=w, h=h)


-def mesh_2d_no_wrap(rank: int, world_size: int) -> NeighborMap:
-    """Square 2D mesh (N/S/E/W) WITHOUT wrap-around.
-
-    Edge nodes have fewer neighbors (no wrapping). Used for SIP-level
-    topologies where physical links don't wrap.
-    """
-    side = int(round(world_size ** 0.5))
-    if side * side != world_size:
-        raise ValueError(
-            f"mesh_2d_no_wrap requires square world_size, got {world_size}"
-        )
-    r, c = divmod(rank, side)
+def mesh_2d_no_wrap(
+    rank: int, world_size: int,
+    w: int | None = None, h: int | None = None,
+) -> NeighborMap:
+    """2D mesh (N/S/E/W) WITHOUT wrap-around. Supports rectangular dims."""
+    w, h = _resolve_2d_dims(world_size, w, h, "mesh_2d_no_wrap")
+    r, c = divmod(rank, w)
    n: NeighborMap = {}
    if r > 0:
-        n["N"] = (r - 1) * side + c
-    if r < side - 1:
-        n["S"] = (r + 1) * side + c
+        n["N"] = (r - 1) * w + c
+    if r < h - 1:
+        n["S"] = (r + 1) * w + c
    if c > 0:
-        n["W"] = r * side + (c - 1)
-    if c < side - 1:
-        n["E"] = r * side + (c + 1)
+        n["W"] = r * w + (c - 1)
+    if c < w - 1:
+        n["E"] = r * w + (c + 1)
    return n


@@ -23,7 +23,7 @@ def _hbm_pa(sip: int, cube: int, pe_id: int, spec: dict) -> int:
    mm = spec["cube"]["memory_map"]
    slice_bytes = mm["hbm_total_gb_per_cube"] * (1 << 30) // mm["hbm_slices_per_cube"]
    pa = PhysAddr.pe_hbm_addr(
-        rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
+        sip_id=sip, die_id=cube, pe_id=pe_id,
        pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
    )
    return pa.encode()
@@ -58,7 +58,18 @@ class IoCpuComponent(ComponentBase):
            self._pending[key] = (expected, received, parent_done)

    def _dispatch_to_m_cpus(self, env: simpy.Environment, txn: Any) -> Generator:
-        """Fan out sub-Transactions to target cube M_CPUs, wait for responses."""
+        """Fan out sub-Transactions to target cube M_CPUs, wait for responses.
+
+        ADR-0009 D5 (extended): for KernelLaunchMsg, stamp a single global
+        target_start_ns = env.now + max(IO_CPU → any target PE_CPU path
+        latency across all target cubes). M_CPU passes this value through
+        unchanged; every PE in every cube yields until the same sim-time
+        before beginning kernel execution. Without this, cross-cube
+        launches would have each cube's M_CPU compute its own per-cube
+        barrier relative to its local env.now, leaving PEs on different
+        cubes out of sync (the "h3/h4 dispatch-offset artifact").
+        """
+        import dataclasses
        from kernbench.runtime_api.kernel import KernelLaunchMsg, MemoryReadMsg, MemoryWriteMsg

        request = txn.request
@@ -72,10 +83,60 @@ class IoCpuComponent(ComponentBase):
            txn.done.succeed()
            return

+        # For KernelLaunchMsg, compute the global barrier once here so
+        # every downstream PE_CPU uses the same target_start_ns.
+        if isinstance(request, KernelLaunchMsg):
+            io_overhead = self.ctx.node_overhead_ns.get(self.node.id, 0.0)
+            global_max_latency = 0.0
+            pe_ids = self._resolve_pe_ids(
+                getattr(request, "target_pe", "all")
+            )
+            for sip, cube in cube_targets:
+                try:
+                    m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
+                    io_to_m_path = self.ctx.router.find_node_path(
+                        self.node.id, m_cpu_id,
+                    )
+                except Exception:
+                    continue
+                if len(io_to_m_path) < 2:
+                    continue
+                leg1 = self.ctx.compute_path_latency_ns(
+                    io_to_m_path, nbytes=0,
+                )
+                m_overhead = self.ctx.node_overhead_ns.get(m_cpu_id, 0.0)
+                for pe_id in pe_ids:
+                    pe_cpu_id = (
+                        f"sip{sip}.cube{cube}.pe{pe_id}.pe_cpu"
+                    )
+                    try:
+                        m_to_pe_path = self.ctx.router.find_node_path(
+                            m_cpu_id, pe_cpu_id,
+                        )
+                    except Exception:
+                        continue
+                    if len(m_to_pe_path) < 2:
+                        continue
+                    leg2 = self.ctx.compute_path_latency_ns(
+                        m_to_pe_path, nbytes=0,
+                    )
+                    latency = leg1 + leg2 - io_overhead - m_overhead
+                    if latency > global_max_latency:
+                        global_max_latency = latency
+            request = dataclasses.replace(
+                request,
+                target_start_ns=float(env.now) + global_max_latency,
+            )
+
        # Setup aggregation
        self._pending[request.request_id] = (len(cube_targets), 0, txn.done)

-        # Fan out to each target cube's M_CPU
+        # Fan out to each target cube's M_CPU. Kernel-launch fanout
+        # carries control metadata only; nbytes is forced to 0 for
+        # KernelLaunchMsg so the launch sub-txns do not occupy data-fabric
+        # BW (would otherwise serialize 16 cubes worth of fanout on the
+        # shared first hop and break ADR-0009 D5's barrier prediction).
+        is_kernel_launch = isinstance(request, KernelLaunchMsg)
        for sip, cube in cube_targets:
            try:
                m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
@@ -86,11 +147,25 @@ class IoCpuComponent(ComponentBase):
                continue
            sub_txn = Transaction(
                request=request, path=path, step=0,
-                nbytes=txn.nbytes, done=env.event(),
+                nbytes=0 if is_kernel_launch else txn.nbytes,
+                done=env.event(),
                result_data=txn.result_data,
            )
            yield self.out_ports[path[1]].put(sub_txn.advance())

+    def _resolve_pe_ids(self, target_pe: Any) -> list[int]:
+        """Resolve target_pe → list of PE indices (mirrors M_CPU logic)."""
+        if isinstance(target_pe, int):
+            return [target_pe]
+        if isinstance(target_pe, tuple):
+            return list(target_pe)
+        # "all": all PEs in a cube
+        n_slices = 8
+        if self.ctx and self.ctx.spec:
+            mm = self.ctx.spec.get("cube", {}).get("memory_map", {})
+            n_slices = mm.get("hbm_slices_per_cube", 8)
+        return list(range(n_slices))
+
    def _resolve_cube_targets(self, request: Any) -> list[tuple[int, int]]:
        """Return list of (sip, cube) pairs to fan out to."""
        from kernbench.runtime_api.kernel import (
@@ -145,10 +220,10 @@ class IoCpuComponent(ComponentBase):
        return []

    def _cube_from_pa(self, pa_val: int, fallback: int) -> int:
-        """Extract cube_id from a physical address, with fallback."""
+        """Extract die_id from a physical address, with fallback."""
        from kernbench.policy.address.phyaddr import PhysAddr
        try:
-            return PhysAddr.decode(pa_val).cube_id
+            return PhysAddr.decode(pa_val).die_id
        except Exception:
            return fallback

@@ -162,7 +162,11 @@ class MCpuComponent(ComponentBase):
        Routes through find_node_path (M_CPU → NOC → PE_CPU command edges).
        PE_CPU sends ResponseMsg back via NOC → M_CPU on completion.
        Then sends aggregate ResponseMsg back to IO_CPU on the reverse path.
+
+        ADR-0009 D5: stamps target_start_ns so every PE in this fanout
+        starts executing at the same env.now regardless of dispatch path.
        """
+        import dataclasses
        request = txn.request
        target_pe = getattr(request, "target_pe", "all")
        cube_prefix = self.node.id.rsplit(".", 1)[0]  # e.g. "sip0.cube0"
@@ -172,9 +176,13 @@ class MCpuComponent(ComponentBase):
            txn.done.succeed()
            return

-        # Fan out to each PE_CPU, using response-based aggregation
-        sub_txns: list[Transaction] = []
-        n_dispatched = 0
+        # Resolve per-PE paths. If IO_CPU already stamped a global
+        # target_start_ns (ADR-0009 D5 extended), pass it through
+        # unchanged so every PE across every cube uses the same barrier.
+        # Otherwise (e.g. direct-to-M_CPU launch in a unit test) compute
+        # a per-cube barrier from env.now.
+        per_pe: list[tuple[int, list[str], float]] = []
+        max_latency = 0.0
        for pe_id in pe_ids:
            pe_cpu_id = f"{cube_prefix}.pe{pe_id}.pe_cpu"
            try:
@@ -183,8 +191,24 @@ class MCpuComponent(ComponentBase):
                continue
            if len(path) < 2:
                continue
+            latency = self.ctx.compute_path_latency_ns(path, nbytes=0)
+            per_pe.append((pe_id, path, latency))
+            if latency > max_latency:
+                max_latency = latency
+
+        if getattr(request, "target_start_ns", None) is not None:
+            stamped_request = request
+        else:
+            stamped_request = dataclasses.replace(
+                request, target_start_ns=float(env.now) + max_latency,
+            )
+
+        # Fan out to each PE_CPU, using response-based aggregation
+        sub_txns: list[Transaction] = []
+        n_dispatched = 0
+        for pe_id, path, _lat in per_pe:
            sub_txn = Transaction(
-                request=request, path=path, step=0,
+                request=stamped_request, path=path, step=0,
                nbytes=0, done=env.event(),
            )
            yield self.out_ports[path[1]].put(sub_txn.advance())
@@ -204,16 +228,21 @@ class MCpuComponent(ComponentBase):
        yield all_done
        del self._parent_txns[request.request_id]

-        # Aggregate PE-internal metrics (max across PEs)
+        # Aggregate PE-internal metrics (max across PEs and across cubes).
+        # Multiple M_CPUs share the same result_data dict via IO_CPU fanout;
+        # merge against the existing value so cubes don't clobber each other.
        pe_exec_values = [st.result_data.get("pe_exec_ns", 0.0) for st in sub_txns]
        if pe_exec_values:
-            txn.result_data["pe_exec_ns"] = max(pe_exec_values)
+            cur = txn.result_data.get("pe_exec_ns", 0.0) or 0.0
+            txn.result_data["pe_exec_ns"] = max(cur, max(pe_exec_values))
        dma_values = [st.result_data.get("dma_ns", 0.0) for st in sub_txns]
        if dma_values:
-            txn.result_data["dma_ns"] = max(dma_values)
+            cur = txn.result_data.get("dma_ns", 0.0) or 0.0
+            txn.result_data["dma_ns"] = max(cur, max(dma_values))
        compute_values = [st.result_data.get("compute_ns", 0.0) for st in sub_txns]
        if compute_values:
-            txn.result_data["compute_ns"] = max(compute_values)
+            cur = txn.result_data.get("compute_ns", 0.0) or 0.0
+            txn.result_data["compute_ns"] = max(cur, max(compute_values))

        # Send aggregate response on reverse command path back to IO_CPU
        reverse_path = list(reversed(txn.path))
@@ -95,6 +95,13 @@ class PeCpuComponent(ComponentBase):
        request = txn.request
        yield from self.run(env, 0)

+        # ADR-0009 D5: synchronized launch barrier. If M_CPU stamped a
+        # target_start_ns, wait until then so every PE in this launch
+        # begins pe_exec measurement at the same simulated time.
+        target_start = getattr(request, "target_start_ns", None)
+        if target_start is not None and target_start > env.now:
+            yield env.timeout(float(target_start) - env.now)
+
        kernel_fn = get_kernel(request.kernel_ref.name)
        num_programs = self._derive_num_programs(request)
        kernel_args = self._unpack_kernel_args(request)
@@ -186,13 +186,37 @@ class PeDmaComponent(PeEngineBase):
    # ── IPCQ inbound (fabric → PE_DMA → MemoryStore + PE_IPCQ) ──────

    def _handle_ipcq_inbound(self, env: simpy.Environment, txn: Any) -> Generator:
-        """At destination PE_DMA: atomically write data and forward metadata.
+        """At destination PE_DMA: pay terminal drain, then atomically write
+        data and forward metadata.
+
+        ADR-0023 D9 (drain at inbound terminal): the Transaction carries
+        ``drain_ns = nbytes / bottleneck_bw_on_path`` stamped by the sender
+        PE_DMA. Like every other Transaction terminal in the simulator (see
+        ``ComponentBase._forward_txn``), this drain must be paid when the
+        Transaction reaches its destination. SRC-side ``tl.send`` is
+        fire-and-forget — it never yields on ``sub_done`` — so paying the
+        drain here does NOT delay the sender. What it DOES delay is the
+        IpcqMetaArrival forwarded below: that delay is the only signal
+        ``tl.recv`` on DST blocks on, which is exactly the desired
+        semantics — "send dispatches and returns; recv waits until the
+        bytes have actually landed in its inbox".
+
+        The drain MUST be paid before the atomic block — inserting a yield
+        inside would break invariant I6.

        I6 (MUST): no SimPy yield between MemoryStore.write and the
        IpcqMetaArrival put into PE_IPCQ.
        """
        from kernbench.common.ipcq_types import IpcqMetaArrival

+        # Pay terminal BW drain before the atomic write/metadata forward.
+        # Without this, IPCQ effectively got fabric bandwidth for free at
+        # the terminal (only intermediate-hop overhead_ns was charged),
+        # making IPCQ lower than raw DMA at large sizes in benchmarks.
+        drain = getattr(txn, "drain_ns", 0.0)
+        if drain > 0:
+            yield env.timeout(drain)
+
        token = txn.request

        # ── ATOMIC: do not introduce yield between these two operations ──
@@ -278,7 +302,16 @@ class PeDmaComponent(PeEngineBase):
            dma_res = self._dma_write if is_write else self._dma_read
            assert dma_res is not None

-            pa = PhysAddr.decode(addr)
+            # Translate VA → PA via MMU (same logic as non-pipeline path)
+            target_pa = addr
+            if self._mmu is not None:
+                from kernbench.policy.address.pe_mmu import PageFault
+                try:
+                    target_pa = self._mmu.translate(addr)
+                except PageFault:
+                    target_pa = addr  # fallback: treat as PA directly
+
+            pa = PhysAddr.decode(target_pa)
            dst_node = self.ctx.resolver.resolve(pa)
            path = self.ctx.router.find_path(self._pe_prefix, dst_node)
            drain_ns = self.ctx.compute_drain_ns(path, nbytes)
@@ -290,7 +323,7 @@ class PeDmaComponent(PeEngineBase):
                    correlation_id="pipeline",
                    request_id=f"tile_{token.tile_id}",
                    src_sip=0, src_cube=0, src_pe=0,
-                    dst_pa=addr, nbytes=nbytes,
+                    dst_pa=target_pa, nbytes=nbytes,
                    is_write=is_write,
                )
                sub_txn = Transaction(
@@ -338,9 +338,13 @@ class PeIpcqComponent(ComponentBase):
                nbytes=req.result_data.get("nbytes", 0),
            )

-        # Fast path credit return — bottleneck BW based latency
-        env.process(
-            self._delayed_credit_send(env, direction, qp["peer_credit_store"], qp["my_tail"])
+        # Credit return: recv blocks on credit-emit so the protocol cost
+        # (full path latency to deliver the credit metadata back to the
+        # sender) is reflected in the recv's pe_exec_ns. Models the IPCQ
+        # control-plane completing the consume-acknowledgement before
+        # recv returns to the kernel.
+        yield from self._delayed_credit_send(
+            env, direction, qp["peer_credit_store"], qp["my_tail"],
        )

        if not req.done.triggered:
@@ -455,7 +459,12 @@ class PeIpcqComponent(ComponentBase):
        yield peer_credit_store.put(meta)

    def _credit_latency_ns(self, direction: str) -> float:
-        """Compute credit fast path latency = credit_size / bottleneck_bw.
+        """Full path latency for the credit-return packet.
+
+        Pays per-node overhead + edge prop + drain along the same fabric
+        the data took. PathRouter.find_path() auto-appends ".pe_dma" to
+        the source only, so the destination MUST be spelled with the
+        explicit ".pe_dma" suffix.

        Falls back to 0 when ctx/router is unavailable (unit-test mode).
        """
@@ -463,10 +472,12 @@ class PeIpcqComponent(ComponentBase):
            return 0.0
        qp = self._queue_pairs[direction]
        peer = qp["peer"]
-        peer_pe_prefix = f"sip{peer.sip}.cube{peer.cube}.pe{peer.pe}"
+        peer_pe_dma = f"sip{peer.sip}.cube{peer.cube}.pe{peer.pe}.pe_dma"
        try:
-            path = self.ctx.router.find_path(self._pe_prefix, peer_pe_prefix)
-            return self.ctx.compute_drain_ns(path, self._credit_size_bytes)
+            path = self.ctx.router.find_path(self._pe_prefix, peer_pe_dma)
+            return self.ctx.compute_path_latency_ns(
+                path, self._credit_size_bytes,
+            )
        except Exception:
            return 0.0

@@ -26,6 +26,9 @@ class ComponentContext:
    spec: dict = field(default_factory=dict)  # topology spec (cube layout, PE count, etc.)
    memory_store: Any = None  # MemoryStore for Phase 1 data-aware execution (ADR-0020)
    op_logger: Any = None     # OpLogger for Phase 1 op recording (ADR-0020)
+    # node_id -> overhead_ns (ADR-0009 D5: used by M_CPU to compute per-PE
+    # dispatch latency when stamping target_start_ns on KernelLaunchMsg).
+    node_overhead_ns: dict[str, float] = field(default_factory=dict)

    def get_shared_resource(
        self, env: simpy.Environment, key: str, capacity: int = 1,
@@ -52,3 +55,19 @@ class ComponentContext:
        if min_bw == float("inf"):
            return 0.0
        return nbytes / min_bw
+
+    def compute_path_latency_ns(self, path: list[str], nbytes: int = 0) -> float:
+        """Formula latency along path: wire + per-node overhead + drain.
+
+        ADR-0009 D5: M_CPU uses this to compute per-PE dispatch latency
+        when stamping target_start_ns on KernelLaunchMsg fanout.
+        """
+        total = 0.0
+        for i in range(len(path) - 1):
+            edge = self.edge_map.get((path[i], path[i + 1]))
+            if edge:
+                total += edge.distance_mm * self.ns_per_mm
+        for node_id in path:
+            total += self.node_overhead_ns.get(node_id, 0.0)
+        total += self.compute_drain_ns(path, nbytes)
+        return total
@@ -58,7 +58,13 @@ class IoCpuComponent(ComponentBase):
            self._pending[key] = (expected, received, parent_done)

    def _dispatch_to_m_cpus(self, env: simpy.Environment, txn: Any) -> Generator:
-        """Fan out sub-Transactions to target cube M_CPUs, wait for responses."""
+        """Fan out sub-Transactions to target cube M_CPUs, wait for responses.
+
+        ADR-0009 D5 (extended): stamp a global target_start_ns on
+        KernelLaunchMsg so every PE across every target cube starts at
+        the same env.now. See the non-legacy builtin for full rationale.
+        """
+        import dataclasses
        from kernbench.runtime_api.kernel import KernelLaunchMsg, MemoryReadMsg, MemoryWriteMsg

        request = txn.request
@@ -72,10 +78,53 @@ class IoCpuComponent(ComponentBase):
            txn.done.succeed()
            return

+        if isinstance(request, KernelLaunchMsg):
+            io_overhead = self.ctx.node_overhead_ns.get(self.node.id, 0.0)
+            global_max_latency = 0.0
+            pe_ids = self._resolve_pe_ids(
+                getattr(request, "target_pe", "all")
+            )
+            for sip, cube in cube_targets:
+                try:
+                    m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
+                    io_to_m_path = self.ctx.router.find_node_path(
+                        self.node.id, m_cpu_id,
+                    )
+                except Exception:
+                    continue
+                if len(io_to_m_path) < 2:
+                    continue
+                leg1 = self.ctx.compute_path_latency_ns(
+                    io_to_m_path, nbytes=0,
+                )
+                m_overhead = self.ctx.node_overhead_ns.get(m_cpu_id, 0.0)
+                for pe_id in pe_ids:
+                    pe_cpu_id = (
+                        f"sip{sip}.cube{cube}.pe{pe_id}.pe_cpu"
+                    )
+                    try:
+                        m_to_pe_path = self.ctx.router.find_node_path(
+                            m_cpu_id, pe_cpu_id,
+                        )
+                    except Exception:
+                        continue
+                    if len(m_to_pe_path) < 2:
+                        continue
+                    leg2 = self.ctx.compute_path_latency_ns(
+                        m_to_pe_path, nbytes=0,
+                    )
+                    latency = leg1 + leg2 - io_overhead - m_overhead
+                    if latency > global_max_latency:
+                        global_max_latency = latency
+            request = dataclasses.replace(
+                request,
+                target_start_ns=float(env.now) + global_max_latency,
+            )
+
        # Setup aggregation
        self._pending[request.request_id] = (len(cube_targets), 0, txn.done)

-        # Fan out to each target cube's M_CPU
+        is_kernel_launch = isinstance(request, KernelLaunchMsg)
        for sip, cube in cube_targets:
            try:
                m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
@@ -86,11 +135,24 @@ class IoCpuComponent(ComponentBase):
                continue
            sub_txn = Transaction(
                request=request, path=path, step=0,
-                nbytes=txn.nbytes, done=env.event(),
+                nbytes=0 if is_kernel_launch else txn.nbytes,
+                done=env.event(),
                result_data=txn.result_data,
            )
            yield self.out_ports[path[1]].put(sub_txn.advance())

+    def _resolve_pe_ids(self, target_pe: Any) -> list[int]:
+        """Resolve target_pe → list of PE indices (mirrors M_CPU logic)."""
+        if isinstance(target_pe, int):
+            return [target_pe]
+        if isinstance(target_pe, tuple):
+            return list(target_pe)
+        n_slices = 8
+        if self.ctx and self.ctx.spec:
+            mm = self.ctx.spec.get("cube", {}).get("memory_map", {})
+            n_slices = mm.get("hbm_slices_per_cube", 8)
+        return list(range(n_slices))
+
    def _resolve_cube_targets(self, request: Any) -> list[tuple[int, int]]:
        """Return list of (sip, cube) pairs to fan out to."""
        from kernbench.runtime_api.kernel import (
@@ -145,10 +207,10 @@ class IoCpuComponent(ComponentBase):
        return []

    def _cube_from_pa(self, pa_val: int, fallback: int) -> int:
-        """Extract cube_id from a physical address, with fallback."""
+        """Extract die_id from a physical address, with fallback."""
        from kernbench.policy.address.phyaddr import PhysAddr
        try:
-            return PhysAddr.decode(pa_val).cube_id
+            return PhysAddr.decode(pa_val).die_id
        except Exception:
            return fallback

@@ -162,7 +162,11 @@ class MCpuComponent(ComponentBase):
        Routes through find_node_path (M_CPU → NOC → PE_CPU command edges).
        PE_CPU sends ResponseMsg back via NOC → M_CPU on completion.
        Then sends aggregate ResponseMsg back to IO_CPU on the reverse path.
+
+        ADR-0009 D5: stamps target_start_ns so every PE in this fanout
+        starts executing at the same env.now regardless of dispatch path.
        """
+        import dataclasses
        request = txn.request
        target_pe = getattr(request, "target_pe", "all")
        cube_prefix = self.node.id.rsplit(".", 1)[0]  # e.g. "sip0.cube0"
@@ -172,9 +176,10 @@ class MCpuComponent(ComponentBase):
            txn.done.succeed()
            return

-        # Fan out to each PE_CPU, using response-based aggregation
-        sub_txns: list[Transaction] = []
-        n_dispatched = 0
+        # Resolve per-PE paths. If IO_CPU already stamped a global
+        # target_start_ns (ADR-0009 D5 extended), pass it through.
+        per_pe: list[tuple[int, list[str], float]] = []
+        max_latency = 0.0
        for pe_id in pe_ids:
            pe_cpu_id = f"{cube_prefix}.pe{pe_id}.pe_cpu"
            try:
@@ -183,8 +188,24 @@ class MCpuComponent(ComponentBase):
                continue
            if len(path) < 2:
                continue
+            latency = self.ctx.compute_path_latency_ns(path, nbytes=0)
+            per_pe.append((pe_id, path, latency))
+            if latency > max_latency:
+                max_latency = latency
+
+        if getattr(request, "target_start_ns", None) is not None:
+            stamped_request = request
+        else:
+            stamped_request = dataclasses.replace(
+                request, target_start_ns=float(env.now) + max_latency,
+            )
+
+        # Fan out to each PE_CPU, using response-based aggregation
+        sub_txns: list[Transaction] = []
+        n_dispatched = 0
+        for pe_id, path, _lat in per_pe:
            sub_txn = Transaction(
-                request=request, path=path, step=0,
+                request=stamped_request, path=path, step=0,
                nbytes=0, done=env.event(),
            )
            yield self.out_ports[path[1]].put(sub_txn.advance())
@@ -204,16 +225,21 @@ class MCpuComponent(ComponentBase):
        yield all_done
        del self._parent_txns[request.request_id]

-        # Aggregate PE-internal metrics (max across PEs)
+        # Aggregate PE-internal metrics (max across PEs and across cubes).
+        # Multiple M_CPUs share the same result_data dict via IO_CPU fanout;
+        # merge against the existing value so cubes don't clobber each other.
        pe_exec_values = [st.result_data.get("pe_exec_ns", 0.0) for st in sub_txns]
        if pe_exec_values:
-            txn.result_data["pe_exec_ns"] = max(pe_exec_values)
+            cur = txn.result_data.get("pe_exec_ns", 0.0) or 0.0
+            txn.result_data["pe_exec_ns"] = max(cur, max(pe_exec_values))
        dma_values = [st.result_data.get("dma_ns", 0.0) for st in sub_txns]
        if dma_values:
-            txn.result_data["dma_ns"] = max(dma_values)
+            cur = txn.result_data.get("dma_ns", 0.0) or 0.0
+            txn.result_data["dma_ns"] = max(cur, max(dma_values))
        compute_values = [st.result_data.get("compute_ns", 0.0) for st in sub_txns]
        if compute_values:
-            txn.result_data["compute_ns"] = max(compute_values)
+            cur = txn.result_data.get("compute_ns", 0.0) or 0.0
+            txn.result_data["compute_ns"] = max(cur, max(compute_values))

        # Send aggregate response on reverse command path back to IO_CPU
        reverse_path = list(reversed(txn.path))
@@ -71,6 +71,13 @@ class PeCpuComponent(ComponentBase):
        request = txn.request
        yield from self.run(env, 0)

+        # ADR-0009 D5: synchronized launch barrier. If M_CPU stamped a
+        # target_start_ns, wait until then so every PE in this launch
+        # begins pe_exec measurement at the same simulated time.
+        target_start = getattr(request, "target_start_ns", None)
+        if target_start is not None and target_start > env.now:
+            yield env.timeout(float(target_start) - env.now)
+
        kernel_fn = get_kernel(request.kernel_ref.name)
        num_programs = self._derive_num_programs(request)
        kernel_args = self._unpack_kernel_args(request)
@@ -89,11 +89,10 @@ class _FreeList:

 class PEMemAllocator:
    def __init__(
-        self, rack_id: int, sip_id: int, cube_id: int, pe_id: int, cfg: AddressConfig,
+        self, sip_id: int, die_id: int, pe_id: int, cfg: AddressConfig,
    ) -> None:
-        self._rack_id = rack_id
        self._sip_id = sip_id
-        self._cube_id = cube_id
+        self._die_id = die_id
        self._pe_id = pe_id
        self._cfg = cfg
        self._hbm = _FreeList(cfg.hbm_slice_bytes)
@@ -108,7 +107,7 @@ class PEMemAllocator:
                f"available {self._cfg.hbm_slice_bytes - self._hbm.used}"
            )
        return PhysAddr.pe_hbm_addr(
-            rack_id=self._rack_id, sip_id=self._sip_id, cube_id=self._cube_id,
+            sip_id=self._sip_id, die_id=self._die_id,
            pe_id=self._pe_id, pe_local_hbm_offset=offset,
            slice_size_bytes=self._cfg.hbm_slice_bytes,
        )
@@ -128,7 +127,7 @@ class PEMemAllocator:
                f"available {self._cfg.tcm_allocatable_bytes - self._tcm.used}"
            )
        return PhysAddr.pe_tcm_addr(
-            rack_id=self._rack_id, sip_id=self._sip_id, cube_id=self._cube_id,
+            sip_id=self._sip_id, die_id=self._die_id,
            pe_id=self._pe_id, tcm_offset=offset,
        )

@@ -19,7 +19,14 @@ class PageFault(Exception):


 class PeMMU:
-    """Per-PE MMU with page-aligned VA→PA translation table.
+    """Per-PE MMU with sub-page-capable VA→PA translation table.
+
+    Each page-table entry is a list of (start_in_page, end_in_page,
+    pa_at_offset_zero) regions. This is a SIMULATOR STOPGAP — real MMUs
+    store one PA per page-table entry. Sub-page regions exist here so
+    DPPolicy layouts that shard below page granularity (e.g. 128 B
+    payloads with 4 KB pages) don't silently mis-route through last-
+    write-wins overwrites. Memory note: project_mmu_subpage_stopgap.md.

    Args:
        page_size: Page size in bytes (default 2 MB).
@@ -34,7 +41,11 @@ class PeMMU:
        self._page_size = page_size
        self._page_shift = (page_size - 1).bit_length()
        self._page_mask = page_size - 1
-        self._table: dict[int, int] = {}  # va_page_number → pa_page_base
+        # vpn → list of (start_in_page, end_in_page, pa_at_offset_zero).
+        # pa_at_offset_zero is the PA that offset 0 of the page would map
+        # to under this region — i.e. translate(off) = pa_at_offset_zero
+        # + off when start <= off < end.
+        self._table: dict[int, list[tuple[int, int, int]]] = {}
        self._overhead_ns = overhead_ns

    @property
@@ -46,21 +57,67 @@ class PeMMU:
        return len(self._table)

    def map(self, va: int, pa: int, size: int) -> None:
-        """Register VA→PA mapping for a contiguous range."""
-        for off in range(0, size, self._page_size):
-            vpn = (va + off) >> self._page_shift
-            self._table[vpn] = pa + off
+        """Register VA→PA mapping for a contiguous range.
+
+        Sub-page-aware: a single page can hold multiple disjoint regions,
+        each pointing to a different PA. Later map() calls APPEND a new
+        region; on overlap with an existing region, the new region wins
+        for the overlapping offsets (translate iterates in reverse so the
+        last write takes precedence — matches legacy single-PA behavior
+        when a full page is re-mapped).
+        """
+        end_va = va + size
+        cur = va
+        while cur < end_va:
+            vpn = cur >> self._page_shift
+            page_base_va = vpn << self._page_shift
+            page_end_va = page_base_va + self._page_size
+            region_start = cur - page_base_va
+            region_end = min(end_va, page_end_va) - page_base_va
+            # PA seen at offset 0 of page if this region's mapping covered it
+            pa_at_offset_zero = pa + (cur - va) - region_start
+            self._table.setdefault(vpn, []).append(
+                (region_start, region_end, pa_at_offset_zero)
+            )
+            cur = page_base_va + region_end

    def unmap(self, va: int, size: int) -> None:
-        """Remove VA mapping for a contiguous range."""
-        for off in range(0, size, self._page_size):
-            vpn = (va + off) >> self._page_shift
-            self._table.pop(vpn, None)
+        """Remove VA mapping for a contiguous range.
+
+        Drops any region whose extent is contained within the unmapped
+        range. Partial overlaps (region straddles the range boundary)
+        are left in place — caller is expected to unmap on the same
+        boundaries it mapped on.
+        """
+        end_va = va + size
+        cur = va
+        while cur < end_va:
+            vpn = cur >> self._page_shift
+            page_base_va = vpn << self._page_shift
+            page_end_va = page_base_va + self._page_size
+            unmap_start = cur - page_base_va
+            unmap_end = min(end_va, page_end_va) - page_base_va
+            regions = self._table.get(vpn)
+            if regions is not None:
+                kept = [
+                    r for r in regions
+                    if not (r[0] >= unmap_start and r[1] <= unmap_end)
+                ]
+                if kept:
+                    self._table[vpn] = kept
+                else:
+                    del self._table[vpn]
+            cur = page_base_va + unmap_end

    def translate(self, va: int) -> int:
        """Translate VA to PA. Raises PageFault if unmapped."""
        vpn = va >> self._page_shift
-        pa_page_base = self._table.get(vpn)
-        if pa_page_base is None:
+        regions = self._table.get(vpn)
+        if regions is None:
            raise PageFault(va)
-        return pa_page_base + (va & self._page_mask)
+        offset = va & self._page_mask
+        # Iterate latest-first so newer map() calls win on overlap
+        for start, end, pa_at_offset_zero in reversed(regions):
+            if start <= offset < end:
+                return pa_at_offset_zero + offset
+        raise PageFault(va)
@@ -6,6 +6,47 @@ from typing import Literal

 MAX_51 = (1 << 51) - 1

+# ── Layout constants (ADR-0001 Rev 2) ────────────────────────────────
+# [50:47] sip_id (4)
+# [46:42] die_id (5)
+# [41: 0] local_offset (42)
+_SIP_SHIFT = 47
+_DIE_SHIFT = 42
+_LOCAL_BITS = 42
+_LOCAL_MASK = (1 << _LOCAL_BITS) - 1
+
+# AHBM die: [41:38] MBZ, [37] addr_space, [36:0] sub-address
+_AHBM_SEL_BIT = 37
+_AHBM_LOCAL_USED = 38  # bits actually meaningful for AHBM
+
+# Resource window: [36:34] resource_kind, [33:0] kind_local
+_RES_KIND_SHIFT = 34
+_RES_KIND_MASK = 0x7
+
+# PE_LOCAL: [32:29] pe_id, [28:25] pe_sub_unit, [24:0] sub_offset
+_PE_ID_SHIFT = 29
+_PE_SUB_SHIFT = 25
+_PE_SUB_OFFSET_BITS = 25
+
+# MCPU_LOCAL: [29:25] mcpu_sub_unit, [24:0] sub_offset
+_MCPU_SUB_SHIFT = 25
+
+# CUBE_SRAM: [24:0] sram_offset
+_SRAM_OFFSET_BITS = 25
+
+# IOCHIPLET: [41:40] MBZ, [39:0] chiplet_offset
+_CHIPLET_LOCAL_BITS = 40
+_IOCPU_BOUNDARY = 1 << 31  # 2 GB
+
+# IOCPU: [30:27] iocpu_sub_unit, [26:0] sub_offset
+_IOCPU_SUB_SHIFT = 27
+_IOCPU_SUB_OFFSET_BITS = 27
+
+# die_id ranges
+_AHBM_DIE_MAX = 15
+_CHIPLET_DIE_MIN = 16
+_CHIPLET_DIE_MAX = 20
+

 class PhysAddrError(Exception):
    pass
@@ -22,163 +63,278 @@ def _chk_max(name: str, v: int, maxv: int) -> None:


 class UnitType(IntEnum):
-    PE = 0
-    MCPU = 1
-    SRAM = 2
+    """resource_kind values for AHBM resource window."""
+    PE = 0       # PE_LOCAL
+    MCPU = 1     # MCPU_LOCAL
+    SRAM = 2     # CUBE_SRAM
+
+
+class PESubUnit(IntEnum):
+    PE_CPU_DTCM = 0
+    MATH_ENGINE_DTCM = 1
+    IPCQ = 2
+    PE_CPU_SFR = 3
+    MATH_ENGINE_SFR = 4
+    DMA_ENGINE_SFR = 5
+    PE_TCM = 6
+
+
+class MCPUSubUnit(IntEnum):
+    MCPU_ITCM = 0
+    MCPU_DTCM = 1
+    IPCQ = 2
+    MCPU_SFR = 3
+    MCPU_DMA_SFR = 4
+    MCPU_SRAM = 5
+
+
+class IOCPUSubUnit(IntEnum):
+    IOCPU_ITCM = 0
+    IOCPU_DTCM = 1
+    IPCQ = 2
+    IOCPU_SFR = 3
+    IO_DMA_SFR = 4
+    IO_SRAM = 5


@dataclass(frozen=True)
 class PhysAddr:
-    """
-    51-bit physical address value object.
+    """51-bit physical address value object (ADR-0001 Rev 2).

    Layout:
-      [50:47] rack_id  (4)
-      [46:43] sip_id   (4)
-      [42:38] sip_seg  (5)   # cube_id
-      [37:0]  local_offset (38) => each segment is 256GB
-
-    local_offset:
-      [37] selector: 1 = HBM window (128GB reserved), 0 = PE resource window
+      [50:47] sip_id        (4)   -- 16 SIPs
+      [46:42] die_id        (5)   -- 0..15 AHBM, 16..20 IOCHIPLET
+      [41: 0] local_offset  (42)  -- 4 TB per die
    """

-    rack_id: int
    sip_id: int
-    sip_seg: int
+    die_id: int
    local_offset: int

-    kind: Literal["hbm", "pe_resource", "raw"] = "raw"
-    cube_id: int = 0
+    kind: Literal["hbm", "pe_resource", "iocpu", "ual", "raw"] = "raw"
    unit_type: UnitType = UnitType.PE
    pe_id: int = 0
-    ext: int = 0
+    pe_sub_unit: int = 0
    sub_offset: int = 0
    hbm_offset: int = 0
+    iocpu_sub_unit: int = 0
+    chiplet_offset: int = 0
+    mcpu_sub_unit: int = 0

-    HBM_WINDOW_BYTES = 1 << 37  # 128GB
+    HBM_WINDOW_BYTES = 1 << 37  # 128 GB
+
+    # ── encode / decode ──────────────────────────────────────────────

    def encode(self) -> int:
-        _chk_range("rack_id", self.rack_id, 4)
        _chk_range("sip_id", self.sip_id, 4)
-        _chk_range("sip_seg", self.sip_seg, 5)
-        _chk_range("local_offset", self.local_offset, 38)
-        addr = (self.rack_id << 47) | (self.sip_id << 43) | (self.sip_seg << 38) | self.local_offset
-        if not (0 <= addr <= MAX_51):
-            raise PhysAddrError("address exceeds 51-bit space")
+        _chk_range("die_id", self.die_id, 5)
+        _chk_range("local_offset", self.local_offset, _LOCAL_BITS)
+        # MBZ enforcement
+        if self.die_id <= _AHBM_DIE_MAX:
+            mbz_top = (self.local_offset >> _AHBM_LOCAL_USED) & 0xF
+            if mbz_top != 0:
+                raise PhysAddrError("AHBM local_offset bits [41:38] must be zero")
+        elif _CHIPLET_DIE_MIN <= self.die_id <= _CHIPLET_DIE_MAX:
+            mbz_top = (self.local_offset >> _CHIPLET_LOCAL_BITS) & 0x3
+            if mbz_top != 0:
+                raise PhysAddrError("IOCHIPLET local_offset bits [41:40] must be zero")
+        addr = (self.sip_id << _SIP_SHIFT) | (self.die_id << _DIE_SHIFT) | self.local_offset
        return addr

    @staticmethod
    def decode(addr: int) -> PhysAddr:
        if not (0 <= addr <= MAX_51):
            raise PhysAddrError("addr must be a 51-bit value")
-        rack = (addr >> 47) & 0xF
-        sip_id = (addr >> 43) & 0xF
-        sip_seg = (addr >> 38) & 0x1F
-        off = addr & ((1 << 38) - 1)
-        cube_id = sip_seg
-        sel = (off >> 37) & 0x1
-        if sel == 1:
-            hbm_offset = int(off & ((1 << 37) - 1))
-            return PhysAddr(
-                rack_id=rack,
-                sip_id=sip_id,
-                sip_seg=sip_seg,
-                local_offset=off,
-                kind="hbm",
-                cube_id=cube_id,
-                hbm_offset=hbm_offset,
-            )
-        # PE resource decode
-        raw_ut = int((off >> 34) & 0x7)
-        try:
-            unit_type = UnitType(raw_ut)
-        except ValueError:
-            raise PhysAddrError(f"unknown unit_type: {raw_ut}") from None
-        pe_id = int((off >> 30) & 0xF)
-        ext = int((off >> 29) & 0x1)
-        sub_offset = int(off & ((1 << 29) - 1))
-        return PhysAddr(
-            rack_id=rack,
-            sip_id=sip_id,
-            sip_seg=sip_seg,
-            local_offset=off,
-            kind="pe_resource",
-            cube_id=cube_id,
-            unit_type=unit_type,
-            pe_id=pe_id,
-            ext=ext,
-            sub_offset=sub_offset,
-            hbm_offset=0,
-        )
+        sip_id = (addr >> _SIP_SHIFT) & 0xF
+        die_id = (addr >> _DIE_SHIFT) & 0x1F
+        local_offset = addr & _LOCAL_MASK
+
+        if die_id <= _AHBM_DIE_MAX:
+            return PhysAddr._decode_ahbm(sip_id, die_id, local_offset)
+        elif _CHIPLET_DIE_MIN <= die_id <= _CHIPLET_DIE_MAX:
+            return PhysAddr._decode_chiplet(sip_id, die_id, local_offset)
+        else:
+            raise PhysAddrError(f"die_id {die_id} is reserved (21..31)")

    @staticmethod
-    def hbm_addr(*, rack_id: int, sip_id: int, cube_id: int, hbm_offset: int) -> PhysAddr:
-        _chk_max("cube_id", cube_id, 31)
-        _chk_range("hbm_offset", hbm_offset, 37)
-        sip_seg = cube_id
-        local_offset = (1 << 37) | int(hbm_offset)
+    def _decode_ahbm(sip_id: int, die_id: int, local_offset: int) -> PhysAddr:
+        sel = (local_offset >> _AHBM_SEL_BIT) & 0x1
+        if sel == 1:
+            hbm_offset = int(local_offset & ((1 << _AHBM_SEL_BIT) - 1))
+            return PhysAddr(
+                sip_id=sip_id, die_id=die_id, local_offset=local_offset,
+                kind="hbm", hbm_offset=hbm_offset,
+            )
+        # Resource window
+        res_kind = int((local_offset >> _RES_KIND_SHIFT) & _RES_KIND_MASK)
+        try:
+            unit_type = UnitType(res_kind)
+        except ValueError:
+            raise PhysAddrError(f"unknown resource_kind: {res_kind}") from None
+
+        if unit_type == UnitType.PE:
+            pe_id = int((local_offset >> _PE_ID_SHIFT) & 0xF)
+            pe_sub = int((local_offset >> _PE_SUB_SHIFT) & 0xF)
+            sub_off = int(local_offset & ((1 << _PE_SUB_OFFSET_BITS) - 1))
+            return PhysAddr(
+                sip_id=sip_id, die_id=die_id, local_offset=local_offset,
+                kind="pe_resource", unit_type=unit_type,
+                pe_id=pe_id, pe_sub_unit=pe_sub, sub_offset=sub_off,
+            )
+        elif unit_type == UnitType.MCPU:
+            mcpu_sub = int((local_offset >> _MCPU_SUB_SHIFT) & 0x1F)
+            sub_off = int(local_offset & ((1 << _PE_SUB_OFFSET_BITS) - 1))
+            return PhysAddr(
+                sip_id=sip_id, die_id=die_id, local_offset=local_offset,
+                kind="pe_resource", unit_type=unit_type,
+                mcpu_sub_unit=mcpu_sub, sub_offset=sub_off,
+            )
+        else:  # SRAM
+            sub_off = int(local_offset & ((1 << _SRAM_OFFSET_BITS) - 1))
+            return PhysAddr(
+                sip_id=sip_id, die_id=die_id, local_offset=local_offset,
+                kind="pe_resource", unit_type=unit_type,
+                sub_offset=sub_off,
+            )
+
+    @staticmethod
+    def _decode_chiplet(sip_id: int, die_id: int, local_offset: int) -> PhysAddr:
+        chip_off = local_offset & ((1 << _CHIPLET_LOCAL_BITS) - 1)
+        if chip_off < _IOCPU_BOUNDARY:
+            iocpu_sub = int((chip_off >> _IOCPU_SUB_SHIFT) & 0xF)
+            sub_off = int(chip_off & ((1 << _IOCPU_SUB_OFFSET_BITS) - 1))
+            return PhysAddr(
+                sip_id=sip_id, die_id=die_id, local_offset=local_offset,
+                kind="iocpu", chiplet_offset=chip_off,
+                iocpu_sub_unit=iocpu_sub, sub_offset=sub_off,
+            )
+        else:
+            return PhysAddr(
+                sip_id=sip_id, die_id=die_id, local_offset=local_offset,
+                kind="ual", chiplet_offset=chip_off,
+            )
+
+    # ── AHBM factory methods ────────────────────────────────────────
+
+    @staticmethod
+    def hbm_addr(*, sip_id: int, die_id: int, hbm_offset: int) -> PhysAddr:
+        _chk_max("die_id", die_id, _AHBM_DIE_MAX)
+        _chk_range("hbm_offset", hbm_offset, _AHBM_SEL_BIT)
+        local_offset = (1 << _AHBM_SEL_BIT) | int(hbm_offset)
        return PhysAddr(
-            rack_id=rack_id,
-            sip_id=sip_id,
-            sip_seg=sip_seg,
-            local_offset=local_offset,
-            kind="hbm",
-            cube_id=cube_id,
-            hbm_offset=int(hbm_offset),
+            sip_id=sip_id, die_id=die_id, local_offset=local_offset,
+            kind="hbm", hbm_offset=int(hbm_offset),
        )

    @staticmethod
    def pe_hbm_addr(
-        *,
-        rack_id: int,
-        sip_id: int,
-        cube_id: int,
-        pe_id: int,
-        pe_local_hbm_offset: int,
-        slice_size_bytes: int,
+        *, sip_id: int, die_id: int,
+        pe_id: int, pe_local_hbm_offset: int, slice_size_bytes: int,
    ) -> PhysAddr:
-        _chk_max("cube_id", cube_id, 31)
+        _chk_max("die_id", die_id, _AHBM_DIE_MAX)
        _chk_range("pe_id", pe_id, 4)
        if not (0 <= pe_local_hbm_offset < slice_size_bytes):
            raise PhysAddrError("pe_local_hbm_offset out of PE local slice range")
        hbm_offset = int(pe_id) * int(slice_size_bytes) + int(pe_local_hbm_offset)
        if not (0 <= hbm_offset < PhysAddr.HBM_WINDOW_BYTES):
            raise PhysAddrError("HBM offset exceeds reserved 128GB window")
-        return PhysAddr.hbm_addr(
-            rack_id=rack_id, sip_id=sip_id, cube_id=cube_id, hbm_offset=hbm_offset
-        )
+        return PhysAddr.hbm_addr(sip_id=sip_id, die_id=die_id, hbm_offset=hbm_offset)

    @staticmethod
    def hbm_pe_id(hbm_offset: int, slice_size_bytes: int) -> int:
        return hbm_offset // slice_size_bytes

    @staticmethod
-    def cube_sram_addr(
-        *, rack_id: int, sip_id: int, cube_id: int, sram_offset: int,
+    def pe_tcm_addr(
+        *, sip_id: int, die_id: int, pe_id: int, tcm_offset: int,
    ) -> PhysAddr:
-        _chk_max("cube_id", cube_id, 31)
-        _chk_range("sram_offset", sram_offset, 29)
-        sip_seg = cube_id
-        local_offset = (UnitType.SRAM << 34) | sram_offset
-        return PhysAddr(
-            rack_id=rack_id, sip_id=sip_id, sip_seg=sip_seg,
-            local_offset=local_offset,
-            kind="pe_resource", cube_id=cube_id,
-            unit_type=UnitType.SRAM, sub_offset=sram_offset,
+        return PhysAddr.pe_resource_addr(
+            sip_id=sip_id, die_id=die_id, pe_id=pe_id,
+            pe_sub_unit=PESubUnit.PE_TCM, sub_offset=tcm_offset,
        )

    @staticmethod
-    def pe_tcm_addr(
-        *, rack_id: int, sip_id: int, cube_id: int, pe_id: int, tcm_offset: int,
+    def pe_resource_addr(
+        *, sip_id: int, die_id: int, pe_id: int,
+        pe_sub_unit: int, sub_offset: int,
    ) -> PhysAddr:
-        _chk_max("cube_id", cube_id, 31)
+        _chk_max("die_id", die_id, _AHBM_DIE_MAX)
        _chk_range("pe_id", pe_id, 4)
-        _chk_range("tcm_offset", tcm_offset, 29)
-        sip_seg = cube_id
-        local_offset = (UnitType.PE << 34) | (pe_id << 30) | tcm_offset
-        return PhysAddr(
-            rack_id=rack_id, sip_id=sip_id, sip_seg=sip_seg,
-            local_offset=local_offset,
-            kind="pe_resource", cube_id=cube_id,
-            unit_type=UnitType.PE, pe_id=pe_id, sub_offset=tcm_offset,
+        _chk_range("pe_sub_unit", pe_sub_unit, 4)
+        _chk_range("sub_offset", sub_offset, _PE_SUB_OFFSET_BITS)
+        local_offset = (
+            (UnitType.PE << _RES_KIND_SHIFT)
+            | (pe_id << _PE_ID_SHIFT)
+            | (pe_sub_unit << _PE_SUB_SHIFT)
+            | sub_offset
+        )
+        return PhysAddr(
+            sip_id=sip_id, die_id=die_id, local_offset=local_offset,
+            kind="pe_resource", unit_type=UnitType.PE,
+            pe_id=pe_id, pe_sub_unit=pe_sub_unit, sub_offset=sub_offset,
+        )
+
+    @staticmethod
+    def cube_sram_addr(
+        *, sip_id: int, die_id: int, sram_offset: int,
+    ) -> PhysAddr:
+        _chk_max("die_id", die_id, _AHBM_DIE_MAX)
+        _chk_range("sram_offset", sram_offset, _SRAM_OFFSET_BITS)
+        local_offset = (UnitType.SRAM << _RES_KIND_SHIFT) | sram_offset
+        return PhysAddr(
+            sip_id=sip_id, die_id=die_id, local_offset=local_offset,
+            kind="pe_resource", unit_type=UnitType.SRAM, sub_offset=sram_offset,
+        )
+
+    @staticmethod
+    def mcpu_resource_addr(
+        *, sip_id: int, die_id: int, mcpu_sub_unit: int, sub_offset: int,
+    ) -> PhysAddr:
+        _chk_max("die_id", die_id, _AHBM_DIE_MAX)
+        _chk_range("mcpu_sub_unit", mcpu_sub_unit, 5)
+        _chk_range("sub_offset", sub_offset, _PE_SUB_OFFSET_BITS)
+        local_offset = (
+            (UnitType.MCPU << _RES_KIND_SHIFT)
+            | (mcpu_sub_unit << _MCPU_SUB_SHIFT)
+            | sub_offset
+        )
+        return PhysAddr(
+            sip_id=sip_id, die_id=die_id, local_offset=local_offset,
+            kind="pe_resource", unit_type=UnitType.MCPU,
+            mcpu_sub_unit=mcpu_sub_unit, sub_offset=sub_offset,
+        )
+
+    # ── IOCHIPLET factory methods ────────────────────────────────────
+
+    @staticmethod
+    def iocpu_resource_addr(
+        *, sip_id: int, die_id: int, iocpu_sub_unit: int, sub_offset: int,
+    ) -> PhysAddr:
+        _chk_max("die_id", die_id, _CHIPLET_DIE_MAX)
+        if die_id < _CHIPLET_DIE_MIN:
+            raise PhysAddrError(
+                f"die_id {die_id} is not an IOCHIPLET "
+                f"(must be {_CHIPLET_DIE_MIN}..{_CHIPLET_DIE_MAX})"
+            )
+        _chk_range("iocpu_sub_unit", iocpu_sub_unit, 4)
+        _chk_range("sub_offset", sub_offset, _IOCPU_SUB_OFFSET_BITS)
+        chiplet_offset = (iocpu_sub_unit << _IOCPU_SUB_SHIFT) | sub_offset
+        if chiplet_offset >= _IOCPU_BOUNDARY:
+            raise PhysAddrError("IOCPU region overflow (must be < 2 GB)")
+        return PhysAddr(
+            sip_id=sip_id, die_id=die_id, local_offset=chiplet_offset,
+            kind="iocpu", chiplet_offset=chiplet_offset,
+            iocpu_sub_unit=iocpu_sub_unit, sub_offset=sub_offset,
+        )
+
+    @staticmethod
+    def ual_addr(*, sip_id: int, die_id: int, ual_offset: int) -> PhysAddr:
+        _chk_max("die_id", die_id, _CHIPLET_DIE_MAX)
+        if die_id < _CHIPLET_DIE_MIN:
+            raise PhysAddrError(f"die_id {die_id} is not an IOCHIPLET")
+        chiplet_offset = _IOCPU_BOUNDARY + ual_offset
+        _chk_range("chiplet_offset", chiplet_offset, _CHIPLET_LOCAL_BITS)
+        return PhysAddr(
+            sip_id=sip_id, die_id=die_id, local_offset=chiplet_offset,
+            kind="ual", chiplet_offset=chiplet_offset,
        )
@@ -27,16 +27,16 @@ class AddressResolver:

    def resolve(self, addr: PhysAddr) -> str:
        s = addr.sip_id
-        c = addr.cube_id
+        d = addr.die_id
        if addr.kind == "hbm":
-            node_id = f"sip{s}.cube{c}.hbm_ctrl"
+            node_id = f"sip{s}.cube{d}.hbm_ctrl"
        elif addr.kind == "pe_resource":
            if addr.unit_type == UnitType.PE:
-                node_id = f"sip{s}.cube{c}.pe{addr.pe_id}.pe_tcm"
+                node_id = f"sip{s}.cube{d}.pe{addr.pe_id}.pe_tcm"
            elif addr.unit_type == UnitType.SRAM:
-                node_id = f"sip{s}.cube{c}.sram"
+                node_id = f"sip{s}.cube{d}.sram"
            elif addr.unit_type == UnitType.MCPU:
-                node_id = f"sip{s}.cube{c}.m_cpu"
+                node_id = f"sip{s}.cube{d}.m_cpu"
            else:
                raise RoutingError(f"unsupported unit_type: {addr.unit_type}")
        else:
@@ -385,7 +385,7 @@ class RuntimeContext:
            for cube_id in range(cubes_per_sip):
                for pe_id in range(pes_per_cube):
                    self._allocators[(sip_id, cube_id, pe_id)] = PEMemAllocator(
-                        rack_id=0, sip_id=sip_id, cube_id=cube_id, pe_id=pe_id, cfg=cfg,
+                        sip_id=sip_id, die_id=cube_id, pe_id=pe_id, cfg=cfg,
                    )

        # Initialize VA allocator (MMU mappings are installed via fabric MmuMapMsg)
@@ -113,7 +113,18 @@ class AhbmCCLBackend:
            )
        n_elem = shards[0].nbytes // tensor.itemsize
        kernel_fn = self._algo_module.kernel
-        kernel_args = self._algo_module.kernel_args(self._world_size, n_elem)
+        # Derive effective cube dims from tensor's actual shard placement
+        # (may differ from topology mesh when TP uses fewer cubes).
+        sip0_cubes = sorted({s.cube for s in shards if s.sip == shards[0].sip})
+        eff_n_cubes = len(sip0_cubes) if sip0_cubes else 1
+        if eff_n_cubes == 1:
+            eff_cube_w, eff_cube_h = 1, 1
+        else:
+            eff_cube_w, eff_cube_h = self._cube_w, self._cube_h
+        kernel_args = self._algo_module.kernel_args(
+            self._world_size, n_elem,
+            cube_w=eff_cube_w, cube_h=eff_cube_h,
+        )

        # Resolve sip_rank from the current greenlet's bound rank
        from greenlet import getcurrent as _gc
@@ -90,6 +90,11 @@ class KernelLaunchMsg:
    args: tuple[KernelArg, ...]
    target_cubes: tuple[int, ...] | Literal["all"] = "all"
    target_pe: int | tuple[int, ...] | Literal["all"] = "all"
+    # ADR-0009 D5: synchronized kernel start. When set, each PE_CPU yields
+    # until env.now >= target_start_ns before beginning kernel execution,
+    # so every PE in a launch starts at the same simulated time regardless
+    # of its M_CPU dispatch path length. Stamped by M_CPU fan-out.
+    target_start_ns: float | None = None
    msg_type: Literal["kernel_launch"] = "kernel_launch"


@@ -67,6 +67,10 @@ class GraphEngine:
            spec=graph.spec,
            memory_store=self._memory_store,
            op_logger=self._op_logger,
+            node_overhead_ns={
+                nid: float(n.attrs.get("overhead_ns", 0.0))
+                for nid, n in graph.nodes.items()
+            },
        )
        self._components: dict[str, ComponentBase] = {
            node_id: ComponentRegistry.create(node, overrides, ctx)
@@ -212,7 +212,7 @@ def _generate_probe_h2d(graph, edge_map) -> list[dict]:
    t_offset = 0.0
    for rid, (name, cube, hops) in enumerate(cases):
        pa = PhysAddr.pe_hbm_addr(
-            rack_id=0, sip_id=0, cube_id=cube, pe_id=0,
+            sip_id=0, die_id=cube, pe_id=0,
            pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
        )
        dst_node = resolver.resolve(pa)
@@ -256,7 +256,7 @@ def _generate_probe_d2h(graph, edge_map) -> list[dict]:
    t_offset = 0.0
    for rid, (name, cube, hops) in enumerate(cases):
        pa = PhysAddr.pe_hbm_addr(
-            rack_id=0, sip_id=0, cube_id=cube, pe_id=0,
+            sip_id=0, die_id=cube, pe_id=0,
            pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
        )
        dst_node = resolver.resolve(pa)
@@ -310,7 +310,7 @@ def _generate_probe_pe_dma(graph, edge_map) -> list[dict]:
    t_offset = 0.0
    for rid, (name, sip, src_cube, src_pe, dst_cube, dst_pe) in enumerate(cases):
        pa = PhysAddr.pe_hbm_addr(
-            rack_id=0, sip_id=sip, cube_id=dst_cube, pe_id=dst_pe,
+            sip_id=sip, die_id=dst_cube, pe_id=dst_pe,
            pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
        )
        dst_node = resolver.resolve(pa)
@@ -0,0 +1,37 @@
+algorithm,sip_topology,n_sips,n_elem,bytes_per_pe,bytes_per_sip,latency_ns
+intercube_allreduce,mesh_2d_no_wrap,6,8,16,256,3508.4249999999993
+intercube_allreduce,mesh_2d_no_wrap,6,32,64,1024,3515.55
+intercube_allreduce,mesh_2d_no_wrap,6,64,128,2048,3525.0499999999975
+intercube_allreduce,mesh_2d_no_wrap,6,128,256,4096,3544.049999999992
+intercube_allreduce,mesh_2d_no_wrap,6,512,1024,16384,3667.049999999992
+intercube_allreduce,mesh_2d_no_wrap,6,1024,2048,32768,3837.049999999992
+intercube_allreduce,mesh_2d_no_wrap,6,2048,4096,65536,4177.049999999992
+intercube_allreduce,mesh_2d_no_wrap,6,4096,8192,131072,4857.049999999959
+intercube_allreduce,mesh_2d_no_wrap,6,8192,16384,262144,6217.049999999945
+intercube_allreduce,mesh_2d_no_wrap,6,16384,32768,524288,8937.049999999937
+intercube_allreduce,mesh_2d_no_wrap,6,32768,65536,1048576,14377.049999999872
+intercube_allreduce,mesh_2d_no_wrap,6,49152,98304,1572864,19817.049999999872
+intercube_allreduce,ring_1d,6,8,16,256,3073.1299999999937
+intercube_allreduce,ring_1d,6,32,64,1024,3079.8799999999947
+intercube_allreduce,ring_1d,6,64,128,2048,3088.879999999992
+intercube_allreduce,ring_1d,6,128,256,4096,3106.8799999999865
+intercube_allreduce,ring_1d,6,512,1024,16384,3225.8799999999865
+intercube_allreduce,ring_1d,6,1024,2048,32768,3391.8799999999865
+intercube_allreduce,ring_1d,6,2048,4096,65536,3723.8799999999865
+intercube_allreduce,ring_1d,6,4096,8192,131072,4387.879999999965
+intercube_allreduce,ring_1d,6,8192,16384,262144,5715.879999999957
+intercube_allreduce,ring_1d,6,16384,32768,524288,8371.879999999932
+intercube_allreduce,ring_1d,6,32768,65536,1048576,13683.879999999903
+intercube_allreduce,ring_1d,6,49152,98304,1572864,18995.879999999917
+intercube_allreduce,torus_2d,6,8,16,256,2190.4799999999923
+intercube_allreduce,torus_2d,6,32,64,1024,2196.479999999993
+intercube_allreduce,torus_2d,6,64,128,2048,2204.4799999999905
+intercube_allreduce,torus_2d,6,128,256,4096,2220.479999999985
+intercube_allreduce,torus_2d,6,512,1024,16384,2325.479999999985
+intercube_allreduce,torus_2d,6,1024,2048,32768,2471.479999999985
+intercube_allreduce,torus_2d,6,2048,4096,65536,2763.479999999985
+intercube_allreduce,torus_2d,6,4096,8192,131072,3347.4799999999777
+intercube_allreduce,torus_2d,6,8192,16384,262144,4515.4799999999705
+intercube_allreduce,torus_2d,6,16384,32768,524288,6851.479999999952
+intercube_allreduce,torus_2d,6,32768,65536,1048576,11523.479999999923
+intercube_allreduce,torus_2d,6,49152,98304,1572864,16195.479999999952
@@ -7,11 +7,45 @@ stateful/SimPy-event-consuming and MUST NOT be shared).
 """
 from __future__ import annotations

+import os
+
 import pytest

 from kernbench.topology.builder import resolve_topology


+def pytest_sessionfinish(session, exitstatus):
+    """Aggregate parametrized sweep rows into combined CSV + PNG plots.
+
+    Runs on the controller node only (xdist worker processes set
+    ``PYTEST_XDIST_WORKER``; we skip those). Idempotent — does nothing
+    if no sweep rows are present (e.g., when the sweep was filtered out).
+    """
+    if os.environ.get("PYTEST_XDIST_WORKER"):
+        return
+    import importlib.util
+    import sys
+    from pathlib import Path
+
+    mod_path = Path(__file__).parent / "test_allreduce_multidevice.py"
+    if not mod_path.exists():
+        return
+    spec = importlib.util.spec_from_file_location(
+        "_test_allreduce_multidevice_for_aggregate", mod_path,
+    )
+    if spec is None or spec.loader is None:
+        return
+    mod = importlib.util.module_from_spec(spec)
+    sys.modules[spec.name] = mod
+    try:
+        spec.loader.exec_module(mod)
+        agg = getattr(mod, "_aggregate_sweep_plots", None)
+        if agg is not None:
+            agg()
+    except Exception as e:
+        print(f"[conftest] sweep aggregation failed: {e}")
+
+
@pytest.fixture(scope="session")
 def topology():
    """Session-scoped parsed topology (immutable graph + spec).
@@ -0,0 +1,81 @@
+hop,label,size_bytes,path,total_ns
+h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,ipcq,31.1399999999976
+h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,raw,12.019999999996799
+h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,ipcq,32.6399999999976
+h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,raw,13.019999999996799
+h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,ipcq,34.1399999999976
+h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,raw,14.019999999996799
+h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,ipcq,35.6399999999976
+h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,raw,15.019999999996799
+h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,ipcq,38.6399999999976
+h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,raw,17.0199999999968
+h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,ipcq,41.6399999999976
+h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,raw,19.0199999999968
+h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,ipcq,53.6399999999976
+h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,raw,27.0199999999968
+h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,ipcq,77.6399999999976
+h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,raw,43.0199999999968
+h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,ipcq,125.64000000000306
+h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,raw,75.02000000000407
+h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,ipcq,149.64000000000306
+h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,raw,91.02000000000407
+h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,ipcq,31.1399999999976
+h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,raw,12.019999999996799
+h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,ipcq,32.6399999999976
+h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,raw,13.019999999996799
+h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,ipcq,34.1399999999976
+h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,raw,14.019999999996799
+h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,ipcq,35.6399999999976
+h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,raw,15.019999999996799
+h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,ipcq,38.6399999999976
+h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,raw,17.0199999999968
+h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,ipcq,41.6399999999976
+h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,raw,19.0199999999968
+h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,ipcq,53.6399999999976
+h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,raw,27.0199999999968
+h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,ipcq,77.6399999999976
+h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,raw,43.0199999999968
+h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,ipcq,125.64000000000306
+h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,raw,75.02000000000407
+h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,ipcq,149.64000000000306
+h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,raw,91.02000000000407
+h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,ipcq,67.15999999999804
+h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,raw,68.53999999999724
+h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,ipcq,68.65999999999804
+h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,raw,70.03999999999724
+h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,ipcq,70.15999999999804
+h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,raw,71.53999999999724
+h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,ipcq,71.65999999999804
+h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,raw,73.03999999999724
+h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,ipcq,74.65999999999804
+h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,raw,76.03999999999724
+h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,ipcq,77.65999999999804
+h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,raw,79.03999999999724
+h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,ipcq,89.65999999999804
+h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,raw,91.03999999999724
+h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,ipcq,113.65999999999804
+h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,raw,115.03999999999724
+h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,ipcq,161.65999999999985
+h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,raw,163.04000000000087
+h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,ipcq,185.65999999999985
+h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,raw,187.04000000000087
+h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,ipcq,87.15999999999804
+h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,raw,88.53999999999724
+h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,ipcq,88.65999999999804
+h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,raw,90.03999999999724
+h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,ipcq,90.15999999999804
+h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,raw,91.53999999999724
+h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,ipcq,91.65999999999804
+h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,raw,93.03999999999724
+h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,ipcq,94.65999999999804
+h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,raw,96.03999999999724
+h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,ipcq,97.65999999999804
+h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,raw,99.03999999999724
+h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,ipcq,109.65999999999804
+h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,raw,111.03999999999724
+h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,ipcq,133.65999999999804
+h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,raw,135.03999999999724
+h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,ipcq,181.65999999999985
+h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,raw,183.04000000000087
+h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,ipcq,205.65999999999985
+h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,raw,207.04000000000087
@@ -149,7 +149,7 @@ def _make_tuple_allocators(
 ) -> dict[tuple[int, int, int], PEMemAllocator]:
    return {
        (s, c, p): PEMemAllocator(
-            rack_id=0, sip_id=s, cube_id=c, pe_id=p, cfg=_CFG,
+            sip_id=s, die_id=c, pe_id=p, cfg=_CFG,
        )
        for s in range(num_sips)
        for c in range(num_cubes)
@@ -22,13 +22,23 @@ from kernbench.ccl.sfr_config import configure_sfr_intercube_multisip
 from kernbench.policy.placement.dp import DPPolicy


-def _sip_topo_dims(sip_topo: str, n_sips: int) -> tuple[int, int]:
+def _sip_topo_dims(
+    sip_topo: str, n_sips: int,
+    spec_w: int | None = None, spec_h: int | None = None,
+) -> tuple[int, int]:
    if sip_topo == "ring_1d":
        return (0, 0)
+    if spec_w is not None and spec_h is not None:
+        if spec_w * spec_h != n_sips:
+            raise ValueError(
+                f"sip layout {spec_w}x{spec_h} != n_sips ({n_sips})"
+            )
+        return (spec_w, spec_h)
    side = int(round(math.sqrt(n_sips)))
    if side * side != n_sips:
        raise ValueError(
-            f"SIP topology '{sip_topo}' requires square n_sips, got {n_sips}"
+            f"SIP topology '{sip_topo}' requires square n_sips or "
+            f"explicit w/h in spec, got {n_sips}"
        )
    return (side, side)

@@ -54,10 +64,13 @@ def run_allreduce(
    topo_name_to_kind = algo_module.TOPO_NAME_TO_KIND

    n_elem = int(cfg.get("n_elem", 8))
-    n_sips = int(spec.get("system", {}).get("sips", {}).get("count", 1))
-    sip_topo = str(
-        spec.get("system", {}).get("sips", {}).get("topology", "ring_1d")
-    )
+    sips_cfg = spec.get("system", {}).get("sips", {})
+    n_sips = int(sips_cfg.get("count", 1))
+    sip_topo = str(sips_cfg.get("topology", "ring_1d"))
+    spec_sip_w = sips_cfg.get("w")
+    spec_sip_h = sips_cfg.get("h")
+    spec_sip_w = int(spec_sip_w) if spec_sip_w is not None else None
+    spec_sip_h = int(spec_sip_h) if spec_sip_h is not None else None

    cm = spec["sip"]["cube_mesh"]
    cube_w = int(cm["w"])
@@ -65,7 +78,9 @@ def run_allreduce(
    n_cubes = cube_w * cube_h

    sip_topo_kind = topo_name_to_kind.get(sip_topo, 0)
-    sip_topo_w, sip_topo_h = _sip_topo_dims(sip_topo, n_sips)
+    sip_topo_w, sip_topo_h = _sip_topo_dims(
+        sip_topo, n_sips, spec_w=spec_sip_w, spec_h=spec_sip_h,
+    )

    algo_name = cfg.get("algorithm", "allreduce")
    print(f"\n{'=' * 60}")
@@ -173,18 +188,36 @@ from kernbench.topology.builder import resolve_topology
 TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"

 CONFIGS = [
-    pytest.param("intercube_allreduce", "ring_1d", 2, id="ring_2sip"),
-    pytest.param("intercube_allreduce", "torus_2d", 4, id="torus_4sip"),
-    pytest.param("intercube_allreduce", "mesh_2d_no_wrap", 4, id="mesh_4sip"),
+    pytest.param(
+        "intercube_allreduce", "ring_1d", 6, None, None,
+        id="ring_6sip",
+    ),
+    pytest.param(
+        "intercube_allreduce", "torus_2d", 6, 2, 3,
+        id="torus_6sip_2x3",
+    ),
+    pytest.param(
+        "intercube_allreduce", "mesh_2d_no_wrap", 6, 2, 3,
+        id="mesh_6sip_2x3",
+    ),
 ]


-def _write_temp_configs(tmp_path, sip_topology, n_sips, algorithm):
+def _write_temp_configs(
+    tmp_path, sip_topology, n_sips, algorithm, n_elem_override=None,
+    sip_w=None, sip_h=None,
+):
    """Write temp topology.yaml and ccl.yaml with the given overrides."""
    with open(TOPOLOGY_PATH) as f:
        topo_cfg = yaml.safe_load(f)
    topo_cfg["system"]["sips"]["count"] = n_sips
    topo_cfg["system"]["sips"]["topology"] = sip_topology
+    if sip_w is not None and sip_h is not None:
+        topo_cfg["system"]["sips"]["w"] = int(sip_w)
+        topo_cfg["system"]["sips"]["h"] = int(sip_h)
+    else:
+        topo_cfg["system"]["sips"].pop("w", None)
+        topo_cfg["system"]["sips"].pop("h", None)
    topo_path = tmp_path / "topology.yaml"
    with open(topo_path, "w") as f:
        yaml.dump(topo_cfg, f, default_flow_style=False)
@@ -193,6 +226,15 @@ def _write_temp_configs(tmp_path, sip_topology, n_sips, algorithm):
    with open(ccl_path) as f:
        ccl_cfg = yaml.safe_load(f)
    ccl_cfg["defaults"]["algorithm"] = algorithm
+    if n_elem_override is not None:
+        ccl_cfg.setdefault("algorithms", {}).setdefault(
+            algorithm, {},
+        )["n_elem"] = int(n_elem_override)
+        # Ensure IPCQ slot is big enough for the per-message payload.
+        per_msg_bytes = int(n_elem_override) * 2  # f16
+        default_slot = int(ccl_cfg["defaults"].get("slot_size", 4096))
+        if per_msg_bytes > default_slot:
+            ccl_cfg["defaults"]["slot_size"] = per_msg_bytes
    tmp_ccl = tmp_path / "ccl.yaml"
    with open(tmp_ccl, "w") as f:
        yaml.dump(ccl_cfg, f, default_flow_style=False)
@@ -200,10 +242,15 @@ def _write_temp_configs(tmp_path, sip_topology, n_sips, algorithm):
    return str(topo_path), str(tmp_ccl)


-@pytest.mark.parametrize("algorithm,sip_topology,n_sips", CONFIGS)
-def test_allreduce(tmp_path, algorithm, sip_topology, n_sips):
+@pytest.mark.parametrize(
+    "algorithm,sip_topology,n_sips,sip_w,sip_h", CONFIGS,
+)
+def test_allreduce(
+    tmp_path, algorithm, sip_topology, n_sips, sip_w, sip_h,
+):
    topo_path, ccl_path = _write_temp_configs(
        tmp_path, sip_topology, n_sips, algorithm,
+        sip_w=sip_w, sip_h=sip_h,
    )
    topo = resolve_topology(topo_path)
    engine = GraphEngine(topo.topology_obj, enable_data=True)
@@ -220,3 +267,584 @@ def test_allreduce(tmp_path, algorithm, sip_topology, n_sips):
            algorithm=algorithm, ccl_yaml=ccl_path,
        )
        assert result["ok_cubes"] > 0
+
+
+# ── Latency sweep (parametrized + xdist-friendly) ─────────────────────
+
+# avoid 16 (== n_cubes, dim_map collision). Goes up to 96 KB per PE:
+# bytes_per_pe = n_elem * 2 (f16). 49152 elem * 2 = 96 KB / PE.
+_SWEEP_N_ELEM = [
+    8, 32, 64, 128, 512, 1024, 2048,
+    4096, 8192, 16384, 32768, 49152,
+]
+_ELEM_BYTES_F16 = 2
+
+_SWEEP_TOPOLOGIES = [
+    ("intercube_allreduce", "ring_1d", 6, None, None),
+    ("intercube_allreduce", "torus_2d", 6, 2, 3),
+    ("intercube_allreduce", "mesh_2d_no_wrap", 6, 2, 3),
+]
+
+# Shared on-disk staging dir for parametrized sweep rows. Each
+# parametrized invocation writes one JSON file here; the aggregator
+# (run from conftest.pytest_sessionfinish) reads them and emits the
+# combined CSV + PNG plots.
+_SWEEP_OUT_DIR = Path(__file__).parent / "allreduce_latency_plots"
+_SWEEP_ROWS_DIR = _SWEEP_OUT_DIR / "_rows"
+
+
+def _sweep_params():
+    out = []
+    for algorithm, sip_topology, n_sips, sip_w, sip_h in _SWEEP_TOPOLOGIES:
+        for n_elem in _SWEEP_N_ELEM:
+            out.append(pytest.param(
+                algorithm, sip_topology, n_sips, sip_w, sip_h, n_elem,
+                id=f"{sip_topology}-n_elem{n_elem}",
+            ))
+    return out
+
+
+@pytest.mark.parametrize(
+    "algorithm,sip_topology,n_sips,sip_w,sip_h,n_elem", _sweep_params(),
+)
+def test_allreduce_latency_one(
+    tmp_path, algorithm, sip_topology, n_sips, sip_w, sip_h, n_elem,
+):
+    """One config of the latency sweep. xdist parallelizes across params.
+
+    Writes a single JSON row to ``_SWEEP_ROWS_DIR``. The conftest
+    sessionfinish hook aggregates rows into CSV + plots after all
+    parametrized cases finish.
+    """
+    import json
+
+    topo_path, ccl_path = _write_temp_configs(
+        tmp_path, sip_topology, n_sips, algorithm,
+        sip_w=sip_w, sip_h=sip_h,
+        n_elem_override=n_elem,
+    )
+    topo = resolve_topology(topo_path)
+    engine = GraphEngine(topo.topology_obj, enable_data=True)
+    spec = topo.topology_obj.spec
+
+    with RuntimeContext(
+        engine=engine,
+        target_device=DeviceSelector("all"),
+        correlation_id=f"sweep_{algorithm}_{sip_topology}_{n_elem}",
+        spec=spec,
+    ) as ctx:
+        result = run_allreduce(
+            ctx, engine, spec,
+            algorithm=algorithm, ccl_yaml=ccl_path,
+        )
+        assert result["ok_cubes"] > 0
+
+    pe_exec_vals = [
+        float(tr.get("pe_exec_ns", 0.0) or 0.0)
+        for _, (_, tr) in engine._results.items()
+        if isinstance(tr, dict)
+    ]
+    crit_ns = max(pe_exec_vals) if pe_exec_vals else 0.0
+
+    cm = spec["sip"]["cube_mesh"]
+    n_cubes = int(cm["w"]) * int(cm["h"])
+    bytes_per_sip = n_cubes * n_elem * _ELEM_BYTES_F16
+    bytes_per_pe = n_elem * _ELEM_BYTES_F16
+
+    record = {
+        "algorithm": algorithm,
+        "sip_topology": sip_topology,
+        "n_sips": n_sips,
+        "n_elem": n_elem,
+        "bytes_per_pe": bytes_per_pe,
+        "bytes_per_sip": bytes_per_sip,
+        "latency_ns": crit_ns,
+    }
+
+    _SWEEP_ROWS_DIR.mkdir(parents=True, exist_ok=True)
+    row_path = _SWEEP_ROWS_DIR / f"{sip_topology}_{n_elem}.json"
+    with open(row_path, "w", encoding="utf-8") as f:
+        json.dump(record, f)
+
+
+def _aggregate_sweep_plots() -> bool:
+    """Read all per-config rows and emit CSV + PNG plots.
+
+    Called by ``conftest.pytest_sessionfinish`` (controller node only).
+    Returns True if any rows were aggregated, False otherwise.
+    """
+    import csv
+    import json
+
+    row_files = sorted(_SWEEP_ROWS_DIR.glob("*.json")) \
+        if _SWEEP_ROWS_DIR.exists() else []
+    records: list[dict] = []
+    if row_files:
+        for p in row_files:
+            with open(p, encoding="utf-8") as f:
+                records.append(json.load(f))
+    else:
+        # Fallback: replot from existing summary.csv (skip sweep re-run).
+        summary_path = _SWEEP_OUT_DIR / "summary.csv"
+        if not summary_path.exists():
+            return False
+        with open(summary_path, encoding="utf-8") as f:
+            for row in csv.DictReader(f):
+                records.append({
+                    "algorithm": row["algorithm"],
+                    "sip_topology": row["sip_topology"],
+                    "n_sips": int(row["n_sips"]),
+                    "n_elem": int(row["n_elem"]),
+                    "bytes_per_pe": int(row["bytes_per_pe"]),
+                    "bytes_per_sip": int(row["bytes_per_sip"]),
+                    "latency_ns": float(row["latency_ns"]),
+                })
+    if not records:
+        return False
+
+    import matplotlib.pyplot as plt
+    from matplotlib.ticker import FuncFormatter
+
+    def _fmt_bytes(x, _pos):
+        if x <= 0:
+            return "0"
+        if x >= 1024 * 1024:
+            return f"{x / (1024 * 1024):.0f} MB"
+        if x >= 1024:
+            return f"{x / 1024:.0f} KB"
+        return f"{x:.0f} B"
+
+    _bytes_fmt = FuncFormatter(_fmt_bytes)
+
+    _SWEEP_OUT_DIR.mkdir(parents=True, exist_ok=True)
+    with open(_SWEEP_OUT_DIR / "summary.csv", "w",
+              newline="", encoding="utf-8") as f:
+        w = csv.DictWriter(f, fieldnames=[
+            "algorithm", "sip_topology", "n_sips", "n_elem",
+            "bytes_per_pe", "bytes_per_sip", "latency_ns",
+        ])
+        w.writeheader()
+        for r in sorted(records, key=lambda r: (
+            r["sip_topology"], r["bytes_per_pe"],
+        )):
+            w.writerow(r)
+
+    topologies = sorted({r["sip_topology"] for r in records})
+    for topo_name in topologies:
+        rs = sorted(
+            [r for r in records if r["sip_topology"] == topo_name],
+            key=lambda r: r["bytes_per_pe"],
+        )
+        if not rs:
+            continue
+        xs = [r["bytes_per_pe"] for r in rs]
+        ys = [r["latency_ns"] for r in rs]
+        title = (
+            f"Allreduce latency — {topo_name} "
+            f"(n_sips={rs[0]['n_sips']})"
+        )
+        fig, ax = plt.subplots(figsize=(8, 5))
+        ax.plot(xs, ys, marker="o", color="tab:blue")
+        ax.set_xscale("log", base=2)
+        ax.set_xlabel("Bytes per PE (log scale)")
+        ax.set_ylabel("max pe_exec_ns (critical path)")
+        ax.set_title(title)
+        ax.grid(True, alpha=0.3)
+        ax.xaxis.set_major_formatter(_bytes_fmt)
+        fig.tight_layout()
+        fig.savefig(_SWEEP_OUT_DIR / f"{topo_name}.png", dpi=120)
+        plt.close(fig)
+
+    colors = {"ring_1d": "tab:blue", "torus_2d": "tab:orange",
+              "mesh_2d_no_wrap": "tab:green"}
+    THEORETICAL_TORUS_2D_6SIP_NS = 10600.0
+    fig, ax = plt.subplots(figsize=(9, 6))
+    for topo_name in topologies:
+        rs = sorted(
+            [r for r in records if r["sip_topology"] == topo_name],
+            key=lambda r: r["bytes_per_pe"],
+        )
+        if not rs:
+            continue
+        ax.plot(
+            [r["bytes_per_pe"] for r in rs],
+            [r["latency_ns"] for r in rs],
+            marker="o",
+            label=f"{topo_name} (n_sips={rs[0]['n_sips']})",
+            color=colors.get(topo_name),
+        )
+    ax.axhline(
+        y=THEORETICAL_TORUS_2D_6SIP_NS,
+        color="tab:red", linestyle="--", linewidth=1.5,
+        label=f"theoretical torus_2d (6 SIPs) = "
+              f"{THEORETICAL_TORUS_2D_6SIP_NS:.0f} ns",
+    )
+    BYTES_96KB = 96 * 1024
+    ax.axvline(
+        x=BYTES_96KB, ymin=0, ymax=1,
+        color="tab:red", linestyle=":", linewidth=1.2,
+    )
+    ax.plot(
+        [BYTES_96KB], [THEORETICAL_TORUS_2D_6SIP_NS],
+        marker="x", color="tab:red", markersize=10, markeredgewidth=2,
+    )
+    # Find simulated torus_2d latency at 96 KB (if present) for direct
+    # comparison with the theoretical value.
+    sim_torus_at_96kb = next(
+        (r["latency_ns"] for r in records
+         if r["sip_topology"] == "torus_2d" and r["bytes_per_pe"] == BYTES_96KB),
+        None,
+    )
+    if sim_torus_at_96kb is not None:
+        ax.plot(
+            [BYTES_96KB], [sim_torus_at_96kb],
+            marker="o", color="tab:orange",
+            markersize=10, markeredgecolor="black", markeredgewidth=1.2,
+        )
+        ax.annotate(
+            f"96 KB\n"
+            f"theoretical = {THEORETICAL_TORUS_2D_6SIP_NS:.0f} ns\n"
+            f"simulated   = {sim_torus_at_96kb:.0f} ns",
+            xy=(BYTES_96KB, sim_torus_at_96kb),
+            xytext=(10, -20), textcoords="offset points",
+            color="tab:red", fontsize=9,
+        )
+    else:
+        ax.annotate(
+            f"96 KB\n→ theoretical {THEORETICAL_TORUS_2D_6SIP_NS:.0f} ns",
+            xy=(BYTES_96KB, THEORETICAL_TORUS_2D_6SIP_NS),
+            xytext=(8, -20), textcoords="offset points",
+            color="tab:red", fontsize=9,
+        )
+    ax.set_xscale("log", base=2)
+    ax.set_xlabel("Bytes per PE (log scale)")
+    ax.set_ylabel("max pe_exec_ns (critical path)")
+    ax.set_title("Multi-device allreduce latency by topology")
+    ax.grid(True, alpha=0.3)
+
+    # Drop 128 KB tick (overlaps visually with the explicit 96 KB marker)
+    # and add 96 KB.
+    BYTES_128KB = 128 * 1024
+    existing_ticks = [t for t in ax.get_xticks() if int(t) != BYTES_128KB]
+    if BYTES_96KB not in existing_ticks:
+        existing_ticks.append(BYTES_96KB)
+    ax.set_xticks(sorted(existing_ticks))
+    ax.set_xlim(left=min(r["bytes_per_pe"] for r in records) / 2,
+                right=BYTES_96KB * 1.5)
+    ax.legend()
+    ax.xaxis.set_major_formatter(_bytes_fmt)
+    fig.tight_layout()
+    fig.savefig(_SWEEP_OUT_DIR / "overview.png", dpi=120)
+    plt.close(fig)
+
+    # Cleanup row staging dir so a partial future run doesn't pick up
+    # stale rows.
+    for p in row_files:
+        try:
+            p.unlink()
+        except OSError:
+            pass
+    try:
+        _SWEEP_ROWS_DIR.rmdir()
+    except OSError:
+        pass
+
+    print(f"\nWrote {_SWEEP_OUT_DIR / 'overview.png'} "
+          f"from {len(records)} rows")
+    return True
+
+
+# ── Topology diagram (device-level + cube-level reduction) ────────────
+
+# Convention: "rows × cols" everywhere, row-major rank assignment
+# (rank = row * n_cols + col). For the 2×3 inter-SIP grid, this means
+# 2 rows × 3 columns:  SIP 0 1 2 / SIP 3 4 5.
+
+_PALETTE_BG = "#fafbfd"
+_PALETTE_FRAME = "#3a3f4a"
+_PALETTE_BLUE = "#2c6fb6"
+_PALETTE_GREEN = "#2e8a4e"
+_PALETTE_TEXT = "#1f2530"
+_PALETTE_BOX_FILL = "#eaf2fb"
+_PALETTE_BOX_EDGE = "#2c4a78"
+_PALETTE_ROOT_FILL = "#ffd9b8"
+_PALETTE_ROOT_EDGE = "#bd5a14"
+
+
+def _arrow(ax, xy_from, xy_to, color="black", lw=1.4, alpha=1.0,
+           style="-|>", curve=0.0):
+    from matplotlib.patches import FancyArrowPatch
+    arrow = FancyArrowPatch(
+        xy_from, xy_to,
+        arrowstyle=style, mutation_scale=12,
+        color=color, lw=lw, alpha=alpha,
+        connectionstyle=f"arc3,rad={curve}",
+    )
+    ax.add_patch(arrow)
+
+
+def _draw_sip_box(ax, cx, cy, w, h, label, *, fill=_PALETTE_BOX_FILL,
+                  edge=_PALETTE_BOX_EDGE, text_color=_PALETTE_TEXT,
+                  font=10):
+    from matplotlib.patches import FancyBboxPatch
+    box = FancyBboxPatch(
+        (cx - w / 2, cy - h / 2), w, h,
+        boxstyle="round,pad=0.02,rounding_size=0.10",
+        linewidth=1.4, edgecolor=edge, facecolor=fill,
+    )
+    ax.add_patch(box)
+    ax.text(cx, cy, label, ha="center", va="center",
+            color=text_color, fontsize=font, fontweight="bold")
+
+
+def _frame_panel(ax, title, lim_x=10.0, lim_y=6.0):
+    """Set up a square-ish panel with a visible outer border."""
+    from matplotlib.patches import FancyBboxPatch
+    ax.set_xlim(0, lim_x)
+    ax.set_ylim(0, lim_y)
+    ax.set_aspect("equal")
+    ax.axis("off")
+    ax.set_facecolor(_PALETTE_BG)
+    border = FancyBboxPatch(
+        (0.05, 0.05), lim_x - 0.10, lim_y - 0.10,
+        boxstyle="round,pad=0.01,rounding_size=0.12",
+        linewidth=1.4, edgecolor=_PALETTE_FRAME, facecolor=_PALETTE_BG,
+        zorder=0,
+    )
+    ax.add_patch(border)
+    ax.set_title(title, fontsize=12, fontweight="bold",
+                 color=_PALETTE_TEXT, pad=8)
+
+
+def _draw_ring_topology(ax):
+    _frame_panel(ax, "ring_1d (6 SIPs)", lim_x=10.0, lim_y=6.0)
+
+    xs = [1.2, 2.7, 4.2, 5.7, 7.2, 8.7]
+    y = 3.1
+    box_w, box_h = 1.05, 0.9
+    for i, x in enumerate(xs):
+        _draw_sip_box(ax, x, y, box_w, box_h, f"SIP {i}")
+    # Forward ring (global_E) — adjacent neighbours, anchored to box edges.
+    for i in range(5):
+        _arrow(ax, (xs[i] + box_w / 2, y),
+               (xs[i + 1] - box_w / 2, y),
+               color=_PALETTE_BLUE, lw=1.6)
+    # Wrap (SIP 5 → SIP 0). Anchor at right-CENTER of SIP 5 and
+    # left-CENTER of SIP 0; arc OUTSIDE (above) the row so it does not
+    # overlap any of the SIP boxes in between.
+    _arrow(
+        ax,
+        (xs[5] + box_w / 2, y),
+        (xs[0] - box_w / 2, y),
+        color=_PALETTE_BLUE, lw=1.6, curve=-0.40,
+    )
+    ax.text(5.0, y + 2.0, "global_E  (ring)", ha="center",
+            color=_PALETTE_BLUE, fontsize=10, style="italic")
+    ax.text(5.0, y - 1.5,
+            "(global_W = reverse direction, used by the algorithm)",
+            ha="center", color="gray", fontsize=8, style="italic")
+
+
+def _draw_grid_topology(ax, kind, *, n_rows=2, n_cols=3):
+    """kind ∈ {'torus', 'mesh'}. Lays out as n_rows × n_cols (row-major).
+
+    For the sweep we use 2 rows × 3 cols → SIP layout::
+
+        row 0:  SIP 0   SIP 1   SIP 2
+        row 1:  SIP 3   SIP 4   SIP 5
+    """
+    title = f"torus_2d ({n_rows}×{n_cols}, 6 SIPs)" if kind == "torus" \
+        else f"mesh_2d_no_wrap ({n_rows}×{n_cols}, 6 SIPs)"
+    _frame_panel(ax, title, lim_x=10.0, lim_y=6.0)
+
+    col_xs = [2.0, 5.0, 8.0]  # 3 cols
+    row_ys = [4.3, 1.8]       # 2 rows
+    box_w, box_h = 1.3, 0.95
+    pos: dict[tuple[int, int], tuple[float, float]] = {}
+    for r in range(n_rows):
+        for c in range(n_cols):
+            rank = r * n_cols + c
+            x, y = col_xs[c], row_ys[r]
+            pos[(r, c)] = (x, y)
+            _draw_sip_box(ax, x, y, box_w, box_h, f"SIP {rank}")
+
+    # Row edges (E↔W) — between adjacent columns within each row.
+    for r in range(n_rows):
+        for c in range(n_cols - 1):
+            x0, y0 = pos[(r, c)]
+            x1, y1 = pos[(r, c + 1)]
+            _arrow(ax, (x0 + box_w / 2, y0 + 0.10),
+                   (x1 - box_w / 2, y1 + 0.10),
+                   color=_PALETTE_BLUE, lw=1.5)
+            _arrow(ax, (x1 - box_w / 2, y1 - 0.10),
+                   (x0 + box_w / 2, y0 - 0.10),
+                   color=_PALETTE_BLUE, lw=1.5)
+    # Col edges (N↔S) — between adjacent rows within each column.
+    for c in range(n_cols):
+        for r in range(n_rows - 1):
+            x0, y0 = pos[(r, c)]
+            x1, y1 = pos[(r + 1, c)]
+            _arrow(ax, (x0 - 0.12, y0 - box_h / 2),
+                   (x1 - 0.12, y1 + box_h / 2),
+                   color=_PALETTE_GREEN, lw=1.5)
+            _arrow(ax, (x1 + 0.12, y1 + box_h / 2),
+                   (x0 + 0.12, y0 - box_h / 2),
+                   color=_PALETTE_GREEN, lw=1.5)
+    # Wrap arrows for torus only — anchor to the centre of the OUTER
+    # edge of the end SIPs and arc OUTSIDE the row/column so they do
+    # not overlap the SIPs in between.
+    if kind == "torus":
+        # Row wrap: last col → first col. Top row arcs UP, bottom row
+        # arcs DOWN, so each wrap sits clearly outside its own row.
+        for r in range(n_rows):
+            x0, y0 = pos[(r, 0)]
+            x1, y1 = pos[(r, n_cols - 1)]
+            curve = -0.45 if r == 0 else 0.45
+            _arrow(
+                ax,
+                (x1 + box_w / 2, y1),
+                (x0 - box_w / 2, y0),
+                color=_PALETTE_BLUE, lw=1.5,
+                curve=curve, alpha=0.9,
+            )
+        # Col wrap: last row → first row. Leftmost col arcs LEFT,
+        # rightmost col arcs RIGHT. Middle col(s) get a small inline
+        # marker + legend note (drawing them through the panel would
+        # collide with the row arrows).
+        for c in range(n_cols):
+            x0, y0 = pos[(0, c)]
+            x1, y1 = pos[(n_rows - 1, c)]
+            if c == 0:
+                curve = 0.55
+            elif c == n_cols - 1:
+                curve = -0.55
+            else:
+                continue  # skip middle col — see legend note
+            _arrow(
+                ax,
+                (x1, y1 - box_h / 2),
+                (x0, y0 + box_h / 2),
+                color=_PALETTE_GREEN, lw=1.5,
+                curve=curve, alpha=0.9,
+            )
+
+    ax.text(0.7, 5.6, "global_E/W (row)", color=_PALETTE_BLUE,
+            fontsize=9, style="italic", fontweight="bold")
+    ax.text(0.7, 5.25, "global_N/S (col)", color=_PALETTE_GREEN,
+            fontsize=9, style="italic", fontweight="bold")
+    ax.text(0.7, 4.92,
+            "wrap = torus" if kind == "torus" else "no wrap = mesh",
+            color="gray", fontsize=8, style="italic")
+    if kind == "torus" and n_cols > 2:
+        ax.text(0.7, 0.3,
+                "(middle-col wrap omitted for clarity — every row "
+                "and every column wraps)",
+                color="gray", fontsize=7.5, style="italic")
+
+
+def _draw_cube_reduction(ax):
+    """4×4 cube grid inside SIP 0 — compact layout with phase legend."""
+    from matplotlib.patches import Rectangle
+    _frame_panel(ax, "Cube-level reduction inside SIP 0 (4×4 cubes)",
+                 lim_x=10.0, lim_y=6.0)
+
+    cube_w = 0.65
+    cube_gap = 0.18
+    # Center the 4×4 grid in the left half of the panel.
+    grid_total = 4 * cube_w + 3 * cube_gap
+    grid_x0 = 0.7
+    grid_y0 = 0.7
+    centers: dict[tuple[int, int], tuple[float, float]] = {}
+    for r in range(4):
+        for c in range(4):
+            cx = grid_x0 + c * (cube_w + cube_gap) + cube_w / 2
+            cy = grid_y0 + (3 - r) * (cube_w + cube_gap) + cube_w / 2
+            centers[(r, c)] = (cx, cy)
+            cube_id = r * 4 + c
+            is_root = (r == 3 and c == 3)
+            face = _PALETTE_ROOT_FILL if is_root else _PALETTE_BOX_FILL
+            edge = _PALETTE_ROOT_EDGE if is_root else _PALETTE_BOX_EDGE
+            rect = Rectangle(
+                (cx - cube_w / 2, cy - cube_w / 2), cube_w, cube_w,
+                linewidth=1.2, edgecolor=edge, facecolor=face,
+            )
+            ax.add_patch(rect)
+            label = f"c{cube_id}"
+            ax.text(cx, cy, label, ha="center", va="center",
+                    fontsize=7.5, fontweight="bold",
+                    color=_PALETTE_ROOT_EDGE if is_root
+                    else _PALETTE_TEXT)
+
+    # Phase 1: row reduce W→E.
+    for r in range(4):
+        for c in range(3):
+            x0, y0 = centers[(r, c)]
+            x1, y1 = centers[(r, c + 1)]
+            _arrow(ax, (x0 + cube_w / 2, y0), (x1 - cube_w / 2, y1),
+                   color=_PALETTE_BLUE, lw=1.5)
+    # Phase 2: col reduce N→S along rightmost column.
+    for r in range(3):
+        x0, y0 = centers[(r, 3)]
+        x1, y1 = centers[(r + 1, 3)]
+        _arrow(ax, (x0, y0 - cube_w / 2), (x1, y1 + cube_w / 2),
+               color=_PALETTE_GREEN, lw=1.7)
+
+    # Phase legend on the right side.
+    legend_x = grid_x0 + grid_total + 0.55
+    ax.text(legend_x, 5.0, "Phase 1: row reduce  (W → E)",
+            color=_PALETTE_BLUE, fontsize=10, fontweight="bold")
+    ax.text(legend_x, 4.55, "Phase 2: col reduce  (N → S, rightmost col)",
+            color=_PALETTE_GREEN, fontsize=10, fontweight="bold")
+    ax.text(legend_x, 4.10, "Phase 3: inter-SIP exchange at root cube",
+            color=_PALETTE_ROOT_EDGE, fontsize=10, fontweight="bold")
+    ax.text(legend_x, 3.65, "Phase 4: col broadcast  (S → N)",
+            color=_PALETTE_GREEN, fontsize=10, style="italic")
+    ax.text(legend_x, 3.20, "Phase 5: row broadcast  (E → W)",
+            color=_PALETTE_BLUE, fontsize=10, style="italic")
+    ax.text(legend_x, 2.55,
+            "(broadcast phases reverse phases 2 & 1)",
+            color="gray", fontsize=8.5, style="italic")
+    ax.text(legend_x, 1.7,
+            "Root cube (c15, bottom-right) is the only\n"
+            "cube that performs the inter-SIP exchange.",
+            color=_PALETTE_ROOT_EDGE, fontsize=9, style="italic")
+
+
+def emit_topology_diagram() -> str:
+    """Emit a 2×2-panel topology diagram into allreduce_latency_plots/.
+
+    Top row: ring_1d | torus_2d (2×3)
+    Bot row: mesh_2d_no_wrap (2×3) | cube-level reduction in SIP 0
+    """
+    import matplotlib.gridspec as gridspec
+    import matplotlib.pyplot as plt
+
+    _SWEEP_OUT_DIR.mkdir(parents=True, exist_ok=True)
+    fig = plt.figure(figsize=(16, 10), facecolor="white")
+    gs = gridspec.GridSpec(2, 2, figure=fig, hspace=0.30, wspace=0.10)
+    ax_ring = fig.add_subplot(gs[0, 0])
+    ax_torus = fig.add_subplot(gs[0, 1])
+    ax_mesh = fig.add_subplot(gs[1, 0])
+    ax_cube = fig.add_subplot(gs[1, 1])
+
+    _draw_ring_topology(ax_ring)
+    _draw_grid_topology(ax_torus, "torus", n_rows=2, n_cols=3)
+    _draw_grid_topology(ax_mesh, "mesh", n_rows=2, n_cols=3)
+    _draw_cube_reduction(ax_cube)
+
+    fig.suptitle(
+        "Allreduce topology — device-level (top: ring, torus, mesh) "
+        "and cube-level reduction in SIP 0",
+        fontsize=14, fontweight="bold", color=_PALETTE_TEXT, y=0.98,
+    )
+    out_path = _SWEEP_OUT_DIR / "topology.png"
+    fig.savefig(out_path, dpi=130, bbox_inches="tight",
+                facecolor=fig.get_facecolor())
+    plt.close(fig)
+    return str(out_path)
+
+
+def test_emit_topology_diagram():
+    """Emit topology.png alongside the sweep plots. Pure plotting; no sim."""
+    out = emit_topology_diagram()
+    assert Path(out).exists()
@@ -23,7 +23,7 @@ def _engine():
 def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
    slice_bytes = 48 * (1 << 30) // 8
    pa = PhysAddr.pe_hbm_addr(
-        rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
+        sip_id=sip, die_id=cube, pe_id=pe_id,
        pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
    )
    return pa.encode()
@@ -1,48 +0,0 @@
-"""Test that tl.recv() (no direction) works under the mock runtime
-and the SimPy PE_IPCQ component (ADR-0023 D4 weak fairness)."""
-from __future__ import annotations
-
-import numpy as np
-
-from kernbench.ccl.testing import run_kernel_in_mock
-
-
-def kernel_round_robin(t_ptr, n_elem, tl):
-    """Each PE sends one tile E then receives N-1 tiles via round-robin.
-    Uses TensorHandle math (PE_MATH) so Phase 2 produces correct HBM
-    contents under SimPy + op_log replay."""
-    rank = tl.program_id(axis=0)
-    world_size = tl.num_programs(axis=0)
-    nbytes = n_elem * 2
-
-    pe_addr = t_ptr + rank * nbytes
-    acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
-    current = acc
-
-    for _step in range(world_size - 1):
-        tl.send(dir="E", src=current)
-        # No direction → round-robin
-        recv = tl.recv(shape=(n_elem,), dtype="f16")
-        acc = acc + recv
-        current = recv  # forward W's tile to E next round
-
-    tl.store(pe_addr, acc)
-
-
-def test_round_robin_recv_mock_runtime():
-    n_elem = 8
-    inputs = [
-        np.full((n_elem,), float(r + 1), dtype=np.float16)
-        for r in range(4)
-    ]
-    expected = sum(inputs)  # [10,...]
-
-    outputs = run_kernel_in_mock(
-        kernel_fn=kernel_round_robin,
-        world_size=4,
-        topology="ring_1d",
-        inputs=inputs,
-        kernel_args=(n_elem,),
-    )
-    for r in range(4):
-        assert np.allclose(outputs[r], expected)
@@ -30,7 +30,7 @@ def _graph():
 def _hbm_pa(pe_id: int = 0) -> int:
    slice_bytes = 48 * (1 << 30) // 8
    pa = PhysAddr.pe_hbm_addr(
-        rack_id=0, sip_id=0, cube_id=0, pe_id=pe_id,
+        sip_id=0, die_id=0, pe_id=pe_id,
        pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
    )
    return pa.encode()
@@ -0,0 +1,194 @@
+"""ADR-0009 D5 invariant: all PEs targeted by a single kernel launch MUST
+begin executing the kernel body at the same simulated time, regardless of
+their dispatch path length.
+
+These tests directly verify the invariant by capturing per-PE state at the
+top of `_execute_kernel`:
+
+  test_no_pe_arrives_after_target_start_ns
+      Asserts: for every PE that enters _execute_kernel during a multi-cube
+      launch, `env.now` at entry must be <= target_start_ns. Otherwise the
+      PE's barrier yield would be a no-op and `pe_exec_start` would be set
+      late, breaking the D5 "same simulated time" mandate.
+
+  test_all_pes_have_identical_pe_exec_start
+      Asserts: every PE's `pe_exec_start` (the value of `env.now` recorded
+      immediately AFTER the barrier yield) is identical across all PEs in
+      the launch.
+
+Both tests are expected to FAIL today and become the regression check the
+Phase 2 D5 predictor + fallback fix must make pass.
+"""
+from __future__ import annotations
+
+from pathlib import Path
+
+import numpy as np
+import pytest
+
+from kernbench.policy.placement.dp import DPPolicy
+from kernbench.runtime_api.context import RuntimeContext
+from kernbench.runtime_api.types import DeviceSelector
+from kernbench.sim_engine.engine import GraphEngine
+from kernbench.topology.builder import resolve_topology
+
+TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
+
+
+def _capture_per_pe_d5_state():
+    """Monkey-patch PeCpuComponent._execute_kernel to record, per PE:
+
+      - entry_now: env.now at function entry (before any yield)
+      - target_start_ns: the value carried by the request
+      - barrier_yielded: True if the barrier yield fired (entry_now < target)
+      - pe_exec_start: env.now immediately after the barrier check
+                       (i.e. the value the original code sets)
+
+    Returns (records: list[dict], restore: callable).
+    """
+    import kernbench.components.builtin.pe_cpu as pe_cpu_mod
+
+    records: list[dict] = []
+    original = pe_cpu_mod.PeCpuComponent._execute_kernel
+
+    def patched(self, env, txn):
+        request = txn.request
+        target_start = getattr(request, "target_start_ns", None)
+        entry_now = float(env.now)
+        rec = {
+            "node_id": self.node.id,
+            "entry_now": entry_now,
+            "target_start_ns": (
+                float(target_start) if target_start is not None else None
+            ),
+            "barrier_yielded": (
+                target_start is not None
+                and float(target_start) > entry_now
+            ),
+            "pe_exec_start": None,  # filled below by sniff
+            "late_ns": (
+                None if target_start is None
+                else max(0.0, entry_now - float(target_start))
+            ),
+        }
+        records.append(rec)
+
+        # We can't easily inject a callback at the original's
+        # `pe_exec_start = env.now` line without rewriting it. Approximate:
+        # if the original yields the barrier, env.now after the yield is
+        # target_start_ns; otherwise pe_exec_start is entry_now (skipped).
+        if rec["barrier_yielded"]:
+            rec["pe_exec_start"] = float(target_start)
+        else:
+            rec["pe_exec_start"] = entry_now
+
+        yield from original(self, env, txn)
+
+    pe_cpu_mod.PeCpuComponent._execute_kernel = patched
+
+    def restore():
+        pe_cpu_mod.PeCpuComponent._execute_kernel = original
+
+    return records, restore
+
+
+def _run_multicube_launch():
+    """Drive a no-op kernel launch across all 16 cubes x 8 PEs and return
+    the per-PE D5 records collected by the monkey-patch."""
+    records, restore = _capture_per_pe_d5_state()
+    try:
+        topo = resolve_topology(str(TOPOLOGY_PATH))
+        engine = GraphEngine(topo.topology_obj, enable_data=True)
+        spec = topo.topology_obj.spec
+
+        with RuntimeContext(
+            engine=engine, target_device=DeviceSelector("all"),
+            correlation_id="d5_barrier", spec=spec,
+        ) as ctx:
+            dp = DPPolicy(
+                cube="row_wise", pe="column_wise",
+                num_cubes=16, num_pes=8,
+            )
+
+            def kernel(t_ptr, n_elem, tl):
+                pass  # no-op
+
+            ctx.ahbm.set_device(0)
+            t = ctx.zeros(
+                (16, 8 * 64), dtype="f16", dp=dp, name="probe",
+            )
+            t.copy_(ctx.from_numpy(
+                np.zeros((16, 8 * 64), dtype=np.float16),
+            ))
+
+            pending = ctx.launch(
+                "d5_probe", kernel, t, 64, _defer_wait=True,
+            )
+            for h, _sip, meta in pending:
+                ctx.wait(h, _meta=meta)
+    finally:
+        restore()
+    return records
+
+
+def test_no_pe_arrives_after_target_start_ns():
+    """ADR-0009 D5: no PE may enter `_execute_kernel` after target_start_ns.
+
+    Today this fails because IO_CPU's predictor under-shoots actual
+    dispatch latency for far cubes (cube4, cube9-15). Phase 2 fix:
+    chain-aware predictor in IO_CPU + monotonic upward re-stamp in M_CPU.
+    """
+    records = _run_multicube_launch()
+    assert records, "expected per-PE _execute_kernel records"
+
+    late = [
+        r for r in records
+        if r["target_start_ns"] is not None
+        and r["late_ns"] is not None
+        and r["late_ns"] > 1e-6
+    ]
+
+    if late:
+        # Provide actionable diagnostic in the failure.
+        worst = sorted(late, key=lambda r: -r["late_ns"])[:5]
+        details = "\n".join(
+            f"  {r['node_id']}: late by {r['late_ns']:.2f} ns "
+            f"(entry_now={r['entry_now']:.2f}, "
+            f"target_start_ns={r['target_start_ns']:.2f})"
+            for r in worst
+        )
+        pytest.fail(
+            f"ADR-0009 D5 violated: {len(late)}/{len(records)} PEs "
+            f"entered _execute_kernel AFTER target_start_ns "
+            f"(barrier yield silently skipped). "
+            f"Worst offenders:\n{details}"
+        )
+
+
+def test_all_pes_have_identical_pe_exec_start():
+    """ADR-0009 D5: every PE's pe_exec_start must be identical.
+
+    With D5 honored, every PE either yields to target_start_ns (start =
+    target_start_ns) or, if late, would still be aligned by the M_CPU
+    upward re-stamp (Phase 2). Today: 75/128 PEs in this launch have
+    distinct pe_exec_start values because they skipped the barrier.
+    """
+    records = _run_multicube_launch()
+    assert records, "expected per-PE _execute_kernel records"
+
+    starts = sorted({round(r["pe_exec_start"], 6) for r in records})
+    if len(starts) > 1:
+        spread = max(starts) - min(starts)
+        # Distribution of how many PEs at each distinct start time
+        from collections import Counter
+        bucket = Counter(round(r["pe_exec_start"], 6) for r in records)
+        details = "\n".join(
+            f"  pe_exec_start={t}: {n} PEs"
+            for t, n in sorted(bucket.items())
+        )
+        pytest.fail(
+            f"ADR-0009 D5 violated: PEs have {len(starts)} distinct "
+            f"pe_exec_start values (spread = {spread:.2f} ns); "
+            f"D5 mandates a single common value. "
+            f"Distribution:\n{details}"
+        )
@@ -50,7 +50,7 @@ def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
    from kernbench.policy.address.phyaddr import PhysAddr
    slice_bytes = 48 * (1 << 30) // 8
    pa = PhysAddr.pe_hbm_addr(
-        rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
+        sip_id=sip, die_id=cube, pe_id=pe_id,
        pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
    )
    return pa.encode()
@@ -31,7 +31,7 @@ def _hbm_pa(sip=0, cube=0, pe_id=0):
    from kernbench.policy.address.phyaddr import PhysAddr
    slice_bytes = 48 * (1 << 30) // 8
    pa = PhysAddr.pe_hbm_addr(
-        rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
+        sip_id=sip, die_id=cube, pe_id=pe_id,
        pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
    )
    return pa.encode()
@@ -29,7 +29,7 @@ def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
    # 48 GB / 8 slices = 6 GB per slice
    slice_bytes = 48 * (1 << 30) // 8
    pa = PhysAddr.pe_hbm_addr(
-        rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
+        sip_id=sip, die_id=cube, pe_id=pe_id,
        pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
    )
    return pa.encode()
@@ -37,7 +37,7 @@ def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:

 def _sram_pa(sip: int = 0, cube: int = 0) -> int:
    """Create an SRAM physical address."""
-    pa = PhysAddr.cube_sram_addr(rack_id=0, sip_id=sip, cube_id=cube, sram_offset=0x800)
+    pa = PhysAddr.cube_sram_addr(sip_id=sip, die_id=cube, sram_offset=0x800)
    return pa.encode()


@@ -1,8 +1,9 @@
 """Tests for configure_sfr_intercube_multisip neighbor table wiring.

-Verifies that IPCQ neighbor tables are correctly installed for
-intercube (pe0, 4×4 mesh N/S/E/W) + inter-SIP (pe0, all cubes,
-global_E/global_W) communication.
+Verifies full IPCQ hardware wiring (independent of DPPolicy):
+  - intra-cube (2×4 PE grid)  → intra_N/S/E/W
+  - intercube same-lane       → N/S/E/W
+  - inter-SIP same-(cube, pe) → global_N/S/E/W
 """
 from __future__ import annotations

@@ -16,6 +17,7 @@ from kernbench.topology.builder import resolve_topology
 TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"

 N_CUBES = 16
+PES_PER_CUBE = 8


 def _engine_and_spec():
@@ -36,78 +38,102 @@ class TestConfigureSfrNeighborTables:
        plan = configure_sfr_intercube_multisip(engine, spec, cfg)

        n_sips = int(spec["system"]["sips"]["count"])
-        assert plan["world_size"] == n_sips * N_CUBES
-        assert len(plan["rank_to_pe"]) == n_sips * N_CUBES
-        for pe_idx, (sip, cube, pe) in enumerate(plan["rank_to_pe"]):
-            assert pe == 0, f"pe_idx {pe_idx}: pe must be 0, got {pe}"
+        expected = n_sips * N_CUBES * PES_PER_CUBE
+        assert plan["world_size"] == expected
+        assert len(plan["rank_to_pe"]) == expected

-    def test_corner_cube0_has_E_and_S_only(self):
-        """Cube 0 (row=0, col=0) is NW corner: only E and S neighbors."""
+    # ── Intra-cube (intra_N/S/E/W) ────────────────────────────────
+
+    def test_pe0_intra_cube_has_intra_E_and_intra_S(self):
+        """pe0 is NW of the 2×4 PE grid: intra_E=pe1, intra_S=pe4."""
        engine, spec = _engine_and_spec()
        cfg = _merged_cfg()
        configure_sfr_intercube_multisip(engine, spec, cfg)

-        ipcq = engine._components["sip0.cube0.pe0.pe_ipcq"]
-        qp = ipcq.queue_pairs
-        assert "E" in qp, "cube 0 must have E neighbor"
-        assert "S" in qp, "cube 0 must have S neighbor"
-        assert "W" not in qp, "cube 0 (col=0) must NOT have W neighbor"
-        assert "N" not in qp, "cube 0 (row=0) must NOT have N neighbor"
+        qp = engine._components["sip0.cube0.pe0.pe_ipcq"].queue_pairs
+        assert "intra_E" in qp
+        assert qp["intra_E"]["peer"].pe == 1
+        assert "intra_S" in qp
+        assert qp["intra_S"]["peer"].pe == 4
+        assert "intra_W" not in qp
+        assert "intra_N" not in qp
+
+    def test_pe5_intra_cube_has_all_four(self):
+        """pe5 (row=1, col=1 in 2×4 grid) has all 4 intra directions.
+
+        Intra neighbors: intra_N=pe1, intra_E=pe6, intra_W=pe4,
+        intra_S not present (row=1 is bottom row).
+        """
+        engine, spec = _engine_and_spec()
+        cfg = _merged_cfg()
+        configure_sfr_intercube_multisip(engine, spec, cfg)
+
+        qp = engine._components["sip0.cube0.pe5.pe_ipcq"].queue_pairs
+        assert qp["intra_N"]["peer"].pe == 1
+        assert qp["intra_E"]["peer"].pe == 6
+        assert qp["intra_W"]["peer"].pe == 4
+        assert "intra_S" not in qp  # bottom row
+
+    # ── Intercube same-lane (N/S/E/W) ─────────────────────────────
+
+    def test_corner_cube0_pe0_has_intercube_E_and_S(self):
+        """Cube 0 (NW mesh corner): intercube E→cube1, S→cube4."""
+        engine, spec = _engine_and_spec()
+        cfg = _merged_cfg()
+        configure_sfr_intercube_multisip(engine, spec, cfg)
+
+        qp = engine._components["sip0.cube0.pe0.pe_ipcq"].queue_pairs
        assert qp["E"]["peer"].cube == 1
+        assert qp["E"]["peer"].pe == 0  # same-lane
        assert qp["S"]["peer"].cube == 4
+        assert qp["S"]["peer"].pe == 0
+        assert "W" not in qp, "cube 0 has no west neighbor"
+        assert "N" not in qp, "cube 0 has no north neighbor"

-    def test_interior_cube5_has_all_four(self):
-        """Cube 5 (row=1, col=1) is interior: N/S/E/W all present."""
+    def test_interior_cube5_pe3_has_all_four_intercube_same_lane(self):
+        """Cube 5 interior, pe3: intercube N/S/E/W all present, same-lane."""
        engine, spec = _engine_and_spec()
        cfg = _merged_cfg()
        configure_sfr_intercube_multisip(engine, spec, cfg)

-        ipcq = engine._components["sip0.cube5.pe0.pe_ipcq"]
-        qp = ipcq.queue_pairs
-        assert qp["N"]["peer"].cube == 1
-        assert qp["S"]["peer"].cube == 9
-        assert qp["E"]["peer"].cube == 6
-        assert qp["W"]["peer"].cube == 4
+        qp = engine._components["sip0.cube5.pe3.pe_ipcq"].queue_pairs
+        for d, expected_cube in [("N", 1), ("S", 9), ("E", 6), ("W", 4)]:
+            assert qp[d]["peer"].cube == expected_cube
+            assert qp[d]["peer"].pe == 3  # same-lane

-    def test_root_cube15_has_inter_sip(self):
-        """Cube 15 (root, SE corner) has N, W + global_E/global_W."""
+    def test_all_pes_have_intercube_wiring(self):
+        """Every PE on every interior cube has intercube same-lane wiring."""
        engine, spec = _engine_and_spec()
        cfg = _merged_cfg()
        configure_sfr_intercube_multisip(engine, spec, cfg)

-        ipcq0 = engine._components["sip0.cube15.pe0.pe_ipcq"]
-        qp0 = ipcq0.queue_pairs
-        assert "N" in qp0
-        assert "W" in qp0
-        assert "E" not in qp0, "cube 15 (col=3) must NOT have E"
-        assert "S" not in qp0, "cube 15 (row=3) must NOT have S"
-        assert "global_E" in qp0, "root cube must have global_E"
-        assert "global_W" in qp0, "root cube must have global_W"
-        assert qp0["global_E"]["peer"].sip == 1
-        assert qp0["global_E"]["peer"].cube == 15
-
-        ipcq1 = engine._components["sip1.cube15.pe0.pe_ipcq"]
-        qp1 = ipcq1.queue_pairs
-        assert qp1["global_E"]["peer"].sip == 0
-        assert qp1["global_E"]["peer"].cube == 15
-
-    def test_all_cubes_have_inter_sip(self):
-        """ALL cubes (not just root) are wired for inter-SIP."""
-        engine, spec = _engine_and_spec()
-        cfg = _merged_cfg()
-        configure_sfr_intercube_multisip(engine, spec, cfg)
-
-        root_cube = int(cfg.get("root_cube", N_CUBES - 1))
-        for cube_id in range(N_CUBES):
-            ipcq = engine._components[f"sip0.cube{cube_id}.pe0.pe_ipcq"]
-            qp = ipcq.queue_pairs
-            assert "global_E" in qp, (
-                f"sip0.cube{cube_id}.pe0 missing global_E"
-            )
-            assert "global_W" in qp, (
-                f"sip0.cube{cube_id}.pe0 missing global_W"
-            )
-            if cube_id == root_cube:
-                assert qp["global_E"]["peer"].sip != 0, (
-                    f"root cube {root_cube} global_E must point to another SIP"
+        # Interior cube 5: every PE should have N/S/E/W same-lane.
+        for pe in range(PES_PER_CUBE):
+            qp = engine._components[f"sip0.cube5.pe{pe}.pe_ipcq"].queue_pairs
+            for d in ("N", "S", "E", "W"):
+                assert d in qp, f"sip0.cube5.pe{pe} missing intercube {d}"
+                assert qp[d]["peer"].pe == pe, (
+                    f"sip0.cube5.pe{pe} {d} not same-lane"
                )
+
+    # ── Inter-SIP (global_*) ──────────────────────────────────────
+
+    def test_every_pe_on_every_cube_has_inter_sip(self):
+        """All PEs on all cubes wired for inter-SIP via global_*."""
+        engine, spec = _engine_and_spec()
+        cfg = _merged_cfg()
+        configure_sfr_intercube_multisip(engine, spec, cfg)
+
+        for cube_id in range(N_CUBES):
+            for pe in range(PES_PER_CUBE):
+                qp = engine._components[
+                    f"sip0.cube{cube_id}.pe{pe}.pe_ipcq"
+                ].queue_pairs
+                assert "global_E" in qp, (
+                    f"sip0.cube{cube_id}.pe{pe} missing global_E"
+                )
+                assert "global_W" in qp
+                # Peer must be same (cube, pe) on another SIP.
+                assert qp["global_E"]["peer"].sip == 1
+                assert qp["global_E"]["peer"].cube == cube_id
+                assert qp["global_E"]["peer"].pe == pe
@@ -36,7 +36,7 @@ def _engine():
 def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
    slice_bytes = 48 * (1 << 30) // 8
    pa = PhysAddr.pe_hbm_addr(
-        rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
+        sip_id=sip, die_id=cube, pe_id=pe_id,
        pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
    )
    return pa.encode()
@@ -0,0 +1,62 @@
+"""ADR-0009 D5: synchronized launch barrier.
+
+M_CPU stamps KernelLaunchMsg with target_start_ns = env.now + max path
+latency; PE_CPU yields until that time before recording pe_exec_start.
+Every PE in a single launch MUST begin kernel execution at the same
+env.now regardless of its dispatch path length.
+
+We verify this indirectly: for a no-op kernel, pe_exec_ns = env.now -
+pe_exec_start. If every PE's pe_exec_start is identical and every PE
+runs the same no-op body, every pe_exec_ns value must be identical.
+Without D5, pe_exec_start varies by dispatch-path length and so does
+pe_exec_ns.
+"""
+from __future__ import annotations
+
+from pathlib import Path
+
+import numpy as np
+
+from kernbench.policy.placement.dp import DPPolicy
+from kernbench.runtime_api.context import RuntimeContext
+from kernbench.runtime_api.types import DeviceSelector
+from kernbench.sim_engine.engine import GraphEngine
+from kernbench.topology.builder import resolve_topology
+
+TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
+
+
+def test_kernel_launch_sync_all_pes_have_equal_exec_time():
+    """No-op kernel: every PE's pe_exec_ns must be identical under D5."""
+    topo = resolve_topology(str(TOPOLOGY_PATH))
+    engine = GraphEngine(topo.topology_obj, enable_data=True)
+    spec = topo.topology_obj.spec
+
+    with RuntimeContext(engine=engine, target_device=DeviceSelector("all"),
+                        correlation_id="sync_test", spec=spec) as ctx:
+        dp = DPPolicy(cube="row_wise", pe="column_wise",
+                      num_cubes=16, num_pes=8)
+
+        def kernel(t_ptr, n_elem, tl):
+            pass  # no-op
+
+        ctx.ahbm.set_device(0)
+        t = ctx.zeros((16, 8 * 64), dtype="f16", dp=dp, name="probe")
+        t.copy_(ctx.from_numpy(np.zeros((16, 8 * 64), dtype=np.float16)))
+
+        pending = ctx.launch("sync_probe", kernel, t, 64, _defer_wait=True)
+        for h, _sip, meta in pending:
+            ctx.wait(h, _meta=meta)
+
+        pe_exec_vals = []
+        for h, _sip, _meta in pending:
+            _, trace = engine.get_completion(h)
+            if trace and trace.get("pe_exec_ns") is not None:
+                pe_exec_vals.append(float(trace["pe_exec_ns"]))
+
+    assert pe_exec_vals, "expected completion traces with pe_exec_ns"
+    spread = max(pe_exec_vals) - min(pe_exec_vals)
+    assert spread < 1e-6, (
+        f"ADR-0009 D5 violated: pe_exec_ns spread across PEs = "
+        f"{spread:.6f} ns (expected 0). Values: {pe_exec_vals}"
+    )
@@ -38,7 +38,7 @@ def _engine():
 def _hbm_pa(sip=0, cube=0, pe_id=0):
    slice_bytes = 48 * (1 << 30) // 8
    pa = PhysAddr.pe_hbm_addr(
-        rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
+        sip_id=sip, die_id=cube, pe_id=pe_id,
        pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
    )
    return pa.encode()
@@ -53,7 +53,7 @@ def _engine():
 def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
    slice_bytes = 48 * (1 << 30) // 8
    pa = PhysAddr.pe_hbm_addr(
-        rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
+        sip_id=sip, die_id=cube, pe_id=pe_id,
        pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
    )
    return pa.encode()
@@ -0,0 +1,741 @@
+"""Diagnostic for the inter-cube RAW > IPCQ asymmetry on h3/h4 plots.
+
+Single-shot run at h3 (sip0.cube0.pe0 -> sip0.cube1.pe0), nbytes=4096.
+
+Captures per-PE pe_exec_ns and the actual path / drain / per-node overhead
+breakdown for the RAW sub-txn (PE_DMA -> remote HBM_CTRL) vs the IPCQ
+outbound sub-txn (PE_DMA -> peer PE_DMA), so we can localize the gap to
+one of:
+    (a) drain at HBM-BW (RAW) vs fabric-BW (IPCQ)
+    (b) path-length / per-node overhead asymmetry
+    (c) RAW SRC paying tl.load (local HBM read) on top of remote tl.store
+        while IPCQ DST only pays inbound traversal+drain.
+
+Phase 1 / test-only. No production code is modified.
+"""
+from __future__ import annotations
+
+from pathlib import Path
+
+import numpy as np
+import pytest
+
+from kernbench.ccl.install import load_ccl_config, resolve_algorithm_config
+from kernbench.ccl.sfr_config import configure_sfr_intercube_multisip
+from kernbench.policy.placement.dp import DPPolicy
+from kernbench.runtime_api.context import RuntimeContext
+from kernbench.runtime_api.types import DeviceSelector
+from kernbench.sim_engine.engine import GraphEngine
+from kernbench.topology.builder import resolve_topology
+
+TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
+
+import os
+
+# Allow the test to be re-run for h4 (inter-cube vertical) at multiple sizes
+# to investigate why IPCQ slope flattens past 8192 B (path may differ).
+NBYTES = int(os.environ.get("DIAG_NBYTES", "4096"))
+ELEM_BYTES = 2
+N_ELEM = NBYTES // ELEM_BYTES
+N_CUBES = 16
+N_PES = 8
+HOP = os.environ.get("DIAG_HOP", "h3")
+if HOP == "h4":
+    SRC = (0, 0, 0)
+    DST = (0, 4, 0)  # h4 inter-cube vertical
+else:
+    SRC = (0, 0, 0)
+    DST = (0, 1, 0)  # h3 inter-cube horizontal
+
+
+# ── Per-PE pe_exec_ns capture via monkey-patch ───────────────────────
+
+
+def _install_barrier_capture():
+    """Wrap PeCpuComponent._execute_kernel to log, for every PE that
+    enters: env.now at entry, target_start_ns the request carried,
+    whether the barrier yield fired (i.e. env.now < target_start_ns),
+    and env.now at pe_exec_start.
+    """
+    import kernbench.components.builtin.pe_cpu as pe_cpu_mod
+
+    log: list[dict] = []
+    original = pe_cpu_mod.PeCpuComponent._execute_kernel
+
+    def patched(self, env, txn):
+        request = txn.request
+        target_start = getattr(request, "target_start_ns", None)
+        entry_now = float(env.now)
+        log_entry = {
+            "node_id": self.node.id,
+            "entry_now": entry_now,
+            "target_start_ns": (
+                float(target_start) if target_start is not None else None
+            ),
+            "barrier_skipped": (
+                target_start is None
+                or float(target_start) <= entry_now
+            ),
+            "delta_late_ns": (
+                None if target_start is None
+                else max(0.0, entry_now - float(target_start))
+            ),
+        }
+        log.append(log_entry)
+        yield from original(self, env, txn)
+
+    pe_cpu_mod.PeCpuComponent._execute_kernel = patched
+
+    def restore():
+        pe_cpu_mod.PeCpuComponent._execute_kernel = original
+
+    return log, restore
+
+
+def _install_per_pe_capture():
+    """Wrap PeCpuComponent._execute_kernel so we record (node_id ->
+    pe_exec_ns) for every PE that executes a kernel during the run.
+
+    Returns (capture_dict, restore_callable).
+    """
+    import kernbench.components.builtin.pe_cpu as pe_cpu_mod
+
+    captured: dict[str, float] = {}
+    original = pe_cpu_mod.PeCpuComponent._execute_kernel
+
+    def patched(self, env, txn):
+        gen = original(self, env, txn)
+        try:
+            value = yield from gen
+        finally:
+            v = txn.result_data.get("pe_exec_ns")
+            if v is not None:
+                captured[self.node.id] = float(v)
+        return value
+
+    pe_cpu_mod.PeCpuComponent._execute_kernel = patched
+
+    def restore():
+        pe_cpu_mod.PeCpuComponent._execute_kernel = original
+
+    return captured, restore
+
+
+def _install_recv_capture(target_node_id: str):
+    """Wrap PeIpcqComponent._handle_recv to log entry/exit times and the
+    peer_head_cache/my_tail values seen at the start.
+
+    This pins down whether recv ever blocked on a wait_event, or whether
+    it consumed without waiting (i.e. peer_head_cache > my_tail at entry).
+    """
+    import kernbench.components.builtin.pe_ipcq as pe_ipcq_mod
+
+    log: list[dict] = []
+    original = pe_ipcq_mod.PeIpcqComponent._handle_recv
+
+    def patched(self, env, req, cmd):
+        if self.node.id != target_node_id:
+            yield from original(self, env, req, cmd)
+            return
+        # Snapshot state before dispatch
+        d = cmd.direction
+        qp = self._queue_pairs.get(d, {})
+        log.append({
+            "phase": "enter",
+            "t": float(env.now),
+            "direction": d,
+            "peer_head_cache": qp.get("peer_head_cache"),
+            "my_tail": qp.get("my_tail"),
+        })
+        yield from original(self, env, req, cmd)
+        qp = self._queue_pairs.get(d, {})
+        log.append({
+            "phase": "exit",
+            "t": float(env.now),
+            "direction": d,
+            "peer_head_cache": qp.get("peer_head_cache"),
+            "my_tail": qp.get("my_tail"),
+        })
+
+    pe_ipcq_mod.PeIpcqComponent._handle_recv = patched
+
+    def restore():
+        pe_ipcq_mod.PeIpcqComponent._handle_recv = original
+
+    return log, restore
+
+
+def _install_meta_arrival_capture(target_node_id: str):
+    """Log every IpcqMetaArrival that lands on ``target_node_id`` PE_IPCQ.
+
+    Records (env_now, sender_seq, dst_addr, matched_direction,
+    peer_head_cache_before, my_tail_before).
+    """
+    import kernbench.components.builtin.pe_ipcq as pe_ipcq_mod
+
+    log: list[dict] = []
+    original = pe_ipcq_mod.PeIpcqComponent._handle_meta_arrival
+
+    def patched(self, msg):
+        if self.node.id == target_node_id:
+            token = msg.token
+            now = float(self._env.now) if hasattr(self, "_env") else 0.0
+            # _env is not stored on the component; use ctx? Fall back to
+            # introspection via self._inbox._env (SimPy stores reference).
+            try:
+                now = float(self._inbox._env.now)
+            except Exception:
+                pass
+            entry = {
+                "t": now,
+                "sender_seq": getattr(token, "sender_seq", None),
+                "dst_addr": getattr(token, "dst_addr", None),
+                "src_sip": getattr(token, "src_sip", None),
+                "src_cube": getattr(token, "src_cube", None),
+                "src_pe": getattr(token, "src_pe", None),
+                "src_direction": getattr(token, "src_direction", None),
+                "nbytes": getattr(token, "nbytes", None),
+                "matched_direction": None,
+                "peer_head_cache_before": {},
+                "my_tail_before": {},
+            }
+            for d, qp in self._queue_pairs.items():
+                entry["peer_head_cache_before"][d] = qp["peer_head_cache"]
+                entry["my_tail_before"][d] = qp["my_tail"]
+                base = qp["my_rx_base_pa"]
+                size = qp["n_slots"] * qp["slot_size"]
+                if base <= entry["dst_addr"] < base + size:
+                    entry["matched_direction"] = d
+            log.append(entry)
+        return original(self, msg)
+
+    pe_ipcq_mod.PeIpcqComponent._handle_meta_arrival = patched
+
+    def restore():
+        pe_ipcq_mod.PeIpcqComponent._handle_meta_arrival = original
+
+    return log, restore
+
+
+def _snapshot_qp_state(engine, target_node_id: str) -> dict:
+    """Snapshot every direction's qp state on the target PE_IPCQ now.
+
+    Captures peer_head_cache, my_tail, my_rx_base_pa, n_slots, slot_size
+    for each installed direction.
+    """
+    comp = engine._components.get(target_node_id)
+    if comp is None:
+        return {}
+    return {
+        d: {
+            "peer_head_cache": qp["peer_head_cache"],
+            "my_tail": qp["my_tail"],
+            "my_rx_base_pa": qp["my_rx_base_pa"],
+            "n_slots": qp["n_slots"],
+            "slot_size": qp["slot_size"],
+            "rx_range": (
+                qp["my_rx_base_pa"],
+                qp["my_rx_base_pa"] + qp["n_slots"] * qp["slot_size"],
+            ),
+        }
+        for d, qp in comp.queue_pairs.items()
+    }
+
+
+# ── Path / drain breakdown using engine ctx ──────────────────────────
+
+
+def _path_breakdown(ctx, path: list[str], nbytes: int) -> dict:
+    edge_total_ns = 0.0
+    edge_details = []
+    min_bw = float("inf")
+    for i in range(len(path) - 1):
+        edge = ctx.edge_map.get((path[i], path[i + 1]))
+        if edge is None:
+            edge_details.append((path[i], path[i + 1], None, None, None))
+            continue
+        prop_ns = edge.distance_mm * ctx.ns_per_mm
+        edge_total_ns += prop_ns
+        bw = getattr(edge, "bw_gbs", None) or 0.0
+        if bw > 0 and bw < min_bw:
+            min_bw = bw
+        edge_details.append(
+            (path[i], path[i + 1], edge.distance_mm, prop_ns, bw),
+        )
+
+    overhead_total_ns = 0.0
+    overhead_details = []
+    for nid in path:
+        oh = float(ctx.node_overhead_ns.get(nid, 0.0))
+        overhead_total_ns += oh
+        overhead_details.append((nid, oh))
+
+    drain_ns = ctx.compute_drain_ns(path, nbytes)
+    bottleneck_bw = None if min_bw == float("inf") else min_bw
+
+    return {
+        "path": path,
+        "edges": edge_details,
+        "edge_total_ns": edge_total_ns,
+        "overheads": overhead_details,
+        "overhead_total_ns": overhead_total_ns,
+        "drain_ns": drain_ns,
+        "bottleneck_bw_gbs": bottleneck_bw,
+        "expected_total_ns": edge_total_ns + overhead_total_ns + drain_ns,
+    }
+
+
+def _print_breakdown(label: str, br: dict) -> None:
+    print(f"\n  {label}")
+    print(f"    path ({len(br['path'])} nodes):")
+    for nid in br["path"]:
+        print(f"      - {nid}")
+    print(f"    edges (prop. delay):")
+    for src, dst, dist_mm, prop_ns, bw in br["edges"]:
+        if dist_mm is None:
+            print(f"      ! {src} -> {dst}  EDGE NOT FOUND IN edge_map")
+            continue
+        print(
+            f"      {src} -> {dst}  "
+            f"dist={dist_mm:.3f}mm  prop={prop_ns:.2f}ns  "
+            f"bw={bw or 0:.2f}GB/s"
+        )
+    print(f"    per-node overhead_ns:")
+    for nid, oh in br["overheads"]:
+        if oh > 0:
+            print(f"      {nid:<60s}  overhead_ns={oh:.2f}")
+    print(f"    edge_total_ns      = {br['edge_total_ns']:.2f}")
+    print(f"    overhead_total_ns  = {br['overhead_total_ns']:.2f}")
+    print(f"    bottleneck_bw_gbs  = {br['bottleneck_bw_gbs']}")
+    print(f"    drain_ns (nbytes={NBYTES}) = {br['drain_ns']:.2f}")
+    print(f"    expected_total_ns  = {br['expected_total_ns']:.2f}")
+
+
+# ── RAW path scenario ────────────────────────────────────────────────
+
+
+def _dump_src_op_records(engine, src_sip, src_cube, src_pe, label) -> None:
+    """Print op_logger records for ops on the SRC PE.
+
+    The op log captures t_start/t_end for memory/math/gemm/copy ops on
+    every component, so we can see how long tl.load vs tl.store vs
+    tl.send actually took at the engine level.
+    """
+    op_logger = getattr(engine, "_op_logger", None)
+    if op_logger is None:
+        print(f"  ({label}) op_logger not available")
+        return
+    src_prefix = f"sip{src_sip}.cube{src_cube}.pe{src_pe}."
+    recs = [r for r in op_logger.records if r.component_id.startswith(src_prefix)]
+    print(f"  ({label}) op_logger records on SRC PE ({src_prefix}*):")
+    for r in recs[:40]:
+        dur = r.t_end - r.t_start
+        comp_short = r.component_id.replace(src_prefix, "")
+        params_short = ""
+        if "nbytes" in r.params:
+            params_short = f" nbytes={r.params['nbytes']}"
+        if "src_addr" in r.params:
+            params_short += f" src_addr={r.params['src_addr']}"
+        if "dst_addr" in r.params:
+            params_short += f" dst_addr={r.params['dst_addr']}"
+        print(
+            f"    t=[{r.t_start:7.2f}..{r.t_end:7.2f}] dur={dur:6.2f}ns  "
+            f"{comp_short:<25s} {r.op_kind:<8s} {r.op_name:<12s}{params_short}"
+        )
+
+
+def _run_raw():
+    captured, restore = _install_per_pe_capture()
+    try:
+        topo = resolve_topology(str(TOPOLOGY_PATH))
+        engine = GraphEngine(topo.topology_obj, enable_data=True)
+        spec = topo.topology_obj.spec
+
+        src_sip, src_cube, src_pe = SRC
+        dst_sip, dst_cube, dst_pe = DST
+        assert src_sip == dst_sip
+
+        src_off = (src_cube * N_PES + src_pe) * N_ELEM * ELEM_BYTES
+        dst_off = (dst_cube * N_PES + dst_pe) * N_ELEM * ELEM_BYTES
+
+        with RuntimeContext(
+            engine=engine,
+            target_device=DeviceSelector("all"),
+            correlation_id="diag_raw",
+            spec=spec,
+        ) as rt:
+            dp = DPPolicy(
+                cube="row_wise", pe="column_wise",
+                num_cubes=N_CUBES, num_pes=N_PES,
+            )
+            rt.ahbm.set_device(src_sip)
+            t = rt.zeros(
+                (N_CUBES, N_PES * N_ELEM), dtype="f16",
+                dp=dp, name="raw_tensor",
+            )
+            t.copy_(rt.from_numpy(
+                np.full((N_CUBES, N_PES * N_ELEM), 1.0, dtype=np.float16),
+            ))
+
+            def kernel(t_ptr, n_elem, tl):
+                pe_id = tl.program_id(axis=0)
+                cube_id = tl.program_id(axis=1)
+                if cube_id == src_cube and pe_id == src_pe:
+                    data = tl.load(
+                        t_ptr + src_off, shape=(n_elem,), dtype="f16",
+                    )
+                    tl.store(t_ptr + dst_off, data)
+
+            pending = rt.launch(
+                "diag_raw_kernel", kernel, t, N_ELEM, _defer_wait=True,
+            )
+            for h, _sip, meta in pending:
+                rt.wait(h, _meta=meta)
+
+        # Compute the RAW sub-txn path: src PE_DMA -> dst HBM_CTRL
+        from kernbench.policy.address.phyaddr import PhysAddr
+        ctx = next(iter(engine._components.values())).ctx
+        src_pe_prefix = f"sip{src_sip}.cube{src_cube}.pe{src_pe}"
+        # Resolve dst PA to HBM controller node
+        # The raw store kernel issues DmaWriteCmd on dst VA; in the engine
+        # this is translated via PE_MMU. For diagnostic we approximate
+        # the destination as the dst cube's HBM controller for slice
+        # belonging to dst_pe.
+        # Use the resolver on a constructed PA matching the same memory
+        # slice the kernel writes to.
+        # The tensor is "row_wise" sharded across cubes, so each cube
+        # owns row[cube_id, :], with each PE owning a column slice.
+        # The actual dst PA depends on the AHBM allocator; we read it
+        # via the tensor's shard map.
+        shard_map = getattr(t, "_shard_map", None) or getattr(t, "shard_map", None)
+        # Fallback: query the resolver directly by constructing a PA in
+        # the dst cube's HBM region. If shard_map is unavailable, still
+        # show the breakdown for src-PE-DMA -> first reachable HBM_CTRL
+        # in dst cube.
+        dst_hbm_id = f"sip{dst_sip}.cube{dst_cube}.hbm_ctrl"
+        if dst_hbm_id not in engine._components:
+            # try alternate naming
+            for nid in engine._components.keys():
+                if (
+                    nid.startswith(f"sip{dst_sip}.cube{dst_cube}.")
+                    and "hbm" in nid
+                ):
+                    dst_hbm_id = nid
+                    break
+
+        # find_path() prepends ".pe_dma" to src_pe automatically
+        try:
+            raw_path = ctx.router.find_path(src_pe_prefix, dst_hbm_id)
+        except Exception as e:
+            raw_path = []
+            print(f"  WARN: find_path raw failed: {e}")
+        if not raw_path:
+            # Try other HBM-related node names in dst cube
+            for nid in engine._components.keys():
+                if not nid.startswith(f"sip{dst_sip}.cube{dst_cube}."):
+                    continue
+                if "hbm" not in nid:
+                    continue
+                try:
+                    p = ctx.router.find_path(src_pe_prefix, nid)
+                except Exception:
+                    p = []
+                if p:
+                    raw_path = p
+                    print(f"  (fallback raw dst node: {nid})")
+                    break
+
+        return captured, ctx, raw_path, engine
+    finally:
+        restore()
+
+
+# ── IPCQ path scenario ───────────────────────────────────────────────
+
+
+def _run_ipcq():
+    captured, restore = _install_per_pe_capture()
+    dst_pe_ipcq_id = (
+        f"sip{DST[0]}.cube{DST[1]}.pe{DST[2]}.pe_ipcq"
+    )
+    arrival_log, restore_arrival = _install_meta_arrival_capture(
+        dst_pe_ipcq_id,
+    )
+    recv_log, restore_recv = _install_recv_capture(dst_pe_ipcq_id)
+    barrier_log, restore_barrier = _install_barrier_capture()
+    try:
+        topo = resolve_topology(str(TOPOLOGY_PATH))
+        engine = GraphEngine(topo.topology_obj, enable_data=True)
+        spec = topo.topology_obj.spec
+
+        src_sip, src_cube, src_pe = SRC
+        dst_sip, dst_cube, dst_pe = DST
+
+        cfg = load_ccl_config()
+        merged = resolve_algorithm_config(cfg, name="intercube_allreduce")
+        merged["slot_size"] = max(int(merged.get("slot_size", 4096)), NBYTES)
+
+        with RuntimeContext(
+            engine=engine,
+            target_device=DeviceSelector("all"),
+            correlation_id="diag_ipcq",
+            spec=spec,
+        ) as rt:
+            configure_sfr_intercube_multisip(engine, spec, merged)
+            dp = DPPolicy(
+                cube="row_wise", pe="column_wise",
+                num_cubes=N_CUBES, num_pes=N_PES,
+            )
+
+            def kernel(t_ptr, n_elem, tl):
+                pe_id = tl.program_id(axis=0)
+                cube_id = tl.program_id(axis=1)
+                if cube_id == src_cube and pe_id == src_pe:
+                    data = tl.load(t_ptr, shape=(n_elem,), dtype="f16")
+                    tl.send(dir=("E" if HOP == "h3" else "S"), src=data)
+                elif cube_id == dst_cube and pe_id == dst_pe:
+                    tl.recv(
+                        dir=("W" if HOP == "h3" else "N"),
+                        shape=(n_elem,), dtype="f16",
+                    )
+
+            tensors = []
+            for s in sorted({src_sip, dst_sip}):
+                rt.ahbm.set_device(s)
+                t = rt.zeros(
+                    (N_CUBES, N_PES * N_ELEM), dtype="f16",
+                    dp=dp, name=f"sip{s}",
+                )
+                t.copy_(rt.from_numpy(
+                    np.full((N_CUBES, N_PES * N_ELEM), 1.0, dtype=np.float16),
+                ))
+                tensors.append(t)
+
+            all_pending = []
+            for tt in tensors:
+                pending = rt.launch(
+                    "diag_ipcq_kernel", kernel, tt, N_ELEM, _defer_wait=True,
+                )
+                all_pending.extend(pending)
+            for h, _sip, meta in all_pending:
+                rt.wait(h, _meta=meta)
+
+        ctx = next(iter(engine._components.values())).ctx
+        src_pe_prefix = f"sip{src_sip}.cube{src_cube}.pe{src_pe}"
+        dst_pe_dma = f"sip{dst_sip}.cube{dst_cube}.pe{dst_pe}.pe_dma"
+        try:
+            ipcq_path = ctx.router.find_path(src_pe_prefix, dst_pe_dma)
+        except Exception as e:
+            ipcq_path = []
+            print(f"  WARN: find_path ipcq failed: {e}")
+        # Snapshot DST PE_IPCQ qp state at end-of-run so we can see what
+        # peer_head_cache/my_tail looked like (and at which directions).
+        qp_state = _snapshot_qp_state(engine, dst_pe_ipcq_id)
+        return (captured, ctx, ipcq_path, engine,
+                arrival_log, qp_state, recv_log, barrier_log)
+    finally:
+        restore_barrier()
+        restore_recv()
+        restore_arrival()
+        restore()
+
+
+# ── Test entry ───────────────────────────────────────────────────────
+
+
+@pytest.mark.diagnostic
+def test_pe_to_pe_diagnostic_h3():
+    print("\n" + "=" * 78)
+    print(f"  Diagnostic: h3 inter-cube horizontal, nbytes={NBYTES}")
+    print(f"  src={SRC}  dst={DST}")
+    print("=" * 78)
+
+    # ── RAW scenario
+    print("\n[RAW] tl.load + tl.store (sender pays both legs)")
+    raw_per_pe, raw_ctx, raw_path, raw_engine = _run_raw()
+    print(f"  per-PE pe_exec_ns ({len(raw_per_pe)} entries):")
+    src_id = f"sip{SRC[0]}.cube{SRC[1]}.pe{SRC[2]}.pe_cpu"
+    dst_id = f"sip{DST[0]}.cube{DST[1]}.pe{DST[2]}.pe_cpu"
+    for nid in (src_id, dst_id):
+        if nid in raw_per_pe:
+            print(f"    {nid:<60s}  {raw_per_pe[nid]:.2f} ns  <-- key PE")
+    nonzero = {k: v for k, v in raw_per_pe.items() if v > 0.5}
+    if nonzero:
+        print(f"  other PEs with pe_exec_ns > 0.5 ns:")
+        for nid, v in sorted(nonzero.items(), key=lambda kv: -kv[1])[:6]:
+            if nid not in (src_id, dst_id):
+                print(f"    {nid:<60s}  {v:.2f} ns")
+    print(f"  max(pe_exec_ns) = "
+          f"{max(raw_per_pe.values()) if raw_per_pe else 0:.2f} ns")
+
+    if raw_path:
+        br = _path_breakdown(raw_ctx, raw_path, NBYTES)
+        _print_breakdown("RAW sub-txn path (src.pe_dma -> dst.hbm_ctrl)", br)
+    _dump_src_op_records(raw_engine, *SRC, "RAW")
+
+    # ── IPCQ scenario
+    print("\n[IPCQ] tl.send + tl.recv (recv pays inbound traversal+drain)")
+    (ipcq_per_pe, ipcq_ctx, ipcq_path, ipcq_engine,
+     arrival_log, qp_state, recv_log, barrier_log) = _run_ipcq()
+    print(f"\n  [BARRIER LOG] {len(barrier_log)} _execute_kernel entries:")
+    src_id = f"sip{SRC[0]}.cube{SRC[1]}.pe{SRC[2]}.pe_cpu"
+    dst_id = f"sip{DST[0]}.cube{DST[1]}.pe{DST[2]}.pe_cpu"
+    n_skipped = 0
+    src_entry = None
+    dst_entry = None
+    for e in barrier_log:
+        if e["barrier_skipped"]:
+            n_skipped += 1
+        if e["node_id"] == src_id:
+            src_entry = e
+        if e["node_id"] == dst_id:
+            dst_entry = e
+    print(f"    PEs entering _execute_kernel: {len(barrier_log)}")
+    print(f"    PEs that SKIPPED barrier (env.now > target_start): {n_skipped}")
+    if src_entry:
+        print(
+            f"    SRC pe ({src_id}): entry_now={src_entry['entry_now']:.2f}  "
+            f"target_start={src_entry['target_start_ns']:.2f}  "
+            f"skipped={src_entry['barrier_skipped']}  "
+            f"late_ns={src_entry['delta_late_ns']:.2f}"
+        )
+    if dst_entry:
+        print(
+            f"    DST pe ({dst_id}): entry_now={dst_entry['entry_now']:.2f}  "
+            f"target_start={dst_entry['target_start_ns']:.2f}  "
+            f"skipped={dst_entry['barrier_skipped']}  "
+            f"late_ns={dst_entry['delta_late_ns']:.2f}"
+        )
+    # Top 5 latest arrivals
+    sorted_late = sorted(
+        [e for e in barrier_log if e["delta_late_ns"] is not None],
+        key=lambda e: -e["delta_late_ns"],
+    )[:5]
+    print(f"    Top 5 latest PE arrivals (positive = barrier missed):")
+    for e in sorted_late:
+        if e["delta_late_ns"] > 0:
+            print(
+                f"      {e['node_id']}: late by {e['delta_late_ns']:.2f} ns  "
+                f"(entry={e['entry_now']:.2f}, target={e['target_start_ns']:.2f})"
+            )
+    print(f"\n  [RECV LOG on dst pe_ipcq] {len(recv_log)} entries:")
+    for e in recv_log:
+        print(
+            f"    {e['phase']:5s} t={e['t']:8.2f} ns  "
+            f"dir={e['direction']}  "
+            f"peer_head_cache={e['peer_head_cache']}  "
+            f"my_tail={e['my_tail']}"
+        )
+    print(f"\n  [META-ARRIVAL LOG on dst pe_ipcq] {len(arrival_log)} arrivals:")
+    for i, e in enumerate(arrival_log):
+        print(
+            f"    #{i:2d} t={e['t']:8.2f} ns  "
+            f"src=(sip{e['src_sip']},cube{e['src_cube']},pe{e['src_pe']}) "
+            f"dir={e['src_direction']}  "
+            f"sender_seq={e['sender_seq']} "
+            f"matched_dir={e['matched_direction']} "
+            f"nbytes={e['nbytes']}"
+        )
+        for d, ph in e["peer_head_cache_before"].items():
+            mt = e["my_tail_before"][d]
+            if ph != 0 or mt != 0 or d == e["matched_direction"]:
+                print(
+                    f"        before: dir={d}  peer_head_cache={ph}  my_tail={mt}"
+                )
+    print(f"\n  [QP STATE END-OF-RUN on dst pe_ipcq]:")
+    for d, st in qp_state.items():
+        print(
+            f"    dir={d}  peer_head_cache={st['peer_head_cache']}  "
+            f"my_tail={st['my_tail']}  rx_range=[{st['rx_range'][0]}..."
+            f"{st['rx_range'][1]})  n_slots={st['n_slots']} "
+            f"slot_size={st['slot_size']}"
+        )
+    print(f"  per-PE pe_exec_ns ({len(ipcq_per_pe)} entries):")
+    for nid in (src_id, dst_id):
+        if nid in ipcq_per_pe:
+            print(f"    {nid:<60s}  {ipcq_per_pe[nid]:.2f} ns  <-- key PE")
+    nonzero = {k: v for k, v in ipcq_per_pe.items() if v > 0.5}
+    if nonzero:
+        print(f"  other PEs with pe_exec_ns > 0.5 ns:")
+        for nid, v in sorted(nonzero.items(), key=lambda kv: -kv[1])[:6]:
+            if nid not in (src_id, dst_id):
+                print(f"    {nid:<60s}  {v:.2f} ns")
+    print(f"  max(pe_exec_ns) = "
+          f"{max(ipcq_per_pe.values()) if ipcq_per_pe else 0:.2f} ns")
+
+    if ipcq_path:
+        br = _path_breakdown(ipcq_ctx, ipcq_path, NBYTES)
+        _print_breakdown("IPCQ sub-txn path (src.pe_dma -> peer.pe_dma)", br)
+    _dump_src_op_records(ipcq_engine, *SRC, "IPCQ")
+    _dump_src_op_records(ipcq_engine, *DST, "IPCQ DST")
+
+    # ── Credit-return path analysis (where the missing IPCQ "ack" lives)
+    print("\n" + "-" * 78)
+    print("Credit-return path (current modeling)")
+    print("-" * 78)
+    src_pe_prefix = f"sip{SRC[0]}.cube{SRC[1]}.pe{SRC[2]}"
+    dst_pe_prefix = f"sip{DST[0]}.cube{DST[1]}.pe{DST[2]}"
+    # PE_IPCQ._credit_latency_ns calls
+    #     ctx.router.find_path(self._pe_prefix, peer_pe_prefix)
+    # where the *destination* lacks the ".pe_dma" suffix. find_path()
+    # only auto-appends to the source, so this raises -> the except
+    # clause silently returns 0.0. Effectively credit latency = 0.
+    try:
+        ipcq_ctx.router.find_path(dst_pe_prefix, src_pe_prefix)
+        bug_caught = False
+    except Exception as e:
+        bug_caught = True
+        print(f"  CONFIRMED BUG in _credit_latency_ns: dest lacks '.pe_dma' "
+              f"-> find_path raises -> caught exception -> returns 0.0")
+        print(f"  Error: {e}")
+    # The intended credit path is recv -> sender (reverse data direction)
+    try:
+        credit_path = ipcq_ctx.router.find_path(
+            dst_pe_prefix, f"{src_pe_prefix}.pe_dma",
+        )
+    except Exception as e:
+        credit_path = []
+        print(f"  WARN: corrected find_path credit failed: {e}")
+    if credit_path:
+        credit_size = 16  # PE_IPCQ default _credit_size_bytes
+        # Today's modeling: drain only, 16 bytes -> ~0.125 ns
+        cur = ipcq_ctx.compute_drain_ns(credit_path, credit_size)
+        # Proposed modeling: full path latency (edges + node overhead + drain)
+        proposed = ipcq_ctx.compute_path_latency_ns(credit_path, credit_size)
+        print(f"  credit path nodes = {len(credit_path)} (recv -> sender)")
+        for nid in credit_path[:6]:
+            print(f"    {nid}")
+        if len(credit_path) > 6:
+            print(f"    ... {len(credit_path) - 6} more nodes")
+        br = _path_breakdown(ipcq_ctx, credit_path, credit_size)
+        print(f"  edge_total_ns       = {br['edge_total_ns']:.2f}")
+        print(f"  overhead_total_ns   = {br['overhead_total_ns']:.2f}")
+        print(f"  drain_ns(16 bytes)  = {br['drain_ns']:.2f}")
+        print(f"  CURRENT _credit_latency_ns (drain only) = {cur:.3f} ns")
+        print(f"  PROPOSED            (compute_path_latency_ns) = {proposed:.2f} ns")
+        print(f"  delta               = {proposed - cur:+.2f} ns")
+
+    # ── Comparison summary
+    print("\n" + "-" * 78)
+    print("Summary")
+    print("-" * 78)
+    raw_max = max(raw_per_pe.values()) if raw_per_pe else 0.0
+    ipcq_max = max(ipcq_per_pe.values()) if ipcq_per_pe else 0.0
+    print(f"  RAW  max(pe_exec_ns)              = {raw_max:.2f} ns")
+    print(f"  IPCQ max(pe_exec_ns) (current)    = {ipcq_max:.2f} ns")
+    print(f"  delta (RAW - IPCQ current)        = {raw_max - ipcq_max:+.2f} ns")
+    if credit_path:
+        ipcq_with_credit = ipcq_max + (proposed - cur)
+        print(
+            f"  IPCQ projected w/ blocking credit + full path overhead "
+            f"= {ipcq_with_credit:.2f} ns"
+        )
+        print(
+            f"  delta (RAW - IPCQ projected)      = "
+            f"{raw_max - ipcq_with_credit:+.2f} ns  "
+            f"(<= 0 means IPCQ >= RAW)"
+        )
+
+    # No assertions — this is observational.
+    assert raw_per_pe, "no RAW pe_exec_ns recorded"
+    assert ipcq_per_pe, "no IPCQ pe_exec_ns recorded"
@@ -0,0 +1,347 @@
+"""PE-to-PE latency sweep across hop types and data sizes.
+
+Compares IPCQ send/recv vs raw-DMA (tl.load + tl.store) latency for four
+hop types:
+
+  H1 Intra-cube horizontal   pe0 → pe1
+  H2 Intra-cube vertical     pe0 → pe4
+  H3 Inter-cube horizontal   sip0.cube0.pe0 → sip0.cube1.pe0
+  H4 Inter-cube vertical     sip0.cube0.pe0 → sip0.cube4.pe0
+
+Sizes: 128..10240 bytes. Emits PNGs with both lines plus a CSV.
+"""
+from __future__ import annotations
+
+import csv
+from dataclasses import dataclass
+from pathlib import Path
+
+import numpy as np
+import pytest
+
+from kernbench.ccl.install import load_ccl_config, resolve_algorithm_config
+from kernbench.ccl.sfr_config import configure_sfr_intercube_multisip
+from kernbench.policy.placement.dp import DPPolicy
+from kernbench.runtime_api.context import RuntimeContext
+from kernbench.runtime_api.types import DeviceSelector
+from kernbench.sim_engine.engine import GraphEngine
+from kernbench.topology.builder import resolve_topology
+
+TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
+PLOT_DIR = Path(__file__).parent / "pe2pe_latency_plots"
+
+SIZES = [128, 256, 384, 512, 768, 1024, 2048, 4096, 8192, 10240]
+
+N_CUBES = 16
+N_PES = 8
+ELEM_BYTES = 2  # f16
+
+
+@dataclass(frozen=True)
+class Hop:
+    id: str
+    label: str
+    src: tuple[int, int, int]
+    dst: tuple[int, int, int]
+    send_dir: str
+    recv_dir: str
+    supports_raw: bool
+
+
+HOPS = [
+    Hop("h1_intra_horizontal", "Intra-cube horizontal (pe0 to pe1)",
+        (0, 0, 0), (0, 0, 1), "intra_E", "intra_W", True),
+    Hop("h2_intra_vertical", "Intra-cube vertical (pe0 to pe4)",
+        (0, 0, 0), (0, 0, 4), "intra_S", "intra_N", True),
+    Hop("h3_inter_cube_horizontal", "Inter-cube horizontal (cube0 to cube1)",
+        (0, 0, 0), (0, 1, 0), "E", "W", True),
+    Hop("h4_inter_cube_vertical", "Inter-cube vertical (cube0 to cube4)",
+        (0, 0, 0), (0, 4, 0), "S", "N", True),
+]
+
+
+def _make_engine():
+    topo = resolve_topology(str(TOPOLOGY_PATH))
+    engine = GraphEngine(topo.topology_obj, enable_data=True)
+    return engine, topo.topology_obj.spec
+
+
+# ── IPCQ path ────────────────────────────────────────────────────────
+
+
+def _measure_ipcq(hop: Hop, nbytes: int) -> float:
+    engine, spec = _make_engine()
+
+    cfg = load_ccl_config()
+    merged = resolve_algorithm_config(cfg, name="intercube_allreduce")
+    merged["slot_size"] = max(int(merged.get("slot_size", 4096)), nbytes)
+
+    n_elem = nbytes // ELEM_BYTES
+    src_sip, src_cube, src_pe = hop.src
+    dst_sip, dst_cube, dst_pe = hop.dst
+    send_dir, recv_dir = hop.send_dir, hop.recv_dir
+
+    with RuntimeContext(
+        engine=engine,
+        target_device=DeviceSelector("all"),
+        correlation_id=f"ipcq_{hop.id}_{nbytes}",
+        spec=spec,
+    ) as ctx:
+        configure_sfr_intercube_multisip(engine, spec, merged)
+
+        dp = DPPolicy(
+            cube="row_wise", pe="column_wise",
+            num_cubes=N_CUBES, num_pes=N_PES,
+        )
+
+        def kernel(t_ptr, n_elem, tl):
+            pe_id = tl.program_id(axis=0)
+            cube_id = tl.program_id(axis=1)
+            if cube_id == src_cube and pe_id == src_pe:
+                data = tl.load(t_ptr, shape=(n_elem,), dtype="f16")
+                tl.send(dir=send_dir, src=data)
+            elif cube_id == dst_cube and pe_id == dst_pe:
+                tl.recv(dir=recv_dir, shape=(n_elem,), dtype="f16")
+
+        tensors = []
+        for s in sorted({src_sip, dst_sip}):
+            ctx.ahbm.set_device(s)
+            t = ctx.zeros(
+                (N_CUBES, N_PES * n_elem), dtype="f16",
+                dp=dp, name=f"sip{s}",
+            )
+            t.copy_(ctx.from_numpy(
+                np.full((N_CUBES, N_PES * n_elem), 1.0, dtype=np.float16),
+            ))
+            tensors.append(t)
+
+        all_pending = []
+        for t in tensors:
+            pending = ctx.launch(
+                f"{hop.id}_ipcq", kernel, t, n_elem, _defer_wait=True,
+            )
+            all_pending.extend(pending)
+        for h, sip_id, meta in all_pending:
+            ctx.wait(h, _meta=meta)
+
+        # Per-PE kernel execution time (excludes launch dispatch and
+        # response aggregation). IPCQ: DST blocks on tl.recv until the
+        # send arrives, so max across SIPs = DST's transfer time.
+        pe_exec_vals = []
+        for h, _sip, _meta in all_pending:
+            _, trace = engine.get_completion(h)
+            if trace and trace.get("pe_exec_ns") is not None:
+                pe_exec_vals.append(float(trace["pe_exec_ns"]))
+
+    return max(pe_exec_vals) if pe_exec_vals else 0.0
+
+
+# ── Raw DMA path (intra-SIP only) ────────────────────────────────────
+
+
+def _measure_raw(hop: Hop, nbytes: int) -> float:
+    """tl.load from source slice + tl.store to destination slice. The VA
+    mapping spans the cube mesh within one SIP (MmuMapMsg broadcasts to all
+    cubes of the SIP), so the store goes through the fabric to the
+    destination PE's HBM.  No IPCQ protocol involved.
+    """
+    if not hop.supports_raw:
+        raise RuntimeError(f"hop {hop.id} does not support raw path")
+
+    engine, spec = _make_engine()
+
+    n_elem = nbytes // ELEM_BYTES
+    src_sip, src_cube, src_pe = hop.src
+    dst_sip, dst_cube, dst_pe = hop.dst
+    assert src_sip == dst_sip
+
+    # Slice offsets in the (N_CUBES, N_PES * n_elem) tensor:
+    #   row = cube, slice within row = pe * n_elem .. (pe+1)*n_elem
+    # Byte offsets from va_base:
+    src_off = (src_cube * N_PES + src_pe) * n_elem * ELEM_BYTES
+    dst_off = (dst_cube * N_PES + dst_pe) * n_elem * ELEM_BYTES
+
+    with RuntimeContext(
+        engine=engine,
+        target_device=DeviceSelector("all"),
+        correlation_id=f"raw_{hop.id}_{nbytes}",
+        spec=spec,
+    ) as ctx:
+        dp = DPPolicy(
+            cube="row_wise", pe="column_wise",
+            num_cubes=N_CUBES, num_pes=N_PES,
+        )
+        ctx.ahbm.set_device(src_sip)
+        t = ctx.zeros(
+            (N_CUBES, N_PES * n_elem), dtype="f16",
+            dp=dp, name="raw_tensor",
+        )
+        t.copy_(ctx.from_numpy(
+            np.full((N_CUBES, N_PES * n_elem), 1.0, dtype=np.float16),
+        ))
+
+        def kernel(t_ptr, n_elem, tl):
+            pe_id = tl.program_id(axis=0)
+            cube_id = tl.program_id(axis=1)
+            if cube_id == src_cube and pe_id == src_pe:
+                data = tl.load(
+                    t_ptr + src_off, shape=(n_elem,), dtype="f16",
+                )
+                tl.store(t_ptr + dst_off, data)
+
+        pending = ctx.launch(
+            f"{hop.id}_raw", kernel, t, n_elem, _defer_wait=True,
+        )
+        for h, sip_id, meta in pending:
+            ctx.wait(h, _meta=meta)
+
+        # Per-PE kernel execution time. Raw: only SRC does real work
+        # (tl.load + tl.store, store is blocking), so max across all PEs
+        # = SRC's transfer time. Idle PEs contribute only overhead_ns.
+        pe_exec_vals = []
+        for h, _sip, _meta in pending:
+            _, trace = engine.get_completion(h)
+            if trace and trace.get("pe_exec_ns") is not None:
+                pe_exec_vals.append(float(trace["pe_exec_ns"]))
+
+    return max(pe_exec_vals) if pe_exec_vals else 0.0
+
+
+# ── CSV + plotting ───────────────────────────────────────────────────
+
+
+def _write_csv(records, path: Path) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    with open(path, "w", newline="", encoding="utf-8") as f:
+        w = csv.DictWriter(
+            f, fieldnames=["hop", "label", "size_bytes", "path", "total_ns"],
+        )
+        w.writeheader()
+        for r in records:
+            w.writerow(r)
+
+
+def _plot_per_hop(records, hop: Hop, path: Path) -> None:
+    import matplotlib.pyplot as plt
+
+    ipcq = sorted(
+        [r for r in records if r["hop"] == hop.id and r["path"] == "ipcq"],
+        key=lambda r: r["size_bytes"],
+    )
+    raw = sorted(
+        [r for r in records if r["hop"] == hop.id and r["path"] == "raw"],
+        key=lambda r: r["size_bytes"],
+    )
+
+    fig, ax = plt.subplots(figsize=(8, 5))
+    if ipcq:
+        ax.plot(
+            [r["size_bytes"] for r in ipcq],
+            [r["total_ns"] for r in ipcq],
+            marker="o", label="IPCQ (send/recv)", color="tab:blue",
+        )
+    if raw:
+        ax.plot(
+            [r["size_bytes"] for r in raw],
+            [r["total_ns"] for r in raw],
+            marker="s", label="Raw DMA (load+store)", color="tab:orange",
+        )
+    ax.set_xlabel("Data size (bytes)")
+    ax.set_ylabel("Latency (ns)")
+    ax.set_title(hop.label)
+    ax.grid(True, alpha=0.3)
+    ax.legend()
+    fig.tight_layout()
+    fig.savefig(path, dpi=120)
+    plt.close(fig)
+
+
+def _plot_overview(records, path: Path) -> None:
+    import matplotlib.pyplot as plt
+
+    fig, axes = plt.subplots(2, 2, figsize=(13, 9))
+    axes = axes.flatten()
+    for i, hop in enumerate(HOPS):
+        ax = axes[i]
+        ipcq = sorted(
+            [r for r in records if r["hop"] == hop.id and r["path"] == "ipcq"],
+            key=lambda r: r["size_bytes"],
+        )
+        raw = sorted(
+            [r for r in records if r["hop"] == hop.id and r["path"] == "raw"],
+            key=lambda r: r["size_bytes"],
+        )
+        if ipcq:
+            ax.plot(
+                [r["size_bytes"] for r in ipcq],
+                [r["total_ns"] for r in ipcq],
+                marker="o", label="IPCQ", color="tab:blue",
+            )
+        if raw:
+            ax.plot(
+                [r["size_bytes"] for r in raw],
+                [r["total_ns"] for r in raw],
+                marker="s", label="Raw", color="tab:orange",
+            )
+        ax.set_title(hop.label, fontsize=10)
+        ax.set_xlabel("bytes")
+        ax.set_ylabel("ns")
+        ax.grid(True, alpha=0.3)
+        ax.legend(fontsize=8)
+    for j in range(len(HOPS), len(axes)):
+        axes[j].axis("off")
+    fig.suptitle(
+        "PE-to-PE latency: IPCQ vs raw DMA",
+        fontsize=14,
+    )
+    fig.tight_layout()
+    fig.savefig(path, dpi=120)
+    plt.close(fig)
+
+
+# ── Test entry ───────────────────────────────────────────────────────
+
+
+def test_pe_to_pe_latency_sweep():
+    records: list[dict] = []
+
+    for hop in HOPS:
+        for size in SIZES:
+            # IPCQ path
+            ipcq_ns = _measure_ipcq(hop, size)
+            records.append({
+                "hop": hop.id, "label": hop.label,
+                "size_bytes": size, "path": "ipcq",
+                "total_ns": ipcq_ns,
+            })
+
+            raw_s = "n/a"
+            if hop.supports_raw:
+                raw_ns = _measure_raw(hop, size)
+                records.append({
+                    "hop": hop.id, "label": hop.label,
+                    "size_bytes": size, "path": "raw",
+                    "total_ns": raw_ns,
+                })
+                raw_s = f"{raw_ns:7.1f}ns"
+
+            print(
+                f"[{hop.id}] size={size:5d}  "
+                f"ipcq={ipcq_ns:7.1f}ns  raw={raw_s}"
+            )
+
+    PLOT_DIR.mkdir(parents=True, exist_ok=True)
+    _write_csv(records, PLOT_DIR / "summary.csv")
+    for hop in HOPS:
+        _plot_per_hop(records, hop, PLOT_DIR / f"{hop.id}.png")
+    _plot_overview(records, PLOT_DIR / "overview.png")
+
+    for hop in HOPS:
+        rs = sorted(
+            [r for r in records if r["hop"] == hop.id and r["path"] == "ipcq"],
+            key=lambda r: r["size_bytes"],
+        )
+        for r in rs:
+            assert r["total_ns"] > 0, f"{hop.id}: total_ns must be > 0"
+
+    print(f"\n  Plots + CSV written to {PLOT_DIR}")
@@ -1,7 +1,10 @@
 import pytest

 from kernbench.policy.address.allocator import AddressConfig, AllocationError, PEMemAllocator
-from kernbench.policy.address.phyaddr import PhysAddr, PhysAddrError, UnitType
+from kernbench.policy.address.phyaddr import (
+    PhysAddr, PhysAddrError, UnitType,
+    PESubUnit, MCPUSubUnit, IOCPUSubUnit,
+)

 _MB = 1 << 20
 _GB = 1 << 30
@@ -23,13 +26,11 @@ _CFG = AddressConfig(


 def test_physaddr_immutable():
-    pa = PhysAddr.hbm_addr(rack_id=0, sip_id=0, cube_id=0, hbm_offset=0)
+    pa = PhysAddr.hbm_addr(sip_id=0, die_id=0, hbm_offset=0)
    with pytest.raises(AttributeError):
-        pa.rack_id = 1  # type: ignore[misc]
-    # hashable
-    {pa}
-    # comparable
-    pa2 = PhysAddr.hbm_addr(rack_id=0, sip_id=0, cube_id=0, hbm_offset=0)
+        pa.sip_id = 1  # type: ignore[misc]
+    {pa}  # hashable
+    pa2 = PhysAddr.hbm_addr(sip_id=0, die_id=0, hbm_offset=0)
    assert pa == pa2


@@ -37,120 +38,133 @@ def test_physaddr_immutable():


 def test_hbm_encode_decode_roundtrip():
-    pa = PhysAddr.hbm_addr(rack_id=2, sip_id=3, cube_id=5, hbm_offset=0x1000)
+    pa = PhysAddr.hbm_addr(sip_id=3, die_id=5, hbm_offset=0x1000)
    raw = pa.encode()
    dec = PhysAddr.decode(raw)
-    assert dec.rack_id == 2
    assert dec.sip_id == 3
-    assert dec.cube_id == 5
+    assert dec.die_id == 5
    assert dec.kind == "hbm"
    assert dec.hbm_offset == 0x1000


-# ── PE resource encode/decode roundtrip ─────────────────────────────
+# ── PE resource encode/decode roundtrip (new layout) ───────────────


 def test_pe_resource_encode_decode_roundtrip():
-    pa = PhysAddr(
-        rack_id=1, sip_id=2, sip_seg=7, local_offset=0,
-        kind="pe_resource", cube_id=7,
-        unit_type=UnitType.PE, pe_id=3, ext=1, sub_offset=0xFF,
+    pa = PhysAddr.pe_resource_addr(
+        sip_id=2, die_id=7, pe_id=3,
+        pe_sub_unit=PESubUnit.PE_TCM, sub_offset=0xFF,
    )
-    # manually build local_offset matching bit layout
-    local_offset = (UnitType.PE << 34) | (3 << 30) | (1 << 29) | 0xFF
-    pa2 = PhysAddr(
-        rack_id=1, sip_id=2, sip_seg=7, local_offset=local_offset,
-        kind="pe_resource", cube_id=7,
-        unit_type=UnitType.PE, pe_id=3, ext=1, sub_offset=0xFF,
-    )
-    raw = pa2.encode()
+    raw = pa.encode()
    dec = PhysAddr.decode(raw)
    assert dec.kind == "pe_resource"
    assert dec.unit_type == UnitType.PE
    assert dec.pe_id == 3
-    assert dec.ext == 1
+    assert dec.pe_sub_unit == PESubUnit.PE_TCM
    assert dec.sub_offset == 0xFF
+    assert dec.die_id == 7
+    assert dec.sip_id == 2
+
+
+def test_pe_resource_all_sub_units():
+    """Each PE sub-unit roundtrips correctly."""
+    for su in PESubUnit:
+        pa = PhysAddr.pe_resource_addr(
+            sip_id=0, die_id=0, pe_id=0,
+            pe_sub_unit=su, sub_offset=42,
+        )
+        dec = PhysAddr.decode(pa.encode())
+        assert dec.pe_sub_unit == su
+        assert dec.sub_offset == 42


 # ── pe_hbm_addr factory ────────────────────────────────────────────


 def test_pe_hbm_addr_factory():
-    SLICE = 6 * (1 << 30)  # 6 GB per PE slice
+    SLICE = 6 * _GB
    pa = PhysAddr.pe_hbm_addr(
-        rack_id=0, sip_id=0, cube_id=0,
+        sip_id=0, die_id=0,
        pe_id=2, pe_local_hbm_offset=1024, slice_size_bytes=SLICE,
    )
    assert pa.kind == "hbm"
-    assert pa.cube_id == 0
+    assert pa.die_id == 0
    assert pa.hbm_offset == 2 * SLICE + 1024


 def test_pe_hbm_addr_overflow():
-    SLICE = 6 * (1 << 30)
+    SLICE = 6 * _GB
    with pytest.raises(PhysAddrError, match="pe_local_hbm_offset"):
        PhysAddr.pe_hbm_addr(
-            rack_id=0, sip_id=0, cube_id=0,
+            sip_id=0, die_id=0,
            pe_id=0, pe_local_hbm_offset=SLICE, slice_size_bytes=SLICE,
        )


-# ── Invalid unit_type decode (fix #1) ──────────────────────────────
+# ── Invalid resource_kind decode ──────────────────────────────────


-def test_invalid_unit_type_raises():
-    # Craft a PE-resource address with unit_type=7 (invalid)
-    local_offset = (7 << 34) | (0 << 30) | 0
-    pa_raw = PhysAddr(
-        rack_id=0, sip_id=0, sip_seg=0, local_offset=local_offset,
-    )
+def test_invalid_resource_kind_raises():
+    # resource_kind=7 (invalid), addr_space=0
+    local_offset = (7 << 34) | 0
+    pa_raw = PhysAddr(sip_id=0, die_id=0, local_offset=local_offset)
    raw = pa_raw.encode()
-    with pytest.raises(PhysAddrError, match="unit_type"):
+    with pytest.raises(PhysAddrError, match="resource_kind"):
        PhysAddr.decode(raw)


-# ── hbm_pe_id utility (fix #3) ─────────────────────────────────────
+# ── hbm_pe_id utility ─────────────────────────────────────────────


 def test_hbm_pe_id_utility():
-    SLICE = 6 * (1 << 30)  # 6 GB
+    SLICE = 6 * _GB
    pa = PhysAddr.pe_hbm_addr(
-        rack_id=0, sip_id=0, cube_id=0,
+        sip_id=0, die_id=0,
        pe_id=5, pe_local_hbm_offset=256, slice_size_bytes=SLICE,
    )
    assert PhysAddr.hbm_pe_id(pa.hbm_offset, SLICE) == 5


-# ── UnitType.SRAM exists (fix #5) ──────────────────────────────────
+# ── UnitType / sub-unit enums ──────────────────────────────────────


 def test_sram_unit_type_exists():
    assert UnitType.SRAM == 2


+def test_pe_sub_unit_enum():
+    assert PESubUnit.PE_TCM == 6
+    assert PESubUnit.IPCQ == 2
+
+
+def test_mcpu_sub_unit_enum():
+    assert MCPUSubUnit.MCPU_SRAM == 5
+
+
+def test_iocpu_sub_unit_enum():
+    assert IOCPUSubUnit.IO_SRAM == 5
+
+
 # ── cube_sram_addr factory + roundtrip ──────────────────────────────


 def test_cube_sram_addr_roundtrip():
-    pa = PhysAddr.cube_sram_addr(
-        rack_id=0, sip_id=1, cube_id=3, sram_offset=0x800,
-    )
+    pa = PhysAddr.cube_sram_addr(sip_id=1, die_id=3, sram_offset=0x800)
    assert pa.kind == "pe_resource"
    assert pa.unit_type == UnitType.SRAM
-    assert pa.cube_id == 3
+    assert pa.die_id == 3
    assert pa.sub_offset == 0x800
-    # encode → decode roundtrip
    dec = PhysAddr.decode(pa.encode())
    assert dec.unit_type == UnitType.SRAM
-    assert dec.cube_id == 3
+    assert dec.die_id == 3
    assert dec.sub_offset == 0x800


 def test_cube_sram_addr_range_check():
    with pytest.raises(PhysAddrError):
        PhysAddr.cube_sram_addr(
-            rack_id=0, sip_id=0, cube_id=0,
-            sram_offset=(1 << 29),  # exceeds 29-bit sub_offset
+            sip_id=0, die_id=0,
+            sram_offset=(1 << 25),  # exceeds 25-bit sub_offset
        )


@@ -158,29 +172,137 @@ def test_cube_sram_addr_range_check():


 def test_pe_tcm_addr_roundtrip():
-    pa = PhysAddr.pe_tcm_addr(
-        rack_id=0, sip_id=0, cube_id=2, pe_id=7, tcm_offset=0x400,
-    )
+    pa = PhysAddr.pe_tcm_addr(sip_id=0, die_id=2, pe_id=7, tcm_offset=0x400)
    assert pa.kind == "pe_resource"
    assert pa.unit_type == UnitType.PE
    assert pa.pe_id == 7
-    assert pa.cube_id == 2
+    assert pa.die_id == 2
+    assert pa.pe_sub_unit == PESubUnit.PE_TCM
    assert pa.sub_offset == 0x400
-    # encode → decode roundtrip
    dec = PhysAddr.decode(pa.encode())
    assert dec.unit_type == UnitType.PE
    assert dec.pe_id == 7
+    assert dec.pe_sub_unit == PESubUnit.PE_TCM
    assert dec.sub_offset == 0x400


 def test_pe_tcm_addr_range_check():
    with pytest.raises(PhysAddrError):
        PhysAddr.pe_tcm_addr(
-            rack_id=0, sip_id=0, cube_id=0, pe_id=0,
-            tcm_offset=(1 << 29),  # exceeds 29-bit sub_offset
+            sip_id=0, die_id=0, pe_id=0,
+            tcm_offset=(1 << 25),  # exceeds 25-bit sub_offset
        )


+# ── MCPU resource factory + roundtrip ──────────────────────────────
+
+
+def test_mcpu_resource_roundtrip():
+    pa = PhysAddr.mcpu_resource_addr(
+        sip_id=0, die_id=1,
+        mcpu_sub_unit=MCPUSubUnit.MCPU_SRAM, sub_offset=0x100,
+    )
+    assert pa.kind == "pe_resource"
+    assert pa.unit_type == UnitType.MCPU
+    assert pa.mcpu_sub_unit == MCPUSubUnit.MCPU_SRAM
+    assert pa.sub_offset == 0x100
+    dec = PhysAddr.decode(pa.encode())
+    assert dec.unit_type == UnitType.MCPU
+    assert dec.mcpu_sub_unit == MCPUSubUnit.MCPU_SRAM
+    assert dec.sub_offset == 0x100
+
+
+# ── IOCHIPLET: IOCPU factory + roundtrip ────────────────────────────
+
+
+def test_iocpu_resource_roundtrip():
+    pa = PhysAddr.iocpu_resource_addr(
+        sip_id=1, die_id=17,
+        iocpu_sub_unit=IOCPUSubUnit.IPCQ, sub_offset=0x20000,
+    )
+    assert pa.kind == "iocpu"
+    assert pa.iocpu_sub_unit == IOCPUSubUnit.IPCQ
+    assert pa.sub_offset == 0x20000
+    dec = PhysAddr.decode(pa.encode())
+    assert dec.kind == "iocpu"
+    assert dec.iocpu_sub_unit == IOCPUSubUnit.IPCQ
+    assert dec.sub_offset == 0x20000
+    assert dec.die_id == 17
+
+
+def test_iocpu_die_range_check():
+    with pytest.raises(PhysAddrError, match="IOCHIPLET"):
+        PhysAddr.iocpu_resource_addr(
+            sip_id=0, die_id=5,  # not a chiplet die
+            iocpu_sub_unit=0, sub_offset=0,
+        )
+
+
+# ── IOCHIPLET: UAL factory + roundtrip ──────────────────────────────
+
+
+def test_ual_addr_roundtrip():
+    pa = PhysAddr.ual_addr(sip_id=0, die_id=16, ual_offset=0x1000)
+    assert pa.kind == "ual"
+    dec = PhysAddr.decode(pa.encode())
+    assert dec.kind == "ual"
+    assert dec.die_id == 16
+    assert dec.chiplet_offset >= (1 << 31)  # >= 2 GB boundary
+
+
+# ── die_id dispatch ────────────────────────────────────────────────
+
+
+def test_die_id_ahbm_range():
+    for die in [0, 15]:
+        pa = PhysAddr.hbm_addr(sip_id=0, die_id=die, hbm_offset=0)
+        dec = PhysAddr.decode(pa.encode())
+        assert dec.kind == "hbm"
+        assert dec.die_id == die
+
+
+def test_die_id_chiplet_range():
+    for die in [16, 20]:
+        pa = PhysAddr.iocpu_resource_addr(
+            sip_id=0, die_id=die,
+            iocpu_sub_unit=0, sub_offset=0,
+        )
+        dec = PhysAddr.decode(pa.encode())
+        assert dec.kind == "iocpu"
+        assert dec.die_id == die
+
+
+def test_die_id_reserved_raises():
+    raw = (0 << 47) | (21 << 42) | 0  # die_id=21 (reserved)
+    with pytest.raises(PhysAddrError, match="reserved"):
+        PhysAddr.decode(raw)
+
+
+# ── Boundary values ────────────────────────────────────────────────
+
+
+def test_sip_boundary():
+    pa = PhysAddr.hbm_addr(sip_id=15, die_id=0, hbm_offset=0)
+    dec = PhysAddr.decode(pa.encode())
+    assert dec.sip_id == 15
+
+
+def test_mbz_enforcement_ahbm():
+    """AHBM local_offset bits [41:38] must be zero."""
+    local_offset = (1 << 38) | (1 << 37)  # MBZ bit set + HBM
+    pa = PhysAddr(sip_id=0, die_id=0, local_offset=local_offset)
+    with pytest.raises(PhysAddrError, match="bits \\[41:38\\]"):
+        pa.encode()
+
+
+def test_mbz_enforcement_chiplet():
+    """IOCHIPLET local_offset bits [41:40] must be zero."""
+    local_offset = (1 << 40) | 0  # MBZ bit set
+    pa = PhysAddr(sip_id=0, die_id=16, local_offset=local_offset)
+    with pytest.raises(PhysAddrError, match="bits \\[41:40\\]"):
+        pa.encode()
+
+
 # ── AddressConfig ───────────────────────────────────────────────────


@@ -193,7 +315,7 @@ def test_address_config_derived_sizes():


 def _make_alloc(pe_id: int = 0) -> PEMemAllocator:
-    return PEMemAllocator(rack_id=0, sip_id=0, cube_id=0, pe_id=pe_id, cfg=_CFG)
+    return PEMemAllocator(sip_id=0, die_id=0, pe_id=pe_id, cfg=_CFG)


 def test_allocator_hbm_basic():
@@ -201,8 +323,7 @@ def test_allocator_hbm_basic():
    pa = a.alloc_hbm(4096)
    assert pa.kind == "hbm"
    assert pa.sip_id == 0
-    assert pa.cube_id == 0
-    # hbm_offset should be pe3's slice start
+    assert pa.die_id == 0
    assert pa.hbm_offset == 3 * 6 * _GB


@@ -210,8 +331,8 @@ def test_allocator_hbm_sequential():
    a = _make_alloc()
    pa1 = a.alloc_hbm(1024)
    pa2 = a.alloc_hbm(2048)
-    assert pa1.hbm_offset == 0  # pe0 slice start + 0
-    assert pa2.hbm_offset == 1024  # pe0 slice start + 1024
+    assert pa1.hbm_offset == 0
+    assert pa2.hbm_offset == 1024


 def test_allocator_hbm_overflow():
@@ -235,7 +356,6 @@ def test_allocator_tcm_basic():

 def test_allocator_tcm_respects_reserved():
    a = _make_alloc()
-    # allocatable = 12 MB, should succeed
    a.alloc_tcm(12 * _MB)
    assert a.tcm_used == 12 * _MB
    assert a.tcm_total == 12 * _MB
@@ -21,7 +21,7 @@ def _engine():
 def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
    slice_bytes = 48 * (1 << 30) // 8
    pa = PhysAddr.pe_hbm_addr(
-        rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
+        sip_id=sip, die_id=cube, pe_id=pe_id,
        pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
    )
    return pa.encode()
@@ -20,7 +20,7 @@ def test_resolve_hbm_addr():
    """HBM address -> sip{S}.cube{C}.hbm_ctrl (single controller per cube)."""
    g = _graph()
    resolver = AddressResolver(g)
-    pa = PhysAddr.hbm_addr(rack_id=0, sip_id=0, cube_id=3, hbm_offset=0x1000)
+    pa = PhysAddr.hbm_addr(sip_id=0, die_id=3, hbm_offset=0x1000)
    assert resolver.resolve(pa) == "sip0.cube3.hbm_ctrl"


@@ -28,33 +28,33 @@ def test_resolve_hbm_addr_high_offset():
    """HBM address with large offset still resolves to same hbm_ctrl."""
    g = _graph()
    resolver = AddressResolver(g)
-    pa = PhysAddr.hbm_addr(rack_id=0, sip_id=0, cube_id=0, hbm_offset=0x600000000)
+    pa = PhysAddr.hbm_addr(sip_id=0, die_id=0, hbm_offset=0x600000000)
    assert resolver.resolve(pa) == "sip0.cube0.hbm_ctrl"


 def test_resolve_pe_tcm_addr():
-    """PE TCM address → sip{S}.cube{C}.pe{P}.pe_tcm"""
+    """PE TCM address -> sip{S}.cube{C}.pe{P}.pe_tcm"""
    g = _graph()
    resolver = AddressResolver(g)
-    pa = PhysAddr.pe_tcm_addr(rack_id=0, sip_id=1, cube_id=5, pe_id=7, tcm_offset=0x400)
+    pa = PhysAddr.pe_tcm_addr(sip_id=1, die_id=5, pe_id=7, tcm_offset=0x400)
    assert resolver.resolve(pa) == "sip1.cube5.pe7.pe_tcm"


 def test_resolve_sram_addr():
-    """SRAM address → sip{S}.cube{C}.sram"""
+    """SRAM address -> sip{S}.cube{C}.sram"""
    g = _graph()
    resolver = AddressResolver(g)
-    pa = PhysAddr.cube_sram_addr(rack_id=0, sip_id=0, cube_id=10, sram_offset=0x800)
+    pa = PhysAddr.cube_sram_addr(sip_id=0, die_id=10, sram_offset=0x800)
    assert resolver.resolve(pa) == "sip0.cube10.sram"


 def test_resolve_mcpu_addr():
-    """MCPU pe_resource address → sip{S}.cube{C}.m_cpu"""
+    """MCPU pe_resource address -> sip{S}.cube{C}.m_cpu"""
    g = _graph()
    resolver = AddressResolver(g)
-    pa = PhysAddr(
-        rack_id=0, sip_id=0, sip_seg=2, local_offset=(UnitType.MCPU << 34),
-        kind="pe_resource", cube_id=2, unit_type=UnitType.MCPU,
+    pa = PhysAddr.mcpu_resource_addr(
+        sip_id=0, die_id=2,
+        mcpu_sub_unit=0, sub_offset=0,
    )
    assert resolver.resolve(pa) == "sip0.cube2.m_cpu"

@@ -64,7 +64,7 @@ def test_resolve_nonexistent_node():
    g = _graph()
    resolver = AddressResolver(g)
    # sip_id=15 doesn't exist in the 2-SIP topology
-    pa = PhysAddr.hbm_addr(rack_id=0, sip_id=15, cube_id=0, hbm_offset=0)
+    pa = PhysAddr.hbm_addr(sip_id=15, die_id=0, hbm_offset=0)
    with pytest.raises(RoutingError):
        resolver.resolve(pa)

@@ -73,7 +73,7 @@ def test_resolve_nonexistent_node():


 def test_path_local_hbm():
-    """PE0 -> hbm_ctrl: pe_dma → router → hbm_ctrl (through router mesh)."""
+    """PE0 -> hbm_ctrl: pe_dma -> router -> hbm_ctrl (through router mesh)."""
    g = _graph()
    router = PathRouter(g)
    path = router.find_path("sip0.cube0.pe0", "sip0.cube0.hbm_ctrl")
@@ -107,7 +107,7 @@ def test_all_pe_hbm_equidistant():
    """All PEs in a cube have equal routing distance to hbm_ctrl.

    With n_to_one mapping and high routing weight on HBM edges,
-    all PE→hbm_ctrl paths have the same accumulated distance.
+    all PE->hbm_ctrl paths have the same accumulated distance.
    """
    g = _graph()
    router = PathRouter(g)
@@ -151,7 +151,7 @@ def test_path_remote_cube_hbm():


 def test_path_sram_via_router_mesh():
-    """PE → SRAM must go through router mesh nodes."""
+    """PE -> SRAM must go through router mesh nodes."""
    g = _graph()
    router = PathRouter(g)
    path = router.find_path("sip0.cube0.pe0", "sip0.cube0.sram")
@@ -168,7 +168,7 @@ def test_path_sram_via_router_mesh():


 def test_path_local_tcm():
-    """PE0 → own TCM is PE-internal, not via router mesh."""
+    """PE0 -> own TCM is PE-internal, not via router mesh."""
    g = _graph()
    router = PathRouter(g)
    path = router.find_path("sip0.cube0.pe0", "sip0.cube0.pe0.pe_tcm")
@@ -0,0 +1,106 @@
+"""Rectangular (non-square) SIP-level 2D topology support.
+
+Phase 1 regression target: today the 2D builtin topology functions in
+``kernbench.ccl.topologies`` (``mesh_2d``, ``torus_2d``,
+``mesh_2d_no_wrap``) hardcode ``side = sqrt(world_size)`` and raise
+``ValueError`` for any non-square ``world_size``. This blocks running
+the allreduce sweep at n_sips=6 on torus/mesh layouts.
+
+Phase 2 will extend these functions to accept optional ``w, h`` kwargs
+so a 2×3 (or 3×2, etc.) layout works. Until then, every test below is
+expected to FAIL.
+
+Layout convention used here (matches non-rectangular case):
+    rank = row * w + col   for 0 <= row < h, 0 <= col < w
+
+For w=2, h=3, world_size=6 the layout is:
+
+         col=0  col=1
+  row=0:   0      1
+  row=1:   2      3
+  row=2:   4      5
+"""
+from __future__ import annotations
+
+import pytest
+
+from kernbench.ccl.topologies import (
+    mesh_2d,
+    mesh_2d_no_wrap,
+    torus_2d,
+)
+
+
+# ── mesh_2d_no_wrap (no wrap-around) ──────────────────────────────────
+
+
+def test_mesh_2d_no_wrap_2x3_top_left():
+    """rank 0 (top-left, no N, no W): only S and E."""
+    nbrs = mesh_2d_no_wrap(rank=0, world_size=6, w=2, h=3)
+    assert nbrs == {"S": 2, "E": 1}, nbrs
+
+
+def test_mesh_2d_no_wrap_2x3_top_right():
+    """rank 1 (top-right, no N, no E): only S and W."""
+    nbrs = mesh_2d_no_wrap(rank=1, world_size=6, w=2, h=3)
+    assert nbrs == {"S": 3, "W": 0}, nbrs
+
+
+def test_mesh_2d_no_wrap_2x3_middle_left():
+    """rank 2 (middle-left, no W): N, S, E."""
+    nbrs = mesh_2d_no_wrap(rank=2, world_size=6, w=2, h=3)
+    assert nbrs == {"N": 0, "S": 4, "E": 3}, nbrs
+
+
+def test_mesh_2d_no_wrap_2x3_bottom_right():
+    """rank 5 (bottom-right, no S, no E): only N and W."""
+    nbrs = mesh_2d_no_wrap(rank=5, world_size=6, w=2, h=3)
+    assert nbrs == {"N": 3, "W": 4}, nbrs
+
+
+# ── torus_2d (wrap-around on all four edges) ─────────────────────────
+
+
+def test_torus_2d_2x3_top_left():
+    """rank 0: N wraps to row 2 col 0 (rank 4); W wraps to col 1 (rank 1)."""
+    nbrs = torus_2d(rank=0, world_size=6, w=2, h=3)
+    assert nbrs == {"N": 4, "S": 2, "W": 1, "E": 1}, nbrs
+
+
+def test_torus_2d_2x3_bottom_right():
+    """rank 5: S wraps to row 0 (rank 1); E wraps to col 0 (rank 4)."""
+    nbrs = torus_2d(rank=5, world_size=6, w=2, h=3)
+    assert nbrs == {"N": 3, "S": 1, "W": 4, "E": 4}, nbrs
+
+
+# ── mesh_2d alias for torus_2d ───────────────────────────────────────
+
+
+def test_mesh_2d_2x3_matches_torus_2d():
+    """mesh_2d is currently a torus alias; behaviour must match torus_2d."""
+    for rank in range(6):
+        assert mesh_2d(rank=rank, world_size=6, w=2, h=3) == \
+            torus_2d(rank=rank, world_size=6, w=2, h=3)
+
+
+# ── Back-compat: square layouts still work without w/h kwargs ────────
+
+
+def test_square_back_compat_mesh_2d_no_wrap():
+    """Calling without w, h should still work for square world_size."""
+    nbrs = mesh_2d_no_wrap(rank=0, world_size=4)
+    assert nbrs == {"S": 2, "E": 1}, nbrs
+
+
+def test_square_back_compat_torus_2d():
+    nbrs = torus_2d(rank=0, world_size=4)
+    assert nbrs == {"N": 2, "S": 2, "W": 1, "E": 1}, nbrs
+
+
+# ── Validation: w*h must match world_size ────────────────────────────
+
+
+def test_rectangular_dims_must_match_world_size():
+    """Phase 2 contract: explicit w, h must satisfy w*h == world_size."""
+    with pytest.raises(ValueError):
+        mesh_2d_no_wrap(rank=0, world_size=6, w=3, h=3)  # 9 != 6
@@ -44,7 +44,7 @@ _CFG = AddressConfig(

 def _make_allocators(num_pe: int = 8) -> dict[tuple[int, int, int], PEMemAllocator]:
    return {
-        (0, 0, i): PEMemAllocator(rack_id=0, sip_id=0, cube_id=0, pe_id=i, cfg=_CFG)
+        (0, 0, i): PEMemAllocator(sip_id=0, die_id=0, pe_id=i, cfg=_CFG)
        for i in range(num_pe)
    }

@@ -55,7 +55,7 @@ def _make_ctx():

 def test_allocator_free_hbm_reclaims_space():
    """free_hbm returns HBM space; subsequent alloc can reuse it."""
-    a = PEMemAllocator(rack_id=0, sip_id=0, cube_id=0, pe_id=0, cfg=_CFG)
+    a = PEMemAllocator(sip_id=0, die_id=0, pe_id=0, cfg=_CFG)
    pa1 = a.alloc_hbm(4096)
    used_after_alloc = a.hbm_used
    a.free_hbm(pa1, 4096)
@@ -66,7 +66,7 @@ def test_allocator_free_hbm_reclaims_space():

 def test_allocator_free_tcm_reclaims_space():
    """free_tcm returns TCM space."""
-    a = PEMemAllocator(rack_id=0, sip_id=0, cube_id=0, pe_id=0, cfg=_CFG)
+    a = PEMemAllocator(sip_id=0, die_id=0, pe_id=0, cfg=_CFG)
    pa1 = a.alloc_tcm(256)
    used_after_alloc = a.tcm_used
    a.free_tcm(pa1, 256)
@@ -39,7 +39,7 @@ _CFG = AddressConfig(

 def _make_allocators(num_pe: int = 8) -> dict[tuple[int, int, int], PEMemAllocator]:
    return {
-        (0, 0, i): PEMemAllocator(rack_id=0, sip_id=0, cube_id=0, pe_id=i, cfg=_CFG)
+        (0, 0, i): PEMemAllocator(sip_id=0, die_id=0, pe_id=i, cfg=_CFG)
        for i in range(num_pe)
    }

@@ -70,7 +70,7 @@ def _make_standalone(shape, num_pe=NUM_PE):
        sram_bytes_per_cube=32 * _MB,
    )
    allocators = {
-        (0, 0, i): PEMemAllocator(rack_id=0, sip_id=0, cube_id=0, pe_id=i, cfg=cfg)
+        (0, 0, i): PEMemAllocator(sip_id=0, die_id=0, pe_id=i, cfg=cfg)
        for i in range(num_pe)
    }
    va_alloc = VirtualAllocator(va_base=0x1_0000_0000, va_size=64 * _GB, page_size=4096)
Author	SHA1	Message	Date
ywkang	fca24feac5	Fix all remaining test failures: single-cube allreduce + matplotlib dep - intercube_allreduce: add single-cube fast path that skips intra-SIP mesh reduce and goes directly to inter-SIP exchange. Fixes IPCQ deadlock when TP launches kernel on one cube per SIP. - distributed.py: derive effective cube dims from tensor shard placement instead of hardcoding topology mesh size. - pyproject.toml: add matplotlib>=3.7 to dependencies. - pe_dma.py (prior commit): add MMU translation in pipeline DMA path. 577 passed, 0 failed (was 529 passed, 10 failed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-27 21:25:31 -07:00
ywkang	d55dc6cb4f	Merge: accept remote pe2pe summary.csv	2026-04-27 17:13:06 -07:00
mukesh	46291bf91b	PE-to-PE latency: drop h5 inter-SIP panel from overview Remove h5_inter_sip from the hop list and switch the overview grid from 2x3 to 2x2. RAW DMA was unavailable for the cross-SIP hop, so the panel only carried IPCQ data and was redundant with h4_inter_cube for the topology comparison. Regenerate pe2pe_latency_plots/overview.png and summary.csv; delete the obsolete h5_inter_sip.png. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 16:43:28 -07:00
mukesh	04c912f53e	Allreduce sweep: parametrized + xdist parallelism + topology diagram Refactor the latency sweep from one giant test into 36 parametrized cases that run in parallel under xdist (~6-8x faster: 1:49 instead of ~10 min). Each case writes a JSON row to a staging dir; conftest sessionfinish hook aggregates rows on the controller node into summary.csv and the per-topology + overview plots. Aggregator gains a CSV fallback so plot-only tweaks no longer require re-running the sweep. Overview plot updates: - 96 KB explicit x-axis marker with vertical dotted line - horizontal theoretical 2D-torus reference (10600 ns) - annotation showing both theoretical and simulated values at 96 KB - drop overlapping 128 KB tick New topology.png: 2x2 panel diagram showing device-level topology (ring, torus 2x3, mesh 2x3) and the cube-level reduction inside SIP 0. Wrap arrows anchor on box edges and arc outside rows/columns so they do not overlap any SIP. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 16:43:19 -07:00
mukesh	1c33afec55	ADR-0032 + intra_* opposite directions in IPCQ install Add intra_N/S/E/W to install.py _OPPOSITE_DIR table so the intra-cube PE-to-PE namespace is symmetrical with intercube N/S/E/W. ADR-0032 documents the intercube allreduce algorithm (supersedes ADR-0029). Refresh ADR-0024/0025/0029 cross-refs and update test_intercube_sfr_config.py to cover the new intra_* mappings. Drop the obsolete test_ccl_round_robin_recv.py (replaced by intercube tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 16:43:01 -07:00
ywkang	81cc32c46b	ADR-0001 Rev 2: 51-bit PhysAddr layout with concrete sub-unit tables Remove rack_id (4 bits), rename sip_seg→die_id, shift fields to enable 42-bit local_offset (4 TB per die). Define PE_LOCAL/MCPU_LOCAL/CUBE_SRAM sub-unit tables for AHBM dies and IOCPU sub-unit table for IOCHIPLET dies (1 TB window). Supersedes ADR-0031. Also fixes latent VA/PA confusion in pe_dma pipeline DMA path where virtual addresses were decoded as physical addresses without MMU translation — previously masked by coincidental bit-position alignment. 529 passed (+6 recovered), 10 pre-existing failures unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-27 15:52:29 -07:00
mukesh	e9cc40f74d	Rectangular SIP topology + 6-device allreduce sweep mesh_2d, torus_2d, and mesh_2d_no_wrap accept optional w,h kwargs; sqrt fall-back preserved for square layouts (back-compat tests confirm 4-SIP and 9-SIP square configs still work). sfr_config reads system.sips.w/h from spec and threads dims through to the topology fn. test_allreduce_multidevice CONFIGS switched from 4 SIPs (square) to 6 SIPs: ring_1d_6sip, torus_2d_6sip_2x3, mesh_2d_no_wrap_6sip_2x3. _write_temp_configs writes system.sips.w/h when supplied; _sip_topo_dims reads them back. Latency sweep loop also moved to 6-SIP layouts. Linear-scale plot variants dropped -- only log-scale *.png + summary.csv emitted. Plots in tests/allreduce_latency_plots regenerated. New tests/test_sip_topology_rectangular.py asserts neighbor correctness for 2x3 layouts and back-compat for square fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 15:13:14 -07:00
mukesh	c1a5cf3a2a	ADR-0009 D5: chain-aware target_start_ns + zero-byte launch fanout The single-walk predictor (find_node_path(io_cpu, pe_cpu) + compute_path_latency_ns) under-shot actual dispatch latency for far cubes -- the routing graph could pick a path bypassing M_CPU, and non-zero-nbytes launch sub-txns serialized on shared first hops. Far PEs arrived at _execute_kernel after target_start_ns, silently skipped the barrier yield, and started pe_exec_start late. Their reported pe_exec_ns under-counted by exactly the late_ns amount (63 ns observed at h4 cube4.pe0 in the IPCQ test, up to 113 ns worst case for cubes 9-11), producing the suspicious flat region in the h4 IPCQ curve at 8192/10240 bytes. Fix: - IO_CPU predictor uses the explicit two-leg chain (IO_CPU->M_CPU + M_CPU->PE_CPU - io.overhead - m.overhead), so every PE on every targeted cube has a barrier >= its real dispatch arrival. - Kernel-launch fanout sub-txns carry nbytes=0 (control-plane, not data-plane), removing the per-cube fanout serialization that pushed far M_CPUs past the predictor. - Legacy io_cpu mirror updated. ADR-0009 D5 mechanism updated to specify the two-leg formula and the nbytes=0 requirement. New tests/test_d5_barrier_invariant.py asserts (a) no PE enters _execute_kernel after target_start_ns and (b) every PE in a multi-cube launch has identical pe_exec_start -- both regressions silently pass on the existing tests/test_kernel_launch_sync.py because that test only inspects post-aggregation max(pe_exec_ns). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 15:12:58 -07:00
mukesh	90874abbfe	ADR-0023 D9: blocking credit-emit with full-path latency PE_IPCQ._handle_recv now yields-from _delayed_credit_send instead of spawning it as a fork, so the receiver's pe_exec_ns includes the credit-return cost. _credit_latency_ns switches from compute_drain_ns(path, 16) to compute_path_latency_ns(path, 16) and fixes a latent find_path bug where the destination lacked the ".pe_dma" suffix (silently returned 0 ns under the bare except). Net effect on h3/h4 inter-cube pe-to-pe latency: IPCQ >= raw DMA at every size, matching real-HW posted-write semantics. tl.send remains fire-and-forget. ADR-0023 D9 amended; new diagnostic test tests/test_pe_to_pe_diagnostic.py captures per-PE pe_exec_ns, paths, drain, and meta-arrival timing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 15:12:38 -07:00
mukesh	19dfc86dc3	Allreduce latency sweep across topologies and data sizes Adds test_allreduce_latency_sweep that runs the existing intercube allreduce kernel under three SIP topologies (ring_1d, torus_2d, mesh_2d_no_wrap, all at n_sips=4) across 11 data sizes from 256 B/SIP up to 1 MB/SIP. For each point, captures max(pe_exec_ns) — the critical-path kernel time — and emits CSV plus log-x and linear-x plots, both per-topology and combined overview, with KB/MB-formatted tick labels. Reuses run_allreduce + _write_temp_configs and adds a slot_size auto-bump when n_elem*2 exceeds the default IPCQ slot. Sweep skips n_elem=16 because the runtime's dim_map scalar-arg remapping (context.py:761) collides any int-valued kernel scalar that matches a global tensor dim with its local shard size. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 10:16:29 -07:00
mukesh	14d800b0ae	Kernel-launch sync (ADR-0009 D5) and IPCQ drain at inbound (ADR-0023) - KernelLaunchMsg gains target_start_ns: IO_CPU stamps a global barrier (max path latency across every target PE), M_CPU passes it through, PE_CPU yields until it before recording pe_exec_start. Every PE in a launch begins kernel execution at the same env.now regardless of its dispatch path length — eliminates per-PE dispatch-offset artifact in cross-PE and cross-cube latency measurements. - PE_DMA._handle_ipcq_inbound now pays Transaction.drain_ns at the top, matching the terminal-drain behavior of ComponentBase._forward_txn for every non-IPCQ Transaction. SRC-side tl.send stays fire-and-forget (sender doesn't yield on sub_done); tl.recv now blocks until bytes have actually drained into its inbox. - ComponentContext: new compute_path_latency_ns helper + node_overhead_ns field populated by GraphEngine. - tests/test_kernel_launch_sync.py: asserts all PEs in one launch produce identical pe_exec_ns for a no-op kernel (zero spread). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 15:30:29 -07:00
mukesh	6918e6e906	PE-to-PE latency test + supporting fixes Adds tests/test_pe_to_pe_latency.py: a sweep that measures PE-to-PE transfer latency for five hop types (intra-cube horizontal/vertical, inter-cube horizontal/vertical, inter-SIP) across data sizes 128 B to 10 KB, on both the IPCQ (tl.send/tl.recv) and raw-DMA (tl.load+tl.store) paths. Emits per-hop PNG plots, an overview PNG, and a CSV summary into tests/pe2pe_latency_plots/. Latency is reported as max(pe_exec_ns) across participating PEs, read from engine.get_completion(), so the measurement captures the SRC/DST PE's kernel body time rather than the full launch+ response-aggregation envelope. Two simulator fixes were needed to make this measurement meaningful: - PeMMU now stores a list of (start, end, pa) sub-regions per page rather than a single PA. DPPolicy layouts with shards smaller than page_size (e.g. 128 B payloads with 4 KB pages) used to silently overwrite each other through last-write-wins, causing DMAs intended for cube0 to physically route to cube3 - inflating latency by ~170 ns per DMA at small sizes. STOPGAP: real MMUs don't support sub-page regions; long-term fix is either smaller MMU page size or DPPolicy validation that refuses sub-page shards. - M_CPU's per-PE metrics aggregation (pe_exec_ns, dma_ns, compute_ns) now max-merges against the existing value in result_data rather than overwriting. Multi-cube workloads share one result_data dict via IO_CPU fanout; the previous overwrite caused whichever cube's M_CPU finished last to clobber others' values, so multi-cube pe_exec_ns was racy and frequently 0. Same fix applied in legacy/builtin/m_cpu.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 21:04:31 -07:00