Add deck builder + overview-with-ref diagram scripts

scripts/build_overview_slides.py renders a 5-slide PPTX (kernbench2_overview.pptx) summarizing architecture, model correctness, IPCQ, allreduce, and buffer-kind tier comparison. scripts/emit_overview_with_external_ref.py renders log-y and broken-y variants of the allreduce overview (overview_log.png, overview_broken.png) including a 366 µs ext-sim reference marker at 96 KB / PE. Also includes cube_mesh_view.png rendered from the SVG. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add tl.recv_no_consume diagnostic API for apples-to-apples pe2pe plot
2026-04-28 18:20:54 -07:00 · 2026-04-28 18:20:44 -07:00 · 2026-04-28 18:20:28 -07:00 · 2026-04-28 13:31:02 -07:00 · 2026-04-27 21:42:44 -07:00 · 2026-04-27 21:41:46 -07:00
95 changed files with 7420 additions and 585 deletions
@@ -1,25 +1,39 @@
-# ADR-0001: PhysAddr Layout & Address Decoding Contract
+# ADR-0001: 51-bit Physical Address Layout & Decoding Contract
 ## Status
-Accepted
+Accepted (Revision 2 — 2026-04-27: concrete bit layout, rack_id removal,
 Tray->SIP / SIP->DIE renaming, PE/MCPU/IOCPU sub-unit tables.
 Supersedes ADR-0031.)
 ## Date
-2026-02-27
+2026-04-27 (original: 2026-02-27)
 ## Context
-KernBench Graph Latency Simulator must route requests deterministically and compute end-to-end latency strictly by graph traversal.
+KernBench requires a stable, parsable physical address scheme that:
 To model local vs remote traffic (same/different SIP, same/different CUBE, optional PE-group), requests need a stable, parsable address/location scheme that:
- can be decoded into routing domains (SIP/CUBE/HBM/PE-resource, etc.)
+- can be decoded into routing domains (SIP / die / HBM / PE-resource / IOCPU)
 - remains topology-agnostic (no hardcoded counts)
- supports swappable policy and DI-first components without leaking topology assumptions into node implementations
+- supports swappable policy and DI-first components
 - covers multiple SIPs, AHBM dies, and IO chiplet dies in a unified space
 ### History
 - Original ADR-0001 defined a 51-bit layout with `rack_id(4) + sip_id(4) +
  sip_seg(5) + local_offset(38)`. `rack_id` was never used in practice.
 - ADR-0031 (stub) requested PE-resource range partition but was never
  implemented.
 Revision 2 removes `rack_id`, renames `sip_seg -> die_id`, and provides
 concrete sub-unit tables for PE, MCPU, CUBE_SRAM, and IOCPU resources.
 ADR-0031 is superseded.
 ## Decision
-We define a **PhysAddr value object** and an **address decoding contract** that converts an integer address into routing domains.
+We define a **PhysAddr value object** and an **address decoding contract**
 that converts an integer address into routing domains.
 ### D1. PhysAddr is an immutable value object
@@ -27,82 +41,322 @@ We define a **PhysAddr value object** and an **address decoding contract** that
 - Any allocator returns a **fully specified PhysAddr** (not partial metadata).
 - No global state may be required to interpret a PhysAddr.
-### D2. PhysAddr fields (logical contract)
+### D2. 51-bit Physical Address Layout
-PhysAddr must be able to represent at least:
+A 51-bit physical address is adopted.
- `rack_id` (optional but reserved for scale-out)
+#### 2.1 Top-Level Address Map
 - `sip_id`  (device / SIP domain)
 - `sip_seg` (SIP-level segment/window selection, e.g., cube window)
 - `local_offset` (offset within the chosen segment/window)
-Decoded/derived fields may include (optional):
+```text
 [50:47] sip_id        (4)     -- 16 SIPs
 [46:42] die_id        (5)     -- 32 dies per SIP
 [41: 0] local_offset  (42)    -- 4 TB per die
 ```
- `cube_id`
+```text
- `kind` (e.g., HBM vs PE-resource vs raw)
+50      47 46      42 41                      0
- `unit_type` / `pe_id` (if PE-level addressing is modeled)
+---------+----------+-------------------------+
 | sip_id  | die_id   |      local_offset       |
 +---------+----------+-------------------------+
 ```
-**Important:** The exact bit allocation may evolve, but the *semantic fields above* must remain decodable without hidden assumptions.
+#### 2.2 die_id Allocation
-### D3. Decoding is deterministic and policy-compatible
+| die_id | Meaning |
 |--------|---------|
 | 0..15  | AHBM dies |
 | 16..20 | IOCHIPLET dies |
 | 21..31 | Reserved |
- Decoding must deterministically map an integer address to:
+#### 2.3 AHBM Die Layout
  - destination SIP domain (`sip_id`)
  - destination sub-domain (`cube_id` if applicable)
  - destination target kind (HBM/PE-resource/other)
 - Decoding must not depend on runtime topology sizes; it may depend on **explicit topology parameters** provided through configuration (e.g., segment size, slice size), and those parameters must live in the topology/config layer (not in random components).
-### D4. Topology-derived constants live in the topology layer
+Only lower 256 GB of the 4 TB die-local window is assigned.
-Constants such as segment sizes (e.g., HBM slice size / window size) are derived from topology configuration (YAML/JSON/dict) and are provided to the decoder via DI/config.
+```text
-They must not be hardcoded in node implementations.
+[41:38] MBZ            (4)
 [37]    addr_space      (1)    -- 0 = local resource, 1 = HBM memory
 [36: 0] sub-address    (37)
 ```
 | addr_space | Meaning |
 |------------|---------|
 | 0 | Local resource |
 | 1 | HBM memory |
 ##### 2.3.1 HBM Window (addr_space = 1)
 ```text
 [36:0] hbm_offset     (37)    -- 128 GB decode window
 ```
 The architectural decode window is fixed at 128 GB. Implemented capacity
 may be smaller depending on SKU/topology (see D4).
 ##### 2.3.2 Resource Window (addr_space = 0)
 ```text
 [36:34] resource_kind  (3)
 [33: 0] kind_local    (34)    -- 16 GB per kind
 ```
 | resource_kind | Meaning |
 |---------------|---------|
 | 000 | PE_LOCAL |
 | 001 | MCPU_LOCAL |
 | 010 | CUBE_SRAM |
 | 011..111 | Reserved |
 Each kind gets a 16 GB decode region.
 ##### 2.3.3 PE_LOCAL (resource_kind = 000)
 ```text
 [33]    MBZ            (1)
 [32:29] pe_id          (4)     -- 0..15
 [28:25] pe_sub_unit    (4)
 [24: 0] sub_offset    (25)    -- 32 MB per slot
 ```
 16 PEs x 16 sub-unit slots x 32 MB = 8 GB active decode.
 | pe_sub_unit | Name | Budget |
 |-------------|------|--------|
 | 0 | PE_CPU_DTCM | 8 KB |
 | 1 | MATH_ENGINE_DTCM | 8 KB |
 | 2 | IPCQ | 256 KB |
 | 3 | PE_CPU_SFR | 16 KB |
 | 4 | MATH_ENGINE_SFR | 16 KB |
 | 5 | DMA_ENGINE_SFR | 192 KB |
 | 6 | PE_TCM | 2 MB |
 | 7..15 | Reserved | -- |
 ##### 2.3.4 MCPU_LOCAL (resource_kind = 001)
 ```text
 [33:30] MBZ            (4)
 [29:25] mcpu_sub_unit  (5)
 [24: 0] sub_offset    (25)    -- 32 MB per slot
 ```
 1 GB active decode.
 | mcpu_sub_unit | Name | Budget |
 |---------------|------|--------|
 | 0 | MCPU_ITCM | 512 KB |
 | 1 | MCPU_DTCM | 512 KB |
 | 2 | IPCQ | 256 KB |
 | 3 | MCPU_SFR | 8 KB |
 | 4 | MCPU_DMA_SFR | 16 KB |
 | 5 | MCPU_SRAM | 10 MB |
 | 6..31 | Reserved | -- |
 ##### 2.3.5 CUBE_SRAM (resource_kind = 010)
 ```text
 [33:25] MBZ            (9)
 [24: 0] sram_offset   (25)    -- flat 32 MB
 ```
 #### 2.4 IOCHIPLET Die Layout
 Only lower 1 TB of the 4 TB die-local window is assigned.
 ```text
 [41:40] MBZ            (2)
 [39: 0] chiplet_offset (40)   -- 1 TB
 ```
 Region split by address range:
 | Range | Meaning | Decode condition |
 |-------|---------|------------------|
 | [0, 2 GB) | IOCPU resource | chiplet_offset < 0x8000_0000 |
 | [2 GB, 1 TB) | UAL | chiplet_offset >= 0x8000_0000 |
 ##### 2.4.1 IOCPU Region
 ```text
 [30:27] iocpu_sub_unit (4)
 [26: 0] sub_offset    (27)    -- 128 MB per slot
 ```
 16 x 128 MB slots. 2 GB active decode.
 | iocpu_sub_unit | Name | Budget |
 |----------------|------|--------|
 | 0 | IOCPU_ITCM | 512 KB |
 | 1 | IOCPU_DTCM | 512 KB |
 | 2 | IPCQ | 2 MB |
 | 3 | IOCPU_SFR | 8 KB |
 | 4 | IO_DMA_SFR | 16 KB |
 | 5 | IO_SRAM | 64 MB |
 | 6..15 | Reserved | -- |
 ##### 2.4.2 UAL Region
 Sub-layout TBD (separate ADR).
 #### 2.5 Addressing Rules
 1. MBZ bits must be zero. An address with non-zero MBZ bits is
   **architecturally invalid**. Implementation may raise a decode fault
   or return an error -- behavior is not prescribed by this ADR.
 2. Fixed slot sizes are chosen for simple hardware decode; actual
   implemented capacity may be smaller than the slot.
 3. Access beyond a sub-unit's implemented budget within a slot is
   **architecturally invalid** (same policy as MBZ).
 ### D3. Bitfield decoding is deterministic
 Given an integer address, field extraction (`sip_id`, `die_id`, `kind`,
 `sub_unit`, `offset`) is purely positional. No runtime state is required.
 Decoding deterministically maps an integer address to destination domains:
 `sip_id`, `die_id`, target kind (HBM / PE_LOCAL / MCPU_LOCAL / CUBE_SRAM /
 IOCPU / UAL).
 ### D4. Capacity validation may depend on topology config
 Whether a decoded address falls within **implemented capacity** (e.g.,
 HBM 96 GB on a specific SKU) is checked against topology parameters
 provided via DI/config. Decode itself (D3) never consults topology --
 only validation does. These parameters must live in the topology/config
 layer, not in node implementations.
 ### D5. Routing consumes decoded domains, not raw bits
 Routing policy uses decoded domains:
- `src` location (sip/cube/pe or node_id)
+- `src` location (sip / die / pe or node_id)
 - `dst` domains derived from PhysAddr decoding
 - `size_bytes` for size-aware link latency
-Routing must not inspect raw bit-fields directly except inside the decoding module.
+
 Routing must not inspect raw bit-fields directly except inside the
 decoding module.
 ## Alternatives Considered
-1) **Use raw integers everywhere, decode ad-hoc in routing**
+1. **Keep `rack_id` (4 bits)**: Rejected -- never used in practice,
   consumes 4 bits that enable die-local expansion to 42 bits
   (IOCHIPLET 1 TB).
- Rejected: leads to duplicated logic, inconsistent routing, and hidden assumptions embedded in multiple components.
+2. **Uniform 256 GB per die**: Rejected -- IOCHIPLET UAL requires ~1 TB.
   Freed rack_id bits enable 42-bit local_offset.
-1) **Hardcode topology sizes (SIP/CUBE/PE counts) into decoding**
+3. **Variable-width die windows (AHBM 256 GB, CHIPLET 1 TB via multi-seg
   spanning)**: Rejected -- complicates D3 (deterministic decoding).
   Uniform 4 TB window with MBZ padding is simpler.
- Rejected: violates SPEC (R3) and breaks swappability and configuration-driven topologies.
+4. **Use raw integers everywhere, decode ad-hoc in routing**: Rejected --
   leads to duplicated logic, inconsistent routing, and hidden
   assumptions.
-1) **Put decoding inside memory controllers or routers**
+5. **Hardcode topology sizes (SIP/CUBE/PE counts) into decoding**:
   Rejected -- violates SPEC R3 and breaks swappability.
- Rejected: leaks policy into components and undermines DI-first, swappable implementations (SPEC R4).
+6. **Put decoding inside memory controllers or routers**: Rejected --
   leaks policy into components, violates SPEC R4 / D5.
 ## Consequences
 ### Positive
- Deterministic routing domains enable clear test invariants for local vs remote paths (SPEC R1, R5).
+- Simple hierarchical decoder: SIP -> die -> kind -> sub-unit.
- Keeps topology variability (SPEC R3) while preserving consistent semantics.
+- Clean separation of memory (HBM) vs local resource (PE/MCPU/SRAM/IOCPU).
- DI-first: decoder can be swapped or extended without changing components or tests (SPEC R4).
+- Deterministic routing domains enable clear test invariants (SPEC R1, R5).
 - Expandable: 11 reserved die_id slots, reserved resource_kind / sub-unit
  slots, reserved MBZ bits.
 - DI-first: decoder can be swapped without changing components (SPEC R4).
-### Tradeoffs / Costs
+### Tradeoffs
- Requires explicit configuration for any topology-derived sizes.
+- Sparse address holes due to power-of-2 slot alignment.
- Introduces a single “blessed” decoding module that must remain stable and well-tested.
+- Large reserved/MBZ regions (intentional for future extension).
 - Requires explicit configuration for topology-derived sizes (D4).
 - Introduces a single "blessed" decoding module that must remain stable
  and well-tested.
 ## Supersedes
 - **ADR-0031 (PhysAddr PE-Resource Extension)**: stub status. The
  PE_LOCAL / MCPU_LOCAL / CUBE_SRAM sub-unit tables in D2.3.3-D2.3.5
  fulfill ADR-0031's stated goals.
 ## Implementation Notes (Non-normative)
- Recommended module boundary:
+- Recommended module: `src/kernbench/policy/address/phyaddr.py`
-  - `src/kernbench/policy/address/phyaddr.py`
+- Tests should cover: encode/decode round-trip per kind, MBZ enforcement,
  die_id dispatch (AHBM / IOCHIPLET / reserved), sub-unit boundary
  values, backward compatibility of factory APIs.
 - Factory methods: `hbm_addr`, `pe_hbm_addr`, `pe_tcm_addr`,
  `cube_sram_addr` retain signatures (minus `rack_id`); `cube_id`
  parameter renamed to `die_id`.
 - New factories: `pe_resource_addr`, `mcpu_resource_addr`,
  `iocpu_resource_addr`, `ual_addr`.
- Tests should cover:
+## Appendix A. Address Examples
-  - deterministic decoding
+
-  - local vs remote classification from decoded fields
+### A.1 AHBM HBM access
-  - invariants: “allocator returns full PhysAddr”, “decoding requires no global state”
+
 sip=2, die=5, HBM offset=0x1000
 ```text
 sip_id     = 2       -> [50:47] = 0b0010
 die_id     = 5       -> [46:42] = 0b00101
 addr_space = 1       -> [37]    = 1 (HBM)
 hbm_offset = 0x1000  -> [36:0]
 51-bit addr = (2 << 47) | (5 << 42) | (1 << 37) | 0x1000
 ```
 ### A.2 AHBM PE_LOCAL -- PE3 PE_TCM, offset=0x400
 ```text
 sip_id        = 0  -> [50:47] = 0
 die_id        = 0  -> [46:42] = 0
 addr_space    = 0  -> [37]    = 0
 resource_kind = 0  -> [36:34] = 000 (PE_LOCAL)
 pe_id         = 3  -> [32:29] = 0011
 pe_sub_unit   = 6  -> [28:25] = 0110 (PE_TCM)
 sub_offset    = 0x400 -> [24:0]
 local_offset = (0 << 34) | (3 << 29) | (6 << 25) | 0x400
 ```
 ### A.3 AHBM MCPU_LOCAL -- MCPU_SRAM, offset=0x0
 ```text
 sip_id        = 1  -> [50:47] = 0001
 die_id        = 3  -> [46:42] = 00011
 addr_space    = 0  -> [37]    = 0
 resource_kind = 1  -> [36:34] = 001 (MCPU_LOCAL)
 mcpu_sub_unit = 5  -> [29:25] = 00101 (MCPU_SRAM)
 sub_offset    = 0  -> [24:0]  = 0
 local_offset = (1 << 34) | (5 << 25)
 ```
 ### A.4 IOCHIPLET -- IOCPU IPCQ, offset=0x20000
 ```text
 sip_id         = 1   -> [50:47] = 0001
 die_id         = 17  -> [46:42] = 10001 (IOCHIPLET[1])
 iocpu_sub_unit = 2   -> [30:27] = 0010 (IPCQ)
 sub_offset     = 0x20000 -> [26:0]
 chiplet_offset = (2 << 27) | 0x20000
                 (< 0x8000_0000 -> IOCPU region)
 ```
 ### A.5 IOCHIPLET -- UAL region, offset=4 GB
 ```text
 sip_id         = 0   -> [50:47] = 0
 die_id         = 16  -> [46:42] = 10000 (IOCHIPLET[0])
 chiplet_offset = 0x1_0000_0000 (4 GB >= 2 GB -> UAL region)
 ```
 ## Links
- SPEC.md: R1 (routing), R3 (configurable topology), R4 (DI-first), R5 (multi-domain comm)
+- SPEC.md: R1 (routing), R3 (configurable topology), R4 (DI-first),
  R5 (multi-domain comm)
 - ADR-0031: Superseded
@@ -67,6 +67,76 @@ Completion semantics:
 ---
 ### D5. Launch timing is endpoint-synchronized
 All PEs targeted by a single kernel launch MUST begin executing the kernel
 body at the same simulated time, regardless of their dispatch path length
 from the launch entry point.
 Rationale. The dispatch tree Host → IO_CPU → M_CPU → PE_CPU has variable
 latency at every level. PEs near their M_CPU receive the launch earlier
 than PEs farther away; cubes near an IO_CPU receive it earlier than cubes
 farther away. Without synchronization, each PE's kernel begins at a
 different `env.now`, making per-PE metrics such as `pe_exec_ns` a function
 of dispatch-path geometry rather than of the kernel's behavior —
 producing measurement artifacts in benchmarks that time kernel-internal
 waits (for example `tl.recv` on cross-cube or cross-SIP hops).
 Mechanism.
 - `KernelLaunchMsg` carries an optional `target_start_ns: float | None`.
 - **IO_CPU** is the canonical stamper. On fan-out to M_CPUs, it
  computes `target_start_ns = env.now + max_latency` where
  `max_latency` is the maximum, over every target (sip, cube, pe)
  tuple, of the **two-leg dispatch chain**:
  ```
  max_latency(sip, cube, pe) =
      compute_path_latency_ns(find_node_path(io_cpu, m_cpu(sip, cube)))
    + compute_path_latency_ns(find_node_path(m_cpu(sip, cube), pe_cpu))
    - io_cpu.overhead_ns
    - m_cpu.overhead_ns
  ```
  This models the actual dispatch as **two sequential Transactions**
  (IO_CPU → M_CPU, then M_CPU → PE_CPU). Each leg's
  `compute_path_latency_ns` adds its endpoints' `overhead_ns`;
  `io_cpu.overhead_ns` is subtracted because IO_CPU has already
  paid it before this method runs, and `m_cpu.overhead_ns` is
  subtracted once because it appears as endpoint of leg1 *and*
  start of leg2 but is paid only once at run time. A single
  `find_node_path(io_cpu, pe_cpu)` walk is **not** equivalent —
  it can pick a graph path that bypasses M_CPU and silently
  under-shoots the prediction for far cubes, breaking the D5
  invariant.
  The fanned-out sub-Transactions carry **`nbytes = 0`** for
  `KernelLaunchMsg` (control message only). Without this,
  large kernel-launch payloads would occupy fabric BW on the
  shared first hop and serialize the per-cube dispatch, pushing
  far M_CPUs past `target_start_ns` and re-introducing the
  late-arrival violation.
 - **M_CPU** passes an already-stamped `target_start_ns` through
  unchanged. Only when the value is absent (e.g. a direct
  launch-to-M_CPU unit test) does M_CPU compute a per-cube barrier
  `env.now + max(local command-path latency)`.
 - **PE_CPU** yields `env.timeout(target_start_ns - env.now)` at the top
  of `_execute_kernel`, before recording `pe_exec_start` and invoking
  the kernel body.
 - When `target_start_ns is None`, PE_CPU falls through to the legacy
  unsynchronized behavior — preserving backward compatibility.
 IO_CPU-level stamping guarantees every PE across every targeted cube
 uses the same barrier sim-time, eliminating both the within-cube
 dispatch-offset artifact *and* the cross-cube offset artifact in
 multi-cube launches. Models a real-hardware timed-broadcast launch
 (latency-equalized dispatch tree).
 The synchronization is internal to the engine / IO_CPU / M_CPU / PE_CPU
 control plane — runtime API and application kernels are unchanged.
 ---
 ## Links
 - SPEC R1, R2, R7, R8
@@ -372,24 +372,41 @@ When the receiver frees a slot, the sender must learn about it
 travel through general vc_comm fabric — it uses a **separate fast
 path**, an abstraction of the NVLink / UCIe credit-return wire.
-**Latency** is computed from the **bottleneck BW on the path**, not a
+**Latency** is computed from the **full path latency** (per-node
-magic constant:
+overhead + edge propagation + drain), not a magic constant:
 ```
 credit_size_bytes = 16  (ccl.yaml: ipcq_credit_size_bytes)
-path = router.find_path(self_pe, peer_pe)
+path = router.find_path(self_pe, peer_pe.pe_dma)
-latency = compute_drain_ns(path, credit_size_bytes)
+latency = compute_path_latency_ns(path, credit_size_bytes)
-        = credit_size_bytes / bottleneck_bw_on_path
+        = sum(edge.distance_mm * ns_per_mm)
        + sum(node_overhead_ns[n] for n in path)
        + credit_size_bytes / bottleneck_bw_on_path
 ```
 The router auto-appends `.pe_dma` to the source only, so the
 destination MUST be spelled with the explicit `.pe_dma` suffix or
 `find_path` raises and the credit silently teleports at zero cost
 (latent bug fixed alongside this update).
 `tl.recv` blocks on the credit-emit completion (recv yields-from
 `_delayed_credit_send` rather than spawning it as a fork). This puts
 the credit-return cost on the receiver's `pe_exec_ns`, modeling the
 IPCQ control-plane completing the consume-acknowledgement before
 recv returns to the kernel — the protocol equivalent of a non-posted
 `tl.store` waiting for an HBM ack on the raw DMA path.
 That gives us:
 - **Topology-proportional approximation**: an in-cube credit return is
  automatically faster than a cross-SIP credit return.
- **No magic constants**: no arbitrary `ipcq_ctrl_latency_ns`.
+- **No magic constants**: every nanosecond comes from
  `compute_path_latency_ns` on the same edge_map and `node_overhead_ns`
  as data traffic.
 - **No deadlock risk**: unlike piggyback, B can issue credit even when
-  it has no data to send back.
+  it has no data to send back. `peer_credit_store.put` is unbounded.
- **Reuses existing utility**: `ComponentContext.compute_drain_ns`.
+- **`IPCQ ≥ raw DMA`** for matched physical moves — the credit-emit
  cost on recv balances the HBM ack-trip cost RAW pays on the sender.
 #### Component coupling — SimPy Store channel
@@ -420,11 +437,21 @@ fan-out (see `IpcqInitMsg` in D12).
 #### PE_DMA's added responsibility
 When `vc_comm` receives a token, PE_DMA processes it as the following
-**atomic** sequence. **No SimPy yield is allowed between the two steps**
+sequence: pay the Transaction's terminal BW drain, then atomically
-(invariant I6):
+write data and forward metadata. **No SimPy yield is allowed between
 the data write and the metadata forward** (invariant I6). The drain
 yield must sit before the atomic block, not inside it:
 ```python
-def _on_vc_comm_recv(self, env, token):
+def _on_vc_comm_recv(self, env, txn):
    # Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the
    # sender PE_DMA). MUST happen before the atomic block so recv only
    # wakes after the bytes have "landed".
    drain = getattr(txn, "drain_ns", 0.0)
    if drain > 0:
        yield env.timeout(drain)
    token = txn.request
    # ── ATOMIC: no yield between these two operations ──
    data = self._memory_store.read(token.src_space, token.src_addr,
                                   shape=..., dtype=...)
@@ -439,6 +466,33 @@ The final `put` is yieldable but uses an unbounded internal store, so
 it completes in a single step. That `put` is the closing call of the
 atomic block; nothing may be inserted before it.
 #### Drain-at-inbound semantics (D9 timing model)
 The Transaction carries `drain_ns = nbytes / bottleneck_bw_on_path`
 stamped at send-side PE_DMA. In this simulator per-hop `overhead_ns`
 is paid at each forwarding component via `run()`, and the remaining
 BW drain is paid once at the Transaction's terminal. Every non-IPCQ
 Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via
 `ComponentBase._forward_txn` at the terminal node. For IPCQ the
 destination PE_DMA intercepts the Transaction with `_handle_ipcq_inbound`
 (so IPCQ-specific data write + metadata forward can happen), so **the
 drain MUST be paid explicitly at the top of that handler** to keep
 IPCQ's timing model on par with every other fabric Transaction.
 Side-effects of paying drain here:
 - **SRC `tl.send`** is unchanged — fire-and-forget semantics are
  preserved because the sender PE_DMA does not `yield sub_done`. The
  `sub_done.succeed()` call (made after metadata forward below) is an
  event with no listener on the sender side.
 - **DST `tl.recv`** unblocks `drain_ns` later. Since recv wakes only
  when `IpcqMetaArrival` reaches its local PE_IPCQ, and the metadata
  forward now happens after the drain, recv observes the full fabric
  transfer time including bandwidth cost.
 Matches the physical picture: send dispatches and leaves; recv waits
 until the bytes have actually been drained into its inbox.
 ### D9.5. ADR-0020 (2-pass) integration
 `tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase
@@ -365,23 +365,39 @@ data 경로의 piggyback 모델과 달리, credit return은 일반 vc_comm fabri
 거치지 않고 **별도 fast path**로 처리한다. 이는 실제 HW의 NVLink/UCIe
 credit return fast path를 추상화한 것이다.
-**Latency 계산**: magic constant가 아니라 **라우팅 경로의 bottleneck BW**
+**Latency 계산**: magic constant가 아니라 **라우팅 경로의 full path
-기준으로 산출한다.
+latency** (per-node overhead + edge propagation + drain) 기준으로
 산출한다.
 ```
 credit_size_bytes = 16  (ccl.yaml: ipcq_credit_size_bytes)
-path = router.find_path(self_pe, peer_pe)
+path = router.find_path(self_pe, peer_pe.pe_dma)
-latency = compute_drain_ns(path, credit_size_bytes)
+latency = compute_path_latency_ns(path, credit_size_bytes)
-        = credit_size_bytes / bottleneck_bw_on_path
+        = sum(edge.distance_mm * ns_per_mm)
        + sum(node_overhead_ns[n] for n in path)
        + credit_size_bytes / bottleneck_bw_on_path
 ```
 router는 source에만 `.pe_dma`를 자동 부여하므로 destination에는 반드시
 `.pe_dma` suffix를 명시해야 한다. 그렇지 않으면 `find_path`가 raise하고
 credit이 0 cost로 silently teleport되는 latent bug가 발생한다 (이번
 업데이트에서 수정됨).
 `tl.recv`는 credit-emit 완료를 yield-from으로 기다린다 (이전에는
 `env.process`로 fork). 이로써 credit-return cost가 receiver의
 `pe_exec_ns`에 반영되어, IPCQ control-plane이 consume-acknowledgement를
 완료한 뒤에야 recv가 kernel에 반환된다 — RAW DMA의 non-posted `tl.store`가
 HBM ack-trip을 기다리는 것의 protocol-level 등가물이다.
 이로써:
 - **토폴로지 비례 approximation**: cube 내 credit return과 cross-SIP credit이
-  자동으로 다른 latency를 가짐 (정확한 값은 아니지만 magic constant보다 의미 있음)
+  자동으로 다른 latency를 가짐
- **Magic constant 없음**: 별도 `ipcq_ctrl_latency_ns` 같은 임의 값 불필요
+- **Magic constant 없음**: 모든 ns 값이 데이터 트래픽과 동일한 edge_map
- **Deadlock 위험 없음**: piggyback과 달리 B가 A에게 보낼 데이터가 없어도
+  및 `node_overhead_ns`에서 산출되는 `compute_path_latency_ns`로부터 옴
-  credit이 자동 발행됨
+- **Deadlock 위험 없음**: `peer_credit_store.put`은 unbounded, B가 A에게
- **기존 utility 재사용**: `ComponentContext.compute_drain_ns` 그대로 사용
+  보낼 데이터가 없어도 credit이 자동 발행됨
 - **`IPCQ ≥ raw DMA`** 보장: matched physical move에 대해 credit-emit이
  RAW의 ack-trip cost와 균형을 이룸
 ```
 PE B: tl.recv(W) → 데이터 가져감 → my_tail++
@@ -426,11 +442,22 @@ backend init에서 IpcqInitMsg fan-out 시 양방향 fast path channel을 함께
 #### PE_DMA의 책임 추가
-PE_DMA(vc_comm)는 token 수신 시 다음 atomic 시퀀스로 처리한다.
+PE_DMA(vc_comm)는 token 수신 시 다음 시퀀스로 처리한다: Transaction
-**두 동작 사이에 SimPy yield를 두어서는 안 된다** (I6 MUST 규칙 참조):
+terminal의 BW drain을 먼저 지불하고, 이어서 atomic하게 data write +
 metadata forward 수행. **data write와 metadata forward 사이에는 SimPy
 yield를 두어서는 안 된다** (I6 MUST 규칙 참조). drain yield는 atomic
 구간 안이 아니라 그 앞에 위치해야 한다:
 ```python
-def _on_vc_comm_recv(self, env, token):
+def _on_vc_comm_recv(self, env, txn):
    # Sender PE_DMA가 찍어 둔 drain_ns (= nbytes / bottleneck_bw) 를
    # 여기서 지불. atomic 구간보다 앞이어야 한다 — recv는 bytes가
    # "도착"한 이후에만 깨어나야 하므로.
    drain = getattr(txn, "drain_ns", 0.0)
    if drain > 0:
        yield env.timeout(drain)
    token = txn.request
    # ── ATOMIC: 두 동작 사이에 yield 금지 ──
    # 1. data를 dst_addr에 write (dst의 메모리 공간은 token.dst_endpoint.buffer_kind)
    data = self._memory_store.read(token.src_space, token.src_addr,
@@ -446,6 +473,32 @@ wire로 capacity가 unbounded인 store를 사용하므로 즉시 완료된다 (
 single-step). 이 최종 put이 atomic 구간의 끝이며, 그 이전에 다른 yield가
 삽입되면 안 된다.
 #### Drain-at-inbound semantics (D9 timing model)
 Transaction은 sender PE_DMA가 `drain_ns = nbytes / bottleneck_bw_on_path`
 를 찍어 둔 상태로 fabric에 들어간다. 이 simulator에서 per-hop `overhead_ns`
 는 각 forwarding component의 `run()` 에서 지불되고, 남은 BW drain은
 Transaction의 terminal node에서 한 번 지불된다. IPCQ가 아닌 모든
 Transaction (raw DMA, kernel-launch fanout 등) 은
 `ComponentBase._forward_txn` 이 terminal에서 이 drain을 지불한다. IPCQ의
 경우 목적지 PE_DMA가 `_handle_ipcq_inbound` 핸들러로 Transaction을
 가로채서 (IPCQ 전용 data write + metadata forward를 해야 하므로)
 **이 핸들러 최상단에서 drain을 명시적으로 지불해야 한다** — 그래야 IPCQ의
 timing model이 다른 모든 fabric Transaction과 동일선상에 놓인다.
 여기서 drain을 지불할 때의 side-effect:
 - **SRC `tl.send`**: 동작 불변. sender PE_DMA가 `sub_done` 을 `yield`
  하지 않으므로 fire-and-forget 의미가 보존된다. metadata forward 이후
  호출되는 `sub_done.succeed()` 는 sender 입장에서 listener가 없는 이벤트.
 - **DST `tl.recv`**: `drain_ns` 만큼 늦게 깨어난다. recv는 local PE_IPCQ
  의 `IpcqMetaArrival` 수신 시에만 wake되며, metadata forward가 drain
  이후로 이동했으므로 recv는 bandwidth까지 포함한 전체 fabric transfer
  시간을 관측하게 된다.
 물리적 그림과 일치: send는 dispatch하고 바로 반환; recv는 bytes가 실제로
 자신의 inbox로 drain될 때까지 대기.
 #### Backpressure latency 정확도
 backpressure 해제까지 걸리는 시간:
@@ -2,7 +2,14 @@
 ## Status
-Proposed (Revision 8 — Hierarchical content split out to ADR-0029)
+Accepted. rank = SIP process-group model stands. The allreduce algorithm
 path (mapper / validator / per-PE install machinery originally targeted at
 ADR-0029) has been replaced by ADR-0032: `AhbmCCLBackend` now calls
 `configure_sfr_intercube_multisip` at `init_process_group` time and the
 intercube kernel receives `(sip_rank, sip_topo_kind, sip_topo_w,
 sip_topo_h)` appended after the module's `kernel_args()`. The
 `leader_only` / `all_pes` mapper concepts in this document are no longer
 used by the default allreduce path.
 ## Context
@@ -89,7 +89,14 @@ direction_idx × bytes_per_direction). 따라서:
 `src/kernbench/ccl/install.py`:
 ```python
-_OPPOSITE_DIR = {"E": "W", "W": "E", "N": "S", "S": "N"}
+# Extended in ADR-0032 with global_* pairs for inter-SIP directions,
 # which were introduced by configure_sfr_intercube_multisip to keep
 # intercube (N/S/E/W) and inter-SIP (global_N/S/E/W) namespaces disjoint.
 _OPPOSITE_DIR = {
    "E": "W", "W": "E", "N": "S", "S": "N",
    "global_E": "global_W", "global_W": "global_E",
    "global_N": "global_S", "global_S": "global_N",
 }
 def reverse_direction(my_rank: int, peer_rank: int, my_dir: str) -> str | None:
    """Find peer's direction that reciprocates my_dir→peer_rank.
@@ -2,7 +2,9 @@
 ## Status
-Proposed
+Superseded by ADR-0032 (Intercube all-reduce). The 3-level kernel and
 `hierarchical_allreduce.py` module have been removed. The cube-mesh
 intercube + inter-SIP path is now the single all-reduce algorithm.
 ## Context
@@ -2,7 +2,11 @@
 ## Status
-Stub (Blocker for ADR-0030 — specific range allocations TBD)
+Superseded by ADR-0001 (Revision 2, 2026-04-27).
 PE_LOCAL / MCPU_LOCAL / CUBE_SRAM sub-unit tables are now defined in
 ADR-0001 D2.3.3-D2.3.5.
 Previous status: Stub (Blocker for ADR-0030 — specific range allocations TBD)
 ## Context
@@ -0,0 +1,256 @@
 # ADR-0032: Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange
 ## Status
 Accepted (supersedes ADR-0029).
 ## Context
 ### Goal
 Define a single all-reduce algorithm that exploits the topology hierarchy:
 cube mesh within each SIP (intercube) + inter-SIP exchange. One kernel,
 one SFR configuration path, driven by `topology.yaml` and `ccl.yaml`.
 ### Why replace ADR-0029 (hierarchical 3-level)
 ADR-0029 proposed a 3-level (intra-cube → inter-cube → inter-SIP) algorithm
 where every PE in the system participates. In practice this adds the
 intra-cube PE-to-PE stage complexity (bidirectional reduce + chain broadcast)
 without matching the common workload pattern where the tensor is sharded
 **per cube** (not per PE within a cube).
 Moreover, the hierarchical design required:
 - per-PE neighbor graph installation (`_build_pe_installs` multi-level)
 - multi-level topology schema (`hierarchical_3level`)
 - `all_pes` mapper + `multi_pe_sip_local` validator infrastructure
 The intercube algorithm below removes all of that: **pe0-only same-lane
 intercube reduce on the 4×4 cube mesh**, then inter-SIP exchange on the
 root cube, then broadcast back. Simpler kernel, simpler wiring, same
 bandwidth characteristics for the common per-cube DP workload.
 ### Current state
 - `src/kernbench/ccl/algorithms/intercube_allreduce.py` — kernel
 - `src/kernbench/ccl/sfr_config.py` — `configure_sfr_intercube_multisip`
 - `src/kernbench/runtime_api/distributed.py` — `AhbmCCLBackend` wires this
  automatically at `init_process_group` time.
 - Old `ring_allreduce`, `mesh_allreduce`, `tree_allreduce`,
  `hierarchical_allreduce` modules and their tests are **removed**.
 ---
 ## Decision
 ### D1. Algorithm structure — 5 phases
 For each SIP (launched concurrently by `mp.spawn`):
 ```
 Phase 1 — Row reduce W → E (cube mesh, pe0 only):
    col=0 sends E → col=1 accumulates, sends E → ... → col=3 holds row sum.
 Phase 2 — Col reduce N → S on rightmost column (pe0, col = mesh_w-1):
    row=0 sends S → row=1 accumulates, sends S → ... → root cube (15)
    holds the full SIP sum.
 Phase 3 — Inter-SIP exchange on root cube (pe0 of root cube only):
    Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
    selected by sip_topo_kind (from topology.yaml sips.topology).
 Phase 4 — Col broadcast S → N on rightmost column.
 Phase 5 — Row broadcast E → W across the cube mesh.
 ```
 After all phases every cube's pe0 holds the global sum.
 The kernel is a single function parameterised by `sip_topo_kind ∈ {0, 1, 2}`
 (ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical
 across topologies; only phase 3 branches. Helper functions
 `_inter_sip_ring`, `_inter_sip_torus_2d`, `_inter_sip_mesh_2d` encode the
 three exchange patterns.
 ### D2. Tensor layout (rank = SIP, per-worker)
 Per ADR-0024 rank = SIP at the process-group level. Each worker allocates
 its own cube-mesh-spanning tensor:
 ```python
 dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=16, num_pes=1)
 tensor = torch.zeros((n_cubes, n_elem), dtype="f16", dp=dp)
 ```
 Shard layout: 16 shards per SIP, one per cube on pe0. The kernel addresses
 each cube's shard as `pe_addr = t_ptr + cube_id * n_elem * 2`.
 ### D3. SFR / IPCQ wiring — `configure_sfr_intercube_multisip`
 Replaces the rank-to-2-PE install from ADR-0024. Wires PE_IPCQ neighbor
 tables for **every cube's pe0 across every SIP** — regardless of which
 cube is the root or which SIP topology is selected. This lets the kernel
 elect the root cube at runtime and supports topology switches without
 re-wiring.
 | Level | Direction labels | Scope |
 |---|---|---|
 | Intercube within SIP | N / S / E / W | pe0 of every cube → pe0 of mesh neighbors (no wrap) |
 | Inter-SIP (all cubes) | global_E / global_W / global_N / global_S | pe0 of cube c on sip A → pe0 of cube c on peer SIP per `sips.topology` |
 Inter-SIP directions use the `global_*` prefix to keep the namespace
 disjoint from intercube directions. ADR-0025's `_OPPOSITE_DIR` is extended
 with `global_E ↔ global_W` and `global_N ↔ global_S` so the reverse-
 direction resolver handles 2-SIP bidirectional rings correctly.
 Internally the function calls `install_ipcq` with:
 - `world_size = n_sips × n_cubes`
 - `rank_to_pe = [(sip, cube, 0) for sip in range(n_sips) for cube in range(n_cubes)]`
 - A closure-captured `neighbors()` function that builds the map above.
 This `world_size` is internal to IPCQ wiring and does not leak to the
 process-group rank.
 ### D4. SIP topology — from `topology.yaml`
 ```yaml
 system:
  sips:
    count: 2
    topology: ring_1d       # or torus_2d, mesh_2d_no_wrap
 ```
 - `ring_1d`: n_sips-1 rounds of `send global_E / recv global_W`.
 - `torus_2d`: sqrt(n_sips)×sqrt(n_sips) wrapping mesh. Row ring on
  `global_E/W` then col ring on `global_S/N`.
 - `mesh_2d_no_wrap`: square mesh without wrap-around. Chain reduce +
  broadcast per dimension.
 2D variants require `n_sips` to be a perfect square.
 ### D5. Process-group integration — `AhbmCCLBackend`
 At `init_process_group` time the backend:
 1. Loads `ccl.yaml` + `topology.yaml`.
 2. Derives `sip_topo_kind, sip_topo_w, sip_topo_h` from
   `system.sips.topology` using the algorithm module's `TOPO_NAME_TO_KIND`.
 3. Calls `configure_sfr_intercube_multisip(engine, spec, cfg)` — one-time
   SFR wiring, mirrors NCCL communicator creation.
 At each `dist.all_reduce(tensor)` call:
 1. Resolves `kernel_fn` from `cfg["module"]`.
 2. Builds args: `(n_elem, cube_w, cube_h, n_sips)` from
   `kernel_args(world_size, n_elem)`.
 3. Appends `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` where
   `sip_rank` is the current greenlet's bound rank.
 4. Launches with `_defer_wait=True`; the main scheduler drains pending
   handles after all workers submit (per ADR-0024 D7 / ADR-0027 D0.4).
 ### D6. Config schema
 `ccl.yaml`:
 ```yaml
 defaults:
  algorithm: intercube_allreduce
  buffer_kind: tcm
  ...
 algorithms:
  intercube_allreduce:
    module: kernbench.ccl.algorithms.intercube_allreduce
    topology: none
    buffer_kind: tcm
    n_elem: 8
    root_cube: 15
 ```
 `topology.yaml`:
 ```yaml
 system:
  sips:
    count: 2
    topology: ring_1d
 sip:
  cube_mesh: { w: 4, h: 4 }
 ```
 ### D7. Algorithm module contract
 Modules loaded via `cfg["module"]` must export:
 | Name | Purpose |
 |---|---|
 | `kernel` | callable, signature `(t_ptr, n_elem, cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl)` |
 | `kernel_args(world_size, n_elem) -> tuple` | returns the first 4 scalar args (per-tensor) |
 | `TOPO_NAME_TO_KIND: dict[str, int]` | maps `system.sips.topology` name to kernel branch code |
 | `SIP_TOPO_RING`, `SIP_TOPO_TORUS`, `SIP_TOPO_MESH` | integer constants (0, 1, 2) |
 ---
 ## Dependencies
 - **ADR-0023**: IPCQ protocol (neighbor table, send/recv, credit return).
 - **ADR-0024**: rank = SIP launcher, `mp.spawn`, greenlet-local rank.
 - **ADR-0025**: Address-based IPCQ direction matching; extended
  `_OPPOSITE_DIR` with `global_*` pairs.
 - **ADR-0027**: Worker-wait / collective-pending drain in main scheduler.
 ## Non-goals
 - **Per-PE allreduce** (intra-cube PE-to-PE reduce). Out of scope — the
  workload for this algorithm is per-cube DP.
 - **Asymmetric SIP topologies** (non-square mesh/torus). `torus_2d` and
  `mesh_2d_no_wrap` require `n_sips = k²`.
 - **Pipelined chunks**: single-tile per cube, no pipelining yet.
 - **Root cube runtime election**: the kernel currently uses
  `root_cube = (mesh_h - 1) * mesh_w + (mesh_w - 1)` hardcoded to the SE
  corner. SFR wiring covers all cubes, so runtime election is a pure kernel
  change when needed.
 ---
 ## Consequences
 ### Positive
 - **Single kernel, single install path** for all-reduce — replaces four
  removed modules (`ring`, `mesh`, `tree`, `hierarchical`).
 - **Topology-agnostic kernel**: ring / torus / mesh selected via one
  integer param, no kernel duplication.
 - **Automatic via `dist.all_reduce`**: no bench-level or user-level
  algorithm selection needed; config-driven end-to-end.
 - **Full SFR wiring**: every cube on every SIP has inter-SIP links
  available — supports future dynamic root-cube election.
 ### Negative
 - **Not suitable for per-PE sharded tensors**: TP-layer-style tensors that
  shard within one cube across 8 PEs are not addressable by this kernel.
  Such workloads would need a separate intra-cube all-reduce path (not
  yet implemented).
 - **`configure_sfr_intercube_multisip` always wires all pe0s**: even if a
  given run only needs a subset (e.g. 1 SIP, ring only). Install cost is
  small but not zero.
 ---
 ## Affected files
 | File | Change |
 |---|---|
 | `src/kernbench/ccl/algorithms/intercube_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` |
 | `src/kernbench/ccl/sfr_config.py` (new) | `configure_sfr_intercube_multisip` |
 | `src/kernbench/ccl/topologies.py` | Added `torus_2d`, `mesh_2d_no_wrap` |
 | `src/kernbench/ccl/install.py` | Extended `_OPPOSITE_DIR` with `global_*` pairs |
 | `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend` uses `configure_sfr_intercube_multisip` + appends sip_rank/topo args |
 | `ccl.yaml` | Single `intercube_allreduce` entry |
 | `topology.yaml` | Added `system.sips.topology` |
 | `benches/ccl_allreduce.py` | Row-wise cube-mesh tensor layout |
 | `tests/test_allreduce_multidevice.py` (new) | Config-driven ring/torus/mesh |
 | `tests/test_distributed_intercube_allreduce.py` (new) | Full `dist.all_reduce` path |
 | `tests/test_intercube_sfr_config.py` (new) | SFR wiring verification |
 | Removed | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` and their tests |
@@ -0,0 +1,13 @@
 buffer_kind,sip_topology,n_sips,n_elem,bytes_per_pe,latency_ns
 hbm,torus_2d,6,128,256,1858.0399999999827
 hbm,torus_2d,6,1024,2048,2389.0399999999827
 hbm,torus_2d,6,8192,16384,6673.039999999986
 hbm,torus_2d,6,32768,65536,21361.03999999992
 sram,torus_2d,6,128,256,1774.0399999999827
 sram,torus_2d,6,1024,2048,2389.0399999999827
 sram,torus_2d,6,8192,16384,7345.039999999986
 sram,torus_2d,6,32768,65536,24337.039999999935
 tcm,torus_2d,6,128,256,1678.0399999999827
 tcm,torus_2d,6,1024,2048,1957.0399999999827
 tcm,torus_2d,6,8192,16384,4225.039999999986
 tcm,torus_2d,6,32768,65536,12001.03999999992
@@ -0,0 +1,37 @@
 algorithm,sip_topology,n_sips,n_elem,bytes_per_pe,bytes_per_sip,latency_ns
 intercube_allreduce,mesh_2d_no_wrap,6,8,16,256,2626.302499999998
 intercube_allreduce,mesh_2d_no_wrap,6,32,64,1024,2634.7399999999952
 intercube_allreduce,mesh_2d_no_wrap,6,64,128,2048,2645.9899999999925
 intercube_allreduce,mesh_2d_no_wrap,6,128,256,4096,2668.489999999987
 intercube_allreduce,mesh_2d_no_wrap,6,512,1024,16384,2812.489999999987
 intercube_allreduce,mesh_2d_no_wrap,6,1024,2048,32768,3010.489999999987
 intercube_allreduce,mesh_2d_no_wrap,6,2048,4096,65536,3406.489999999987
 intercube_allreduce,mesh_2d_no_wrap,6,4096,8192,131072,4198.489999999965
 intercube_allreduce,mesh_2d_no_wrap,6,8192,16384,262144,5782.489999999969
 intercube_allreduce,mesh_2d_no_wrap,6,16384,32768,524288,8950.489999999925
 intercube_allreduce,mesh_2d_no_wrap,6,32768,65536,1048576,15286.48999999986
 intercube_allreduce,mesh_2d_no_wrap,6,49152,98304,1572864,21622.489999999932
 intercube_allreduce,ring_1d,6,8,16,256,2302.9849999999933
 intercube_allreduce,ring_1d,6,32,64,1024,2310.8599999999906
 intercube_allreduce,ring_1d,6,64,128,2048,2321.359999999988
 intercube_allreduce,ring_1d,6,128,256,4096,2342.3599999999824
 intercube_allreduce,ring_1d,6,512,1024,16384,2479.3599999999824
 intercube_allreduce,ring_1d,6,1024,2048,32768,2669.3599999999824
 intercube_allreduce,ring_1d,6,2048,4096,65536,3049.3599999999824
 intercube_allreduce,ring_1d,6,4096,8192,131072,3809.3599999999715
 intercube_allreduce,ring_1d,6,8192,16384,262144,5329.359999999979
 intercube_allreduce,ring_1d,6,16384,32768,524288,8369.35999999992
 intercube_allreduce,ring_1d,6,32768,65536,1048576,14449.359999999899
 intercube_allreduce,ring_1d,6,49152,98304,1572864,20529.35999999997
 intercube_allreduce,torus_2d,6,8,16,256,1644.2899999999936
 intercube_allreduce,torus_2d,6,32,64,1024,1651.0399999999909
 intercube_allreduce,torus_2d,6,64,128,2048,1660.0399999999881
 intercube_allreduce,torus_2d,6,128,256,4096,1678.0399999999827
 intercube_allreduce,torus_2d,6,512,1024,16384,1795.0399999999827
 intercube_allreduce,torus_2d,6,1024,2048,32768,1957.0399999999827
 intercube_allreduce,torus_2d,6,2048,4096,65536,2281.0399999999827
 intercube_allreduce,torus_2d,6,4096,8192,131072,2929.039999999979
 intercube_allreduce,torus_2d,6,8192,16384,262144,4225.039999999986
 intercube_allreduce,torus_2d,6,16384,32768,524288,6817.039999999943
 intercube_allreduce,torus_2d,6,32768,65536,1048576,12001.03999999992
 intercube_allreduce,torus_2d,6,49152,98304,1572864,17185.039999999994
@@ -0,0 +1,81 @@
 hop,label,size_bytes,path,total_ns
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,ipcq,31.3899999999976
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,raw,12.019999999996799
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,ipcq,33.1399999999976
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,raw,13.019999999996799
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,ipcq,34.8899999999976
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,raw,14.019999999996799
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,ipcq,36.6399999999976
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,raw,15.019999999996799
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,ipcq,40.1399999999976
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,raw,17.0199999999968
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,ipcq,43.6399999999976
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,raw,19.0199999999968
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,ipcq,57.6399999999976
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,raw,27.0199999999968
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,ipcq,85.6399999999976
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,raw,43.0199999999968
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,ipcq,141.64000000000306
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,raw,75.02000000000407
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,ipcq,169.64000000000306
 h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,raw,91.02000000000407
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,ipcq,31.3899999999976
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,raw,12.019999999996799
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,ipcq,33.1399999999976
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,raw,13.019999999996799
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,ipcq,34.8899999999976
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,raw,14.019999999996799
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,ipcq,36.6399999999976
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,raw,15.019999999996799
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,ipcq,40.1399999999976
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,raw,17.0199999999968
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,ipcq,43.6399999999976
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,raw,19.0199999999968
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,ipcq,57.6399999999976
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,raw,27.0199999999968
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,ipcq,85.6399999999976
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,raw,43.0199999999968
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,ipcq,141.64000000000306
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,raw,75.02000000000407
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,ipcq,169.64000000000306
 h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,raw,91.02000000000407
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,ipcq,67.40999999999804
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,raw,68.53999999999724
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,ipcq,69.15999999999804
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,raw,70.03999999999724
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,ipcq,70.90999999999804
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,raw,71.53999999999724
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,ipcq,72.65999999999804
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,raw,73.03999999999724
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,ipcq,76.15999999999804
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,raw,76.03999999999724
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,ipcq,79.65999999999804
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,raw,79.03999999999724
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,ipcq,93.65999999999804
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,raw,91.03999999999724
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,ipcq,121.65999999999804
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,raw,115.03999999999724
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,ipcq,177.65999999999985
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,raw,163.04000000000087
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,ipcq,205.65999999999985
 h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,raw,187.04000000000087
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,ipcq,87.40999999999804
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,raw,88.53999999999724
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,ipcq,89.15999999999804
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,raw,90.03999999999724
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,ipcq,90.90999999999804
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,raw,91.53999999999724
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,ipcq,92.65999999999804
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,raw,93.03999999999724
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,ipcq,96.15999999999804
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,raw,96.03999999999724
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,ipcq,99.65999999999804
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,raw,99.03999999999724
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,ipcq,113.65999999999804
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,raw,111.03999999999724
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,ipcq,141.65999999999804
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,raw,135.03999999999724
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,ipcq,197.65999999999985
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,raw,183.04000000000087
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,ipcq,225.65999999999985
 h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,raw,207.04000000000087
@@ -0,0 +1,157 @@
 direction: right
 pe: PE {
  style.fill: "#f8f9fa"
  style.stroke: "#495057"
  style.border-radius: 8
  cpu: PE_CPU (control) {
    style.fill: "#bbdefb"
    style.stroke: "#1565c0"
    style.border-radius: 4
  }
  sched: PE_SCHED (dispatch) {
    style.fill: "#bbdefb"
    style.stroke: "#1565c0"
    style.border-radius: 4
  }
  ipcq_added: IPCQ (added) {
    style.fill: "#e1f5fe"
    style.stroke: "#0277bd"
    style.stroke-dash: 5
    style.stroke-width: 2
    style.border-radius: 6
    ipcq: PE_IPCQ (control plane) {
      style.fill: "#bbdefb"
      style.stroke: "#1565c0"
      style.border-radius: 4
    }
  }
  dma: PE_DMA (single FIFO inbox) {
    style.fill: "#fff3e0"
    style.stroke: "#e65100"
    style.border-radius: 6
  }
  fs: PE_FETCH_STORE {
    style.fill: "#c8e6c9"
    style.stroke: "#2e7d32"
    style.border-radius: 4
  }
  tcm: TCM (16MB SRAM) {
    style.fill: "#fce4ec"
    style.stroke: "#c62828"
    style.border-radius: 6
    ipcq_slot: IPCQ Slot Region {
      style.stroke-dash: 5
      style.fill: "#ffcdd2"
      style.stroke: "#c62828"
      style.border-radius: 3
    }
  }
  gemm: GEMM engine {
    style.fill: "#c8e6c9"
    style.stroke: "#2e7d32"
    style.border-radius: 4
  }
  math: MATH engine {
    style.fill: "#c8e6c9"
    style.stroke: "#2e7d32"
    style.border-radius: 4
  }
  fport: Fabric Port {
    style.fill: "#ffe0b2"
    style.stroke: "#e65100"
    style.border-radius: 4
  }
  # Control → dispatch
  cpu -> sched: cmd dispatch
  cpu -> ipcq_added.ipcq: IpcqRequest
  # Compute pipeline
  sched -> dma: TileToken\n(compute port)
  dma -> fs: TileToken
  dma <-> tcm: DMA_READ/WRITE\n(HBM ↔ TCM)
  fs <-> tcm: fetch/store\n(TCM ↔ reg)
  fs -> gemm: TileToken
  fs -> math: TileToken
  gemm -> fs: TileToken
  math -> fs: TileToken
  # IPCQ data path — outbound
  ipcq_added.ipcq -> dma: IpcqDmaToken\n(IPCQ port) {style.stroke: "#1565c0"}
  # IPCQ data path — inbound (MetaArrival: DMA → IPCQ)
  dma -> ipcq_added.ipcq: IpcqMetaArrival {style.stroke: "#1565c0"}
  # Credit return (dashed)
  ipcq_added.ipcq -> dma: IpcqCreditMetadata\n(NoC latency charged) {
    style.stroke: "#7b1fa2"
    style.stroke-dash: 5
  }
  # DMA ↔ Fabric
  dma <-> fport
 }
 # ── NoC Router + attached resources ──
 noc: NoC Router {
  style.fill: "#f3e5f5"
  style.stroke: "#6a1b9a"
  style.border-radius: 6
 }
 hbm: Local HBM {
  style.fill: "#e8eaf6"
  style.stroke: "#283593"
  style.border-radius: 6
  ipcq_slot_hbm: IPCQ Slot Region {
    style.stroke-dash: 5
    style.fill: "#c5cae9"
    style.stroke: "#283593"
    style.border-radius: 3
  }
 }
 sram: Cube SRAM {
  style.fill: "#e0f7fa"
  style.stroke: "#00695c"
  style.border-radius: 6
  ipcq_slot_sram: IPCQ Slot Region {
    style.stroke-dash: 5
    style.fill: "#b2dfdb"
    style.stroke: "#00695c"
    style.border-radius: 3
  }
 }
 other_pe: Other PEs {
  style.fill: "#ede7f6"
  style.stroke: "#6a1b9a"
  style.border-radius: 6
 }
 other_cube: Other Cubes / SIPs {
  style.fill: "#ede7f6"
  style.stroke: "#6a1b9a"
  style.border-radius: 6
 }
 pe.fport <-> noc
 noc <-> hbm
 noc <-> sram
 noc <-> other_pe
 noc <-> other_cube
@@ -0,0 +1,166 @@
 direction: right
 pe: PE {
  style.fill: "#f8f9fa"
  style.stroke: "#495057"
  style.border-radius: 8
  cpu: PE_CPU (control) {
    style.fill: "#bbdefb"
    style.stroke: "#1565c0"
    style.border-radius: 4
  }
  sched: PE_SCHED (dispatch) {
    style.fill: "#bbdefb"
    style.stroke: "#1565c0"
    style.border-radius: 4
  }
  ipcq: IPCQ Controller (NEW) {
    style.fill: "#e1f5fe"
    style.stroke: "#0277bd"
    style.border-radius: 6
    style.stroke-width: 2
    ptrmgmt: Pointer Mgmt {
      style.fill: "#b3e5fc"
      style.stroke: "#0277bd"
      style.border-radius: 4
      qprf: QPair Reg File
      bp: Backpressure
      sag: Slot Addr Gen
    }
    sideband: Sideband {
      style.fill: "#b3e5fc"
      style.stroke: "#0277bd"
      style.border-radius: 4
      metax: Meta Extractor
      crinj: Credit Injector
      crrcv: Credit Receiver
    }
  }
  dma: PE_DMA (MOD) {
    style.fill: "#fff3e0"
    style.stroke: "#e65100"
    style.border-radius: 6
    compute_port: compute port {
      style.fill: "#ffe0b2"
      style.stroke: "#e65100"
      style.border-radius: 4
    }
    ipcq_port: IPCQ port {
      style.fill: "#ffe0b2"
      style.stroke: "#e65100"
      style.border-radius: 4
    }
    wrr: WRR Arbiter (NEW) {
      style.fill: "#ffcc80"
      style.stroke: "#e65100"
      style.border-radius: 4
      style.stroke-width: 2
    }
    compute_port -> wrr
    ipcq_port -> wrr
  }
  fs: PE_FETCH_STORE {
    style.fill: "#c8e6c9"
    style.stroke: "#2e7d32"
    style.border-radius: 4
  }
  tcm: TCM (16MB SRAM) {
    style.fill: "#fce4ec"
    style.stroke: "#c62828"
    style.border-radius: 6
    work: Kernel Working Memory {
      style.fill: "#f8bbd0"
      style.stroke: "#c62828"
      style.border-radius: 4
    }
    slot: IPCQ Slot Region (rsv) {
      style.fill: "#f48fb1"
      style.stroke: "#c62828"
      style.border-radius: 4
      style.stroke-width: 2
    }
  }
  gemm: GEMM engine {
    style.fill: "#c8e6c9"
    style.stroke: "#2e7d32"
    style.border-radius: 4
  }
  math: MATH engine {
    style.fill: "#c8e6c9"
    style.stroke: "#2e7d32"
    style.border-radius: 4
  }
  fport: Fabric Port {
    style.fill: "#ffe0b2"
    style.stroke: "#e65100"
    style.border-radius: 4
  }
  # Control
  cpu -> sched: cmd dispatch
  cpu -> ipcq: MMIO
  # Compute pipeline
  sched -> dma.compute_port: TileToken
  dma -> fs: TileToken
  dma <-> tcm.work: DMA_READ/WRITE\n(HBM ↔ TCM)
  fs <-> tcm.work: fetch/store\n(TCM ↔ reg)
  fs -> gemm: TileToken
  fs -> math: TileToken
  gemm -> fs: TileToken
  math -> fs: TileToken
  # IPCQ data path
  ipcq -> dma.ipcq_port: IpcqDmaToken {style.stroke: "#0277bd"}
  dma -> ipcq.sideband.metax: IpcqMetaArrival {style.stroke: "#0277bd"}
  # IPCQ slot R/W
  dma <-> tcm.slot: slot read/write {
    style.stroke: "#0277bd"
    style.stroke-dash: 3
  }
  # Credit via fabric port
  ipcq.sideband.crinj -> fport: credit out (16B) {
    style.stroke: "#7b1fa2"
    style.stroke-dash: 5
  }
  fport -> ipcq.sideband.crrcv: credit in (16B) {
    style.stroke: "#7b1fa2"
    style.stroke-dash: 5
  }
  # DMA ↔ Fabric
  dma.wrr <-> fport
 }
 noc: NoC Router {
  style.fill: "#f3e5f5"
  style.stroke: "#6a1b9a"
  style.border-radius: 6
 }
 ext: Other PEs / Cubes / SIPs {
  style.fill: "#ede7f6"
  style.stroke: "#6a1b9a"
  style.border-radius: 6
 }
 pe.fport <-> noc
 noc <-> ext
@@ -0,0 +1,548 @@
 # IPCQ-DMA Co-design Hardware Design Document
 **Status**: Draft — Review Requested
 **Date**: 2026-04-28
 **Authors**: YW Kang
 **Reviewers**: (HW team TBD)
 **Related**: ADR-0023 (IPCQ PE Collective), ADR-0025 (Direction Addressing)
 ---
 ## 1. Background & Motivation
 IPCQ(Inter-PE Communication Queue)는 PE 간 collective communication을 위한
 하드웨어 큐 메커니즘이다. 핵심 설계 원리는 **DMA가 데이터 전송 시 별도의
 제어 메시지 없이, piggyback된 메타 정보를 바탕으로 IPCQ의 head/tail pointer를
 자동 업데이트**하는 IPCQ-DMA co-design이다.
 이 문서는:
 1. 현재 PE 아키텍처에서 IPCQ가 하드웨어 수준에서 어떻게 동작하는지 기술하고,
 2. 이 하드웨어를 시뮬레이터에서 어떻게 모델링하고 있는지 검증하며,
 3. 실제 하드웨어 구현을 위한 설계를 제안하고,
 4. 대안들을 검토하여 최적 접근을 확정한다.
 ---
 ## 2. High-level Behavior of PE_IPCQ
 ![PE Baseline Architecture](diagrams/pe_baseline.png)
 > source: [`diagrams/pe_baseline.d2`](diagrams/pe_baseline.d2) — `d2 --layout=elk --scale 1.5` 로 렌더링.
 ### IPCQ 하드웨어 동작
 **HW Configuration**:
 * IPCQ는 PE 간에 ring buffer 기반의 단방향 큐를 설정하여 데이터를 전달한다.
 * 각 PE는 방향별(N/S/E/W 등)로 독립적인 queue pair 를 유지한다.
 * IPCQ는 각 queue pair 마다 sender's head/tail pointer, receiver's head/tail pointer 를 유지한다.
 * **IPCQ Slot Region**: IPCQ의 수신 버퍼로, 다이어그램의 점선 박스로 표시된 것처럼 TCM, Cube SRAM, Local HBM 중 하나를 buffer_kind로 지정하여 사용할 수 있다.
 각 tier별 성능 특성 (시뮬레이션 모델 값, `ipcq_types.py`):
 | Buffer Kind | Intrinsic BW | Effective BW (NoC bottleneck) | 용도 |
 |-------------|-------------|-------------------------------|------|
 | TCM | 512 GB/s | 512 GB/s (직결, NoC 미경유) | 최저 latency, PE 내부 전용 |
 | Cube SRAM | 512 GB/s | 128 GB/s (`sram_to_router_bw`) | Cube 내 공유, NoC BW에 제한 |
 | Local HBM | 256 GB/s | 256 GB/s (`hbm_to_router_bw`) | 대용량, NoC BW에 제한 |
 **Send 경로 (fire-and-forget)**:
 1. PE_CPU가 `tl.send(dir, src_addr)` 발행 → PE_IPCQ에 IpcqRequest 전달
 2. PE_IPCQ가 backpressure 확인: `(my_head - peer_tail_cache) < peer.n_slots`
 3. Peer의 rx slot 주소 계산: `peer_rx_base + (my_head % n_slots) × slot_size`
 4. IpcqDmaToken(data + piggyback metadata: sender_seq)을 PE_DMA에 전달
 5. PE_IPCQ가 `my_head++`, PE_CPU에 즉시 반환 (DMA 완료를 기다리지 않음)
 6. PE_DMA가 src data를 snapshot 후 NoC를 통해 peer PE_DMA로 전송
 **Receive 경로 (blocking)**:
 1. Peer PE_DMA가 data를 slot에 write하고, **같은 사이클에** metadata(sender_seq, dst_addr)를 추출
 2. PE_IPCQ가 dst_addr range matching으로 방향을 식별, `peer_head_cache` 업데이트
 3. `tl.recv(dir)` 대기 중인 PE_CPU에 wakeup signal 전달
 4. PE_CPU가 slot에서 데이터 읽기, PE_IPCQ가 `my_tail++`
 5. **Credit return**: PE_IPCQ가 16B credit packet(`consumer_seq`)을 NoC를 통해 sender에게 전송
 6. Sender PE_IPCQ가 `peer_tail_cache` 업데이트, backpressure 해제
 **핵심 설계 원리**:
 - **Data + head pointer piggyback**: 별도의 head 동기화 메시지 없이, DMA data flit에 sender_seq를 실어보냄
 - **Atomic write + metadata**: 수신측 DMA가 slot write와 metadata 전달을 같은 사이클에 수행 (I6 invariant)
 - **Address-based direction matching**: 같은 peer에 여러 방향이 연결되어도 dst_addr range로 구분 (ADR-0025)
 - **Credit-based flow control**: Receiver가 slot 소비 후 16B credit으로 sender에게 알림
 ---
 ## 3. Simulator Implementation Verification
 위의 하드웨어 동작을 시뮬레이터에서 어떻게 모델링하는지 검증한다.
 ### 3.1 의도와 구현의 매핑
 | 설계 의도 | 시뮬레이터 구현 | 위치 |
 |-----------|----------------|------|
 | DMA가 데이터 전송 시 head pointer를 piggyback | `IpcqDmaToken.sender_seq` 필드가 data flit과 함께 전달 | `ipcq_types.py:185` |
 | 수신측 DMA가 data write + metadata 전달을 atomic 처리 | `_handle_ipcq_inbound`에서 `store.write` → `IpcqMetaArrival` 사이에 yield 없음 (I6) | `pe_dma.py:232-275` |
 | Send는 fire-and-forget | `_handle_ipcq_outbound`에서 `sub_done`을 기다리지 않음 | `pe_dma.py:182` |
 | Recv는 데이터 도착까지 block | `peer_head_cache > my_tail` 조건으로 대기 | `pe_ipcq.py:263` |
 | Credit return은 별도 fast-path | SimPy Store를 통한 direct put (latency는 NoC 경로 기반으로 charge) | `pe_ipcq.py:443-469` |
 | In-flight data semantics (snapshot) | Send 시점에 data snapshot 보존, 이후 src 수정과 무관 | `pe_dma.py:142-155` |
 | PE_DMA 단일 inbox | 모든 in_port를 `_fan_in`으로 단일 FIFO에 merge (`base.py:51-53`) | compute port와 IPCQ port 사이에 arbiter 없음 |
 ### 3.2 Credit Return Path 모델링 상세
 Credit return은 실제 NoC 경로를 `router.find_path()`로 찾고,
 `compute_path_latency_ns()`로 hop latency + BW drain을 계산하여 charge한다.
 ```python
 # pe_ipcq.py:471-492
 def _credit_latency_ns(self, direction: str) -> float:
    path = self.ctx.router.find_path(self._pe_prefix, peer_pe_dma)
    return self.ctx.compute_path_latency_ns(path, self._credit_size_bytes)
 ```
 단, latency를 `env.timeout()`으로 지불한 후 `peer_credit_store`(SimPy Store)에
 직접 put하는 방식이다. 실제 `Transaction`을 만들어 NoC를 hop-by-hop 통과시키지는
 않으므로, **다른 트래픽과의 bandwidth contention은 모델링되지 않는다.**
 | | Latency | BW Contention |
 |---|---|---|
 | Data path (IpcqDmaToken) | NoC Transaction으로 정확 모델링 | 실제 fabric 통과 |
 | Credit path (16B) | NoC 경로 latency 정확 반영 | fabric Transaction 미주입 (단순화) |
 Credit은 16B로 data transfer(수십~수백 KB) 대비 무시 가능한 크기이므로,
 이 단순화로 인한 실질적 오차는 거의 없다.
 ### 3.3 검증 결론
 시뮬레이터 구현은 IPCQ-DMA co-design 의도를 **정확하게 모델링**하고 있다.
 ---
 ## 4. Proposed Hardware Design
 ### 4.1 Block Diagram (변경 후)
 변경점을 강조 표시: **(NEW)** = 신규, **(MOD)** = 수정.
 ![PE Proposed Architecture](diagrams/pe_proposed.png)
 > Source: [`diagrams/pe_proposed.d2`](diagrams/pe_proposed.d2) — `d2 --layout=elk` 로 렌더링.
 **Baseline → Proposed 핵심 변경**:
 - 단일 FIFO inbox → **compute port / IPCQ port 분리 + WRR Arbiter** (NEW)
 - PE_IPCQ (SimPy component) → **IPCQ Controller** (HW register + combinational logic)
 - TCM 내 **IPCQ Slot Region 예약 영역** 명시
 - Credit Injector / Receiver가 Fabric Port를 통해 NoC에 직접 연결
 ### 4.2 Module Details
 #### 4.2.1 IPCQ Controller (신규 모듈)
 PE_CPU와 DMA Engine 사이에 위치하는 하드웨어 제어 블록.
 시뮬레이터의 `PeIpcqComponent`에 대응한다.
 ##### QPair Register File
 방향별 queue pair 상태를 flip-flop으로 유지한다.
 ```
 Per-direction registers (each 64-bit):
  my_head          — sender write position (monotonic)
  my_tail          — receiver read position (monotonic)
  peer_head_cache  — last known peer head (updated by Meta Extractor)
  peer_tail_cache  — last known peer tail (updated by Credit Receive)
  rx_base_pa       — this PE's rx buffer base physical address
  peer_rx_base_pa  — peer's rx buffer base physical address
  n_slots          — ring depth (power-of-2 제약, 아래 참조)
  slot_size        — bytes per slot
  peer_credit_tgt  — peer PE의 credit receive 주소
 Directions: 최대 8 (N/S/E/W/parent/child_left/child_right + spare)
 Total: 8 dirs × 9 regs × 8B = 576B flip-flops
 ```
 PE_CPU가 MMIO(CSR)로 읽기/쓰기 가능. Init 시점에 소프트웨어가 채워넣는다.
 ##### Slot Address Generator (combinational)
 ```
 Input:  pointer (my_head or my_tail), n_slots, slot_size, base_pa
 Output: slot_addr = base_pa + (pointer % n_slots) * slot_size
 Implementation:
  n_slots power-of-2 제약 → pointer & (n_slots - 1)  (AND mask, 1 gate delay)
  slot_size power-of-2   → barrel shift               (1 cycle)
  64-bit add             → ripple/kogge-stone adder    (1 cycle)
 Latency: 1-2 cycles combinational
 ```
 ##### Backpressure Comparator (combinational)
 ```
 full = (my_head - peer_tail_cache) >= n_slots
 Implementation: 64-bit subtract + unsigned compare
 Output: stall signal → PE_CPU (IPCQ send blocked) or DMA issue hold
 Latency: 1 cycle
 ```
 ##### Meta Extractor (inbound datapath sideband)
 DMA Engine의 inbound vc_comm path에 wired. Arriving IPCQ flit의 header에서
 metadata를 추출하여 queue pair 상태를 업데이트한다.
 ```
 Trigger: DMA inbound write completion (same cycle)
 Extract: {sender_seq, dst_addr} from flit header
 Direction matching (ADR-0025 D2):
  for each dir:
    match = (base_pa[dir] <= dst_addr) && (dst_addr < base_pa[dir] + n_slots[dir] * slot_size[dir])
  8× parallel range comparators + priority encoder
 Update: peer_head_cache[matched_dir] = max(peer_head_cache, sender_seq + 1)
 Output: recv_wake signal for matched direction → PE_CPU interrupt/flag
 Implementation: 8× (2 comparators + AND) + priority encoder
 Latency: 1 cycle (pipelined with DMA write — I6 atomicity 자연 보장)
 ```
 ##### Credit Injector (outbound)
 ```
 Trigger: recv completion (my_tail 증가 후)
 Action:  pack 16B credit packet → DMA vc_comm (또는 dedicated credit VC)
 Packet: {consumer_seq = my_tail, dst_rx_base_pa = my_rx_base_pa}
 Latency: 1 cycle to generate, then NoC traversal
 ```
 ##### Credit Receiver (inbound sideband)
 ```
 Trigger: 16B credit packet arrival (from NoC)
 Extract: {consumer_seq, dst_rx_base_pa}
 Direction matching (ADR-0025 D3):
  for each dir:
    match = (peer_rx_base_pa[dir] == credit.dst_rx_base_pa)
 Update: peer_tail_cache[matched_dir] = max(peer_tail_cache, consumer_seq)
 Output: send_wake signal → deassert backpressure stall
 Latency: 1 cycle
 ```
 #### 4.2.2 DMA Engine 수정사항
 ##### vc_comm IPCQ-aware mode
 기존 vc_comm 채널에 IPCQ flit 처리 모드를 추가한다.
 **Outbound**:
 1. IPCQ Controller로부터 command 수신: {src_addr, dst_addr, nbytes, sender_seq}
 2. TCM에서 src_addr read → DMA read buffer에 snapshot (기존 DMA behavior)
 3. Flit pack: data + piggyback metadata (sender_seq, dst_addr)
 4. NoC fabric port에 inject
 5. Fire-and-forget (completion을 기다리지 않음)
 **Inbound**:
 1. NoC로부터 IPCQ flit 수신
 2. Terminal BW drain charge (drain_ns = nbytes / bottleneck_bw)
 3. Slot write latency charge (backing memory tier)
 4. **ATOMIC** (same pipeline stage, no stall insertion):
   - TCM write: data → slot address
   - Meta Extractor trigger: sender_seq + dst_addr → IPCQ Controller
 5. Done
 **I6 atomicity 하드웨어 보장**: TCM write completion과 Meta Extractor trigger가
 동일 pipeline stage에서 발생하므로 별도 synchronization이 불필요하다.
 시뮬레이터의 "no yield between write and IpcqMetaArrival"이 자연스럽게 보장된다.
 ##### Data Snapshot Semantics
 DMA read buffer에 latch된 데이터는 src memory의 이후 수정에 영향받지 않는다.
 이는 DMA의 standard read-then-write behavior이므로 추가 HW가 불필요하다.
 ##### Credit Virtual Channel (선택적)
 옵션 A: vc_comm에 credit을 multiplexing (16B header-only flit으로 구분)
 옵션 B: 3rd dedicated credit VC 추가 (strict priority > data)
 옵션 B가 deadlock prevention에 유리하나, 16B credit의 BW 영향이 무시 가능하므로
 옵션 A로도 충분하다.
 #### 4.2.3 Fabric Flit Format 확장
 ```
 일반 data flit (예: 512-bit):
 ┌──────────────────────────────────────────┐
 │ [511:480] routing header (32b)           │
 │ [479:0]   payload (480b = 60B)           │
 └──────────────────────────────────────────┘
 IPCQ data flit (첫 flit에만 metadata 포함):
 ┌──────────────────────────────────────────┐
 │ [511:480] routing header (32b)           │
 │   [511]    ipcq_flag (1b)                │  ← IPCQ vs normal DMA 식별
 │   [510:509] vc_id (2b)                   │
 │   [508:480] route + hop count            │
 │ [479:416] ipcq_metadata (64b)            │  ← piggyback
 │   [479:448] sender_seq (32b)             │
 │   [447:416] dst_addr[31:0] (32b)         │  ← direction matching용
 │ [415:0]   payload (416b = 52B)           │
 └──────────────────────────────────────────┘
 후속 flits: full 60B payload (metadata 없음)
 Credit-only flit (128-bit, header-only):
 ┌──────────────────────────────────────────┐
 │ [127:96]  routing header (32b)           │
 │   [127]   credit_flag (1b)               │
 │ [95:64]   consumer_seq (32b)             │
 │ [63:0]    dst_rx_base_pa (64b)           │
 └──────────────────────────────────────────┘
 ```
 첫 flit의 payload가 60B → 52B로 감소 (13% overhead).
 Multi-flit transfer에서는 후속 flit이 full payload이므로 대형 전송에서 overhead < 1%.
 #### 4.2.4 TCM IPCQ Slot Region
 ```
 TCM Memory Map (16MB):
 ┌─────────────────────────────┐ 0x000000
 │  Kernel Working Memory      │
 │  (compute tensors)          │
 │  ~14MB                      │
 ├─────────────────────────────┤ 0xE00000
 │  IPCQ RX Buffers            │
 │  Dir N: slots × slot_size   │
 │  Dir S: slots × slot_size   │
 │  Dir E: slots × slot_size   │
 │  Dir W: slots × slot_size   │
 │  ~1MB                       │
 ├─────────────────────────────┤ 0xF00000
 │  IPCQ Metadata / Scratch    │
 │  ~1MB                       │
 └─────────────────────────────┘ 0xFFFFFF
 ```
 IPCQ region을 TCM의 상위 bank에 배치하여 compute access와의
 bank conflict를 최소화한다 (Section 6.1 참조).
 ---
 ## 5. End-to-End Dataflow
 ### 5.1 Sequence Diagram
 ```mermaid
 sequenceDiagram
    participant CPU_A as PE_A: PE_CPU
    participant IPCQ_A as PE_A: IPCQ Ctrl
    participant DMA_A as PE_A: DMA
    participant NOC as NoC Fabric
    participant DMA_B as PE_B: DMA
    participant IPCQ_B as PE_B: IPCQ Ctrl
    participant TCM_B as PE_B: TCM
    participant CPU_B as PE_B: PE_CPU
    Note over CPU_A: tl.send(dir="E", src=0x1000)
    CPU_A->>IPCQ_A: MMIO: send request
    Note over IPCQ_A: Backpressure check:<br/>(head - peer_tail_cache) < n_slots → PASS<br/>Slot addr gen:<br/>dst = peer_rx_base + (head%n) × slot_size
    IPCQ_A->>DMA_A: IpcqDmaToken {src, dst, sender_seq=head}
    Note over IPCQ_A: my_head++
    IPCQ_A-->>CPU_A: send returns (fire-and-forget)
    Note over DMA_A: TCM read → snapshot in read buffer<br/>Flit pack: data + {sender_seq, dst_addr}
    DMA_A->>NOC: IPCQ data flit(s)
    Note over NOC: hop latency + BW drain
    NOC->>DMA_B: IPCQ data flit(s)
    Note over DMA_B: Terminal BW drain<br/>Slot write latency
    rect rgb(255, 240, 220)
        Note over DMA_B,IPCQ_B: ATOMIC (I6): same cycle, no stall
        DMA_B->>TCM_B: write data → slot address
        DMA_B->>IPCQ_B: Meta Extractor: {sender_seq, dst_addr}
    end
    Note over IPCQ_B: Range match dst_addr → direction "W"<br/>peer_head_cache["W"] = sender_seq + 1
    IPCQ_B-->>CPU_B: recv_wake signal
    Note over CPU_B: tl.recv(dir="W") wakes up
    CPU_B->>IPCQ_B: recv request
    Note over IPCQ_B: peer_head_cache > my_tail → YES<br/>slot_addr = rx_base + (tail%n) × slot_size
    IPCQ_B-->>CPU_B: return slot_addr
    CPU_B->>TCM_B: read data from slot
    Note over IPCQ_B: my_tail++
    IPCQ_B->>NOC: Credit (16B): {consumer_seq, dst_rx_base_pa}
    Note over NOC: credit traversal (NoC latency)
    NOC->>IPCQ_A: Credit arrival
    Note over IPCQ_A: Match dst_rx_base_pa → direction "E"<br/>peer_tail_cache["E"] = consumer_seq<br/>Backpressure deassert (if stalled)
 ```
 ---
 ## 6. 2nm Implementation Analysis
 ### 6.1 Area Estimate
 | Module | Gate Count | Area (2nm est.) | Notes |
 |--------|-----------|-----------------|-------|
 | QPair Register File | ~4.6K FF | 0.002 mm² | 576B flip-flops |
 | Slot Addr Gen + Backpressure | ~5K gates | 0.001 mm² | Combinational |
 | Meta Extractor + Credit Logic | ~3K gates | 0.001 mm² | 8× parallel comparators |
 | **Total IPCQ Controller** | **~12.6K** | **~0.004 mm²** | **PE 전체 대비 < 0.1%** |
 | DMA vc_comm 확장 | ~2K gates | 0.002 mm² | Flit pack/unpack |
 | **Total 변경분** | **~14.6K** | **~0.006 mm²** | |
 ### 6.2 Timing
 | Path | Delay (2nm est.) | Target Clock | Margin |
 |------|-------------------|-------------|--------|
 | Backpressure (sub + cmp) | ~0.3 ns | 1 GHz (1 ns) | 3× |
 | Slot Addr Gen (mask + shift + add) | ~0.5 ns | 1 GHz | 2× |
 | Meta Extractor (8× range match) | ~0.4 ns | 1 GHz | 2.5× |
 | Credit Receiver (8× equality) | ~0.3 ns | 1 GHz | 3× |
 모든 critical path가 1 cycle 이내. Timing closure 문제 없음.
 ### 6.3 Power
 - Active: ~1 mW (register read/write + comparators, send/recv 동작 시)
 - Idle: leakage only
 - PE 전체 전력 대비 무시 가능
 ### 6.4 Constraints
 | 항목 | 제약 | 근거 |
 |------|------|------|
 | `n_slots` | **반드시 power-of-2** | mod → AND mask (1 gate). 임의 값은 divider 필요 (~10 cycles) |
 | `slot_size` | **power-of-2 권장** | mul → barrel shift. 임의 값은 multiplier 필요 |
 | TCM IPCQ region | **전용 bank 배치** | Compute access와 bank conflict 방지 |
 ---
 ## 7. Risk Assessment
 ### 7.1 TCM Bank Conflict
 - **Risk**: IPCQ slot write와 compute read가 동일 bank 접근 시 stall
 - **Mitigation**: IPCQ region을 TCM 상위 address의 전용 bank에 배치
 - **Cost**: TCM banking flexibility 소폭 감소
 - **Severity**: Medium (성능 영향), Low (correctness 문제 아님)
 ### 7.2 Credit Return Latency under Congestion
 - **Risk**: NoC 혼잡 시 credit return 지연 → sender backpressure stall
 - **Mitigation**:
  - Credit을 별도 VC로 분리 + strict priority (16B로 BW impact 미미)
  - 또는 n_slots를 넉넉히(8+) 설정하여 credit 지연을 buffer로 흡수
 - **Severity**: Low (credit 16B는 congestion에 거의 기여하지 않음)
 ### 7.3 Inter-Direction Ordering
 - **Risk**: 같은 PE에서 여러 방향으로 동시 send 시 순서
 - **Mitigation**: Per-direction monotonic seq으로 충분. Inter-direction ordering은
  kernel(소프트웨어) 책임 — 현재 시뮬레이터 모델과 동일
 - **Severity**: Low (아키텍처 설계에 의해 해소)
 ---
 ## 8. Alternatives Considered
 ### 8.1 Doorbell + Polling (전통적 방식)
 ```
 Send: DMA write data → DMA write doorbell register at peer → peer polls doorbell
 Recv: Polling loop on doorbell, or interrupt-driven
 ```
 | 장점 | 단점 |
 |------|------|
 | 단순한 HW (IPCQ controller 불필요) | 2번의 DMA transaction (data + doorbell) |
 | 기존 DMA 재사용 | Data/doorbell 사이 ordering 보장 필요 (fence) |
 | | Polling은 전력 낭비, interrupt는 latency overhead |
 **평가**: Piggyback 대비 latency 2-3× 증가. **불채택.**
 ### 8.2 Hardware Message Queue (NVIDIA NVLink 스타일)
 ```
 Send: CPU → HMQ에 descriptor push → HW가 peer HMQ로 자동 전달
 Recv: HMQ에서 descriptor pop → data pointer 확인
 ```
 | 장점 | 단점 |
 |------|------|
 | CPU는 descriptor만 작성 | 별도 HMQ engine 필요 (~0.05 mm²) |
 | Descriptor/data 분리 → 유연 | DMA와 별개 datapath → area/power 중복 |
 | | Large tensor에는 결국 DMA 필요 |
 **평가**: CCL의 large tensor 패턴에서 DMA 필수이므로 HMQ + DMA 이중 구조는
 면적 낭비. **불채택.**
 ### 8.3 RDMA-style Completion Queue (CQ)
 ```
 Send: DMA write → peer에 CQE 자동 생성
 Recv: CQ poll/interrupt → data 위치 확인
 ```
 | 장점 | 단점 |
 |------|------|
 | InfiniBand/RoCE 성숙 모델 | CQ 관리 logic + CQE memory overhead |
 | Multi-tenant/isolation 용이 | CQE/data ordering 보장 추가 필요 |
 | | PE-to-PE CCL에는 over-engineered |
 **평가**: RDMA CQ는 host-facing NIC의 multi-tenant 격리에 적합.
 PE 간 단일 owner 환경에서는 불필요한 복잡성. **불채택.**
 ### 8.4 Credit-in-Data Piggyback (v2 최적화 후보)
 현재 설계에서 credit return은 별도 16B packet이다.
 Bidirectional 통신 패턴에서는 **reverse 방향 data flit에 credit을 합칠 수 있다.**
 ```
 PE_A →E→ PE_B: data + sender_seq=3
 PE_B →W→ PE_A: data + sender_seq=5 + credit_ack=4  ← credit이 data에 합쳐짐
 ```
 | 장점 | 단점 |
 |------|------|
 | Credit 전용 packet 제거 → NoC BW 절약 | Unidirectional 패턴에서는 fallback 필요 |
 | Bidirectional allreduce에서 credit latency → 0 | Flit header에 8B 추가 (overhead 미미) |
 | | Logic 복잡도 소폭 증가 |
 **평가**: 현재 설계의 우수한 최적화.
 Bidirectional allreduce에서 credit packet을 완전 제거 가능.
 Standalone credit fallback도 유지. **v2로 채택 권고.**
 ---
 ## 9. Recommendations
 1. **현재 IPCQ-DMA co-design을 기본 하드웨어 설계로 채택**
   — 단순하고, 면적 효율적이며, 2nm에서 timing/power 문제 없음
 2. **n_slots를 반드시 power-of-2로 제약**
   — mod 연산을 AND mask로 대체, critical path 단축
 3. **TCM banking에서 IPCQ region 전용 bank 할당**
   — compute와의 bank conflict 방지
 4. **v2에서 Credit-in-Data Piggyback (Section 8.4) 추가 검토**
   — bidirectional 패턴에서 credit overhead 제거
 ---
 ## 10. Open Questions
 - [ ] IPCQ slot region size를 TCM의 몇 %까지 허용할 것인가? (현재 가정: ~1MB / 16MB = 6.25%)
 - [ ] Credit VC를 별도로 둘 것인가, vc_comm에 multiplexing할 것인가?
 - [ ] Inter-SIP link에서의 flit format 호환성 검증 필요
 - [ ] n_slots 최대값 제한? (8 directions × 8 slots × 64KB = 4MB → TCM의 25%)
@@ -6,7 +6,7 @@ build-backend = "setuptools.build_meta"
 name = "kernbench"
 version = "0.1.0"
 requires-python = ">=3.10"
-dependencies = ["pytest", "simpy", "pyyaml", "fastapi>=0.110", "uvicorn[standard]>=0.29", "websockets>=12", "numpy>=1.24", "greenlet>=3.0"]
+dependencies = ["pytest", "simpy", "pyyaml", "fastapi>=0.110", "uvicorn[standard]>=0.29", "websockets>=12", "numpy>=1.24", "greenlet>=3.0", "matplotlib>=3.7"]
 [project.scripts]
 kernbench = "kernbench.cli.main:main"
@@ -0,0 +1,171 @@
 """Generate a 5-slide PPTX summarizing the kernbench2 model.
 Slides (in order):
  1. Overall architecture — how PEs are connected (cube_mesh_view)
  2. Model correctness — DMA vs P2P latency (pe2pe overview)
  3. PE-to-PE IPCQ communication (ipcq_two_pe_dma)
  4. 6-device allreduce — model vs theoretical vs ext-sim (overview_broken)
  5. IPCQ buffer-kind sweep — TCM vs SRAM vs HBM (buffer_kind_sweep)
 This is a derived-artifact generator — no production code touched.
 """
 from __future__ import annotations
 from pathlib import Path
 from PIL import Image
 from pptx import Presentation
 from pptx.dml.color import RGBColor
 from pptx.enum.shapes import MSO_SHAPE
 from pptx.util import Emu, Inches, Pt
 ROOT = Path(__file__).resolve().parent.parent
 DIAG = ROOT / "docs" / "diagrams"
 OUT = DIAG / "kernbench2_overview.pptx"
 # 16:9 widescreen — 13.333 × 7.5 in
 SLIDE_W_IN = 13.333
 SLIDE_H_IN = 7.5
 SLIDES = [
    {
        "title": "1. CUBE Architecture: NOC Router Mesh + PE Connectivity",
        "image": DIAG / "cube_mesh_view.png",
        "bullets": [
            "Each CUBE holds an 8-PE NOC mesh wired through routers (R0..R7)",
            "Every PE has IO_CPU, M_CPU, PE_CPU + IPCQ engine + DMA engine",
            "Inter-cube traffic exits via UCIe/UAL ports; SIPs stitch into ring/torus/mesh",
            "Foundation for every latency, IPCQ, and allreduce experiment that follows",
        ],
    },
    {
        "title": "2. Model Correctness: DMA vs P2P Latency Sweep",
        "image": DIAG / "pe2pe_latency_plots" / "overview.png",
        "bullets": [
            "Sweeps payload size across PE-to-PE paths and compares to DMA",
            "Confirms the simulator reproduces the expected DMA/P2P crossover",
            "Acts as the per-hop ground truth that feeds collective-level models",
        ],
    },
    {
        "title": "3. IPCQ: How Two PEs Communicate (DMA + Slot Memory)",
        "image": DIAG / "ipcq_diagram_plots" / "ipcq_two_pe_dma.png",
        "bullets": [
            "Sender pushes payload through PE_DMA → fabric → receiver IPCQ slot",
            "Slot memory (TCM/SRAM/HBM) charges a write on arrival, a read on consume",
            "Credit return rides the fabric path back (16 B packet, no slot-IO)",
            "This is the building block the multi-device allreduce composes",
        ],
    },
    {
        "title": "4. 6-Device Allreduce: Model vs Theoretical vs External Simulator",
        "image": DIAG / "allreduce_latency_plots" / "overview_broken.png",
        "bullets": [
            "Three SIP topologies (ring / torus / mesh) swept 16 B → 96 KB per PE",
            "Dashed red curve: hand-derived theoretical model for torus_2d (6 SIPs)",
            "Top panel (broken y-axis): single-device reduce on ext-sim ≈ 366 µs",
            "Our 6-device collective lands at ~17–22 µs — ~17× faster than ext-sim baseline",
        ],
    },
    {
        "title": "5. IPCQ Slot Memory: TCM vs SRAM vs HBM",
        "image": DIAG / "allreduce_latency_plots" / "buffer_kind_sweep.png",
        "bullets": [
            "Same allreduce with slot memory swapped: TCM (per-PE local) / SRAM / HBM (cube-shared, behind router link)",
            "Cost = NoC drain + slot-IO + PE↔bank hop; only TCM skips the bank hop",
            "Topology link BWs set the order: SRAM bank link 128 GB/s is the narrowest in the system, HBM 256 GB/s",
            "At 64 KB / PE: TCM 12.0 µs < HBM 21.4 µs < SRAM 24.3 µs — SRAM is slowest because of its narrow bank link",
        ],
    },
 ]
 def _add_title(slide, text):
    left = Inches(0.4)
    top = Inches(0.25)
    width = Inches(SLIDE_W_IN - 0.8)
    height = Inches(0.7)
    box = slide.shapes.add_textbox(left, top, width, height)
    tf = box.text_frame
    tf.margin_left = tf.margin_right = Emu(0)
    tf.margin_top = tf.margin_bottom = Emu(0)
    p = tf.paragraphs[0]
    run = p.add_run()
    run.text = text
    run.font.size = Pt(26)
    run.font.bold = True
    run.font.color.rgb = RGBColor(0x10, 0x2A, 0x55)
    return box
 def _add_image_centered(slide, img_path, *, left_in, top_in, max_w_in, max_h_in):
    with Image.open(img_path) as im:
        iw, ih = im.size
    max_w_emu = Inches(max_w_in)
    max_h_emu = Inches(max_h_in)
    scale = min(max_w_emu / iw, max_h_emu / ih)
    w = int(iw * scale)
    h = int(ih * scale)
    left = Inches(left_in) + (max_w_emu - w) // 2
    top = Inches(top_in) + (max_h_emu - h) // 2
    slide.shapes.add_picture(str(img_path), left, top, width=w, height=h)
 def _add_bullets(slide, bullets, *, left_in, top_in, width_in, height_in):
    box = slide.shapes.add_textbox(
        Inches(left_in), Inches(top_in), Inches(width_in), Inches(height_in),
    )
    tf = box.text_frame
    tf.word_wrap = True
    for i, line in enumerate(bullets):
        p = tf.paragraphs[0] if i == 0 else tf.add_paragraph()
        p.level = 0
        run = p.add_run()
        run.text = "• " + line
        run.font.size = Pt(15)
        run.font.color.rgb = RGBColor(0x22, 0x22, 0x22)
        p.space_after = Pt(6)
 def _add_footer(slide, idx, total):
    box = slide.shapes.add_textbox(
        Inches(SLIDE_W_IN - 1.2), Inches(SLIDE_H_IN - 0.45),
        Inches(1.0), Inches(0.3),
    )
    p = box.text_frame.paragraphs[0]
    run = p.add_run()
    run.text = f"{idx} / {total}"
    run.font.size = Pt(10)
    run.font.color.rgb = RGBColor(0x88, 0x88, 0x88)
 def build():
    prs = Presentation()
    prs.slide_width = Inches(SLIDE_W_IN)
    prs.slide_height = Inches(SLIDE_H_IN)
    blank = prs.slide_layouts[6]
    for i, cfg in enumerate(SLIDES, start=1):
        slide = prs.slides.add_slide(blank)
        _add_title(slide, cfg["title"])
        # Layout: image on the left (8.4 in wide), bullets on the right (4.4 in).
        _add_image_centered(
            slide, cfg["image"],
            left_in=0.3, top_in=1.05,
            max_w_in=8.3, max_h_in=5.9,
        )
        _add_bullets(
            slide, cfg["bullets"],
            left_in=8.8, top_in=1.2,
            width_in=4.3, height_in=5.7,
        )
        _add_footer(slide, i, len(SLIDES))
    OUT.parent.mkdir(parents=True, exist_ok=True)
    prs.save(OUT)
    print(f"wrote {OUT}")
 if __name__ == "__main__":
    build()
@@ -0,0 +1,192 @@
 """One-shot: render overview.png with an external 366 µs reference, in two
 variants — log scale and broken y-axis. Reads docs/diagrams/allreduce_latency_plots/summary.csv
 and writes overview_log.png and overview_broken.png alongside it.
 This is a derived-artifact generator (per CLAUDE.md): plotting only, no production
 or test logic touched.
 """
 from __future__ import annotations
 import csv
 from pathlib import Path
 import matplotlib.pyplot as plt
 import matplotlib.ticker as mticker
 ROOT = Path(__file__).resolve().parent.parent
 PLOT_DIR = ROOT / "docs" / "diagrams" / "allreduce_latency_plots"
 CSV_PATH = PLOT_DIR / "summary.csv"
 EXT_LABEL = "ext-sim single-device reduce: 366 µs"
 EXT_LATENCY_NS = 366_000.0
 COLORS = {
    "ring_1d": "tab:blue",
    "torus_2d": "tab:orange",
    "mesh_2d_no_wrap": "tab:green",
 }
 # Hand-derived theoretical model for torus_2d (6 SIPs). Mirrors
 # _aggregate_sweep_plots in tests/test_allreduce_multidevice.py.
 NOC_PACKET_BYTES = 128
 PES_PER_CUBE = 8
 T_STARTUP_NS = 1346.0
 TAU_NS = (8741.0 - 1346.0) / (6144 - 1)
 def _theoretical_torus_2d_ns(bytes_per_pe: int) -> float:
    bytes_per_cube = int(bytes_per_pe) * PES_PER_CUBE
    n_packets = max(1, -(-bytes_per_cube // NOC_PACKET_BYTES))
    return T_STARTUP_NS + (n_packets - 1) * TAU_NS
 def _plot_theoretical(ax, records):
    torus_rs = sorted(
        [r for r in records if r["sip_topology"] == "torus_2d"],
        key=lambda r: r["bytes_per_pe"],
    )
    if not torus_rs:
        return
    ax.plot(
        [r["bytes_per_pe"] for r in torus_rs],
        [_theoretical_torus_2d_ns(r["bytes_per_pe"]) for r in torus_rs],
        color="tab:red", linestyle="--", linewidth=1.6, marker="x",
        label="theoretical torus_2d (6 SIPs)",
    )
 def _bytes_fmt(x, _pos):
    if x >= 1024 * 1024:
        return f"{x / (1024 * 1024):.0f}M"
    if x >= 1024:
        return f"{x / 1024:.0f}K"
    return f"{int(x)}"
 def _load_records():
    rows = []
    with open(CSV_PATH, newline="") as f:
        r = csv.DictReader(f)
        for row in r:
            rows.append({
                "sip_topology": row["sip_topology"],
                "bytes_per_pe": int(row["bytes_per_pe"]),
                "latency_ns": float(row["latency_ns"]),
            })
    return rows
 def _ext_x(records):
    """Anchor the external reference at the largest payload (96 KB / PE)."""
    return max(r["bytes_per_pe"] for r in records)
 def _plot_curves(ax, records, topologies):
    for topo in topologies:
        rs = sorted([r for r in records if r["sip_topology"] == topo],
                    key=lambda r: r["bytes_per_pe"])
        if not rs:
            continue
        ax.plot(
            [r["bytes_per_pe"] for r in rs],
            [r["latency_ns"] for r in rs],
            marker="o",
            label=f"{topo}",
            color=COLORS.get(topo),
        )
 def emit_log(records):
    topologies = sorted({r["sip_topology"] for r in records})
    fig, ax = plt.subplots(figsize=(9, 6))
    _plot_curves(ax, records, topologies)
    _plot_theoretical(ax, records)
    ax.scatter(
        [_ext_x(records)], [EXT_LATENCY_NS],
        marker="*", s=220, color="tab:red", zorder=5,
        label=EXT_LABEL,
    )
    ax.set_xscale("log", base=2)
    ax.set_yscale("log")
    ax.set_xlabel("Bytes per PE (log scale)")
    ax.set_ylabel("Time (ns) — log scale")
    ax.set_title("Multi-device allreduce latency vs external single-device reference")
    ax.grid(True, which="both", alpha=0.3)
    ax.xaxis.set_major_formatter(mticker.FuncFormatter(_bytes_fmt))
    ax.legend(loc="upper left")
    fig.tight_layout()
    out = PLOT_DIR / "overview_log.png"
    fig.savefig(out, dpi=120)
    plt.close(fig)
    print(f"wrote {out}")
 def emit_broken(records):
    topologies = sorted({r["sip_topology"] for r in records})
    max_local = max(r["latency_ns"] for r in records)
    fig, (ax_top, ax_bot) = plt.subplots(
        2, 1, sharex=True,
        gridspec_kw={"height_ratios": [1, 4], "hspace": 0.05},
        figsize=(9, 6.5),
    )
    # Bottom panel: today's three curves + theoretical, linear y.
    _plot_curves(ax_bot, records, topologies)
    _plot_theoretical(ax_bot, records)
    ax_bot.set_ylim(0, max_local * 1.10)
    # Top panel: only the external reference marker, linear y around 366 µs.
    ax_top.scatter(
        [_ext_x(records)], [EXT_LATENCY_NS],
        marker="*", s=240, color="tab:red", zorder=5,
        label=EXT_LABEL,
    )
    ax_top.set_ylim(EXT_LATENCY_NS * 0.93, EXT_LATENCY_NS * 1.05)
    # Hide the spine between the two panels and draw diagonal "break" ticks.
    ax_top.spines["bottom"].set_visible(False)
    ax_bot.spines["top"].set_visible(False)
    ax_top.tick_params(labeltop=False, bottom=False)
    ax_bot.xaxis.tick_bottom()
    d = 0.012  # diagonal-tick size, in axis-fraction
    kw = dict(transform=ax_top.transAxes, color="k", clip_on=False, lw=1)
    ax_top.plot((-d, +d), (-d, +d), **kw)
    ax_top.plot((1 - d, 1 + d), (-d, +d), **kw)
    kw.update(transform=ax_bot.transAxes)
    ax_bot.plot((-d, +d), (1 - d * 4, 1 + d * 4), **kw)
    ax_bot.plot((1 - d, 1 + d), (1 - d * 4, 1 + d * 4), **kw)
    ax_bot.set_xscale("log", base=2)
    ax_bot.set_xlabel("Bytes per PE (log scale)")
    ax_bot.set_ylabel("Time (ns)")
    ax_top.set_ylabel("Time (ns)")
    ax_bot.grid(True, alpha=0.3)
    ax_top.grid(True, alpha=0.3)
    ax_bot.xaxis.set_major_formatter(mticker.FuncFormatter(_bytes_fmt))
    # One legend covering both axes.
    handles_bot, labels_bot = ax_bot.get_legend_handles_labels()
    handles_top, labels_top = ax_top.get_legend_handles_labels()
    ax_bot.legend(handles_bot + handles_top, labels_bot + labels_top,
                  loc="upper left")
    fig.suptitle("Multi-device allreduce latency vs external single-device reference (broken y-axis)")
    fig.tight_layout()
    out = PLOT_DIR / "overview_broken.png"
    fig.savefig(out, dpi=120)
    plt.close(fig)
    print(f"wrote {out}")
 def main():
    records = _load_records()
    if not records:
        raise SystemExit(f"no rows in {CSV_PATH}")
    emit_log(records)
    emit_broken(records)
 if __name__ == "__main__":
    main()
@@ -0,0 +1,141 @@
 """Re-render pe2pe latency PNGs from the existing summary.csv with the
 current (no-consume) labels. Used after a label-only test edit to avoid
 re-measuring (~5 min) when the data on disk is already correct.
 Reads docs/diagrams/pe2pe_latency_plots/summary.csv. Plots 2 curves:
 "IPCQ no-consume" (from the ipcq_no_consume rows if present, else from
 the ipcq rows) and "Raw DMA" (raw rows).
 """
 from __future__ import annotations
 import csv
 from pathlib import Path
 import matplotlib.pyplot as plt
 ROOT = Path(__file__).resolve().parent.parent
 PLOT_DIR = ROOT / "docs" / "diagrams" / "pe2pe_latency_plots"
 CSV_PATH = PLOT_DIR / "summary.csv"
 def _load_records():
    rows = []
    with open(CSV_PATH, newline="") as f:
        for r in csv.DictReader(f):
            rows.append({
                "hop": r["hop"],
                "label": r["label"],
                "size_bytes": int(r["size_bytes"]),
                "path": r["path"],
                "total_ns": float(r["total_ns"]),
            })
    return rows
 def _ipcq_rows(records, hop):
    # Prefer ipcq_no_consume if present (older 3-path CSV); fall back to ipcq
    # (current single-path CSV where ipcq IS no-consume).
    nc = [r for r in records
          if r["hop"] == hop and r["path"] == "ipcq_no_consume"]
    if nc:
        return sorted(nc, key=lambda r: r["size_bytes"])
    return sorted(
        [r for r in records if r["hop"] == hop and r["path"] == "ipcq"],
        key=lambda r: r["size_bytes"],
    )
 def _raw_rows(records, hop):
    return sorted(
        [r for r in records if r["hop"] == hop and r["path"] == "raw"],
        key=lambda r: r["size_bytes"],
    )
 def _hops(records):
    seen = []
    for r in records:
        if r["hop"] not in {h["id"] for h in seen}:
            seen.append({"id": r["hop"], "label": r["label"]})
    return seen
 def _plot_per_hop(records, hop, path):
    ipcq = _ipcq_rows(records, hop["id"])
    raw = _raw_rows(records, hop["id"])
    fig, ax = plt.subplots(figsize=(8, 5))
    if ipcq:
        ax.plot(
            [r["size_bytes"] for r in ipcq],
            [r["total_ns"] for r in ipcq],
            marker="o", color="tab:blue",
            label="IPCQ no-consume (send/recv, no slot read)",
        )
    if raw:
        ax.plot(
            [r["size_bytes"] for r in raw],
            [r["total_ns"] for r in raw],
            marker="s", color="tab:orange",
            label="Raw DMA (load+store)",
        )
    ax.set_xlabel("Data size (bytes)")
    ax.set_ylabel("Latency (ns)")
    ax.set_title(hop["label"])
    ax.grid(True, alpha=0.3)
    ax.legend()
    fig.tight_layout()
    fig.savefig(path, dpi=120)
    plt.close(fig)
 def _plot_overview(records, hops, path):
    fig, axes = plt.subplots(2, 2, figsize=(13, 9))
    axes = axes.flatten()
    for i, hop in enumerate(hops):
        ax = axes[i]
        ipcq = _ipcq_rows(records, hop["id"])
        raw = _raw_rows(records, hop["id"])
        if ipcq:
            ax.plot(
                [r["size_bytes"] for r in ipcq],
                [r["total_ns"] for r in ipcq],
                marker="o", color="tab:blue",
                label="IPCQ no-consume",
            )
        if raw:
            ax.plot(
                [r["size_bytes"] for r in raw],
                [r["total_ns"] for r in raw],
                marker="s", color="tab:orange",
                label="Raw DMA",
            )
        ax.set_title(hop["label"], fontsize=10)
        ax.set_xlabel("bytes")
        ax.set_ylabel("ns")
        ax.grid(True, alpha=0.3)
        ax.legend(fontsize=8)
    for j in range(len(hops), len(axes)):
        axes[j].axis("off")
    fig.suptitle(
        "PE-to-PE latency: IPCQ no-consume vs raw DMA",
        fontsize=14,
    )
    fig.tight_layout()
    fig.savefig(path, dpi=120)
    plt.close(fig)
 def main():
    records = _load_records()
    hops = _hops(records)
    for hop in hops:
        out = PLOT_DIR / f"{hop['id']}.png"
        _plot_per_hop(records, hop, out)
        print(f"wrote {out}")
    overview = PLOT_DIR / "overview.png"
    _plot_overview(records, hops, overview)
    print(f"wrote {overview}")
 if __name__ == "__main__":
    main()
@@ -24,9 +24,7 @@ TOPO_NAME_TO_KIND = {
 }
-def kernel_args(world_size: int, n_elem: int) -> tuple:
+def kernel_args(world_size: int, n_elem: int, *, cube_w: int = 4, cube_h: int = 4) -> tuple:
    cube_w = 4
    cube_h = 4
    return (n_elem, cube_w, cube_h, world_size)
@@ -111,6 +109,11 @@ def allreduce_intercube_multidevice(
 ):
    """Intercube all-reduce (pe0-only) with configurable SIP topology.
    Root cube sits at the geometric center (cube_w//2, cube_h//2) and
    each phase converges bidirectionally so the intra-SIP critical path
    is ~half what a corner-root walk would be (e.g., 4×4 mesh: 4 hops
    reduce + 4 hops broadcast vs 6+6 with corner root).
    Args:
        t_ptr: VA base of the row-wise-sharded tensor on this SIP.
        n_elem: f16 elements per cube tile.
@@ -127,61 +130,117 @@ def allreduce_intercube_multidevice(
    row = cube_id // cube_w
    col = cube_id % cube_w
    nbytes = n_elem * 2
    single_cube = (cube_w == 1 and cube_h == 1)
    root_col = cube_w // 2
    root_row = cube_h // 2
    root_cube = root_row * cube_w + root_col
    pe_addr = t_ptr + cube_id * nbytes
    acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
-    # ── Phase 1: row reduce W → E ──
+    if single_cube:
-    if col == 0:
+        # ── Single-cube mode: skip intra-SIP reduce, go directly to
        #    inter-SIP exchange (TP use case: one cube per rank). ──
        if n_sips > 1:
            if sip_topo_kind == SIP_TOPO_RING:
                acc = _inter_sip_ring(acc, n_sips, n_elem, tl)
            elif sip_topo_kind == SIP_TOPO_TORUS:
                acc = _inter_sip_torus_2d(
                    acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
            elif sip_topo_kind == SIP_TOPO_MESH:
                acc = _inter_sip_mesh_2d(
                    acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
    else:
        # ── Multi-cube mode: center-root bidirectional reduce
        #    + inter-SIP exchange + bidirectional broadcast ──
        # Phase 1: row reduce — converge at col == root_col.
        # Left half (col < root_col) walks W→E; right half (col > root_col)
        # walks E→W; the root_col cube merges both sides.
        if col == 0 and root_col > 0:
            tl.send(dir="E", src=acc)
-    elif col < cube_w - 1:
+        elif 0 < col < root_col:
            recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
            acc = acc + recv
            tl.send(dir="E", src=acc)
-    else:
+        elif col == root_col:
            if root_col > 0:
                recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
                acc = acc + recv
            if cube_w - 1 > root_col:
                recv = tl.recv(dir="E", shape=(n_elem,), dtype="f16")
                acc = acc + recv
        elif root_col < col < cube_w - 1:
            recv = tl.recv(dir="E", shape=(n_elem,), dtype="f16")
            acc = acc + recv
            tl.send(dir="W", src=acc)
        elif col == cube_w - 1 and cube_w - 1 > root_col:
            tl.send(dir="W", src=acc)
-    # ── Phase 2: col reduce N → S on rightmost column ──
+        # Phase 2: col reduce on col == root_col — converge at row == root_row.
-    if col == cube_w - 1:
+        if col == root_col:
-        if row == 0:
+            if row == 0 and root_row > 0:
                tl.send(dir="S", src=acc)
-        elif row < cube_h - 1:
+            elif 0 < row < root_row:
                recv = tl.recv(dir="N", shape=(n_elem,), dtype="f16")
                acc = acc + recv
                tl.send(dir="S", src=acc)
-        else:
+            elif row == root_row:
                if root_row > 0:
                    recv = tl.recv(dir="N", shape=(n_elem,), dtype="f16")
                    acc = acc + recv
                if cube_h - 1 > root_row:
                    recv = tl.recv(dir="S", shape=(n_elem,), dtype="f16")
                    acc = acc + recv
            elif root_row < row < cube_h - 1:
                recv = tl.recv(dir="S", shape=(n_elem,), dtype="f16")
                acc = acc + recv
                tl.send(dir="N", src=acc)
            elif row == cube_h - 1 and cube_h - 1 > root_row:
                tl.send(dir="N", src=acc)
-    # ── Phase 3: inter-SIP exchange on root cube ──
+        # Phase 3: inter-SIP exchange on root cube.
    root_cube = (cube_h - 1) * cube_w + (cube_w - 1)
        if cube_id == root_cube and n_sips > 1:
            if sip_topo_kind == SIP_TOPO_RING:
                acc = _inter_sip_ring(acc, n_sips, n_elem, tl)
            elif sip_topo_kind == SIP_TOPO_TORUS:
-            acc = _inter_sip_torus_2d(acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
+                acc = _inter_sip_torus_2d(
                    acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
            elif sip_topo_kind == SIP_TOPO_MESH:
-            acc = _inter_sip_mesh_2d(acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
+                acc = _inter_sip_mesh_2d(
                    acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
-    # ── Phase 4: col broadcast S → N on rightmost column ──
+        # Phase 4: col broadcast on col == root_col, outward from root_row.
-    if col == cube_w - 1:
+        if col == root_col:
-        if row == cube_h - 1:
+            if row == root_row:
                if root_row > 0:
                    tl.send(dir="N", src=acc)
-        elif row > 0:
+                if cube_h - 1 > root_row:
                    tl.send(dir="S", src=acc)
            elif row < root_row:
                acc = tl.recv(dir="S", shape=(n_elem,), dtype="f16")
                if row > 0:
                    tl.send(dir="N", src=acc)
-        else:
+            elif row > root_row:
-            acc = tl.recv(dir="S", shape=(n_elem,), dtype="f16")
+                acc = tl.recv(dir="N", shape=(n_elem,), dtype="f16")
                if row < cube_h - 1:
                    tl.send(dir="S", src=acc)
-    # ── Phase 5: row broadcast E → W ──
+        # Phase 5: row broadcast outward from root_col.
-    if col == cube_w - 1:
+        if col == root_col:
            if root_col > 0:
                tl.send(dir="W", src=acc)
-    elif col > 0:
+            if cube_w - 1 > root_col:
                tl.send(dir="E", src=acc)
        elif col < root_col:
            acc = tl.recv(dir="E", shape=(n_elem,), dtype="f16")
            if col > 0:
                tl.send(dir="W", src=acc)
-    else:
+        elif col > root_col:
-        acc = tl.recv(dir="E", shape=(n_elem,), dtype="f16")
+            acc = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
            if col < cube_w - 1:
                tl.send(dir="E", src=acc)
    tl.store(pe_addr, acc)
@@ -221,6 +221,8 @@ def install_ipcq(
    _OPPOSITE_DIR = {
        "E": "W", "W": "E", "N": "S", "S": "N",
        "intra_E": "intra_W", "intra_W": "intra_E",
        "intra_N": "intra_S", "intra_S": "intra_N",
        "global_E": "global_W", "global_W": "global_E",
        "global_N": "global_S", "global_S": "global_N",
    }
@@ -1,22 +1,24 @@
-"""SFR configuration for intercube + inter-SIP IPCQ wiring.
+"""SFR configuration for the full IPCQ hardware wiring.
-Provides ``configure_sfr_intercube_multisip`` which programs PE_IPCQ
+Installs PE_IPCQ neighbor tables modeling the physical hardware.
-neighbor tables for:
+Wiring is independent of DPPolicy / kernel choice — the kernel decides
 at runtime which links to use.
-  1. Intercube within each SIP — pe0 of every cube connects to pe0 of
+Direction label namespaces (disjoint):
     its N/S/E/W mesh neighbors (no wrap-around).
  2. Inter-SIP on ALL cubes — pe0 of cube_c on sip_A connects to pe0 of
     cube_c on each peer SIP, using ``global_E``/``global_W`` (ring) or
     ``global_N``/``global_S``/``global_E``/``global_W`` (mesh/torus)
     direction labels.  Wiring all cubes allows the kernel to
     dynamically elect the root cube at runtime.
-SIP-level topology is read from ``topology.yaml`` →
+  - Intra-cube PE-to-PE:   ``intra_N / intra_S / intra_E / intra_W``
-``system.sips.topology`` (e.g. ``ring_1d``, ``mesh_2d``).
+    Logical 2×4 PE grid within a cube (no wrap):
 Intercube mesh dimensions come from ``sip.cube_mesh.w/h``.
-Internally delegates to ``install_ipcq`` with a computed ``rank_to_pe``
+         Row 0:  pe0  pe1  pe2  pe3
-(pe0-only) and a closure-captured ``neighbors()`` function.
+         Row 1:  pe4  pe5  pe6  pe7
  - Intercube same-lane:   ``N / S / E / W``
    ``pe_i of cube_A ↔ pe_i of cube_B`` across the 4×4 cube mesh
    (no wrap). Every PE i ∈ [0..7] wired independently.
  - Inter-SIP same-(cube, pe): ``global_N / global_S / global_E / global_W``
    ``pe_i of cube_c on sip_A ↔ pe_i of cube_c on sip_B`` per
    ``topology.yaml → system.sips.topology``.
 """
 from __future__ import annotations
@@ -27,12 +29,46 @@ from kernbench.ccl.install import install_ipcq
 from kernbench.ccl.topologies import _BUILTIN as _TOPO_BUILTINS
 # ── Intra-cube 2×4 PE grid ───────────────────────────────────────────
 _PE_GRID_COLS = 4
 _PE_GRID_ROWS = 2
 _PES_PER_CUBE = _PE_GRID_COLS * _PE_GRID_ROWS  # 8
 def _intra_cube_neighbors(pe: int) -> dict[str, int]:
    """Logical 2×4 PE grid neighbors within a cube (no wrap).
    Returns directions in the ``intra_*`` namespace.
    """
    row, col = divmod(pe, _PE_GRID_COLS)
    nbrs: dict[str, int] = {}
    if col < _PE_GRID_COLS - 1:
        nbrs["intra_E"] = row * _PE_GRID_COLS + (col + 1)
    if col > 0:
        nbrs["intra_W"] = row * _PE_GRID_COLS + (col - 1)
    if row < _PE_GRID_ROWS - 1:
        nbrs["intra_S"] = (row + 1) * _PE_GRID_COLS + col
    if row > 0:
        nbrs["intra_N"] = (row - 1) * _PE_GRID_COLS + col
    return nbrs
 # ── Public entry point ───────────────────────────────────────────────
 def configure_sfr_intercube_multisip(
    engine: Any,
    spec: dict,
    cfg: dict,
 ) -> dict[str, Any]:
-    """Wire IPCQ for intercube (pe0, mesh) + inter-SIP (pe0, all cubes).
+    """Wire the full IPCQ hardware model.
    Every PE on every cube on every SIP gets neighbor table entries for:
      - intra-cube (2×4 grid) in the ``intra_*`` namespace
      - intercube same-lane (4×4 cube mesh, no wrap) in ``N/S/E/W``
      - inter-SIP same-(cube, pe) in ``global_*``
    Args:
        engine: GraphEngine with ``_components``.
@@ -46,48 +82,71 @@ def configure_sfr_intercube_multisip(
    mesh_w = int(cm["w"])
    mesh_h = int(cm["h"])
    n_cubes = mesh_w * mesh_h
-    n_sips = int(spec.get("system", {}).get("sips", {}).get("count", 1))
+    sips_cfg = spec.get("system", {}).get("sips", {})
-    sip_topology = str(
+    n_sips = int(sips_cfg.get("count", 1))
-        spec.get("system", {}).get("sips", {}).get("topology", "ring_1d")
+    sip_topology = str(sips_cfg.get("topology", "ring_1d"))
-    )
+    sip_w = sips_cfg.get("w")
    sip_h = sips_cfg.get("h")
    sip_w = int(sip_w) if sip_w is not None else None
    sip_h = int(sip_h) if sip_h is not None else None
    if sip_topology not in _TOPO_BUILTINS:
        raise ValueError(
            f"Unknown sip topology '{sip_topology}'. "
            f"Available: {list(_TOPO_BUILTINS)}"
        )
-    sip_topo_fn = _TOPO_BUILTINS[sip_topology]
+    _sip_topo_fn_raw = _TOPO_BUILTINS[sip_topology]
-    world_size = n_sips * n_cubes
+    def sip_topo_fn(rank: int, ws: int) -> dict:
        if sip_w is not None and sip_h is not None:
            try:
                return _sip_topo_fn_raw(rank, ws, w=sip_w, h=sip_h)
            except TypeError:
                pass
        return _sip_topo_fn_raw(rank, ws)
    pes_per_cube = _PES_PER_CUBE
    world_size = n_sips * n_cubes * pes_per_cube
    pe_idx_to_pe: list[tuple[int, int, int]] = [
-        (sip, cube, 0)
+        (sip, cube, pe)
        for sip in range(n_sips)
        for cube in range(n_cubes)
        for pe in range(pes_per_cube)
    ]
    def _pe_idx(sip: int, cube: int, pe: int) -> int:
        return (sip * n_cubes + cube) * pes_per_cube + pe
    def _neighbors(pe_idx: int, ws: int, _base: dict) -> dict[str, int]:
-        sip = pe_idx // n_cubes
+        tmp = pe_idx
-        cube = pe_idx % n_cubes
+        pe = tmp % pes_per_cube
        tmp //= pes_per_cube
        cube = tmp % n_cubes
        sip = tmp // n_cubes
        row = cube // mesh_w
        col = cube % mesh_w
        nbrs: dict[str, int] = {}
-        # Intercube within SIP (mesh, no wrap-around)
+        # ── Intra-cube (intra_N/S/E/W) ──
-        if col < mesh_w - 1:
+        for d, peer_pe in _intra_cube_neighbors(pe).items():
-            nbrs["E"] = sip * n_cubes + (row * mesh_w + col + 1)
+            nbrs[d] = _pe_idx(sip, cube, peer_pe)
        if col > 0:
            nbrs["W"] = sip * n_cubes + (row * mesh_w + col - 1)
        if row < mesh_h - 1:
            nbrs["S"] = sip * n_cubes + ((row + 1) * mesh_w + col)
        if row > 0:
            nbrs["N"] = sip * n_cubes + ((row - 1) * mesh_w + col)
-        # Inter-SIP on ALL cubes
+        # ── Intercube same-lane (N/S/E/W, 4×4 no wrap) ──
        if col < mesh_w - 1:
            nbrs["E"] = _pe_idx(sip, row * mesh_w + (col + 1), pe)
        if col > 0:
            nbrs["W"] = _pe_idx(sip, row * mesh_w + (col - 1), pe)
        if row < mesh_h - 1:
            nbrs["S"] = _pe_idx(sip, (row + 1) * mesh_w + col, pe)
        if row > 0:
            nbrs["N"] = _pe_idx(sip, (row - 1) * mesh_w + col, pe)
        # ── Inter-SIP same-(cube, pe) (global_*) ──
        if n_sips > 1:
            sip_nbrs = sip_topo_fn(sip, n_sips)
            for d, peer_sip in sip_nbrs.items():
-                nbrs[f"global_{d}"] = peer_sip * n_cubes + cube
+                nbrs[f"global_{d}"] = _pe_idx(peer_sip, cube, pe)
        return nbrs
@@ -33,23 +33,41 @@ def ring_1d_unidir(rank: int, world_size: int) -> NeighborMap:
    return {"E": (rank + 1) % world_size}
-def mesh_2d(rank: int, world_size: int) -> NeighborMap:
+def _resolve_2d_dims(
-    """Square 2D mesh (N/S/E/W).
+    world_size: int, w: int | None, h: int | None, name: str,
-
+) -> tuple[int, int]:
-    Layout: rank = row * side + col, with side = sqrt(world_size).
+    if w is not None and h is not None:
-    Wrap-around (torus) on all four edges.
+        if w * h != world_size:
-    """
+            raise ValueError(
                f"{name}: w*h ({w}*{h}) != world_size ({world_size})"
            )
        return w, h
    side = int(round(world_size ** 0.5))
    if side * side != world_size:
        raise ValueError(
-            f"mesh_2d requires square world_size, got {world_size}"
+            f"{name} requires square world_size or explicit w,h, "
            f"got {world_size}"
        )
-    r, c = divmod(rank, side)
+    return side, side
 def mesh_2d(
    rank: int, world_size: int,
    w: int | None = None, h: int | None = None,
 ) -> NeighborMap:
    """2D mesh (N/S/E/W) with wrap-around on all four edges.
    Layout: rank = row * w + col. When w, h are given, supports
    rectangular (e.g. 2x3) layouts. Otherwise falls back to square
    side = sqrt(world_size).
    """
    w, h = _resolve_2d_dims(world_size, w, h, "mesh_2d")
    r, c = divmod(rank, w)
    return {
-        "N": ((r - 1) % side) * side + c,
+        "N": ((r - 1) % h) * w + c,
-        "S": ((r + 1) % side) * side + c,
+        "S": ((r + 1) % h) * w + c,
-        "W": r * side + (c - 1) % side,
+        "W": r * w + (c - 1) % w,
-        "E": r * side + (c + 1) % side,
+        "E": r * w + (c + 1) % w,
    }
@@ -73,36 +91,30 @@ def tree_binary(rank: int, world_size: int) -> NeighborMap:
    return n
-def torus_2d(rank: int, world_size: int) -> NeighborMap:
+def torus_2d(
-    """Square 2D torus (N/S/E/W) with wrap-around on all edges.
+    rank: int, world_size: int,
-
+    w: int | None = None, h: int | None = None,
-    Alias for mesh_2d (which already wraps). Explicit name for clarity
+) -> NeighborMap:
-    when used as a SIP-level topology.
+    """2D torus (N/S/E/W) with wrap-around on all edges. Alias for mesh_2d."""
-    """
+    return mesh_2d(rank, world_size, w=w, h=h)
    return mesh_2d(rank, world_size)
-def mesh_2d_no_wrap(rank: int, world_size: int) -> NeighborMap:
+def mesh_2d_no_wrap(
-    """Square 2D mesh (N/S/E/W) WITHOUT wrap-around.
+    rank: int, world_size: int,
-
+    w: int | None = None, h: int | None = None,
-    Edge nodes have fewer neighbors (no wrapping). Used for SIP-level
+) -> NeighborMap:
-    topologies where physical links don't wrap.
+    """2D mesh (N/S/E/W) WITHOUT wrap-around. Supports rectangular dims."""
-    """
+    w, h = _resolve_2d_dims(world_size, w, h, "mesh_2d_no_wrap")
-    side = int(round(world_size ** 0.5))
+    r, c = divmod(rank, w)
    if side * side != world_size:
        raise ValueError(
            f"mesh_2d_no_wrap requires square world_size, got {world_size}"
        )
    r, c = divmod(rank, side)
    n: NeighborMap = {}
    if r > 0:
-        n["N"] = (r - 1) * side + c
+        n["N"] = (r - 1) * w + c
-    if r < side - 1:
+    if r < h - 1:
-        n["S"] = (r + 1) * side + c
+        n["S"] = (r + 1) * w + c
    if c > 0:
-        n["W"] = r * side + (c - 1)
+        n["W"] = r * w + (c - 1)
-    if c < side - 1:
+    if c < w - 1:
-        n["E"] = r * side + (c + 1)
+        n["E"] = r * w + (c + 1)
    return n
@@ -23,7 +23,7 @@ def _hbm_pa(sip: int, cube: int, pe_id: int, spec: dict) -> int:
    mm = spec["cube"]["memory_map"]
    slice_bytes = mm["hbm_total_gb_per_cube"] * (1 << 30) // mm["hbm_slices_per_cube"]
    pa = PhysAddr.pe_hbm_addr(
-        rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
+        sip_id=sip, die_id=cube, pe_id=pe_id,
        pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
    )
    return pa.encode()
@@ -31,6 +31,26 @@ class IpcqInvalidDirection(ValueError):
    has no neighbor installed for this PE."""
 # ── ADR-0023 D9.7: IPCQ slot-memory latency model ───────────────────
 #
 # Per-tier (bw_gbs, overhead_ns) used to charge the slot write (inbound)
 # and slot read (recv consume). Mirrors topology.yaml component values.
 _BUFFER_KIND_BW: dict[str, tuple[float, float]] = {
    "tcm":  (512.0, 0.0),
    "sram": (512.0, 2.0),
    "hbm":  (256.0, 6.0),
 }
 def slot_io_latency_ns(buffer_kind: str, nbytes: int) -> float:
    """Per-access latency for one slot read/write of ``nbytes`` against
    the IPCQ backing memory tier (``buffer_kind``)."""
    bw_gbs, overhead_ns = _BUFFER_KIND_BW.get(
        buffer_kind, _BUFFER_KIND_BW["tcm"],
    )
    return float(nbytes) / bw_gbs + overhead_ns
 # ── D2.5: IpcqEndpoint ───────────────────────────────────────────────
@@ -115,6 +135,13 @@ class IpcqRecvCmd:
        "return_slot" — return slot address as-is (default, zero-copy).
                        Kernel uses the slot memory directly.
        "copy_to_dst" — copy slot data to dst_addr, then return.
    ``consume`` (DIAGNOSTIC ONLY): when False, recv still blocks until the
    payload lands in the slot, but skips the slot-read latency charge
    (slot-IO + PE↔bank fabric drain for SRAM/HBM tiers). This exists
    solely so the pe2pe overview plot can compare apples-to-apples
    against tl.store (a one-sided write that pays no read on DST). Real
    kernels always need the data they receive — leave this True.
    """
    direction: str | None        # None → round-robin (weak fairness, D4)
@@ -126,6 +153,7 @@ class IpcqRecvCmd:
    dst_space: str = ""          # used only when recv_mode == "copy_to_dst"
    blocking: bool = True
    data_op: bool = True
    consume: bool = True         # DIAGNOSTIC: see docstring
 # ── D12: IpcqDmaToken (PE_IPCQ → PE_DMA, vc_comm) ───────────────────
@@ -58,7 +58,18 @@ class IoCpuComponent(ComponentBase):
            self._pending[key] = (expected, received, parent_done)
    def _dispatch_to_m_cpus(self, env: simpy.Environment, txn: Any) -> Generator:
-        """Fan out sub-Transactions to target cube M_CPUs, wait for responses."""
+        """Fan out sub-Transactions to target cube M_CPUs, wait for responses.
        ADR-0009 D5 (extended): for KernelLaunchMsg, stamp a single global
        target_start_ns = env.now + max(IO_CPU → any target PE_CPU path
        latency across all target cubes). M_CPU passes this value through
        unchanged; every PE in every cube yields until the same sim-time
        before beginning kernel execution. Without this, cross-cube
        launches would have each cube's M_CPU compute its own per-cube
        barrier relative to its local env.now, leaving PEs on different
        cubes out of sync (the "h3/h4 dispatch-offset artifact").
        """
        import dataclasses
        from kernbench.runtime_api.kernel import KernelLaunchMsg, MemoryReadMsg, MemoryWriteMsg
        request = txn.request
@@ -72,10 +83,60 @@ class IoCpuComponent(ComponentBase):
            txn.done.succeed()
            return
        # For KernelLaunchMsg, compute the global barrier once here so
        # every downstream PE_CPU uses the same target_start_ns.
        if isinstance(request, KernelLaunchMsg):
            io_overhead = self.ctx.node_overhead_ns.get(self.node.id, 0.0)
            global_max_latency = 0.0
            pe_ids = self._resolve_pe_ids(
                getattr(request, "target_pe", "all")
            )
            for sip, cube in cube_targets:
                try:
                    m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
                    io_to_m_path = self.ctx.router.find_node_path(
                        self.node.id, m_cpu_id,
                    )
                except Exception:
                    continue
                if len(io_to_m_path) < 2:
                    continue
                leg1 = self.ctx.compute_path_latency_ns(
                    io_to_m_path, nbytes=0,
                )
                m_overhead = self.ctx.node_overhead_ns.get(m_cpu_id, 0.0)
                for pe_id in pe_ids:
                    pe_cpu_id = (
                        f"sip{sip}.cube{cube}.pe{pe_id}.pe_cpu"
                    )
                    try:
                        m_to_pe_path = self.ctx.router.find_node_path(
                            m_cpu_id, pe_cpu_id,
                        )
                    except Exception:
                        continue
                    if len(m_to_pe_path) < 2:
                        continue
                    leg2 = self.ctx.compute_path_latency_ns(
                        m_to_pe_path, nbytes=0,
                    )
                    latency = leg1 + leg2 - io_overhead - m_overhead
                    if latency > global_max_latency:
                        global_max_latency = latency
            request = dataclasses.replace(
                request,
                target_start_ns=float(env.now) + global_max_latency,
            )
        # Setup aggregation
        self._pending[request.request_id] = (len(cube_targets), 0, txn.done)
-        # Fan out to each target cube's M_CPU
+        # Fan out to each target cube's M_CPU. Kernel-launch fanout
        # carries control metadata only; nbytes is forced to 0 for
        # KernelLaunchMsg so the launch sub-txns do not occupy data-fabric
        # BW (would otherwise serialize 16 cubes worth of fanout on the
        # shared first hop and break ADR-0009 D5's barrier prediction).
        is_kernel_launch = isinstance(request, KernelLaunchMsg)
        for sip, cube in cube_targets:
            try:
                m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
@@ -86,11 +147,25 @@ class IoCpuComponent(ComponentBase):
                continue
            sub_txn = Transaction(
                request=request, path=path, step=0,
-                nbytes=txn.nbytes, done=env.event(),
+                nbytes=0 if is_kernel_launch else txn.nbytes,
                done=env.event(),
                result_data=txn.result_data,
            )
            yield self.out_ports[path[1]].put(sub_txn.advance())
    def _resolve_pe_ids(self, target_pe: Any) -> list[int]:
        """Resolve target_pe → list of PE indices (mirrors M_CPU logic)."""
        if isinstance(target_pe, int):
            return [target_pe]
        if isinstance(target_pe, tuple):
            return list(target_pe)
        # "all": all PEs in a cube
        n_slices = 8
        if self.ctx and self.ctx.spec:
            mm = self.ctx.spec.get("cube", {}).get("memory_map", {})
            n_slices = mm.get("hbm_slices_per_cube", 8)
        return list(range(n_slices))
    def _resolve_cube_targets(self, request: Any) -> list[tuple[int, int]]:
        """Return list of (sip, cube) pairs to fan out to."""
        from kernbench.runtime_api.kernel import (
@@ -145,10 +220,10 @@ class IoCpuComponent(ComponentBase):
        return []
    def _cube_from_pa(self, pa_val: int, fallback: int) -> int:
-        """Extract cube_id from a physical address, with fallback."""
+        """Extract die_id from a physical address, with fallback."""
        from kernbench.policy.address.phyaddr import PhysAddr
        try:
-            return PhysAddr.decode(pa_val).cube_id
+            return PhysAddr.decode(pa_val).die_id
        except Exception:
            return fallback
@@ -162,7 +162,11 @@ class MCpuComponent(ComponentBase):
        Routes through find_node_path (M_CPU → NOC → PE_CPU command edges).
        PE_CPU sends ResponseMsg back via NOC → M_CPU on completion.
        Then sends aggregate ResponseMsg back to IO_CPU on the reverse path.
        ADR-0009 D5: stamps target_start_ns so every PE in this fanout
        starts executing at the same env.now regardless of dispatch path.
        """
        import dataclasses
        request = txn.request
        target_pe = getattr(request, "target_pe", "all")
        cube_prefix = self.node.id.rsplit(".", 1)[0]  # e.g. "sip0.cube0"
@@ -172,9 +176,13 @@ class MCpuComponent(ComponentBase):
            txn.done.succeed()
            return
-        # Fan out to each PE_CPU, using response-based aggregation
+        # Resolve per-PE paths. If IO_CPU already stamped a global
-        sub_txns: list[Transaction] = []
+        # target_start_ns (ADR-0009 D5 extended), pass it through
-        n_dispatched = 0
+        # unchanged so every PE across every cube uses the same barrier.
        # Otherwise (e.g. direct-to-M_CPU launch in a unit test) compute
        # a per-cube barrier from env.now.
        per_pe: list[tuple[int, list[str], float]] = []
        max_latency = 0.0
        for pe_id in pe_ids:
            pe_cpu_id = f"{cube_prefix}.pe{pe_id}.pe_cpu"
            try:
@@ -183,8 +191,24 @@ class MCpuComponent(ComponentBase):
                continue
            if len(path) < 2:
                continue
            latency = self.ctx.compute_path_latency_ns(path, nbytes=0)
            per_pe.append((pe_id, path, latency))
            if latency > max_latency:
                max_latency = latency
        if getattr(request, "target_start_ns", None) is not None:
            stamped_request = request
        else:
            stamped_request = dataclasses.replace(
                request, target_start_ns=float(env.now) + max_latency,
            )
        # Fan out to each PE_CPU, using response-based aggregation
        sub_txns: list[Transaction] = []
        n_dispatched = 0
        for pe_id, path, _lat in per_pe:
            sub_txn = Transaction(
-                request=request, path=path, step=0,
+                request=stamped_request, path=path, step=0,
                nbytes=0, done=env.event(),
            )
            yield self.out_ports[path[1]].put(sub_txn.advance())
@@ -204,16 +228,21 @@ class MCpuComponent(ComponentBase):
        yield all_done
        del self._parent_txns[request.request_id]
-        # Aggregate PE-internal metrics (max across PEs)
+        # Aggregate PE-internal metrics (max across PEs and across cubes).
        # Multiple M_CPUs share the same result_data dict via IO_CPU fanout;
        # merge against the existing value so cubes don't clobber each other.
        pe_exec_values = [st.result_data.get("pe_exec_ns", 0.0) for st in sub_txns]
        if pe_exec_values:
-            txn.result_data["pe_exec_ns"] = max(pe_exec_values)
+            cur = txn.result_data.get("pe_exec_ns", 0.0) or 0.0
            txn.result_data["pe_exec_ns"] = max(cur, max(pe_exec_values))
        dma_values = [st.result_data.get("dma_ns", 0.0) for st in sub_txns]
        if dma_values:
-            txn.result_data["dma_ns"] = max(dma_values)
+            cur = txn.result_data.get("dma_ns", 0.0) or 0.0
            txn.result_data["dma_ns"] = max(cur, max(dma_values))
        compute_values = [st.result_data.get("compute_ns", 0.0) for st in sub_txns]
        if compute_values:
-            txn.result_data["compute_ns"] = max(compute_values)
+            cur = txn.result_data.get("compute_ns", 0.0) or 0.0
            txn.result_data["compute_ns"] = max(cur, max(compute_values))
        # Send aggregate response on reverse command path back to IO_CPU
        reverse_path = list(reversed(txn.path))
@@ -95,6 +95,13 @@ class PeCpuComponent(ComponentBase):
        request = txn.request
        yield from self.run(env, 0)
        # ADR-0009 D5: synchronized launch barrier. If M_CPU stamped a
        # target_start_ns, wait until then so every PE in this launch
        # begins pe_exec measurement at the same simulated time.
        target_start = getattr(request, "target_start_ns", None)
        if target_start is not None and target_start > env.now:
            yield env.timeout(float(target_start) - env.now)
        kernel_fn = get_kernel(request.kernel_ref.name)
        num_programs = self._derive_num_programs(request)
        kernel_args = self._unpack_kernel_args(request)
@@ -186,15 +186,63 @@ class PeDmaComponent(PeEngineBase):
    # ── IPCQ inbound (fabric → PE_DMA → MemoryStore + PE_IPCQ) ──────
    def _handle_ipcq_inbound(self, env: simpy.Environment, txn: Any) -> Generator:
-        """At destination PE_DMA: atomically write data and forward metadata.
+        """At destination PE_DMA: pay terminal drain, then atomically write
        data and forward metadata.
        ADR-0023 D9 (drain at inbound terminal): the Transaction carries
        ``drain_ns = nbytes / bottleneck_bw_on_path`` stamped by the sender
        PE_DMA. Like every other Transaction terminal in the simulator (see
        ``ComponentBase._forward_txn``), this drain must be paid when the
        Transaction reaches its destination. SRC-side ``tl.send`` is
        fire-and-forget — it never yields on ``sub_done`` — so paying the
        drain here does NOT delay the sender. What it DOES delay is the
        IpcqMetaArrival forwarded below: that delay is the only signal
        ``tl.recv`` on DST blocks on, which is exactly the desired
        semantics — "send dispatches and returns; recv waits until the
        bytes have actually landed in its inbox".
        The drain MUST be paid before the atomic block — inserting a yield
        inside would break invariant I6.
        I6 (MUST): no SimPy yield between MemoryStore.write and the
        IpcqMetaArrival put into PE_IPCQ.
        """
        from kernbench.common.ipcq_types import IpcqMetaArrival
        # Pay terminal BW drain before the atomic write/metadata forward.
        # Without this, IPCQ effectively got fabric bandwidth for free at
        # the terminal (only intermediate-hop overhead_ns was charged),
        # making IPCQ lower than raw DMA at large sizes in benchmarks.
        drain = getattr(txn, "drain_ns", 0.0)
        if drain > 0:
            yield env.timeout(drain)
        token = txn.request
        # ADR-0023 D9.7: charge IPCQ slot-WRITE latency against the
        # backing-memory tier (tcm/sram/hbm) before the atomic block.
        # Must come BEFORE the atomic write→IpcqMetaArrival pair (I6).
        # SRAM/HBM also pay a PE_DMA→bank fabric drain (slot lives on
        # the cube NoC); TCM is per-PE local and skips this hop.
        from kernbench.common.ipcq_types import slot_io_latency_ns
        buffer_kind = token.dst_endpoint.buffer_kind
        if buffer_kind in ("sram", "hbm") and self.ctx is not None:
            cube_prefix = self._pe_prefix.rsplit(".", 1)[0]
            bank_node = (
                f"{cube_prefix}.sram" if buffer_kind == "sram"
                else f"{cube_prefix}.hbm_ctrl"
            )
            try:
                path = self.ctx.router.find_path(self._pe_prefix, bank_node)
                bank_drain_ns = self.ctx.compute_drain_ns(path, token.nbytes)
                if bank_drain_ns > 0:
                    yield env.timeout(bank_drain_ns)
            except Exception:
                pass
        slot_write_ns = slot_io_latency_ns(buffer_kind, token.nbytes)
        if slot_write_ns > 0:
            yield env.timeout(slot_write_ns)
        # ── ATOMIC: do not introduce yield between these two operations ──
        # 1. Move data via MemoryStore (single-hop DMA write).
        # Prefer the in-flight snapshot stashed by the sender PE_DMA;
@@ -278,7 +326,16 @@ class PeDmaComponent(PeEngineBase):
            dma_res = self._dma_write if is_write else self._dma_read
            assert dma_res is not None
-            pa = PhysAddr.decode(addr)
+            # Translate VA → PA via MMU (same logic as non-pipeline path)
            target_pa = addr
            if self._mmu is not None:
                from kernbench.policy.address.pe_mmu import PageFault
                try:
                    target_pa = self._mmu.translate(addr)
                except PageFault:
                    target_pa = addr  # fallback: treat as PA directly
            pa = PhysAddr.decode(target_pa)
            dst_node = self.ctx.resolver.resolve(pa)
            path = self.ctx.router.find_path(self._pe_prefix, dst_node)
            drain_ns = self.ctx.compute_drain_ns(path, nbytes)
@@ -290,7 +347,7 @@ class PeDmaComponent(PeEngineBase):
                    correlation_id="pipeline",
                    request_id=f"tile_{token.tile_id}",
                    src_sip=0, src_cube=0, src_pe=0,
-                    dst_pa=addr, nbytes=nbytes,
+                    dst_pa=target_pa, nbytes=nbytes,
                    is_write=is_write,
                )
                sub_txn = Transaction(
@@ -329,6 +329,41 @@ class PeIpcqComponent(ComponentBase):
        qp["my_tail"] += 1
        # ADR-0023 D9.7: charge IPCQ slot-READ latency against the
        # backing-memory tier (tcm/sram/hbm). Recv blocks for the
        # kernel-side slot consume; pe_exec_ns reflects this cost.
        # SRAM/HBM live on the cube NoC behind a router-attached link,
        # so reading a slot also pays a PE→bank fabric drain. TCM is
        # per-PE local and skips this hop.
        #
        # cmd.consume is a DIAGNOSTIC flag (default True). When False,
        # the read charges below are skipped — used only by the pe2pe
        # overview plot for an apples-to-apples comparison against
        # tl.store (one-sided write, no read on DST). Real kernels
        # always consume; this branch must not be exercised in
        # production code paths.
        from kernbench.common.ipcq_types import slot_io_latency_ns
        nbytes = req.result_data.get("nbytes", 0)
        if cmd.consume:
            if self._buffer_kind in ("sram", "hbm") and self.ctx is not None:
                cube_prefix = self._pe_prefix.rsplit(".", 1)[0]
                bank_node = (
                    f"{cube_prefix}.sram" if self._buffer_kind == "sram"
                    else f"{cube_prefix}.hbm_ctrl"
                )
                try:
                    path = self.ctx.router.find_path(
                        self._pe_prefix, bank_node,
                    )
                    bank_drain_ns = self.ctx.compute_drain_ns(path, nbytes)
                    if bank_drain_ns > 0:
                        yield env.timeout(bank_drain_ns)
                except Exception:
                    pass
            slot_read_ns = slot_io_latency_ns(self._buffer_kind, nbytes)
            if slot_read_ns > 0:
                yield env.timeout(slot_read_ns)
        # Diagnostics trace (D14)
        from kernbench.ccl import diagnostics
        if diagnostics.trace_enabled():
@@ -338,9 +373,13 @@ class PeIpcqComponent(ComponentBase):
                nbytes=req.result_data.get("nbytes", 0),
            )
-        # Fast path credit return — bottleneck BW based latency
+        # Credit return: recv blocks on credit-emit so the protocol cost
-        env.process(
+        # (full path latency to deliver the credit metadata back to the
-            self._delayed_credit_send(env, direction, qp["peer_credit_store"], qp["my_tail"])
+        # sender) is reflected in the recv's pe_exec_ns. Models the IPCQ
        # control-plane completing the consume-acknowledgement before
        # recv returns to the kernel.
        yield from self._delayed_credit_send(
            env, direction, qp["peer_credit_store"], qp["my_tail"],
        )
        if not req.done.triggered:
@@ -455,7 +494,12 @@ class PeIpcqComponent(ComponentBase):
        yield peer_credit_store.put(meta)
    def _credit_latency_ns(self, direction: str) -> float:
-        """Compute credit fast path latency = credit_size / bottleneck_bw.
+        """Full path latency for the credit-return packet.
        Pays per-node overhead + edge prop + drain along the same fabric
        the data took. PathRouter.find_path() auto-appends ".pe_dma" to
        the source only, so the destination MUST be spelled with the
        explicit ".pe_dma" suffix.
        Falls back to 0 when ctx/router is unavailable (unit-test mode).
        """
@@ -463,10 +507,12 @@ class PeIpcqComponent(ComponentBase):
            return 0.0
        qp = self._queue_pairs[direction]
        peer = qp["peer"]
-        peer_pe_prefix = f"sip{peer.sip}.cube{peer.cube}.pe{peer.pe}"
+        peer_pe_dma = f"sip{peer.sip}.cube{peer.cube}.pe{peer.pe}.pe_dma"
        try:
-            path = self.ctx.router.find_path(self._pe_prefix, peer_pe_prefix)
+            path = self.ctx.router.find_path(self._pe_prefix, peer_pe_dma)
-            return self.ctx.compute_drain_ns(path, self._credit_size_bytes)
+            return self.ctx.compute_path_latency_ns(
                path, self._credit_size_bytes,
            )
        except Exception:
            return 0.0
@@ -26,6 +26,9 @@ class ComponentContext:
    spec: dict = field(default_factory=dict)  # topology spec (cube layout, PE count, etc.)
    memory_store: Any = None  # MemoryStore for Phase 1 data-aware execution (ADR-0020)
    op_logger: Any = None     # OpLogger for Phase 1 op recording (ADR-0020)
    # node_id -> overhead_ns (ADR-0009 D5: used by M_CPU to compute per-PE
    # dispatch latency when stamping target_start_ns on KernelLaunchMsg).
    node_overhead_ns: dict[str, float] = field(default_factory=dict)
    def get_shared_resource(
        self, env: simpy.Environment, key: str, capacity: int = 1,
@@ -52,3 +55,19 @@ class ComponentContext:
        if min_bw == float("inf"):
            return 0.0
        return nbytes / min_bw
    def compute_path_latency_ns(self, path: list[str], nbytes: int = 0) -> float:
        """Formula latency along path: wire + per-node overhead + drain.
        ADR-0009 D5: M_CPU uses this to compute per-PE dispatch latency
        when stamping target_start_ns on KernelLaunchMsg fanout.
        """
        total = 0.0
        for i in range(len(path) - 1):
            edge = self.edge_map.get((path[i], path[i + 1]))
            if edge:
                total += edge.distance_mm * self.ns_per_mm
        for node_id in path:
            total += self.node_overhead_ns.get(node_id, 0.0)
        total += self.compute_drain_ns(path, nbytes)
        return total
@@ -58,7 +58,13 @@ class IoCpuComponent(ComponentBase):
            self._pending[key] = (expected, received, parent_done)
    def _dispatch_to_m_cpus(self, env: simpy.Environment, txn: Any) -> Generator:
-        """Fan out sub-Transactions to target cube M_CPUs, wait for responses."""
+        """Fan out sub-Transactions to target cube M_CPUs, wait for responses.
        ADR-0009 D5 (extended): stamp a global target_start_ns on
        KernelLaunchMsg so every PE across every target cube starts at
        the same env.now. See the non-legacy builtin for full rationale.
        """
        import dataclasses
        from kernbench.runtime_api.kernel import KernelLaunchMsg, MemoryReadMsg, MemoryWriteMsg
        request = txn.request
@@ -72,10 +78,53 @@ class IoCpuComponent(ComponentBase):
            txn.done.succeed()
            return
        if isinstance(request, KernelLaunchMsg):
            io_overhead = self.ctx.node_overhead_ns.get(self.node.id, 0.0)
            global_max_latency = 0.0
            pe_ids = self._resolve_pe_ids(
                getattr(request, "target_pe", "all")
            )
            for sip, cube in cube_targets:
                try:
                    m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
                    io_to_m_path = self.ctx.router.find_node_path(
                        self.node.id, m_cpu_id,
                    )
                except Exception:
                    continue
                if len(io_to_m_path) < 2:
                    continue
                leg1 = self.ctx.compute_path_latency_ns(
                    io_to_m_path, nbytes=0,
                )
                m_overhead = self.ctx.node_overhead_ns.get(m_cpu_id, 0.0)
                for pe_id in pe_ids:
                    pe_cpu_id = (
                        f"sip{sip}.cube{cube}.pe{pe_id}.pe_cpu"
                    )
                    try:
                        m_to_pe_path = self.ctx.router.find_node_path(
                            m_cpu_id, pe_cpu_id,
                        )
                    except Exception:
                        continue
                    if len(m_to_pe_path) < 2:
                        continue
                    leg2 = self.ctx.compute_path_latency_ns(
                        m_to_pe_path, nbytes=0,
                    )
                    latency = leg1 + leg2 - io_overhead - m_overhead
                    if latency > global_max_latency:
                        global_max_latency = latency
            request = dataclasses.replace(
                request,
                target_start_ns=float(env.now) + global_max_latency,
            )
        # Setup aggregation
        self._pending[request.request_id] = (len(cube_targets), 0, txn.done)
-        # Fan out to each target cube's M_CPU
+        is_kernel_launch = isinstance(request, KernelLaunchMsg)
        for sip, cube in cube_targets:
            try:
                m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
@@ -86,11 +135,24 @@ class IoCpuComponent(ComponentBase):
                continue
            sub_txn = Transaction(
                request=request, path=path, step=0,
-                nbytes=txn.nbytes, done=env.event(),
+                nbytes=0 if is_kernel_launch else txn.nbytes,
                done=env.event(),
                result_data=txn.result_data,
            )
            yield self.out_ports[path[1]].put(sub_txn.advance())
    def _resolve_pe_ids(self, target_pe: Any) -> list[int]:
        """Resolve target_pe → list of PE indices (mirrors M_CPU logic)."""
        if isinstance(target_pe, int):
            return [target_pe]
        if isinstance(target_pe, tuple):
            return list(target_pe)
        n_slices = 8
        if self.ctx and self.ctx.spec:
            mm = self.ctx.spec.get("cube", {}).get("memory_map", {})
            n_slices = mm.get("hbm_slices_per_cube", 8)
        return list(range(n_slices))
    def _resolve_cube_targets(self, request: Any) -> list[tuple[int, int]]:
        """Return list of (sip, cube) pairs to fan out to."""
        from kernbench.runtime_api.kernel import (
@@ -145,10 +207,10 @@ class IoCpuComponent(ComponentBase):
        return []
    def _cube_from_pa(self, pa_val: int, fallback: int) -> int:
-        """Extract cube_id from a physical address, with fallback."""
+        """Extract die_id from a physical address, with fallback."""
        from kernbench.policy.address.phyaddr import PhysAddr
        try:
-            return PhysAddr.decode(pa_val).cube_id
+            return PhysAddr.decode(pa_val).die_id
        except Exception:
            return fallback
@@ -162,7 +162,11 @@ class MCpuComponent(ComponentBase):
        Routes through find_node_path (M_CPU → NOC → PE_CPU command edges).
        PE_CPU sends ResponseMsg back via NOC → M_CPU on completion.
        Then sends aggregate ResponseMsg back to IO_CPU on the reverse path.
        ADR-0009 D5: stamps target_start_ns so every PE in this fanout
        starts executing at the same env.now regardless of dispatch path.
        """
        import dataclasses
        request = txn.request
        target_pe = getattr(request, "target_pe", "all")
        cube_prefix = self.node.id.rsplit(".", 1)[0]  # e.g. "sip0.cube0"
@@ -172,9 +176,10 @@ class MCpuComponent(ComponentBase):
            txn.done.succeed()
            return
-        # Fan out to each PE_CPU, using response-based aggregation
+        # Resolve per-PE paths. If IO_CPU already stamped a global
-        sub_txns: list[Transaction] = []
+        # target_start_ns (ADR-0009 D5 extended), pass it through.
-        n_dispatched = 0
+        per_pe: list[tuple[int, list[str], float]] = []
        max_latency = 0.0
        for pe_id in pe_ids:
            pe_cpu_id = f"{cube_prefix}.pe{pe_id}.pe_cpu"
            try:
@@ -183,8 +188,24 @@ class MCpuComponent(ComponentBase):
                continue
            if len(path) < 2:
                continue
            latency = self.ctx.compute_path_latency_ns(path, nbytes=0)
            per_pe.append((pe_id, path, latency))
            if latency > max_latency:
                max_latency = latency
        if getattr(request, "target_start_ns", None) is not None:
            stamped_request = request
        else:
            stamped_request = dataclasses.replace(
                request, target_start_ns=float(env.now) + max_latency,
            )
        # Fan out to each PE_CPU, using response-based aggregation
        sub_txns: list[Transaction] = []
        n_dispatched = 0
        for pe_id, path, _lat in per_pe:
            sub_txn = Transaction(
-                request=request, path=path, step=0,
+                request=stamped_request, path=path, step=0,
                nbytes=0, done=env.event(),
            )
            yield self.out_ports[path[1]].put(sub_txn.advance())
@@ -204,16 +225,21 @@ class MCpuComponent(ComponentBase):
        yield all_done
        del self._parent_txns[request.request_id]
-        # Aggregate PE-internal metrics (max across PEs)
+        # Aggregate PE-internal metrics (max across PEs and across cubes).
        # Multiple M_CPUs share the same result_data dict via IO_CPU fanout;
        # merge against the existing value so cubes don't clobber each other.
        pe_exec_values = [st.result_data.get("pe_exec_ns", 0.0) for st in sub_txns]
        if pe_exec_values:
-            txn.result_data["pe_exec_ns"] = max(pe_exec_values)
+            cur = txn.result_data.get("pe_exec_ns", 0.0) or 0.0
            txn.result_data["pe_exec_ns"] = max(cur, max(pe_exec_values))
        dma_values = [st.result_data.get("dma_ns", 0.0) for st in sub_txns]
        if dma_values:
-            txn.result_data["dma_ns"] = max(dma_values)
+            cur = txn.result_data.get("dma_ns", 0.0) or 0.0
            txn.result_data["dma_ns"] = max(cur, max(dma_values))
        compute_values = [st.result_data.get("compute_ns", 0.0) for st in sub_txns]
        if compute_values:
-            txn.result_data["compute_ns"] = max(compute_values)
+            cur = txn.result_data.get("compute_ns", 0.0) or 0.0
            txn.result_data["compute_ns"] = max(cur, max(compute_values))
        # Send aggregate response on reverse command path back to IO_CPU
        reverse_path = list(reversed(txn.path))
@@ -71,6 +71,13 @@ class PeCpuComponent(ComponentBase):
        request = txn.request
        yield from self.run(env, 0)
        # ADR-0009 D5: synchronized launch barrier. If M_CPU stamped a
        # target_start_ns, wait until then so every PE in this launch
        # begins pe_exec measurement at the same simulated time.
        target_start = getattr(request, "target_start_ns", None)
        if target_start is not None and target_start > env.now:
            yield env.timeout(float(target_start) - env.now)
        kernel_fn = get_kernel(request.kernel_ref.name)
        num_programs = self._derive_num_programs(request)
        kernel_args = self._unpack_kernel_args(request)
@@ -89,11 +89,10 @@ class _FreeList:
 class PEMemAllocator:
    def __init__(
-        self, rack_id: int, sip_id: int, cube_id: int, pe_id: int, cfg: AddressConfig,
+        self, sip_id: int, die_id: int, pe_id: int, cfg: AddressConfig,
    ) -> None:
        self._rack_id = rack_id
        self._sip_id = sip_id
-        self._cube_id = cube_id
+        self._die_id = die_id
        self._pe_id = pe_id
        self._cfg = cfg
        self._hbm = _FreeList(cfg.hbm_slice_bytes)
@@ -108,7 +107,7 @@ class PEMemAllocator:
                f"available {self._cfg.hbm_slice_bytes - self._hbm.used}"
            )
        return PhysAddr.pe_hbm_addr(
-            rack_id=self._rack_id, sip_id=self._sip_id, cube_id=self._cube_id,
+            sip_id=self._sip_id, die_id=self._die_id,
            pe_id=self._pe_id, pe_local_hbm_offset=offset,
            slice_size_bytes=self._cfg.hbm_slice_bytes,
        )
@@ -128,7 +127,7 @@ class PEMemAllocator:
                f"available {self._cfg.tcm_allocatable_bytes - self._tcm.used}"
            )
        return PhysAddr.pe_tcm_addr(
-            rack_id=self._rack_id, sip_id=self._sip_id, cube_id=self._cube_id,
+            sip_id=self._sip_id, die_id=self._die_id,
            pe_id=self._pe_id, tcm_offset=offset,
        )
@@ -19,7 +19,14 @@ class PageFault(Exception):
 class PeMMU:
-    """Per-PE MMU with page-aligned VA→PA translation table.
+    """Per-PE MMU with sub-page-capable VA→PA translation table.
    Each page-table entry is a list of (start_in_page, end_in_page,
    pa_at_offset_zero) regions. This is a SIMULATOR STOPGAP — real MMUs
    store one PA per page-table entry. Sub-page regions exist here so
    DPPolicy layouts that shard below page granularity (e.g. 128 B
    payloads with 4 KB pages) don't silently mis-route through last-
    write-wins overwrites. Memory note: project_mmu_subpage_stopgap.md.
    Args:
        page_size: Page size in bytes (default 2 MB).
@@ -34,7 +41,11 @@ class PeMMU:
        self._page_size = page_size
        self._page_shift = (page_size - 1).bit_length()
        self._page_mask = page_size - 1
-        self._table: dict[int, int] = {}  # va_page_number → pa_page_base
+        # vpn → list of (start_in_page, end_in_page, pa_at_offset_zero).
        # pa_at_offset_zero is the PA that offset 0 of the page would map
        # to under this region — i.e. translate(off) = pa_at_offset_zero
        # + off when start <= off < end.
        self._table: dict[int, list[tuple[int, int, int]]] = {}
        self._overhead_ns = overhead_ns
    @property
@@ -46,21 +57,67 @@ class PeMMU:
        return len(self._table)
    def map(self, va: int, pa: int, size: int) -> None:
-        """Register VA→PA mapping for a contiguous range."""
+        """Register VA→PA mapping for a contiguous range.
-        for off in range(0, size, self._page_size):
+
-            vpn = (va + off) >> self._page_shift
+        Sub-page-aware: a single page can hold multiple disjoint regions,
-            self._table[vpn] = pa + off
+        each pointing to a different PA. Later map() calls APPEND a new
        region; on overlap with an existing region, the new region wins
        for the overlapping offsets (translate iterates in reverse so the
        last write takes precedence — matches legacy single-PA behavior
        when a full page is re-mapped).
        """
        end_va = va + size
        cur = va
        while cur < end_va:
            vpn = cur >> self._page_shift
            page_base_va = vpn << self._page_shift
            page_end_va = page_base_va + self._page_size
            region_start = cur - page_base_va
            region_end = min(end_va, page_end_va) - page_base_va
            # PA seen at offset 0 of page if this region's mapping covered it
            pa_at_offset_zero = pa + (cur - va) - region_start
            self._table.setdefault(vpn, []).append(
                (region_start, region_end, pa_at_offset_zero)
            )
            cur = page_base_va + region_end
    def unmap(self, va: int, size: int) -> None:
-        """Remove VA mapping for a contiguous range."""
+        """Remove VA mapping for a contiguous range.
-        for off in range(0, size, self._page_size):
+
-            vpn = (va + off) >> self._page_shift
+        Drops any region whose extent is contained within the unmapped
-            self._table.pop(vpn, None)
+        range. Partial overlaps (region straddles the range boundary)
        are left in place — caller is expected to unmap on the same
        boundaries it mapped on.
        """
        end_va = va + size
        cur = va
        while cur < end_va:
            vpn = cur >> self._page_shift
            page_base_va = vpn << self._page_shift
            page_end_va = page_base_va + self._page_size
            unmap_start = cur - page_base_va
            unmap_end = min(end_va, page_end_va) - page_base_va
            regions = self._table.get(vpn)
            if regions is not None:
                kept = [
                    r for r in regions
                    if not (r[0] >= unmap_start and r[1] <= unmap_end)
                ]
                if kept:
                    self._table[vpn] = kept
                else:
                    del self._table[vpn]
            cur = page_base_va + unmap_end
    def translate(self, va: int) -> int:
        """Translate VA to PA. Raises PageFault if unmapped."""
        vpn = va >> self._page_shift
-        pa_page_base = self._table.get(vpn)
+        regions = self._table.get(vpn)
-        if pa_page_base is None:
+        if regions is None:
            raise PageFault(va)
        offset = va & self._page_mask
        # Iterate latest-first so newer map() calls win on overlap
        for start, end, pa_at_offset_zero in reversed(regions):
            if start <= offset < end:
                return pa_at_offset_zero + offset
        raise PageFault(va)
        return pa_page_base + (va & self._page_mask)
@@ -6,6 +6,47 @@ from typing import Literal
 MAX_51 = (1 << 51) - 1
 # ── Layout constants (ADR-0001 Rev 2) ────────────────────────────────
 # [50:47] sip_id (4)
 # [46:42] die_id (5)
 # [41: 0] local_offset (42)
 _SIP_SHIFT = 47
 _DIE_SHIFT = 42
 _LOCAL_BITS = 42
 _LOCAL_MASK = (1 << _LOCAL_BITS) - 1
 # AHBM die: [41:38] MBZ, [37] addr_space, [36:0] sub-address
 _AHBM_SEL_BIT = 37
 _AHBM_LOCAL_USED = 38  # bits actually meaningful for AHBM
 # Resource window: [36:34] resource_kind, [33:0] kind_local
 _RES_KIND_SHIFT = 34
 _RES_KIND_MASK = 0x7
 # PE_LOCAL: [32:29] pe_id, [28:25] pe_sub_unit, [24:0] sub_offset
 _PE_ID_SHIFT = 29
 _PE_SUB_SHIFT = 25
 _PE_SUB_OFFSET_BITS = 25
 # MCPU_LOCAL: [29:25] mcpu_sub_unit, [24:0] sub_offset
 _MCPU_SUB_SHIFT = 25
 # CUBE_SRAM: [24:0] sram_offset
 _SRAM_OFFSET_BITS = 25
 # IOCHIPLET: [41:40] MBZ, [39:0] chiplet_offset
 _CHIPLET_LOCAL_BITS = 40
 _IOCPU_BOUNDARY = 1 << 31  # 2 GB
 # IOCPU: [30:27] iocpu_sub_unit, [26:0] sub_offset
 _IOCPU_SUB_SHIFT = 27
 _IOCPU_SUB_OFFSET_BITS = 27
 # die_id ranges
 _AHBM_DIE_MAX = 15
 _CHIPLET_DIE_MIN = 16
 _CHIPLET_DIE_MAX = 20
 class PhysAddrError(Exception):
    pass
@@ -22,163 +63,278 @@ def _chk_max(name: str, v: int, maxv: int) -> None:
 class UnitType(IntEnum):
-    PE = 0
+    """resource_kind values for AHBM resource window."""
-    MCPU = 1
+    PE = 0       # PE_LOCAL
-    SRAM = 2
+    MCPU = 1     # MCPU_LOCAL
    SRAM = 2     # CUBE_SRAM
 class PESubUnit(IntEnum):
    PE_CPU_DTCM = 0
    MATH_ENGINE_DTCM = 1
    IPCQ = 2
    PE_CPU_SFR = 3
    MATH_ENGINE_SFR = 4
    DMA_ENGINE_SFR = 5
    PE_TCM = 6
 class MCPUSubUnit(IntEnum):
    MCPU_ITCM = 0
    MCPU_DTCM = 1
    IPCQ = 2
    MCPU_SFR = 3
    MCPU_DMA_SFR = 4
    MCPU_SRAM = 5
 class IOCPUSubUnit(IntEnum):
    IOCPU_ITCM = 0
    IOCPU_DTCM = 1
    IPCQ = 2
    IOCPU_SFR = 3
    IO_DMA_SFR = 4
    IO_SRAM = 5
@dataclass(frozen=True)
 class PhysAddr:
-    """
+    """51-bit physical address value object (ADR-0001 Rev 2).
    51-bit physical address value object.
    Layout:
-      [50:47] rack_id  (4)
+      [50:47] sip_id        (4)   -- 16 SIPs
-      [46:43] sip_id   (4)
+      [46:42] die_id        (5)   -- 0..15 AHBM, 16..20 IOCHIPLET
-      [42:38] sip_seg  (5)   # cube_id
+      [41: 0] local_offset  (42)  -- 4 TB per die
      [37:0]  local_offset (38) => each segment is 256GB
    local_offset:
      [37] selector: 1 = HBM window (128GB reserved), 0 = PE resource window
    """
    rack_id: int
    sip_id: int
-    sip_seg: int
+    die_id: int
    local_offset: int
-    kind: Literal["hbm", "pe_resource", "raw"] = "raw"
+    kind: Literal["hbm", "pe_resource", "iocpu", "ual", "raw"] = "raw"
    cube_id: int = 0
    unit_type: UnitType = UnitType.PE
    pe_id: int = 0
-    ext: int = 0
+    pe_sub_unit: int = 0
    sub_offset: int = 0
    hbm_offset: int = 0
    iocpu_sub_unit: int = 0
    chiplet_offset: int = 0
    mcpu_sub_unit: int = 0
    HBM_WINDOW_BYTES = 1 << 37  # 128 GB
    # ── encode / decode ──────────────────────────────────────────────
    def encode(self) -> int:
        _chk_range("rack_id", self.rack_id, 4)
        _chk_range("sip_id", self.sip_id, 4)
-        _chk_range("sip_seg", self.sip_seg, 5)
+        _chk_range("die_id", self.die_id, 5)
-        _chk_range("local_offset", self.local_offset, 38)
+        _chk_range("local_offset", self.local_offset, _LOCAL_BITS)
-        addr = (self.rack_id << 47) | (self.sip_id << 43) | (self.sip_seg << 38) | self.local_offset
+        # MBZ enforcement
-        if not (0 <= addr <= MAX_51):
+        if self.die_id <= _AHBM_DIE_MAX:
-            raise PhysAddrError("address exceeds 51-bit space")
+            mbz_top = (self.local_offset >> _AHBM_LOCAL_USED) & 0xF
            if mbz_top != 0:
                raise PhysAddrError("AHBM local_offset bits [41:38] must be zero")
        elif _CHIPLET_DIE_MIN <= self.die_id <= _CHIPLET_DIE_MAX:
            mbz_top = (self.local_offset >> _CHIPLET_LOCAL_BITS) & 0x3
            if mbz_top != 0:
                raise PhysAddrError("IOCHIPLET local_offset bits [41:40] must be zero")
        addr = (self.sip_id << _SIP_SHIFT) | (self.die_id << _DIE_SHIFT) | self.local_offset
        return addr
    @staticmethod
    def decode(addr: int) -> PhysAddr:
        if not (0 <= addr <= MAX_51):
            raise PhysAddrError("addr must be a 51-bit value")
-        rack = (addr >> 47) & 0xF
+        sip_id = (addr >> _SIP_SHIFT) & 0xF
-        sip_id = (addr >> 43) & 0xF
+        die_id = (addr >> _DIE_SHIFT) & 0x1F
-        sip_seg = (addr >> 38) & 0x1F
+        local_offset = addr & _LOCAL_MASK
-        off = addr & ((1 << 38) - 1)
+
-        cube_id = sip_seg
+        if die_id <= _AHBM_DIE_MAX:
-        sel = (off >> 37) & 0x1
+            return PhysAddr._decode_ahbm(sip_id, die_id, local_offset)
        elif _CHIPLET_DIE_MIN <= die_id <= _CHIPLET_DIE_MAX:
            return PhysAddr._decode_chiplet(sip_id, die_id, local_offset)
        else:
            raise PhysAddrError(f"die_id {die_id} is reserved (21..31)")
    @staticmethod
    def _decode_ahbm(sip_id: int, die_id: int, local_offset: int) -> PhysAddr:
        sel = (local_offset >> _AHBM_SEL_BIT) & 0x1
        if sel == 1:
-            hbm_offset = int(off & ((1 << 37) - 1))
+            hbm_offset = int(local_offset & ((1 << _AHBM_SEL_BIT) - 1))
            return PhysAddr(
-                rack_id=rack,
+                sip_id=sip_id, die_id=die_id, local_offset=local_offset,
-                sip_id=sip_id,
+                kind="hbm", hbm_offset=hbm_offset,
                sip_seg=sip_seg,
                local_offset=off,
                kind="hbm",
                cube_id=cube_id,
                hbm_offset=hbm_offset,
            )
-        # PE resource decode
+        # Resource window
-        raw_ut = int((off >> 34) & 0x7)
+        res_kind = int((local_offset >> _RES_KIND_SHIFT) & _RES_KIND_MASK)
        try:
-            unit_type = UnitType(raw_ut)
+            unit_type = UnitType(res_kind)
        except ValueError:
-            raise PhysAddrError(f"unknown unit_type: {raw_ut}") from None
+            raise PhysAddrError(f"unknown resource_kind: {res_kind}") from None
-        pe_id = int((off >> 30) & 0xF)
+
-        ext = int((off >> 29) & 0x1)
+        if unit_type == UnitType.PE:
-        sub_offset = int(off & ((1 << 29) - 1))
+            pe_id = int((local_offset >> _PE_ID_SHIFT) & 0xF)
            pe_sub = int((local_offset >> _PE_SUB_SHIFT) & 0xF)
            sub_off = int(local_offset & ((1 << _PE_SUB_OFFSET_BITS) - 1))
            return PhysAddr(
-            rack_id=rack,
+                sip_id=sip_id, die_id=die_id, local_offset=local_offset,
-            sip_id=sip_id,
+                kind="pe_resource", unit_type=unit_type,
-            sip_seg=sip_seg,
+                pe_id=pe_id, pe_sub_unit=pe_sub, sub_offset=sub_off,
-            local_offset=off,
+            )
-            kind="pe_resource",
+        elif unit_type == UnitType.MCPU:
-            cube_id=cube_id,
+            mcpu_sub = int((local_offset >> _MCPU_SUB_SHIFT) & 0x1F)
-            unit_type=unit_type,
+            sub_off = int(local_offset & ((1 << _PE_SUB_OFFSET_BITS) - 1))
-            pe_id=pe_id,
+            return PhysAddr(
-            ext=ext,
+                sip_id=sip_id, die_id=die_id, local_offset=local_offset,
-            sub_offset=sub_offset,
+                kind="pe_resource", unit_type=unit_type,
-            hbm_offset=0,
+                mcpu_sub_unit=mcpu_sub, sub_offset=sub_off,
            )
        else:  # SRAM
            sub_off = int(local_offset & ((1 << _SRAM_OFFSET_BITS) - 1))
            return PhysAddr(
                sip_id=sip_id, die_id=die_id, local_offset=local_offset,
                kind="pe_resource", unit_type=unit_type,
                sub_offset=sub_off,
            )
    @staticmethod
-    def hbm_addr(*, rack_id: int, sip_id: int, cube_id: int, hbm_offset: int) -> PhysAddr:
+    def _decode_chiplet(sip_id: int, die_id: int, local_offset: int) -> PhysAddr:
-        _chk_max("cube_id", cube_id, 31)
+        chip_off = local_offset & ((1 << _CHIPLET_LOCAL_BITS) - 1)
-        _chk_range("hbm_offset", hbm_offset, 37)
+        if chip_off < _IOCPU_BOUNDARY:
-        sip_seg = cube_id
+            iocpu_sub = int((chip_off >> _IOCPU_SUB_SHIFT) & 0xF)
-        local_offset = (1 << 37) | int(hbm_offset)
+            sub_off = int(chip_off & ((1 << _IOCPU_SUB_OFFSET_BITS) - 1))
            return PhysAddr(
-            rack_id=rack_id,
+                sip_id=sip_id, die_id=die_id, local_offset=local_offset,
-            sip_id=sip_id,
+                kind="iocpu", chiplet_offset=chip_off,
-            sip_seg=sip_seg,
+                iocpu_sub_unit=iocpu_sub, sub_offset=sub_off,
-            local_offset=local_offset,
+            )
-            kind="hbm",
+        else:
-            cube_id=cube_id,
+            return PhysAddr(
-            hbm_offset=int(hbm_offset),
+                sip_id=sip_id, die_id=die_id, local_offset=local_offset,
                kind="ual", chiplet_offset=chip_off,
            )
    # ── AHBM factory methods ────────────────────────────────────────
    @staticmethod
    def hbm_addr(*, sip_id: int, die_id: int, hbm_offset: int) -> PhysAddr:
        _chk_max("die_id", die_id, _AHBM_DIE_MAX)
        _chk_range("hbm_offset", hbm_offset, _AHBM_SEL_BIT)
        local_offset = (1 << _AHBM_SEL_BIT) | int(hbm_offset)
        return PhysAddr(
            sip_id=sip_id, die_id=die_id, local_offset=local_offset,
            kind="hbm", hbm_offset=int(hbm_offset),
        )
    @staticmethod
    def pe_hbm_addr(
-        *,
+        *, sip_id: int, die_id: int,
-        rack_id: int,
+        pe_id: int, pe_local_hbm_offset: int, slice_size_bytes: int,
        sip_id: int,
        cube_id: int,
        pe_id: int,
        pe_local_hbm_offset: int,
        slice_size_bytes: int,
    ) -> PhysAddr:
-        _chk_max("cube_id", cube_id, 31)
+        _chk_max("die_id", die_id, _AHBM_DIE_MAX)
        _chk_range("pe_id", pe_id, 4)
        if not (0 <= pe_local_hbm_offset < slice_size_bytes):
            raise PhysAddrError("pe_local_hbm_offset out of PE local slice range")
        hbm_offset = int(pe_id) * int(slice_size_bytes) + int(pe_local_hbm_offset)
        if not (0 <= hbm_offset < PhysAddr.HBM_WINDOW_BYTES):
            raise PhysAddrError("HBM offset exceeds reserved 128GB window")
-        return PhysAddr.hbm_addr(
+        return PhysAddr.hbm_addr(sip_id=sip_id, die_id=die_id, hbm_offset=hbm_offset)
            rack_id=rack_id, sip_id=sip_id, cube_id=cube_id, hbm_offset=hbm_offset
        )
    @staticmethod
    def hbm_pe_id(hbm_offset: int, slice_size_bytes: int) -> int:
        return hbm_offset // slice_size_bytes
    @staticmethod
-    def cube_sram_addr(
+    def pe_tcm_addr(
-        *, rack_id: int, sip_id: int, cube_id: int, sram_offset: int,
+        *, sip_id: int, die_id: int, pe_id: int, tcm_offset: int,
    ) -> PhysAddr:
-        _chk_max("cube_id", cube_id, 31)
+        return PhysAddr.pe_resource_addr(
-        _chk_range("sram_offset", sram_offset, 29)
+            sip_id=sip_id, die_id=die_id, pe_id=pe_id,
-        sip_seg = cube_id
+            pe_sub_unit=PESubUnit.PE_TCM, sub_offset=tcm_offset,
        local_offset = (UnitType.SRAM << 34) | sram_offset
        return PhysAddr(
            rack_id=rack_id, sip_id=sip_id, sip_seg=sip_seg,
            local_offset=local_offset,
            kind="pe_resource", cube_id=cube_id,
            unit_type=UnitType.SRAM, sub_offset=sram_offset,
        )
    @staticmethod
-    def pe_tcm_addr(
+    def pe_resource_addr(
-        *, rack_id: int, sip_id: int, cube_id: int, pe_id: int, tcm_offset: int,
+        *, sip_id: int, die_id: int, pe_id: int,
        pe_sub_unit: int, sub_offset: int,
    ) -> PhysAddr:
-        _chk_max("cube_id", cube_id, 31)
+        _chk_max("die_id", die_id, _AHBM_DIE_MAX)
        _chk_range("pe_id", pe_id, 4)
-        _chk_range("tcm_offset", tcm_offset, 29)
+        _chk_range("pe_sub_unit", pe_sub_unit, 4)
-        sip_seg = cube_id
+        _chk_range("sub_offset", sub_offset, _PE_SUB_OFFSET_BITS)
-        local_offset = (UnitType.PE << 34) | (pe_id << 30) | tcm_offset
+        local_offset = (
-        return PhysAddr(
+            (UnitType.PE << _RES_KIND_SHIFT)
-            rack_id=rack_id, sip_id=sip_id, sip_seg=sip_seg,
+            | (pe_id << _PE_ID_SHIFT)
-            local_offset=local_offset,
+            | (pe_sub_unit << _PE_SUB_SHIFT)
-            kind="pe_resource", cube_id=cube_id,
+            | sub_offset
-            unit_type=UnitType.PE, pe_id=pe_id, sub_offset=tcm_offset,
+        )
        return PhysAddr(
            sip_id=sip_id, die_id=die_id, local_offset=local_offset,
            kind="pe_resource", unit_type=UnitType.PE,
            pe_id=pe_id, pe_sub_unit=pe_sub_unit, sub_offset=sub_offset,
        )
    @staticmethod
    def cube_sram_addr(
        *, sip_id: int, die_id: int, sram_offset: int,
    ) -> PhysAddr:
        _chk_max("die_id", die_id, _AHBM_DIE_MAX)
        _chk_range("sram_offset", sram_offset, _SRAM_OFFSET_BITS)
        local_offset = (UnitType.SRAM << _RES_KIND_SHIFT) | sram_offset
        return PhysAddr(
            sip_id=sip_id, die_id=die_id, local_offset=local_offset,
            kind="pe_resource", unit_type=UnitType.SRAM, sub_offset=sram_offset,
        )
    @staticmethod
    def mcpu_resource_addr(
        *, sip_id: int, die_id: int, mcpu_sub_unit: int, sub_offset: int,
    ) -> PhysAddr:
        _chk_max("die_id", die_id, _AHBM_DIE_MAX)
        _chk_range("mcpu_sub_unit", mcpu_sub_unit, 5)
        _chk_range("sub_offset", sub_offset, _PE_SUB_OFFSET_BITS)
        local_offset = (
            (UnitType.MCPU << _RES_KIND_SHIFT)
            | (mcpu_sub_unit << _MCPU_SUB_SHIFT)
            | sub_offset
        )
        return PhysAddr(
            sip_id=sip_id, die_id=die_id, local_offset=local_offset,
            kind="pe_resource", unit_type=UnitType.MCPU,
            mcpu_sub_unit=mcpu_sub_unit, sub_offset=sub_offset,
        )
    # ── IOCHIPLET factory methods ────────────────────────────────────
    @staticmethod
    def iocpu_resource_addr(
        *, sip_id: int, die_id: int, iocpu_sub_unit: int, sub_offset: int,
    ) -> PhysAddr:
        _chk_max("die_id", die_id, _CHIPLET_DIE_MAX)
        if die_id < _CHIPLET_DIE_MIN:
            raise PhysAddrError(
                f"die_id {die_id} is not an IOCHIPLET "
                f"(must be {_CHIPLET_DIE_MIN}..{_CHIPLET_DIE_MAX})"
            )
        _chk_range("iocpu_sub_unit", iocpu_sub_unit, 4)
        _chk_range("sub_offset", sub_offset, _IOCPU_SUB_OFFSET_BITS)
        chiplet_offset = (iocpu_sub_unit << _IOCPU_SUB_SHIFT) | sub_offset
        if chiplet_offset >= _IOCPU_BOUNDARY:
            raise PhysAddrError("IOCPU region overflow (must be < 2 GB)")
        return PhysAddr(
            sip_id=sip_id, die_id=die_id, local_offset=chiplet_offset,
            kind="iocpu", chiplet_offset=chiplet_offset,
            iocpu_sub_unit=iocpu_sub_unit, sub_offset=sub_offset,
        )
    @staticmethod
    def ual_addr(*, sip_id: int, die_id: int, ual_offset: int) -> PhysAddr:
        _chk_max("die_id", die_id, _CHIPLET_DIE_MAX)
        if die_id < _CHIPLET_DIE_MIN:
            raise PhysAddrError(f"die_id {die_id} is not an IOCHIPLET")
        chiplet_offset = _IOCPU_BOUNDARY + ual_offset
        _chk_range("chiplet_offset", chiplet_offset, _CHIPLET_LOCAL_BITS)
        return PhysAddr(
            sip_id=sip_id, die_id=die_id, local_offset=chiplet_offset,
            kind="ual", chiplet_offset=chiplet_offset,
        )
@@ -27,16 +27,16 @@ class AddressResolver:
    def resolve(self, addr: PhysAddr) -> str:
        s = addr.sip_id
-        c = addr.cube_id
+        d = addr.die_id
        if addr.kind == "hbm":
-            node_id = f"sip{s}.cube{c}.hbm_ctrl"
+            node_id = f"sip{s}.cube{d}.hbm_ctrl"
        elif addr.kind == "pe_resource":
            if addr.unit_type == UnitType.PE:
-                node_id = f"sip{s}.cube{c}.pe{addr.pe_id}.pe_tcm"
+                node_id = f"sip{s}.cube{d}.pe{addr.pe_id}.pe_tcm"
            elif addr.unit_type == UnitType.SRAM:
-                node_id = f"sip{s}.cube{c}.sram"
+                node_id = f"sip{s}.cube{d}.sram"
            elif addr.unit_type == UnitType.MCPU:
-                node_id = f"sip{s}.cube{c}.m_cpu"
+                node_id = f"sip{s}.cube{d}.m_cpu"
            else:
                raise RoutingError(f"unsupported unit_type: {addr.unit_type}")
        else:
@@ -385,7 +385,7 @@ class RuntimeContext:
            for cube_id in range(cubes_per_sip):
                for pe_id in range(pes_per_cube):
                    self._allocators[(sip_id, cube_id, pe_id)] = PEMemAllocator(
-                        rack_id=0, sip_id=sip_id, cube_id=cube_id, pe_id=pe_id, cfg=cfg,
+                        sip_id=sip_id, die_id=cube_id, pe_id=pe_id, cfg=cfg,
                    )
        # Initialize VA allocator (MMU mappings are installed via fabric MmuMapMsg)
@@ -113,7 +113,18 @@ class AhbmCCLBackend:
            )
        n_elem = shards[0].nbytes // tensor.itemsize
        kernel_fn = self._algo_module.kernel
-        kernel_args = self._algo_module.kernel_args(self._world_size, n_elem)
+        # Derive effective cube dims from tensor's actual shard placement
        # (may differ from topology mesh when TP uses fewer cubes).
        sip0_cubes = sorted({s.cube for s in shards if s.sip == shards[0].sip})
        eff_n_cubes = len(sip0_cubes) if sip0_cubes else 1
        if eff_n_cubes == 1:
            eff_cube_w, eff_cube_h = 1, 1
        else:
            eff_cube_w, eff_cube_h = self._cube_w, self._cube_h
        kernel_args = self._algo_module.kernel_args(
            self._world_size, n_elem,
            cube_w=eff_cube_w, cube_h=eff_cube_h,
        )
        # Resolve sip_rank from the current greenlet's bound rank
        from greenlet import getcurrent as _gc
@@ -90,6 +90,11 @@ class KernelLaunchMsg:
    args: tuple[KernelArg, ...]
    target_cubes: tuple[int, ...] | Literal["all"] = "all"
    target_pe: int | tuple[int, ...] | Literal["all"] = "all"
    # ADR-0009 D5: synchronized kernel start. When set, each PE_CPU yields
    # until env.now >= target_start_ns before beginning kernel execution,
    # so every PE in a launch starts at the same simulated time regardless
    # of its M_CPU dispatch path length. Stamped by M_CPU fan-out.
    target_start_ns: float | None = None
    msg_type: Literal["kernel_launch"] = "kernel_launch"
@@ -67,6 +67,10 @@ class GraphEngine:
            spec=graph.spec,
            memory_store=self._memory_store,
            op_logger=self._op_logger,
            node_overhead_ns={
                nid: float(n.attrs.get("overhead_ns", 0.0))
                for nid, n in graph.nodes.items()
            },
        )
        self._components: dict[str, ComponentBase] = {
            node_id: ComponentRegistry.create(node, overrides, ctx)
@@ -212,7 +212,7 @@ def _generate_probe_h2d(graph, edge_map) -> list[dict]:
    t_offset = 0.0
    for rid, (name, cube, hops) in enumerate(cases):
        pa = PhysAddr.pe_hbm_addr(
-            rack_id=0, sip_id=0, cube_id=cube, pe_id=0,
+            sip_id=0, die_id=cube, pe_id=0,
            pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
        )
        dst_node = resolver.resolve(pa)
@@ -256,7 +256,7 @@ def _generate_probe_d2h(graph, edge_map) -> list[dict]:
    t_offset = 0.0
    for rid, (name, cube, hops) in enumerate(cases):
        pa = PhysAddr.pe_hbm_addr(
-            rack_id=0, sip_id=0, cube_id=cube, pe_id=0,
+            sip_id=0, die_id=cube, pe_id=0,
            pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
        )
        dst_node = resolver.resolve(pa)
@@ -310,7 +310,7 @@ def _generate_probe_pe_dma(graph, edge_map) -> list[dict]:
    t_offset = 0.0
    for rid, (name, sip, src_cube, src_pe, dst_cube, dst_pe) in enumerate(cases):
        pa = PhysAddr.pe_hbm_addr(
-            rack_id=0, sip_id=sip, cube_id=dst_cube, pe_id=dst_pe,
+            sip_id=sip, die_id=dst_cube, pe_id=dst_pe,
            pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
        )
        dst_node = resolver.resolve(pa)
@@ -492,6 +492,48 @@ class TLContext:
            )
        return self._make_handle(addr=0, shape=shape, dtype=dtype)
    def recv_no_consume(
        self,
        dir: str | None = None,
        shape: tuple[int, ...] = (),
        dtype: str = "f16",
    ) -> TensorHandle:
        """DIAGNOSTIC ONLY — recv that blocks for arrival but skips slot read.
        Same blocking semantics as ``tl.recv``: the kernel waits until
        the payload has landed in the IPCQ slot. Differs from ``tl.recv``
        by skipping the slot-read latency charge (slot-IO + PE↔bank
        fabric drain) on DST.
        This entry point exists solely so the pe2pe overview plot can
        draw an apples-to-apples comparison against ``tl.store`` (a
        one-sided fabric write that pays no read on DST). Production
        kernels MUST use ``tl.recv`` — they need to consume the data
        they receive. This API is segregated from ``tl.recv`` so the
        diagnostic flag can never accidentally be set in real workloads.
        """
        self._emit_dispatch_overhead()
        cmd = IpcqRecvCmd(
            direction=dir,
            shape=shape, dtype=dtype,
            handle_id=self._next_handle_id(),
            consume=False,
        )
        result = self._emit(cmd)  # type: ignore[arg-type]
        if isinstance(result, dict):
            slot_addr = int(result.get("src_addr", 0))
            slot_space = str(result.get("src_space", "tcm"))
            return TensorHandle(
                id=self._next_handle_id(),
                addr=slot_addr,
                shape=shape,
                dtype=dtype,
                nbytes=self._nbytes(shape, dtype),
                data=None,
                space=slot_space,
            )
        return self._make_handle(addr=0, shape=shape, dtype=dtype)
    def recv_async(
        self,
        dir: str,
@@ -7,11 +7,49 @@ stateful/SimPy-event-consuming and MUST NOT be shared).
 """
 from __future__ import annotations
 import os
 import pytest
 from kernbench.topology.builder import resolve_topology
 def pytest_sessionfinish(session, exitstatus):
    """Aggregate parametrized sweep rows into combined CSV + PNG plots.
    Runs on the controller node only (xdist worker processes set
    ``PYTEST_XDIST_WORKER``; we skip those). Idempotent — does nothing
    if no sweep rows are present (e.g., when the sweep was filtered out).
    """
    if os.environ.get("PYTEST_XDIST_WORKER"):
        return
    import importlib.util
    import sys
    from pathlib import Path
    def _exec(name: str, attr: str) -> None:
        mod_path = Path(__file__).parent / name
        if not mod_path.exists():
            return
        s = importlib.util.spec_from_file_location(
            f"_{name.removesuffix('.py')}_for_aggregate", mod_path,
        )
        if s is None or s.loader is None:
            return
        mod = importlib.util.module_from_spec(s)
        sys.modules[s.name] = mod
        try:
            s.loader.exec_module(mod)
            fn = getattr(mod, attr, None)
            if fn is not None:
                fn()
        except Exception as e:
            print(f"[conftest] aggregator {attr}() in {name} failed: {e}")
    _exec("test_allreduce_multidevice.py", "_aggregate_sweep_plots")
    _exec("test_allreduce_buffer_kind_sweep.py", "aggregate_buffer_kind_plot")
@pytest.fixture(scope="session")
 def topology():
    """Session-scoped parsed topology (immutable graph + spec).
@@ -149,7 +149,7 @@ def _make_tuple_allocators(
 ) -> dict[tuple[int, int, int], PEMemAllocator]:
    return {
        (s, c, p): PEMemAllocator(
-            rack_id=0, sip_id=s, cube_id=c, pe_id=p, cfg=_CFG,
+            sip_id=s, die_id=c, pe_id=p, cfg=_CFG,
        )
        for s in range(num_sips)
        for c in range(num_cubes)
@@ -0,0 +1,196 @@
 """Phase 1 buffer-kind allreduce sweep — torus_2d 6 SIPs.
 Parametrized over (buffer_kind, n_elem). Each case runs the standard
 config-driven allreduce app and writes a JSON row to a shared staging
 dir; the conftest sessionfinish hook (added in Phase 1) aggregates
 rows into ``docs/diagrams/allreduce_latency_plots/buffer_kind_sweep.png``.
 Pre-Phase-2: the three buffer-kind lines overlap exactly because slot
 access is latency-free today. Post-Phase-2 they spread out (tcm
 fastest, hbm slowest).
 """
 from __future__ import annotations
 import json
 from pathlib import Path
 import pytest
 import yaml
 from kernbench.runtime_api.context import RuntimeContext
 from kernbench.runtime_api.types import DeviceSelector
 from kernbench.sim_engine.engine import GraphEngine
 from kernbench.topology.builder import resolve_topology
 # Reuse the allreduce app helpers.
 from tests.test_allreduce_multidevice import (
    _write_temp_configs,
    run_allreduce,
 )
 _BUFFER_KINDS = ["tcm", "sram", "hbm"]
 _N_ELEM_GRID = [128, 1024, 8192, 32768]   # 256 B → 64 KB per slot
 _ELEM_BYTES_F16 = 2
 _OUT_DIR = (Path(__file__).parent.parent / "docs" / "diagrams"
            / "allreduce_latency_plots")
 _ROWS_DIR = _OUT_DIR / "_buffer_kind_rows"
 def _bk_params():
    out = []
    for bk in _BUFFER_KINDS:
        for n_elem in _N_ELEM_GRID:
            out.append(pytest.param(bk, n_elem, id=f"{bk}-n_elem{n_elem}"))
    return out
@pytest.mark.parametrize("buffer_kind,n_elem", _bk_params())
 def test_buffer_kind_allreduce_one(tmp_path, buffer_kind, n_elem):
    """One config of the buffer-kind sweep. xdist parallelizes."""
    sub = tmp_path / f"{buffer_kind}_{n_elem}"
    sub.mkdir()
    topo_path, ccl_path = _write_temp_configs(
        sub,
        sip_topology="torus_2d",
        n_sips=6,
        algorithm="intercube_allreduce",
        sip_w=3, sip_h=2,
        n_elem_override=n_elem,
    )
    # Override buffer_kind in the temp ccl.yaml.
    with open(ccl_path) as f:
        ccl_cfg = yaml.safe_load(f)
    ccl_cfg.setdefault("defaults", {})["buffer_kind"] = buffer_kind
    ccl_cfg.setdefault("algorithms", {}).setdefault(
        "intercube_allreduce", {},
    )["buffer_kind"] = buffer_kind
    with open(ccl_path, "w") as f:
        yaml.dump(ccl_cfg, f, default_flow_style=False)
    topo = resolve_topology(topo_path)
    engine = GraphEngine(topo.topology_obj, enable_data=True)
    spec = topo.topology_obj.spec
    with RuntimeContext(
        engine=engine,
        target_device=DeviceSelector("all"),
        correlation_id=f"bk_sweep_{buffer_kind}_{n_elem}",
        spec=spec,
    ) as ctx:
        result = run_allreduce(
            ctx, engine, spec,
            algorithm="intercube_allreduce", ccl_yaml=ccl_path,
        )
        assert result["ok_cubes"] > 0
    pe_exec_vals = [
        float(tr.get("pe_exec_ns", 0.0) or 0.0)
        for _, (_, tr) in engine._results.items()
        if isinstance(tr, dict)
    ]
    crit_ns = max(pe_exec_vals) if pe_exec_vals else 0.0
    bytes_per_pe = n_elem * _ELEM_BYTES_F16
    record = {
        "buffer_kind": buffer_kind,
        "sip_topology": "torus_2d",
        "n_sips": 6,
        "n_elem": n_elem,
        "bytes_per_pe": bytes_per_pe,
        "latency_ns": crit_ns,
    }
    _ROWS_DIR.mkdir(parents=True, exist_ok=True)
    row_path = _ROWS_DIR / f"{buffer_kind}_{n_elem}.json"
    with open(row_path, "w", encoding="utf-8") as f:
        json.dump(record, f)
 def aggregate_buffer_kind_plot() -> bool:
    """Read per-config rows and emit buffer_kind_sweep.png + CSV.
    Called from conftest.pytest_sessionfinish (controller-only).
    Returns True if rows were aggregated.
    """
    import csv
    if not _ROWS_DIR.exists():
        return False
    row_files = sorted(_ROWS_DIR.glob("*.json"))
    if not row_files:
        return False
    records = []
    for p in row_files:
        with open(p, encoding="utf-8") as f:
            records.append(json.load(f))
    import matplotlib.pyplot as plt
    from matplotlib.ticker import FuncFormatter
    def _fmt_bytes(x, _pos):
        if x <= 0:
            return "0"
        if x >= 1024 * 1024:
            return f"{x / (1024 * 1024):.0f} MB"
        if x >= 1024:
            return f"{x / 1024:.0f} KB"
        return f"{x:.0f} B"
    _bytes_fmt = FuncFormatter(_fmt_bytes)
    _OUT_DIR.mkdir(parents=True, exist_ok=True)
    with open(_OUT_DIR / "buffer_kind_sweep.csv", "w",
              newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=[
            "buffer_kind", "sip_topology", "n_sips", "n_elem",
            "bytes_per_pe", "latency_ns",
        ])
        w.writeheader()
        for r in sorted(records, key=lambda r: (
            r["buffer_kind"], r["bytes_per_pe"],
        )):
            w.writerow(r)
    colors = {"tcm": "tab:blue", "sram": "tab:orange", "hbm": "tab:red"}
    fig, ax = plt.subplots(figsize=(10, 6))
    for bk in ["tcm", "sram", "hbm"]:
        rs = sorted(
            [r for r in records if r["buffer_kind"] == bk],
            key=lambda r: r["bytes_per_pe"],
        )
        if not rs:
            continue
        ax.plot(
            [r["bytes_per_pe"] for r in rs],
            [r["latency_ns"] for r in rs],
            marker="o", lw=2.0,
            color=colors[bk], label=f"buffer_kind = {bk}",
        )
    ax.set_xscale("log", base=2)
    ax.set_xlabel("Bytes per PE (log scale)")
    ax.set_ylabel("Time (ns)")
    ax.set_title(
        "Allreduce torus_2d (6 SIPs, 3×2) — IPCQ slot memory tier"
    )
    ax.grid(True, alpha=0.3)
    ax.legend()
    ax.xaxis.set_major_formatter(_bytes_fmt)
    fig.tight_layout()
    fig.savefig(_OUT_DIR / "buffer_kind_sweep.png", dpi=130)
    plt.close(fig)
    for p in row_files:
        try:
            p.unlink()
        except OSError:
            pass
    try:
        _ROWS_DIR.rmdir()
    except OSError:
        pass
    print(f"\nWrote {_OUT_DIR / 'buffer_kind_sweep.png'} "
          f"from {len(records)} rows")
    return True
@@ -22,13 +22,23 @@ from kernbench.ccl.sfr_config import configure_sfr_intercube_multisip
 from kernbench.policy.placement.dp import DPPolicy
-def _sip_topo_dims(sip_topo: str, n_sips: int) -> tuple[int, int]:
+def _sip_topo_dims(
    sip_topo: str, n_sips: int,
    spec_w: int | None = None, spec_h: int | None = None,
 ) -> tuple[int, int]:
    if sip_topo == "ring_1d":
        return (0, 0)
    if spec_w is not None and spec_h is not None:
        if spec_w * spec_h != n_sips:
            raise ValueError(
                f"sip layout {spec_w}x{spec_h} != n_sips ({n_sips})"
            )
        return (spec_w, spec_h)
    side = int(round(math.sqrt(n_sips)))
    if side * side != n_sips:
        raise ValueError(
-            f"SIP topology '{sip_topo}' requires square n_sips, got {n_sips}"
+            f"SIP topology '{sip_topo}' requires square n_sips or "
            f"explicit w/h in spec, got {n_sips}"
        )
    return (side, side)
@@ -54,10 +64,13 @@ def run_allreduce(
    topo_name_to_kind = algo_module.TOPO_NAME_TO_KIND
    n_elem = int(cfg.get("n_elem", 8))
-    n_sips = int(spec.get("system", {}).get("sips", {}).get("count", 1))
+    sips_cfg = spec.get("system", {}).get("sips", {})
-    sip_topo = str(
+    n_sips = int(sips_cfg.get("count", 1))
-        spec.get("system", {}).get("sips", {}).get("topology", "ring_1d")
+    sip_topo = str(sips_cfg.get("topology", "ring_1d"))
-    )
+    spec_sip_w = sips_cfg.get("w")
    spec_sip_h = sips_cfg.get("h")
    spec_sip_w = int(spec_sip_w) if spec_sip_w is not None else None
    spec_sip_h = int(spec_sip_h) if spec_sip_h is not None else None
    cm = spec["sip"]["cube_mesh"]
    cube_w = int(cm["w"])
@@ -65,7 +78,9 @@ def run_allreduce(
    n_cubes = cube_w * cube_h
    sip_topo_kind = topo_name_to_kind.get(sip_topo, 0)
-    sip_topo_w, sip_topo_h = _sip_topo_dims(sip_topo, n_sips)
+    sip_topo_w, sip_topo_h = _sip_topo_dims(
        sip_topo, n_sips, spec_w=spec_sip_w, spec_h=spec_sip_h,
    )
    algo_name = cfg.get("algorithm", "allreduce")
    print(f"\n{'=' * 60}")
@@ -173,18 +188,36 @@ from kernbench.topology.builder import resolve_topology
 TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
 CONFIGS = [
-    pytest.param("intercube_allreduce", "ring_1d", 2, id="ring_2sip"),
+    pytest.param(
-    pytest.param("intercube_allreduce", "torus_2d", 4, id="torus_4sip"),
+        "intercube_allreduce", "ring_1d", 6, None, None,
-    pytest.param("intercube_allreduce", "mesh_2d_no_wrap", 4, id="mesh_4sip"),
+        id="ring_6sip",
    ),
    pytest.param(
        "intercube_allreduce", "torus_2d", 6, 2, 3,
        id="torus_6sip_2x3",
    ),
    pytest.param(
        "intercube_allreduce", "mesh_2d_no_wrap", 6, 2, 3,
        id="mesh_6sip_2x3",
    ),
 ]
-def _write_temp_configs(tmp_path, sip_topology, n_sips, algorithm):
+def _write_temp_configs(
    tmp_path, sip_topology, n_sips, algorithm, n_elem_override=None,
    sip_w=None, sip_h=None,
 ):
    """Write temp topology.yaml and ccl.yaml with the given overrides."""
    with open(TOPOLOGY_PATH) as f:
        topo_cfg = yaml.safe_load(f)
    topo_cfg["system"]["sips"]["count"] = n_sips
    topo_cfg["system"]["sips"]["topology"] = sip_topology
    if sip_w is not None and sip_h is not None:
        topo_cfg["system"]["sips"]["w"] = int(sip_w)
        topo_cfg["system"]["sips"]["h"] = int(sip_h)
    else:
        topo_cfg["system"]["sips"].pop("w", None)
        topo_cfg["system"]["sips"].pop("h", None)
    topo_path = tmp_path / "topology.yaml"
    with open(topo_path, "w") as f:
        yaml.dump(topo_cfg, f, default_flow_style=False)
@@ -193,6 +226,15 @@ def _write_temp_configs(tmp_path, sip_topology, n_sips, algorithm):
    with open(ccl_path) as f:
        ccl_cfg = yaml.safe_load(f)
    ccl_cfg["defaults"]["algorithm"] = algorithm
    if n_elem_override is not None:
        ccl_cfg.setdefault("algorithms", {}).setdefault(
            algorithm, {},
        )["n_elem"] = int(n_elem_override)
        # Ensure IPCQ slot is big enough for the per-message payload.
        per_msg_bytes = int(n_elem_override) * 2  # f16
        default_slot = int(ccl_cfg["defaults"].get("slot_size", 4096))
        if per_msg_bytes > default_slot:
            ccl_cfg["defaults"]["slot_size"] = per_msg_bytes
    tmp_ccl = tmp_path / "ccl.yaml"
    with open(tmp_ccl, "w") as f:
        yaml.dump(ccl_cfg, f, default_flow_style=False)
@@ -200,10 +242,15 @@ def _write_temp_configs(tmp_path, sip_topology, n_sips, algorithm):
    return str(topo_path), str(tmp_ccl)
-@pytest.mark.parametrize("algorithm,sip_topology,n_sips", CONFIGS)
+@pytest.mark.parametrize(
-def test_allreduce(tmp_path, algorithm, sip_topology, n_sips):
+    "algorithm,sip_topology,n_sips,sip_w,sip_h", CONFIGS,
 )
 def test_allreduce(
    tmp_path, algorithm, sip_topology, n_sips, sip_w, sip_h,
 ):
    topo_path, ccl_path = _write_temp_configs(
        tmp_path, sip_topology, n_sips, algorithm,
        sip_w=sip_w, sip_h=sip_h,
    )
    topo = resolve_topology(topo_path)
    engine = GraphEngine(topo.topology_obj, enable_data=True)
@@ -220,3 +267,570 @@ def test_allreduce(tmp_path, algorithm, sip_topology, n_sips):
            algorithm=algorithm, ccl_yaml=ccl_path,
        )
        assert result["ok_cubes"] > 0
 # ── Latency sweep (parametrized + xdist-friendly) ─────────────────────
 # avoid 16 (== n_cubes, dim_map collision). Goes up to 96 KB per PE:
 # bytes_per_pe = n_elem * 2 (f16). 49152 elem * 2 = 96 KB / PE.
 _SWEEP_N_ELEM = [
    8, 32, 64, 128, 512, 1024, 2048,
    4096, 8192, 16384, 32768, 49152,
 ]
 _ELEM_BYTES_F16 = 2
 _SWEEP_TOPOLOGIES = [
    ("intercube_allreduce", "ring_1d", 6, None, None),
    ("intercube_allreduce", "torus_2d", 6, 2, 3),
    ("intercube_allreduce", "mesh_2d_no_wrap", 6, 2, 3),
 ]
 # Shared on-disk staging dir for parametrized sweep rows. Each
 # parametrized invocation writes one JSON file here; the aggregator
 # (run from conftest.pytest_sessionfinish) reads them and emits the
 # combined CSV + PNG plots.
 _SWEEP_OUT_DIR = (Path(__file__).parent.parent / "docs" / "diagrams"
                  / "allreduce_latency_plots")
 _SWEEP_ROWS_DIR = _SWEEP_OUT_DIR / "_rows"
 def _sweep_params():
    out = []
    for algorithm, sip_topology, n_sips, sip_w, sip_h in _SWEEP_TOPOLOGIES:
        for n_elem in _SWEEP_N_ELEM:
            out.append(pytest.param(
                algorithm, sip_topology, n_sips, sip_w, sip_h, n_elem,
                id=f"{sip_topology}-n_elem{n_elem}",
            ))
    return out
@pytest.mark.parametrize(
    "algorithm,sip_topology,n_sips,sip_w,sip_h,n_elem", _sweep_params(),
 )
 def test_allreduce_latency_one(
    tmp_path, algorithm, sip_topology, n_sips, sip_w, sip_h, n_elem,
 ):
    """One config of the latency sweep. xdist parallelizes across params.
    Writes a single JSON row to ``_SWEEP_ROWS_DIR``. The conftest
    sessionfinish hook aggregates rows into CSV + plots after all
    parametrized cases finish.
    """
    import json
    topo_path, ccl_path = _write_temp_configs(
        tmp_path, sip_topology, n_sips, algorithm,
        sip_w=sip_w, sip_h=sip_h,
        n_elem_override=n_elem,
    )
    topo = resolve_topology(topo_path)
    engine = GraphEngine(topo.topology_obj, enable_data=True)
    spec = topo.topology_obj.spec
    with RuntimeContext(
        engine=engine,
        target_device=DeviceSelector("all"),
        correlation_id=f"sweep_{algorithm}_{sip_topology}_{n_elem}",
        spec=spec,
    ) as ctx:
        result = run_allreduce(
            ctx, engine, spec,
            algorithm=algorithm, ccl_yaml=ccl_path,
        )
        assert result["ok_cubes"] > 0
    pe_exec_vals = [
        float(tr.get("pe_exec_ns", 0.0) or 0.0)
        for _, (_, tr) in engine._results.items()
        if isinstance(tr, dict)
    ]
    crit_ns = max(pe_exec_vals) if pe_exec_vals else 0.0
    cm = spec["sip"]["cube_mesh"]
    n_cubes = int(cm["w"]) * int(cm["h"])
    bytes_per_sip = n_cubes * n_elem * _ELEM_BYTES_F16
    bytes_per_pe = n_elem * _ELEM_BYTES_F16
    record = {
        "algorithm": algorithm,
        "sip_topology": sip_topology,
        "n_sips": n_sips,
        "n_elem": n_elem,
        "bytes_per_pe": bytes_per_pe,
        "bytes_per_sip": bytes_per_sip,
        "latency_ns": crit_ns,
    }
    _SWEEP_ROWS_DIR.mkdir(parents=True, exist_ok=True)
    row_path = _SWEEP_ROWS_DIR / f"{sip_topology}_{n_elem}.json"
    with open(row_path, "w", encoding="utf-8") as f:
        json.dump(record, f)
 def _aggregate_sweep_plots() -> bool:
    """Read all per-config rows and emit CSV + PNG plots.
    Called by ``conftest.pytest_sessionfinish`` (controller node only).
    Returns True if any rows were aggregated, False otherwise.
    """
    import csv
    import json
    row_files = sorted(_SWEEP_ROWS_DIR.glob("*.json")) \
        if _SWEEP_ROWS_DIR.exists() else []
    records: list[dict] = []
    if row_files:
        for p in row_files:
            with open(p, encoding="utf-8") as f:
                records.append(json.load(f))
    else:
        # Fallback: replot from existing summary.csv (skip sweep re-run).
        summary_path = _SWEEP_OUT_DIR / "summary.csv"
        if not summary_path.exists():
            return False
        with open(summary_path, encoding="utf-8") as f:
            for row in csv.DictReader(f):
                records.append({
                    "algorithm": row["algorithm"],
                    "sip_topology": row["sip_topology"],
                    "n_sips": int(row["n_sips"]),
                    "n_elem": int(row["n_elem"]),
                    "bytes_per_pe": int(row["bytes_per_pe"]),
                    "bytes_per_sip": int(row["bytes_per_sip"]),
                    "latency_ns": float(row["latency_ns"]),
                })
    if not records:
        return False
    import matplotlib.pyplot as plt
    from matplotlib.ticker import FuncFormatter
    def _fmt_bytes(x, _pos):
        if x <= 0:
            return "0"
        if x >= 1024 * 1024:
            return f"{x / (1024 * 1024):.0f} MB"
        if x >= 1024:
            return f"{x / 1024:.0f} KB"
        return f"{x:.0f} B"
    _bytes_fmt = FuncFormatter(_fmt_bytes)
    _SWEEP_OUT_DIR.mkdir(parents=True, exist_ok=True)
    with open(_SWEEP_OUT_DIR / "summary.csv", "w",
              newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=[
            "algorithm", "sip_topology", "n_sips", "n_elem",
            "bytes_per_pe", "bytes_per_sip", "latency_ns",
        ])
        w.writeheader()
        for r in sorted(records, key=lambda r: (
            r["sip_topology"], r["bytes_per_pe"],
        )):
            w.writerow(r)
    topologies = sorted({r["sip_topology"] for r in records})
    for topo_name in topologies:
        rs = sorted(
            [r for r in records if r["sip_topology"] == topo_name],
            key=lambda r: r["bytes_per_pe"],
        )
        if not rs:
            continue
        xs = [r["bytes_per_pe"] for r in rs]
        ys = [r["latency_ns"] for r in rs]
        title = (
            f"Allreduce latency — {topo_name} "
            f"(n_sips={rs[0]['n_sips']})"
        )
        fig, ax = plt.subplots(figsize=(8, 5))
        ax.plot(xs, ys, marker="o", color="tab:blue")
        ax.set_xscale("log", base=2)
        ax.set_xlabel("Bytes per PE (log scale)")
        ax.set_ylabel("Time (ns)")
        ax.set_title(title)
        ax.grid(True, alpha=0.3)
        ax.xaxis.set_major_formatter(_bytes_fmt)
        fig.tight_layout()
        fig.savefig(_SWEEP_OUT_DIR / f"{topo_name}.png", dpi=120)
        plt.close(fig)
    colors = {"ring_1d": "tab:blue", "torus_2d": "tab:orange",
              "mesh_2d_no_wrap": "tab:green"}
    # ── Hand-derived theoretical model for torus_2d (6 SIPs) ──
    # Critical-path analysis (per packet, packet = 128 B at NoC):
    #   local intra-SIP reduce + broadcast = 8 hops × 57 ns = 456 ns
    #   global X-direction reduce          = 5 UCIe + 1 UAL = 445 ns
    #   global Y-direction reduce          = 5 UCIe + 1 UAL = 445 ns
    #   per-packet startup latency         = 456 + 445 + 445 = 1346 ns
    # Packet count is PER CUBE (8 PEs/cube cooperate on the cube tile).
    # At 6144 packets/cube the pipelined total is 8741 ns, so the
    # bottleneck-stage interval τ = (8741 − 1346) / (6144 − 1) ≈ 1.204 ns.
    # T_theoretical(N) = 1346 + (N − 1) × τ
    #   where N = ceil((bytes_per_pe × 8) / 128) = ceil(bytes_per_pe / 16)
    NOC_PACKET_BYTES = 128
    PES_PER_CUBE = 8
    T_STARTUP_NS = 1346.0
    TAU_NS = (8741.0 - 1346.0) / (6144 - 1)  # ≈ 1.2038 ns/packet
    def _theoretical_torus_2d_ns(bytes_per_pe: int) -> float:
        bytes_per_cube = int(bytes_per_pe) * PES_PER_CUBE
        n_packets = max(1, -(-bytes_per_cube // NOC_PACKET_BYTES))  # ceil
        return T_STARTUP_NS + (n_packets - 1) * TAU_NS
    fig, ax = plt.subplots(figsize=(9, 6))
    for topo_name in topologies:
        rs = sorted(
            [r for r in records if r["sip_topology"] == topo_name],
            key=lambda r: r["bytes_per_pe"],
        )
        if not rs:
            continue
        ax.plot(
            [r["bytes_per_pe"] for r in rs],
            [r["latency_ns"] for r in rs],
            marker="o",
            label=f"{topo_name} (n_sips={rs[0]['n_sips']})",
            color=colors.get(topo_name),
        )
    # Theoretical torus_2d curve across all payload sizes.
    torus_rs = sorted(
        [r for r in records if r["sip_topology"] == "torus_2d"],
        key=lambda r: r["bytes_per_pe"],
    )
    if torus_rs:
        xs_th = [r["bytes_per_pe"] for r in torus_rs]
        ys_th = [_theoretical_torus_2d_ns(r["bytes_per_pe"]) for r in torus_rs]
        ax.plot(
            xs_th, ys_th,
            color="tab:red", linestyle="--", linewidth=1.6, marker="x",
            label="theoretical torus_2d (6 SIPs)",
        )
    ax.set_xscale("log", base=2)
    ax.set_xlabel("Bytes per PE (log scale)")
    ax.set_ylabel("Time (ns)")
    ax.set_title("Multi-device allreduce latency by topology")
    ax.grid(True, alpha=0.3)
    ax.set_xlim(left=min(r["bytes_per_pe"] for r in records) / 2,
                right=max(r["bytes_per_pe"] for r in records) * 1.5)
    ax.legend()
    ax.xaxis.set_major_formatter(_bytes_fmt)
    fig.tight_layout()
    fig.savefig(_SWEEP_OUT_DIR / "overview.png", dpi=120)
    plt.close(fig)
    # Cleanup row staging dir so a partial future run doesn't pick up
    # stale rows.
    for p in row_files:
        try:
            p.unlink()
        except OSError:
            pass
    try:
        _SWEEP_ROWS_DIR.rmdir()
    except OSError:
        pass
    print(f"\nWrote {_SWEEP_OUT_DIR / 'overview.png'} "
          f"from {len(records)} rows")
    return True
 # ── Topology diagram (device-level + cube-level reduction) ────────────
 # Convention: "rows × cols" everywhere, row-major rank assignment
 # (rank = row * n_cols + col). For the 2×3 inter-SIP grid, this means
 # 2 rows × 3 columns:  SIP 0 1 2 / SIP 3 4 5.
 _PALETTE_BG = "#fafbfd"
 _PALETTE_FRAME = "#3a3f4a"
 _PALETTE_BLUE = "#2c6fb6"
 _PALETTE_GREEN = "#2e8a4e"
 _PALETTE_TEXT = "#1f2530"
 _PALETTE_BOX_FILL = "#eaf2fb"
 _PALETTE_BOX_EDGE = "#2c4a78"
 _PALETTE_ROOT_FILL = "#ffd9b8"
 _PALETTE_ROOT_EDGE = "#bd5a14"
 def _arrow(ax, xy_from, xy_to, color="black", lw=1.4, alpha=1.0,
           style="-|>", curve=0.0):
    from matplotlib.patches import FancyArrowPatch
    arrow = FancyArrowPatch(
        xy_from, xy_to,
        arrowstyle=style, mutation_scale=12,
        color=color, lw=lw, alpha=alpha,
        connectionstyle=f"arc3,rad={curve}",
    )
    ax.add_patch(arrow)
 def _draw_sip_box(ax, cx, cy, w, h, label, *, fill=_PALETTE_BOX_FILL,
                  edge=_PALETTE_BOX_EDGE, text_color=_PALETTE_TEXT,
                  font=10):
    from matplotlib.patches import FancyBboxPatch
    box = FancyBboxPatch(
        (cx - w / 2, cy - h / 2), w, h,
        boxstyle="round,pad=0.02,rounding_size=0.10",
        linewidth=1.4, edgecolor=edge, facecolor=fill,
    )
    ax.add_patch(box)
    ax.text(cx, cy, label, ha="center", va="center",
            color=text_color, fontsize=font, fontweight="bold")
 def _frame_panel(ax, title, lim_x=10.0, lim_y=6.0):
    """Set up a square-ish panel with a visible outer border."""
    from matplotlib.patches import FancyBboxPatch
    ax.set_xlim(0, lim_x)
    ax.set_ylim(0, lim_y)
    ax.set_aspect("equal")
    ax.axis("off")
    ax.set_facecolor(_PALETTE_BG)
    border = FancyBboxPatch(
        (0.05, 0.05), lim_x - 0.10, lim_y - 0.10,
        boxstyle="round,pad=0.01,rounding_size=0.12",
        linewidth=1.4, edgecolor=_PALETTE_FRAME, facecolor=_PALETTE_BG,
        zorder=0,
    )
    ax.add_patch(border)
    ax.set_title(title, fontsize=12, fontweight="bold",
                 color=_PALETTE_TEXT, pad=8)
 def _draw_ring_topology(ax):
    _frame_panel(ax, "ring_1d (6 SIPs)", lim_x=10.0, lim_y=6.0)
    xs = [1.2, 2.7, 4.2, 5.7, 7.2, 8.7]
    y = 3.1
    box_w, box_h = 1.05, 0.9
    for i, x in enumerate(xs):
        _draw_sip_box(ax, x, y, box_w, box_h, f"SIP {i}")
    # Forward ring (global_E) — adjacent neighbours, anchored to box edges.
    for i in range(5):
        _arrow(ax, (xs[i] + box_w / 2, y),
               (xs[i + 1] - box_w / 2, y),
               color=_PALETTE_BLUE, lw=1.6)
    # Wrap (SIP 5 → SIP 0). Anchor at right-CENTER of SIP 5 and
    # left-CENTER of SIP 0; arc OUTSIDE (above) the row so it does not
    # overlap any of the SIP boxes in between.
    _arrow(
        ax,
        (xs[5] + box_w / 2, y),
        (xs[0] - box_w / 2, y),
        color=_PALETTE_BLUE, lw=1.6, curve=-0.40,
    )
    ax.text(5.0, y + 2.0, "global_E  (ring)", ha="center",
            color=_PALETTE_BLUE, fontsize=10, style="italic")
    ax.text(5.0, y - 1.5,
            "(global_W = reverse direction, used by the algorithm)",
            ha="center", color="gray", fontsize=8, style="italic")
 def _draw_grid_topology(ax, kind, *, n_rows=2, n_cols=3):
    """kind ∈ {'torus', 'mesh'}. Lays out as n_rows × n_cols (row-major).
    For the sweep we use 2 rows × 3 cols → SIP layout::
        row 0:  SIP 0   SIP 1   SIP 2
        row 1:  SIP 3   SIP 4   SIP 5
    """
    title = f"torus_2d ({n_rows}×{n_cols}, 6 SIPs)" if kind == "torus" \
        else f"mesh_2d_no_wrap ({n_rows}×{n_cols}, 6 SIPs)"
    _frame_panel(ax, title, lim_x=10.0, lim_y=6.0)
    col_xs = [2.0, 5.0, 8.0]  # 3 cols
    row_ys = [4.3, 1.8]       # 2 rows
    box_w, box_h = 1.3, 0.95
    pos: dict[tuple[int, int], tuple[float, float]] = {}
    for r in range(n_rows):
        for c in range(n_cols):
            rank = r * n_cols + c
            x, y = col_xs[c], row_ys[r]
            pos[(r, c)] = (x, y)
            _draw_sip_box(ax, x, y, box_w, box_h, f"SIP {rank}")
    # Row edges (E↔W) — between adjacent columns within each row.
    for r in range(n_rows):
        for c in range(n_cols - 1):
            x0, y0 = pos[(r, c)]
            x1, y1 = pos[(r, c + 1)]
            _arrow(ax, (x0 + box_w / 2, y0 + 0.10),
                   (x1 - box_w / 2, y1 + 0.10),
                   color=_PALETTE_BLUE, lw=1.5)
            _arrow(ax, (x1 - box_w / 2, y1 - 0.10),
                   (x0 + box_w / 2, y0 - 0.10),
                   color=_PALETTE_BLUE, lw=1.5)
    # Col edges (N↔S) — between adjacent rows within each column.
    for c in range(n_cols):
        for r in range(n_rows - 1):
            x0, y0 = pos[(r, c)]
            x1, y1 = pos[(r + 1, c)]
            _arrow(ax, (x0 - 0.12, y0 - box_h / 2),
                   (x1 - 0.12, y1 + box_h / 2),
                   color=_PALETTE_GREEN, lw=1.5)
            _arrow(ax, (x1 + 0.12, y1 + box_h / 2),
                   (x0 + 0.12, y0 - box_h / 2),
                   color=_PALETTE_GREEN, lw=1.5)
    # Wrap arrows for torus only — anchor to the centre of the OUTER
    # edge of the end SIPs and arc OUTSIDE the row/column so they do
    # not overlap the SIPs in between.
    if kind == "torus":
        # Row wrap: last col → first col. Top row arcs UP, bottom row
        # arcs DOWN, so each wrap sits clearly outside its own row.
        for r in range(n_rows):
            x0, y0 = pos[(r, 0)]
            x1, y1 = pos[(r, n_cols - 1)]
            curve = -0.45 if r == 0 else 0.45
            _arrow(
                ax,
                (x1 + box_w / 2, y1),
                (x0 - box_w / 2, y0),
                color=_PALETTE_BLUE, lw=1.5,
                curve=curve, alpha=0.9,
            )
        # Col wrap: last row → first row. Leftmost col arcs LEFT,
        # rightmost col arcs RIGHT. Middle col(s) get a small inline
        # marker + legend note (drawing them through the panel would
        # collide with the row arrows).
        for c in range(n_cols):
            x0, y0 = pos[(0, c)]
            x1, y1 = pos[(n_rows - 1, c)]
            if c == 0:
                curve = 0.55
            elif c == n_cols - 1:
                curve = -0.55
            else:
                continue  # skip middle col — see legend note
            _arrow(
                ax,
                (x1, y1 - box_h / 2),
                (x0, y0 + box_h / 2),
                color=_PALETTE_GREEN, lw=1.5,
                curve=curve, alpha=0.9,
            )
    ax.text(0.7, 5.6, "global_E/W (row)", color=_PALETTE_BLUE,
            fontsize=9, style="italic", fontweight="bold")
    ax.text(0.7, 5.25, "global_N/S (col)", color=_PALETTE_GREEN,
            fontsize=9, style="italic", fontweight="bold")
    ax.text(0.7, 4.92,
            "wrap = torus" if kind == "torus" else "no wrap = mesh",
            color="gray", fontsize=8, style="italic")
    if kind == "torus" and n_cols > 2:
        ax.text(0.7, 0.3,
                "(middle-col wrap omitted for clarity — every row "
                "and every column wraps)",
                color="gray", fontsize=7.5, style="italic")
 def _draw_cube_reduction(ax):
    """4×4 cube grid inside SIP 0 — compact layout with phase legend."""
    from matplotlib.patches import Rectangle
    _frame_panel(ax, "Cube-level reduction inside SIP 0 (4×4 cubes)",
                 lim_x=10.0, lim_y=6.0)
    cube_w = 0.65
    cube_gap = 0.18
    # Center the 4×4 grid in the left half of the panel.
    grid_total = 4 * cube_w + 3 * cube_gap
    grid_x0 = 0.7
    grid_y0 = 0.7
    centers: dict[tuple[int, int], tuple[float, float]] = {}
    for r in range(4):
        for c in range(4):
            cx = grid_x0 + c * (cube_w + cube_gap) + cube_w / 2
            cy = grid_y0 + (3 - r) * (cube_w + cube_gap) + cube_w / 2
            centers[(r, c)] = (cx, cy)
            cube_id = r * 4 + c
            is_root = (r == 3 and c == 3)
            face = _PALETTE_ROOT_FILL if is_root else _PALETTE_BOX_FILL
            edge = _PALETTE_ROOT_EDGE if is_root else _PALETTE_BOX_EDGE
            rect = Rectangle(
                (cx - cube_w / 2, cy - cube_w / 2), cube_w, cube_w,
                linewidth=1.2, edgecolor=edge, facecolor=face,
            )
            ax.add_patch(rect)
            label = f"c{cube_id}"
            ax.text(cx, cy, label, ha="center", va="center",
                    fontsize=7.5, fontweight="bold",
                    color=_PALETTE_ROOT_EDGE if is_root
                    else _PALETTE_TEXT)
    # Phase 1: row reduce W→E.
    for r in range(4):
        for c in range(3):
            x0, y0 = centers[(r, c)]
            x1, y1 = centers[(r, c + 1)]
            _arrow(ax, (x0 + cube_w / 2, y0), (x1 - cube_w / 2, y1),
                   color=_PALETTE_BLUE, lw=1.5)
    # Phase 2: col reduce N→S along rightmost column.
    for r in range(3):
        x0, y0 = centers[(r, 3)]
        x1, y1 = centers[(r + 1, 3)]
        _arrow(ax, (x0, y0 - cube_w / 2), (x1, y1 + cube_w / 2),
               color=_PALETTE_GREEN, lw=1.7)
    # Phase legend on the right side.
    legend_x = grid_x0 + grid_total + 0.55
    ax.text(legend_x, 5.0, "Phase 1: row reduce  (W → E)",
            color=_PALETTE_BLUE, fontsize=10, fontweight="bold")
    ax.text(legend_x, 4.55, "Phase 2: col reduce  (N → S, rightmost col)",
            color=_PALETTE_GREEN, fontsize=10, fontweight="bold")
    ax.text(legend_x, 4.10, "Phase 3: inter-SIP exchange at root cube",
            color=_PALETTE_ROOT_EDGE, fontsize=10, fontweight="bold")
    ax.text(legend_x, 3.65, "Phase 4: col broadcast  (S → N)",
            color=_PALETTE_GREEN, fontsize=10, style="italic")
    ax.text(legend_x, 3.20, "Phase 5: row broadcast  (E → W)",
            color=_PALETTE_BLUE, fontsize=10, style="italic")
    ax.text(legend_x, 2.55,
            "(broadcast phases reverse phases 2 & 1)",
            color="gray", fontsize=8.5, style="italic")
    ax.text(legend_x, 1.7,
            "Root cube (c15, bottom-right) is the only\n"
            "cube that performs the inter-SIP exchange.",
            color=_PALETTE_ROOT_EDGE, fontsize=9, style="italic")
 def emit_topology_diagram() -> str:
    """Emit a 2×2-panel topology diagram into docs/diagrams/allreduce_latency_plots/.
    Top row: ring_1d | torus_2d (2×3)
    Bot row: mesh_2d_no_wrap (2×3) | cube-level reduction in SIP 0
    """
    import matplotlib.gridspec as gridspec
    import matplotlib.pyplot as plt
    _SWEEP_OUT_DIR.mkdir(parents=True, exist_ok=True)
    fig = plt.figure(figsize=(16, 10), facecolor="white")
    gs = gridspec.GridSpec(2, 2, figure=fig, hspace=0.30, wspace=0.10)
    ax_ring = fig.add_subplot(gs[0, 0])
    ax_torus = fig.add_subplot(gs[0, 1])
    ax_mesh = fig.add_subplot(gs[1, 0])
    ax_cube = fig.add_subplot(gs[1, 1])
    _draw_ring_topology(ax_ring)
    _draw_grid_topology(ax_torus, "torus", n_rows=2, n_cols=3)
    _draw_grid_topology(ax_mesh, "mesh", n_rows=2, n_cols=3)
    _draw_cube_reduction(ax_cube)
    fig.suptitle(
        "Allreduce topology — device-level (top: ring, torus, mesh) "
        "and cube-level reduction in SIP 0",
        fontsize=14, fontweight="bold", color=_PALETTE_TEXT, y=0.98,
    )
    out_path = _SWEEP_OUT_DIR / "topology.png"
    fig.savefig(out_path, dpi=130, bbox_inches="tight",
                facecolor=fig.get_facecolor())
    plt.close(fig)
    return str(out_path)
 def test_emit_topology_diagram():
    """Emit topology.png alongside the sweep plots. Pure plotting; no sim."""
    out = emit_topology_diagram()
    assert Path(out).exists()
@@ -23,7 +23,7 @@ def _engine():
 def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
    slice_bytes = 48 * (1 << 30) // 8
    pa = PhysAddr.pe_hbm_addr(
-        rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
+        sip_id=sip, die_id=cube, pe_id=pe_id,
        pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
    )
    return pa.encode()
@@ -1,48 +0,0 @@
 """Test that tl.recv() (no direction) works under the mock runtime
 and the SimPy PE_IPCQ component (ADR-0023 D4 weak fairness)."""
 from __future__ import annotations
 import numpy as np
 from kernbench.ccl.testing import run_kernel_in_mock
 def kernel_round_robin(t_ptr, n_elem, tl):
    """Each PE sends one tile E then receives N-1 tiles via round-robin.
    Uses TensorHandle math (PE_MATH) so Phase 2 produces correct HBM
    contents under SimPy + op_log replay."""
    rank = tl.program_id(axis=0)
    world_size = tl.num_programs(axis=0)
    nbytes = n_elem * 2
    pe_addr = t_ptr + rank * nbytes
    acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
    current = acc
    for _step in range(world_size - 1):
        tl.send(dir="E", src=current)
        # No direction → round-robin
        recv = tl.recv(shape=(n_elem,), dtype="f16")
        acc = acc + recv
        current = recv  # forward W's tile to E next round
    tl.store(pe_addr, acc)
 def test_round_robin_recv_mock_runtime():
    n_elem = 8
    inputs = [
        np.full((n_elem,), float(r + 1), dtype=np.float16)
        for r in range(4)
    ]
    expected = sum(inputs)  # [10,...]
    outputs = run_kernel_in_mock(
        kernel_fn=kernel_round_robin,
        world_size=4,
        topology="ring_1d",
        inputs=inputs,
        kernel_args=(n_elem,),
    )
    for r in range(4):
        assert np.allclose(outputs[r], expected)
@@ -30,7 +30,7 @@ def _graph():
 def _hbm_pa(pe_id: int = 0) -> int:
    slice_bytes = 48 * (1 << 30) // 8
    pa = PhysAddr.pe_hbm_addr(
-        rack_id=0, sip_id=0, cube_id=0, pe_id=pe_id,
+        sip_id=0, die_id=0, pe_id=pe_id,
        pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
    )
    return pa.encode()
@@ -0,0 +1,194 @@
 """ADR-0009 D5 invariant: all PEs targeted by a single kernel launch MUST
 begin executing the kernel body at the same simulated time, regardless of
 their dispatch path length.
 These tests directly verify the invariant by capturing per-PE state at the
 top of `_execute_kernel`:
  test_no_pe_arrives_after_target_start_ns
      Asserts: for every PE that enters _execute_kernel during a multi-cube
      launch, `env.now` at entry must be <= target_start_ns. Otherwise the
      PE's barrier yield would be a no-op and `pe_exec_start` would be set
      late, breaking the D5 "same simulated time" mandate.
  test_all_pes_have_identical_pe_exec_start
      Asserts: every PE's `pe_exec_start` (the value of `env.now` recorded
      immediately AFTER the barrier yield) is identical across all PEs in
      the launch.
 Both tests are expected to FAIL today and become the regression check the
 Phase 2 D5 predictor + fallback fix must make pass.
 """
 from __future__ import annotations
 from pathlib import Path
 import numpy as np
 import pytest
 from kernbench.policy.placement.dp import DPPolicy
 from kernbench.runtime_api.context import RuntimeContext
 from kernbench.runtime_api.types import DeviceSelector
 from kernbench.sim_engine.engine import GraphEngine
 from kernbench.topology.builder import resolve_topology
 TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
 def _capture_per_pe_d5_state():
    """Monkey-patch PeCpuComponent._execute_kernel to record, per PE:
      - entry_now: env.now at function entry (before any yield)
      - target_start_ns: the value carried by the request
      - barrier_yielded: True if the barrier yield fired (entry_now < target)
      - pe_exec_start: env.now immediately after the barrier check
                       (i.e. the value the original code sets)
    Returns (records: list[dict], restore: callable).
    """
    import kernbench.components.builtin.pe_cpu as pe_cpu_mod
    records: list[dict] = []
    original = pe_cpu_mod.PeCpuComponent._execute_kernel
    def patched(self, env, txn):
        request = txn.request
        target_start = getattr(request, "target_start_ns", None)
        entry_now = float(env.now)
        rec = {
            "node_id": self.node.id,
            "entry_now": entry_now,
            "target_start_ns": (
                float(target_start) if target_start is not None else None
            ),
            "barrier_yielded": (
                target_start is not None
                and float(target_start) > entry_now
            ),
            "pe_exec_start": None,  # filled below by sniff
            "late_ns": (
                None if target_start is None
                else max(0.0, entry_now - float(target_start))
            ),
        }
        records.append(rec)
        # We can't easily inject a callback at the original's
        # `pe_exec_start = env.now` line without rewriting it. Approximate:
        # if the original yields the barrier, env.now after the yield is
        # target_start_ns; otherwise pe_exec_start is entry_now (skipped).
        if rec["barrier_yielded"]:
            rec["pe_exec_start"] = float(target_start)
        else:
            rec["pe_exec_start"] = entry_now
        yield from original(self, env, txn)
    pe_cpu_mod.PeCpuComponent._execute_kernel = patched
    def restore():
        pe_cpu_mod.PeCpuComponent._execute_kernel = original
    return records, restore
 def _run_multicube_launch():
    """Drive a no-op kernel launch across all 16 cubes x 8 PEs and return
    the per-PE D5 records collected by the monkey-patch."""
    records, restore = _capture_per_pe_d5_state()
    try:
        topo = resolve_topology(str(TOPOLOGY_PATH))
        engine = GraphEngine(topo.topology_obj, enable_data=True)
        spec = topo.topology_obj.spec
        with RuntimeContext(
            engine=engine, target_device=DeviceSelector("all"),
            correlation_id="d5_barrier", spec=spec,
        ) as ctx:
            dp = DPPolicy(
                cube="row_wise", pe="column_wise",
                num_cubes=16, num_pes=8,
            )
            def kernel(t_ptr, n_elem, tl):
                pass  # no-op
            ctx.ahbm.set_device(0)
            t = ctx.zeros(
                (16, 8 * 64), dtype="f16", dp=dp, name="probe",
            )
            t.copy_(ctx.from_numpy(
                np.zeros((16, 8 * 64), dtype=np.float16),
            ))
            pending = ctx.launch(
                "d5_probe", kernel, t, 64, _defer_wait=True,
            )
            for h, _sip, meta in pending:
                ctx.wait(h, _meta=meta)
    finally:
        restore()
    return records
 def test_no_pe_arrives_after_target_start_ns():
    """ADR-0009 D5: no PE may enter `_execute_kernel` after target_start_ns.
    Today this fails because IO_CPU's predictor under-shoots actual
    dispatch latency for far cubes (cube4, cube9-15). Phase 2 fix:
    chain-aware predictor in IO_CPU + monotonic upward re-stamp in M_CPU.
    """
    records = _run_multicube_launch()
    assert records, "expected per-PE _execute_kernel records"
    late = [
        r for r in records
        if r["target_start_ns"] is not None
        and r["late_ns"] is not None
        and r["late_ns"] > 1e-6
    ]
    if late:
        # Provide actionable diagnostic in the failure.
        worst = sorted(late, key=lambda r: -r["late_ns"])[:5]
        details = "\n".join(
            f"  {r['node_id']}: late by {r['late_ns']:.2f} ns "
            f"(entry_now={r['entry_now']:.2f}, "
            f"target_start_ns={r['target_start_ns']:.2f})"
            for r in worst
        )
        pytest.fail(
            f"ADR-0009 D5 violated: {len(late)}/{len(records)} PEs "
            f"entered _execute_kernel AFTER target_start_ns "
            f"(barrier yield silently skipped). "
            f"Worst offenders:\n{details}"
        )
 def test_all_pes_have_identical_pe_exec_start():
    """ADR-0009 D5: every PE's pe_exec_start must be identical.
    With D5 honored, every PE either yields to target_start_ns (start =
    target_start_ns) or, if late, would still be aligned by the M_CPU
    upward re-stamp (Phase 2). Today: 75/128 PEs in this launch have
    distinct pe_exec_start values because they skipped the barrier.
    """
    records = _run_multicube_launch()
    assert records, "expected per-PE _execute_kernel records"
    starts = sorted({round(r["pe_exec_start"], 6) for r in records})
    if len(starts) > 1:
        spread = max(starts) - min(starts)
        # Distribution of how many PEs at each distinct start time
        from collections import Counter
        bucket = Counter(round(r["pe_exec_start"], 6) for r in records)
        details = "\n".join(
            f"  pe_exec_start={t}: {n} PEs"
            for t, n in sorted(bucket.items())
        )
        pytest.fail(
            f"ADR-0009 D5 violated: PEs have {len(starts)} distinct "
            f"pe_exec_start values (spread = {spread:.2f} ns); "
            f"D5 mandates a single common value. "
            f"Distribution:\n{details}"
        )
@@ -50,7 +50,7 @@ def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
    from kernbench.policy.address.phyaddr import PhysAddr
    slice_bytes = 48 * (1 << 30) // 8
    pa = PhysAddr.pe_hbm_addr(
-        rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
+        sip_id=sip, die_id=cube, pe_id=pe_id,
        pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
    )
    return pa.encode()
@@ -31,7 +31,7 @@ def _hbm_pa(sip=0, cube=0, pe_id=0):
    from kernbench.policy.address.phyaddr import PhysAddr
    slice_bytes = 48 * (1 << 30) // 8
    pa = PhysAddr.pe_hbm_addr(
-        rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
+        sip_id=sip, die_id=cube, pe_id=pe_id,
        pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
    )
    return pa.encode()
@@ -0,0 +1,622 @@
 """High-level IPCQ + SFR connection diagram (presentation only).
 Renders ``docs/diagrams/ipcq_diagram_plots/ipcq_send_recv.png`` showing one
 concrete example: SIP 0 / cube 0 / pe 0 sending to pe 1 in the
 ``intra_E`` direction. Boxes and arrows are grounded in the actual
 code paths:
  - PE_IPCQ SFR fields:  src/kernbench/components/builtin/pe_ipcq.py
  - SFR install:         src/kernbench/ccl/install.py +
                         src/kernbench/ccl/sfr_config.py
  - PE_DMA outbound /
    inbound atomic write: src/kernbench/components/builtin/pe_dma.py
 This is a pure-plotting test (no simulation). It exists so the diagram
 can be regenerated reproducibly alongside the rest of the suite.
 """
 from __future__ import annotations
 from pathlib import Path
 _OUT_DIR = (Path(__file__).parent.parent / "docs" / "diagrams"
            / "ipcq_diagram_plots")
 # Color palette (matches the topology diagram for visual continuity).
 _BG = "#fafbfd"
 _FRAME = "#3a3f4a"
 _TEXT = "#1f2530"
 _BLUE = "#2c6fb6"
 _GREEN = "#2e8a4e"
 _ORANGE = "#d3722a"
 _PURPLE = "#7a4cb6"
 _BOX_FILL = "#eaf2fb"
 _BOX_EDGE = "#2c4a78"
 _HW_FILL = "#f3ecda"
 _HW_EDGE = "#a07a2a"
 _MEM_FILL = "#e8f3e8"
 _MEM_EDGE = "#2e8a4e"
 def _box(ax, x, y, w, h, title, lines, *, fill=_BOX_FILL, edge=_BOX_EDGE,
         title_color=None, font=9):
    from matplotlib.patches import FancyBboxPatch
    box = FancyBboxPatch(
        (x, y), w, h,
        boxstyle="round,pad=0.04,rounding_size=0.18",
        linewidth=1.6, edgecolor=edge, facecolor=fill, zorder=2,
    )
    ax.add_patch(box)
    ax.text(x + w / 2, y + h - 0.45, title,
            ha="center", va="top", fontsize=font + 1.5,
            fontweight="bold",
            color=title_color or edge, zorder=3)
    for i, line in enumerate(lines):
        ax.text(
            x + 0.25, y + h - 1.1 - i * 0.45, line,
            ha="left", va="top", fontsize=font - 0.5, color=_TEXT,
            family="monospace", zorder=3,
        )
 def _arrow(ax, xy_from, xy_to, *, color=_BLUE, lw=1.8, curve=0.0,
           style="-|>", alpha=1.0, zorder=4):
    from matplotlib.patches import FancyArrowPatch
    arrow = FancyArrowPatch(
        xy_from, xy_to,
        arrowstyle=style, mutation_scale=14,
        color=color, lw=lw, alpha=alpha,
        connectionstyle=f"arc3,rad={curve}",
        zorder=zorder,
    )
    ax.add_patch(arrow)
 def _step_label(ax, x, y, n, text, color=_BLUE):
    from matplotlib.patches import Circle
    ax.add_patch(Circle((x, y), 0.28, facecolor=color, edgecolor="white",
                        linewidth=1.4, zorder=5))
    ax.text(x, y, str(n), ha="center", va="center", fontsize=9,
            fontweight="bold", color="white", zorder=6)
    ax.text(x + 0.45, y, text, ha="left", va="center", fontsize=9,
            color=_TEXT, zorder=6)
 def emit_ipcq_diagram() -> str:
    import matplotlib.pyplot as plt
    from matplotlib.patches import FancyBboxPatch, Rectangle
    _OUT_DIR.mkdir(parents=True, exist_ok=True)
    fig, ax = plt.subplots(figsize=(18, 11), facecolor="white")
    ax.set_xlim(0, 22)
    ax.set_ylim(0, 14)
    ax.set_aspect("equal")
    ax.axis("off")
    ax.set_facecolor(_BG)
    # Outer panel border.
    border = FancyBboxPatch(
        (0.15, 0.15), 21.7, 13.7,
        boxstyle="round,pad=0.02,rounding_size=0.20",
        linewidth=1.4, edgecolor=_FRAME, facecolor=_BG, zorder=0,
    )
    ax.add_patch(border)
    ax.set_title(
        "IPCQ — SFR state and send/recv path between pe0 and pe1 "
        "(intra_E direction, SIP 0 / cube 0)",
        fontsize=14, fontweight="bold", color=_TEXT, pad=12,
    )
    # ── pe0 side (left half) ────────────────────────────────────────
    _box(
        ax, x=0.8, y=8.4, w=8.4, h=5.0,
        title="pe0.pe_ipcq   (SFR — direction: intra_E)",
        lines=[
            "neighbor_table[intra_E]:",
            "  peer = sip0.cube0.pe1",
            "  peer.rx_base_pa  → pe1's intra_W slot ring",
            "  my_rx_base_pa    → pe0's intra_E slot ring",
            "  n_slots = 8     slot_size = 512 B",
            "",
            "head/tail counters (per direction):",
            "  my_head           # ++ on tl.send",
            "  my_tail           # ++ on tl.recv",
            "  peer_head_cache   # updated on IpcqMetaArrival",
            "  peer_tail_cache   # updated on IpcqCreditMetadata",
            "",
            "send blocks while (my_head − peer_tail_cache) ≥ n_slots",
        ],
        edge=_BOX_EDGE, fill=_BOX_FILL,
    )
    _box(
        ax, x=0.8, y=4.5, w=8.4, h=2.7,
        title="pe0.pe_dma   (outbound IPCQ driver)",
        lines=[
            "_handle_ipcq_outbound():",
            "  • snapshot src bytes from MemoryStore",
            "  • find fabric path → pe1.pe_dma",
            "  • send Transaction; do NOT wait (fire-and-forget)",
        ],
        edge=_HW_EDGE, fill=_HW_FILL,
    )
    # ── pe1 side (right half) ───────────────────────────────────────
    _box(
        ax, x=12.8, y=8.4, w=8.4, h=5.0,
        title="pe1.pe_ipcq   (SFR — direction: intra_W)",
        lines=[
            "neighbor_table[intra_W]:",
            "  peer = sip0.cube0.pe0",
            "  peer.rx_base_pa  → pe0's intra_E slot ring",
            "  my_rx_base_pa    → pe1's intra_W slot ring",
            "  n_slots = 8     slot_size = 512 B",
            "",
            "head/tail counters (per direction):",
            "  my_head           # ++ on tl.send (other direction)",
            "  my_tail           # ++ on tl.recv (this direction)",
            "  peer_head_cache   # updated on IpcqMetaArrival",
            "  peer_tail_cache   # updated on IpcqCreditMetadata",
            "",
            "recv blocks while peer_head_cache ≤ my_tail",
        ],
        edge=_BOX_EDGE, fill=_BOX_FILL,
    )
    _box(
        ax, x=12.8, y=4.5, w=8.4, h=2.7,
        title="pe1.pe_dma   (inbound IPCQ driver)",
        lines=[
            "_handle_ipcq_inbound():",
            "  • pay terminal drain over fabric BW",
            "  • atomic: write data into pe1's intra_W slot",
            "  • forward IpcqMetaArrival → pe1.pe_ipcq",
        ],
        edge=_HW_EDGE, fill=_HW_FILL,
    )
    # ── Slot ring buffer (under pe1.pe_dma) ─────────────────────────
    ring_x0, ring_y0 = 12.8, 1.1
    ring_w, ring_h = 8.4, 2.6
    box = FancyBboxPatch(
        (ring_x0, ring_y0), ring_w, ring_h,
        boxstyle="round,pad=0.04,rounding_size=0.16",
        linewidth=1.6, edgecolor=_MEM_EDGE, facecolor=_MEM_FILL, zorder=2,
    )
    ax.add_patch(box)
    ax.text(
        ring_x0 + ring_w / 2, ring_y0 + ring_h - 0.42,
        "MemoryStore[buffer_kind]   pe1's intra_W slot ring "
        "(n_slots = 8, slot_size = 512 B)",
        ha="center", va="top", fontsize=10, fontweight="bold",
        color=_MEM_EDGE, zorder=3,
    )
    # 8 slots laid out horizontally inside the ring panel.
    n_slots = 8
    pad = 0.35
    slot_w = (ring_w - 2 * pad) / n_slots
    slot_h = 0.85
    slot_y = ring_y0 + 0.3
    for i in range(n_slots):
        sx = ring_x0 + pad + i * slot_w
        is_active = (i == 3)  # Highlight one example slot
        face = "#ffd9b8" if is_active else "white"
        edge = _ORANGE if is_active else _MEM_EDGE
        rect = Rectangle(
            (sx + 0.05, slot_y), slot_w - 0.10, slot_h,
            linewidth=1.2, facecolor=face, edgecolor=edge, zorder=3,
        )
        ax.add_patch(rect)
        ax.text(
            sx + slot_w / 2, slot_y + slot_h / 2,
            f"s{i}", ha="center", va="center", fontsize=9,
            color=_ORANGE if is_active else _TEXT,
            fontweight="bold" if is_active else "normal", zorder=4,
        )
    ax.text(
        ring_x0 + pad + 3 * slot_w + slot_w / 2, slot_y - 0.30,
        "slot_idx = my_head % n_slots",
        ha="center", va="top", fontsize=8, style="italic",
        color=_ORANGE,
    )
    # ── Fabric label (between pe0.pe_dma and pe1.pe_dma) ────────────
    fab = FancyBboxPatch(
        (9.6, 5.0), 2.6, 1.7,
        boxstyle="round,pad=0.04,rounding_size=0.20",
        linewidth=1.4, edgecolor=_PURPLE, facecolor="white", zorder=2,
    )
    ax.add_patch(fab)
    ax.text(10.9, 6.4, "Fabric", ha="center", va="center",
            fontsize=11, fontweight="bold", color=_PURPLE)
    ax.text(10.9, 5.7, "(NoC routers,\npe_dma → pe_dma)",
            ha="center", va="center", fontsize=8, color=_TEXT)
    # ── Arrows + step labels ────────────────────────────────────────
    # 1. tl.send  ↘  pe0.pe_ipcq
    _arrow(ax, (9.2, 12.9), (9.7, 12.9), color=_BLUE)  # placeholder so number lands
    _step_label(ax, 0.5, 13.6,
                1, "kernel calls tl.send(dir='intra_E', src_addr=X)",
                color=_BLUE)
    # 2. pe0.pe_ipcq → pe0.pe_dma  (IpcqDmaToken)
    _arrow(ax, (5.0, 8.4), (5.0, 7.2), color=_BLUE, lw=2.0)
    ax.text(5.2, 7.85, "IpcqDmaToken\n"
                       "dst = peer.rx_base_pa + slot_idx*512",
            ha="left", va="center", fontsize=8, color=_BLUE,
            family="monospace")
    # 3. pe0.pe_dma → fabric → pe1.pe_dma  (data, fire-and-forget)
    _arrow(ax, (9.2, 5.85), (9.6, 5.85), color=_BLUE, lw=2.0)
    _arrow(ax, (12.2, 5.85), (12.8, 5.85), color=_BLUE, lw=2.0)
    ax.text(10.9, 4.7, "data (fire-and-forget)",
            ha="center", va="center", fontsize=8, style="italic",
            color=_BLUE)
    # 4. pe1.pe_dma → MemoryStore slot (atomic)
    _arrow(ax, (17.0, 4.5), (17.0, 3.7), color=_GREEN, lw=2.0)
    ax.text(17.2, 4.10, "atomic write",
            ha="left", va="center", fontsize=8, color=_GREEN,
            family="monospace")
    # 5. pe1.pe_dma → pe1.pe_ipcq  (IpcqMetaArrival)
    _arrow(ax, (15.0, 7.2), (15.0, 8.4), color=_GREEN, lw=2.0)
    ax.text(13.0, 7.85, "IpcqMetaArrival\n"
                        "→ peer_head_cache update",
            ha="left", va="center", fontsize=8, color=_GREEN,
            family="monospace")
    # 6. tl.recv unblocks (annotation only)
    _step_label(ax, 12.85, 13.6,
                6, "tl.recv(dir='intra_W') unblocks; consume slot; my_tail++",
                color=_GREEN)
    # 7. pe1.pe_ipcq → pe0.pe_ipcq  (IpcqCreditMetadata, fast-path SimPy Store)
    _arrow(ax, (12.8, 11.0), (9.2, 11.0),
           color=_ORANGE, lw=2.0, curve=0.18)
    ax.text(11.0, 11.55,
            "IpcqCreditMetadata  (consumer_seq, dst_rx_base_pa)\n"
            "→ pe0's credit_inbox  (SimPy Store, no fabric)",
            ha="center", va="center", fontsize=8, color=_ORANGE,
            family="monospace")
    # 8. pe0.peer_tail_cache update unblocks tl.send
    ax.text(0.5, 0.55,
            "Steps 1–3 = data path (fabric, fire-and-forget);  "
            "4–6 = receiver wake-up;  7 = credit return (fast path); "
            "8 = sender unblocks when peer_tail_cache catches up.",
            ha="left", va="center", fontsize=9, color=_TEXT,
            style="italic")
    # In-figure step legend (top, between pe0/pe1 panels).
    legend_x = 9.4
    legend_y = 13.5
    _step_label(ax, legend_x, legend_y, 2,
                "PE_IPCQ → PE_DMA (token)", color=_BLUE)
    _step_label(ax, legend_x, legend_y - 0.45, 3,
                "PE_DMA → fabric → PE_DMA (data)", color=_BLUE)
    _step_label(ax, legend_x, legend_y - 0.90, 4,
                "atomic slot write", color=_GREEN)
    _step_label(ax, legend_x, legend_y - 1.35, 5,
                "IpcqMetaArrival", color=_GREEN)
    _step_label(ax, legend_x, legend_y - 1.80, 7,
                "IpcqCreditMetadata", color=_ORANGE)
    out_path = _OUT_DIR / "ipcq_send_recv.png"
    fig.savefig(out_path, dpi=130, bbox_inches="tight",
                facecolor=fig.get_facecolor())
    import matplotlib.pyplot as _plt
    _plt.close(fig)
    return str(out_path)
 def test_emit_ipcq_diagram():
    out = emit_ipcq_diagram()
    assert Path(out).exists()
 # ── 2nd diagram: two-PE data + DMA + IPCQ-memory layout ──────────────
 def _pe_panel(ax, x0, y0, w, h, label, *, edge=_FRAME, fill="white"):
    """Outer container for one PE: title bar + body."""
    from matplotlib.patches import FancyBboxPatch
    box = FancyBboxPatch(
        (x0, y0), w, h,
        boxstyle="round,pad=0.04,rounding_size=0.20",
        linewidth=1.8, edgecolor=edge, facecolor=fill, zorder=1,
    )
    ax.add_patch(box)
    # Title band
    title_h = 0.55
    band = FancyBboxPatch(
        (x0 + 0.12, y0 + h - title_h - 0.10), w - 0.24, title_h,
        boxstyle="round,pad=0.02,rounding_size=0.10",
        linewidth=0, edgecolor="none", facecolor=edge, zorder=2,
    )
    ax.add_patch(band)
    ax.text(
        x0 + w / 2, y0 + h - title_h / 2 - 0.10, label,
        ha="center", va="center", fontsize=12, fontweight="bold",
        color="white", zorder=3,
    )
 def _sub_block(ax, cx, cy, w, h, title, body_lines, *,
               fill, edge, font=9):
    from matplotlib.patches import FancyBboxPatch
    rect = FancyBboxPatch(
        (cx - w / 2, cy - h / 2), w, h,
        boxstyle="round,pad=0.02,rounding_size=0.10",
        linewidth=1.4, edgecolor=edge, facecolor=fill, zorder=3,
    )
    ax.add_patch(rect)
    ax.text(cx, cy + h / 2 - 0.30, title, ha="center", va="top",
            fontsize=font + 1, fontweight="bold", color=edge, zorder=4)
    for i, line in enumerate(body_lines):
        ax.text(
            cx, cy + h / 2 - 0.75 - i * 0.34, line,
            ha="center", va="top", fontsize=font - 0.5, color=_TEXT,
            family="monospace", zorder=4,
        )
 def _tcm_with_slots(ax, cx, cy, w, h, *, n_slots=8, active_slot=3,
                    title="PE_TCM (local memory)"):
    """Draw a TCM box that contains a source buffer + IPCQ slot ring."""
    from matplotlib.patches import FancyBboxPatch, Rectangle
    rect = FancyBboxPatch(
        (cx - w / 2, cy - h / 2), w, h,
        boxstyle="round,pad=0.02,rounding_size=0.10",
        linewidth=1.4, edgecolor=_MEM_EDGE, facecolor=_MEM_FILL, zorder=3,
    )
    ax.add_patch(rect)
    ax.text(
        cx, cy + h / 2 - 0.28, title, ha="center", va="top",
        fontsize=9.5, fontweight="bold", color=_MEM_EDGE, zorder=4,
    )
    # Source buffer region (left part).
    src_w = (w - 0.6) * 0.30
    src_h = h - 1.20
    sx = cx - w / 2 + 0.20
    sy = cy - h / 2 + 0.20
    src_rect = Rectangle(
        (sx, sy), src_w, src_h,
        linewidth=1.0, facecolor="white", edgecolor=_BLUE, zorder=4,
    )
    ax.add_patch(src_rect)
    ax.text(sx + src_w / 2, sy + src_h / 2 + 0.18, "source",
            ha="center", va="center", fontsize=8.5, color=_BLUE,
            fontweight="bold", zorder=5)
    ax.text(sx + src_w / 2, sy + src_h / 2 - 0.18, "buffer",
            ha="center", va="center", fontsize=8.5, color=_BLUE,
            fontweight="bold", zorder=5)
    # Slot ring region (right part).
    ring_x0 = sx + src_w + 0.30
    ring_w = (cx + w / 2 - 0.20) - ring_x0
    ring_y0 = sy
    ring_h = src_h
    ring_rect = Rectangle(
        (ring_x0, ring_y0), ring_w, ring_h,
        linewidth=1.0, facecolor="white", edgecolor=_ORANGE, zorder=4,
    )
    ax.add_patch(ring_rect)
    ax.text(
        ring_x0 + ring_w / 2, ring_y0 + ring_h - 0.18,
        "IPCQ slot ring  (intra_W)",
        ha="center", va="top", fontsize=8.5, color=_ORANGE,
        fontweight="bold", zorder=5,
    )
    # Draw 8 slots in a 2×4 grid.
    cols = 4
    rows = 2
    slot_inner_pad = 0.12
    sw = (ring_w - (cols + 1) * slot_inner_pad) / cols
    sh = (ring_h - 0.65 - (rows + 1) * slot_inner_pad) / rows
    for i in range(n_slots):
        r = i // cols
        c = i % cols
        sx_i = ring_x0 + slot_inner_pad + c * (sw + slot_inner_pad)
        sy_i = (ring_y0 + slot_inner_pad
                + (rows - 1 - r) * (sh + slot_inner_pad))
        is_active = (i == active_slot)
        face = "#ffd9b8" if is_active else "white"
        edge = _ORANGE if is_active else "#c9c9c9"
        ax.add_patch(Rectangle(
            (sx_i, sy_i), sw, sh,
            linewidth=1.0, facecolor=face, edgecolor=edge, zorder=5,
        ))
        ax.text(
            sx_i + sw / 2, sy_i + sh / 2, f"s{i}",
            ha="center", va="center", fontsize=8,
            fontweight="bold" if is_active else "normal",
            color=_ORANGE if is_active else "#666",
            zorder=6,
        )
 def emit_ipcq_dma_diagram() -> str:
    """Two-PE diagram emphasising: outbound DMA writes DIRECTLY into the
    receiver's local memory (slot ring in PE_TCM). pe1.pe_dma is the
    inbound memory port that pays drain + emits the MetaArrival notice;
    the actual DMA payload terminates in the slot, not in another DMA.
    """
    import matplotlib.pyplot as plt
    from matplotlib.patches import FancyBboxPatch
    _OUT_DIR.mkdir(parents=True, exist_ok=True)
    fig, ax = plt.subplots(figsize=(22, 12), facecolor="white")
    XMAX, YMAX = 28.0, 14.0
    ax.set_xlim(0, XMAX)
    ax.set_ylim(0, YMAX)
    ax.set_aspect("equal")
    ax.axis("off")
    ax.set_facecolor(_BG)
    # Outer page border.
    ax.add_patch(FancyBboxPatch(
        (0.20, 0.20), XMAX - 0.40, YMAX - 0.40,
        boxstyle="round,pad=0.02,rounding_size=0.20",
        linewidth=1.4, edgecolor=_FRAME, facecolor=_BG, zorder=0,
    ))
    ax.set_title(
        "Two PEs over IPCQ — outbound DMA lands DIRECTLY in receiver "
        "memory (slot ring in PE_TCM)",
        fontsize=14, fontweight="bold", color=_TEXT, pad=12,
    )
    # ── PE panels ───────────────────────────────────────────────────
    PE0_X, PE0_W = 0.8, 11.6
    PE1_X, PE1_W = 15.6, 11.6
    PE_Y, PE_H = 1.6, 10.4
    _pe_panel(ax, x0=PE0_X, y0=PE_Y, w=PE0_W, h=PE_H,
              label="PE 0   (sender — sip0.cube0.pe0)",
              edge=_BLUE, fill="white")
    _pe_panel(ax, x0=PE1_X, y0=PE_Y, w=PE1_W, h=PE_H,
              label="PE 1   (receiver — sip0.cube0.pe1)",
              edge=_GREEN, fill="white")
    # ── PE 0 sub-blocks ─────────────────────────────────────────────
    # Top row: PE_CPU and PE_IPCQ
    _sub_block(
        ax, cx=PE0_X + 2.5, cy=10.3, w=3.4, h=1.6,
        title="PE_CPU",
        body_lines=["kernel:",
                    "  tl.send(dir='intra_E',",
                    "          src=ptr)"],
        fill=_BOX_FILL, edge=_BOX_EDGE,
    )
    _sub_block(
        ax, cx=PE0_X + 8.4, cy=10.3, w=4.0, h=1.6,
        title="PE_IPCQ   (control / SFR)",
        body_lines=["per-direction state:",
                    "  head/tail, peer.rx_base_pa,",
                    "  peer_tail_cache"],
        fill=_BOX_FILL, edge=_BOX_EDGE,
    )
    # Mid: PE_TCM (left, with src + slot ring) and PE_DMA outbound (right)
    _tcm_with_slots(
        ax, cx=PE0_X + 3.0, cy=5.4, w=5.6, h=3.6,
        n_slots=8, active_slot=-1,
        title="PE_TCM (local memory · buffer_kind = tcm)",
    )
    _sub_block(
        ax, cx=PE0_X + 8.6, cy=5.4, w=3.6, h=3.6,
        title="PE_DMA   (outbound)",
        body_lines=["snapshot src bytes",
                    "  from PE_TCM",
                    "build Transaction",
                    "  (dst = peer's slot PA)",
                    "fire onto fabric;",
                    "  do not wait for ack"],
        fill=_HW_FILL, edge=_HW_EDGE,
    )
    # Arrows on PE 0 side
    _arrow(ax, (PE0_X + 4.20, 10.3), (PE0_X + 6.40, 10.3),
           color=_BLUE, lw=1.7)
    ax.text(PE0_X + 5.30, 10.65, "tl.send",
            ha="center", va="center", fontsize=8.5, color=_BLUE,
            fontweight="bold")
    # PE_IPCQ → PE_DMA control (kept; label removed per request)
    _arrow(ax, (PE0_X + 8.4, 9.50), (PE0_X + 8.6, 7.20),
           color=_ORANGE, lw=1.6)
    # PE_TCM(src) → PE_DMA (read source data)
    _arrow(ax, (PE0_X + 5.80, 5.40), (PE0_X + 6.80, 5.40),
           color=_BLUE, lw=2.0)
    ax.text(PE0_X + 6.30, 6.05, "read source\n(snapshot)",
            ha="center", va="bottom", fontsize=7.5, color=_BLUE,
            family="monospace")
    # ── Fabric in the middle ────────────────────────────────────────
    FAB_X0, FAB_W = 12.6, 2.8
    FAB_Y0, FAB_H = 4.6, 2.2
    ax.add_patch(FancyBboxPatch(
        (FAB_X0, FAB_Y0), FAB_W, FAB_H,
        boxstyle="round,pad=0.04,rounding_size=0.20",
        linewidth=1.6, edgecolor=_PURPLE, facecolor="white", zorder=2,
    ))
    ax.text(FAB_X0 + FAB_W / 2, FAB_Y0 + FAB_H - 0.45,
            "NoC Fabric", ha="center", va="center",
            fontsize=12, fontweight="bold", color=_PURPLE)
    ax.text(FAB_X0 + FAB_W / 2, FAB_Y0 + 0.55,
            "(routers, links;\nfabric BW + drain time)",
            ha="center", va="center", fontsize=8.5, color=_TEXT)
    # ── PE 1 sub-blocks ─────────────────────────────────────────────
    # Top row: PE_IPCQ and PE_CPU
    _sub_block(
        ax, cx=PE1_X + 3.2, cy=10.3, w=4.0, h=1.6,
        title="PE_IPCQ   (control / SFR)",
        body_lines=["per-direction state:",
                    "  head/tail, peer_head_cache,",
                    "  my_rx_base_pa"],
        fill=_BOX_FILL, edge=_BOX_EDGE,
    )
    _sub_block(
        ax, cx=PE1_X + 9.1, cy=10.3, w=3.4, h=1.6,
        title="PE_CPU",
        body_lines=["kernel:",
                    "  ptr = tl.recv(",
                    "    dir='intra_W')"],
        fill=_BOX_FILL, edge=_BOX_EDGE,
    )
    # Wide PE_TCM occupying the centre-bottom of PE 1 — the DMA payload
    # terminates HERE (not in any DMA component).
    _tcm_with_slots(
        ax, cx=PE1_X + 5.0, cy=5.4, w=8.4, h=3.6,
        n_slots=8, active_slot=3,
        title="PE_TCM (local memory · buffer_kind = tcm)",
    )
    # ── DATA arrows: outbound DMA ──► RECEIVER MEMORY (the slot) ───
    # The inbound PE_DMA is NOT on the data path — it's a sim-side
    # bookkeeper that pays terminal drain + emits MetaArrival. The
    # actual DMA payload jumps fabric → slot directly.
    # 1) pe0.PE_DMA → fabric
    _arrow(ax, (PE0_X + 10.40, 5.40), (FAB_X0, 5.40),
           color=_BLUE, lw=2.8)
    # 2) fabric → PE_TCM slot s3 (DMA payload terminates IN MEMORY)
    SLOT_X = PE1_X + 2.95   # x-coordinate of slot s3 within PE_TCM
    _arrow(ax, (FAB_X0 + FAB_W, 5.40), (SLOT_X, 5.40),
           color=_BLUE, lw=2.8)
    # PE_IPCQ → PE_CPU: tl.recv unblocks
    _arrow(ax, (PE1_X + 5.20, 10.30), (PE1_X + 7.40, 10.30),
           color=_GREEN, lw=1.7)
    ax.text(PE1_X + 6.30, 10.65, "unblock tl.recv",
            ha="center", va="center", fontsize=8.5, color=_GREEN,
            fontweight="bold")
    # PE_CPU → PE_TCM: kernel reads consumed slot via returned ptr
    _arrow(ax, (PE1_X + 9.10, 9.50), (PE1_X + 8.10, 7.20),
           color=_GREEN, lw=1.4, curve=0.10)
    ax.text(PE1_X + 9.30, 8.30, "kernel reads\nslot data",
            ha="left", va="center", fontsize=7.5, color=_GREEN)
    # (Credit-return arrow + label removed per request — see code
    # for the actual mechanism: pe1.pe_ipcq → pe0.credit_inbox via
    # SimPy Store after env.timeout(fabric_path_latency_ns).)
    # ── Footer legend ──────────────────────────────────────────────
    ax.text(0.6, 0.85,
            "DATA  (blue) :  pe0 PE_TCM[src]  →  pe0 PE_DMA  →  "
            "NoC fabric  →  pe1 PE_TCM[slot s3]   ← DMA write "
            "terminates IN MEMORY",
            ha="left", va="center", fontsize=9, color=_TEXT,
            style="italic")
    ax.text(0.6, 0.45,
            "CTRL (orange) :  PE_IPCQ issues IpcqDmaToken on send;  "
            "pe1's inbound port emits MetaArrival;  credit return "
            "uses the fabric path (timing) but bypasses the per-hop "
            "component graph (D9 fast path).",
            ha="left", va="center", fontsize=9, color=_TEXT,
            style="italic")
    out_path = _OUT_DIR / "ipcq_two_pe_dma.png"
    fig.savefig(out_path, dpi=130, bbox_inches="tight",
                facecolor=fig.get_facecolor())
    plt.close(fig)
    return str(out_path)
 def test_emit_ipcq_dma_diagram():
    out = emit_ipcq_dma_diagram()
    assert Path(out).exists()
@@ -29,7 +29,7 @@ def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
    # 48 GB / 8 slices = 6 GB per slice
    slice_bytes = 48 * (1 << 30) // 8
    pa = PhysAddr.pe_hbm_addr(
-        rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
+        sip_id=sip, die_id=cube, pe_id=pe_id,
        pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
    )
    return pa.encode()
@@ -37,7 +37,7 @@ def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
 def _sram_pa(sip: int = 0, cube: int = 0) -> int:
    """Create an SRAM physical address."""
-    pa = PhysAddr.cube_sram_addr(rack_id=0, sip_id=sip, cube_id=cube, sram_offset=0x800)
+    pa = PhysAddr.cube_sram_addr(sip_id=sip, die_id=cube, sram_offset=0x800)
    return pa.encode()
@@ -0,0 +1,139 @@
 """Phase 1 test for moving the intercube_allreduce root cube from the
 bottom-right corner (3,3) to the geometric center (2,2).
 Today's algorithm (intercube_allreduce.py) hardcodes
 ``root_cube = (cube_h-1) * cube_w + (cube_w-1)`` (= cube 15 in 4×4).
 The intra-SIP critical path for one allreduce is therefore::
    Phase 1 (row reduce W→E to col 3)         : 3 hops
    Phase 2 (col reduce N→S to row 3 on col 3): 3 hops
    Phase 3 (inter-SIP at root)               : (separate)
    Phase 4 (col broadcast S→N)               : 3 hops
    Phase 5 (row broadcast E→W)               : 3 hops
    Total intra-SIP critical path             : 12 hops
 Moving the root to (2,2) and using BIDIRECTIONAL convergence (cols 0..2
 go W→E, col 3 goes E→W in parallel; rows 0..2 go N→S, row 3 goes S→N
 in parallel) cuts each phase's critical path from 3 hops to 2::
    Phase 1 critical path : max(2, 1) = 2 hops
    Phase 2 critical path : max(2, 1) = 2 hops
    Phase 4 critical path : 2 hops
    Phase 5 critical path : 2 hops
    Total intra-SIP critical path : 8 hops
 Per-hop cost at 96 KB on TCM ≈ 600 ns (slot IO write+read 384 ns +
 fabric drain ~217 ns). 4 fewer hops ⇒ ~2.4 µs reduction.
 EXPECTED Phase 1 outcome:
  - Today (root = corner) :  ~22.0 µs   ← test FAILS (> 20500 ns)
  - After Phase 2 (root = center) : ~19.6 µs ← test PASSES (< 20500 ns)
 """
 from __future__ import annotations
 from pathlib import Path
 import pytest
 from kernbench.runtime_api.context import RuntimeContext
 from kernbench.runtime_api.types import DeviceSelector
 from kernbench.sim_engine.engine import GraphEngine
 from kernbench.topology.builder import resolve_topology
 from tests.test_allreduce_multidevice import (
    _write_temp_configs,
    run_allreduce,
 )
 def _run_torus_96kb(tmp_path: Path) -> float:
    """Run torus_2d 6-SIP allreduce at 96 KB / slot, return critical-path
    pe_exec_ns. Fixed at TCM (the project default)."""
    sub = tmp_path / "torus_root_center"
    sub.mkdir()
    topo_path, ccl_path = _write_temp_configs(
        sub,
        sip_topology="torus_2d",
        n_sips=6,
        algorithm="intercube_allreduce",
        sip_w=3, sip_h=2,
        n_elem_override=49152,   # 49152 × 2 = 96 KB / slot
    )
    topo = resolve_topology(topo_path)
    engine = GraphEngine(topo.topology_obj, enable_data=True)
    spec = topo.topology_obj.spec
    with RuntimeContext(
        engine=engine,
        target_device=DeviceSelector("all"),
        correlation_id="root_center_phase1",
        spec=spec,
    ) as ctx:
        result = run_allreduce(
            ctx, engine, spec,
            algorithm="intercube_allreduce", ccl_yaml=ccl_path,
        )
        assert result["ok_cubes"] > 0
    pe_exec_vals = [
        float(tr.get("pe_exec_ns", 0.0) or 0.0)
        for _, (_, tr) in engine._results.items()
        if isinstance(tr, dict)
    ]
    return max(pe_exec_vals) if pe_exec_vals else 0.0
 def test_intra_sip_critical_path_at_96k_below_threshold(tmp_path):
    """Post-Phase-2 (root=center, bidirectional reduce) the torus_2d
    96 KB allreduce on TCM should drop below 20.5 µs.
    Today's value: ~22.0 µs (12-hop critical path with corner root).
    Expected post-Phase-2: ~19.6 µs (8-hop critical path with
    center root) — model estimate, ~11% reduction end-to-end.
    """
    lat_ns = _run_torus_96kb(tmp_path)
    THRESHOLD_NS = 20_500.0
    assert lat_ns < THRESHOLD_NS, (
        f"torus_2d 6-SIP 96 KB allreduce should land below "
        f"{THRESHOLD_NS:.0f} ns post-Phase-2 (root=center, "
        f"bidirectional reduce). got {lat_ns:.1f} ns "
        f"({lat_ns / 1000:.2f} µs)"
    )
 def test_correctness_preserved(tmp_path):
    """Smoke check: at small n_elem the new algorithm must still produce
    the correct sum across all 96 cubes. ``run_allreduce`` validates
    every cube against the expected reduce result (``ok_cubes`` must be
    96 = 6 SIPs × 16 cubes).
    This guards against the obvious Phase 2 risk: bidirectional reduce
    sums each contribution exactly once. If implemented wrong (double-
    counting or skipping the right edge column / bottom row), the
    asserts inside run_allreduce fail.
    """
    sub = tmp_path / "correctness"
    sub.mkdir()
    topo_path, ccl_path = _write_temp_configs(
        sub,
        sip_topology="torus_2d",
        n_sips=6,
        algorithm="intercube_allreduce",
        sip_w=3, sip_h=2,
        n_elem_override=128,   # tiny payload to keep this fast
    )
    topo = resolve_topology(topo_path)
    engine = GraphEngine(topo.topology_obj, enable_data=True)
    spec = topo.topology_obj.spec
    with RuntimeContext(
        engine=engine,
        target_device=DeviceSelector("all"),
        correlation_id="root_center_correctness",
        spec=spec,
    ) as ctx:
        result = run_allreduce(
            ctx, engine, spec,
            algorithm="intercube_allreduce", ccl_yaml=ccl_path,
        )
    n_cubes = 6 * 16  # 6 SIPs × 16 cubes/SIP
    assert result["ok_cubes"] == n_cubes, (
        f"all 96 cubes must validate; got {result['ok_cubes']} OK"
    )
@@ -1,8 +1,9 @@
 """Tests for configure_sfr_intercube_multisip neighbor table wiring.
-Verifies that IPCQ neighbor tables are correctly installed for
+Verifies full IPCQ hardware wiring (independent of DPPolicy):
-intercube (pe0, 4×4 mesh N/S/E/W) + inter-SIP (pe0, all cubes,
+  - intra-cube (2×4 PE grid)  → intra_N/S/E/W
-global_E/global_W) communication.
+  - intercube same-lane       → N/S/E/W
  - inter-SIP same-(cube, pe) → global_N/S/E/W
 """
 from __future__ import annotations
@@ -16,6 +17,7 @@ from kernbench.topology.builder import resolve_topology
 TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
 N_CUBES = 16
 PES_PER_CUBE = 8
 def _engine_and_spec():
@@ -36,78 +38,102 @@ class TestConfigureSfrNeighborTables:
        plan = configure_sfr_intercube_multisip(engine, spec, cfg)
        n_sips = int(spec["system"]["sips"]["count"])
-        assert plan["world_size"] == n_sips * N_CUBES
+        expected = n_sips * N_CUBES * PES_PER_CUBE
-        assert len(plan["rank_to_pe"]) == n_sips * N_CUBES
+        assert plan["world_size"] == expected
-        for pe_idx, (sip, cube, pe) in enumerate(plan["rank_to_pe"]):
+        assert len(plan["rank_to_pe"]) == expected
            assert pe == 0, f"pe_idx {pe_idx}: pe must be 0, got {pe}"
-    def test_corner_cube0_has_E_and_S_only(self):
+    # ── Intra-cube (intra_N/S/E/W) ────────────────────────────────
-        """Cube 0 (row=0, col=0) is NW corner: only E and S neighbors."""
+
    def test_pe0_intra_cube_has_intra_E_and_intra_S(self):
        """pe0 is NW of the 2×4 PE grid: intra_E=pe1, intra_S=pe4."""
        engine, spec = _engine_and_spec()
        cfg = _merged_cfg()
        configure_sfr_intercube_multisip(engine, spec, cfg)
-        ipcq = engine._components["sip0.cube0.pe0.pe_ipcq"]
+        qp = engine._components["sip0.cube0.pe0.pe_ipcq"].queue_pairs
-        qp = ipcq.queue_pairs
+        assert "intra_E" in qp
-        assert "E" in qp, "cube 0 must have E neighbor"
+        assert qp["intra_E"]["peer"].pe == 1
-        assert "S" in qp, "cube 0 must have S neighbor"
+        assert "intra_S" in qp
-        assert "W" not in qp, "cube 0 (col=0) must NOT have W neighbor"
+        assert qp["intra_S"]["peer"].pe == 4
-        assert "N" not in qp, "cube 0 (row=0) must NOT have N neighbor"
+        assert "intra_W" not in qp
        assert "intra_N" not in qp
    def test_pe5_intra_cube_has_all_four(self):
        """pe5 (row=1, col=1 in 2×4 grid) has all 4 intra directions.
        Intra neighbors: intra_N=pe1, intra_E=pe6, intra_W=pe4,
        intra_S not present (row=1 is bottom row).
        """
        engine, spec = _engine_and_spec()
        cfg = _merged_cfg()
        configure_sfr_intercube_multisip(engine, spec, cfg)
        qp = engine._components["sip0.cube0.pe5.pe_ipcq"].queue_pairs
        assert qp["intra_N"]["peer"].pe == 1
        assert qp["intra_E"]["peer"].pe == 6
        assert qp["intra_W"]["peer"].pe == 4
        assert "intra_S" not in qp  # bottom row
    # ── Intercube same-lane (N/S/E/W) ─────────────────────────────
    def test_corner_cube0_pe0_has_intercube_E_and_S(self):
        """Cube 0 (NW mesh corner): intercube E→cube1, S→cube4."""
        engine, spec = _engine_and_spec()
        cfg = _merged_cfg()
        configure_sfr_intercube_multisip(engine, spec, cfg)
        qp = engine._components["sip0.cube0.pe0.pe_ipcq"].queue_pairs
        assert qp["E"]["peer"].cube == 1
        assert qp["E"]["peer"].pe == 0  # same-lane
        assert qp["S"]["peer"].cube == 4
        assert qp["S"]["peer"].pe == 0
        assert "W" not in qp, "cube 0 has no west neighbor"
        assert "N" not in qp, "cube 0 has no north neighbor"
-    def test_interior_cube5_has_all_four(self):
+    def test_interior_cube5_pe3_has_all_four_intercube_same_lane(self):
-        """Cube 5 (row=1, col=1) is interior: N/S/E/W all present."""
+        """Cube 5 interior, pe3: intercube N/S/E/W all present, same-lane."""
        engine, spec = _engine_and_spec()
        cfg = _merged_cfg()
        configure_sfr_intercube_multisip(engine, spec, cfg)
-        ipcq = engine._components["sip0.cube5.pe0.pe_ipcq"]
+        qp = engine._components["sip0.cube5.pe3.pe_ipcq"].queue_pairs
-        qp = ipcq.queue_pairs
+        for d, expected_cube in [("N", 1), ("S", 9), ("E", 6), ("W", 4)]:
-        assert qp["N"]["peer"].cube == 1
+            assert qp[d]["peer"].cube == expected_cube
-        assert qp["S"]["peer"].cube == 9
+            assert qp[d]["peer"].pe == 3  # same-lane
        assert qp["E"]["peer"].cube == 6
        assert qp["W"]["peer"].cube == 4
-    def test_root_cube15_has_inter_sip(self):
+    def test_all_pes_have_intercube_wiring(self):
-        """Cube 15 (root, SE corner) has N, W + global_E/global_W."""
+        """Every PE on every interior cube has intercube same-lane wiring."""
        engine, spec = _engine_and_spec()
        cfg = _merged_cfg()
        configure_sfr_intercube_multisip(engine, spec, cfg)
-        ipcq0 = engine._components["sip0.cube15.pe0.pe_ipcq"]
+        # Interior cube 5: every PE should have N/S/E/W same-lane.
-        qp0 = ipcq0.queue_pairs
+        for pe in range(PES_PER_CUBE):
-        assert "N" in qp0
+            qp = engine._components[f"sip0.cube5.pe{pe}.pe_ipcq"].queue_pairs
-        assert "W" in qp0
+            for d in ("N", "S", "E", "W"):
-        assert "E" not in qp0, "cube 15 (col=3) must NOT have E"
+                assert d in qp, f"sip0.cube5.pe{pe} missing intercube {d}"
-        assert "S" not in qp0, "cube 15 (row=3) must NOT have S"
+                assert qp[d]["peer"].pe == pe, (
-        assert "global_E" in qp0, "root cube must have global_E"
+                    f"sip0.cube5.pe{pe} {d} not same-lane"
-        assert "global_W" in qp0, "root cube must have global_W"
+                )
        assert qp0["global_E"]["peer"].sip == 1
        assert qp0["global_E"]["peer"].cube == 15
-        ipcq1 = engine._components["sip1.cube15.pe0.pe_ipcq"]
+    # ── Inter-SIP (global_*) ──────────────────────────────────────
        qp1 = ipcq1.queue_pairs
        assert qp1["global_E"]["peer"].sip == 0
        assert qp1["global_E"]["peer"].cube == 15
-    def test_all_cubes_have_inter_sip(self):
+    def test_every_pe_on_every_cube_has_inter_sip(self):
-        """ALL cubes (not just root) are wired for inter-SIP."""
+        """All PEs on all cubes wired for inter-SIP via global_*."""
        engine, spec = _engine_and_spec()
        cfg = _merged_cfg()
        configure_sfr_intercube_multisip(engine, spec, cfg)
        root_cube = int(cfg.get("root_cube", N_CUBES - 1))
        for cube_id in range(N_CUBES):
-            ipcq = engine._components[f"sip0.cube{cube_id}.pe0.pe_ipcq"]
+            for pe in range(PES_PER_CUBE):
-            qp = ipcq.queue_pairs
+                qp = engine._components[
                    f"sip0.cube{cube_id}.pe{pe}.pe_ipcq"
                ].queue_pairs
                assert "global_E" in qp, (
-                f"sip0.cube{cube_id}.pe0 missing global_E"
+                    f"sip0.cube{cube_id}.pe{pe} missing global_E"
            )
            assert "global_W" in qp, (
                f"sip0.cube{cube_id}.pe0 missing global_W"
            )
            if cube_id == root_cube:
                assert qp["global_E"]["peer"].sip != 0, (
                    f"root cube {root_cube} global_E must point to another SIP"
                )
                assert "global_W" in qp
                # Peer must be same (cube, pe) on another SIP.
                assert qp["global_E"]["peer"].sip == 1
                assert qp["global_E"]["peer"].cube == cube_id
                assert qp["global_E"]["peer"].pe == pe
@@ -36,7 +36,7 @@ def _engine():
 def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
    slice_bytes = 48 * (1 << 30) // 8
    pa = PhysAddr.pe_hbm_addr(
-        rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
+        sip_id=sip, die_id=cube, pe_id=pe_id,
        pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
    )
    return pa.encode()
@@ -0,0 +1,237 @@
 """Phase 1 micro-tests for IPCQ slot-memory latency model.
 These tests assert the TARGET behavior expected after Phase 2 wires
 ``buffer_kind`` (tcm/sram/hbm) into the IPCQ slot read/write latency
 charges. They are written BEFORE the production change and are
 EXPECTED TO FAIL today.
 Failure semantics today:
  - Slot access is latency-free, so the tcm/sram/hbm runs produce
    identical pe_exec_ns. The ordering assertion therefore fails with
    "tcm == sram == hbm" — proving the test harness is wired and that
    Phase 2 production work is what makes them pass.
 Reference (Phase 2 will edit these):
  - src/kernbench/components/builtin/pe_dma.py  — _handle_ipcq_inbound
  - src/kernbench/components/builtin/pe_ipcq.py — _handle_recv,
                                                  _BUFFER_KIND_BW table
  - src/kernbench/runtime_api/kernel.py         — IpcqDmaToken adds
                                                  buffer_kind field
  - ccl.yaml                                    — algorithm.buffer_kind
 The tests reuse the existing config-driven allreduce app
 (``run_allreduce`` in tests/test_allreduce_multidevice.py) with a 2-SIP
 ring topology and a SMALL n_elem so they finish fast (~3-5 s each).
 """
 from __future__ import annotations
 from pathlib import Path
 from typing import Any
 import pytest
 from kernbench.runtime_api.context import RuntimeContext
 from kernbench.runtime_api.types import DeviceSelector
 from kernbench.sim_engine.engine import GraphEngine
 from kernbench.topology.builder import resolve_topology
 # Reuse the test app's helpers so this micro-test file does not
 # duplicate the run-allreduce + write-temp-configs plumbing.
 from tests.test_allreduce_multidevice import (
    _write_temp_configs,
    run_allreduce,
 )
 # Expected per-tier (slot intrinsic BW, fixed overhead, PE↔bank hop BW).
 # Slot intrinsic mirrors _BUFFER_KIND_BW in src/kernbench/common/ipcq_types.py.
 # PE↔bank hop reflects topology.yaml link BWs:
 #   - TCM is per-PE local → no hop, encoded as inf.
 #   - SRAM bank sits on cube NoC behind sram_to_router_bw_gbs = 128 GB/s.
 #   - HBM ctrl sits on cube NoC behind hbm_to_router_bw_gbs  = 256 GB/s.
 _EXPECTED_TIER = {
    "tcm":  {"slot_bw_gbs": 512.0, "overhead_ns": 0.0, "bank_hop_bw_gbs": float("inf")},
    "sram": {"slot_bw_gbs": 512.0, "overhead_ns": 2.0, "bank_hop_bw_gbs": 128.0},
    "hbm":  {"slot_bw_gbs": 256.0, "overhead_ns": 6.0, "bank_hop_bw_gbs": 256.0},
 }
 def _expected_slot_io_ns(buffer_kind: str, nbytes: int) -> float:
    """Per-access latency the model is expected to add (write OR read).
    Includes the PE↔bank fabric hop for non-TCM tiers — SRAM and HBM
    live on the cube NoC behind a router-attached link, so each slot
    access pays a fabric drain in addition to the intrinsic slot-IO.
    """
    tier = _EXPECTED_TIER[buffer_kind]
    bank_hop_ns = nbytes / tier["bank_hop_bw_gbs"]
    slot_io_ns = nbytes / tier["slot_bw_gbs"] + tier["overhead_ns"]
    return bank_hop_ns + slot_io_ns
 def _run_torus_allreduce(
    tmp_path: Path, *, buffer_kind: str, n_elem: int,
 ) -> float:
    """Run one torus_2d 6-SIP allreduce and return critical-path
    pe_exec_ns. The buffer_kind override is wired into ccl.yaml.
    """
    sub = tmp_path / f"{buffer_kind}_{n_elem}"
    sub.mkdir()
    topo_path, ccl_path = _write_temp_configs(
        sub,
        sip_topology="torus_2d",
        n_sips=6,
        algorithm="intercube_allreduce",
        sip_w=3, sip_h=2,
        n_elem_override=n_elem,
    )
    # Patch ccl.yaml in-place so the algorithm picks up buffer_kind.
    import yaml
    with open(ccl_path) as f:
        ccl_cfg = yaml.safe_load(f)
    ccl_cfg.setdefault("defaults", {})["buffer_kind"] = buffer_kind
    ccl_cfg.setdefault("algorithms", {}).setdefault(
        "intercube_allreduce", {},
    )["buffer_kind"] = buffer_kind
    with open(ccl_path, "w") as f:
        yaml.dump(ccl_cfg, f, default_flow_style=False)
    topo = resolve_topology(topo_path)
    engine = GraphEngine(topo.topology_obj, enable_data=True)
    spec = topo.topology_obj.spec
    with RuntimeContext(
        engine=engine,
        target_device=DeviceSelector("all"),
        correlation_id=f"bk_{buffer_kind}_{n_elem}",
        spec=spec,
    ) as ctx:
        result = run_allreduce(
            ctx, engine, spec,
            algorithm="intercube_allreduce", ccl_yaml=ccl_path,
        )
        assert result["ok_cubes"] > 0, "allreduce did not validate"
    pe_exec_vals = [
        float(tr.get("pe_exec_ns", 0.0) or 0.0)
        for _, (_, tr) in engine._results.items()
        if isinstance(tr, dict)
    ]
    return max(pe_exec_vals) if pe_exec_vals else 0.0
 # ── Phase 1 assertions ───────────────────────────────────────────────
 def test_slot_write_latency_orders_tcm_hbm_sram(tmp_path):
    """tcm < hbm < sram at 8192 B per send.
    The ordering is set by the topology link BWs, NOT the intrinsic slot
    cell rates: SRAM and HBM both live on the cube NoC behind a router
    link, and SRAM's link (128 GB/s) is the narrowest in the system —
    narrower than HBM's (256 GB/s). So once the PE↔bank hop is charged,
    SRAM ends up the slowest tier even though its slot cell array has
    the same intrinsic BW as TCM.
    Pre-fix model misses the PE↔bank hop entirely → assertion FAILS
    (today's ordering is tcm < sram < hbm). Post-fix model includes the
    hop → assertion PASSES.
    """
    n_elem = 4096  # 8192 B per slot
    lat_tcm = _run_torus_allreduce(tmp_path, buffer_kind="tcm",  n_elem=n_elem)
    lat_sram = _run_torus_allreduce(tmp_path, buffer_kind="sram", n_elem=n_elem)
    lat_hbm = _run_torus_allreduce(tmp_path, buffer_kind="hbm",  n_elem=n_elem)
    # Expected per-access deltas (write+read = 2× the per-access value).
    exp_tcm = 2 * _expected_slot_io_ns("tcm",  n_elem * 2)
    exp_sram = 2 * _expected_slot_io_ns("sram", n_elem * 2)
    exp_hbm = 2 * _expected_slot_io_ns("hbm",  n_elem * 2)
    # Floor margin: 50% of the raw expected per-access delta — lets the
    # implementation choose to charge only one side without breaking the
    # test, but still requires a clearly observable gap.
    margin_hbm_tcm = 0.5 * (exp_hbm - exp_tcm)
    margin_sram_hbm = 0.5 * (exp_sram - exp_hbm)
    assert lat_hbm > lat_tcm + margin_hbm_tcm, (
        f"hbm should be slower than tcm by ≥ {margin_hbm_tcm:.1f} ns "
        f"per allreduce, got hbm={lat_hbm:.1f} tcm={lat_tcm:.1f} "
        f"(delta={lat_hbm - lat_tcm:.1f})"
    )
    assert lat_sram > lat_hbm + margin_sram_hbm, (
        f"sram should be slower than hbm by ≥ {margin_sram_hbm:.1f} ns "
        f"per allreduce (sram bank link 128 GB/s is narrower than hbm "
        f"link 256 GB/s), got sram={lat_sram:.1f} hbm={lat_hbm:.1f} "
        f"(delta={lat_sram - lat_hbm:.1f})"
    )
 def test_slot_io_scales_linearly_with_nbytes(tmp_path):
    """For buffer_kind=hbm, doubling nbytes should add ~nbytes/32 ns
    of latency to each slot access. Sanity-checks the slope.
    Pre-Phase-2: latency does not respond to nbytes via memory BW
    (only via fabric drain), so the observed slope is dominated by
    fabric BW and does NOT match 1/32 ns/B.
    """
    lat_4k = _run_torus_allreduce(tmp_path, buffer_kind="hbm", n_elem=2048)
    lat_8k = _run_torus_allreduce(tmp_path, buffer_kind="hbm", n_elem=4096)
    # Expected delta from doubling: at least one slot-IO event per cube
    # in the critical path (very conservative). Per-access add = 4096/256 = 16
    # ns on HBM going from 4k → 8k. Multiple slot accesses on the critical
    # path should make the observed delta meaningfully larger.
    expected_min_delta = 0.5 * (4096 / 256.0)  # ≈ 8 ns
    assert lat_8k - lat_4k > expected_min_delta, (
        f"doubling nbytes on hbm should add ≥ {expected_min_delta:.1f} ns "
        f"of slot-IO latency, got delta={lat_8k - lat_4k:.1f} ns "
        f"(lat_4k={lat_4k:.1f}, lat_8k={lat_8k:.1f})"
    )
 def test_buffer_kind_sensitivity_grows_with_payload(tmp_path):
    """Credit-return cost is fabric-only by design (16 B packet); only
    the data slot-IO charge depends on ``buffer_kind``. Therefore the
    tcm-vs-hbm gap must scale with payload size and be a small fraction
    of the large-payload gap at small payloads.
    Concrete invariant the model must satisfy:
        gap_small / gap_large < 0.10
    Pre-Phase-2: gap_small == gap_large == 0 (division undefined → test
    fails because gap_large is required > 0). Post-Phase-2: at small
    nbytes the slot-IO charge is dominated by the constant
    ``overhead_ns`` term, while at large nbytes it is dominated by the
    ``nbytes / bw_gbs`` term — so gap_large grows linearly while
    gap_small stays small.
    """
    n_elem_small = 8        # 16 B per slot — overhead-bound
    n_elem_large = 16384    # 32 KB per slot — bandwidth-bound
    lat_tcm_small = _run_torus_allreduce(
        tmp_path, buffer_kind="tcm", n_elem=n_elem_small,
    )
    lat_hbm_small = _run_torus_allreduce(
        tmp_path, buffer_kind="hbm", n_elem=n_elem_small,
    )
    lat_tcm_large = _run_torus_allreduce(
        tmp_path, buffer_kind="tcm", n_elem=n_elem_large,
    )
    lat_hbm_large = _run_torus_allreduce(
        tmp_path, buffer_kind="hbm", n_elem=n_elem_large,
    )
    gap_small = abs(lat_hbm_small - lat_tcm_small)
    gap_large = abs(lat_hbm_large - lat_tcm_large)
    assert gap_large > 1000.0, (
        f"large-payload buffer_kind gap must be observably large "
        f"(this is the sweep's whole point). got gap_large={gap_large:.1f} ns "
        f"(lat_tcm_large={lat_tcm_large:.1f}, lat_hbm_large={lat_hbm_large:.1f})"
    )
    assert gap_small / gap_large < 0.10, (
        f"buffer_kind sensitivity should grow with payload — "
        f"small-payload gap should be < 10% of large-payload gap. "
        f"got gap_small={gap_small:.1f} ns, gap_large={gap_large:.1f} ns, "
        f"ratio={gap_small / gap_large:.3f}"
    )
@@ -0,0 +1,208 @@
 """Phase 1 micro-tests for IPCQ slot-memory PHYSICAL placement.
 The current model in ``_BUFFER_KIND_BW`` (src/kernbench/common/ipcq_types.py)
 charges only an intrinsic-memory term for IPCQ slot read/write::
    TCM:  nbytes/512 + 0
    SRAM: nbytes/512 + 2
    HBM:  nbytes/256 + 6
 This treats SRAM and HBM as if they were per-PE local. The topology
 declares the opposite — both live on the cube NoC, behind their own
 router-attached link::
    topology.yaml:130   sram_to_router_bw_gbs: 128.0
    topology.yaml:129   hbm_to_router_bw_gbs:  256.0
 So a correct model must charge a PE→bank fabric drain for SRAM and HBM
 on both ``tl.send`` (writer landing bytes into the cube SRAM/HBM bank
 via PE_DMA → router → bank) and ``tl.recv`` (reader pulling bytes back
 across the same link). TCM stays free of that hop because it is
 genuinely per-PE local.
 The three tests below run the existing torus_2d 6-SIP allreduce harness
 with ``buffer_kind`` flipped between tcm/sram/hbm and assert invariants
 that the post-fix model must satisfy. They EXPECT TO FAIL today because
 the simulator under-charges SRAM and HBM by skipping the PE↔bank hop.
 Phase 2 will edit:
  - src/kernbench/components/builtin/pe_ipcq.py   (_handle_recv: add
        compute_drain_ns(pe→bank, nbytes) for sram/hbm)
  - src/kernbench/components/builtin/pe_dma.py    (_handle_ipcq_inbound:
        add second-leg drain for sram/hbm-destined slots)
 Tests must NEVER be weakened to make Phase 2 pass — invariants below
 follow from physics (link BW × payload), so any model reflecting the
 topology will satisfy them by construction.
 """
 from __future__ import annotations
 from pathlib import Path
 import pytest
 import yaml
 from kernbench.runtime_api.context import RuntimeContext
 from kernbench.runtime_api.types import DeviceSelector
 from kernbench.sim_engine.engine import GraphEngine
 from kernbench.topology.builder import resolve_topology
 from tests.test_allreduce_multidevice import (
    _write_temp_configs,
    run_allreduce,
 )
 def _run_allreduce_with_buffer_kind(
    tmp_path: Path, *, buffer_kind: str, n_elem: int,
 ) -> float:
    """Run one torus_2d 6-SIP allreduce with the given buffer_kind and
    return critical-path pe_exec_ns (max across all PEs).
    Mirrors the sweep harness in test_allreduce_buffer_kind_sweep.py
    so the assertions below compare apples-to-apples against that PNG.
    """
    sub = tmp_path / f"{buffer_kind}_{n_elem}"
    sub.mkdir()
    topo_path, ccl_path = _write_temp_configs(
        sub,
        sip_topology="torus_2d",
        n_sips=6,
        algorithm="intercube_allreduce",
        sip_w=3, sip_h=2,
        n_elem_override=n_elem,
    )
    with open(ccl_path) as f:
        ccl_cfg = yaml.safe_load(f)
    ccl_cfg.setdefault("defaults", {})["buffer_kind"] = buffer_kind
    ccl_cfg.setdefault("algorithms", {}).setdefault(
        "intercube_allreduce", {},
    )["buffer_kind"] = buffer_kind
    with open(ccl_path, "w") as f:
        yaml.dump(ccl_cfg, f, default_flow_style=False)
    topo = resolve_topology(topo_path)
    engine = GraphEngine(topo.topology_obj, enable_data=True)
    spec = topo.topology_obj.spec
    with RuntimeContext(
        engine=engine,
        target_device=DeviceSelector("all"),
        correlation_id=f"loc_{buffer_kind}_{n_elem}",
        spec=spec,
    ) as ctx:
        result = run_allreduce(
            ctx, engine, spec,
            algorithm="intercube_allreduce", ccl_yaml=ccl_path,
        )
        assert result["ok_cubes"] > 0, "allreduce did not validate"
    pe_exec_vals = [
        float(tr.get("pe_exec_ns", 0.0) or 0.0)
        for _, (_, tr) in engine._results.items()
        if isinstance(tr, dict)
    ]
    return max(pe_exec_vals) if pe_exec_vals else 0.0
 # ── Phase 1 assertions ───────────────────────────────────────────────
 def test_sram_meaningfully_slower_than_tcm_at_large_payload(tmp_path):
    """At 32 KB / PE the SRAM-backed allreduce must take meaningfully
    longer than the TCM-backed one because every IPCQ slot access goes
    through the 128 GB/s SRAM↔router link, while TCM stays per-PE local.
    Floor justification (physics, not implementation):
        Per-IPCQ-roundtrip the SRAM tier adds 2 × nbytes/128 ns over TCM
        (one PE→SRAM hop on send-inbound, one SRAM→PE hop on recv).
        At 32 KB:  2 × 32768/128 = 512 ns added per slot exchange.
        With ≥ 10 critical-path exchanges in a 6-SIP torus_2d allreduce
        this is ≥ 5_120 ns. The threshold below is half that to leave
        room for differing critical-path counting.
    Pre-Phase-2: gap is constant 48 ns (just the SRAM overhead × 24
    slot accesses); test FAILS.
    Post-Phase-2: gap scales with payload; test PASSES.
    """
    n_elem = 16384  # 32 KB / PE
    lat_tcm = _run_allreduce_with_buffer_kind(
        tmp_path, buffer_kind="tcm", n_elem=n_elem,
    )
    lat_sram = _run_allreduce_with_buffer_kind(
        tmp_path, buffer_kind="sram", n_elem=n_elem,
    )
    delta = lat_sram - lat_tcm
    THRESHOLD_NS = 2_500.0
    assert delta > THRESHOLD_NS, (
        f"SRAM should be ≥ {THRESHOLD_NS:.0f} ns slower than TCM at 32 KB "
        f"because each IPCQ access pays a 128 GB/s PE↔SRAM hop. "
        f"got tcm={lat_tcm:.1f} sram={lat_sram:.1f} delta={delta:.1f} ns"
    )
 def test_sram_tcm_gap_scales_with_payload(tmp_path):
    """The SRAM-vs-TCM gap must grow roughly linearly with payload size.
    Pre-Phase-2: the only difference between TCM and SRAM is the SRAM
    per-access ``overhead_ns = 2``, which does NOT scale with payload —
    so the gap is the same constant 48 ns at 8 KB and at 32 KB. Ratio = 1.
    Post-Phase-2: the dominant term is 2 × nbytes/128 (PE↔SRAM hop on
    write+read) which IS linear in payload. Going 8 KB → 32 KB (4×)
    should produce a gap roughly 4× larger.
    Threshold below is 3× to keep slack for fixed-overhead effects.
    """
    lat_tcm_small = _run_allreduce_with_buffer_kind(
        tmp_path, buffer_kind="tcm", n_elem=4096,    # 8 KB
    )
    lat_sram_small = _run_allreduce_with_buffer_kind(
        tmp_path, buffer_kind="sram", n_elem=4096,
    )
    lat_tcm_large = _run_allreduce_with_buffer_kind(
        tmp_path, buffer_kind="tcm", n_elem=16384,   # 32 KB
    )
    lat_sram_large = _run_allreduce_with_buffer_kind(
        tmp_path, buffer_kind="sram", n_elem=16384,
    )
    gap_small = lat_sram_small - lat_tcm_small
    gap_large = lat_sram_large - lat_tcm_large
    assert gap_small > 0, (
        f"sanity: SRAM should never be FASTER than TCM, "
        f"got gap_small={gap_small:.1f} ns"
    )
    assert gap_large > 3.0 * gap_small, (
        f"4× payload should produce ≥3× SRAM/TCM gap (linear in nbytes "
        f"because of the 128 GB/s PE↔SRAM hop). "
        f"got gap_small={gap_small:.1f} (8KB), gap_large={gap_large:.1f} "
        f"(32KB), ratio={gap_large / max(gap_small, 1e-9):.2f}"
    )
 def test_hbm_pe_hop_charged_at_large_payload(tmp_path):
    """At 32 KB / PE the HBM-vs-TCM gap must exceed the gap that comes
    purely from HBM's 256 GB/s intrinsic slot-IO disadvantage.
    Pre-Phase-2 the entire HBM/TCM gap is just the slot-IO term
    (24 × (nbytes/512 + 6) ≈ 1_700 ns at 32 KB). Post-fix adds another
    24 × (nbytes/256) × 2 ≈ 6_144 ns from the PE↔HBM hop on send and
    recv, so the total HBM/TCM gap should clearly clear 4 µs.
    """
    n_elem = 16384  # 32 KB / PE
    lat_tcm = _run_allreduce_with_buffer_kind(
        tmp_path, buffer_kind="tcm", n_elem=n_elem,
    )
    lat_hbm = _run_allreduce_with_buffer_kind(
        tmp_path, buffer_kind="hbm", n_elem=n_elem,
    )
    delta = lat_hbm - lat_tcm
    THRESHOLD_NS = 4_000.0
    assert delta > THRESHOLD_NS, (
        f"HBM should be ≥ {THRESHOLD_NS:.0f} ns slower than TCM at 32 KB "
        f"once the 256 GB/s PE↔HBM hop is charged on each IPCQ access. "
        f"got tcm={lat_tcm:.1f} hbm={lat_hbm:.1f} delta={delta:.1f} ns"
    )
@@ -0,0 +1,62 @@
 """ADR-0009 D5: synchronized launch barrier.
 M_CPU stamps KernelLaunchMsg with target_start_ns = env.now + max path
 latency; PE_CPU yields until that time before recording pe_exec_start.
 Every PE in a single launch MUST begin kernel execution at the same
 env.now regardless of its dispatch path length.
 We verify this indirectly: for a no-op kernel, pe_exec_ns = env.now -
 pe_exec_start. If every PE's pe_exec_start is identical and every PE
 runs the same no-op body, every pe_exec_ns value must be identical.
 Without D5, pe_exec_start varies by dispatch-path length and so does
 pe_exec_ns.
 """
 from __future__ import annotations
 from pathlib import Path
 import numpy as np
 from kernbench.policy.placement.dp import DPPolicy
 from kernbench.runtime_api.context import RuntimeContext
 from kernbench.runtime_api.types import DeviceSelector
 from kernbench.sim_engine.engine import GraphEngine
 from kernbench.topology.builder import resolve_topology
 TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
 def test_kernel_launch_sync_all_pes_have_equal_exec_time():
    """No-op kernel: every PE's pe_exec_ns must be identical under D5."""
    topo = resolve_topology(str(TOPOLOGY_PATH))
    engine = GraphEngine(topo.topology_obj, enable_data=True)
    spec = topo.topology_obj.spec
    with RuntimeContext(engine=engine, target_device=DeviceSelector("all"),
                        correlation_id="sync_test", spec=spec) as ctx:
        dp = DPPolicy(cube="row_wise", pe="column_wise",
                      num_cubes=16, num_pes=8)
        def kernel(t_ptr, n_elem, tl):
            pass  # no-op
        ctx.ahbm.set_device(0)
        t = ctx.zeros((16, 8 * 64), dtype="f16", dp=dp, name="probe")
        t.copy_(ctx.from_numpy(np.zeros((16, 8 * 64), dtype=np.float16)))
        pending = ctx.launch("sync_probe", kernel, t, 64, _defer_wait=True)
        for h, _sip, meta in pending:
            ctx.wait(h, _meta=meta)
        pe_exec_vals = []
        for h, _sip, _meta in pending:
            _, trace = engine.get_completion(h)
            if trace and trace.get("pe_exec_ns") is not None:
                pe_exec_vals.append(float(trace["pe_exec_ns"]))
    assert pe_exec_vals, "expected completion traces with pe_exec_ns"
    spread = max(pe_exec_vals) - min(pe_exec_vals)
    assert spread < 1e-6, (
        f"ADR-0009 D5 violated: pe_exec_ns spread across PEs = "
        f"{spread:.6f} ns (expected 0). Values: {pe_exec_vals}"
    )
@@ -38,7 +38,7 @@ def _engine():
 def _hbm_pa(sip=0, cube=0, pe_id=0):
    slice_bytes = 48 * (1 << 30) // 8
    pa = PhysAddr.pe_hbm_addr(
-        rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
+        sip_id=sip, die_id=cube, pe_id=pe_id,
        pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
    )
    return pa.encode()
@@ -53,7 +53,7 @@ def _engine():
 def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
    slice_bytes = 48 * (1 << 30) // 8
    pa = PhysAddr.pe_hbm_addr(
-        rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
+        sip_id=sip, die_id=cube, pe_id=pe_id,
        pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
    )
    return pa.encode()
@@ -0,0 +1,741 @@
 """Diagnostic for the inter-cube RAW > IPCQ asymmetry on h3/h4 plots.
 Single-shot run at h3 (sip0.cube0.pe0 -> sip0.cube1.pe0), nbytes=4096.
 Captures per-PE pe_exec_ns and the actual path / drain / per-node overhead
 breakdown for the RAW sub-txn (PE_DMA -> remote HBM_CTRL) vs the IPCQ
 outbound sub-txn (PE_DMA -> peer PE_DMA), so we can localize the gap to
 one of:
    (a) drain at HBM-BW (RAW) vs fabric-BW (IPCQ)
    (b) path-length / per-node overhead asymmetry
    (c) RAW SRC paying tl.load (local HBM read) on top of remote tl.store
        while IPCQ DST only pays inbound traversal+drain.
 Phase 1 / test-only. No production code is modified.
 """
 from __future__ import annotations
 from pathlib import Path
 import numpy as np
 import pytest
 from kernbench.ccl.install import load_ccl_config, resolve_algorithm_config
 from kernbench.ccl.sfr_config import configure_sfr_intercube_multisip
 from kernbench.policy.placement.dp import DPPolicy
 from kernbench.runtime_api.context import RuntimeContext
 from kernbench.runtime_api.types import DeviceSelector
 from kernbench.sim_engine.engine import GraphEngine
 from kernbench.topology.builder import resolve_topology
 TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
 import os
 # Allow the test to be re-run for h4 (inter-cube vertical) at multiple sizes
 # to investigate why IPCQ slope flattens past 8192 B (path may differ).
 NBYTES = int(os.environ.get("DIAG_NBYTES", "4096"))
 ELEM_BYTES = 2
 N_ELEM = NBYTES // ELEM_BYTES
 N_CUBES = 16
 N_PES = 8
 HOP = os.environ.get("DIAG_HOP", "h3")
 if HOP == "h4":
    SRC = (0, 0, 0)
    DST = (0, 4, 0)  # h4 inter-cube vertical
 else:
    SRC = (0, 0, 0)
    DST = (0, 1, 0)  # h3 inter-cube horizontal
 # ── Per-PE pe_exec_ns capture via monkey-patch ───────────────────────
 def _install_barrier_capture():
    """Wrap PeCpuComponent._execute_kernel to log, for every PE that
    enters: env.now at entry, target_start_ns the request carried,
    whether the barrier yield fired (i.e. env.now < target_start_ns),
    and env.now at pe_exec_start.
    """
    import kernbench.components.builtin.pe_cpu as pe_cpu_mod
    log: list[dict] = []
    original = pe_cpu_mod.PeCpuComponent._execute_kernel
    def patched(self, env, txn):
        request = txn.request
        target_start = getattr(request, "target_start_ns", None)
        entry_now = float(env.now)
        log_entry = {
            "node_id": self.node.id,
            "entry_now": entry_now,
            "target_start_ns": (
                float(target_start) if target_start is not None else None
            ),
            "barrier_skipped": (
                target_start is None
                or float(target_start) <= entry_now
            ),
            "delta_late_ns": (
                None if target_start is None
                else max(0.0, entry_now - float(target_start))
            ),
        }
        log.append(log_entry)
        yield from original(self, env, txn)
    pe_cpu_mod.PeCpuComponent._execute_kernel = patched
    def restore():
        pe_cpu_mod.PeCpuComponent._execute_kernel = original
    return log, restore
 def _install_per_pe_capture():
    """Wrap PeCpuComponent._execute_kernel so we record (node_id ->
    pe_exec_ns) for every PE that executes a kernel during the run.
    Returns (capture_dict, restore_callable).
    """
    import kernbench.components.builtin.pe_cpu as pe_cpu_mod
    captured: dict[str, float] = {}
    original = pe_cpu_mod.PeCpuComponent._execute_kernel
    def patched(self, env, txn):
        gen = original(self, env, txn)
        try:
            value = yield from gen
        finally:
            v = txn.result_data.get("pe_exec_ns")
            if v is not None:
                captured[self.node.id] = float(v)
        return value
    pe_cpu_mod.PeCpuComponent._execute_kernel = patched
    def restore():
        pe_cpu_mod.PeCpuComponent._execute_kernel = original
    return captured, restore
 def _install_recv_capture(target_node_id: str):
    """Wrap PeIpcqComponent._handle_recv to log entry/exit times and the
    peer_head_cache/my_tail values seen at the start.
    This pins down whether recv ever blocked on a wait_event, or whether
    it consumed without waiting (i.e. peer_head_cache > my_tail at entry).
    """
    import kernbench.components.builtin.pe_ipcq as pe_ipcq_mod
    log: list[dict] = []
    original = pe_ipcq_mod.PeIpcqComponent._handle_recv
    def patched(self, env, req, cmd):
        if self.node.id != target_node_id:
            yield from original(self, env, req, cmd)
            return
        # Snapshot state before dispatch
        d = cmd.direction
        qp = self._queue_pairs.get(d, {})
        log.append({
            "phase": "enter",
            "t": float(env.now),
            "direction": d,
            "peer_head_cache": qp.get("peer_head_cache"),
            "my_tail": qp.get("my_tail"),
        })
        yield from original(self, env, req, cmd)
        qp = self._queue_pairs.get(d, {})
        log.append({
            "phase": "exit",
            "t": float(env.now),
            "direction": d,
            "peer_head_cache": qp.get("peer_head_cache"),
            "my_tail": qp.get("my_tail"),
        })
    pe_ipcq_mod.PeIpcqComponent._handle_recv = patched
    def restore():
        pe_ipcq_mod.PeIpcqComponent._handle_recv = original
    return log, restore
 def _install_meta_arrival_capture(target_node_id: str):
    """Log every IpcqMetaArrival that lands on ``target_node_id`` PE_IPCQ.
    Records (env_now, sender_seq, dst_addr, matched_direction,
    peer_head_cache_before, my_tail_before).
    """
    import kernbench.components.builtin.pe_ipcq as pe_ipcq_mod
    log: list[dict] = []
    original = pe_ipcq_mod.PeIpcqComponent._handle_meta_arrival
    def patched(self, msg):
        if self.node.id == target_node_id:
            token = msg.token
            now = float(self._env.now) if hasattr(self, "_env") else 0.0
            # _env is not stored on the component; use ctx? Fall back to
            # introspection via self._inbox._env (SimPy stores reference).
            try:
                now = float(self._inbox._env.now)
            except Exception:
                pass
            entry = {
                "t": now,
                "sender_seq": getattr(token, "sender_seq", None),
                "dst_addr": getattr(token, "dst_addr", None),
                "src_sip": getattr(token, "src_sip", None),
                "src_cube": getattr(token, "src_cube", None),
                "src_pe": getattr(token, "src_pe", None),
                "src_direction": getattr(token, "src_direction", None),
                "nbytes": getattr(token, "nbytes", None),
                "matched_direction": None,
                "peer_head_cache_before": {},
                "my_tail_before": {},
            }
            for d, qp in self._queue_pairs.items():
                entry["peer_head_cache_before"][d] = qp["peer_head_cache"]
                entry["my_tail_before"][d] = qp["my_tail"]
                base = qp["my_rx_base_pa"]
                size = qp["n_slots"] * qp["slot_size"]
                if base <= entry["dst_addr"] < base + size:
                    entry["matched_direction"] = d
            log.append(entry)
        return original(self, msg)
    pe_ipcq_mod.PeIpcqComponent._handle_meta_arrival = patched
    def restore():
        pe_ipcq_mod.PeIpcqComponent._handle_meta_arrival = original
    return log, restore
 def _snapshot_qp_state(engine, target_node_id: str) -> dict:
    """Snapshot every direction's qp state on the target PE_IPCQ now.
    Captures peer_head_cache, my_tail, my_rx_base_pa, n_slots, slot_size
    for each installed direction.
    """
    comp = engine._components.get(target_node_id)
    if comp is None:
        return {}
    return {
        d: {
            "peer_head_cache": qp["peer_head_cache"],
            "my_tail": qp["my_tail"],
            "my_rx_base_pa": qp["my_rx_base_pa"],
            "n_slots": qp["n_slots"],
            "slot_size": qp["slot_size"],
            "rx_range": (
                qp["my_rx_base_pa"],
                qp["my_rx_base_pa"] + qp["n_slots"] * qp["slot_size"],
            ),
        }
        for d, qp in comp.queue_pairs.items()
    }
 # ── Path / drain breakdown using engine ctx ──────────────────────────
 def _path_breakdown(ctx, path: list[str], nbytes: int) -> dict:
    edge_total_ns = 0.0
    edge_details = []
    min_bw = float("inf")
    for i in range(len(path) - 1):
        edge = ctx.edge_map.get((path[i], path[i + 1]))
        if edge is None:
            edge_details.append((path[i], path[i + 1], None, None, None))
            continue
        prop_ns = edge.distance_mm * ctx.ns_per_mm
        edge_total_ns += prop_ns
        bw = getattr(edge, "bw_gbs", None) or 0.0
        if bw > 0 and bw < min_bw:
            min_bw = bw
        edge_details.append(
            (path[i], path[i + 1], edge.distance_mm, prop_ns, bw),
        )
    overhead_total_ns = 0.0
    overhead_details = []
    for nid in path:
        oh = float(ctx.node_overhead_ns.get(nid, 0.0))
        overhead_total_ns += oh
        overhead_details.append((nid, oh))
    drain_ns = ctx.compute_drain_ns(path, nbytes)
    bottleneck_bw = None if min_bw == float("inf") else min_bw
    return {
        "path": path,
        "edges": edge_details,
        "edge_total_ns": edge_total_ns,
        "overheads": overhead_details,
        "overhead_total_ns": overhead_total_ns,
        "drain_ns": drain_ns,
        "bottleneck_bw_gbs": bottleneck_bw,
        "expected_total_ns": edge_total_ns + overhead_total_ns + drain_ns,
    }
 def _print_breakdown(label: str, br: dict) -> None:
    print(f"\n  {label}")
    print(f"    path ({len(br['path'])} nodes):")
    for nid in br["path"]:
        print(f"      - {nid}")
    print(f"    edges (prop. delay):")
    for src, dst, dist_mm, prop_ns, bw in br["edges"]:
        if dist_mm is None:
            print(f"      ! {src} -> {dst}  EDGE NOT FOUND IN edge_map")
            continue
        print(
            f"      {src} -> {dst}  "
            f"dist={dist_mm:.3f}mm  prop={prop_ns:.2f}ns  "
            f"bw={bw or 0:.2f}GB/s"
        )
    print(f"    per-node overhead_ns:")
    for nid, oh in br["overheads"]:
        if oh > 0:
            print(f"      {nid:<60s}  overhead_ns={oh:.2f}")
    print(f"    edge_total_ns      = {br['edge_total_ns']:.2f}")
    print(f"    overhead_total_ns  = {br['overhead_total_ns']:.2f}")
    print(f"    bottleneck_bw_gbs  = {br['bottleneck_bw_gbs']}")
    print(f"    drain_ns (nbytes={NBYTES}) = {br['drain_ns']:.2f}")
    print(f"    expected_total_ns  = {br['expected_total_ns']:.2f}")
 # ── RAW path scenario ────────────────────────────────────────────────
 def _dump_src_op_records(engine, src_sip, src_cube, src_pe, label) -> None:
    """Print op_logger records for ops on the SRC PE.
    The op log captures t_start/t_end for memory/math/gemm/copy ops on
    every component, so we can see how long tl.load vs tl.store vs
    tl.send actually took at the engine level.
    """
    op_logger = getattr(engine, "_op_logger", None)
    if op_logger is None:
        print(f"  ({label}) op_logger not available")
        return
    src_prefix = f"sip{src_sip}.cube{src_cube}.pe{src_pe}."
    recs = [r for r in op_logger.records if r.component_id.startswith(src_prefix)]
    print(f"  ({label}) op_logger records on SRC PE ({src_prefix}*):")
    for r in recs[:40]:
        dur = r.t_end - r.t_start
        comp_short = r.component_id.replace(src_prefix, "")
        params_short = ""
        if "nbytes" in r.params:
            params_short = f" nbytes={r.params['nbytes']}"
        if "src_addr" in r.params:
            params_short += f" src_addr={r.params['src_addr']}"
        if "dst_addr" in r.params:
            params_short += f" dst_addr={r.params['dst_addr']}"
        print(
            f"    t=[{r.t_start:7.2f}..{r.t_end:7.2f}] dur={dur:6.2f}ns  "
            f"{comp_short:<25s} {r.op_kind:<8s} {r.op_name:<12s}{params_short}"
        )
 def _run_raw():
    captured, restore = _install_per_pe_capture()
    try:
        topo = resolve_topology(str(TOPOLOGY_PATH))
        engine = GraphEngine(topo.topology_obj, enable_data=True)
        spec = topo.topology_obj.spec
        src_sip, src_cube, src_pe = SRC
        dst_sip, dst_cube, dst_pe = DST
        assert src_sip == dst_sip
        src_off = (src_cube * N_PES + src_pe) * N_ELEM * ELEM_BYTES
        dst_off = (dst_cube * N_PES + dst_pe) * N_ELEM * ELEM_BYTES
        with RuntimeContext(
            engine=engine,
            target_device=DeviceSelector("all"),
            correlation_id="diag_raw",
            spec=spec,
        ) as rt:
            dp = DPPolicy(
                cube="row_wise", pe="column_wise",
                num_cubes=N_CUBES, num_pes=N_PES,
            )
            rt.ahbm.set_device(src_sip)
            t = rt.zeros(
                (N_CUBES, N_PES * N_ELEM), dtype="f16",
                dp=dp, name="raw_tensor",
            )
            t.copy_(rt.from_numpy(
                np.full((N_CUBES, N_PES * N_ELEM), 1.0, dtype=np.float16),
            ))
            def kernel(t_ptr, n_elem, tl):
                pe_id = tl.program_id(axis=0)
                cube_id = tl.program_id(axis=1)
                if cube_id == src_cube and pe_id == src_pe:
                    data = tl.load(
                        t_ptr + src_off, shape=(n_elem,), dtype="f16",
                    )
                    tl.store(t_ptr + dst_off, data)
            pending = rt.launch(
                "diag_raw_kernel", kernel, t, N_ELEM, _defer_wait=True,
            )
            for h, _sip, meta in pending:
                rt.wait(h, _meta=meta)
        # Compute the RAW sub-txn path: src PE_DMA -> dst HBM_CTRL
        from kernbench.policy.address.phyaddr import PhysAddr
        ctx = next(iter(engine._components.values())).ctx
        src_pe_prefix = f"sip{src_sip}.cube{src_cube}.pe{src_pe}"
        # Resolve dst PA to HBM controller node
        # The raw store kernel issues DmaWriteCmd on dst VA; in the engine
        # this is translated via PE_MMU. For diagnostic we approximate
        # the destination as the dst cube's HBM controller for slice
        # belonging to dst_pe.
        # Use the resolver on a constructed PA matching the same memory
        # slice the kernel writes to.
        # The tensor is "row_wise" sharded across cubes, so each cube
        # owns row[cube_id, :], with each PE owning a column slice.
        # The actual dst PA depends on the AHBM allocator; we read it
        # via the tensor's shard map.
        shard_map = getattr(t, "_shard_map", None) or getattr(t, "shard_map", None)
        # Fallback: query the resolver directly by constructing a PA in
        # the dst cube's HBM region. If shard_map is unavailable, still
        # show the breakdown for src-PE-DMA -> first reachable HBM_CTRL
        # in dst cube.
        dst_hbm_id = f"sip{dst_sip}.cube{dst_cube}.hbm_ctrl"
        if dst_hbm_id not in engine._components:
            # try alternate naming
            for nid in engine._components.keys():
                if (
                    nid.startswith(f"sip{dst_sip}.cube{dst_cube}.")
                    and "hbm" in nid
                ):
                    dst_hbm_id = nid
                    break
        # find_path() prepends ".pe_dma" to src_pe automatically
        try:
            raw_path = ctx.router.find_path(src_pe_prefix, dst_hbm_id)
        except Exception as e:
            raw_path = []
            print(f"  WARN: find_path raw failed: {e}")
        if not raw_path:
            # Try other HBM-related node names in dst cube
            for nid in engine._components.keys():
                if not nid.startswith(f"sip{dst_sip}.cube{dst_cube}."):
                    continue
                if "hbm" not in nid:
                    continue
                try:
                    p = ctx.router.find_path(src_pe_prefix, nid)
                except Exception:
                    p = []
                if p:
                    raw_path = p
                    print(f"  (fallback raw dst node: {nid})")
                    break
        return captured, ctx, raw_path, engine
    finally:
        restore()
 # ── IPCQ path scenario ───────────────────────────────────────────────
 def _run_ipcq():
    captured, restore = _install_per_pe_capture()
    dst_pe_ipcq_id = (
        f"sip{DST[0]}.cube{DST[1]}.pe{DST[2]}.pe_ipcq"
    )
    arrival_log, restore_arrival = _install_meta_arrival_capture(
        dst_pe_ipcq_id,
    )
    recv_log, restore_recv = _install_recv_capture(dst_pe_ipcq_id)
    barrier_log, restore_barrier = _install_barrier_capture()
    try:
        topo = resolve_topology(str(TOPOLOGY_PATH))
        engine = GraphEngine(topo.topology_obj, enable_data=True)
        spec = topo.topology_obj.spec
        src_sip, src_cube, src_pe = SRC
        dst_sip, dst_cube, dst_pe = DST
        cfg = load_ccl_config()
        merged = resolve_algorithm_config(cfg, name="intercube_allreduce")
        merged["slot_size"] = max(int(merged.get("slot_size", 4096)), NBYTES)
        with RuntimeContext(
            engine=engine,
            target_device=DeviceSelector("all"),
            correlation_id="diag_ipcq",
            spec=spec,
        ) as rt:
            configure_sfr_intercube_multisip(engine, spec, merged)
            dp = DPPolicy(
                cube="row_wise", pe="column_wise",
                num_cubes=N_CUBES, num_pes=N_PES,
            )
            def kernel(t_ptr, n_elem, tl):
                pe_id = tl.program_id(axis=0)
                cube_id = tl.program_id(axis=1)
                if cube_id == src_cube and pe_id == src_pe:
                    data = tl.load(t_ptr, shape=(n_elem,), dtype="f16")
                    tl.send(dir=("E" if HOP == "h3" else "S"), src=data)
                elif cube_id == dst_cube and pe_id == dst_pe:
                    tl.recv(
                        dir=("W" if HOP == "h3" else "N"),
                        shape=(n_elem,), dtype="f16",
                    )
            tensors = []
            for s in sorted({src_sip, dst_sip}):
                rt.ahbm.set_device(s)
                t = rt.zeros(
                    (N_CUBES, N_PES * N_ELEM), dtype="f16",
                    dp=dp, name=f"sip{s}",
                )
                t.copy_(rt.from_numpy(
                    np.full((N_CUBES, N_PES * N_ELEM), 1.0, dtype=np.float16),
                ))
                tensors.append(t)
            all_pending = []
            for tt in tensors:
                pending = rt.launch(
                    "diag_ipcq_kernel", kernel, tt, N_ELEM, _defer_wait=True,
                )
                all_pending.extend(pending)
            for h, _sip, meta in all_pending:
                rt.wait(h, _meta=meta)
        ctx = next(iter(engine._components.values())).ctx
        src_pe_prefix = f"sip{src_sip}.cube{src_cube}.pe{src_pe}"
        dst_pe_dma = f"sip{dst_sip}.cube{dst_cube}.pe{dst_pe}.pe_dma"
        try:
            ipcq_path = ctx.router.find_path(src_pe_prefix, dst_pe_dma)
        except Exception as e:
            ipcq_path = []
            print(f"  WARN: find_path ipcq failed: {e}")
        # Snapshot DST PE_IPCQ qp state at end-of-run so we can see what
        # peer_head_cache/my_tail looked like (and at which directions).
        qp_state = _snapshot_qp_state(engine, dst_pe_ipcq_id)
        return (captured, ctx, ipcq_path, engine,
                arrival_log, qp_state, recv_log, barrier_log)
    finally:
        restore_barrier()
        restore_recv()
        restore_arrival()
        restore()
 # ── Test entry ───────────────────────────────────────────────────────
@pytest.mark.diagnostic
 def test_pe_to_pe_diagnostic_h3():
    print("\n" + "=" * 78)
    print(f"  Diagnostic: h3 inter-cube horizontal, nbytes={NBYTES}")
    print(f"  src={SRC}  dst={DST}")
    print("=" * 78)
    # ── RAW scenario
    print("\n[RAW] tl.load + tl.store (sender pays both legs)")
    raw_per_pe, raw_ctx, raw_path, raw_engine = _run_raw()
    print(f"  per-PE pe_exec_ns ({len(raw_per_pe)} entries):")
    src_id = f"sip{SRC[0]}.cube{SRC[1]}.pe{SRC[2]}.pe_cpu"
    dst_id = f"sip{DST[0]}.cube{DST[1]}.pe{DST[2]}.pe_cpu"
    for nid in (src_id, dst_id):
        if nid in raw_per_pe:
            print(f"    {nid:<60s}  {raw_per_pe[nid]:.2f} ns  <-- key PE")
    nonzero = {k: v for k, v in raw_per_pe.items() if v > 0.5}
    if nonzero:
        print(f"  other PEs with pe_exec_ns > 0.5 ns:")
        for nid, v in sorted(nonzero.items(), key=lambda kv: -kv[1])[:6]:
            if nid not in (src_id, dst_id):
                print(f"    {nid:<60s}  {v:.2f} ns")
    print(f"  max(pe_exec_ns) = "
          f"{max(raw_per_pe.values()) if raw_per_pe else 0:.2f} ns")
    if raw_path:
        br = _path_breakdown(raw_ctx, raw_path, NBYTES)
        _print_breakdown("RAW sub-txn path (src.pe_dma -> dst.hbm_ctrl)", br)
    _dump_src_op_records(raw_engine, *SRC, "RAW")
    # ── IPCQ scenario
    print("\n[IPCQ] tl.send + tl.recv (recv pays inbound traversal+drain)")
    (ipcq_per_pe, ipcq_ctx, ipcq_path, ipcq_engine,
     arrival_log, qp_state, recv_log, barrier_log) = _run_ipcq()
    print(f"\n  [BARRIER LOG] {len(barrier_log)} _execute_kernel entries:")
    src_id = f"sip{SRC[0]}.cube{SRC[1]}.pe{SRC[2]}.pe_cpu"
    dst_id = f"sip{DST[0]}.cube{DST[1]}.pe{DST[2]}.pe_cpu"
    n_skipped = 0
    src_entry = None
    dst_entry = None
    for e in barrier_log:
        if e["barrier_skipped"]:
            n_skipped += 1
        if e["node_id"] == src_id:
            src_entry = e
        if e["node_id"] == dst_id:
            dst_entry = e
    print(f"    PEs entering _execute_kernel: {len(barrier_log)}")
    print(f"    PEs that SKIPPED barrier (env.now > target_start): {n_skipped}")
    if src_entry:
        print(
            f"    SRC pe ({src_id}): entry_now={src_entry['entry_now']:.2f}  "
            f"target_start={src_entry['target_start_ns']:.2f}  "
            f"skipped={src_entry['barrier_skipped']}  "
            f"late_ns={src_entry['delta_late_ns']:.2f}"
        )
    if dst_entry:
        print(
            f"    DST pe ({dst_id}): entry_now={dst_entry['entry_now']:.2f}  "
            f"target_start={dst_entry['target_start_ns']:.2f}  "
            f"skipped={dst_entry['barrier_skipped']}  "
            f"late_ns={dst_entry['delta_late_ns']:.2f}"
        )
    # Top 5 latest arrivals
    sorted_late = sorted(
        [e for e in barrier_log if e["delta_late_ns"] is not None],
        key=lambda e: -e["delta_late_ns"],
    )[:5]
    print(f"    Top 5 latest PE arrivals (positive = barrier missed):")
    for e in sorted_late:
        if e["delta_late_ns"] > 0:
            print(
                f"      {e['node_id']}: late by {e['delta_late_ns']:.2f} ns  "
                f"(entry={e['entry_now']:.2f}, target={e['target_start_ns']:.2f})"
            )
    print(f"\n  [RECV LOG on dst pe_ipcq] {len(recv_log)} entries:")
    for e in recv_log:
        print(
            f"    {e['phase']:5s} t={e['t']:8.2f} ns  "
            f"dir={e['direction']}  "
            f"peer_head_cache={e['peer_head_cache']}  "
            f"my_tail={e['my_tail']}"
        )
    print(f"\n  [META-ARRIVAL LOG on dst pe_ipcq] {len(arrival_log)} arrivals:")
    for i, e in enumerate(arrival_log):
        print(
            f"    #{i:2d} t={e['t']:8.2f} ns  "
            f"src=(sip{e['src_sip']},cube{e['src_cube']},pe{e['src_pe']}) "
            f"dir={e['src_direction']}  "
            f"sender_seq={e['sender_seq']} "
            f"matched_dir={e['matched_direction']} "
            f"nbytes={e['nbytes']}"
        )
        for d, ph in e["peer_head_cache_before"].items():
            mt = e["my_tail_before"][d]
            if ph != 0 or mt != 0 or d == e["matched_direction"]:
                print(
                    f"        before: dir={d}  peer_head_cache={ph}  my_tail={mt}"
                )
    print(f"\n  [QP STATE END-OF-RUN on dst pe_ipcq]:")
    for d, st in qp_state.items():
        print(
            f"    dir={d}  peer_head_cache={st['peer_head_cache']}  "
            f"my_tail={st['my_tail']}  rx_range=[{st['rx_range'][0]}..."
            f"{st['rx_range'][1]})  n_slots={st['n_slots']} "
            f"slot_size={st['slot_size']}"
        )
    print(f"  per-PE pe_exec_ns ({len(ipcq_per_pe)} entries):")
    for nid in (src_id, dst_id):
        if nid in ipcq_per_pe:
            print(f"    {nid:<60s}  {ipcq_per_pe[nid]:.2f} ns  <-- key PE")
    nonzero = {k: v for k, v in ipcq_per_pe.items() if v > 0.5}
    if nonzero:
        print(f"  other PEs with pe_exec_ns > 0.5 ns:")
        for nid, v in sorted(nonzero.items(), key=lambda kv: -kv[1])[:6]:
            if nid not in (src_id, dst_id):
                print(f"    {nid:<60s}  {v:.2f} ns")
    print(f"  max(pe_exec_ns) = "
          f"{max(ipcq_per_pe.values()) if ipcq_per_pe else 0:.2f} ns")
    if ipcq_path:
        br = _path_breakdown(ipcq_ctx, ipcq_path, NBYTES)
        _print_breakdown("IPCQ sub-txn path (src.pe_dma -> peer.pe_dma)", br)
    _dump_src_op_records(ipcq_engine, *SRC, "IPCQ")
    _dump_src_op_records(ipcq_engine, *DST, "IPCQ DST")
    # ── Credit-return path analysis (where the missing IPCQ "ack" lives)
    print("\n" + "-" * 78)
    print("Credit-return path (current modeling)")
    print("-" * 78)
    src_pe_prefix = f"sip{SRC[0]}.cube{SRC[1]}.pe{SRC[2]}"
    dst_pe_prefix = f"sip{DST[0]}.cube{DST[1]}.pe{DST[2]}"
    # PE_IPCQ._credit_latency_ns calls
    #     ctx.router.find_path(self._pe_prefix, peer_pe_prefix)
    # where the *destination* lacks the ".pe_dma" suffix. find_path()
    # only auto-appends to the source, so this raises -> the except
    # clause silently returns 0.0. Effectively credit latency = 0.
    try:
        ipcq_ctx.router.find_path(dst_pe_prefix, src_pe_prefix)
        bug_caught = False
    except Exception as e:
        bug_caught = True
        print(f"  CONFIRMED BUG in _credit_latency_ns: dest lacks '.pe_dma' "
              f"-> find_path raises -> caught exception -> returns 0.0")
        print(f"  Error: {e}")
    # The intended credit path is recv -> sender (reverse data direction)
    try:
        credit_path = ipcq_ctx.router.find_path(
            dst_pe_prefix, f"{src_pe_prefix}.pe_dma",
        )
    except Exception as e:
        credit_path = []
        print(f"  WARN: corrected find_path credit failed: {e}")
    if credit_path:
        credit_size = 16  # PE_IPCQ default _credit_size_bytes
        # Today's modeling: drain only, 16 bytes -> ~0.125 ns
        cur = ipcq_ctx.compute_drain_ns(credit_path, credit_size)
        # Proposed modeling: full path latency (edges + node overhead + drain)
        proposed = ipcq_ctx.compute_path_latency_ns(credit_path, credit_size)
        print(f"  credit path nodes = {len(credit_path)} (recv -> sender)")
        for nid in credit_path[:6]:
            print(f"    {nid}")
        if len(credit_path) > 6:
            print(f"    ... {len(credit_path) - 6} more nodes")
        br = _path_breakdown(ipcq_ctx, credit_path, credit_size)
        print(f"  edge_total_ns       = {br['edge_total_ns']:.2f}")
        print(f"  overhead_total_ns   = {br['overhead_total_ns']:.2f}")
        print(f"  drain_ns(16 bytes)  = {br['drain_ns']:.2f}")
        print(f"  CURRENT _credit_latency_ns (drain only) = {cur:.3f} ns")
        print(f"  PROPOSED            (compute_path_latency_ns) = {proposed:.2f} ns")
        print(f"  delta               = {proposed - cur:+.2f} ns")
    # ── Comparison summary
    print("\n" + "-" * 78)
    print("Summary")
    print("-" * 78)
    raw_max = max(raw_per_pe.values()) if raw_per_pe else 0.0
    ipcq_max = max(ipcq_per_pe.values()) if ipcq_per_pe else 0.0
    print(f"  RAW  max(pe_exec_ns)              = {raw_max:.2f} ns")
    print(f"  IPCQ max(pe_exec_ns) (current)    = {ipcq_max:.2f} ns")
    print(f"  delta (RAW - IPCQ current)        = {raw_max - ipcq_max:+.2f} ns")
    if credit_path:
        ipcq_with_credit = ipcq_max + (proposed - cur)
        print(
            f"  IPCQ projected w/ blocking credit + full path overhead "
            f"= {ipcq_with_credit:.2f} ns"
        )
        print(
            f"  delta (RAW - IPCQ projected)      = "
            f"{raw_max - ipcq_with_credit:+.2f} ns  "
            f"(<= 0 means IPCQ >= RAW)"
        )
    # No assertions — this is observational.
    assert raw_per_pe, "no RAW pe_exec_ns recorded"
    assert ipcq_per_pe, "no IPCQ pe_exec_ns recorded"
@@ -0,0 +1,361 @@
 """PE-to-PE latency sweep across hop types and data sizes.
 Compares IPCQ send/recv vs raw-DMA (tl.load + tl.store) latency for four
 hop types. The IPCQ path uses ``tl.recv_no_consume(...)`` so that DST
 does not pay the slot-read latency — apples-to-apples with the DMA
 path, which is a one-sided write that has no read on DST.
 ``tl.recv_no_consume`` is a DIAGNOSTIC-only entry point that exists
 solely to draw this graph; production kernels use ``tl.recv``.
  H1 Intra-cube horizontal   pe0 → pe1
  H2 Intra-cube vertical     pe0 → pe4
  H3 Inter-cube horizontal   sip0.cube0.pe0 → sip0.cube1.pe0
  H4 Inter-cube vertical     sip0.cube0.pe0 → sip0.cube4.pe0
 Sizes: 128..10240 bytes. Emits PNGs with both lines plus a CSV.
 """
 from __future__ import annotations
 import csv
 from dataclasses import dataclass
 from pathlib import Path
 import numpy as np
 import pytest
 from kernbench.ccl.install import load_ccl_config, resolve_algorithm_config
 from kernbench.ccl.sfr_config import configure_sfr_intercube_multisip
 from kernbench.policy.placement.dp import DPPolicy
 from kernbench.runtime_api.context import RuntimeContext
 from kernbench.runtime_api.types import DeviceSelector
 from kernbench.sim_engine.engine import GraphEngine
 from kernbench.topology.builder import resolve_topology
 TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
 PLOT_DIR = (
    Path(__file__).parent.parent / "docs" / "diagrams" / "pe2pe_latency_plots"
 )
 SIZES = [128, 256, 384, 512, 768, 1024, 2048, 4096, 8192, 10240]
 N_CUBES = 16
 N_PES = 8
 ELEM_BYTES = 2  # f16
@dataclass(frozen=True)
 class Hop:
    id: str
    label: str
    src: tuple[int, int, int]
    dst: tuple[int, int, int]
    send_dir: str
    recv_dir: str
    supports_raw: bool
 HOPS = [
    Hop("h1_intra_horizontal", "Intra-cube horizontal (pe0 to pe1)",
        (0, 0, 0), (0, 0, 1), "intra_E", "intra_W", True),
    Hop("h2_intra_vertical", "Intra-cube vertical (pe0 to pe4)",
        (0, 0, 0), (0, 0, 4), "intra_S", "intra_N", True),
    Hop("h3_inter_cube_horizontal", "Inter-cube horizontal (cube0 to cube1)",
        (0, 0, 0), (0, 1, 0), "E", "W", True),
    Hop("h4_inter_cube_vertical", "Inter-cube vertical (cube0 to cube4)",
        (0, 0, 0), (0, 4, 0), "S", "N", True),
 ]
 def _make_engine():
    topo = resolve_topology(str(TOPOLOGY_PATH))
    engine = GraphEngine(topo.topology_obj, enable_data=True)
    return engine, topo.topology_obj.spec
 # ── IPCQ path ────────────────────────────────────────────────────────
 def _measure_ipcq(hop: Hop, nbytes: int) -> float:
    engine, spec = _make_engine()
    cfg = load_ccl_config()
    merged = resolve_algorithm_config(cfg, name="intercube_allreduce")
    merged["slot_size"] = max(int(merged.get("slot_size", 4096)), nbytes)
    n_elem = nbytes // ELEM_BYTES
    src_sip, src_cube, src_pe = hop.src
    dst_sip, dst_cube, dst_pe = hop.dst
    send_dir, recv_dir = hop.send_dir, hop.recv_dir
    with RuntimeContext(
        engine=engine,
        target_device=DeviceSelector("all"),
        correlation_id=f"ipcq_{hop.id}_{nbytes}",
        spec=spec,
    ) as ctx:
        configure_sfr_intercube_multisip(engine, spec, merged)
        dp = DPPolicy(
            cube="row_wise", pe="column_wise",
            num_cubes=N_CUBES, num_pes=N_PES,
        )
        def kernel(t_ptr, n_elem, tl):
            pe_id = tl.program_id(axis=0)
            cube_id = tl.program_id(axis=1)
            if cube_id == src_cube and pe_id == src_pe:
                data = tl.load(t_ptr, shape=(n_elem,), dtype="f16")
                tl.send(dir=send_dir, src=data)
            elif cube_id == dst_cube and pe_id == dst_pe:
                # tl.recv_no_consume: DST blocks until bytes land in
                # slot but skips slot-read latency. Apples-to-apples
                # with the raw-DMA path below, which has no DST read.
                # Diagnostic-only — production kernels use tl.recv.
                tl.recv_no_consume(dir=recv_dir,
                                   shape=(n_elem,), dtype="f16")
        tensors = []
        for s in sorted({src_sip, dst_sip}):
            ctx.ahbm.set_device(s)
            t = ctx.zeros(
                (N_CUBES, N_PES * n_elem), dtype="f16",
                dp=dp, name=f"sip{s}",
            )
            t.copy_(ctx.from_numpy(
                np.full((N_CUBES, N_PES * n_elem), 1.0, dtype=np.float16),
            ))
            tensors.append(t)
        all_pending = []
        for t in tensors:
            pending = ctx.launch(
                f"{hop.id}_ipcq", kernel, t, n_elem, _defer_wait=True,
            )
            all_pending.extend(pending)
        for h, sip_id, meta in all_pending:
            ctx.wait(h, _meta=meta)
        # Per-PE kernel execution time (excludes launch dispatch and
        # response aggregation). IPCQ: DST blocks on tl.recv until the
        # send arrives, so max across SIPs = DST's transfer time.
        pe_exec_vals = []
        for h, _sip, _meta in all_pending:
            _, trace = engine.get_completion(h)
            if trace and trace.get("pe_exec_ns") is not None:
                pe_exec_vals.append(float(trace["pe_exec_ns"]))
    return max(pe_exec_vals) if pe_exec_vals else 0.0
 # ── Raw DMA path (intra-SIP only) ────────────────────────────────────
 def _measure_raw(hop: Hop, nbytes: int) -> float:
    """tl.load from source slice + tl.store to destination slice. The VA
    mapping spans the cube mesh within one SIP (MmuMapMsg broadcasts to all
    cubes of the SIP), so the store goes through the fabric to the
    destination PE's HBM.  No IPCQ protocol involved.
    """
    if not hop.supports_raw:
        raise RuntimeError(f"hop {hop.id} does not support raw path")
    engine, spec = _make_engine()
    n_elem = nbytes // ELEM_BYTES
    src_sip, src_cube, src_pe = hop.src
    dst_sip, dst_cube, dst_pe = hop.dst
    assert src_sip == dst_sip
    # Slice offsets in the (N_CUBES, N_PES * n_elem) tensor:
    #   row = cube, slice within row = pe * n_elem .. (pe+1)*n_elem
    # Byte offsets from va_base:
    src_off = (src_cube * N_PES + src_pe) * n_elem * ELEM_BYTES
    dst_off = (dst_cube * N_PES + dst_pe) * n_elem * ELEM_BYTES
    with RuntimeContext(
        engine=engine,
        target_device=DeviceSelector("all"),
        correlation_id=f"raw_{hop.id}_{nbytes}",
        spec=spec,
    ) as ctx:
        dp = DPPolicy(
            cube="row_wise", pe="column_wise",
            num_cubes=N_CUBES, num_pes=N_PES,
        )
        ctx.ahbm.set_device(src_sip)
        t = ctx.zeros(
            (N_CUBES, N_PES * n_elem), dtype="f16",
            dp=dp, name="raw_tensor",
        )
        t.copy_(ctx.from_numpy(
            np.full((N_CUBES, N_PES * n_elem), 1.0, dtype=np.float16),
        ))
        def kernel(t_ptr, n_elem, tl):
            pe_id = tl.program_id(axis=0)
            cube_id = tl.program_id(axis=1)
            if cube_id == src_cube and pe_id == src_pe:
                data = tl.load(
                    t_ptr + src_off, shape=(n_elem,), dtype="f16",
                )
                tl.store(t_ptr + dst_off, data)
        pending = ctx.launch(
            f"{hop.id}_raw", kernel, t, n_elem, _defer_wait=True,
        )
        for h, sip_id, meta in pending:
            ctx.wait(h, _meta=meta)
        # Per-PE kernel execution time. Raw: only SRC does real work
        # (tl.load + tl.store, store is blocking), so max across all PEs
        # = SRC's transfer time. Idle PEs contribute only overhead_ns.
        pe_exec_vals = []
        for h, _sip, _meta in pending:
            _, trace = engine.get_completion(h)
            if trace and trace.get("pe_exec_ns") is not None:
                pe_exec_vals.append(float(trace["pe_exec_ns"]))
    return max(pe_exec_vals) if pe_exec_vals else 0.0
 # ── CSV + plotting ───────────────────────────────────────────────────
 def _write_csv(records, path: Path) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(
            f, fieldnames=["hop", "label", "size_bytes", "path", "total_ns"],
        )
        w.writeheader()
        for r in records:
            w.writerow(r)
 def _plot_per_hop(records, hop: Hop, path: Path) -> None:
    import matplotlib.pyplot as plt
    ipcq = sorted(
        [r for r in records if r["hop"] == hop.id and r["path"] == "ipcq"],
        key=lambda r: r["size_bytes"],
    )
    raw = sorted(
        [r for r in records if r["hop"] == hop.id and r["path"] == "raw"],
        key=lambda r: r["size_bytes"],
    )
    fig, ax = plt.subplots(figsize=(8, 5))
    if ipcq:
        ax.plot(
            [r["size_bytes"] for r in ipcq],
            [r["total_ns"] for r in ipcq],
            marker="o", label="IPCQ no-consume (send/recv, no slot read)",
            color="tab:blue",
        )
    if raw:
        ax.plot(
            [r["size_bytes"] for r in raw],
            [r["total_ns"] for r in raw],
            marker="s", label="Raw DMA (load+store)", color="tab:orange",
        )
    ax.set_xlabel("Data size (bytes)")
    ax.set_ylabel("Latency (ns)")
    ax.set_title(hop.label)
    ax.grid(True, alpha=0.3)
    ax.legend()
    fig.tight_layout()
    fig.savefig(path, dpi=120)
    plt.close(fig)
 def _plot_overview(records, path: Path) -> None:
    import matplotlib.pyplot as plt
    fig, axes = plt.subplots(2, 2, figsize=(13, 9))
    axes = axes.flatten()
    for i, hop in enumerate(HOPS):
        ax = axes[i]
        ipcq = sorted(
            [r for r in records if r["hop"] == hop.id and r["path"] == "ipcq"],
            key=lambda r: r["size_bytes"],
        )
        raw = sorted(
            [r for r in records if r["hop"] == hop.id and r["path"] == "raw"],
            key=lambda r: r["size_bytes"],
        )
        if ipcq:
            ax.plot(
                [r["size_bytes"] for r in ipcq],
                [r["total_ns"] for r in ipcq],
                marker="o", label="IPCQ no-consume", color="tab:blue",
            )
        if raw:
            ax.plot(
                [r["size_bytes"] for r in raw],
                [r["total_ns"] for r in raw],
                marker="s", label="Raw DMA", color="tab:orange",
            )
        ax.set_title(hop.label, fontsize=10)
        ax.set_xlabel("bytes")
        ax.set_ylabel("ns")
        ax.grid(True, alpha=0.3)
        ax.legend(fontsize=8)
    for j in range(len(HOPS), len(axes)):
        axes[j].axis("off")
    fig.suptitle(
        "PE-to-PE latency: IPCQ no-consume vs raw DMA",
        fontsize=14,
    )
    fig.tight_layout()
    fig.savefig(path, dpi=120)
    plt.close(fig)
 # ── Test entry ───────────────────────────────────────────────────────
 def test_pe_to_pe_latency_sweep():
    records: list[dict] = []
    for hop in HOPS:
        for size in SIZES:
            # IPCQ path uses tl.recv(consume=False) — apples-to-apples
            # with the raw-DMA path, which has no DST read either.
            ipcq_ns = _measure_ipcq(hop, size)
            records.append({
                "hop": hop.id, "label": hop.label,
                "size_bytes": size, "path": "ipcq",
                "total_ns": ipcq_ns,
            })
            raw_s = "n/a"
            if hop.supports_raw:
                raw_ns = _measure_raw(hop, size)
                records.append({
                    "hop": hop.id, "label": hop.label,
                    "size_bytes": size, "path": "raw",
                    "total_ns": raw_ns,
                })
                raw_s = f"{raw_ns:7.1f}ns"
            print(
                f"[{hop.id}] size={size:5d}  "
                f"ipcq={ipcq_ns:7.1f}ns  raw={raw_s}"
            )
    PLOT_DIR.mkdir(parents=True, exist_ok=True)
    _write_csv(records, PLOT_DIR / "summary.csv")
    for hop in HOPS:
        _plot_per_hop(records, hop, PLOT_DIR / f"{hop.id}.png")
    _plot_overview(records, PLOT_DIR / "overview.png")
    for hop in HOPS:
        rs = sorted(
            [r for r in records if r["hop"] == hop.id and r["path"] == "ipcq"],
            key=lambda r: r["size_bytes"],
        )
        for r in rs:
            assert r["total_ns"] > 0, f"{hop.id}: total_ns must be > 0"
    print(f"\n  Plots + CSV written to {PLOT_DIR}")
@@ -1,7 +1,10 @@
 import pytest
 from kernbench.policy.address.allocator import AddressConfig, AllocationError, PEMemAllocator
-from kernbench.policy.address.phyaddr import PhysAddr, PhysAddrError, UnitType
+from kernbench.policy.address.phyaddr import (
    PhysAddr, PhysAddrError, UnitType,
    PESubUnit, MCPUSubUnit, IOCPUSubUnit,
 )
 _MB = 1 << 20
 _GB = 1 << 30
@@ -23,13 +26,11 @@ _CFG = AddressConfig(
 def test_physaddr_immutable():
-    pa = PhysAddr.hbm_addr(rack_id=0, sip_id=0, cube_id=0, hbm_offset=0)
+    pa = PhysAddr.hbm_addr(sip_id=0, die_id=0, hbm_offset=0)
    with pytest.raises(AttributeError):
-        pa.rack_id = 1  # type: ignore[misc]
+        pa.sip_id = 1  # type: ignore[misc]
-    # hashable
+    {pa}  # hashable
-    {pa}
+    pa2 = PhysAddr.hbm_addr(sip_id=0, die_id=0, hbm_offset=0)
    # comparable
    pa2 = PhysAddr.hbm_addr(rack_id=0, sip_id=0, cube_id=0, hbm_offset=0)
    assert pa == pa2
@@ -37,120 +38,133 @@ def test_physaddr_immutable():
 def test_hbm_encode_decode_roundtrip():
-    pa = PhysAddr.hbm_addr(rack_id=2, sip_id=3, cube_id=5, hbm_offset=0x1000)
+    pa = PhysAddr.hbm_addr(sip_id=3, die_id=5, hbm_offset=0x1000)
    raw = pa.encode()
    dec = PhysAddr.decode(raw)
    assert dec.rack_id == 2
    assert dec.sip_id == 3
-    assert dec.cube_id == 5
+    assert dec.die_id == 5
    assert dec.kind == "hbm"
    assert dec.hbm_offset == 0x1000
-# ── PE resource encode/decode roundtrip ─────────────────────────────
+# ── PE resource encode/decode roundtrip (new layout) ───────────────
 def test_pe_resource_encode_decode_roundtrip():
-    pa = PhysAddr(
+    pa = PhysAddr.pe_resource_addr(
-        rack_id=1, sip_id=2, sip_seg=7, local_offset=0,
+        sip_id=2, die_id=7, pe_id=3,
-        kind="pe_resource", cube_id=7,
+        pe_sub_unit=PESubUnit.PE_TCM, sub_offset=0xFF,
        unit_type=UnitType.PE, pe_id=3, ext=1, sub_offset=0xFF,
    )
-    # manually build local_offset matching bit layout
+    raw = pa.encode()
    local_offset = (UnitType.PE << 34) | (3 << 30) | (1 << 29) | 0xFF
    pa2 = PhysAddr(
        rack_id=1, sip_id=2, sip_seg=7, local_offset=local_offset,
        kind="pe_resource", cube_id=7,
        unit_type=UnitType.PE, pe_id=3, ext=1, sub_offset=0xFF,
    )
    raw = pa2.encode()
    dec = PhysAddr.decode(raw)
    assert dec.kind == "pe_resource"
    assert dec.unit_type == UnitType.PE
    assert dec.pe_id == 3
-    assert dec.ext == 1
+    assert dec.pe_sub_unit == PESubUnit.PE_TCM
    assert dec.sub_offset == 0xFF
    assert dec.die_id == 7
    assert dec.sip_id == 2
 def test_pe_resource_all_sub_units():
    """Each PE sub-unit roundtrips correctly."""
    for su in PESubUnit:
        pa = PhysAddr.pe_resource_addr(
            sip_id=0, die_id=0, pe_id=0,
            pe_sub_unit=su, sub_offset=42,
        )
        dec = PhysAddr.decode(pa.encode())
        assert dec.pe_sub_unit == su
        assert dec.sub_offset == 42
 # ── pe_hbm_addr factory ────────────────────────────────────────────
 def test_pe_hbm_addr_factory():
-    SLICE = 6 * (1 << 30)  # 6 GB per PE slice
+    SLICE = 6 * _GB
    pa = PhysAddr.pe_hbm_addr(
-        rack_id=0, sip_id=0, cube_id=0,
+        sip_id=0, die_id=0,
        pe_id=2, pe_local_hbm_offset=1024, slice_size_bytes=SLICE,
    )
    assert pa.kind == "hbm"
-    assert pa.cube_id == 0
+    assert pa.die_id == 0
    assert pa.hbm_offset == 2 * SLICE + 1024
 def test_pe_hbm_addr_overflow():
-    SLICE = 6 * (1 << 30)
+    SLICE = 6 * _GB
    with pytest.raises(PhysAddrError, match="pe_local_hbm_offset"):
        PhysAddr.pe_hbm_addr(
-            rack_id=0, sip_id=0, cube_id=0,
+            sip_id=0, die_id=0,
            pe_id=0, pe_local_hbm_offset=SLICE, slice_size_bytes=SLICE,
        )
-# ── Invalid unit_type decode (fix #1) ──────────────────────────────
+# ── Invalid resource_kind decode ──────────────────────────────────
-def test_invalid_unit_type_raises():
+def test_invalid_resource_kind_raises():
-    # Craft a PE-resource address with unit_type=7 (invalid)
+    # resource_kind=7 (invalid), addr_space=0
-    local_offset = (7 << 34) | (0 << 30) | 0
+    local_offset = (7 << 34) | 0
-    pa_raw = PhysAddr(
+    pa_raw = PhysAddr(sip_id=0, die_id=0, local_offset=local_offset)
        rack_id=0, sip_id=0, sip_seg=0, local_offset=local_offset,
    )
    raw = pa_raw.encode()
-    with pytest.raises(PhysAddrError, match="unit_type"):
+    with pytest.raises(PhysAddrError, match="resource_kind"):
        PhysAddr.decode(raw)
-# ── hbm_pe_id utility (fix #3) ─────────────────────────────────────
+# ── hbm_pe_id utility ─────────────────────────────────────────────
 def test_hbm_pe_id_utility():
-    SLICE = 6 * (1 << 30)  # 6 GB
+    SLICE = 6 * _GB
    pa = PhysAddr.pe_hbm_addr(
-        rack_id=0, sip_id=0, cube_id=0,
+        sip_id=0, die_id=0,
        pe_id=5, pe_local_hbm_offset=256, slice_size_bytes=SLICE,
    )
    assert PhysAddr.hbm_pe_id(pa.hbm_offset, SLICE) == 5
-# ── UnitType.SRAM exists (fix #5) ──────────────────────────────────
+# ── UnitType / sub-unit enums ──────────────────────────────────────
 def test_sram_unit_type_exists():
    assert UnitType.SRAM == 2
 def test_pe_sub_unit_enum():
    assert PESubUnit.PE_TCM == 6
    assert PESubUnit.IPCQ == 2
 def test_mcpu_sub_unit_enum():
    assert MCPUSubUnit.MCPU_SRAM == 5
 def test_iocpu_sub_unit_enum():
    assert IOCPUSubUnit.IO_SRAM == 5
 # ── cube_sram_addr factory + roundtrip ──────────────────────────────
 def test_cube_sram_addr_roundtrip():
-    pa = PhysAddr.cube_sram_addr(
+    pa = PhysAddr.cube_sram_addr(sip_id=1, die_id=3, sram_offset=0x800)
        rack_id=0, sip_id=1, cube_id=3, sram_offset=0x800,
    )
    assert pa.kind == "pe_resource"
    assert pa.unit_type == UnitType.SRAM
-    assert pa.cube_id == 3
+    assert pa.die_id == 3
    assert pa.sub_offset == 0x800
    # encode → decode roundtrip
    dec = PhysAddr.decode(pa.encode())
    assert dec.unit_type == UnitType.SRAM
-    assert dec.cube_id == 3
+    assert dec.die_id == 3
    assert dec.sub_offset == 0x800
 def test_cube_sram_addr_range_check():
    with pytest.raises(PhysAddrError):
        PhysAddr.cube_sram_addr(
-            rack_id=0, sip_id=0, cube_id=0,
+            sip_id=0, die_id=0,
-            sram_offset=(1 << 29),  # exceeds 29-bit sub_offset
+            sram_offset=(1 << 25),  # exceeds 25-bit sub_offset
        )
@@ -158,29 +172,137 @@ def test_cube_sram_addr_range_check():
 def test_pe_tcm_addr_roundtrip():
-    pa = PhysAddr.pe_tcm_addr(
+    pa = PhysAddr.pe_tcm_addr(sip_id=0, die_id=2, pe_id=7, tcm_offset=0x400)
        rack_id=0, sip_id=0, cube_id=2, pe_id=7, tcm_offset=0x400,
    )
    assert pa.kind == "pe_resource"
    assert pa.unit_type == UnitType.PE
    assert pa.pe_id == 7
-    assert pa.cube_id == 2
+    assert pa.die_id == 2
    assert pa.pe_sub_unit == PESubUnit.PE_TCM
    assert pa.sub_offset == 0x400
    # encode → decode roundtrip
    dec = PhysAddr.decode(pa.encode())
    assert dec.unit_type == UnitType.PE
    assert dec.pe_id == 7
    assert dec.pe_sub_unit == PESubUnit.PE_TCM
    assert dec.sub_offset == 0x400
 def test_pe_tcm_addr_range_check():
    with pytest.raises(PhysAddrError):
        PhysAddr.pe_tcm_addr(
-            rack_id=0, sip_id=0, cube_id=0, pe_id=0,
+            sip_id=0, die_id=0, pe_id=0,
-            tcm_offset=(1 << 29),  # exceeds 29-bit sub_offset
+            tcm_offset=(1 << 25),  # exceeds 25-bit sub_offset
        )
 # ── MCPU resource factory + roundtrip ──────────────────────────────
 def test_mcpu_resource_roundtrip():
    pa = PhysAddr.mcpu_resource_addr(
        sip_id=0, die_id=1,
        mcpu_sub_unit=MCPUSubUnit.MCPU_SRAM, sub_offset=0x100,
    )
    assert pa.kind == "pe_resource"
    assert pa.unit_type == UnitType.MCPU
    assert pa.mcpu_sub_unit == MCPUSubUnit.MCPU_SRAM
    assert pa.sub_offset == 0x100
    dec = PhysAddr.decode(pa.encode())
    assert dec.unit_type == UnitType.MCPU
    assert dec.mcpu_sub_unit == MCPUSubUnit.MCPU_SRAM
    assert dec.sub_offset == 0x100
 # ── IOCHIPLET: IOCPU factory + roundtrip ────────────────────────────
 def test_iocpu_resource_roundtrip():
    pa = PhysAddr.iocpu_resource_addr(
        sip_id=1, die_id=17,
        iocpu_sub_unit=IOCPUSubUnit.IPCQ, sub_offset=0x20000,
    )
    assert pa.kind == "iocpu"
    assert pa.iocpu_sub_unit == IOCPUSubUnit.IPCQ
    assert pa.sub_offset == 0x20000
    dec = PhysAddr.decode(pa.encode())
    assert dec.kind == "iocpu"
    assert dec.iocpu_sub_unit == IOCPUSubUnit.IPCQ
    assert dec.sub_offset == 0x20000
    assert dec.die_id == 17
 def test_iocpu_die_range_check():
    with pytest.raises(PhysAddrError, match="IOCHIPLET"):
        PhysAddr.iocpu_resource_addr(
            sip_id=0, die_id=5,  # not a chiplet die
            iocpu_sub_unit=0, sub_offset=0,
        )
 # ── IOCHIPLET: UAL factory + roundtrip ──────────────────────────────
 def test_ual_addr_roundtrip():
    pa = PhysAddr.ual_addr(sip_id=0, die_id=16, ual_offset=0x1000)
    assert pa.kind == "ual"
    dec = PhysAddr.decode(pa.encode())
    assert dec.kind == "ual"
    assert dec.die_id == 16
    assert dec.chiplet_offset >= (1 << 31)  # >= 2 GB boundary
 # ── die_id dispatch ────────────────────────────────────────────────
 def test_die_id_ahbm_range():
    for die in [0, 15]:
        pa = PhysAddr.hbm_addr(sip_id=0, die_id=die, hbm_offset=0)
        dec = PhysAddr.decode(pa.encode())
        assert dec.kind == "hbm"
        assert dec.die_id == die
 def test_die_id_chiplet_range():
    for die in [16, 20]:
        pa = PhysAddr.iocpu_resource_addr(
            sip_id=0, die_id=die,
            iocpu_sub_unit=0, sub_offset=0,
        )
        dec = PhysAddr.decode(pa.encode())
        assert dec.kind == "iocpu"
        assert dec.die_id == die
 def test_die_id_reserved_raises():
    raw = (0 << 47) | (21 << 42) | 0  # die_id=21 (reserved)
    with pytest.raises(PhysAddrError, match="reserved"):
        PhysAddr.decode(raw)
 # ── Boundary values ────────────────────────────────────────────────
 def test_sip_boundary():
    pa = PhysAddr.hbm_addr(sip_id=15, die_id=0, hbm_offset=0)
    dec = PhysAddr.decode(pa.encode())
    assert dec.sip_id == 15
 def test_mbz_enforcement_ahbm():
    """AHBM local_offset bits [41:38] must be zero."""
    local_offset = (1 << 38) | (1 << 37)  # MBZ bit set + HBM
    pa = PhysAddr(sip_id=0, die_id=0, local_offset=local_offset)
    with pytest.raises(PhysAddrError, match="bits \\[41:38\\]"):
        pa.encode()
 def test_mbz_enforcement_chiplet():
    """IOCHIPLET local_offset bits [41:40] must be zero."""
    local_offset = (1 << 40) | 0  # MBZ bit set
    pa = PhysAddr(sip_id=0, die_id=16, local_offset=local_offset)
    with pytest.raises(PhysAddrError, match="bits \\[41:40\\]"):
        pa.encode()
 # ── AddressConfig ───────────────────────────────────────────────────
@@ -193,7 +315,7 @@ def test_address_config_derived_sizes():
 def _make_alloc(pe_id: int = 0) -> PEMemAllocator:
-    return PEMemAllocator(rack_id=0, sip_id=0, cube_id=0, pe_id=pe_id, cfg=_CFG)
+    return PEMemAllocator(sip_id=0, die_id=0, pe_id=pe_id, cfg=_CFG)
 def test_allocator_hbm_basic():
@@ -201,8 +323,7 @@ def test_allocator_hbm_basic():
    pa = a.alloc_hbm(4096)
    assert pa.kind == "hbm"
    assert pa.sip_id == 0
-    assert pa.cube_id == 0
+    assert pa.die_id == 0
    # hbm_offset should be pe3's slice start
    assert pa.hbm_offset == 3 * 6 * _GB
@@ -210,8 +331,8 @@ def test_allocator_hbm_sequential():
    a = _make_alloc()
    pa1 = a.alloc_hbm(1024)
    pa2 = a.alloc_hbm(2048)
-    assert pa1.hbm_offset == 0  # pe0 slice start + 0
+    assert pa1.hbm_offset == 0
-    assert pa2.hbm_offset == 1024  # pe0 slice start + 1024
+    assert pa2.hbm_offset == 1024
 def test_allocator_hbm_overflow():
@@ -235,7 +356,6 @@ def test_allocator_tcm_basic():
 def test_allocator_tcm_respects_reserved():
    a = _make_alloc()
    # allocatable = 12 MB, should succeed
    a.alloc_tcm(12 * _MB)
    assert a.tcm_used == 12 * _MB
    assert a.tcm_total == 12 * _MB
@@ -21,7 +21,7 @@ def _engine():
 def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
    slice_bytes = 48 * (1 << 30) // 8
    pa = PhysAddr.pe_hbm_addr(
-        rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
+        sip_id=sip, die_id=cube, pe_id=pe_id,
        pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
    )
    return pa.encode()
@@ -20,7 +20,7 @@ def test_resolve_hbm_addr():
    """HBM address -> sip{S}.cube{C}.hbm_ctrl (single controller per cube)."""
    g = _graph()
    resolver = AddressResolver(g)
-    pa = PhysAddr.hbm_addr(rack_id=0, sip_id=0, cube_id=3, hbm_offset=0x1000)
+    pa = PhysAddr.hbm_addr(sip_id=0, die_id=3, hbm_offset=0x1000)
    assert resolver.resolve(pa) == "sip0.cube3.hbm_ctrl"
@@ -28,33 +28,33 @@ def test_resolve_hbm_addr_high_offset():
    """HBM address with large offset still resolves to same hbm_ctrl."""
    g = _graph()
    resolver = AddressResolver(g)
-    pa = PhysAddr.hbm_addr(rack_id=0, sip_id=0, cube_id=0, hbm_offset=0x600000000)
+    pa = PhysAddr.hbm_addr(sip_id=0, die_id=0, hbm_offset=0x600000000)
    assert resolver.resolve(pa) == "sip0.cube0.hbm_ctrl"
 def test_resolve_pe_tcm_addr():
-    """PE TCM address → sip{S}.cube{C}.pe{P}.pe_tcm"""
+    """PE TCM address -> sip{S}.cube{C}.pe{P}.pe_tcm"""
    g = _graph()
    resolver = AddressResolver(g)
-    pa = PhysAddr.pe_tcm_addr(rack_id=0, sip_id=1, cube_id=5, pe_id=7, tcm_offset=0x400)
+    pa = PhysAddr.pe_tcm_addr(sip_id=1, die_id=5, pe_id=7, tcm_offset=0x400)
    assert resolver.resolve(pa) == "sip1.cube5.pe7.pe_tcm"
 def test_resolve_sram_addr():
-    """SRAM address → sip{S}.cube{C}.sram"""
+    """SRAM address -> sip{S}.cube{C}.sram"""
    g = _graph()
    resolver = AddressResolver(g)
-    pa = PhysAddr.cube_sram_addr(rack_id=0, sip_id=0, cube_id=10, sram_offset=0x800)
+    pa = PhysAddr.cube_sram_addr(sip_id=0, die_id=10, sram_offset=0x800)
    assert resolver.resolve(pa) == "sip0.cube10.sram"
 def test_resolve_mcpu_addr():
-    """MCPU pe_resource address → sip{S}.cube{C}.m_cpu"""
+    """MCPU pe_resource address -> sip{S}.cube{C}.m_cpu"""
    g = _graph()
    resolver = AddressResolver(g)
-    pa = PhysAddr(
+    pa = PhysAddr.mcpu_resource_addr(
-        rack_id=0, sip_id=0, sip_seg=2, local_offset=(UnitType.MCPU << 34),
+        sip_id=0, die_id=2,
-        kind="pe_resource", cube_id=2, unit_type=UnitType.MCPU,
+        mcpu_sub_unit=0, sub_offset=0,
    )
    assert resolver.resolve(pa) == "sip0.cube2.m_cpu"
@@ -64,7 +64,7 @@ def test_resolve_nonexistent_node():
    g = _graph()
    resolver = AddressResolver(g)
    # sip_id=15 doesn't exist in the 2-SIP topology
-    pa = PhysAddr.hbm_addr(rack_id=0, sip_id=15, cube_id=0, hbm_offset=0)
+    pa = PhysAddr.hbm_addr(sip_id=15, die_id=0, hbm_offset=0)
    with pytest.raises(RoutingError):
        resolver.resolve(pa)
@@ -73,7 +73,7 @@ def test_resolve_nonexistent_node():
 def test_path_local_hbm():
-    """PE0 -> hbm_ctrl: pe_dma → router → hbm_ctrl (through router mesh)."""
+    """PE0 -> hbm_ctrl: pe_dma -> router -> hbm_ctrl (through router mesh)."""
    g = _graph()
    router = PathRouter(g)
    path = router.find_path("sip0.cube0.pe0", "sip0.cube0.hbm_ctrl")
@@ -107,7 +107,7 @@ def test_all_pe_hbm_equidistant():
    """All PEs in a cube have equal routing distance to hbm_ctrl.
    With n_to_one mapping and high routing weight on HBM edges,
-    all PE→hbm_ctrl paths have the same accumulated distance.
+    all PE->hbm_ctrl paths have the same accumulated distance.
    """
    g = _graph()
    router = PathRouter(g)
@@ -151,7 +151,7 @@ def test_path_remote_cube_hbm():
 def test_path_sram_via_router_mesh():
-    """PE → SRAM must go through router mesh nodes."""
+    """PE -> SRAM must go through router mesh nodes."""
    g = _graph()
    router = PathRouter(g)
    path = router.find_path("sip0.cube0.pe0", "sip0.cube0.sram")
@@ -168,7 +168,7 @@ def test_path_sram_via_router_mesh():
 def test_path_local_tcm():
-    """PE0 → own TCM is PE-internal, not via router mesh."""
+    """PE0 -> own TCM is PE-internal, not via router mesh."""
    g = _graph()
    router = PathRouter(g)
    path = router.find_path("sip0.cube0.pe0", "sip0.cube0.pe0.pe_tcm")
@@ -0,0 +1,106 @@
 """Rectangular (non-square) SIP-level 2D topology support.
 Phase 1 regression target: today the 2D builtin topology functions in
 ``kernbench.ccl.topologies`` (``mesh_2d``, ``torus_2d``,
 ``mesh_2d_no_wrap``) hardcode ``side = sqrt(world_size)`` and raise
 ``ValueError`` for any non-square ``world_size``. This blocks running
 the allreduce sweep at n_sips=6 on torus/mesh layouts.
 Phase 2 will extend these functions to accept optional ``w, h`` kwargs
 so a 2×3 (or 3×2, etc.) layout works. Until then, every test below is
 expected to FAIL.
 Layout convention used here (matches non-rectangular case):
    rank = row * w + col   for 0 <= row < h, 0 <= col < w
 For w=2, h=3, world_size=6 the layout is:
         col=0  col=1
  row=0:   0      1
  row=1:   2      3
  row=2:   4      5
 """
 from __future__ import annotations
 import pytest
 from kernbench.ccl.topologies import (
    mesh_2d,
    mesh_2d_no_wrap,
    torus_2d,
 )
 # ── mesh_2d_no_wrap (no wrap-around) ──────────────────────────────────
 def test_mesh_2d_no_wrap_2x3_top_left():
    """rank 0 (top-left, no N, no W): only S and E."""
    nbrs = mesh_2d_no_wrap(rank=0, world_size=6, w=2, h=3)
    assert nbrs == {"S": 2, "E": 1}, nbrs
 def test_mesh_2d_no_wrap_2x3_top_right():
    """rank 1 (top-right, no N, no E): only S and W."""
    nbrs = mesh_2d_no_wrap(rank=1, world_size=6, w=2, h=3)
    assert nbrs == {"S": 3, "W": 0}, nbrs
 def test_mesh_2d_no_wrap_2x3_middle_left():
    """rank 2 (middle-left, no W): N, S, E."""
    nbrs = mesh_2d_no_wrap(rank=2, world_size=6, w=2, h=3)
    assert nbrs == {"N": 0, "S": 4, "E": 3}, nbrs
 def test_mesh_2d_no_wrap_2x3_bottom_right():
    """rank 5 (bottom-right, no S, no E): only N and W."""
    nbrs = mesh_2d_no_wrap(rank=5, world_size=6, w=2, h=3)
    assert nbrs == {"N": 3, "W": 4}, nbrs
 # ── torus_2d (wrap-around on all four edges) ─────────────────────────
 def test_torus_2d_2x3_top_left():
    """rank 0: N wraps to row 2 col 0 (rank 4); W wraps to col 1 (rank 1)."""
    nbrs = torus_2d(rank=0, world_size=6, w=2, h=3)
    assert nbrs == {"N": 4, "S": 2, "W": 1, "E": 1}, nbrs
 def test_torus_2d_2x3_bottom_right():
    """rank 5: S wraps to row 0 (rank 1); E wraps to col 0 (rank 4)."""
    nbrs = torus_2d(rank=5, world_size=6, w=2, h=3)
    assert nbrs == {"N": 3, "S": 1, "W": 4, "E": 4}, nbrs
 # ── mesh_2d alias for torus_2d ───────────────────────────────────────
 def test_mesh_2d_2x3_matches_torus_2d():
    """mesh_2d is currently a torus alias; behaviour must match torus_2d."""
    for rank in range(6):
        assert mesh_2d(rank=rank, world_size=6, w=2, h=3) == \
            torus_2d(rank=rank, world_size=6, w=2, h=3)
 # ── Back-compat: square layouts still work without w/h kwargs ────────
 def test_square_back_compat_mesh_2d_no_wrap():
    """Calling without w, h should still work for square world_size."""
    nbrs = mesh_2d_no_wrap(rank=0, world_size=4)
    assert nbrs == {"S": 2, "E": 1}, nbrs
 def test_square_back_compat_torus_2d():
    nbrs = torus_2d(rank=0, world_size=4)
    assert nbrs == {"N": 2, "S": 2, "W": 1, "E": 1}, nbrs
 # ── Validation: w*h must match world_size ────────────────────────────
 def test_rectangular_dims_must_match_world_size():
    """Phase 2 contract: explicit w, h must satisfy w*h == world_size."""
    with pytest.raises(ValueError):
        mesh_2d_no_wrap(rank=0, world_size=6, w=3, h=3)  # 9 != 6
@@ -44,7 +44,7 @@ _CFG = AddressConfig(
 def _make_allocators(num_pe: int = 8) -> dict[tuple[int, int, int], PEMemAllocator]:
    return {
-        (0, 0, i): PEMemAllocator(rack_id=0, sip_id=0, cube_id=0, pe_id=i, cfg=_CFG)
+        (0, 0, i): PEMemAllocator(sip_id=0, die_id=0, pe_id=i, cfg=_CFG)
        for i in range(num_pe)
    }
@@ -55,7 +55,7 @@ def _make_ctx():
 def test_allocator_free_hbm_reclaims_space():
    """free_hbm returns HBM space; subsequent alloc can reuse it."""
-    a = PEMemAllocator(rack_id=0, sip_id=0, cube_id=0, pe_id=0, cfg=_CFG)
+    a = PEMemAllocator(sip_id=0, die_id=0, pe_id=0, cfg=_CFG)
    pa1 = a.alloc_hbm(4096)
    used_after_alloc = a.hbm_used
    a.free_hbm(pa1, 4096)
@@ -66,7 +66,7 @@ def test_allocator_free_hbm_reclaims_space():
 def test_allocator_free_tcm_reclaims_space():
    """free_tcm returns TCM space."""
-    a = PEMemAllocator(rack_id=0, sip_id=0, cube_id=0, pe_id=0, cfg=_CFG)
+    a = PEMemAllocator(sip_id=0, die_id=0, pe_id=0, cfg=_CFG)
    pa1 = a.alloc_tcm(256)
    used_after_alloc = a.tcm_used
    a.free_tcm(pa1, 256)
@@ -39,7 +39,7 @@ _CFG = AddressConfig(
 def _make_allocators(num_pe: int = 8) -> dict[tuple[int, int, int], PEMemAllocator]:
    return {
-        (0, 0, i): PEMemAllocator(rack_id=0, sip_id=0, cube_id=0, pe_id=i, cfg=_CFG)
+        (0, 0, i): PEMemAllocator(sip_id=0, die_id=0, pe_id=i, cfg=_CFG)
        for i in range(num_pe)
    }
@@ -70,7 +70,7 @@ def _make_standalone(shape, num_pe=NUM_PE):
        sram_bytes_per_cube=32 * _MB,
    )
    allocators = {
-        (0, 0, i): PEMemAllocator(rack_id=0, sip_id=0, cube_id=0, pe_id=i, cfg=cfg)
+        (0, 0, i): PEMemAllocator(sip_id=0, die_id=0, pe_id=i, cfg=cfg)
        for i in range(num_pe)
    }
    va_alloc = VirtualAllocator(va_base=0x1_0000_0000, va_size=64 * _GB, page_size=4096)
Author	SHA1	Message	Date
mukesh	5accd98171	Add deck builder + overview-with-ref diagram scripts scripts/build_overview_slides.py renders a 5-slide PPTX (kernbench2_overview.pptx) summarizing architecture, model correctness, IPCQ, allreduce, and buffer-kind tier comparison. scripts/emit_overview_with_external_ref.py renders log-y and broken-y variants of the allreduce overview (overview_log.png, overview_broken.png) including a 366 µs ext-sim reference marker at 96 KB / PE. Also includes cube_mesh_view.png rendered from the SVG. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:20:54 -07:00
mukesh	a563169e89	Add tl.recv_no_consume diagnostic API for apples-to-apples pe2pe plot The pe2pe overview compared IPCQ (tl.send + tl.recv) against raw DMA (tl.load + tl.store), but DMA is one-sided — DST never reads — while tl.recv pays a slot-read on DST. The comparison was unfair: IPCQ looked slower partly because it does more work. Adds tl.recv_no_consume() — a separate, diagnostic-only entry point that blocks for slot arrival but skips the slot-read (and bank-hop) charge on DST. Production tl.recv is unchanged (no `consume` kwarg on the public API), so the diagnostic flag can never accidentally leak into real workloads. Updates test_pe_to_pe_latency to call tl.recv_no_consume so the overview.png shows IPCQ no-consume vs raw DMA on equal footing. Also fixes PLOT_DIR back to docs/diagrams/pe2pe_latency_plots/ (was lost in a merge). Adds scripts/replot_pe2pe.py for label-only re-renders without re-measuring. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:20:44 -07:00
mukesh	9c129d6131	ADR-0023 D9.7+: charge PE↔bank fabric hop for SRAM/HBM IPCQ slots Cube SRAM and HBM live on the cube NoC behind router-attached links (sram_to_router_bw_gbs=128, hbm_to_router_bw_gbs=256). Previously the slot-IO model treated them as if they were per-PE local, so the buffer_kind sweep showed TCM ≈ SRAM at 64 KB / PE. pe_ipcq._handle_recv and pe_dma._handle_ipcq_inbound now charge a PE→bank compute_drain_ns on top of the intrinsic slot-IO for SRAM/HBM. TCM stays free of this hop. Adds an internal IpcqRecvCmd.consume field that gates the recv-side hop+slot-IO charges (used by a follow-up diagnostic API; default True keeps current behavior). Post-fix at 64 KB / PE: TCM 12.0 µs < HBM 21.4 µs < SRAM 24.3 µs. SRAM is slowest because its 128 GB/s bank link is the narrowest in the system — narrower than HBM's 256 GB/s. The existing ordering test is rewritten from tcm<sram<hbm to tcm<hbm<sram and a new test_ipcq_buffer_kind_locations adds 3 invariants on the gap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:20:28 -07:00
ywkang	533e699299	IPCQ-DMA co-design HW design doc + fix IPCQ slot BW model Add hardware design document (docs/ipcq-dma-codesign-hw.md) covering PE_IPCQ high-level architecture, simulator verification, proposed HW implementation, and alternatives analysis. Include D2 block diagrams for baseline and proposed PE architectures. Fix IPCQ slot-memory bandwidth parameters to match topology.yaml: SRAM 128→512 GB/s (intrinsic BW, NoC-bottlenecked at 128), HBM 32→256 GB/s (was per-channel, now per-PE aggregate). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-28 13:31:02 -07:00
mukesh	54fcb7e4bc	Add tests/test_emit_ipcq_diagram.py (missed from earlier commit) This is the diagram generator that emits ipcq_send_recv.png and ipcq_two_pe_dma.png (referenced by commit `1e39214` but accidentally left untracked). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 21:42:44 -07:00
mukesh	ad5f01ab13	Merge origin/master: combine single-cube fast path + center-root reduce Conflict resolution: - intercube_allreduce.py: kept origin's `if single_cube:` early-exit (TP launches kernel on one cube/rank → skip intra-SIP mesh and go direct to inter-SIP exchange) AND replaced the multi-cube body with the local center-root + bidirectional reduce/broadcast (8-hop critical path on 4×4 vs 12 with corner root). - tests/{allreduce,pe2pe}_latency_plots/: kept the local move to docs/diagrams/; dropped origin's stale content edits to the old paths (regenerable derived artifacts). - docs/diagrams/pe2pe_latency_plots/summary.csv: kept local (post-Phase-2 + center-root values). Origin contributions retained as-is: - pyproject.toml: matplotlib >= 3.7 dep. - runtime_api/distributed.py: derive effective cube_w/h from tensor shard placement so single-cube TP paths get cube_w=cube_h=1. - kernel_args() now accepts optional cube_w/cube_h kwargs. Verified post-merge: - test_intercube_root_center.py: 2/2 (center-root multi-cube path). - test_tp_layers.py + test_tp_mlp.py: 10/10 (single-cube TP path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 21:41:46 -07:00
mukesh	1c5752a9ec	Intercube allreduce: center root + bidirectional reduce Move the algorithmic root cube from the corner (cube_w-1, cube_h-1) to the geometric center (cube_w//2, cube_h//2) and have each phase converge bidirectionally so the intra-SIP critical path drops from ~12 hops to ~8 hops on a 4×4 mesh (left half W→E + right half E→W in row reduce; top half N→S + bottom half S→N in col reduce; mirrored on broadcast). Result on torus_2d 6 SIPs at 96 KB / PE on TCM: before (corner root) : 22.0 µs after (center root) : 17.2 µs (−22%) Same shape on ring_1d (−7%) and mesh_2d_no_wrap (−12%); also holds across SRAM and HBM (~−20% each). Phase 1 test (test_intercube_root_center.py) asserts the torus_2d 96 KB latency drops below 20.5 µs and that all 96 cubes still validate (correctness preserved). Plot updates: - overview.png: replace constant 10.6 µs theoretical line with user-supplied hand-derived curve (per-cube packet count = bytes_per_pe × 8 PEs ÷ 128 B; 1346 ns startup + 1.20 ns/pkt). - All summary.csv numbers and per-topology PNGs regenerated. - pe2pe_latency_plots and ipcq diagram emitter PNGs refreshed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 21:28:58 -07:00
mukesh	84a1325e5c	ADR-0023 D9.7: IPCQ slot-memory latency model (TCM/SRAM/HBM) Charge per-tier bandwidth + setup overhead at IPCQ slot WRITE (receiver inbound DMA, in pe_dma._handle_ipcq_inbound) and slot READ (recv consume, in pe_ipcq._handle_recv). Tier table (common/ipcq_types.py): tcm : 512 GB/s, 0 ns sram : 128 GB/s, 2 ns hbm : 32 GB/s, 6 ns Before this change, slot read/write was free regardless of buffer_kind, making memory-tier choice invisible in simulated latency. After the change, swapping buffer_kind in ccl.yaml produces measurable per-tier separation in allreduce latency. Tests: test_ipcq_buffer_kind_latency.py — three micro-tests asserting tcm < sram < hbm ordering, payload-scaling, and that buffer_kind sensitivity grows with payload (credit-only path stays fabric-bound). test_allreduce_buffer_kind_sweep.py — 12-config parametrized sweep emitting buffer_kind_sweep.png (3 lines, torus_2d). conftest sessionfinish hook generalised to dispatch multiple sweep aggregators (allreduce + buffer-kind). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 21:28:34 -07:00
mukesh	1e39214f89	Move generated diagrams to docs/diagrams/; add IPCQ diagram emitter Plot output dirs now live under docs/diagrams/ (the canonical "derived artifacts" location per CLAUDE.md): tests/allreduce_latency_plots/ → docs/diagrams/allreduce_latency_plots/ tests/pe2pe_latency_plots/ → docs/diagrams/pe2pe_latency_plots/ + new docs/diagrams/ipcq_diagram_plots/ with two presentation diagrams (ipcq_send_recv.png, ipcq_two_pe_dma.png) New test tests/test_emit_ipcq_diagram.py renders the two IPCQ diagrams from a static description (no simulation); it exists so the diagrams can be regenerated reproducibly. Path references updated in tests/test_pe_to_pe_latency.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 21:28:17 -07:00
ywkang	fca24feac5	Fix all remaining test failures: single-cube allreduce + matplotlib dep - intercube_allreduce: add single-cube fast path that skips intra-SIP mesh reduce and goes directly to inter-SIP exchange. Fixes IPCQ deadlock when TP launches kernel on one cube per SIP. - distributed.py: derive effective cube dims from tensor shard placement instead of hardcoding topology mesh size. - pyproject.toml: add matplotlib>=3.7 to dependencies. - pe_dma.py (prior commit): add MMU translation in pipeline DMA path. 577 passed, 0 failed (was 529 passed, 10 failed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-27 21:25:31 -07:00
ywkang	d55dc6cb4f	Merge: accept remote pe2pe summary.csv	2026-04-27 17:13:06 -07:00
mukesh	46291bf91b	PE-to-PE latency: drop h5 inter-SIP panel from overview Remove h5_inter_sip from the hop list and switch the overview grid from 2x3 to 2x2. RAW DMA was unavailable for the cross-SIP hop, so the panel only carried IPCQ data and was redundant with h4_inter_cube for the topology comparison. Regenerate pe2pe_latency_plots/overview.png and summary.csv; delete the obsolete h5_inter_sip.png. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 16:43:28 -07:00
mukesh	04c912f53e	Allreduce sweep: parametrized + xdist parallelism + topology diagram Refactor the latency sweep from one giant test into 36 parametrized cases that run in parallel under xdist (~6-8x faster: 1:49 instead of ~10 min). Each case writes a JSON row to a staging dir; conftest sessionfinish hook aggregates rows on the controller node into summary.csv and the per-topology + overview plots. Aggregator gains a CSV fallback so plot-only tweaks no longer require re-running the sweep. Overview plot updates: - 96 KB explicit x-axis marker with vertical dotted line - horizontal theoretical 2D-torus reference (10600 ns) - annotation showing both theoretical and simulated values at 96 KB - drop overlapping 128 KB tick New topology.png: 2x2 panel diagram showing device-level topology (ring, torus 2x3, mesh 2x3) and the cube-level reduction inside SIP 0. Wrap arrows anchor on box edges and arc outside rows/columns so they do not overlap any SIP. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 16:43:19 -07:00
mukesh	1c33afec55	ADR-0032 + intra_* opposite directions in IPCQ install Add intra_N/S/E/W to install.py _OPPOSITE_DIR table so the intra-cube PE-to-PE namespace is symmetrical with intercube N/S/E/W. ADR-0032 documents the intercube allreduce algorithm (supersedes ADR-0029). Refresh ADR-0024/0025/0029 cross-refs and update test_intercube_sfr_config.py to cover the new intra_* mappings. Drop the obsolete test_ccl_round_robin_recv.py (replaced by intercube tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 16:43:01 -07:00
ywkang	81cc32c46b	ADR-0001 Rev 2: 51-bit PhysAddr layout with concrete sub-unit tables Remove rack_id (4 bits), rename sip_seg→die_id, shift fields to enable 42-bit local_offset (4 TB per die). Define PE_LOCAL/MCPU_LOCAL/CUBE_SRAM sub-unit tables for AHBM dies and IOCPU sub-unit table for IOCHIPLET dies (1 TB window). Supersedes ADR-0031. Also fixes latent VA/PA confusion in pe_dma pipeline DMA path where virtual addresses were decoded as physical addresses without MMU translation — previously masked by coincidental bit-position alignment. 529 passed (+6 recovered), 10 pre-existing failures unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-27 15:52:29 -07:00
mukesh	e9cc40f74d	Rectangular SIP topology + 6-device allreduce sweep mesh_2d, torus_2d, and mesh_2d_no_wrap accept optional w,h kwargs; sqrt fall-back preserved for square layouts (back-compat tests confirm 4-SIP and 9-SIP square configs still work). sfr_config reads system.sips.w/h from spec and threads dims through to the topology fn. test_allreduce_multidevice CONFIGS switched from 4 SIPs (square) to 6 SIPs: ring_1d_6sip, torus_2d_6sip_2x3, mesh_2d_no_wrap_6sip_2x3. _write_temp_configs writes system.sips.w/h when supplied; _sip_topo_dims reads them back. Latency sweep loop also moved to 6-SIP layouts. Linear-scale plot variants dropped -- only log-scale *.png + summary.csv emitted. Plots in tests/allreduce_latency_plots regenerated. New tests/test_sip_topology_rectangular.py asserts neighbor correctness for 2x3 layouts and back-compat for square fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 15:13:14 -07:00
mukesh	c1a5cf3a2a	ADR-0009 D5: chain-aware target_start_ns + zero-byte launch fanout The single-walk predictor (find_node_path(io_cpu, pe_cpu) + compute_path_latency_ns) under-shot actual dispatch latency for far cubes -- the routing graph could pick a path bypassing M_CPU, and non-zero-nbytes launch sub-txns serialized on shared first hops. Far PEs arrived at _execute_kernel after target_start_ns, silently skipped the barrier yield, and started pe_exec_start late. Their reported pe_exec_ns under-counted by exactly the late_ns amount (63 ns observed at h4 cube4.pe0 in the IPCQ test, up to 113 ns worst case for cubes 9-11), producing the suspicious flat region in the h4 IPCQ curve at 8192/10240 bytes. Fix: - IO_CPU predictor uses the explicit two-leg chain (IO_CPU->M_CPU + M_CPU->PE_CPU - io.overhead - m.overhead), so every PE on every targeted cube has a barrier >= its real dispatch arrival. - Kernel-launch fanout sub-txns carry nbytes=0 (control-plane, not data-plane), removing the per-cube fanout serialization that pushed far M_CPUs past the predictor. - Legacy io_cpu mirror updated. ADR-0009 D5 mechanism updated to specify the two-leg formula and the nbytes=0 requirement. New tests/test_d5_barrier_invariant.py asserts (a) no PE enters _execute_kernel after target_start_ns and (b) every PE in a multi-cube launch has identical pe_exec_start -- both regressions silently pass on the existing tests/test_kernel_launch_sync.py because that test only inspects post-aggregation max(pe_exec_ns). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 15:12:58 -07:00
mukesh	90874abbfe	ADR-0023 D9: blocking credit-emit with full-path latency PE_IPCQ._handle_recv now yields-from _delayed_credit_send instead of spawning it as a fork, so the receiver's pe_exec_ns includes the credit-return cost. _credit_latency_ns switches from compute_drain_ns(path, 16) to compute_path_latency_ns(path, 16) and fixes a latent find_path bug where the destination lacked the ".pe_dma" suffix (silently returned 0 ns under the bare except). Net effect on h3/h4 inter-cube pe-to-pe latency: IPCQ >= raw DMA at every size, matching real-HW posted-write semantics. tl.send remains fire-and-forget. ADR-0023 D9 amended; new diagnostic test tests/test_pe_to_pe_diagnostic.py captures per-PE pe_exec_ns, paths, drain, and meta-arrival timing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 15:12:38 -07:00
mukesh	19dfc86dc3	Allreduce latency sweep across topologies and data sizes Adds test_allreduce_latency_sweep that runs the existing intercube allreduce kernel under three SIP topologies (ring_1d, torus_2d, mesh_2d_no_wrap, all at n_sips=4) across 11 data sizes from 256 B/SIP up to 1 MB/SIP. For each point, captures max(pe_exec_ns) — the critical-path kernel time — and emits CSV plus log-x and linear-x plots, both per-topology and combined overview, with KB/MB-formatted tick labels. Reuses run_allreduce + _write_temp_configs and adds a slot_size auto-bump when n_elem*2 exceeds the default IPCQ slot. Sweep skips n_elem=16 because the runtime's dim_map scalar-arg remapping (context.py:761) collides any int-valued kernel scalar that matches a global tensor dim with its local shard size. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 10:16:29 -07:00
mukesh	14d800b0ae	Kernel-launch sync (ADR-0009 D5) and IPCQ drain at inbound (ADR-0023) - KernelLaunchMsg gains target_start_ns: IO_CPU stamps a global barrier (max path latency across every target PE), M_CPU passes it through, PE_CPU yields until it before recording pe_exec_start. Every PE in a launch begins kernel execution at the same env.now regardless of its dispatch path length — eliminates per-PE dispatch-offset artifact in cross-PE and cross-cube latency measurements. - PE_DMA._handle_ipcq_inbound now pays Transaction.drain_ns at the top, matching the terminal-drain behavior of ComponentBase._forward_txn for every non-IPCQ Transaction. SRC-side tl.send stays fire-and-forget (sender doesn't yield on sub_done); tl.recv now blocks until bytes have actually drained into its inbox. - ComponentContext: new compute_path_latency_ns helper + node_overhead_ns field populated by GraphEngine. - tests/test_kernel_launch_sync.py: asserts all PEs in one launch produce identical pe_exec_ns for a no-op kernel (zero spread). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 15:30:29 -07:00
mukesh	6918e6e906	PE-to-PE latency test + supporting fixes Adds tests/test_pe_to_pe_latency.py: a sweep that measures PE-to-PE transfer latency for five hop types (intra-cube horizontal/vertical, inter-cube horizontal/vertical, inter-SIP) across data sizes 128 B to 10 KB, on both the IPCQ (tl.send/tl.recv) and raw-DMA (tl.load+tl.store) paths. Emits per-hop PNG plots, an overview PNG, and a CSV summary into tests/pe2pe_latency_plots/. Latency is reported as max(pe_exec_ns) across participating PEs, read from engine.get_completion(), so the measurement captures the SRC/DST PE's kernel body time rather than the full launch+ response-aggregation envelope. Two simulator fixes were needed to make this measurement meaningful: - PeMMU now stores a list of (start, end, pa) sub-regions per page rather than a single PA. DPPolicy layouts with shards smaller than page_size (e.g. 128 B payloads with 4 KB pages) used to silently overwrite each other through last-write-wins, causing DMAs intended for cube0 to physically route to cube3 - inflating latency by ~170 ns per DMA at small sizes. STOPGAP: real MMUs don't support sub-page regions; long-term fix is either smaller MMU page size or DPPolicy validation that refuses sub-page shards. - M_CPU's per-PE metrics aggregation (pe_exec_ns, dma_ns, compute_ns) now max-merges against the existing value in result_data rather than overwriting. Multi-cube workloads share one result_data dict via IO_CPU fanout; the previous overwrite caused whichever cube's M_CPU finished last to clobber others' values, so multi-cube pe_exec_ns was racy and frequently 0. Same fix applied in legacy/builtin/m_cpu.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 21:04:31 -07:00