ADR: bilingual structure — EN canonical in adr/, KO mirror in adr-ko/

Establish English as the canonical ADR language with Korean translations held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror). Promotion from adr-proposed/ to adr/ now writes English to adr/ and the Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md. - Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English, 2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix dropped). ADR-0023 EN regenerated against KO source which had newer HW Realization Notes (D16-D23) section. - docs/adr-history/ left frozen by design (transitional state). - CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline section covering bidirectional sync, conflict resolution (EN wins), and proposed-language freedom. - tools/verify_adr_lang_pairs.py: new verification tool checking pair completeness, filename mirroring, ADR-ID match, Status byte-equality. Pre-commit hook intentionally not added; run on demand or in CI. - tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF normalization, em-dash title separator, underscore-slug edge case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:38:44 -07:00
parent 687c98086d
commit a796c1d2f7
42 changed files with 10515 additions and 3422 deletions
@@ -202,8 +202,8 @@ General fallbacks. Apply to anything not explicitly covered above.
 >
 > Contains **foundations** (Authority & Scope → Terminology → Terminology
 > Discipline → Mental Model → Common Failure Modes) followed by **rules**
-> (Non-Trivial, Verification Plan, CLI, Derived Artifacts, runtime API /
-> sim_engine Boundaries).
+> (Non-Trivial, Verification Plan, CLI, Derived Artifacts, ADR Translation
+> Discipline, runtime API / sim_engine Boundaries).

 ## Authority & Scope

@@ -218,14 +218,22 @@ General fallbacks. Apply to anything not explicitly covered above.

 ### ADR Lifecycle

-ADRs live in one of three folders based on lifecycle state:
+ADRs live in one of four folders. Three carry **canonical English**
+content based on lifecycle state; the fourth holds Korean translations:

- `docs/adr/` — **Accepted** (current implementation reflected).
+- `docs/adr/` — **Accepted** (canonical English; current
+  implementation reflected).
 - `docs/adr-proposed/` — **Proposed**, **Stub**, or **Draft** (design
  only / future-work exploration / retroactive documentation pending
-  verification).
+  verification). **Authoring language is free** (any language); the
+  promotion step (below) translates to English.
 - `docs/adr-history/` — **Superseded** or **Merged** (no longer the
-  authoritative source; kept as historical record).
+  authoritative source; kept as historical record). Frozen — language
+  policy not applied retroactively.
+- `docs/adr-ko/` — Korean translations of accepted ADRs (derived
+  artifact, 1:1 mirror of `docs/adr/`). English in `docs/adr/` is the
+  canonical source of truth; when KO and EN disagree, EN wins. See
+  *ADR Translation Discipline* below.

 Status field values:

@@ -240,17 +248,23 @@ Status field values:
 Transitions:

 - **Proposed/Stub → Accepted**: when the ADR's decisions are
-  reflected in production code AND covered by tests. `git mv` from
-  `docs/adr-proposed/` to `docs/adr/`, change Status to `Accepted`.
+  reflected in production code AND covered by tests. If the proposed
+  ADR is in Korean, translate to English and place the English in
+  `docs/adr/`; move the Korean original to `docs/adr-ko/`. If the
+  proposed ADR is in English, `git mv` it to `docs/adr/` and create
+  the Korean translation in `docs/adr-ko/`. Change Status to
+  `Accepted` in both files.
 - **Draft → Accepted**: when the ADR's text has been verified to
-  accurately describe the existing implementation. `git mv` from
-  `docs/adr-proposed/` to `docs/adr/`, change Status to `Accepted`.
+  accurately describe the existing implementation. Same English /
+  Korean placement rule as above.
 - **Accepted → Superseded**: set Status to `Superseded by ADR-MMMM`
-  and `git mv` to `docs/adr-history/`. The superseding ADR includes
-  a "Supersedes ADR-NNNN" reference (or, for partial supersession of
-  clauses, documents this in its own body).
+  in both the EN and KO files and `git mv` both to their respective
+  history locations (`docs/adr-history/` for English; the KO copy
+  stays in `docs/adr-ko/` only if it was already mirrored — see *ADR
+  Translation Discipline* for the frozen-history exception).
 - **Accepted → Merged**: set Status to `Merged into ADR-MMMM`
-  (single-line stub) and `git mv` to `docs/adr-history/`.
+  (single-line stub) in both files and apply the same `git mv` rule
+  as the Superseded transition.

 Cross-references between ADRs use the `ADR-NNNN` ID and remain valid
 regardless of folder location. ADR numbers are **immutable**; never
@@ -361,11 +375,48 @@ Concrete forms that Part 1's *Verification Plan* MUST take in this repo:
 ## Derived Artifacts (Clarification)

 - Generated diagrams under `docs/diagrams/` are **derived artifacts**, not production code.
- Creating or updating files in `docs/diagrams/`:
+- Korean ADR translations under `docs/adr-ko/` are **derived artifacts**
+  (mirror of the canonical English in `docs/adr/`); see *ADR Translation
+  Discipline*.
+- Creating or updating files in `docs/diagrams/` or `docs/adr-ko/`:
  - does NOT count as a production code change,
  - does NOT require Phase 2 approval,
  - MUST be consistent with SPEC.md and ADRs.

+## ADR Translation Discipline
+
+English in `docs/adr/` is the canonical source of truth. Korean in
+`docs/adr-ko/` mirrors it 1:1 as a derived artifact.
+
+**Bidirectional sync rule (MUST)**: any edit to a file in `docs/adr/`
+must be accompanied, in the same change, by a mirroring edit to
+`docs/adr-ko/<same-filename>.md`. The reverse also applies: edits to
+`docs/adr-ko/` must mirror back into `docs/adr/`. The two files must
+always describe the same architectural content.
+
+Mechanics:
+
+- When editing an EN ADR, propagate the change to its KO counterpart
+  by translating just the diff (preserve unaffected KO prose); do not
+  regenerate the whole KO file from scratch.
+- When editing a KO ADR, propagate to EN the same way.
+- Filename mirror: `docs/adr/X.md` ↔ `docs/adr-ko/X.md` (no language
+  suffix in either path).
+- The `## Status` block content must remain byte-identical between
+  the EN and KO files (e.g., both say `Accepted`).
+- Conflict policy: if the two diverge despite the rule, treat EN as
+  authoritative and overwrite KO. Surface the divergence to the user
+  before reconciling.
+- `docs/adr-proposed/` is exempt — single language only, no mirror
+  required until promotion.
+- `docs/adr-history/` is frozen — pre-existing mixed-language state
+  there is not migrated.
+
+Verification: `python tools/verify_adr_lang_pairs.py` checks that
+every EN ADR has a matching KO file, the title's ADR-NNNN matches the
+filename, and Status blocks are byte-equal. Run it on demand or wire
+it into CI. Exit code: 0 = OK, 1 = mismatch.
+
 ## runtime API / sim_engine Boundaries

 - runtime API MUST NOT hardcode topology/routing or internal hop sequences.
@@ -0,0 +1,362 @@
+# ADR-0001: 51-bit Physical Address Layout & Decoding Contract
+
+## Status
+
+Accepted (Revision 2 — 2026-04-27: concrete bit layout, rack_id removal,
+Tray->SIP / SIP->DIE renaming, PE/MCPU/IOCPU sub-unit tables.
+Supersedes ADR-0031.)
+
+## Date
+
+2026-04-27 (original: 2026-02-27)
+
+## Context
+
+KernBench requires a stable, parsable physical address scheme that:
+
+- can be decoded into routing domains (SIP / die / HBM / PE-resource / IOCPU)
+- remains topology-agnostic (no hardcoded counts)
+- supports swappable policy and DI-first components
+- covers multiple SIPs, AHBM dies, and IO chiplet dies in a unified space
+
+### History
+
+- Original ADR-0001 defined a 51-bit layout with `rack_id(4) + sip_id(4) +
+  sip_seg(5) + local_offset(38)`. `rack_id` was never used in practice.
+- ADR-0031 (stub) requested PE-resource range partition but was never
+  implemented.
+
+Revision 2 removes `rack_id`, renames `sip_seg -> die_id`, and provides
+concrete sub-unit tables for PE, MCPU, CUBE_SRAM, and IOCPU resources.
+ADR-0031 is superseded.
+
+## Decision
+
+We define a **PhysAddr value object** and an **address decoding contract**
+that converts an integer address into routing domains.
+
+### D1. PhysAddr is an immutable value object
+
+- PhysAddr is immutable and comparable as a pure value.
+- Any allocator returns a **fully specified PhysAddr** (not partial metadata).
+- No global state may be required to interpret a PhysAddr.
+
+### D2. 51-bit Physical Address Layout
+
+A 51-bit physical address is adopted.
+
+#### 2.1 Top-Level Address Map
+
+```text
+[50:47] sip_id        (4)     -- 16 SIPs
+[46:42] die_id        (5)     -- 32 dies per SIP
+[41: 0] local_offset  (42)    -- 4 TB per die
+```
+
+```text
+50      47 46      42 41                      0
+---------+----------+-------------------------+
+| sip_id  | die_id   |      local_offset       |
+---------+----------+-------------------------+
+```
+
+#### 2.2 die_id Allocation
+
+| die_id | Meaning |
+|--------|---------|
+| 0..15  | AHBM dies |
+| 16..20 | IOCHIPLET dies |
+| 21..31 | Reserved |
+
+#### 2.3 AHBM Die Layout
+
+Only lower 256 GB of the 4 TB die-local window is assigned.
+
+```text
+[41:38] MBZ            (4)
+[37]    addr_space      (1)    -- 0 = local resource, 1 = HBM memory
+[36: 0] sub-address    (37)
+```
+
+| addr_space | Meaning |
+|------------|---------|
+| 0 | Local resource |
+| 1 | HBM memory |
+
+##### 2.3.1 HBM Window (addr_space = 1)
+
+```text
+[36:0] hbm_offset     (37)    -- 128 GB decode window
+```
+
+The architectural decode window is fixed at 128 GB. Implemented capacity
+may be smaller depending on SKU/topology (see D4).
+
+##### 2.3.2 Resource Window (addr_space = 0)
+
+```text
+[36:34] resource_kind  (3)
+[33: 0] kind_local    (34)    -- 16 GB per kind
+```
+
+| resource_kind | Meaning |
+|---------------|---------|
+| 000 | PE_LOCAL |
+| 001 | MCPU_LOCAL |
+| 010 | CUBE_SRAM |
+| 011..111 | Reserved |
+
+Each kind gets a 16 GB decode region.
+
+##### 2.3.3 PE_LOCAL (resource_kind = 000)
+
+```text
+[33]    MBZ            (1)
+[32:29] pe_id          (4)     -- 0..15
+[28:25] pe_sub_unit    (4)
+[24: 0] sub_offset    (25)    -- 32 MB per slot
+```
+
+16 PEs x 16 sub-unit slots x 32 MB = 8 GB active decode.
+
+| pe_sub_unit | Name | Budget |
+|-------------|------|--------|
+| 0 | PE_CPU_DTCM | 8 KB |
+| 1 | MATH_ENGINE_DTCM | 8 KB |
+| 2 | IPCQ | 256 KB |
+| 3 | PE_CPU_SFR | 16 KB |
+| 4 | MATH_ENGINE_SFR | 16 KB |
+| 5 | DMA_ENGINE_SFR | 192 KB |
+| 6 | PE_TCM | 2 MB |
+| 7..15 | Reserved | -- |
+
+##### 2.3.4 MCPU_LOCAL (resource_kind = 001)
+
+```text
+[33:30] MBZ            (4)
+[29:25] mcpu_sub_unit  (5)
+[24: 0] sub_offset    (25)    -- 32 MB per slot
+```
+
+1 GB active decode.
+
+| mcpu_sub_unit | Name | Budget |
+|---------------|------|--------|
+| 0 | MCPU_ITCM | 512 KB |
+| 1 | MCPU_DTCM | 512 KB |
+| 2 | IPCQ | 256 KB |
+| 3 | MCPU_SFR | 8 KB |
+| 4 | MCPU_DMA_SFR | 16 KB |
+| 5 | MCPU_SRAM | 10 MB |
+| 6..31 | Reserved | -- |
+
+##### 2.3.5 CUBE_SRAM (resource_kind = 010)
+
+```text
+[33:25] MBZ            (9)
+[24: 0] sram_offset   (25)    -- flat 32 MB
+```
+
+#### 2.4 IOCHIPLET Die Layout
+
+Only lower 1 TB of the 4 TB die-local window is assigned.
+
+```text
+[41:40] MBZ            (2)
+[39: 0] chiplet_offset (40)   -- 1 TB
+```
+
+Region split by address range:
+
+| Range | Meaning | Decode condition |
+|-------|---------|------------------|
+| [0, 2 GB) | IOCPU resource | chiplet_offset < 0x8000_0000 |
+| [2 GB, 1 TB) | UAL | chiplet_offset >= 0x8000_0000 |
+
+##### 2.4.1 IOCPU Region
+
+```text
+[30:27] iocpu_sub_unit (4)
+[26: 0] sub_offset    (27)    -- 128 MB per slot
+```
+
+16 x 128 MB slots. 2 GB active decode.
+
+| iocpu_sub_unit | Name | Budget |
+|----------------|------|--------|
+| 0 | IOCPU_ITCM | 512 KB |
+| 1 | IOCPU_DTCM | 512 KB |
+| 2 | IPCQ | 2 MB |
+| 3 | IOCPU_SFR | 8 KB |
+| 4 | IO_DMA_SFR | 16 KB |
+| 5 | IO_SRAM | 64 MB |
+| 6..15 | Reserved | -- |
+
+##### 2.4.2 UAL Region
+
+Sub-layout TBD (separate ADR).
+
+#### 2.5 Addressing Rules
+
+1. MBZ bits must be zero. An address with non-zero MBZ bits is
+   **architecturally invalid**. Implementation may raise a decode fault
+   or return an error -- behavior is not prescribed by this ADR.
+2. Fixed slot sizes are chosen for simple hardware decode; actual
+   implemented capacity may be smaller than the slot.
+3. Access beyond a sub-unit's implemented budget within a slot is
+   **architecturally invalid** (same policy as MBZ).
+
+### D3. Bitfield decoding is deterministic
+
+Given an integer address, field extraction (`sip_id`, `die_id`, `kind`,
+`sub_unit`, `offset`) is purely positional. No runtime state is required.
+Decoding deterministically maps an integer address to destination domains:
+`sip_id`, `die_id`, target kind (HBM / PE_LOCAL / MCPU_LOCAL / CUBE_SRAM /
+IOCPU / UAL).
+
+### D4. Capacity validation may depend on topology config
+
+Whether a decoded address falls within **implemented capacity** (e.g.,
+HBM 96 GB on a specific SKU) is checked against topology parameters
+provided via DI/config. Decode itself (D3) never consults topology --
+only validation does. These parameters must live in the topology/config
+layer, not in node implementations.
+
+### D5. Routing consumes decoded domains, not raw bits
+
+Routing policy uses decoded domains:
+
+- `src` location (sip / die / pe or node_id)
+- `dst` domains derived from PhysAddr decoding
+- `size_bytes` for size-aware link latency
+
+Routing must not inspect raw bit-fields directly except inside the
+decoding module.
+
+## Alternatives Considered
+
+1. **Keep `rack_id` (4 bits)**: Rejected -- never used in practice,
+   consumes 4 bits that enable die-local expansion to 42 bits
+   (IOCHIPLET 1 TB).
+
+2. **Uniform 256 GB per die**: Rejected -- IOCHIPLET UAL requires ~1 TB.
+   Freed rack_id bits enable 42-bit local_offset.
+
+3. **Variable-width die windows (AHBM 256 GB, CHIPLET 1 TB via multi-seg
+   spanning)**: Rejected -- complicates D3 (deterministic decoding).
+   Uniform 4 TB window with MBZ padding is simpler.
+
+4. **Use raw integers everywhere, decode ad-hoc in routing**: Rejected --
+   leads to duplicated logic, inconsistent routing, and hidden
+   assumptions.
+
+5. **Hardcode topology sizes (SIP/CUBE/PE counts) into decoding**:
+   Rejected -- violates SPEC R3 and breaks swappability.
+
+6. **Put decoding inside memory controllers or routers**: Rejected --
+   leaks policy into components, violates SPEC R4 / D5.
+
+## Consequences
+
+### Positive
+
+- Simple hierarchical decoder: SIP -> die -> kind -> sub-unit.
+- Clean separation of memory (HBM) vs local resource (PE/MCPU/SRAM/IOCPU).
+- Deterministic routing domains enable clear test invariants (SPEC R1, R5).
+- Expandable: 11 reserved die_id slots, reserved resource_kind / sub-unit
+  slots, reserved MBZ bits.
+- DI-first: decoder can be swapped without changing components (SPEC R4).
+
+### Tradeoffs
+
+- Sparse address holes due to power-of-2 slot alignment.
+- Large reserved/MBZ regions (intentional for future extension).
+- Requires explicit configuration for topology-derived sizes (D4).
+- Introduces a single "blessed" decoding module that must remain stable
+  and well-tested.
+
+## Supersedes
+
+- **ADR-0031 (PhysAddr PE-Resource Extension)**: stub status. The
+  PE_LOCAL / MCPU_LOCAL / CUBE_SRAM sub-unit tables in D2.3.3-D2.3.5
+  fulfill ADR-0031's stated goals.
+
+## Implementation Notes (Non-normative)
+
+- Recommended module: `src/kernbench/policy/address/phyaddr.py`
+- Tests should cover: encode/decode round-trip per kind, MBZ enforcement,
+  die_id dispatch (AHBM / IOCHIPLET / reserved), sub-unit boundary
+  values, backward compatibility of factory APIs.
+- Factory methods: `hbm_addr`, `pe_hbm_addr`, `pe_tcm_addr`,
+  `cube_sram_addr` retain signatures (minus `rack_id`); `cube_id`
+  parameter renamed to `die_id`.
+- New factories: `pe_resource_addr`, `mcpu_resource_addr`,
+  `iocpu_resource_addr`, `ual_addr`.
+
+## Appendix A. Address Examples
+
+### A.1 AHBM HBM access
+
+sip=2, die=5, HBM offset=0x1000
+
+```text
+sip_id     = 2       -> [50:47] = 0b0010
+die_id     = 5       -> [46:42] = 0b00101
+addr_space = 1       -> [37]    = 1 (HBM)
+hbm_offset = 0x1000  -> [36:0]
+
+51-bit addr = (2 << 47) | (5 << 42) | (1 << 37) | 0x1000
+```
+
+### A.2 AHBM PE_LOCAL -- PE3 PE_TCM, offset=0x400
+
+```text
+sip_id        = 0  -> [50:47] = 0
+die_id        = 0  -> [46:42] = 0
+addr_space    = 0  -> [37]    = 0
+resource_kind = 0  -> [36:34] = 000 (PE_LOCAL)
+pe_id         = 3  -> [32:29] = 0011
+pe_sub_unit   = 6  -> [28:25] = 0110 (PE_TCM)
+sub_offset    = 0x400 -> [24:0]
+
+local_offset = (0 << 34) | (3 << 29) | (6 << 25) | 0x400
+```
+
+### A.3 AHBM MCPU_LOCAL -- MCPU_SRAM, offset=0x0
+
+```text
+sip_id        = 1  -> [50:47] = 0001
+die_id        = 3  -> [46:42] = 00011
+addr_space    = 0  -> [37]    = 0
+resource_kind = 1  -> [36:34] = 001 (MCPU_LOCAL)
+mcpu_sub_unit = 5  -> [29:25] = 00101 (MCPU_SRAM)
+sub_offset    = 0  -> [24:0]  = 0
+
+local_offset = (1 << 34) | (5 << 25)
+```
+
+### A.4 IOCHIPLET -- IOCPU IPCQ, offset=0x20000
+
+```text
+sip_id         = 1   -> [50:47] = 0001
+die_id         = 17  -> [46:42] = 10001 (IOCHIPLET[1])
+iocpu_sub_unit = 2   -> [30:27] = 0010 (IPCQ)
+sub_offset     = 0x20000 -> [26:0]
+
+chiplet_offset = (2 << 27) | 0x20000
+                 (< 0x8000_0000 -> IOCPU region)
+```
+
+### A.5 IOCHIPLET -- UAL region, offset=4 GB
+
+```text
+sip_id         = 0   -> [50:47] = 0
+die_id         = 16  -> [46:42] = 10000 (IOCHIPLET[0])
+chiplet_offset = 0x1_0000_0000 (4 GB >= 2 GB -> UAL region)
+```
+
+## Links
+
+- SPEC.md: R1 (routing), R3 (configurable topology), R4 (DI-first),
+  R5 (multi-domain comm)
+- ADR-0031: Superseded
@@ -0,0 +1,102 @@
+# ADR-0002: Routing Distance, Ordering & Bypass Rules
+
+## Status
+Accepted
+
+## Date
+2026-02-27
+
+## Context
+The KernBench Graph Latency Simulator must compare kernel execution time
+across different architectures and topologies by computing end-to-end
+latency from graph traversal.
+
+To support meaningful comparison:
+- routing must be deterministic
+- latency must reflect actual interconnect structure
+- local vs remote traffic must be distinguishable
+- “bypass” optimizations must not undermine debuggability or correctness
+
+The simulator also aims to avoid software-managed metadata and hidden
+shortcuts that obscure control paths.
+
+## Decision
+
+### D1. Distance is accumulated latency, not hop count
+- Routing “distance” is defined as the **sum of per-node and per-link latency**.
+- Hop count alone must not be used for ordering or path selection.
+- Size-aware serialization latency (bytes / BW) contributes to distance.
+
+### D2. Routing order is derived from graph traversal
+- The chosen route is the path with minimum accumulated latency
+  given the constructed graph and routing policy.
+- Deterministic ordering must be guaranteed for identical inputs
+  (topology + policy + request).
+
+### D3. Bypass is explicit and graph-represented
+- All paths must be explicitly represented in the graph and subject to latency accumulation.
+- Example: PE_DMA connects to the NOC router mesh (ADR-0017 D7). All destinations
+  (HBM, shared SRAM, inter-cube UCIe) are reached via explicit mesh hops.
+  Local HBM access has minimal hops (switching overhead only); remote access
+  traverses additional routers.
+- Implicit or “magic” bypass paths are disallowed.
+
+### D4. No zero-latency end-to-end paths
+
+- Every routed request must incur **end-to-end** latency > 0.
+- Individual fabric segments (e.g., NOC hops) MAY have distance_mm = 0
+  when the fabric is distributed and distance is not meaningful at that granularity.
+  This is allowed because other components on the same path (e.g., PE_DMA, SRAM,
+  UCIe endpoints) contribute non-zero latency, ensuring the end-to-end invariant holds.
+- Fully zero-latency end-to-end paths are disallowed, except for explicit
+  test-only stubs clearly marked as such.
+
+### D5. Policy vs topology responsibility split
+- Topology builder:
+  - defines nodes and links and their latency/BW parameters
+- Routing policy:
+  - selects among available graph paths based on decoded domains
+- Routing policy must not assume missing links; missing connectivity
+  is a topology construction error.
+
+### D6. No software-managed routing metadata
+- Routing decisions must not rely on per-request software-managed metadata
+  that tracks distance, hop count, or ordering outside the graph model.
+- All distance/order computation is derived from traversal itself.
+
+## Alternatives Considered
+
+1) **Hop-count based routing**
+- Rejected: ignores heterogeneous latency/BW and misrepresents
+  architectural differences.
+
+2) **Implicit local shortcuts**
+- Rejected: breaks debuggability and violates traversal-based latency.
+
+3) **Software-managed distance metadata**
+- Rejected: increases control overhead and obscures routing semantics.
+
+## Consequences
+
+### Positive
+- Clear, debuggable hop-by-hop traces (SPEC R2, R4).
+- Architecture comparisons reflect real interconnect structure.
+- Routing behavior is reproducible and deterministic.
+
+### Tradeoffs / Costs
+- Graph construction must be correct and complete.
+- Bypass modeling requires explicit graph representation,
+  which slightly increases topology description complexity.
+
+## Implementation Notes (Non-normative)
+- Recommended responsibilities:
+  - Graph builder: ensure all required paths exist.
+  - Router: select next hop based on decoded domains and policy.
+- Tests should assert:
+  - non-zero end-to-end latency
+  - deterministic routing for identical inputs
+  - bypass paths appear explicitly in emitted traces
+
+## Links
+- SPEC.md: R1 (routing), R2 (latency), R3 (topology), R5 (multi-domain comm)
+- ADR-0001: PhysAddr layout & decoding contract
@@ -0,0 +1,68 @@
+# ADR-0003: Target System Hierarchy & Modeling Scope
+
+## Status
+
+Accepted
+
+## Context
+
+We need a system-level simulator to evaluate LLM kernel performance on our AI Accelerator platform.
+The platform is organized as a compute tray containing multiple identical SIPs connected via PCIe or UAL
+through switching fabrics, with a host CPU issuing commands/kernels.
+
+## Decision
+
+We model the system hierarchy explicitly:
+
+### D1. Tray-level
+
+- A compute tray contains:
+  - Host CPU (issues requests / coordinates runtime & data placement)
+  - Multiple identical SIPs (accelerators)
+  - Interconnect fabric between SIPs (PCIe and/or UAL via switches)
+
+### D2. SIP-level
+
+- A SIP is a multi-die package composed of:
+  - Multiple CUBEs (HBM die + compute PEs + UCIe)
+  - One or more IO chiplets (host/SIP interfaces)
+- IO chiplets:
+  - provide interfaces: PCIe-EP, IO_CPU, optionally UAL-EP
+  - can be multiple per SIP
+  - placement constrained to SIP shoreline (top/bottom/left/right); each shoreline may host 1–2 IO chiplets
+
+### D3. CUBE-level
+
+- A CUBE contains:
+  - HBM + memory controller (HBM_CTRL)
+  - NOC (on-die fabric): carries all intra-cube traffic including HBM data,
+    inter-cube (UCIe), command (M_CPU↔PE_CPU), and shared SRAM access.
+    Must provide: full-BW PE↔local HBM path, PE↔SRAM connectivity,
+    PE↔UCIe connectivity, M_CPU↔PE command path.
+    NOC topology is an implementation choice (e.g., 2D mesh, ring, crossbar);
+    current implementation uses a 2D mesh with XY routing (see ADR-0017).
+    HBM_CTRL is attached to each PE's local NOC port (local HBM = minimal hop).
+  - Shared SRAM: cube-level shared memory accessible by all PEs via NOC
+  - management/control CPU (M_CPU) coordinating PE command distribution and completion aggregation
+  - multiple PEs
+  - up to 4 UCIe endpoints (N/E/W/S) for CUBE↔CUBE and CUBE↔IO connectivity
+
+### D4. PE-level
+
+- A PE can execute one kernel instance
+- PE contains internal control + accelerators (modeled at PE view granularity):
+  - PE_CPU, command handler, PE_TCM, DMA/GEMM/MATH engines, internal queues
+
+## Consequences
+
+- The simulator supports abstraction by “views”:
+  - SIP view hides PE internals
+  - CUBE view treats each PE as a single block
+  - PE view expands PE internals
+- Topology remains parameterized; sizes/counts/links come from configuration.
+
+## Links
+
+- SPEC R3/R5
+- ADR-0005 (diagram views)
+- ADR-0017 (cube NOC 2D mesh architecture)
@@ -0,0 +1,76 @@
+# ADR-0004: Memory Semantics & Local-HBM Bandwidth Guarantee
+
+## Status
+
+Accepted
+
+## Context
+
+Accurately modeling PE↔HBM behavior is essential for kernel latency estimation.
+Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth, independent of intervening on-die fabric bandwidth.
+
+## Decision
+
+### D1. Local HBM definition
+
+- Each PE is assigned a logically defined “local HBM” region.
+- Local HBM corresponds to the pseudo-channel subset directly attached to that PE’s
+  router in the NOC mesh (ADR-0017 D4).
+- The path is: PE_DMA → local router → HBM_CTRL (switching overhead only, 0 mesh hops).
+- The mapping (HBM pseudo-channels → PE local regions) is derived from topology configuration.
+
+### D2. Local HBM bandwidth guarantee contract
+
+- Accesses from a PE to its local HBM MUST guarantee full effective HBM
+  read/write bandwidth independent of intervening fabric bandwidth limits.
+- Effective HBM bandwidth = spec bandwidth x efficiency factor.
+  The efficiency factor (configured via `hbm_ctrl.attrs.efficiency`, default 0.8)
+  models real-world DRAM inefficiencies (refresh cycles, bank conflicts, page
+  misses). For example: 256 GB/s spec x 0.8 = 204.8 GB/s effective.
+- The topology builder applies the efficiency factor to router-to-hbm edge
+  bandwidth at graph construction time, so all downstream routing and latency
+  computation uses the effective value.
+- This guarantee is modeled by:
+  - a dedicated logical path and/or service model that enforces HBM BW at the PE-local-HBM interaction point,
+  - while still incurring non-zero latency along explicitly modeled components.
+- HBM CTRL internal modeling (PC striping, cut-through, scheduling fidelity)
+  is consolidated in ADR-0033 (Latency Model: Assumptions and Known
+  Simplifications). The aggregate BW guarantee here remains the contract;
+  ADR-0033 documents how the per-PC model realizes it and which scheduler
+  effects are intentionally simplified.
+
+### D3. Remote PE HBM semantics (intra-cube)
+
+- A PE that accesses another PE's local HBM traverses the NOC:
+  - PE_DMA → NOC → (fabric hops) → target PE's NOC port → HBM_CTRL
+- NOC bandwidth and hop count may limit remote HBM access relative to local access.
+
+### D4. Non-local HBM semantics (inter-cube / inter-SIP)
+
+- Accesses from a PE to HBM in a different cube or SIP MAY be limited by:
+  - NOC bandwidth within the cube,
+  - inter-cube UCIe links,
+  - inter-SIP fabric (PCIe/UAL).
+- These paths MUST be explicit and traceable.
+
+### D5. Shared SRAM semantics
+
+- Each CUBE contains a shared SRAM accessible by all PEs in that CUBE.
+- Access path: PE_DMA → NOC → shared SRAM.
+- Shared SRAM bandwidth is limited by the NOC↔SRAM link bandwidth.
+- Shared SRAM is not part of the HBM address space; it is a separate memory domain.
+
+## Verification Notes
+
+Tests should cover:
+
+- local-HBM case: BW matches HBM BW regardless of fabric BW parameter
+- remote PE HBM case: latency includes mesh hop traversal
+- non-local cases (inter-cube/inter-SIP): BW/latency respond to fabric/link parameters
+- shared SRAM case: access via NOC with correct BW
+
+## Links
+
+- SPEC R2/R5
+- ADR-0002 (distance/order & explicit bypass)
+- ADR-0017 D7 (PE DMA data paths through NOC to HBM)
@@ -0,0 +1,186 @@
+# ADR-0005: Diagram Views & Distance-Aware Layout Rules
+
+## Status
+
+Accepted
+
+## Context
+
+We require verifiable and inspectable system modeling for a large-scale,
+parameterized AI Accelerator system.
+
+Humans must be able to:
+
+- visually inspect the modeled topology,
+- reason about communication structure and relative distance,
+- do so at multiple abstraction levels without being overwhelmed by detail.
+
+The simulator models distance (accumulated latency) as a first-class concept.
+Diagrams must reflect this distance by default.
+
+---
+
+## Decision
+
+### D1. Global Defaults
+
+- All diagrams MUST be **distance-aware by default**.
+- All diagrams MUST render **representative views** of the architecture.
+- Instance indices (e.g., sip0, cube2, pe3) MUST NOT be required for diagram generation.
+- Instance indices MAY be used ONLY:
+  - to define a distance anchor in asymmetric or debugging scenarios, or
+  - when explicitly requested.
+
+---
+
+### D2. Representative Rendering Rule
+
+- All CUBEs share the same internal structure.
+- All PEs share the same internal structure.
+
+Therefore:
+
+- SIP-level diagrams render representative CUBEs and IO chiplets.
+- CUBE-level diagrams render representative PEs as opaque blocks.
+- PE-level diagrams render a representative PE with fully expanded internals.
+
+Diagrams MUST NOT depend on specific SIP, CUBE, or PE indices
+unless explicitly requested.
+
+---
+
+### D3. Diagram Views
+
+#### View A — SIP-Level Diagram
+
+**Purpose**
+Explain system-scale structure and connectivity.
+
+**Visible elements**
+
+- SIP boundaries (optional)
+- CUBEs (opaque blocks)
+- IO chiplets (opaque blocks)
+- Optional UCIe stubs only if needed to clarify connectivity
+
+**Hidden elements**
+
+- PE internals
+- CUBE internal fabric
+- IO chiplet internals
+
+**Visible links**
+
+- Host ↔ IO chiplets (PCIe)
+- SIP ↔ SIP (PCIe / UAL via switches)
+- IO ↔ CUBE (on-package links)
+
+---
+
+#### View B — CUBE-Level Diagram
+
+**Purpose**
+Explain cube-internal structure and data/control flow.
+
+**Visible elements**
+
+- Router mesh: 2D grid of NOC routers (from cube_mesh.yaml), all traffic routes through mesh
+- HBM_CTRL attached to PE routers (local HBM = 0 hop)
+- HBM subsystem (HBM_CTRL)
+- Shared SRAM: cube-level shared memory
+- Management CPU (M_CPU)
+- PEs as opaque blocks (PE[0..N−1])
+- UCIe endpoints (N/E/W/S) as ports
+
+**Hidden elements**
+
+- PE internals
+
+**Visible links**
+
+- PE → router (HBM + non-HBM data path via mesh)
+- Router ↔ HBM_CTRL (local HBM access)
+- Router ↔ Router (mesh hops for remote access)
+- Router ↔ UCIe endpoints
+- Router ↔ shared SRAM
+- M_CPU ↔ router (command path)
+- Router → PE_CPU (command delivery, collapsed into PE block)
+
+---
+
+#### View C — PE-Level Diagram
+
+**Purpose**
+Explain internal PE behavior and execution structure.
+
+**Visible elements**
+
+- PE_CPU
+- Command handler / scheduler
+- PE_TCM (local SRAM)
+- HW accelerators (DMA, GEMM, MATH, etc.)
+- Local HBM interface
+- Optional IPCQ / messaging endpoints
+
+**Visible links**
+
+- Control paths (CPU → scheduler → engines)
+- Data paths (engines ↔ TCM, DMA ↔ local HBM)
+- External fabric ports as abstract ports only
+
+---
+
+### D4. Distance-Aware Layout (Default)
+
+#### Distance definition
+
+- Distance is defined as **accumulated latency**, consistent with ADR-0002.
+- Distance is computed from a single anchor node.
+
+#### Default anchor selection
+
+- SIP view: IO chiplet (or Host CPU if present)
+- CUBE view: a representative PE
+- PE view: PE_CPU or Command Handler
+
+Anchors are **implicit defaults** and MUST NOT be required to be specified.
+
+#### Layout rules
+
+- Diagrams MUST be laid out in layers based on distance buckets.
+- Layout direction MUST be consistent within a view type
+  (preferred: left-to-right).
+- Nodes with equal distance MUST have stable ordering
+  (by role or identifier, deterministically).
+
+Cycles MAY be rendered using dashed or curved edges for readability,
+without affecting distance semantics.
+
+---
+
+### D5. Generation Contract (for Tools / Claude Code)
+
+When generating diagrams:
+
+- Assume distance-aware layout by default.
+- Assume representative rendering by default.
+- Do NOT ask for SIP/CUBE/PE indices unless required.
+- Do NOT expand hidden abstraction levels.
+- Prefer architectural clarity over micro-hop fidelity.
+
+---
+
+## Consequences
+
+- Diagrams are stable across topology scaling.
+- Changes in distance or routing policy are reflected visually.
+- Diagrams serve as verifiable artifacts derived from the simulator model,
+  not as hand-maintained documentation.
+
+---
+
+## Links
+
+- SPEC Section 4 (Output, Debuggability, and Diagrams)
+- ADR-0002 (Routing distance semantics)
+- ADR-0006 (Topology compilation & automatic diagram generation)
@@ -0,0 +1,130 @@
+# ADR-0006: Topology Compilation, Distance Extraction, and Automatic Diagram Generation
+
+## Status
+
+Accepted
+
+## Context
+
+The simulator compiles topology configuration (e.g., topology.yaml) into an explicit model graph,
+and computes routing and accumulated latency (distance).
+Diagrams should be generated from these authoritative artifacts to ensure consistency and avoid
+hand-maintained topology drawings.
+
+Additionally, for usability, diagrams should be emitted automatically into a stable location
+so that developers can preview them immediately in the repository.
+
+---
+
+## Decision
+
+### D1. Topology compilation is the single source of truth
+
+- topology.yaml (or equivalent config) is compiled into:
+  - an explicit system graph,
+  - node/link attributes,
+  - routing policies.
+This compiled graph is the authoritative representation of the system.
+
+### D2. Distance extraction during compilation
+
+- During or immediately after topology compilation, the simulator MUST compute distance metadata
+  (accumulated latency) consistent with ADR-0002.
+- Distance metadata MUST be sufficient to support distance-aware diagram layout as defined in ADR-0005.
+- Distributed fabric segments (e.g., NOC) MAY have distance_mm = 0 per ADR-0002 D4;
+  layout placement for such nodes uses explicit position metadata rather than distance buckets.
+
+### D3. Diagram generation is a derived artifact
+
+- Diagrams MUST be generated from:
+  - the compiled topology graph,
+  - extracted distance metadata,
+  - view/layout rules defined in ADR-0005.
+- Diagram generation MUST NOT require additional hand-written topology descriptions.
+
+### D4. Automatic diagram emission to the repository
+
+- As part of topology compilation, the implementation MUST produce the following diagrams by default:
+  - SIP-level diagram (representative, distance-aware)
+  - CUBE-level diagram (representative, distance-aware)
+  - PE-level diagram (representative, distance-aware)
+- The default output directory is:
+  - `docs/diagrams/`
+- The generator MUST overwrite/update only when the compiled topology (or diagram rules) changes.
+
+### D5. View-specific projection and layout
+
+For each view (SIP / CUBE / PE):
+
+- The generator MUST project the compiled graph into a reduced view graph:
+  - hide/collapse nodes according to ADR-0005,
+  - preserve connectivity semantics relevant to that view,
+  - compute distance buckets and assign layout layers deterministically.
+- CUBE-level projection MUST include:
+  - Router mesh (from cube_mesh.yaml), HBM_CTRL, shared SRAM, M_CPU, UCIe ports,
+    and PEs as opaque blocks.
+  - All paths (HBM, non-HBM, command) route through the same router mesh (ADR-0017).
+- Default anchors are implicit (ADR-0005) and MUST NOT require instance indices.
+
+### D6. Output formats and determinism
+
+- The generator MUST output at least one of:
+  - Mermaid (Markdown-native)
+  - Graphviz DOT (rank-based control)
+  - SVG (mm-accurate layout, no external dependencies)
+- SVG is preferred when mm-accurate position metadata is available from the compiled topology.
+- Output MUST be deterministic:
+  - same topology + same rules → identical diagram text
+- File naming MUST be deterministic and stable (see "Output Conventions").
+
+### D7. Performance and caching
+
+- Diagram generation MAY be lazy and/or cached, as long as the outputs in `docs/diagrams/`
+  remain consistent with the compiled topology.
+- The implementation SHOULD use a cache key based on:
+  - topology content hash,
+  - routing policy version,
+  - diagram rules version,
+  - view type (SIP/CUBE/PE).
+
+---
+
+## Output Conventions
+
+### Directory
+
+- `docs/diagrams/` is the canonical output directory for generated diagrams.
+
+### File names (recommended, deterministic)
+
+- `system_view.svg` / `system_view.mmd` / `system_view.dot`
+- `sip_view.svg` / `sip_view.mmd` / `sip_view.dot`
+- `cube_view.svg` / `cube_view.mmd` / `cube_view.dot`
+- `pe_view.svg` / `pe_view.mmd` / `pe_view.dot`
+
+Optionally, for multi-topology workflows:
+
+- `sip_view__{topology_id}.svg`
+- `cube_view__{topology_id}.svg`
+- `pe_view__{topology_id}.svg`
+
+### Repository policy
+
+- Generated diagram files MAY be committed to the repository to enable diff-based review.
+- If committed, they MUST be reproducible from topology compilation.
+
+---
+
+## Consequences
+
+- Diagrams are always consistent with simulator behavior.
+- Architectural changes automatically propagate to visualizations.
+- Diagram diffs become meaningful indicators of architectural change.
+
+---
+
+## Links
+
+- SPEC Section 4 (Output, Debuggability, and Diagrams)
+- ADR-0002 (Distance semantics)
+- ADR-0005 (Diagram views and layout rules)
@@ -0,0 +1,95 @@
+# ADR-0007: Runtime API and Simulation Engine Boundaries
+
+## Status
+
+Accepted
+
+## Context
+
+The simulator consists of multiple layers with distinct responsibilities:
+
+- a host-facing API layer used by benchmarks and user code,
+- a discrete-event simulation engine that executes requests,
+- device components that model hardware behavior.
+
+Without strict boundaries, orchestration logic can leak into components,
+or simulation internals can become entangled with user-facing APIs.
+
+This ADR defines clear responsibility boundaries between:
+
+- runtime API,
+- simulation engine (sim_engine),
+- hardware components.
+
+---
+
+## Decision
+
+### D1. Runtime API is host-facing orchestration only
+
+The runtime API represents host/driver-level behavior and MUST:
+
+- expose high-level operations (tensor deployment, kernel launch),
+- submit requests only to endpoint components (e.g., IO_CPU),
+- await completion via futures/handles,
+- own and persist host-side metadata (tensor allocation maps, kernel bindings).
+
+The runtime API MUST NOT:
+
+- hardcode hop-by-hop routing or fan-out,
+- directly invoke internal components (M_CPU, PE_CPU, engines),
+- embed topology- or routing-specific assumptions.
+
+---
+
+### D2. Simulation engine wires components and tracks completion
+
+The simulation engine (sim_engine) MUST:
+
+- wire components at initialization (create port stores + start wire
+  processes per the component port/wire framework — ADR-0015),
+- inject requests into the compiled topology graph at entry components
+  (e.g., PCIE_EP for memory operations, IO_CPU for kernel launch),
+- schedule and execute events using a discrete-event model,
+- manage correlation ids and completion tracking.
+
+The simulation engine MUST NOT:
+
+- define tensor semantics,
+- define kernel execution policies,
+- expose internal graph details to the runtime API,
+- walk the topology path during request execution,
+- call component `run()` methods directly,
+- track per-hop latency or decompose fan-out (components own this).
+
+---
+
+### D3. Components own fan-out and aggregation
+
+Device-side components MUST:
+
+- fan-out requests to downstream domains
+  (IO_CPU → M_CPU → PE_CPU → schedulers/engines),
+- aggregate completion and failure signals,
+- propagate results deterministically upstream.
+
+Neither the runtime API nor the simulation engine may orchestrate
+component-level fan-out explicitly.
+
+---
+
+## Consequences
+
+- Runtime APIs remain stable as topology and routing evolve.
+- Simulation internals can change without affecting user-facing code.
+- Component implementations remain swappable via DI.
+
+---
+
+## Links
+
+- SPEC R4, R7, R8
+- ADR-0008 (Tensor deployment)
+- ADR-0009 (Kernel execution)
+- ADR-0015 (Component port/wire model and engine role)
+- ADR-0010 (CLI surface and execution semantics — runtime API consumer)
@@ -0,0 +1,100 @@
+# ADR-0008: Tensor Deployment and Allocation (Host Allocator, PA-first)
+
+## Status
+
+Accepted
+
+## Context
+
+Benchmarks require PyTorch-like tensor semantics:
+
+- tensor creation (empty, fill),
+- deployment to accelerator devices (tensor.to()).
+
+In the realistic system, host software manages allocation/mapping and installs
+mappings for DMA/MMU. For Phase 0 we simplify (ADR-0011):
+
+- device memory operations use PA only,
+- VA/MMU/IOMMU is not modeled.
+
+To keep the host↔device interface minimal, we avoid a separate
+AllocateTensorMeta message. Instead, host allocation produces a PA shard map
+that is used directly by MemoryWrite/Read and KernelLaunch.
+
+---
+
+## Decision
+
+### D1. Tensor is a host-owned handle with PA shard mapping
+
+A Tensor object is a host-owned handle that encapsulates:
+
+- shape and dtype,
+- initialization intent,
+- device placement and allocation metadata as a PA shard map.
+
+After deployment, the Tensor handle MUST contain:
+
+- a list of shards, each with (sip,cube,pe,pa,nbytes,offset_bytes).
+
+This PA shard mapping is the single source of truth for kernel argument binding.
+
+---
+
+### D2. Deployment uses a host allocator (Phase 0)
+
+In Phase 0, tensor deployment produces PA shard mappings via a host allocator:
+
+- placement (split/replicate/hybrid) is decided by a DP policy,
+- allocation assigns PA ranges at the PE level and returns shard mappings,
+- the Tensor handle stores the resulting shard list deterministically.
+
+No separate host-visible device allocation RPC is required in Phase 0.
+
+---
+
+### D3. Data initialization and transfer uses MemoryWrite/Read only
+
+Any data initialization or transfer implied by a tensor (e.g., fill, copy)
+MUST be represented using Host ↔ IO_CPU messages only:
+
+- MemoryWrite
+- MemoryRead
+
+Rules:
+
+- MemoryWrite/Read MUST reference PA + (sip,cube,pe) tags (ADR-0012).
+- Allocation metadata MUST NOT be embedded as a separate allocation message.
+- Bulk tensor data MUST NOT be embedded in Phase 0 messages.
+
+The simulation engine schedules MemoryWrite/Read through the graph so that
+latency is computed by explicit traversal.
+
+---
+
+### D4. Extension path (non-breaking)
+
+Future ADRs MAY introduce optional VA/MMU/IOMMU modeling by adding:
+
+- virtual addressing in tensor handles,
+- mapping install steps,
+- translation latency/page granularity.
+
+The Phase 0 PA shard map remains a valid fast-path configuration.
+
+---
+
+## Consequences
+
+- Host↔IO_CPU contract remains minimal (MemoryRead/Write + KernelLaunch).
+- KernelLaunch can pass per-PE data placement explicitly via shard tags.
+- Early implementation stays simple and testable.
+
+---
+
+## Links
+
+- ADR-0011 (Memory Addressing — PA / VA / LA)
+- ADR-0012 (Host↔IO_CPU schema)
+- ADR-0007 (runtime_api vs sim_engine boundaries)
+- ADR-0009 (Kernel execution)
@@ -0,0 +1,146 @@
+# ADR-0009: Kernel Execution Messaging and Completion Semantics
+
+## Status
+
+Accepted
+
+## Context
+
+Kernel execution is initiated by the host and proceeds through
+device control components:
+
+Host → IO_CPU → M_CPU → PE_CPU → schedulers → engines
+
+Completion propagates in reverse order.
+
+To keep benchmarks simple and topology-agnostic,
+kernel execution must be endpoint-driven with deterministic aggregation.
+
+---
+
+## Decision
+
+### D1. Kernel launch is an endpoint request
+
+A kernel launch is initiated by submitting a single KernelLaunch request
+to the IO_CPU endpoint.
+
+The runtime API MUST:
+
+- construct the kernel launch request,
+- submit it to IO_CPU,
+- await a single completion result.
+
+The runtime API MUST NOT orchestrate internal fan-out.
+
+---
+
+### D2. Tensor arguments are passed by metadata
+
+KernelLaunch requests MUST reference tensor arguments via:
+
+- host-owned tensor handles, or
+- resolved device address maps derived from those handles.
+
+Bulk tensor data MUST NOT be embedded in kernel launch messages.
+
+---
+
+### D3. Fan-out and aggregation are component responsibilities
+
+- IO_CPU fans out work to M_CPUs.
+- M_CPU fans out work to PE_CPUs.
+- PE_CPU manages kernel execution and engine dispatch.
+
+Completion semantics:
+
+- M_CPU completes when all targeted PEs complete or a failure policy triggers.
+- IO_CPU completes when all targeted CUBEs complete or a failure policy triggers.
+
+---
+
+### D4. Completion and failure propagation
+
+- All messages MUST carry correlation identifiers.
+- Completion and failure MUST propagate deterministically to the host.
+- The simulation engine provides futures/handles to observe completion.
+
+---
+
+### D5. Launch timing is endpoint-synchronized
+
+All PEs targeted by a single kernel launch MUST begin executing the kernel
+body at the same simulated time, regardless of their dispatch path length
+from the launch entry point.
+
+Rationale. The dispatch tree Host → IO_CPU → M_CPU → PE_CPU has variable
+latency at every level. PEs near their M_CPU receive the launch earlier
+than PEs farther away; cubes near an IO_CPU receive it earlier than cubes
+farther away. Without synchronization, each PE's kernel begins at a
+different `env.now`, making per-PE metrics such as `pe_exec_ns` a function
+of dispatch-path geometry rather than of the kernel's behavior —
+producing measurement artifacts in benchmarks that time kernel-internal
+waits (for example `tl.recv` on cross-cube or cross-SIP hops).
+
+Mechanism.
+
+- `KernelLaunchMsg` carries an optional `target_start_ns: float | None`.
+- **IO_CPU** is the canonical stamper. On fan-out to M_CPUs, it
+  computes `target_start_ns = env.now + max_latency` where
+  `max_latency` is the maximum, over every target (sip, cube, pe)
+  tuple, of the **two-leg dispatch chain**:
+
+  ```
+  max_latency(sip, cube, pe) =
+      compute_path_latency_ns(find_node_path(io_cpu, m_cpu(sip, cube)))
+    + compute_path_latency_ns(find_node_path(m_cpu(sip, cube), pe_cpu))
+    - io_cpu.overhead_ns
+    - m_cpu.overhead_ns
+  ```
+
+  This models the actual dispatch as **two sequential Transactions**
+  (IO_CPU → M_CPU, then M_CPU → PE_CPU). Each leg's
+  `compute_path_latency_ns` adds its endpoints' `overhead_ns`;
+  `io_cpu.overhead_ns` is subtracted because IO_CPU has already
+  paid it before this method runs, and `m_cpu.overhead_ns` is
+  subtracted once because it appears as endpoint of leg1 *and*
+  start of leg2 but is paid only once at run time. A single
+  `find_node_path(io_cpu, pe_cpu)` walk is **not** equivalent —
+  it can pick a graph path that bypasses M_CPU and silently
+  under-shoots the prediction for far cubes, breaking the D5
+  invariant.
+
+  The fanned-out sub-Transactions carry **`nbytes = 0`** for
+  `KernelLaunchMsg` (control message only). Without this,
+  large kernel-launch payloads would occupy fabric BW on the
+  shared first hop and serialize the per-cube dispatch, pushing
+  far M_CPUs past `target_start_ns` and re-introducing the
+  late-arrival violation.
+- **M_CPU** passes an already-stamped `target_start_ns` through
+  unchanged. Only when the value is absent (e.g. a direct
+  launch-to-M_CPU unit test) does M_CPU compute a per-cube barrier
+  `env.now + max(local command-path latency)`.
+- **PE_CPU** yields `env.timeout(target_start_ns - env.now)` at the top
+  of `_execute_kernel`, before recording `pe_exec_start` and invoking
+  the kernel body.
+- When `target_start_ns is None`, PE_CPU falls through to the legacy
+  unsynchronized behavior — preserving backward compatibility.
+
+IO_CPU-level stamping guarantees every PE across every targeted cube
+uses the same barrier sim-time, eliminating both the within-cube
+dispatch-offset artifact *and* the cross-cube offset artifact in
+multi-cube launches. Models a real-hardware timed-broadcast launch
+(latency-equalized dispatch tree).
+
+The synchronization is internal to the engine / IO_CPU / M_CPU / PE_CPU
+control plane — runtime API and application kernels are unchanged.
+
+---
+
+## Links
+
+- SPEC R1, R2, R7, R8
+- ADR-0007 (Runtime API boundaries)
+- ADR-0008 (Tensor deployment)
+- ADR-0013 (Verification strategy — V2 fan-out tests)
+- ADR-0015 D4 (concrete fabric path for kernel launch)
@@ -0,0 +1,131 @@
+# ADR-0010: Command Line Interface and Execution Semantics
+
+## Status
+
+Accepted
+
+## Context
+
+The `kernbench` CLI is the user-facing entry point of the simulator. It
+exposes three subcommands:
+
+- `run` — execute a benchmark against a topology.
+- `probe` — diagnostic utility for latency / BW measurement.
+- `web` — interactive topology viewer.
+
+Device enumeration is centralized in the CLI; neither the runtime API
+nor the simulation engine enumerates devices. Benchmarks remain
+single-device by design and accept a device identifier as input.
+
+## Decision
+
+### D1. Benchmark contract — single-device by design
+
+- A benchmark MUST define behavior for a single device only.
+- A benchmark MUST accept a device identifier as input.
+- Benchmarks MUST NOT enumerate or loop over multiple devices.
+
+Multi-device execution is the CLI's concern (D3), not the benchmark's.
+
+### D2. `kernbench run` — benchmark execution
+
+Required arguments:
+
+- `--topology <path>`: topology YAML file path. Loaded via
+  `resolve_topology()`.
+- `--bench <name>`: benchmark name. Resolved via
+  `benches.loader.resolve_bench()`.
+
+Optional arguments:
+
+- `--device <selector>` (default: `all`):
+  - `all` — run once per discovered SIP (see D3).
+  - `sip:<N>` — run only on SIP N.
+  - Parsed via `resolve_device()`.
+- `--verify-data` (default: off) — enable Phase 2 data verification
+  (see ADR-0020). When set, `engine_factory` constructs the engine
+  with `enable_data=True`. After the benchmark runs, a diagnostic
+  summary of recorded ops is printed.
+
+Each invocation runs the benchmark once within a single simulation
+instance.
+
+### D3. Multi-device execution is logically parallel
+
+When `--device all` (or omitted) and the topology has multiple SIPs:
+
+- Benchmark executions are submitted to a single simulation engine
+  instance.
+- Executions are logically parallel in simulation time.
+- Inter-device contention is naturally modeled (shared fabric
+  bandwidth, cross-SIP traffic, etc.).
+
+The CLI does NOT spawn multiple OS processes or independent
+simulation runs — parallelism is internal to one simulation instance.
+
+### D4. `kernbench probe` — latency / BW diagnostic utility
+
+Required argument:
+
+- `--topology <path>`: topology YAML file path.
+
+Optional argument:
+
+- `--case <name>` (default: `all`) — run a predefined traffic
+  pattern, or `all` to run every defined case.
+
+Probe runs each pattern through the simulation engine and reports
+per case:
+
+- End-to-end latency (ns).
+- Effective bandwidth (nbytes / total_ns).
+- Bottleneck bandwidth (min edge BW along the chosen path).
+- Utilization (effective / bottleneck).
+
+Probe additionally validates monotonicity invariants — for example
+that local-HBM access ≤ cross-PE-within-cube ≤ cross-cube ≤
+cross-SIP — and reports violations. Probe is a developer tool for
+verifying the latency / BW model; it is not a benchmark.
+
+### D5. `kernbench web` — topology viewer
+
+Optional arguments:
+
+- `--port <N>` (default: `8765`) — HTTP port.
+- `--no-open` — do not auto-open the browser.
+
+Launches a local HTTP server that renders the compiled topology in
+the browser. Distinct from the static `docs/diagrams/` artifacts:
+
+- `docs/diagrams/` files are derived at topology-compile time
+  (ADR-0006).
+- `kernbench web` is interactive — pan/zoom, hover for component
+  attributes, switch between SIP / CUBE / PE views.
+
+### D6. Runtime API and simulation engine remain device-scoped
+
+- Runtime API calls operate on one device per invocation.
+- The simulation engine schedules all requests deterministically.
+- Neither layer enumerates devices.
+
+This invariant keeps each layer testable in isolation; device
+enumeration and multi-device fan-out live only in the CLI's `run`
+command (D3).
+
+## Consequences
+
+- Benchmark authors write single-device logic; multi-device behavior
+  emerges from the CLI dispatching across SIPs.
+- Adding a new subcommand (e.g., trace export, replay) does not
+  require benchmark or runtime-API changes — the CLI is the
+  extension point.
+- `probe` and `web` are diagnostic / visualization tools, not
+  benchmarks; they bypass the benchmark loader path.
+
+## Links
+
+- SPEC R7, R8, R9
+- ADR-0007 (Runtime API and Simulation Engine Boundaries)
+- ADR-0020 (Two-pass data execution — `--verify-data`)
+- ADR-0006 (Topology compilation and diagram generation —
+  background for `kernbench web`)
@@ -0,0 +1,521 @@
+# ADR-0011: Memory Addressing — PA / VA / LA Address Models
+
+## Status
+
+Accepted.
+
+- **VA model: currently implemented (default).**
+- PA model: implemented as PageFault fallback in PE_DMA.
+- LA model: proposed, not implemented.
+
+## Context
+
+KernBench's address model evolved through three design points, each
+addressing a limitation of the previous. This ADR documents all three
+in one place because future implementation work selects among them.
+
+### PA-only baseline
+
+Phase 0 of KernBench treated all device memory operations
+(MemoryRead/MemoryWrite) as raw physical-address transfers. No
+host-side virtual addressing, no MMU/IOMMU translation. Allocators
+returned PA mappings; DMA requests carried PA directly.
+
+This was sufficient for early correctness/latency work but
+insufficient for running standard Triton kernels that use
+`base_addr + offset` patterns on sharded tensors: each PE's shard
+has a different PA, but the kernel needs a single contiguous address
+space to compute offsets.
+
+### Why VA/MMU (current default)
+
+A realistic system uses host-side virtual addressing and an
+MMU/IOMMU-style translation path for DMA: the host allocates physical
+memory at PE level, maps it into a virtual address space, installs
+mappings, and DMA requests use virtual addresses that are translated
+to physical addresses.
+
+Adopting this model lets kernels use `base_addr + offset` over a
+contiguous VA range while the device-side MMU translates each access
+to the appropriate PA.
+
+### Why LA/BAAW (proposed)
+
+VA/MMU treats HBM as a single backing space. KernBench needs to
+explore architectures where HBM is composed of multiple pseudo
+channels in parallel:
+
+- CUBE's HBM has 32 or 64 pseudo channels.
+- In a PE-Local-HBM model, each PE is assigned N pseudo channels
+  (N = `hbm_pseudo_channels / pes_per_cube`).
+- Per-channel BW (e.g. 32 GB/s) determines aggregate PE BW
+  (N × per-channel).
+
+Two channel-mapping modes need to be modelable:
+
+- **1:1 mode** — one logical access → N per-channel requests.
+  Precise per-channel BW contention modelling.
+- **n:1 mode (default)** — one logical access → one aggregated
+  request. Channels are assumed to interleave; aggregated BW model.
+
+VA's `tl.load(va_ptr)` produces a single DMA request to a single
+target. Decomposing that into per-channel requests inside PE_DMA
+requires the address layer to be aware of channels. This is the
+role of the LA (Logical Address) abstraction with BAAW
+(Logical-to-Physical Mapping Unit).
+
+Core requirements driving the LA design:
+
+- PE_DMA → HBM_CTRL effective bandwidth semantics must be identical
+  in both modes (only request shape and resource model differ).
+- Kernel programming model is unchanged — physical channel
+  information is never exposed to kernel code.
+- Mode switch is a topology-level configuration.
+
+### Design space summary
+
+| Model | Status | Key idea |
+|-------|--------|----------|
+| PA | fallback (implemented) | Direct physical addressing, no translation |
+| VA | current default (implemented) | Per-tensor contiguous VA range; MMU translates per access |
+| LA | proposed | LA + BAAW resolves to (PA, channel); supports 1:1 and n:1 channel mapping modes |
+
+---
+
+## Decision
+
+This ADR defines three address models. At any given time the system
+operates in exactly one model. Selection is topology- / configuration-
+driven; coexistence within one simulation run is not required.
+
+---
+
+### Address Model: PA (Physical Address) — fallback
+
+#### D-PA1. PA-only semantics
+
+- All device memory accesses (MemoryRead/MemoryWrite) operate on
+  device physical addresses (PA) plus size.
+- PA-only mode remains functional via the PageFault fallback path in
+  PE_DMA: if a DMA src/dst address has no MMU mapping, PE_DMA treats
+  the value as a PA directly.
+
+#### D-PA2. Allocation produces PA mappings
+
+Device allocation selects PE-local memory regions and returns PA
+mappings sufficient to execute kernels and issue DMA requests.
+
+PA model is retained primarily for backward compatibility with PA-only
+tests and as the underlying physical layer that VA / LA models resolve
+into.
+
+---
+
+### Address Model: VA (Virtual Address with MMU) — current default
+
+#### D-VA1. Virtual Address Model
+
+- Each tensor gets a single contiguous VA range (`TensorHandle.va_base`).
+- `TensorShard` does NOT carry a `va` field — shard VA is derived as
+  `va_base + offset_bytes`.
+- Kernels receive `va_base` as their pointer argument (via
+  `TensorArg.va_base`).
+- `DmaReadCmd.src_addr` and `DmaWriteCmd.dst_addr` carry VA (not PA).
+
+#### D-VA2. PE_MMU Component
+
+- Hybrid design: SimPy component (inbox for `MmuMapMsg`) + utility
+  (synchronous `translate()` called by PE_DMA).
+- Page-aligned dict lookup for O(1) VA → PA translation.
+- `tlb_overhead_ns` configurable per-access latency.
+- PageFault fallback: if VA has no mapping, PE_DMA treats it as PA
+  directly (preserves PA model for backward compatibility).
+
+#### D-VA3. Mapping Installation
+
+- `MmuMapMsg` traverses the fabric: Host → PCIE_EP → IO_CPU (cube
+  fan-out) → M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured
+  end-to-end.
+- `MmuMapMsg.target_sips` controls SIP-level routing to prevent
+  cross-SIP mapping contamination for replicated tensors.
+- Mapping strategy based on `DPPolicy.cube`:
+  - **Replicate** (`cube="replicate"`): per-(sip, cube) local mapping
+    only. Each cube's PEs see only their local PA. No cross-cube
+    mapping installed.
+  - **Sharded** (`cube="column_wise"`, etc.): broadcast all shard
+    mappings to all target cubes. Enables cross-PE and cross-cube
+    DMA.
+
+#### D-VA4. Tensor Lifecycle
+
+- `del tensor` triggers automatic cleanup via `Tensor.__del__` +
+  `weakref` to `RuntimeContext`. Sends `MmuUnmapMsg` through fabric,
+  returns VA and PA space.
+- `with RuntimeContext(...) as ctx:` provides scope-based bulk cleanup.
+- `RuntimeContext._tensors` uses `weakref.ref` to avoid preventing GC.
+- `PEMemAllocator` uses free-list with coalescing (not bump allocator).
+- `VirtualAllocator` uses free-list with coalescing for VA space.
+
+#### D-VA5. Allocators
+
+- `VirtualAllocator`: device-wide VA space, page-aligned alloc/free
+  with coalescing.
+- `PEMemAllocator`: per-PE HBM/TCM, free-list based alloc/free with
+  coalescing.
+- Page size configurable via `topology.yaml` `pe_mmu` attrs
+  (default 4096).
+
+#### Consequences (VA model)
+
+- Triton kernels use `base_addr + offset` patterns naturally on
+  sharded tensors.
+- All latency remains explicit via graph traversal, including MMU
+  mapping installation and per-access TLB overhead.
+- PA-only mode retained as fallback (PageFault → treat as PA).
+- IPCQ and other fixed-address resources bypass MMU (use PA directly).
+
+---
+
+### Address Model: LA (Logical Address with BAAW) — proposed
+
+LA replaces VA when channel-level HBM modelling is required.
+Adopting this model removes the VA/MMU infrastructure (D-LA1 lists the
+removed artifacts). Coexistence with VA in the same run is not a goal.
+
+#### D-LA1. LA introduction — replaces VA infrastructure
+
+LA is the sole address space used by kernel code (`tl.load`,
+`tl.store`, `tl.composite`). Properties:
+
+- Can map a Tensor to a contiguous logical space (like VA).
+- Expresses `(logical buffer + offset)`.
+- Does NOT contain physical channel information directly.
+- Stays as an intermediate abstraction until physical resolution.
+
+LA address space:
+
+| Item | Value |
+|------|-------|
+| LA start | `0x1_0000_0000` (4 GB, preserves former VA start) |
+| LA space size | 64 GB per PE |
+| Alignment unit | segment (see D-LA3) |
+
+LA is PE-local: different PEs may use the same LA value; BAAW segment
+tables differ → they resolve to different PAs.
+
+VA infrastructure removed when LA is adopted:
+
+| Removed | Replacement |
+|---------|-------------|
+| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (same free-list approach, renamed) |
+| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (inside PE_DMA) |
+| `components/builtin/pe_mmu.py` (PeMmuComponent) | Removed — BAAW is internal PE_DMA logic, not a separate component |
+| `runtime_api/kernel.py`: `MmuMapMsg`, `MmuUnmapMsg` | `BaawSegmentInstallMsg` |
+| `runtime_api/context.py`: VA alloc + MMU install | LA alloc + BAAW segment install |
+| `runtime_api/tensor.py`: `va_base` | `la_base` |
+| `topology.yaml`: `pe_mmu` component entry | Removed |
+
+#### D-LA2. Mapping mode setting
+
+Topology-level (cube) configuration:
+
+```yaml
+cube:
+  memory_map:
+    hbm_mapping_mode: n_to_one    # one_to_one | n_to_one
+    hbm_pseudo_channels: 64       # total pseudo channel count
+    hbm_channels_per_pe: 8        # per-PE local channel count
+    hbm_channel_bw_gbs: 32.0      # per-channel bandwidth
+```
+
+Consumed by the graph compiler (topology builder) and BAAW
+initialisation.
+
+#### D-LA3. Segment and BAAW
+
+Segment partitions the LA space; each segment maps to a specific HBM
+channel or channel group. Created at tensor deploy time by the runtime
+allocator. BAAW resolves LA → physical request(s) using the segment
+table.
+
+```python
+@dataclass
+class BaawSegment:
+    la_base: int          # segment start LA
+    la_size: int          # segment size (bytes)
+    mode: str             # "one_to_one" | "n_to_one"
+    # 1:1 mode fields
+    channel_count: int    # channels assigned to this segment (e.g. 8)
+    pa_bases: list[int]   # per-channel PA bases (len = channel_count)
+    channel_ids: list[int]   # per-channel logical IDs (e.g. [0..7])
+    channel_size: int     # per-channel size (la_size // channel_count)
+    # n:1 mode fields
+    agg_pa_base: int      # aggregated PA base
+    agg_node_id: str      # aggregated router node_id
+```
+
+Segment lifecycle:
+
+1. **Allocate** (tensor deploy): RuntimeContext allocates LA from LA
+   allocator. PEMemAllocator allocates per-channel PA (1:1) or
+   aggregated PA (n:1). `BaawSegmentInstallMsg` registers the segment
+   with PE_DMA.
+2. **Use** (kernel run): kernel `tl.load(la_ptr)` → `DmaReadCmd
+   (src_addr=LA)`. PE_DMA's BAAW front-end looks up the segment and
+   converts to PA(s).
+3. **Free** (tensor free): segment removed from table; LA and PA
+   returned.
+
+#### D-LA4. BAAW resolution logic
+
+BAAW is a front-end stage inside PE_DMA, not a separate SimPy
+component. Synchronous address-resolution logic executed at the start
+of PE_DMA's `handle_command()`.
+
+Input: `(LA, nbytes)`. Output:
+
+- **1:1 mode**: `list[PhysicalRequest]` — one per channel.
+- **n:1 mode**: single `PhysicalRequest`.
+
+```python
+@dataclass
+class PhysicalRequest:
+    pa: int           # 51-bit Physical Address
+    nbytes: int       # transfer size for this request
+    dst_node: str     # target node_id (channel router or aggregated router)
+
+
+def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
+    seg = self._find_segment(la)  # la_base <= la < la_base + la_size
+    offset = la - seg.la_base
+
+    if seg.mode == "n_to_one":
+        pa = seg.agg_pa_base + offset
+        return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]
+
+    # one_to_one
+    requests = []
+    per_ch_size = seg.channel_size
+    for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
+        ch_offset = offset % per_ch_size
+        ch_nbytes = nbytes // seg.channel_count
+        pa = pa_base + ch_offset
+        dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
+        requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
+    return requests
+```
+
+BAAW responsibilities:
+
+- Convert logical access → physical request units.
+- Apply mode-dependent fan-out (1:1) or pass-through (n:1).
+- Compute PA and target node.
+
+BAAW non-responsibilities:
+
+- Performing actual data movement.
+- Executing NOC routing.
+- Simulating bandwidth occupation (downstream components' job).
+
+BAAW output is directly usable by the simulator's routing and resource
+model without additional address decoding.
+
+#### D-LA5. PE_DMA `handle_command()` change
+
+Current (VA-based) flow:
+
+```
+DmaReadCmd.src_addr (VA)
+  → MMU.translate(VA) → PA
+  → PhysAddr.decode(PA) → PhysAddr object
+  → resolver.resolve(PhysAddr) → dst_node_id
+  → router.find_path(pe_prefix, dst_node_id) → path
+  → 1 sub-Transaction → fabric inject
+```
+
+LA-based flow:
+
+```
+DmaReadCmd.src_addr (LA)
+  → BAAW.resolve(LA, nbytes) → list[PhysicalRequest]
+  → for each PhysicalRequest:
+      → router.find_path(pe_prefix, req.dst_node) → path
+      → compute_drain_ns(path, req.nbytes) → drain
+      → sub-Transaction → fabric inject
+  → await all sub-Transactions
+  → pe_txn.done.succeed()
+```
+
+Key changes:
+
+- MMU reference removed → BAAW resolve.
+- `PhysAddr.decode()` + `resolver.resolve()` → BAAW returns `dst_node`
+  directly.
+- 1 request → N parallel requests in 1:1 mode.
+
+#### D-LA6. 1:1 mode detail
+
+- One logical access → N physical requests (N = `channels_per_pe`).
+- N = `hbm_pseudo_channels / pes_per_cube`.
+- Each request: fully-resolved 51-bit PA, targets a specific channel
+  router (`{pe_prefix}.ch_r{channel_id}`).
+- Per-channel link models BW contention.
+- PE_DMA injects N sub-transactions concurrently.
+
+Example: `hbm_pseudo_channels=64`, `pes_per_cube=8` → `channels_per_pe=8`.
+PE0 owns ch0-7.
+
+```text
+Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
+BAAW segment: {
+    la_base: 0x1_0000_0000, la_size: 4096,
+    mode: "one_to_one", channel_count: 8,
+    pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
+    channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
+    channel_size: 512,
+}
+
+BAAW resolve result (8 requests):
+  → PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
+  → PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
+  → ...
+  → PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")
+
+PE_DMA: 8 sub-transactions parallel inject
+  per-channel router → hbm_ctrl link (channel_bw_gbs) per channel
+  Total effective BW = 8 × channel_bw_gbs
+```
+
+Other N values:
+
+- `hbm_pseudo_channels=32`, `pes_per_cube=8` → `channels_per_pe=4`,
+  4 requests
+- `hbm_pseudo_channels=64`, `pes_per_cube=4` → `channels_per_pe=16`,
+  16 requests
+
+#### D-LA7. n:1 mode detail
+
+- One logical access → one aggregated request.
+- Target: aggregated router → hbm_ctrl (see ADR-0017 D8).
+- Aggregated link BW = `channels_per_pe × channel_bw_gbs`
+  (e.g. 8 × 32 = 256 GB/s).
+- Single queue / resource for modelling.
+- No per-channel PA decomposition.
+
+```text
+Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
+BAAW segment: {
+    la_base: 0x1_0000_0000, la_size: 4096,
+    mode: "n_to_one",
+    agg_pa_base: PA_agg,
+    agg_node_id: "sip0.cube0.pe0.agg_router",
+}
+
+BAAW resolve result:
+  → PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")
+
+PE_DMA: 1 sub-transaction
+  aggregated router → hbm_ctrl link (256 GB/s)
+```
+
+#### D-LA8. Kernel model preserved
+
+- Kernel still issues single memory ops (`tl.load`, `tl.store`,
+  `tl.composite`).
+- LA is the address scheme exposed to kernel code.
+- Channel decomposition / aggregation happens inside PE_DMA's BAAW.
+- Kernel code never sees physical channel information.
+
+#### Consequences (LA model, proposed)
+
+Positive:
+
+- 1:1 vs n:1 semantics live in one place (BAAW).
+- Kernel abstraction preserved — no kernel code changes.
+- Topology-based policy control (mode switch via yaml).
+- Improved simulation-model consistency and debuggability.
+- Segment-based mapping is simpler than page tables; lower overhead.
+
+Negative:
+
+- Full VA/MMU code refactor required.
+- Request-generation path more complex (N requests in 1:1 mode).
+- Reduced per-channel visibility in n:1 mode.
+- VA-related tests need rewriting.
+
+---
+
+## Migration Path
+
+- **PA → VA** was an extension. PA mode is retained as the PageFault
+  fallback inside PE_DMA. Switching does not require removing PA
+  code.
+- **VA → LA**, if adopted, is a replacement, not coexistence. See
+  D-LA1 for the VA infrastructure removal list. PA fallback inside
+  PE_DMA may be retained orthogonally for tests.
+
+## Alternatives Considered (LA model)
+
+1. **Keep VA + fan-out in MMU**: MMU returns per-channel PAs.
+   Rejected: MMU's role would grow beyond translation to request
+   decomposition; aggregation (n:1) becomes awkward to express.
+2. **Channel-aware kernel API**: kernels call per-channel load/store
+   directly. Rejected: abstraction leakage, portability loss, all
+   benchmarks need rewriting.
+3. **Always PA (no LA)**: runtime passes per-channel PA to kernel
+   directly. Rejected: incompatible with aggregation; conversion
+   timing unclear; channel info leaks to kernel.
+
+## Test Requirements
+
+### VA model (current, regression)
+
+- Cross-PE / cross-cube DMA paths over installed mappings.
+- `MmuMapMsg` / `MmuUnmapMsg` fabric traversal with measured latency.
+- TLB-overhead-per-access timing.
+- PageFault fallback path preserves PA-only behaviour.
+
+### LA model (when implemented)
+
+- 1:1 mode: same logical access → N per-channel requests.
+- n:1 mode: same logical access → 1 aggregated request.
+- Bandwidth equivalence between modes for identical workload.
+- 1:1 mode: per-channel contention modelled correctly.
+- n:1 mode: aggregated bandwidth correctly reflected.
+- Kernel code unchanged across mode switch.
+- BAAW segment install / uninstall correctness.
+- Multiple tensors in distinct segments do not collide.
+
+## Implementation Order (LA, when scheduled)
+
+1. LA type (`policy/address/la_allocator.py`).
+2. BAAW segment table (`policy/address/baaw.py`).
+3. `BaawSegmentInstallMsg` (`runtime_api/kernel.py`).
+4. PE_DMA BAAW integration (`components/builtin/pe_dma.py`
+   `handle_command()`).
+5. RuntimeContext: LA alloc + segment install
+   (`runtime_api/context.py`).
+6. `Tensor.va_base` → `Tensor.la_base` (`runtime_api/tensor.py`).
+7. Remove VA/MMU code.
+8. Remove `pe_mmu` from `topology.yaml`; add mapping mode settings.
+9. Test migration:
+
+| Test file | Action |
+|-----------|--------|
+| `tests/test_mmu_component.py` | Remove → BAAW segment install tests |
+| `tests/test_mmu_fabric.py` | Remove → BAAW + fabric integration tests |
+| `tests/test_pe_mmu.py` | Remove |
+| `tests/test_va_allocator.py` | Replace with LA allocator tests |
+| `tests/test_va_integration.py` | Replace with LA + BAAW integration tests |
+| `tests/test_va_offset.py` | Replace with LA offset tests |
+
+## Links
+
+- ADR-0007 (runtime_api vs sim_engine boundaries)
+- ADR-0008 (tensor deployment)
+- ADR-0009 (kernel execution)
+- ADR-0014 (PE-internal execution model)
+- ADR-0015 (component port/wire model)
+- ADR-0017 (Cube NOC and HBM connectivity — LA model topology consumer)
+- ADR-0013 (Verification strategy — V1 PA tagging)
+- SPEC R2 (latency by traversal), R10 (memory addressing)
@@ -0,0 +1,233 @@
+# ADR-0012: Host ↔ IO_CPU Message Schema (PA-first, PE-tagged)
+
+## Status
+
+Accepted
+
+## Context
+
+Phase 0 uses a PA-first memory model (ADR-0011):
+
+- memory operations use device physical addresses (PA) only,
+- VA/MMU/IOMMU is not modeled.
+
+The host-facing runtime API interacts with the device via the IO_CPU endpoint.
+We define stable, minimal message schemas for Host ↔ IO_CPU so that:
+
+- benchmarks remain stable,
+- IO_CPU-internal fan-out/aggregation can evolve independently,
+- completion and failure propagation is deterministic.
+
+We also require PE-tagging (A 방식): each shard explicitly carries (sip,cube,pe)
+so IO_CPU can deterministically route/fan-out without relying on PA decoding.
+
+---
+
+## Decision
+
+### D1. Contract scope
+
+This schema is the stable contract ONLY for Host ↔ IO_CPU.
+
+Messages beyond IO_CPU (to M_CPU, PE_CPU, schedulers, engines) are component-internal
+and are NOT part of this host contract in Phase 0.
+
+---
+
+### D2. Required message set
+
+The runtime API MUST use only these message types for Host ↔ IO_CPU:
+
+- MemoryWrite
+- MemoryRead
+- KernelLaunch
+
+All operations required by benchmarks (tensor init/copy, kernel run) MUST be expressible
+with these messages.
+
+---
+
+### D3. Common envelope (mandatory for all requests)
+
+All Host ↔ IO_CPU requests MUST include:
+
+- `msg_type: str`
+- `correlation_id: str`
+  - generated by the host
+  - used to match responses deterministically
+- `request_id: str`
+  - unique within a correlation_id
+- `target_device: str`
+  - device identifier (e.g., "sip:0")
+- `timestamp_tag: str | None` (optional)
+  - debug tag only; MUST NOT affect determinism
+
+All Host ↔ IO_CPU responses MUST include:
+
+- `correlation_id: str`
+- `request_id: str`
+- `completion: Completion`
+
+---
+
+### D4. Completion schema (mandatory)
+
+`Completion` MUST have:
+
+- `ok: bool`
+- `error_code: str | None`
+- `error_message: str | None`
+
+Rules:
+
+- If `ok == true` then `error_code` and `error_message` MUST be null.
+- If `ok == false` then `error_code` MUST be non-null.
+- Completion semantics MUST be deterministic.
+
+---
+
+### D5. MemoryWrite schema (PA-first, PE-tagged)
+
+`MemoryWrite` represents a host-initiated write/initialize operation to device memory.
+
+Mandatory fields:
+
+- common envelope fields (D3)
+- destination placement tags (A 방식):
+  - `dst_sip: int`
+  - `dst_cube: int`
+  - `dst_pe: int`
+- `dst_pa: int`
+  - destination physical address in the destination PE's address space
+- `nbytes: int`
+- `src_kind: "pattern" | "host_buffer_ref"`
+  - Phase 0 MUST support "pattern"
+- `pattern: Pattern | None`
+  - required if `src_kind == "pattern"`
+
+`Pattern` (Phase 0 mandatory support):
+
+- `pattern_kind: "zero" | "fill_u8" | "fill_u16" | "fill_u32" | "fill_fp16" | "fill_fp32"`
+- `value: number | None`
+  - required for fill_*; ignored for zero
+
+Optional fields:
+
+- `dst_mem_kind: "HBM" | "TCM" | "AUTO"` (default "AUTO")
+- `debug_label: str | None`
+
+Notes:
+
+- This message MUST NOT embed bulk tensor data in Phase 0.
+- All latency MUST come from explicit graph traversal and modeled components.
+
+---
+
+### D6. MemoryRead schema (PA-first, PE-tagged)
+
+`MemoryRead` represents a host-initiated read from device memory.
+
+Mandatory fields:
+
+- common envelope fields (D3)
+- source placement tags (A 방식):
+  - `src_sip: int`
+  - `src_cube: int`
+  - `src_pe: int`
+- `src_pa: int`
+- `nbytes: int`
+
+Optional fields:
+
+- `dst_kind: "host_sink" | "discard"` (default "host_sink")
+- `debug_label: str | None`
+
+Response payload:
+
+- actual bytes are NOT required in Phase 0 (latency/traces focus)
+- implementations MAY return lightweight stats or hashes later via a new ADR
+
+---
+
+### D7. KernelLaunch schema (PA-first, PE-tagged shards)
+
+`KernelLaunch` represents launching a kernel on a target device via IO_CPU.
+
+Mandatory fields:
+
+- common envelope fields (D3)
+- `kernel_ref: KernelRef`
+- `args: list[KernelArg]`
+
+`KernelRef` MUST have:
+
+- `name: str`
+- `kind: "deployed" | "builtin"`
+- `deploy_pa: int | None` — PA where kernel binary was deployed (required for "deployed")
+- `deploy_sip: int` — SIP where binary resides
+- `deploy_cube: int` — cube where binary resides
+- `deploy_pe: int` — PE where binary resides
+- `nbytes_code: int` — kernel binary size (for BW modeling)
+
+Kernel binaries MUST be pre-deployed to device memory via MemoryWrite.
+KernelLaunch MUST NOT embed kernel source code or IR in the launch message.
+
+`KernelArg` supports tensor args by PA mapping and scalars by value.
+
+Tensor arg (mandatory):
+
+- `arg_kind: "tensor"`
+- `tensor_pa_map: TensorPAMap`
+
+`TensorPAMap` MUST have:
+
+- `shards: list[TensorShard]`
+
+`TensorShard` MUST have (A 방식 강제):
+
+- `sip: int`
+- `cube: int`
+- `pe: int`
+- `pa: int`
+- `nbytes: int`
+- `offset_bytes: int`
+
+Scalar arg (mandatory):
+
+- `arg_kind: "scalar"`
+- `dtype: "i32" | "i64" | "fp16" | "fp32" | "bool"`
+- `value: number | bool`
+
+Optional KernelLaunch fields:
+
+- `grid: dict | None`
+- `meta: dict | None`
+- `failure_policy: "fail_fast" | "collect_all"` (default "fail_fast")
+- `debug_label: str | None`
+
+Notes:
+
+- KernelLaunch MUST NOT embed bulk tensor data.
+- KernelLaunch MUST be submitted only to the IO_CPU endpoint.
+- IO_CPU MUST fan-out work internally using the shard (sip,cube,pe) tags.
+
+---
+
+## Verification Notes
+
+Tests SHOULD validate:
+
+- schema validation rejects missing mandatory fields,
+- deterministic correlation/response matching,
+- MemoryWrite/Read/KernelLaunch produce explicit hop traces,
+- all routed requests incur latency > 0.
+
+---
+
+## Links
+
+- ADR-0011 (Memory Addressing — PA / VA / LA)
+- ADR-0007 (runtime_api vs sim_engine boundaries)
+- ADR-0009 (kernel execution fan-out/aggregation)
+- ADR-0013 (Verification strategy — V1 message schema validation)
+- SPEC R2, R7, R8
@@ -0,0 +1,139 @@
+# ADR-0013: Verification Strategy and Phase 1 Test Plan
+
+## Status
+
+Accepted
+
+## Context
+
+KernBench is a system-level simulator whose correctness is defined by:
+
+- adherence to SPEC-defined invariants,
+- determinism and debuggability,
+- explicit modeling of routing and latency.
+
+Given the evolving implementation, we need a stable verification strategy
+that prevents architectural drift while allowing incremental development.
+
+This ADR defines the Phase 1 verification plan and what constitutes
+"correct behavior" for early implementations.
+
+---
+
+## Decision
+
+### D1. Verification is contract-based
+
+Verification MUST be derived from:
+
+- SPEC requirements,
+- accepted ADRs.
+
+Tests MUST validate architectural contracts, not incidental implementation details.
+
+---
+
+### D2. Phase 1 verification scope
+
+Phase 1 verification focuses on:
+
+- message contract validity (ADR-0012),
+- routing and fan-out semantics at the IO_CPU boundary (ADR-0009),
+- PA-first memory addressing and shard tagging (ADR-0011),
+- core latency and trace invariants (SPEC 0.1, R2).
+
+Microarchitectural accuracy, bandwidth contention, and cycle-level behavior
+are explicitly out of scope in Phase 1.
+
+---
+
+### D3. Required Phase 1 verification cases
+
+The following verification cases MUST be supported by the implementation:
+
+#### V1. Message schema validation
+
+- KernelLaunch requests missing `(sip, cube, pe)` in any tensor shard MUST be rejected.
+- MemoryWrite/MemoryRead requests missing destination/source placement tags MUST be rejected.
+- Completion results MUST follow the `ok / error_code / error_message` contract.
+
+#### V2. IO_CPU fan-out and aggregation
+
+Given:
+
+- a topology with one SIP, one CUBE, and two PEs,
+- a KernelLaunch request containing two tensor shards targeting different PEs,
+
+The system MUST:
+
+- submit a single KernelLaunch to IO_CPU,
+- fan-out work internally to both PEs,
+- aggregate completion and return a single deterministic completion to the host.
+
+#### V3. Latency and trace invariants
+
+For any valid request:
+
+- the hop-by-hop trace MUST be non-empty,
+- total latency MUST be greater than zero,
+- repeated runs with identical inputs MUST produce identical traces.
+
+#### V4. Topology independence and cross-domain coverage
+
+Verification cases MUST pass for multiple topology shapes, including:
+
+- minimal: (1 SIP, 1 CUBE, 1 PE)
+- multi-PE: (1 SIP, 1 CUBE, N PEs)
+- multi-CUBE within a SIP: (1 SIP, M CUBEs, ≥1 PE per CUBE)
+- multi-SIP tray: (K SIPs, ≥1 CUBE per SIP, ≥1 PE per CUBE)
+
+For multi-CUBE and multi-SIP topologies, Phase 1 verification focuses on:
+
+- explicit connectivity (required links exist),
+- deterministic routing and control-path traversal,
+- non-empty traces and latency > 0 for representative cross-domain requests
+  (inter-CUBE and inter-SIP paths).
+
+Tests MUST NOT hardcode topology sizes, node ids, or link counts.
+Instead, tests MUST derive expectations from the compiled topology metadata
+---
+
+### D4. Phase 1 artifacts
+
+Phase 1 MAY include:
+
+- verification-only test code,
+- topology fixtures,
+- trace inspection utilities.
+
+Phase 1 MUST NOT require:
+
+- production code changes solely to satisfy tests,
+- weakening or removing tests to allow progress.
+
+---
+
+### D5. Phase 2 enforcement
+
+Phase 2 (Apply) MUST:
+
+- run the Phase 1 verification cases,
+- rollback all changes if any verification fails,
+- preserve tests as authoritative contracts.
+
+---
+
+## Consequences
+
+- Architectural correctness is enforced early.
+- Tests serve as executable documentation of system behavior.
+- Implementation remains flexible without losing rigor.
+
+---
+
+## Links
+
+- SPEC 0.1, R2, R6
+- ADR-0011 (Memory Addressing — PA / VA / LA)
+- ADR-0012 (Host ↔ IO_CPU message schema)
+- ADR-0009 (Kernel execution semantics)
@@ -0,0 +1,451 @@
+# ADR-0014: PE Pipeline Execution Model
+
+## Status
+
+Accepted
+
+## Context
+
+This ADR defines the PE-internal kernel execution model:
+
+- Role decomposition of PE-internal components
+- Command dispatch paths (simple / composite / multi-op composite with epilogue)
+- TileToken-based self-routing pipeline (scheduler does dispatch + completion only)
+- TCM-centric dataflow with a register-file intermediary
+- Engine resource model
+- Observability and trace contract
+- Topology representation
+
+PE-internal structure (7 components in scope; 2 cross-referenced):
+
+- `pe_cpu`, `pe_scheduler`, `pe_dma`, `pe_fetch_store`, `pe_gemm`, `pe_math`,
+  `pe_tcm` — defined here
+- `pe_mmu` — VA model, defined in ADR-0011 D-VA
+- `pe_ipcq` — collective communication, defined in ADR-0023
+
+The goal is a deterministic, trace-friendly execution contract that keeps
+each block independently swappable.
+
+## Decision
+
+### D1. PE-internal component roles
+
+**PE_CPU**
+
+- Executes kernel instruction stream / control logic.
+- Generates PE commands and submits them to `PE_SCHEDULER` (via
+  `PeInternalTxn`).
+- Does NOT enqueue work directly into engine queues.
+
+**PE_SCHEDULER**
+
+- Sole dispatcher inside a PE.
+- Receives commands from `PE_CPU`. Dispatch by command type:
+  - Simple command (`DmaReadCmd`, `DmaWriteCmd`, `GemmCmd`, `MathCmd`)
+    → forward directly to the target engine.
+  - `CompositeCmd` → generate a `TilePlan`, feed tiles into the pipeline
+    via a single `_feed_loop` (D6).
+- Does not participate in stage-to-stage chaining within a composite;
+  that is handled by token self-routing (D6).
+
+**PE_DMA**
+
+- Handles memory transfers between TCM and external memory domains
+  (HBM, shared SRAM, cross-cube UCIe) through the cube NOC.
+- Two execution channels:
+  - `DMA_READ` (capacity = 1) and `DMA_WRITE` (capacity = 1) — see D4.
+- Additional virtual channels:
+  - `vc_compute` — load/store/writeback traffic for GEMM/MATH tiles.
+  - `vc_comm` — IPCQ collective send data (defined in ADR-0023 D8).
+
+**PE_FETCH_STORE**
+
+- TCM ↔ Register File transfer unit.
+- Isolates register-file access semantics from compute engines so that
+  GEMM/MATH stay pure compute components.
+- BW-based latency model; TCM access contention naturally serializes
+  through `PE_TCM`'s BW resource.
+
+**PE_GEMM**
+
+- MAC array. Reads operands from the register file; writes results to
+  the register file. Does not touch `PE_TCM` directly.
+
+**PE_MATH**
+
+- Element-wise / reduction / SIMD unit. Reads / writes the register file.
+
+**PE_TCM**
+
+- Tightly-coupled scratchpad with BW-serialized access. Two logical
+  regions partitioned by ownership (see D5).
+
+**Cross-referenced components** (defined elsewhere):
+
+- `pe_mmu` — VA→PA translation per access (ADR-0011 D-VA).
+- `pe_ipcq` — collective ring buffers and peer endpoint metadata
+  (ADR-0023).
+
+### D2. Command lifecycle and queues
+
+`PE_SCHEDULER` maintains three logical structures:
+
+**SubmissionQueue** — written by `PE_CPU`; consumed by the scheduler.
+
+**InflightTable** — owned and mutated only by `PE_SCHEDULER`; tracks
+expanded sub-commands, dependency state, engine assignment, and
+completion status.
+
+**CompletionQueue** — written by `PE_SCHEDULER`; holds final completion
+records.
+
+**Single-writer rule**: only `PE_SCHEDULER` mutates command completion
+state. Engines report completion via explicit events / messages
+consumed by the scheduler.
+
+**Command completion**: when all sub-commands complete, `PE_SCHEDULER`
+publishes a completion record.
+
+### D3. Dispatch modes
+
+#### D3.1 Simple command
+
+A simple command expands to exactly one engine sub-command:
+
+- `DmaReadCmd` / `DmaWriteCmd` → `PE_DMA`
+- `GemmCmd` → `PE_GEMM`
+- `MathCmd` → `PE_MATH`
+
+Flow:
+
+```text
+PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution
+       → completion → PE_SCHEDULER → CompletionQueue
+```
+
+#### D3.2 Composite command (single-op tiled pipeline)
+
+The default `CompositeCmd` runs a single compute op as a tile-pipelined
+sequence:
+
+```text
+DMA_READ → FETCH (TCM → RF) → COMPUTE (GEMM | MATH) → STORE (RF → TCM) → DMA_WRITE
+```
+
+`PE_SCHEDULER` splits the DMA payload into hardware tiles and emits one
+`TileToken` per tile with a monotonically increasing `tile_id`.
+
+Tile dependency (within one tile `t`):
+
+```text
+DMA_READ(t) → FETCH(t) → COMPUTE(t) → STORE(t) → DMA_WRITE(t)
+```
+
+Inter-tile overlap is allowed wherever engine resources permit
+(D4 governs the constraints):
+
+```text
+DMA_READ(t+1) ∥ COMPUTE(t)
+DMA_WRITE(t-1) ∥ COMPUTE(t)
+```
+
+#### D3.3 Multi-op composite (head + epilogue with scope)
+
+A `CompositeCmd` MAY carry `ops: tuple[OpSpec, ...]` to express a
+multi-op pipeline:
+
+```python
+@dataclass(frozen=True)
+class OpSpec:
+    kind: str         # "gemm" | "math.exp" | "math.bias_add" | ...
+    scope: Scope      # "per_k_tile" | "per_output_tile" | "once"
+    ...
+```
+
+- `ops[0]` (head) defines tile geometry (e.g., the head GEMM determines
+  M/K/N partition).
+- `ops[1:]` (epilogue) are subsequent stages whose `scope` decides how
+  often they fire:
+  - `per_k_tile` — every K-reduction step.
+  - `per_output_tile` — once per output tile.
+  - `once` — once per kernel.
+
+Cross-engine chains (e.g., GEMM head → MATH epilogue) are natural —
+each stage is dispatched via token self-routing (D6), so GEMM and MATH
+participate serially within the same composite even though they share
+the compute slot (D4).
+
+The empty-`ops` form is the legacy single-op path.
+
+### D4. Engine resource model
+
+**DMA engine**:
+
+- `DMA_READ`: `simpy.Resource(capacity=1)`.
+- `DMA_WRITE`: `simpy.Resource(capacity=1)`.
+- Both channels run concurrently (READ ∥ WRITE allowed).
+- Within a channel, requests serialize (READ ∥ READ disallowed; same
+  for WRITE).
+- `vc_comm` is an orthogonal channel for IPCQ traffic defined in
+  ADR-0023 D8 — out of scope for this ADR.
+
+**Compute engine**:
+
+- `accel_slot`: `simpy.Resource(capacity=1)` shared by `PE_GEMM` and
+  `PE_MATH`.
+- At most one compute op runs at a time within a PE.
+- Multi-op composite chains (D3.3) execute their compute stages serially
+  through this slot; token self-routing (D6) ensures the next stage
+  starts only after the previous compute releases the slot.
+
+**Engine completion**: each engine emits a completion event consumed by
+the scheduler / `PipelineContext` (D6).
+
+### D5. Dataflow
+
+**Input path (HBM source)**:
+
+```text
+HBM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
+PE_TCM → PE_FETCH_STORE → Register File
+Register File → PE_GEMM | PE_MATH
+```
+
+**Input path (shared SRAM source)**:
+
+```text
+Shared SRAM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
+PE_TCM → PE_FETCH_STORE → Register File
+```
+
+**Output path (HBM destination)**:
+
+```text
+Register File → PE_FETCH_STORE → PE_TCM
+PE_TCM → PE_DMA (DMA_WRITE) → cube NOC → HBM
+```
+
+GEMM/MATH never touch `PE_TCM` directly — `PE_FETCH_STORE` is the
+single TCM↔register-file gateway. This makes TCM BW contention
+explicit and lets fetch unit policies (e.g., prefetch) be replaced
+independently of compute engines.
+
+#### D5.1 PE_TCM partitioning
+
+`PE_TCM` is split into two logical regions:
+
+**SchedulerReservedTCM**
+
+- Owned exclusively by `PE_SCHEDULER`.
+- Holds composite-command tile buffers.
+- `PE_SCHEDULER` partitions this region, assigns buffers per DMA_READ /
+  COMPUTE / DMA_WRITE stage, guarantees input/output separation, and
+  manages tile-buffer lifetimes.
+
+**AllocatableTCM**
+
+- General-purpose region managed by `PEMemAllocator`.
+- Used for host / DP-visible allocations.
+
+**Visibility rule (hard isolation)**: `PEMemAllocator` MUST NOT see or
+allocate inside `SchedulerReservedTCM`. The reserved region is excluded
+from allocator-managed ranges by construction.
+
+**Tile buffer rules**:
+
+- Input and output buffers within `SchedulerReservedTCM` MUST NOT
+  overlap during a tile's active lifetime.
+- A tile buffer remains valid until the corresponding `DMA_WRITE`
+  completes.
+- Buffer reuse is permitted only after the consuming tile's lifetime
+  ends.
+
+### D6. TileToken self-routing pipeline
+
+A composite's stage-to-stage progression happens **without** routing
+through the scheduler. Each component forwards the token directly to
+the next stage's component using the token's `plan`:
+
+```text
+Scheduler → DMA → Fetch → GEMM → Math (epi) → Store → DMA_WB → (complete)
+              ↑ chaining: no scheduler hop                          ↑
+                                                  PipelineContext.complete_tile()
+```
+
+This mirrors real-HW done-wire chains. The scheduler handles only
+**initial dispatch + completion aggregation**.
+
+#### TilePlan / Stage
+
+```python
+class StageType(Enum):
+    DMA_READ = 0
+    FETCH = 1
+    GEMM = 2
+    MATH = 3
+    STORE = 4
+    DMA_WRITE = 5
+
+@dataclass(frozen=True)
+class Stage:
+    stage_type: StageType
+    component: str         # topology node id (e.g., "sip0.cube0.pe0.pe_dma")
+    params: dict           # stage-specific parameters
+
+@dataclass(frozen=True)
+class TilePlan:
+    tile_id: int
+    stages: tuple[Stage, ...]
+```
+
+#### TileToken
+
+```python
+@dataclass
+class TileToken:
+    tile_id: int
+    pipeline_ctx: PipelineContext
+    plan: TilePlan
+    stage_idx: int
+    params: dict             # cached current stage params
+    data_op: bool = True     # op_log opt-in (ADR-0020 D4)
+```
+
+Single-owner invariant: a token is owned by exactly one component at a
+time. Lifecycle: scheduler creates with `stage_idx=0` → component
+`_process()` → increment `stage_idx` → put to next stage's `in_port` →
+last stage calls `pipeline_ctx.complete_tile()`.
+
+#### PipelineContext (exactly-once completion)
+
+```python
+@dataclass
+class PipelineContext:
+    id: str
+    total_tiles: int
+    completed_tiles: int = 0
+    done_event: simpy.Event = None
+
+    def complete_tile(self) -> None:
+        self.completed_tiles += 1
+        if self.completed_tiles == self.total_tiles:
+            self.done_event.succeed()
+```
+
+Each tile's last stage MUST call `complete_tile()` exactly once.
+Duplicate calls are bugs (SimPy `Event` can succeed at most once).
+
+#### Feed ordering
+
+`PE_SCHEDULER` has exactly one `_feed_loop` process consuming a
+`_pending_feeds` FIFO. Composite commands are enqueued in submission
+order; tile feed for a command runs to completion before the next
+command's feed begins. **Tile-feed interleaving between commands is
+disallowed.**
+
+Within a single command's tiles, downstream pipeline overlap arises
+naturally — earlier tiles progress through later stages while the feeder
+keeps pushing remaining tiles into the first stage queue (SimPy Store
+backpressure governs flow control). If the first-stage queue is full,
+only the feeder blocks; the scheduler worker's inbox processing
+continues.
+
+#### Token routing pattern (base class)
+
+```python
+def _pipeline_worker(self, env):
+    while True:
+        token = yield self._inbox.get()
+        yield from self._process(env, token)       # stage-specific logic
+        next_idx = token.stage_idx + 1
+        if next_idx < len(token.plan.stages):
+            next_stage = token.plan.stages[next_idx]
+            token.stage_idx = next_idx
+            token.params = next_stage.params
+            yield self.out_ports[next_stage.component].put(token)
+        else:
+            token.pipeline_ctx.complete_tile()
+```
+
+Each component implements only `_process()`; chaining lives in the
+base class.
+
+### D7. Observability and trace contract
+
+The simulator emits deterministic trace events:
+
+- `command_submitted`
+- `sub_command_dispatched`
+- `engine_start`
+- `engine_complete`
+- `tile_ready`
+- `command_complete`
+
+For identical inputs, trace ordering MUST be deterministic.
+
+### D8. Topology representation
+
+PE-internal components are declared in `cube.pe_template`:
+
+```yaml
+pe_template:
+  components:
+    pe_cpu:         { kind: pe_cpu,         impl: builtin.pe_cpu,         attrs: { overhead_ns: ... } }
+    pe_scheduler:   { kind: pe_scheduler,   impl: builtin.pe_scheduler,   attrs: { overhead_ns: ... } }
+    pe_dma:         { kind: pe_dma,         impl: builtin.pe_dma,         attrs: { rd_engines: 1, wr_engines: 1 } }
+    pe_fetch_store: { kind: pe_fetch_store, impl: builtin.pe_fetch_store, attrs: { ... } }
+    pe_gemm:        { kind: pe_gemm,        impl: builtin.pe_gemm,        attrs: { shared_resource: accel_slot, ... } }
+    pe_math:        { kind: pe_math,        impl: builtin.pe_math,        attrs: { shared_resource: accel_slot, ... } }
+    pe_tcm:         { kind: pe_tcm,         impl: builtin.pe_tcm,         attrs: { size_mb: ..., read_bw_gbs: ..., write_bw_gbs: ... } }
+    pe_mmu:         { kind: pe_mmu,         impl: builtin.pe_mmu,         attrs: { ... } }   # ADR-0011 D-VA
+    pe_ipcq:        { kind: pe_ipcq,        impl: builtin.pe_ipcq,        attrs: { ... } }   # ADR-0023
+  links:
+    # Scheduler dispatch edges (initial)
+    scheduler_to_dma_mm:         0.0
+    scheduler_to_fetch_store_mm: 0.0
+    scheduler_to_gemm_mm:        0.0
+    scheduler_to_math_mm:        0.0
+    # Pipeline chaining edges (token self-routing per D6)
+    dma_to_fetch_store_mm:       0.0
+    fetch_store_to_gemm_mm:      0.0
+    fetch_store_to_math_mm:      0.0
+    gemm_to_fetch_store_mm:      0.0
+    gemm_to_math_mm:             0.0
+    math_to_fetch_store_mm:      0.0
+    fetch_store_to_dma_mm:       0.0
+    fetch_store_to_tcm_bw_gbs:   ...
+```
+
+Template is instantiated once per PE. PE instances are derived from
+`cube.pe_layout` (corner placement). External connectivity (PE_DMA ↔
+cube NOC ↔ HBM, etc.) is modeled at the cube level (ADR-0017 D4).
+
+## Consequences
+
+### Positive
+
+- Each block is an independent topology node — individually swappable
+  via DI (ADR-0015).
+- PE-internal structure is visible in the topology graph.
+- Components do not know their downstream — plan-based routing gives
+  flexibility (e.g., epilogue chains require no scheduler change).
+- DMA and compute overlap naturally via SimPy Store backpressure.
+- Multi-op composite expresses fused operations (e.g., GEMM + bias_add)
+  without engine-level coupling.
+- TCM access contention is realistic — `PE_FETCH_STORE` is the single
+  TCM↔RF gateway.
+
+### Negative
+
+- Intra-PE component count is higher than a coarser model (7 base + 2
+  cross-referenced) — more topology nodes/edges.
+- Intra-PE token forwarding is explicit in traces (acceptable trade for
+  HW fidelity).
+
+## Links
+
+- ADR-0011 D-VA (PE_MMU component, VA translation)
+- ADR-0015 D4 (component port/wire model)
+- ADR-0020 (greenlet kernel execution / two-pass)
+- ADR-0023 (PE_IPCQ + PE_DMA virtual channels)
+- SPEC R3, R4
@@ -0,0 +1,202 @@
+# ADR-0015: Component Port/Wire Model and Fabric Routing
+
+## Status
+
+Accepted
+
+## Context
+
+Realistic hardware modeling — queues, contention, fan-out — requires
+that components own fabric traversal while the simulation engine
+handles only initialization and completion observation. Direct method
+calls between components, or path-walking inside the engine, defeat
+queueing and contention semantics.
+
+This ADR defines:
+
+- how components communicate via typed port queues,
+- how propagation delay is modeled (wire processes with BW occupancy),
+- the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch
+  (via M_CPU),
+- the engine's reduced role (wire init + completion observation only),
+- M_CPU.DMA as an internal subcomponent of M_CPU.
+
+---
+
+## Decision
+
+### D1. Component port model
+
+Each component has typed input/output ports modeled as SimPy Stores:
+
+```text
+in_ports:  dict[str, simpy.Store]   # keyed by source node_id
+out_ports: dict[str, simpy.Store]   # keyed by destination node_id
+```
+
+Ports are created at engine initialization based on graph edges.
+Each directed edge (src → dst) results in:
+
+- `src.out_ports[dst]`  — the sending end
+- `dst.in_ports[src]`   — the receiving end
+
+---
+
+### D2. Wire process (propagation delay + BW occupancy)
+
+For each directed edge (src, dst) in the topology graph, a SimPy wire process
+models propagation delay and BW occupancy:
+
+```python
+def wire_process(env, out_port, in_port, delay_ns, bw_gbs):
+    available_at = 0.0
+    while True:
+        cmd = yield out_port.get()
+        if bw_gbs > 0:
+            nbytes = getattr(cmd, "nbytes", 0)
+            if nbytes > 0:
+                wait = available_at - env.now
+                if wait > 0:
+                    yield env.timeout(wait)
+                available_at = env.now + (nbytes / bw_gbs)
+        yield env.timeout(delay_ns)
+        yield in_port.put(cmd)
+```
+
+Wire processes are started at engine initialization.
+Each directed edge maintains an `available_at` timestamp tracking when the link
+becomes free for the next transaction. When a transaction occupies a link, the
+next transaction on the same directed link must wait until occupancy clears
+(back-to-back serialization). TX and RX directions are independent (separate
+wire processes with separate `available_at` state).
+
+---
+
+### D3. Engine role (reduced)
+
+The simulation engine MUST:
+
+- wire components at initialization (create port Stores, start wire processes),
+- identify the entry component for each request type (PCIE_EP),
+- put the request into the entry component's in_port,
+- wait for a completion event.
+
+The simulation engine MUST NOT:
+
+- walk the topology path during request execution,
+- call component `run()` methods directly,
+- track per-hop latency or decompose fan-out.
+
+---
+
+### D4. Fabric paths for Memory R/W and Kernel Launch
+
+Memory R/W and Kernel Launch use **different** fabric paths.
+Memory operations bypass M_CPU and route directly to HBM via the crossbar.
+Kernel Launch routes through M_CPU for PE fan-out.
+
+**Memory R/W forward path (pcie_ep → hbm_ctrl, M_CPU bypass):**
+
+```text
+pcie_ep → io_noc → io_ucie
+  → [transit cubes: ucie_in → noc → ucie_out]  (zero or more)
+  → target cube: ucie_in → router mesh → hbm_ctrl
+```
+
+**Memory R/W completion path:**
+
+```text
+hbm_ctrl → router mesh → [transit cubes: ucie → router mesh → ucie]
+  → io_ucie → io_noc → pcie_ep
+```
+
+**Kernel Launch forward path (pcie_ep → io_cpu → M_CPU → PE):**
+
+```text
+pcie_ep → io_noc → io_cpu → io_noc → io_ucie
+  → [transit cubes: ucie_in → noc → ucie_out]  (zero or more)
+  → target cube: ucie_in → noc → M_CPU → PE[0..n] (parallel fan-out)
+```
+
+**Kernel Launch completion path:**
+
+```text
+PE[0..n] all complete → M_CPU (aggregation)
+  → noc → [transit cubes: ucie → noc → ucie]
+  → io_ucie → io_noc → io_cpu → io_noc → pcie_ep
+```
+
+**Rationale for M_CPU bypass on Memory R/W:**
+
+Memory write/read operations do not require command interpretation or PE
+dispatch — they are direct data transfers to/from HBM. Routing through M_CPU
+would add unnecessary overhead (5ns) without functional benefit. The io_noc
+inside the IO chiplet handles the routing decision: memory operations go
+directly to cube fabric, while kernel launches are forwarded to io_cpu first.
+
+---
+
+### D5. M_CPU.DMA is an internal subcomponent of M_CPU
+
+M_CPU.DMA is NOT a separate topology node.
+It is an internal subcomponent owned by the M_CPU component implementation.
+
+M_CPU.DMA:
+
+- owns the DMA READ and DMA WRITE queues (capacity=1 each, per ADR-0014 D4),
+- issues memory requests over the NOC to hbm_ctrl,
+- receives completion from hbm_ctrl via the NOC,
+- reports completion to M_CPU,
+- is created and managed inside M_CPU's `__init__` and `run()`.
+
+M_CPU.DMA does not appear as a node in the compiled topology graph.
+
+---
+
+### D6. Transit cube forwarding
+
+A cube that is not the target of a memory or kernel request acts as a transit node.
+Transit cubes forward requests without consuming them:
+
+```text
+ucie_in (from upstream) → noc → ucie_out (to downstream)
+```
+
+Transit forwarding is implemented entirely within the ucie_in component.
+The noc and ucie_out components in a transit cube forward the packet without modification.
+
+---
+
+### D7. _formula_latency is preserved as a lower-bound cross-check
+
+The path-based formula latency function (`_formula_latency`) is preserved in the engine
+as a lower bound for correctness verification.
+
+Invariant:
+
+- Phase 0: `_formula_latency == component model total_ns`
+- Phase 1+: `_formula_latency <= component model total_ns` (contention adds queueing)
+
+This function is independent of the port/wire model and requires only the topology graph.
+It is used for shard comparison in `_route_kernel` and as a regression guard.
+
+---
+
+## Consequences
+
+- Components model realistic hardware behavior (queues, contention, fan-out).
+- Propagation delay is modeled accurately per edge.
+- Engine is decoupled from routing policy.
+- Component implementations remain swappable via DI (ADR-0007 D3).
+
+---
+
+## Links
+
+- ADR-0007 D2 (engine role boundary)
+- ADR-0009 D3 (kernel execution fan-out hierarchy)
+- ADR-0014 D4 (DMA engine capacity=1)
+- ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal)
+- ADR-0016 (IOChiplet NOC and memory data path)
+- ADR-0017 (cube NOC 2D mesh architecture)
+- ADR-0033 (Latency model assumptions built on these mechanisms)
@@ -0,0 +1,98 @@
+# ADR-0016: IOChiplet NOC and Memory Data Path
+
+## Status
+
+Accepted
+
+## Context
+
+ADR-0003 D2 defines IO chiplets as SIP-level components providing PCIe-EP and
+IO_CPU interfaces, but does not specify internal routing within the IO chiplet.
+ADR-0015 D4 was updated to document the M_CPU bypass for Memory R/W, but the
+IO chiplet's internal NOC architecture that enables this routing was not
+formally documented.
+
+The IO chiplet needs an internal routing fabric (io_noc) to:
+
+- connect pcie_ep, io_cpu, and per-cube UCIe PHY ports
+- route memory operations (MemoryWrite/Read) directly to cube fabric without
+  passing through io_cpu
+- route kernel launch commands through io_cpu for command interpretation
+
+## Decision
+
+### D1. IOChiplet internal NOC (io_noc)
+
+Each IO chiplet instance contains an internal NOC node (`io_noc`) that connects:
+
+- `pcie_ep` — host-facing PCIe endpoint
+- `io_cpu` — command processor for kernel launch interpretation
+- `io_ucie-{PHY}.conn{N}` — per-PHY connection nodes to cube UCIe ports
+
+The io_noc is a forwarding-only fabric (`forwarding_v1` implementation) with
+zero overhead. All routing decisions are made by the simulation engine based
+on message type, not by io_noc itself.
+
+### D2. IOChiplet UCIe decomposition
+
+Each IO chiplet PHY port is decomposed into:
+
+- `io_ucie-{PHY}` — the UCIe protocol endpoint (overhead = 8ns)
+- `io_ucie-{PHY}.conn{N}` — N connection nodes between io_noc and io_ucie
+
+This mirrors the cube-side UCIe decomposition (ADR-0015 D1) and allows
+multiple independent NOC-to-UCIe connections per PHY.
+
+### D3. Memory R/W path (M_CPU bypass)
+
+Memory operations (MemoryWrite, MemoryRead) are routed directly from pcie_ep
+through io_noc to the target cube, bypassing io_cpu entirely:
+
+```text
+pcie_ep → io_noc → conn → io_ucie → [cube UCIe] → router mesh → hbm_ctrl
+```
+
+This avoids the 10ns io_cpu overhead for pure data transfers. The simulation
+engine's `_process_memory_direct()` method uses `find_memory_path()` which
+resolves the shortest path from pcie_ep to the target HBM node.
+
+### D4. Kernel Launch path (via io_cpu)
+
+Kernel launch commands require io_cpu for command interpretation and PE
+fan-out setup:
+
+```text
+pcie_ep → io_noc → io_cpu → io_noc → conn → io_ucie → [cube UCIe]
+  → noc → m_cpu → PE
+```
+
+The engine's `_entry_points()` method routes KernelLaunchMsg through both
+pcie_ep (entry) and io_cpu (command processing).
+
+### D5. IOChiplet-to-cube port mapping
+
+Each IO chiplet instance declares which cube ports it connects to:
+
+```yaml
+cube_ports:
+  - { cube: {xy: [0,0]}, cube_side: N, phy: P0, distance_mm: 2.0 }
+  - { cube: {xy: [1,0]}, cube_side: N, phy: P1, distance_mm: 2.0 }
+```
+
+The topology builder creates edges from io_ucie PHY nodes to the
+corresponding cube UCIe port nodes, with the specified distance and
+the IO chiplet's `per_connection_bw_gbs` as link bandwidth.
+
+## Consequences
+
+- IO chiplet has a well-defined internal routing fabric
+- Memory operations avoid unnecessary io_cpu overhead
+- Kernel launch commands still get proper command interpretation
+- The io_noc pattern is consistent with cube-level NOC design
+- ADR-0003 D2 is extended (not contradicted) by this ADR
+
+## Links
+
+- ADR-0003 D2 (IO chiplet definition)
+- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
+- ADR-0012 D1 (host-to-IO_CPU message schema)
@@ -0,0 +1,291 @@
+# ADR-0017: Cube NOC and HBM Connectivity
+
+## Status
+
+Accepted
+
+## Context
+
+The CUBE-level NOC is a 2D router mesh that carries every intra-cube
+request: PE-to-HBM data, PE-to-PE traffic, command paths
+(M_CPU↔PE_CPU), shared SRAM access, and inter-cube UCIe traffic.
+
+The CUBE's HBM is exposed through per-PE controller endpoints attached
+to PE routers. This per-PE partitioning makes local-vs-remote HBM
+distinguishable by mesh distance: a PE's own HBM partition sits at its
+own router (switching overhead only); another PE's HBM partition is
+reachable by mesh hops to that PE's router.
+
+Two channel-mapping modes are supported in the design space:
+
+- **n:1 (default, implemented)** — each PE's HBM partition aggregates
+  `channels_per_pe` pseudo-channels into one endpoint. Effective
+  per-PE BW = N × per-channel BW.
+- **1:1 (future)** — each PE router decomposes into per-channel
+  mini-routers; per-channel BW contention is modeled directly.
+
+In both modes the per-PE effective BW is identical; only the connectivity
+granularity differs.
+
+## Decision
+
+### D1. 2D router mesh
+
+Each cube contains a 2D mesh of NOC routers generated by `mesh_gen.py`.
+
+- Node naming: `sip{S}.cube{C}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`).
+- Implementation: `forwarding_v1`. NOC `overhead_ns = 0`.
+- Default 6×6 grid (sized from PE corner placement + UCIe attachment
+  count); larger PE counts scale the grid up.
+- HBM exclusion zone: center rows/columns are excluded where HBM die
+  physically occupies space (e.g., r2c2, r2c3, r3c2, r3c3 for a 6×6).
+- Latency = Manhattan distance × `ns_per_mm`.
+
+### D2. XY routing algorithm
+
+Deterministic XY routing:
+
+1. Horizontal segment: route from source X to destination X at source Y.
+2. Vertical segment: route from destination X at source Y to destination Y.
+
+Each directed segment carries a unique key:
+
+- Horizontal: `("H", y_band, x_min, x_max, direction)`
+- Vertical:   `("V", x_band, y_min, y_max, direction)`
+
+Grid positions are snapped to the router grid, excluding the HBM zone.
+
+### D3. Per-segment contention model
+
+Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions
+sharing a segment (same row or column band, same direction) contend for
+the resource — modelling link-level serialization in a wormhole-routed
+mesh.
+
+With no contention, NOC traversal latency equals Manhattan distance ×
+`ns_per_mm`. Under contention, SimPy's resource scheduling adds queueing
+delay.
+
+### D4. NOC attachment points (per-PE HBM partition)
+
+Every PE router carries three attachments: `pe{idx}.dma`, `pe{idx}.cpu`,
+and `pe{idx}.hbm`. The last is the per-PE HBM controller endpoint —
+`sip{S}.cube{C}.hbm_ctrl.pe{idx}` — which owns one slice of the cube's
+HBM (one pseudo-channel group; see D8).
+
+Other attachments:
+
+- M_CPU and shared SRAM each occupy a dedicated edge router.
+- UCIe endpoints (N/S/E/W) each expose 4 connection routers distributed
+  along that edge (see D6).
+
+```text
+                    UCIe-N (conn x4)
+                         |
+           +---------+---+---+---------+
+           |         |       |         |
+PE0.dma ---+  r0c0   |  ...  |  r0c5  +--- PE2.dma
+PE0.cpu <--+ +hbm.pe0|       | +hbm.pe2+--< PE2.cpu
+           |         |       |         |
+UCIe-W ----+  ...    | [HBM] |  ...   +---- UCIe-E
+(conn x4)  |         | zone  |         |  (conn x4)
+           |  r2c0   |       |         |
+M_CPU <--->+         |       |         |
+           |  r3c0   |       |         |
+SRAM <---->+         |       |         |
+           |         |       |         |
+PE4.dma ---+  r4c0   |  ...  |  r4c5  +--- PE6.dma
+PE4.cpu <--+ +hbm.pe4|       | +hbm.pe6+--< PE6.cpu
+           |         |       |         |
+           +---------+---+---+---------+
+                         |
+                    UCIe-S (conn x4)
+```
+
+Per-PE HBM partitioning is the key invariant that makes local vs
+cross-PE HBM distinguishable by mesh distance (see D7).
+
+### D5. NOC edge bandwidths and distances
+
+| Connection                    | BW (GB/s)  | Distance      | Notes                                       |
+| ----------------------------- | ---------- | ------------- | ------------------------------------------- |
+| PE_DMA → NOC                  | 256.0      | Physical (PE) | Matches local-HBM aggregate BW              |
+| NOC → PE_CPU                  | —          | 0.0 mm        | Command path only                           |
+| Router ↔ hbm_ctrl.pe{idx}     | 256.0      | 0.0 mm        | Per PE router; N × per-channel BW (see D8)  |
+| NOC ↔ M_CPU                   | —          | 0.0 mm        | Command path                                |
+| NOC ↔ SRAM                    | 128.0 × 4  | 0.0 mm        | 512 GB/s aggregate                          |
+| NOC ↔ UCIe conn               | 128.0      | 0.0 mm        | Per connection; 4 conn per port             |
+
+`0.0 mm` distances reflect the distributed nature of the NOC; actual
+traversal distance is computed via Manhattan distance within the router
+grid.
+
+### D6. UCIe decomposition and inter-cube traffic
+
+Each of the 4 UCIe ports (N, S, E, W) decomposes into:
+
+- 1 `ucie-{PORT}` node: UCIe protocol endpoint (`overhead = 8.0 ns`).
+- 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe.
+
+This decomposition gives 4 independent NOC↔UCIe connections per port,
+each with 128 GB/s bandwidth (512 GB/s aggregate per port).
+
+Inter-cube traffic path:
+
+```text
+Source: PE_DMA → NOC → conn{i} → ucie-{PORT}
+                  [UCIe link: 512 GB/s, 1.0mm seam distance]
+Target: ucie-{PORT} → conn{i} → r{x}c{y} → (mesh hops) → hbm_ctrl.pe{idx}
+```
+
+UCIe overhead (8.0 ns) is applied at each `ucie-{PORT}` node, so a full
+crossing incurs 16 ns (TX port + RX port).
+
+### D7. Data paths through the NOC
+
+All intra-cube traffic uses the same router mesh — no separate fast
+paths.
+
+**Local HBM** (same PE's own partition; 0 mesh hops):
+
+```text
+PE_DMA → r{x}c{y} → hbm_ctrl.pe{idx}   (switching overhead only)
+```
+
+**Cross-PE HBM within cube** (target PE's partition, reached by mesh):
+
+```text
+PE_DMA → r{x}c{y} → (mesh hops) → r{x'}c{y'} → hbm_ctrl.pe{idx'}
+```
+
+Example: PE0 (on `r0c0`) accessing PE2's HBM (PE2 on `r1c4`):
+
+```text
+PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl.pe2
+```
+
+Dijkstra computes the shortest path within the mesh.
+
+**Cross-cube HBM** (UCIe traversal):
+
+```text
+PE_DMA → r{x}c{y} → conn → ucie-{PORT} → [seam] → ucie-{PORT'} → conn
+       → r{x'}c{y'} → hbm_ctrl.pe{idx'}
+```
+
+**Kernel launch command to PE**:
+
+```text
+[from io_noc] → ucie → conn → r{x}c{y} → (mesh) → M_CPU → (mesh) → PE_CPU
+```
+
+**Shared SRAM access**:
+
+```text
+PE_DMA → r{x}c{y} → (mesh) → SRAM
+```
+
+### D8. HBM channel mapping mode
+
+Channel mapping is configured at cube scope:
+
+```yaml
+cube:
+  memory_map:
+    hbm_mapping_mode: n_to_one       # one_to_one | n_to_one
+    hbm_pseudo_channels: 64          # total pseudo-channel count
+    hbm_channels_per_pe: 8           # per-PE local channel count
+    hbm_channel_bw_gbs: 32.0         # per-channel bandwidth (GB/s)
+    hbm_slices_per_cube: 8           # number of per-PE partitions
+    hbm_total_gb_per_cube: 48
+```
+
+**n:1 mode (default, implemented).** Each PE's HBM partition is a single
+endpoint `hbm_ctrl.pe{idx}` that aggregates `channels_per_pe` pseudo-
+channels. The `Router ↔ hbm_ctrl.pe{idx}` link bandwidth equals
+`channels_per_pe × hbm_channel_bw_gbs`. Pseudo-channels are assumed to
+interleave; only aggregate per-PE BW is modeled. No separate aggregated
+router node exists — the per-PE router itself serves that role.
+
+**1:1 mode (future).** Each PE router decomposes into N channel
+mini-routers; per-channel routing carries fully-resolved PA + channel ID.
+A `ChannelSplitter` resolves a logical access to N per-channel physical
+requests. Per-channel link models BW contention. Cross-PE channel
+access semantics are deferred to the implementation ADR.
+
+**BW math (defaults).**
+
+| Parameter                          | Value                      |
+| ---------------------------------- | -------------------------- |
+| pseudo channels per cube           | 64 (parameter)             |
+| PEs per cube                       | 8 (parameter)              |
+| channels per PE (N)                | 64 / 8 = 8                 |
+| per-channel BW                     | 32 GB/s (parameter)        |
+| per-PE local BW                    | N × 32 = 256 GB/s          |
+| cube total HBM BW                  | 64 × 32 = 2048 GB/s        |
+
+Both modes give the same per-PE effective BW; only the request shape and
+contention model differ.
+
+### D9. AddressResolver — per-PE HBM endpoint
+
+The address resolver decodes a PA's HBM offset to the owning PE's
+partition:
+
+```python
+# policy/routing/router.py
+hbm_slice_bytes = hbm_total_gb_per_cube * (1 << 30) // hbm_slices_per_cube
+
+if addr.kind == "hbm":
+    pe_id = int(addr.hbm_offset) // hbm_slice_bytes
+    return f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
+```
+
+The pe_id computation is intrinsic to the routing layer (not a
+topology-time concern). Any HBM PA falls within exactly one partition,
+yielding deterministic routing.
+
+External callers (e.g., M_CPU DMA, Memory R/W from PCIE_EP) follow the
+same resolver path — there is no separate fast path.
+
+### D10. Mesh generation parameters
+
+`mesh_gen.py` produces `cube_mesh.yaml` from:
+
+- `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner.
+- `cube.geometry`: cube physical dimensions and HBM zone.
+- `cube.ucie.n_connections`: determines router count for UCIe attachment.
+
+Output `mesh_data` dictionary contains:
+
+- Router grid with positions and HBM exclusion zones.
+- PE-to-router attachments (`pe{idx}.dma`, `pe{idx}.cpu`, `pe{idx}.hbm`
+  per PE).
+- UCIe-to-router attachments (N/S/E/W distributed across edge routers).
+- M_CPU and SRAM router attachments.
+
+## Consequences
+
+- Local HBM (0 mesh hops, switching overhead only) and cross-PE HBM
+  (mesh hops) are naturally distinguishable, satisfying SPEC R5
+  (multi-domain communication) and ADR-0002 (no zero-latency end-to-end
+  paths).
+- All cube-internal traffic routes through one mesh — single contention
+  model, single layout, single set of edge BWs.
+- Per-PE HBM partitioning maps cleanly to the LA model (ADR-0011): each
+  PE's partition is the n:1 aggregate of its assigned pseudo-channels.
+- 1:1 mode extension is structurally natural — split each PE router into
+  N channel routers.
+- Mesh generation is fully parameterised by `topology.yaml`; PE/cube
+  geometry changes propagate without code edits.
+
+## Links
+
+- ADR-0002 (Routing distance, ordering, no zero-latency paths)
+- ADR-0003 D3 (cube-level NOC definition — extended here)
+- ADR-0004 (Memory semantics, local HBM)
+- ADR-0011 (Memory addressing — LA model consumes per-PE partition)
+- ADR-0014 D1 (PE_DMA egress via router mesh)
+- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
+- ADR-0016 (IOChiplet io_noc — analogous pattern at IO chiplet level)
+- ADR-0033 (Latency model: per-PC parallelism, switch penalty)
@@ -0,0 +1,516 @@
+# ADR-0020: 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)
+
+## Status
+
+Accepted
+
+## Context
+
+현재 시뮬레이션은 **타이밍만** 모델링한다.
+`tl.load()`, `tl.composite(op="gemm")` 등은 SimPy latency를 생성하지만,
+실제 텐서 데이터를 읽거나 연산하지 않는다.
+
+### 필요한 기능
+
+1. HBM/TCM/SRAM에 실제 데이터를 저장하고 읽을 수 있어야 한다
+2. PE_GEMM, PE_MATH가 실제 행렬 연산을 수행하고 결과를 검증할 수 있어야 한다
+3. 시뮬레이션 성능 저하를 최소화해야 한다
+
+### 제약 조건
+
+- SimPy는 single-thread 이벤트 루프 — numpy matmul을 안에서 하면 전체가 block
+- 컴포넌트는 교체 가능해야 한다 (ADR-0015) — 프레임워크 요구사항이 구현에 침투하면 안 됨
+- 벤치마크 커널은 명령형 코드(tl.load → tl.composite → tl.wait) — 같은 코드를 재사용해야 함
+- 커널 함수는 plain Python function으로 유지해야 한다 (generator/async 변환 불가)
+
+### 설계 탐색 결과
+
+| Option | 방식 | 판정 |
+|--------|------|------|
+| SimPy 내 직접 실행 | GEMM을 SimPy 안에서 numpy 호출 | 탈락: single-thread block |
+| SimPy + ThreadPool | future.submit → timeout → result() | 탈락: back-to-back 요청 시 result()에서 block |
+| Symbolic + lazy | 메타데이터만 추적, 나중에 실행 | 탈락: control-flow dependent 읽기 처리 곤란 |
+| **2-pass (채택)** | Phase 1: 타이밍, Phase 2: 데이터 | 완전 분리, 성능 영향 없음 |
+
+---
+
+## Decision
+
+### D1. 2-Pass 실행 모델 — Phase 0 제거
+
+기존의 3단계(Phase 0 → Phase 1 → Phase 2)를 **2단계로 통합**한다.
+
+기존:
+```
+Phase 0: 커널 → PeCommand 리스트 (데이터 없음, 분기 불가)
+Phase 1: PeCommand 리스트를 SimPy replay (타이밍만)
+```
+
+변경:
+```
+Phase 1 (타이밍): 커널 + SimPy 통합 실행 — greenlet 기반
+  - 메모리 읽기/쓰기: SimPy 타이밍 + MemoryStore 실제 데이터
+  - 연산 (GEMM/Math): SimPy 타이밍 + op_log 기록 (실제 연산은 Phase 2)
+  - dynamic control flow 가능 (tl.load가 실제 데이터 반환)
+
+Phase 2 (데이터): op_log 기반 실제 연산 실행 — SimPy 외부, 병렬 가능
+```
+
+본 ADR은 **메모리 연산에 한해 Phase 1을 data-aware로 확장**한다.
+Phase 1은 latency/BW 병목 분석 + 메모리 데이터 추적,
+Phase 2는 GEMM/Math 연산 정합성 검증.
+Phase 2는 optional — 타이밍만 필요하면 Phase 1만 실행.
+
+### D2. Op Log 기록 — ComponentBase hook
+
+op_log 기록은 **컴포넌트 베이스 클래스의 hook**으로 수행한다.
+개별 컴포넌트 구현을 수정하지 않는다.
+
+```python
+class ComponentBase:
+    def _on_process_start(self, env, msg):
+        if self._op_logger and getattr(msg, 'data_op', False):
+            self._op_logger.record_start(env.now, self.node.id, msg)
+
+    def _on_process_end(self, env, msg):
+        if self._op_logger and getattr(msg, 'data_op', False):
+            self._op_logger.record_end(env.now, self.node.id, msg)
+```
+
+`_forward_txn()` 에서 `run()` 전후로 hook을 호출한다.
+`_op_logger`는 optional — 없으면 오버헤드 제로.
+
+**hook 시점 정의**:
+
+| 시점 | 의미 |
+|------|------|
+| `t_start` | 컴포넌트가 해당 msg의 **service를 시작**한 시점 (`run()` 진입 직전) |
+| `t_end` | 컴포넌트의 **내부 service가 완료**된 시점 (`run()` 반환 직후) |
+
+link traversal latency는 t_start/t_end에 포함되지 않는다.
+link latency는 발신 컴포넌트의 t_end와 수신 컴포넌트의 t_start 차이로 관측된다.
+
+### D3. Greenlet 기반 커널 실행 — Phase 0 제거
+
+기존 Phase 0 (커널 → PeCommand 리스트)를 제거하고,
+**greenlet**을 사용하여 커널과 SimPy를 협력적으로 interleave 실행한다.
+
+#### 동작 원리
+
+greenlet은 협력적 context switch를 제공하는 C 확장이다.
+커널(child greenlet)이 `tl.load()` 등을 호출하면 SimPy 루프(parent greenlet)로
+switch하여 타이밍 시뮬레이션을 수행하고, 완료 후 실제 데이터와 함께 커널로 돌아온다.
+
+```
+SimPy 루프 (parent greenlet)          커널 (child greenlet)
+─────────────────────────              ──────────────────────
+g.switch() ─────────────────────────→ 커널 시작
+                                       a = tl.load(ptr, ...)
+                                         내부: parent.switch(DmaReadCmd)
+cmd = DmaReadCmd ←──────────────────  (커널 일시정지)
+  yield DmaReadMsg(...)
+  yield env.timeout(dma_latency)
+  data = memory_store.read(...)
+g.switch(data) ─────────────────────→ (커널 재개)
+                                       a = data  ← 실제 numpy array
+                                       if a[0][0] > 0.5:  ← 분기 가능
+                                         ...
+```
+
+커널은 **plain Python function**으로 유지된다.
+greenlet switch는 `tl.load()`, `tl.store()` 등의 **내부 구현에만** 존재한다.
+
+#### KernelRunner — 프레임워크 레이어
+
+greenlet 루프는 PE_CPU 컴포넌트가 아니라 프레임워크 레이어인
+**KernelRunner**에 위치한다.
+
+```python
+# KernelRunner (프레임워크 — greenlet ↔ SimPy 연결)
+class KernelRunner:
+    def run(self, env, kernel_fn, args, store):
+        g = greenlet(self._run_kernel)
+        cmd = g.switch(kernel_fn, args)
+
+        while cmd is not None:
+            if isinstance(cmd, DmaReadCmd):
+                yield from self._dispatch_dma(env, cmd)
+                data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
+                cmd = g.switch(data)            # 실제 데이터와 함께 재개
+            elif isinstance(cmd, GemmCmd):
+                yield from self._dispatch_gemm(env, cmd)
+                cmd = g.switch()                # 재개 (데이터 없음)
+            elif isinstance(cmd, DmaWriteCmd):
+                store.write(cmd.dst_addr, cmd.data)  # visibility = issue 시점
+                yield from self._dispatch_dma(env, cmd)  # timing만 반영
+                cmd = g.switch()
+
+# PE_CPU (컴포넌트 — 간단하게 유지, greenlet을 모름)
+def _execute_kernel(self, env):
+    runner = KernelRunner(self.ctx)
+    yield from runner.run(env, kernel_fn, args, store)
+```
+
+**Op logging single source of truth**: KernelRunner는 op_log에 직접 기록하지 않는다.
+모든 op logging은 **ComponentBase hook (_on_process_start/end)만** 담당한다.
+KernelRunner가 `_dispatch_gemm()` 등으로 컴포넌트에 메시지를 전달하면,
+컴포넌트 베이스 클래스의 hook이 자동으로 기록한다.
+
+**레이어 분리**:
+- **커널 코드**: plain function, greenlet 존재를 모름
+- **TLContext**: `tl.load()` 내부에서 `parent.switch(cmd)` 호출
+- **KernelRunner**: greenlet ↔ SimPy 연결, MemoryStore 읽기/쓰기 처리. **logging 안 함**.
+- **ComponentBase hook**: op_log 기록의 유일한 경로
+- **PE_CPU**: KernelRunner를 호출만 함, 컴포넌트로서 교체 가능
+
+#### 메모리 읽기/쓰기 vs 연산의 처리 차이
+
+| 연산 | Phase 1에서 | Phase 2에서 |
+|------|------------|------------|
+| `tl.load()` | SimPy 타이밍 + MemoryStore read → **실제 데이터 반환** | — |
+| `tl.store()` | SimPy 타이밍 + MemoryStore write → **실제 기록** | — |
+| `tl.composite(gemm)` | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
+| `tl.dot()` / math ops | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
+
+메모리 읽기/쓰기는 Phase 1에서 즉시 처리 (numpy slice, 빠름).
+GEMM/Math 연산은 Phase 2에서 batch 실행 (성능 분리).
+
+#### Store Visibility Rule
+
+`tl.store()`는 **issue 시점에 MemoryStore에 즉시 반영**된다 (visibility = issue).
+SimPy DMA 타이밍은 이후 별도로 시뮬레이션된다.
+
+이는 timing과 visibility를 의도적으로 분리한 것이다:
+- **visibility**: MemoryStore에 반영되는 시점 = `store.write()` 호출 시
+- **timing**: SimPy에서 DMA latency가 완료되는 시점
+
+이 분리로 dynamic control flow에서 store 직후 load가 최신 데이터를 볼 수 있다.
+
+#### Result Handle Semantics
+
+`tl.composite()`(sync/async)는 결과 tensor를 참조하는 **handle**을 반환한다.
+
+Phase 1에서의 핵심 계약:
+
+1. **모든 compute handle은 Phase 1에서 항상 pending 상태로 간주한다.**
+2. `tl.wait(handle)`은 **timing synchronization만 표현**하며,
+   handle을 ready로 만들지 않는다.
+3. handle의 실제 결과 데이터 접근(`handle.data`, element access,
+   numpy conversion 등)은 **Phase 2에서만 가능**하다.
+4. 따라서 Phase 1에서 **compute-result 기반 control flow는 지원하지 않는다.**
+5. 반면 `tl.load()`는 Phase 1에서 실제 데이터를 반환하므로,
+   **memory-read 기반 control flow는 지원 가능**하다.
+
+| handle 상태 | Phase | 허용 동작 |
+|------------|-------|----------|
+| pending | Phase 1 | `tl.wait(handle)` — timing 동기화만 |
+| pending | Phase 1 | handle을 `tl.store()`의 대상으로 전달 (logical destination 연결만, payload는 Phase 2) |
+| pending | Phase 1 | **데이터 접근 불가** — 값 기반 분기 불가 |
+| ready | Phase 2 | 실제 numpy 데이터 접근, 검증 |
+
+이 제약은 의도적이다. Phase 1에서 연산을 실행하면 SimPy single-thread가
+block되어 2-pass 분리의 존재 이유가 사라진다.
+
+#### Phase 1 Materialization — Future Extension
+
+향후 소형 연산(scalar, 작은 reduction)에 대해 Phase 1 eager execution이
+필요한 경우, `materialized_in_phase1: bool` 플래그를 op record에 추가하여
+선택적 materialization을 지원할 수 있다. 현재 범위에서는 구현하지 않는다.
+
+### D4. data_op 플래그 — 메시지 자기 선언
+
+로깅 대상은 메시지 타입이 아니라 메시지 인스턴스의 `data_op` 속성으로 결정한다.
+프레임워크가 메시지 타입을 하드코딩하지 않는다.
+
+```python
+class MsgBase:
+    data_op: bool = False       # 기본: 로깅 안 함
+
+class DmaReadCmd(MsgBase):
+    data_op = True              # 메모리 이동 → 로깅
+
+class GemmCmd(MsgBase):
+    data_op = True              # 연산 → 로깅
+
+class MathCmd(MsgBase):
+    data_op = True              # 연산 → 로깅
+```
+
+새 메시지 타입(예: IpcqMsg) 추가 시 `data_op = True`만 설정하면
+프레임워크 코드 수정 없이 자동 로깅된다.
+
+### D5. Op Log 구조
+
+#### op 분류 체계
+
+2단계로 분류한다:
+
+| 레벨 | 필드 | 역할 |
+|------|------|------|
+| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch 기준 |
+| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` 등 | 구체 연산 식별 |
+
+#### OpRecord 정의
+
+```python
+@dataclass
+class OpRecord:
+    t_start: float              # SimPy 시각 (ns) — service 시작
+    t_end: float                # SimPy 시각 (ns) — service 완료
+    component_id: str           # e.g. "sip0.cube0.pe0.pe_gemm"
+    op_kind: str                # "memory" | "gemm" | "math"
+    op_name: str                # 구체 연산명
+    params: dict                # 연산별 파라미터 (아래 참조)
+    dependency_ids: list[int]   # 현재는 in-memory record index 기반, 향후 stable op_id로 대체 가능
+```
+
+#### dependency_ids 생성 규칙
+
+`dependency_ids`는 **optional**이며, 기본적으로 executor는
+주소 기반 dependency 추론을 수행한다 (D6 참조).
+
+정확한 실행 순서가 필요한 경우에만 명시적으로 설정한다:
+- **기본 (address-based inference)**: executor가 read/write set을 분석하여
+  RAW/WAW/WAR 의존성을 자동 추론. 대부분의 경우 이것으로 충분.
+- **명시적 설정**: TLContext 또는 command 생성 단계에서 logical dependency가
+  주소로 표현되지 않는 경우에 설정.
+  예: completion handle 기반 동기화 — handle dependency는 메모리 주소가 아니라
+  논리적 완료 순서에 의존하므로 address inference로 잡히지 않는다.
+
+#### op_log ordering
+
+op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
+동일 `t_start`의 record들은 insertion order를 보존한다.
+
+#### params 상세
+
+**memory (dma_read / dma_write)**:
+```python
+{
+    "src_addr": int,            # source 주소 (byte)
+    "dst_addr": int,            # destination 주소 (byte)
+    "nbytes": int,              # 전송 크기
+    "src_space": str,           # "hbm" | "tcm" | "sram"
+    "dst_space": str,           # "hbm" | "tcm" | "sram"
+}
+```
+
+**gemm**:
+```python
+{
+    "src_a_addr": int,          # operand A 주소
+    "src_b_addr": int,          # operand B 주소
+    "dst_addr": int,            # output 주소
+    "shape_a": tuple,           # e.g. (128, 256)
+    "shape_b": tuple,           # e.g. (256, 128)
+    "shape_out": tuple,         # e.g. (128, 128)
+    "dtype_in": str,            # e.g. "f16"
+    "dtype_acc": str,           # accumulation dtype, e.g. "f32"
+    "dtype_out": str,           # output dtype, e.g. "f16"
+    "transpose_a": bool,
+    "transpose_b": bool,
+    "layout_a": str,            # "row_major" | "col_major"
+    "layout_b": str,
+    "layout_out": str,
+    "addr_space": str,          # "tcm" (GEMM operand는 항상 TCM)
+}
+```
+
+**math**:
+```python
+{
+    "op": str,                  # "exp" | "add" | "sum" | "where" | ...
+    "input_addrs": list[int],   # operand 주소 목록
+    "input_shapes": list[tuple],
+    "dst_addr": int,
+    "shape_out": tuple,
+    "dtype": str,
+    "axis": int | None,         # reduction axis
+    "addr_space": str,          # "tcm"
+}
+```
+
+### D6. Phase 2 Executor
+
+Phase 2는 SimPy 밖에서 op_log를 실행한다.
+
+```python
+class DataExecutor:
+    def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
+        self.store = initial_store  # Phase 1의 MemoryStore snapshot을 입력으로 받는다
+
+    def run(self):
+        for t, ops in groupby(op_log, key=lambda o: o.t_start):
+            batch = list(ops)
+            independent, sequential = self._classify(batch)
+            self._execute_parallel(independent)
+            self._execute_sequential(sequential)
+```
+
+**병렬 실행 판정**:
+
+같은 `t_start`의 op들은 **병렬 후보**로 간주한다.
+실제 병렬 실행 여부는 executor가 다음 기준으로 판정한다:
+- read/write 주소 범위 겹침 여부 (WAW, RAW, WAR 충돌 검사)
+- `dependency_ids`에 명시된 선행 op 완료 여부
+
+주소 범위가 겹치지 않고 명시적 의존성이 없는 op들만 병렬 실행한다.
+
+**배치 최적화**: 동일 op_name이며 **shape, dtype, layout, transpose flag가
+모두 동일한** 독립 op들만 batching 대상이 된다.
+예: 여러 PE의 동일 shape GEMM → `np.matmul(a_batch, b_batch)` 한 번으로 묶음.
+CPU에서도 BLAS 효율 향상, GPU에서는 launch overhead 절감.
+
+**Phase 2 실행 순서 보장**:
+
+Phase 2는 데이터 도착 시점을 고려하지 않으며,
+dependency (주소 기반 추론 + 명시적 dependency_ids)를 통해서만
+실행 순서를 보장한다.
+
+### D7. Memory Store
+
+`MemoryStore`는 논리적으로 byte-addressable semantics를 따르며,
+현재 구현은 **tensor-granular storage** (addr → numpy ndarray 매핑)를 사용한다.
+
+```python
+class MemoryStore:
+    def write(self, space: str, addr: int, data: np.ndarray) -> None: ...
+    def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
+```
+
+**내부 저장 포맷: numpy ndarray**
+
+MemoryStore는 텐서를 **numpy ndarray**로 저장한다.
+
+| 후보 | store/load 속도 | Phase 2 연산 | 판정 |
+|------|----------------|-------------|------|
+| **numpy ndarray** | 즉시 (참조 전달, 복사 없음) | `np.matmul` 바로 사용 | **채택** |
+| bytearray | memcpy 필요 | `np.frombuffer` 변환 필요 | 탈락 |
+| torch tensor | 즉시 | torch 연산 가능 | GPU 최적화 시만 사용 |
+
+- write: numpy array를 **참조 저장** (복사 없음) → Phase 1 오버헤드 = dict lookup 1회
+- read: numpy array를 **참조 반환** (복사 없음)
+- 동일 addr에 재 write 시 기존 array를 **tensor 단위로 덮어쓴다** (partial overwrite 미지원)
+- dtype은 numpy native 사용 (`np.float16`, `np.float32`, `np.bfloat16` 등)
+- byte-level access가 필요한 경우 `.view(np.uint8)` 로 변환
+- Phase 2에서 GPU batch 최적화 시 numpy → torch tensor 변환은 executor가 담당
+
+**read/write contract**:
+
+- read/write는 **contiguous tensor** 기준이다.
+  non-contiguous stride view가 필요한 경우 별도 copy op으로 표현한다.
+- 일반 benchmark path에서는 producer/consumer dtype 일치를 기대한다.
+  reinterpret cast는 low-level memory validation 또는 특수 테스트 케이스를 위한
+  permissive behavior이다.
+- addr은 byte-aligned이며, 최소 alignment = dtype 크기.
+- dtype mismatch (write와 다른 dtype으로 read)는 reinterpret cast로 처리한다.
+  shape 불일치 시 nbytes 기준으로 검증하고, 불일치하면 error.
+- 정합성 기준은 주소 범위 기반 read/write semantics를 따른다.
+- 구현 최적화로 tensor object cache를 둘 수 있지만,
+  canonical state는 byte-addressable storage이다.
+- deploy 시점에 호스트가 초기 텐서 데이터를 주입한다.
+
+### D8. 벤치마크 커널 코드
+
+벤치마크의 **사용자 코드 API는 변경하지 않는다**.
+`tl.load()`, `tl.composite()`, `tl.store()` 등의 호출 인터페이스는 유지.
+
+단, 내부 command/message schema는 Phase 2 실행에 필요한 metadata를
+포함하도록 확장될 수 있다 (예: dtype_acc, transpose 등 추가 필드).
+
+### D9. 컴포넌트 변경 없음
+
+개별 컴포넌트 구현(PE_GEMM, PE_DMA, HBM_CTRL 등)은 수정하지 않는다.
+op_log 기록은 ComponentBase hook의 책임이다.
+커스텀 컴포넌트 교체 시 타이밍 모델만 교체되며,
+Phase 2 데이터 실행은 영향받지 않는다.
+
+### D10. Phase 2는 Optional
+
+```python
+engine = GraphEngine(graph)
+engine.run(benchmark)                       # Phase 1: 타이밍만
+result = engine.get_timing_result()
+
+if verify_data:
+    executor = DataExecutor(engine.op_log)  # Phase 2: 데이터
+    executor.run()
+    executor.verify(expected_output)
+```
+
+타이밍 분석만 필요하면 Phase 2를 건너뛴다.
+op_logger를 비활성화하면 Phase 1 성능도 기존과 동일.
+
+### D11. Verification Contract
+
+기본 검증은 **최종 output tensor**를 reference backend(numpy)와 비교한다.
+
+dtype별 tolerance 정책:
+
+| dtype | 비교 방식 | tolerance |
+|-------|----------|-----------|
+| f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
+| f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
+| bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
+| int 계열 | `np.array_equal` | exact |
+
+- 기본 모드: 최종 output만 비교 (end-to-end correctness)
+- 디버그 모드: intermediate tensor도 op 단위로 비교 가능
+  (MemoryStore snapshot at each op boundary)
+
+---
+
+## Non-goals
+
+- **Compute-result-based control flow**: 지원하지 않는다.
+  모든 compute handle은 Phase 1에서 pending 상태이며,
+  `wait()`는 timing synchronization만 표현하고 data readiness를 의미하지 않는다.
+  Phase 1에서 `handle.data` 접근, element access, truth-value evaluation은
+  **error로 처리**한다.
+  메모리 데이터 기반 분기(`tl.load()` 결과)는 greenlet으로 지원된다.
+  Phase 1 materialization은 future extension (D3 참조).
+- **Cycle-accurate overlap reconstruction**: Phase 2에서 Phase 1의 실행 시간
+  overlap을 정확히 재현하지 않는다. Phase 2는 데이터 정합성만 검증한다.
+- **GPU kernel compilation**: Phase 2의 GEMM/Math는 numpy/torch 호출이며,
+  실제 하드웨어 PE의 마이크로아키텍처를 재현하지 않는다.
+
+## Open Questions
+
+- **Aliasing / slice view**: 동일 backing storage를 참조하는 slice/view를
+  MemoryStore에서 어떻게 표현할지 (stride-based view vs copy semantics)
+- **IPCQ/descriptor read 일반화**: PE-to-PE 통신을 memory op으로 완전히
+  일반화할지, 별도 op_kind를 둘지
+- **Op log streaming**: 대규모 시뮬레이션에서 op_log 메모리 사용량 관리
+  (in-memory list vs disk-backed streaming)
+- **Fused operation**: tl.composite의 tiled pipeline (READ→COMPUTE→WRITE)을
+  하나의 fused op record로 기록할지, 개별 op으로 분리할지
+- **Math op schema 일반화**: 현재 math params는 단순 구조이나,
+  broadcasting rule, input별 dtype, keepdims, scalar/immediate operand,
+  where/mask 표현 등 일반화가 필요할 수 있음
+- **Op record 식별자**: 현재 dependency_ids는 in-memory list index 기반이며,
+  streaming/disk-backed mode 도입 시 stable op_id로 대체 필요
+- **Phase 1 materialization policy**: D3의 Future Extension 참조.
+  허용 시 해당 op의 Phase 2 처리 방식 (skip / verify / recompute) 정의 필요
+
+---
+
+## Consequences
+
+### 긍정적
+
+- SimPy 시뮬레이션 성능 영향 최소 (op_log append만 추가)
+- Phase 2에서 멀티스레드/GPU 자유롭게 사용 가능
+- 컴포넌트 교체 자유도 유지 (ADR-0015 설계 철학 보존)
+- 벤치마크 사용자 코드 API 변경 불필요
+- 새 메시지 타입 추가 시 data_op 플래그만 설정
+- greenlet으로 Phase 0 제거 — 메모리 데이터 기반 dynamic control flow 지원
+- `tl.load()`가 실제 데이터를 반환하므로 커널 디버깅 용이
+
+### 부정적
+
+- op_log 메모리 사용량 (대규모 시뮬레이션 시)
+- Phase 2 실행 시간은 텐서 크기에 비례 (대형 GEMM)
+- pending handle (연산 미완료) 기반 동적 분기 불가
+  (연산은 Phase 2에서 실행, Phase 1에서 결과 값 미확정).
+  메모리 데이터 기반 분기는 greenlet으로 지원된다.
+- greenlet C 확장 의존성 추가 (pip install greenlet)
@@ -0,0 +1,90 @@
+# ADR-0022: 2D Grid program_id Semantics
+
+## Status
+
+Accepted
+
+## Context
+
+Triton kernels use `tl.program_id(axis)` to identify their position in a launch grid.
+Our hardware has a 2-level hierarchy: **cubes** contain **PEs**.
+The previous implementation ignored the `axis` parameter and always returned a flat PE index,
+making it impossible for kernels to distinguish their cube-local position from their cube identity.
+
+## Decision
+
+Map `tl.program_id` and `tl.num_programs` to the 2D hardware grid:
+
+| Call | Returns | Description |
+|------|---------|-------------|
+| `tl.program_id(axis=0)` | `local_pe_id` | PE index within cube |
+| `tl.program_id(axis=1)` | `cube_id` | Cube index |
+| `tl.num_programs(axis=0)` | `num_pes_per_cube` | PEs per cube |
+| `tl.num_programs(axis=1)` | `num_cubes` | Total cubes |
+
+Global PID is derived as:
+
+```python
+global_pid = tl.program_id(axis=1) * tl.num_programs(axis=0) + tl.program_id(axis=0)
+```
+
+### Axis mapping rationale
+
+- **axis=0 = PE (innermost)**: PEs within a cube share HBM and communicate via local NOC mesh. This is the fast, tightly-coupled dimension — analogous to threads within a block.
+- **axis=1 = Cube (outer)**: Cross-cube communication goes through UCIe with higher latency. This is the coarser scheduling dimension — analogous to blocks in a grid.
+
+## Implementation
+
+### TLContext (`triton_emu/tl_context.py`)
+
+Added `cube_id` and `num_cubes` constructor parameters. `program_id()` and `num_programs()` dispatch on `axis`:
+
+```python
+def program_id(self, axis: int = 0) -> int:
+    if axis == 1:
+        return self._cube_id
+    return self._pe_id
+
+def num_programs(self, axis: int = 0) -> int:
+    if axis == 1:
+        return self._num_cubes
+    return self._num_programs
+```
+
+### PE_CPU (`components/builtin/pe_cpu.py`)
+
+- Extracts `num_cubes` from `ctx.spec["system"]["sips"]["cubes_per_sip"]`
+- Passes `cube_id` (already available as `self._cube_idx`) and `num_cubes` to TLContext
+
+### KernelRunner (`triton_emu/kernel_runner.py`)
+
+- Receives `num_cubes` from PE_CPU
+- Passes `cube_id` and `num_cubes` to TLContext in greenlet mode
+
+## Backward Compatibility
+
+- Existing code using `tl.program_id(0)` or `tl.program_id()` is unchanged — returns the same PE index as before.
+- `cube_id` and `num_cubes` default to `0` and `1`, so callers that don't provide them (e.g. unit tests) continue to work.
+
+## Usage Example
+
+```python
+def sharded_gemm_kernel(a_ptr, b_ptr, out_ptr, M, K, N, tl):
+    local_pid = tl.program_id(axis=0)      # PE within cube
+    cube_id   = tl.program_id(axis=1)      # which cube
+    global_pid = cube_id * tl.num_programs(axis=0) + local_pid
+
+    # Column-wise sharding across global PID
+    n_per_pid = N // (tl.num_programs(axis=1) * tl.num_programs(axis=0))
+    col_start = global_pid * n_per_pid
+
+    a = tl.load(a_ptr, shape=(M, K), dtype="f16")
+    b = tl.ref(b_ptr + col_start * K * 2, shape=(K, n_per_pid), dtype="f16")
+    h = tl.composite(op="gemm", a=a, b=b, out_ptr=out_ptr + col_start * M * 2)
+    tl.wait(h)
+```
+
+## Consequences
+
+- Benchmarks can now express cube-aware sharding and addressing without hardcoding topology dimensions.
+- Future axis=2 (SIP-level) can be added following the same pattern if needed.
@@ -0,0 +1,206 @@
+# ADR-0024: SIP-level Launcher — rank = SIP
+
+## Status
+
+Accepted
+
+## Context
+
+### 목표
+
+`torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device)
+경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이**
+읽히는 bench 코드를 목표로 한다.
+
+real PyTorch와 비교:
+
+| 차원 | real PyTorch | KernBench |
+| --- | --- | --- |
+| 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 각 1 SIP |
+| `get_rank()` | `RANK` env var | greenlet-local 레지스트리 |
+| `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 |
+| `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP |
+| `mp.spawn` | OS 프로세스 fork | greenlet fan-out |
+
+### 풀어야 할 문제
+
+1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록.
+2. **Greenlet-local rank/device tracking** — 1-프로세스 모델 안에서 각
+   worker greenlet이 자기 rank / 자기 SIP를 정확히 식별.
+3. **Tensor placement = structural (sip, cube, pe)** — rank가 SIP이면
+   기본 텐서 배치도 구조적 좌표로 표현되어야 함.
+
+### Non-problem (이 ADR 밖)
+
+- IPCQ direction addressing → ADR-0025
+- `DPPolicy.sip`/`num_sips` 제거 → ADR-0026
+- Megatron-style TP → ADR-0027
+- DTensor → ADR-0028 (future)
+- Worker scheduling / `mp.spawn` / collective drain / exception cleanup
+  → ADR-0027 D0/D1
+- Collective algorithm 구현 (intercube_allreduce, SFR config) → ADR-0032
+
+## Decision
+
+### D1. rank = SIP (world_size 해석)
+
+```python
+def _resolve_world_size(self) -> int:
+    if "world_size" in self._merged:
+        return int(self._merged["world_size"])
+    defaults = self._cfg_all.get("defaults", {})
+    if "world_size" in defaults:
+        return int(defaults["world_size"])
+    spec = self.ctx.spec or {}
+    return int(spec.get("system", {}).get("sips", {}).get("count", 1))
+```
+
+우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml`
+override는 legacy "rank = PE" 테스트 경로로 유지.
+
+### D2. Greenlet-local rank registry (+ debug warning)
+
+```python
+class DistributedContext:
+    def __init__(self):
+        self._backend = None
+        self._rank_by_greenlet: dict = {}
+
+    def _bind_rank(self, g, rank: int) -> None:
+        self._rank_by_greenlet[g] = int(rank)
+
+    def get_rank(self) -> int:
+        self._ensure_initialized()
+        from greenlet import getcurrent
+        g = getcurrent()
+        if g not in self._rank_by_greenlet:
+            if os.environ.get("KERNBENCH_DEBUG"):
+                warnings.warn(
+                    "get_rank() called outside a bound greenlet — returning 0. "
+                    "Likely a bug unless running single-driver."
+                )
+            return 0
+        return int(self._rank_by_greenlet[g])
+```
+
+### D3. `torch.ahbm.set_device(rank)` — SIP 바인딩
+
+KernBench 백엔드 이름은 `ahbm` (ADR-0023). Real PyTorch는
+`torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named
+namespace를 사용한다.
+
+```python
+class _AhbmNamespace:
+    """torch.ahbm — per-greenlet SIP device binding.
+
+    Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. Since
+    KernBench's backend is 'ahbm' (not CUDA), we expose the equivalent
+    API under ``torch.ahbm`` to avoid pretending to be a CUDA runtime.
+    """
+
+    def __init__(self):
+        self._device_by_greenlet: dict = {}
+
+    def set_device(self, device: int) -> None:
+        from greenlet import getcurrent
+        self._device_by_greenlet[getcurrent()] = int(device)
+
+    def current_device(self) -> int | None:
+        from greenlet import getcurrent
+        return self._device_by_greenlet.get(getcurrent())
+
+# Attached to RuntimeContext as `self.ahbm = _AhbmNamespace()`.
+# Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
+```
+
+**PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한
+`torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`,
+`torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는
+코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다.
+
+```python
+class _AcceleratorNamespace:
+    """torch.accelerator — device-agnostic API (PyTorch 2.x style).
+
+    Aliases torch.ahbm for bench code that prefers device-neutral idiom:
+        torch.accelerator.set_device_index(rank)
+        torch.accelerator.current_device_index()
+    """
+
+    def __init__(self, ahbm: _AhbmNamespace):
+        self._ahbm = ahbm
+
+    def set_device_index(self, device: int) -> None:
+        self._ahbm.set_device(device)
+
+    def current_device_index(self) -> int | None:
+        return self._ahbm.current_device()
+
+# RuntimeContext
+self.ahbm = _AhbmNamespace()
+self.accelerator = _AcceleratorNamespace(self.ahbm)   # alias
+```
+
+Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유:
+
+```python
+torch.ahbm.set_device(rank)                   # KernBench-native, explicit backend
+torch.accelerator.set_device_index(rank)      # PyTorch 2.x device-agnostic
+```
+
+### D4. Tensor placement = structural (sip, cube, pe) 좌표
+
+`resolve_dp_policy`가 `target_sip`을 직접 받아 구조적 좌표로 placement 생성.
+세부는 ADR-0026.
+
+```python
+# RuntimeContext._create_tensor
+current_sip = self.ahbm.current_device()          # (D3 naming)
+if current_sip is None:
+    current_sip = 0  # single-driver fallback (D2와 일관)
+placement = resolve_dp_policy(
+    dp, shape=shape_2d, itemsize=itemsize,
+    num_pe=eff_num_pe, num_cubes=eff_num_cubes,
+    target_sip=current_sip,
+)
+```
+
+Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적
+좌표를 직접 보유. ShardSpec 상세는 ADR-0026.
+
+---
+
+## Dependencies
+
+- **ADR-0023** (IPCQ): backend `ahbm` namespace의 기원.
+- **ADR-0026** (DPPolicy intra-device): D4의 `resolve_dp_policy` 시그니처와
+  ShardSpec의 구조적 좌표 표현.
+- **ADR-0027** (Megatron TP + scheduler): worker scheduling, `mp.spawn`,
+  collective drain, exception cleanup의 구현 기준.
+
+---
+
+## Non-goals
+
+- **IPCQ protocol 수정**: ADR-0023 유지.
+- **DPPolicy 필드 정리**: ADR-0026.
+- **Megatron-style TP**: ADR-0027.
+- **Worker scheduling / spawn / drain / exception cleanup**: ADR-0027 D0/D1.
+- **Collective algorithm 구현**: ADR-0032.
+- **Multi-node (프로세스 간)**: 단일 프로세스.
+
+---
+
+## Consequences
+
+### Positive
+
+- **Bench = real PyTorch DDP** (공개 API 관점).
+- **Greenlet-local rank**: 1-프로세스 모델에서 cross-rank correctness 가능.
+- **Structural placement 좌표**: ADR-0026 / ADR-0027 / ADR-0032의 다른 ADR이
+  `(sip, cube, pe)` 3튜플 위에서 일관되게 동작.
+
+### Neutral
+
+- IPCQ PE-level protocol (ADR-0023) 불변.
+- IO_CPU 역할 불변 (기존 transit 그대로).
@@ -0,0 +1,283 @@
+# ADR-0025: IPCQ Direction Addressing — address-based matching
+
+## Status
+
+Accepted (Revision 2 — Address-based matching; peer_direction field dropped)
+
+## Context
+
+### 목표
+
+ADR-0023의 IPCQ protocol에서 **"어느 direction pair를 통한 전송인가"의 식별**을
+topology / dict-order에 의존하지 않고 **주소 기반**으로 일관되게 한다.
+2-rank bidirectional ring (또는 여러 direction이 동일 peer를 가리키는
+topology 일반)에서 정확히 동작하도록 한다.
+
+### 드러난 버그 — 2-rank bidirectional ring
+
+`ring_1d(rank, world_size=2)` → `{"E": 1, "W": 1}` (rank 0). 양쪽 방향이 같은 peer.
+
+**버그 1 (install)**:
+- `reverse_direction(0, 1)` → dict order로 "E" 반환 (틀림, "W"가 맞음 — opposite
+  direction convention)
+- rank 0의 E entry가 `peer.rx_base_pa = rx_base(sip1, cube0, pe0, d="E")`로 설정
+- tl.send(E) → data가 sip1의 E-rx buffer로 landing (should be W-rx)
+
+**버그 2 (runtime)**:
+- 설령 install이 올바른 주소로 설정해도, receiver의 `_handle_meta_arrival`이
+  sender 좌표만으로 direction 매칭 → 첫 direction (E) 승
+- peer_head_cache[E] 증가, peer_head_cache[W]는 불변
+- Kernel의 tl.recv(W)는 peer_head_cache[W] 대기 → 영원히 블록 → IpcqDeadlock
+
+### 근본 원인
+
+두 축에서 동일 문제:
+1. **Install-time pairing**: "내 direction과 peer의 어느 direction이 짝인가"
+   결정이 dict-iteration-order에 의존 → 여러 direction이 같은 peer를 가리킬 때
+   fragile
+2. **Runtime identification**: "어느 qp를 업데이트해야 하는가" 결정이 sender
+   좌표만으로 이루어짐 → direction 중복 시 ambiguous
+
+### 해결 방향 — address-based matching
+
+각 PE의 rx buffer는 **direction별로 고유한 주소 range**에 위치 (rx_base_pa +
+direction_idx × bytes_per_direction). 따라서:
+
+- **Runtime**: sender coord 대신 **dst_addr 범위**로 매칭 → unambiguous
+- **Install**: opposite-direction 우선 선택 heuristic (ring / mesh의 자연스러운
+  대칭성)
+- `peer_direction` 같은 이중 메타데이터 불필요 — **주소가 single source of
+  truth**
+
+이 설계는 **PhysAddr 전환 (ADR-0030)과 독립적**으로 작동. 현재 synthetic
+주소든 PhysAddr든 direction별 range 유일성만 지켜지면 동일하게 적용 가능.
+
+---
+
+## Decision
+
+### D1. Install — `reverse_direction` opposite-preference
+
+`src/kernbench/ccl/install.py`:
+
+```python
+# Extended in ADR-0032 with global_* pairs for inter-SIP directions,
+# which were introduced by configure_sfr_intercube_multisip to keep
+# intercube (N/S/E/W) and inter-SIP (global_N/S/E/W) namespaces disjoint.
+_OPPOSITE_DIR = {
+    "E": "W", "W": "E", "N": "S", "S": "N",
+    "global_E": "global_W", "global_W": "global_E",
+    "global_N": "global_S", "global_S": "global_N",
+}
+
+def reverse_direction(my_rank: int, peer_rank: int, my_dir: str) -> str | None:
+    """Find peer's direction that reciprocates my_dir→peer_rank.
+
+    Prefer the OPPOSITE direction (E↔W, N↔S) when the peer has it
+    pointing back to us. This matters in 2-rank bidirectional rings
+    where both E and W on one side point to the same peer — without
+    the preference, the first-match-wins iteration would route data
+    into the wrong rx slot. Falls back to any direction pointing back
+    for topologies without an opposite convention (tree_binary's
+    parent/child).
+    """
+    nt = neighbor_table[peer_rank]
+    opp = _OPPOSITE_DIR.get(my_dir)
+    if opp is not None and nt.get(opp) == my_rank:
+        return opp
+    for d, target in nt.items():
+        if target == my_rank:
+            return d
+    return None
+```
+
+호출부:
+
+```python
+for d, peer_rank in nbrs.items():
+    peer_dir = reverse_direction(r, peer_rank, d)  # my_dir 전달
+    if peer_dir is None:
+        continue
+    ...
+```
+
+### D2. Runtime — `_handle_meta_arrival` dst_addr 매칭
+
+`src/kernbench/components/builtin/pe_ipcq.py`:
+
+```python
+def _handle_meta_arrival(self, msg: IpcqMetaArrival) -> None:
+    """Match incoming token to the receiver-side direction by dst_addr range.
+
+    Each direction has a unique rx buffer address range
+    (my_rx_base_pa + n_slots * slot_size). The token's dst_addr (set by
+    the sender's IPCQ when computing peer's slot address) falls within
+    exactly one such range. This address-based matching is unambiguous
+    even when multiple directions have the same peer (2-rank ring).
+    """
+    token = msg.token
+    dst_addr = token.dst_addr
+    for d, qp in self._queue_pairs.items():
+        base = qp["my_rx_base_pa"]
+        size = qp["n_slots"] * qp["slot_size"]
+        if base <= dst_addr < base + size:
+            qp["peer_head_cache"] = max(qp["peer_head_cache"],
+                                         token.sender_seq + 1)
+            self._arrived_tokens.setdefault(d, []).append(token)
+            waiters = self._recv_waiters.get(d, [])
+            self._recv_waiters[d] = []
+            for ev in waiters:
+                if not ev.triggered:
+                    ev.succeed()
+            any_waiters = self._any_recv_waiters
+            self._any_recv_waiters = []
+            for ev in any_waiters:
+                if not ev.triggered:
+                    ev.succeed()
+            return
+    # Unknown dst_addr — diagnostic log (should not happen under correct install)
+```
+
+Sender 좌표 검사는 **제거**. `dst_addr`가 이미 direction을 결정.
+
+### D3. Credit — `dst_rx_base_pa` 필드 추가
+
+`src/kernbench/common/ipcq_types.py`:
+
+```python
+@dataclass(frozen=True)
+class IpcqCreditMetadata:
+    consumer_seq: int
+    dst_rx_base_pa: int       # NEW: 원 sender의 peer.rx_base_pa와 매칭용
+    # 기존 필드 (diagnostic / log 용도로 유지)
+    src_sip: int
+    src_cube: int
+    src_pe: int
+    src_direction: str
+```
+
+Credit 생성 시 (`_delayed_credit_send`): 자기 direction의 `my_rx_base_pa`를
+`dst_rx_base_pa`로 실어 보냄 (이게 상대방이 sender 당시 썼던 `peer.rx_base_pa`).
+
+수신 측 (`_credit_worker`):
+
+```python
+def _credit_worker(self, env):
+    while True:
+        credit = yield self._credit_inbox.get()
+        for d, qp in self._queue_pairs.items():
+            # peer의 rx_base_pa와 credit의 dst_rx_base_pa가 일치하는 qp 찾기
+            if qp["peer"].rx_base_pa == credit.dst_rx_base_pa:
+                qp["peer_tail_cache"] = max(qp["peer_tail_cache"],
+                                              credit.consumer_seq)
+                waiters = self._send_waiters.get(d, [])
+                self._send_waiters[d] = []
+                for ev in waiters:
+                    if not ev.triggered:
+                        ev.succeed()
+                break
+```
+
+Sender 좌표 검사 제거. `dst_rx_base_pa` 매칭으로 unambiguous.
+
+### D4. `IpcqInitEntry`에 `peer_direction` 필드를 **추가하지 않음**
+
+ADR-0025 rev 1에서 제안했던 `IpcqInitEntry.peer_direction`은 **불필요**.
+이유:
+- Meta arrival은 dst_addr로 매칭 (D2)
+- Credit은 dst_rx_base_pa로 매칭 (D3)
+- qp에 peer_direction 저장 필요 없음
+- Install은 rx_base_pa 계산 시 내부적으로만 peer_dir 사용 (`reverse_direction`)
+
+IpcqInitEntry schema 변경 없음. Rev 1 대비 **단순화**.
+
+### D5. `IpcqDmaToken.src_direction` 유지 (diagnostic only)
+
+기존 `src_direction` 필드는 제거하지 않는다. 다음 용도로 유지:
+- Logging / trace: `KERNBENCH_CCL_TRACE=1` 출력의 `(rank, t, dir, nbytes)`
+- Diagnostics: pointer_dump 등에서 direction 표시
+- 미래 확장 여지
+
+Runtime matching은 `dst_addr`만 사용.
+
+### D6. Invariants (ADR-0023 I3 강화)
+
+**I3 (엄격)**: 각 방향 pair `(my_direction, peer_direction)`에 대해 my
+rx_base와 peer rx_base는 **별개의 direction slot**을 가리켜야 함. Install은
+이를 보장해야 한다 (reverse_direction opposite-preference).
+
+**I3.1 (신규)**: 모든 qp에 대해 `qp["my_rx_base_pa"]`와 `qp["peer"].rx_base_pa`는
+서로 disjoint한 주소 range를 점유한다 (다른 direction의 buffer는 절대 겹치지
+않음). 이것이 D2/D3의 주소-기반 매칭의 전제.
+
+Install time에 검증 가능:
+```python
+# ccl/install_plan.py: build_install_plans 끝에 assertion
+all_rx_ranges = set()
+for plan in plans:
+    for pe_install in plan.pe_installs:
+        for entry in pe_install.neighbors:
+            r = (entry.my_rx_base_pa,
+                 entry.my_rx_base_pa + plan.n_slots * plan.slot_size)
+            overlap = any(_ranges_overlap(r, e) for e in all_rx_ranges)
+            assert not overlap
+            all_rx_ranges.add(r)
+```
+
+---
+
+## Dependencies
+
+- **ADR-0023** (IPCQ protocol): 본 ADR은 ADR-0023의 runtime 매칭 로직 수정
+  (D2, D3) + install heuristic 개선 (D1). IPCQ 프로토콜의 semantic layer
+  변경은 없음.
+- **ADR-0024** (launcher): 2-rank bidirectional ring이 실제 쓰이는 경우가
+  ADR-0024의 ws=SIP_count 모델. 본 ADR이 그 케이스를 작동시킴.
+- **ADR-0030** (PhysAddr transition, stub): **독립적** — ADR-0025의
+  주소-기반 매칭은 현재 synthetic 주소든 PhysAddr이든 동일하게 작동.
+
+---
+
+## Non-goals
+
+- **IPCQ 주소 체계를 PhysAddr로 전환**: ADR-0030 scope. 본 ADR은 주소가 어떻게
+  인코딩되는가와 무관.
+- **Multi-hop routing**: ADR-0023 D5의 single-hop DMA write 전제 유지.
+- **Unidir ring 특수화**: `ring_1d_unidir`는 direction 하나만 있으므로 본 버그
+  무관.
+
+---
+
+## Open questions
+
+- **주소 매칭 성능**: `_handle_meta_arrival`과 `_credit_worker`가 qp를 선형
+  순회 (max 4 direction). 성능 영향 무시 가능 수준. 문제 시 dict lookup으로
+  전환 가능 (`_qp_by_rx_base`).
+- **`IpcqDmaToken.src_direction` 필요성 재평가**: diagnostic 용도로만 남긴
+  필드를 계속 유지할지, 또는 logging 외부로 분리할지. 현재는 유지.
+- **Install-time invariant 검증 cost**: D6의 I3.1 검증은 O(N_PE × N_direction)^2.
+  대형 topology에서 느려질 수 있음 → interval tree 등 자료구조로 개선 가능.
+  단순 구현 먼저.
+
+---
+
+## Consequences
+
+### Positive
+
+- **단순함**: `peer_direction` 이중 메타데이터 제거. 주소가 single source of truth.
+- **Unambiguous matching**: 모든 topology (direction 중복 포함)에서 동작.
+- **Schema 변경 최소**: `IpcqInitEntry` 불변, `IpcqCreditMetadata`에 1 필드 추가.
+- **PhysAddr 전환 (ADR-0030) 독립**: 주소-기반 매칭은 주소 인코딩 방식과 무관.
+- **Diagnostic 유지**: `IpcqDmaToken.src_direction`은 로깅 용도로 존치.
+
+### Negative
+
+- Runtime 매칭이 주소 비교로 바뀌어서 디버깅 시 "왜 peer_head_cache[E]가 아닌
+  W가 업데이트됐나" 같은 질문에 address range를 추적해야 함 (기존엔 direction
+  이름으로 충분). 해결: pointer_dump에 "direction ↔ rx_base_pa" 매핑 포함.
+
+### Neutral
+
+- IPCQ protocol의 semantic layer (sender가 dst_addr 계산, receiver가 수신)는
+  불변.
@@ -0,0 +1,288 @@
+# ADR-0026: DPPolicy = Intra-Device Only — sip/num_sips 필드 제거
+
+## Status
+
+Accepted (Revision 5 — Phase 2 landed 2026-04-14, 523 passed + 1 strict xfail)
+
+## Context
+
+### 목표
+
+`DPPolicy`를 **한 device(SIP) 내부의 cube × PE 분산**만 표현하는 순수한
+intra-device 추상화로 명확화한다. SIP 간 분산(TP)은 별도 레이어로 분리
+(ADR-0024의 `torch.ahbm.set_device(rank)` 또는 ADR-0027의 Megatron parallel
+layers가 담당).
+
+## Decision
+
+### D1. `DPPolicy`에서 `sip` + `num_sips` 필드 제거
+
+```python
+@dataclass(frozen=True)
+class DPPolicy:
+    """Intra-device (cube × PE) data-parallel policy.
+
+    SIP-level placement is controlled by ``torch.ahbm.set_device(rank)``
+    (ADR-0024 D3) and, for model-level TP, by Megatron-style parallel
+    layers (ADR-0027). DPPolicy does not cross SIP boundaries.
+    """
+    cube: Literal["replicate", "column_wise", "row_wise"] = "replicate"
+    pe: Literal["replicate", "column_wise", "row_wise"] = "replicate"
+    num_pes: int | None = None
+    num_cubes: int | None = None
+```
+
+제거되는 필드: `sip`, `num_sips`.
+
+### D2. `ShardSpec` — structural (sip, cube, pe) 좌표, `pe_index` 완전 제거
+
+현재 `ShardSpec.pe_index`는 **global flat index** (`sip × cubes × pes + cube ×
+pes + pe`). 이는 ADR-0024 D4이 "abstraction leakage"로 지적한 형태.
+
+본 ADR에서 ShardSpec을 **structural 좌표로 재정의**하고, `pe_index`는
+property로도 **남기지 않는다**:
+
+```python
+# src/kernbench/policy/placement/dp.py (after)
+@dataclass(frozen=True)
+class ShardSpec:
+    """Structural shard placement — intra-SIP (cube × PE) coord.
+
+    Global-flat `pe_index` was removed in ADR-0026. Callers must use
+    structural coords (sip, cube, pe) directly. If a flat integer key is
+    needed (e.g. dict lookup), compute it explicitly at the call site.
+    """
+    sip: int              # structural — which SIP this shard lives on
+    cube: int             # local within SIP
+    pe: int               # local within cube
+    offset_bytes: int
+    nbytes: int
+```
+
+**핵심 원칙**:
+- ShardSpec의 정체성은 `(sip, cube, pe)` 3튜플.
+- **`pe_index` property도 없음** — silent semantics drift 차단.
+- Global flat을 기대한 기존 호출자는 `.pe_index` 접근 시 **즉시
+  `AttributeError`** → 반드시 구조적 좌표로 migration.
+- Flat integer key가 필요한 국소 문맥 (예: 내부 dict lookup)은 호출자가
+  명시적으로 `spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe`를 계산.
+
+**Property 제거 정당화**: KernBench는 사내 프로젝트로 call site가 한정되어
+있음. Silent drift 위험 (의미만 바뀌고 타입은 같은 int) 대비 explicit breakage
+(AttributeError)가 훨씬 안전.
+
+### D3. `resolve_dp_policy`가 `target_sip`을 받아 structural 좌표 생성
+
+ADR-0024 D4의 계약 구현. Post-hoc shifting 없음.
+
+```python
+# src/kernbench/policy/placement/dp.py (after)
+
+@dataclass(frozen=True)
+class _LocalPeShard:
+    """Internal — PE resolver의 반환. Cube 내 local PE 식별자 + payload."""
+    local_pe: int                  # cube-local PE index (0..num_pe-1)
+    offset_bytes: int
+    nbytes: int
+
+
+def resolve_dp_policy(
+    policy: DPPolicy,
+    *,
+    shape: tuple[int, int],
+    itemsize: int,
+    num_pe: int,
+    num_cubes: int = 1,
+    target_sip: int,       # NEW — 어느 SIP에 배치할지 명시
+) -> list[ShardSpec]:
+    """2-level resolution (cube × PE) on a specified SIP.
+
+    Returns ShardSpecs with structural coords (sip=target_sip, cube, pe).
+    No SIP-level split — DPPolicy is intra-device only.
+    """
+    resolver = _PE_RESOLVERS[policy.pe]
+    all_shards: list[ShardSpec] = []
+
+    # Level 1: cube within SIP
+    cube_splits = _split_shape(policy.cube, shape, num_cubes, itemsize)
+
+    for cube_id, (cube_shape, cube_offset) in enumerate(cube_splits):
+        # Level 2: PE within cube — resolver returns _LocalPeShard (local_pe)
+        local_shards = resolver(shape=cube_shape, itemsize=itemsize,
+                                 num_pe=num_pe)
+
+        for ls in local_shards:
+            all_shards.append(ShardSpec(
+                sip=target_sip,                   # from caller (current_device)
+                cube=cube_id,                     # local within SIP
+                pe=ls.local_pe,                   # local within cube (explicit name)
+                offset_bytes=cube_offset + ls.offset_bytes,
+                nbytes=ls.nbytes,
+            ))
+
+    return all_shards
+```
+
+**내부 resolver** (`column_wise`, `row_wise`, `replicate`)는 `_LocalPeShard`
+리스트 반환 — `local_pe` 필드명으로 **"cube-local PE identifier"임이 명시적**.
+과거 `ShardSpec.pe_index`와 이름이 혼동되던 문제 해소.
+
+**이름 규약 정리** (전체 ADR):
+- `ShardSpec.pe`: 최종 외부 API — cube-local PE (structural coord)
+- `_LocalPeShard.local_pe`: 내부 resolver 단계의 동일 의미
+- `pe_index`: **제거**. 외부/내부 어디에도 남기지 않는다 (silent drift 차단의
+  부가 효과: 이름 재등장 없음).
+
+### D4. `_create_tensor` — 구조적 좌표로 직접 placement
+
+ADR-0024 D4 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy`
+호출 시점에 직접 지정.
+
+```python
+# context.py _create_tensor (after)
+current_sip = self.ahbm.current_device()
+if current_sip is None:
+    # Single-driver fallback (ADR-0024 D2와 일관).
+    # Launcher 기반 코드가 set_device()를 빼먹으면 조용히 SIP 0에 박히는
+    # 문제가 있음 → debug mode에서 경고.
+    if os.environ.get("KERNBENCH_DEBUG"):
+        import warnings
+        warnings.warn(
+            "torch.ahbm.current_device() is None; defaulting to SIP 0. "
+            "If this is a multi-rank launcher context, you likely forgot "
+            "torch.ahbm.set_device(rank) inside the worker.",
+            stacklevel=2,
+        )
+    current_sip = 0
+
+placement = resolve_dp_policy(
+    dp,
+    shape=shape_2d,
+    itemsize=itemsize,
+    num_pe=eff_num_pe,
+    num_cubes=eff_num_cubes,
+    target_sip=current_sip,          # ← 구조적 좌표 일차 지정
+)
+
+# placement의 각 ShardSpec은 이미 (sip=current_sip, cube=local, pe=local) 포함.
+# 과거의 post-hoc shifting 블록은 완전히 제거.
+```
+
+**모든** 텐서가 current device SIP에 배치됨. Multi-SIP 텐서를 만들고 싶으면
+ADR-0027의 TP primitive 사용.
+
+**Single-driver fallback의 trade-off**: set_device 없는 호출에서 SIP 0으로
+default는 기존 single-driver 테스트 호환을 위해 유지. `KERNBENCH_DEBUG=1`
+환경에서는 launcher 컨텍스트의 실수로 set_device 누락 시 조용히 잘못된 SIP에
+배치되는 것을 감지할 수 있도록 warning.
+
+### D5. Downstream — allocator lookup은 구조적 tuple key로
+
+기존 `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`):
+
+```python
+for spec in placement:
+    alloc = allocators[spec.pe_index]       # ← AttributeError (property 제거됨)
+```
+
+`pe_index`가 없어졌으므로 구조적 좌표로 **강제** migration:
+
+```python
+for spec in placement:
+    alloc = allocators[(spec.sip, spec.cube, spec.pe)]
+```
+
+`_ensure_allocators`의 dict population도 tuple key로:
+
+```python
+# context.py _ensure_allocators (after)
+for sip_id in sip_range:
+    for cube_id in range(cubes_per_sip):
+        for pe_id in range(pes_per_cube):
+            self._allocators[(sip_id, cube_id, pe_id)] = PEMemAllocator(
+                rack_id=0, sip_id=sip_id, cube_id=cube_id, pe_id=pe_id, cfg=cfg,
+            )
+```
+
+`_free_tensor`도 동일: 기존 `flat_idx = sip * ... + cube * ... + pe` 계산
+블록 제거, `(shard.sip, shard.cube, shard.pe)` 직접 사용.
+
+**Tuple vs dataclass `PEIdentity`**: Tuple이 단순하고 hashable로 바로 써서
+권고. `PEIdentity` 값객체는 명시적 타입 장점은 있지만 boilerplate가 크고 현재
+allocator dict의 유일한 key라 오버엔지니어링. Tuple 유지.
+
+### D7. 하위 호환 — 불가 (cleanup ADR)
+
+이 ADR은 **breaking change**.
+
+1. `DPPolicy(sip=...)` 또는 `DPPolicy(num_sips=...)` 호출 → `TypeError`
+2. `ShardSpec.pe_index` 접근 → `AttributeError`
+
+모두 **즉시 명시적 breakage**. Deprecation warning / fallback 경로 없음.
+KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에 migration.
+
+**Silent drift 차단**이 property 완전 제거의 주된 이점: global flat을 기대한
+코드가 SIP-local 결과를 받아 조용히 잘못된 인덱싱을 할 가능성 제거.
+
+## Dependencies
+
+- **ADR-0024** (launcher): `set_device(rank)` 및 current-device scoping이
+  SIP 배치 메커니즘 제공. 본 ADR은 그 위에 서서 DPPolicy를 순수 intra-device로
+  좁힘.
+- **ADR-0027** (Megatron TP): 다중 SIP에 걸친 텐서가 필요한 경우의 대안 경로.
+  이 ADR 적용 후 multi-SIP use case는 ADR-0027로 이관.
+
+---
+
+## Non-goals
+
+- **`DPPolicy.cube` / `pe` 재설계**: 기존 replicate/column_wise/row_wise 의미
+  유지.
+- **Tiling 정책 통합**: `tiled_column_major` / `tiled_row_major`는 그대로.
+- **Multi-device 텐서 추상화 신규**: DTensor-like는 ADR-0028.
+
+---
+
+## Open questions
+
+- **`_create_tensor`의 current_sip 기본값**: set_device 없는 호출에서 rank=0
+  (SIP 0)로 fallback할지, 아니면 error 낼지. 권고는 fallback (기존 single-driver
+  테스트와의 호환).
+- **`test_sip_parallel.py` 재작성 범위**: 기존 단위 테스트의 의도를 유지하며
+  launcher 기반으로 옮기려면 추가 fixture 필요. 별도 작업으로 scope.
+- **`DPPolicy`의 `num_sips=None` 의미**: 필드가 없어지면 `num_sips` 개념 자체가
+  사라짐. Multi-SIP을 표현하고 싶으면 ADR-0027의 TP primitive를 쓰라는 것이
+  명시적 답.
+
+**Resolved (이전 rev에서 open이었던 것들)**:
+- ~~`ShardSpec.pe_index` property 존치 여부~~ → **완전 제거** (D2)
+- ~~`_ensure_allocators` dict key 형식~~ → **tuple `(sip, cube, pe)`** (D5)
+
+---
+
+## Consequences
+
+### Positive
+
+- **개념 분리 명확**: DPPolicy = intra-device, TP = inter-device.
+- **API 단순화**: DPPolicy 생성자 필드 ~33% 축소.
+- **Structural 좌표 일관성**: ShardSpec이 `(sip, cube, pe)` 튜플로 표현 →
+  abstraction leakage 해소 (ADR-0024 D4 계약 충족).
+- **`pe_index` 의미 명확**: SIP-local이 단일 해석. Global flat이 필요하면 명시.
+- **Launcher 모델 일관성**: ADR-0024의 "1 worker per SIP" 모델이 유일한 SIP
+  경계 제어 메커니즘.
+
+### Negative
+
+- **Breaking change (explicit)**: `DPPolicy(sip=...)` → `TypeError`,
+  `spec.pe_index` → `AttributeError`. 모든 호출자 한 번에 수정 필요.
+- **ShardSpec schema 변경**: `pe_index` 단일 필드 → `sip`/`cube`/`pe` 세 필드.
+  Downstream (`deploy_tensor`, `_free_tensor`, `_ensure_allocators`,
+  `allocators` dict key 등) 연쇄 수정.
+- **Silent drift 없음**: property 완전 제거로 runtime에서 즉시 실패 →
+  migration leakage 원천 차단. (Negative가 아니라 explicit tradeoff)
+- `test_sip_parallel.py` 재작성 비용.
+
+### Neutral
+
+- 기존 `cube` / `pe` 필드 의미 불변.
@@ -0,0 +1,888 @@
+# ADR-0027: Megatron-style Tensor Parallelism API
+
+## Status
+
+Accepted
+
+## Context
+
+### 목표
+
+SIP 간 tensor parallelism(TP)을 **Megatron-LM 스타일의 명시적 parallel layer**
+API로 지원한다. DTensor 같은 선언적 추상화는 별도 ADR(0028) future work.
+
+Megatron-style을 선택한 이유:
+- TP는 model의 특정 layer 경계에서 발생. 명시적 primitive가 mental model에
+  자연스러움.
+- NVIDIA Megatron / DeepSpeed가 확립한 인더스트리 표준.
+- DTensor는 선언적이라 디자인 공간이 더 크다 → 단계적.
+
+### TP primitive 스펙 (Megatron-LM 참조)
+
+- **ColumnParallelLinear**: weight의 **column(out_features)** 축을 TP ranks에
+  분산. 입력 full-replicated, 출력 column-sharded. 후속 RowParallelLinear가
+  올 때 forward all-reduce 없음.
+- **RowParallelLinear**: weight의 **row(in_features)** 축을 TP ranks에 분산.
+  입력이 이미 column-sharded (ColumnParallel의 출력). forward 끝에
+  **all-reduce** 필요.
+- **VocabParallelEmbedding**: embedding을 vocab 축에 분산. forward 끝에
+  all-reduce. (초기 scope에서는 stub, 실제 구현은 all-gather kernel 선행 필요.)
+- **`copy_to_tp_region`**, **`reduce_from_tp_region`**, **`scatter_to_tp_region`**,
+  **`gather_from_tp_region`** — 기본 primitive.
+
+### 풀어야 할 문제
+
+1. **Worker-wait 일반화 (D0)**: `dist.all_reduce`의 defer/yield/drain 패턴을
+   모든 `ctx.wait` 경로로 확장. **이 ADR의 가장 큰 아키텍처 결정**.
+
+2. **런처 API 정규화 (D1)**: 현 bench들이 hand-rolled greenlet loop을 사용.
+   `torch.multiprocessing.spawn(fn, args, nprocs)`로 흡수해 real-PyTorch API 면
+   유지 + D0의 scheduler drain을 단일 구현 위치에 집중.
+
+3. **Per-rank weight 분산 표현**: 각 worker가 weight tensor의 자기 slice를
+   소유. ADR-0024의 `set_device(rank)` + ADR-0026의 intra-device DPPolicy로
+   자연스럽게 표현.
+
+4. **Forward-only scope**: 현재 KernBench는 backward가 없음 (simulation 목적).
+   본 ADR은 **forward만** 우선 지원. Training simulation은 별도 ADR.
+
+5. **Collective 호출 지점**: RowParallelLinear가 forward 끝에 `all_reduce` 호출.
+   ADR-0024의 multi-greenlet 구조 + D0 generalization에서 자연스럽게 동작.
+
+6. **TP group 개념**: Megatron은 DP × TP × PP group을 교차 사용. 초기 scope는
+   **TP group = 전체 SIP** 단순화. Mixed DP+TP는 future.
+
+---
+
+## Decision
+
+### D0. Worker-wait 일반화 — `ctx.wait`가 worker 컨텍스트면 main으로 defer
+
+**문제 재확인**. `kernel_runner.run`은 spawn 시점의 `greenlet.getcurrent()`를
+kernel greenlet의 `_parent`로 캡처한다
+([kernel_runner.py:94](src/kernbench/triton_emu/kernel_runner.py#L94)).
+main 컨텍스트에서 `env.run`이 돌면 parent=main이라 safe. worker 컨텍스트에서
+`env.run`이 돌면 parent=worker가 되고, worker가 yield/finish하는 순간 kernel
+greenlet은 orphan → `GreenletExit` → ADR-0024 Phase B의 `ring_default_ws` 실패.
+
+**해결**. worker greenlet이 `ctx.wait(h)`를 호출하면 직접 `env.run`을 driving
+하는 대신 **main scheduler로 yield**. main이 env.run을 drive해 handle이 완료
+되면 worker로 control return.
+
+#### D0.1 `RuntimeContext` 확장
+
+```python
+# context.py
+@dataclass
+class RuntimeContext:
+    ...
+    _pending_worker_waits: list[RequestHandle] = field(default_factory=list, init=False)
+```
+
+#### D0.2 `ctx.wait`의 worker fork
+
+```python
+def wait(self, handle, *, _meta=None):
+    # Fast-path: already completed — skip enqueue + switch (consistent with
+    # D0.4-(3) idempotency). Avoids needless worker→main→worker round-trip
+    # and prevents redundant _pending_worker_waits growth.
+    if handle in self._completed:
+        completion, _trace = self.engine.get_completion(handle)
+        return completion
+
+    from greenlet import getcurrent
+    g = getcurrent()
+    if g.parent is not None and not g.parent.dead:
+        # Worker greenlet: defer to main. Push handle, yield to parent.
+        # Parent (scheduler loop) drains env.run, then switches back.
+        self._pending_worker_waits.append(handle)
+        g.parent.switch()
+        # On resume: handle must have completed (main drained the list).
+        # Fall through to the status-quo completion/trace assembly.
+
+    # Main context (or single-driver): drive engine directly.
+    wait_fn = getattr(self.engine, "wait", None)
+    if wait_fn is not None:
+        wait_fn(handle)
+    completion, trace = self.engine.get_completion(handle)
+    self._completed.add(handle)
+    if _meta is not None and trace is not None:
+        entry = dict(trace) if isinstance(trace, dict) else {"raw": trace}
+        entry.update(_meta)
+        self._traces.append(entry)
+    return completion
+```
+
+#### D0.3 `ctx.wait`의 worker-context 세만틱 contract (normative)
+
+본 ADR은 `ctx.wait`의 세만틱을 worker 컨텍스트에서 **명시적으로 변경**한다.
+
+- **Submit-vs-complete 분리**: `ctx.wait(h)`는 worker에서 호출될 때 "즉시 완료
+  보장"이 아니라 "**다음 scheduler drain 이후** 완료 보장"이다. worker가
+  `wait()`에서 return하는 시점 = main이 해당 handle에 대해 `engine.wait`을
+  마친 시점. Main context 호출은 기존대로 즉시-동기 (status quo).
+- **Resume invariant (normative)**: worker-deferred `ctx.wait(h)`에서
+  `g.parent.switch()`가 return해 worker가 resume되는 시점에는 **반드시
+  `h in ctx._completed`가 True여야 한다**. 이 invariant가 깨지면 worker가
+  stale 상태에서 이후 단계를 진행하므로 `_drain_pending` / scheduler loop /
+  `ctx.wait` 어느 부분을 수정하든 이 불변식을 지켜야 한다. T3.b가 이
+  invariant를 직접 assert한다.
+- **관찰 가능 변화**: worker 안에서 `h = ctx.submit(msg); ctx.wait(h);
+  read(handle_result)` 패턴은 여전히 성립 — 단 `wait()`와 `read` 사이에는
+  자동으로 main-drain이 삽입되었다는 사실을 세만틱 명세로 포함한다.
+- **Host 객체 직접 read는 D0.5 참조**: `ctx.wait` 없이 `tensor.numpy()`를
+  부르는 경우의 계약은 D0.5에서 별도로 규정.
+
+#### D0.4 Main scheduler drain — 규약 (normative)
+
+(D1의 `multiprocessing.spawn` 내부 구현. 아래는 세만틱 정의.)
+
+```python
+while alive:
+    for g in alive:              # (1) round-based worker switch
+        g.switch()
+    _drain_pending(ctx)           # (2) drain in main context
+```
+
+(`_drain_pending`의 실제 정의는 D0.5 참조 — outer while-loop으로 두 큐가
+모두 빌 때까지 drain.)
+
+**규약**:
+
+1. **Round-based cooperative scheduling & yield 의무 (worker contract)**.
+   `g.switch()`는 해당 worker가 **자발적으로 yield**할 때까지 return하지 않는다
+   (cooperative greenlet 세만틱). 따라서:
+   - Worker가 yield 없이 `while True: do_compute()` 같은 pure-compute loop를
+     돌면 `g.switch()`는 영원히 return하지 않고 **scheduler loop 자체가 hard
+     block**된다 (다른 worker는 switch 기회를 못 얻음, drain도 안 일어남). 이는
+     starvation이 아니라 **scheduler non-progress (deadlock 등가)**이며 본
+     ADR이 **unsupported**로 규정한다.
+   - Worker는 **반드시** `ctx.wait(h)`, `dist.all_reduce`, host-read barrier
+     (D0.5) 중 하나를 유한 step 내에 호출해야 한다. TP layer의 `forward`는
+     매 layer 끝에서 launch→wait 쌍을 포함하므로 자연스럽게 이 조건을 만족.
+     CCL kernel도 `dist.all_reduce` 내부에서 yield한다.
+   - 구현이 이를 **감지**할 필요는 없다 (타임아웃/steps-since-yield 카운터
+     등). 이는 user contract이며 위반 시 증상은 "simulation hang"이다.
+   - **Future extension**: non-collective 긴 계산 경로가 자주 나오면
+     명시적 `torch.distributed.cooperative_yield()` primitive (no-op yield)를
+     도입할 수 있다. 현 ADR 범위 밖. Breaking change 아님 — 필요 시 추가하면
+     됨.
+   - Round 내에서는 alive worker 전체가 한 번씩 `switch`를 받는다. 단일 round
+     안에서 한 worker가 여러 번 wait를 호출해도 그 turn 안에서 순차적으로
+     enqueue된 뒤 scheduler drain 한 번에 일괄 처리 (FIFO).
+
+2. **Drain 순서 = submission 순서 (FIFO)**. `_pending_worker_waits`는 list
+   append/pop(0)로 엄격한 FIFO. 완료 순서가 아니라 submission 순서로 drain되며,
+   SimPy scheduler 자체가 인과적으로 올바른 완료 순서를 보장하므로 submission
+   순서 drain이 안전하다. `completion order`와 `drain order`는 혼동하지 말 것.
+
+   **Two-queue ordering (worker waits → collectives)**: `_drain_pending`은
+   worker wait 큐를 먼저, collective 큐를 나중에 drain한다. 이 순서의 근거:
+   - **두 큐는 서로 다른 dependency source**: worker wait은 worker가 직접
+     `submit + wait` 쌍으로 만들어낸 handle (tensor deploy, MmuMap 등). collective
+     큐는 `dist.all_reduce`가 내부적으로 enqueue한 kernel launch handle이며
+     worker는 이걸 직접 wait하지 않는다 (D0.5의 두 큐 drain 모델 참조).
+   - **Correctness 관점 독립**: collective는 worker 관점에선 "이미 submit된
+     후 yield한" 상태. 그 완료 타이밍은 worker의 다음 action 시점 이전이기만
+     하면 됨. worker wait 큐와의 순서 dependency 없음.
+   - **단일 drain barrier 안에서 둘 다 완료**: D0.5의 loop-until-empty 규약에
+     따라 한 barrier invocation에서 worker → collective → (새로 생긴 것이
+     있으면 반복) 순으로 모두 빠짐. worker가 resume될 땐 양쪽 모두 drained.
+   - **대안 (collective 먼저)도 가능**: 본 ADR은 현 구현 단순성을 위해 worker
+     먼저를 고정했을 뿐 의미상 동치. 성능 프로파일 차이가 관찰되면 재조정.
+
+3. **중복 enqueue — correctness는 idempotent drain, dedup은 non-guaranteed**.
+   `ctx.wait(h)`는 `h in ctx._completed`면 즉시 return. `_drain_pending`도
+   동일 guard. 같은 handle이 `_pending_worker_waits`에 여러 번 appended
+   되더라도 실제 `engine.wait`는 한 번만 호출된다 (idempotent).
+   - **Correctness**: idempotent drain에 의존 → safe.
+   - **Memory/성능**: 본 ADR은 `_pending_worker_waits`의 **dedup을 보장하지
+     않는다**. 같은 handle이 N번 enqueue되면 큐에 N개 element가 보관되고
+     drain 시 N번 pop + in-set guard가 돈다. 단일 worker가 같은 handle을
+     반복 wait하는 비정상 패턴이 아니면 N은 1~수 수준.
+   - **Implementation freedom**: 구현은 선택적으로 dedup (예: `set`을 side
+     index로 두거나 append 전 `h not in pending_set` 검사) 가능. correctness
+     를 바꾸지 않는 최적화로 분류.
+
+4. **Exception propagation + sibling cleanup**.
+   worker greenlet이 raise하면 `g.switch()`가 main으로 예외를 전달한다.
+   scheduler loop은 즉시 중단되고 다음 cleanup을 **명시적으로** 수행:
+
+   ```python
+   try:
+       while True:
+           alive = [g for g in gs if not g.dead]
+           if not alive:
+               break
+           for g in alive:
+               if not g.dead:
+                   g.switch()
+           _drain_pending(ctx)
+   except Exception as outer:
+       # (a) 살아남은 sibling worker greenlet 강제 종료.
+       for other in gs:
+           if not other.dead:
+               try:
+                   other.throw(SystemExit)
+               except Exception:
+                   pass          # 사일런트 — 이미 예외 상황
+       # (b) Backend barrier / pending 상태 초기화 (장래 epoch barrier 도입 대비).
+       backend = getattr(ctx.distributed, "_backend", None)
+       if backend is not None and hasattr(backend, "_barrier"):
+           backend._barrier.reset()
+       backend_pending = getattr(backend, "_pending_collective_handles", None)
+       if backend_pending is not None:
+           backend_pending.clear()
+       ctx._pending_worker_waits.clear()
+       # (c) 원인 예외는 SpawnException으로 래핑.
+       raise SpawnException(errors) from outer
+   ```
+
+   규약:
+   - **Sibling abort 보장**: worker 하나가 raise하면 모든 sibling greenlet에
+     `SystemExit`을 throw — greenlet은 즉시 terminate된다. greenlet leak 없음.
+   - **Pending queue 명시적 clear**: worker-wait + collective-pending 두 큐를
+     비움. 재사용 시 오염 방지.
+   - **`SpawnException(errors)` 래핑**: `errors: dict[int, Exception]`에 각
+     rank의 원래 예외를 담는다. real-PyTorch `torch.multiprocessing.spawn`의
+     failure 패턴과 호환.
+     - **Scope 제한**: `errors`에는 **자기 코드로 raise한 rank (root cause)만**
+       포함된다. Sibling cleanup 과정에서 `throw(SystemExit)`으로 종료된 rank는
+       `errors`에 나타나지 않는다 (SystemExit은 D1.2의 entry 래퍼 `try/except
+       Exception`에 걸리지 않음 — 의도된 설계: sibling 종료는 실패가 아니라
+       cleanup signal). 독자가 "모든 failed rank가 다 들어올 것"으로 기대하지
+       않도록 명시.
+   - **`ctx._traces`는 예외 이전 시점까지의 partial 상태**. trace completeness
+     는 보장되지 않음 (일부 launch/all_reduce가 entry를 남기지 못한 채 종료
+     가능).
+   - **Allocator / MemoryStore**는 예외 이전 상태 유지 — 재사용은 non-goal,
+     새 `RuntimeContext` 생성 권장.
+   - **`join=False` / retry / partial recovery**는 본 ADR의 non-goal.
+
+   `SpawnException`은 `runtime_api/multiprocessing.py`에 정의:
+
+   ```python
+   class SpawnException(RuntimeError):
+       def __init__(self, errors: dict[int, Exception]):
+           self.errors = errors
+           first = next(iter(errors.items()), None)
+           msg = (f"spawn failed on ranks {sorted(errors.keys())}"
+                  + (f": rank {first[0]} raised {first[1]!r}" if first else ""))
+           super().__init__(msg)
+   ```
+
+5. **Single-driver 호환**. `g.parent is None`인 main-only 실행 (legacy 단일
+   드라이버 테스트)에서는 D0.2의 worker-fork 조건이 거짓 → 기존 즉시-동기
+   경로 유지. `_drain_pending`은 호출되지 않는다.
+
+#### D0.5 Host-read barrier — 결정 (normative)
+
+Worker 안에서 `tensor.numpy()`, `tensor.__getitem__`, `tensor.data` 등
+**host-observable read**는 **자동 drain barrier**로 정의한다. 호출 직전:
+
+1. `ctx._pending_worker_waits`와 `backend._pending_collective_handles`가 비어
+   있지 않으면 `g.parent.switch()`로 main에 yield → main은 `_drain_pending`
+   실행 → 완료 후 worker resume.
+2. 두 큐가 모두 비어 있으면 즉시 read.
+
+**Barrier 반복 규약 (normative — re-entrance)**: `_drain_pending`은 while-loop
+로 **두 큐가 모두 완전히 비어질 때까지** drain한다. 단일 pass가 아님:
+
+```python
+def _drain_pending(ctx):
+    while ctx._pending_worker_waits or (
+        ctx.distributed._backend
+        and ctx.distributed._backend._pending_collective_handles
+    ):
+        while ctx._pending_worker_waits:
+            h = ctx._pending_worker_waits.pop(0)
+            if h not in ctx._completed:
+                ctx.engine.wait(h)
+        backend = ctx.distributed._backend
+        if backend is not None:
+            while backend._pending_collective_handles:
+                h, _sip_id, meta = backend._pending_collective_handles.pop(0)
+                ctx.wait(h, _meta=meta)  # main context: safe; ctx.wait가
+                                          # 다시 pending에 push하지 않음
+```
+
+**Main-context ctx.wait 비재귀 invariant (normative)**: `_drain_pending` 내부의
+`ctx.wait(h, _meta=meta)` 호출은 main greenlet 컨텍스트에서 실행된다. D0.2의
+worker-fork 조건(`g.parent is not None and not g.parent.dead`)이 False이므로
+즉시-동기 경로로 진입 → **`_pending_worker_waits`에 절대 enqueue하지 않는다**.
+이 invariant 덕분에 drain loop은 재귀/큐 재증가 없이 끝난다. 구현 시
+`g.parent is None`을 단일 main greenlet 보장으로 유지하는 것이 중요.
+
+**왜 loop인가**: `ctx.wait(h, _meta=meta)`는 main 컨텍스트에서 호출되므로 D0.2
+경로에 따라 engine을 **직접 drive**한다 (추가 enqueue 없음 — 위 invariant).
+따라서 이론적으로는 single pass로 충분하지만 — 규약은 **loop-until-empty**로
+고정한다. 이유:
+
+1. **미래 확장 안전성**: 향후 drain 중 새 pending이 enqueue되는 구현 (예:
+   collective가 sub-handle을 가진 tree-reduce)이 생길 수 있다. loop 규약이면
+   이때도 correctness 유지.
+2. **가독성**: "barrier는 pending이 빌 때까지 drain"이라는 단일 문장으로
+   의미가 닫힘. `ctx.wait` 호출이 새 enqueue를 안 한다는 non-trivial invariant
+   에 의존하지 않음.
+3. **Barrier의 세만틱은 "해당 read에 필요한 모든 dependency 완료"**: 현 모델
+   에선 모든 pending이 곧 모든 dependency이므로 둘은 동일. 사용자 mental model
+   은 전자.
+
+**Termination 보증**: 두 체제로 분리해 서술한다.
+
+- **현재 구현**: `ctx.wait`는 main context에서 호출 시 engine을 직접 drive
+  (D0.2) → 새 pending을 enqueue하지 않는다. 한 iteration마다 pending의 크기가
+  `pop(0)` + `engine.wait`로 엄격히 감소. iteration 수는 **초기 pending 크기
+  자체가 상한** → 유한 종료.
+- **Future extension (loop 규약을 정당화하는 상한)**: 향후 drain 중 새 pending이
+  enqueue되는 구현 (예: tree-reduce sub-handle)이 도입되면 초기 크기 상한은
+  깨진다. 그러나 SimPy causality는 handle의 dependency가 유한 DAG임을 보장하므로
+  **nested depth가 finite**. loop 규약이 이 경우까지 자동 수용한다.
+
+두 체제 모두 무한 루프가 불가능함을 보장. 현 구현의 단일-pass 상한은 공격적
+최적화 시 참고 값일 뿐 규약은 loop-until-empty로 고정.
+
+**왜 implicit drain at read가 맞는가**:
+
+- 기존 open question에서 (a) implicit drain, (b) explicit barrier 둘 중 선택
+  문제였다. (b)는 명확하지만 TP layer 사용자가 `out = fc1.forward(x);
+  ctx.drain(); result = out.numpy()` 3-step을 매번 써야 하는 부담. (a)는
+  "읽을 때 반영된 값을 보장"하는 단일 규약으로 CUDA의 `cudaDeviceSynchronize
+  before host copy` 패턴과 동일 — 숨은 규칙이 아닌 **명명된 entry-point의
+  contract**이다.
+- 본 ADR은 (a)를 채택하되 그 entry-point 목록을 **명시적으로 닫는다**:
+  `Tensor.numpy()`, `Tensor.data` (numpy alias), `Tensor.__getitem__`,
+  `Tensor.__repr__` (data가 포함되는 경우), 그 외 공식 host-read API는 본
+  ADR 구현 시점에 코드베이스 검색으로 확정. 추가되는 host-read API는 반드시
+  이 contract를 따라야 한다 (테스트로 회귀 방지).
+- `ctx.submit`만 하고 `wait` 없이 `numpy`를 직접 호출하는 경우도 drain
+  barrier가 동작 (pending queue에 handle이 있기 때문). 사용자가 explicit
+  wait을 생략해도 read 시점에 invariant가 복원된다.
+
+**`Tensor.copy_(source)` — write barrier 규정**:
+
+`copy_`는 semantically "target에 write"이지만 내부적으로 `source.numpy()`를
+호출하여 host에서 source 데이터를 가져온 뒤 `target._memory_store.write(...)`
+로 각 shard에 쓴다. 두 방향 모두 barrier 처리:
+
+1. **Source-side (read barrier)**: `source.numpy()`가 D0.5 read barrier를
+   트리거 (source 자체가 deployed tensor이고 pending이 있을 때).
+2. **Target-side (write barrier — global pending 기준)**: `copy_` 진입 시
+   `ctx._pending_worker_waits` 또는 `backend._pending_collective_handles`가
+   비어 있지 않으면 write 전에 `g.parent.switch()`로 drain. **Per-tensor /
+   per-shard dependency tracking이 아니라 global pending queue 기준**.
+   - 왜 global인가: KernBench의 handle 표현에는 "이 handle이 target의 어느
+     shard를 write한다"는 역추적 정보가 없다. 안전한 보수적 규약으로 "전역
+     pending이 있으면 drain". 이 결과로 **unrelated tensor의 pending도 copy_를
+     막을 수 있다** — drop-in invariant 우선.
+   - **명시적 tradeoff**: 이 규약은 서로 독립적인 tensor 사이에도 불필요한
+     serialization을 도입할 수 있다. 그러나 현 single-queue execution model
+     하에서는 이 비용이 허용 가능 — cross-rank correctness와 "읽을 때 최신"
+     invariant를 단순한 규칙으로 보장하는 편이 우선.
+   - 실질적 영향: 단일 worker는 대부분 한 layer step 안에서 pending이 주로
+     자기 작업 — over-barrier로 인한 추가 context switch는 round 끝 scheduler
+     drain 시점과 일치하는 경우가 많아 큰 문제 안 됨.
+   - Future refinement: per-tensor pending tracking을 도입하면 이 규약을
+     좁힐 수 있으나 본 ADR scope 밖.
+
+**Non-barrier**:
+
+- `tensor.shape`, `tensor.dtype`, `tensor.name` 등 **metadata-only** 접근은
+  drain하지 않음. 데이터 의존성이 없음.
+- `tensor.pa`, `tensor.va` 등 raw address accessor도 drain하지 않음 (주소만,
+  내용 아님).
+
+**공식 barrier entry-point (closed set)**:
+
+| API | Kind | Rationale |
+|---|---|---|
+| `Tensor.numpy()` | read | host-observable copy |
+| `Tensor.data` | read | `numpy()` alias |
+| `Tensor.__getitem__` | read | shard-aligned read |
+| `Tensor.__repr__` (data 포함 시) | read | debugging/log |
+| `Tensor.copy_(source)` | read + write | source read + target write |
+
+이 contract를 T5/T6에서 직접 검증.
+
+#### D0.6 왜 worker 함수 API는 불변인가 (informative)
+
+- `torch.zeros(...)` 내부는 `self.submit(msg)` + `self.wait(h)` 쌍. `wait`가
+  D0.2/D0.3에 따라 자동으로 main-defer → 겉보기 동기적으로 보이지만 한 번
+  yield.
+- `tensor.numpy()`는 D0.5에 따라 host-read barrier → pending이 있으면
+  drain→read, 없으면 즉시 read.
+- `dist.all_reduce`는 기존 `_defer_wait=True` + `_pending_collective_handles`
+  경로를 그대로 사용. D0.4의 drain이 두 큐를 함께 처리.
+
+#### D0.7 불변 조건 (invariants)
+
+- **kernel greenlet의 `_parent`는 항상 main**: env.run이 worker 컨텍스트에서
+  절대 돌지 않기 때문. (T3의 핵심 assertion.)
+- **cross-rank 동기 지점**: 모든 worker가 yield한 뒤에만 drain → 모든 rank의
+  kernel이 한 라운드에 함께 진행 (cross-rank IPCQ 교환의 필수 조건).
+- **Single-driver 호환**: D0.4-(5).
+
+### D1. `torch.multiprocessing.spawn(fn, args, nprocs)`
+
+Real-PyTorch API 파리티 + D0의 scheduler loop의 단일 구현 위치.
+
+#### D1.0 API parity only — execution parity 아님 (normative)
+
+`torch.multiprocessing.spawn` 이름은 **API signature parity**에 한정된다.
+실제 실행 모델은 **cooperative greenlet scheduler** (단일 Python 프로세스,
+단일 OS 스레드, D0.4의 round-robin drive)이다. 다음은 **본 ADR이 제공하지
+않는 속성** — real-PyTorch `torch.multiprocessing.spawn`이 보장하는 것 중
+명시적으로 **non-goal**:
+
+- 프로세스 격리 (independent OS process per rank).
+- 독립 address space (각 rank가 자기 Python heap 보유).
+- Failure isolation (한 rank의 hard crash가 다른 rank 영향 없음).
+- OS-level scheduler fairness (rank 간 preemptive time slicing).
+- `mp.Queue`, `mp.Lock` 등 inter-process primitive.
+
+이 구현의 실제 성질:
+
+- 모든 rank는 같은 Python 프로세스 안의 greenlet. shared global state가
+  그대로 보임 (의도된 simulation convenience).
+- GIL 하의 단일 스레드 → parallel execution 아님. SimPy 이벤트 순서로
+  "논리적 동시성"만 재현.
+- 한 worker에서 unhandled exception → 전체 simulation 중단 (D0.4-(4)).
+
+**호출자 의무**: real-PyTorch multi-process 샘플을 KernBench로 이식할 때
+프로세스 격리에 의존하는 로직 (예: `os.getpid`, 독립 임시 파일, 신호 처리
+등)은 지워야 한다. Namespace 이름은 코드 이식성을 위해 유지 — 세만틱은
+다르다.
+
+#### D1.1 Public surface
+
+```python
+# runtime_api/multiprocessing.py (new)
+class _MultiprocessingNamespace:
+    def __init__(self, ctx):
+        self._ctx = ctx
+
+    def spawn(self, fn, args: tuple, nprocs: int, join: bool = True) -> None:
+        """Spawn `nprocs` worker greenlets, each calling fn(rank, *args).
+
+        Mirrors torch.multiprocessing.spawn signature (minus `daemon`).
+        Drives the D0 scheduler loop until all workers finish.
+        """
+        ...
+```
+
+#### D1.2 구현
+
+```python
+def spawn(self, fn, args, nprocs, join=True):
+    from greenlet import greenlet
+    ctx = self._ctx
+    dist = ctx.distributed
+    gs: list[greenlet] = []
+    errors: dict[int, Exception] = {}
+    for rank in range(nprocs):
+        def _entry(r=rank):
+            try:
+                fn(r, *args)
+            except Exception as e:
+                errors[r] = e
+                raise
+        g = greenlet(_entry)
+        dist._bind_rank(g, rank)
+        gs.append(g)
+
+    try:
+        while True:
+            alive = [g for g in gs if not g.dead]
+            if not alive:
+                break
+            for g in alive:
+                if not g.dead:
+                    g.switch()
+            _drain_pending(ctx)       # D0.5
+    except Exception as outer:
+        # Sibling cleanup per D0.4-(4)
+        for other in gs:
+            if not other.dead:
+                try:
+                    other.throw(SystemExit)
+                except Exception:
+                    pass
+        backend = getattr(dist, "_backend", None)
+        if backend is not None:
+            if hasattr(backend, "_barrier"):
+                backend._barrier.reset()
+            if getattr(backend, "_pending_collective_handles", None) is not None:
+                backend._pending_collective_handles.clear()
+        ctx._pending_worker_waits.clear()
+        raise SpawnException(errors) from outer
+    # `join=True` semantics: we already wait for all workers.
+```
+
+#### D1.3 `torch` namespace attach
+
+`runtime_api/context.py` `__post_init__`에서:
+```python
+self.multiprocessing = _MultiprocessingNamespace(self)
+```
+
+→ bench 코드에서 `torch.multiprocessing.spawn(worker, args=(ws,), nprocs=ws)`.
+
+#### D1.4 기존 bench 마이그레이션
+
+`benches/ccl_allreduce.py`의 hand-rolled loop은 `torch.multiprocessing.spawn`
+한 줄로 축소. 기존 matrix 회귀는 그대로 유지. 현재 xfail인 `ring_default_ws`는
+D0 덕분에 PASS로 전환 예상 (worker가 kernel greenlet orphan을 발생시키지 않음).
+
+### D2. 새 패키지 `kernbench.tp`
+
+```
+src/kernbench/tp/
+    __init__.py          — public API re-exports
+    parallel_state.py    — TP group 관리 (현재 single global group)
+    layers.py            — ColumnParallelLinear, RowParallelLinear, VocabParallelEmbedding
+    primitives.py        — copy/reduce/scatter/gather_to/from_tp_region
+    kernels.py           — TP layer가 launch하는 gemm kernel (재사용 가능)
+    mappings.py          — forward identity/all_reduce, backward stub
+```
+
+### D3. `parallel_state` — TP group
+
+```python
+# parallel_state.py
+_TP_WORLD_SIZE = None
+
+def initialize_model_parallel(tensor_model_parallel_size: int) -> None:
+    """Initialize TP group. Must be called after dist.init_process_group."""
+    global _TP_WORLD_SIZE
+    from kernbench.runtime_api.distributed import get_dist  # or torch.distributed
+    dist = get_dist()
+    total = dist.get_world_size()
+    if tensor_model_parallel_size != total:
+        raise NotImplementedError(
+            "Only TP == world_size supported in initial scope"
+        )
+    _TP_WORLD_SIZE = tensor_model_parallel_size
+
+def get_tensor_model_parallel_world_size() -> int:
+    return _TP_WORLD_SIZE
+
+def get_tensor_model_parallel_rank() -> int:
+    from kernbench.runtime_api.distributed import get_dist
+    return get_dist().get_rank()         # ADR-0024 greenlet-local rank
+```
+
+초기 scope: TP size = world_size = topology SIP count. Pure TP 모델.
+
+### D4-pre. TP shard ownership vs DPPolicy — 역할 분리 (normative)
+
+TP layer의 weight/output 표현에서 두 개념을 명확히 분리한다:
+
+| 개념 | 결정 주체 | 범위 |
+|---|---|---|
+| **TP shard ownership** (어느 rank가 weight의 어떤 slice를 소유하는가) | greenlet-local rank + `torch.ahbm.set_device(rank)` (ADR-0024 D2/D3) | **cross-rank, cross-SIP** |
+| **Intra-rank placement** (소유된 slice를 rank 내부에서 cube × PE로 어떻게 분산하는가) | `DPPolicy(cube=..., pe=...)` (ADR-0026) | **한 rank 내부 (SIP 경계 안)** |
+
+따라서 `ColumnParallelLinear`가 `(in_features, out_features // ws)` shape로
+weight를 생성하고 `DPPolicy(cube="column_wise", pe="column_wise")`를 부여
+하면:
+
+- **Rank r**이 소유하는 slice = weight의 column 축 [r * k_local, (r+1) *
+  k_local) — **set_device(r)**가 이걸 결정 (해당 rank가 SIP r에 존재).
+- **그 slice 내부**에서 cube × PE column-wise 분산 — **DPPolicy**가 이걸
+  결정.
+
+두 축은 **독립적**이다. 같은 DPPolicy로 두 rank가 자기 slice를 만들면
+slice 자체는 다른 SIP에 있지만 intra-SIP placement 패턴은 동일. 반대로
+DPPolicy를 `cube="replicate", pe="replicate"`로 바꿔도 TP shard ownership은
+유지되고 intra-rank placement만 달라짐.
+
+**이 경계가 흐려지는 실수** (본 ADR이 금지):
+
+- DPPolicy에 "SIP 축"이 다시 등장 (ADR-0026에서 제거됨).
+- TP layer가 `set_device` 없이 `DPPolicy`만으로 cross-rank sharding을
+  표현 → 단일 rank 안에서 세로로 자른 것과 구분 안 됨.
+
+본 ADR의 TP layer는 항상 "rank = SIP = one slice 소유 + DPPolicy intra-SIP
+분산" 관점에서만 weight/output을 다룬다.
+
+### D4. `ColumnParallelLinear`
+
+**중요**: host-side `torch.matmul` 추상화를 신규 도입하지 않는다. layer의
+forward는 `torch.launch("gemm", gemm_kernel, ...)`로 기존 gemm kernel을
+호출 — KernBench bench들이 이미 쓰는 패턴
+([benches/gemm_single_pe.py](benches/gemm_single_pe.py),
+[benches/gpt3_qkv.py](benches/gpt3_qkv.py)).
+
+```python
+# layers.py
+from kernbench.policy.placement.dp import DPPolicy
+from kernbench.tp.kernels import _gemm_kernel
+from kernbench.tp.parallel_state import (
+    get_tensor_model_parallel_rank,
+    get_tensor_model_parallel_world_size,
+)
+
+class ColumnParallelLinear:
+    """Weight의 K(out_features) 축을 TP rank에 분산.
+
+    forward(x):
+        x: (M, N) — full-replicated across ranks
+        W_k: (N, K / world_size) — rank-local slice (set_device로 SIP r에 거주)
+        y_k = x @ W_k → (M, K / world_size) — rank-local output
+
+    출력은 column-sharded. RowParallelLinear가 기대하는 입력 형태.
+    """
+
+    def __init__(self, in_features: int, out_features: int, bias: bool = False,
+                 dtype: str = "f16", torch=None):
+        ws = get_tensor_model_parallel_world_size()
+        assert out_features % ws == 0
+        self.in_features = in_features
+        self.k_local = out_features // ws
+        self._torch = torch
+        # 각 rank가 자기 slice 소유 — set_device(rank)에 의해 SIP r에 배치.
+        self.weight = torch.zeros(
+            (in_features, self.k_local), dtype=dtype,
+            dp=DPPolicy(cube="column_wise", pe="column_wise"),
+            name="col_parallel_w",
+        )
+        self.bias = None
+        if bias:
+            self.bias = torch.zeros(
+                (self.k_local,), dtype=dtype,
+                dp=DPPolicy(cube="replicate", pe="replicate"),
+                name="col_parallel_b",
+            )
+
+    def forward(self, x):
+        # x는 full-replicated (caller 보장). 단순 local gemm.
+        M = x.shape[0]
+        out = self._torch.empty(
+            (M, self.k_local), dtype=x.dtype,
+            dp=DPPolicy(cube="column_wise", pe="column_wise"),
+            name="col_parallel_out",
+        )
+        self._torch.launch(
+            "col_parallel_gemm", _gemm_kernel,
+            x, self.weight, out, M, self.in_features, self.k_local,
+        )
+        # bias add는 별도 kernel 혹은 composite gemm의 fused bias.
+        # 초기 scope에서는 bias=False만 충분히 검증.
+        return out
+```
+
+**Yield-safety contract (normative)**: `ColumnParallelLinear.forward`는 한 번의
+`torch.launch` 호출로 kernel launch → 내부 `ctx.wait` 쌍을 포함한다. 이는
+D0.4-(1)의 "worker는 유한 step 내 yield" 조건을 자동으로 만족 — TP layer
+사용자가 yield 패턴을 수동으로 삽입할 필요 없음.
+
+### D5. `RowParallelLinear`
+
+```python
+class RowParallelLinear:
+    """Weight의 N(in_features) 축을 TP rank에 분산.
+
+    forward(x):
+        x: (M, N / world_size) — rank-local slice (ColumnParallel의 출력)
+        W_k: (N / world_size, K) — rank-local slice
+        y_k = x @ W_k → (M, K) — partial sum on each rank
+        y = all_reduce(y_k, op="sum") → (M, K) on every rank
+    """
+
+    def __init__(self, in_features: int, out_features: int, bias: bool = False,
+                 dtype: str = "f16", torch=None):
+        ws = get_tensor_model_parallel_world_size()
+        assert in_features % ws == 0
+        self.n_local = in_features // ws
+        self.out_features = out_features
+        self._torch = torch
+        self.weight = torch.zeros(
+            (self.n_local, out_features), dtype=dtype,
+            dp=DPPolicy(cube="column_wise", pe="column_wise"),
+            name="row_parallel_w",
+        )
+        # bias는 rank 0에만 (Megatron convention). 초기 scope에서는 생략.
+        self.bias = None
+
+    def forward(self, x):
+        M = x.shape[0]
+        y_partial = self._torch.empty(
+            (M, self.out_features), dtype=x.dtype,
+            dp=DPPolicy(cube="column_wise", pe="column_wise"),
+            name="row_parallel_partial",
+        )
+        self._torch.launch(
+            "row_parallel_gemm", _gemm_kernel,
+            x, self.weight, y_partial, M, self.n_local, self.out_features,
+        )
+        # Cross-rank reduce. ADR-0024의 dist.all_reduce는 D0 + mp.spawn 하에서
+        # 정상 동작 (kernel parent = main 유지).
+        self._torch.distributed.all_reduce(y_partial, op="sum")
+        return y_partial
+```
+
+**Yield-safety contract (normative)**: `RowParallelLinear.forward`는 launch →
+내부 wait에 이어 `all_reduce` (defer + worker yield 패턴)까지 포함하므로 forward
+한 번당 **최소 2회 yield**가 보장됨. D0.4-(1)의 scheduler progress 조건 자동
+만족. 모든 본 ADR의 TP layer forward는 "최소 하나의 wait 또는 collective를
+포함해 yield-safe하다"를 invariant로 유지한다 — 이후 추가되는 TP primitive
+(VocabParallelEmbedding 등)도 동일 계약 필수.
+
+### D6. Primitive 함수
+
+```python
+# primitives.py
+def copy_to_tp_region(x):
+    """Forward: identity. Backward: all-reduce. (Training 추가 시 구현)."""
+    return x
+
+def reduce_from_tp_region(x, torch):
+    """Forward: all-reduce. Backward: identity."""
+    torch.distributed.all_reduce(x, op="sum")
+    return x
+
+def scatter_to_tp_region(x):
+    raise NotImplementedError(
+        "Phase 2: 사용자가 이미 sharded tensor를 생성하는 것으로 대체"
+    )
+
+def gather_from_tp_region(x):
+    raise NotImplementedError(
+        "Phase 2: all-gather kernel 선행 필요 (future)"
+    )
+```
+
+### D7. 샘플 bench — 2-layer MLP with TP
+
+```python
+# benches/tp_mlp.py (신규)
+from kernbench.policy.placement.dp import DPPolicy
+import kernbench.tp as tp
+import numpy as np
+
+
+def worker(rank: int, world_size: int, torch):
+    torch.ahbm.set_device(rank)
+    tp.initialize_model_parallel(world_size)
+
+    B, D_in, D_hidden, D_out = 1, 512, 2048, 512
+    fc1 = tp.ColumnParallelLinear(D_in, D_hidden, torch=torch)
+    fc2 = tp.RowParallelLinear(D_hidden, D_out, torch=torch)
+
+    x = torch.zeros(
+        (B, D_in), dtype="f16",
+        dp=DPPolicy(cube="replicate", pe="replicate"),
+        name="x",
+    )
+    # init x with some pattern (e.g., constant)
+    x.copy_(torch.from_numpy(np.full((B, D_in), 0.1, dtype=np.float16)))
+
+    h = fc1.forward(x)      # column-sharded (B, D_hidden / ws)
+    y = fc2.forward(h)      # all-reduced (B, D_out) on every rank
+
+    # rank 0만 결과 출력 / 검증
+    if rank == 0:
+        result = y.numpy()
+        # 실제 검증 값은 zero-init weight이면 전부 0 — scope에서는 "완료 자체" 검증
+        print(f"  tp_mlp: shape={result.shape}, mean={float(result.mean()):.4f}")
+
+
+def run(torch):
+    torch.distributed.init_process_group(backend="ahbm")
+    ws = torch.distributed.get_world_size()
+    torch.multiprocessing.spawn(worker, args=(ws,), nprocs=ws)
+```
+
+### D8. Non-functional — training 미지원
+
+본 ADR은 **inference/forward only**. Backward / gradient / optimizer는 future.
+기존 KernBench가 training이 아니므로 자연스러움.
+
+### D9. 초기 scope 제약
+
+- TP size = world_size (mixed DP+TP 없음).
+- `scatter_to_tp_region`, `gather_from_tp_region`은 unimplemented.
+- **Weight 기본값은 zero**. 적절한 init scheme (Xavier, Kaiming 등)은 future.
+  단 테스트는 `tensor.copy_`로 결정론적 non-zero pattern을 주입해 numerical
+  correctness를 검증 (T2/T6). 즉 "production default = zero, 검증 = 결정론적
+  non-zero"로 운영 분리.
+- Bias 초기 scope에서 생략 (Megatron의 rank 0-only bias 정책은 future).
+- Pipeline parallelism은 scope 밖.
+- VocabParallelEmbedding은 all-gather 선행 필요 → stub only.
+
+### D10. 회귀: `ring_default_ws` xfail 해제 — 필수 acceptance
+
+D0 (worker-wait 일반화) + D0.5 (host-read barrier) 덕분에 모든 worker-driven
+`ctx.wait` 및 host-read가 main-drain 경로로 routing됨 → ADR-0024 Phase B의
+kernel-greenlet orphan 원인이 소멸. 기존 matrix test의 `ring_default_ws`
+strict-xfail 케이스를 본 ADR 구현 이후 **PASS**로 전환하는 것을 **필수 회귀
+기준**으로 포함. Observable acceptance criteria는 **T7**에 명시 (deadlock
+부재, GreenletExit 부재, numerical tolerance 등).
+
+---
+
+## Dependencies
+
+- **ADR-0024** (launcher): rank = SIP, greenlet-local rank,
+  `torch.ahbm.set_device(rank)`.
+- **ADR-0026** (DPPolicy intra-device): weight tensor의 per-rank slice 표현.
+- **ADR-0023 / ADR-0025** (IPCQ): `dist.all_reduce` 구현의 기반.
+
+---
+
+## Non-goals
+
+- **Backward pass / training**: inference only. Training simulation은 별도 ADR.
+- **Mixed parallelism (DP + TP + PP)**: 초기엔 pure TP only.
+- **Weight init schemes**: 단순 zero / debug pattern.
+- **Fused ops**: Megatron의 fused matmul+bias+gelu는 kernel 레벨 문제.
+- **DTensor 통합**: ADR-0028 future.
+- **Host-side `torch.matmul` 추상화**: TP layer는 `torch.launch(gemm_kernel, ...)`
+  로 기존 gemm kernel을 호출. 신규 matmul host-op 도입 안 함.
+
+---
+
+## Open questions
+
+- **`initialize_model_parallel` 위치**: `kernbench.tp.initialize_model_parallel`
+  (현 결정) vs real-PyTorch의 `torch.distributed.init_device_mesh`. TP 전용
+  모듈에 유지.
+- **Weight init**: ADR은 zero. Debug pattern (e.g., identity)이 유효 검증에
+  필요할 수 있음 — Phase 1 test에서 필요 시 추가.
+- **bias 배치 정책**: Megatron은 RowParallelLinear bias를 rank 0에만. 초기
+  scope에서는 bias=False로 회피.
+- **GEMM kernel 위치**: `kernbench.tp.kernels._gemm_kernel` vs 기존
+  `benches/gemm_single_pe.py`에서 import. TP가 bench 의존을 가지면 안 되므로
+  tp 내부에 복제. 향후 `kernbench.kernels` 공용 패키지로 이관 가능.
+
+**Resolved (이전 rev에서 open이었던 것들)**:
+- ~~`tensor.numpy()` 호출 시 drain 타이밍~~ → **D0.5에서 결정**: 공식 host-read
+  entry-point(`numpy`, `data`, `__getitem__`, data-포함 `__repr__`)는 자동
+  drain barrier. metadata-only accessor는 barrier 아님.
+
+---
+
+## Consequences
+
+### Positive
+
+- **Megatron 코드 이식 용이**: real training code와 API 일치.
+- **TP 벤치마크 가능**: scaling, communication-compute overlap 등 HW 특성
+  연구.
+- **`ring_default_ws` xfail 해제**: D0의 부산물로 ADR-0024 Phase B 블로커 해소.
+- **Scheduler loop 단일화**: D1 (`mp.spawn`) 도입으로 hand-rolled loop 제거.
+  후속 collective/TP 벤치가 동일 패턴 재사용.
+- **DPPolicy 의미 명확화** (ADR-0026 시너지): TP layer가 intra-device DPPolicy
+  만 사용하는 모범 사례.
+
+### Negative
+
+- 새 모듈 (`kernbench.tp`) 유지보수 비용.
+- 초기 scope가 제한적 (pure TP only, forward only).
+- D0 generalization이 `ctx.wait`의 세만틱을 바꿈 — 단일 드라이버 테스트와의
+  호환성을 명시적으로 검증 필요 (T7).
+
+### Neutral
+
+- ADR-0024/0026 기반 위에 순수한 상위 레이어 추가. Hardware simulation
+  stack에 영향 없음 (D0 제외).
@@ -0,0 +1,256 @@
+# ADR-0032: Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange
+
+## Status
+
+Accepted (supersedes ADR-0029).
+
+## Context
+
+### Goal
+
+Define a single all-reduce algorithm that exploits the topology hierarchy:
+cube mesh within each SIP (intercube) + inter-SIP exchange. One kernel,
+one SFR configuration path, driven by `topology.yaml` and `ccl.yaml`.
+
+### Why replace ADR-0029 (hierarchical 3-level)
+
+ADR-0029 proposed a 3-level (intra-cube → inter-cube → inter-SIP) algorithm
+where every PE in the system participates. In practice this adds the
+intra-cube PE-to-PE stage complexity (bidirectional reduce + chain broadcast)
+without matching the common workload pattern where the tensor is sharded
+**per cube** (not per PE within a cube).
+
+Moreover, the hierarchical design required:
+- per-PE neighbor graph installation (`_build_pe_installs` multi-level)
+- multi-level topology schema (`hierarchical_3level`)
+- `all_pes` mapper + `multi_pe_sip_local` validator infrastructure
+
+The intercube algorithm below removes all of that: **pe0-only same-lane
+intercube reduce on the 4×4 cube mesh**, then inter-SIP exchange on the
+root cube, then broadcast back. Simpler kernel, simpler wiring, same
+bandwidth characteristics for the common per-cube DP workload.
+
+### Current state
+
+- `src/kernbench/ccl/algorithms/intercube_allreduce.py` — kernel
+- `src/kernbench/ccl/sfr_config.py` — `configure_sfr_intercube_multisip`
+- `src/kernbench/runtime_api/distributed.py` — `AhbmCCLBackend` wires this
+  automatically at `init_process_group` time.
+- Old `ring_allreduce`, `mesh_allreduce`, `tree_allreduce`,
+  `hierarchical_allreduce` modules and their tests are **removed**.
+
+---
+
+## Decision
+
+### D1. Algorithm structure — 5 phases
+
+For each SIP (launched concurrently by `mp.spawn`):
+
+```
+Phase 1 — Row reduce W → E (cube mesh, pe0 only):
+    col=0 sends E → col=1 accumulates, sends E → ... → col=3 holds row sum.
+
+Phase 2 — Col reduce N → S on rightmost column (pe0, col = mesh_w-1):
+    row=0 sends S → row=1 accumulates, sends S → ... → root cube (15)
+    holds the full SIP sum.
+
+Phase 3 — Inter-SIP exchange on root cube (pe0 of root cube only):
+    Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
+    selected by sip_topo_kind (from topology.yaml sips.topology).
+
+Phase 4 — Col broadcast S → N on rightmost column.
+
+Phase 5 — Row broadcast E → W across the cube mesh.
+```
+
+After all phases every cube's pe0 holds the global sum.
+
+The kernel is a single function parameterised by `sip_topo_kind ∈ {0, 1, 2}`
+(ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical
+across topologies; only phase 3 branches. Helper functions
+`_inter_sip_ring`, `_inter_sip_torus_2d`, `_inter_sip_mesh_2d` encode the
+three exchange patterns.
+
+### D2. Tensor layout (rank = SIP, per-worker)
+
+Per ADR-0024 rank = SIP at the process-group level. Each worker allocates
+its own cube-mesh-spanning tensor:
+
+```python
+dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=16, num_pes=1)
+tensor = torch.zeros((n_cubes, n_elem), dtype="f16", dp=dp)
+```
+
+Shard layout: 16 shards per SIP, one per cube on pe0. The kernel addresses
+each cube's shard as `pe_addr = t_ptr + cube_id * n_elem * 2`.
+
+### D3. SFR / IPCQ wiring — `configure_sfr_intercube_multisip`
+
+Replaces the rank-to-2-PE install from ADR-0024. Wires PE_IPCQ neighbor
+tables for **every cube's pe0 across every SIP** — regardless of which
+cube is the root or which SIP topology is selected. This lets the kernel
+elect the root cube at runtime and supports topology switches without
+re-wiring.
+
+| Level | Direction labels | Scope |
+|---|---|---|
+| Intercube within SIP | N / S / E / W | pe0 of every cube → pe0 of mesh neighbors (no wrap) |
+| Inter-SIP (all cubes) | global_E / global_W / global_N / global_S | pe0 of cube c on sip A → pe0 of cube c on peer SIP per `sips.topology` |
+
+Inter-SIP directions use the `global_*` prefix to keep the namespace
+disjoint from intercube directions. ADR-0025's `_OPPOSITE_DIR` is extended
+with `global_E ↔ global_W` and `global_N ↔ global_S` so the reverse-
+direction resolver handles 2-SIP bidirectional rings correctly.
+
+Internally the function calls `install_ipcq` with:
+- `world_size = n_sips × n_cubes`
+- `rank_to_pe = [(sip, cube, 0) for sip in range(n_sips) for cube in range(n_cubes)]`
+- A closure-captured `neighbors()` function that builds the map above.
+
+This `world_size` is internal to IPCQ wiring and does not leak to the
+process-group rank.
+
+### D4. SIP topology — from `topology.yaml`
+
+```yaml
+system:
+  sips:
+    count: 2
+    topology: ring_1d       # or torus_2d, mesh_2d_no_wrap
+```
+
+- `ring_1d`: n_sips-1 rounds of `send global_E / recv global_W`.
+- `torus_2d`: sqrt(n_sips)×sqrt(n_sips) wrapping mesh. Row ring on
+  `global_E/W` then col ring on `global_S/N`.
+- `mesh_2d_no_wrap`: square mesh without wrap-around. Chain reduce +
+  broadcast per dimension.
+
+2D variants require `n_sips` to be a perfect square.
+
+### D5. Process-group integration — `AhbmCCLBackend`
+
+At `init_process_group` time the backend:
+
+1. Loads `ccl.yaml` + `topology.yaml`.
+2. Derives `sip_topo_kind, sip_topo_w, sip_topo_h` from
+   `system.sips.topology` using the algorithm module's `TOPO_NAME_TO_KIND`.
+3. Calls `configure_sfr_intercube_multisip(engine, spec, cfg)` — one-time
+   SFR wiring, mirrors NCCL communicator creation.
+
+At each `dist.all_reduce(tensor)` call:
+
+1. Resolves `kernel_fn` from `cfg["module"]`.
+2. Builds args: `(n_elem, cube_w, cube_h, n_sips)` from
+   `kernel_args(world_size, n_elem)`.
+3. Appends `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` where
+   `sip_rank` is the current greenlet's bound rank.
+4. Launches with `_defer_wait=True`; the main scheduler drains pending
+   handles after all workers submit (per ADR-0027 D0.4).
+
+### D6. Config schema
+
+`ccl.yaml`:
+
+```yaml
+defaults:
+  algorithm: intercube_allreduce
+  buffer_kind: tcm
+  ...
+
+algorithms:
+  intercube_allreduce:
+    module: kernbench.ccl.algorithms.intercube_allreduce
+    topology: none
+    buffer_kind: tcm
+    n_elem: 8
+    root_cube: 15
+```
+
+`topology.yaml`:
+
+```yaml
+system:
+  sips:
+    count: 2
+    topology: ring_1d
+sip:
+  cube_mesh: { w: 4, h: 4 }
+```
+
+### D7. Algorithm module contract
+
+Modules loaded via `cfg["module"]` must export:
+
+| Name | Purpose |
+|---|---|
+| `kernel` | callable, signature `(t_ptr, n_elem, cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl)` |
+| `kernel_args(world_size, n_elem) -> tuple` | returns the first 4 scalar args (per-tensor) |
+| `TOPO_NAME_TO_KIND: dict[str, int]` | maps `system.sips.topology` name to kernel branch code |
+| `SIP_TOPO_RING`, `SIP_TOPO_TORUS`, `SIP_TOPO_MESH` | integer constants (0, 1, 2) |
+
+---
+
+## Dependencies
+
+- **ADR-0023**: IPCQ protocol (neighbor table, send/recv, credit return).
+- **ADR-0024**: rank = SIP launcher, `mp.spawn`, greenlet-local rank.
+- **ADR-0025**: Address-based IPCQ direction matching; extended
+  `_OPPOSITE_DIR` with `global_*` pairs.
+- **ADR-0027**: Worker-wait / collective-pending drain in main scheduler.
+
+## Non-goals
+
+- **Per-PE allreduce** (intra-cube PE-to-PE reduce). Out of scope — the
+  workload for this algorithm is per-cube DP.
+- **Asymmetric SIP topologies** (non-square mesh/torus). `torus_2d` and
+  `mesh_2d_no_wrap` require `n_sips = k²`.
+- **Pipelined chunks**: single-tile per cube, no pipelining yet.
+- **Root cube runtime election**: the kernel currently uses
+  `root_cube = (mesh_h - 1) * mesh_w + (mesh_w - 1)` hardcoded to the SE
+  corner. SFR wiring covers all cubes, so runtime election is a pure kernel
+  change when needed.
+
+---
+
+## Consequences
+
+### Positive
+
+- **Single kernel, single install path** for all-reduce — replaces four
+  removed modules (`ring`, `mesh`, `tree`, `hierarchical`).
+- **Topology-agnostic kernel**: ring / torus / mesh selected via one
+  integer param, no kernel duplication.
+- **Automatic via `dist.all_reduce`**: no bench-level or user-level
+  algorithm selection needed; config-driven end-to-end.
+- **Full SFR wiring**: every cube on every SIP has inter-SIP links
+  available — supports future dynamic root-cube election.
+
+### Negative
+
+- **Not suitable for per-PE sharded tensors**: TP-layer-style tensors that
+  shard within one cube across 8 PEs are not addressable by this kernel.
+  Such workloads would need a separate intra-cube all-reduce path (not
+  yet implemented).
+- **`configure_sfr_intercube_multisip` always wires all pe0s**: even if a
+  given run only needs a subset (e.g. 1 SIP, ring only). Install cost is
+  small but not zero.
+
+---
+
+## Affected files
+
+| File | Change |
+|---|---|
+| `src/kernbench/ccl/algorithms/intercube_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` |
+| `src/kernbench/ccl/sfr_config.py` (new) | `configure_sfr_intercube_multisip` |
+| `src/kernbench/ccl/topologies.py` | Added `torus_2d`, `mesh_2d_no_wrap` |
+| `src/kernbench/ccl/install.py` | Extended `_OPPOSITE_DIR` with `global_*` pairs |
+| `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend` uses `configure_sfr_intercube_multisip` + appends sip_rank/topo args |
+| `ccl.yaml` | Single `intercube_allreduce` entry |
+| `topology.yaml` | Added `system.sips.topology` |
+| `benches/ccl_allreduce.py` | Row-wise cube-mesh tensor layout |
+| `tests/test_allreduce_multidevice.py` (new) | Config-driven ring/torus/mesh |
+| `tests/test_distributed_intercube_allreduce.py` (new) | Full `dist.all_reduce` path |
+| `tests/test_intercube_sfr_config.py` (new) | SFR wiring verification |
+| Removed | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` and their tests |
@@ -0,0 +1,162 @@
+# ADR-0033 — Latency Model: Assumptions and Known Simplifications
+
+## Status
+
+Accepted
+
+## Context
+
+The simulator is an analytical, event-driven performance model — not a
+cycle-accurate or RTL-level simulator. Many real-HW effects are approximated
+or omitted by design. To keep the model auditable and reviewable as a whole,
+this ADR consolidates the assumptions in one place. Individual component ADRs
+(ADR-0015, ADR-0017, ADR-0004) define the *mechanisms*; this document defines
+the *limits of fidelity*.
+
+## Decisions
+
+### D1. Modeled precisely
+
+- **Per-directed-edge BW occupancy** (FIFO serialization via `available_at`) —
+  ADR-0015 D2.
+- **Per-component switching/overhead latency** (`overhead_ns` attr).
+- **HBM per-pseudo-channel parallelism** via stateless `pc_avail[N]` array
+  with address-based PC selection (ADR-0034 D3). Burst granularity tunable
+  (`burst_bytes`, default 256B). Read and write share each PC's
+  `available_at` (real HW command bus is per-PC shared).
+- **HBM direction switching penalty mechanism**: per-PC last-direction
+  tracking + configurable `switch_penalty_ns`. Default 0 — see D2.
+- **Wire chunk-streaming (Phase 2c)**: each wire decomposes Transactions
+  with payload into `Flit` objects of `flit_bytes` (default = HBM
+  `burst_bytes` = 256B). The wire emits each flit individually after
+  `prop_ns + flit_nbytes/bw_gbs` so the link's bandwidth throttles
+  flit arrival rate per real-HW wormhole semantics.
+- **Separate Stores per directed edge** (Phase 2c key fix): the wire
+  is the *only* conduit between `src.out_ports[dst]` and
+  `dst.in_ports[src]`. Earlier the two were aliased to the same
+  `simpy.Store`; when the wire put a chunkified flit back, the
+  destination's `fan_in` could pull it before the wire applied
+  bandwidth delay, leaving half the flits bypassing the bottleneck.
+- **Flit-aware pass-through** (`TransitComponent`, `HbmCtrlComponent`):
+  forward each flit serially with per-transaction overhead applied
+  ONCE on the first-flit arrival (header decode model). Subsequent
+  flits pipeline through with no extra delay. Wormhole emerges
+  naturally across multi-hop paths.
+- **HBM CTRL per-flit PC commit**: each flit arriving at HBM CTRL
+  schedules a PC commit at `max(env.now, pc_avail[pc]) + chunk_time`,
+  with the `is_last` flit waiting for the last PC commit before
+  signaling `txn.done`.
+- **Non-flit-aware components (default) reassemble flits at
+  ``_fan_in``** before the legacy `_forward_txn` path runs. This
+  preserves backward compatibility for components that have not yet
+  been migrated to flit-aware processing (e.g., `MCpuComponent`,
+  `IoCpuComponent` sub-txn generators). Such components reassemble
+  *once per leg boundary*, NOT per hop — multi-hop wormhole timing
+  through a chain of flit-aware routers is preserved.
+
+### D2. Approximated (with known directional error)
+
+| Effect | Real HW | Our model | Error direction |
+|--------|---------|-----------|----------------|
+| Router output port arbitration | Round-robin / weighted | Wire edge FIFO + serial worker | Fair when one txn per cycle; multi-stream sharing not modeled at flit level |
+| HBM scheduler / write buffer | FR-FCFS + watermark drain | FIFO, no reordering | Pessimistic for mixed R/W when alternations are dense — default `switch_penalty_ns = 0` assumes ideal scheduler amortizes |
+| Flit ↔ burst granularity | 32B flit < 256B burst | `flit_bytes = burst_bytes = 256B` | Sub-flit fine-grained timing noise; affects very small wire arbitration windows only |
+| Wire-level RR fairness | Per-cycle multi-flow arbitration on shared link | Single serial wire process per edge | Fair only when one transaction is in flight on a given edge at a time. Multi-stream concurrent traffic on the same edge serializes by FIFO order |
+
+### D3. Ignored (out of scope)
+
+- Bank-level row buffer conflict penalty (assume no conflicts — best case;
+  the model has no per-bank state within a PC, so same-bank reuse cannot be
+  detected).
+- HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state
+  `burst_time = burst_bytes / pc_bw_gbs`).
+- Refresh, ECC, thermal throttling, power gating.
+- Clock domain crossings, PLL lock time.
+- Upstream backpressure due to downstream buffer occupancy (input ports use
+  unbounded `simpy.Store`).
+- Sub-flit cycle-level arbitration at routers (flit granularity is our
+  smallest unit).
+
+### D4. Workload sensitivity
+
+Workloads where the above simplifications meaningfully affect results:
+
+- **Random scatter/gather**: bank conflict ignored → model optimistic.
+- **Heavy mixed R/W intensive** (e.g., GEMM bias accumulation): HBM scheduler
+  absent. With default `switch_penalty_ns = 0` we assume ideal amortization;
+  setting it non-zero models pessimistic per-alternation cost.
+- **High concurrency (>10 active flows on one link)**: HoL blocking and VC
+  limits not modeled → model optimistic.
+- **Very small (sub-flit) transactions**: flit quantization noise.
+- **Concurrent multi-flow on a single wire**: wire is serial FIFO at the
+  flit level, so per-flow fairness within a single edge is not modeled.
+  Pre-edge merging (multiple sources arriving at a router and being
+  forwarded to the same downstream wire) is correctly modeled via the
+  flit-aware router's serial worker.
+
+### D5. Verification policy
+
+For workloads in D4, cross-check against real HW or a cycle-accurate
+simulator before drawing absolute-magnitude conclusions. The model remains
+accurate for **relative comparisons** within the modeled regime.
+
+### D6. Future work
+
+Note: multi-stream merging at routers IS modeled correctly — each
+in_port has its own fan_in process, all push to a shared inbox, and
+the router worker forwards in inbox FIFO order. Flits from different
+upstream streams naturally interleave at flit granularity. The items
+below are different concerns, ordered by expected workload impact.
+
+**Higher impact (workload accuracy gap)**:
+
+- [ ] **Bank-level conflict modeling** within a PC (opt-in via
+  `track_banks: true`). Currently we assume no same-bank reuse;
+  random scatter/gather workloads are optimistic here.
+- [ ] **HBM scheduler** with write buffer + watermark drain (Tier 2
+  from the design discussion). Default `switch_penalty_ns=0` is the
+  ideal-amortization stand-in; bursty mixed R/W workloads benefit
+  from explicit modeling.
+- [ ] **Backpressure** modeling for finite component buffers. Matters
+  at high concurrency / sustained saturation where buffer occupancy
+  causes upstream stalls.
+- [ ] **Op_log integration with chunk-streaming**: currently op_log
+  fires on PE-internal command messages (DmaReadCmd, DmaWriteCmd,
+  GemmCmd, MathCmd) which are not chunkified. Integration would
+  require flit-aware components to also emit op_log start/end hooks
+  per transaction (start on first flit, end on is_last).
+
+**Lower impact (academic / specific use cases)**:
+
+- [ ] **Cycle-accurate router arbitration policies** (RR with
+  priorities, age, iSLIP). The FIFO inbox is already approximately
+  fair when flit arrival times differ slightly between streams (the
+  common case for similar-rate workloads). True impact appears only
+  for: (a) priority/QoS modeling, (b) per-stream tail latency
+  analysis under sustained saturation. Not critical for makespan or
+  average-latency studies.
+- [ ] **Sub-flit (32B) granularity** for finer wire arbitration
+  cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates
+  per 32B flit. Effect is small for most workloads (sub-flit timing
+  noise on small messages).
+
+## Consequences
+
+- Single review point for all model fidelity questions. Each future PR
+  touching latency must update the relevant section here.
+- Workload-specific magnitude error envelopes are explicit.
+- Builder-side derivation of `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`
+  enforces the ADR-0017 D8 invariant in code rather than relying on yaml
+  manual consistency.
+- Wire transfer time is charged once per bottleneck-link transit (Phase 2c
+  per-flit timing) rather than via terminal `drain_ns` injection. Single
+  transactions land at `drain + commit_time + small_overheads`; multi-hop
+  preserves wormhole pipelining; multi-stream merge correctly serializes
+  at the shared wire's FIFO.
+
+## Cross-references
+
+- ADR-0015 — component / port / wire model.
+- ADR-0017 — Cube NOC architecture and HBM connectivity.
+- ADR-0004 — memory semantics, local HBM.
+- ADR-0034 — HBM controller internal design.
@@ -0,0 +1,271 @@
+# ADR-0034: HBM Controller Internal Design
+
+## Status
+
+Accepted
+
+## Context
+
+`HbmCtrlComponent` is the per-PE HBM partition endpoint at the leaf of
+the cube NOC. One instance is created per PE under the topology node
+`sip{S}.cube{C}.hbm_ctrl.pe{idx}` and attaches to that PE's router
+(ADR-0017 D4). The component models per-pseudo-channel (PC) scheduling,
+burst-granular commit timing, address-based PC selection, and response
+routing back to the requester.
+
+This ADR documents the component as currently implemented. ADR-0017 D4/D8
+defines *where* HBM CTRL attaches and *what* aggregate BW it must
+deliver. ADR-0033 D1/D2 defines *what fidelity* of HBM modelling is in
+scope. This ADR fills the gap between those two — the per-instance
+internal scheduling model.
+
+## Decision
+
+### D1. Role
+
+`HbmCtrlComponent` is a per-PE HBM partition endpoint. One instance per
+PE (default 8 per cube, set by `cube.memory_map.hbm_slices_per_cube`)
+attaches to that PE's router via the `peX.hbm` attachment list in
+`cube_mesh.yaml` (ADR-0017 D4). In the default n:1 channel mapping
+(ADR-0017 D8) the instance aggregates `channels_per_pe` pseudo-channels
+into one endpoint.
+
+The component models:
+
+- Per-PC scheduling (D2) with R/W command-bus sharing.
+- Address-based PC selection (D3).
+- Burst-granular commit timing (D4).
+- Flit-aware per-flit PC commit and async finalize (D5, D6).
+- Command-only Transaction handling for read-data drain (D7).
+- Response routing back to the requester (D8).
+
+It does not model:
+
+- Bank-level row-buffer conflicts, refresh, ECC, thermal throttling
+  (ADR-0033 D3).
+- Cross-PE HBM contention beyond its own router edge (handled by the
+  router mesh — ADR-0017 D3).
+- 1:1 channel mode (ADR-0017 D8 future work).
+
+### D2. Per-PC scheduling model
+
+Per-instance state initialised in `start()`:
+
+- `_pc_avail: list[float]` — earliest sim-time each PC is free; length
+  `num_pcs`, initial 0.0.
+- `_pc_last_dir: list["R"|"W"|None]` — direction of the last commit on
+  each PC, used for switch-penalty detection (D4); initial `None`.
+
+`num_pcs` and `burst_bytes` must each be a positive power of two so
+that address-based PC selection (D3) reduces to a shift-and-mask.
+
+Read and write requests share the same `_pc_avail` slot per PC — the
+real HW per-PC command bus is shared between read and write traffic, so
+issuing a write to PC k blocks a subsequent read to PC k by exactly the
+burst time.
+
+Direction `dir` for a request is inferred from the request type:
+
+- `MemoryWriteMsg` → `"W"`.
+- `PeDmaMsg` with `is_write=True` → `"W"`.
+- All others (`MemoryReadMsg`, `PeDmaMsg` read) → `"R"`.
+
+### D3. Address-based PC selection
+
+PC index for an access is derived from the access address by shift and
+mask:
+
+```text
+pc_shift = log2(burst_bytes)         # default 8  (burst=256B)
+pc_mask  = num_pcs - 1               # default 7  (8 PCs)
+pc       = (address >> pc_shift) & pc_mask
+```
+
+Computed once in `start()` from topology config so alternative
+`(burst_bytes, num_pcs)` pairs stay consistent. For the canonical
+default `(256, 8)` this places the PC select field at bits `[10:8]` of
+the HBM byte offset: bits `[7:0]` are within-burst (same PC), bits
+`[10:8]` are the 3-bit PC index, bits `[36:11]` are row/bank/column
+within the PC slice (see `phyaddr.py` comment).
+
+Address-based striping — as opposed to address-blind global
+round-robin — preserves PC parallelism for offset-disjoint concurrent
+transfers: each transfer's bursts land deterministically on the PC set
+implied by its byte addresses, so multi-PE workloads accessing disjoint
+regions do not collide on a single PC.
+
+### D4. Burst granularity and PC commit timing
+
+A single PC commit takes:
+
+```text
+chunk_time = burst_bytes / pc_bw_gbs    # ns
+```
+
+- `burst_bytes` (default 256) is the burst granularity matching the
+  flit size (ADR-0033 D1).
+- `pc_bw_gbs` is **builder-derived** from
+  `hbm_to_router_bw_gbs / num_pcs` (`topology/builder.py`), enforcing
+  the ADR-0017 D8 invariant that aggregate per-PE BW equals the
+  router-to-HBM link BW.
+
+Per-PC commit scheduling for an arriving access on PC `pc` with
+direction `dir`:
+
+```text
+switch_cost = switch_penalty_ns
+              if pc_last_dir[pc] not in (None, dir) else 0
+start  = max(env.now, pc_avail[pc]) + switch_cost
+finish = start + chunk_time
+pc_avail[pc]    = finish
+pc_last_dir[pc] = dir
+```
+
+Default `switch_penalty_ns = 0` — Tier 0 assumption that an ideal HBM
+scheduler amortises R/W switching cost (ADR-0033 D2). Non-zero values
+model pessimistic per-alternation cost.
+
+### D5. Flit-aware per-flit PC commit (primary path)
+
+`_handle_flit` is the primary worker path. For each arriving `Flit`:
+
+1. On the **first** flit of a transaction (`tid = id(txn)` not in
+   `_txn_state`):
+   - Apply `overhead_ns` once via `run(env, nbytes)` — header decode
+     model, first-flit overhead pattern (ADR-0033 D1).
+   - Initialise `_txn_state[tid] = {"last_finish": env.now}`.
+2. Compute `pc = _pc_for_address(flit.address)` (D3).
+3. Apply the per-PC schedule (D4) using the request direction (D2).
+4. Update `state["last_finish"] = max(state["last_finish"], finish)`.
+5. If `flit.is_last`: pop `_txn_state[tid]` and spawn `_finalize_txn`
+   (D6).
+
+Per-flit address-aware commit is the mechanism that lets concurrent
+multi-PE traffic to disjoint HBM offsets pipeline through distinct PCs
+in parallel.
+
+### D6. Async finalize per transaction
+
+When a transaction's last flit has been scheduled, finalisation runs in
+a separately-spawned process:
+
+```python
+def _finalize_txn(env, txn, last_finish):
+    wait = last_finish - env.now
+    if wait > 0:
+        yield env.timeout(wait)
+    yield from _send_response(env, txn)
+```
+
+`_handle_flit` spawns this via `env.process(...)` and returns
+immediately, so the worker can pick up the next inbox message while the
+last PC commit drains.
+
+Without this split — i.e. if the worker itself did
+`yield env.timeout(wait)` — concurrent single-flit transactions whose
+addresses hit distinct PCs would still serialise at `chunk_time` each
+inside the worker, hiding the PC parallelism that D3 and D5 are
+designed to expose.
+
+### D7. Non-flit fallback for command-only transactions
+
+`_handle_txn` runs when the inbox delivers a `Transaction` rather than a
+`Flit`. This is the path for command-only requests that the wire does
+not chunk into flits — most notably `MemoryReadMsg` whose command txn
+carries `nbytes=0` (data drain is modelled at HBM CTRL post-processing,
+not as inbound flits).
+
+Procedure:
+
+1. `work_bytes = txn.nbytes if txn.nbytes > 0 else int(request.nbytes or 0)`
+   — for read commands, work is sized by the request.
+2. `n_chunks = ceil(work_bytes / burst_bytes)` if `work_bytes > 0` else
+   0.
+3. `chunk_interval = drain_ns / n_chunks` (when both > 0) — chunks are
+   scheduled over time at `drain/n_chunks` ns intervals to model the
+   bottleneck-link's data arrival rate (ADR-0033 D1 chunk-loop drain).
+4. Apply `run(env, txn.nbytes)` once for `overhead_ns`.
+5. For each chunk `i`, advance `chunk_interval` ns then apply the D4
+   schedule with `pc = _pc_for_address(base_address + i * burst_bytes)`.
+6. After scheduling all chunks, wait `last_finish - env.now` then call
+   `_send_response`.
+
+`_handle_txn` shares the same `_pc_avail` / `_pc_last_dir` state with
+`_handle_flit` — there is exactly one source of PC scheduling truth
+across both paths.
+
+### D8. Response routing
+
+`_send_response` dispatches on request type and path geometry:
+
+| Case | Trigger | Response |
+| --- | --- | --- |
+| PE_DMA | `isinstance(txn.request, PeDmaMsg)` | New reverse-path Transaction (`is_response=True`, `nbytes=0`), same `done` |
+| Bypass — Memory Read | `"m_cpu" not in any(txn.path)` AND `MemoryReadMsg` | Reverse-path Transaction with `nbytes=request.nbytes` (data return) |
+| Bypass — Memory Write | `"m_cpu" not in any(txn.path)` AND not Memory Read | `txn.done.succeed()` (write completes locally) |
+| Default | otherwise | New `ResponseMsg(correlation_id, request_id, src_cube, src_pe, success=True)` on reverse path |
+
+The "bypass" classification matches the Memory R/W fabric path defined
+in ADR-0015 D4 (PCIE_EP → io_noc → ucie → cube router → hbm_ctrl,
+without M_CPU). The PE_DMA case is its own dedicated reverse-path to
+keep the inner-loop DMA fast (PE_DMA reads/writes do not synthesise a
+ResponseMsg envelope).
+
+In all reverse-path cases, the response Transaction is put onto
+`out_ports[reverse_path[1]]` — the first hop back along the recorded
+forward path. If `reverse_path` has fewer than 2 entries (degenerate
+path), the original `txn.done` is signalled directly.
+
+### D9. Configurable attributes
+
+| Attribute | Default | Source | Notes |
+| --- | --- | --- | --- |
+| `num_pcs` | 8 | topology cube `hbm_ctrl.attrs` | Must be power of 2 |
+| `pc_bw_gbs` | 32.0 | builder-derived: `hbm_to_router_bw_gbs / num_pcs` | Enforces ADR-0017 D8 invariant |
+| `burst_bytes` | 256 | topology attrs | Must be power of 2; equals `flit_bytes` (ADR-0033 D1) |
+| `switch_penalty_ns` | 0.0 | topology attrs | Tier 0 default; non-zero models pessimistic R/W switching |
+| `efficiency` | 1.0 | topology attrs | Applied at builder time to `hbm_to_router_bw_gbs` (router-edge BW scaling only) |
+| `overhead_ns` | 0.0 | topology attrs | First-flit decode overhead (D5) |
+
+`pc_bw_gbs` is derived by `topology/builder.py` rather than configured
+directly so the aggregate per-PE BW matches the router-to-HBM link BW
+without yaml-side duplication.
+
+## Consequences
+
+### Positive
+
+- Address-based PC selection preserves multi-stream HBM parallelism
+  that an address-blind round-robin would collapse — important for
+  multi-PE workloads with disjoint HBM regions.
+- Flit-aware path (D5) + async finalize (D6) preserves wormhole
+  pipelining and exposes PC parallelism for back-to-back single-flit
+  transactions.
+- Single source of PC scheduling truth (D4 mechanism, used by both D5
+  flit path and D7 chunk-loop path).
+- Builder-derived `pc_bw_gbs` enforces ADR-0017 D8 in code, not yaml
+  discipline.
+
+### Negative
+
+- No bank-level conflict modelling within a PC; address-blind to
+  bank/row-buffer reuse (ADR-0033 D3).
+- No HBM scheduler (FR-FCFS / write-buffer / watermark drain); fixed
+  FIFO per PC. Bursty mixed R/W is approximated by `switch_penalty_ns`
+  (ADR-0033 D2).
+- `_txn_state` is a regular dict keyed by `id(txn)`; in-flight state
+  accumulates per concurrent transaction and is removed only on
+  `is_last`. Adequate for current workloads.
+
+## Links
+
+- ADR-0001 (Physical address layout — PC bit field comment)
+- ADR-0015 D4 (Memory R/W fabric path — bypass response case)
+- ADR-0017 D4 (Per-PE HBM partitioning — attachment to PE routers)
+- ADR-0017 D8 (HBM channel mapping mode — n:1 aggregate this ADR
+  implements)
+- ADR-0017 D9 (AddressResolver — `hbm_ctrl.pe{pe_id}` endpoint
+  resolution)
+- ADR-0033 D1 (Modelled precisely — per-PC parallelism, switch penalty,
+  flit-aware PC commit, first-flit overhead, chunk-loop drain)
+- ADR-0033 D2 (Switch-penalty default 0 — ideal scheduler amortisation)
@@ -0,0 +1,286 @@
+# ADR-0035: M_CPU and M_CPU.DMA Component Model
+
+## Status
+
+Accepted
+
+## Context
+
+M_CPU is the cube-level command processor. It receives commands from
+IO_CPU (or from PCIE_EP when the engine routes Memory R/W through
+M_CPU as a fallback), fans them out to the PEs in its cube, and
+aggregates per-PE responses into a single ResponseMsg sent back to
+IO_CPU on the reverse path.
+
+M_CPU.DMA is the cube-level DMA channel pair that handles Memory R/W
+fan-out. Per ADR-0015 D5 it is **not** a separate topology node —
+it lives as internal state of `MCpuComponent`.
+
+This ADR documents the M_CPU component implementation that realizes
+those responsibilities, including the three distinct fan-out paths
+(Memory R/W, Kernel Launch, MMU Map/Unmap), the M_CPU.DMA resource
+model, and the response aggregation contract.
+
+## Decision
+
+### D1. Role
+
+M_CPU has three responsibilities:
+
+1. **Transit forwarding** — when not the terminal hop (e.g., on the
+   reverse response path PE → M_CPU → IO_CPU), forwards Transactions
+   to `next_hop` in their pre-computed path.
+2. **Multi-PE fan-out at terminal hop** — dispatches to one of three
+   fan-out paths based on request type (D2).
+3. **Response aggregation** — collects per-PE responses, sends a
+   single aggregate ResponseMsg back to IO_CPU on the reverse path.
+
+Per invocation (`run()`): applies `overhead_ns` once per incoming
+Transaction.
+
+M_CPU does **not**:
+
+- Decide routing — paths are pre-computed by the router (ADR-0002).
+- Handle PE-internal execution — PE_CPU / PE_SCHEDULER / engines
+  (ADR-0014).
+- Decode addresses — `ctx.resolver.resolve(pa)` returns the per-PE
+  `hbm_ctrl.pe{X}` directly (ADR-0017 D9).
+- Interpret tensor or kernel semantics — fan-out dispatch by Python
+  isinstance check only.
+
+### D2. Three fan-out paths dispatched by request type
+
+At the terminal hop the worker dispatches by request type:
+
+```python
+elif self.ctx is not None and txn.request is not None:
+    if isinstance(txn.request, KernelLaunchMsg):
+        env.process(self._kernel_launch_fanout(env, txn))
+    elif isinstance(txn.request, (MmuMapMsg, MmuUnmapMsg)):
+        env.process(self._mmu_msg_fanout(env, txn))
+    else:
+        env.process(self._dma_fanout(env, txn))
+```
+
+Each path uses a different router method:
+
+- `_dma_fanout` uses `ctx.router.find_mcpu_dma_path()` — the
+  M_CPU-specific DMA path that avoids PE pipeline nodes.
+- `_kernel_launch_fanout` uses `ctx.router.find_node_path()` — the
+  generic NOC command path to PE_CPU.
+- `_mmu_msg_fanout` uses `ctx.router.find_node_path()` — NOC command
+  path to PE_MMU.
+
+### D3. M_CPU.DMA internal subcomponent (ADR-0015 D5)
+
+`MCpuComponent.start()` initializes two SimPy resources:
+
+```python
+self._dma_write = simpy.Resource(env, capacity=1)  # MemoryWriteMsg
+self._dma_read  = simpy.Resource(env, capacity=1)  # MemoryReadMsg
+```
+
+Properties:
+
+- **Not a topology node** — managed entirely inside `MCpuComponent`;
+  does not appear in `topology.yaml` or in the compiled graph.
+- **Independent read and write channels** — concurrent in-flight
+  Memory R/W is allowed.
+- **Capacity=1 per channel** serializes the **dispatch step**
+  (`yield self.out_ports[...].put(...)`) of concurrent in-flight Memory
+  R/W requests at this M_CPU. Actual fabric transfer time is modeled
+  by wire processes between components (ADR-0015 D2) and by
+  `drain_ns` at terminal hops; the DMA resource does not gate
+  transfer duration.
+
+Resource selection is request-type-based:
+
+```python
+dma_res = self._dma_write if isinstance(request, MemoryWriteMsg) else self._dma_read
+```
+
+### D4. Transit forwarding at non-terminal hops
+
+When `txn.next_hop` is not None — typical for the reverse response
+path (PE → M_CPU → IO_CPU) — the worker forwards normally:
+
+```python
+if next_hop:
+    yield self.out_ports[next_hop].put(txn.advance())
+```
+
+The fan-out branches fire only at the terminal hop. The same component
+therefore serves both forward command dispatch and reverse response
+relay roles.
+
+### D5. DMA fan-out (`_dma_fanout` — Memory R/W)
+
+For each Memory R/W request at terminal hop:
+
+1. `_resolve_dma_destinations(request)` returns a per-PE
+   `hbm_ctrl.pe{X}` derived from the request's PA via
+   `ctx.resolver.resolve(PhysAddr.decode(pa))` (ADR-0017 D9).
+2. For each destination:
+   - Acquire the appropriate DMA resource (`_dma_write` or
+     `_dma_read`) via `with dma_res.request() as req`.
+   - Resolve path via `ctx.router.find_mcpu_dma_path()`.
+   - Compute `drain_ns = ctx.compute_drain_ns(path, nbytes)`.
+   - Create sub-Transaction carrying `drain_ns` and dispatch to
+     `path[1]`.
+3. Track `max_drain_ns` across destinations and record it as
+   `txn.result_data["xfer_ns"]` after all responses arrive.
+4. After all per-PE responses are collected (D8), send an aggregate
+   ResponseMsg on the reverse command path back to IO_CPU.
+
+PA decode fallback (`f"{cube_prefix}.hbm_ctrl"`) is legacy dead code —
+no such node exists after ADR-0017 D4's per-PE partitioning. Kept
+defensively but does not route to a real destination.
+
+### D6. Kernel launch fan-out (`_kernel_launch_fanout`)
+
+For `KernelLaunchMsg` at terminal hop:
+
+1. `_resolve_pe_ids(target_pe)` → list of PE ids in this cube.
+2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_cpu"` via
+   `ctx.router.find_node_path()`.
+3. **`target_start_ns` handling** (ADR-0009 D5):
+   - If the request already carries `target_start_ns` (stamped by
+     IO_CPU per ADR-0036 D3): **pass through unchanged**.
+   - If absent (direct-to-M_CPU launch in unit tests): compute a
+     per-cube barrier `env.now + max(per-PE leg latency)` and stamp
+     via `dataclasses.replace`.
+4. Dispatch sub-Transactions with `nbytes=0` (kernel launch is a
+   control message; preserving nbytes=0 keeps fan-out off the shared
+   first-hop fabric BW, mirroring ADR-0036 D4).
+5. After all per-PE responses arrive (D8), aggregate per-PE metrics
+   from each sub-Transaction's `result_data` into the parent
+   transaction:
+
+   ```python
+   txn.result_data["pe_exec_ns"]  = max(existing, max(pe_exec_values))
+   txn.result_data["dma_ns"]      = max(existing, max(dma_values))
+   txn.result_data["compute_ns"]  = max(existing, max(compute_values))
+   ```
+
+   The max-merge with the existing value matters because cross-cube
+   IO_CPU fan-out shares the same parent `result_data`; merging
+   prevents one cube from clobbering another's metric.
+6. Send aggregate ResponseMsg on reverse path back to IO_CPU.
+
+### D7. MMU map/unmap fan-out (`_mmu_msg_fanout`)
+
+For `MmuMapMsg` / `MmuUnmapMsg` at terminal hop:
+
+1. `_resolve_pe_ids(target_pe)` → PE ids.
+2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_mmu"` via
+   `find_node_path()`.
+3. Dispatch sub-Transactions with `nbytes=0`.
+4. PE_MMU is a terminal node — it does **not** send a ResponseMsg
+   back. Instead, the sub-Transaction's own `sub_done` event is the
+   completion signal.
+5. Wait for all `sub_done` events in-line (does **not** use
+   `_pending` counter — D8 is for response-bearing fan-out only).
+6. Send aggregate ResponseMsg on reverse path back to IO_CPU.
+
+### D8. Response aggregation (`_pending` + `_parent_txns`)
+
+For DMA and kernel-launch fan-out (which expect per-PE ResponseMsg
+arriving on the reverse path):
+
+```python
+self._pending: dict[str, tuple[int, int, simpy.Event]] = {}
+self._parent_txns: dict[str, Any] = {}
+```
+
+- On dispatch: register `(expected, received=0, all_done)` and
+  remember the parent transaction.
+- `_worker` recognises responses by `is_response=True` and routes
+  them to `_collect_response`, which increments `received` and
+  signals `all_done` when `received >= expected`.
+- After `yield all_done`, the fan-out path constructs the aggregate
+  ResponseMsg:
+
+  ```python
+  resp_msg = ResponseMsg(
+      correlation_id=request.correlation_id,
+      request_id=request.request_id,
+      src_cube=cube_id,
+      src_pe=-1,             # -1 = M_CPU aggregate, not a single PE
+      success=True,          # no failure semantics implemented
+  )
+  ```
+
+- The response Transaction travels on `list(reversed(txn.path))`
+  back to IO_CPU.
+
+MMU fan-out (D7) uses a simpler in-line list of `sub_done` events
+because PE_MMU is terminal — there is no ResponseMsg path to
+intercept.
+
+### D9. Helpers and configurable attribute
+
+`_resolve_pe_ids(target_pe)`:
+
+- `int` → `[target_pe]`
+- `tuple[int, ...]` → `list(target_pe)`
+- `"all"` → `range(n_slices)` where `n_slices` comes from cube
+  `memory_map.hbm_slices_per_cube` (default 8).
+
+Used by kernel-launch and MMU fan-out paths.
+
+Single configurable attribute drives per-instance latency:
+
+| Site | impl name | overhead_ns |
+| --- | --- | --- |
+| Cube `m_cpu` | `builtin.m_cpu` | 5.0 |
+
+Applied once in `run()` per Transaction — models command
+interpretation and dispatch-decision time at M_CPU.
+
+## Consequences
+
+### Positive
+
+- Three fan-out paths are clearly separated by request type — adding
+  a new request kind is an isinstance branch + one fan-out method.
+- M_CPU.DMA channels are independent (read and write run concurrently)
+  and serialize only the dispatch step at capacity=1.
+- Transit-vs-terminal behavior is a single `if next_hop` check, so
+  the same component handles forward dispatch and reverse response
+  relay without role duplication.
+- `target_start_ns` passthrough (D6) preserves the cross-cube barrier
+  established by IO_CPU (ADR-0036 D3), while the fallback computation
+  keeps direct-to-M_CPU unit tests working.
+- Per-PE metric `max`-merge against existing parent `result_data`
+  values is robust to cross-cube IO_CPU fan-out sharing the same
+  parent.
+
+### Negative
+
+- No partial-failure semantics — a missing per-PE response stalls the
+  parent `all_done` indefinitely. Acceptable for simulation; not
+  suitable as a production-style endpoint.
+- `_resolve_dma_destinations`'s cube-wide hbm_ctrl fallback is dead
+  code (no such node exists post-ADR-0017 D4). Kept defensively;
+  invites confusion and merits a follow-up cleanup.
+- DMA resource serialization applies only at dispatch (the `put` call
+  is instantaneous in unbounded stores). The capacity=1 channel
+  models "one request in flight at a time at this M_CPU", not
+  "transfer duration serialization" — readers must consult wire
+  processes (ADR-0015 D2) and `drain_ns` for actual transfer
+  parallelism.
+
+## Links
+
+- ADR-0009 D3 (M_CPU fan-out and aggregation completion semantics)
+- ADR-0009 D5 (`target_start_ns` — passed through unchanged when
+  present; computed as per-cube barrier when absent)
+- ADR-0011 D-VA3 (MmuMapMsg fabric path includes M_CPU as PE fan-out
+  point)
+- ADR-0014 D4 (DMA engine capacity=1; M_CPU.DMA mirrors the same
+  contract at cube level)
+- ADR-0015 D5 (M_CPU.DMA is internal subcomponent of M_CPU, not a
+  topology node)
+- ADR-0017 D9 (AddressResolver returns per-PE `hbm_ctrl.pe{X}`)
+- ADR-0036 D3 / D4 (IO_CPU stamps `target_start_ns`; M_CPU passes
+  through unchanged; nbytes=0 invariant preserved through fan-out)
@@ -0,0 +1,216 @@
+# ADR-0036: IO_CPU Component Model
+
+## Status
+
+Accepted
+
+## Context
+
+IO_CPU is the IO chiplet's host-facing endpoint inside the simulation
+graph. PCIE_EP receives host messages from the runtime API and routes
+them via the io_noc; for command-bearing requests (KernelLaunch,
+MmuMap/Unmap) the io_noc forwards to IO_CPU, which:
+
+- Fans out the request to per-cube M_CPUs.
+- Aggregates per-cube responses into a single host-visible completion.
+- For kernel launches, stamps a global `target_start_ns` barrier so
+  every PE across every targeted cube begins kernel body execution at
+  the same simulated time (ADR-0009 D5).
+
+Memory R/W traffic bypasses IO_CPU per ADR-0015 D4 / ADR-0016 D3;
+this component therefore handles only command-plane traffic in normal
+operation.
+
+This ADR documents the IO_CPU component implementation that realizes
+those responsibilities.
+
+## Decision
+
+### D1. Role
+
+IO_CPU is the host-facing endpoint of the IO chiplet. It has two
+primary responsibilities:
+
+1. **Multi-cube fan-out** — distribute KernelLaunchMsg / MmuMapMsg /
+   MmuUnmapMsg to per-cube M_CPUs.
+2. **Response aggregation** — collect per-cube ResponseMsg, signal
+   parent `txn.done` when all targeted cubes have responded.
+
+A third, narrower responsibility applies only to KernelLaunchMsg:
+**`target_start_ns` global barrier stamping** (D3).
+
+The component does **not**:
+
+- Decide routing — paths are pre-computed by the router (ADR-0002).
+- Decode tensor or kernel internals — those concerns belong to
+  M_CPU / PE_CPU / engines.
+- Handle PE-level fan-out — M_CPU fans out within a cube (ADR-0009 D3).
+- Handle Memory R/W data path — those bypass IO_CPU per ADR-0015 D4
+  and ADR-0016 D3 (Memory R/W resolution code in
+  `_resolve_cube_targets` exists as a defensive fallback only).
+
+Per invocation (`run()`): applies the configured `overhead_ns` once
+per incoming Transaction (D8).
+
+### D2. Forward path — multi-cube fan-out
+
+When a non-response Transaction arrives, the worker:
+
+1. Pays `overhead_ns` via `run()`.
+2. Calls `_resolve_cube_targets` to derive the list of `(sip, cube)`
+   targets from the request (D5).
+3. For each target:
+   - Resolves M_CPU node id via `ctx.resolver.find_m_cpu(sip, cube)`.
+   - Resolves the path via `ctx.router.find_node_path(io_cpu, m_cpu)`.
+   - Creates a per-cube sub-Transaction with `path` populated and
+     forwards it to `path[1]` (the first hop on the io_noc).
+4. Registers aggregation state: `_pending[request_id] = (expected,
+   received=0, parent_done)`.
+
+### D3. KernelLaunch `target_start_ns` global barrier (ADR-0009 D5)
+
+IO_CPU is the canonical stamper for `target_start_ns`. When the
+request is a `KernelLaunchMsg`, IO_CPU computes a single global
+barrier covering every targeted PE across every targeted cube:
+
+```text
+for (sip, cube) in cube_targets:
+    leg1 = compute_path_latency_ns(io_cpu → m_cpu(sip, cube), nbytes=0)
+    for pe_id in target_pe_ids:
+        leg2 = compute_path_latency_ns(m_cpu → pe_cpu(sip, cube, pe_id),
+                                       nbytes=0)
+        latency = leg1 + leg2 - io_overhead_ns - m_overhead_ns
+        global_max = max(global_max, latency)
+
+target_start_ns = env.now + global_max
+```
+
+The request is then replaced (via `dataclasses.replace`) so the
+stamped value propagates through the fan-out.
+
+Two overhead corrections:
+
+- `io_overhead_ns` is subtracted because IO_CPU has already paid it
+  in `run()` before this method runs.
+- `m_overhead_ns` is subtracted once because it appears as the
+  endpoint of leg1 *and* the start of leg2 in path latency, but
+  M_CPU pays it only once at run time.
+
+Every downstream PE_CPU yields until `target_start_ns` before
+beginning kernel body execution; all PEs therefore start at the same
+simulated time regardless of how long their individual dispatch path
+took.
+
+### D4. KernelLaunch sub-Transactions carry `nbytes=0`
+
+Per-cube sub-Transactions for KernelLaunchMsg force `nbytes=0`,
+overriding the parent `txn.nbytes`:
+
+- Kernel launch is a control message; payload size is irrelevant at
+  the data-fabric level.
+- If `nbytes > 0`, every per-cube sub-txn occupies fabric BW on the
+  io_noc's shared first hop. With 16 cubes this serializes fan-out,
+  pushing far M_CPUs past `target_start_ns` and breaking the D3
+  invariant.
+
+Non-KernelLaunch sub-Transactions preserve `txn.nbytes` (only relevant
+for the defensive Memory R/W fallback path, which carries actual
+payload sizes).
+
+### D5. Per-request-type cube target resolution
+
+`_resolve_cube_targets` dispatches by request type:
+
+| Request type | Source of `(sip, cube)` | `target_cubes="all"` semantics |
+| --- | --- | --- |
+| `MemoryWriteMsg` | `dst_sip`, `dst_cube` (or `PhysAddr.decode(dst_pa).die_id` fallback) | single cube derived from PA decode |
+| `MemoryReadMsg` | `src_sip`, `src_cube` (or `PhysAddr.decode(src_pa).die_id` fallback) | single cube derived from PA decode |
+| `KernelLaunchMsg` | tensor shards filtered by `shard.sip == my_sip` | every cube that owns a shard on this SIP |
+| `MmuMapMsg` / `MmuUnmapMsg` | `target_cubes` list, filtered to this SIP | `range(cubes_per_sip)` from spec |
+
+Each IO_CPU instance fans out only within its own SIP — `_my_sip()`
+parses the SIP id from the node id (e.g., `sip0.io0.io_cpu` → 0).
+
+The Memory R/W rows exist for defensive completeness; the engine's
+normal path routes Memory R/W via `_process_memory_direct()` /
+`find_memory_path()`, bypassing IO_CPU entirely (ADR-0015 D4 /
+ADR-0016 D3).
+
+### D6. Response aggregation
+
+`_pending: dict[request_id → (expected, received, parent_done)]`:
+
+- On dispatch: register `(len(cube_targets), 0, txn.done)`.
+- `_worker` recognises responses by `is_response=True` and routes
+  them to `_collect_response`.
+- `_collect_response` increments `received`; when `received >=
+  expected`, `parent_done.succeed()` is invoked and the entry is
+  removed from `_pending`.
+
+This is a simple per-request counter. There is no per-cube identity
+tracking and no partial-failure handling — a missing response
+indefinitely stalls the parent done. Production-style failure paths
+are out of scope for the current simulator model.
+
+### D7. `target_pe` resolution helper
+
+`_resolve_pe_ids(target_pe)`:
+
+- `int` → `[target_pe]`.
+- `tuple[int, ...]` → `list(target_pe)`.
+- `"all"` → `range(n_slices)`, where `n_slices` comes from cube
+  `memory_map.hbm_slices_per_cube` (default 8).
+
+Used in D3's barrier computation to enumerate every PE target per
+cube.
+
+### D8. Configurable `overhead_ns`
+
+A single attribute drives per-instance latency:
+
+| Site | impl name | overhead_ns |
+| --- | --- | --- |
+| IO chiplet `io_cpu` | `builtin.io_cpu` | 10.0 |
+
+Applied once in `run()` per Transaction. Models command
+interpretation + dispatch-decision time at IO_CPU.
+
+## Consequences
+
+### Positive
+
+- Cross-cube and cross-SIP kernel launches share a single global
+  barrier (D3 + D4) — no per-cube divergence in start time.
+- nbytes=0 invariant keeps fan-out off the shared first-hop fabric
+  BW, preserving the barrier's accuracy at scale (16 cubes).
+- Response aggregation via a single counter → minimal state,
+  deterministic ordering of completion.
+- Per-SIP scoping (`_my_sip()`) keeps IO_CPUs in different SIPs
+  cleanly independent.
+
+### Negative
+
+- No partial-failure semantics — a missing per-cube response
+  indefinitely stalls the parent. Adequate for simulation but not
+  suitable as a production-style endpoint.
+- `_pending` is a regular dict; in-flight requests accumulate state.
+  Acceptable for current benchmark workloads (few concurrent
+  outstanding launches); unbounded in principle.
+- The Memory R/W resolution branches in `_resolve_cube_targets` are
+  dead code in the normal engine path. Kept defensively but invite
+  drift if the bypass path ever changes.
+
+## Links
+
+- ADR-0002 (Routing distance — path computation)
+- ADR-0009 D1 (Kernel launch is an endpoint request to IO_CPU)
+- ADR-0009 D3 (M_CPU fans out within a cube; IO_CPU fans out across
+  cubes)
+- ADR-0009 D5 (target_start_ns canonical stamping at IO_CPU)
+- ADR-0011 D-VA3 (MmuMapMsg routes through IO_CPU for cube fan-out)
+- ADR-0012 (Host ↔ IO_CPU message schema)
+- ADR-0015 D4 (Memory R/W bypasses IO_CPU; Kernel Launch via IO_CPU)
+- ADR-0016 D1 (IO chiplet io_noc — IO_CPU attaches here)
+- ADR-0016 D3 (Memory R/W path bypasses IO_CPU)
+- ADR-0016 D4 (Kernel Launch path through IO_CPU for command
+  interpretation)
@@ -0,0 +1,200 @@
+# ADR-0037: Forwarding Component (forwarding_v1)
+
+## Status
+
+Accepted
+
+## Context
+
+The simulation graph has many node positions that exist purely to model
+fabric traversal — NOC mesh routers, switches, UCIe protocol endpoints,
+IO chiplet io_noc, transit cubes. These share a common pattern: receive
+a message, apply per-component overhead (modeling header decode +
+routing decision time), forward to the next hop along the pre-computed
+path.
+
+This ADR defines the contract for these transit nodes: a single
+component type (`TransitComponent`) that handles flit-aware forwarding
+with wormhole cut-through semantics, used under multiple impl names
+according to the conceptual role each instance plays.
+
+## Decision
+
+### D1. Role
+
+The Forwarding component (`TransitComponent` class) is a **stateless
+transit node** in the simulation graph. It models any fabric position
+where a message physically traverses but no semantic processing
+happens.
+
+Per traversal, the component:
+
+1. Reads an incoming Transaction or Flit from an `in_port`.
+2. Applies the configured per-component overhead (`overhead_ns`),
+   applied **once per Transaction** even across multi-flit payloads
+   (see D2).
+3. Looks up the next hop along the Transaction's pre-computed `path`.
+4. Forwards to the corresponding `out_port`; at the terminal node
+   (no next hop), signals `txn.done` once the `is_last` flit arrives.
+
+The component **does NOT**:
+
+- Decide routing — paths are pre-computed by the router (ADR-0002 /
+  ADR-0017 D2). Forwarding only executes the per-hop step.
+- Model wire propagation or bandwidth occupancy — separate wire
+  processes between components handle that (ADR-0015 D2).
+- Resolve addresses — the AddressResolver does that (ADR-0017 D9).
+- Aggregate completion — terminal endpoints (IO_CPU, M_CPU, HBM_CTRL)
+  handle that.
+
+### D2. First-flit overhead model (header decode)
+
+Per-Transaction `overhead_ns` is applied **exactly once**, at first
+flit arrival:
+
+- `_txn_decoded: set[int]` tracks which Transactions have already
+  paid the overhead at this node.
+- On first-flit arrival for a Transaction: `yield self.run(env,
+  msg.txn.nbytes)` — pays the overhead.
+- Subsequent flits of the same Transaction skip the overhead — they
+  pipeline through with no extra delay.
+- On `is_last` flit: remove the Transaction from `_txn_decoded`.
+
+This models the real-HW behavior where header decode and routing
+decision happen once on first flit; payload flits then stream through
+the same path (wormhole cut-through). Multi-hop pipelining emerges
+naturally — each hop adds its own first-flit overhead, but flits
+after the first do not re-pay overhead at any hop they have already
+passed first.
+
+### D3. Serial worker forwarding (preserves order)
+
+The component's worker is a single SimPy process that consumes flits
+from `_inbox` and forwards them serially in arrival order. The
+component does NOT spawn `env.process(...)` per flit.
+
+Rationale: if the first flit yields on `overhead_ns` while subsequent
+flits run in parallel processes, the later flits can overtake the
+first. This produces out-of-order delivery and lets the `is_last`
+flit arrive at the destination before the first flit — corrupting
+both the transaction's completion semantics and any flit-index-based
+processing downstream.
+
+### D4. Path-based next-hop routing
+
+Routing is **not** a Forwarding-component concern. The Transaction
+arrives with a pre-computed `path` (built by the router; ADR-0002 /
+ADR-0017 D2). The component just looks up its own position in the
+path and forwards to `path[index + 1]`:
+
+```python
+def _next_hop_in_path(self, txn):
+    my_id = self.node.id
+    path = txn.path
+    for i, n in enumerate(path):
+        if n == my_id and i + 1 < len(path):
+            return path[i + 1]
+    return None
+```
+
+If `next_hop` is found and present in `out_ports`, the flit is
+forwarded. Otherwise (terminal node), `txn.done.succeed()` is
+invoked when the `is_last` flit arrives.
+
+### D5. Flit-aware mode with Non-Flit fallback
+
+`_FLIT_AWARE = True` opts this component out of the base class's
+flit-reassembly logic in `_fan_in`. Flits are placed directly on
+`_inbox` (no reassembly), enabling per-flit handling in the worker
+loop (D2, D3).
+
+Non-Flit messages — zero-byte control Transactions and other
+non-chunkified payloads — fall through to the base class's legacy
+`_forward_txn` path via `env.process`. This preserves backward
+compatibility for control-plane traffic that does not benefit from
+flit-level processing.
+
+### D6. Multi-stream merging at the base class
+
+Multi-stream FIFO merging at routers is the base class's
+responsibility, not Forwarding's. The base class's `_fan_in` spawns
+one process per `in_port`; all push to a single shared `_inbox`.
+Flits from different upstream streams therefore interleave at
+flit granularity in `_inbox`'s FIFO order.
+
+The Forwarding worker simply consumes `_inbox` in arrival order —
+correctly modeling per-router multi-flow arbitration as
+fair-FIFO over the shared inbox.
+
+### D7. Single implementation under multiple impl names
+
+A single `TransitComponent` class is registered under four impl names
+in `components.yaml`:
+
+- `builtin.forwarding` — generic forwarding (e.g., `io_noc`,
+  `noc_router`, UCIe conn bridges)
+- `builtin.switch` — tray-level switch
+- `builtin.noc` — cube-level NOC fabric (legacy singleton; current
+  NOC routers use `builtin.forwarding`)
+- `builtin.ucie` — UCIe protocol endpoint
+
+All four aliases instantiate the same class with the same behavior.
+Per-instance differentiation lives only in `attrs.overhead_ns`.
+Separate impl names exist as intent tags for readability and to
+allow future divergence without backward-incompatible config
+changes.
+
+### D8. Configurable `overhead_ns`
+
+A single attribute drives per-instance latency:
+
+| Usage site | impl name | overhead_ns |
+| --- | --- | --- |
+| Tray-level switch | `builtin.switch` | 5.0 |
+| Cube NOC router | `builtin.forwarding` | 2.0 |
+| IO chiplet io_noc | `builtin.forwarding` | 0.0 |
+| UCIe protocol endpoint (`ucie-{N,S,E,W}`) | `builtin.ucie` | 8.0 |
+| UCIe conn bridge (`ucie-{PORT}.conn{N}`) | `builtin.forwarding` | 0.0 |
+
+Default is 0.0. The attribute is read at each `run()` invocation, so
+dynamic reconfiguration is possible but not currently used.
+
+## Consequences
+
+### Positive
+
+- A single class handles all transit-node roles in the simulation
+  graph — minimal code surface for a high-population component type.
+- Flit-aware processing + serial worker preserves wormhole semantics
+  across multi-hop paths without per-flit process overhead.
+- `overhead_ns` is the only per-instance tunable; routing, BW, and
+  address resolution stay cleanly separated in their own components /
+  modules.
+- Multi-stream merging emerges from the base-class structure; no
+  router-specific logic duplicates fair-FIFO arbitration.
+- Non-Flit fallback path keeps control-plane traffic working without
+  forcing every message into the flit framework.
+
+### Negative
+
+- The single class hides usage-site intent inside `attrs.overhead_ns`
+  configuration; readers must consult `topology.yaml` +
+  `components.yaml` to see which impl name maps to which behavior
+  class.
+- Per-flit serial worker is a bottleneck if `overhead_ns` is large
+  and many concurrent transactions arrive at the same router; current
+  values (0–8 ns) make this negligible.
+
+## Links
+
+- ADR-0002 (Routing distance — path computation)
+- ADR-0015 D1 (Component port model)
+- ADR-0015 D2 (Wire process — BW + propagation, separate from this
+  component)
+- ADR-0015 D6 (Transit cube forwarding pattern)
+- ADR-0016 D1 (IO chiplet io_noc — uses this component)
+- ADR-0017 D1 (Cube NOC routers — use this component)
+- ADR-0017 D6 (UCIe decomposition — `ucie-{PORT}` instances use this
+  component)
+- ADR-0033 D1 (Flit-aware pass-through, first-flit overhead,
+  multi-stream merge semantics)
@@ -18,7 +18,7 @@ We define stable, minimal message schemas for Host ↔ IO_CPU so that:
 - IO_CPU-internal fan-out/aggregation can evolve independently,
 - completion and failure propagation is deterministic.

-We also require PE-tagging (A 방식): each shard explicitly carries (sip,cube,pe)
+We also require PE-tagging (Scheme A): each shard explicitly carries (sip,cube,pe)
 so IO_CPU can deterministically route/fan-out without relying on PA decoding.

 ---
@@ -93,7 +93,7 @@ Rules:
 Mandatory fields:

 - common envelope fields (D3)
- destination placement tags (A 방식):
+- destination placement tags (Scheme A):
  - `dst_sip: int`
  - `dst_cube: int`
  - `dst_pe: int`
@@ -130,7 +130,7 @@ Notes:
 Mandatory fields:

 - common envelope fields (D3)
- source placement tags (A 방식):
+- source placement tags (Scheme A):
  - `src_sip: int`
  - `src_cube: int`
  - `src_pe: int`
@@ -183,7 +183,7 @@ Tensor arg (mandatory):

 - `shards: list[TensorShard]`

-`TensorShard` MUST have (A 방식 강제):
+`TensorShard` MUST have (Scheme A enforced):

 - `sip: int`
 - `cube: int`
@@ -1,519 +0,0 @@
-# ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation)
-
-## Status
-
-Accepted
-
-## Context
-
-The current simulation models **timing only**.
-`tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies,
-but do not actually read tensor data or perform computations.
-
-### Required Capabilities
-
-1. Must be able to store and read actual data in HBM/TCM/SRAM
-2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
-3. Must minimize simulation performance degradation
-
-### Constraints
-
- SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
- Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations
- Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused
- Kernel functions must remain plain Python functions (no generator/async transformation)
-
-### Design Exploration Results
-
-| Option | Approach | Verdict |
-|--------|----------|---------|
-| Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block |
-| SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests |
-| Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads |
-| **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact |
-
---
-
-## Decision
-
-### D1. 2-Pass Execution Model — Phase 0 Elimination
-
-The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**.
-
-Before:
-```
-Phase 0: Kernel → PeCommand list (no data, no branching)
-Phase 1: Replay PeCommand list via SimPy (timing only)
-```
-
-After:
-```
-Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based
-  - Memory read/write: SimPy timing + MemoryStore actual data
-  - Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2)
-  - Dynamic control flow possible (tl.load returns actual data)
-
-Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable
-```
-
-This ADR **extends Phase 1 to be data-aware for memory operations only**.
-Phase 1 handles latency/BW bottleneck analysis + memory data tracking,
-Phase 2 handles GEMM/Math computation correctness verification.
-Phase 2 is optional — if only timing is needed, run Phase 1 alone.
-
-### D2. Op Log Recording — ComponentBase Hook
-
-Op log recording is performed as a **hook in the component base class**.
-Individual component implementations are not modified.
-
-```python
-class ComponentBase:
-    def _on_process_start(self, env, msg):
-        if self._op_logger and getattr(msg, 'data_op', False):
-            self._op_logger.record_start(env.now, self.node.id, msg)
-
-    def _on_process_end(self, env, msg):
-        if self._op_logger and getattr(msg, 'data_op', False):
-            self._op_logger.record_end(env.now, self.node.id, msg)
-```
-
-Hooks are called before and after `run()` within `_forward_txn()`.
-`_op_logger` is optional — zero overhead when absent.
-
-**Hook timing definitions**:
-
-| Timing | Meaning |
-|--------|---------|
-| `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) |
-| `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) |
-
-Link traversal latency is not included in t_start/t_end.
-Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start.
-
-### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination
-
-The existing Phase 0 (kernel → PeCommand list) is eliminated,
-and **greenlet** is used to cooperatively interleave kernel and SimPy execution.
-
-#### Operating Principle
-
-greenlet is a C extension that provides cooperative context switching.
-When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet)
-to perform timing simulation, and after completion, returns to the kernel with actual data.
-
-```
-SimPy loop (parent greenlet)           Kernel (child greenlet)
-─────────────────────────              ──────────────────────
-g.switch() ─────────────────────────→ Kernel starts
-                                       a = tl.load(ptr, ...)
-                                         internal: parent.switch(DmaReadCmd)
-cmd = DmaReadCmd ←──────────────────  (kernel paused)
-  yield DmaReadMsg(...)
-  yield env.timeout(dma_latency)
-  data = memory_store.read(...)
-g.switch(data) ─────────────────────→ (kernel resumed)
-                                       a = data  ← actual numpy array
-                                       if a[0][0] > 0.5:  ← branching possible
-                                         ...
-```
-
-The kernel is maintained as a **plain Python function**.
-greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc.
-
-#### KernelRunner — Framework Layer
-
-The greenlet loop resides not in the PE_CPU component but in the framework layer,
-**KernelRunner**.
-
-```python
-# KernelRunner (framework — greenlet ↔ SimPy bridge)
-class KernelRunner:
-    def run(self, env, kernel_fn, args, store):
-        g = greenlet(self._run_kernel)
-        cmd = g.switch(kernel_fn, args)
-
-        while cmd is not None:
-            if isinstance(cmd, DmaReadCmd):
-                yield from self._dispatch_dma(env, cmd)
-                data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
-                cmd = g.switch(data)            # resume with actual data
-            elif isinstance(cmd, GemmCmd):
-                yield from self._dispatch_gemm(env, cmd)
-                cmd = g.switch()                # resume (no data)
-            elif isinstance(cmd, DmaWriteCmd):
-                store.write(cmd.dst_addr, cmd.data)  # visibility = issue time
-                yield from self._dispatch_dma(env, cmd)  # timing only
-                cmd = g.switch()
-
-# PE_CPU (component — kept simple, unaware of greenlet)
-def _execute_kernel(self, env):
-    runner = KernelRunner(self.ctx)
-    yield from runner.run(env, kernel_fn, args, store)
-```
-
-**Op logging single source of truth**: KernelRunner does not record directly to op_log.
-All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**.
-When KernelRunner delivers messages to components via `_dispatch_gemm()` etc.,
-the component base class hooks automatically record them.
-
-**Layer separation**:
- **Kernel code**: plain function, unaware of greenlet
- **TLContext**: calls `parent.switch(cmd)` inside `tl.load()`
- **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**.
- **ComponentBase hook**: the sole path for op_log recording
- **PE_CPU**: only calls KernelRunner, replaceable as a component
-
-#### Handling Differences Between Memory Read/Write and Compute
-
-| Operation | In Phase 1 | In Phase 2 |
-|-----------|-----------|-----------|
-| `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — |
-| `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — |
-| `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation |
-| `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation |
-
-Memory read/write is processed immediately in Phase 1 (numpy slice, fast).
-GEMM/Math operations are batch-executed in Phase 2 (performance separation).
-
-#### Store Visibility Rule
-
-`tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue).
-SimPy DMA timing is simulated separately afterward.
-
-This is an intentional separation of timing and visibility:
- **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called
- **timing**: the point at which DMA latency completes in SimPy
-
-This separation allows a load immediately after a store to see the latest data in dynamic control flow.
-
-#### Result Handle Semantics
-
-`tl.composite()` (sync/async) returns a **handle** referencing the result tensor.
-
-The key contract in Phase 1:
-
-1. **All compute handles are always considered pending in Phase 1.**
-2. `tl.wait(handle)` **expresses timing synchronization only**
-   and does not make the handle ready.
-3. Accessing the handle's actual result data (`handle.data`, element access,
-   numpy conversion, etc.) is **only possible in Phase 2**.
-4. Therefore, **compute-result-based control flow is not supported in Phase 1.**
-5. In contrast, `tl.load()` returns actual data in Phase 1, so
-   **memory-read-based control flow is supported**.
-
-| Handle state | Phase | Allowed operations |
-|------------|-------|----------|
-| pending | Phase 1 | `tl.wait(handle)` — timing synchronization only |
-| pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) |
-| pending | Phase 1 | **Data access not allowed** — value-based branching not possible |
-| ready | Phase 2 | Actual numpy data access, verification |
-
-This restriction is intentional. If computations were executed in Phase 1,
-the SimPy single-thread would block, defeating the purpose of 2-pass separation.
-
-#### Phase 1 Materialization — Future Extension
-
-If Phase 1 eager execution becomes necessary for small operations
-(scalar, small reduction) in the future, selective materialization can be supported
-by adding a `materialized_in_phase1: bool` flag to the op record.
-This is not implemented in the current scope.
-
-### D4. data_op Flag — Message Self-Declaration
-
-The logging target is determined by the `data_op` attribute on the message instance,
-not by message type. The framework does not hardcode message types.
-
-```python
-class MsgBase:
-    data_op: bool = False       # default: no logging
-
-class DmaReadCmd(MsgBase):
-    data_op = True              # memory transfer → logging
-
-class GemmCmd(MsgBase):
-    data_op = True              # compute → logging
-
-class MathCmd(MsgBase):
-    data_op = True              # compute → logging
-```
-
-When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True`
-enables automatic logging without modifying framework code.
-
-### D5. Op Log Structure
-
-#### Op Classification Scheme
-
-A two-level classification is used:
-
-| Level | Field | Role |
-|-------|-------|------|
-| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion |
-| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification |
-
-#### OpRecord Definition
-
-```python
-@dataclass
-class OpRecord:
-    t_start: float              # SimPy time (ns) — service start
-    t_end: float                # SimPy time (ns) — service completion
-    component_id: str           # e.g. "sip0.cube0.pe0.pe_gemm"
-    op_kind: str                # "memory" | "gemm" | "math"
-    op_name: str                # specific operation name
-    params: dict                # per-operation parameters (see below)
-    dependency_ids: list[int]   # currently based on in-memory record index, may be replaced with stable op_id in the future
-```
-
-#### dependency_ids Generation Rules
-
-`dependency_ids` is **optional**, and by default the executor performs
-address-based dependency inference (see D6).
-
-Explicit setting is only needed when precise execution ordering is required:
- **Default (address-based inference)**: the executor analyzes read/write sets to
-  automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases.
- **Explicit setting**: set when logical dependencies cannot be expressed via addresses
-  at the TLContext or command generation stage.
-  Example: completion handle-based synchronization — handle dependencies depend on
-  logical completion order rather than memory addresses, so they cannot be captured
-  by address inference.
-
-#### op_log Ordering
-
-The op_log maintains **stable ordering** based on `t_start`.
-Records with the same `t_start` preserve insertion order.
-
-#### params Details
-
-**memory (dma_read / dma_write)**:
-```python
-{
-    "src_addr": int,            # source address (byte)
-    "dst_addr": int,            # destination address (byte)
-    "nbytes": int,              # transfer size
-    "src_space": str,           # "hbm" | "tcm" | "sram"
-    "dst_space": str,           # "hbm" | "tcm" | "sram"
-}
-```
-
-**gemm**:
-```python
-{
-    "src_a_addr": int,          # operand A address
-    "src_b_addr": int,          # operand B address
-    "dst_addr": int,            # output address
-    "shape_a": tuple,           # e.g. (128, 256)
-    "shape_b": tuple,           # e.g. (256, 128)
-    "shape_out": tuple,         # e.g. (128, 128)
-    "dtype_in": str,            # e.g. "f16"
-    "dtype_acc": str,           # accumulation dtype, e.g. "f32"
-    "dtype_out": str,           # output dtype, e.g. "f16"
-    "transpose_a": bool,
-    "transpose_b": bool,
-    "layout_a": str,            # "row_major" | "col_major"
-    "layout_b": str,
-    "layout_out": str,
-    "addr_space": str,          # "tcm" (GEMM operands are always in TCM)
-}
-```
-
-**math**:
-```python
-{
-    "op": str,                  # "exp" | "add" | "sum" | "where" | ...
-    "input_addrs": list[int],   # list of operand addresses
-    "input_shapes": list[tuple],
-    "dst_addr": int,
-    "shape_out": tuple,
-    "dtype": str,
-    "axis": int | None,         # reduction axis
-    "addr_space": str,          # "tcm"
-}
-```
-
-### D6. Phase 2 Executor
-
-Phase 2 executes the op_log outside of SimPy.
-
-```python
-class DataExecutor:
-    def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
-        self.store = initial_store  # Takes the Phase 1 MemoryStore snapshot as input
-
-    def run(self):
-        for t, ops in groupby(op_log, key=lambda o: o.t_start):
-            batch = list(ops)
-            independent, sequential = self._classify(batch)
-            self._execute_parallel(independent)
-            self._execute_sequential(sequential)
-```
-
-**Parallel execution determination**:
-
-Ops with the same `t_start` are considered **parallel candidates**.
-The executor determines actual parallel execution based on the following criteria:
- Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks)
- Whether predecessor ops specified in `dependency_ids` have completed
-
-Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel.
-
-**Batch optimization**: Only independent ops with the same op_name **and identical
-shape, dtype, layout, and transpose flags** are eligible for batching.
-Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call.
-Improves BLAS efficiency on CPU, reduces launch overhead on GPU.
-
-**Phase 2 execution order guarantee**:
-
-Phase 2 does not consider data arrival timing,
-and guarantees execution order solely through
-dependencies (address-based inference + explicit dependency_ids).
-
-### D7. Memory Store
-
-`MemoryStore` logically follows byte-addressable semantics,
-and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping).
-
-```python
-class MemoryStore:
-    def write(self, space: str, addr: int, data: np.ndarray) -> None: ...
-    def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
-```
-
-**Internal storage format: numpy ndarray**
-
-MemoryStore stores tensors as **numpy ndarrays**.
-
-| Candidate | store/load speed | Phase 2 compute | Verdict |
-|-----------|-----------------|-----------------|---------|
-| **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** |
-| bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected |
-| torch tensor | Immediate | torch operations available | Use only for GPU optimization |
-
- write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup
- read: **returns numpy array by reference** (no copy)
- Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported)
- dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.)
- For byte-level access, convert via `.view(np.uint8)`
- For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility
-
-**read/write contract**:
-
- read/write operates on a **contiguous tensor** basis.
-  If non-contiguous stride views are needed, express them as separate copy ops.
- In the normal benchmark path, producer/consumer dtype match is expected.
-  Reinterpret cast is a permissive behavior for low-level memory validation
-  or special test cases.
- addr is byte-aligned, with minimum alignment = dtype size.
- dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast.
-  Shape mismatch is verified based on nbytes, and raises an error on mismatch.
- Correctness criteria follow address-range-based read/write semantics.
- A tensor object cache may be used as an implementation optimization,
-  but the canonical state is byte-addressable storage.
- At deploy time, the host injects initial tensor data.
-
-### D8. Benchmark Kernel Code
-
-The benchmark's **user code API is not changed**.
-The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained.
-
-However, internal command/message schemas may be extended to include metadata
-required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose).
-
-### D9. No Component Changes
-
-Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified.
-Op log recording is the responsibility of the ComponentBase hook.
-When custom components are replaced, only the timing model changes,
-and Phase 2 data execution is unaffected.
-
-### D10. Phase 2 is Optional
-
-```python
-engine = GraphEngine(graph)
-engine.run(benchmark)                       # Phase 1: timing only
-result = engine.get_timing_result()
-
-if verify_data:
-    executor = DataExecutor(engine.op_log)  # Phase 2: data
-    executor.run()
-    executor.verify(expected_output)
-```
-
-If only timing analysis is needed, Phase 2 is skipped.
-If the op_logger is deactivated, Phase 1 performance is identical to the original.
-
-### D11. Verification Contract
-
-Basic verification **compares the final output tensor** against a reference backend (numpy).
-
-Per-dtype tolerance policy:
-
-| dtype | Comparison method | Tolerance |
-|-------|----------|-----------|
-| f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
-| f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
-| bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
-| int types | `np.array_equal` | exact |
-
- Default mode: compare final output only (end-to-end correctness)
- Debug mode: can compare intermediate tensors on a per-op basis
-  (MemoryStore snapshot at each op boundary)
-
---
-
-## Non-goals
-
- **Compute-result-based control flow**: not supported.
-  All compute handles are in pending state during Phase 1,
-  `wait()` expresses timing synchronization only and does not imply data readiness.
-  Accessing `handle.data`, element access, or truth-value evaluation in Phase 1
-  is **treated as an error**.
-  Memory-data-based branching (results of `tl.load()`) is supported via greenlet.
-  Phase 1 materialization is a future extension (see D3).
- **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce
-  the execution time overlap from Phase 1. Phase 2 only verifies data correctness.
- **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls
-  and do not reproduce the actual hardware PE microarchitecture.
-
-## Open Questions
-
- **Aliasing / slice view**: How to represent slice/views referencing the same
-  backing storage in MemoryStore (stride-based view vs copy semantics)
- **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE
-  communication as memory ops or introduce a separate op_kind
- **Op log streaming**: Managing op_log memory usage in large-scale simulations
-  (in-memory list vs disk-backed streaming)
- **Fused operation**: Whether to record tl.composite's tiled pipeline
-  (READ→COMPUTE→WRITE) as a single fused op record or separate individual ops
- **Math op schema generalization**: The current math params have a simple structure,
-  but generalization may be needed for broadcasting rules, per-input dtype, keepdims,
-  scalar/immediate operands, where/mask expressions, etc.
- **Op record identifier**: Currently dependency_ids are based on in-memory list indices;
-  replacement with stable op_id is needed when introducing streaming/disk-backed mode
- **Phase 1 materialization policy**: See Future Extension in D3.
-  If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops
-  needs to be defined
-
---
-
-## Consequences
-
-### Positive
-
- Minimal impact on SimPy simulation performance (only op_log append added)
- Free to use multi-threading/GPU in Phase 2
- Component replaceability preserved (ADR-0015 design philosophy maintained)
- No changes needed to benchmark user code API
- When adding new message types, only set the data_op flag
- Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported
- `tl.load()` returns actual data, making kernel debugging easier
-
-### Negative
-
- op_log memory usage (for large-scale simulations)
- Phase 2 execution time is proportional to tensor size (large GEMM)
- Dynamic branching based on pending handles (incomplete computations) not possible
-  (computations execute in Phase 2, result values are undetermined in Phase 1).
-  Memory-data-based branching is supported via greenlet.
- greenlet C extension dependency added (pip install greenlet)
@@ -1,4 +1,4 @@
-# ADR-0020: 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)
+# ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation)

 ## Status

@@ -6,65 +6,65 @@ Accepted

 ## Context

-현재 시뮬레이션은 **타이밍만** 모델링한다.
-`tl.load()`, `tl.composite(op="gemm")` 등은 SimPy latency를 생성하지만,
-실제 텐서 데이터를 읽거나 연산하지 않는다.
+The current simulation models **timing only**.
+`tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies,
+but do not actually read tensor data or perform computations.

-### 필요한 기능
+### Required Capabilities

-1. HBM/TCM/SRAM에 실제 데이터를 저장하고 읽을 수 있어야 한다
-2. PE_GEMM, PE_MATH가 실제 행렬 연산을 수행하고 결과를 검증할 수 있어야 한다
-3. 시뮬레이션 성능 저하를 최소화해야 한다
+1. Must be able to store and read actual data in HBM/TCM/SRAM
+2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
+3. Must minimize simulation performance degradation

-### 제약 조건
+### Constraints

- SimPy는 single-thread 이벤트 루프 — numpy matmul을 안에서 하면 전체가 block
- 컴포넌트는 교체 가능해야 한다 (ADR-0015) — 프레임워크 요구사항이 구현에 침투하면 안 됨
- 벤치마크 커널은 명령형 코드(tl.load → tl.composite → tl.wait) — 같은 코드를 재사용해야 함
- 커널 함수는 plain Python function으로 유지해야 한다 (generator/async 변환 불가)
+- SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
+- Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations
+- Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused
+- Kernel functions must remain plain Python functions (no generator/async transformation)

-### 설계 탐색 결과
+### Design Exploration Results

-| Option | 방식 | 판정 |
-|--------|------|------|
-| SimPy 내 직접 실행 | GEMM을 SimPy 안에서 numpy 호출 | 탈락: single-thread block |
-| SimPy + ThreadPool | future.submit → timeout → result() | 탈락: back-to-back 요청 시 result()에서 block |
-| Symbolic + lazy | 메타데이터만 추적, 나중에 실행 | 탈락: control-flow dependent 읽기 처리 곤란 |
-| **2-pass (채택)** | Phase 1: 타이밍, Phase 2: 데이터 | 완전 분리, 성능 영향 없음 |
+| Option | Approach | Verdict |
+|--------|----------|---------|
+| Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block |
+| SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests |
+| Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads |
+| **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact |

 ---

 ## Decision

-### D1. 2-Pass 실행 모델 — Phase 0 제거
+### D1. 2-Pass Execution Model — Phase 0 Elimination

-기존의 3단계(Phase 0 → Phase 1 → Phase 2)를 **2단계로 통합**한다.
+The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**.

-기존:
+Before:
 ```
-Phase 0: 커널 → PeCommand 리스트 (데이터 없음, 분기 불가)
-Phase 1: PeCommand 리스트를 SimPy replay (타이밍만)
+Phase 0: Kernel → PeCommand list (no data, no branching)
+Phase 1: Replay PeCommand list via SimPy (timing only)
 ```

-변경:
+After:
 ```
-Phase 1 (타이밍): 커널 + SimPy 통합 실행 — greenlet 기반
-  - 메모리 읽기/쓰기: SimPy 타이밍 + MemoryStore 실제 데이터
-  - 연산 (GEMM/Math): SimPy 타이밍 + op_log 기록 (실제 연산은 Phase 2)
-  - dynamic control flow 가능 (tl.load가 실제 데이터 반환)
+Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based
+  - Memory read/write: SimPy timing + MemoryStore actual data
+  - Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2)
+  - Dynamic control flow possible (tl.load returns actual data)

-Phase 2 (데이터): op_log 기반 실제 연산 실행 — SimPy 외부, 병렬 가능
+Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable
 ```

-본 ADR은 **메모리 연산에 한해 Phase 1을 data-aware로 확장**한다.
-Phase 1은 latency/BW 병목 분석 + 메모리 데이터 추적,
-Phase 2는 GEMM/Math 연산 정합성 검증.
-Phase 2는 optional — 타이밍만 필요하면 Phase 1만 실행.
+This ADR **extends Phase 1 to be data-aware for memory operations only**.
+Phase 1 handles latency/BW bottleneck analysis + memory data tracking,
+Phase 2 handles GEMM/Math computation correctness verification.
+Phase 2 is optional — if only timing is needed, run Phase 1 alone.

-### D2. Op Log 기록 — ComponentBase hook
+### D2. Op Log Recording — ComponentBase Hook

-op_log 기록은 **컴포넌트 베이스 클래스의 hook**으로 수행한다.
-개별 컴포넌트 구현을 수정하지 않는다.
+Op log recording is performed as a **hook in the component base class**.
+Individual component implementations are not modified.

 ```python
 class ComponentBase:
@@ -77,56 +77,56 @@ class ComponentBase:
            self._op_logger.record_end(env.now, self.node.id, msg)
 ```

-`_forward_txn()` 에서 `run()` 전후로 hook을 호출한다.
-`_op_logger`는 optional — 없으면 오버헤드 제로.
+Hooks are called before and after `run()` within `_forward_txn()`.
+`_op_logger` is optional — zero overhead when absent.

-**hook 시점 정의**:
+**Hook timing definitions**:

-| 시점 | 의미 |
-|------|------|
-| `t_start` | 컴포넌트가 해당 msg의 **service를 시작**한 시점 (`run()` 진입 직전) |
-| `t_end` | 컴포넌트의 **내부 service가 완료**된 시점 (`run()` 반환 직후) |
+| Timing | Meaning |
+|--------|---------|
+| `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) |
+| `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) |

-link traversal latency는 t_start/t_end에 포함되지 않는다.
-link latency는 발신 컴포넌트의 t_end와 수신 컴포넌트의 t_start 차이로 관측된다.
+Link traversal latency is not included in t_start/t_end.
+Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start.

-### D3. Greenlet 기반 커널 실행 — Phase 0 제거
+### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination

-기존 Phase 0 (커널 → PeCommand 리스트)를 제거하고,
-**greenlet**을 사용하여 커널과 SimPy를 협력적으로 interleave 실행한다.
+The existing Phase 0 (kernel → PeCommand list) is eliminated,
+and **greenlet** is used to cooperatively interleave kernel and SimPy execution.

-#### 동작 원리
+#### Operating Principle

-greenlet은 협력적 context switch를 제공하는 C 확장이다.
-커널(child greenlet)이 `tl.load()` 등을 호출하면 SimPy 루프(parent greenlet)로
-switch하여 타이밍 시뮬레이션을 수행하고, 완료 후 실제 데이터와 함께 커널로 돌아온다.
+greenlet is a C extension that provides cooperative context switching.
+When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet)
+to perform timing simulation, and after completion, returns to the kernel with actual data.

 ```
-SimPy 루프 (parent greenlet)          커널 (child greenlet)
+SimPy loop (parent greenlet)           Kernel (child greenlet)
 ─────────────────────────              ──────────────────────
-g.switch() ─────────────────────────→ 커널 시작
+g.switch() ─────────────────────────→ Kernel starts
                                       a = tl.load(ptr, ...)
-                                         내부: parent.switch(DmaReadCmd)
-cmd = DmaReadCmd ←──────────────────  (커널 일시정지)
+                                         internal: parent.switch(DmaReadCmd)
+cmd = DmaReadCmd ←──────────────────  (kernel paused)
  yield DmaReadMsg(...)
  yield env.timeout(dma_latency)
  data = memory_store.read(...)
-g.switch(data) ─────────────────────→ (커널 재개)
-                                       a = data  ← 실제 numpy array
-                                       if a[0][0] > 0.5:  ← 분기 가능
+g.switch(data) ─────────────────────→ (kernel resumed)
+                                       a = data  ← actual numpy array
+                                       if a[0][0] > 0.5:  ← branching possible
                                         ...
 ```

-커널은 **plain Python function**으로 유지된다.
-greenlet switch는 `tl.load()`, `tl.store()` 등의 **내부 구현에만** 존재한다.
+The kernel is maintained as a **plain Python function**.
+greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc.

-#### KernelRunner — 프레임워크 레이어
+#### KernelRunner — Framework Layer

-greenlet 루프는 PE_CPU 컴포넌트가 아니라 프레임워크 레이어인
-**KernelRunner**에 위치한다.
+The greenlet loop resides not in the PE_CPU component but in the framework layer,
+**KernelRunner**.

 ```python
-# KernelRunner (프레임워크 — greenlet ↔ SimPy 연결)
+# KernelRunner (framework — greenlet ↔ SimPy bridge)
 class KernelRunner:
    def run(self, env, kernel_fn, args, store):
        g = greenlet(self._run_kernel)
@@ -136,160 +136,162 @@ class KernelRunner:
            if isinstance(cmd, DmaReadCmd):
                yield from self._dispatch_dma(env, cmd)
                data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
-                cmd = g.switch(data)            # 실제 데이터와 함께 재개
+                cmd = g.switch(data)            # resume with actual data
            elif isinstance(cmd, GemmCmd):
                yield from self._dispatch_gemm(env, cmd)
-                cmd = g.switch()                # 재개 (데이터 없음)
+                cmd = g.switch()                # resume (no data)
            elif isinstance(cmd, DmaWriteCmd):
-                store.write(cmd.dst_addr, cmd.data)  # visibility = issue 시점
-                yield from self._dispatch_dma(env, cmd)  # timing만 반영
+                store.write(cmd.dst_addr, cmd.data)  # visibility = issue time
+                yield from self._dispatch_dma(env, cmd)  # timing only
                cmd = g.switch()

-# PE_CPU (컴포넌트 — 간단하게 유지, greenlet을 모름)
+# PE_CPU (component — kept simple, unaware of greenlet)
 def _execute_kernel(self, env):
    runner = KernelRunner(self.ctx)
    yield from runner.run(env, kernel_fn, args, store)
 ```

-**Op logging single source of truth**: KernelRunner는 op_log에 직접 기록하지 않는다.
-모든 op logging은 **ComponentBase hook (_on_process_start/end)만** 담당한다.
-KernelRunner가 `_dispatch_gemm()` 등으로 컴포넌트에 메시지를 전달하면,
-컴포넌트 베이스 클래스의 hook이 자동으로 기록한다.
+**Op logging single source of truth**: KernelRunner does not record directly to op_log.
+All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**.
+When KernelRunner delivers messages to components via `_dispatch_gemm()` etc.,
+the component base class hooks automatically record them.

-**레이어 분리**:
- **커널 코드**: plain function, greenlet 존재를 모름
- **TLContext**: `tl.load()` 내부에서 `parent.switch(cmd)` 호출
- **KernelRunner**: greenlet ↔ SimPy 연결, MemoryStore 읽기/쓰기 처리. **logging 안 함**.
- **ComponentBase hook**: op_log 기록의 유일한 경로
- **PE_CPU**: KernelRunner를 호출만 함, 컴포넌트로서 교체 가능
+**Layer separation**:
+- **Kernel code**: plain function, unaware of greenlet
+- **TLContext**: calls `parent.switch(cmd)` inside `tl.load()`
+- **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**.
+- **ComponentBase hook**: the sole path for op_log recording
+- **PE_CPU**: only calls KernelRunner, replaceable as a component

-#### 메모리 읽기/쓰기 vs 연산의 처리 차이
+#### Handling Differences Between Memory Read/Write and Compute

-| 연산 | Phase 1에서 | Phase 2에서 |
-|------|------------|------------|
-| `tl.load()` | SimPy 타이밍 + MemoryStore read → **실제 데이터 반환** | — |
-| `tl.store()` | SimPy 타이밍 + MemoryStore write → **실제 기록** | — |
-| `tl.composite(gemm)` | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
-| `tl.dot()` / math ops | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
+| Operation | In Phase 1 | In Phase 2 |
+|-----------|-----------|-----------|
+| `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — |
+| `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — |
+| `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation |
+| `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation |

-메모리 읽기/쓰기는 Phase 1에서 즉시 처리 (numpy slice, 빠름).
-GEMM/Math 연산은 Phase 2에서 batch 실행 (성능 분리).
+Memory read/write is processed immediately in Phase 1 (numpy slice, fast).
+GEMM/Math operations are batch-executed in Phase 2 (performance separation).

 #### Store Visibility Rule

-`tl.store()`는 **issue 시점에 MemoryStore에 즉시 반영**된다 (visibility = issue).
-SimPy DMA 타이밍은 이후 별도로 시뮬레이션된다.
+`tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue).
+SimPy DMA timing is simulated separately afterward.

-이는 timing과 visibility를 의도적으로 분리한 것이다:
- **visibility**: MemoryStore에 반영되는 시점 = `store.write()` 호출 시
- **timing**: SimPy에서 DMA latency가 완료되는 시점
+This is an intentional separation of timing and visibility:
+- **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called
+- **timing**: the point at which DMA latency completes in SimPy

-이 분리로 dynamic control flow에서 store 직후 load가 최신 데이터를 볼 수 있다.
+This separation allows a load immediately after a store to see the latest data in dynamic control flow.

 #### Result Handle Semantics

-`tl.composite()`(sync/async)는 결과 tensor를 참조하는 **handle**을 반환한다.
+`tl.composite()` (sync/async) returns a **handle** referencing the result tensor.

-Phase 1에서의 핵심 계약:
+The key contract in Phase 1:

-1. **모든 compute handle은 Phase 1에서 항상 pending 상태로 간주한다.**
-2. `tl.wait(handle)`은 **timing synchronization만 표현**하며,
-   handle을 ready로 만들지 않는다.
-3. handle의 실제 결과 데이터 접근(`handle.data`, element access,
-   numpy conversion 등)은 **Phase 2에서만 가능**하다.
-4. 따라서 Phase 1에서 **compute-result 기반 control flow는 지원하지 않는다.**
-5. 반면 `tl.load()`는 Phase 1에서 실제 데이터를 반환하므로,
-   **memory-read 기반 control flow는 지원 가능**하다.
+1. **All compute handles are always considered pending in Phase 1.**
+2. `tl.wait(handle)` **expresses timing synchronization only**
+   and does not make the handle ready.
+3. Accessing the handle's actual result data (`handle.data`, element access,
+   numpy conversion, etc.) is **only possible in Phase 2**.
+4. Therefore, **compute-result-based control flow is not supported in Phase 1.**
+5. In contrast, `tl.load()` returns actual data in Phase 1, so
+   **memory-read-based control flow is supported**.

-| handle 상태 | Phase | 허용 동작 |
+| Handle state | Phase | Allowed operations |
 |------------|-------|----------|
-| pending | Phase 1 | `tl.wait(handle)` — timing 동기화만 |
-| pending | Phase 1 | handle을 `tl.store()`의 대상으로 전달 (logical destination 연결만, payload는 Phase 2) |
-| pending | Phase 1 | **데이터 접근 불가** — 값 기반 분기 불가 |
-| ready | Phase 2 | 실제 numpy 데이터 접근, 검증 |
+| pending | Phase 1 | `tl.wait(handle)` — timing synchronization only |
+| pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) |
+| pending | Phase 1 | **Data access not allowed** — value-based branching not possible |
+| ready | Phase 2 | Actual numpy data access, verification |

-이 제약은 의도적이다. Phase 1에서 연산을 실행하면 SimPy single-thread가
-block되어 2-pass 분리의 존재 이유가 사라진다.
+This restriction is intentional. If computations were executed in Phase 1,
+the SimPy single-thread would block, defeating the purpose of 2-pass separation.

 #### Phase 1 Materialization — Future Extension

-향후 소형 연산(scalar, 작은 reduction)에 대해 Phase 1 eager execution이
-필요한 경우, `materialized_in_phase1: bool` 플래그를 op record에 추가하여
-선택적 materialization을 지원할 수 있다. 현재 범위에서는 구현하지 않는다.
+If Phase 1 eager execution becomes necessary for small operations
+(scalar, small reduction) in the future, selective materialization can be supported
+by adding a `materialized_in_phase1: bool` flag to the op record.
+This is not implemented in the current scope.

-### D4. data_op 플래그 — 메시지 자기 선언
+### D4. data_op Flag — Message Self-Declaration

-로깅 대상은 메시지 타입이 아니라 메시지 인스턴스의 `data_op` 속성으로 결정한다.
-프레임워크가 메시지 타입을 하드코딩하지 않는다.
+The logging target is determined by the `data_op` attribute on the message instance,
+not by message type. The framework does not hardcode message types.

 ```python
 class MsgBase:
-    data_op: bool = False       # 기본: 로깅 안 함
+    data_op: bool = False       # default: no logging

 class DmaReadCmd(MsgBase):
-    data_op = True              # 메모리 이동 → 로깅
+    data_op = True              # memory transfer → logging

 class GemmCmd(MsgBase):
-    data_op = True              # 연산 → 로깅
+    data_op = True              # compute → logging

 class MathCmd(MsgBase):
-    data_op = True              # 연산 → 로깅
+    data_op = True              # compute → logging
 ```

-새 메시지 타입(예: IpcqMsg) 추가 시 `data_op = True`만 설정하면
-프레임워크 코드 수정 없이 자동 로깅된다.
+When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True`
+enables automatic logging without modifying framework code.

-### D5. Op Log 구조
+### D5. Op Log Structure

-#### op 분류 체계
+#### Op Classification Scheme

-2단계로 분류한다:
+A two-level classification is used:

-| 레벨 | 필드 | 역할 |
-|------|------|------|
-| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch 기준 |
-| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` 등 | 구체 연산 식별 |
+| Level | Field | Role |
+|-------|-------|------|
+| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion |
+| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification |

-#### OpRecord 정의
+#### OpRecord Definition

 ```python
@dataclass
 class OpRecord:
-    t_start: float              # SimPy 시각 (ns) — service 시작
-    t_end: float                # SimPy 시각 (ns) — service 완료
+    t_start: float              # SimPy time (ns) — service start
+    t_end: float                # SimPy time (ns) — service completion
    component_id: str           # e.g. "sip0.cube0.pe0.pe_gemm"
    op_kind: str                # "memory" | "gemm" | "math"
-    op_name: str                # 구체 연산명
-    params: dict                # 연산별 파라미터 (아래 참조)
-    dependency_ids: list[int]   # 현재는 in-memory record index 기반, 향후 stable op_id로 대체 가능
+    op_name: str                # specific operation name
+    params: dict                # per-operation parameters (see below)
+    dependency_ids: list[int]   # currently based on in-memory record index, may be replaced with stable op_id in the future
 ```

-#### dependency_ids 생성 규칙
+#### dependency_ids Generation Rules

-`dependency_ids`는 **optional**이며, 기본적으로 executor는
-주소 기반 dependency 추론을 수행한다 (D6 참조).
+`dependency_ids` is **optional**, and by default the executor performs
+address-based dependency inference (see D6).

-정확한 실행 순서가 필요한 경우에만 명시적으로 설정한다:
- **기본 (address-based inference)**: executor가 read/write set을 분석하여
-  RAW/WAW/WAR 의존성을 자동 추론. 대부분의 경우 이것으로 충분.
- **명시적 설정**: TLContext 또는 command 생성 단계에서 logical dependency가
-  주소로 표현되지 않는 경우에 설정.
-  예: completion handle 기반 동기화 — handle dependency는 메모리 주소가 아니라
-  논리적 완료 순서에 의존하므로 address inference로 잡히지 않는다.
+Explicit setting is only needed when precise execution ordering is required:
+- **Default (address-based inference)**: the executor analyzes read/write sets to
+  automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases.
+- **Explicit setting**: set when logical dependencies cannot be expressed via addresses
+  at the TLContext or command generation stage.
+  Example: completion handle-based synchronization — handle dependencies depend on
+  logical completion order rather than memory addresses, so they cannot be captured
+  by address inference.

-#### op_log ordering
+#### op_log Ordering

-op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
-동일 `t_start`의 record들은 insertion order를 보존한다.
+The op_log maintains **stable ordering** based on `t_start`.
+Records with the same `t_start` preserve insertion order.

-#### params 상세
+#### params Details

 **memory (dma_read / dma_write)**:
 ```python
 {
-    "src_addr": int,            # source 주소 (byte)
-    "dst_addr": int,            # destination 주소 (byte)
-    "nbytes": int,              # 전송 크기
+    "src_addr": int,            # source address (byte)
+    "dst_addr": int,            # destination address (byte)
+    "nbytes": int,              # transfer size
    "src_space": str,           # "hbm" | "tcm" | "sram"
    "dst_space": str,           # "hbm" | "tcm" | "sram"
 }
@@ -298,9 +300,9 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
 **gemm**:
 ```python
 {
-    "src_a_addr": int,          # operand A 주소
-    "src_b_addr": int,          # operand B 주소
-    "dst_addr": int,            # output 주소
+    "src_a_addr": int,          # operand A address
+    "src_b_addr": int,          # operand B address
+    "dst_addr": int,            # output address
    "shape_a": tuple,           # e.g. (128, 256)
    "shape_b": tuple,           # e.g. (256, 128)
    "shape_out": tuple,         # e.g. (128, 128)
@@ -312,7 +314,7 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
    "layout_a": str,            # "row_major" | "col_major"
    "layout_b": str,
    "layout_out": str,
-    "addr_space": str,          # "tcm" (GEMM operand는 항상 TCM)
+    "addr_space": str,          # "tcm" (GEMM operands are always in TCM)
 }
 ```

@@ -320,7 +322,7 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
 ```python
 {
    "op": str,                  # "exp" | "add" | "sum" | "where" | ...
-    "input_addrs": list[int],   # operand 주소 목록
+    "input_addrs": list[int],   # list of operand addresses
    "input_shapes": list[tuple],
    "dst_addr": int,
    "shape_out": tuple,
@@ -332,12 +334,12 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.

 ### D6. Phase 2 Executor

-Phase 2는 SimPy 밖에서 op_log를 실행한다.
+Phase 2 executes the op_log outside of SimPy.

 ```python
 class DataExecutor:
    def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
-        self.store = initial_store  # Phase 1의 MemoryStore snapshot을 입력으로 받는다
+        self.store = initial_store  # Takes the Phase 1 MemoryStore snapshot as input

    def run(self):
        for t, ops in groupby(op_log, key=lambda o: o.t_start):
@@ -347,30 +349,30 @@ class DataExecutor:
            self._execute_sequential(sequential)
 ```

-**병렬 실행 판정**:
+**Parallel execution determination**:

-같은 `t_start`의 op들은 **병렬 후보**로 간주한다.
-실제 병렬 실행 여부는 executor가 다음 기준으로 판정한다:
- read/write 주소 범위 겹침 여부 (WAW, RAW, WAR 충돌 검사)
- `dependency_ids`에 명시된 선행 op 완료 여부
+Ops with the same `t_start` are considered **parallel candidates**.
+The executor determines actual parallel execution based on the following criteria:
+- Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks)
+- Whether predecessor ops specified in `dependency_ids` have completed

-주소 범위가 겹치지 않고 명시적 의존성이 없는 op들만 병렬 실행한다.
+Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel.

-**배치 최적화**: 동일 op_name이며 **shape, dtype, layout, transpose flag가
-모두 동일한** 독립 op들만 batching 대상이 된다.
-예: 여러 PE의 동일 shape GEMM → `np.matmul(a_batch, b_batch)` 한 번으로 묶음.
-CPU에서도 BLAS 효율 향상, GPU에서는 launch overhead 절감.
+**Batch optimization**: Only independent ops with the same op_name **and identical
+shape, dtype, layout, and transpose flags** are eligible for batching.
+Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call.
+Improves BLAS efficiency on CPU, reduces launch overhead on GPU.

-**Phase 2 실행 순서 보장**:
+**Phase 2 execution order guarantee**:

-Phase 2는 데이터 도착 시점을 고려하지 않으며,
-dependency (주소 기반 추론 + 명시적 dependency_ids)를 통해서만
-실행 순서를 보장한다.
+Phase 2 does not consider data arrival timing,
+and guarantees execution order solely through
+dependencies (address-based inference + explicit dependency_ids).

 ### D7. Memory Store

-`MemoryStore`는 논리적으로 byte-addressable semantics를 따르며,
-현재 구현은 **tensor-granular storage** (addr → numpy ndarray 매핑)를 사용한다.
+`MemoryStore` logically follows byte-addressable semantics,
+and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping).

 ```python
 class MemoryStore:
@@ -378,139 +380,140 @@ class MemoryStore:
    def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
 ```

-**내부 저장 포맷: numpy ndarray**
+**Internal storage format: numpy ndarray**

-MemoryStore는 텐서를 **numpy ndarray**로 저장한다.
+MemoryStore stores tensors as **numpy ndarrays**.

-| 후보 | store/load 속도 | Phase 2 연산 | 판정 |
-|------|----------------|-------------|------|
-| **numpy ndarray** | 즉시 (참조 전달, 복사 없음) | `np.matmul` 바로 사용 | **채택** |
-| bytearray | memcpy 필요 | `np.frombuffer` 변환 필요 | 탈락 |
-| torch tensor | 즉시 | torch 연산 가능 | GPU 최적화 시만 사용 |
+| Candidate | store/load speed | Phase 2 compute | Verdict |
+|-----------|-----------------|-----------------|---------|
+| **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** |
+| bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected |
+| torch tensor | Immediate | torch operations available | Use only for GPU optimization |

- write: numpy array를 **참조 저장** (복사 없음) → Phase 1 오버헤드 = dict lookup 1회
- read: numpy array를 **참조 반환** (복사 없음)
- 동일 addr에 재 write 시 기존 array를 **tensor 단위로 덮어쓴다** (partial overwrite 미지원)
- dtype은 numpy native 사용 (`np.float16`, `np.float32`, `np.bfloat16` 등)
- byte-level access가 필요한 경우 `.view(np.uint8)` 로 변환
- Phase 2에서 GPU batch 최적화 시 numpy → torch tensor 변환은 executor가 담당
+- write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup
+- read: **returns numpy array by reference** (no copy)
+- Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported)
+- dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.)
+- For byte-level access, convert via `.view(np.uint8)`
+- For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility

 **read/write contract**:

- read/write는 **contiguous tensor** 기준이다.
-  non-contiguous stride view가 필요한 경우 별도 copy op으로 표현한다.
- 일반 benchmark path에서는 producer/consumer dtype 일치를 기대한다.
-  reinterpret cast는 low-level memory validation 또는 특수 테스트 케이스를 위한
-  permissive behavior이다.
- addr은 byte-aligned이며, 최소 alignment = dtype 크기.
- dtype mismatch (write와 다른 dtype으로 read)는 reinterpret cast로 처리한다.
-  shape 불일치 시 nbytes 기준으로 검증하고, 불일치하면 error.
- 정합성 기준은 주소 범위 기반 read/write semantics를 따른다.
- 구현 최적화로 tensor object cache를 둘 수 있지만,
-  canonical state는 byte-addressable storage이다.
- deploy 시점에 호스트가 초기 텐서 데이터를 주입한다.
+- read/write operates on a **contiguous tensor** basis.
+  If non-contiguous stride views are needed, express them as separate copy ops.
+- In the normal benchmark path, producer/consumer dtype match is expected.
+  Reinterpret cast is a permissive behavior for low-level memory validation
+  or special test cases.
+- addr is byte-aligned, with minimum alignment = dtype size.
+- dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast.
+  Shape mismatch is verified based on nbytes, and raises an error on mismatch.
+- Correctness criteria follow address-range-based read/write semantics.
+- A tensor object cache may be used as an implementation optimization,
+  but the canonical state is byte-addressable storage.
+- At deploy time, the host injects initial tensor data.

-### D8. 벤치마크 커널 코드
+### D8. Benchmark Kernel Code

-벤치마크의 **사용자 코드 API는 변경하지 않는다**.
-`tl.load()`, `tl.composite()`, `tl.store()` 등의 호출 인터페이스는 유지.
+The benchmark's **user code API is not changed**.
+The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained.

-단, 내부 command/message schema는 Phase 2 실행에 필요한 metadata를
-포함하도록 확장될 수 있다 (예: dtype_acc, transpose 등 추가 필드).
+However, internal command/message schemas may be extended to include metadata
+required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose).

-### D9. 컴포넌트 변경 없음
+### D9. No Component Changes

-개별 컴포넌트 구현(PE_GEMM, PE_DMA, HBM_CTRL 등)은 수정하지 않는다.
-op_log 기록은 ComponentBase hook의 책임이다.
-커스텀 컴포넌트 교체 시 타이밍 모델만 교체되며,
-Phase 2 데이터 실행은 영향받지 않는다.
+Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified.
+Op log recording is the responsibility of the ComponentBase hook.
+When custom components are replaced, only the timing model changes,
+and Phase 2 data execution is unaffected.

-### D10. Phase 2는 Optional
+### D10. Phase 2 is Optional

 ```python
 engine = GraphEngine(graph)
-engine.run(benchmark)                       # Phase 1: 타이밍만
+engine.run(benchmark)                       # Phase 1: timing only
 result = engine.get_timing_result()

 if verify_data:
-    executor = DataExecutor(engine.op_log)  # Phase 2: 데이터
+    executor = DataExecutor(engine.op_log)  # Phase 2: data
    executor.run()
    executor.verify(expected_output)
 ```

-타이밍 분석만 필요하면 Phase 2를 건너뛴다.
-op_logger를 비활성화하면 Phase 1 성능도 기존과 동일.
+If only timing analysis is needed, Phase 2 is skipped.
+If the op_logger is deactivated, Phase 1 performance is identical to the original.

 ### D11. Verification Contract

-기본 검증은 **최종 output tensor**를 reference backend(numpy)와 비교한다.
+Basic verification **compares the final output tensor** against a reference backend (numpy).

-dtype별 tolerance 정책:
+Per-dtype tolerance policy:

-| dtype | 비교 방식 | tolerance |
+| dtype | Comparison method | Tolerance |
 |-------|----------|-----------|
 | f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
 | f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
 | bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
-| int 계열 | `np.array_equal` | exact |
+| int types | `np.array_equal` | exact |

- 기본 모드: 최종 output만 비교 (end-to-end correctness)
- 디버그 모드: intermediate tensor도 op 단위로 비교 가능
+- Default mode: compare final output only (end-to-end correctness)
+- Debug mode: can compare intermediate tensors on a per-op basis
  (MemoryStore snapshot at each op boundary)

 ---

 ## Non-goals

- **Compute-result-based control flow**: 지원하지 않는다.
-  모든 compute handle은 Phase 1에서 pending 상태이며,
-  `wait()`는 timing synchronization만 표현하고 data readiness를 의미하지 않는다.
-  Phase 1에서 `handle.data` 접근, element access, truth-value evaluation은
-  **error로 처리**한다.
-  메모리 데이터 기반 분기(`tl.load()` 결과)는 greenlet으로 지원된다.
-  Phase 1 materialization은 future extension (D3 참조).
- **Cycle-accurate overlap reconstruction**: Phase 2에서 Phase 1의 실행 시간
-  overlap을 정확히 재현하지 않는다. Phase 2는 데이터 정합성만 검증한다.
- **GPU kernel compilation**: Phase 2의 GEMM/Math는 numpy/torch 호출이며,
-  실제 하드웨어 PE의 마이크로아키텍처를 재현하지 않는다.
+- **Compute-result-based control flow**: not supported.
+  All compute handles are in pending state during Phase 1,
+  `wait()` expresses timing synchronization only and does not imply data readiness.
+  Accessing `handle.data`, element access, or truth-value evaluation in Phase 1
+  is **treated as an error**.
+  Memory-data-based branching (results of `tl.load()`) is supported via greenlet.
+  Phase 1 materialization is a future extension (see D3).
+- **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce
+  the execution time overlap from Phase 1. Phase 2 only verifies data correctness.
+- **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls
+  and do not reproduce the actual hardware PE microarchitecture.

 ## Open Questions

- **Aliasing / slice view**: 동일 backing storage를 참조하는 slice/view를
-  MemoryStore에서 어떻게 표현할지 (stride-based view vs copy semantics)
- **IPCQ/descriptor read 일반화**: PE-to-PE 통신을 memory op으로 완전히
-  일반화할지, 별도 op_kind를 둘지
- **Op log streaming**: 대규모 시뮬레이션에서 op_log 메모리 사용량 관리
+- **Aliasing / slice view**: How to represent slice/views referencing the same
+  backing storage in MemoryStore (stride-based view vs copy semantics)
+- **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE
+  communication as memory ops or introduce a separate op_kind
+- **Op log streaming**: Managing op_log memory usage in large-scale simulations
  (in-memory list vs disk-backed streaming)
- **Fused operation**: tl.composite의 tiled pipeline (READ→COMPUTE→WRITE)을
-  하나의 fused op record로 기록할지, 개별 op으로 분리할지
- **Math op schema 일반화**: 현재 math params는 단순 구조이나,
-  broadcasting rule, input별 dtype, keepdims, scalar/immediate operand,
-  where/mask 표현 등 일반화가 필요할 수 있음
- **Op record 식별자**: 현재 dependency_ids는 in-memory list index 기반이며,
-  streaming/disk-backed mode 도입 시 stable op_id로 대체 필요
- **Phase 1 materialization policy**: D3의 Future Extension 참조.
-  허용 시 해당 op의 Phase 2 처리 방식 (skip / verify / recompute) 정의 필요
+- **Fused operation**: Whether to record tl.composite's tiled pipeline
+  (READ→COMPUTE→WRITE) as a single fused op record or separate individual ops
+- **Math op schema generalization**: The current math params have a simple structure,
+  but generalization may be needed for broadcasting rules, per-input dtype, keepdims,
+  scalar/immediate operands, where/mask expressions, etc.
+- **Op record identifier**: Currently dependency_ids are based on in-memory list indices;
+  replacement with stable op_id is needed when introducing streaming/disk-backed mode
+- **Phase 1 materialization policy**: See Future Extension in D3.
+  If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops
+  needs to be defined

 ---

 ## Consequences

-### 긍정적
+### Positive

- SimPy 시뮬레이션 성능 영향 최소 (op_log append만 추가)
- Phase 2에서 멀티스레드/GPU 자유롭게 사용 가능
- 컴포넌트 교체 자유도 유지 (ADR-0015 설계 철학 보존)
- 벤치마크 사용자 코드 API 변경 불필요
- 새 메시지 타입 추가 시 data_op 플래그만 설정
- greenlet으로 Phase 0 제거 — 메모리 데이터 기반 dynamic control flow 지원
- `tl.load()`가 실제 데이터를 반환하므로 커널 디버깅 용이
+- Minimal impact on SimPy simulation performance (only op_log append added)
+- Free to use multi-threading/GPU in Phase 2
+- Component replaceability preserved (ADR-0015 design philosophy maintained)
+- No changes needed to benchmark user code API
+- When adding new message types, only set the data_op flag
+- Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported
+- `tl.load()` returns actual data, making kernel debugging easier

-### 부정적
+### Negative

- op_log 메모리 사용량 (대규모 시뮬레이션 시)
- Phase 2 실행 시간은 텐서 크기에 비례 (대형 GEMM)
- pending handle (연산 미완료) 기반 동적 분기 불가
-  (연산은 Phase 2에서 실행, Phase 1에서 결과 값 미확정).
-  메모리 데이터 기반 분기는 greenlet으로 지원된다.
- greenlet C 확장 의존성 추가 (pip install greenlet)
+- op_log memory usage (for large-scale simulations)
+- Phase 2 execution time is proportional to tensor size (large GEMM)
+- Dynamic branching based on pending handles (incomplete computations) not possible
+  (computations execute in Phase 2, result values are undetermined in Phase 1).
+  Memory-data-based branching is supported via greenlet.
+- greenlet C extension dependency added (pip install greenlet)
@@ -1,882 +0,0 @@
-# ADR-0023: PE-level IPCQ — Inter-PE Collective Communication
-
-## Status
-
-Accepted
-
-## Context
-
-### Goal
-
-Add the infrastructure that lets CCL (Collective Communication Library)
-kernels run **inside** a PE. The host just launches a kernel on each
-SIP; the actual synchronization and data movement happen **inside the
-PE kernel via an IPCQ (Inter-Process Communication Queue)**.
-
-This mirrors how NCCL performs NVLink communication inside a GPU
-kernel, or how Cerebras / Tenstorrent expose core-local communication
-queues. Host-level collectives (`dist.all_reduce`) are deferred to
-**future work**; this ADR focuses solely on the kernel-side collective
-infrastructure.
-
-### Problems to solve
-
-1. PE-to-PE direct data movement (writing into a peer's memory).
-2. Synchronization — the sender must check that the receiver has space
-   in its buffer (backpressure).
-3. Resource contention between compute traffic and communication
-   traffic (Head-of-Line blocking).
-4. The host must be able to construct logical neighbor topologies
-   (ring / mesh / tree) per algorithm.
-
---
-
-## Decision
-
-### D1. Add a new `PE_IPCQ` component
-
-A new component `PE_IPCQ` is added inside each PE. It follows the same
-pattern as PE_GEMM / PE_MATH — modeling a sub-block of the PE as a
-distinct component.
-
-```
-PE
-├── PE_CPU
-├── PE_SCHEDULER
-├── PE_DMA
-├── PE_IPCQ          ← new
-├── PE_FETCH_STORE
-├── PE_GEMM
-├── PE_MATH
-├── PE_TCM
-├── PE_MMU
-```
-
-**Role separation** (control plane vs. data plane):
-
- **PE_IPCQ (control plane)**: ring-buffer address arithmetic, head /
-  tail pointer management, peer pointer caches, backpressure, 4-direction
-  neighbor mapping.
- **PE_DMA (data plane)**: actually moves data through cube_noc / UCIe
-  / PCIE into the peer's memory.
-
-PE_IPCQ does **not** move data itself — it delegates to PE_DMA.
-
-### D2. Ring buffer model
-
-Each PE owns 4 directions (N/S/E/W) × {tx, rx} = 8 ring buffers.
-
-```python
-@dataclass
-class IpcqQueuePair:
-    direction: Direction          # N/S/E/W
-    peer: IpcqEndpoint            # set by host at init time (D2.5)
-    tx_buffer_base: int           # outgoing data base addr (in our memory)
-    rx_buffer_base: int           # incoming data base addr (in our memory)
-    slot_size: int                # 1 tile per slot
-    n_slots: int                  # ring depth
-    my_head: int                  # next slot we will write/send into
-    my_tail: int                  # next slot we will read/recv from
-    peer_head_cache: int          # peer's last-seen head (updated via D9 piggyback)
-    peer_tail_cache: int          # peer's last-seen tail (updated via D9 fast-path credit)
-```
-
-**Canonical field names**: throughout this ADR the four names above
-(`my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`) are used
-consistently. Synonyms (`peer_head_local`, `peer_head`, `peer_tail`,
-etc.) are not used.
-
-| Field | Owner | Updated when |
-|-------|-------|--------------|
-| `my_head` | local PE_IPCQ | immediately after `tl.send` (send tracking) |
-| `my_tail` | local PE_IPCQ | immediately after `tl.recv` (recv tracking) |
-| `peer_head_cache` | local PE_IPCQ | on `IpcqMetaArrival` (D9 piggyback) |
-| `peer_tail_cache` | local PE_IPCQ | on `IpcqCreditMetadata` (D9 fast path) |
-
-**Slot unit**: fixed-size, one slot holds one full tile (no descriptor
-indirection). Full data embedded in the slot. See D5.
-
-### D2.5. `IpcqEndpoint` schema
-
-`IpcqQueuePair.peer` carries everything the sender needs to compute the
-peer's rx slot address:
-
-```python
-@dataclass(frozen=True)
-class IpcqEndpoint:
-    sip: int
-    cube: int
-    pe: int
-    buffer_kind: str             # "tcm" | "hbm" | "sram"
-    rx_base_pa: int              # peer rx_buffer base PA (PhysAddr.encode())
-    rx_base_va: int              # peer rx_buffer base VA (optional, MMU mode)
-    n_slots: int                 # peer ring depth (for wrap-around)
-    slot_size: int               # peer slot size (for offset)
-```
-
-Address computation:
-
-```python
-slot_idx = self.my_head % peer.n_slots
-dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size
-```
-
-PE_IPCQ passes `dst_pa` to PE_DMA inside an `IpcqDmaToken`. PE_DMA
-(vc_comm) routes the data to `dst_pa` through the fabric.
-
-**Endpoint construction order**: at backend init (D10), the IPCQ
-buffers for **every PE** are allocated first (so each rank knows the
-others' PA), then the per-rank neighbor tables are built and pushed to
-PE_IPCQ via `IpcqInitMsg`.
-
-### D3. Four-direction mapping ≡ logical ProcessGroup
-
-The PE views four directions (N/S/E/W) as logical ports. Real peer
-addresses are configured by the host CCL init, per the chosen
-algorithm. The PE kernel never knows the topology, only directions.
-
-```python
-# 1D ring
-for rank in range(world_size):
-    ipcq_set_neighbor(rank, "E", peer=ranks[(rank + 1) % world_size])
-    ipcq_set_neighbor(rank, "W", peer=ranks[(rank - 1) % world_size])
-
-# 2D mesh
-for r in range(R):
-    for c in range(C):
-        ipcq_set_neighbor((r, c), "N", peer=((r - 1) % R, c))
-        ipcq_set_neighbor((r, c), "S", peer=((r + 1) % R, c))
-        ipcq_set_neighbor((r, c), "E", peer=(r, (c + 1) % C))
-        ipcq_set_neighbor((r, c), "W", peer=(r, (c - 1) % C))
-```
-
-The PE code does not need to know where `tl.send(dir="E", ...)` actually
-ends up.
-
-### D4. PE kernel API
-
-```python
-# Send (blocking; may stall on backpressure)
-tl.send(dir: str, src=TensorHandle)
-tl.send(dir: str, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)
-
-# Recv (blocking)
-recv = tl.recv(dir: str, shape=..., dtype=...)
-recv = tl.recv(shape=..., dtype=...)        # round-robin across 4 directions
-
-# Recv (non-blocking)
-fut  = tl.recv_async(dir: str, shape=..., dtype=...)
-recv = tl.wait(fut)
-```
-
-`tl.recv()` (no direction) keeps a `last_polled_dir` cursor and on each
-call rotates through directions, returning the first available slot.
-Empty in all 4 directions → wait.
-
-**Fairness is weak**: the rotating start mitigates simple bias, but if
-one direction always wins the race the others can starve. Algorithms
-that need strict fairness must call `tl.recv(dir=...)` explicitly.
-
-### D5. Single-hop DMA write + full-data slot model
-
-Data moves from sender memory into the receiver's ring slot in **one
-DMA transfer**. Key properties:
-
- **Single-hop**: the sender already knows the peer rx slot address and
-  fires one fabric DMA into it.
- **No CPU memcpy**: the CPU never copies data.
- **No intermediate staging**: neither side keeps a separate staging
-  buffer (sender uses the source addr directly; receiver gets the data
-  in its ring slot directly).
-
-(Strictly speaking the fabric DMA write does happen, so this is not
-literally "no data movement" — it's the same property NCCL labels
-"zero-copy", meaning no CPU memcpy and no staging copy.)
-
-```
-PE A: tl.send(E, src_addr, nbytes)
-  1. IPCQ computes the peer rx slot address:
-       dst_addr = peer.rx_base_pa + (my_head % peer.n_slots) * peer.slot_size
-  2. Backpressure: my_head - peer_tail_cache < peer.n_slots ?
-                   (full → sleep / poll)
-  3. Submit DMA on PE_DMA(vc_comm): src_addr → peer dst_addr, nbytes
-  4. my_head += 1
-
-PE B: data = tl.recv(W)
-  1. Look at rx_buffer[my_tail % n_slots]
-  2. Wait for the data to arrive (D7 backpressure mode)
-  3. Return the slot address to the kernel (or fetch into register file)
-  4. my_tail += 1
-  5. Issue a credit-return fast path (D9): after the bottleneck-BW
-     latency the peer A's peer_tail_cache is updated.
-```
-
-The slot holds the full tile. The receiver only reads its own
-rx_buffer; it never reads back into A's memory. The sender knows the
-peer rx slot address and DMAs directly into it (single-hop).
-
-The PE's own PE_TCM read/write does not go through DMA (PE_TCM is local
-to the PE).
-
-### D6. Buffer placement — three-way benchmark
-
-The host CCL init picks the IPCQ ring-buffer location:
-
-```python
-ipcq_init(
-    backend="ahbm",
-    buffer_kind="tcm" | "hbm" | "sram",
-    n_slots=8,
-    slot_size=4096,
-)
-```
-
-| Location | Trait | Trade-off |
-|----------|-------|-----------|
-| **PE_TCM** | Attached to the PE; fast | Small; competes with PE-internal resources |
-| **PE-local HBM** | Large; via DMA | Higher latency |
-| **Cube SRAM** | Mid-size; cube-shared | Cube-internal contention |
-
-All three locations run the same kernel code; only the init differs.
-
-### D7. Backpressure — two-mode benchmark
-
-How the sender or receiver waits when peer slots are full / data not
-yet arrived:
-
-| Mode | Behavior | Model |
-|------|----------|-------|
-| **poll** | Periodically re-check the cached peer pointer | Spin loop |
-| **sleep** | Yield a SimPy event; wake on a peer-trigger | Interrupt-like |
-
-```python
-ipcq_init(backpressure="poll" | "sleep", ...)
-```
-
-Both modes are implemented so latency / throughput trade-offs can be
-benchmarked.
-
-### D8. PE_DMA virtual channels
-
-Extend PE_DMA from a single queue into a **two-channel virtual-channel**
-model.
-
-```
-PE_DMA
-├── vc_compute: tile load / store / writeback for GEMM and Math
-└── vc_comm:    IPCQ send data
-```
-
-Each VC has an independent state machine:
-
- One channel stalling does not block the other.
- The same physical link (cube_noc, UCIe, …) is shared, but link BW is
-  split between channels.
-
-**Chunk-level interleave**:
-
- Large GEMM tile DMAs do not lock the link end-to-end.
- Progress happens in chunks (e.g. 256 B); each chunk shares link BW
-  with the other VC's pending chunks.
- Chunk size is an init parameter (smaller = fairer, larger = more
-  efficient).
-
-Net effect:
-
- HoL blocking is eliminated (an IPCQ send can interleave with a long
-  compute DMA).
- Compute / comm overlap is natural (NVIDIA copy-engine + compute-SM
-  pattern).
- Matches the NoC-virtual-channel pattern used in real HW.
-
-**First-implementation accuracy limit (intentional)**: this ADR's
-first cut uses **deterministic chunk-level interleave + weighted
-round-robin arbitration** (default 50 / 50, exposed in `ccl.yaml`).
-This is a first-order approximation and is simpler than real HW
-dynamic-contention / credit-based arbiters. Functional correctness is
-unaffected, but heavy-contention scenarios may report slightly
-optimistic latency vs. real HW. A separate ADR can add a NoC arbiter
-component later if more precision is needed.
-
-#### Token routing
-
- Compute tokens (`TileToken`) — go through the existing
-  PE_FETCH_STORE → PE_DMA chain.
- Communication tokens (`IpcqDmaToken`, new) — PE_IPCQ → PE_DMA
-  self-routing.
- PE_DMA picks the channel by token type.
-
-```python
-class PeDmaComponent:
-    def _process(self, env, token):
-        if isinstance(token, IpcqDmaToken):
-            yield from self._vc_comm_process(env, token)
-        else:
-            yield from self._vc_compute_process(env, token)
-```
-
-### D9. Pointer synchronization — DMA payload piggyback
-
-Real HW (NVLink, UCIe, etc.) piggybacks metadata onto DMA payloads so
-pointers update along with the data. This simulation adopts the same
-model: **no separate control channel** — metadata travels with the
-data.
-
-The big benefits:
-
- **Automatic ordering**: data and metadata move on the same token, so
-  data is visible **before** the head_cache update. No race.
- **HW fidelity**: matches NVLink / UCIe piggybacked headers.
- **Component simplification**: no separate `IpcqPtrUpdate` event type.
-
-#### Send flow (head update via piggyback)
-
-```
-PE A: tl.send(E, src_addr, nbytes)
-  1. PE_IPCQ checks backpressure (using peer_tail_cache)
-  2. PE_IPCQ creates an IpcqDmaToken:
-       - data body (src_addr → peer dst_addr)
-       - piggyback metadata: (sender_seq, src_sip/cube/pe, src_direction)
-  3. Hand the token to PE_DMA(vc_comm)
-  4. PE A increments my_head (send tracking)
-
-[fabric DMA: latency elapses]
-
-PE B's PE_DMA receives the token
-  5. Writes data into dst_addr (B's rx slot) via MemoryStore.write
-  6. Forwards token metadata to PE B's PE_IPCQ (PE-internal wire, ~1 cycle)
-
-PE B's PE_IPCQ receives the metadata
-  7. Updates peer_head_cache (= A's head)
-  8. Wakes any pending recv on that direction
-```
-
-**Steps 5 and 6 must execute in the same SimPy step** — DMA completion
-makes data and metadata atomically visible.
-
-#### Recv flow (credit return — fast path with bottleneck-BW latency)
-
-When the receiver frees a slot, the sender must learn about it
-(backpressure release). Unlike data, the credit return does **not**
-travel through general vc_comm fabric — it uses a **separate fast
-path**, an abstraction of the NVLink / UCIe credit-return wire.
-
-**Latency** is computed from the **full path latency** (per-node
-overhead + edge propagation + drain), not a magic constant:
-
-```
-credit_size_bytes = 16  (ccl.yaml: ipcq_credit_size_bytes)
-path = router.find_path(self_pe, peer_pe.pe_dma)
-latency = compute_path_latency_ns(path, credit_size_bytes)
-        = sum(edge.distance_mm * ns_per_mm)
-        + sum(node_overhead_ns[n] for n in path)
-        + credit_size_bytes / bottleneck_bw_on_path
-```
-
-The router auto-appends `.pe_dma` to the source only, so the
-destination MUST be spelled with the explicit `.pe_dma` suffix or
-`find_path` raises and the credit silently teleports at zero cost
-(latent bug fixed alongside this update).
-
-`tl.recv` blocks on the credit-emit completion (recv yields-from
-`_delayed_credit_send` rather than spawning it as a fork). This puts
-the credit-return cost on the receiver's `pe_exec_ns`, modeling the
-IPCQ control-plane completing the consume-acknowledgement before
-recv returns to the kernel — the protocol equivalent of a non-posted
-`tl.store` waiting for an HBM ack on the raw DMA path.
-
-That gives us:
-
- **Topology-proportional approximation**: an in-cube credit return is
-  automatically faster than a cross-SIP credit return.
- **No magic constants**: every nanosecond comes from
-  `compute_path_latency_ns` on the same edge_map and `node_overhead_ns`
-  as data traffic.
- **No deadlock risk**: unlike piggyback, B can issue credit even when
-  it has no data to send back. `peer_credit_store.put` is unbounded.
- **`IPCQ ≥ raw DMA`** for matched physical moves — the credit-emit
-  cost on recv balances the HBM ack-trip cost RAW pays on the sender.
-
-#### Component coupling — SimPy Store channel
-
-PE B's PE_IPCQ does not call PE A's PE_IPCQ directly. Instead, at init
-time, **a SimPy Store is wired between the two** (a per-direction
-fast-path channel) and credit metadata is `put` into that store.
-
-```python
-class PeIpcqComponent:
-    def _delayed_credit_send(self, env, peer_credit_store, my_tail, latency_ns):
-        yield env.timeout(latency_ns)
-        yield peer_credit_store.put(IpcqCreditMetadata(seq=my_tail, ...))
-```
-
-Backend init wires both directions of the fast-path channel as part of
-fan-out (see `IpcqInitMsg` in D12).
-
-#### Credit-return fast path limitations
-
- `credit_size_bytes` is an estimate (typically 16–64 bytes).
- The fast path is **excluded from vc_comm BW contention** (separate
-  wire). Real HW credit-return wires are very lightweight, so this is a
-  reasonable first approximation.
- A follow-up ADR can: model the credit fast path as a separate link
-  (BW limit + contention), or switch to piggyback (`credit_return_mode:
-  piggyback`).
-
-#### PE_DMA's added responsibility
-
-When `vc_comm` receives a token, PE_DMA processes it as the following
-sequence: pay the Transaction's terminal BW drain, then atomically
-write data and forward metadata. **No SimPy yield is allowed between
-the data write and the metadata forward** (invariant I6). The drain
-yield must sit before the atomic block, not inside it:
-
-```python
-def _on_vc_comm_recv(self, env, txn):
-    # Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the
-    # sender PE_DMA). MUST happen before the atomic block so recv only
-    # wakes after the bytes have "landed".
-    drain = getattr(txn, "drain_ns", 0.0)
-    if drain > 0:
-        yield env.timeout(drain)
-
-    token = txn.request
-    # ── ATOMIC: no yield between these two operations ──
-    data = self._memory_store.read(token.src_space, token.src_addr,
-                                   shape=..., dtype=...)
-    self._memory_store.write(token.dst_endpoint.buffer_kind,
-                             token.dst_addr, data)
-    # 2. Forward metadata to the local PE_IPCQ
-    yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token))
-    # ───────────────────────────────────────────────────
-```
-
-The final `put` is yieldable but uses an unbounded internal store, so
-it completes in a single step. That `put` is the closing call of the
-atomic block; nothing may be inserted before it.
-
-#### Drain-at-inbound semantics (D9 timing model)
-
-The Transaction carries `drain_ns = nbytes / bottleneck_bw_on_path`
-stamped at send-side PE_DMA. In this simulator per-hop `overhead_ns`
-is paid at each forwarding component via `run()`, and the remaining
-BW drain is paid once at the Transaction's terminal. Every non-IPCQ
-Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via
-`ComponentBase._forward_txn` at the terminal node. For IPCQ the
-destination PE_DMA intercepts the Transaction with `_handle_ipcq_inbound`
-(so IPCQ-specific data write + metadata forward can happen), so **the
-drain MUST be paid explicitly at the top of that handler** to keep
-IPCQ's timing model on par with every other fabric Transaction.
-
-Side-effects of paying drain here:
-
- **SRC `tl.send`** is unchanged — fire-and-forget semantics are
-  preserved because the sender PE_DMA does not `yield sub_done`. The
-  `sub_done.succeed()` call (made after metadata forward below) is an
-  event with no listener on the sender side.
- **DST `tl.recv`** unblocks `drain_ns` later. Since recv wakes only
-  when `IpcqMetaArrival` reaches its local PE_IPCQ, and the metadata
-  forward now happens after the drain, recv observes the full fabric
-  transfer time including bandwidth cost.
-
-Matches the physical picture: send dispatches and leaves; recv waits
-until the bytes have actually been drained into its inbox.
-
-### D9.5. ADR-0020 (2-pass) integration
-
-`tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase
-1 simulates timing **and** moves data via MemoryStore; Phase 2 enables
-op-log-based correctness verification.
-
-#### Phase 1 (timing + data)
-
-D9 models head and tail updates with two different mechanisms:
-
- **Send-side (head update)** — DMA payload piggyback. Data write and
-  metadata forward happen in the same SimPy step → automatic atomic
-  visibility.
- **Recv-side (tail credit return)** — fast-path SimPy Store channel
-  with bottleneck-BW latency, then `peer_tail_cache` update.
-
-Together they preserve ring-buffer pointer consistency.
-
-The op-log records `op_kind="ipcq"` entries for sends (with
-`src/dst/space/addr/nbytes/dir/dtype/shape/sender_seq`) and recvs (with
-`recv_mode/src/dst/space/addr/nbytes/dir/dtype/shape/consumer_seq`).
-Two recv modes:
-
- **`return_slot`** (default): the slot address is returned to the
-  kernel. Zero-copy.
- **`copy_to_dst`**: when the kernel passes `dst_addr` + `dst_space`,
-  PE_IPCQ copies the slot data into the user dst.
-
-#### Phase 2 (op_log replay)
-
-When `DataExecutor` encounters an `op_kind="ipcq"` record:
-
- **send**: idempotent `src → dst` ndarray write.
- **recv (`return_slot`)**: no-op (the slot already holds the data).
- **recv (`copy_to_dst`)**: idempotent `slot → dst_addr` copy.
-
-IPCQ ops are pure data movement — Phase 2 has nothing extra to compute.
-The downstream GEMM / Math ops in `DataExecutor` will consume the data
-and naturally validate correctness.
-
-### D10. Host CCL init keeps the PyTorch shape
-
-The host code looks just like real PyTorch DDP. `init_process_group`
-creates the backend object; it does **not** receive IPCQ knobs
-(neighbor topology, buffer_kind, backpressure …).
-
-```python
-# benches/ccl_allreduce.py — same shape as real PyTorch
-def worker(rank, world_size, torch):
-    dist = torch.distributed
-    dist.init_process_group(backend="ahbm")  # reads ccl.yaml + topology
-    tensor = torch.zeros((1, world_size * N_ELEM), dtype="f16", dp=...)
-    tensor.copy_(torch.from_numpy(init))
-    dist.all_reduce(tensor, op="sum")
-```
-
-The IPCQ configuration is decided by the backend at
-`init_process_group` time: it loads `ccl.yaml`, picks the algorithm,
-and pushes IPCQ neighbor tables to every participating PE_IPCQ. The
-host code never has to know about IPCQ.
-
-A bench runs one algorithm, chosen via `ccl.yaml`'s `defaults.algorithm`.
-Switching algorithms is purely a `ccl.yaml` change — no host edits
-required.
-
-#### Init flow (eager)
-
-1. `init_process_group(backend="ahbm")` is called.
-2. Backend loads `ccl.yaml` → resolves `defaults.algorithm`.
-3. Pulls topology + buffer_kind + backpressure + slot config from
-   `algorithms[<algo>]`.
-4. **Immediately** installs neighbor tables on every PE_IPCQ
-   (sideband or fabric `IpcqInitMsg`).
-5. Subsequent `torch.launch(kernel_name, ...)` calls behave normally —
-   PE_IPCQ is already prepared whether the kernel is a CCL kernel or
-   not.
-
-### D11. CCL config file (`ccl.yaml`)
-
-IPCQ config and algorithm metadata live in a separate YAML file,
-following the same pattern as `components.yaml` and `topology.yaml`.
-
-A single benchmark execution runs one algorithm
-(`defaults.algorithm`). Switching algorithms means editing
-`defaults.algorithm` only.
-
-```yaml
-defaults:
-  algorithm: ring_allreduce_tcm
-  buffer_kind: tcm                # tcm | hbm | sram
-  backpressure: sleep             # poll | sleep
-  n_slots: 8
-  slot_size: 4096
-  vc_chunk_size: 256
-  ipcq_credit_size_bytes: 16
-
-algorithms:
-  ring_allreduce_tcm:
-    module: kernbench.ccl.algorithms.ring_allreduce
-    topology: ring_1d             # builtin name or "custom"
-    buffer_kind: tcm
-    n_elem: 8                     # optional, per-algorithm tile width
-
-  tree_allreduce_7:
-    module: kernbench.ccl.algorithms.tree_allreduce
-    topology: tree_binary
-    buffer_kind: tcm
-    world_size: 7                 # algorithm-level override
-    n_elem: 16
-
-  custom_mesh:
-    module: kernbench.ccl.algorithms.custom_mesh
-    topology: custom              # the module supplies its own neighbors()
-```
-
-`world_size` is **not set in `defaults`**. The backend resolves it via:
-`algorithm-level override > defaults override > topology spec`. The
-last fallback (`sips × cubes_per_sip × pes_per_cube`) mirrors real DDP
-where `WORLD_SIZE` comes from env vars rather than config files.
-
-#### Algorithm module structure
-
-Each algorithm module exports two hooks — `kernel` (required) and
-`neighbors` (optional) — plus a `kernel_args` helper that the
-backend uses to populate positional kernel arguments at `all_reduce`
-time:
-
-```python
-# src/kernbench/ccl/algorithms/ring_allreduce.py
-
-def kernel_args(world_size: int, n_elem: int) -> tuple:
-    return (n_elem, world_size)
-
-
-def kernel(t_ptr, n_elem, world_size, tl):
-    """Required — the PE kernel.
-
-    IPCQ is already installed by the backend before this is called.
-    The kernel only uses the four-direction send / recv API.
-    """
-    ...
-
-
-def neighbors(rank, world_size, neighbor_map):
-    """Optional — override the builtin topology's neighbor map.
-
-    Returns a new dict, the modified-in-place dict, or None to keep the
-    builtin map.
-    """
-    return None
-```
-
-#### `neighbors` override patterns
-
- **Pattern A — tweak a builtin**: drop a direction for some ranks, etc.
- **Pattern B — replace entirely**: ignore `neighbor_map` and return a
-  brand-new dict.
- **Pattern C — keep builtin**: omit `neighbors` or return None.
-
-#### Builtin topologies
-
-| topology | direction set |
-|----------|---------------|
-| `ring_1d` | E, W |
-| `ring_1d_unidir` | E only |
-| `mesh_2d` | N, S, E, W |
-| `tree_binary` | parent, child_left, child_right |
-| `none` | (empty) — algorithm must supply `neighbors()` |
-
-#### Adding a new algorithm
-
-1. Write `kernel` and `kernel_args` in
-   `src/kernbench/ccl/algorithms/<algo>.py`.
-2. Add an entry in `ccl.yaml`'s `algorithms` section.
-3. (Optional) provide `neighbors()` for custom topology.
-4. Set `defaults.algorithm` to the new algorithm.
-
-The host bench (`benches/ccl_allreduce.py`) does not change.
-
-### D12. Message / token schema
-
-The new message types added by this ADR. They live in
-`src/kernbench/common/pe_commands.py` and
-`src/kernbench/runtime_api/kernel.py`.
-
-#### `IpcqInitMsg` (sideband, fan-out at init)
-
-The backend pushes neighbor tables to every PE_IPCQ. Structure mirrors
-`MmuMapMsg` (`target_sips`, `target_cubes`, `target_pe`, `entries`).
-Each `IpcqInitEntry` has `direction`, `peer: IpcqEndpoint`,
-`my_rx_base_pa/va`, `n_slots`, `slot_size`, plus a `peer_credit_store`
-field — a `simpy.Store` instance pre-wired so the sender PE_IPCQ can
-push `IpcqCreditMetadata` directly into the receiver's input queue.
-
-#### `IpcqSendCmd` (PE_CPU → PE_IPCQ)
-
-Carries `direction`, source addr/space, nbytes, shape, dtype, and a
-handle id. `data_op=True` so it lands in the op_log.
-
-#### `IpcqRecvCmd` (PE_CPU → PE_IPCQ)
-
-Carries `direction` (or None for round-robin), `recv_mode`
-(`return_slot` / `copy_to_dst`), optional `dst_addr/dst_space`, shape,
-dtype, blocking flag.
-
-#### `IpcqDmaToken` (PE_IPCQ → PE_DMA, vc_comm channel)
-
-Per D9 piggyback: the token carries the data (`src/dst/space/nbytes`)
-plus the head metadata (`sender_seq`, `src_sip/cube/pe`,
-`src_direction`). PE_DMA picks the channel by token type
-(`IpcqDmaToken → vc_comm`, `TileToken → vc_compute`).
-
-The receiver's PE_DMA, on token arrival, performs the I6 atomic
-sequence: write data into MemoryStore, then forward `IpcqMetaArrival`
-to the local PE_IPCQ.
-
-#### `IpcqCreditMetadata` (PE_IPCQ → peer PE_IPCQ, fast path)
-
-Carries `consumer_seq` (= my_tail), source PE coords, and source
-direction. Travels through the dedicated SimPy Store channel rather
-than `vc_comm`. Latency = `credit_size_bytes / bottleneck_bw_on_path`.
-
-There is **no `IpcqPtrUpdate` event** — head updates flow via D9
-piggyback, tail updates via the D9 fast-path channel.
-
-### D13. Test strategy
-
-Test plan:
-
-#### T1. Unit tests (component-level)
-
- **PE_IPCQ** (`tests/test_pe_ipcq.py`): send without backpressure
-  immediately forwards a token; full peer slot triggers backpressure
-  (poll / sleep modes); recv waits, wakes on `IpcqMetaArrival`;
-  round-robin recv weak fairness; bad direction → `IpcqInvalidDirection`.
- **PE_DMA virtual channels** (`tests/test_pe_dma_vc.py`): `vc_compute`
-  / `vc_comm` independent progress, chunk interleave, BW split.
- **Builtin topology** (`tests/test_ccl_topologies.py`): ring_1d /
-  mesh_2d / tree_binary correctness, mesh_2d non-square →
-  `ValueError`, custom resolver returns the module's `neighbors`.
-
-#### T2. Integration tests (E2E send/recv)
-
- **`tests/test_ipcq_e2e.py`**: 2-rank ring, 4-rank ring (bidirectional
-  no-deadlock), 4×4 mesh.
- **CCL kernel + 2-pass** (`tests/test_ipcq_2pass.py`): greenlet mode
-  records `ipcq` ops in op_log; DataExecutor produces correct
-  `out.data`.
-
-#### T3. Backend init (`tests/test_ccl_backend_ipcq.py`)
-
-`ccl.yaml` load, builtin topology → `IpcqInitMsg` fan-out, endpoint PA
-consistency, per-`buffer_kind` allocation.
-
-#### T4. Regression
-
-All existing tests pass; ADR-0020 op_log / DataExecutor unaffected for
-non-CCL benches.
-
-#### T5. Performance / overhead
-
-Single send/recv pair latency = (DMA latency) + (IPCQ overhead).
-Should be close to a regular PE_DMA write of the same nbytes (IPCQ
-overhead < 100 ns).
-
-### D14. Invariants and failure modes
-
-#### Invariants
-
-I1. **Slot lifecycle exactly-once**: one send → exactly one recv.
-I2. **Pointer monotonicity**: `my_head` / `my_tail` strictly
-   non-decreasing; `sender_seq` strictly increasing.
-I3. **Endpoint consistency**: if rank A's `direction=E` peer is rank
-   B, then rank B's reverse-direction peer must be rank A. Verified at
-   init.
-I4. **`buffer_kind` consistency**: all PEs in a process group share
-   the same `buffer_kind` (no mixed mode in the first cut).
-I5. **op_log ordering**: send → DMA complete → recv possible. The
-   t_start order in op_log respects this causality.
-I6. **Atomic data + metadata visibility (MUST)**: at the receiver
-   side, data write (`MemoryStore.write`) and metadata forward
-   (`peer_head_cache` update) **must execute in the same SimPy step**.
-   No yield is allowed between the two operations in PE_DMA's vc_comm
-   handler. Code review must reject any inserted `yield` (or `yield
-   from`) — it would create a race where head_cache becomes visible
-   before or after the data.
-I7. **MemoryStore slot existence ↔ pointer**: as a consequence of I6,
-   the step in which `peer_head_cache > my_tail` becomes truthy is the
-   same step in which the slot data is observable.
-
-#### Failure modes (runtime errors)
-
-F1. **Bad direction**: `tl.send(dir="X")` for an uninstalled direction
-   → `IpcqInvalidDirection`, simulation aborts.
-F2. **Type mismatch**: dtype/shape/nbytes disagreement between matched
-   send and recv. Not validated by default; opt-in strict mode catches
-   it (`strict_validation: true` on a PE_IPCQ node attrs).
-F3. **Deadlock detection (timeout-based)**: the simulator empties its
-   schedule while a send/recv is still pending → engine raises
-   `IpcqDeadlock` and embeds a pointer dump.
-F4. **Backend init failure**: missing `defaults.algorithm`, missing
-   `algorithms[name]`, module import failure, topology validation
-   failure (I3, I4) — all raised at `init_process_group` time.
-F5. **Slot full + infinite backpressure**: the peer never recvs.
-   Surfaces as F3 timeout.
-
-#### Diagnostics
-
- **CCL trace**: `KERNBENCH_CCL_TRACE=1` logs each send/recv as
-  `(rank, t, dir, nbytes)`.
- **Pointer dump**: `kernbench.ccl.diagnostics.pointer_dump(engine)`
-  prints every PE_IPCQ ring buffer's `my_head`, `my_tail`,
-  `peer_head_cache`, `peer_tail_cache`.
- **Deadlock dump**: on hang the engine includes the pointer dump in
-  the `IpcqDeadlock` exception message.
-
-### D15. Algorithm-author cheat sheet
-
-Full step-by-step lives in
-[`docs/onboarding/ccl-author-guide.en.md`](../onboarding/ccl-author-guide.en.md). The
-shortest version:
-
-| Things you touch | Things you don't |
-|------------------|-------------------|
-| `src/kernbench/ccl/algorithms/<your_algo>.py` (`kernel`, `kernel_args`, optional `neighbors`) | `benches/ccl_allreduce.py` host code |
-| One entry in `ccl.yaml` + optionally `defaults.algorithm` | `src/kernbench/ccl/` framework |
-| (Optional) `tests/test_<your_algo>.py` mock test | PE_IPCQ component, AhbmCCLBackend |
-
-5-step flow: write the kernel → register in `ccl.yaml` → optional
-`neighbors` override → optional mock unit test → SimPy validation via
-`kernbench run --bench ccl_allreduce --verify-data`.
-
-Common mistakes: using a direction that wasn't installed, sends
-without matching recvs (deadlock), dtype/shape disagreement, assuming
-fairness from `tl.recv()` round-robin, confusing
-`tl.num_programs(axis)` with the CCL group size.
-
---
-
-## Non-goals
-
- **Host collective**: a model where `dist.all_reduce` itself moves
-  data on the host side is out of scope. This ADR only covers
-  communication that happens inside the PE kernel.
- **All-reduce algorithms**: ring / tree / etc. live in algorithm
-  modules and can be added without amending this ADR.
- **Reliability / error handling**: link faults, send/recv failure
-  recovery, etc. are out of scope.
- **NoC arbiter precision**: dynamic VC contention is left for a future
-  ADR (see D8).
-
---
-
-## Open questions
-
- **VC arbitration accuracy** — the first cut uses deterministic
-  chunk interleave + weighted round-robin; heavy contention may report
-  optimistic latency. A NoC arbiter component can be added later.
- **Credit return BW model** — the fast path is currently outside the
-  fabric BW contention model. Can be modeled as a separate link or
-  switched to piggyback (`credit_return_mode: piggyback`).
- **Ring buffer slot allocation metadata** — whether the host pushes
-  IPCQ buffer metadata via sideband or via a fabric message similar to
-  `MmuMapMsg` is open.
- **VC BW split default** — 50/50 vs. weighted (e.g. 80/20). Exposed in
-  `ccl.yaml`; default value TBD.
- **Direction count** — 4 (N/S/E/W) is fixed in the first cut; 6
-  (with Up/Down for 3D) or N (variable) is future work.
- **Multi-tile aggregation primitives** — whether
-  `tl.recv_all` or similar is needed for fan-in.
- **Round-robin recv fairness** — current weak fairness can starve;
-  strict fairness counter is future work.
- **Deadlock detection precision** — currently timeout-based; a
-  realtime wait-for graph would enable deterministic detection.
-
---
-
-## Consequences
-
-### Positive
-
- PE-to-PE direct communication enables CCL kernels to be written.
- Host stays minimal (just `launch`), synchronization happens inside
-  the PE → strong compute / comm overlap.
- VCs eliminate HoL blocking → collective latency is not blocked by
-  compute traffic.
- Buffer placement and backpressure mode are init-time parameters →
-  easy to benchmark.
- Four-direction logical neighbors → host is free to map
-  ring/mesh/tree algorithms.
-
-### Negative
-
- One new component (PE_IPCQ) and a redesigned PE_DMA (VCs).
- IPCQ memory cost = 8 rings × `slot_size` × `n_slots` per PE.
- VC arbitration is a first-order approximation; heavy contention
-  scenarios may report slightly optimistic latency vs real HW (D8).
- Chunk-level interleave makes PE_DMA implementation more complex.
@@ -6,43 +6,46 @@ Accepted

 ## Context

-### 목표
+### Goal

-`torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device)
-경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이**
-읽히는 bench 코드를 목표로 한다.
+Align the participation unit (rank) of `torch.distributed` collective calls
+to the **SIP** (device) boundary. The aim is bench code that, at the host
+level, reads **indistinguishably** from real PyTorch DDP/TP scripts.

-real PyTorch와 비교:
+Comparison with real PyTorch:

-| 차원 | real PyTorch | KernBench |
+| Dimension | real PyTorch | KernBench |
 | --- | --- | --- |
-| 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 각 1 SIP |
-| `get_rank()` | `RANK` env var | greenlet-local 레지스트리 |
-| `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 |
+| Process model | N processes, 1 GPU each | 1 process, N greenlets, 1 SIP each |
+| `get_rank()` | `RANK` env var | greenlet-local registry |
+| `get_world_size()` | `WORLD_SIZE` env var | SIP count from topology |
 | `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP |
-| `mp.spawn` | OS 프로세스 fork | greenlet fan-out |
+| `mp.spawn` | OS process fork | greenlet fan-out |

-### 풀어야 할 문제
+### Problems to solve

-1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록.
-2. **Greenlet-local rank/device tracking** — 1-프로세스 모델 안에서 각
-   worker greenlet이 자기 rank / 자기 SIP를 정확히 식별.
-3. **Tensor placement = structural (sip, cube, pe)** — rank가 SIP이면
-   기본 텐서 배치도 구조적 좌표로 표현되어야 함.
+1. **Public API where rank = SIP** — so bench workers do not have to know
+   about the PE concept.
+2. **Greenlet-local rank/device tracking** — within the 1-process model,
+   each worker greenlet must correctly identify its own rank / its own SIP.
+3. **Tensor placement = structural (sip, cube, pe)** — if rank is SIP,
+   the default tensor placement should also be expressed in structural
+   coordinates.

-### Non-problem (이 ADR 밖)
+### Non-problem (outside this ADR)

 - IPCQ direction addressing → ADR-0025
- `DPPolicy.sip`/`num_sips` 제거 → ADR-0026
+- Removing `DPPolicy.sip`/`num_sips` → ADR-0026
 - Megatron-style TP → ADR-0027
 - DTensor → ADR-0028 (future)
 - Worker scheduling / `mp.spawn` / collective drain / exception cleanup
  → ADR-0027 D0/D1
- Collective algorithm 구현 (intercube_allreduce, SFR config) → ADR-0032
+- Collective algorithm implementation (intercube_allreduce, SFR config)
+  → ADR-0032

 ## Decision

-### D1. rank = SIP (world_size 해석)
+### D1. rank = SIP (world_size resolution)

 ```python
 def _resolve_world_size(self) -> int:
@@ -55,8 +58,8 @@ def _resolve_world_size(self) -> int:
    return int(spec.get("system", {}).get("sips", {}).get("count", 1))
 ```

-우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml`
-override는 legacy "rank = PE" 테스트 경로로 유지.
+Priority order: algorithm override > defaults override > SIP count. The
+`ccl.yaml` override is retained as the legacy "rank = PE" test path.

 ### D2. Greenlet-local rank registry (+ debug warning)

@@ -83,11 +86,11 @@ class DistributedContext:
        return int(self._rank_by_greenlet[g])
 ```

-### D3. `torch.ahbm.set_device(rank)` — SIP 바인딩
+### D3. `torch.ahbm.set_device(rank)` — SIP binding

-KernBench 백엔드 이름은 `ahbm` (ADR-0023). Real PyTorch는
-`torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named
-namespace를 사용한다.
+The KernBench backend name is `ahbm` (ADR-0023). Real PyTorch uses
+`torch.cuda.set_device(r)`, but since we are not CUDA we use an
+honestly-named namespace.

 ```python
 class _AhbmNamespace:
@@ -113,10 +116,12 @@ class _AhbmNamespace:
 # Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
 ```

-**PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한
-`torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`,
-`torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는
-코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다.
+**PyTorch 2.x style parallel support**: Recent PyTorch is moving toward a
+device-agnostic `torch.accelerator` namespace
+(`torch.accelerator.set_device_index(r)`,
+`torch.accelerator.current_device_index()`). To support users who want to
+write code that is not tied to a specific device vendor, KernBench also
+exposes this surface in parallel.

 ```python
 class _AcceleratorNamespace:
@@ -141,23 +146,23 @@ self.ahbm = _AhbmNamespace()
 self.accelerator = _AcceleratorNamespace(self.ahbm)   # alias
 ```

-Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유:
+Bench authors may choose either — both share the same registry internally:

 ```python
 torch.ahbm.set_device(rank)                   # KernBench-native, explicit backend
 torch.accelerator.set_device_index(rank)      # PyTorch 2.x device-agnostic
 ```

-### D4. Tensor placement = structural (sip, cube, pe) 좌표
+### D4. Tensor placement = structural (sip, cube, pe) coordinates

-`resolve_dp_policy`가 `target_sip`을 직접 받아 구조적 좌표로 placement 생성.
-세부는 ADR-0026.
+`resolve_dp_policy` takes `target_sip` directly and produces placement in
+structural coordinates. Details in ADR-0026.

 ```python
 # RuntimeContext._create_tensor
 current_sip = self.ahbm.current_device()          # (D3 naming)
 if current_sip is None:
-    current_sip = 0  # single-driver fallback (D2와 일관)
+    current_sip = 0  # single-driver fallback (consistent with D2)
 placement = resolve_dp_policy(
    dp, shape=shape_2d, itemsize=itemsize,
    num_pe=eff_num_pe, num_cubes=eff_num_cubes,
@@ -165,29 +170,29 @@ placement = resolve_dp_policy(
 )
 ```

-Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적
-좌표를 직접 보유. ShardSpec 상세는 ADR-0026.
+No post-hoc `pe_index` shifting — ShardSpec carries the `(sip, cube, pe)`
+structural coordinates directly. ShardSpec details in ADR-0026.

 ---

 ## Dependencies

- **ADR-0023** (IPCQ): backend `ahbm` namespace의 기원.
- **ADR-0026** (DPPolicy intra-device): D4의 `resolve_dp_policy` 시그니처와
-  ShardSpec의 구조적 좌표 표현.
- **ADR-0027** (Megatron TP + scheduler): worker scheduling, `mp.spawn`,
-  collective drain, exception cleanup의 구현 기준.
+- **ADR-0023** (IPCQ): origin of the backend `ahbm` namespace.
+- **ADR-0026** (DPPolicy intra-device): the `resolve_dp_policy` signature
+  used by D4 and the structural-coordinate representation of ShardSpec.
+- **ADR-0027** (Megatron TP + scheduler): the implementation baseline for
+  worker scheduling, `mp.spawn`, collective drain, and exception cleanup.

 ---

 ## Non-goals

- **IPCQ protocol 수정**: ADR-0023 유지.
- **DPPolicy 필드 정리**: ADR-0026.
+- **Modifying the IPCQ protocol**: ADR-0023 remains as-is.
+- **Cleaning up DPPolicy fields**: ADR-0026.
 - **Megatron-style TP**: ADR-0027.
 - **Worker scheduling / spawn / drain / exception cleanup**: ADR-0027 D0/D1.
- **Collective algorithm 구현**: ADR-0032.
- **Multi-node (프로세스 간)**: 단일 프로세스.
+- **Collective algorithm implementation**: ADR-0032.
+- **Multi-node (cross-process)**: single process only.

 ---

@@ -195,12 +200,14 @@ Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적

 ### Positive

- **Bench = real PyTorch DDP** (공개 API 관점).
- **Greenlet-local rank**: 1-프로세스 모델에서 cross-rank correctness 가능.
- **Structural placement 좌표**: ADR-0026 / ADR-0027 / ADR-0032의 다른 ADR이
-  `(sip, cube, pe)` 3튜플 위에서 일관되게 동작.
+- **Bench = real PyTorch DDP** (from the public-API point of view).
+- **Greenlet-local rank**: enables cross-rank correctness within the
+  1-process model.
+- **Structural placement coordinates**: lets the other ADRs (ADR-0026 /
+  ADR-0027 / ADR-0032) operate consistently on top of the `(sip, cube, pe)`
+  3-tuple.

 ### Neutral

- IPCQ PE-level protocol (ADR-0023) 불변.
- IO_CPU 역할 불변 (기존 transit 그대로).
+- IPCQ PE-level protocol (ADR-0023) is unchanged.
+- IO_CPU role is unchanged (existing transit behavior preserved).
@@ -6,51 +6,58 @@ Accepted (Revision 2 — Address-based matching; peer_direction field dropped)

 ## Context

-### 목표
+### Goal

-ADR-0023의 IPCQ protocol에서 **"어느 direction pair를 통한 전송인가"의 식별**을
-topology / dict-order에 의존하지 않고 **주소 기반**으로 일관되게 한다.
-2-rank bidirectional ring (또는 여러 direction이 동일 peer를 가리키는
-topology 일반)에서 정확히 동작하도록 한다.
+In the IPCQ protocol of ADR-0023, make the **identification of "which
+direction pair this transfer belongs to"** consistent and **address-based**,
+without depending on topology / dict-order. It must work correctly in a
+2-rank bidirectional ring (and more generally in any topology where
+multiple directions point to the same peer).

-### 드러난 버그 — 2-rank bidirectional ring
+### The bug surfaced — 2-rank bidirectional ring

-`ring_1d(rank, world_size=2)` → `{"E": 1, "W": 1}` (rank 0). 양쪽 방향이 같은 peer.
+`ring_1d(rank, world_size=2)` → `{"E": 1, "W": 1}` (rank 0). Both directions
+point to the same peer.

-**버그 1 (install)**:
- `reverse_direction(0, 1)` → dict order로 "E" 반환 (틀림, "W"가 맞음 — opposite
-  direction convention)
- rank 0의 E entry가 `peer.rx_base_pa = rx_base(sip1, cube0, pe0, d="E")`로 설정
- tl.send(E) → data가 sip1의 E-rx buffer로 landing (should be W-rx)
+**Bug 1 (install)**:
+- `reverse_direction(0, 1)` → returns "E" by dict order (wrong; "W" is the
+  correct answer — opposite-direction convention)
+- rank 0's E entry is set with `peer.rx_base_pa = rx_base(sip1, cube0, pe0, d="E")`
+- tl.send(E) → data lands in sip1's E-rx buffer (should be W-rx)

-**버그 2 (runtime)**:
- 설령 install이 올바른 주소로 설정해도, receiver의 `_handle_meta_arrival`이
-  sender 좌표만으로 direction 매칭 → 첫 direction (E) 승
- peer_head_cache[E] 증가, peer_head_cache[W]는 불변
- Kernel의 tl.recv(W)는 peer_head_cache[W] 대기 → 영원히 블록 → IpcqDeadlock
+**Bug 2 (runtime)**:
+- Even if install set up the correct address, the receiver's
+  `_handle_meta_arrival` matches direction by sender coordinates only → the
+  first direction (E) wins
+- peer_head_cache[E] is incremented; peer_head_cache[W] is unchanged
+- The kernel's tl.recv(W) waits on peer_head_cache[W] → blocks forever →
+  IpcqDeadlock

-### 근본 원인
+### Root cause

-두 축에서 동일 문제:
-1. **Install-time pairing**: "내 direction과 peer의 어느 direction이 짝인가"
-   결정이 dict-iteration-order에 의존 → 여러 direction이 같은 peer를 가리킬 때
-   fragile
-2. **Runtime identification**: "어느 qp를 업데이트해야 하는가" 결정이 sender
-   좌표만으로 이루어짐 → direction 중복 시 ambiguous
+The same issue along two axes:
+1. **Install-time pairing**: deciding "which of my directions pairs with
+   which direction of the peer" depends on dict-iteration-order → fragile
+   when multiple directions point to the same peer
+2. **Runtime identification**: deciding "which qp should be updated" is
+   based on sender coordinates alone → ambiguous when directions are
+   duplicated

-### 해결 방향 — address-based matching
+### Solution direction — address-based matching

-각 PE의 rx buffer는 **direction별로 고유한 주소 range**에 위치 (rx_base_pa +
-direction_idx × bytes_per_direction). 따라서:
+Each PE's rx buffer sits at a **unique address range per direction**
+(rx_base_pa + direction_idx × bytes_per_direction). Therefore:

- **Runtime**: sender coord 대신 **dst_addr 범위**로 매칭 → unambiguous
- **Install**: opposite-direction 우선 선택 heuristic (ring / mesh의 자연스러운
-  대칭성)
- `peer_direction` 같은 이중 메타데이터 불필요 — **주소가 single source of
-  truth**
+- **Runtime**: match by **dst_addr range** instead of sender coord →
+  unambiguous
+- **Install**: prefer the opposite direction as a heuristic (the natural
+  symmetry of ring / mesh)
+- No need for redundant metadata like `peer_direction` — **address is the
+  single source of truth**

-이 설계는 **PhysAddr 전환 (ADR-0030)과 독립적**으로 작동. 현재 synthetic
-주소든 PhysAddr든 direction별 range 유일성만 지켜지면 동일하게 적용 가능.
+This design works **independently of the PhysAddr transition (ADR-0030)**.
+Whether the current addresses are synthetic or PhysAddr, the same approach
+applies as long as the per-direction range uniqueness is preserved.

 ---

@@ -91,17 +98,17 @@ def reverse_direction(my_rank: int, peer_rank: int, my_dir: str) -> str | None:
    return None
 ```

-호출부:
+Call site:

 ```python
 for d, peer_rank in nbrs.items():
-    peer_dir = reverse_direction(r, peer_rank, d)  # my_dir 전달
+    peer_dir = reverse_direction(r, peer_rank, d)  # pass my_dir
    if peer_dir is None:
        continue
    ...
 ```

-### D2. Runtime — `_handle_meta_arrival` dst_addr 매칭
+### D2. Runtime — `_handle_meta_arrival` dst_addr matching

 `src/kernbench/components/builtin/pe_ipcq.py`:

@@ -138,9 +145,10 @@ def _handle_meta_arrival(self, msg: IpcqMetaArrival) -> None:
    # Unknown dst_addr — diagnostic log (should not happen under correct install)
 ```

-Sender 좌표 검사는 **제거**. `dst_addr`가 이미 direction을 결정.
+The sender-coordinate check is **removed**. `dst_addr` already determines
+the direction.

-### D3. Credit — `dst_rx_base_pa` 필드 추가
+### D3. Credit — add `dst_rx_base_pa` field

 `src/kernbench/common/ipcq_types.py`:

@@ -148,25 +156,26 @@ Sender 좌표 검사는 **제거**. `dst_addr`가 이미 direction을 결정.
@dataclass(frozen=True)
 class IpcqCreditMetadata:
    consumer_seq: int
-    dst_rx_base_pa: int       # NEW: 원 sender의 peer.rx_base_pa와 매칭용
-    # 기존 필드 (diagnostic / log 용도로 유지)
+    dst_rx_base_pa: int       # NEW: matches the original sender's peer.rx_base_pa
+    # Existing fields (kept for diagnostic / logging purposes)
    src_sip: int
    src_cube: int
    src_pe: int
    src_direction: str
 ```

-Credit 생성 시 (`_delayed_credit_send`): 자기 direction의 `my_rx_base_pa`를
-`dst_rx_base_pa`로 실어 보냄 (이게 상대방이 sender 당시 썼던 `peer.rx_base_pa`).
+When the credit is generated (`_delayed_credit_send`): it carries this
+direction's `my_rx_base_pa` as `dst_rx_base_pa` (this is the
+`peer.rx_base_pa` the other side used when it was the sender).

-수신 측 (`_credit_worker`):
+Receiver side (`_credit_worker`):

 ```python
 def _credit_worker(self, env):
    while True:
        credit = yield self._credit_inbox.get()
        for d, qp in self._queue_pairs.items():
-            # peer의 rx_base_pa와 credit의 dst_rx_base_pa가 일치하는 qp 찾기
+            # Find the qp whose peer rx_base_pa matches the credit's dst_rx_base_pa
            if qp["peer"].rx_base_pa == credit.dst_rx_base_pa:
                qp["peer_tail_cache"] = max(qp["peer_tail_cache"],
                                              credit.consumer_seq)
@@ -178,41 +187,45 @@ def _credit_worker(self, env):
                break
 ```

-Sender 좌표 검사 제거. `dst_rx_base_pa` 매칭으로 unambiguous.
+Sender-coordinate check removed. Matching by `dst_rx_base_pa` is
+unambiguous.

-### D4. `IpcqInitEntry`에 `peer_direction` 필드를 **추가하지 않음**
+### D4. Do **not** add a `peer_direction` field to `IpcqInitEntry`

-ADR-0025 rev 1에서 제안했던 `IpcqInitEntry.peer_direction`은 **불필요**.
-이유:
- Meta arrival은 dst_addr로 매칭 (D2)
- Credit은 dst_rx_base_pa로 매칭 (D3)
- qp에 peer_direction 저장 필요 없음
- Install은 rx_base_pa 계산 시 내부적으로만 peer_dir 사용 (`reverse_direction`)
+The `IpcqInitEntry.peer_direction` proposed in ADR-0025 rev 1 is
+**unnecessary**. Reasons:
+- Meta arrivals are matched by dst_addr (D2)
+- Credits are matched by dst_rx_base_pa (D3)
+- No need to store peer_direction on qp
+- Install only uses peer_dir internally when computing rx_base_pa
+  (`reverse_direction`)

-IpcqInitEntry schema 변경 없음. Rev 1 대비 **단순화**.
+No change to the IpcqInitEntry schema. **Simpler** than rev 1.

-### D5. `IpcqDmaToken.src_direction` 유지 (diagnostic only)
+### D5. Keep `IpcqDmaToken.src_direction` (diagnostic only)

-기존 `src_direction` 필드는 제거하지 않는다. 다음 용도로 유지:
- Logging / trace: `KERNBENCH_CCL_TRACE=1` 출력의 `(rank, t, dir, nbytes)`
- Diagnostics: pointer_dump 등에서 direction 표시
- 미래 확장 여지
+The existing `src_direction` field is not removed. It is retained for:
+- Logging / trace: the `(rank, t, dir, nbytes)` output of
+  `KERNBENCH_CCL_TRACE=1`
+- Diagnostics: showing direction in pointer_dump, etc.
+- Room for future extension

-Runtime matching은 `dst_addr`만 사용.
+Runtime matching uses only `dst_addr`.

-### D6. Invariants (ADR-0023 I3 강화)
+### D6. Invariants (strengthens ADR-0023 I3)

-**I3 (엄격)**: 각 방향 pair `(my_direction, peer_direction)`에 대해 my
-rx_base와 peer rx_base는 **별개의 direction slot**을 가리켜야 함. Install은
-이를 보장해야 한다 (reverse_direction opposite-preference).
+**I3 (strict)**: For each direction pair `(my_direction, peer_direction)`,
+my rx_base and peer rx_base must point to **distinct direction slots**.
+Install must guarantee this (reverse_direction opposite-preference).

-**I3.1 (신규)**: 모든 qp에 대해 `qp["my_rx_base_pa"]`와 `qp["peer"].rx_base_pa`는
-서로 disjoint한 주소 range를 점유한다 (다른 direction의 buffer는 절대 겹치지
-않음). 이것이 D2/D3의 주소-기반 매칭의 전제.
+**I3.1 (new)**: For every qp, `qp["my_rx_base_pa"]` and
+`qp["peer"].rx_base_pa` occupy mutually disjoint address ranges (buffers
+of different directions never overlap). This is the prerequisite for the
+address-based matching of D2/D3.

-Install time에 검증 가능:
+Verifiable at install time:
 ```python
-# ccl/install_plan.py: build_install_plans 끝에 assertion
+# ccl/install_plan.py: assertion at the end of build_install_plans
 all_rx_ranges = set()
 for plan in plans:
    for pe_install in plan.pe_installs:
@@ -228,36 +241,42 @@ for plan in plans:

 ## Dependencies

- **ADR-0023** (IPCQ protocol): 본 ADR은 ADR-0023의 runtime 매칭 로직 수정
-  (D2, D3) + install heuristic 개선 (D1). IPCQ 프로토콜의 semantic layer
-  변경은 없음.
- **ADR-0024** (launcher): 2-rank bidirectional ring이 실제 쓰이는 경우가
-  ADR-0024의 ws=SIP_count 모델. 본 ADR이 그 케이스를 작동시킴.
- **ADR-0030** (PhysAddr transition, stub): **독립적** — ADR-0025의
-  주소-기반 매칭은 현재 synthetic 주소든 PhysAddr이든 동일하게 작동.
+- **ADR-0023** (IPCQ protocol): this ADR modifies ADR-0023's runtime
+  matching logic (D2, D3) and improves the install heuristic (D1). No
+  change to the IPCQ protocol's semantic layer.
+- **ADR-0024** (launcher): the case where a 2-rank bidirectional ring is
+  actually used is the ws=SIP_count model of ADR-0024. This ADR makes that
+  case work.
+- **ADR-0030** (PhysAddr transition, stub): **independent** — ADR-0025's
+  address-based matching works identically whether the current addresses
+  are synthetic or PhysAddr.

 ---

 ## Non-goals

- **IPCQ 주소 체계를 PhysAddr로 전환**: ADR-0030 scope. 본 ADR은 주소가 어떻게
-  인코딩되는가와 무관.
- **Multi-hop routing**: ADR-0023 D5의 single-hop DMA write 전제 유지.
- **Unidir ring 특수화**: `ring_1d_unidir`는 direction 하나만 있으므로 본 버그
-  무관.
+- **Migrating IPCQ addressing to PhysAddr**: ADR-0030 scope. This ADR is
+  agnostic to how addresses are encoded.
+- **Multi-hop routing**: the single-hop DMA write assumption of ADR-0023
+  D5 still holds.
+- **Unidir ring specialization**: `ring_1d_unidir` only has a single
+  direction, so the bug does not apply.

 ---

 ## Open questions

- **주소 매칭 성능**: `_handle_meta_arrival`과 `_credit_worker`가 qp를 선형
-  순회 (max 4 direction). 성능 영향 무시 가능 수준. 문제 시 dict lookup으로
-  전환 가능 (`_qp_by_rx_base`).
- **`IpcqDmaToken.src_direction` 필요성 재평가**: diagnostic 용도로만 남긴
-  필드를 계속 유지할지, 또는 logging 외부로 분리할지. 현재는 유지.
- **Install-time invariant 검증 cost**: D6의 I3.1 검증은 O(N_PE × N_direction)^2.
-  대형 topology에서 느려질 수 있음 → interval tree 등 자료구조로 개선 가능.
-  단순 구현 먼저.
+- **Address-matching performance**: `_handle_meta_arrival` and
+  `_credit_worker` iterate qp linearly (max 4 directions). The performance
+  impact is negligible. If it becomes an issue, this can be switched to a
+  dict lookup (`_qp_by_rx_base`).
+- **Re-evaluating the need for `IpcqDmaToken.src_direction`**: whether to
+  keep this field, which is only kept for diagnostics, or to split it out
+  of logging. Currently retained.
+- **Cost of install-time invariant verification**: the I3.1 verification
+  of D6 is O(N_PE × N_direction)^2. It could be slow on large topologies
+  → improvable via data structures such as interval trees. Simple
+  implementation first.

 ---

@@ -265,19 +284,26 @@ for plan in plans:

 ### Positive

- **단순함**: `peer_direction` 이중 메타데이터 제거. 주소가 single source of truth.
- **Unambiguous matching**: 모든 topology (direction 중복 포함)에서 동작.
- **Schema 변경 최소**: `IpcqInitEntry` 불변, `IpcqCreditMetadata`에 1 필드 추가.
- **PhysAddr 전환 (ADR-0030) 독립**: 주소-기반 매칭은 주소 인코딩 방식과 무관.
- **Diagnostic 유지**: `IpcqDmaToken.src_direction`은 로깅 용도로 존치.
+- **Simplicity**: redundant `peer_direction` metadata removed. Address is
+  the single source of truth.
+- **Unambiguous matching**: works on every topology (including duplicate
+  directions).
+- **Minimal schema changes**: `IpcqInitEntry` unchanged, one field added
+  to `IpcqCreditMetadata`.
+- **Independent of PhysAddr transition (ADR-0030)**: address-based matching
+  is agnostic to the address encoding.
+- **Diagnostics retained**: `IpcqDmaToken.src_direction` is kept for
+  logging.

 ### Negative

- Runtime 매칭이 주소 비교로 바뀌어서 디버깅 시 "왜 peer_head_cache[E]가 아닌
-  W가 업데이트됐나" 같은 질문에 address range를 추적해야 함 (기존엔 direction
-  이름으로 충분). 해결: pointer_dump에 "direction ↔ rx_base_pa" 매핑 포함.
+- Runtime matching is now by address comparison, so when debugging
+  questions like "why did peer_head_cache[W] update rather than [E]" one
+  has to follow the address range (previously the direction name was
+  enough). Mitigation: include a "direction ↔ rx_base_pa" mapping in
+  pointer_dump.

 ### Neutral

- IPCQ protocol의 semantic layer (sender가 dst_addr 계산, receiver가 수신)는
-  불변.
+- The semantic layer of the IPCQ protocol (sender computes dst_addr,
+  receiver receives) is unchanged.
@@ -1,4 +1,4 @@
-# ADR-0026: DPPolicy = Intra-Device Only — sip/num_sips 필드 제거
+# ADR-0026: DPPolicy = Intra-Device Only — remove sip/num_sips fields

 ## Status

@@ -6,16 +6,17 @@ Accepted (Revision 5 — Phase 2 landed 2026-04-14, 523 passed + 1 strict xfail)

 ## Context

-### 목표
+### Goal

-`DPPolicy`를 **한 device(SIP) 내부의 cube × PE 분산**만 표현하는 순수한
-intra-device 추상화로 명확화한다. SIP 간 분산(TP)은 별도 레이어로 분리
-(ADR-0024의 `torch.ahbm.set_device(rank)` 또는 ADR-0027의 Megatron parallel
-layers가 담당).
+Clarify `DPPolicy` as a pure intra-device abstraction that only expresses
+**cube × PE distribution within a single device (SIP)**. Inter-SIP
+distribution (TP) is split into a separate layer (handled by ADR-0024's
+`torch.ahbm.set_device(rank)` or by ADR-0027's Megatron-style parallel
+layers).

 ## Decision

-### D1. `DPPolicy`에서 `sip` + `num_sips` 필드 제거
+### D1. Remove `sip` + `num_sips` fields from `DPPolicy`

 ```python
@dataclass(frozen=True)
@@ -32,15 +33,16 @@ class DPPolicy:
    num_cubes: int | None = None
 ```

-제거되는 필드: `sip`, `num_sips`.
+Removed fields: `sip`, `num_sips`.

-### D2. `ShardSpec` — structural (sip, cube, pe) 좌표, `pe_index` 완전 제거
+### D2. `ShardSpec` — structural (sip, cube, pe) coordinates, `pe_index` fully removed

-현재 `ShardSpec.pe_index`는 **global flat index** (`sip × cubes × pes + cube ×
-pes + pe`). 이는 ADR-0024 D4이 "abstraction leakage"로 지적한 형태.
+The current `ShardSpec.pe_index` is a **global flat index**
+(`sip × cubes × pes + cube × pes + pe`). This is the form ADR-0024 D4
+flagged as "abstraction leakage".

-본 ADR에서 ShardSpec을 **structural 좌표로 재정의**하고, `pe_index`는
-property로도 **남기지 않는다**:
+This ADR **redefines ShardSpec in structural coordinates** and **does
+not even leave `pe_index` as a property**:

 ```python
 # src/kernbench/policy/placement/dp.py (after)
@@ -59,28 +61,32 @@ class ShardSpec:
    nbytes: int
 ```

-**핵심 원칙**:
- ShardSpec의 정체성은 `(sip, cube, pe)` 3튜플.
- **`pe_index` property도 없음** — silent semantics drift 차단.
- Global flat을 기대한 기존 호출자는 `.pe_index` 접근 시 **즉시
-  `AttributeError`** → 반드시 구조적 좌표로 migration.
- Flat integer key가 필요한 국소 문맥 (예: 내부 dict lookup)은 호출자가
-  명시적으로 `spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe`를 계산.
+**Core principle**:
+- The identity of ShardSpec is the `(sip, cube, pe)` 3-tuple.
+- **No `pe_index` property either** — blocks silent semantics drift.
+- Existing callers expecting global-flat get an **immediate
+  `AttributeError`** on `.pe_index` access → forced migration to
+  structural coordinates.
+- Local contexts that genuinely need a flat integer key (e.g. internal
+  dict lookup) explicitly compute
+  `spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe` at the call
+  site.

-**Property 제거 정당화**: KernBench는 사내 프로젝트로 call site가 한정되어
-있음. Silent drift 위험 (의미만 바뀌고 타입은 같은 int) 대비 explicit breakage
-(AttributeError)가 훨씬 안전.
+**Justification for removing the property**: KernBench is an internal
+project with a limited number of call sites. Explicit breakage
+(AttributeError) is much safer than the risk of silent drift (semantics
+change while the type stays int).

-### D3. `resolve_dp_policy`가 `target_sip`을 받아 structural 좌표 생성
+### D3. `resolve_dp_policy` takes `target_sip` and produces structural coordinates

-ADR-0024 D4의 계약 구현. Post-hoc shifting 없음.
+Implements the contract of ADR-0024 D4. No post-hoc shifting.

 ```python
 # src/kernbench/policy/placement/dp.py (after)

@dataclass(frozen=True)
 class _LocalPeShard:
-    """Internal — PE resolver의 반환. Cube 내 local PE 식별자 + payload."""
+    """Internal — return value of the PE resolver. Cube-local PE id + payload."""
    local_pe: int                  # cube-local PE index (0..num_pe-1)
    offset_bytes: int
    nbytes: int
@@ -93,7 +99,7 @@ def resolve_dp_policy(
    itemsize: int,
    num_pe: int,
    num_cubes: int = 1,
-    target_sip: int,       # NEW — 어느 SIP에 배치할지 명시
+    target_sip: int,       # NEW — explicitly state which SIP to place on
 ) -> list[ShardSpec]:
    """2-level resolution (cube × PE) on a specified SIP.

@@ -123,28 +129,30 @@ def resolve_dp_policy(
    return all_shards
 ```

-**내부 resolver** (`column_wise`, `row_wise`, `replicate`)는 `_LocalPeShard`
-리스트 반환 — `local_pe` 필드명으로 **"cube-local PE identifier"임이 명시적**.
-과거 `ShardSpec.pe_index`와 이름이 혼동되던 문제 해소.
+**Internal resolvers** (`column_wise`, `row_wise`, `replicate`) return a
+list of `_LocalPeShard` — the `local_pe` field name makes it **explicit
+that this is a "cube-local PE identifier"**. This resolves the previous
+confusion with the name `ShardSpec.pe_index`.

-**이름 규약 정리** (전체 ADR):
- `ShardSpec.pe`: 최종 외부 API — cube-local PE (structural coord)
- `_LocalPeShard.local_pe`: 내부 resolver 단계의 동일 의미
- `pe_index`: **제거**. 외부/내부 어디에도 남기지 않는다 (silent drift 차단의
-  부가 효과: 이름 재등장 없음).
+**Naming convention summary** (whole ADR):
+- `ShardSpec.pe`: the final external API — cube-local PE (structural coord)
+- `_LocalPeShard.local_pe`: the same meaning at the internal resolver stage
+- `pe_index`: **removed**. Not retained anywhere, internal or external
+  (additional benefit of preventing silent drift: the name does not
+  reappear).

-### D4. `_create_tensor` — 구조적 좌표로 직접 placement
+### D4. `_create_tensor` — placement directly in structural coordinates

-ADR-0024 D4 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy`
-호출 시점에 직접 지정.
+Continuation of ADR-0024 D4. Post-hoc shifting removed; structural
+coordinates are specified directly at the `resolve_dp_policy` call site.

 ```python
 # context.py _create_tensor (after)
 current_sip = self.ahbm.current_device()
 if current_sip is None:
-    # Single-driver fallback (ADR-0024 D2와 일관).
-    # Launcher 기반 코드가 set_device()를 빼먹으면 조용히 SIP 0에 박히는
-    # 문제가 있음 → debug mode에서 경고.
+    # Single-driver fallback (consistent with ADR-0024 D2).
+    # In launcher-based code, forgetting set_device() silently sticks the
+    # tensor on SIP 0 — emit a warning in debug mode.
    if os.environ.get("KERNBENCH_DEBUG"):
        import warnings
        warnings.warn(
@@ -161,38 +169,39 @@ placement = resolve_dp_policy(
    itemsize=itemsize,
    num_pe=eff_num_pe,
    num_cubes=eff_num_cubes,
-    target_sip=current_sip,          # ← 구조적 좌표 일차 지정
+    target_sip=current_sip,          # ← structural coord specified up front
 )

-# placement의 각 ShardSpec은 이미 (sip=current_sip, cube=local, pe=local) 포함.
-# 과거의 post-hoc shifting 블록은 완전히 제거.
+# Each ShardSpec in placement already carries (sip=current_sip, cube=local, pe=local).
+# The old post-hoc shifting block is removed entirely.
 ```

-**모든** 텐서가 current device SIP에 배치됨. Multi-SIP 텐서를 만들고 싶으면
-ADR-0027의 TP primitive 사용.
+**Every** tensor is placed on the current device's SIP. If you need a
+multi-SIP tensor, use the TP primitive of ADR-0027.

-**Single-driver fallback의 trade-off**: set_device 없는 호출에서 SIP 0으로
-default는 기존 single-driver 테스트 호환을 위해 유지. `KERNBENCH_DEBUG=1`
-환경에서는 launcher 컨텍스트의 실수로 set_device 누락 시 조용히 잘못된 SIP에
-배치되는 것을 감지할 수 있도록 warning.
+**Trade-off of the single-driver fallback**: When set_device is not
+called, defaulting to SIP 0 is kept for compatibility with existing
+single-driver tests. With `KERNBENCH_DEBUG=1`, a warning is emitted so
+that accidentally omitting set_device in a launcher context — which would
+silently place the tensor on the wrong SIP — can be detected.

-### D5. Downstream — allocator lookup은 구조적 tuple key로
+### D5. Downstream — allocator lookup by structural tuple key

-기존 `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`):
+Existing `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`):

 ```python
 for spec in placement:
-    alloc = allocators[spec.pe_index]       # ← AttributeError (property 제거됨)
+    alloc = allocators[spec.pe_index]       # ← AttributeError (property removed)
 ```

-`pe_index`가 없어졌으므로 구조적 좌표로 **강제** migration:
+With `pe_index` gone, migration to structural coordinates is **forced**:

 ```python
 for spec in placement:
    alloc = allocators[(spec.sip, spec.cube, spec.pe)]
 ```

-`_ensure_allocators`의 dict population도 tuple key로:
+The dict population in `_ensure_allocators` is also tuple-keyed:

 ```python
 # context.py _ensure_allocators (after)
@@ -204,59 +213,71 @@ for sip_id in sip_range:
            )
 ```

-`_free_tensor`도 동일: 기존 `flat_idx = sip * ... + cube * ... + pe` 계산
-블록 제거, `(shard.sip, shard.cube, shard.pe)` 직접 사용.
+`_free_tensor` is the same: the old
+`flat_idx = sip * ... + cube * ... + pe` computation block is removed,
+and `(shard.sip, shard.cube, shard.pe)` is used directly.

-**Tuple vs dataclass `PEIdentity`**: Tuple이 단순하고 hashable로 바로 써서
-권고. `PEIdentity` 값객체는 명시적 타입 장점은 있지만 boilerplate가 크고 현재
-allocator dict의 유일한 key라 오버엔지니어링. Tuple 유지.
+**Tuple vs dataclass `PEIdentity`**: Recommend the tuple — it is simple
+and hashable out of the box. A `PEIdentity` value object has the upside
+of an explicit type, but the boilerplate is large and it is currently
+the only key of the allocator dict, so it would be over-engineering.
+Keep the tuple.

-### D7. 하위 호환 — 불가 (cleanup ADR)
+### D7. Backward compatibility — none (cleanup ADR)

-이 ADR은 **breaking change**.
+This ADR is a **breaking change**.

-1. `DPPolicy(sip=...)` 또는 `DPPolicy(num_sips=...)` 호출 → `TypeError`
-2. `ShardSpec.pe_index` 접근 → `AttributeError`
+1. `DPPolicy(sip=...)` or `DPPolicy(num_sips=...)` → `TypeError`
+2. `ShardSpec.pe_index` access → `AttributeError`

-모두 **즉시 명시적 breakage**. Deprecation warning / fallback 경로 없음.
-KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에 migration.
+Both are **immediate, explicit breakage**. No deprecation warning /
+fallback path. KernBench is an internal project with a bounded set of
+call sites, so migration happens in one pass.

-**Silent drift 차단**이 property 완전 제거의 주된 이점: global flat을 기대한
-코드가 SIP-local 결과를 받아 조용히 잘못된 인덱싱을 할 가능성 제거.
+**Blocking silent drift** is the main upside of fully removing the
+property: code that expected a global flat could otherwise silently
+receive a SIP-local result and index incorrectly — that possibility is
+eliminated.

 ## Dependencies

- **ADR-0024** (launcher): `set_device(rank)` 및 current-device scoping이
-  SIP 배치 메커니즘 제공. 본 ADR은 그 위에 서서 DPPolicy를 순수 intra-device로
-  좁힘.
- **ADR-0027** (Megatron TP): 다중 SIP에 걸친 텐서가 필요한 경우의 대안 경로.
-  이 ADR 적용 후 multi-SIP use case는 ADR-0027로 이관.
+- **ADR-0024** (launcher): `set_device(rank)` and current-device scoping
+  provide the SIP placement mechanism. This ADR sits on top and narrows
+  DPPolicy to pure intra-device.
+- **ADR-0027** (Megatron TP): the alternative path when a tensor spans
+  multiple SIPs. After this ADR is applied, multi-SIP use cases move to
+  ADR-0027.

 ---

 ## Non-goals

- **`DPPolicy.cube` / `pe` 재설계**: 기존 replicate/column_wise/row_wise 의미
-  유지.
- **Tiling 정책 통합**: `tiled_column_major` / `tiled_row_major`는 그대로.
- **Multi-device 텐서 추상화 신규**: DTensor-like는 ADR-0028.
+- **Redesign of `DPPolicy.cube` / `pe`**: existing
+  replicate/column_wise/row_wise semantics are kept.
+- **Tiling policy consolidation**: `tiled_column_major` /
+  `tiled_row_major` stay as they are.
+- **New multi-device tensor abstraction**: a DTensor-like is ADR-0028.

 ---

 ## Open questions

- **`_create_tensor`의 current_sip 기본값**: set_device 없는 호출에서 rank=0
-  (SIP 0)로 fallback할지, 아니면 error 낼지. 권고는 fallback (기존 single-driver
-  테스트와의 호환).
- **`test_sip_parallel.py` 재작성 범위**: 기존 단위 테스트의 의도를 유지하며
-  launcher 기반으로 옮기려면 추가 fixture 필요. 별도 작업으로 scope.
- **`DPPolicy`의 `num_sips=None` 의미**: 필드가 없어지면 `num_sips` 개념 자체가
-  사라짐. Multi-SIP을 표현하고 싶으면 ADR-0027의 TP primitive를 쓰라는 것이
-  명시적 답.
+- **Default value of current_sip in `_create_tensor`**: for calls without
+  set_device, whether to fall back to rank=0 (SIP 0) or to raise an
+  error. The recommendation is fallback (compatibility with existing
+  single-driver tests).
+- **Scope of `test_sip_parallel.py` rewrite**: porting the existing unit
+  tests to the launcher base while preserving their intent requires
+  additional fixtures. Scoped as separate work.
+- **Meaning of `num_sips=None` on `DPPolicy`**: once the field is gone,
+  the concept of `num_sips` disappears entirely. The explicit answer for
+  expressing multi-SIP is to use the TP primitive of ADR-0027.

-**Resolved (이전 rev에서 open이었던 것들)**:
- ~~`ShardSpec.pe_index` property 존치 여부~~ → **완전 제거** (D2)
- ~~`_ensure_allocators` dict key 형식~~ → **tuple `(sip, cube, pe)`** (D5)
+**Resolved (items that were open in earlier revs)**:
+- ~~Whether to keep the `ShardSpec.pe_index` property~~ → **fully
+  removed** (D2)
+- ~~Form of `_ensure_allocators` dict key~~ → **tuple `(sip, cube, pe)`**
+  (D5)

 ---

@@ -264,25 +285,31 @@ KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에

 ### Positive

- **개념 분리 명확**: DPPolicy = intra-device, TP = inter-device.
- **API 단순화**: DPPolicy 생성자 필드 ~33% 축소.
- **Structural 좌표 일관성**: ShardSpec이 `(sip, cube, pe)` 튜플로 표현 →
-  abstraction leakage 해소 (ADR-0024 D4 계약 충족).
- **`pe_index` 의미 명확**: SIP-local이 단일 해석. Global flat이 필요하면 명시.
- **Launcher 모델 일관성**: ADR-0024의 "1 worker per SIP" 모델이 유일한 SIP
-  경계 제어 메커니즘.
+- **Clean conceptual separation**: DPPolicy = intra-device, TP =
+  inter-device.
+- **API simplification**: about a 33% reduction in DPPolicy constructor
+  fields.
+- **Structural-coordinate consistency**: ShardSpec is expressed as a
+  `(sip, cube, pe)` tuple → abstraction leakage resolved (the ADR-0024
+  D4 contract is satisfied).
+- **Clear meaning of `pe_index`**: the single interpretation is
+  SIP-local. If global-flat is needed, it must be made explicit.
+- **Launcher-model consistency**: ADR-0024's "1 worker per SIP" model is
+  the sole SIP-boundary control mechanism.

 ### Negative

 - **Breaking change (explicit)**: `DPPolicy(sip=...)` → `TypeError`,
-  `spec.pe_index` → `AttributeError`. 모든 호출자 한 번에 수정 필요.
- **ShardSpec schema 변경**: `pe_index` 단일 필드 → `sip`/`cube`/`pe` 세 필드.
-  Downstream (`deploy_tensor`, `_free_tensor`, `_ensure_allocators`,
-  `allocators` dict key 등) 연쇄 수정.
- **Silent drift 없음**: property 완전 제거로 runtime에서 즉시 실패 →
-  migration leakage 원천 차단. (Negative가 아니라 explicit tradeoff)
- `test_sip_parallel.py` 재작성 비용.
+  `spec.pe_index` → `AttributeError`. All callers need to be fixed at
+  once.
+- **ShardSpec schema change**: a single `pe_index` field becomes three
+  fields `sip`/`cube`/`pe`. Cascading edits downstream (`deploy_tensor`,
+  `_free_tensor`, `_ensure_allocators`, `allocators` dict key, etc.).
+- **No silent drift**: with the property fully removed, runtime failure
+  is immediate → migration leakage is blocked at the source. (Not a
+  negative but an explicit tradeoff.)
+- The cost of rewriting `test_sip_parallel.py`.

 ### Neutral

- 기존 `cube` / `pe` 필드 의미 불변.
+- The meaning of the existing `cube` / `pe` fields is unchanged.
@@ -92,6 +92,18 @@ def test_crlf_normalization(tmp_path: Path) -> None:
    assert v.verify(tmp_path) == []


+def test_em_dash_title_separator_recognized(tmp_path: Path) -> None:
+    """ADR-0033 uses ' — ' instead of ': ' between ADR-NNNN and the title."""
+    en = tmp_path / "docs/adr/ADR-0033-foo-bar.md"
+    ko = tmp_path / "docs/adr-ko/ADR-0033-foo-bar.md"
+    en.parent.mkdir(parents=True, exist_ok=True)
+    ko.parent.mkdir(parents=True, exist_ok=True)
+    body = "## Status\n\nAccepted\n\n## Context\n\nbody\n"
+    en.write_text("# ADR-0033 — Latency Model\n\n" + body, encoding="utf-8")
+    ko.write_text("# ADR-0033 — Latency Model\n\n" + body, encoding="utf-8")
+    assert v.verify(tmp_path) == []
+
+
 def test_underscore_in_slug_recognized(tmp_path: Path) -> None:
    """ADR-0013 uses an underscore in its slug; the regex must accept it."""
    _make_adr(tmp_path / "docs/adr/ADR-0013-ver-verification_strategy.md", "0013")
@@ -24,7 +24,7 @@ import sys
 from pathlib import Path

 ADR_FILENAME_RE = re.compile(r"^ADR-(\d{4})-[a-z0-9_-]+\.md$")
-TITLE_RE = re.compile(r"^# ADR-(\d{4}):")
+TITLE_RE = re.compile(r"^# ADR-(\d{4})\b")


 def _normalize(text: str) -> str: