ADR: bilingual structure — EN canonical in adr/, KO mirror in adr-ko/

Establish English as the canonical ADR language with Korean translations held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror). Promotion from adr-proposed/ to adr/ now writes English to adr/ and the Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md. - Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English, 2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix dropped). ADR-0023 EN regenerated against KO source which had newer HW Realization Notes (D16-D23) section. - docs/adr-history/ left frozen by design (transitional state). - CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline section covering bidirectional sync, conflict resolution (EN wins), and proposed-language freedom. - tools/verify_adr_lang_pairs.py: new verification tool checking pair completeness, filename mirroring, ADR-ID match, Status byte-equality. Pre-commit hook intentionally not added; run on demand or in CI. - tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF normalization, em-dash title separator, underscore-slug edge case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:38:44 -07:00
parent 687c98086d
commit a796c1d2f7
42 changed files with 10515 additions and 3422 deletions
@@ -202,8 +202,8 @@ General fallbacks. Apply to anything not explicitly covered above.
 >
 > Contains **foundations** (Authority & Scope → Terminology → Terminology
 > Discipline → Mental Model → Common Failure Modes) followed by **rules**
-> (Non-Trivial, Verification Plan, CLI, Derived Artifacts, runtime API /
+> (Non-Trivial, Verification Plan, CLI, Derived Artifacts, ADR Translation
-> sim_engine Boundaries).
+> Discipline, runtime API / sim_engine Boundaries).
 ## Authority & Scope
@@ -218,14 +218,22 @@ General fallbacks. Apply to anything not explicitly covered above.
 ### ADR Lifecycle
-ADRs live in one of three folders based on lifecycle state:
+ADRs live in one of four folders. Three carry **canonical English**
 content based on lifecycle state; the fourth holds Korean translations:
- `docs/adr/` — **Accepted** (current implementation reflected).
+- `docs/adr/` — **Accepted** (canonical English; current
  implementation reflected).
 - `docs/adr-proposed/` — **Proposed**, **Stub**, or **Draft** (design
  only / future-work exploration / retroactive documentation pending
-  verification).
+  verification). **Authoring language is free** (any language); the
  promotion step (below) translates to English.
 - `docs/adr-history/` — **Superseded** or **Merged** (no longer the
-  authoritative source; kept as historical record).
+  authoritative source; kept as historical record). Frozen — language
  policy not applied retroactively.
 - `docs/adr-ko/` — Korean translations of accepted ADRs (derived
  artifact, 1:1 mirror of `docs/adr/`). English in `docs/adr/` is the
  canonical source of truth; when KO and EN disagree, EN wins. See
  *ADR Translation Discipline* below.
 Status field values:
@@ -240,17 +248,23 @@ Status field values:
 Transitions:
 - **Proposed/Stub → Accepted**: when the ADR's decisions are
-  reflected in production code AND covered by tests. `git mv` from
+  reflected in production code AND covered by tests. If the proposed
-  `docs/adr-proposed/` to `docs/adr/`, change Status to `Accepted`.
+  ADR is in Korean, translate to English and place the English in
  `docs/adr/`; move the Korean original to `docs/adr-ko/`. If the
  proposed ADR is in English, `git mv` it to `docs/adr/` and create
  the Korean translation in `docs/adr-ko/`. Change Status to
  `Accepted` in both files.
 - **Draft → Accepted**: when the ADR's text has been verified to
-  accurately describe the existing implementation. `git mv` from
+  accurately describe the existing implementation. Same English /
-  `docs/adr-proposed/` to `docs/adr/`, change Status to `Accepted`.
+  Korean placement rule as above.
 - **Accepted → Superseded**: set Status to `Superseded by ADR-MMMM`
-  and `git mv` to `docs/adr-history/`. The superseding ADR includes
+  in both the EN and KO files and `git mv` both to their respective
-  a "Supersedes ADR-NNNN" reference (or, for partial supersession of
+  history locations (`docs/adr-history/` for English; the KO copy
-  clauses, documents this in its own body).
+  stays in `docs/adr-ko/` only if it was already mirrored — see *ADR
  Translation Discipline* for the frozen-history exception).
 - **Accepted → Merged**: set Status to `Merged into ADR-MMMM`
-  (single-line stub) and `git mv` to `docs/adr-history/`.
+  (single-line stub) in both files and apply the same `git mv` rule
  as the Superseded transition.
 Cross-references between ADRs use the `ADR-NNNN` ID and remain valid
 regardless of folder location. ADR numbers are **immutable**; never
@@ -361,11 +375,48 @@ Concrete forms that Part 1's *Verification Plan* MUST take in this repo:
 ## Derived Artifacts (Clarification)
 - Generated diagrams under `docs/diagrams/` are **derived artifacts**, not production code.
- Creating or updating files in `docs/diagrams/`:
+- Korean ADR translations under `docs/adr-ko/` are **derived artifacts**
  (mirror of the canonical English in `docs/adr/`); see *ADR Translation
  Discipline*.
 - Creating or updating files in `docs/diagrams/` or `docs/adr-ko/`:
  - does NOT count as a production code change,
  - does NOT require Phase 2 approval,
  - MUST be consistent with SPEC.md and ADRs.
 ## ADR Translation Discipline
 English in `docs/adr/` is the canonical source of truth. Korean in
 `docs/adr-ko/` mirrors it 1:1 as a derived artifact.
 **Bidirectional sync rule (MUST)**: any edit to a file in `docs/adr/`
 must be accompanied, in the same change, by a mirroring edit to
 `docs/adr-ko/<same-filename>.md`. The reverse also applies: edits to
 `docs/adr-ko/` must mirror back into `docs/adr/`. The two files must
 always describe the same architectural content.
 Mechanics:
 - When editing an EN ADR, propagate the change to its KO counterpart
  by translating just the diff (preserve unaffected KO prose); do not
  regenerate the whole KO file from scratch.
 - When editing a KO ADR, propagate to EN the same way.
 - Filename mirror: `docs/adr/X.md` ↔ `docs/adr-ko/X.md` (no language
  suffix in either path).
 - The `## Status` block content must remain byte-identical between
  the EN and KO files (e.g., both say `Accepted`).
 - Conflict policy: if the two diverge despite the rule, treat EN as
  authoritative and overwrite KO. Surface the divergence to the user
  before reconciling.
 - `docs/adr-proposed/` is exempt — single language only, no mirror
  required until promotion.
 - `docs/adr-history/` is frozen — pre-existing mixed-language state
  there is not migrated.
 Verification: `python tools/verify_adr_lang_pairs.py` checks that
 every EN ADR has a matching KO file, the title's ADR-NNNN matches the
 filename, and Status blocks are byte-equal. Run it on demand or wire
 it into CI. Exit code: 0 = OK, 1 = mismatch.
 ## runtime API / sim_engine Boundaries
 - runtime API MUST NOT hardcode topology/routing or internal hop sequences.
@@ -0,0 +1,362 @@
 # ADR-0001: 51-bit Physical Address Layout & Decoding Contract
 ## Status
 Accepted (Revision 2 — 2026-04-27: concrete bit layout, rack_id removal,
 Tray->SIP / SIP->DIE renaming, PE/MCPU/IOCPU sub-unit tables.
 Supersedes ADR-0031.)
 ## Date
 2026-04-27 (original: 2026-02-27)
 ## Context
 KernBench requires a stable, parsable physical address scheme that:
 - can be decoded into routing domains (SIP / die / HBM / PE-resource / IOCPU)
 - remains topology-agnostic (no hardcoded counts)
 - supports swappable policy and DI-first components
 - covers multiple SIPs, AHBM dies, and IO chiplet dies in a unified space
 ### History
 - Original ADR-0001 defined a 51-bit layout with `rack_id(4) + sip_id(4) +
  sip_seg(5) + local_offset(38)`. `rack_id` was never used in practice.
 - ADR-0031 (stub) requested PE-resource range partition but was never
  implemented.
 Revision 2 removes `rack_id`, renames `sip_seg -> die_id`, and provides
 concrete sub-unit tables for PE, MCPU, CUBE_SRAM, and IOCPU resources.
 ADR-0031 is superseded.
 ## Decision
 We define a **PhysAddr value object** and an **address decoding contract**
 that converts an integer address into routing domains.
 ### D1. PhysAddr is an immutable value object
 - PhysAddr is immutable and comparable as a pure value.
 - Any allocator returns a **fully specified PhysAddr** (not partial metadata).
 - No global state may be required to interpret a PhysAddr.
 ### D2. 51-bit Physical Address Layout
 A 51-bit physical address is adopted.
 #### 2.1 Top-Level Address Map
 ```text
 [50:47] sip_id        (4)     -- 16 SIPs
 [46:42] die_id        (5)     -- 32 dies per SIP
 [41: 0] local_offset  (42)    -- 4 TB per die
 ```
 ```text
 50      47 46      42 41                      0
 +---------+----------+-------------------------+
 | sip_id  | die_id   |      local_offset       |
 +---------+----------+-------------------------+
 ```
 #### 2.2 die_id Allocation
 | die_id | Meaning |
 |--------|---------|
 | 0..15  | AHBM dies |
 | 16..20 | IOCHIPLET dies |
 | 21..31 | Reserved |
 #### 2.3 AHBM Die Layout
 Only lower 256 GB of the 4 TB die-local window is assigned.
 ```text
 [41:38] MBZ            (4)
 [37]    addr_space      (1)    -- 0 = local resource, 1 = HBM memory
 [36: 0] sub-address    (37)
 ```
 | addr_space | Meaning |
 |------------|---------|
 | 0 | Local resource |
 | 1 | HBM memory |
 ##### 2.3.1 HBM Window (addr_space = 1)
 ```text
 [36:0] hbm_offset     (37)    -- 128 GB decode window
 ```
 The architectural decode window is fixed at 128 GB. Implemented capacity
 may be smaller depending on SKU/topology (see D4).
 ##### 2.3.2 Resource Window (addr_space = 0)
 ```text
 [36:34] resource_kind  (3)
 [33: 0] kind_local    (34)    -- 16 GB per kind
 ```
 | resource_kind | Meaning |
 |---------------|---------|
 | 000 | PE_LOCAL |
 | 001 | MCPU_LOCAL |
 | 010 | CUBE_SRAM |
 | 011..111 | Reserved |
 Each kind gets a 16 GB decode region.
 ##### 2.3.3 PE_LOCAL (resource_kind = 000)
 ```text
 [33]    MBZ            (1)
 [32:29] pe_id          (4)     -- 0..15
 [28:25] pe_sub_unit    (4)
 [24: 0] sub_offset    (25)    -- 32 MB per slot
 ```
 16 PEs x 16 sub-unit slots x 32 MB = 8 GB active decode.
 | pe_sub_unit | Name | Budget |
 |-------------|------|--------|
 | 0 | PE_CPU_DTCM | 8 KB |
 | 1 | MATH_ENGINE_DTCM | 8 KB |
 | 2 | IPCQ | 256 KB |
 | 3 | PE_CPU_SFR | 16 KB |
 | 4 | MATH_ENGINE_SFR | 16 KB |
 | 5 | DMA_ENGINE_SFR | 192 KB |
 | 6 | PE_TCM | 2 MB |
 | 7..15 | Reserved | -- |
 ##### 2.3.4 MCPU_LOCAL (resource_kind = 001)
 ```text
 [33:30] MBZ            (4)
 [29:25] mcpu_sub_unit  (5)
 [24: 0] sub_offset    (25)    -- 32 MB per slot
 ```
 1 GB active decode.
 | mcpu_sub_unit | Name | Budget |
 |---------------|------|--------|
 | 0 | MCPU_ITCM | 512 KB |
 | 1 | MCPU_DTCM | 512 KB |
 | 2 | IPCQ | 256 KB |
 | 3 | MCPU_SFR | 8 KB |
 | 4 | MCPU_DMA_SFR | 16 KB |
 | 5 | MCPU_SRAM | 10 MB |
 | 6..31 | Reserved | -- |
 ##### 2.3.5 CUBE_SRAM (resource_kind = 010)
 ```text
 [33:25] MBZ            (9)
 [24: 0] sram_offset   (25)    -- flat 32 MB
 ```
 #### 2.4 IOCHIPLET Die Layout
 Only lower 1 TB of the 4 TB die-local window is assigned.
 ```text
 [41:40] MBZ            (2)
 [39: 0] chiplet_offset (40)   -- 1 TB
 ```
 Region split by address range:
 | Range | Meaning | Decode condition |
 |-------|---------|------------------|
 | [0, 2 GB) | IOCPU resource | chiplet_offset < 0x8000_0000 |
 | [2 GB, 1 TB) | UAL | chiplet_offset >= 0x8000_0000 |
 ##### 2.4.1 IOCPU Region
 ```text
 [30:27] iocpu_sub_unit (4)
 [26: 0] sub_offset    (27)    -- 128 MB per slot
 ```
 16 x 128 MB slots. 2 GB active decode.
 | iocpu_sub_unit | Name | Budget |
 |----------------|------|--------|
 | 0 | IOCPU_ITCM | 512 KB |
 | 1 | IOCPU_DTCM | 512 KB |
 | 2 | IPCQ | 2 MB |
 | 3 | IOCPU_SFR | 8 KB |
 | 4 | IO_DMA_SFR | 16 KB |
 | 5 | IO_SRAM | 64 MB |
 | 6..15 | Reserved | -- |
 ##### 2.4.2 UAL Region
 Sub-layout TBD (separate ADR).
 #### 2.5 Addressing Rules
 1. MBZ bits must be zero. An address with non-zero MBZ bits is
   **architecturally invalid**. Implementation may raise a decode fault
   or return an error -- behavior is not prescribed by this ADR.
 2. Fixed slot sizes are chosen for simple hardware decode; actual
   implemented capacity may be smaller than the slot.
 3. Access beyond a sub-unit's implemented budget within a slot is
   **architecturally invalid** (same policy as MBZ).
 ### D3. Bitfield decoding is deterministic
 Given an integer address, field extraction (`sip_id`, `die_id`, `kind`,
 `sub_unit`, `offset`) is purely positional. No runtime state is required.
 Decoding deterministically maps an integer address to destination domains:
 `sip_id`, `die_id`, target kind (HBM / PE_LOCAL / MCPU_LOCAL / CUBE_SRAM /
 IOCPU / UAL).
 ### D4. Capacity validation may depend on topology config
 Whether a decoded address falls within **implemented capacity** (e.g.,
 HBM 96 GB on a specific SKU) is checked against topology parameters
 provided via DI/config. Decode itself (D3) never consults topology --
 only validation does. These parameters must live in the topology/config
 layer, not in node implementations.
 ### D5. Routing consumes decoded domains, not raw bits
 Routing policy uses decoded domains:
 - `src` location (sip / die / pe or node_id)
 - `dst` domains derived from PhysAddr decoding
 - `size_bytes` for size-aware link latency
 Routing must not inspect raw bit-fields directly except inside the
 decoding module.
 ## Alternatives Considered
 1. **Keep `rack_id` (4 bits)**: Rejected -- never used in practice,
   consumes 4 bits that enable die-local expansion to 42 bits
   (IOCHIPLET 1 TB).
 2. **Uniform 256 GB per die**: Rejected -- IOCHIPLET UAL requires ~1 TB.
   Freed rack_id bits enable 42-bit local_offset.
 3. **Variable-width die windows (AHBM 256 GB, CHIPLET 1 TB via multi-seg
   spanning)**: Rejected -- complicates D3 (deterministic decoding).
   Uniform 4 TB window with MBZ padding is simpler.
 4. **Use raw integers everywhere, decode ad-hoc in routing**: Rejected --
   leads to duplicated logic, inconsistent routing, and hidden
   assumptions.
 5. **Hardcode topology sizes (SIP/CUBE/PE counts) into decoding**:
   Rejected -- violates SPEC R3 and breaks swappability.
 6. **Put decoding inside memory controllers or routers**: Rejected --
   leaks policy into components, violates SPEC R4 / D5.
 ## Consequences
 ### Positive
 - Simple hierarchical decoder: SIP -> die -> kind -> sub-unit.
 - Clean separation of memory (HBM) vs local resource (PE/MCPU/SRAM/IOCPU).
 - Deterministic routing domains enable clear test invariants (SPEC R1, R5).
 - Expandable: 11 reserved die_id slots, reserved resource_kind / sub-unit
  slots, reserved MBZ bits.
 - DI-first: decoder can be swapped without changing components (SPEC R4).
 ### Tradeoffs
 - Sparse address holes due to power-of-2 slot alignment.
 - Large reserved/MBZ regions (intentional for future extension).
 - Requires explicit configuration for topology-derived sizes (D4).
 - Introduces a single "blessed" decoding module that must remain stable
  and well-tested.
 ## Supersedes
 - **ADR-0031 (PhysAddr PE-Resource Extension)**: stub status. The
  PE_LOCAL / MCPU_LOCAL / CUBE_SRAM sub-unit tables in D2.3.3-D2.3.5
  fulfill ADR-0031's stated goals.
 ## Implementation Notes (Non-normative)
 - Recommended module: `src/kernbench/policy/address/phyaddr.py`
 - Tests should cover: encode/decode round-trip per kind, MBZ enforcement,
  die_id dispatch (AHBM / IOCHIPLET / reserved), sub-unit boundary
  values, backward compatibility of factory APIs.
 - Factory methods: `hbm_addr`, `pe_hbm_addr`, `pe_tcm_addr`,
  `cube_sram_addr` retain signatures (minus `rack_id`); `cube_id`
  parameter renamed to `die_id`.
 - New factories: `pe_resource_addr`, `mcpu_resource_addr`,
  `iocpu_resource_addr`, `ual_addr`.
 ## Appendix A. Address Examples
 ### A.1 AHBM HBM access
 sip=2, die=5, HBM offset=0x1000
 ```text
 sip_id     = 2       -> [50:47] = 0b0010
 die_id     = 5       -> [46:42] = 0b00101
 addr_space = 1       -> [37]    = 1 (HBM)
 hbm_offset = 0x1000  -> [36:0]
 51-bit addr = (2 << 47) | (5 << 42) | (1 << 37) | 0x1000
 ```
 ### A.2 AHBM PE_LOCAL -- PE3 PE_TCM, offset=0x400
 ```text
 sip_id        = 0  -> [50:47] = 0
 die_id        = 0  -> [46:42] = 0
 addr_space    = 0  -> [37]    = 0
 resource_kind = 0  -> [36:34] = 000 (PE_LOCAL)
 pe_id         = 3  -> [32:29] = 0011
 pe_sub_unit   = 6  -> [28:25] = 0110 (PE_TCM)
 sub_offset    = 0x400 -> [24:0]
 local_offset = (0 << 34) | (3 << 29) | (6 << 25) | 0x400
 ```
 ### A.3 AHBM MCPU_LOCAL -- MCPU_SRAM, offset=0x0
 ```text
 sip_id        = 1  -> [50:47] = 0001
 die_id        = 3  -> [46:42] = 00011
 addr_space    = 0  -> [37]    = 0
 resource_kind = 1  -> [36:34] = 001 (MCPU_LOCAL)
 mcpu_sub_unit = 5  -> [29:25] = 00101 (MCPU_SRAM)
 sub_offset    = 0  -> [24:0]  = 0
 local_offset = (1 << 34) | (5 << 25)
 ```
 ### A.4 IOCHIPLET -- IOCPU IPCQ, offset=0x20000
 ```text
 sip_id         = 1   -> [50:47] = 0001
 die_id         = 17  -> [46:42] = 10001 (IOCHIPLET[1])
 iocpu_sub_unit = 2   -> [30:27] = 0010 (IPCQ)
 sub_offset     = 0x20000 -> [26:0]
 chiplet_offset = (2 << 27) | 0x20000
                 (< 0x8000_0000 -> IOCPU region)
 ```
 ### A.5 IOCHIPLET -- UAL region, offset=4 GB
 ```text
 sip_id         = 0   -> [50:47] = 0
 die_id         = 16  -> [46:42] = 10000 (IOCHIPLET[0])
 chiplet_offset = 0x1_0000_0000 (4 GB >= 2 GB -> UAL region)
 ```
 ## Links
 - SPEC.md: R1 (routing), R3 (configurable topology), R4 (DI-first),
  R5 (multi-domain comm)
 - ADR-0031: Superseded
@@ -0,0 +1,102 @@
 # ADR-0002: Routing Distance, Ordering & Bypass Rules
 ## Status
 Accepted
 ## Date
 2026-02-27
 ## Context
 The KernBench Graph Latency Simulator must compare kernel execution time
 across different architectures and topologies by computing end-to-end
 latency from graph traversal.
 To support meaningful comparison:
 - routing must be deterministic
 - latency must reflect actual interconnect structure
 - local vs remote traffic must be distinguishable
 - “bypass” optimizations must not undermine debuggability or correctness
 The simulator also aims to avoid software-managed metadata and hidden
 shortcuts that obscure control paths.
 ## Decision
 ### D1. Distance is accumulated latency, not hop count
 - Routing “distance” is defined as the **sum of per-node and per-link latency**.
 - Hop count alone must not be used for ordering or path selection.
 - Size-aware serialization latency (bytes / BW) contributes to distance.
 ### D2. Routing order is derived from graph traversal
 - The chosen route is the path with minimum accumulated latency
  given the constructed graph and routing policy.
 - Deterministic ordering must be guaranteed for identical inputs
  (topology + policy + request).
 ### D3. Bypass is explicit and graph-represented
 - All paths must be explicitly represented in the graph and subject to latency accumulation.
 - Example: PE_DMA connects to the NOC router mesh (ADR-0017 D7). All destinations
  (HBM, shared SRAM, inter-cube UCIe) are reached via explicit mesh hops.
  Local HBM access has minimal hops (switching overhead only); remote access
  traverses additional routers.
 - Implicit or “magic” bypass paths are disallowed.
 ### D4. No zero-latency end-to-end paths
 - Every routed request must incur **end-to-end** latency > 0.
 - Individual fabric segments (e.g., NOC hops) MAY have distance_mm = 0
  when the fabric is distributed and distance is not meaningful at that granularity.
  This is allowed because other components on the same path (e.g., PE_DMA, SRAM,
  UCIe endpoints) contribute non-zero latency, ensuring the end-to-end invariant holds.
 - Fully zero-latency end-to-end paths are disallowed, except for explicit
  test-only stubs clearly marked as such.
 ### D5. Policy vs topology responsibility split
 - Topology builder:
  - defines nodes and links and their latency/BW parameters
 - Routing policy:
  - selects among available graph paths based on decoded domains
 - Routing policy must not assume missing links; missing connectivity
  is a topology construction error.
 ### D6. No software-managed routing metadata
 - Routing decisions must not rely on per-request software-managed metadata
  that tracks distance, hop count, or ordering outside the graph model.
 - All distance/order computation is derived from traversal itself.
 ## Alternatives Considered
 1) **Hop-count based routing**
 - Rejected: ignores heterogeneous latency/BW and misrepresents
  architectural differences.
 2) **Implicit local shortcuts**
 - Rejected: breaks debuggability and violates traversal-based latency.
 3) **Software-managed distance metadata**
 - Rejected: increases control overhead and obscures routing semantics.
 ## Consequences
 ### Positive
 - Clear, debuggable hop-by-hop traces (SPEC R2, R4).
 - Architecture comparisons reflect real interconnect structure.
 - Routing behavior is reproducible and deterministic.
 ### Tradeoffs / Costs
 - Graph construction must be correct and complete.
 - Bypass modeling requires explicit graph representation,
  which slightly increases topology description complexity.
 ## Implementation Notes (Non-normative)
 - Recommended responsibilities:
  - Graph builder: ensure all required paths exist.
  - Router: select next hop based on decoded domains and policy.
 - Tests should assert:
  - non-zero end-to-end latency
  - deterministic routing for identical inputs
  - bypass paths appear explicitly in emitted traces
 ## Links
 - SPEC.md: R1 (routing), R2 (latency), R3 (topology), R5 (multi-domain comm)
 - ADR-0001: PhysAddr layout & decoding contract
@@ -0,0 +1,68 @@
 # ADR-0003: Target System Hierarchy & Modeling Scope
 ## Status
 Accepted
 ## Context
 We need a system-level simulator to evaluate LLM kernel performance on our AI Accelerator platform.
 The platform is organized as a compute tray containing multiple identical SIPs connected via PCIe or UAL
 through switching fabrics, with a host CPU issuing commands/kernels.
 ## Decision
 We model the system hierarchy explicitly:
 ### D1. Tray-level
 - A compute tray contains:
  - Host CPU (issues requests / coordinates runtime & data placement)
  - Multiple identical SIPs (accelerators)
  - Interconnect fabric between SIPs (PCIe and/or UAL via switches)
 ### D2. SIP-level
 - A SIP is a multi-die package composed of:
  - Multiple CUBEs (HBM die + compute PEs + UCIe)
  - One or more IO chiplets (host/SIP interfaces)
 - IO chiplets:
  - provide interfaces: PCIe-EP, IO_CPU, optionally UAL-EP
  - can be multiple per SIP
  - placement constrained to SIP shoreline (top/bottom/left/right); each shoreline may host 1–2 IO chiplets
 ### D3. CUBE-level
 - A CUBE contains:
  - HBM + memory controller (HBM_CTRL)
  - NOC (on-die fabric): carries all intra-cube traffic including HBM data,
    inter-cube (UCIe), command (M_CPU↔PE_CPU), and shared SRAM access.
    Must provide: full-BW PE↔local HBM path, PE↔SRAM connectivity,
    PE↔UCIe connectivity, M_CPU↔PE command path.
    NOC topology is an implementation choice (e.g., 2D mesh, ring, crossbar);
    current implementation uses a 2D mesh with XY routing (see ADR-0017).
    HBM_CTRL is attached to each PE's local NOC port (local HBM = minimal hop).
  - Shared SRAM: cube-level shared memory accessible by all PEs via NOC
  - management/control CPU (M_CPU) coordinating PE command distribution and completion aggregation
  - multiple PEs
  - up to 4 UCIe endpoints (N/E/W/S) for CUBE↔CUBE and CUBE↔IO connectivity
 ### D4. PE-level
 - A PE can execute one kernel instance
 - PE contains internal control + accelerators (modeled at PE view granularity):
  - PE_CPU, command handler, PE_TCM, DMA/GEMM/MATH engines, internal queues
 ## Consequences
 - The simulator supports abstraction by “views”:
  - SIP view hides PE internals
  - CUBE view treats each PE as a single block
  - PE view expands PE internals
 - Topology remains parameterized; sizes/counts/links come from configuration.
 ## Links
 - SPEC R3/R5
 - ADR-0005 (diagram views)
 - ADR-0017 (cube NOC 2D mesh architecture)
@@ -0,0 +1,76 @@
 # ADR-0004: Memory Semantics & Local-HBM Bandwidth Guarantee
 ## Status
 Accepted
 ## Context
 Accurately modeling PE↔HBM behavior is essential for kernel latency estimation.
 Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth, independent of intervening on-die fabric bandwidth.
 ## Decision
 ### D1. Local HBM definition
 - Each PE is assigned a logically defined “local HBM” region.
 - Local HBM corresponds to the pseudo-channel subset directly attached to that PE’s
  router in the NOC mesh (ADR-0017 D4).
 - The path is: PE_DMA → local router → HBM_CTRL (switching overhead only, 0 mesh hops).
 - The mapping (HBM pseudo-channels → PE local regions) is derived from topology configuration.
 ### D2. Local HBM bandwidth guarantee contract
 - Accesses from a PE to its local HBM MUST guarantee full effective HBM
  read/write bandwidth independent of intervening fabric bandwidth limits.
 - Effective HBM bandwidth = spec bandwidth x efficiency factor.
  The efficiency factor (configured via `hbm_ctrl.attrs.efficiency`, default 0.8)
  models real-world DRAM inefficiencies (refresh cycles, bank conflicts, page
  misses). For example: 256 GB/s spec x 0.8 = 204.8 GB/s effective.
 - The topology builder applies the efficiency factor to router-to-hbm edge
  bandwidth at graph construction time, so all downstream routing and latency
  computation uses the effective value.
 - This guarantee is modeled by:
  - a dedicated logical path and/or service model that enforces HBM BW at the PE-local-HBM interaction point,
  - while still incurring non-zero latency along explicitly modeled components.
 - HBM CTRL internal modeling (PC striping, cut-through, scheduling fidelity)
  is consolidated in ADR-0033 (Latency Model: Assumptions and Known
  Simplifications). The aggregate BW guarantee here remains the contract;
  ADR-0033 documents how the per-PC model realizes it and which scheduler
  effects are intentionally simplified.
 ### D3. Remote PE HBM semantics (intra-cube)
 - A PE that accesses another PE's local HBM traverses the NOC:
  - PE_DMA → NOC → (fabric hops) → target PE's NOC port → HBM_CTRL
 - NOC bandwidth and hop count may limit remote HBM access relative to local access.
 ### D4. Non-local HBM semantics (inter-cube / inter-SIP)
 - Accesses from a PE to HBM in a different cube or SIP MAY be limited by:
  - NOC bandwidth within the cube,
  - inter-cube UCIe links,
  - inter-SIP fabric (PCIe/UAL).
 - These paths MUST be explicit and traceable.
 ### D5. Shared SRAM semantics
 - Each CUBE contains a shared SRAM accessible by all PEs in that CUBE.
 - Access path: PE_DMA → NOC → shared SRAM.
 - Shared SRAM bandwidth is limited by the NOC↔SRAM link bandwidth.
 - Shared SRAM is not part of the HBM address space; it is a separate memory domain.
 ## Verification Notes
 Tests should cover:
 - local-HBM case: BW matches HBM BW regardless of fabric BW parameter
 - remote PE HBM case: latency includes mesh hop traversal
 - non-local cases (inter-cube/inter-SIP): BW/latency respond to fabric/link parameters
 - shared SRAM case: access via NOC with correct BW
 ## Links
 - SPEC R2/R5
 - ADR-0002 (distance/order & explicit bypass)
 - ADR-0017 D7 (PE DMA data paths through NOC to HBM)
@@ -0,0 +1,186 @@
 # ADR-0005: Diagram Views & Distance-Aware Layout Rules
 ## Status
 Accepted
 ## Context
 We require verifiable and inspectable system modeling for a large-scale,
 parameterized AI Accelerator system.
 Humans must be able to:
 - visually inspect the modeled topology,
 - reason about communication structure and relative distance,
 - do so at multiple abstraction levels without being overwhelmed by detail.
 The simulator models distance (accumulated latency) as a first-class concept.
 Diagrams must reflect this distance by default.
 ---
 ## Decision
 ### D1. Global Defaults
 - All diagrams MUST be **distance-aware by default**.
 - All diagrams MUST render **representative views** of the architecture.
 - Instance indices (e.g., sip0, cube2, pe3) MUST NOT be required for diagram generation.
 - Instance indices MAY be used ONLY:
  - to define a distance anchor in asymmetric or debugging scenarios, or
  - when explicitly requested.
 ---
 ### D2. Representative Rendering Rule
 - All CUBEs share the same internal structure.
 - All PEs share the same internal structure.
 Therefore:
 - SIP-level diagrams render representative CUBEs and IO chiplets.
 - CUBE-level diagrams render representative PEs as opaque blocks.
 - PE-level diagrams render a representative PE with fully expanded internals.
 Diagrams MUST NOT depend on specific SIP, CUBE, or PE indices
 unless explicitly requested.
 ---
 ### D3. Diagram Views
 #### View A — SIP-Level Diagram
 **Purpose**
 Explain system-scale structure and connectivity.
 **Visible elements**
 - SIP boundaries (optional)
 - CUBEs (opaque blocks)
 - IO chiplets (opaque blocks)
 - Optional UCIe stubs only if needed to clarify connectivity
 **Hidden elements**
 - PE internals
 - CUBE internal fabric
 - IO chiplet internals
 **Visible links**
 - Host ↔ IO chiplets (PCIe)
 - SIP ↔ SIP (PCIe / UAL via switches)
 - IO ↔ CUBE (on-package links)
 ---
 #### View B — CUBE-Level Diagram
 **Purpose**
 Explain cube-internal structure and data/control flow.
 **Visible elements**
 - Router mesh: 2D grid of NOC routers (from cube_mesh.yaml), all traffic routes through mesh
 - HBM_CTRL attached to PE routers (local HBM = 0 hop)
 - HBM subsystem (HBM_CTRL)
 - Shared SRAM: cube-level shared memory
 - Management CPU (M_CPU)
 - PEs as opaque blocks (PE[0..N−1])
 - UCIe endpoints (N/E/W/S) as ports
 **Hidden elements**
 - PE internals
 **Visible links**
 - PE → router (HBM + non-HBM data path via mesh)
 - Router ↔ HBM_CTRL (local HBM access)
 - Router ↔ Router (mesh hops for remote access)
 - Router ↔ UCIe endpoints
 - Router ↔ shared SRAM
 - M_CPU ↔ router (command path)
 - Router → PE_CPU (command delivery, collapsed into PE block)
 ---
 #### View C — PE-Level Diagram
 **Purpose**
 Explain internal PE behavior and execution structure.
 **Visible elements**
 - PE_CPU
 - Command handler / scheduler
 - PE_TCM (local SRAM)
 - HW accelerators (DMA, GEMM, MATH, etc.)
 - Local HBM interface
 - Optional IPCQ / messaging endpoints
 **Visible links**
 - Control paths (CPU → scheduler → engines)
 - Data paths (engines ↔ TCM, DMA ↔ local HBM)
 - External fabric ports as abstract ports only
 ---
 ### D4. Distance-Aware Layout (Default)
 #### Distance definition
 - Distance is defined as **accumulated latency**, consistent with ADR-0002.
 - Distance is computed from a single anchor node.
 #### Default anchor selection
 - SIP view: IO chiplet (or Host CPU if present)
 - CUBE view: a representative PE
 - PE view: PE_CPU or Command Handler
 Anchors are **implicit defaults** and MUST NOT be required to be specified.
 #### Layout rules
 - Diagrams MUST be laid out in layers based on distance buckets.
 - Layout direction MUST be consistent within a view type
  (preferred: left-to-right).
 - Nodes with equal distance MUST have stable ordering
  (by role or identifier, deterministically).
 Cycles MAY be rendered using dashed or curved edges for readability,
 without affecting distance semantics.
 ---
 ### D5. Generation Contract (for Tools / Claude Code)
 When generating diagrams:
 - Assume distance-aware layout by default.
 - Assume representative rendering by default.
 - Do NOT ask for SIP/CUBE/PE indices unless required.
 - Do NOT expand hidden abstraction levels.
 - Prefer architectural clarity over micro-hop fidelity.
 ---
 ## Consequences
 - Diagrams are stable across topology scaling.
 - Changes in distance or routing policy are reflected visually.
 - Diagrams serve as verifiable artifacts derived from the simulator model,
  not as hand-maintained documentation.
 ---
 ## Links
 - SPEC Section 4 (Output, Debuggability, and Diagrams)
 - ADR-0002 (Routing distance semantics)
 - ADR-0006 (Topology compilation & automatic diagram generation)
@@ -0,0 +1,130 @@
 # ADR-0006: Topology Compilation, Distance Extraction, and Automatic Diagram Generation
 ## Status
 Accepted
 ## Context
 The simulator compiles topology configuration (e.g., topology.yaml) into an explicit model graph,
 and computes routing and accumulated latency (distance).
 Diagrams should be generated from these authoritative artifacts to ensure consistency and avoid
 hand-maintained topology drawings.
 Additionally, for usability, diagrams should be emitted automatically into a stable location
 so that developers can preview them immediately in the repository.
 ---
 ## Decision
 ### D1. Topology compilation is the single source of truth
 - topology.yaml (or equivalent config) is compiled into:
  - an explicit system graph,
  - node/link attributes,
  - routing policies.
 This compiled graph is the authoritative representation of the system.
 ### D2. Distance extraction during compilation
 - During or immediately after topology compilation, the simulator MUST compute distance metadata
  (accumulated latency) consistent with ADR-0002.
 - Distance metadata MUST be sufficient to support distance-aware diagram layout as defined in ADR-0005.
 - Distributed fabric segments (e.g., NOC) MAY have distance_mm = 0 per ADR-0002 D4;
  layout placement for such nodes uses explicit position metadata rather than distance buckets.
 ### D3. Diagram generation is a derived artifact
 - Diagrams MUST be generated from:
  - the compiled topology graph,
  - extracted distance metadata,
  - view/layout rules defined in ADR-0005.
 - Diagram generation MUST NOT require additional hand-written topology descriptions.
 ### D4. Automatic diagram emission to the repository
 - As part of topology compilation, the implementation MUST produce the following diagrams by default:
  - SIP-level diagram (representative, distance-aware)
  - CUBE-level diagram (representative, distance-aware)
  - PE-level diagram (representative, distance-aware)
 - The default output directory is:
  - `docs/diagrams/`
 - The generator MUST overwrite/update only when the compiled topology (or diagram rules) changes.
 ### D5. View-specific projection and layout
 For each view (SIP / CUBE / PE):
 - The generator MUST project the compiled graph into a reduced view graph:
  - hide/collapse nodes according to ADR-0005,
  - preserve connectivity semantics relevant to that view,
  - compute distance buckets and assign layout layers deterministically.
 - CUBE-level projection MUST include:
  - Router mesh (from cube_mesh.yaml), HBM_CTRL, shared SRAM, M_CPU, UCIe ports,
    and PEs as opaque blocks.
  - All paths (HBM, non-HBM, command) route through the same router mesh (ADR-0017).
 - Default anchors are implicit (ADR-0005) and MUST NOT require instance indices.
 ### D6. Output formats and determinism
 - The generator MUST output at least one of:
  - Mermaid (Markdown-native)
  - Graphviz DOT (rank-based control)
  - SVG (mm-accurate layout, no external dependencies)
 - SVG is preferred when mm-accurate position metadata is available from the compiled topology.
 - Output MUST be deterministic:
  - same topology + same rules → identical diagram text
 - File naming MUST be deterministic and stable (see "Output Conventions").
 ### D7. Performance and caching
 - Diagram generation MAY be lazy and/or cached, as long as the outputs in `docs/diagrams/`
  remain consistent with the compiled topology.
 - The implementation SHOULD use a cache key based on:
  - topology content hash,
  - routing policy version,
  - diagram rules version,
  - view type (SIP/CUBE/PE).
 ---
 ## Output Conventions
 ### Directory
 - `docs/diagrams/` is the canonical output directory for generated diagrams.
 ### File names (recommended, deterministic)
 - `system_view.svg` / `system_view.mmd` / `system_view.dot`
 - `sip_view.svg` / `sip_view.mmd` / `sip_view.dot`
 - `cube_view.svg` / `cube_view.mmd` / `cube_view.dot`
 - `pe_view.svg` / `pe_view.mmd` / `pe_view.dot`
 Optionally, for multi-topology workflows:
 - `sip_view__{topology_id}.svg`
 - `cube_view__{topology_id}.svg`
 - `pe_view__{topology_id}.svg`
 ### Repository policy
 - Generated diagram files MAY be committed to the repository to enable diff-based review.
 - If committed, they MUST be reproducible from topology compilation.
 ---
 ## Consequences
 - Diagrams are always consistent with simulator behavior.
 - Architectural changes automatically propagate to visualizations.
 - Diagram diffs become meaningful indicators of architectural change.
 ---
 ## Links
 - SPEC Section 4 (Output, Debuggability, and Diagrams)
 - ADR-0002 (Distance semantics)
 - ADR-0005 (Diagram views and layout rules)
@@ -0,0 +1,95 @@
 # ADR-0007: Runtime API and Simulation Engine Boundaries
 ## Status
 Accepted
 ## Context
 The simulator consists of multiple layers with distinct responsibilities:
 - a host-facing API layer used by benchmarks and user code,
 - a discrete-event simulation engine that executes requests,
 - device components that model hardware behavior.
 Without strict boundaries, orchestration logic can leak into components,
 or simulation internals can become entangled with user-facing APIs.
 This ADR defines clear responsibility boundaries between:
 - runtime API,
 - simulation engine (sim_engine),
 - hardware components.
 ---
 ## Decision
 ### D1. Runtime API is host-facing orchestration only
 The runtime API represents host/driver-level behavior and MUST:
 - expose high-level operations (tensor deployment, kernel launch),
 - submit requests only to endpoint components (e.g., IO_CPU),
 - await completion via futures/handles,
 - own and persist host-side metadata (tensor allocation maps, kernel bindings).
 The runtime API MUST NOT:
 - hardcode hop-by-hop routing or fan-out,
 - directly invoke internal components (M_CPU, PE_CPU, engines),
 - embed topology- or routing-specific assumptions.
 ---
 ### D2. Simulation engine wires components and tracks completion
 The simulation engine (sim_engine) MUST:
 - wire components at initialization (create port stores + start wire
  processes per the component port/wire framework — ADR-0015),
 - inject requests into the compiled topology graph at entry components
  (e.g., PCIE_EP for memory operations, IO_CPU for kernel launch),
 - schedule and execute events using a discrete-event model,
 - manage correlation ids and completion tracking.
 The simulation engine MUST NOT:
 - define tensor semantics,
 - define kernel execution policies,
 - expose internal graph details to the runtime API,
 - walk the topology path during request execution,
 - call component `run()` methods directly,
 - track per-hop latency or decompose fan-out (components own this).
 ---
 ### D3. Components own fan-out and aggregation
 Device-side components MUST:
 - fan-out requests to downstream domains
  (IO_CPU → M_CPU → PE_CPU → schedulers/engines),
 - aggregate completion and failure signals,
 - propagate results deterministically upstream.
 Neither the runtime API nor the simulation engine may orchestrate
 component-level fan-out explicitly.
 ---
 ## Consequences
 - Runtime APIs remain stable as topology and routing evolve.
 - Simulation internals can change without affecting user-facing code.
 - Component implementations remain swappable via DI.
 ---
 ## Links
 - SPEC R4, R7, R8
 - ADR-0008 (Tensor deployment)
 - ADR-0009 (Kernel execution)
 - ADR-0015 (Component port/wire model and engine role)
 - ADR-0010 (CLI surface and execution semantics — runtime API consumer)
@@ -0,0 +1,100 @@
 # ADR-0008: Tensor Deployment and Allocation (Host Allocator, PA-first)
 ## Status
 Accepted
 ## Context
 Benchmarks require PyTorch-like tensor semantics:
 - tensor creation (empty, fill),
 - deployment to accelerator devices (tensor.to()).
 In the realistic system, host software manages allocation/mapping and installs
 mappings for DMA/MMU. For Phase 0 we simplify (ADR-0011):
 - device memory operations use PA only,
 - VA/MMU/IOMMU is not modeled.
 To keep the host↔device interface minimal, we avoid a separate
 AllocateTensorMeta message. Instead, host allocation produces a PA shard map
 that is used directly by MemoryWrite/Read and KernelLaunch.
 ---
 ## Decision
 ### D1. Tensor is a host-owned handle with PA shard mapping
 A Tensor object is a host-owned handle that encapsulates:
 - shape and dtype,
 - initialization intent,
 - device placement and allocation metadata as a PA shard map.
 After deployment, the Tensor handle MUST contain:
 - a list of shards, each with (sip,cube,pe,pa,nbytes,offset_bytes).
 This PA shard mapping is the single source of truth for kernel argument binding.
 ---
 ### D2. Deployment uses a host allocator (Phase 0)
 In Phase 0, tensor deployment produces PA shard mappings via a host allocator:
 - placement (split/replicate/hybrid) is decided by a DP policy,
 - allocation assigns PA ranges at the PE level and returns shard mappings,
 - the Tensor handle stores the resulting shard list deterministically.
 No separate host-visible device allocation RPC is required in Phase 0.
 ---
 ### D3. Data initialization and transfer uses MemoryWrite/Read only
 Any data initialization or transfer implied by a tensor (e.g., fill, copy)
 MUST be represented using Host ↔ IO_CPU messages only:
 - MemoryWrite
 - MemoryRead
 Rules:
 - MemoryWrite/Read MUST reference PA + (sip,cube,pe) tags (ADR-0012).
 - Allocation metadata MUST NOT be embedded as a separate allocation message.
 - Bulk tensor data MUST NOT be embedded in Phase 0 messages.
 The simulation engine schedules MemoryWrite/Read through the graph so that
 latency is computed by explicit traversal.
 ---
 ### D4. Extension path (non-breaking)
 Future ADRs MAY introduce optional VA/MMU/IOMMU modeling by adding:
 - virtual addressing in tensor handles,
 - mapping install steps,
 - translation latency/page granularity.
 The Phase 0 PA shard map remains a valid fast-path configuration.
 ---
 ## Consequences
 - Host↔IO_CPU contract remains minimal (MemoryRead/Write + KernelLaunch).
 - KernelLaunch can pass per-PE data placement explicitly via shard tags.
 - Early implementation stays simple and testable.
 ---
 ## Links
 - ADR-0011 (Memory Addressing — PA / VA / LA)
 - ADR-0012 (Host↔IO_CPU schema)
 - ADR-0007 (runtime_api vs sim_engine boundaries)
 - ADR-0009 (Kernel execution)
@@ -0,0 +1,146 @@
 # ADR-0009: Kernel Execution Messaging and Completion Semantics
 ## Status
 Accepted
 ## Context
 Kernel execution is initiated by the host and proceeds through
 device control components:
 Host → IO_CPU → M_CPU → PE_CPU → schedulers → engines
 Completion propagates in reverse order.
 To keep benchmarks simple and topology-agnostic,
 kernel execution must be endpoint-driven with deterministic aggregation.
 ---
 ## Decision
 ### D1. Kernel launch is an endpoint request
 A kernel launch is initiated by submitting a single KernelLaunch request
 to the IO_CPU endpoint.
 The runtime API MUST:
 - construct the kernel launch request,
 - submit it to IO_CPU,
 - await a single completion result.
 The runtime API MUST NOT orchestrate internal fan-out.
 ---
 ### D2. Tensor arguments are passed by metadata
 KernelLaunch requests MUST reference tensor arguments via:
 - host-owned tensor handles, or
 - resolved device address maps derived from those handles.
 Bulk tensor data MUST NOT be embedded in kernel launch messages.
 ---
 ### D3. Fan-out and aggregation are component responsibilities
 - IO_CPU fans out work to M_CPUs.
 - M_CPU fans out work to PE_CPUs.
 - PE_CPU manages kernel execution and engine dispatch.
 Completion semantics:
 - M_CPU completes when all targeted PEs complete or a failure policy triggers.
 - IO_CPU completes when all targeted CUBEs complete or a failure policy triggers.
 ---
 ### D4. Completion and failure propagation
 - All messages MUST carry correlation identifiers.
 - Completion and failure MUST propagate deterministically to the host.
 - The simulation engine provides futures/handles to observe completion.
 ---
 ### D5. Launch timing is endpoint-synchronized
 All PEs targeted by a single kernel launch MUST begin executing the kernel
 body at the same simulated time, regardless of their dispatch path length
 from the launch entry point.
 Rationale. The dispatch tree Host → IO_CPU → M_CPU → PE_CPU has variable
 latency at every level. PEs near their M_CPU receive the launch earlier
 than PEs farther away; cubes near an IO_CPU receive it earlier than cubes
 farther away. Without synchronization, each PE's kernel begins at a
 different `env.now`, making per-PE metrics such as `pe_exec_ns` a function
 of dispatch-path geometry rather than of the kernel's behavior —
 producing measurement artifacts in benchmarks that time kernel-internal
 waits (for example `tl.recv` on cross-cube or cross-SIP hops).
 Mechanism.
 - `KernelLaunchMsg` carries an optional `target_start_ns: float | None`.
 - **IO_CPU** is the canonical stamper. On fan-out to M_CPUs, it
  computes `target_start_ns = env.now + max_latency` where
  `max_latency` is the maximum, over every target (sip, cube, pe)
  tuple, of the **two-leg dispatch chain**:
  ```
  max_latency(sip, cube, pe) =
      compute_path_latency_ns(find_node_path(io_cpu, m_cpu(sip, cube)))
    + compute_path_latency_ns(find_node_path(m_cpu(sip, cube), pe_cpu))
    - io_cpu.overhead_ns
    - m_cpu.overhead_ns
  ```
  This models the actual dispatch as **two sequential Transactions**
  (IO_CPU → M_CPU, then M_CPU → PE_CPU). Each leg's
  `compute_path_latency_ns` adds its endpoints' `overhead_ns`;
  `io_cpu.overhead_ns` is subtracted because IO_CPU has already
  paid it before this method runs, and `m_cpu.overhead_ns` is
  subtracted once because it appears as endpoint of leg1 *and*
  start of leg2 but is paid only once at run time. A single
  `find_node_path(io_cpu, pe_cpu)` walk is **not** equivalent —
  it can pick a graph path that bypasses M_CPU and silently
  under-shoots the prediction for far cubes, breaking the D5
  invariant.
  The fanned-out sub-Transactions carry **`nbytes = 0`** for
  `KernelLaunchMsg` (control message only). Without this,
  large kernel-launch payloads would occupy fabric BW on the
  shared first hop and serialize the per-cube dispatch, pushing
  far M_CPUs past `target_start_ns` and re-introducing the
  late-arrival violation.
 - **M_CPU** passes an already-stamped `target_start_ns` through
  unchanged. Only when the value is absent (e.g. a direct
  launch-to-M_CPU unit test) does M_CPU compute a per-cube barrier
  `env.now + max(local command-path latency)`.
 - **PE_CPU** yields `env.timeout(target_start_ns - env.now)` at the top
  of `_execute_kernel`, before recording `pe_exec_start` and invoking
  the kernel body.
 - When `target_start_ns is None`, PE_CPU falls through to the legacy
  unsynchronized behavior — preserving backward compatibility.
 IO_CPU-level stamping guarantees every PE across every targeted cube
 uses the same barrier sim-time, eliminating both the within-cube
 dispatch-offset artifact *and* the cross-cube offset artifact in
 multi-cube launches. Models a real-hardware timed-broadcast launch
 (latency-equalized dispatch tree).
 The synchronization is internal to the engine / IO_CPU / M_CPU / PE_CPU
 control plane — runtime API and application kernels are unchanged.
 ---
 ## Links
 - SPEC R1, R2, R7, R8
 - ADR-0007 (Runtime API boundaries)
 - ADR-0008 (Tensor deployment)
 - ADR-0013 (Verification strategy — V2 fan-out tests)
 - ADR-0015 D4 (concrete fabric path for kernel launch)
@@ -0,0 +1,131 @@
 # ADR-0010: Command Line Interface and Execution Semantics
 ## Status
 Accepted
 ## Context
 The `kernbench` CLI is the user-facing entry point of the simulator. It
 exposes three subcommands:
 - `run` — execute a benchmark against a topology.
 - `probe` — diagnostic utility for latency / BW measurement.
 - `web` — interactive topology viewer.
 Device enumeration is centralized in the CLI; neither the runtime API
 nor the simulation engine enumerates devices. Benchmarks remain
 single-device by design and accept a device identifier as input.
 ## Decision
 ### D1. Benchmark contract — single-device by design
 - A benchmark MUST define behavior for a single device only.
 - A benchmark MUST accept a device identifier as input.
 - Benchmarks MUST NOT enumerate or loop over multiple devices.
 Multi-device execution is the CLI's concern (D3), not the benchmark's.
 ### D2. `kernbench run` — benchmark execution
 Required arguments:
 - `--topology <path>`: topology YAML file path. Loaded via
  `resolve_topology()`.
 - `--bench <name>`: benchmark name. Resolved via
  `benches.loader.resolve_bench()`.
 Optional arguments:
 - `--device <selector>` (default: `all`):
  - `all` — run once per discovered SIP (see D3).
  - `sip:<N>` — run only on SIP N.
  - Parsed via `resolve_device()`.
 - `--verify-data` (default: off) — enable Phase 2 data verification
  (see ADR-0020). When set, `engine_factory` constructs the engine
  with `enable_data=True`. After the benchmark runs, a diagnostic
  summary of recorded ops is printed.
 Each invocation runs the benchmark once within a single simulation
 instance.
 ### D3. Multi-device execution is logically parallel
 When `--device all` (or omitted) and the topology has multiple SIPs:
 - Benchmark executions are submitted to a single simulation engine
  instance.
 - Executions are logically parallel in simulation time.
 - Inter-device contention is naturally modeled (shared fabric
  bandwidth, cross-SIP traffic, etc.).
 The CLI does NOT spawn multiple OS processes or independent
 simulation runs — parallelism is internal to one simulation instance.
 ### D4. `kernbench probe` — latency / BW diagnostic utility
 Required argument:
 - `--topology <path>`: topology YAML file path.
 Optional argument:
 - `--case <name>` (default: `all`) — run a predefined traffic
  pattern, or `all` to run every defined case.
 Probe runs each pattern through the simulation engine and reports
 per case:
 - End-to-end latency (ns).
 - Effective bandwidth (nbytes / total_ns).
 - Bottleneck bandwidth (min edge BW along the chosen path).
 - Utilization (effective / bottleneck).
 Probe additionally validates monotonicity invariants — for example
 that local-HBM access ≤ cross-PE-within-cube ≤ cross-cube ≤
 cross-SIP — and reports violations. Probe is a developer tool for
 verifying the latency / BW model; it is not a benchmark.
 ### D5. `kernbench web` — topology viewer
 Optional arguments:
 - `--port <N>` (default: `8765`) — HTTP port.
 - `--no-open` — do not auto-open the browser.
 Launches a local HTTP server that renders the compiled topology in
 the browser. Distinct from the static `docs/diagrams/` artifacts:
 - `docs/diagrams/` files are derived at topology-compile time
  (ADR-0006).
 - `kernbench web` is interactive — pan/zoom, hover for component
  attributes, switch between SIP / CUBE / PE views.
 ### D6. Runtime API and simulation engine remain device-scoped
 - Runtime API calls operate on one device per invocation.
 - The simulation engine schedules all requests deterministically.
 - Neither layer enumerates devices.
 This invariant keeps each layer testable in isolation; device
 enumeration and multi-device fan-out live only in the CLI's `run`
 command (D3).
 ## Consequences
 - Benchmark authors write single-device logic; multi-device behavior
  emerges from the CLI dispatching across SIPs.
 - Adding a new subcommand (e.g., trace export, replay) does not
  require benchmark or runtime-API changes — the CLI is the
  extension point.
 - `probe` and `web` are diagnostic / visualization tools, not
  benchmarks; they bypass the benchmark loader path.
 ## Links
 - SPEC R7, R8, R9
 - ADR-0007 (Runtime API and Simulation Engine Boundaries)
 - ADR-0020 (Two-pass data execution — `--verify-data`)
 - ADR-0006 (Topology compilation and diagram generation —
  background for `kernbench web`)
@@ -0,0 +1,521 @@
 # ADR-0011: Memory Addressing — PA / VA / LA Address Models
 ## Status
 Accepted.
 - **VA model: currently implemented (default).**
 - PA model: implemented as PageFault fallback in PE_DMA.
 - LA model: proposed, not implemented.
 ## Context
 KernBench's address model evolved through three design points, each
 addressing a limitation of the previous. This ADR documents all three
 in one place because future implementation work selects among them.
 ### PA-only baseline
 Phase 0 of KernBench treated all device memory operations
 (MemoryRead/MemoryWrite) as raw physical-address transfers. No
 host-side virtual addressing, no MMU/IOMMU translation. Allocators
 returned PA mappings; DMA requests carried PA directly.
 This was sufficient for early correctness/latency work but
 insufficient for running standard Triton kernels that use
 `base_addr + offset` patterns on sharded tensors: each PE's shard
 has a different PA, but the kernel needs a single contiguous address
 space to compute offsets.
 ### Why VA/MMU (current default)
 A realistic system uses host-side virtual addressing and an
 MMU/IOMMU-style translation path for DMA: the host allocates physical
 memory at PE level, maps it into a virtual address space, installs
 mappings, and DMA requests use virtual addresses that are translated
 to physical addresses.
 Adopting this model lets kernels use `base_addr + offset` over a
 contiguous VA range while the device-side MMU translates each access
 to the appropriate PA.
 ### Why LA/BAAW (proposed)
 VA/MMU treats HBM as a single backing space. KernBench needs to
 explore architectures where HBM is composed of multiple pseudo
 channels in parallel:
 - CUBE's HBM has 32 or 64 pseudo channels.
 - In a PE-Local-HBM model, each PE is assigned N pseudo channels
  (N = `hbm_pseudo_channels / pes_per_cube`).
 - Per-channel BW (e.g. 32 GB/s) determines aggregate PE BW
  (N × per-channel).
 Two channel-mapping modes need to be modelable:
 - **1:1 mode** — one logical access → N per-channel requests.
  Precise per-channel BW contention modelling.
 - **n:1 mode (default)** — one logical access → one aggregated
  request. Channels are assumed to interleave; aggregated BW model.
 VA's `tl.load(va_ptr)` produces a single DMA request to a single
 target. Decomposing that into per-channel requests inside PE_DMA
 requires the address layer to be aware of channels. This is the
 role of the LA (Logical Address) abstraction with BAAW
 (Logical-to-Physical Mapping Unit).
 Core requirements driving the LA design:
 - PE_DMA → HBM_CTRL effective bandwidth semantics must be identical
  in both modes (only request shape and resource model differ).
 - Kernel programming model is unchanged — physical channel
  information is never exposed to kernel code.
 - Mode switch is a topology-level configuration.
 ### Design space summary
 | Model | Status | Key idea |
 |-------|--------|----------|
 | PA | fallback (implemented) | Direct physical addressing, no translation |
 | VA | current default (implemented) | Per-tensor contiguous VA range; MMU translates per access |
 | LA | proposed | LA + BAAW resolves to (PA, channel); supports 1:1 and n:1 channel mapping modes |
 ---
 ## Decision
 This ADR defines three address models. At any given time the system
 operates in exactly one model. Selection is topology- / configuration-
 driven; coexistence within one simulation run is not required.
 ---
 ### Address Model: PA (Physical Address) — fallback
 #### D-PA1. PA-only semantics
 - All device memory accesses (MemoryRead/MemoryWrite) operate on
  device physical addresses (PA) plus size.
 - PA-only mode remains functional via the PageFault fallback path in
  PE_DMA: if a DMA src/dst address has no MMU mapping, PE_DMA treats
  the value as a PA directly.
 #### D-PA2. Allocation produces PA mappings
 Device allocation selects PE-local memory regions and returns PA
 mappings sufficient to execute kernels and issue DMA requests.
 PA model is retained primarily for backward compatibility with PA-only
 tests and as the underlying physical layer that VA / LA models resolve
 into.
 ---
 ### Address Model: VA (Virtual Address with MMU) — current default
 #### D-VA1. Virtual Address Model
 - Each tensor gets a single contiguous VA range (`TensorHandle.va_base`).
 - `TensorShard` does NOT carry a `va` field — shard VA is derived as
  `va_base + offset_bytes`.
 - Kernels receive `va_base` as their pointer argument (via
  `TensorArg.va_base`).
 - `DmaReadCmd.src_addr` and `DmaWriteCmd.dst_addr` carry VA (not PA).
 #### D-VA2. PE_MMU Component
 - Hybrid design: SimPy component (inbox for `MmuMapMsg`) + utility
  (synchronous `translate()` called by PE_DMA).
 - Page-aligned dict lookup for O(1) VA → PA translation.
 - `tlb_overhead_ns` configurable per-access latency.
 - PageFault fallback: if VA has no mapping, PE_DMA treats it as PA
  directly (preserves PA model for backward compatibility).
 #### D-VA3. Mapping Installation
 - `MmuMapMsg` traverses the fabric: Host → PCIE_EP → IO_CPU (cube
  fan-out) → M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured
  end-to-end.
 - `MmuMapMsg.target_sips` controls SIP-level routing to prevent
  cross-SIP mapping contamination for replicated tensors.
 - Mapping strategy based on `DPPolicy.cube`:
  - **Replicate** (`cube="replicate"`): per-(sip, cube) local mapping
    only. Each cube's PEs see only their local PA. No cross-cube
    mapping installed.
  - **Sharded** (`cube="column_wise"`, etc.): broadcast all shard
    mappings to all target cubes. Enables cross-PE and cross-cube
    DMA.
 #### D-VA4. Tensor Lifecycle
 - `del tensor` triggers automatic cleanup via `Tensor.__del__` +
  `weakref` to `RuntimeContext`. Sends `MmuUnmapMsg` through fabric,
  returns VA and PA space.
 - `with RuntimeContext(...) as ctx:` provides scope-based bulk cleanup.
 - `RuntimeContext._tensors` uses `weakref.ref` to avoid preventing GC.
 - `PEMemAllocator` uses free-list with coalescing (not bump allocator).
 - `VirtualAllocator` uses free-list with coalescing for VA space.
 #### D-VA5. Allocators
 - `VirtualAllocator`: device-wide VA space, page-aligned alloc/free
  with coalescing.
 - `PEMemAllocator`: per-PE HBM/TCM, free-list based alloc/free with
  coalescing.
 - Page size configurable via `topology.yaml` `pe_mmu` attrs
  (default 4096).
 #### Consequences (VA model)
 - Triton kernels use `base_addr + offset` patterns naturally on
  sharded tensors.
 - All latency remains explicit via graph traversal, including MMU
  mapping installation and per-access TLB overhead.
 - PA-only mode retained as fallback (PageFault → treat as PA).
 - IPCQ and other fixed-address resources bypass MMU (use PA directly).
 ---
 ### Address Model: LA (Logical Address with BAAW) — proposed
 LA replaces VA when channel-level HBM modelling is required.
 Adopting this model removes the VA/MMU infrastructure (D-LA1 lists the
 removed artifacts). Coexistence with VA in the same run is not a goal.
 #### D-LA1. LA introduction — replaces VA infrastructure
 LA is the sole address space used by kernel code (`tl.load`,
 `tl.store`, `tl.composite`). Properties:
 - Can map a Tensor to a contiguous logical space (like VA).
 - Expresses `(logical buffer + offset)`.
 - Does NOT contain physical channel information directly.
 - Stays as an intermediate abstraction until physical resolution.
 LA address space:
 | Item | Value |
 |------|-------|
 | LA start | `0x1_0000_0000` (4 GB, preserves former VA start) |
 | LA space size | 64 GB per PE |
 | Alignment unit | segment (see D-LA3) |
 LA is PE-local: different PEs may use the same LA value; BAAW segment
 tables differ → they resolve to different PAs.
 VA infrastructure removed when LA is adopted:
 | Removed | Replacement |
 |---------|-------------|
 | `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (same free-list approach, renamed) |
 | `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (inside PE_DMA) |
 | `components/builtin/pe_mmu.py` (PeMmuComponent) | Removed — BAAW is internal PE_DMA logic, not a separate component |
 | `runtime_api/kernel.py`: `MmuMapMsg`, `MmuUnmapMsg` | `BaawSegmentInstallMsg` |
 | `runtime_api/context.py`: VA alloc + MMU install | LA alloc + BAAW segment install |
 | `runtime_api/tensor.py`: `va_base` | `la_base` |
 | `topology.yaml`: `pe_mmu` component entry | Removed |
 #### D-LA2. Mapping mode setting
 Topology-level (cube) configuration:
 ```yaml
 cube:
  memory_map:
    hbm_mapping_mode: n_to_one    # one_to_one | n_to_one
    hbm_pseudo_channels: 64       # total pseudo channel count
    hbm_channels_per_pe: 8        # per-PE local channel count
    hbm_channel_bw_gbs: 32.0      # per-channel bandwidth
 ```
 Consumed by the graph compiler (topology builder) and BAAW
 initialisation.
 #### D-LA3. Segment and BAAW
 Segment partitions the LA space; each segment maps to a specific HBM
 channel or channel group. Created at tensor deploy time by the runtime
 allocator. BAAW resolves LA → physical request(s) using the segment
 table.
 ```python
@dataclass
 class BaawSegment:
    la_base: int          # segment start LA
    la_size: int          # segment size (bytes)
    mode: str             # "one_to_one" | "n_to_one"
    # 1:1 mode fields
    channel_count: int    # channels assigned to this segment (e.g. 8)
    pa_bases: list[int]   # per-channel PA bases (len = channel_count)
    channel_ids: list[int]   # per-channel logical IDs (e.g. [0..7])
    channel_size: int     # per-channel size (la_size // channel_count)
    # n:1 mode fields
    agg_pa_base: int      # aggregated PA base
    agg_node_id: str      # aggregated router node_id
 ```
 Segment lifecycle:
 1. **Allocate** (tensor deploy): RuntimeContext allocates LA from LA
   allocator. PEMemAllocator allocates per-channel PA (1:1) or
   aggregated PA (n:1). `BaawSegmentInstallMsg` registers the segment
   with PE_DMA.
 2. **Use** (kernel run): kernel `tl.load(la_ptr)` → `DmaReadCmd
   (src_addr=LA)`. PE_DMA's BAAW front-end looks up the segment and
   converts to PA(s).
 3. **Free** (tensor free): segment removed from table; LA and PA
   returned.
 #### D-LA4. BAAW resolution logic
 BAAW is a front-end stage inside PE_DMA, not a separate SimPy
 component. Synchronous address-resolution logic executed at the start
 of PE_DMA's `handle_command()`.
 Input: `(LA, nbytes)`. Output:
 - **1:1 mode**: `list[PhysicalRequest]` — one per channel.
 - **n:1 mode**: single `PhysicalRequest`.
 ```python
@dataclass
 class PhysicalRequest:
    pa: int           # 51-bit Physical Address
    nbytes: int       # transfer size for this request
    dst_node: str     # target node_id (channel router or aggregated router)
 def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
    seg = self._find_segment(la)  # la_base <= la < la_base + la_size
    offset = la - seg.la_base
    if seg.mode == "n_to_one":
        pa = seg.agg_pa_base + offset
        return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]
    # one_to_one
    requests = []
    per_ch_size = seg.channel_size
    for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
        ch_offset = offset % per_ch_size
        ch_nbytes = nbytes // seg.channel_count
        pa = pa_base + ch_offset
        dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
        requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
    return requests
 ```
 BAAW responsibilities:
 - Convert logical access → physical request units.
 - Apply mode-dependent fan-out (1:1) or pass-through (n:1).
 - Compute PA and target node.
 BAAW non-responsibilities:
 - Performing actual data movement.
 - Executing NOC routing.
 - Simulating bandwidth occupation (downstream components' job).
 BAAW output is directly usable by the simulator's routing and resource
 model without additional address decoding.
 #### D-LA5. PE_DMA `handle_command()` change
 Current (VA-based) flow:
 ```
 DmaReadCmd.src_addr (VA)
  → MMU.translate(VA) → PA
  → PhysAddr.decode(PA) → PhysAddr object
  → resolver.resolve(PhysAddr) → dst_node_id
  → router.find_path(pe_prefix, dst_node_id) → path
  → 1 sub-Transaction → fabric inject
 ```
 LA-based flow:
 ```
 DmaReadCmd.src_addr (LA)
  → BAAW.resolve(LA, nbytes) → list[PhysicalRequest]
  → for each PhysicalRequest:
      → router.find_path(pe_prefix, req.dst_node) → path
      → compute_drain_ns(path, req.nbytes) → drain
      → sub-Transaction → fabric inject
  → await all sub-Transactions
  → pe_txn.done.succeed()
 ```
 Key changes:
 - MMU reference removed → BAAW resolve.
 - `PhysAddr.decode()` + `resolver.resolve()` → BAAW returns `dst_node`
  directly.
 - 1 request → N parallel requests in 1:1 mode.
 #### D-LA6. 1:1 mode detail
 - One logical access → N physical requests (N = `channels_per_pe`).
 - N = `hbm_pseudo_channels / pes_per_cube`.
 - Each request: fully-resolved 51-bit PA, targets a specific channel
  router (`{pe_prefix}.ch_r{channel_id}`).
 - Per-channel link models BW contention.
 - PE_DMA injects N sub-transactions concurrently.
 Example: `hbm_pseudo_channels=64`, `pes_per_cube=8` → `channels_per_pe=8`.
 PE0 owns ch0-7.
 ```text
 Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
 BAAW segment: {
    la_base: 0x1_0000_0000, la_size: 4096,
    mode: "one_to_one", channel_count: 8,
    pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
    channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
    channel_size: 512,
 }
 BAAW resolve result (8 requests):
  → PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
  → PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
  → ...
  → PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")
 PE_DMA: 8 sub-transactions parallel inject
  per-channel router → hbm_ctrl link (channel_bw_gbs) per channel
  Total effective BW = 8 × channel_bw_gbs
 ```
 Other N values:
 - `hbm_pseudo_channels=32`, `pes_per_cube=8` → `channels_per_pe=4`,
  4 requests
 - `hbm_pseudo_channels=64`, `pes_per_cube=4` → `channels_per_pe=16`,
  16 requests
 #### D-LA7. n:1 mode detail
 - One logical access → one aggregated request.
 - Target: aggregated router → hbm_ctrl (see ADR-0017 D8).
 - Aggregated link BW = `channels_per_pe × channel_bw_gbs`
  (e.g. 8 × 32 = 256 GB/s).
 - Single queue / resource for modelling.
 - No per-channel PA decomposition.
 ```text
 Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
 BAAW segment: {
    la_base: 0x1_0000_0000, la_size: 4096,
    mode: "n_to_one",
    agg_pa_base: PA_agg,
    agg_node_id: "sip0.cube0.pe0.agg_router",
 }
 BAAW resolve result:
  → PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")
 PE_DMA: 1 sub-transaction
  aggregated router → hbm_ctrl link (256 GB/s)
 ```
 #### D-LA8. Kernel model preserved
 - Kernel still issues single memory ops (`tl.load`, `tl.store`,
  `tl.composite`).
 - LA is the address scheme exposed to kernel code.
 - Channel decomposition / aggregation happens inside PE_DMA's BAAW.
 - Kernel code never sees physical channel information.
 #### Consequences (LA model, proposed)
 Positive:
 - 1:1 vs n:1 semantics live in one place (BAAW).
 - Kernel abstraction preserved — no kernel code changes.
 - Topology-based policy control (mode switch via yaml).
 - Improved simulation-model consistency and debuggability.
 - Segment-based mapping is simpler than page tables; lower overhead.
 Negative:
 - Full VA/MMU code refactor required.
 - Request-generation path more complex (N requests in 1:1 mode).
 - Reduced per-channel visibility in n:1 mode.
 - VA-related tests need rewriting.
 ---
 ## Migration Path
 - **PA → VA** was an extension. PA mode is retained as the PageFault
  fallback inside PE_DMA. Switching does not require removing PA
  code.
 - **VA → LA**, if adopted, is a replacement, not coexistence. See
  D-LA1 for the VA infrastructure removal list. PA fallback inside
  PE_DMA may be retained orthogonally for tests.
 ## Alternatives Considered (LA model)
 1. **Keep VA + fan-out in MMU**: MMU returns per-channel PAs.
   Rejected: MMU's role would grow beyond translation to request
   decomposition; aggregation (n:1) becomes awkward to express.
 2. **Channel-aware kernel API**: kernels call per-channel load/store
   directly. Rejected: abstraction leakage, portability loss, all
   benchmarks need rewriting.
 3. **Always PA (no LA)**: runtime passes per-channel PA to kernel
   directly. Rejected: incompatible with aggregation; conversion
   timing unclear; channel info leaks to kernel.
 ## Test Requirements
 ### VA model (current, regression)
 - Cross-PE / cross-cube DMA paths over installed mappings.
 - `MmuMapMsg` / `MmuUnmapMsg` fabric traversal with measured latency.
 - TLB-overhead-per-access timing.
 - PageFault fallback path preserves PA-only behaviour.
 ### LA model (when implemented)
 - 1:1 mode: same logical access → N per-channel requests.
 - n:1 mode: same logical access → 1 aggregated request.
 - Bandwidth equivalence between modes for identical workload.
 - 1:1 mode: per-channel contention modelled correctly.
 - n:1 mode: aggregated bandwidth correctly reflected.
 - Kernel code unchanged across mode switch.
 - BAAW segment install / uninstall correctness.
 - Multiple tensors in distinct segments do not collide.
 ## Implementation Order (LA, when scheduled)
 1. LA type (`policy/address/la_allocator.py`).
 2. BAAW segment table (`policy/address/baaw.py`).
 3. `BaawSegmentInstallMsg` (`runtime_api/kernel.py`).
 4. PE_DMA BAAW integration (`components/builtin/pe_dma.py`
   `handle_command()`).
 5. RuntimeContext: LA alloc + segment install
   (`runtime_api/context.py`).
 6. `Tensor.va_base` → `Tensor.la_base` (`runtime_api/tensor.py`).
 7. Remove VA/MMU code.
 8. Remove `pe_mmu` from `topology.yaml`; add mapping mode settings.
 9. Test migration:
 | Test file | Action |
 |-----------|--------|
 | `tests/test_mmu_component.py` | Remove → BAAW segment install tests |
 | `tests/test_mmu_fabric.py` | Remove → BAAW + fabric integration tests |
 | `tests/test_pe_mmu.py` | Remove |
 | `tests/test_va_allocator.py` | Replace with LA allocator tests |
 | `tests/test_va_integration.py` | Replace with LA + BAAW integration tests |
 | `tests/test_va_offset.py` | Replace with LA offset tests |
 ## Links
 - ADR-0007 (runtime_api vs sim_engine boundaries)
 - ADR-0008 (tensor deployment)
 - ADR-0009 (kernel execution)
 - ADR-0014 (PE-internal execution model)
 - ADR-0015 (component port/wire model)
 - ADR-0017 (Cube NOC and HBM connectivity — LA model topology consumer)
 - ADR-0013 (Verification strategy — V1 PA tagging)
 - SPEC R2 (latency by traversal), R10 (memory addressing)
@@ -0,0 +1,233 @@
 # ADR-0012: Host ↔ IO_CPU Message Schema (PA-first, PE-tagged)
 ## Status
 Accepted
 ## Context
 Phase 0 uses a PA-first memory model (ADR-0011):
 - memory operations use device physical addresses (PA) only,
 - VA/MMU/IOMMU is not modeled.
 The host-facing runtime API interacts with the device via the IO_CPU endpoint.
 We define stable, minimal message schemas for Host ↔ IO_CPU so that:
 - benchmarks remain stable,
 - IO_CPU-internal fan-out/aggregation can evolve independently,
 - completion and failure propagation is deterministic.
 We also require PE-tagging (A 방식): each shard explicitly carries (sip,cube,pe)
 so IO_CPU can deterministically route/fan-out without relying on PA decoding.
 ---
 ## Decision
 ### D1. Contract scope
 This schema is the stable contract ONLY for Host ↔ IO_CPU.
 Messages beyond IO_CPU (to M_CPU, PE_CPU, schedulers, engines) are component-internal
 and are NOT part of this host contract in Phase 0.
 ---
 ### D2. Required message set
 The runtime API MUST use only these message types for Host ↔ IO_CPU:
 - MemoryWrite
 - MemoryRead
 - KernelLaunch
 All operations required by benchmarks (tensor init/copy, kernel run) MUST be expressible
 with these messages.
 ---
 ### D3. Common envelope (mandatory for all requests)
 All Host ↔ IO_CPU requests MUST include:
 - `msg_type: str`
 - `correlation_id: str`
  - generated by the host
  - used to match responses deterministically
 - `request_id: str`
  - unique within a correlation_id
 - `target_device: str`
  - device identifier (e.g., "sip:0")
 - `timestamp_tag: str | None` (optional)
  - debug tag only; MUST NOT affect determinism
 All Host ↔ IO_CPU responses MUST include:
 - `correlation_id: str`
 - `request_id: str`
 - `completion: Completion`
 ---
 ### D4. Completion schema (mandatory)
 `Completion` MUST have:
 - `ok: bool`
 - `error_code: str | None`
 - `error_message: str | None`
 Rules:
 - If `ok == true` then `error_code` and `error_message` MUST be null.
 - If `ok == false` then `error_code` MUST be non-null.
 - Completion semantics MUST be deterministic.
 ---
 ### D5. MemoryWrite schema (PA-first, PE-tagged)
 `MemoryWrite` represents a host-initiated write/initialize operation to device memory.
 Mandatory fields:
 - common envelope fields (D3)
 - destination placement tags (A 방식):
  - `dst_sip: int`
  - `dst_cube: int`
  - `dst_pe: int`
 - `dst_pa: int`
  - destination physical address in the destination PE's address space
 - `nbytes: int`
 - `src_kind: "pattern" | "host_buffer_ref"`
  - Phase 0 MUST support "pattern"
 - `pattern: Pattern | None`
  - required if `src_kind == "pattern"`
 `Pattern` (Phase 0 mandatory support):
 - `pattern_kind: "zero" | "fill_u8" | "fill_u16" | "fill_u32" | "fill_fp16" | "fill_fp32"`
 - `value: number | None`
  - required for fill_*; ignored for zero
 Optional fields:
 - `dst_mem_kind: "HBM" | "TCM" | "AUTO"` (default "AUTO")
 - `debug_label: str | None`
 Notes:
 - This message MUST NOT embed bulk tensor data in Phase 0.
 - All latency MUST come from explicit graph traversal and modeled components.
 ---
 ### D6. MemoryRead schema (PA-first, PE-tagged)
 `MemoryRead` represents a host-initiated read from device memory.
 Mandatory fields:
 - common envelope fields (D3)
 - source placement tags (A 방식):
  - `src_sip: int`
  - `src_cube: int`
  - `src_pe: int`
 - `src_pa: int`
 - `nbytes: int`
 Optional fields:
 - `dst_kind: "host_sink" | "discard"` (default "host_sink")
 - `debug_label: str | None`
 Response payload:
 - actual bytes are NOT required in Phase 0 (latency/traces focus)
 - implementations MAY return lightweight stats or hashes later via a new ADR
 ---
 ### D7. KernelLaunch schema (PA-first, PE-tagged shards)
 `KernelLaunch` represents launching a kernel on a target device via IO_CPU.
 Mandatory fields:
 - common envelope fields (D3)
 - `kernel_ref: KernelRef`
 - `args: list[KernelArg]`
 `KernelRef` MUST have:
 - `name: str`
 - `kind: "deployed" | "builtin"`
 - `deploy_pa: int | None` — PA where kernel binary was deployed (required for "deployed")
 - `deploy_sip: int` — SIP where binary resides
 - `deploy_cube: int` — cube where binary resides
 - `deploy_pe: int` — PE where binary resides
 - `nbytes_code: int` — kernel binary size (for BW modeling)
 Kernel binaries MUST be pre-deployed to device memory via MemoryWrite.
 KernelLaunch MUST NOT embed kernel source code or IR in the launch message.
 `KernelArg` supports tensor args by PA mapping and scalars by value.
 Tensor arg (mandatory):
 - `arg_kind: "tensor"`
 - `tensor_pa_map: TensorPAMap`
 `TensorPAMap` MUST have:
 - `shards: list[TensorShard]`
 `TensorShard` MUST have (A 방식 강제):
 - `sip: int`
 - `cube: int`
 - `pe: int`
 - `pa: int`
 - `nbytes: int`
 - `offset_bytes: int`
 Scalar arg (mandatory):
 - `arg_kind: "scalar"`
 - `dtype: "i32" | "i64" | "fp16" | "fp32" | "bool"`
 - `value: number | bool`
 Optional KernelLaunch fields:
 - `grid: dict | None`
 - `meta: dict | None`
 - `failure_policy: "fail_fast" | "collect_all"` (default "fail_fast")
 - `debug_label: str | None`
 Notes:
 - KernelLaunch MUST NOT embed bulk tensor data.
 - KernelLaunch MUST be submitted only to the IO_CPU endpoint.
 - IO_CPU MUST fan-out work internally using the shard (sip,cube,pe) tags.
 ---
 ## Verification Notes
 Tests SHOULD validate:
 - schema validation rejects missing mandatory fields,
 - deterministic correlation/response matching,
 - MemoryWrite/Read/KernelLaunch produce explicit hop traces,
 - all routed requests incur latency > 0.
 ---
 ## Links
 - ADR-0011 (Memory Addressing — PA / VA / LA)
 - ADR-0007 (runtime_api vs sim_engine boundaries)
 - ADR-0009 (kernel execution fan-out/aggregation)
 - ADR-0013 (Verification strategy — V1 message schema validation)
 - SPEC R2, R7, R8
@@ -0,0 +1,139 @@
 # ADR-0013: Verification Strategy and Phase 1 Test Plan
 ## Status
 Accepted
 ## Context
 KernBench is a system-level simulator whose correctness is defined by:
 - adherence to SPEC-defined invariants,
 - determinism and debuggability,
 - explicit modeling of routing and latency.
 Given the evolving implementation, we need a stable verification strategy
 that prevents architectural drift while allowing incremental development.
 This ADR defines the Phase 1 verification plan and what constitutes
 "correct behavior" for early implementations.
 ---
 ## Decision
 ### D1. Verification is contract-based
 Verification MUST be derived from:
 - SPEC requirements,
 - accepted ADRs.
 Tests MUST validate architectural contracts, not incidental implementation details.
 ---
 ### D2. Phase 1 verification scope
 Phase 1 verification focuses on:
 - message contract validity (ADR-0012),
 - routing and fan-out semantics at the IO_CPU boundary (ADR-0009),
 - PA-first memory addressing and shard tagging (ADR-0011),
 - core latency and trace invariants (SPEC 0.1, R2).
 Microarchitectural accuracy, bandwidth contention, and cycle-level behavior
 are explicitly out of scope in Phase 1.
 ---
 ### D3. Required Phase 1 verification cases
 The following verification cases MUST be supported by the implementation:
 #### V1. Message schema validation
 - KernelLaunch requests missing `(sip, cube, pe)` in any tensor shard MUST be rejected.
 - MemoryWrite/MemoryRead requests missing destination/source placement tags MUST be rejected.
 - Completion results MUST follow the `ok / error_code / error_message` contract.
 #### V2. IO_CPU fan-out and aggregation
 Given:
 - a topology with one SIP, one CUBE, and two PEs,
 - a KernelLaunch request containing two tensor shards targeting different PEs,
 The system MUST:
 - submit a single KernelLaunch to IO_CPU,
 - fan-out work internally to both PEs,
 - aggregate completion and return a single deterministic completion to the host.
 #### V3. Latency and trace invariants
 For any valid request:
 - the hop-by-hop trace MUST be non-empty,
 - total latency MUST be greater than zero,
 - repeated runs with identical inputs MUST produce identical traces.
 #### V4. Topology independence and cross-domain coverage
 Verification cases MUST pass for multiple topology shapes, including:
 - minimal: (1 SIP, 1 CUBE, 1 PE)
 - multi-PE: (1 SIP, 1 CUBE, N PEs)
 - multi-CUBE within a SIP: (1 SIP, M CUBEs, ≥1 PE per CUBE)
 - multi-SIP tray: (K SIPs, ≥1 CUBE per SIP, ≥1 PE per CUBE)
 For multi-CUBE and multi-SIP topologies, Phase 1 verification focuses on:
 - explicit connectivity (required links exist),
 - deterministic routing and control-path traversal,
 - non-empty traces and latency > 0 for representative cross-domain requests
  (inter-CUBE and inter-SIP paths).
 Tests MUST NOT hardcode topology sizes, node ids, or link counts.
 Instead, tests MUST derive expectations from the compiled topology metadata
 ---
 ### D4. Phase 1 artifacts
 Phase 1 MAY include:
 - verification-only test code,
 - topology fixtures,
 - trace inspection utilities.
 Phase 1 MUST NOT require:
 - production code changes solely to satisfy tests,
 - weakening or removing tests to allow progress.
 ---
 ### D5. Phase 2 enforcement
 Phase 2 (Apply) MUST:
 - run the Phase 1 verification cases,
 - rollback all changes if any verification fails,
 - preserve tests as authoritative contracts.
 ---
 ## Consequences
 - Architectural correctness is enforced early.
 - Tests serve as executable documentation of system behavior.
 - Implementation remains flexible without losing rigor.
 ---
 ## Links
 - SPEC 0.1, R2, R6
 - ADR-0011 (Memory Addressing — PA / VA / LA)
 - ADR-0012 (Host ↔ IO_CPU message schema)
 - ADR-0009 (Kernel execution semantics)
@@ -0,0 +1,451 @@
 # ADR-0014: PE Pipeline Execution Model
 ## Status
 Accepted
 ## Context
 This ADR defines the PE-internal kernel execution model:
 - Role decomposition of PE-internal components
 - Command dispatch paths (simple / composite / multi-op composite with epilogue)
 - TileToken-based self-routing pipeline (scheduler does dispatch + completion only)
 - TCM-centric dataflow with a register-file intermediary
 - Engine resource model
 - Observability and trace contract
 - Topology representation
 PE-internal structure (7 components in scope; 2 cross-referenced):
 - `pe_cpu`, `pe_scheduler`, `pe_dma`, `pe_fetch_store`, `pe_gemm`, `pe_math`,
  `pe_tcm` — defined here
 - `pe_mmu` — VA model, defined in ADR-0011 D-VA
 - `pe_ipcq` — collective communication, defined in ADR-0023
 The goal is a deterministic, trace-friendly execution contract that keeps
 each block independently swappable.
 ## Decision
 ### D1. PE-internal component roles
 **PE_CPU**
 - Executes kernel instruction stream / control logic.
 - Generates PE commands and submits them to `PE_SCHEDULER` (via
  `PeInternalTxn`).
 - Does NOT enqueue work directly into engine queues.
 **PE_SCHEDULER**
 - Sole dispatcher inside a PE.
 - Receives commands from `PE_CPU`. Dispatch by command type:
  - Simple command (`DmaReadCmd`, `DmaWriteCmd`, `GemmCmd`, `MathCmd`)
    → forward directly to the target engine.
  - `CompositeCmd` → generate a `TilePlan`, feed tiles into the pipeline
    via a single `_feed_loop` (D6).
 - Does not participate in stage-to-stage chaining within a composite;
  that is handled by token self-routing (D6).
 **PE_DMA**
 - Handles memory transfers between TCM and external memory domains
  (HBM, shared SRAM, cross-cube UCIe) through the cube NOC.
 - Two execution channels:
  - `DMA_READ` (capacity = 1) and `DMA_WRITE` (capacity = 1) — see D4.
 - Additional virtual channels:
  - `vc_compute` — load/store/writeback traffic for GEMM/MATH tiles.
  - `vc_comm` — IPCQ collective send data (defined in ADR-0023 D8).
 **PE_FETCH_STORE**
 - TCM ↔ Register File transfer unit.
 - Isolates register-file access semantics from compute engines so that
  GEMM/MATH stay pure compute components.
 - BW-based latency model; TCM access contention naturally serializes
  through `PE_TCM`'s BW resource.
 **PE_GEMM**
 - MAC array. Reads operands from the register file; writes results to
  the register file. Does not touch `PE_TCM` directly.
 **PE_MATH**
 - Element-wise / reduction / SIMD unit. Reads / writes the register file.
 **PE_TCM**
 - Tightly-coupled scratchpad with BW-serialized access. Two logical
  regions partitioned by ownership (see D5).
 **Cross-referenced components** (defined elsewhere):
 - `pe_mmu` — VA→PA translation per access (ADR-0011 D-VA).
 - `pe_ipcq` — collective ring buffers and peer endpoint metadata
  (ADR-0023).
 ### D2. Command lifecycle and queues
 `PE_SCHEDULER` maintains three logical structures:
 **SubmissionQueue** — written by `PE_CPU`; consumed by the scheduler.
 **InflightTable** — owned and mutated only by `PE_SCHEDULER`; tracks
 expanded sub-commands, dependency state, engine assignment, and
 completion status.
 **CompletionQueue** — written by `PE_SCHEDULER`; holds final completion
 records.
 **Single-writer rule**: only `PE_SCHEDULER` mutates command completion
 state. Engines report completion via explicit events / messages
 consumed by the scheduler.
 **Command completion**: when all sub-commands complete, `PE_SCHEDULER`
 publishes a completion record.
 ### D3. Dispatch modes
 #### D3.1 Simple command
 A simple command expands to exactly one engine sub-command:
 - `DmaReadCmd` / `DmaWriteCmd` → `PE_DMA`
 - `GemmCmd` → `PE_GEMM`
 - `MathCmd` → `PE_MATH`
 Flow:
 ```text
 PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution
       → completion → PE_SCHEDULER → CompletionQueue
 ```
 #### D3.2 Composite command (single-op tiled pipeline)
 The default `CompositeCmd` runs a single compute op as a tile-pipelined
 sequence:
 ```text
 DMA_READ → FETCH (TCM → RF) → COMPUTE (GEMM | MATH) → STORE (RF → TCM) → DMA_WRITE
 ```
 `PE_SCHEDULER` splits the DMA payload into hardware tiles and emits one
 `TileToken` per tile with a monotonically increasing `tile_id`.
 Tile dependency (within one tile `t`):
 ```text
 DMA_READ(t) → FETCH(t) → COMPUTE(t) → STORE(t) → DMA_WRITE(t)
 ```
 Inter-tile overlap is allowed wherever engine resources permit
 (D4 governs the constraints):
 ```text
 DMA_READ(t+1) ∥ COMPUTE(t)
 DMA_WRITE(t-1) ∥ COMPUTE(t)
 ```
 #### D3.3 Multi-op composite (head + epilogue with scope)
 A `CompositeCmd` MAY carry `ops: tuple[OpSpec, ...]` to express a
 multi-op pipeline:
 ```python
@dataclass(frozen=True)
 class OpSpec:
    kind: str         # "gemm" | "math.exp" | "math.bias_add" | ...
    scope: Scope      # "per_k_tile" | "per_output_tile" | "once"
    ...
 ```
 - `ops[0]` (head) defines tile geometry (e.g., the head GEMM determines
  M/K/N partition).
 - `ops[1:]` (epilogue) are subsequent stages whose `scope` decides how
  often they fire:
  - `per_k_tile` — every K-reduction step.
  - `per_output_tile` — once per output tile.
  - `once` — once per kernel.
 Cross-engine chains (e.g., GEMM head → MATH epilogue) are natural —
 each stage is dispatched via token self-routing (D6), so GEMM and MATH
 participate serially within the same composite even though they share
 the compute slot (D4).
 The empty-`ops` form is the legacy single-op path.
 ### D4. Engine resource model
 **DMA engine**:
 - `DMA_READ`: `simpy.Resource(capacity=1)`.
 - `DMA_WRITE`: `simpy.Resource(capacity=1)`.
 - Both channels run concurrently (READ ∥ WRITE allowed).
 - Within a channel, requests serialize (READ ∥ READ disallowed; same
  for WRITE).
 - `vc_comm` is an orthogonal channel for IPCQ traffic defined in
  ADR-0023 D8 — out of scope for this ADR.
 **Compute engine**:
 - `accel_slot`: `simpy.Resource(capacity=1)` shared by `PE_GEMM` and
  `PE_MATH`.
 - At most one compute op runs at a time within a PE.
 - Multi-op composite chains (D3.3) execute their compute stages serially
  through this slot; token self-routing (D6) ensures the next stage
  starts only after the previous compute releases the slot.
 **Engine completion**: each engine emits a completion event consumed by
 the scheduler / `PipelineContext` (D6).
 ### D5. Dataflow
 **Input path (HBM source)**:
 ```text
 HBM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
 PE_TCM → PE_FETCH_STORE → Register File
 Register File → PE_GEMM | PE_MATH
 ```
 **Input path (shared SRAM source)**:
 ```text
 Shared SRAM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
 PE_TCM → PE_FETCH_STORE → Register File
 ```
 **Output path (HBM destination)**:
 ```text
 Register File → PE_FETCH_STORE → PE_TCM
 PE_TCM → PE_DMA (DMA_WRITE) → cube NOC → HBM
 ```
 GEMM/MATH never touch `PE_TCM` directly — `PE_FETCH_STORE` is the
 single TCM↔register-file gateway. This makes TCM BW contention
 explicit and lets fetch unit policies (e.g., prefetch) be replaced
 independently of compute engines.
 #### D5.1 PE_TCM partitioning
 `PE_TCM` is split into two logical regions:
 **SchedulerReservedTCM**
 - Owned exclusively by `PE_SCHEDULER`.
 - Holds composite-command tile buffers.
 - `PE_SCHEDULER` partitions this region, assigns buffers per DMA_READ /
  COMPUTE / DMA_WRITE stage, guarantees input/output separation, and
  manages tile-buffer lifetimes.
 **AllocatableTCM**
 - General-purpose region managed by `PEMemAllocator`.
 - Used for host / DP-visible allocations.
 **Visibility rule (hard isolation)**: `PEMemAllocator` MUST NOT see or
 allocate inside `SchedulerReservedTCM`. The reserved region is excluded
 from allocator-managed ranges by construction.
 **Tile buffer rules**:
 - Input and output buffers within `SchedulerReservedTCM` MUST NOT
  overlap during a tile's active lifetime.
 - A tile buffer remains valid until the corresponding `DMA_WRITE`
  completes.
 - Buffer reuse is permitted only after the consuming tile's lifetime
  ends.
 ### D6. TileToken self-routing pipeline
 A composite's stage-to-stage progression happens **without** routing
 through the scheduler. Each component forwards the token directly to
 the next stage's component using the token's `plan`:
 ```text
 Scheduler → DMA → Fetch → GEMM → Math (epi) → Store → DMA_WB → (complete)
              ↑ chaining: no scheduler hop                          ↑
                                                  PipelineContext.complete_tile()
 ```
 This mirrors real-HW done-wire chains. The scheduler handles only
 **initial dispatch + completion aggregation**.
 #### TilePlan / Stage
 ```python
 class StageType(Enum):
    DMA_READ = 0
    FETCH = 1
    GEMM = 2
    MATH = 3
    STORE = 4
    DMA_WRITE = 5
@dataclass(frozen=True)
 class Stage:
    stage_type: StageType
    component: str         # topology node id (e.g., "sip0.cube0.pe0.pe_dma")
    params: dict           # stage-specific parameters
@dataclass(frozen=True)
 class TilePlan:
    tile_id: int
    stages: tuple[Stage, ...]
 ```
 #### TileToken
 ```python
@dataclass
 class TileToken:
    tile_id: int
    pipeline_ctx: PipelineContext
    plan: TilePlan
    stage_idx: int
    params: dict             # cached current stage params
    data_op: bool = True     # op_log opt-in (ADR-0020 D4)
 ```
 Single-owner invariant: a token is owned by exactly one component at a
 time. Lifecycle: scheduler creates with `stage_idx=0` → component
 `_process()` → increment `stage_idx` → put to next stage's `in_port` →
 last stage calls `pipeline_ctx.complete_tile()`.
 #### PipelineContext (exactly-once completion)
 ```python
@dataclass
 class PipelineContext:
    id: str
    total_tiles: int
    completed_tiles: int = 0
    done_event: simpy.Event = None
    def complete_tile(self) -> None:
        self.completed_tiles += 1
        if self.completed_tiles == self.total_tiles:
            self.done_event.succeed()
 ```
 Each tile's last stage MUST call `complete_tile()` exactly once.
 Duplicate calls are bugs (SimPy `Event` can succeed at most once).
 #### Feed ordering
 `PE_SCHEDULER` has exactly one `_feed_loop` process consuming a
 `_pending_feeds` FIFO. Composite commands are enqueued in submission
 order; tile feed for a command runs to completion before the next
 command's feed begins. **Tile-feed interleaving between commands is
 disallowed.**
 Within a single command's tiles, downstream pipeline overlap arises
 naturally — earlier tiles progress through later stages while the feeder
 keeps pushing remaining tiles into the first stage queue (SimPy Store
 backpressure governs flow control). If the first-stage queue is full,
 only the feeder blocks; the scheduler worker's inbox processing
 continues.
 #### Token routing pattern (base class)
 ```python
 def _pipeline_worker(self, env):
    while True:
        token = yield self._inbox.get()
        yield from self._process(env, token)       # stage-specific logic
        next_idx = token.stage_idx + 1
        if next_idx < len(token.plan.stages):
            next_stage = token.plan.stages[next_idx]
            token.stage_idx = next_idx
            token.params = next_stage.params
            yield self.out_ports[next_stage.component].put(token)
        else:
            token.pipeline_ctx.complete_tile()
 ```
 Each component implements only `_process()`; chaining lives in the
 base class.
 ### D7. Observability and trace contract
 The simulator emits deterministic trace events:
 - `command_submitted`
 - `sub_command_dispatched`
 - `engine_start`
 - `engine_complete`
 - `tile_ready`
 - `command_complete`
 For identical inputs, trace ordering MUST be deterministic.
 ### D8. Topology representation
 PE-internal components are declared in `cube.pe_template`:
 ```yaml
 pe_template:
  components:
    pe_cpu:         { kind: pe_cpu,         impl: builtin.pe_cpu,         attrs: { overhead_ns: ... } }
    pe_scheduler:   { kind: pe_scheduler,   impl: builtin.pe_scheduler,   attrs: { overhead_ns: ... } }
    pe_dma:         { kind: pe_dma,         impl: builtin.pe_dma,         attrs: { rd_engines: 1, wr_engines: 1 } }
    pe_fetch_store: { kind: pe_fetch_store, impl: builtin.pe_fetch_store, attrs: { ... } }
    pe_gemm:        { kind: pe_gemm,        impl: builtin.pe_gemm,        attrs: { shared_resource: accel_slot, ... } }
    pe_math:        { kind: pe_math,        impl: builtin.pe_math,        attrs: { shared_resource: accel_slot, ... } }
    pe_tcm:         { kind: pe_tcm,         impl: builtin.pe_tcm,         attrs: { size_mb: ..., read_bw_gbs: ..., write_bw_gbs: ... } }
    pe_mmu:         { kind: pe_mmu,         impl: builtin.pe_mmu,         attrs: { ... } }   # ADR-0011 D-VA
    pe_ipcq:        { kind: pe_ipcq,        impl: builtin.pe_ipcq,        attrs: { ... } }   # ADR-0023
  links:
    # Scheduler dispatch edges (initial)
    scheduler_to_dma_mm:         0.0
    scheduler_to_fetch_store_mm: 0.0
    scheduler_to_gemm_mm:        0.0
    scheduler_to_math_mm:        0.0
    # Pipeline chaining edges (token self-routing per D6)
    dma_to_fetch_store_mm:       0.0
    fetch_store_to_gemm_mm:      0.0
    fetch_store_to_math_mm:      0.0
    gemm_to_fetch_store_mm:      0.0
    gemm_to_math_mm:             0.0
    math_to_fetch_store_mm:      0.0
    fetch_store_to_dma_mm:       0.0
    fetch_store_to_tcm_bw_gbs:   ...
 ```
 Template is instantiated once per PE. PE instances are derived from
 `cube.pe_layout` (corner placement). External connectivity (PE_DMA ↔
 cube NOC ↔ HBM, etc.) is modeled at the cube level (ADR-0017 D4).
 ## Consequences
 ### Positive
 - Each block is an independent topology node — individually swappable
  via DI (ADR-0015).
 - PE-internal structure is visible in the topology graph.
 - Components do not know their downstream — plan-based routing gives
  flexibility (e.g., epilogue chains require no scheduler change).
 - DMA and compute overlap naturally via SimPy Store backpressure.
 - Multi-op composite expresses fused operations (e.g., GEMM + bias_add)
  without engine-level coupling.
 - TCM access contention is realistic — `PE_FETCH_STORE` is the single
  TCM↔RF gateway.
 ### Negative
 - Intra-PE component count is higher than a coarser model (7 base + 2
  cross-referenced) — more topology nodes/edges.
 - Intra-PE token forwarding is explicit in traces (acceptable trade for
  HW fidelity).
 ## Links
 - ADR-0011 D-VA (PE_MMU component, VA translation)
 - ADR-0015 D4 (component port/wire model)
 - ADR-0020 (greenlet kernel execution / two-pass)
 - ADR-0023 (PE_IPCQ + PE_DMA virtual channels)
 - SPEC R3, R4
@@ -0,0 +1,202 @@
 # ADR-0015: Component Port/Wire Model and Fabric Routing
 ## Status
 Accepted
 ## Context
 Realistic hardware modeling — queues, contention, fan-out — requires
 that components own fabric traversal while the simulation engine
 handles only initialization and completion observation. Direct method
 calls between components, or path-walking inside the engine, defeat
 queueing and contention semantics.
 This ADR defines:
 - how components communicate via typed port queues,
 - how propagation delay is modeled (wire processes with BW occupancy),
 - the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch
  (via M_CPU),
 - the engine's reduced role (wire init + completion observation only),
 - M_CPU.DMA as an internal subcomponent of M_CPU.
 ---
 ## Decision
 ### D1. Component port model
 Each component has typed input/output ports modeled as SimPy Stores:
 ```text
 in_ports:  dict[str, simpy.Store]   # keyed by source node_id
 out_ports: dict[str, simpy.Store]   # keyed by destination node_id
 ```
 Ports are created at engine initialization based on graph edges.
 Each directed edge (src → dst) results in:
 - `src.out_ports[dst]`  — the sending end
 - `dst.in_ports[src]`   — the receiving end
 ---
 ### D2. Wire process (propagation delay + BW occupancy)
 For each directed edge (src, dst) in the topology graph, a SimPy wire process
 models propagation delay and BW occupancy:
 ```python
 def wire_process(env, out_port, in_port, delay_ns, bw_gbs):
    available_at = 0.0
    while True:
        cmd = yield out_port.get()
        if bw_gbs > 0:
            nbytes = getattr(cmd, "nbytes", 0)
            if nbytes > 0:
                wait = available_at - env.now
                if wait > 0:
                    yield env.timeout(wait)
                available_at = env.now + (nbytes / bw_gbs)
        yield env.timeout(delay_ns)
        yield in_port.put(cmd)
 ```
 Wire processes are started at engine initialization.
 Each directed edge maintains an `available_at` timestamp tracking when the link
 becomes free for the next transaction. When a transaction occupies a link, the
 next transaction on the same directed link must wait until occupancy clears
 (back-to-back serialization). TX and RX directions are independent (separate
 wire processes with separate `available_at` state).
 ---
 ### D3. Engine role (reduced)
 The simulation engine MUST:
 - wire components at initialization (create port Stores, start wire processes),
 - identify the entry component for each request type (PCIE_EP),
 - put the request into the entry component's in_port,
 - wait for a completion event.
 The simulation engine MUST NOT:
 - walk the topology path during request execution,
 - call component `run()` methods directly,
 - track per-hop latency or decompose fan-out.
 ---
 ### D4. Fabric paths for Memory R/W and Kernel Launch
 Memory R/W and Kernel Launch use **different** fabric paths.
 Memory operations bypass M_CPU and route directly to HBM via the crossbar.
 Kernel Launch routes through M_CPU for PE fan-out.
 **Memory R/W forward path (pcie_ep → hbm_ctrl, M_CPU bypass):**
 ```text
 pcie_ep → io_noc → io_ucie
  → [transit cubes: ucie_in → noc → ucie_out]  (zero or more)
  → target cube: ucie_in → router mesh → hbm_ctrl
 ```
 **Memory R/W completion path:**
 ```text
 hbm_ctrl → router mesh → [transit cubes: ucie → router mesh → ucie]
  → io_ucie → io_noc → pcie_ep
 ```
 **Kernel Launch forward path (pcie_ep → io_cpu → M_CPU → PE):**
 ```text
 pcie_ep → io_noc → io_cpu → io_noc → io_ucie
  → [transit cubes: ucie_in → noc → ucie_out]  (zero or more)
  → target cube: ucie_in → noc → M_CPU → PE[0..n] (parallel fan-out)
 ```
 **Kernel Launch completion path:**
 ```text
 PE[0..n] all complete → M_CPU (aggregation)
  → noc → [transit cubes: ucie → noc → ucie]
  → io_ucie → io_noc → io_cpu → io_noc → pcie_ep
 ```
 **Rationale for M_CPU bypass on Memory R/W:**
 Memory write/read operations do not require command interpretation or PE
 dispatch — they are direct data transfers to/from HBM. Routing through M_CPU
 would add unnecessary overhead (5ns) without functional benefit. The io_noc
 inside the IO chiplet handles the routing decision: memory operations go
 directly to cube fabric, while kernel launches are forwarded to io_cpu first.
 ---
 ### D5. M_CPU.DMA is an internal subcomponent of M_CPU
 M_CPU.DMA is NOT a separate topology node.
 It is an internal subcomponent owned by the M_CPU component implementation.
 M_CPU.DMA:
 - owns the DMA READ and DMA WRITE queues (capacity=1 each, per ADR-0014 D4),
 - issues memory requests over the NOC to hbm_ctrl,
 - receives completion from hbm_ctrl via the NOC,
 - reports completion to M_CPU,
 - is created and managed inside M_CPU's `__init__` and `run()`.
 M_CPU.DMA does not appear as a node in the compiled topology graph.
 ---
 ### D6. Transit cube forwarding
 A cube that is not the target of a memory or kernel request acts as a transit node.
 Transit cubes forward requests without consuming them:
 ```text
 ucie_in (from upstream) → noc → ucie_out (to downstream)
 ```
 Transit forwarding is implemented entirely within the ucie_in component.
 The noc and ucie_out components in a transit cube forward the packet without modification.
 ---
 ### D7. _formula_latency is preserved as a lower-bound cross-check
 The path-based formula latency function (`_formula_latency`) is preserved in the engine
 as a lower bound for correctness verification.
 Invariant:
 - Phase 0: `_formula_latency == component model total_ns`
 - Phase 1+: `_formula_latency <= component model total_ns` (contention adds queueing)
 This function is independent of the port/wire model and requires only the topology graph.
 It is used for shard comparison in `_route_kernel` and as a regression guard.
 ---
 ## Consequences
 - Components model realistic hardware behavior (queues, contention, fan-out).
 - Propagation delay is modeled accurately per edge.
 - Engine is decoupled from routing policy.
 - Component implementations remain swappable via DI (ADR-0007 D3).
 ---
 ## Links
 - ADR-0007 D2 (engine role boundary)
 - ADR-0009 D3 (kernel execution fan-out hierarchy)
 - ADR-0014 D4 (DMA engine capacity=1)
 - ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal)
 - ADR-0016 (IOChiplet NOC and memory data path)
 - ADR-0017 (cube NOC 2D mesh architecture)
 - ADR-0033 (Latency model assumptions built on these mechanisms)
@@ -0,0 +1,98 @@
 # ADR-0016: IOChiplet NOC and Memory Data Path
 ## Status
 Accepted
 ## Context
 ADR-0003 D2 defines IO chiplets as SIP-level components providing PCIe-EP and
 IO_CPU interfaces, but does not specify internal routing within the IO chiplet.
 ADR-0015 D4 was updated to document the M_CPU bypass for Memory R/W, but the
 IO chiplet's internal NOC architecture that enables this routing was not
 formally documented.
 The IO chiplet needs an internal routing fabric (io_noc) to:
 - connect pcie_ep, io_cpu, and per-cube UCIe PHY ports
 - route memory operations (MemoryWrite/Read) directly to cube fabric without
  passing through io_cpu
 - route kernel launch commands through io_cpu for command interpretation
 ## Decision
 ### D1. IOChiplet internal NOC (io_noc)
 Each IO chiplet instance contains an internal NOC node (`io_noc`) that connects:
 - `pcie_ep` — host-facing PCIe endpoint
 - `io_cpu` — command processor for kernel launch interpretation
 - `io_ucie-{PHY}.conn{N}` — per-PHY connection nodes to cube UCIe ports
 The io_noc is a forwarding-only fabric (`forwarding_v1` implementation) with
 zero overhead. All routing decisions are made by the simulation engine based
 on message type, not by io_noc itself.
 ### D2. IOChiplet UCIe decomposition
 Each IO chiplet PHY port is decomposed into:
 - `io_ucie-{PHY}` — the UCIe protocol endpoint (overhead = 8ns)
 - `io_ucie-{PHY}.conn{N}` — N connection nodes between io_noc and io_ucie
 This mirrors the cube-side UCIe decomposition (ADR-0015 D1) and allows
 multiple independent NOC-to-UCIe connections per PHY.
 ### D3. Memory R/W path (M_CPU bypass)
 Memory operations (MemoryWrite, MemoryRead) are routed directly from pcie_ep
 through io_noc to the target cube, bypassing io_cpu entirely:
 ```text
 pcie_ep → io_noc → conn → io_ucie → [cube UCIe] → router mesh → hbm_ctrl
 ```
 This avoids the 10ns io_cpu overhead for pure data transfers. The simulation
 engine's `_process_memory_direct()` method uses `find_memory_path()` which
 resolves the shortest path from pcie_ep to the target HBM node.
 ### D4. Kernel Launch path (via io_cpu)
 Kernel launch commands require io_cpu for command interpretation and PE
 fan-out setup:
 ```text
 pcie_ep → io_noc → io_cpu → io_noc → conn → io_ucie → [cube UCIe]
  → noc → m_cpu → PE
 ```
 The engine's `_entry_points()` method routes KernelLaunchMsg through both
 pcie_ep (entry) and io_cpu (command processing).
 ### D5. IOChiplet-to-cube port mapping
 Each IO chiplet instance declares which cube ports it connects to:
 ```yaml
 cube_ports:
  - { cube: {xy: [0,0]}, cube_side: N, phy: P0, distance_mm: 2.0 }
  - { cube: {xy: [1,0]}, cube_side: N, phy: P1, distance_mm: 2.0 }
 ```
 The topology builder creates edges from io_ucie PHY nodes to the
 corresponding cube UCIe port nodes, with the specified distance and
 the IO chiplet's `per_connection_bw_gbs` as link bandwidth.
 ## Consequences
 - IO chiplet has a well-defined internal routing fabric
 - Memory operations avoid unnecessary io_cpu overhead
 - Kernel launch commands still get proper command interpretation
 - The io_noc pattern is consistent with cube-level NOC design
 - ADR-0003 D2 is extended (not contradicted) by this ADR
 ## Links
 - ADR-0003 D2 (IO chiplet definition)
 - ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
 - ADR-0012 D1 (host-to-IO_CPU message schema)
@@ -0,0 +1,291 @@
 # ADR-0017: Cube NOC and HBM Connectivity
 ## Status
 Accepted
 ## Context
 The CUBE-level NOC is a 2D router mesh that carries every intra-cube
 request: PE-to-HBM data, PE-to-PE traffic, command paths
 (M_CPU↔PE_CPU), shared SRAM access, and inter-cube UCIe traffic.
 The CUBE's HBM is exposed through per-PE controller endpoints attached
 to PE routers. This per-PE partitioning makes local-vs-remote HBM
 distinguishable by mesh distance: a PE's own HBM partition sits at its
 own router (switching overhead only); another PE's HBM partition is
 reachable by mesh hops to that PE's router.
 Two channel-mapping modes are supported in the design space:
 - **n:1 (default, implemented)** — each PE's HBM partition aggregates
  `channels_per_pe` pseudo-channels into one endpoint. Effective
  per-PE BW = N × per-channel BW.
 - **1:1 (future)** — each PE router decomposes into per-channel
  mini-routers; per-channel BW contention is modeled directly.
 In both modes the per-PE effective BW is identical; only the connectivity
 granularity differs.
 ## Decision
 ### D1. 2D router mesh
 Each cube contains a 2D mesh of NOC routers generated by `mesh_gen.py`.
 - Node naming: `sip{S}.cube{C}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`).
 - Implementation: `forwarding_v1`. NOC `overhead_ns = 0`.
 - Default 6×6 grid (sized from PE corner placement + UCIe attachment
  count); larger PE counts scale the grid up.
 - HBM exclusion zone: center rows/columns are excluded where HBM die
  physically occupies space (e.g., r2c2, r2c3, r3c2, r3c3 for a 6×6).
 - Latency = Manhattan distance × `ns_per_mm`.
 ### D2. XY routing algorithm
 Deterministic XY routing:
 1. Horizontal segment: route from source X to destination X at source Y.
 2. Vertical segment: route from destination X at source Y to destination Y.
 Each directed segment carries a unique key:
 - Horizontal: `("H", y_band, x_min, x_max, direction)`
 - Vertical:   `("V", x_band, y_min, y_max, direction)`
 Grid positions are snapped to the router grid, excluding the HBM zone.
 ### D3. Per-segment contention model
 Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions
 sharing a segment (same row or column band, same direction) contend for
 the resource — modelling link-level serialization in a wormhole-routed
 mesh.
 With no contention, NOC traversal latency equals Manhattan distance ×
 `ns_per_mm`. Under contention, SimPy's resource scheduling adds queueing
 delay.
 ### D4. NOC attachment points (per-PE HBM partition)
 Every PE router carries three attachments: `pe{idx}.dma`, `pe{idx}.cpu`,
 and `pe{idx}.hbm`. The last is the per-PE HBM controller endpoint —
 `sip{S}.cube{C}.hbm_ctrl.pe{idx}` — which owns one slice of the cube's
 HBM (one pseudo-channel group; see D8).
 Other attachments:
 - M_CPU and shared SRAM each occupy a dedicated edge router.
 - UCIe endpoints (N/S/E/W) each expose 4 connection routers distributed
  along that edge (see D6).
 ```text
                    UCIe-N (conn x4)
                         |
           +---------+---+---+---------+
           |         |       |         |
 PE0.dma ---+  r0c0   |  ...  |  r0c5  +--- PE2.dma
 PE0.cpu <--+ +hbm.pe0|       | +hbm.pe2+--< PE2.cpu
           |         |       |         |
 UCIe-W ----+  ...    | [HBM] |  ...   +---- UCIe-E
 (conn x4)  |         | zone  |         |  (conn x4)
           |  r2c0   |       |         |
 M_CPU <--->+         |       |         |
           |  r3c0   |       |         |
 SRAM <---->+         |       |         |
           |         |       |         |
 PE4.dma ---+  r4c0   |  ...  |  r4c5  +--- PE6.dma
 PE4.cpu <--+ +hbm.pe4|       | +hbm.pe6+--< PE6.cpu
           |         |       |         |
           +---------+---+---+---------+
                         |
                    UCIe-S (conn x4)
 ```
 Per-PE HBM partitioning is the key invariant that makes local vs
 cross-PE HBM distinguishable by mesh distance (see D7).
 ### D5. NOC edge bandwidths and distances
 | Connection                    | BW (GB/s)  | Distance      | Notes                                       |
 | ----------------------------- | ---------- | ------------- | ------------------------------------------- |
 | PE_DMA → NOC                  | 256.0      | Physical (PE) | Matches local-HBM aggregate BW              |
 | NOC → PE_CPU                  | —          | 0.0 mm        | Command path only                           |
 | Router ↔ hbm_ctrl.pe{idx}     | 256.0      | 0.0 mm        | Per PE router; N × per-channel BW (see D8)  |
 | NOC ↔ M_CPU                   | —          | 0.0 mm        | Command path                                |
 | NOC ↔ SRAM                    | 128.0 × 4  | 0.0 mm        | 512 GB/s aggregate                          |
 | NOC ↔ UCIe conn               | 128.0      | 0.0 mm        | Per connection; 4 conn per port             |
 `0.0 mm` distances reflect the distributed nature of the NOC; actual
 traversal distance is computed via Manhattan distance within the router
 grid.
 ### D6. UCIe decomposition and inter-cube traffic
 Each of the 4 UCIe ports (N, S, E, W) decomposes into:
 - 1 `ucie-{PORT}` node: UCIe protocol endpoint (`overhead = 8.0 ns`).
 - 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe.
 This decomposition gives 4 independent NOC↔UCIe connections per port,
 each with 128 GB/s bandwidth (512 GB/s aggregate per port).
 Inter-cube traffic path:
 ```text
 Source: PE_DMA → NOC → conn{i} → ucie-{PORT}
                  [UCIe link: 512 GB/s, 1.0mm seam distance]
 Target: ucie-{PORT} → conn{i} → r{x}c{y} → (mesh hops) → hbm_ctrl.pe{idx}
 ```
 UCIe overhead (8.0 ns) is applied at each `ucie-{PORT}` node, so a full
 crossing incurs 16 ns (TX port + RX port).
 ### D7. Data paths through the NOC
 All intra-cube traffic uses the same router mesh — no separate fast
 paths.
 **Local HBM** (same PE's own partition; 0 mesh hops):
 ```text
 PE_DMA → r{x}c{y} → hbm_ctrl.pe{idx}   (switching overhead only)
 ```
 **Cross-PE HBM within cube** (target PE's partition, reached by mesh):
 ```text
 PE_DMA → r{x}c{y} → (mesh hops) → r{x'}c{y'} → hbm_ctrl.pe{idx'}
 ```
 Example: PE0 (on `r0c0`) accessing PE2's HBM (PE2 on `r1c4`):
 ```text
 PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl.pe2
 ```
 Dijkstra computes the shortest path within the mesh.
 **Cross-cube HBM** (UCIe traversal):
 ```text
 PE_DMA → r{x}c{y} → conn → ucie-{PORT} → [seam] → ucie-{PORT'} → conn
       → r{x'}c{y'} → hbm_ctrl.pe{idx'}
 ```
 **Kernel launch command to PE**:
 ```text
 [from io_noc] → ucie → conn → r{x}c{y} → (mesh) → M_CPU → (mesh) → PE_CPU
 ```
 **Shared SRAM access**:
 ```text
 PE_DMA → r{x}c{y} → (mesh) → SRAM
 ```
 ### D8. HBM channel mapping mode
 Channel mapping is configured at cube scope:
 ```yaml
 cube:
  memory_map:
    hbm_mapping_mode: n_to_one       # one_to_one | n_to_one
    hbm_pseudo_channels: 64          # total pseudo-channel count
    hbm_channels_per_pe: 8           # per-PE local channel count
    hbm_channel_bw_gbs: 32.0         # per-channel bandwidth (GB/s)
    hbm_slices_per_cube: 8           # number of per-PE partitions
    hbm_total_gb_per_cube: 48
 ```
 **n:1 mode (default, implemented).** Each PE's HBM partition is a single
 endpoint `hbm_ctrl.pe{idx}` that aggregates `channels_per_pe` pseudo-
 channels. The `Router ↔ hbm_ctrl.pe{idx}` link bandwidth equals
 `channels_per_pe × hbm_channel_bw_gbs`. Pseudo-channels are assumed to
 interleave; only aggregate per-PE BW is modeled. No separate aggregated
 router node exists — the per-PE router itself serves that role.
 **1:1 mode (future).** Each PE router decomposes into N channel
 mini-routers; per-channel routing carries fully-resolved PA + channel ID.
 A `ChannelSplitter` resolves a logical access to N per-channel physical
 requests. Per-channel link models BW contention. Cross-PE channel
 access semantics are deferred to the implementation ADR.
 **BW math (defaults).**
 | Parameter                          | Value                      |
 | ---------------------------------- | -------------------------- |
 | pseudo channels per cube           | 64 (parameter)             |
 | PEs per cube                       | 8 (parameter)              |
 | channels per PE (N)                | 64 / 8 = 8                 |
 | per-channel BW                     | 32 GB/s (parameter)        |
 | per-PE local BW                    | N × 32 = 256 GB/s          |
 | cube total HBM BW                  | 64 × 32 = 2048 GB/s        |
 Both modes give the same per-PE effective BW; only the request shape and
 contention model differ.
 ### D9. AddressResolver — per-PE HBM endpoint
 The address resolver decodes a PA's HBM offset to the owning PE's
 partition:
 ```python
 # policy/routing/router.py
 hbm_slice_bytes = hbm_total_gb_per_cube * (1 << 30) // hbm_slices_per_cube
 if addr.kind == "hbm":
    pe_id = int(addr.hbm_offset) // hbm_slice_bytes
    return f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
 ```
 The pe_id computation is intrinsic to the routing layer (not a
 topology-time concern). Any HBM PA falls within exactly one partition,
 yielding deterministic routing.
 External callers (e.g., M_CPU DMA, Memory R/W from PCIE_EP) follow the
 same resolver path — there is no separate fast path.
 ### D10. Mesh generation parameters
 `mesh_gen.py` produces `cube_mesh.yaml` from:
 - `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner.
 - `cube.geometry`: cube physical dimensions and HBM zone.
 - `cube.ucie.n_connections`: determines router count for UCIe attachment.
 Output `mesh_data` dictionary contains:
 - Router grid with positions and HBM exclusion zones.
 - PE-to-router attachments (`pe{idx}.dma`, `pe{idx}.cpu`, `pe{idx}.hbm`
  per PE).
 - UCIe-to-router attachments (N/S/E/W distributed across edge routers).
 - M_CPU and SRAM router attachments.
 ## Consequences
 - Local HBM (0 mesh hops, switching overhead only) and cross-PE HBM
  (mesh hops) are naturally distinguishable, satisfying SPEC R5
  (multi-domain communication) and ADR-0002 (no zero-latency end-to-end
  paths).
 - All cube-internal traffic routes through one mesh — single contention
  model, single layout, single set of edge BWs.
 - Per-PE HBM partitioning maps cleanly to the LA model (ADR-0011): each
  PE's partition is the n:1 aggregate of its assigned pseudo-channels.
 - 1:1 mode extension is structurally natural — split each PE router into
  N channel routers.
 - Mesh generation is fully parameterised by `topology.yaml`; PE/cube
  geometry changes propagate without code edits.
 ## Links
 - ADR-0002 (Routing distance, ordering, no zero-latency paths)
 - ADR-0003 D3 (cube-level NOC definition — extended here)
 - ADR-0004 (Memory semantics, local HBM)
 - ADR-0011 (Memory addressing — LA model consumes per-PE partition)
 - ADR-0014 D1 (PE_DMA egress via router mesh)
 - ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
 - ADR-0016 (IOChiplet io_noc — analogous pattern at IO chiplet level)
 - ADR-0033 (Latency model: per-PC parallelism, switch penalty)
@@ -0,0 +1,516 @@
 # ADR-0020: 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)
 ## Status
 Accepted
 ## Context
 현재 시뮬레이션은 **타이밍만** 모델링한다.
 `tl.load()`, `tl.composite(op="gemm")` 등은 SimPy latency를 생성하지만,
 실제 텐서 데이터를 읽거나 연산하지 않는다.
 ### 필요한 기능
 1. HBM/TCM/SRAM에 실제 데이터를 저장하고 읽을 수 있어야 한다
 2. PE_GEMM, PE_MATH가 실제 행렬 연산을 수행하고 결과를 검증할 수 있어야 한다
 3. 시뮬레이션 성능 저하를 최소화해야 한다
 ### 제약 조건
 - SimPy는 single-thread 이벤트 루프 — numpy matmul을 안에서 하면 전체가 block
 - 컴포넌트는 교체 가능해야 한다 (ADR-0015) — 프레임워크 요구사항이 구현에 침투하면 안 됨
 - 벤치마크 커널은 명령형 코드(tl.load → tl.composite → tl.wait) — 같은 코드를 재사용해야 함
 - 커널 함수는 plain Python function으로 유지해야 한다 (generator/async 변환 불가)
 ### 설계 탐색 결과
 | Option | 방식 | 판정 |
 |--------|------|------|
 | SimPy 내 직접 실행 | GEMM을 SimPy 안에서 numpy 호출 | 탈락: single-thread block |
 | SimPy + ThreadPool | future.submit → timeout → result() | 탈락: back-to-back 요청 시 result()에서 block |
 | Symbolic + lazy | 메타데이터만 추적, 나중에 실행 | 탈락: control-flow dependent 읽기 처리 곤란 |
 | **2-pass (채택)** | Phase 1: 타이밍, Phase 2: 데이터 | 완전 분리, 성능 영향 없음 |
 ---
 ## Decision
 ### D1. 2-Pass 실행 모델 — Phase 0 제거
 기존의 3단계(Phase 0 → Phase 1 → Phase 2)를 **2단계로 통합**한다.
 기존:
 ```
 Phase 0: 커널 → PeCommand 리스트 (데이터 없음, 분기 불가)
 Phase 1: PeCommand 리스트를 SimPy replay (타이밍만)
 ```
 변경:
 ```
 Phase 1 (타이밍): 커널 + SimPy 통합 실행 — greenlet 기반
  - 메모리 읽기/쓰기: SimPy 타이밍 + MemoryStore 실제 데이터
  - 연산 (GEMM/Math): SimPy 타이밍 + op_log 기록 (실제 연산은 Phase 2)
  - dynamic control flow 가능 (tl.load가 실제 데이터 반환)
 Phase 2 (데이터): op_log 기반 실제 연산 실행 — SimPy 외부, 병렬 가능
 ```
 본 ADR은 **메모리 연산에 한해 Phase 1을 data-aware로 확장**한다.
 Phase 1은 latency/BW 병목 분석 + 메모리 데이터 추적,
 Phase 2는 GEMM/Math 연산 정합성 검증.
 Phase 2는 optional — 타이밍만 필요하면 Phase 1만 실행.
 ### D2. Op Log 기록 — ComponentBase hook
 op_log 기록은 **컴포넌트 베이스 클래스의 hook**으로 수행한다.
 개별 컴포넌트 구현을 수정하지 않는다.
 ```python
 class ComponentBase:
    def _on_process_start(self, env, msg):
        if self._op_logger and getattr(msg, 'data_op', False):
            self._op_logger.record_start(env.now, self.node.id, msg)
    def _on_process_end(self, env, msg):
        if self._op_logger and getattr(msg, 'data_op', False):
            self._op_logger.record_end(env.now, self.node.id, msg)
 ```
 `_forward_txn()` 에서 `run()` 전후로 hook을 호출한다.
 `_op_logger`는 optional — 없으면 오버헤드 제로.
 **hook 시점 정의**:
 | 시점 | 의미 |
 |------|------|
 | `t_start` | 컴포넌트가 해당 msg의 **service를 시작**한 시점 (`run()` 진입 직전) |
 | `t_end` | 컴포넌트의 **내부 service가 완료**된 시점 (`run()` 반환 직후) |
 link traversal latency는 t_start/t_end에 포함되지 않는다.
 link latency는 발신 컴포넌트의 t_end와 수신 컴포넌트의 t_start 차이로 관측된다.
 ### D3. Greenlet 기반 커널 실행 — Phase 0 제거
 기존 Phase 0 (커널 → PeCommand 리스트)를 제거하고,
 **greenlet**을 사용하여 커널과 SimPy를 협력적으로 interleave 실행한다.
 #### 동작 원리
 greenlet은 협력적 context switch를 제공하는 C 확장이다.
 커널(child greenlet)이 `tl.load()` 등을 호출하면 SimPy 루프(parent greenlet)로
 switch하여 타이밍 시뮬레이션을 수행하고, 완료 후 실제 데이터와 함께 커널로 돌아온다.
 ```
 SimPy 루프 (parent greenlet)          커널 (child greenlet)
 ─────────────────────────              ──────────────────────
 g.switch() ─────────────────────────→ 커널 시작
                                       a = tl.load(ptr, ...)
                                         내부: parent.switch(DmaReadCmd)
 cmd = DmaReadCmd ←──────────────────  (커널 일시정지)
  yield DmaReadMsg(...)
  yield env.timeout(dma_latency)
  data = memory_store.read(...)
 g.switch(data) ─────────────────────→ (커널 재개)
                                       a = data  ← 실제 numpy array
                                       if a[0][0] > 0.5:  ← 분기 가능
                                         ...
 ```
 커널은 **plain Python function**으로 유지된다.
 greenlet switch는 `tl.load()`, `tl.store()` 등의 **내부 구현에만** 존재한다.
 #### KernelRunner — 프레임워크 레이어
 greenlet 루프는 PE_CPU 컴포넌트가 아니라 프레임워크 레이어인
 **KernelRunner**에 위치한다.
 ```python
 # KernelRunner (프레임워크 — greenlet ↔ SimPy 연결)
 class KernelRunner:
    def run(self, env, kernel_fn, args, store):
        g = greenlet(self._run_kernel)
        cmd = g.switch(kernel_fn, args)
        while cmd is not None:
            if isinstance(cmd, DmaReadCmd):
                yield from self._dispatch_dma(env, cmd)
                data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
                cmd = g.switch(data)            # 실제 데이터와 함께 재개
            elif isinstance(cmd, GemmCmd):
                yield from self._dispatch_gemm(env, cmd)
                cmd = g.switch()                # 재개 (데이터 없음)
            elif isinstance(cmd, DmaWriteCmd):
                store.write(cmd.dst_addr, cmd.data)  # visibility = issue 시점
                yield from self._dispatch_dma(env, cmd)  # timing만 반영
                cmd = g.switch()
 # PE_CPU (컴포넌트 — 간단하게 유지, greenlet을 모름)
 def _execute_kernel(self, env):
    runner = KernelRunner(self.ctx)
    yield from runner.run(env, kernel_fn, args, store)
 ```
 **Op logging single source of truth**: KernelRunner는 op_log에 직접 기록하지 않는다.
 모든 op logging은 **ComponentBase hook (_on_process_start/end)만** 담당한다.
 KernelRunner가 `_dispatch_gemm()` 등으로 컴포넌트에 메시지를 전달하면,
 컴포넌트 베이스 클래스의 hook이 자동으로 기록한다.
 **레이어 분리**:
 - **커널 코드**: plain function, greenlet 존재를 모름
 - **TLContext**: `tl.load()` 내부에서 `parent.switch(cmd)` 호출
 - **KernelRunner**: greenlet ↔ SimPy 연결, MemoryStore 읽기/쓰기 처리. **logging 안 함**.
 - **ComponentBase hook**: op_log 기록의 유일한 경로
 - **PE_CPU**: KernelRunner를 호출만 함, 컴포넌트로서 교체 가능
 #### 메모리 읽기/쓰기 vs 연산의 처리 차이
 | 연산 | Phase 1에서 | Phase 2에서 |
 |------|------------|------------|
 | `tl.load()` | SimPy 타이밍 + MemoryStore read → **실제 데이터 반환** | — |
 | `tl.store()` | SimPy 타이밍 + MemoryStore write → **실제 기록** | — |
 | `tl.composite(gemm)` | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
 | `tl.dot()` / math ops | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
 메모리 읽기/쓰기는 Phase 1에서 즉시 처리 (numpy slice, 빠름).
 GEMM/Math 연산은 Phase 2에서 batch 실행 (성능 분리).
 #### Store Visibility Rule
 `tl.store()`는 **issue 시점에 MemoryStore에 즉시 반영**된다 (visibility = issue).
 SimPy DMA 타이밍은 이후 별도로 시뮬레이션된다.
 이는 timing과 visibility를 의도적으로 분리한 것이다:
 - **visibility**: MemoryStore에 반영되는 시점 = `store.write()` 호출 시
 - **timing**: SimPy에서 DMA latency가 완료되는 시점
 이 분리로 dynamic control flow에서 store 직후 load가 최신 데이터를 볼 수 있다.
 #### Result Handle Semantics
 `tl.composite()`(sync/async)는 결과 tensor를 참조하는 **handle**을 반환한다.
 Phase 1에서의 핵심 계약:
 1. **모든 compute handle은 Phase 1에서 항상 pending 상태로 간주한다.**
 2. `tl.wait(handle)`은 **timing synchronization만 표현**하며,
   handle을 ready로 만들지 않는다.
 3. handle의 실제 결과 데이터 접근(`handle.data`, element access,
   numpy conversion 등)은 **Phase 2에서만 가능**하다.
 4. 따라서 Phase 1에서 **compute-result 기반 control flow는 지원하지 않는다.**
 5. 반면 `tl.load()`는 Phase 1에서 실제 데이터를 반환하므로,
   **memory-read 기반 control flow는 지원 가능**하다.
 | handle 상태 | Phase | 허용 동작 |
 |------------|-------|----------|
 | pending | Phase 1 | `tl.wait(handle)` — timing 동기화만 |
 | pending | Phase 1 | handle을 `tl.store()`의 대상으로 전달 (logical destination 연결만, payload는 Phase 2) |
 | pending | Phase 1 | **데이터 접근 불가** — 값 기반 분기 불가 |
 | ready | Phase 2 | 실제 numpy 데이터 접근, 검증 |
 이 제약은 의도적이다. Phase 1에서 연산을 실행하면 SimPy single-thread가
 block되어 2-pass 분리의 존재 이유가 사라진다.
 #### Phase 1 Materialization — Future Extension
 향후 소형 연산(scalar, 작은 reduction)에 대해 Phase 1 eager execution이
 필요한 경우, `materialized_in_phase1: bool` 플래그를 op record에 추가하여
 선택적 materialization을 지원할 수 있다. 현재 범위에서는 구현하지 않는다.
 ### D4. data_op 플래그 — 메시지 자기 선언
 로깅 대상은 메시지 타입이 아니라 메시지 인스턴스의 `data_op` 속성으로 결정한다.
 프레임워크가 메시지 타입을 하드코딩하지 않는다.
 ```python
 class MsgBase:
    data_op: bool = False       # 기본: 로깅 안 함
 class DmaReadCmd(MsgBase):
    data_op = True              # 메모리 이동 → 로깅
 class GemmCmd(MsgBase):
    data_op = True              # 연산 → 로깅
 class MathCmd(MsgBase):
    data_op = True              # 연산 → 로깅
 ```
 새 메시지 타입(예: IpcqMsg) 추가 시 `data_op = True`만 설정하면
 프레임워크 코드 수정 없이 자동 로깅된다.
 ### D5. Op Log 구조
 #### op 분류 체계
 2단계로 분류한다:
 | 레벨 | 필드 | 역할 |
 |------|------|------|
 | `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch 기준 |
 | `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` 등 | 구체 연산 식별 |
 #### OpRecord 정의
 ```python
@dataclass
 class OpRecord:
    t_start: float              # SimPy 시각 (ns) — service 시작
    t_end: float                # SimPy 시각 (ns) — service 완료
    component_id: str           # e.g. "sip0.cube0.pe0.pe_gemm"
    op_kind: str                # "memory" | "gemm" | "math"
    op_name: str                # 구체 연산명
    params: dict                # 연산별 파라미터 (아래 참조)
    dependency_ids: list[int]   # 현재는 in-memory record index 기반, 향후 stable op_id로 대체 가능
 ```
 #### dependency_ids 생성 규칙
 `dependency_ids`는 **optional**이며, 기본적으로 executor는
 주소 기반 dependency 추론을 수행한다 (D6 참조).
 정확한 실행 순서가 필요한 경우에만 명시적으로 설정한다:
 - **기본 (address-based inference)**: executor가 read/write set을 분석하여
  RAW/WAW/WAR 의존성을 자동 추론. 대부분의 경우 이것으로 충분.
 - **명시적 설정**: TLContext 또는 command 생성 단계에서 logical dependency가
  주소로 표현되지 않는 경우에 설정.
  예: completion handle 기반 동기화 — handle dependency는 메모리 주소가 아니라
  논리적 완료 순서에 의존하므로 address inference로 잡히지 않는다.
 #### op_log ordering
 op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
 동일 `t_start`의 record들은 insertion order를 보존한다.
 #### params 상세
 **memory (dma_read / dma_write)**:
 ```python
 {
    "src_addr": int,            # source 주소 (byte)
    "dst_addr": int,            # destination 주소 (byte)
    "nbytes": int,              # 전송 크기
    "src_space": str,           # "hbm" | "tcm" | "sram"
    "dst_space": str,           # "hbm" | "tcm" | "sram"
 }
 ```
 **gemm**:
 ```python
 {
    "src_a_addr": int,          # operand A 주소
    "src_b_addr": int,          # operand B 주소
    "dst_addr": int,            # output 주소
    "shape_a": tuple,           # e.g. (128, 256)
    "shape_b": tuple,           # e.g. (256, 128)
    "shape_out": tuple,         # e.g. (128, 128)
    "dtype_in": str,            # e.g. "f16"
    "dtype_acc": str,           # accumulation dtype, e.g. "f32"
    "dtype_out": str,           # output dtype, e.g. "f16"
    "transpose_a": bool,
    "transpose_b": bool,
    "layout_a": str,            # "row_major" | "col_major"
    "layout_b": str,
    "layout_out": str,
    "addr_space": str,          # "tcm" (GEMM operand는 항상 TCM)
 }
 ```
 **math**:
 ```python
 {
    "op": str,                  # "exp" | "add" | "sum" | "where" | ...
    "input_addrs": list[int],   # operand 주소 목록
    "input_shapes": list[tuple],
    "dst_addr": int,
    "shape_out": tuple,
    "dtype": str,
    "axis": int | None,         # reduction axis
    "addr_space": str,          # "tcm"
 }
 ```
 ### D6. Phase 2 Executor
 Phase 2는 SimPy 밖에서 op_log를 실행한다.
 ```python
 class DataExecutor:
    def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
        self.store = initial_store  # Phase 1의 MemoryStore snapshot을 입력으로 받는다
    def run(self):
        for t, ops in groupby(op_log, key=lambda o: o.t_start):
            batch = list(ops)
            independent, sequential = self._classify(batch)
            self._execute_parallel(independent)
            self._execute_sequential(sequential)
 ```
 **병렬 실행 판정**:
 같은 `t_start`의 op들은 **병렬 후보**로 간주한다.
 실제 병렬 실행 여부는 executor가 다음 기준으로 판정한다:
 - read/write 주소 범위 겹침 여부 (WAW, RAW, WAR 충돌 검사)
 - `dependency_ids`에 명시된 선행 op 완료 여부
 주소 범위가 겹치지 않고 명시적 의존성이 없는 op들만 병렬 실행한다.
 **배치 최적화**: 동일 op_name이며 **shape, dtype, layout, transpose flag가
 모두 동일한** 독립 op들만 batching 대상이 된다.
 예: 여러 PE의 동일 shape GEMM → `np.matmul(a_batch, b_batch)` 한 번으로 묶음.
 CPU에서도 BLAS 효율 향상, GPU에서는 launch overhead 절감.
 **Phase 2 실행 순서 보장**:
 Phase 2는 데이터 도착 시점을 고려하지 않으며,
 dependency (주소 기반 추론 + 명시적 dependency_ids)를 통해서만
 실행 순서를 보장한다.
 ### D7. Memory Store
 `MemoryStore`는 논리적으로 byte-addressable semantics를 따르며,
 현재 구현은 **tensor-granular storage** (addr → numpy ndarray 매핑)를 사용한다.
 ```python
 class MemoryStore:
    def write(self, space: str, addr: int, data: np.ndarray) -> None: ...
    def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
 ```
 **내부 저장 포맷: numpy ndarray**
 MemoryStore는 텐서를 **numpy ndarray**로 저장한다.
 | 후보 | store/load 속도 | Phase 2 연산 | 판정 |
 |------|----------------|-------------|------|
 | **numpy ndarray** | 즉시 (참조 전달, 복사 없음) | `np.matmul` 바로 사용 | **채택** |
 | bytearray | memcpy 필요 | `np.frombuffer` 변환 필요 | 탈락 |
 | torch tensor | 즉시 | torch 연산 가능 | GPU 최적화 시만 사용 |
 - write: numpy array를 **참조 저장** (복사 없음) → Phase 1 오버헤드 = dict lookup 1회
 - read: numpy array를 **참조 반환** (복사 없음)
 - 동일 addr에 재 write 시 기존 array를 **tensor 단위로 덮어쓴다** (partial overwrite 미지원)
 - dtype은 numpy native 사용 (`np.float16`, `np.float32`, `np.bfloat16` 등)
 - byte-level access가 필요한 경우 `.view(np.uint8)` 로 변환
 - Phase 2에서 GPU batch 최적화 시 numpy → torch tensor 변환은 executor가 담당
 **read/write contract**:
 - read/write는 **contiguous tensor** 기준이다.
  non-contiguous stride view가 필요한 경우 별도 copy op으로 표현한다.
 - 일반 benchmark path에서는 producer/consumer dtype 일치를 기대한다.
  reinterpret cast는 low-level memory validation 또는 특수 테스트 케이스를 위한
  permissive behavior이다.
 - addr은 byte-aligned이며, 최소 alignment = dtype 크기.
 - dtype mismatch (write와 다른 dtype으로 read)는 reinterpret cast로 처리한다.
  shape 불일치 시 nbytes 기준으로 검증하고, 불일치하면 error.
 - 정합성 기준은 주소 범위 기반 read/write semantics를 따른다.
 - 구현 최적화로 tensor object cache를 둘 수 있지만,
  canonical state는 byte-addressable storage이다.
 - deploy 시점에 호스트가 초기 텐서 데이터를 주입한다.
 ### D8. 벤치마크 커널 코드
 벤치마크의 **사용자 코드 API는 변경하지 않는다**.
 `tl.load()`, `tl.composite()`, `tl.store()` 등의 호출 인터페이스는 유지.
 단, 내부 command/message schema는 Phase 2 실행에 필요한 metadata를
 포함하도록 확장될 수 있다 (예: dtype_acc, transpose 등 추가 필드).
 ### D9. 컴포넌트 변경 없음
 개별 컴포넌트 구현(PE_GEMM, PE_DMA, HBM_CTRL 등)은 수정하지 않는다.
 op_log 기록은 ComponentBase hook의 책임이다.
 커스텀 컴포넌트 교체 시 타이밍 모델만 교체되며,
 Phase 2 데이터 실행은 영향받지 않는다.
 ### D10. Phase 2는 Optional
 ```python
 engine = GraphEngine(graph)
 engine.run(benchmark)                       # Phase 1: 타이밍만
 result = engine.get_timing_result()
 if verify_data:
    executor = DataExecutor(engine.op_log)  # Phase 2: 데이터
    executor.run()
    executor.verify(expected_output)
 ```
 타이밍 분석만 필요하면 Phase 2를 건너뛴다.
 op_logger를 비활성화하면 Phase 1 성능도 기존과 동일.
 ### D11. Verification Contract
 기본 검증은 **최종 output tensor**를 reference backend(numpy)와 비교한다.
 dtype별 tolerance 정책:
 | dtype | 비교 방식 | tolerance |
 |-------|----------|-----------|
 | f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
 | f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
 | bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
 | int 계열 | `np.array_equal` | exact |
 - 기본 모드: 최종 output만 비교 (end-to-end correctness)
 - 디버그 모드: intermediate tensor도 op 단위로 비교 가능
  (MemoryStore snapshot at each op boundary)
 ---
 ## Non-goals
 - **Compute-result-based control flow**: 지원하지 않는다.
  모든 compute handle은 Phase 1에서 pending 상태이며,
  `wait()`는 timing synchronization만 표현하고 data readiness를 의미하지 않는다.
  Phase 1에서 `handle.data` 접근, element access, truth-value evaluation은
  **error로 처리**한다.
  메모리 데이터 기반 분기(`tl.load()` 결과)는 greenlet으로 지원된다.
  Phase 1 materialization은 future extension (D3 참조).
 - **Cycle-accurate overlap reconstruction**: Phase 2에서 Phase 1의 실행 시간
  overlap을 정확히 재현하지 않는다. Phase 2는 데이터 정합성만 검증한다.
 - **GPU kernel compilation**: Phase 2의 GEMM/Math는 numpy/torch 호출이며,
  실제 하드웨어 PE의 마이크로아키텍처를 재현하지 않는다.
 ## Open Questions
 - **Aliasing / slice view**: 동일 backing storage를 참조하는 slice/view를
  MemoryStore에서 어떻게 표현할지 (stride-based view vs copy semantics)
 - **IPCQ/descriptor read 일반화**: PE-to-PE 통신을 memory op으로 완전히
  일반화할지, 별도 op_kind를 둘지
 - **Op log streaming**: 대규모 시뮬레이션에서 op_log 메모리 사용량 관리
  (in-memory list vs disk-backed streaming)
 - **Fused operation**: tl.composite의 tiled pipeline (READ→COMPUTE→WRITE)을
  하나의 fused op record로 기록할지, 개별 op으로 분리할지
 - **Math op schema 일반화**: 현재 math params는 단순 구조이나,
  broadcasting rule, input별 dtype, keepdims, scalar/immediate operand,
  where/mask 표현 등 일반화가 필요할 수 있음
 - **Op record 식별자**: 현재 dependency_ids는 in-memory list index 기반이며,
  streaming/disk-backed mode 도입 시 stable op_id로 대체 필요
 - **Phase 1 materialization policy**: D3의 Future Extension 참조.
  허용 시 해당 op의 Phase 2 처리 방식 (skip / verify / recompute) 정의 필요
 ---
 ## Consequences
 ### 긍정적
 - SimPy 시뮬레이션 성능 영향 최소 (op_log append만 추가)
 - Phase 2에서 멀티스레드/GPU 자유롭게 사용 가능
 - 컴포넌트 교체 자유도 유지 (ADR-0015 설계 철학 보존)
 - 벤치마크 사용자 코드 API 변경 불필요
 - 새 메시지 타입 추가 시 data_op 플래그만 설정
 - greenlet으로 Phase 0 제거 — 메모리 데이터 기반 dynamic control flow 지원
 - `tl.load()`가 실제 데이터를 반환하므로 커널 디버깅 용이
 ### 부정적
 - op_log 메모리 사용량 (대규모 시뮬레이션 시)
 - Phase 2 실행 시간은 텐서 크기에 비례 (대형 GEMM)
 - pending handle (연산 미완료) 기반 동적 분기 불가
  (연산은 Phase 2에서 실행, Phase 1에서 결과 값 미확정).
  메모리 데이터 기반 분기는 greenlet으로 지원된다.
 - greenlet C 확장 의존성 추가 (pip install greenlet)
@@ -0,0 +1,90 @@
 # ADR-0022: 2D Grid program_id Semantics
 ## Status
 Accepted
 ## Context
 Triton kernels use `tl.program_id(axis)` to identify their position in a launch grid.
 Our hardware has a 2-level hierarchy: **cubes** contain **PEs**.
 The previous implementation ignored the `axis` parameter and always returned a flat PE index,
 making it impossible for kernels to distinguish their cube-local position from their cube identity.
 ## Decision
 Map `tl.program_id` and `tl.num_programs` to the 2D hardware grid:
 | Call | Returns | Description |
 |------|---------|-------------|
 | `tl.program_id(axis=0)` | `local_pe_id` | PE index within cube |
 | `tl.program_id(axis=1)` | `cube_id` | Cube index |
 | `tl.num_programs(axis=0)` | `num_pes_per_cube` | PEs per cube |
 | `tl.num_programs(axis=1)` | `num_cubes` | Total cubes |
 Global PID is derived as:
 ```python
 global_pid = tl.program_id(axis=1) * tl.num_programs(axis=0) + tl.program_id(axis=0)
 ```
 ### Axis mapping rationale
 - **axis=0 = PE (innermost)**: PEs within a cube share HBM and communicate via local NOC mesh. This is the fast, tightly-coupled dimension — analogous to threads within a block.
 - **axis=1 = Cube (outer)**: Cross-cube communication goes through UCIe with higher latency. This is the coarser scheduling dimension — analogous to blocks in a grid.
 ## Implementation
 ### TLContext (`triton_emu/tl_context.py`)
 Added `cube_id` and `num_cubes` constructor parameters. `program_id()` and `num_programs()` dispatch on `axis`:
 ```python
 def program_id(self, axis: int = 0) -> int:
    if axis == 1:
        return self._cube_id
    return self._pe_id
 def num_programs(self, axis: int = 0) -> int:
    if axis == 1:
        return self._num_cubes
    return self._num_programs
 ```
 ### PE_CPU (`components/builtin/pe_cpu.py`)
 - Extracts `num_cubes` from `ctx.spec["system"]["sips"]["cubes_per_sip"]`
 - Passes `cube_id` (already available as `self._cube_idx`) and `num_cubes` to TLContext
 ### KernelRunner (`triton_emu/kernel_runner.py`)
 - Receives `num_cubes` from PE_CPU
 - Passes `cube_id` and `num_cubes` to TLContext in greenlet mode
 ## Backward Compatibility
 - Existing code using `tl.program_id(0)` or `tl.program_id()` is unchanged — returns the same PE index as before.
 - `cube_id` and `num_cubes` default to `0` and `1`, so callers that don't provide them (e.g. unit tests) continue to work.
 ## Usage Example
 ```python
 def sharded_gemm_kernel(a_ptr, b_ptr, out_ptr, M, K, N, tl):
    local_pid = tl.program_id(axis=0)      # PE within cube
    cube_id   = tl.program_id(axis=1)      # which cube
    global_pid = cube_id * tl.num_programs(axis=0) + local_pid
    # Column-wise sharding across global PID
    n_per_pid = N // (tl.num_programs(axis=1) * tl.num_programs(axis=0))
    col_start = global_pid * n_per_pid
    a = tl.load(a_ptr, shape=(M, K), dtype="f16")
    b = tl.ref(b_ptr + col_start * K * 2, shape=(K, n_per_pid), dtype="f16")
    h = tl.composite(op="gemm", a=a, b=b, out_ptr=out_ptr + col_start * M * 2)
    tl.wait(h)
 ```
 ## Consequences
 - Benchmarks can now express cube-aware sharding and addressing without hardcoding topology dimensions.
 - Future axis=2 (SIP-level) can be added following the same pattern if needed.
@@ -0,0 +1,206 @@
 # ADR-0024: SIP-level Launcher — rank = SIP
 ## Status
 Accepted
 ## Context
 ### 목표
 `torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device)
 경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이**
 읽히는 bench 코드를 목표로 한다.
 real PyTorch와 비교:
 | 차원 | real PyTorch | KernBench |
 | --- | --- | --- |
 | 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 각 1 SIP |
 | `get_rank()` | `RANK` env var | greenlet-local 레지스트리 |
 | `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 |
 | `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP |
 | `mp.spawn` | OS 프로세스 fork | greenlet fan-out |
 ### 풀어야 할 문제
 1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록.
 2. **Greenlet-local rank/device tracking** — 1-프로세스 모델 안에서 각
   worker greenlet이 자기 rank / 자기 SIP를 정확히 식별.
 3. **Tensor placement = structural (sip, cube, pe)** — rank가 SIP이면
   기본 텐서 배치도 구조적 좌표로 표현되어야 함.
 ### Non-problem (이 ADR 밖)
 - IPCQ direction addressing → ADR-0025
 - `DPPolicy.sip`/`num_sips` 제거 → ADR-0026
 - Megatron-style TP → ADR-0027
 - DTensor → ADR-0028 (future)
 - Worker scheduling / `mp.spawn` / collective drain / exception cleanup
  → ADR-0027 D0/D1
 - Collective algorithm 구현 (intercube_allreduce, SFR config) → ADR-0032
 ## Decision
 ### D1. rank = SIP (world_size 해석)
 ```python
 def _resolve_world_size(self) -> int:
    if "world_size" in self._merged:
        return int(self._merged["world_size"])
    defaults = self._cfg_all.get("defaults", {})
    if "world_size" in defaults:
        return int(defaults["world_size"])
    spec = self.ctx.spec or {}
    return int(spec.get("system", {}).get("sips", {}).get("count", 1))
 ```
 우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml`
 override는 legacy "rank = PE" 테스트 경로로 유지.
 ### D2. Greenlet-local rank registry (+ debug warning)
 ```python
 class DistributedContext:
    def __init__(self):
        self._backend = None
        self._rank_by_greenlet: dict = {}
    def _bind_rank(self, g, rank: int) -> None:
        self._rank_by_greenlet[g] = int(rank)
    def get_rank(self) -> int:
        self._ensure_initialized()
        from greenlet import getcurrent
        g = getcurrent()
        if g not in self._rank_by_greenlet:
            if os.environ.get("KERNBENCH_DEBUG"):
                warnings.warn(
                    "get_rank() called outside a bound greenlet — returning 0. "
                    "Likely a bug unless running single-driver."
                )
            return 0
        return int(self._rank_by_greenlet[g])
 ```
 ### D3. `torch.ahbm.set_device(rank)` — SIP 바인딩
 KernBench 백엔드 이름은 `ahbm` (ADR-0023). Real PyTorch는
 `torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named
 namespace를 사용한다.
 ```python
 class _AhbmNamespace:
    """torch.ahbm — per-greenlet SIP device binding.
    Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. Since
    KernBench's backend is 'ahbm' (not CUDA), we expose the equivalent
    API under ``torch.ahbm`` to avoid pretending to be a CUDA runtime.
    """
    def __init__(self):
        self._device_by_greenlet: dict = {}
    def set_device(self, device: int) -> None:
        from greenlet import getcurrent
        self._device_by_greenlet[getcurrent()] = int(device)
    def current_device(self) -> int | None:
        from greenlet import getcurrent
        return self._device_by_greenlet.get(getcurrent())
 # Attached to RuntimeContext as `self.ahbm = _AhbmNamespace()`.
 # Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
 ```
 **PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한
 `torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`,
 `torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는
 코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다.
 ```python
 class _AcceleratorNamespace:
    """torch.accelerator — device-agnostic API (PyTorch 2.x style).
    Aliases torch.ahbm for bench code that prefers device-neutral idiom:
        torch.accelerator.set_device_index(rank)
        torch.accelerator.current_device_index()
    """
    def __init__(self, ahbm: _AhbmNamespace):
        self._ahbm = ahbm
    def set_device_index(self, device: int) -> None:
        self._ahbm.set_device(device)
    def current_device_index(self) -> int | None:
        return self._ahbm.current_device()
 # RuntimeContext
 self.ahbm = _AhbmNamespace()
 self.accelerator = _AcceleratorNamespace(self.ahbm)   # alias
 ```
 Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유:
 ```python
 torch.ahbm.set_device(rank)                   # KernBench-native, explicit backend
 torch.accelerator.set_device_index(rank)      # PyTorch 2.x device-agnostic
 ```
 ### D4. Tensor placement = structural (sip, cube, pe) 좌표
 `resolve_dp_policy`가 `target_sip`을 직접 받아 구조적 좌표로 placement 생성.
 세부는 ADR-0026.
 ```python
 # RuntimeContext._create_tensor
 current_sip = self.ahbm.current_device()          # (D3 naming)
 if current_sip is None:
    current_sip = 0  # single-driver fallback (D2와 일관)
 placement = resolve_dp_policy(
    dp, shape=shape_2d, itemsize=itemsize,
    num_pe=eff_num_pe, num_cubes=eff_num_cubes,
    target_sip=current_sip,
 )
 ```
 Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적
 좌표를 직접 보유. ShardSpec 상세는 ADR-0026.
 ---
 ## Dependencies
 - **ADR-0023** (IPCQ): backend `ahbm` namespace의 기원.
 - **ADR-0026** (DPPolicy intra-device): D4의 `resolve_dp_policy` 시그니처와
  ShardSpec의 구조적 좌표 표현.
 - **ADR-0027** (Megatron TP + scheduler): worker scheduling, `mp.spawn`,
  collective drain, exception cleanup의 구현 기준.
 ---
 ## Non-goals
 - **IPCQ protocol 수정**: ADR-0023 유지.
 - **DPPolicy 필드 정리**: ADR-0026.
 - **Megatron-style TP**: ADR-0027.
 - **Worker scheduling / spawn / drain / exception cleanup**: ADR-0027 D0/D1.
 - **Collective algorithm 구현**: ADR-0032.
 - **Multi-node (프로세스 간)**: 단일 프로세스.
 ---
 ## Consequences
 ### Positive
 - **Bench = real PyTorch DDP** (공개 API 관점).
 - **Greenlet-local rank**: 1-프로세스 모델에서 cross-rank correctness 가능.
 - **Structural placement 좌표**: ADR-0026 / ADR-0027 / ADR-0032의 다른 ADR이
  `(sip, cube, pe)` 3튜플 위에서 일관되게 동작.
 ### Neutral
 - IPCQ PE-level protocol (ADR-0023) 불변.
 - IO_CPU 역할 불변 (기존 transit 그대로).
@@ -0,0 +1,283 @@
 # ADR-0025: IPCQ Direction Addressing — address-based matching
 ## Status
 Accepted (Revision 2 — Address-based matching; peer_direction field dropped)
 ## Context
 ### 목표
 ADR-0023의 IPCQ protocol에서 **"어느 direction pair를 통한 전송인가"의 식별**을
 topology / dict-order에 의존하지 않고 **주소 기반**으로 일관되게 한다.
 2-rank bidirectional ring (또는 여러 direction이 동일 peer를 가리키는
 topology 일반)에서 정확히 동작하도록 한다.
 ### 드러난 버그 — 2-rank bidirectional ring
 `ring_1d(rank, world_size=2)` → `{"E": 1, "W": 1}` (rank 0). 양쪽 방향이 같은 peer.
 **버그 1 (install)**:
 - `reverse_direction(0, 1)` → dict order로 "E" 반환 (틀림, "W"가 맞음 — opposite
  direction convention)
 - rank 0의 E entry가 `peer.rx_base_pa = rx_base(sip1, cube0, pe0, d="E")`로 설정
 - tl.send(E) → data가 sip1의 E-rx buffer로 landing (should be W-rx)
 **버그 2 (runtime)**:
 - 설령 install이 올바른 주소로 설정해도, receiver의 `_handle_meta_arrival`이
  sender 좌표만으로 direction 매칭 → 첫 direction (E) 승
 - peer_head_cache[E] 증가, peer_head_cache[W]는 불변
 - Kernel의 tl.recv(W)는 peer_head_cache[W] 대기 → 영원히 블록 → IpcqDeadlock
 ### 근본 원인
 두 축에서 동일 문제:
 1. **Install-time pairing**: "내 direction과 peer의 어느 direction이 짝인가"
   결정이 dict-iteration-order에 의존 → 여러 direction이 같은 peer를 가리킬 때
   fragile
 2. **Runtime identification**: "어느 qp를 업데이트해야 하는가" 결정이 sender
   좌표만으로 이루어짐 → direction 중복 시 ambiguous
 ### 해결 방향 — address-based matching
 각 PE의 rx buffer는 **direction별로 고유한 주소 range**에 위치 (rx_base_pa +
 direction_idx × bytes_per_direction). 따라서:
 - **Runtime**: sender coord 대신 **dst_addr 범위**로 매칭 → unambiguous
 - **Install**: opposite-direction 우선 선택 heuristic (ring / mesh의 자연스러운
  대칭성)
 - `peer_direction` 같은 이중 메타데이터 불필요 — **주소가 single source of
  truth**
 이 설계는 **PhysAddr 전환 (ADR-0030)과 독립적**으로 작동. 현재 synthetic
 주소든 PhysAddr든 direction별 range 유일성만 지켜지면 동일하게 적용 가능.
 ---
 ## Decision
 ### D1. Install — `reverse_direction` opposite-preference
 `src/kernbench/ccl/install.py`:
 ```python
 # Extended in ADR-0032 with global_* pairs for inter-SIP directions,
 # which were introduced by configure_sfr_intercube_multisip to keep
 # intercube (N/S/E/W) and inter-SIP (global_N/S/E/W) namespaces disjoint.
 _OPPOSITE_DIR = {
    "E": "W", "W": "E", "N": "S", "S": "N",
    "global_E": "global_W", "global_W": "global_E",
    "global_N": "global_S", "global_S": "global_N",
 }
 def reverse_direction(my_rank: int, peer_rank: int, my_dir: str) -> str | None:
    """Find peer's direction that reciprocates my_dir→peer_rank.
    Prefer the OPPOSITE direction (E↔W, N↔S) when the peer has it
    pointing back to us. This matters in 2-rank bidirectional rings
    where both E and W on one side point to the same peer — without
    the preference, the first-match-wins iteration would route data
    into the wrong rx slot. Falls back to any direction pointing back
    for topologies without an opposite convention (tree_binary's
    parent/child).
    """
    nt = neighbor_table[peer_rank]
    opp = _OPPOSITE_DIR.get(my_dir)
    if opp is not None and nt.get(opp) == my_rank:
        return opp
    for d, target in nt.items():
        if target == my_rank:
            return d
    return None
 ```
 호출부:
 ```python
 for d, peer_rank in nbrs.items():
    peer_dir = reverse_direction(r, peer_rank, d)  # my_dir 전달
    if peer_dir is None:
        continue
    ...
 ```
 ### D2. Runtime — `_handle_meta_arrival` dst_addr 매칭
 `src/kernbench/components/builtin/pe_ipcq.py`:
 ```python
 def _handle_meta_arrival(self, msg: IpcqMetaArrival) -> None:
    """Match incoming token to the receiver-side direction by dst_addr range.
    Each direction has a unique rx buffer address range
    (my_rx_base_pa + n_slots * slot_size). The token's dst_addr (set by
    the sender's IPCQ when computing peer's slot address) falls within
    exactly one such range. This address-based matching is unambiguous
    even when multiple directions have the same peer (2-rank ring).
    """
    token = msg.token
    dst_addr = token.dst_addr
    for d, qp in self._queue_pairs.items():
        base = qp["my_rx_base_pa"]
        size = qp["n_slots"] * qp["slot_size"]
        if base <= dst_addr < base + size:
            qp["peer_head_cache"] = max(qp["peer_head_cache"],
                                         token.sender_seq + 1)
            self._arrived_tokens.setdefault(d, []).append(token)
            waiters = self._recv_waiters.get(d, [])
            self._recv_waiters[d] = []
            for ev in waiters:
                if not ev.triggered:
                    ev.succeed()
            any_waiters = self._any_recv_waiters
            self._any_recv_waiters = []
            for ev in any_waiters:
                if not ev.triggered:
                    ev.succeed()
            return
    # Unknown dst_addr — diagnostic log (should not happen under correct install)
 ```
 Sender 좌표 검사는 **제거**. `dst_addr`가 이미 direction을 결정.
 ### D3. Credit — `dst_rx_base_pa` 필드 추가
 `src/kernbench/common/ipcq_types.py`:
 ```python
@dataclass(frozen=True)
 class IpcqCreditMetadata:
    consumer_seq: int
    dst_rx_base_pa: int       # NEW: 원 sender의 peer.rx_base_pa와 매칭용
    # 기존 필드 (diagnostic / log 용도로 유지)
    src_sip: int
    src_cube: int
    src_pe: int
    src_direction: str
 ```
 Credit 생성 시 (`_delayed_credit_send`): 자기 direction의 `my_rx_base_pa`를
 `dst_rx_base_pa`로 실어 보냄 (이게 상대방이 sender 당시 썼던 `peer.rx_base_pa`).
 수신 측 (`_credit_worker`):
 ```python
 def _credit_worker(self, env):
    while True:
        credit = yield self._credit_inbox.get()
        for d, qp in self._queue_pairs.items():
            # peer의 rx_base_pa와 credit의 dst_rx_base_pa가 일치하는 qp 찾기
            if qp["peer"].rx_base_pa == credit.dst_rx_base_pa:
                qp["peer_tail_cache"] = max(qp["peer_tail_cache"],
                                              credit.consumer_seq)
                waiters = self._send_waiters.get(d, [])
                self._send_waiters[d] = []
                for ev in waiters:
                    if not ev.triggered:
                        ev.succeed()
                break
 ```
 Sender 좌표 검사 제거. `dst_rx_base_pa` 매칭으로 unambiguous.
 ### D4. `IpcqInitEntry`에 `peer_direction` 필드를 **추가하지 않음**
 ADR-0025 rev 1에서 제안했던 `IpcqInitEntry.peer_direction`은 **불필요**.
 이유:
 - Meta arrival은 dst_addr로 매칭 (D2)
 - Credit은 dst_rx_base_pa로 매칭 (D3)
 - qp에 peer_direction 저장 필요 없음
 - Install은 rx_base_pa 계산 시 내부적으로만 peer_dir 사용 (`reverse_direction`)
 IpcqInitEntry schema 변경 없음. Rev 1 대비 **단순화**.
 ### D5. `IpcqDmaToken.src_direction` 유지 (diagnostic only)
 기존 `src_direction` 필드는 제거하지 않는다. 다음 용도로 유지:
 - Logging / trace: `KERNBENCH_CCL_TRACE=1` 출력의 `(rank, t, dir, nbytes)`
 - Diagnostics: pointer_dump 등에서 direction 표시
 - 미래 확장 여지
 Runtime matching은 `dst_addr`만 사용.
 ### D6. Invariants (ADR-0023 I3 강화)
 **I3 (엄격)**: 각 방향 pair `(my_direction, peer_direction)`에 대해 my
 rx_base와 peer rx_base는 **별개의 direction slot**을 가리켜야 함. Install은
 이를 보장해야 한다 (reverse_direction opposite-preference).
 **I3.1 (신규)**: 모든 qp에 대해 `qp["my_rx_base_pa"]`와 `qp["peer"].rx_base_pa`는
 서로 disjoint한 주소 range를 점유한다 (다른 direction의 buffer는 절대 겹치지
 않음). 이것이 D2/D3의 주소-기반 매칭의 전제.
 Install time에 검증 가능:
 ```python
 # ccl/install_plan.py: build_install_plans 끝에 assertion
 all_rx_ranges = set()
 for plan in plans:
    for pe_install in plan.pe_installs:
        for entry in pe_install.neighbors:
            r = (entry.my_rx_base_pa,
                 entry.my_rx_base_pa + plan.n_slots * plan.slot_size)
            overlap = any(_ranges_overlap(r, e) for e in all_rx_ranges)
            assert not overlap
            all_rx_ranges.add(r)
 ```
 ---
 ## Dependencies
 - **ADR-0023** (IPCQ protocol): 본 ADR은 ADR-0023의 runtime 매칭 로직 수정
  (D2, D3) + install heuristic 개선 (D1). IPCQ 프로토콜의 semantic layer
  변경은 없음.
 - **ADR-0024** (launcher): 2-rank bidirectional ring이 실제 쓰이는 경우가
  ADR-0024의 ws=SIP_count 모델. 본 ADR이 그 케이스를 작동시킴.
 - **ADR-0030** (PhysAddr transition, stub): **독립적** — ADR-0025의
  주소-기반 매칭은 현재 synthetic 주소든 PhysAddr이든 동일하게 작동.
 ---
 ## Non-goals
 - **IPCQ 주소 체계를 PhysAddr로 전환**: ADR-0030 scope. 본 ADR은 주소가 어떻게
  인코딩되는가와 무관.
 - **Multi-hop routing**: ADR-0023 D5의 single-hop DMA write 전제 유지.
 - **Unidir ring 특수화**: `ring_1d_unidir`는 direction 하나만 있으므로 본 버그
  무관.
 ---
 ## Open questions
 - **주소 매칭 성능**: `_handle_meta_arrival`과 `_credit_worker`가 qp를 선형
  순회 (max 4 direction). 성능 영향 무시 가능 수준. 문제 시 dict lookup으로
  전환 가능 (`_qp_by_rx_base`).
 - **`IpcqDmaToken.src_direction` 필요성 재평가**: diagnostic 용도로만 남긴
  필드를 계속 유지할지, 또는 logging 외부로 분리할지. 현재는 유지.
 - **Install-time invariant 검증 cost**: D6의 I3.1 검증은 O(N_PE × N_direction)^2.
  대형 topology에서 느려질 수 있음 → interval tree 등 자료구조로 개선 가능.
  단순 구현 먼저.
 ---
 ## Consequences
 ### Positive
 - **단순함**: `peer_direction` 이중 메타데이터 제거. 주소가 single source of truth.
 - **Unambiguous matching**: 모든 topology (direction 중복 포함)에서 동작.
 - **Schema 변경 최소**: `IpcqInitEntry` 불변, `IpcqCreditMetadata`에 1 필드 추가.
 - **PhysAddr 전환 (ADR-0030) 독립**: 주소-기반 매칭은 주소 인코딩 방식과 무관.
 - **Diagnostic 유지**: `IpcqDmaToken.src_direction`은 로깅 용도로 존치.
 ### Negative
 - Runtime 매칭이 주소 비교로 바뀌어서 디버깅 시 "왜 peer_head_cache[E]가 아닌
  W가 업데이트됐나" 같은 질문에 address range를 추적해야 함 (기존엔 direction
  이름으로 충분). 해결: pointer_dump에 "direction ↔ rx_base_pa" 매핑 포함.
 ### Neutral
 - IPCQ protocol의 semantic layer (sender가 dst_addr 계산, receiver가 수신)는
  불변.
@@ -0,0 +1,288 @@
 # ADR-0026: DPPolicy = Intra-Device Only — sip/num_sips 필드 제거
 ## Status
 Accepted (Revision 5 — Phase 2 landed 2026-04-14, 523 passed + 1 strict xfail)
 ## Context
 ### 목표
 `DPPolicy`를 **한 device(SIP) 내부의 cube × PE 분산**만 표현하는 순수한
 intra-device 추상화로 명확화한다. SIP 간 분산(TP)은 별도 레이어로 분리
 (ADR-0024의 `torch.ahbm.set_device(rank)` 또는 ADR-0027의 Megatron parallel
 layers가 담당).
 ## Decision
 ### D1. `DPPolicy`에서 `sip` + `num_sips` 필드 제거
 ```python
@dataclass(frozen=True)
 class DPPolicy:
    """Intra-device (cube × PE) data-parallel policy.
    SIP-level placement is controlled by ``torch.ahbm.set_device(rank)``
    (ADR-0024 D3) and, for model-level TP, by Megatron-style parallel
    layers (ADR-0027). DPPolicy does not cross SIP boundaries.
    """
    cube: Literal["replicate", "column_wise", "row_wise"] = "replicate"
    pe: Literal["replicate", "column_wise", "row_wise"] = "replicate"
    num_pes: int | None = None
    num_cubes: int | None = None
 ```
 제거되는 필드: `sip`, `num_sips`.
 ### D2. `ShardSpec` — structural (sip, cube, pe) 좌표, `pe_index` 완전 제거
 현재 `ShardSpec.pe_index`는 **global flat index** (`sip × cubes × pes + cube ×
 pes + pe`). 이는 ADR-0024 D4이 "abstraction leakage"로 지적한 형태.
 본 ADR에서 ShardSpec을 **structural 좌표로 재정의**하고, `pe_index`는
 property로도 **남기지 않는다**:
 ```python
 # src/kernbench/policy/placement/dp.py (after)
@dataclass(frozen=True)
 class ShardSpec:
    """Structural shard placement — intra-SIP (cube × PE) coord.
    Global-flat `pe_index` was removed in ADR-0026. Callers must use
    structural coords (sip, cube, pe) directly. If a flat integer key is
    needed (e.g. dict lookup), compute it explicitly at the call site.
    """
    sip: int              # structural — which SIP this shard lives on
    cube: int             # local within SIP
    pe: int               # local within cube
    offset_bytes: int
    nbytes: int
 ```
 **핵심 원칙**:
 - ShardSpec의 정체성은 `(sip, cube, pe)` 3튜플.
 - **`pe_index` property도 없음** — silent semantics drift 차단.
 - Global flat을 기대한 기존 호출자는 `.pe_index` 접근 시 **즉시
  `AttributeError`** → 반드시 구조적 좌표로 migration.
 - Flat integer key가 필요한 국소 문맥 (예: 내부 dict lookup)은 호출자가
  명시적으로 `spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe`를 계산.
 **Property 제거 정당화**: KernBench는 사내 프로젝트로 call site가 한정되어
 있음. Silent drift 위험 (의미만 바뀌고 타입은 같은 int) 대비 explicit breakage
 (AttributeError)가 훨씬 안전.
 ### D3. `resolve_dp_policy`가 `target_sip`을 받아 structural 좌표 생성
 ADR-0024 D4의 계약 구현. Post-hoc shifting 없음.
 ```python
 # src/kernbench/policy/placement/dp.py (after)
@dataclass(frozen=True)
 class _LocalPeShard:
    """Internal — PE resolver의 반환. Cube 내 local PE 식별자 + payload."""
    local_pe: int                  # cube-local PE index (0..num_pe-1)
    offset_bytes: int
    nbytes: int
 def resolve_dp_policy(
    policy: DPPolicy,
    *,
    shape: tuple[int, int],
    itemsize: int,
    num_pe: int,
    num_cubes: int = 1,
    target_sip: int,       # NEW — 어느 SIP에 배치할지 명시
 ) -> list[ShardSpec]:
    """2-level resolution (cube × PE) on a specified SIP.
    Returns ShardSpecs with structural coords (sip=target_sip, cube, pe).
    No SIP-level split — DPPolicy is intra-device only.
    """
    resolver = _PE_RESOLVERS[policy.pe]
    all_shards: list[ShardSpec] = []
    # Level 1: cube within SIP
    cube_splits = _split_shape(policy.cube, shape, num_cubes, itemsize)
    for cube_id, (cube_shape, cube_offset) in enumerate(cube_splits):
        # Level 2: PE within cube — resolver returns _LocalPeShard (local_pe)
        local_shards = resolver(shape=cube_shape, itemsize=itemsize,
                                 num_pe=num_pe)
        for ls in local_shards:
            all_shards.append(ShardSpec(
                sip=target_sip,                   # from caller (current_device)
                cube=cube_id,                     # local within SIP
                pe=ls.local_pe,                   # local within cube (explicit name)
                offset_bytes=cube_offset + ls.offset_bytes,
                nbytes=ls.nbytes,
            ))
    return all_shards
 ```
 **내부 resolver** (`column_wise`, `row_wise`, `replicate`)는 `_LocalPeShard`
 리스트 반환 — `local_pe` 필드명으로 **"cube-local PE identifier"임이 명시적**.
 과거 `ShardSpec.pe_index`와 이름이 혼동되던 문제 해소.
 **이름 규약 정리** (전체 ADR):
 - `ShardSpec.pe`: 최종 외부 API — cube-local PE (structural coord)
 - `_LocalPeShard.local_pe`: 내부 resolver 단계의 동일 의미
 - `pe_index`: **제거**. 외부/내부 어디에도 남기지 않는다 (silent drift 차단의
  부가 효과: 이름 재등장 없음).
 ### D4. `_create_tensor` — 구조적 좌표로 직접 placement
 ADR-0024 D4 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy`
 호출 시점에 직접 지정.
 ```python
 # context.py _create_tensor (after)
 current_sip = self.ahbm.current_device()
 if current_sip is None:
    # Single-driver fallback (ADR-0024 D2와 일관).
    # Launcher 기반 코드가 set_device()를 빼먹으면 조용히 SIP 0에 박히는
    # 문제가 있음 → debug mode에서 경고.
    if os.environ.get("KERNBENCH_DEBUG"):
        import warnings
        warnings.warn(
            "torch.ahbm.current_device() is None; defaulting to SIP 0. "
            "If this is a multi-rank launcher context, you likely forgot "
            "torch.ahbm.set_device(rank) inside the worker.",
            stacklevel=2,
        )
    current_sip = 0
 placement = resolve_dp_policy(
    dp,
    shape=shape_2d,
    itemsize=itemsize,
    num_pe=eff_num_pe,
    num_cubes=eff_num_cubes,
    target_sip=current_sip,          # ← 구조적 좌표 일차 지정
 )
 # placement의 각 ShardSpec은 이미 (sip=current_sip, cube=local, pe=local) 포함.
 # 과거의 post-hoc shifting 블록은 완전히 제거.
 ```
 **모든** 텐서가 current device SIP에 배치됨. Multi-SIP 텐서를 만들고 싶으면
 ADR-0027의 TP primitive 사용.
 **Single-driver fallback의 trade-off**: set_device 없는 호출에서 SIP 0으로
 default는 기존 single-driver 테스트 호환을 위해 유지. `KERNBENCH_DEBUG=1`
 환경에서는 launcher 컨텍스트의 실수로 set_device 누락 시 조용히 잘못된 SIP에
 배치되는 것을 감지할 수 있도록 warning.
 ### D5. Downstream — allocator lookup은 구조적 tuple key로
 기존 `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`):
 ```python
 for spec in placement:
    alloc = allocators[spec.pe_index]       # ← AttributeError (property 제거됨)
 ```
 `pe_index`가 없어졌으므로 구조적 좌표로 **강제** migration:
 ```python
 for spec in placement:
    alloc = allocators[(spec.sip, spec.cube, spec.pe)]
 ```
 `_ensure_allocators`의 dict population도 tuple key로:
 ```python
 # context.py _ensure_allocators (after)
 for sip_id in sip_range:
    for cube_id in range(cubes_per_sip):
        for pe_id in range(pes_per_cube):
            self._allocators[(sip_id, cube_id, pe_id)] = PEMemAllocator(
                rack_id=0, sip_id=sip_id, cube_id=cube_id, pe_id=pe_id, cfg=cfg,
            )
 ```
 `_free_tensor`도 동일: 기존 `flat_idx = sip * ... + cube * ... + pe` 계산
 블록 제거, `(shard.sip, shard.cube, shard.pe)` 직접 사용.
 **Tuple vs dataclass `PEIdentity`**: Tuple이 단순하고 hashable로 바로 써서
 권고. `PEIdentity` 값객체는 명시적 타입 장점은 있지만 boilerplate가 크고 현재
 allocator dict의 유일한 key라 오버엔지니어링. Tuple 유지.
 ### D7. 하위 호환 — 불가 (cleanup ADR)
 이 ADR은 **breaking change**.
 1. `DPPolicy(sip=...)` 또는 `DPPolicy(num_sips=...)` 호출 → `TypeError`
 2. `ShardSpec.pe_index` 접근 → `AttributeError`
 모두 **즉시 명시적 breakage**. Deprecation warning / fallback 경로 없음.
 KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에 migration.
 **Silent drift 차단**이 property 완전 제거의 주된 이점: global flat을 기대한
 코드가 SIP-local 결과를 받아 조용히 잘못된 인덱싱을 할 가능성 제거.
 ## Dependencies
 - **ADR-0024** (launcher): `set_device(rank)` 및 current-device scoping이
  SIP 배치 메커니즘 제공. 본 ADR은 그 위에 서서 DPPolicy를 순수 intra-device로
  좁힘.
 - **ADR-0027** (Megatron TP): 다중 SIP에 걸친 텐서가 필요한 경우의 대안 경로.
  이 ADR 적용 후 multi-SIP use case는 ADR-0027로 이관.
 ---
 ## Non-goals
 - **`DPPolicy.cube` / `pe` 재설계**: 기존 replicate/column_wise/row_wise 의미
  유지.
 - **Tiling 정책 통합**: `tiled_column_major` / `tiled_row_major`는 그대로.
 - **Multi-device 텐서 추상화 신규**: DTensor-like는 ADR-0028.
 ---
 ## Open questions
 - **`_create_tensor`의 current_sip 기본값**: set_device 없는 호출에서 rank=0
  (SIP 0)로 fallback할지, 아니면 error 낼지. 권고는 fallback (기존 single-driver
  테스트와의 호환).
 - **`test_sip_parallel.py` 재작성 범위**: 기존 단위 테스트의 의도를 유지하며
  launcher 기반으로 옮기려면 추가 fixture 필요. 별도 작업으로 scope.
 - **`DPPolicy`의 `num_sips=None` 의미**: 필드가 없어지면 `num_sips` 개념 자체가
  사라짐. Multi-SIP을 표현하고 싶으면 ADR-0027의 TP primitive를 쓰라는 것이
  명시적 답.
 **Resolved (이전 rev에서 open이었던 것들)**:
 - ~~`ShardSpec.pe_index` property 존치 여부~~ → **완전 제거** (D2)
 - ~~`_ensure_allocators` dict key 형식~~ → **tuple `(sip, cube, pe)`** (D5)
 ---
 ## Consequences
 ### Positive
 - **개념 분리 명확**: DPPolicy = intra-device, TP = inter-device.
 - **API 단순화**: DPPolicy 생성자 필드 ~33% 축소.
 - **Structural 좌표 일관성**: ShardSpec이 `(sip, cube, pe)` 튜플로 표현 →
  abstraction leakage 해소 (ADR-0024 D4 계약 충족).
 - **`pe_index` 의미 명확**: SIP-local이 단일 해석. Global flat이 필요하면 명시.
 - **Launcher 모델 일관성**: ADR-0024의 "1 worker per SIP" 모델이 유일한 SIP
  경계 제어 메커니즘.
 ### Negative
 - **Breaking change (explicit)**: `DPPolicy(sip=...)` → `TypeError`,
  `spec.pe_index` → `AttributeError`. 모든 호출자 한 번에 수정 필요.
 - **ShardSpec schema 변경**: `pe_index` 단일 필드 → `sip`/`cube`/`pe` 세 필드.
  Downstream (`deploy_tensor`, `_free_tensor`, `_ensure_allocators`,
  `allocators` dict key 등) 연쇄 수정.
 - **Silent drift 없음**: property 완전 제거로 runtime에서 즉시 실패 →
  migration leakage 원천 차단. (Negative가 아니라 explicit tradeoff)
 - `test_sip_parallel.py` 재작성 비용.
 ### Neutral
 - 기존 `cube` / `pe` 필드 의미 불변.
@@ -0,0 +1,888 @@
 # ADR-0027: Megatron-style Tensor Parallelism API
 ## Status
 Accepted
 ## Context
 ### 목표
 SIP 간 tensor parallelism(TP)을 **Megatron-LM 스타일의 명시적 parallel layer**
 API로 지원한다. DTensor 같은 선언적 추상화는 별도 ADR(0028) future work.
 Megatron-style을 선택한 이유:
 - TP는 model의 특정 layer 경계에서 발생. 명시적 primitive가 mental model에
  자연스러움.
 - NVIDIA Megatron / DeepSpeed가 확립한 인더스트리 표준.
 - DTensor는 선언적이라 디자인 공간이 더 크다 → 단계적.
 ### TP primitive 스펙 (Megatron-LM 참조)
 - **ColumnParallelLinear**: weight의 **column(out_features)** 축을 TP ranks에
  분산. 입력 full-replicated, 출력 column-sharded. 후속 RowParallelLinear가
  올 때 forward all-reduce 없음.
 - **RowParallelLinear**: weight의 **row(in_features)** 축을 TP ranks에 분산.
  입력이 이미 column-sharded (ColumnParallel의 출력). forward 끝에
  **all-reduce** 필요.
 - **VocabParallelEmbedding**: embedding을 vocab 축에 분산. forward 끝에
  all-reduce. (초기 scope에서는 stub, 실제 구현은 all-gather kernel 선행 필요.)
 - **`copy_to_tp_region`**, **`reduce_from_tp_region`**, **`scatter_to_tp_region`**,
  **`gather_from_tp_region`** — 기본 primitive.
 ### 풀어야 할 문제
 1. **Worker-wait 일반화 (D0)**: `dist.all_reduce`의 defer/yield/drain 패턴을
   모든 `ctx.wait` 경로로 확장. **이 ADR의 가장 큰 아키텍처 결정**.
 2. **런처 API 정규화 (D1)**: 현 bench들이 hand-rolled greenlet loop을 사용.
   `torch.multiprocessing.spawn(fn, args, nprocs)`로 흡수해 real-PyTorch API 면
   유지 + D0의 scheduler drain을 단일 구현 위치에 집중.
 3. **Per-rank weight 분산 표현**: 각 worker가 weight tensor의 자기 slice를
   소유. ADR-0024의 `set_device(rank)` + ADR-0026의 intra-device DPPolicy로
   자연스럽게 표현.
 4. **Forward-only scope**: 현재 KernBench는 backward가 없음 (simulation 목적).
   본 ADR은 **forward만** 우선 지원. Training simulation은 별도 ADR.
 5. **Collective 호출 지점**: RowParallelLinear가 forward 끝에 `all_reduce` 호출.
   ADR-0024의 multi-greenlet 구조 + D0 generalization에서 자연스럽게 동작.
 6. **TP group 개념**: Megatron은 DP × TP × PP group을 교차 사용. 초기 scope는
   **TP group = 전체 SIP** 단순화. Mixed DP+TP는 future.
 ---
 ## Decision
 ### D0. Worker-wait 일반화 — `ctx.wait`가 worker 컨텍스트면 main으로 defer
 **문제 재확인**. `kernel_runner.run`은 spawn 시점의 `greenlet.getcurrent()`를
 kernel greenlet의 `_parent`로 캡처한다
 ([kernel_runner.py:94](src/kernbench/triton_emu/kernel_runner.py#L94)).
 main 컨텍스트에서 `env.run`이 돌면 parent=main이라 safe. worker 컨텍스트에서
 `env.run`이 돌면 parent=worker가 되고, worker가 yield/finish하는 순간 kernel
 greenlet은 orphan → `GreenletExit` → ADR-0024 Phase B의 `ring_default_ws` 실패.
 **해결**. worker greenlet이 `ctx.wait(h)`를 호출하면 직접 `env.run`을 driving
 하는 대신 **main scheduler로 yield**. main이 env.run을 drive해 handle이 완료
 되면 worker로 control return.
 #### D0.1 `RuntimeContext` 확장
 ```python
 # context.py
@dataclass
 class RuntimeContext:
    ...
    _pending_worker_waits: list[RequestHandle] = field(default_factory=list, init=False)
 ```
 #### D0.2 `ctx.wait`의 worker fork
 ```python
 def wait(self, handle, *, _meta=None):
    # Fast-path: already completed — skip enqueue + switch (consistent with
    # D0.4-(3) idempotency). Avoids needless worker→main→worker round-trip
    # and prevents redundant _pending_worker_waits growth.
    if handle in self._completed:
        completion, _trace = self.engine.get_completion(handle)
        return completion
    from greenlet import getcurrent
    g = getcurrent()
    if g.parent is not None and not g.parent.dead:
        # Worker greenlet: defer to main. Push handle, yield to parent.
        # Parent (scheduler loop) drains env.run, then switches back.
        self._pending_worker_waits.append(handle)
        g.parent.switch()
        # On resume: handle must have completed (main drained the list).
        # Fall through to the status-quo completion/trace assembly.
    # Main context (or single-driver): drive engine directly.
    wait_fn = getattr(self.engine, "wait", None)
    if wait_fn is not None:
        wait_fn(handle)
    completion, trace = self.engine.get_completion(handle)
    self._completed.add(handle)
    if _meta is not None and trace is not None:
        entry = dict(trace) if isinstance(trace, dict) else {"raw": trace}
        entry.update(_meta)
        self._traces.append(entry)
    return completion
 ```
 #### D0.3 `ctx.wait`의 worker-context 세만틱 contract (normative)
 본 ADR은 `ctx.wait`의 세만틱을 worker 컨텍스트에서 **명시적으로 변경**한다.
 - **Submit-vs-complete 분리**: `ctx.wait(h)`는 worker에서 호출될 때 "즉시 완료
  보장"이 아니라 "**다음 scheduler drain 이후** 완료 보장"이다. worker가
  `wait()`에서 return하는 시점 = main이 해당 handle에 대해 `engine.wait`을
  마친 시점. Main context 호출은 기존대로 즉시-동기 (status quo).
 - **Resume invariant (normative)**: worker-deferred `ctx.wait(h)`에서
  `g.parent.switch()`가 return해 worker가 resume되는 시점에는 **반드시
  `h in ctx._completed`가 True여야 한다**. 이 invariant가 깨지면 worker가
  stale 상태에서 이후 단계를 진행하므로 `_drain_pending` / scheduler loop /
  `ctx.wait` 어느 부분을 수정하든 이 불변식을 지켜야 한다. T3.b가 이
  invariant를 직접 assert한다.
 - **관찰 가능 변화**: worker 안에서 `h = ctx.submit(msg); ctx.wait(h);
  read(handle_result)` 패턴은 여전히 성립 — 단 `wait()`와 `read` 사이에는
  자동으로 main-drain이 삽입되었다는 사실을 세만틱 명세로 포함한다.
 - **Host 객체 직접 read는 D0.5 참조**: `ctx.wait` 없이 `tensor.numpy()`를
  부르는 경우의 계약은 D0.5에서 별도로 규정.
 #### D0.4 Main scheduler drain — 규약 (normative)
 (D1의 `multiprocessing.spawn` 내부 구현. 아래는 세만틱 정의.)
 ```python
 while alive:
    for g in alive:              # (1) round-based worker switch
        g.switch()
    _drain_pending(ctx)           # (2) drain in main context
 ```
 (`_drain_pending`의 실제 정의는 D0.5 참조 — outer while-loop으로 두 큐가
 모두 빌 때까지 drain.)
 **규약**:
 1. **Round-based cooperative scheduling & yield 의무 (worker contract)**.
   `g.switch()`는 해당 worker가 **자발적으로 yield**할 때까지 return하지 않는다
   (cooperative greenlet 세만틱). 따라서:
   - Worker가 yield 없이 `while True: do_compute()` 같은 pure-compute loop를
     돌면 `g.switch()`는 영원히 return하지 않고 **scheduler loop 자체가 hard
     block**된다 (다른 worker는 switch 기회를 못 얻음, drain도 안 일어남). 이는
     starvation이 아니라 **scheduler non-progress (deadlock 등가)**이며 본
     ADR이 **unsupported**로 규정한다.
   - Worker는 **반드시** `ctx.wait(h)`, `dist.all_reduce`, host-read barrier
     (D0.5) 중 하나를 유한 step 내에 호출해야 한다. TP layer의 `forward`는
     매 layer 끝에서 launch→wait 쌍을 포함하므로 자연스럽게 이 조건을 만족.
     CCL kernel도 `dist.all_reduce` 내부에서 yield한다.
   - 구현이 이를 **감지**할 필요는 없다 (타임아웃/steps-since-yield 카운터
     등). 이는 user contract이며 위반 시 증상은 "simulation hang"이다.
   - **Future extension**: non-collective 긴 계산 경로가 자주 나오면
     명시적 `torch.distributed.cooperative_yield()` primitive (no-op yield)를
     도입할 수 있다. 현 ADR 범위 밖. Breaking change 아님 — 필요 시 추가하면
     됨.
   - Round 내에서는 alive worker 전체가 한 번씩 `switch`를 받는다. 단일 round
     안에서 한 worker가 여러 번 wait를 호출해도 그 turn 안에서 순차적으로
     enqueue된 뒤 scheduler drain 한 번에 일괄 처리 (FIFO).
 2. **Drain 순서 = submission 순서 (FIFO)**. `_pending_worker_waits`는 list
   append/pop(0)로 엄격한 FIFO. 완료 순서가 아니라 submission 순서로 drain되며,
   SimPy scheduler 자체가 인과적으로 올바른 완료 순서를 보장하므로 submission
   순서 drain이 안전하다. `completion order`와 `drain order`는 혼동하지 말 것.
   **Two-queue ordering (worker waits → collectives)**: `_drain_pending`은
   worker wait 큐를 먼저, collective 큐를 나중에 drain한다. 이 순서의 근거:
   - **두 큐는 서로 다른 dependency source**: worker wait은 worker가 직접
     `submit + wait` 쌍으로 만들어낸 handle (tensor deploy, MmuMap 등). collective
     큐는 `dist.all_reduce`가 내부적으로 enqueue한 kernel launch handle이며
     worker는 이걸 직접 wait하지 않는다 (D0.5의 두 큐 drain 모델 참조).
   - **Correctness 관점 독립**: collective는 worker 관점에선 "이미 submit된
     후 yield한" 상태. 그 완료 타이밍은 worker의 다음 action 시점 이전이기만
     하면 됨. worker wait 큐와의 순서 dependency 없음.
   - **단일 drain barrier 안에서 둘 다 완료**: D0.5의 loop-until-empty 규약에
     따라 한 barrier invocation에서 worker → collective → (새로 생긴 것이
     있으면 반복) 순으로 모두 빠짐. worker가 resume될 땐 양쪽 모두 drained.
   - **대안 (collective 먼저)도 가능**: 본 ADR은 현 구현 단순성을 위해 worker
     먼저를 고정했을 뿐 의미상 동치. 성능 프로파일 차이가 관찰되면 재조정.
 3. **중복 enqueue — correctness는 idempotent drain, dedup은 non-guaranteed**.
   `ctx.wait(h)`는 `h in ctx._completed`면 즉시 return. `_drain_pending`도
   동일 guard. 같은 handle이 `_pending_worker_waits`에 여러 번 appended
   되더라도 실제 `engine.wait`는 한 번만 호출된다 (idempotent).
   - **Correctness**: idempotent drain에 의존 → safe.
   - **Memory/성능**: 본 ADR은 `_pending_worker_waits`의 **dedup을 보장하지
     않는다**. 같은 handle이 N번 enqueue되면 큐에 N개 element가 보관되고
     drain 시 N번 pop + in-set guard가 돈다. 단일 worker가 같은 handle을
     반복 wait하는 비정상 패턴이 아니면 N은 1~수 수준.
   - **Implementation freedom**: 구현은 선택적으로 dedup (예: `set`을 side
     index로 두거나 append 전 `h not in pending_set` 검사) 가능. correctness
     를 바꾸지 않는 최적화로 분류.
 4. **Exception propagation + sibling cleanup**.
   worker greenlet이 raise하면 `g.switch()`가 main으로 예외를 전달한다.
   scheduler loop은 즉시 중단되고 다음 cleanup을 **명시적으로** 수행:
   ```python
   try:
       while True:
           alive = [g for g in gs if not g.dead]
           if not alive:
               break
           for g in alive:
               if not g.dead:
                   g.switch()
           _drain_pending(ctx)
   except Exception as outer:
       # (a) 살아남은 sibling worker greenlet 강제 종료.
       for other in gs:
           if not other.dead:
               try:
                   other.throw(SystemExit)
               except Exception:
                   pass          # 사일런트 — 이미 예외 상황
       # (b) Backend barrier / pending 상태 초기화 (장래 epoch barrier 도입 대비).
       backend = getattr(ctx.distributed, "_backend", None)
       if backend is not None and hasattr(backend, "_barrier"):
           backend._barrier.reset()
       backend_pending = getattr(backend, "_pending_collective_handles", None)
       if backend_pending is not None:
           backend_pending.clear()
       ctx._pending_worker_waits.clear()
       # (c) 원인 예외는 SpawnException으로 래핑.
       raise SpawnException(errors) from outer
   ```
   규약:
   - **Sibling abort 보장**: worker 하나가 raise하면 모든 sibling greenlet에
     `SystemExit`을 throw — greenlet은 즉시 terminate된다. greenlet leak 없음.
   - **Pending queue 명시적 clear**: worker-wait + collective-pending 두 큐를
     비움. 재사용 시 오염 방지.
   - **`SpawnException(errors)` 래핑**: `errors: dict[int, Exception]`에 각
     rank의 원래 예외를 담는다. real-PyTorch `torch.multiprocessing.spawn`의
     failure 패턴과 호환.
     - **Scope 제한**: `errors`에는 **자기 코드로 raise한 rank (root cause)만**
       포함된다. Sibling cleanup 과정에서 `throw(SystemExit)`으로 종료된 rank는
       `errors`에 나타나지 않는다 (SystemExit은 D1.2의 entry 래퍼 `try/except
       Exception`에 걸리지 않음 — 의도된 설계: sibling 종료는 실패가 아니라
       cleanup signal). 독자가 "모든 failed rank가 다 들어올 것"으로 기대하지
       않도록 명시.
   - **`ctx._traces`는 예외 이전 시점까지의 partial 상태**. trace completeness
     는 보장되지 않음 (일부 launch/all_reduce가 entry를 남기지 못한 채 종료
     가능).
   - **Allocator / MemoryStore**는 예외 이전 상태 유지 — 재사용은 non-goal,
     새 `RuntimeContext` 생성 권장.
   - **`join=False` / retry / partial recovery**는 본 ADR의 non-goal.
   `SpawnException`은 `runtime_api/multiprocessing.py`에 정의:
   ```python
   class SpawnException(RuntimeError):
       def __init__(self, errors: dict[int, Exception]):
           self.errors = errors
           first = next(iter(errors.items()), None)
           msg = (f"spawn failed on ranks {sorted(errors.keys())}"
                  + (f": rank {first[0]} raised {first[1]!r}" if first else ""))
           super().__init__(msg)
   ```
 5. **Single-driver 호환**. `g.parent is None`인 main-only 실행 (legacy 단일
   드라이버 테스트)에서는 D0.2의 worker-fork 조건이 거짓 → 기존 즉시-동기
   경로 유지. `_drain_pending`은 호출되지 않는다.
 #### D0.5 Host-read barrier — 결정 (normative)
 Worker 안에서 `tensor.numpy()`, `tensor.__getitem__`, `tensor.data` 등
 **host-observable read**는 **자동 drain barrier**로 정의한다. 호출 직전:
 1. `ctx._pending_worker_waits`와 `backend._pending_collective_handles`가 비어
   있지 않으면 `g.parent.switch()`로 main에 yield → main은 `_drain_pending`
   실행 → 완료 후 worker resume.
 2. 두 큐가 모두 비어 있으면 즉시 read.
 **Barrier 반복 규약 (normative — re-entrance)**: `_drain_pending`은 while-loop
 로 **두 큐가 모두 완전히 비어질 때까지** drain한다. 단일 pass가 아님:
 ```python
 def _drain_pending(ctx):
    while ctx._pending_worker_waits or (
        ctx.distributed._backend
        and ctx.distributed._backend._pending_collective_handles
    ):
        while ctx._pending_worker_waits:
            h = ctx._pending_worker_waits.pop(0)
            if h not in ctx._completed:
                ctx.engine.wait(h)
        backend = ctx.distributed._backend
        if backend is not None:
            while backend._pending_collective_handles:
                h, _sip_id, meta = backend._pending_collective_handles.pop(0)
                ctx.wait(h, _meta=meta)  # main context: safe; ctx.wait가
                                          # 다시 pending에 push하지 않음
 ```
 **Main-context ctx.wait 비재귀 invariant (normative)**: `_drain_pending` 내부의
 `ctx.wait(h, _meta=meta)` 호출은 main greenlet 컨텍스트에서 실행된다. D0.2의
 worker-fork 조건(`g.parent is not None and not g.parent.dead`)이 False이므로
 즉시-동기 경로로 진입 → **`_pending_worker_waits`에 절대 enqueue하지 않는다**.
 이 invariant 덕분에 drain loop은 재귀/큐 재증가 없이 끝난다. 구현 시
 `g.parent is None`을 단일 main greenlet 보장으로 유지하는 것이 중요.
 **왜 loop인가**: `ctx.wait(h, _meta=meta)`는 main 컨텍스트에서 호출되므로 D0.2
 경로에 따라 engine을 **직접 drive**한다 (추가 enqueue 없음 — 위 invariant).
 따라서 이론적으로는 single pass로 충분하지만 — 규약은 **loop-until-empty**로
 고정한다. 이유:
 1. **미래 확장 안전성**: 향후 drain 중 새 pending이 enqueue되는 구현 (예:
   collective가 sub-handle을 가진 tree-reduce)이 생길 수 있다. loop 규약이면
   이때도 correctness 유지.
 2. **가독성**: "barrier는 pending이 빌 때까지 drain"이라는 단일 문장으로
   의미가 닫힘. `ctx.wait` 호출이 새 enqueue를 안 한다는 non-trivial invariant
   에 의존하지 않음.
 3. **Barrier의 세만틱은 "해당 read에 필요한 모든 dependency 완료"**: 현 모델
   에선 모든 pending이 곧 모든 dependency이므로 둘은 동일. 사용자 mental model
   은 전자.
 **Termination 보증**: 두 체제로 분리해 서술한다.
 - **현재 구현**: `ctx.wait`는 main context에서 호출 시 engine을 직접 drive
  (D0.2) → 새 pending을 enqueue하지 않는다. 한 iteration마다 pending의 크기가
  `pop(0)` + `engine.wait`로 엄격히 감소. iteration 수는 **초기 pending 크기
  자체가 상한** → 유한 종료.
 - **Future extension (loop 규약을 정당화하는 상한)**: 향후 drain 중 새 pending이
  enqueue되는 구현 (예: tree-reduce sub-handle)이 도입되면 초기 크기 상한은
  깨진다. 그러나 SimPy causality는 handle의 dependency가 유한 DAG임을 보장하므로
  **nested depth가 finite**. loop 규약이 이 경우까지 자동 수용한다.
 두 체제 모두 무한 루프가 불가능함을 보장. 현 구현의 단일-pass 상한은 공격적
 최적화 시 참고 값일 뿐 규약은 loop-until-empty로 고정.
 **왜 implicit drain at read가 맞는가**:
 - 기존 open question에서 (a) implicit drain, (b) explicit barrier 둘 중 선택
  문제였다. (b)는 명확하지만 TP layer 사용자가 `out = fc1.forward(x);
  ctx.drain(); result = out.numpy()` 3-step을 매번 써야 하는 부담. (a)는
  "읽을 때 반영된 값을 보장"하는 단일 규약으로 CUDA의 `cudaDeviceSynchronize
  before host copy` 패턴과 동일 — 숨은 규칙이 아닌 **명명된 entry-point의
  contract**이다.
 - 본 ADR은 (a)를 채택하되 그 entry-point 목록을 **명시적으로 닫는다**:
  `Tensor.numpy()`, `Tensor.data` (numpy alias), `Tensor.__getitem__`,
  `Tensor.__repr__` (data가 포함되는 경우), 그 외 공식 host-read API는 본
  ADR 구현 시점에 코드베이스 검색으로 확정. 추가되는 host-read API는 반드시
  이 contract를 따라야 한다 (테스트로 회귀 방지).
 - `ctx.submit`만 하고 `wait` 없이 `numpy`를 직접 호출하는 경우도 drain
  barrier가 동작 (pending queue에 handle이 있기 때문). 사용자가 explicit
  wait을 생략해도 read 시점에 invariant가 복원된다.
 **`Tensor.copy_(source)` — write barrier 규정**:
 `copy_`는 semantically "target에 write"이지만 내부적으로 `source.numpy()`를
 호출하여 host에서 source 데이터를 가져온 뒤 `target._memory_store.write(...)`
 로 각 shard에 쓴다. 두 방향 모두 barrier 처리:
 1. **Source-side (read barrier)**: `source.numpy()`가 D0.5 read barrier를
   트리거 (source 자체가 deployed tensor이고 pending이 있을 때).
 2. **Target-side (write barrier — global pending 기준)**: `copy_` 진입 시
   `ctx._pending_worker_waits` 또는 `backend._pending_collective_handles`가
   비어 있지 않으면 write 전에 `g.parent.switch()`로 drain. **Per-tensor /
   per-shard dependency tracking이 아니라 global pending queue 기준**.
   - 왜 global인가: KernBench의 handle 표현에는 "이 handle이 target의 어느
     shard를 write한다"는 역추적 정보가 없다. 안전한 보수적 규약으로 "전역
     pending이 있으면 drain". 이 결과로 **unrelated tensor의 pending도 copy_를
     막을 수 있다** — drop-in invariant 우선.
   - **명시적 tradeoff**: 이 규약은 서로 독립적인 tensor 사이에도 불필요한
     serialization을 도입할 수 있다. 그러나 현 single-queue execution model
     하에서는 이 비용이 허용 가능 — cross-rank correctness와 "읽을 때 최신"
     invariant를 단순한 규칙으로 보장하는 편이 우선.
   - 실질적 영향: 단일 worker는 대부분 한 layer step 안에서 pending이 주로
     자기 작업 — over-barrier로 인한 추가 context switch는 round 끝 scheduler
     drain 시점과 일치하는 경우가 많아 큰 문제 안 됨.
   - Future refinement: per-tensor pending tracking을 도입하면 이 규약을
     좁힐 수 있으나 본 ADR scope 밖.
 **Non-barrier**:
 - `tensor.shape`, `tensor.dtype`, `tensor.name` 등 **metadata-only** 접근은
  drain하지 않음. 데이터 의존성이 없음.
 - `tensor.pa`, `tensor.va` 등 raw address accessor도 drain하지 않음 (주소만,
  내용 아님).
 **공식 barrier entry-point (closed set)**:
 | API | Kind | Rationale |
 |---|---|---|
 | `Tensor.numpy()` | read | host-observable copy |
 | `Tensor.data` | read | `numpy()` alias |
 | `Tensor.__getitem__` | read | shard-aligned read |
 | `Tensor.__repr__` (data 포함 시) | read | debugging/log |
 | `Tensor.copy_(source)` | read + write | source read + target write |
 이 contract를 T5/T6에서 직접 검증.
 #### D0.6 왜 worker 함수 API는 불변인가 (informative)
 - `torch.zeros(...)` 내부는 `self.submit(msg)` + `self.wait(h)` 쌍. `wait`가
  D0.2/D0.3에 따라 자동으로 main-defer → 겉보기 동기적으로 보이지만 한 번
  yield.
 - `tensor.numpy()`는 D0.5에 따라 host-read barrier → pending이 있으면
  drain→read, 없으면 즉시 read.
 - `dist.all_reduce`는 기존 `_defer_wait=True` + `_pending_collective_handles`
  경로를 그대로 사용. D0.4의 drain이 두 큐를 함께 처리.
 #### D0.7 불변 조건 (invariants)
 - **kernel greenlet의 `_parent`는 항상 main**: env.run이 worker 컨텍스트에서
  절대 돌지 않기 때문. (T3의 핵심 assertion.)
 - **cross-rank 동기 지점**: 모든 worker가 yield한 뒤에만 drain → 모든 rank의
  kernel이 한 라운드에 함께 진행 (cross-rank IPCQ 교환의 필수 조건).
 - **Single-driver 호환**: D0.4-(5).
 ### D1. `torch.multiprocessing.spawn(fn, args, nprocs)`
 Real-PyTorch API 파리티 + D0의 scheduler loop의 단일 구현 위치.
 #### D1.0 API parity only — execution parity 아님 (normative)
 `torch.multiprocessing.spawn` 이름은 **API signature parity**에 한정된다.
 실제 실행 모델은 **cooperative greenlet scheduler** (단일 Python 프로세스,
 단일 OS 스레드, D0.4의 round-robin drive)이다. 다음은 **본 ADR이 제공하지
 않는 속성** — real-PyTorch `torch.multiprocessing.spawn`이 보장하는 것 중
 명시적으로 **non-goal**:
 - 프로세스 격리 (independent OS process per rank).
 - 독립 address space (각 rank가 자기 Python heap 보유).
 - Failure isolation (한 rank의 hard crash가 다른 rank 영향 없음).
 - OS-level scheduler fairness (rank 간 preemptive time slicing).
 - `mp.Queue`, `mp.Lock` 등 inter-process primitive.
 이 구현의 실제 성질:
 - 모든 rank는 같은 Python 프로세스 안의 greenlet. shared global state가
  그대로 보임 (의도된 simulation convenience).
 - GIL 하의 단일 스레드 → parallel execution 아님. SimPy 이벤트 순서로
  "논리적 동시성"만 재현.
 - 한 worker에서 unhandled exception → 전체 simulation 중단 (D0.4-(4)).
 **호출자 의무**: real-PyTorch multi-process 샘플을 KernBench로 이식할 때
 프로세스 격리에 의존하는 로직 (예: `os.getpid`, 독립 임시 파일, 신호 처리
 등)은 지워야 한다. Namespace 이름은 코드 이식성을 위해 유지 — 세만틱은
 다르다.
 #### D1.1 Public surface
 ```python
 # runtime_api/multiprocessing.py (new)
 class _MultiprocessingNamespace:
    def __init__(self, ctx):
        self._ctx = ctx
    def spawn(self, fn, args: tuple, nprocs: int, join: bool = True) -> None:
        """Spawn `nprocs` worker greenlets, each calling fn(rank, *args).
        Mirrors torch.multiprocessing.spawn signature (minus `daemon`).
        Drives the D0 scheduler loop until all workers finish.
        """
        ...
 ```
 #### D1.2 구현
 ```python
 def spawn(self, fn, args, nprocs, join=True):
    from greenlet import greenlet
    ctx = self._ctx
    dist = ctx.distributed
    gs: list[greenlet] = []
    errors: dict[int, Exception] = {}
    for rank in range(nprocs):
        def _entry(r=rank):
            try:
                fn(r, *args)
            except Exception as e:
                errors[r] = e
                raise
        g = greenlet(_entry)
        dist._bind_rank(g, rank)
        gs.append(g)
    try:
        while True:
            alive = [g for g in gs if not g.dead]
            if not alive:
                break
            for g in alive:
                if not g.dead:
                    g.switch()
            _drain_pending(ctx)       # D0.5
    except Exception as outer:
        # Sibling cleanup per D0.4-(4)
        for other in gs:
            if not other.dead:
                try:
                    other.throw(SystemExit)
                except Exception:
                    pass
        backend = getattr(dist, "_backend", None)
        if backend is not None:
            if hasattr(backend, "_barrier"):
                backend._barrier.reset()
            if getattr(backend, "_pending_collective_handles", None) is not None:
                backend._pending_collective_handles.clear()
        ctx._pending_worker_waits.clear()
        raise SpawnException(errors) from outer
    # `join=True` semantics: we already wait for all workers.
 ```
 #### D1.3 `torch` namespace attach
 `runtime_api/context.py` `__post_init__`에서:
 ```python
 self.multiprocessing = _MultiprocessingNamespace(self)
 ```
 → bench 코드에서 `torch.multiprocessing.spawn(worker, args=(ws,), nprocs=ws)`.
 #### D1.4 기존 bench 마이그레이션
 `benches/ccl_allreduce.py`의 hand-rolled loop은 `torch.multiprocessing.spawn`
 한 줄로 축소. 기존 matrix 회귀는 그대로 유지. 현재 xfail인 `ring_default_ws`는
 D0 덕분에 PASS로 전환 예상 (worker가 kernel greenlet orphan을 발생시키지 않음).
 ### D2. 새 패키지 `kernbench.tp`
 ```
 src/kernbench/tp/
    __init__.py          — public API re-exports
    parallel_state.py    — TP group 관리 (현재 single global group)
    layers.py            — ColumnParallelLinear, RowParallelLinear, VocabParallelEmbedding
    primitives.py        — copy/reduce/scatter/gather_to/from_tp_region
    kernels.py           — TP layer가 launch하는 gemm kernel (재사용 가능)
    mappings.py          — forward identity/all_reduce, backward stub
 ```
 ### D3. `parallel_state` — TP group
 ```python
 # parallel_state.py
 _TP_WORLD_SIZE = None
 def initialize_model_parallel(tensor_model_parallel_size: int) -> None:
    """Initialize TP group. Must be called after dist.init_process_group."""
    global _TP_WORLD_SIZE
    from kernbench.runtime_api.distributed import get_dist  # or torch.distributed
    dist = get_dist()
    total = dist.get_world_size()
    if tensor_model_parallel_size != total:
        raise NotImplementedError(
            "Only TP == world_size supported in initial scope"
        )
    _TP_WORLD_SIZE = tensor_model_parallel_size
 def get_tensor_model_parallel_world_size() -> int:
    return _TP_WORLD_SIZE
 def get_tensor_model_parallel_rank() -> int:
    from kernbench.runtime_api.distributed import get_dist
    return get_dist().get_rank()         # ADR-0024 greenlet-local rank
 ```
 초기 scope: TP size = world_size = topology SIP count. Pure TP 모델.
 ### D4-pre. TP shard ownership vs DPPolicy — 역할 분리 (normative)
 TP layer의 weight/output 표현에서 두 개념을 명확히 분리한다:
 | 개념 | 결정 주체 | 범위 |
 |---|---|---|
 | **TP shard ownership** (어느 rank가 weight의 어떤 slice를 소유하는가) | greenlet-local rank + `torch.ahbm.set_device(rank)` (ADR-0024 D2/D3) | **cross-rank, cross-SIP** |
 | **Intra-rank placement** (소유된 slice를 rank 내부에서 cube × PE로 어떻게 분산하는가) | `DPPolicy(cube=..., pe=...)` (ADR-0026) | **한 rank 내부 (SIP 경계 안)** |
 따라서 `ColumnParallelLinear`가 `(in_features, out_features // ws)` shape로
 weight를 생성하고 `DPPolicy(cube="column_wise", pe="column_wise")`를 부여
 하면:
 - **Rank r**이 소유하는 slice = weight의 column 축 [r * k_local, (r+1) *
  k_local) — **set_device(r)**가 이걸 결정 (해당 rank가 SIP r에 존재).
 - **그 slice 내부**에서 cube × PE column-wise 분산 — **DPPolicy**가 이걸
  결정.
 두 축은 **독립적**이다. 같은 DPPolicy로 두 rank가 자기 slice를 만들면
 slice 자체는 다른 SIP에 있지만 intra-SIP placement 패턴은 동일. 반대로
 DPPolicy를 `cube="replicate", pe="replicate"`로 바꿔도 TP shard ownership은
 유지되고 intra-rank placement만 달라짐.
 **이 경계가 흐려지는 실수** (본 ADR이 금지):
 - DPPolicy에 "SIP 축"이 다시 등장 (ADR-0026에서 제거됨).
 - TP layer가 `set_device` 없이 `DPPolicy`만으로 cross-rank sharding을
  표현 → 단일 rank 안에서 세로로 자른 것과 구분 안 됨.
 본 ADR의 TP layer는 항상 "rank = SIP = one slice 소유 + DPPolicy intra-SIP
 분산" 관점에서만 weight/output을 다룬다.
 ### D4. `ColumnParallelLinear`
 **중요**: host-side `torch.matmul` 추상화를 신규 도입하지 않는다. layer의
 forward는 `torch.launch("gemm", gemm_kernel, ...)`로 기존 gemm kernel을
 호출 — KernBench bench들이 이미 쓰는 패턴
 ([benches/gemm_single_pe.py](benches/gemm_single_pe.py),
 [benches/gpt3_qkv.py](benches/gpt3_qkv.py)).
 ```python
 # layers.py
 from kernbench.policy.placement.dp import DPPolicy
 from kernbench.tp.kernels import _gemm_kernel
 from kernbench.tp.parallel_state import (
    get_tensor_model_parallel_rank,
    get_tensor_model_parallel_world_size,
 )
 class ColumnParallelLinear:
    """Weight의 K(out_features) 축을 TP rank에 분산.
    forward(x):
        x: (M, N) — full-replicated across ranks
        W_k: (N, K / world_size) — rank-local slice (set_device로 SIP r에 거주)
        y_k = x @ W_k → (M, K / world_size) — rank-local output
    출력은 column-sharded. RowParallelLinear가 기대하는 입력 형태.
    """
    def __init__(self, in_features: int, out_features: int, bias: bool = False,
                 dtype: str = "f16", torch=None):
        ws = get_tensor_model_parallel_world_size()
        assert out_features % ws == 0
        self.in_features = in_features
        self.k_local = out_features // ws
        self._torch = torch
        # 각 rank가 자기 slice 소유 — set_device(rank)에 의해 SIP r에 배치.
        self.weight = torch.zeros(
            (in_features, self.k_local), dtype=dtype,
            dp=DPPolicy(cube="column_wise", pe="column_wise"),
            name="col_parallel_w",
        )
        self.bias = None
        if bias:
            self.bias = torch.zeros(
                (self.k_local,), dtype=dtype,
                dp=DPPolicy(cube="replicate", pe="replicate"),
                name="col_parallel_b",
            )
    def forward(self, x):
        # x는 full-replicated (caller 보장). 단순 local gemm.
        M = x.shape[0]
        out = self._torch.empty(
            (M, self.k_local), dtype=x.dtype,
            dp=DPPolicy(cube="column_wise", pe="column_wise"),
            name="col_parallel_out",
        )
        self._torch.launch(
            "col_parallel_gemm", _gemm_kernel,
            x, self.weight, out, M, self.in_features, self.k_local,
        )
        # bias add는 별도 kernel 혹은 composite gemm의 fused bias.
        # 초기 scope에서는 bias=False만 충분히 검증.
        return out
 ```
 **Yield-safety contract (normative)**: `ColumnParallelLinear.forward`는 한 번의
 `torch.launch` 호출로 kernel launch → 내부 `ctx.wait` 쌍을 포함한다. 이는
 D0.4-(1)의 "worker는 유한 step 내 yield" 조건을 자동으로 만족 — TP layer
 사용자가 yield 패턴을 수동으로 삽입할 필요 없음.
 ### D5. `RowParallelLinear`
 ```python
 class RowParallelLinear:
    """Weight의 N(in_features) 축을 TP rank에 분산.
    forward(x):
        x: (M, N / world_size) — rank-local slice (ColumnParallel의 출력)
        W_k: (N / world_size, K) — rank-local slice
        y_k = x @ W_k → (M, K) — partial sum on each rank
        y = all_reduce(y_k, op="sum") → (M, K) on every rank
    """
    def __init__(self, in_features: int, out_features: int, bias: bool = False,
                 dtype: str = "f16", torch=None):
        ws = get_tensor_model_parallel_world_size()
        assert in_features % ws == 0
        self.n_local = in_features // ws
        self.out_features = out_features
        self._torch = torch
        self.weight = torch.zeros(
            (self.n_local, out_features), dtype=dtype,
            dp=DPPolicy(cube="column_wise", pe="column_wise"),
            name="row_parallel_w",
        )
        # bias는 rank 0에만 (Megatron convention). 초기 scope에서는 생략.
        self.bias = None
    def forward(self, x):
        M = x.shape[0]
        y_partial = self._torch.empty(
            (M, self.out_features), dtype=x.dtype,
            dp=DPPolicy(cube="column_wise", pe="column_wise"),
            name="row_parallel_partial",
        )
        self._torch.launch(
            "row_parallel_gemm", _gemm_kernel,
            x, self.weight, y_partial, M, self.n_local, self.out_features,
        )
        # Cross-rank reduce. ADR-0024의 dist.all_reduce는 D0 + mp.spawn 하에서
        # 정상 동작 (kernel parent = main 유지).
        self._torch.distributed.all_reduce(y_partial, op="sum")
        return y_partial
 ```
 **Yield-safety contract (normative)**: `RowParallelLinear.forward`는 launch →
 내부 wait에 이어 `all_reduce` (defer + worker yield 패턴)까지 포함하므로 forward
 한 번당 **최소 2회 yield**가 보장됨. D0.4-(1)의 scheduler progress 조건 자동
 만족. 모든 본 ADR의 TP layer forward는 "최소 하나의 wait 또는 collective를
 포함해 yield-safe하다"를 invariant로 유지한다 — 이후 추가되는 TP primitive
 (VocabParallelEmbedding 등)도 동일 계약 필수.
 ### D6. Primitive 함수
 ```python
 # primitives.py
 def copy_to_tp_region(x):
    """Forward: identity. Backward: all-reduce. (Training 추가 시 구현)."""
    return x
 def reduce_from_tp_region(x, torch):
    """Forward: all-reduce. Backward: identity."""
    torch.distributed.all_reduce(x, op="sum")
    return x
 def scatter_to_tp_region(x):
    raise NotImplementedError(
        "Phase 2: 사용자가 이미 sharded tensor를 생성하는 것으로 대체"
    )
 def gather_from_tp_region(x):
    raise NotImplementedError(
        "Phase 2: all-gather kernel 선행 필요 (future)"
    )
 ```
 ### D7. 샘플 bench — 2-layer MLP with TP
 ```python
 # benches/tp_mlp.py (신규)
 from kernbench.policy.placement.dp import DPPolicy
 import kernbench.tp as tp
 import numpy as np
 def worker(rank: int, world_size: int, torch):
    torch.ahbm.set_device(rank)
    tp.initialize_model_parallel(world_size)
    B, D_in, D_hidden, D_out = 1, 512, 2048, 512
    fc1 = tp.ColumnParallelLinear(D_in, D_hidden, torch=torch)
    fc2 = tp.RowParallelLinear(D_hidden, D_out, torch=torch)
    x = torch.zeros(
        (B, D_in), dtype="f16",
        dp=DPPolicy(cube="replicate", pe="replicate"),
        name="x",
    )
    # init x with some pattern (e.g., constant)
    x.copy_(torch.from_numpy(np.full((B, D_in), 0.1, dtype=np.float16)))
    h = fc1.forward(x)      # column-sharded (B, D_hidden / ws)
    y = fc2.forward(h)      # all-reduced (B, D_out) on every rank
    # rank 0만 결과 출력 / 검증
    if rank == 0:
        result = y.numpy()
        # 실제 검증 값은 zero-init weight이면 전부 0 — scope에서는 "완료 자체" 검증
        print(f"  tp_mlp: shape={result.shape}, mean={float(result.mean()):.4f}")
 def run(torch):
    torch.distributed.init_process_group(backend="ahbm")
    ws = torch.distributed.get_world_size()
    torch.multiprocessing.spawn(worker, args=(ws,), nprocs=ws)
 ```
 ### D8. Non-functional — training 미지원
 본 ADR은 **inference/forward only**. Backward / gradient / optimizer는 future.
 기존 KernBench가 training이 아니므로 자연스러움.
 ### D9. 초기 scope 제약
 - TP size = world_size (mixed DP+TP 없음).
 - `scatter_to_tp_region`, `gather_from_tp_region`은 unimplemented.
 - **Weight 기본값은 zero**. 적절한 init scheme (Xavier, Kaiming 등)은 future.
  단 테스트는 `tensor.copy_`로 결정론적 non-zero pattern을 주입해 numerical
  correctness를 검증 (T2/T6). 즉 "production default = zero, 검증 = 결정론적
  non-zero"로 운영 분리.
 - Bias 초기 scope에서 생략 (Megatron의 rank 0-only bias 정책은 future).
 - Pipeline parallelism은 scope 밖.
 - VocabParallelEmbedding은 all-gather 선행 필요 → stub only.
 ### D10. 회귀: `ring_default_ws` xfail 해제 — 필수 acceptance
 D0 (worker-wait 일반화) + D0.5 (host-read barrier) 덕분에 모든 worker-driven
 `ctx.wait` 및 host-read가 main-drain 경로로 routing됨 → ADR-0024 Phase B의
 kernel-greenlet orphan 원인이 소멸. 기존 matrix test의 `ring_default_ws`
 strict-xfail 케이스를 본 ADR 구현 이후 **PASS**로 전환하는 것을 **필수 회귀
 기준**으로 포함. Observable acceptance criteria는 **T7**에 명시 (deadlock
 부재, GreenletExit 부재, numerical tolerance 등).
 ---
 ## Dependencies
 - **ADR-0024** (launcher): rank = SIP, greenlet-local rank,
  `torch.ahbm.set_device(rank)`.
 - **ADR-0026** (DPPolicy intra-device): weight tensor의 per-rank slice 표현.
 - **ADR-0023 / ADR-0025** (IPCQ): `dist.all_reduce` 구현의 기반.
 ---
 ## Non-goals
 - **Backward pass / training**: inference only. Training simulation은 별도 ADR.
 - **Mixed parallelism (DP + TP + PP)**: 초기엔 pure TP only.
 - **Weight init schemes**: 단순 zero / debug pattern.
 - **Fused ops**: Megatron의 fused matmul+bias+gelu는 kernel 레벨 문제.
 - **DTensor 통합**: ADR-0028 future.
 - **Host-side `torch.matmul` 추상화**: TP layer는 `torch.launch(gemm_kernel, ...)`
  로 기존 gemm kernel을 호출. 신규 matmul host-op 도입 안 함.
 ---
 ## Open questions
 - **`initialize_model_parallel` 위치**: `kernbench.tp.initialize_model_parallel`
  (현 결정) vs real-PyTorch의 `torch.distributed.init_device_mesh`. TP 전용
  모듈에 유지.
 - **Weight init**: ADR은 zero. Debug pattern (e.g., identity)이 유효 검증에
  필요할 수 있음 — Phase 1 test에서 필요 시 추가.
 - **bias 배치 정책**: Megatron은 RowParallelLinear bias를 rank 0에만. 초기
  scope에서는 bias=False로 회피.
 - **GEMM kernel 위치**: `kernbench.tp.kernels._gemm_kernel` vs 기존
  `benches/gemm_single_pe.py`에서 import. TP가 bench 의존을 가지면 안 되므로
  tp 내부에 복제. 향후 `kernbench.kernels` 공용 패키지로 이관 가능.
 **Resolved (이전 rev에서 open이었던 것들)**:
 - ~~`tensor.numpy()` 호출 시 drain 타이밍~~ → **D0.5에서 결정**: 공식 host-read
  entry-point(`numpy`, `data`, `__getitem__`, data-포함 `__repr__`)는 자동
  drain barrier. metadata-only accessor는 barrier 아님.
 ---
 ## Consequences
 ### Positive
 - **Megatron 코드 이식 용이**: real training code와 API 일치.
 - **TP 벤치마크 가능**: scaling, communication-compute overlap 등 HW 특성
  연구.
 - **`ring_default_ws` xfail 해제**: D0의 부산물로 ADR-0024 Phase B 블로커 해소.
 - **Scheduler loop 단일화**: D1 (`mp.spawn`) 도입으로 hand-rolled loop 제거.
  후속 collective/TP 벤치가 동일 패턴 재사용.
 - **DPPolicy 의미 명확화** (ADR-0026 시너지): TP layer가 intra-device DPPolicy
  만 사용하는 모범 사례.
 ### Negative
 - 새 모듈 (`kernbench.tp`) 유지보수 비용.
 - 초기 scope가 제한적 (pure TP only, forward only).
 - D0 generalization이 `ctx.wait`의 세만틱을 바꿈 — 단일 드라이버 테스트와의
  호환성을 명시적으로 검증 필요 (T7).
 ### Neutral
 - ADR-0024/0026 기반 위에 순수한 상위 레이어 추가. Hardware simulation
  stack에 영향 없음 (D0 제외).
@@ -0,0 +1,256 @@
 # ADR-0032: Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange
 ## Status
 Accepted (supersedes ADR-0029).
 ## Context
 ### Goal
 Define a single all-reduce algorithm that exploits the topology hierarchy:
 cube mesh within each SIP (intercube) + inter-SIP exchange. One kernel,
 one SFR configuration path, driven by `topology.yaml` and `ccl.yaml`.
 ### Why replace ADR-0029 (hierarchical 3-level)
 ADR-0029 proposed a 3-level (intra-cube → inter-cube → inter-SIP) algorithm
 where every PE in the system participates. In practice this adds the
 intra-cube PE-to-PE stage complexity (bidirectional reduce + chain broadcast)
 without matching the common workload pattern where the tensor is sharded
 **per cube** (not per PE within a cube).
 Moreover, the hierarchical design required:
 - per-PE neighbor graph installation (`_build_pe_installs` multi-level)
 - multi-level topology schema (`hierarchical_3level`)
 - `all_pes` mapper + `multi_pe_sip_local` validator infrastructure
 The intercube algorithm below removes all of that: **pe0-only same-lane
 intercube reduce on the 4×4 cube mesh**, then inter-SIP exchange on the
 root cube, then broadcast back. Simpler kernel, simpler wiring, same
 bandwidth characteristics for the common per-cube DP workload.
 ### Current state
 - `src/kernbench/ccl/algorithms/intercube_allreduce.py` — kernel
 - `src/kernbench/ccl/sfr_config.py` — `configure_sfr_intercube_multisip`
 - `src/kernbench/runtime_api/distributed.py` — `AhbmCCLBackend` wires this
  automatically at `init_process_group` time.
 - Old `ring_allreduce`, `mesh_allreduce`, `tree_allreduce`,
  `hierarchical_allreduce` modules and their tests are **removed**.
 ---
 ## Decision
 ### D1. Algorithm structure — 5 phases
 For each SIP (launched concurrently by `mp.spawn`):
 ```
 Phase 1 — Row reduce W → E (cube mesh, pe0 only):
    col=0 sends E → col=1 accumulates, sends E → ... → col=3 holds row sum.
 Phase 2 — Col reduce N → S on rightmost column (pe0, col = mesh_w-1):
    row=0 sends S → row=1 accumulates, sends S → ... → root cube (15)
    holds the full SIP sum.
 Phase 3 — Inter-SIP exchange on root cube (pe0 of root cube only):
    Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
    selected by sip_topo_kind (from topology.yaml sips.topology).
 Phase 4 — Col broadcast S → N on rightmost column.
 Phase 5 — Row broadcast E → W across the cube mesh.
 ```
 After all phases every cube's pe0 holds the global sum.
 The kernel is a single function parameterised by `sip_topo_kind ∈ {0, 1, 2}`
 (ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical
 across topologies; only phase 3 branches. Helper functions
 `_inter_sip_ring`, `_inter_sip_torus_2d`, `_inter_sip_mesh_2d` encode the
 three exchange patterns.
 ### D2. Tensor layout (rank = SIP, per-worker)
 Per ADR-0024 rank = SIP at the process-group level. Each worker allocates
 its own cube-mesh-spanning tensor:
 ```python
 dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=16, num_pes=1)
 tensor = torch.zeros((n_cubes, n_elem), dtype="f16", dp=dp)
 ```
 Shard layout: 16 shards per SIP, one per cube on pe0. The kernel addresses
 each cube's shard as `pe_addr = t_ptr + cube_id * n_elem * 2`.
 ### D3. SFR / IPCQ wiring — `configure_sfr_intercube_multisip`
 Replaces the rank-to-2-PE install from ADR-0024. Wires PE_IPCQ neighbor
 tables for **every cube's pe0 across every SIP** — regardless of which
 cube is the root or which SIP topology is selected. This lets the kernel
 elect the root cube at runtime and supports topology switches without
 re-wiring.
 | Level | Direction labels | Scope |
 |---|---|---|
 | Intercube within SIP | N / S / E / W | pe0 of every cube → pe0 of mesh neighbors (no wrap) |
 | Inter-SIP (all cubes) | global_E / global_W / global_N / global_S | pe0 of cube c on sip A → pe0 of cube c on peer SIP per `sips.topology` |
 Inter-SIP directions use the `global_*` prefix to keep the namespace
 disjoint from intercube directions. ADR-0025's `_OPPOSITE_DIR` is extended
 with `global_E ↔ global_W` and `global_N ↔ global_S` so the reverse-
 direction resolver handles 2-SIP bidirectional rings correctly.
 Internally the function calls `install_ipcq` with:
 - `world_size = n_sips × n_cubes`
 - `rank_to_pe = [(sip, cube, 0) for sip in range(n_sips) for cube in range(n_cubes)]`
 - A closure-captured `neighbors()` function that builds the map above.
 This `world_size` is internal to IPCQ wiring and does not leak to the
 process-group rank.
 ### D4. SIP topology — from `topology.yaml`
 ```yaml
 system:
  sips:
    count: 2
    topology: ring_1d       # or torus_2d, mesh_2d_no_wrap
 ```
 - `ring_1d`: n_sips-1 rounds of `send global_E / recv global_W`.
 - `torus_2d`: sqrt(n_sips)×sqrt(n_sips) wrapping mesh. Row ring on
  `global_E/W` then col ring on `global_S/N`.
 - `mesh_2d_no_wrap`: square mesh without wrap-around. Chain reduce +
  broadcast per dimension.
 2D variants require `n_sips` to be a perfect square.
 ### D5. Process-group integration — `AhbmCCLBackend`
 At `init_process_group` time the backend:
 1. Loads `ccl.yaml` + `topology.yaml`.
 2. Derives `sip_topo_kind, sip_topo_w, sip_topo_h` from
   `system.sips.topology` using the algorithm module's `TOPO_NAME_TO_KIND`.
 3. Calls `configure_sfr_intercube_multisip(engine, spec, cfg)` — one-time
   SFR wiring, mirrors NCCL communicator creation.
 At each `dist.all_reduce(tensor)` call:
 1. Resolves `kernel_fn` from `cfg["module"]`.
 2. Builds args: `(n_elem, cube_w, cube_h, n_sips)` from
   `kernel_args(world_size, n_elem)`.
 3. Appends `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` where
   `sip_rank` is the current greenlet's bound rank.
 4. Launches with `_defer_wait=True`; the main scheduler drains pending
   handles after all workers submit (per ADR-0027 D0.4).
 ### D6. Config schema
 `ccl.yaml`:
 ```yaml
 defaults:
  algorithm: intercube_allreduce
  buffer_kind: tcm
  ...
 algorithms:
  intercube_allreduce:
    module: kernbench.ccl.algorithms.intercube_allreduce
    topology: none
    buffer_kind: tcm
    n_elem: 8
    root_cube: 15
 ```
 `topology.yaml`:
 ```yaml
 system:
  sips:
    count: 2
    topology: ring_1d
 sip:
  cube_mesh: { w: 4, h: 4 }
 ```
 ### D7. Algorithm module contract
 Modules loaded via `cfg["module"]` must export:
 | Name | Purpose |
 |---|---|
 | `kernel` | callable, signature `(t_ptr, n_elem, cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl)` |
 | `kernel_args(world_size, n_elem) -> tuple` | returns the first 4 scalar args (per-tensor) |
 | `TOPO_NAME_TO_KIND: dict[str, int]` | maps `system.sips.topology` name to kernel branch code |
 | `SIP_TOPO_RING`, `SIP_TOPO_TORUS`, `SIP_TOPO_MESH` | integer constants (0, 1, 2) |
 ---
 ## Dependencies
 - **ADR-0023**: IPCQ protocol (neighbor table, send/recv, credit return).
 - **ADR-0024**: rank = SIP launcher, `mp.spawn`, greenlet-local rank.
 - **ADR-0025**: Address-based IPCQ direction matching; extended
  `_OPPOSITE_DIR` with `global_*` pairs.
 - **ADR-0027**: Worker-wait / collective-pending drain in main scheduler.
 ## Non-goals
 - **Per-PE allreduce** (intra-cube PE-to-PE reduce). Out of scope — the
  workload for this algorithm is per-cube DP.
 - **Asymmetric SIP topologies** (non-square mesh/torus). `torus_2d` and
  `mesh_2d_no_wrap` require `n_sips = k²`.
 - **Pipelined chunks**: single-tile per cube, no pipelining yet.
 - **Root cube runtime election**: the kernel currently uses
  `root_cube = (mesh_h - 1) * mesh_w + (mesh_w - 1)` hardcoded to the SE
  corner. SFR wiring covers all cubes, so runtime election is a pure kernel
  change when needed.
 ---
 ## Consequences
 ### Positive
 - **Single kernel, single install path** for all-reduce — replaces four
  removed modules (`ring`, `mesh`, `tree`, `hierarchical`).
 - **Topology-agnostic kernel**: ring / torus / mesh selected via one
  integer param, no kernel duplication.
 - **Automatic via `dist.all_reduce`**: no bench-level or user-level
  algorithm selection needed; config-driven end-to-end.
 - **Full SFR wiring**: every cube on every SIP has inter-SIP links
  available — supports future dynamic root-cube election.
 ### Negative
 - **Not suitable for per-PE sharded tensors**: TP-layer-style tensors that
  shard within one cube across 8 PEs are not addressable by this kernel.
  Such workloads would need a separate intra-cube all-reduce path (not
  yet implemented).
 - **`configure_sfr_intercube_multisip` always wires all pe0s**: even if a
  given run only needs a subset (e.g. 1 SIP, ring only). Install cost is
  small but not zero.
 ---
 ## Affected files
 | File | Change |
 |---|---|
 | `src/kernbench/ccl/algorithms/intercube_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` |
 | `src/kernbench/ccl/sfr_config.py` (new) | `configure_sfr_intercube_multisip` |
 | `src/kernbench/ccl/topologies.py` | Added `torus_2d`, `mesh_2d_no_wrap` |
 | `src/kernbench/ccl/install.py` | Extended `_OPPOSITE_DIR` with `global_*` pairs |
 | `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend` uses `configure_sfr_intercube_multisip` + appends sip_rank/topo args |
 | `ccl.yaml` | Single `intercube_allreduce` entry |
 | `topology.yaml` | Added `system.sips.topology` |
 | `benches/ccl_allreduce.py` | Row-wise cube-mesh tensor layout |
 | `tests/test_allreduce_multidevice.py` (new) | Config-driven ring/torus/mesh |
 | `tests/test_distributed_intercube_allreduce.py` (new) | Full `dist.all_reduce` path |
 | `tests/test_intercube_sfr_config.py` (new) | SFR wiring verification |
 | Removed | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` and their tests |
@@ -0,0 +1,162 @@
 # ADR-0033 — Latency Model: Assumptions and Known Simplifications
 ## Status
 Accepted
 ## Context
 The simulator is an analytical, event-driven performance model — not a
 cycle-accurate or RTL-level simulator. Many real-HW effects are approximated
 or omitted by design. To keep the model auditable and reviewable as a whole,
 this ADR consolidates the assumptions in one place. Individual component ADRs
 (ADR-0015, ADR-0017, ADR-0004) define the *mechanisms*; this document defines
 the *limits of fidelity*.
 ## Decisions
 ### D1. Modeled precisely
 - **Per-directed-edge BW occupancy** (FIFO serialization via `available_at`) —
  ADR-0015 D2.
 - **Per-component switching/overhead latency** (`overhead_ns` attr).
 - **HBM per-pseudo-channel parallelism** via stateless `pc_avail[N]` array
  with address-based PC selection (ADR-0034 D3). Burst granularity tunable
  (`burst_bytes`, default 256B). Read and write share each PC's
  `available_at` (real HW command bus is per-PC shared).
 - **HBM direction switching penalty mechanism**: per-PC last-direction
  tracking + configurable `switch_penalty_ns`. Default 0 — see D2.
 - **Wire chunk-streaming (Phase 2c)**: each wire decomposes Transactions
  with payload into `Flit` objects of `flit_bytes` (default = HBM
  `burst_bytes` = 256B). The wire emits each flit individually after
  `prop_ns + flit_nbytes/bw_gbs` so the link's bandwidth throttles
  flit arrival rate per real-HW wormhole semantics.
 - **Separate Stores per directed edge** (Phase 2c key fix): the wire
  is the *only* conduit between `src.out_ports[dst]` and
  `dst.in_ports[src]`. Earlier the two were aliased to the same
  `simpy.Store`; when the wire put a chunkified flit back, the
  destination's `fan_in` could pull it before the wire applied
  bandwidth delay, leaving half the flits bypassing the bottleneck.
 - **Flit-aware pass-through** (`TransitComponent`, `HbmCtrlComponent`):
  forward each flit serially with per-transaction overhead applied
  ONCE on the first-flit arrival (header decode model). Subsequent
  flits pipeline through with no extra delay. Wormhole emerges
  naturally across multi-hop paths.
 - **HBM CTRL per-flit PC commit**: each flit arriving at HBM CTRL
  schedules a PC commit at `max(env.now, pc_avail[pc]) + chunk_time`,
  with the `is_last` flit waiting for the last PC commit before
  signaling `txn.done`.
 - **Non-flit-aware components (default) reassemble flits at
  ``_fan_in``** before the legacy `_forward_txn` path runs. This
  preserves backward compatibility for components that have not yet
  been migrated to flit-aware processing (e.g., `MCpuComponent`,
  `IoCpuComponent` sub-txn generators). Such components reassemble
  *once per leg boundary*, NOT per hop — multi-hop wormhole timing
  through a chain of flit-aware routers is preserved.
 ### D2. Approximated (with known directional error)
 | Effect | Real HW | Our model | Error direction |
 |--------|---------|-----------|----------------|
 | Router output port arbitration | Round-robin / weighted | Wire edge FIFO + serial worker | Fair when one txn per cycle; multi-stream sharing not modeled at flit level |
 | HBM scheduler / write buffer | FR-FCFS + watermark drain | FIFO, no reordering | Pessimistic for mixed R/W when alternations are dense — default `switch_penalty_ns = 0` assumes ideal scheduler amortizes |
 | Flit ↔ burst granularity | 32B flit < 256B burst | `flit_bytes = burst_bytes = 256B` | Sub-flit fine-grained timing noise; affects very small wire arbitration windows only |
 | Wire-level RR fairness | Per-cycle multi-flow arbitration on shared link | Single serial wire process per edge | Fair only when one transaction is in flight on a given edge at a time. Multi-stream concurrent traffic on the same edge serializes by FIFO order |
 ### D3. Ignored (out of scope)
 - Bank-level row buffer conflict penalty (assume no conflicts — best case;
  the model has no per-bank state within a PC, so same-bank reuse cannot be
  detected).
 - HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state
  `burst_time = burst_bytes / pc_bw_gbs`).
 - Refresh, ECC, thermal throttling, power gating.
 - Clock domain crossings, PLL lock time.
 - Upstream backpressure due to downstream buffer occupancy (input ports use
  unbounded `simpy.Store`).
 - Sub-flit cycle-level arbitration at routers (flit granularity is our
  smallest unit).
 ### D4. Workload sensitivity
 Workloads where the above simplifications meaningfully affect results:
 - **Random scatter/gather**: bank conflict ignored → model optimistic.
 - **Heavy mixed R/W intensive** (e.g., GEMM bias accumulation): HBM scheduler
  absent. With default `switch_penalty_ns = 0` we assume ideal amortization;
  setting it non-zero models pessimistic per-alternation cost.
 - **High concurrency (>10 active flows on one link)**: HoL blocking and VC
  limits not modeled → model optimistic.
 - **Very small (sub-flit) transactions**: flit quantization noise.
 - **Concurrent multi-flow on a single wire**: wire is serial FIFO at the
  flit level, so per-flow fairness within a single edge is not modeled.
  Pre-edge merging (multiple sources arriving at a router and being
  forwarded to the same downstream wire) is correctly modeled via the
  flit-aware router's serial worker.
 ### D5. Verification policy
 For workloads in D4, cross-check against real HW or a cycle-accurate
 simulator before drawing absolute-magnitude conclusions. The model remains
 accurate for **relative comparisons** within the modeled regime.
 ### D6. Future work
 Note: multi-stream merging at routers IS modeled correctly — each
 in_port has its own fan_in process, all push to a shared inbox, and
 the router worker forwards in inbox FIFO order. Flits from different
 upstream streams naturally interleave at flit granularity. The items
 below are different concerns, ordered by expected workload impact.
 **Higher impact (workload accuracy gap)**:
 - [ ] **Bank-level conflict modeling** within a PC (opt-in via
  `track_banks: true`). Currently we assume no same-bank reuse;
  random scatter/gather workloads are optimistic here.
 - [ ] **HBM scheduler** with write buffer + watermark drain (Tier 2
  from the design discussion). Default `switch_penalty_ns=0` is the
  ideal-amortization stand-in; bursty mixed R/W workloads benefit
  from explicit modeling.
 - [ ] **Backpressure** modeling for finite component buffers. Matters
  at high concurrency / sustained saturation where buffer occupancy
  causes upstream stalls.
 - [ ] **Op_log integration with chunk-streaming**: currently op_log
  fires on PE-internal command messages (DmaReadCmd, DmaWriteCmd,
  GemmCmd, MathCmd) which are not chunkified. Integration would
  require flit-aware components to also emit op_log start/end hooks
  per transaction (start on first flit, end on is_last).
 **Lower impact (academic / specific use cases)**:
 - [ ] **Cycle-accurate router arbitration policies** (RR with
  priorities, age, iSLIP). The FIFO inbox is already approximately
  fair when flit arrival times differ slightly between streams (the
  common case for similar-rate workloads). True impact appears only
  for: (a) priority/QoS modeling, (b) per-stream tail latency
  analysis under sustained saturation. Not critical for makespan or
  average-latency studies.
 - [ ] **Sub-flit (32B) granularity** for finer wire arbitration
  cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates
  per 32B flit. Effect is small for most workloads (sub-flit timing
  noise on small messages).
 ## Consequences
 - Single review point for all model fidelity questions. Each future PR
  touching latency must update the relevant section here.
 - Workload-specific magnitude error envelopes are explicit.
 - Builder-side derivation of `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`
  enforces the ADR-0017 D8 invariant in code rather than relying on yaml
  manual consistency.
 - Wire transfer time is charged once per bottleneck-link transit (Phase 2c
  per-flit timing) rather than via terminal `drain_ns` injection. Single
  transactions land at `drain + commit_time + small_overheads`; multi-hop
  preserves wormhole pipelining; multi-stream merge correctly serializes
  at the shared wire's FIFO.
 ## Cross-references
 - ADR-0015 — component / port / wire model.
 - ADR-0017 — Cube NOC architecture and HBM connectivity.
 - ADR-0004 — memory semantics, local HBM.
 - ADR-0034 — HBM controller internal design.
@@ -0,0 +1,271 @@
 # ADR-0034: HBM Controller Internal Design
 ## Status
 Accepted
 ## Context
 `HbmCtrlComponent` is the per-PE HBM partition endpoint at the leaf of
 the cube NOC. One instance is created per PE under the topology node
 `sip{S}.cube{C}.hbm_ctrl.pe{idx}` and attaches to that PE's router
 (ADR-0017 D4). The component models per-pseudo-channel (PC) scheduling,
 burst-granular commit timing, address-based PC selection, and response
 routing back to the requester.
 This ADR documents the component as currently implemented. ADR-0017 D4/D8
 defines *where* HBM CTRL attaches and *what* aggregate BW it must
 deliver. ADR-0033 D1/D2 defines *what fidelity* of HBM modelling is in
 scope. This ADR fills the gap between those two — the per-instance
 internal scheduling model.
 ## Decision
 ### D1. Role
 `HbmCtrlComponent` is a per-PE HBM partition endpoint. One instance per
 PE (default 8 per cube, set by `cube.memory_map.hbm_slices_per_cube`)
 attaches to that PE's router via the `peX.hbm` attachment list in
 `cube_mesh.yaml` (ADR-0017 D4). In the default n:1 channel mapping
 (ADR-0017 D8) the instance aggregates `channels_per_pe` pseudo-channels
 into one endpoint.
 The component models:
 - Per-PC scheduling (D2) with R/W command-bus sharing.
 - Address-based PC selection (D3).
 - Burst-granular commit timing (D4).
 - Flit-aware per-flit PC commit and async finalize (D5, D6).
 - Command-only Transaction handling for read-data drain (D7).
 - Response routing back to the requester (D8).
 It does not model:
 - Bank-level row-buffer conflicts, refresh, ECC, thermal throttling
  (ADR-0033 D3).
 - Cross-PE HBM contention beyond its own router edge (handled by the
  router mesh — ADR-0017 D3).
 - 1:1 channel mode (ADR-0017 D8 future work).
 ### D2. Per-PC scheduling model
 Per-instance state initialised in `start()`:
 - `_pc_avail: list[float]` — earliest sim-time each PC is free; length
  `num_pcs`, initial 0.0.
 - `_pc_last_dir: list["R"|"W"|None]` — direction of the last commit on
  each PC, used for switch-penalty detection (D4); initial `None`.
 `num_pcs` and `burst_bytes` must each be a positive power of two so
 that address-based PC selection (D3) reduces to a shift-and-mask.
 Read and write requests share the same `_pc_avail` slot per PC — the
 real HW per-PC command bus is shared between read and write traffic, so
 issuing a write to PC k blocks a subsequent read to PC k by exactly the
 burst time.
 Direction `dir` for a request is inferred from the request type:
 - `MemoryWriteMsg` → `"W"`.
 - `PeDmaMsg` with `is_write=True` → `"W"`.
 - All others (`MemoryReadMsg`, `PeDmaMsg` read) → `"R"`.
 ### D3. Address-based PC selection
 PC index for an access is derived from the access address by shift and
 mask:
 ```text
 pc_shift = log2(burst_bytes)         # default 8  (burst=256B)
 pc_mask  = num_pcs - 1               # default 7  (8 PCs)
 pc       = (address >> pc_shift) & pc_mask
 ```
 Computed once in `start()` from topology config so alternative
 `(burst_bytes, num_pcs)` pairs stay consistent. For the canonical
 default `(256, 8)` this places the PC select field at bits `[10:8]` of
 the HBM byte offset: bits `[7:0]` are within-burst (same PC), bits
 `[10:8]` are the 3-bit PC index, bits `[36:11]` are row/bank/column
 within the PC slice (see `phyaddr.py` comment).
 Address-based striping — as opposed to address-blind global
 round-robin — preserves PC parallelism for offset-disjoint concurrent
 transfers: each transfer's bursts land deterministically on the PC set
 implied by its byte addresses, so multi-PE workloads accessing disjoint
 regions do not collide on a single PC.
 ### D4. Burst granularity and PC commit timing
 A single PC commit takes:
 ```text
 chunk_time = burst_bytes / pc_bw_gbs    # ns
 ```
 - `burst_bytes` (default 256) is the burst granularity matching the
  flit size (ADR-0033 D1).
 - `pc_bw_gbs` is **builder-derived** from
  `hbm_to_router_bw_gbs / num_pcs` (`topology/builder.py`), enforcing
  the ADR-0017 D8 invariant that aggregate per-PE BW equals the
  router-to-HBM link BW.
 Per-PC commit scheduling for an arriving access on PC `pc` with
 direction `dir`:
 ```text
 switch_cost = switch_penalty_ns
              if pc_last_dir[pc] not in (None, dir) else 0
 start  = max(env.now, pc_avail[pc]) + switch_cost
 finish = start + chunk_time
 pc_avail[pc]    = finish
 pc_last_dir[pc] = dir
 ```
 Default `switch_penalty_ns = 0` — Tier 0 assumption that an ideal HBM
 scheduler amortises R/W switching cost (ADR-0033 D2). Non-zero values
 model pessimistic per-alternation cost.
 ### D5. Flit-aware per-flit PC commit (primary path)
 `_handle_flit` is the primary worker path. For each arriving `Flit`:
 1. On the **first** flit of a transaction (`tid = id(txn)` not in
   `_txn_state`):
   - Apply `overhead_ns` once via `run(env, nbytes)` — header decode
     model, first-flit overhead pattern (ADR-0033 D1).
   - Initialise `_txn_state[tid] = {"last_finish": env.now}`.
 2. Compute `pc = _pc_for_address(flit.address)` (D3).
 3. Apply the per-PC schedule (D4) using the request direction (D2).
 4. Update `state["last_finish"] = max(state["last_finish"], finish)`.
 5. If `flit.is_last`: pop `_txn_state[tid]` and spawn `_finalize_txn`
   (D6).
 Per-flit address-aware commit is the mechanism that lets concurrent
 multi-PE traffic to disjoint HBM offsets pipeline through distinct PCs
 in parallel.
 ### D6. Async finalize per transaction
 When a transaction's last flit has been scheduled, finalisation runs in
 a separately-spawned process:
 ```python
 def _finalize_txn(env, txn, last_finish):
    wait = last_finish - env.now
    if wait > 0:
        yield env.timeout(wait)
    yield from _send_response(env, txn)
 ```
 `_handle_flit` spawns this via `env.process(...)` and returns
 immediately, so the worker can pick up the next inbox message while the
 last PC commit drains.
 Without this split — i.e. if the worker itself did
 `yield env.timeout(wait)` — concurrent single-flit transactions whose
 addresses hit distinct PCs would still serialise at `chunk_time` each
 inside the worker, hiding the PC parallelism that D3 and D5 are
 designed to expose.
 ### D7. Non-flit fallback for command-only transactions
 `_handle_txn` runs when the inbox delivers a `Transaction` rather than a
 `Flit`. This is the path for command-only requests that the wire does
 not chunk into flits — most notably `MemoryReadMsg` whose command txn
 carries `nbytes=0` (data drain is modelled at HBM CTRL post-processing,
 not as inbound flits).
 Procedure:
 1. `work_bytes = txn.nbytes if txn.nbytes > 0 else int(request.nbytes or 0)`
   — for read commands, work is sized by the request.
 2. `n_chunks = ceil(work_bytes / burst_bytes)` if `work_bytes > 0` else
   0.
 3. `chunk_interval = drain_ns / n_chunks` (when both > 0) — chunks are
   scheduled over time at `drain/n_chunks` ns intervals to model the
   bottleneck-link's data arrival rate (ADR-0033 D1 chunk-loop drain).
 4. Apply `run(env, txn.nbytes)` once for `overhead_ns`.
 5. For each chunk `i`, advance `chunk_interval` ns then apply the D4
   schedule with `pc = _pc_for_address(base_address + i * burst_bytes)`.
 6. After scheduling all chunks, wait `last_finish - env.now` then call
   `_send_response`.
 `_handle_txn` shares the same `_pc_avail` / `_pc_last_dir` state with
 `_handle_flit` — there is exactly one source of PC scheduling truth
 across both paths.
 ### D8. Response routing
 `_send_response` dispatches on request type and path geometry:
 | Case | Trigger | Response |
 | --- | --- | --- |
 | PE_DMA | `isinstance(txn.request, PeDmaMsg)` | New reverse-path Transaction (`is_response=True`, `nbytes=0`), same `done` |
 | Bypass — Memory Read | `"m_cpu" not in any(txn.path)` AND `MemoryReadMsg` | Reverse-path Transaction with `nbytes=request.nbytes` (data return) |
 | Bypass — Memory Write | `"m_cpu" not in any(txn.path)` AND not Memory Read | `txn.done.succeed()` (write completes locally) |
 | Default | otherwise | New `ResponseMsg(correlation_id, request_id, src_cube, src_pe, success=True)` on reverse path |
 The "bypass" classification matches the Memory R/W fabric path defined
 in ADR-0015 D4 (PCIE_EP → io_noc → ucie → cube router → hbm_ctrl,
 without M_CPU). The PE_DMA case is its own dedicated reverse-path to
 keep the inner-loop DMA fast (PE_DMA reads/writes do not synthesise a
 ResponseMsg envelope).
 In all reverse-path cases, the response Transaction is put onto
 `out_ports[reverse_path[1]]` — the first hop back along the recorded
 forward path. If `reverse_path` has fewer than 2 entries (degenerate
 path), the original `txn.done` is signalled directly.
 ### D9. Configurable attributes
 | Attribute | Default | Source | Notes |
 | --- | --- | --- | --- |
 | `num_pcs` | 8 | topology cube `hbm_ctrl.attrs` | Must be power of 2 |
 | `pc_bw_gbs` | 32.0 | builder-derived: `hbm_to_router_bw_gbs / num_pcs` | Enforces ADR-0017 D8 invariant |
 | `burst_bytes` | 256 | topology attrs | Must be power of 2; equals `flit_bytes` (ADR-0033 D1) |
 | `switch_penalty_ns` | 0.0 | topology attrs | Tier 0 default; non-zero models pessimistic R/W switching |
 | `efficiency` | 1.0 | topology attrs | Applied at builder time to `hbm_to_router_bw_gbs` (router-edge BW scaling only) |
 | `overhead_ns` | 0.0 | topology attrs | First-flit decode overhead (D5) |
 `pc_bw_gbs` is derived by `topology/builder.py` rather than configured
 directly so the aggregate per-PE BW matches the router-to-HBM link BW
 without yaml-side duplication.
 ## Consequences
 ### Positive
 - Address-based PC selection preserves multi-stream HBM parallelism
  that an address-blind round-robin would collapse — important for
  multi-PE workloads with disjoint HBM regions.
 - Flit-aware path (D5) + async finalize (D6) preserves wormhole
  pipelining and exposes PC parallelism for back-to-back single-flit
  transactions.
 - Single source of PC scheduling truth (D4 mechanism, used by both D5
  flit path and D7 chunk-loop path).
 - Builder-derived `pc_bw_gbs` enforces ADR-0017 D8 in code, not yaml
  discipline.
 ### Negative
 - No bank-level conflict modelling within a PC; address-blind to
  bank/row-buffer reuse (ADR-0033 D3).
 - No HBM scheduler (FR-FCFS / write-buffer / watermark drain); fixed
  FIFO per PC. Bursty mixed R/W is approximated by `switch_penalty_ns`
  (ADR-0033 D2).
 - `_txn_state` is a regular dict keyed by `id(txn)`; in-flight state
  accumulates per concurrent transaction and is removed only on
  `is_last`. Adequate for current workloads.
 ## Links
 - ADR-0001 (Physical address layout — PC bit field comment)
 - ADR-0015 D4 (Memory R/W fabric path — bypass response case)
 - ADR-0017 D4 (Per-PE HBM partitioning — attachment to PE routers)
 - ADR-0017 D8 (HBM channel mapping mode — n:1 aggregate this ADR
  implements)
 - ADR-0017 D9 (AddressResolver — `hbm_ctrl.pe{pe_id}` endpoint
  resolution)
 - ADR-0033 D1 (Modelled precisely — per-PC parallelism, switch penalty,
  flit-aware PC commit, first-flit overhead, chunk-loop drain)
 - ADR-0033 D2 (Switch-penalty default 0 — ideal scheduler amortisation)
@@ -0,0 +1,286 @@
 # ADR-0035: M_CPU and M_CPU.DMA Component Model
 ## Status
 Accepted
 ## Context
 M_CPU is the cube-level command processor. It receives commands from
 IO_CPU (or from PCIE_EP when the engine routes Memory R/W through
 M_CPU as a fallback), fans them out to the PEs in its cube, and
 aggregates per-PE responses into a single ResponseMsg sent back to
 IO_CPU on the reverse path.
 M_CPU.DMA is the cube-level DMA channel pair that handles Memory R/W
 fan-out. Per ADR-0015 D5 it is **not** a separate topology node —
 it lives as internal state of `MCpuComponent`.
 This ADR documents the M_CPU component implementation that realizes
 those responsibilities, including the three distinct fan-out paths
 (Memory R/W, Kernel Launch, MMU Map/Unmap), the M_CPU.DMA resource
 model, and the response aggregation contract.
 ## Decision
 ### D1. Role
 M_CPU has three responsibilities:
 1. **Transit forwarding** — when not the terminal hop (e.g., on the
   reverse response path PE → M_CPU → IO_CPU), forwards Transactions
   to `next_hop` in their pre-computed path.
 2. **Multi-PE fan-out at terminal hop** — dispatches to one of three
   fan-out paths based on request type (D2).
 3. **Response aggregation** — collects per-PE responses, sends a
   single aggregate ResponseMsg back to IO_CPU on the reverse path.
 Per invocation (`run()`): applies `overhead_ns` once per incoming
 Transaction.
 M_CPU does **not**:
 - Decide routing — paths are pre-computed by the router (ADR-0002).
 - Handle PE-internal execution — PE_CPU / PE_SCHEDULER / engines
  (ADR-0014).
 - Decode addresses — `ctx.resolver.resolve(pa)` returns the per-PE
  `hbm_ctrl.pe{X}` directly (ADR-0017 D9).
 - Interpret tensor or kernel semantics — fan-out dispatch by Python
  isinstance check only.
 ### D2. Three fan-out paths dispatched by request type
 At the terminal hop the worker dispatches by request type:
 ```python
 elif self.ctx is not None and txn.request is not None:
    if isinstance(txn.request, KernelLaunchMsg):
        env.process(self._kernel_launch_fanout(env, txn))
    elif isinstance(txn.request, (MmuMapMsg, MmuUnmapMsg)):
        env.process(self._mmu_msg_fanout(env, txn))
    else:
        env.process(self._dma_fanout(env, txn))
 ```
 Each path uses a different router method:
 - `_dma_fanout` uses `ctx.router.find_mcpu_dma_path()` — the
  M_CPU-specific DMA path that avoids PE pipeline nodes.
 - `_kernel_launch_fanout` uses `ctx.router.find_node_path()` — the
  generic NOC command path to PE_CPU.
 - `_mmu_msg_fanout` uses `ctx.router.find_node_path()` — NOC command
  path to PE_MMU.
 ### D3. M_CPU.DMA internal subcomponent (ADR-0015 D5)
 `MCpuComponent.start()` initializes two SimPy resources:
 ```python
 self._dma_write = simpy.Resource(env, capacity=1)  # MemoryWriteMsg
 self._dma_read  = simpy.Resource(env, capacity=1)  # MemoryReadMsg
 ```
 Properties:
 - **Not a topology node** — managed entirely inside `MCpuComponent`;
  does not appear in `topology.yaml` or in the compiled graph.
 - **Independent read and write channels** — concurrent in-flight
  Memory R/W is allowed.
 - **Capacity=1 per channel** serializes the **dispatch step**
  (`yield self.out_ports[...].put(...)`) of concurrent in-flight Memory
  R/W requests at this M_CPU. Actual fabric transfer time is modeled
  by wire processes between components (ADR-0015 D2) and by
  `drain_ns` at terminal hops; the DMA resource does not gate
  transfer duration.
 Resource selection is request-type-based:
 ```python
 dma_res = self._dma_write if isinstance(request, MemoryWriteMsg) else self._dma_read
 ```
 ### D4. Transit forwarding at non-terminal hops
 When `txn.next_hop` is not None — typical for the reverse response
 path (PE → M_CPU → IO_CPU) — the worker forwards normally:
 ```python
 if next_hop:
    yield self.out_ports[next_hop].put(txn.advance())
 ```
 The fan-out branches fire only at the terminal hop. The same component
 therefore serves both forward command dispatch and reverse response
 relay roles.
 ### D5. DMA fan-out (`_dma_fanout` — Memory R/W)
 For each Memory R/W request at terminal hop:
 1. `_resolve_dma_destinations(request)` returns a per-PE
   `hbm_ctrl.pe{X}` derived from the request's PA via
   `ctx.resolver.resolve(PhysAddr.decode(pa))` (ADR-0017 D9).
 2. For each destination:
   - Acquire the appropriate DMA resource (`_dma_write` or
     `_dma_read`) via `with dma_res.request() as req`.
   - Resolve path via `ctx.router.find_mcpu_dma_path()`.
   - Compute `drain_ns = ctx.compute_drain_ns(path, nbytes)`.
   - Create sub-Transaction carrying `drain_ns` and dispatch to
     `path[1]`.
 3. Track `max_drain_ns` across destinations and record it as
   `txn.result_data["xfer_ns"]` after all responses arrive.
 4. After all per-PE responses are collected (D8), send an aggregate
   ResponseMsg on the reverse command path back to IO_CPU.
 PA decode fallback (`f"{cube_prefix}.hbm_ctrl"`) is legacy dead code —
 no such node exists after ADR-0017 D4's per-PE partitioning. Kept
 defensively but does not route to a real destination.
 ### D6. Kernel launch fan-out (`_kernel_launch_fanout`)
 For `KernelLaunchMsg` at terminal hop:
 1. `_resolve_pe_ids(target_pe)` → list of PE ids in this cube.
 2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_cpu"` via
   `ctx.router.find_node_path()`.
 3. **`target_start_ns` handling** (ADR-0009 D5):
   - If the request already carries `target_start_ns` (stamped by
     IO_CPU per ADR-0036 D3): **pass through unchanged**.
   - If absent (direct-to-M_CPU launch in unit tests): compute a
     per-cube barrier `env.now + max(per-PE leg latency)` and stamp
     via `dataclasses.replace`.
 4. Dispatch sub-Transactions with `nbytes=0` (kernel launch is a
   control message; preserving nbytes=0 keeps fan-out off the shared
   first-hop fabric BW, mirroring ADR-0036 D4).
 5. After all per-PE responses arrive (D8), aggregate per-PE metrics
   from each sub-Transaction's `result_data` into the parent
   transaction:
   ```python
   txn.result_data["pe_exec_ns"]  = max(existing, max(pe_exec_values))
   txn.result_data["dma_ns"]      = max(existing, max(dma_values))
   txn.result_data["compute_ns"]  = max(existing, max(compute_values))
   ```
   The max-merge with the existing value matters because cross-cube
   IO_CPU fan-out shares the same parent `result_data`; merging
   prevents one cube from clobbering another's metric.
 6. Send aggregate ResponseMsg on reverse path back to IO_CPU.
 ### D7. MMU map/unmap fan-out (`_mmu_msg_fanout`)
 For `MmuMapMsg` / `MmuUnmapMsg` at terminal hop:
 1. `_resolve_pe_ids(target_pe)` → PE ids.
 2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_mmu"` via
   `find_node_path()`.
 3. Dispatch sub-Transactions with `nbytes=0`.
 4. PE_MMU is a terminal node — it does **not** send a ResponseMsg
   back. Instead, the sub-Transaction's own `sub_done` event is the
   completion signal.
 5. Wait for all `sub_done` events in-line (does **not** use
   `_pending` counter — D8 is for response-bearing fan-out only).
 6. Send aggregate ResponseMsg on reverse path back to IO_CPU.
 ### D8. Response aggregation (`_pending` + `_parent_txns`)
 For DMA and kernel-launch fan-out (which expect per-PE ResponseMsg
 arriving on the reverse path):
 ```python
 self._pending: dict[str, tuple[int, int, simpy.Event]] = {}
 self._parent_txns: dict[str, Any] = {}
 ```
 - On dispatch: register `(expected, received=0, all_done)` and
  remember the parent transaction.
 - `_worker` recognises responses by `is_response=True` and routes
  them to `_collect_response`, which increments `received` and
  signals `all_done` when `received >= expected`.
 - After `yield all_done`, the fan-out path constructs the aggregate
  ResponseMsg:
  ```python
  resp_msg = ResponseMsg(
      correlation_id=request.correlation_id,
      request_id=request.request_id,
      src_cube=cube_id,
      src_pe=-1,             # -1 = M_CPU aggregate, not a single PE
      success=True,          # no failure semantics implemented
  )
  ```
 - The response Transaction travels on `list(reversed(txn.path))`
  back to IO_CPU.
 MMU fan-out (D7) uses a simpler in-line list of `sub_done` events
 because PE_MMU is terminal — there is no ResponseMsg path to
 intercept.
 ### D9. Helpers and configurable attribute
 `_resolve_pe_ids(target_pe)`:
 - `int` → `[target_pe]`
 - `tuple[int, ...]` → `list(target_pe)`
 - `"all"` → `range(n_slices)` where `n_slices` comes from cube
  `memory_map.hbm_slices_per_cube` (default 8).
 Used by kernel-launch and MMU fan-out paths.
 Single configurable attribute drives per-instance latency:
 | Site | impl name | overhead_ns |
 | --- | --- | --- |
 | Cube `m_cpu` | `builtin.m_cpu` | 5.0 |
 Applied once in `run()` per Transaction — models command
 interpretation and dispatch-decision time at M_CPU.
 ## Consequences
 ### Positive
 - Three fan-out paths are clearly separated by request type — adding
  a new request kind is an isinstance branch + one fan-out method.
 - M_CPU.DMA channels are independent (read and write run concurrently)
  and serialize only the dispatch step at capacity=1.
 - Transit-vs-terminal behavior is a single `if next_hop` check, so
  the same component handles forward dispatch and reverse response
  relay without role duplication.
 - `target_start_ns` passthrough (D6) preserves the cross-cube barrier
  established by IO_CPU (ADR-0036 D3), while the fallback computation
  keeps direct-to-M_CPU unit tests working.
 - Per-PE metric `max`-merge against existing parent `result_data`
  values is robust to cross-cube IO_CPU fan-out sharing the same
  parent.
 ### Negative
 - No partial-failure semantics — a missing per-PE response stalls the
  parent `all_done` indefinitely. Acceptable for simulation; not
  suitable as a production-style endpoint.
 - `_resolve_dma_destinations`'s cube-wide hbm_ctrl fallback is dead
  code (no such node exists post-ADR-0017 D4). Kept defensively;
  invites confusion and merits a follow-up cleanup.
 - DMA resource serialization applies only at dispatch (the `put` call
  is instantaneous in unbounded stores). The capacity=1 channel
  models "one request in flight at a time at this M_CPU", not
  "transfer duration serialization" — readers must consult wire
  processes (ADR-0015 D2) and `drain_ns` for actual transfer
  parallelism.
 ## Links
 - ADR-0009 D3 (M_CPU fan-out and aggregation completion semantics)
 - ADR-0009 D5 (`target_start_ns` — passed through unchanged when
  present; computed as per-cube barrier when absent)
 - ADR-0011 D-VA3 (MmuMapMsg fabric path includes M_CPU as PE fan-out
  point)
 - ADR-0014 D4 (DMA engine capacity=1; M_CPU.DMA mirrors the same
  contract at cube level)
 - ADR-0015 D5 (M_CPU.DMA is internal subcomponent of M_CPU, not a
  topology node)
 - ADR-0017 D9 (AddressResolver returns per-PE `hbm_ctrl.pe{X}`)
 - ADR-0036 D3 / D4 (IO_CPU stamps `target_start_ns`; M_CPU passes
  through unchanged; nbytes=0 invariant preserved through fan-out)
@@ -0,0 +1,216 @@
 # ADR-0036: IO_CPU Component Model
 ## Status
 Accepted
 ## Context
 IO_CPU is the IO chiplet's host-facing endpoint inside the simulation
 graph. PCIE_EP receives host messages from the runtime API and routes
 them via the io_noc; for command-bearing requests (KernelLaunch,
 MmuMap/Unmap) the io_noc forwards to IO_CPU, which:
 - Fans out the request to per-cube M_CPUs.
 - Aggregates per-cube responses into a single host-visible completion.
 - For kernel launches, stamps a global `target_start_ns` barrier so
  every PE across every targeted cube begins kernel body execution at
  the same simulated time (ADR-0009 D5).
 Memory R/W traffic bypasses IO_CPU per ADR-0015 D4 / ADR-0016 D3;
 this component therefore handles only command-plane traffic in normal
 operation.
 This ADR documents the IO_CPU component implementation that realizes
 those responsibilities.
 ## Decision
 ### D1. Role
 IO_CPU is the host-facing endpoint of the IO chiplet. It has two
 primary responsibilities:
 1. **Multi-cube fan-out** — distribute KernelLaunchMsg / MmuMapMsg /
   MmuUnmapMsg to per-cube M_CPUs.
 2. **Response aggregation** — collect per-cube ResponseMsg, signal
   parent `txn.done` when all targeted cubes have responded.
 A third, narrower responsibility applies only to KernelLaunchMsg:
 **`target_start_ns` global barrier stamping** (D3).
 The component does **not**:
 - Decide routing — paths are pre-computed by the router (ADR-0002).
 - Decode tensor or kernel internals — those concerns belong to
  M_CPU / PE_CPU / engines.
 - Handle PE-level fan-out — M_CPU fans out within a cube (ADR-0009 D3).
 - Handle Memory R/W data path — those bypass IO_CPU per ADR-0015 D4
  and ADR-0016 D3 (Memory R/W resolution code in
  `_resolve_cube_targets` exists as a defensive fallback only).
 Per invocation (`run()`): applies the configured `overhead_ns` once
 per incoming Transaction (D8).
 ### D2. Forward path — multi-cube fan-out
 When a non-response Transaction arrives, the worker:
 1. Pays `overhead_ns` via `run()`.
 2. Calls `_resolve_cube_targets` to derive the list of `(sip, cube)`
   targets from the request (D5).
 3. For each target:
   - Resolves M_CPU node id via `ctx.resolver.find_m_cpu(sip, cube)`.
   - Resolves the path via `ctx.router.find_node_path(io_cpu, m_cpu)`.
   - Creates a per-cube sub-Transaction with `path` populated and
     forwards it to `path[1]` (the first hop on the io_noc).
 4. Registers aggregation state: `_pending[request_id] = (expected,
   received=0, parent_done)`.
 ### D3. KernelLaunch `target_start_ns` global barrier (ADR-0009 D5)
 IO_CPU is the canonical stamper for `target_start_ns`. When the
 request is a `KernelLaunchMsg`, IO_CPU computes a single global
 barrier covering every targeted PE across every targeted cube:
 ```text
 for (sip, cube) in cube_targets:
    leg1 = compute_path_latency_ns(io_cpu → m_cpu(sip, cube), nbytes=0)
    for pe_id in target_pe_ids:
        leg2 = compute_path_latency_ns(m_cpu → pe_cpu(sip, cube, pe_id),
                                       nbytes=0)
        latency = leg1 + leg2 - io_overhead_ns - m_overhead_ns
        global_max = max(global_max, latency)
 target_start_ns = env.now + global_max
 ```
 The request is then replaced (via `dataclasses.replace`) so the
 stamped value propagates through the fan-out.
 Two overhead corrections:
 - `io_overhead_ns` is subtracted because IO_CPU has already paid it
  in `run()` before this method runs.
 - `m_overhead_ns` is subtracted once because it appears as the
  endpoint of leg1 *and* the start of leg2 in path latency, but
  M_CPU pays it only once at run time.
 Every downstream PE_CPU yields until `target_start_ns` before
 beginning kernel body execution; all PEs therefore start at the same
 simulated time regardless of how long their individual dispatch path
 took.
 ### D4. KernelLaunch sub-Transactions carry `nbytes=0`
 Per-cube sub-Transactions for KernelLaunchMsg force `nbytes=0`,
 overriding the parent `txn.nbytes`:
 - Kernel launch is a control message; payload size is irrelevant at
  the data-fabric level.
 - If `nbytes > 0`, every per-cube sub-txn occupies fabric BW on the
  io_noc's shared first hop. With 16 cubes this serializes fan-out,
  pushing far M_CPUs past `target_start_ns` and breaking the D3
  invariant.
 Non-KernelLaunch sub-Transactions preserve `txn.nbytes` (only relevant
 for the defensive Memory R/W fallback path, which carries actual
 payload sizes).
 ### D5. Per-request-type cube target resolution
 `_resolve_cube_targets` dispatches by request type:
 | Request type | Source of `(sip, cube)` | `target_cubes="all"` semantics |
 | --- | --- | --- |
 | `MemoryWriteMsg` | `dst_sip`, `dst_cube` (or `PhysAddr.decode(dst_pa).die_id` fallback) | single cube derived from PA decode |
 | `MemoryReadMsg` | `src_sip`, `src_cube` (or `PhysAddr.decode(src_pa).die_id` fallback) | single cube derived from PA decode |
 | `KernelLaunchMsg` | tensor shards filtered by `shard.sip == my_sip` | every cube that owns a shard on this SIP |
 | `MmuMapMsg` / `MmuUnmapMsg` | `target_cubes` list, filtered to this SIP | `range(cubes_per_sip)` from spec |
 Each IO_CPU instance fans out only within its own SIP — `_my_sip()`
 parses the SIP id from the node id (e.g., `sip0.io0.io_cpu` → 0).
 The Memory R/W rows exist for defensive completeness; the engine's
 normal path routes Memory R/W via `_process_memory_direct()` /
 `find_memory_path()`, bypassing IO_CPU entirely (ADR-0015 D4 /
 ADR-0016 D3).
 ### D6. Response aggregation
 `_pending: dict[request_id → (expected, received, parent_done)]`:
 - On dispatch: register `(len(cube_targets), 0, txn.done)`.
 - `_worker` recognises responses by `is_response=True` and routes
  them to `_collect_response`.
 - `_collect_response` increments `received`; when `received >=
  expected`, `parent_done.succeed()` is invoked and the entry is
  removed from `_pending`.
 This is a simple per-request counter. There is no per-cube identity
 tracking and no partial-failure handling — a missing response
 indefinitely stalls the parent done. Production-style failure paths
 are out of scope for the current simulator model.
 ### D7. `target_pe` resolution helper
 `_resolve_pe_ids(target_pe)`:
 - `int` → `[target_pe]`.
 - `tuple[int, ...]` → `list(target_pe)`.
 - `"all"` → `range(n_slices)`, where `n_slices` comes from cube
  `memory_map.hbm_slices_per_cube` (default 8).
 Used in D3's barrier computation to enumerate every PE target per
 cube.
 ### D8. Configurable `overhead_ns`
 A single attribute drives per-instance latency:
 | Site | impl name | overhead_ns |
 | --- | --- | --- |
 | IO chiplet `io_cpu` | `builtin.io_cpu` | 10.0 |
 Applied once in `run()` per Transaction. Models command
 interpretation + dispatch-decision time at IO_CPU.
 ## Consequences
 ### Positive
 - Cross-cube and cross-SIP kernel launches share a single global
  barrier (D3 + D4) — no per-cube divergence in start time.
 - nbytes=0 invariant keeps fan-out off the shared first-hop fabric
  BW, preserving the barrier's accuracy at scale (16 cubes).
 - Response aggregation via a single counter → minimal state,
  deterministic ordering of completion.
 - Per-SIP scoping (`_my_sip()`) keeps IO_CPUs in different SIPs
  cleanly independent.
 ### Negative
 - No partial-failure semantics — a missing per-cube response
  indefinitely stalls the parent. Adequate for simulation but not
  suitable as a production-style endpoint.
 - `_pending` is a regular dict; in-flight requests accumulate state.
  Acceptable for current benchmark workloads (few concurrent
  outstanding launches); unbounded in principle.
 - The Memory R/W resolution branches in `_resolve_cube_targets` are
  dead code in the normal engine path. Kept defensively but invite
  drift if the bypass path ever changes.
 ## Links
 - ADR-0002 (Routing distance — path computation)
 - ADR-0009 D1 (Kernel launch is an endpoint request to IO_CPU)
 - ADR-0009 D3 (M_CPU fans out within a cube; IO_CPU fans out across
  cubes)
 - ADR-0009 D5 (target_start_ns canonical stamping at IO_CPU)
 - ADR-0011 D-VA3 (MmuMapMsg routes through IO_CPU for cube fan-out)
 - ADR-0012 (Host ↔ IO_CPU message schema)
 - ADR-0015 D4 (Memory R/W bypasses IO_CPU; Kernel Launch via IO_CPU)
 - ADR-0016 D1 (IO chiplet io_noc — IO_CPU attaches here)
 - ADR-0016 D3 (Memory R/W path bypasses IO_CPU)
 - ADR-0016 D4 (Kernel Launch path through IO_CPU for command
  interpretation)
@@ -0,0 +1,200 @@
 # ADR-0037: Forwarding Component (forwarding_v1)
 ## Status
 Accepted
 ## Context
 The simulation graph has many node positions that exist purely to model
 fabric traversal — NOC mesh routers, switches, UCIe protocol endpoints,
 IO chiplet io_noc, transit cubes. These share a common pattern: receive
 a message, apply per-component overhead (modeling header decode +
 routing decision time), forward to the next hop along the pre-computed
 path.
 This ADR defines the contract for these transit nodes: a single
 component type (`TransitComponent`) that handles flit-aware forwarding
 with wormhole cut-through semantics, used under multiple impl names
 according to the conceptual role each instance plays.
 ## Decision
 ### D1. Role
 The Forwarding component (`TransitComponent` class) is a **stateless
 transit node** in the simulation graph. It models any fabric position
 where a message physically traverses but no semantic processing
 happens.
 Per traversal, the component:
 1. Reads an incoming Transaction or Flit from an `in_port`.
 2. Applies the configured per-component overhead (`overhead_ns`),
   applied **once per Transaction** even across multi-flit payloads
   (see D2).
 3. Looks up the next hop along the Transaction's pre-computed `path`.
 4. Forwards to the corresponding `out_port`; at the terminal node
   (no next hop), signals `txn.done` once the `is_last` flit arrives.
 The component **does NOT**:
 - Decide routing — paths are pre-computed by the router (ADR-0002 /
  ADR-0017 D2). Forwarding only executes the per-hop step.
 - Model wire propagation or bandwidth occupancy — separate wire
  processes between components handle that (ADR-0015 D2).
 - Resolve addresses — the AddressResolver does that (ADR-0017 D9).
 - Aggregate completion — terminal endpoints (IO_CPU, M_CPU, HBM_CTRL)
  handle that.
 ### D2. First-flit overhead model (header decode)
 Per-Transaction `overhead_ns` is applied **exactly once**, at first
 flit arrival:
 - `_txn_decoded: set[int]` tracks which Transactions have already
  paid the overhead at this node.
 - On first-flit arrival for a Transaction: `yield self.run(env,
  msg.txn.nbytes)` — pays the overhead.
 - Subsequent flits of the same Transaction skip the overhead — they
  pipeline through with no extra delay.
 - On `is_last` flit: remove the Transaction from `_txn_decoded`.
 This models the real-HW behavior where header decode and routing
 decision happen once on first flit; payload flits then stream through
 the same path (wormhole cut-through). Multi-hop pipelining emerges
 naturally — each hop adds its own first-flit overhead, but flits
 after the first do not re-pay overhead at any hop they have already
 passed first.
 ### D3. Serial worker forwarding (preserves order)
 The component's worker is a single SimPy process that consumes flits
 from `_inbox` and forwards them serially in arrival order. The
 component does NOT spawn `env.process(...)` per flit.
 Rationale: if the first flit yields on `overhead_ns` while subsequent
 flits run in parallel processes, the later flits can overtake the
 first. This produces out-of-order delivery and lets the `is_last`
 flit arrive at the destination before the first flit — corrupting
 both the transaction's completion semantics and any flit-index-based
 processing downstream.
 ### D4. Path-based next-hop routing
 Routing is **not** a Forwarding-component concern. The Transaction
 arrives with a pre-computed `path` (built by the router; ADR-0002 /
 ADR-0017 D2). The component just looks up its own position in the
 path and forwards to `path[index + 1]`:
 ```python
 def _next_hop_in_path(self, txn):
    my_id = self.node.id
    path = txn.path
    for i, n in enumerate(path):
        if n == my_id and i + 1 < len(path):
            return path[i + 1]
    return None
 ```
 If `next_hop` is found and present in `out_ports`, the flit is
 forwarded. Otherwise (terminal node), `txn.done.succeed()` is
 invoked when the `is_last` flit arrives.
 ### D5. Flit-aware mode with Non-Flit fallback
 `_FLIT_AWARE = True` opts this component out of the base class's
 flit-reassembly logic in `_fan_in`. Flits are placed directly on
 `_inbox` (no reassembly), enabling per-flit handling in the worker
 loop (D2, D3).
 Non-Flit messages — zero-byte control Transactions and other
 non-chunkified payloads — fall through to the base class's legacy
 `_forward_txn` path via `env.process`. This preserves backward
 compatibility for control-plane traffic that does not benefit from
 flit-level processing.
 ### D6. Multi-stream merging at the base class
 Multi-stream FIFO merging at routers is the base class's
 responsibility, not Forwarding's. The base class's `_fan_in` spawns
 one process per `in_port`; all push to a single shared `_inbox`.
 Flits from different upstream streams therefore interleave at
 flit granularity in `_inbox`'s FIFO order.
 The Forwarding worker simply consumes `_inbox` in arrival order —
 correctly modeling per-router multi-flow arbitration as
 fair-FIFO over the shared inbox.
 ### D7. Single implementation under multiple impl names
 A single `TransitComponent` class is registered under four impl names
 in `components.yaml`:
 - `builtin.forwarding` — generic forwarding (e.g., `io_noc`,
  `noc_router`, UCIe conn bridges)
 - `builtin.switch` — tray-level switch
 - `builtin.noc` — cube-level NOC fabric (legacy singleton; current
  NOC routers use `builtin.forwarding`)
 - `builtin.ucie` — UCIe protocol endpoint
 All four aliases instantiate the same class with the same behavior.
 Per-instance differentiation lives only in `attrs.overhead_ns`.
 Separate impl names exist as intent tags for readability and to
 allow future divergence without backward-incompatible config
 changes.
 ### D8. Configurable `overhead_ns`
 A single attribute drives per-instance latency:
 | Usage site | impl name | overhead_ns |
 | --- | --- | --- |
 | Tray-level switch | `builtin.switch` | 5.0 |
 | Cube NOC router | `builtin.forwarding` | 2.0 |
 | IO chiplet io_noc | `builtin.forwarding` | 0.0 |
 | UCIe protocol endpoint (`ucie-{N,S,E,W}`) | `builtin.ucie` | 8.0 |
 | UCIe conn bridge (`ucie-{PORT}.conn{N}`) | `builtin.forwarding` | 0.0 |
 Default is 0.0. The attribute is read at each `run()` invocation, so
 dynamic reconfiguration is possible but not currently used.
 ## Consequences
 ### Positive
 - A single class handles all transit-node roles in the simulation
  graph — minimal code surface for a high-population component type.
 - Flit-aware processing + serial worker preserves wormhole semantics
  across multi-hop paths without per-flit process overhead.
 - `overhead_ns` is the only per-instance tunable; routing, BW, and
  address resolution stay cleanly separated in their own components /
  modules.
 - Multi-stream merging emerges from the base-class structure; no
  router-specific logic duplicates fair-FIFO arbitration.
 - Non-Flit fallback path keeps control-plane traffic working without
  forcing every message into the flit framework.
 ### Negative
 - The single class hides usage-site intent inside `attrs.overhead_ns`
  configuration; readers must consult `topology.yaml` +
  `components.yaml` to see which impl name maps to which behavior
  class.
 - Per-flit serial worker is a bottleneck if `overhead_ns` is large
  and many concurrent transactions arrive at the same router; current
  values (0–8 ns) make this negligible.
 ## Links
 - ADR-0002 (Routing distance — path computation)
 - ADR-0015 D1 (Component port model)
 - ADR-0015 D2 (Wire process — BW + propagation, separate from this
  component)
 - ADR-0015 D6 (Transit cube forwarding pattern)
 - ADR-0016 D1 (IO chiplet io_noc — uses this component)
 - ADR-0017 D1 (Cube NOC routers — use this component)
 - ADR-0017 D6 (UCIe decomposition — `ucie-{PORT}` instances use this
  component)
 - ADR-0033 D1 (Flit-aware pass-through, first-flit overhead,
  multi-stream merge semantics)
@@ -18,7 +18,7 @@ We define stable, minimal message schemas for Host ↔ IO_CPU so that:
 - IO_CPU-internal fan-out/aggregation can evolve independently,
 - completion and failure propagation is deterministic.
-We also require PE-tagging (A 방식): each shard explicitly carries (sip,cube,pe)
+We also require PE-tagging (Scheme A): each shard explicitly carries (sip,cube,pe)
 so IO_CPU can deterministically route/fan-out without relying on PA decoding.
 ---
@@ -93,7 +93,7 @@ Rules:
 Mandatory fields:
 - common envelope fields (D3)
- destination placement tags (A 방식):
+- destination placement tags (Scheme A):
  - `dst_sip: int`
  - `dst_cube: int`
  - `dst_pe: int`
@@ -130,7 +130,7 @@ Notes:
 Mandatory fields:
 - common envelope fields (D3)
- source placement tags (A 방식):
+- source placement tags (Scheme A):
  - `src_sip: int`
  - `src_cube: int`
  - `src_pe: int`
@@ -183,7 +183,7 @@ Tensor arg (mandatory):
 - `shards: list[TensorShard]`
-`TensorShard` MUST have (A 방식 강제):
+`TensorShard` MUST have (Scheme A enforced):
 - `sip: int`
 - `cube: int`
@@ -1,519 +0,0 @@
 # ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation)
 ## Status
 Accepted
 ## Context
 The current simulation models **timing only**.
 `tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies,
 but do not actually read tensor data or perform computations.
 ### Required Capabilities
 1. Must be able to store and read actual data in HBM/TCM/SRAM
 2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
 3. Must minimize simulation performance degradation
 ### Constraints
 - SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
 - Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations
 - Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused
 - Kernel functions must remain plain Python functions (no generator/async transformation)
 ### Design Exploration Results
 | Option | Approach | Verdict |
 |--------|----------|---------|
 | Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block |
 | SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests |
 | Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads |
 | **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact |
 ---
 ## Decision
 ### D1. 2-Pass Execution Model — Phase 0 Elimination
 The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**.
 Before:
 ```
 Phase 0: Kernel → PeCommand list (no data, no branching)
 Phase 1: Replay PeCommand list via SimPy (timing only)
 ```
 After:
 ```
 Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based
  - Memory read/write: SimPy timing + MemoryStore actual data
  - Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2)
  - Dynamic control flow possible (tl.load returns actual data)
 Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable
 ```
 This ADR **extends Phase 1 to be data-aware for memory operations only**.
 Phase 1 handles latency/BW bottleneck analysis + memory data tracking,
 Phase 2 handles GEMM/Math computation correctness verification.
 Phase 2 is optional — if only timing is needed, run Phase 1 alone.
 ### D2. Op Log Recording — ComponentBase Hook
 Op log recording is performed as a **hook in the component base class**.
 Individual component implementations are not modified.
 ```python
 class ComponentBase:
    def _on_process_start(self, env, msg):
        if self._op_logger and getattr(msg, 'data_op', False):
            self._op_logger.record_start(env.now, self.node.id, msg)
    def _on_process_end(self, env, msg):
        if self._op_logger and getattr(msg, 'data_op', False):
            self._op_logger.record_end(env.now, self.node.id, msg)
 ```
 Hooks are called before and after `run()` within `_forward_txn()`.
 `_op_logger` is optional — zero overhead when absent.
 **Hook timing definitions**:
 | Timing | Meaning |
 |--------|---------|
 | `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) |
 | `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) |
 Link traversal latency is not included in t_start/t_end.
 Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start.
 ### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination
 The existing Phase 0 (kernel → PeCommand list) is eliminated,
 and **greenlet** is used to cooperatively interleave kernel and SimPy execution.
 #### Operating Principle
 greenlet is a C extension that provides cooperative context switching.
 When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet)
 to perform timing simulation, and after completion, returns to the kernel with actual data.
 ```
 SimPy loop (parent greenlet)           Kernel (child greenlet)
 ─────────────────────────              ──────────────────────
 g.switch() ─────────────────────────→ Kernel starts
                                       a = tl.load(ptr, ...)
                                         internal: parent.switch(DmaReadCmd)
 cmd = DmaReadCmd ←──────────────────  (kernel paused)
  yield DmaReadMsg(...)
  yield env.timeout(dma_latency)
  data = memory_store.read(...)
 g.switch(data) ─────────────────────→ (kernel resumed)
                                       a = data  ← actual numpy array
                                       if a[0][0] > 0.5:  ← branching possible
                                         ...
 ```
 The kernel is maintained as a **plain Python function**.
 greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc.
 #### KernelRunner — Framework Layer
 The greenlet loop resides not in the PE_CPU component but in the framework layer,
 **KernelRunner**.
 ```python
 # KernelRunner (framework — greenlet ↔ SimPy bridge)
 class KernelRunner:
    def run(self, env, kernel_fn, args, store):
        g = greenlet(self._run_kernel)
        cmd = g.switch(kernel_fn, args)
        while cmd is not None:
            if isinstance(cmd, DmaReadCmd):
                yield from self._dispatch_dma(env, cmd)
                data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
                cmd = g.switch(data)            # resume with actual data
            elif isinstance(cmd, GemmCmd):
                yield from self._dispatch_gemm(env, cmd)
                cmd = g.switch()                # resume (no data)
            elif isinstance(cmd, DmaWriteCmd):
                store.write(cmd.dst_addr, cmd.data)  # visibility = issue time
                yield from self._dispatch_dma(env, cmd)  # timing only
                cmd = g.switch()
 # PE_CPU (component — kept simple, unaware of greenlet)
 def _execute_kernel(self, env):
    runner = KernelRunner(self.ctx)
    yield from runner.run(env, kernel_fn, args, store)
 ```
 **Op logging single source of truth**: KernelRunner does not record directly to op_log.
 All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**.
 When KernelRunner delivers messages to components via `_dispatch_gemm()` etc.,
 the component base class hooks automatically record them.
 **Layer separation**:
 - **Kernel code**: plain function, unaware of greenlet
 - **TLContext**: calls `parent.switch(cmd)` inside `tl.load()`
 - **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**.
 - **ComponentBase hook**: the sole path for op_log recording
 - **PE_CPU**: only calls KernelRunner, replaceable as a component
 #### Handling Differences Between Memory Read/Write and Compute
 | Operation | In Phase 1 | In Phase 2 |
 |-----------|-----------|-----------|
 | `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — |
 | `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — |
 | `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation |
 | `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation |
 Memory read/write is processed immediately in Phase 1 (numpy slice, fast).
 GEMM/Math operations are batch-executed in Phase 2 (performance separation).
 #### Store Visibility Rule
 `tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue).
 SimPy DMA timing is simulated separately afterward.
 This is an intentional separation of timing and visibility:
 - **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called
 - **timing**: the point at which DMA latency completes in SimPy
 This separation allows a load immediately after a store to see the latest data in dynamic control flow.
 #### Result Handle Semantics
 `tl.composite()` (sync/async) returns a **handle** referencing the result tensor.
 The key contract in Phase 1:
 1. **All compute handles are always considered pending in Phase 1.**
 2. `tl.wait(handle)` **expresses timing synchronization only**
   and does not make the handle ready.
 3. Accessing the handle's actual result data (`handle.data`, element access,
   numpy conversion, etc.) is **only possible in Phase 2**.
 4. Therefore, **compute-result-based control flow is not supported in Phase 1.**
 5. In contrast, `tl.load()` returns actual data in Phase 1, so
   **memory-read-based control flow is supported**.
 | Handle state | Phase | Allowed operations |
 |------------|-------|----------|
 | pending | Phase 1 | `tl.wait(handle)` — timing synchronization only |
 | pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) |
 | pending | Phase 1 | **Data access not allowed** — value-based branching not possible |
 | ready | Phase 2 | Actual numpy data access, verification |
 This restriction is intentional. If computations were executed in Phase 1,
 the SimPy single-thread would block, defeating the purpose of 2-pass separation.
 #### Phase 1 Materialization — Future Extension
 If Phase 1 eager execution becomes necessary for small operations
 (scalar, small reduction) in the future, selective materialization can be supported
 by adding a `materialized_in_phase1: bool` flag to the op record.
 This is not implemented in the current scope.
 ### D4. data_op Flag — Message Self-Declaration
 The logging target is determined by the `data_op` attribute on the message instance,
 not by message type. The framework does not hardcode message types.
 ```python
 class MsgBase:
    data_op: bool = False       # default: no logging
 class DmaReadCmd(MsgBase):
    data_op = True              # memory transfer → logging
 class GemmCmd(MsgBase):
    data_op = True              # compute → logging
 class MathCmd(MsgBase):
    data_op = True              # compute → logging
 ```
 When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True`
 enables automatic logging without modifying framework code.
 ### D5. Op Log Structure
 #### Op Classification Scheme
 A two-level classification is used:
 | Level | Field | Role |
 |-------|-------|------|
 | `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion |
 | `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification |
 #### OpRecord Definition
 ```python
@dataclass
 class OpRecord:
    t_start: float              # SimPy time (ns) — service start
    t_end: float                # SimPy time (ns) — service completion
    component_id: str           # e.g. "sip0.cube0.pe0.pe_gemm"
    op_kind: str                # "memory" | "gemm" | "math"
    op_name: str                # specific operation name
    params: dict                # per-operation parameters (see below)
    dependency_ids: list[int]   # currently based on in-memory record index, may be replaced with stable op_id in the future
 ```
 #### dependency_ids Generation Rules
 `dependency_ids` is **optional**, and by default the executor performs
 address-based dependency inference (see D6).
 Explicit setting is only needed when precise execution ordering is required:
 - **Default (address-based inference)**: the executor analyzes read/write sets to
  automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases.
 - **Explicit setting**: set when logical dependencies cannot be expressed via addresses
  at the TLContext or command generation stage.
  Example: completion handle-based synchronization — handle dependencies depend on
  logical completion order rather than memory addresses, so they cannot be captured
  by address inference.
 #### op_log Ordering
 The op_log maintains **stable ordering** based on `t_start`.
 Records with the same `t_start` preserve insertion order.
 #### params Details
 **memory (dma_read / dma_write)**:
 ```python
 {
    "src_addr": int,            # source address (byte)
    "dst_addr": int,            # destination address (byte)
    "nbytes": int,              # transfer size
    "src_space": str,           # "hbm" | "tcm" | "sram"
    "dst_space": str,           # "hbm" | "tcm" | "sram"
 }
 ```
 **gemm**:
 ```python
 {
    "src_a_addr": int,          # operand A address
    "src_b_addr": int,          # operand B address
    "dst_addr": int,            # output address
    "shape_a": tuple,           # e.g. (128, 256)
    "shape_b": tuple,           # e.g. (256, 128)
    "shape_out": tuple,         # e.g. (128, 128)
    "dtype_in": str,            # e.g. "f16"
    "dtype_acc": str,           # accumulation dtype, e.g. "f32"
    "dtype_out": str,           # output dtype, e.g. "f16"
    "transpose_a": bool,
    "transpose_b": bool,
    "layout_a": str,            # "row_major" | "col_major"
    "layout_b": str,
    "layout_out": str,
    "addr_space": str,          # "tcm" (GEMM operands are always in TCM)
 }
 ```
 **math**:
 ```python
 {
    "op": str,                  # "exp" | "add" | "sum" | "where" | ...
    "input_addrs": list[int],   # list of operand addresses
    "input_shapes": list[tuple],
    "dst_addr": int,
    "shape_out": tuple,
    "dtype": str,
    "axis": int | None,         # reduction axis
    "addr_space": str,          # "tcm"
 }
 ```
 ### D6. Phase 2 Executor
 Phase 2 executes the op_log outside of SimPy.
 ```python
 class DataExecutor:
    def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
        self.store = initial_store  # Takes the Phase 1 MemoryStore snapshot as input
    def run(self):
        for t, ops in groupby(op_log, key=lambda o: o.t_start):
            batch = list(ops)
            independent, sequential = self._classify(batch)
            self._execute_parallel(independent)
            self._execute_sequential(sequential)
 ```
 **Parallel execution determination**:
 Ops with the same `t_start` are considered **parallel candidates**.
 The executor determines actual parallel execution based on the following criteria:
 - Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks)
 - Whether predecessor ops specified in `dependency_ids` have completed
 Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel.
 **Batch optimization**: Only independent ops with the same op_name **and identical
 shape, dtype, layout, and transpose flags** are eligible for batching.
 Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call.
 Improves BLAS efficiency on CPU, reduces launch overhead on GPU.
 **Phase 2 execution order guarantee**:
 Phase 2 does not consider data arrival timing,
 and guarantees execution order solely through
 dependencies (address-based inference + explicit dependency_ids).
 ### D7. Memory Store
 `MemoryStore` logically follows byte-addressable semantics,
 and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping).
 ```python
 class MemoryStore:
    def write(self, space: str, addr: int, data: np.ndarray) -> None: ...
    def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
 ```
 **Internal storage format: numpy ndarray**
 MemoryStore stores tensors as **numpy ndarrays**.
 | Candidate | store/load speed | Phase 2 compute | Verdict |
 |-----------|-----------------|-----------------|---------|
 | **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** |
 | bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected |
 | torch tensor | Immediate | torch operations available | Use only for GPU optimization |
 - write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup
 - read: **returns numpy array by reference** (no copy)
 - Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported)
 - dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.)
 - For byte-level access, convert via `.view(np.uint8)`
 - For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility
 **read/write contract**:
 - read/write operates on a **contiguous tensor** basis.
  If non-contiguous stride views are needed, express them as separate copy ops.
 - In the normal benchmark path, producer/consumer dtype match is expected.
  Reinterpret cast is a permissive behavior for low-level memory validation
  or special test cases.
 - addr is byte-aligned, with minimum alignment = dtype size.
 - dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast.
  Shape mismatch is verified based on nbytes, and raises an error on mismatch.
 - Correctness criteria follow address-range-based read/write semantics.
 - A tensor object cache may be used as an implementation optimization,
  but the canonical state is byte-addressable storage.
 - At deploy time, the host injects initial tensor data.
 ### D8. Benchmark Kernel Code
 The benchmark's **user code API is not changed**.
 The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained.
 However, internal command/message schemas may be extended to include metadata
 required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose).
 ### D9. No Component Changes
 Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified.
 Op log recording is the responsibility of the ComponentBase hook.
 When custom components are replaced, only the timing model changes,
 and Phase 2 data execution is unaffected.
 ### D10. Phase 2 is Optional
 ```python
 engine = GraphEngine(graph)
 engine.run(benchmark)                       # Phase 1: timing only
 result = engine.get_timing_result()
 if verify_data:
    executor = DataExecutor(engine.op_log)  # Phase 2: data
    executor.run()
    executor.verify(expected_output)
 ```
 If only timing analysis is needed, Phase 2 is skipped.
 If the op_logger is deactivated, Phase 1 performance is identical to the original.
 ### D11. Verification Contract
 Basic verification **compares the final output tensor** against a reference backend (numpy).
 Per-dtype tolerance policy:
 | dtype | Comparison method | Tolerance |
 |-------|----------|-----------|
 | f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
 | f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
 | bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
 | int types | `np.array_equal` | exact |
 - Default mode: compare final output only (end-to-end correctness)
 - Debug mode: can compare intermediate tensors on a per-op basis
  (MemoryStore snapshot at each op boundary)
 ---
 ## Non-goals
 - **Compute-result-based control flow**: not supported.
  All compute handles are in pending state during Phase 1,
  `wait()` expresses timing synchronization only and does not imply data readiness.
  Accessing `handle.data`, element access, or truth-value evaluation in Phase 1
  is **treated as an error**.
  Memory-data-based branching (results of `tl.load()`) is supported via greenlet.
  Phase 1 materialization is a future extension (see D3).
 - **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce
  the execution time overlap from Phase 1. Phase 2 only verifies data correctness.
 - **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls
  and do not reproduce the actual hardware PE microarchitecture.
 ## Open Questions
 - **Aliasing / slice view**: How to represent slice/views referencing the same
  backing storage in MemoryStore (stride-based view vs copy semantics)
 - **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE
  communication as memory ops or introduce a separate op_kind
 - **Op log streaming**: Managing op_log memory usage in large-scale simulations
  (in-memory list vs disk-backed streaming)
 - **Fused operation**: Whether to record tl.composite's tiled pipeline
  (READ→COMPUTE→WRITE) as a single fused op record or separate individual ops
 - **Math op schema generalization**: The current math params have a simple structure,
  but generalization may be needed for broadcasting rules, per-input dtype, keepdims,
  scalar/immediate operands, where/mask expressions, etc.
 - **Op record identifier**: Currently dependency_ids are based on in-memory list indices;
  replacement with stable op_id is needed when introducing streaming/disk-backed mode
 - **Phase 1 materialization policy**: See Future Extension in D3.
  If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops
  needs to be defined
 ---
 ## Consequences
 ### Positive
 - Minimal impact on SimPy simulation performance (only op_log append added)
 - Free to use multi-threading/GPU in Phase 2
 - Component replaceability preserved (ADR-0015 design philosophy maintained)
 - No changes needed to benchmark user code API
 - When adding new message types, only set the data_op flag
 - Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported
 - `tl.load()` returns actual data, making kernel debugging easier
 ### Negative
 - op_log memory usage (for large-scale simulations)
 - Phase 2 execution time is proportional to tensor size (large GEMM)
 - Dynamic branching based on pending handles (incomplete computations) not possible
  (computations execute in Phase 2, result values are undetermined in Phase 1).
  Memory-data-based branching is supported via greenlet.
 - greenlet C extension dependency added (pip install greenlet)
@@ -1,4 +1,4 @@
-# ADR-0020: 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)
+# ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation)
 ## Status
@@ -6,65 +6,65 @@ Accepted
 ## Context
-현재 시뮬레이션은 **타이밍만** 모델링한다.
+The current simulation models **timing only**.
-`tl.load()`, `tl.composite(op="gemm")` 등은 SimPy latency를 생성하지만,
+`tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies,
-실제 텐서 데이터를 읽거나 연산하지 않는다.
+but do not actually read tensor data or perform computations.
-### 필요한 기능
+### Required Capabilities
-1. HBM/TCM/SRAM에 실제 데이터를 저장하고 읽을 수 있어야 한다
+1. Must be able to store and read actual data in HBM/TCM/SRAM
-2. PE_GEMM, PE_MATH가 실제 행렬 연산을 수행하고 결과를 검증할 수 있어야 한다
+2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
-3. 시뮬레이션 성능 저하를 최소화해야 한다
+3. Must minimize simulation performance degradation
-### 제약 조건
+### Constraints
- SimPy는 single-thread 이벤트 루프 — numpy matmul을 안에서 하면 전체가 block
+- SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
- 컴포넌트는 교체 가능해야 한다 (ADR-0015) — 프레임워크 요구사항이 구현에 침투하면 안 됨
+- Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations
- 벤치마크 커널은 명령형 코드(tl.load → tl.composite → tl.wait) — 같은 코드를 재사용해야 함
+- Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused
- 커널 함수는 plain Python function으로 유지해야 한다 (generator/async 변환 불가)
+- Kernel functions must remain plain Python functions (no generator/async transformation)
-### 설계 탐색 결과
+### Design Exploration Results
-| Option | 방식 | 판정 |
+| Option | Approach | Verdict |
-|--------|------|------|
+|--------|----------|---------|
-| SimPy 내 직접 실행 | GEMM을 SimPy 안에서 numpy 호출 | 탈락: single-thread block |
+| Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block |
-| SimPy + ThreadPool | future.submit → timeout → result() | 탈락: back-to-back 요청 시 result()에서 block |
+| SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests |
-| Symbolic + lazy | 메타데이터만 추적, 나중에 실행 | 탈락: control-flow dependent 읽기 처리 곤란 |
+| Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads |
-| **2-pass (채택)** | Phase 1: 타이밍, Phase 2: 데이터 | 완전 분리, 성능 영향 없음 |
+| **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact |
 ---
 ## Decision
-### D1. 2-Pass 실행 모델 — Phase 0 제거
+### D1. 2-Pass Execution Model — Phase 0 Elimination
-기존의 3단계(Phase 0 → Phase 1 → Phase 2)를 **2단계로 통합**한다.
+The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**.
-기존:
+Before:
 ```
-Phase 0: 커널 → PeCommand 리스트 (데이터 없음, 분기 불가)
+Phase 0: Kernel → PeCommand list (no data, no branching)
-Phase 1: PeCommand 리스트를 SimPy replay (타이밍만)
+Phase 1: Replay PeCommand list via SimPy (timing only)
 ```
-변경:
+After:
 ```
-Phase 1 (타이밍): 커널 + SimPy 통합 실행 — greenlet 기반
+Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based
-  - 메모리 읽기/쓰기: SimPy 타이밍 + MemoryStore 실제 데이터
+  - Memory read/write: SimPy timing + MemoryStore actual data
-  - 연산 (GEMM/Math): SimPy 타이밍 + op_log 기록 (실제 연산은 Phase 2)
+  - Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2)
-  - dynamic control flow 가능 (tl.load가 실제 데이터 반환)
+  - Dynamic control flow possible (tl.load returns actual data)
-Phase 2 (데이터): op_log 기반 실제 연산 실행 — SimPy 외부, 병렬 가능
+Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable
 ```
-본 ADR은 **메모리 연산에 한해 Phase 1을 data-aware로 확장**한다.
+This ADR **extends Phase 1 to be data-aware for memory operations only**.
-Phase 1은 latency/BW 병목 분석 + 메모리 데이터 추적,
+Phase 1 handles latency/BW bottleneck analysis + memory data tracking,
-Phase 2는 GEMM/Math 연산 정합성 검증.
+Phase 2 handles GEMM/Math computation correctness verification.
-Phase 2는 optional — 타이밍만 필요하면 Phase 1만 실행.
+Phase 2 is optional — if only timing is needed, run Phase 1 alone.
-### D2. Op Log 기록 — ComponentBase hook
+### D2. Op Log Recording — ComponentBase Hook
-op_log 기록은 **컴포넌트 베이스 클래스의 hook**으로 수행한다.
+Op log recording is performed as a **hook in the component base class**.
-개별 컴포넌트 구현을 수정하지 않는다.
+Individual component implementations are not modified.
 ```python
 class ComponentBase:
@@ -77,56 +77,56 @@ class ComponentBase:
            self._op_logger.record_end(env.now, self.node.id, msg)
 ```
-`_forward_txn()` 에서 `run()` 전후로 hook을 호출한다.
+Hooks are called before and after `run()` within `_forward_txn()`.
-`_op_logger`는 optional — 없으면 오버헤드 제로.
+`_op_logger` is optional — zero overhead when absent.
-**hook 시점 정의**:
+**Hook timing definitions**:
-| 시점 | 의미 |
+| Timing | Meaning |
-|------|------|
+|--------|---------|
-| `t_start` | 컴포넌트가 해당 msg의 **service를 시작**한 시점 (`run()` 진입 직전) |
+| `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) |
-| `t_end` | 컴포넌트의 **내부 service가 완료**된 시점 (`run()` 반환 직후) |
+| `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) |
-link traversal latency는 t_start/t_end에 포함되지 않는다.
+Link traversal latency is not included in t_start/t_end.
-link latency는 발신 컴포넌트의 t_end와 수신 컴포넌트의 t_start 차이로 관측된다.
+Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start.
-### D3. Greenlet 기반 커널 실행 — Phase 0 제거
+### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination
-기존 Phase 0 (커널 → PeCommand 리스트)를 제거하고,
+The existing Phase 0 (kernel → PeCommand list) is eliminated,
-**greenlet**을 사용하여 커널과 SimPy를 협력적으로 interleave 실행한다.
+and **greenlet** is used to cooperatively interleave kernel and SimPy execution.
-#### 동작 원리
+#### Operating Principle
-greenlet은 협력적 context switch를 제공하는 C 확장이다.
+greenlet is a C extension that provides cooperative context switching.
-커널(child greenlet)이 `tl.load()` 등을 호출하면 SimPy 루프(parent greenlet)로
+When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet)
-switch하여 타이밍 시뮬레이션을 수행하고, 완료 후 실제 데이터와 함께 커널로 돌아온다.
+to perform timing simulation, and after completion, returns to the kernel with actual data.
 ```
-SimPy 루프 (parent greenlet)          커널 (child greenlet)
+SimPy loop (parent greenlet)           Kernel (child greenlet)
 ─────────────────────────              ──────────────────────
-g.switch() ─────────────────────────→ 커널 시작
+g.switch() ─────────────────────────→ Kernel starts
                                       a = tl.load(ptr, ...)
-                                         내부: parent.switch(DmaReadCmd)
+                                         internal: parent.switch(DmaReadCmd)
-cmd = DmaReadCmd ←──────────────────  (커널 일시정지)
+cmd = DmaReadCmd ←──────────────────  (kernel paused)
  yield DmaReadMsg(...)
  yield env.timeout(dma_latency)
  data = memory_store.read(...)
-g.switch(data) ─────────────────────→ (커널 재개)
+g.switch(data) ─────────────────────→ (kernel resumed)
-                                       a = data  ← 실제 numpy array
+                                       a = data  ← actual numpy array
-                                       if a[0][0] > 0.5:  ← 분기 가능
+                                       if a[0][0] > 0.5:  ← branching possible
                                         ...
 ```
-커널은 **plain Python function**으로 유지된다.
+The kernel is maintained as a **plain Python function**.
-greenlet switch는 `tl.load()`, `tl.store()` 등의 **내부 구현에만** 존재한다.
+greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc.
-#### KernelRunner — 프레임워크 레이어
+#### KernelRunner — Framework Layer
-greenlet 루프는 PE_CPU 컴포넌트가 아니라 프레임워크 레이어인
+The greenlet loop resides not in the PE_CPU component but in the framework layer,
-**KernelRunner**에 위치한다.
+**KernelRunner**.
 ```python
-# KernelRunner (프레임워크 — greenlet ↔ SimPy 연결)
+# KernelRunner (framework — greenlet ↔ SimPy bridge)
 class KernelRunner:
    def run(self, env, kernel_fn, args, store):
        g = greenlet(self._run_kernel)
@@ -136,160 +136,162 @@ class KernelRunner:
            if isinstance(cmd, DmaReadCmd):
                yield from self._dispatch_dma(env, cmd)
                data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
-                cmd = g.switch(data)            # 실제 데이터와 함께 재개
+                cmd = g.switch(data)            # resume with actual data
            elif isinstance(cmd, GemmCmd):
                yield from self._dispatch_gemm(env, cmd)
-                cmd = g.switch()                # 재개 (데이터 없음)
+                cmd = g.switch()                # resume (no data)
            elif isinstance(cmd, DmaWriteCmd):
-                store.write(cmd.dst_addr, cmd.data)  # visibility = issue 시점
+                store.write(cmd.dst_addr, cmd.data)  # visibility = issue time
-                yield from self._dispatch_dma(env, cmd)  # timing만 반영
+                yield from self._dispatch_dma(env, cmd)  # timing only
                cmd = g.switch()
-# PE_CPU (컴포넌트 — 간단하게 유지, greenlet을 모름)
+# PE_CPU (component — kept simple, unaware of greenlet)
 def _execute_kernel(self, env):
    runner = KernelRunner(self.ctx)
    yield from runner.run(env, kernel_fn, args, store)
 ```
-**Op logging single source of truth**: KernelRunner는 op_log에 직접 기록하지 않는다.
+**Op logging single source of truth**: KernelRunner does not record directly to op_log.
-모든 op logging은 **ComponentBase hook (_on_process_start/end)만** 담당한다.
+All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**.
-KernelRunner가 `_dispatch_gemm()` 등으로 컴포넌트에 메시지를 전달하면,
+When KernelRunner delivers messages to components via `_dispatch_gemm()` etc.,
-컴포넌트 베이스 클래스의 hook이 자동으로 기록한다.
+the component base class hooks automatically record them.
-**레이어 분리**:
+**Layer separation**:
- **커널 코드**: plain function, greenlet 존재를 모름
+- **Kernel code**: plain function, unaware of greenlet
- **TLContext**: `tl.load()` 내부에서 `parent.switch(cmd)` 호출
+- **TLContext**: calls `parent.switch(cmd)` inside `tl.load()`
- **KernelRunner**: greenlet ↔ SimPy 연결, MemoryStore 읽기/쓰기 처리. **logging 안 함**.
+- **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**.
- **ComponentBase hook**: op_log 기록의 유일한 경로
+- **ComponentBase hook**: the sole path for op_log recording
- **PE_CPU**: KernelRunner를 호출만 함, 컴포넌트로서 교체 가능
+- **PE_CPU**: only calls KernelRunner, replaceable as a component
-#### 메모리 읽기/쓰기 vs 연산의 처리 차이
+#### Handling Differences Between Memory Read/Write and Compute
-| 연산 | Phase 1에서 | Phase 2에서 |
+| Operation | In Phase 1 | In Phase 2 |
-|------|------------|------------|
+|-----------|-----------|-----------|
-| `tl.load()` | SimPy 타이밍 + MemoryStore read → **실제 데이터 반환** | — |
+| `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — |
-| `tl.store()` | SimPy 타이밍 + MemoryStore write → **실제 기록** | — |
+| `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — |
-| `tl.composite(gemm)` | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
+| `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation |
-| `tl.dot()` / math ops | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
+| `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation |
-메모리 읽기/쓰기는 Phase 1에서 즉시 처리 (numpy slice, 빠름).
+Memory read/write is processed immediately in Phase 1 (numpy slice, fast).
-GEMM/Math 연산은 Phase 2에서 batch 실행 (성능 분리).
+GEMM/Math operations are batch-executed in Phase 2 (performance separation).
 #### Store Visibility Rule
-`tl.store()`는 **issue 시점에 MemoryStore에 즉시 반영**된다 (visibility = issue).
+`tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue).
-SimPy DMA 타이밍은 이후 별도로 시뮬레이션된다.
+SimPy DMA timing is simulated separately afterward.
-이는 timing과 visibility를 의도적으로 분리한 것이다:
+This is an intentional separation of timing and visibility:
- **visibility**: MemoryStore에 반영되는 시점 = `store.write()` 호출 시
+- **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called
- **timing**: SimPy에서 DMA latency가 완료되는 시점
+- **timing**: the point at which DMA latency completes in SimPy
-이 분리로 dynamic control flow에서 store 직후 load가 최신 데이터를 볼 수 있다.
+This separation allows a load immediately after a store to see the latest data in dynamic control flow.
 #### Result Handle Semantics
-`tl.composite()`(sync/async)는 결과 tensor를 참조하는 **handle**을 반환한다.
+`tl.composite()` (sync/async) returns a **handle** referencing the result tensor.
-Phase 1에서의 핵심 계약:
+The key contract in Phase 1:
-1. **모든 compute handle은 Phase 1에서 항상 pending 상태로 간주한다.**
+1. **All compute handles are always considered pending in Phase 1.**
-2. `tl.wait(handle)`은 **timing synchronization만 표현**하며,
+2. `tl.wait(handle)` **expresses timing synchronization only**
-   handle을 ready로 만들지 않는다.
+   and does not make the handle ready.
-3. handle의 실제 결과 데이터 접근(`handle.data`, element access,
+3. Accessing the handle's actual result data (`handle.data`, element access,
-   numpy conversion 등)은 **Phase 2에서만 가능**하다.
+   numpy conversion, etc.) is **only possible in Phase 2**.
-4. 따라서 Phase 1에서 **compute-result 기반 control flow는 지원하지 않는다.**
+4. Therefore, **compute-result-based control flow is not supported in Phase 1.**
-5. 반면 `tl.load()`는 Phase 1에서 실제 데이터를 반환하므로,
+5. In contrast, `tl.load()` returns actual data in Phase 1, so
-   **memory-read 기반 control flow는 지원 가능**하다.
+   **memory-read-based control flow is supported**.
-| handle 상태 | Phase | 허용 동작 |
+| Handle state | Phase | Allowed operations |
 |------------|-------|----------|
-| pending | Phase 1 | `tl.wait(handle)` — timing 동기화만 |
+| pending | Phase 1 | `tl.wait(handle)` — timing synchronization only |
-| pending | Phase 1 | handle을 `tl.store()`의 대상으로 전달 (logical destination 연결만, payload는 Phase 2) |
+| pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) |
-| pending | Phase 1 | **데이터 접근 불가** — 값 기반 분기 불가 |
+| pending | Phase 1 | **Data access not allowed** — value-based branching not possible |
-| ready | Phase 2 | 실제 numpy 데이터 접근, 검증 |
+| ready | Phase 2 | Actual numpy data access, verification |
-이 제약은 의도적이다. Phase 1에서 연산을 실행하면 SimPy single-thread가
+This restriction is intentional. If computations were executed in Phase 1,
-block되어 2-pass 분리의 존재 이유가 사라진다.
+the SimPy single-thread would block, defeating the purpose of 2-pass separation.
 #### Phase 1 Materialization — Future Extension
-향후 소형 연산(scalar, 작은 reduction)에 대해 Phase 1 eager execution이
+If Phase 1 eager execution becomes necessary for small operations
-필요한 경우, `materialized_in_phase1: bool` 플래그를 op record에 추가하여
+(scalar, small reduction) in the future, selective materialization can be supported
-선택적 materialization을 지원할 수 있다. 현재 범위에서는 구현하지 않는다.
+by adding a `materialized_in_phase1: bool` flag to the op record.
 This is not implemented in the current scope.
-### D4. data_op 플래그 — 메시지 자기 선언
+### D4. data_op Flag — Message Self-Declaration
-로깅 대상은 메시지 타입이 아니라 메시지 인스턴스의 `data_op` 속성으로 결정한다.
+The logging target is determined by the `data_op` attribute on the message instance,
-프레임워크가 메시지 타입을 하드코딩하지 않는다.
+not by message type. The framework does not hardcode message types.
 ```python
 class MsgBase:
-    data_op: bool = False       # 기본: 로깅 안 함
+    data_op: bool = False       # default: no logging
 class DmaReadCmd(MsgBase):
-    data_op = True              # 메모리 이동 → 로깅
+    data_op = True              # memory transfer → logging
 class GemmCmd(MsgBase):
-    data_op = True              # 연산 → 로깅
+    data_op = True              # compute → logging
 class MathCmd(MsgBase):
-    data_op = True              # 연산 → 로깅
+    data_op = True              # compute → logging
 ```
-새 메시지 타입(예: IpcqMsg) 추가 시 `data_op = True`만 설정하면
+When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True`
-프레임워크 코드 수정 없이 자동 로깅된다.
+enables automatic logging without modifying framework code.
-### D5. Op Log 구조
+### D5. Op Log Structure
-#### op 분류 체계
+#### Op Classification Scheme
-2단계로 분류한다:
+A two-level classification is used:
-| 레벨 | 필드 | 역할 |
+| Level | Field | Role |
-|------|------|------|
+|-------|-------|------|
-| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch 기준 |
+| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion |
-| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` 등 | 구체 연산 식별 |
+| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification |
-#### OpRecord 정의
+#### OpRecord Definition
 ```python
@dataclass
 class OpRecord:
-    t_start: float              # SimPy 시각 (ns) — service 시작
+    t_start: float              # SimPy time (ns) — service start
-    t_end: float                # SimPy 시각 (ns) — service 완료
+    t_end: float                # SimPy time (ns) — service completion
    component_id: str           # e.g. "sip0.cube0.pe0.pe_gemm"
    op_kind: str                # "memory" | "gemm" | "math"
-    op_name: str                # 구체 연산명
+    op_name: str                # specific operation name
-    params: dict                # 연산별 파라미터 (아래 참조)
+    params: dict                # per-operation parameters (see below)
-    dependency_ids: list[int]   # 현재는 in-memory record index 기반, 향후 stable op_id로 대체 가능
+    dependency_ids: list[int]   # currently based on in-memory record index, may be replaced with stable op_id in the future
 ```
-#### dependency_ids 생성 규칙
+#### dependency_ids Generation Rules
-`dependency_ids`는 **optional**이며, 기본적으로 executor는
+`dependency_ids` is **optional**, and by default the executor performs
-주소 기반 dependency 추론을 수행한다 (D6 참조).
+address-based dependency inference (see D6).
-정확한 실행 순서가 필요한 경우에만 명시적으로 설정한다:
+Explicit setting is only needed when precise execution ordering is required:
- **기본 (address-based inference)**: executor가 read/write set을 분석하여
+- **Default (address-based inference)**: the executor analyzes read/write sets to
-  RAW/WAW/WAR 의존성을 자동 추론. 대부분의 경우 이것으로 충분.
+  automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases.
- **명시적 설정**: TLContext 또는 command 생성 단계에서 logical dependency가
+- **Explicit setting**: set when logical dependencies cannot be expressed via addresses
-  주소로 표현되지 않는 경우에 설정.
+  at the TLContext or command generation stage.
-  예: completion handle 기반 동기화 — handle dependency는 메모리 주소가 아니라
+  Example: completion handle-based synchronization — handle dependencies depend on
-  논리적 완료 순서에 의존하므로 address inference로 잡히지 않는다.
+  logical completion order rather than memory addresses, so they cannot be captured
  by address inference.
-#### op_log ordering
+#### op_log Ordering
-op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
+The op_log maintains **stable ordering** based on `t_start`.
-동일 `t_start`의 record들은 insertion order를 보존한다.
+Records with the same `t_start` preserve insertion order.
-#### params 상세
+#### params Details
 **memory (dma_read / dma_write)**:
 ```python
 {
-    "src_addr": int,            # source 주소 (byte)
+    "src_addr": int,            # source address (byte)
-    "dst_addr": int,            # destination 주소 (byte)
+    "dst_addr": int,            # destination address (byte)
-    "nbytes": int,              # 전송 크기
+    "nbytes": int,              # transfer size
    "src_space": str,           # "hbm" | "tcm" | "sram"
    "dst_space": str,           # "hbm" | "tcm" | "sram"
 }
@@ -298,9 +300,9 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
 **gemm**:
 ```python
 {
-    "src_a_addr": int,          # operand A 주소
+    "src_a_addr": int,          # operand A address
-    "src_b_addr": int,          # operand B 주소
+    "src_b_addr": int,          # operand B address
-    "dst_addr": int,            # output 주소
+    "dst_addr": int,            # output address
    "shape_a": tuple,           # e.g. (128, 256)
    "shape_b": tuple,           # e.g. (256, 128)
    "shape_out": tuple,         # e.g. (128, 128)
@@ -312,7 +314,7 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
    "layout_a": str,            # "row_major" | "col_major"
    "layout_b": str,
    "layout_out": str,
-    "addr_space": str,          # "tcm" (GEMM operand는 항상 TCM)
+    "addr_space": str,          # "tcm" (GEMM operands are always in TCM)
 }
 ```
@@ -320,7 +322,7 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
 ```python
 {
    "op": str,                  # "exp" | "add" | "sum" | "where" | ...
-    "input_addrs": list[int],   # operand 주소 목록
+    "input_addrs": list[int],   # list of operand addresses
    "input_shapes": list[tuple],
    "dst_addr": int,
    "shape_out": tuple,
@@ -332,12 +334,12 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
 ### D6. Phase 2 Executor
-Phase 2는 SimPy 밖에서 op_log를 실행한다.
+Phase 2 executes the op_log outside of SimPy.
 ```python
 class DataExecutor:
    def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
-        self.store = initial_store  # Phase 1의 MemoryStore snapshot을 입력으로 받는다
+        self.store = initial_store  # Takes the Phase 1 MemoryStore snapshot as input
    def run(self):
        for t, ops in groupby(op_log, key=lambda o: o.t_start):
@@ -347,30 +349,30 @@ class DataExecutor:
            self._execute_sequential(sequential)
 ```
-**병렬 실행 판정**:
+**Parallel execution determination**:
-같은 `t_start`의 op들은 **병렬 후보**로 간주한다.
+Ops with the same `t_start` are considered **parallel candidates**.
-실제 병렬 실행 여부는 executor가 다음 기준으로 판정한다:
+The executor determines actual parallel execution based on the following criteria:
- read/write 주소 범위 겹침 여부 (WAW, RAW, WAR 충돌 검사)
+- Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks)
- `dependency_ids`에 명시된 선행 op 완료 여부
+- Whether predecessor ops specified in `dependency_ids` have completed
-주소 범위가 겹치지 않고 명시적 의존성이 없는 op들만 병렬 실행한다.
+Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel.
-**배치 최적화**: 동일 op_name이며 **shape, dtype, layout, transpose flag가
+**Batch optimization**: Only independent ops with the same op_name **and identical
-모두 동일한** 독립 op들만 batching 대상이 된다.
+shape, dtype, layout, and transpose flags** are eligible for batching.
-예: 여러 PE의 동일 shape GEMM → `np.matmul(a_batch, b_batch)` 한 번으로 묶음.
+Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call.
-CPU에서도 BLAS 효율 향상, GPU에서는 launch overhead 절감.
+Improves BLAS efficiency on CPU, reduces launch overhead on GPU.
-**Phase 2 실행 순서 보장**:
+**Phase 2 execution order guarantee**:
-Phase 2는 데이터 도착 시점을 고려하지 않으며,
+Phase 2 does not consider data arrival timing,
-dependency (주소 기반 추론 + 명시적 dependency_ids)를 통해서만
+and guarantees execution order solely through
-실행 순서를 보장한다.
+dependencies (address-based inference + explicit dependency_ids).
 ### D7. Memory Store
-`MemoryStore`는 논리적으로 byte-addressable semantics를 따르며,
+`MemoryStore` logically follows byte-addressable semantics,
-현재 구현은 **tensor-granular storage** (addr → numpy ndarray 매핑)를 사용한다.
+and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping).
 ```python
 class MemoryStore:
@@ -378,139 +380,140 @@ class MemoryStore:
    def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
 ```
-**내부 저장 포맷: numpy ndarray**
+**Internal storage format: numpy ndarray**
-MemoryStore는 텐서를 **numpy ndarray**로 저장한다.
+MemoryStore stores tensors as **numpy ndarrays**.
-| 후보 | store/load 속도 | Phase 2 연산 | 판정 |
+| Candidate | store/load speed | Phase 2 compute | Verdict |
-|------|----------------|-------------|------|
+|-----------|-----------------|-----------------|---------|
-| **numpy ndarray** | 즉시 (참조 전달, 복사 없음) | `np.matmul` 바로 사용 | **채택** |
+| **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** |
-| bytearray | memcpy 필요 | `np.frombuffer` 변환 필요 | 탈락 |
+| bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected |
-| torch tensor | 즉시 | torch 연산 가능 | GPU 최적화 시만 사용 |
+| torch tensor | Immediate | torch operations available | Use only for GPU optimization |
- write: numpy array를 **참조 저장** (복사 없음) → Phase 1 오버헤드 = dict lookup 1회
+- write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup
- read: numpy array를 **참조 반환** (복사 없음)
+- read: **returns numpy array by reference** (no copy)
- 동일 addr에 재 write 시 기존 array를 **tensor 단위로 덮어쓴다** (partial overwrite 미지원)
+- Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported)
- dtype은 numpy native 사용 (`np.float16`, `np.float32`, `np.bfloat16` 등)
+- dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.)
- byte-level access가 필요한 경우 `.view(np.uint8)` 로 변환
+- For byte-level access, convert via `.view(np.uint8)`
- Phase 2에서 GPU batch 최적화 시 numpy → torch tensor 변환은 executor가 담당
+- For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility
 **read/write contract**:
- read/write는 **contiguous tensor** 기준이다.
+- read/write operates on a **contiguous tensor** basis.
-  non-contiguous stride view가 필요한 경우 별도 copy op으로 표현한다.
+  If non-contiguous stride views are needed, express them as separate copy ops.
- 일반 benchmark path에서는 producer/consumer dtype 일치를 기대한다.
+- In the normal benchmark path, producer/consumer dtype match is expected.
-  reinterpret cast는 low-level memory validation 또는 특수 테스트 케이스를 위한
+  Reinterpret cast is a permissive behavior for low-level memory validation
-  permissive behavior이다.
+  or special test cases.
- addr은 byte-aligned이며, 최소 alignment = dtype 크기.
+- addr is byte-aligned, with minimum alignment = dtype size.
- dtype mismatch (write와 다른 dtype으로 read)는 reinterpret cast로 처리한다.
+- dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast.
-  shape 불일치 시 nbytes 기준으로 검증하고, 불일치하면 error.
+  Shape mismatch is verified based on nbytes, and raises an error on mismatch.
- 정합성 기준은 주소 범위 기반 read/write semantics를 따른다.
+- Correctness criteria follow address-range-based read/write semantics.
- 구현 최적화로 tensor object cache를 둘 수 있지만,
+- A tensor object cache may be used as an implementation optimization,
-  canonical state는 byte-addressable storage이다.
+  but the canonical state is byte-addressable storage.
- deploy 시점에 호스트가 초기 텐서 데이터를 주입한다.
+- At deploy time, the host injects initial tensor data.
-### D8. 벤치마크 커널 코드
+### D8. Benchmark Kernel Code
-벤치마크의 **사용자 코드 API는 변경하지 않는다**.
+The benchmark's **user code API is not changed**.
-`tl.load()`, `tl.composite()`, `tl.store()` 등의 호출 인터페이스는 유지.
+The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained.
-단, 내부 command/message schema는 Phase 2 실행에 필요한 metadata를
+However, internal command/message schemas may be extended to include metadata
-포함하도록 확장될 수 있다 (예: dtype_acc, transpose 등 추가 필드).
+required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose).
-### D9. 컴포넌트 변경 없음
+### D9. No Component Changes
-개별 컴포넌트 구현(PE_GEMM, PE_DMA, HBM_CTRL 등)은 수정하지 않는다.
+Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified.
-op_log 기록은 ComponentBase hook의 책임이다.
+Op log recording is the responsibility of the ComponentBase hook.
-커스텀 컴포넌트 교체 시 타이밍 모델만 교체되며,
+When custom components are replaced, only the timing model changes,
-Phase 2 데이터 실행은 영향받지 않는다.
+and Phase 2 data execution is unaffected.
-### D10. Phase 2는 Optional
+### D10. Phase 2 is Optional
 ```python
 engine = GraphEngine(graph)
-engine.run(benchmark)                       # Phase 1: 타이밍만
+engine.run(benchmark)                       # Phase 1: timing only
 result = engine.get_timing_result()
 if verify_data:
-    executor = DataExecutor(engine.op_log)  # Phase 2: 데이터
+    executor = DataExecutor(engine.op_log)  # Phase 2: data
    executor.run()
    executor.verify(expected_output)
 ```
-타이밍 분석만 필요하면 Phase 2를 건너뛴다.
+If only timing analysis is needed, Phase 2 is skipped.
-op_logger를 비활성화하면 Phase 1 성능도 기존과 동일.
+If the op_logger is deactivated, Phase 1 performance is identical to the original.
 ### D11. Verification Contract
-기본 검증은 **최종 output tensor**를 reference backend(numpy)와 비교한다.
+Basic verification **compares the final output tensor** against a reference backend (numpy).
-dtype별 tolerance 정책:
+Per-dtype tolerance policy:
-| dtype | 비교 방식 | tolerance |
+| dtype | Comparison method | Tolerance |
 |-------|----------|-----------|
 | f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
 | f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
 | bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
-| int 계열 | `np.array_equal` | exact |
+| int types | `np.array_equal` | exact |
- 기본 모드: 최종 output만 비교 (end-to-end correctness)
+- Default mode: compare final output only (end-to-end correctness)
- 디버그 모드: intermediate tensor도 op 단위로 비교 가능
+- Debug mode: can compare intermediate tensors on a per-op basis
  (MemoryStore snapshot at each op boundary)
 ---
 ## Non-goals
- **Compute-result-based control flow**: 지원하지 않는다.
+- **Compute-result-based control flow**: not supported.
-  모든 compute handle은 Phase 1에서 pending 상태이며,
+  All compute handles are in pending state during Phase 1,
-  `wait()`는 timing synchronization만 표현하고 data readiness를 의미하지 않는다.
+  `wait()` expresses timing synchronization only and does not imply data readiness.
-  Phase 1에서 `handle.data` 접근, element access, truth-value evaluation은
+  Accessing `handle.data`, element access, or truth-value evaluation in Phase 1
-  **error로 처리**한다.
+  is **treated as an error**.
-  메모리 데이터 기반 분기(`tl.load()` 결과)는 greenlet으로 지원된다.
+  Memory-data-based branching (results of `tl.load()`) is supported via greenlet.
-  Phase 1 materialization은 future extension (D3 참조).
+  Phase 1 materialization is a future extension (see D3).
- **Cycle-accurate overlap reconstruction**: Phase 2에서 Phase 1의 실행 시간
+- **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce
-  overlap을 정확히 재현하지 않는다. Phase 2는 데이터 정합성만 검증한다.
+  the execution time overlap from Phase 1. Phase 2 only verifies data correctness.
- **GPU kernel compilation**: Phase 2의 GEMM/Math는 numpy/torch 호출이며,
+- **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls
-  실제 하드웨어 PE의 마이크로아키텍처를 재현하지 않는다.
+  and do not reproduce the actual hardware PE microarchitecture.
 ## Open Questions
- **Aliasing / slice view**: 동일 backing storage를 참조하는 slice/view를
+- **Aliasing / slice view**: How to represent slice/views referencing the same
-  MemoryStore에서 어떻게 표현할지 (stride-based view vs copy semantics)
+  backing storage in MemoryStore (stride-based view vs copy semantics)
- **IPCQ/descriptor read 일반화**: PE-to-PE 통신을 memory op으로 완전히
+- **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE
-  일반화할지, 별도 op_kind를 둘지
+  communication as memory ops or introduce a separate op_kind
- **Op log streaming**: 대규모 시뮬레이션에서 op_log 메모리 사용량 관리
+- **Op log streaming**: Managing op_log memory usage in large-scale simulations
  (in-memory list vs disk-backed streaming)
- **Fused operation**: tl.composite의 tiled pipeline (READ→COMPUTE→WRITE)을
+- **Fused operation**: Whether to record tl.composite's tiled pipeline
-  하나의 fused op record로 기록할지, 개별 op으로 분리할지
+  (READ→COMPUTE→WRITE) as a single fused op record or separate individual ops
- **Math op schema 일반화**: 현재 math params는 단순 구조이나,
+- **Math op schema generalization**: The current math params have a simple structure,
-  broadcasting rule, input별 dtype, keepdims, scalar/immediate operand,
+  but generalization may be needed for broadcasting rules, per-input dtype, keepdims,
-  where/mask 표현 등 일반화가 필요할 수 있음
+  scalar/immediate operands, where/mask expressions, etc.
- **Op record 식별자**: 현재 dependency_ids는 in-memory list index 기반이며,
+- **Op record identifier**: Currently dependency_ids are based on in-memory list indices;
-  streaming/disk-backed mode 도입 시 stable op_id로 대체 필요
+  replacement with stable op_id is needed when introducing streaming/disk-backed mode
- **Phase 1 materialization policy**: D3의 Future Extension 참조.
+- **Phase 1 materialization policy**: See Future Extension in D3.
-  허용 시 해당 op의 Phase 2 처리 방식 (skip / verify / recompute) 정의 필요
+  If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops
  needs to be defined
 ---
 ## Consequences
-### 긍정적
+### Positive
- SimPy 시뮬레이션 성능 영향 최소 (op_log append만 추가)
+- Minimal impact on SimPy simulation performance (only op_log append added)
- Phase 2에서 멀티스레드/GPU 자유롭게 사용 가능
+- Free to use multi-threading/GPU in Phase 2
- 컴포넌트 교체 자유도 유지 (ADR-0015 설계 철학 보존)
+- Component replaceability preserved (ADR-0015 design philosophy maintained)
- 벤치마크 사용자 코드 API 변경 불필요
+- No changes needed to benchmark user code API
- 새 메시지 타입 추가 시 data_op 플래그만 설정
+- When adding new message types, only set the data_op flag
- greenlet으로 Phase 0 제거 — 메모리 데이터 기반 dynamic control flow 지원
+- Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported
- `tl.load()`가 실제 데이터를 반환하므로 커널 디버깅 용이
+- `tl.load()` returns actual data, making kernel debugging easier
-### 부정적
+### Negative
- op_log 메모리 사용량 (대규모 시뮬레이션 시)
+- op_log memory usage (for large-scale simulations)
- Phase 2 실행 시간은 텐서 크기에 비례 (대형 GEMM)
+- Phase 2 execution time is proportional to tensor size (large GEMM)
- pending handle (연산 미완료) 기반 동적 분기 불가
+- Dynamic branching based on pending handles (incomplete computations) not possible
-  (연산은 Phase 2에서 실행, Phase 1에서 결과 값 미확정).
+  (computations execute in Phase 2, result values are undetermined in Phase 1).
-  메모리 데이터 기반 분기는 greenlet으로 지원된다.
+  Memory-data-based branching is supported via greenlet.
- greenlet C 확장 의존성 추가 (pip install greenlet)
+- greenlet C extension dependency added (pip install greenlet)
@@ -1,882 +0,0 @@
 # ADR-0023: PE-level IPCQ — Inter-PE Collective Communication
 ## Status
 Accepted
 ## Context
 ### Goal
 Add the infrastructure that lets CCL (Collective Communication Library)
 kernels run **inside** a PE. The host just launches a kernel on each
 SIP; the actual synchronization and data movement happen **inside the
 PE kernel via an IPCQ (Inter-Process Communication Queue)**.
 This mirrors how NCCL performs NVLink communication inside a GPU
 kernel, or how Cerebras / Tenstorrent expose core-local communication
 queues. Host-level collectives (`dist.all_reduce`) are deferred to
 **future work**; this ADR focuses solely on the kernel-side collective
 infrastructure.
 ### Problems to solve
 1. PE-to-PE direct data movement (writing into a peer's memory).
 2. Synchronization — the sender must check that the receiver has space
   in its buffer (backpressure).
 3. Resource contention between compute traffic and communication
   traffic (Head-of-Line blocking).
 4. The host must be able to construct logical neighbor topologies
   (ring / mesh / tree) per algorithm.
 ---
 ## Decision
 ### D1. Add a new `PE_IPCQ` component
 A new component `PE_IPCQ` is added inside each PE. It follows the same
 pattern as PE_GEMM / PE_MATH — modeling a sub-block of the PE as a
 distinct component.
 ```
 PE
 ├── PE_CPU
 ├── PE_SCHEDULER
 ├── PE_DMA
 ├── PE_IPCQ          ← new
 ├── PE_FETCH_STORE
 ├── PE_GEMM
 ├── PE_MATH
 ├── PE_TCM
 ├── PE_MMU
 ```
 **Role separation** (control plane vs. data plane):
 - **PE_IPCQ (control plane)**: ring-buffer address arithmetic, head /
  tail pointer management, peer pointer caches, backpressure, 4-direction
  neighbor mapping.
 - **PE_DMA (data plane)**: actually moves data through cube_noc / UCIe
  / PCIE into the peer's memory.
 PE_IPCQ does **not** move data itself — it delegates to PE_DMA.
 ### D2. Ring buffer model
 Each PE owns 4 directions (N/S/E/W) × {tx, rx} = 8 ring buffers.
 ```python
@dataclass
 class IpcqQueuePair:
    direction: Direction          # N/S/E/W
    peer: IpcqEndpoint            # set by host at init time (D2.5)
    tx_buffer_base: int           # outgoing data base addr (in our memory)
    rx_buffer_base: int           # incoming data base addr (in our memory)
    slot_size: int                # 1 tile per slot
    n_slots: int                  # ring depth
    my_head: int                  # next slot we will write/send into
    my_tail: int                  # next slot we will read/recv from
    peer_head_cache: int          # peer's last-seen head (updated via D9 piggyback)
    peer_tail_cache: int          # peer's last-seen tail (updated via D9 fast-path credit)
 ```
 **Canonical field names**: throughout this ADR the four names above
 (`my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`) are used
 consistently. Synonyms (`peer_head_local`, `peer_head`, `peer_tail`,
 etc.) are not used.
 | Field | Owner | Updated when |
 |-------|-------|--------------|
 | `my_head` | local PE_IPCQ | immediately after `tl.send` (send tracking) |
 | `my_tail` | local PE_IPCQ | immediately after `tl.recv` (recv tracking) |
 | `peer_head_cache` | local PE_IPCQ | on `IpcqMetaArrival` (D9 piggyback) |
 | `peer_tail_cache` | local PE_IPCQ | on `IpcqCreditMetadata` (D9 fast path) |
 **Slot unit**: fixed-size, one slot holds one full tile (no descriptor
 indirection). Full data embedded in the slot. See D5.
 ### D2.5. `IpcqEndpoint` schema
 `IpcqQueuePair.peer` carries everything the sender needs to compute the
 peer's rx slot address:
 ```python
@dataclass(frozen=True)
 class IpcqEndpoint:
    sip: int
    cube: int
    pe: int
    buffer_kind: str             # "tcm" | "hbm" | "sram"
    rx_base_pa: int              # peer rx_buffer base PA (PhysAddr.encode())
    rx_base_va: int              # peer rx_buffer base VA (optional, MMU mode)
    n_slots: int                 # peer ring depth (for wrap-around)
    slot_size: int               # peer slot size (for offset)
 ```
 Address computation:
 ```python
 slot_idx = self.my_head % peer.n_slots
 dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size
 ```
 PE_IPCQ passes `dst_pa` to PE_DMA inside an `IpcqDmaToken`. PE_DMA
 (vc_comm) routes the data to `dst_pa` through the fabric.
 **Endpoint construction order**: at backend init (D10), the IPCQ
 buffers for **every PE** are allocated first (so each rank knows the
 others' PA), then the per-rank neighbor tables are built and pushed to
 PE_IPCQ via `IpcqInitMsg`.
 ### D3. Four-direction mapping ≡ logical ProcessGroup
 The PE views four directions (N/S/E/W) as logical ports. Real peer
 addresses are configured by the host CCL init, per the chosen
 algorithm. The PE kernel never knows the topology, only directions.
 ```python
 # 1D ring
 for rank in range(world_size):
    ipcq_set_neighbor(rank, "E", peer=ranks[(rank + 1) % world_size])
    ipcq_set_neighbor(rank, "W", peer=ranks[(rank - 1) % world_size])
 # 2D mesh
 for r in range(R):
    for c in range(C):
        ipcq_set_neighbor((r, c), "N", peer=((r - 1) % R, c))
        ipcq_set_neighbor((r, c), "S", peer=((r + 1) % R, c))
        ipcq_set_neighbor((r, c), "E", peer=(r, (c + 1) % C))
        ipcq_set_neighbor((r, c), "W", peer=(r, (c - 1) % C))
 ```
 The PE code does not need to know where `tl.send(dir="E", ...)` actually
 ends up.
 ### D4. PE kernel API
 ```python
 # Send (blocking; may stall on backpressure)
 tl.send(dir: str, src=TensorHandle)
 tl.send(dir: str, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)
 # Recv (blocking)
 recv = tl.recv(dir: str, shape=..., dtype=...)
 recv = tl.recv(shape=..., dtype=...)        # round-robin across 4 directions
 # Recv (non-blocking)
 fut  = tl.recv_async(dir: str, shape=..., dtype=...)
 recv = tl.wait(fut)
 ```
 `tl.recv()` (no direction) keeps a `last_polled_dir` cursor and on each
 call rotates through directions, returning the first available slot.
 Empty in all 4 directions → wait.
 **Fairness is weak**: the rotating start mitigates simple bias, but if
 one direction always wins the race the others can starve. Algorithms
 that need strict fairness must call `tl.recv(dir=...)` explicitly.
 ### D5. Single-hop DMA write + full-data slot model
 Data moves from sender memory into the receiver's ring slot in **one
 DMA transfer**. Key properties:
 - **Single-hop**: the sender already knows the peer rx slot address and
  fires one fabric DMA into it.
 - **No CPU memcpy**: the CPU never copies data.
 - **No intermediate staging**: neither side keeps a separate staging
  buffer (sender uses the source addr directly; receiver gets the data
  in its ring slot directly).
 (Strictly speaking the fabric DMA write does happen, so this is not
 literally "no data movement" — it's the same property NCCL labels
 "zero-copy", meaning no CPU memcpy and no staging copy.)
 ```
 PE A: tl.send(E, src_addr, nbytes)
  1. IPCQ computes the peer rx slot address:
       dst_addr = peer.rx_base_pa + (my_head % peer.n_slots) * peer.slot_size
  2. Backpressure: my_head - peer_tail_cache < peer.n_slots ?
                   (full → sleep / poll)
  3. Submit DMA on PE_DMA(vc_comm): src_addr → peer dst_addr, nbytes
  4. my_head += 1
 PE B: data = tl.recv(W)
  1. Look at rx_buffer[my_tail % n_slots]
  2. Wait for the data to arrive (D7 backpressure mode)
  3. Return the slot address to the kernel (or fetch into register file)
  4. my_tail += 1
  5. Issue a credit-return fast path (D9): after the bottleneck-BW
     latency the peer A's peer_tail_cache is updated.
 ```
 The slot holds the full tile. The receiver only reads its own
 rx_buffer; it never reads back into A's memory. The sender knows the
 peer rx slot address and DMAs directly into it (single-hop).
 The PE's own PE_TCM read/write does not go through DMA (PE_TCM is local
 to the PE).
 ### D6. Buffer placement — three-way benchmark
 The host CCL init picks the IPCQ ring-buffer location:
 ```python
 ipcq_init(
    backend="ahbm",
    buffer_kind="tcm" | "hbm" | "sram",
    n_slots=8,
    slot_size=4096,
 )
 ```
 | Location | Trait | Trade-off |
 |----------|-------|-----------|
 | **PE_TCM** | Attached to the PE; fast | Small; competes with PE-internal resources |
 | **PE-local HBM** | Large; via DMA | Higher latency |
 | **Cube SRAM** | Mid-size; cube-shared | Cube-internal contention |
 All three locations run the same kernel code; only the init differs.
 ### D7. Backpressure — two-mode benchmark
 How the sender or receiver waits when peer slots are full / data not
 yet arrived:
 | Mode | Behavior | Model |
 |------|----------|-------|
 | **poll** | Periodically re-check the cached peer pointer | Spin loop |
 | **sleep** | Yield a SimPy event; wake on a peer-trigger | Interrupt-like |
 ```python
 ipcq_init(backpressure="poll" | "sleep", ...)
 ```
 Both modes are implemented so latency / throughput trade-offs can be
 benchmarked.
 ### D8. PE_DMA virtual channels
 Extend PE_DMA from a single queue into a **two-channel virtual-channel**
 model.
 ```
 PE_DMA
 ├── vc_compute: tile load / store / writeback for GEMM and Math
 └── vc_comm:    IPCQ send data
 ```
 Each VC has an independent state machine:
 - One channel stalling does not block the other.
 - The same physical link (cube_noc, UCIe, …) is shared, but link BW is
  split between channels.
 **Chunk-level interleave**:
 - Large GEMM tile DMAs do not lock the link end-to-end.
 - Progress happens in chunks (e.g. 256 B); each chunk shares link BW
  with the other VC's pending chunks.
 - Chunk size is an init parameter (smaller = fairer, larger = more
  efficient).
 Net effect:
 - HoL blocking is eliminated (an IPCQ send can interleave with a long
  compute DMA).
 - Compute / comm overlap is natural (NVIDIA copy-engine + compute-SM
  pattern).
 - Matches the NoC-virtual-channel pattern used in real HW.
 **First-implementation accuracy limit (intentional)**: this ADR's
 first cut uses **deterministic chunk-level interleave + weighted
 round-robin arbitration** (default 50 / 50, exposed in `ccl.yaml`).
 This is a first-order approximation and is simpler than real HW
 dynamic-contention / credit-based arbiters. Functional correctness is
 unaffected, but heavy-contention scenarios may report slightly
 optimistic latency vs. real HW. A separate ADR can add a NoC arbiter
 component later if more precision is needed.
 #### Token routing
 - Compute tokens (`TileToken`) — go through the existing
  PE_FETCH_STORE → PE_DMA chain.
 - Communication tokens (`IpcqDmaToken`, new) — PE_IPCQ → PE_DMA
  self-routing.
 - PE_DMA picks the channel by token type.
 ```python
 class PeDmaComponent:
    def _process(self, env, token):
        if isinstance(token, IpcqDmaToken):
            yield from self._vc_comm_process(env, token)
        else:
            yield from self._vc_compute_process(env, token)
 ```
 ### D9. Pointer synchronization — DMA payload piggyback
 Real HW (NVLink, UCIe, etc.) piggybacks metadata onto DMA payloads so
 pointers update along with the data. This simulation adopts the same
 model: **no separate control channel** — metadata travels with the
 data.
 The big benefits:
 - **Automatic ordering**: data and metadata move on the same token, so
  data is visible **before** the head_cache update. No race.
 - **HW fidelity**: matches NVLink / UCIe piggybacked headers.
 - **Component simplification**: no separate `IpcqPtrUpdate` event type.
 #### Send flow (head update via piggyback)
 ```
 PE A: tl.send(E, src_addr, nbytes)
  1. PE_IPCQ checks backpressure (using peer_tail_cache)
  2. PE_IPCQ creates an IpcqDmaToken:
       - data body (src_addr → peer dst_addr)
       - piggyback metadata: (sender_seq, src_sip/cube/pe, src_direction)
  3. Hand the token to PE_DMA(vc_comm)
  4. PE A increments my_head (send tracking)
 [fabric DMA: latency elapses]
 PE B's PE_DMA receives the token
  5. Writes data into dst_addr (B's rx slot) via MemoryStore.write
  6. Forwards token metadata to PE B's PE_IPCQ (PE-internal wire, ~1 cycle)
 PE B's PE_IPCQ receives the metadata
  7. Updates peer_head_cache (= A's head)
  8. Wakes any pending recv on that direction
 ```
 **Steps 5 and 6 must execute in the same SimPy step** — DMA completion
 makes data and metadata atomically visible.
 #### Recv flow (credit return — fast path with bottleneck-BW latency)
 When the receiver frees a slot, the sender must learn about it
 (backpressure release). Unlike data, the credit return does **not**
 travel through general vc_comm fabric — it uses a **separate fast
 path**, an abstraction of the NVLink / UCIe credit-return wire.
 **Latency** is computed from the **full path latency** (per-node
 overhead + edge propagation + drain), not a magic constant:
 ```
 credit_size_bytes = 16  (ccl.yaml: ipcq_credit_size_bytes)
 path = router.find_path(self_pe, peer_pe.pe_dma)
 latency = compute_path_latency_ns(path, credit_size_bytes)
        = sum(edge.distance_mm * ns_per_mm)
        + sum(node_overhead_ns[n] for n in path)
        + credit_size_bytes / bottleneck_bw_on_path
 ```
 The router auto-appends `.pe_dma` to the source only, so the
 destination MUST be spelled with the explicit `.pe_dma` suffix or
 `find_path` raises and the credit silently teleports at zero cost
 (latent bug fixed alongside this update).
 `tl.recv` blocks on the credit-emit completion (recv yields-from
 `_delayed_credit_send` rather than spawning it as a fork). This puts
 the credit-return cost on the receiver's `pe_exec_ns`, modeling the
 IPCQ control-plane completing the consume-acknowledgement before
 recv returns to the kernel — the protocol equivalent of a non-posted
 `tl.store` waiting for an HBM ack on the raw DMA path.
 That gives us:
 - **Topology-proportional approximation**: an in-cube credit return is
  automatically faster than a cross-SIP credit return.
 - **No magic constants**: every nanosecond comes from
  `compute_path_latency_ns` on the same edge_map and `node_overhead_ns`
  as data traffic.
 - **No deadlock risk**: unlike piggyback, B can issue credit even when
  it has no data to send back. `peer_credit_store.put` is unbounded.
 - **`IPCQ ≥ raw DMA`** for matched physical moves — the credit-emit
  cost on recv balances the HBM ack-trip cost RAW pays on the sender.
 #### Component coupling — SimPy Store channel
 PE B's PE_IPCQ does not call PE A's PE_IPCQ directly. Instead, at init
 time, **a SimPy Store is wired between the two** (a per-direction
 fast-path channel) and credit metadata is `put` into that store.
 ```python
 class PeIpcqComponent:
    def _delayed_credit_send(self, env, peer_credit_store, my_tail, latency_ns):
        yield env.timeout(latency_ns)
        yield peer_credit_store.put(IpcqCreditMetadata(seq=my_tail, ...))
 ```
 Backend init wires both directions of the fast-path channel as part of
 fan-out (see `IpcqInitMsg` in D12).
 #### Credit-return fast path limitations
 - `credit_size_bytes` is an estimate (typically 16–64 bytes).
 - The fast path is **excluded from vc_comm BW contention** (separate
  wire). Real HW credit-return wires are very lightweight, so this is a
  reasonable first approximation.
 - A follow-up ADR can: model the credit fast path as a separate link
  (BW limit + contention), or switch to piggyback (`credit_return_mode:
  piggyback`).
 #### PE_DMA's added responsibility
 When `vc_comm` receives a token, PE_DMA processes it as the following
 sequence: pay the Transaction's terminal BW drain, then atomically
 write data and forward metadata. **No SimPy yield is allowed between
 the data write and the metadata forward** (invariant I6). The drain
 yield must sit before the atomic block, not inside it:
 ```python
 def _on_vc_comm_recv(self, env, txn):
    # Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the
    # sender PE_DMA). MUST happen before the atomic block so recv only
    # wakes after the bytes have "landed".
    drain = getattr(txn, "drain_ns", 0.0)
    if drain > 0:
        yield env.timeout(drain)
    token = txn.request
    # ── ATOMIC: no yield between these two operations ──
    data = self._memory_store.read(token.src_space, token.src_addr,
                                   shape=..., dtype=...)
    self._memory_store.write(token.dst_endpoint.buffer_kind,
                             token.dst_addr, data)
    # 2. Forward metadata to the local PE_IPCQ
    yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token))
    # ───────────────────────────────────────────────────
 ```
 The final `put` is yieldable but uses an unbounded internal store, so
 it completes in a single step. That `put` is the closing call of the
 atomic block; nothing may be inserted before it.
 #### Drain-at-inbound semantics (D9 timing model)
 The Transaction carries `drain_ns = nbytes / bottleneck_bw_on_path`
 stamped at send-side PE_DMA. In this simulator per-hop `overhead_ns`
 is paid at each forwarding component via `run()`, and the remaining
 BW drain is paid once at the Transaction's terminal. Every non-IPCQ
 Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via
 `ComponentBase._forward_txn` at the terminal node. For IPCQ the
 destination PE_DMA intercepts the Transaction with `_handle_ipcq_inbound`
 (so IPCQ-specific data write + metadata forward can happen), so **the
 drain MUST be paid explicitly at the top of that handler** to keep
 IPCQ's timing model on par with every other fabric Transaction.
 Side-effects of paying drain here:
 - **SRC `tl.send`** is unchanged — fire-and-forget semantics are
  preserved because the sender PE_DMA does not `yield sub_done`. The
  `sub_done.succeed()` call (made after metadata forward below) is an
  event with no listener on the sender side.
 - **DST `tl.recv`** unblocks `drain_ns` later. Since recv wakes only
  when `IpcqMetaArrival` reaches its local PE_IPCQ, and the metadata
  forward now happens after the drain, recv observes the full fabric
  transfer time including bandwidth cost.
 Matches the physical picture: send dispatches and leaves; recv waits
 until the bytes have actually been drained into its inbox.
 ### D9.5. ADR-0020 (2-pass) integration
 `tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase
 1 simulates timing **and** moves data via MemoryStore; Phase 2 enables
 op-log-based correctness verification.
 #### Phase 1 (timing + data)
 D9 models head and tail updates with two different mechanisms:
 - **Send-side (head update)** — DMA payload piggyback. Data write and
  metadata forward happen in the same SimPy step → automatic atomic
  visibility.
 - **Recv-side (tail credit return)** — fast-path SimPy Store channel
  with bottleneck-BW latency, then `peer_tail_cache` update.
 Together they preserve ring-buffer pointer consistency.
 The op-log records `op_kind="ipcq"` entries for sends (with
 `src/dst/space/addr/nbytes/dir/dtype/shape/sender_seq`) and recvs (with
 `recv_mode/src/dst/space/addr/nbytes/dir/dtype/shape/consumer_seq`).
 Two recv modes:
 - **`return_slot`** (default): the slot address is returned to the
  kernel. Zero-copy.
 - **`copy_to_dst`**: when the kernel passes `dst_addr` + `dst_space`,
  PE_IPCQ copies the slot data into the user dst.
 #### Phase 2 (op_log replay)
 When `DataExecutor` encounters an `op_kind="ipcq"` record:
 - **send**: idempotent `src → dst` ndarray write.
 - **recv (`return_slot`)**: no-op (the slot already holds the data).
 - **recv (`copy_to_dst`)**: idempotent `slot → dst_addr` copy.
 IPCQ ops are pure data movement — Phase 2 has nothing extra to compute.
 The downstream GEMM / Math ops in `DataExecutor` will consume the data
 and naturally validate correctness.
 ### D10. Host CCL init keeps the PyTorch shape
 The host code looks just like real PyTorch DDP. `init_process_group`
 creates the backend object; it does **not** receive IPCQ knobs
 (neighbor topology, buffer_kind, backpressure …).
 ```python
 # benches/ccl_allreduce.py — same shape as real PyTorch
 def worker(rank, world_size, torch):
    dist = torch.distributed
    dist.init_process_group(backend="ahbm")  # reads ccl.yaml + topology
    tensor = torch.zeros((1, world_size * N_ELEM), dtype="f16", dp=...)
    tensor.copy_(torch.from_numpy(init))
    dist.all_reduce(tensor, op="sum")
 ```
 The IPCQ configuration is decided by the backend at
 `init_process_group` time: it loads `ccl.yaml`, picks the algorithm,
 and pushes IPCQ neighbor tables to every participating PE_IPCQ. The
 host code never has to know about IPCQ.
 A bench runs one algorithm, chosen via `ccl.yaml`'s `defaults.algorithm`.
 Switching algorithms is purely a `ccl.yaml` change — no host edits
 required.
 #### Init flow (eager)
 1. `init_process_group(backend="ahbm")` is called.
 2. Backend loads `ccl.yaml` → resolves `defaults.algorithm`.
 3. Pulls topology + buffer_kind + backpressure + slot config from
   `algorithms[<algo>]`.
 4. **Immediately** installs neighbor tables on every PE_IPCQ
   (sideband or fabric `IpcqInitMsg`).
 5. Subsequent `torch.launch(kernel_name, ...)` calls behave normally —
   PE_IPCQ is already prepared whether the kernel is a CCL kernel or
   not.
 ### D11. CCL config file (`ccl.yaml`)
 IPCQ config and algorithm metadata live in a separate YAML file,
 following the same pattern as `components.yaml` and `topology.yaml`.
 A single benchmark execution runs one algorithm
 (`defaults.algorithm`). Switching algorithms means editing
 `defaults.algorithm` only.
 ```yaml
 defaults:
  algorithm: ring_allreduce_tcm
  buffer_kind: tcm                # tcm | hbm | sram
  backpressure: sleep             # poll | sleep
  n_slots: 8
  slot_size: 4096
  vc_chunk_size: 256
  ipcq_credit_size_bytes: 16
 algorithms:
  ring_allreduce_tcm:
    module: kernbench.ccl.algorithms.ring_allreduce
    topology: ring_1d             # builtin name or "custom"
    buffer_kind: tcm
    n_elem: 8                     # optional, per-algorithm tile width
  tree_allreduce_7:
    module: kernbench.ccl.algorithms.tree_allreduce
    topology: tree_binary
    buffer_kind: tcm
    world_size: 7                 # algorithm-level override
    n_elem: 16
  custom_mesh:
    module: kernbench.ccl.algorithms.custom_mesh
    topology: custom              # the module supplies its own neighbors()
 ```
 `world_size` is **not set in `defaults`**. The backend resolves it via:
 `algorithm-level override > defaults override > topology spec`. The
 last fallback (`sips × cubes_per_sip × pes_per_cube`) mirrors real DDP
 where `WORLD_SIZE` comes from env vars rather than config files.
 #### Algorithm module structure
 Each algorithm module exports two hooks — `kernel` (required) and
 `neighbors` (optional) — plus a `kernel_args` helper that the
 backend uses to populate positional kernel arguments at `all_reduce`
 time:
 ```python
 # src/kernbench/ccl/algorithms/ring_allreduce.py
 def kernel_args(world_size: int, n_elem: int) -> tuple:
    return (n_elem, world_size)
 def kernel(t_ptr, n_elem, world_size, tl):
    """Required — the PE kernel.
    IPCQ is already installed by the backend before this is called.
    The kernel only uses the four-direction send / recv API.
    """
    ...
 def neighbors(rank, world_size, neighbor_map):
    """Optional — override the builtin topology's neighbor map.
    Returns a new dict, the modified-in-place dict, or None to keep the
    builtin map.
    """
    return None
 ```
 #### `neighbors` override patterns
 - **Pattern A — tweak a builtin**: drop a direction for some ranks, etc.
 - **Pattern B — replace entirely**: ignore `neighbor_map` and return a
  brand-new dict.
 - **Pattern C — keep builtin**: omit `neighbors` or return None.
 #### Builtin topologies
 | topology | direction set |
 |----------|---------------|
 | `ring_1d` | E, W |
 | `ring_1d_unidir` | E only |
 | `mesh_2d` | N, S, E, W |
 | `tree_binary` | parent, child_left, child_right |
 | `none` | (empty) — algorithm must supply `neighbors()` |
 #### Adding a new algorithm
 1. Write `kernel` and `kernel_args` in
   `src/kernbench/ccl/algorithms/<algo>.py`.
 2. Add an entry in `ccl.yaml`'s `algorithms` section.
 3. (Optional) provide `neighbors()` for custom topology.
 4. Set `defaults.algorithm` to the new algorithm.
 The host bench (`benches/ccl_allreduce.py`) does not change.
 ### D12. Message / token schema
 The new message types added by this ADR. They live in
 `src/kernbench/common/pe_commands.py` and
 `src/kernbench/runtime_api/kernel.py`.
 #### `IpcqInitMsg` (sideband, fan-out at init)
 The backend pushes neighbor tables to every PE_IPCQ. Structure mirrors
 `MmuMapMsg` (`target_sips`, `target_cubes`, `target_pe`, `entries`).
 Each `IpcqInitEntry` has `direction`, `peer: IpcqEndpoint`,
 `my_rx_base_pa/va`, `n_slots`, `slot_size`, plus a `peer_credit_store`
 field — a `simpy.Store` instance pre-wired so the sender PE_IPCQ can
 push `IpcqCreditMetadata` directly into the receiver's input queue.
 #### `IpcqSendCmd` (PE_CPU → PE_IPCQ)
 Carries `direction`, source addr/space, nbytes, shape, dtype, and a
 handle id. `data_op=True` so it lands in the op_log.
 #### `IpcqRecvCmd` (PE_CPU → PE_IPCQ)
 Carries `direction` (or None for round-robin), `recv_mode`
 (`return_slot` / `copy_to_dst`), optional `dst_addr/dst_space`, shape,
 dtype, blocking flag.
 #### `IpcqDmaToken` (PE_IPCQ → PE_DMA, vc_comm channel)
 Per D9 piggyback: the token carries the data (`src/dst/space/nbytes`)
 plus the head metadata (`sender_seq`, `src_sip/cube/pe`,
 `src_direction`). PE_DMA picks the channel by token type
 (`IpcqDmaToken → vc_comm`, `TileToken → vc_compute`).
 The receiver's PE_DMA, on token arrival, performs the I6 atomic
 sequence: write data into MemoryStore, then forward `IpcqMetaArrival`
 to the local PE_IPCQ.
 #### `IpcqCreditMetadata` (PE_IPCQ → peer PE_IPCQ, fast path)
 Carries `consumer_seq` (= my_tail), source PE coords, and source
 direction. Travels through the dedicated SimPy Store channel rather
 than `vc_comm`. Latency = `credit_size_bytes / bottleneck_bw_on_path`.
 There is **no `IpcqPtrUpdate` event** — head updates flow via D9
 piggyback, tail updates via the D9 fast-path channel.
 ### D13. Test strategy
 Test plan:
 #### T1. Unit tests (component-level)
 - **PE_IPCQ** (`tests/test_pe_ipcq.py`): send without backpressure
  immediately forwards a token; full peer slot triggers backpressure
  (poll / sleep modes); recv waits, wakes on `IpcqMetaArrival`;
  round-robin recv weak fairness; bad direction → `IpcqInvalidDirection`.
 - **PE_DMA virtual channels** (`tests/test_pe_dma_vc.py`): `vc_compute`
  / `vc_comm` independent progress, chunk interleave, BW split.
 - **Builtin topology** (`tests/test_ccl_topologies.py`): ring_1d /
  mesh_2d / tree_binary correctness, mesh_2d non-square →
  `ValueError`, custom resolver returns the module's `neighbors`.
 #### T2. Integration tests (E2E send/recv)
 - **`tests/test_ipcq_e2e.py`**: 2-rank ring, 4-rank ring (bidirectional
  no-deadlock), 4×4 mesh.
 - **CCL kernel + 2-pass** (`tests/test_ipcq_2pass.py`): greenlet mode
  records `ipcq` ops in op_log; DataExecutor produces correct
  `out.data`.
 #### T3. Backend init (`tests/test_ccl_backend_ipcq.py`)
 `ccl.yaml` load, builtin topology → `IpcqInitMsg` fan-out, endpoint PA
 consistency, per-`buffer_kind` allocation.
 #### T4. Regression
 All existing tests pass; ADR-0020 op_log / DataExecutor unaffected for
 non-CCL benches.
 #### T5. Performance / overhead
 Single send/recv pair latency = (DMA latency) + (IPCQ overhead).
 Should be close to a regular PE_DMA write of the same nbytes (IPCQ
 overhead < 100 ns).
 ### D14. Invariants and failure modes
 #### Invariants
 I1. **Slot lifecycle exactly-once**: one send → exactly one recv.
 I2. **Pointer monotonicity**: `my_head` / `my_tail` strictly
   non-decreasing; `sender_seq` strictly increasing.
 I3. **Endpoint consistency**: if rank A's `direction=E` peer is rank
   B, then rank B's reverse-direction peer must be rank A. Verified at
   init.
 I4. **`buffer_kind` consistency**: all PEs in a process group share
   the same `buffer_kind` (no mixed mode in the first cut).
 I5. **op_log ordering**: send → DMA complete → recv possible. The
   t_start order in op_log respects this causality.
 I6. **Atomic data + metadata visibility (MUST)**: at the receiver
   side, data write (`MemoryStore.write`) and metadata forward
   (`peer_head_cache` update) **must execute in the same SimPy step**.
   No yield is allowed between the two operations in PE_DMA's vc_comm
   handler. Code review must reject any inserted `yield` (or `yield
   from`) — it would create a race where head_cache becomes visible
   before or after the data.
 I7. **MemoryStore slot existence ↔ pointer**: as a consequence of I6,
   the step in which `peer_head_cache > my_tail` becomes truthy is the
   same step in which the slot data is observable.
 #### Failure modes (runtime errors)
 F1. **Bad direction**: `tl.send(dir="X")` for an uninstalled direction
   → `IpcqInvalidDirection`, simulation aborts.
 F2. **Type mismatch**: dtype/shape/nbytes disagreement between matched
   send and recv. Not validated by default; opt-in strict mode catches
   it (`strict_validation: true` on a PE_IPCQ node attrs).
 F3. **Deadlock detection (timeout-based)**: the simulator empties its
   schedule while a send/recv is still pending → engine raises
   `IpcqDeadlock` and embeds a pointer dump.
 F4. **Backend init failure**: missing `defaults.algorithm`, missing
   `algorithms[name]`, module import failure, topology validation
   failure (I3, I4) — all raised at `init_process_group` time.
 F5. **Slot full + infinite backpressure**: the peer never recvs.
   Surfaces as F3 timeout.
 #### Diagnostics
 - **CCL trace**: `KERNBENCH_CCL_TRACE=1` logs each send/recv as
  `(rank, t, dir, nbytes)`.
 - **Pointer dump**: `kernbench.ccl.diagnostics.pointer_dump(engine)`
  prints every PE_IPCQ ring buffer's `my_head`, `my_tail`,
  `peer_head_cache`, `peer_tail_cache`.
 - **Deadlock dump**: on hang the engine includes the pointer dump in
  the `IpcqDeadlock` exception message.
 ### D15. Algorithm-author cheat sheet
 Full step-by-step lives in
 [`docs/onboarding/ccl-author-guide.en.md`](../onboarding/ccl-author-guide.en.md). The
 shortest version:
 | Things you touch | Things you don't |
 |------------------|-------------------|
 | `src/kernbench/ccl/algorithms/<your_algo>.py` (`kernel`, `kernel_args`, optional `neighbors`) | `benches/ccl_allreduce.py` host code |
 | One entry in `ccl.yaml` + optionally `defaults.algorithm` | `src/kernbench/ccl/` framework |
 | (Optional) `tests/test_<your_algo>.py` mock test | PE_IPCQ component, AhbmCCLBackend |
 5-step flow: write the kernel → register in `ccl.yaml` → optional
 `neighbors` override → optional mock unit test → SimPy validation via
 `kernbench run --bench ccl_allreduce --verify-data`.
 Common mistakes: using a direction that wasn't installed, sends
 without matching recvs (deadlock), dtype/shape disagreement, assuming
 fairness from `tl.recv()` round-robin, confusing
 `tl.num_programs(axis)` with the CCL group size.
 ---
 ## Non-goals
 - **Host collective**: a model where `dist.all_reduce` itself moves
  data on the host side is out of scope. This ADR only covers
  communication that happens inside the PE kernel.
 - **All-reduce algorithms**: ring / tree / etc. live in algorithm
  modules and can be added without amending this ADR.
 - **Reliability / error handling**: link faults, send/recv failure
  recovery, etc. are out of scope.
 - **NoC arbiter precision**: dynamic VC contention is left for a future
  ADR (see D8).
 ---
 ## Open questions
 - **VC arbitration accuracy** — the first cut uses deterministic
  chunk interleave + weighted round-robin; heavy contention may report
  optimistic latency. A NoC arbiter component can be added later.
 - **Credit return BW model** — the fast path is currently outside the
  fabric BW contention model. Can be modeled as a separate link or
  switched to piggyback (`credit_return_mode: piggyback`).
 - **Ring buffer slot allocation metadata** — whether the host pushes
  IPCQ buffer metadata via sideband or via a fabric message similar to
  `MmuMapMsg` is open.
 - **VC BW split default** — 50/50 vs. weighted (e.g. 80/20). Exposed in
  `ccl.yaml`; default value TBD.
 - **Direction count** — 4 (N/S/E/W) is fixed in the first cut; 6
  (with Up/Down for 3D) or N (variable) is future work.
 - **Multi-tile aggregation primitives** — whether
  `tl.recv_all` or similar is needed for fan-in.
 - **Round-robin recv fairness** — current weak fairness can starve;
  strict fairness counter is future work.
 - **Deadlock detection precision** — currently timeout-based; a
  realtime wait-for graph would enable deterministic detection.
 ---
 ## Consequences
 ### Positive
 - PE-to-PE direct communication enables CCL kernels to be written.
 - Host stays minimal (just `launch`), synchronization happens inside
  the PE → strong compute / comm overlap.
 - VCs eliminate HoL blocking → collective latency is not blocked by
  compute traffic.
 - Buffer placement and backpressure mode are init-time parameters →
  easy to benchmark.
 - Four-direction logical neighbors → host is free to map
  ring/mesh/tree algorithms.
 ### Negative
 - One new component (PE_IPCQ) and a redesigned PE_DMA (VCs).
 - IPCQ memory cost = 8 rings × `slot_size` × `n_slots` per PE.
 - VC arbitration is a first-order approximation; heavy contention
  scenarios may report slightly optimistic latency vs real HW (D8).
 - Chunk-level interleave makes PE_DMA implementation more complex.
@@ -6,43 +6,46 @@ Accepted
 ## Context
-### 목표
+### Goal
-`torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device)
+Align the participation unit (rank) of `torch.distributed` collective calls
-경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이**
+to the **SIP** (device) boundary. The aim is bench code that, at the host
-읽히는 bench 코드를 목표로 한다.
+level, reads **indistinguishably** from real PyTorch DDP/TP scripts.
-real PyTorch와 비교:
+Comparison with real PyTorch:
-| 차원 | real PyTorch | KernBench |
+| Dimension | real PyTorch | KernBench |
 | --- | --- | --- |
-| 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 각 1 SIP |
+| Process model | N processes, 1 GPU each | 1 process, N greenlets, 1 SIP each |
-| `get_rank()` | `RANK` env var | greenlet-local 레지스트리 |
+| `get_rank()` | `RANK` env var | greenlet-local registry |
-| `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 |
+| `get_world_size()` | `WORLD_SIZE` env var | SIP count from topology |
 | `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP |
-| `mp.spawn` | OS 프로세스 fork | greenlet fan-out |
+| `mp.spawn` | OS process fork | greenlet fan-out |
-### 풀어야 할 문제
+### Problems to solve
-1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록.
+1. **Public API where rank = SIP** — so bench workers do not have to know
-2. **Greenlet-local rank/device tracking** — 1-프로세스 모델 안에서 각
+   about the PE concept.
-   worker greenlet이 자기 rank / 자기 SIP를 정확히 식별.
+2. **Greenlet-local rank/device tracking** — within the 1-process model,
-3. **Tensor placement = structural (sip, cube, pe)** — rank가 SIP이면
+   each worker greenlet must correctly identify its own rank / its own SIP.
-   기본 텐서 배치도 구조적 좌표로 표현되어야 함.
+3. **Tensor placement = structural (sip, cube, pe)** — if rank is SIP,
   the default tensor placement should also be expressed in structural
   coordinates.
-### Non-problem (이 ADR 밖)
+### Non-problem (outside this ADR)
 - IPCQ direction addressing → ADR-0025
- `DPPolicy.sip`/`num_sips` 제거 → ADR-0026
+- Removing `DPPolicy.sip`/`num_sips` → ADR-0026
 - Megatron-style TP → ADR-0027
 - DTensor → ADR-0028 (future)
 - Worker scheduling / `mp.spawn` / collective drain / exception cleanup
  → ADR-0027 D0/D1
- Collective algorithm 구현 (intercube_allreduce, SFR config) → ADR-0032
+- Collective algorithm implementation (intercube_allreduce, SFR config)
  → ADR-0032
 ## Decision
-### D1. rank = SIP (world_size 해석)
+### D1. rank = SIP (world_size resolution)
 ```python
 def _resolve_world_size(self) -> int:
@@ -55,8 +58,8 @@ def _resolve_world_size(self) -> int:
    return int(spec.get("system", {}).get("sips", {}).get("count", 1))
 ```
-우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml`
+Priority order: algorithm override > defaults override > SIP count. The
-override는 legacy "rank = PE" 테스트 경로로 유지.
+`ccl.yaml` override is retained as the legacy "rank = PE" test path.
 ### D2. Greenlet-local rank registry (+ debug warning)
@@ -83,11 +86,11 @@ class DistributedContext:
        return int(self._rank_by_greenlet[g])
 ```
-### D3. `torch.ahbm.set_device(rank)` — SIP 바인딩
+### D3. `torch.ahbm.set_device(rank)` — SIP binding
-KernBench 백엔드 이름은 `ahbm` (ADR-0023). Real PyTorch는
+The KernBench backend name is `ahbm` (ADR-0023). Real PyTorch uses
-`torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named
+`torch.cuda.set_device(r)`, but since we are not CUDA we use an
-namespace를 사용한다.
+honestly-named namespace.
 ```python
 class _AhbmNamespace:
@@ -113,10 +116,12 @@ class _AhbmNamespace:
 # Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
 ```
-**PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한
+**PyTorch 2.x style parallel support**: Recent PyTorch is moving toward a
-`torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`,
+device-agnostic `torch.accelerator` namespace
-`torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는
+(`torch.accelerator.set_device_index(r)`,
-코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다.
+`torch.accelerator.current_device_index()`). To support users who want to
 write code that is not tied to a specific device vendor, KernBench also
 exposes this surface in parallel.
 ```python
 class _AcceleratorNamespace:
@@ -141,23 +146,23 @@ self.ahbm = _AhbmNamespace()
 self.accelerator = _AcceleratorNamespace(self.ahbm)   # alias
 ```
-Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유:
+Bench authors may choose either — both share the same registry internally:
 ```python
 torch.ahbm.set_device(rank)                   # KernBench-native, explicit backend
 torch.accelerator.set_device_index(rank)      # PyTorch 2.x device-agnostic
 ```
-### D4. Tensor placement = structural (sip, cube, pe) 좌표
+### D4. Tensor placement = structural (sip, cube, pe) coordinates
-`resolve_dp_policy`가 `target_sip`을 직접 받아 구조적 좌표로 placement 생성.
+`resolve_dp_policy` takes `target_sip` directly and produces placement in
-세부는 ADR-0026.
+structural coordinates. Details in ADR-0026.
 ```python
 # RuntimeContext._create_tensor
 current_sip = self.ahbm.current_device()          # (D3 naming)
 if current_sip is None:
-    current_sip = 0  # single-driver fallback (D2와 일관)
+    current_sip = 0  # single-driver fallback (consistent with D2)
 placement = resolve_dp_policy(
    dp, shape=shape_2d, itemsize=itemsize,
    num_pe=eff_num_pe, num_cubes=eff_num_cubes,
@@ -165,29 +170,29 @@ placement = resolve_dp_policy(
 )
 ```
-Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적
+No post-hoc `pe_index` shifting — ShardSpec carries the `(sip, cube, pe)`
-좌표를 직접 보유. ShardSpec 상세는 ADR-0026.
+structural coordinates directly. ShardSpec details in ADR-0026.
 ---
 ## Dependencies
- **ADR-0023** (IPCQ): backend `ahbm` namespace의 기원.
+- **ADR-0023** (IPCQ): origin of the backend `ahbm` namespace.
- **ADR-0026** (DPPolicy intra-device): D4의 `resolve_dp_policy` 시그니처와
+- **ADR-0026** (DPPolicy intra-device): the `resolve_dp_policy` signature
-  ShardSpec의 구조적 좌표 표현.
+  used by D4 and the structural-coordinate representation of ShardSpec.
- **ADR-0027** (Megatron TP + scheduler): worker scheduling, `mp.spawn`,
+- **ADR-0027** (Megatron TP + scheduler): the implementation baseline for
-  collective drain, exception cleanup의 구현 기준.
+  worker scheduling, `mp.spawn`, collective drain, and exception cleanup.
 ---
 ## Non-goals
- **IPCQ protocol 수정**: ADR-0023 유지.
+- **Modifying the IPCQ protocol**: ADR-0023 remains as-is.
- **DPPolicy 필드 정리**: ADR-0026.
+- **Cleaning up DPPolicy fields**: ADR-0026.
 - **Megatron-style TP**: ADR-0027.
 - **Worker scheduling / spawn / drain / exception cleanup**: ADR-0027 D0/D1.
- **Collective algorithm 구현**: ADR-0032.
+- **Collective algorithm implementation**: ADR-0032.
- **Multi-node (프로세스 간)**: 단일 프로세스.
+- **Multi-node (cross-process)**: single process only.
 ---
@@ -195,12 +200,14 @@ Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적
 ### Positive
- **Bench = real PyTorch DDP** (공개 API 관점).
+- **Bench = real PyTorch DDP** (from the public-API point of view).
- **Greenlet-local rank**: 1-프로세스 모델에서 cross-rank correctness 가능.
+- **Greenlet-local rank**: enables cross-rank correctness within the
- **Structural placement 좌표**: ADR-0026 / ADR-0027 / ADR-0032의 다른 ADR이
+  1-process model.
-  `(sip, cube, pe)` 3튜플 위에서 일관되게 동작.
+- **Structural placement coordinates**: lets the other ADRs (ADR-0026 /
  ADR-0027 / ADR-0032) operate consistently on top of the `(sip, cube, pe)`
  3-tuple.
 ### Neutral
- IPCQ PE-level protocol (ADR-0023) 불변.
+- IPCQ PE-level protocol (ADR-0023) is unchanged.
- IO_CPU 역할 불변 (기존 transit 그대로).
+- IO_CPU role is unchanged (existing transit behavior preserved).
@@ -6,51 +6,58 @@ Accepted (Revision 2 — Address-based matching; peer_direction field dropped)
 ## Context
-### 목표
+### Goal
-ADR-0023의 IPCQ protocol에서 **"어느 direction pair를 통한 전송인가"의 식별**을
+In the IPCQ protocol of ADR-0023, make the **identification of "which
-topology / dict-order에 의존하지 않고 **주소 기반**으로 일관되게 한다.
+direction pair this transfer belongs to"** consistent and **address-based**,
-2-rank bidirectional ring (또는 여러 direction이 동일 peer를 가리키는
+without depending on topology / dict-order. It must work correctly in a
-topology 일반)에서 정확히 동작하도록 한다.
+2-rank bidirectional ring (and more generally in any topology where
 multiple directions point to the same peer).
-### 드러난 버그 — 2-rank bidirectional ring
+### The bug surfaced — 2-rank bidirectional ring
-`ring_1d(rank, world_size=2)` → `{"E": 1, "W": 1}` (rank 0). 양쪽 방향이 같은 peer.
+`ring_1d(rank, world_size=2)` → `{"E": 1, "W": 1}` (rank 0). Both directions
 point to the same peer.
-**버그 1 (install)**:
+**Bug 1 (install)**:
- `reverse_direction(0, 1)` → dict order로 "E" 반환 (틀림, "W"가 맞음 — opposite
+- `reverse_direction(0, 1)` → returns "E" by dict order (wrong; "W" is the
-  direction convention)
+  correct answer — opposite-direction convention)
- rank 0의 E entry가 `peer.rx_base_pa = rx_base(sip1, cube0, pe0, d="E")`로 설정
+- rank 0's E entry is set with `peer.rx_base_pa = rx_base(sip1, cube0, pe0, d="E")`
- tl.send(E) → data가 sip1의 E-rx buffer로 landing (should be W-rx)
+- tl.send(E) → data lands in sip1's E-rx buffer (should be W-rx)
-**버그 2 (runtime)**:
+**Bug 2 (runtime)**:
- 설령 install이 올바른 주소로 설정해도, receiver의 `_handle_meta_arrival`이
+- Even if install set up the correct address, the receiver's
-  sender 좌표만으로 direction 매칭 → 첫 direction (E) 승
+  `_handle_meta_arrival` matches direction by sender coordinates only → the
- peer_head_cache[E] 증가, peer_head_cache[W]는 불변
+  first direction (E) wins
- Kernel의 tl.recv(W)는 peer_head_cache[W] 대기 → 영원히 블록 → IpcqDeadlock
+- peer_head_cache[E] is incremented; peer_head_cache[W] is unchanged
 - The kernel's tl.recv(W) waits on peer_head_cache[W] → blocks forever →
  IpcqDeadlock
-### 근본 원인
+### Root cause
-두 축에서 동일 문제:
+The same issue along two axes:
-1. **Install-time pairing**: "내 direction과 peer의 어느 direction이 짝인가"
+1. **Install-time pairing**: deciding "which of my directions pairs with
-   결정이 dict-iteration-order에 의존 → 여러 direction이 같은 peer를 가리킬 때
+   which direction of the peer" depends on dict-iteration-order → fragile
-   fragile
+   when multiple directions point to the same peer
-2. **Runtime identification**: "어느 qp를 업데이트해야 하는가" 결정이 sender
+2. **Runtime identification**: deciding "which qp should be updated" is
-   좌표만으로 이루어짐 → direction 중복 시 ambiguous
+   based on sender coordinates alone → ambiguous when directions are
   duplicated
-### 해결 방향 — address-based matching
+### Solution direction — address-based matching
-각 PE의 rx buffer는 **direction별로 고유한 주소 range**에 위치 (rx_base_pa +
+Each PE's rx buffer sits at a **unique address range per direction**
-direction_idx × bytes_per_direction). 따라서:
+(rx_base_pa + direction_idx × bytes_per_direction). Therefore:
- **Runtime**: sender coord 대신 **dst_addr 범위**로 매칭 → unambiguous
+- **Runtime**: match by **dst_addr range** instead of sender coord →
- **Install**: opposite-direction 우선 선택 heuristic (ring / mesh의 자연스러운
+  unambiguous
-  대칭성)
+- **Install**: prefer the opposite direction as a heuristic (the natural
- `peer_direction` 같은 이중 메타데이터 불필요 — **주소가 single source of
+  symmetry of ring / mesh)
-  truth**
+- No need for redundant metadata like `peer_direction` — **address is the
  single source of truth**
-이 설계는 **PhysAddr 전환 (ADR-0030)과 독립적**으로 작동. 현재 synthetic
+This design works **independently of the PhysAddr transition (ADR-0030)**.
-주소든 PhysAddr든 direction별 range 유일성만 지켜지면 동일하게 적용 가능.
+Whether the current addresses are synthetic or PhysAddr, the same approach
 applies as long as the per-direction range uniqueness is preserved.
 ---
@@ -91,17 +98,17 @@ def reverse_direction(my_rank: int, peer_rank: int, my_dir: str) -> str | None:
    return None
 ```
-호출부:
+Call site:
 ```python
 for d, peer_rank in nbrs.items():
-    peer_dir = reverse_direction(r, peer_rank, d)  # my_dir 전달
+    peer_dir = reverse_direction(r, peer_rank, d)  # pass my_dir
    if peer_dir is None:
        continue
    ...
 ```
-### D2. Runtime — `_handle_meta_arrival` dst_addr 매칭
+### D2. Runtime — `_handle_meta_arrival` dst_addr matching
 `src/kernbench/components/builtin/pe_ipcq.py`:
@@ -138,9 +145,10 @@ def _handle_meta_arrival(self, msg: IpcqMetaArrival) -> None:
    # Unknown dst_addr — diagnostic log (should not happen under correct install)
 ```
-Sender 좌표 검사는 **제거**. `dst_addr`가 이미 direction을 결정.
+The sender-coordinate check is **removed**. `dst_addr` already determines
 the direction.
-### D3. Credit — `dst_rx_base_pa` 필드 추가
+### D3. Credit — add `dst_rx_base_pa` field
 `src/kernbench/common/ipcq_types.py`:
@@ -148,25 +156,26 @@ Sender 좌표 검사는 **제거**. `dst_addr`가 이미 direction을 결정.
@dataclass(frozen=True)
 class IpcqCreditMetadata:
    consumer_seq: int
-    dst_rx_base_pa: int       # NEW: 원 sender의 peer.rx_base_pa와 매칭용
+    dst_rx_base_pa: int       # NEW: matches the original sender's peer.rx_base_pa
-    # 기존 필드 (diagnostic / log 용도로 유지)
+    # Existing fields (kept for diagnostic / logging purposes)
    src_sip: int
    src_cube: int
    src_pe: int
    src_direction: str
 ```
-Credit 생성 시 (`_delayed_credit_send`): 자기 direction의 `my_rx_base_pa`를
+When the credit is generated (`_delayed_credit_send`): it carries this
-`dst_rx_base_pa`로 실어 보냄 (이게 상대방이 sender 당시 썼던 `peer.rx_base_pa`).
+direction's `my_rx_base_pa` as `dst_rx_base_pa` (this is the
 `peer.rx_base_pa` the other side used when it was the sender).
-수신 측 (`_credit_worker`):
+Receiver side (`_credit_worker`):
 ```python
 def _credit_worker(self, env):
    while True:
        credit = yield self._credit_inbox.get()
        for d, qp in self._queue_pairs.items():
-            # peer의 rx_base_pa와 credit의 dst_rx_base_pa가 일치하는 qp 찾기
+            # Find the qp whose peer rx_base_pa matches the credit's dst_rx_base_pa
            if qp["peer"].rx_base_pa == credit.dst_rx_base_pa:
                qp["peer_tail_cache"] = max(qp["peer_tail_cache"],
                                              credit.consumer_seq)
@@ -178,41 +187,45 @@ def _credit_worker(self, env):
                break
 ```
-Sender 좌표 검사 제거. `dst_rx_base_pa` 매칭으로 unambiguous.
+Sender-coordinate check removed. Matching by `dst_rx_base_pa` is
 unambiguous.
-### D4. `IpcqInitEntry`에 `peer_direction` 필드를 **추가하지 않음**
+### D4. Do **not** add a `peer_direction` field to `IpcqInitEntry`
-ADR-0025 rev 1에서 제안했던 `IpcqInitEntry.peer_direction`은 **불필요**.
+The `IpcqInitEntry.peer_direction` proposed in ADR-0025 rev 1 is
-이유:
+**unnecessary**. Reasons:
- Meta arrival은 dst_addr로 매칭 (D2)
+- Meta arrivals are matched by dst_addr (D2)
- Credit은 dst_rx_base_pa로 매칭 (D3)
+- Credits are matched by dst_rx_base_pa (D3)
- qp에 peer_direction 저장 필요 없음
+- No need to store peer_direction on qp
- Install은 rx_base_pa 계산 시 내부적으로만 peer_dir 사용 (`reverse_direction`)
+- Install only uses peer_dir internally when computing rx_base_pa
  (`reverse_direction`)
-IpcqInitEntry schema 변경 없음. Rev 1 대비 **단순화**.
+No change to the IpcqInitEntry schema. **Simpler** than rev 1.
-### D5. `IpcqDmaToken.src_direction` 유지 (diagnostic only)
+### D5. Keep `IpcqDmaToken.src_direction` (diagnostic only)
-기존 `src_direction` 필드는 제거하지 않는다. 다음 용도로 유지:
+The existing `src_direction` field is not removed. It is retained for:
- Logging / trace: `KERNBENCH_CCL_TRACE=1` 출력의 `(rank, t, dir, nbytes)`
+- Logging / trace: the `(rank, t, dir, nbytes)` output of
- Diagnostics: pointer_dump 등에서 direction 표시
+  `KERNBENCH_CCL_TRACE=1`
- 미래 확장 여지
+- Diagnostics: showing direction in pointer_dump, etc.
 - Room for future extension
-Runtime matching은 `dst_addr`만 사용.
+Runtime matching uses only `dst_addr`.
-### D6. Invariants (ADR-0023 I3 강화)
+### D6. Invariants (strengthens ADR-0023 I3)
-**I3 (엄격)**: 각 방향 pair `(my_direction, peer_direction)`에 대해 my
+**I3 (strict)**: For each direction pair `(my_direction, peer_direction)`,
-rx_base와 peer rx_base는 **별개의 direction slot**을 가리켜야 함. Install은
+my rx_base and peer rx_base must point to **distinct direction slots**.
-이를 보장해야 한다 (reverse_direction opposite-preference).
+Install must guarantee this (reverse_direction opposite-preference).
-**I3.1 (신규)**: 모든 qp에 대해 `qp["my_rx_base_pa"]`와 `qp["peer"].rx_base_pa`는
+**I3.1 (new)**: For every qp, `qp["my_rx_base_pa"]` and
-서로 disjoint한 주소 range를 점유한다 (다른 direction의 buffer는 절대 겹치지
+`qp["peer"].rx_base_pa` occupy mutually disjoint address ranges (buffers
-않음). 이것이 D2/D3의 주소-기반 매칭의 전제.
+of different directions never overlap). This is the prerequisite for the
 address-based matching of D2/D3.
-Install time에 검증 가능:
+Verifiable at install time:
 ```python
-# ccl/install_plan.py: build_install_plans 끝에 assertion
+# ccl/install_plan.py: assertion at the end of build_install_plans
 all_rx_ranges = set()
 for plan in plans:
    for pe_install in plan.pe_installs:
@@ -228,36 +241,42 @@ for plan in plans:
 ## Dependencies
- **ADR-0023** (IPCQ protocol): 본 ADR은 ADR-0023의 runtime 매칭 로직 수정
+- **ADR-0023** (IPCQ protocol): this ADR modifies ADR-0023's runtime
-  (D2, D3) + install heuristic 개선 (D1). IPCQ 프로토콜의 semantic layer
+  matching logic (D2, D3) and improves the install heuristic (D1). No
-  변경은 없음.
+  change to the IPCQ protocol's semantic layer.
- **ADR-0024** (launcher): 2-rank bidirectional ring이 실제 쓰이는 경우가
+- **ADR-0024** (launcher): the case where a 2-rank bidirectional ring is
-  ADR-0024의 ws=SIP_count 모델. 본 ADR이 그 케이스를 작동시킴.
+  actually used is the ws=SIP_count model of ADR-0024. This ADR makes that
- **ADR-0030** (PhysAddr transition, stub): **독립적** — ADR-0025의
+  case work.
-  주소-기반 매칭은 현재 synthetic 주소든 PhysAddr이든 동일하게 작동.
+- **ADR-0030** (PhysAddr transition, stub): **independent** — ADR-0025's
  address-based matching works identically whether the current addresses
  are synthetic or PhysAddr.
 ---
 ## Non-goals
- **IPCQ 주소 체계를 PhysAddr로 전환**: ADR-0030 scope. 본 ADR은 주소가 어떻게
+- **Migrating IPCQ addressing to PhysAddr**: ADR-0030 scope. This ADR is
-  인코딩되는가와 무관.
+  agnostic to how addresses are encoded.
- **Multi-hop routing**: ADR-0023 D5의 single-hop DMA write 전제 유지.
+- **Multi-hop routing**: the single-hop DMA write assumption of ADR-0023
- **Unidir ring 특수화**: `ring_1d_unidir`는 direction 하나만 있으므로 본 버그
+  D5 still holds.
-  무관.
+- **Unidir ring specialization**: `ring_1d_unidir` only has a single
  direction, so the bug does not apply.
 ---
 ## Open questions
- **주소 매칭 성능**: `_handle_meta_arrival`과 `_credit_worker`가 qp를 선형
+- **Address-matching performance**: `_handle_meta_arrival` and
-  순회 (max 4 direction). 성능 영향 무시 가능 수준. 문제 시 dict lookup으로
+  `_credit_worker` iterate qp linearly (max 4 directions). The performance
-  전환 가능 (`_qp_by_rx_base`).
+  impact is negligible. If it becomes an issue, this can be switched to a
- **`IpcqDmaToken.src_direction` 필요성 재평가**: diagnostic 용도로만 남긴
+  dict lookup (`_qp_by_rx_base`).
-  필드를 계속 유지할지, 또는 logging 외부로 분리할지. 현재는 유지.
+- **Re-evaluating the need for `IpcqDmaToken.src_direction`**: whether to
- **Install-time invariant 검증 cost**: D6의 I3.1 검증은 O(N_PE × N_direction)^2.
+  keep this field, which is only kept for diagnostics, or to split it out
-  대형 topology에서 느려질 수 있음 → interval tree 등 자료구조로 개선 가능.
+  of logging. Currently retained.
-  단순 구현 먼저.
+- **Cost of install-time invariant verification**: the I3.1 verification
  of D6 is O(N_PE × N_direction)^2. It could be slow on large topologies
  → improvable via data structures such as interval trees. Simple
  implementation first.
 ---
@@ -265,19 +284,26 @@ for plan in plans:
 ### Positive
- **단순함**: `peer_direction` 이중 메타데이터 제거. 주소가 single source of truth.
+- **Simplicity**: redundant `peer_direction` metadata removed. Address is
- **Unambiguous matching**: 모든 topology (direction 중복 포함)에서 동작.
+  the single source of truth.
- **Schema 변경 최소**: `IpcqInitEntry` 불변, `IpcqCreditMetadata`에 1 필드 추가.
+- **Unambiguous matching**: works on every topology (including duplicate
- **PhysAddr 전환 (ADR-0030) 독립**: 주소-기반 매칭은 주소 인코딩 방식과 무관.
+  directions).
- **Diagnostic 유지**: `IpcqDmaToken.src_direction`은 로깅 용도로 존치.
+- **Minimal schema changes**: `IpcqInitEntry` unchanged, one field added
  to `IpcqCreditMetadata`.
 - **Independent of PhysAddr transition (ADR-0030)**: address-based matching
  is agnostic to the address encoding.
 - **Diagnostics retained**: `IpcqDmaToken.src_direction` is kept for
  logging.
 ### Negative
- Runtime 매칭이 주소 비교로 바뀌어서 디버깅 시 "왜 peer_head_cache[E]가 아닌
+- Runtime matching is now by address comparison, so when debugging
-  W가 업데이트됐나" 같은 질문에 address range를 추적해야 함 (기존엔 direction
+  questions like "why did peer_head_cache[W] update rather than [E]" one
-  이름으로 충분). 해결: pointer_dump에 "direction ↔ rx_base_pa" 매핑 포함.
+  has to follow the address range (previously the direction name was
  enough). Mitigation: include a "direction ↔ rx_base_pa" mapping in
  pointer_dump.
 ### Neutral
- IPCQ protocol의 semantic layer (sender가 dst_addr 계산, receiver가 수신)는
+- The semantic layer of the IPCQ protocol (sender computes dst_addr,
-  불변.
+  receiver receives) is unchanged.
@@ -1,4 +1,4 @@
-# ADR-0026: DPPolicy = Intra-Device Only — sip/num_sips 필드 제거
+# ADR-0026: DPPolicy = Intra-Device Only — remove sip/num_sips fields
 ## Status
@@ -6,16 +6,17 @@ Accepted (Revision 5 — Phase 2 landed 2026-04-14, 523 passed + 1 strict xfail)
 ## Context
-### 목표
+### Goal
-`DPPolicy`를 **한 device(SIP) 내부의 cube × PE 분산**만 표현하는 순수한
+Clarify `DPPolicy` as a pure intra-device abstraction that only expresses
-intra-device 추상화로 명확화한다. SIP 간 분산(TP)은 별도 레이어로 분리
+**cube × PE distribution within a single device (SIP)**. Inter-SIP
-(ADR-0024의 `torch.ahbm.set_device(rank)` 또는 ADR-0027의 Megatron parallel
+distribution (TP) is split into a separate layer (handled by ADR-0024's
-layers가 담당).
+`torch.ahbm.set_device(rank)` or by ADR-0027's Megatron-style parallel
 layers).
 ## Decision
-### D1. `DPPolicy`에서 `sip` + `num_sips` 필드 제거
+### D1. Remove `sip` + `num_sips` fields from `DPPolicy`
 ```python
@dataclass(frozen=True)
@@ -32,15 +33,16 @@ class DPPolicy:
    num_cubes: int | None = None
 ```
-제거되는 필드: `sip`, `num_sips`.
+Removed fields: `sip`, `num_sips`.
-### D2. `ShardSpec` — structural (sip, cube, pe) 좌표, `pe_index` 완전 제거
+### D2. `ShardSpec` — structural (sip, cube, pe) coordinates, `pe_index` fully removed
-현재 `ShardSpec.pe_index`는 **global flat index** (`sip × cubes × pes + cube ×
+The current `ShardSpec.pe_index` is a **global flat index**
-pes + pe`). 이는 ADR-0024 D4이 "abstraction leakage"로 지적한 형태.
+(`sip × cubes × pes + cube × pes + pe`). This is the form ADR-0024 D4
 flagged as "abstraction leakage".
-본 ADR에서 ShardSpec을 **structural 좌표로 재정의**하고, `pe_index`는
+This ADR **redefines ShardSpec in structural coordinates** and **does
-property로도 **남기지 않는다**:
+not even leave `pe_index` as a property**:
 ```python
 # src/kernbench/policy/placement/dp.py (after)
@@ -59,28 +61,32 @@ class ShardSpec:
    nbytes: int
 ```
-**핵심 원칙**:
+**Core principle**:
- ShardSpec의 정체성은 `(sip, cube, pe)` 3튜플.
+- The identity of ShardSpec is the `(sip, cube, pe)` 3-tuple.
- **`pe_index` property도 없음** — silent semantics drift 차단.
+- **No `pe_index` property either** — blocks silent semantics drift.
- Global flat을 기대한 기존 호출자는 `.pe_index` 접근 시 **즉시
+- Existing callers expecting global-flat get an **immediate
-  `AttributeError`** → 반드시 구조적 좌표로 migration.
+  `AttributeError`** on `.pe_index` access → forced migration to
- Flat integer key가 필요한 국소 문맥 (예: 내부 dict lookup)은 호출자가
+  structural coordinates.
-  명시적으로 `spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe`를 계산.
+- Local contexts that genuinely need a flat integer key (e.g. internal
  dict lookup) explicitly compute
  `spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe` at the call
  site.
-**Property 제거 정당화**: KernBench는 사내 프로젝트로 call site가 한정되어
+**Justification for removing the property**: KernBench is an internal
-있음. Silent drift 위험 (의미만 바뀌고 타입은 같은 int) 대비 explicit breakage
+project with a limited number of call sites. Explicit breakage
-(AttributeError)가 훨씬 안전.
+(AttributeError) is much safer than the risk of silent drift (semantics
 change while the type stays int).
-### D3. `resolve_dp_policy`가 `target_sip`을 받아 structural 좌표 생성
+### D3. `resolve_dp_policy` takes `target_sip` and produces structural coordinates
-ADR-0024 D4의 계약 구현. Post-hoc shifting 없음.
+Implements the contract of ADR-0024 D4. No post-hoc shifting.
 ```python
 # src/kernbench/policy/placement/dp.py (after)
@dataclass(frozen=True)
 class _LocalPeShard:
-    """Internal — PE resolver의 반환. Cube 내 local PE 식별자 + payload."""
+    """Internal — return value of the PE resolver. Cube-local PE id + payload."""
    local_pe: int                  # cube-local PE index (0..num_pe-1)
    offset_bytes: int
    nbytes: int
@@ -93,7 +99,7 @@ def resolve_dp_policy(
    itemsize: int,
    num_pe: int,
    num_cubes: int = 1,
-    target_sip: int,       # NEW — 어느 SIP에 배치할지 명시
+    target_sip: int,       # NEW — explicitly state which SIP to place on
 ) -> list[ShardSpec]:
    """2-level resolution (cube × PE) on a specified SIP.
@@ -123,28 +129,30 @@ def resolve_dp_policy(
    return all_shards
 ```
-**내부 resolver** (`column_wise`, `row_wise`, `replicate`)는 `_LocalPeShard`
+**Internal resolvers** (`column_wise`, `row_wise`, `replicate`) return a
-리스트 반환 — `local_pe` 필드명으로 **"cube-local PE identifier"임이 명시적**.
+list of `_LocalPeShard` — the `local_pe` field name makes it **explicit
-과거 `ShardSpec.pe_index`와 이름이 혼동되던 문제 해소.
+that this is a "cube-local PE identifier"**. This resolves the previous
 confusion with the name `ShardSpec.pe_index`.
-**이름 규약 정리** (전체 ADR):
+**Naming convention summary** (whole ADR):
- `ShardSpec.pe`: 최종 외부 API — cube-local PE (structural coord)
+- `ShardSpec.pe`: the final external API — cube-local PE (structural coord)
- `_LocalPeShard.local_pe`: 내부 resolver 단계의 동일 의미
+- `_LocalPeShard.local_pe`: the same meaning at the internal resolver stage
- `pe_index`: **제거**. 외부/내부 어디에도 남기지 않는다 (silent drift 차단의
+- `pe_index`: **removed**. Not retained anywhere, internal or external
-  부가 효과: 이름 재등장 없음).
+  (additional benefit of preventing silent drift: the name does not
  reappear).
-### D4. `_create_tensor` — 구조적 좌표로 직접 placement
+### D4. `_create_tensor` — placement directly in structural coordinates
-ADR-0024 D4 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy`
+Continuation of ADR-0024 D4. Post-hoc shifting removed; structural
-호출 시점에 직접 지정.
+coordinates are specified directly at the `resolve_dp_policy` call site.
 ```python
 # context.py _create_tensor (after)
 current_sip = self.ahbm.current_device()
 if current_sip is None:
-    # Single-driver fallback (ADR-0024 D2와 일관).
+    # Single-driver fallback (consistent with ADR-0024 D2).
-    # Launcher 기반 코드가 set_device()를 빼먹으면 조용히 SIP 0에 박히는
+    # In launcher-based code, forgetting set_device() silently sticks the
-    # 문제가 있음 → debug mode에서 경고.
+    # tensor on SIP 0 — emit a warning in debug mode.
    if os.environ.get("KERNBENCH_DEBUG"):
        import warnings
        warnings.warn(
@@ -161,38 +169,39 @@ placement = resolve_dp_policy(
    itemsize=itemsize,
    num_pe=eff_num_pe,
    num_cubes=eff_num_cubes,
-    target_sip=current_sip,          # ← 구조적 좌표 일차 지정
+    target_sip=current_sip,          # ← structural coord specified up front
 )
-# placement의 각 ShardSpec은 이미 (sip=current_sip, cube=local, pe=local) 포함.
+# Each ShardSpec in placement already carries (sip=current_sip, cube=local, pe=local).
-# 과거의 post-hoc shifting 블록은 완전히 제거.
+# The old post-hoc shifting block is removed entirely.
 ```
-**모든** 텐서가 current device SIP에 배치됨. Multi-SIP 텐서를 만들고 싶으면
+**Every** tensor is placed on the current device's SIP. If you need a
-ADR-0027의 TP primitive 사용.
+multi-SIP tensor, use the TP primitive of ADR-0027.
-**Single-driver fallback의 trade-off**: set_device 없는 호출에서 SIP 0으로
+**Trade-off of the single-driver fallback**: When set_device is not
-default는 기존 single-driver 테스트 호환을 위해 유지. `KERNBENCH_DEBUG=1`
+called, defaulting to SIP 0 is kept for compatibility with existing
-환경에서는 launcher 컨텍스트의 실수로 set_device 누락 시 조용히 잘못된 SIP에
+single-driver tests. With `KERNBENCH_DEBUG=1`, a warning is emitted so
-배치되는 것을 감지할 수 있도록 warning.
+that accidentally omitting set_device in a launcher context — which would
 silently place the tensor on the wrong SIP — can be detected.
-### D5. Downstream — allocator lookup은 구조적 tuple key로
+### D5. Downstream — allocator lookup by structural tuple key
-기존 `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`):
+Existing `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`):
 ```python
 for spec in placement:
-    alloc = allocators[spec.pe_index]       # ← AttributeError (property 제거됨)
+    alloc = allocators[spec.pe_index]       # ← AttributeError (property removed)
 ```
-`pe_index`가 없어졌으므로 구조적 좌표로 **강제** migration:
+With `pe_index` gone, migration to structural coordinates is **forced**:
 ```python
 for spec in placement:
    alloc = allocators[(spec.sip, spec.cube, spec.pe)]
 ```
-`_ensure_allocators`의 dict population도 tuple key로:
+The dict population in `_ensure_allocators` is also tuple-keyed:
 ```python
 # context.py _ensure_allocators (after)
@@ -204,59 +213,71 @@ for sip_id in sip_range:
            )
 ```
-`_free_tensor`도 동일: 기존 `flat_idx = sip * ... + cube * ... + pe` 계산
+`_free_tensor` is the same: the old
-블록 제거, `(shard.sip, shard.cube, shard.pe)` 직접 사용.
+`flat_idx = sip * ... + cube * ... + pe` computation block is removed,
 and `(shard.sip, shard.cube, shard.pe)` is used directly.
-**Tuple vs dataclass `PEIdentity`**: Tuple이 단순하고 hashable로 바로 써서
+**Tuple vs dataclass `PEIdentity`**: Recommend the tuple — it is simple
-권고. `PEIdentity` 값객체는 명시적 타입 장점은 있지만 boilerplate가 크고 현재
+and hashable out of the box. A `PEIdentity` value object has the upside
-allocator dict의 유일한 key라 오버엔지니어링. Tuple 유지.
+of an explicit type, but the boilerplate is large and it is currently
 the only key of the allocator dict, so it would be over-engineering.
 Keep the tuple.
-### D7. 하위 호환 — 불가 (cleanup ADR)
+### D7. Backward compatibility — none (cleanup ADR)
-이 ADR은 **breaking change**.
+This ADR is a **breaking change**.
-1. `DPPolicy(sip=...)` 또는 `DPPolicy(num_sips=...)` 호출 → `TypeError`
+1. `DPPolicy(sip=...)` or `DPPolicy(num_sips=...)` → `TypeError`
-2. `ShardSpec.pe_index` 접근 → `AttributeError`
+2. `ShardSpec.pe_index` access → `AttributeError`
-모두 **즉시 명시적 breakage**. Deprecation warning / fallback 경로 없음.
+Both are **immediate, explicit breakage**. No deprecation warning /
-KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에 migration.
+fallback path. KernBench is an internal project with a bounded set of
 call sites, so migration happens in one pass.
-**Silent drift 차단**이 property 완전 제거의 주된 이점: global flat을 기대한
+**Blocking silent drift** is the main upside of fully removing the
-코드가 SIP-local 결과를 받아 조용히 잘못된 인덱싱을 할 가능성 제거.
+property: code that expected a global flat could otherwise silently
 receive a SIP-local result and index incorrectly — that possibility is
 eliminated.
 ## Dependencies
- **ADR-0024** (launcher): `set_device(rank)` 및 current-device scoping이
+- **ADR-0024** (launcher): `set_device(rank)` and current-device scoping
-  SIP 배치 메커니즘 제공. 본 ADR은 그 위에 서서 DPPolicy를 순수 intra-device로
+  provide the SIP placement mechanism. This ADR sits on top and narrows
-  좁힘.
+  DPPolicy to pure intra-device.
- **ADR-0027** (Megatron TP): 다중 SIP에 걸친 텐서가 필요한 경우의 대안 경로.
+- **ADR-0027** (Megatron TP): the alternative path when a tensor spans
-  이 ADR 적용 후 multi-SIP use case는 ADR-0027로 이관.
+  multiple SIPs. After this ADR is applied, multi-SIP use cases move to
  ADR-0027.
 ---
 ## Non-goals
- **`DPPolicy.cube` / `pe` 재설계**: 기존 replicate/column_wise/row_wise 의미
+- **Redesign of `DPPolicy.cube` / `pe`**: existing
-  유지.
+  replicate/column_wise/row_wise semantics are kept.
- **Tiling 정책 통합**: `tiled_column_major` / `tiled_row_major`는 그대로.
+- **Tiling policy consolidation**: `tiled_column_major` /
- **Multi-device 텐서 추상화 신규**: DTensor-like는 ADR-0028.
+  `tiled_row_major` stay as they are.
 - **New multi-device tensor abstraction**: a DTensor-like is ADR-0028.
 ---
 ## Open questions
- **`_create_tensor`의 current_sip 기본값**: set_device 없는 호출에서 rank=0
+- **Default value of current_sip in `_create_tensor`**: for calls without
-  (SIP 0)로 fallback할지, 아니면 error 낼지. 권고는 fallback (기존 single-driver
+  set_device, whether to fall back to rank=0 (SIP 0) or to raise an
-  테스트와의 호환).
+  error. The recommendation is fallback (compatibility with existing
- **`test_sip_parallel.py` 재작성 범위**: 기존 단위 테스트의 의도를 유지하며
+  single-driver tests).
-  launcher 기반으로 옮기려면 추가 fixture 필요. 별도 작업으로 scope.
+- **Scope of `test_sip_parallel.py` rewrite**: porting the existing unit
- **`DPPolicy`의 `num_sips=None` 의미**: 필드가 없어지면 `num_sips` 개념 자체가
+  tests to the launcher base while preserving their intent requires
-  사라짐. Multi-SIP을 표현하고 싶으면 ADR-0027의 TP primitive를 쓰라는 것이
+  additional fixtures. Scoped as separate work.
-  명시적 답.
+- **Meaning of `num_sips=None` on `DPPolicy`**: once the field is gone,
  the concept of `num_sips` disappears entirely. The explicit answer for
  expressing multi-SIP is to use the TP primitive of ADR-0027.
-**Resolved (이전 rev에서 open이었던 것들)**:
+**Resolved (items that were open in earlier revs)**:
- ~~`ShardSpec.pe_index` property 존치 여부~~ → **완전 제거** (D2)
+- ~~Whether to keep the `ShardSpec.pe_index` property~~ → **fully
- ~~`_ensure_allocators` dict key 형식~~ → **tuple `(sip, cube, pe)`** (D5)
+  removed** (D2)
 - ~~Form of `_ensure_allocators` dict key~~ → **tuple `(sip, cube, pe)`**
  (D5)
 ---
@@ -264,25 +285,31 @@ KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에
 ### Positive
- **개념 분리 명확**: DPPolicy = intra-device, TP = inter-device.
+- **Clean conceptual separation**: DPPolicy = intra-device, TP =
- **API 단순화**: DPPolicy 생성자 필드 ~33% 축소.
+  inter-device.
- **Structural 좌표 일관성**: ShardSpec이 `(sip, cube, pe)` 튜플로 표현 →
+- **API simplification**: about a 33% reduction in DPPolicy constructor
-  abstraction leakage 해소 (ADR-0024 D4 계약 충족).
+  fields.
- **`pe_index` 의미 명확**: SIP-local이 단일 해석. Global flat이 필요하면 명시.
+- **Structural-coordinate consistency**: ShardSpec is expressed as a
- **Launcher 모델 일관성**: ADR-0024의 "1 worker per SIP" 모델이 유일한 SIP
+  `(sip, cube, pe)` tuple → abstraction leakage resolved (the ADR-0024
-  경계 제어 메커니즘.
+  D4 contract is satisfied).
 - **Clear meaning of `pe_index`**: the single interpretation is
  SIP-local. If global-flat is needed, it must be made explicit.
 - **Launcher-model consistency**: ADR-0024's "1 worker per SIP" model is
  the sole SIP-boundary control mechanism.
 ### Negative
 - **Breaking change (explicit)**: `DPPolicy(sip=...)` → `TypeError`,
-  `spec.pe_index` → `AttributeError`. 모든 호출자 한 번에 수정 필요.
+  `spec.pe_index` → `AttributeError`. All callers need to be fixed at
- **ShardSpec schema 변경**: `pe_index` 단일 필드 → `sip`/`cube`/`pe` 세 필드.
+  once.
-  Downstream (`deploy_tensor`, `_free_tensor`, `_ensure_allocators`,
+- **ShardSpec schema change**: a single `pe_index` field becomes three
-  `allocators` dict key 등) 연쇄 수정.
+  fields `sip`/`cube`/`pe`. Cascading edits downstream (`deploy_tensor`,
- **Silent drift 없음**: property 완전 제거로 runtime에서 즉시 실패 →
+  `_free_tensor`, `_ensure_allocators`, `allocators` dict key, etc.).
-  migration leakage 원천 차단. (Negative가 아니라 explicit tradeoff)
+- **No silent drift**: with the property fully removed, runtime failure
- `test_sip_parallel.py` 재작성 비용.
+  is immediate → migration leakage is blocked at the source. (Not a
  negative but an explicit tradeoff.)
 - The cost of rewriting `test_sip_parallel.py`.
 ### Neutral
- 기존 `cube` / `pe` 필드 의미 불변.
+- The meaning of the existing `cube` / `pe` fields is unchanged.
@@ -92,6 +92,18 @@ def test_crlf_normalization(tmp_path: Path) -> None:
    assert v.verify(tmp_path) == []
 def test_em_dash_title_separator_recognized(tmp_path: Path) -> None:
    """ADR-0033 uses ' — ' instead of ': ' between ADR-NNNN and the title."""
    en = tmp_path / "docs/adr/ADR-0033-foo-bar.md"
    ko = tmp_path / "docs/adr-ko/ADR-0033-foo-bar.md"
    en.parent.mkdir(parents=True, exist_ok=True)
    ko.parent.mkdir(parents=True, exist_ok=True)
    body = "## Status\n\nAccepted\n\n## Context\n\nbody\n"
    en.write_text("# ADR-0033 — Latency Model\n\n" + body, encoding="utf-8")
    ko.write_text("# ADR-0033 — Latency Model\n\n" + body, encoding="utf-8")
    assert v.verify(tmp_path) == []
 def test_underscore_in_slug_recognized(tmp_path: Path) -> None:
    """ADR-0013 uses an underscore in its slug; the regex must accept it."""
    _make_adr(tmp_path / "docs/adr/ADR-0013-ver-verification_strategy.md", "0013")
@@ -24,7 +24,7 @@ import sys
 from pathlib import Path
 ADR_FILENAME_RE = re.compile(r"^ADR-(\d{4})-[a-z0-9_-]+\.md$")
-TITLE_RE = re.compile(r"^# ADR-(\d{4}):")
+TITLE_RE = re.compile(r"^# ADR-(\d{4})\b")
 def _normalize(text: str) -> str: