diff --git a/CLAUDE.md b/CLAUDE.md
index afc2ab4..f48675c 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -202,8 +202,8 @@ General fallbacks. Apply to anything not explicitly covered above.
 >
 > Contains **foundations** (Authority & Scope → Terminology → Terminology
 > Discipline → Mental Model → Common Failure Modes) followed by **rules**
-> (Non-Trivial, Verification Plan, CLI, Derived Artifacts, runtime API /
-> sim_engine Boundaries).
+> (Non-Trivial, Verification Plan, CLI, Derived Artifacts, ADR Translation
+> Discipline, runtime API / sim_engine Boundaries).
 
 ## Authority & Scope
 
@@ -218,14 +218,22 @@ General fallbacks. Apply to anything not explicitly covered above.
 
 ### ADR Lifecycle
 
-ADRs live in one of three folders based on lifecycle state:
+ADRs live in one of four folders. Three carry **canonical English**
+content based on lifecycle state; the fourth holds Korean translations:
 
-- `docs/adr/` — **Accepted** (current implementation reflected).
+- `docs/adr/` — **Accepted** (canonical English; current
+  implementation reflected).
 - `docs/adr-proposed/` — **Proposed**, **Stub**, or **Draft** (design
   only / future-work exploration / retroactive documentation pending
-  verification).
+  verification). **Authoring language is free** (any language); the
+  promotion step (below) translates to English.
 - `docs/adr-history/` — **Superseded** or **Merged** (no longer the
-  authoritative source; kept as historical record).
+  authoritative source; kept as historical record). Frozen — language
+  policy not applied retroactively.
+- `docs/adr-ko/` — Korean translations of accepted ADRs (derived
+  artifact, 1:1 mirror of `docs/adr/`). English in `docs/adr/` is the
+  canonical source of truth; when KO and EN disagree, EN wins. See
+  *ADR Translation Discipline* below.
 
 Status field values:
 
@@ -240,17 +248,23 @@ Status field values:
 Transitions:
 
 - **Proposed/Stub → Accepted**: when the ADR's decisions are
-  reflected in production code AND covered by tests. `git mv` from
-  `docs/adr-proposed/` to `docs/adr/`, change Status to `Accepted`.
+  reflected in production code AND covered by tests. If the proposed
+  ADR is in Korean, translate to English and place the English in
+  `docs/adr/`; move the Korean original to `docs/adr-ko/`. If the
+  proposed ADR is in English, `git mv` it to `docs/adr/` and create
+  the Korean translation in `docs/adr-ko/`. Change Status to
+  `Accepted` in both files.
 - **Draft → Accepted**: when the ADR's text has been verified to
-  accurately describe the existing implementation. `git mv` from
-  `docs/adr-proposed/` to `docs/adr/`, change Status to `Accepted`.
+  accurately describe the existing implementation. Same English /
+  Korean placement rule as above.
 - **Accepted → Superseded**: set Status to `Superseded by ADR-MMMM`
-  and `git mv` to `docs/adr-history/`. The superseding ADR includes
-  a "Supersedes ADR-NNNN" reference (or, for partial supersession of
-  clauses, documents this in its own body).
+  in both the EN and KO files and `git mv` both to their respective
+  history locations (`docs/adr-history/` for English; the KO copy
+  stays in `docs/adr-ko/` only if it was already mirrored — see *ADR
+  Translation Discipline* for the frozen-history exception).
 - **Accepted → Merged**: set Status to `Merged into ADR-MMMM`
-  (single-line stub) and `git mv` to `docs/adr-history/`.
+  (single-line stub) in both files and apply the same `git mv` rule
+  as the Superseded transition.
 
 Cross-references between ADRs use the `ADR-NNNN` ID and remain valid
 regardless of folder location. ADR numbers are **immutable**; never
@@ -361,11 +375,48 @@ Concrete forms that Part 1's *Verification Plan* MUST take in this repo:
 ## Derived Artifacts (Clarification)
 
 - Generated diagrams under `docs/diagrams/` are **derived artifacts**, not production code.
-- Creating or updating files in `docs/diagrams/`:
+- Korean ADR translations under `docs/adr-ko/` are **derived artifacts**
+  (mirror of the canonical English in `docs/adr/`); see *ADR Translation
+  Discipline*.
+- Creating or updating files in `docs/diagrams/` or `docs/adr-ko/`:
   - does NOT count as a production code change,
   - does NOT require Phase 2 approval,
   - MUST be consistent with SPEC.md and ADRs.
 
+## ADR Translation Discipline
+
+English in `docs/adr/` is the canonical source of truth. Korean in
+`docs/adr-ko/` mirrors it 1:1 as a derived artifact.
+
+**Bidirectional sync rule (MUST)**: any edit to a file in `docs/adr/`
+must be accompanied, in the same change, by a mirroring edit to
+`docs/adr-ko/<same-filename>.md`. The reverse also applies: edits to
+`docs/adr-ko/` must mirror back into `docs/adr/`. The two files must
+always describe the same architectural content.
+
+Mechanics:
+
+- When editing an EN ADR, propagate the change to its KO counterpart
+  by translating just the diff (preserve unaffected KO prose); do not
+  regenerate the whole KO file from scratch.
+- When editing a KO ADR, propagate to EN the same way.
+- Filename mirror: `docs/adr/X.md` ↔ `docs/adr-ko/X.md` (no language
+  suffix in either path).
+- The `## Status` block content must remain byte-identical between
+  the EN and KO files (e.g., both say `Accepted`).
+- Conflict policy: if the two diverge despite the rule, treat EN as
+  authoritative and overwrite KO. Surface the divergence to the user
+  before reconciling.
+- `docs/adr-proposed/` is exempt — single language only, no mirror
+  required until promotion.
+- `docs/adr-history/` is frozen — pre-existing mixed-language state
+  there is not migrated.
+
+Verification: `python tools/verify_adr_lang_pairs.py` checks that
+every EN ADR has a matching KO file, the title's ADR-NNNN matches the
+filename, and Status blocks are byte-equal. Run it on demand or wire
+it into CI. Exit code: 0 = OK, 1 = mismatch.
+
 ## runtime API / sim_engine Boundaries
 
 - runtime API MUST NOT hardcode topology/routing or internal hop sequences.
diff --git a/docs/adr-ko/ADR-0001-mem-physaddr-layout.md b/docs/adr-ko/ADR-0001-mem-physaddr-layout.md
new file mode 100644
index 0000000..5f33108
--- /dev/null
+++ b/docs/adr-ko/ADR-0001-mem-physaddr-layout.md
@@ -0,0 +1,362 @@
+# ADR-0001: 51-bit Physical Address Layout & Decoding Contract
+
+## Status
+
+Accepted (Revision 2 — 2026-04-27: concrete bit layout, rack_id removal,
+Tray->SIP / SIP->DIE renaming, PE/MCPU/IOCPU sub-unit tables.
+Supersedes ADR-0031.)
+
+## Date
+
+2026-04-27 (original: 2026-02-27)
+
+## Context
+
+KernBench requires a stable, parsable physical address scheme that:
+
+- can be decoded into routing domains (SIP / die / HBM / PE-resource / IOCPU)
+- remains topology-agnostic (no hardcoded counts)
+- supports swappable policy and DI-first components
+- covers multiple SIPs, AHBM dies, and IO chiplet dies in a unified space
+
+### History
+
+- Original ADR-0001 defined a 51-bit layout with `rack_id(4) + sip_id(4) +
+  sip_seg(5) + local_offset(38)`. `rack_id` was never used in practice.
+- ADR-0031 (stub) requested PE-resource range partition but was never
+  implemented.
+
+Revision 2 removes `rack_id`, renames `sip_seg -> die_id`, and provides
+concrete sub-unit tables for PE, MCPU, CUBE_SRAM, and IOCPU resources.
+ADR-0031 is superseded.
+
+## Decision
+
+We define a **PhysAddr value object** and an **address decoding contract**
+that converts an integer address into routing domains.
+
+### D1. PhysAddr is an immutable value object
+
+- PhysAddr is immutable and comparable as a pure value.
+- Any allocator returns a **fully specified PhysAddr** (not partial metadata).
+- No global state may be required to interpret a PhysAddr.
+
+### D2. 51-bit Physical Address Layout
+
+A 51-bit physical address is adopted.
+
+#### 2.1 Top-Level Address Map
+
+```text
+[50:47] sip_id        (4)     -- 16 SIPs
+[46:42] die_id        (5)     -- 32 dies per SIP
+[41: 0] local_offset  (42)    -- 4 TB per die
+```
+
+```text
+50      47 46      42 41                      0
++---------+----------+-------------------------+
+| sip_id  | die_id   |      local_offset       |
++---------+----------+-------------------------+
+```
+
+#### 2.2 die_id Allocation
+
+| die_id | Meaning |
+|--------|---------|
+| 0..15  | AHBM dies |
+| 16..20 | IOCHIPLET dies |
+| 21..31 | Reserved |
+
+#### 2.3 AHBM Die Layout
+
+Only lower 256 GB of the 4 TB die-local window is assigned.
+
+```text
+[41:38] MBZ            (4)
+[37]    addr_space      (1)    -- 0 = local resource, 1 = HBM memory
+[36: 0] sub-address    (37)
+```
+
+| addr_space | Meaning |
+|------------|---------|
+| 0 | Local resource |
+| 1 | HBM memory |
+
+##### 2.3.1 HBM Window (addr_space = 1)
+
+```text
+[36:0] hbm_offset     (37)    -- 128 GB decode window
+```
+
+The architectural decode window is fixed at 128 GB. Implemented capacity
+may be smaller depending on SKU/topology (see D4).
+
+##### 2.3.2 Resource Window (addr_space = 0)
+
+```text
+[36:34] resource_kind  (3)
+[33: 0] kind_local    (34)    -- 16 GB per kind
+```
+
+| resource_kind | Meaning |
+|---------------|---------|
+| 000 | PE_LOCAL |
+| 001 | MCPU_LOCAL |
+| 010 | CUBE_SRAM |
+| 011..111 | Reserved |
+
+Each kind gets a 16 GB decode region.
+
+##### 2.3.3 PE_LOCAL (resource_kind = 000)
+
+```text
+[33]    MBZ            (1)
+[32:29] pe_id          (4)     -- 0..15
+[28:25] pe_sub_unit    (4)
+[24: 0] sub_offset    (25)    -- 32 MB per slot
+```
+
+16 PEs x 16 sub-unit slots x 32 MB = 8 GB active decode.
+
+| pe_sub_unit | Name | Budget |
+|-------------|------|--------|
+| 0 | PE_CPU_DTCM | 8 KB |
+| 1 | MATH_ENGINE_DTCM | 8 KB |
+| 2 | IPCQ | 256 KB |
+| 3 | PE_CPU_SFR | 16 KB |
+| 4 | MATH_ENGINE_SFR | 16 KB |
+| 5 | DMA_ENGINE_SFR | 192 KB |
+| 6 | PE_TCM | 2 MB |
+| 7..15 | Reserved | -- |
+
+##### 2.3.4 MCPU_LOCAL (resource_kind = 001)
+
+```text
+[33:30] MBZ            (4)
+[29:25] mcpu_sub_unit  (5)
+[24: 0] sub_offset    (25)    -- 32 MB per slot
+```
+
+1 GB active decode.
+
+| mcpu_sub_unit | Name | Budget |
+|---------------|------|--------|
+| 0 | MCPU_ITCM | 512 KB |
+| 1 | MCPU_DTCM | 512 KB |
+| 2 | IPCQ | 256 KB |
+| 3 | MCPU_SFR | 8 KB |
+| 4 | MCPU_DMA_SFR | 16 KB |
+| 5 | MCPU_SRAM | 10 MB |
+| 6..31 | Reserved | -- |
+
+##### 2.3.5 CUBE_SRAM (resource_kind = 010)
+
+```text
+[33:25] MBZ            (9)
+[24: 0] sram_offset   (25)    -- flat 32 MB
+```
+
+#### 2.4 IOCHIPLET Die Layout
+
+Only lower 1 TB of the 4 TB die-local window is assigned.
+
+```text
+[41:40] MBZ            (2)
+[39: 0] chiplet_offset (40)   -- 1 TB
+```
+
+Region split by address range:
+
+| Range | Meaning | Decode condition |
+|-------|---------|------------------|
+| [0, 2 GB) | IOCPU resource | chiplet_offset < 0x8000_0000 |
+| [2 GB, 1 TB) | UAL | chiplet_offset >= 0x8000_0000 |
+
+##### 2.4.1 IOCPU Region
+
+```text
+[30:27] iocpu_sub_unit (4)
+[26: 0] sub_offset    (27)    -- 128 MB per slot
+```
+
+16 x 128 MB slots. 2 GB active decode.
+
+| iocpu_sub_unit | Name | Budget |
+|----------------|------|--------|
+| 0 | IOCPU_ITCM | 512 KB |
+| 1 | IOCPU_DTCM | 512 KB |
+| 2 | IPCQ | 2 MB |
+| 3 | IOCPU_SFR | 8 KB |
+| 4 | IO_DMA_SFR | 16 KB |
+| 5 | IO_SRAM | 64 MB |
+| 6..15 | Reserved | -- |
+
+##### 2.4.2 UAL Region
+
+Sub-layout TBD (separate ADR).
+
+#### 2.5 Addressing Rules
+
+1. MBZ bits must be zero. An address with non-zero MBZ bits is
+   **architecturally invalid**. Implementation may raise a decode fault
+   or return an error -- behavior is not prescribed by this ADR.
+2. Fixed slot sizes are chosen for simple hardware decode; actual
+   implemented capacity may be smaller than the slot.
+3. Access beyond a sub-unit's implemented budget within a slot is
+   **architecturally invalid** (same policy as MBZ).
+
+### D3. Bitfield decoding is deterministic
+
+Given an integer address, field extraction (`sip_id`, `die_id`, `kind`,
+`sub_unit`, `offset`) is purely positional. No runtime state is required.
+Decoding deterministically maps an integer address to destination domains:
+`sip_id`, `die_id`, target kind (HBM / PE_LOCAL / MCPU_LOCAL / CUBE_SRAM /
+IOCPU / UAL).
+
+### D4. Capacity validation may depend on topology config
+
+Whether a decoded address falls within **implemented capacity** (e.g.,
+HBM 96 GB on a specific SKU) is checked against topology parameters
+provided via DI/config. Decode itself (D3) never consults topology --
+only validation does. These parameters must live in the topology/config
+layer, not in node implementations.
+
+### D5. Routing consumes decoded domains, not raw bits
+
+Routing policy uses decoded domains:
+
+- `src` location (sip / die / pe or node_id)
+- `dst` domains derived from PhysAddr decoding
+- `size_bytes` for size-aware link latency
+
+Routing must not inspect raw bit-fields directly except inside the
+decoding module.
+
+## Alternatives Considered
+
+1. **Keep `rack_id` (4 bits)**: Rejected -- never used in practice,
+   consumes 4 bits that enable die-local expansion to 42 bits
+   (IOCHIPLET 1 TB).
+
+2. **Uniform 256 GB per die**: Rejected -- IOCHIPLET UAL requires ~1 TB.
+   Freed rack_id bits enable 42-bit local_offset.
+
+3. **Variable-width die windows (AHBM 256 GB, CHIPLET 1 TB via multi-seg
+   spanning)**: Rejected -- complicates D3 (deterministic decoding).
+   Uniform 4 TB window with MBZ padding is simpler.
+
+4. **Use raw integers everywhere, decode ad-hoc in routing**: Rejected --
+   leads to duplicated logic, inconsistent routing, and hidden
+   assumptions.
+
+5. **Hardcode topology sizes (SIP/CUBE/PE counts) into decoding**:
+   Rejected -- violates SPEC R3 and breaks swappability.
+
+6. **Put decoding inside memory controllers or routers**: Rejected --
+   leaks policy into components, violates SPEC R4 / D5.
+
+## Consequences
+
+### Positive
+
+- Simple hierarchical decoder: SIP -> die -> kind -> sub-unit.
+- Clean separation of memory (HBM) vs local resource (PE/MCPU/SRAM/IOCPU).
+- Deterministic routing domains enable clear test invariants (SPEC R1, R5).
+- Expandable: 11 reserved die_id slots, reserved resource_kind / sub-unit
+  slots, reserved MBZ bits.
+- DI-first: decoder can be swapped without changing components (SPEC R4).
+
+### Tradeoffs
+
+- Sparse address holes due to power-of-2 slot alignment.
+- Large reserved/MBZ regions (intentional for future extension).
+- Requires explicit configuration for topology-derived sizes (D4).
+- Introduces a single "blessed" decoding module that must remain stable
+  and well-tested.
+
+## Supersedes
+
+- **ADR-0031 (PhysAddr PE-Resource Extension)**: stub status. The
+  PE_LOCAL / MCPU_LOCAL / CUBE_SRAM sub-unit tables in D2.3.3-D2.3.5
+  fulfill ADR-0031's stated goals.
+
+## Implementation Notes (Non-normative)
+
+- Recommended module: `src/kernbench/policy/address/phyaddr.py`
+- Tests should cover: encode/decode round-trip per kind, MBZ enforcement,
+  die_id dispatch (AHBM / IOCHIPLET / reserved), sub-unit boundary
+  values, backward compatibility of factory APIs.
+- Factory methods: `hbm_addr`, `pe_hbm_addr`, `pe_tcm_addr`,
+  `cube_sram_addr` retain signatures (minus `rack_id`); `cube_id`
+  parameter renamed to `die_id`.
+- New factories: `pe_resource_addr`, `mcpu_resource_addr`,
+  `iocpu_resource_addr`, `ual_addr`.
+
+## Appendix A. Address Examples
+
+### A.1 AHBM HBM access
+
+sip=2, die=5, HBM offset=0x1000
+
+```text
+sip_id     = 2       -> [50:47] = 0b0010
+die_id     = 5       -> [46:42] = 0b00101
+addr_space = 1       -> [37]    = 1 (HBM)
+hbm_offset = 0x1000  -> [36:0]
+
+51-bit addr = (2 << 47) | (5 << 42) | (1 << 37) | 0x1000
+```
+
+### A.2 AHBM PE_LOCAL -- PE3 PE_TCM, offset=0x400
+
+```text
+sip_id        = 0  -> [50:47] = 0
+die_id        = 0  -> [46:42] = 0
+addr_space    = 0  -> [37]    = 0
+resource_kind = 0  -> [36:34] = 000 (PE_LOCAL)
+pe_id         = 3  -> [32:29] = 0011
+pe_sub_unit   = 6  -> [28:25] = 0110 (PE_TCM)
+sub_offset    = 0x400 -> [24:0]
+
+local_offset = (0 << 34) | (3 << 29) | (6 << 25) | 0x400
+```
+
+### A.3 AHBM MCPU_LOCAL -- MCPU_SRAM, offset=0x0
+
+```text
+sip_id        = 1  -> [50:47] = 0001
+die_id        = 3  -> [46:42] = 00011
+addr_space    = 0  -> [37]    = 0
+resource_kind = 1  -> [36:34] = 001 (MCPU_LOCAL)
+mcpu_sub_unit = 5  -> [29:25] = 00101 (MCPU_SRAM)
+sub_offset    = 0  -> [24:0]  = 0
+
+local_offset = (1 << 34) | (5 << 25)
+```
+
+### A.4 IOCHIPLET -- IOCPU IPCQ, offset=0x20000
+
+```text
+sip_id         = 1   -> [50:47] = 0001
+die_id         = 17  -> [46:42] = 10001 (IOCHIPLET[1])
+iocpu_sub_unit = 2   -> [30:27] = 0010 (IPCQ)
+sub_offset     = 0x20000 -> [26:0]
+
+chiplet_offset = (2 << 27) | 0x20000
+                 (< 0x8000_0000 -> IOCPU region)
+```
+
+### A.5 IOCHIPLET -- UAL region, offset=4 GB
+
+```text
+sip_id         = 0   -> [50:47] = 0
+die_id         = 16  -> [46:42] = 10000 (IOCHIPLET[0])
+chiplet_offset = 0x1_0000_0000 (4 GB >= 2 GB -> UAL region)
+```
+
+## Links
+
+- SPEC.md: R1 (routing), R3 (configurable topology), R4 (DI-first),
+  R5 (multi-domain comm)
+- ADR-0031: Superseded
diff --git a/docs/adr-ko/ADR-0002-lat-routing-distance.md b/docs/adr-ko/ADR-0002-lat-routing-distance.md
new file mode 100644
index 0000000..19849f9
--- /dev/null
+++ b/docs/adr-ko/ADR-0002-lat-routing-distance.md
@@ -0,0 +1,102 @@
+# ADR-0002: Routing Distance, Ordering & Bypass Rules
+
+## Status
+Accepted
+
+## Date
+2026-02-27
+
+## Context
+The KernBench Graph Latency Simulator must compare kernel execution time
+across different architectures and topologies by computing end-to-end
+latency from graph traversal.
+
+To support meaningful comparison:
+- routing must be deterministic
+- latency must reflect actual interconnect structure
+- local vs remote traffic must be distinguishable
+- “bypass” optimizations must not undermine debuggability or correctness
+
+The simulator also aims to avoid software-managed metadata and hidden
+shortcuts that obscure control paths.
+
+## Decision
+
+### D1. Distance is accumulated latency, not hop count
+- Routing “distance” is defined as the **sum of per-node and per-link latency**.
+- Hop count alone must not be used for ordering or path selection.
+- Size-aware serialization latency (bytes / BW) contributes to distance.
+
+### D2. Routing order is derived from graph traversal
+- The chosen route is the path with minimum accumulated latency
+  given the constructed graph and routing policy.
+- Deterministic ordering must be guaranteed for identical inputs
+  (topology + policy + request).
+
+### D3. Bypass is explicit and graph-represented
+- All paths must be explicitly represented in the graph and subject to latency accumulation.
+- Example: PE_DMA connects to the NOC router mesh (ADR-0017 D7). All destinations
+  (HBM, shared SRAM, inter-cube UCIe) are reached via explicit mesh hops.
+  Local HBM access has minimal hops (switching overhead only); remote access
+  traverses additional routers.
+- Implicit or “magic” bypass paths are disallowed.
+
+### D4. No zero-latency end-to-end paths
+
+- Every routed request must incur **end-to-end** latency > 0.
+- Individual fabric segments (e.g., NOC hops) MAY have distance_mm = 0
+  when the fabric is distributed and distance is not meaningful at that granularity.
+  This is allowed because other components on the same path (e.g., PE_DMA, SRAM,
+  UCIe endpoints) contribute non-zero latency, ensuring the end-to-end invariant holds.
+- Fully zero-latency end-to-end paths are disallowed, except for explicit
+  test-only stubs clearly marked as such.
+
+### D5. Policy vs topology responsibility split
+- Topology builder:
+  - defines nodes and links and their latency/BW parameters
+- Routing policy:
+  - selects among available graph paths based on decoded domains
+- Routing policy must not assume missing links; missing connectivity
+  is a topology construction error.
+
+### D6. No software-managed routing metadata
+- Routing decisions must not rely on per-request software-managed metadata
+  that tracks distance, hop count, or ordering outside the graph model.
+- All distance/order computation is derived from traversal itself.
+
+## Alternatives Considered
+
+1) **Hop-count based routing**
+- Rejected: ignores heterogeneous latency/BW and misrepresents
+  architectural differences.
+
+2) **Implicit local shortcuts**
+- Rejected: breaks debuggability and violates traversal-based latency.
+
+3) **Software-managed distance metadata**
+- Rejected: increases control overhead and obscures routing semantics.
+
+## Consequences
+
+### Positive
+- Clear, debuggable hop-by-hop traces (SPEC R2, R4).
+- Architecture comparisons reflect real interconnect structure.
+- Routing behavior is reproducible and deterministic.
+
+### Tradeoffs / Costs
+- Graph construction must be correct and complete.
+- Bypass modeling requires explicit graph representation,
+  which slightly increases topology description complexity.
+
+## Implementation Notes (Non-normative)
+- Recommended responsibilities:
+  - Graph builder: ensure all required paths exist.
+  - Router: select next hop based on decoded domains and policy.
+- Tests should assert:
+  - non-zero end-to-end latency
+  - deterministic routing for identical inputs
+  - bypass paths appear explicitly in emitted traces
+
+## Links
+- SPEC.md: R1 (routing), R2 (latency), R3 (topology), R5 (multi-domain comm)
+- ADR-0001: PhysAddr layout & decoding contract
diff --git a/docs/adr-ko/ADR-0003-dev-target-system-hierarchy.md b/docs/adr-ko/ADR-0003-dev-target-system-hierarchy.md
new file mode 100644
index 0000000..e5acc7d
--- /dev/null
+++ b/docs/adr-ko/ADR-0003-dev-target-system-hierarchy.md
@@ -0,0 +1,68 @@
+# ADR-0003: Target System Hierarchy & Modeling Scope
+
+## Status
+
+Accepted
+
+## Context
+
+We need a system-level simulator to evaluate LLM kernel performance on our AI Accelerator platform.
+The platform is organized as a compute tray containing multiple identical SIPs connected via PCIe or UAL
+through switching fabrics, with a host CPU issuing commands/kernels.
+
+## Decision
+
+We model the system hierarchy explicitly:
+
+### D1. Tray-level
+
+- A compute tray contains:
+  - Host CPU (issues requests / coordinates runtime & data placement)
+  - Multiple identical SIPs (accelerators)
+  - Interconnect fabric between SIPs (PCIe and/or UAL via switches)
+
+### D2. SIP-level
+
+- A SIP is a multi-die package composed of:
+  - Multiple CUBEs (HBM die + compute PEs + UCIe)
+  - One or more IO chiplets (host/SIP interfaces)
+- IO chiplets:
+  - provide interfaces: PCIe-EP, IO_CPU, optionally UAL-EP
+  - can be multiple per SIP
+  - placement constrained to SIP shoreline (top/bottom/left/right); each shoreline may host 1–2 IO chiplets
+
+### D3. CUBE-level
+
+- A CUBE contains:
+  - HBM + memory controller (HBM_CTRL)
+  - NOC (on-die fabric): carries all intra-cube traffic including HBM data,
+    inter-cube (UCIe), command (M_CPU↔PE_CPU), and shared SRAM access.
+    Must provide: full-BW PE↔local HBM path, PE↔SRAM connectivity,
+    PE↔UCIe connectivity, M_CPU↔PE command path.
+    NOC topology is an implementation choice (e.g., 2D mesh, ring, crossbar);
+    current implementation uses a 2D mesh with XY routing (see ADR-0017).
+    HBM_CTRL is attached to each PE's local NOC port (local HBM = minimal hop).
+  - Shared SRAM: cube-level shared memory accessible by all PEs via NOC
+  - management/control CPU (M_CPU) coordinating PE command distribution and completion aggregation
+  - multiple PEs
+  - up to 4 UCIe endpoints (N/E/W/S) for CUBE↔CUBE and CUBE↔IO connectivity
+
+### D4. PE-level
+
+- A PE can execute one kernel instance
+- PE contains internal control + accelerators (modeled at PE view granularity):
+  - PE_CPU, command handler, PE_TCM, DMA/GEMM/MATH engines, internal queues
+
+## Consequences
+
+- The simulator supports abstraction by “views”:
+  - SIP view hides PE internals
+  - CUBE view treats each PE as a single block
+  - PE view expands PE internals
+- Topology remains parameterized; sizes/counts/links come from configuration.
+
+## Links
+
+- SPEC R3/R5
+- ADR-0005 (diagram views)
+- ADR-0017 (cube NOC 2D mesh architecture)
diff --git a/docs/adr-ko/ADR-0004-mem-memory-semantics-local-hbm.md b/docs/adr-ko/ADR-0004-mem-memory-semantics-local-hbm.md
new file mode 100644
index 0000000..d9144b0
--- /dev/null
+++ b/docs/adr-ko/ADR-0004-mem-memory-semantics-local-hbm.md
@@ -0,0 +1,76 @@
+# ADR-0004: Memory Semantics & Local-HBM Bandwidth Guarantee
+
+## Status
+
+Accepted
+
+## Context
+
+Accurately modeling PE↔HBM behavior is essential for kernel latency estimation.
+Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth, independent of intervening on-die fabric bandwidth.
+
+## Decision
+
+### D1. Local HBM definition
+
+- Each PE is assigned a logically defined “local HBM” region.
+- Local HBM corresponds to the pseudo-channel subset directly attached to that PE’s
+  router in the NOC mesh (ADR-0017 D4).
+- The path is: PE_DMA → local router → HBM_CTRL (switching overhead only, 0 mesh hops).
+- The mapping (HBM pseudo-channels → PE local regions) is derived from topology configuration.
+
+### D2. Local HBM bandwidth guarantee contract
+
+- Accesses from a PE to its local HBM MUST guarantee full effective HBM
+  read/write bandwidth independent of intervening fabric bandwidth limits.
+- Effective HBM bandwidth = spec bandwidth x efficiency factor.
+  The efficiency factor (configured via `hbm_ctrl.attrs.efficiency`, default 0.8)
+  models real-world DRAM inefficiencies (refresh cycles, bank conflicts, page
+  misses). For example: 256 GB/s spec x 0.8 = 204.8 GB/s effective.
+- The topology builder applies the efficiency factor to router-to-hbm edge
+  bandwidth at graph construction time, so all downstream routing and latency
+  computation uses the effective value.
+- This guarantee is modeled by:
+  - a dedicated logical path and/or service model that enforces HBM BW at the PE-local-HBM interaction point,
+  - while still incurring non-zero latency along explicitly modeled components.
+- HBM CTRL internal modeling (PC striping, cut-through, scheduling fidelity)
+  is consolidated in ADR-0033 (Latency Model: Assumptions and Known
+  Simplifications). The aggregate BW guarantee here remains the contract;
+  ADR-0033 documents how the per-PC model realizes it and which scheduler
+  effects are intentionally simplified.
+
+### D3. Remote PE HBM semantics (intra-cube)
+
+- A PE that accesses another PE's local HBM traverses the NOC:
+  - PE_DMA → NOC → (fabric hops) → target PE's NOC port → HBM_CTRL
+- NOC bandwidth and hop count may limit remote HBM access relative to local access.
+
+### D4. Non-local HBM semantics (inter-cube / inter-SIP)
+
+- Accesses from a PE to HBM in a different cube or SIP MAY be limited by:
+  - NOC bandwidth within the cube,
+  - inter-cube UCIe links,
+  - inter-SIP fabric (PCIe/UAL).
+- These paths MUST be explicit and traceable.
+
+### D5. Shared SRAM semantics
+
+- Each CUBE contains a shared SRAM accessible by all PEs in that CUBE.
+- Access path: PE_DMA → NOC → shared SRAM.
+- Shared SRAM bandwidth is limited by the NOC↔SRAM link bandwidth.
+- Shared SRAM is not part of the HBM address space; it is a separate memory domain.
+
+## Verification Notes
+
+Tests should cover:
+
+- local-HBM case: BW matches HBM BW regardless of fabric BW parameter
+- remote PE HBM case: latency includes mesh hop traversal
+- non-local cases (inter-cube/inter-SIP): BW/latency respond to fabric/link parameters
+- shared SRAM case: access via NOC with correct BW
+
+## Links
+
+- SPEC R2/R5
+- ADR-0002 (distance/order & explicit bypass)
+- ADR-0017 D7 (PE DMA data paths through NOC to HBM)
diff --git a/docs/adr-ko/ADR-0005-dev-diagram-views-distance-layout.md b/docs/adr-ko/ADR-0005-dev-diagram-views-distance-layout.md
new file mode 100644
index 0000000..6391f19
--- /dev/null
+++ b/docs/adr-ko/ADR-0005-dev-diagram-views-distance-layout.md
@@ -0,0 +1,186 @@
+# ADR-0005: Diagram Views & Distance-Aware Layout Rules
+
+## Status
+
+Accepted
+
+## Context
+
+We require verifiable and inspectable system modeling for a large-scale,
+parameterized AI Accelerator system.
+
+Humans must be able to:
+
+- visually inspect the modeled topology,
+- reason about communication structure and relative distance,
+- do so at multiple abstraction levels without being overwhelmed by detail.
+
+The simulator models distance (accumulated latency) as a first-class concept.
+Diagrams must reflect this distance by default.
+
+---
+
+## Decision
+
+### D1. Global Defaults
+
+- All diagrams MUST be **distance-aware by default**.
+- All diagrams MUST render **representative views** of the architecture.
+- Instance indices (e.g., sip0, cube2, pe3) MUST NOT be required for diagram generation.
+- Instance indices MAY be used ONLY:
+  - to define a distance anchor in asymmetric or debugging scenarios, or
+  - when explicitly requested.
+
+---
+
+### D2. Representative Rendering Rule
+
+- All CUBEs share the same internal structure.
+- All PEs share the same internal structure.
+
+Therefore:
+
+- SIP-level diagrams render representative CUBEs and IO chiplets.
+- CUBE-level diagrams render representative PEs as opaque blocks.
+- PE-level diagrams render a representative PE with fully expanded internals.
+
+Diagrams MUST NOT depend on specific SIP, CUBE, or PE indices
+unless explicitly requested.
+
+---
+
+### D3. Diagram Views
+
+#### View A — SIP-Level Diagram
+
+**Purpose**
+Explain system-scale structure and connectivity.
+
+**Visible elements**
+
+- SIP boundaries (optional)
+- CUBEs (opaque blocks)
+- IO chiplets (opaque blocks)
+- Optional UCIe stubs only if needed to clarify connectivity
+
+**Hidden elements**
+
+- PE internals
+- CUBE internal fabric
+- IO chiplet internals
+
+**Visible links**
+
+- Host ↔ IO chiplets (PCIe)
+- SIP ↔ SIP (PCIe / UAL via switches)
+- IO ↔ CUBE (on-package links)
+
+---
+
+#### View B — CUBE-Level Diagram
+
+**Purpose**
+Explain cube-internal structure and data/control flow.
+
+**Visible elements**
+
+- Router mesh: 2D grid of NOC routers (from cube_mesh.yaml), all traffic routes through mesh
+- HBM_CTRL attached to PE routers (local HBM = 0 hop)
+- HBM subsystem (HBM_CTRL)
+- Shared SRAM: cube-level shared memory
+- Management CPU (M_CPU)
+- PEs as opaque blocks (PE[0..N−1])
+- UCIe endpoints (N/E/W/S) as ports
+
+**Hidden elements**
+
+- PE internals
+
+**Visible links**
+
+- PE → router (HBM + non-HBM data path via mesh)
+- Router ↔ HBM_CTRL (local HBM access)
+- Router ↔ Router (mesh hops for remote access)
+- Router ↔ UCIe endpoints
+- Router ↔ shared SRAM
+- M_CPU ↔ router (command path)
+- Router → PE_CPU (command delivery, collapsed into PE block)
+
+---
+
+#### View C — PE-Level Diagram
+
+**Purpose**
+Explain internal PE behavior and execution structure.
+
+**Visible elements**
+
+- PE_CPU
+- Command handler / scheduler
+- PE_TCM (local SRAM)
+- HW accelerators (DMA, GEMM, MATH, etc.)
+- Local HBM interface
+- Optional IPCQ / messaging endpoints
+
+**Visible links**
+
+- Control paths (CPU → scheduler → engines)
+- Data paths (engines ↔ TCM, DMA ↔ local HBM)
+- External fabric ports as abstract ports only
+
+---
+
+### D4. Distance-Aware Layout (Default)
+
+#### Distance definition
+
+- Distance is defined as **accumulated latency**, consistent with ADR-0002.
+- Distance is computed from a single anchor node.
+
+#### Default anchor selection
+
+- SIP view: IO chiplet (or Host CPU if present)
+- CUBE view: a representative PE
+- PE view: PE_CPU or Command Handler
+
+Anchors are **implicit defaults** and MUST NOT be required to be specified.
+
+#### Layout rules
+
+- Diagrams MUST be laid out in layers based on distance buckets.
+- Layout direction MUST be consistent within a view type
+  (preferred: left-to-right).
+- Nodes with equal distance MUST have stable ordering
+  (by role or identifier, deterministically).
+
+Cycles MAY be rendered using dashed or curved edges for readability,
+without affecting distance semantics.
+
+---
+
+### D5. Generation Contract (for Tools / Claude Code)
+
+When generating diagrams:
+
+- Assume distance-aware layout by default.
+- Assume representative rendering by default.
+- Do NOT ask for SIP/CUBE/PE indices unless required.
+- Do NOT expand hidden abstraction levels.
+- Prefer architectural clarity over micro-hop fidelity.
+
+---
+
+## Consequences
+
+- Diagrams are stable across topology scaling.
+- Changes in distance or routing policy are reflected visually.
+- Diagrams serve as verifiable artifacts derived from the simulator model,
+  not as hand-maintained documentation.
+
+---
+
+## Links
+
+- SPEC Section 4 (Output, Debuggability, and Diagrams)
+- ADR-0002 (Routing distance semantics)
+- ADR-0006 (Topology compilation & automatic diagram generation)
diff --git a/docs/adr-ko/ADR-0006-dev-topology-compilation-distance-diagram.md b/docs/adr-ko/ADR-0006-dev-topology-compilation-distance-diagram.md
new file mode 100644
index 0000000..4b3767c
--- /dev/null
+++ b/docs/adr-ko/ADR-0006-dev-topology-compilation-distance-diagram.md
@@ -0,0 +1,130 @@
+# ADR-0006: Topology Compilation, Distance Extraction, and Automatic Diagram Generation
+
+## Status
+
+Accepted
+
+## Context
+
+The simulator compiles topology configuration (e.g., topology.yaml) into an explicit model graph,
+and computes routing and accumulated latency (distance).
+Diagrams should be generated from these authoritative artifacts to ensure consistency and avoid
+hand-maintained topology drawings.
+
+Additionally, for usability, diagrams should be emitted automatically into a stable location
+so that developers can preview them immediately in the repository.
+
+---
+
+## Decision
+
+### D1. Topology compilation is the single source of truth
+
+- topology.yaml (or equivalent config) is compiled into:
+  - an explicit system graph,
+  - node/link attributes,
+  - routing policies.
+This compiled graph is the authoritative representation of the system.
+
+### D2. Distance extraction during compilation
+
+- During or immediately after topology compilation, the simulator MUST compute distance metadata
+  (accumulated latency) consistent with ADR-0002.
+- Distance metadata MUST be sufficient to support distance-aware diagram layout as defined in ADR-0005.
+- Distributed fabric segments (e.g., NOC) MAY have distance_mm = 0 per ADR-0002 D4;
+  layout placement for such nodes uses explicit position metadata rather than distance buckets.
+
+### D3. Diagram generation is a derived artifact
+
+- Diagrams MUST be generated from:
+  - the compiled topology graph,
+  - extracted distance metadata,
+  - view/layout rules defined in ADR-0005.
+- Diagram generation MUST NOT require additional hand-written topology descriptions.
+
+### D4. Automatic diagram emission to the repository
+
+- As part of topology compilation, the implementation MUST produce the following diagrams by default:
+  - SIP-level diagram (representative, distance-aware)
+  - CUBE-level diagram (representative, distance-aware)
+  - PE-level diagram (representative, distance-aware)
+- The default output directory is:
+  - `docs/diagrams/`
+- The generator MUST overwrite/update only when the compiled topology (or diagram rules) changes.
+
+### D5. View-specific projection and layout
+
+For each view (SIP / CUBE / PE):
+
+- The generator MUST project the compiled graph into a reduced view graph:
+  - hide/collapse nodes according to ADR-0005,
+  - preserve connectivity semantics relevant to that view,
+  - compute distance buckets and assign layout layers deterministically.
+- CUBE-level projection MUST include:
+  - Router mesh (from cube_mesh.yaml), HBM_CTRL, shared SRAM, M_CPU, UCIe ports,
+    and PEs as opaque blocks.
+  - All paths (HBM, non-HBM, command) route through the same router mesh (ADR-0017).
+- Default anchors are implicit (ADR-0005) and MUST NOT require instance indices.
+
+### D6. Output formats and determinism
+
+- The generator MUST output at least one of:
+  - Mermaid (Markdown-native)
+  - Graphviz DOT (rank-based control)
+  - SVG (mm-accurate layout, no external dependencies)
+- SVG is preferred when mm-accurate position metadata is available from the compiled topology.
+- Output MUST be deterministic:
+  - same topology + same rules → identical diagram text
+- File naming MUST be deterministic and stable (see "Output Conventions").
+
+### D7. Performance and caching
+
+- Diagram generation MAY be lazy and/or cached, as long as the outputs in `docs/diagrams/`
+  remain consistent with the compiled topology.
+- The implementation SHOULD use a cache key based on:
+  - topology content hash,
+  - routing policy version,
+  - diagram rules version,
+  - view type (SIP/CUBE/PE).
+
+---
+
+## Output Conventions
+
+### Directory
+
+- `docs/diagrams/` is the canonical output directory for generated diagrams.
+
+### File names (recommended, deterministic)
+
+- `system_view.svg` / `system_view.mmd` / `system_view.dot`
+- `sip_view.svg` / `sip_view.mmd` / `sip_view.dot`
+- `cube_view.svg` / `cube_view.mmd` / `cube_view.dot`
+- `pe_view.svg` / `pe_view.mmd` / `pe_view.dot`
+
+Optionally, for multi-topology workflows:
+
+- `sip_view__{topology_id}.svg`
+- `cube_view__{topology_id}.svg`
+- `pe_view__{topology_id}.svg`
+
+### Repository policy
+
+- Generated diagram files MAY be committed to the repository to enable diff-based review.
+- If committed, they MUST be reproducible from topology compilation.
+
+---
+
+## Consequences
+
+- Diagrams are always consistent with simulator behavior.
+- Architectural changes automatically propagate to visualizations.
+- Diagram diffs become meaningful indicators of architectural change.
+
+---
+
+## Links
+
+- SPEC Section 4 (Output, Debuggability, and Diagrams)
+- ADR-0002 (Distance semantics)
+- ADR-0005 (Diagram views and layout rules)
diff --git a/docs/adr-ko/ADR-0007-api-runtime-api-boundaries.md b/docs/adr-ko/ADR-0007-api-runtime-api-boundaries.md
new file mode 100644
index 0000000..9522b9d
--- /dev/null
+++ b/docs/adr-ko/ADR-0007-api-runtime-api-boundaries.md
@@ -0,0 +1,95 @@
+# ADR-0007: Runtime API and Simulation Engine Boundaries
+
+## Status
+
+Accepted
+
+## Context
+
+The simulator consists of multiple layers with distinct responsibilities:
+
+- a host-facing API layer used by benchmarks and user code,
+- a discrete-event simulation engine that executes requests,
+- device components that model hardware behavior.
+
+Without strict boundaries, orchestration logic can leak into components,
+or simulation internals can become entangled with user-facing APIs.
+
+This ADR defines clear responsibility boundaries between:
+
+- runtime API,
+- simulation engine (sim_engine),
+- hardware components.
+
+---
+
+## Decision
+
+### D1. Runtime API is host-facing orchestration only
+
+The runtime API represents host/driver-level behavior and MUST:
+
+- expose high-level operations (tensor deployment, kernel launch),
+- submit requests only to endpoint components (e.g., IO_CPU),
+- await completion via futures/handles,
+- own and persist host-side metadata (tensor allocation maps, kernel bindings).
+
+The runtime API MUST NOT:
+
+- hardcode hop-by-hop routing or fan-out,
+- directly invoke internal components (M_CPU, PE_CPU, engines),
+- embed topology- or routing-specific assumptions.
+
+---
+
+### D2. Simulation engine wires components and tracks completion
+
+The simulation engine (sim_engine) MUST:
+
+- wire components at initialization (create port stores + start wire
+  processes per the component port/wire framework — ADR-0015),
+- inject requests into the compiled topology graph at entry components
+  (e.g., PCIE_EP for memory operations, IO_CPU for kernel launch),
+- schedule and execute events using a discrete-event model,
+- manage correlation ids and completion tracking.
+
+The simulation engine MUST NOT:
+
+- define tensor semantics,
+- define kernel execution policies,
+- expose internal graph details to the runtime API,
+- walk the topology path during request execution,
+- call component `run()` methods directly,
+- track per-hop latency or decompose fan-out (components own this).
+
+---
+
+### D3. Components own fan-out and aggregation
+
+Device-side components MUST:
+
+- fan-out requests to downstream domains
+  (IO_CPU → M_CPU → PE_CPU → schedulers/engines),
+- aggregate completion and failure signals,
+- propagate results deterministically upstream.
+
+Neither the runtime API nor the simulation engine may orchestrate
+component-level fan-out explicitly.
+
+---
+
+## Consequences
+
+- Runtime APIs remain stable as topology and routing evolve.
+- Simulation internals can change without affecting user-facing code.
+- Component implementations remain swappable via DI.
+
+---
+
+## Links
+
+- SPEC R4, R7, R8
+- ADR-0008 (Tensor deployment)
+- ADR-0009 (Kernel execution)
+- ADR-0015 (Component port/wire model and engine role)
+- ADR-0010 (CLI surface and execution semantics — runtime API consumer)
diff --git a/docs/adr-ko/ADR-0008-api-tensor-deploy-and-allocation.md b/docs/adr-ko/ADR-0008-api-tensor-deploy-and-allocation.md
new file mode 100644
index 0000000..36ca4da
--- /dev/null
+++ b/docs/adr-ko/ADR-0008-api-tensor-deploy-and-allocation.md
@@ -0,0 +1,100 @@
+# ADR-0008: Tensor Deployment and Allocation (Host Allocator, PA-first)
+
+## Status
+
+Accepted
+
+## Context
+
+Benchmarks require PyTorch-like tensor semantics:
+
+- tensor creation (empty, fill),
+- deployment to accelerator devices (tensor.to()).
+
+In the realistic system, host software manages allocation/mapping and installs
+mappings for DMA/MMU. For Phase 0 we simplify (ADR-0011):
+
+- device memory operations use PA only,
+- VA/MMU/IOMMU is not modeled.
+
+To keep the host↔device interface minimal, we avoid a separate
+AllocateTensorMeta message. Instead, host allocation produces a PA shard map
+that is used directly by MemoryWrite/Read and KernelLaunch.
+
+---
+
+## Decision
+
+### D1. Tensor is a host-owned handle with PA shard mapping
+
+A Tensor object is a host-owned handle that encapsulates:
+
+- shape and dtype,
+- initialization intent,
+- device placement and allocation metadata as a PA shard map.
+
+After deployment, the Tensor handle MUST contain:
+
+- a list of shards, each with (sip,cube,pe,pa,nbytes,offset_bytes).
+
+This PA shard mapping is the single source of truth for kernel argument binding.
+
+---
+
+### D2. Deployment uses a host allocator (Phase 0)
+
+In Phase 0, tensor deployment produces PA shard mappings via a host allocator:
+
+- placement (split/replicate/hybrid) is decided by a DP policy,
+- allocation assigns PA ranges at the PE level and returns shard mappings,
+- the Tensor handle stores the resulting shard list deterministically.
+
+No separate host-visible device allocation RPC is required in Phase 0.
+
+---
+
+### D3. Data initialization and transfer uses MemoryWrite/Read only
+
+Any data initialization or transfer implied by a tensor (e.g., fill, copy)
+MUST be represented using Host ↔ IO_CPU messages only:
+
+- MemoryWrite
+- MemoryRead
+
+Rules:
+
+- MemoryWrite/Read MUST reference PA + (sip,cube,pe) tags (ADR-0012).
+- Allocation metadata MUST NOT be embedded as a separate allocation message.
+- Bulk tensor data MUST NOT be embedded in Phase 0 messages.
+
+The simulation engine schedules MemoryWrite/Read through the graph so that
+latency is computed by explicit traversal.
+
+---
+
+### D4. Extension path (non-breaking)
+
+Future ADRs MAY introduce optional VA/MMU/IOMMU modeling by adding:
+
+- virtual addressing in tensor handles,
+- mapping install steps,
+- translation latency/page granularity.
+
+The Phase 0 PA shard map remains a valid fast-path configuration.
+
+---
+
+## Consequences
+
+- Host↔IO_CPU contract remains minimal (MemoryRead/Write + KernelLaunch).
+- KernelLaunch can pass per-PE data placement explicitly via shard tags.
+- Early implementation stays simple and testable.
+
+---
+
+## Links
+
+- ADR-0011 (Memory Addressing — PA / VA / LA)
+- ADR-0012 (Host↔IO_CPU schema)
+- ADR-0007 (runtime_api vs sim_engine boundaries)
+- ADR-0009 (Kernel execution)
diff --git a/docs/adr-ko/ADR-0009-api-kernel-execution-messaging.md b/docs/adr-ko/ADR-0009-api-kernel-execution-messaging.md
new file mode 100644
index 0000000..a94be07
--- /dev/null
+++ b/docs/adr-ko/ADR-0009-api-kernel-execution-messaging.md
@@ -0,0 +1,146 @@
+# ADR-0009: Kernel Execution Messaging and Completion Semantics
+
+## Status
+
+Accepted
+
+## Context
+
+Kernel execution is initiated by the host and proceeds through
+device control components:
+
+Host → IO_CPU → M_CPU → PE_CPU → schedulers → engines
+
+Completion propagates in reverse order.
+
+To keep benchmarks simple and topology-agnostic,
+kernel execution must be endpoint-driven with deterministic aggregation.
+
+---
+
+## Decision
+
+### D1. Kernel launch is an endpoint request
+
+A kernel launch is initiated by submitting a single KernelLaunch request
+to the IO_CPU endpoint.
+
+The runtime API MUST:
+
+- construct the kernel launch request,
+- submit it to IO_CPU,
+- await a single completion result.
+
+The runtime API MUST NOT orchestrate internal fan-out.
+
+---
+
+### D2. Tensor arguments are passed by metadata
+
+KernelLaunch requests MUST reference tensor arguments via:
+
+- host-owned tensor handles, or
+- resolved device address maps derived from those handles.
+
+Bulk tensor data MUST NOT be embedded in kernel launch messages.
+
+---
+
+### D3. Fan-out and aggregation are component responsibilities
+
+- IO_CPU fans out work to M_CPUs.
+- M_CPU fans out work to PE_CPUs.
+- PE_CPU manages kernel execution and engine dispatch.
+
+Completion semantics:
+
+- M_CPU completes when all targeted PEs complete or a failure policy triggers.
+- IO_CPU completes when all targeted CUBEs complete or a failure policy triggers.
+
+---
+
+### D4. Completion and failure propagation
+
+- All messages MUST carry correlation identifiers.
+- Completion and failure MUST propagate deterministically to the host.
+- The simulation engine provides futures/handles to observe completion.
+
+---
+
+### D5. Launch timing is endpoint-synchronized
+
+All PEs targeted by a single kernel launch MUST begin executing the kernel
+body at the same simulated time, regardless of their dispatch path length
+from the launch entry point.
+
+Rationale. The dispatch tree Host → IO_CPU → M_CPU → PE_CPU has variable
+latency at every level. PEs near their M_CPU receive the launch earlier
+than PEs farther away; cubes near an IO_CPU receive it earlier than cubes
+farther away. Without synchronization, each PE's kernel begins at a
+different `env.now`, making per-PE metrics such as `pe_exec_ns` a function
+of dispatch-path geometry rather than of the kernel's behavior —
+producing measurement artifacts in benchmarks that time kernel-internal
+waits (for example `tl.recv` on cross-cube or cross-SIP hops).
+
+Mechanism.
+
+- `KernelLaunchMsg` carries an optional `target_start_ns: float | None`.
+- **IO_CPU** is the canonical stamper. On fan-out to M_CPUs, it
+  computes `target_start_ns = env.now + max_latency` where
+  `max_latency` is the maximum, over every target (sip, cube, pe)
+  tuple, of the **two-leg dispatch chain**:
+
+  ```
+  max_latency(sip, cube, pe) =
+      compute_path_latency_ns(find_node_path(io_cpu, m_cpu(sip, cube)))
+    + compute_path_latency_ns(find_node_path(m_cpu(sip, cube), pe_cpu))
+    - io_cpu.overhead_ns
+    - m_cpu.overhead_ns
+  ```
+
+  This models the actual dispatch as **two sequential Transactions**
+  (IO_CPU → M_CPU, then M_CPU → PE_CPU). Each leg's
+  `compute_path_latency_ns` adds its endpoints' `overhead_ns`;
+  `io_cpu.overhead_ns` is subtracted because IO_CPU has already
+  paid it before this method runs, and `m_cpu.overhead_ns` is
+  subtracted once because it appears as endpoint of leg1 *and*
+  start of leg2 but is paid only once at run time. A single
+  `find_node_path(io_cpu, pe_cpu)` walk is **not** equivalent —
+  it can pick a graph path that bypasses M_CPU and silently
+  under-shoots the prediction for far cubes, breaking the D5
+  invariant.
+
+  The fanned-out sub-Transactions carry **`nbytes = 0`** for
+  `KernelLaunchMsg` (control message only). Without this,
+  large kernel-launch payloads would occupy fabric BW on the
+  shared first hop and serialize the per-cube dispatch, pushing
+  far M_CPUs past `target_start_ns` and re-introducing the
+  late-arrival violation.
+- **M_CPU** passes an already-stamped `target_start_ns` through
+  unchanged. Only when the value is absent (e.g. a direct
+  launch-to-M_CPU unit test) does M_CPU compute a per-cube barrier
+  `env.now + max(local command-path latency)`.
+- **PE_CPU** yields `env.timeout(target_start_ns - env.now)` at the top
+  of `_execute_kernel`, before recording `pe_exec_start` and invoking
+  the kernel body.
+- When `target_start_ns is None`, PE_CPU falls through to the legacy
+  unsynchronized behavior — preserving backward compatibility.
+
+IO_CPU-level stamping guarantees every PE across every targeted cube
+uses the same barrier sim-time, eliminating both the within-cube
+dispatch-offset artifact *and* the cross-cube offset artifact in
+multi-cube launches. Models a real-hardware timed-broadcast launch
+(latency-equalized dispatch tree).
+
+The synchronization is internal to the engine / IO_CPU / M_CPU / PE_CPU
+control plane — runtime API and application kernels are unchanged.
+
+---
+
+## Links
+
+- SPEC R1, R2, R7, R8
+- ADR-0007 (Runtime API boundaries)
+- ADR-0008 (Tensor deployment)
+- ADR-0013 (Verification strategy — V2 fan-out tests)
+- ADR-0015 D4 (concrete fabric path for kernel launch)
diff --git a/docs/adr-ko/ADR-0010-api-cli-surface-and-semantics.md b/docs/adr-ko/ADR-0010-api-cli-surface-and-semantics.md
new file mode 100644
index 0000000..4925a64
--- /dev/null
+++ b/docs/adr-ko/ADR-0010-api-cli-surface-and-semantics.md
@@ -0,0 +1,131 @@
+# ADR-0010: Command Line Interface and Execution Semantics
+
+## Status
+
+Accepted
+
+## Context
+
+The `kernbench` CLI is the user-facing entry point of the simulator. It
+exposes three subcommands:
+
+- `run` — execute a benchmark against a topology.
+- `probe` — diagnostic utility for latency / BW measurement.
+- `web` — interactive topology viewer.
+
+Device enumeration is centralized in the CLI; neither the runtime API
+nor the simulation engine enumerates devices. Benchmarks remain
+single-device by design and accept a device identifier as input.
+
+## Decision
+
+### D1. Benchmark contract — single-device by design
+
+- A benchmark MUST define behavior for a single device only.
+- A benchmark MUST accept a device identifier as input.
+- Benchmarks MUST NOT enumerate or loop over multiple devices.
+
+Multi-device execution is the CLI's concern (D3), not the benchmark's.
+
+### D2. `kernbench run` — benchmark execution
+
+Required arguments:
+
+- `--topology <path>`: topology YAML file path. Loaded via
+  `resolve_topology()`.
+- `--bench <name>`: benchmark name. Resolved via
+  `benches.loader.resolve_bench()`.
+
+Optional arguments:
+
+- `--device <selector>` (default: `all`):
+  - `all` — run once per discovered SIP (see D3).
+  - `sip:<N>` — run only on SIP N.
+  - Parsed via `resolve_device()`.
+- `--verify-data` (default: off) — enable Phase 2 data verification
+  (see ADR-0020). When set, `engine_factory` constructs the engine
+  with `enable_data=True`. After the benchmark runs, a diagnostic
+  summary of recorded ops is printed.
+
+Each invocation runs the benchmark once within a single simulation
+instance.
+
+### D3. Multi-device execution is logically parallel
+
+When `--device all` (or omitted) and the topology has multiple SIPs:
+
+- Benchmark executions are submitted to a single simulation engine
+  instance.
+- Executions are logically parallel in simulation time.
+- Inter-device contention is naturally modeled (shared fabric
+  bandwidth, cross-SIP traffic, etc.).
+
+The CLI does NOT spawn multiple OS processes or independent
+simulation runs — parallelism is internal to one simulation instance.
+
+### D4. `kernbench probe` — latency / BW diagnostic utility
+
+Required argument:
+
+- `--topology <path>`: topology YAML file path.
+
+Optional argument:
+
+- `--case <name>` (default: `all`) — run a predefined traffic
+  pattern, or `all` to run every defined case.
+
+Probe runs each pattern through the simulation engine and reports
+per case:
+
+- End-to-end latency (ns).
+- Effective bandwidth (nbytes / total_ns).
+- Bottleneck bandwidth (min edge BW along the chosen path).
+- Utilization (effective / bottleneck).
+
+Probe additionally validates monotonicity invariants — for example
+that local-HBM access ≤ cross-PE-within-cube ≤ cross-cube ≤
+cross-SIP — and reports violations. Probe is a developer tool for
+verifying the latency / BW model; it is not a benchmark.
+
+### D5. `kernbench web` — topology viewer
+
+Optional arguments:
+
+- `--port <N>` (default: `8765`) — HTTP port.
+- `--no-open` — do not auto-open the browser.
+
+Launches a local HTTP server that renders the compiled topology in
+the browser. Distinct from the static `docs/diagrams/` artifacts:
+
+- `docs/diagrams/` files are derived at topology-compile time
+  (ADR-0006).
+- `kernbench web` is interactive — pan/zoom, hover for component
+  attributes, switch between SIP / CUBE / PE views.
+
+### D6. Runtime API and simulation engine remain device-scoped
+
+- Runtime API calls operate on one device per invocation.
+- The simulation engine schedules all requests deterministically.
+- Neither layer enumerates devices.
+
+This invariant keeps each layer testable in isolation; device
+enumeration and multi-device fan-out live only in the CLI's `run`
+command (D3).
+
+## Consequences
+
+- Benchmark authors write single-device logic; multi-device behavior
+  emerges from the CLI dispatching across SIPs.
+- Adding a new subcommand (e.g., trace export, replay) does not
+  require benchmark or runtime-API changes — the CLI is the
+  extension point.
+- `probe` and `web` are diagnostic / visualization tools, not
+  benchmarks; they bypass the benchmark loader path.
+
+## Links
+
+- SPEC R7, R8, R9
+- ADR-0007 (Runtime API and Simulation Engine Boundaries)
+- ADR-0020 (Two-pass data execution — `--verify-data`)
+- ADR-0006 (Topology compilation and diagram generation —
+  background for `kernbench web`)
diff --git a/docs/adr-ko/ADR-0011-mem-memory-addressing-simplification.md b/docs/adr-ko/ADR-0011-mem-memory-addressing-simplification.md
new file mode 100644
index 0000000..064c365
--- /dev/null
+++ b/docs/adr-ko/ADR-0011-mem-memory-addressing-simplification.md
@@ -0,0 +1,521 @@
+# ADR-0011: Memory Addressing — PA / VA / LA Address Models
+
+## Status
+
+Accepted.
+
+- **VA model: currently implemented (default).**
+- PA model: implemented as PageFault fallback in PE_DMA.
+- LA model: proposed, not implemented.
+
+## Context
+
+KernBench's address model evolved through three design points, each
+addressing a limitation of the previous. This ADR documents all three
+in one place because future implementation work selects among them.
+
+### PA-only baseline
+
+Phase 0 of KernBench treated all device memory operations
+(MemoryRead/MemoryWrite) as raw physical-address transfers. No
+host-side virtual addressing, no MMU/IOMMU translation. Allocators
+returned PA mappings; DMA requests carried PA directly.
+
+This was sufficient for early correctness/latency work but
+insufficient for running standard Triton kernels that use
+`base_addr + offset` patterns on sharded tensors: each PE's shard
+has a different PA, but the kernel needs a single contiguous address
+space to compute offsets.
+
+### Why VA/MMU (current default)
+
+A realistic system uses host-side virtual addressing and an
+MMU/IOMMU-style translation path for DMA: the host allocates physical
+memory at PE level, maps it into a virtual address space, installs
+mappings, and DMA requests use virtual addresses that are translated
+to physical addresses.
+
+Adopting this model lets kernels use `base_addr + offset` over a
+contiguous VA range while the device-side MMU translates each access
+to the appropriate PA.
+
+### Why LA/BAAW (proposed)
+
+VA/MMU treats HBM as a single backing space. KernBench needs to
+explore architectures where HBM is composed of multiple pseudo
+channels in parallel:
+
+- CUBE's HBM has 32 or 64 pseudo channels.
+- In a PE-Local-HBM model, each PE is assigned N pseudo channels
+  (N = `hbm_pseudo_channels / pes_per_cube`).
+- Per-channel BW (e.g. 32 GB/s) determines aggregate PE BW
+  (N × per-channel).
+
+Two channel-mapping modes need to be modelable:
+
+- **1:1 mode** — one logical access → N per-channel requests.
+  Precise per-channel BW contention modelling.
+- **n:1 mode (default)** — one logical access → one aggregated
+  request. Channels are assumed to interleave; aggregated BW model.
+
+VA's `tl.load(va_ptr)` produces a single DMA request to a single
+target. Decomposing that into per-channel requests inside PE_DMA
+requires the address layer to be aware of channels. This is the
+role of the LA (Logical Address) abstraction with BAAW
+(Logical-to-Physical Mapping Unit).
+
+Core requirements driving the LA design:
+
+- PE_DMA → HBM_CTRL effective bandwidth semantics must be identical
+  in both modes (only request shape and resource model differ).
+- Kernel programming model is unchanged — physical channel
+  information is never exposed to kernel code.
+- Mode switch is a topology-level configuration.
+
+### Design space summary
+
+| Model | Status | Key idea |
+|-------|--------|----------|
+| PA | fallback (implemented) | Direct physical addressing, no translation |
+| VA | current default (implemented) | Per-tensor contiguous VA range; MMU translates per access |
+| LA | proposed | LA + BAAW resolves to (PA, channel); supports 1:1 and n:1 channel mapping modes |
+
+---
+
+## Decision
+
+This ADR defines three address models. At any given time the system
+operates in exactly one model. Selection is topology- / configuration-
+driven; coexistence within one simulation run is not required.
+
+---
+
+### Address Model: PA (Physical Address) — fallback
+
+#### D-PA1. PA-only semantics
+
+- All device memory accesses (MemoryRead/MemoryWrite) operate on
+  device physical addresses (PA) plus size.
+- PA-only mode remains functional via the PageFault fallback path in
+  PE_DMA: if a DMA src/dst address has no MMU mapping, PE_DMA treats
+  the value as a PA directly.
+
+#### D-PA2. Allocation produces PA mappings
+
+Device allocation selects PE-local memory regions and returns PA
+mappings sufficient to execute kernels and issue DMA requests.
+
+PA model is retained primarily for backward compatibility with PA-only
+tests and as the underlying physical layer that VA / LA models resolve
+into.
+
+---
+
+### Address Model: VA (Virtual Address with MMU) — current default
+
+#### D-VA1. Virtual Address Model
+
+- Each tensor gets a single contiguous VA range (`TensorHandle.va_base`).
+- `TensorShard` does NOT carry a `va` field — shard VA is derived as
+  `va_base + offset_bytes`.
+- Kernels receive `va_base` as their pointer argument (via
+  `TensorArg.va_base`).
+- `DmaReadCmd.src_addr` and `DmaWriteCmd.dst_addr` carry VA (not PA).
+
+#### D-VA2. PE_MMU Component
+
+- Hybrid design: SimPy component (inbox for `MmuMapMsg`) + utility
+  (synchronous `translate()` called by PE_DMA).
+- Page-aligned dict lookup for O(1) VA → PA translation.
+- `tlb_overhead_ns` configurable per-access latency.
+- PageFault fallback: if VA has no mapping, PE_DMA treats it as PA
+  directly (preserves PA model for backward compatibility).
+
+#### D-VA3. Mapping Installation
+
+- `MmuMapMsg` traverses the fabric: Host → PCIE_EP → IO_CPU (cube
+  fan-out) → M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured
+  end-to-end.
+- `MmuMapMsg.target_sips` controls SIP-level routing to prevent
+  cross-SIP mapping contamination for replicated tensors.
+- Mapping strategy based on `DPPolicy.cube`:
+  - **Replicate** (`cube="replicate"`): per-(sip, cube) local mapping
+    only. Each cube's PEs see only their local PA. No cross-cube
+    mapping installed.
+  - **Sharded** (`cube="column_wise"`, etc.): broadcast all shard
+    mappings to all target cubes. Enables cross-PE and cross-cube
+    DMA.
+
+#### D-VA4. Tensor Lifecycle
+
+- `del tensor` triggers automatic cleanup via `Tensor.__del__` +
+  `weakref` to `RuntimeContext`. Sends `MmuUnmapMsg` through fabric,
+  returns VA and PA space.
+- `with RuntimeContext(...) as ctx:` provides scope-based bulk cleanup.
+- `RuntimeContext._tensors` uses `weakref.ref` to avoid preventing GC.
+- `PEMemAllocator` uses free-list with coalescing (not bump allocator).
+- `VirtualAllocator` uses free-list with coalescing for VA space.
+
+#### D-VA5. Allocators
+
+- `VirtualAllocator`: device-wide VA space, page-aligned alloc/free
+  with coalescing.
+- `PEMemAllocator`: per-PE HBM/TCM, free-list based alloc/free with
+  coalescing.
+- Page size configurable via `topology.yaml` `pe_mmu` attrs
+  (default 4096).
+
+#### Consequences (VA model)
+
+- Triton kernels use `base_addr + offset` patterns naturally on
+  sharded tensors.
+- All latency remains explicit via graph traversal, including MMU
+  mapping installation and per-access TLB overhead.
+- PA-only mode retained as fallback (PageFault → treat as PA).
+- IPCQ and other fixed-address resources bypass MMU (use PA directly).
+
+---
+
+### Address Model: LA (Logical Address with BAAW) — proposed
+
+LA replaces VA when channel-level HBM modelling is required.
+Adopting this model removes the VA/MMU infrastructure (D-LA1 lists the
+removed artifacts). Coexistence with VA in the same run is not a goal.
+
+#### D-LA1. LA introduction — replaces VA infrastructure
+
+LA is the sole address space used by kernel code (`tl.load`,
+`tl.store`, `tl.composite`). Properties:
+
+- Can map a Tensor to a contiguous logical space (like VA).
+- Expresses `(logical buffer + offset)`.
+- Does NOT contain physical channel information directly.
+- Stays as an intermediate abstraction until physical resolution.
+
+LA address space:
+
+| Item | Value |
+|------|-------|
+| LA start | `0x1_0000_0000` (4 GB, preserves former VA start) |
+| LA space size | 64 GB per PE |
+| Alignment unit | segment (see D-LA3) |
+
+LA is PE-local: different PEs may use the same LA value; BAAW segment
+tables differ → they resolve to different PAs.
+
+VA infrastructure removed when LA is adopted:
+
+| Removed | Replacement |
+|---------|-------------|
+| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (same free-list approach, renamed) |
+| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (inside PE_DMA) |
+| `components/builtin/pe_mmu.py` (PeMmuComponent) | Removed — BAAW is internal PE_DMA logic, not a separate component |
+| `runtime_api/kernel.py`: `MmuMapMsg`, `MmuUnmapMsg` | `BaawSegmentInstallMsg` |
+| `runtime_api/context.py`: VA alloc + MMU install | LA alloc + BAAW segment install |
+| `runtime_api/tensor.py`: `va_base` | `la_base` |
+| `topology.yaml`: `pe_mmu` component entry | Removed |
+
+#### D-LA2. Mapping mode setting
+
+Topology-level (cube) configuration:
+
+```yaml
+cube:
+  memory_map:
+    hbm_mapping_mode: n_to_one    # one_to_one | n_to_one
+    hbm_pseudo_channels: 64       # total pseudo channel count
+    hbm_channels_per_pe: 8        # per-PE local channel count
+    hbm_channel_bw_gbs: 32.0      # per-channel bandwidth
+```
+
+Consumed by the graph compiler (topology builder) and BAAW
+initialisation.
+
+#### D-LA3. Segment and BAAW
+
+Segment partitions the LA space; each segment maps to a specific HBM
+channel or channel group. Created at tensor deploy time by the runtime
+allocator. BAAW resolves LA → physical request(s) using the segment
+table.
+
+```python
+@dataclass
+class BaawSegment:
+    la_base: int          # segment start LA
+    la_size: int          # segment size (bytes)
+    mode: str             # "one_to_one" | "n_to_one"
+    # 1:1 mode fields
+    channel_count: int    # channels assigned to this segment (e.g. 8)
+    pa_bases: list[int]   # per-channel PA bases (len = channel_count)
+    channel_ids: list[int]   # per-channel logical IDs (e.g. [0..7])
+    channel_size: int     # per-channel size (la_size // channel_count)
+    # n:1 mode fields
+    agg_pa_base: int      # aggregated PA base
+    agg_node_id: str      # aggregated router node_id
+```
+
+Segment lifecycle:
+
+1. **Allocate** (tensor deploy): RuntimeContext allocates LA from LA
+   allocator. PEMemAllocator allocates per-channel PA (1:1) or
+   aggregated PA (n:1). `BaawSegmentInstallMsg` registers the segment
+   with PE_DMA.
+2. **Use** (kernel run): kernel `tl.load(la_ptr)` → `DmaReadCmd
+   (src_addr=LA)`. PE_DMA's BAAW front-end looks up the segment and
+   converts to PA(s).
+3. **Free** (tensor free): segment removed from table; LA and PA
+   returned.
+
+#### D-LA4. BAAW resolution logic
+
+BAAW is a front-end stage inside PE_DMA, not a separate SimPy
+component. Synchronous address-resolution logic executed at the start
+of PE_DMA's `handle_command()`.
+
+Input: `(LA, nbytes)`. Output:
+
+- **1:1 mode**: `list[PhysicalRequest]` — one per channel.
+- **n:1 mode**: single `PhysicalRequest`.
+
+```python
+@dataclass
+class PhysicalRequest:
+    pa: int           # 51-bit Physical Address
+    nbytes: int       # transfer size for this request
+    dst_node: str     # target node_id (channel router or aggregated router)
+
+
+def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
+    seg = self._find_segment(la)  # la_base <= la < la_base + la_size
+    offset = la - seg.la_base
+
+    if seg.mode == "n_to_one":
+        pa = seg.agg_pa_base + offset
+        return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]
+
+    # one_to_one
+    requests = []
+    per_ch_size = seg.channel_size
+    for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
+        ch_offset = offset % per_ch_size
+        ch_nbytes = nbytes // seg.channel_count
+        pa = pa_base + ch_offset
+        dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
+        requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
+    return requests
+```
+
+BAAW responsibilities:
+
+- Convert logical access → physical request units.
+- Apply mode-dependent fan-out (1:1) or pass-through (n:1).
+- Compute PA and target node.
+
+BAAW non-responsibilities:
+
+- Performing actual data movement.
+- Executing NOC routing.
+- Simulating bandwidth occupation (downstream components' job).
+
+BAAW output is directly usable by the simulator's routing and resource
+model without additional address decoding.
+
+#### D-LA5. PE_DMA `handle_command()` change
+
+Current (VA-based) flow:
+
+```
+DmaReadCmd.src_addr (VA)
+  → MMU.translate(VA) → PA
+  → PhysAddr.decode(PA) → PhysAddr object
+  → resolver.resolve(PhysAddr) → dst_node_id
+  → router.find_path(pe_prefix, dst_node_id) → path
+  → 1 sub-Transaction → fabric inject
+```
+
+LA-based flow:
+
+```
+DmaReadCmd.src_addr (LA)
+  → BAAW.resolve(LA, nbytes) → list[PhysicalRequest]
+  → for each PhysicalRequest:
+      → router.find_path(pe_prefix, req.dst_node) → path
+      → compute_drain_ns(path, req.nbytes) → drain
+      → sub-Transaction → fabric inject
+  → await all sub-Transactions
+  → pe_txn.done.succeed()
+```
+
+Key changes:
+
+- MMU reference removed → BAAW resolve.
+- `PhysAddr.decode()` + `resolver.resolve()` → BAAW returns `dst_node`
+  directly.
+- 1 request → N parallel requests in 1:1 mode.
+
+#### D-LA6. 1:1 mode detail
+
+- One logical access → N physical requests (N = `channels_per_pe`).
+- N = `hbm_pseudo_channels / pes_per_cube`.
+- Each request: fully-resolved 51-bit PA, targets a specific channel
+  router (`{pe_prefix}.ch_r{channel_id}`).
+- Per-channel link models BW contention.
+- PE_DMA injects N sub-transactions concurrently.
+
+Example: `hbm_pseudo_channels=64`, `pes_per_cube=8` → `channels_per_pe=8`.
+PE0 owns ch0-7.
+
+```text
+Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
+BAAW segment: {
+    la_base: 0x1_0000_0000, la_size: 4096,
+    mode: "one_to_one", channel_count: 8,
+    pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
+    channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
+    channel_size: 512,
+}
+
+BAAW resolve result (8 requests):
+  → PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
+  → PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
+  → ...
+  → PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")
+
+PE_DMA: 8 sub-transactions parallel inject
+  per-channel router → hbm_ctrl link (channel_bw_gbs) per channel
+  Total effective BW = 8 × channel_bw_gbs
+```
+
+Other N values:
+
+- `hbm_pseudo_channels=32`, `pes_per_cube=8` → `channels_per_pe=4`,
+  4 requests
+- `hbm_pseudo_channels=64`, `pes_per_cube=4` → `channels_per_pe=16`,
+  16 requests
+
+#### D-LA7. n:1 mode detail
+
+- One logical access → one aggregated request.
+- Target: aggregated router → hbm_ctrl (see ADR-0017 D8).
+- Aggregated link BW = `channels_per_pe × channel_bw_gbs`
+  (e.g. 8 × 32 = 256 GB/s).
+- Single queue / resource for modelling.
+- No per-channel PA decomposition.
+
+```text
+Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
+BAAW segment: {
+    la_base: 0x1_0000_0000, la_size: 4096,
+    mode: "n_to_one",
+    agg_pa_base: PA_agg,
+    agg_node_id: "sip0.cube0.pe0.agg_router",
+}
+
+BAAW resolve result:
+  → PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")
+
+PE_DMA: 1 sub-transaction
+  aggregated router → hbm_ctrl link (256 GB/s)
+```
+
+#### D-LA8. Kernel model preserved
+
+- Kernel still issues single memory ops (`tl.load`, `tl.store`,
+  `tl.composite`).
+- LA is the address scheme exposed to kernel code.
+- Channel decomposition / aggregation happens inside PE_DMA's BAAW.
+- Kernel code never sees physical channel information.
+
+#### Consequences (LA model, proposed)
+
+Positive:
+
+- 1:1 vs n:1 semantics live in one place (BAAW).
+- Kernel abstraction preserved — no kernel code changes.
+- Topology-based policy control (mode switch via yaml).
+- Improved simulation-model consistency and debuggability.
+- Segment-based mapping is simpler than page tables; lower overhead.
+
+Negative:
+
+- Full VA/MMU code refactor required.
+- Request-generation path more complex (N requests in 1:1 mode).
+- Reduced per-channel visibility in n:1 mode.
+- VA-related tests need rewriting.
+
+---
+
+## Migration Path
+
+- **PA → VA** was an extension. PA mode is retained as the PageFault
+  fallback inside PE_DMA. Switching does not require removing PA
+  code.
+- **VA → LA**, if adopted, is a replacement, not coexistence. See
+  D-LA1 for the VA infrastructure removal list. PA fallback inside
+  PE_DMA may be retained orthogonally for tests.
+
+## Alternatives Considered (LA model)
+
+1. **Keep VA + fan-out in MMU**: MMU returns per-channel PAs.
+   Rejected: MMU's role would grow beyond translation to request
+   decomposition; aggregation (n:1) becomes awkward to express.
+2. **Channel-aware kernel API**: kernels call per-channel load/store
+   directly. Rejected: abstraction leakage, portability loss, all
+   benchmarks need rewriting.
+3. **Always PA (no LA)**: runtime passes per-channel PA to kernel
+   directly. Rejected: incompatible with aggregation; conversion
+   timing unclear; channel info leaks to kernel.
+
+## Test Requirements
+
+### VA model (current, regression)
+
+- Cross-PE / cross-cube DMA paths over installed mappings.
+- `MmuMapMsg` / `MmuUnmapMsg` fabric traversal with measured latency.
+- TLB-overhead-per-access timing.
+- PageFault fallback path preserves PA-only behaviour.
+
+### LA model (when implemented)
+
+- 1:1 mode: same logical access → N per-channel requests.
+- n:1 mode: same logical access → 1 aggregated request.
+- Bandwidth equivalence between modes for identical workload.
+- 1:1 mode: per-channel contention modelled correctly.
+- n:1 mode: aggregated bandwidth correctly reflected.
+- Kernel code unchanged across mode switch.
+- BAAW segment install / uninstall correctness.
+- Multiple tensors in distinct segments do not collide.
+
+## Implementation Order (LA, when scheduled)
+
+1. LA type (`policy/address/la_allocator.py`).
+2. BAAW segment table (`policy/address/baaw.py`).
+3. `BaawSegmentInstallMsg` (`runtime_api/kernel.py`).
+4. PE_DMA BAAW integration (`components/builtin/pe_dma.py`
+   `handle_command()`).
+5. RuntimeContext: LA alloc + segment install
+   (`runtime_api/context.py`).
+6. `Tensor.va_base` → `Tensor.la_base` (`runtime_api/tensor.py`).
+7. Remove VA/MMU code.
+8. Remove `pe_mmu` from `topology.yaml`; add mapping mode settings.
+9. Test migration:
+
+| Test file | Action |
+|-----------|--------|
+| `tests/test_mmu_component.py` | Remove → BAAW segment install tests |
+| `tests/test_mmu_fabric.py` | Remove → BAAW + fabric integration tests |
+| `tests/test_pe_mmu.py` | Remove |
+| `tests/test_va_allocator.py` | Replace with LA allocator tests |
+| `tests/test_va_integration.py` | Replace with LA + BAAW integration tests |
+| `tests/test_va_offset.py` | Replace with LA offset tests |
+
+## Links
+
+- ADR-0007 (runtime_api vs sim_engine boundaries)
+- ADR-0008 (tensor deployment)
+- ADR-0009 (kernel execution)
+- ADR-0014 (PE-internal execution model)
+- ADR-0015 (component port/wire model)
+- ADR-0017 (Cube NOC and HBM connectivity — LA model topology consumer)
+- ADR-0013 (Verification strategy — V1 PA tagging)
+- SPEC R2 (latency by traversal), R10 (memory addressing)
diff --git a/docs/adr-ko/ADR-0012-api-host-io-message-schema.md b/docs/adr-ko/ADR-0012-api-host-io-message-schema.md
new file mode 100644
index 0000000..07d95c5
--- /dev/null
+++ b/docs/adr-ko/ADR-0012-api-host-io-message-schema.md
@@ -0,0 +1,233 @@
+# ADR-0012: Host ↔ IO_CPU Message Schema (PA-first, PE-tagged)
+
+## Status
+
+Accepted
+
+## Context
+
+Phase 0 uses a PA-first memory model (ADR-0011):
+
+- memory operations use device physical addresses (PA) only,
+- VA/MMU/IOMMU is not modeled.
+
+The host-facing runtime API interacts with the device via the IO_CPU endpoint.
+We define stable, minimal message schemas for Host ↔ IO_CPU so that:
+
+- benchmarks remain stable,
+- IO_CPU-internal fan-out/aggregation can evolve independently,
+- completion and failure propagation is deterministic.
+
+We also require PE-tagging (A 방식): each shard explicitly carries (sip,cube,pe)
+so IO_CPU can deterministically route/fan-out without relying on PA decoding.
+
+---
+
+## Decision
+
+### D1. Contract scope
+
+This schema is the stable contract ONLY for Host ↔ IO_CPU.
+
+Messages beyond IO_CPU (to M_CPU, PE_CPU, schedulers, engines) are component-internal
+and are NOT part of this host contract in Phase 0.
+
+---
+
+### D2. Required message set
+
+The runtime API MUST use only these message types for Host ↔ IO_CPU:
+
+- MemoryWrite
+- MemoryRead
+- KernelLaunch
+
+All operations required by benchmarks (tensor init/copy, kernel run) MUST be expressible
+with these messages.
+
+---
+
+### D3. Common envelope (mandatory for all requests)
+
+All Host ↔ IO_CPU requests MUST include:
+
+- `msg_type: str`
+- `correlation_id: str`
+  - generated by the host
+  - used to match responses deterministically
+- `request_id: str`
+  - unique within a correlation_id
+- `target_device: str`
+  - device identifier (e.g., "sip:0")
+- `timestamp_tag: str | None` (optional)
+  - debug tag only; MUST NOT affect determinism
+
+All Host ↔ IO_CPU responses MUST include:
+
+- `correlation_id: str`
+- `request_id: str`
+- `completion: Completion`
+
+---
+
+### D4. Completion schema (mandatory)
+
+`Completion` MUST have:
+
+- `ok: bool`
+- `error_code: str | None`
+- `error_message: str | None`
+
+Rules:
+
+- If `ok == true` then `error_code` and `error_message` MUST be null.
+- If `ok == false` then `error_code` MUST be non-null.
+- Completion semantics MUST be deterministic.
+
+---
+
+### D5. MemoryWrite schema (PA-first, PE-tagged)
+
+`MemoryWrite` represents a host-initiated write/initialize operation to device memory.
+
+Mandatory fields:
+
+- common envelope fields (D3)
+- destination placement tags (A 방식):
+  - `dst_sip: int`
+  - `dst_cube: int`
+  - `dst_pe: int`
+- `dst_pa: int`
+  - destination physical address in the destination PE's address space
+- `nbytes: int`
+- `src_kind: "pattern" | "host_buffer_ref"`
+  - Phase 0 MUST support "pattern"
+- `pattern: Pattern | None`
+  - required if `src_kind == "pattern"`
+
+`Pattern` (Phase 0 mandatory support):
+
+- `pattern_kind: "zero" | "fill_u8" | "fill_u16" | "fill_u32" | "fill_fp16" | "fill_fp32"`
+- `value: number | None`
+  - required for fill_*; ignored for zero
+
+Optional fields:
+
+- `dst_mem_kind: "HBM" | "TCM" | "AUTO"` (default "AUTO")
+- `debug_label: str | None`
+
+Notes:
+
+- This message MUST NOT embed bulk tensor data in Phase 0.
+- All latency MUST come from explicit graph traversal and modeled components.
+
+---
+
+### D6. MemoryRead schema (PA-first, PE-tagged)
+
+`MemoryRead` represents a host-initiated read from device memory.
+
+Mandatory fields:
+
+- common envelope fields (D3)
+- source placement tags (A 방식):
+  - `src_sip: int`
+  - `src_cube: int`
+  - `src_pe: int`
+- `src_pa: int`
+- `nbytes: int`
+
+Optional fields:
+
+- `dst_kind: "host_sink" | "discard"` (default "host_sink")
+- `debug_label: str | None`
+
+Response payload:
+
+- actual bytes are NOT required in Phase 0 (latency/traces focus)
+- implementations MAY return lightweight stats or hashes later via a new ADR
+
+---
+
+### D7. KernelLaunch schema (PA-first, PE-tagged shards)
+
+`KernelLaunch` represents launching a kernel on a target device via IO_CPU.
+
+Mandatory fields:
+
+- common envelope fields (D3)
+- `kernel_ref: KernelRef`
+- `args: list[KernelArg]`
+
+`KernelRef` MUST have:
+
+- `name: str`
+- `kind: "deployed" | "builtin"`
+- `deploy_pa: int | None` — PA where kernel binary was deployed (required for "deployed")
+- `deploy_sip: int` — SIP where binary resides
+- `deploy_cube: int` — cube where binary resides
+- `deploy_pe: int` — PE where binary resides
+- `nbytes_code: int` — kernel binary size (for BW modeling)
+
+Kernel binaries MUST be pre-deployed to device memory via MemoryWrite.
+KernelLaunch MUST NOT embed kernel source code or IR in the launch message.
+
+`KernelArg` supports tensor args by PA mapping and scalars by value.
+
+Tensor arg (mandatory):
+
+- `arg_kind: "tensor"`
+- `tensor_pa_map: TensorPAMap`
+
+`TensorPAMap` MUST have:
+
+- `shards: list[TensorShard]`
+
+`TensorShard` MUST have (A 방식 강제):
+
+- `sip: int`
+- `cube: int`
+- `pe: int`
+- `pa: int`
+- `nbytes: int`
+- `offset_bytes: int`
+
+Scalar arg (mandatory):
+
+- `arg_kind: "scalar"`
+- `dtype: "i32" | "i64" | "fp16" | "fp32" | "bool"`
+- `value: number | bool`
+
+Optional KernelLaunch fields:
+
+- `grid: dict | None`
+- `meta: dict | None`
+- `failure_policy: "fail_fast" | "collect_all"` (default "fail_fast")
+- `debug_label: str | None`
+
+Notes:
+
+- KernelLaunch MUST NOT embed bulk tensor data.
+- KernelLaunch MUST be submitted only to the IO_CPU endpoint.
+- IO_CPU MUST fan-out work internally using the shard (sip,cube,pe) tags.
+
+---
+
+## Verification Notes
+
+Tests SHOULD validate:
+
+- schema validation rejects missing mandatory fields,
+- deterministic correlation/response matching,
+- MemoryWrite/Read/KernelLaunch produce explicit hop traces,
+- all routed requests incur latency > 0.
+
+---
+
+## Links
+
+- ADR-0011 (Memory Addressing — PA / VA / LA)
+- ADR-0007 (runtime_api vs sim_engine boundaries)
+- ADR-0009 (kernel execution fan-out/aggregation)
+- ADR-0013 (Verification strategy — V1 message schema validation)
+- SPEC R2, R7, R8
diff --git a/docs/adr-ko/ADR-0013-ver-verification_strategy.md b/docs/adr-ko/ADR-0013-ver-verification_strategy.md
new file mode 100644
index 0000000..dd7e958
--- /dev/null
+++ b/docs/adr-ko/ADR-0013-ver-verification_strategy.md
@@ -0,0 +1,139 @@
+# ADR-0013: Verification Strategy and Phase 1 Test Plan
+
+## Status
+
+Accepted
+
+## Context
+
+KernBench is a system-level simulator whose correctness is defined by:
+
+- adherence to SPEC-defined invariants,
+- determinism and debuggability,
+- explicit modeling of routing and latency.
+
+Given the evolving implementation, we need a stable verification strategy
+that prevents architectural drift while allowing incremental development.
+
+This ADR defines the Phase 1 verification plan and what constitutes
+"correct behavior" for early implementations.
+
+---
+
+## Decision
+
+### D1. Verification is contract-based
+
+Verification MUST be derived from:
+
+- SPEC requirements,
+- accepted ADRs.
+
+Tests MUST validate architectural contracts, not incidental implementation details.
+
+---
+
+### D2. Phase 1 verification scope
+
+Phase 1 verification focuses on:
+
+- message contract validity (ADR-0012),
+- routing and fan-out semantics at the IO_CPU boundary (ADR-0009),
+- PA-first memory addressing and shard tagging (ADR-0011),
+- core latency and trace invariants (SPEC 0.1, R2).
+
+Microarchitectural accuracy, bandwidth contention, and cycle-level behavior
+are explicitly out of scope in Phase 1.
+
+---
+
+### D3. Required Phase 1 verification cases
+
+The following verification cases MUST be supported by the implementation:
+
+#### V1. Message schema validation
+
+- KernelLaunch requests missing `(sip, cube, pe)` in any tensor shard MUST be rejected.
+- MemoryWrite/MemoryRead requests missing destination/source placement tags MUST be rejected.
+- Completion results MUST follow the `ok / error_code / error_message` contract.
+
+#### V2. IO_CPU fan-out and aggregation
+
+Given:
+
+- a topology with one SIP, one CUBE, and two PEs,
+- a KernelLaunch request containing two tensor shards targeting different PEs,
+
+The system MUST:
+
+- submit a single KernelLaunch to IO_CPU,
+- fan-out work internally to both PEs,
+- aggregate completion and return a single deterministic completion to the host.
+
+#### V3. Latency and trace invariants
+
+For any valid request:
+
+- the hop-by-hop trace MUST be non-empty,
+- total latency MUST be greater than zero,
+- repeated runs with identical inputs MUST produce identical traces.
+
+#### V4. Topology independence and cross-domain coverage
+
+Verification cases MUST pass for multiple topology shapes, including:
+
+- minimal: (1 SIP, 1 CUBE, 1 PE)
+- multi-PE: (1 SIP, 1 CUBE, N PEs)
+- multi-CUBE within a SIP: (1 SIP, M CUBEs, ≥1 PE per CUBE)
+- multi-SIP tray: (K SIPs, ≥1 CUBE per SIP, ≥1 PE per CUBE)
+
+For multi-CUBE and multi-SIP topologies, Phase 1 verification focuses on:
+
+- explicit connectivity (required links exist),
+- deterministic routing and control-path traversal,
+- non-empty traces and latency > 0 for representative cross-domain requests
+  (inter-CUBE and inter-SIP paths).
+
+Tests MUST NOT hardcode topology sizes, node ids, or link counts.
+Instead, tests MUST derive expectations from the compiled topology metadata
+---
+
+### D4. Phase 1 artifacts
+
+Phase 1 MAY include:
+
+- verification-only test code,
+- topology fixtures,
+- trace inspection utilities.
+
+Phase 1 MUST NOT require:
+
+- production code changes solely to satisfy tests,
+- weakening or removing tests to allow progress.
+
+---
+
+### D5. Phase 2 enforcement
+
+Phase 2 (Apply) MUST:
+
+- run the Phase 1 verification cases,
+- rollback all changes if any verification fails,
+- preserve tests as authoritative contracts.
+
+---
+
+## Consequences
+
+- Architectural correctness is enforced early.
+- Tests serve as executable documentation of system behavior.
+- Implementation remains flexible without losing rigor.
+
+---
+
+## Links
+
+- SPEC 0.1, R2, R6
+- ADR-0011 (Memory Addressing — PA / VA / LA)
+- ADR-0012 (Host ↔ IO_CPU message schema)
+- ADR-0009 (Kernel execution semantics)
diff --git a/docs/adr-ko/ADR-0014-dev-pe-pipeline-execution-model.md b/docs/adr-ko/ADR-0014-dev-pe-pipeline-execution-model.md
new file mode 100644
index 0000000..ccb63f3
--- /dev/null
+++ b/docs/adr-ko/ADR-0014-dev-pe-pipeline-execution-model.md
@@ -0,0 +1,451 @@
+# ADR-0014: PE Pipeline Execution Model
+
+## Status
+
+Accepted
+
+## Context
+
+This ADR defines the PE-internal kernel execution model:
+
+- Role decomposition of PE-internal components
+- Command dispatch paths (simple / composite / multi-op composite with epilogue)
+- TileToken-based self-routing pipeline (scheduler does dispatch + completion only)
+- TCM-centric dataflow with a register-file intermediary
+- Engine resource model
+- Observability and trace contract
+- Topology representation
+
+PE-internal structure (7 components in scope; 2 cross-referenced):
+
+- `pe_cpu`, `pe_scheduler`, `pe_dma`, `pe_fetch_store`, `pe_gemm`, `pe_math`,
+  `pe_tcm` — defined here
+- `pe_mmu` — VA model, defined in ADR-0011 D-VA
+- `pe_ipcq` — collective communication, defined in ADR-0023
+
+The goal is a deterministic, trace-friendly execution contract that keeps
+each block independently swappable.
+
+## Decision
+
+### D1. PE-internal component roles
+
+**PE_CPU**
+
+- Executes kernel instruction stream / control logic.
+- Generates PE commands and submits them to `PE_SCHEDULER` (via
+  `PeInternalTxn`).
+- Does NOT enqueue work directly into engine queues.
+
+**PE_SCHEDULER**
+
+- Sole dispatcher inside a PE.
+- Receives commands from `PE_CPU`. Dispatch by command type:
+  - Simple command (`DmaReadCmd`, `DmaWriteCmd`, `GemmCmd`, `MathCmd`)
+    → forward directly to the target engine.
+  - `CompositeCmd` → generate a `TilePlan`, feed tiles into the pipeline
+    via a single `_feed_loop` (D6).
+- Does not participate in stage-to-stage chaining within a composite;
+  that is handled by token self-routing (D6).
+
+**PE_DMA**
+
+- Handles memory transfers between TCM and external memory domains
+  (HBM, shared SRAM, cross-cube UCIe) through the cube NOC.
+- Two execution channels:
+  - `DMA_READ` (capacity = 1) and `DMA_WRITE` (capacity = 1) — see D4.
+- Additional virtual channels:
+  - `vc_compute` — load/store/writeback traffic for GEMM/MATH tiles.
+  - `vc_comm` — IPCQ collective send data (defined in ADR-0023 D8).
+
+**PE_FETCH_STORE**
+
+- TCM ↔ Register File transfer unit.
+- Isolates register-file access semantics from compute engines so that
+  GEMM/MATH stay pure compute components.
+- BW-based latency model; TCM access contention naturally serializes
+  through `PE_TCM`'s BW resource.
+
+**PE_GEMM**
+
+- MAC array. Reads operands from the register file; writes results to
+  the register file. Does not touch `PE_TCM` directly.
+
+**PE_MATH**
+
+- Element-wise / reduction / SIMD unit. Reads / writes the register file.
+
+**PE_TCM**
+
+- Tightly-coupled scratchpad with BW-serialized access. Two logical
+  regions partitioned by ownership (see D5).
+
+**Cross-referenced components** (defined elsewhere):
+
+- `pe_mmu` — VA→PA translation per access (ADR-0011 D-VA).
+- `pe_ipcq` — collective ring buffers and peer endpoint metadata
+  (ADR-0023).
+
+### D2. Command lifecycle and queues
+
+`PE_SCHEDULER` maintains three logical structures:
+
+**SubmissionQueue** — written by `PE_CPU`; consumed by the scheduler.
+
+**InflightTable** — owned and mutated only by `PE_SCHEDULER`; tracks
+expanded sub-commands, dependency state, engine assignment, and
+completion status.
+
+**CompletionQueue** — written by `PE_SCHEDULER`; holds final completion
+records.
+
+**Single-writer rule**: only `PE_SCHEDULER` mutates command completion
+state. Engines report completion via explicit events / messages
+consumed by the scheduler.
+
+**Command completion**: when all sub-commands complete, `PE_SCHEDULER`
+publishes a completion record.
+
+### D3. Dispatch modes
+
+#### D3.1 Simple command
+
+A simple command expands to exactly one engine sub-command:
+
+- `DmaReadCmd` / `DmaWriteCmd` → `PE_DMA`
+- `GemmCmd` → `PE_GEMM`
+- `MathCmd` → `PE_MATH`
+
+Flow:
+
+```text
+PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution
+       → completion → PE_SCHEDULER → CompletionQueue
+```
+
+#### D3.2 Composite command (single-op tiled pipeline)
+
+The default `CompositeCmd` runs a single compute op as a tile-pipelined
+sequence:
+
+```text
+DMA_READ → FETCH (TCM → RF) → COMPUTE (GEMM | MATH) → STORE (RF → TCM) → DMA_WRITE
+```
+
+`PE_SCHEDULER` splits the DMA payload into hardware tiles and emits one
+`TileToken` per tile with a monotonically increasing `tile_id`.
+
+Tile dependency (within one tile `t`):
+
+```text
+DMA_READ(t) → FETCH(t) → COMPUTE(t) → STORE(t) → DMA_WRITE(t)
+```
+
+Inter-tile overlap is allowed wherever engine resources permit
+(D4 governs the constraints):
+
+```text
+DMA_READ(t+1) ∥ COMPUTE(t)
+DMA_WRITE(t-1) ∥ COMPUTE(t)
+```
+
+#### D3.3 Multi-op composite (head + epilogue with scope)
+
+A `CompositeCmd` MAY carry `ops: tuple[OpSpec, ...]` to express a
+multi-op pipeline:
+
+```python
+@dataclass(frozen=True)
+class OpSpec:
+    kind: str         # "gemm" | "math.exp" | "math.bias_add" | ...
+    scope: Scope      # "per_k_tile" | "per_output_tile" | "once"
+    ...
+```
+
+- `ops[0]` (head) defines tile geometry (e.g., the head GEMM determines
+  M/K/N partition).
+- `ops[1:]` (epilogue) are subsequent stages whose `scope` decides how
+  often they fire:
+  - `per_k_tile` — every K-reduction step.
+  - `per_output_tile` — once per output tile.
+  - `once` — once per kernel.
+
+Cross-engine chains (e.g., GEMM head → MATH epilogue) are natural —
+each stage is dispatched via token self-routing (D6), so GEMM and MATH
+participate serially within the same composite even though they share
+the compute slot (D4).
+
+The empty-`ops` form is the legacy single-op path.
+
+### D4. Engine resource model
+
+**DMA engine**:
+
+- `DMA_READ`: `simpy.Resource(capacity=1)`.
+- `DMA_WRITE`: `simpy.Resource(capacity=1)`.
+- Both channels run concurrently (READ ∥ WRITE allowed).
+- Within a channel, requests serialize (READ ∥ READ disallowed; same
+  for WRITE).
+- `vc_comm` is an orthogonal channel for IPCQ traffic defined in
+  ADR-0023 D8 — out of scope for this ADR.
+
+**Compute engine**:
+
+- `accel_slot`: `simpy.Resource(capacity=1)` shared by `PE_GEMM` and
+  `PE_MATH`.
+- At most one compute op runs at a time within a PE.
+- Multi-op composite chains (D3.3) execute their compute stages serially
+  through this slot; token self-routing (D6) ensures the next stage
+  starts only after the previous compute releases the slot.
+
+**Engine completion**: each engine emits a completion event consumed by
+the scheduler / `PipelineContext` (D6).
+
+### D5. Dataflow
+
+**Input path (HBM source)**:
+
+```text
+HBM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
+PE_TCM → PE_FETCH_STORE → Register File
+Register File → PE_GEMM | PE_MATH
+```
+
+**Input path (shared SRAM source)**:
+
+```text
+Shared SRAM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
+PE_TCM → PE_FETCH_STORE → Register File
+```
+
+**Output path (HBM destination)**:
+
+```text
+Register File → PE_FETCH_STORE → PE_TCM
+PE_TCM → PE_DMA (DMA_WRITE) → cube NOC → HBM
+```
+
+GEMM/MATH never touch `PE_TCM` directly — `PE_FETCH_STORE` is the
+single TCM↔register-file gateway. This makes TCM BW contention
+explicit and lets fetch unit policies (e.g., prefetch) be replaced
+independently of compute engines.
+
+#### D5.1 PE_TCM partitioning
+
+`PE_TCM` is split into two logical regions:
+
+**SchedulerReservedTCM**
+
+- Owned exclusively by `PE_SCHEDULER`.
+- Holds composite-command tile buffers.
+- `PE_SCHEDULER` partitions this region, assigns buffers per DMA_READ /
+  COMPUTE / DMA_WRITE stage, guarantees input/output separation, and
+  manages tile-buffer lifetimes.
+
+**AllocatableTCM**
+
+- General-purpose region managed by `PEMemAllocator`.
+- Used for host / DP-visible allocations.
+
+**Visibility rule (hard isolation)**: `PEMemAllocator` MUST NOT see or
+allocate inside `SchedulerReservedTCM`. The reserved region is excluded
+from allocator-managed ranges by construction.
+
+**Tile buffer rules**:
+
+- Input and output buffers within `SchedulerReservedTCM` MUST NOT
+  overlap during a tile's active lifetime.
+- A tile buffer remains valid until the corresponding `DMA_WRITE`
+  completes.
+- Buffer reuse is permitted only after the consuming tile's lifetime
+  ends.
+
+### D6. TileToken self-routing pipeline
+
+A composite's stage-to-stage progression happens **without** routing
+through the scheduler. Each component forwards the token directly to
+the next stage's component using the token's `plan`:
+
+```text
+Scheduler → DMA → Fetch → GEMM → Math (epi) → Store → DMA_WB → (complete)
+              ↑ chaining: no scheduler hop                          ↑
+                                                  PipelineContext.complete_tile()
+```
+
+This mirrors real-HW done-wire chains. The scheduler handles only
+**initial dispatch + completion aggregation**.
+
+#### TilePlan / Stage
+
+```python
+class StageType(Enum):
+    DMA_READ = 0
+    FETCH = 1
+    GEMM = 2
+    MATH = 3
+    STORE = 4
+    DMA_WRITE = 5
+
+@dataclass(frozen=True)
+class Stage:
+    stage_type: StageType
+    component: str         # topology node id (e.g., "sip0.cube0.pe0.pe_dma")
+    params: dict           # stage-specific parameters
+
+@dataclass(frozen=True)
+class TilePlan:
+    tile_id: int
+    stages: tuple[Stage, ...]
+```
+
+#### TileToken
+
+```python
+@dataclass
+class TileToken:
+    tile_id: int
+    pipeline_ctx: PipelineContext
+    plan: TilePlan
+    stage_idx: int
+    params: dict             # cached current stage params
+    data_op: bool = True     # op_log opt-in (ADR-0020 D4)
+```
+
+Single-owner invariant: a token is owned by exactly one component at a
+time. Lifecycle: scheduler creates with `stage_idx=0` → component
+`_process()` → increment `stage_idx` → put to next stage's `in_port` →
+last stage calls `pipeline_ctx.complete_tile()`.
+
+#### PipelineContext (exactly-once completion)
+
+```python
+@dataclass
+class PipelineContext:
+    id: str
+    total_tiles: int
+    completed_tiles: int = 0
+    done_event: simpy.Event = None
+
+    def complete_tile(self) -> None:
+        self.completed_tiles += 1
+        if self.completed_tiles == self.total_tiles:
+            self.done_event.succeed()
+```
+
+Each tile's last stage MUST call `complete_tile()` exactly once.
+Duplicate calls are bugs (SimPy `Event` can succeed at most once).
+
+#### Feed ordering
+
+`PE_SCHEDULER` has exactly one `_feed_loop` process consuming a
+`_pending_feeds` FIFO. Composite commands are enqueued in submission
+order; tile feed for a command runs to completion before the next
+command's feed begins. **Tile-feed interleaving between commands is
+disallowed.**
+
+Within a single command's tiles, downstream pipeline overlap arises
+naturally — earlier tiles progress through later stages while the feeder
+keeps pushing remaining tiles into the first stage queue (SimPy Store
+backpressure governs flow control). If the first-stage queue is full,
+only the feeder blocks; the scheduler worker's inbox processing
+continues.
+
+#### Token routing pattern (base class)
+
+```python
+def _pipeline_worker(self, env):
+    while True:
+        token = yield self._inbox.get()
+        yield from self._process(env, token)       # stage-specific logic
+        next_idx = token.stage_idx + 1
+        if next_idx < len(token.plan.stages):
+            next_stage = token.plan.stages[next_idx]
+            token.stage_idx = next_idx
+            token.params = next_stage.params
+            yield self.out_ports[next_stage.component].put(token)
+        else:
+            token.pipeline_ctx.complete_tile()
+```
+
+Each component implements only `_process()`; chaining lives in the
+base class.
+
+### D7. Observability and trace contract
+
+The simulator emits deterministic trace events:
+
+- `command_submitted`
+- `sub_command_dispatched`
+- `engine_start`
+- `engine_complete`
+- `tile_ready`
+- `command_complete`
+
+For identical inputs, trace ordering MUST be deterministic.
+
+### D8. Topology representation
+
+PE-internal components are declared in `cube.pe_template`:
+
+```yaml
+pe_template:
+  components:
+    pe_cpu:         { kind: pe_cpu,         impl: builtin.pe_cpu,         attrs: { overhead_ns: ... } }
+    pe_scheduler:   { kind: pe_scheduler,   impl: builtin.pe_scheduler,   attrs: { overhead_ns: ... } }
+    pe_dma:         { kind: pe_dma,         impl: builtin.pe_dma,         attrs: { rd_engines: 1, wr_engines: 1 } }
+    pe_fetch_store: { kind: pe_fetch_store, impl: builtin.pe_fetch_store, attrs: { ... } }
+    pe_gemm:        { kind: pe_gemm,        impl: builtin.pe_gemm,        attrs: { shared_resource: accel_slot, ... } }
+    pe_math:        { kind: pe_math,        impl: builtin.pe_math,        attrs: { shared_resource: accel_slot, ... } }
+    pe_tcm:         { kind: pe_tcm,         impl: builtin.pe_tcm,         attrs: { size_mb: ..., read_bw_gbs: ..., write_bw_gbs: ... } }
+    pe_mmu:         { kind: pe_mmu,         impl: builtin.pe_mmu,         attrs: { ... } }   # ADR-0011 D-VA
+    pe_ipcq:        { kind: pe_ipcq,        impl: builtin.pe_ipcq,        attrs: { ... } }   # ADR-0023
+  links:
+    # Scheduler dispatch edges (initial)
+    scheduler_to_dma_mm:         0.0
+    scheduler_to_fetch_store_mm: 0.0
+    scheduler_to_gemm_mm:        0.0
+    scheduler_to_math_mm:        0.0
+    # Pipeline chaining edges (token self-routing per D6)
+    dma_to_fetch_store_mm:       0.0
+    fetch_store_to_gemm_mm:      0.0
+    fetch_store_to_math_mm:      0.0
+    gemm_to_fetch_store_mm:      0.0
+    gemm_to_math_mm:             0.0
+    math_to_fetch_store_mm:      0.0
+    fetch_store_to_dma_mm:       0.0
+    fetch_store_to_tcm_bw_gbs:   ...
+```
+
+Template is instantiated once per PE. PE instances are derived from
+`cube.pe_layout` (corner placement). External connectivity (PE_DMA ↔
+cube NOC ↔ HBM, etc.) is modeled at the cube level (ADR-0017 D4).
+
+## Consequences
+
+### Positive
+
+- Each block is an independent topology node — individually swappable
+  via DI (ADR-0015).
+- PE-internal structure is visible in the topology graph.
+- Components do not know their downstream — plan-based routing gives
+  flexibility (e.g., epilogue chains require no scheduler change).
+- DMA and compute overlap naturally via SimPy Store backpressure.
+- Multi-op composite expresses fused operations (e.g., GEMM + bias_add)
+  without engine-level coupling.
+- TCM access contention is realistic — `PE_FETCH_STORE` is the single
+  TCM↔RF gateway.
+
+### Negative
+
+- Intra-PE component count is higher than a coarser model (7 base + 2
+  cross-referenced) — more topology nodes/edges.
+- Intra-PE token forwarding is explicit in traces (acceptable trade for
+  HW fidelity).
+
+## Links
+
+- ADR-0011 D-VA (PE_MMU component, VA translation)
+- ADR-0015 D4 (component port/wire model)
+- ADR-0020 (greenlet kernel execution / two-pass)
+- ADR-0023 (PE_IPCQ + PE_DMA virtual channels)
+- SPEC R3, R4
diff --git a/docs/adr-ko/ADR-0015-dev-component-port-wire-model.md b/docs/adr-ko/ADR-0015-dev-component-port-wire-model.md
new file mode 100644
index 0000000..5f999af
--- /dev/null
+++ b/docs/adr-ko/ADR-0015-dev-component-port-wire-model.md
@@ -0,0 +1,202 @@
+# ADR-0015: Component Port/Wire Model and Fabric Routing
+
+## Status
+
+Accepted
+
+## Context
+
+Realistic hardware modeling — queues, contention, fan-out — requires
+that components own fabric traversal while the simulation engine
+handles only initialization and completion observation. Direct method
+calls between components, or path-walking inside the engine, defeat
+queueing and contention semantics.
+
+This ADR defines:
+
+- how components communicate via typed port queues,
+- how propagation delay is modeled (wire processes with BW occupancy),
+- the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch
+  (via M_CPU),
+- the engine's reduced role (wire init + completion observation only),
+- M_CPU.DMA as an internal subcomponent of M_CPU.
+
+---
+
+## Decision
+
+### D1. Component port model
+
+Each component has typed input/output ports modeled as SimPy Stores:
+
+```text
+in_ports:  dict[str, simpy.Store]   # keyed by source node_id
+out_ports: dict[str, simpy.Store]   # keyed by destination node_id
+```
+
+Ports are created at engine initialization based on graph edges.
+Each directed edge (src → dst) results in:
+
+- `src.out_ports[dst]`  — the sending end
+- `dst.in_ports[src]`   — the receiving end
+
+---
+
+### D2. Wire process (propagation delay + BW occupancy)
+
+For each directed edge (src, dst) in the topology graph, a SimPy wire process
+models propagation delay and BW occupancy:
+
+```python
+def wire_process(env, out_port, in_port, delay_ns, bw_gbs):
+    available_at = 0.0
+    while True:
+        cmd = yield out_port.get()
+        if bw_gbs > 0:
+            nbytes = getattr(cmd, "nbytes", 0)
+            if nbytes > 0:
+                wait = available_at - env.now
+                if wait > 0:
+                    yield env.timeout(wait)
+                available_at = env.now + (nbytes / bw_gbs)
+        yield env.timeout(delay_ns)
+        yield in_port.put(cmd)
+```
+
+Wire processes are started at engine initialization.
+Each directed edge maintains an `available_at` timestamp tracking when the link
+becomes free for the next transaction. When a transaction occupies a link, the
+next transaction on the same directed link must wait until occupancy clears
+(back-to-back serialization). TX and RX directions are independent (separate
+wire processes with separate `available_at` state).
+
+---
+
+### D3. Engine role (reduced)
+
+The simulation engine MUST:
+
+- wire components at initialization (create port Stores, start wire processes),
+- identify the entry component for each request type (PCIE_EP),
+- put the request into the entry component's in_port,
+- wait for a completion event.
+
+The simulation engine MUST NOT:
+
+- walk the topology path during request execution,
+- call component `run()` methods directly,
+- track per-hop latency or decompose fan-out.
+
+---
+
+### D4. Fabric paths for Memory R/W and Kernel Launch
+
+Memory R/W and Kernel Launch use **different** fabric paths.
+Memory operations bypass M_CPU and route directly to HBM via the crossbar.
+Kernel Launch routes through M_CPU for PE fan-out.
+
+**Memory R/W forward path (pcie_ep → hbm_ctrl, M_CPU bypass):**
+
+```text
+pcie_ep → io_noc → io_ucie
+  → [transit cubes: ucie_in → noc → ucie_out]  (zero or more)
+  → target cube: ucie_in → router mesh → hbm_ctrl
+```
+
+**Memory R/W completion path:**
+
+```text
+hbm_ctrl → router mesh → [transit cubes: ucie → router mesh → ucie]
+  → io_ucie → io_noc → pcie_ep
+```
+
+**Kernel Launch forward path (pcie_ep → io_cpu → M_CPU → PE):**
+
+```text
+pcie_ep → io_noc → io_cpu → io_noc → io_ucie
+  → [transit cubes: ucie_in → noc → ucie_out]  (zero or more)
+  → target cube: ucie_in → noc → M_CPU → PE[0..n] (parallel fan-out)
+```
+
+**Kernel Launch completion path:**
+
+```text
+PE[0..n] all complete → M_CPU (aggregation)
+  → noc → [transit cubes: ucie → noc → ucie]
+  → io_ucie → io_noc → io_cpu → io_noc → pcie_ep
+```
+
+**Rationale for M_CPU bypass on Memory R/W:**
+
+Memory write/read operations do not require command interpretation or PE
+dispatch — they are direct data transfers to/from HBM. Routing through M_CPU
+would add unnecessary overhead (5ns) without functional benefit. The io_noc
+inside the IO chiplet handles the routing decision: memory operations go
+directly to cube fabric, while kernel launches are forwarded to io_cpu first.
+
+---
+
+### D5. M_CPU.DMA is an internal subcomponent of M_CPU
+
+M_CPU.DMA is NOT a separate topology node.
+It is an internal subcomponent owned by the M_CPU component implementation.
+
+M_CPU.DMA:
+
+- owns the DMA READ and DMA WRITE queues (capacity=1 each, per ADR-0014 D4),
+- issues memory requests over the NOC to hbm_ctrl,
+- receives completion from hbm_ctrl via the NOC,
+- reports completion to M_CPU,
+- is created and managed inside M_CPU's `__init__` and `run()`.
+
+M_CPU.DMA does not appear as a node in the compiled topology graph.
+
+---
+
+### D6. Transit cube forwarding
+
+A cube that is not the target of a memory or kernel request acts as a transit node.
+Transit cubes forward requests without consuming them:
+
+```text
+ucie_in (from upstream) → noc → ucie_out (to downstream)
+```
+
+Transit forwarding is implemented entirely within the ucie_in component.
+The noc and ucie_out components in a transit cube forward the packet without modification.
+
+---
+
+### D7. _formula_latency is preserved as a lower-bound cross-check
+
+The path-based formula latency function (`_formula_latency`) is preserved in the engine
+as a lower bound for correctness verification.
+
+Invariant:
+
+- Phase 0: `_formula_latency == component model total_ns`
+- Phase 1+: `_formula_latency <= component model total_ns` (contention adds queueing)
+
+This function is independent of the port/wire model and requires only the topology graph.
+It is used for shard comparison in `_route_kernel` and as a regression guard.
+
+---
+
+## Consequences
+
+- Components model realistic hardware behavior (queues, contention, fan-out).
+- Propagation delay is modeled accurately per edge.
+- Engine is decoupled from routing policy.
+- Component implementations remain swappable via DI (ADR-0007 D3).
+
+---
+
+## Links
+
+- ADR-0007 D2 (engine role boundary)
+- ADR-0009 D3 (kernel execution fan-out hierarchy)
+- ADR-0014 D4 (DMA engine capacity=1)
+- ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal)
+- ADR-0016 (IOChiplet NOC and memory data path)
+- ADR-0017 (cube NOC 2D mesh architecture)
+- ADR-0033 (Latency model assumptions built on these mechanisms)
diff --git a/docs/adr-ko/ADR-0016-dev-iochiplet-noc-and-memory-path.md b/docs/adr-ko/ADR-0016-dev-iochiplet-noc-and-memory-path.md
new file mode 100644
index 0000000..cb1e281
--- /dev/null
+++ b/docs/adr-ko/ADR-0016-dev-iochiplet-noc-and-memory-path.md
@@ -0,0 +1,98 @@
+# ADR-0016: IOChiplet NOC and Memory Data Path
+
+## Status
+
+Accepted
+
+## Context
+
+ADR-0003 D2 defines IO chiplets as SIP-level components providing PCIe-EP and
+IO_CPU interfaces, but does not specify internal routing within the IO chiplet.
+ADR-0015 D4 was updated to document the M_CPU bypass for Memory R/W, but the
+IO chiplet's internal NOC architecture that enables this routing was not
+formally documented.
+
+The IO chiplet needs an internal routing fabric (io_noc) to:
+
+- connect pcie_ep, io_cpu, and per-cube UCIe PHY ports
+- route memory operations (MemoryWrite/Read) directly to cube fabric without
+  passing through io_cpu
+- route kernel launch commands through io_cpu for command interpretation
+
+## Decision
+
+### D1. IOChiplet internal NOC (io_noc)
+
+Each IO chiplet instance contains an internal NOC node (`io_noc`) that connects:
+
+- `pcie_ep` — host-facing PCIe endpoint
+- `io_cpu` — command processor for kernel launch interpretation
+- `io_ucie-{PHY}.conn{N}` — per-PHY connection nodes to cube UCIe ports
+
+The io_noc is a forwarding-only fabric (`forwarding_v1` implementation) with
+zero overhead. All routing decisions are made by the simulation engine based
+on message type, not by io_noc itself.
+
+### D2. IOChiplet UCIe decomposition
+
+Each IO chiplet PHY port is decomposed into:
+
+- `io_ucie-{PHY}` — the UCIe protocol endpoint (overhead = 8ns)
+- `io_ucie-{PHY}.conn{N}` — N connection nodes between io_noc and io_ucie
+
+This mirrors the cube-side UCIe decomposition (ADR-0015 D1) and allows
+multiple independent NOC-to-UCIe connections per PHY.
+
+### D3. Memory R/W path (M_CPU bypass)
+
+Memory operations (MemoryWrite, MemoryRead) are routed directly from pcie_ep
+through io_noc to the target cube, bypassing io_cpu entirely:
+
+```text
+pcie_ep → io_noc → conn → io_ucie → [cube UCIe] → router mesh → hbm_ctrl
+```
+
+This avoids the 10ns io_cpu overhead for pure data transfers. The simulation
+engine's `_process_memory_direct()` method uses `find_memory_path()` which
+resolves the shortest path from pcie_ep to the target HBM node.
+
+### D4. Kernel Launch path (via io_cpu)
+
+Kernel launch commands require io_cpu for command interpretation and PE
+fan-out setup:
+
+```text
+pcie_ep → io_noc → io_cpu → io_noc → conn → io_ucie → [cube UCIe]
+  → noc → m_cpu → PE
+```
+
+The engine's `_entry_points()` method routes KernelLaunchMsg through both
+pcie_ep (entry) and io_cpu (command processing).
+
+### D5. IOChiplet-to-cube port mapping
+
+Each IO chiplet instance declares which cube ports it connects to:
+
+```yaml
+cube_ports:
+  - { cube: {xy: [0,0]}, cube_side: N, phy: P0, distance_mm: 2.0 }
+  - { cube: {xy: [1,0]}, cube_side: N, phy: P1, distance_mm: 2.0 }
+```
+
+The topology builder creates edges from io_ucie PHY nodes to the
+corresponding cube UCIe port nodes, with the specified distance and
+the IO chiplet's `per_connection_bw_gbs` as link bandwidth.
+
+## Consequences
+
+- IO chiplet has a well-defined internal routing fabric
+- Memory operations avoid unnecessary io_cpu overhead
+- Kernel launch commands still get proper command interpretation
+- The io_noc pattern is consistent with cube-level NOC design
+- ADR-0003 D2 is extended (not contradicted) by this ADR
+
+## Links
+
+- ADR-0003 D2 (IO chiplet definition)
+- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
+- ADR-0012 D1 (host-to-IO_CPU message schema)
diff --git a/docs/adr-ko/ADR-0017-dev-cube-noc-and-hbm-connectivity.md b/docs/adr-ko/ADR-0017-dev-cube-noc-and-hbm-connectivity.md
new file mode 100644
index 0000000..c442dde
--- /dev/null
+++ b/docs/adr-ko/ADR-0017-dev-cube-noc-and-hbm-connectivity.md
@@ -0,0 +1,291 @@
+# ADR-0017: Cube NOC and HBM Connectivity
+
+## Status
+
+Accepted
+
+## Context
+
+The CUBE-level NOC is a 2D router mesh that carries every intra-cube
+request: PE-to-HBM data, PE-to-PE traffic, command paths
+(M_CPU↔PE_CPU), shared SRAM access, and inter-cube UCIe traffic.
+
+The CUBE's HBM is exposed through per-PE controller endpoints attached
+to PE routers. This per-PE partitioning makes local-vs-remote HBM
+distinguishable by mesh distance: a PE's own HBM partition sits at its
+own router (switching overhead only); another PE's HBM partition is
+reachable by mesh hops to that PE's router.
+
+Two channel-mapping modes are supported in the design space:
+
+- **n:1 (default, implemented)** — each PE's HBM partition aggregates
+  `channels_per_pe` pseudo-channels into one endpoint. Effective
+  per-PE BW = N × per-channel BW.
+- **1:1 (future)** — each PE router decomposes into per-channel
+  mini-routers; per-channel BW contention is modeled directly.
+
+In both modes the per-PE effective BW is identical; only the connectivity
+granularity differs.
+
+## Decision
+
+### D1. 2D router mesh
+
+Each cube contains a 2D mesh of NOC routers generated by `mesh_gen.py`.
+
+- Node naming: `sip{S}.cube{C}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`).
+- Implementation: `forwarding_v1`. NOC `overhead_ns = 0`.
+- Default 6×6 grid (sized from PE corner placement + UCIe attachment
+  count); larger PE counts scale the grid up.
+- HBM exclusion zone: center rows/columns are excluded where HBM die
+  physically occupies space (e.g., r2c2, r2c3, r3c2, r3c3 for a 6×6).
+- Latency = Manhattan distance × `ns_per_mm`.
+
+### D2. XY routing algorithm
+
+Deterministic XY routing:
+
+1. Horizontal segment: route from source X to destination X at source Y.
+2. Vertical segment: route from destination X at source Y to destination Y.
+
+Each directed segment carries a unique key:
+
+- Horizontal: `("H", y_band, x_min, x_max, direction)`
+- Vertical:   `("V", x_band, y_min, y_max, direction)`
+
+Grid positions are snapped to the router grid, excluding the HBM zone.
+
+### D3. Per-segment contention model
+
+Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions
+sharing a segment (same row or column band, same direction) contend for
+the resource — modelling link-level serialization in a wormhole-routed
+mesh.
+
+With no contention, NOC traversal latency equals Manhattan distance ×
+`ns_per_mm`. Under contention, SimPy's resource scheduling adds queueing
+delay.
+
+### D4. NOC attachment points (per-PE HBM partition)
+
+Every PE router carries three attachments: `pe{idx}.dma`, `pe{idx}.cpu`,
+and `pe{idx}.hbm`. The last is the per-PE HBM controller endpoint —
+`sip{S}.cube{C}.hbm_ctrl.pe{idx}` — which owns one slice of the cube's
+HBM (one pseudo-channel group; see D8).
+
+Other attachments:
+
+- M_CPU and shared SRAM each occupy a dedicated edge router.
+- UCIe endpoints (N/S/E/W) each expose 4 connection routers distributed
+  along that edge (see D6).
+
+```text
+                    UCIe-N (conn x4)
+                         |
+           +---------+---+---+---------+
+           |         |       |         |
+PE0.dma ---+  r0c0   |  ...  |  r0c5  +--- PE2.dma
+PE0.cpu <--+ +hbm.pe0|       | +hbm.pe2+--< PE2.cpu
+           |         |       |         |
+UCIe-W ----+  ...    | [HBM] |  ...   +---- UCIe-E
+(conn x4)  |         | zone  |         |  (conn x4)
+           |  r2c0   |       |         |
+M_CPU <--->+         |       |         |
+           |  r3c0   |       |         |
+SRAM <---->+         |       |         |
+           |         |       |         |
+PE4.dma ---+  r4c0   |  ...  |  r4c5  +--- PE6.dma
+PE4.cpu <--+ +hbm.pe4|       | +hbm.pe6+--< PE6.cpu
+           |         |       |         |
+           +---------+---+---+---------+
+                         |
+                    UCIe-S (conn x4)
+```
+
+Per-PE HBM partitioning is the key invariant that makes local vs
+cross-PE HBM distinguishable by mesh distance (see D7).
+
+### D5. NOC edge bandwidths and distances
+
+| Connection                    | BW (GB/s)  | Distance      | Notes                                       |
+| ----------------------------- | ---------- | ------------- | ------------------------------------------- |
+| PE_DMA → NOC                  | 256.0      | Physical (PE) | Matches local-HBM aggregate BW              |
+| NOC → PE_CPU                  | —          | 0.0 mm        | Command path only                           |
+| Router ↔ hbm_ctrl.pe{idx}     | 256.0      | 0.0 mm        | Per PE router; N × per-channel BW (see D8)  |
+| NOC ↔ M_CPU                   | —          | 0.0 mm        | Command path                                |
+| NOC ↔ SRAM                    | 128.0 × 4  | 0.0 mm        | 512 GB/s aggregate                          |
+| NOC ↔ UCIe conn               | 128.0      | 0.0 mm        | Per connection; 4 conn per port             |
+
+`0.0 mm` distances reflect the distributed nature of the NOC; actual
+traversal distance is computed via Manhattan distance within the router
+grid.
+
+### D6. UCIe decomposition and inter-cube traffic
+
+Each of the 4 UCIe ports (N, S, E, W) decomposes into:
+
+- 1 `ucie-{PORT}` node: UCIe protocol endpoint (`overhead = 8.0 ns`).
+- 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe.
+
+This decomposition gives 4 independent NOC↔UCIe connections per port,
+each with 128 GB/s bandwidth (512 GB/s aggregate per port).
+
+Inter-cube traffic path:
+
+```text
+Source: PE_DMA → NOC → conn{i} → ucie-{PORT}
+                  [UCIe link: 512 GB/s, 1.0mm seam distance]
+Target: ucie-{PORT} → conn{i} → r{x}c{y} → (mesh hops) → hbm_ctrl.pe{idx}
+```
+
+UCIe overhead (8.0 ns) is applied at each `ucie-{PORT}` node, so a full
+crossing incurs 16 ns (TX port + RX port).
+
+### D7. Data paths through the NOC
+
+All intra-cube traffic uses the same router mesh — no separate fast
+paths.
+
+**Local HBM** (same PE's own partition; 0 mesh hops):
+
+```text
+PE_DMA → r{x}c{y} → hbm_ctrl.pe{idx}   (switching overhead only)
+```
+
+**Cross-PE HBM within cube** (target PE's partition, reached by mesh):
+
+```text
+PE_DMA → r{x}c{y} → (mesh hops) → r{x'}c{y'} → hbm_ctrl.pe{idx'}
+```
+
+Example: PE0 (on `r0c0`) accessing PE2's HBM (PE2 on `r1c4`):
+
+```text
+PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl.pe2
+```
+
+Dijkstra computes the shortest path within the mesh.
+
+**Cross-cube HBM** (UCIe traversal):
+
+```text
+PE_DMA → r{x}c{y} → conn → ucie-{PORT} → [seam] → ucie-{PORT'} → conn
+       → r{x'}c{y'} → hbm_ctrl.pe{idx'}
+```
+
+**Kernel launch command to PE**:
+
+```text
+[from io_noc] → ucie → conn → r{x}c{y} → (mesh) → M_CPU → (mesh) → PE_CPU
+```
+
+**Shared SRAM access**:
+
+```text
+PE_DMA → r{x}c{y} → (mesh) → SRAM
+```
+
+### D8. HBM channel mapping mode
+
+Channel mapping is configured at cube scope:
+
+```yaml
+cube:
+  memory_map:
+    hbm_mapping_mode: n_to_one       # one_to_one | n_to_one
+    hbm_pseudo_channels: 64          # total pseudo-channel count
+    hbm_channels_per_pe: 8           # per-PE local channel count
+    hbm_channel_bw_gbs: 32.0         # per-channel bandwidth (GB/s)
+    hbm_slices_per_cube: 8           # number of per-PE partitions
+    hbm_total_gb_per_cube: 48
+```
+
+**n:1 mode (default, implemented).** Each PE's HBM partition is a single
+endpoint `hbm_ctrl.pe{idx}` that aggregates `channels_per_pe` pseudo-
+channels. The `Router ↔ hbm_ctrl.pe{idx}` link bandwidth equals
+`channels_per_pe × hbm_channel_bw_gbs`. Pseudo-channels are assumed to
+interleave; only aggregate per-PE BW is modeled. No separate aggregated
+router node exists — the per-PE router itself serves that role.
+
+**1:1 mode (future).** Each PE router decomposes into N channel
+mini-routers; per-channel routing carries fully-resolved PA + channel ID.
+A `ChannelSplitter` resolves a logical access to N per-channel physical
+requests. Per-channel link models BW contention. Cross-PE channel
+access semantics are deferred to the implementation ADR.
+
+**BW math (defaults).**
+
+| Parameter                          | Value                      |
+| ---------------------------------- | -------------------------- |
+| pseudo channels per cube           | 64 (parameter)             |
+| PEs per cube                       | 8 (parameter)              |
+| channels per PE (N)                | 64 / 8 = 8                 |
+| per-channel BW                     | 32 GB/s (parameter)        |
+| per-PE local BW                    | N × 32 = 256 GB/s          |
+| cube total HBM BW                  | 64 × 32 = 2048 GB/s        |
+
+Both modes give the same per-PE effective BW; only the request shape and
+contention model differ.
+
+### D9. AddressResolver — per-PE HBM endpoint
+
+The address resolver decodes a PA's HBM offset to the owning PE's
+partition:
+
+```python
+# policy/routing/router.py
+hbm_slice_bytes = hbm_total_gb_per_cube * (1 << 30) // hbm_slices_per_cube
+
+if addr.kind == "hbm":
+    pe_id = int(addr.hbm_offset) // hbm_slice_bytes
+    return f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
+```
+
+The pe_id computation is intrinsic to the routing layer (not a
+topology-time concern). Any HBM PA falls within exactly one partition,
+yielding deterministic routing.
+
+External callers (e.g., M_CPU DMA, Memory R/W from PCIE_EP) follow the
+same resolver path — there is no separate fast path.
+
+### D10. Mesh generation parameters
+
+`mesh_gen.py` produces `cube_mesh.yaml` from:
+
+- `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner.
+- `cube.geometry`: cube physical dimensions and HBM zone.
+- `cube.ucie.n_connections`: determines router count for UCIe attachment.
+
+Output `mesh_data` dictionary contains:
+
+- Router grid with positions and HBM exclusion zones.
+- PE-to-router attachments (`pe{idx}.dma`, `pe{idx}.cpu`, `pe{idx}.hbm`
+  per PE).
+- UCIe-to-router attachments (N/S/E/W distributed across edge routers).
+- M_CPU and SRAM router attachments.
+
+## Consequences
+
+- Local HBM (0 mesh hops, switching overhead only) and cross-PE HBM
+  (mesh hops) are naturally distinguishable, satisfying SPEC R5
+  (multi-domain communication) and ADR-0002 (no zero-latency end-to-end
+  paths).
+- All cube-internal traffic routes through one mesh — single contention
+  model, single layout, single set of edge BWs.
+- Per-PE HBM partitioning maps cleanly to the LA model (ADR-0011): each
+  PE's partition is the n:1 aggregate of its assigned pseudo-channels.
+- 1:1 mode extension is structurally natural — split each PE router into
+  N channel routers.
+- Mesh generation is fully parameterised by `topology.yaml`; PE/cube
+  geometry changes propagate without code edits.
+
+## Links
+
+- ADR-0002 (Routing distance, ordering, no zero-latency paths)
+- ADR-0003 D3 (cube-level NOC definition — extended here)
+- ADR-0004 (Memory semantics, local HBM)
+- ADR-0011 (Memory addressing — LA model consumes per-PE partition)
+- ADR-0014 D1 (PE_DMA egress via router mesh)
+- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
+- ADR-0016 (IOChiplet io_noc — analogous pattern at IO chiplet level)
+- ADR-0033 (Latency model: per-PC parallelism, switch penalty)
diff --git a/docs/adr-ko/ADR-0020-prog-data-execution-two-pass.md b/docs/adr-ko/ADR-0020-prog-data-execution-two-pass.md
new file mode 100644
index 0000000..a8d277d
--- /dev/null
+++ b/docs/adr-ko/ADR-0020-prog-data-execution-two-pass.md
@@ -0,0 +1,516 @@
+# ADR-0020: 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)
+
+## Status
+
+Accepted
+
+## Context
+
+현재 시뮬레이션은 **타이밍만** 모델링한다.
+`tl.load()`, `tl.composite(op="gemm")` 등은 SimPy latency를 생성하지만,
+실제 텐서 데이터를 읽거나 연산하지 않는다.
+
+### 필요한 기능
+
+1. HBM/TCM/SRAM에 실제 데이터를 저장하고 읽을 수 있어야 한다
+2. PE_GEMM, PE_MATH가 실제 행렬 연산을 수행하고 결과를 검증할 수 있어야 한다
+3. 시뮬레이션 성능 저하를 최소화해야 한다
+
+### 제약 조건
+
+- SimPy는 single-thread 이벤트 루프 — numpy matmul을 안에서 하면 전체가 block
+- 컴포넌트는 교체 가능해야 한다 (ADR-0015) — 프레임워크 요구사항이 구현에 침투하면 안 됨
+- 벤치마크 커널은 명령형 코드(tl.load → tl.composite → tl.wait) — 같은 코드를 재사용해야 함
+- 커널 함수는 plain Python function으로 유지해야 한다 (generator/async 변환 불가)
+
+### 설계 탐색 결과
+
+| Option | 방식 | 판정 |
+|--------|------|------|
+| SimPy 내 직접 실행 | GEMM을 SimPy 안에서 numpy 호출 | 탈락: single-thread block |
+| SimPy + ThreadPool | future.submit → timeout → result() | 탈락: back-to-back 요청 시 result()에서 block |
+| Symbolic + lazy | 메타데이터만 추적, 나중에 실행 | 탈락: control-flow dependent 읽기 처리 곤란 |
+| **2-pass (채택)** | Phase 1: 타이밍, Phase 2: 데이터 | 완전 분리, 성능 영향 없음 |
+
+---
+
+## Decision
+
+### D1. 2-Pass 실행 모델 — Phase 0 제거
+
+기존의 3단계(Phase 0 → Phase 1 → Phase 2)를 **2단계로 통합**한다.
+
+기존:
+```
+Phase 0: 커널 → PeCommand 리스트 (데이터 없음, 분기 불가)
+Phase 1: PeCommand 리스트를 SimPy replay (타이밍만)
+```
+
+변경:
+```
+Phase 1 (타이밍): 커널 + SimPy 통합 실행 — greenlet 기반
+  - 메모리 읽기/쓰기: SimPy 타이밍 + MemoryStore 실제 데이터
+  - 연산 (GEMM/Math): SimPy 타이밍 + op_log 기록 (실제 연산은 Phase 2)
+  - dynamic control flow 가능 (tl.load가 실제 데이터 반환)
+
+Phase 2 (데이터): op_log 기반 실제 연산 실행 — SimPy 외부, 병렬 가능
+```
+
+본 ADR은 **메모리 연산에 한해 Phase 1을 data-aware로 확장**한다.
+Phase 1은 latency/BW 병목 분석 + 메모리 데이터 추적,
+Phase 2는 GEMM/Math 연산 정합성 검증.
+Phase 2는 optional — 타이밍만 필요하면 Phase 1만 실행.
+
+### D2. Op Log 기록 — ComponentBase hook
+
+op_log 기록은 **컴포넌트 베이스 클래스의 hook**으로 수행한다.
+개별 컴포넌트 구현을 수정하지 않는다.
+
+```python
+class ComponentBase:
+    def _on_process_start(self, env, msg):
+        if self._op_logger and getattr(msg, 'data_op', False):
+            self._op_logger.record_start(env.now, self.node.id, msg)
+
+    def _on_process_end(self, env, msg):
+        if self._op_logger and getattr(msg, 'data_op', False):
+            self._op_logger.record_end(env.now, self.node.id, msg)
+```
+
+`_forward_txn()` 에서 `run()` 전후로 hook을 호출한다.
+`_op_logger`는 optional — 없으면 오버헤드 제로.
+
+**hook 시점 정의**:
+
+| 시점 | 의미 |
+|------|------|
+| `t_start` | 컴포넌트가 해당 msg의 **service를 시작**한 시점 (`run()` 진입 직전) |
+| `t_end` | 컴포넌트의 **내부 service가 완료**된 시점 (`run()` 반환 직후) |
+
+link traversal latency는 t_start/t_end에 포함되지 않는다.
+link latency는 발신 컴포넌트의 t_end와 수신 컴포넌트의 t_start 차이로 관측된다.
+
+### D3. Greenlet 기반 커널 실행 — Phase 0 제거
+
+기존 Phase 0 (커널 → PeCommand 리스트)를 제거하고,
+**greenlet**을 사용하여 커널과 SimPy를 협력적으로 interleave 실행한다.
+
+#### 동작 원리
+
+greenlet은 협력적 context switch를 제공하는 C 확장이다.
+커널(child greenlet)이 `tl.load()` 등을 호출하면 SimPy 루프(parent greenlet)로
+switch하여 타이밍 시뮬레이션을 수행하고, 완료 후 실제 데이터와 함께 커널로 돌아온다.
+
+```
+SimPy 루프 (parent greenlet)          커널 (child greenlet)
+─────────────────────────              ──────────────────────
+g.switch() ─────────────────────────→ 커널 시작
+                                       a = tl.load(ptr, ...)
+                                         내부: parent.switch(DmaReadCmd)
+cmd = DmaReadCmd ←──────────────────  (커널 일시정지)
+  yield DmaReadMsg(...)
+  yield env.timeout(dma_latency)
+  data = memory_store.read(...)
+g.switch(data) ─────────────────────→ (커널 재개)
+                                       a = data  ← 실제 numpy array
+                                       if a[0][0] > 0.5:  ← 분기 가능
+                                         ...
+```
+
+커널은 **plain Python function**으로 유지된다.
+greenlet switch는 `tl.load()`, `tl.store()` 등의 **내부 구현에만** 존재한다.
+
+#### KernelRunner — 프레임워크 레이어
+
+greenlet 루프는 PE_CPU 컴포넌트가 아니라 프레임워크 레이어인
+**KernelRunner**에 위치한다.
+
+```python
+# KernelRunner (프레임워크 — greenlet ↔ SimPy 연결)
+class KernelRunner:
+    def run(self, env, kernel_fn, args, store):
+        g = greenlet(self._run_kernel)
+        cmd = g.switch(kernel_fn, args)
+
+        while cmd is not None:
+            if isinstance(cmd, DmaReadCmd):
+                yield from self._dispatch_dma(env, cmd)
+                data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
+                cmd = g.switch(data)            # 실제 데이터와 함께 재개
+            elif isinstance(cmd, GemmCmd):
+                yield from self._dispatch_gemm(env, cmd)
+                cmd = g.switch()                # 재개 (데이터 없음)
+            elif isinstance(cmd, DmaWriteCmd):
+                store.write(cmd.dst_addr, cmd.data)  # visibility = issue 시점
+                yield from self._dispatch_dma(env, cmd)  # timing만 반영
+                cmd = g.switch()
+
+# PE_CPU (컴포넌트 — 간단하게 유지, greenlet을 모름)
+def _execute_kernel(self, env):
+    runner = KernelRunner(self.ctx)
+    yield from runner.run(env, kernel_fn, args, store)
+```
+
+**Op logging single source of truth**: KernelRunner는 op_log에 직접 기록하지 않는다.
+모든 op logging은 **ComponentBase hook (_on_process_start/end)만** 담당한다.
+KernelRunner가 `_dispatch_gemm()` 등으로 컴포넌트에 메시지를 전달하면,
+컴포넌트 베이스 클래스의 hook이 자동으로 기록한다.
+
+**레이어 분리**:
+- **커널 코드**: plain function, greenlet 존재를 모름
+- **TLContext**: `tl.load()` 내부에서 `parent.switch(cmd)` 호출
+- **KernelRunner**: greenlet ↔ SimPy 연결, MemoryStore 읽기/쓰기 처리. **logging 안 함**.
+- **ComponentBase hook**: op_log 기록의 유일한 경로
+- **PE_CPU**: KernelRunner를 호출만 함, 컴포넌트로서 교체 가능
+
+#### 메모리 읽기/쓰기 vs 연산의 처리 차이
+
+| 연산 | Phase 1에서 | Phase 2에서 |
+|------|------------|------------|
+| `tl.load()` | SimPy 타이밍 + MemoryStore read → **실제 데이터 반환** | — |
+| `tl.store()` | SimPy 타이밍 + MemoryStore write → **실제 기록** | — |
+| `tl.composite(gemm)` | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
+| `tl.dot()` / math ops | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
+
+메모리 읽기/쓰기는 Phase 1에서 즉시 처리 (numpy slice, 빠름).
+GEMM/Math 연산은 Phase 2에서 batch 실행 (성능 분리).
+
+#### Store Visibility Rule
+
+`tl.store()`는 **issue 시점에 MemoryStore에 즉시 반영**된다 (visibility = issue).
+SimPy DMA 타이밍은 이후 별도로 시뮬레이션된다.
+
+이는 timing과 visibility를 의도적으로 분리한 것이다:
+- **visibility**: MemoryStore에 반영되는 시점 = `store.write()` 호출 시
+- **timing**: SimPy에서 DMA latency가 완료되는 시점
+
+이 분리로 dynamic control flow에서 store 직후 load가 최신 데이터를 볼 수 있다.
+
+#### Result Handle Semantics
+
+`tl.composite()`(sync/async)는 결과 tensor를 참조하는 **handle**을 반환한다.
+
+Phase 1에서의 핵심 계약:
+
+1. **모든 compute handle은 Phase 1에서 항상 pending 상태로 간주한다.**
+2. `tl.wait(handle)`은 **timing synchronization만 표현**하며,
+   handle을 ready로 만들지 않는다.
+3. handle의 실제 결과 데이터 접근(`handle.data`, element access,
+   numpy conversion 등)은 **Phase 2에서만 가능**하다.
+4. 따라서 Phase 1에서 **compute-result 기반 control flow는 지원하지 않는다.**
+5. 반면 `tl.load()`는 Phase 1에서 실제 데이터를 반환하므로,
+   **memory-read 기반 control flow는 지원 가능**하다.
+
+| handle 상태 | Phase | 허용 동작 |
+|------------|-------|----------|
+| pending | Phase 1 | `tl.wait(handle)` — timing 동기화만 |
+| pending | Phase 1 | handle을 `tl.store()`의 대상으로 전달 (logical destination 연결만, payload는 Phase 2) |
+| pending | Phase 1 | **데이터 접근 불가** — 값 기반 분기 불가 |
+| ready | Phase 2 | 실제 numpy 데이터 접근, 검증 |
+
+이 제약은 의도적이다. Phase 1에서 연산을 실행하면 SimPy single-thread가
+block되어 2-pass 분리의 존재 이유가 사라진다.
+
+#### Phase 1 Materialization — Future Extension
+
+향후 소형 연산(scalar, 작은 reduction)에 대해 Phase 1 eager execution이
+필요한 경우, `materialized_in_phase1: bool` 플래그를 op record에 추가하여
+선택적 materialization을 지원할 수 있다. 현재 범위에서는 구현하지 않는다.
+
+### D4. data_op 플래그 — 메시지 자기 선언
+
+로깅 대상은 메시지 타입이 아니라 메시지 인스턴스의 `data_op` 속성으로 결정한다.
+프레임워크가 메시지 타입을 하드코딩하지 않는다.
+
+```python
+class MsgBase:
+    data_op: bool = False       # 기본: 로깅 안 함
+
+class DmaReadCmd(MsgBase):
+    data_op = True              # 메모리 이동 → 로깅
+
+class GemmCmd(MsgBase):
+    data_op = True              # 연산 → 로깅
+
+class MathCmd(MsgBase):
+    data_op = True              # 연산 → 로깅
+```
+
+새 메시지 타입(예: IpcqMsg) 추가 시 `data_op = True`만 설정하면
+프레임워크 코드 수정 없이 자동 로깅된다.
+
+### D5. Op Log 구조
+
+#### op 분류 체계
+
+2단계로 분류한다:
+
+| 레벨 | 필드 | 역할 |
+|------|------|------|
+| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch 기준 |
+| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` 등 | 구체 연산 식별 |
+
+#### OpRecord 정의
+
+```python
+@dataclass
+class OpRecord:
+    t_start: float              # SimPy 시각 (ns) — service 시작
+    t_end: float                # SimPy 시각 (ns) — service 완료
+    component_id: str           # e.g. "sip0.cube0.pe0.pe_gemm"
+    op_kind: str                # "memory" | "gemm" | "math"
+    op_name: str                # 구체 연산명
+    params: dict                # 연산별 파라미터 (아래 참조)
+    dependency_ids: list[int]   # 현재는 in-memory record index 기반, 향후 stable op_id로 대체 가능
+```
+
+#### dependency_ids 생성 규칙
+
+`dependency_ids`는 **optional**이며, 기본적으로 executor는
+주소 기반 dependency 추론을 수행한다 (D6 참조).
+
+정확한 실행 순서가 필요한 경우에만 명시적으로 설정한다:
+- **기본 (address-based inference)**: executor가 read/write set을 분석하여
+  RAW/WAW/WAR 의존성을 자동 추론. 대부분의 경우 이것으로 충분.
+- **명시적 설정**: TLContext 또는 command 생성 단계에서 logical dependency가
+  주소로 표현되지 않는 경우에 설정.
+  예: completion handle 기반 동기화 — handle dependency는 메모리 주소가 아니라
+  논리적 완료 순서에 의존하므로 address inference로 잡히지 않는다.
+
+#### op_log ordering
+
+op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
+동일 `t_start`의 record들은 insertion order를 보존한다.
+
+#### params 상세
+
+**memory (dma_read / dma_write)**:
+```python
+{
+    "src_addr": int,            # source 주소 (byte)
+    "dst_addr": int,            # destination 주소 (byte)
+    "nbytes": int,              # 전송 크기
+    "src_space": str,           # "hbm" | "tcm" | "sram"
+    "dst_space": str,           # "hbm" | "tcm" | "sram"
+}
+```
+
+**gemm**:
+```python
+{
+    "src_a_addr": int,          # operand A 주소
+    "src_b_addr": int,          # operand B 주소
+    "dst_addr": int,            # output 주소
+    "shape_a": tuple,           # e.g. (128, 256)
+    "shape_b": tuple,           # e.g. (256, 128)
+    "shape_out": tuple,         # e.g. (128, 128)
+    "dtype_in": str,            # e.g. "f16"
+    "dtype_acc": str,           # accumulation dtype, e.g. "f32"
+    "dtype_out": str,           # output dtype, e.g. "f16"
+    "transpose_a": bool,
+    "transpose_b": bool,
+    "layout_a": str,            # "row_major" | "col_major"
+    "layout_b": str,
+    "layout_out": str,
+    "addr_space": str,          # "tcm" (GEMM operand는 항상 TCM)
+}
+```
+
+**math**:
+```python
+{
+    "op": str,                  # "exp" | "add" | "sum" | "where" | ...
+    "input_addrs": list[int],   # operand 주소 목록
+    "input_shapes": list[tuple],
+    "dst_addr": int,
+    "shape_out": tuple,
+    "dtype": str,
+    "axis": int | None,         # reduction axis
+    "addr_space": str,          # "tcm"
+}
+```
+
+### D6. Phase 2 Executor
+
+Phase 2는 SimPy 밖에서 op_log를 실행한다.
+
+```python
+class DataExecutor:
+    def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
+        self.store = initial_store  # Phase 1의 MemoryStore snapshot을 입력으로 받는다
+
+    def run(self):
+        for t, ops in groupby(op_log, key=lambda o: o.t_start):
+            batch = list(ops)
+            independent, sequential = self._classify(batch)
+            self._execute_parallel(independent)
+            self._execute_sequential(sequential)
+```
+
+**병렬 실행 판정**:
+
+같은 `t_start`의 op들은 **병렬 후보**로 간주한다.
+실제 병렬 실행 여부는 executor가 다음 기준으로 판정한다:
+- read/write 주소 범위 겹침 여부 (WAW, RAW, WAR 충돌 검사)
+- `dependency_ids`에 명시된 선행 op 완료 여부
+
+주소 범위가 겹치지 않고 명시적 의존성이 없는 op들만 병렬 실행한다.
+
+**배치 최적화**: 동일 op_name이며 **shape, dtype, layout, transpose flag가
+모두 동일한** 독립 op들만 batching 대상이 된다.
+예: 여러 PE의 동일 shape GEMM → `np.matmul(a_batch, b_batch)` 한 번으로 묶음.
+CPU에서도 BLAS 효율 향상, GPU에서는 launch overhead 절감.
+
+**Phase 2 실행 순서 보장**:
+
+Phase 2는 데이터 도착 시점을 고려하지 않으며,
+dependency (주소 기반 추론 + 명시적 dependency_ids)를 통해서만
+실행 순서를 보장한다.
+
+### D7. Memory Store
+
+`MemoryStore`는 논리적으로 byte-addressable semantics를 따르며,
+현재 구현은 **tensor-granular storage** (addr → numpy ndarray 매핑)를 사용한다.
+
+```python
+class MemoryStore:
+    def write(self, space: str, addr: int, data: np.ndarray) -> None: ...
+    def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
+```
+
+**내부 저장 포맷: numpy ndarray**
+
+MemoryStore는 텐서를 **numpy ndarray**로 저장한다.
+
+| 후보 | store/load 속도 | Phase 2 연산 | 판정 |
+|------|----------------|-------------|------|
+| **numpy ndarray** | 즉시 (참조 전달, 복사 없음) | `np.matmul` 바로 사용 | **채택** |
+| bytearray | memcpy 필요 | `np.frombuffer` 변환 필요 | 탈락 |
+| torch tensor | 즉시 | torch 연산 가능 | GPU 최적화 시만 사용 |
+
+- write: numpy array를 **참조 저장** (복사 없음) → Phase 1 오버헤드 = dict lookup 1회
+- read: numpy array를 **참조 반환** (복사 없음)
+- 동일 addr에 재 write 시 기존 array를 **tensor 단위로 덮어쓴다** (partial overwrite 미지원)
+- dtype은 numpy native 사용 (`np.float16`, `np.float32`, `np.bfloat16` 등)
+- byte-level access가 필요한 경우 `.view(np.uint8)` 로 변환
+- Phase 2에서 GPU batch 최적화 시 numpy → torch tensor 변환은 executor가 담당
+
+**read/write contract**:
+
+- read/write는 **contiguous tensor** 기준이다.
+  non-contiguous stride view가 필요한 경우 별도 copy op으로 표현한다.
+- 일반 benchmark path에서는 producer/consumer dtype 일치를 기대한다.
+  reinterpret cast는 low-level memory validation 또는 특수 테스트 케이스를 위한
+  permissive behavior이다.
+- addr은 byte-aligned이며, 최소 alignment = dtype 크기.
+- dtype mismatch (write와 다른 dtype으로 read)는 reinterpret cast로 처리한다.
+  shape 불일치 시 nbytes 기준으로 검증하고, 불일치하면 error.
+- 정합성 기준은 주소 범위 기반 read/write semantics를 따른다.
+- 구현 최적화로 tensor object cache를 둘 수 있지만,
+  canonical state는 byte-addressable storage이다.
+- deploy 시점에 호스트가 초기 텐서 데이터를 주입한다.
+
+### D8. 벤치마크 커널 코드
+
+벤치마크의 **사용자 코드 API는 변경하지 않는다**.
+`tl.load()`, `tl.composite()`, `tl.store()` 등의 호출 인터페이스는 유지.
+
+단, 내부 command/message schema는 Phase 2 실행에 필요한 metadata를
+포함하도록 확장될 수 있다 (예: dtype_acc, transpose 등 추가 필드).
+
+### D9. 컴포넌트 변경 없음
+
+개별 컴포넌트 구현(PE_GEMM, PE_DMA, HBM_CTRL 등)은 수정하지 않는다.
+op_log 기록은 ComponentBase hook의 책임이다.
+커스텀 컴포넌트 교체 시 타이밍 모델만 교체되며,
+Phase 2 데이터 실행은 영향받지 않는다.
+
+### D10. Phase 2는 Optional
+
+```python
+engine = GraphEngine(graph)
+engine.run(benchmark)                       # Phase 1: 타이밍만
+result = engine.get_timing_result()
+
+if verify_data:
+    executor = DataExecutor(engine.op_log)  # Phase 2: 데이터
+    executor.run()
+    executor.verify(expected_output)
+```
+
+타이밍 분석만 필요하면 Phase 2를 건너뛴다.
+op_logger를 비활성화하면 Phase 1 성능도 기존과 동일.
+
+### D11. Verification Contract
+
+기본 검증은 **최종 output tensor**를 reference backend(numpy)와 비교한다.
+
+dtype별 tolerance 정책:
+
+| dtype | 비교 방식 | tolerance |
+|-------|----------|-----------|
+| f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
+| f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
+| bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
+| int 계열 | `np.array_equal` | exact |
+
+- 기본 모드: 최종 output만 비교 (end-to-end correctness)
+- 디버그 모드: intermediate tensor도 op 단위로 비교 가능
+  (MemoryStore snapshot at each op boundary)
+
+---
+
+## Non-goals
+
+- **Compute-result-based control flow**: 지원하지 않는다.
+  모든 compute handle은 Phase 1에서 pending 상태이며,
+  `wait()`는 timing synchronization만 표현하고 data readiness를 의미하지 않는다.
+  Phase 1에서 `handle.data` 접근, element access, truth-value evaluation은
+  **error로 처리**한다.
+  메모리 데이터 기반 분기(`tl.load()` 결과)는 greenlet으로 지원된다.
+  Phase 1 materialization은 future extension (D3 참조).
+- **Cycle-accurate overlap reconstruction**: Phase 2에서 Phase 1의 실행 시간
+  overlap을 정확히 재현하지 않는다. Phase 2는 데이터 정합성만 검증한다.
+- **GPU kernel compilation**: Phase 2의 GEMM/Math는 numpy/torch 호출이며,
+  실제 하드웨어 PE의 마이크로아키텍처를 재현하지 않는다.
+
+## Open Questions
+
+- **Aliasing / slice view**: 동일 backing storage를 참조하는 slice/view를
+  MemoryStore에서 어떻게 표현할지 (stride-based view vs copy semantics)
+- **IPCQ/descriptor read 일반화**: PE-to-PE 통신을 memory op으로 완전히
+  일반화할지, 별도 op_kind를 둘지
+- **Op log streaming**: 대규모 시뮬레이션에서 op_log 메모리 사용량 관리
+  (in-memory list vs disk-backed streaming)
+- **Fused operation**: tl.composite의 tiled pipeline (READ→COMPUTE→WRITE)을
+  하나의 fused op record로 기록할지, 개별 op으로 분리할지
+- **Math op schema 일반화**: 현재 math params는 단순 구조이나,
+  broadcasting rule, input별 dtype, keepdims, scalar/immediate operand,
+  where/mask 표현 등 일반화가 필요할 수 있음
+- **Op record 식별자**: 현재 dependency_ids는 in-memory list index 기반이며,
+  streaming/disk-backed mode 도입 시 stable op_id로 대체 필요
+- **Phase 1 materialization policy**: D3의 Future Extension 참조.
+  허용 시 해당 op의 Phase 2 처리 방식 (skip / verify / recompute) 정의 필요
+
+---
+
+## Consequences
+
+### 긍정적
+
+- SimPy 시뮬레이션 성능 영향 최소 (op_log append만 추가)
+- Phase 2에서 멀티스레드/GPU 자유롭게 사용 가능
+- 컴포넌트 교체 자유도 유지 (ADR-0015 설계 철학 보존)
+- 벤치마크 사용자 코드 API 변경 불필요
+- 새 메시지 타입 추가 시 data_op 플래그만 설정
+- greenlet으로 Phase 0 제거 — 메모리 데이터 기반 dynamic control flow 지원
+- `tl.load()`가 실제 데이터를 반환하므로 커널 디버깅 용이
+
+### 부정적
+
+- op_log 메모리 사용량 (대규모 시뮬레이션 시)
+- Phase 2 실행 시간은 텐서 크기에 비례 (대형 GEMM)
+- pending handle (연산 미완료) 기반 동적 분기 불가
+  (연산은 Phase 2에서 실행, Phase 1에서 결과 값 미확정).
+  메모리 데이터 기반 분기는 greenlet으로 지원된다.
+- greenlet C 확장 의존성 추가 (pip install greenlet)
diff --git a/docs/adr-ko/ADR-0022-prog-program-id-2d-grid.md b/docs/adr-ko/ADR-0022-prog-program-id-2d-grid.md
new file mode 100644
index 0000000..371bb49
--- /dev/null
+++ b/docs/adr-ko/ADR-0022-prog-program-id-2d-grid.md
@@ -0,0 +1,90 @@
+# ADR-0022: 2D Grid program_id Semantics
+
+## Status
+
+Accepted
+
+## Context
+
+Triton kernels use `tl.program_id(axis)` to identify their position in a launch grid.
+Our hardware has a 2-level hierarchy: **cubes** contain **PEs**.
+The previous implementation ignored the `axis` parameter and always returned a flat PE index,
+making it impossible for kernels to distinguish their cube-local position from their cube identity.
+
+## Decision
+
+Map `tl.program_id` and `tl.num_programs` to the 2D hardware grid:
+
+| Call | Returns | Description |
+|------|---------|-------------|
+| `tl.program_id(axis=0)` | `local_pe_id` | PE index within cube |
+| `tl.program_id(axis=1)` | `cube_id` | Cube index |
+| `tl.num_programs(axis=0)` | `num_pes_per_cube` | PEs per cube |
+| `tl.num_programs(axis=1)` | `num_cubes` | Total cubes |
+
+Global PID is derived as:
+
+```python
+global_pid = tl.program_id(axis=1) * tl.num_programs(axis=0) + tl.program_id(axis=0)
+```
+
+### Axis mapping rationale
+
+- **axis=0 = PE (innermost)**: PEs within a cube share HBM and communicate via local NOC mesh. This is the fast, tightly-coupled dimension — analogous to threads within a block.
+- **axis=1 = Cube (outer)**: Cross-cube communication goes through UCIe with higher latency. This is the coarser scheduling dimension — analogous to blocks in a grid.
+
+## Implementation
+
+### TLContext (`triton_emu/tl_context.py`)
+
+Added `cube_id` and `num_cubes` constructor parameters. `program_id()` and `num_programs()` dispatch on `axis`:
+
+```python
+def program_id(self, axis: int = 0) -> int:
+    if axis == 1:
+        return self._cube_id
+    return self._pe_id
+
+def num_programs(self, axis: int = 0) -> int:
+    if axis == 1:
+        return self._num_cubes
+    return self._num_programs
+```
+
+### PE_CPU (`components/builtin/pe_cpu.py`)
+
+- Extracts `num_cubes` from `ctx.spec["system"]["sips"]["cubes_per_sip"]`
+- Passes `cube_id` (already available as `self._cube_idx`) and `num_cubes` to TLContext
+
+### KernelRunner (`triton_emu/kernel_runner.py`)
+
+- Receives `num_cubes` from PE_CPU
+- Passes `cube_id` and `num_cubes` to TLContext in greenlet mode
+
+## Backward Compatibility
+
+- Existing code using `tl.program_id(0)` or `tl.program_id()` is unchanged — returns the same PE index as before.
+- `cube_id` and `num_cubes` default to `0` and `1`, so callers that don't provide them (e.g. unit tests) continue to work.
+
+## Usage Example
+
+```python
+def sharded_gemm_kernel(a_ptr, b_ptr, out_ptr, M, K, N, tl):
+    local_pid = tl.program_id(axis=0)      # PE within cube
+    cube_id   = tl.program_id(axis=1)      # which cube
+    global_pid = cube_id * tl.num_programs(axis=0) + local_pid
+
+    # Column-wise sharding across global PID
+    n_per_pid = N // (tl.num_programs(axis=1) * tl.num_programs(axis=0))
+    col_start = global_pid * n_per_pid
+
+    a = tl.load(a_ptr, shape=(M, K), dtype="f16")
+    b = tl.ref(b_ptr + col_start * K * 2, shape=(K, n_per_pid), dtype="f16")
+    h = tl.composite(op="gemm", a=a, b=b, out_ptr=out_ptr + col_start * M * 2)
+    tl.wait(h)
+```
+
+## Consequences
+
+- Benchmarks can now express cube-aware sharding and addressing without hardcoding topology dimensions.
+- Future axis=2 (SIP-level) can be added following the same pattern if needed.
diff --git a/docs/adr-ko/ADR-0023-dev-ipcq-pe-collective.md b/docs/adr-ko/ADR-0023-dev-ipcq-pe-collective.md
new file mode 100644
index 0000000..5fd174d
--- /dev/null
+++ b/docs/adr-ko/ADR-0023-dev-ipcq-pe-collective.md
@@ -0,0 +1,1648 @@
+# ADR-0023: PE-level IPCQ — Inter-PE Collective Communication
+
+## Status
+
+Accepted
+
+## Context
+
+### 목표
+
+CCL (Collective Communication Library) 커널을 PE 안에서 실행할 수 있도록
+PE 간 데이터 교환 인프라를 추가한다. 호스트는 그저 각 SIP에 커널을 launch만 하고,
+실제 동기화와 데이터 이동은 **PE 커널 안에서 IPCQ(Inter-Process Communication
+Queue)를 통해** 일어난다.
+
+이는 NCCL이 GPU 커널 안에서 NVLink 통신을 수행하는 모델, 또는 Cerebras/Tenstorrent의
+core-local 통신 큐와 유사하다. 호스트 레벨 collective(`dist.all_reduce`)는
+**미래 작업**으로 미루고, 본 ADR은 커널 collective 인프라에만 집중한다.
+
+### 풀어야 할 문제
+
+1. PE 간 직접 데이터 이동 (peer's memory에 write)
+2. 동기화 — 송신 측이 수신 측 buffer 공간을 확인해야 함 (backpressure)
+3. compute traffic과 communication traffic의 자원 경쟁 (Head-of-Line blocking)
+4. 호스트가 알고리즘에 따라 (ring/mesh/tree) 논리적 neighbor 토폴로지를 구성할 수 있어야 함
+
+---
+
+## Decision
+
+### D1. PE_IPCQ 컴포넌트 신규 추가
+
+PE 안에 새 컴포넌트 `PE_IPCQ`를 추가한다. PE_GEMM/PE_MATH가 PE_CPU의
+sub-block을 별도 컴포넌트로 모델링하는 것과 동일한 패턴이다.
+
+```
+PE
+├── PE_CPU
+├── PE_SCHEDULER
+├── PE_DMA
+├── PE_IPCQ          ← 신규
+├── PE_FETCH_STORE
+├── PE_GEMM
+├── PE_MATH
+├── PE_TCM
+├── PE_MMU
+```
+
+**역할 분리** (control plane vs data plane):
+- **PE_IPCQ (control plane)**: ring buffer 주소 계산, head/tail pointer 관리,
+  peer pointer 캐시, backpressure 결정, 4-방향 neighbor 매핑
+- **PE_DMA (data plane)**: 실제 데이터를 cube_noc/UCIe/PCIE 경유로 peer 메모리에 전송
+
+PE_IPCQ는 데이터 이동을 직접 수행하지 않고 PE_DMA에 위임한다.
+
+### D2. Ring Buffer 모델
+
+각 PE는 4-방향(N/S/E/W) × {tx, rx} = 총 8개의 ring buffer를 가진다.
+
+```python
+@dataclass
+class IpcqQueuePair:
+    direction: Direction          # N/S/E/W
+    peer: IpcqEndpoint            # init 시 호스트가 설정 (D2.5)
+    tx_buffer_base: int           # 내가 보낼 데이터의 base addr (자기 메모리)
+    rx_buffer_base: int           # 내가 받을 데이터의 base addr (자기 메모리)
+    slot_size: int                # tile 단위
+    n_slots: int                  # ring depth
+    my_head: int                  # 내 send 위치 (다음에 쓸 tx/peer slot)
+    my_tail: int                  # 내 recv 위치 (다음에 읽을 rx slot)
+    peer_head_cache: int          # 캐시: peer가 마지막으로 보낸 head 위치 (D9 piggyback으로 갱신)
+    peer_tail_cache: int          # 캐시: peer가 마지막으로 소비한 tail 위치 (D9 fast path credit으로 갱신)
+```
+
+**필드명 규약 (canonical)**: 본 ADR 전체에서 다음 4개 이름을 일관되게 사용한다.
+
+| 필드 | 소유자 | 갱신 시점 |
+|------|--------|----------|
+| `my_head` | 자기 PE_IPCQ | tl.send 호출 후 즉시 (송신 추적용) |
+| `my_tail` | 자기 PE_IPCQ | tl.recv 호출 후 즉시 (수신 추적용) |
+| `peer_head_cache` | 자기 PE_IPCQ | IpcqMetaArrival 도착 시 (D9 piggyback) |
+| `peer_tail_cache` | 자기 PE_IPCQ | IpcqCreditMetadata 도착 시 (D9 fast path) |
+
+다른 표현(`peer_head_local`, `peer_head`, `peer_tail` 등)은 사용하지 않는다.
+
+**Slot 단위**: fixed-size, 한 slot이 한 tile 데이터를 통째로 담는다.
+descriptor 모델이 아니라 **full data embedding** 모델 (D5에서 상세).
+
+### D2.5. PeAddress / IpcqEndpoint 스키마
+
+`IpcqQueuePair.peer`가 가져야 할 정보를 명시한다. 송신 측 PE_IPCQ가
+peer rx slot에 직접 DMA write하려면 다음을 모두 알아야 한다.
+
+```python
+@dataclass(frozen=True)
+class IpcqEndpoint:
+    """송신 측이 peer's rx_buffer 주소를 계산하기 위해 필요한 모든 정보."""
+    sip: int                     # 목적지 SIP
+    cube: int                    # 목적지 cube
+    pe: int                      # 목적지 PE (cube 내 local index)
+    buffer_kind: str             # "tcm" | "hbm" | "sram" — 어느 메모리 공간
+    rx_base_pa: int              # peer rx_buffer base의 PA (PhysAddr.encode())
+    rx_base_va: int              # peer rx_buffer base의 VA (선택, MMU 사용 시)
+    n_slots: int                 # peer ring depth (경계 wrap-around 계산용)
+    slot_size: int               # peer slot 크기 (offset 계산용)
+```
+
+`IpcqQueuePair`의 `peer` 필드는 이 `IpcqEndpoint` 객체를 들고 있다.
+주소 계산은:
+
+```python
+slot_idx = self.my_head % peer.n_slots
+dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size
+```
+
+PE_IPCQ는 이 dst_pa를 `IpcqDmaToken`의 dst_addr로 PE_DMA에 전달한다.
+PE_DMA(vc_comm)는 fabric 라우팅(cube_noc/UCIe/PCIE)을 통해 dst_pa로 데이터를 전송한다.
+
+**Endpoint 생성 시점**: backend init (D10)에서 모든 PE의 IPCQ buffer를
+allocator로 할당받고, 각 rank의 neighbor table을 만들 때 peer rank의
+endpoint 정보를 install한다. 즉 install 순서는:
+
+1. **모든 rank의 IPCQ buffer 할당** (각 PE의 buffer_kind 메모리 공간에서)
+2. **rank별 endpoint table 구성** (자신의 4-방향 peer가 어느 sip/cube/pe/pa를 갖는지)
+3. **PE_IPCQ에 install** (`IpcqInitMsg` via fabric or sideband)
+
+이 순서는 모든 rank가 서로의 PA를 알아야 하므로, 단계 1을 모든 rank에 대해
+먼저 끝낸 후 단계 2-3을 진행한다.
+
+### D3. 4-방향 매핑 = 논리적 ProcessGroup
+
+PE는 4방향(N/S/E/W)을 logical port로 본다. 실제 peer 주소는 호스트 CCL init이
+알고리즘에 따라 설정한다. PE 커널은 토폴로지를 알지 못하고 방향만 사용한다.
+
+```python
+# 호스트 init 예시 — 1D ring
+for rank in range(world_size):
+    ipcq_set_neighbor(rank, "E", peer=ranks[(rank + 1) % world_size])
+    ipcq_set_neighbor(rank, "W", peer=ranks[(rank - 1) % world_size])
+
+# 호스트 init 예시 — 2D mesh
+for r in range(R):
+    for c in range(C):
+        ipcq_set_neighbor((r, c), "N", peer=((r - 1) % R, c))
+        ipcq_set_neighbor((r, c), "S", peer=((r + 1) % R, c))
+        ipcq_set_neighbor((r, c), "E", peer=(r, (c + 1) % C))
+        ipcq_set_neighbor((r, c), "W", peer=(r, (c - 1) % C))
+```
+
+PE 코드 입장에서 `tl.send(dir="E", ...)`가 어디로 가는지는 알 필요가 없다.
+
+### D4. PE 커널 API
+
+```python
+# Send (blocking, backpressure 발생 가능)
+tl.send(dir: str, src_addr: int, nbytes: int) -> None
+
+# Recv (blocking)
+data = tl.recv(dir: str)            # 특정 방향에서 수신
+data = tl.recv()                    # 4방향 round-robin, 도착한 첫 tile 반환
+
+# Recv (non-blocking)
+handle = tl.recv_async(dir: str)
+data = tl.wait(handle)
+```
+
+`tl.recv()` (방향 미지정)는 IPCQ가 last_polled_dir 인덱스를 들고 있다가
+다음 호출 시 그 다음 방향부터 검사하면서 데이터 있는 첫 슬롯을 반환한다.
+4방향 모두 비어있으면 wait.
+
+**Fairness는 weak fairness**: polling 시작 방향을 회전시켜 단순 편향을
+완화하지만, 한 방향에 데이터가 항상 먼저 도착하면 다른 방향이 starvation될
+수 있다. strict fairness가 필요한 알고리즘은 `tl.recv(dir=...)`로 방향을
+명시해야 한다. (Open Questions 참조)
+
+### D5. Single-hop DMA Write + Full-data Slot 모델
+
+데이터는 송신 측 메모리에서 수신 측 ring slot으로 **단일 DMA 전송**으로
+이동한다. 핵심 속성:
+
+- **Single-hop**: 송신 측 IPCQ가 peer rx slot 주소를 직접 알고 있어 한 번의
+  fabric DMA로 데이터가 도착한다.
+- **No CPU memcpy**: CPU가 데이터를 복사하지 않는다.
+- **No intermediate staging**: 송신/수신 어느 쪽에도 별도 staging buffer가
+  없다 (송신은 자기 source 주소에서 직접, 수신은 자기 ring slot으로 직접).
+
+(엄밀히 말하면 fabric DMA write 자체는 발생하므로 "data movement가 전혀 없다"는
+의미는 아니다. NCCL의 "zero-copy"가 가리키는 것 — CPU memcpy / staging copy
+부재 — 과 동일한 속성이다.)
+
+데이터 이동 모델:
+
+```
+PE A: tl.send(E, src_addr, nbytes)
+  1. IPCQ가 peer rx slot 주소 계산
+       dst_addr = peer.rx_base_pa + (my_head % peer.n_slots) * peer.slot_size
+  2. backpressure: my_head - peer_tail_cache < peer.n_slots ?
+                   (꽉 찼으면 sleep/poll)
+  3. PE_DMA(vc_comm)에 DMA 요청 → src_addr에서 peer의 dst_addr로 nbytes 전송
+  4. my_head += 1
+
+PE B: data = tl.recv(W)
+  1. 내 rx_buffer[my_tail % n_slots] 위치 확인
+  2. 데이터 도착 대기 (D7 backpressure 모드)
+  3. 그 주소를 PE 커널에 반환 (또는 fetch unit으로 register file에 로드)
+  4. my_tail += 1
+  5. credit return fast path 발행 (D9) — bottleneck-BW latency 후
+     peer A의 peer_tail_cache 갱신
+```
+
+**핵심**: Slot에 데이터가 통째로 들어간다. PE B의 recv는 자기 rx_buffer만
+읽으면 되고, A의 메모리를 read하지 않는다. 송신 측 IPCQ가 peer rx slot
+주소를 알고 있으므로 직접 그 주소로 DMA write한다 (single-hop).
+
+본인의 PE_TCM read/write는 DMA를 거치지 않는다 (PE에 직접 붙어있음).
+slot이 본인 TCM에 있으면 직접 접근, 아니면 PE_DMA 경유.
+
+### D6. Buffer 위치 — 3-way benchmark
+
+호스트 CCL init이 IPCQ ring buffer의 메모리 위치를 결정한다:
+
+```python
+ipcq_init(
+    backend="ahbm",
+    buffer_kind="tcm" | "hbm" | "sram",
+    n_slots=8,
+    slot_size=4096,
+)
+```
+
+| 위치 | 특징 | trade-off |
+|------|------|-----------|
+| **PE_TCM** | PE에 직접 붙음, 빠름 | 작음, PE 내부 자원과 경쟁 |
+| **PE-local HBM** | 큼, DMA 경유 | latency 큼 |
+| **Cube SRAM** | 중간 크기, cube-shared | cube 내 PE 간 contention |
+
+세 위치 모두 동일 코드로 동작하며 init만 다르다. 벤치마크로 비교 가능.
+
+**규칙**: peer가 read/write할 때는 DMA 경유. 본인이 자기 PE_TCM 읽기/쓰기는
+DMA 없음.
+
+### D7. Backpressure — 2-mode benchmark
+
+송신 측이 peer slot full을 감지했을 때, 또는 수신 측이 데이터 미도착을
+감지했을 때 어떻게 대기하는가:
+
+| 모드 | 동작 | 모델 |
+|------|------|------|
+| **poll** | 캐시된 peer pointer를 주기적으로 재확인. cache update event를 폴링 | spin loop |
+| **sleep** | SimPy event를 yield하고 sleep, peer가 update event를 trigger하면 wake | interrupt-like |
+
+```python
+ipcq_init(backpressure="poll" | "sleep", ...)
+```
+
+두 모드 모두 구현하여 latency/throughput trade-off를 벤치마크할 수 있다.
+
+### D8. PE_DMA Virtual Channel
+
+PE_DMA를 단일 큐에서 **2-channel virtual channel** 모델로 확장한다.
+
+```
+PE_DMA
+├── vc_compute: GEMM/MATH의 tile load/store/writeback
+└── vc_comm:    IPCQ의 send 데이터
+```
+
+각 VC는 독립적인 state machine을 가진다:
+- 한 채널이 stall되어도 다른 채널은 진행
+- 동일 link(cube_noc, UCIe 등)는 공유하지만, link BW는 두 채널이 분할 사용
+
+**Chunk 단위 인터리브**:
+- 큰 GEMM tile DMA가 한 번에 link를 점유하지 않음
+- chunk_size 단위로 진행 (예: 256B), 매 chunk마다 다른 VC와 link BW 공유
+- chunk_size는 init 파라미터 (작을수록 fair, 클수록 효율)
+
+이로써:
+- HoL blocking 해소 (compute DMA 진행 중에도 IPCQ send 끼어들 수 있음)
+- compute/comm overlap 자연스러움 (NVIDIA copy engine + compute SM 패턴)
+- HW 모델 정합 (NoC virtual channel은 실제 HW 기법)
+
+**첫 구현의 정확도 한계 (intentional)**:
+
+본 ADR의 첫 구현은 **deterministic chunk-level interleave + weighted
+round-robin arbitration** (default 50/50, ccl.yaml에 노출)을 채택한다.
+이는 first-order approximation이며, 실제 HW의 dynamic contention/credit-based
+arbitration보다는 단순화된 모델이다.
+
+| 모델링 항목 | 첫 구현 | 향후 확장 가능 |
+|------------|---------|---------------|
+| VC 간 BW 분할 | 정적 weight | dynamic contention 기반 |
+| Chunk 단위 인터리브 | deterministic round-robin | priority/QoS 기반 |
+| Cross-VC dependency | 없음 (독립) | NoC arbiter component 추가 |
+
+이 단순화는 functional correctness에는 영향이 없으며, latency 측정에서
+heavy contention 시나리오는 실제보다 약간 optimistic한 결과를 낼 수 있다.
+정밀화가 필요하면 별도 ADR로 NoC arbiter를 도입한다.
+
+#### Token routing
+
+- compute용 token (TileToken): 기존 PE_FETCH_STORE → PE_DMA 체이닝 그대로
+- comm용 token (IpcqDmaToken, 신규): PE_IPCQ → PE_DMA로 self-routing
+- PE_DMA가 token 종류로 채널 결정
+
+```python
+class PeDmaComponent:
+    def _process(self, env, token):
+        if isinstance(token, IpcqDmaToken):
+            yield from self._vc_comm_process(env, token)
+        else:
+            yield from self._vc_compute_process(env, token)
+```
+
+### D9. Pointer 동기화 — DMA payload piggyback
+
+실제 하드웨어(NVLink, UCIe 등)는 DMA 메시지의 payload에 메타데이터를
+piggyback하여 송수신과 함께 pointer를 갱신한다. 본 시뮬레이션도 같은 모델을
+채택하여 **별도의 control 채널 없이** 메타데이터가 data와 함께 도착하도록 한다.
+
+이 모델의 핵심 이점:
+
+- **자동 ordering**: 메타데이터가 data와 동일 token으로 이동하므로 data가
+  먼저 visible해진 다음에야 head_cache가 갱신된다. 별도 ordering invariant
+  없이 race condition이 원천 차단된다.
+- **HW 정합**: 실제 NVLink/UCIe의 piggybacked header 모델과 일치
+- **컴포넌트 단순화**: 별도 IpcqPtrUpdate event 종류가 필요 없음
+
+#### Send 흐름 (head 측 piggyback)
+
+```
+PE A: tl.send(E, src_addr, nbytes)
+  1. PE_IPCQ가 backpressure 체크 (peer_tail_cache 기준)
+  2. PE_IPCQ가 IpcqDmaToken 생성:
+     - data 본체 (src_addr → peer dst_addr)
+     - piggyback metadata: (sender_seq, src_sip/cube/pe, src_direction)
+  3. PE_DMA(vc_comm)에 token put
+  4. PE A는 자기 my_head++ (송신 추적용)
+
+[fabric DMA: latency 만큼 진행]
+
+PE B의 PE_DMA가 token 수신
+  5. data를 dst_addr (B의 rx slot)에 MemoryStore.write
+  6. token의 metadata를 PE B의 PE_IPCQ로 forward (PE 내부 wire, ~1 cycle)
+
+PE B의 PE_IPCQ가 metadata 수신
+  7. peer_head_cache 갱신 (= A의 head 위치)
+  8. 대기 중인 recv (해당 direction)가 있으면 wake
+```
+
+여기서 핵심은 **5와 6은 같은 SimPy step**이라는 것이다 — DMA 완료와 동시에
+data와 metadata가 atomic하게 visible해진다.
+
+#### Recv 흐름 (credit return — fast path with bottleneck-BW latency)
+
+수신측이 slot을 비우면 송신측은 그 사실을 알아야 한다 (backpressure 해제).
+data 경로의 piggyback 모델과 달리, credit return은 일반 vc_comm fabric을
+거치지 않고 **별도 fast path**로 처리한다. 이는 실제 HW의 NVLink/UCIe
+credit return fast path를 추상화한 것이다.
+
+**Latency 계산**: magic constant가 아니라 **라우팅 경로의 full path
+latency** (per-node overhead + edge propagation + drain) 기준으로
+산출한다.
+
+```
+credit_size_bytes = 16  (ccl.yaml: ipcq_credit_size_bytes)
+path = router.find_path(self_pe, peer_pe.pe_dma)
+latency = compute_path_latency_ns(path, credit_size_bytes)
+        = sum(edge.distance_mm * ns_per_mm)
+        + sum(node_overhead_ns[n] for n in path)
+        + credit_size_bytes / bottleneck_bw_on_path
+```
+
+router는 source에만 `.pe_dma`를 자동 부여하므로 destination에는 반드시
+`.pe_dma` suffix를 명시해야 한다. 그렇지 않으면 `find_path`가 raise하고
+credit이 0 cost로 silently teleport되는 latent bug가 발생한다 (이번
+업데이트에서 수정됨).
+
+`tl.recv`는 credit-emit 완료를 yield-from으로 기다린다 (이전에는
+`env.process`로 fork). 이로써 credit-return cost가 receiver의
+`pe_exec_ns`에 반영되어, IPCQ control-plane이 consume-acknowledgement를
+완료한 뒤에야 recv가 kernel에 반환된다 — RAW DMA의 non-posted `tl.store`가
+HBM ack-trip을 기다리는 것의 protocol-level 등가물이다.
+
+이로써:
+- **토폴로지 비례 approximation**: cube 내 credit return과 cross-SIP credit이
+  자동으로 다른 latency를 가짐
+- **Magic constant 없음**: 모든 ns 값이 데이터 트래픽과 동일한 edge_map
+  및 `node_overhead_ns`에서 산출되는 `compute_path_latency_ns`로부터 옴
+- **Deadlock 위험 없음**: `peer_credit_store.put`은 unbounded, B가 A에게
+  보낼 데이터가 없어도 credit이 자동 발행됨
+- **`IPCQ ≥ raw DMA`** 보장: matched physical move에 대해 credit-emit이
+  RAW의 ack-trip cost와 균형을 이룸
+
+```
+PE B: tl.recv(W) → 데이터 가져감 → my_tail++
+
+PE B의 PE_IPCQ:
+  1. router로 PE A까지 path 계산
+  2. compute_drain_ns(path, credit_size_bytes) = latency_ns
+  3. env.process(self._delayed_credit_send(latency_ns, peer_credit_store, my_tail))
+
+[fast path: latency_ns 만큼 timeout, fabric vc 미사용]
+
+PE A의 PE_IPCQ가 자기 credit_store에서 IpcqCreditMetadata 수신:
+  4. peer_tail_cache 갱신
+  5. 대기 중인 send (해당 direction)가 있으면 wake
+```
+
+#### Component 결합도 — SimPy Store 채널
+
+PE B의 PE_IPCQ가 PE A의 PE_IPCQ를 직접 호출하지 않는다. 대신 **init 시점에
+양쪽 PE_IPCQ 사이에 SimPy Store를 한 번 wire**해두고 (양방향 fast path 채널),
+credit metadata는 그 store로 put한다.
+
+```python
+class PeIpcqComponent:
+    def _delayed_credit_send(self, env, peer_credit_store, my_tail, latency_ns):
+        yield env.timeout(latency_ns)
+        yield peer_credit_store.put(IpcqCreditMetadata(seq=my_tail, ...))
+```
+
+backend init에서 IpcqInitMsg fan-out 시 양방향 fast path channel을 함께
+설치한다 (D12 IpcqInitMsg에 명시).
+
+#### Credit return fast path의 한계
+
+- `credit_size_bytes`는 estimate. 보통 16-64 bytes로 충분하며, 실제 HW의
+  credit return wire 크기를 모방한 값.
+- fast path는 일반 vc_comm BW contention 모델에서 **제외**된다 (별도 채널).
+  실제 HW의 credit return wire는 매우 lightweight이므로 1차 근사로 합리적.
+- 정밀화가 필요하면 후속 ADR에서:
+  - credit fast path를 별도 link로 모델링 (BW limit + contention)
+  - 또는 piggyback 모드로 변경 가능 (`credit_return_mode: piggyback`)
+
+#### PE_DMA의 책임 추가
+
+PE_DMA(vc_comm)는 token 수신 시 다음 시퀀스로 처리한다: Transaction
+terminal의 BW drain을 먼저 지불하고, 이어서 atomic하게 data write +
+metadata forward 수행. **data write와 metadata forward 사이에는 SimPy
+yield를 두어서는 안 된다** (I6 MUST 규칙 참조). drain yield는 atomic
+구간 안이 아니라 그 앞에 위치해야 한다:
+
+```python
+def _on_vc_comm_recv(self, env, txn):
+    # Sender PE_DMA가 찍어 둔 drain_ns (= nbytes / bottleneck_bw) 를
+    # 여기서 지불. atomic 구간보다 앞이어야 한다 — recv는 bytes가
+    # "도착"한 이후에만 깨어나야 하므로.
+    drain = getattr(txn, "drain_ns", 0.0)
+    if drain > 0:
+        yield env.timeout(drain)
+
+    token = txn.request
+    # ── ATOMIC: 두 동작 사이에 yield 금지 ──
+    # 1. data를 dst_addr에 write (dst의 메모리 공간은 token.dst_endpoint.buffer_kind)
+    data = self._memory_store.read(token.src_space, token.src_addr,
+                                    shape=..., dtype=...)
+    self._memory_store.write(token.dst_endpoint.buffer_kind, token.dst_addr, data)
+    # 2. token의 metadata를 자기 PE의 IPCQ로 forward
+    yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token))
+    # ─────────────────────────────────────
+```
+
+`out_ports[ipcq_id].put`은 SimPy Store의 yield-able 호출이지만, PE 내부
+wire로 capacity가 unbounded인 store를 사용하므로 즉시 완료된다 (실질적으로
+single-step). 이 최종 put이 atomic 구간의 끝이며, 그 이전에 다른 yield가
+삽입되면 안 된다.
+
+#### Drain-at-inbound semantics (D9 timing model)
+
+Transaction은 sender PE_DMA가 `drain_ns = nbytes / bottleneck_bw_on_path`
+를 찍어 둔 상태로 fabric에 들어간다. 이 simulator에서 per-hop `overhead_ns`
+는 각 forwarding component의 `run()` 에서 지불되고, 남은 BW drain은
+Transaction의 terminal node에서 한 번 지불된다. IPCQ가 아닌 모든
+Transaction (raw DMA, kernel-launch fanout 등) 은
+`ComponentBase._forward_txn` 이 terminal에서 이 drain을 지불한다. IPCQ의
+경우 목적지 PE_DMA가 `_handle_ipcq_inbound` 핸들러로 Transaction을
+가로채서 (IPCQ 전용 data write + metadata forward를 해야 하므로)
+**이 핸들러 최상단에서 drain을 명시적으로 지불해야 한다** — 그래야 IPCQ의
+timing model이 다른 모든 fabric Transaction과 동일선상에 놓인다.
+
+여기서 drain을 지불할 때의 side-effect:
+
+- **SRC `tl.send`**: 동작 불변. sender PE_DMA가 `sub_done` 을 `yield`
+  하지 않으므로 fire-and-forget 의미가 보존된다. metadata forward 이후
+  호출되는 `sub_done.succeed()` 는 sender 입장에서 listener가 없는 이벤트.
+- **DST `tl.recv`**: `drain_ns` 만큼 늦게 깨어난다. recv는 local PE_IPCQ
+  의 `IpcqMetaArrival` 수신 시에만 wake되며, metadata forward가 drain
+  이후로 이동했으므로 recv는 bandwidth까지 포함한 전체 fabric transfer
+  시간을 관측하게 된다.
+
+물리적 그림과 일치: send는 dispatch하고 바로 반환; recv는 bytes가 실제로
+자신의 inbox로 drain될 때까지 대기.
+
+#### Backpressure latency 정확도
+
+backpressure 해제까지 걸리는 시간:
+
+- **데이터 send 측 latency** = full fabric DMA (data + piggyback metadata 함께)
+- **Credit return 측 latency** = fast path with bottleneck-BW
+  (`credit_size_bytes / bottleneck_bw_on_path`)
+
+| 시나리오 | 모델링된 latency | 실제 HW와의 관계 |
+|---------|----------------|----------------|
+| Cube 내 (fast link) | 작음 (bottleneck = cube_noc BW) | topology-aware approximation |
+| Cross-cube (UCIe) | 중간 (bottleneck = UCIe BW) | topology-aware approximation |
+| Cross-SIP (PCIE) | 큼 (bottleneck = PCIE BW) | topology-aware approximation |
+
+별도 magic latency 파라미터 없이 토폴로지에 비례한 first-order
+approximation이 자동으로 반영된다. 실제 HW와 정확히 일치하지는 않지만
+(credit fast path는 contention 모델에서 제외, credit_size_bytes는 estimate),
+magic constant 모델보다 훨씬 의미 있는 비교 가능. 정밀화는 후속 ADR로
+넘긴다.
+
+### D9.5. ADR-0020 (2-Pass) 통합
+
+`tl.send/recv`는 ADR-0020의 2-pass 모델과 통합되어야 한다. Phase 1은
+타이밍과 실제 데이터 이동(MemoryStore) 모두 모델링하고, Phase 2는 op_log
+기반 정합성 검증을 가능케 한다.
+
+#### Phase 1 (타이밍 + 데이터 이동)
+
+D9는 head 갱신과 tail 갱신을 다른 메커니즘으로 모델링한다:
+
+- **Send-side (head update)** — DMA payload piggyback. data write와 metadata
+  forward가 동일 SimPy step에 일어나므로 자동으로 atomic visibility 보장.
+- **Recv-side (tail credit return)** — fast path SimPy Store 채널.
+  bottleneck-BW 기반 latency 후 peer_tail_cache 갱신.
+
+두 메커니즘을 합쳐서 전체 ring buffer pointer 일관성을 유지한다.
+
+**send 시**:
+
+1. PE_IPCQ가 backpressure 체크 (peer_tail_cache 기준)
+2. PE_IPCQ가 IpcqDmaToken 생성 (data + piggyback metadata) → PE_DMA(vc_comm)에 put
+3. PE_DMA가 fabric DMA 시뮬레이션 (latency 진행)
+4. **DMA 완료와 동일한 SimPy step에 atomic 시퀀스**:
+   - **MemoryStore.write(buffer_kind, dst_pa, data)** — single-hop DMA write
+   - 수신측 PE_IPCQ에 metadata forward → peer_head_cache 갱신 → 대기 recv wake
+5. **op_log 기록**: `OpRecord(op_kind="ipcq", op_name="send", params={src_space, src_addr, dst_space, dst_addr, nbytes, dir, dtype, shape, sender_seq})`
+   - `dst_space`는 `token.dst_endpoint.buffer_kind`에서 derive된 값이다
+     (별도 token 필드가 아니다). dst_addr은 `token.dst_addr`.
+
+**recv 시**:
+
+1. PE_IPCQ가 (peer_head_cache > my_tail) AND (MemoryStore.has(slot_addr)) 조건 대기
+   (D9 piggyback 모델에서는 두 조건이 같은 step에 truthy가 되지만, defensive check)
+2. 조건 만족 시: `slot_addr = my_rx_base + slot_idx * slot_size`
+3. **두 가지 모드** (`recv_mode`로 op_log에 기록):
+   - **`return_slot`** (default): slot_addr을 그대로 PE 커널에 반환.
+     데이터 복사 없음. 커널이 slot 메모리를 직접 사용한다.
+   - **`copy_to_dst`**: 호출 시 dst_addr이 지정된 경우. slot 데이터를 읽어서
+     dst_addr에 write. `data = memory_store.read(...)`; `memory_store.write(dst_space, dst_addr, data)`
+4. PE_IPCQ가 my_tail++, fast path credit return을 발행 (D9 — vc_comm
+   fabric을 거치지 않고 별도 SimPy Store 채널로 bottleneck-BW latency 후
+   peer 측 peer_tail_cache 갱신)
+5. **op_log 기록**: `OpRecord(op_kind="ipcq", op_name="recv", params={recv_mode, src_space, src_addr, dst_space, dst_addr, nbytes, dir, dtype, shape, consumer_seq})`
+   - `recv_mode="return_slot"`: src_space/src_addr가 slot 위치, dst_addr=None
+   - `recv_mode="copy_to_dst"`: src_space/src_addr가 slot 위치, dst_space/dst_addr가 사용자 지정 위치
+
+#### Phase 2 (op_log replay)
+
+DataExecutor가 `op_kind="ipcq"` 레코드를 만나면:
+
+- **send**: src → dst (peer rx slot)로 ndarray를 idempotent하게 write
+- **recv (`recv_mode="return_slot"`)**: no-op. slot 데이터는 Phase 1에서
+  이미 적절한 위치에 있으며, 커널이 해당 slot 메모리를 직접 사용함.
+- **recv (`recv_mode="copy_to_dst"`)**: slot → dst_addr로 ndarray를 idempotent
+  하게 copy
+
+본질적으로 IPCQ는 **데이터 이동**만 하므로 Phase 2가 추가로 계산할 것은 없다.
+DataExecutor의 GEMM/Math가 그 데이터를 사용하면 자동으로 정합성이 검증된다.
+
+```python
+class DataExecutor:
+    def _execute_op(self, op):
+        if op.op_kind == "ipcq":
+            self._execute_ipcq(op)
+        elif op.op_kind == "memory":
+            ...
+        elif op.op_kind == "gemm":
+            ...
+
+    def _execute_ipcq(self, op):
+        """IPCQ ops are data movement; Phase 1 already wrote to MemoryStore."""
+        p = op.params
+        if op.op_name == "send":
+            data = self.store.read(p["src_space"], p["src_addr"],
+                                   shape=p["shape"], dtype=p["dtype"])
+            self.store.write(p["dst_space"], p["dst_addr"], data)
+        elif op.op_name == "recv":
+            if p.get("recv_mode") == "copy_to_dst":
+                data = self.store.read(p["src_space"], p["src_addr"],
+                                       shape=p["shape"], dtype=p["dtype"])
+                self.store.write(p["dst_space"], p["dst_addr"], data)
+            # recv_mode == "return_slot": no-op (data already in slot)
+```
+
+#### `--verify-data` 흐름 (CCL 커널)
+
+```
+1. kernbench run --bench ccl_allreduce --verify-data
+2. backend init → IPCQ buffers 할당, neighbor table install
+3. 모든 rank greenlet 동시 실행
+4. 각 PE 커널이 tl.send/recv → MemoryStore에 데이터 누적
+5. 시뮬레이션 완료 후 DataExecutor.run() → ipcq op 멱등 replay (no-op)
+6. 벤치마크가 print(out) 또는 out.data 비교 → 정합성 확인
+```
+
+벤치 작성자는 `out.data`로 결과를 읽고 expected와 비교하면 된다 (ADR-0020 D7
+Tensor.data 패턴).
+
+### D10. 호스트 CCL Init은 PyTorch 패턴 그대로
+
+호스트 코드는 실제 PyTorch distributed 코드와 동일하게 유지한다.
+`init_process_group`은 backend 객체만 만들고, IPCQ 설정 (neighbor topology,
+buffer_kind, backpressure 등)은 받지 않는다.
+
+```python
+# benches/ccl_allreduce.py — 실제 PyTorch와 동일한 호스트 코드
+def run_rank(rank, world_size, torch):
+    dist = torch.distributed
+    dist.init_process_group(backend="ahbm", world_size=world_size, rank=rank)
+
+    tensor = torch.zeros((M, K), dtype="f16", dp=...)
+
+    from kernbench.ccl.algorithms import ring_allreduce
+    torch.launch("ring_allreduce", ring_allreduce.kernel, tensor, rank, world_size)
+```
+
+IPCQ 설정은 backend가 **init_process_group 시점에** `ccl.yaml`을 읽고 즉시
+PE_IPCQ neighbor table을 install한다. 호스트 코드는 IPCQ를 인지할 필요가 없다.
+
+벤치마크 하나는 하나의 알고리즘을 사용하는 것을 가정하며, 사용할 알고리즘은
+`ccl.yaml`의 `defaults.algorithm` 으로 지정한다 (D11). 호스트 코드 변경 없이
+ccl.yaml만 수정하여 다른 알고리즘으로 교체할 수 있다.
+
+#### Init 흐름 (eager)
+
+1. `init_process_group(backend="ahbm")` 호출
+2. backend가 `ccl.yaml` 로드 → `defaults.algorithm` 결정
+3. `algorithms[<algo>]`에서 topology + buffer_kind + backpressure + slot/size 결정
+4. **즉시** 모든 PE의 PE_IPCQ에 neighbor table을 install (sideband 또는 fabric `IpcqInitMsg`)
+5. 이후 `torch.launch(kernel_name, ...)`는 일반 launch와 동일하게 처리
+   (CCL kernel이든 아니든 PE_IPCQ는 이미 준비됨)
+
+### D11. CCL 설정 파일 (`ccl.yaml`)
+
+IPCQ 설정과 알고리즘 metadata는 별도 YAML 파일에 둔다.
+`components.yaml`/`topology.yaml`과 같은 패턴을 유지하며, 변경 이력이 코드처럼
+추적 가능하다.
+
+벤치마크 한 번 실행은 한 알고리즘만 사용한다 (`defaults.algorithm`).
+다른 알고리즘으로 교체하려면 `ccl.yaml`의 `defaults.algorithm` 만 바꾸면 된다.
+
+```yaml
+# ccl.yaml — CCL backend (ahbm) configuration
+#
+# 이 파일은 init_process_group(backend="ahbm") 시점에 로드되며,
+# defaults.algorithm 으로 지정된 알고리즘에 따라 PE_IPCQ neighbor table을
+# install한다. 호스트 코드는 IPCQ 설정을 인지하지 않는다.
+
+defaults:
+  # 이번 벤치 실행에서 사용할 알고리즘. algorithms 섹션에 정의된 것 중 하나.
+  algorithm: ring_allreduce
+
+  # IPCQ ring buffer가 위치할 메모리.
+  #   tcm  — PE-local TCM (작지만 빠름, PE 내부 자원과 경쟁)
+  #   hbm  — PE-local HBM (큼, DMA latency 큼)
+  #   sram — Cube-shared SRAM (중간 크기, cube 내 PE 간 contention)
+  buffer_kind: tcm
+
+  # send/recv가 peer slot full / data 미도착을 만났을 때의 대기 방식.
+  #   poll  — peer pointer 캐시를 spin loop로 재확인
+  #   sleep — SimPy event yield 후 wakeup 대기 (interrupt-like)
+  backpressure: sleep
+
+  # Ring buffer depth (한 방향당 slot 개수). 클수록 in-flight 가능, 메모리 ↑
+  n_slots: 8
+
+  # Slot 하나의 크기 (bytes). 한 tile을 통째로 담을 수 있는 크기여야 함.
+  slot_size: 4096
+
+  # PE_DMA virtual channel chunk 크기 (bytes). 작을수록 fair, 클수록 효율.
+  # IPCQ traffic과 compute traffic 사이의 인터리브 granularity (D8 참조).
+  vc_chunk_size: 256
+
+  # Credit return fast path 메시지 크기 (bytes). 실제 HW의 credit return wire
+  # 크기를 모방. backend가 라우팅 경로의 bottleneck BW를 보고 latency를
+  # 계산한다 (D9 참조). 보통 16-64로 충분.
+  ipcq_credit_size_bytes: 16
+
+algorithms:
+  # ── 알고리즘 정의 ─────────────────────────────────────────────────
+  # 각 entry는 알고리즘 모듈과 그 알고리즘이 요구하는 topology를 명시한다.
+  # 알고리즘별 default override 가능 (buffer_kind, backpressure 등).
+
+  ring_allreduce:
+    # PE 커널이 정의된 모듈. `kernel(t_ptr, rank, world_size, tl)` 함수를 export.
+    module: kernbench.ccl.algorithms.ring_allreduce
+
+    # 이 알고리즘이 요구하는 neighbor topology. builtin 이름 또는 "custom".
+    #   ring_1d        — 1D 양방향 ring (E/W)
+    #   ring_1d_unidir — 1D 단방향 ring (E only)
+    #   mesh_2d        — 2D mesh (N/S/E/W)
+    #   tree_binary    — binary tree (parent/children direction)
+    #   custom         — 모듈의 neighbors(rank, world_size) 함수 사용
+    topology: ring_1d
+
+  tree_allreduce:
+    module: kernbench.ccl.algorithms.tree_allreduce
+    topology: tree_binary
+    # 알고리즘별 override (이 알고리즘만 hbm 사용)
+    buffer_kind: hbm
+
+  custom_mesh:
+    module: kernbench.ccl.algorithms.custom_mesh
+    topology: custom            # 모듈이 직접 neighbors() 함수 제공
+```
+
+#### 알고리즘 모듈 구조
+
+알고리즘 모듈은 두 개의 hook을 export한다 — `kernel`은 필수, `neighbors`는 선택.
+
+```python
+# src/kernbench/ccl/algorithms/ring_allreduce.py
+
+def kernel(t_ptr, rank, world_size, tl):
+    """필수 — PE 커널.
+
+    IPCQ 설정은 backend가 ccl.yaml + neighbors() 결과로 install한 상태이다.
+    커널은 그저 4-방향 send/recv API만 사용하면 된다.
+    """
+    for step in range(world_size - 1):
+        ...
+        tl.send(dir="E", ...)
+        data = tl.recv(dir="W")
+
+
+def neighbors(rank, world_size, neighbor_map):
+    """선택 — neighbor table override hook.
+
+    backend는 ccl.yaml의 topology 필드에 따라 builtin neighbor_map을 생성한 뒤,
+    이 함수가 정의되어 있으면 호출하여 결과를 override 한다.
+
+    Args:
+        rank: 이 rank의 인덱스
+        world_size: 전체 rank 수
+        neighbor_map: ccl.yaml의 topology 필드가 만든 builtin 매핑
+                      예: ring_1d → {"E": (rank+1)%ws, "W": (rank-1)%ws}
+                      mutable dict — 직접 수정 가능
+
+    Returns:
+        dict | None:
+          dict — neighbor_map을 override한 결과
+          None — override 안 함, neighbor_map 그대로 사용
+    """
+    return None  # 또는 수정 후 반환
+```
+
+#### `neighbors` override 패턴
+
+대부분의 알고리즘은 builtin topology만으로 충분하므로 `neighbors` 정의가 필요 없다.
+정의가 필요한 경우의 패턴:
+
+**Pattern A — builtin을 base로 일부만 수정**:
+```python
+def neighbors(rank, world_size, neighbor_map):
+    # 짝수 rank만 W 사용
+    if rank % 2 == 1:
+        neighbor_map.pop("W", None)
+    return neighbor_map
+```
+
+**Pattern B — 완전히 새로 만들기 (skip-connection ring 등)**:
+```python
+def neighbors(rank, world_size, neighbor_map):
+    # neighbor_map은 무시하고 새로 작성
+    return {"E": (rank + 2) % world_size}
+```
+
+#### Builtin topology generators
+
+`ccl.yaml`의 `topology` 필드가 다음 builtin 이름이면 backend가 알아서 처리:
+
+| topology | 설명 | direction set |
+|----------|------|---------------|
+| `ring_1d` | 1D 양방향 ring | E, W |
+| `ring_1d_unidir` | 1D 단방향 ring | E only |
+| `mesh_2d` | 2D mesh | N, S, E, W |
+| `tree_binary` | binary tree (root = rank 0) | parent, child_left, child_right |
+| `none` | 빈 매핑 — 알고리즘이 `neighbors()`로 처음부터 작성 | (없음) |
+
+`topology: none`은 builtin이 빈 dict를 반환하므로 알고리즘의 `neighbors()`가
+처음부터 매핑을 만들어야 한다.
+
+#### 알고리즘 추가 절차
+
+1. `src/kernbench/ccl/algorithms/<algo>.py`에 `kernel` 함수 작성
+2. `ccl.yaml`의 `algorithms` 섹션에 entry 추가 (`module`, `topology`)
+3. (선택) 같은 모듈에 `neighbors()` 함수 추가하여 builtin override
+4. `defaults.algorithm`을 새 알고리즘으로 설정하면 적용
+
+호스트 코드는 손대지 않는다.
+
+### D12. 메시지 / 토큰 스키마
+
+본 ADR이 추가하는 모든 메시지/토큰의 필드를 명시한다. 구현 시 이 정의를
+`src/kernbench/common/pe_commands.py`와 `src/kernbench/runtime_api/kernel.py`에
+그대로 추가한다.
+
+#### `IpcqInitMsg` (sideband, init 시 fan-out)
+
+backend가 모든 PE의 PE_IPCQ에 neighbor table을 install하기 위해 사용한다.
+구조는 `MmuMapMsg`와 유사 (target_sips, target_cubes, target_pe + entries).
+
+```python
+@dataclass(frozen=True)
+class IpcqInitEntry:
+    direction: str               # "N" | "S" | "E" | "W"
+    peer: IpcqEndpoint           # D2.5 참조
+    my_rx_base_pa: int           # 자신의 rx_buffer base
+    my_rx_base_va: int           # 선택
+    n_slots: int
+    slot_size: int
+    # Credit fast path 채널 (D9).
+    # 계약: 이 필드는 반드시 simpy.Store 인스턴스이며, IpcqCreditMetadata
+    # 객체만을 받는 receive endpoint이다 (peer's PE_IPCQ가 자기 입력 큐로
+    # 사용). 송신 측 PE_IPCQ는 _delayed_credit_send에서 이 store에 직접
+    # IpcqCreditMetadata를 put한다. 다른 객체 type을 put해서는 안 된다.
+    # backend init 시 양방향 SimPy Store가 한 번 wire되며 이후 변경 불가.
+    peer_credit_store: "simpy.Store[IpcqCreditMetadata]"
+
+@dataclass(frozen=True)
+class IpcqInitMsg:
+    correlation_id: str
+    request_id: str
+    target_sips: tuple[int, ...]
+    target_cubes: tuple[int, ...]
+    target_pe: int | tuple[int, ...] | str
+    entries: tuple[IpcqInitEntry, ...]   # 이 PE의 4-방향 entry
+    backpressure_mode: str       # "poll" | "sleep"
+    buffer_kind: str             # "tcm" | "hbm" | "sram"
+    credit_size_bytes: int       # D9 fast path latency 계산용 (default 16)
+```
+
+**Credit fast path channel wiring**: backend init이 모든 PE의 PE_IPCQ에
+양방향 fast path 채널을 한 번 설치한다. PE A의 IpcqInitEntry(direction=E)에
+PE B의 credit-receive Store reference를 넣어 송신 측이 직접 put할 수 있게
+한다 (별도 fabric routing 없음).
+
+#### `IpcqSendCmd` (PE_CPU → PE_IPCQ)
+
+```python
+@dataclass(frozen=True)
+class IpcqSendCmd:
+    direction: str               # 어느 방향으로 보낼지
+    src_addr: int                # 보낼 데이터의 원본 주소 (TCM/HBM)
+    src_space: str               # "tcm" | "hbm" | "sram"
+    nbytes: int
+    shape: tuple[int, ...]       # data shape (op_log/MemoryStore용)
+    dtype: str
+    handle_id: str               # completion 추적용
+    data_op: bool = True         # ADR-0020 op_log 기록 대상
+```
+
+#### `IpcqRecvCmd` (PE_CPU → PE_IPCQ)
+
+```python
+@dataclass(frozen=True)
+class IpcqRecvCmd:
+    direction: str | None        # None이면 round-robin (weak fairness, D4)
+    # recv_mode: 두 가지 동작 모드
+    #   "return_slot" — slot 주소를 그대로 PE 커널에 반환 (default, zero-copy)
+    #   "copy_to_dst" — slot 데이터를 dst_addr에 copy 후 반환
+    recv_mode: str = "return_slot"
+    # dst_addr / dst_space는 recv_mode="copy_to_dst"일 때만 사용됨
+    dst_addr: int = 0
+    dst_space: str = ""
+    shape: tuple[int, ...] = ()  # data shape (op_log/MemoryStore용)
+    dtype: str = ""
+    handle_id: str = ""
+    blocking: bool = True        # blocking vs non-blocking
+    data_op: bool = True
+```
+
+#### `IpcqDmaToken` (PE_IPCQ → PE_DMA, vc_comm 채널)
+
+D9의 piggyback 모델에 따라 token이 data + head metadata를 함께 담아
+fabric을 따라 이동한다. 수신 측 PE_DMA가 도착 시점에 data를 dst_addr에
+write하고 metadata를 PE_IPCQ로 forward한다 (atomic).
+
+```python
+@dataclass
+class IpcqDmaToken:
+    # ── Data movement (single-hop DMA write) ──
+    src_addr: int                # 자기 메모리 주소
+    src_space: str
+    dst_addr: int                # peer rx slot 주소 (이미 계산됨)
+    dst_endpoint: IpcqEndpoint   # 라우팅용 (sip/cube/pe)
+    nbytes: int                  # data 크기
+    handle_id: str               # 완료 시 송신 측 PE_IPCQ로 알림
+
+    # ── Piggyback metadata (수신측 PE_IPCQ가 자동 갱신할 정보) ──
+    sender_seq: int              # 단조 증가 sequence number
+                                 # peer가 자기 head_cache로 사용
+    src_sip: int                 # 송신 측 (수신측이 어느 peer인지 식별)
+    src_cube: int
+    src_pe: int
+    src_direction: str           # 송신측 기준 방향 (수신측은 reverse 매핑으로 자기 direction 결정)
+
+    data_op: bool = True         # ADR-0020 op_log 기록 대상
+```
+
+PE_DMA는 token type으로 채널 결정 (D8): TileToken → vc_compute, IpcqDmaToken → vc_comm.
+
+**수신 측 PE_DMA의 처리** (vc_comm 도착 시):
+
+```python
+def _vc_comm_arrival(self, env, token: IpcqDmaToken):
+    # 1. data를 dst_addr에 write (data와 metadata atomic visibility)
+    if self._memory_store is not None:
+        data = self._memory_store.read(token.src_space, token.src_addr,
+                                        shape=..., dtype=...)
+        self._memory_store.write(token.dst_endpoint.buffer_kind, token.dst_addr, data)
+    # 2. metadata를 자기 PE의 IPCQ로 forward (PE 내부 wire, 같은 step)
+    yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token))
+```
+
+PE_IPCQ는 `IpcqMetaArrival`을 받아 sender_seq를 보고 peer_head_cache를 갱신한다.
+
+#### `IpcqCreditMetadata` (PE_IPCQ → peer PE_IPCQ, fast path 채널)
+
+Credit return은 D9의 fast path 모델에 따라 vc_comm fabric을 거치지 않고
+**별도의 SimPy Store 채널**로 전달된다. backend init 시 양방향 channel이
+미리 wire되며, latency는 bottleneck-BW 기반으로 계산된다.
+
+```python
+@dataclass(frozen=True)
+class IpcqCreditMetadata:
+    """Credit return — recv 측 → send 측 fast path."""
+    consumer_seq: int            # my_tail (recv 측의 새 tail)
+    src_sip: int                 # 누가 보냈는지 (수신 측이 어느 peer credit인지 식별)
+    src_cube: int
+    src_pe: int
+    src_direction: str           # 송신 측 기준 방향 (수신 측은 reverse 매핑)
+```
+
+**전송 흐름**:
+
+```python
+class PeIpcqComponent:
+    def _delayed_credit_send(self, env, peer_credit_store, my_tail, latency_ns):
+        yield env.timeout(latency_ns)
+        yield peer_credit_store.put(IpcqCreditMetadata(
+            consumer_seq=my_tail, src_sip=..., src_cube=..., src_pe=...,
+            src_direction=...,
+        ))
+```
+
+`latency_ns`는 D9에 정의된 대로:
+
+```python
+path = self.ctx.router.find_path(self_pe_prefix, peer_pe_prefix)
+latency_ns = self.ctx.compute_drain_ns(path, credit_size_bytes)
+```
+
+**별도의 IpcqPtrUpdate 이벤트는 없다** — head 갱신은 D9 piggyback 모델로,
+tail 갱신은 D9 fast path SimPy Store 채널로 처리된다.
+
+### D13. 테스트 전략
+
+단위/통합/regression 테스트를 명시한다.
+
+#### T1. 단위 테스트 (component-level)
+
+- **PE_IPCQ 단위** (`tests/test_pe_ipcq.py`):
+  - send: backpressure 미발생 시 즉시 PE_DMA로 token forward
+  - send: peer slot full → backpressure (poll/sleep 모드별)
+  - send: peer credit return (IpcqCreditMetadata) 도착 후 backpressure 해제
+  - recv: 데이터 도착 시 즉시 반환
+  - recv: 데이터 미도착 → wait → IpcqMetaArrival (D9 piggyback) 수신 시 wake
+  - recv (round-robin): 4-방향 중 도착한 첫 데이터 반환 (weak fairness)
+  - 잘못된 방향 → IpcqInvalidDirection 예외
+
+- **PE_DMA virtual channel** (`tests/test_pe_dma_vc.py`):
+  - vc_compute / vc_comm 독립 진행 (한 채널 stall 시 다른 채널 진행)
+  - chunk-level 인터리브 verification
+  - link BW 분할 (50/50 또는 weighted)
+
+- **builtin topology** (`tests/test_ccl_topologies.py`):
+  - ring_1d/mesh_2d/tree_binary 각각 (rank, world_size) → neighbor dict 정합성
+  - mesh_2d non-square → ValueError
+  - resolve_topology(custom, module) → module.neighbors 반환
+
+#### T2. 통합 테스트 (E2E send/recv)
+
+- **`tests/test_ipcq_e2e.py`**:
+  - 2-rank ring: rank 0 send(E) → rank 1 recv(W) → 데이터 정합성
+  - 4-rank ring: 양방향 send/recv 동시 진행, deadlock 없음
+  - mesh_2d 4×4: N/S/E/W 4방향 동시 send/recv
+
+- **CCL kernel + 2-pass** (`tests/test_ipcq_2pass.py`):
+  - greenlet 모드 + IPCQ → op_log에 ipcq 레코드 생성 검증
+  - DataExecutor가 ipcq op 처리 후 결과 정합성 (`out.data` 확인)
+
+#### T3. Backend init 테스트 (`tests/test_ccl_backend_ipcq.py`)
+
+- ccl.yaml 로드 → `defaults.algorithm` 추출
+- builtin topology → IpcqInitMsg fan-out
+- IpcqEndpoint의 PA가 모든 PE에서 일관 (rank A의 peer E의 rx_base_pa = rank A+1의 자기 rx_base_pa)
+- buffer_kind 별 메모리 할당 (tcm/hbm/sram)
+
+#### T4. Regression
+
+- 기존 401 tests 전부 PASS
+- ADR-0020 통합으로 인한 op_log/DataExecutor 영향 없음 (CCL 미사용 벤치)
+
+#### T5. 성능 / overhead
+
+- 단일 send/recv pair latency = (DMA latency) + (IPCQ overhead)
+- 비교: 같은 nbytes의 일반 PE_DMA write와 거의 동일해야 함 (IPCQ overhead < 100 ns)
+
+### D14. Invariants & Failure Modes
+
+CCL 인프라에서 흔히 발생하는 hang/오류 상황을 명시하고, 대응 방식을 정의한다.
+
+#### Invariants (시뮬레이션이 보장해야 하는 것)
+
+I1. **Slot lifecycle exactly-once**: 한 send → 정확히 한 recv. 중복 send나
+   중복 recv는 sequence 오류로 간주.
+
+I2. **Pointer monotonicity**: my_head, my_tail은 단조 증가 (감소 없음).
+   sender_seq는 송신 측에서 단조 증가, 수신 측 cache 갱신도 단조 증가.
+
+I3. **Endpoint consistency**: rank A의 IpcqEndpoint(direction=E)의 peer가
+   rank B라면, rank B의 IpcqEndpoint(reverse(E))의 peer는 rank A여야 함.
+   backend init 시 검증.
+
+I4. **buffer_kind consistency**: 한 ProcessGroup 내 모든 PE의 buffer_kind는
+   동일 (mixed kind는 supported 안 함, 첫 구현). 검증 실패 시 init 에러.
+
+I5. **op_log ordering**: send → DMA 완료 → recv 가능. op_log의 t_start
+   순서가 이 인과관계를 위배하지 않음.
+
+I6. **Atomic data + metadata visibility (MUST)**: 본 ADR의 correctness 핵심
+   조건이다. 수신 측에서 data write (MemoryStore.write)와 metadata forward
+   (peer_head_cache 갱신)는 동일한 SimPy step에 일어나야 한다. control이
+   data를 앞지를 수 없다.
+
+   **구현 규칙 (MUST)**:
+   - PE_DMA의 vc_comm token 도착 처리(`_vc_comm_arrival`)는 다음 두 동작
+     사이에 **어떤 SimPy yield도 두어서는 안 된다**:
+     1. `MemoryStore.write(token.dst_endpoint.buffer_kind, token.dst_addr, data)`
+     2. PE_IPCQ에 `IpcqMetaArrival` forward
+   - 두 동작은 동일 SimPy event callback 내에서 연속 실행되어야 한다.
+   - 코드 리뷰에서 이 사이에 `yield` (또는 `yield from`)을 추가하는 것은
+     correctness 위반으로 reject한다.
+
+   이 규칙을 위반하면 다른 SimPy process가 끼어들어 head_cache가 data
+   visibility보다 먼저 또는 늦게 보이는 race condition이 발생한다.
+
+I7. **MemoryStore slot existence ↔ pointer**: I6의 결과로,
+   `peer_head_cache > my_tail`이 truthy가 되는 step과 `MemoryStore.has(slot_addr)`
+   이 truthy가 되는 step이 동일하다. recv는 두 조건을 모두 체크하지만 (defensive),
+   단일 조건만 체크해도 정확하다.
+
+#### Failure Modes (런타임 에러)
+
+F1. **잘못된 direction**:
+   - PE 커널이 `tl.send(dir="X")` 호출 → install 안 된 direction
+   - PE_IPCQ가 즉시 `IpcqInvalidDirection` 예외 raise
+   - SimPy 시뮬레이션 즉시 abort, 사용자에게 명확한 에러
+
+F2. **타입 mismatch**:
+   - send와 recv의 dtype/shape/nbytes가 일치하지 않음
+   - 첫 구현은 검증 안 함 (dtype/shape는 hint), 향후 strict mode로 추가
+
+F3. **Deadlock detection (timeout 기반)**:
+   - send: peer_tail_cache가 갱신 안 되고 영원히 wait
+   - recv: peer_head_cache 갱신 안 되고 영원히 wait
+   - 시뮬레이션 timeout (default 10ms simulated time) 초과 시 abort
+   - 디버그를 위해 각 PE의 last send/recv 위치, blocking 상태 dump
+
+F4. **Backend init 실패**:
+   - ccl.yaml에 `defaults.algorithm` 누락
+   - `algorithms[name]` 정의 누락
+   - 알고리즘 모듈 import 실패
+   - topology 검증 실패 (I3, I4)
+   → 모두 `init_process_group` 시점에 즉시 에러
+
+F5. **Slot full + 무한 backpressure**:
+   - peer가 영원히 안 받음
+   - F3과 같이 timeout으로 처리
+   - 디버그: 막힌 PE의 my_head, peer_tail_cache 출력
+
+#### 진단 도구 (구현 단계에서 추가)
+
+- **CCL trace**: 각 send/recv를 (rank, t, dir, nbytes) 형태로 로깅
+- **Pointer dump**: 시뮬레이션 종료 시 또는 hang 시 모든 PE의 IPCQ pointer 상태 출력
+- **Deadlock graph**: hang 발생 시 wait-for 그래프 출력 (어느 PE가 어떤 PE를 기다리는지)
+
+### D15. 알고리즘 작성자 가이드 (요약)
+
+본 섹션은 알고리즘 작성자가 한 화면으로 시작점을 잡을 수 있도록 한다.
+자세한 step-by-step 가이드는 [docs/onboarding/ccl-author-guide.md](../onboarding/ccl-author-guide.md) 참조.
+
+#### 만지는 것 / 만지지 않는 것
+
+| 만지는 것 | 만지지 않는 것 |
+|----------|---------------|
+| `src/kernbench/ccl/algorithms/<your_algo>.py` (kernel + 선택적 neighbors) | `benches/ccl_allreduce.py` 호스트 코드 |
+| `ccl.yaml` 의 한 entry 추가 + `defaults.algorithm` | `src/kernbench/ccl/` 프레임워크 |
+| (선택) `tests/test_<your_algo>.py` 단위 테스트 | `src/kernbench/components/builtin/pe_ipcq.py` 컴포넌트 |
+| | `src/kernbench/runtime_api/distributed.py` backend |
+
+#### 알고리즘 모듈 인터페이스 contract
+
+```python
+# src/kernbench/ccl/algorithms/<your_algo>.py
+
+def kernel(*args, tl) -> None:
+    """필수. PE 커널.
+
+    Args (positional): tensor pointers, rank, world_size, 알고리즘 파라미터
+    Args (keyword): tl — TLContext (자동 주입)
+
+    사용 가능한 IPCQ API:
+        tl.send(dir, src_addr, nbytes)         # blocking, backpressure 시 wait
+        tl.recv(dir)                            # 특정 방향에서 blocking recv
+        tl.recv()                               # 4방향 round-robin
+        tl.recv_async(dir) → handle             # non-blocking
+        tl.wait(handle)                         # non-blocking 완료 대기
+
+    기존 API도 그대로 사용:
+        tl.load / tl.store / tl.composite / tl.program_id 등
+    """
+    ...
+
+def neighbors(rank, world_size, neighbor_map) -> dict | None:
+    """선택. ccl.yaml의 builtin topology가 만든 neighbor_map을 override.
+
+    None 반환 → builtin 그대로 사용
+    dict 반환 → 그 dict로 override (builtin을 base로 수정 가능)
+    """
+    return None
+```
+
+#### 5-step 흐름
+
+1. **kernel 함수 작성** — `src/kernbench/ccl/algorithms/<algo>.py` 신규 파일
+2. **ccl.yaml 등록** — `algorithms.<name>` entry + `defaults.algorithm` 변경
+3. **(선택) neighbors override** — builtin topology를 base로 수정이 필요할 때
+4. **단위 테스트** — `kernbench.ccl.testing.run_kernel_in_mock` (SimPy 없이 빠름)
+5. **시뮬 검증** — `kernbench run --bench ccl_allreduce --verify-data`
+
+호스트 코드 (`benches/ccl_allreduce.py`)는 손대지 않는다.
+
+#### 사용 가능한 헬퍼 (`kernbench.ccl.helpers`)
+
+| Helper | 설명 |
+|--------|------|
+| `chunked(addr, n_chunks, ...)` | 텐서를 n개 chunk view로 슬라이싱 |
+| `ring_step(rank, step, ws)` | ring algorithm의 step별 (send_idx, recv_idx) |
+| `tree_step(rank, level)` | binary tree의 level별 parent/child 인덱스 |
+
+#### 디버깅 도구
+
+- `KERNBENCH_CCL_TRACE=1` — send/recv trace 출력
+- 시뮬 종료 시 자동 IPCQ pointer dump
+- Deadlock 시 (10ms 시뮬 시간 초과) wait-for graph dump
+
+#### 흔한 실수
+
+1. **install 안 된 direction 사용** — ccl.yaml의 topology가 ring_1d면 N/S 사용 불가
+2. **send/recv 짝 맞지 않음** — peer 측 recv 없으면 hang (slot full backpressure)
+3. **dtype/shape 불일치** — 첫 구현은 검증 안 함, 작성자 책임
+
+자세한 step-by-step과 hello-world 예제는 `docs/onboarding/ccl-author-guide.md` 참조.
+
+---
+
+## HW Realization Notes (Informative)
+
+**Status of this section**: Forward-looking. Describes how the simulator
+contract (D1–D15) would map to silicon. Not currently implemented;
+subject to revision before tapeout. The simulator implements the
+contract via Python/SimPy equivalents in
+[pe_ipcq.py](../../src/kernbench/components/builtin/pe_ipcq.py) and
+[pe_dma.py](../../src/kernbench/components/builtin/pe_dma.py).
+
+### D16. Proposed HW Block Diagram and End-to-End Dataflow
+
+![PE Baseline Architecture](../diagrams/pe_baseline.png)
+
+> Source: [`../diagrams/pe_baseline.d2`](../diagrams/pe_baseline.d2) — `d2 --layout=elk --scale 1.5`.
+
+![PE Proposed Architecture](../diagrams/pe_proposed.png)
+
+> Source: [`../diagrams/pe_proposed.d2`](../diagrams/pe_proposed.d2) — `d2 --layout=elk`.
+
+**Baseline → Proposed 핵심 변경**:
+
+- 단일 FIFO inbox → **compute port / IPCQ port 분리 + WRR Arbiter** (NEW)
+- PE_IPCQ (SimPy component) → **IPCQ Controller** (HW register + combinational logic)
+- TCM 내 **IPCQ Slot Region 예약 영역** 명시
+- Credit Injector / Receiver가 Fabric Port를 통해 NoC에 직접 연결
+
+#### End-to-End Sequence (HW view)
+
+```mermaid
+sequenceDiagram
+    participant CPU_A as PE_A: PE_CPU
+    participant IPCQ_A as PE_A: IPCQ Ctrl
+    participant DMA_A as PE_A: DMA
+    participant NOC as NoC Fabric
+    participant DMA_B as PE_B: DMA
+    participant IPCQ_B as PE_B: IPCQ Ctrl
+    participant TCM_B as PE_B: TCM
+    participant CPU_B as PE_B: PE_CPU
+
+    Note over CPU_A: tl.send(dir="E", src=0x1000)
+
+    CPU_A->>IPCQ_A: MMIO: send request
+    Note over IPCQ_A: Backpressure check:<br/>(head - peer_tail_cache) < n_slots → PASS<br/>Slot addr gen:<br/>dst = peer_rx_base + (head%n) × slot_size
+    IPCQ_A->>DMA_A: IpcqDmaToken {src, dst, sender_seq=head}
+    Note over IPCQ_A: my_head++
+    IPCQ_A-->>CPU_A: send returns (fire-and-forget)
+
+    Note over DMA_A: TCM read → snapshot in read buffer<br/>Flit pack: data + {sender_seq, dst_addr}
+    DMA_A->>NOC: IPCQ data flit(s)
+
+    Note over NOC: hop latency + BW drain
+
+    NOC->>DMA_B: IPCQ data flit(s)
+    Note over DMA_B: Terminal BW drain<br/>Slot write latency
+
+    rect rgb(255, 240, 220)
+        Note over DMA_B,IPCQ_B: ATOMIC (I6): same cycle, no stall
+        DMA_B->>TCM_B: write data → slot address
+        DMA_B->>IPCQ_B: Meta Extractor: {sender_seq, dst_addr}
+    end
+
+    Note over IPCQ_B: Range match dst_addr → direction "W"<br/>peer_head_cache["W"] = sender_seq + 1
+    IPCQ_B-->>CPU_B: recv_wake signal
+
+    Note over CPU_B: tl.recv(dir="W") wakes up
+    CPU_B->>IPCQ_B: recv request
+    Note over IPCQ_B: peer_head_cache > my_tail → YES<br/>slot_addr = rx_base + (tail%n) × slot_size
+    IPCQ_B-->>CPU_B: return slot_addr
+    CPU_B->>TCM_B: read data from slot
+    Note over IPCQ_B: my_tail++
+
+    IPCQ_B->>NOC: Credit (16B): {consumer_seq, dst_rx_base_pa}
+    Note over NOC: credit traversal (NoC latency)
+    NOC->>IPCQ_A: Credit arrival
+
+    Note over IPCQ_A: Match dst_rx_base_pa → direction "E"<br/>peer_tail_cache["E"] = consumer_seq<br/>Backpressure deassert (if stalled)
+```
+
+### D17. IPCQ Controller HW Module (신규)
+
+PE_CPU와 DMA Engine 사이에 위치하는 하드웨어 제어 블록. 시뮬레이터의
+`PeIpcqComponent`에 대응한다.
+
+#### QPair Register File
+
+방향별 queue pair 상태를 flip-flop으로 유지. PE_CPU가 MMIO(CSR)로 읽기/쓰기
+가능하며, init 시점에 소프트웨어가 채워넣는다.
+
+```
+Per-direction registers (each 64-bit):
+  my_head          — sender write position (monotonic)
+  my_tail          — receiver read position (monotonic)
+  peer_head_cache  — last known peer head (updated by Meta Extractor)
+  peer_tail_cache  — last known peer tail (updated by Credit Receiver)
+  rx_base_pa       — this PE's rx buffer base physical address
+  peer_rx_base_pa  — peer's rx buffer base physical address
+  n_slots          — ring depth (power-of-2 제약, D21 참조)
+  slot_size        — bytes per slot
+  peer_credit_tgt  — peer PE의 credit receive 주소
+
+Directions: 최대 8 (N/S/E/W/parent/child_left/child_right + spare)
+Total: 8 dirs × 9 regs × 8B = 576B flip-flops
+```
+
+#### Slot Address Generator (combinational)
+
+```
+Input:  pointer (my_head or my_tail), n_slots, slot_size, base_pa
+Output: slot_addr = base_pa + (pointer % n_slots) * slot_size
+
+Implementation:
+  n_slots power-of-2 → pointer & (n_slots - 1)   (AND mask, 1 gate)
+  slot_size power-of-2 → barrel shift             (1 cycle)
+  64-bit add → ripple/kogge-stone adder           (1 cycle)
+
+Latency: 1-2 cycles combinational
+```
+
+#### Backpressure Comparator (combinational)
+
+```
+full = (my_head - peer_tail_cache) >= n_slots
+
+Implementation: 64-bit subtract + unsigned compare
+Output: stall signal → PE_CPU (IPCQ send blocked) or DMA issue hold
+Latency: 1 cycle
+```
+
+#### Meta Extractor (inbound datapath sideband)
+
+DMA Engine의 inbound vc_comm path에 wired. 도착하는 IPCQ flit의 header에서
+metadata를 추출하여 queue pair 상태를 갱신한다.
+
+```
+Trigger: DMA inbound write completion (same cycle)
+Extract: {sender_seq, dst_addr} from flit header
+
+Direction matching (ADR-0025 D2):
+  for each dir:
+    match = (base_pa[dir] <= dst_addr) && (dst_addr < base_pa[dir] + n_slots[dir] * slot_size[dir])
+  8× parallel range comparators + priority encoder
+
+Update: peer_head_cache[matched_dir] = max(peer_head_cache, sender_seq + 1)
+Output: recv_wake signal → PE_CPU interrupt/flag
+Latency: 1 cycle (pipelined with DMA write — I6 atomicity 자연 보장)
+```
+
+#### Credit Injector (outbound)
+
+```
+Trigger: recv completion (my_tail 증가 후)
+Action:  pack 16B credit packet → DMA vc_comm (또는 dedicated credit VC)
+
+Packet: {consumer_seq = my_tail, dst_rx_base_pa = my_rx_base_pa}
+Latency: 1 cycle to generate, then NoC traversal
+```
+
+#### Credit Receiver (inbound sideband)
+
+```
+Trigger: 16B credit packet arrival (from NoC)
+Extract: {consumer_seq, dst_rx_base_pa}
+
+Direction matching (ADR-0025 D3):
+  for each dir:
+    match = (peer_rx_base_pa[dir] == credit.dst_rx_base_pa)
+
+Update: peer_tail_cache[matched_dir] = max(peer_tail_cache, consumer_seq)
+Output: send_wake signal → deassert backpressure stall
+Latency: 1 cycle
+```
+
+### D18. DMA Engine vc_comm IPCQ-aware Mode
+
+기존 vc_comm 채널(D8)에 IPCQ flit 처리 모드를 추가한다.
+
+**Outbound**:
+
+1. IPCQ Controller로부터 command 수신: `{src_addr, dst_addr, nbytes, sender_seq}`
+2. TCM에서 src_addr read → DMA read buffer에 snapshot (standard DMA behavior)
+3. Flit pack: data + piggyback metadata (sender_seq, dst_addr)
+4. NoC fabric port에 inject
+5. Fire-and-forget (completion 미대기)
+
+**Inbound**:
+
+1. NoC로부터 IPCQ flit 수신
+2. Terminal BW drain charge (`drain_ns = nbytes / bottleneck_bw`)
+3. Slot write latency charge (backing memory tier)
+4. **ATOMIC** (same pipeline stage, no stall insertion):
+   - TCM write: data → slot address
+   - Meta Extractor trigger: sender_seq + dst_addr → IPCQ Controller
+5. Done
+
+**I6 atomicity 하드웨어 보장**: TCM write completion과 Meta Extractor trigger가
+동일 pipeline stage에서 발생하므로 별도 synchronization이 불필요. 시뮬레이터의
+"no SimPy yield between MemoryStore.write and IpcqMetaArrival put" (D9, I6)이
+자연스럽게 보장된다.
+
+#### Data Snapshot Semantics
+
+DMA read buffer에 latch된 데이터는 src memory의 이후 수정에 영향받지 않는다.
+이는 DMA standard read-then-write behavior이므로 추가 HW 불필요.
+
+#### Credit Virtual Channel (선택적)
+
+- **옵션 A**: vc_comm에 credit을 multiplexing (16B header-only flit으로 구분).
+- **옵션 B**: 3rd dedicated credit VC 추가 (strict priority > data).
+
+옵션 B가 deadlock prevention에 유리하나, 16B credit의 BW 영향이 무시 가능하므로
+옵션 A로도 충분.
+
+### D19. Fabric Flit Format Extension
+
+```
+일반 data flit (예: 512-bit):
+┌──────────────────────────────────────────┐
+│ [511:480] routing header (32b)           │
+│ [479:0]   payload (480b = 60B)           │
+└──────────────────────────────────────────┘
+
+IPCQ data flit (첫 flit에만 metadata 포함):
+┌──────────────────────────────────────────┐
+│ [511:480] routing header (32b)           │
+│   [511]    ipcq_flag (1b)                │  ← IPCQ vs normal DMA 식별
+│   [510:509] vc_id (2b)                   │
+│   [508:480] route + hop count            │
+│ [479:416] ipcq_metadata (64b)            │  ← piggyback
+│   [479:448] sender_seq (32b)             │
+│   [447:416] dst_addr[31:0] (32b)         │  ← direction matching용
+│ [415:0]   payload (416b = 52B)           │
+└──────────────────────────────────────────┘
+후속 flits: full 60B payload (metadata 없음)
+
+Credit-only flit (128-bit, header-only):
+┌──────────────────────────────────────────┐
+│ [127:96]  routing header (32b)           │
+│   [127]   credit_flag (1b)               │
+│ [95:64]   consumer_seq (32b)             │
+│ [63:0]    dst_rx_base_pa (64b)           │
+└──────────────────────────────────────────┘
+```
+
+첫 flit의 payload가 60B → 52B로 감소 (13% overhead). Multi-flit transfer에서는
+후속 flit이 full payload이므로 대형 전송에서 overhead < 1%.
+
+### D20. TCM IPCQ Slot Region Layout
+
+```
+TCM Memory Map (16MB):
+┌─────────────────────────────┐ 0x000000
+│  Kernel Working Memory      │
+│  (compute tensors)          │
+│  ~14MB                      │
+├─────────────────────────────┤ 0xE00000
+│  IPCQ RX Buffers            │
+│  Dir N: slots × slot_size   │
+│  Dir S: slots × slot_size   │
+│  Dir E: slots × slot_size   │
+│  Dir W: slots × slot_size   │
+│  ~1MB                       │
+├─────────────────────────────┤ 0xF00000
+│  IPCQ Metadata / Scratch    │
+│  ~1MB                       │
+└─────────────────────────────┘ 0xFFFFFF
+```
+
+IPCQ region을 TCM의 상위 bank에 배치하여 compute access와의 bank conflict를
+최소화한다 (Risk D22 참조).
+
+### D21. 2nm Implementation Analysis
+
+#### Area Estimate
+
+| Module | Gate Count | Area (2nm est.) | Notes |
+|---|---|---|---|
+| QPair Register File | ~4.6K FF | 0.002 mm² | 576B flip-flops |
+| Slot Addr Gen + Backpressure | ~5K gates | 0.001 mm² | Combinational |
+| Meta Extractor + Credit Logic | ~3K gates | 0.001 mm² | 8× parallel comparators |
+| **IPCQ Controller subtotal** | **~12.6K** | **~0.004 mm²** | **PE 전체 대비 < 0.1%** |
+| DMA vc_comm 확장 | ~2K gates | 0.002 mm² | Flit pack/unpack |
+| **Total 변경분** | **~14.6K** | **~0.006 mm²** | |
+
+#### Timing
+
+| Path | Delay (2nm est.) | Target Clock | Margin |
+|---|---|---|---|
+| Backpressure (sub + cmp) | ~0.3 ns | 1 GHz (1 ns) | 3× |
+| Slot Addr Gen (mask + shift + add) | ~0.5 ns | 1 GHz | 2× |
+| Meta Extractor (8× range match) | ~0.4 ns | 1 GHz | 2.5× |
+| Credit Receiver (8× equality) | ~0.3 ns | 1 GHz | 3× |
+
+모든 critical path가 1 cycle 이내. Timing closure 문제 없음.
+
+#### Power
+
+- Active: ~1 mW (register R/W + comparators, send/recv 동작 시)
+- Idle: leakage only
+- PE 전체 전력 대비 무시 가능
+
+#### Constraints
+
+| 항목 | 제약 | 근거 |
+|---|---|---|
+| `n_slots` | **반드시 power-of-2** | mod → AND mask (1 gate). 임의 값은 divider 필요 (~10 cycles) |
+| `slot_size` | **power-of-2 권장** | mul → barrel shift. 임의 값은 multiplier 필요 |
+| TCM IPCQ region | **전용 bank 배치** | Compute access와 bank conflict 방지 |
+
+### D22. Risk Assessment
+
+#### TCM Bank Conflict
+
+- **Risk**: IPCQ slot write와 compute read가 동일 bank 접근 시 stall
+- **Mitigation**: IPCQ region을 TCM 상위 address의 전용 bank에 배치 (D20)
+- **Cost**: TCM banking flexibility 소폭 감소
+- **Severity**: Medium (성능 영향), Low (correctness 문제 아님)
+
+#### Credit Return Latency under Congestion
+
+- **Risk**: NoC 혼잡 시 credit return 지연 → sender backpressure stall
+- **Mitigation**:
+  - Credit을 별도 VC로 분리 + strict priority (16B로 BW impact 미미)
+  - 또는 n_slots를 넉넉히(8+) 설정하여 credit 지연을 buffer로 흡수
+- **Severity**: Low (credit 16B는 congestion에 거의 기여하지 않음)
+
+#### Inter-Direction Ordering
+
+- **Risk**: 같은 PE에서 여러 방향으로 동시 send 시 순서
+- **Mitigation**: Per-direction monotonic seq으로 충분. Inter-direction ordering은
+  kernel(소프트웨어) 책임 — 현재 시뮬레이터 모델과 동일 (D2 + D4)
+- **Severity**: Low (아키텍처 설계에 의해 해소)
+
+### D23. HW Alternatives Considered
+
+#### Doorbell + Polling (전통적 방식)
+
+```
+Send: DMA write data → DMA write doorbell register at peer → peer polls doorbell
+Recv: Polling loop on doorbell, or interrupt-driven
+```
+
+| 장점 | 단점 |
+|---|---|
+| 단순한 HW (IPCQ controller 불필요) | 2번의 DMA transaction (data + doorbell) |
+| 기존 DMA 재사용 | Data/doorbell 사이 ordering 보장 필요 (fence) |
+| | Polling은 전력 낭비, interrupt는 latency overhead |
+
+**평가**: Piggyback 대비 latency 2-3× 증가. **불채택.**
+
+#### Hardware Message Queue (NVIDIA NVLink 스타일)
+
+```
+Send: CPU → HMQ에 descriptor push → HW가 peer HMQ로 자동 전달
+Recv: HMQ에서 descriptor pop → data pointer 확인
+```
+
+| 장점 | 단점 |
+|---|---|
+| CPU는 descriptor만 작성 | 별도 HMQ engine 필요 (~0.05 mm²) |
+| Descriptor/data 분리 → 유연 | DMA와 별개 datapath → area/power 중복 |
+| | Large tensor에는 결국 DMA 필요 |
+
+**평가**: CCL의 large tensor 패턴에서 DMA 필수이므로 HMQ + DMA 이중 구조는
+면적 낭비. **불채택.**
+
+#### RDMA-style Completion Queue (CQ)
+
+```
+Send: DMA write → peer에 CQE 자동 생성
+Recv: CQ poll/interrupt → data 위치 확인
+```
+
+| 장점 | 단점 |
+|---|---|
+| InfiniBand/RoCE 성숙 모델 | CQ 관리 logic + CQE memory overhead |
+| Multi-tenant/isolation 용이 | CQE/data ordering 보장 추가 필요 |
+| | PE-to-PE CCL에는 over-engineered |
+
+**평가**: RDMA CQ는 host-facing NIC의 multi-tenant 격리에 적합.
+PE 간 단일 owner 환경에서는 불필요한 복잡성. **불채택.**
+
+#### Credit-in-Data Piggyback (v2 최적화 후보)
+
+현재 설계에서 credit return은 별도 16B packet이다. Bidirectional 통신
+패턴에서는 **reverse 방향 data flit에 credit을 합칠 수 있다.**
+
+```
+PE_A →E→ PE_B: data + sender_seq=3
+PE_B →W→ PE_A: data + sender_seq=5 + credit_ack=4  ← credit이 data에 합쳐짐
+```
+
+| 장점 | 단점 |
+|---|---|
+| Credit 전용 packet 제거 → NoC BW 절약 | Unidirectional 패턴에서는 fallback 필요 |
+| Bidirectional allreduce에서 credit latency → 0 | Flit header에 8B 추가 (overhead 미미) |
+| | Logic 복잡도 소폭 증가 |
+
+**평가**: 현재 설계의 우수한 최적화. Bidirectional allreduce에서 credit packet을
+완전 제거 가능. Standalone credit fallback도 유지. **v2로 채택 권고.**
+
+### Open HW Questions
+
+- IPCQ slot region size를 TCM의 몇 %까지 허용할 것인가? (현재 가정: ~1MB / 16MB = 6.25%)
+- Credit VC를 별도로 둘 것인가, vc_comm에 multiplexing할 것인가? (D18 참조)
+- Inter-SIP link에서의 flit format 호환성 검증 필요
+- n_slots 최대값 제한? (8 directions × 8 slots × 64KB = 4MB → TCM의 25%)
+
+---
+
+## Non-goals
+
+- **호스트 collective**: `dist.all_reduce`가 데이터 이동을 직접 수행하는 모델은
+  본 ADR 범위 외. 본 ADR은 PE 커널 안에서 일어나는 통신만 다룬다.
+- **All-reduce 알고리즘**: ring/tree 등 알고리즘 자체는 별도 ADR (또는 커널
+  코드)에서 다룬다. 본 ADR은 인프라(IPCQ + VC)만 정의.
+- **Reliability/error handling**: send/recv 실패, link 장애 등은 다루지 않음.
+- **NoC arbiter 정밀 모델**: VC 간 dynamic contention은 첫 구현 범위 외 (D8).
+
+---
+
+## Open Questions
+
+- **VC arbitration 정확도**: 첫 구현은 deterministic chunk interleave +
+  weighted round-robin. heavy contention 시나리오에서 실제보다 optimistic한
+  결과가 나올 수 있음. 정밀화 필요 시 별도 NoC arbiter component 도입을 검토.
+- **Credit return fast path BW 모델**: 첫 구현은 fast path가 fabric BW
+  contention 모델에서 제외 (별도 lightweight wire 가정). 정밀화 필요 시
+  credit fast path를 별도 link로 모델링하거나, `credit_return_mode: piggyback`
+  옵션 추가.
+- **Ring buffer slot의 메모리 할당**: TCM/HBM/SRAM 어디에 두든 IPCQ가 알아야
+  할 metadata (base addr, slot_size, n_slots). init 시 호스트가 사이드밴드로
+  넣을지, fabric MmuMapMsg와 유사한 메시지로 넣을지 결정 필요.
+- **VC 간 BW 분할 default**: 균등 분할(50/50)인지, weighted(예: 80% compute,
+  20% comm)인지. ccl.yaml에 노출하되 default 값 결정 필요.
+- **Direction 개수**: 4방향(N/S/E/W) 고정인지, 6방향(+ Up/Down for 3D),
+  또는 가변 N개로 확장할지. 첫 구현은 4방향 고정.
+- **다중 channel 데이터 구조 (multi-tile aggregation)**: 한 collective에서
+  여러 tile을 fan-out 받는 경우 기존 round-robin recv로 충분한지, 별도
+  primitive(`tl.recv_all`)가 필요한지.
+- **Round-robin recv fairness**: 첫 구현은 last_polled_dir 인덱스 기반 weak
+  fairness. 한 방향에 데이터가 항상 먼저 도착하면 starvation 가능. strict
+  fairness가 필요하면 별도 fairness counter 추가.
+- **Deadlock detection 정밀화**: 첫 구현은 timeout 기반. 향후 wait-for graph
+  실시간 추적으로 deterministic deadlock detection 가능.
+
+---
+
+## Consequences
+
+### 긍정적
+
+- PE 간 직접 통신 가능 → CCL 커널 작성 가능
+- 호스트는 launch만, 동기화는 PE 안에서 → 단순한 호스트 코드, 강한
+  compute/comm overlap
+- VC를 통해 HoL blocking 제거 → collective latency가 compute traffic에
+  block되지 않음
+- Buffer 위치/backpressure 모드를 init 파라미터로 선택 가능 → 벤치마크 가능
+- 4-방향 logical neighbor → 호스트가 ring/mesh/tree 등 알고리즘 자유롭게
+  매핑
+
+### 부정적
+
+- 컴포넌트 1개 신규 추가 (PE_IPCQ), PE_DMA 재설계 (VC 추가)
+- IPCQ 메모리 (8 ring × slot_size × n_slots) 만큼 PE-local 메모리 사용
+- VC arbitration 모델이 first-order approximation이므로 heavy contention
+  시나리오에서 실제 HW보다 약간 optimistic한 latency 결과 가능 (D8 한계)
+- VC chunk-level 인터리브로 PE_DMA 구현이 더 복잡해짐
diff --git a/docs/adr-ko/ADR-0024-par-sip-tp-launcher.md b/docs/adr-ko/ADR-0024-par-sip-tp-launcher.md
new file mode 100644
index 0000000..b321e84
--- /dev/null
+++ b/docs/adr-ko/ADR-0024-par-sip-tp-launcher.md
@@ -0,0 +1,206 @@
+# ADR-0024: SIP-level Launcher — rank = SIP
+
+## Status
+
+Accepted
+
+## Context
+
+### 목표
+
+`torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device)
+경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이**
+읽히는 bench 코드를 목표로 한다.
+
+real PyTorch와 비교:
+
+| 차원 | real PyTorch | KernBench |
+| --- | --- | --- |
+| 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 각 1 SIP |
+| `get_rank()` | `RANK` env var | greenlet-local 레지스트리 |
+| `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 |
+| `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP |
+| `mp.spawn` | OS 프로세스 fork | greenlet fan-out |
+
+### 풀어야 할 문제
+
+1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록.
+2. **Greenlet-local rank/device tracking** — 1-프로세스 모델 안에서 각
+   worker greenlet이 자기 rank / 자기 SIP를 정확히 식별.
+3. **Tensor placement = structural (sip, cube, pe)** — rank가 SIP이면
+   기본 텐서 배치도 구조적 좌표로 표현되어야 함.
+
+### Non-problem (이 ADR 밖)
+
+- IPCQ direction addressing → ADR-0025
+- `DPPolicy.sip`/`num_sips` 제거 → ADR-0026
+- Megatron-style TP → ADR-0027
+- DTensor → ADR-0028 (future)
+- Worker scheduling / `mp.spawn` / collective drain / exception cleanup
+  → ADR-0027 D0/D1
+- Collective algorithm 구현 (intercube_allreduce, SFR config) → ADR-0032
+
+## Decision
+
+### D1. rank = SIP (world_size 해석)
+
+```python
+def _resolve_world_size(self) -> int:
+    if "world_size" in self._merged:
+        return int(self._merged["world_size"])
+    defaults = self._cfg_all.get("defaults", {})
+    if "world_size" in defaults:
+        return int(defaults["world_size"])
+    spec = self.ctx.spec or {}
+    return int(spec.get("system", {}).get("sips", {}).get("count", 1))
+```
+
+우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml`
+override는 legacy "rank = PE" 테스트 경로로 유지.
+
+### D2. Greenlet-local rank registry (+ debug warning)
+
+```python
+class DistributedContext:
+    def __init__(self):
+        self._backend = None
+        self._rank_by_greenlet: dict = {}
+
+    def _bind_rank(self, g, rank: int) -> None:
+        self._rank_by_greenlet[g] = int(rank)
+
+    def get_rank(self) -> int:
+        self._ensure_initialized()
+        from greenlet import getcurrent
+        g = getcurrent()
+        if g not in self._rank_by_greenlet:
+            if os.environ.get("KERNBENCH_DEBUG"):
+                warnings.warn(
+                    "get_rank() called outside a bound greenlet — returning 0. "
+                    "Likely a bug unless running single-driver."
+                )
+            return 0
+        return int(self._rank_by_greenlet[g])
+```
+
+### D3. `torch.ahbm.set_device(rank)` — SIP 바인딩
+
+KernBench 백엔드 이름은 `ahbm` (ADR-0023). Real PyTorch는
+`torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named
+namespace를 사용한다.
+
+```python
+class _AhbmNamespace:
+    """torch.ahbm — per-greenlet SIP device binding.
+
+    Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. Since
+    KernBench's backend is 'ahbm' (not CUDA), we expose the equivalent
+    API under ``torch.ahbm`` to avoid pretending to be a CUDA runtime.
+    """
+
+    def __init__(self):
+        self._device_by_greenlet: dict = {}
+
+    def set_device(self, device: int) -> None:
+        from greenlet import getcurrent
+        self._device_by_greenlet[getcurrent()] = int(device)
+
+    def current_device(self) -> int | None:
+        from greenlet import getcurrent
+        return self._device_by_greenlet.get(getcurrent())
+
+# Attached to RuntimeContext as `self.ahbm = _AhbmNamespace()`.
+# Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
+```
+
+**PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한
+`torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`,
+`torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는
+코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다.
+
+```python
+class _AcceleratorNamespace:
+    """torch.accelerator — device-agnostic API (PyTorch 2.x style).
+
+    Aliases torch.ahbm for bench code that prefers device-neutral idiom:
+        torch.accelerator.set_device_index(rank)
+        torch.accelerator.current_device_index()
+    """
+
+    def __init__(self, ahbm: _AhbmNamespace):
+        self._ahbm = ahbm
+
+    def set_device_index(self, device: int) -> None:
+        self._ahbm.set_device(device)
+
+    def current_device_index(self) -> int | None:
+        return self._ahbm.current_device()
+
+# RuntimeContext
+self.ahbm = _AhbmNamespace()
+self.accelerator = _AcceleratorNamespace(self.ahbm)   # alias
+```
+
+Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유:
+
+```python
+torch.ahbm.set_device(rank)                   # KernBench-native, explicit backend
+torch.accelerator.set_device_index(rank)      # PyTorch 2.x device-agnostic
+```
+
+### D4. Tensor placement = structural (sip, cube, pe) 좌표
+
+`resolve_dp_policy`가 `target_sip`을 직접 받아 구조적 좌표로 placement 생성.
+세부는 ADR-0026.
+
+```python
+# RuntimeContext._create_tensor
+current_sip = self.ahbm.current_device()          # (D3 naming)
+if current_sip is None:
+    current_sip = 0  # single-driver fallback (D2와 일관)
+placement = resolve_dp_policy(
+    dp, shape=shape_2d, itemsize=itemsize,
+    num_pe=eff_num_pe, num_cubes=eff_num_cubes,
+    target_sip=current_sip,
+)
+```
+
+Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적
+좌표를 직접 보유. ShardSpec 상세는 ADR-0026.
+
+---
+
+## Dependencies
+
+- **ADR-0023** (IPCQ): backend `ahbm` namespace의 기원.
+- **ADR-0026** (DPPolicy intra-device): D4의 `resolve_dp_policy` 시그니처와
+  ShardSpec의 구조적 좌표 표현.
+- **ADR-0027** (Megatron TP + scheduler): worker scheduling, `mp.spawn`,
+  collective drain, exception cleanup의 구현 기준.
+
+---
+
+## Non-goals
+
+- **IPCQ protocol 수정**: ADR-0023 유지.
+- **DPPolicy 필드 정리**: ADR-0026.
+- **Megatron-style TP**: ADR-0027.
+- **Worker scheduling / spawn / drain / exception cleanup**: ADR-0027 D0/D1.
+- **Collective algorithm 구현**: ADR-0032.
+- **Multi-node (프로세스 간)**: 단일 프로세스.
+
+---
+
+## Consequences
+
+### Positive
+
+- **Bench = real PyTorch DDP** (공개 API 관점).
+- **Greenlet-local rank**: 1-프로세스 모델에서 cross-rank correctness 가능.
+- **Structural placement 좌표**: ADR-0026 / ADR-0027 / ADR-0032의 다른 ADR이
+  `(sip, cube, pe)` 3튜플 위에서 일관되게 동작.
+
+### Neutral
+
+- IPCQ PE-level protocol (ADR-0023) 불변.
+- IO_CPU 역할 불변 (기존 transit 그대로).
diff --git a/docs/adr-ko/ADR-0025-algo-ipcq-direction-addressing.md b/docs/adr-ko/ADR-0025-algo-ipcq-direction-addressing.md
new file mode 100644
index 0000000..8a6afa0
--- /dev/null
+++ b/docs/adr-ko/ADR-0025-algo-ipcq-direction-addressing.md
@@ -0,0 +1,283 @@
+# ADR-0025: IPCQ Direction Addressing — address-based matching
+
+## Status
+
+Accepted (Revision 2 — Address-based matching; peer_direction field dropped)
+
+## Context
+
+### 목표
+
+ADR-0023의 IPCQ protocol에서 **"어느 direction pair를 통한 전송인가"의 식별**을
+topology / dict-order에 의존하지 않고 **주소 기반**으로 일관되게 한다.
+2-rank bidirectional ring (또는 여러 direction이 동일 peer를 가리키는
+topology 일반)에서 정확히 동작하도록 한다.
+
+### 드러난 버그 — 2-rank bidirectional ring
+
+`ring_1d(rank, world_size=2)` → `{"E": 1, "W": 1}` (rank 0). 양쪽 방향이 같은 peer.
+
+**버그 1 (install)**:
+- `reverse_direction(0, 1)` → dict order로 "E" 반환 (틀림, "W"가 맞음 — opposite
+  direction convention)
+- rank 0의 E entry가 `peer.rx_base_pa = rx_base(sip1, cube0, pe0, d="E")`로 설정
+- tl.send(E) → data가 sip1의 E-rx buffer로 landing (should be W-rx)
+
+**버그 2 (runtime)**:
+- 설령 install이 올바른 주소로 설정해도, receiver의 `_handle_meta_arrival`이
+  sender 좌표만으로 direction 매칭 → 첫 direction (E) 승
+- peer_head_cache[E] 증가, peer_head_cache[W]는 불변
+- Kernel의 tl.recv(W)는 peer_head_cache[W] 대기 → 영원히 블록 → IpcqDeadlock
+
+### 근본 원인
+
+두 축에서 동일 문제:
+1. **Install-time pairing**: "내 direction과 peer의 어느 direction이 짝인가"
+   결정이 dict-iteration-order에 의존 → 여러 direction이 같은 peer를 가리킬 때
+   fragile
+2. **Runtime identification**: "어느 qp를 업데이트해야 하는가" 결정이 sender
+   좌표만으로 이루어짐 → direction 중복 시 ambiguous
+
+### 해결 방향 — address-based matching
+
+각 PE의 rx buffer는 **direction별로 고유한 주소 range**에 위치 (rx_base_pa +
+direction_idx × bytes_per_direction). 따라서:
+
+- **Runtime**: sender coord 대신 **dst_addr 범위**로 매칭 → unambiguous
+- **Install**: opposite-direction 우선 선택 heuristic (ring / mesh의 자연스러운
+  대칭성)
+- `peer_direction` 같은 이중 메타데이터 불필요 — **주소가 single source of
+  truth**
+
+이 설계는 **PhysAddr 전환 (ADR-0030)과 독립적**으로 작동. 현재 synthetic
+주소든 PhysAddr든 direction별 range 유일성만 지켜지면 동일하게 적용 가능.
+
+---
+
+## Decision
+
+### D1. Install — `reverse_direction` opposite-preference
+
+`src/kernbench/ccl/install.py`:
+
+```python
+# Extended in ADR-0032 with global_* pairs for inter-SIP directions,
+# which were introduced by configure_sfr_intercube_multisip to keep
+# intercube (N/S/E/W) and inter-SIP (global_N/S/E/W) namespaces disjoint.
+_OPPOSITE_DIR = {
+    "E": "W", "W": "E", "N": "S", "S": "N",
+    "global_E": "global_W", "global_W": "global_E",
+    "global_N": "global_S", "global_S": "global_N",
+}
+
+def reverse_direction(my_rank: int, peer_rank: int, my_dir: str) -> str | None:
+    """Find peer's direction that reciprocates my_dir→peer_rank.
+
+    Prefer the OPPOSITE direction (E↔W, N↔S) when the peer has it
+    pointing back to us. This matters in 2-rank bidirectional rings
+    where both E and W on one side point to the same peer — without
+    the preference, the first-match-wins iteration would route data
+    into the wrong rx slot. Falls back to any direction pointing back
+    for topologies without an opposite convention (tree_binary's
+    parent/child).
+    """
+    nt = neighbor_table[peer_rank]
+    opp = _OPPOSITE_DIR.get(my_dir)
+    if opp is not None and nt.get(opp) == my_rank:
+        return opp
+    for d, target in nt.items():
+        if target == my_rank:
+            return d
+    return None
+```
+
+호출부:
+
+```python
+for d, peer_rank in nbrs.items():
+    peer_dir = reverse_direction(r, peer_rank, d)  # my_dir 전달
+    if peer_dir is None:
+        continue
+    ...
+```
+
+### D2. Runtime — `_handle_meta_arrival` dst_addr 매칭
+
+`src/kernbench/components/builtin/pe_ipcq.py`:
+
+```python
+def _handle_meta_arrival(self, msg: IpcqMetaArrival) -> None:
+    """Match incoming token to the receiver-side direction by dst_addr range.
+
+    Each direction has a unique rx buffer address range
+    (my_rx_base_pa + n_slots * slot_size). The token's dst_addr (set by
+    the sender's IPCQ when computing peer's slot address) falls within
+    exactly one such range. This address-based matching is unambiguous
+    even when multiple directions have the same peer (2-rank ring).
+    """
+    token = msg.token
+    dst_addr = token.dst_addr
+    for d, qp in self._queue_pairs.items():
+        base = qp["my_rx_base_pa"]
+        size = qp["n_slots"] * qp["slot_size"]
+        if base <= dst_addr < base + size:
+            qp["peer_head_cache"] = max(qp["peer_head_cache"],
+                                         token.sender_seq + 1)
+            self._arrived_tokens.setdefault(d, []).append(token)
+            waiters = self._recv_waiters.get(d, [])
+            self._recv_waiters[d] = []
+            for ev in waiters:
+                if not ev.triggered:
+                    ev.succeed()
+            any_waiters = self._any_recv_waiters
+            self._any_recv_waiters = []
+            for ev in any_waiters:
+                if not ev.triggered:
+                    ev.succeed()
+            return
+    # Unknown dst_addr — diagnostic log (should not happen under correct install)
+```
+
+Sender 좌표 검사는 **제거**. `dst_addr`가 이미 direction을 결정.
+
+### D3. Credit — `dst_rx_base_pa` 필드 추가
+
+`src/kernbench/common/ipcq_types.py`:
+
+```python
+@dataclass(frozen=True)
+class IpcqCreditMetadata:
+    consumer_seq: int
+    dst_rx_base_pa: int       # NEW: 원 sender의 peer.rx_base_pa와 매칭용
+    # 기존 필드 (diagnostic / log 용도로 유지)
+    src_sip: int
+    src_cube: int
+    src_pe: int
+    src_direction: str
+```
+
+Credit 생성 시 (`_delayed_credit_send`): 자기 direction의 `my_rx_base_pa`를
+`dst_rx_base_pa`로 실어 보냄 (이게 상대방이 sender 당시 썼던 `peer.rx_base_pa`).
+
+수신 측 (`_credit_worker`):
+
+```python
+def _credit_worker(self, env):
+    while True:
+        credit = yield self._credit_inbox.get()
+        for d, qp in self._queue_pairs.items():
+            # peer의 rx_base_pa와 credit의 dst_rx_base_pa가 일치하는 qp 찾기
+            if qp["peer"].rx_base_pa == credit.dst_rx_base_pa:
+                qp["peer_tail_cache"] = max(qp["peer_tail_cache"],
+                                              credit.consumer_seq)
+                waiters = self._send_waiters.get(d, [])
+                self._send_waiters[d] = []
+                for ev in waiters:
+                    if not ev.triggered:
+                        ev.succeed()
+                break
+```
+
+Sender 좌표 검사 제거. `dst_rx_base_pa` 매칭으로 unambiguous.
+
+### D4. `IpcqInitEntry`에 `peer_direction` 필드를 **추가하지 않음**
+
+ADR-0025 rev 1에서 제안했던 `IpcqInitEntry.peer_direction`은 **불필요**.
+이유:
+- Meta arrival은 dst_addr로 매칭 (D2)
+- Credit은 dst_rx_base_pa로 매칭 (D3)
+- qp에 peer_direction 저장 필요 없음
+- Install은 rx_base_pa 계산 시 내부적으로만 peer_dir 사용 (`reverse_direction`)
+
+IpcqInitEntry schema 변경 없음. Rev 1 대비 **단순화**.
+
+### D5. `IpcqDmaToken.src_direction` 유지 (diagnostic only)
+
+기존 `src_direction` 필드는 제거하지 않는다. 다음 용도로 유지:
+- Logging / trace: `KERNBENCH_CCL_TRACE=1` 출력의 `(rank, t, dir, nbytes)`
+- Diagnostics: pointer_dump 등에서 direction 표시
+- 미래 확장 여지
+
+Runtime matching은 `dst_addr`만 사용.
+
+### D6. Invariants (ADR-0023 I3 강화)
+
+**I3 (엄격)**: 각 방향 pair `(my_direction, peer_direction)`에 대해 my
+rx_base와 peer rx_base는 **별개의 direction slot**을 가리켜야 함. Install은
+이를 보장해야 한다 (reverse_direction opposite-preference).
+
+**I3.1 (신규)**: 모든 qp에 대해 `qp["my_rx_base_pa"]`와 `qp["peer"].rx_base_pa`는
+서로 disjoint한 주소 range를 점유한다 (다른 direction의 buffer는 절대 겹치지
+않음). 이것이 D2/D3의 주소-기반 매칭의 전제.
+
+Install time에 검증 가능:
+```python
+# ccl/install_plan.py: build_install_plans 끝에 assertion
+all_rx_ranges = set()
+for plan in plans:
+    for pe_install in plan.pe_installs:
+        for entry in pe_install.neighbors:
+            r = (entry.my_rx_base_pa,
+                 entry.my_rx_base_pa + plan.n_slots * plan.slot_size)
+            overlap = any(_ranges_overlap(r, e) for e in all_rx_ranges)
+            assert not overlap
+            all_rx_ranges.add(r)
+```
+
+---
+
+## Dependencies
+
+- **ADR-0023** (IPCQ protocol): 본 ADR은 ADR-0023의 runtime 매칭 로직 수정
+  (D2, D3) + install heuristic 개선 (D1). IPCQ 프로토콜의 semantic layer
+  변경은 없음.
+- **ADR-0024** (launcher): 2-rank bidirectional ring이 실제 쓰이는 경우가
+  ADR-0024의 ws=SIP_count 모델. 본 ADR이 그 케이스를 작동시킴.
+- **ADR-0030** (PhysAddr transition, stub): **독립적** — ADR-0025의
+  주소-기반 매칭은 현재 synthetic 주소든 PhysAddr이든 동일하게 작동.
+
+---
+
+## Non-goals
+
+- **IPCQ 주소 체계를 PhysAddr로 전환**: ADR-0030 scope. 본 ADR은 주소가 어떻게
+  인코딩되는가와 무관.
+- **Multi-hop routing**: ADR-0023 D5의 single-hop DMA write 전제 유지.
+- **Unidir ring 특수화**: `ring_1d_unidir`는 direction 하나만 있으므로 본 버그
+  무관.
+
+---
+
+## Open questions
+
+- **주소 매칭 성능**: `_handle_meta_arrival`과 `_credit_worker`가 qp를 선형
+  순회 (max 4 direction). 성능 영향 무시 가능 수준. 문제 시 dict lookup으로
+  전환 가능 (`_qp_by_rx_base`).
+- **`IpcqDmaToken.src_direction` 필요성 재평가**: diagnostic 용도로만 남긴
+  필드를 계속 유지할지, 또는 logging 외부로 분리할지. 현재는 유지.
+- **Install-time invariant 검증 cost**: D6의 I3.1 검증은 O(N_PE × N_direction)^2.
+  대형 topology에서 느려질 수 있음 → interval tree 등 자료구조로 개선 가능.
+  단순 구현 먼저.
+
+---
+
+## Consequences
+
+### Positive
+
+- **단순함**: `peer_direction` 이중 메타데이터 제거. 주소가 single source of truth.
+- **Unambiguous matching**: 모든 topology (direction 중복 포함)에서 동작.
+- **Schema 변경 최소**: `IpcqInitEntry` 불변, `IpcqCreditMetadata`에 1 필드 추가.
+- **PhysAddr 전환 (ADR-0030) 독립**: 주소-기반 매칭은 주소 인코딩 방식과 무관.
+- **Diagnostic 유지**: `IpcqDmaToken.src_direction`은 로깅 용도로 존치.
+
+### Negative
+
+- Runtime 매칭이 주소 비교로 바뀌어서 디버깅 시 "왜 peer_head_cache[E]가 아닌
+  W가 업데이트됐나" 같은 질문에 address range를 추적해야 함 (기존엔 direction
+  이름으로 충분). 해결: pointer_dump에 "direction ↔ rx_base_pa" 매핑 포함.
+
+### Neutral
+
+- IPCQ protocol의 semantic layer (sender가 dst_addr 계산, receiver가 수신)는
+  불변.
diff --git a/docs/adr-ko/ADR-0026-par-dppolicy-intra-device.md b/docs/adr-ko/ADR-0026-par-dppolicy-intra-device.md
new file mode 100644
index 0000000..d043f59
--- /dev/null
+++ b/docs/adr-ko/ADR-0026-par-dppolicy-intra-device.md
@@ -0,0 +1,288 @@
+# ADR-0026: DPPolicy = Intra-Device Only — sip/num_sips 필드 제거
+
+## Status
+
+Accepted (Revision 5 — Phase 2 landed 2026-04-14, 523 passed + 1 strict xfail)
+
+## Context
+
+### 목표
+
+`DPPolicy`를 **한 device(SIP) 내부의 cube × PE 분산**만 표현하는 순수한
+intra-device 추상화로 명확화한다. SIP 간 분산(TP)은 별도 레이어로 분리
+(ADR-0024의 `torch.ahbm.set_device(rank)` 또는 ADR-0027의 Megatron parallel
+layers가 담당).
+
+## Decision
+
+### D1. `DPPolicy`에서 `sip` + `num_sips` 필드 제거
+
+```python
+@dataclass(frozen=True)
+class DPPolicy:
+    """Intra-device (cube × PE) data-parallel policy.
+
+    SIP-level placement is controlled by ``torch.ahbm.set_device(rank)``
+    (ADR-0024 D3) and, for model-level TP, by Megatron-style parallel
+    layers (ADR-0027). DPPolicy does not cross SIP boundaries.
+    """
+    cube: Literal["replicate", "column_wise", "row_wise"] = "replicate"
+    pe: Literal["replicate", "column_wise", "row_wise"] = "replicate"
+    num_pes: int | None = None
+    num_cubes: int | None = None
+```
+
+제거되는 필드: `sip`, `num_sips`.
+
+### D2. `ShardSpec` — structural (sip, cube, pe) 좌표, `pe_index` 완전 제거
+
+현재 `ShardSpec.pe_index`는 **global flat index** (`sip × cubes × pes + cube ×
+pes + pe`). 이는 ADR-0024 D4이 "abstraction leakage"로 지적한 형태.
+
+본 ADR에서 ShardSpec을 **structural 좌표로 재정의**하고, `pe_index`는
+property로도 **남기지 않는다**:
+
+```python
+# src/kernbench/policy/placement/dp.py (after)
+@dataclass(frozen=True)
+class ShardSpec:
+    """Structural shard placement — intra-SIP (cube × PE) coord.
+
+    Global-flat `pe_index` was removed in ADR-0026. Callers must use
+    structural coords (sip, cube, pe) directly. If a flat integer key is
+    needed (e.g. dict lookup), compute it explicitly at the call site.
+    """
+    sip: int              # structural — which SIP this shard lives on
+    cube: int             # local within SIP
+    pe: int               # local within cube
+    offset_bytes: int
+    nbytes: int
+```
+
+**핵심 원칙**:
+- ShardSpec의 정체성은 `(sip, cube, pe)` 3튜플.
+- **`pe_index` property도 없음** — silent semantics drift 차단.
+- Global flat을 기대한 기존 호출자는 `.pe_index` 접근 시 **즉시
+  `AttributeError`** → 반드시 구조적 좌표로 migration.
+- Flat integer key가 필요한 국소 문맥 (예: 내부 dict lookup)은 호출자가
+  명시적으로 `spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe`를 계산.
+
+**Property 제거 정당화**: KernBench는 사내 프로젝트로 call site가 한정되어
+있음. Silent drift 위험 (의미만 바뀌고 타입은 같은 int) 대비 explicit breakage
+(AttributeError)가 훨씬 안전.
+
+### D3. `resolve_dp_policy`가 `target_sip`을 받아 structural 좌표 생성
+
+ADR-0024 D4의 계약 구현. Post-hoc shifting 없음.
+
+```python
+# src/kernbench/policy/placement/dp.py (after)
+
+@dataclass(frozen=True)
+class _LocalPeShard:
+    """Internal — PE resolver의 반환. Cube 내 local PE 식별자 + payload."""
+    local_pe: int                  # cube-local PE index (0..num_pe-1)
+    offset_bytes: int
+    nbytes: int
+
+
+def resolve_dp_policy(
+    policy: DPPolicy,
+    *,
+    shape: tuple[int, int],
+    itemsize: int,
+    num_pe: int,
+    num_cubes: int = 1,
+    target_sip: int,       # NEW — 어느 SIP에 배치할지 명시
+) -> list[ShardSpec]:
+    """2-level resolution (cube × PE) on a specified SIP.
+
+    Returns ShardSpecs with structural coords (sip=target_sip, cube, pe).
+    No SIP-level split — DPPolicy is intra-device only.
+    """
+    resolver = _PE_RESOLVERS[policy.pe]
+    all_shards: list[ShardSpec] = []
+
+    # Level 1: cube within SIP
+    cube_splits = _split_shape(policy.cube, shape, num_cubes, itemsize)
+
+    for cube_id, (cube_shape, cube_offset) in enumerate(cube_splits):
+        # Level 2: PE within cube — resolver returns _LocalPeShard (local_pe)
+        local_shards = resolver(shape=cube_shape, itemsize=itemsize,
+                                 num_pe=num_pe)
+
+        for ls in local_shards:
+            all_shards.append(ShardSpec(
+                sip=target_sip,                   # from caller (current_device)
+                cube=cube_id,                     # local within SIP
+                pe=ls.local_pe,                   # local within cube (explicit name)
+                offset_bytes=cube_offset + ls.offset_bytes,
+                nbytes=ls.nbytes,
+            ))
+
+    return all_shards
+```
+
+**내부 resolver** (`column_wise`, `row_wise`, `replicate`)는 `_LocalPeShard`
+리스트 반환 — `local_pe` 필드명으로 **"cube-local PE identifier"임이 명시적**.
+과거 `ShardSpec.pe_index`와 이름이 혼동되던 문제 해소.
+
+**이름 규약 정리** (전체 ADR):
+- `ShardSpec.pe`: 최종 외부 API — cube-local PE (structural coord)
+- `_LocalPeShard.local_pe`: 내부 resolver 단계의 동일 의미
+- `pe_index`: **제거**. 외부/내부 어디에도 남기지 않는다 (silent drift 차단의
+  부가 효과: 이름 재등장 없음).
+
+### D4. `_create_tensor` — 구조적 좌표로 직접 placement
+
+ADR-0024 D4 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy`
+호출 시점에 직접 지정.
+
+```python
+# context.py _create_tensor (after)
+current_sip = self.ahbm.current_device()
+if current_sip is None:
+    # Single-driver fallback (ADR-0024 D2와 일관).
+    # Launcher 기반 코드가 set_device()를 빼먹으면 조용히 SIP 0에 박히는
+    # 문제가 있음 → debug mode에서 경고.
+    if os.environ.get("KERNBENCH_DEBUG"):
+        import warnings
+        warnings.warn(
+            "torch.ahbm.current_device() is None; defaulting to SIP 0. "
+            "If this is a multi-rank launcher context, you likely forgot "
+            "torch.ahbm.set_device(rank) inside the worker.",
+            stacklevel=2,
+        )
+    current_sip = 0
+
+placement = resolve_dp_policy(
+    dp,
+    shape=shape_2d,
+    itemsize=itemsize,
+    num_pe=eff_num_pe,
+    num_cubes=eff_num_cubes,
+    target_sip=current_sip,          # ← 구조적 좌표 일차 지정
+)
+
+# placement의 각 ShardSpec은 이미 (sip=current_sip, cube=local, pe=local) 포함.
+# 과거의 post-hoc shifting 블록은 완전히 제거.
+```
+
+**모든** 텐서가 current device SIP에 배치됨. Multi-SIP 텐서를 만들고 싶으면
+ADR-0027의 TP primitive 사용.
+
+**Single-driver fallback의 trade-off**: set_device 없는 호출에서 SIP 0으로
+default는 기존 single-driver 테스트 호환을 위해 유지. `KERNBENCH_DEBUG=1`
+환경에서는 launcher 컨텍스트의 실수로 set_device 누락 시 조용히 잘못된 SIP에
+배치되는 것을 감지할 수 있도록 warning.
+
+### D5. Downstream — allocator lookup은 구조적 tuple key로
+
+기존 `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`):
+
+```python
+for spec in placement:
+    alloc = allocators[spec.pe_index]       # ← AttributeError (property 제거됨)
+```
+
+`pe_index`가 없어졌으므로 구조적 좌표로 **강제** migration:
+
+```python
+for spec in placement:
+    alloc = allocators[(spec.sip, spec.cube, spec.pe)]
+```
+
+`_ensure_allocators`의 dict population도 tuple key로:
+
+```python
+# context.py _ensure_allocators (after)
+for sip_id in sip_range:
+    for cube_id in range(cubes_per_sip):
+        for pe_id in range(pes_per_cube):
+            self._allocators[(sip_id, cube_id, pe_id)] = PEMemAllocator(
+                rack_id=0, sip_id=sip_id, cube_id=cube_id, pe_id=pe_id, cfg=cfg,
+            )
+```
+
+`_free_tensor`도 동일: 기존 `flat_idx = sip * ... + cube * ... + pe` 계산
+블록 제거, `(shard.sip, shard.cube, shard.pe)` 직접 사용.
+
+**Tuple vs dataclass `PEIdentity`**: Tuple이 단순하고 hashable로 바로 써서
+권고. `PEIdentity` 값객체는 명시적 타입 장점은 있지만 boilerplate가 크고 현재
+allocator dict의 유일한 key라 오버엔지니어링. Tuple 유지.
+
+### D7. 하위 호환 — 불가 (cleanup ADR)
+
+이 ADR은 **breaking change**.
+
+1. `DPPolicy(sip=...)` 또는 `DPPolicy(num_sips=...)` 호출 → `TypeError`
+2. `ShardSpec.pe_index` 접근 → `AttributeError`
+
+모두 **즉시 명시적 breakage**. Deprecation warning / fallback 경로 없음.
+KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에 migration.
+
+**Silent drift 차단**이 property 완전 제거의 주된 이점: global flat을 기대한
+코드가 SIP-local 결과를 받아 조용히 잘못된 인덱싱을 할 가능성 제거.
+
+## Dependencies
+
+- **ADR-0024** (launcher): `set_device(rank)` 및 current-device scoping이
+  SIP 배치 메커니즘 제공. 본 ADR은 그 위에 서서 DPPolicy를 순수 intra-device로
+  좁힘.
+- **ADR-0027** (Megatron TP): 다중 SIP에 걸친 텐서가 필요한 경우의 대안 경로.
+  이 ADR 적용 후 multi-SIP use case는 ADR-0027로 이관.
+
+---
+
+## Non-goals
+
+- **`DPPolicy.cube` / `pe` 재설계**: 기존 replicate/column_wise/row_wise 의미
+  유지.
+- **Tiling 정책 통합**: `tiled_column_major` / `tiled_row_major`는 그대로.
+- **Multi-device 텐서 추상화 신규**: DTensor-like는 ADR-0028.
+
+---
+
+## Open questions
+
+- **`_create_tensor`의 current_sip 기본값**: set_device 없는 호출에서 rank=0
+  (SIP 0)로 fallback할지, 아니면 error 낼지. 권고는 fallback (기존 single-driver
+  테스트와의 호환).
+- **`test_sip_parallel.py` 재작성 범위**: 기존 단위 테스트의 의도를 유지하며
+  launcher 기반으로 옮기려면 추가 fixture 필요. 별도 작업으로 scope.
+- **`DPPolicy`의 `num_sips=None` 의미**: 필드가 없어지면 `num_sips` 개념 자체가
+  사라짐. Multi-SIP을 표현하고 싶으면 ADR-0027의 TP primitive를 쓰라는 것이
+  명시적 답.
+
+**Resolved (이전 rev에서 open이었던 것들)**:
+- ~~`ShardSpec.pe_index` property 존치 여부~~ → **완전 제거** (D2)
+- ~~`_ensure_allocators` dict key 형식~~ → **tuple `(sip, cube, pe)`** (D5)
+
+---
+
+## Consequences
+
+### Positive
+
+- **개념 분리 명확**: DPPolicy = intra-device, TP = inter-device.
+- **API 단순화**: DPPolicy 생성자 필드 ~33% 축소.
+- **Structural 좌표 일관성**: ShardSpec이 `(sip, cube, pe)` 튜플로 표현 →
+  abstraction leakage 해소 (ADR-0024 D4 계약 충족).
+- **`pe_index` 의미 명확**: SIP-local이 단일 해석. Global flat이 필요하면 명시.
+- **Launcher 모델 일관성**: ADR-0024의 "1 worker per SIP" 모델이 유일한 SIP
+  경계 제어 메커니즘.
+
+### Negative
+
+- **Breaking change (explicit)**: `DPPolicy(sip=...)` → `TypeError`,
+  `spec.pe_index` → `AttributeError`. 모든 호출자 한 번에 수정 필요.
+- **ShardSpec schema 변경**: `pe_index` 단일 필드 → `sip`/`cube`/`pe` 세 필드.
+  Downstream (`deploy_tensor`, `_free_tensor`, `_ensure_allocators`,
+  `allocators` dict key 등) 연쇄 수정.
+- **Silent drift 없음**: property 완전 제거로 runtime에서 즉시 실패 →
+  migration leakage 원천 차단. (Negative가 아니라 explicit tradeoff)
+- `test_sip_parallel.py` 재작성 비용.
+
+### Neutral
+
+- 기존 `cube` / `pe` 필드 의미 불변.
diff --git a/docs/adr-ko/ADR-0027-par-megatron-tp.md b/docs/adr-ko/ADR-0027-par-megatron-tp.md
new file mode 100644
index 0000000..7b04254
--- /dev/null
+++ b/docs/adr-ko/ADR-0027-par-megatron-tp.md
@@ -0,0 +1,888 @@
+# ADR-0027: Megatron-style Tensor Parallelism API
+
+## Status
+
+Accepted
+
+## Context
+
+### 목표
+
+SIP 간 tensor parallelism(TP)을 **Megatron-LM 스타일의 명시적 parallel layer**
+API로 지원한다. DTensor 같은 선언적 추상화는 별도 ADR(0028) future work.
+
+Megatron-style을 선택한 이유:
+- TP는 model의 특정 layer 경계에서 발생. 명시적 primitive가 mental model에
+  자연스러움.
+- NVIDIA Megatron / DeepSpeed가 확립한 인더스트리 표준.
+- DTensor는 선언적이라 디자인 공간이 더 크다 → 단계적.
+
+### TP primitive 스펙 (Megatron-LM 참조)
+
+- **ColumnParallelLinear**: weight의 **column(out_features)** 축을 TP ranks에
+  분산. 입력 full-replicated, 출력 column-sharded. 후속 RowParallelLinear가
+  올 때 forward all-reduce 없음.
+- **RowParallelLinear**: weight의 **row(in_features)** 축을 TP ranks에 분산.
+  입력이 이미 column-sharded (ColumnParallel의 출력). forward 끝에
+  **all-reduce** 필요.
+- **VocabParallelEmbedding**: embedding을 vocab 축에 분산. forward 끝에
+  all-reduce. (초기 scope에서는 stub, 실제 구현은 all-gather kernel 선행 필요.)
+- **`copy_to_tp_region`**, **`reduce_from_tp_region`**, **`scatter_to_tp_region`**,
+  **`gather_from_tp_region`** — 기본 primitive.
+
+### 풀어야 할 문제
+
+1. **Worker-wait 일반화 (D0)**: `dist.all_reduce`의 defer/yield/drain 패턴을
+   모든 `ctx.wait` 경로로 확장. **이 ADR의 가장 큰 아키텍처 결정**.
+
+2. **런처 API 정규화 (D1)**: 현 bench들이 hand-rolled greenlet loop을 사용.
+   `torch.multiprocessing.spawn(fn, args, nprocs)`로 흡수해 real-PyTorch API 면
+   유지 + D0의 scheduler drain을 단일 구현 위치에 집중.
+
+3. **Per-rank weight 분산 표현**: 각 worker가 weight tensor의 자기 slice를
+   소유. ADR-0024의 `set_device(rank)` + ADR-0026의 intra-device DPPolicy로
+   자연스럽게 표현.
+
+4. **Forward-only scope**: 현재 KernBench는 backward가 없음 (simulation 목적).
+   본 ADR은 **forward만** 우선 지원. Training simulation은 별도 ADR.
+
+5. **Collective 호출 지점**: RowParallelLinear가 forward 끝에 `all_reduce` 호출.
+   ADR-0024의 multi-greenlet 구조 + D0 generalization에서 자연스럽게 동작.
+
+6. **TP group 개념**: Megatron은 DP × TP × PP group을 교차 사용. 초기 scope는
+   **TP group = 전체 SIP** 단순화. Mixed DP+TP는 future.
+
+---
+
+## Decision
+
+### D0. Worker-wait 일반화 — `ctx.wait`가 worker 컨텍스트면 main으로 defer
+
+**문제 재확인**. `kernel_runner.run`은 spawn 시점의 `greenlet.getcurrent()`를
+kernel greenlet의 `_parent`로 캡처한다
+([kernel_runner.py:94](src/kernbench/triton_emu/kernel_runner.py#L94)).
+main 컨텍스트에서 `env.run`이 돌면 parent=main이라 safe. worker 컨텍스트에서
+`env.run`이 돌면 parent=worker가 되고, worker가 yield/finish하는 순간 kernel
+greenlet은 orphan → `GreenletExit` → ADR-0024 Phase B의 `ring_default_ws` 실패.
+
+**해결**. worker greenlet이 `ctx.wait(h)`를 호출하면 직접 `env.run`을 driving
+하는 대신 **main scheduler로 yield**. main이 env.run을 drive해 handle이 완료
+되면 worker로 control return.
+
+#### D0.1 `RuntimeContext` 확장
+
+```python
+# context.py
+@dataclass
+class RuntimeContext:
+    ...
+    _pending_worker_waits: list[RequestHandle] = field(default_factory=list, init=False)
+```
+
+#### D0.2 `ctx.wait`의 worker fork
+
+```python
+def wait(self, handle, *, _meta=None):
+    # Fast-path: already completed — skip enqueue + switch (consistent with
+    # D0.4-(3) idempotency). Avoids needless worker→main→worker round-trip
+    # and prevents redundant _pending_worker_waits growth.
+    if handle in self._completed:
+        completion, _trace = self.engine.get_completion(handle)
+        return completion
+
+    from greenlet import getcurrent
+    g = getcurrent()
+    if g.parent is not None and not g.parent.dead:
+        # Worker greenlet: defer to main. Push handle, yield to parent.
+        # Parent (scheduler loop) drains env.run, then switches back.
+        self._pending_worker_waits.append(handle)
+        g.parent.switch()
+        # On resume: handle must have completed (main drained the list).
+        # Fall through to the status-quo completion/trace assembly.
+
+    # Main context (or single-driver): drive engine directly.
+    wait_fn = getattr(self.engine, "wait", None)
+    if wait_fn is not None:
+        wait_fn(handle)
+    completion, trace = self.engine.get_completion(handle)
+    self._completed.add(handle)
+    if _meta is not None and trace is not None:
+        entry = dict(trace) if isinstance(trace, dict) else {"raw": trace}
+        entry.update(_meta)
+        self._traces.append(entry)
+    return completion
+```
+
+#### D0.3 `ctx.wait`의 worker-context 세만틱 contract (normative)
+
+본 ADR은 `ctx.wait`의 세만틱을 worker 컨텍스트에서 **명시적으로 변경**한다.
+
+- **Submit-vs-complete 분리**: `ctx.wait(h)`는 worker에서 호출될 때 "즉시 완료
+  보장"이 아니라 "**다음 scheduler drain 이후** 완료 보장"이다. worker가
+  `wait()`에서 return하는 시점 = main이 해당 handle에 대해 `engine.wait`을
+  마친 시점. Main context 호출은 기존대로 즉시-동기 (status quo).
+- **Resume invariant (normative)**: worker-deferred `ctx.wait(h)`에서
+  `g.parent.switch()`가 return해 worker가 resume되는 시점에는 **반드시
+  `h in ctx._completed`가 True여야 한다**. 이 invariant가 깨지면 worker가
+  stale 상태에서 이후 단계를 진행하므로 `_drain_pending` / scheduler loop /
+  `ctx.wait` 어느 부분을 수정하든 이 불변식을 지켜야 한다. T3.b가 이
+  invariant를 직접 assert한다.
+- **관찰 가능 변화**: worker 안에서 `h = ctx.submit(msg); ctx.wait(h);
+  read(handle_result)` 패턴은 여전히 성립 — 단 `wait()`와 `read` 사이에는
+  자동으로 main-drain이 삽입되었다는 사실을 세만틱 명세로 포함한다.
+- **Host 객체 직접 read는 D0.5 참조**: `ctx.wait` 없이 `tensor.numpy()`를
+  부르는 경우의 계약은 D0.5에서 별도로 규정.
+
+#### D0.4 Main scheduler drain — 규약 (normative)
+
+(D1의 `multiprocessing.spawn` 내부 구현. 아래는 세만틱 정의.)
+
+```python
+while alive:
+    for g in alive:              # (1) round-based worker switch
+        g.switch()
+    _drain_pending(ctx)           # (2) drain in main context
+```
+
+(`_drain_pending`의 실제 정의는 D0.5 참조 — outer while-loop으로 두 큐가
+모두 빌 때까지 drain.)
+
+**규약**:
+
+1. **Round-based cooperative scheduling & yield 의무 (worker contract)**.
+   `g.switch()`는 해당 worker가 **자발적으로 yield**할 때까지 return하지 않는다
+   (cooperative greenlet 세만틱). 따라서:
+   - Worker가 yield 없이 `while True: do_compute()` 같은 pure-compute loop를
+     돌면 `g.switch()`는 영원히 return하지 않고 **scheduler loop 자체가 hard
+     block**된다 (다른 worker는 switch 기회를 못 얻음, drain도 안 일어남). 이는
+     starvation이 아니라 **scheduler non-progress (deadlock 등가)**이며 본
+     ADR이 **unsupported**로 규정한다.
+   - Worker는 **반드시** `ctx.wait(h)`, `dist.all_reduce`, host-read barrier
+     (D0.5) 중 하나를 유한 step 내에 호출해야 한다. TP layer의 `forward`는
+     매 layer 끝에서 launch→wait 쌍을 포함하므로 자연스럽게 이 조건을 만족.
+     CCL kernel도 `dist.all_reduce` 내부에서 yield한다.
+   - 구현이 이를 **감지**할 필요는 없다 (타임아웃/steps-since-yield 카운터
+     등). 이는 user contract이며 위반 시 증상은 "simulation hang"이다.
+   - **Future extension**: non-collective 긴 계산 경로가 자주 나오면
+     명시적 `torch.distributed.cooperative_yield()` primitive (no-op yield)를
+     도입할 수 있다. 현 ADR 범위 밖. Breaking change 아님 — 필요 시 추가하면
+     됨.
+   - Round 내에서는 alive worker 전체가 한 번씩 `switch`를 받는다. 단일 round
+     안에서 한 worker가 여러 번 wait를 호출해도 그 turn 안에서 순차적으로
+     enqueue된 뒤 scheduler drain 한 번에 일괄 처리 (FIFO).
+
+2. **Drain 순서 = submission 순서 (FIFO)**. `_pending_worker_waits`는 list
+   append/pop(0)로 엄격한 FIFO. 완료 순서가 아니라 submission 순서로 drain되며,
+   SimPy scheduler 자체가 인과적으로 올바른 완료 순서를 보장하므로 submission
+   순서 drain이 안전하다. `completion order`와 `drain order`는 혼동하지 말 것.
+
+   **Two-queue ordering (worker waits → collectives)**: `_drain_pending`은
+   worker wait 큐를 먼저, collective 큐를 나중에 drain한다. 이 순서의 근거:
+   - **두 큐는 서로 다른 dependency source**: worker wait은 worker가 직접
+     `submit + wait` 쌍으로 만들어낸 handle (tensor deploy, MmuMap 등). collective
+     큐는 `dist.all_reduce`가 내부적으로 enqueue한 kernel launch handle이며
+     worker는 이걸 직접 wait하지 않는다 (D0.5의 두 큐 drain 모델 참조).
+   - **Correctness 관점 독립**: collective는 worker 관점에선 "이미 submit된
+     후 yield한" 상태. 그 완료 타이밍은 worker의 다음 action 시점 이전이기만
+     하면 됨. worker wait 큐와의 순서 dependency 없음.
+   - **단일 drain barrier 안에서 둘 다 완료**: D0.5의 loop-until-empty 규약에
+     따라 한 barrier invocation에서 worker → collective → (새로 생긴 것이
+     있으면 반복) 순으로 모두 빠짐. worker가 resume될 땐 양쪽 모두 drained.
+   - **대안 (collective 먼저)도 가능**: 본 ADR은 현 구현 단순성을 위해 worker
+     먼저를 고정했을 뿐 의미상 동치. 성능 프로파일 차이가 관찰되면 재조정.
+
+3. **중복 enqueue — correctness는 idempotent drain, dedup은 non-guaranteed**.
+   `ctx.wait(h)`는 `h in ctx._completed`면 즉시 return. `_drain_pending`도
+   동일 guard. 같은 handle이 `_pending_worker_waits`에 여러 번 appended
+   되더라도 실제 `engine.wait`는 한 번만 호출된다 (idempotent).
+   - **Correctness**: idempotent drain에 의존 → safe.
+   - **Memory/성능**: 본 ADR은 `_pending_worker_waits`의 **dedup을 보장하지
+     않는다**. 같은 handle이 N번 enqueue되면 큐에 N개 element가 보관되고
+     drain 시 N번 pop + in-set guard가 돈다. 단일 worker가 같은 handle을
+     반복 wait하는 비정상 패턴이 아니면 N은 1~수 수준.
+   - **Implementation freedom**: 구현은 선택적으로 dedup (예: `set`을 side
+     index로 두거나 append 전 `h not in pending_set` 검사) 가능. correctness
+     를 바꾸지 않는 최적화로 분류.
+
+4. **Exception propagation + sibling cleanup**.
+   worker greenlet이 raise하면 `g.switch()`가 main으로 예외를 전달한다.
+   scheduler loop은 즉시 중단되고 다음 cleanup을 **명시적으로** 수행:
+
+   ```python
+   try:
+       while True:
+           alive = [g for g in gs if not g.dead]
+           if not alive:
+               break
+           for g in alive:
+               if not g.dead:
+                   g.switch()
+           _drain_pending(ctx)
+   except Exception as outer:
+       # (a) 살아남은 sibling worker greenlet 강제 종료.
+       for other in gs:
+           if not other.dead:
+               try:
+                   other.throw(SystemExit)
+               except Exception:
+                   pass          # 사일런트 — 이미 예외 상황
+       # (b) Backend barrier / pending 상태 초기화 (장래 epoch barrier 도입 대비).
+       backend = getattr(ctx.distributed, "_backend", None)
+       if backend is not None and hasattr(backend, "_barrier"):
+           backend._barrier.reset()
+       backend_pending = getattr(backend, "_pending_collective_handles", None)
+       if backend_pending is not None:
+           backend_pending.clear()
+       ctx._pending_worker_waits.clear()
+       # (c) 원인 예외는 SpawnException으로 래핑.
+       raise SpawnException(errors) from outer
+   ```
+
+   규약:
+   - **Sibling abort 보장**: worker 하나가 raise하면 모든 sibling greenlet에
+     `SystemExit`을 throw — greenlet은 즉시 terminate된다. greenlet leak 없음.
+   - **Pending queue 명시적 clear**: worker-wait + collective-pending 두 큐를
+     비움. 재사용 시 오염 방지.
+   - **`SpawnException(errors)` 래핑**: `errors: dict[int, Exception]`에 각
+     rank의 원래 예외를 담는다. real-PyTorch `torch.multiprocessing.spawn`의
+     failure 패턴과 호환.
+     - **Scope 제한**: `errors`에는 **자기 코드로 raise한 rank (root cause)만**
+       포함된다. Sibling cleanup 과정에서 `throw(SystemExit)`으로 종료된 rank는
+       `errors`에 나타나지 않는다 (SystemExit은 D1.2의 entry 래퍼 `try/except
+       Exception`에 걸리지 않음 — 의도된 설계: sibling 종료는 실패가 아니라
+       cleanup signal). 독자가 "모든 failed rank가 다 들어올 것"으로 기대하지
+       않도록 명시.
+   - **`ctx._traces`는 예외 이전 시점까지의 partial 상태**. trace completeness
+     는 보장되지 않음 (일부 launch/all_reduce가 entry를 남기지 못한 채 종료
+     가능).
+   - **Allocator / MemoryStore**는 예외 이전 상태 유지 — 재사용은 non-goal,
+     새 `RuntimeContext` 생성 권장.
+   - **`join=False` / retry / partial recovery**는 본 ADR의 non-goal.
+
+   `SpawnException`은 `runtime_api/multiprocessing.py`에 정의:
+
+   ```python
+   class SpawnException(RuntimeError):
+       def __init__(self, errors: dict[int, Exception]):
+           self.errors = errors
+           first = next(iter(errors.items()), None)
+           msg = (f"spawn failed on ranks {sorted(errors.keys())}"
+                  + (f": rank {first[0]} raised {first[1]!r}" if first else ""))
+           super().__init__(msg)
+   ```
+
+5. **Single-driver 호환**. `g.parent is None`인 main-only 실행 (legacy 단일
+   드라이버 테스트)에서는 D0.2의 worker-fork 조건이 거짓 → 기존 즉시-동기
+   경로 유지. `_drain_pending`은 호출되지 않는다.
+
+#### D0.5 Host-read barrier — 결정 (normative)
+
+Worker 안에서 `tensor.numpy()`, `tensor.__getitem__`, `tensor.data` 등
+**host-observable read**는 **자동 drain barrier**로 정의한다. 호출 직전:
+
+1. `ctx._pending_worker_waits`와 `backend._pending_collective_handles`가 비어
+   있지 않으면 `g.parent.switch()`로 main에 yield → main은 `_drain_pending`
+   실행 → 완료 후 worker resume.
+2. 두 큐가 모두 비어 있으면 즉시 read.
+
+**Barrier 반복 규약 (normative — re-entrance)**: `_drain_pending`은 while-loop
+로 **두 큐가 모두 완전히 비어질 때까지** drain한다. 단일 pass가 아님:
+
+```python
+def _drain_pending(ctx):
+    while ctx._pending_worker_waits or (
+        ctx.distributed._backend
+        and ctx.distributed._backend._pending_collective_handles
+    ):
+        while ctx._pending_worker_waits:
+            h = ctx._pending_worker_waits.pop(0)
+            if h not in ctx._completed:
+                ctx.engine.wait(h)
+        backend = ctx.distributed._backend
+        if backend is not None:
+            while backend._pending_collective_handles:
+                h, _sip_id, meta = backend._pending_collective_handles.pop(0)
+                ctx.wait(h, _meta=meta)  # main context: safe; ctx.wait가
+                                          # 다시 pending에 push하지 않음
+```
+
+**Main-context ctx.wait 비재귀 invariant (normative)**: `_drain_pending` 내부의
+`ctx.wait(h, _meta=meta)` 호출은 main greenlet 컨텍스트에서 실행된다. D0.2의
+worker-fork 조건(`g.parent is not None and not g.parent.dead`)이 False이므로
+즉시-동기 경로로 진입 → **`_pending_worker_waits`에 절대 enqueue하지 않는다**.
+이 invariant 덕분에 drain loop은 재귀/큐 재증가 없이 끝난다. 구현 시
+`g.parent is None`을 단일 main greenlet 보장으로 유지하는 것이 중요.
+
+**왜 loop인가**: `ctx.wait(h, _meta=meta)`는 main 컨텍스트에서 호출되므로 D0.2
+경로에 따라 engine을 **직접 drive**한다 (추가 enqueue 없음 — 위 invariant).
+따라서 이론적으로는 single pass로 충분하지만 — 규약은 **loop-until-empty**로
+고정한다. 이유:
+
+1. **미래 확장 안전성**: 향후 drain 중 새 pending이 enqueue되는 구현 (예:
+   collective가 sub-handle을 가진 tree-reduce)이 생길 수 있다. loop 규약이면
+   이때도 correctness 유지.
+2. **가독성**: "barrier는 pending이 빌 때까지 drain"이라는 단일 문장으로
+   의미가 닫힘. `ctx.wait` 호출이 새 enqueue를 안 한다는 non-trivial invariant
+   에 의존하지 않음.
+3. **Barrier의 세만틱은 "해당 read에 필요한 모든 dependency 완료"**: 현 모델
+   에선 모든 pending이 곧 모든 dependency이므로 둘은 동일. 사용자 mental model
+   은 전자.
+
+**Termination 보증**: 두 체제로 분리해 서술한다.
+
+- **현재 구현**: `ctx.wait`는 main context에서 호출 시 engine을 직접 drive
+  (D0.2) → 새 pending을 enqueue하지 않는다. 한 iteration마다 pending의 크기가
+  `pop(0)` + `engine.wait`로 엄격히 감소. iteration 수는 **초기 pending 크기
+  자체가 상한** → 유한 종료.
+- **Future extension (loop 규약을 정당화하는 상한)**: 향후 drain 중 새 pending이
+  enqueue되는 구현 (예: tree-reduce sub-handle)이 도입되면 초기 크기 상한은
+  깨진다. 그러나 SimPy causality는 handle의 dependency가 유한 DAG임을 보장하므로
+  **nested depth가 finite**. loop 규약이 이 경우까지 자동 수용한다.
+
+두 체제 모두 무한 루프가 불가능함을 보장. 현 구현의 단일-pass 상한은 공격적
+최적화 시 참고 값일 뿐 규약은 loop-until-empty로 고정.
+
+**왜 implicit drain at read가 맞는가**:
+
+- 기존 open question에서 (a) implicit drain, (b) explicit barrier 둘 중 선택
+  문제였다. (b)는 명확하지만 TP layer 사용자가 `out = fc1.forward(x);
+  ctx.drain(); result = out.numpy()` 3-step을 매번 써야 하는 부담. (a)는
+  "읽을 때 반영된 값을 보장"하는 단일 규약으로 CUDA의 `cudaDeviceSynchronize
+  before host copy` 패턴과 동일 — 숨은 규칙이 아닌 **명명된 entry-point의
+  contract**이다.
+- 본 ADR은 (a)를 채택하되 그 entry-point 목록을 **명시적으로 닫는다**:
+  `Tensor.numpy()`, `Tensor.data` (numpy alias), `Tensor.__getitem__`,
+  `Tensor.__repr__` (data가 포함되는 경우), 그 외 공식 host-read API는 본
+  ADR 구현 시점에 코드베이스 검색으로 확정. 추가되는 host-read API는 반드시
+  이 contract를 따라야 한다 (테스트로 회귀 방지).
+- `ctx.submit`만 하고 `wait` 없이 `numpy`를 직접 호출하는 경우도 drain
+  barrier가 동작 (pending queue에 handle이 있기 때문). 사용자가 explicit
+  wait을 생략해도 read 시점에 invariant가 복원된다.
+
+**`Tensor.copy_(source)` — write barrier 규정**:
+
+`copy_`는 semantically "target에 write"이지만 내부적으로 `source.numpy()`를
+호출하여 host에서 source 데이터를 가져온 뒤 `target._memory_store.write(...)`
+로 각 shard에 쓴다. 두 방향 모두 barrier 처리:
+
+1. **Source-side (read barrier)**: `source.numpy()`가 D0.5 read barrier를
+   트리거 (source 자체가 deployed tensor이고 pending이 있을 때).
+2. **Target-side (write barrier — global pending 기준)**: `copy_` 진입 시
+   `ctx._pending_worker_waits` 또는 `backend._pending_collective_handles`가
+   비어 있지 않으면 write 전에 `g.parent.switch()`로 drain. **Per-tensor /
+   per-shard dependency tracking이 아니라 global pending queue 기준**.
+   - 왜 global인가: KernBench의 handle 표현에는 "이 handle이 target의 어느
+     shard를 write한다"는 역추적 정보가 없다. 안전한 보수적 규약으로 "전역
+     pending이 있으면 drain". 이 결과로 **unrelated tensor의 pending도 copy_를
+     막을 수 있다** — drop-in invariant 우선.
+   - **명시적 tradeoff**: 이 규약은 서로 독립적인 tensor 사이에도 불필요한
+     serialization을 도입할 수 있다. 그러나 현 single-queue execution model
+     하에서는 이 비용이 허용 가능 — cross-rank correctness와 "읽을 때 최신"
+     invariant를 단순한 규칙으로 보장하는 편이 우선.
+   - 실질적 영향: 단일 worker는 대부분 한 layer step 안에서 pending이 주로
+     자기 작업 — over-barrier로 인한 추가 context switch는 round 끝 scheduler
+     drain 시점과 일치하는 경우가 많아 큰 문제 안 됨.
+   - Future refinement: per-tensor pending tracking을 도입하면 이 규약을
+     좁힐 수 있으나 본 ADR scope 밖.
+
+**Non-barrier**:
+
+- `tensor.shape`, `tensor.dtype`, `tensor.name` 등 **metadata-only** 접근은
+  drain하지 않음. 데이터 의존성이 없음.
+- `tensor.pa`, `tensor.va` 등 raw address accessor도 drain하지 않음 (주소만,
+  내용 아님).
+
+**공식 barrier entry-point (closed set)**:
+
+| API | Kind | Rationale |
+|---|---|---|
+| `Tensor.numpy()` | read | host-observable copy |
+| `Tensor.data` | read | `numpy()` alias |
+| `Tensor.__getitem__` | read | shard-aligned read |
+| `Tensor.__repr__` (data 포함 시) | read | debugging/log |
+| `Tensor.copy_(source)` | read + write | source read + target write |
+
+이 contract를 T5/T6에서 직접 검증.
+
+#### D0.6 왜 worker 함수 API는 불변인가 (informative)
+
+- `torch.zeros(...)` 내부는 `self.submit(msg)` + `self.wait(h)` 쌍. `wait`가
+  D0.2/D0.3에 따라 자동으로 main-defer → 겉보기 동기적으로 보이지만 한 번
+  yield.
+- `tensor.numpy()`는 D0.5에 따라 host-read barrier → pending이 있으면
+  drain→read, 없으면 즉시 read.
+- `dist.all_reduce`는 기존 `_defer_wait=True` + `_pending_collective_handles`
+  경로를 그대로 사용. D0.4의 drain이 두 큐를 함께 처리.
+
+#### D0.7 불변 조건 (invariants)
+
+- **kernel greenlet의 `_parent`는 항상 main**: env.run이 worker 컨텍스트에서
+  절대 돌지 않기 때문. (T3의 핵심 assertion.)
+- **cross-rank 동기 지점**: 모든 worker가 yield한 뒤에만 drain → 모든 rank의
+  kernel이 한 라운드에 함께 진행 (cross-rank IPCQ 교환의 필수 조건).
+- **Single-driver 호환**: D0.4-(5).
+
+### D1. `torch.multiprocessing.spawn(fn, args, nprocs)`
+
+Real-PyTorch API 파리티 + D0의 scheduler loop의 단일 구현 위치.
+
+#### D1.0 API parity only — execution parity 아님 (normative)
+
+`torch.multiprocessing.spawn` 이름은 **API signature parity**에 한정된다.
+실제 실행 모델은 **cooperative greenlet scheduler** (단일 Python 프로세스,
+단일 OS 스레드, D0.4의 round-robin drive)이다. 다음은 **본 ADR이 제공하지
+않는 속성** — real-PyTorch `torch.multiprocessing.spawn`이 보장하는 것 중
+명시적으로 **non-goal**:
+
+- 프로세스 격리 (independent OS process per rank).
+- 독립 address space (각 rank가 자기 Python heap 보유).
+- Failure isolation (한 rank의 hard crash가 다른 rank 영향 없음).
+- OS-level scheduler fairness (rank 간 preemptive time slicing).
+- `mp.Queue`, `mp.Lock` 등 inter-process primitive.
+
+이 구현의 실제 성질:
+
+- 모든 rank는 같은 Python 프로세스 안의 greenlet. shared global state가
+  그대로 보임 (의도된 simulation convenience).
+- GIL 하의 단일 스레드 → parallel execution 아님. SimPy 이벤트 순서로
+  "논리적 동시성"만 재현.
+- 한 worker에서 unhandled exception → 전체 simulation 중단 (D0.4-(4)).
+
+**호출자 의무**: real-PyTorch multi-process 샘플을 KernBench로 이식할 때
+프로세스 격리에 의존하는 로직 (예: `os.getpid`, 독립 임시 파일, 신호 처리
+등)은 지워야 한다. Namespace 이름은 코드 이식성을 위해 유지 — 세만틱은
+다르다.
+
+#### D1.1 Public surface
+
+```python
+# runtime_api/multiprocessing.py (new)
+class _MultiprocessingNamespace:
+    def __init__(self, ctx):
+        self._ctx = ctx
+
+    def spawn(self, fn, args: tuple, nprocs: int, join: bool = True) -> None:
+        """Spawn `nprocs` worker greenlets, each calling fn(rank, *args).
+
+        Mirrors torch.multiprocessing.spawn signature (minus `daemon`).
+        Drives the D0 scheduler loop until all workers finish.
+        """
+        ...
+```
+
+#### D1.2 구현
+
+```python
+def spawn(self, fn, args, nprocs, join=True):
+    from greenlet import greenlet
+    ctx = self._ctx
+    dist = ctx.distributed
+    gs: list[greenlet] = []
+    errors: dict[int, Exception] = {}
+    for rank in range(nprocs):
+        def _entry(r=rank):
+            try:
+                fn(r, *args)
+            except Exception as e:
+                errors[r] = e
+                raise
+        g = greenlet(_entry)
+        dist._bind_rank(g, rank)
+        gs.append(g)
+
+    try:
+        while True:
+            alive = [g for g in gs if not g.dead]
+            if not alive:
+                break
+            for g in alive:
+                if not g.dead:
+                    g.switch()
+            _drain_pending(ctx)       # D0.5
+    except Exception as outer:
+        # Sibling cleanup per D0.4-(4)
+        for other in gs:
+            if not other.dead:
+                try:
+                    other.throw(SystemExit)
+                except Exception:
+                    pass
+        backend = getattr(dist, "_backend", None)
+        if backend is not None:
+            if hasattr(backend, "_barrier"):
+                backend._barrier.reset()
+            if getattr(backend, "_pending_collective_handles", None) is not None:
+                backend._pending_collective_handles.clear()
+        ctx._pending_worker_waits.clear()
+        raise SpawnException(errors) from outer
+    # `join=True` semantics: we already wait for all workers.
+```
+
+#### D1.3 `torch` namespace attach
+
+`runtime_api/context.py` `__post_init__`에서:
+```python
+self.multiprocessing = _MultiprocessingNamespace(self)
+```
+
+→ bench 코드에서 `torch.multiprocessing.spawn(worker, args=(ws,), nprocs=ws)`.
+
+#### D1.4 기존 bench 마이그레이션
+
+`benches/ccl_allreduce.py`의 hand-rolled loop은 `torch.multiprocessing.spawn`
+한 줄로 축소. 기존 matrix 회귀는 그대로 유지. 현재 xfail인 `ring_default_ws`는
+D0 덕분에 PASS로 전환 예상 (worker가 kernel greenlet orphan을 발생시키지 않음).
+
+### D2. 새 패키지 `kernbench.tp`
+
+```
+src/kernbench/tp/
+    __init__.py          — public API re-exports
+    parallel_state.py    — TP group 관리 (현재 single global group)
+    layers.py            — ColumnParallelLinear, RowParallelLinear, VocabParallelEmbedding
+    primitives.py        — copy/reduce/scatter/gather_to/from_tp_region
+    kernels.py           — TP layer가 launch하는 gemm kernel (재사용 가능)
+    mappings.py          — forward identity/all_reduce, backward stub
+```
+
+### D3. `parallel_state` — TP group
+
+```python
+# parallel_state.py
+_TP_WORLD_SIZE = None
+
+def initialize_model_parallel(tensor_model_parallel_size: int) -> None:
+    """Initialize TP group. Must be called after dist.init_process_group."""
+    global _TP_WORLD_SIZE
+    from kernbench.runtime_api.distributed import get_dist  # or torch.distributed
+    dist = get_dist()
+    total = dist.get_world_size()
+    if tensor_model_parallel_size != total:
+        raise NotImplementedError(
+            "Only TP == world_size supported in initial scope"
+        )
+    _TP_WORLD_SIZE = tensor_model_parallel_size
+
+def get_tensor_model_parallel_world_size() -> int:
+    return _TP_WORLD_SIZE
+
+def get_tensor_model_parallel_rank() -> int:
+    from kernbench.runtime_api.distributed import get_dist
+    return get_dist().get_rank()         # ADR-0024 greenlet-local rank
+```
+
+초기 scope: TP size = world_size = topology SIP count. Pure TP 모델.
+
+### D4-pre. TP shard ownership vs DPPolicy — 역할 분리 (normative)
+
+TP layer의 weight/output 표현에서 두 개념을 명확히 분리한다:
+
+| 개념 | 결정 주체 | 범위 |
+|---|---|---|
+| **TP shard ownership** (어느 rank가 weight의 어떤 slice를 소유하는가) | greenlet-local rank + `torch.ahbm.set_device(rank)` (ADR-0024 D2/D3) | **cross-rank, cross-SIP** |
+| **Intra-rank placement** (소유된 slice를 rank 내부에서 cube × PE로 어떻게 분산하는가) | `DPPolicy(cube=..., pe=...)` (ADR-0026) | **한 rank 내부 (SIP 경계 안)** |
+
+따라서 `ColumnParallelLinear`가 `(in_features, out_features // ws)` shape로
+weight를 생성하고 `DPPolicy(cube="column_wise", pe="column_wise")`를 부여
+하면:
+
+- **Rank r**이 소유하는 slice = weight의 column 축 [r * k_local, (r+1) *
+  k_local) — **set_device(r)**가 이걸 결정 (해당 rank가 SIP r에 존재).
+- **그 slice 내부**에서 cube × PE column-wise 분산 — **DPPolicy**가 이걸
+  결정.
+
+두 축은 **독립적**이다. 같은 DPPolicy로 두 rank가 자기 slice를 만들면
+slice 자체는 다른 SIP에 있지만 intra-SIP placement 패턴은 동일. 반대로
+DPPolicy를 `cube="replicate", pe="replicate"`로 바꿔도 TP shard ownership은
+유지되고 intra-rank placement만 달라짐.
+
+**이 경계가 흐려지는 실수** (본 ADR이 금지):
+
+- DPPolicy에 "SIP 축"이 다시 등장 (ADR-0026에서 제거됨).
+- TP layer가 `set_device` 없이 `DPPolicy`만으로 cross-rank sharding을
+  표현 → 단일 rank 안에서 세로로 자른 것과 구분 안 됨.
+
+본 ADR의 TP layer는 항상 "rank = SIP = one slice 소유 + DPPolicy intra-SIP
+분산" 관점에서만 weight/output을 다룬다.
+
+### D4. `ColumnParallelLinear`
+
+**중요**: host-side `torch.matmul` 추상화를 신규 도입하지 않는다. layer의
+forward는 `torch.launch("gemm", gemm_kernel, ...)`로 기존 gemm kernel을
+호출 — KernBench bench들이 이미 쓰는 패턴
+([benches/gemm_single_pe.py](benches/gemm_single_pe.py),
+[benches/gpt3_qkv.py](benches/gpt3_qkv.py)).
+
+```python
+# layers.py
+from kernbench.policy.placement.dp import DPPolicy
+from kernbench.tp.kernels import _gemm_kernel
+from kernbench.tp.parallel_state import (
+    get_tensor_model_parallel_rank,
+    get_tensor_model_parallel_world_size,
+)
+
+class ColumnParallelLinear:
+    """Weight의 K(out_features) 축을 TP rank에 분산.
+
+    forward(x):
+        x: (M, N) — full-replicated across ranks
+        W_k: (N, K / world_size) — rank-local slice (set_device로 SIP r에 거주)
+        y_k = x @ W_k → (M, K / world_size) — rank-local output
+
+    출력은 column-sharded. RowParallelLinear가 기대하는 입력 형태.
+    """
+
+    def __init__(self, in_features: int, out_features: int, bias: bool = False,
+                 dtype: str = "f16", torch=None):
+        ws = get_tensor_model_parallel_world_size()
+        assert out_features % ws == 0
+        self.in_features = in_features
+        self.k_local = out_features // ws
+        self._torch = torch
+        # 각 rank가 자기 slice 소유 — set_device(rank)에 의해 SIP r에 배치.
+        self.weight = torch.zeros(
+            (in_features, self.k_local), dtype=dtype,
+            dp=DPPolicy(cube="column_wise", pe="column_wise"),
+            name="col_parallel_w",
+        )
+        self.bias = None
+        if bias:
+            self.bias = torch.zeros(
+                (self.k_local,), dtype=dtype,
+                dp=DPPolicy(cube="replicate", pe="replicate"),
+                name="col_parallel_b",
+            )
+
+    def forward(self, x):
+        # x는 full-replicated (caller 보장). 단순 local gemm.
+        M = x.shape[0]
+        out = self._torch.empty(
+            (M, self.k_local), dtype=x.dtype,
+            dp=DPPolicy(cube="column_wise", pe="column_wise"),
+            name="col_parallel_out",
+        )
+        self._torch.launch(
+            "col_parallel_gemm", _gemm_kernel,
+            x, self.weight, out, M, self.in_features, self.k_local,
+        )
+        # bias add는 별도 kernel 혹은 composite gemm의 fused bias.
+        # 초기 scope에서는 bias=False만 충분히 검증.
+        return out
+```
+
+**Yield-safety contract (normative)**: `ColumnParallelLinear.forward`는 한 번의
+`torch.launch` 호출로 kernel launch → 내부 `ctx.wait` 쌍을 포함한다. 이는
+D0.4-(1)의 "worker는 유한 step 내 yield" 조건을 자동으로 만족 — TP layer
+사용자가 yield 패턴을 수동으로 삽입할 필요 없음.
+
+### D5. `RowParallelLinear`
+
+```python
+class RowParallelLinear:
+    """Weight의 N(in_features) 축을 TP rank에 분산.
+
+    forward(x):
+        x: (M, N / world_size) — rank-local slice (ColumnParallel의 출력)
+        W_k: (N / world_size, K) — rank-local slice
+        y_k = x @ W_k → (M, K) — partial sum on each rank
+        y = all_reduce(y_k, op="sum") → (M, K) on every rank
+    """
+
+    def __init__(self, in_features: int, out_features: int, bias: bool = False,
+                 dtype: str = "f16", torch=None):
+        ws = get_tensor_model_parallel_world_size()
+        assert in_features % ws == 0
+        self.n_local = in_features // ws
+        self.out_features = out_features
+        self._torch = torch
+        self.weight = torch.zeros(
+            (self.n_local, out_features), dtype=dtype,
+            dp=DPPolicy(cube="column_wise", pe="column_wise"),
+            name="row_parallel_w",
+        )
+        # bias는 rank 0에만 (Megatron convention). 초기 scope에서는 생략.
+        self.bias = None
+
+    def forward(self, x):
+        M = x.shape[0]
+        y_partial = self._torch.empty(
+            (M, self.out_features), dtype=x.dtype,
+            dp=DPPolicy(cube="column_wise", pe="column_wise"),
+            name="row_parallel_partial",
+        )
+        self._torch.launch(
+            "row_parallel_gemm", _gemm_kernel,
+            x, self.weight, y_partial, M, self.n_local, self.out_features,
+        )
+        # Cross-rank reduce. ADR-0024의 dist.all_reduce는 D0 + mp.spawn 하에서
+        # 정상 동작 (kernel parent = main 유지).
+        self._torch.distributed.all_reduce(y_partial, op="sum")
+        return y_partial
+```
+
+**Yield-safety contract (normative)**: `RowParallelLinear.forward`는 launch →
+내부 wait에 이어 `all_reduce` (defer + worker yield 패턴)까지 포함하므로 forward
+한 번당 **최소 2회 yield**가 보장됨. D0.4-(1)의 scheduler progress 조건 자동
+만족. 모든 본 ADR의 TP layer forward는 "최소 하나의 wait 또는 collective를
+포함해 yield-safe하다"를 invariant로 유지한다 — 이후 추가되는 TP primitive
+(VocabParallelEmbedding 등)도 동일 계약 필수.
+
+### D6. Primitive 함수
+
+```python
+# primitives.py
+def copy_to_tp_region(x):
+    """Forward: identity. Backward: all-reduce. (Training 추가 시 구현)."""
+    return x
+
+def reduce_from_tp_region(x, torch):
+    """Forward: all-reduce. Backward: identity."""
+    torch.distributed.all_reduce(x, op="sum")
+    return x
+
+def scatter_to_tp_region(x):
+    raise NotImplementedError(
+        "Phase 2: 사용자가 이미 sharded tensor를 생성하는 것으로 대체"
+    )
+
+def gather_from_tp_region(x):
+    raise NotImplementedError(
+        "Phase 2: all-gather kernel 선행 필요 (future)"
+    )
+```
+
+### D7. 샘플 bench — 2-layer MLP with TP
+
+```python
+# benches/tp_mlp.py (신규)
+from kernbench.policy.placement.dp import DPPolicy
+import kernbench.tp as tp
+import numpy as np
+
+
+def worker(rank: int, world_size: int, torch):
+    torch.ahbm.set_device(rank)
+    tp.initialize_model_parallel(world_size)
+
+    B, D_in, D_hidden, D_out = 1, 512, 2048, 512
+    fc1 = tp.ColumnParallelLinear(D_in, D_hidden, torch=torch)
+    fc2 = tp.RowParallelLinear(D_hidden, D_out, torch=torch)
+
+    x = torch.zeros(
+        (B, D_in), dtype="f16",
+        dp=DPPolicy(cube="replicate", pe="replicate"),
+        name="x",
+    )
+    # init x with some pattern (e.g., constant)
+    x.copy_(torch.from_numpy(np.full((B, D_in), 0.1, dtype=np.float16)))
+
+    h = fc1.forward(x)      # column-sharded (B, D_hidden / ws)
+    y = fc2.forward(h)      # all-reduced (B, D_out) on every rank
+
+    # rank 0만 결과 출력 / 검증
+    if rank == 0:
+        result = y.numpy()
+        # 실제 검증 값은 zero-init weight이면 전부 0 — scope에서는 "완료 자체" 검증
+        print(f"  tp_mlp: shape={result.shape}, mean={float(result.mean()):.4f}")
+
+
+def run(torch):
+    torch.distributed.init_process_group(backend="ahbm")
+    ws = torch.distributed.get_world_size()
+    torch.multiprocessing.spawn(worker, args=(ws,), nprocs=ws)
+```
+
+### D8. Non-functional — training 미지원
+
+본 ADR은 **inference/forward only**. Backward / gradient / optimizer는 future.
+기존 KernBench가 training이 아니므로 자연스러움.
+
+### D9. 초기 scope 제약
+
+- TP size = world_size (mixed DP+TP 없음).
+- `scatter_to_tp_region`, `gather_from_tp_region`은 unimplemented.
+- **Weight 기본값은 zero**. 적절한 init scheme (Xavier, Kaiming 등)은 future.
+  단 테스트는 `tensor.copy_`로 결정론적 non-zero pattern을 주입해 numerical
+  correctness를 검증 (T2/T6). 즉 "production default = zero, 검증 = 결정론적
+  non-zero"로 운영 분리.
+- Bias 초기 scope에서 생략 (Megatron의 rank 0-only bias 정책은 future).
+- Pipeline parallelism은 scope 밖.
+- VocabParallelEmbedding은 all-gather 선행 필요 → stub only.
+
+### D10. 회귀: `ring_default_ws` xfail 해제 — 필수 acceptance
+
+D0 (worker-wait 일반화) + D0.5 (host-read barrier) 덕분에 모든 worker-driven
+`ctx.wait` 및 host-read가 main-drain 경로로 routing됨 → ADR-0024 Phase B의
+kernel-greenlet orphan 원인이 소멸. 기존 matrix test의 `ring_default_ws`
+strict-xfail 케이스를 본 ADR 구현 이후 **PASS**로 전환하는 것을 **필수 회귀
+기준**으로 포함. Observable acceptance criteria는 **T7**에 명시 (deadlock
+부재, GreenletExit 부재, numerical tolerance 등).
+
+---
+
+## Dependencies
+
+- **ADR-0024** (launcher): rank = SIP, greenlet-local rank,
+  `torch.ahbm.set_device(rank)`.
+- **ADR-0026** (DPPolicy intra-device): weight tensor의 per-rank slice 표현.
+- **ADR-0023 / ADR-0025** (IPCQ): `dist.all_reduce` 구현의 기반.
+
+---
+
+## Non-goals
+
+- **Backward pass / training**: inference only. Training simulation은 별도 ADR.
+- **Mixed parallelism (DP + TP + PP)**: 초기엔 pure TP only.
+- **Weight init schemes**: 단순 zero / debug pattern.
+- **Fused ops**: Megatron의 fused matmul+bias+gelu는 kernel 레벨 문제.
+- **DTensor 통합**: ADR-0028 future.
+- **Host-side `torch.matmul` 추상화**: TP layer는 `torch.launch(gemm_kernel, ...)`
+  로 기존 gemm kernel을 호출. 신규 matmul host-op 도입 안 함.
+
+---
+
+## Open questions
+
+- **`initialize_model_parallel` 위치**: `kernbench.tp.initialize_model_parallel`
+  (현 결정) vs real-PyTorch의 `torch.distributed.init_device_mesh`. TP 전용
+  모듈에 유지.
+- **Weight init**: ADR은 zero. Debug pattern (e.g., identity)이 유효 검증에
+  필요할 수 있음 — Phase 1 test에서 필요 시 추가.
+- **bias 배치 정책**: Megatron은 RowParallelLinear bias를 rank 0에만. 초기
+  scope에서는 bias=False로 회피.
+- **GEMM kernel 위치**: `kernbench.tp.kernels._gemm_kernel` vs 기존
+  `benches/gemm_single_pe.py`에서 import. TP가 bench 의존을 가지면 안 되므로
+  tp 내부에 복제. 향후 `kernbench.kernels` 공용 패키지로 이관 가능.
+
+**Resolved (이전 rev에서 open이었던 것들)**:
+- ~~`tensor.numpy()` 호출 시 drain 타이밍~~ → **D0.5에서 결정**: 공식 host-read
+  entry-point(`numpy`, `data`, `__getitem__`, data-포함 `__repr__`)는 자동
+  drain barrier. metadata-only accessor는 barrier 아님.
+
+---
+
+## Consequences
+
+### Positive
+
+- **Megatron 코드 이식 용이**: real training code와 API 일치.
+- **TP 벤치마크 가능**: scaling, communication-compute overlap 등 HW 특성
+  연구.
+- **`ring_default_ws` xfail 해제**: D0의 부산물로 ADR-0024 Phase B 블로커 해소.
+- **Scheduler loop 단일화**: D1 (`mp.spawn`) 도입으로 hand-rolled loop 제거.
+  후속 collective/TP 벤치가 동일 패턴 재사용.
+- **DPPolicy 의미 명확화** (ADR-0026 시너지): TP layer가 intra-device DPPolicy
+  만 사용하는 모범 사례.
+
+### Negative
+
+- 새 모듈 (`kernbench.tp`) 유지보수 비용.
+- 초기 scope가 제한적 (pure TP only, forward only).
+- D0 generalization이 `ctx.wait`의 세만틱을 바꿈 — 단일 드라이버 테스트와의
+  호환성을 명시적으로 검증 필요 (T7).
+
+### Neutral
+
+- ADR-0024/0026 기반 위에 순수한 상위 레이어 추가. Hardware simulation
+  stack에 영향 없음 (D0 제외).
diff --git a/docs/adr-ko/ADR-0032-algo-intercube-allreduce.md b/docs/adr-ko/ADR-0032-algo-intercube-allreduce.md
new file mode 100644
index 0000000..bb6ba3c
--- /dev/null
+++ b/docs/adr-ko/ADR-0032-algo-intercube-allreduce.md
@@ -0,0 +1,256 @@
+# ADR-0032: Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange
+
+## Status
+
+Accepted (supersedes ADR-0029).
+
+## Context
+
+### Goal
+
+Define a single all-reduce algorithm that exploits the topology hierarchy:
+cube mesh within each SIP (intercube) + inter-SIP exchange. One kernel,
+one SFR configuration path, driven by `topology.yaml` and `ccl.yaml`.
+
+### Why replace ADR-0029 (hierarchical 3-level)
+
+ADR-0029 proposed a 3-level (intra-cube → inter-cube → inter-SIP) algorithm
+where every PE in the system participates. In practice this adds the
+intra-cube PE-to-PE stage complexity (bidirectional reduce + chain broadcast)
+without matching the common workload pattern where the tensor is sharded
+**per cube** (not per PE within a cube).
+
+Moreover, the hierarchical design required:
+- per-PE neighbor graph installation (`_build_pe_installs` multi-level)
+- multi-level topology schema (`hierarchical_3level`)
+- `all_pes` mapper + `multi_pe_sip_local` validator infrastructure
+
+The intercube algorithm below removes all of that: **pe0-only same-lane
+intercube reduce on the 4×4 cube mesh**, then inter-SIP exchange on the
+root cube, then broadcast back. Simpler kernel, simpler wiring, same
+bandwidth characteristics for the common per-cube DP workload.
+
+### Current state
+
+- `src/kernbench/ccl/algorithms/intercube_allreduce.py` — kernel
+- `src/kernbench/ccl/sfr_config.py` — `configure_sfr_intercube_multisip`
+- `src/kernbench/runtime_api/distributed.py` — `AhbmCCLBackend` wires this
+  automatically at `init_process_group` time.
+- Old `ring_allreduce`, `mesh_allreduce`, `tree_allreduce`,
+  `hierarchical_allreduce` modules and their tests are **removed**.
+
+---
+
+## Decision
+
+### D1. Algorithm structure — 5 phases
+
+For each SIP (launched concurrently by `mp.spawn`):
+
+```
+Phase 1 — Row reduce W → E (cube mesh, pe0 only):
+    col=0 sends E → col=1 accumulates, sends E → ... → col=3 holds row sum.
+
+Phase 2 — Col reduce N → S on rightmost column (pe0, col = mesh_w-1):
+    row=0 sends S → row=1 accumulates, sends S → ... → root cube (15)
+    holds the full SIP sum.
+
+Phase 3 — Inter-SIP exchange on root cube (pe0 of root cube only):
+    Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
+    selected by sip_topo_kind (from topology.yaml sips.topology).
+
+Phase 4 — Col broadcast S → N on rightmost column.
+
+Phase 5 — Row broadcast E → W across the cube mesh.
+```
+
+After all phases every cube's pe0 holds the global sum.
+
+The kernel is a single function parameterised by `sip_topo_kind ∈ {0, 1, 2}`
+(ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical
+across topologies; only phase 3 branches. Helper functions
+`_inter_sip_ring`, `_inter_sip_torus_2d`, `_inter_sip_mesh_2d` encode the
+three exchange patterns.
+
+### D2. Tensor layout (rank = SIP, per-worker)
+
+Per ADR-0024 rank = SIP at the process-group level. Each worker allocates
+its own cube-mesh-spanning tensor:
+
+```python
+dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=16, num_pes=1)
+tensor = torch.zeros((n_cubes, n_elem), dtype="f16", dp=dp)
+```
+
+Shard layout: 16 shards per SIP, one per cube on pe0. The kernel addresses
+each cube's shard as `pe_addr = t_ptr + cube_id * n_elem * 2`.
+
+### D3. SFR / IPCQ wiring — `configure_sfr_intercube_multisip`
+
+Replaces the rank-to-2-PE install from ADR-0024. Wires PE_IPCQ neighbor
+tables for **every cube's pe0 across every SIP** — regardless of which
+cube is the root or which SIP topology is selected. This lets the kernel
+elect the root cube at runtime and supports topology switches without
+re-wiring.
+
+| Level | Direction labels | Scope |
+|---|---|---|
+| Intercube within SIP | N / S / E / W | pe0 of every cube → pe0 of mesh neighbors (no wrap) |
+| Inter-SIP (all cubes) | global_E / global_W / global_N / global_S | pe0 of cube c on sip A → pe0 of cube c on peer SIP per `sips.topology` |
+
+Inter-SIP directions use the `global_*` prefix to keep the namespace
+disjoint from intercube directions. ADR-0025's `_OPPOSITE_DIR` is extended
+with `global_E ↔ global_W` and `global_N ↔ global_S` so the reverse-
+direction resolver handles 2-SIP bidirectional rings correctly.
+
+Internally the function calls `install_ipcq` with:
+- `world_size = n_sips × n_cubes`
+- `rank_to_pe = [(sip, cube, 0) for sip in range(n_sips) for cube in range(n_cubes)]`
+- A closure-captured `neighbors()` function that builds the map above.
+
+This `world_size` is internal to IPCQ wiring and does not leak to the
+process-group rank.
+
+### D4. SIP topology — from `topology.yaml`
+
+```yaml
+system:
+  sips:
+    count: 2
+    topology: ring_1d       # or torus_2d, mesh_2d_no_wrap
+```
+
+- `ring_1d`: n_sips-1 rounds of `send global_E / recv global_W`.
+- `torus_2d`: sqrt(n_sips)×sqrt(n_sips) wrapping mesh. Row ring on
+  `global_E/W` then col ring on `global_S/N`.
+- `mesh_2d_no_wrap`: square mesh without wrap-around. Chain reduce +
+  broadcast per dimension.
+
+2D variants require `n_sips` to be a perfect square.
+
+### D5. Process-group integration — `AhbmCCLBackend`
+
+At `init_process_group` time the backend:
+
+1. Loads `ccl.yaml` + `topology.yaml`.
+2. Derives `sip_topo_kind, sip_topo_w, sip_topo_h` from
+   `system.sips.topology` using the algorithm module's `TOPO_NAME_TO_KIND`.
+3. Calls `configure_sfr_intercube_multisip(engine, spec, cfg)` — one-time
+   SFR wiring, mirrors NCCL communicator creation.
+
+At each `dist.all_reduce(tensor)` call:
+
+1. Resolves `kernel_fn` from `cfg["module"]`.
+2. Builds args: `(n_elem, cube_w, cube_h, n_sips)` from
+   `kernel_args(world_size, n_elem)`.
+3. Appends `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` where
+   `sip_rank` is the current greenlet's bound rank.
+4. Launches with `_defer_wait=True`; the main scheduler drains pending
+   handles after all workers submit (per ADR-0027 D0.4).
+
+### D6. Config schema
+
+`ccl.yaml`:
+
+```yaml
+defaults:
+  algorithm: intercube_allreduce
+  buffer_kind: tcm
+  ...
+
+algorithms:
+  intercube_allreduce:
+    module: kernbench.ccl.algorithms.intercube_allreduce
+    topology: none
+    buffer_kind: tcm
+    n_elem: 8
+    root_cube: 15
+```
+
+`topology.yaml`:
+
+```yaml
+system:
+  sips:
+    count: 2
+    topology: ring_1d
+sip:
+  cube_mesh: { w: 4, h: 4 }
+```
+
+### D7. Algorithm module contract
+
+Modules loaded via `cfg["module"]` must export:
+
+| Name | Purpose |
+|---|---|
+| `kernel` | callable, signature `(t_ptr, n_elem, cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl)` |
+| `kernel_args(world_size, n_elem) -> tuple` | returns the first 4 scalar args (per-tensor) |
+| `TOPO_NAME_TO_KIND: dict[str, int]` | maps `system.sips.topology` name to kernel branch code |
+| `SIP_TOPO_RING`, `SIP_TOPO_TORUS`, `SIP_TOPO_MESH` | integer constants (0, 1, 2) |
+
+---
+
+## Dependencies
+
+- **ADR-0023**: IPCQ protocol (neighbor table, send/recv, credit return).
+- **ADR-0024**: rank = SIP launcher, `mp.spawn`, greenlet-local rank.
+- **ADR-0025**: Address-based IPCQ direction matching; extended
+  `_OPPOSITE_DIR` with `global_*` pairs.
+- **ADR-0027**: Worker-wait / collective-pending drain in main scheduler.
+
+## Non-goals
+
+- **Per-PE allreduce** (intra-cube PE-to-PE reduce). Out of scope — the
+  workload for this algorithm is per-cube DP.
+- **Asymmetric SIP topologies** (non-square mesh/torus). `torus_2d` and
+  `mesh_2d_no_wrap` require `n_sips = k²`.
+- **Pipelined chunks**: single-tile per cube, no pipelining yet.
+- **Root cube runtime election**: the kernel currently uses
+  `root_cube = (mesh_h - 1) * mesh_w + (mesh_w - 1)` hardcoded to the SE
+  corner. SFR wiring covers all cubes, so runtime election is a pure kernel
+  change when needed.
+
+---
+
+## Consequences
+
+### Positive
+
+- **Single kernel, single install path** for all-reduce — replaces four
+  removed modules (`ring`, `mesh`, `tree`, `hierarchical`).
+- **Topology-agnostic kernel**: ring / torus / mesh selected via one
+  integer param, no kernel duplication.
+- **Automatic via `dist.all_reduce`**: no bench-level or user-level
+  algorithm selection needed; config-driven end-to-end.
+- **Full SFR wiring**: every cube on every SIP has inter-SIP links
+  available — supports future dynamic root-cube election.
+
+### Negative
+
+- **Not suitable for per-PE sharded tensors**: TP-layer-style tensors that
+  shard within one cube across 8 PEs are not addressable by this kernel.
+  Such workloads would need a separate intra-cube all-reduce path (not
+  yet implemented).
+- **`configure_sfr_intercube_multisip` always wires all pe0s**: even if a
+  given run only needs a subset (e.g. 1 SIP, ring only). Install cost is
+  small but not zero.
+
+---
+
+## Affected files
+
+| File | Change |
+|---|---|
+| `src/kernbench/ccl/algorithms/intercube_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` |
+| `src/kernbench/ccl/sfr_config.py` (new) | `configure_sfr_intercube_multisip` |
+| `src/kernbench/ccl/topologies.py` | Added `torus_2d`, `mesh_2d_no_wrap` |
+| `src/kernbench/ccl/install.py` | Extended `_OPPOSITE_DIR` with `global_*` pairs |
+| `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend` uses `configure_sfr_intercube_multisip` + appends sip_rank/topo args |
+| `ccl.yaml` | Single `intercube_allreduce` entry |
+| `topology.yaml` | Added `system.sips.topology` |
+| `benches/ccl_allreduce.py` | Row-wise cube-mesh tensor layout |
+| `tests/test_allreduce_multidevice.py` (new) | Config-driven ring/torus/mesh |
+| `tests/test_distributed_intercube_allreduce.py` (new) | Full `dist.all_reduce` path |
+| `tests/test_intercube_sfr_config.py` (new) | SFR wiring verification |
+| Removed | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` and their tests |
diff --git a/docs/adr-ko/ADR-0033-lat-latency-model-assumptions.md b/docs/adr-ko/ADR-0033-lat-latency-model-assumptions.md
new file mode 100644
index 0000000..13ca1f9
--- /dev/null
+++ b/docs/adr-ko/ADR-0033-lat-latency-model-assumptions.md
@@ -0,0 +1,162 @@
+# ADR-0033 — Latency Model: Assumptions and Known Simplifications
+
+## Status
+
+Accepted
+
+## Context
+
+The simulator is an analytical, event-driven performance model — not a
+cycle-accurate or RTL-level simulator. Many real-HW effects are approximated
+or omitted by design. To keep the model auditable and reviewable as a whole,
+this ADR consolidates the assumptions in one place. Individual component ADRs
+(ADR-0015, ADR-0017, ADR-0004) define the *mechanisms*; this document defines
+the *limits of fidelity*.
+
+## Decisions
+
+### D1. Modeled precisely
+
+- **Per-directed-edge BW occupancy** (FIFO serialization via `available_at`) —
+  ADR-0015 D2.
+- **Per-component switching/overhead latency** (`overhead_ns` attr).
+- **HBM per-pseudo-channel parallelism** via stateless `pc_avail[N]` array
+  with address-based PC selection (ADR-0034 D3). Burst granularity tunable
+  (`burst_bytes`, default 256B). Read and write share each PC's
+  `available_at` (real HW command bus is per-PC shared).
+- **HBM direction switching penalty mechanism**: per-PC last-direction
+  tracking + configurable `switch_penalty_ns`. Default 0 — see D2.
+- **Wire chunk-streaming (Phase 2c)**: each wire decomposes Transactions
+  with payload into `Flit` objects of `flit_bytes` (default = HBM
+  `burst_bytes` = 256B). The wire emits each flit individually after
+  `prop_ns + flit_nbytes/bw_gbs` so the link's bandwidth throttles
+  flit arrival rate per real-HW wormhole semantics.
+- **Separate Stores per directed edge** (Phase 2c key fix): the wire
+  is the *only* conduit between `src.out_ports[dst]` and
+  `dst.in_ports[src]`. Earlier the two were aliased to the same
+  `simpy.Store`; when the wire put a chunkified flit back, the
+  destination's `fan_in` could pull it before the wire applied
+  bandwidth delay, leaving half the flits bypassing the bottleneck.
+- **Flit-aware pass-through** (`TransitComponent`, `HbmCtrlComponent`):
+  forward each flit serially with per-transaction overhead applied
+  ONCE on the first-flit arrival (header decode model). Subsequent
+  flits pipeline through with no extra delay. Wormhole emerges
+  naturally across multi-hop paths.
+- **HBM CTRL per-flit PC commit**: each flit arriving at HBM CTRL
+  schedules a PC commit at `max(env.now, pc_avail[pc]) + chunk_time`,
+  with the `is_last` flit waiting for the last PC commit before
+  signaling `txn.done`.
+- **Non-flit-aware components (default) reassemble flits at
+  ``_fan_in``** before the legacy `_forward_txn` path runs. This
+  preserves backward compatibility for components that have not yet
+  been migrated to flit-aware processing (e.g., `MCpuComponent`,
+  `IoCpuComponent` sub-txn generators). Such components reassemble
+  *once per leg boundary*, NOT per hop — multi-hop wormhole timing
+  through a chain of flit-aware routers is preserved.
+
+### D2. Approximated (with known directional error)
+
+| Effect | Real HW | Our model | Error direction |
+|--------|---------|-----------|----------------|
+| Router output port arbitration | Round-robin / weighted | Wire edge FIFO + serial worker | Fair when one txn per cycle; multi-stream sharing not modeled at flit level |
+| HBM scheduler / write buffer | FR-FCFS + watermark drain | FIFO, no reordering | Pessimistic for mixed R/W when alternations are dense — default `switch_penalty_ns = 0` assumes ideal scheduler amortizes |
+| Flit ↔ burst granularity | 32B flit < 256B burst | `flit_bytes = burst_bytes = 256B` | Sub-flit fine-grained timing noise; affects very small wire arbitration windows only |
+| Wire-level RR fairness | Per-cycle multi-flow arbitration on shared link | Single serial wire process per edge | Fair only when one transaction is in flight on a given edge at a time. Multi-stream concurrent traffic on the same edge serializes by FIFO order |
+
+### D3. Ignored (out of scope)
+
+- Bank-level row buffer conflict penalty (assume no conflicts — best case;
+  the model has no per-bank state within a PC, so same-bank reuse cannot be
+  detected).
+- HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state
+  `burst_time = burst_bytes / pc_bw_gbs`).
+- Refresh, ECC, thermal throttling, power gating.
+- Clock domain crossings, PLL lock time.
+- Upstream backpressure due to downstream buffer occupancy (input ports use
+  unbounded `simpy.Store`).
+- Sub-flit cycle-level arbitration at routers (flit granularity is our
+  smallest unit).
+
+### D4. Workload sensitivity
+
+Workloads where the above simplifications meaningfully affect results:
+
+- **Random scatter/gather**: bank conflict ignored → model optimistic.
+- **Heavy mixed R/W intensive** (e.g., GEMM bias accumulation): HBM scheduler
+  absent. With default `switch_penalty_ns = 0` we assume ideal amortization;
+  setting it non-zero models pessimistic per-alternation cost.
+- **High concurrency (>10 active flows on one link)**: HoL blocking and VC
+  limits not modeled → model optimistic.
+- **Very small (sub-flit) transactions**: flit quantization noise.
+- **Concurrent multi-flow on a single wire**: wire is serial FIFO at the
+  flit level, so per-flow fairness within a single edge is not modeled.
+  Pre-edge merging (multiple sources arriving at a router and being
+  forwarded to the same downstream wire) is correctly modeled via the
+  flit-aware router's serial worker.
+
+### D5. Verification policy
+
+For workloads in D4, cross-check against real HW or a cycle-accurate
+simulator before drawing absolute-magnitude conclusions. The model remains
+accurate for **relative comparisons** within the modeled regime.
+
+### D6. Future work
+
+Note: multi-stream merging at routers IS modeled correctly — each
+in_port has its own fan_in process, all push to a shared inbox, and
+the router worker forwards in inbox FIFO order. Flits from different
+upstream streams naturally interleave at flit granularity. The items
+below are different concerns, ordered by expected workload impact.
+
+**Higher impact (workload accuracy gap)**:
+
+- [ ] **Bank-level conflict modeling** within a PC (opt-in via
+  `track_banks: true`). Currently we assume no same-bank reuse;
+  random scatter/gather workloads are optimistic here.
+- [ ] **HBM scheduler** with write buffer + watermark drain (Tier 2
+  from the design discussion). Default `switch_penalty_ns=0` is the
+  ideal-amortization stand-in; bursty mixed R/W workloads benefit
+  from explicit modeling.
+- [ ] **Backpressure** modeling for finite component buffers. Matters
+  at high concurrency / sustained saturation where buffer occupancy
+  causes upstream stalls.
+- [ ] **Op_log integration with chunk-streaming**: currently op_log
+  fires on PE-internal command messages (DmaReadCmd, DmaWriteCmd,
+  GemmCmd, MathCmd) which are not chunkified. Integration would
+  require flit-aware components to also emit op_log start/end hooks
+  per transaction (start on first flit, end on is_last).
+
+**Lower impact (academic / specific use cases)**:
+
+- [ ] **Cycle-accurate router arbitration policies** (RR with
+  priorities, age, iSLIP). The FIFO inbox is already approximately
+  fair when flit arrival times differ slightly between streams (the
+  common case for similar-rate workloads). True impact appears only
+  for: (a) priority/QoS modeling, (b) per-stream tail latency
+  analysis under sustained saturation. Not critical for makespan or
+  average-latency studies.
+- [ ] **Sub-flit (32B) granularity** for finer wire arbitration
+  cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates
+  per 32B flit. Effect is small for most workloads (sub-flit timing
+  noise on small messages).
+
+## Consequences
+
+- Single review point for all model fidelity questions. Each future PR
+  touching latency must update the relevant section here.
+- Workload-specific magnitude error envelopes are explicit.
+- Builder-side derivation of `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`
+  enforces the ADR-0017 D8 invariant in code rather than relying on yaml
+  manual consistency.
+- Wire transfer time is charged once per bottleneck-link transit (Phase 2c
+  per-flit timing) rather than via terminal `drain_ns` injection. Single
+  transactions land at `drain + commit_time + small_overheads`; multi-hop
+  preserves wormhole pipelining; multi-stream merge correctly serializes
+  at the shared wire's FIFO.
+
+## Cross-references
+
+- ADR-0015 — component / port / wire model.
+- ADR-0017 — Cube NOC architecture and HBM connectivity.
+- ADR-0004 — memory semantics, local HBM.
+- ADR-0034 — HBM controller internal design.
diff --git a/docs/adr-ko/ADR-0034-dev-hbm-controller-internal-design.md b/docs/adr-ko/ADR-0034-dev-hbm-controller-internal-design.md
new file mode 100644
index 0000000..b7d3e8f
--- /dev/null
+++ b/docs/adr-ko/ADR-0034-dev-hbm-controller-internal-design.md
@@ -0,0 +1,271 @@
+# ADR-0034: HBM Controller Internal Design
+
+## Status
+
+Accepted
+
+## Context
+
+`HbmCtrlComponent` is the per-PE HBM partition endpoint at the leaf of
+the cube NOC. One instance is created per PE under the topology node
+`sip{S}.cube{C}.hbm_ctrl.pe{idx}` and attaches to that PE's router
+(ADR-0017 D4). The component models per-pseudo-channel (PC) scheduling,
+burst-granular commit timing, address-based PC selection, and response
+routing back to the requester.
+
+This ADR documents the component as currently implemented. ADR-0017 D4/D8
+defines *where* HBM CTRL attaches and *what* aggregate BW it must
+deliver. ADR-0033 D1/D2 defines *what fidelity* of HBM modelling is in
+scope. This ADR fills the gap between those two — the per-instance
+internal scheduling model.
+
+## Decision
+
+### D1. Role
+
+`HbmCtrlComponent` is a per-PE HBM partition endpoint. One instance per
+PE (default 8 per cube, set by `cube.memory_map.hbm_slices_per_cube`)
+attaches to that PE's router via the `peX.hbm` attachment list in
+`cube_mesh.yaml` (ADR-0017 D4). In the default n:1 channel mapping
+(ADR-0017 D8) the instance aggregates `channels_per_pe` pseudo-channels
+into one endpoint.
+
+The component models:
+
+- Per-PC scheduling (D2) with R/W command-bus sharing.
+- Address-based PC selection (D3).
+- Burst-granular commit timing (D4).
+- Flit-aware per-flit PC commit and async finalize (D5, D6).
+- Command-only Transaction handling for read-data drain (D7).
+- Response routing back to the requester (D8).
+
+It does not model:
+
+- Bank-level row-buffer conflicts, refresh, ECC, thermal throttling
+  (ADR-0033 D3).
+- Cross-PE HBM contention beyond its own router edge (handled by the
+  router mesh — ADR-0017 D3).
+- 1:1 channel mode (ADR-0017 D8 future work).
+
+### D2. Per-PC scheduling model
+
+Per-instance state initialised in `start()`:
+
+- `_pc_avail: list[float]` — earliest sim-time each PC is free; length
+  `num_pcs`, initial 0.0.
+- `_pc_last_dir: list["R"|"W"|None]` — direction of the last commit on
+  each PC, used for switch-penalty detection (D4); initial `None`.
+
+`num_pcs` and `burst_bytes` must each be a positive power of two so
+that address-based PC selection (D3) reduces to a shift-and-mask.
+
+Read and write requests share the same `_pc_avail` slot per PC — the
+real HW per-PC command bus is shared between read and write traffic, so
+issuing a write to PC k blocks a subsequent read to PC k by exactly the
+burst time.
+
+Direction `dir` for a request is inferred from the request type:
+
+- `MemoryWriteMsg` → `"W"`.
+- `PeDmaMsg` with `is_write=True` → `"W"`.
+- All others (`MemoryReadMsg`, `PeDmaMsg` read) → `"R"`.
+
+### D3. Address-based PC selection
+
+PC index for an access is derived from the access address by shift and
+mask:
+
+```text
+pc_shift = log2(burst_bytes)         # default 8  (burst=256B)
+pc_mask  = num_pcs - 1               # default 7  (8 PCs)
+pc       = (address >> pc_shift) & pc_mask
+```
+
+Computed once in `start()` from topology config so alternative
+`(burst_bytes, num_pcs)` pairs stay consistent. For the canonical
+default `(256, 8)` this places the PC select field at bits `[10:8]` of
+the HBM byte offset: bits `[7:0]` are within-burst (same PC), bits
+`[10:8]` are the 3-bit PC index, bits `[36:11]` are row/bank/column
+within the PC slice (see `phyaddr.py` comment).
+
+Address-based striping — as opposed to address-blind global
+round-robin — preserves PC parallelism for offset-disjoint concurrent
+transfers: each transfer's bursts land deterministically on the PC set
+implied by its byte addresses, so multi-PE workloads accessing disjoint
+regions do not collide on a single PC.
+
+### D4. Burst granularity and PC commit timing
+
+A single PC commit takes:
+
+```text
+chunk_time = burst_bytes / pc_bw_gbs    # ns
+```
+
+- `burst_bytes` (default 256) is the burst granularity matching the
+  flit size (ADR-0033 D1).
+- `pc_bw_gbs` is **builder-derived** from
+  `hbm_to_router_bw_gbs / num_pcs` (`topology/builder.py`), enforcing
+  the ADR-0017 D8 invariant that aggregate per-PE BW equals the
+  router-to-HBM link BW.
+
+Per-PC commit scheduling for an arriving access on PC `pc` with
+direction `dir`:
+
+```text
+switch_cost = switch_penalty_ns
+              if pc_last_dir[pc] not in (None, dir) else 0
+start  = max(env.now, pc_avail[pc]) + switch_cost
+finish = start + chunk_time
+pc_avail[pc]    = finish
+pc_last_dir[pc] = dir
+```
+
+Default `switch_penalty_ns = 0` — Tier 0 assumption that an ideal HBM
+scheduler amortises R/W switching cost (ADR-0033 D2). Non-zero values
+model pessimistic per-alternation cost.
+
+### D5. Flit-aware per-flit PC commit (primary path)
+
+`_handle_flit` is the primary worker path. For each arriving `Flit`:
+
+1. On the **first** flit of a transaction (`tid = id(txn)` not in
+   `_txn_state`):
+   - Apply `overhead_ns` once via `run(env, nbytes)` — header decode
+     model, first-flit overhead pattern (ADR-0033 D1).
+   - Initialise `_txn_state[tid] = {"last_finish": env.now}`.
+2. Compute `pc = _pc_for_address(flit.address)` (D3).
+3. Apply the per-PC schedule (D4) using the request direction (D2).
+4. Update `state["last_finish"] = max(state["last_finish"], finish)`.
+5. If `flit.is_last`: pop `_txn_state[tid]` and spawn `_finalize_txn`
+   (D6).
+
+Per-flit address-aware commit is the mechanism that lets concurrent
+multi-PE traffic to disjoint HBM offsets pipeline through distinct PCs
+in parallel.
+
+### D6. Async finalize per transaction
+
+When a transaction's last flit has been scheduled, finalisation runs in
+a separately-spawned process:
+
+```python
+def _finalize_txn(env, txn, last_finish):
+    wait = last_finish - env.now
+    if wait > 0:
+        yield env.timeout(wait)
+    yield from _send_response(env, txn)
+```
+
+`_handle_flit` spawns this via `env.process(...)` and returns
+immediately, so the worker can pick up the next inbox message while the
+last PC commit drains.
+
+Without this split — i.e. if the worker itself did
+`yield env.timeout(wait)` — concurrent single-flit transactions whose
+addresses hit distinct PCs would still serialise at `chunk_time` each
+inside the worker, hiding the PC parallelism that D3 and D5 are
+designed to expose.
+
+### D7. Non-flit fallback for command-only transactions
+
+`_handle_txn` runs when the inbox delivers a `Transaction` rather than a
+`Flit`. This is the path for command-only requests that the wire does
+not chunk into flits — most notably `MemoryReadMsg` whose command txn
+carries `nbytes=0` (data drain is modelled at HBM CTRL post-processing,
+not as inbound flits).
+
+Procedure:
+
+1. `work_bytes = txn.nbytes if txn.nbytes > 0 else int(request.nbytes or 0)`
+   — for read commands, work is sized by the request.
+2. `n_chunks = ceil(work_bytes / burst_bytes)` if `work_bytes > 0` else
+   0.
+3. `chunk_interval = drain_ns / n_chunks` (when both > 0) — chunks are
+   scheduled over time at `drain/n_chunks` ns intervals to model the
+   bottleneck-link's data arrival rate (ADR-0033 D1 chunk-loop drain).
+4. Apply `run(env, txn.nbytes)` once for `overhead_ns`.
+5. For each chunk `i`, advance `chunk_interval` ns then apply the D4
+   schedule with `pc = _pc_for_address(base_address + i * burst_bytes)`.
+6. After scheduling all chunks, wait `last_finish - env.now` then call
+   `_send_response`.
+
+`_handle_txn` shares the same `_pc_avail` / `_pc_last_dir` state with
+`_handle_flit` — there is exactly one source of PC scheduling truth
+across both paths.
+
+### D8. Response routing
+
+`_send_response` dispatches on request type and path geometry:
+
+| Case | Trigger | Response |
+| --- | --- | --- |
+| PE_DMA | `isinstance(txn.request, PeDmaMsg)` | New reverse-path Transaction (`is_response=True`, `nbytes=0`), same `done` |
+| Bypass — Memory Read | `"m_cpu" not in any(txn.path)` AND `MemoryReadMsg` | Reverse-path Transaction with `nbytes=request.nbytes` (data return) |
+| Bypass — Memory Write | `"m_cpu" not in any(txn.path)` AND not Memory Read | `txn.done.succeed()` (write completes locally) |
+| Default | otherwise | New `ResponseMsg(correlation_id, request_id, src_cube, src_pe, success=True)` on reverse path |
+
+The "bypass" classification matches the Memory R/W fabric path defined
+in ADR-0015 D4 (PCIE_EP → io_noc → ucie → cube router → hbm_ctrl,
+without M_CPU). The PE_DMA case is its own dedicated reverse-path to
+keep the inner-loop DMA fast (PE_DMA reads/writes do not synthesise a
+ResponseMsg envelope).
+
+In all reverse-path cases, the response Transaction is put onto
+`out_ports[reverse_path[1]]` — the first hop back along the recorded
+forward path. If `reverse_path` has fewer than 2 entries (degenerate
+path), the original `txn.done` is signalled directly.
+
+### D9. Configurable attributes
+
+| Attribute | Default | Source | Notes |
+| --- | --- | --- | --- |
+| `num_pcs` | 8 | topology cube `hbm_ctrl.attrs` | Must be power of 2 |
+| `pc_bw_gbs` | 32.0 | builder-derived: `hbm_to_router_bw_gbs / num_pcs` | Enforces ADR-0017 D8 invariant |
+| `burst_bytes` | 256 | topology attrs | Must be power of 2; equals `flit_bytes` (ADR-0033 D1) |
+| `switch_penalty_ns` | 0.0 | topology attrs | Tier 0 default; non-zero models pessimistic R/W switching |
+| `efficiency` | 1.0 | topology attrs | Applied at builder time to `hbm_to_router_bw_gbs` (router-edge BW scaling only) |
+| `overhead_ns` | 0.0 | topology attrs | First-flit decode overhead (D5) |
+
+`pc_bw_gbs` is derived by `topology/builder.py` rather than configured
+directly so the aggregate per-PE BW matches the router-to-HBM link BW
+without yaml-side duplication.
+
+## Consequences
+
+### Positive
+
+- Address-based PC selection preserves multi-stream HBM parallelism
+  that an address-blind round-robin would collapse — important for
+  multi-PE workloads with disjoint HBM regions.
+- Flit-aware path (D5) + async finalize (D6) preserves wormhole
+  pipelining and exposes PC parallelism for back-to-back single-flit
+  transactions.
+- Single source of PC scheduling truth (D4 mechanism, used by both D5
+  flit path and D7 chunk-loop path).
+- Builder-derived `pc_bw_gbs` enforces ADR-0017 D8 in code, not yaml
+  discipline.
+
+### Negative
+
+- No bank-level conflict modelling within a PC; address-blind to
+  bank/row-buffer reuse (ADR-0033 D3).
+- No HBM scheduler (FR-FCFS / write-buffer / watermark drain); fixed
+  FIFO per PC. Bursty mixed R/W is approximated by `switch_penalty_ns`
+  (ADR-0033 D2).
+- `_txn_state` is a regular dict keyed by `id(txn)`; in-flight state
+  accumulates per concurrent transaction and is removed only on
+  `is_last`. Adequate for current workloads.
+
+## Links
+
+- ADR-0001 (Physical address layout — PC bit field comment)
+- ADR-0015 D4 (Memory R/W fabric path — bypass response case)
+- ADR-0017 D4 (Per-PE HBM partitioning — attachment to PE routers)
+- ADR-0017 D8 (HBM channel mapping mode — n:1 aggregate this ADR
+  implements)
+- ADR-0017 D9 (AddressResolver — `hbm_ctrl.pe{pe_id}` endpoint
+  resolution)
+- ADR-0033 D1 (Modelled precisely — per-PC parallelism, switch penalty,
+  flit-aware PC commit, first-flit overhead, chunk-loop drain)
+- ADR-0033 D2 (Switch-penalty default 0 — ideal scheduler amortisation)
diff --git a/docs/adr-ko/ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md b/docs/adr-ko/ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md
new file mode 100644
index 0000000..8d22a7b
--- /dev/null
+++ b/docs/adr-ko/ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md
@@ -0,0 +1,286 @@
+# ADR-0035: M_CPU and M_CPU.DMA Component Model
+
+## Status
+
+Accepted
+
+## Context
+
+M_CPU is the cube-level command processor. It receives commands from
+IO_CPU (or from PCIE_EP when the engine routes Memory R/W through
+M_CPU as a fallback), fans them out to the PEs in its cube, and
+aggregates per-PE responses into a single ResponseMsg sent back to
+IO_CPU on the reverse path.
+
+M_CPU.DMA is the cube-level DMA channel pair that handles Memory R/W
+fan-out. Per ADR-0015 D5 it is **not** a separate topology node —
+it lives as internal state of `MCpuComponent`.
+
+This ADR documents the M_CPU component implementation that realizes
+those responsibilities, including the three distinct fan-out paths
+(Memory R/W, Kernel Launch, MMU Map/Unmap), the M_CPU.DMA resource
+model, and the response aggregation contract.
+
+## Decision
+
+### D1. Role
+
+M_CPU has three responsibilities:
+
+1. **Transit forwarding** — when not the terminal hop (e.g., on the
+   reverse response path PE → M_CPU → IO_CPU), forwards Transactions
+   to `next_hop` in their pre-computed path.
+2. **Multi-PE fan-out at terminal hop** — dispatches to one of three
+   fan-out paths based on request type (D2).
+3. **Response aggregation** — collects per-PE responses, sends a
+   single aggregate ResponseMsg back to IO_CPU on the reverse path.
+
+Per invocation (`run()`): applies `overhead_ns` once per incoming
+Transaction.
+
+M_CPU does **not**:
+
+- Decide routing — paths are pre-computed by the router (ADR-0002).
+- Handle PE-internal execution — PE_CPU / PE_SCHEDULER / engines
+  (ADR-0014).
+- Decode addresses — `ctx.resolver.resolve(pa)` returns the per-PE
+  `hbm_ctrl.pe{X}` directly (ADR-0017 D9).
+- Interpret tensor or kernel semantics — fan-out dispatch by Python
+  isinstance check only.
+
+### D2. Three fan-out paths dispatched by request type
+
+At the terminal hop the worker dispatches by request type:
+
+```python
+elif self.ctx is not None and txn.request is not None:
+    if isinstance(txn.request, KernelLaunchMsg):
+        env.process(self._kernel_launch_fanout(env, txn))
+    elif isinstance(txn.request, (MmuMapMsg, MmuUnmapMsg)):
+        env.process(self._mmu_msg_fanout(env, txn))
+    else:
+        env.process(self._dma_fanout(env, txn))
+```
+
+Each path uses a different router method:
+
+- `_dma_fanout` uses `ctx.router.find_mcpu_dma_path()` — the
+  M_CPU-specific DMA path that avoids PE pipeline nodes.
+- `_kernel_launch_fanout` uses `ctx.router.find_node_path()` — the
+  generic NOC command path to PE_CPU.
+- `_mmu_msg_fanout` uses `ctx.router.find_node_path()` — NOC command
+  path to PE_MMU.
+
+### D3. M_CPU.DMA internal subcomponent (ADR-0015 D5)
+
+`MCpuComponent.start()` initializes two SimPy resources:
+
+```python
+self._dma_write = simpy.Resource(env, capacity=1)  # MemoryWriteMsg
+self._dma_read  = simpy.Resource(env, capacity=1)  # MemoryReadMsg
+```
+
+Properties:
+
+- **Not a topology node** — managed entirely inside `MCpuComponent`;
+  does not appear in `topology.yaml` or in the compiled graph.
+- **Independent read and write channels** — concurrent in-flight
+  Memory R/W is allowed.
+- **Capacity=1 per channel** serializes the **dispatch step**
+  (`yield self.out_ports[...].put(...)`) of concurrent in-flight Memory
+  R/W requests at this M_CPU. Actual fabric transfer time is modeled
+  by wire processes between components (ADR-0015 D2) and by
+  `drain_ns` at terminal hops; the DMA resource does not gate
+  transfer duration.
+
+Resource selection is request-type-based:
+
+```python
+dma_res = self._dma_write if isinstance(request, MemoryWriteMsg) else self._dma_read
+```
+
+### D4. Transit forwarding at non-terminal hops
+
+When `txn.next_hop` is not None — typical for the reverse response
+path (PE → M_CPU → IO_CPU) — the worker forwards normally:
+
+```python
+if next_hop:
+    yield self.out_ports[next_hop].put(txn.advance())
+```
+
+The fan-out branches fire only at the terminal hop. The same component
+therefore serves both forward command dispatch and reverse response
+relay roles.
+
+### D5. DMA fan-out (`_dma_fanout` — Memory R/W)
+
+For each Memory R/W request at terminal hop:
+
+1. `_resolve_dma_destinations(request)` returns a per-PE
+   `hbm_ctrl.pe{X}` derived from the request's PA via
+   `ctx.resolver.resolve(PhysAddr.decode(pa))` (ADR-0017 D9).
+2. For each destination:
+   - Acquire the appropriate DMA resource (`_dma_write` or
+     `_dma_read`) via `with dma_res.request() as req`.
+   - Resolve path via `ctx.router.find_mcpu_dma_path()`.
+   - Compute `drain_ns = ctx.compute_drain_ns(path, nbytes)`.
+   - Create sub-Transaction carrying `drain_ns` and dispatch to
+     `path[1]`.
+3. Track `max_drain_ns` across destinations and record it as
+   `txn.result_data["xfer_ns"]` after all responses arrive.
+4. After all per-PE responses are collected (D8), send an aggregate
+   ResponseMsg on the reverse command path back to IO_CPU.
+
+PA decode fallback (`f"{cube_prefix}.hbm_ctrl"`) is legacy dead code —
+no such node exists after ADR-0017 D4's per-PE partitioning. Kept
+defensively but does not route to a real destination.
+
+### D6. Kernel launch fan-out (`_kernel_launch_fanout`)
+
+For `KernelLaunchMsg` at terminal hop:
+
+1. `_resolve_pe_ids(target_pe)` → list of PE ids in this cube.
+2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_cpu"` via
+   `ctx.router.find_node_path()`.
+3. **`target_start_ns` handling** (ADR-0009 D5):
+   - If the request already carries `target_start_ns` (stamped by
+     IO_CPU per ADR-0036 D3): **pass through unchanged**.
+   - If absent (direct-to-M_CPU launch in unit tests): compute a
+     per-cube barrier `env.now + max(per-PE leg latency)` and stamp
+     via `dataclasses.replace`.
+4. Dispatch sub-Transactions with `nbytes=0` (kernel launch is a
+   control message; preserving nbytes=0 keeps fan-out off the shared
+   first-hop fabric BW, mirroring ADR-0036 D4).
+5. After all per-PE responses arrive (D8), aggregate per-PE metrics
+   from each sub-Transaction's `result_data` into the parent
+   transaction:
+
+   ```python
+   txn.result_data["pe_exec_ns"]  = max(existing, max(pe_exec_values))
+   txn.result_data["dma_ns"]      = max(existing, max(dma_values))
+   txn.result_data["compute_ns"]  = max(existing, max(compute_values))
+   ```
+
+   The max-merge with the existing value matters because cross-cube
+   IO_CPU fan-out shares the same parent `result_data`; merging
+   prevents one cube from clobbering another's metric.
+6. Send aggregate ResponseMsg on reverse path back to IO_CPU.
+
+### D7. MMU map/unmap fan-out (`_mmu_msg_fanout`)
+
+For `MmuMapMsg` / `MmuUnmapMsg` at terminal hop:
+
+1. `_resolve_pe_ids(target_pe)` → PE ids.
+2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_mmu"` via
+   `find_node_path()`.
+3. Dispatch sub-Transactions with `nbytes=0`.
+4. PE_MMU is a terminal node — it does **not** send a ResponseMsg
+   back. Instead, the sub-Transaction's own `sub_done` event is the
+   completion signal.
+5. Wait for all `sub_done` events in-line (does **not** use
+   `_pending` counter — D8 is for response-bearing fan-out only).
+6. Send aggregate ResponseMsg on reverse path back to IO_CPU.
+
+### D8. Response aggregation (`_pending` + `_parent_txns`)
+
+For DMA and kernel-launch fan-out (which expect per-PE ResponseMsg
+arriving on the reverse path):
+
+```python
+self._pending: dict[str, tuple[int, int, simpy.Event]] = {}
+self._parent_txns: dict[str, Any] = {}
+```
+
+- On dispatch: register `(expected, received=0, all_done)` and
+  remember the parent transaction.
+- `_worker` recognises responses by `is_response=True` and routes
+  them to `_collect_response`, which increments `received` and
+  signals `all_done` when `received >= expected`.
+- After `yield all_done`, the fan-out path constructs the aggregate
+  ResponseMsg:
+
+  ```python
+  resp_msg = ResponseMsg(
+      correlation_id=request.correlation_id,
+      request_id=request.request_id,
+      src_cube=cube_id,
+      src_pe=-1,             # -1 = M_CPU aggregate, not a single PE
+      success=True,          # no failure semantics implemented
+  )
+  ```
+
+- The response Transaction travels on `list(reversed(txn.path))`
+  back to IO_CPU.
+
+MMU fan-out (D7) uses a simpler in-line list of `sub_done` events
+because PE_MMU is terminal — there is no ResponseMsg path to
+intercept.
+
+### D9. Helpers and configurable attribute
+
+`_resolve_pe_ids(target_pe)`:
+
+- `int` → `[target_pe]`
+- `tuple[int, ...]` → `list(target_pe)`
+- `"all"` → `range(n_slices)` where `n_slices` comes from cube
+  `memory_map.hbm_slices_per_cube` (default 8).
+
+Used by kernel-launch and MMU fan-out paths.
+
+Single configurable attribute drives per-instance latency:
+
+| Site | impl name | overhead_ns |
+| --- | --- | --- |
+| Cube `m_cpu` | `builtin.m_cpu` | 5.0 |
+
+Applied once in `run()` per Transaction — models command
+interpretation and dispatch-decision time at M_CPU.
+
+## Consequences
+
+### Positive
+
+- Three fan-out paths are clearly separated by request type — adding
+  a new request kind is an isinstance branch + one fan-out method.
+- M_CPU.DMA channels are independent (read and write run concurrently)
+  and serialize only the dispatch step at capacity=1.
+- Transit-vs-terminal behavior is a single `if next_hop` check, so
+  the same component handles forward dispatch and reverse response
+  relay without role duplication.
+- `target_start_ns` passthrough (D6) preserves the cross-cube barrier
+  established by IO_CPU (ADR-0036 D3), while the fallback computation
+  keeps direct-to-M_CPU unit tests working.
+- Per-PE metric `max`-merge against existing parent `result_data`
+  values is robust to cross-cube IO_CPU fan-out sharing the same
+  parent.
+
+### Negative
+
+- No partial-failure semantics — a missing per-PE response stalls the
+  parent `all_done` indefinitely. Acceptable for simulation; not
+  suitable as a production-style endpoint.
+- `_resolve_dma_destinations`'s cube-wide hbm_ctrl fallback is dead
+  code (no such node exists post-ADR-0017 D4). Kept defensively;
+  invites confusion and merits a follow-up cleanup.
+- DMA resource serialization applies only at dispatch (the `put` call
+  is instantaneous in unbounded stores). The capacity=1 channel
+  models "one request in flight at a time at this M_CPU", not
+  "transfer duration serialization" — readers must consult wire
+  processes (ADR-0015 D2) and `drain_ns` for actual transfer
+  parallelism.
+
+## Links
+
+- ADR-0009 D3 (M_CPU fan-out and aggregation completion semantics)
+- ADR-0009 D5 (`target_start_ns` — passed through unchanged when
+  present; computed as per-cube barrier when absent)
+- ADR-0011 D-VA3 (MmuMapMsg fabric path includes M_CPU as PE fan-out
+  point)
+- ADR-0014 D4 (DMA engine capacity=1; M_CPU.DMA mirrors the same
+  contract at cube level)
+- ADR-0015 D5 (M_CPU.DMA is internal subcomponent of M_CPU, not a
+  topology node)
+- ADR-0017 D9 (AddressResolver returns per-PE `hbm_ctrl.pe{X}`)
+- ADR-0036 D3 / D4 (IO_CPU stamps `target_start_ns`; M_CPU passes
+  through unchanged; nbytes=0 invariant preserved through fan-out)
diff --git a/docs/adr-ko/ADR-0036-dev-io-cpu-component-model.md b/docs/adr-ko/ADR-0036-dev-io-cpu-component-model.md
new file mode 100644
index 0000000..b79e9ad
--- /dev/null
+++ b/docs/adr-ko/ADR-0036-dev-io-cpu-component-model.md
@@ -0,0 +1,216 @@
+# ADR-0036: IO_CPU Component Model
+
+## Status
+
+Accepted
+
+## Context
+
+IO_CPU is the IO chiplet's host-facing endpoint inside the simulation
+graph. PCIE_EP receives host messages from the runtime API and routes
+them via the io_noc; for command-bearing requests (KernelLaunch,
+MmuMap/Unmap) the io_noc forwards to IO_CPU, which:
+
+- Fans out the request to per-cube M_CPUs.
+- Aggregates per-cube responses into a single host-visible completion.
+- For kernel launches, stamps a global `target_start_ns` barrier so
+  every PE across every targeted cube begins kernel body execution at
+  the same simulated time (ADR-0009 D5).
+
+Memory R/W traffic bypasses IO_CPU per ADR-0015 D4 / ADR-0016 D3;
+this component therefore handles only command-plane traffic in normal
+operation.
+
+This ADR documents the IO_CPU component implementation that realizes
+those responsibilities.
+
+## Decision
+
+### D1. Role
+
+IO_CPU is the host-facing endpoint of the IO chiplet. It has two
+primary responsibilities:
+
+1. **Multi-cube fan-out** — distribute KernelLaunchMsg / MmuMapMsg /
+   MmuUnmapMsg to per-cube M_CPUs.
+2. **Response aggregation** — collect per-cube ResponseMsg, signal
+   parent `txn.done` when all targeted cubes have responded.
+
+A third, narrower responsibility applies only to KernelLaunchMsg:
+**`target_start_ns` global barrier stamping** (D3).
+
+The component does **not**:
+
+- Decide routing — paths are pre-computed by the router (ADR-0002).
+- Decode tensor or kernel internals — those concerns belong to
+  M_CPU / PE_CPU / engines.
+- Handle PE-level fan-out — M_CPU fans out within a cube (ADR-0009 D3).
+- Handle Memory R/W data path — those bypass IO_CPU per ADR-0015 D4
+  and ADR-0016 D3 (Memory R/W resolution code in
+  `_resolve_cube_targets` exists as a defensive fallback only).
+
+Per invocation (`run()`): applies the configured `overhead_ns` once
+per incoming Transaction (D8).
+
+### D2. Forward path — multi-cube fan-out
+
+When a non-response Transaction arrives, the worker:
+
+1. Pays `overhead_ns` via `run()`.
+2. Calls `_resolve_cube_targets` to derive the list of `(sip, cube)`
+   targets from the request (D5).
+3. For each target:
+   - Resolves M_CPU node id via `ctx.resolver.find_m_cpu(sip, cube)`.
+   - Resolves the path via `ctx.router.find_node_path(io_cpu, m_cpu)`.
+   - Creates a per-cube sub-Transaction with `path` populated and
+     forwards it to `path[1]` (the first hop on the io_noc).
+4. Registers aggregation state: `_pending[request_id] = (expected,
+   received=0, parent_done)`.
+
+### D3. KernelLaunch `target_start_ns` global barrier (ADR-0009 D5)
+
+IO_CPU is the canonical stamper for `target_start_ns`. When the
+request is a `KernelLaunchMsg`, IO_CPU computes a single global
+barrier covering every targeted PE across every targeted cube:
+
+```text
+for (sip, cube) in cube_targets:
+    leg1 = compute_path_latency_ns(io_cpu → m_cpu(sip, cube), nbytes=0)
+    for pe_id in target_pe_ids:
+        leg2 = compute_path_latency_ns(m_cpu → pe_cpu(sip, cube, pe_id),
+                                       nbytes=0)
+        latency = leg1 + leg2 - io_overhead_ns - m_overhead_ns
+        global_max = max(global_max, latency)
+
+target_start_ns = env.now + global_max
+```
+
+The request is then replaced (via `dataclasses.replace`) so the
+stamped value propagates through the fan-out.
+
+Two overhead corrections:
+
+- `io_overhead_ns` is subtracted because IO_CPU has already paid it
+  in `run()` before this method runs.
+- `m_overhead_ns` is subtracted once because it appears as the
+  endpoint of leg1 *and* the start of leg2 in path latency, but
+  M_CPU pays it only once at run time.
+
+Every downstream PE_CPU yields until `target_start_ns` before
+beginning kernel body execution; all PEs therefore start at the same
+simulated time regardless of how long their individual dispatch path
+took.
+
+### D4. KernelLaunch sub-Transactions carry `nbytes=0`
+
+Per-cube sub-Transactions for KernelLaunchMsg force `nbytes=0`,
+overriding the parent `txn.nbytes`:
+
+- Kernel launch is a control message; payload size is irrelevant at
+  the data-fabric level.
+- If `nbytes > 0`, every per-cube sub-txn occupies fabric BW on the
+  io_noc's shared first hop. With 16 cubes this serializes fan-out,
+  pushing far M_CPUs past `target_start_ns` and breaking the D3
+  invariant.
+
+Non-KernelLaunch sub-Transactions preserve `txn.nbytes` (only relevant
+for the defensive Memory R/W fallback path, which carries actual
+payload sizes).
+
+### D5. Per-request-type cube target resolution
+
+`_resolve_cube_targets` dispatches by request type:
+
+| Request type | Source of `(sip, cube)` | `target_cubes="all"` semantics |
+| --- | --- | --- |
+| `MemoryWriteMsg` | `dst_sip`, `dst_cube` (or `PhysAddr.decode(dst_pa).die_id` fallback) | single cube derived from PA decode |
+| `MemoryReadMsg` | `src_sip`, `src_cube` (or `PhysAddr.decode(src_pa).die_id` fallback) | single cube derived from PA decode |
+| `KernelLaunchMsg` | tensor shards filtered by `shard.sip == my_sip` | every cube that owns a shard on this SIP |
+| `MmuMapMsg` / `MmuUnmapMsg` | `target_cubes` list, filtered to this SIP | `range(cubes_per_sip)` from spec |
+
+Each IO_CPU instance fans out only within its own SIP — `_my_sip()`
+parses the SIP id from the node id (e.g., `sip0.io0.io_cpu` → 0).
+
+The Memory R/W rows exist for defensive completeness; the engine's
+normal path routes Memory R/W via `_process_memory_direct()` /
+`find_memory_path()`, bypassing IO_CPU entirely (ADR-0015 D4 /
+ADR-0016 D3).
+
+### D6. Response aggregation
+
+`_pending: dict[request_id → (expected, received, parent_done)]`:
+
+- On dispatch: register `(len(cube_targets), 0, txn.done)`.
+- `_worker` recognises responses by `is_response=True` and routes
+  them to `_collect_response`.
+- `_collect_response` increments `received`; when `received >=
+  expected`, `parent_done.succeed()` is invoked and the entry is
+  removed from `_pending`.
+
+This is a simple per-request counter. There is no per-cube identity
+tracking and no partial-failure handling — a missing response
+indefinitely stalls the parent done. Production-style failure paths
+are out of scope for the current simulator model.
+
+### D7. `target_pe` resolution helper
+
+`_resolve_pe_ids(target_pe)`:
+
+- `int` → `[target_pe]`.
+- `tuple[int, ...]` → `list(target_pe)`.
+- `"all"` → `range(n_slices)`, where `n_slices` comes from cube
+  `memory_map.hbm_slices_per_cube` (default 8).
+
+Used in D3's barrier computation to enumerate every PE target per
+cube.
+
+### D8. Configurable `overhead_ns`
+
+A single attribute drives per-instance latency:
+
+| Site | impl name | overhead_ns |
+| --- | --- | --- |
+| IO chiplet `io_cpu` | `builtin.io_cpu` | 10.0 |
+
+Applied once in `run()` per Transaction. Models command
+interpretation + dispatch-decision time at IO_CPU.
+
+## Consequences
+
+### Positive
+
+- Cross-cube and cross-SIP kernel launches share a single global
+  barrier (D3 + D4) — no per-cube divergence in start time.
+- nbytes=0 invariant keeps fan-out off the shared first-hop fabric
+  BW, preserving the barrier's accuracy at scale (16 cubes).
+- Response aggregation via a single counter → minimal state,
+  deterministic ordering of completion.
+- Per-SIP scoping (`_my_sip()`) keeps IO_CPUs in different SIPs
+  cleanly independent.
+
+### Negative
+
+- No partial-failure semantics — a missing per-cube response
+  indefinitely stalls the parent. Adequate for simulation but not
+  suitable as a production-style endpoint.
+- `_pending` is a regular dict; in-flight requests accumulate state.
+  Acceptable for current benchmark workloads (few concurrent
+  outstanding launches); unbounded in principle.
+- The Memory R/W resolution branches in `_resolve_cube_targets` are
+  dead code in the normal engine path. Kept defensively but invite
+  drift if the bypass path ever changes.
+
+## Links
+
+- ADR-0002 (Routing distance — path computation)
+- ADR-0009 D1 (Kernel launch is an endpoint request to IO_CPU)
+- ADR-0009 D3 (M_CPU fans out within a cube; IO_CPU fans out across
+  cubes)
+- ADR-0009 D5 (target_start_ns canonical stamping at IO_CPU)
+- ADR-0011 D-VA3 (MmuMapMsg routes through IO_CPU for cube fan-out)
+- ADR-0012 (Host ↔ IO_CPU message schema)
+- ADR-0015 D4 (Memory R/W bypasses IO_CPU; Kernel Launch via IO_CPU)
+- ADR-0016 D1 (IO chiplet io_noc — IO_CPU attaches here)
+- ADR-0016 D3 (Memory R/W path bypasses IO_CPU)
+- ADR-0016 D4 (Kernel Launch path through IO_CPU for command
+  interpretation)
diff --git a/docs/adr-ko/ADR-0037-dev-forwarding-component.md b/docs/adr-ko/ADR-0037-dev-forwarding-component.md
new file mode 100644
index 0000000..193dbe0
--- /dev/null
+++ b/docs/adr-ko/ADR-0037-dev-forwarding-component.md
@@ -0,0 +1,200 @@
+# ADR-0037: Forwarding Component (forwarding_v1)
+
+## Status
+
+Accepted
+
+## Context
+
+The simulation graph has many node positions that exist purely to model
+fabric traversal — NOC mesh routers, switches, UCIe protocol endpoints,
+IO chiplet io_noc, transit cubes. These share a common pattern: receive
+a message, apply per-component overhead (modeling header decode +
+routing decision time), forward to the next hop along the pre-computed
+path.
+
+This ADR defines the contract for these transit nodes: a single
+component type (`TransitComponent`) that handles flit-aware forwarding
+with wormhole cut-through semantics, used under multiple impl names
+according to the conceptual role each instance plays.
+
+## Decision
+
+### D1. Role
+
+The Forwarding component (`TransitComponent` class) is a **stateless
+transit node** in the simulation graph. It models any fabric position
+where a message physically traverses but no semantic processing
+happens.
+
+Per traversal, the component:
+
+1. Reads an incoming Transaction or Flit from an `in_port`.
+2. Applies the configured per-component overhead (`overhead_ns`),
+   applied **once per Transaction** even across multi-flit payloads
+   (see D2).
+3. Looks up the next hop along the Transaction's pre-computed `path`.
+4. Forwards to the corresponding `out_port`; at the terminal node
+   (no next hop), signals `txn.done` once the `is_last` flit arrives.
+
+The component **does NOT**:
+
+- Decide routing — paths are pre-computed by the router (ADR-0002 /
+  ADR-0017 D2). Forwarding only executes the per-hop step.
+- Model wire propagation or bandwidth occupancy — separate wire
+  processes between components handle that (ADR-0015 D2).
+- Resolve addresses — the AddressResolver does that (ADR-0017 D9).
+- Aggregate completion — terminal endpoints (IO_CPU, M_CPU, HBM_CTRL)
+  handle that.
+
+### D2. First-flit overhead model (header decode)
+
+Per-Transaction `overhead_ns` is applied **exactly once**, at first
+flit arrival:
+
+- `_txn_decoded: set[int]` tracks which Transactions have already
+  paid the overhead at this node.
+- On first-flit arrival for a Transaction: `yield self.run(env,
+  msg.txn.nbytes)` — pays the overhead.
+- Subsequent flits of the same Transaction skip the overhead — they
+  pipeline through with no extra delay.
+- On `is_last` flit: remove the Transaction from `_txn_decoded`.
+
+This models the real-HW behavior where header decode and routing
+decision happen once on first flit; payload flits then stream through
+the same path (wormhole cut-through). Multi-hop pipelining emerges
+naturally — each hop adds its own first-flit overhead, but flits
+after the first do not re-pay overhead at any hop they have already
+passed first.
+
+### D3. Serial worker forwarding (preserves order)
+
+The component's worker is a single SimPy process that consumes flits
+from `_inbox` and forwards them serially in arrival order. The
+component does NOT spawn `env.process(...)` per flit.
+
+Rationale: if the first flit yields on `overhead_ns` while subsequent
+flits run in parallel processes, the later flits can overtake the
+first. This produces out-of-order delivery and lets the `is_last`
+flit arrive at the destination before the first flit — corrupting
+both the transaction's completion semantics and any flit-index-based
+processing downstream.
+
+### D4. Path-based next-hop routing
+
+Routing is **not** a Forwarding-component concern. The Transaction
+arrives with a pre-computed `path` (built by the router; ADR-0002 /
+ADR-0017 D2). The component just looks up its own position in the
+path and forwards to `path[index + 1]`:
+
+```python
+def _next_hop_in_path(self, txn):
+    my_id = self.node.id
+    path = txn.path
+    for i, n in enumerate(path):
+        if n == my_id and i + 1 < len(path):
+            return path[i + 1]
+    return None
+```
+
+If `next_hop` is found and present in `out_ports`, the flit is
+forwarded. Otherwise (terminal node), `txn.done.succeed()` is
+invoked when the `is_last` flit arrives.
+
+### D5. Flit-aware mode with Non-Flit fallback
+
+`_FLIT_AWARE = True` opts this component out of the base class's
+flit-reassembly logic in `_fan_in`. Flits are placed directly on
+`_inbox` (no reassembly), enabling per-flit handling in the worker
+loop (D2, D3).
+
+Non-Flit messages — zero-byte control Transactions and other
+non-chunkified payloads — fall through to the base class's legacy
+`_forward_txn` path via `env.process`. This preserves backward
+compatibility for control-plane traffic that does not benefit from
+flit-level processing.
+
+### D6. Multi-stream merging at the base class
+
+Multi-stream FIFO merging at routers is the base class's
+responsibility, not Forwarding's. The base class's `_fan_in` spawns
+one process per `in_port`; all push to a single shared `_inbox`.
+Flits from different upstream streams therefore interleave at
+flit granularity in `_inbox`'s FIFO order.
+
+The Forwarding worker simply consumes `_inbox` in arrival order —
+correctly modeling per-router multi-flow arbitration as
+fair-FIFO over the shared inbox.
+
+### D7. Single implementation under multiple impl names
+
+A single `TransitComponent` class is registered under four impl names
+in `components.yaml`:
+
+- `builtin.forwarding` — generic forwarding (e.g., `io_noc`,
+  `noc_router`, UCIe conn bridges)
+- `builtin.switch` — tray-level switch
+- `builtin.noc` — cube-level NOC fabric (legacy singleton; current
+  NOC routers use `builtin.forwarding`)
+- `builtin.ucie` — UCIe protocol endpoint
+
+All four aliases instantiate the same class with the same behavior.
+Per-instance differentiation lives only in `attrs.overhead_ns`.
+Separate impl names exist as intent tags for readability and to
+allow future divergence without backward-incompatible config
+changes.
+
+### D8. Configurable `overhead_ns`
+
+A single attribute drives per-instance latency:
+
+| Usage site | impl name | overhead_ns |
+| --- | --- | --- |
+| Tray-level switch | `builtin.switch` | 5.0 |
+| Cube NOC router | `builtin.forwarding` | 2.0 |
+| IO chiplet io_noc | `builtin.forwarding` | 0.0 |
+| UCIe protocol endpoint (`ucie-{N,S,E,W}`) | `builtin.ucie` | 8.0 |
+| UCIe conn bridge (`ucie-{PORT}.conn{N}`) | `builtin.forwarding` | 0.0 |
+
+Default is 0.0. The attribute is read at each `run()` invocation, so
+dynamic reconfiguration is possible but not currently used.
+
+## Consequences
+
+### Positive
+
+- A single class handles all transit-node roles in the simulation
+  graph — minimal code surface for a high-population component type.
+- Flit-aware processing + serial worker preserves wormhole semantics
+  across multi-hop paths without per-flit process overhead.
+- `overhead_ns` is the only per-instance tunable; routing, BW, and
+  address resolution stay cleanly separated in their own components /
+  modules.
+- Multi-stream merging emerges from the base-class structure; no
+  router-specific logic duplicates fair-FIFO arbitration.
+- Non-Flit fallback path keeps control-plane traffic working without
+  forcing every message into the flit framework.
+
+### Negative
+
+- The single class hides usage-site intent inside `attrs.overhead_ns`
+  configuration; readers must consult `topology.yaml` +
+  `components.yaml` to see which impl name maps to which behavior
+  class.
+- Per-flit serial worker is a bottleneck if `overhead_ns` is large
+  and many concurrent transactions arrive at the same router; current
+  values (0–8 ns) make this negligible.
+
+## Links
+
+- ADR-0002 (Routing distance — path computation)
+- ADR-0015 D1 (Component port model)
+- ADR-0015 D2 (Wire process — BW + propagation, separate from this
+  component)
+- ADR-0015 D6 (Transit cube forwarding pattern)
+- ADR-0016 D1 (IO chiplet io_noc — uses this component)
+- ADR-0017 D1 (Cube NOC routers — use this component)
+- ADR-0017 D6 (UCIe decomposition — `ucie-{PORT}` instances use this
+  component)
+- ADR-0033 D1 (Flit-aware pass-through, first-flit overhead,
+  multi-stream merge semantics)
diff --git a/docs/adr/ADR-0012-api-host-io-message-schema.md b/docs/adr/ADR-0012-api-host-io-message-schema.md
index 07d95c5..0979788 100644
--- a/docs/adr/ADR-0012-api-host-io-message-schema.md
+++ b/docs/adr/ADR-0012-api-host-io-message-schema.md
@@ -18,7 +18,7 @@ We define stable, minimal message schemas for Host ↔ IO_CPU so that:
 - IO_CPU-internal fan-out/aggregation can evolve independently,
 - completion and failure propagation is deterministic.
 
-We also require PE-tagging (A 방식): each shard explicitly carries (sip,cube,pe)
+We also require PE-tagging (Scheme A): each shard explicitly carries (sip,cube,pe)
 so IO_CPU can deterministically route/fan-out without relying on PA decoding.
 
 ---
@@ -93,7 +93,7 @@ Rules:
 Mandatory fields:
 
 - common envelope fields (D3)
-- destination placement tags (A 방식):
+- destination placement tags (Scheme A):
   - `dst_sip: int`
   - `dst_cube: int`
   - `dst_pe: int`
@@ -130,7 +130,7 @@ Notes:
 Mandatory fields:
 
 - common envelope fields (D3)
-- source placement tags (A 방식):
+- source placement tags (Scheme A):
   - `src_sip: int`
   - `src_cube: int`
   - `src_pe: int`
@@ -183,7 +183,7 @@ Tensor arg (mandatory):
 
 - `shards: list[TensorShard]`
 
-`TensorShard` MUST have (A 방식 강제):
+`TensorShard` MUST have (Scheme A enforced):
 
 - `sip: int`
 - `cube: int`
diff --git a/docs/adr/ADR-0020-prog-data-execution-two-pass.en.md b/docs/adr/ADR-0020-prog-data-execution-two-pass.en.md
deleted file mode 100644
index 04ac253..0000000
--- a/docs/adr/ADR-0020-prog-data-execution-two-pass.en.md
+++ /dev/null
@@ -1,519 +0,0 @@
-# ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation)
-
-## Status
-
-Accepted
-
-## Context
-
-The current simulation models **timing only**.
-`tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies,
-but do not actually read tensor data or perform computations.
-
-### Required Capabilities
-
-1. Must be able to store and read actual data in HBM/TCM/SRAM
-2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
-3. Must minimize simulation performance degradation
-
-### Constraints
-
-- SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
-- Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations
-- Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused
-- Kernel functions must remain plain Python functions (no generator/async transformation)
-
-### Design Exploration Results
-
-| Option | Approach | Verdict |
-|--------|----------|---------|
-| Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block |
-| SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests |
-| Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads |
-| **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact |
-
----
-
-## Decision
-
-### D1. 2-Pass Execution Model — Phase 0 Elimination
-
-The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**.
-
-Before:
-```
-Phase 0: Kernel → PeCommand list (no data, no branching)
-Phase 1: Replay PeCommand list via SimPy (timing only)
-```
-
-After:
-```
-Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based
-  - Memory read/write: SimPy timing + MemoryStore actual data
-  - Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2)
-  - Dynamic control flow possible (tl.load returns actual data)
-
-Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable
-```
-
-This ADR **extends Phase 1 to be data-aware for memory operations only**.
-Phase 1 handles latency/BW bottleneck analysis + memory data tracking,
-Phase 2 handles GEMM/Math computation correctness verification.
-Phase 2 is optional — if only timing is needed, run Phase 1 alone.
-
-### D2. Op Log Recording — ComponentBase Hook
-
-Op log recording is performed as a **hook in the component base class**.
-Individual component implementations are not modified.
-
-```python
-class ComponentBase:
-    def _on_process_start(self, env, msg):
-        if self._op_logger and getattr(msg, 'data_op', False):
-            self._op_logger.record_start(env.now, self.node.id, msg)
-
-    def _on_process_end(self, env, msg):
-        if self._op_logger and getattr(msg, 'data_op', False):
-            self._op_logger.record_end(env.now, self.node.id, msg)
-```
-
-Hooks are called before and after `run()` within `_forward_txn()`.
-`_op_logger` is optional — zero overhead when absent.
-
-**Hook timing definitions**:
-
-| Timing | Meaning |
-|--------|---------|
-| `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) |
-| `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) |
-
-Link traversal latency is not included in t_start/t_end.
-Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start.
-
-### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination
-
-The existing Phase 0 (kernel → PeCommand list) is eliminated,
-and **greenlet** is used to cooperatively interleave kernel and SimPy execution.
-
-#### Operating Principle
-
-greenlet is a C extension that provides cooperative context switching.
-When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet)
-to perform timing simulation, and after completion, returns to the kernel with actual data.
-
-```
-SimPy loop (parent greenlet)           Kernel (child greenlet)
-─────────────────────────              ──────────────────────
-g.switch() ─────────────────────────→ Kernel starts
-                                       a = tl.load(ptr, ...)
-                                         internal: parent.switch(DmaReadCmd)
-cmd = DmaReadCmd ←──────────────────  (kernel paused)
-  yield DmaReadMsg(...)
-  yield env.timeout(dma_latency)
-  data = memory_store.read(...)
-g.switch(data) ─────────────────────→ (kernel resumed)
-                                       a = data  ← actual numpy array
-                                       if a[0][0] > 0.5:  ← branching possible
-                                         ...
-```
-
-The kernel is maintained as a **plain Python function**.
-greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc.
-
-#### KernelRunner — Framework Layer
-
-The greenlet loop resides not in the PE_CPU component but in the framework layer,
-**KernelRunner**.
-
-```python
-# KernelRunner (framework — greenlet ↔ SimPy bridge)
-class KernelRunner:
-    def run(self, env, kernel_fn, args, store):
-        g = greenlet(self._run_kernel)
-        cmd = g.switch(kernel_fn, args)
-
-        while cmd is not None:
-            if isinstance(cmd, DmaReadCmd):
-                yield from self._dispatch_dma(env, cmd)
-                data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
-                cmd = g.switch(data)            # resume with actual data
-            elif isinstance(cmd, GemmCmd):
-                yield from self._dispatch_gemm(env, cmd)
-                cmd = g.switch()                # resume (no data)
-            elif isinstance(cmd, DmaWriteCmd):
-                store.write(cmd.dst_addr, cmd.data)  # visibility = issue time
-                yield from self._dispatch_dma(env, cmd)  # timing only
-                cmd = g.switch()
-
-# PE_CPU (component — kept simple, unaware of greenlet)
-def _execute_kernel(self, env):
-    runner = KernelRunner(self.ctx)
-    yield from runner.run(env, kernel_fn, args, store)
-```
-
-**Op logging single source of truth**: KernelRunner does not record directly to op_log.
-All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**.
-When KernelRunner delivers messages to components via `_dispatch_gemm()` etc.,
-the component base class hooks automatically record them.
-
-**Layer separation**:
-- **Kernel code**: plain function, unaware of greenlet
-- **TLContext**: calls `parent.switch(cmd)` inside `tl.load()`
-- **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**.
-- **ComponentBase hook**: the sole path for op_log recording
-- **PE_CPU**: only calls KernelRunner, replaceable as a component
-
-#### Handling Differences Between Memory Read/Write and Compute
-
-| Operation | In Phase 1 | In Phase 2 |
-|-----------|-----------|-----------|
-| `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — |
-| `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — |
-| `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation |
-| `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation |
-
-Memory read/write is processed immediately in Phase 1 (numpy slice, fast).
-GEMM/Math operations are batch-executed in Phase 2 (performance separation).
-
-#### Store Visibility Rule
-
-`tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue).
-SimPy DMA timing is simulated separately afterward.
-
-This is an intentional separation of timing and visibility:
-- **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called
-- **timing**: the point at which DMA latency completes in SimPy
-
-This separation allows a load immediately after a store to see the latest data in dynamic control flow.
-
-#### Result Handle Semantics
-
-`tl.composite()` (sync/async) returns a **handle** referencing the result tensor.
-
-The key contract in Phase 1:
-
-1. **All compute handles are always considered pending in Phase 1.**
-2. `tl.wait(handle)` **expresses timing synchronization only**
-   and does not make the handle ready.
-3. Accessing the handle's actual result data (`handle.data`, element access,
-   numpy conversion, etc.) is **only possible in Phase 2**.
-4. Therefore, **compute-result-based control flow is not supported in Phase 1.**
-5. In contrast, `tl.load()` returns actual data in Phase 1, so
-   **memory-read-based control flow is supported**.
-
-| Handle state | Phase | Allowed operations |
-|------------|-------|----------|
-| pending | Phase 1 | `tl.wait(handle)` — timing synchronization only |
-| pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) |
-| pending | Phase 1 | **Data access not allowed** — value-based branching not possible |
-| ready | Phase 2 | Actual numpy data access, verification |
-
-This restriction is intentional. If computations were executed in Phase 1,
-the SimPy single-thread would block, defeating the purpose of 2-pass separation.
-
-#### Phase 1 Materialization — Future Extension
-
-If Phase 1 eager execution becomes necessary for small operations
-(scalar, small reduction) in the future, selective materialization can be supported
-by adding a `materialized_in_phase1: bool` flag to the op record.
-This is not implemented in the current scope.
-
-### D4. data_op Flag — Message Self-Declaration
-
-The logging target is determined by the `data_op` attribute on the message instance,
-not by message type. The framework does not hardcode message types.
-
-```python
-class MsgBase:
-    data_op: bool = False       # default: no logging
-
-class DmaReadCmd(MsgBase):
-    data_op = True              # memory transfer → logging
-
-class GemmCmd(MsgBase):
-    data_op = True              # compute → logging
-
-class MathCmd(MsgBase):
-    data_op = True              # compute → logging
-```
-
-When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True`
-enables automatic logging without modifying framework code.
-
-### D5. Op Log Structure
-
-#### Op Classification Scheme
-
-A two-level classification is used:
-
-| Level | Field | Role |
-|-------|-------|------|
-| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion |
-| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification |
-
-#### OpRecord Definition
-
-```python
-@dataclass
-class OpRecord:
-    t_start: float              # SimPy time (ns) — service start
-    t_end: float                # SimPy time (ns) — service completion
-    component_id: str           # e.g. "sip0.cube0.pe0.pe_gemm"
-    op_kind: str                # "memory" | "gemm" | "math"
-    op_name: str                # specific operation name
-    params: dict                # per-operation parameters (see below)
-    dependency_ids: list[int]   # currently based on in-memory record index, may be replaced with stable op_id in the future
-```
-
-#### dependency_ids Generation Rules
-
-`dependency_ids` is **optional**, and by default the executor performs
-address-based dependency inference (see D6).
-
-Explicit setting is only needed when precise execution ordering is required:
-- **Default (address-based inference)**: the executor analyzes read/write sets to
-  automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases.
-- **Explicit setting**: set when logical dependencies cannot be expressed via addresses
-  at the TLContext or command generation stage.
-  Example: completion handle-based synchronization — handle dependencies depend on
-  logical completion order rather than memory addresses, so they cannot be captured
-  by address inference.
-
-#### op_log Ordering
-
-The op_log maintains **stable ordering** based on `t_start`.
-Records with the same `t_start` preserve insertion order.
-
-#### params Details
-
-**memory (dma_read / dma_write)**:
-```python
-{
-    "src_addr": int,            # source address (byte)
-    "dst_addr": int,            # destination address (byte)
-    "nbytes": int,              # transfer size
-    "src_space": str,           # "hbm" | "tcm" | "sram"
-    "dst_space": str,           # "hbm" | "tcm" | "sram"
-}
-```
-
-**gemm**:
-```python
-{
-    "src_a_addr": int,          # operand A address
-    "src_b_addr": int,          # operand B address
-    "dst_addr": int,            # output address
-    "shape_a": tuple,           # e.g. (128, 256)
-    "shape_b": tuple,           # e.g. (256, 128)
-    "shape_out": tuple,         # e.g. (128, 128)
-    "dtype_in": str,            # e.g. "f16"
-    "dtype_acc": str,           # accumulation dtype, e.g. "f32"
-    "dtype_out": str,           # output dtype, e.g. "f16"
-    "transpose_a": bool,
-    "transpose_b": bool,
-    "layout_a": str,            # "row_major" | "col_major"
-    "layout_b": str,
-    "layout_out": str,
-    "addr_space": str,          # "tcm" (GEMM operands are always in TCM)
-}
-```
-
-**math**:
-```python
-{
-    "op": str,                  # "exp" | "add" | "sum" | "where" | ...
-    "input_addrs": list[int],   # list of operand addresses
-    "input_shapes": list[tuple],
-    "dst_addr": int,
-    "shape_out": tuple,
-    "dtype": str,
-    "axis": int | None,         # reduction axis
-    "addr_space": str,          # "tcm"
-}
-```
-
-### D6. Phase 2 Executor
-
-Phase 2 executes the op_log outside of SimPy.
-
-```python
-class DataExecutor:
-    def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
-        self.store = initial_store  # Takes the Phase 1 MemoryStore snapshot as input
-
-    def run(self):
-        for t, ops in groupby(op_log, key=lambda o: o.t_start):
-            batch = list(ops)
-            independent, sequential = self._classify(batch)
-            self._execute_parallel(independent)
-            self._execute_sequential(sequential)
-```
-
-**Parallel execution determination**:
-
-Ops with the same `t_start` are considered **parallel candidates**.
-The executor determines actual parallel execution based on the following criteria:
-- Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks)
-- Whether predecessor ops specified in `dependency_ids` have completed
-
-Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel.
-
-**Batch optimization**: Only independent ops with the same op_name **and identical
-shape, dtype, layout, and transpose flags** are eligible for batching.
-Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call.
-Improves BLAS efficiency on CPU, reduces launch overhead on GPU.
-
-**Phase 2 execution order guarantee**:
-
-Phase 2 does not consider data arrival timing,
-and guarantees execution order solely through
-dependencies (address-based inference + explicit dependency_ids).
-
-### D7. Memory Store
-
-`MemoryStore` logically follows byte-addressable semantics,
-and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping).
-
-```python
-class MemoryStore:
-    def write(self, space: str, addr: int, data: np.ndarray) -> None: ...
-    def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
-```
-
-**Internal storage format: numpy ndarray**
-
-MemoryStore stores tensors as **numpy ndarrays**.
-
-| Candidate | store/load speed | Phase 2 compute | Verdict |
-|-----------|-----------------|-----------------|---------|
-| **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** |
-| bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected |
-| torch tensor | Immediate | torch operations available | Use only for GPU optimization |
-
-- write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup
-- read: **returns numpy array by reference** (no copy)
-- Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported)
-- dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.)
-- For byte-level access, convert via `.view(np.uint8)`
-- For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility
-
-**read/write contract**:
-
-- read/write operates on a **contiguous tensor** basis.
-  If non-contiguous stride views are needed, express them as separate copy ops.
-- In the normal benchmark path, producer/consumer dtype match is expected.
-  Reinterpret cast is a permissive behavior for low-level memory validation
-  or special test cases.
-- addr is byte-aligned, with minimum alignment = dtype size.
-- dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast.
-  Shape mismatch is verified based on nbytes, and raises an error on mismatch.
-- Correctness criteria follow address-range-based read/write semantics.
-- A tensor object cache may be used as an implementation optimization,
-  but the canonical state is byte-addressable storage.
-- At deploy time, the host injects initial tensor data.
-
-### D8. Benchmark Kernel Code
-
-The benchmark's **user code API is not changed**.
-The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained.
-
-However, internal command/message schemas may be extended to include metadata
-required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose).
-
-### D9. No Component Changes
-
-Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified.
-Op log recording is the responsibility of the ComponentBase hook.
-When custom components are replaced, only the timing model changes,
-and Phase 2 data execution is unaffected.
-
-### D10. Phase 2 is Optional
-
-```python
-engine = GraphEngine(graph)
-engine.run(benchmark)                       # Phase 1: timing only
-result = engine.get_timing_result()
-
-if verify_data:
-    executor = DataExecutor(engine.op_log)  # Phase 2: data
-    executor.run()
-    executor.verify(expected_output)
-```
-
-If only timing analysis is needed, Phase 2 is skipped.
-If the op_logger is deactivated, Phase 1 performance is identical to the original.
-
-### D11. Verification Contract
-
-Basic verification **compares the final output tensor** against a reference backend (numpy).
-
-Per-dtype tolerance policy:
-
-| dtype | Comparison method | Tolerance |
-|-------|----------|-----------|
-| f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
-| f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
-| bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
-| int types | `np.array_equal` | exact |
-
-- Default mode: compare final output only (end-to-end correctness)
-- Debug mode: can compare intermediate tensors on a per-op basis
-  (MemoryStore snapshot at each op boundary)
-
----
-
-## Non-goals
-
-- **Compute-result-based control flow**: not supported.
-  All compute handles are in pending state during Phase 1,
-  `wait()` expresses timing synchronization only and does not imply data readiness.
-  Accessing `handle.data`, element access, or truth-value evaluation in Phase 1
-  is **treated as an error**.
-  Memory-data-based branching (results of `tl.load()`) is supported via greenlet.
-  Phase 1 materialization is a future extension (see D3).
-- **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce
-  the execution time overlap from Phase 1. Phase 2 only verifies data correctness.
-- **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls
-  and do not reproduce the actual hardware PE microarchitecture.
-
-## Open Questions
-
-- **Aliasing / slice view**: How to represent slice/views referencing the same
-  backing storage in MemoryStore (stride-based view vs copy semantics)
-- **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE
-  communication as memory ops or introduce a separate op_kind
-- **Op log streaming**: Managing op_log memory usage in large-scale simulations
-  (in-memory list vs disk-backed streaming)
-- **Fused operation**: Whether to record tl.composite's tiled pipeline
-  (READ→COMPUTE→WRITE) as a single fused op record or separate individual ops
-- **Math op schema generalization**: The current math params have a simple structure,
-  but generalization may be needed for broadcasting rules, per-input dtype, keepdims,
-  scalar/immediate operands, where/mask expressions, etc.
-- **Op record identifier**: Currently dependency_ids are based on in-memory list indices;
-  replacement with stable op_id is needed when introducing streaming/disk-backed mode
-- **Phase 1 materialization policy**: See Future Extension in D3.
-  If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops
-  needs to be defined
-
----
-
-## Consequences
-
-### Positive
-
-- Minimal impact on SimPy simulation performance (only op_log append added)
-- Free to use multi-threading/GPU in Phase 2
-- Component replaceability preserved (ADR-0015 design philosophy maintained)
-- No changes needed to benchmark user code API
-- When adding new message types, only set the data_op flag
-- Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported
-- `tl.load()` returns actual data, making kernel debugging easier
-
-### Negative
-
-- op_log memory usage (for large-scale simulations)
-- Phase 2 execution time is proportional to tensor size (large GEMM)
-- Dynamic branching based on pending handles (incomplete computations) not possible
-  (computations execute in Phase 2, result values are undetermined in Phase 1).
-  Memory-data-based branching is supported via greenlet.
-- greenlet C extension dependency added (pip install greenlet)
diff --git a/docs/adr/ADR-0020-prog-data-execution-two-pass.md b/docs/adr/ADR-0020-prog-data-execution-two-pass.md
index a8d277d..04ac253 100644
--- a/docs/adr/ADR-0020-prog-data-execution-two-pass.md
+++ b/docs/adr/ADR-0020-prog-data-execution-two-pass.md
@@ -1,4 +1,4 @@
-# ADR-0020: 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)
+# ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation)
 
 ## Status
 
@@ -6,65 +6,65 @@ Accepted
 
 ## Context
 
-현재 시뮬레이션은 **타이밍만** 모델링한다.
-`tl.load()`, `tl.composite(op="gemm")` 등은 SimPy latency를 생성하지만,
-실제 텐서 데이터를 읽거나 연산하지 않는다.
+The current simulation models **timing only**.
+`tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies,
+but do not actually read tensor data or perform computations.
 
-### 필요한 기능
+### Required Capabilities
 
-1. HBM/TCM/SRAM에 실제 데이터를 저장하고 읽을 수 있어야 한다
-2. PE_GEMM, PE_MATH가 실제 행렬 연산을 수행하고 결과를 검증할 수 있어야 한다
-3. 시뮬레이션 성능 저하를 최소화해야 한다
+1. Must be able to store and read actual data in HBM/TCM/SRAM
+2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
+3. Must minimize simulation performance degradation
 
-### 제약 조건
+### Constraints
 
-- SimPy는 single-thread 이벤트 루프 — numpy matmul을 안에서 하면 전체가 block
-- 컴포넌트는 교체 가능해야 한다 (ADR-0015) — 프레임워크 요구사항이 구현에 침투하면 안 됨
-- 벤치마크 커널은 명령형 코드(tl.load → tl.composite → tl.wait) — 같은 코드를 재사용해야 함
-- 커널 함수는 plain Python function으로 유지해야 한다 (generator/async 변환 불가)
+- SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
+- Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations
+- Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused
+- Kernel functions must remain plain Python functions (no generator/async transformation)
 
-### 설계 탐색 결과
+### Design Exploration Results
 
-| Option | 방식 | 판정 |
-|--------|------|------|
-| SimPy 내 직접 실행 | GEMM을 SimPy 안에서 numpy 호출 | 탈락: single-thread block |
-| SimPy + ThreadPool | future.submit → timeout → result() | 탈락: back-to-back 요청 시 result()에서 block |
-| Symbolic + lazy | 메타데이터만 추적, 나중에 실행 | 탈락: control-flow dependent 읽기 처리 곤란 |
-| **2-pass (채택)** | Phase 1: 타이밍, Phase 2: 데이터 | 완전 분리, 성능 영향 없음 |
+| Option | Approach | Verdict |
+|--------|----------|---------|
+| Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block |
+| SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests |
+| Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads |
+| **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact |
 
 ---
 
 ## Decision
 
-### D1. 2-Pass 실행 모델 — Phase 0 제거
+### D1. 2-Pass Execution Model — Phase 0 Elimination
 
-기존의 3단계(Phase 0 → Phase 1 → Phase 2)를 **2단계로 통합**한다.
+The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**.
 
-기존:
+Before:
 ```
-Phase 0: 커널 → PeCommand 리스트 (데이터 없음, 분기 불가)
-Phase 1: PeCommand 리스트를 SimPy replay (타이밍만)
+Phase 0: Kernel → PeCommand list (no data, no branching)
+Phase 1: Replay PeCommand list via SimPy (timing only)
 ```
 
-변경:
+After:
 ```
-Phase 1 (타이밍): 커널 + SimPy 통합 실행 — greenlet 기반
-  - 메모리 읽기/쓰기: SimPy 타이밍 + MemoryStore 실제 데이터
-  - 연산 (GEMM/Math): SimPy 타이밍 + op_log 기록 (실제 연산은 Phase 2)
-  - dynamic control flow 가능 (tl.load가 실제 데이터 반환)
+Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based
+  - Memory read/write: SimPy timing + MemoryStore actual data
+  - Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2)
+  - Dynamic control flow possible (tl.load returns actual data)
 
-Phase 2 (데이터): op_log 기반 실제 연산 실행 — SimPy 외부, 병렬 가능
+Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable
 ```
 
-본 ADR은 **메모리 연산에 한해 Phase 1을 data-aware로 확장**한다.
-Phase 1은 latency/BW 병목 분석 + 메모리 데이터 추적,
-Phase 2는 GEMM/Math 연산 정합성 검증.
-Phase 2는 optional — 타이밍만 필요하면 Phase 1만 실행.
+This ADR **extends Phase 1 to be data-aware for memory operations only**.
+Phase 1 handles latency/BW bottleneck analysis + memory data tracking,
+Phase 2 handles GEMM/Math computation correctness verification.
+Phase 2 is optional — if only timing is needed, run Phase 1 alone.
 
-### D2. Op Log 기록 — ComponentBase hook
+### D2. Op Log Recording — ComponentBase Hook
 
-op_log 기록은 **컴포넌트 베이스 클래스의 hook**으로 수행한다.
-개별 컴포넌트 구현을 수정하지 않는다.
+Op log recording is performed as a **hook in the component base class**.
+Individual component implementations are not modified.
 
 ```python
 class ComponentBase:
@@ -77,56 +77,56 @@ class ComponentBase:
             self._op_logger.record_end(env.now, self.node.id, msg)
 ```
 
-`_forward_txn()` 에서 `run()` 전후로 hook을 호출한다.
-`_op_logger`는 optional — 없으면 오버헤드 제로.
+Hooks are called before and after `run()` within `_forward_txn()`.
+`_op_logger` is optional — zero overhead when absent.
 
-**hook 시점 정의**:
+**Hook timing definitions**:
 
-| 시점 | 의미 |
-|------|------|
-| `t_start` | 컴포넌트가 해당 msg의 **service를 시작**한 시점 (`run()` 진입 직전) |
-| `t_end` | 컴포넌트의 **내부 service가 완료**된 시점 (`run()` 반환 직후) |
+| Timing | Meaning |
+|--------|---------|
+| `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) |
+| `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) |
 
-link traversal latency는 t_start/t_end에 포함되지 않는다.
-link latency는 발신 컴포넌트의 t_end와 수신 컴포넌트의 t_start 차이로 관측된다.
+Link traversal latency is not included in t_start/t_end.
+Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start.
 
-### D3. Greenlet 기반 커널 실행 — Phase 0 제거
+### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination
 
-기존 Phase 0 (커널 → PeCommand 리스트)를 제거하고,
-**greenlet**을 사용하여 커널과 SimPy를 협력적으로 interleave 실행한다.
+The existing Phase 0 (kernel → PeCommand list) is eliminated,
+and **greenlet** is used to cooperatively interleave kernel and SimPy execution.
 
-#### 동작 원리
+#### Operating Principle
 
-greenlet은 협력적 context switch를 제공하는 C 확장이다.
-커널(child greenlet)이 `tl.load()` 등을 호출하면 SimPy 루프(parent greenlet)로
-switch하여 타이밍 시뮬레이션을 수행하고, 완료 후 실제 데이터와 함께 커널로 돌아온다.
+greenlet is a C extension that provides cooperative context switching.
+When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet)
+to perform timing simulation, and after completion, returns to the kernel with actual data.
 
 ```
-SimPy 루프 (parent greenlet)          커널 (child greenlet)
+SimPy loop (parent greenlet)           Kernel (child greenlet)
 ─────────────────────────              ──────────────────────
-g.switch() ─────────────────────────→ 커널 시작
+g.switch() ─────────────────────────→ Kernel starts
                                        a = tl.load(ptr, ...)
-                                         내부: parent.switch(DmaReadCmd)
-cmd = DmaReadCmd ←──────────────────  (커널 일시정지)
+                                         internal: parent.switch(DmaReadCmd)
+cmd = DmaReadCmd ←──────────────────  (kernel paused)
   yield DmaReadMsg(...)
   yield env.timeout(dma_latency)
   data = memory_store.read(...)
-g.switch(data) ─────────────────────→ (커널 재개)
-                                       a = data  ← 실제 numpy array
-                                       if a[0][0] > 0.5:  ← 분기 가능
+g.switch(data) ─────────────────────→ (kernel resumed)
+                                       a = data  ← actual numpy array
+                                       if a[0][0] > 0.5:  ← branching possible
                                          ...
 ```
 
-커널은 **plain Python function**으로 유지된다.
-greenlet switch는 `tl.load()`, `tl.store()` 등의 **내부 구현에만** 존재한다.
+The kernel is maintained as a **plain Python function**.
+greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc.
 
-#### KernelRunner — 프레임워크 레이어
+#### KernelRunner — Framework Layer
 
-greenlet 루프는 PE_CPU 컴포넌트가 아니라 프레임워크 레이어인
-**KernelRunner**에 위치한다.
+The greenlet loop resides not in the PE_CPU component but in the framework layer,
+**KernelRunner**.
 
 ```python
-# KernelRunner (프레임워크 — greenlet ↔ SimPy 연결)
+# KernelRunner (framework — greenlet ↔ SimPy bridge)
 class KernelRunner:
     def run(self, env, kernel_fn, args, store):
         g = greenlet(self._run_kernel)
@@ -136,160 +136,162 @@ class KernelRunner:
             if isinstance(cmd, DmaReadCmd):
                 yield from self._dispatch_dma(env, cmd)
                 data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
-                cmd = g.switch(data)            # 실제 데이터와 함께 재개
+                cmd = g.switch(data)            # resume with actual data
             elif isinstance(cmd, GemmCmd):
                 yield from self._dispatch_gemm(env, cmd)
-                cmd = g.switch()                # 재개 (데이터 없음)
+                cmd = g.switch()                # resume (no data)
             elif isinstance(cmd, DmaWriteCmd):
-                store.write(cmd.dst_addr, cmd.data)  # visibility = issue 시점
-                yield from self._dispatch_dma(env, cmd)  # timing만 반영
+                store.write(cmd.dst_addr, cmd.data)  # visibility = issue time
+                yield from self._dispatch_dma(env, cmd)  # timing only
                 cmd = g.switch()
 
-# PE_CPU (컴포넌트 — 간단하게 유지, greenlet을 모름)
+# PE_CPU (component — kept simple, unaware of greenlet)
 def _execute_kernel(self, env):
     runner = KernelRunner(self.ctx)
     yield from runner.run(env, kernel_fn, args, store)
 ```
 
-**Op logging single source of truth**: KernelRunner는 op_log에 직접 기록하지 않는다.
-모든 op logging은 **ComponentBase hook (_on_process_start/end)만** 담당한다.
-KernelRunner가 `_dispatch_gemm()` 등으로 컴포넌트에 메시지를 전달하면,
-컴포넌트 베이스 클래스의 hook이 자동으로 기록한다.
+**Op logging single source of truth**: KernelRunner does not record directly to op_log.
+All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**.
+When KernelRunner delivers messages to components via `_dispatch_gemm()` etc.,
+the component base class hooks automatically record them.
 
-**레이어 분리**:
-- **커널 코드**: plain function, greenlet 존재를 모름
-- **TLContext**: `tl.load()` 내부에서 `parent.switch(cmd)` 호출
-- **KernelRunner**: greenlet ↔ SimPy 연결, MemoryStore 읽기/쓰기 처리. **logging 안 함**.
-- **ComponentBase hook**: op_log 기록의 유일한 경로
-- **PE_CPU**: KernelRunner를 호출만 함, 컴포넌트로서 교체 가능
+**Layer separation**:
+- **Kernel code**: plain function, unaware of greenlet
+- **TLContext**: calls `parent.switch(cmd)` inside `tl.load()`
+- **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**.
+- **ComponentBase hook**: the sole path for op_log recording
+- **PE_CPU**: only calls KernelRunner, replaceable as a component
 
-#### 메모리 읽기/쓰기 vs 연산의 처리 차이
+#### Handling Differences Between Memory Read/Write and Compute
 
-| 연산 | Phase 1에서 | Phase 2에서 |
-|------|------------|------------|
-| `tl.load()` | SimPy 타이밍 + MemoryStore read → **실제 데이터 반환** | — |
-| `tl.store()` | SimPy 타이밍 + MemoryStore write → **실제 기록** | — |
-| `tl.composite(gemm)` | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
-| `tl.dot()` / math ops | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
+| Operation | In Phase 1 | In Phase 2 |
+|-----------|-----------|-----------|
+| `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — |
+| `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — |
+| `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation |
+| `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation |
 
-메모리 읽기/쓰기는 Phase 1에서 즉시 처리 (numpy slice, 빠름).
-GEMM/Math 연산은 Phase 2에서 batch 실행 (성능 분리).
+Memory read/write is processed immediately in Phase 1 (numpy slice, fast).
+GEMM/Math operations are batch-executed in Phase 2 (performance separation).
 
 #### Store Visibility Rule
 
-`tl.store()`는 **issue 시점에 MemoryStore에 즉시 반영**된다 (visibility = issue).
-SimPy DMA 타이밍은 이후 별도로 시뮬레이션된다.
+`tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue).
+SimPy DMA timing is simulated separately afterward.
 
-이는 timing과 visibility를 의도적으로 분리한 것이다:
-- **visibility**: MemoryStore에 반영되는 시점 = `store.write()` 호출 시
-- **timing**: SimPy에서 DMA latency가 완료되는 시점
+This is an intentional separation of timing and visibility:
+- **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called
+- **timing**: the point at which DMA latency completes in SimPy
 
-이 분리로 dynamic control flow에서 store 직후 load가 최신 데이터를 볼 수 있다.
+This separation allows a load immediately after a store to see the latest data in dynamic control flow.
 
 #### Result Handle Semantics
 
-`tl.composite()`(sync/async)는 결과 tensor를 참조하는 **handle**을 반환한다.
+`tl.composite()` (sync/async) returns a **handle** referencing the result tensor.
 
-Phase 1에서의 핵심 계약:
+The key contract in Phase 1:
 
-1. **모든 compute handle은 Phase 1에서 항상 pending 상태로 간주한다.**
-2. `tl.wait(handle)`은 **timing synchronization만 표현**하며,
-   handle을 ready로 만들지 않는다.
-3. handle의 실제 결과 데이터 접근(`handle.data`, element access,
-   numpy conversion 등)은 **Phase 2에서만 가능**하다.
-4. 따라서 Phase 1에서 **compute-result 기반 control flow는 지원하지 않는다.**
-5. 반면 `tl.load()`는 Phase 1에서 실제 데이터를 반환하므로,
-   **memory-read 기반 control flow는 지원 가능**하다.
+1. **All compute handles are always considered pending in Phase 1.**
+2. `tl.wait(handle)` **expresses timing synchronization only**
+   and does not make the handle ready.
+3. Accessing the handle's actual result data (`handle.data`, element access,
+   numpy conversion, etc.) is **only possible in Phase 2**.
+4. Therefore, **compute-result-based control flow is not supported in Phase 1.**
+5. In contrast, `tl.load()` returns actual data in Phase 1, so
+   **memory-read-based control flow is supported**.
 
-| handle 상태 | Phase | 허용 동작 |
+| Handle state | Phase | Allowed operations |
 |------------|-------|----------|
-| pending | Phase 1 | `tl.wait(handle)` — timing 동기화만 |
-| pending | Phase 1 | handle을 `tl.store()`의 대상으로 전달 (logical destination 연결만, payload는 Phase 2) |
-| pending | Phase 1 | **데이터 접근 불가** — 값 기반 분기 불가 |
-| ready | Phase 2 | 실제 numpy 데이터 접근, 검증 |
+| pending | Phase 1 | `tl.wait(handle)` — timing synchronization only |
+| pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) |
+| pending | Phase 1 | **Data access not allowed** — value-based branching not possible |
+| ready | Phase 2 | Actual numpy data access, verification |
 
-이 제약은 의도적이다. Phase 1에서 연산을 실행하면 SimPy single-thread가
-block되어 2-pass 분리의 존재 이유가 사라진다.
+This restriction is intentional. If computations were executed in Phase 1,
+the SimPy single-thread would block, defeating the purpose of 2-pass separation.
 
 #### Phase 1 Materialization — Future Extension
 
-향후 소형 연산(scalar, 작은 reduction)에 대해 Phase 1 eager execution이
-필요한 경우, `materialized_in_phase1: bool` 플래그를 op record에 추가하여
-선택적 materialization을 지원할 수 있다. 현재 범위에서는 구현하지 않는다.
+If Phase 1 eager execution becomes necessary for small operations
+(scalar, small reduction) in the future, selective materialization can be supported
+by adding a `materialized_in_phase1: bool` flag to the op record.
+This is not implemented in the current scope.
 
-### D4. data_op 플래그 — 메시지 자기 선언
+### D4. data_op Flag — Message Self-Declaration
 
-로깅 대상은 메시지 타입이 아니라 메시지 인스턴스의 `data_op` 속성으로 결정한다.
-프레임워크가 메시지 타입을 하드코딩하지 않는다.
+The logging target is determined by the `data_op` attribute on the message instance,
+not by message type. The framework does not hardcode message types.
 
 ```python
 class MsgBase:
-    data_op: bool = False       # 기본: 로깅 안 함
+    data_op: bool = False       # default: no logging
 
 class DmaReadCmd(MsgBase):
-    data_op = True              # 메모리 이동 → 로깅
+    data_op = True              # memory transfer → logging
 
 class GemmCmd(MsgBase):
-    data_op = True              # 연산 → 로깅
+    data_op = True              # compute → logging
 
 class MathCmd(MsgBase):
-    data_op = True              # 연산 → 로깅
+    data_op = True              # compute → logging
 ```
 
-새 메시지 타입(예: IpcqMsg) 추가 시 `data_op = True`만 설정하면
-프레임워크 코드 수정 없이 자동 로깅된다.
+When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True`
+enables automatic logging without modifying framework code.
 
-### D5. Op Log 구조
+### D5. Op Log Structure
 
-#### op 분류 체계
+#### Op Classification Scheme
 
-2단계로 분류한다:
+A two-level classification is used:
 
-| 레벨 | 필드 | 역할 |
-|------|------|------|
-| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch 기준 |
-| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` 등 | 구체 연산 식별 |
+| Level | Field | Role |
+|-------|-------|------|
+| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion |
+| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification |
 
-#### OpRecord 정의
+#### OpRecord Definition
 
 ```python
 @dataclass
 class OpRecord:
-    t_start: float              # SimPy 시각 (ns) — service 시작
-    t_end: float                # SimPy 시각 (ns) — service 완료
+    t_start: float              # SimPy time (ns) — service start
+    t_end: float                # SimPy time (ns) — service completion
     component_id: str           # e.g. "sip0.cube0.pe0.pe_gemm"
     op_kind: str                # "memory" | "gemm" | "math"
-    op_name: str                # 구체 연산명
-    params: dict                # 연산별 파라미터 (아래 참조)
-    dependency_ids: list[int]   # 현재는 in-memory record index 기반, 향후 stable op_id로 대체 가능
+    op_name: str                # specific operation name
+    params: dict                # per-operation parameters (see below)
+    dependency_ids: list[int]   # currently based on in-memory record index, may be replaced with stable op_id in the future
 ```
 
-#### dependency_ids 생성 규칙
+#### dependency_ids Generation Rules
 
-`dependency_ids`는 **optional**이며, 기본적으로 executor는
-주소 기반 dependency 추론을 수행한다 (D6 참조).
+`dependency_ids` is **optional**, and by default the executor performs
+address-based dependency inference (see D6).
 
-정확한 실행 순서가 필요한 경우에만 명시적으로 설정한다:
-- **기본 (address-based inference)**: executor가 read/write set을 분석하여
-  RAW/WAW/WAR 의존성을 자동 추론. 대부분의 경우 이것으로 충분.
-- **명시적 설정**: TLContext 또는 command 생성 단계에서 logical dependency가
-  주소로 표현되지 않는 경우에 설정.
-  예: completion handle 기반 동기화 — handle dependency는 메모리 주소가 아니라
-  논리적 완료 순서에 의존하므로 address inference로 잡히지 않는다.
+Explicit setting is only needed when precise execution ordering is required:
+- **Default (address-based inference)**: the executor analyzes read/write sets to
+  automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases.
+- **Explicit setting**: set when logical dependencies cannot be expressed via addresses
+  at the TLContext or command generation stage.
+  Example: completion handle-based synchronization — handle dependencies depend on
+  logical completion order rather than memory addresses, so they cannot be captured
+  by address inference.
 
-#### op_log ordering
+#### op_log Ordering
 
-op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
-동일 `t_start`의 record들은 insertion order를 보존한다.
+The op_log maintains **stable ordering** based on `t_start`.
+Records with the same `t_start` preserve insertion order.
 
-#### params 상세
+#### params Details
 
 **memory (dma_read / dma_write)**:
 ```python
 {
-    "src_addr": int,            # source 주소 (byte)
-    "dst_addr": int,            # destination 주소 (byte)
-    "nbytes": int,              # 전송 크기
+    "src_addr": int,            # source address (byte)
+    "dst_addr": int,            # destination address (byte)
+    "nbytes": int,              # transfer size
     "src_space": str,           # "hbm" | "tcm" | "sram"
     "dst_space": str,           # "hbm" | "tcm" | "sram"
 }
@@ -298,9 +300,9 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
 **gemm**:
 ```python
 {
-    "src_a_addr": int,          # operand A 주소
-    "src_b_addr": int,          # operand B 주소
-    "dst_addr": int,            # output 주소
+    "src_a_addr": int,          # operand A address
+    "src_b_addr": int,          # operand B address
+    "dst_addr": int,            # output address
     "shape_a": tuple,           # e.g. (128, 256)
     "shape_b": tuple,           # e.g. (256, 128)
     "shape_out": tuple,         # e.g. (128, 128)
@@ -312,7 +314,7 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
     "layout_a": str,            # "row_major" | "col_major"
     "layout_b": str,
     "layout_out": str,
-    "addr_space": str,          # "tcm" (GEMM operand는 항상 TCM)
+    "addr_space": str,          # "tcm" (GEMM operands are always in TCM)
 }
 ```
 
@@ -320,7 +322,7 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
 ```python
 {
     "op": str,                  # "exp" | "add" | "sum" | "where" | ...
-    "input_addrs": list[int],   # operand 주소 목록
+    "input_addrs": list[int],   # list of operand addresses
     "input_shapes": list[tuple],
     "dst_addr": int,
     "shape_out": tuple,
@@ -332,12 +334,12 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
 
 ### D6. Phase 2 Executor
 
-Phase 2는 SimPy 밖에서 op_log를 실행한다.
+Phase 2 executes the op_log outside of SimPy.
 
 ```python
 class DataExecutor:
     def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
-        self.store = initial_store  # Phase 1의 MemoryStore snapshot을 입력으로 받는다
+        self.store = initial_store  # Takes the Phase 1 MemoryStore snapshot as input
 
     def run(self):
         for t, ops in groupby(op_log, key=lambda o: o.t_start):
@@ -347,30 +349,30 @@ class DataExecutor:
             self._execute_sequential(sequential)
 ```
 
-**병렬 실행 판정**:
+**Parallel execution determination**:
 
-같은 `t_start`의 op들은 **병렬 후보**로 간주한다.
-실제 병렬 실행 여부는 executor가 다음 기준으로 판정한다:
-- read/write 주소 범위 겹침 여부 (WAW, RAW, WAR 충돌 검사)
-- `dependency_ids`에 명시된 선행 op 완료 여부
+Ops with the same `t_start` are considered **parallel candidates**.
+The executor determines actual parallel execution based on the following criteria:
+- Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks)
+- Whether predecessor ops specified in `dependency_ids` have completed
 
-주소 범위가 겹치지 않고 명시적 의존성이 없는 op들만 병렬 실행한다.
+Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel.
 
-**배치 최적화**: 동일 op_name이며 **shape, dtype, layout, transpose flag가
-모두 동일한** 독립 op들만 batching 대상이 된다.
-예: 여러 PE의 동일 shape GEMM → `np.matmul(a_batch, b_batch)` 한 번으로 묶음.
-CPU에서도 BLAS 효율 향상, GPU에서는 launch overhead 절감.
+**Batch optimization**: Only independent ops with the same op_name **and identical
+shape, dtype, layout, and transpose flags** are eligible for batching.
+Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call.
+Improves BLAS efficiency on CPU, reduces launch overhead on GPU.
 
-**Phase 2 실행 순서 보장**:
+**Phase 2 execution order guarantee**:
 
-Phase 2는 데이터 도착 시점을 고려하지 않으며,
-dependency (주소 기반 추론 + 명시적 dependency_ids)를 통해서만
-실행 순서를 보장한다.
+Phase 2 does not consider data arrival timing,
+and guarantees execution order solely through
+dependencies (address-based inference + explicit dependency_ids).
 
 ### D7. Memory Store
 
-`MemoryStore`는 논리적으로 byte-addressable semantics를 따르며,
-현재 구현은 **tensor-granular storage** (addr → numpy ndarray 매핑)를 사용한다.
+`MemoryStore` logically follows byte-addressable semantics,
+and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping).
 
 ```python
 class MemoryStore:
@@ -378,139 +380,140 @@ class MemoryStore:
     def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
 ```
 
-**내부 저장 포맷: numpy ndarray**
+**Internal storage format: numpy ndarray**
 
-MemoryStore는 텐서를 **numpy ndarray**로 저장한다.
+MemoryStore stores tensors as **numpy ndarrays**.
 
-| 후보 | store/load 속도 | Phase 2 연산 | 판정 |
-|------|----------------|-------------|------|
-| **numpy ndarray** | 즉시 (참조 전달, 복사 없음) | `np.matmul` 바로 사용 | **채택** |
-| bytearray | memcpy 필요 | `np.frombuffer` 변환 필요 | 탈락 |
-| torch tensor | 즉시 | torch 연산 가능 | GPU 최적화 시만 사용 |
+| Candidate | store/load speed | Phase 2 compute | Verdict |
+|-----------|-----------------|-----------------|---------|
+| **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** |
+| bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected |
+| torch tensor | Immediate | torch operations available | Use only for GPU optimization |
 
-- write: numpy array를 **참조 저장** (복사 없음) → Phase 1 오버헤드 = dict lookup 1회
-- read: numpy array를 **참조 반환** (복사 없음)
-- 동일 addr에 재 write 시 기존 array를 **tensor 단위로 덮어쓴다** (partial overwrite 미지원)
-- dtype은 numpy native 사용 (`np.float16`, `np.float32`, `np.bfloat16` 등)
-- byte-level access가 필요한 경우 `.view(np.uint8)` 로 변환
-- Phase 2에서 GPU batch 최적화 시 numpy → torch tensor 변환은 executor가 담당
+- write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup
+- read: **returns numpy array by reference** (no copy)
+- Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported)
+- dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.)
+- For byte-level access, convert via `.view(np.uint8)`
+- For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility
 
 **read/write contract**:
 
-- read/write는 **contiguous tensor** 기준이다.
-  non-contiguous stride view가 필요한 경우 별도 copy op으로 표현한다.
-- 일반 benchmark path에서는 producer/consumer dtype 일치를 기대한다.
-  reinterpret cast는 low-level memory validation 또는 특수 테스트 케이스를 위한
-  permissive behavior이다.
-- addr은 byte-aligned이며, 최소 alignment = dtype 크기.
-- dtype mismatch (write와 다른 dtype으로 read)는 reinterpret cast로 처리한다.
-  shape 불일치 시 nbytes 기준으로 검증하고, 불일치하면 error.
-- 정합성 기준은 주소 범위 기반 read/write semantics를 따른다.
-- 구현 최적화로 tensor object cache를 둘 수 있지만,
-  canonical state는 byte-addressable storage이다.
-- deploy 시점에 호스트가 초기 텐서 데이터를 주입한다.
+- read/write operates on a **contiguous tensor** basis.
+  If non-contiguous stride views are needed, express them as separate copy ops.
+- In the normal benchmark path, producer/consumer dtype match is expected.
+  Reinterpret cast is a permissive behavior for low-level memory validation
+  or special test cases.
+- addr is byte-aligned, with minimum alignment = dtype size.
+- dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast.
+  Shape mismatch is verified based on nbytes, and raises an error on mismatch.
+- Correctness criteria follow address-range-based read/write semantics.
+- A tensor object cache may be used as an implementation optimization,
+  but the canonical state is byte-addressable storage.
+- At deploy time, the host injects initial tensor data.
 
-### D8. 벤치마크 커널 코드
+### D8. Benchmark Kernel Code
 
-벤치마크의 **사용자 코드 API는 변경하지 않는다**.
-`tl.load()`, `tl.composite()`, `tl.store()` 등의 호출 인터페이스는 유지.
+The benchmark's **user code API is not changed**.
+The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained.
 
-단, 내부 command/message schema는 Phase 2 실행에 필요한 metadata를
-포함하도록 확장될 수 있다 (예: dtype_acc, transpose 등 추가 필드).
+However, internal command/message schemas may be extended to include metadata
+required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose).
 
-### D9. 컴포넌트 변경 없음
+### D9. No Component Changes
 
-개별 컴포넌트 구현(PE_GEMM, PE_DMA, HBM_CTRL 등)은 수정하지 않는다.
-op_log 기록은 ComponentBase hook의 책임이다.
-커스텀 컴포넌트 교체 시 타이밍 모델만 교체되며,
-Phase 2 데이터 실행은 영향받지 않는다.
+Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified.
+Op log recording is the responsibility of the ComponentBase hook.
+When custom components are replaced, only the timing model changes,
+and Phase 2 data execution is unaffected.
 
-### D10. Phase 2는 Optional
+### D10. Phase 2 is Optional
 
 ```python
 engine = GraphEngine(graph)
-engine.run(benchmark)                       # Phase 1: 타이밍만
+engine.run(benchmark)                       # Phase 1: timing only
 result = engine.get_timing_result()
 
 if verify_data:
-    executor = DataExecutor(engine.op_log)  # Phase 2: 데이터
+    executor = DataExecutor(engine.op_log)  # Phase 2: data
     executor.run()
     executor.verify(expected_output)
 ```
 
-타이밍 분석만 필요하면 Phase 2를 건너뛴다.
-op_logger를 비활성화하면 Phase 1 성능도 기존과 동일.
+If only timing analysis is needed, Phase 2 is skipped.
+If the op_logger is deactivated, Phase 1 performance is identical to the original.
 
 ### D11. Verification Contract
 
-기본 검증은 **최종 output tensor**를 reference backend(numpy)와 비교한다.
+Basic verification **compares the final output tensor** against a reference backend (numpy).
 
-dtype별 tolerance 정책:
+Per-dtype tolerance policy:
 
-| dtype | 비교 방식 | tolerance |
+| dtype | Comparison method | Tolerance |
 |-------|----------|-----------|
 | f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
 | f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
 | bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
-| int 계열 | `np.array_equal` | exact |
+| int types | `np.array_equal` | exact |
 
-- 기본 모드: 최종 output만 비교 (end-to-end correctness)
-- 디버그 모드: intermediate tensor도 op 단위로 비교 가능
+- Default mode: compare final output only (end-to-end correctness)
+- Debug mode: can compare intermediate tensors on a per-op basis
   (MemoryStore snapshot at each op boundary)
 
 ---
 
 ## Non-goals
 
-- **Compute-result-based control flow**: 지원하지 않는다.
-  모든 compute handle은 Phase 1에서 pending 상태이며,
-  `wait()`는 timing synchronization만 표현하고 data readiness를 의미하지 않는다.
-  Phase 1에서 `handle.data` 접근, element access, truth-value evaluation은
-  **error로 처리**한다.
-  메모리 데이터 기반 분기(`tl.load()` 결과)는 greenlet으로 지원된다.
-  Phase 1 materialization은 future extension (D3 참조).
-- **Cycle-accurate overlap reconstruction**: Phase 2에서 Phase 1의 실행 시간
-  overlap을 정확히 재현하지 않는다. Phase 2는 데이터 정합성만 검증한다.
-- **GPU kernel compilation**: Phase 2의 GEMM/Math는 numpy/torch 호출이며,
-  실제 하드웨어 PE의 마이크로아키텍처를 재현하지 않는다.
+- **Compute-result-based control flow**: not supported.
+  All compute handles are in pending state during Phase 1,
+  `wait()` expresses timing synchronization only and does not imply data readiness.
+  Accessing `handle.data`, element access, or truth-value evaluation in Phase 1
+  is **treated as an error**.
+  Memory-data-based branching (results of `tl.load()`) is supported via greenlet.
+  Phase 1 materialization is a future extension (see D3).
+- **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce
+  the execution time overlap from Phase 1. Phase 2 only verifies data correctness.
+- **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls
+  and do not reproduce the actual hardware PE microarchitecture.
 
 ## Open Questions
 
-- **Aliasing / slice view**: 동일 backing storage를 참조하는 slice/view를
-  MemoryStore에서 어떻게 표현할지 (stride-based view vs copy semantics)
-- **IPCQ/descriptor read 일반화**: PE-to-PE 통신을 memory op으로 완전히
-  일반화할지, 별도 op_kind를 둘지
-- **Op log streaming**: 대규모 시뮬레이션에서 op_log 메모리 사용량 관리
+- **Aliasing / slice view**: How to represent slice/views referencing the same
+  backing storage in MemoryStore (stride-based view vs copy semantics)
+- **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE
+  communication as memory ops or introduce a separate op_kind
+- **Op log streaming**: Managing op_log memory usage in large-scale simulations
   (in-memory list vs disk-backed streaming)
-- **Fused operation**: tl.composite의 tiled pipeline (READ→COMPUTE→WRITE)을
-  하나의 fused op record로 기록할지, 개별 op으로 분리할지
-- **Math op schema 일반화**: 현재 math params는 단순 구조이나,
-  broadcasting rule, input별 dtype, keepdims, scalar/immediate operand,
-  where/mask 표현 등 일반화가 필요할 수 있음
-- **Op record 식별자**: 현재 dependency_ids는 in-memory list index 기반이며,
-  streaming/disk-backed mode 도입 시 stable op_id로 대체 필요
-- **Phase 1 materialization policy**: D3의 Future Extension 참조.
-  허용 시 해당 op의 Phase 2 처리 방식 (skip / verify / recompute) 정의 필요
+- **Fused operation**: Whether to record tl.composite's tiled pipeline
+  (READ→COMPUTE→WRITE) as a single fused op record or separate individual ops
+- **Math op schema generalization**: The current math params have a simple structure,
+  but generalization may be needed for broadcasting rules, per-input dtype, keepdims,
+  scalar/immediate operands, where/mask expressions, etc.
+- **Op record identifier**: Currently dependency_ids are based on in-memory list indices;
+  replacement with stable op_id is needed when introducing streaming/disk-backed mode
+- **Phase 1 materialization policy**: See Future Extension in D3.
+  If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops
+  needs to be defined
 
 ---
 
 ## Consequences
 
-### 긍정적
+### Positive
 
-- SimPy 시뮬레이션 성능 영향 최소 (op_log append만 추가)
-- Phase 2에서 멀티스레드/GPU 자유롭게 사용 가능
-- 컴포넌트 교체 자유도 유지 (ADR-0015 설계 철학 보존)
-- 벤치마크 사용자 코드 API 변경 불필요
-- 새 메시지 타입 추가 시 data_op 플래그만 설정
-- greenlet으로 Phase 0 제거 — 메모리 데이터 기반 dynamic control flow 지원
-- `tl.load()`가 실제 데이터를 반환하므로 커널 디버깅 용이
+- Minimal impact on SimPy simulation performance (only op_log append added)
+- Free to use multi-threading/GPU in Phase 2
+- Component replaceability preserved (ADR-0015 design philosophy maintained)
+- No changes needed to benchmark user code API
+- When adding new message types, only set the data_op flag
+- Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported
+- `tl.load()` returns actual data, making kernel debugging easier
 
-### 부정적
+### Negative
 
-- op_log 메모리 사용량 (대규모 시뮬레이션 시)
-- Phase 2 실행 시간은 텐서 크기에 비례 (대형 GEMM)
-- pending handle (연산 미완료) 기반 동적 분기 불가
-  (연산은 Phase 2에서 실행, Phase 1에서 결과 값 미확정).
-  메모리 데이터 기반 분기는 greenlet으로 지원된다.
-- greenlet C 확장 의존성 추가 (pip install greenlet)
+- op_log memory usage (for large-scale simulations)
+- Phase 2 execution time is proportional to tensor size (large GEMM)
+- Dynamic branching based on pending handles (incomplete computations) not possible
+  (computations execute in Phase 2, result values are undetermined in Phase 1).
+  Memory-data-based branching is supported via greenlet.
+- greenlet C extension dependency added (pip install greenlet)
diff --git a/docs/adr/ADR-0023-dev-ipcq-pe-collective.en.md b/docs/adr/ADR-0023-dev-ipcq-pe-collective.en.md
deleted file mode 100644
index e6b6334..0000000
--- a/docs/adr/ADR-0023-dev-ipcq-pe-collective.en.md
+++ /dev/null
@@ -1,882 +0,0 @@
-# ADR-0023: PE-level IPCQ — Inter-PE Collective Communication
-
-## Status
-
-Accepted
-
-## Context
-
-### Goal
-
-Add the infrastructure that lets CCL (Collective Communication Library)
-kernels run **inside** a PE. The host just launches a kernel on each
-SIP; the actual synchronization and data movement happen **inside the
-PE kernel via an IPCQ (Inter-Process Communication Queue)**.
-
-This mirrors how NCCL performs NVLink communication inside a GPU
-kernel, or how Cerebras / Tenstorrent expose core-local communication
-queues. Host-level collectives (`dist.all_reduce`) are deferred to
-**future work**; this ADR focuses solely on the kernel-side collective
-infrastructure.
-
-### Problems to solve
-
-1. PE-to-PE direct data movement (writing into a peer's memory).
-2. Synchronization — the sender must check that the receiver has space
-   in its buffer (backpressure).
-3. Resource contention between compute traffic and communication
-   traffic (Head-of-Line blocking).
-4. The host must be able to construct logical neighbor topologies
-   (ring / mesh / tree) per algorithm.
-
----
-
-## Decision
-
-### D1. Add a new `PE_IPCQ` component
-
-A new component `PE_IPCQ` is added inside each PE. It follows the same
-pattern as PE_GEMM / PE_MATH — modeling a sub-block of the PE as a
-distinct component.
-
-```
-PE
-├── PE_CPU
-├── PE_SCHEDULER
-├── PE_DMA
-├── PE_IPCQ          ← new
-├── PE_FETCH_STORE
-├── PE_GEMM
-├── PE_MATH
-├── PE_TCM
-├── PE_MMU
-```
-
-**Role separation** (control plane vs. data plane):
-
-- **PE_IPCQ (control plane)**: ring-buffer address arithmetic, head /
-  tail pointer management, peer pointer caches, backpressure, 4-direction
-  neighbor mapping.
-- **PE_DMA (data plane)**: actually moves data through cube_noc / UCIe
-  / PCIE into the peer's memory.
-
-PE_IPCQ does **not** move data itself — it delegates to PE_DMA.
-
-### D2. Ring buffer model
-
-Each PE owns 4 directions (N/S/E/W) × {tx, rx} = 8 ring buffers.
-
-```python
-@dataclass
-class IpcqQueuePair:
-    direction: Direction          # N/S/E/W
-    peer: IpcqEndpoint            # set by host at init time (D2.5)
-    tx_buffer_base: int           # outgoing data base addr (in our memory)
-    rx_buffer_base: int           # incoming data base addr (in our memory)
-    slot_size: int                # 1 tile per slot
-    n_slots: int                  # ring depth
-    my_head: int                  # next slot we will write/send into
-    my_tail: int                  # next slot we will read/recv from
-    peer_head_cache: int          # peer's last-seen head (updated via D9 piggyback)
-    peer_tail_cache: int          # peer's last-seen tail (updated via D9 fast-path credit)
-```
-
-**Canonical field names**: throughout this ADR the four names above
-(`my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`) are used
-consistently. Synonyms (`peer_head_local`, `peer_head`, `peer_tail`,
-etc.) are not used.
-
-| Field | Owner | Updated when |
-|-------|-------|--------------|
-| `my_head` | local PE_IPCQ | immediately after `tl.send` (send tracking) |
-| `my_tail` | local PE_IPCQ | immediately after `tl.recv` (recv tracking) |
-| `peer_head_cache` | local PE_IPCQ | on `IpcqMetaArrival` (D9 piggyback) |
-| `peer_tail_cache` | local PE_IPCQ | on `IpcqCreditMetadata` (D9 fast path) |
-
-**Slot unit**: fixed-size, one slot holds one full tile (no descriptor
-indirection). Full data embedded in the slot. See D5.
-
-### D2.5. `IpcqEndpoint` schema
-
-`IpcqQueuePair.peer` carries everything the sender needs to compute the
-peer's rx slot address:
-
-```python
-@dataclass(frozen=True)
-class IpcqEndpoint:
-    sip: int
-    cube: int
-    pe: int
-    buffer_kind: str             # "tcm" | "hbm" | "sram"
-    rx_base_pa: int              # peer rx_buffer base PA (PhysAddr.encode())
-    rx_base_va: int              # peer rx_buffer base VA (optional, MMU mode)
-    n_slots: int                 # peer ring depth (for wrap-around)
-    slot_size: int               # peer slot size (for offset)
-```
-
-Address computation:
-
-```python
-slot_idx = self.my_head % peer.n_slots
-dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size
-```
-
-PE_IPCQ passes `dst_pa` to PE_DMA inside an `IpcqDmaToken`. PE_DMA
-(vc_comm) routes the data to `dst_pa` through the fabric.
-
-**Endpoint construction order**: at backend init (D10), the IPCQ
-buffers for **every PE** are allocated first (so each rank knows the
-others' PA), then the per-rank neighbor tables are built and pushed to
-PE_IPCQ via `IpcqInitMsg`.
-
-### D3. Four-direction mapping ≡ logical ProcessGroup
-
-The PE views four directions (N/S/E/W) as logical ports. Real peer
-addresses are configured by the host CCL init, per the chosen
-algorithm. The PE kernel never knows the topology, only directions.
-
-```python
-# 1D ring
-for rank in range(world_size):
-    ipcq_set_neighbor(rank, "E", peer=ranks[(rank + 1) % world_size])
-    ipcq_set_neighbor(rank, "W", peer=ranks[(rank - 1) % world_size])
-
-# 2D mesh
-for r in range(R):
-    for c in range(C):
-        ipcq_set_neighbor((r, c), "N", peer=((r - 1) % R, c))
-        ipcq_set_neighbor((r, c), "S", peer=((r + 1) % R, c))
-        ipcq_set_neighbor((r, c), "E", peer=(r, (c + 1) % C))
-        ipcq_set_neighbor((r, c), "W", peer=(r, (c - 1) % C))
-```
-
-The PE code does not need to know where `tl.send(dir="E", ...)` actually
-ends up.
-
-### D4. PE kernel API
-
-```python
-# Send (blocking; may stall on backpressure)
-tl.send(dir: str, src=TensorHandle)
-tl.send(dir: str, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)
-
-# Recv (blocking)
-recv = tl.recv(dir: str, shape=..., dtype=...)
-recv = tl.recv(shape=..., dtype=...)        # round-robin across 4 directions
-
-# Recv (non-blocking)
-fut  = tl.recv_async(dir: str, shape=..., dtype=...)
-recv = tl.wait(fut)
-```
-
-`tl.recv()` (no direction) keeps a `last_polled_dir` cursor and on each
-call rotates through directions, returning the first available slot.
-Empty in all 4 directions → wait.
-
-**Fairness is weak**: the rotating start mitigates simple bias, but if
-one direction always wins the race the others can starve. Algorithms
-that need strict fairness must call `tl.recv(dir=...)` explicitly.
-
-### D5. Single-hop DMA write + full-data slot model
-
-Data moves from sender memory into the receiver's ring slot in **one
-DMA transfer**. Key properties:
-
-- **Single-hop**: the sender already knows the peer rx slot address and
-  fires one fabric DMA into it.
-- **No CPU memcpy**: the CPU never copies data.
-- **No intermediate staging**: neither side keeps a separate staging
-  buffer (sender uses the source addr directly; receiver gets the data
-  in its ring slot directly).
-
-(Strictly speaking the fabric DMA write does happen, so this is not
-literally "no data movement" — it's the same property NCCL labels
-"zero-copy", meaning no CPU memcpy and no staging copy.)
-
-```
-PE A: tl.send(E, src_addr, nbytes)
-  1. IPCQ computes the peer rx slot address:
-       dst_addr = peer.rx_base_pa + (my_head % peer.n_slots) * peer.slot_size
-  2. Backpressure: my_head - peer_tail_cache < peer.n_slots ?
-                   (full → sleep / poll)
-  3. Submit DMA on PE_DMA(vc_comm): src_addr → peer dst_addr, nbytes
-  4. my_head += 1
-
-PE B: data = tl.recv(W)
-  1. Look at rx_buffer[my_tail % n_slots]
-  2. Wait for the data to arrive (D7 backpressure mode)
-  3. Return the slot address to the kernel (or fetch into register file)
-  4. my_tail += 1
-  5. Issue a credit-return fast path (D9): after the bottleneck-BW
-     latency the peer A's peer_tail_cache is updated.
-```
-
-The slot holds the full tile. The receiver only reads its own
-rx_buffer; it never reads back into A's memory. The sender knows the
-peer rx slot address and DMAs directly into it (single-hop).
-
-The PE's own PE_TCM read/write does not go through DMA (PE_TCM is local
-to the PE).
-
-### D6. Buffer placement — three-way benchmark
-
-The host CCL init picks the IPCQ ring-buffer location:
-
-```python
-ipcq_init(
-    backend="ahbm",
-    buffer_kind="tcm" | "hbm" | "sram",
-    n_slots=8,
-    slot_size=4096,
-)
-```
-
-| Location | Trait | Trade-off |
-|----------|-------|-----------|
-| **PE_TCM** | Attached to the PE; fast | Small; competes with PE-internal resources |
-| **PE-local HBM** | Large; via DMA | Higher latency |
-| **Cube SRAM** | Mid-size; cube-shared | Cube-internal contention |
-
-All three locations run the same kernel code; only the init differs.
-
-### D7. Backpressure — two-mode benchmark
-
-How the sender or receiver waits when peer slots are full / data not
-yet arrived:
-
-| Mode | Behavior | Model |
-|------|----------|-------|
-| **poll** | Periodically re-check the cached peer pointer | Spin loop |
-| **sleep** | Yield a SimPy event; wake on a peer-trigger | Interrupt-like |
-
-```python
-ipcq_init(backpressure="poll" | "sleep", ...)
-```
-
-Both modes are implemented so latency / throughput trade-offs can be
-benchmarked.
-
-### D8. PE_DMA virtual channels
-
-Extend PE_DMA from a single queue into a **two-channel virtual-channel**
-model.
-
-```
-PE_DMA
-├── vc_compute: tile load / store / writeback for GEMM and Math
-└── vc_comm:    IPCQ send data
-```
-
-Each VC has an independent state machine:
-
-- One channel stalling does not block the other.
-- The same physical link (cube_noc, UCIe, …) is shared, but link BW is
-  split between channels.
-
-**Chunk-level interleave**:
-
-- Large GEMM tile DMAs do not lock the link end-to-end.
-- Progress happens in chunks (e.g. 256 B); each chunk shares link BW
-  with the other VC's pending chunks.
-- Chunk size is an init parameter (smaller = fairer, larger = more
-  efficient).
-
-Net effect:
-
-- HoL blocking is eliminated (an IPCQ send can interleave with a long
-  compute DMA).
-- Compute / comm overlap is natural (NVIDIA copy-engine + compute-SM
-  pattern).
-- Matches the NoC-virtual-channel pattern used in real HW.
-
-**First-implementation accuracy limit (intentional)**: this ADR's
-first cut uses **deterministic chunk-level interleave + weighted
-round-robin arbitration** (default 50 / 50, exposed in `ccl.yaml`).
-This is a first-order approximation and is simpler than real HW
-dynamic-contention / credit-based arbiters. Functional correctness is
-unaffected, but heavy-contention scenarios may report slightly
-optimistic latency vs. real HW. A separate ADR can add a NoC arbiter
-component later if more precision is needed.
-
-#### Token routing
-
-- Compute tokens (`TileToken`) — go through the existing
-  PE_FETCH_STORE → PE_DMA chain.
-- Communication tokens (`IpcqDmaToken`, new) — PE_IPCQ → PE_DMA
-  self-routing.
-- PE_DMA picks the channel by token type.
-
-```python
-class PeDmaComponent:
-    def _process(self, env, token):
-        if isinstance(token, IpcqDmaToken):
-            yield from self._vc_comm_process(env, token)
-        else:
-            yield from self._vc_compute_process(env, token)
-```
-
-### D9. Pointer synchronization — DMA payload piggyback
-
-Real HW (NVLink, UCIe, etc.) piggybacks metadata onto DMA payloads so
-pointers update along with the data. This simulation adopts the same
-model: **no separate control channel** — metadata travels with the
-data.
-
-The big benefits:
-
-- **Automatic ordering**: data and metadata move on the same token, so
-  data is visible **before** the head_cache update. No race.
-- **HW fidelity**: matches NVLink / UCIe piggybacked headers.
-- **Component simplification**: no separate `IpcqPtrUpdate` event type.
-
-#### Send flow (head update via piggyback)
-
-```
-PE A: tl.send(E, src_addr, nbytes)
-  1. PE_IPCQ checks backpressure (using peer_tail_cache)
-  2. PE_IPCQ creates an IpcqDmaToken:
-       - data body (src_addr → peer dst_addr)
-       - piggyback metadata: (sender_seq, src_sip/cube/pe, src_direction)
-  3. Hand the token to PE_DMA(vc_comm)
-  4. PE A increments my_head (send tracking)
-
-[fabric DMA: latency elapses]
-
-PE B's PE_DMA receives the token
-  5. Writes data into dst_addr (B's rx slot) via MemoryStore.write
-  6. Forwards token metadata to PE B's PE_IPCQ (PE-internal wire, ~1 cycle)
-
-PE B's PE_IPCQ receives the metadata
-  7. Updates peer_head_cache (= A's head)
-  8. Wakes any pending recv on that direction
-```
-
-**Steps 5 and 6 must execute in the same SimPy step** — DMA completion
-makes data and metadata atomically visible.
-
-#### Recv flow (credit return — fast path with bottleneck-BW latency)
-
-When the receiver frees a slot, the sender must learn about it
-(backpressure release). Unlike data, the credit return does **not**
-travel through general vc_comm fabric — it uses a **separate fast
-path**, an abstraction of the NVLink / UCIe credit-return wire.
-
-**Latency** is computed from the **full path latency** (per-node
-overhead + edge propagation + drain), not a magic constant:
-
-```
-credit_size_bytes = 16  (ccl.yaml: ipcq_credit_size_bytes)
-path = router.find_path(self_pe, peer_pe.pe_dma)
-latency = compute_path_latency_ns(path, credit_size_bytes)
-        = sum(edge.distance_mm * ns_per_mm)
-        + sum(node_overhead_ns[n] for n in path)
-        + credit_size_bytes / bottleneck_bw_on_path
-```
-
-The router auto-appends `.pe_dma` to the source only, so the
-destination MUST be spelled with the explicit `.pe_dma` suffix or
-`find_path` raises and the credit silently teleports at zero cost
-(latent bug fixed alongside this update).
-
-`tl.recv` blocks on the credit-emit completion (recv yields-from
-`_delayed_credit_send` rather than spawning it as a fork). This puts
-the credit-return cost on the receiver's `pe_exec_ns`, modeling the
-IPCQ control-plane completing the consume-acknowledgement before
-recv returns to the kernel — the protocol equivalent of a non-posted
-`tl.store` waiting for an HBM ack on the raw DMA path.
-
-That gives us:
-
-- **Topology-proportional approximation**: an in-cube credit return is
-  automatically faster than a cross-SIP credit return.
-- **No magic constants**: every nanosecond comes from
-  `compute_path_latency_ns` on the same edge_map and `node_overhead_ns`
-  as data traffic.
-- **No deadlock risk**: unlike piggyback, B can issue credit even when
-  it has no data to send back. `peer_credit_store.put` is unbounded.
-- **`IPCQ ≥ raw DMA`** for matched physical moves — the credit-emit
-  cost on recv balances the HBM ack-trip cost RAW pays on the sender.
-
-#### Component coupling — SimPy Store channel
-
-PE B's PE_IPCQ does not call PE A's PE_IPCQ directly. Instead, at init
-time, **a SimPy Store is wired between the two** (a per-direction
-fast-path channel) and credit metadata is `put` into that store.
-
-```python
-class PeIpcqComponent:
-    def _delayed_credit_send(self, env, peer_credit_store, my_tail, latency_ns):
-        yield env.timeout(latency_ns)
-        yield peer_credit_store.put(IpcqCreditMetadata(seq=my_tail, ...))
-```
-
-Backend init wires both directions of the fast-path channel as part of
-fan-out (see `IpcqInitMsg` in D12).
-
-#### Credit-return fast path limitations
-
-- `credit_size_bytes` is an estimate (typically 16–64 bytes).
-- The fast path is **excluded from vc_comm BW contention** (separate
-  wire). Real HW credit-return wires are very lightweight, so this is a
-  reasonable first approximation.
-- A follow-up ADR can: model the credit fast path as a separate link
-  (BW limit + contention), or switch to piggyback (`credit_return_mode:
-  piggyback`).
-
-#### PE_DMA's added responsibility
-
-When `vc_comm` receives a token, PE_DMA processes it as the following
-sequence: pay the Transaction's terminal BW drain, then atomically
-write data and forward metadata. **No SimPy yield is allowed between
-the data write and the metadata forward** (invariant I6). The drain
-yield must sit before the atomic block, not inside it:
-
-```python
-def _on_vc_comm_recv(self, env, txn):
-    # Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the
-    # sender PE_DMA). MUST happen before the atomic block so recv only
-    # wakes after the bytes have "landed".
-    drain = getattr(txn, "drain_ns", 0.0)
-    if drain > 0:
-        yield env.timeout(drain)
-
-    token = txn.request
-    # ── ATOMIC: no yield between these two operations ──
-    data = self._memory_store.read(token.src_space, token.src_addr,
-                                   shape=..., dtype=...)
-    self._memory_store.write(token.dst_endpoint.buffer_kind,
-                             token.dst_addr, data)
-    # 2. Forward metadata to the local PE_IPCQ
-    yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token))
-    # ───────────────────────────────────────────────────
-```
-
-The final `put` is yieldable but uses an unbounded internal store, so
-it completes in a single step. That `put` is the closing call of the
-atomic block; nothing may be inserted before it.
-
-#### Drain-at-inbound semantics (D9 timing model)
-
-The Transaction carries `drain_ns = nbytes / bottleneck_bw_on_path`
-stamped at send-side PE_DMA. In this simulator per-hop `overhead_ns`
-is paid at each forwarding component via `run()`, and the remaining
-BW drain is paid once at the Transaction's terminal. Every non-IPCQ
-Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via
-`ComponentBase._forward_txn` at the terminal node. For IPCQ the
-destination PE_DMA intercepts the Transaction with `_handle_ipcq_inbound`
-(so IPCQ-specific data write + metadata forward can happen), so **the
-drain MUST be paid explicitly at the top of that handler** to keep
-IPCQ's timing model on par with every other fabric Transaction.
-
-Side-effects of paying drain here:
-
-- **SRC `tl.send`** is unchanged — fire-and-forget semantics are
-  preserved because the sender PE_DMA does not `yield sub_done`. The
-  `sub_done.succeed()` call (made after metadata forward below) is an
-  event with no listener on the sender side.
-- **DST `tl.recv`** unblocks `drain_ns` later. Since recv wakes only
-  when `IpcqMetaArrival` reaches its local PE_IPCQ, and the metadata
-  forward now happens after the drain, recv observes the full fabric
-  transfer time including bandwidth cost.
-
-Matches the physical picture: send dispatches and leaves; recv waits
-until the bytes have actually been drained into its inbox.
-
-### D9.5. ADR-0020 (2-pass) integration
-
-`tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase
-1 simulates timing **and** moves data via MemoryStore; Phase 2 enables
-op-log-based correctness verification.
-
-#### Phase 1 (timing + data)
-
-D9 models head and tail updates with two different mechanisms:
-
-- **Send-side (head update)** — DMA payload piggyback. Data write and
-  metadata forward happen in the same SimPy step → automatic atomic
-  visibility.
-- **Recv-side (tail credit return)** — fast-path SimPy Store channel
-  with bottleneck-BW latency, then `peer_tail_cache` update.
-
-Together they preserve ring-buffer pointer consistency.
-
-The op-log records `op_kind="ipcq"` entries for sends (with
-`src/dst/space/addr/nbytes/dir/dtype/shape/sender_seq`) and recvs (with
-`recv_mode/src/dst/space/addr/nbytes/dir/dtype/shape/consumer_seq`).
-Two recv modes:
-
-- **`return_slot`** (default): the slot address is returned to the
-  kernel. Zero-copy.
-- **`copy_to_dst`**: when the kernel passes `dst_addr` + `dst_space`,
-  PE_IPCQ copies the slot data into the user dst.
-
-#### Phase 2 (op_log replay)
-
-When `DataExecutor` encounters an `op_kind="ipcq"` record:
-
-- **send**: idempotent `src → dst` ndarray write.
-- **recv (`return_slot`)**: no-op (the slot already holds the data).
-- **recv (`copy_to_dst`)**: idempotent `slot → dst_addr` copy.
-
-IPCQ ops are pure data movement — Phase 2 has nothing extra to compute.
-The downstream GEMM / Math ops in `DataExecutor` will consume the data
-and naturally validate correctness.
-
-### D10. Host CCL init keeps the PyTorch shape
-
-The host code looks just like real PyTorch DDP. `init_process_group`
-creates the backend object; it does **not** receive IPCQ knobs
-(neighbor topology, buffer_kind, backpressure …).
-
-```python
-# benches/ccl_allreduce.py — same shape as real PyTorch
-def worker(rank, world_size, torch):
-    dist = torch.distributed
-    dist.init_process_group(backend="ahbm")  # reads ccl.yaml + topology
-    tensor = torch.zeros((1, world_size * N_ELEM), dtype="f16", dp=...)
-    tensor.copy_(torch.from_numpy(init))
-    dist.all_reduce(tensor, op="sum")
-```
-
-The IPCQ configuration is decided by the backend at
-`init_process_group` time: it loads `ccl.yaml`, picks the algorithm,
-and pushes IPCQ neighbor tables to every participating PE_IPCQ. The
-host code never has to know about IPCQ.
-
-A bench runs one algorithm, chosen via `ccl.yaml`'s `defaults.algorithm`.
-Switching algorithms is purely a `ccl.yaml` change — no host edits
-required.
-
-#### Init flow (eager)
-
-1. `init_process_group(backend="ahbm")` is called.
-2. Backend loads `ccl.yaml` → resolves `defaults.algorithm`.
-3. Pulls topology + buffer_kind + backpressure + slot config from
-   `algorithms[<algo>]`.
-4. **Immediately** installs neighbor tables on every PE_IPCQ
-   (sideband or fabric `IpcqInitMsg`).
-5. Subsequent `torch.launch(kernel_name, ...)` calls behave normally —
-   PE_IPCQ is already prepared whether the kernel is a CCL kernel or
-   not.
-
-### D11. CCL config file (`ccl.yaml`)
-
-IPCQ config and algorithm metadata live in a separate YAML file,
-following the same pattern as `components.yaml` and `topology.yaml`.
-
-A single benchmark execution runs one algorithm
-(`defaults.algorithm`). Switching algorithms means editing
-`defaults.algorithm` only.
-
-```yaml
-defaults:
-  algorithm: ring_allreduce_tcm
-  buffer_kind: tcm                # tcm | hbm | sram
-  backpressure: sleep             # poll | sleep
-  n_slots: 8
-  slot_size: 4096
-  vc_chunk_size: 256
-  ipcq_credit_size_bytes: 16
-
-algorithms:
-  ring_allreduce_tcm:
-    module: kernbench.ccl.algorithms.ring_allreduce
-    topology: ring_1d             # builtin name or "custom"
-    buffer_kind: tcm
-    n_elem: 8                     # optional, per-algorithm tile width
-
-  tree_allreduce_7:
-    module: kernbench.ccl.algorithms.tree_allreduce
-    topology: tree_binary
-    buffer_kind: tcm
-    world_size: 7                 # algorithm-level override
-    n_elem: 16
-
-  custom_mesh:
-    module: kernbench.ccl.algorithms.custom_mesh
-    topology: custom              # the module supplies its own neighbors()
-```
-
-`world_size` is **not set in `defaults`**. The backend resolves it via:
-`algorithm-level override > defaults override > topology spec`. The
-last fallback (`sips × cubes_per_sip × pes_per_cube`) mirrors real DDP
-where `WORLD_SIZE` comes from env vars rather than config files.
-
-#### Algorithm module structure
-
-Each algorithm module exports two hooks — `kernel` (required) and
-`neighbors` (optional) — plus a `kernel_args` helper that the
-backend uses to populate positional kernel arguments at `all_reduce`
-time:
-
-```python
-# src/kernbench/ccl/algorithms/ring_allreduce.py
-
-def kernel_args(world_size: int, n_elem: int) -> tuple:
-    return (n_elem, world_size)
-
-
-def kernel(t_ptr, n_elem, world_size, tl):
-    """Required — the PE kernel.
-
-    IPCQ is already installed by the backend before this is called.
-    The kernel only uses the four-direction send / recv API.
-    """
-    ...
-
-
-def neighbors(rank, world_size, neighbor_map):
-    """Optional — override the builtin topology's neighbor map.
-
-    Returns a new dict, the modified-in-place dict, or None to keep the
-    builtin map.
-    """
-    return None
-```
-
-#### `neighbors` override patterns
-
-- **Pattern A — tweak a builtin**: drop a direction for some ranks, etc.
-- **Pattern B — replace entirely**: ignore `neighbor_map` and return a
-  brand-new dict.
-- **Pattern C — keep builtin**: omit `neighbors` or return None.
-
-#### Builtin topologies
-
-| topology | direction set |
-|----------|---------------|
-| `ring_1d` | E, W |
-| `ring_1d_unidir` | E only |
-| `mesh_2d` | N, S, E, W |
-| `tree_binary` | parent, child_left, child_right |
-| `none` | (empty) — algorithm must supply `neighbors()` |
-
-#### Adding a new algorithm
-
-1. Write `kernel` and `kernel_args` in
-   `src/kernbench/ccl/algorithms/<algo>.py`.
-2. Add an entry in `ccl.yaml`'s `algorithms` section.
-3. (Optional) provide `neighbors()` for custom topology.
-4. Set `defaults.algorithm` to the new algorithm.
-
-The host bench (`benches/ccl_allreduce.py`) does not change.
-
-### D12. Message / token schema
-
-The new message types added by this ADR. They live in
-`src/kernbench/common/pe_commands.py` and
-`src/kernbench/runtime_api/kernel.py`.
-
-#### `IpcqInitMsg` (sideband, fan-out at init)
-
-The backend pushes neighbor tables to every PE_IPCQ. Structure mirrors
-`MmuMapMsg` (`target_sips`, `target_cubes`, `target_pe`, `entries`).
-Each `IpcqInitEntry` has `direction`, `peer: IpcqEndpoint`,
-`my_rx_base_pa/va`, `n_slots`, `slot_size`, plus a `peer_credit_store`
-field — a `simpy.Store` instance pre-wired so the sender PE_IPCQ can
-push `IpcqCreditMetadata` directly into the receiver's input queue.
-
-#### `IpcqSendCmd` (PE_CPU → PE_IPCQ)
-
-Carries `direction`, source addr/space, nbytes, shape, dtype, and a
-handle id. `data_op=True` so it lands in the op_log.
-
-#### `IpcqRecvCmd` (PE_CPU → PE_IPCQ)
-
-Carries `direction` (or None for round-robin), `recv_mode`
-(`return_slot` / `copy_to_dst`), optional `dst_addr/dst_space`, shape,
-dtype, blocking flag.
-
-#### `IpcqDmaToken` (PE_IPCQ → PE_DMA, vc_comm channel)
-
-Per D9 piggyback: the token carries the data (`src/dst/space/nbytes`)
-plus the head metadata (`sender_seq`, `src_sip/cube/pe`,
-`src_direction`). PE_DMA picks the channel by token type
-(`IpcqDmaToken → vc_comm`, `TileToken → vc_compute`).
-
-The receiver's PE_DMA, on token arrival, performs the I6 atomic
-sequence: write data into MemoryStore, then forward `IpcqMetaArrival`
-to the local PE_IPCQ.
-
-#### `IpcqCreditMetadata` (PE_IPCQ → peer PE_IPCQ, fast path)
-
-Carries `consumer_seq` (= my_tail), source PE coords, and source
-direction. Travels through the dedicated SimPy Store channel rather
-than `vc_comm`. Latency = `credit_size_bytes / bottleneck_bw_on_path`.
-
-There is **no `IpcqPtrUpdate` event** — head updates flow via D9
-piggyback, tail updates via the D9 fast-path channel.
-
-### D13. Test strategy
-
-Test plan:
-
-#### T1. Unit tests (component-level)
-
-- **PE_IPCQ** (`tests/test_pe_ipcq.py`): send without backpressure
-  immediately forwards a token; full peer slot triggers backpressure
-  (poll / sleep modes); recv waits, wakes on `IpcqMetaArrival`;
-  round-robin recv weak fairness; bad direction → `IpcqInvalidDirection`.
-- **PE_DMA virtual channels** (`tests/test_pe_dma_vc.py`): `vc_compute`
-  / `vc_comm` independent progress, chunk interleave, BW split.
-- **Builtin topology** (`tests/test_ccl_topologies.py`): ring_1d /
-  mesh_2d / tree_binary correctness, mesh_2d non-square →
-  `ValueError`, custom resolver returns the module's `neighbors`.
-
-#### T2. Integration tests (E2E send/recv)
-
-- **`tests/test_ipcq_e2e.py`**: 2-rank ring, 4-rank ring (bidirectional
-  no-deadlock), 4×4 mesh.
-- **CCL kernel + 2-pass** (`tests/test_ipcq_2pass.py`): greenlet mode
-  records `ipcq` ops in op_log; DataExecutor produces correct
-  `out.data`.
-
-#### T3. Backend init (`tests/test_ccl_backend_ipcq.py`)
-
-`ccl.yaml` load, builtin topology → `IpcqInitMsg` fan-out, endpoint PA
-consistency, per-`buffer_kind` allocation.
-
-#### T4. Regression
-
-All existing tests pass; ADR-0020 op_log / DataExecutor unaffected for
-non-CCL benches.
-
-#### T5. Performance / overhead
-
-Single send/recv pair latency = (DMA latency) + (IPCQ overhead).
-Should be close to a regular PE_DMA write of the same nbytes (IPCQ
-overhead < 100 ns).
-
-### D14. Invariants and failure modes
-
-#### Invariants
-
-I1. **Slot lifecycle exactly-once**: one send → exactly one recv.
-I2. **Pointer monotonicity**: `my_head` / `my_tail` strictly
-   non-decreasing; `sender_seq` strictly increasing.
-I3. **Endpoint consistency**: if rank A's `direction=E` peer is rank
-   B, then rank B's reverse-direction peer must be rank A. Verified at
-   init.
-I4. **`buffer_kind` consistency**: all PEs in a process group share
-   the same `buffer_kind` (no mixed mode in the first cut).
-I5. **op_log ordering**: send → DMA complete → recv possible. The
-   t_start order in op_log respects this causality.
-I6. **Atomic data + metadata visibility (MUST)**: at the receiver
-   side, data write (`MemoryStore.write`) and metadata forward
-   (`peer_head_cache` update) **must execute in the same SimPy step**.
-   No yield is allowed between the two operations in PE_DMA's vc_comm
-   handler. Code review must reject any inserted `yield` (or `yield
-   from`) — it would create a race where head_cache becomes visible
-   before or after the data.
-I7. **MemoryStore slot existence ↔ pointer**: as a consequence of I6,
-   the step in which `peer_head_cache > my_tail` becomes truthy is the
-   same step in which the slot data is observable.
-
-#### Failure modes (runtime errors)
-
-F1. **Bad direction**: `tl.send(dir="X")` for an uninstalled direction
-   → `IpcqInvalidDirection`, simulation aborts.
-F2. **Type mismatch**: dtype/shape/nbytes disagreement between matched
-   send and recv. Not validated by default; opt-in strict mode catches
-   it (`strict_validation: true` on a PE_IPCQ node attrs).
-F3. **Deadlock detection (timeout-based)**: the simulator empties its
-   schedule while a send/recv is still pending → engine raises
-   `IpcqDeadlock` and embeds a pointer dump.
-F4. **Backend init failure**: missing `defaults.algorithm`, missing
-   `algorithms[name]`, module import failure, topology validation
-   failure (I3, I4) — all raised at `init_process_group` time.
-F5. **Slot full + infinite backpressure**: the peer never recvs.
-   Surfaces as F3 timeout.
-
-#### Diagnostics
-
-- **CCL trace**: `KERNBENCH_CCL_TRACE=1` logs each send/recv as
-  `(rank, t, dir, nbytes)`.
-- **Pointer dump**: `kernbench.ccl.diagnostics.pointer_dump(engine)`
-  prints every PE_IPCQ ring buffer's `my_head`, `my_tail`,
-  `peer_head_cache`, `peer_tail_cache`.
-- **Deadlock dump**: on hang the engine includes the pointer dump in
-  the `IpcqDeadlock` exception message.
-
-### D15. Algorithm-author cheat sheet
-
-Full step-by-step lives in
-[`docs/onboarding/ccl-author-guide.en.md`](../onboarding/ccl-author-guide.en.md). The
-shortest version:
-
-| Things you touch | Things you don't |
-|------------------|-------------------|
-| `src/kernbench/ccl/algorithms/<your_algo>.py` (`kernel`, `kernel_args`, optional `neighbors`) | `benches/ccl_allreduce.py` host code |
-| One entry in `ccl.yaml` + optionally `defaults.algorithm` | `src/kernbench/ccl/` framework |
-| (Optional) `tests/test_<your_algo>.py` mock test | PE_IPCQ component, AhbmCCLBackend |
-
-5-step flow: write the kernel → register in `ccl.yaml` → optional
-`neighbors` override → optional mock unit test → SimPy validation via
-`kernbench run --bench ccl_allreduce --verify-data`.
-
-Common mistakes: using a direction that wasn't installed, sends
-without matching recvs (deadlock), dtype/shape disagreement, assuming
-fairness from `tl.recv()` round-robin, confusing
-`tl.num_programs(axis)` with the CCL group size.
-
----
-
-## Non-goals
-
-- **Host collective**: a model where `dist.all_reduce` itself moves
-  data on the host side is out of scope. This ADR only covers
-  communication that happens inside the PE kernel.
-- **All-reduce algorithms**: ring / tree / etc. live in algorithm
-  modules and can be added without amending this ADR.
-- **Reliability / error handling**: link faults, send/recv failure
-  recovery, etc. are out of scope.
-- **NoC arbiter precision**: dynamic VC contention is left for a future
-  ADR (see D8).
-
----
-
-## Open questions
-
-- **VC arbitration accuracy** — the first cut uses deterministic
-  chunk interleave + weighted round-robin; heavy contention may report
-  optimistic latency. A NoC arbiter component can be added later.
-- **Credit return BW model** — the fast path is currently outside the
-  fabric BW contention model. Can be modeled as a separate link or
-  switched to piggyback (`credit_return_mode: piggyback`).
-- **Ring buffer slot allocation metadata** — whether the host pushes
-  IPCQ buffer metadata via sideband or via a fabric message similar to
-  `MmuMapMsg` is open.
-- **VC BW split default** — 50/50 vs. weighted (e.g. 80/20). Exposed in
-  `ccl.yaml`; default value TBD.
-- **Direction count** — 4 (N/S/E/W) is fixed in the first cut; 6
-  (with Up/Down for 3D) or N (variable) is future work.
-- **Multi-tile aggregation primitives** — whether
-  `tl.recv_all` or similar is needed for fan-in.
-- **Round-robin recv fairness** — current weak fairness can starve;
-  strict fairness counter is future work.
-- **Deadlock detection precision** — currently timeout-based; a
-  realtime wait-for graph would enable deterministic detection.
-
----
-
-## Consequences
-
-### Positive
-
-- PE-to-PE direct communication enables CCL kernels to be written.
-- Host stays minimal (just `launch`), synchronization happens inside
-  the PE → strong compute / comm overlap.
-- VCs eliminate HoL blocking → collective latency is not blocked by
-  compute traffic.
-- Buffer placement and backpressure mode are init-time parameters →
-  easy to benchmark.
-- Four-direction logical neighbors → host is free to map
-  ring/mesh/tree algorithms.
-
-### Negative
-
-- One new component (PE_IPCQ) and a redesigned PE_DMA (VCs).
-- IPCQ memory cost = 8 rings × `slot_size` × `n_slots` per PE.
-- VC arbitration is a first-order approximation; heavy contention
-  scenarios may report slightly optimistic latency vs real HW (D8).
-- Chunk-level interleave makes PE_DMA implementation more complex.
diff --git a/docs/adr/ADR-0023-dev-ipcq-pe-collective.md b/docs/adr/ADR-0023-dev-ipcq-pe-collective.md
index 5fd174d..2db86c5 100644
--- a/docs/adr/ADR-0023-dev-ipcq-pe-collective.md
+++ b/docs/adr/ADR-0023-dev-ipcq-pe-collective.md
@@ -6,39 +6,45 @@ Accepted
 
 ## Context
 
-### 목표
+### Goal
 
-CCL (Collective Communication Library) 커널을 PE 안에서 실행할 수 있도록
-PE 간 데이터 교환 인프라를 추가한다. 호스트는 그저 각 SIP에 커널을 launch만 하고,
-실제 동기화와 데이터 이동은 **PE 커널 안에서 IPCQ(Inter-Process Communication
-Queue)를 통해** 일어난다.
+Add the infrastructure that lets CCL (Collective Communication Library)
+kernels run **inside** a PE. The host just launches a kernel on each
+SIP; the actual synchronization and data movement happen **inside the
+PE kernel via an IPCQ (Inter-Process Communication Queue)**.
 
-이는 NCCL이 GPU 커널 안에서 NVLink 통신을 수행하는 모델, 또는 Cerebras/Tenstorrent의
-core-local 통신 큐와 유사하다. 호스트 레벨 collective(`dist.all_reduce`)는
-**미래 작업**으로 미루고, 본 ADR은 커널 collective 인프라에만 집중한다.
+This mirrors how NCCL performs NVLink communication inside a GPU
+kernel, or how Cerebras / Tenstorrent expose core-local communication
+queues. Host-level collectives (`dist.all_reduce`) are deferred to
+**future work**; this ADR focuses solely on the kernel-side collective
+infrastructure.
 
-### 풀어야 할 문제
+### Problems to solve
 
-1. PE 간 직접 데이터 이동 (peer's memory에 write)
-2. 동기화 — 송신 측이 수신 측 buffer 공간을 확인해야 함 (backpressure)
-3. compute traffic과 communication traffic의 자원 경쟁 (Head-of-Line blocking)
-4. 호스트가 알고리즘에 따라 (ring/mesh/tree) 논리적 neighbor 토폴로지를 구성할 수 있어야 함
+1. PE-to-PE direct data movement (writing into a peer's memory).
+2. Synchronization — the sender must check that the receiver has space
+   in its buffer (backpressure).
+3. Resource contention between compute traffic and communication
+   traffic (Head-of-Line blocking).
+4. The host must be able to construct logical neighbor topologies
+   (ring / mesh / tree) per algorithm.
 
 ---
 
 ## Decision
 
-### D1. PE_IPCQ 컴포넌트 신규 추가
+### D1. Add a new `PE_IPCQ` component
 
-PE 안에 새 컴포넌트 `PE_IPCQ`를 추가한다. PE_GEMM/PE_MATH가 PE_CPU의
-sub-block을 별도 컴포넌트로 모델링하는 것과 동일한 패턴이다.
+A new component `PE_IPCQ` is added inside each PE. It follows the same
+pattern as PE_GEMM / PE_MATH — modeling a sub-block of the PE as a
+distinct component.
 
 ```
 PE
 ├── PE_CPU
 ├── PE_SCHEDULER
 ├── PE_DMA
-├── PE_IPCQ          ← 신규
+├── PE_IPCQ          ← new
 ├── PE_FETCH_STORE
 ├── PE_GEMM
 ├── PE_MATH
@@ -46,99 +52,96 @@ PE
 ├── PE_MMU
 ```
 
-**역할 분리** (control plane vs data plane):
-- **PE_IPCQ (control plane)**: ring buffer 주소 계산, head/tail pointer 관리,
-  peer pointer 캐시, backpressure 결정, 4-방향 neighbor 매핑
-- **PE_DMA (data plane)**: 실제 데이터를 cube_noc/UCIe/PCIE 경유로 peer 메모리에 전송
+**Role separation** (control plane vs. data plane):
 
-PE_IPCQ는 데이터 이동을 직접 수행하지 않고 PE_DMA에 위임한다.
+- **PE_IPCQ (control plane)**: ring-buffer address arithmetic, head /
+  tail pointer management, peer pointer caches, backpressure, 4-direction
+  neighbor mapping.
+- **PE_DMA (data plane)**: actually moves data through cube_noc / UCIe
+  / PCIE into the peer's memory.
 
-### D2. Ring Buffer 모델
+PE_IPCQ does **not** move data itself — it delegates to PE_DMA.
 
-각 PE는 4-방향(N/S/E/W) × {tx, rx} = 총 8개의 ring buffer를 가진다.
+### D2. Ring buffer model
+
+Each PE owns 4 directions (N/S/E/W) × {tx, rx} = 8 ring buffers.
 
 ```python
 @dataclass
 class IpcqQueuePair:
     direction: Direction          # N/S/E/W
-    peer: IpcqEndpoint            # init 시 호스트가 설정 (D2.5)
-    tx_buffer_base: int           # 내가 보낼 데이터의 base addr (자기 메모리)
-    rx_buffer_base: int           # 내가 받을 데이터의 base addr (자기 메모리)
-    slot_size: int                # tile 단위
+    peer: IpcqEndpoint            # set by host at init time (D2.5)
+    tx_buffer_base: int           # outgoing data base addr (in our memory)
+    rx_buffer_base: int           # incoming data base addr (in our memory)
+    slot_size: int                # 1 tile per slot
     n_slots: int                  # ring depth
-    my_head: int                  # 내 send 위치 (다음에 쓸 tx/peer slot)
-    my_tail: int                  # 내 recv 위치 (다음에 읽을 rx slot)
-    peer_head_cache: int          # 캐시: peer가 마지막으로 보낸 head 위치 (D9 piggyback으로 갱신)
-    peer_tail_cache: int          # 캐시: peer가 마지막으로 소비한 tail 위치 (D9 fast path credit으로 갱신)
+    my_head: int                  # next slot we will write/send into
+    my_tail: int                  # next slot we will read/recv from
+    peer_head_cache: int          # peer's last-seen head (updated via D9 piggyback)
+    peer_tail_cache: int          # peer's last-seen tail (updated via D9 fast-path credit)
 ```
 
-**필드명 규약 (canonical)**: 본 ADR 전체에서 다음 4개 이름을 일관되게 사용한다.
+**Canonical field names**: throughout this ADR the four names above
+(`my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`) are used
+consistently. Synonyms (`peer_head_local`, `peer_head`, `peer_tail`,
+etc.) are not used.
 
-| 필드 | 소유자 | 갱신 시점 |
-|------|--------|----------|
-| `my_head` | 자기 PE_IPCQ | tl.send 호출 후 즉시 (송신 추적용) |
-| `my_tail` | 자기 PE_IPCQ | tl.recv 호출 후 즉시 (수신 추적용) |
-| `peer_head_cache` | 자기 PE_IPCQ | IpcqMetaArrival 도착 시 (D9 piggyback) |
-| `peer_tail_cache` | 자기 PE_IPCQ | IpcqCreditMetadata 도착 시 (D9 fast path) |
+| Field | Owner | Updated when |
+|-------|-------|--------------|
+| `my_head` | local PE_IPCQ | immediately after `tl.send` (send tracking) |
+| `my_tail` | local PE_IPCQ | immediately after `tl.recv` (recv tracking) |
+| `peer_head_cache` | local PE_IPCQ | on `IpcqMetaArrival` (D9 piggyback) |
+| `peer_tail_cache` | local PE_IPCQ | on `IpcqCreditMetadata` (D9 fast path) |
 
-다른 표현(`peer_head_local`, `peer_head`, `peer_tail` 등)은 사용하지 않는다.
+**Slot unit**: fixed-size, one slot holds one full tile (no descriptor
+indirection). Full data embedded in the slot. See D5.
 
-**Slot 단위**: fixed-size, 한 slot이 한 tile 데이터를 통째로 담는다.
-descriptor 모델이 아니라 **full data embedding** 모델 (D5에서 상세).
+### D2.5. `IpcqEndpoint` schema
 
-### D2.5. PeAddress / IpcqEndpoint 스키마
-
-`IpcqQueuePair.peer`가 가져야 할 정보를 명시한다. 송신 측 PE_IPCQ가
-peer rx slot에 직접 DMA write하려면 다음을 모두 알아야 한다.
+`IpcqQueuePair.peer` carries everything the sender needs to compute the
+peer's rx slot address:
 
 ```python
 @dataclass(frozen=True)
 class IpcqEndpoint:
-    """송신 측이 peer's rx_buffer 주소를 계산하기 위해 필요한 모든 정보."""
-    sip: int                     # 목적지 SIP
-    cube: int                    # 목적지 cube
-    pe: int                      # 목적지 PE (cube 내 local index)
-    buffer_kind: str             # "tcm" | "hbm" | "sram" — 어느 메모리 공간
-    rx_base_pa: int              # peer rx_buffer base의 PA (PhysAddr.encode())
-    rx_base_va: int              # peer rx_buffer base의 VA (선택, MMU 사용 시)
-    n_slots: int                 # peer ring depth (경계 wrap-around 계산용)
-    slot_size: int               # peer slot 크기 (offset 계산용)
+    sip: int
+    cube: int
+    pe: int
+    buffer_kind: str             # "tcm" | "hbm" | "sram"
+    rx_base_pa: int              # peer rx_buffer base PA (PhysAddr.encode())
+    rx_base_va: int              # peer rx_buffer base VA (optional, MMU mode)
+    n_slots: int                 # peer ring depth (for wrap-around)
+    slot_size: int               # peer slot size (for offset)
 ```
 
-`IpcqQueuePair`의 `peer` 필드는 이 `IpcqEndpoint` 객체를 들고 있다.
-주소 계산은:
+Address computation:
 
 ```python
 slot_idx = self.my_head % peer.n_slots
 dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size
 ```
 
-PE_IPCQ는 이 dst_pa를 `IpcqDmaToken`의 dst_addr로 PE_DMA에 전달한다.
-PE_DMA(vc_comm)는 fabric 라우팅(cube_noc/UCIe/PCIE)을 통해 dst_pa로 데이터를 전송한다.
+PE_IPCQ passes `dst_pa` to PE_DMA inside an `IpcqDmaToken`. PE_DMA
+(vc_comm) routes the data to `dst_pa` through the fabric.
 
-**Endpoint 생성 시점**: backend init (D10)에서 모든 PE의 IPCQ buffer를
-allocator로 할당받고, 각 rank의 neighbor table을 만들 때 peer rank의
-endpoint 정보를 install한다. 즉 install 순서는:
+**Endpoint construction order**: at backend init (D10), the IPCQ
+buffers for **every PE** are allocated first (so each rank knows the
+others' PA), then the per-rank neighbor tables are built and pushed to
+PE_IPCQ via `IpcqInitMsg`.
 
-1. **모든 rank의 IPCQ buffer 할당** (각 PE의 buffer_kind 메모리 공간에서)
-2. **rank별 endpoint table 구성** (자신의 4-방향 peer가 어느 sip/cube/pe/pa를 갖는지)
-3. **PE_IPCQ에 install** (`IpcqInitMsg` via fabric or sideband)
+### D3. Four-direction mapping ≡ logical ProcessGroup
 
-이 순서는 모든 rank가 서로의 PA를 알아야 하므로, 단계 1을 모든 rank에 대해
-먼저 끝낸 후 단계 2-3을 진행한다.
-
-### D3. 4-방향 매핑 = 논리적 ProcessGroup
-
-PE는 4방향(N/S/E/W)을 logical port로 본다. 실제 peer 주소는 호스트 CCL init이
-알고리즘에 따라 설정한다. PE 커널은 토폴로지를 알지 못하고 방향만 사용한다.
+The PE views four directions (N/S/E/W) as logical ports. Real peer
+addresses are configured by the host CCL init, per the chosen
+algorithm. The PE kernel never knows the topology, only directions.
 
 ```python
-# 호스트 init 예시 — 1D ring
+# 1D ring
 for rank in range(world_size):
     ipcq_set_neighbor(rank, "E", peer=ranks[(rank + 1) % world_size])
     ipcq_set_neighbor(rank, "W", peer=ranks[(rank - 1) % world_size])
 
-# 호스트 init 예시 — 2D mesh
+# 2D mesh
 for r in range(R):
     for c in range(C):
         ipcq_set_neighbor((r, c), "N", peer=((r - 1) % R, c))
@@ -147,77 +150,77 @@ for r in range(R):
         ipcq_set_neighbor((r, c), "W", peer=(r, (c - 1) % C))
 ```
 
-PE 코드 입장에서 `tl.send(dir="E", ...)`가 어디로 가는지는 알 필요가 없다.
+The PE code does not need to know where `tl.send(dir="E", ...)` actually
+ends up.
 
-### D4. PE 커널 API
+### D4. PE kernel API
 
 ```python
-# Send (blocking, backpressure 발생 가능)
-tl.send(dir: str, src_addr: int, nbytes: int) -> None
+# Send (blocking; may stall on backpressure)
+tl.send(dir: str, src=TensorHandle)
+tl.send(dir: str, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)
 
 # Recv (blocking)
-data = tl.recv(dir: str)            # 특정 방향에서 수신
-data = tl.recv()                    # 4방향 round-robin, 도착한 첫 tile 반환
+recv = tl.recv(dir: str, shape=..., dtype=...)
+recv = tl.recv(shape=..., dtype=...)        # round-robin across 4 directions
 
 # Recv (non-blocking)
-handle = tl.recv_async(dir: str)
-data = tl.wait(handle)
+fut  = tl.recv_async(dir: str, shape=..., dtype=...)
+recv = tl.wait(fut)
 ```
 
-`tl.recv()` (방향 미지정)는 IPCQ가 last_polled_dir 인덱스를 들고 있다가
-다음 호출 시 그 다음 방향부터 검사하면서 데이터 있는 첫 슬롯을 반환한다.
-4방향 모두 비어있으면 wait.
+`tl.recv()` (no direction) keeps a `last_polled_dir` cursor and on each
+call rotates through directions, returning the first available slot.
+Empty in all 4 directions → wait.
 
-**Fairness는 weak fairness**: polling 시작 방향을 회전시켜 단순 편향을
-완화하지만, 한 방향에 데이터가 항상 먼저 도착하면 다른 방향이 starvation될
-수 있다. strict fairness가 필요한 알고리즘은 `tl.recv(dir=...)`로 방향을
-명시해야 한다. (Open Questions 참조)
+**Fairness is weak**: the rotating start mitigates simple bias, but if
+one direction always wins the race the others can starve. Algorithms
+that need strict fairness must call `tl.recv(dir=...)` explicitly.
 
-### D5. Single-hop DMA Write + Full-data Slot 모델
+### D5. Single-hop DMA write + full-data slot model
 
-데이터는 송신 측 메모리에서 수신 측 ring slot으로 **단일 DMA 전송**으로
-이동한다. 핵심 속성:
+Data moves from sender memory into the receiver's ring slot in **one
+DMA transfer**. Key properties:
 
-- **Single-hop**: 송신 측 IPCQ가 peer rx slot 주소를 직접 알고 있어 한 번의
-  fabric DMA로 데이터가 도착한다.
-- **No CPU memcpy**: CPU가 데이터를 복사하지 않는다.
-- **No intermediate staging**: 송신/수신 어느 쪽에도 별도 staging buffer가
-  없다 (송신은 자기 source 주소에서 직접, 수신은 자기 ring slot으로 직접).
+- **Single-hop**: the sender already knows the peer rx slot address and
+  fires one fabric DMA into it.
+- **No CPU memcpy**: the CPU never copies data.
+- **No intermediate staging**: neither side keeps a separate staging
+  buffer (sender uses the source addr directly; receiver gets the data
+  in its ring slot directly).
 
-(엄밀히 말하면 fabric DMA write 자체는 발생하므로 "data movement가 전혀 없다"는
-의미는 아니다. NCCL의 "zero-copy"가 가리키는 것 — CPU memcpy / staging copy
-부재 — 과 동일한 속성이다.)
-
-데이터 이동 모델:
+(Strictly speaking the fabric DMA write does happen, so this is not
+literally "no data movement" — it's the same property NCCL labels
+"zero-copy", meaning no CPU memcpy and no staging copy.)
 
 ```
 PE A: tl.send(E, src_addr, nbytes)
-  1. IPCQ가 peer rx slot 주소 계산
+  1. IPCQ computes the peer rx slot address:
        dst_addr = peer.rx_base_pa + (my_head % peer.n_slots) * peer.slot_size
-  2. backpressure: my_head - peer_tail_cache < peer.n_slots ?
-                   (꽉 찼으면 sleep/poll)
-  3. PE_DMA(vc_comm)에 DMA 요청 → src_addr에서 peer의 dst_addr로 nbytes 전송
+  2. Backpressure: my_head - peer_tail_cache < peer.n_slots ?
+                   (full → sleep / poll)
+  3. Submit DMA on PE_DMA(vc_comm): src_addr → peer dst_addr, nbytes
   4. my_head += 1
 
 PE B: data = tl.recv(W)
-  1. 내 rx_buffer[my_tail % n_slots] 위치 확인
-  2. 데이터 도착 대기 (D7 backpressure 모드)
-  3. 그 주소를 PE 커널에 반환 (또는 fetch unit으로 register file에 로드)
+  1. Look at rx_buffer[my_tail % n_slots]
+  2. Wait for the data to arrive (D7 backpressure mode)
+  3. Return the slot address to the kernel (or fetch into register file)
   4. my_tail += 1
-  5. credit return fast path 발행 (D9) — bottleneck-BW latency 후
-     peer A의 peer_tail_cache 갱신
+  5. Issue a credit-return fast path (D9): after the bottleneck-BW
+     latency the peer A's peer_tail_cache is updated.
 ```
 
-**핵심**: Slot에 데이터가 통째로 들어간다. PE B의 recv는 자기 rx_buffer만
-읽으면 되고, A의 메모리를 read하지 않는다. 송신 측 IPCQ가 peer rx slot
-주소를 알고 있으므로 직접 그 주소로 DMA write한다 (single-hop).
+The slot holds the full tile. The receiver only reads its own
+rx_buffer; it never reads back into A's memory. The sender knows the
+peer rx slot address and DMAs directly into it (single-hop).
 
-본인의 PE_TCM read/write는 DMA를 거치지 않는다 (PE에 직접 붙어있음).
-slot이 본인 TCM에 있으면 직접 접근, 아니면 PE_DMA 경유.
+The PE's own PE_TCM read/write does not go through DMA (PE_TCM is local
+to the PE).
 
-### D6. Buffer 위치 — 3-way benchmark
+### D6. Buffer placement — three-way benchmark
 
-호스트 CCL init이 IPCQ ring buffer의 메모리 위치를 결정한다:
+The host CCL init picks the IPCQ ring-buffer location:
 
 ```python
 ipcq_init(
@@ -228,79 +231,80 @@ ipcq_init(
 )
 ```
 
-| 위치 | 특징 | trade-off |
-|------|------|-----------|
-| **PE_TCM** | PE에 직접 붙음, 빠름 | 작음, PE 내부 자원과 경쟁 |
-| **PE-local HBM** | 큼, DMA 경유 | latency 큼 |
-| **Cube SRAM** | 중간 크기, cube-shared | cube 내 PE 간 contention |
+| Location | Trait | Trade-off |
+|----------|-------|-----------|
+| **PE_TCM** | Attached to the PE; fast | Small; competes with PE-internal resources |
+| **PE-local HBM** | Large; via DMA | Higher latency |
+| **Cube SRAM** | Mid-size; cube-shared | Cube-internal contention |
 
-세 위치 모두 동일 코드로 동작하며 init만 다르다. 벤치마크로 비교 가능.
+All three locations run the same kernel code; only the init differs.
 
-**규칙**: peer가 read/write할 때는 DMA 경유. 본인이 자기 PE_TCM 읽기/쓰기는
-DMA 없음.
+### D7. Backpressure — two-mode benchmark
 
-### D7. Backpressure — 2-mode benchmark
+How the sender or receiver waits when peer slots are full / data not
+yet arrived:
 
-송신 측이 peer slot full을 감지했을 때, 또는 수신 측이 데이터 미도착을
-감지했을 때 어떻게 대기하는가:
-
-| 모드 | 동작 | 모델 |
-|------|------|------|
-| **poll** | 캐시된 peer pointer를 주기적으로 재확인. cache update event를 폴링 | spin loop |
-| **sleep** | SimPy event를 yield하고 sleep, peer가 update event를 trigger하면 wake | interrupt-like |
+| Mode | Behavior | Model |
+|------|----------|-------|
+| **poll** | Periodically re-check the cached peer pointer | Spin loop |
+| **sleep** | Yield a SimPy event; wake on a peer-trigger | Interrupt-like |
 
 ```python
 ipcq_init(backpressure="poll" | "sleep", ...)
 ```
 
-두 모드 모두 구현하여 latency/throughput trade-off를 벤치마크할 수 있다.
+Both modes are implemented so latency / throughput trade-offs can be
+benchmarked.
 
-### D8. PE_DMA Virtual Channel
+### D8. PE_DMA virtual channels
 
-PE_DMA를 단일 큐에서 **2-channel virtual channel** 모델로 확장한다.
+Extend PE_DMA from a single queue into a **two-channel virtual-channel**
+model.
 
 ```
 PE_DMA
-├── vc_compute: GEMM/MATH의 tile load/store/writeback
-└── vc_comm:    IPCQ의 send 데이터
+├── vc_compute: tile load / store / writeback for GEMM and Math
+└── vc_comm:    IPCQ send data
 ```
 
-각 VC는 독립적인 state machine을 가진다:
-- 한 채널이 stall되어도 다른 채널은 진행
-- 동일 link(cube_noc, UCIe 등)는 공유하지만, link BW는 두 채널이 분할 사용
+Each VC has an independent state machine:
 
-**Chunk 단위 인터리브**:
-- 큰 GEMM tile DMA가 한 번에 link를 점유하지 않음
-- chunk_size 단위로 진행 (예: 256B), 매 chunk마다 다른 VC와 link BW 공유
-- chunk_size는 init 파라미터 (작을수록 fair, 클수록 효율)
+- One channel stalling does not block the other.
+- The same physical link (cube_noc, UCIe, …) is shared, but link BW is
+  split between channels.
 
-이로써:
-- HoL blocking 해소 (compute DMA 진행 중에도 IPCQ send 끼어들 수 있음)
-- compute/comm overlap 자연스러움 (NVIDIA copy engine + compute SM 패턴)
-- HW 모델 정합 (NoC virtual channel은 실제 HW 기법)
+**Chunk-level interleave**:
 
-**첫 구현의 정확도 한계 (intentional)**:
+- Large GEMM tile DMAs do not lock the link end-to-end.
+- Progress happens in chunks (e.g. 256 B); each chunk shares link BW
+  with the other VC's pending chunks.
+- Chunk size is an init parameter (smaller = fairer, larger = more
+  efficient).
 
-본 ADR의 첫 구현은 **deterministic chunk-level interleave + weighted
-round-robin arbitration** (default 50/50, ccl.yaml에 노출)을 채택한다.
-이는 first-order approximation이며, 실제 HW의 dynamic contention/credit-based
-arbitration보다는 단순화된 모델이다.
+Net effect:
 
-| 모델링 항목 | 첫 구현 | 향후 확장 가능 |
-|------------|---------|---------------|
-| VC 간 BW 분할 | 정적 weight | dynamic contention 기반 |
-| Chunk 단위 인터리브 | deterministic round-robin | priority/QoS 기반 |
-| Cross-VC dependency | 없음 (독립) | NoC arbiter component 추가 |
+- HoL blocking is eliminated (an IPCQ send can interleave with a long
+  compute DMA).
+- Compute / comm overlap is natural (NVIDIA copy-engine + compute-SM
+  pattern).
+- Matches the NoC-virtual-channel pattern used in real HW.
 
-이 단순화는 functional correctness에는 영향이 없으며, latency 측정에서
-heavy contention 시나리오는 실제보다 약간 optimistic한 결과를 낼 수 있다.
-정밀화가 필요하면 별도 ADR로 NoC arbiter를 도입한다.
+**First-implementation accuracy limit (intentional)**: this ADR's
+first cut uses **deterministic chunk-level interleave + weighted
+round-robin arbitration** (default 50 / 50, exposed in `ccl.yaml`).
+This is a first-order approximation and is simpler than real HW
+dynamic-contention / credit-based arbiters. Functional correctness is
+unaffected, but heavy-contention scenarios may report slightly
+optimistic latency vs. real HW. A separate ADR can add a NoC arbiter
+component later if more precision is needed.
 
 #### Token routing
 
-- compute용 token (TileToken): 기존 PE_FETCH_STORE → PE_DMA 체이닝 그대로
-- comm용 token (IpcqDmaToken, 신규): PE_IPCQ → PE_DMA로 self-routing
-- PE_DMA가 token 종류로 채널 결정
+- Compute tokens (`TileToken`) — go through the existing
+  PE_FETCH_STORE → PE_DMA chain.
+- Communication tokens (`IpcqDmaToken`, new) — PE_IPCQ → PE_DMA
+  self-routing.
+- PE_DMA picks the channel by token type.
 
 ```python
 class PeDmaComponent:
@@ -311,55 +315,54 @@ class PeDmaComponent:
             yield from self._vc_compute_process(env, token)
 ```
 
-### D9. Pointer 동기화 — DMA payload piggyback
+### D9. Pointer synchronization — DMA payload piggyback
 
-실제 하드웨어(NVLink, UCIe 등)는 DMA 메시지의 payload에 메타데이터를
-piggyback하여 송수신과 함께 pointer를 갱신한다. 본 시뮬레이션도 같은 모델을
-채택하여 **별도의 control 채널 없이** 메타데이터가 data와 함께 도착하도록 한다.
+Real HW (NVLink, UCIe, etc.) piggybacks metadata onto DMA payloads so
+pointers update along with the data. This simulation adopts the same
+model: **no separate control channel** — metadata travels with the
+data.
 
-이 모델의 핵심 이점:
+The big benefits:
 
-- **자동 ordering**: 메타데이터가 data와 동일 token으로 이동하므로 data가
-  먼저 visible해진 다음에야 head_cache가 갱신된다. 별도 ordering invariant
-  없이 race condition이 원천 차단된다.
-- **HW 정합**: 실제 NVLink/UCIe의 piggybacked header 모델과 일치
-- **컴포넌트 단순화**: 별도 IpcqPtrUpdate event 종류가 필요 없음
+- **Automatic ordering**: data and metadata move on the same token, so
+  data is visible **before** the head_cache update. No race.
+- **HW fidelity**: matches NVLink / UCIe piggybacked headers.
+- **Component simplification**: no separate `IpcqPtrUpdate` event type.
 
-#### Send 흐름 (head 측 piggyback)
+#### Send flow (head update via piggyback)
 
 ```
 PE A: tl.send(E, src_addr, nbytes)
-  1. PE_IPCQ가 backpressure 체크 (peer_tail_cache 기준)
-  2. PE_IPCQ가 IpcqDmaToken 생성:
-     - data 본체 (src_addr → peer dst_addr)
-     - piggyback metadata: (sender_seq, src_sip/cube/pe, src_direction)
-  3. PE_DMA(vc_comm)에 token put
-  4. PE A는 자기 my_head++ (송신 추적용)
+  1. PE_IPCQ checks backpressure (using peer_tail_cache)
+  2. PE_IPCQ creates an IpcqDmaToken:
+       - data body (src_addr → peer dst_addr)
+       - piggyback metadata: (sender_seq, src_sip/cube/pe, src_direction)
+  3. Hand the token to PE_DMA(vc_comm)
+  4. PE A increments my_head (send tracking)
 
-[fabric DMA: latency 만큼 진행]
+[fabric DMA: latency elapses]
 
-PE B의 PE_DMA가 token 수신
-  5. data를 dst_addr (B의 rx slot)에 MemoryStore.write
-  6. token의 metadata를 PE B의 PE_IPCQ로 forward (PE 내부 wire, ~1 cycle)
+PE B's PE_DMA receives the token
+  5. Writes data into dst_addr (B's rx slot) via MemoryStore.write
+  6. Forwards token metadata to PE B's PE_IPCQ (PE-internal wire, ~1 cycle)
 
-PE B의 PE_IPCQ가 metadata 수신
-  7. peer_head_cache 갱신 (= A의 head 위치)
-  8. 대기 중인 recv (해당 direction)가 있으면 wake
+PE B's PE_IPCQ receives the metadata
+  7. Updates peer_head_cache (= A's head)
+  8. Wakes any pending recv on that direction
 ```
 
-여기서 핵심은 **5와 6은 같은 SimPy step**이라는 것이다 — DMA 완료와 동시에
-data와 metadata가 atomic하게 visible해진다.
+**Steps 5 and 6 must execute in the same SimPy step** — DMA completion
+makes data and metadata atomically visible.
 
-#### Recv 흐름 (credit return — fast path with bottleneck-BW latency)
+#### Recv flow (credit return — fast path with bottleneck-BW latency)
 
-수신측이 slot을 비우면 송신측은 그 사실을 알아야 한다 (backpressure 해제).
-data 경로의 piggyback 모델과 달리, credit return은 일반 vc_comm fabric을
-거치지 않고 **별도 fast path**로 처리한다. 이는 실제 HW의 NVLink/UCIe
-credit return fast path를 추상화한 것이다.
+When the receiver frees a slot, the sender must learn about it
+(backpressure release). Unlike data, the credit return does **not**
+travel through general vc_comm fabric — it uses a **separate fast
+path**, an abstraction of the NVLink / UCIe credit-return wire.
 
-**Latency 계산**: magic constant가 아니라 **라우팅 경로의 full path
-latency** (per-node overhead + edge propagation + drain) 기준으로
-산출한다.
+**Latency** is computed from the **full path latency** (per-node
+overhead + edge propagation + drain), not a magic constant:
 
 ```
 credit_size_bytes = 16  (ccl.yaml: ipcq_credit_size_bytes)
@@ -370,47 +373,35 @@ latency = compute_path_latency_ns(path, credit_size_bytes)
         + credit_size_bytes / bottleneck_bw_on_path
 ```
 
-router는 source에만 `.pe_dma`를 자동 부여하므로 destination에는 반드시
-`.pe_dma` suffix를 명시해야 한다. 그렇지 않으면 `find_path`가 raise하고
-credit이 0 cost로 silently teleport되는 latent bug가 발생한다 (이번
-업데이트에서 수정됨).
+The router auto-appends `.pe_dma` to the source only, so the
+destination MUST be spelled with the explicit `.pe_dma` suffix or
+`find_path` raises and the credit silently teleports at zero cost
+(latent bug fixed alongside this update).
 
-`tl.recv`는 credit-emit 완료를 yield-from으로 기다린다 (이전에는
-`env.process`로 fork). 이로써 credit-return cost가 receiver의
-`pe_exec_ns`에 반영되어, IPCQ control-plane이 consume-acknowledgement를
-완료한 뒤에야 recv가 kernel에 반환된다 — RAW DMA의 non-posted `tl.store`가
-HBM ack-trip을 기다리는 것의 protocol-level 등가물이다.
+`tl.recv` blocks on the credit-emit completion (recv yields-from
+`_delayed_credit_send` rather than spawning it as a fork). This puts
+the credit-return cost on the receiver's `pe_exec_ns`, modeling the
+IPCQ control-plane completing the consume-acknowledgement before
+recv returns to the kernel — the protocol equivalent of a non-posted
+`tl.store` waiting for an HBM ack on the raw DMA path.
 
-이로써:
-- **토폴로지 비례 approximation**: cube 내 credit return과 cross-SIP credit이
-  자동으로 다른 latency를 가짐
-- **Magic constant 없음**: 모든 ns 값이 데이터 트래픽과 동일한 edge_map
-  및 `node_overhead_ns`에서 산출되는 `compute_path_latency_ns`로부터 옴
-- **Deadlock 위험 없음**: `peer_credit_store.put`은 unbounded, B가 A에게
-  보낼 데이터가 없어도 credit이 자동 발행됨
-- **`IPCQ ≥ raw DMA`** 보장: matched physical move에 대해 credit-emit이
-  RAW의 ack-trip cost와 균형을 이룸
+That gives us:
 
-```
-PE B: tl.recv(W) → 데이터 가져감 → my_tail++
+- **Topology-proportional approximation**: an in-cube credit return is
+  automatically faster than a cross-SIP credit return.
+- **No magic constants**: every nanosecond comes from
+  `compute_path_latency_ns` on the same edge_map and `node_overhead_ns`
+  as data traffic.
+- **No deadlock risk**: unlike piggyback, B can issue credit even when
+  it has no data to send back. `peer_credit_store.put` is unbounded.
+- **`IPCQ ≥ raw DMA`** for matched physical moves — the credit-emit
+  cost on recv balances the HBM ack-trip cost RAW pays on the sender.
 
-PE B의 PE_IPCQ:
-  1. router로 PE A까지 path 계산
-  2. compute_drain_ns(path, credit_size_bytes) = latency_ns
-  3. env.process(self._delayed_credit_send(latency_ns, peer_credit_store, my_tail))
+#### Component coupling — SimPy Store channel
 
-[fast path: latency_ns 만큼 timeout, fabric vc 미사용]
-
-PE A의 PE_IPCQ가 자기 credit_store에서 IpcqCreditMetadata 수신:
-  4. peer_tail_cache 갱신
-  5. 대기 중인 send (해당 direction)가 있으면 wake
-```
-
-#### Component 결합도 — SimPy Store 채널
-
-PE B의 PE_IPCQ가 PE A의 PE_IPCQ를 직접 호출하지 않는다. 대신 **init 시점에
-양쪽 PE_IPCQ 사이에 SimPy Store를 한 번 wire**해두고 (양방향 fast path 채널),
-credit metadata는 그 store로 put한다.
+PE B's PE_IPCQ does not call PE A's PE_IPCQ directly. Instead, at init
+time, **a SimPy Store is wired between the two** (a per-direction
+fast-path channel) and credit metadata is `put` into that store.
 
 ```python
 class PeIpcqComponent:
@@ -419,763 +410,414 @@ class PeIpcqComponent:
         yield peer_credit_store.put(IpcqCreditMetadata(seq=my_tail, ...))
 ```
 
-backend init에서 IpcqInitMsg fan-out 시 양방향 fast path channel을 함께
-설치한다 (D12 IpcqInitMsg에 명시).
+Backend init wires both directions of the fast-path channel as part of
+fan-out (see `IpcqInitMsg` in D12).
 
-#### Credit return fast path의 한계
+#### Credit-return fast path limitations
 
-- `credit_size_bytes`는 estimate. 보통 16-64 bytes로 충분하며, 실제 HW의
-  credit return wire 크기를 모방한 값.
-- fast path는 일반 vc_comm BW contention 모델에서 **제외**된다 (별도 채널).
-  실제 HW의 credit return wire는 매우 lightweight이므로 1차 근사로 합리적.
-- 정밀화가 필요하면 후속 ADR에서:
-  - credit fast path를 별도 link로 모델링 (BW limit + contention)
-  - 또는 piggyback 모드로 변경 가능 (`credit_return_mode: piggyback`)
+- `credit_size_bytes` is an estimate (typically 16–64 bytes).
+- The fast path is **excluded from vc_comm BW contention** (separate
+  wire). Real HW credit-return wires are very lightweight, so this is a
+  reasonable first approximation.
+- A follow-up ADR can: model the credit fast path as a separate link
+  (BW limit + contention), or switch to piggyback (`credit_return_mode:
+  piggyback`).
 
-#### PE_DMA의 책임 추가
+#### PE_DMA's added responsibility
 
-PE_DMA(vc_comm)는 token 수신 시 다음 시퀀스로 처리한다: Transaction
-terminal의 BW drain을 먼저 지불하고, 이어서 atomic하게 data write +
-metadata forward 수행. **data write와 metadata forward 사이에는 SimPy
-yield를 두어서는 안 된다** (I6 MUST 규칙 참조). drain yield는 atomic
-구간 안이 아니라 그 앞에 위치해야 한다:
+When `vc_comm` receives a token, PE_DMA processes it as the following
+sequence: pay the Transaction's terminal BW drain, then atomically
+write data and forward metadata. **No SimPy yield is allowed between
+the data write and the metadata forward** (invariant I6). The drain
+yield must sit before the atomic block, not inside it:
 
 ```python
 def _on_vc_comm_recv(self, env, txn):
-    # Sender PE_DMA가 찍어 둔 drain_ns (= nbytes / bottleneck_bw) 를
-    # 여기서 지불. atomic 구간보다 앞이어야 한다 — recv는 bytes가
-    # "도착"한 이후에만 깨어나야 하므로.
+    # Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the
+    # sender PE_DMA). MUST happen before the atomic block so recv only
+    # wakes after the bytes have "landed".
     drain = getattr(txn, "drain_ns", 0.0)
     if drain > 0:
         yield env.timeout(drain)
 
     token = txn.request
-    # ── ATOMIC: 두 동작 사이에 yield 금지 ──
-    # 1. data를 dst_addr에 write (dst의 메모리 공간은 token.dst_endpoint.buffer_kind)
+    # ── ATOMIC: no yield between these two operations ──
     data = self._memory_store.read(token.src_space, token.src_addr,
-                                    shape=..., dtype=...)
-    self._memory_store.write(token.dst_endpoint.buffer_kind, token.dst_addr, data)
-    # 2. token의 metadata를 자기 PE의 IPCQ로 forward
+                                   shape=..., dtype=...)
+    self._memory_store.write(token.dst_endpoint.buffer_kind,
+                             token.dst_addr, data)
+    # 2. Forward metadata to the local PE_IPCQ
     yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token))
-    # ─────────────────────────────────────
+    # ───────────────────────────────────────────────────
 ```
 
-`out_ports[ipcq_id].put`은 SimPy Store의 yield-able 호출이지만, PE 내부
-wire로 capacity가 unbounded인 store를 사용하므로 즉시 완료된다 (실질적으로
-single-step). 이 최종 put이 atomic 구간의 끝이며, 그 이전에 다른 yield가
-삽입되면 안 된다.
+The final `put` is yieldable but uses an unbounded internal store, so
+it completes in a single step. That `put` is the closing call of the
+atomic block; nothing may be inserted before it.
 
 #### Drain-at-inbound semantics (D9 timing model)
 
-Transaction은 sender PE_DMA가 `drain_ns = nbytes / bottleneck_bw_on_path`
-를 찍어 둔 상태로 fabric에 들어간다. 이 simulator에서 per-hop `overhead_ns`
-는 각 forwarding component의 `run()` 에서 지불되고, 남은 BW drain은
-Transaction의 terminal node에서 한 번 지불된다. IPCQ가 아닌 모든
-Transaction (raw DMA, kernel-launch fanout 등) 은
-`ComponentBase._forward_txn` 이 terminal에서 이 drain을 지불한다. IPCQ의
-경우 목적지 PE_DMA가 `_handle_ipcq_inbound` 핸들러로 Transaction을
-가로채서 (IPCQ 전용 data write + metadata forward를 해야 하므로)
-**이 핸들러 최상단에서 drain을 명시적으로 지불해야 한다** — 그래야 IPCQ의
-timing model이 다른 모든 fabric Transaction과 동일선상에 놓인다.
+The Transaction carries `drain_ns = nbytes / bottleneck_bw_on_path`
+stamped at send-side PE_DMA. In this simulator per-hop `overhead_ns`
+is paid at each forwarding component via `run()`, and the remaining
+BW drain is paid once at the Transaction's terminal. Every non-IPCQ
+Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via
+`ComponentBase._forward_txn` at the terminal node. For IPCQ the
+destination PE_DMA intercepts the Transaction with `_handle_ipcq_inbound`
+(so IPCQ-specific data write + metadata forward can happen), so **the
+drain MUST be paid explicitly at the top of that handler** to keep
+IPCQ's timing model on par with every other fabric Transaction.
 
-여기서 drain을 지불할 때의 side-effect:
+Side-effects of paying drain here:
 
-- **SRC `tl.send`**: 동작 불변. sender PE_DMA가 `sub_done` 을 `yield`
-  하지 않으므로 fire-and-forget 의미가 보존된다. metadata forward 이후
-  호출되는 `sub_done.succeed()` 는 sender 입장에서 listener가 없는 이벤트.
-- **DST `tl.recv`**: `drain_ns` 만큼 늦게 깨어난다. recv는 local PE_IPCQ
-  의 `IpcqMetaArrival` 수신 시에만 wake되며, metadata forward가 drain
-  이후로 이동했으므로 recv는 bandwidth까지 포함한 전체 fabric transfer
-  시간을 관측하게 된다.
+- **SRC `tl.send`** is unchanged — fire-and-forget semantics are
+  preserved because the sender PE_DMA does not `yield sub_done`. The
+  `sub_done.succeed()` call (made after metadata forward below) is an
+  event with no listener on the sender side.
+- **DST `tl.recv`** unblocks `drain_ns` later. Since recv wakes only
+  when `IpcqMetaArrival` reaches its local PE_IPCQ, and the metadata
+  forward now happens after the drain, recv observes the full fabric
+  transfer time including bandwidth cost.
 
-물리적 그림과 일치: send는 dispatch하고 바로 반환; recv는 bytes가 실제로
-자신의 inbox로 drain될 때까지 대기.
+Matches the physical picture: send dispatches and leaves; recv waits
+until the bytes have actually been drained into its inbox.
 
-#### Backpressure latency 정확도
+### D9.5. ADR-0020 (2-pass) integration
 
-backpressure 해제까지 걸리는 시간:
+`tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase
+1 simulates timing **and** moves data via MemoryStore; Phase 2 enables
+op-log-based correctness verification.
 
-- **데이터 send 측 latency** = full fabric DMA (data + piggyback metadata 함께)
-- **Credit return 측 latency** = fast path with bottleneck-BW
-  (`credit_size_bytes / bottleneck_bw_on_path`)
+#### Phase 1 (timing + data)
 
-| 시나리오 | 모델링된 latency | 실제 HW와의 관계 |
-|---------|----------------|----------------|
-| Cube 내 (fast link) | 작음 (bottleneck = cube_noc BW) | topology-aware approximation |
-| Cross-cube (UCIe) | 중간 (bottleneck = UCIe BW) | topology-aware approximation |
-| Cross-SIP (PCIE) | 큼 (bottleneck = PCIE BW) | topology-aware approximation |
+D9 models head and tail updates with two different mechanisms:
 
-별도 magic latency 파라미터 없이 토폴로지에 비례한 first-order
-approximation이 자동으로 반영된다. 실제 HW와 정확히 일치하지는 않지만
-(credit fast path는 contention 모델에서 제외, credit_size_bytes는 estimate),
-magic constant 모델보다 훨씬 의미 있는 비교 가능. 정밀화는 후속 ADR로
-넘긴다.
+- **Send-side (head update)** — DMA payload piggyback. Data write and
+  metadata forward happen in the same SimPy step → automatic atomic
+  visibility.
+- **Recv-side (tail credit return)** — fast-path SimPy Store channel
+  with bottleneck-BW latency, then `peer_tail_cache` update.
 
-### D9.5. ADR-0020 (2-Pass) 통합
+Together they preserve ring-buffer pointer consistency.
 
-`tl.send/recv`는 ADR-0020의 2-pass 모델과 통합되어야 한다. Phase 1은
-타이밍과 실제 데이터 이동(MemoryStore) 모두 모델링하고, Phase 2는 op_log
-기반 정합성 검증을 가능케 한다.
+The op-log records `op_kind="ipcq"` entries for sends (with
+`src/dst/space/addr/nbytes/dir/dtype/shape/sender_seq`) and recvs (with
+`recv_mode/src/dst/space/addr/nbytes/dir/dtype/shape/consumer_seq`).
+Two recv modes:
 
-#### Phase 1 (타이밍 + 데이터 이동)
-
-D9는 head 갱신과 tail 갱신을 다른 메커니즘으로 모델링한다:
-
-- **Send-side (head update)** — DMA payload piggyback. data write와 metadata
-  forward가 동일 SimPy step에 일어나므로 자동으로 atomic visibility 보장.
-- **Recv-side (tail credit return)** — fast path SimPy Store 채널.
-  bottleneck-BW 기반 latency 후 peer_tail_cache 갱신.
-
-두 메커니즘을 합쳐서 전체 ring buffer pointer 일관성을 유지한다.
-
-**send 시**:
-
-1. PE_IPCQ가 backpressure 체크 (peer_tail_cache 기준)
-2. PE_IPCQ가 IpcqDmaToken 생성 (data + piggyback metadata) → PE_DMA(vc_comm)에 put
-3. PE_DMA가 fabric DMA 시뮬레이션 (latency 진행)
-4. **DMA 완료와 동일한 SimPy step에 atomic 시퀀스**:
-   - **MemoryStore.write(buffer_kind, dst_pa, data)** — single-hop DMA write
-   - 수신측 PE_IPCQ에 metadata forward → peer_head_cache 갱신 → 대기 recv wake
-5. **op_log 기록**: `OpRecord(op_kind="ipcq", op_name="send", params={src_space, src_addr, dst_space, dst_addr, nbytes, dir, dtype, shape, sender_seq})`
-   - `dst_space`는 `token.dst_endpoint.buffer_kind`에서 derive된 값이다
-     (별도 token 필드가 아니다). dst_addr은 `token.dst_addr`.
-
-**recv 시**:
-
-1. PE_IPCQ가 (peer_head_cache > my_tail) AND (MemoryStore.has(slot_addr)) 조건 대기
-   (D9 piggyback 모델에서는 두 조건이 같은 step에 truthy가 되지만, defensive check)
-2. 조건 만족 시: `slot_addr = my_rx_base + slot_idx * slot_size`
-3. **두 가지 모드** (`recv_mode`로 op_log에 기록):
-   - **`return_slot`** (default): slot_addr을 그대로 PE 커널에 반환.
-     데이터 복사 없음. 커널이 slot 메모리를 직접 사용한다.
-   - **`copy_to_dst`**: 호출 시 dst_addr이 지정된 경우. slot 데이터를 읽어서
-     dst_addr에 write. `data = memory_store.read(...)`; `memory_store.write(dst_space, dst_addr, data)`
-4. PE_IPCQ가 my_tail++, fast path credit return을 발행 (D9 — vc_comm
-   fabric을 거치지 않고 별도 SimPy Store 채널로 bottleneck-BW latency 후
-   peer 측 peer_tail_cache 갱신)
-5. **op_log 기록**: `OpRecord(op_kind="ipcq", op_name="recv", params={recv_mode, src_space, src_addr, dst_space, dst_addr, nbytes, dir, dtype, shape, consumer_seq})`
-   - `recv_mode="return_slot"`: src_space/src_addr가 slot 위치, dst_addr=None
-   - `recv_mode="copy_to_dst"`: src_space/src_addr가 slot 위치, dst_space/dst_addr가 사용자 지정 위치
+- **`return_slot`** (default): the slot address is returned to the
+  kernel. Zero-copy.
+- **`copy_to_dst`**: when the kernel passes `dst_addr` + `dst_space`,
+  PE_IPCQ copies the slot data into the user dst.
 
 #### Phase 2 (op_log replay)
 
-DataExecutor가 `op_kind="ipcq"` 레코드를 만나면:
+When `DataExecutor` encounters an `op_kind="ipcq"` record:
 
-- **send**: src → dst (peer rx slot)로 ndarray를 idempotent하게 write
-- **recv (`recv_mode="return_slot"`)**: no-op. slot 데이터는 Phase 1에서
-  이미 적절한 위치에 있으며, 커널이 해당 slot 메모리를 직접 사용함.
-- **recv (`recv_mode="copy_to_dst"`)**: slot → dst_addr로 ndarray를 idempotent
-  하게 copy
+- **send**: idempotent `src → dst` ndarray write.
+- **recv (`return_slot`)**: no-op (the slot already holds the data).
+- **recv (`copy_to_dst`)**: idempotent `slot → dst_addr` copy.
 
-본질적으로 IPCQ는 **데이터 이동**만 하므로 Phase 2가 추가로 계산할 것은 없다.
-DataExecutor의 GEMM/Math가 그 데이터를 사용하면 자동으로 정합성이 검증된다.
+IPCQ ops are pure data movement — Phase 2 has nothing extra to compute.
+The downstream GEMM / Math ops in `DataExecutor` will consume the data
+and naturally validate correctness.
+
+### D10. Host CCL init keeps the PyTorch shape
+
+The host code looks just like real PyTorch DDP. `init_process_group`
+creates the backend object; it does **not** receive IPCQ knobs
+(neighbor topology, buffer_kind, backpressure …).
 
 ```python
-class DataExecutor:
-    def _execute_op(self, op):
-        if op.op_kind == "ipcq":
-            self._execute_ipcq(op)
-        elif op.op_kind == "memory":
-            ...
-        elif op.op_kind == "gemm":
-            ...
-
-    def _execute_ipcq(self, op):
-        """IPCQ ops are data movement; Phase 1 already wrote to MemoryStore."""
-        p = op.params
-        if op.op_name == "send":
-            data = self.store.read(p["src_space"], p["src_addr"],
-                                   shape=p["shape"], dtype=p["dtype"])
-            self.store.write(p["dst_space"], p["dst_addr"], data)
-        elif op.op_name == "recv":
-            if p.get("recv_mode") == "copy_to_dst":
-                data = self.store.read(p["src_space"], p["src_addr"],
-                                       shape=p["shape"], dtype=p["dtype"])
-                self.store.write(p["dst_space"], p["dst_addr"], data)
-            # recv_mode == "return_slot": no-op (data already in slot)
-```
-
-#### `--verify-data` 흐름 (CCL 커널)
-
-```
-1. kernbench run --bench ccl_allreduce --verify-data
-2. backend init → IPCQ buffers 할당, neighbor table install
-3. 모든 rank greenlet 동시 실행
-4. 각 PE 커널이 tl.send/recv → MemoryStore에 데이터 누적
-5. 시뮬레이션 완료 후 DataExecutor.run() → ipcq op 멱등 replay (no-op)
-6. 벤치마크가 print(out) 또는 out.data 비교 → 정합성 확인
-```
-
-벤치 작성자는 `out.data`로 결과를 읽고 expected와 비교하면 된다 (ADR-0020 D7
-Tensor.data 패턴).
-
-### D10. 호스트 CCL Init은 PyTorch 패턴 그대로
-
-호스트 코드는 실제 PyTorch distributed 코드와 동일하게 유지한다.
-`init_process_group`은 backend 객체만 만들고, IPCQ 설정 (neighbor topology,
-buffer_kind, backpressure 등)은 받지 않는다.
-
-```python
-# benches/ccl_allreduce.py — 실제 PyTorch와 동일한 호스트 코드
-def run_rank(rank, world_size, torch):
+# benches/ccl_allreduce.py — same shape as real PyTorch
+def worker(rank, world_size, torch):
     dist = torch.distributed
-    dist.init_process_group(backend="ahbm", world_size=world_size, rank=rank)
-
-    tensor = torch.zeros((M, K), dtype="f16", dp=...)
-
-    from kernbench.ccl.algorithms import ring_allreduce
-    torch.launch("ring_allreduce", ring_allreduce.kernel, tensor, rank, world_size)
+    dist.init_process_group(backend="ahbm")  # reads ccl.yaml + topology
+    tensor = torch.zeros((1, world_size * N_ELEM), dtype="f16", dp=...)
+    tensor.copy_(torch.from_numpy(init))
+    dist.all_reduce(tensor, op="sum")
 ```
 
-IPCQ 설정은 backend가 **init_process_group 시점에** `ccl.yaml`을 읽고 즉시
-PE_IPCQ neighbor table을 install한다. 호스트 코드는 IPCQ를 인지할 필요가 없다.
+The IPCQ configuration is decided by the backend at
+`init_process_group` time: it loads `ccl.yaml`, picks the algorithm,
+and pushes IPCQ neighbor tables to every participating PE_IPCQ. The
+host code never has to know about IPCQ.
 
-벤치마크 하나는 하나의 알고리즘을 사용하는 것을 가정하며, 사용할 알고리즘은
-`ccl.yaml`의 `defaults.algorithm` 으로 지정한다 (D11). 호스트 코드 변경 없이
-ccl.yaml만 수정하여 다른 알고리즘으로 교체할 수 있다.
+A bench runs one algorithm, chosen via `ccl.yaml`'s `defaults.algorithm`.
+Switching algorithms is purely a `ccl.yaml` change — no host edits
+required.
 
-#### Init 흐름 (eager)
+#### Init flow (eager)
 
-1. `init_process_group(backend="ahbm")` 호출
-2. backend가 `ccl.yaml` 로드 → `defaults.algorithm` 결정
-3. `algorithms[<algo>]`에서 topology + buffer_kind + backpressure + slot/size 결정
-4. **즉시** 모든 PE의 PE_IPCQ에 neighbor table을 install (sideband 또는 fabric `IpcqInitMsg`)
-5. 이후 `torch.launch(kernel_name, ...)`는 일반 launch와 동일하게 처리
-   (CCL kernel이든 아니든 PE_IPCQ는 이미 준비됨)
+1. `init_process_group(backend="ahbm")` is called.
+2. Backend loads `ccl.yaml` → resolves `defaults.algorithm`.
+3. Pulls topology + buffer_kind + backpressure + slot config from
+   `algorithms[<algo>]`.
+4. **Immediately** installs neighbor tables on every PE_IPCQ
+   (sideband or fabric `IpcqInitMsg`).
+5. Subsequent `torch.launch(kernel_name, ...)` calls behave normally —
+   PE_IPCQ is already prepared whether the kernel is a CCL kernel or
+   not.
 
-### D11. CCL 설정 파일 (`ccl.yaml`)
+### D11. CCL config file (`ccl.yaml`)
 
-IPCQ 설정과 알고리즘 metadata는 별도 YAML 파일에 둔다.
-`components.yaml`/`topology.yaml`과 같은 패턴을 유지하며, 변경 이력이 코드처럼
-추적 가능하다.
+IPCQ config and algorithm metadata live in a separate YAML file,
+following the same pattern as `components.yaml` and `topology.yaml`.
 
-벤치마크 한 번 실행은 한 알고리즘만 사용한다 (`defaults.algorithm`).
-다른 알고리즘으로 교체하려면 `ccl.yaml`의 `defaults.algorithm` 만 바꾸면 된다.
+A single benchmark execution runs one algorithm
+(`defaults.algorithm`). Switching algorithms means editing
+`defaults.algorithm` only.
 
 ```yaml
-# ccl.yaml — CCL backend (ahbm) configuration
-#
-# 이 파일은 init_process_group(backend="ahbm") 시점에 로드되며,
-# defaults.algorithm 으로 지정된 알고리즘에 따라 PE_IPCQ neighbor table을
-# install한다. 호스트 코드는 IPCQ 설정을 인지하지 않는다.
-
 defaults:
-  # 이번 벤치 실행에서 사용할 알고리즘. algorithms 섹션에 정의된 것 중 하나.
-  algorithm: ring_allreduce
-
-  # IPCQ ring buffer가 위치할 메모리.
-  #   tcm  — PE-local TCM (작지만 빠름, PE 내부 자원과 경쟁)
-  #   hbm  — PE-local HBM (큼, DMA latency 큼)
-  #   sram — Cube-shared SRAM (중간 크기, cube 내 PE 간 contention)
-  buffer_kind: tcm
-
-  # send/recv가 peer slot full / data 미도착을 만났을 때의 대기 방식.
-  #   poll  — peer pointer 캐시를 spin loop로 재확인
-  #   sleep — SimPy event yield 후 wakeup 대기 (interrupt-like)
-  backpressure: sleep
-
-  # Ring buffer depth (한 방향당 slot 개수). 클수록 in-flight 가능, 메모리 ↑
+  algorithm: ring_allreduce_tcm
+  buffer_kind: tcm                # tcm | hbm | sram
+  backpressure: sleep             # poll | sleep
   n_slots: 8
-
-  # Slot 하나의 크기 (bytes). 한 tile을 통째로 담을 수 있는 크기여야 함.
   slot_size: 4096
-
-  # PE_DMA virtual channel chunk 크기 (bytes). 작을수록 fair, 클수록 효율.
-  # IPCQ traffic과 compute traffic 사이의 인터리브 granularity (D8 참조).
   vc_chunk_size: 256
-
-  # Credit return fast path 메시지 크기 (bytes). 실제 HW의 credit return wire
-  # 크기를 모방. backend가 라우팅 경로의 bottleneck BW를 보고 latency를
-  # 계산한다 (D9 참조). 보통 16-64로 충분.
   ipcq_credit_size_bytes: 16
 
 algorithms:
-  # ── 알고리즘 정의 ─────────────────────────────────────────────────
-  # 각 entry는 알고리즘 모듈과 그 알고리즘이 요구하는 topology를 명시한다.
-  # 알고리즘별 default override 가능 (buffer_kind, backpressure 등).
-
-  ring_allreduce:
-    # PE 커널이 정의된 모듈. `kernel(t_ptr, rank, world_size, tl)` 함수를 export.
+  ring_allreduce_tcm:
     module: kernbench.ccl.algorithms.ring_allreduce
+    topology: ring_1d             # builtin name or "custom"
+    buffer_kind: tcm
+    n_elem: 8                     # optional, per-algorithm tile width
 
-    # 이 알고리즘이 요구하는 neighbor topology. builtin 이름 또는 "custom".
-    #   ring_1d        — 1D 양방향 ring (E/W)
-    #   ring_1d_unidir — 1D 단방향 ring (E only)
-    #   mesh_2d        — 2D mesh (N/S/E/W)
-    #   tree_binary    — binary tree (parent/children direction)
-    #   custom         — 모듈의 neighbors(rank, world_size) 함수 사용
-    topology: ring_1d
-
-  tree_allreduce:
+  tree_allreduce_7:
     module: kernbench.ccl.algorithms.tree_allreduce
     topology: tree_binary
-    # 알고리즘별 override (이 알고리즘만 hbm 사용)
-    buffer_kind: hbm
+    buffer_kind: tcm
+    world_size: 7                 # algorithm-level override
+    n_elem: 16
 
   custom_mesh:
     module: kernbench.ccl.algorithms.custom_mesh
-    topology: custom            # 모듈이 직접 neighbors() 함수 제공
+    topology: custom              # the module supplies its own neighbors()
 ```
 
-#### 알고리즘 모듈 구조
+`world_size` is **not set in `defaults`**. The backend resolves it via:
+`algorithm-level override > defaults override > topology spec`. The
+last fallback (`sips × cubes_per_sip × pes_per_cube`) mirrors real DDP
+where `WORLD_SIZE` comes from env vars rather than config files.
 
-알고리즘 모듈은 두 개의 hook을 export한다 — `kernel`은 필수, `neighbors`는 선택.
+#### Algorithm module structure
+
+Each algorithm module exports two hooks — `kernel` (required) and
+`neighbors` (optional) — plus a `kernel_args` helper that the
+backend uses to populate positional kernel arguments at `all_reduce`
+time:
 
 ```python
 # src/kernbench/ccl/algorithms/ring_allreduce.py
 
-def kernel(t_ptr, rank, world_size, tl):
-    """필수 — PE 커널.
+def kernel_args(world_size: int, n_elem: int) -> tuple:
+    return (n_elem, world_size)
 
-    IPCQ 설정은 backend가 ccl.yaml + neighbors() 결과로 install한 상태이다.
-    커널은 그저 4-방향 send/recv API만 사용하면 된다.
-    """
-    for step in range(world_size - 1):
-        ...
-        tl.send(dir="E", ...)
-        data = tl.recv(dir="W")
 
+def kernel(t_ptr, n_elem, world_size, tl):
+    """Required — the PE kernel.
 
-def neighbors(rank, world_size, neighbor_map):
-    """선택 — neighbor table override hook.
-
-    backend는 ccl.yaml의 topology 필드에 따라 builtin neighbor_map을 생성한 뒤,
-    이 함수가 정의되어 있으면 호출하여 결과를 override 한다.
-
-    Args:
-        rank: 이 rank의 인덱스
-        world_size: 전체 rank 수
-        neighbor_map: ccl.yaml의 topology 필드가 만든 builtin 매핑
-                      예: ring_1d → {"E": (rank+1)%ws, "W": (rank-1)%ws}
-                      mutable dict — 직접 수정 가능
-
-    Returns:
-        dict | None:
-          dict — neighbor_map을 override한 결과
-          None — override 안 함, neighbor_map 그대로 사용
-    """
-    return None  # 또는 수정 후 반환
-```
-
-#### `neighbors` override 패턴
-
-대부분의 알고리즘은 builtin topology만으로 충분하므로 `neighbors` 정의가 필요 없다.
-정의가 필요한 경우의 패턴:
-
-**Pattern A — builtin을 base로 일부만 수정**:
-```python
-def neighbors(rank, world_size, neighbor_map):
-    # 짝수 rank만 W 사용
-    if rank % 2 == 1:
-        neighbor_map.pop("W", None)
-    return neighbor_map
-```
-
-**Pattern B — 완전히 새로 만들기 (skip-connection ring 등)**:
-```python
-def neighbors(rank, world_size, neighbor_map):
-    # neighbor_map은 무시하고 새로 작성
-    return {"E": (rank + 2) % world_size}
-```
-
-#### Builtin topology generators
-
-`ccl.yaml`의 `topology` 필드가 다음 builtin 이름이면 backend가 알아서 처리:
-
-| topology | 설명 | direction set |
-|----------|------|---------------|
-| `ring_1d` | 1D 양방향 ring | E, W |
-| `ring_1d_unidir` | 1D 단방향 ring | E only |
-| `mesh_2d` | 2D mesh | N, S, E, W |
-| `tree_binary` | binary tree (root = rank 0) | parent, child_left, child_right |
-| `none` | 빈 매핑 — 알고리즘이 `neighbors()`로 처음부터 작성 | (없음) |
-
-`topology: none`은 builtin이 빈 dict를 반환하므로 알고리즘의 `neighbors()`가
-처음부터 매핑을 만들어야 한다.
-
-#### 알고리즘 추가 절차
-
-1. `src/kernbench/ccl/algorithms/<algo>.py`에 `kernel` 함수 작성
-2. `ccl.yaml`의 `algorithms` 섹션에 entry 추가 (`module`, `topology`)
-3. (선택) 같은 모듈에 `neighbors()` 함수 추가하여 builtin override
-4. `defaults.algorithm`을 새 알고리즘으로 설정하면 적용
-
-호스트 코드는 손대지 않는다.
-
-### D12. 메시지 / 토큰 스키마
-
-본 ADR이 추가하는 모든 메시지/토큰의 필드를 명시한다. 구현 시 이 정의를
-`src/kernbench/common/pe_commands.py`와 `src/kernbench/runtime_api/kernel.py`에
-그대로 추가한다.
-
-#### `IpcqInitMsg` (sideband, init 시 fan-out)
-
-backend가 모든 PE의 PE_IPCQ에 neighbor table을 install하기 위해 사용한다.
-구조는 `MmuMapMsg`와 유사 (target_sips, target_cubes, target_pe + entries).
-
-```python
-@dataclass(frozen=True)
-class IpcqInitEntry:
-    direction: str               # "N" | "S" | "E" | "W"
-    peer: IpcqEndpoint           # D2.5 참조
-    my_rx_base_pa: int           # 자신의 rx_buffer base
-    my_rx_base_va: int           # 선택
-    n_slots: int
-    slot_size: int
-    # Credit fast path 채널 (D9).
-    # 계약: 이 필드는 반드시 simpy.Store 인스턴스이며, IpcqCreditMetadata
-    # 객체만을 받는 receive endpoint이다 (peer's PE_IPCQ가 자기 입력 큐로
-    # 사용). 송신 측 PE_IPCQ는 _delayed_credit_send에서 이 store에 직접
-    # IpcqCreditMetadata를 put한다. 다른 객체 type을 put해서는 안 된다.
-    # backend init 시 양방향 SimPy Store가 한 번 wire되며 이후 변경 불가.
-    peer_credit_store: "simpy.Store[IpcqCreditMetadata]"
-
-@dataclass(frozen=True)
-class IpcqInitMsg:
-    correlation_id: str
-    request_id: str
-    target_sips: tuple[int, ...]
-    target_cubes: tuple[int, ...]
-    target_pe: int | tuple[int, ...] | str
-    entries: tuple[IpcqInitEntry, ...]   # 이 PE의 4-방향 entry
-    backpressure_mode: str       # "poll" | "sleep"
-    buffer_kind: str             # "tcm" | "hbm" | "sram"
-    credit_size_bytes: int       # D9 fast path latency 계산용 (default 16)
-```
-
-**Credit fast path channel wiring**: backend init이 모든 PE의 PE_IPCQ에
-양방향 fast path 채널을 한 번 설치한다. PE A의 IpcqInitEntry(direction=E)에
-PE B의 credit-receive Store reference를 넣어 송신 측이 직접 put할 수 있게
-한다 (별도 fabric routing 없음).
-
-#### `IpcqSendCmd` (PE_CPU → PE_IPCQ)
-
-```python
-@dataclass(frozen=True)
-class IpcqSendCmd:
-    direction: str               # 어느 방향으로 보낼지
-    src_addr: int                # 보낼 데이터의 원본 주소 (TCM/HBM)
-    src_space: str               # "tcm" | "hbm" | "sram"
-    nbytes: int
-    shape: tuple[int, ...]       # data shape (op_log/MemoryStore용)
-    dtype: str
-    handle_id: str               # completion 추적용
-    data_op: bool = True         # ADR-0020 op_log 기록 대상
-```
-
-#### `IpcqRecvCmd` (PE_CPU → PE_IPCQ)
-
-```python
-@dataclass(frozen=True)
-class IpcqRecvCmd:
-    direction: str | None        # None이면 round-robin (weak fairness, D4)
-    # recv_mode: 두 가지 동작 모드
-    #   "return_slot" — slot 주소를 그대로 PE 커널에 반환 (default, zero-copy)
-    #   "copy_to_dst" — slot 데이터를 dst_addr에 copy 후 반환
-    recv_mode: str = "return_slot"
-    # dst_addr / dst_space는 recv_mode="copy_to_dst"일 때만 사용됨
-    dst_addr: int = 0
-    dst_space: str = ""
-    shape: tuple[int, ...] = ()  # data shape (op_log/MemoryStore용)
-    dtype: str = ""
-    handle_id: str = ""
-    blocking: bool = True        # blocking vs non-blocking
-    data_op: bool = True
-```
-
-#### `IpcqDmaToken` (PE_IPCQ → PE_DMA, vc_comm 채널)
-
-D9의 piggyback 모델에 따라 token이 data + head metadata를 함께 담아
-fabric을 따라 이동한다. 수신 측 PE_DMA가 도착 시점에 data를 dst_addr에
-write하고 metadata를 PE_IPCQ로 forward한다 (atomic).
-
-```python
-@dataclass
-class IpcqDmaToken:
-    # ── Data movement (single-hop DMA write) ──
-    src_addr: int                # 자기 메모리 주소
-    src_space: str
-    dst_addr: int                # peer rx slot 주소 (이미 계산됨)
-    dst_endpoint: IpcqEndpoint   # 라우팅용 (sip/cube/pe)
-    nbytes: int                  # data 크기
-    handle_id: str               # 완료 시 송신 측 PE_IPCQ로 알림
-
-    # ── Piggyback metadata (수신측 PE_IPCQ가 자동 갱신할 정보) ──
-    sender_seq: int              # 단조 증가 sequence number
-                                 # peer가 자기 head_cache로 사용
-    src_sip: int                 # 송신 측 (수신측이 어느 peer인지 식별)
-    src_cube: int
-    src_pe: int
-    src_direction: str           # 송신측 기준 방향 (수신측은 reverse 매핑으로 자기 direction 결정)
-
-    data_op: bool = True         # ADR-0020 op_log 기록 대상
-```
-
-PE_DMA는 token type으로 채널 결정 (D8): TileToken → vc_compute, IpcqDmaToken → vc_comm.
-
-**수신 측 PE_DMA의 처리** (vc_comm 도착 시):
-
-```python
-def _vc_comm_arrival(self, env, token: IpcqDmaToken):
-    # 1. data를 dst_addr에 write (data와 metadata atomic visibility)
-    if self._memory_store is not None:
-        data = self._memory_store.read(token.src_space, token.src_addr,
-                                        shape=..., dtype=...)
-        self._memory_store.write(token.dst_endpoint.buffer_kind, token.dst_addr, data)
-    # 2. metadata를 자기 PE의 IPCQ로 forward (PE 내부 wire, 같은 step)
-    yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token))
-```
-
-PE_IPCQ는 `IpcqMetaArrival`을 받아 sender_seq를 보고 peer_head_cache를 갱신한다.
-
-#### `IpcqCreditMetadata` (PE_IPCQ → peer PE_IPCQ, fast path 채널)
-
-Credit return은 D9의 fast path 모델에 따라 vc_comm fabric을 거치지 않고
-**별도의 SimPy Store 채널**로 전달된다. backend init 시 양방향 channel이
-미리 wire되며, latency는 bottleneck-BW 기반으로 계산된다.
-
-```python
-@dataclass(frozen=True)
-class IpcqCreditMetadata:
-    """Credit return — recv 측 → send 측 fast path."""
-    consumer_seq: int            # my_tail (recv 측의 새 tail)
-    src_sip: int                 # 누가 보냈는지 (수신 측이 어느 peer credit인지 식별)
-    src_cube: int
-    src_pe: int
-    src_direction: str           # 송신 측 기준 방향 (수신 측은 reverse 매핑)
-```
-
-**전송 흐름**:
-
-```python
-class PeIpcqComponent:
-    def _delayed_credit_send(self, env, peer_credit_store, my_tail, latency_ns):
-        yield env.timeout(latency_ns)
-        yield peer_credit_store.put(IpcqCreditMetadata(
-            consumer_seq=my_tail, src_sip=..., src_cube=..., src_pe=...,
-            src_direction=...,
-        ))
-```
-
-`latency_ns`는 D9에 정의된 대로:
-
-```python
-path = self.ctx.router.find_path(self_pe_prefix, peer_pe_prefix)
-latency_ns = self.ctx.compute_drain_ns(path, credit_size_bytes)
-```
-
-**별도의 IpcqPtrUpdate 이벤트는 없다** — head 갱신은 D9 piggyback 모델로,
-tail 갱신은 D9 fast path SimPy Store 채널로 처리된다.
-
-### D13. 테스트 전략
-
-단위/통합/regression 테스트를 명시한다.
-
-#### T1. 단위 테스트 (component-level)
-
-- **PE_IPCQ 단위** (`tests/test_pe_ipcq.py`):
-  - send: backpressure 미발생 시 즉시 PE_DMA로 token forward
-  - send: peer slot full → backpressure (poll/sleep 모드별)
-  - send: peer credit return (IpcqCreditMetadata) 도착 후 backpressure 해제
-  - recv: 데이터 도착 시 즉시 반환
-  - recv: 데이터 미도착 → wait → IpcqMetaArrival (D9 piggyback) 수신 시 wake
-  - recv (round-robin): 4-방향 중 도착한 첫 데이터 반환 (weak fairness)
-  - 잘못된 방향 → IpcqInvalidDirection 예외
-
-- **PE_DMA virtual channel** (`tests/test_pe_dma_vc.py`):
-  - vc_compute / vc_comm 독립 진행 (한 채널 stall 시 다른 채널 진행)
-  - chunk-level 인터리브 verification
-  - link BW 분할 (50/50 또는 weighted)
-
-- **builtin topology** (`tests/test_ccl_topologies.py`):
-  - ring_1d/mesh_2d/tree_binary 각각 (rank, world_size) → neighbor dict 정합성
-  - mesh_2d non-square → ValueError
-  - resolve_topology(custom, module) → module.neighbors 반환
-
-#### T2. 통합 테스트 (E2E send/recv)
-
-- **`tests/test_ipcq_e2e.py`**:
-  - 2-rank ring: rank 0 send(E) → rank 1 recv(W) → 데이터 정합성
-  - 4-rank ring: 양방향 send/recv 동시 진행, deadlock 없음
-  - mesh_2d 4×4: N/S/E/W 4방향 동시 send/recv
-
-- **CCL kernel + 2-pass** (`tests/test_ipcq_2pass.py`):
-  - greenlet 모드 + IPCQ → op_log에 ipcq 레코드 생성 검증
-  - DataExecutor가 ipcq op 처리 후 결과 정합성 (`out.data` 확인)
-
-#### T3. Backend init 테스트 (`tests/test_ccl_backend_ipcq.py`)
-
-- ccl.yaml 로드 → `defaults.algorithm` 추출
-- builtin topology → IpcqInitMsg fan-out
-- IpcqEndpoint의 PA가 모든 PE에서 일관 (rank A의 peer E의 rx_base_pa = rank A+1의 자기 rx_base_pa)
-- buffer_kind 별 메모리 할당 (tcm/hbm/sram)
-
-#### T4. Regression
-
-- 기존 401 tests 전부 PASS
-- ADR-0020 통합으로 인한 op_log/DataExecutor 영향 없음 (CCL 미사용 벤치)
-
-#### T5. 성능 / overhead
-
-- 단일 send/recv pair latency = (DMA latency) + (IPCQ overhead)
-- 비교: 같은 nbytes의 일반 PE_DMA write와 거의 동일해야 함 (IPCQ overhead < 100 ns)
-
-### D14. Invariants & Failure Modes
-
-CCL 인프라에서 흔히 발생하는 hang/오류 상황을 명시하고, 대응 방식을 정의한다.
-
-#### Invariants (시뮬레이션이 보장해야 하는 것)
-
-I1. **Slot lifecycle exactly-once**: 한 send → 정확히 한 recv. 중복 send나
-   중복 recv는 sequence 오류로 간주.
-
-I2. **Pointer monotonicity**: my_head, my_tail은 단조 증가 (감소 없음).
-   sender_seq는 송신 측에서 단조 증가, 수신 측 cache 갱신도 단조 증가.
-
-I3. **Endpoint consistency**: rank A의 IpcqEndpoint(direction=E)의 peer가
-   rank B라면, rank B의 IpcqEndpoint(reverse(E))의 peer는 rank A여야 함.
-   backend init 시 검증.
-
-I4. **buffer_kind consistency**: 한 ProcessGroup 내 모든 PE의 buffer_kind는
-   동일 (mixed kind는 supported 안 함, 첫 구현). 검증 실패 시 init 에러.
-
-I5. **op_log ordering**: send → DMA 완료 → recv 가능. op_log의 t_start
-   순서가 이 인과관계를 위배하지 않음.
-
-I6. **Atomic data + metadata visibility (MUST)**: 본 ADR의 correctness 핵심
-   조건이다. 수신 측에서 data write (MemoryStore.write)와 metadata forward
-   (peer_head_cache 갱신)는 동일한 SimPy step에 일어나야 한다. control이
-   data를 앞지를 수 없다.
-
-   **구현 규칙 (MUST)**:
-   - PE_DMA의 vc_comm token 도착 처리(`_vc_comm_arrival`)는 다음 두 동작
-     사이에 **어떤 SimPy yield도 두어서는 안 된다**:
-     1. `MemoryStore.write(token.dst_endpoint.buffer_kind, token.dst_addr, data)`
-     2. PE_IPCQ에 `IpcqMetaArrival` forward
-   - 두 동작은 동일 SimPy event callback 내에서 연속 실행되어야 한다.
-   - 코드 리뷰에서 이 사이에 `yield` (또는 `yield from`)을 추가하는 것은
-     correctness 위반으로 reject한다.
-
-   이 규칙을 위반하면 다른 SimPy process가 끼어들어 head_cache가 data
-   visibility보다 먼저 또는 늦게 보이는 race condition이 발생한다.
-
-I7. **MemoryStore slot existence ↔ pointer**: I6의 결과로,
-   `peer_head_cache > my_tail`이 truthy가 되는 step과 `MemoryStore.has(slot_addr)`
-   이 truthy가 되는 step이 동일하다. recv는 두 조건을 모두 체크하지만 (defensive),
-   단일 조건만 체크해도 정확하다.
-
-#### Failure Modes (런타임 에러)
-
-F1. **잘못된 direction**:
-   - PE 커널이 `tl.send(dir="X")` 호출 → install 안 된 direction
-   - PE_IPCQ가 즉시 `IpcqInvalidDirection` 예외 raise
-   - SimPy 시뮬레이션 즉시 abort, 사용자에게 명확한 에러
-
-F2. **타입 mismatch**:
-   - send와 recv의 dtype/shape/nbytes가 일치하지 않음
-   - 첫 구현은 검증 안 함 (dtype/shape는 hint), 향후 strict mode로 추가
-
-F3. **Deadlock detection (timeout 기반)**:
-   - send: peer_tail_cache가 갱신 안 되고 영원히 wait
-   - recv: peer_head_cache 갱신 안 되고 영원히 wait
-   - 시뮬레이션 timeout (default 10ms simulated time) 초과 시 abort
-   - 디버그를 위해 각 PE의 last send/recv 위치, blocking 상태 dump
-
-F4. **Backend init 실패**:
-   - ccl.yaml에 `defaults.algorithm` 누락
-   - `algorithms[name]` 정의 누락
-   - 알고리즘 모듈 import 실패
-   - topology 검증 실패 (I3, I4)
-   → 모두 `init_process_group` 시점에 즉시 에러
-
-F5. **Slot full + 무한 backpressure**:
-   - peer가 영원히 안 받음
-   - F3과 같이 timeout으로 처리
-   - 디버그: 막힌 PE의 my_head, peer_tail_cache 출력
-
-#### 진단 도구 (구현 단계에서 추가)
-
-- **CCL trace**: 각 send/recv를 (rank, t, dir, nbytes) 형태로 로깅
-- **Pointer dump**: 시뮬레이션 종료 시 또는 hang 시 모든 PE의 IPCQ pointer 상태 출력
-- **Deadlock graph**: hang 발생 시 wait-for 그래프 출력 (어느 PE가 어떤 PE를 기다리는지)
-
-### D15. 알고리즘 작성자 가이드 (요약)
-
-본 섹션은 알고리즘 작성자가 한 화면으로 시작점을 잡을 수 있도록 한다.
-자세한 step-by-step 가이드는 [docs/onboarding/ccl-author-guide.md](../onboarding/ccl-author-guide.md) 참조.
-
-#### 만지는 것 / 만지지 않는 것
-
-| 만지는 것 | 만지지 않는 것 |
-|----------|---------------|
-| `src/kernbench/ccl/algorithms/<your_algo>.py` (kernel + 선택적 neighbors) | `benches/ccl_allreduce.py` 호스트 코드 |
-| `ccl.yaml` 의 한 entry 추가 + `defaults.algorithm` | `src/kernbench/ccl/` 프레임워크 |
-| (선택) `tests/test_<your_algo>.py` 단위 테스트 | `src/kernbench/components/builtin/pe_ipcq.py` 컴포넌트 |
-| | `src/kernbench/runtime_api/distributed.py` backend |
-
-#### 알고리즘 모듈 인터페이스 contract
-
-```python
-# src/kernbench/ccl/algorithms/<your_algo>.py
-
-def kernel(*args, tl) -> None:
-    """필수. PE 커널.
-
-    Args (positional): tensor pointers, rank, world_size, 알고리즘 파라미터
-    Args (keyword): tl — TLContext (자동 주입)
-
-    사용 가능한 IPCQ API:
-        tl.send(dir, src_addr, nbytes)         # blocking, backpressure 시 wait
-        tl.recv(dir)                            # 특정 방향에서 blocking recv
-        tl.recv()                               # 4방향 round-robin
-        tl.recv_async(dir) → handle             # non-blocking
-        tl.wait(handle)                         # non-blocking 완료 대기
-
-    기존 API도 그대로 사용:
-        tl.load / tl.store / tl.composite / tl.program_id 등
+    IPCQ is already installed by the backend before this is called.
+    The kernel only uses the four-direction send / recv API.
     """
     ...
 
-def neighbors(rank, world_size, neighbor_map) -> dict | None:
-    """선택. ccl.yaml의 builtin topology가 만든 neighbor_map을 override.
 
-    None 반환 → builtin 그대로 사용
-    dict 반환 → 그 dict로 override (builtin을 base로 수정 가능)
+def neighbors(rank, world_size, neighbor_map):
+    """Optional — override the builtin topology's neighbor map.
+
+    Returns a new dict, the modified-in-place dict, or None to keep the
+    builtin map.
     """
     return None
 ```
 
-#### 5-step 흐름
+#### `neighbors` override patterns
 
-1. **kernel 함수 작성** — `src/kernbench/ccl/algorithms/<algo>.py` 신규 파일
-2. **ccl.yaml 등록** — `algorithms.<name>` entry + `defaults.algorithm` 변경
-3. **(선택) neighbors override** — builtin topology를 base로 수정이 필요할 때
-4. **단위 테스트** — `kernbench.ccl.testing.run_kernel_in_mock` (SimPy 없이 빠름)
-5. **시뮬 검증** — `kernbench run --bench ccl_allreduce --verify-data`
+- **Pattern A — tweak a builtin**: drop a direction for some ranks, etc.
+- **Pattern B — replace entirely**: ignore `neighbor_map` and return a
+  brand-new dict.
+- **Pattern C — keep builtin**: omit `neighbors` or return None.
 
-호스트 코드 (`benches/ccl_allreduce.py`)는 손대지 않는다.
+#### Builtin topologies
 
-#### 사용 가능한 헬퍼 (`kernbench.ccl.helpers`)
+| topology | direction set |
+|----------|---------------|
+| `ring_1d` | E, W |
+| `ring_1d_unidir` | E only |
+| `mesh_2d` | N, S, E, W |
+| `tree_binary` | parent, child_left, child_right |
+| `none` | (empty) — algorithm must supply `neighbors()` |
 
-| Helper | 설명 |
-|--------|------|
-| `chunked(addr, n_chunks, ...)` | 텐서를 n개 chunk view로 슬라이싱 |
-| `ring_step(rank, step, ws)` | ring algorithm의 step별 (send_idx, recv_idx) |
-| `tree_step(rank, level)` | binary tree의 level별 parent/child 인덱스 |
+#### Adding a new algorithm
 
-#### 디버깅 도구
+1. Write `kernel` and `kernel_args` in
+   `src/kernbench/ccl/algorithms/<algo>.py`.
+2. Add an entry in `ccl.yaml`'s `algorithms` section.
+3. (Optional) provide `neighbors()` for custom topology.
+4. Set `defaults.algorithm` to the new algorithm.
 
-- `KERNBENCH_CCL_TRACE=1` — send/recv trace 출력
-- 시뮬 종료 시 자동 IPCQ pointer dump
-- Deadlock 시 (10ms 시뮬 시간 초과) wait-for graph dump
+The host bench (`benches/ccl_allreduce.py`) does not change.
 
-#### 흔한 실수
+### D12. Message / token schema
 
-1. **install 안 된 direction 사용** — ccl.yaml의 topology가 ring_1d면 N/S 사용 불가
-2. **send/recv 짝 맞지 않음** — peer 측 recv 없으면 hang (slot full backpressure)
-3. **dtype/shape 불일치** — 첫 구현은 검증 안 함, 작성자 책임
+The new message types added by this ADR. They live in
+`src/kernbench/common/pe_commands.py` and
+`src/kernbench/runtime_api/kernel.py`.
 
-자세한 step-by-step과 hello-world 예제는 `docs/onboarding/ccl-author-guide.md` 참조.
+#### `IpcqInitMsg` (sideband, fan-out at init)
+
+The backend pushes neighbor tables to every PE_IPCQ. Structure mirrors
+`MmuMapMsg` (`target_sips`, `target_cubes`, `target_pe`, `entries`).
+Each `IpcqInitEntry` has `direction`, `peer: IpcqEndpoint`,
+`my_rx_base_pa/va`, `n_slots`, `slot_size`, plus a `peer_credit_store`
+field — a `simpy.Store` instance pre-wired so the sender PE_IPCQ can
+push `IpcqCreditMetadata` directly into the receiver's input queue.
+
+#### `IpcqSendCmd` (PE_CPU → PE_IPCQ)
+
+Carries `direction`, source addr/space, nbytes, shape, dtype, and a
+handle id. `data_op=True` so it lands in the op_log.
+
+#### `IpcqRecvCmd` (PE_CPU → PE_IPCQ)
+
+Carries `direction` (or None for round-robin), `recv_mode`
+(`return_slot` / `copy_to_dst`), optional `dst_addr/dst_space`, shape,
+dtype, blocking flag.
+
+#### `IpcqDmaToken` (PE_IPCQ → PE_DMA, vc_comm channel)
+
+Per D9 piggyback: the token carries the data (`src/dst/space/nbytes`)
+plus the head metadata (`sender_seq`, `src_sip/cube/pe`,
+`src_direction`). PE_DMA picks the channel by token type
+(`IpcqDmaToken → vc_comm`, `TileToken → vc_compute`).
+
+The receiver's PE_DMA, on token arrival, performs the I6 atomic
+sequence: write data into MemoryStore, then forward `IpcqMetaArrival`
+to the local PE_IPCQ.
+
+#### `IpcqCreditMetadata` (PE_IPCQ → peer PE_IPCQ, fast path)
+
+Carries `consumer_seq` (= my_tail), source PE coords, and source
+direction. Travels through the dedicated SimPy Store channel rather
+than `vc_comm`. Latency = `credit_size_bytes / bottleneck_bw_on_path`.
+
+There is **no `IpcqPtrUpdate` event** — head updates flow via D9
+piggyback, tail updates via the D9 fast-path channel.
+
+### D13. Test strategy
+
+Test plan:
+
+#### T1. Unit tests (component-level)
+
+- **PE_IPCQ** (`tests/test_pe_ipcq.py`): send without backpressure
+  immediately forwards a token; full peer slot triggers backpressure
+  (poll / sleep modes); recv waits, wakes on `IpcqMetaArrival`;
+  round-robin recv weak fairness; bad direction → `IpcqInvalidDirection`.
+- **PE_DMA virtual channels** (`tests/test_pe_dma_vc.py`): `vc_compute`
+  / `vc_comm` independent progress, chunk interleave, BW split.
+- **Builtin topology** (`tests/test_ccl_topologies.py`): ring_1d /
+  mesh_2d / tree_binary correctness, mesh_2d non-square →
+  `ValueError`, custom resolver returns the module's `neighbors`.
+
+#### T2. Integration tests (E2E send/recv)
+
+- **`tests/test_ipcq_e2e.py`**: 2-rank ring, 4-rank ring (bidirectional
+  no-deadlock), 4×4 mesh.
+- **CCL kernel + 2-pass** (`tests/test_ipcq_2pass.py`): greenlet mode
+  records `ipcq` ops in op_log; DataExecutor produces correct
+  `out.data`.
+
+#### T3. Backend init (`tests/test_ccl_backend_ipcq.py`)
+
+`ccl.yaml` load, builtin topology → `IpcqInitMsg` fan-out, endpoint PA
+consistency, per-`buffer_kind` allocation.
+
+#### T4. Regression
+
+All existing tests pass; ADR-0020 op_log / DataExecutor unaffected for
+non-CCL benches.
+
+#### T5. Performance / overhead
+
+Single send/recv pair latency = (DMA latency) + (IPCQ overhead).
+Should be close to a regular PE_DMA write of the same nbytes (IPCQ
+overhead < 100 ns).
+
+### D14. Invariants and failure modes
+
+#### Invariants
+
+I1. **Slot lifecycle exactly-once**: one send → exactly one recv.
+I2. **Pointer monotonicity**: `my_head` / `my_tail` strictly
+   non-decreasing; `sender_seq` strictly increasing.
+I3. **Endpoint consistency**: if rank A's `direction=E` peer is rank
+   B, then rank B's reverse-direction peer must be rank A. Verified at
+   init.
+I4. **`buffer_kind` consistency**: all PEs in a process group share
+   the same `buffer_kind` (no mixed mode in the first cut).
+I5. **op_log ordering**: send → DMA complete → recv possible. The
+   t_start order in op_log respects this causality.
+I6. **Atomic data + metadata visibility (MUST)**: at the receiver
+   side, data write (`MemoryStore.write`) and metadata forward
+   (`peer_head_cache` update) **must execute in the same SimPy step**.
+   No yield is allowed between the two operations in PE_DMA's vc_comm
+   handler. Code review must reject any inserted `yield` (or `yield
+   from`) — it would create a race where head_cache becomes visible
+   before or after the data.
+I7. **MemoryStore slot existence ↔ pointer**: as a consequence of I6,
+   the step in which `peer_head_cache > my_tail` becomes truthy is the
+   same step in which the slot data is observable.
+
+#### Failure modes (runtime errors)
+
+F1. **Bad direction**: `tl.send(dir="X")` for an uninstalled direction
+   → `IpcqInvalidDirection`, simulation aborts.
+F2. **Type mismatch**: dtype/shape/nbytes disagreement between matched
+   send and recv. Not validated by default; opt-in strict mode catches
+   it (`strict_validation: true` on a PE_IPCQ node attrs).
+F3. **Deadlock detection (timeout-based)**: the simulator empties its
+   schedule while a send/recv is still pending → engine raises
+   `IpcqDeadlock` and embeds a pointer dump.
+F4. **Backend init failure**: missing `defaults.algorithm`, missing
+   `algorithms[name]`, module import failure, topology validation
+   failure (I3, I4) — all raised at `init_process_group` time.
+F5. **Slot full + infinite backpressure**: the peer never recvs.
+   Surfaces as F3 timeout.
+
+#### Diagnostics
+
+- **CCL trace**: `KERNBENCH_CCL_TRACE=1` logs each send/recv as
+  `(rank, t, dir, nbytes)`.
+- **Pointer dump**: `kernbench.ccl.diagnostics.pointer_dump(engine)`
+  prints every PE_IPCQ ring buffer's `my_head`, `my_tail`,
+  `peer_head_cache`, `peer_tail_cache`.
+- **Deadlock dump**: on hang the engine includes the pointer dump in
+  the `IpcqDeadlock` exception message.
+
+### D15. Algorithm-author cheat sheet
+
+Full step-by-step lives in
+[`docs/onboarding/ccl-author-guide.en.md`](../onboarding/ccl-author-guide.en.md). The
+shortest version:
+
+| Things you touch | Things you don't |
+|------------------|-------------------|
+| `src/kernbench/ccl/algorithms/<your_algo>.py` (`kernel`, `kernel_args`, optional `neighbors`) | `benches/ccl_allreduce.py` host code |
+| One entry in `ccl.yaml` + optionally `defaults.algorithm` | `src/kernbench/ccl/` framework |
+| (Optional) `tests/test_<your_algo>.py` mock test | PE_IPCQ component, AhbmCCLBackend |
+
+5-step flow: write the kernel → register in `ccl.yaml` → optional
+`neighbors` override → optional mock unit test → SimPy validation via
+`kernbench run --bench ccl_allreduce --verify-data`.
+
+Common mistakes: using a direction that wasn't installed, sends
+without matching recvs (deadlock), dtype/shape disagreement, assuming
+fairness from `tl.recv()` round-robin, confusing
+`tl.num_programs(axis)` with the CCL group size.
 
 ---
 
@@ -1188,7 +830,7 @@ contract via Python/SimPy equivalents in
 [pe_ipcq.py](../../src/kernbench/components/builtin/pe_ipcq.py) and
 [pe_dma.py](../../src/kernbench/components/builtin/pe_dma.py).
 
-### D16. Proposed HW Block Diagram and End-to-End Dataflow
+### D16. Proposed HW block diagram and end-to-end dataflow
 
 ![PE Baseline Architecture](../diagrams/pe_baseline.png)
 
@@ -1198,14 +840,14 @@ contract via Python/SimPy equivalents in
 
 > Source: [`../diagrams/pe_proposed.d2`](../diagrams/pe_proposed.d2) — `d2 --layout=elk`.
 
-**Baseline → Proposed 핵심 변경**:
+**Baseline → Proposed key changes**:
 
-- 단일 FIFO inbox → **compute port / IPCQ port 분리 + WRR Arbiter** (NEW)
+- Single FIFO inbox → **separate compute port / IPCQ port + WRR Arbiter** (NEW)
 - PE_IPCQ (SimPy component) → **IPCQ Controller** (HW register + combinational logic)
-- TCM 내 **IPCQ Slot Region 예약 영역** 명시
-- Credit Injector / Receiver가 Fabric Port를 통해 NoC에 직접 연결
+- **IPCQ Slot Region reserved area** within TCM
+- Credit Injector / Receiver connect directly to the NoC via the Fabric Port
 
-#### End-to-End Sequence (HW view)
+#### End-to-end sequence (HW view)
 
 ```mermaid
 sequenceDiagram
@@ -1257,15 +899,15 @@ sequenceDiagram
     Note over IPCQ_A: Match dst_rx_base_pa → direction "E"<br/>peer_tail_cache["E"] = consumer_seq<br/>Backpressure deassert (if stalled)
 ```
 
-### D17. IPCQ Controller HW Module (신규)
+### D17. IPCQ Controller HW Module (NEW)
 
-PE_CPU와 DMA Engine 사이에 위치하는 하드웨어 제어 블록. 시뮬레이터의
-`PeIpcqComponent`에 대응한다.
+The hardware control block sitting between PE_CPU and the DMA Engine.
+Corresponds to the simulator's `PeIpcqComponent`.
 
 #### QPair Register File
 
-방향별 queue pair 상태를 flip-flop으로 유지. PE_CPU가 MMIO(CSR)로 읽기/쓰기
-가능하며, init 시점에 소프트웨어가 채워넣는다.
+Per-direction queue-pair state held in flip-flops. The PE_CPU reads /
+writes them via MMIO (CSRs); software populates them at init time.
 
 ```
 Per-direction registers (each 64-bit):
@@ -1275,12 +917,12 @@ Per-direction registers (each 64-bit):
   peer_tail_cache  — last known peer tail (updated by Credit Receiver)
   rx_base_pa       — this PE's rx buffer base physical address
   peer_rx_base_pa  — peer's rx buffer base physical address
-  n_slots          — ring depth (power-of-2 제약, D21 참조)
+  n_slots          — ring depth (power-of-2 constraint, see D21)
   slot_size        — bytes per slot
-  peer_credit_tgt  — peer PE의 credit receive 주소
+  peer_credit_tgt  — peer PE's credit-receive address
 
-Directions: 최대 8 (N/S/E/W/parent/child_left/child_right + spare)
-Total: 8 dirs × 9 regs × 8B = 576B flip-flops
+Directions: up to 8 (N/S/E/W/parent/child_left/child_right + spare)
+Total: 8 dirs × 9 regs × 8 B = 576 B of flip-flops
 ```
 
 #### Slot Address Generator (combinational)
@@ -1292,9 +934,9 @@ Output: slot_addr = base_pa + (pointer % n_slots) * slot_size
 Implementation:
   n_slots power-of-2 → pointer & (n_slots - 1)   (AND mask, 1 gate)
   slot_size power-of-2 → barrel shift             (1 cycle)
-  64-bit add → ripple/kogge-stone adder           (1 cycle)
+  64-bit add → ripple / Kogge-Stone adder         (1 cycle)
 
-Latency: 1-2 cycles combinational
+Latency: 1–2 combinational cycles
 ```
 
 #### Backpressure Comparator (combinational)
@@ -1309,8 +951,8 @@ Latency: 1 cycle
 
 #### Meta Extractor (inbound datapath sideband)
 
-DMA Engine의 inbound vc_comm path에 wired. 도착하는 IPCQ flit의 header에서
-metadata를 추출하여 queue pair 상태를 갱신한다.
+Wired into the DMA Engine's inbound vc_comm path. Extracts metadata
+from arriving IPCQ flit headers and updates queue-pair state.
 
 ```
 Trigger: DMA inbound write completion (same cycle)
@@ -1322,24 +964,24 @@ Direction matching (ADR-0025 D2):
   8× parallel range comparators + priority encoder
 
 Update: peer_head_cache[matched_dir] = max(peer_head_cache, sender_seq + 1)
-Output: recv_wake signal → PE_CPU interrupt/flag
-Latency: 1 cycle (pipelined with DMA write — I6 atomicity 자연 보장)
+Output: recv_wake signal → PE_CPU interrupt / flag
+Latency: 1 cycle (pipelined with the DMA write — I6 atomicity is intrinsic)
 ```
 
 #### Credit Injector (outbound)
 
 ```
-Trigger: recv completion (my_tail 증가 후)
-Action:  pack 16B credit packet → DMA vc_comm (또는 dedicated credit VC)
+Trigger: recv completion (after my_tail increments)
+Action:  pack a 16 B credit packet → DMA vc_comm (or a dedicated credit VC)
 
 Packet: {consumer_seq = my_tail, dst_rx_base_pa = my_rx_base_pa}
-Latency: 1 cycle to generate, then NoC traversal
+Latency: 1 cycle to generate; then NoC traversal
 ```
 
 #### Credit Receiver (inbound sideband)
 
 ```
-Trigger: 16B credit packet arrival (from NoC)
+Trigger: 16 B credit packet arrival (from NoC)
 Extract: {consumer_seq, dst_rx_base_pa}
 
 Direction matching (ADR-0025 D3):
@@ -1351,67 +993,70 @@ Output: send_wake signal → deassert backpressure stall
 Latency: 1 cycle
 ```
 
-### D18. DMA Engine vc_comm IPCQ-aware Mode
+### D18. DMA Engine vc_comm IPCQ-aware mode
 
-기존 vc_comm 채널(D8)에 IPCQ flit 처리 모드를 추가한다.
+Add IPCQ-flit handling to the existing vc_comm channel (D8).
 
 **Outbound**:
 
-1. IPCQ Controller로부터 command 수신: `{src_addr, dst_addr, nbytes, sender_seq}`
-2. TCM에서 src_addr read → DMA read buffer에 snapshot (standard DMA behavior)
-3. Flit pack: data + piggyback metadata (sender_seq, dst_addr)
-4. NoC fabric port에 inject
-5. Fire-and-forget (completion 미대기)
+1. Receive a command from the IPCQ Controller: `{src_addr, dst_addr, nbytes, sender_seq}`.
+2. Read `src_addr` from TCM → snapshot into the DMA read buffer (standard DMA behavior).
+3. Pack flit: data + piggyback metadata (`sender_seq`, `dst_addr`).
+4. Inject into the NoC fabric port.
+5. Fire-and-forget (no completion wait).
 
 **Inbound**:
 
-1. NoC로부터 IPCQ flit 수신
-2. Terminal BW drain charge (`drain_ns = nbytes / bottleneck_bw`)
-3. Slot write latency charge (backing memory tier)
+1. Receive an IPCQ flit from the NoC.
+2. Charge terminal BW drain (`drain_ns = nbytes / bottleneck_bw`).
+3. Charge slot write latency (per backing memory tier).
 4. **ATOMIC** (same pipeline stage, no stall insertion):
-   - TCM write: data → slot address
-   - Meta Extractor trigger: sender_seq + dst_addr → IPCQ Controller
-5. Done
+   - TCM write: data → slot address.
+   - Meta Extractor trigger: `sender_seq` + `dst_addr` → IPCQ Controller.
+5. Done.
 
-**I6 atomicity 하드웨어 보장**: TCM write completion과 Meta Extractor trigger가
-동일 pipeline stage에서 발생하므로 별도 synchronization이 불필요. 시뮬레이터의
-"no SimPy yield between MemoryStore.write and IpcqMetaArrival put" (D9, I6)이
-자연스럽게 보장된다.
+**I6 atomicity guaranteed in hardware**: TCM write completion and Meta
+Extractor trigger occur in the same pipeline stage, so no separate
+synchronization is needed. The simulator's "no SimPy yield between
+`MemoryStore.write` and `IpcqMetaArrival` put" (D9, I6) is preserved
+naturally.
 
-#### Data Snapshot Semantics
+#### Data snapshot semantics
 
-DMA read buffer에 latch된 데이터는 src memory의 이후 수정에 영향받지 않는다.
-이는 DMA standard read-then-write behavior이므로 추가 HW 불필요.
+Data latched into the DMA read buffer is unaffected by subsequent
+writes to `src` memory. This is standard DMA read-then-write
+behavior; no extra HW is required.
 
-#### Credit Virtual Channel (선택적)
+#### Credit virtual channel (optional)
 
-- **옵션 A**: vc_comm에 credit을 multiplexing (16B header-only flit으로 구분).
-- **옵션 B**: 3rd dedicated credit VC 추가 (strict priority > data).
+- **Option A**: multiplex credits onto vc_comm (distinguish via 16 B
+  header-only flits).
+- **Option B**: add a third dedicated credit VC (strict priority > data).
 
-옵션 B가 deadlock prevention에 유리하나, 16B credit의 BW 영향이 무시 가능하므로
-옵션 A로도 충분.
+Option B is friendlier to deadlock prevention, but a 16 B credit's BW
+impact is negligible, so Option A suffices.
 
-### D19. Fabric Flit Format Extension
+### D19. Fabric flit format extension
 
 ```
-일반 data flit (예: 512-bit):
+Generic data flit (e.g. 512-bit):
 ┌──────────────────────────────────────────┐
 │ [511:480] routing header (32b)           │
-│ [479:0]   payload (480b = 60B)           │
+│ [479:0]   payload (480b = 60 B)          │
 └──────────────────────────────────────────┘
 
-IPCQ data flit (첫 flit에만 metadata 포함):
+IPCQ data flit (only the first flit carries metadata):
 ┌──────────────────────────────────────────┐
 │ [511:480] routing header (32b)           │
-│   [511]    ipcq_flag (1b)                │  ← IPCQ vs normal DMA 식별
+│   [511]    ipcq_flag (1b)                │  ← IPCQ vs. normal DMA
 │   [510:509] vc_id (2b)                   │
 │   [508:480] route + hop count            │
 │ [479:416] ipcq_metadata (64b)            │  ← piggyback
 │   [479:448] sender_seq (32b)             │
-│   [447:416] dst_addr[31:0] (32b)         │  ← direction matching용
-│ [415:0]   payload (416b = 52B)           │
+│   [447:416] dst_addr[31:0] (32b)         │  ← used for direction match
+│ [415:0]   payload (416b = 52 B)          │
 └──────────────────────────────────────────┘
-후속 flits: full 60B payload (metadata 없음)
+Subsequent flits: full 60 B payload (no metadata).
 
 Credit-only flit (128-bit, header-only):
 ┌──────────────────────────────────────────┐
@@ -1422,227 +1067,242 @@ Credit-only flit (128-bit, header-only):
 └──────────────────────────────────────────┘
 ```
 
-첫 flit의 payload가 60B → 52B로 감소 (13% overhead). Multi-flit transfer에서는
-후속 flit이 full payload이므로 대형 전송에서 overhead < 1%.
+First-flit payload shrinks from 60 B to 52 B (13 % overhead). For
+multi-flit transfers the subsequent flits carry full payloads, so
+overhead < 1 % on large transfers.
 
-### D20. TCM IPCQ Slot Region Layout
+### D20. TCM IPCQ slot region layout
 
 ```
-TCM Memory Map (16MB):
+TCM Memory Map (16 MB):
 ┌─────────────────────────────┐ 0x000000
 │  Kernel Working Memory      │
 │  (compute tensors)          │
-│  ~14MB                      │
+│  ~14 MB                     │
 ├─────────────────────────────┤ 0xE00000
 │  IPCQ RX Buffers            │
 │  Dir N: slots × slot_size   │
 │  Dir S: slots × slot_size   │
 │  Dir E: slots × slot_size   │
 │  Dir W: slots × slot_size   │
-│  ~1MB                       │
+│  ~1 MB                      │
 ├─────────────────────────────┤ 0xF00000
 │  IPCQ Metadata / Scratch    │
-│  ~1MB                       │
+│  ~1 MB                      │
 └─────────────────────────────┘ 0xFFFFFF
 ```
 
-IPCQ region을 TCM의 상위 bank에 배치하여 compute access와의 bank conflict를
-최소화한다 (Risk D22 참조).
+Place the IPCQ region in the upper TCM bank to minimize bank conflict
+with compute accesses (see Risk D22).
 
-### D21. 2nm Implementation Analysis
+### D21. 2 nm implementation analysis
 
-#### Area Estimate
+#### Area estimate
 
-| Module | Gate Count | Area (2nm est.) | Notes |
+| Module | Gate count | Area (2 nm est.) | Notes |
 |---|---|---|---|
-| QPair Register File | ~4.6K FF | 0.002 mm² | 576B flip-flops |
-| Slot Addr Gen + Backpressure | ~5K gates | 0.001 mm² | Combinational |
-| Meta Extractor + Credit Logic | ~3K gates | 0.001 mm² | 8× parallel comparators |
-| **IPCQ Controller subtotal** | **~12.6K** | **~0.004 mm²** | **PE 전체 대비 < 0.1%** |
-| DMA vc_comm 확장 | ~2K gates | 0.002 mm² | Flit pack/unpack |
-| **Total 변경분** | **~14.6K** | **~0.006 mm²** | |
+| QPair Register File | ~4.6 K FF | 0.002 mm² | 576 B of flip-flops |
+| Slot Addr Gen + Backpressure | ~5 K gates | 0.001 mm² | Combinational |
+| Meta Extractor + Credit Logic | ~3 K gates | 0.001 mm² | 8× parallel comparators |
+| **IPCQ Controller subtotal** | **~12.6 K** | **~0.004 mm²** | **< 0.1 % of the PE area** |
+| DMA vc_comm extension | ~2 K gates | 0.002 mm² | Flit pack / unpack |
+| **Total delta** | **~14.6 K** | **~0.006 mm²** | |
 
 #### Timing
 
-| Path | Delay (2nm est.) | Target Clock | Margin |
+| Path | Delay (2 nm est.) | Target clock | Margin |
 |---|---|---|---|
 | Backpressure (sub + cmp) | ~0.3 ns | 1 GHz (1 ns) | 3× |
 | Slot Addr Gen (mask + shift + add) | ~0.5 ns | 1 GHz | 2× |
 | Meta Extractor (8× range match) | ~0.4 ns | 1 GHz | 2.5× |
 | Credit Receiver (8× equality) | ~0.3 ns | 1 GHz | 3× |
 
-모든 critical path가 1 cycle 이내. Timing closure 문제 없음.
+All critical paths fit within one cycle. Timing closure is not a
+concern.
 
 #### Power
 
-- Active: ~1 mW (register R/W + comparators, send/recv 동작 시)
-- Idle: leakage only
-- PE 전체 전력 대비 무시 가능
+- Active: ~1 mW (register R/W + comparators while sending / receiving).
+- Idle: leakage only.
+- Negligible vs. total PE power.
 
 #### Constraints
 
-| 항목 | 제약 | 근거 |
+| Item | Constraint | Rationale |
 |---|---|---|
-| `n_slots` | **반드시 power-of-2** | mod → AND mask (1 gate). 임의 값은 divider 필요 (~10 cycles) |
-| `slot_size` | **power-of-2 권장** | mul → barrel shift. 임의 값은 multiplier 필요 |
-| TCM IPCQ region | **전용 bank 배치** | Compute access와 bank conflict 방지 |
+| `n_slots` | **must be power-of-2** | mod → AND mask (1 gate). Arbitrary values need a divider (~10 cycles). |
+| `slot_size` | **power-of-2 recommended** | mul → barrel shift. Arbitrary values need a multiplier. |
+| TCM IPCQ region | **dedicated bank** | Prevents bank conflict with compute accesses. |
 
-### D22. Risk Assessment
+### D22. Risk assessment
 
-#### TCM Bank Conflict
+#### TCM bank conflict
 
-- **Risk**: IPCQ slot write와 compute read가 동일 bank 접근 시 stall
-- **Mitigation**: IPCQ region을 TCM 상위 address의 전용 bank에 배치 (D20)
-- **Cost**: TCM banking flexibility 소폭 감소
-- **Severity**: Medium (성능 영향), Low (correctness 문제 아님)
+- **Risk**: IPCQ slot write and compute read both target the same TCM
+  bank → stall.
+- **Mitigation**: place the IPCQ region in a dedicated upper-address
+  bank (D20).
+- **Cost**: a small loss of TCM banking flexibility.
+- **Severity**: Medium (performance), Low (no correctness issue).
 
-#### Credit Return Latency under Congestion
+#### Credit return latency under congestion
 
-- **Risk**: NoC 혼잡 시 credit return 지연 → sender backpressure stall
+- **Risk**: NoC congestion → credit-return delay → sender backpressure
+  stall.
 - **Mitigation**:
-  - Credit을 별도 VC로 분리 + strict priority (16B로 BW impact 미미)
-  - 또는 n_slots를 넉넉히(8+) 설정하여 credit 지연을 buffer로 흡수
-- **Severity**: Low (credit 16B는 congestion에 거의 기여하지 않음)
+  - Put credits on a separate VC with strict priority (16 B →
+    negligible BW impact).
+  - Or pick `n_slots` generously (8+) so credit delay is absorbed by
+    buffer depth.
+- **Severity**: Low (16 B credits contribute almost nothing to
+  congestion).
 
-#### Inter-Direction Ordering
+#### Inter-direction ordering
 
-- **Risk**: 같은 PE에서 여러 방향으로 동시 send 시 순서
-- **Mitigation**: Per-direction monotonic seq으로 충분. Inter-direction ordering은
-  kernel(소프트웨어) 책임 — 현재 시뮬레이터 모델과 동일 (D2 + D4)
-- **Severity**: Low (아키텍처 설계에 의해 해소)
+- **Risk**: simultaneous sends from one PE on multiple directions.
+- **Mitigation**: per-direction monotonic `sender_seq` suffices.
+  Inter-direction ordering is the kernel's (software's)
+  responsibility — same as the simulator model (D2 + D4).
+- **Severity**: Low (resolved by design).
 
-### D23. HW Alternatives Considered
+### D23. HW alternatives considered
 
-#### Doorbell + Polling (전통적 방식)
+#### Doorbell + polling (traditional)
 
 ```
-Send: DMA write data → DMA write doorbell register at peer → peer polls doorbell
-Recv: Polling loop on doorbell, or interrupt-driven
+Send: DMA write data → DMA write a doorbell register at the peer → peer polls doorbell
+Recv: polling loop on the doorbell, or interrupt-driven
 ```
 
-| 장점 | 단점 |
+| Pros | Cons |
 |---|---|
-| 단순한 HW (IPCQ controller 불필요) | 2번의 DMA transaction (data + doorbell) |
-| 기존 DMA 재사용 | Data/doorbell 사이 ordering 보장 필요 (fence) |
-| | Polling은 전력 낭비, interrupt는 latency overhead |
+| Simple HW (no IPCQ controller) | Two DMA transactions (data + doorbell) |
+| Reuses existing DMA | Needs explicit fence between data and doorbell |
+| | Polling burns power; interrupt adds latency |
 
-**평가**: Piggyback 대비 latency 2-3× 증가. **불채택.**
+**Verdict**: 2–3× latency vs. piggyback. **Rejected.**
 
-#### Hardware Message Queue (NVIDIA NVLink 스타일)
+#### Hardware message queue (NVIDIA NVLink style)
 
 ```
-Send: CPU → HMQ에 descriptor push → HW가 peer HMQ로 자동 전달
-Recv: HMQ에서 descriptor pop → data pointer 확인
+Send: CPU → push a descriptor onto HMQ → HW relays it to the peer HMQ
+Recv: pop a descriptor from HMQ → use the data pointer
 ```
 
-| 장점 | 단점 |
+| Pros | Cons |
 |---|---|
-| CPU는 descriptor만 작성 | 별도 HMQ engine 필요 (~0.05 mm²) |
-| Descriptor/data 분리 → 유연 | DMA와 별개 datapath → area/power 중복 |
-| | Large tensor에는 결국 DMA 필요 |
+| CPU only writes descriptors | Needs a separate HMQ engine (~0.05 mm²) |
+| Descriptor / data separation is flexible | Separate datapath from DMA → area / power overlap |
+| | Large tensors still need DMA |
 
-**평가**: CCL의 large tensor 패턴에서 DMA 필수이므로 HMQ + DMA 이중 구조는
-면적 낭비. **불채택.**
+**Verdict**: With CCL's large-tensor pattern, DMA is still required,
+so HMQ + DMA is a duplicated datapath. **Rejected.**
 
-#### RDMA-style Completion Queue (CQ)
+#### RDMA-style completion queue (CQ)
 
 ```
-Send: DMA write → peer에 CQE 자동 생성
-Recv: CQ poll/interrupt → data 위치 확인
+Send: DMA write → CQE auto-posted at the peer
+Recv: CQ poll / interrupt → read data location
 ```
 
-| 장점 | 단점 |
+| Pros | Cons |
 |---|---|
-| InfiniBand/RoCE 성숙 모델 | CQ 관리 logic + CQE memory overhead |
-| Multi-tenant/isolation 용이 | CQE/data ordering 보장 추가 필요 |
-| | PE-to-PE CCL에는 over-engineered |
+| Mature InfiniBand / RoCE model | CQ management logic + CQE memory overhead |
+| Good multi-tenant isolation | CQE / data ordering needs extra plumbing |
+| | Over-engineered for PE-to-PE CCL |
 
-**평가**: RDMA CQ는 host-facing NIC의 multi-tenant 격리에 적합.
-PE 간 단일 owner 환경에서는 불필요한 복잡성. **불채택.**
+**Verdict**: RDMA CQ is suited to host-facing NICs with multi-tenant
+isolation. For single-owner PE-to-PE this is needless complexity.
+**Rejected.**
 
-#### Credit-in-Data Piggyback (v2 최적화 후보)
+#### Credit-in-data piggyback (v2 optimization candidate)
 
-현재 설계에서 credit return은 별도 16B packet이다. Bidirectional 통신
-패턴에서는 **reverse 방향 data flit에 credit을 합칠 수 있다.**
+In the current design the credit return is a separate 16 B packet.
+For bidirectional traffic patterns, **the credit can be folded into a
+reverse-direction data flit**.
 
 ```
 PE_A →E→ PE_B: data + sender_seq=3
-PE_B →W→ PE_A: data + sender_seq=5 + credit_ack=4  ← credit이 data에 합쳐짐
+PE_B →W→ PE_A: data + sender_seq=5 + credit_ack=4  ← credit folded into data
 ```
 
-| 장점 | 단점 |
+| Pros | Cons |
 |---|---|
-| Credit 전용 packet 제거 → NoC BW 절약 | Unidirectional 패턴에서는 fallback 필요 |
-| Bidirectional allreduce에서 credit latency → 0 | Flit header에 8B 추가 (overhead 미미) |
-| | Logic 복잡도 소폭 증가 |
+| Removes the dedicated credit packet → NoC BW savings | Needs fallback for unidirectional patterns |
+| Bidirectional allreduce: credit latency → 0 | +8 B in the flit header (negligible) |
+| | Slightly more logic complexity |
 
-**평가**: 현재 설계의 우수한 최적화. Bidirectional allreduce에서 credit packet을
-완전 제거 가능. Standalone credit fallback도 유지. **v2로 채택 권고.**
+**Verdict**: A strong optimization. Eliminates the credit packet for
+bidirectional allreduce; the standalone credit fallback is retained.
+**Recommended for v2.**
 
-### Open HW Questions
+### Open HW questions
 
-- IPCQ slot region size를 TCM의 몇 %까지 허용할 것인가? (현재 가정: ~1MB / 16MB = 6.25%)
-- Credit VC를 별도로 둘 것인가, vc_comm에 multiplexing할 것인가? (D18 참조)
-- Inter-SIP link에서의 flit format 호환성 검증 필요
-- n_slots 최대값 제한? (8 directions × 8 slots × 64KB = 4MB → TCM의 25%)
+- What fraction of TCM may the IPCQ slot region occupy? (Current
+  assumption: ~1 MB / 16 MB = 6.25 %.)
+- Dedicated credit VC vs. vc_comm multiplexing? (See D18.)
+- Inter-SIP link flit-format compatibility verification.
+- Maximum `n_slots`? (8 directions × 8 slots × 64 KB = 4 MB → 25 % of
+  TCM.)
 
 ---
 
 ## Non-goals
 
-- **호스트 collective**: `dist.all_reduce`가 데이터 이동을 직접 수행하는 모델은
-  본 ADR 범위 외. 본 ADR은 PE 커널 안에서 일어나는 통신만 다룬다.
-- **All-reduce 알고리즘**: ring/tree 등 알고리즘 자체는 별도 ADR (또는 커널
-  코드)에서 다룬다. 본 ADR은 인프라(IPCQ + VC)만 정의.
-- **Reliability/error handling**: send/recv 실패, link 장애 등은 다루지 않음.
-- **NoC arbiter 정밀 모델**: VC 간 dynamic contention은 첫 구현 범위 외 (D8).
+- **Host collective**: a model where `dist.all_reduce` itself moves
+  data on the host side is out of scope. This ADR only covers
+  communication that happens inside the PE kernel.
+- **All-reduce algorithms**: ring / tree / etc. live in algorithm
+  modules and can be added without amending this ADR.
+- **Reliability / error handling**: link faults, send/recv failure
+  recovery, etc. are out of scope.
+- **NoC arbiter precision**: dynamic VC contention is left for a future
+  ADR (see D8).
 
 ---
 
-## Open Questions
+## Open questions
 
-- **VC arbitration 정확도**: 첫 구현은 deterministic chunk interleave +
-  weighted round-robin. heavy contention 시나리오에서 실제보다 optimistic한
-  결과가 나올 수 있음. 정밀화 필요 시 별도 NoC arbiter component 도입을 검토.
-- **Credit return fast path BW 모델**: 첫 구현은 fast path가 fabric BW
-  contention 모델에서 제외 (별도 lightweight wire 가정). 정밀화 필요 시
-  credit fast path를 별도 link로 모델링하거나, `credit_return_mode: piggyback`
-  옵션 추가.
-- **Ring buffer slot의 메모리 할당**: TCM/HBM/SRAM 어디에 두든 IPCQ가 알아야
-  할 metadata (base addr, slot_size, n_slots). init 시 호스트가 사이드밴드로
-  넣을지, fabric MmuMapMsg와 유사한 메시지로 넣을지 결정 필요.
-- **VC 간 BW 분할 default**: 균등 분할(50/50)인지, weighted(예: 80% compute,
-  20% comm)인지. ccl.yaml에 노출하되 default 값 결정 필요.
-- **Direction 개수**: 4방향(N/S/E/W) 고정인지, 6방향(+ Up/Down for 3D),
-  또는 가변 N개로 확장할지. 첫 구현은 4방향 고정.
-- **다중 channel 데이터 구조 (multi-tile aggregation)**: 한 collective에서
-  여러 tile을 fan-out 받는 경우 기존 round-robin recv로 충분한지, 별도
-  primitive(`tl.recv_all`)가 필요한지.
-- **Round-robin recv fairness**: 첫 구현은 last_polled_dir 인덱스 기반 weak
-  fairness. 한 방향에 데이터가 항상 먼저 도착하면 starvation 가능. strict
-  fairness가 필요하면 별도 fairness counter 추가.
-- **Deadlock detection 정밀화**: 첫 구현은 timeout 기반. 향후 wait-for graph
-  실시간 추적으로 deterministic deadlock detection 가능.
+- **VC arbitration accuracy** — the first cut uses deterministic
+  chunk interleave + weighted round-robin; heavy contention may report
+  optimistic latency. A NoC arbiter component can be added later.
+- **Credit return BW model** — the fast path is currently outside the
+  fabric BW contention model. Can be modeled as a separate link or
+  switched to piggyback (`credit_return_mode: piggyback`).
+- **Ring buffer slot allocation metadata** — whether the host pushes
+  IPCQ buffer metadata via sideband or via a fabric message similar to
+  `MmuMapMsg` is open.
+- **VC BW split default** — 50/50 vs. weighted (e.g. 80/20). Exposed in
+  `ccl.yaml`; default value TBD.
+- **Direction count** — 4 (N/S/E/W) is fixed in the first cut; 6
+  (with Up/Down for 3D) or N (variable) is future work.
+- **Multi-tile aggregation primitives** — whether
+  `tl.recv_all` or similar is needed for fan-in.
+- **Round-robin recv fairness** — current weak fairness can starve;
+  strict fairness counter is future work.
+- **Deadlock detection precision** — currently timeout-based; a
+  realtime wait-for graph would enable deterministic detection.
 
 ---
 
 ## Consequences
 
-### 긍정적
+### Positive
 
-- PE 간 직접 통신 가능 → CCL 커널 작성 가능
-- 호스트는 launch만, 동기화는 PE 안에서 → 단순한 호스트 코드, 강한
-  compute/comm overlap
-- VC를 통해 HoL blocking 제거 → collective latency가 compute traffic에
-  block되지 않음
-- Buffer 위치/backpressure 모드를 init 파라미터로 선택 가능 → 벤치마크 가능
-- 4-방향 logical neighbor → 호스트가 ring/mesh/tree 등 알고리즘 자유롭게
-  매핑
+- PE-to-PE direct communication enables CCL kernels to be written.
+- Host stays minimal (just `launch`), synchronization happens inside
+  the PE → strong compute / comm overlap.
+- VCs eliminate HoL blocking → collective latency is not blocked by
+  compute traffic.
+- Buffer placement and backpressure mode are init-time parameters →
+  easy to benchmark.
+- Four-direction logical neighbors → host is free to map
+  ring/mesh/tree algorithms.
 
-### 부정적
+### Negative
 
-- 컴포넌트 1개 신규 추가 (PE_IPCQ), PE_DMA 재설계 (VC 추가)
-- IPCQ 메모리 (8 ring × slot_size × n_slots) 만큼 PE-local 메모리 사용
-- VC arbitration 모델이 first-order approximation이므로 heavy contention
-  시나리오에서 실제 HW보다 약간 optimistic한 latency 결과 가능 (D8 한계)
-- VC chunk-level 인터리브로 PE_DMA 구현이 더 복잡해짐
+- One new component (PE_IPCQ) and a redesigned PE_DMA (VCs).
+- IPCQ memory cost = 8 rings × `slot_size` × `n_slots` per PE.
+- VC arbitration is a first-order approximation; heavy contention
+  scenarios may report slightly optimistic latency vs real HW (D8).
+- Chunk-level interleave makes PE_DMA implementation more complex.
diff --git a/docs/adr/ADR-0024-par-sip-tp-launcher.md b/docs/adr/ADR-0024-par-sip-tp-launcher.md
index b321e84..52bd9b4 100644
--- a/docs/adr/ADR-0024-par-sip-tp-launcher.md
+++ b/docs/adr/ADR-0024-par-sip-tp-launcher.md
@@ -6,43 +6,46 @@ Accepted
 
 ## Context
 
-### 목표
+### Goal
 
-`torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device)
-경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이**
-읽히는 bench 코드를 목표로 한다.
+Align the participation unit (rank) of `torch.distributed` collective calls
+to the **SIP** (device) boundary. The aim is bench code that, at the host
+level, reads **indistinguishably** from real PyTorch DDP/TP scripts.
 
-real PyTorch와 비교:
+Comparison with real PyTorch:
 
-| 차원 | real PyTorch | KernBench |
+| Dimension | real PyTorch | KernBench |
 | --- | --- | --- |
-| 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 각 1 SIP |
-| `get_rank()` | `RANK` env var | greenlet-local 레지스트리 |
-| `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 |
+| Process model | N processes, 1 GPU each | 1 process, N greenlets, 1 SIP each |
+| `get_rank()` | `RANK` env var | greenlet-local registry |
+| `get_world_size()` | `WORLD_SIZE` env var | SIP count from topology |
 | `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP |
-| `mp.spawn` | OS 프로세스 fork | greenlet fan-out |
+| `mp.spawn` | OS process fork | greenlet fan-out |
 
-### 풀어야 할 문제
+### Problems to solve
 
-1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록.
-2. **Greenlet-local rank/device tracking** — 1-프로세스 모델 안에서 각
-   worker greenlet이 자기 rank / 자기 SIP를 정확히 식별.
-3. **Tensor placement = structural (sip, cube, pe)** — rank가 SIP이면
-   기본 텐서 배치도 구조적 좌표로 표현되어야 함.
+1. **Public API where rank = SIP** — so bench workers do not have to know
+   about the PE concept.
+2. **Greenlet-local rank/device tracking** — within the 1-process model,
+   each worker greenlet must correctly identify its own rank / its own SIP.
+3. **Tensor placement = structural (sip, cube, pe)** — if rank is SIP,
+   the default tensor placement should also be expressed in structural
+   coordinates.
 
-### Non-problem (이 ADR 밖)
+### Non-problem (outside this ADR)
 
 - IPCQ direction addressing → ADR-0025
-- `DPPolicy.sip`/`num_sips` 제거 → ADR-0026
+- Removing `DPPolicy.sip`/`num_sips` → ADR-0026
 - Megatron-style TP → ADR-0027
 - DTensor → ADR-0028 (future)
 - Worker scheduling / `mp.spawn` / collective drain / exception cleanup
   → ADR-0027 D0/D1
-- Collective algorithm 구현 (intercube_allreduce, SFR config) → ADR-0032
+- Collective algorithm implementation (intercube_allreduce, SFR config)
+  → ADR-0032
 
 ## Decision
 
-### D1. rank = SIP (world_size 해석)
+### D1. rank = SIP (world_size resolution)
 
 ```python
 def _resolve_world_size(self) -> int:
@@ -55,8 +58,8 @@ def _resolve_world_size(self) -> int:
     return int(spec.get("system", {}).get("sips", {}).get("count", 1))
 ```
 
-우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml`
-override는 legacy "rank = PE" 테스트 경로로 유지.
+Priority order: algorithm override > defaults override > SIP count. The
+`ccl.yaml` override is retained as the legacy "rank = PE" test path.
 
 ### D2. Greenlet-local rank registry (+ debug warning)
 
@@ -83,11 +86,11 @@ class DistributedContext:
         return int(self._rank_by_greenlet[g])
 ```
 
-### D3. `torch.ahbm.set_device(rank)` — SIP 바인딩
+### D3. `torch.ahbm.set_device(rank)` — SIP binding
 
-KernBench 백엔드 이름은 `ahbm` (ADR-0023). Real PyTorch는
-`torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named
-namespace를 사용한다.
+The KernBench backend name is `ahbm` (ADR-0023). Real PyTorch uses
+`torch.cuda.set_device(r)`, but since we are not CUDA we use an
+honestly-named namespace.
 
 ```python
 class _AhbmNamespace:
@@ -113,10 +116,12 @@ class _AhbmNamespace:
 # Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
 ```
 
-**PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한
-`torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`,
-`torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는
-코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다.
+**PyTorch 2.x style parallel support**: Recent PyTorch is moving toward a
+device-agnostic `torch.accelerator` namespace
+(`torch.accelerator.set_device_index(r)`,
+`torch.accelerator.current_device_index()`). To support users who want to
+write code that is not tied to a specific device vendor, KernBench also
+exposes this surface in parallel.
 
 ```python
 class _AcceleratorNamespace:
@@ -141,23 +146,23 @@ self.ahbm = _AhbmNamespace()
 self.accelerator = _AcceleratorNamespace(self.ahbm)   # alias
 ```
 
-Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유:
+Bench authors may choose either — both share the same registry internally:
 
 ```python
 torch.ahbm.set_device(rank)                   # KernBench-native, explicit backend
 torch.accelerator.set_device_index(rank)      # PyTorch 2.x device-agnostic
 ```
 
-### D4. Tensor placement = structural (sip, cube, pe) 좌표
+### D4. Tensor placement = structural (sip, cube, pe) coordinates
 
-`resolve_dp_policy`가 `target_sip`을 직접 받아 구조적 좌표로 placement 생성.
-세부는 ADR-0026.
+`resolve_dp_policy` takes `target_sip` directly and produces placement in
+structural coordinates. Details in ADR-0026.
 
 ```python
 # RuntimeContext._create_tensor
 current_sip = self.ahbm.current_device()          # (D3 naming)
 if current_sip is None:
-    current_sip = 0  # single-driver fallback (D2와 일관)
+    current_sip = 0  # single-driver fallback (consistent with D2)
 placement = resolve_dp_policy(
     dp, shape=shape_2d, itemsize=itemsize,
     num_pe=eff_num_pe, num_cubes=eff_num_cubes,
@@ -165,29 +170,29 @@ placement = resolve_dp_policy(
 )
 ```
 
-Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적
-좌표를 직접 보유. ShardSpec 상세는 ADR-0026.
+No post-hoc `pe_index` shifting — ShardSpec carries the `(sip, cube, pe)`
+structural coordinates directly. ShardSpec details in ADR-0026.
 
 ---
 
 ## Dependencies
 
-- **ADR-0023** (IPCQ): backend `ahbm` namespace의 기원.
-- **ADR-0026** (DPPolicy intra-device): D4의 `resolve_dp_policy` 시그니처와
-  ShardSpec의 구조적 좌표 표현.
-- **ADR-0027** (Megatron TP + scheduler): worker scheduling, `mp.spawn`,
-  collective drain, exception cleanup의 구현 기준.
+- **ADR-0023** (IPCQ): origin of the backend `ahbm` namespace.
+- **ADR-0026** (DPPolicy intra-device): the `resolve_dp_policy` signature
+  used by D4 and the structural-coordinate representation of ShardSpec.
+- **ADR-0027** (Megatron TP + scheduler): the implementation baseline for
+  worker scheduling, `mp.spawn`, collective drain, and exception cleanup.
 
 ---
 
 ## Non-goals
 
-- **IPCQ protocol 수정**: ADR-0023 유지.
-- **DPPolicy 필드 정리**: ADR-0026.
+- **Modifying the IPCQ protocol**: ADR-0023 remains as-is.
+- **Cleaning up DPPolicy fields**: ADR-0026.
 - **Megatron-style TP**: ADR-0027.
 - **Worker scheduling / spawn / drain / exception cleanup**: ADR-0027 D0/D1.
-- **Collective algorithm 구현**: ADR-0032.
-- **Multi-node (프로세스 간)**: 단일 프로세스.
+- **Collective algorithm implementation**: ADR-0032.
+- **Multi-node (cross-process)**: single process only.
 
 ---
 
@@ -195,12 +200,14 @@ Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적
 
 ### Positive
 
-- **Bench = real PyTorch DDP** (공개 API 관점).
-- **Greenlet-local rank**: 1-프로세스 모델에서 cross-rank correctness 가능.
-- **Structural placement 좌표**: ADR-0026 / ADR-0027 / ADR-0032의 다른 ADR이
-  `(sip, cube, pe)` 3튜플 위에서 일관되게 동작.
+- **Bench = real PyTorch DDP** (from the public-API point of view).
+- **Greenlet-local rank**: enables cross-rank correctness within the
+  1-process model.
+- **Structural placement coordinates**: lets the other ADRs (ADR-0026 /
+  ADR-0027 / ADR-0032) operate consistently on top of the `(sip, cube, pe)`
+  3-tuple.
 
 ### Neutral
 
-- IPCQ PE-level protocol (ADR-0023) 불변.
-- IO_CPU 역할 불변 (기존 transit 그대로).
+- IPCQ PE-level protocol (ADR-0023) is unchanged.
+- IO_CPU role is unchanged (existing transit behavior preserved).
diff --git a/docs/adr/ADR-0025-algo-ipcq-direction-addressing.md b/docs/adr/ADR-0025-algo-ipcq-direction-addressing.md
index 8a6afa0..ca2974d 100644
--- a/docs/adr/ADR-0025-algo-ipcq-direction-addressing.md
+++ b/docs/adr/ADR-0025-algo-ipcq-direction-addressing.md
@@ -6,51 +6,58 @@ Accepted (Revision 2 — Address-based matching; peer_direction field dropped)
 
 ## Context
 
-### 목표
+### Goal
 
-ADR-0023의 IPCQ protocol에서 **"어느 direction pair를 통한 전송인가"의 식별**을
-topology / dict-order에 의존하지 않고 **주소 기반**으로 일관되게 한다.
-2-rank bidirectional ring (또는 여러 direction이 동일 peer를 가리키는
-topology 일반)에서 정확히 동작하도록 한다.
+In the IPCQ protocol of ADR-0023, make the **identification of "which
+direction pair this transfer belongs to"** consistent and **address-based**,
+without depending on topology / dict-order. It must work correctly in a
+2-rank bidirectional ring (and more generally in any topology where
+multiple directions point to the same peer).
 
-### 드러난 버그 — 2-rank bidirectional ring
+### The bug surfaced — 2-rank bidirectional ring
 
-`ring_1d(rank, world_size=2)` → `{"E": 1, "W": 1}` (rank 0). 양쪽 방향이 같은 peer.
+`ring_1d(rank, world_size=2)` → `{"E": 1, "W": 1}` (rank 0). Both directions
+point to the same peer.
 
-**버그 1 (install)**:
-- `reverse_direction(0, 1)` → dict order로 "E" 반환 (틀림, "W"가 맞음 — opposite
-  direction convention)
-- rank 0의 E entry가 `peer.rx_base_pa = rx_base(sip1, cube0, pe0, d="E")`로 설정
-- tl.send(E) → data가 sip1의 E-rx buffer로 landing (should be W-rx)
+**Bug 1 (install)**:
+- `reverse_direction(0, 1)` → returns "E" by dict order (wrong; "W" is the
+  correct answer — opposite-direction convention)
+- rank 0's E entry is set with `peer.rx_base_pa = rx_base(sip1, cube0, pe0, d="E")`
+- tl.send(E) → data lands in sip1's E-rx buffer (should be W-rx)
 
-**버그 2 (runtime)**:
-- 설령 install이 올바른 주소로 설정해도, receiver의 `_handle_meta_arrival`이
-  sender 좌표만으로 direction 매칭 → 첫 direction (E) 승
-- peer_head_cache[E] 증가, peer_head_cache[W]는 불변
-- Kernel의 tl.recv(W)는 peer_head_cache[W] 대기 → 영원히 블록 → IpcqDeadlock
+**Bug 2 (runtime)**:
+- Even if install set up the correct address, the receiver's
+  `_handle_meta_arrival` matches direction by sender coordinates only → the
+  first direction (E) wins
+- peer_head_cache[E] is incremented; peer_head_cache[W] is unchanged
+- The kernel's tl.recv(W) waits on peer_head_cache[W] → blocks forever →
+  IpcqDeadlock
 
-### 근본 원인
+### Root cause
 
-두 축에서 동일 문제:
-1. **Install-time pairing**: "내 direction과 peer의 어느 direction이 짝인가"
-   결정이 dict-iteration-order에 의존 → 여러 direction이 같은 peer를 가리킬 때
-   fragile
-2. **Runtime identification**: "어느 qp를 업데이트해야 하는가" 결정이 sender
-   좌표만으로 이루어짐 → direction 중복 시 ambiguous
+The same issue along two axes:
+1. **Install-time pairing**: deciding "which of my directions pairs with
+   which direction of the peer" depends on dict-iteration-order → fragile
+   when multiple directions point to the same peer
+2. **Runtime identification**: deciding "which qp should be updated" is
+   based on sender coordinates alone → ambiguous when directions are
+   duplicated
 
-### 해결 방향 — address-based matching
+### Solution direction — address-based matching
 
-각 PE의 rx buffer는 **direction별로 고유한 주소 range**에 위치 (rx_base_pa +
-direction_idx × bytes_per_direction). 따라서:
+Each PE's rx buffer sits at a **unique address range per direction**
+(rx_base_pa + direction_idx × bytes_per_direction). Therefore:
 
-- **Runtime**: sender coord 대신 **dst_addr 범위**로 매칭 → unambiguous
-- **Install**: opposite-direction 우선 선택 heuristic (ring / mesh의 자연스러운
-  대칭성)
-- `peer_direction` 같은 이중 메타데이터 불필요 — **주소가 single source of
-  truth**
+- **Runtime**: match by **dst_addr range** instead of sender coord →
+  unambiguous
+- **Install**: prefer the opposite direction as a heuristic (the natural
+  symmetry of ring / mesh)
+- No need for redundant metadata like `peer_direction` — **address is the
+  single source of truth**
 
-이 설계는 **PhysAddr 전환 (ADR-0030)과 독립적**으로 작동. 현재 synthetic
-주소든 PhysAddr든 direction별 range 유일성만 지켜지면 동일하게 적용 가능.
+This design works **independently of the PhysAddr transition (ADR-0030)**.
+Whether the current addresses are synthetic or PhysAddr, the same approach
+applies as long as the per-direction range uniqueness is preserved.
 
 ---
 
@@ -91,17 +98,17 @@ def reverse_direction(my_rank: int, peer_rank: int, my_dir: str) -> str | None:
     return None
 ```
 
-호출부:
+Call site:
 
 ```python
 for d, peer_rank in nbrs.items():
-    peer_dir = reverse_direction(r, peer_rank, d)  # my_dir 전달
+    peer_dir = reverse_direction(r, peer_rank, d)  # pass my_dir
     if peer_dir is None:
         continue
     ...
 ```
 
-### D2. Runtime — `_handle_meta_arrival` dst_addr 매칭
+### D2. Runtime — `_handle_meta_arrival` dst_addr matching
 
 `src/kernbench/components/builtin/pe_ipcq.py`:
 
@@ -138,9 +145,10 @@ def _handle_meta_arrival(self, msg: IpcqMetaArrival) -> None:
     # Unknown dst_addr — diagnostic log (should not happen under correct install)
 ```
 
-Sender 좌표 검사는 **제거**. `dst_addr`가 이미 direction을 결정.
+The sender-coordinate check is **removed**. `dst_addr` already determines
+the direction.
 
-### D3. Credit — `dst_rx_base_pa` 필드 추가
+### D3. Credit — add `dst_rx_base_pa` field
 
 `src/kernbench/common/ipcq_types.py`:
 
@@ -148,25 +156,26 @@ Sender 좌표 검사는 **제거**. `dst_addr`가 이미 direction을 결정.
 @dataclass(frozen=True)
 class IpcqCreditMetadata:
     consumer_seq: int
-    dst_rx_base_pa: int       # NEW: 원 sender의 peer.rx_base_pa와 매칭용
-    # 기존 필드 (diagnostic / log 용도로 유지)
+    dst_rx_base_pa: int       # NEW: matches the original sender's peer.rx_base_pa
+    # Existing fields (kept for diagnostic / logging purposes)
     src_sip: int
     src_cube: int
     src_pe: int
     src_direction: str
 ```
 
-Credit 생성 시 (`_delayed_credit_send`): 자기 direction의 `my_rx_base_pa`를
-`dst_rx_base_pa`로 실어 보냄 (이게 상대방이 sender 당시 썼던 `peer.rx_base_pa`).
+When the credit is generated (`_delayed_credit_send`): it carries this
+direction's `my_rx_base_pa` as `dst_rx_base_pa` (this is the
+`peer.rx_base_pa` the other side used when it was the sender).
 
-수신 측 (`_credit_worker`):
+Receiver side (`_credit_worker`):
 
 ```python
 def _credit_worker(self, env):
     while True:
         credit = yield self._credit_inbox.get()
         for d, qp in self._queue_pairs.items():
-            # peer의 rx_base_pa와 credit의 dst_rx_base_pa가 일치하는 qp 찾기
+            # Find the qp whose peer rx_base_pa matches the credit's dst_rx_base_pa
             if qp["peer"].rx_base_pa == credit.dst_rx_base_pa:
                 qp["peer_tail_cache"] = max(qp["peer_tail_cache"],
                                               credit.consumer_seq)
@@ -178,41 +187,45 @@ def _credit_worker(self, env):
                 break
 ```
 
-Sender 좌표 검사 제거. `dst_rx_base_pa` 매칭으로 unambiguous.
+Sender-coordinate check removed. Matching by `dst_rx_base_pa` is
+unambiguous.
 
-### D4. `IpcqInitEntry`에 `peer_direction` 필드를 **추가하지 않음**
+### D4. Do **not** add a `peer_direction` field to `IpcqInitEntry`
 
-ADR-0025 rev 1에서 제안했던 `IpcqInitEntry.peer_direction`은 **불필요**.
-이유:
-- Meta arrival은 dst_addr로 매칭 (D2)
-- Credit은 dst_rx_base_pa로 매칭 (D3)
-- qp에 peer_direction 저장 필요 없음
-- Install은 rx_base_pa 계산 시 내부적으로만 peer_dir 사용 (`reverse_direction`)
+The `IpcqInitEntry.peer_direction` proposed in ADR-0025 rev 1 is
+**unnecessary**. Reasons:
+- Meta arrivals are matched by dst_addr (D2)
+- Credits are matched by dst_rx_base_pa (D3)
+- No need to store peer_direction on qp
+- Install only uses peer_dir internally when computing rx_base_pa
+  (`reverse_direction`)
 
-IpcqInitEntry schema 변경 없음. Rev 1 대비 **단순화**.
+No change to the IpcqInitEntry schema. **Simpler** than rev 1.
 
-### D5. `IpcqDmaToken.src_direction` 유지 (diagnostic only)
+### D5. Keep `IpcqDmaToken.src_direction` (diagnostic only)
 
-기존 `src_direction` 필드는 제거하지 않는다. 다음 용도로 유지:
-- Logging / trace: `KERNBENCH_CCL_TRACE=1` 출력의 `(rank, t, dir, nbytes)`
-- Diagnostics: pointer_dump 등에서 direction 표시
-- 미래 확장 여지
+The existing `src_direction` field is not removed. It is retained for:
+- Logging / trace: the `(rank, t, dir, nbytes)` output of
+  `KERNBENCH_CCL_TRACE=1`
+- Diagnostics: showing direction in pointer_dump, etc.
+- Room for future extension
 
-Runtime matching은 `dst_addr`만 사용.
+Runtime matching uses only `dst_addr`.
 
-### D6. Invariants (ADR-0023 I3 강화)
+### D6. Invariants (strengthens ADR-0023 I3)
 
-**I3 (엄격)**: 각 방향 pair `(my_direction, peer_direction)`에 대해 my
-rx_base와 peer rx_base는 **별개의 direction slot**을 가리켜야 함. Install은
-이를 보장해야 한다 (reverse_direction opposite-preference).
+**I3 (strict)**: For each direction pair `(my_direction, peer_direction)`,
+my rx_base and peer rx_base must point to **distinct direction slots**.
+Install must guarantee this (reverse_direction opposite-preference).
 
-**I3.1 (신규)**: 모든 qp에 대해 `qp["my_rx_base_pa"]`와 `qp["peer"].rx_base_pa`는
-서로 disjoint한 주소 range를 점유한다 (다른 direction의 buffer는 절대 겹치지
-않음). 이것이 D2/D3의 주소-기반 매칭의 전제.
+**I3.1 (new)**: For every qp, `qp["my_rx_base_pa"]` and
+`qp["peer"].rx_base_pa` occupy mutually disjoint address ranges (buffers
+of different directions never overlap). This is the prerequisite for the
+address-based matching of D2/D3.
 
-Install time에 검증 가능:
+Verifiable at install time:
 ```python
-# ccl/install_plan.py: build_install_plans 끝에 assertion
+# ccl/install_plan.py: assertion at the end of build_install_plans
 all_rx_ranges = set()
 for plan in plans:
     for pe_install in plan.pe_installs:
@@ -228,36 +241,42 @@ for plan in plans:
 
 ## Dependencies
 
-- **ADR-0023** (IPCQ protocol): 본 ADR은 ADR-0023의 runtime 매칭 로직 수정
-  (D2, D3) + install heuristic 개선 (D1). IPCQ 프로토콜의 semantic layer
-  변경은 없음.
-- **ADR-0024** (launcher): 2-rank bidirectional ring이 실제 쓰이는 경우가
-  ADR-0024의 ws=SIP_count 모델. 본 ADR이 그 케이스를 작동시킴.
-- **ADR-0030** (PhysAddr transition, stub): **독립적** — ADR-0025의
-  주소-기반 매칭은 현재 synthetic 주소든 PhysAddr이든 동일하게 작동.
+- **ADR-0023** (IPCQ protocol): this ADR modifies ADR-0023's runtime
+  matching logic (D2, D3) and improves the install heuristic (D1). No
+  change to the IPCQ protocol's semantic layer.
+- **ADR-0024** (launcher): the case where a 2-rank bidirectional ring is
+  actually used is the ws=SIP_count model of ADR-0024. This ADR makes that
+  case work.
+- **ADR-0030** (PhysAddr transition, stub): **independent** — ADR-0025's
+  address-based matching works identically whether the current addresses
+  are synthetic or PhysAddr.
 
 ---
 
 ## Non-goals
 
-- **IPCQ 주소 체계를 PhysAddr로 전환**: ADR-0030 scope. 본 ADR은 주소가 어떻게
-  인코딩되는가와 무관.
-- **Multi-hop routing**: ADR-0023 D5의 single-hop DMA write 전제 유지.
-- **Unidir ring 특수화**: `ring_1d_unidir`는 direction 하나만 있으므로 본 버그
-  무관.
+- **Migrating IPCQ addressing to PhysAddr**: ADR-0030 scope. This ADR is
+  agnostic to how addresses are encoded.
+- **Multi-hop routing**: the single-hop DMA write assumption of ADR-0023
+  D5 still holds.
+- **Unidir ring specialization**: `ring_1d_unidir` only has a single
+  direction, so the bug does not apply.
 
 ---
 
 ## Open questions
 
-- **주소 매칭 성능**: `_handle_meta_arrival`과 `_credit_worker`가 qp를 선형
-  순회 (max 4 direction). 성능 영향 무시 가능 수준. 문제 시 dict lookup으로
-  전환 가능 (`_qp_by_rx_base`).
-- **`IpcqDmaToken.src_direction` 필요성 재평가**: diagnostic 용도로만 남긴
-  필드를 계속 유지할지, 또는 logging 외부로 분리할지. 현재는 유지.
-- **Install-time invariant 검증 cost**: D6의 I3.1 검증은 O(N_PE × N_direction)^2.
-  대형 topology에서 느려질 수 있음 → interval tree 등 자료구조로 개선 가능.
-  단순 구현 먼저.
+- **Address-matching performance**: `_handle_meta_arrival` and
+  `_credit_worker` iterate qp linearly (max 4 directions). The performance
+  impact is negligible. If it becomes an issue, this can be switched to a
+  dict lookup (`_qp_by_rx_base`).
+- **Re-evaluating the need for `IpcqDmaToken.src_direction`**: whether to
+  keep this field, which is only kept for diagnostics, or to split it out
+  of logging. Currently retained.
+- **Cost of install-time invariant verification**: the I3.1 verification
+  of D6 is O(N_PE × N_direction)^2. It could be slow on large topologies
+  → improvable via data structures such as interval trees. Simple
+  implementation first.
 
 ---
 
@@ -265,19 +284,26 @@ for plan in plans:
 
 ### Positive
 
-- **단순함**: `peer_direction` 이중 메타데이터 제거. 주소가 single source of truth.
-- **Unambiguous matching**: 모든 topology (direction 중복 포함)에서 동작.
-- **Schema 변경 최소**: `IpcqInitEntry` 불변, `IpcqCreditMetadata`에 1 필드 추가.
-- **PhysAddr 전환 (ADR-0030) 독립**: 주소-기반 매칭은 주소 인코딩 방식과 무관.
-- **Diagnostic 유지**: `IpcqDmaToken.src_direction`은 로깅 용도로 존치.
+- **Simplicity**: redundant `peer_direction` metadata removed. Address is
+  the single source of truth.
+- **Unambiguous matching**: works on every topology (including duplicate
+  directions).
+- **Minimal schema changes**: `IpcqInitEntry` unchanged, one field added
+  to `IpcqCreditMetadata`.
+- **Independent of PhysAddr transition (ADR-0030)**: address-based matching
+  is agnostic to the address encoding.
+- **Diagnostics retained**: `IpcqDmaToken.src_direction` is kept for
+  logging.
 
 ### Negative
 
-- Runtime 매칭이 주소 비교로 바뀌어서 디버깅 시 "왜 peer_head_cache[E]가 아닌
-  W가 업데이트됐나" 같은 질문에 address range를 추적해야 함 (기존엔 direction
-  이름으로 충분). 해결: pointer_dump에 "direction ↔ rx_base_pa" 매핑 포함.
+- Runtime matching is now by address comparison, so when debugging
+  questions like "why did peer_head_cache[W] update rather than [E]" one
+  has to follow the address range (previously the direction name was
+  enough). Mitigation: include a "direction ↔ rx_base_pa" mapping in
+  pointer_dump.
 
 ### Neutral
 
-- IPCQ protocol의 semantic layer (sender가 dst_addr 계산, receiver가 수신)는
-  불변.
+- The semantic layer of the IPCQ protocol (sender computes dst_addr,
+  receiver receives) is unchanged.
diff --git a/docs/adr/ADR-0026-par-dppolicy-intra-device.md b/docs/adr/ADR-0026-par-dppolicy-intra-device.md
index d043f59..16c8170 100644
--- a/docs/adr/ADR-0026-par-dppolicy-intra-device.md
+++ b/docs/adr/ADR-0026-par-dppolicy-intra-device.md
@@ -1,4 +1,4 @@
-# ADR-0026: DPPolicy = Intra-Device Only — sip/num_sips 필드 제거
+# ADR-0026: DPPolicy = Intra-Device Only — remove sip/num_sips fields
 
 ## Status
 
@@ -6,16 +6,17 @@ Accepted (Revision 5 — Phase 2 landed 2026-04-14, 523 passed + 1 strict xfail)
 
 ## Context
 
-### 목표
+### Goal
 
-`DPPolicy`를 **한 device(SIP) 내부의 cube × PE 분산**만 표현하는 순수한
-intra-device 추상화로 명확화한다. SIP 간 분산(TP)은 별도 레이어로 분리
-(ADR-0024의 `torch.ahbm.set_device(rank)` 또는 ADR-0027의 Megatron parallel
-layers가 담당).
+Clarify `DPPolicy` as a pure intra-device abstraction that only expresses
+**cube × PE distribution within a single device (SIP)**. Inter-SIP
+distribution (TP) is split into a separate layer (handled by ADR-0024's
+`torch.ahbm.set_device(rank)` or by ADR-0027's Megatron-style parallel
+layers).
 
 ## Decision
 
-### D1. `DPPolicy`에서 `sip` + `num_sips` 필드 제거
+### D1. Remove `sip` + `num_sips` fields from `DPPolicy`
 
 ```python
 @dataclass(frozen=True)
@@ -32,15 +33,16 @@ class DPPolicy:
     num_cubes: int | None = None
 ```
 
-제거되는 필드: `sip`, `num_sips`.
+Removed fields: `sip`, `num_sips`.
 
-### D2. `ShardSpec` — structural (sip, cube, pe) 좌표, `pe_index` 완전 제거
+### D2. `ShardSpec` — structural (sip, cube, pe) coordinates, `pe_index` fully removed
 
-현재 `ShardSpec.pe_index`는 **global flat index** (`sip × cubes × pes + cube ×
-pes + pe`). 이는 ADR-0024 D4이 "abstraction leakage"로 지적한 형태.
+The current `ShardSpec.pe_index` is a **global flat index**
+(`sip × cubes × pes + cube × pes + pe`). This is the form ADR-0024 D4
+flagged as "abstraction leakage".
 
-본 ADR에서 ShardSpec을 **structural 좌표로 재정의**하고, `pe_index`는
-property로도 **남기지 않는다**:
+This ADR **redefines ShardSpec in structural coordinates** and **does
+not even leave `pe_index` as a property**:
 
 ```python
 # src/kernbench/policy/placement/dp.py (after)
@@ -59,28 +61,32 @@ class ShardSpec:
     nbytes: int
 ```
 
-**핵심 원칙**:
-- ShardSpec의 정체성은 `(sip, cube, pe)` 3튜플.
-- **`pe_index` property도 없음** — silent semantics drift 차단.
-- Global flat을 기대한 기존 호출자는 `.pe_index` 접근 시 **즉시
-  `AttributeError`** → 반드시 구조적 좌표로 migration.
-- Flat integer key가 필요한 국소 문맥 (예: 내부 dict lookup)은 호출자가
-  명시적으로 `spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe`를 계산.
+**Core principle**:
+- The identity of ShardSpec is the `(sip, cube, pe)` 3-tuple.
+- **No `pe_index` property either** — blocks silent semantics drift.
+- Existing callers expecting global-flat get an **immediate
+  `AttributeError`** on `.pe_index` access → forced migration to
+  structural coordinates.
+- Local contexts that genuinely need a flat integer key (e.g. internal
+  dict lookup) explicitly compute
+  `spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe` at the call
+  site.
 
-**Property 제거 정당화**: KernBench는 사내 프로젝트로 call site가 한정되어
-있음. Silent drift 위험 (의미만 바뀌고 타입은 같은 int) 대비 explicit breakage
-(AttributeError)가 훨씬 안전.
+**Justification for removing the property**: KernBench is an internal
+project with a limited number of call sites. Explicit breakage
+(AttributeError) is much safer than the risk of silent drift (semantics
+change while the type stays int).
 
-### D3. `resolve_dp_policy`가 `target_sip`을 받아 structural 좌표 생성
+### D3. `resolve_dp_policy` takes `target_sip` and produces structural coordinates
 
-ADR-0024 D4의 계약 구현. Post-hoc shifting 없음.
+Implements the contract of ADR-0024 D4. No post-hoc shifting.
 
 ```python
 # src/kernbench/policy/placement/dp.py (after)
 
 @dataclass(frozen=True)
 class _LocalPeShard:
-    """Internal — PE resolver의 반환. Cube 내 local PE 식별자 + payload."""
+    """Internal — return value of the PE resolver. Cube-local PE id + payload."""
     local_pe: int                  # cube-local PE index (0..num_pe-1)
     offset_bytes: int
     nbytes: int
@@ -93,7 +99,7 @@ def resolve_dp_policy(
     itemsize: int,
     num_pe: int,
     num_cubes: int = 1,
-    target_sip: int,       # NEW — 어느 SIP에 배치할지 명시
+    target_sip: int,       # NEW — explicitly state which SIP to place on
 ) -> list[ShardSpec]:
     """2-level resolution (cube × PE) on a specified SIP.
 
@@ -123,28 +129,30 @@ def resolve_dp_policy(
     return all_shards
 ```
 
-**내부 resolver** (`column_wise`, `row_wise`, `replicate`)는 `_LocalPeShard`
-리스트 반환 — `local_pe` 필드명으로 **"cube-local PE identifier"임이 명시적**.
-과거 `ShardSpec.pe_index`와 이름이 혼동되던 문제 해소.
+**Internal resolvers** (`column_wise`, `row_wise`, `replicate`) return a
+list of `_LocalPeShard` — the `local_pe` field name makes it **explicit
+that this is a "cube-local PE identifier"**. This resolves the previous
+confusion with the name `ShardSpec.pe_index`.
 
-**이름 규약 정리** (전체 ADR):
-- `ShardSpec.pe`: 최종 외부 API — cube-local PE (structural coord)
-- `_LocalPeShard.local_pe`: 내부 resolver 단계의 동일 의미
-- `pe_index`: **제거**. 외부/내부 어디에도 남기지 않는다 (silent drift 차단의
-  부가 효과: 이름 재등장 없음).
+**Naming convention summary** (whole ADR):
+- `ShardSpec.pe`: the final external API — cube-local PE (structural coord)
+- `_LocalPeShard.local_pe`: the same meaning at the internal resolver stage
+- `pe_index`: **removed**. Not retained anywhere, internal or external
+  (additional benefit of preventing silent drift: the name does not
+  reappear).
 
-### D4. `_create_tensor` — 구조적 좌표로 직접 placement
+### D4. `_create_tensor` — placement directly in structural coordinates
 
-ADR-0024 D4 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy`
-호출 시점에 직접 지정.
+Continuation of ADR-0024 D4. Post-hoc shifting removed; structural
+coordinates are specified directly at the `resolve_dp_policy` call site.
 
 ```python
 # context.py _create_tensor (after)
 current_sip = self.ahbm.current_device()
 if current_sip is None:
-    # Single-driver fallback (ADR-0024 D2와 일관).
-    # Launcher 기반 코드가 set_device()를 빼먹으면 조용히 SIP 0에 박히는
-    # 문제가 있음 → debug mode에서 경고.
+    # Single-driver fallback (consistent with ADR-0024 D2).
+    # In launcher-based code, forgetting set_device() silently sticks the
+    # tensor on SIP 0 — emit a warning in debug mode.
     if os.environ.get("KERNBENCH_DEBUG"):
         import warnings
         warnings.warn(
@@ -161,38 +169,39 @@ placement = resolve_dp_policy(
     itemsize=itemsize,
     num_pe=eff_num_pe,
     num_cubes=eff_num_cubes,
-    target_sip=current_sip,          # ← 구조적 좌표 일차 지정
+    target_sip=current_sip,          # ← structural coord specified up front
 )
 
-# placement의 각 ShardSpec은 이미 (sip=current_sip, cube=local, pe=local) 포함.
-# 과거의 post-hoc shifting 블록은 완전히 제거.
+# Each ShardSpec in placement already carries (sip=current_sip, cube=local, pe=local).
+# The old post-hoc shifting block is removed entirely.
 ```
 
-**모든** 텐서가 current device SIP에 배치됨. Multi-SIP 텐서를 만들고 싶으면
-ADR-0027의 TP primitive 사용.
+**Every** tensor is placed on the current device's SIP. If you need a
+multi-SIP tensor, use the TP primitive of ADR-0027.
 
-**Single-driver fallback의 trade-off**: set_device 없는 호출에서 SIP 0으로
-default는 기존 single-driver 테스트 호환을 위해 유지. `KERNBENCH_DEBUG=1`
-환경에서는 launcher 컨텍스트의 실수로 set_device 누락 시 조용히 잘못된 SIP에
-배치되는 것을 감지할 수 있도록 warning.
+**Trade-off of the single-driver fallback**: When set_device is not
+called, defaulting to SIP 0 is kept for compatibility with existing
+single-driver tests. With `KERNBENCH_DEBUG=1`, a warning is emitted so
+that accidentally omitting set_device in a launcher context — which would
+silently place the tensor on the wrong SIP — can be detected.
 
-### D5. Downstream — allocator lookup은 구조적 tuple key로
+### D5. Downstream — allocator lookup by structural tuple key
 
-기존 `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`):
+Existing `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`):
 
 ```python
 for spec in placement:
-    alloc = allocators[spec.pe_index]       # ← AttributeError (property 제거됨)
+    alloc = allocators[spec.pe_index]       # ← AttributeError (property removed)
 ```
 
-`pe_index`가 없어졌으므로 구조적 좌표로 **강제** migration:
+With `pe_index` gone, migration to structural coordinates is **forced**:
 
 ```python
 for spec in placement:
     alloc = allocators[(spec.sip, spec.cube, spec.pe)]
 ```
 
-`_ensure_allocators`의 dict population도 tuple key로:
+The dict population in `_ensure_allocators` is also tuple-keyed:
 
 ```python
 # context.py _ensure_allocators (after)
@@ -204,59 +213,71 @@ for sip_id in sip_range:
             )
 ```
 
-`_free_tensor`도 동일: 기존 `flat_idx = sip * ... + cube * ... + pe` 계산
-블록 제거, `(shard.sip, shard.cube, shard.pe)` 직접 사용.
+`_free_tensor` is the same: the old
+`flat_idx = sip * ... + cube * ... + pe` computation block is removed,
+and `(shard.sip, shard.cube, shard.pe)` is used directly.
 
-**Tuple vs dataclass `PEIdentity`**: Tuple이 단순하고 hashable로 바로 써서
-권고. `PEIdentity` 값객체는 명시적 타입 장점은 있지만 boilerplate가 크고 현재
-allocator dict의 유일한 key라 오버엔지니어링. Tuple 유지.
+**Tuple vs dataclass `PEIdentity`**: Recommend the tuple — it is simple
+and hashable out of the box. A `PEIdentity` value object has the upside
+of an explicit type, but the boilerplate is large and it is currently
+the only key of the allocator dict, so it would be over-engineering.
+Keep the tuple.
 
-### D7. 하위 호환 — 불가 (cleanup ADR)
+### D7. Backward compatibility — none (cleanup ADR)
 
-이 ADR은 **breaking change**.
+This ADR is a **breaking change**.
 
-1. `DPPolicy(sip=...)` 또는 `DPPolicy(num_sips=...)` 호출 → `TypeError`
-2. `ShardSpec.pe_index` 접근 → `AttributeError`
+1. `DPPolicy(sip=...)` or `DPPolicy(num_sips=...)` → `TypeError`
+2. `ShardSpec.pe_index` access → `AttributeError`
 
-모두 **즉시 명시적 breakage**. Deprecation warning / fallback 경로 없음.
-KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에 migration.
+Both are **immediate, explicit breakage**. No deprecation warning /
+fallback path. KernBench is an internal project with a bounded set of
+call sites, so migration happens in one pass.
 
-**Silent drift 차단**이 property 완전 제거의 주된 이점: global flat을 기대한
-코드가 SIP-local 결과를 받아 조용히 잘못된 인덱싱을 할 가능성 제거.
+**Blocking silent drift** is the main upside of fully removing the
+property: code that expected a global flat could otherwise silently
+receive a SIP-local result and index incorrectly — that possibility is
+eliminated.
 
 ## Dependencies
 
-- **ADR-0024** (launcher): `set_device(rank)` 및 current-device scoping이
-  SIP 배치 메커니즘 제공. 본 ADR은 그 위에 서서 DPPolicy를 순수 intra-device로
-  좁힘.
-- **ADR-0027** (Megatron TP): 다중 SIP에 걸친 텐서가 필요한 경우의 대안 경로.
-  이 ADR 적용 후 multi-SIP use case는 ADR-0027로 이관.
+- **ADR-0024** (launcher): `set_device(rank)` and current-device scoping
+  provide the SIP placement mechanism. This ADR sits on top and narrows
+  DPPolicy to pure intra-device.
+- **ADR-0027** (Megatron TP): the alternative path when a tensor spans
+  multiple SIPs. After this ADR is applied, multi-SIP use cases move to
+  ADR-0027.
 
 ---
 
 ## Non-goals
 
-- **`DPPolicy.cube` / `pe` 재설계**: 기존 replicate/column_wise/row_wise 의미
-  유지.
-- **Tiling 정책 통합**: `tiled_column_major` / `tiled_row_major`는 그대로.
-- **Multi-device 텐서 추상화 신규**: DTensor-like는 ADR-0028.
+- **Redesign of `DPPolicy.cube` / `pe`**: existing
+  replicate/column_wise/row_wise semantics are kept.
+- **Tiling policy consolidation**: `tiled_column_major` /
+  `tiled_row_major` stay as they are.
+- **New multi-device tensor abstraction**: a DTensor-like is ADR-0028.
 
 ---
 
 ## Open questions
 
-- **`_create_tensor`의 current_sip 기본값**: set_device 없는 호출에서 rank=0
-  (SIP 0)로 fallback할지, 아니면 error 낼지. 권고는 fallback (기존 single-driver
-  테스트와의 호환).
-- **`test_sip_parallel.py` 재작성 범위**: 기존 단위 테스트의 의도를 유지하며
-  launcher 기반으로 옮기려면 추가 fixture 필요. 별도 작업으로 scope.
-- **`DPPolicy`의 `num_sips=None` 의미**: 필드가 없어지면 `num_sips` 개념 자체가
-  사라짐. Multi-SIP을 표현하고 싶으면 ADR-0027의 TP primitive를 쓰라는 것이
-  명시적 답.
+- **Default value of current_sip in `_create_tensor`**: for calls without
+  set_device, whether to fall back to rank=0 (SIP 0) or to raise an
+  error. The recommendation is fallback (compatibility with existing
+  single-driver tests).
+- **Scope of `test_sip_parallel.py` rewrite**: porting the existing unit
+  tests to the launcher base while preserving their intent requires
+  additional fixtures. Scoped as separate work.
+- **Meaning of `num_sips=None` on `DPPolicy`**: once the field is gone,
+  the concept of `num_sips` disappears entirely. The explicit answer for
+  expressing multi-SIP is to use the TP primitive of ADR-0027.
 
-**Resolved (이전 rev에서 open이었던 것들)**:
-- ~~`ShardSpec.pe_index` property 존치 여부~~ → **완전 제거** (D2)
-- ~~`_ensure_allocators` dict key 형식~~ → **tuple `(sip, cube, pe)`** (D5)
+**Resolved (items that were open in earlier revs)**:
+- ~~Whether to keep the `ShardSpec.pe_index` property~~ → **fully
+  removed** (D2)
+- ~~Form of `_ensure_allocators` dict key~~ → **tuple `(sip, cube, pe)`**
+  (D5)
 
 ---
 
@@ -264,25 +285,31 @@ KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에
 
 ### Positive
 
-- **개념 분리 명확**: DPPolicy = intra-device, TP = inter-device.
-- **API 단순화**: DPPolicy 생성자 필드 ~33% 축소.
-- **Structural 좌표 일관성**: ShardSpec이 `(sip, cube, pe)` 튜플로 표현 →
-  abstraction leakage 해소 (ADR-0024 D4 계약 충족).
-- **`pe_index` 의미 명확**: SIP-local이 단일 해석. Global flat이 필요하면 명시.
-- **Launcher 모델 일관성**: ADR-0024의 "1 worker per SIP" 모델이 유일한 SIP
-  경계 제어 메커니즘.
+- **Clean conceptual separation**: DPPolicy = intra-device, TP =
+  inter-device.
+- **API simplification**: about a 33% reduction in DPPolicy constructor
+  fields.
+- **Structural-coordinate consistency**: ShardSpec is expressed as a
+  `(sip, cube, pe)` tuple → abstraction leakage resolved (the ADR-0024
+  D4 contract is satisfied).
+- **Clear meaning of `pe_index`**: the single interpretation is
+  SIP-local. If global-flat is needed, it must be made explicit.
+- **Launcher-model consistency**: ADR-0024's "1 worker per SIP" model is
+  the sole SIP-boundary control mechanism.
 
 ### Negative
 
 - **Breaking change (explicit)**: `DPPolicy(sip=...)` → `TypeError`,
-  `spec.pe_index` → `AttributeError`. 모든 호출자 한 번에 수정 필요.
-- **ShardSpec schema 변경**: `pe_index` 단일 필드 → `sip`/`cube`/`pe` 세 필드.
-  Downstream (`deploy_tensor`, `_free_tensor`, `_ensure_allocators`,
-  `allocators` dict key 등) 연쇄 수정.
-- **Silent drift 없음**: property 완전 제거로 runtime에서 즉시 실패 →
-  migration leakage 원천 차단. (Negative가 아니라 explicit tradeoff)
-- `test_sip_parallel.py` 재작성 비용.
+  `spec.pe_index` → `AttributeError`. All callers need to be fixed at
+  once.
+- **ShardSpec schema change**: a single `pe_index` field becomes three
+  fields `sip`/`cube`/`pe`. Cascading edits downstream (`deploy_tensor`,
+  `_free_tensor`, `_ensure_allocators`, `allocators` dict key, etc.).
+- **No silent drift**: with the property fully removed, runtime failure
+  is immediate → migration leakage is blocked at the source. (Not a
+  negative but an explicit tradeoff.)
+- The cost of rewriting `test_sip_parallel.py`.
 
 ### Neutral
 
-- 기존 `cube` / `pe` 필드 의미 불변.
+- The meaning of the existing `cube` / `pe` fields is unchanged.
diff --git a/docs/adr/ADR-0027-par-megatron-tp.md b/docs/adr/ADR-0027-par-megatron-tp.md
index 7b04254..e8fb267 100644
--- a/docs/adr/ADR-0027-par-megatron-tp.md
+++ b/docs/adr/ADR-0027-par-megatron-tp.md
@@ -6,70 +6,77 @@ Accepted
 
 ## Context
 
-### 목표
+### Goal
 
-SIP 간 tensor parallelism(TP)을 **Megatron-LM 스타일의 명시적 parallel layer**
-API로 지원한다. DTensor 같은 선언적 추상화는 별도 ADR(0028) future work.
+Support inter-SIP tensor parallelism (TP) via a **Megatron-LM style explicit
+parallel layer** API. Declarative abstractions like DTensor are future work
+in a separate ADR (0028).
 
-Megatron-style을 선택한 이유:
-- TP는 model의 특정 layer 경계에서 발생. 명시적 primitive가 mental model에
-  자연스러움.
-- NVIDIA Megatron / DeepSpeed가 확립한 인더스트리 표준.
-- DTensor는 선언적이라 디자인 공간이 더 크다 → 단계적.
+Why Megatron-style was chosen:
+- TP arises at specific layer boundaries of a model. Explicit primitives are
+  natural to the mental model.
+- The de-facto industry standard established by NVIDIA Megatron / DeepSpeed.
+- DTensor is declarative, so its design space is larger → phased approach.
 
-### TP primitive 스펙 (Megatron-LM 참조)
+### TP primitive spec (Megatron-LM reference)
 
-- **ColumnParallelLinear**: weight의 **column(out_features)** 축을 TP ranks에
-  분산. 입력 full-replicated, 출력 column-sharded. 후속 RowParallelLinear가
-  올 때 forward all-reduce 없음.
-- **RowParallelLinear**: weight의 **row(in_features)** 축을 TP ranks에 분산.
-  입력이 이미 column-sharded (ColumnParallel의 출력). forward 끝에
-  **all-reduce** 필요.
-- **VocabParallelEmbedding**: embedding을 vocab 축에 분산. forward 끝에
-  all-reduce. (초기 scope에서는 stub, 실제 구현은 all-gather kernel 선행 필요.)
+- **ColumnParallelLinear**: shards the weight's **column (out_features)** axis
+  across TP ranks. Input is full-replicated, output is column-sharded. When a
+  RowParallelLinear follows, no forward all-reduce is required.
+- **RowParallelLinear**: shards the weight's **row (in_features)** axis across
+  TP ranks. Input is already column-sharded (the output of ColumnParallel).
+  Requires an **all-reduce** at the end of forward.
+- **VocabParallelEmbedding**: shards the embedding along the vocab axis.
+  all-reduce at the end of forward. (A stub in the initial scope; full
+  implementation requires an all-gather kernel as a prerequisite.)
 - **`copy_to_tp_region`**, **`reduce_from_tp_region`**, **`scatter_to_tp_region`**,
-  **`gather_from_tp_region`** — 기본 primitive.
+  **`gather_from_tp_region`** — basic primitives.
 
-### 풀어야 할 문제
+### Problems to solve
 
-1. **Worker-wait 일반화 (D0)**: `dist.all_reduce`의 defer/yield/drain 패턴을
-   모든 `ctx.wait` 경로로 확장. **이 ADR의 가장 큰 아키텍처 결정**.
+1. **Worker-wait generalization (D0)**: extend the defer/yield/drain pattern of
+   `dist.all_reduce` to every `ctx.wait` path. **The biggest architectural
+   decision of this ADR.**
 
-2. **런처 API 정규화 (D1)**: 현 bench들이 hand-rolled greenlet loop을 사용.
-   `torch.multiprocessing.spawn(fn, args, nprocs)`로 흡수해 real-PyTorch API 면
-   유지 + D0의 scheduler drain을 단일 구현 위치에 집중.
+2. **Launcher API normalization (D1)**: current benches use a hand-rolled
+   greenlet loop. Absorb it into `torch.multiprocessing.spawn(fn, args, nprocs)`
+   to preserve the real-PyTorch API surface + concentrate D0's scheduler drain
+   in a single implementation site.
 
-3. **Per-rank weight 분산 표현**: 각 worker가 weight tensor의 자기 slice를
-   소유. ADR-0024의 `set_device(rank)` + ADR-0026의 intra-device DPPolicy로
-   자연스럽게 표현.
+3. **Per-rank weight shard representation**: each worker owns its own slice of
+   the weight tensor. Naturally expressed via ADR-0024's `set_device(rank)` +
+   ADR-0026's intra-device DPPolicy.
 
-4. **Forward-only scope**: 현재 KernBench는 backward가 없음 (simulation 목적).
-   본 ADR은 **forward만** 우선 지원. Training simulation은 별도 ADR.
+4. **Forward-only scope**: KernBench currently has no backward (simulation
+   purposes). This ADR prioritizes **forward only**. Training simulation is a
+   separate ADR.
 
-5. **Collective 호출 지점**: RowParallelLinear가 forward 끝에 `all_reduce` 호출.
-   ADR-0024의 multi-greenlet 구조 + D0 generalization에서 자연스럽게 동작.
+5. **Collective call site**: RowParallelLinear calls `all_reduce` at the end of
+   forward. Naturally works with ADR-0024's multi-greenlet structure + D0
+   generalization.
 
-6. **TP group 개념**: Megatron은 DP × TP × PP group을 교차 사용. 초기 scope는
-   **TP group = 전체 SIP** 단순화. Mixed DP+TP는 future.
+6. **TP group concept**: Megatron crosses DP × TP × PP groups. The initial
+   scope simplifies to **TP group = all SIPs**. Mixed DP+TP is future work.
 
 ---
 
 ## Decision
 
-### D0. Worker-wait 일반화 — `ctx.wait`가 worker 컨텍스트면 main으로 defer
+### D0. Worker-wait generalization — `ctx.wait` defers to main when in worker context
 
-**문제 재확인**. `kernel_runner.run`은 spawn 시점의 `greenlet.getcurrent()`를
-kernel greenlet의 `_parent`로 캡처한다
+**Restating the problem.** `kernel_runner.run` captures the `greenlet.getcurrent()`
+at spawn time as the kernel greenlet's `_parent`
 ([kernel_runner.py:94](src/kernbench/triton_emu/kernel_runner.py#L94)).
-main 컨텍스트에서 `env.run`이 돌면 parent=main이라 safe. worker 컨텍스트에서
-`env.run`이 돌면 parent=worker가 되고, worker가 yield/finish하는 순간 kernel
-greenlet은 orphan → `GreenletExit` → ADR-0024 Phase B의 `ring_default_ws` 실패.
+If `env.run` runs in the main context, parent=main is safe. If `env.run` runs
+in a worker context, parent=worker, and the moment the worker yields/finishes
+the kernel greenlet becomes an orphan → `GreenletExit` → failure of ADR-0024
+Phase B's `ring_default_ws`.
 
-**해결**. worker greenlet이 `ctx.wait(h)`를 호출하면 직접 `env.run`을 driving
-하는 대신 **main scheduler로 yield**. main이 env.run을 drive해 handle이 완료
-되면 worker로 control return.
+**Resolution.** When a worker greenlet calls `ctx.wait(h)`, instead of driving
+`env.run` directly, **yield to the main scheduler**. main drives env.run and,
+once the handle completes, control returns to the worker.
 
-#### D0.1 `RuntimeContext` 확장
+#### D0.1 `RuntimeContext` extension
 
 ```python
 # context.py
@@ -79,7 +86,7 @@ class RuntimeContext:
     _pending_worker_waits: list[RequestHandle] = field(default_factory=list, init=False)
 ```
 
-#### D0.2 `ctx.wait`의 worker fork
+#### D0.2 `ctx.wait` worker fork
 
 ```python
 def wait(self, handle, *, _meta=None):
@@ -113,29 +120,33 @@ def wait(self, handle, *, _meta=None):
     return completion
 ```
 
-#### D0.3 `ctx.wait`의 worker-context 세만틱 contract (normative)
+#### D0.3 `ctx.wait` worker-context semantic contract (normative)
 
-본 ADR은 `ctx.wait`의 세만틱을 worker 컨텍스트에서 **명시적으로 변경**한다.
+This ADR **explicitly changes** the semantics of `ctx.wait` in worker context.
 
-- **Submit-vs-complete 분리**: `ctx.wait(h)`는 worker에서 호출될 때 "즉시 완료
-  보장"이 아니라 "**다음 scheduler drain 이후** 완료 보장"이다. worker가
-  `wait()`에서 return하는 시점 = main이 해당 handle에 대해 `engine.wait`을
-  마친 시점. Main context 호출은 기존대로 즉시-동기 (status quo).
-- **Resume invariant (normative)**: worker-deferred `ctx.wait(h)`에서
-  `g.parent.switch()`가 return해 worker가 resume되는 시점에는 **반드시
-  `h in ctx._completed`가 True여야 한다**. 이 invariant가 깨지면 worker가
-  stale 상태에서 이후 단계를 진행하므로 `_drain_pending` / scheduler loop /
-  `ctx.wait` 어느 부분을 수정하든 이 불변식을 지켜야 한다. T3.b가 이
-  invariant를 직접 assert한다.
-- **관찰 가능 변화**: worker 안에서 `h = ctx.submit(msg); ctx.wait(h);
-  read(handle_result)` 패턴은 여전히 성립 — 단 `wait()`와 `read` 사이에는
-  자동으로 main-drain이 삽입되었다는 사실을 세만틱 명세로 포함한다.
-- **Host 객체 직접 read는 D0.5 참조**: `ctx.wait` 없이 `tensor.numpy()`를
-  부르는 경우의 계약은 D0.5에서 별도로 규정.
+- **Submit-vs-complete separation**: when called from a worker, `ctx.wait(h)`
+  no longer guarantees "immediate completion" but instead guarantees
+  "completion **after the next scheduler drain**". The point at which the
+  worker returns from `wait()` = the point at which main has finished
+  `engine.wait` for that handle. Main-context calls remain immediate-synchronous
+  as before (status quo).
+- **Resume invariant (normative)**: at the point a worker resumes from a
+  worker-deferred `ctx.wait(h)` (when `g.parent.switch()` returns), **`h in
+  ctx._completed` must be True**. If this invariant breaks, the worker
+  proceeds in a stale state, so whichever of `_drain_pending` / the scheduler
+  loop / `ctx.wait` is modified, this invariant must be preserved. T3.b
+  directly asserts this invariant.
+- **Observable change**: the pattern `h = ctx.submit(msg); ctx.wait(h);
+  read(handle_result)` inside a worker still holds — but the semantic spec
+  now includes the fact that a main-drain is automatically inserted between
+  `wait()` and `read`.
+- **Direct host-object reads see D0.5**: the contract for calling
+  `tensor.numpy()` without `ctx.wait` is specified separately in D0.5.
 
-#### D0.4 Main scheduler drain — 규약 (normative)
+#### D0.4 Main scheduler drain — protocol (normative)
 
-(D1의 `multiprocessing.spawn` 내부 구현. 아래는 세만틱 정의.)
+(The internal implementation of D1's `multiprocessing.spawn`. Below is the
+semantic definition.)
 
 ```python
 while alive:
@@ -144,69 +155,82 @@ while alive:
     _drain_pending(ctx)           # (2) drain in main context
 ```
 
-(`_drain_pending`의 실제 정의는 D0.5 참조 — outer while-loop으로 두 큐가
-모두 빌 때까지 drain.)
+(The actual definition of `_drain_pending` is in D0.5 — an outer while-loop
+that drains until both queues are empty.)
 
-**규약**:
+**Rules**:
 
-1. **Round-based cooperative scheduling & yield 의무 (worker contract)**.
-   `g.switch()`는 해당 worker가 **자발적으로 yield**할 때까지 return하지 않는다
-   (cooperative greenlet 세만틱). 따라서:
-   - Worker가 yield 없이 `while True: do_compute()` 같은 pure-compute loop를
-     돌면 `g.switch()`는 영원히 return하지 않고 **scheduler loop 자체가 hard
-     block**된다 (다른 worker는 switch 기회를 못 얻음, drain도 안 일어남). 이는
-     starvation이 아니라 **scheduler non-progress (deadlock 등가)**이며 본
-     ADR이 **unsupported**로 규정한다.
-   - Worker는 **반드시** `ctx.wait(h)`, `dist.all_reduce`, host-read barrier
-     (D0.5) 중 하나를 유한 step 내에 호출해야 한다. TP layer의 `forward`는
-     매 layer 끝에서 launch→wait 쌍을 포함하므로 자연스럽게 이 조건을 만족.
-     CCL kernel도 `dist.all_reduce` 내부에서 yield한다.
-   - 구현이 이를 **감지**할 필요는 없다 (타임아웃/steps-since-yield 카운터
-     등). 이는 user contract이며 위반 시 증상은 "simulation hang"이다.
-   - **Future extension**: non-collective 긴 계산 경로가 자주 나오면
-     명시적 `torch.distributed.cooperative_yield()` primitive (no-op yield)를
-     도입할 수 있다. 현 ADR 범위 밖. Breaking change 아님 — 필요 시 추가하면
-     됨.
-   - Round 내에서는 alive worker 전체가 한 번씩 `switch`를 받는다. 단일 round
-     안에서 한 worker가 여러 번 wait를 호출해도 그 turn 안에서 순차적으로
-     enqueue된 뒤 scheduler drain 한 번에 일괄 처리 (FIFO).
+1. **Round-based cooperative scheduling & yield obligation (worker contract)**.
+   `g.switch()` does not return until the worker **voluntarily yields**
+   (cooperative greenlet semantics). Therefore:
+   - If a worker runs a pure-compute loop like `while True: do_compute()`
+     without yielding, `g.switch()` never returns and **the scheduler loop
+     itself hard-blocks** (other workers cannot get a switch turn, no drain
+     occurs). This is not starvation but **scheduler non-progress (deadlock
+     equivalent)**, and this ADR classifies it as **unsupported**.
+   - Workers **must** call one of `ctx.wait(h)`, `dist.all_reduce`, or a
+     host-read barrier (D0.5) within a finite number of steps. The `forward`
+     of a TP layer includes a launch→wait pair at the end of every layer, so
+     this condition is naturally met. CCL kernels also yield inside
+     `dist.all_reduce`.
+   - Implementations need not **detect** this (timeouts/steps-since-yield
+     counters, etc.). It is a user contract; the symptom on violation is
+     "simulation hang".
+   - **Future extension**: if non-collective long compute paths become
+     common, an explicit `torch.distributed.cooperative_yield()` primitive
+     (no-op yield) could be introduced. Out of scope for this ADR. Not a
+     breaking change — can be added if needed.
+   - Within a round, every alive worker receives one `switch` turn. Even if
+     a single worker calls wait multiple times within one round, the calls
+     are enqueued sequentially within that turn and processed in a single
+     scheduler drain batch (FIFO).
 
-2. **Drain 순서 = submission 순서 (FIFO)**. `_pending_worker_waits`는 list
-   append/pop(0)로 엄격한 FIFO. 완료 순서가 아니라 submission 순서로 drain되며,
-   SimPy scheduler 자체가 인과적으로 올바른 완료 순서를 보장하므로 submission
-   순서 drain이 안전하다. `completion order`와 `drain order`는 혼동하지 말 것.
+2. **Drain order = submission order (FIFO)**. `_pending_worker_waits` is
+   strict FIFO via list append/pop(0). Drain occurs in submission order, not
+   completion order, and SimPy's scheduler itself guarantees a causally
+   correct completion order, so submission-order drain is safe. Do not
+   confuse `completion order` with `drain order`.
 
-   **Two-queue ordering (worker waits → collectives)**: `_drain_pending`은
-   worker wait 큐를 먼저, collective 큐를 나중에 drain한다. 이 순서의 근거:
-   - **두 큐는 서로 다른 dependency source**: worker wait은 worker가 직접
-     `submit + wait` 쌍으로 만들어낸 handle (tensor deploy, MmuMap 등). collective
-     큐는 `dist.all_reduce`가 내부적으로 enqueue한 kernel launch handle이며
-     worker는 이걸 직접 wait하지 않는다 (D0.5의 두 큐 drain 모델 참조).
-   - **Correctness 관점 독립**: collective는 worker 관점에선 "이미 submit된
-     후 yield한" 상태. 그 완료 타이밍은 worker의 다음 action 시점 이전이기만
-     하면 됨. worker wait 큐와의 순서 dependency 없음.
-   - **단일 drain barrier 안에서 둘 다 완료**: D0.5의 loop-until-empty 규약에
-     따라 한 barrier invocation에서 worker → collective → (새로 생긴 것이
-     있으면 반복) 순으로 모두 빠짐. worker가 resume될 땐 양쪽 모두 drained.
-   - **대안 (collective 먼저)도 가능**: 본 ADR은 현 구현 단순성을 위해 worker
-     먼저를 고정했을 뿐 의미상 동치. 성능 프로파일 차이가 관찰되면 재조정.
+   **Two-queue ordering (worker waits → collectives)**: `_drain_pending`
+   drains the worker wait queue first, then the collective queue. Rationale
+   for this ordering:
+   - **The two queues are different dependency sources**: worker waits are
+     handles produced by a worker's own `submit + wait` pair (tensor deploy,
+     MmuMap, etc.). The collective queue holds kernel-launch handles that
+     `dist.all_reduce` enqueues internally, which the worker never directly
+     waits on (see the two-queue drain model in D0.5).
+   - **Independent in correctness terms**: from the worker's perspective, a
+     collective is "already submitted, then yielded". Its completion timing
+     only needs to precede the worker's next action. There is no ordering
+     dependency with the worker wait queue.
+   - **Both finish within a single drain barrier**: per D0.5's
+     loop-until-empty rule, a single barrier invocation drains worker →
+     collective → (repeat if new ones appeared) in that order. By the time
+     the worker resumes, both sides are drained.
+   - **The alternative (collective first) is also valid**: this ADR fixes
+     worker-first only for current implementation simplicity; semantically
+     they are equivalent. Revisit if a performance-profile difference is
+     observed.
 
-3. **중복 enqueue — correctness는 idempotent drain, dedup은 non-guaranteed**.
-   `ctx.wait(h)`는 `h in ctx._completed`면 즉시 return. `_drain_pending`도
-   동일 guard. 같은 handle이 `_pending_worker_waits`에 여러 번 appended
-   되더라도 실제 `engine.wait`는 한 번만 호출된다 (idempotent).
-   - **Correctness**: idempotent drain에 의존 → safe.
-   - **Memory/성능**: 본 ADR은 `_pending_worker_waits`의 **dedup을 보장하지
-     않는다**. 같은 handle이 N번 enqueue되면 큐에 N개 element가 보관되고
-     drain 시 N번 pop + in-set guard가 돈다. 단일 worker가 같은 handle을
-     반복 wait하는 비정상 패턴이 아니면 N은 1~수 수준.
-   - **Implementation freedom**: 구현은 선택적으로 dedup (예: `set`을 side
-     index로 두거나 append 전 `h not in pending_set` 검사) 가능. correctness
-     를 바꾸지 않는 최적화로 분류.
+3. **Duplicate enqueue — correctness via idempotent drain; dedup not
+   guaranteed**. `ctx.wait(h)` returns immediately if `h in ctx._completed`.
+   `_drain_pending` uses the same guard. Even if the same handle is appended
+   to `_pending_worker_waits` multiple times, `engine.wait` is invoked only
+   once (idempotent).
+   - **Correctness**: relies on idempotent drain → safe.
+   - **Memory/performance**: this ADR **does not guarantee dedup** of
+     `_pending_worker_waits`. If the same handle is enqueued N times, the
+     queue retains N elements and drain performs N pops + in-set guards.
+     Unless a single worker abnormally repeats waits on the same handle, N
+     stays at the order of 1 to a few.
+   - **Implementation freedom**: implementations may optionally dedup (e.g.,
+     hold a `set` as a side index, or check `h not in pending_set` before
+     append). Classified as an optimization that does not change correctness.
 
 4. **Exception propagation + sibling cleanup**.
-   worker greenlet이 raise하면 `g.switch()`가 main으로 예외를 전달한다.
-   scheduler loop은 즉시 중단되고 다음 cleanup을 **명시적으로** 수행:
+   When a worker greenlet raises, `g.switch()` propagates the exception to
+   main. The scheduler loop stops immediately and performs the following
+   cleanup **explicitly**:
 
    ```python
    try:
@@ -219,14 +243,14 @@ while alive:
                    g.switch()
            _drain_pending(ctx)
    except Exception as outer:
-       # (a) 살아남은 sibling worker greenlet 강제 종료.
+       # (a) Force-terminate surviving sibling worker greenlets.
        for other in gs:
            if not other.dead:
                try:
                    other.throw(SystemExit)
                except Exception:
-                   pass          # 사일런트 — 이미 예외 상황
-       # (b) Backend barrier / pending 상태 초기화 (장래 epoch barrier 도입 대비).
+                   pass          # silent — already in exceptional state
+       # (b) Reset backend barrier / pending state (in preparation for future epoch barrier).
        backend = getattr(ctx.distributed, "_backend", None)
        if backend is not None and hasattr(backend, "_barrier"):
            backend._barrier.reset()
@@ -234,32 +258,34 @@ while alive:
        if backend_pending is not None:
            backend_pending.clear()
        ctx._pending_worker_waits.clear()
-       # (c) 원인 예외는 SpawnException으로 래핑.
+       # (c) Wrap the originating exception in SpawnException.
        raise SpawnException(errors) from outer
    ```
 
-   규약:
-   - **Sibling abort 보장**: worker 하나가 raise하면 모든 sibling greenlet에
-     `SystemExit`을 throw — greenlet은 즉시 terminate된다. greenlet leak 없음.
-   - **Pending queue 명시적 clear**: worker-wait + collective-pending 두 큐를
-     비움. 재사용 시 오염 방지.
-   - **`SpawnException(errors)` 래핑**: `errors: dict[int, Exception]`에 각
-     rank의 원래 예외를 담는다. real-PyTorch `torch.multiprocessing.spawn`의
-     failure 패턴과 호환.
-     - **Scope 제한**: `errors`에는 **자기 코드로 raise한 rank (root cause)만**
-       포함된다. Sibling cleanup 과정에서 `throw(SystemExit)`으로 종료된 rank는
-       `errors`에 나타나지 않는다 (SystemExit은 D1.2의 entry 래퍼 `try/except
-       Exception`에 걸리지 않음 — 의도된 설계: sibling 종료는 실패가 아니라
-       cleanup signal). 독자가 "모든 failed rank가 다 들어올 것"으로 기대하지
-       않도록 명시.
-   - **`ctx._traces`는 예외 이전 시점까지의 partial 상태**. trace completeness
-     는 보장되지 않음 (일부 launch/all_reduce가 entry를 남기지 못한 채 종료
-     가능).
-   - **Allocator / MemoryStore**는 예외 이전 상태 유지 — 재사용은 non-goal,
-     새 `RuntimeContext` 생성 권장.
-   - **`join=False` / retry / partial recovery**는 본 ADR의 non-goal.
+   Protocol:
+   - **Sibling abort guarantee**: when one worker raises, `SystemExit` is
+     thrown into all sibling greenlets — greenlets terminate immediately. No
+     greenlet leaks.
+   - **Explicit pending-queue clear**: both queues (worker-wait +
+     collective-pending) are cleared. Prevents contamination on reuse.
+   - **`SpawnException(errors)` wrapping**: `errors: dict[int, Exception]`
+     contains the original exception per rank. Compatible with the failure
+     pattern of real-PyTorch `torch.multiprocessing.spawn`.
+     - **Scope restriction**: `errors` includes **only ranks that raised
+       from their own code (root cause)**. Ranks terminated via
+       `throw(SystemExit)` during sibling cleanup do not appear in `errors`
+       (SystemExit is not caught by D1.2's entry wrapper `try/except
+       Exception` — intentional design: sibling termination is a cleanup
+       signal, not a failure). Made explicit so readers do not expect "all
+       failed ranks" to appear.
+   - **`ctx._traces` is the partial state up to the moment of exception**.
+     Trace completeness is not guaranteed (some launches/all_reduces may
+     terminate without leaving an entry).
+   - **Allocator / MemoryStore** remain in their pre-exception state — reuse
+     is non-goal; creating a fresh `RuntimeContext` is recommended.
+   - **`join=False` / retry / partial recovery** are non-goals for this ADR.
 
-   `SpawnException`은 `runtime_api/multiprocessing.py`에 정의:
+   `SpawnException` is defined in `runtime_api/multiprocessing.py`:
 
    ```python
    class SpawnException(RuntimeError):
@@ -271,22 +297,25 @@ while alive:
            super().__init__(msg)
    ```
 
-5. **Single-driver 호환**. `g.parent is None`인 main-only 실행 (legacy 단일
-   드라이버 테스트)에서는 D0.2의 worker-fork 조건이 거짓 → 기존 즉시-동기
-   경로 유지. `_drain_pending`은 호출되지 않는다.
+5. **Single-driver compatibility**. In main-only execution where `g.parent is
+   None` (legacy single-driver tests), D0.2's worker-fork condition is false
+   → the existing immediate-synchronous path is preserved. `_drain_pending`
+   is not invoked.
 
-#### D0.5 Host-read barrier — 결정 (normative)
+#### D0.5 Host-read barrier — decision (normative)
 
-Worker 안에서 `tensor.numpy()`, `tensor.__getitem__`, `tensor.data` 등
-**host-observable read**는 **자동 drain barrier**로 정의한다. 호출 직전:
+Inside a worker, **host-observable reads** such as `tensor.numpy()`,
+`tensor.__getitem__`, and `tensor.data` are defined as **automatic drain
+barriers**. Immediately before the call:
 
-1. `ctx._pending_worker_waits`와 `backend._pending_collective_handles`가 비어
-   있지 않으면 `g.parent.switch()`로 main에 yield → main은 `_drain_pending`
-   실행 → 완료 후 worker resume.
-2. 두 큐가 모두 비어 있으면 즉시 read.
+1. If `ctx._pending_worker_waits` or `backend._pending_collective_handles`
+   are non-empty, yield to main via `g.parent.switch()` → main runs
+   `_drain_pending` → worker resumes after completion.
+2. If both queues are empty, read immediately.
 
-**Barrier 반복 규약 (normative — re-entrance)**: `_drain_pending`은 while-loop
-로 **두 큐가 모두 완전히 비어질 때까지** drain한다. 단일 pass가 아님:
+**Barrier iteration protocol (normative — re-entrance)**: `_drain_pending`
+drains via a while-loop **until both queues are completely empty**, not in a
+single pass:
 
 ```python
 def _drain_pending(ctx):
@@ -302,156 +331,175 @@ def _drain_pending(ctx):
         if backend is not None:
             while backend._pending_collective_handles:
                 h, _sip_id, meta = backend._pending_collective_handles.pop(0)
-                ctx.wait(h, _meta=meta)  # main context: safe; ctx.wait가
-                                          # 다시 pending에 push하지 않음
+                ctx.wait(h, _meta=meta)  # main context: safe; ctx.wait will
+                                          # not push back to pending
 ```
 
-**Main-context ctx.wait 비재귀 invariant (normative)**: `_drain_pending` 내부의
-`ctx.wait(h, _meta=meta)` 호출은 main greenlet 컨텍스트에서 실행된다. D0.2의
-worker-fork 조건(`g.parent is not None and not g.parent.dead`)이 False이므로
-즉시-동기 경로로 진입 → **`_pending_worker_waits`에 절대 enqueue하지 않는다**.
-이 invariant 덕분에 drain loop은 재귀/큐 재증가 없이 끝난다. 구현 시
-`g.parent is None`을 단일 main greenlet 보장으로 유지하는 것이 중요.
+**Main-context ctx.wait non-recursion invariant (normative)**: the
+`ctx.wait(h, _meta=meta)` call inside `_drain_pending` runs in the main
+greenlet context. Because D0.2's worker-fork condition (`g.parent is not
+None and not g.parent.dead`) is False, it enters the immediate-synchronous
+path → **never enqueues to `_pending_worker_waits`**. Thanks to this
+invariant, the drain loop terminates without recursion / queue re-growth.
+When implementing, it is important to maintain `g.parent is None` as the
+single-main-greenlet guarantee.
 
-**왜 loop인가**: `ctx.wait(h, _meta=meta)`는 main 컨텍스트에서 호출되므로 D0.2
-경로에 따라 engine을 **직접 drive**한다 (추가 enqueue 없음 — 위 invariant).
-따라서 이론적으로는 single pass로 충분하지만 — 규약은 **loop-until-empty**로
-고정한다. 이유:
+**Why a loop**: `ctx.wait(h, _meta=meta)` is called in main context, so per
+the D0.2 path it **drives the engine directly** (no additional enqueue — the
+invariant above). In theory a single pass would suffice — but the protocol
+is fixed at **loop-until-empty**. Reasons:
 
-1. **미래 확장 안전성**: 향후 drain 중 새 pending이 enqueue되는 구현 (예:
-   collective가 sub-handle을 가진 tree-reduce)이 생길 수 있다. loop 규약이면
-   이때도 correctness 유지.
-2. **가독성**: "barrier는 pending이 빌 때까지 drain"이라는 단일 문장으로
-   의미가 닫힘. `ctx.wait` 호출이 새 enqueue를 안 한다는 non-trivial invariant
-   에 의존하지 않음.
-3. **Barrier의 세만틱은 "해당 read에 필요한 모든 dependency 완료"**: 현 모델
-   에선 모든 pending이 곧 모든 dependency이므로 둘은 동일. 사용자 mental model
-   은 전자.
+1. **Future-extension safety**: a future implementation might enqueue new
+   pending items mid-drain (e.g., tree-reduce collectives with sub-handles).
+   The loop protocol preserves correctness in that case.
+2. **Readability**: the single sentence "the barrier drains until pending
+   is empty" closes the semantics. No dependence on the non-trivial
+   invariant that `ctx.wait` calls do not produce new enqueues.
+3. **Barrier semantics are "all dependencies needed for this read are
+   complete"**: in the current model all pending = all dependencies, so the
+   two are identical. The user mental model is the former.
 
-**Termination 보증**: 두 체제로 분리해 서술한다.
+**Termination guarantee**: described under two regimes.
 
-- **현재 구현**: `ctx.wait`는 main context에서 호출 시 engine을 직접 drive
-  (D0.2) → 새 pending을 enqueue하지 않는다. 한 iteration마다 pending의 크기가
-  `pop(0)` + `engine.wait`로 엄격히 감소. iteration 수는 **초기 pending 크기
-  자체가 상한** → 유한 종료.
-- **Future extension (loop 규약을 정당화하는 상한)**: 향후 drain 중 새 pending이
-  enqueue되는 구현 (예: tree-reduce sub-handle)이 도입되면 초기 크기 상한은
-  깨진다. 그러나 SimPy causality는 handle의 dependency가 유한 DAG임을 보장하므로
-  **nested depth가 finite**. loop 규약이 이 경우까지 자동 수용한다.
+- **Current implementation**: when called in main context, `ctx.wait`
+  drives the engine directly (D0.2) → does not enqueue new pending. Each
+  iteration strictly shrinks pending size by `pop(0)` + `engine.wait`. The
+  iteration count is bounded by **the initial pending size itself** →
+  finite termination.
+- **Future extension (the bound that justifies the loop protocol)**: if an
+  implementation enqueues new pending mid-drain (e.g., tree-reduce
+  sub-handles) is introduced, the initial-size bound breaks. However,
+  SimPy causality guarantees that the dependency DAG of handles is finite,
+  so **nested depth is finite**. The loop protocol automatically
+  accommodates this case.
 
-두 체제 모두 무한 루프가 불가능함을 보장. 현 구현의 단일-pass 상한은 공격적
-최적화 시 참고 값일 뿐 규약은 loop-until-empty로 고정.
+Both regimes guarantee that infinite loops are impossible. The
+single-pass bound of the current implementation is a reference value for
+aggressive optimization; the protocol is fixed at loop-until-empty.
 
-**왜 implicit drain at read가 맞는가**:
+**Why implicit drain at read is correct**:
 
-- 기존 open question에서 (a) implicit drain, (b) explicit barrier 둘 중 선택
-  문제였다. (b)는 명확하지만 TP layer 사용자가 `out = fc1.forward(x);
-  ctx.drain(); result = out.numpy()` 3-step을 매번 써야 하는 부담. (a)는
-  "읽을 때 반영된 값을 보장"하는 단일 규약으로 CUDA의 `cudaDeviceSynchronize
-  before host copy` 패턴과 동일 — 숨은 규칙이 아닌 **명명된 entry-point의
-  contract**이다.
-- 본 ADR은 (a)를 채택하되 그 entry-point 목록을 **명시적으로 닫는다**:
+- In the original open question, the choice was between (a) implicit drain
+  and (b) explicit barrier. (b) is clear but burdens TP layer users with
+  the 3-step pattern `out = fc1.forward(x); ctx.drain(); result =
+  out.numpy()` on every read. (a) is a single rule that "guarantees the
+  read sees the reflected value" — identical to CUDA's `cudaDeviceSynchronize
+  before host copy` pattern, which is not a hidden rule but the **contract
+  of a named entry point**.
+- This ADR adopts (a) but **closes the entry-point list explicitly**:
   `Tensor.numpy()`, `Tensor.data` (numpy alias), `Tensor.__getitem__`,
-  `Tensor.__repr__` (data가 포함되는 경우), 그 외 공식 host-read API는 본
-  ADR 구현 시점에 코드베이스 검색으로 확정. 추가되는 host-read API는 반드시
-  이 contract를 따라야 한다 (테스트로 회귀 방지).
-- `ctx.submit`만 하고 `wait` 없이 `numpy`를 직접 호출하는 경우도 drain
-  barrier가 동작 (pending queue에 handle이 있기 때문). 사용자가 explicit
-  wait을 생략해도 read 시점에 invariant가 복원된다.
+  `Tensor.__repr__` (when data is included), and any other official
+  host-read APIs are finalized via codebase search at the time of
+  implementing this ADR. Any newly added host-read API must follow this
+  contract (regression-guarded by tests).
+- Even when calling `numpy` directly after only `ctx.submit` without
+  `wait`, the drain barrier still operates (because the handle is in the
+  pending queue). The invariant is restored at read time even if the user
+  omits an explicit wait.
 
-**`Tensor.copy_(source)` — write barrier 규정**:
+**`Tensor.copy_(source)` — write barrier specification**:
 
-`copy_`는 semantically "target에 write"이지만 내부적으로 `source.numpy()`를
-호출하여 host에서 source 데이터를 가져온 뒤 `target._memory_store.write(...)`
-로 각 shard에 쓴다. 두 방향 모두 barrier 처리:
+`copy_` is semantically "write to target", but internally it calls
+`source.numpy()` to fetch source data on the host then writes to each
+shard via `target._memory_store.write(...)`. Both directions are
+barrier-handled:
 
-1. **Source-side (read barrier)**: `source.numpy()`가 D0.5 read barrier를
-   트리거 (source 자체가 deployed tensor이고 pending이 있을 때).
-2. **Target-side (write barrier — global pending 기준)**: `copy_` 진입 시
-   `ctx._pending_worker_waits` 또는 `backend._pending_collective_handles`가
-   비어 있지 않으면 write 전에 `g.parent.switch()`로 drain. **Per-tensor /
-   per-shard dependency tracking이 아니라 global pending queue 기준**.
-   - 왜 global인가: KernBench의 handle 표현에는 "이 handle이 target의 어느
-     shard를 write한다"는 역추적 정보가 없다. 안전한 보수적 규약으로 "전역
-     pending이 있으면 drain". 이 결과로 **unrelated tensor의 pending도 copy_를
-     막을 수 있다** — drop-in invariant 우선.
-   - **명시적 tradeoff**: 이 규약은 서로 독립적인 tensor 사이에도 불필요한
-     serialization을 도입할 수 있다. 그러나 현 single-queue execution model
-     하에서는 이 비용이 허용 가능 — cross-rank correctness와 "읽을 때 최신"
-     invariant를 단순한 규칙으로 보장하는 편이 우선.
-   - 실질적 영향: 단일 worker는 대부분 한 layer step 안에서 pending이 주로
-     자기 작업 — over-barrier로 인한 추가 context switch는 round 끝 scheduler
-     drain 시점과 일치하는 경우가 많아 큰 문제 안 됨.
-   - Future refinement: per-tensor pending tracking을 도입하면 이 규약을
-     좁힐 수 있으나 본 ADR scope 밖.
+1. **Source-side (read barrier)**: `source.numpy()` triggers the D0.5 read
+   barrier (when source itself is a deployed tensor with pending).
+2. **Target-side (write barrier — based on global pending)**: on `copy_`
+   entry, if `ctx._pending_worker_waits` or
+   `backend._pending_collective_handles` are non-empty, drain via
+   `g.parent.switch()` before writing. **Not per-tensor / per-shard
+   dependency tracking, but based on the global pending queue**.
+   - Why global: KernBench's handle representation does not retain the
+     reverse-mapping information "this handle writes to which shard of which
+     target". A safe conservative rule: "drain if any global pending
+     exists". As a result, **pending of an unrelated tensor can also block
+     copy_** — drop-in invariant takes priority.
+   - **Explicit tradeoff**: this rule can introduce unnecessary
+     serialization between independent tensors. However, under the current
+     single-queue execution model this cost is acceptable — guaranteeing
+     cross-rank correctness and the "read sees latest" invariant via a
+     simple rule takes precedence.
+   - Practical impact: most pending of a single worker within a layer step
+     is its own work — extra context switches from over-barrier often
+     coincide with the end-of-round scheduler drain point, so no major
+     issue.
+   - Future refinement: per-tensor pending tracking could narrow this
+     rule, but it is out of scope for this ADR.
 
 **Non-barrier**:
 
-- `tensor.shape`, `tensor.dtype`, `tensor.name` 등 **metadata-only** 접근은
-  drain하지 않음. 데이터 의존성이 없음.
-- `tensor.pa`, `tensor.va` 등 raw address accessor도 drain하지 않음 (주소만,
-  내용 아님).
+- `tensor.shape`, `tensor.dtype`, `tensor.name`, and other
+  **metadata-only** access does not drain. No data dependency.
+- `tensor.pa`, `tensor.va`, and other raw address accessors also do not
+  drain (address only, not content).
 
-**공식 barrier entry-point (closed set)**:
+**Official barrier entry-points (closed set)**:
 
 | API | Kind | Rationale |
 |---|---|---|
 | `Tensor.numpy()` | read | host-observable copy |
 | `Tensor.data` | read | `numpy()` alias |
 | `Tensor.__getitem__` | read | shard-aligned read |
-| `Tensor.__repr__` (data 포함 시) | read | debugging/log |
+| `Tensor.__repr__` (when data is included) | read | debugging/log |
 | `Tensor.copy_(source)` | read + write | source read + target write |
 
-이 contract를 T5/T6에서 직접 검증.
+This contract is verified directly in T5/T6.
 
-#### D0.6 왜 worker 함수 API는 불변인가 (informative)
+#### D0.6 Why the worker function API is unchanged (informative)
 
-- `torch.zeros(...)` 내부는 `self.submit(msg)` + `self.wait(h)` 쌍. `wait`가
-  D0.2/D0.3에 따라 자동으로 main-defer → 겉보기 동기적으로 보이지만 한 번
-  yield.
-- `tensor.numpy()`는 D0.5에 따라 host-read barrier → pending이 있으면
-  drain→read, 없으면 즉시 read.
-- `dist.all_reduce`는 기존 `_defer_wait=True` + `_pending_collective_handles`
-  경로를 그대로 사용. D0.4의 drain이 두 큐를 함께 처리.
+- The inside of `torch.zeros(...)` is a `self.submit(msg)` + `self.wait(h)`
+  pair. `wait` auto-defers to main per D0.2/D0.3 — appears synchronous from
+  the outside but yields once.
+- `tensor.numpy()` follows D0.5's host-read barrier → drain→read when
+  pending exists, immediate read otherwise.
+- `dist.all_reduce` continues to use the existing `_defer_wait=True` +
+  `_pending_collective_handles` path. D0.4's drain processes both queues
+  together.
 
-#### D0.7 불변 조건 (invariants)
+#### D0.7 Invariants
 
-- **kernel greenlet의 `_parent`는 항상 main**: env.run이 worker 컨텍스트에서
-  절대 돌지 않기 때문. (T3의 핵심 assertion.)
-- **cross-rank 동기 지점**: 모든 worker가 yield한 뒤에만 drain → 모든 rank의
-  kernel이 한 라운드에 함께 진행 (cross-rank IPCQ 교환의 필수 조건).
-- **Single-driver 호환**: D0.4-(5).
+- **The kernel greenlet's `_parent` is always main**: because env.run never
+  runs in worker context. (Core assertion of T3.)
+- **Cross-rank synchronization point**: drain occurs only after every
+  worker has yielded → kernels of all ranks advance together within one
+  round (a prerequisite for cross-rank IPCQ exchange).
+- **Single-driver compatibility**: D0.4-(5).
 
 ### D1. `torch.multiprocessing.spawn(fn, args, nprocs)`
 
-Real-PyTorch API 파리티 + D0의 scheduler loop의 단일 구현 위치.
+Real-PyTorch API parity + a single implementation site for D0's scheduler
+loop.
 
-#### D1.0 API parity only — execution parity 아님 (normative)
+#### D1.0 API parity only — not execution parity (normative)
 
-`torch.multiprocessing.spawn` 이름은 **API signature parity**에 한정된다.
-실제 실행 모델은 **cooperative greenlet scheduler** (단일 Python 프로세스,
-단일 OS 스레드, D0.4의 round-robin drive)이다. 다음은 **본 ADR이 제공하지
-않는 속성** — real-PyTorch `torch.multiprocessing.spawn`이 보장하는 것 중
-명시적으로 **non-goal**:
+The name `torch.multiprocessing.spawn` is restricted to **API signature
+parity**. The actual execution model is a **cooperative greenlet scheduler**
+(single Python process, single OS thread, round-robin drive per D0.4). The
+following are **properties this ADR does NOT provide** — among the
+guarantees of real-PyTorch `torch.multiprocessing.spawn`, explicitly
+**non-goals**:
 
-- 프로세스 격리 (independent OS process per rank).
-- 독립 address space (각 rank가 자기 Python heap 보유).
-- Failure isolation (한 rank의 hard crash가 다른 rank 영향 없음).
-- OS-level scheduler fairness (rank 간 preemptive time slicing).
-- `mp.Queue`, `mp.Lock` 등 inter-process primitive.
+- Process isolation (independent OS process per rank).
+- Independent address space (each rank with its own Python heap).
+- Failure isolation (a hard crash in one rank not affecting others).
+- OS-level scheduler fairness (preemptive time slicing between ranks).
+- Inter-process primitives such as `mp.Queue`, `mp.Lock`.
 
-이 구현의 실제 성질:
+Actual properties of this implementation:
 
-- 모든 rank는 같은 Python 프로세스 안의 greenlet. shared global state가
-  그대로 보임 (의도된 simulation convenience).
-- GIL 하의 단일 스레드 → parallel execution 아님. SimPy 이벤트 순서로
-  "논리적 동시성"만 재현.
-- 한 worker에서 unhandled exception → 전체 simulation 중단 (D0.4-(4)).
+- All ranks are greenlets inside the same Python process. Shared global
+  state is visible as-is (intentional simulation convenience).
+- Single-threaded under the GIL → not parallel execution. Only "logical
+  concurrency" via SimPy event ordering is reproduced.
+- Unhandled exception in any one worker → entire simulation aborts
+  (D0.4-(4)).
 
-**호출자 의무**: real-PyTorch multi-process 샘플을 KernBench로 이식할 때
-프로세스 격리에 의존하는 로직 (예: `os.getpid`, 독립 임시 파일, 신호 처리
-등)은 지워야 한다. Namespace 이름은 코드 이식성을 위해 유지 — 세만틱은
-다르다.
+**Caller's obligation**: when porting real-PyTorch multi-process samples to
+KernBench, logic that relies on process isolation (e.g., `os.getpid`,
+independent temp files, signal handling) must be removed. The namespace name
+is preserved for code portability — semantics differ.
 
 #### D1.1 Public surface
 
@@ -470,7 +518,7 @@ class _MultiprocessingNamespace:
         ...
 ```
 
-#### D1.2 구현
+#### D1.2 Implementation
 
 ```python
 def spawn(self, fn, args, nprocs, join=True):
@@ -520,28 +568,29 @@ def spawn(self, fn, args, nprocs, join=True):
 
 #### D1.3 `torch` namespace attach
 
-`runtime_api/context.py` `__post_init__`에서:
+In `runtime_api/context.py` `__post_init__`:
 ```python
 self.multiprocessing = _MultiprocessingNamespace(self)
 ```
 
-→ bench 코드에서 `torch.multiprocessing.spawn(worker, args=(ws,), nprocs=ws)`.
+→ in bench code: `torch.multiprocessing.spawn(worker, args=(ws,), nprocs=ws)`.
 
-#### D1.4 기존 bench 마이그레이션
+#### D1.4 Migration of existing benches
 
-`benches/ccl_allreduce.py`의 hand-rolled loop은 `torch.multiprocessing.spawn`
-한 줄로 축소. 기존 matrix 회귀는 그대로 유지. 현재 xfail인 `ring_default_ws`는
-D0 덕분에 PASS로 전환 예상 (worker가 kernel greenlet orphan을 발생시키지 않음).
+The hand-rolled loop in `benches/ccl_allreduce.py` collapses into a single
+`torch.multiprocessing.spawn` line. Existing matrix regressions are
+preserved. The currently xfail `ring_default_ws` is expected to flip to
+PASS thanks to D0 (workers no longer orphan the kernel greenlet).
 
-### D2. 새 패키지 `kernbench.tp`
+### D2. New package `kernbench.tp`
 
 ```
 src/kernbench/tp/
     __init__.py          — public API re-exports
-    parallel_state.py    — TP group 관리 (현재 single global group)
+    parallel_state.py    — TP group management (currently a single global group)
     layers.py            — ColumnParallelLinear, RowParallelLinear, VocabParallelEmbedding
     primitives.py        — copy/reduce/scatter/gather_to/from_tp_region
-    kernels.py           — TP layer가 launch하는 gemm kernel (재사용 가능)
+    kernels.py           — gemm kernel launched by TP layers (reusable)
     mappings.py          — forward identity/all_reduce, backward stub
 ```
 
@@ -571,45 +620,49 @@ def get_tensor_model_parallel_rank() -> int:
     return get_dist().get_rank()         # ADR-0024 greenlet-local rank
 ```
 
-초기 scope: TP size = world_size = topology SIP count. Pure TP 모델.
+Initial scope: TP size = world_size = topology SIP count. Pure TP model.
 
-### D4-pre. TP shard ownership vs DPPolicy — 역할 분리 (normative)
+### D4-pre. TP shard ownership vs DPPolicy — role separation (normative)
 
-TP layer의 weight/output 표현에서 두 개념을 명확히 분리한다:
+In the weight/output representation of TP layers, two concepts are clearly
+separated:
 
-| 개념 | 결정 주체 | 범위 |
+| Concept | Decided by | Scope |
 |---|---|---|
-| **TP shard ownership** (어느 rank가 weight의 어떤 slice를 소유하는가) | greenlet-local rank + `torch.ahbm.set_device(rank)` (ADR-0024 D2/D3) | **cross-rank, cross-SIP** |
-| **Intra-rank placement** (소유된 slice를 rank 내부에서 cube × PE로 어떻게 분산하는가) | `DPPolicy(cube=..., pe=...)` (ADR-0026) | **한 rank 내부 (SIP 경계 안)** |
+| **TP shard ownership** (which rank owns which slice of the weight) | greenlet-local rank + `torch.ahbm.set_device(rank)` (ADR-0024 D2/D3) | **cross-rank, cross-SIP** |
+| **Intra-rank placement** (how the owned slice is distributed across cube × PE inside the rank) | `DPPolicy(cube=..., pe=...)` (ADR-0026) | **inside one rank (within SIP boundary)** |
 
-따라서 `ColumnParallelLinear`가 `(in_features, out_features // ws)` shape로
-weight를 생성하고 `DPPolicy(cube="column_wise", pe="column_wise")`를 부여
-하면:
+Thus when `ColumnParallelLinear` creates a weight of shape `(in_features,
+out_features // ws)` and assigns `DPPolicy(cube="column_wise",
+pe="column_wise")`:
 
-- **Rank r**이 소유하는 slice = weight의 column 축 [r * k_local, (r+1) *
-  k_local) — **set_device(r)**가 이걸 결정 (해당 rank가 SIP r에 존재).
-- **그 slice 내부**에서 cube × PE column-wise 분산 — **DPPolicy**가 이걸
-  결정.
+- The slice owned by **rank r** = column-axis [r * k_local, (r+1) *
+  k_local) of the weight — **set_device(r)** determines this (that rank
+  resides on SIP r).
+- **Inside that slice**, the cube × PE column-wise distribution — **DPPolicy**
+  determines this.
 
-두 축은 **독립적**이다. 같은 DPPolicy로 두 rank가 자기 slice를 만들면
-slice 자체는 다른 SIP에 있지만 intra-SIP placement 패턴은 동일. 반대로
-DPPolicy를 `cube="replicate", pe="replicate"`로 바꿔도 TP shard ownership은
-유지되고 intra-rank placement만 달라짐.
+The two axes are **independent**. If two ranks build their own slice with
+the same DPPolicy, the slices themselves live on different SIPs but the
+intra-SIP placement pattern is the same. Conversely, changing DPPolicy to
+`cube="replicate", pe="replicate"` preserves TP shard ownership and only
+changes intra-rank placement.
 
-**이 경계가 흐려지는 실수** (본 ADR이 금지):
+**Mistakes that blur this boundary** (forbidden by this ADR):
 
-- DPPolicy에 "SIP 축"이 다시 등장 (ADR-0026에서 제거됨).
-- TP layer가 `set_device` 없이 `DPPolicy`만으로 cross-rank sharding을
-  표현 → 단일 rank 안에서 세로로 자른 것과 구분 안 됨.
+- The "SIP axis" reappearing in DPPolicy (removed in ADR-0026).
+- TP layers expressing cross-rank sharding via `DPPolicy` alone without
+  `set_device` → indistinguishable from a vertical split within a single
+  rank.
 
-본 ADR의 TP layer는 항상 "rank = SIP = one slice 소유 + DPPolicy intra-SIP
-분산" 관점에서만 weight/output을 다룬다.
+The TP layers of this ADR always treat weight/output from the perspective
+of "rank = SIP = owns one slice + DPPolicy intra-SIP distribution" only.
 
 ### D4. `ColumnParallelLinear`
 
-**중요**: host-side `torch.matmul` 추상화를 신규 도입하지 않는다. layer의
-forward는 `torch.launch("gemm", gemm_kernel, ...)`로 기존 gemm kernel을
-호출 — KernBench bench들이 이미 쓰는 패턴
+**Important**: no new host-side `torch.matmul` abstraction is introduced.
+The layer's forward calls the existing gemm kernel via `torch.launch("gemm",
+gemm_kernel, ...)` — the pattern already used by KernBench benches
 ([benches/gemm_single_pe.py](benches/gemm_single_pe.py),
 [benches/gpt3_qkv.py](benches/gpt3_qkv.py)).
 
@@ -623,14 +676,14 @@ from kernbench.tp.parallel_state import (
 )
 
 class ColumnParallelLinear:
-    """Weight의 K(out_features) 축을 TP rank에 분산.
+    """Shards the K(out_features) axis of the weight across TP ranks.
 
     forward(x):
         x: (M, N) — full-replicated across ranks
-        W_k: (N, K / world_size) — rank-local slice (set_device로 SIP r에 거주)
+        W_k: (N, K / world_size) — rank-local slice (placed on SIP r via set_device)
         y_k = x @ W_k → (M, K / world_size) — rank-local output
 
-    출력은 column-sharded. RowParallelLinear가 기대하는 입력 형태.
+    Output is column-sharded. The input form expected by RowParallelLinear.
     """
 
     def __init__(self, in_features: int, out_features: int, bias: bool = False,
@@ -640,7 +693,7 @@ class ColumnParallelLinear:
         self.in_features = in_features
         self.k_local = out_features // ws
         self._torch = torch
-        # 각 rank가 자기 slice 소유 — set_device(rank)에 의해 SIP r에 배치.
+        # Each rank owns its own slice — placed on SIP r by set_device(rank).
         self.weight = torch.zeros(
             (in_features, self.k_local), dtype=dtype,
             dp=DPPolicy(cube="column_wise", pe="column_wise"),
@@ -655,7 +708,7 @@ class ColumnParallelLinear:
             )
 
     def forward(self, x):
-        # x는 full-replicated (caller 보장). 단순 local gemm.
+        # x is full-replicated (caller-guaranteed). Plain local gemm.
         M = x.shape[0]
         out = self._torch.empty(
             (M, self.k_local), dtype=x.dtype,
@@ -666,24 +719,25 @@ class ColumnParallelLinear:
             "col_parallel_gemm", _gemm_kernel,
             x, self.weight, out, M, self.in_features, self.k_local,
         )
-        # bias add는 별도 kernel 혹은 composite gemm의 fused bias.
-        # 초기 scope에서는 bias=False만 충분히 검증.
+        # bias add as a separate kernel or as fused bias of a composite gemm.
+        # Initial scope verifies bias=False sufficiently.
         return out
 ```
 
-**Yield-safety contract (normative)**: `ColumnParallelLinear.forward`는 한 번의
-`torch.launch` 호출로 kernel launch → 내부 `ctx.wait` 쌍을 포함한다. 이는
-D0.4-(1)의 "worker는 유한 step 내 yield" 조건을 자동으로 만족 — TP layer
-사용자가 yield 패턴을 수동으로 삽입할 필요 없음.
+**Yield-safety contract (normative)**: `ColumnParallelLinear.forward`
+includes one `torch.launch` call containing a kernel launch → internal
+`ctx.wait` pair. This automatically satisfies the "worker yields within a
+finite number of steps" condition of D0.4-(1) — TP layer users do not need
+to insert yield patterns manually.
 
 ### D5. `RowParallelLinear`
 
 ```python
 class RowParallelLinear:
-    """Weight의 N(in_features) 축을 TP rank에 분산.
+    """Shards the N(in_features) axis of the weight across TP ranks.
 
     forward(x):
-        x: (M, N / world_size) — rank-local slice (ColumnParallel의 출력)
+        x: (M, N / world_size) — rank-local slice (output of ColumnParallel)
         W_k: (N / world_size, K) — rank-local slice
         y_k = x @ W_k → (M, K) — partial sum on each rank
         y = all_reduce(y_k, op="sum") → (M, K) on every rank
@@ -701,7 +755,7 @@ class RowParallelLinear:
             dp=DPPolicy(cube="column_wise", pe="column_wise"),
             name="row_parallel_w",
         )
-        # bias는 rank 0에만 (Megatron convention). 초기 scope에서는 생략.
+        # bias only on rank 0 (Megatron convention). Omitted in initial scope.
         self.bias = None
 
     def forward(self, x):
@@ -715,25 +769,26 @@ class RowParallelLinear:
             "row_parallel_gemm", _gemm_kernel,
             x, self.weight, y_partial, M, self.n_local, self.out_features,
         )
-        # Cross-rank reduce. ADR-0024의 dist.all_reduce는 D0 + mp.spawn 하에서
-        # 정상 동작 (kernel parent = main 유지).
+        # Cross-rank reduce. ADR-0024's dist.all_reduce works correctly
+        # under D0 + mp.spawn (kernel parent = main is preserved).
         self._torch.distributed.all_reduce(y_partial, op="sum")
         return y_partial
 ```
 
-**Yield-safety contract (normative)**: `RowParallelLinear.forward`는 launch →
-내부 wait에 이어 `all_reduce` (defer + worker yield 패턴)까지 포함하므로 forward
-한 번당 **최소 2회 yield**가 보장됨. D0.4-(1)의 scheduler progress 조건 자동
-만족. 모든 본 ADR의 TP layer forward는 "최소 하나의 wait 또는 collective를
-포함해 yield-safe하다"를 invariant로 유지한다 — 이후 추가되는 TP primitive
-(VocabParallelEmbedding 등)도 동일 계약 필수.
+**Yield-safety contract (normative)**: `RowParallelLinear.forward` includes
+launch → internal wait followed by `all_reduce` (defer + worker-yield
+pattern), so **at least 2 yields per forward** are guaranteed. The
+scheduler-progress condition of D0.4-(1) is automatically satisfied. All TP
+layer forwards in this ADR maintain the invariant "yield-safe by containing
+at least one wait or collective" — any future TP primitives (e.g.,
+VocabParallelEmbedding) must keep the same contract.
 
-### D6. Primitive 함수
+### D6. Primitive functions
 
 ```python
 # primitives.py
 def copy_to_tp_region(x):
-    """Forward: identity. Backward: all-reduce. (Training 추가 시 구현)."""
+    """Forward: identity. Backward: all-reduce. (Implemented when training is added)."""
     return x
 
 def reduce_from_tp_region(x, torch):
@@ -743,19 +798,19 @@ def reduce_from_tp_region(x, torch):
 
 def scatter_to_tp_region(x):
     raise NotImplementedError(
-        "Phase 2: 사용자가 이미 sharded tensor를 생성하는 것으로 대체"
+        "Phase 2: replaced by users creating already-sharded tensors"
     )
 
 def gather_from_tp_region(x):
     raise NotImplementedError(
-        "Phase 2: all-gather kernel 선행 필요 (future)"
+        "Phase 2: requires all-gather kernel as a prerequisite (future)"
     )
 ```
 
-### D7. 샘플 bench — 2-layer MLP with TP
+### D7. Sample bench — 2-layer MLP with TP
 
 ```python
-# benches/tp_mlp.py (신규)
+# benches/tp_mlp.py (new)
 from kernbench.policy.placement.dp import DPPolicy
 import kernbench.tp as tp
 import numpy as np
@@ -780,10 +835,10 @@ def worker(rank: int, world_size: int, torch):
     h = fc1.forward(x)      # column-sharded (B, D_hidden / ws)
     y = fc2.forward(h)      # all-reduced (B, D_out) on every rank
 
-    # rank 0만 결과 출력 / 검증
+    # Only rank 0 prints / verifies the result
     if rank == 0:
         result = y.numpy()
-        # 실제 검증 값은 zero-init weight이면 전부 0 — scope에서는 "완료 자체" 검증
+        # With zero-init weights, all values are 0 — within scope "completion itself" is the check
         print(f"  tp_mlp: shape={result.shape}, mean={float(result.mean()):.4f}")
 
 
@@ -793,31 +848,33 @@ def run(torch):
     torch.multiprocessing.spawn(worker, args=(ws,), nprocs=ws)
 ```
 
-### D8. Non-functional — training 미지원
+### D8. Non-functional — training not supported
 
-본 ADR은 **inference/forward only**. Backward / gradient / optimizer는 future.
-기존 KernBench가 training이 아니므로 자연스러움.
+This ADR is **inference/forward only**. Backward / gradient / optimizer is
+future work. Natural because KernBench is not a training system.
 
-### D9. 초기 scope 제약
+### D9. Initial-scope constraints
 
-- TP size = world_size (mixed DP+TP 없음).
-- `scatter_to_tp_region`, `gather_from_tp_region`은 unimplemented.
-- **Weight 기본값은 zero**. 적절한 init scheme (Xavier, Kaiming 등)은 future.
-  단 테스트는 `tensor.copy_`로 결정론적 non-zero pattern을 주입해 numerical
-  correctness를 검증 (T2/T6). 즉 "production default = zero, 검증 = 결정론적
-  non-zero"로 운영 분리.
-- Bias 초기 scope에서 생략 (Megatron의 rank 0-only bias 정책은 future).
-- Pipeline parallelism은 scope 밖.
-- VocabParallelEmbedding은 all-gather 선행 필요 → stub only.
+- TP size = world_size (no mixed DP+TP).
+- `scatter_to_tp_region`, `gather_from_tp_region` are unimplemented.
+- **Default weight value is zero**. Proper init schemes (Xavier, Kaiming,
+  etc.) are future. Tests inject deterministic non-zero patterns via
+  `tensor.copy_` to verify numerical correctness (T2/T6). I.e., operate as
+  "production default = zero, verification = deterministic non-zero".
+- Bias is omitted in the initial scope (Megatron's rank-0-only bias policy
+  is future).
+- Pipeline parallelism is out of scope.
+- VocabParallelEmbedding requires a prerequisite all-gather → stub only.
 
-### D10. 회귀: `ring_default_ws` xfail 해제 — 필수 acceptance
+### D10. Regression: `ring_default_ws` xfail removal — mandatory acceptance
 
-D0 (worker-wait 일반화) + D0.5 (host-read barrier) 덕분에 모든 worker-driven
-`ctx.wait` 및 host-read가 main-drain 경로로 routing됨 → ADR-0024 Phase B의
-kernel-greenlet orphan 원인이 소멸. 기존 matrix test의 `ring_default_ws`
-strict-xfail 케이스를 본 ADR 구현 이후 **PASS**로 전환하는 것을 **필수 회귀
-기준**으로 포함. Observable acceptance criteria는 **T7**에 명시 (deadlock
-부재, GreenletExit 부재, numerical tolerance 등).
+Thanks to D0 (worker-wait generalization) + D0.5 (host-read barrier), every
+worker-driven `ctx.wait` and host-read is routed through the main-drain path
+→ the cause of the kernel-greenlet orphan in ADR-0024 Phase B disappears.
+Flipping the existing matrix test's `ring_default_ws` strict-xfail case to
+**PASS** after this ADR's implementation is included as a **mandatory
+regression criterion**. Observable acceptance criteria are specified in
+**T7** (no deadlock, no GreenletExit, numerical tolerance, etc.).
 
 ---
 
@@ -825,40 +882,48 @@ strict-xfail 케이스를 본 ADR 구현 이후 **PASS**로 전환하는 것을
 
 - **ADR-0024** (launcher): rank = SIP, greenlet-local rank,
   `torch.ahbm.set_device(rank)`.
-- **ADR-0026** (DPPolicy intra-device): weight tensor의 per-rank slice 표현.
-- **ADR-0023 / ADR-0025** (IPCQ): `dist.all_reduce` 구현의 기반.
+- **ADR-0026** (DPPolicy intra-device): per-rank slice representation of
+  weight tensors.
+- **ADR-0023 / ADR-0025** (IPCQ): foundation of `dist.all_reduce`
+  implementation.
 
 ---
 
 ## Non-goals
 
-- **Backward pass / training**: inference only. Training simulation은 별도 ADR.
-- **Mixed parallelism (DP + TP + PP)**: 초기엔 pure TP only.
-- **Weight init schemes**: 단순 zero / debug pattern.
-- **Fused ops**: Megatron의 fused matmul+bias+gelu는 kernel 레벨 문제.
-- **DTensor 통합**: ADR-0028 future.
-- **Host-side `torch.matmul` 추상화**: TP layer는 `torch.launch(gemm_kernel, ...)`
-  로 기존 gemm kernel을 호출. 신규 matmul host-op 도입 안 함.
+- **Backward pass / training**: inference only. Training simulation is a
+  separate ADR.
+- **Mixed parallelism (DP + TP + PP)**: pure TP only at the start.
+- **Weight init schemes**: simple zero / debug pattern.
+- **Fused ops**: Megatron's fused matmul+bias+gelu is a kernel-level
+  concern.
+- **DTensor integration**: ADR-0028 future.
+- **Host-side `torch.matmul` abstraction**: TP layers call the existing
+  gemm kernel via `torch.launch(gemm_kernel, ...)`. No new matmul host-op
+  is introduced.
 
 ---
 
 ## Open questions
 
-- **`initialize_model_parallel` 위치**: `kernbench.tp.initialize_model_parallel`
-  (현 결정) vs real-PyTorch의 `torch.distributed.init_device_mesh`. TP 전용
-  모듈에 유지.
-- **Weight init**: ADR은 zero. Debug pattern (e.g., identity)이 유효 검증에
-  필요할 수 있음 — Phase 1 test에서 필요 시 추가.
-- **bias 배치 정책**: Megatron은 RowParallelLinear bias를 rank 0에만. 초기
-  scope에서는 bias=False로 회피.
-- **GEMM kernel 위치**: `kernbench.tp.kernels._gemm_kernel` vs 기존
-  `benches/gemm_single_pe.py`에서 import. TP가 bench 의존을 가지면 안 되므로
-  tp 내부에 복제. 향후 `kernbench.kernels` 공용 패키지로 이관 가능.
+- **Location of `initialize_model_parallel`**:
+  `kernbench.tp.initialize_model_parallel` (current decision) vs
+  real-PyTorch's `torch.distributed.init_device_mesh`. Kept in the TP-only
+  module.
+- **Weight init**: the ADR uses zero. A debug pattern (e.g., identity) may
+  be needed for valid verification — add at Phase 1 test time if needed.
+- **Bias placement policy**: Megatron places RowParallelLinear bias only on
+  rank 0. Avoided in the initial scope via bias=False.
+- **GEMM kernel location**: `kernbench.tp.kernels._gemm_kernel` vs
+  importing from existing `benches/gemm_single_pe.py`. TP must not depend
+  on benches, so duplicated inside tp. Migration to a shared
+  `kernbench.kernels` package is possible later.
 
-**Resolved (이전 rev에서 open이었던 것들)**:
-- ~~`tensor.numpy()` 호출 시 drain 타이밍~~ → **D0.5에서 결정**: 공식 host-read
-  entry-point(`numpy`, `data`, `__getitem__`, data-포함 `__repr__`)는 자동
-  drain barrier. metadata-only accessor는 barrier 아님.
+**Resolved (previously open in earlier revisions)**:
+- ~~Drain timing on `tensor.numpy()` call~~ → **decided in D0.5**: the
+  official host-read entry points (`numpy`, `data`, `__getitem__`,
+  data-containing `__repr__`) are automatic drain barriers. Metadata-only
+  accessors are not barriers.
 
 ---
 
@@ -866,23 +931,25 @@ strict-xfail 케이스를 본 ADR 구현 이후 **PASS**로 전환하는 것을
 
 ### Positive
 
-- **Megatron 코드 이식 용이**: real training code와 API 일치.
-- **TP 벤치마크 가능**: scaling, communication-compute overlap 등 HW 특성
-  연구.
-- **`ring_default_ws` xfail 해제**: D0의 부산물로 ADR-0024 Phase B 블로커 해소.
-- **Scheduler loop 단일화**: D1 (`mp.spawn`) 도입으로 hand-rolled loop 제거.
-  후속 collective/TP 벤치가 동일 패턴 재사용.
-- **DPPolicy 의미 명확화** (ADR-0026 시너지): TP layer가 intra-device DPPolicy
-  만 사용하는 모범 사례.
+- **Easy porting of Megatron code**: API matches real training code.
+- **TP benchmarking enabled**: research on scaling,
+  communication-compute overlap, and other HW characteristics.
+- **`ring_default_ws` xfail removal**: as a byproduct of D0, the ADR-0024
+  Phase B blocker is resolved.
+- **Scheduler-loop unification**: introducing D1 (`mp.spawn`) removes the
+  hand-rolled loop. Subsequent collective/TP benches reuse the same
+  pattern.
+- **DPPolicy semantics clarified** (synergy with ADR-0026): TP layers as a
+  best-practice example of using intra-device DPPolicy only.
 
 ### Negative
 
-- 새 모듈 (`kernbench.tp`) 유지보수 비용.
-- 초기 scope가 제한적 (pure TP only, forward only).
-- D0 generalization이 `ctx.wait`의 세만틱을 바꿈 — 단일 드라이버 테스트와의
-  호환성을 명시적으로 검증 필요 (T7).
+- Maintenance cost of a new module (`kernbench.tp`).
+- Initial scope is limited (pure TP only, forward only).
+- D0 generalization changes the semantics of `ctx.wait` — compatibility
+  with single-driver tests must be explicitly verified (T7).
 
 ### Neutral
 
-- ADR-0024/0026 기반 위에 순수한 상위 레이어 추가. Hardware simulation
-  stack에 영향 없음 (D0 제외).
+- A pure upper layer added on top of ADR-0024/0026. No impact on the
+  hardware-simulation stack (apart from D0).
diff --git a/tests/test_verify_adr_lang_pairs.py b/tests/test_verify_adr_lang_pairs.py
index 190197a..0ad2881 100644
--- a/tests/test_verify_adr_lang_pairs.py
+++ b/tests/test_verify_adr_lang_pairs.py
@@ -92,6 +92,18 @@ def test_crlf_normalization(tmp_path: Path) -> None:
     assert v.verify(tmp_path) == []
 
 
+def test_em_dash_title_separator_recognized(tmp_path: Path) -> None:
+    """ADR-0033 uses ' — ' instead of ': ' between ADR-NNNN and the title."""
+    en = tmp_path / "docs/adr/ADR-0033-foo-bar.md"
+    ko = tmp_path / "docs/adr-ko/ADR-0033-foo-bar.md"
+    en.parent.mkdir(parents=True, exist_ok=True)
+    ko.parent.mkdir(parents=True, exist_ok=True)
+    body = "## Status\n\nAccepted\n\n## Context\n\nbody\n"
+    en.write_text("# ADR-0033 — Latency Model\n\n" + body, encoding="utf-8")
+    ko.write_text("# ADR-0033 — Latency Model\n\n" + body, encoding="utf-8")
+    assert v.verify(tmp_path) == []
+
+
 def test_underscore_in_slug_recognized(tmp_path: Path) -> None:
     """ADR-0013 uses an underscore in its slug; the regex must accept it."""
     _make_adr(tmp_path / "docs/adr/ADR-0013-ver-verification_strategy.md", "0013")
diff --git a/tools/verify_adr_lang_pairs.py b/tools/verify_adr_lang_pairs.py
index d0147d5..5661cf3 100644
--- a/tools/verify_adr_lang_pairs.py
+++ b/tools/verify_adr_lang_pairs.py
@@ -24,7 +24,7 @@ import sys
 from pathlib import Path
 
 ADR_FILENAME_RE = re.compile(r"^ADR-(\d{4})-[a-z0-9_-]+\.md$")
-TITLE_RE = re.compile(r"^# ADR-(\d{4}):")
+TITLE_RE = re.compile(r"^# ADR-(\d{4})\b")
 
 
 def _normalize(text: str) -> str: