diff --git a/.claude/commands/report.md b/.claude/commands/report.md
new file mode 100644
index 0000000..a6ea2bc
--- /dev/null
+++ b/.claude/commands/report.md
@@ -0,0 +1,318 @@
+---
+description: Generate a public-facing architecture design document from approved ADRs and SPEC.md, with gap analysis reported to chat only.
+---
+
+# `/report` — Architecture Design Document Generator
+
+Generates a **public-facing** architecture design document at
+`docs/report/architecture-{YYYY}-{1H|2H}.md` derived from the current ADR
+corpus, SPEC.md, CLAUDE.md, and the canonical component list.
+
+This command is **strictly read-only** on `docs/adr/`, `SPEC.md`,
+`CLAUDE.md`, and `src/`. The only write is the report file itself
+(a derived artifact under `docs/report/`).
+
+---
+
+## Invocation
+
+Two modes:
+
+- `/report` — **dry-run** (default). No file is written. The command
+  reads sources, performs classification, and reports the planned TOC
+  + gap analysis to chat only. Use this to validate ADR-to-section
+  mapping before committing.
+- `/report write` — **write mode**. Performs the same procedure and
+  writes `docs/report/architecture-{period}.md`. Use after a dry-run
+  whose classification looks correct.
+
+Period determination (both modes), from system date:
+
+- month 1–6 → `{YYYY}-1H`
+- month 7–12 → `{YYYY}-2H`
+
+In write mode, if `docs/report/architecture-{period}.md` already exists,
+overwrite it without asking (regeneration is the expected operation).
+
+---
+
+## Output Contract
+
+### Document body (`docs/report/architecture-{period}.md`)
+
+Public release form. Reader is an external developer/architect. They do
+**not** have access to SPEC.md or ADR files. Therefore:
+
+- **No `ADR-NNNN` identifiers** in visible prose.
+- **No `SPEC R/§` identifiers** in visible prose.
+- **No internal jargon** assumed without definition.
+- **No diagram embeds** — only `<!-- DIAGRAM: ... -->` placeholders.
+- **Attribution via HTML comments** — every prose paragraph that derives
+  from a source carries an inline comment immediately above it:
+  `<!-- src: ADR-NNNN <section-name> -->` (multiple sources allowed).
+
+### Chat-only report (not written to any file)
+
+After writing the document, report to the user in the chat response:
+
+- File path written.
+- Section counts (e.g., "Detailed Architecture: 8 components covered,
+  2 in `builtin/` have no ADR backing").
+- **G1 gaps** — SPEC requirements (R-numbers / §) with no ADR citing them.
+- **G2 gaps** — ADRs missing **Context** or **Decision**. Alternatives
+  and Consequences are optional; their absence is NOT a gap.
+- **G3 gaps** — ADR cross-references without a back-reference.
+- **G4 suggestions** — areas where an ADR seems missing based on the
+  ADR corpus + SPEC reading. Phrase as suggestions, not findings. Each
+  G4 item must say *why* it's suggested and remain falsifiable.
+- **G5 consistency issues** — ADR-to-ADR inconsistencies:
+  - **G5a (supersession not reflected)** — ADR-A states it supersedes
+    ADR-B, but ADR-B's Status is not marked as Superseded.
+  - **G5b (merge candidates)** — two or more ADRs cover near-identical
+    scope (detected naturally during section assignment, not via
+    exhaustive pair-wise scan).
+  - **G5c (explicit contradictions)** — two ADRs whose Decisions
+    directly oppose each other. Must cite both quotations; do not
+    speculate contradictions from topical similarity alone.
+- **TOC rationale** — for each section, list contributing ADR IDs
+  (this is for the user's verification only, never written to the
+  document itself).
+
+G4 must never appear in the document body. G1–G3 are also chat-only.
+
+---
+
+## Procedure
+
+### Step 1 — Determine period
+
+Use current system date. Compute `{YYYY}-1H` or `{YYYY}-2H`.
+
+### Step 2 — Ingest ADRs
+
+For each `docs/adr/ADR-NNNN-*.md`:
+
+- If both `ADR-NNNN-*.md` (Korean) and `ADR-NNNN-*.en.md` (English)
+  exist for the same number, **prefer the Korean `.md`** version.
+- Parse for the four canonical sections: Context, Decision, Alternatives
+  (also accept "Alternatives Considered"), Consequences.
+- Record presence/absence of **Context** and **Decision** for G2.
+  Alternatives and Consequences presence is recorded for use during
+  authoring, but their absence is not a gap.
+- Record ADR-NNNN cross-references for G3.
+- Record Status (e.g., Accepted, Superseded, Draft) and any "supersedes
+  ADR-NNNN" text in the body for G5a.
+
+Process ADRs in **numerical order** for determinism.
+
+### Step 3 — Read canonical component list
+
+List `src/kernbench/components/builtin/*.py`, excluding `__init__.py`,
+`pe_types.py`, and `__pycache__/`. Sort alphabetically. This is the
+canonical order for Detailed Architecture subsections.
+
+### Step 4 — Read SPEC.md and CLAUDE.md
+
+For G1 detection: extract every `R<N>` and `§<X.Y>` identifier mentioned
+in SPEC.md. For each ADR, check which of these it cites. SPEC IDs with
+zero citing ADRs → G1.
+
+### Step 5 — Section assignment
+
+Assign each ADR to exactly one of:
+
+- **Design Principles** — project-wide rationale, philosophy, mission
+  (e.g., "why source-level kernel execution", "why fast multi-device
+  scaling"). Includes ADRs that describe foundational invariants
+  (e.g., latency model assumptions, verification strategy).
+- **High-level Architecture** — Tray / SIP / CUBE / PE hierarchy and
+  cross-layer boundaries (e.g., runtime API ↔ sim_engine ↔ components).
+- **Detailed Architecture** — single-component internal designs. One
+  subsection per file in the canonical component list. ADRs whose
+  primary topic is the internal structure of one component go here.
+- **Implementation Decisions** — **cross-cutting** algorithms / policies
+  / schemes / models that don't belong to a single component:
+  collective algorithms, parallelization policies, address schemes,
+  routing algorithms, model assumptions.
+
+Boundary rule between Detailed Architecture and Implementation Decisions:
+
+> Detailed Architecture = component-internal.
+> Implementation Decisions = spans multiple components OR is an
+> algorithm/policy/scheme/assumption rather than a structural choice.
+
+If an ADR fits two sections plausibly, prefer the one that minimizes
+duplication and pick the more specific bucket (Detailed if it primarily
+concerns one component, else Implementation Decisions).
+
+During classification, opportunistically detect ADR consistency issues:
+
+- **G5b (merge candidate)** — if two or more ADRs land in the same
+  Detailed Architecture subsection or the same Implementation Decisions
+  topic AND their primary scope is near-identical, record as a merge
+  candidate. Topical adjacency is not enough; the scopes must be
+  effectively the same question.
+- **G5c (explicit contradiction)** — if while reading you encounter two
+  ADRs whose Decisions directly oppose each other on the same question,
+  record both quotations verbatim with their ADR IDs. Do NOT speculate
+  contradictions from similarity, vocabulary, or domain overlap — only
+  explicit, citable opposition.
+
+Do NOT perform an exhaustive pair-wise scan of all ADRs. G5b/G5c are
+byproducts of normal reading; if not encountered, the chat report
+shows "(none)".
+
+### Step 6 — Write the document (write mode only)
+
+In **dry-run mode**, skip this step entirely. Proceed directly to Step 7.
+
+```markdown
+# KernBench — Architecture Design Document
+*{YYYY} {1H|2H}*
+
+## Design Principles
+<prose>
+
+## High-level Architecture
+<intro prose>
+
+### Tray
+### SIP
+### CUBE
+### PE
+
+## Detailed Architecture
+### <component-1>
+### <component-2>
+...
+
+## Implementation Decisions
+### <topic-1>
+### <topic-2>
+...
+```
+
+#### Authoring rules (apply to every section)
+
+- **Stay grounded.** Every claim must trace to an ADR's stated content
+  (Context / Decision / Alternatives / Consequences). No invented
+  motivation, no invented alternatives, no invented trade-offs.
+- **4-part discipline, naturally.** Each subsection should naturally
+  cover: the problem the design addresses, the decision made, the
+  alternatives considered, the consequences. Do **not** label these
+  with rigid headers like "**Problem.**" — weave them into prose. But
+  ensure all four are present *if the source ADR documents them*.
+- **Missing → omit, not fabricate.** If a source ADR has no
+  "Alternatives" section, do **not** invent alternatives for the
+  report. Simply write the remaining parts and record G2 in chat.
+- **Attribution.** Every paragraph derived from one or more ADRs
+  carries an HTML comment immediately above:
+  `<!-- src: ADR-NNNN <section> [, ADR-MMMM <section>] -->`.
+- **Diagram placeholders.** Where a diagram would help, insert
+  `<!-- DIAGRAM: <short description of what the diagram should show> -->`
+  on its own line. **Never** embed an image (`![...](...)`).
+- **Public tone.** Self-contained. Define internal terms (SIP, CUBE,
+  PE, Tray, NOC, IPCQ, TCM, etc.) on first use within the document.
+  Do not assume reader has read SPEC or ADRs.
+- **No internal references.** No `ADR-NNNN` in body text. No
+  `SPEC §X.Y` or `R<N>` in body text. These appear only inside HTML
+  attribution comments.
+- **Detailed Architecture component subsections.** Use the canonical
+  list from Step 3 in order. For each component file, write a
+  subsection drawing from any ADR that primarily concerns that
+  component. If no ADR covers a component, write a one-line stub
+  noting the component exists and flag it in chat report. If an ADR
+  covers a topic not in the canonical list, place it under
+  "Detailed Architecture → Other" (sub-subsection) and flag for
+  canonical-list extension in chat.
+- **Implementation Decisions topic naming.** Derive topic names from
+  ADR titles, made reader-friendly (no ADR number). Group related
+  ADRs under one topic when natural (e.g., multiple address-related
+  ADRs under "Address Scheme").
+
+### Step 7 — Generate chat report
+
+After Step 6 (write mode) or directly from Step 5 (dry-run mode),
+emit the following to chat. Do **not** write any of this to a file.
+
+In **dry-run mode**, replace the `Wrote:` line with:
+`**DRY-RUN — no file written.** Review TOC and gaps below. Run \`/report write\` to commit.`
+
+```
+## /report — Generation Summary
+
+**Wrote:** docs/report/architecture-{period}.md
+
+**Section coverage**
+- Design Principles: <N> ADRs
+- High-level Architecture: <N> ADRs
+- Detailed Architecture: <covered>/<total> components ; components without ADR: [...]
+- Implementation Decisions: <N> topics, <N> ADRs
+
+**TOC rationale (ADR → section mapping)**
+- Design Principles: ADR-NNNN, ADR-MMMM
+- High-level Architecture: ...
+- Detailed Architecture → <component>: ADR-NNNN
+- Implementation Decisions → <topic>: ADR-NNNN, ADR-MMMM
+
+**G1 — SPEC requirements without ADR support**
+- R<N> / §<X.Y>: not cited by any ADR
+- (or "none")
+
+**G2 — ADRs missing required sections (Context or Decision)**
+- ADR-NNNN: missing <Context|Decision>
+- (or "none")
+
+**G3 — Broken cross-references**
+- ADR-NNNN cites ADR-MMMM; ADR-MMMM does not back-reference
+- (or "none")
+
+**G4 — Suggested topics that may warrant a new ADR (verify before acting)**
+- <topic>: <why agent thinks it may be missing — must be falsifiable>
+- (or "none")
+
+**G5 — ADR consistency issues**
+- **G5a (supersession not reflected)**
+  - ADR-NNNN claims to supersede ADR-MMMM, but ADR-MMMM Status is "<status>"
+  - (or "none")
+- **G5b (merge candidates)**
+  - ADR-NNNN + ADR-MMMM: near-identical scope on <topic> — evaluate merge
+  - (or "none")
+- **G5c (explicit contradictions)**
+  - ADR-NNNN says "<quote>"; ADR-MMMM says "<quote>" — direct opposition on <question>
+  - (or "none")
+```
+
+---
+
+## Constraints (do not violate)
+
+1. **Read-only on source.** No writes to `docs/adr/`, `SPEC.md`,
+   `CLAUDE.md`, or `src/`. Only write is
+   `docs/report/architecture-{period}.md`.
+2. **No fabrication.** Every body paragraph traces to ADR content via
+   HTML attribution comment.
+3. **No diagram embeds.** Placeholders only.
+4. **No internal IDs in body.** ADR-NNNN and SPEC R/§ stay inside
+   HTML comments only.
+5. **Determinism.** ADRs processed in numerical order; components in
+   canonical (alphabetical) order. Same inputs → same output.
+6. **G4 stays in chat.** Never written to the document.
+7. **Korean bilingual preference.** When both `.md` and `.en.md`
+   exist for the same ADR number, use `.md`.
+8. **All ADRs included.** No exclusion list. ADRs about internal
+   tooling (CLI, diagram views, verification strategy) are still
+   included — usually under Design Principles or Implementation
+   Decisions, written in publishable form.
+
+---
+
+## Failure modes to avoid
+
+- **Padding** with general background not present in the source ADRs.
+- **Inferring alternatives** the ADR doesn't mention.
+- **Quietly skipping** an ADR because it seems internal. Include it,
+  rephrase for public audience.
+- **Inventing components** not in `src/kernbench/components/builtin/`.
+- **Auto-selecting diagrams** from `docs/diagrams/`. Only placeholders.
+- **Promoting G4 suggestions to the document.** They stay in chat.
diff --git a/.claude/settings.local.json b/.claude/settings.local.json
index fbd7d37..3c0351e 100644
--- a/.claude/settings.local.json
+++ b/.claude/settings.local.json
@@ -30,7 +30,10 @@
       "Bash(python -m pytest tests/test_pe_components.py -v)",
       "Bash(python -m pytest tests/test_triton_emu.py -v)",
       "Bash(python -m pytest tests/test_pe_components.py tests/test_triton_emu.py -v)",
-      "Bash(python -m pytest tests/test_pe_components.py::test_mcpu_multi_pe_kernel_launch tests/test_pe_components.py::test_qkv_gemm_bench_multi_pe_completes -v)"
+      "Bash(python -m pytest tests/test_pe_components.py::test_mcpu_multi_pe_kernel_launch tests/test_pe_components.py::test_qkv_gemm_bench_multi_pe_completes -v)",
+      "Bash(git add:*)",
+      "Bash(git commit:*)",
+      "Bash(git push:*)"
     ]
   }
 }
diff --git a/.gitignore b/.gitignore
index ff7356b..93a3b88 100644
--- a/.gitignore
+++ b/.gitignore
@@ -29,4 +29,6 @@ build/
 
 # Logs
 *.log
-.claude/
+.claude/*
+!.claude/commands/
+!.claude/commands/*.md
diff --git a/CLAUDE.md b/CLAUDE.md
index 2057f0c..afc2ab4 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -218,17 +218,43 @@ General fallbacks. Apply to anything not explicitly covered above.
 
 ### ADR Lifecycle
 
-- `docs/adr/` contains ADRs reflecting current implementation or
-  work-in-progress designs.
-- `docs/history/` contains superseded ADRs as historical record.
-- When an ADR is superseded:
-  1. The superseding ADR includes a "Supersedes ADR-NNNN" line.
-  2. The superseded ADR's Status is set to "Superseded by ADR-MMMM".
-  3. The superseded ADR file is **moved** (git mv) to `docs/history/`.
-- Cross-references between ADRs use the ADR-NNNN ID and remain
-  valid regardless of file location.
-- ADR numbers are **immutable**; never renumber. Numbering holes
-  from moved ADRs are expected.
+ADRs live in one of three folders based on lifecycle state:
+
+- `docs/adr/` — **Accepted** (current implementation reflected).
+- `docs/adr-proposed/` — **Proposed**, **Stub**, or **Draft** (design
+  only / future-work exploration / retroactive documentation pending
+  verification).
+- `docs/adr-history/` — **Superseded** or **Merged** (no longer the
+  authoritative source; kept as historical record).
+
+Status field values:
+
+- `Accepted` — design is in current implementation.
+- `Proposed` — design is concrete but not yet implemented.
+- `Stub (Future Work)` — design space exploration; no commitment yet.
+- `Draft` — retroactive documentation drafted but not yet verified
+  against the implementation it describes.
+- `Superseded by ADR-NNNN` — replaced by another ADR.
+- `Merged into ADR-NNNN` — content absorbed by another ADR.
+
+Transitions:
+
+- **Proposed/Stub → Accepted**: when the ADR's decisions are
+  reflected in production code AND covered by tests. `git mv` from
+  `docs/adr-proposed/` to `docs/adr/`, change Status to `Accepted`.
+- **Draft → Accepted**: when the ADR's text has been verified to
+  accurately describe the existing implementation. `git mv` from
+  `docs/adr-proposed/` to `docs/adr/`, change Status to `Accepted`.
+- **Accepted → Superseded**: set Status to `Superseded by ADR-MMMM`
+  and `git mv` to `docs/adr-history/`. The superseding ADR includes
+  a "Supersedes ADR-NNNN" reference (or, for partial supersession of
+  clauses, documents this in its own body).
+- **Accepted → Merged**: set Status to `Merged into ADR-MMMM`
+  (single-line stub) and `git mv` to `docs/adr-history/`.
+
+Cross-references between ADRs use the `ADR-NNNN` ID and remain valid
+regardless of folder location. ADR numbers are **immutable**; never
+renumber. Numbering holes from moved ADRs are expected.
 
 ## Terminology
 
diff --git a/README.md b/README.md
index 36d969e..77a9f5e 100644
--- a/README.md
+++ b/README.md
@@ -155,5 +155,6 @@ kernbench/
 ## Documentation
 
 - [CHANGES.md](CHANGES.md) — changelog with detailed descriptions of each release
-- [docs/latency-model.md](docs/latency-model.md) — latency model explanation with worked examples
+- [docs/onboarding/latency-model.md](docs/onboarding/latency-model.md) — latency model explanation with worked examples
+- [docs/onboarding/](docs/onboarding/) — onboarding guides (architecture overview, latency model, CCL author guide, intro presentation)
 - [docs/adr/](docs/adr/) — Architecture Decision Records
diff --git a/SPEC.md b/SPEC.md
index 0850a1b..bdcfaaf 100644
--- a/SPEC.md
+++ b/SPEC.md
@@ -51,7 +51,7 @@ Major architectural decisions are documented in ADRs and referenced by number.
 - ADR-0007: runtime_api vs sim_engine responsibility boundaries
 - ADR-0008: Tensor deployment and allocation (Host allocator, PA-first)
 - ADR-0009: Kernel execution fan-out and completion semantics
-- ADR-0010: CLI device selection and multi-device execution semantics
+- ADR-0010: Command line interface and execution semantics
 - ADR-0011: Memory Addressing — PA / VA / LA Address Models
 - ADR-0012: Host ↔ IO_CPU message schema (PA-first, PE-tagged shards)
 - ADR-0013: Verification strategy and Phase 1 test plan
diff --git a/docs/history/ADR-0018-Logical Address.en.md b/docs/adr-history/ADR-0018-mem-logical-address.en.md
similarity index 100%
rename from docs/history/ADR-0018-Logical Address.en.md
rename to docs/adr-history/ADR-0018-mem-logical-address.en.md
diff --git a/docs/history/ADR-0018-Logical Address.md b/docs/adr-history/ADR-0018-mem-logical-address.md
similarity index 100%
rename from docs/history/ADR-0018-Logical Address.md
rename to docs/adr-history/ADR-0018-mem-logical-address.md
diff --git a/docs/adr-history/ADR-0019-dev-noc-local-hbm.en.md b/docs/adr-history/ADR-0019-dev-noc-local-hbm.en.md
new file mode 100644
index 0000000..1d746ae
--- /dev/null
+++ b/docs/adr-history/ADR-0019-dev-noc-local-hbm.en.md
@@ -0,0 +1,5 @@
+# ADR-0019: Per-Channel and Aggregated HBM Connection Models within CUBE NOC
+
+## Status
+
+Merged into ADR-0017 (Cube NOC and HBM Connectivity).
diff --git a/docs/adr-history/ADR-0019-dev-noc-local-hbm.md b/docs/adr-history/ADR-0019-dev-noc-local-hbm.md
new file mode 100644
index 0000000..63efb1d
--- /dev/null
+++ b/docs/adr-history/ADR-0019-dev-noc-local-hbm.md
@@ -0,0 +1,5 @@
+# ADR-0019: CUBE NOC 내 Per-Channel 및 Aggregated HBM 연결 모델
+
+## Status
+
+Merged into ADR-0017 (Cube NOC and HBM Connectivity).
diff --git a/docs/adr-history/ADR-0021-dev-pe-pipeline-refactor.en.md b/docs/adr-history/ADR-0021-dev-pe-pipeline-refactor.en.md
new file mode 100644
index 0000000..710dadd
--- /dev/null
+++ b/docs/adr-history/ADR-0021-dev-pe-pipeline-refactor.en.md
@@ -0,0 +1,5 @@
+# ADR-0021: PE Pipeline Refactoring — Component Separation + Scheduler-Based Routing
+
+## Status
+
+Merged into ADR-0014 (PE Pipeline Execution Model).
diff --git a/docs/adr-history/ADR-0021-dev-pe-pipeline-refactor.md b/docs/adr-history/ADR-0021-dev-pe-pipeline-refactor.md
new file mode 100644
index 0000000..82b48e2
--- /dev/null
+++ b/docs/adr-history/ADR-0021-dev-pe-pipeline-refactor.md
@@ -0,0 +1,5 @@
+# ADR-0021: PE 파이프라인 리팩토링 — 컴포넌트 분리 + Scheduler 기반 라우팅
+
+## Status
+
+Merged into ADR-0014 (PE Pipeline Execution Model).
diff --git a/docs/history/ADR-0029-hierarchical-allreduce.md b/docs/adr-history/ADR-0029-algo-hierarchical-allreduce.md
similarity index 100%
rename from docs/history/ADR-0029-hierarchical-allreduce.md
rename to docs/adr-history/ADR-0029-algo-hierarchical-allreduce.md
diff --git a/docs/history/ADR-0031-physaddr-pe-resource-extension.md b/docs/adr-history/ADR-0031-mem-physaddr-pe-resource-extension.md
similarity index 99%
rename from docs/history/ADR-0031-physaddr-pe-resource-extension.md
rename to docs/adr-history/ADR-0031-mem-physaddr-pe-resource-extension.md
index e8bb39e..6aa23c5 100644
--- a/docs/history/ADR-0031-physaddr-pe-resource-extension.md
+++ b/docs/adr-history/ADR-0031-mem-physaddr-pe-resource-extension.md
@@ -257,5 +257,5 @@ PhysAddr encoding. 호출자는 어느 range인지 몰라도 됨.
 |------|--------|
 | `src/kernbench/policy/address/phyaddr.py` | Range table (`PE_RESOURCE_MAP`), range-based decode, 신규 component-specific factory들 (`pe_ipcq_addr` 등), 기존 `pe_tcm_addr` 내부 인코딩 갱신 |
 | `src/kernbench/policy/address/allocator.py` | Range-aware pool 분리 (TCM pool / IPCQ pool / scratchpad pool 등 per-PE) |
-| `docs/adr/ADR-0001-physaddr-layout.md` | Amendment note: range-based PE resource partition |
+| `docs/adr/ADR-0001-mem-physaddr-layout.md` | Amendment note: range-based PE resource partition |
 | `tests/test_phyaddr.py` | Range table 검증, 각 factory의 encode/decode round-trip, 기존 `pe_tcm_addr` 회귀 |
diff --git a/docs/adr/ADR-0028-dtensor-support.md b/docs/adr-proposed/ADR-0028-par-dtensor-support.md
similarity index 100%
rename from docs/adr/ADR-0028-dtensor-support.md
rename to docs/adr-proposed/ADR-0028-par-dtensor-support.md
diff --git a/docs/adr/ADR-0030-ipcq-physaddr.md b/docs/adr-proposed/ADR-0030-mem-ipcq-physaddr.md
similarity index 99%
rename from docs/adr/ADR-0030-ipcq-physaddr.md
rename to docs/adr-proposed/ADR-0030-mem-ipcq-physaddr.md
index ab735be..cfc9add 100644
--- a/docs/adr/ADR-0030-ipcq-physaddr.md
+++ b/docs/adr-proposed/ADR-0030-mem-ipcq-physaddr.md
@@ -340,7 +340,7 @@ encoding can be plugged in later" 약속이 이행된 것.
 | `src/kernbench/sim_engine/memory_store.py` | D3: IPCQ buffer가 기존 space와 공유되는지 검증 |
 | `src/kernbench/sim_engine/engine.py` | D4: IPCQ token routing이 PhysAddr-based fabric 경로 사용 |
 | `src/kernbench/ccl/diagnostics.py` | D5: pointer_dump를 PhysAddr 포매팅으로 개선 |
-| `docs/adr/ADR-0023-ipcq-pe-collective.md` | D6: D2.5 amendment note |
+| `docs/adr/ADR-0023-dev-ipcq-pe-collective.md` | D6: D2.5 amendment note |
 | `tests/test_ipcq_physaddr.py` (new) | T1 |
 | `tests/test_ipcq_alloc.py` (new) | T2 |
 | `tests/test_ccl_install_plan.py` | T3 확장 |
diff --git a/docs/adr/ADR-0001-physaddr-layout.md b/docs/adr/ADR-0001-mem-physaddr-layout.md
similarity index 100%
rename from docs/adr/ADR-0001-physaddr-layout.md
rename to docs/adr/ADR-0001-mem-physaddr-layout.md
diff --git a/docs/adr/ADR-0002-routing-distance.md b/docs/adr/ADR-0002-lat-routing-distance.md
similarity index 97%
rename from docs/adr/ADR-0002-routing-distance.md
rename to docs/adr/ADR-0002-lat-routing-distance.md
index 34bd7e4..19849f9 100644
--- a/docs/adr/ADR-0002-routing-distance.md
+++ b/docs/adr/ADR-0002-lat-routing-distance.md
@@ -35,7 +35,7 @@ shortcuts that obscure control paths.
 
 ### D3. Bypass is explicit and graph-represented
 - All paths must be explicitly represented in the graph and subject to latency accumulation.
-- Example: PE_DMA connects to the NOC router mesh (ADR-0019). All destinations
+- Example: PE_DMA connects to the NOC router mesh (ADR-0017 D7). All destinations
   (HBM, shared SRAM, inter-cube UCIe) are reached via explicit mesh hops.
   Local HBM access has minimal hops (switching overhead only); remote access
   traverses additional routers.
diff --git a/docs/adr/ADR-0003-target-system-hierarchy.md b/docs/adr/ADR-0003-dev-target-system-hierarchy.md
similarity index 100%
rename from docs/adr/ADR-0003-target-system-hierarchy.md
rename to docs/adr/ADR-0003-dev-target-system-hierarchy.md
diff --git a/docs/adr/ADR-0004-memory-semantics-local-hbm.md b/docs/adr/ADR-0004-mem-memory-semantics-local-hbm.md
similarity index 98%
rename from docs/adr/ADR-0004-memory-semantics-local-hbm.md
rename to docs/adr/ADR-0004-mem-memory-semantics-local-hbm.md
index cb7d3ff..d9144b0 100644
--- a/docs/adr/ADR-0004-memory-semantics-local-hbm.md
+++ b/docs/adr/ADR-0004-mem-memory-semantics-local-hbm.md
@@ -15,7 +15,7 @@ Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth,
 
 - Each PE is assigned a logically defined “local HBM” region.
 - Local HBM corresponds to the pseudo-channel subset directly attached to that PE’s
-  router in the NOC mesh (ADR-0019).
+  router in the NOC mesh (ADR-0017 D4).
 - The path is: PE_DMA → local router → HBM_CTRL (switching overhead only, 0 mesh hops).
 - The mapping (HBM pseudo-channels → PE local regions) is derived from topology configuration.
 
diff --git a/docs/adr/ADR-0005-diagram-views-distance-layout.md b/docs/adr/ADR-0005-dev-diagram-views-distance-layout.md
similarity index 92%
rename from docs/adr/ADR-0005-diagram-views-distance-layout.md
rename to docs/adr/ADR-0005-dev-diagram-views-distance-layout.md
index 6908409..6391f19 100644
--- a/docs/adr/ADR-0005-diagram-views-distance-layout.md
+++ b/docs/adr/ADR-0005-dev-diagram-views-distance-layout.md
@@ -20,7 +20,9 @@ Diagrams must reflect this distance by default.
 
 ---
 
-## Global Defaults
+## Decision
+
+### D1. Global Defaults
 
 - All diagrams MUST be **distance-aware by default**.
 - All diagrams MUST render **representative views** of the architecture.
@@ -31,7 +33,7 @@ Diagrams must reflect this distance by default.
 
 ---
 
-## Representative Rendering Rule
+### D2. Representative Rendering Rule
 
 - All CUBEs share the same internal structure.
 - All PEs share the same internal structure.
@@ -47,9 +49,9 @@ unless explicitly requested.
 
 ---
 
-## Diagram Views
+### D3. Diagram Views
 
-### View A — SIP-Level Diagram
+#### View A — SIP-Level Diagram
 
 **Purpose**
 Explain system-scale structure and connectivity.
@@ -75,7 +77,7 @@ Explain system-scale structure and connectivity.
 
 ---
 
-### View B — CUBE-Level Diagram
+#### View B — CUBE-Level Diagram
 
 **Purpose**
 Explain cube-internal structure and data/control flow.
@@ -106,7 +108,7 @@ Explain cube-internal structure and data/control flow.
 
 ---
 
-### View C — PE-Level Diagram
+#### View C — PE-Level Diagram
 
 **Purpose**
 Explain internal PE behavior and execution structure.
@@ -128,14 +130,14 @@ Explain internal PE behavior and execution structure.
 
 ---
 
-## Distance-Aware Layout (Default)
+### D4. Distance-Aware Layout (Default)
 
-### Distance definition
+#### Distance definition
 
 - Distance is defined as **accumulated latency**, consistent with ADR-0002.
 - Distance is computed from a single anchor node.
 
-### Default anchor selection
+#### Default anchor selection
 
 - SIP view: IO chiplet (or Host CPU if present)
 - CUBE view: a representative PE
@@ -143,7 +145,7 @@ Explain internal PE behavior and execution structure.
 
 Anchors are **implicit defaults** and MUST NOT be required to be specified.
 
-### Layout rules
+#### Layout rules
 
 - Diagrams MUST be laid out in layers based on distance buckets.
 - Layout direction MUST be consistent within a view type
@@ -156,7 +158,7 @@ without affecting distance semantics.
 
 ---
 
-## Generation Contract (for Tools / Claude Code)
+### D5. Generation Contract (for Tools / Claude Code)
 
 When generating diagrams:
 
diff --git a/docs/adr/ADR-0006-topology-compilation-distance-diagram.md b/docs/adr/ADR-0006-dev-topology-compilation-distance-diagram.md
similarity index 99%
rename from docs/adr/ADR-0006-topology-compilation-distance-diagram.md
rename to docs/adr/ADR-0006-dev-topology-compilation-distance-diagram.md
index 60b0d8b..4b3767c 100644
--- a/docs/adr/ADR-0006-topology-compilation-distance-diagram.md
+++ b/docs/adr/ADR-0006-dev-topology-compilation-distance-diagram.md
@@ -63,7 +63,7 @@ For each view (SIP / CUBE / PE):
 - CUBE-level projection MUST include:
   - Router mesh (from cube_mesh.yaml), HBM_CTRL, shared SRAM, M_CPU, UCIe ports,
     and PEs as opaque blocks.
-  - All paths (HBM, non-HBM, command) route through the same router mesh (ADR-0019).
+  - All paths (HBM, non-HBM, command) route through the same router mesh (ADR-0017).
 - Default anchors are implicit (ADR-0005) and MUST NOT require instance indices.
 
 ### D6. Output formats and determinism
diff --git a/docs/adr/ADR-0007-runtime-api-boundaries.md b/docs/adr/ADR-0007-api-runtime-api-boundaries.md
similarity index 73%
rename from docs/adr/ADR-0007-runtime-api-boundaries.md
rename to docs/adr/ADR-0007-api-runtime-api-boundaries.md
index 51975be..9522b9d 100644
--- a/docs/adr/ADR-0007-runtime-api-boundaries.md
+++ b/docs/adr/ADR-0007-api-runtime-api-boundaries.md
@@ -42,21 +42,25 @@ The runtime API MUST NOT:
 
 ---
 
-### D2. Simulation engine executes and schedules requests
+### D2. Simulation engine wires components and tracks completion
 
 The simulation engine (sim_engine) MUST:
 
-- inject requests into the compiled topology graph,
+- wire components at initialization (create port stores + start wire
+  processes per the component port/wire framework — ADR-0015),
+- inject requests into the compiled topology graph at entry components
+  (e.g., PCIE_EP for memory operations, IO_CPU for kernel launch),
 - schedule and execute events using a discrete-event model,
-- manage correlation ids and completion tracking,
-- decompose operations into low-level requests when required
-  (e.g., MemoryWrite events).
+- manage correlation ids and completion tracking.
 
 The simulation engine MUST NOT:
 
 - define tensor semantics,
 - define kernel execution policies,
-- expose internal graph details to the runtime API.
+- expose internal graph details to the runtime API,
+- walk the topology path during request execution,
+- call component `run()` methods directly,
+- track per-hop latency or decompose fan-out (components own this).
 
 ---
 
@@ -87,3 +91,5 @@ component-level fan-out explicitly.
 - SPEC R4, R7, R8
 - ADR-0008 (Tensor deployment)
 - ADR-0009 (Kernel execution)
+- ADR-0015 (Component port/wire model and engine role)
+- ADR-0010 (CLI surface and execution semantics — runtime API consumer)
diff --git a/docs/adr/ADR-0008-tensor-deploy-and-allocation.md b/docs/adr/ADR-0008-api-tensor-deploy-and-allocation.md
similarity index 100%
rename from docs/adr/ADR-0008-tensor-deploy-and-allocation.md
rename to docs/adr/ADR-0008-api-tensor-deploy-and-allocation.md
diff --git a/docs/adr/ADR-0009-kernel-execution-messaging.md b/docs/adr/ADR-0009-api-kernel-execution-messaging.md
similarity index 97%
rename from docs/adr/ADR-0009-kernel-execution-messaging.md
rename to docs/adr/ADR-0009-api-kernel-execution-messaging.md
index fae01b5..a94be07 100644
--- a/docs/adr/ADR-0009-kernel-execution-messaging.md
+++ b/docs/adr/ADR-0009-api-kernel-execution-messaging.md
@@ -142,3 +142,5 @@ control plane — runtime API and application kernels are unchanged.
 - SPEC R1, R2, R7, R8
 - ADR-0007 (Runtime API boundaries)
 - ADR-0008 (Tensor deployment)
+- ADR-0013 (Verification strategy — V2 fan-out tests)
+- ADR-0015 D4 (concrete fabric path for kernel launch)
diff --git a/docs/adr/ADR-0010-api-cli-surface-and-semantics.md b/docs/adr/ADR-0010-api-cli-surface-and-semantics.md
new file mode 100644
index 0000000..4925a64
--- /dev/null
+++ b/docs/adr/ADR-0010-api-cli-surface-and-semantics.md
@@ -0,0 +1,131 @@
+# ADR-0010: Command Line Interface and Execution Semantics
+
+## Status
+
+Accepted
+
+## Context
+
+The `kernbench` CLI is the user-facing entry point of the simulator. It
+exposes three subcommands:
+
+- `run` — execute a benchmark against a topology.
+- `probe` — diagnostic utility for latency / BW measurement.
+- `web` — interactive topology viewer.
+
+Device enumeration is centralized in the CLI; neither the runtime API
+nor the simulation engine enumerates devices. Benchmarks remain
+single-device by design and accept a device identifier as input.
+
+## Decision
+
+### D1. Benchmark contract — single-device by design
+
+- A benchmark MUST define behavior for a single device only.
+- A benchmark MUST accept a device identifier as input.
+- Benchmarks MUST NOT enumerate or loop over multiple devices.
+
+Multi-device execution is the CLI's concern (D3), not the benchmark's.
+
+### D2. `kernbench run` — benchmark execution
+
+Required arguments:
+
+- `--topology <path>`: topology YAML file path. Loaded via
+  `resolve_topology()`.
+- `--bench <name>`: benchmark name. Resolved via
+  `benches.loader.resolve_bench()`.
+
+Optional arguments:
+
+- `--device <selector>` (default: `all`):
+  - `all` — run once per discovered SIP (see D3).
+  - `sip:<N>` — run only on SIP N.
+  - Parsed via `resolve_device()`.
+- `--verify-data` (default: off) — enable Phase 2 data verification
+  (see ADR-0020). When set, `engine_factory` constructs the engine
+  with `enable_data=True`. After the benchmark runs, a diagnostic
+  summary of recorded ops is printed.
+
+Each invocation runs the benchmark once within a single simulation
+instance.
+
+### D3. Multi-device execution is logically parallel
+
+When `--device all` (or omitted) and the topology has multiple SIPs:
+
+- Benchmark executions are submitted to a single simulation engine
+  instance.
+- Executions are logically parallel in simulation time.
+- Inter-device contention is naturally modeled (shared fabric
+  bandwidth, cross-SIP traffic, etc.).
+
+The CLI does NOT spawn multiple OS processes or independent
+simulation runs — parallelism is internal to one simulation instance.
+
+### D4. `kernbench probe` — latency / BW diagnostic utility
+
+Required argument:
+
+- `--topology <path>`: topology YAML file path.
+
+Optional argument:
+
+- `--case <name>` (default: `all`) — run a predefined traffic
+  pattern, or `all` to run every defined case.
+
+Probe runs each pattern through the simulation engine and reports
+per case:
+
+- End-to-end latency (ns).
+- Effective bandwidth (nbytes / total_ns).
+- Bottleneck bandwidth (min edge BW along the chosen path).
+- Utilization (effective / bottleneck).
+
+Probe additionally validates monotonicity invariants — for example
+that local-HBM access ≤ cross-PE-within-cube ≤ cross-cube ≤
+cross-SIP — and reports violations. Probe is a developer tool for
+verifying the latency / BW model; it is not a benchmark.
+
+### D5. `kernbench web` — topology viewer
+
+Optional arguments:
+
+- `--port <N>` (default: `8765`) — HTTP port.
+- `--no-open` — do not auto-open the browser.
+
+Launches a local HTTP server that renders the compiled topology in
+the browser. Distinct from the static `docs/diagrams/` artifacts:
+
+- `docs/diagrams/` files are derived at topology-compile time
+  (ADR-0006).
+- `kernbench web` is interactive — pan/zoom, hover for component
+  attributes, switch between SIP / CUBE / PE views.
+
+### D6. Runtime API and simulation engine remain device-scoped
+
+- Runtime API calls operate on one device per invocation.
+- The simulation engine schedules all requests deterministically.
+- Neither layer enumerates devices.
+
+This invariant keeps each layer testable in isolation; device
+enumeration and multi-device fan-out live only in the CLI's `run`
+command (D3).
+
+## Consequences
+
+- Benchmark authors write single-device logic; multi-device behavior
+  emerges from the CLI dispatching across SIPs.
+- Adding a new subcommand (e.g., trace export, replay) does not
+  require benchmark or runtime-API changes — the CLI is the
+  extension point.
+- `probe` and `web` are diagnostic / visualization tools, not
+  benchmarks; they bypass the benchmark loader path.
+
+## Links
+
+- SPEC R7, R8, R9
+- ADR-0007 (Runtime API and Simulation Engine Boundaries)
+- ADR-0020 (Two-pass data execution — `--verify-data`)
+- ADR-0006 (Topology compilation and diagram generation —
+  background for `kernbench web`)
diff --git a/docs/adr/ADR-0010-cli-device-selection.md b/docs/adr/ADR-0010-cli-device-selection.md
deleted file mode 100644
index bed601b..0000000
--- a/docs/adr/ADR-0010-cli-device-selection.md
+++ /dev/null
@@ -1,62 +0,0 @@
-# ADR-0010: CLI Device Selection and Multi-Device Execution Semantics
-
-## Status
-
-Accepted
-
-## Context
-
-Benchmarks represent device-agnostic workloads that operate on a single device.
-Users may want to run a benchmark:
-
-- on a specific device, or
-- across all devices in the system.
-
-Device enumeration must not leak into benchmarks or runtime APIs.
-
----
-
-## Decision
-
-### D1. Benchmarks are single-device by design
-
-- A benchmark MUST define behavior for a single device only.
-- A benchmark MUST accept a device identifier as input.
-- Benchmarks MUST NOT enumerate or loop over multiple devices.
-
----
-
-### D2. CLI controls device selection
-
-The `kernbench run` command supports an optional `--device` argument:
-
-- If `--device <id>` is specified:
-  - the benchmark executes once for the specified device.
-
-- If `--device` is omitted:
-  - the benchmark executes once using all the SIPs discovered in the topology.
-
----
-
-### D3. Multi-device execution is logically parallel
-
-When running on multiple devices:
-
-- benchmark executions are submitted to a single simulation engine instance,
-- executions are logically parallel in simulation time,
-- inter-device contention is naturally modeled.
-
----
-
-### D4. Runtime API and simulation engine remain device-scoped
-
-- Runtime API calls operate on one device per invocation.
-- The simulation engine schedules all requests deterministically.
-- Neither layer enumerates devices.
-
----
-
-## Links
-
-- SPEC R7, R8
-- ADR-0007 (Runtime API boundaries)
diff --git a/docs/adr/ADR-0011-memory-addressing-simplification.md b/docs/adr/ADR-0011-mem-memory-addressing-simplification.md
similarity index 98%
rename from docs/adr/ADR-0011-memory-addressing-simplification.md
rename to docs/adr/ADR-0011-mem-memory-addressing-simplification.md
index 8961b24..064c365 100644
--- a/docs/adr/ADR-0011-memory-addressing-simplification.md
+++ b/docs/adr/ADR-0011-mem-memory-addressing-simplification.md
@@ -396,7 +396,7 @@ Other N values:
 #### D-LA7. n:1 mode detail
 
 - One logical access → one aggregated request.
-- Target: aggregated router → hbm_ctrl (see ADR-0019).
+- Target: aggregated router → hbm_ctrl (see ADR-0017 D8).
 - Aggregated link BW = `channels_per_pe × channel_bw_gbs`
   (e.g. 8 × 32 = 256 GB/s).
 - Single queue / resource for modelling.
@@ -516,6 +516,6 @@ Negative:
 - ADR-0009 (kernel execution)
 - ADR-0014 (PE-internal execution model)
 - ADR-0015 (component port/wire model)
-- ADR-0019 (NOC + per-channel HBM connectivity — LA model topology
-  consumer)
+- ADR-0017 (Cube NOC and HBM connectivity — LA model topology consumer)
+- ADR-0013 (Verification strategy — V1 PA tagging)
 - SPEC R2 (latency by traversal), R10 (memory addressing)
diff --git a/docs/adr/ADR-0012-host-io-message-schema.md b/docs/adr/ADR-0012-api-host-io-message-schema.md
similarity index 98%
rename from docs/adr/ADR-0012-host-io-message-schema.md
rename to docs/adr/ADR-0012-api-host-io-message-schema.md
index c9332f5..07d95c5 100644
--- a/docs/adr/ADR-0012-host-io-message-schema.md
+++ b/docs/adr/ADR-0012-api-host-io-message-schema.md
@@ -229,4 +229,5 @@ Tests SHOULD validate:
 - ADR-0011 (Memory Addressing — PA / VA / LA)
 - ADR-0007 (runtime_api vs sim_engine boundaries)
 - ADR-0009 (kernel execution fan-out/aggregation)
+- ADR-0013 (Verification strategy — V1 message schema validation)
 - SPEC R2, R7, R8
diff --git a/docs/adr/ADR-0013-verification_strategy.md b/docs/adr/ADR-0013-ver-verification_strategy.md
similarity index 100%
rename from docs/adr/ADR-0013-verification_strategy.md
rename to docs/adr/ADR-0013-ver-verification_strategy.md
diff --git a/docs/adr/ADR-0014-dev-pe-pipeline-execution-model.md b/docs/adr/ADR-0014-dev-pe-pipeline-execution-model.md
new file mode 100644
index 0000000..ccb63f3
--- /dev/null
+++ b/docs/adr/ADR-0014-dev-pe-pipeline-execution-model.md
@@ -0,0 +1,451 @@
+# ADR-0014: PE Pipeline Execution Model
+
+## Status
+
+Accepted
+
+## Context
+
+This ADR defines the PE-internal kernel execution model:
+
+- Role decomposition of PE-internal components
+- Command dispatch paths (simple / composite / multi-op composite with epilogue)
+- TileToken-based self-routing pipeline (scheduler does dispatch + completion only)
+- TCM-centric dataflow with a register-file intermediary
+- Engine resource model
+- Observability and trace contract
+- Topology representation
+
+PE-internal structure (7 components in scope; 2 cross-referenced):
+
+- `pe_cpu`, `pe_scheduler`, `pe_dma`, `pe_fetch_store`, `pe_gemm`, `pe_math`,
+  `pe_tcm` — defined here
+- `pe_mmu` — VA model, defined in ADR-0011 D-VA
+- `pe_ipcq` — collective communication, defined in ADR-0023
+
+The goal is a deterministic, trace-friendly execution contract that keeps
+each block independently swappable.
+
+## Decision
+
+### D1. PE-internal component roles
+
+**PE_CPU**
+
+- Executes kernel instruction stream / control logic.
+- Generates PE commands and submits them to `PE_SCHEDULER` (via
+  `PeInternalTxn`).
+- Does NOT enqueue work directly into engine queues.
+
+**PE_SCHEDULER**
+
+- Sole dispatcher inside a PE.
+- Receives commands from `PE_CPU`. Dispatch by command type:
+  - Simple command (`DmaReadCmd`, `DmaWriteCmd`, `GemmCmd`, `MathCmd`)
+    → forward directly to the target engine.
+  - `CompositeCmd` → generate a `TilePlan`, feed tiles into the pipeline
+    via a single `_feed_loop` (D6).
+- Does not participate in stage-to-stage chaining within a composite;
+  that is handled by token self-routing (D6).
+
+**PE_DMA**
+
+- Handles memory transfers between TCM and external memory domains
+  (HBM, shared SRAM, cross-cube UCIe) through the cube NOC.
+- Two execution channels:
+  - `DMA_READ` (capacity = 1) and `DMA_WRITE` (capacity = 1) — see D4.
+- Additional virtual channels:
+  - `vc_compute` — load/store/writeback traffic for GEMM/MATH tiles.
+  - `vc_comm` — IPCQ collective send data (defined in ADR-0023 D8).
+
+**PE_FETCH_STORE**
+
+- TCM ↔ Register File transfer unit.
+- Isolates register-file access semantics from compute engines so that
+  GEMM/MATH stay pure compute components.
+- BW-based latency model; TCM access contention naturally serializes
+  through `PE_TCM`'s BW resource.
+
+**PE_GEMM**
+
+- MAC array. Reads operands from the register file; writes results to
+  the register file. Does not touch `PE_TCM` directly.
+
+**PE_MATH**
+
+- Element-wise / reduction / SIMD unit. Reads / writes the register file.
+
+**PE_TCM**
+
+- Tightly-coupled scratchpad with BW-serialized access. Two logical
+  regions partitioned by ownership (see D5).
+
+**Cross-referenced components** (defined elsewhere):
+
+- `pe_mmu` — VA→PA translation per access (ADR-0011 D-VA).
+- `pe_ipcq` — collective ring buffers and peer endpoint metadata
+  (ADR-0023).
+
+### D2. Command lifecycle and queues
+
+`PE_SCHEDULER` maintains three logical structures:
+
+**SubmissionQueue** — written by `PE_CPU`; consumed by the scheduler.
+
+**InflightTable** — owned and mutated only by `PE_SCHEDULER`; tracks
+expanded sub-commands, dependency state, engine assignment, and
+completion status.
+
+**CompletionQueue** — written by `PE_SCHEDULER`; holds final completion
+records.
+
+**Single-writer rule**: only `PE_SCHEDULER` mutates command completion
+state. Engines report completion via explicit events / messages
+consumed by the scheduler.
+
+**Command completion**: when all sub-commands complete, `PE_SCHEDULER`
+publishes a completion record.
+
+### D3. Dispatch modes
+
+#### D3.1 Simple command
+
+A simple command expands to exactly one engine sub-command:
+
+- `DmaReadCmd` / `DmaWriteCmd` → `PE_DMA`
+- `GemmCmd` → `PE_GEMM`
+- `MathCmd` → `PE_MATH`
+
+Flow:
+
+```text
+PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution
+       → completion → PE_SCHEDULER → CompletionQueue
+```
+
+#### D3.2 Composite command (single-op tiled pipeline)
+
+The default `CompositeCmd` runs a single compute op as a tile-pipelined
+sequence:
+
+```text
+DMA_READ → FETCH (TCM → RF) → COMPUTE (GEMM | MATH) → STORE (RF → TCM) → DMA_WRITE
+```
+
+`PE_SCHEDULER` splits the DMA payload into hardware tiles and emits one
+`TileToken` per tile with a monotonically increasing `tile_id`.
+
+Tile dependency (within one tile `t`):
+
+```text
+DMA_READ(t) → FETCH(t) → COMPUTE(t) → STORE(t) → DMA_WRITE(t)
+```
+
+Inter-tile overlap is allowed wherever engine resources permit
+(D4 governs the constraints):
+
+```text
+DMA_READ(t+1) ∥ COMPUTE(t)
+DMA_WRITE(t-1) ∥ COMPUTE(t)
+```
+
+#### D3.3 Multi-op composite (head + epilogue with scope)
+
+A `CompositeCmd` MAY carry `ops: tuple[OpSpec, ...]` to express a
+multi-op pipeline:
+
+```python
+@dataclass(frozen=True)
+class OpSpec:
+    kind: str         # "gemm" | "math.exp" | "math.bias_add" | ...
+    scope: Scope      # "per_k_tile" | "per_output_tile" | "once"
+    ...
+```
+
+- `ops[0]` (head) defines tile geometry (e.g., the head GEMM determines
+  M/K/N partition).
+- `ops[1:]` (epilogue) are subsequent stages whose `scope` decides how
+  often they fire:
+  - `per_k_tile` — every K-reduction step.
+  - `per_output_tile` — once per output tile.
+  - `once` — once per kernel.
+
+Cross-engine chains (e.g., GEMM head → MATH epilogue) are natural —
+each stage is dispatched via token self-routing (D6), so GEMM and MATH
+participate serially within the same composite even though they share
+the compute slot (D4).
+
+The empty-`ops` form is the legacy single-op path.
+
+### D4. Engine resource model
+
+**DMA engine**:
+
+- `DMA_READ`: `simpy.Resource(capacity=1)`.
+- `DMA_WRITE`: `simpy.Resource(capacity=1)`.
+- Both channels run concurrently (READ ∥ WRITE allowed).
+- Within a channel, requests serialize (READ ∥ READ disallowed; same
+  for WRITE).
+- `vc_comm` is an orthogonal channel for IPCQ traffic defined in
+  ADR-0023 D8 — out of scope for this ADR.
+
+**Compute engine**:
+
+- `accel_slot`: `simpy.Resource(capacity=1)` shared by `PE_GEMM` and
+  `PE_MATH`.
+- At most one compute op runs at a time within a PE.
+- Multi-op composite chains (D3.3) execute their compute stages serially
+  through this slot; token self-routing (D6) ensures the next stage
+  starts only after the previous compute releases the slot.
+
+**Engine completion**: each engine emits a completion event consumed by
+the scheduler / `PipelineContext` (D6).
+
+### D5. Dataflow
+
+**Input path (HBM source)**:
+
+```text
+HBM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
+PE_TCM → PE_FETCH_STORE → Register File
+Register File → PE_GEMM | PE_MATH
+```
+
+**Input path (shared SRAM source)**:
+
+```text
+Shared SRAM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
+PE_TCM → PE_FETCH_STORE → Register File
+```
+
+**Output path (HBM destination)**:
+
+```text
+Register File → PE_FETCH_STORE → PE_TCM
+PE_TCM → PE_DMA (DMA_WRITE) → cube NOC → HBM
+```
+
+GEMM/MATH never touch `PE_TCM` directly — `PE_FETCH_STORE` is the
+single TCM↔register-file gateway. This makes TCM BW contention
+explicit and lets fetch unit policies (e.g., prefetch) be replaced
+independently of compute engines.
+
+#### D5.1 PE_TCM partitioning
+
+`PE_TCM` is split into two logical regions:
+
+**SchedulerReservedTCM**
+
+- Owned exclusively by `PE_SCHEDULER`.
+- Holds composite-command tile buffers.
+- `PE_SCHEDULER` partitions this region, assigns buffers per DMA_READ /
+  COMPUTE / DMA_WRITE stage, guarantees input/output separation, and
+  manages tile-buffer lifetimes.
+
+**AllocatableTCM**
+
+- General-purpose region managed by `PEMemAllocator`.
+- Used for host / DP-visible allocations.
+
+**Visibility rule (hard isolation)**: `PEMemAllocator` MUST NOT see or
+allocate inside `SchedulerReservedTCM`. The reserved region is excluded
+from allocator-managed ranges by construction.
+
+**Tile buffer rules**:
+
+- Input and output buffers within `SchedulerReservedTCM` MUST NOT
+  overlap during a tile's active lifetime.
+- A tile buffer remains valid until the corresponding `DMA_WRITE`
+  completes.
+- Buffer reuse is permitted only after the consuming tile's lifetime
+  ends.
+
+### D6. TileToken self-routing pipeline
+
+A composite's stage-to-stage progression happens **without** routing
+through the scheduler. Each component forwards the token directly to
+the next stage's component using the token's `plan`:
+
+```text
+Scheduler → DMA → Fetch → GEMM → Math (epi) → Store → DMA_WB → (complete)
+              ↑ chaining: no scheduler hop                          ↑
+                                                  PipelineContext.complete_tile()
+```
+
+This mirrors real-HW done-wire chains. The scheduler handles only
+**initial dispatch + completion aggregation**.
+
+#### TilePlan / Stage
+
+```python
+class StageType(Enum):
+    DMA_READ = 0
+    FETCH = 1
+    GEMM = 2
+    MATH = 3
+    STORE = 4
+    DMA_WRITE = 5
+
+@dataclass(frozen=True)
+class Stage:
+    stage_type: StageType
+    component: str         # topology node id (e.g., "sip0.cube0.pe0.pe_dma")
+    params: dict           # stage-specific parameters
+
+@dataclass(frozen=True)
+class TilePlan:
+    tile_id: int
+    stages: tuple[Stage, ...]
+```
+
+#### TileToken
+
+```python
+@dataclass
+class TileToken:
+    tile_id: int
+    pipeline_ctx: PipelineContext
+    plan: TilePlan
+    stage_idx: int
+    params: dict             # cached current stage params
+    data_op: bool = True     # op_log opt-in (ADR-0020 D4)
+```
+
+Single-owner invariant: a token is owned by exactly one component at a
+time. Lifecycle: scheduler creates with `stage_idx=0` → component
+`_process()` → increment `stage_idx` → put to next stage's `in_port` →
+last stage calls `pipeline_ctx.complete_tile()`.
+
+#### PipelineContext (exactly-once completion)
+
+```python
+@dataclass
+class PipelineContext:
+    id: str
+    total_tiles: int
+    completed_tiles: int = 0
+    done_event: simpy.Event = None
+
+    def complete_tile(self) -> None:
+        self.completed_tiles += 1
+        if self.completed_tiles == self.total_tiles:
+            self.done_event.succeed()
+```
+
+Each tile's last stage MUST call `complete_tile()` exactly once.
+Duplicate calls are bugs (SimPy `Event` can succeed at most once).
+
+#### Feed ordering
+
+`PE_SCHEDULER` has exactly one `_feed_loop` process consuming a
+`_pending_feeds` FIFO. Composite commands are enqueued in submission
+order; tile feed for a command runs to completion before the next
+command's feed begins. **Tile-feed interleaving between commands is
+disallowed.**
+
+Within a single command's tiles, downstream pipeline overlap arises
+naturally — earlier tiles progress through later stages while the feeder
+keeps pushing remaining tiles into the first stage queue (SimPy Store
+backpressure governs flow control). If the first-stage queue is full,
+only the feeder blocks; the scheduler worker's inbox processing
+continues.
+
+#### Token routing pattern (base class)
+
+```python
+def _pipeline_worker(self, env):
+    while True:
+        token = yield self._inbox.get()
+        yield from self._process(env, token)       # stage-specific logic
+        next_idx = token.stage_idx + 1
+        if next_idx < len(token.plan.stages):
+            next_stage = token.plan.stages[next_idx]
+            token.stage_idx = next_idx
+            token.params = next_stage.params
+            yield self.out_ports[next_stage.component].put(token)
+        else:
+            token.pipeline_ctx.complete_tile()
+```
+
+Each component implements only `_process()`; chaining lives in the
+base class.
+
+### D7. Observability and trace contract
+
+The simulator emits deterministic trace events:
+
+- `command_submitted`
+- `sub_command_dispatched`
+- `engine_start`
+- `engine_complete`
+- `tile_ready`
+- `command_complete`
+
+For identical inputs, trace ordering MUST be deterministic.
+
+### D8. Topology representation
+
+PE-internal components are declared in `cube.pe_template`:
+
+```yaml
+pe_template:
+  components:
+    pe_cpu:         { kind: pe_cpu,         impl: builtin.pe_cpu,         attrs: { overhead_ns: ... } }
+    pe_scheduler:   { kind: pe_scheduler,   impl: builtin.pe_scheduler,   attrs: { overhead_ns: ... } }
+    pe_dma:         { kind: pe_dma,         impl: builtin.pe_dma,         attrs: { rd_engines: 1, wr_engines: 1 } }
+    pe_fetch_store: { kind: pe_fetch_store, impl: builtin.pe_fetch_store, attrs: { ... } }
+    pe_gemm:        { kind: pe_gemm,        impl: builtin.pe_gemm,        attrs: { shared_resource: accel_slot, ... } }
+    pe_math:        { kind: pe_math,        impl: builtin.pe_math,        attrs: { shared_resource: accel_slot, ... } }
+    pe_tcm:         { kind: pe_tcm,         impl: builtin.pe_tcm,         attrs: { size_mb: ..., read_bw_gbs: ..., write_bw_gbs: ... } }
+    pe_mmu:         { kind: pe_mmu,         impl: builtin.pe_mmu,         attrs: { ... } }   # ADR-0011 D-VA
+    pe_ipcq:        { kind: pe_ipcq,        impl: builtin.pe_ipcq,        attrs: { ... } }   # ADR-0023
+  links:
+    # Scheduler dispatch edges (initial)
+    scheduler_to_dma_mm:         0.0
+    scheduler_to_fetch_store_mm: 0.0
+    scheduler_to_gemm_mm:        0.0
+    scheduler_to_math_mm:        0.0
+    # Pipeline chaining edges (token self-routing per D6)
+    dma_to_fetch_store_mm:       0.0
+    fetch_store_to_gemm_mm:      0.0
+    fetch_store_to_math_mm:      0.0
+    gemm_to_fetch_store_mm:      0.0
+    gemm_to_math_mm:             0.0
+    math_to_fetch_store_mm:      0.0
+    fetch_store_to_dma_mm:       0.0
+    fetch_store_to_tcm_bw_gbs:   ...
+```
+
+Template is instantiated once per PE. PE instances are derived from
+`cube.pe_layout` (corner placement). External connectivity (PE_DMA ↔
+cube NOC ↔ HBM, etc.) is modeled at the cube level (ADR-0017 D4).
+
+## Consequences
+
+### Positive
+
+- Each block is an independent topology node — individually swappable
+  via DI (ADR-0015).
+- PE-internal structure is visible in the topology graph.
+- Components do not know their downstream — plan-based routing gives
+  flexibility (e.g., epilogue chains require no scheduler change).
+- DMA and compute overlap naturally via SimPy Store backpressure.
+- Multi-op composite expresses fused operations (e.g., GEMM + bias_add)
+  without engine-level coupling.
+- TCM access contention is realistic — `PE_FETCH_STORE` is the single
+  TCM↔RF gateway.
+
+### Negative
+
+- Intra-PE component count is higher than a coarser model (7 base + 2
+  cross-referenced) — more topology nodes/edges.
+- Intra-PE token forwarding is explicit in traces (acceptable trade for
+  HW fidelity).
+
+## Links
+
+- ADR-0011 D-VA (PE_MMU component, VA translation)
+- ADR-0015 D4 (component port/wire model)
+- ADR-0020 (greenlet kernel execution / two-pass)
+- ADR-0023 (PE_IPCQ + PE_DMA virtual channels)
+- SPEC R3, R4
diff --git a/docs/adr/ADR-0014-pe-internal-execution-model.md b/docs/adr/ADR-0014-pe-internal-execution-model.md
deleted file mode 100644
index 7153b2a..0000000
--- a/docs/adr/ADR-0014-pe-internal-execution-model.md
+++ /dev/null
@@ -1,365 +0,0 @@
-# ADR-0014: PE Internal Execution Model (PE_CPU, PE_SCHEDULER, and Composite Commands)
-
-## Status
-
-Accepted
-
-## Context
-
-ADR-0003 (system hierarchy) and ADR-0009 (kernel execution semantics) reference PE internals but do not define:
-
-- the dispatch model inside a PE,
-- the responsibilities of PE_SCHEDULER,
-- the PE_TCM-centric dataflow contract used by accelerator engines.
-
-We need a deterministic and debuggable PE-internal execution contract that supports:
-
-- simple single-engine commands
-- composite commands that build a tiled pipeline across DMA and accelerator engines
-
-The simulator must produce deterministic traces and allow modeling of PE-internal pipelining without introducing nondeterministic engine scheduling.
-
-## Decision
-
-### D1. PE internal component roles
-
-Each PE contains the following logical components.
-
-**PE_CPU**
-
-- Executes kernel instruction stream or kernel control logic.
-- Generates PE commands.
-- Submits commands to PE_SCHEDULER.
-- PE_CPU does NOT enqueue work directly into engine queues.
-
-**PE_SCHEDULER**
-
-- The sole dispatcher inside a PE.
-- Receives commands from PE_CPU.
-- Expands composite commands into sub-commands.
-- Tracks dependencies and command state.
-- Dispatches work to engine queues.
-- Manages tile scheduling for composite commands.
-
-**PE_DMA**
-
-- Handles memory transfers between PE_TCM and external memory domains.
-- PE_DMA connects to the cube-level NOC (on-die fabric):
-  - All destinations (HBM, shared SRAM, inter-cube UCIe) are reached via the NOC
-  - Local HBM access: PE_DMA → NOC → hbm_ctrl (minimal hop)
-  - Remote/shared: PE_DMA → NOC → (fabric hops) → destination
-- Supported directions include:
-  - HBM → PE_TCM (via NOC)
-  - PE_TCM → HBM (via NOC)
-  - PE_TCM → shared SRAM (via NOC)
-  - PE_TCM → other memory domains (via NOC, if supported by topology)
-
-**PE_GEMM**
-
-- Matrix multiplication engine.
-- Reads activations from PE_TCM.
-- May stream weights directly from HBM.
-
-**PE_MATH**
-
-- Element-wise computation engine.
-- Reads and writes PE_TCM.
-
-**PE_TCM**
-
-- Local SRAM used as the staging memory for accelerator operations.
-
----
-
-### D2. Command lifecycle and queues
-
-PE_SCHEDULER maintains three logical structures.
-
-**SubmissionQueue**
-
-- Written by PE_CPU.
-- Contains incoming PE commands waiting to be processed.
-
-**InflightTable**
-
-- Owned and mutated only by PE_SCHEDULER.
-- Tracks:
-  - expanded sub-commands
-  - dependency state
-  - engine assignment
-  - completion status
-
-**CompletionQueue**
-
-- Written by PE_SCHEDULER.
-- Contains final completion records for commands.
-
-**Single-writer rule**
-
-- Only PE_SCHEDULER is allowed to mutate command completion state.
-- Engine components must report completion via explicit completion events/messages.
-
-**Command completion**
-
-A command becomes DONE when:
-
-- all sub-commands complete
-- PE_SCHEDULER publishes a completion record to CompletionQueue.
-
----
-
-### D3. Dispatch modes
-
-PE commands are divided into two categories.
-
-#### D3.1 Simple command
-
-A simple command expands to exactly one engine sub-command.
-
-Examples include:
-
-- DMA transfer
-- GEMM compute
-- MATH compute
-
-Execution flow:
-
-```text
-PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution → completion event → PE_SCHEDULER → CompletionQueue
-```
-
-#### D3.2 Composite command (tiled pipeline)
-
-Composite commands implement tiled pipelined execution across engines.
-
-Each tile executes the following pipeline:
-
-```text
-Input DMA (READ)
-→ Compute (GEMM or MATH)
-→ Output DMA (WRITE)
-```
-
-**Tiling rule**
-
-If the DMA payload exceeds hardware tile size, PE_SCHEDULER splits the transfer into tiles.
-Each tile is assigned a monotonically increasing `tile_id`.
-
-**Tile dependency rules**
-
-For tile `t`:
-
-- Compute must wait for input DMA: `DMA_READ(t) → COMPUTE(t)`
-- Output DMA must wait for compute: `COMPUTE(t) → DMA_WRITE(t)`
-- All dependencies are enforced by PE_SCHEDULER.
-
-**Overlap policy (Phase 0 default)**
-
-Operations for different tiles may overlap when engine resources permit.
-
-Allowed overlaps:
-
-```text
-DMA_READ(t+1) ∥ COMPUTE(t)
-DMA_WRITE(t−1) ∥ COMPUTE(t)
-DMA_READ(t) ∥ DMA_WRITE(t)
-```
-
-Disallowed overlaps:
-
-```text
-GEMM(t) ∥ GEMM(t′)
-MATH(t) ∥ MATH(t′)
-GEMM(t) ∥ MATH(t′)
-```
-
----
-
-### D4. Engine execution model (Phase 0 default)
-
-Each engine behaves as a deterministic service resource.
-
-**DMA engine**
-
-PE_DMA contains two independent channels.
-
-```text
-DMA_READ capacity  = 1
-DMA_WRITE capacity = 1
-```
-
-Rules:
-
-- DMA_READ and DMA_WRITE may execute concurrently.
-- Multiple READs cannot overlap.
-- Multiple WRITEs cannot overlap.
-
-Example allowed:
-
-```text
-DMA_READ(t+1) ∥ DMA_WRITE(t)
-```
-
-Example not allowed:
-
-```text
-DMA_READ(t) ∥ DMA_READ(t+1)
-DMA_WRITE(t) ∥ DMA_WRITE(t+1)
-```
-
-**Compute engine**
-
-Compute operations share a single compute resource.
-
-```text
-PE_ACCEL capacity = 1
-```
-
-Both GEMM and MATH require this shared compute slot.
-
-Consequences:
-
-- GEMM ∥ GEMM not allowed
-- MATH ∥ MATH not allowed
-- GEMM ∥ MATH not allowed
-
-Only one compute operation can run in a PE at a time.
-
-**Compute opcode restriction**
-
-Composite commands contain one compute opcode only.
-
-Examples:
-
-```text
-COMPOSITE_GEMM
-COMPOSITE_MATH
-```
-
-Mixed compute pipelines such as `GEMM → MATH` are not supported in Phase 0.
-
-**Engine completion signaling**
-
-Every engine emits a completion event when a sub-command finishes.
-Completion events are delivered to PE_SCHEDULER.
-
----
-
-### D5. Dataflow model
-
-Compute operations use a TCM-centric dataflow model.
-
-**Input path (HBM)**
-
-```text
-HBM → NOC → PE_DMA (DMA_READ) → PE_TCM
-```
-
-**Input path (shared SRAM)**
-
-```text
-Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM
-```
-
-**Compute stage**
-
-Compute engines read input tensors from PE_TCM.
-
-```text
-PE_TCM → GEMM / MATH
-```
-
-Weights for GEMM may optionally stream directly from HBM (via NOC).
-
-**Output path (HBM)**
-
-Compute results are written to PE_TCM, then DMA writes to HBM.
-
-```text
-PE_TCM → PE_DMA (DMA_WRITE) → NOC → HBM
-```
-
-**Output path (shared SRAM)**
-
-```text
-PE_TCM → PE_DMA (DMA_WRITE) → NOC → Shared SRAM
-```
-
-#### D5.1 PE_TCM partitioning and ownership boundary
-
-The PE_TCM address space is partitioned into two logical regions.
-
-**SchedulerReservedTCM**
-
-- A staging region owned exclusively by PE_SCHEDULER.
-- This region is used for composite command tile buffers.
-- PE_SCHEDULER:
-  - partitions this region into tile buffers
-  - assigns buffers for DMA_READ, COMPUTE, and DMA_WRITE stages
-  - guarantees input/output buffer separation
-  - manages tile buffer lifetime
-
-**AllocatableTCM**
-
-- General-purpose region managed by PEMemAllocator.
-- Used by host or DP-visible allocations.
-
-**Visibility rule (hard isolation)**
-
-- PEMemAllocator must not see or allocate memory inside SchedulerReservedTCM.
-- SchedulerReservedTCM is excluded from allocator-managed ranges by construction.
-- This prevents DP or host allocations from interfering with scheduler staging buffers.
-
-**Tile buffer rules**
-
-Within SchedulerReservedTCM:
-
-- input buffers and output buffers must not overlap
-- PE_SCHEDULER assigns tile buffers for DMA and compute stages
-- tile buffers remain valid until the corresponding DMA_WRITE completes
-- Buffer reuse is allowed only after the tile lifetime finishes.
-
----
-
-### D6. Observability and trace contract
-
-The simulator must emit deterministic trace events.
-
-Required events include:
-
-- `command_submitted`
-- `sub_command_dispatched`
-- `engine_start`
-- `engine_complete`
-- `tile_ready`
-- `command_complete`
-
-Trace ordering must be deterministic for identical inputs.
-
----
-
-### D7. Topology representation
-
-PE internal components are declared in `cube.pe_template`.
-
-The template is instantiated once per PE.
-
-PE instances are derived from `cube.pe_layout`.
-
-External connectivity such as:
-
-- PE_DMA → NOC → HBM (data path)
-- PE_DMA → NOC → shared SRAM, inter-cube UCIe (non-HBM data path)
-- NOC → PE_CPU (command path from M_CPU)
-
-is modeled at the CUBE level (see ADR-0003 D3).
-
----
-
-## Links
-
-- SPEC R3, R4
-- ADR-0003 D4 (PE-level system hierarchy)
-- ADR-0005 View C (PE-level diagram)
-- ADR-0008 D2 (PA-level allocation at PE scope; PEMemAllocator is the per-PE allocator instance)
-- ADR-0009 D3 (kernel execution fan-out and PE_CPU dispatch)
diff --git a/docs/adr/ADR-0015-component-port-wire-model.md b/docs/adr/ADR-0015-dev-component-port-wire-model.md
similarity index 86%
rename from docs/adr/ADR-0015-component-port-wire-model.md
rename to docs/adr/ADR-0015-dev-component-port-wire-model.md
index acfbb9c..5f999af 100644
--- a/docs/adr/ADR-0015-component-port-wire-model.md
+++ b/docs/adr/ADR-0015-dev-component-port-wire-model.md
@@ -6,20 +6,19 @@ Accepted
 
 ## Context
 
-ADR-0007 D2 assigns path-walking and low-level request decomposition to the simulation engine.
-In practice, the engine iterates the topology path and calls `run()` on each component
-sequentially — conflating routing policy with component behavior and preventing realistic
-hardware modeling (queues, contention, fan-out).
-
-ADR-0007 D3 already states that components own fan-out and aggregation, but the current
-implementation does not enforce this for fabric traversal.
+Realistic hardware modeling — queues, contention, fan-out — requires
+that components own fabric traversal while the simulation engine
+handles only initialization and completion observation. Direct method
+calls between components, or path-walking inside the engine, defeat
+queueing and contention semantics.
 
 This ADR defines:
 
 - how components communicate via typed port queues,
 - how propagation delay is modeled (wire processes with BW occupancy),
-- the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch (via M_CPU),
-- the reduced role of the simulation engine,
+- the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch
+  (via M_CPU),
+- the engine's reduced role (wire init + completion observation only),
 - M_CPU.DMA as an internal subcomponent of M_CPU.
 
 ---
@@ -88,9 +87,6 @@ The simulation engine MUST NOT:
 - call component `run()` methods directly,
 - track per-hop latency or decompose fan-out.
 
-This supersedes ADR-0007 D2's "decompose operations into low-level requests" clause.
-ADR-0007 D2 must be amended accordingly.
-
 ---
 
 ### D4. Fabric paths for Memory R/W and Kernel Launch
@@ -192,16 +188,15 @@ It is used for shard comparison in `_route_kernel` and as a regression guard.
 - Propagation delay is modeled accurately per edge.
 - Engine is decoupled from routing policy.
 - Component implementations remain swappable via DI (ADR-0007 D3).
-- ADR-0007 D2 must be amended to remove path-walking from engine responsibilities.
-- ADR-0009 D3 should be updated to reference the unified fabric path (D4 above).
 
 ---
 
 ## Links
 
-- ADR-0007 D2 (to be amended: engine path-walking clause)
-- ADR-0009 D3 (kernel execution fan-out; fabric path to be referenced)
+- ADR-0007 D2 (engine role boundary)
+- ADR-0009 D3 (kernel execution fan-out hierarchy)
 - ADR-0014 D4 (DMA engine capacity=1)
 - ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal)
 - ADR-0016 (IOChiplet NOC and memory data path)
 - ADR-0017 (cube NOC 2D mesh architecture)
+- ADR-0033 (Latency model assumptions built on these mechanisms)
diff --git a/docs/adr/ADR-0016-iochiplet-noc-and-memory-path.md b/docs/adr/ADR-0016-dev-iochiplet-noc-and-memory-path.md
similarity index 100%
rename from docs/adr/ADR-0016-iochiplet-noc-and-memory-path.md
rename to docs/adr/ADR-0016-dev-iochiplet-noc-and-memory-path.md
diff --git a/docs/adr/ADR-0017-cube-noc-2d-mesh.md b/docs/adr/ADR-0017-cube-noc-2d-mesh.md
deleted file mode 100644
index c43c841..0000000
--- a/docs/adr/ADR-0017-cube-noc-2d-mesh.md
+++ /dev/null
@@ -1,189 +0,0 @@
-# ADR-0017: Cube NOC 2D Mesh Architecture
-
-## Status
-
-Accepted
-
-## Context
-
-ADR-0003 D3 defines the cube-level NOC as a "distributed on-die fabric" but
-does not specify the internal routing model, contention semantics, or
-attachment topology. The implementation uses a 2D mesh router grid with
-XY routing and per-segment contention modeling. This ADR formalizes that
-architecture.
-
-## Decision
-
-### D1. NOC node and router grid
-
-Each cube contains a 2D router mesh generated by `mesh_gen.py`.
-Each router is a separate topology node (`sip{S}.cube{C}.r{row}c{col}`)
-implemented as `forwarding_v1`. (Supersedes the original single-node
-`noc_2d_mesh_v1` design — see ADR-0019.)
-
-Grid properties:
-
-- Default dimensions: 6x6 routers (derived from PE layout + UCIe connections)
-- Router naming: `r{row}c{col}` (e.g., `r0c0`, `r5c5`)
-- HBM exclusion zone: center rows/columns are excluded where HBM physically
-  occupies space (e.g., r2c2, r2c3, r3c2, r3c3)
-- Router positions are derived from physical PE corner placement and cube
-  geometry
-
-The NOC overhead_ns is 0.0. Latency is modeled by Manhattan distance
-traversal within the mesh (distance_mm x ns_per_mm).
-
-### D2. XY routing algorithm
-
-The NOC uses deterministic XY routing:
-
-1. Horizontal segment: route from source X to destination X at source Y
-2. Vertical segment: route from destination X at source Y to destination Y
-
-Each directed segment is identified by a unique link key:
-
-- Horizontal: `("H", y_band, x_min, x_max, direction)`
-- Vertical: `("V", x_band, y_min, y_max, direction)`
-
-Grid positions are snapped to the router grid, excluding the HBM zone.
-
-### D3. Contention model
-
-Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions
-sharing a segment (same row or column band, same direction) contend for the
-resource. This models link-level serialization in a wormhole-routed mesh.
-
-With no contention, NOC traversal latency equals the Manhattan distance
-multiplied by `ns_per_mm`. Under contention, additional queueing delay
-is added by SimPy's resource scheduling.
-
-### D4. NOC attachment points
-
-The NOC connects to all major cube-level components:
-
-```text
-                    UCIe-N (conn x4)
-                         |
-           +---------+---+---+---------+
-           |         |       |         |
-PE0.dma ---+  r0c0   |  ...  |  r0c5  +--- PE2.dma
-PE0.cpu <--+         |       |         +--< PE2.cpu
-           |         |       |         |
-UCIe-W ----+  ...    | [HBM] |  ...   +---- UCIe-E
-(conn x4)  |         | zone  |         |  (conn x4)
-           |  r2c0   |       |         |
-M_CPU <--->+         |       |         |
-           |  r3c0   |       |         |
-SRAM <---->+         |       |         |
-           |         |       |         |
-PE4.dma ---+  r4c0   |  ...  |  r4c5  +--- PE6.dma
-PE4.cpu <--+         |       |         +--< PE6.cpu
-           |         |       |         |
-           +---------+---+---+---------+
-                         |
-                    UCIe-S (conn x4)
-
-HBM attach: PE가 있는 라우터에 hbm_ctrl도 연결 (ADR-0019 D1)
-(xbar_top/xbar_bot은 ADR-0019에 의해 제거됨)
-```
-
-### D5. NOC edge bandwidths and distances
-
-| Connection | BW (GB/s) | Distance | Notes |
-| --- | --- | --- | --- |
-| PE_DMA -> NOC | 256.0 | Physical (PE pos) | Matches HBM slice BW |
-| NOC -> PE_CPU | - | 0.0 mm | Command path only |
-| Router <-> HBM_CTRL | 256.0 | 0.0 mm | Per PE router (ADR-0019) |
-| NOC <-> M_CPU | - | 0.0 mm | Command path |
-| NOC <-> SRAM | 128.0 x4 | 0.0 mm | 512 GB/s aggregate |
-| NOC <-> UCIe conn | 128.0 | 0.0 mm | Per connection, 4 per port |
-
-Distance 0.0 mm for most connections reflects the distributed nature of
-the NOC; the actual traversal distance is computed internally via Manhattan
-distance within the router grid.
-
-### D6. UCIe decomposition and inter-cube traffic
-
-Each cube has 4 UCIe ports (N, S, E, W). Each port is decomposed into:
-
-- 1 `ucie-{PORT}` node: UCIe protocol endpoint (overhead = 8.0 ns)
-- 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe
-
-This decomposition enables N=4 independent NOC-to-UCIe connections per port,
-each with 128 GB/s bandwidth. Total aggregate per port: 512 GB/s.
-
-Inter-cube traffic path:
-
-```text
-Source: PE_DMA -> NOC -> conn{i} -> ucie-{PORT}
-                    [UCIe link: 512 GB/s, 1.0mm seam distance]
-Target: ucie-{PORT} -> conn{i} -> r{x}c{y} -> (mesh hops) -> hbm_ctrl
-```
-
-UCIe overhead (8.0 ns) is applied at each ucie-{PORT} node, so a
-full crossing incurs 16 ns (TX port + RX port).
-
-### D7. Data paths through the NOC
-
-**PE DMA to local HBM (same half):**
-
-```text
-PE_DMA -> r{x}c{y} -> hbm_ctrl  (local: 0 mesh hops, switching overhead only)
-```
-
-**PE DMA to remote PE's HBM:**
-
-```text
-PE_DMA -> r{x}c{y} -> (mesh hops) -> r{x'}c{y'} -> hbm_ctrl
-```
-
-**PE DMA to remote cube HBM:**
-
-```text
-PE_DMA -> r{x}c{y} -> conn -> ucie-E -> [seam] -> ucie-W -> conn -> r{x'}c{y'} -> hbm_ctrl
-```
-
-**Kernel Launch command to PE:**
-
-```text
-[from io_noc] -> ucie -> conn -> r{x}c{y} -> (mesh hops) -> M_CPU -> (mesh hops) -> PE_CPU
-```
-
-**Shared SRAM access:**
-
-```text
-PE_DMA -> r{x}c{y} -> (mesh hops) -> SRAM
-```
-
-### D8. Mesh generation
-
-The router grid is generated by `mesh_gen.py` based on:
-
-- `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner
-- `cube.geometry`: cube physical dimensions and HBM zone
-- `cube.ucie.n_connections`: determines router count for UCIe attachment
-
-The generator produces a `mesh_data` dictionary containing:
-
-- Router grid with positions and HBM exclusion zones
-- PE-to-router attachments (pe_dma, pe_cpu per PE)
-- UCIe-to-router attachments (N/S/E/W, distributed across edge routers)
-- M_CPU and SRAM router attachments
-- HBM attachment per PE router (ADR-0019)
-
-## Consequences
-
-- NOC provides position-aware routing with deterministic latency
-- Contention is captured per directed segment (not per-node)
-- All cube-internal traffic is explicitly routed through the NOC
-- HBM exclusion zone reflects physical die layout constraints
-- The mesh generation is fully parameterized by `topology.yaml`
-
-## Links
-
-- ADR-0003 D3 (cube-level NOC definition — extended by this ADR)
-- ADR-0004 D1 (PE DMA to local HBM path via router mesh)
-- ADR-0014 D1 (PE_DMA egress via router mesh)
-- ADR-0019 (NOC-Local HBM — xbar/bridge 제거, 명시적 라우터 mesh)
-- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
-- ADR-0016 D1 (IOChiplet io_noc — analogous pattern at IO chiplet level)
diff --git a/docs/adr/ADR-0017-dev-cube-noc-and-hbm-connectivity.md b/docs/adr/ADR-0017-dev-cube-noc-and-hbm-connectivity.md
new file mode 100644
index 0000000..c442dde
--- /dev/null
+++ b/docs/adr/ADR-0017-dev-cube-noc-and-hbm-connectivity.md
@@ -0,0 +1,291 @@
+# ADR-0017: Cube NOC and HBM Connectivity
+
+## Status
+
+Accepted
+
+## Context
+
+The CUBE-level NOC is a 2D router mesh that carries every intra-cube
+request: PE-to-HBM data, PE-to-PE traffic, command paths
+(M_CPU↔PE_CPU), shared SRAM access, and inter-cube UCIe traffic.
+
+The CUBE's HBM is exposed through per-PE controller endpoints attached
+to PE routers. This per-PE partitioning makes local-vs-remote HBM
+distinguishable by mesh distance: a PE's own HBM partition sits at its
+own router (switching overhead only); another PE's HBM partition is
+reachable by mesh hops to that PE's router.
+
+Two channel-mapping modes are supported in the design space:
+
+- **n:1 (default, implemented)** — each PE's HBM partition aggregates
+  `channels_per_pe` pseudo-channels into one endpoint. Effective
+  per-PE BW = N × per-channel BW.
+- **1:1 (future)** — each PE router decomposes into per-channel
+  mini-routers; per-channel BW contention is modeled directly.
+
+In both modes the per-PE effective BW is identical; only the connectivity
+granularity differs.
+
+## Decision
+
+### D1. 2D router mesh
+
+Each cube contains a 2D mesh of NOC routers generated by `mesh_gen.py`.
+
+- Node naming: `sip{S}.cube{C}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`).
+- Implementation: `forwarding_v1`. NOC `overhead_ns = 0`.
+- Default 6×6 grid (sized from PE corner placement + UCIe attachment
+  count); larger PE counts scale the grid up.
+- HBM exclusion zone: center rows/columns are excluded where HBM die
+  physically occupies space (e.g., r2c2, r2c3, r3c2, r3c3 for a 6×6).
+- Latency = Manhattan distance × `ns_per_mm`.
+
+### D2. XY routing algorithm
+
+Deterministic XY routing:
+
+1. Horizontal segment: route from source X to destination X at source Y.
+2. Vertical segment: route from destination X at source Y to destination Y.
+
+Each directed segment carries a unique key:
+
+- Horizontal: `("H", y_band, x_min, x_max, direction)`
+- Vertical:   `("V", x_band, y_min, y_max, direction)`
+
+Grid positions are snapped to the router grid, excluding the HBM zone.
+
+### D3. Per-segment contention model
+
+Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions
+sharing a segment (same row or column band, same direction) contend for
+the resource — modelling link-level serialization in a wormhole-routed
+mesh.
+
+With no contention, NOC traversal latency equals Manhattan distance ×
+`ns_per_mm`. Under contention, SimPy's resource scheduling adds queueing
+delay.
+
+### D4. NOC attachment points (per-PE HBM partition)
+
+Every PE router carries three attachments: `pe{idx}.dma`, `pe{idx}.cpu`,
+and `pe{idx}.hbm`. The last is the per-PE HBM controller endpoint —
+`sip{S}.cube{C}.hbm_ctrl.pe{idx}` — which owns one slice of the cube's
+HBM (one pseudo-channel group; see D8).
+
+Other attachments:
+
+- M_CPU and shared SRAM each occupy a dedicated edge router.
+- UCIe endpoints (N/S/E/W) each expose 4 connection routers distributed
+  along that edge (see D6).
+
+```text
+                    UCIe-N (conn x4)
+                         |
+           +---------+---+---+---------+
+           |         |       |         |
+PE0.dma ---+  r0c0   |  ...  |  r0c5  +--- PE2.dma
+PE0.cpu <--+ +hbm.pe0|       | +hbm.pe2+--< PE2.cpu
+           |         |       |         |
+UCIe-W ----+  ...    | [HBM] |  ...   +---- UCIe-E
+(conn x4)  |         | zone  |         |  (conn x4)
+           |  r2c0   |       |         |
+M_CPU <--->+         |       |         |
+           |  r3c0   |       |         |
+SRAM <---->+         |       |         |
+           |         |       |         |
+PE4.dma ---+  r4c0   |  ...  |  r4c5  +--- PE6.dma
+PE4.cpu <--+ +hbm.pe4|       | +hbm.pe6+--< PE6.cpu
+           |         |       |         |
+           +---------+---+---+---------+
+                         |
+                    UCIe-S (conn x4)
+```
+
+Per-PE HBM partitioning is the key invariant that makes local vs
+cross-PE HBM distinguishable by mesh distance (see D7).
+
+### D5. NOC edge bandwidths and distances
+
+| Connection                    | BW (GB/s)  | Distance      | Notes                                       |
+| ----------------------------- | ---------- | ------------- | ------------------------------------------- |
+| PE_DMA → NOC                  | 256.0      | Physical (PE) | Matches local-HBM aggregate BW              |
+| NOC → PE_CPU                  | —          | 0.0 mm        | Command path only                           |
+| Router ↔ hbm_ctrl.pe{idx}     | 256.0      | 0.0 mm        | Per PE router; N × per-channel BW (see D8)  |
+| NOC ↔ M_CPU                   | —          | 0.0 mm        | Command path                                |
+| NOC ↔ SRAM                    | 128.0 × 4  | 0.0 mm        | 512 GB/s aggregate                          |
+| NOC ↔ UCIe conn               | 128.0      | 0.0 mm        | Per connection; 4 conn per port             |
+
+`0.0 mm` distances reflect the distributed nature of the NOC; actual
+traversal distance is computed via Manhattan distance within the router
+grid.
+
+### D6. UCIe decomposition and inter-cube traffic
+
+Each of the 4 UCIe ports (N, S, E, W) decomposes into:
+
+- 1 `ucie-{PORT}` node: UCIe protocol endpoint (`overhead = 8.0 ns`).
+- 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe.
+
+This decomposition gives 4 independent NOC↔UCIe connections per port,
+each with 128 GB/s bandwidth (512 GB/s aggregate per port).
+
+Inter-cube traffic path:
+
+```text
+Source: PE_DMA → NOC → conn{i} → ucie-{PORT}
+                  [UCIe link: 512 GB/s, 1.0mm seam distance]
+Target: ucie-{PORT} → conn{i} → r{x}c{y} → (mesh hops) → hbm_ctrl.pe{idx}
+```
+
+UCIe overhead (8.0 ns) is applied at each `ucie-{PORT}` node, so a full
+crossing incurs 16 ns (TX port + RX port).
+
+### D7. Data paths through the NOC
+
+All intra-cube traffic uses the same router mesh — no separate fast
+paths.
+
+**Local HBM** (same PE's own partition; 0 mesh hops):
+
+```text
+PE_DMA → r{x}c{y} → hbm_ctrl.pe{idx}   (switching overhead only)
+```
+
+**Cross-PE HBM within cube** (target PE's partition, reached by mesh):
+
+```text
+PE_DMA → r{x}c{y} → (mesh hops) → r{x'}c{y'} → hbm_ctrl.pe{idx'}
+```
+
+Example: PE0 (on `r0c0`) accessing PE2's HBM (PE2 on `r1c4`):
+
+```text
+PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl.pe2
+```
+
+Dijkstra computes the shortest path within the mesh.
+
+**Cross-cube HBM** (UCIe traversal):
+
+```text
+PE_DMA → r{x}c{y} → conn → ucie-{PORT} → [seam] → ucie-{PORT'} → conn
+       → r{x'}c{y'} → hbm_ctrl.pe{idx'}
+```
+
+**Kernel launch command to PE**:
+
+```text
+[from io_noc] → ucie → conn → r{x}c{y} → (mesh) → M_CPU → (mesh) → PE_CPU
+```
+
+**Shared SRAM access**:
+
+```text
+PE_DMA → r{x}c{y} → (mesh) → SRAM
+```
+
+### D8. HBM channel mapping mode
+
+Channel mapping is configured at cube scope:
+
+```yaml
+cube:
+  memory_map:
+    hbm_mapping_mode: n_to_one       # one_to_one | n_to_one
+    hbm_pseudo_channels: 64          # total pseudo-channel count
+    hbm_channels_per_pe: 8           # per-PE local channel count
+    hbm_channel_bw_gbs: 32.0         # per-channel bandwidth (GB/s)
+    hbm_slices_per_cube: 8           # number of per-PE partitions
+    hbm_total_gb_per_cube: 48
+```
+
+**n:1 mode (default, implemented).** Each PE's HBM partition is a single
+endpoint `hbm_ctrl.pe{idx}` that aggregates `channels_per_pe` pseudo-
+channels. The `Router ↔ hbm_ctrl.pe{idx}` link bandwidth equals
+`channels_per_pe × hbm_channel_bw_gbs`. Pseudo-channels are assumed to
+interleave; only aggregate per-PE BW is modeled. No separate aggregated
+router node exists — the per-PE router itself serves that role.
+
+**1:1 mode (future).** Each PE router decomposes into N channel
+mini-routers; per-channel routing carries fully-resolved PA + channel ID.
+A `ChannelSplitter` resolves a logical access to N per-channel physical
+requests. Per-channel link models BW contention. Cross-PE channel
+access semantics are deferred to the implementation ADR.
+
+**BW math (defaults).**
+
+| Parameter                          | Value                      |
+| ---------------------------------- | -------------------------- |
+| pseudo channels per cube           | 64 (parameter)             |
+| PEs per cube                       | 8 (parameter)              |
+| channels per PE (N)                | 64 / 8 = 8                 |
+| per-channel BW                     | 32 GB/s (parameter)        |
+| per-PE local BW                    | N × 32 = 256 GB/s          |
+| cube total HBM BW                  | 64 × 32 = 2048 GB/s        |
+
+Both modes give the same per-PE effective BW; only the request shape and
+contention model differ.
+
+### D9. AddressResolver — per-PE HBM endpoint
+
+The address resolver decodes a PA's HBM offset to the owning PE's
+partition:
+
+```python
+# policy/routing/router.py
+hbm_slice_bytes = hbm_total_gb_per_cube * (1 << 30) // hbm_slices_per_cube
+
+if addr.kind == "hbm":
+    pe_id = int(addr.hbm_offset) // hbm_slice_bytes
+    return f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
+```
+
+The pe_id computation is intrinsic to the routing layer (not a
+topology-time concern). Any HBM PA falls within exactly one partition,
+yielding deterministic routing.
+
+External callers (e.g., M_CPU DMA, Memory R/W from PCIE_EP) follow the
+same resolver path — there is no separate fast path.
+
+### D10. Mesh generation parameters
+
+`mesh_gen.py` produces `cube_mesh.yaml` from:
+
+- `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner.
+- `cube.geometry`: cube physical dimensions and HBM zone.
+- `cube.ucie.n_connections`: determines router count for UCIe attachment.
+
+Output `mesh_data` dictionary contains:
+
+- Router grid with positions and HBM exclusion zones.
+- PE-to-router attachments (`pe{idx}.dma`, `pe{idx}.cpu`, `pe{idx}.hbm`
+  per PE).
+- UCIe-to-router attachments (N/S/E/W distributed across edge routers).
+- M_CPU and SRAM router attachments.
+
+## Consequences
+
+- Local HBM (0 mesh hops, switching overhead only) and cross-PE HBM
+  (mesh hops) are naturally distinguishable, satisfying SPEC R5
+  (multi-domain communication) and ADR-0002 (no zero-latency end-to-end
+  paths).
+- All cube-internal traffic routes through one mesh — single contention
+  model, single layout, single set of edge BWs.
+- Per-PE HBM partitioning maps cleanly to the LA model (ADR-0011): each
+  PE's partition is the n:1 aggregate of its assigned pseudo-channels.
+- 1:1 mode extension is structurally natural — split each PE router into
+  N channel routers.
+- Mesh generation is fully parameterised by `topology.yaml`; PE/cube
+  geometry changes propagate without code edits.
+
+## Links
+
+- ADR-0002 (Routing distance, ordering, no zero-latency paths)
+- ADR-0003 D3 (cube-level NOC definition — extended here)
+- ADR-0004 (Memory semantics, local HBM)
+- ADR-0011 (Memory addressing — LA model consumes per-PE partition)
+- ADR-0014 D1 (PE_DMA egress via router mesh)
+- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
+- ADR-0016 (IOChiplet io_noc — analogous pattern at IO chiplet level)
+- ADR-0033 (Latency model: per-PC parallelism, switch penalty)
diff --git a/docs/adr/ADR-0019-NOC-Local HBM.en.md b/docs/adr/ADR-0019-NOC-Local HBM.en.md
deleted file mode 100644
index 815a70b..0000000
--- a/docs/adr/ADR-0019-NOC-Local HBM.en.md	
+++ /dev/null
@@ -1,305 +0,0 @@
-# ADR-0019: Per-Channel and Aggregated HBM Connection Models within CUBE NOC
-
-## Status
-
-Accepted
-
-## Context
-
-The CUBE-internal NOC must connect each PE to HBM. KernBench needs
-to evaluate two connectivity models:
-
-- **1:1 mode** — PE_DMA connects to N separate per-channel routers,
-  each with its own link to hbm_ctrl. Models per-channel BW
-  contention precisely.
-  N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`).
-- **n:1 mode** — PE_DMA connects to a single aggregated router with
-  one link to hbm_ctrl. Channels are treated as interleaved; only
-  aggregate BW is modeled.
-
-Effective PE-local BW is identical under both modes
-(= N × per-channel BW); only the connectivity granularity differs.
-
----
-
-## Decision
-
-### D1. HBM Attaches to PE Routers
-
-Consolidate the current `hbm_ctrl.slice{0-7}` (8 nodes) into a **single `hbm_ctrl` node**,
-and attach the HBM access point to the same router where the PE is attached.
-
-- n:1 mode: PE's local HBM access goes directly from its own router (switching overhead only, 0 hops)
-- Remote PE's HBM access: reaches the target PE's router via mesh hops
-- The read/write resource model within the HBM controller is preserved
-
-Node naming changes:
-
-| Current | After Change |
-| ---- | ------- |
-| `sip0.cube0.hbm_ctrl.slice0` ~ `slice7` | `sip0.cube0.hbm_ctrl` (single) |
-
-In `mesh_gen.py`, add `pe{idx}.hbm` to the PE attachment so that
-the builder generates an edge between that router and hbm_ctrl.
-
----
-
-### D2. Complete Removal of xbar, bridge, and Single NOC Node
-
-Remove all of the following nodes and related edges:
-
-- `{cube}.xbar_top`, `{cube}.xbar_bot`
-- `{cube}.bridge.left`, `{cube}.bridge.right`
-- `{cube}.noc` (single TwoDMeshNocComponent node)
-- Edges of type `noc_to_xbar`, `xbar_to_noc`, `xbar_to_hbm`, `hbm_to_xbar`
-- Edges of type `xbar_to_bridge`, `bridge_to_xbar`
-- Edges of type `pe_to_noc`, `noc_to_pe`, `noc_to_pe_cpu`, etc. referencing the single noc node
-
-Their role is replaced by an **explicit router mesh based on cube_mesh.yaml**.
-Each router (r0c0, r0c1, ...) from the 6x6 router grid generated by `mesh_gen.py`
-is created as a separate SimPy node in the topology graph,
-and adjacent routers are connected via XY mesh edges.
-
----
-
-### D3. Explicit Router Mesh (Common Basis for n:1 / 1:1)
-
-#### Router Nodes Based on cube_mesh.yaml
-
-Each non-null router from cube_mesh.yaml generated by `mesh_gen.py`
-is created as a **separate SimPy node** in the topology graph.
-
-- Node ID: `{cube}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`)
-- kind: `noc_router`, impl: `forwarding_v1`
-- pos_mm: taken from cube_mesh.yaml
-
-Based on the attach information in cube_mesh.yaml, components are connected to each router:
-- `pe{p}.dma` → PE_DMA ↔ router edge
-- `pe{p}.cpu` → PE_CPU ↔ router edge
-- `pe{p}.hbm` → HBM_CTRL ↔ router edge (added in n:1)
-- `m_cpu` → M_CPU ↔ router edge
-- `sram` → SRAM ↔ router edge
-- `ucie_{dir}.c{i}` → UCIe conn ↔ router edge
-
-Router-to-router XY mesh edges: bidirectional edges between adjacent routers.
-Null routers (HBM exclusion zones) are skipped.
-
-#### 1:1 Mode Extension (To Be Implemented Later)
-
-In 1:1 mode, each router differentiates into N channel mini-routers.
-Per-channel routing and ChannelSplitter (LA → per-channel PA) introduction are required.
-N GEMM engines per PE are also added at this point.
-
----
-
-### D4. Cross-PE HBM Access (n:1 Mode)
-
-In n:1 mode, when a PE accesses another PE's local HBM,
-it hops through the XY mesh in cube_mesh.yaml to reach the target PE's router.
-
-Example: PE0 (r0c0) accessing PE2's (r1c4) HBM:
-
-```text
-PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl
-```
-
-The Dijkstra router finds the shortest path in the mesh.
-
-Cross-PE channel access in 1:1 mode will be defined during the 1:1 extension in D3.
-
----
-
-### D5. n:1 Mode: Uses cube_mesh.yaml Router Mesh
-
-In n:1 mode, no separate "aggregated router" is created.
-The existing router grid from cube_mesh.yaml serves that role.
-
-#### Connection Structure
-
-PE_DMA, PE_CPU, and HBM are all connected to the router where each PE is attached:
-
-```text
-sip0.cube0.pe0.pe_dma ←→ sip0.cube0.r0c0  (bw: N × channel_bw_gbs)
-sip0.cube0.hbm_ctrl   ←→ sip0.cube0.r0c0  (bw: N × channel_bw_gbs)
-```
-
-Routers are connected via XY mesh edges. PE's local HBM access goes
-directly from its own router (switching overhead only).
-
-#### n:1 Mode Full Data Paths
-
-**Local HBM (0 hops):**
-```text
-PE0.pe_dma → r0c0 → hbm_ctrl  (switching overhead only)
-```
-
-**Remote HBM (mesh hops):**
-```text
-PE0.pe_dma → r0c0 → r0c1 → ... → r1c4 → hbm_ctrl
-```
-
-**M_CPU DMA:**
-```text
-M_CPU → r2c0 → (mesh hops) → r{x}c{y} → hbm_ctrl
-```
-
----
-
-### D6. All Traffic Is Unified onto the Same Router Mesh
-
-- All memory accesses (DMA data) and commands (PE_CPU) use the same router mesh
-- Local access does not use a separate fast path (xbar)
-- Cross-cube (remote) access path:
-
-```text
-PE_DMA → r{x}c{y} → (mesh hops) → ucie_conn → ucie-{PORT}
-  → [UCIe link] → remote ucie → remote conn → remote r{x}c{y} → hbm_ctrl
-```
-
-UCIe connections maintain the existing structure,
-but both endpoints become mesh routers instead of xbars.
-
-The number of UCIe lines is determined by BW ratio: `ucie_lines_per_side = ceil(ucie_bw / noc_line_bw)`.
-
----
-
-### D7. AddressResolver Changes
-
-Current `AddressResolver.resolve()`:
-
-```python
-# Current: HBM offset → pe_slice → "sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
-pe_slice = PhysAddr.hbm_pe_id(addr.hbm_offset, self._slice_size_bytes)
-return f"sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
-```
-
-After change:
-
-```python
-# Changed: HBM → single endpoint
-return f"sip{s}.cube{c}.hbm_ctrl"
-```
-
-The pe_slice calculation is removed.
-In n:1 mode, PE_DMA directly accesses the hbm_ctrl attached to its own router.
-
-resolver.resolve() is retained for external access (M_CPU DMA, etc.) and backward compatibility.
-
----
-
-### D8. topology.yaml Configuration Changes
-
-#### Added Settings
-
-```yaml
-cube:
-  memory_map:
-    hbm_mapping_mode: n_to_one          # one_to_one | n_to_one
-    hbm_pseudo_channels: 64             # total pseudo channel count
-    hbm_channels_per_pe: 8              # local channels per PE (= pseudo_channels / pes_per_cube)
-    hbm_channel_bw_gbs: 32.0            # per-channel bandwidth (GB/s)
-    hbm_total_gb_per_cube: 48           # retained
-```
-
-#### Removed Settings
-
-```yaml
-# To be removed
-links:
-  xbar_to_hbm_bw_gbs: 256.0            # → replaced by channel_bw_gbs × channels_per_pe
-  xbar_to_hbm_mm: 2.5                  # → replaced by ch_router_to_hbm_mm
-  xbar_to_bridge_bw_gbs: 128.0         # → removed (no bridge)
-  xbar_to_bridge_mm: 3.0               # → removed
-  noc_to_xbar_bw_gbs: ...              # → removed
-  noc_to_xbar_mm: ...                  # → removed
-```
-
-#### Added Link Settings
-
-```yaml
-links:
-  router_link_bw_gbs: 256.0            # XY mesh link BW between routers
-  router_overhead_ns: 2.0              # router switching overhead
-  pe_to_router_bw_gbs: 256.0           # PE_DMA ↔ router
-  hbm_to_router_bw_gbs: 256.0          # HBM ↔ router (= N × channel_bw)
-```
-
----
-
-### D9. Bandwidth Numerical Consistency
-
-| Configuration | Value |
-| ---- | --- |
-| pseudo channels per cube | 64 (parameter) |
-| PEs per cube | 8 (parameter) |
-| channels per PE (N) | `pseudo_channels / pes_per_cube` = 8 |
-| per-channel BW | 32 GB/s (parameter) |
-| per-PE local BW | N × 32 = 256 GB/s |
-| cube total HBM BW | 64 × 32 = 2048 GB/s |
-
-The effective BW per PE is identical in both modes:
-
-- 1:1 mode: N channel links × channel_bw_gbs = N × 32 = 256 GB/s
-- n:1 mode: 1 aggregated link = N × channel_bw_gbs = 256 GB/s
-
----
-
-## Consequences
-
-### Positive
-
-- The router mesh based on cube_mesh.yaml accurately reflects physical placement
-- In n:1 mode, the existing VA scheme is preserved, keeping transition costs low
-- Local / remote / command traffic is unified onto the same mesh, resulting in simplicity
-- Aligns well with graph compiler-based topology generation
-- Channel count and PE count are both parameterized, enabling testing of various configurations
-- 1:1 mode extension naturally follows through router differentiation
-
-### Negative
-
-- The number of SimPy nodes increases due to explicit router nodes (6x6 = up to 32 routers/cube)
-- The internal contention model of TwoDMeshNocComponent needs to be replaced with a per-router model
-
----
-
-## Alternatives
-
-### A1. Retain Existing xbar + HBM Slices
-
-- Local/remote paths remain bifurcated
-- Cannot model at pseudo-channel granularity
-- Cannot switch between 1:1/n:1 modes
-
-### A2. Always Generate Per-Channel Links and Aggregate Only in n:1
-
-- Topology structure always has 1:1 size
-- Expressing n:1 semantics via link aggregation is complex
-- No reduction in router node count
-
-### A3. Gradual Transition (Retain xbar + Add NOC Path)
-
-- Higher compatibility, but dual-path coexistence increases complexity
-- Since xbar removal is ultimately necessary, the intermediate step provides little value
-
----
-
-## Test Requirements
-
-- Verify that requests are delivered via per-channel links in 1:1 mode
-- Verify that requests are delivered via the aggregated link in n:1 mode
-- Verify that topology is correctly generated in both modes:
-  - 1:1: `total_ch` channel routers + per-PE links + horizontal links
-  - n:1: `pes_per_cube` aggregated routers + per-PE links
-- Verify that effective BW is consistent across both modes for the same workload
-- Verify that horizontal line routing works for cross-PE access
-- Verify that routing through UCIe works for cross-cube access
-- Verify that topology generation is correct under parameter variations (channels_per_pe = 4, 8, 16, etc.)
-
----
-
-## Links
-
-- ADR-0011 (LA model) → addressing-side integration
-- ADR-0017 (Cube NOC 2D Mesh) → this ADR replaces the xbar/bridge portion
-- ADR-0004 (Memory Semantics) → BW model redefinition
-- ADR-0014 (PE Internal Execution Model) → impact from PE_DMA path changes
diff --git a/docs/adr/ADR-0019-NOC-Local HBM.md b/docs/adr/ADR-0019-NOC-Local HBM.md
deleted file mode 100644
index d1aed2b..0000000
--- a/docs/adr/ADR-0019-NOC-Local HBM.md	
+++ /dev/null
@@ -1,305 +0,0 @@
-# ADR-0019: CUBE NOC 내 Per-Channel 및 Aggregated HBM 연결 모델
-
-## Status
-
-Accepted
-
-## Context
-
-CUBE 내부 NOC은 각 PE를 HBM에 연결해야 한다. KernBench는 두 가지
-connectivity 모델을 비교 평가할 수 있어야 한다.
-
-- **1:1 mode** — PE_DMA가 N개 per-channel router 각각에 별도 link로
-  연결되고, 각 router는 hbm_ctrl에 자기 channel link를 가진다.
-  Per-channel BW contention을 정확히 모델링.
-  N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`).
-- **n:1 mode** — PE_DMA가 단일 aggregated router를 거쳐 하나의 link로
-  hbm_ctrl에 연결. Channel들이 interleaved 된 것으로 가정하고
-  aggregate BW만 모델링.
-
-두 모드에서 PE당 effective BW는 동일 (= N × per-channel BW);
-connectivity granularity만 다르다.
-
----
-
-## Decision
-
-### D1. HBM은 PE 라우터에 attach된다
-
-현재의 `hbm_ctrl.slice{0-7}` (8개 노드)를 **`hbm_ctrl` 단일 노드**로 통합하고,
-PE가 attach된 라우터에 HBM access point도 함께 attach한다.
-
-- n:1 mode: PE의 local HBM 접근은 자기 라우터에서 바로 (switching overhead만, 0 hop)
-- remote PE의 HBM 접근: mesh hop을 거쳐 대상 PE의 라우터에 도달
-- HBM controller 내부의 read/write resource 모델은 유지
-
-노드 네이밍 변경:
-
-| 현재 | 변경 후 |
-| ---- | ------- |
-| `sip0.cube0.hbm_ctrl.slice0` ~ `slice7` | `sip0.cube0.hbm_ctrl` (단일) |
-
-`mesh_gen.py`에서 PE attachment에 `pe{idx}.hbm`을 추가하여,
-builder가 해당 라우터와 hbm_ctrl 간 edge를 생성한다.
-
----
-
-### D2. xbar, bridge, 단일 NOC 노드 완전 제거
-
-기존 다음 노드 및 관련 edge를 모두 제거한다:
-
-- `{cube}.xbar_top`, `{cube}.xbar_bot`
-- `{cube}.bridge.left`, `{cube}.bridge.right`
-- `{cube}.noc` (단일 TwoDMeshNocComponent 노드)
-- `noc_to_xbar`, `xbar_to_noc`, `xbar_to_hbm`, `hbm_to_xbar` 종류의 edge
-- `xbar_to_bridge`, `bridge_to_xbar` 종류의 edge
-- `pe_to_noc`, `noc_to_pe`, `noc_to_pe_cpu` 등 단일 noc 노드 참조 edge
-
-이들의 역할은 **cube_mesh.yaml 기반의 명시적 라우터 mesh**가 대체한다.
-기존 `mesh_gen.py`가 생성하는 6×6 라우터 grid의 각 라우터(r0c0, r0c1, ...)를
-별도의 SimPy 노드로 topology graph에 생성하고,
-인접 라우터 간 XY mesh edge로 연결한다.
-
----
-
-### D3. 명시적 라우터 mesh (n:1 / 1:1 공통 기반)
-
-#### cube_mesh.yaml 기반 라우터 노드
-
-`mesh_gen.py`가 생성한 cube_mesh.yaml의 각 non-null 라우터를
-topology graph의 **별도 SimPy 노드**로 생성한다.
-
-- 노드 ID: `{cube}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`)
-- kind: `noc_router`, impl: `forwarding_v1`
-- pos_mm: cube_mesh.yaml에서 가져옴
-
-기존 cube_mesh.yaml의 attach 정보에 따라 각 라우터에 component를 연결:
-- `pe{p}.dma` → PE_DMA ↔ 라우터 edge
-- `pe{p}.cpu` → PE_CPU ↔ 라우터 edge
-- `pe{p}.hbm` → HBM_CTRL ↔ 라우터 edge (n:1에서 추가)
-- `m_cpu` → M_CPU ↔ 라우터 edge
-- `sram` → SRAM ↔ 라우터 edge
-- `ucie_{dir}.c{i}` → UCIe conn ↔ 라우터 edge
-
-라우터 간 XY mesh edge: 인접 라우터 간 bidirectional edge.
-null 라우터(HBM exclusion zone)는 skip.
-
-#### 1:1 mode 확장 (나중에 구현)
-
-1:1 mode에서는 각 라우터가 N개 channel mini-router로 분화된다.
-per-channel routing과 ChannelSplitter (LA → per-channel PA) 도입이 필요.
-PE당 N개 GEMM engine도 이 시점에 추가.
-
----
-
-### D4. cross-PE HBM 접근 (n:1 mode)
-
-n:1 mode에서 PE가 다른 PE의 local HBM에 접근하는 경우,
-cube_mesh.yaml의 XY mesh를 통해 대상 PE의 라우터까지 hop한다.
-
-예: PE0(r0c0)이 PE2(r1c4)의 HBM에 접근:
-
-```text
-PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl
-```
-
-Dijkstra router가 mesh에서 최단 경로를 탐색한다.
-
-1:1 mode에서의 cross-PE channel 접근은 D3의 1:1 확장 시 정의한다.
-
----
-
-### D5. n:1 mode: cube_mesh.yaml 라우터 mesh 사용
-
-n:1 mode에서는 별도의 "aggregated router"를 생성하지 않는다.
-기존 cube_mesh.yaml의 라우터 grid가 그 역할을 한다.
-
-#### 연결 구조
-
-각 PE가 attach된 라우터에 PE_DMA, PE_CPU, HBM이 함께 연결된다:
-
-```text
-sip0.cube0.pe0.pe_dma ←→ sip0.cube0.r0c0  (bw: N × channel_bw_gbs)
-sip0.cube0.hbm_ctrl   ←→ sip0.cube0.r0c0  (bw: N × channel_bw_gbs)
-```
-
-라우터 간 XY mesh edge로 연결. PE의 local HBM 접근은
-자기 라우터에서 바로 (switching overhead만).
-
-#### n:1 mode 전체 데이터 경로
-
-**local HBM (0 hop):**
-```text
-PE0.pe_dma → r0c0 → hbm_ctrl  (switching overhead only)
-```
-
-**remote HBM (mesh hops):**
-```text
-PE0.pe_dma → r0c0 → r0c1 → ... → r1c4 → hbm_ctrl
-```
-
-**M_CPU DMA:**
-```text
-M_CPU → r2c0 → (mesh hops) → r{x}c{y} → hbm_ctrl
-```
-
----
-
-### D6. 모든 트래픽을 동일 router mesh로 통일한다
-
-- 모든 memory access (DMA data)와 command (PE_CPU)가 동일 router mesh를 사용한다
-- local access도 별도의 fast path(xbar)를 사용하지 않는다
-- cross-cube (remote) access 경로:
-
-```text
-PE_DMA → r{x}c{y} → (mesh hops) → ucie_conn → ucie-{PORT}
-  → [UCIe link] → remote ucie → remote conn → remote r{x}c{y} → hbm_ctrl
-```
-
-UCIe 연결은 기존 구조를 유지하되,
-양쪽 endpoint가 xbar 대신 mesh 라우터가 된다.
-
-UCIe line 수는 BW 비율로 결정: `ucie_lines_per_side = ceil(ucie_bw / noc_line_bw)`.
-
----
-
-### D7. AddressResolver 변경
-
-현재 `AddressResolver.resolve()`:
-
-```python
-# 현재: HBM offset → pe_slice → "sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
-pe_slice = PhysAddr.hbm_pe_id(addr.hbm_offset, self._slice_size_bytes)
-return f"sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
-```
-
-변경 후:
-
-```python
-# 변경: HBM → 단일 endpoint
-return f"sip{s}.cube{c}.hbm_ctrl"
-```
-
-pe_slice 계산이 제거된다.
-n:1 mode에서 PE_DMA는 자기 라우터에 attach된 hbm_ctrl에 직접 접근한다.
-
-resolver.resolve()는 외부 접근(M_CPU DMA 등) 및 backward compatibility용으로 유지한다.
-
----
-
-### D8. topology.yaml 설정 변경
-
-#### 추가 설정
-
-```yaml
-cube:
-  memory_map:
-    hbm_mapping_mode: n_to_one          # one_to_one | n_to_one
-    hbm_pseudo_channels: 64             # 전체 pseudo channel 수
-    hbm_channels_per_pe: 8              # PE당 local channel 수 (= pseudo_channels / pes_per_cube)
-    hbm_channel_bw_gbs: 32.0            # per-channel bandwidth (GB/s)
-    hbm_total_gb_per_cube: 48           # 유지
-```
-
-#### 제거 설정
-
-```yaml
-# 제거 대상
-links:
-  xbar_to_hbm_bw_gbs: 256.0            # → channel_bw_gbs × channels_per_pe로 대체
-  xbar_to_hbm_mm: 2.5                  # → ch_router_to_hbm_mm으로 대체
-  xbar_to_bridge_bw_gbs: 128.0         # → 제거 (bridge 없음)
-  xbar_to_bridge_mm: 3.0               # → 제거
-  noc_to_xbar_bw_gbs: ...              # → 제거
-  noc_to_xbar_mm: ...                  # → 제거
-```
-
-#### 추가 link 설정
-
-```yaml
-links:
-  router_link_bw_gbs: 256.0            # 라우터 간 XY mesh link BW
-  router_overhead_ns: 2.0              # 라우터 switching overhead
-  pe_to_router_bw_gbs: 256.0           # PE_DMA ↔ 라우터
-  hbm_to_router_bw_gbs: 256.0          # HBM ↔ 라우터 (= N × channel_bw)
-```
-
----
-
-### D9. 대역폭 수치 정합
-
-| 구성 | 값 |
-| ---- | --- |
-| pseudo channels per cube | 64 (파라미터) |
-| PEs per cube | 8 (파라미터) |
-| channels per PE (N) | `pseudo_channels / pes_per_cube` = 8 |
-| per-channel BW | 32 GB/s (파라미터) |
-| per-PE local BW | N × 32 = 256 GB/s |
-| cube total HBM BW | 64 × 32 = 2048 GB/s |
-
-두 모드에서 PE당 effective BW는 동일:
-
-- 1:1 mode: N개 channel link × channel_bw_gbs = N × 32 = 256 GB/s
-- n:1 mode: 1개 aggregated link = N × channel_bw_gbs = 256 GB/s
-
----
-
-## Consequences
-
-### Positive
-
-- cube_mesh.yaml 기반 라우터 mesh로 물리적 배치를 정확히 반영한다
-- n:1 mode에서 기존 VA 체계를 유지하여 전환 비용이 낮다
-- local / remote / command 트래픽이 동일 mesh로 통일되어 단순하다
-- graph compiler 기반 topology 생성과 잘 맞는다
-- channel 수, PE 수가 모두 파라미터이므로 다양한 구성을 테스트할 수 있다
-- 1:1 mode 확장이 라우터 분화로 자연스럽게 가능하다
-
-### Negative
-
-- 명시적 라우터 노드로 인해 SimPy 노드 수가 증가한다 (6×6 = 최대 32개 라우터/cube)
-- TwoDMeshNocComponent의 내부 contention 모델을 라우터별 모델로 교체 필요
-
----
-
-## Alternatives
-
-### A1. 기존 xbar + HBM slice 유지
-
-- local/remote 경로가 이원화됨
-- pseudo-channel 단위 모델링 불가
-- 1:1/n:1 mode 전환 불가
-
-### A2. per-channel link를 항상 생성하고 n:1에서만 집계
-
-- topology 구조가 항상 1:1 크기
-- n:1 semantics를 link aggregation으로 표현하기 복잡
-- router 노드 수 감소 효과 없음
-
-### A3. 단계적 전환 (xbar 유지 + NOC 경로 추가)
-
-- 호환성은 높으나 두 경로 공존으로 복잡도 증가
-- 최종적으로 xbar 제거가 필요하므로 중간 단계의 가치가 낮음
-
----
-
-## Test Requirements
-
-- 1:1 mode에서 channel별 link로 request가 전달되는지 확인
-- n:1 mode에서 aggregated link로 request가 전달되는지 확인
-- 두 mode에서 topology가 올바르게 생성되는지 검증:
-  - 1:1: `total_ch`개 channel router + per-PE link + horizontal link
-  - n:1: `pes_per_cube`개 aggregated router + per-PE link
-- 동일 workload에서 effective BW가 두 모드에서 일관적인지 확인
-- cross-PE 접근 시 horizontal line routing이 동작하는지 확인
-- cross-cube 접근 시 UCIe를 통한 routing이 동작하는지 확인
-- 파라미터 변경 (channels_per_pe = 4, 8, 16 등)에서 topology 생성이 정상인지 확인
-
----
-
-## Links
-
-- ADR-0011 (LA model) → addressing 측 연동
-- ADR-0017 (Cube NOC 2D Mesh) → 본 ADR이 xbar/bridge 부분을 대체
-- ADR-0004 (Memory Semantics) → BW 모델 재정의
-- ADR-0014 (PE Internal Execution Model) → PE_DMA 경로 변경 영향
diff --git a/docs/adr/ADR-0020-data-execution-two-pass.en.md b/docs/adr/ADR-0020-prog-data-execution-two-pass.en.md
similarity index 100%
rename from docs/adr/ADR-0020-data-execution-two-pass.en.md
rename to docs/adr/ADR-0020-prog-data-execution-two-pass.en.md
diff --git a/docs/adr/ADR-0020-data-execution-two-pass.md b/docs/adr/ADR-0020-prog-data-execution-two-pass.md
similarity index 100%
rename from docs/adr/ADR-0020-data-execution-two-pass.md
rename to docs/adr/ADR-0020-prog-data-execution-two-pass.md
diff --git a/docs/adr/ADR-0021-pe-pipeline-refactor.en.md b/docs/adr/ADR-0021-pe-pipeline-refactor.en.md
deleted file mode 100644
index 0d5dd6e..0000000
--- a/docs/adr/ADR-0021-pe-pipeline-refactor.en.md
+++ /dev/null
@@ -1,432 +0,0 @@
-# ADR-0021: PE Pipeline Refactoring — Component Separation + Scheduler-Based Routing
-
-## Status
-
-Accepted
-
-## Context
-
-### Actual Hardware Structure
-
-```
-HBM ←(DMA)→ TCM ←(Fetch/Store Unit)→ Register File ←→ GEMM/MATH Engine
-```
-
-- DMA: HBM ↔ TCM transfer (via fabric, tens to hundreds of ns)
-- Fetch/Store Unit: TCM ↔ Register File transfer (BW-based, a few ns)
-- GEMM/MATH Engine: computation between Register Files (cycle-accurate)
-- Completion signal: PE-internal 1-cycle wire signal (done pin assert)
-
----
-
-## Decision
-
-### D1. Separate Each Block into an Independent Component
-
-The internal blocks of pe_accel are separated into **independent PeEngineBase components**.
-Existing 5 blocks + 1 Fetch/Store Unit = 6 components.
-
-| Component | Role | HW Correspondence |
-|-----------|------|-------------------|
-| PE_SCHEDULER | Plan generation, tile state management, stage routing | Scheduler/Sequencer |
-| PE_DMA | HBM ↔ TCM (via fabric) | DMA Engine |
-| PE_FETCH_STORE | TCM ↔ Register File | Load/Store Unit |
-| PE_GEMM | MAC compute (register only) | MAC Array |
-| PE_MATH | Element-wise/reduction (register only) | SIMD/Vector Unit |
-| PE_TCM | BW-serialized scratchpad | SRAM Bank |
-
-Each component exists as a topology node and is connected via ports/wires.
-Replacing the `impl` allows changing the timing model of an individual block.
-
-### D2. Token Self-Routing — Scheduler Handles Only Dispatch + Completion
-
-**Components do not pass through the scheduler at every stage.**
-The token carries a plan so that components chain directly to the next stage.
-
-```
-Scheduler → DMA → Fetch → GEMM → Math → Store → DMA_WB → (done) → Scheduler
-              ↑ chaining: does not go through scheduler          completion only
-```
-
-This matches the actual HW structure where each block's done signal is directly
-connected to the next block via wire. The scheduler is responsible **only for
-initial dispatch + completion aggregation**.
-
-#### Stage Definition
-
-```python
-class StageType(Enum):
-    DMA_READ = 0
-    FETCH = 1
-    GEMM = 2
-    MATH = 3
-    STORE = 4
-    DMA_WRITE = 5
-```
-
-#### Plan Structure
-
-When the scheduler receives a CompositeCmd, it generates a **per-tile execution plan**.
-The plan defines the **stage sequence** for each tile:
-
-```python
-@dataclass
-class Stage:
-    stage_type: StageType
-    component: str       # topology node ID (e.g. "sip0.cube0.pe0.pe_dma")
-    params: dict         # per-stage parameters (dynamic)
-
-@dataclass(frozen=True)
-class TilePlan:
-    tile_id: int
-    stages: tuple[Stage, ...]  # list of stages to execute in order (immutable)
-```
-
-The stage sequence varies depending on the plan:
-
-```python
-# Normal GEMM: HBM → TCM → Register → Compute → Register → TCM → HBM
-stages = (DMA_READ, FETCH, GEMM, STORE, DMA_WRITE)
-
-# GEMM directly from TCM data (skip DMA read):
-stages = (FETCH, GEMM, STORE, DMA_WRITE)
-
-# MATH element-wise:
-stages = (DMA_READ, FETCH, MATH, STORE, DMA_WRITE)
-
-# GEMM + accumulation (intermediate K-tile, skip writeback):
-stages = (DMA_READ, FETCH, GEMM, STORE)  # store to TCM only
-```
-
-**Components do not hardcode the next component.**
-They read the next stage from the token's plan and forward it directly via out_port.
-This is the same pattern as a network packet carrying a routing header.
-
-#### Pipeline Context
-
-```python
-@dataclass
-class PipelineContext:
-    id: str
-    total_tiles: int
-    completed_tiles: int = 0
-    done_event: simpy.Event = None  # succeeds when all tiles are complete
-
-    def complete_tile(self) -> None:
-        self.completed_tiles += 1
-        if self.completed_tiles == self.total_tiles:
-            self.done_event.succeed()
-```
-
-**Completion follows an exactly-once contract**: the last stage of each tile must call
-`complete_tile()` exactly once. Duplicate calls are a bug, and `done_event` must
-succeed only once (SimPy Event constraint).
-
-#### Scheduler Role (Reduced)
-
-When the scheduler receives a CompositeCmd, it creates a plan and PipelineContext,
-enqueues them into the scheduler's internal `_pending_feeds` FIFO, and returns immediately.
-
-Actual tile injection is handled by a **single feeder process** (`_feed_loop`).
-This feeder consumes `_pending_feeds` in FIFO order and
-**does not allow tile feed interleaving across composite commands.**
-That is, the feed for the next command begins only after all tiles of the current
-command have been injected into the first stage queue.
-
-There is **exactly one `_feed_loop`** per scheduler, and
-tile feed for composite commands is performed exclusively through this single process.
-Command issue order refers to **the order in which PE_SCHEDULER receives PeInternalTxn**.
-
-This structure maintains command issue order while ensuring that when the first stage
-queue is full, only the feeder process blocks — the scheduler worker's inbox processing
-itself does not stall.
-
-```python
-class PeSchedulerV2(PeEngineBase):
-    _pipelines: dict[str, PipelineContext]
-    _pending_feeds: simpy.Store   # FIFO of (plan, ctx)
-
-    def start(self, env):
-        super().start(env)
-        self._pending_feeds = simpy.Store(env)
-        env.process(self._feed_loop(env))
-
-    def _dispatch_composite(self, env, pe_txn, cmd):
-        plan = generate_plan(cmd)
-        ctx = PipelineContext(
-            id=next_id(),
-            total_tiles=len(plan.tiles),
-            done_event=pe_txn.done,
-        )
-        self._pipelines[ctx.id] = ctx
-
-        # only enqueue to feeder queue and return immediately
-        yield self._pending_feeds.put((plan, ctx))
-
-    def _feed_loop(self, env):
-        """Single feeder process: feeds composite commands in FIFO order.
-
-        Tile feed interleaving across composite commands is not allowed.
-        The feed for the next command begins only after all tiles of the
-        current command have been injected into the first stage queue.
-
-        When the first stage queue is full, only this feeder blocks;
-        the scheduler worker's inbox processing does not stall.
-        """
-        while True:
-            plan, ctx = yield self._pending_feeds.get()
-            for tile in plan.tiles:
-                token = TileToken(
-                    tile_id=tile.tile_id,
-                    pipeline_ctx=ctx,
-                    plan=tile,
-                    stage_idx=0,
-                    params=tile.stages[0].params,
-                )
-                yield self.out_ports[tile.stages[0].component].put(token)
-                # queue capacity = HW queue depth → feeder blocks only when full
-```
-
-In this ADR, the scheduler can accept multiple composite commands,
-but tile submission order follows per-command FIFO.
-Within a command, tile-level pipeline overlap is allowed,
-but tile feed interleaving across commands is not.
-
-### D3. Data Transfer vs. Completion Signal — HW Modeling Criteria
-
-| Communication Type | Method | HW Correspondence |
-|-------------------|--------|-------------------|
-| Tile token (work directive) | message via out_port | enqueue to command queue |
-| Stage completion → next stage | component directly calls out_port.put | done-triggered local enqueue |
-| Pipeline completion → scheduler | PipelineContext.complete_tile() | completion interrupt |
-
-**Tile token**: uses out_port.put(). SimPy Store capacity = HW queue depth.
-
-**Intra-PE chaining latency**: within the scope of this ADR, no explicit latency model
-is applied to intra-PE stage triggers. Chaining between components corresponds to
-PE-internal wires, and since there is no scheduler round-trip, no artificial hop cost
-is incurred.
-
-**Pipeline completion**: the component at the last stage calls `pipeline_ctx.complete_tile()`.
-When all tiles are complete, PipelineContext calls done_event.succeed().
-
-### D4. Asynchronous Pipeline — Natural Overlap
-
-The scheduler processes CompositeCmds **asynchronously**.
-However, tile feed does not spawn an independent process per command; instead,
-the scheduler's internal **single feeder process** performs the feed in FIFO order.
-Therefore, the scheduler can continue to receive the next command,
-but the first-stage tile injection order is guaranteed per command.
-
-Since **SimPy Store capacity = HW queue depth**:
-- When the queue is full, put() naturally blocks (backpressure)
-- While DMA is processing tile 0, GEMM can start fetching an already-completed tile
-- When a second CompositeCmd arrives, it is immediately queued to the DMA queue
-
-```
-First-stage feed order (feeder → DMA queue):
-  [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN] | [cmd2:t0][cmd2:t1]...
-                                            ↑ cmd2 starts after cmd1 feed completes
-
-Runtime pipeline (downstream overlap):
-  PE_DMA:    [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN][cmd2:t0][cmd2:t1]...
-  PE_FETCH:          [cmd1:t0][cmd1:t1]...
-  PE_GEMM:                   [cmd1:t0][cmd1:t1]...
-                              ↑ pipeline overlap within the same command
-```
-
-Here, the overlap does not come from tile feed interleaving across different commands,
-but occurs naturally as tiles from earlier commands progress to downstream stages
-while the feeder continues injecting subsequent tiles.
-
-For example, tile feed for cmd2 does not start until all tiles of cmd1 have been
-injected into the first stage queue. However, while cmd1.tile0 has already progressed
-to GEMM, cmd1.tile1 and cmd1.tile2 may still remain in DMA/FETCH, so
-**pipeline overlap within the same command occurs naturally**.
-
-#### Component Chaining Pattern
-
-All components follow the same pattern:
-
-```python
-def _pipeline_worker(self, env):
-    while True:
-        token = yield self._inbox.get()
-
-        # process own stage
-        yield from self._process(env, token)
-
-        # chain to next stage (read from plan)
-        next_idx = token.stage_idx + 1
-        if next_idx < len(token.plan.stages):
-            next_stage = token.plan.stages[next_idx]
-            token.stage_idx = next_idx
-            token.params = next_stage.params
-            yield self.out_ports[next_stage.component].put(token)
-        else:
-            # last stage — pipeline completion
-            token.pipeline_ctx.complete_tile()
-```
-
-### D5. PE_FETCH_STORE — Dedicated TCM ↔ Register File Transfer
-
-Previously, GemmBlock and MathBlock each implemented their own TCM read/write.
-This is separated into a **PE_FETCH_STORE component**.
-
-```python
-# PE_FETCH_STORE._process()
-def _process(self, env, token):
-    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
-    yield tcm_done
-    # chaining is handled by the base class (D4 pattern)
-```
-
-Advantages:
-- GEMM/MATH perform **pure compute only** — no TCM access logic
-- Fetch/store BW contention is naturally modeled (serialization via PE_TCM resource)
-- Prefetch strategies can be experimented with by replacing the fetch unit alone
-
-### D6. Simplification of Each Compute Component
-
-GEMM/MATH perform compute only with register data already prepared.
-**Chaining follows the common pattern (D4), so only _process() needs to be implemented:**
-
-```python
-# PE_GEMM._process()
-def _process(self, env, token):
-    yield env.timeout(self._mac_latency(token.params))
-
-# PE_MATH._process()
-def _process(self, env, token):
-    yield env.timeout(self._simd_latency(token.params))
-
-# PE_FETCH_STORE._process()
-def _process(self, env, token):
-    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
-    yield tcm_done
-
-# PE_DMA._process()
-def _process(self, env, token):
-    yield from self._do_fabric_dma(token.params)
-```
-
-By replacing only the timing model, one can freely switch between cycle-accurate
-and analytical models. Since the chaining logic resides in the base class,
-each component only implements its pure stage logic.
-
-### D7. Topology Changes
-
-Add PE_FETCH_STORE to the PE template:
-
-```yaml
-pe_template:
-  components:
-    pe_cpu:         { kind: pe_cpu,         impl: pe_cpu_v1, ... }
-    pe_scheduler:   { kind: pe_scheduler,   impl: pe_scheduler_v2, ... }
-    pe_dma:         { kind: pe_dma,         impl: pe_dma_v1, ... }
-    pe_fetch_store: { kind: pe_fetch_store, impl: pe_fetch_store_v1, ... }
-    pe_gemm:        { kind: pe_gemm,        impl: pe_gemm_v1, ... }
-    pe_math:        { kind: pe_math,        impl: pe_math_v1, ... }
-    pe_mmu:         { kind: pe_mmu,         impl: pe_mmu_v1, ... }
-    pe_tcm:         { kind: pe_tcm,         impl: pe_tcm_v1, ... }
-  links:
-    # existing links...
-    fetch_store_to_tcm_bw_gbs: 512.0
-    fetch_store_to_tcm_mm: 0.0
-```
-
-PE internal edge connections:
-```
-PE_SCHEDULER → PE_DMA (initial dispatch)
-PE_SCHEDULER → PE_FETCH_STORE (initial dispatch)
-PE_SCHEDULER → PE_GEMM (initial dispatch)
-PE_SCHEDULER → PE_MATH (initial dispatch)
-PE_DMA → PE_FETCH_STORE (chaining)
-PE_FETCH_STORE → PE_GEMM (chaining)
-PE_FETCH_STORE → PE_MATH (chaining)
-PE_GEMM → PE_FETCH_STORE (store chaining)
-PE_MATH → PE_FETCH_STORE (store chaining)
-PE_FETCH_STORE → PE_DMA (writeback chaining)
-PE_FETCH_STORE → PE_TCM (BW request)
-```
-
-Topology edges encompass both **control/dispatch visibility + runtime chaining**.
-Scheduler → sub-component edges are initial dispatch paths, while
-inter-component edges are runtime chaining paths driven by token self-routing.
-
-### D9. TileToken Message Definition
-
-A message used for passing tile work between components.
-The token carries the plan and stage index, enabling self-routing.
-
-```python
-@dataclass
-class TileToken:
-    tile_id: int
-    pipeline_ctx: PipelineContext    # completion tracking
-    plan: TilePlan                   # full stage sequence for this tile (immutable)
-    stage_idx: int                   # current stage index in plan.stages
-    params: dict                     # current stage parameter cache (canonical: plan.stages[stage_idx].params)
-    data_op: bool = True             # op_log recording target (ADR-0020)
-```
-
-A TileToken is **owned by exactly one component at a time** and
-is never referenced by multiple components simultaneously (single-owner).
-
-Token lifecycle:
-1. Scheduler creates it with stage_idx=0 and puts it to the first stage component
-2. The component executes _process(), increments stage_idx, and puts it to the next component
-3. The last stage component calls pipeline_ctx.complete_tile()
-4. When all tiles are complete, PipelineContext calls done_event.succeed()
-
-Relationship with existing PeInternalTxn:
-- PeInternalTxn: command transfer between PE_CPU → PE_SCHEDULER (existing, unchanged)
-- TileToken: per-tile work transfer from PE_SCHEDULER → sub-components (new, self-routing)
-
----
-
-## Non-goals
-
-- **PE_CPU changes**: the PE_CPU → PE_SCHEDULER interface is not modified
-  (PeInternalTxn-based, ADR-0014 maintained)
-- **Resource contention model across multiple pipelines**: the current scope focuses on
-  accurate modeling of a single pipeline. TCM bank conflicts across multiple pipelines
-  are future work.
-
-## Open Questions
-
-- **Register File capacity model**: whether to model capacity limits when the fetch unit
-  loads into registers. Capacity is expressed in bytes (register_file_bytes), and
-  the number of tiles that can be held simultaneously is determined by tile size.
-  When capacity is exceeded, fetch stalls, creating natural backpressure.
-- **Prefetch strategy**: this ADR does not allow tile feed interleaving across composite
-  commands. Therefore, overlap arises not from pre-injection across commands, but
-  naturally from pipeline progression of tiles within the same command.
-  If additional prefetch is needed, it should be considered at the level of tile ordering
-  within the same command or fetch/store unit policy, not cross-command injection.
-- **PE_DMA coalescing**: per-tile DMA may cause fragmentation.
-  Direction is to merge/coalesce within DMA without scheduler involvement.
-- **Synchronous execution mode**: this ADR adopts asynchronous pipeline as the
-  default/sole execution model. If a sync mode is needed for debug or validation
-  purposes, it will be considered in a future ADR.
-- **TCM bank conflict across multiple pipelines**: currently based on a single pipeline.
-  Bank conflict modeling when multiple pipelines simultaneously access TCM is future work.
-
----
-
-## Consequences
-
-### Positive
-
-- Each block is an independent component — individually replaceable (ADR-0015 compliant)
-- PE internal structure is visible in the topology
-- Components do not know the next component — plan-based routing provides flexibility
-- Natural pipeline overlap between DMA and compute (SimPy Store backpressure)
-- Improved HW modeling accuracy (done signal = Event, data transfer = message)
-- Fetch/store separation enables accurate TCM BW contention modeling
-
-### Negative
-
-- Increased number of PE internal components (5 → 6) — more topology nodes/edges
-- Component separation makes intra-PE token forwarding more explicit than before
-
diff --git a/docs/adr/ADR-0021-pe-pipeline-refactor.md b/docs/adr/ADR-0021-pe-pipeline-refactor.md
deleted file mode 100644
index 628937b..0000000
--- a/docs/adr/ADR-0021-pe-pipeline-refactor.md
+++ /dev/null
@@ -1,426 +0,0 @@
-# ADR-0021: PE 파이프라인 리팩토링 — 컴포넌트 분리 + Scheduler 기반 라우팅
-
-## Status
-
-Accepted
-
-## Context
-
-### 실제 하드웨어 구조
-
-```
-HBM ←(DMA)→ TCM ←(Fetch/Store Unit)→ Register File ←→ GEMM/MATH Engine
-```
-
-- DMA: HBM ↔ TCM 전송 (fabric 경유, 수십~수백 ns)
-- Fetch/Store Unit: TCM ↔ Register File 전송 (BW 기반, 수 ns)
-- GEMM/MATH Engine: Register File 간 연산 (cycle-accurate)
-- 완료 신호: PE 내부 1-cycle wire signal (done pin assert)
-
----
-
-## Decision
-
-### D1. 각 블록을 독립 컴포넌트로 분리
-
-pe_accel의 내부 블록을 **독립 PeEngineBase 컴포넌트**로 분리한다.
-기존 5개 + Fetch/Store Unit 1개 = 6개 컴포넌트.
-
-| 컴포넌트 | 역할 | HW 대응 |
-|----------|------|---------|
-| PE_SCHEDULER | plan 생성, tile 상태 관리, stage 라우팅 | Scheduler/Sequencer |
-| PE_DMA | HBM ↔ TCM (fabric 경유) | DMA Engine |
-| PE_FETCH_STORE | TCM ↔ Register File | Load/Store Unit |
-| PE_GEMM | MAC compute (register only) | MAC Array |
-| PE_MATH | element-wise/reduction (register only) | SIMD/Vector Unit |
-| PE_TCM | BW-serialized scratchpad | SRAM Bank |
-
-각 컴포넌트는 topology 노드로 존재하며, port/wire로 연결된다.
-`impl`을 교체하면 개별 블록의 타이밍 모델을 변경할 수 있다.
-
-### D2. Token Self-Routing — Scheduler는 dispatch + completion만
-
-**컴포넌트가 매 stage마다 scheduler를 경유하지 않는다.**
-Token이 plan을 가지고 있어 컴포넌트가 직접 다음 stage로 체이닝한다.
-
-```
-Scheduler → DMA → Fetch → GEMM → Math → Store → DMA_WB → (done) → Scheduler
-              ↑ 체이닝: scheduler 안 거침                    completion만
-```
-
-이는 실제 HW에서 각 블록의 done signal이 다음 블록에 직접 wire로 연결되어
-있는 구조와 일치한다. Scheduler는 **초기 dispatch + completion aggregation만** 담당.
-
-#### Stage 정의
-
-```python
-class StageType(Enum):
-    DMA_READ = 0
-    FETCH = 1
-    GEMM = 2
-    MATH = 3
-    STORE = 4
-    DMA_WRITE = 5
-```
-
-#### Plan 구조
-
-Scheduler가 CompositeCmd를 받으면 **tile 단위 실행 plan**을 생성한다.
-Plan은 각 tile의 **stage sequence**를 정의한다:
-
-```python
-@dataclass
-class Stage:
-    stage_type: StageType
-    component: str       # topology 노드 ID (e.g. "sip0.cube0.pe0.pe_dma")
-    params: dict         # stage별 파라미터 (dynamic)
-
-@dataclass(frozen=True)
-class TilePlan:
-    tile_id: int
-    stages: tuple[Stage, ...]  # 순서대로 실행할 stage 목록 (immutable)
-```
-
-Plan에 따라 stage sequence가 달라진다:
-
-```python
-# 일반 GEMM: HBM → TCM → Register → Compute → Register → TCM → HBM
-stages = (DMA_READ, FETCH, GEMM, STORE, DMA_WRITE)
-
-# TCM 데이터로 바로 GEMM (DMA read 생략):
-stages = (FETCH, GEMM, STORE, DMA_WRITE)
-
-# MATH element-wise:
-stages = (DMA_READ, FETCH, MATH, STORE, DMA_WRITE)
-
-# GEMM + accumulation (중간 K-tile, writeback 생략):
-stages = (DMA_READ, FETCH, GEMM, STORE)  # store to TCM only
-```
-
-**컴포넌트는 다음 컴포넌트를 하드코딩하지 않는다.**
-Token의 plan에서 다음 stage를 읽고, out_port로 직접 전달한다.
-네트워크 패킷이 라우팅 헤더를 가지고 있는 것과 같은 패턴이다.
-
-#### Pipeline Context
-
-```python
-@dataclass
-class PipelineContext:
-    id: str
-    total_tiles: int
-    completed_tiles: int = 0
-    done_event: simpy.Event = None  # 모든 tile 완료 시 succeed
-
-    def complete_tile(self) -> None:
-        self.completed_tiles += 1
-        if self.completed_tiles == self.total_tiles:
-            self.done_event.succeed()
-```
-
-**Completion은 exactly-once contract**: 각 tile의 마지막 stage는 정확히 한 번만
-`complete_tile()`을 호출해야 한다. 중복 호출은 버그이며, `done_event`는
-단 한 번만 succeed되어야 한다 (SimPy Event 제약).
-
-#### Scheduler 역할 (축소됨)
-
-Scheduler는 CompositeCmd를 받으면 plan과 PipelineContext를 생성한 뒤,
-이를 scheduler 내부의 `_pending_feeds` FIFO에 enqueue하고 즉시 리턴한다.
-
-실제 tile 투입은 **단일 feeder process** (`_feed_loop`)가 담당한다.
-이 feeder는 `_pending_feeds`를 FIFO 순서로 소비하며,
-**composite command 간 tile feed interleaving은 허용하지 않는다.**
-즉, 한 command의 모든 tile이 첫 stage queue에 투입된 후에만
-다음 command의 feed가 시작된다.
-
-Scheduler당 `_feed_loop`는 **정확히 하나만** 존재하며,
-composite command의 tile feed는 이 단일 process를 통해서만 수행된다.
-Command issue order는 **PE_SCHEDULER가 PeInternalTxn을 수신한 순서**를 의미한다.
-
-이 구조는 command issue order를 유지하면서도, 첫 stage queue full 시
-feeder process만 block되고 scheduler worker의 inbox 처리 자체는 멈추지 않도록 한다.
-
-```python
-class PeSchedulerV2(PeEngineBase):
-    _pipelines: dict[str, PipelineContext]
-    _pending_feeds: simpy.Store   # FIFO of (plan, ctx)
-
-    def start(self, env):
-        super().start(env)
-        self._pending_feeds = simpy.Store(env)
-        env.process(self._feed_loop(env))
-
-    def _dispatch_composite(self, env, pe_txn, cmd):
-        plan = generate_plan(cmd)
-        ctx = PipelineContext(
-            id=next_id(),
-            total_tiles=len(plan.tiles),
-            done_event=pe_txn.done,
-        )
-        self._pipelines[ctx.id] = ctx
-
-        # feeder queue에 등록만 하고 즉시 리턴
-        yield self._pending_feeds.put((plan, ctx))
-
-    def _feed_loop(self, env):
-        """단일 feeder process: composite command를 FIFO 순서로 feed.
-
-        Composite command 간 tile feed interleaving은 허용하지 않는다.
-        한 command의 모든 tile이 첫 stage queue에 투입된 후에만
-        다음 command의 feed가 시작된다.
-
-        첫 stage queue full 시 이 feeder만 block되며,
-        scheduler worker의 inbox 처리는 멈추지 않는다.
-        """
-        while True:
-            plan, ctx = yield self._pending_feeds.get()
-            for tile in plan.tiles:
-                token = TileToken(
-                    tile_id=tile.tile_id,
-                    pipeline_ctx=ctx,
-                    plan=tile,
-                    stage_idx=0,
-                    params=tile.stages[0].params,
-                )
-                yield self.out_ports[tile.stages[0].component].put(token)
-                # queue capacity = HW queue depth → full이면 feeder만 block
-```
-
-본 ADR에서 scheduler는 여러 composite command를 수용할 수 있으나,
-tile submission order는 command 단위 FIFO를 따른다.
-Command 내부에서는 tile-level pipeline overlap을 허용하지만,
-command 간 tile feed interleaving은 허용하지 않는다.
-
-### D3. 데이터 전달 vs 완료 신호 — HW 모델링 기준
-
-| 통신 유형 | 방식 | HW 대응 |
-|----------|------|---------|
-| tile token (작업 지시) | message via out_port | command queue에 enqueue |
-| stage 완료 → 다음 stage | 컴포넌트가 직접 out_port.put | done-triggered local enqueue |
-| pipeline 완료 → scheduler | PipelineContext.complete_tile() | completion interrupt |
-
-**Tile token**: out_port.put() 사용. SimPy Store capacity = HW queue depth.
-
-**Intra-PE chaining latency**: 본 ADR 범위에서는 intra-PE stage trigger에
-explicit latency model을 두지 않는다. 컴포넌트 간 체이닝은 PE 내부 wire에 해당하며,
-scheduler 왕복이 없으므로 artificial hop cost가 발생하지 않는다.
-
-**Pipeline 완료**: 마지막 stage의 컴포넌트가 `pipeline_ctx.complete_tile()` 호출.
-모든 tile 완료 시 PipelineContext가 done_event.succeed().
-
-### D4. 비동기 파이프라인 — 자연스러운 overlap
-
-Scheduler는 CompositeCmd를 **비동기로** 처리한다.
-다만 tile feed는 command마다 독립 process를 만들지 않고,
-scheduler 내부의 **단일 feeder process**가 FIFO 순서로 수행한다.
-따라서 scheduler는 다음 command를 계속 받을 수 있지만,
-첫-stage tile 투입 순서는 command 단위로 보장된다.
-
-**SimPy Store capacity = HW queue depth**이므로:
-- queue가 차면 put()이 자연스럽게 block (backpressure)
-- DMA가 tile 0을 처리하는 동안 GEMM은 이미 완료된 tile의 fetch를 시작
-- 두 번째 CompositeCmd가 들어오면 DMA queue에 바로 이어서 투입
-
-```
-First-stage feed order (feeder → DMA queue):
-  [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN] | [cmd2:t0][cmd2:t1]...
-                                            ↑ cmd1 feed 완료 후 cmd2 시작
-
-Runtime pipeline (downstream overlap):
-  PE_DMA:    [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN][cmd2:t0][cmd2:t1]...
-  PE_FETCH:          [cmd1:t0][cmd1:t1]...
-  PE_GEMM:                   [cmd1:t0][cmd1:t1]...
-                              ↑ 같은 cmd 내부에서 pipeline overlap
-```
-
-이때 overlap은 서로 다른 command의 tile feed interleaving에서 오는 것이 아니라,
-먼저 투입된 command의 tile들이 downstream stage로 진행되는 동안 feeder가
-다음 tile들을 계속 투입하면서 자연스럽게 발생한다.
-
-예를 들어 cmd1의 모든 tile이 첫 stage queue에 투입되기 전에는
-cmd2의 tile feed는 시작되지 않는다. 그러나 cmd1.tile0이 이미 GEMM으로
-진행한 상태에서 cmd1.tile1, cmd1.tile2가 DMA/FETCH에 남아 있을 수 있으므로,
-**같은 command 내부에서는 pipeline overlap이 자연스럽게 발생**한다.
-
-#### 컴포넌트 체이닝 패턴
-
-모든 컴포넌트가 동일한 패턴을 따른다:
-
-```python
-def _pipeline_worker(self, env):
-    while True:
-        token = yield self._inbox.get()
-
-        # 자기 stage 처리
-        yield from self._process(env, token)
-
-        # 다음 stage로 체이닝 (plan에서 읽음)
-        next_idx = token.stage_idx + 1
-        if next_idx < len(token.plan.stages):
-            next_stage = token.plan.stages[next_idx]
-            token.stage_idx = next_idx
-            token.params = next_stage.params
-            yield self.out_ports[next_stage.component].put(token)
-        else:
-            # 마지막 stage — pipeline completion
-            token.pipeline_ctx.complete_tile()
-```
-
-### D5. PE_FETCH_STORE — TCM ↔ Register File 전담
-
-기존에 GemmBlock과 MathBlock이 각각 TCM read/write를 구현했으나,
-이를 **PE_FETCH_STORE 컴포넌트**로 분리한다.
-
-```python
-# PE_FETCH_STORE._process()
-def _process(self, env, token):
-    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
-    yield tcm_done
-    # 체이닝은 base class가 처리 (D4 패턴)
-```
-
-장점:
-- GEMM/MATH는 **순수 compute만** — TCM 접근 로직 없음
-- fetch/store BW 경합이 자연스럽게 모델링됨 (PE_TCM의 resource로 serialization)
-- prefetch 전략 등 fetch unit 단독 교체로 실험 가능
-
-### D6. 각 Compute 컴포넌트의 단순화
-
-GEMM/MATH는 register 데이터가 이미 준비된 상태에서 compute만 수행.
-**체이닝은 공통 패턴(D4)을 따르므로, _process()만 구현하면 된다:**
-
-```python
-# PE_GEMM._process()
-def _process(self, env, token):
-    yield env.timeout(self._mac_latency(token.params))
-
-# PE_MATH._process()
-def _process(self, env, token):
-    yield env.timeout(self._simd_latency(token.params))
-
-# PE_FETCH_STORE._process()
-def _process(self, env, token):
-    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
-    yield tcm_done
-
-# PE_DMA._process()
-def _process(self, env, token):
-    yield from self._do_fabric_dma(token.params)
-```
-
-타이밍 모델만 교체하면 cycle-accurate든 analytical든 자유롭게 변경 가능.
-체이닝 로직은 base class에 있으므로 각 컴포넌트는 순수 stage 로직만 구현.
-
-### D7. Topology 변경
-
-PE template에 PE_FETCH_STORE 추가:
-
-```yaml
-pe_template:
-  components:
-    pe_cpu:         { kind: pe_cpu,         impl: pe_cpu_v1, ... }
-    pe_scheduler:   { kind: pe_scheduler,   impl: pe_scheduler_v2, ... }
-    pe_dma:         { kind: pe_dma,         impl: pe_dma_v1, ... }
-    pe_fetch_store: { kind: pe_fetch_store, impl: pe_fetch_store_v1, ... }
-    pe_gemm:        { kind: pe_gemm,        impl: pe_gemm_v1, ... }
-    pe_math:        { kind: pe_math,        impl: pe_math_v1, ... }
-    pe_mmu:         { kind: pe_mmu,         impl: pe_mmu_v1, ... }
-    pe_tcm:         { kind: pe_tcm,         impl: pe_tcm_v1, ... }
-  links:
-    # 기존 links...
-    fetch_store_to_tcm_bw_gbs: 512.0
-    fetch_store_to_tcm_mm: 0.0
-```
-
-PE 내부 edge 연결:
-```
-PE_SCHEDULER → PE_DMA (초기 dispatch)
-PE_SCHEDULER → PE_FETCH_STORE (초기 dispatch)
-PE_SCHEDULER → PE_GEMM (초기 dispatch)
-PE_SCHEDULER → PE_MATH (초기 dispatch)
-PE_DMA → PE_FETCH_STORE (체이닝)
-PE_FETCH_STORE → PE_GEMM (체이닝)
-PE_FETCH_STORE → PE_MATH (체이닝)
-PE_GEMM → PE_FETCH_STORE (store 체이닝)
-PE_MATH → PE_FETCH_STORE (store 체이닝)
-PE_FETCH_STORE → PE_DMA (writeback 체이닝)
-PE_FETCH_STORE → PE_TCM (BW 요청)
-```
-
-Topology edge는 **control/dispatch visibility + runtime chaining** 양쪽을 포함한다.
-Scheduler → 하위 컴포넌트 edge는 초기 dispatch 경로이며,
-컴포넌트 간 edge는 token self-routing에 의한 runtime chaining 경로이다.
-
-### D9. TileToken 메시지 정의
-
-컴포넌트 간 tile 작업 전달에 사용하는 메시지.
-Token이 plan과 stage index를 가지고 있어 self-routing이 가능하다.
-
-```python
-@dataclass
-class TileToken:
-    tile_id: int
-    pipeline_ctx: PipelineContext    # completion 추적
-    plan: TilePlan                   # 이 tile의 전체 stage sequence (immutable)
-    stage_idx: int                   # 현재 stage index in plan.stages
-    params: dict                     # current stage 파라미터 캐시 (canonical: plan.stages[stage_idx].params)
-    data_op: bool = True             # op_log 기록 대상 (ADR-0020)
-```
-
-TileToken은 한 시점에 **하나의 컴포넌트에 의해서만 소유**되며,
-동시에 여러 컴포넌트에 의해 참조되지 않는다 (single-owner).
-
-Token lifecycle:
-1. Scheduler가 stage_idx=0으로 생성, 첫 stage 컴포넌트에 put
-2. 컴포넌트가 _process() 실행 후 stage_idx 증가, 다음 컴포넌트에 put
-3. 마지막 stage 컴포넌트가 pipeline_ctx.complete_tile() 호출
-4. 모든 tile 완료 시 PipelineContext가 done_event.succeed()
-
-기존 PeInternalTxn과의 관계:
-- PeInternalTxn: PE_CPU → PE_SCHEDULER 간 command 전달 (기존 유지)
-- TileToken: PE_SCHEDULER → 하위 컴포넌트 간 tile 단위 작업 전달 (신규, self-routing)
-
----
-
-## Non-goals
-
-- **PE_CPU 변경**: PE_CPU → PE_SCHEDULER 인터페이스는 변경하지 않음
-  (PeInternalTxn 기반, ADR-0014 유지)
-- **다중 pipeline 간 자원 경합 모델**: 현재 범위에서는 단일 pipeline의
-  정확한 모델링에 집중. 다중 pipeline 간 TCM bank conflict 등은 future work.
-
-## Open Questions
-
-- **Register File 용량 모델**: fetch unit이 register에 로드할 때 용량 제한을
-  모델링할지. 용량은 바이트 단위(register_file_bytes)로 표현하며,
-  동시에 보유 가능한 tile 수는 tile 크기에 따라 결정된다.
-  용량 초과 시 fetch가 stall되어 자연스러운 backpressure가 발생한다.
-- **Prefetch 전략**: 본 ADR에서는 composite command 간 tile feed interleaving을
-  허용하지 않는다. 따라서 overlap은 command 간 선행 투입이 아니라,
-  같은 command 내부 tile들의 pipeline progression에서 자연스럽게 발생한다.
-  추가적인 prefetch가 필요하면 command 간 투입이 아니라, 같은 command 내부에서의
-  tile ordering 또는 fetch/store unit policy 차원에서 검토한다.
-- **PE_DMA coalescing**: tile 단위 DMA는 fragmentation 발생 가능.
-  DMA 내부에서 merge/coalesce하되 scheduler는 관여하지 않는 방향.
-- **동기 실행 모드**: 본 ADR에서는 비동기 pipeline을 기본/유일 execution model로
-  채택한다. 디버그 또는 validation 목적의 sync mode가 필요하면 future ADR에서 검토.
-- **다중 pipeline 간 TCM bank conflict**: 현재 단일 pipeline 기준.
-  다중 pipeline이 동시에 TCM에 접근할 때의 bank conflict 모델은 future work.
-
----
-
-## Consequences
-
-### 긍정적
-
-- 각 블록이 독립 컴포넌트 — 개별 교체 가능 (ADR-0015 준수)
-- topology에서 PE 내부 구조 가시화
-- 컴포넌트가 다음 컴포넌트를 모름 — plan 기반 라우팅으로 유연성 확보
-- DMA와 compute의 자연스러운 파이프라인 overlap (SimPy Store backpressure)
-- HW 모델링 정확도 향상 (done signal = Event, data transfer = message)
-- fetch/store 분리로 TCM BW 경합 정확히 모델링
-
-### 부정적
-
-- PE 내부 컴포넌트 수 증가 (5 → 6) — topology 노드/edge 증가
-- 컴포넌트 분리로 인해 intra-PE token forwarding이 이전 대비 더 명시적으로 드러남
-
diff --git a/docs/adr/ADR-0022-program-id-2d-grid.md b/docs/adr/ADR-0022-prog-program-id-2d-grid.md
similarity index 96%
rename from docs/adr/ADR-0022-program-id-2d-grid.md
rename to docs/adr/ADR-0022-prog-program-id-2d-grid.md
index 9bf7966..371bb49 100644
--- a/docs/adr/ADR-0022-program-id-2d-grid.md
+++ b/docs/adr/ADR-0022-prog-program-id-2d-grid.md
@@ -1,10 +1,10 @@
 # ADR-0022: 2D Grid program_id Semantics
 
-- **Status**: Accepted
-- **Date**: 2026-04-09
-- **Context**: Triton-style kernel addressing for multi-cube PE topology
+## Status
 
-## Problem
+Accepted
+
+## Context
 
 Triton kernels use `tl.program_id(axis)` to identify their position in a launch grid.
 Our hardware has a 2-level hierarchy: **cubes** contain **PEs**.
diff --git a/docs/adr/ADR-0023-ipcq-pe-collective.en.md b/docs/adr/ADR-0023-dev-ipcq-pe-collective.en.md
similarity index 99%
rename from docs/adr/ADR-0023-ipcq-pe-collective.en.md
rename to docs/adr/ADR-0023-dev-ipcq-pe-collective.en.md
index 5322753..e6b6334 100644
--- a/docs/adr/ADR-0023-ipcq-pe-collective.en.md
+++ b/docs/adr/ADR-0023-dev-ipcq-pe-collective.en.md
@@ -709,7 +709,7 @@ piggyback, tail updates via the D9 fast-path channel.
 
 ### D13. Test strategy
 
-Following the ADR-0021 D8 pattern.
+Test plan:
 
 #### T1. Unit tests (component-level)
 
@@ -801,7 +801,7 @@ F5. **Slot full + infinite backpressure**: the peer never recvs.
 ### D15. Algorithm-author cheat sheet
 
 Full step-by-step lives in
-[`docs/ccl-author-guide.en.md`](../ccl-author-guide.en.md). The
+[`docs/onboarding/ccl-author-guide.en.md`](../onboarding/ccl-author-guide.en.md). The
 shortest version:
 
 | Things you touch | Things you don't |
diff --git a/docs/adr/ADR-0023-ipcq-pe-collective.md b/docs/adr/ADR-0023-dev-ipcq-pe-collective.md
similarity index 77%
rename from docs/adr/ADR-0023-ipcq-pe-collective.md
rename to docs/adr/ADR-0023-dev-ipcq-pe-collective.md
index 026b6f3..5fd174d 100644
--- a/docs/adr/ADR-0023-ipcq-pe-collective.md
+++ b/docs/adr/ADR-0023-dev-ipcq-pe-collective.md
@@ -969,7 +969,7 @@ tail 갱신은 D9 fast path SimPy Store 채널로 처리된다.
 
 ### D13. 테스트 전략
 
-ADR-0021의 D8 패턴을 따라 단위/통합/regression 테스트를 명시한다.
+단위/통합/regression 테스트를 명시한다.
 
 #### T1. 단위 테스트 (component-level)
 
@@ -1102,7 +1102,7 @@ F5. **Slot full + 무한 backpressure**:
 ### D15. 알고리즘 작성자 가이드 (요약)
 
 본 섹션은 알고리즘 작성자가 한 화면으로 시작점을 잡을 수 있도록 한다.
-자세한 step-by-step 가이드는 [docs/ccl-author-guide.md](../ccl-author-guide.md) 참조.
+자세한 step-by-step 가이드는 [docs/onboarding/ccl-author-guide.md](../onboarding/ccl-author-guide.md) 참조.
 
 #### 만지는 것 / 만지지 않는 것
 
@@ -1175,7 +1175,416 @@ def neighbors(rank, world_size, neighbor_map) -> dict | None:
 2. **send/recv 짝 맞지 않음** — peer 측 recv 없으면 hang (slot full backpressure)
 3. **dtype/shape 불일치** — 첫 구현은 검증 안 함, 작성자 책임
 
-자세한 step-by-step과 hello-world 예제는 `docs/ccl-author-guide.md` 참조.
+자세한 step-by-step과 hello-world 예제는 `docs/onboarding/ccl-author-guide.md` 참조.
+
+---
+
+## HW Realization Notes (Informative)
+
+**Status of this section**: Forward-looking. Describes how the simulator
+contract (D1–D15) would map to silicon. Not currently implemented;
+subject to revision before tapeout. The simulator implements the
+contract via Python/SimPy equivalents in
+[pe_ipcq.py](../../src/kernbench/components/builtin/pe_ipcq.py) and
+[pe_dma.py](../../src/kernbench/components/builtin/pe_dma.py).
+
+### D16. Proposed HW Block Diagram and End-to-End Dataflow
+
+![PE Baseline Architecture](../diagrams/pe_baseline.png)
+
+> Source: [`../diagrams/pe_baseline.d2`](../diagrams/pe_baseline.d2) — `d2 --layout=elk --scale 1.5`.
+
+![PE Proposed Architecture](../diagrams/pe_proposed.png)
+
+> Source: [`../diagrams/pe_proposed.d2`](../diagrams/pe_proposed.d2) — `d2 --layout=elk`.
+
+**Baseline → Proposed 핵심 변경**:
+
+- 단일 FIFO inbox → **compute port / IPCQ port 분리 + WRR Arbiter** (NEW)
+- PE_IPCQ (SimPy component) → **IPCQ Controller** (HW register + combinational logic)
+- TCM 내 **IPCQ Slot Region 예약 영역** 명시
+- Credit Injector / Receiver가 Fabric Port를 통해 NoC에 직접 연결
+
+#### End-to-End Sequence (HW view)
+
+```mermaid
+sequenceDiagram
+    participant CPU_A as PE_A: PE_CPU
+    participant IPCQ_A as PE_A: IPCQ Ctrl
+    participant DMA_A as PE_A: DMA
+    participant NOC as NoC Fabric
+    participant DMA_B as PE_B: DMA
+    participant IPCQ_B as PE_B: IPCQ Ctrl
+    participant TCM_B as PE_B: TCM
+    participant CPU_B as PE_B: PE_CPU
+
+    Note over CPU_A: tl.send(dir="E", src=0x1000)
+
+    CPU_A->>IPCQ_A: MMIO: send request
+    Note over IPCQ_A: Backpressure check:<br/>(head - peer_tail_cache) < n_slots → PASS<br/>Slot addr gen:<br/>dst = peer_rx_base + (head%n) × slot_size
+    IPCQ_A->>DMA_A: IpcqDmaToken {src, dst, sender_seq=head}
+    Note over IPCQ_A: my_head++
+    IPCQ_A-->>CPU_A: send returns (fire-and-forget)
+
+    Note over DMA_A: TCM read → snapshot in read buffer<br/>Flit pack: data + {sender_seq, dst_addr}
+    DMA_A->>NOC: IPCQ data flit(s)
+
+    Note over NOC: hop latency + BW drain
+
+    NOC->>DMA_B: IPCQ data flit(s)
+    Note over DMA_B: Terminal BW drain<br/>Slot write latency
+
+    rect rgb(255, 240, 220)
+        Note over DMA_B,IPCQ_B: ATOMIC (I6): same cycle, no stall
+        DMA_B->>TCM_B: write data → slot address
+        DMA_B->>IPCQ_B: Meta Extractor: {sender_seq, dst_addr}
+    end
+
+    Note over IPCQ_B: Range match dst_addr → direction "W"<br/>peer_head_cache["W"] = sender_seq + 1
+    IPCQ_B-->>CPU_B: recv_wake signal
+
+    Note over CPU_B: tl.recv(dir="W") wakes up
+    CPU_B->>IPCQ_B: recv request
+    Note over IPCQ_B: peer_head_cache > my_tail → YES<br/>slot_addr = rx_base + (tail%n) × slot_size
+    IPCQ_B-->>CPU_B: return slot_addr
+    CPU_B->>TCM_B: read data from slot
+    Note over IPCQ_B: my_tail++
+
+    IPCQ_B->>NOC: Credit (16B): {consumer_seq, dst_rx_base_pa}
+    Note over NOC: credit traversal (NoC latency)
+    NOC->>IPCQ_A: Credit arrival
+
+    Note over IPCQ_A: Match dst_rx_base_pa → direction "E"<br/>peer_tail_cache["E"] = consumer_seq<br/>Backpressure deassert (if stalled)
+```
+
+### D17. IPCQ Controller HW Module (신규)
+
+PE_CPU와 DMA Engine 사이에 위치하는 하드웨어 제어 블록. 시뮬레이터의
+`PeIpcqComponent`에 대응한다.
+
+#### QPair Register File
+
+방향별 queue pair 상태를 flip-flop으로 유지. PE_CPU가 MMIO(CSR)로 읽기/쓰기
+가능하며, init 시점에 소프트웨어가 채워넣는다.
+
+```
+Per-direction registers (each 64-bit):
+  my_head          — sender write position (monotonic)
+  my_tail          — receiver read position (monotonic)
+  peer_head_cache  — last known peer head (updated by Meta Extractor)
+  peer_tail_cache  — last known peer tail (updated by Credit Receiver)
+  rx_base_pa       — this PE's rx buffer base physical address
+  peer_rx_base_pa  — peer's rx buffer base physical address
+  n_slots          — ring depth (power-of-2 제약, D21 참조)
+  slot_size        — bytes per slot
+  peer_credit_tgt  — peer PE의 credit receive 주소
+
+Directions: 최대 8 (N/S/E/W/parent/child_left/child_right + spare)
+Total: 8 dirs × 9 regs × 8B = 576B flip-flops
+```
+
+#### Slot Address Generator (combinational)
+
+```
+Input:  pointer (my_head or my_tail), n_slots, slot_size, base_pa
+Output: slot_addr = base_pa + (pointer % n_slots) * slot_size
+
+Implementation:
+  n_slots power-of-2 → pointer & (n_slots - 1)   (AND mask, 1 gate)
+  slot_size power-of-2 → barrel shift             (1 cycle)
+  64-bit add → ripple/kogge-stone adder           (1 cycle)
+
+Latency: 1-2 cycles combinational
+```
+
+#### Backpressure Comparator (combinational)
+
+```
+full = (my_head - peer_tail_cache) >= n_slots
+
+Implementation: 64-bit subtract + unsigned compare
+Output: stall signal → PE_CPU (IPCQ send blocked) or DMA issue hold
+Latency: 1 cycle
+```
+
+#### Meta Extractor (inbound datapath sideband)
+
+DMA Engine의 inbound vc_comm path에 wired. 도착하는 IPCQ flit의 header에서
+metadata를 추출하여 queue pair 상태를 갱신한다.
+
+```
+Trigger: DMA inbound write completion (same cycle)
+Extract: {sender_seq, dst_addr} from flit header
+
+Direction matching (ADR-0025 D2):
+  for each dir:
+    match = (base_pa[dir] <= dst_addr) && (dst_addr < base_pa[dir] + n_slots[dir] * slot_size[dir])
+  8× parallel range comparators + priority encoder
+
+Update: peer_head_cache[matched_dir] = max(peer_head_cache, sender_seq + 1)
+Output: recv_wake signal → PE_CPU interrupt/flag
+Latency: 1 cycle (pipelined with DMA write — I6 atomicity 자연 보장)
+```
+
+#### Credit Injector (outbound)
+
+```
+Trigger: recv completion (my_tail 증가 후)
+Action:  pack 16B credit packet → DMA vc_comm (또는 dedicated credit VC)
+
+Packet: {consumer_seq = my_tail, dst_rx_base_pa = my_rx_base_pa}
+Latency: 1 cycle to generate, then NoC traversal
+```
+
+#### Credit Receiver (inbound sideband)
+
+```
+Trigger: 16B credit packet arrival (from NoC)
+Extract: {consumer_seq, dst_rx_base_pa}
+
+Direction matching (ADR-0025 D3):
+  for each dir:
+    match = (peer_rx_base_pa[dir] == credit.dst_rx_base_pa)
+
+Update: peer_tail_cache[matched_dir] = max(peer_tail_cache, consumer_seq)
+Output: send_wake signal → deassert backpressure stall
+Latency: 1 cycle
+```
+
+### D18. DMA Engine vc_comm IPCQ-aware Mode
+
+기존 vc_comm 채널(D8)에 IPCQ flit 처리 모드를 추가한다.
+
+**Outbound**:
+
+1. IPCQ Controller로부터 command 수신: `{src_addr, dst_addr, nbytes, sender_seq}`
+2. TCM에서 src_addr read → DMA read buffer에 snapshot (standard DMA behavior)
+3. Flit pack: data + piggyback metadata (sender_seq, dst_addr)
+4. NoC fabric port에 inject
+5. Fire-and-forget (completion 미대기)
+
+**Inbound**:
+
+1. NoC로부터 IPCQ flit 수신
+2. Terminal BW drain charge (`drain_ns = nbytes / bottleneck_bw`)
+3. Slot write latency charge (backing memory tier)
+4. **ATOMIC** (same pipeline stage, no stall insertion):
+   - TCM write: data → slot address
+   - Meta Extractor trigger: sender_seq + dst_addr → IPCQ Controller
+5. Done
+
+**I6 atomicity 하드웨어 보장**: TCM write completion과 Meta Extractor trigger가
+동일 pipeline stage에서 발생하므로 별도 synchronization이 불필요. 시뮬레이터의
+"no SimPy yield between MemoryStore.write and IpcqMetaArrival put" (D9, I6)이
+자연스럽게 보장된다.
+
+#### Data Snapshot Semantics
+
+DMA read buffer에 latch된 데이터는 src memory의 이후 수정에 영향받지 않는다.
+이는 DMA standard read-then-write behavior이므로 추가 HW 불필요.
+
+#### Credit Virtual Channel (선택적)
+
+- **옵션 A**: vc_comm에 credit을 multiplexing (16B header-only flit으로 구분).
+- **옵션 B**: 3rd dedicated credit VC 추가 (strict priority > data).
+
+옵션 B가 deadlock prevention에 유리하나, 16B credit의 BW 영향이 무시 가능하므로
+옵션 A로도 충분.
+
+### D19. Fabric Flit Format Extension
+
+```
+일반 data flit (예: 512-bit):
+┌──────────────────────────────────────────┐
+│ [511:480] routing header (32b)           │
+│ [479:0]   payload (480b = 60B)           │
+└──────────────────────────────────────────┘
+
+IPCQ data flit (첫 flit에만 metadata 포함):
+┌──────────────────────────────────────────┐
+│ [511:480] routing header (32b)           │
+│   [511]    ipcq_flag (1b)                │  ← IPCQ vs normal DMA 식별
+│   [510:509] vc_id (2b)                   │
+│   [508:480] route + hop count            │
+│ [479:416] ipcq_metadata (64b)            │  ← piggyback
+│   [479:448] sender_seq (32b)             │
+│   [447:416] dst_addr[31:0] (32b)         │  ← direction matching용
+│ [415:0]   payload (416b = 52B)           │
+└──────────────────────────────────────────┘
+후속 flits: full 60B payload (metadata 없음)
+
+Credit-only flit (128-bit, header-only):
+┌──────────────────────────────────────────┐
+│ [127:96]  routing header (32b)           │
+│   [127]   credit_flag (1b)               │
+│ [95:64]   consumer_seq (32b)             │
+│ [63:0]    dst_rx_base_pa (64b)           │
+└──────────────────────────────────────────┘
+```
+
+첫 flit의 payload가 60B → 52B로 감소 (13% overhead). Multi-flit transfer에서는
+후속 flit이 full payload이므로 대형 전송에서 overhead < 1%.
+
+### D20. TCM IPCQ Slot Region Layout
+
+```
+TCM Memory Map (16MB):
+┌─────────────────────────────┐ 0x000000
+│  Kernel Working Memory      │
+│  (compute tensors)          │
+│  ~14MB                      │
+├─────────────────────────────┤ 0xE00000
+│  IPCQ RX Buffers            │
+│  Dir N: slots × slot_size   │
+│  Dir S: slots × slot_size   │
+│  Dir E: slots × slot_size   │
+│  Dir W: slots × slot_size   │
+│  ~1MB                       │
+├─────────────────────────────┤ 0xF00000
+│  IPCQ Metadata / Scratch    │
+│  ~1MB                       │
+└─────────────────────────────┘ 0xFFFFFF
+```
+
+IPCQ region을 TCM의 상위 bank에 배치하여 compute access와의 bank conflict를
+최소화한다 (Risk D22 참조).
+
+### D21. 2nm Implementation Analysis
+
+#### Area Estimate
+
+| Module | Gate Count | Area (2nm est.) | Notes |
+|---|---|---|---|
+| QPair Register File | ~4.6K FF | 0.002 mm² | 576B flip-flops |
+| Slot Addr Gen + Backpressure | ~5K gates | 0.001 mm² | Combinational |
+| Meta Extractor + Credit Logic | ~3K gates | 0.001 mm² | 8× parallel comparators |
+| **IPCQ Controller subtotal** | **~12.6K** | **~0.004 mm²** | **PE 전체 대비 < 0.1%** |
+| DMA vc_comm 확장 | ~2K gates | 0.002 mm² | Flit pack/unpack |
+| **Total 변경분** | **~14.6K** | **~0.006 mm²** | |
+
+#### Timing
+
+| Path | Delay (2nm est.) | Target Clock | Margin |
+|---|---|---|---|
+| Backpressure (sub + cmp) | ~0.3 ns | 1 GHz (1 ns) | 3× |
+| Slot Addr Gen (mask + shift + add) | ~0.5 ns | 1 GHz | 2× |
+| Meta Extractor (8× range match) | ~0.4 ns | 1 GHz | 2.5× |
+| Credit Receiver (8× equality) | ~0.3 ns | 1 GHz | 3× |
+
+모든 critical path가 1 cycle 이내. Timing closure 문제 없음.
+
+#### Power
+
+- Active: ~1 mW (register R/W + comparators, send/recv 동작 시)
+- Idle: leakage only
+- PE 전체 전력 대비 무시 가능
+
+#### Constraints
+
+| 항목 | 제약 | 근거 |
+|---|---|---|
+| `n_slots` | **반드시 power-of-2** | mod → AND mask (1 gate). 임의 값은 divider 필요 (~10 cycles) |
+| `slot_size` | **power-of-2 권장** | mul → barrel shift. 임의 값은 multiplier 필요 |
+| TCM IPCQ region | **전용 bank 배치** | Compute access와 bank conflict 방지 |
+
+### D22. Risk Assessment
+
+#### TCM Bank Conflict
+
+- **Risk**: IPCQ slot write와 compute read가 동일 bank 접근 시 stall
+- **Mitigation**: IPCQ region을 TCM 상위 address의 전용 bank에 배치 (D20)
+- **Cost**: TCM banking flexibility 소폭 감소
+- **Severity**: Medium (성능 영향), Low (correctness 문제 아님)
+
+#### Credit Return Latency under Congestion
+
+- **Risk**: NoC 혼잡 시 credit return 지연 → sender backpressure stall
+- **Mitigation**:
+  - Credit을 별도 VC로 분리 + strict priority (16B로 BW impact 미미)
+  - 또는 n_slots를 넉넉히(8+) 설정하여 credit 지연을 buffer로 흡수
+- **Severity**: Low (credit 16B는 congestion에 거의 기여하지 않음)
+
+#### Inter-Direction Ordering
+
+- **Risk**: 같은 PE에서 여러 방향으로 동시 send 시 순서
+- **Mitigation**: Per-direction monotonic seq으로 충분. Inter-direction ordering은
+  kernel(소프트웨어) 책임 — 현재 시뮬레이터 모델과 동일 (D2 + D4)
+- **Severity**: Low (아키텍처 설계에 의해 해소)
+
+### D23. HW Alternatives Considered
+
+#### Doorbell + Polling (전통적 방식)
+
+```
+Send: DMA write data → DMA write doorbell register at peer → peer polls doorbell
+Recv: Polling loop on doorbell, or interrupt-driven
+```
+
+| 장점 | 단점 |
+|---|---|
+| 단순한 HW (IPCQ controller 불필요) | 2번의 DMA transaction (data + doorbell) |
+| 기존 DMA 재사용 | Data/doorbell 사이 ordering 보장 필요 (fence) |
+| | Polling은 전력 낭비, interrupt는 latency overhead |
+
+**평가**: Piggyback 대비 latency 2-3× 증가. **불채택.**
+
+#### Hardware Message Queue (NVIDIA NVLink 스타일)
+
+```
+Send: CPU → HMQ에 descriptor push → HW가 peer HMQ로 자동 전달
+Recv: HMQ에서 descriptor pop → data pointer 확인
+```
+
+| 장점 | 단점 |
+|---|---|
+| CPU는 descriptor만 작성 | 별도 HMQ engine 필요 (~0.05 mm²) |
+| Descriptor/data 분리 → 유연 | DMA와 별개 datapath → area/power 중복 |
+| | Large tensor에는 결국 DMA 필요 |
+
+**평가**: CCL의 large tensor 패턴에서 DMA 필수이므로 HMQ + DMA 이중 구조는
+면적 낭비. **불채택.**
+
+#### RDMA-style Completion Queue (CQ)
+
+```
+Send: DMA write → peer에 CQE 자동 생성
+Recv: CQ poll/interrupt → data 위치 확인
+```
+
+| 장점 | 단점 |
+|---|---|
+| InfiniBand/RoCE 성숙 모델 | CQ 관리 logic + CQE memory overhead |
+| Multi-tenant/isolation 용이 | CQE/data ordering 보장 추가 필요 |
+| | PE-to-PE CCL에는 over-engineered |
+
+**평가**: RDMA CQ는 host-facing NIC의 multi-tenant 격리에 적합.
+PE 간 단일 owner 환경에서는 불필요한 복잡성. **불채택.**
+
+#### Credit-in-Data Piggyback (v2 최적화 후보)
+
+현재 설계에서 credit return은 별도 16B packet이다. Bidirectional 통신
+패턴에서는 **reverse 방향 data flit에 credit을 합칠 수 있다.**
+
+```
+PE_A →E→ PE_B: data + sender_seq=3
+PE_B →W→ PE_A: data + sender_seq=5 + credit_ack=4  ← credit이 data에 합쳐짐
+```
+
+| 장점 | 단점 |
+|---|---|
+| Credit 전용 packet 제거 → NoC BW 절약 | Unidirectional 패턴에서는 fallback 필요 |
+| Bidirectional allreduce에서 credit latency → 0 | Flit header에 8B 추가 (overhead 미미) |
+| | Logic 복잡도 소폭 증가 |
+
+**평가**: 현재 설계의 우수한 최적화. Bidirectional allreduce에서 credit packet을
+완전 제거 가능. Standalone credit fallback도 유지. **v2로 채택 권고.**
+
+### Open HW Questions
+
+- IPCQ slot region size를 TCM의 몇 %까지 허용할 것인가? (현재 가정: ~1MB / 16MB = 6.25%)
+- Credit VC를 별도로 둘 것인가, vc_comm에 multiplexing할 것인가? (D18 참조)
+- Inter-SIP link에서의 flit format 호환성 검증 필요
+- n_slots 최대값 제한? (8 directions × 8 slots × 64KB = 4MB → TCM의 25%)
 
 ---
 
diff --git a/docs/adr/ADR-0024-par-sip-tp-launcher.md b/docs/adr/ADR-0024-par-sip-tp-launcher.md
new file mode 100644
index 0000000..b321e84
--- /dev/null
+++ b/docs/adr/ADR-0024-par-sip-tp-launcher.md
@@ -0,0 +1,206 @@
+# ADR-0024: SIP-level Launcher — rank = SIP
+
+## Status
+
+Accepted
+
+## Context
+
+### 목표
+
+`torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device)
+경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이**
+읽히는 bench 코드를 목표로 한다.
+
+real PyTorch와 비교:
+
+| 차원 | real PyTorch | KernBench |
+| --- | --- | --- |
+| 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 각 1 SIP |
+| `get_rank()` | `RANK` env var | greenlet-local 레지스트리 |
+| `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 |
+| `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP |
+| `mp.spawn` | OS 프로세스 fork | greenlet fan-out |
+
+### 풀어야 할 문제
+
+1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록.
+2. **Greenlet-local rank/device tracking** — 1-프로세스 모델 안에서 각
+   worker greenlet이 자기 rank / 자기 SIP를 정확히 식별.
+3. **Tensor placement = structural (sip, cube, pe)** — rank가 SIP이면
+   기본 텐서 배치도 구조적 좌표로 표현되어야 함.
+
+### Non-problem (이 ADR 밖)
+
+- IPCQ direction addressing → ADR-0025
+- `DPPolicy.sip`/`num_sips` 제거 → ADR-0026
+- Megatron-style TP → ADR-0027
+- DTensor → ADR-0028 (future)
+- Worker scheduling / `mp.spawn` / collective drain / exception cleanup
+  → ADR-0027 D0/D1
+- Collective algorithm 구현 (intercube_allreduce, SFR config) → ADR-0032
+
+## Decision
+
+### D1. rank = SIP (world_size 해석)
+
+```python
+def _resolve_world_size(self) -> int:
+    if "world_size" in self._merged:
+        return int(self._merged["world_size"])
+    defaults = self._cfg_all.get("defaults", {})
+    if "world_size" in defaults:
+        return int(defaults["world_size"])
+    spec = self.ctx.spec or {}
+    return int(spec.get("system", {}).get("sips", {}).get("count", 1))
+```
+
+우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml`
+override는 legacy "rank = PE" 테스트 경로로 유지.
+
+### D2. Greenlet-local rank registry (+ debug warning)
+
+```python
+class DistributedContext:
+    def __init__(self):
+        self._backend = None
+        self._rank_by_greenlet: dict = {}
+
+    def _bind_rank(self, g, rank: int) -> None:
+        self._rank_by_greenlet[g] = int(rank)
+
+    def get_rank(self) -> int:
+        self._ensure_initialized()
+        from greenlet import getcurrent
+        g = getcurrent()
+        if g not in self._rank_by_greenlet:
+            if os.environ.get("KERNBENCH_DEBUG"):
+                warnings.warn(
+                    "get_rank() called outside a bound greenlet — returning 0. "
+                    "Likely a bug unless running single-driver."
+                )
+            return 0
+        return int(self._rank_by_greenlet[g])
+```
+
+### D3. `torch.ahbm.set_device(rank)` — SIP 바인딩
+
+KernBench 백엔드 이름은 `ahbm` (ADR-0023). Real PyTorch는
+`torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named
+namespace를 사용한다.
+
+```python
+class _AhbmNamespace:
+    """torch.ahbm — per-greenlet SIP device binding.
+
+    Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. Since
+    KernBench's backend is 'ahbm' (not CUDA), we expose the equivalent
+    API under ``torch.ahbm`` to avoid pretending to be a CUDA runtime.
+    """
+
+    def __init__(self):
+        self._device_by_greenlet: dict = {}
+
+    def set_device(self, device: int) -> None:
+        from greenlet import getcurrent
+        self._device_by_greenlet[getcurrent()] = int(device)
+
+    def current_device(self) -> int | None:
+        from greenlet import getcurrent
+        return self._device_by_greenlet.get(getcurrent())
+
+# Attached to RuntimeContext as `self.ahbm = _AhbmNamespace()`.
+# Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
+```
+
+**PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한
+`torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`,
+`torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는
+코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다.
+
+```python
+class _AcceleratorNamespace:
+    """torch.accelerator — device-agnostic API (PyTorch 2.x style).
+
+    Aliases torch.ahbm for bench code that prefers device-neutral idiom:
+        torch.accelerator.set_device_index(rank)
+        torch.accelerator.current_device_index()
+    """
+
+    def __init__(self, ahbm: _AhbmNamespace):
+        self._ahbm = ahbm
+
+    def set_device_index(self, device: int) -> None:
+        self._ahbm.set_device(device)
+
+    def current_device_index(self) -> int | None:
+        return self._ahbm.current_device()
+
+# RuntimeContext
+self.ahbm = _AhbmNamespace()
+self.accelerator = _AcceleratorNamespace(self.ahbm)   # alias
+```
+
+Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유:
+
+```python
+torch.ahbm.set_device(rank)                   # KernBench-native, explicit backend
+torch.accelerator.set_device_index(rank)      # PyTorch 2.x device-agnostic
+```
+
+### D4. Tensor placement = structural (sip, cube, pe) 좌표
+
+`resolve_dp_policy`가 `target_sip`을 직접 받아 구조적 좌표로 placement 생성.
+세부는 ADR-0026.
+
+```python
+# RuntimeContext._create_tensor
+current_sip = self.ahbm.current_device()          # (D3 naming)
+if current_sip is None:
+    current_sip = 0  # single-driver fallback (D2와 일관)
+placement = resolve_dp_policy(
+    dp, shape=shape_2d, itemsize=itemsize,
+    num_pe=eff_num_pe, num_cubes=eff_num_cubes,
+    target_sip=current_sip,
+)
+```
+
+Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적
+좌표를 직접 보유. ShardSpec 상세는 ADR-0026.
+
+---
+
+## Dependencies
+
+- **ADR-0023** (IPCQ): backend `ahbm` namespace의 기원.
+- **ADR-0026** (DPPolicy intra-device): D4의 `resolve_dp_policy` 시그니처와
+  ShardSpec의 구조적 좌표 표현.
+- **ADR-0027** (Megatron TP + scheduler): worker scheduling, `mp.spawn`,
+  collective drain, exception cleanup의 구현 기준.
+
+---
+
+## Non-goals
+
+- **IPCQ protocol 수정**: ADR-0023 유지.
+- **DPPolicy 필드 정리**: ADR-0026.
+- **Megatron-style TP**: ADR-0027.
+- **Worker scheduling / spawn / drain / exception cleanup**: ADR-0027 D0/D1.
+- **Collective algorithm 구현**: ADR-0032.
+- **Multi-node (프로세스 간)**: 단일 프로세스.
+
+---
+
+## Consequences
+
+### Positive
+
+- **Bench = real PyTorch DDP** (공개 API 관점).
+- **Greenlet-local rank**: 1-프로세스 모델에서 cross-rank correctness 가능.
+- **Structural placement 좌표**: ADR-0026 / ADR-0027 / ADR-0032의 다른 ADR이
+  `(sip, cube, pe)` 3튜플 위에서 일관되게 동작.
+
+### Neutral
+
+- IPCQ PE-level protocol (ADR-0023) 불변.
+- IO_CPU 역할 불변 (기존 transit 그대로).
diff --git a/docs/adr/ADR-0024-sip-tp-launcher.md b/docs/adr/ADR-0024-sip-tp-launcher.md
deleted file mode 100644
index fcc97bb..0000000
--- a/docs/adr/ADR-0024-sip-tp-launcher.md
+++ /dev/null
@@ -1,868 +0,0 @@
-# ADR-0024: SIP-level TP Launcher — rank = SIP (host-driven dispatch)
-
-## Status
-
-Accepted. rank = SIP process-group model stands. The allreduce algorithm
-path (mapper / validator / per-PE install machinery originally targeted at
-ADR-0029) has been replaced by ADR-0032: `AhbmCCLBackend` now calls
-`configure_sfr_intercube_multisip` at `init_process_group` time and the
-intercube kernel receives `(sip_rank, sip_topo_kind, sip_topo_w,
-sip_topo_h)` appended after the module's `kernel_args()`. The
-`leader_only` / `all_pes` mapper concepts in this document are no longer
-used by the default allreduce path.
-
-## Context
-
-### 목표
-
-`torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device)
-경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이**
-읽히는 bench 코드를 목표로 한다.
-
-real PyTorch와 비교:
-
-| 차원 | real PyTorch | KernBench (이 ADR 이후) |
-|---|---|---|
-| 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 각 1 SIP |
-| `get_rank()` | `RANK` env var | greenlet-local 레지스트리 |
-| `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 |
-| `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP |
-| `mp.spawn` | OS 프로세스 fork | greenlet fan-out |
-
-### 설계 원칙 — 공개 API의 추상화, 내부는 기존 path 활용
-
-**공개 API (bench worker) 수준의 추상화**:
-```
-rank = SIP
-DPPolicy = intra-device (cube × PE) 분산만
-dist.all_reduce, torch.ahbm.set_device, mp.spawn 등 PyTorch-style 표면
-```
-
-**Framework 내부 구현**:
-```
-build_install_plans (host): topology + mapper + algorithm → SipInstallPlan
-  ↓
-backend (host): plan의 per-PE spec을 engine.submit으로 IpcqInitMsg 디스패치
-  ↓
-engine: 기존 PE-scoped routing (MmuMapMsg 등과 동일 경로)
-  ↓
-PE_IPCQ: 자체 message loop에서 IpcqInitMsg 처리 (기존 capability)
-```
-
-**핵심**: 새 message 타입이나 IO_CPU 확장 없음. 기존 engine routing과 기존
-`IpcqInitMsg` 타입을 그대로 사용. 기존의 "sideband direct call" 우회만
-제거하여 convention 일원화.
-
-### 풀어야 할 문제
-
-1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록.
-2. **Multi-worker 실행** — N개 rank가 독립 worker 코드 실행. 1 프로세스 제약
-   하에서 greenlet + barrier 동기화.
-3. **Cross-rank collective submit 동기화** — 첫 rank가 혼자 wait하면 peer 부재로
-   SimPy deadlock. 모든 rank submit 후 drain 보장.
-4. **기존 sideband install 제거** — IpcqInitMsg를 engine.submit으로 일원화.
-   MmuMapMsg 등 다른 control-plane 메시지와 동일 패턴.
-5. **Algorithm / mapper / validator 분리** — 알고리즘 모듈은 kernel 코드만
-   담고, topology / mapping / validation은 registry + 선언.
-
-### Non-problem (이 ADR 밖)
-
-- IPCQ direction addressing fix → **ADR-0025**
-- `DPPolicy.sip`/`num_sips` 제거 → **ADR-0026**
-- Megatron-style TP → **ADR-0027**
-- DTensor → **ADR-0028 (future)**
-- **IO_CPU를 SIP-level control-plane 단일 endpoint로 승격**: 이 ADR에서는
-  invariant으로 채택하지 않음. 현재 KernBench에 해당 원칙이 없고, 단독으로
-  도입하기엔 정당화가 약함. 미래에 control-plane latency 모델링 정밀도 요구가
-  생기면 별도 ADR.
-
-## Decision
-
-### D1. rank = SIP (world_size 해석)
-
-```python
-def _resolve_world_size(self) -> int:
-    if "world_size" in self._merged:
-        return int(self._merged["world_size"])
-    defaults = self._cfg_all.get("defaults", {})
-    if "world_size" in defaults:
-        return int(defaults["world_size"])
-    spec = self.ctx.spec or {}
-    return int(spec.get("system", {}).get("sips", {}).get("count", 1))
-```
-
-우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml`
-override는 legacy "rank = PE" 테스트 경로로 유지.
-
-### D2. Install 경로 — engine.submit 일원화
-
-`ccl/install.py`의 sideband direct call을 제거하고, `IpcqInitMsg`를
-`engine.submit`으로 보낸다. MmuMapMsg / MemoryWriteMsg 등이 이미 동일 패턴.
-
-```python
-# Backend (AhbmCCLBackend.__init__ 또는 init_process_group 시점)
-from kernbench.ccl.install_plan import build_install_plans
-
-plans = build_install_plans(
-    world_size=self._world_size,
-    algorithm=self._merged["algorithm"],
-    algorithm_config=self._merged,
-    spec=self.ctx.spec,
-)
-self._plans = plans
-
-# Each PE_IPCQ가 자기 neighbor table을 받도록 engine 경유 submit
-handles = []
-for plan in plans:
-    for pe_install in plan.pe_installs:
-        h = self.ctx.submit(IpcqInitMsg(
-            correlation_id=self.ctx.correlation_id,
-            request_id=f"ipcq_init_s{plan.sip}c{pe_install.cube}p{pe_install.pe}",
-            target_sips=(plan.sip,),
-            target_cubes=(pe_install.cube,),
-            target_pe=pe_install.pe,
-            entries=pe_install.neighbors,
-            buffer_kind=plan.buffer_kind,
-            n_slots=plan.n_slots,
-            slot_size=plan.slot_size,
-            # ... (기존 IpcqInitMsg 필드)
-        ))
-        handles.append(h)
-
-# Eager install — init_process_group이 반환하기 전에 완료 보장
-for h in handles:
-    self.ctx.wait(h)
-```
-
-**PE_IPCQ 컴포넌트**는 이미 `IpcqInitMsg`를 main loop에서 처리 (`pe_ipcq.py`
-라인 145-147). 변경 불필요. 유일한 차이는 "message가 sideband Python call이
-아니라 engine queue를 거쳐 도착한다"는 점.
-
-**Correctness invariant (equivalence)**: `init_process_group()`은 모든
-install handle을 `wait()`한 후 반환하므로 launch-before-install 문제는
-구조적으로 없다. 남는 correctness 질문은 단 하나:
-
-> Engine-routed `IpcqInitMsg` 처리가 기존 sideband
-> `pe_ipcq._install_neighbors(msg)` 호출과 **동일한 최종 PE_IPCQ 상태**를
-> 생성하는가.
-
-검증 포인트 (T3 참고):
-
-1. **State equivalence**: `_install_neighbors()` 내부 상태 전이가 engine
-   dispatch path에서도 동일하게 일어나 최종 PE_IPCQ state
-   (`_queue_pairs`, `_installed`, `_credit_inbox` 등)가 일치.
-
-2. **Sideband-only side effect 부재**: Sideband path에서만 있던 부수 효과가
-   없음 (예: engine.submit이 설정하는 request_id / correlation tracking 등이
-   install semantics를 왜곡하지 않음).
-
-3. **Ordering independence**: 서로 다른 PE들의 install message가 engine
-   큐에서 임의 순서로 처리되어도 최종 상태가 동일. 즉 install은 **PE별
-   독립 연산**이어야 하고, cross-PE 순서 의존성이 있으면 안 됨.
-
-4. **Idempotency**: 동일 PE에 대해 `IpcqInitMsg`가 두 번 도착하면? 현재
-   설계 전제는 "per-PE 단 한 번 install". 중복 install 시 동작은 정의되지
-   않음. 보수적 정책:
-   - 최초 install 시 `_installed = True`로 전이
-   - 이후 중복 install msg는 **에러** (raise) 또는 **silent idempotent**
-     (no-op) 둘 중 하나로 명시
-   - Recommend: **raise** (명시적 에러 → 버그 조기 검출). T3에 duplicate
-     install 케이스 추가.
-
-5. **Partial install visibility**: 일부 PE만 install 완료된 중간 상태가
-   외부에 observable한가? 현재 구조에서는 `init_process_group()`의 eager
-   wait-all이 barrier 역할을 하므로 partial state는 bench 코드에 노출되지
-   않음. 단, debugging / introspection API는 중간 상태를 볼 수 있음 (문제
-   아님, 문서화만).
-
-**Timing 영향**: Engine-routed install은 `init_process_group()`이 SimPy 시간을
-소비하게 만든다. 기존 sideband install은 사실상 zero-cost. ADR 계약:
-
-> Benchmarks must not rely on zero-cost initialization.
-> `init_process_group()` consumes simulated time proportional to the number
-> of participating PEs × per-PE install latency. First collective call
-> starts at a well-defined but non-zero sim time.
-
-### D3. Launch 경로 — non-CCL 커널과 동일 primitive
-
-**CCL 커널은 non-CCL 커널과 동일한 `KernelLaunchMsg` submission path를 쓴다.**
-Engine 내부의 IO_CPU/M_CPU transit 같은 것은 **기존 구현 세부이지 CCL-specific
-장치가 아님**. Backend는 plan의 `participating_pes` 목록을 돌면서 `KernelLaunchMsg`를
-submit할 뿐이다. 새 메시지 타입 없음, 새 라우팅 경로 없음.
-
-```python
-# AhbmCCLBackend.all_reduce
-def all_reduce(self, tensor, op="sum"):
-    if op != "sum":
-        raise NotImplementedError(...)
-    if tensor._handle is None or not tensor._handle.shards:
-        raise RuntimeError(...)
-
-    # Validator — global handle 기준 (D8)
-    validator_name = self._merged.get("validator")
-    if validator_name:
-        resolve_validator(validator_name)(tensor._handle, self._world_size, self.ctx.spec)
-
-    rank = self.ctx.distributed.get_rank()
-    plan = self._plans[rank]
-    tensor_view = _tensor_slice_for_sip(tensor._handle, plan.sip)
-
-    # Plan에서 kernel args 계산 (host-side)
-    import importlib
-    mod = importlib.import_module(plan.kernel_module)
-    n_elem = tensor_view.shards[0].nbytes // tensor.itemsize
-    kargs = mod.kernel_args(n_elem=n_elem, world_size=plan.world_size,
-                             **plan.kernel_config)
-
-    def _submit():
-        out = []
-        for (cube, pe) in plan.participating_pes:
-            h = self.ctx.submit(KernelLaunchMsg(
-                correlation_id=self.ctx.correlation_id,
-                request_id=f"allreduce_r{rank}_c{cube}p{pe}",
-                kernel_ref=KernelRef(name=plan.algorithm_name, kind="builtin"),
-                args=(_tensor_arg_for_pe(tensor_view, cube, pe), *kargs),
-                target_sips=(plan.sip,),
-                target_cubes=(cube,),
-                target_pe=pe,
-            ))
-            out.append(h)
-        return out
-
-    self._barrier.submit_and_drain(self.ctx, rank, _submit)
-```
-
-### D4. Algorithm ABI — 얇게 + 명시적 arg 계약
-
-각 알고리즘 모듈은 **kernel + kernel_args만 필수**.
-
-```python
-# src/kernbench/ccl/algorithms/ring_allreduce.py
-def kernel(t_ptr, n_elem, world_size, tl):
-    """PE-side kernel code.
-
-    Signature convention: first positional arg is the tensor pointer
-    (per-PE slice), subsequent positional args are whatever
-    kernel_args() returns. `tl` is injected by the TLContext runtime.
-    """
-
-def kernel_args(*, n_elem: int, world_size: int, **kw) -> tuple:
-    """Return the tuple of non-tensor positional args.
-
-    Signature contract:
-    - Called keyword-only with n_elem and world_size plus kernel_config.
-    - Returns a tuple (possibly empty) of scalar / metadata args.
-    - The backend constructs the final KernelLaunchMsg.args as:
-          (per_pe_tensor_arg, *kernel_args(...))
-      where per_pe_tensor_arg is a TensorArg containing only the shards
-      local to the receiving PE (derived from tensor_view).
-    """
-    return (n_elem, world_size)
-```
-
-**Arg assembly in backend (reference)**:
-
-```python
-# AhbmCCLBackend.all_reduce (D3에서 발췌)
-kargs = mod.kernel_args(n_elem=n_elem, world_size=plan.world_size,
-                         **plan.kernel_config)
-for (cube, pe) in plan.participating_pes:
-    pe_tensor_arg = _tensor_arg_for_pe(tensor_view, cube, pe)
-    self.ctx.submit(KernelLaunchMsg(
-        args=(pe_tensor_arg, *kargs),       # tensor first, then kernel_args return
-        target_sips=(plan.sip,),
-        target_cubes=(cube,),
-        target_pe=pe,
-        ...
-    ))
-```
-
-**ccl.yaml**에서 선언적 metadata:
-
-```yaml
-algorithms:
-  ring_allreduce_tcm:
-    module: kernbench.ccl.algorithms.ring_allreduce
-    topology: ring_1d             # kernbench/ccl/topologies.py
-    mapper: leader_only           # kernbench/ccl/mappers.py (신규)
-    validator: single_shard_per_rank   # kernbench/ccl/validators.py (신규)
-    buffer_kind: tcm
-    n_elem: 8
-```
-
-- `topology` (필수)
-- `mapper` (선택, default `"leader_only"`)
-- `validator` (선택)
-
-알고리즘 모듈 자체에는 mapper/validator/participating_pes/neighbor
-생성기가 **들어가지 않음**.
-
-### D5. Mapper + validator — registry key **또는** import path
-
-Host-side framework가 built-in registry 제공. 커스텀 확장은 dot-import path.
-
-```python
-# src/kernbench/ccl/mappers.py (new)
-Mapper = Callable[[dict, int], list[tuple[int, int]]]
-
-def leader_only(spec, rank):
-    """Single leader PE per SIP. Ring/tree/mesh용."""
-    return [(0, 0)]
-
-def all_pes(spec, rank):
-    """Every PE in the SIP. 알고리즘이 intra-SIP 전체 PE를 참여시킬 때 사용
-    (e.g. intra-SIP reduction, intra-SIP broadcast, hierarchical collective
-    의 낮은 레벨 등)."""
-    cm = spec["sip"]["cube_mesh"]
-    pl = spec["cube"]["pe_layout"]
-    n_cubes = cm["w"] * cm["h"]
-    n_pes = pl["pe_per_corner"] * len(pl["corners"])
-    return [(c, p) for c in range(n_cubes) for p in range(n_pes)]
-
-MAPPER_REGISTRY = {"leader_only": leader_only, "all_pes": all_pes}
-
-def resolve_mapper(key_or_path: str) -> Mapper:
-    if key_or_path in MAPPER_REGISTRY:
-        return MAPPER_REGISTRY[key_or_path]
-    if "." in key_or_path:
-        import importlib
-        mod_path, fn_name = key_or_path.rsplit(".", 1)
-        return getattr(importlib.import_module(mod_path), fn_name)
-    raise ValueError(f"unknown mapper: {key_or_path!r}")
-```
-
-Validator도 동일 패턴 (`src/kernbench/ccl/validators.py`). 입력은 **global
-TensorHandle** (D8 참고).
-
-### D6. Host-side install plan builder
-
-```python
-# src/kernbench/ccl/install_plan.py (new; 기존 install.py의 재구성)
-from dataclasses import dataclass
-from typing import Any, Mapping
-
-@dataclass(frozen=True)
-class NeighborTableEntry:
-    direction: str
-    peer_direction: str       # ADR-0025
-    peer_sip: int
-    peer_cube: int
-    peer_pe: int
-    rx_base_pa: int
-    # ... 기타 IPCQ 설정 ...
-
-@dataclass(frozen=True)
-class PeInstallSpec:
-    cube: int
-    pe: int
-    neighbors: tuple[NeighborTableEntry, ...]
-
-@dataclass(frozen=True)
-class SipInstallPlan:
-    algorithm_name: str                  # human-readable ("ring_allreduce_tcm")
-    sip: int
-    rank: int
-    world_size: int
-    pe_installs: tuple[PeInstallSpec, ...]     # per-PE neighbor tables
-    buffer_kind: str
-    n_slots: int
-    slot_size: int
-    kernel_module: str
-    participating_pes: tuple[tuple[int, int], ...]
-    kernel_config: Mapping[str, Any]
-
-
-def build_install_plans(
-    world_size: int,
-    algorithm: str,
-    algorithm_config: dict,
-    spec: dict,
-) -> list[SipInstallPlan]:
-    """Compose topology + mapper + algorithm into per-SIP plan list."""
-    topo_fn = _resolve_topology(algorithm_config["topology"])
-    mapper = resolve_mapper(algorithm_config.get("mapper", "leader_only"))
-
-    # kernel_config: launch 시 kernel_args에 전달할 algorithm-specific params
-    kernel_config = {
-        k: v for k, v in algorithm_config.items()
-        if k in {"n_elem", "reduce_op", "chunk_size"} or k.startswith("kernel_")
-    }
-
-    plans = []
-    for rank in range(world_size):
-        sip = rank  # identity mapping (non-identity는 open question)
-        pes = mapper(spec, rank)
-        pe_installs = _build_pe_installs(
-            rank=rank, world_size=world_size, sip=sip,
-            pes=pes, topo_fn=topo_fn, algorithm_config=algorithm_config, spec=spec,
-        )
-        plans.append(SipInstallPlan(
-            algorithm_name=algorithm,
-            sip=sip, rank=rank, world_size=world_size,
-            pe_installs=pe_installs,
-            buffer_kind=algorithm_config["buffer_kind"],
-            n_slots=algorithm_config["n_slots"],
-            slot_size=algorithm_config["slot_size"],
-            kernel_module=algorithm_config["module"],
-            participating_pes=tuple(pes),
-            kernel_config=kernel_config,
-        ))
-    return plans
-```
-
-`_build_pe_installs`는 기존 `ccl/install.py`의 neighbor 계산 로직을 재활용
-(ADR-0025의 `reverse_direction` 개선 반영).
-
-**Multi-PE 매퍼와 neighbor 생성 책임**: mapper가 SIP 내 여러 PE를 반환하는
-경우 (`all_pes` 등), PE-level neighbor 그래프는 `_build_pe_installs` 내부에
-형성된다. 즉 topology 모듈은 rank-level 관계만 제공하고, PE-level 연결은
-builder에서 풀어낸다. 복잡한 multi-level 패턴을 쓰는 알고리즘은 이 책임
-분산이 관리 부담이 될 수 있음 — 관련 논의는 ADR-0029 참고.
-
-### D7. Epoch-based collective barrier
-
-Cross-rank submit 동기화. 각 collective 호출은 독립 epoch. 같은 rank의
-중복 join은 즉시 에러.
-
-```python
-# src/kernbench/runtime_api/distributed.py
-@dataclass
-class _EpochState:
-    participants: set[int] = field(default_factory=set)
-    pending: list = field(default_factory=list)
-    drained: bool = False
-    returned: int = 0
-
-
-class _CollectiveBarrier:
-    """Epoch-based barrier.
-
-    Contract:
-    - Each call joins the earliest non-drained epoch.
-    - Each rank may join a given epoch at most once. Duplicate join raises.
-    - Last arriver (participants == world_size) performs drain and advances
-      _next_epoch. Earlier arrivers yield and re-check drained on resume.
-    - Epoch state is GC'd when returned == world_size (success path).
-    - On failure paths, residual state is acceptable; reset() clears it.
-    """
-
-    def __init__(self, world_size: int):
-        self._world_size = world_size
-        self._next_epoch = 0
-        self._state: dict[int, _EpochState] = {}
-
-    def submit_and_drain(self, ctx, rank: int, submit_fn) -> None:
-        epoch = self._next_epoch
-        state = self._state.setdefault(epoch, _EpochState())
-
-        if rank in state.participants:
-            raise RuntimeError(
-                f"rank {rank} attempted duplicate join to epoch {epoch}"
-            )
-        state.participants.add(rank)
-
-        handles = submit_fn()
-        state.pending.extend(handles)
-
-        is_last = len(state.participants) >= self._world_size
-
-        if is_last:
-            for h in state.pending:
-                ctx.wait(h)
-            state.drained = True
-            self._next_epoch = epoch + 1
-        else:
-            from greenlet import getcurrent
-            g = getcurrent()
-            if g.parent is None:
-                raise RuntimeError("barrier requires a bound worker greenlet")
-            while not state.drained:
-                g.parent.switch()
-
-        state.returned += 1
-        if state.returned >= self._world_size:
-            self._state.pop(epoch, None)
-
-    def reset(self) -> None:
-        """Explicit cleanup on spawn exception unwinding."""
-        self._state.clear()
-        self._next_epoch = 0
-```
-
-### D8. Per-rank tensor view + validator contract
-
-**Validator** (host-side, pre-slice, global handle 기준):
-
-```python
-# src/kernbench/ccl/validators.py
-Validator = Callable[[TensorHandle, int, dict], None]
-
-def single_shard_per_rank(handle, world_size, spec):
-    """Ring 계열: 정확히 world_size개 shard, SIP당 1개."""
-    if len(handle.shards) != world_size:
-        raise ValueError(...)
-    per_sip = {}
-    for s in handle.shards:
-        per_sip[s.sip] = per_sip.get(s.sip, 0) + 1
-    if any(c != 1 for c in per_sip.values()):
-        raise ValueError(...)
-
-def multi_pe_sip_local(handle, world_size, spec):
-    """Multi-PE per SIP layout: 각 SIP에 intra-SIP PE 수만큼 shard 존재.
-    Intra-SIP 전체 PE를 참여시키는 알고리즘이 사용."""
-    cm = spec["sip"]["cube_mesh"]
-    pl = spec["cube"]["pe_layout"]
-    per_sip = cm["w"] * cm["h"] * pl["pe_per_corner"] * len(pl["corners"])
-    if len(handle.shards) != world_size * per_sip:
-        raise ValueError(...)
-
-VALIDATOR_REGISTRY = {...}
-def resolve_validator(key_or_path): ...
-```
-
-Validator는 world 전체의 shard layout 불변량을 본다. Per-rank view는
-backend가 validator 호출 **후** `_tensor_slice_for_sip`로 생성.
-
-**Per-rank tensor view** — SIP-local slice:
-
-```python
-def _tensor_slice_for_sip(handle, sip) -> TensorArg:
-    sip_shards = [s for s in handle.shards if s.sip == sip]
-    if not sip_shards:
-        raise RuntimeError(f"tensor has no shards on SIP {sip}")
-    # Deterministic ordering contract: (cube, pe, offset_bytes) ascending.
-    # Multi-PE mappers (hierarchical 등) rely on this ordering to align
-    # per-PE tensor arg construction with participating_pes enumeration.
-    sip_shards.sort(key=lambda s: (s.cube, s.pe, s.offset_bytes))
-    min_offset = min(s.offset_bytes for s in sip_shards)
-    local_va_base = handle.va_base + min_offset if handle.va_base else 0
-    return TensorArg(
-        shards=tuple(TensorArgShard(...) for s in sip_shards),
-        va_base=local_va_base,
-    )
-```
-
-**Ordering invariant**: slice의 shard는 `(cube, pe, offset_bytes)` 오름차순.
-Backend가 `participating_pes`를 iterate하며 `_tensor_arg_for_pe(view, cube, pe)`를
-구성할 때, 결정론적 ordering을 전제할 수 있다. 특히 `all_pes` mapper +
-hierarchical 알고리즘이 per-PE slice 조합을 순서 의존적으로 해석하는 경우에
-중요.
-
-### D9. Greenlet-local rank registry (+ debug warning)
-
-```python
-class DistributedContext:
-    def __init__(self):
-        self._backend = None
-        self._rank_by_greenlet: dict = {}
-
-    def _bind_rank(self, g, rank: int) -> None:
-        self._rank_by_greenlet[g] = int(rank)
-
-    def get_rank(self) -> int:
-        self._ensure_initialized()
-        from greenlet import getcurrent
-        g = getcurrent()
-        if g not in self._rank_by_greenlet:
-            if os.environ.get("KERNBENCH_DEBUG"):
-                warnings.warn(
-                    "get_rank() called outside a bound greenlet — returning 0. "
-                    "Likely a bug unless running single-driver."
-                )
-            return 0
-        return int(self._rank_by_greenlet[g])
-```
-
-### D10. `torch.ahbm.set_device(rank)` — SIP 바인딩
-
-KernBench 백엔드 이름은 `ahbm` (ADR-0023 D10). Real PyTorch는
-`torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named
-namespace를 사용한다.
-
-```python
-class _AhbmNamespace:
-    """torch.ahbm — per-greenlet SIP device binding.
-
-    Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. Since
-    KernBench's backend is 'ahbm' (not CUDA), we expose the equivalent
-    API under ``torch.ahbm`` to avoid pretending to be a CUDA runtime.
-    """
-
-    def __init__(self):
-        self._device_by_greenlet: dict = {}
-
-    def set_device(self, device: int) -> None:
-        from greenlet import getcurrent
-        self._device_by_greenlet[getcurrent()] = int(device)
-
-    def current_device(self) -> int | None:
-        from greenlet import getcurrent
-        return self._device_by_greenlet.get(getcurrent())
-
-# Attached to RuntimeContext as `self.ahbm = _AhbmNamespace()`.
-# Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
-```
-
-**PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한
-`torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`,
-`torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는
-코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다.
-
-```python
-class _AcceleratorNamespace:
-    """torch.accelerator — device-agnostic API (PyTorch 2.x style).
-
-    Aliases torch.ahbm for bench code that prefers device-neutral idiom:
-        torch.accelerator.set_device_index(rank)
-        torch.accelerator.current_device_index()
-    """
-
-    def __init__(self, ahbm: _AhbmNamespace):
-        self._ahbm = ahbm
-
-    def set_device_index(self, device: int) -> None:
-        self._ahbm.set_device(device)
-
-    def current_device_index(self) -> int | None:
-        return self._ahbm.current_device()
-
-# RuntimeContext
-self.ahbm = _AhbmNamespace()
-self.accelerator = _AcceleratorNamespace(self.ahbm)   # alias
-```
-
-Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유:
-
-```python
-torch.ahbm.set_device(rank)                   # KernBench-native, explicit backend
-torch.accelerator.set_device_index(rank)      # PyTorch 2.x device-agnostic
-```
-
-### D11. Tensor placement = structural (sip, cube, pe) 좌표
-
-`resolve_dp_policy`가 `target_sip`을 직접 받아 구조적 좌표로 placement 생성.
-세부는 ADR-0026.
-
-```python
-# RuntimeContext._create_tensor
-current_sip = self.ahbm.current_device()          # (D10 naming)
-if current_sip is None:
-    current_sip = 0  # single-driver fallback (D9와 일관)
-placement = resolve_dp_policy(
-    dp, shape=shape_2d, itemsize=itemsize,
-    num_pe=eff_num_pe, num_cubes=eff_num_cubes,
-    target_sip=current_sip,
-)
-```
-
-Post-hoc `pe_index` shifting 제거 — ShardSpec이 `(sip, cube, pe)` 구조적
-좌표 보유.
-
-### D12. `torch.multiprocessing.spawn`-compat surface
-
-Bench 작성자 표면은 real PyTorch `mp.spawn`과 동일:
-
-```python
-# src/kernbench/runtime_api/multiprocessing.py (new)
-def spawn(fn, args=(), nprocs=1, join=True, daemon=False, start_method="spawn"):
-    """Drop-in for torch.multiprocessing.spawn.
-    Internal: greenlet fan-out + epoch-barrier sync + exception propagation.
-    """
-    ...
-
-# torch namespace에 부착
-torch.multiprocessing = SimpleNamespace(spawn=spawn)
-```
-
-Bench:
-
-```python
-import torch.multiprocessing as mp
-mp.spawn(worker, nprocs=world_size, args=(world_size, torch))
-```
-
-### D13. Scheduler + exception handling
-
-```python
-def spawn(fn, args, nprocs, ...):
-    dist = torch.distributed
-    gs: list[greenlet] = []
-    errors: dict[int, Exception] = {}
-
-    for rank in range(nprocs):
-        def _entry(r=rank):
-            try:
-                fn(r, *args)
-            except Exception as e:
-                errors[r] = e
-                raise
-        g = greenlet(_entry)
-        dist._bind_rank(g, rank)
-        gs.append(g)
-
-    try:
-        while True:
-            alive = [g for g in gs if not g.dead]
-            if not alive:
-                break
-            for g in alive:
-                if not g.dead:
-                    g.switch()
-    except Exception as outer:
-        for other in gs:
-            if not other.dead:
-                try:
-                    other.throw(SystemExit)
-                except Exception:
-                    pass
-        # Epoch barrier state 명시적 cleanup
-        backend = getattr(dist, "_backend", None)
-        if backend is not None and hasattr(backend, "_barrier"):
-            backend._barrier.reset()
-        raise SpawnException(errors) from outer
-```
-
-**Scheduler contract**:
-- Deterministic round-robin over insertion order (rank 0, 1, ..., N-1).
-- 동기화 지점은 epoch barrier (D7)만. Scheduler 순서에 의존하는 correctness 없음.
-- 예외 발생 시 다른 greenlet 강제 종료 + `SpawnException` 전파.
-
-**Starvation guideline**:
-- 일반적으로 collective barrier가 workers를 동기화. 큰 편차 없음.
-- 극단적 non-collective 루프 대비 cooperative yield 제공:
-  `torch.distributed.cooperative_yield()`.
-
-### D14. Backward compatibility
-
-1. **Single-driver 호출**: `get_rank()` 0 반환 (D9).
-2. **`ccl.yaml` world_size override**: D1 fallback 우회 — legacy "rank = PE"
-   테스트 경로로 사용 가능.
-3. **`DPPolicy.sip="column_wise"` 명시**: ADR-0026 scope.
-4. **`install_ipcq()` compatibility wrapper**:
-
-기존 `ccl/install.py`의 `install_ipcq()` API는 곧바로 제거하지 않는다.
-Thin compatibility wrapper로 남겨 기존 직접 호출자가 점진적으로 migration할
-수 있게 한다.
-
-```python
-# src/kernbench/ccl/install.py (after this ADR)
-def install_ipcq(engine, spec, merged, *, algo_module=None, rank_to_pe=None):
-    """DEPRECATED: legacy host-side PE installer.
-
-    Internally delegates to build_install_plans + engine-routed IpcqInitMsg.
-    Use dist.init_process_group() instead.
-    """
-    from kernbench.ccl.install_plan import build_install_plans
-    import warnings
-    warnings.warn(
-        "install_ipcq() is deprecated; use dist.init_process_group()",
-        DeprecationWarning, stacklevel=2,
-    )
-    plans = build_install_plans(
-        world_size=merged.get("world_size", 1),
-        algorithm=merged["algorithm"],
-        algorithm_config=merged,
-        spec=spec,
-    )
-    handles = []
-    for plan in plans:
-        for pe_install in plan.pe_installs:
-            h = engine.submit(IpcqInitMsg(
-                target_sips=(plan.sip,),
-                target_cubes=(pe_install.cube,),
-                target_pe=pe_install.pe,
-                entries=pe_install.neighbors,
-                buffer_kind=plan.buffer_kind,
-                n_slots=plan.n_slots,
-                slot_size=plan.slot_size,
-            ))
-            handles.append(h)
-    for h in handles:
-        engine.wait(h)
-    return {"world_size": merged.get("world_size", 1), "plans": plans}
-```
-
-Migration 스케줄:
-- Phase 1: wrapper로 유지 + DeprecationWarning
-- Phase 2: 직접 호출자 grep-audit → 각각 `dist.init_process_group()` 또는
-  `build_install_plans()` 직접 사용으로 이관
-- Phase 3: wrapper 제거 (별도 cleanup ADR 또는 PR)
-
----
-
-## Dependencies
-
-- **ADR-0023** (IPCQ): `IpcqInitMsg` 메시지 타입과 PE_IPCQ 핸들링을 그대로
-  활용. Engine-routed submit으로 전환하는 것이 유일한 변경.
-- **ADR-0025** (IPCQ direction fix): `_build_pe_installs`의 neighbor 계산이
-  2-rank ring 등에서 정확히 동작하려면 필요.
-- **ADR-0003 / 0016** (IO_CPU): IO_CPU는 기존 transit 역할 그대로. 본 ADR에서
-  IO_CPU 역할 변경 없음.
-
----
-
-## Non-goals
-
-- **IPCQ protocol 수정**: ADR-0023 유지.
-- **DPPolicy 필드 정리**: ADR-0026.
-- **Megatron-style TP**: ADR-0027.
-- **Multi-node (프로세스 간)**: 단일 프로세스.
-- **IO_CPU SIP control-plane 단일 endpoint 원칙 채택**: 본 ADR 범위 밖. 현재
-  KernBench에 이 원칙이 없고, 도입은 별도 ADR.
-- **Hierarchical all-reduce 알고리즘 설계**: ADR-0029. 본 ADR은 그 알고리즘이
-  쓸 framework 인프라 (`all_pes` mapper, `multi_pe_sip_local` validator,
-  registry 확장점)만 제공.
-
----
-
-## Open questions
-
-### 🟡 Nice-to-have — scope 경계 관련
-
-- **Install timing 허용치**: SimPy 시간 상 install이 몇 ns~us 소모. 기존
-  sideband는 0ns. 기존 테스트가 t=0 시작을 전제로 하는지 확인 (audit 결과에
-  따라 테스트 교정 필요).
-
-- **`IpcqInitMsg` 배치 가능성**: MmuMapMsg처럼 `target_pe="all"` 브로드캐스트
-  는 IPCQ에서는 부적합 (PE마다 neighbor가 다름). 현재는 per-PE 개별 submit.
-  Per-PE payload를 담는 batched IpcqInitMsg 타입은 future optimization.
-
-- **`_rank_to_sip` 매핑**: 현재 identity. Non-trivial mapping 요구 시 별도.
-
-- **Cooperative yield API 위치**: `torch.distributed.cooperative_yield()`로
-  노출 예정. 실제 필요성은 Phase 2 이후 벤치 추가 시 판단.
-
-(PE-level topology 일원화 관련 중장기 방향은 **ADR-0029** 참고 — 복잡한
-multi-level 알고리즘이 driving force가 되는 framework 진화 방향.)
-
----
-
-## Consequences
-
-### Positive
-
-- **새 message 타입 0개**: 기존 `IpcqInitMsg` + `KernelLaunchMsg`만으로 구현.
-- **IO_CPU / engine 변경 없음**: 기존 routing 그대로.
-- **Sideband install convention 제거**: MmuMapMsg 등과 동일 패턴으로 일원화.
-- **Plan state stale 문제 소멸**: Plan은 host 단일 소유.
-- **Bench = real PyTorch DDP** (공개 API 관점).
-- **Algorithm ABI 경량**: `kernel` + `kernel_args`만 필수.
-- **Epoch-based barrier**: interleaved collective 안전.
-- **Control/data plane 분리**: data plane(PE_IPCQ)은 ADR-0023 유지, control
-  plane은 host-driven.
-- 장기 확장성: Megatron TP, DTensor 기반.
-
-### Negative
-
-- 신규 모듈: `install_plan.py`, `mappers.py`, `validators.py`,
-  `multiprocessing.py`.
-- Engine이 `IpcqInitMsg`를 엔진-path로 라우팅할 수 있는지 구현 시 확인 필요
-  (minor hook 가능성).
-- Install이 SimPy 시간을 소모 (positive로도 볼 수 있으나, 기존 sideband 시점
-  0ns 전제인 테스트가 있으면 교정 필요).
-
-### Neutral
-
-- IPCQ PE-level protocol (ADR-0023) 불변.
-- `DPPolicy` 필드 변경은 ADR-0026.
-- IO_CPU 역할 불변 (기존 transit 그대로).
diff --git a/docs/adr/ADR-0025-ipcq-direction-addressing.md b/docs/adr/ADR-0025-algo-ipcq-direction-addressing.md
similarity index 100%
rename from docs/adr/ADR-0025-ipcq-direction-addressing.md
rename to docs/adr/ADR-0025-algo-ipcq-direction-addressing.md
diff --git a/docs/adr/ADR-0026-dppolicy-intra-device.md b/docs/adr/ADR-0026-par-dppolicy-intra-device.md
similarity index 96%
rename from docs/adr/ADR-0026-dppolicy-intra-device.md
rename to docs/adr/ADR-0026-par-dppolicy-intra-device.md
index f11beef..d043f59 100644
--- a/docs/adr/ADR-0026-dppolicy-intra-device.md
+++ b/docs/adr/ADR-0026-par-dppolicy-intra-device.md
@@ -23,7 +23,7 @@ class DPPolicy:
     """Intra-device (cube × PE) data-parallel policy.
 
     SIP-level placement is controlled by ``torch.ahbm.set_device(rank)``
-    (ADR-0024 D10) and, for model-level TP, by Megatron-style parallel
+    (ADR-0024 D3) and, for model-level TP, by Megatron-style parallel
     layers (ADR-0027). DPPolicy does not cross SIP boundaries.
     """
     cube: Literal["replicate", "column_wise", "row_wise"] = "replicate"
@@ -37,7 +37,7 @@ class DPPolicy:
 ### D2. `ShardSpec` — structural (sip, cube, pe) 좌표, `pe_index` 완전 제거
 
 현재 `ShardSpec.pe_index`는 **global flat index** (`sip × cubes × pes + cube ×
-pes + pe`). 이는 ADR-0024 D11이 "abstraction leakage"로 지적한 형태.
+pes + pe`). 이는 ADR-0024 D4이 "abstraction leakage"로 지적한 형태.
 
 본 ADR에서 ShardSpec을 **structural 좌표로 재정의**하고, `pe_index`는
 property로도 **남기지 않는다**:
@@ -73,7 +73,7 @@ class ShardSpec:
 
 ### D3. `resolve_dp_policy`가 `target_sip`을 받아 structural 좌표 생성
 
-ADR-0024 D11의 계약 구현. Post-hoc shifting 없음.
+ADR-0024 D4의 계약 구현. Post-hoc shifting 없음.
 
 ```python
 # src/kernbench/policy/placement/dp.py (after)
@@ -135,14 +135,14 @@ def resolve_dp_policy(
 
 ### D4. `_create_tensor` — 구조적 좌표로 직접 placement
 
-ADR-0024 D11 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy`
+ADR-0024 D4 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy`
 호출 시점에 직접 지정.
 
 ```python
 # context.py _create_tensor (after)
 current_sip = self.ahbm.current_device()
 if current_sip is None:
-    # Single-driver fallback (ADR-0024 D9와 일관).
+    # Single-driver fallback (ADR-0024 D2와 일관).
     # Launcher 기반 코드가 set_device()를 빼먹으면 조용히 SIP 0에 박히는
     # 문제가 있음 → debug mode에서 경고.
     if os.environ.get("KERNBENCH_DEBUG"):
@@ -267,7 +267,7 @@ KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에
 - **개념 분리 명확**: DPPolicy = intra-device, TP = inter-device.
 - **API 단순화**: DPPolicy 생성자 필드 ~33% 축소.
 - **Structural 좌표 일관성**: ShardSpec이 `(sip, cube, pe)` 튜플로 표현 →
-  abstraction leakage 해소 (ADR-0024 D11 계약 충족).
+  abstraction leakage 해소 (ADR-0024 D4 계약 충족).
 - **`pe_index` 의미 명확**: SIP-local이 단일 해석. Global flat이 필요하면 명시.
 - **Launcher 모델 일관성**: ADR-0024의 "1 worker per SIP" 모델이 유일한 SIP
   경계 제어 메커니즘.
diff --git a/docs/adr/ADR-0027-megatron-tp.md b/docs/adr/ADR-0027-par-megatron-tp.md
similarity index 94%
rename from docs/adr/ADR-0027-megatron-tp.md
rename to docs/adr/ADR-0027-par-megatron-tp.md
index bdacbcf..7b04254 100644
--- a/docs/adr/ADR-0027-megatron-tp.md
+++ b/docs/adr/ADR-0027-par-megatron-tp.md
@@ -2,9 +2,7 @@
 
 ## Status
 
-Accepted (Revision 7 — resume invariant / main-context wait 비재귀 invariant /
-global barrier over-serialization tradeoff / TP forward yield-safety 명시,
-2026-04-14)
+Accepted
 
 ## Context
 
@@ -166,9 +164,9 @@ while alive:
    - 구현이 이를 **감지**할 필요는 없다 (타임아웃/steps-since-yield 카운터
      등). 이는 user contract이며 위반 시 증상은 "simulation hang"이다.
    - **Future extension**: non-collective 긴 계산 경로가 자주 나오면
-     ADR-0024 D13의 `torch.distributed.cooperative_yield()` primitive (명시적
-     no-op yield)를 도입할 수 있다. 현 ADR 범위 밖. Breaking change 아님 —
-     필요 시 추가하면 됨.
+     명시적 `torch.distributed.cooperative_yield()` primitive (no-op yield)를
+     도입할 수 있다. 현 ADR 범위 밖. Breaking change 아님 — 필요 시 추가하면
+     됨.
    - Round 내에서는 alive worker 전체가 한 번씩 `switch`를 받는다. 단일 round
      안에서 한 worker가 여러 번 wait를 호출해도 그 turn 안에서 순차적으로
      enqueue된 뒤 scheduler drain 한 번에 일괄 처리 (FIFO).
@@ -183,7 +181,7 @@ while alive:
    - **두 큐는 서로 다른 dependency source**: worker wait은 worker가 직접
      `submit + wait` 쌍으로 만들어낸 handle (tensor deploy, MmuMap 등). collective
      큐는 `dist.all_reduce`가 내부적으로 enqueue한 kernel launch handle이며
-     worker는 이걸 직접 wait하지 않는다 (ADR-0024 D7).
+     worker는 이걸 직접 wait하지 않는다 (D0.5의 두 큐 drain 모델 참조).
    - **Correctness 관점 독립**: collective는 worker 관점에선 "이미 submit된
      후 yield한" 상태. 그 완료 타이밍은 worker의 다음 action 시점 이전이기만
      하면 됨. worker wait 큐와의 순서 dependency 없음.
@@ -206,7 +204,7 @@ while alive:
      index로 두거나 append 전 `h not in pending_set` 검사) 가능. correctness
      를 바꾸지 않는 최적화로 분류.
 
-4. **Exception propagation + sibling cleanup (ADR-0024 D13 방식 채택)**.
+4. **Exception propagation + sibling cleanup**.
    worker greenlet이 raise하면 `g.switch()`가 main으로 예외를 전달한다.
    scheduler loop은 즉시 중단되고 다음 cleanup을 **명시적으로** 수행:
 
@@ -581,7 +579,7 @@ TP layer의 weight/output 표현에서 두 개념을 명확히 분리한다:
 
 | 개념 | 결정 주체 | 범위 |
 |---|---|---|
-| **TP shard ownership** (어느 rank가 weight의 어떤 slice를 소유하는가) | greenlet-local rank + `torch.ahbm.set_device(rank)` (ADR-0024 D9/D10) | **cross-rank, cross-SIP** |
+| **TP shard ownership** (어느 rank가 weight의 어떤 slice를 소유하는가) | greenlet-local rank + `torch.ahbm.set_device(rank)` (ADR-0024 D2/D3) | **cross-rank, cross-SIP** |
 | **Intra-rank placement** (소유된 slice를 rank 내부에서 cube × PE로 어떻게 분산하는가) | `DPPolicy(cube=..., pe=...)` (ADR-0026) | **한 rank 내부 (SIP 경계 안)** |
 
 따라서 `ColumnParallelLinear`가 `(in_features, out_features // ws)` shape로
@@ -825,40 +823,11 @@ strict-xfail 케이스를 본 ADR 구현 이후 **PASS**로 전환하는 것을
 
 ## Dependencies
 
-- **ADR-0024** (launcher): rank = SIP, greenlet-local rank, `dist.all_reduce`,
-  `torch.ahbm.set_device(rank)`. 본 ADR의 D0/D1이 이 인프라를 확장.
+- **ADR-0024** (launcher): rank = SIP, greenlet-local rank,
+  `torch.ahbm.set_device(rank)`.
 - **ADR-0026** (DPPolicy intra-device): weight tensor의 per-rank slice 표현.
 - **ADR-0023 / ADR-0025** (IPCQ): `dist.all_reduce` 구현의 기반.
 
-### Supersedes (partial)
-
-ADR-0024의 다음 섹션은 **미구현 상태의 설계**이며, 본 ADR이 더 단순한 모델로
-대체한다:
-
-- **ADR-0024 D7 (`_CollectiveBarrier.submit_and_drain`)** — epoch 기반 last-
-  arriver-drains 패턴. 문제: last arriver가 **worker 컨텍스트에서** `ctx.wait`을
-  호출해 env.run을 drive → D0.2가 막으려는 orphan 원인을 재현한다. 본 ADR의
-  **D0.4 two-queue drain** (worker가 모두 yield한 뒤 main이 drain)이 동일한
-  "모든 rank가 submit 완료 전까지 어떤 rank의 collective도 진행되지 않음"
-  invariant를 **worker-safe하게** 제공한다. `_CollectiveBarrier` 클래스는
-  구현하지 않는다.
-- **ADR-0024 D12/D13 (`spawn_workers` skeleton)** — signature / scheduler
-  loop / exception handling 설계. 본 ADR의 **D1**이 real-PyTorch API와 일치하는
-  signature (`spawn(fn, args, nprocs)`)로 재정의하며, D0 scheduler drain을 단일
-  위치에서 수행한다. ADR-0024 D13의 exception cleanup (siblings
-  `throw(SystemExit)` + `SpawnException` 래핑)은 본 ADR에 그대로 흡수
-  (D0.4-(4) 참조).
-
-현 구현은 ADR-0024의 D7/D12/D13 어느 것도 landing하지 않았으므로 supersede에
-따른 마이그레이션 비용은 없음. 향후 `docs/adr/ADR-0024`에 "superseded by
-ADR-0027 D0/D1" 주석만 추가하면 정합.
-
-**Source of truth (normative, 구현자 대상)**: worker scheduling / collective
-drain / spawn / exception cleanup의 구현 기준은 **ADR-0027 D0/D1이다**. 구현
-시 ADR-0024 D7/D12/D13의 pseudocode / contract / signature를 참고하지 말 것 —
-두 ADR이 다른 결론을 낼 때는 항상 ADR-0027이 우선한다. 리뷰어도 이 원칙으로
-PR을 심사.
-
 ---
 
 ## Non-goals
diff --git a/docs/adr/ADR-0032-intercube-allreduce.md b/docs/adr/ADR-0032-algo-intercube-allreduce.md
similarity index 99%
rename from docs/adr/ADR-0032-intercube-allreduce.md
rename to docs/adr/ADR-0032-algo-intercube-allreduce.md
index c3df130..bb6ba3c 100644
--- a/docs/adr/ADR-0032-intercube-allreduce.md
+++ b/docs/adr/ADR-0032-algo-intercube-allreduce.md
@@ -146,7 +146,7 @@ At each `dist.all_reduce(tensor)` call:
 3. Appends `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` where
    `sip_rank` is the current greenlet's bound rank.
 4. Launches with `_defer_wait=True`; the main scheduler drains pending
-   handles after all workers submit (per ADR-0024 D7 / ADR-0027 D0.4).
+   handles after all workers submit (per ADR-0027 D0.4).
 
 ### D6. Config schema
 
diff --git a/docs/adr/ADR-0033-latency-model-assumptions.md b/docs/adr/ADR-0033-lat-latency-model-assumptions.md
similarity index 83%
rename from docs/adr/ADR-0033-latency-model-assumptions.md
rename to docs/adr/ADR-0033-lat-latency-model-assumptions.md
index 5d4024f..13ca1f9 100644
--- a/docs/adr/ADR-0033-latency-model-assumptions.md
+++ b/docs/adr/ADR-0033-lat-latency-model-assumptions.md
@@ -10,7 +10,7 @@ The simulator is an analytical, event-driven performance model — not a
 cycle-accurate or RTL-level simulator. Many real-HW effects are approximated
 or omitted by design. To keep the model auditable and reviewable as a whole,
 this ADR consolidates the assumptions in one place. Individual component ADRs
-(ADR-0015, ADR-0019, ADR-0004) define the *mechanisms*; this document defines
+(ADR-0015, ADR-0017, ADR-0004) define the *mechanisms*; this document defines
 the *limits of fidelity*.
 
 ## Decisions
@@ -21,7 +21,7 @@ the *limits of fidelity*.
   ADR-0015 D2.
 - **Per-component switching/overhead latency** (`overhead_ns` attr).
 - **HBM per-pseudo-channel parallelism** via stateless `pc_avail[N]` array
-  with global round-robin chunking. Burst granularity tunable
+  with address-based PC selection (ADR-0034 D3). Burst granularity tunable
   (`burst_bytes`, default 256B). Read and write share each PC's
   `available_at` (real HW command bus is per-PC shared).
 - **HBM direction switching penalty mechanism**: per-PC last-direction
@@ -66,8 +66,8 @@ the *limits of fidelity*.
 ### D3. Ignored (out of scope)
 
 - Bank-level row buffer conflict penalty (assume no conflicts — best case;
-  round-robin chunk assignment is address-blind so we cannot detect same-bank
-  reuse).
+  the model has no per-bank state within a PC, so same-bank reuse cannot be
+  detected).
 - HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state
   `burst_time = burst_bytes / pc_bw_gbs`).
 - Refresh, ECC, thermal throttling, power gating.
@@ -110,29 +110,6 @@ below are different concerns, ordered by expected workload impact.
 
 **Higher impact (workload accuracy gap)**:
 
-- [ ] **Address-based PC selection at HBM CTRL** (replace the
-  address-blind global round-robin). Compute the PC index from
-  the HBM byte offset using parameters already in topology config:
-
-      pc_shift = log2(burst_bytes)        # default 8 (burst=256B)
-      pc_mask  = num_pcs - 1              # default 7 (8 PCs)
-      pc       = (hbm_offset >> pc_shift) & pc_mask
-
-  For the default `burst_bytes=256, num_pcs=8` this places the PC
-  select field at HBM byte-offset bits **[10:8]**: bits [7:0] are
-  the within-burst offset (same PC), bits [10:8] are the 3-bit PC
-  index, and bits [36:11] are row/bank/column within the PC slice.
-  Shift/mask are derived from topology config rather than hardcoded
-  so alternative `(burst_bytes, num_pcs)` pairs stay consistent.
-  See `src/kernbench/policy/address/phyaddr.py` for the canonical
-  comment.
-
-  Real-HW workloads where this matters most: (a) strided multi-
-  transaction streams that under global-RR collide on the same PCs
-  but under address-striping land on disjoint sets; (b) offset-
-  disjoint parallel transfers where address-striping preserves
-  parallelism while global-RR re-serializes them. Directly affects
-  multi-PE concurrent HBM workload latencies.
 - [ ] **Bank-level conflict modeling** within a PC (opt-in via
   `track_banks: true`). Currently we assume no same-bank reuse;
   random scatter/gather workloads are optimistic here.
@@ -169,7 +146,7 @@ below are different concerns, ordered by expected workload impact.
   touching latency must update the relevant section here.
 - Workload-specific magnitude error envelopes are explicit.
 - Builder-side derivation of `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`
-  enforces the ADR-0019 D9 invariant in code rather than relying on yaml
+  enforces the ADR-0017 D8 invariant in code rather than relying on yaml
   manual consistency.
 - Wire transfer time is charged once per bottleneck-link transit (Phase 2c
   per-flit timing) rather than via terminal `drain_ns` injection. Single
@@ -180,5 +157,6 @@ below are different concerns, ordered by expected workload impact.
 ## Cross-references
 
 - ADR-0015 — component / port / wire model.
-- ADR-0019 — NoC and local HBM topology.
+- ADR-0017 — Cube NOC architecture and HBM connectivity.
 - ADR-0004 — memory semantics, local HBM.
+- ADR-0034 — HBM controller internal design.
diff --git a/docs/adr/ADR-0034-dev-hbm-controller-internal-design.md b/docs/adr/ADR-0034-dev-hbm-controller-internal-design.md
new file mode 100644
index 0000000..b7d3e8f
--- /dev/null
+++ b/docs/adr/ADR-0034-dev-hbm-controller-internal-design.md
@@ -0,0 +1,271 @@
+# ADR-0034: HBM Controller Internal Design
+
+## Status
+
+Accepted
+
+## Context
+
+`HbmCtrlComponent` is the per-PE HBM partition endpoint at the leaf of
+the cube NOC. One instance is created per PE under the topology node
+`sip{S}.cube{C}.hbm_ctrl.pe{idx}` and attaches to that PE's router
+(ADR-0017 D4). The component models per-pseudo-channel (PC) scheduling,
+burst-granular commit timing, address-based PC selection, and response
+routing back to the requester.
+
+This ADR documents the component as currently implemented. ADR-0017 D4/D8
+defines *where* HBM CTRL attaches and *what* aggregate BW it must
+deliver. ADR-0033 D1/D2 defines *what fidelity* of HBM modelling is in
+scope. This ADR fills the gap between those two — the per-instance
+internal scheduling model.
+
+## Decision
+
+### D1. Role
+
+`HbmCtrlComponent` is a per-PE HBM partition endpoint. One instance per
+PE (default 8 per cube, set by `cube.memory_map.hbm_slices_per_cube`)
+attaches to that PE's router via the `peX.hbm` attachment list in
+`cube_mesh.yaml` (ADR-0017 D4). In the default n:1 channel mapping
+(ADR-0017 D8) the instance aggregates `channels_per_pe` pseudo-channels
+into one endpoint.
+
+The component models:
+
+- Per-PC scheduling (D2) with R/W command-bus sharing.
+- Address-based PC selection (D3).
+- Burst-granular commit timing (D4).
+- Flit-aware per-flit PC commit and async finalize (D5, D6).
+- Command-only Transaction handling for read-data drain (D7).
+- Response routing back to the requester (D8).
+
+It does not model:
+
+- Bank-level row-buffer conflicts, refresh, ECC, thermal throttling
+  (ADR-0033 D3).
+- Cross-PE HBM contention beyond its own router edge (handled by the
+  router mesh — ADR-0017 D3).
+- 1:1 channel mode (ADR-0017 D8 future work).
+
+### D2. Per-PC scheduling model
+
+Per-instance state initialised in `start()`:
+
+- `_pc_avail: list[float]` — earliest sim-time each PC is free; length
+  `num_pcs`, initial 0.0.
+- `_pc_last_dir: list["R"|"W"|None]` — direction of the last commit on
+  each PC, used for switch-penalty detection (D4); initial `None`.
+
+`num_pcs` and `burst_bytes` must each be a positive power of two so
+that address-based PC selection (D3) reduces to a shift-and-mask.
+
+Read and write requests share the same `_pc_avail` slot per PC — the
+real HW per-PC command bus is shared between read and write traffic, so
+issuing a write to PC k blocks a subsequent read to PC k by exactly the
+burst time.
+
+Direction `dir` for a request is inferred from the request type:
+
+- `MemoryWriteMsg` → `"W"`.
+- `PeDmaMsg` with `is_write=True` → `"W"`.
+- All others (`MemoryReadMsg`, `PeDmaMsg` read) → `"R"`.
+
+### D3. Address-based PC selection
+
+PC index for an access is derived from the access address by shift and
+mask:
+
+```text
+pc_shift = log2(burst_bytes)         # default 8  (burst=256B)
+pc_mask  = num_pcs - 1               # default 7  (8 PCs)
+pc       = (address >> pc_shift) & pc_mask
+```
+
+Computed once in `start()` from topology config so alternative
+`(burst_bytes, num_pcs)` pairs stay consistent. For the canonical
+default `(256, 8)` this places the PC select field at bits `[10:8]` of
+the HBM byte offset: bits `[7:0]` are within-burst (same PC), bits
+`[10:8]` are the 3-bit PC index, bits `[36:11]` are row/bank/column
+within the PC slice (see `phyaddr.py` comment).
+
+Address-based striping — as opposed to address-blind global
+round-robin — preserves PC parallelism for offset-disjoint concurrent
+transfers: each transfer's bursts land deterministically on the PC set
+implied by its byte addresses, so multi-PE workloads accessing disjoint
+regions do not collide on a single PC.
+
+### D4. Burst granularity and PC commit timing
+
+A single PC commit takes:
+
+```text
+chunk_time = burst_bytes / pc_bw_gbs    # ns
+```
+
+- `burst_bytes` (default 256) is the burst granularity matching the
+  flit size (ADR-0033 D1).
+- `pc_bw_gbs` is **builder-derived** from
+  `hbm_to_router_bw_gbs / num_pcs` (`topology/builder.py`), enforcing
+  the ADR-0017 D8 invariant that aggregate per-PE BW equals the
+  router-to-HBM link BW.
+
+Per-PC commit scheduling for an arriving access on PC `pc` with
+direction `dir`:
+
+```text
+switch_cost = switch_penalty_ns
+              if pc_last_dir[pc] not in (None, dir) else 0
+start  = max(env.now, pc_avail[pc]) + switch_cost
+finish = start + chunk_time
+pc_avail[pc]    = finish
+pc_last_dir[pc] = dir
+```
+
+Default `switch_penalty_ns = 0` — Tier 0 assumption that an ideal HBM
+scheduler amortises R/W switching cost (ADR-0033 D2). Non-zero values
+model pessimistic per-alternation cost.
+
+### D5. Flit-aware per-flit PC commit (primary path)
+
+`_handle_flit` is the primary worker path. For each arriving `Flit`:
+
+1. On the **first** flit of a transaction (`tid = id(txn)` not in
+   `_txn_state`):
+   - Apply `overhead_ns` once via `run(env, nbytes)` — header decode
+     model, first-flit overhead pattern (ADR-0033 D1).
+   - Initialise `_txn_state[tid] = {"last_finish": env.now}`.
+2. Compute `pc = _pc_for_address(flit.address)` (D3).
+3. Apply the per-PC schedule (D4) using the request direction (D2).
+4. Update `state["last_finish"] = max(state["last_finish"], finish)`.
+5. If `flit.is_last`: pop `_txn_state[tid]` and spawn `_finalize_txn`
+   (D6).
+
+Per-flit address-aware commit is the mechanism that lets concurrent
+multi-PE traffic to disjoint HBM offsets pipeline through distinct PCs
+in parallel.
+
+### D6. Async finalize per transaction
+
+When a transaction's last flit has been scheduled, finalisation runs in
+a separately-spawned process:
+
+```python
+def _finalize_txn(env, txn, last_finish):
+    wait = last_finish - env.now
+    if wait > 0:
+        yield env.timeout(wait)
+    yield from _send_response(env, txn)
+```
+
+`_handle_flit` spawns this via `env.process(...)` and returns
+immediately, so the worker can pick up the next inbox message while the
+last PC commit drains.
+
+Without this split — i.e. if the worker itself did
+`yield env.timeout(wait)` — concurrent single-flit transactions whose
+addresses hit distinct PCs would still serialise at `chunk_time` each
+inside the worker, hiding the PC parallelism that D3 and D5 are
+designed to expose.
+
+### D7. Non-flit fallback for command-only transactions
+
+`_handle_txn` runs when the inbox delivers a `Transaction` rather than a
+`Flit`. This is the path for command-only requests that the wire does
+not chunk into flits — most notably `MemoryReadMsg` whose command txn
+carries `nbytes=0` (data drain is modelled at HBM CTRL post-processing,
+not as inbound flits).
+
+Procedure:
+
+1. `work_bytes = txn.nbytes if txn.nbytes > 0 else int(request.nbytes or 0)`
+   — for read commands, work is sized by the request.
+2. `n_chunks = ceil(work_bytes / burst_bytes)` if `work_bytes > 0` else
+   0.
+3. `chunk_interval = drain_ns / n_chunks` (when both > 0) — chunks are
+   scheduled over time at `drain/n_chunks` ns intervals to model the
+   bottleneck-link's data arrival rate (ADR-0033 D1 chunk-loop drain).
+4. Apply `run(env, txn.nbytes)` once for `overhead_ns`.
+5. For each chunk `i`, advance `chunk_interval` ns then apply the D4
+   schedule with `pc = _pc_for_address(base_address + i * burst_bytes)`.
+6. After scheduling all chunks, wait `last_finish - env.now` then call
+   `_send_response`.
+
+`_handle_txn` shares the same `_pc_avail` / `_pc_last_dir` state with
+`_handle_flit` — there is exactly one source of PC scheduling truth
+across both paths.
+
+### D8. Response routing
+
+`_send_response` dispatches on request type and path geometry:
+
+| Case | Trigger | Response |
+| --- | --- | --- |
+| PE_DMA | `isinstance(txn.request, PeDmaMsg)` | New reverse-path Transaction (`is_response=True`, `nbytes=0`), same `done` |
+| Bypass — Memory Read | `"m_cpu" not in any(txn.path)` AND `MemoryReadMsg` | Reverse-path Transaction with `nbytes=request.nbytes` (data return) |
+| Bypass — Memory Write | `"m_cpu" not in any(txn.path)` AND not Memory Read | `txn.done.succeed()` (write completes locally) |
+| Default | otherwise | New `ResponseMsg(correlation_id, request_id, src_cube, src_pe, success=True)` on reverse path |
+
+The "bypass" classification matches the Memory R/W fabric path defined
+in ADR-0015 D4 (PCIE_EP → io_noc → ucie → cube router → hbm_ctrl,
+without M_CPU). The PE_DMA case is its own dedicated reverse-path to
+keep the inner-loop DMA fast (PE_DMA reads/writes do not synthesise a
+ResponseMsg envelope).
+
+In all reverse-path cases, the response Transaction is put onto
+`out_ports[reverse_path[1]]` — the first hop back along the recorded
+forward path. If `reverse_path` has fewer than 2 entries (degenerate
+path), the original `txn.done` is signalled directly.
+
+### D9. Configurable attributes
+
+| Attribute | Default | Source | Notes |
+| --- | --- | --- | --- |
+| `num_pcs` | 8 | topology cube `hbm_ctrl.attrs` | Must be power of 2 |
+| `pc_bw_gbs` | 32.0 | builder-derived: `hbm_to_router_bw_gbs / num_pcs` | Enforces ADR-0017 D8 invariant |
+| `burst_bytes` | 256 | topology attrs | Must be power of 2; equals `flit_bytes` (ADR-0033 D1) |
+| `switch_penalty_ns` | 0.0 | topology attrs | Tier 0 default; non-zero models pessimistic R/W switching |
+| `efficiency` | 1.0 | topology attrs | Applied at builder time to `hbm_to_router_bw_gbs` (router-edge BW scaling only) |
+| `overhead_ns` | 0.0 | topology attrs | First-flit decode overhead (D5) |
+
+`pc_bw_gbs` is derived by `topology/builder.py` rather than configured
+directly so the aggregate per-PE BW matches the router-to-HBM link BW
+without yaml-side duplication.
+
+## Consequences
+
+### Positive
+
+- Address-based PC selection preserves multi-stream HBM parallelism
+  that an address-blind round-robin would collapse — important for
+  multi-PE workloads with disjoint HBM regions.
+- Flit-aware path (D5) + async finalize (D6) preserves wormhole
+  pipelining and exposes PC parallelism for back-to-back single-flit
+  transactions.
+- Single source of PC scheduling truth (D4 mechanism, used by both D5
+  flit path and D7 chunk-loop path).
+- Builder-derived `pc_bw_gbs` enforces ADR-0017 D8 in code, not yaml
+  discipline.
+
+### Negative
+
+- No bank-level conflict modelling within a PC; address-blind to
+  bank/row-buffer reuse (ADR-0033 D3).
+- No HBM scheduler (FR-FCFS / write-buffer / watermark drain); fixed
+  FIFO per PC. Bursty mixed R/W is approximated by `switch_penalty_ns`
+  (ADR-0033 D2).
+- `_txn_state` is a regular dict keyed by `id(txn)`; in-flight state
+  accumulates per concurrent transaction and is removed only on
+  `is_last`. Adequate for current workloads.
+
+## Links
+
+- ADR-0001 (Physical address layout — PC bit field comment)
+- ADR-0015 D4 (Memory R/W fabric path — bypass response case)
+- ADR-0017 D4 (Per-PE HBM partitioning — attachment to PE routers)
+- ADR-0017 D8 (HBM channel mapping mode — n:1 aggregate this ADR
+  implements)
+- ADR-0017 D9 (AddressResolver — `hbm_ctrl.pe{pe_id}` endpoint
+  resolution)
+- ADR-0033 D1 (Modelled precisely — per-PC parallelism, switch penalty,
+  flit-aware PC commit, first-flit overhead, chunk-loop drain)
+- ADR-0033 D2 (Switch-penalty default 0 — ideal scheduler amortisation)
diff --git a/docs/adr/ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md b/docs/adr/ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md
new file mode 100644
index 0000000..8d22a7b
--- /dev/null
+++ b/docs/adr/ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md
@@ -0,0 +1,286 @@
+# ADR-0035: M_CPU and M_CPU.DMA Component Model
+
+## Status
+
+Accepted
+
+## Context
+
+M_CPU is the cube-level command processor. It receives commands from
+IO_CPU (or from PCIE_EP when the engine routes Memory R/W through
+M_CPU as a fallback), fans them out to the PEs in its cube, and
+aggregates per-PE responses into a single ResponseMsg sent back to
+IO_CPU on the reverse path.
+
+M_CPU.DMA is the cube-level DMA channel pair that handles Memory R/W
+fan-out. Per ADR-0015 D5 it is **not** a separate topology node —
+it lives as internal state of `MCpuComponent`.
+
+This ADR documents the M_CPU component implementation that realizes
+those responsibilities, including the three distinct fan-out paths
+(Memory R/W, Kernel Launch, MMU Map/Unmap), the M_CPU.DMA resource
+model, and the response aggregation contract.
+
+## Decision
+
+### D1. Role
+
+M_CPU has three responsibilities:
+
+1. **Transit forwarding** — when not the terminal hop (e.g., on the
+   reverse response path PE → M_CPU → IO_CPU), forwards Transactions
+   to `next_hop` in their pre-computed path.
+2. **Multi-PE fan-out at terminal hop** — dispatches to one of three
+   fan-out paths based on request type (D2).
+3. **Response aggregation** — collects per-PE responses, sends a
+   single aggregate ResponseMsg back to IO_CPU on the reverse path.
+
+Per invocation (`run()`): applies `overhead_ns` once per incoming
+Transaction.
+
+M_CPU does **not**:
+
+- Decide routing — paths are pre-computed by the router (ADR-0002).
+- Handle PE-internal execution — PE_CPU / PE_SCHEDULER / engines
+  (ADR-0014).
+- Decode addresses — `ctx.resolver.resolve(pa)` returns the per-PE
+  `hbm_ctrl.pe{X}` directly (ADR-0017 D9).
+- Interpret tensor or kernel semantics — fan-out dispatch by Python
+  isinstance check only.
+
+### D2. Three fan-out paths dispatched by request type
+
+At the terminal hop the worker dispatches by request type:
+
+```python
+elif self.ctx is not None and txn.request is not None:
+    if isinstance(txn.request, KernelLaunchMsg):
+        env.process(self._kernel_launch_fanout(env, txn))
+    elif isinstance(txn.request, (MmuMapMsg, MmuUnmapMsg)):
+        env.process(self._mmu_msg_fanout(env, txn))
+    else:
+        env.process(self._dma_fanout(env, txn))
+```
+
+Each path uses a different router method:
+
+- `_dma_fanout` uses `ctx.router.find_mcpu_dma_path()` — the
+  M_CPU-specific DMA path that avoids PE pipeline nodes.
+- `_kernel_launch_fanout` uses `ctx.router.find_node_path()` — the
+  generic NOC command path to PE_CPU.
+- `_mmu_msg_fanout` uses `ctx.router.find_node_path()` — NOC command
+  path to PE_MMU.
+
+### D3. M_CPU.DMA internal subcomponent (ADR-0015 D5)
+
+`MCpuComponent.start()` initializes two SimPy resources:
+
+```python
+self._dma_write = simpy.Resource(env, capacity=1)  # MemoryWriteMsg
+self._dma_read  = simpy.Resource(env, capacity=1)  # MemoryReadMsg
+```
+
+Properties:
+
+- **Not a topology node** — managed entirely inside `MCpuComponent`;
+  does not appear in `topology.yaml` or in the compiled graph.
+- **Independent read and write channels** — concurrent in-flight
+  Memory R/W is allowed.
+- **Capacity=1 per channel** serializes the **dispatch step**
+  (`yield self.out_ports[...].put(...)`) of concurrent in-flight Memory
+  R/W requests at this M_CPU. Actual fabric transfer time is modeled
+  by wire processes between components (ADR-0015 D2) and by
+  `drain_ns` at terminal hops; the DMA resource does not gate
+  transfer duration.
+
+Resource selection is request-type-based:
+
+```python
+dma_res = self._dma_write if isinstance(request, MemoryWriteMsg) else self._dma_read
+```
+
+### D4. Transit forwarding at non-terminal hops
+
+When `txn.next_hop` is not None — typical for the reverse response
+path (PE → M_CPU → IO_CPU) — the worker forwards normally:
+
+```python
+if next_hop:
+    yield self.out_ports[next_hop].put(txn.advance())
+```
+
+The fan-out branches fire only at the terminal hop. The same component
+therefore serves both forward command dispatch and reverse response
+relay roles.
+
+### D5. DMA fan-out (`_dma_fanout` — Memory R/W)
+
+For each Memory R/W request at terminal hop:
+
+1. `_resolve_dma_destinations(request)` returns a per-PE
+   `hbm_ctrl.pe{X}` derived from the request's PA via
+   `ctx.resolver.resolve(PhysAddr.decode(pa))` (ADR-0017 D9).
+2. For each destination:
+   - Acquire the appropriate DMA resource (`_dma_write` or
+     `_dma_read`) via `with dma_res.request() as req`.
+   - Resolve path via `ctx.router.find_mcpu_dma_path()`.
+   - Compute `drain_ns = ctx.compute_drain_ns(path, nbytes)`.
+   - Create sub-Transaction carrying `drain_ns` and dispatch to
+     `path[1]`.
+3. Track `max_drain_ns` across destinations and record it as
+   `txn.result_data["xfer_ns"]` after all responses arrive.
+4. After all per-PE responses are collected (D8), send an aggregate
+   ResponseMsg on the reverse command path back to IO_CPU.
+
+PA decode fallback (`f"{cube_prefix}.hbm_ctrl"`) is legacy dead code —
+no such node exists after ADR-0017 D4's per-PE partitioning. Kept
+defensively but does not route to a real destination.
+
+### D6. Kernel launch fan-out (`_kernel_launch_fanout`)
+
+For `KernelLaunchMsg` at terminal hop:
+
+1. `_resolve_pe_ids(target_pe)` → list of PE ids in this cube.
+2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_cpu"` via
+   `ctx.router.find_node_path()`.
+3. **`target_start_ns` handling** (ADR-0009 D5):
+   - If the request already carries `target_start_ns` (stamped by
+     IO_CPU per ADR-0036 D3): **pass through unchanged**.
+   - If absent (direct-to-M_CPU launch in unit tests): compute a
+     per-cube barrier `env.now + max(per-PE leg latency)` and stamp
+     via `dataclasses.replace`.
+4. Dispatch sub-Transactions with `nbytes=0` (kernel launch is a
+   control message; preserving nbytes=0 keeps fan-out off the shared
+   first-hop fabric BW, mirroring ADR-0036 D4).
+5. After all per-PE responses arrive (D8), aggregate per-PE metrics
+   from each sub-Transaction's `result_data` into the parent
+   transaction:
+
+   ```python
+   txn.result_data["pe_exec_ns"]  = max(existing, max(pe_exec_values))
+   txn.result_data["dma_ns"]      = max(existing, max(dma_values))
+   txn.result_data["compute_ns"]  = max(existing, max(compute_values))
+   ```
+
+   The max-merge with the existing value matters because cross-cube
+   IO_CPU fan-out shares the same parent `result_data`; merging
+   prevents one cube from clobbering another's metric.
+6. Send aggregate ResponseMsg on reverse path back to IO_CPU.
+
+### D7. MMU map/unmap fan-out (`_mmu_msg_fanout`)
+
+For `MmuMapMsg` / `MmuUnmapMsg` at terminal hop:
+
+1. `_resolve_pe_ids(target_pe)` → PE ids.
+2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_mmu"` via
+   `find_node_path()`.
+3. Dispatch sub-Transactions with `nbytes=0`.
+4. PE_MMU is a terminal node — it does **not** send a ResponseMsg
+   back. Instead, the sub-Transaction's own `sub_done` event is the
+   completion signal.
+5. Wait for all `sub_done` events in-line (does **not** use
+   `_pending` counter — D8 is for response-bearing fan-out only).
+6. Send aggregate ResponseMsg on reverse path back to IO_CPU.
+
+### D8. Response aggregation (`_pending` + `_parent_txns`)
+
+For DMA and kernel-launch fan-out (which expect per-PE ResponseMsg
+arriving on the reverse path):
+
+```python
+self._pending: dict[str, tuple[int, int, simpy.Event]] = {}
+self._parent_txns: dict[str, Any] = {}
+```
+
+- On dispatch: register `(expected, received=0, all_done)` and
+  remember the parent transaction.
+- `_worker` recognises responses by `is_response=True` and routes
+  them to `_collect_response`, which increments `received` and
+  signals `all_done` when `received >= expected`.
+- After `yield all_done`, the fan-out path constructs the aggregate
+  ResponseMsg:
+
+  ```python
+  resp_msg = ResponseMsg(
+      correlation_id=request.correlation_id,
+      request_id=request.request_id,
+      src_cube=cube_id,
+      src_pe=-1,             # -1 = M_CPU aggregate, not a single PE
+      success=True,          # no failure semantics implemented
+  )
+  ```
+
+- The response Transaction travels on `list(reversed(txn.path))`
+  back to IO_CPU.
+
+MMU fan-out (D7) uses a simpler in-line list of `sub_done` events
+because PE_MMU is terminal — there is no ResponseMsg path to
+intercept.
+
+### D9. Helpers and configurable attribute
+
+`_resolve_pe_ids(target_pe)`:
+
+- `int` → `[target_pe]`
+- `tuple[int, ...]` → `list(target_pe)`
+- `"all"` → `range(n_slices)` where `n_slices` comes from cube
+  `memory_map.hbm_slices_per_cube` (default 8).
+
+Used by kernel-launch and MMU fan-out paths.
+
+Single configurable attribute drives per-instance latency:
+
+| Site | impl name | overhead_ns |
+| --- | --- | --- |
+| Cube `m_cpu` | `builtin.m_cpu` | 5.0 |
+
+Applied once in `run()` per Transaction — models command
+interpretation and dispatch-decision time at M_CPU.
+
+## Consequences
+
+### Positive
+
+- Three fan-out paths are clearly separated by request type — adding
+  a new request kind is an isinstance branch + one fan-out method.
+- M_CPU.DMA channels are independent (read and write run concurrently)
+  and serialize only the dispatch step at capacity=1.
+- Transit-vs-terminal behavior is a single `if next_hop` check, so
+  the same component handles forward dispatch and reverse response
+  relay without role duplication.
+- `target_start_ns` passthrough (D6) preserves the cross-cube barrier
+  established by IO_CPU (ADR-0036 D3), while the fallback computation
+  keeps direct-to-M_CPU unit tests working.
+- Per-PE metric `max`-merge against existing parent `result_data`
+  values is robust to cross-cube IO_CPU fan-out sharing the same
+  parent.
+
+### Negative
+
+- No partial-failure semantics — a missing per-PE response stalls the
+  parent `all_done` indefinitely. Acceptable for simulation; not
+  suitable as a production-style endpoint.
+- `_resolve_dma_destinations`'s cube-wide hbm_ctrl fallback is dead
+  code (no such node exists post-ADR-0017 D4). Kept defensively;
+  invites confusion and merits a follow-up cleanup.
+- DMA resource serialization applies only at dispatch (the `put` call
+  is instantaneous in unbounded stores). The capacity=1 channel
+  models "one request in flight at a time at this M_CPU", not
+  "transfer duration serialization" — readers must consult wire
+  processes (ADR-0015 D2) and `drain_ns` for actual transfer
+  parallelism.
+
+## Links
+
+- ADR-0009 D3 (M_CPU fan-out and aggregation completion semantics)
+- ADR-0009 D5 (`target_start_ns` — passed through unchanged when
+  present; computed as per-cube barrier when absent)
+- ADR-0011 D-VA3 (MmuMapMsg fabric path includes M_CPU as PE fan-out
+  point)
+- ADR-0014 D4 (DMA engine capacity=1; M_CPU.DMA mirrors the same
+  contract at cube level)
+- ADR-0015 D5 (M_CPU.DMA is internal subcomponent of M_CPU, not a
+  topology node)
+- ADR-0017 D9 (AddressResolver returns per-PE `hbm_ctrl.pe{X}`)
+- ADR-0036 D3 / D4 (IO_CPU stamps `target_start_ns`; M_CPU passes
+  through unchanged; nbytes=0 invariant preserved through fan-out)
diff --git a/docs/adr/ADR-0036-dev-io-cpu-component-model.md b/docs/adr/ADR-0036-dev-io-cpu-component-model.md
new file mode 100644
index 0000000..b79e9ad
--- /dev/null
+++ b/docs/adr/ADR-0036-dev-io-cpu-component-model.md
@@ -0,0 +1,216 @@
+# ADR-0036: IO_CPU Component Model
+
+## Status
+
+Accepted
+
+## Context
+
+IO_CPU is the IO chiplet's host-facing endpoint inside the simulation
+graph. PCIE_EP receives host messages from the runtime API and routes
+them via the io_noc; for command-bearing requests (KernelLaunch,
+MmuMap/Unmap) the io_noc forwards to IO_CPU, which:
+
+- Fans out the request to per-cube M_CPUs.
+- Aggregates per-cube responses into a single host-visible completion.
+- For kernel launches, stamps a global `target_start_ns` barrier so
+  every PE across every targeted cube begins kernel body execution at
+  the same simulated time (ADR-0009 D5).
+
+Memory R/W traffic bypasses IO_CPU per ADR-0015 D4 / ADR-0016 D3;
+this component therefore handles only command-plane traffic in normal
+operation.
+
+This ADR documents the IO_CPU component implementation that realizes
+those responsibilities.
+
+## Decision
+
+### D1. Role
+
+IO_CPU is the host-facing endpoint of the IO chiplet. It has two
+primary responsibilities:
+
+1. **Multi-cube fan-out** — distribute KernelLaunchMsg / MmuMapMsg /
+   MmuUnmapMsg to per-cube M_CPUs.
+2. **Response aggregation** — collect per-cube ResponseMsg, signal
+   parent `txn.done` when all targeted cubes have responded.
+
+A third, narrower responsibility applies only to KernelLaunchMsg:
+**`target_start_ns` global barrier stamping** (D3).
+
+The component does **not**:
+
+- Decide routing — paths are pre-computed by the router (ADR-0002).
+- Decode tensor or kernel internals — those concerns belong to
+  M_CPU / PE_CPU / engines.
+- Handle PE-level fan-out — M_CPU fans out within a cube (ADR-0009 D3).
+- Handle Memory R/W data path — those bypass IO_CPU per ADR-0015 D4
+  and ADR-0016 D3 (Memory R/W resolution code in
+  `_resolve_cube_targets` exists as a defensive fallback only).
+
+Per invocation (`run()`): applies the configured `overhead_ns` once
+per incoming Transaction (D8).
+
+### D2. Forward path — multi-cube fan-out
+
+When a non-response Transaction arrives, the worker:
+
+1. Pays `overhead_ns` via `run()`.
+2. Calls `_resolve_cube_targets` to derive the list of `(sip, cube)`
+   targets from the request (D5).
+3. For each target:
+   - Resolves M_CPU node id via `ctx.resolver.find_m_cpu(sip, cube)`.
+   - Resolves the path via `ctx.router.find_node_path(io_cpu, m_cpu)`.
+   - Creates a per-cube sub-Transaction with `path` populated and
+     forwards it to `path[1]` (the first hop on the io_noc).
+4. Registers aggregation state: `_pending[request_id] = (expected,
+   received=0, parent_done)`.
+
+### D3. KernelLaunch `target_start_ns` global barrier (ADR-0009 D5)
+
+IO_CPU is the canonical stamper for `target_start_ns`. When the
+request is a `KernelLaunchMsg`, IO_CPU computes a single global
+barrier covering every targeted PE across every targeted cube:
+
+```text
+for (sip, cube) in cube_targets:
+    leg1 = compute_path_latency_ns(io_cpu → m_cpu(sip, cube), nbytes=0)
+    for pe_id in target_pe_ids:
+        leg2 = compute_path_latency_ns(m_cpu → pe_cpu(sip, cube, pe_id),
+                                       nbytes=0)
+        latency = leg1 + leg2 - io_overhead_ns - m_overhead_ns
+        global_max = max(global_max, latency)
+
+target_start_ns = env.now + global_max
+```
+
+The request is then replaced (via `dataclasses.replace`) so the
+stamped value propagates through the fan-out.
+
+Two overhead corrections:
+
+- `io_overhead_ns` is subtracted because IO_CPU has already paid it
+  in `run()` before this method runs.
+- `m_overhead_ns` is subtracted once because it appears as the
+  endpoint of leg1 *and* the start of leg2 in path latency, but
+  M_CPU pays it only once at run time.
+
+Every downstream PE_CPU yields until `target_start_ns` before
+beginning kernel body execution; all PEs therefore start at the same
+simulated time regardless of how long their individual dispatch path
+took.
+
+### D4. KernelLaunch sub-Transactions carry `nbytes=0`
+
+Per-cube sub-Transactions for KernelLaunchMsg force `nbytes=0`,
+overriding the parent `txn.nbytes`:
+
+- Kernel launch is a control message; payload size is irrelevant at
+  the data-fabric level.
+- If `nbytes > 0`, every per-cube sub-txn occupies fabric BW on the
+  io_noc's shared first hop. With 16 cubes this serializes fan-out,
+  pushing far M_CPUs past `target_start_ns` and breaking the D3
+  invariant.
+
+Non-KernelLaunch sub-Transactions preserve `txn.nbytes` (only relevant
+for the defensive Memory R/W fallback path, which carries actual
+payload sizes).
+
+### D5. Per-request-type cube target resolution
+
+`_resolve_cube_targets` dispatches by request type:
+
+| Request type | Source of `(sip, cube)` | `target_cubes="all"` semantics |
+| --- | --- | --- |
+| `MemoryWriteMsg` | `dst_sip`, `dst_cube` (or `PhysAddr.decode(dst_pa).die_id` fallback) | single cube derived from PA decode |
+| `MemoryReadMsg` | `src_sip`, `src_cube` (or `PhysAddr.decode(src_pa).die_id` fallback) | single cube derived from PA decode |
+| `KernelLaunchMsg` | tensor shards filtered by `shard.sip == my_sip` | every cube that owns a shard on this SIP |
+| `MmuMapMsg` / `MmuUnmapMsg` | `target_cubes` list, filtered to this SIP | `range(cubes_per_sip)` from spec |
+
+Each IO_CPU instance fans out only within its own SIP — `_my_sip()`
+parses the SIP id from the node id (e.g., `sip0.io0.io_cpu` → 0).
+
+The Memory R/W rows exist for defensive completeness; the engine's
+normal path routes Memory R/W via `_process_memory_direct()` /
+`find_memory_path()`, bypassing IO_CPU entirely (ADR-0015 D4 /
+ADR-0016 D3).
+
+### D6. Response aggregation
+
+`_pending: dict[request_id → (expected, received, parent_done)]`:
+
+- On dispatch: register `(len(cube_targets), 0, txn.done)`.
+- `_worker` recognises responses by `is_response=True` and routes
+  them to `_collect_response`.
+- `_collect_response` increments `received`; when `received >=
+  expected`, `parent_done.succeed()` is invoked and the entry is
+  removed from `_pending`.
+
+This is a simple per-request counter. There is no per-cube identity
+tracking and no partial-failure handling — a missing response
+indefinitely stalls the parent done. Production-style failure paths
+are out of scope for the current simulator model.
+
+### D7. `target_pe` resolution helper
+
+`_resolve_pe_ids(target_pe)`:
+
+- `int` → `[target_pe]`.
+- `tuple[int, ...]` → `list(target_pe)`.
+- `"all"` → `range(n_slices)`, where `n_slices` comes from cube
+  `memory_map.hbm_slices_per_cube` (default 8).
+
+Used in D3's barrier computation to enumerate every PE target per
+cube.
+
+### D8. Configurable `overhead_ns`
+
+A single attribute drives per-instance latency:
+
+| Site | impl name | overhead_ns |
+| --- | --- | --- |
+| IO chiplet `io_cpu` | `builtin.io_cpu` | 10.0 |
+
+Applied once in `run()` per Transaction. Models command
+interpretation + dispatch-decision time at IO_CPU.
+
+## Consequences
+
+### Positive
+
+- Cross-cube and cross-SIP kernel launches share a single global
+  barrier (D3 + D4) — no per-cube divergence in start time.
+- nbytes=0 invariant keeps fan-out off the shared first-hop fabric
+  BW, preserving the barrier's accuracy at scale (16 cubes).
+- Response aggregation via a single counter → minimal state,
+  deterministic ordering of completion.
+- Per-SIP scoping (`_my_sip()`) keeps IO_CPUs in different SIPs
+  cleanly independent.
+
+### Negative
+
+- No partial-failure semantics — a missing per-cube response
+  indefinitely stalls the parent. Adequate for simulation but not
+  suitable as a production-style endpoint.
+- `_pending` is a regular dict; in-flight requests accumulate state.
+  Acceptable for current benchmark workloads (few concurrent
+  outstanding launches); unbounded in principle.
+- The Memory R/W resolution branches in `_resolve_cube_targets` are
+  dead code in the normal engine path. Kept defensively but invite
+  drift if the bypass path ever changes.
+
+## Links
+
+- ADR-0002 (Routing distance — path computation)
+- ADR-0009 D1 (Kernel launch is an endpoint request to IO_CPU)
+- ADR-0009 D3 (M_CPU fans out within a cube; IO_CPU fans out across
+  cubes)
+- ADR-0009 D5 (target_start_ns canonical stamping at IO_CPU)
+- ADR-0011 D-VA3 (MmuMapMsg routes through IO_CPU for cube fan-out)
+- ADR-0012 (Host ↔ IO_CPU message schema)
+- ADR-0015 D4 (Memory R/W bypasses IO_CPU; Kernel Launch via IO_CPU)
+- ADR-0016 D1 (IO chiplet io_noc — IO_CPU attaches here)
+- ADR-0016 D3 (Memory R/W path bypasses IO_CPU)
+- ADR-0016 D4 (Kernel Launch path through IO_CPU for command
+  interpretation)
diff --git a/docs/adr/ADR-0037-dev-forwarding-component.md b/docs/adr/ADR-0037-dev-forwarding-component.md
new file mode 100644
index 0000000..193dbe0
--- /dev/null
+++ b/docs/adr/ADR-0037-dev-forwarding-component.md
@@ -0,0 +1,200 @@
+# ADR-0037: Forwarding Component (forwarding_v1)
+
+## Status
+
+Accepted
+
+## Context
+
+The simulation graph has many node positions that exist purely to model
+fabric traversal — NOC mesh routers, switches, UCIe protocol endpoints,
+IO chiplet io_noc, transit cubes. These share a common pattern: receive
+a message, apply per-component overhead (modeling header decode +
+routing decision time), forward to the next hop along the pre-computed
+path.
+
+This ADR defines the contract for these transit nodes: a single
+component type (`TransitComponent`) that handles flit-aware forwarding
+with wormhole cut-through semantics, used under multiple impl names
+according to the conceptual role each instance plays.
+
+## Decision
+
+### D1. Role
+
+The Forwarding component (`TransitComponent` class) is a **stateless
+transit node** in the simulation graph. It models any fabric position
+where a message physically traverses but no semantic processing
+happens.
+
+Per traversal, the component:
+
+1. Reads an incoming Transaction or Flit from an `in_port`.
+2. Applies the configured per-component overhead (`overhead_ns`),
+   applied **once per Transaction** even across multi-flit payloads
+   (see D2).
+3. Looks up the next hop along the Transaction's pre-computed `path`.
+4. Forwards to the corresponding `out_port`; at the terminal node
+   (no next hop), signals `txn.done` once the `is_last` flit arrives.
+
+The component **does NOT**:
+
+- Decide routing — paths are pre-computed by the router (ADR-0002 /
+  ADR-0017 D2). Forwarding only executes the per-hop step.
+- Model wire propagation or bandwidth occupancy — separate wire
+  processes between components handle that (ADR-0015 D2).
+- Resolve addresses — the AddressResolver does that (ADR-0017 D9).
+- Aggregate completion — terminal endpoints (IO_CPU, M_CPU, HBM_CTRL)
+  handle that.
+
+### D2. First-flit overhead model (header decode)
+
+Per-Transaction `overhead_ns` is applied **exactly once**, at first
+flit arrival:
+
+- `_txn_decoded: set[int]` tracks which Transactions have already
+  paid the overhead at this node.
+- On first-flit arrival for a Transaction: `yield self.run(env,
+  msg.txn.nbytes)` — pays the overhead.
+- Subsequent flits of the same Transaction skip the overhead — they
+  pipeline through with no extra delay.
+- On `is_last` flit: remove the Transaction from `_txn_decoded`.
+
+This models the real-HW behavior where header decode and routing
+decision happen once on first flit; payload flits then stream through
+the same path (wormhole cut-through). Multi-hop pipelining emerges
+naturally — each hop adds its own first-flit overhead, but flits
+after the first do not re-pay overhead at any hop they have already
+passed first.
+
+### D3. Serial worker forwarding (preserves order)
+
+The component's worker is a single SimPy process that consumes flits
+from `_inbox` and forwards them serially in arrival order. The
+component does NOT spawn `env.process(...)` per flit.
+
+Rationale: if the first flit yields on `overhead_ns` while subsequent
+flits run in parallel processes, the later flits can overtake the
+first. This produces out-of-order delivery and lets the `is_last`
+flit arrive at the destination before the first flit — corrupting
+both the transaction's completion semantics and any flit-index-based
+processing downstream.
+
+### D4. Path-based next-hop routing
+
+Routing is **not** a Forwarding-component concern. The Transaction
+arrives with a pre-computed `path` (built by the router; ADR-0002 /
+ADR-0017 D2). The component just looks up its own position in the
+path and forwards to `path[index + 1]`:
+
+```python
+def _next_hop_in_path(self, txn):
+    my_id = self.node.id
+    path = txn.path
+    for i, n in enumerate(path):
+        if n == my_id and i + 1 < len(path):
+            return path[i + 1]
+    return None
+```
+
+If `next_hop` is found and present in `out_ports`, the flit is
+forwarded. Otherwise (terminal node), `txn.done.succeed()` is
+invoked when the `is_last` flit arrives.
+
+### D5. Flit-aware mode with Non-Flit fallback
+
+`_FLIT_AWARE = True` opts this component out of the base class's
+flit-reassembly logic in `_fan_in`. Flits are placed directly on
+`_inbox` (no reassembly), enabling per-flit handling in the worker
+loop (D2, D3).
+
+Non-Flit messages — zero-byte control Transactions and other
+non-chunkified payloads — fall through to the base class's legacy
+`_forward_txn` path via `env.process`. This preserves backward
+compatibility for control-plane traffic that does not benefit from
+flit-level processing.
+
+### D6. Multi-stream merging at the base class
+
+Multi-stream FIFO merging at routers is the base class's
+responsibility, not Forwarding's. The base class's `_fan_in` spawns
+one process per `in_port`; all push to a single shared `_inbox`.
+Flits from different upstream streams therefore interleave at
+flit granularity in `_inbox`'s FIFO order.
+
+The Forwarding worker simply consumes `_inbox` in arrival order —
+correctly modeling per-router multi-flow arbitration as
+fair-FIFO over the shared inbox.
+
+### D7. Single implementation under multiple impl names
+
+A single `TransitComponent` class is registered under four impl names
+in `components.yaml`:
+
+- `builtin.forwarding` — generic forwarding (e.g., `io_noc`,
+  `noc_router`, UCIe conn bridges)
+- `builtin.switch` — tray-level switch
+- `builtin.noc` — cube-level NOC fabric (legacy singleton; current
+  NOC routers use `builtin.forwarding`)
+- `builtin.ucie` — UCIe protocol endpoint
+
+All four aliases instantiate the same class with the same behavior.
+Per-instance differentiation lives only in `attrs.overhead_ns`.
+Separate impl names exist as intent tags for readability and to
+allow future divergence without backward-incompatible config
+changes.
+
+### D8. Configurable `overhead_ns`
+
+A single attribute drives per-instance latency:
+
+| Usage site | impl name | overhead_ns |
+| --- | --- | --- |
+| Tray-level switch | `builtin.switch` | 5.0 |
+| Cube NOC router | `builtin.forwarding` | 2.0 |
+| IO chiplet io_noc | `builtin.forwarding` | 0.0 |
+| UCIe protocol endpoint (`ucie-{N,S,E,W}`) | `builtin.ucie` | 8.0 |
+| UCIe conn bridge (`ucie-{PORT}.conn{N}`) | `builtin.forwarding` | 0.0 |
+
+Default is 0.0. The attribute is read at each `run()` invocation, so
+dynamic reconfiguration is possible but not currently used.
+
+## Consequences
+
+### Positive
+
+- A single class handles all transit-node roles in the simulation
+  graph — minimal code surface for a high-population component type.
+- Flit-aware processing + serial worker preserves wormhole semantics
+  across multi-hop paths without per-flit process overhead.
+- `overhead_ns` is the only per-instance tunable; routing, BW, and
+  address resolution stay cleanly separated in their own components /
+  modules.
+- Multi-stream merging emerges from the base-class structure; no
+  router-specific logic duplicates fair-FIFO arbitration.
+- Non-Flit fallback path keeps control-plane traffic working without
+  forcing every message into the flit framework.
+
+### Negative
+
+- The single class hides usage-site intent inside `attrs.overhead_ns`
+  configuration; readers must consult `topology.yaml` +
+  `components.yaml` to see which impl name maps to which behavior
+  class.
+- Per-flit serial worker is a bottleneck if `overhead_ns` is large
+  and many concurrent transactions arrive at the same router; current
+  values (0–8 ns) make this negligible.
+
+## Links
+
+- ADR-0002 (Routing distance — path computation)
+- ADR-0015 D1 (Component port model)
+- ADR-0015 D2 (Wire process — BW + propagation, separate from this
+  component)
+- ADR-0015 D6 (Transit cube forwarding pattern)
+- ADR-0016 D1 (IO chiplet io_noc — uses this component)
+- ADR-0017 D1 (Cube NOC routers — use this component)
+- ADR-0017 D6 (UCIe decomposition — `ucie-{PORT}` instances use this
+  component)
+- ADR-0033 D1 (Flit-aware pass-through, first-flit overhead,
+  multi-stream merge semantics)
diff --git a/docs/ipcq-dma-codesign-hw.md b/docs/ipcq-dma-codesign-hw.md
deleted file mode 100644
index 379ad77..0000000
--- a/docs/ipcq-dma-codesign-hw.md
+++ /dev/null
@@ -1,548 +0,0 @@
-# IPCQ-DMA Co-design Hardware Design Document
-
-**Status**: Draft — Review Requested
-**Date**: 2026-04-28
-**Authors**: YW Kang
-**Reviewers**: (HW team TBD)
-**Related**: ADR-0023 (IPCQ PE Collective), ADR-0025 (Direction Addressing)
-
----
-
-## 1. Background & Motivation
-
-IPCQ(Inter-PE Communication Queue)는 PE 간 collective communication을 위한
-하드웨어 큐 메커니즘이다. 핵심 설계 원리는 **DMA가 데이터 전송 시 별도의
-제어 메시지 없이, piggyback된 메타 정보를 바탕으로 IPCQ의 head/tail pointer를
-자동 업데이트**하는 IPCQ-DMA co-design이다.
-
-이 문서는:
-
-1. 현재 PE 아키텍처에서 IPCQ가 하드웨어 수준에서 어떻게 동작하는지 기술하고,
-2. 이 하드웨어를 시뮬레이터에서 어떻게 모델링하고 있는지 검증하며,
-3. 실제 하드웨어 구현을 위한 설계를 제안하고,
-4. 대안들을 검토하여 최적 접근을 확정한다.
-
----
-
-## 2. High-level Behavior of PE_IPCQ
-
-![PE Baseline Architecture](diagrams/pe_baseline.png)
-
-> source: [`diagrams/pe_baseline.d2`](diagrams/pe_baseline.d2) — `d2 --layout=elk --scale 1.5` 로 렌더링.
-
-### IPCQ 하드웨어 동작
-
-**HW Configuration**:
-* IPCQ는 PE 간에 ring buffer 기반의 단방향 큐를 설정하여 데이터를 전달한다.
-* 각 PE는 방향별(N/S/E/W 등)로 독립적인 queue pair 를 유지한다.
-* IPCQ는 각 queue pair 마다 sender's head/tail pointer, receiver's head/tail pointer 를 유지한다.
-
-* **IPCQ Slot Region**: IPCQ의 수신 버퍼로, 다이어그램의 점선 박스로 표시된 것처럼 TCM, Cube SRAM, Local HBM 중 하나를 buffer_kind로 지정하여 사용할 수 있다.
-각 tier별 성능 특성 (시뮬레이션 모델 값, `ipcq_types.py`):
-
-| Buffer Kind | Intrinsic BW | Effective BW (NoC bottleneck) | 용도 |
-|-------------|-------------|-------------------------------|------|
-| TCM | 512 GB/s | 512 GB/s (직결, NoC 미경유) | 최저 latency, PE 내부 전용 |
-| Cube SRAM | 512 GB/s | 128 GB/s (`sram_to_router_bw`) | Cube 내 공유, NoC BW에 제한 |
-| Local HBM | 256 GB/s | 256 GB/s (`hbm_to_router_bw`) | 대용량, NoC BW에 제한 |
-
-**Send 경로 (fire-and-forget)**:
-1. PE_CPU가 `tl.send(dir, src_addr)` 발행 → PE_IPCQ에 IpcqRequest 전달
-2. PE_IPCQ가 backpressure 확인: `(my_head - peer_tail_cache) < peer.n_slots`
-3. Peer의 rx slot 주소 계산: `peer_rx_base + (my_head % n_slots) × slot_size`
-4. IpcqDmaToken(data + piggyback metadata: sender_seq)을 PE_DMA에 전달
-5. PE_IPCQ가 `my_head++`, PE_CPU에 즉시 반환 (DMA 완료를 기다리지 않음)
-6. PE_DMA가 src data를 snapshot 후 NoC를 통해 peer PE_DMA로 전송
-
-**Receive 경로 (blocking)**:
-1. Peer PE_DMA가 data를 slot에 write하고, **같은 사이클에** metadata(sender_seq, dst_addr)를 추출
-2. PE_IPCQ가 dst_addr range matching으로 방향을 식별, `peer_head_cache` 업데이트
-3. `tl.recv(dir)` 대기 중인 PE_CPU에 wakeup signal 전달
-4. PE_CPU가 slot에서 데이터 읽기, PE_IPCQ가 `my_tail++`
-5. **Credit return**: PE_IPCQ가 16B credit packet(`consumer_seq`)을 NoC를 통해 sender에게 전송
-6. Sender PE_IPCQ가 `peer_tail_cache` 업데이트, backpressure 해제
-
-**핵심 설계 원리**:
-- **Data + head pointer piggyback**: 별도의 head 동기화 메시지 없이, DMA data flit에 sender_seq를 실어보냄
-- **Atomic write + metadata**: 수신측 DMA가 slot write와 metadata 전달을 같은 사이클에 수행 (I6 invariant)
-- **Address-based direction matching**: 같은 peer에 여러 방향이 연결되어도 dst_addr range로 구분 (ADR-0025)
-- **Credit-based flow control**: Receiver가 slot 소비 후 16B credit으로 sender에게 알림
-
----
-
-## 3. Simulator Implementation Verification
-
-위의 하드웨어 동작을 시뮬레이터에서 어떻게 모델링하는지 검증한다.
-
-### 3.1 의도와 구현의 매핑
-
-| 설계 의도 | 시뮬레이터 구현 | 위치 |
-|-----------|----------------|------|
-| DMA가 데이터 전송 시 head pointer를 piggyback | `IpcqDmaToken.sender_seq` 필드가 data flit과 함께 전달 | `ipcq_types.py:185` |
-| 수신측 DMA가 data write + metadata 전달을 atomic 처리 | `_handle_ipcq_inbound`에서 `store.write` → `IpcqMetaArrival` 사이에 yield 없음 (I6) | `pe_dma.py:232-275` |
-| Send는 fire-and-forget | `_handle_ipcq_outbound`에서 `sub_done`을 기다리지 않음 | `pe_dma.py:182` |
-| Recv는 데이터 도착까지 block | `peer_head_cache > my_tail` 조건으로 대기 | `pe_ipcq.py:263` |
-| Credit return은 별도 fast-path | SimPy Store를 통한 direct put (latency는 NoC 경로 기반으로 charge) | `pe_ipcq.py:443-469` |
-| In-flight data semantics (snapshot) | Send 시점에 data snapshot 보존, 이후 src 수정과 무관 | `pe_dma.py:142-155` |
-| PE_DMA 단일 inbox | 모든 in_port를 `_fan_in`으로 단일 FIFO에 merge (`base.py:51-53`) | compute port와 IPCQ port 사이에 arbiter 없음 |
-
-### 3.2 Credit Return Path 모델링 상세
-
-Credit return은 실제 NoC 경로를 `router.find_path()`로 찾고,
-`compute_path_latency_ns()`로 hop latency + BW drain을 계산하여 charge한다.
-
-```python
-# pe_ipcq.py:471-492
-def _credit_latency_ns(self, direction: str) -> float:
-    path = self.ctx.router.find_path(self._pe_prefix, peer_pe_dma)
-    return self.ctx.compute_path_latency_ns(path, self._credit_size_bytes)
-```
-
-단, latency를 `env.timeout()`으로 지불한 후 `peer_credit_store`(SimPy Store)에
-직접 put하는 방식이다. 실제 `Transaction`을 만들어 NoC를 hop-by-hop 통과시키지는
-않으므로, **다른 트래픽과의 bandwidth contention은 모델링되지 않는다.**
-
-| | Latency | BW Contention |
-|---|---|---|
-| Data path (IpcqDmaToken) | NoC Transaction으로 정확 모델링 | 실제 fabric 통과 |
-| Credit path (16B) | NoC 경로 latency 정확 반영 | fabric Transaction 미주입 (단순화) |
-
-Credit은 16B로 data transfer(수십~수백 KB) 대비 무시 가능한 크기이므로,
-이 단순화로 인한 실질적 오차는 거의 없다.
-
-### 3.3 검증 결론
-
-시뮬레이터 구현은 IPCQ-DMA co-design 의도를 **정확하게 모델링**하고 있다.
-
----
-
-## 4. Proposed Hardware Design
-
-### 4.1 Block Diagram (변경 후)
-
-변경점을 강조 표시: **(NEW)** = 신규, **(MOD)** = 수정.
-
-![PE Proposed Architecture](diagrams/pe_proposed.png)
-
-> Source: [`diagrams/pe_proposed.d2`](diagrams/pe_proposed.d2) — `d2 --layout=elk` 로 렌더링.
-
-**Baseline → Proposed 핵심 변경**:
-- 단일 FIFO inbox → **compute port / IPCQ port 분리 + WRR Arbiter** (NEW)
-- PE_IPCQ (SimPy component) → **IPCQ Controller** (HW register + combinational logic)
-- TCM 내 **IPCQ Slot Region 예약 영역** 명시
-- Credit Injector / Receiver가 Fabric Port를 통해 NoC에 직접 연결
-
-### 4.2 Module Details
-
-#### 4.2.1 IPCQ Controller (신규 모듈)
-
-PE_CPU와 DMA Engine 사이에 위치하는 하드웨어 제어 블록.
-시뮬레이터의 `PeIpcqComponent`에 대응한다.
-
-##### QPair Register File
-
-방향별 queue pair 상태를 flip-flop으로 유지한다.
-
-```
-Per-direction registers (each 64-bit):
-  my_head          — sender write position (monotonic)
-  my_tail          — receiver read position (monotonic)
-  peer_head_cache  — last known peer head (updated by Meta Extractor)
-  peer_tail_cache  — last known peer tail (updated by Credit Receive)
-  rx_base_pa       — this PE's rx buffer base physical address
-  peer_rx_base_pa  — peer's rx buffer base physical address
-  n_slots          — ring depth (power-of-2 제약, 아래 참조)
-  slot_size        — bytes per slot
-  peer_credit_tgt  — peer PE의 credit receive 주소
-
-Directions: 최대 8 (N/S/E/W/parent/child_left/child_right + spare)
-Total: 8 dirs × 9 regs × 8B = 576B flip-flops
-```
-
-PE_CPU가 MMIO(CSR)로 읽기/쓰기 가능. Init 시점에 소프트웨어가 채워넣는다.
-
-##### Slot Address Generator (combinational)
-
-```
-Input:  pointer (my_head or my_tail), n_slots, slot_size, base_pa
-Output: slot_addr = base_pa + (pointer % n_slots) * slot_size
-
-Implementation:
-  n_slots power-of-2 제약 → pointer & (n_slots - 1)  (AND mask, 1 gate delay)
-  slot_size power-of-2   → barrel shift               (1 cycle)
-  64-bit add             → ripple/kogge-stone adder    (1 cycle)
-
-Latency: 1-2 cycles combinational
-```
-
-##### Backpressure Comparator (combinational)
-
-```
-full = (my_head - peer_tail_cache) >= n_slots
-
-Implementation: 64-bit subtract + unsigned compare
-Output: stall signal → PE_CPU (IPCQ send blocked) or DMA issue hold
-Latency: 1 cycle
-```
-
-##### Meta Extractor (inbound datapath sideband)
-
-DMA Engine의 inbound vc_comm path에 wired. Arriving IPCQ flit의 header에서
-metadata를 추출하여 queue pair 상태를 업데이트한다.
-
-```
-Trigger: DMA inbound write completion (same cycle)
-Extract: {sender_seq, dst_addr} from flit header
-
-Direction matching (ADR-0025 D2):
-  for each dir:
-    match = (base_pa[dir] <= dst_addr) && (dst_addr < base_pa[dir] + n_slots[dir] * slot_size[dir])
-  8× parallel range comparators + priority encoder
-
-Update: peer_head_cache[matched_dir] = max(peer_head_cache, sender_seq + 1)
-Output: recv_wake signal for matched direction → PE_CPU interrupt/flag
-
-Implementation: 8× (2 comparators + AND) + priority encoder
-Latency: 1 cycle (pipelined with DMA write — I6 atomicity 자연 보장)
-```
-
-##### Credit Injector (outbound)
-
-```
-Trigger: recv completion (my_tail 증가 후)
-Action:  pack 16B credit packet → DMA vc_comm (또는 dedicated credit VC)
-
-Packet: {consumer_seq = my_tail, dst_rx_base_pa = my_rx_base_pa}
-Latency: 1 cycle to generate, then NoC traversal
-```
-
-##### Credit Receiver (inbound sideband)
-
-```
-Trigger: 16B credit packet arrival (from NoC)
-Extract: {consumer_seq, dst_rx_base_pa}
-
-Direction matching (ADR-0025 D3):
-  for each dir:
-    match = (peer_rx_base_pa[dir] == credit.dst_rx_base_pa)
-
-Update: peer_tail_cache[matched_dir] = max(peer_tail_cache, consumer_seq)
-Output: send_wake signal → deassert backpressure stall
-
-Latency: 1 cycle
-```
-
-#### 4.2.2 DMA Engine 수정사항
-
-##### vc_comm IPCQ-aware mode
-
-기존 vc_comm 채널에 IPCQ flit 처리 모드를 추가한다.
-
-**Outbound**:
-1. IPCQ Controller로부터 command 수신: {src_addr, dst_addr, nbytes, sender_seq}
-2. TCM에서 src_addr read → DMA read buffer에 snapshot (기존 DMA behavior)
-3. Flit pack: data + piggyback metadata (sender_seq, dst_addr)
-4. NoC fabric port에 inject
-5. Fire-and-forget (completion을 기다리지 않음)
-
-**Inbound**:
-1. NoC로부터 IPCQ flit 수신
-2. Terminal BW drain charge (drain_ns = nbytes / bottleneck_bw)
-3. Slot write latency charge (backing memory tier)
-4. **ATOMIC** (same pipeline stage, no stall insertion):
-   - TCM write: data → slot address
-   - Meta Extractor trigger: sender_seq + dst_addr → IPCQ Controller
-5. Done
-
-**I6 atomicity 하드웨어 보장**: TCM write completion과 Meta Extractor trigger가
-동일 pipeline stage에서 발생하므로 별도 synchronization이 불필요하다.
-시뮬레이터의 "no yield between write and IpcqMetaArrival"이 자연스럽게 보장된다.
-
-##### Data Snapshot Semantics
-
-DMA read buffer에 latch된 데이터는 src memory의 이후 수정에 영향받지 않는다.
-이는 DMA의 standard read-then-write behavior이므로 추가 HW가 불필요하다.
-
-##### Credit Virtual Channel (선택적)
-
-옵션 A: vc_comm에 credit을 multiplexing (16B header-only flit으로 구분)
-옵션 B: 3rd dedicated credit VC 추가 (strict priority > data)
-
-옵션 B가 deadlock prevention에 유리하나, 16B credit의 BW 영향이 무시 가능하므로
-옵션 A로도 충분하다.
-
-#### 4.2.3 Fabric Flit Format 확장
-
-```
-일반 data flit (예: 512-bit):
-┌──────────────────────────────────────────┐
-│ [511:480] routing header (32b)           │
-│ [479:0]   payload (480b = 60B)           │
-└──────────────────────────────────────────┘
-
-IPCQ data flit (첫 flit에만 metadata 포함):
-┌──────────────────────────────────────────┐
-│ [511:480] routing header (32b)           │
-│   [511]    ipcq_flag (1b)                │  ← IPCQ vs normal DMA 식별
-│   [510:509] vc_id (2b)                   │
-│   [508:480] route + hop count            │
-│ [479:416] ipcq_metadata (64b)            │  ← piggyback
-│   [479:448] sender_seq (32b)             │
-│   [447:416] dst_addr[31:0] (32b)         │  ← direction matching용
-│ [415:0]   payload (416b = 52B)           │
-└──────────────────────────────────────────┘
-후속 flits: full 60B payload (metadata 없음)
-
-Credit-only flit (128-bit, header-only):
-┌──────────────────────────────────────────┐
-│ [127:96]  routing header (32b)           │
-│   [127]   credit_flag (1b)               │
-│ [95:64]   consumer_seq (32b)             │
-│ [63:0]    dst_rx_base_pa (64b)           │
-└──────────────────────────────────────────┘
-```
-
-첫 flit의 payload가 60B → 52B로 감소 (13% overhead).
-Multi-flit transfer에서는 후속 flit이 full payload이므로 대형 전송에서 overhead < 1%.
-
-#### 4.2.4 TCM IPCQ Slot Region
-
-```
-TCM Memory Map (16MB):
-┌─────────────────────────────┐ 0x000000
-│  Kernel Working Memory      │
-│  (compute tensors)          │
-│  ~14MB                      │
-├─────────────────────────────┤ 0xE00000
-│  IPCQ RX Buffers            │
-│  Dir N: slots × slot_size   │
-│  Dir S: slots × slot_size   │
-│  Dir E: slots × slot_size   │
-│  Dir W: slots × slot_size   │
-│  ~1MB                       │
-├─────────────────────────────┤ 0xF00000
-│  IPCQ Metadata / Scratch    │
-│  ~1MB                       │
-└─────────────────────────────┘ 0xFFFFFF
-```
-
-IPCQ region을 TCM의 상위 bank에 배치하여 compute access와의
-bank conflict를 최소화한다 (Section 6.1 참조).
-
----
-
-## 5. End-to-End Dataflow
-
-### 5.1 Sequence Diagram
-
-```mermaid
-sequenceDiagram
-    participant CPU_A as PE_A: PE_CPU
-    participant IPCQ_A as PE_A: IPCQ Ctrl
-    participant DMA_A as PE_A: DMA
-    participant NOC as NoC Fabric
-    participant DMA_B as PE_B: DMA
-    participant IPCQ_B as PE_B: IPCQ Ctrl
-    participant TCM_B as PE_B: TCM
-    participant CPU_B as PE_B: PE_CPU
-
-    Note over CPU_A: tl.send(dir="E", src=0x1000)
-
-    CPU_A->>IPCQ_A: MMIO: send request
-    Note over IPCQ_A: Backpressure check:<br/>(head - peer_tail_cache) < n_slots → PASS<br/>Slot addr gen:<br/>dst = peer_rx_base + (head%n) × slot_size
-    IPCQ_A->>DMA_A: IpcqDmaToken {src, dst, sender_seq=head}
-    Note over IPCQ_A: my_head++
-    IPCQ_A-->>CPU_A: send returns (fire-and-forget)
-
-    Note over DMA_A: TCM read → snapshot in read buffer<br/>Flit pack: data + {sender_seq, dst_addr}
-    DMA_A->>NOC: IPCQ data flit(s)
-
-    Note over NOC: hop latency + BW drain
-
-    NOC->>DMA_B: IPCQ data flit(s)
-    Note over DMA_B: Terminal BW drain<br/>Slot write latency
-
-    rect rgb(255, 240, 220)
-        Note over DMA_B,IPCQ_B: ATOMIC (I6): same cycle, no stall
-        DMA_B->>TCM_B: write data → slot address
-        DMA_B->>IPCQ_B: Meta Extractor: {sender_seq, dst_addr}
-    end
-
-    Note over IPCQ_B: Range match dst_addr → direction "W"<br/>peer_head_cache["W"] = sender_seq + 1
-    IPCQ_B-->>CPU_B: recv_wake signal
-
-    Note over CPU_B: tl.recv(dir="W") wakes up
-    CPU_B->>IPCQ_B: recv request
-    Note over IPCQ_B: peer_head_cache > my_tail → YES<br/>slot_addr = rx_base + (tail%n) × slot_size
-    IPCQ_B-->>CPU_B: return slot_addr
-    CPU_B->>TCM_B: read data from slot
-    Note over IPCQ_B: my_tail++
-
-    IPCQ_B->>NOC: Credit (16B): {consumer_seq, dst_rx_base_pa}
-    Note over NOC: credit traversal (NoC latency)
-    NOC->>IPCQ_A: Credit arrival
-
-    Note over IPCQ_A: Match dst_rx_base_pa → direction "E"<br/>peer_tail_cache["E"] = consumer_seq<br/>Backpressure deassert (if stalled)
-```
-
----
-
-## 6. 2nm Implementation Analysis
-
-### 6.1 Area Estimate
-
-| Module | Gate Count | Area (2nm est.) | Notes |
-|--------|-----------|-----------------|-------|
-| QPair Register File | ~4.6K FF | 0.002 mm² | 576B flip-flops |
-| Slot Addr Gen + Backpressure | ~5K gates | 0.001 mm² | Combinational |
-| Meta Extractor + Credit Logic | ~3K gates | 0.001 mm² | 8× parallel comparators |
-| **Total IPCQ Controller** | **~12.6K** | **~0.004 mm²** | **PE 전체 대비 < 0.1%** |
-| DMA vc_comm 확장 | ~2K gates | 0.002 mm² | Flit pack/unpack |
-| **Total 변경분** | **~14.6K** | **~0.006 mm²** | |
-
-### 6.2 Timing
-
-| Path | Delay (2nm est.) | Target Clock | Margin |
-|------|-------------------|-------------|--------|
-| Backpressure (sub + cmp) | ~0.3 ns | 1 GHz (1 ns) | 3× |
-| Slot Addr Gen (mask + shift + add) | ~0.5 ns | 1 GHz | 2× |
-| Meta Extractor (8× range match) | ~0.4 ns | 1 GHz | 2.5× |
-| Credit Receiver (8× equality) | ~0.3 ns | 1 GHz | 3× |
-
-모든 critical path가 1 cycle 이내. Timing closure 문제 없음.
-
-### 6.3 Power
-
-- Active: ~1 mW (register read/write + comparators, send/recv 동작 시)
-- Idle: leakage only
-- PE 전체 전력 대비 무시 가능
-
-### 6.4 Constraints
-
-| 항목 | 제약 | 근거 |
-|------|------|------|
-| `n_slots` | **반드시 power-of-2** | mod → AND mask (1 gate). 임의 값은 divider 필요 (~10 cycles) |
-| `slot_size` | **power-of-2 권장** | mul → barrel shift. 임의 값은 multiplier 필요 |
-| TCM IPCQ region | **전용 bank 배치** | Compute access와 bank conflict 방지 |
-
----
-
-## 7. Risk Assessment
-
-### 7.1 TCM Bank Conflict
-
-- **Risk**: IPCQ slot write와 compute read가 동일 bank 접근 시 stall
-- **Mitigation**: IPCQ region을 TCM 상위 address의 전용 bank에 배치
-- **Cost**: TCM banking flexibility 소폭 감소
-- **Severity**: Medium (성능 영향), Low (correctness 문제 아님)
-
-### 7.2 Credit Return Latency under Congestion
-
-- **Risk**: NoC 혼잡 시 credit return 지연 → sender backpressure stall
-- **Mitigation**:
-  - Credit을 별도 VC로 분리 + strict priority (16B로 BW impact 미미)
-  - 또는 n_slots를 넉넉히(8+) 설정하여 credit 지연을 buffer로 흡수
-- **Severity**: Low (credit 16B는 congestion에 거의 기여하지 않음)
-
-### 7.3 Inter-Direction Ordering
-
-- **Risk**: 같은 PE에서 여러 방향으로 동시 send 시 순서
-- **Mitigation**: Per-direction monotonic seq으로 충분. Inter-direction ordering은
-  kernel(소프트웨어) 책임 — 현재 시뮬레이터 모델과 동일
-- **Severity**: Low (아키텍처 설계에 의해 해소)
-
----
-
-## 8. Alternatives Considered
-
-### 8.1 Doorbell + Polling (전통적 방식)
-
-```
-Send: DMA write data → DMA write doorbell register at peer → peer polls doorbell
-Recv: Polling loop on doorbell, or interrupt-driven
-```
-
-| 장점 | 단점 |
-|------|------|
-| 단순한 HW (IPCQ controller 불필요) | 2번의 DMA transaction (data + doorbell) |
-| 기존 DMA 재사용 | Data/doorbell 사이 ordering 보장 필요 (fence) |
-| | Polling은 전력 낭비, interrupt는 latency overhead |
-
-**평가**: Piggyback 대비 latency 2-3× 증가. **불채택.**
-
-### 8.2 Hardware Message Queue (NVIDIA NVLink 스타일)
-
-```
-Send: CPU → HMQ에 descriptor push → HW가 peer HMQ로 자동 전달
-Recv: HMQ에서 descriptor pop → data pointer 확인
-```
-
-| 장점 | 단점 |
-|------|------|
-| CPU는 descriptor만 작성 | 별도 HMQ engine 필요 (~0.05 mm²) |
-| Descriptor/data 분리 → 유연 | DMA와 별개 datapath → area/power 중복 |
-| | Large tensor에는 결국 DMA 필요 |
-
-**평가**: CCL의 large tensor 패턴에서 DMA 필수이므로 HMQ + DMA 이중 구조는
-면적 낭비. **불채택.**
-
-### 8.3 RDMA-style Completion Queue (CQ)
-
-```
-Send: DMA write → peer에 CQE 자동 생성
-Recv: CQ poll/interrupt → data 위치 확인
-```
-
-| 장점 | 단점 |
-|------|------|
-| InfiniBand/RoCE 성숙 모델 | CQ 관리 logic + CQE memory overhead |
-| Multi-tenant/isolation 용이 | CQE/data ordering 보장 추가 필요 |
-| | PE-to-PE CCL에는 over-engineered |
-
-**평가**: RDMA CQ는 host-facing NIC의 multi-tenant 격리에 적합.
-PE 간 단일 owner 환경에서는 불필요한 복잡성. **불채택.**
-
-### 8.4 Credit-in-Data Piggyback (v2 최적화 후보)
-
-현재 설계에서 credit return은 별도 16B packet이다.
-Bidirectional 통신 패턴에서는 **reverse 방향 data flit에 credit을 합칠 수 있다.**
-
-```
-PE_A →E→ PE_B: data + sender_seq=3
-PE_B →W→ PE_A: data + sender_seq=5 + credit_ack=4  ← credit이 data에 합쳐짐
-```
-
-| 장점 | 단점 |
-|------|------|
-| Credit 전용 packet 제거 → NoC BW 절약 | Unidirectional 패턴에서는 fallback 필요 |
-| Bidirectional allreduce에서 credit latency → 0 | Flit header에 8B 추가 (overhead 미미) |
-| | Logic 복잡도 소폭 증가 |
-
-**평가**: 현재 설계의 우수한 최적화.
-Bidirectional allreduce에서 credit packet을 완전 제거 가능.
-Standalone credit fallback도 유지. **v2로 채택 권고.**
-
----
-
-## 9. Recommendations
-
-1. **현재 IPCQ-DMA co-design을 기본 하드웨어 설계로 채택**
-   — 단순하고, 면적 효율적이며, 2nm에서 timing/power 문제 없음
-
-2. **n_slots를 반드시 power-of-2로 제약**
-   — mod 연산을 AND mask로 대체, critical path 단축
-
-3. **TCM banking에서 IPCQ region 전용 bank 할당**
-   — compute와의 bank conflict 방지
-
-4. **v2에서 Credit-in-Data Piggyback (Section 8.4) 추가 검토**
-   — bidirectional 패턴에서 credit overhead 제거
-
----
-
-## 10. Open Questions
-
-- [ ] IPCQ slot region size를 TCM의 몇 %까지 허용할 것인가? (현재 가정: ~1MB / 16MB = 6.25%)
-- [ ] Credit VC를 별도로 둘 것인가, vc_comm에 multiplexing할 것인가?
-- [ ] Inter-SIP link에서의 flit format 호환성 검증 필요
-- [ ] n_slots 최대값 제한? (8 directions × 8 slots × 64KB = 4MB → TCM의 25%)
diff --git a/docs/ccl-author-guide.en.md b/docs/onboarding/ccl-author-guide.en.md
similarity index 99%
rename from docs/ccl-author-guide.en.md
rename to docs/onboarding/ccl-author-guide.en.md
index 7fd38e1..3c46885 100644
--- a/docs/ccl-author-guide.en.md
+++ b/docs/onboarding/ccl-author-guide.en.md
@@ -582,7 +582,7 @@ If you add a new algorithm or pattern, please send a PR.
 - [ADR-0023](adr/ADR-0023-ipcq-pe-collective.md): IPCQ + PE-level collective design.
 - [ADR-0022](adr/ADR-0022-program-id-2d-grid.md): 2D grid program_id (axis=0/1).
 - [ADR-0020](adr/ADR-0020-data-execution-two-pass.md): 2-pass data execution.
-- [ADR-0021](adr/ADR-0021-pe-pipeline-refactor.md): PE pipeline refactor.
+- [ADR-0014](adr/ADR-0014-dev-pe-pipeline-execution-model.md): PE pipeline execution model.
 
 Existing algorithm examples:
 
diff --git a/docs/ccl-author-guide.md b/docs/onboarding/ccl-author-guide.md
similarity index 99%
rename from docs/ccl-author-guide.md
rename to docs/onboarding/ccl-author-guide.md
index d785f24..0603826 100644
--- a/docs/ccl-author-guide.md
+++ b/docs/onboarding/ccl-author-guide.md
@@ -527,7 +527,7 @@ direct send 후 다른 step에서 같은 주소를 store해도 안전하다 (tok
 - [ADR-0023](adr/ADR-0023-ipcq-pe-collective.md): IPCQ + PE-level collective 설계
 - [ADR-0022](adr/ADR-0022-program-id-2d-grid.md): 2D grid program_id (axis=0/1)
 - [ADR-0020](adr/ADR-0020-data-execution-two-pass.md): 2-pass data execution
-- [ADR-0021](adr/ADR-0021-pe-pipeline-refactor.md): PE pipeline refactor
+- [ADR-0014](adr/ADR-0014-dev-pe-pipeline-execution-model.md): PE pipeline execution model
 
 기존 알고리즘 예제:
 
diff --git a/docs/di-presentation.md b/docs/onboarding/di-presentation.md
similarity index 100%
rename from docs/di-presentation.md
rename to docs/onboarding/di-presentation.md
diff --git a/docs/hw-architecture-overview.md b/docs/onboarding/hw-architecture-overview.md
similarity index 100%
rename from docs/hw-architecture-overview.md
rename to docs/onboarding/hw-architecture-overview.md
diff --git a/docs/latency-model.md b/docs/onboarding/latency-model.md
similarity index 100%
rename from docs/latency-model.md
rename to docs/onboarding/latency-model.md
diff --git a/src/kernbench/ccl/__init__.py b/src/kernbench/ccl/__init__.py
index aa60e46..d4494a8 100644
--- a/src/kernbench/ccl/__init__.py
+++ b/src/kernbench/ccl/__init__.py
@@ -5,5 +5,5 @@ This package provides:
     - helpers:    utilities for algorithm authors (chunked, ring_step, ...)
     - testing:    mock CCL runtime for fast unit tests of algorithm kernels
 
-See docs/adr/ADR-0023-ipcq-pe-collective.md and docs/ccl-author-guide.md.
+See docs/adr/ADR-0023-dev-ipcq-pe-collective.md and docs/onboarding/ccl-author-guide.md.
 """
diff --git a/src/kernbench/common/pe_commands.py b/src/kernbench/common/pe_commands.py
index 1c47c4f..02a32d4 100644
--- a/src/kernbench/common/pe_commands.py
+++ b/src/kernbench/common/pe_commands.py
@@ -24,7 +24,7 @@ class Scope(Enum):
 
 @dataclass(frozen=True)
 class OpSpec:
-    """One operation in a multi-op composite (head + epilogue, ADR-0021).
+    """One operation in a multi-op composite (head + epilogue, ADR-0014 D3.3).
 
     The head op (first in CompositeCmd.ops) defines tile geometry; subsequent
     ops are epilogue stages whose ``scope`` controls how often they fire
@@ -156,7 +156,7 @@ class CompositeCmd:
     out_nbytes: int
     math_op: str | None = None       # for op="math": which math operation
     data_op: bool = True
-    # Multi-op composite (ADR-0021 extension): when non-empty, ops[0] is the
+    # Multi-op composite (ADR-0014 D3.3): when non-empty, ops[0] is the
     # head and ops[1:] are epilogue stages with explicit scope. When empty,
     # the legacy single-op semantics (op/a/b/math_op) apply.
     ops: tuple[OpSpec, ...] = ()
diff --git a/src/kernbench/components/builtin/hbm_ctrl.py b/src/kernbench/components/builtin/hbm_ctrl.py
index 09bcaa6..f45d7c7 100644
--- a/src/kernbench/components/builtin/hbm_ctrl.py
+++ b/src/kernbench/components/builtin/hbm_ctrl.py
@@ -15,7 +15,7 @@ if TYPE_CHECKING:
 
 
 class HbmCtrlComponent(ComponentBase):
-    """HBM controller with per-pseudo-channel (PC) striping (ADR-0019 D1, ADR-0033).
+    """HBM controller with per-pseudo-channel (PC) striping (ADR-0017 D4, ADR-0033).
 
     Stateless per-PC ``available_at`` array; each incoming transaction is
     split into ``ceil(nbytes / burst_bytes)`` chunks distributed round-robin
diff --git a/src/kernbench/components/builtin/m_cpu.py b/src/kernbench/components/builtin/m_cpu.py
index 740a272..d8fffaf 100644
--- a/src/kernbench/components/builtin/m_cpu.py
+++ b/src/kernbench/components/builtin/m_cpu.py
@@ -267,8 +267,9 @@ class MCpuComponent(ComponentBase):
     def _resolve_dma_destinations(self, request: Any, target_pe: int | str) -> list[str]:
         """Return list of HBM destination node_ids for DMA fan-out.
 
-        With single hbm_ctrl per cube (ADR-0019), always returns one node.
-        PA-based resolution still used for cross-cube routing.
+        The PA-based resolver maps each address to one per-PE
+        ``hbm_ctrl.pe{X}`` (ADR-0017 D9), so this method returns exactly
+        one node. Cross-cube routing uses the same resolution.
         """
         cube_prefix = self.node.id.rsplit(".", 1)[0]  # e.g. "sip0.cube0"
 
diff --git a/src/kernbench/components/builtin/pe_dma.py b/src/kernbench/components/builtin/pe_dma.py
index ed3de9e..a416fa5 100644
--- a/src/kernbench/components/builtin/pe_dma.py
+++ b/src/kernbench/components/builtin/pe_dma.py
@@ -17,9 +17,11 @@ if TYPE_CHECKING:
 class PeDmaComponent(PeEngineBase):
     """PE_DMA: dual-channel DMA engine with READ and WRITE resources.
 
-    Each channel has capacity=1 (ADR-0014 D4):
+    Compute channels (vc_compute) have capacity=1 each (ADR-0014 D4):
       - DMA_READ and DMA_WRITE may execute concurrently.
       - Multiple READs cannot overlap; multiple WRITEs cannot overlap.
+    The orthogonal vc_comm channel for IPCQ traffic is defined in
+    ADR-0023 D8.
 
     Handles two message types:
       - Transaction: external fabric messages (PeDmaMsg probes, M_CPU DMA)
diff --git a/src/kernbench/components/builtin/pe_fetch_store.py b/src/kernbench/components/builtin/pe_fetch_store.py
index 3d65e2c..5e3f34c 100644
--- a/src/kernbench/components/builtin/pe_fetch_store.py
+++ b/src/kernbench/components/builtin/pe_fetch_store.py
@@ -1,4 +1,4 @@
-"""PE_FETCH_STORE: TCM ↔ Register File transfer unit (ADR-0021 D5).
+"""PE_FETCH_STORE: TCM ↔ Register File transfer unit (ADR-0014 D1).
 
 Handles both fetch (TCM → register) and store (register → TCM).
 BW serialization is delegated to PE_TCM via port communication.
@@ -18,7 +18,7 @@ if TYPE_CHECKING:
 
 
 class PeFetchStoreComponent(PeEngineBase):
-    """PE_FETCH_STORE: TCM ↔ Register File (ADR-0021 D5).
+    """PE_FETCH_STORE: TCM ↔ Register File (ADR-0014 D1).
 
     Receives TileTokens via pipeline self-routing.
     Sends TcmRequest to PE_TCM for BW-based latency.
diff --git a/src/kernbench/components/builtin/pe_gemm.py b/src/kernbench/components/builtin/pe_gemm.py
index 718d130..ea5a916 100644
--- a/src/kernbench/components/builtin/pe_gemm.py
+++ b/src/kernbench/components/builtin/pe_gemm.py
@@ -1,4 +1,4 @@
-"""PE_GEMM: matrix multiplication engine (ADR-0021 D6).
+"""PE_GEMM: matrix multiplication engine (ADR-0014 D1).
 
 Handles both legacy PeInternalTxn (GemmCmd) and pipeline TileToken.
 In pipeline mode, receives token after fetch stage, computes MAC, chains to next.
@@ -32,7 +32,7 @@ _DTYPE_BITS: dict[str, int] = {
 
 
 class PeGemmComponent(PeEngineBase):
-    """PE_GEMM: MAC array (ADR-0021 D6).
+    """PE_GEMM: MAC array (ADR-0014 D1).
 
     In pipeline mode: pure compute — register data already fetched.
     In legacy mode: handles PeInternalTxn(GemmCmd) with shared accel_slot.
diff --git a/src/kernbench/components/builtin/pe_math.py b/src/kernbench/components/builtin/pe_math.py
index cf5bcf9..eeb819e 100644
--- a/src/kernbench/components/builtin/pe_math.py
+++ b/src/kernbench/components/builtin/pe_math.py
@@ -1,4 +1,4 @@
-"""PE_MATH: element-wise / reduction computation engine (ADR-0021 D6).
+"""PE_MATH: element-wise / reduction computation engine (ADR-0014 D1).
 
 Handles both legacy PeInternalTxn (MathCmd) and pipeline TileToken.
 In pipeline mode, receives token after fetch stage, computes SIMD, chains to next.
@@ -24,7 +24,7 @@ if TYPE_CHECKING:
 
 
 class PeMathComponent(PeEngineBase):
-    """PE_MATH: SIMD/Vector unit (ADR-0021 D6).
+    """PE_MATH: SIMD/Vector unit (ADR-0014 D1).
 
     In pipeline mode: pure compute — register data already fetched.
     In legacy mode: handles PeInternalTxn(MathCmd) with shared accel_slot.
diff --git a/src/kernbench/components/builtin/pe_scheduler.py b/src/kernbench/components/builtin/pe_scheduler.py
index 994acfb..87c73b9 100644
--- a/src/kernbench/components/builtin/pe_scheduler.py
+++ b/src/kernbench/components/builtin/pe_scheduler.py
@@ -1,10 +1,10 @@
-"""PE_SCHEDULER: plan generation + tile dispatch (ADR-0021 D2).
+"""PE_SCHEDULER: plan generation + tile dispatch (ADR-0014 D6).
 
 Receives PeInternalTxn from PE_CPU, routes to engines:
   - Simple commands (DmaReadCmd, GemmCmd, etc.) → direct dispatch to engine
   - CompositeCmd → generate TilePlan, feed tiles via _feed_loop
 
-Composite pipeline uses token self-routing (ADR-0021 D4):
+Composite pipeline uses token self-routing (ADR-0014 D6):
   Scheduler only does initial dispatch + completion tracking.
   Tiles chain through components based on their plan's stage sequence.
 """
@@ -24,7 +24,7 @@ if TYPE_CHECKING:
 
 
 class PeSchedulerComponent(ComponentBase):
-    """PE_SCHEDULER: sole dispatcher inside a PE (ADR-0014 D1, ADR-0021 D2).
+    """PE_SCHEDULER: sole dispatcher inside a PE (ADR-0014 D1, D6).
 
     Simple commands are forwarded to the appropriate engine.
     CompositeCmd creates a TilePlan and feeds tiles into the pipeline.
@@ -104,7 +104,7 @@ class PeSchedulerComponent(ComponentBase):
     def _dispatch_composite(
         self, env: simpy.Environment, pe_txn: Any, cmd: Any,
     ) -> Generator:
-        """Generate plan and enqueue to feeder. Non-blocking (ADR-0021 D4)."""
+        """Generate plan and enqueue to feeder. Non-blocking (ADR-0014 D6)."""
         from kernbench.components.builtin.pe_types import PipelineContext
 
         plan = self._generate_plan(cmd)
@@ -121,7 +121,7 @@ class PeSchedulerComponent(ComponentBase):
         yield self._pending_feeds.put((plan, ctx))
 
     def _feed_loop(self, env: simpy.Environment) -> Generator:
-        """Single feeder process: FIFO command ordering (ADR-0021 D2).
+        """Single feeder process: FIFO command ordering (ADR-0014 D6).
 
         No tile feed interleaving between commands.
         Queue full → only this process blocks.
diff --git a/src/kernbench/components/builtin/pe_tcm.py b/src/kernbench/components/builtin/pe_tcm.py
index dfe940e..dfb2b36 100644
--- a/src/kernbench/components/builtin/pe_tcm.py
+++ b/src/kernbench/components/builtin/pe_tcm.py
@@ -1,4 +1,4 @@
-"""PE_TCM: tightly-coupled memory with BW-based access serialization (ADR-0021).
+"""PE_TCM: tightly-coupled memory with BW-based access serialization (ADR-0014 D1).
 
 Models scratchpad memory inside the PE. Handles both legacy Transaction forwarding
 and TcmRequest from PE_FETCH_STORE for BW-serialized read/write access.
@@ -32,7 +32,7 @@ class TcmRequest:
 
 
 class PeTcmComponent(ComponentBase):
-    """PE_TCM: BW-serialized scratchpad memory (ADR-0021 D1).
+    """PE_TCM: BW-serialized scratchpad memory (ADR-0014 D1).
 
     Dual-channel: read and write can proceed in parallel,
     but concurrent reads serialize, concurrent writes serialize.
diff --git a/src/kernbench/components/builtin/pe_types.py b/src/kernbench/components/builtin/pe_types.py
index 77b92bb..64cfb26 100644
--- a/src/kernbench/components/builtin/pe_types.py
+++ b/src/kernbench/components/builtin/pe_types.py
@@ -1,4 +1,4 @@
-"""PE pipeline types for ADR-0021: TileToken, TilePlan, Stage, PipelineContext.
+"""PE pipeline types for ADR-0014 D6: TileToken, TilePlan, Stage, PipelineContext.
 
 These types are used by the PE_SCHEDULER and all PE engine components
 for tile-based pipeline execution with self-routing.
@@ -84,7 +84,7 @@ class PipelineContext:
 
 @dataclass
 class TileToken:
-    """Self-routing tile token passed between PE components (ADR-0021 D9).
+    """Self-routing tile token passed between PE components (ADR-0014 D6).
 
     Single-owner: only one component holds this token at any time.
     params is a cache of plan.stages[stage_idx].params (canonical source).
diff --git a/src/kernbench/components/builtin/tiling.py b/src/kernbench/components/builtin/tiling.py
index 88884a6..0704ea3 100644
--- a/src/kernbench/components/builtin/tiling.py
+++ b/src/kernbench/components/builtin/tiling.py
@@ -1,4 +1,4 @@
-"""Tile plan generators for PE pipeline (ADR-0021).
+"""Tile plan generators for PE pipeline (ADR-0014 D6).
 
 Generates TilePlan with stage sequences for GEMM and Math operations.
 Ported from pe_accel tiling.py with stage-based plan structure.
diff --git a/src/kernbench/components/legacy/__init__.py b/src/kernbench/components/legacy/__init__.py
index 03d0854..948d652 100644
--- a/src/kernbench/components/legacy/__init__.py
+++ b/src/kernbench/components/legacy/__init__.py
@@ -1,2 +1,2 @@
 # Legacy component backups — not actively used.
-# Kept for reference during ADR-0021 migration.
+# Kept for reference during the PE pipeline refactor (ADR-0014).
diff --git a/src/kernbench/components/legacy/builtin/m_cpu.py b/src/kernbench/components/legacy/builtin/m_cpu.py
index b8e928c..b69b248 100644
--- a/src/kernbench/components/legacy/builtin/m_cpu.py
+++ b/src/kernbench/components/legacy/builtin/m_cpu.py
@@ -264,8 +264,9 @@ class MCpuComponent(ComponentBase):
     def _resolve_dma_destinations(self, request: Any, target_pe: int | str) -> list[str]:
         """Return list of HBM destination node_ids for DMA fan-out.
 
-        With single hbm_ctrl per cube (ADR-0019), always returns one node.
-        PA-based resolution still used for cross-cube routing.
+        The PA-based resolver maps each address to one per-PE
+        ``hbm_ctrl.pe{X}`` (ADR-0017 D9), so this method returns exactly
+        one node. Cross-cube routing uses the same resolution.
         """
         cube_prefix = self.node.id.rsplit(".", 1)[0]  # e.g. "sip0.cube0"
 
diff --git a/src/kernbench/policy/address/phyaddr.py b/src/kernbench/policy/address/phyaddr.py
index 4394ba6..a1e566e 100644
--- a/src/kernbench/policy/address/phyaddr.py
+++ b/src/kernbench/policy/address/phyaddr.py
@@ -20,7 +20,7 @@ _AHBM_SEL_BIT = 37
 _AHBM_LOCAL_USED = 38  # bits actually meaningful for AHBM
 
 # HBM-offset bit layout for PC (pseudo-channel) striping
-# (ADR-0033 D6, ADR-0019). Given burst_bytes = 2^B and num_pcs = 2^P
+# (ADR-0033 D6, ADR-0017 D8). Given burst_bytes = 2^B and num_pcs = 2^P
 # configured at hbm_ctrl, the PC index is derived from hbm_offset as
 #   pc_shift = B; pc_mask = (1 << P) - 1
 #   pc = (hbm_offset >> pc_shift) & pc_mask
diff --git a/src/kernbench/policy/routing/router.py b/src/kernbench/policy/routing/router.py
index ca1e105..0869079 100644
--- a/src/kernbench/policy/routing/router.py
+++ b/src/kernbench/policy/routing/router.py
@@ -35,7 +35,7 @@ class AddressResolver:
     def __init__(self, graph: TopologyGraph) -> None:
         self._node_ids = set(graph.nodes)
         # HBM slice size (bytes) — used to decode pe_id from hbm_offset
-        # so HBM PA → hbm_ctrl.pe{X} (ADR-0019 D1/D4).
+        # so HBM PA → hbm_ctrl.pe{X} (ADR-0017 D4/D9).
         mm = graph.spec.get("cube", {}).get("memory_map", {})
         hbm_total_gb = int(mm.get("hbm_total_gb_per_cube", 48))
         slices_per_cube = int(mm.get("hbm_slices_per_cube", 8))
@@ -129,7 +129,7 @@ class PathRouter:
         Otherwise the cube's own UCIe port appears as a zero-distance
         bus that Dijkstra prefers over the mesh — that is intended only
         for cross-cube routing. Local PE_DMA must traverse the mesh so
-        cross-PE-slice access pays the mesh-distance cost (ADR-0019 D4).
+        cross-PE-slice access pays the mesh-distance cost (ADR-0017 D7).
         """
         start = f"{src_pe}.pe_dma"
         adj = self._adj_local if _same_cube(start, dst_node) else self._adj
@@ -137,13 +137,13 @@ class PathRouter:
 
     def find_path_with_distance(self, src_pe: str, dst_node: str) -> tuple[list[str], float]:
         """Match find_path's cube-local routing so reported distance reflects
-        the actual chosen path (ADR-0019 D4)."""
+        the actual chosen path (ADR-0017 D7)."""
         start = f"{src_pe}.pe_dma"
         adj = self._adj_local if _same_cube(start, dst_node) else self._adj
         return self._run_dijkstra_with_dist(adj, start, dst_node)
 
     def find_mcpu_dma_path(self, m_cpu_id: str, dst_hbm_id: str) -> list[str]:
-        """M_CPU DMA path: routes through router mesh (ADR-0019).
+        """M_CPU DMA path: routes through router mesh (ADR-0017).
 
         Same-cube: uses _adj_local (no UCIe) to stay within mesh.
         Cross-cube: uses _adj_all to route via UCIe.
diff --git a/src/kernbench/runtime_api/context.py b/src/kernbench/runtime_api/context.py
index d339875..6e94388 100644
--- a/src/kernbench/runtime_api/context.py
+++ b/src/kernbench/runtime_api/context.py
@@ -58,7 +58,7 @@ def _get_active_context():
 
 
 class _AhbmNamespace:
-    """torch.ahbm — per-greenlet SIP device binding (ADR-0024 D10).
+    """torch.ahbm — per-greenlet SIP device binding (ADR-0024 D3).
 
     Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. KernBench's
     backend is 'ahbm' (not CUDA), so this namespace avoids pretending to be
@@ -124,7 +124,7 @@ class RuntimeContext:
         dc = DistributedContext()
         dc._ctx_ref = self  # back-reference for AhbmCCLBackend to reach ctx.launch etc.
         self.distributed = dc
-        # ADR-0024 D10: torch.ahbm (KernBench-native) + torch.accelerator
+        # ADR-0024 D3: torch.ahbm (KernBench-native) + torch.accelerator
         # (PyTorch 2.x portable) namespaces for per-greenlet device binding.
         self.ahbm = _AhbmNamespace()
         self.accelerator = _AcceleratorNamespace(self.ahbm)
@@ -472,7 +472,7 @@ class RuntimeContext:
         eff_num_pe = dp.num_pes if dp.num_pes is not None else self._pes_per_cube
         eff_num_cubes = dp.num_cubes if dp.num_cubes is not None else self._num_cubes
         # ADR-0026 D4: resolve structural coords directly at resolve time.
-        # ``torch.ahbm.set_device(rank)`` (ADR-0024 D10) selects the target
+        # ``torch.ahbm.set_device(rank)`` (ADR-0024 D3) selects the target
         # SIP; if unset, fall back to SIP 0 for single-driver compatibility.
         current_sip = (
             self.ahbm.current_device() if hasattr(self, "ahbm") else None
@@ -619,7 +619,7 @@ class RuntimeContext:
         Creates per-SIP KernelLaunchMsg with local va_base per tensor
         (like host driver sending per-rank launch commands).
 
-        When ``_defer_wait=True`` (ADR-0024 D7), returns the list of
+        When ``_defer_wait=True`` (ADR-0027 D0.4), returns the list of
         ``(handle, sip_id, meta)`` tuples instead of waiting. Caller is
         responsible for waiting — used by collective ops to yield between
         submit and wait so all sibling ranks can submit first.
@@ -786,7 +786,7 @@ class RuntimeContext:
             last_handle = h
 
         if _defer_wait:
-            # ADR-0024 D7: return the pending-list so the caller can yield
+            # ADR-0027 D0.4: return the pending-list so the caller can yield
             # between submit and drain. Used by collective ops that need
             # all sibling ranks to submit before any rank waits.
             return [
diff --git a/src/kernbench/runtime_api/distributed.py b/src/kernbench/runtime_api/distributed.py
index a56086f..3d5f4b6 100644
--- a/src/kernbench/runtime_api/distributed.py
+++ b/src/kernbench/runtime_api/distributed.py
@@ -178,7 +178,7 @@ class DistributedContext:
 
     def __init__(self) -> None:
         self._backend: AhbmCCLBackend | None = None
-        # ADR-0024 D9: greenlet-local rank registry. Bench launcher calls
+        # ADR-0024 D2: greenlet-local rank registry. Bench launcher calls
         # _bind_rank(g, rank) when spawning workers; get_rank() resolves the
         # current greenlet to its rank. Unbound greenlets fall back to 0 for
         # single-driver test compat.
@@ -220,7 +220,7 @@ class DistributedContext:
     def get_rank(self) -> int:
         """Return the rank bound to the current greenlet (default 0).
 
-        ADR-0024 D9: workers spawned by the bench launcher each get a rank
+        ADR-0024 D2: workers spawned by the bench launcher each get a rank
         registered via ``_bind_rank``. Callers outside any bound greenlet
         fall back to rank 0 for single-driver test compat.
         """
@@ -230,7 +230,7 @@ class DistributedContext:
         return int(self._rank_by_greenlet.get(g, 0))
 
     def _bind_rank(self, g: Any, rank: int) -> None:
-        """Bind a greenlet to a rank so ``get_rank()`` returns it (ADR-0024 D9)."""
+        """Bind a greenlet to a rank so ``get_rank()`` returns it (ADR-0024 D2)."""
         self._rank_by_greenlet[g] = int(rank)
 
     def get_backend(self) -> str:
diff --git a/src/kernbench/runtime_api/multiprocessing.py b/src/kernbench/runtime_api/multiprocessing.py
index 53994f2..0b50fae 100644
--- a/src/kernbench/runtime_api/multiprocessing.py
+++ b/src/kernbench/runtime_api/multiprocessing.py
@@ -65,7 +65,7 @@ def _drain_pending(ctx: Any) -> None:
                 # Populate _completed so fast-path in ctx.wait short-circuits
                 # on the return leg.
                 ctx._completed.add(h)
-        # (b) Collective backend queue (ADR-0024 D7 + D0.4-(2)).
+        # (b) Collective backend queue (ADR-0027 D0.4-(2)).
         if backend is not None:
             pending_list = getattr(backend, "_pending_collective_handles", None)
             if pending_list is not None:
diff --git a/src/kernbench/sim_engine/op_log.py b/src/kernbench/sim_engine/op_log.py
index acc0d5d..51d2d30 100644
--- a/src/kernbench/sim_engine/op_log.py
+++ b/src/kernbench/sim_engine/op_log.py
@@ -51,7 +51,7 @@ class OpLogger:
         record_end fires.
         """
         snap: dict[str, Any] = {}
-        # TileToken (ADR-0021 pipeline) — capture which stage this is and its
+        # TileToken (ADR-0014 D6 pipeline) — capture which stage this is and its
         # per-stage params (e.g. op_kind/scope for epilogue MATH stages) so
         # we can recover them at record_end even after the token advances.
         try:
diff --git a/src/kernbench/topology/builder.py b/src/kernbench/topology/builder.py
index 8516b3b..f33f1ed 100644
--- a/src/kernbench/topology/builder.py
+++ b/src/kernbench/topology/builder.py
@@ -356,7 +356,7 @@ def _instantiate_cube(
 ) -> None:
     """Add all cube-internal nodes and edges, including PE instances.
 
-    Topology: explicit router mesh from cube_mesh.yaml (ADR-0019).
+    Topology: explicit router mesh from cube_mesh.yaml (ADR-0017 D1).
     Each router is a separate SimPy node. Components attach to routers
     based on cube_mesh.yaml attachment lists.
     """
@@ -367,10 +367,10 @@ def _instantiate_cube(
     clinks = cube["links"]
     mm = cube["memory_map"]
 
-    # ── Mode branch (ADR-0019) ──
+    # ── Mode branch (ADR-0017 D8) ──
     mode = mm.get("hbm_mapping_mode", "n_to_one")
     if mode == "one_to_one":
-        raise NotImplementedError("1:1 mode: ADR-0019 D3")
+        raise NotImplementedError("1:1 mode: ADR-0017 D8")
 
     # ── UCIe ports + connection nodes ──
     ucie_cfg = cube["ucie"]
@@ -404,11 +404,10 @@ def _instantiate_cube(
             label=name.upper().replace("_", " "),
         )
 
-    # ── Per-PE HBM controller (ADR-0019 D1/D4) ──
+    # ── Per-PE HBM controller (ADR-0017 D4) ──
     # Each PE owns one slice of the cube's HBM. The slice has its own
     # set of pseudo-channels and is reachable ONLY through that PE's
     # attaching router (see cube_mesh.yaml ``peX.hbm`` attach lists).
-    # Restored after the ADR-0019 over-consolidation in commit 5917b34.
     hbm_spec = cube["components"]["hbm_ctrl"]
     hbm_lx, hbm_ly = local_pos["hbm_ctrl"]
     _hbm_total_bw = float(cube["links"].get("hbm_to_router_bw_gbs", 256.0))
@@ -425,7 +424,7 @@ def _instantiate_cube(
             label=f"HBM CTRL pe{pe_idx}",
         )
 
-    # ── Router mesh from cube_mesh.yaml (ADR-0019 D3) ──
+    # ── Router mesh from cube_mesh.yaml (ADR-0017 D1) ──
     routers = mesh_data["routers"]
     router_spec = cube["components"]["noc_router"]
     router_bw = clinks.get("router_link_bw_gbs", 256.0)
@@ -573,7 +572,7 @@ def _instantiate_cube(
                     ))
             elif item.endswith(".hbm"):
                 # peX.hbm: router rXcY owns the entry to hbm_ctrl.peX.
-                # (ADR-0019 D1/D4 — per-PE HBM partitioning.)
+                # (ADR-0017 D4 — per-PE HBM partitioning.)
                 pe_prefix = item.rsplit(".", 1)[0]
                 pe_idx = int(pe_prefix.replace("pe", ""))
                 pe_hbm_id = f"{cp}.hbm_ctrl.pe{pe_idx}"
@@ -645,13 +644,12 @@ def _instantiate_cube(
                     ))
 
     # NOTE: HBM↔router edges are created in the per-router attach loop
-    # above (peX.hbm items map router → hbm_ctrl.peX). Removed the
-    # legacy "all routers → single hbm_ctrl" loop that bypassed the
-    # ADR-0019 D4 per-PE partition.
+    # above (peX.hbm items map router → hbm_ctrl.peX). See ADR-0017 D4
+    # for the per-PE partition contract.
 
 
 def _add_pe_internal_edges(edges: list[Edge], pp: str, pe_links: dict) -> None:
-    """Add PE-internal edges for a single PE instance (ADR-0021)."""
+    """Add PE-internal edges for a single PE instance (ADR-0014 D8)."""
     edges.append(Edge(
         src=f"{pp}.pe_cpu", dst=f"{pp}.pe_scheduler",
         distance_mm=pe_links["pe_cpu_to_scheduler_mm"],
@@ -685,7 +683,7 @@ def _add_pe_internal_edges(edges: list[Edge], pp: str, pe_links: dict) -> None:
             kind="pe_internal",
         ))
 
-    # Fetch/Store → TCM (ADR-0021 D5)
+    # Fetch/Store → TCM (ADR-0014 D5)
     if "fetch_store_to_tcm_mm" in pe_links:
         edges.append(Edge(
             src=f"{pp}.pe_fetch_store", dst=f"{pp}.pe_tcm",
@@ -694,7 +692,7 @@ def _add_pe_internal_edges(edges: list[Edge], pp: str, pe_links: dict) -> None:
             kind="pe_internal",
         ))
 
-    # Chaining edges (ADR-0021 D4 — token self-routing)
+    # Chaining edges (ADR-0014 D6 — token self-routing)
     chaining = [
         ("pe_dma", "pe_fetch_store", "dma_to_fetch_store_mm"),
         ("pe_fetch_store", "pe_gemm", "fetch_store_to_gemm_mm"),
diff --git a/src/kernbench/tp/layers.py b/src/kernbench/tp/layers.py
index 27e0bd8..5d66429 100644
--- a/src/kernbench/tp/layers.py
+++ b/src/kernbench/tp/layers.py
@@ -6,7 +6,7 @@
   forward(x) ends with ``dist.all_reduce`` to sum partial products.
 
 Both layers use the intra-device ``DPPolicy`` (ADR-0026). TP shard
-ownership is determined by ``torch.ahbm.set_device(rank)`` (ADR-0024 D10).
+ownership is determined by ``torch.ahbm.set_device(rank)`` (ADR-0024 D3).
 
 Yield-safety contract (ADR-0027 D4/D5): every forward path contains at
 least one ``ctx.wait`` (via ``torch.launch``) or one collective; this
@@ -53,7 +53,7 @@ class ColumnParallelLinear:
         self.k_local = out_features // ws
         self.dtype = dtype
         self._torch = torch
-        # Per-rank weight slice. ``set_device(rank)`` (ADR-0024 D10) places
+        # Per-rank weight slice. ``set_device(rank)`` (ADR-0024 D3) places
         # it on SIP ``rank``. Intra-SIP layout comes from DPPolicy (ADR-0026).
         self.weight = torch.zeros(
             (in_features, self.k_local),
diff --git a/src/kernbench/tp/parallel_state.py b/src/kernbench/tp/parallel_state.py
index 2952f83..3aa3dbb 100644
--- a/src/kernbench/tp/parallel_state.py
+++ b/src/kernbench/tp/parallel_state.py
@@ -43,7 +43,7 @@ def get_tensor_model_parallel_rank() -> int:
     """Return this worker's rank within the TP group.
 
     Delegates to the greenlet-local rank registered by the spawn launcher
-    (ADR-0024 D9 via ``torch.distributed.get_rank``).
+    (ADR-0024 D2 via ``torch.distributed.get_rank``).
     """
     # Resolve via the global torch.distributed facade on the active ctx.
     return _current_rank()
diff --git a/tests/test_e2e_pipeline.py b/tests/test_e2e_pipeline.py
index aa35685..39aa01e 100644
--- a/tests/test_e2e_pipeline.py
+++ b/tests/test_e2e_pipeline.py
@@ -1,4 +1,4 @@
-"""End-to-end pipeline tests (ADR-0020 + ADR-0021).
+"""End-to-end pipeline tests (ADR-0020 + ADR-0014).
 
 Verifies:
   1. Actual benchmark kernel → greenlet mode → op_log → DataExecutor → accuracy
diff --git a/tests/test_hbm_pc_striping.py b/tests/test_hbm_pc_striping.py
index 535c45f..5a0a7ee 100644
--- a/tests/test_hbm_pc_striping.py
+++ b/tests/test_hbm_pc_striping.py
@@ -68,7 +68,7 @@ def _path_drain_for_write(eng: GraphEngine, msg: MemoryWriteMsg) -> float:
 
 def test_builder_derives_pc_bw_gbs():
     """Topology builder must inject `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`
-    as an attr on every hbm_ctrl node. Enforces ADR-0019 D9 invariant
+    as an attr on every hbm_ctrl node. Enforces ADR-0017 D8 invariant
     (channels_per_PE × per-PC BW = aggregated link BW) at build time.
     """
     handle = resolve_topology(str(TOPOLOGY_PATH))
diff --git a/tests/test_ipcq_buffer_kind_locations.py b/tests/test_ipcq_buffer_kind_locations.py
index 912b3a5..b7b4270 100644
--- a/tests/test_ipcq_buffer_kind_locations.py
+++ b/tests/test_ipcq_buffer_kind_locations.py
@@ -192,13 +192,10 @@ def test_hbm_pe_hop_charged_at_large_payload(tmp_path):
     chunk of latency from the PE↔HBM hop on send and recv, so the
     total HBM/TCM gap should clearly clear the threshold below.
 
-    Threshold history: the gap was 4 µs under the over-consolidated
-    single-hbm_ctrl model (commit 5917b34), inflated by serialization
-    on the shared HBM controller. With ADR-0019 D1 per-PE HBM CTRL
-    restored, each PE's slice runs on its own controller with no
-    cross-PE contention, so the IPCQ pattern (each PE writes its own
-    slice) drops the gap to ≈ 1.7 µs — still well above the bare
-    slot-IO term, confirming the PE↔HBM hop is being charged.
+    Under ADR-0017 D4 per-PE HBM CTRL, each PE's slice runs on its own
+    controller with no cross-PE contention, so the IPCQ pattern (each
+    PE writes its own slice) yields a gap of ≈ 1.7 µs — well above the
+    bare slot-IO term, confirming the PE↔HBM hop is being charged.
     """
     n_elem = 16384  # 32 KB / PE
     lat_tcm = _run_allreduce_with_buffer_kind(
diff --git a/tests/test_noc_mesh.py b/tests/test_noc_mesh.py
index bb9821f..2bb5666 100644
--- a/tests/test_noc_mesh.py
+++ b/tests/test_noc_mesh.py
@@ -1,4 +1,4 @@
-"""Tests for CUBE NOC Explicit Router Mesh (ADR-0019).
+"""Tests for CUBE NOC Explicit Router Mesh (ADR-0017).
 
 Key changes verified:
   - Explicit router nodes per cube from cube_mesh.yaml (6×6 grid)
@@ -125,14 +125,14 @@ def test_mesh_file_pe_corner_positions():
 
 
 def test_mesh_file_no_xbar_section():
-    """mesh output must not contain xbar section (ADR-0019 D2)."""
+    """mesh output must not contain xbar section (ADR-0017 D1)."""
     _graph()
     mesh = yaml.safe_load(MESH_PATH.read_text())
     assert "xbar" not in mesh, "xbar section should be removed from cube_mesh.yaml"
 
 
 def test_mesh_file_pe_hbm_attached():
-    """PE routers must have pe{idx}.hbm in attach list (ADR-0019 D1)."""
+    """PE routers must have pe{idx}.hbm in attach list (ADR-0017 D4)."""
     _graph()
     mesh = yaml.safe_load(MESH_PATH.read_text())
     for rid, rdata in mesh["routers"].items():
@@ -235,7 +235,7 @@ def test_mesh_ucie_all_four_directions():
 
 
 # ══════════════════════════════════════════════════════════════════
-# 2. Topology Graph: Explicit Router Mesh (ADR-0019)
+# 2. Topology Graph: Explicit Router Mesh (ADR-0017)
 # ══════════════════════════════════════════════════════════════════
 
 
@@ -247,7 +247,7 @@ def test_router_nodes_exist():
 
 
 def test_no_xbar_or_bridge_nodes():
-    """xbar/bridge nodes must not exist (ADR-0019 D2)."""
+    """xbar/bridge nodes must not exist (ADR-0017 D1)."""
     graph = _graph()
     bad = [n for n in graph.nodes if "xbar" in n or "bridge" in n]
     assert len(bad) == 0, f"Old xbar/bridge nodes found: {bad[:5]}"
@@ -260,11 +260,10 @@ def test_no_single_noc_node():
 
 
 def test_per_pe_hbm_ctrl_nodes():
-    """Each cube has 8 per-PE HBM CTRL instances (ADR-0019 D1).
+    """Each cube has 8 per-PE HBM CTRL instances (ADR-0017 D4).
 
-    Restored from over-consolidation in commit 5917b34. The legacy
-    single ``sip0.cube0.hbm_ctrl`` is gone; each PE owns its own
-    ``hbm_ctrl.pe{X}`` reachable through that PE's attaching router.
+    Each PE owns its own ``hbm_ctrl.pe{X}`` reachable through that PE's
+    attaching router. No cube-wide single ``hbm_ctrl`` node exists.
     """
     graph = _graph()
     for pe in range(8):
@@ -272,7 +271,7 @@ def test_per_pe_hbm_ctrl_nodes():
     # Legacy single hbm_ctrl must not exist
     legacy_id = "sip0.cube0.hbm_ctrl"
     assert legacy_id not in graph.nodes, (
-        f"legacy {legacy_id} must be removed (per-PE partitioning, ADR-0019 D1)"
+        f"legacy {legacy_id} must not exist (per-PE partitioning, ADR-0017 D4)"
     )
 
 
@@ -297,9 +296,7 @@ def test_pe_dma_connects_to_router():
 def test_each_hbm_ctrl_connects_only_to_owning_router():
     """Each ``hbm_ctrl.pe{X}`` must have exactly one router edge
     (router_to_hbm + hbm_to_router) to its owning PE's attaching
-    router (ADR-0019 D4). Replaces a prior test that asserted the
-    single hbm_ctrl was connected to all routers — that asserted the
-    spec-violating consolidation introduced in commit 5917b34.
+    router (ADR-0017 D7).
     """
     graph = _graph()
     pe_router = {0: "r0c0", 1: "r0c1", 2: "r1c4", 3: "r1c5",
@@ -513,7 +510,7 @@ def test_null_routers_excluded():
 
 
 # ══════════════════════════════════════════════════════════════════
-# 7. Router Mesh Latency (ADR-0019)
+# 7. Router Mesh Latency (ADR-0017)
 # ══════════════════════════════════════════════════════════════════
 
 
diff --git a/tests/test_pe_pipeline.py b/tests/test_pe_pipeline.py
index 2f404e0..2b3dc87 100644
--- a/tests/test_pe_pipeline.py
+++ b/tests/test_pe_pipeline.py
@@ -1,4 +1,4 @@
-"""Tests for ADR-0021 PE pipeline: TileToken self-routing, pipeline overlap, e2e accuracy.
+"""Tests for ADR-0014 D6 PE pipeline: TileToken self-routing, pipeline overlap, e2e accuracy.
 
 Test plan items:
   3. Phase 1 → Phase 2 end-to-end (op_log → DataExecutor → verify)
diff --git a/tests/test_per_pe_hbm_partition.py b/tests/test_per_pe_hbm_partition.py
index 930d042..15314a1 100644
--- a/tests/test_per_pe_hbm_partition.py
+++ b/tests/test_per_pe_hbm_partition.py
@@ -1,18 +1,13 @@
-"""Tests for ADR-0019 D1/D4 per-PE HBM partitioning.
+"""Tests for ADR-0017 D4/D7 per-PE HBM partitioning.
 
-Restores the architectural property that was lost in commit 5917b34
-(2026-04-04 "Replace xbar/bridge/single-NOC with explicit router mesh"),
-which over-consolidated 8 per-slice HBM CTRL nodes into one cube-wide
-HBM CTRL connected to every router. ADR-0019 D1/D4 specifies:
+ADR-0017 D4/D7 specifies:
 
 - Each PE owns 8 of the cube's 64 pseudo-channels (PE_X → PCs 8X..8X+7).
 - HBM CTRL is split per-PE: ``hbm_ctrl.pe{X}`` is reachable ONLY through
   PE_X's attaching router. Accessing PE_Y's slice from PE_X requires
   mesh routing to r_Y_attach before entering hbm_ctrl.pe{Y}.
 
-These tests are written BEFORE the production change and are expected
-to FAIL on current code (HBM CTRL is a single ``hbm_ctrl`` node attached
-to all routers). Phase 2 must make them PASS without weakening
+These tests enforce that property without weakening
 assertions.
 """
 from __future__ import annotations
@@ -66,16 +61,16 @@ def test_topology_has_8_hbm_ctrl_per_cube():
     for pe in range(8):
         nid = f"sip0.cube0.hbm_ctrl.pe{pe}"
         assert nid in graph.nodes, (
-            f"Expected per-PE HBM CTRL node {nid!r} (ADR-0019 D1)"
+            f"Expected per-PE HBM CTRL node {nid!r} (ADR-0017 D4)"
         )
         node = graph.nodes[nid]
         assert int(node.attrs.get("num_pcs", 0)) == 8, (
             f"{nid} must have num_pcs=8; got {node.attrs.get('num_pcs')}"
         )
-    # Legacy single hbm_ctrl must not exist
+    # Cube-wide single hbm_ctrl must not exist
     assert "sip0.cube0.hbm_ctrl" not in graph.nodes, (
-        "Legacy single sip0.cube0.hbm_ctrl must be removed in favor of "
-        "per-PE hbm_ctrl.pe{X} (ADR-0019 D1)"
+        "Cube-wide single sip0.cube0.hbm_ctrl must not exist; only "
+        "per-PE hbm_ctrl.pe{X} (ADR-0017 D4)"
     )
 
 
@@ -199,10 +194,8 @@ def test_probe_cli_intra_cube_cases_are_monotonic():
     """Probe CLI cases must show monotonic latency:
         pe-local-hbm < pe-same-half-hbm < pe-cross-half-hbm.
 
-    Prior to per-PE partitioning these three return identical latency
-    because all roads lead to the same hbm_ctrl. With ADR-0019 D4
-    restored, same-half (pe0→pe1) is 1 mesh hop further than local,
-    and cross-half (pe0→pe4) is several hops further.
+    Per ADR-0017 D7, same-half (pe0→pe1) is 1 mesh hop further than
+    local, and cross-half (pe0→pe4) is several hops further.
     """
     graph = _graph()
     spec = graph.spec
diff --git a/tests/test_routing.py b/tests/test_routing.py
index cda2943..4633a25 100644
--- a/tests/test_routing.py
+++ b/tests/test_routing.py
@@ -17,7 +17,7 @@ def _graph():
 
 
 def test_resolve_hbm_addr():
-    """HBM address -> sip{S}.cube{C}.hbm_ctrl.pe{X} (per-PE controller, ADR-0019 D1)."""
+    """HBM address -> sip{S}.cube{C}.hbm_ctrl.pe{X} (per-PE controller, ADR-0017 D9)."""
     g = _graph()
     resolver = AddressResolver(g)
     # offset 0x1000 falls inside PE0's slice (slice_size = 6 GB)
@@ -102,16 +102,13 @@ def test_path_remote_pe_hbm():
     assert not any("xbar" in n or "bridge" in n for n in path)
 
 
-# ── PathRouter: cross-PE HBM distance reflects mesh hops (ADR-0019 D4) ─
+# ── PathRouter: cross-PE HBM distance reflects mesh hops (ADR-0017 D7) ─
 
 
 def test_cross_pe_hbm_distance_increases_with_mesh_hops():
-    """Restored ADR-0019 D4 behavior: accessing another PE's HBM slice
-    must take more routing distance than accessing one's own slice,
-    because each per-PE hbm_ctrl is reachable only via its PE's router.
-
-    Replaces a previous ``test_all_pe_hbm_equidistant`` that asserted the
-    over-consolidated (spec-violating) behavior introduced in 5917b34.
+    """ADR-0017 D7: accessing another PE's HBM slice must take more
+    routing distance than accessing one's own slice, because each
+    per-PE hbm_ctrl is reachable only via its PE's router.
     """
     g = _graph()
     router = PathRouter(g)
diff --git a/tests/test_topology_compile.py b/tests/test_topology_compile.py
index 77fe943..a66174d 100644
--- a/tests/test_topology_compile.py
+++ b/tests/test_topology_compile.py
@@ -21,7 +21,7 @@ def test_full_graph_node_count():
     #                          + 20 ucie (4 ports x (1 port + 4 conn))
     #                          + 8 PEs x 9 pe_comps))  (ADR-0023: +pe_ipcq)
     #   IO: pcie_ep + io_cpu + noc + 4 io_ucie_ports + 4*4 io_ucie_conn = 23
-    #   cube: 32 + 10 + 20 + 72 = 134  (was 127; ADR-0019 D1 per-PE HBM CTRL)
+    #   cube: 32 + 10 + 20 + 72 = 134  (per-PE HBM CTRL, ADR-0017 D4)
     # = 1 + 2*(23 + 16*134) = 1 + 2*(23+2144) = 1 + 4334 = 4335
     assert len(g.nodes) == 4335
 
@@ -29,9 +29,9 @@ def test_full_graph_node_count():
 def test_full_graph_edge_count():
     g = _graph()
     # ADR-0023: +3 IPCQ edges per PE
-    # ADR-0019 D1 (restored): HBM↔router edges drop from 32 routers × 2
-    # to 8 PE-routers × 2 per cube. 32 cubes × (16-64) = -1536 edges.
-    # Multi-op composite (ADR-0021): +1 gemm→math edge per PE for
+    # ADR-0017 D4: HBM↔router edges = 8 PE-routers × 2 per cube
+    # (per-PE partition; not all 32 routers).
+    # Multi-op composite (ADR-0014 D3.3): +1 gemm→math edge per PE for
     # epilogue chaining = 2 SIPs × 16 cubes × 8 PEs = +256 edges.
     assert len(g.edges) == 12412
 
@@ -73,7 +73,7 @@ def test_cube_component_nodes_exist():
     # Null holes must not exist
     for null_rc in ("r2c2", "r2c3", "r3c2", "r3c3"):
         assert f"{cp}.{null_rc}" not in g.nodes
-    # Per-PE HBM CTRL (ADR-0019 D1) — 8 instances, no legacy single node
+    # Per-PE HBM CTRL (ADR-0017 D4) — 8 instances; no cube-wide single node
     for pe in range(8):
         nid = f"{cp}.hbm_ctrl.pe{pe}"
         assert g.nodes[nid].kind == "hbm_ctrl"
@@ -94,7 +94,7 @@ def test_pe_component_nodes_exist():
 
 def test_hbm_ctrl_at_cube_center():
     g = _graph()
-    # Per-PE hbm_ctrl nodes share the cube's HBM placement (ADR-0019 D1)
+    # Per-PE hbm_ctrl nodes share the cube's HBM placement (ADR-0017 D4)
     # cube0 origin = (0, 0), hbm at (6.5, 7.0)
     for pe in range(8):
         node = g.nodes[f"sip0.cube0.hbm_ctrl.pe{pe}"]
@@ -190,8 +190,7 @@ def test_pe_internal_edges():
 
 def test_per_pe_hbm_ctrl_connects_only_to_owning_router():
     """Each hbm_ctrl.pe{X} connects ONLY to PE_X's attaching router
-    (ADR-0019 D4). Replaces a prior test that asserted the
-    spec-violating all-routers consolidation (commit 5917b34)."""
+    (ADR-0017 D7)."""
     g = _graph()
     es = _edge_set(g)
     cp = "sip0.cube0"
diff --git a/tests/test_tp_parallel_state.py b/tests/test_tp_parallel_state.py
index de2aae7..c2c6ccd 100644
--- a/tests/test_tp_parallel_state.py
+++ b/tests/test_tp_parallel_state.py
@@ -56,7 +56,7 @@ def test_initialize_mismatched_ws_raises(topology):
 
 def test_get_tp_rank_is_greenlet_local(topology):
     """D3: get_tensor_model_parallel_rank returns greenlet-local rank
-    (delegates to torch.distributed.get_rank, ADR-0024 D9)."""
+    (delegates to torch.distributed.get_rank, ADR-0024 D2)."""
     import kernbench.tp as tp
 
     with _make_ctx(topology) as ctx:
diff --git a/tests/test_verify_adr_lang_pairs.py b/tests/test_verify_adr_lang_pairs.py
new file mode 100644
index 0000000..190197a
--- /dev/null
+++ b/tests/test_verify_adr_lang_pairs.py
@@ -0,0 +1,107 @@
+"""Tests for tools/verify_adr_lang_pairs.py."""
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+_REPO_ROOT = Path(__file__).resolve().parents[1]
+sys.path.insert(0, str(_REPO_ROOT / "tools"))
+
+import verify_adr_lang_pairs as v  # noqa: E402
+
+
+def _make_adr(
+    path: Path,
+    title_id: str,
+    title_text: str = "Some Title",
+    status: str = "Accepted",
+) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    path.write_text(
+        f"# ADR-{title_id}: {title_text}\n\n"
+        f"## Status\n\n{status}\n\n"
+        f"## Context\n\nbody\n",
+        encoding="utf-8",
+    )
+
+
+def test_complete_pairs_pass(tmp_path: Path) -> None:
+    _make_adr(tmp_path / "docs/adr/ADR-0001-foo-bar.md", "0001", "Foo EN")
+    _make_adr(tmp_path / "docs/adr-ko/ADR-0001-foo-bar.md", "0001", "Foo KO")
+    assert v.verify(tmp_path) == []
+
+
+def test_empty_dirs_pass(tmp_path: Path) -> None:
+    assert v.verify(tmp_path) == []
+
+
+def test_missing_ko_fails(tmp_path: Path) -> None:
+    _make_adr(tmp_path / "docs/adr/ADR-0001-foo-bar.md", "0001")
+    errs = v.verify(tmp_path)
+    assert any("missing KO" in e and "ADR-0001-foo-bar.md" in e for e in errs)
+
+
+def test_orphan_ko_fails(tmp_path: Path) -> None:
+    _make_adr(tmp_path / "docs/adr-ko/ADR-0001-foo-bar.md", "0001")
+    errs = v.verify(tmp_path)
+    assert any("orphan KO" in e and "ADR-0001-foo-bar.md" in e for e in errs)
+
+
+def test_status_mismatch_fails(tmp_path: Path) -> None:
+    _make_adr(tmp_path / "docs/adr/ADR-0001-foo-bar.md", "0001", status="Accepted")
+    _make_adr(tmp_path / "docs/adr-ko/ADR-0001-foo-bar.md", "0001", status="Proposed")
+    errs = v.verify(tmp_path)
+    assert any("Status block mismatch" in e for e in errs)
+
+
+def test_title_id_mismatch_fails(tmp_path: Path) -> None:
+    _make_adr(tmp_path / "docs/adr/ADR-0001-foo-bar.md", "0002")
+    _make_adr(tmp_path / "docs/adr-ko/ADR-0001-foo-bar.md", "0001")
+    errs = v.verify(tmp_path)
+    assert any("EN title ADR-ID" in e for e in errs)
+
+
+def test_multiline_status_with_parenthetical_passes(tmp_path: Path) -> None:
+    """Real ADRs like ADR-0001 have multi-line Status with revision notes."""
+    multiline_status = (
+        "Accepted (Revision 2 - 2026-04-27: concrete bit layout,\n"
+        "Supersedes ADR-0031.)"
+    )
+    _make_adr(
+        tmp_path / "docs/adr/ADR-0001-foo-bar.md", "0001", status=multiline_status
+    )
+    _make_adr(
+        tmp_path / "docs/adr-ko/ADR-0001-foo-bar.md", "0001", status=multiline_status
+    )
+    assert v.verify(tmp_path) == []
+
+
+def test_crlf_normalization(tmp_path: Path) -> None:
+    """KO has CRLF, EN has LF; Status content is otherwise identical -> pass."""
+    en = tmp_path / "docs/adr/ADR-0001-foo-bar.md"
+    ko = tmp_path / "docs/adr-ko/ADR-0001-foo-bar.md"
+    en.parent.mkdir(parents=True, exist_ok=True)
+    ko.parent.mkdir(parents=True, exist_ok=True)
+    en.write_bytes(
+        b"# ADR-0001: Foo\n\n## Status\n\nAccepted\n\n## Context\n\nbody\n"
+    )
+    ko.write_bytes(
+        b"# ADR-0001: Foo\r\n\r\n## Status\r\n\r\nAccepted\r\n\r\n## Context\r\n\r\nbody\r\n"
+    )
+    assert v.verify(tmp_path) == []
+
+
+def test_underscore_in_slug_recognized(tmp_path: Path) -> None:
+    """ADR-0013 uses an underscore in its slug; the regex must accept it."""
+    _make_adr(tmp_path / "docs/adr/ADR-0013-ver-verification_strategy.md", "0013")
+    _make_adr(tmp_path / "docs/adr-ko/ADR-0013-ver-verification_strategy.md", "0013")
+    assert v.verify(tmp_path) == []
+
+
+def test_main_exit_codes(tmp_path: Path, capsys) -> None:
+    assert v.main(["--root", str(tmp_path)]) == 0
+    _make_adr(tmp_path / "docs/adr/ADR-0001-foo-bar.md", "0001")
+    assert v.main(["--root", str(tmp_path)]) == 1
+    out = capsys.readouterr().out
+    assert "FAILED" in out
diff --git a/tools/verify_adr_lang_pairs.py b/tools/verify_adr_lang_pairs.py
new file mode 100644
index 0000000..d0147d5
--- /dev/null
+++ b/tools/verify_adr_lang_pairs.py
@@ -0,0 +1,144 @@
+"""Verify ADR language pair invariants.
+
+Policy (see CLAUDE.md Part 2 -> ADR Translation Discipline):
+  docs/adr/        : English canonical
+  docs/adr-ko/     : Korean translation (1:1 mirror)
+  docs/adr-history/: frozen, not checked (transitional)
+  docs/adr-proposed/: language-free, not checked
+
+Checks:
+  - every docs/adr/<X>.md has a matching docs/adr-ko/<X>.md
+  - every docs/adr-ko/<X>.md has a matching docs/adr/<X>.md (no orphans)
+  - title line `# ADR-NNNN:` of each pair matches the filename's NNNN
+  - `## Status` block content is byte-equal (after CRLF/LF normalization)
+    between EN and KO
+
+Exit code: 0 if all OK, 1 if any mismatch.
+"""
+
+from __future__ import annotations
+
+import argparse
+import re
+import sys
+from pathlib import Path
+
+ADR_FILENAME_RE = re.compile(r"^ADR-(\d{4})-[a-z0-9_-]+\.md$")
+TITLE_RE = re.compile(r"^# ADR-(\d{4}):")
+
+
+def _normalize(text: str) -> str:
+    return text.replace("\r\n", "\n").replace("\r", "\n")
+
+
+def find_adr_files(adr_dir: Path) -> dict[str, Path]:
+    if not adr_dir.is_dir():
+        return {}
+    return {
+        p.name: p
+        for p in sorted(adr_dir.iterdir())
+        if p.is_file() and ADR_FILENAME_RE.match(p.name)
+    }
+
+
+def extract_title_id(text: str) -> str | None:
+    lines = _normalize(text).splitlines()
+    if not lines:
+        return None
+    m = TITLE_RE.match(lines[0])
+    return m.group(1) if m else None
+
+
+def extract_status_block(text: str) -> str | None:
+    """Return content between `## Status` and the next `## ` heading, stripped.
+
+    Returns None if no `## Status` heading exists.
+    """
+    lines = _normalize(text).splitlines()
+    in_status = False
+    collected: list[str] = []
+    for line in lines:
+        if line.strip() == "## Status":
+            in_status = True
+            continue
+        if in_status and line.startswith("## "):
+            break
+        if in_status:
+            collected.append(line)
+    if not in_status:
+        return None
+    return "\n".join(collected).strip()
+
+
+def verify(root: Path) -> list[str]:
+    errors: list[str] = []
+    en_dir = root / "docs" / "adr"
+    ko_dir = root / "docs" / "adr-ko"
+
+    en_files = find_adr_files(en_dir)
+    ko_files = find_adr_files(ko_dir)
+
+    for name in en_files:
+        if name not in ko_files:
+            errors.append(f"missing KO translation: docs/adr-ko/{name}")
+    for name in ko_files:
+        if name not in en_files:
+            errors.append(f"orphan KO (no canonical EN): docs/adr-ko/{name}")
+
+    for name in sorted(en_files.keys() & ko_files.keys()):
+        m = ADR_FILENAME_RE.match(name)
+        assert m is not None
+        expected_id = m.group(1)
+
+        en_text = en_files[name].read_text(encoding="utf-8")
+        ko_text = ko_files[name].read_text(encoding="utf-8")
+
+        en_id = extract_title_id(en_text)
+        ko_id = extract_title_id(ko_text)
+        if en_id != expected_id:
+            errors.append(
+                f"{name}: EN title ADR-ID {en_id!r} != filename {expected_id!r}"
+            )
+        if ko_id != expected_id:
+            errors.append(
+                f"{name}: KO title ADR-ID {ko_id!r} != filename {expected_id!r}"
+            )
+
+        en_status = extract_status_block(en_text)
+        ko_status = extract_status_block(ko_text)
+        if en_status is None:
+            errors.append(f"{name}: EN missing `## Status` section")
+        if ko_status is None:
+            errors.append(f"{name}: KO missing `## Status` section")
+        if en_status is not None and ko_status is not None and en_status != ko_status:
+            errors.append(
+                f"{name}: Status block mismatch\n"
+                f"      EN: {en_status!r}\n"
+                f"      KO: {ko_status!r}"
+            )
+
+    return errors
+
+
+def main(argv: list[str] | None = None) -> int:
+    p = argparse.ArgumentParser(description=__doc__)
+    p.add_argument(
+        "--root",
+        type=Path,
+        default=Path.cwd(),
+        help="Repository root (default: cwd)",
+    )
+    args = p.parse_args(argv)
+
+    errors = verify(args.root)
+    if errors:
+        print("ADR language pair verification FAILED:")
+        for e in errors:
+            print(f"  - {e}")
+        return 1
+    print("ADR language pair verification OK")
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/topology.yaml b/topology.yaml
index 6ccb9c3..3f83a46 100644
--- a/topology.yaml
+++ b/topology.yaml
@@ -78,15 +78,15 @@ cube:
       scheduler_to_fetch_store_mm: 0.5
       dma_to_tcm_bw_gbs:       512.0
       dma_to_tcm_mm:           0.5
-      dma_to_fetch_store_mm:   0.0     # DMA → fetch_store chaining (ADR-0021)
+      dma_to_fetch_store_mm:   0.0     # DMA → fetch_store chaining (ADR-0014 D6)
       fetch_store_to_tcm_bw_gbs: 512.0
       fetch_store_to_tcm_mm:   0.0
-      fetch_store_to_gemm_mm:  0.0     # fetch → GEMM chaining (ADR-0021)
-      fetch_store_to_math_mm:  0.0     # fetch → MATH chaining (ADR-0021)
-      gemm_to_fetch_store_mm:  0.0     # GEMM → store chaining (ADR-0021)
-      gemm_to_math_mm:         0.0     # GEMM → MATH epilogue chaining (ADR-0021)
-      math_to_fetch_store_mm:  0.0     # MATH → store chaining (ADR-0021)
-      fetch_store_to_dma_mm:   0.0     # store → DMA writeback chaining (ADR-0021)
+      fetch_store_to_gemm_mm:  0.0     # fetch → GEMM chaining (ADR-0014 D6)
+      fetch_store_to_math_mm:  0.0     # fetch → MATH chaining (ADR-0014 D6)
+      gemm_to_fetch_store_mm:  0.0     # GEMM → store chaining (ADR-0014 D6)
+      gemm_to_math_mm:         0.0     # GEMM → MATH epilogue chaining (ADR-0014 D6)
+      math_to_fetch_store_mm:  0.0     # MATH → store chaining (ADR-0014 D6)
+      fetch_store_to_dma_mm:   0.0     # store → DMA writeback chaining (ADR-0014 D6)
       gemm_to_tcm_bw_gbs:      512.0
       gemm_to_tcm_mm:          0.5
       math_to_tcm_bw_gbs:      512.0
@@ -99,7 +99,7 @@ cube:
     hbm_total_gb_per_cube: 48
     hbm_slices_per_cube: 8
     hbm_total_bw_gbs: 1024.0
-    hbm_mapping_mode: n_to_one        # one_to_one | n_to_one (ADR-0019)
+    hbm_mapping_mode: n_to_one        # one_to_one | n_to_one (ADR-0017 D8)
     hbm_pseudo_channels: 64           # total pseudo channels per cube
     hbm_channels_per_pe: 8            # = pseudo_channels / pes_per_cube
     hbm_channel_bw_gbs: 32.0          # per-channel bandwidth (GB/s)
@@ -123,7 +123,7 @@ cube:
     per_connection_bw_gbs: 128.0   # BW per connection; 4 × 128 = 512 GB/s = UCIe PHY BW
 
   links:
-    # Router mesh links (ADR-0019)
+    # Router mesh links (ADR-0017 D5)
     router_link_bw_gbs: 256.0        # inter-router XY mesh link BW
     router_overhead_ns: 2.0          # per-router switching overhead
     pe_to_router_bw_gbs: 256.0      # PE_DMA ↔ router (= N × channel_bw)