ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037

Filename + lifecycle: - ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable. - ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2: docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft), docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for retroactive docs pending verification. Merges (one ADR per topic, no change-history annotations): - ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items) - ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl. TileToken self-routing and multi-op composite epilogue scope) - ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md deleted; ADR-0019/0021 moved to adr-history with one-line stub status Retroactive documentation (G4 closures, code-verified): - ADR-0037 forwarding component (TransitComponent: first-flit overhead, serial worker, path-based routing, single impl/multiple names) - ADR-0036 IO_CPU component (target_start_ns global barrier stamping, per-cube fan-out, response aggregation) - ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources, target_start_ns passthrough) - ADR-0034 HBM controller internal design (per-PC state, address-based selection, flit-aware per-flit commit, async finalize, command-only fallback path) Content updates: - ADR-0010 expanded to full CLI surface (run/probe/web), retitled "Command Line Interface and Execution Semantics" - ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned - ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata block replaced with standard Status header - ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4); ADR-0027 cleaned of supersession history - ADR-0033 D6 cleanup: address-based PC selection moved out of future-work (now documented in ADR-0034 D3); related D1/D3 wording realigned - Cross-references back-filled in 5 ADRs (G3 gaps closed) Onboarding docs split: - docs/onboarding/ created - moved: hw-architecture-overview.md, latency-model.md, di-presentation.md, ccl-author-guide{,.en}.md - references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8). No behavior change. Tooling: - tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py (ADR EN/KO pair invariant checker) - .claude/commands/report.md tracked (/report slash command) - .gitignore: allow .claude/commands/*.md while keeping settings files ignored Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:15:55 -07:00
parent 22fd0d2b9d
commit 687c98086d
97 changed files with 3286 additions and 3766 deletions
@@ -0,0 +1,318 @@
 ---
 description: Generate a public-facing architecture design document from approved ADRs and SPEC.md, with gap analysis reported to chat only.
 ---
 # `/report` — Architecture Design Document Generator
 Generates a **public-facing** architecture design document at
 `docs/report/architecture-{YYYY}-{1H|2H}.md` derived from the current ADR
 corpus, SPEC.md, CLAUDE.md, and the canonical component list.
 This command is **strictly read-only** on `docs/adr/`, `SPEC.md`,
 `CLAUDE.md`, and `src/`. The only write is the report file itself
 (a derived artifact under `docs/report/`).
 ---
 ## Invocation
 Two modes:
 - `/report` — **dry-run** (default). No file is written. The command
  reads sources, performs classification, and reports the planned TOC
  + gap analysis to chat only. Use this to validate ADR-to-section
  mapping before committing.
 - `/report write` — **write mode**. Performs the same procedure and
  writes `docs/report/architecture-{period}.md`. Use after a dry-run
  whose classification looks correct.
 Period determination (both modes), from system date:
 - month 1–6 → `{YYYY}-1H`
 - month 7–12 → `{YYYY}-2H`
 In write mode, if `docs/report/architecture-{period}.md` already exists,
 overwrite it without asking (regeneration is the expected operation).
 ---
 ## Output Contract
 ### Document body (`docs/report/architecture-{period}.md`)
 Public release form. Reader is an external developer/architect. They do
 **not** have access to SPEC.md or ADR files. Therefore:
 - **No `ADR-NNNN` identifiers** in visible prose.
 - **No `SPEC R/§` identifiers** in visible prose.
 - **No internal jargon** assumed without definition.
 - **No diagram embeds** — only `<!-- DIAGRAM: ... -->` placeholders.
 - **Attribution via HTML comments** — every prose paragraph that derives
  from a source carries an inline comment immediately above it:
  `<!-- src: ADR-NNNN <section-name> -->` (multiple sources allowed).
 ### Chat-only report (not written to any file)
 After writing the document, report to the user in the chat response:
 - File path written.
 - Section counts (e.g., "Detailed Architecture: 8 components covered,
  2 in `builtin/` have no ADR backing").
 - **G1 gaps** — SPEC requirements (R-numbers / §) with no ADR citing them.
 - **G2 gaps** — ADRs missing **Context** or **Decision**. Alternatives
  and Consequences are optional; their absence is NOT a gap.
 - **G3 gaps** — ADR cross-references without a back-reference.
 - **G4 suggestions** — areas where an ADR seems missing based on the
  ADR corpus + SPEC reading. Phrase as suggestions, not findings. Each
  G4 item must say *why* it's suggested and remain falsifiable.
 - **G5 consistency issues** — ADR-to-ADR inconsistencies:
  - **G5a (supersession not reflected)** — ADR-A states it supersedes
    ADR-B, but ADR-B's Status is not marked as Superseded.
  - **G5b (merge candidates)** — two or more ADRs cover near-identical
    scope (detected naturally during section assignment, not via
    exhaustive pair-wise scan).
  - **G5c (explicit contradictions)** — two ADRs whose Decisions
    directly oppose each other. Must cite both quotations; do not
    speculate contradictions from topical similarity alone.
 - **TOC rationale** — for each section, list contributing ADR IDs
  (this is for the user's verification only, never written to the
  document itself).
 G4 must never appear in the document body. G1–G3 are also chat-only.
 ---
 ## Procedure
 ### Step 1 — Determine period
 Use current system date. Compute `{YYYY}-1H` or `{YYYY}-2H`.
 ### Step 2 — Ingest ADRs
 For each `docs/adr/ADR-NNNN-*.md`:
 - If both `ADR-NNNN-*.md` (Korean) and `ADR-NNNN-*.en.md` (English)
  exist for the same number, **prefer the Korean `.md`** version.
 - Parse for the four canonical sections: Context, Decision, Alternatives
  (also accept "Alternatives Considered"), Consequences.
 - Record presence/absence of **Context** and **Decision** for G2.
  Alternatives and Consequences presence is recorded for use during
  authoring, but their absence is not a gap.
 - Record ADR-NNNN cross-references for G3.
 - Record Status (e.g., Accepted, Superseded, Draft) and any "supersedes
  ADR-NNNN" text in the body for G5a.
 Process ADRs in **numerical order** for determinism.
 ### Step 3 — Read canonical component list
 List `src/kernbench/components/builtin/*.py`, excluding `__init__.py`,
 `pe_types.py`, and `__pycache__/`. Sort alphabetically. This is the
 canonical order for Detailed Architecture subsections.
 ### Step 4 — Read SPEC.md and CLAUDE.md
 For G1 detection: extract every `R<N>` and `§<X.Y>` identifier mentioned
 in SPEC.md. For each ADR, check which of these it cites. SPEC IDs with
 zero citing ADRs → G1.
 ### Step 5 — Section assignment
 Assign each ADR to exactly one of:
 - **Design Principles** — project-wide rationale, philosophy, mission
  (e.g., "why source-level kernel execution", "why fast multi-device
  scaling"). Includes ADRs that describe foundational invariants
  (e.g., latency model assumptions, verification strategy).
 - **High-level Architecture** — Tray / SIP / CUBE / PE hierarchy and
  cross-layer boundaries (e.g., runtime API ↔ sim_engine ↔ components).
 - **Detailed Architecture** — single-component internal designs. One
  subsection per file in the canonical component list. ADRs whose
  primary topic is the internal structure of one component go here.
 - **Implementation Decisions** — **cross-cutting** algorithms / policies
  / schemes / models that don't belong to a single component:
  collective algorithms, parallelization policies, address schemes,
  routing algorithms, model assumptions.
 Boundary rule between Detailed Architecture and Implementation Decisions:
 > Detailed Architecture = component-internal.
 > Implementation Decisions = spans multiple components OR is an
 > algorithm/policy/scheme/assumption rather than a structural choice.
 If an ADR fits two sections plausibly, prefer the one that minimizes
 duplication and pick the more specific bucket (Detailed if it primarily
 concerns one component, else Implementation Decisions).
 During classification, opportunistically detect ADR consistency issues:
 - **G5b (merge candidate)** — if two or more ADRs land in the same
  Detailed Architecture subsection or the same Implementation Decisions
  topic AND their primary scope is near-identical, record as a merge
  candidate. Topical adjacency is not enough; the scopes must be
  effectively the same question.
 - **G5c (explicit contradiction)** — if while reading you encounter two
  ADRs whose Decisions directly oppose each other on the same question,
  record both quotations verbatim with their ADR IDs. Do NOT speculate
  contradictions from similarity, vocabulary, or domain overlap — only
  explicit, citable opposition.
 Do NOT perform an exhaustive pair-wise scan of all ADRs. G5b/G5c are
 byproducts of normal reading; if not encountered, the chat report
 shows "(none)".
 ### Step 6 — Write the document (write mode only)
 In **dry-run mode**, skip this step entirely. Proceed directly to Step 7.
 ```markdown
 # KernBench — Architecture Design Document
 *{YYYY} {1H|2H}*
 ## Design Principles
 <prose>
 ## High-level Architecture
 <intro prose>
 ### Tray
 ### SIP
 ### CUBE
 ### PE
 ## Detailed Architecture
 ### <component-1>
 ### <component-2>
 ...
 ## Implementation Decisions
 ### <topic-1>
 ### <topic-2>
 ...
 ```
 #### Authoring rules (apply to every section)
 - **Stay grounded.** Every claim must trace to an ADR's stated content
  (Context / Decision / Alternatives / Consequences). No invented
  motivation, no invented alternatives, no invented trade-offs.
 - **4-part discipline, naturally.** Each subsection should naturally
  cover: the problem the design addresses, the decision made, the
  alternatives considered, the consequences. Do **not** label these
  with rigid headers like "**Problem.**" — weave them into prose. But
  ensure all four are present *if the source ADR documents them*.
 - **Missing → omit, not fabricate.** If a source ADR has no
  "Alternatives" section, do **not** invent alternatives for the
  report. Simply write the remaining parts and record G2 in chat.
 - **Attribution.** Every paragraph derived from one or more ADRs
  carries an HTML comment immediately above:
  `<!-- src: ADR-NNNN <section> [, ADR-MMMM <section>] -->`.
 - **Diagram placeholders.** Where a diagram would help, insert
  `<!-- DIAGRAM: <short description of what the diagram should show> -->`
  on its own line. **Never** embed an image (`![...](...)`).
 - **Public tone.** Self-contained. Define internal terms (SIP, CUBE,
  PE, Tray, NOC, IPCQ, TCM, etc.) on first use within the document.
  Do not assume reader has read SPEC or ADRs.
 - **No internal references.** No `ADR-NNNN` in body text. No
  `SPEC §X.Y` or `R<N>` in body text. These appear only inside HTML
  attribution comments.
 - **Detailed Architecture component subsections.** Use the canonical
  list from Step 3 in order. For each component file, write a
  subsection drawing from any ADR that primarily concerns that
  component. If no ADR covers a component, write a one-line stub
  noting the component exists and flag it in chat report. If an ADR
  covers a topic not in the canonical list, place it under
  "Detailed Architecture → Other" (sub-subsection) and flag for
  canonical-list extension in chat.
 - **Implementation Decisions topic naming.** Derive topic names from
  ADR titles, made reader-friendly (no ADR number). Group related
  ADRs under one topic when natural (e.g., multiple address-related
  ADRs under "Address Scheme").
 ### Step 7 — Generate chat report
 After Step 6 (write mode) or directly from Step 5 (dry-run mode),
 emit the following to chat. Do **not** write any of this to a file.
 In **dry-run mode**, replace the `Wrote:` line with:
 `**DRY-RUN — no file written.** Review TOC and gaps below. Run \`/report write\` to commit.`
 ```
 ## /report — Generation Summary
 **Wrote:** docs/report/architecture-{period}.md
 **Section coverage**
 - Design Principles: <N> ADRs
 - High-level Architecture: <N> ADRs
 - Detailed Architecture: <covered>/<total> components ; components without ADR: [...]
 - Implementation Decisions: <N> topics, <N> ADRs
 **TOC rationale (ADR → section mapping)**
 - Design Principles: ADR-NNNN, ADR-MMMM
 - High-level Architecture: ...
 - Detailed Architecture → <component>: ADR-NNNN
 - Implementation Decisions → <topic>: ADR-NNNN, ADR-MMMM
 **G1 — SPEC requirements without ADR support**
 - R<N> / §<X.Y>: not cited by any ADR
 - (or "none")
 **G2 — ADRs missing required sections (Context or Decision)**
 - ADR-NNNN: missing <Context|Decision>
 - (or "none")
 **G3 — Broken cross-references**
 - ADR-NNNN cites ADR-MMMM; ADR-MMMM does not back-reference
 - (or "none")
 **G4 — Suggested topics that may warrant a new ADR (verify before acting)**
 - <topic>: <why agent thinks it may be missing — must be falsifiable>
 - (or "none")
 **G5 — ADR consistency issues**
 - **G5a (supersession not reflected)**
  - ADR-NNNN claims to supersede ADR-MMMM, but ADR-MMMM Status is "<status>"
  - (or "none")
 - **G5b (merge candidates)**
  - ADR-NNNN + ADR-MMMM: near-identical scope on <topic> — evaluate merge
  - (or "none")
 - **G5c (explicit contradictions)**
  - ADR-NNNN says "<quote>"; ADR-MMMM says "<quote>" — direct opposition on <question>
  - (or "none")
 ```
 ---
 ## Constraints (do not violate)
 1. **Read-only on source.** No writes to `docs/adr/`, `SPEC.md`,
   `CLAUDE.md`, or `src/`. Only write is
   `docs/report/architecture-{period}.md`.
 2. **No fabrication.** Every body paragraph traces to ADR content via
   HTML attribution comment.
 3. **No diagram embeds.** Placeholders only.
 4. **No internal IDs in body.** ADR-NNNN and SPEC R/§ stay inside
   HTML comments only.
 5. **Determinism.** ADRs processed in numerical order; components in
   canonical (alphabetical) order. Same inputs → same output.
 6. **G4 stays in chat.** Never written to the document.
 7. **Korean bilingual preference.** When both `.md` and `.en.md`
   exist for the same ADR number, use `.md`.
 8. **All ADRs included.** No exclusion list. ADRs about internal
   tooling (CLI, diagram views, verification strategy) are still
   included — usually under Design Principles or Implementation
   Decisions, written in publishable form.
 ---
 ## Failure modes to avoid
 - **Padding** with general background not present in the source ADRs.
 - **Inferring alternatives** the ADR doesn't mention.
 - **Quietly skipping** an ADR because it seems internal. Include it,
  rephrase for public audience.
 - **Inventing components** not in `src/kernbench/components/builtin/`.
 - **Auto-selecting diagrams** from `docs/diagrams/`. Only placeholders.
 - **Promoting G4 suggestions to the document.** They stay in chat.
@@ -30,7 +30,10 @@
      "Bash(python -m pytest tests/test_pe_components.py -v)",
      "Bash(python -m pytest tests/test_triton_emu.py -v)",
      "Bash(python -m pytest tests/test_pe_components.py tests/test_triton_emu.py -v)",
-      "Bash(python -m pytest tests/test_pe_components.py::test_mcpu_multi_pe_kernel_launch tests/test_pe_components.py::test_qkv_gemm_bench_multi_pe_completes -v)"
+      "Bash(python -m pytest tests/test_pe_components.py::test_mcpu_multi_pe_kernel_launch tests/test_pe_components.py::test_qkv_gemm_bench_multi_pe_completes -v)",
      "Bash(git add:*)",
      "Bash(git commit:*)",
      "Bash(git push:*)"
    ]
  }
 }
@@ -29,4 +29,6 @@ build/
 # Logs
 *.log
-.claude/
+.claude/*
 !.claude/commands/
 !.claude/commands/*.md
@@ -218,17 +218,43 @@ General fallbacks. Apply to anything not explicitly covered above.
 ### ADR Lifecycle
- `docs/adr/` contains ADRs reflecting current implementation or
+ADRs live in one of three folders based on lifecycle state:
-  work-in-progress designs.
+
- `docs/history/` contains superseded ADRs as historical record.
+- `docs/adr/` — **Accepted** (current implementation reflected).
- When an ADR is superseded:
+- `docs/adr-proposed/` — **Proposed**, **Stub**, or **Draft** (design
-  1. The superseding ADR includes a "Supersedes ADR-NNNN" line.
+  only / future-work exploration / retroactive documentation pending
-  2. The superseded ADR's Status is set to "Superseded by ADR-MMMM".
+  verification).
-  3. The superseded ADR file is **moved** (git mv) to `docs/history/`.
+- `docs/adr-history/` — **Superseded** or **Merged** (no longer the
- Cross-references between ADRs use the ADR-NNNN ID and remain
+  authoritative source; kept as historical record).
-  valid regardless of file location.
+
- ADR numbers are **immutable**; never renumber. Numbering holes
+Status field values:
-  from moved ADRs are expected.
+
 - `Accepted` — design is in current implementation.
 - `Proposed` — design is concrete but not yet implemented.
 - `Stub (Future Work)` — design space exploration; no commitment yet.
 - `Draft` — retroactive documentation drafted but not yet verified
  against the implementation it describes.
 - `Superseded by ADR-NNNN` — replaced by another ADR.
 - `Merged into ADR-NNNN` — content absorbed by another ADR.
 Transitions:
 - **Proposed/Stub → Accepted**: when the ADR's decisions are
  reflected in production code AND covered by tests. `git mv` from
  `docs/adr-proposed/` to `docs/adr/`, change Status to `Accepted`.
 - **Draft → Accepted**: when the ADR's text has been verified to
  accurately describe the existing implementation. `git mv` from
  `docs/adr-proposed/` to `docs/adr/`, change Status to `Accepted`.
 - **Accepted → Superseded**: set Status to `Superseded by ADR-MMMM`
  and `git mv` to `docs/adr-history/`. The superseding ADR includes
  a "Supersedes ADR-NNNN" reference (or, for partial supersession of
  clauses, documents this in its own body).
 - **Accepted → Merged**: set Status to `Merged into ADR-MMMM`
  (single-line stub) and `git mv` to `docs/adr-history/`.
 Cross-references between ADRs use the `ADR-NNNN` ID and remain valid
 regardless of folder location. ADR numbers are **immutable**; never
 renumber. Numbering holes from moved ADRs are expected.
 ## Terminology
@@ -155,5 +155,6 @@ kernbench/
 ## Documentation
 - [CHANGES.md](CHANGES.md) — changelog with detailed descriptions of each release
- [docs/latency-model.md](docs/latency-model.md) — latency model explanation with worked examples
+- [docs/onboarding/latency-model.md](docs/onboarding/latency-model.md) — latency model explanation with worked examples
 - [docs/onboarding/](docs/onboarding/) — onboarding guides (architecture overview, latency model, CCL author guide, intro presentation)
 - [docs/adr/](docs/adr/) — Architecture Decision Records
@@ -51,7 +51,7 @@ Major architectural decisions are documented in ADRs and referenced by number.
 - ADR-0007: runtime_api vs sim_engine responsibility boundaries
 - ADR-0008: Tensor deployment and allocation (Host allocator, PA-first)
 - ADR-0009: Kernel execution fan-out and completion semantics
- ADR-0010: CLI device selection and multi-device execution semantics
+- ADR-0010: Command line interface and execution semantics
 - ADR-0011: Memory Addressing — PA / VA / LA Address Models
 - ADR-0012: Host ↔ IO_CPU message schema (PA-first, PE-tagged shards)
 - ADR-0013: Verification strategy and Phase 1 test plan
@@ -0,0 +1,5 @@
 # ADR-0019: Per-Channel and Aggregated HBM Connection Models within CUBE NOC
 ## Status
 Merged into ADR-0017 (Cube NOC and HBM Connectivity).
@@ -0,0 +1,5 @@
 # ADR-0019: CUBE NOC 내 Per-Channel 및 Aggregated HBM 연결 모델
 ## Status
 Merged into ADR-0017 (Cube NOC and HBM Connectivity).
@@ -0,0 +1,5 @@
 # ADR-0021: PE Pipeline Refactoring — Component Separation + Scheduler-Based Routing
 ## Status
 Merged into ADR-0014 (PE Pipeline Execution Model).
@@ -0,0 +1,5 @@
 # ADR-0021: PE 파이프라인 리팩토링 — 컴포넌트 분리 + Scheduler 기반 라우팅
 ## Status
 Merged into ADR-0014 (PE Pipeline Execution Model).
@@ -257,5 +257,5 @@ PhysAddr encoding. 호출자는 어느 range인지 몰라도 됨.
 |------|--------|
 | `src/kernbench/policy/address/phyaddr.py` | Range table (`PE_RESOURCE_MAP`), range-based decode, 신규 component-specific factory들 (`pe_ipcq_addr` 등), 기존 `pe_tcm_addr` 내부 인코딩 갱신 |
 | `src/kernbench/policy/address/allocator.py` | Range-aware pool 분리 (TCM pool / IPCQ pool / scratchpad pool 등 per-PE) |
-| `docs/adr/ADR-0001-physaddr-layout.md` | Amendment note: range-based PE resource partition |
+| `docs/adr/ADR-0001-mem-physaddr-layout.md` | Amendment note: range-based PE resource partition |
 | `tests/test_phyaddr.py` | Range table 검증, 각 factory의 encode/decode round-trip, 기존 `pe_tcm_addr` 회귀 |
@@ -340,7 +340,7 @@ encoding can be plugged in later" 약속이 이행된 것.
 | `src/kernbench/sim_engine/memory_store.py` | D3: IPCQ buffer가 기존 space와 공유되는지 검증 |
 | `src/kernbench/sim_engine/engine.py` | D4: IPCQ token routing이 PhysAddr-based fabric 경로 사용 |
 | `src/kernbench/ccl/diagnostics.py` | D5: pointer_dump를 PhysAddr 포매팅으로 개선 |
-| `docs/adr/ADR-0023-ipcq-pe-collective.md` | D6: D2.5 amendment note |
+| `docs/adr/ADR-0023-dev-ipcq-pe-collective.md` | D6: D2.5 amendment note |
 | `tests/test_ipcq_physaddr.py` (new) | T1 |
 | `tests/test_ipcq_alloc.py` (new) | T2 |
 | `tests/test_ccl_install_plan.py` | T3 확장 |
@@ -35,7 +35,7 @@ shortcuts that obscure control paths.
 ### D3. Bypass is explicit and graph-represented
 - All paths must be explicitly represented in the graph and subject to latency accumulation.
- Example: PE_DMA connects to the NOC router mesh (ADR-0019). All destinations
+- Example: PE_DMA connects to the NOC router mesh (ADR-0017 D7). All destinations
  (HBM, shared SRAM, inter-cube UCIe) are reached via explicit mesh hops.
  Local HBM access has minimal hops (switching overhead only); remote access
  traverses additional routers.
@@ -15,7 +15,7 @@ Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth,
 - Each PE is assigned a logically defined “local HBM” region.
 - Local HBM corresponds to the pseudo-channel subset directly attached to that PE’s
-  router in the NOC mesh (ADR-0019).
+  router in the NOC mesh (ADR-0017 D4).
 - The path is: PE_DMA → local router → HBM_CTRL (switching overhead only, 0 mesh hops).
 - The mapping (HBM pseudo-channels → PE local regions) is derived from topology configuration.
@@ -20,7 +20,9 @@ Diagrams must reflect this distance by default.
 ---
-## Global Defaults
+## Decision
 ### D1. Global Defaults
 - All diagrams MUST be **distance-aware by default**.
 - All diagrams MUST render **representative views** of the architecture.
@@ -31,7 +33,7 @@ Diagrams must reflect this distance by default.
 ---
-## Representative Rendering Rule
+### D2. Representative Rendering Rule
 - All CUBEs share the same internal structure.
 - All PEs share the same internal structure.
@@ -47,9 +49,9 @@ unless explicitly requested.
 ---
-## Diagram Views
+### D3. Diagram Views
-### View A — SIP-Level Diagram
+#### View A — SIP-Level Diagram
 **Purpose**
 Explain system-scale structure and connectivity.
@@ -75,7 +77,7 @@ Explain system-scale structure and connectivity.
 ---
-### View B — CUBE-Level Diagram
+#### View B — CUBE-Level Diagram
 **Purpose**
 Explain cube-internal structure and data/control flow.
@@ -106,7 +108,7 @@ Explain cube-internal structure and data/control flow.
 ---
-### View C — PE-Level Diagram
+#### View C — PE-Level Diagram
 **Purpose**
 Explain internal PE behavior and execution structure.
@@ -128,14 +130,14 @@ Explain internal PE behavior and execution structure.
 ---
-## Distance-Aware Layout (Default)
+### D4. Distance-Aware Layout (Default)
-### Distance definition
+#### Distance definition
 - Distance is defined as **accumulated latency**, consistent with ADR-0002.
 - Distance is computed from a single anchor node.
-### Default anchor selection
+#### Default anchor selection
 - SIP view: IO chiplet (or Host CPU if present)
 - CUBE view: a representative PE
@@ -143,7 +145,7 @@ Explain internal PE behavior and execution structure.
 Anchors are **implicit defaults** and MUST NOT be required to be specified.
-### Layout rules
+#### Layout rules
 - Diagrams MUST be laid out in layers based on distance buckets.
 - Layout direction MUST be consistent within a view type
@@ -156,7 +158,7 @@ without affecting distance semantics.
 ---
-## Generation Contract (for Tools / Claude Code)
+### D5. Generation Contract (for Tools / Claude Code)
 When generating diagrams:
@@ -63,7 +63,7 @@ For each view (SIP / CUBE / PE):
 - CUBE-level projection MUST include:
  - Router mesh (from cube_mesh.yaml), HBM_CTRL, shared SRAM, M_CPU, UCIe ports,
    and PEs as opaque blocks.
-  - All paths (HBM, non-HBM, command) route through the same router mesh (ADR-0019).
+  - All paths (HBM, non-HBM, command) route through the same router mesh (ADR-0017).
 - Default anchors are implicit (ADR-0005) and MUST NOT require instance indices.
 ### D6. Output formats and determinism
@@ -42,21 +42,25 @@ The runtime API MUST NOT:
 ---
-### D2. Simulation engine executes and schedules requests
+### D2. Simulation engine wires components and tracks completion
 The simulation engine (sim_engine) MUST:
- inject requests into the compiled topology graph,
+- wire components at initialization (create port stores + start wire
  processes per the component port/wire framework — ADR-0015),
 - inject requests into the compiled topology graph at entry components
  (e.g., PCIE_EP for memory operations, IO_CPU for kernel launch),
 - schedule and execute events using a discrete-event model,
- manage correlation ids and completion tracking,
+- manage correlation ids and completion tracking.
 - decompose operations into low-level requests when required
  (e.g., MemoryWrite events).
 The simulation engine MUST NOT:
 - define tensor semantics,
 - define kernel execution policies,
- expose internal graph details to the runtime API.
+- expose internal graph details to the runtime API,
 - walk the topology path during request execution,
 - call component `run()` methods directly,
 - track per-hop latency or decompose fan-out (components own this).
 ---
@@ -87,3 +91,5 @@ component-level fan-out explicitly.
 - SPEC R4, R7, R8
 - ADR-0008 (Tensor deployment)
 - ADR-0009 (Kernel execution)
 - ADR-0015 (Component port/wire model and engine role)
 - ADR-0010 (CLI surface and execution semantics — runtime API consumer)
@@ -142,3 +142,5 @@ control plane — runtime API and application kernels are unchanged.
 - SPEC R1, R2, R7, R8
 - ADR-0007 (Runtime API boundaries)
 - ADR-0008 (Tensor deployment)
 - ADR-0013 (Verification strategy — V2 fan-out tests)
 - ADR-0015 D4 (concrete fabric path for kernel launch)
@@ -0,0 +1,131 @@
 # ADR-0010: Command Line Interface and Execution Semantics
 ## Status
 Accepted
 ## Context
 The `kernbench` CLI is the user-facing entry point of the simulator. It
 exposes three subcommands:
 - `run` — execute a benchmark against a topology.
 - `probe` — diagnostic utility for latency / BW measurement.
 - `web` — interactive topology viewer.
 Device enumeration is centralized in the CLI; neither the runtime API
 nor the simulation engine enumerates devices. Benchmarks remain
 single-device by design and accept a device identifier as input.
 ## Decision
 ### D1. Benchmark contract — single-device by design
 - A benchmark MUST define behavior for a single device only.
 - A benchmark MUST accept a device identifier as input.
 - Benchmarks MUST NOT enumerate or loop over multiple devices.
 Multi-device execution is the CLI's concern (D3), not the benchmark's.
 ### D2. `kernbench run` — benchmark execution
 Required arguments:
 - `--topology <path>`: topology YAML file path. Loaded via
  `resolve_topology()`.
 - `--bench <name>`: benchmark name. Resolved via
  `benches.loader.resolve_bench()`.
 Optional arguments:
 - `--device <selector>` (default: `all`):
  - `all` — run once per discovered SIP (see D3).
  - `sip:<N>` — run only on SIP N.
  - Parsed via `resolve_device()`.
 - `--verify-data` (default: off) — enable Phase 2 data verification
  (see ADR-0020). When set, `engine_factory` constructs the engine
  with `enable_data=True`. After the benchmark runs, a diagnostic
  summary of recorded ops is printed.
 Each invocation runs the benchmark once within a single simulation
 instance.
 ### D3. Multi-device execution is logically parallel
 When `--device all` (or omitted) and the topology has multiple SIPs:
 - Benchmark executions are submitted to a single simulation engine
  instance.
 - Executions are logically parallel in simulation time.
 - Inter-device contention is naturally modeled (shared fabric
  bandwidth, cross-SIP traffic, etc.).
 The CLI does NOT spawn multiple OS processes or independent
 simulation runs — parallelism is internal to one simulation instance.
 ### D4. `kernbench probe` — latency / BW diagnostic utility
 Required argument:
 - `--topology <path>`: topology YAML file path.
 Optional argument:
 - `--case <name>` (default: `all`) — run a predefined traffic
  pattern, or `all` to run every defined case.
 Probe runs each pattern through the simulation engine and reports
 per case:
 - End-to-end latency (ns).
 - Effective bandwidth (nbytes / total_ns).
 - Bottleneck bandwidth (min edge BW along the chosen path).
 - Utilization (effective / bottleneck).
 Probe additionally validates monotonicity invariants — for example
 that local-HBM access ≤ cross-PE-within-cube ≤ cross-cube ≤
 cross-SIP — and reports violations. Probe is a developer tool for
 verifying the latency / BW model; it is not a benchmark.
 ### D5. `kernbench web` — topology viewer
 Optional arguments:
 - `--port <N>` (default: `8765`) — HTTP port.
 - `--no-open` — do not auto-open the browser.
 Launches a local HTTP server that renders the compiled topology in
 the browser. Distinct from the static `docs/diagrams/` artifacts:
 - `docs/diagrams/` files are derived at topology-compile time
  (ADR-0006).
 - `kernbench web` is interactive — pan/zoom, hover for component
  attributes, switch between SIP / CUBE / PE views.
 ### D6. Runtime API and simulation engine remain device-scoped
 - Runtime API calls operate on one device per invocation.
 - The simulation engine schedules all requests deterministically.
 - Neither layer enumerates devices.
 This invariant keeps each layer testable in isolation; device
 enumeration and multi-device fan-out live only in the CLI's `run`
 command (D3).
 ## Consequences
 - Benchmark authors write single-device logic; multi-device behavior
  emerges from the CLI dispatching across SIPs.
 - Adding a new subcommand (e.g., trace export, replay) does not
  require benchmark or runtime-API changes — the CLI is the
  extension point.
 - `probe` and `web` are diagnostic / visualization tools, not
  benchmarks; they bypass the benchmark loader path.
 ## Links
 - SPEC R7, R8, R9
 - ADR-0007 (Runtime API and Simulation Engine Boundaries)
 - ADR-0020 (Two-pass data execution — `--verify-data`)
 - ADR-0006 (Topology compilation and diagram generation —
  background for `kernbench web`)
@@ -1,62 +0,0 @@
 # ADR-0010: CLI Device Selection and Multi-Device Execution Semantics
 ## Status
 Accepted
 ## Context
 Benchmarks represent device-agnostic workloads that operate on a single device.
 Users may want to run a benchmark:
 - on a specific device, or
 - across all devices in the system.
 Device enumeration must not leak into benchmarks or runtime APIs.
 ---
 ## Decision
 ### D1. Benchmarks are single-device by design
 - A benchmark MUST define behavior for a single device only.
 - A benchmark MUST accept a device identifier as input.
 - Benchmarks MUST NOT enumerate or loop over multiple devices.
 ---
 ### D2. CLI controls device selection
 The `kernbench run` command supports an optional `--device` argument:
 - If `--device <id>` is specified:
  - the benchmark executes once for the specified device.
 - If `--device` is omitted:
  - the benchmark executes once using all the SIPs discovered in the topology.
 ---
 ### D3. Multi-device execution is logically parallel
 When running on multiple devices:
 - benchmark executions are submitted to a single simulation engine instance,
 - executions are logically parallel in simulation time,
 - inter-device contention is naturally modeled.
 ---
 ### D4. Runtime API and simulation engine remain device-scoped
 - Runtime API calls operate on one device per invocation.
 - The simulation engine schedules all requests deterministically.
 - Neither layer enumerates devices.
 ---
 ## Links
 - SPEC R7, R8
 - ADR-0007 (Runtime API boundaries)
@@ -396,7 +396,7 @@ Other N values:
 #### D-LA7. n:1 mode detail
 - One logical access → one aggregated request.
- Target: aggregated router → hbm_ctrl (see ADR-0019).
+- Target: aggregated router → hbm_ctrl (see ADR-0017 D8).
 - Aggregated link BW = `channels_per_pe × channel_bw_gbs`
  (e.g. 8 × 32 = 256 GB/s).
 - Single queue / resource for modelling.
@@ -516,6 +516,6 @@ Negative:
 - ADR-0009 (kernel execution)
 - ADR-0014 (PE-internal execution model)
 - ADR-0015 (component port/wire model)
- ADR-0019 (NOC + per-channel HBM connectivity — LA model topology
+- ADR-0017 (Cube NOC and HBM connectivity — LA model topology consumer)
-  consumer)
+- ADR-0013 (Verification strategy — V1 PA tagging)
 - SPEC R2 (latency by traversal), R10 (memory addressing)
@@ -229,4 +229,5 @@ Tests SHOULD validate:
 - ADR-0011 (Memory Addressing — PA / VA / LA)
 - ADR-0007 (runtime_api vs sim_engine boundaries)
 - ADR-0009 (kernel execution fan-out/aggregation)
 - ADR-0013 (Verification strategy — V1 message schema validation)
 - SPEC R2, R7, R8
@@ -0,0 +1,451 @@
 # ADR-0014: PE Pipeline Execution Model
 ## Status
 Accepted
 ## Context
 This ADR defines the PE-internal kernel execution model:
 - Role decomposition of PE-internal components
 - Command dispatch paths (simple / composite / multi-op composite with epilogue)
 - TileToken-based self-routing pipeline (scheduler does dispatch + completion only)
 - TCM-centric dataflow with a register-file intermediary
 - Engine resource model
 - Observability and trace contract
 - Topology representation
 PE-internal structure (7 components in scope; 2 cross-referenced):
 - `pe_cpu`, `pe_scheduler`, `pe_dma`, `pe_fetch_store`, `pe_gemm`, `pe_math`,
  `pe_tcm` — defined here
 - `pe_mmu` — VA model, defined in ADR-0011 D-VA
 - `pe_ipcq` — collective communication, defined in ADR-0023
 The goal is a deterministic, trace-friendly execution contract that keeps
 each block independently swappable.
 ## Decision
 ### D1. PE-internal component roles
 **PE_CPU**
 - Executes kernel instruction stream / control logic.
 - Generates PE commands and submits them to `PE_SCHEDULER` (via
  `PeInternalTxn`).
 - Does NOT enqueue work directly into engine queues.
 **PE_SCHEDULER**
 - Sole dispatcher inside a PE.
 - Receives commands from `PE_CPU`. Dispatch by command type:
  - Simple command (`DmaReadCmd`, `DmaWriteCmd`, `GemmCmd`, `MathCmd`)
    → forward directly to the target engine.
  - `CompositeCmd` → generate a `TilePlan`, feed tiles into the pipeline
    via a single `_feed_loop` (D6).
 - Does not participate in stage-to-stage chaining within a composite;
  that is handled by token self-routing (D6).
 **PE_DMA**
 - Handles memory transfers between TCM and external memory domains
  (HBM, shared SRAM, cross-cube UCIe) through the cube NOC.
 - Two execution channels:
  - `DMA_READ` (capacity = 1) and `DMA_WRITE` (capacity = 1) — see D4.
 - Additional virtual channels:
  - `vc_compute` — load/store/writeback traffic for GEMM/MATH tiles.
  - `vc_comm` — IPCQ collective send data (defined in ADR-0023 D8).
 **PE_FETCH_STORE**
 - TCM ↔ Register File transfer unit.
 - Isolates register-file access semantics from compute engines so that
  GEMM/MATH stay pure compute components.
 - BW-based latency model; TCM access contention naturally serializes
  through `PE_TCM`'s BW resource.
 **PE_GEMM**
 - MAC array. Reads operands from the register file; writes results to
  the register file. Does not touch `PE_TCM` directly.
 **PE_MATH**
 - Element-wise / reduction / SIMD unit. Reads / writes the register file.
 **PE_TCM**
 - Tightly-coupled scratchpad with BW-serialized access. Two logical
  regions partitioned by ownership (see D5).
 **Cross-referenced components** (defined elsewhere):
 - `pe_mmu` — VA→PA translation per access (ADR-0011 D-VA).
 - `pe_ipcq` — collective ring buffers and peer endpoint metadata
  (ADR-0023).
 ### D2. Command lifecycle and queues
 `PE_SCHEDULER` maintains three logical structures:
 **SubmissionQueue** — written by `PE_CPU`; consumed by the scheduler.
 **InflightTable** — owned and mutated only by `PE_SCHEDULER`; tracks
 expanded sub-commands, dependency state, engine assignment, and
 completion status.
 **CompletionQueue** — written by `PE_SCHEDULER`; holds final completion
 records.
 **Single-writer rule**: only `PE_SCHEDULER` mutates command completion
 state. Engines report completion via explicit events / messages
 consumed by the scheduler.
 **Command completion**: when all sub-commands complete, `PE_SCHEDULER`
 publishes a completion record.
 ### D3. Dispatch modes
 #### D3.1 Simple command
 A simple command expands to exactly one engine sub-command:
 - `DmaReadCmd` / `DmaWriteCmd` → `PE_DMA`
 - `GemmCmd` → `PE_GEMM`
 - `MathCmd` → `PE_MATH`
 Flow:
 ```text
 PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution
       → completion → PE_SCHEDULER → CompletionQueue
 ```
 #### D3.2 Composite command (single-op tiled pipeline)
 The default `CompositeCmd` runs a single compute op as a tile-pipelined
 sequence:
 ```text
 DMA_READ → FETCH (TCM → RF) → COMPUTE (GEMM | MATH) → STORE (RF → TCM) → DMA_WRITE
 ```
 `PE_SCHEDULER` splits the DMA payload into hardware tiles and emits one
 `TileToken` per tile with a monotonically increasing `tile_id`.
 Tile dependency (within one tile `t`):
 ```text
 DMA_READ(t) → FETCH(t) → COMPUTE(t) → STORE(t) → DMA_WRITE(t)
 ```
 Inter-tile overlap is allowed wherever engine resources permit
 (D4 governs the constraints):
 ```text
 DMA_READ(t+1) ∥ COMPUTE(t)
 DMA_WRITE(t-1) ∥ COMPUTE(t)
 ```
 #### D3.3 Multi-op composite (head + epilogue with scope)
 A `CompositeCmd` MAY carry `ops: tuple[OpSpec, ...]` to express a
 multi-op pipeline:
 ```python
@dataclass(frozen=True)
 class OpSpec:
    kind: str         # "gemm" | "math.exp" | "math.bias_add" | ...
    scope: Scope      # "per_k_tile" | "per_output_tile" | "once"
    ...
 ```
 - `ops[0]` (head) defines tile geometry (e.g., the head GEMM determines
  M/K/N partition).
 - `ops[1:]` (epilogue) are subsequent stages whose `scope` decides how
  often they fire:
  - `per_k_tile` — every K-reduction step.
  - `per_output_tile` — once per output tile.
  - `once` — once per kernel.
 Cross-engine chains (e.g., GEMM head → MATH epilogue) are natural —
 each stage is dispatched via token self-routing (D6), so GEMM and MATH
 participate serially within the same composite even though they share
 the compute slot (D4).
 The empty-`ops` form is the legacy single-op path.
 ### D4. Engine resource model
 **DMA engine**:
 - `DMA_READ`: `simpy.Resource(capacity=1)`.
 - `DMA_WRITE`: `simpy.Resource(capacity=1)`.
 - Both channels run concurrently (READ ∥ WRITE allowed).
 - Within a channel, requests serialize (READ ∥ READ disallowed; same
  for WRITE).
 - `vc_comm` is an orthogonal channel for IPCQ traffic defined in
  ADR-0023 D8 — out of scope for this ADR.
 **Compute engine**:
 - `accel_slot`: `simpy.Resource(capacity=1)` shared by `PE_GEMM` and
  `PE_MATH`.
 - At most one compute op runs at a time within a PE.
 - Multi-op composite chains (D3.3) execute their compute stages serially
  through this slot; token self-routing (D6) ensures the next stage
  starts only after the previous compute releases the slot.
 **Engine completion**: each engine emits a completion event consumed by
 the scheduler / `PipelineContext` (D6).
 ### D5. Dataflow
 **Input path (HBM source)**:
 ```text
 HBM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
 PE_TCM → PE_FETCH_STORE → Register File
 Register File → PE_GEMM | PE_MATH
 ```
 **Input path (shared SRAM source)**:
 ```text
 Shared SRAM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
 PE_TCM → PE_FETCH_STORE → Register File
 ```
 **Output path (HBM destination)**:
 ```text
 Register File → PE_FETCH_STORE → PE_TCM
 PE_TCM → PE_DMA (DMA_WRITE) → cube NOC → HBM
 ```
 GEMM/MATH never touch `PE_TCM` directly — `PE_FETCH_STORE` is the
 single TCM↔register-file gateway. This makes TCM BW contention
 explicit and lets fetch unit policies (e.g., prefetch) be replaced
 independently of compute engines.
 #### D5.1 PE_TCM partitioning
 `PE_TCM` is split into two logical regions:
 **SchedulerReservedTCM**
 - Owned exclusively by `PE_SCHEDULER`.
 - Holds composite-command tile buffers.
 - `PE_SCHEDULER` partitions this region, assigns buffers per DMA_READ /
  COMPUTE / DMA_WRITE stage, guarantees input/output separation, and
  manages tile-buffer lifetimes.
 **AllocatableTCM**
 - General-purpose region managed by `PEMemAllocator`.
 - Used for host / DP-visible allocations.
 **Visibility rule (hard isolation)**: `PEMemAllocator` MUST NOT see or
 allocate inside `SchedulerReservedTCM`. The reserved region is excluded
 from allocator-managed ranges by construction.
 **Tile buffer rules**:
 - Input and output buffers within `SchedulerReservedTCM` MUST NOT
  overlap during a tile's active lifetime.
 - A tile buffer remains valid until the corresponding `DMA_WRITE`
  completes.
 - Buffer reuse is permitted only after the consuming tile's lifetime
  ends.
 ### D6. TileToken self-routing pipeline
 A composite's stage-to-stage progression happens **without** routing
 through the scheduler. Each component forwards the token directly to
 the next stage's component using the token's `plan`:
 ```text
 Scheduler → DMA → Fetch → GEMM → Math (epi) → Store → DMA_WB → (complete)
              ↑ chaining: no scheduler hop                          ↑
                                                  PipelineContext.complete_tile()
 ```
 This mirrors real-HW done-wire chains. The scheduler handles only
 **initial dispatch + completion aggregation**.
 #### TilePlan / Stage
 ```python
 class StageType(Enum):
    DMA_READ = 0
    FETCH = 1
    GEMM = 2
    MATH = 3
    STORE = 4
    DMA_WRITE = 5
@dataclass(frozen=True)
 class Stage:
    stage_type: StageType
    component: str         # topology node id (e.g., "sip0.cube0.pe0.pe_dma")
    params: dict           # stage-specific parameters
@dataclass(frozen=True)
 class TilePlan:
    tile_id: int
    stages: tuple[Stage, ...]
 ```
 #### TileToken
 ```python
@dataclass
 class TileToken:
    tile_id: int
    pipeline_ctx: PipelineContext
    plan: TilePlan
    stage_idx: int
    params: dict             # cached current stage params
    data_op: bool = True     # op_log opt-in (ADR-0020 D4)
 ```
 Single-owner invariant: a token is owned by exactly one component at a
 time. Lifecycle: scheduler creates with `stage_idx=0` → component
 `_process()` → increment `stage_idx` → put to next stage's `in_port` →
 last stage calls `pipeline_ctx.complete_tile()`.
 #### PipelineContext (exactly-once completion)
 ```python
@dataclass
 class PipelineContext:
    id: str
    total_tiles: int
    completed_tiles: int = 0
    done_event: simpy.Event = None
    def complete_tile(self) -> None:
        self.completed_tiles += 1
        if self.completed_tiles == self.total_tiles:
            self.done_event.succeed()
 ```
 Each tile's last stage MUST call `complete_tile()` exactly once.
 Duplicate calls are bugs (SimPy `Event` can succeed at most once).
 #### Feed ordering
 `PE_SCHEDULER` has exactly one `_feed_loop` process consuming a
 `_pending_feeds` FIFO. Composite commands are enqueued in submission
 order; tile feed for a command runs to completion before the next
 command's feed begins. **Tile-feed interleaving between commands is
 disallowed.**
 Within a single command's tiles, downstream pipeline overlap arises
 naturally — earlier tiles progress through later stages while the feeder
 keeps pushing remaining tiles into the first stage queue (SimPy Store
 backpressure governs flow control). If the first-stage queue is full,
 only the feeder blocks; the scheduler worker's inbox processing
 continues.
 #### Token routing pattern (base class)
 ```python
 def _pipeline_worker(self, env):
    while True:
        token = yield self._inbox.get()
        yield from self._process(env, token)       # stage-specific logic
        next_idx = token.stage_idx + 1
        if next_idx < len(token.plan.stages):
            next_stage = token.plan.stages[next_idx]
            token.stage_idx = next_idx
            token.params = next_stage.params
            yield self.out_ports[next_stage.component].put(token)
        else:
            token.pipeline_ctx.complete_tile()
 ```
 Each component implements only `_process()`; chaining lives in the
 base class.
 ### D7. Observability and trace contract
 The simulator emits deterministic trace events:
 - `command_submitted`
 - `sub_command_dispatched`
 - `engine_start`
 - `engine_complete`
 - `tile_ready`
 - `command_complete`
 For identical inputs, trace ordering MUST be deterministic.
 ### D8. Topology representation
 PE-internal components are declared in `cube.pe_template`:
 ```yaml
 pe_template:
  components:
    pe_cpu:         { kind: pe_cpu,         impl: builtin.pe_cpu,         attrs: { overhead_ns: ... } }
    pe_scheduler:   { kind: pe_scheduler,   impl: builtin.pe_scheduler,   attrs: { overhead_ns: ... } }
    pe_dma:         { kind: pe_dma,         impl: builtin.pe_dma,         attrs: { rd_engines: 1, wr_engines: 1 } }
    pe_fetch_store: { kind: pe_fetch_store, impl: builtin.pe_fetch_store, attrs: { ... } }
    pe_gemm:        { kind: pe_gemm,        impl: builtin.pe_gemm,        attrs: { shared_resource: accel_slot, ... } }
    pe_math:        { kind: pe_math,        impl: builtin.pe_math,        attrs: { shared_resource: accel_slot, ... } }
    pe_tcm:         { kind: pe_tcm,         impl: builtin.pe_tcm,         attrs: { size_mb: ..., read_bw_gbs: ..., write_bw_gbs: ... } }
    pe_mmu:         { kind: pe_mmu,         impl: builtin.pe_mmu,         attrs: { ... } }   # ADR-0011 D-VA
    pe_ipcq:        { kind: pe_ipcq,        impl: builtin.pe_ipcq,        attrs: { ... } }   # ADR-0023
  links:
    # Scheduler dispatch edges (initial)
    scheduler_to_dma_mm:         0.0
    scheduler_to_fetch_store_mm: 0.0
    scheduler_to_gemm_mm:        0.0
    scheduler_to_math_mm:        0.0
    # Pipeline chaining edges (token self-routing per D6)
    dma_to_fetch_store_mm:       0.0
    fetch_store_to_gemm_mm:      0.0
    fetch_store_to_math_mm:      0.0
    gemm_to_fetch_store_mm:      0.0
    gemm_to_math_mm:             0.0
    math_to_fetch_store_mm:      0.0
    fetch_store_to_dma_mm:       0.0
    fetch_store_to_tcm_bw_gbs:   ...
 ```
 Template is instantiated once per PE. PE instances are derived from
 `cube.pe_layout` (corner placement). External connectivity (PE_DMA ↔
 cube NOC ↔ HBM, etc.) is modeled at the cube level (ADR-0017 D4).
 ## Consequences
 ### Positive
 - Each block is an independent topology node — individually swappable
  via DI (ADR-0015).
 - PE-internal structure is visible in the topology graph.
 - Components do not know their downstream — plan-based routing gives
  flexibility (e.g., epilogue chains require no scheduler change).
 - DMA and compute overlap naturally via SimPy Store backpressure.
 - Multi-op composite expresses fused operations (e.g., GEMM + bias_add)
  without engine-level coupling.
 - TCM access contention is realistic — `PE_FETCH_STORE` is the single
  TCM↔RF gateway.
 ### Negative
 - Intra-PE component count is higher than a coarser model (7 base + 2
  cross-referenced) — more topology nodes/edges.
 - Intra-PE token forwarding is explicit in traces (acceptable trade for
  HW fidelity).
 ## Links
 - ADR-0011 D-VA (PE_MMU component, VA translation)
 - ADR-0015 D4 (component port/wire model)
 - ADR-0020 (greenlet kernel execution / two-pass)
 - ADR-0023 (PE_IPCQ + PE_DMA virtual channels)
 - SPEC R3, R4
@@ -1,365 +0,0 @@
 # ADR-0014: PE Internal Execution Model (PE_CPU, PE_SCHEDULER, and Composite Commands)
 ## Status
 Accepted
 ## Context
 ADR-0003 (system hierarchy) and ADR-0009 (kernel execution semantics) reference PE internals but do not define:
 - the dispatch model inside a PE,
 - the responsibilities of PE_SCHEDULER,
 - the PE_TCM-centric dataflow contract used by accelerator engines.
 We need a deterministic and debuggable PE-internal execution contract that supports:
 - simple single-engine commands
 - composite commands that build a tiled pipeline across DMA and accelerator engines
 The simulator must produce deterministic traces and allow modeling of PE-internal pipelining without introducing nondeterministic engine scheduling.
 ## Decision
 ### D1. PE internal component roles
 Each PE contains the following logical components.
 **PE_CPU**
 - Executes kernel instruction stream or kernel control logic.
 - Generates PE commands.
 - Submits commands to PE_SCHEDULER.
 - PE_CPU does NOT enqueue work directly into engine queues.
 **PE_SCHEDULER**
 - The sole dispatcher inside a PE.
 - Receives commands from PE_CPU.
 - Expands composite commands into sub-commands.
 - Tracks dependencies and command state.
 - Dispatches work to engine queues.
 - Manages tile scheduling for composite commands.
 **PE_DMA**
 - Handles memory transfers between PE_TCM and external memory domains.
 - PE_DMA connects to the cube-level NOC (on-die fabric):
  - All destinations (HBM, shared SRAM, inter-cube UCIe) are reached via the NOC
  - Local HBM access: PE_DMA → NOC → hbm_ctrl (minimal hop)
  - Remote/shared: PE_DMA → NOC → (fabric hops) → destination
 - Supported directions include:
  - HBM → PE_TCM (via NOC)
  - PE_TCM → HBM (via NOC)
  - PE_TCM → shared SRAM (via NOC)
  - PE_TCM → other memory domains (via NOC, if supported by topology)
 **PE_GEMM**
 - Matrix multiplication engine.
 - Reads activations from PE_TCM.
 - May stream weights directly from HBM.
 **PE_MATH**
 - Element-wise computation engine.
 - Reads and writes PE_TCM.
 **PE_TCM**
 - Local SRAM used as the staging memory for accelerator operations.
 ---
 ### D2. Command lifecycle and queues
 PE_SCHEDULER maintains three logical structures.
 **SubmissionQueue**
 - Written by PE_CPU.
 - Contains incoming PE commands waiting to be processed.
 **InflightTable**
 - Owned and mutated only by PE_SCHEDULER.
 - Tracks:
  - expanded sub-commands
  - dependency state
  - engine assignment
  - completion status
 **CompletionQueue**
 - Written by PE_SCHEDULER.
 - Contains final completion records for commands.
 **Single-writer rule**
 - Only PE_SCHEDULER is allowed to mutate command completion state.
 - Engine components must report completion via explicit completion events/messages.
 **Command completion**
 A command becomes DONE when:
 - all sub-commands complete
 - PE_SCHEDULER publishes a completion record to CompletionQueue.
 ---
 ### D3. Dispatch modes
 PE commands are divided into two categories.
 #### D3.1 Simple command
 A simple command expands to exactly one engine sub-command.
 Examples include:
 - DMA transfer
 - GEMM compute
 - MATH compute
 Execution flow:
 ```text
 PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution → completion event → PE_SCHEDULER → CompletionQueue
 ```
 #### D3.2 Composite command (tiled pipeline)
 Composite commands implement tiled pipelined execution across engines.
 Each tile executes the following pipeline:
 ```text
 Input DMA (READ)
 → Compute (GEMM or MATH)
 → Output DMA (WRITE)
 ```
 **Tiling rule**
 If the DMA payload exceeds hardware tile size, PE_SCHEDULER splits the transfer into tiles.
 Each tile is assigned a monotonically increasing `tile_id`.
 **Tile dependency rules**
 For tile `t`:
 - Compute must wait for input DMA: `DMA_READ(t) → COMPUTE(t)`
 - Output DMA must wait for compute: `COMPUTE(t) → DMA_WRITE(t)`
 - All dependencies are enforced by PE_SCHEDULER.
 **Overlap policy (Phase 0 default)**
 Operations for different tiles may overlap when engine resources permit.
 Allowed overlaps:
 ```text
 DMA_READ(t+1) ∥ COMPUTE(t)
 DMA_WRITE(t−1) ∥ COMPUTE(t)
 DMA_READ(t) ∥ DMA_WRITE(t)
 ```
 Disallowed overlaps:
 ```text
 GEMM(t) ∥ GEMM(t′)
 MATH(t) ∥ MATH(t′)
 GEMM(t) ∥ MATH(t′)
 ```
 ---
 ### D4. Engine execution model (Phase 0 default)
 Each engine behaves as a deterministic service resource.
 **DMA engine**
 PE_DMA contains two independent channels.
 ```text
 DMA_READ capacity  = 1
 DMA_WRITE capacity = 1
 ```
 Rules:
 - DMA_READ and DMA_WRITE may execute concurrently.
 - Multiple READs cannot overlap.
 - Multiple WRITEs cannot overlap.
 Example allowed:
 ```text
 DMA_READ(t+1) ∥ DMA_WRITE(t)
 ```
 Example not allowed:
 ```text
 DMA_READ(t) ∥ DMA_READ(t+1)
 DMA_WRITE(t) ∥ DMA_WRITE(t+1)
 ```
 **Compute engine**
 Compute operations share a single compute resource.
 ```text
 PE_ACCEL capacity = 1
 ```
 Both GEMM and MATH require this shared compute slot.
 Consequences:
 - GEMM ∥ GEMM not allowed
 - MATH ∥ MATH not allowed
 - GEMM ∥ MATH not allowed
 Only one compute operation can run in a PE at a time.
 **Compute opcode restriction**
 Composite commands contain one compute opcode only.
 Examples:
 ```text
 COMPOSITE_GEMM
 COMPOSITE_MATH
 ```
 Mixed compute pipelines such as `GEMM → MATH` are not supported in Phase 0.
 **Engine completion signaling**
 Every engine emits a completion event when a sub-command finishes.
 Completion events are delivered to PE_SCHEDULER.
 ---
 ### D5. Dataflow model
 Compute operations use a TCM-centric dataflow model.
 **Input path (HBM)**
 ```text
 HBM → NOC → PE_DMA (DMA_READ) → PE_TCM
 ```
 **Input path (shared SRAM)**
 ```text
 Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM
 ```
 **Compute stage**
 Compute engines read input tensors from PE_TCM.
 ```text
 PE_TCM → GEMM / MATH
 ```
 Weights for GEMM may optionally stream directly from HBM (via NOC).
 **Output path (HBM)**
 Compute results are written to PE_TCM, then DMA writes to HBM.
 ```text
 PE_TCM → PE_DMA (DMA_WRITE) → NOC → HBM
 ```
 **Output path (shared SRAM)**
 ```text
 PE_TCM → PE_DMA (DMA_WRITE) → NOC → Shared SRAM
 ```
 #### D5.1 PE_TCM partitioning and ownership boundary
 The PE_TCM address space is partitioned into two logical regions.
 **SchedulerReservedTCM**
 - A staging region owned exclusively by PE_SCHEDULER.
 - This region is used for composite command tile buffers.
 - PE_SCHEDULER:
  - partitions this region into tile buffers
  - assigns buffers for DMA_READ, COMPUTE, and DMA_WRITE stages
  - guarantees input/output buffer separation
  - manages tile buffer lifetime
 **AllocatableTCM**
 - General-purpose region managed by PEMemAllocator.
 - Used by host or DP-visible allocations.
 **Visibility rule (hard isolation)**
 - PEMemAllocator must not see or allocate memory inside SchedulerReservedTCM.
 - SchedulerReservedTCM is excluded from allocator-managed ranges by construction.
 - This prevents DP or host allocations from interfering with scheduler staging buffers.
 **Tile buffer rules**
 Within SchedulerReservedTCM:
 - input buffers and output buffers must not overlap
 - PE_SCHEDULER assigns tile buffers for DMA and compute stages
 - tile buffers remain valid until the corresponding DMA_WRITE completes
 - Buffer reuse is allowed only after the tile lifetime finishes.
 ---
 ### D6. Observability and trace contract
 The simulator must emit deterministic trace events.
 Required events include:
 - `command_submitted`
 - `sub_command_dispatched`
 - `engine_start`
 - `engine_complete`
 - `tile_ready`
 - `command_complete`
 Trace ordering must be deterministic for identical inputs.
 ---
 ### D7. Topology representation
 PE internal components are declared in `cube.pe_template`.
 The template is instantiated once per PE.
 PE instances are derived from `cube.pe_layout`.
 External connectivity such as:
 - PE_DMA → NOC → HBM (data path)
 - PE_DMA → NOC → shared SRAM, inter-cube UCIe (non-HBM data path)
 - NOC → PE_CPU (command path from M_CPU)
 is modeled at the CUBE level (see ADR-0003 D3).
 ---
 ## Links
 - SPEC R3, R4
 - ADR-0003 D4 (PE-level system hierarchy)
 - ADR-0005 View C (PE-level diagram)
 - ADR-0008 D2 (PA-level allocation at PE scope; PEMemAllocator is the per-PE allocator instance)
 - ADR-0009 D3 (kernel execution fan-out and PE_CPU dispatch)
@@ -6,20 +6,19 @@ Accepted
 ## Context
-ADR-0007 D2 assigns path-walking and low-level request decomposition to the simulation engine.
+Realistic hardware modeling — queues, contention, fan-out — requires
-In practice, the engine iterates the topology path and calls `run()` on each component
+that components own fabric traversal while the simulation engine
-sequentially — conflating routing policy with component behavior and preventing realistic
+handles only initialization and completion observation. Direct method
-hardware modeling (queues, contention, fan-out).
+calls between components, or path-walking inside the engine, defeat
-
+queueing and contention semantics.
 ADR-0007 D3 already states that components own fan-out and aggregation, but the current
 implementation does not enforce this for fabric traversal.
 This ADR defines:
 - how components communicate via typed port queues,
 - how propagation delay is modeled (wire processes with BW occupancy),
- the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch (via M_CPU),
+- the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch
- the reduced role of the simulation engine,
+  (via M_CPU),
 - the engine's reduced role (wire init + completion observation only),
 - M_CPU.DMA as an internal subcomponent of M_CPU.
 ---
@@ -88,9 +87,6 @@ The simulation engine MUST NOT:
 - call component `run()` methods directly,
 - track per-hop latency or decompose fan-out.
 This supersedes ADR-0007 D2's "decompose operations into low-level requests" clause.
 ADR-0007 D2 must be amended accordingly.
 ---
 ### D4. Fabric paths for Memory R/W and Kernel Launch
@@ -192,16 +188,15 @@ It is used for shard comparison in `_route_kernel` and as a regression guard.
 - Propagation delay is modeled accurately per edge.
 - Engine is decoupled from routing policy.
 - Component implementations remain swappable via DI (ADR-0007 D3).
 - ADR-0007 D2 must be amended to remove path-walking from engine responsibilities.
 - ADR-0009 D3 should be updated to reference the unified fabric path (D4 above).
 ---
 ## Links
- ADR-0007 D2 (to be amended: engine path-walking clause)
+- ADR-0007 D2 (engine role boundary)
- ADR-0009 D3 (kernel execution fan-out; fabric path to be referenced)
+- ADR-0009 D3 (kernel execution fan-out hierarchy)
 - ADR-0014 D4 (DMA engine capacity=1)
 - ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal)
 - ADR-0016 (IOChiplet NOC and memory data path)
 - ADR-0017 (cube NOC 2D mesh architecture)
 - ADR-0033 (Latency model assumptions built on these mechanisms)
@@ -1,189 +0,0 @@
 # ADR-0017: Cube NOC 2D Mesh Architecture
 ## Status
 Accepted
 ## Context
 ADR-0003 D3 defines the cube-level NOC as a "distributed on-die fabric" but
 does not specify the internal routing model, contention semantics, or
 attachment topology. The implementation uses a 2D mesh router grid with
 XY routing and per-segment contention modeling. This ADR formalizes that
 architecture.
 ## Decision
 ### D1. NOC node and router grid
 Each cube contains a 2D router mesh generated by `mesh_gen.py`.
 Each router is a separate topology node (`sip{S}.cube{C}.r{row}c{col}`)
 implemented as `forwarding_v1`. (Supersedes the original single-node
 `noc_2d_mesh_v1` design — see ADR-0019.)
 Grid properties:
 - Default dimensions: 6x6 routers (derived from PE layout + UCIe connections)
 - Router naming: `r{row}c{col}` (e.g., `r0c0`, `r5c5`)
 - HBM exclusion zone: center rows/columns are excluded where HBM physically
  occupies space (e.g., r2c2, r2c3, r3c2, r3c3)
 - Router positions are derived from physical PE corner placement and cube
  geometry
 The NOC overhead_ns is 0.0. Latency is modeled by Manhattan distance
 traversal within the mesh (distance_mm x ns_per_mm).
 ### D2. XY routing algorithm
 The NOC uses deterministic XY routing:
 1. Horizontal segment: route from source X to destination X at source Y
 2. Vertical segment: route from destination X at source Y to destination Y
 Each directed segment is identified by a unique link key:
 - Horizontal: `("H", y_band, x_min, x_max, direction)`
 - Vertical: `("V", x_band, y_min, y_max, direction)`
 Grid positions are snapped to the router grid, excluding the HBM zone.
 ### D3. Contention model
 Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions
 sharing a segment (same row or column band, same direction) contend for the
 resource. This models link-level serialization in a wormhole-routed mesh.
 With no contention, NOC traversal latency equals the Manhattan distance
 multiplied by `ns_per_mm`. Under contention, additional queueing delay
 is added by SimPy's resource scheduling.
 ### D4. NOC attachment points
 The NOC connects to all major cube-level components:
 ```text
                    UCIe-N (conn x4)
                         |
           +---------+---+---+---------+
           |         |       |         |
 PE0.dma ---+  r0c0   |  ...  |  r0c5  +--- PE2.dma
 PE0.cpu <--+         |       |         +--< PE2.cpu
           |         |       |         |
 UCIe-W ----+  ...    | [HBM] |  ...   +---- UCIe-E
 (conn x4)  |         | zone  |         |  (conn x4)
           |  r2c0   |       |         |
 M_CPU <--->+         |       |         |
           |  r3c0   |       |         |
 SRAM <---->+         |       |         |
           |         |       |         |
 PE4.dma ---+  r4c0   |  ...  |  r4c5  +--- PE6.dma
 PE4.cpu <--+         |       |         +--< PE6.cpu
           |         |       |         |
           +---------+---+---+---------+
                         |
                    UCIe-S (conn x4)
 HBM attach: PE가 있는 라우터에 hbm_ctrl도 연결 (ADR-0019 D1)
 (xbar_top/xbar_bot은 ADR-0019에 의해 제거됨)
 ```
 ### D5. NOC edge bandwidths and distances
 | Connection | BW (GB/s) | Distance | Notes |
 | --- | --- | --- | --- |
 | PE_DMA -> NOC | 256.0 | Physical (PE pos) | Matches HBM slice BW |
 | NOC -> PE_CPU | - | 0.0 mm | Command path only |
 | Router <-> HBM_CTRL | 256.0 | 0.0 mm | Per PE router (ADR-0019) |
 | NOC <-> M_CPU | - | 0.0 mm | Command path |
 | NOC <-> SRAM | 128.0 x4 | 0.0 mm | 512 GB/s aggregate |
 | NOC <-> UCIe conn | 128.0 | 0.0 mm | Per connection, 4 per port |
 Distance 0.0 mm for most connections reflects the distributed nature of
 the NOC; the actual traversal distance is computed internally via Manhattan
 distance within the router grid.
 ### D6. UCIe decomposition and inter-cube traffic
 Each cube has 4 UCIe ports (N, S, E, W). Each port is decomposed into:
 - 1 `ucie-{PORT}` node: UCIe protocol endpoint (overhead = 8.0 ns)
 - 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe
 This decomposition enables N=4 independent NOC-to-UCIe connections per port,
 each with 128 GB/s bandwidth. Total aggregate per port: 512 GB/s.
 Inter-cube traffic path:
 ```text
 Source: PE_DMA -> NOC -> conn{i} -> ucie-{PORT}
                    [UCIe link: 512 GB/s, 1.0mm seam distance]
 Target: ucie-{PORT} -> conn{i} -> r{x}c{y} -> (mesh hops) -> hbm_ctrl
 ```
 UCIe overhead (8.0 ns) is applied at each ucie-{PORT} node, so a
 full crossing incurs 16 ns (TX port + RX port).
 ### D7. Data paths through the NOC
 **PE DMA to local HBM (same half):**
 ```text
 PE_DMA -> r{x}c{y} -> hbm_ctrl  (local: 0 mesh hops, switching overhead only)
 ```
 **PE DMA to remote PE's HBM:**
 ```text
 PE_DMA -> r{x}c{y} -> (mesh hops) -> r{x'}c{y'} -> hbm_ctrl
 ```
 **PE DMA to remote cube HBM:**
 ```text
 PE_DMA -> r{x}c{y} -> conn -> ucie-E -> [seam] -> ucie-W -> conn -> r{x'}c{y'} -> hbm_ctrl
 ```
 **Kernel Launch command to PE:**
 ```text
 [from io_noc] -> ucie -> conn -> r{x}c{y} -> (mesh hops) -> M_CPU -> (mesh hops) -> PE_CPU
 ```
 **Shared SRAM access:**
 ```text
 PE_DMA -> r{x}c{y} -> (mesh hops) -> SRAM
 ```
 ### D8. Mesh generation
 The router grid is generated by `mesh_gen.py` based on:
 - `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner
 - `cube.geometry`: cube physical dimensions and HBM zone
 - `cube.ucie.n_connections`: determines router count for UCIe attachment
 The generator produces a `mesh_data` dictionary containing:
 - Router grid with positions and HBM exclusion zones
 - PE-to-router attachments (pe_dma, pe_cpu per PE)
 - UCIe-to-router attachments (N/S/E/W, distributed across edge routers)
 - M_CPU and SRAM router attachments
 - HBM attachment per PE router (ADR-0019)
 ## Consequences
 - NOC provides position-aware routing with deterministic latency
 - Contention is captured per directed segment (not per-node)
 - All cube-internal traffic is explicitly routed through the NOC
 - HBM exclusion zone reflects physical die layout constraints
 - The mesh generation is fully parameterized by `topology.yaml`
 ## Links
 - ADR-0003 D3 (cube-level NOC definition — extended by this ADR)
 - ADR-0004 D1 (PE DMA to local HBM path via router mesh)
 - ADR-0014 D1 (PE_DMA egress via router mesh)
 - ADR-0019 (NOC-Local HBM — xbar/bridge 제거, 명시적 라우터 mesh)
 - ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
 - ADR-0016 D1 (IOChiplet io_noc — analogous pattern at IO chiplet level)
@@ -0,0 +1,291 @@
 # ADR-0017: Cube NOC and HBM Connectivity
 ## Status
 Accepted
 ## Context
 The CUBE-level NOC is a 2D router mesh that carries every intra-cube
 request: PE-to-HBM data, PE-to-PE traffic, command paths
 (M_CPU↔PE_CPU), shared SRAM access, and inter-cube UCIe traffic.
 The CUBE's HBM is exposed through per-PE controller endpoints attached
 to PE routers. This per-PE partitioning makes local-vs-remote HBM
 distinguishable by mesh distance: a PE's own HBM partition sits at its
 own router (switching overhead only); another PE's HBM partition is
 reachable by mesh hops to that PE's router.
 Two channel-mapping modes are supported in the design space:
 - **n:1 (default, implemented)** — each PE's HBM partition aggregates
  `channels_per_pe` pseudo-channels into one endpoint. Effective
  per-PE BW = N × per-channel BW.
 - **1:1 (future)** — each PE router decomposes into per-channel
  mini-routers; per-channel BW contention is modeled directly.
 In both modes the per-PE effective BW is identical; only the connectivity
 granularity differs.
 ## Decision
 ### D1. 2D router mesh
 Each cube contains a 2D mesh of NOC routers generated by `mesh_gen.py`.
 - Node naming: `sip{S}.cube{C}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`).
 - Implementation: `forwarding_v1`. NOC `overhead_ns = 0`.
 - Default 6×6 grid (sized from PE corner placement + UCIe attachment
  count); larger PE counts scale the grid up.
 - HBM exclusion zone: center rows/columns are excluded where HBM die
  physically occupies space (e.g., r2c2, r2c3, r3c2, r3c3 for a 6×6).
 - Latency = Manhattan distance × `ns_per_mm`.
 ### D2. XY routing algorithm
 Deterministic XY routing:
 1. Horizontal segment: route from source X to destination X at source Y.
 2. Vertical segment: route from destination X at source Y to destination Y.
 Each directed segment carries a unique key:
 - Horizontal: `("H", y_band, x_min, x_max, direction)`
 - Vertical:   `("V", x_band, y_min, y_max, direction)`
 Grid positions are snapped to the router grid, excluding the HBM zone.
 ### D3. Per-segment contention model
 Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions
 sharing a segment (same row or column band, same direction) contend for
 the resource — modelling link-level serialization in a wormhole-routed
 mesh.
 With no contention, NOC traversal latency equals Manhattan distance ×
 `ns_per_mm`. Under contention, SimPy's resource scheduling adds queueing
 delay.
 ### D4. NOC attachment points (per-PE HBM partition)
 Every PE router carries three attachments: `pe{idx}.dma`, `pe{idx}.cpu`,
 and `pe{idx}.hbm`. The last is the per-PE HBM controller endpoint —
 `sip{S}.cube{C}.hbm_ctrl.pe{idx}` — which owns one slice of the cube's
 HBM (one pseudo-channel group; see D8).
 Other attachments:
 - M_CPU and shared SRAM each occupy a dedicated edge router.
 - UCIe endpoints (N/S/E/W) each expose 4 connection routers distributed
  along that edge (see D6).
 ```text
                    UCIe-N (conn x4)
                         |
           +---------+---+---+---------+
           |         |       |         |
 PE0.dma ---+  r0c0   |  ...  |  r0c5  +--- PE2.dma
 PE0.cpu <--+ +hbm.pe0|       | +hbm.pe2+--< PE2.cpu
           |         |       |         |
 UCIe-W ----+  ...    | [HBM] |  ...   +---- UCIe-E
 (conn x4)  |         | zone  |         |  (conn x4)
           |  r2c0   |       |         |
 M_CPU <--->+         |       |         |
           |  r3c0   |       |         |
 SRAM <---->+         |       |         |
           |         |       |         |
 PE4.dma ---+  r4c0   |  ...  |  r4c5  +--- PE6.dma
 PE4.cpu <--+ +hbm.pe4|       | +hbm.pe6+--< PE6.cpu
           |         |       |         |
           +---------+---+---+---------+
                         |
                    UCIe-S (conn x4)
 ```
 Per-PE HBM partitioning is the key invariant that makes local vs
 cross-PE HBM distinguishable by mesh distance (see D7).
 ### D5. NOC edge bandwidths and distances
 | Connection                    | BW (GB/s)  | Distance      | Notes                                       |
 | ----------------------------- | ---------- | ------------- | ------------------------------------------- |
 | PE_DMA → NOC                  | 256.0      | Physical (PE) | Matches local-HBM aggregate BW              |
 | NOC → PE_CPU                  | —          | 0.0 mm        | Command path only                           |
 | Router ↔ hbm_ctrl.pe{idx}     | 256.0      | 0.0 mm        | Per PE router; N × per-channel BW (see D8)  |
 | NOC ↔ M_CPU                   | —          | 0.0 mm        | Command path                                |
 | NOC ↔ SRAM                    | 128.0 × 4  | 0.0 mm        | 512 GB/s aggregate                          |
 | NOC ↔ UCIe conn               | 128.0      | 0.0 mm        | Per connection; 4 conn per port             |
 `0.0 mm` distances reflect the distributed nature of the NOC; actual
 traversal distance is computed via Manhattan distance within the router
 grid.
 ### D6. UCIe decomposition and inter-cube traffic
 Each of the 4 UCIe ports (N, S, E, W) decomposes into:
 - 1 `ucie-{PORT}` node: UCIe protocol endpoint (`overhead = 8.0 ns`).
 - 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe.
 This decomposition gives 4 independent NOC↔UCIe connections per port,
 each with 128 GB/s bandwidth (512 GB/s aggregate per port).
 Inter-cube traffic path:
 ```text
 Source: PE_DMA → NOC → conn{i} → ucie-{PORT}
                  [UCIe link: 512 GB/s, 1.0mm seam distance]
 Target: ucie-{PORT} → conn{i} → r{x}c{y} → (mesh hops) → hbm_ctrl.pe{idx}
 ```
 UCIe overhead (8.0 ns) is applied at each `ucie-{PORT}` node, so a full
 crossing incurs 16 ns (TX port + RX port).
 ### D7. Data paths through the NOC
 All intra-cube traffic uses the same router mesh — no separate fast
 paths.
 **Local HBM** (same PE's own partition; 0 mesh hops):
 ```text
 PE_DMA → r{x}c{y} → hbm_ctrl.pe{idx}   (switching overhead only)
 ```
 **Cross-PE HBM within cube** (target PE's partition, reached by mesh):
 ```text
 PE_DMA → r{x}c{y} → (mesh hops) → r{x'}c{y'} → hbm_ctrl.pe{idx'}
 ```
 Example: PE0 (on `r0c0`) accessing PE2's HBM (PE2 on `r1c4`):
 ```text
 PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl.pe2
 ```
 Dijkstra computes the shortest path within the mesh.
 **Cross-cube HBM** (UCIe traversal):
 ```text
 PE_DMA → r{x}c{y} → conn → ucie-{PORT} → [seam] → ucie-{PORT'} → conn
       → r{x'}c{y'} → hbm_ctrl.pe{idx'}
 ```
 **Kernel launch command to PE**:
 ```text
 [from io_noc] → ucie → conn → r{x}c{y} → (mesh) → M_CPU → (mesh) → PE_CPU
 ```
 **Shared SRAM access**:
 ```text
 PE_DMA → r{x}c{y} → (mesh) → SRAM
 ```
 ### D8. HBM channel mapping mode
 Channel mapping is configured at cube scope:
 ```yaml
 cube:
  memory_map:
    hbm_mapping_mode: n_to_one       # one_to_one | n_to_one
    hbm_pseudo_channels: 64          # total pseudo-channel count
    hbm_channels_per_pe: 8           # per-PE local channel count
    hbm_channel_bw_gbs: 32.0         # per-channel bandwidth (GB/s)
    hbm_slices_per_cube: 8           # number of per-PE partitions
    hbm_total_gb_per_cube: 48
 ```
 **n:1 mode (default, implemented).** Each PE's HBM partition is a single
 endpoint `hbm_ctrl.pe{idx}` that aggregates `channels_per_pe` pseudo-
 channels. The `Router ↔ hbm_ctrl.pe{idx}` link bandwidth equals
 `channels_per_pe × hbm_channel_bw_gbs`. Pseudo-channels are assumed to
 interleave; only aggregate per-PE BW is modeled. No separate aggregated
 router node exists — the per-PE router itself serves that role.
 **1:1 mode (future).** Each PE router decomposes into N channel
 mini-routers; per-channel routing carries fully-resolved PA + channel ID.
 A `ChannelSplitter` resolves a logical access to N per-channel physical
 requests. Per-channel link models BW contention. Cross-PE channel
 access semantics are deferred to the implementation ADR.
 **BW math (defaults).**
 | Parameter                          | Value                      |
 | ---------------------------------- | -------------------------- |
 | pseudo channels per cube           | 64 (parameter)             |
 | PEs per cube                       | 8 (parameter)              |
 | channels per PE (N)                | 64 / 8 = 8                 |
 | per-channel BW                     | 32 GB/s (parameter)        |
 | per-PE local BW                    | N × 32 = 256 GB/s          |
 | cube total HBM BW                  | 64 × 32 = 2048 GB/s        |
 Both modes give the same per-PE effective BW; only the request shape and
 contention model differ.
 ### D9. AddressResolver — per-PE HBM endpoint
 The address resolver decodes a PA's HBM offset to the owning PE's
 partition:
 ```python
 # policy/routing/router.py
 hbm_slice_bytes = hbm_total_gb_per_cube * (1 << 30) // hbm_slices_per_cube
 if addr.kind == "hbm":
    pe_id = int(addr.hbm_offset) // hbm_slice_bytes
    return f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
 ```
 The pe_id computation is intrinsic to the routing layer (not a
 topology-time concern). Any HBM PA falls within exactly one partition,
 yielding deterministic routing.
 External callers (e.g., M_CPU DMA, Memory R/W from PCIE_EP) follow the
 same resolver path — there is no separate fast path.
 ### D10. Mesh generation parameters
 `mesh_gen.py` produces `cube_mesh.yaml` from:
 - `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner.
 - `cube.geometry`: cube physical dimensions and HBM zone.
 - `cube.ucie.n_connections`: determines router count for UCIe attachment.
 Output `mesh_data` dictionary contains:
 - Router grid with positions and HBM exclusion zones.
 - PE-to-router attachments (`pe{idx}.dma`, `pe{idx}.cpu`, `pe{idx}.hbm`
  per PE).
 - UCIe-to-router attachments (N/S/E/W distributed across edge routers).
 - M_CPU and SRAM router attachments.
 ## Consequences
 - Local HBM (0 mesh hops, switching overhead only) and cross-PE HBM
  (mesh hops) are naturally distinguishable, satisfying SPEC R5
  (multi-domain communication) and ADR-0002 (no zero-latency end-to-end
  paths).
 - All cube-internal traffic routes through one mesh — single contention
  model, single layout, single set of edge BWs.
 - Per-PE HBM partitioning maps cleanly to the LA model (ADR-0011): each
  PE's partition is the n:1 aggregate of its assigned pseudo-channels.
 - 1:1 mode extension is structurally natural — split each PE router into
  N channel routers.
 - Mesh generation is fully parameterised by `topology.yaml`; PE/cube
  geometry changes propagate without code edits.
 ## Links
 - ADR-0002 (Routing distance, ordering, no zero-latency paths)
 - ADR-0003 D3 (cube-level NOC definition — extended here)
 - ADR-0004 (Memory semantics, local HBM)
 - ADR-0011 (Memory addressing — LA model consumes per-PE partition)
 - ADR-0014 D1 (PE_DMA egress via router mesh)
 - ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
 - ADR-0016 (IOChiplet io_noc — analogous pattern at IO chiplet level)
 - ADR-0033 (Latency model: per-PC parallelism, switch penalty)
@@ -1,305 +0,0 @@
 # ADR-0019: Per-Channel and Aggregated HBM Connection Models within CUBE NOC
 ## Status
 Accepted
 ## Context
 The CUBE-internal NOC must connect each PE to HBM. KernBench needs
 to evaluate two connectivity models:
 - **1:1 mode** — PE_DMA connects to N separate per-channel routers,
  each with its own link to hbm_ctrl. Models per-channel BW
  contention precisely.
  N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`).
 - **n:1 mode** — PE_DMA connects to a single aggregated router with
  one link to hbm_ctrl. Channels are treated as interleaved; only
  aggregate BW is modeled.
 Effective PE-local BW is identical under both modes
 (= N × per-channel BW); only the connectivity granularity differs.
 ---
 ## Decision
 ### D1. HBM Attaches to PE Routers
 Consolidate the current `hbm_ctrl.slice{0-7}` (8 nodes) into a **single `hbm_ctrl` node**,
 and attach the HBM access point to the same router where the PE is attached.
 - n:1 mode: PE's local HBM access goes directly from its own router (switching overhead only, 0 hops)
 - Remote PE's HBM access: reaches the target PE's router via mesh hops
 - The read/write resource model within the HBM controller is preserved
 Node naming changes:
 | Current | After Change |
 | ---- | ------- |
 | `sip0.cube0.hbm_ctrl.slice0` ~ `slice7` | `sip0.cube0.hbm_ctrl` (single) |
 In `mesh_gen.py`, add `pe{idx}.hbm` to the PE attachment so that
 the builder generates an edge between that router and hbm_ctrl.
 ---
 ### D2. Complete Removal of xbar, bridge, and Single NOC Node
 Remove all of the following nodes and related edges:
 - `{cube}.xbar_top`, `{cube}.xbar_bot`
 - `{cube}.bridge.left`, `{cube}.bridge.right`
 - `{cube}.noc` (single TwoDMeshNocComponent node)
 - Edges of type `noc_to_xbar`, `xbar_to_noc`, `xbar_to_hbm`, `hbm_to_xbar`
 - Edges of type `xbar_to_bridge`, `bridge_to_xbar`
 - Edges of type `pe_to_noc`, `noc_to_pe`, `noc_to_pe_cpu`, etc. referencing the single noc node
 Their role is replaced by an **explicit router mesh based on cube_mesh.yaml**.
 Each router (r0c0, r0c1, ...) from the 6x6 router grid generated by `mesh_gen.py`
 is created as a separate SimPy node in the topology graph,
 and adjacent routers are connected via XY mesh edges.
 ---
 ### D3. Explicit Router Mesh (Common Basis for n:1 / 1:1)
 #### Router Nodes Based on cube_mesh.yaml
 Each non-null router from cube_mesh.yaml generated by `mesh_gen.py`
 is created as a **separate SimPy node** in the topology graph.
 - Node ID: `{cube}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`)
 - kind: `noc_router`, impl: `forwarding_v1`
 - pos_mm: taken from cube_mesh.yaml
 Based on the attach information in cube_mesh.yaml, components are connected to each router:
 - `pe{p}.dma` → PE_DMA ↔ router edge
 - `pe{p}.cpu` → PE_CPU ↔ router edge
 - `pe{p}.hbm` → HBM_CTRL ↔ router edge (added in n:1)
 - `m_cpu` → M_CPU ↔ router edge
 - `sram` → SRAM ↔ router edge
 - `ucie_{dir}.c{i}` → UCIe conn ↔ router edge
 Router-to-router XY mesh edges: bidirectional edges between adjacent routers.
 Null routers (HBM exclusion zones) are skipped.
 #### 1:1 Mode Extension (To Be Implemented Later)
 In 1:1 mode, each router differentiates into N channel mini-routers.
 Per-channel routing and ChannelSplitter (LA → per-channel PA) introduction are required.
 N GEMM engines per PE are also added at this point.
 ---
 ### D4. Cross-PE HBM Access (n:1 Mode)
 In n:1 mode, when a PE accesses another PE's local HBM,
 it hops through the XY mesh in cube_mesh.yaml to reach the target PE's router.
 Example: PE0 (r0c0) accessing PE2's (r1c4) HBM:
 ```text
 PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl
 ```
 The Dijkstra router finds the shortest path in the mesh.
 Cross-PE channel access in 1:1 mode will be defined during the 1:1 extension in D3.
 ---
 ### D5. n:1 Mode: Uses cube_mesh.yaml Router Mesh
 In n:1 mode, no separate "aggregated router" is created.
 The existing router grid from cube_mesh.yaml serves that role.
 #### Connection Structure
 PE_DMA, PE_CPU, and HBM are all connected to the router where each PE is attached:
 ```text
 sip0.cube0.pe0.pe_dma ←→ sip0.cube0.r0c0  (bw: N × channel_bw_gbs)
 sip0.cube0.hbm_ctrl   ←→ sip0.cube0.r0c0  (bw: N × channel_bw_gbs)
 ```
 Routers are connected via XY mesh edges. PE's local HBM access goes
 directly from its own router (switching overhead only).
 #### n:1 Mode Full Data Paths
 **Local HBM (0 hops):**
 ```text
 PE0.pe_dma → r0c0 → hbm_ctrl  (switching overhead only)
 ```
 **Remote HBM (mesh hops):**
 ```text
 PE0.pe_dma → r0c0 → r0c1 → ... → r1c4 → hbm_ctrl
 ```
 **M_CPU DMA:**
 ```text
 M_CPU → r2c0 → (mesh hops) → r{x}c{y} → hbm_ctrl
 ```
 ---
 ### D6. All Traffic Is Unified onto the Same Router Mesh
 - All memory accesses (DMA data) and commands (PE_CPU) use the same router mesh
 - Local access does not use a separate fast path (xbar)
 - Cross-cube (remote) access path:
 ```text
 PE_DMA → r{x}c{y} → (mesh hops) → ucie_conn → ucie-{PORT}
  → [UCIe link] → remote ucie → remote conn → remote r{x}c{y} → hbm_ctrl
 ```
 UCIe connections maintain the existing structure,
 but both endpoints become mesh routers instead of xbars.
 The number of UCIe lines is determined by BW ratio: `ucie_lines_per_side = ceil(ucie_bw / noc_line_bw)`.
 ---
 ### D7. AddressResolver Changes
 Current `AddressResolver.resolve()`:
 ```python
 # Current: HBM offset → pe_slice → "sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
 pe_slice = PhysAddr.hbm_pe_id(addr.hbm_offset, self._slice_size_bytes)
 return f"sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
 ```
 After change:
 ```python
 # Changed: HBM → single endpoint
 return f"sip{s}.cube{c}.hbm_ctrl"
 ```
 The pe_slice calculation is removed.
 In n:1 mode, PE_DMA directly accesses the hbm_ctrl attached to its own router.
 resolver.resolve() is retained for external access (M_CPU DMA, etc.) and backward compatibility.
 ---
 ### D8. topology.yaml Configuration Changes
 #### Added Settings
 ```yaml
 cube:
  memory_map:
    hbm_mapping_mode: n_to_one          # one_to_one | n_to_one
    hbm_pseudo_channels: 64             # total pseudo channel count
    hbm_channels_per_pe: 8              # local channels per PE (= pseudo_channels / pes_per_cube)
    hbm_channel_bw_gbs: 32.0            # per-channel bandwidth (GB/s)
    hbm_total_gb_per_cube: 48           # retained
 ```
 #### Removed Settings
 ```yaml
 # To be removed
 links:
  xbar_to_hbm_bw_gbs: 256.0            # → replaced by channel_bw_gbs × channels_per_pe
  xbar_to_hbm_mm: 2.5                  # → replaced by ch_router_to_hbm_mm
  xbar_to_bridge_bw_gbs: 128.0         # → removed (no bridge)
  xbar_to_bridge_mm: 3.0               # → removed
  noc_to_xbar_bw_gbs: ...              # → removed
  noc_to_xbar_mm: ...                  # → removed
 ```
 #### Added Link Settings
 ```yaml
 links:
  router_link_bw_gbs: 256.0            # XY mesh link BW between routers
  router_overhead_ns: 2.0              # router switching overhead
  pe_to_router_bw_gbs: 256.0           # PE_DMA ↔ router
  hbm_to_router_bw_gbs: 256.0          # HBM ↔ router (= N × channel_bw)
 ```
 ---
 ### D9. Bandwidth Numerical Consistency
 | Configuration | Value |
 | ---- | --- |
 | pseudo channels per cube | 64 (parameter) |
 | PEs per cube | 8 (parameter) |
 | channels per PE (N) | `pseudo_channels / pes_per_cube` = 8 |
 | per-channel BW | 32 GB/s (parameter) |
 | per-PE local BW | N × 32 = 256 GB/s |
 | cube total HBM BW | 64 × 32 = 2048 GB/s |
 The effective BW per PE is identical in both modes:
 - 1:1 mode: N channel links × channel_bw_gbs = N × 32 = 256 GB/s
 - n:1 mode: 1 aggregated link = N × channel_bw_gbs = 256 GB/s
 ---
 ## Consequences
 ### Positive
 - The router mesh based on cube_mesh.yaml accurately reflects physical placement
 - In n:1 mode, the existing VA scheme is preserved, keeping transition costs low
 - Local / remote / command traffic is unified onto the same mesh, resulting in simplicity
 - Aligns well with graph compiler-based topology generation
 - Channel count and PE count are both parameterized, enabling testing of various configurations
 - 1:1 mode extension naturally follows through router differentiation
 ### Negative
 - The number of SimPy nodes increases due to explicit router nodes (6x6 = up to 32 routers/cube)
 - The internal contention model of TwoDMeshNocComponent needs to be replaced with a per-router model
 ---
 ## Alternatives
 ### A1. Retain Existing xbar + HBM Slices
 - Local/remote paths remain bifurcated
 - Cannot model at pseudo-channel granularity
 - Cannot switch between 1:1/n:1 modes
 ### A2. Always Generate Per-Channel Links and Aggregate Only in n:1
 - Topology structure always has 1:1 size
 - Expressing n:1 semantics via link aggregation is complex
 - No reduction in router node count
 ### A3. Gradual Transition (Retain xbar + Add NOC Path)
 - Higher compatibility, but dual-path coexistence increases complexity
 - Since xbar removal is ultimately necessary, the intermediate step provides little value
 ---
 ## Test Requirements
 - Verify that requests are delivered via per-channel links in 1:1 mode
 - Verify that requests are delivered via the aggregated link in n:1 mode
 - Verify that topology is correctly generated in both modes:
  - 1:1: `total_ch` channel routers + per-PE links + horizontal links
  - n:1: `pes_per_cube` aggregated routers + per-PE links
 - Verify that effective BW is consistent across both modes for the same workload
 - Verify that horizontal line routing works for cross-PE access
 - Verify that routing through UCIe works for cross-cube access
 - Verify that topology generation is correct under parameter variations (channels_per_pe = 4, 8, 16, etc.)
 ---
 ## Links
 - ADR-0011 (LA model) → addressing-side integration
 - ADR-0017 (Cube NOC 2D Mesh) → this ADR replaces the xbar/bridge portion
 - ADR-0004 (Memory Semantics) → BW model redefinition
 - ADR-0014 (PE Internal Execution Model) → impact from PE_DMA path changes
@@ -1,305 +0,0 @@
 # ADR-0019: CUBE NOC 내 Per-Channel 및 Aggregated HBM 연결 모델
 ## Status
 Accepted
 ## Context
 CUBE 내부 NOC은 각 PE를 HBM에 연결해야 한다. KernBench는 두 가지
 connectivity 모델을 비교 평가할 수 있어야 한다.
 - **1:1 mode** — PE_DMA가 N개 per-channel router 각각에 별도 link로
  연결되고, 각 router는 hbm_ctrl에 자기 channel link를 가진다.
  Per-channel BW contention을 정확히 모델링.
  N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`).
 - **n:1 mode** — PE_DMA가 단일 aggregated router를 거쳐 하나의 link로
  hbm_ctrl에 연결. Channel들이 interleaved 된 것으로 가정하고
  aggregate BW만 모델링.
 두 모드에서 PE당 effective BW는 동일 (= N × per-channel BW);
 connectivity granularity만 다르다.
 ---
 ## Decision
 ### D1. HBM은 PE 라우터에 attach된다
 현재의 `hbm_ctrl.slice{0-7}` (8개 노드)를 **`hbm_ctrl` 단일 노드**로 통합하고,
 PE가 attach된 라우터에 HBM access point도 함께 attach한다.
 - n:1 mode: PE의 local HBM 접근은 자기 라우터에서 바로 (switching overhead만, 0 hop)
 - remote PE의 HBM 접근: mesh hop을 거쳐 대상 PE의 라우터에 도달
 - HBM controller 내부의 read/write resource 모델은 유지
 노드 네이밍 변경:
 | 현재 | 변경 후 |
 | ---- | ------- |
 | `sip0.cube0.hbm_ctrl.slice0` ~ `slice7` | `sip0.cube0.hbm_ctrl` (단일) |
 `mesh_gen.py`에서 PE attachment에 `pe{idx}.hbm`을 추가하여,
 builder가 해당 라우터와 hbm_ctrl 간 edge를 생성한다.
 ---
 ### D2. xbar, bridge, 단일 NOC 노드 완전 제거
 기존 다음 노드 및 관련 edge를 모두 제거한다:
 - `{cube}.xbar_top`, `{cube}.xbar_bot`
 - `{cube}.bridge.left`, `{cube}.bridge.right`
 - `{cube}.noc` (단일 TwoDMeshNocComponent 노드)
 - `noc_to_xbar`, `xbar_to_noc`, `xbar_to_hbm`, `hbm_to_xbar` 종류의 edge
 - `xbar_to_bridge`, `bridge_to_xbar` 종류의 edge
 - `pe_to_noc`, `noc_to_pe`, `noc_to_pe_cpu` 등 단일 noc 노드 참조 edge
 이들의 역할은 **cube_mesh.yaml 기반의 명시적 라우터 mesh**가 대체한다.
 기존 `mesh_gen.py`가 생성하는 6×6 라우터 grid의 각 라우터(r0c0, r0c1, ...)를
 별도의 SimPy 노드로 topology graph에 생성하고,
 인접 라우터 간 XY mesh edge로 연결한다.
 ---
 ### D3. 명시적 라우터 mesh (n:1 / 1:1 공통 기반)
 #### cube_mesh.yaml 기반 라우터 노드
 `mesh_gen.py`가 생성한 cube_mesh.yaml의 각 non-null 라우터를
 topology graph의 **별도 SimPy 노드**로 생성한다.
 - 노드 ID: `{cube}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`)
 - kind: `noc_router`, impl: `forwarding_v1`
 - pos_mm: cube_mesh.yaml에서 가져옴
 기존 cube_mesh.yaml의 attach 정보에 따라 각 라우터에 component를 연결:
 - `pe{p}.dma` → PE_DMA ↔ 라우터 edge
 - `pe{p}.cpu` → PE_CPU ↔ 라우터 edge
 - `pe{p}.hbm` → HBM_CTRL ↔ 라우터 edge (n:1에서 추가)
 - `m_cpu` → M_CPU ↔ 라우터 edge
 - `sram` → SRAM ↔ 라우터 edge
 - `ucie_{dir}.c{i}` → UCIe conn ↔ 라우터 edge
 라우터 간 XY mesh edge: 인접 라우터 간 bidirectional edge.
 null 라우터(HBM exclusion zone)는 skip.
 #### 1:1 mode 확장 (나중에 구현)
 1:1 mode에서는 각 라우터가 N개 channel mini-router로 분화된다.
 per-channel routing과 ChannelSplitter (LA → per-channel PA) 도입이 필요.
 PE당 N개 GEMM engine도 이 시점에 추가.
 ---
 ### D4. cross-PE HBM 접근 (n:1 mode)
 n:1 mode에서 PE가 다른 PE의 local HBM에 접근하는 경우,
 cube_mesh.yaml의 XY mesh를 통해 대상 PE의 라우터까지 hop한다.
 예: PE0(r0c0)이 PE2(r1c4)의 HBM에 접근:
 ```text
 PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl
 ```
 Dijkstra router가 mesh에서 최단 경로를 탐색한다.
 1:1 mode에서의 cross-PE channel 접근은 D3의 1:1 확장 시 정의한다.
 ---
 ### D5. n:1 mode: cube_mesh.yaml 라우터 mesh 사용
 n:1 mode에서는 별도의 "aggregated router"를 생성하지 않는다.
 기존 cube_mesh.yaml의 라우터 grid가 그 역할을 한다.
 #### 연결 구조
 각 PE가 attach된 라우터에 PE_DMA, PE_CPU, HBM이 함께 연결된다:
 ```text
 sip0.cube0.pe0.pe_dma ←→ sip0.cube0.r0c0  (bw: N × channel_bw_gbs)
 sip0.cube0.hbm_ctrl   ←→ sip0.cube0.r0c0  (bw: N × channel_bw_gbs)
 ```
 라우터 간 XY mesh edge로 연결. PE의 local HBM 접근은
 자기 라우터에서 바로 (switching overhead만).
 #### n:1 mode 전체 데이터 경로
 **local HBM (0 hop):**
 ```text
 PE0.pe_dma → r0c0 → hbm_ctrl  (switching overhead only)
 ```
 **remote HBM (mesh hops):**
 ```text
 PE0.pe_dma → r0c0 → r0c1 → ... → r1c4 → hbm_ctrl
 ```
 **M_CPU DMA:**
 ```text
 M_CPU → r2c0 → (mesh hops) → r{x}c{y} → hbm_ctrl
 ```
 ---
 ### D6. 모든 트래픽을 동일 router mesh로 통일한다
 - 모든 memory access (DMA data)와 command (PE_CPU)가 동일 router mesh를 사용한다
 - local access도 별도의 fast path(xbar)를 사용하지 않는다
 - cross-cube (remote) access 경로:
 ```text
 PE_DMA → r{x}c{y} → (mesh hops) → ucie_conn → ucie-{PORT}
  → [UCIe link] → remote ucie → remote conn → remote r{x}c{y} → hbm_ctrl
 ```
 UCIe 연결은 기존 구조를 유지하되,
 양쪽 endpoint가 xbar 대신 mesh 라우터가 된다.
 UCIe line 수는 BW 비율로 결정: `ucie_lines_per_side = ceil(ucie_bw / noc_line_bw)`.
 ---
 ### D7. AddressResolver 변경
 현재 `AddressResolver.resolve()`:
 ```python
 # 현재: HBM offset → pe_slice → "sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
 pe_slice = PhysAddr.hbm_pe_id(addr.hbm_offset, self._slice_size_bytes)
 return f"sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
 ```
 변경 후:
 ```python
 # 변경: HBM → 단일 endpoint
 return f"sip{s}.cube{c}.hbm_ctrl"
 ```
 pe_slice 계산이 제거된다.
 n:1 mode에서 PE_DMA는 자기 라우터에 attach된 hbm_ctrl에 직접 접근한다.
 resolver.resolve()는 외부 접근(M_CPU DMA 등) 및 backward compatibility용으로 유지한다.
 ---
 ### D8. topology.yaml 설정 변경
 #### 추가 설정
 ```yaml
 cube:
  memory_map:
    hbm_mapping_mode: n_to_one          # one_to_one | n_to_one
    hbm_pseudo_channels: 64             # 전체 pseudo channel 수
    hbm_channels_per_pe: 8              # PE당 local channel 수 (= pseudo_channels / pes_per_cube)
    hbm_channel_bw_gbs: 32.0            # per-channel bandwidth (GB/s)
    hbm_total_gb_per_cube: 48           # 유지
 ```
 #### 제거 설정
 ```yaml
 # 제거 대상
 links:
  xbar_to_hbm_bw_gbs: 256.0            # → channel_bw_gbs × channels_per_pe로 대체
  xbar_to_hbm_mm: 2.5                  # → ch_router_to_hbm_mm으로 대체
  xbar_to_bridge_bw_gbs: 128.0         # → 제거 (bridge 없음)
  xbar_to_bridge_mm: 3.0               # → 제거
  noc_to_xbar_bw_gbs: ...              # → 제거
  noc_to_xbar_mm: ...                  # → 제거
 ```
 #### 추가 link 설정
 ```yaml
 links:
  router_link_bw_gbs: 256.0            # 라우터 간 XY mesh link BW
  router_overhead_ns: 2.0              # 라우터 switching overhead
  pe_to_router_bw_gbs: 256.0           # PE_DMA ↔ 라우터
  hbm_to_router_bw_gbs: 256.0          # HBM ↔ 라우터 (= N × channel_bw)
 ```
 ---
 ### D9. 대역폭 수치 정합
 | 구성 | 값 |
 | ---- | --- |
 | pseudo channels per cube | 64 (파라미터) |
 | PEs per cube | 8 (파라미터) |
 | channels per PE (N) | `pseudo_channels / pes_per_cube` = 8 |
 | per-channel BW | 32 GB/s (파라미터) |
 | per-PE local BW | N × 32 = 256 GB/s |
 | cube total HBM BW | 64 × 32 = 2048 GB/s |
 두 모드에서 PE당 effective BW는 동일:
 - 1:1 mode: N개 channel link × channel_bw_gbs = N × 32 = 256 GB/s
 - n:1 mode: 1개 aggregated link = N × channel_bw_gbs = 256 GB/s
 ---
 ## Consequences
 ### Positive
 - cube_mesh.yaml 기반 라우터 mesh로 물리적 배치를 정확히 반영한다
 - n:1 mode에서 기존 VA 체계를 유지하여 전환 비용이 낮다
 - local / remote / command 트래픽이 동일 mesh로 통일되어 단순하다
 - graph compiler 기반 topology 생성과 잘 맞는다
 - channel 수, PE 수가 모두 파라미터이므로 다양한 구성을 테스트할 수 있다
 - 1:1 mode 확장이 라우터 분화로 자연스럽게 가능하다
 ### Negative
 - 명시적 라우터 노드로 인해 SimPy 노드 수가 증가한다 (6×6 = 최대 32개 라우터/cube)
 - TwoDMeshNocComponent의 내부 contention 모델을 라우터별 모델로 교체 필요
 ---
 ## Alternatives
 ### A1. 기존 xbar + HBM slice 유지
 - local/remote 경로가 이원화됨
 - pseudo-channel 단위 모델링 불가
 - 1:1/n:1 mode 전환 불가
 ### A2. per-channel link를 항상 생성하고 n:1에서만 집계
 - topology 구조가 항상 1:1 크기
 - n:1 semantics를 link aggregation으로 표현하기 복잡
 - router 노드 수 감소 효과 없음
 ### A3. 단계적 전환 (xbar 유지 + NOC 경로 추가)
 - 호환성은 높으나 두 경로 공존으로 복잡도 증가
 - 최종적으로 xbar 제거가 필요하므로 중간 단계의 가치가 낮음
 ---
 ## Test Requirements
 - 1:1 mode에서 channel별 link로 request가 전달되는지 확인
 - n:1 mode에서 aggregated link로 request가 전달되는지 확인
 - 두 mode에서 topology가 올바르게 생성되는지 검증:
  - 1:1: `total_ch`개 channel router + per-PE link + horizontal link
  - n:1: `pes_per_cube`개 aggregated router + per-PE link
 - 동일 workload에서 effective BW가 두 모드에서 일관적인지 확인
 - cross-PE 접근 시 horizontal line routing이 동작하는지 확인
 - cross-cube 접근 시 UCIe를 통한 routing이 동작하는지 확인
 - 파라미터 변경 (channels_per_pe = 4, 8, 16 등)에서 topology 생성이 정상인지 확인
 ---
 ## Links
 - ADR-0011 (LA model) → addressing 측 연동
 - ADR-0017 (Cube NOC 2D Mesh) → 본 ADR이 xbar/bridge 부분을 대체
 - ADR-0004 (Memory Semantics) → BW 모델 재정의
 - ADR-0014 (PE Internal Execution Model) → PE_DMA 경로 변경 영향
@@ -1,432 +0,0 @@
 # ADR-0021: PE Pipeline Refactoring — Component Separation + Scheduler-Based Routing
 ## Status
 Accepted
 ## Context
 ### Actual Hardware Structure
 ```
 HBM ←(DMA)→ TCM ←(Fetch/Store Unit)→ Register File ←→ GEMM/MATH Engine
 ```
 - DMA: HBM ↔ TCM transfer (via fabric, tens to hundreds of ns)
 - Fetch/Store Unit: TCM ↔ Register File transfer (BW-based, a few ns)
 - GEMM/MATH Engine: computation between Register Files (cycle-accurate)
 - Completion signal: PE-internal 1-cycle wire signal (done pin assert)
 ---
 ## Decision
 ### D1. Separate Each Block into an Independent Component
 The internal blocks of pe_accel are separated into **independent PeEngineBase components**.
 Existing 5 blocks + 1 Fetch/Store Unit = 6 components.
 | Component | Role | HW Correspondence |
 |-----------|------|-------------------|
 | PE_SCHEDULER | Plan generation, tile state management, stage routing | Scheduler/Sequencer |
 | PE_DMA | HBM ↔ TCM (via fabric) | DMA Engine |
 | PE_FETCH_STORE | TCM ↔ Register File | Load/Store Unit |
 | PE_GEMM | MAC compute (register only) | MAC Array |
 | PE_MATH | Element-wise/reduction (register only) | SIMD/Vector Unit |
 | PE_TCM | BW-serialized scratchpad | SRAM Bank |
 Each component exists as a topology node and is connected via ports/wires.
 Replacing the `impl` allows changing the timing model of an individual block.
 ### D2. Token Self-Routing — Scheduler Handles Only Dispatch + Completion
 **Components do not pass through the scheduler at every stage.**
 The token carries a plan so that components chain directly to the next stage.
 ```
 Scheduler → DMA → Fetch → GEMM → Math → Store → DMA_WB → (done) → Scheduler
              ↑ chaining: does not go through scheduler          completion only
 ```
 This matches the actual HW structure where each block's done signal is directly
 connected to the next block via wire. The scheduler is responsible **only for
 initial dispatch + completion aggregation**.
 #### Stage Definition
 ```python
 class StageType(Enum):
    DMA_READ = 0
    FETCH = 1
    GEMM = 2
    MATH = 3
    STORE = 4
    DMA_WRITE = 5
 ```
 #### Plan Structure
 When the scheduler receives a CompositeCmd, it generates a **per-tile execution plan**.
 The plan defines the **stage sequence** for each tile:
 ```python
@dataclass
 class Stage:
    stage_type: StageType
    component: str       # topology node ID (e.g. "sip0.cube0.pe0.pe_dma")
    params: dict         # per-stage parameters (dynamic)
@dataclass(frozen=True)
 class TilePlan:
    tile_id: int
    stages: tuple[Stage, ...]  # list of stages to execute in order (immutable)
 ```
 The stage sequence varies depending on the plan:
 ```python
 # Normal GEMM: HBM → TCM → Register → Compute → Register → TCM → HBM
 stages = (DMA_READ, FETCH, GEMM, STORE, DMA_WRITE)
 # GEMM directly from TCM data (skip DMA read):
 stages = (FETCH, GEMM, STORE, DMA_WRITE)
 # MATH element-wise:
 stages = (DMA_READ, FETCH, MATH, STORE, DMA_WRITE)
 # GEMM + accumulation (intermediate K-tile, skip writeback):
 stages = (DMA_READ, FETCH, GEMM, STORE)  # store to TCM only
 ```
 **Components do not hardcode the next component.**
 They read the next stage from the token's plan and forward it directly via out_port.
 This is the same pattern as a network packet carrying a routing header.
 #### Pipeline Context
 ```python
@dataclass
 class PipelineContext:
    id: str
    total_tiles: int
    completed_tiles: int = 0
    done_event: simpy.Event = None  # succeeds when all tiles are complete
    def complete_tile(self) -> None:
        self.completed_tiles += 1
        if self.completed_tiles == self.total_tiles:
            self.done_event.succeed()
 ```
 **Completion follows an exactly-once contract**: the last stage of each tile must call
 `complete_tile()` exactly once. Duplicate calls are a bug, and `done_event` must
 succeed only once (SimPy Event constraint).
 #### Scheduler Role (Reduced)
 When the scheduler receives a CompositeCmd, it creates a plan and PipelineContext,
 enqueues them into the scheduler's internal `_pending_feeds` FIFO, and returns immediately.
 Actual tile injection is handled by a **single feeder process** (`_feed_loop`).
 This feeder consumes `_pending_feeds` in FIFO order and
 **does not allow tile feed interleaving across composite commands.**
 That is, the feed for the next command begins only after all tiles of the current
 command have been injected into the first stage queue.
 There is **exactly one `_feed_loop`** per scheduler, and
 tile feed for composite commands is performed exclusively through this single process.
 Command issue order refers to **the order in which PE_SCHEDULER receives PeInternalTxn**.
 This structure maintains command issue order while ensuring that when the first stage
 queue is full, only the feeder process blocks — the scheduler worker's inbox processing
 itself does not stall.
 ```python
 class PeSchedulerV2(PeEngineBase):
    _pipelines: dict[str, PipelineContext]
    _pending_feeds: simpy.Store   # FIFO of (plan, ctx)
    def start(self, env):
        super().start(env)
        self._pending_feeds = simpy.Store(env)
        env.process(self._feed_loop(env))
    def _dispatch_composite(self, env, pe_txn, cmd):
        plan = generate_plan(cmd)
        ctx = PipelineContext(
            id=next_id(),
            total_tiles=len(plan.tiles),
            done_event=pe_txn.done,
        )
        self._pipelines[ctx.id] = ctx
        # only enqueue to feeder queue and return immediately
        yield self._pending_feeds.put((plan, ctx))
    def _feed_loop(self, env):
        """Single feeder process: feeds composite commands in FIFO order.
        Tile feed interleaving across composite commands is not allowed.
        The feed for the next command begins only after all tiles of the
        current command have been injected into the first stage queue.
        When the first stage queue is full, only this feeder blocks;
        the scheduler worker's inbox processing does not stall.
        """
        while True:
            plan, ctx = yield self._pending_feeds.get()
            for tile in plan.tiles:
                token = TileToken(
                    tile_id=tile.tile_id,
                    pipeline_ctx=ctx,
                    plan=tile,
                    stage_idx=0,
                    params=tile.stages[0].params,
                )
                yield self.out_ports[tile.stages[0].component].put(token)
                # queue capacity = HW queue depth → feeder blocks only when full
 ```
 In this ADR, the scheduler can accept multiple composite commands,
 but tile submission order follows per-command FIFO.
 Within a command, tile-level pipeline overlap is allowed,
 but tile feed interleaving across commands is not.
 ### D3. Data Transfer vs. Completion Signal — HW Modeling Criteria
 | Communication Type | Method | HW Correspondence |
 |-------------------|--------|-------------------|
 | Tile token (work directive) | message via out_port | enqueue to command queue |
 | Stage completion → next stage | component directly calls out_port.put | done-triggered local enqueue |
 | Pipeline completion → scheduler | PipelineContext.complete_tile() | completion interrupt |
 **Tile token**: uses out_port.put(). SimPy Store capacity = HW queue depth.
 **Intra-PE chaining latency**: within the scope of this ADR, no explicit latency model
 is applied to intra-PE stage triggers. Chaining between components corresponds to
 PE-internal wires, and since there is no scheduler round-trip, no artificial hop cost
 is incurred.
 **Pipeline completion**: the component at the last stage calls `pipeline_ctx.complete_tile()`.
 When all tiles are complete, PipelineContext calls done_event.succeed().
 ### D4. Asynchronous Pipeline — Natural Overlap
 The scheduler processes CompositeCmds **asynchronously**.
 However, tile feed does not spawn an independent process per command; instead,
 the scheduler's internal **single feeder process** performs the feed in FIFO order.
 Therefore, the scheduler can continue to receive the next command,
 but the first-stage tile injection order is guaranteed per command.
 Since **SimPy Store capacity = HW queue depth**:
 - When the queue is full, put() naturally blocks (backpressure)
 - While DMA is processing tile 0, GEMM can start fetching an already-completed tile
 - When a second CompositeCmd arrives, it is immediately queued to the DMA queue
 ```
 First-stage feed order (feeder → DMA queue):
  [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN] | [cmd2:t0][cmd2:t1]...
                                            ↑ cmd2 starts after cmd1 feed completes
 Runtime pipeline (downstream overlap):
  PE_DMA:    [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN][cmd2:t0][cmd2:t1]...
  PE_FETCH:          [cmd1:t0][cmd1:t1]...
  PE_GEMM:                   [cmd1:t0][cmd1:t1]...
                              ↑ pipeline overlap within the same command
 ```
 Here, the overlap does not come from tile feed interleaving across different commands,
 but occurs naturally as tiles from earlier commands progress to downstream stages
 while the feeder continues injecting subsequent tiles.
 For example, tile feed for cmd2 does not start until all tiles of cmd1 have been
 injected into the first stage queue. However, while cmd1.tile0 has already progressed
 to GEMM, cmd1.tile1 and cmd1.tile2 may still remain in DMA/FETCH, so
 **pipeline overlap within the same command occurs naturally**.
 #### Component Chaining Pattern
 All components follow the same pattern:
 ```python
 def _pipeline_worker(self, env):
    while True:
        token = yield self._inbox.get()
        # process own stage
        yield from self._process(env, token)
        # chain to next stage (read from plan)
        next_idx = token.stage_idx + 1
        if next_idx < len(token.plan.stages):
            next_stage = token.plan.stages[next_idx]
            token.stage_idx = next_idx
            token.params = next_stage.params
            yield self.out_ports[next_stage.component].put(token)
        else:
            # last stage — pipeline completion
            token.pipeline_ctx.complete_tile()
 ```
 ### D5. PE_FETCH_STORE — Dedicated TCM ↔ Register File Transfer
 Previously, GemmBlock and MathBlock each implemented their own TCM read/write.
 This is separated into a **PE_FETCH_STORE component**.
 ```python
 # PE_FETCH_STORE._process()
 def _process(self, env, token):
    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
    yield tcm_done
    # chaining is handled by the base class (D4 pattern)
 ```
 Advantages:
 - GEMM/MATH perform **pure compute only** — no TCM access logic
 - Fetch/store BW contention is naturally modeled (serialization via PE_TCM resource)
 - Prefetch strategies can be experimented with by replacing the fetch unit alone
 ### D6. Simplification of Each Compute Component
 GEMM/MATH perform compute only with register data already prepared.
 **Chaining follows the common pattern (D4), so only _process() needs to be implemented:**
 ```python
 # PE_GEMM._process()
 def _process(self, env, token):
    yield env.timeout(self._mac_latency(token.params))
 # PE_MATH._process()
 def _process(self, env, token):
    yield env.timeout(self._simd_latency(token.params))
 # PE_FETCH_STORE._process()
 def _process(self, env, token):
    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
    yield tcm_done
 # PE_DMA._process()
 def _process(self, env, token):
    yield from self._do_fabric_dma(token.params)
 ```
 By replacing only the timing model, one can freely switch between cycle-accurate
 and analytical models. Since the chaining logic resides in the base class,
 each component only implements its pure stage logic.
 ### D7. Topology Changes
 Add PE_FETCH_STORE to the PE template:
 ```yaml
 pe_template:
  components:
    pe_cpu:         { kind: pe_cpu,         impl: pe_cpu_v1, ... }
    pe_scheduler:   { kind: pe_scheduler,   impl: pe_scheduler_v2, ... }
    pe_dma:         { kind: pe_dma,         impl: pe_dma_v1, ... }
    pe_fetch_store: { kind: pe_fetch_store, impl: pe_fetch_store_v1, ... }
    pe_gemm:        { kind: pe_gemm,        impl: pe_gemm_v1, ... }
    pe_math:        { kind: pe_math,        impl: pe_math_v1, ... }
    pe_mmu:         { kind: pe_mmu,         impl: pe_mmu_v1, ... }
    pe_tcm:         { kind: pe_tcm,         impl: pe_tcm_v1, ... }
  links:
    # existing links...
    fetch_store_to_tcm_bw_gbs: 512.0
    fetch_store_to_tcm_mm: 0.0
 ```
 PE internal edge connections:
 ```
 PE_SCHEDULER → PE_DMA (initial dispatch)
 PE_SCHEDULER → PE_FETCH_STORE (initial dispatch)
 PE_SCHEDULER → PE_GEMM (initial dispatch)
 PE_SCHEDULER → PE_MATH (initial dispatch)
 PE_DMA → PE_FETCH_STORE (chaining)
 PE_FETCH_STORE → PE_GEMM (chaining)
 PE_FETCH_STORE → PE_MATH (chaining)
 PE_GEMM → PE_FETCH_STORE (store chaining)
 PE_MATH → PE_FETCH_STORE (store chaining)
 PE_FETCH_STORE → PE_DMA (writeback chaining)
 PE_FETCH_STORE → PE_TCM (BW request)
 ```
 Topology edges encompass both **control/dispatch visibility + runtime chaining**.
 Scheduler → sub-component edges are initial dispatch paths, while
 inter-component edges are runtime chaining paths driven by token self-routing.
 ### D9. TileToken Message Definition
 A message used for passing tile work between components.
 The token carries the plan and stage index, enabling self-routing.
 ```python
@dataclass
 class TileToken:
    tile_id: int
    pipeline_ctx: PipelineContext    # completion tracking
    plan: TilePlan                   # full stage sequence for this tile (immutable)
    stage_idx: int                   # current stage index in plan.stages
    params: dict                     # current stage parameter cache (canonical: plan.stages[stage_idx].params)
    data_op: bool = True             # op_log recording target (ADR-0020)
 ```
 A TileToken is **owned by exactly one component at a time** and
 is never referenced by multiple components simultaneously (single-owner).
 Token lifecycle:
 1. Scheduler creates it with stage_idx=0 and puts it to the first stage component
 2. The component executes _process(), increments stage_idx, and puts it to the next component
 3. The last stage component calls pipeline_ctx.complete_tile()
 4. When all tiles are complete, PipelineContext calls done_event.succeed()
 Relationship with existing PeInternalTxn:
 - PeInternalTxn: command transfer between PE_CPU → PE_SCHEDULER (existing, unchanged)
 - TileToken: per-tile work transfer from PE_SCHEDULER → sub-components (new, self-routing)
 ---
 ## Non-goals
 - **PE_CPU changes**: the PE_CPU → PE_SCHEDULER interface is not modified
  (PeInternalTxn-based, ADR-0014 maintained)
 - **Resource contention model across multiple pipelines**: the current scope focuses on
  accurate modeling of a single pipeline. TCM bank conflicts across multiple pipelines
  are future work.
 ## Open Questions
 - **Register File capacity model**: whether to model capacity limits when the fetch unit
  loads into registers. Capacity is expressed in bytes (register_file_bytes), and
  the number of tiles that can be held simultaneously is determined by tile size.
  When capacity is exceeded, fetch stalls, creating natural backpressure.
 - **Prefetch strategy**: this ADR does not allow tile feed interleaving across composite
  commands. Therefore, overlap arises not from pre-injection across commands, but
  naturally from pipeline progression of tiles within the same command.
  If additional prefetch is needed, it should be considered at the level of tile ordering
  within the same command or fetch/store unit policy, not cross-command injection.
 - **PE_DMA coalescing**: per-tile DMA may cause fragmentation.
  Direction is to merge/coalesce within DMA without scheduler involvement.
 - **Synchronous execution mode**: this ADR adopts asynchronous pipeline as the
  default/sole execution model. If a sync mode is needed for debug or validation
  purposes, it will be considered in a future ADR.
 - **TCM bank conflict across multiple pipelines**: currently based on a single pipeline.
  Bank conflict modeling when multiple pipelines simultaneously access TCM is future work.
 ---
 ## Consequences
 ### Positive
 - Each block is an independent component — individually replaceable (ADR-0015 compliant)
 - PE internal structure is visible in the topology
 - Components do not know the next component — plan-based routing provides flexibility
 - Natural pipeline overlap between DMA and compute (SimPy Store backpressure)
 - Improved HW modeling accuracy (done signal = Event, data transfer = message)
 - Fetch/store separation enables accurate TCM BW contention modeling
 ### Negative
 - Increased number of PE internal components (5 → 6) — more topology nodes/edges
 - Component separation makes intra-PE token forwarding more explicit than before
@@ -1,426 +0,0 @@
 # ADR-0021: PE 파이프라인 리팩토링 — 컴포넌트 분리 + Scheduler 기반 라우팅
 ## Status
 Accepted
 ## Context
 ### 실제 하드웨어 구조
 ```
 HBM ←(DMA)→ TCM ←(Fetch/Store Unit)→ Register File ←→ GEMM/MATH Engine
 ```
 - DMA: HBM ↔ TCM 전송 (fabric 경유, 수십~수백 ns)
 - Fetch/Store Unit: TCM ↔ Register File 전송 (BW 기반, 수 ns)
 - GEMM/MATH Engine: Register File 간 연산 (cycle-accurate)
 - 완료 신호: PE 내부 1-cycle wire signal (done pin assert)
 ---
 ## Decision
 ### D1. 각 블록을 독립 컴포넌트로 분리
 pe_accel의 내부 블록을 **독립 PeEngineBase 컴포넌트**로 분리한다.
 기존 5개 + Fetch/Store Unit 1개 = 6개 컴포넌트.
 | 컴포넌트 | 역할 | HW 대응 |
 |----------|------|---------|
 | PE_SCHEDULER | plan 생성, tile 상태 관리, stage 라우팅 | Scheduler/Sequencer |
 | PE_DMA | HBM ↔ TCM (fabric 경유) | DMA Engine |
 | PE_FETCH_STORE | TCM ↔ Register File | Load/Store Unit |
 | PE_GEMM | MAC compute (register only) | MAC Array |
 | PE_MATH | element-wise/reduction (register only) | SIMD/Vector Unit |
 | PE_TCM | BW-serialized scratchpad | SRAM Bank |
 각 컴포넌트는 topology 노드로 존재하며, port/wire로 연결된다.
 `impl`을 교체하면 개별 블록의 타이밍 모델을 변경할 수 있다.
 ### D2. Token Self-Routing — Scheduler는 dispatch + completion만
 **컴포넌트가 매 stage마다 scheduler를 경유하지 않는다.**
 Token이 plan을 가지고 있어 컴포넌트가 직접 다음 stage로 체이닝한다.
 ```
 Scheduler → DMA → Fetch → GEMM → Math → Store → DMA_WB → (done) → Scheduler
              ↑ 체이닝: scheduler 안 거침                    completion만
 ```
 이는 실제 HW에서 각 블록의 done signal이 다음 블록에 직접 wire로 연결되어
 있는 구조와 일치한다. Scheduler는 **초기 dispatch + completion aggregation만** 담당.
 #### Stage 정의
 ```python
 class StageType(Enum):
    DMA_READ = 0
    FETCH = 1
    GEMM = 2
    MATH = 3
    STORE = 4
    DMA_WRITE = 5
 ```
 #### Plan 구조
 Scheduler가 CompositeCmd를 받으면 **tile 단위 실행 plan**을 생성한다.
 Plan은 각 tile의 **stage sequence**를 정의한다:
 ```python
@dataclass
 class Stage:
    stage_type: StageType
    component: str       # topology 노드 ID (e.g. "sip0.cube0.pe0.pe_dma")
    params: dict         # stage별 파라미터 (dynamic)
@dataclass(frozen=True)
 class TilePlan:
    tile_id: int
    stages: tuple[Stage, ...]  # 순서대로 실행할 stage 목록 (immutable)
 ```
 Plan에 따라 stage sequence가 달라진다:
 ```python
 # 일반 GEMM: HBM → TCM → Register → Compute → Register → TCM → HBM
 stages = (DMA_READ, FETCH, GEMM, STORE, DMA_WRITE)
 # TCM 데이터로 바로 GEMM (DMA read 생략):
 stages = (FETCH, GEMM, STORE, DMA_WRITE)
 # MATH element-wise:
 stages = (DMA_READ, FETCH, MATH, STORE, DMA_WRITE)
 # GEMM + accumulation (중간 K-tile, writeback 생략):
 stages = (DMA_READ, FETCH, GEMM, STORE)  # store to TCM only
 ```
 **컴포넌트는 다음 컴포넌트를 하드코딩하지 않는다.**
 Token의 plan에서 다음 stage를 읽고, out_port로 직접 전달한다.
 네트워크 패킷이 라우팅 헤더를 가지고 있는 것과 같은 패턴이다.
 #### Pipeline Context
 ```python
@dataclass
 class PipelineContext:
    id: str
    total_tiles: int
    completed_tiles: int = 0
    done_event: simpy.Event = None  # 모든 tile 완료 시 succeed
    def complete_tile(self) -> None:
        self.completed_tiles += 1
        if self.completed_tiles == self.total_tiles:
            self.done_event.succeed()
 ```
 **Completion은 exactly-once contract**: 각 tile의 마지막 stage는 정확히 한 번만
 `complete_tile()`을 호출해야 한다. 중복 호출은 버그이며, `done_event`는
 단 한 번만 succeed되어야 한다 (SimPy Event 제약).
 #### Scheduler 역할 (축소됨)
 Scheduler는 CompositeCmd를 받으면 plan과 PipelineContext를 생성한 뒤,
 이를 scheduler 내부의 `_pending_feeds` FIFO에 enqueue하고 즉시 리턴한다.
 실제 tile 투입은 **단일 feeder process** (`_feed_loop`)가 담당한다.
 이 feeder는 `_pending_feeds`를 FIFO 순서로 소비하며,
 **composite command 간 tile feed interleaving은 허용하지 않는다.**
 즉, 한 command의 모든 tile이 첫 stage queue에 투입된 후에만
 다음 command의 feed가 시작된다.
 Scheduler당 `_feed_loop`는 **정확히 하나만** 존재하며,
 composite command의 tile feed는 이 단일 process를 통해서만 수행된다.
 Command issue order는 **PE_SCHEDULER가 PeInternalTxn을 수신한 순서**를 의미한다.
 이 구조는 command issue order를 유지하면서도, 첫 stage queue full 시
 feeder process만 block되고 scheduler worker의 inbox 처리 자체는 멈추지 않도록 한다.
 ```python
 class PeSchedulerV2(PeEngineBase):
    _pipelines: dict[str, PipelineContext]
    _pending_feeds: simpy.Store   # FIFO of (plan, ctx)
    def start(self, env):
        super().start(env)
        self._pending_feeds = simpy.Store(env)
        env.process(self._feed_loop(env))
    def _dispatch_composite(self, env, pe_txn, cmd):
        plan = generate_plan(cmd)
        ctx = PipelineContext(
            id=next_id(),
            total_tiles=len(plan.tiles),
            done_event=pe_txn.done,
        )
        self._pipelines[ctx.id] = ctx
        # feeder queue에 등록만 하고 즉시 리턴
        yield self._pending_feeds.put((plan, ctx))
    def _feed_loop(self, env):
        """단일 feeder process: composite command를 FIFO 순서로 feed.
        Composite command 간 tile feed interleaving은 허용하지 않는다.
        한 command의 모든 tile이 첫 stage queue에 투입된 후에만
        다음 command의 feed가 시작된다.
        첫 stage queue full 시 이 feeder만 block되며,
        scheduler worker의 inbox 처리는 멈추지 않는다.
        """
        while True:
            plan, ctx = yield self._pending_feeds.get()
            for tile in plan.tiles:
                token = TileToken(
                    tile_id=tile.tile_id,
                    pipeline_ctx=ctx,
                    plan=tile,
                    stage_idx=0,
                    params=tile.stages[0].params,
                )
                yield self.out_ports[tile.stages[0].component].put(token)
                # queue capacity = HW queue depth → full이면 feeder만 block
 ```
 본 ADR에서 scheduler는 여러 composite command를 수용할 수 있으나,
 tile submission order는 command 단위 FIFO를 따른다.
 Command 내부에서는 tile-level pipeline overlap을 허용하지만,
 command 간 tile feed interleaving은 허용하지 않는다.
 ### D3. 데이터 전달 vs 완료 신호 — HW 모델링 기준
 | 통신 유형 | 방식 | HW 대응 |
 |----------|------|---------|
 | tile token (작업 지시) | message via out_port | command queue에 enqueue |
 | stage 완료 → 다음 stage | 컴포넌트가 직접 out_port.put | done-triggered local enqueue |
 | pipeline 완료 → scheduler | PipelineContext.complete_tile() | completion interrupt |
 **Tile token**: out_port.put() 사용. SimPy Store capacity = HW queue depth.
 **Intra-PE chaining latency**: 본 ADR 범위에서는 intra-PE stage trigger에
 explicit latency model을 두지 않는다. 컴포넌트 간 체이닝은 PE 내부 wire에 해당하며,
 scheduler 왕복이 없으므로 artificial hop cost가 발생하지 않는다.
 **Pipeline 완료**: 마지막 stage의 컴포넌트가 `pipeline_ctx.complete_tile()` 호출.
 모든 tile 완료 시 PipelineContext가 done_event.succeed().
 ### D4. 비동기 파이프라인 — 자연스러운 overlap
 Scheduler는 CompositeCmd를 **비동기로** 처리한다.
 다만 tile feed는 command마다 독립 process를 만들지 않고,
 scheduler 내부의 **단일 feeder process**가 FIFO 순서로 수행한다.
 따라서 scheduler는 다음 command를 계속 받을 수 있지만,
 첫-stage tile 투입 순서는 command 단위로 보장된다.
 **SimPy Store capacity = HW queue depth**이므로:
 - queue가 차면 put()이 자연스럽게 block (backpressure)
 - DMA가 tile 0을 처리하는 동안 GEMM은 이미 완료된 tile의 fetch를 시작
 - 두 번째 CompositeCmd가 들어오면 DMA queue에 바로 이어서 투입
 ```
 First-stage feed order (feeder → DMA queue):
  [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN] | [cmd2:t0][cmd2:t1]...
                                            ↑ cmd1 feed 완료 후 cmd2 시작
 Runtime pipeline (downstream overlap):
  PE_DMA:    [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN][cmd2:t0][cmd2:t1]...
  PE_FETCH:          [cmd1:t0][cmd1:t1]...
  PE_GEMM:                   [cmd1:t0][cmd1:t1]...
                              ↑ 같은 cmd 내부에서 pipeline overlap
 ```
 이때 overlap은 서로 다른 command의 tile feed interleaving에서 오는 것이 아니라,
 먼저 투입된 command의 tile들이 downstream stage로 진행되는 동안 feeder가
 다음 tile들을 계속 투입하면서 자연스럽게 발생한다.
 예를 들어 cmd1의 모든 tile이 첫 stage queue에 투입되기 전에는
 cmd2의 tile feed는 시작되지 않는다. 그러나 cmd1.tile0이 이미 GEMM으로
 진행한 상태에서 cmd1.tile1, cmd1.tile2가 DMA/FETCH에 남아 있을 수 있으므로,
 **같은 command 내부에서는 pipeline overlap이 자연스럽게 발생**한다.
 #### 컴포넌트 체이닝 패턴
 모든 컴포넌트가 동일한 패턴을 따른다:
 ```python
 def _pipeline_worker(self, env):
    while True:
        token = yield self._inbox.get()
        # 자기 stage 처리
        yield from self._process(env, token)
        # 다음 stage로 체이닝 (plan에서 읽음)
        next_idx = token.stage_idx + 1
        if next_idx < len(token.plan.stages):
            next_stage = token.plan.stages[next_idx]
            token.stage_idx = next_idx
            token.params = next_stage.params
            yield self.out_ports[next_stage.component].put(token)
        else:
            # 마지막 stage — pipeline completion
            token.pipeline_ctx.complete_tile()
 ```
 ### D5. PE_FETCH_STORE — TCM ↔ Register File 전담
 기존에 GemmBlock과 MathBlock이 각각 TCM read/write를 구현했으나,
 이를 **PE_FETCH_STORE 컴포넌트**로 분리한다.
 ```python
 # PE_FETCH_STORE._process()
 def _process(self, env, token):
    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
    yield tcm_done
    # 체이닝은 base class가 처리 (D4 패턴)
 ```
 장점:
 - GEMM/MATH는 **순수 compute만** — TCM 접근 로직 없음
 - fetch/store BW 경합이 자연스럽게 모델링됨 (PE_TCM의 resource로 serialization)
 - prefetch 전략 등 fetch unit 단독 교체로 실험 가능
 ### D6. 각 Compute 컴포넌트의 단순화
 GEMM/MATH는 register 데이터가 이미 준비된 상태에서 compute만 수행.
 **체이닝은 공통 패턴(D4)을 따르므로, _process()만 구현하면 된다:**
 ```python
 # PE_GEMM._process()
 def _process(self, env, token):
    yield env.timeout(self._mac_latency(token.params))
 # PE_MATH._process()
 def _process(self, env, token):
    yield env.timeout(self._simd_latency(token.params))
 # PE_FETCH_STORE._process()
 def _process(self, env, token):
    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
    yield tcm_done
 # PE_DMA._process()
 def _process(self, env, token):
    yield from self._do_fabric_dma(token.params)
 ```
 타이밍 모델만 교체하면 cycle-accurate든 analytical든 자유롭게 변경 가능.
 체이닝 로직은 base class에 있으므로 각 컴포넌트는 순수 stage 로직만 구현.
 ### D7. Topology 변경
 PE template에 PE_FETCH_STORE 추가:
 ```yaml
 pe_template:
  components:
    pe_cpu:         { kind: pe_cpu,         impl: pe_cpu_v1, ... }
    pe_scheduler:   { kind: pe_scheduler,   impl: pe_scheduler_v2, ... }
    pe_dma:         { kind: pe_dma,         impl: pe_dma_v1, ... }
    pe_fetch_store: { kind: pe_fetch_store, impl: pe_fetch_store_v1, ... }
    pe_gemm:        { kind: pe_gemm,        impl: pe_gemm_v1, ... }
    pe_math:        { kind: pe_math,        impl: pe_math_v1, ... }
    pe_mmu:         { kind: pe_mmu,         impl: pe_mmu_v1, ... }
    pe_tcm:         { kind: pe_tcm,         impl: pe_tcm_v1, ... }
  links:
    # 기존 links...
    fetch_store_to_tcm_bw_gbs: 512.0
    fetch_store_to_tcm_mm: 0.0
 ```
 PE 내부 edge 연결:
 ```
 PE_SCHEDULER → PE_DMA (초기 dispatch)
 PE_SCHEDULER → PE_FETCH_STORE (초기 dispatch)
 PE_SCHEDULER → PE_GEMM (초기 dispatch)
 PE_SCHEDULER → PE_MATH (초기 dispatch)
 PE_DMA → PE_FETCH_STORE (체이닝)
 PE_FETCH_STORE → PE_GEMM (체이닝)
 PE_FETCH_STORE → PE_MATH (체이닝)
 PE_GEMM → PE_FETCH_STORE (store 체이닝)
 PE_MATH → PE_FETCH_STORE (store 체이닝)
 PE_FETCH_STORE → PE_DMA (writeback 체이닝)
 PE_FETCH_STORE → PE_TCM (BW 요청)
 ```
 Topology edge는 **control/dispatch visibility + runtime chaining** 양쪽을 포함한다.
 Scheduler → 하위 컴포넌트 edge는 초기 dispatch 경로이며,
 컴포넌트 간 edge는 token self-routing에 의한 runtime chaining 경로이다.
 ### D9. TileToken 메시지 정의
 컴포넌트 간 tile 작업 전달에 사용하는 메시지.
 Token이 plan과 stage index를 가지고 있어 self-routing이 가능하다.
 ```python
@dataclass
 class TileToken:
    tile_id: int
    pipeline_ctx: PipelineContext    # completion 추적
    plan: TilePlan                   # 이 tile의 전체 stage sequence (immutable)
    stage_idx: int                   # 현재 stage index in plan.stages
    params: dict                     # current stage 파라미터 캐시 (canonical: plan.stages[stage_idx].params)
    data_op: bool = True             # op_log 기록 대상 (ADR-0020)
 ```
 TileToken은 한 시점에 **하나의 컴포넌트에 의해서만 소유**되며,
 동시에 여러 컴포넌트에 의해 참조되지 않는다 (single-owner).
 Token lifecycle:
 1. Scheduler가 stage_idx=0으로 생성, 첫 stage 컴포넌트에 put
 2. 컴포넌트가 _process() 실행 후 stage_idx 증가, 다음 컴포넌트에 put
 3. 마지막 stage 컴포넌트가 pipeline_ctx.complete_tile() 호출
 4. 모든 tile 완료 시 PipelineContext가 done_event.succeed()
 기존 PeInternalTxn과의 관계:
 - PeInternalTxn: PE_CPU → PE_SCHEDULER 간 command 전달 (기존 유지)
 - TileToken: PE_SCHEDULER → 하위 컴포넌트 간 tile 단위 작업 전달 (신규, self-routing)
 ---
 ## Non-goals
 - **PE_CPU 변경**: PE_CPU → PE_SCHEDULER 인터페이스는 변경하지 않음
  (PeInternalTxn 기반, ADR-0014 유지)
 - **다중 pipeline 간 자원 경합 모델**: 현재 범위에서는 단일 pipeline의
  정확한 모델링에 집중. 다중 pipeline 간 TCM bank conflict 등은 future work.
 ## Open Questions
 - **Register File 용량 모델**: fetch unit이 register에 로드할 때 용량 제한을
  모델링할지. 용량은 바이트 단위(register_file_bytes)로 표현하며,
  동시에 보유 가능한 tile 수는 tile 크기에 따라 결정된다.
  용량 초과 시 fetch가 stall되어 자연스러운 backpressure가 발생한다.
 - **Prefetch 전략**: 본 ADR에서는 composite command 간 tile feed interleaving을
  허용하지 않는다. 따라서 overlap은 command 간 선행 투입이 아니라,
  같은 command 내부 tile들의 pipeline progression에서 자연스럽게 발생한다.
  추가적인 prefetch가 필요하면 command 간 투입이 아니라, 같은 command 내부에서의
  tile ordering 또는 fetch/store unit policy 차원에서 검토한다.
 - **PE_DMA coalescing**: tile 단위 DMA는 fragmentation 발생 가능.
  DMA 내부에서 merge/coalesce하되 scheduler는 관여하지 않는 방향.
 - **동기 실행 모드**: 본 ADR에서는 비동기 pipeline을 기본/유일 execution model로
  채택한다. 디버그 또는 validation 목적의 sync mode가 필요하면 future ADR에서 검토.
 - **다중 pipeline 간 TCM bank conflict**: 현재 단일 pipeline 기준.
  다중 pipeline이 동시에 TCM에 접근할 때의 bank conflict 모델은 future work.
 ---
 ## Consequences
 ### 긍정적
 - 각 블록이 독립 컴포넌트 — 개별 교체 가능 (ADR-0015 준수)
 - topology에서 PE 내부 구조 가시화
 - 컴포넌트가 다음 컴포넌트를 모름 — plan 기반 라우팅으로 유연성 확보
 - DMA와 compute의 자연스러운 파이프라인 overlap (SimPy Store backpressure)
 - HW 모델링 정확도 향상 (done signal = Event, data transfer = message)
 - fetch/store 분리로 TCM BW 경합 정확히 모델링
 ### 부정적
 - PE 내부 컴포넌트 수 증가 (5 → 6) — topology 노드/edge 증가
 - 컴포넌트 분리로 인해 intra-PE token forwarding이 이전 대비 더 명시적으로 드러남
@@ -1,10 +1,10 @@
 # ADR-0022: 2D Grid program_id Semantics
- **Status**: Accepted
+## Status
 - **Date**: 2026-04-09
 - **Context**: Triton-style kernel addressing for multi-cube PE topology
-## Problem
+Accepted
 ## Context
 Triton kernels use `tl.program_id(axis)` to identify their position in a launch grid.
 Our hardware has a 2-level hierarchy: **cubes** contain **PEs**.
@@ -709,7 +709,7 @@ piggyback, tail updates via the D9 fast-path channel.
 ### D13. Test strategy
-Following the ADR-0021 D8 pattern.
+Test plan:
 #### T1. Unit tests (component-level)
@@ -801,7 +801,7 @@ F5. **Slot full + infinite backpressure**: the peer never recvs.
 ### D15. Algorithm-author cheat sheet
 Full step-by-step lives in
-[`docs/ccl-author-guide.en.md`](../ccl-author-guide.en.md). The
+[`docs/onboarding/ccl-author-guide.en.md`](../onboarding/ccl-author-guide.en.md). The
 shortest version:
 | Things you touch | Things you don't |
@@ -969,7 +969,7 @@ tail 갱신은 D9 fast path SimPy Store 채널로 처리된다.
 ### D13. 테스트 전략
-ADR-0021의 D8 패턴을 따라 단위/통합/regression 테스트를 명시한다.
+단위/통합/regression 테스트를 명시한다.
 #### T1. 단위 테스트 (component-level)
@@ -1102,7 +1102,7 @@ F5. **Slot full + 무한 backpressure**:
 ### D15. 알고리즘 작성자 가이드 (요약)
 본 섹션은 알고리즘 작성자가 한 화면으로 시작점을 잡을 수 있도록 한다.
-자세한 step-by-step 가이드는 [docs/ccl-author-guide.md](../ccl-author-guide.md) 참조.
+자세한 step-by-step 가이드는 [docs/onboarding/ccl-author-guide.md](../onboarding/ccl-author-guide.md) 참조.
 #### 만지는 것 / 만지지 않는 것
@@ -1175,7 +1175,416 @@ def neighbors(rank, world_size, neighbor_map) -> dict | None:
 2. **send/recv 짝 맞지 않음** — peer 측 recv 없으면 hang (slot full backpressure)
 3. **dtype/shape 불일치** — 첫 구현은 검증 안 함, 작성자 책임
-자세한 step-by-step과 hello-world 예제는 `docs/ccl-author-guide.md` 참조.
+자세한 step-by-step과 hello-world 예제는 `docs/onboarding/ccl-author-guide.md` 참조.
 ---
 ## HW Realization Notes (Informative)
 **Status of this section**: Forward-looking. Describes how the simulator
 contract (D1–D15) would map to silicon. Not currently implemented;
 subject to revision before tapeout. The simulator implements the
 contract via Python/SimPy equivalents in
 [pe_ipcq.py](../../src/kernbench/components/builtin/pe_ipcq.py) and
 [pe_dma.py](../../src/kernbench/components/builtin/pe_dma.py).
 ### D16. Proposed HW Block Diagram and End-to-End Dataflow
 ![PE Baseline Architecture](../diagrams/pe_baseline.png)
 > Source: [`../diagrams/pe_baseline.d2`](../diagrams/pe_baseline.d2) — `d2 --layout=elk --scale 1.5`.
 ![PE Proposed Architecture](../diagrams/pe_proposed.png)
 > Source: [`../diagrams/pe_proposed.d2`](../diagrams/pe_proposed.d2) — `d2 --layout=elk`.
 **Baseline → Proposed 핵심 변경**:
 - 단일 FIFO inbox → **compute port / IPCQ port 분리 + WRR Arbiter** (NEW)
 - PE_IPCQ (SimPy component) → **IPCQ Controller** (HW register + combinational logic)
 - TCM 내 **IPCQ Slot Region 예약 영역** 명시
 - Credit Injector / Receiver가 Fabric Port를 통해 NoC에 직접 연결
 #### End-to-End Sequence (HW view)
 ```mermaid
 sequenceDiagram
    participant CPU_A as PE_A: PE_CPU
    participant IPCQ_A as PE_A: IPCQ Ctrl
    participant DMA_A as PE_A: DMA
    participant NOC as NoC Fabric
    participant DMA_B as PE_B: DMA
    participant IPCQ_B as PE_B: IPCQ Ctrl
    participant TCM_B as PE_B: TCM
    participant CPU_B as PE_B: PE_CPU
    Note over CPU_A: tl.send(dir="E", src=0x1000)
    CPU_A->>IPCQ_A: MMIO: send request
    Note over IPCQ_A: Backpressure check:<br/>(head - peer_tail_cache) < n_slots → PASS<br/>Slot addr gen:<br/>dst = peer_rx_base + (head%n) × slot_size
    IPCQ_A->>DMA_A: IpcqDmaToken {src, dst, sender_seq=head}
    Note over IPCQ_A: my_head++
    IPCQ_A-->>CPU_A: send returns (fire-and-forget)
    Note over DMA_A: TCM read → snapshot in read buffer<br/>Flit pack: data + {sender_seq, dst_addr}
    DMA_A->>NOC: IPCQ data flit(s)
    Note over NOC: hop latency + BW drain
    NOC->>DMA_B: IPCQ data flit(s)
    Note over DMA_B: Terminal BW drain<br/>Slot write latency
    rect rgb(255, 240, 220)
        Note over DMA_B,IPCQ_B: ATOMIC (I6): same cycle, no stall
        DMA_B->>TCM_B: write data → slot address
        DMA_B->>IPCQ_B: Meta Extractor: {sender_seq, dst_addr}
    end
    Note over IPCQ_B: Range match dst_addr → direction "W"<br/>peer_head_cache["W"] = sender_seq + 1
    IPCQ_B-->>CPU_B: recv_wake signal
    Note over CPU_B: tl.recv(dir="W") wakes up
    CPU_B->>IPCQ_B: recv request
    Note over IPCQ_B: peer_head_cache > my_tail → YES<br/>slot_addr = rx_base + (tail%n) × slot_size
    IPCQ_B-->>CPU_B: return slot_addr
    CPU_B->>TCM_B: read data from slot
    Note over IPCQ_B: my_tail++
    IPCQ_B->>NOC: Credit (16B): {consumer_seq, dst_rx_base_pa}
    Note over NOC: credit traversal (NoC latency)
    NOC->>IPCQ_A: Credit arrival
    Note over IPCQ_A: Match dst_rx_base_pa → direction "E"<br/>peer_tail_cache["E"] = consumer_seq<br/>Backpressure deassert (if stalled)
 ```
 ### D17. IPCQ Controller HW Module (신규)
 PE_CPU와 DMA Engine 사이에 위치하는 하드웨어 제어 블록. 시뮬레이터의
 `PeIpcqComponent`에 대응한다.
 #### QPair Register File
 방향별 queue pair 상태를 flip-flop으로 유지. PE_CPU가 MMIO(CSR)로 읽기/쓰기
 가능하며, init 시점에 소프트웨어가 채워넣는다.
 ```
 Per-direction registers (each 64-bit):
  my_head          — sender write position (monotonic)
  my_tail          — receiver read position (monotonic)
  peer_head_cache  — last known peer head (updated by Meta Extractor)
  peer_tail_cache  — last known peer tail (updated by Credit Receiver)
  rx_base_pa       — this PE's rx buffer base physical address
  peer_rx_base_pa  — peer's rx buffer base physical address
  n_slots          — ring depth (power-of-2 제약, D21 참조)
  slot_size        — bytes per slot
  peer_credit_tgt  — peer PE의 credit receive 주소
 Directions: 최대 8 (N/S/E/W/parent/child_left/child_right + spare)
 Total: 8 dirs × 9 regs × 8B = 576B flip-flops
 ```
 #### Slot Address Generator (combinational)
 ```
 Input:  pointer (my_head or my_tail), n_slots, slot_size, base_pa
 Output: slot_addr = base_pa + (pointer % n_slots) * slot_size
 Implementation:
  n_slots power-of-2 → pointer & (n_slots - 1)   (AND mask, 1 gate)
  slot_size power-of-2 → barrel shift             (1 cycle)
  64-bit add → ripple/kogge-stone adder           (1 cycle)
 Latency: 1-2 cycles combinational
 ```
 #### Backpressure Comparator (combinational)
 ```
 full = (my_head - peer_tail_cache) >= n_slots
 Implementation: 64-bit subtract + unsigned compare
 Output: stall signal → PE_CPU (IPCQ send blocked) or DMA issue hold
 Latency: 1 cycle
 ```
 #### Meta Extractor (inbound datapath sideband)
 DMA Engine의 inbound vc_comm path에 wired. 도착하는 IPCQ flit의 header에서
 metadata를 추출하여 queue pair 상태를 갱신한다.
 ```
 Trigger: DMA inbound write completion (same cycle)
 Extract: {sender_seq, dst_addr} from flit header
 Direction matching (ADR-0025 D2):
  for each dir:
    match = (base_pa[dir] <= dst_addr) && (dst_addr < base_pa[dir] + n_slots[dir] * slot_size[dir])
  8× parallel range comparators + priority encoder
 Update: peer_head_cache[matched_dir] = max(peer_head_cache, sender_seq + 1)
 Output: recv_wake signal → PE_CPU interrupt/flag
 Latency: 1 cycle (pipelined with DMA write — I6 atomicity 자연 보장)
 ```
 #### Credit Injector (outbound)
 ```
 Trigger: recv completion (my_tail 증가 후)
 Action:  pack 16B credit packet → DMA vc_comm (또는 dedicated credit VC)
 Packet: {consumer_seq = my_tail, dst_rx_base_pa = my_rx_base_pa}
 Latency: 1 cycle to generate, then NoC traversal
 ```
 #### Credit Receiver (inbound sideband)
 ```
 Trigger: 16B credit packet arrival (from NoC)
 Extract: {consumer_seq, dst_rx_base_pa}
 Direction matching (ADR-0025 D3):
  for each dir:
    match = (peer_rx_base_pa[dir] == credit.dst_rx_base_pa)
 Update: peer_tail_cache[matched_dir] = max(peer_tail_cache, consumer_seq)
 Output: send_wake signal → deassert backpressure stall
 Latency: 1 cycle
 ```
 ### D18. DMA Engine vc_comm IPCQ-aware Mode
 기존 vc_comm 채널(D8)에 IPCQ flit 처리 모드를 추가한다.
 **Outbound**:
 1. IPCQ Controller로부터 command 수신: `{src_addr, dst_addr, nbytes, sender_seq}`
 2. TCM에서 src_addr read → DMA read buffer에 snapshot (standard DMA behavior)
 3. Flit pack: data + piggyback metadata (sender_seq, dst_addr)
 4. NoC fabric port에 inject
 5. Fire-and-forget (completion 미대기)
 **Inbound**:
 1. NoC로부터 IPCQ flit 수신
 2. Terminal BW drain charge (`drain_ns = nbytes / bottleneck_bw`)
 3. Slot write latency charge (backing memory tier)
 4. **ATOMIC** (same pipeline stage, no stall insertion):
   - TCM write: data → slot address
   - Meta Extractor trigger: sender_seq + dst_addr → IPCQ Controller
 5. Done
 **I6 atomicity 하드웨어 보장**: TCM write completion과 Meta Extractor trigger가
 동일 pipeline stage에서 발생하므로 별도 synchronization이 불필요. 시뮬레이터의
 "no SimPy yield between MemoryStore.write and IpcqMetaArrival put" (D9, I6)이
 자연스럽게 보장된다.
 #### Data Snapshot Semantics
 DMA read buffer에 latch된 데이터는 src memory의 이후 수정에 영향받지 않는다.
 이는 DMA standard read-then-write behavior이므로 추가 HW 불필요.
 #### Credit Virtual Channel (선택적)
 - **옵션 A**: vc_comm에 credit을 multiplexing (16B header-only flit으로 구분).
 - **옵션 B**: 3rd dedicated credit VC 추가 (strict priority > data).
 옵션 B가 deadlock prevention에 유리하나, 16B credit의 BW 영향이 무시 가능하므로
 옵션 A로도 충분.
 ### D19. Fabric Flit Format Extension
 ```
 일반 data flit (예: 512-bit):
 ┌──────────────────────────────────────────┐
 │ [511:480] routing header (32b)           │
 │ [479:0]   payload (480b = 60B)           │
 └──────────────────────────────────────────┘
 IPCQ data flit (첫 flit에만 metadata 포함):
 ┌──────────────────────────────────────────┐
 │ [511:480] routing header (32b)           │
 │   [511]    ipcq_flag (1b)                │  ← IPCQ vs normal DMA 식별
 │   [510:509] vc_id (2b)                   │
 │   [508:480] route + hop count            │
 │ [479:416] ipcq_metadata (64b)            │  ← piggyback
 │   [479:448] sender_seq (32b)             │
 │   [447:416] dst_addr[31:0] (32b)         │  ← direction matching용
 │ [415:0]   payload (416b = 52B)           │
 └──────────────────────────────────────────┘
 후속 flits: full 60B payload (metadata 없음)
 Credit-only flit (128-bit, header-only):
 ┌──────────────────────────────────────────┐
 │ [127:96]  routing header (32b)           │
 │   [127]   credit_flag (1b)               │
 │ [95:64]   consumer_seq (32b)             │
 │ [63:0]    dst_rx_base_pa (64b)           │
 └──────────────────────────────────────────┘
 ```
 첫 flit의 payload가 60B → 52B로 감소 (13% overhead). Multi-flit transfer에서는
 후속 flit이 full payload이므로 대형 전송에서 overhead < 1%.
 ### D20. TCM IPCQ Slot Region Layout
 ```
 TCM Memory Map (16MB):
 ┌─────────────────────────────┐ 0x000000
 │  Kernel Working Memory      │
 │  (compute tensors)          │
 │  ~14MB                      │
 ├─────────────────────────────┤ 0xE00000
 │  IPCQ RX Buffers            │
 │  Dir N: slots × slot_size   │
 │  Dir S: slots × slot_size   │
 │  Dir E: slots × slot_size   │
 │  Dir W: slots × slot_size   │
 │  ~1MB                       │
 ├─────────────────────────────┤ 0xF00000
 │  IPCQ Metadata / Scratch    │
 │  ~1MB                       │
 └─────────────────────────────┘ 0xFFFFFF
 ```
 IPCQ region을 TCM의 상위 bank에 배치하여 compute access와의 bank conflict를
 최소화한다 (Risk D22 참조).
 ### D21. 2nm Implementation Analysis
 #### Area Estimate
 | Module | Gate Count | Area (2nm est.) | Notes |
 |---|---|---|---|
 | QPair Register File | ~4.6K FF | 0.002 mm² | 576B flip-flops |
 | Slot Addr Gen + Backpressure | ~5K gates | 0.001 mm² | Combinational |
 | Meta Extractor + Credit Logic | ~3K gates | 0.001 mm² | 8× parallel comparators |
 | **IPCQ Controller subtotal** | **~12.6K** | **~0.004 mm²** | **PE 전체 대비 < 0.1%** |
 | DMA vc_comm 확장 | ~2K gates | 0.002 mm² | Flit pack/unpack |
 | **Total 변경분** | **~14.6K** | **~0.006 mm²** | |
 #### Timing
 | Path | Delay (2nm est.) | Target Clock | Margin |
 |---|---|---|---|
 | Backpressure (sub + cmp) | ~0.3 ns | 1 GHz (1 ns) | 3× |
 | Slot Addr Gen (mask + shift + add) | ~0.5 ns | 1 GHz | 2× |
 | Meta Extractor (8× range match) | ~0.4 ns | 1 GHz | 2.5× |
 | Credit Receiver (8× equality) | ~0.3 ns | 1 GHz | 3× |
 모든 critical path가 1 cycle 이내. Timing closure 문제 없음.
 #### Power
 - Active: ~1 mW (register R/W + comparators, send/recv 동작 시)
 - Idle: leakage only
 - PE 전체 전력 대비 무시 가능
 #### Constraints
 | 항목 | 제약 | 근거 |
 |---|---|---|
 | `n_slots` | **반드시 power-of-2** | mod → AND mask (1 gate). 임의 값은 divider 필요 (~10 cycles) |
 | `slot_size` | **power-of-2 권장** | mul → barrel shift. 임의 값은 multiplier 필요 |
 | TCM IPCQ region | **전용 bank 배치** | Compute access와 bank conflict 방지 |
 ### D22. Risk Assessment
 #### TCM Bank Conflict
 - **Risk**: IPCQ slot write와 compute read가 동일 bank 접근 시 stall
 - **Mitigation**: IPCQ region을 TCM 상위 address의 전용 bank에 배치 (D20)
 - **Cost**: TCM banking flexibility 소폭 감소
 - **Severity**: Medium (성능 영향), Low (correctness 문제 아님)
 #### Credit Return Latency under Congestion
 - **Risk**: NoC 혼잡 시 credit return 지연 → sender backpressure stall
 - **Mitigation**:
  - Credit을 별도 VC로 분리 + strict priority (16B로 BW impact 미미)
  - 또는 n_slots를 넉넉히(8+) 설정하여 credit 지연을 buffer로 흡수
 - **Severity**: Low (credit 16B는 congestion에 거의 기여하지 않음)
 #### Inter-Direction Ordering
 - **Risk**: 같은 PE에서 여러 방향으로 동시 send 시 순서
 - **Mitigation**: Per-direction monotonic seq으로 충분. Inter-direction ordering은
  kernel(소프트웨어) 책임 — 현재 시뮬레이터 모델과 동일 (D2 + D4)
 - **Severity**: Low (아키텍처 설계에 의해 해소)
 ### D23. HW Alternatives Considered
 #### Doorbell + Polling (전통적 방식)
 ```
 Send: DMA write data → DMA write doorbell register at peer → peer polls doorbell
 Recv: Polling loop on doorbell, or interrupt-driven
 ```
 | 장점 | 단점 |
 |---|---|
 | 단순한 HW (IPCQ controller 불필요) | 2번의 DMA transaction (data + doorbell) |
 | 기존 DMA 재사용 | Data/doorbell 사이 ordering 보장 필요 (fence) |
 | | Polling은 전력 낭비, interrupt는 latency overhead |
 **평가**: Piggyback 대비 latency 2-3× 증가. **불채택.**
 #### Hardware Message Queue (NVIDIA NVLink 스타일)
 ```
 Send: CPU → HMQ에 descriptor push → HW가 peer HMQ로 자동 전달
 Recv: HMQ에서 descriptor pop → data pointer 확인
 ```
 | 장점 | 단점 |
 |---|---|
 | CPU는 descriptor만 작성 | 별도 HMQ engine 필요 (~0.05 mm²) |
 | Descriptor/data 분리 → 유연 | DMA와 별개 datapath → area/power 중복 |
 | | Large tensor에는 결국 DMA 필요 |
 **평가**: CCL의 large tensor 패턴에서 DMA 필수이므로 HMQ + DMA 이중 구조는
 면적 낭비. **불채택.**
 #### RDMA-style Completion Queue (CQ)
 ```
 Send: DMA write → peer에 CQE 자동 생성
 Recv: CQ poll/interrupt → data 위치 확인
 ```
 | 장점 | 단점 |
 |---|---|
 | InfiniBand/RoCE 성숙 모델 | CQ 관리 logic + CQE memory overhead |
 | Multi-tenant/isolation 용이 | CQE/data ordering 보장 추가 필요 |
 | | PE-to-PE CCL에는 over-engineered |
 **평가**: RDMA CQ는 host-facing NIC의 multi-tenant 격리에 적합.
 PE 간 단일 owner 환경에서는 불필요한 복잡성. **불채택.**
 #### Credit-in-Data Piggyback (v2 최적화 후보)
 현재 설계에서 credit return은 별도 16B packet이다. Bidirectional 통신
 패턴에서는 **reverse 방향 data flit에 credit을 합칠 수 있다.**
 ```
 PE_A →E→ PE_B: data + sender_seq=3
 PE_B →W→ PE_A: data + sender_seq=5 + credit_ack=4  ← credit이 data에 합쳐짐
 ```
 | 장점 | 단점 |
 |---|---|
 | Credit 전용 packet 제거 → NoC BW 절약 | Unidirectional 패턴에서는 fallback 필요 |
 | Bidirectional allreduce에서 credit latency → 0 | Flit header에 8B 추가 (overhead 미미) |
 | | Logic 복잡도 소폭 증가 |
 **평가**: 현재 설계의 우수한 최적화. Bidirectional allreduce에서 credit packet을
 완전 제거 가능. Standalone credit fallback도 유지. **v2로 채택 권고.**
 ### Open HW Questions
 - IPCQ slot region size를 TCM의 몇 %까지 허용할 것인가? (현재 가정: ~1MB / 16MB = 6.25%)
 - Credit VC를 별도로 둘 것인가, vc_comm에 multiplexing할 것인가? (D18 참조)
 - Inter-SIP link에서의 flit format 호환성 검증 필요
 - n_slots 최대값 제한? (8 directions × 8 slots × 64KB = 4MB → TCM의 25%)
 ---
@@ -0,0 +1,206 @@
 # ADR-0024: SIP-level Launcher — rank = SIP
 ## Status
 Accepted
 ## Context
 ### 목표
 `torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device)
 경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이**
 읽히는 bench 코드를 목표로 한다.
 real PyTorch와 비교:
 | 차원 | real PyTorch | KernBench |
 | --- | --- | --- |
 | 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 각 1 SIP |
 | `get_rank()` | `RANK` env var | greenlet-local 레지스트리 |
 | `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 |
 | `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP |
 | `mp.spawn` | OS 프로세스 fork | greenlet fan-out |
 ### 풀어야 할 문제
 1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록.
 2. **Greenlet-local rank/device tracking** — 1-프로세스 모델 안에서 각
   worker greenlet이 자기 rank / 자기 SIP를 정확히 식별.
 3. **Tensor placement = structural (sip, cube, pe)** — rank가 SIP이면
   기본 텐서 배치도 구조적 좌표로 표현되어야 함.
 ### Non-problem (이 ADR 밖)
 - IPCQ direction addressing → ADR-0025
 - `DPPolicy.sip`/`num_sips` 제거 → ADR-0026
 - Megatron-style TP → ADR-0027
 - DTensor → ADR-0028 (future)
 - Worker scheduling / `mp.spawn` / collective drain / exception cleanup
  → ADR-0027 D0/D1
 - Collective algorithm 구현 (intercube_allreduce, SFR config) → ADR-0032
 ## Decision
 ### D1. rank = SIP (world_size 해석)
 ```python
 def _resolve_world_size(self) -> int:
    if "world_size" in self._merged:
        return int(self._merged["world_size"])
    defaults = self._cfg_all.get("defaults", {})
    if "world_size" in defaults:
        return int(defaults["world_size"])
    spec = self.ctx.spec or {}
    return int(spec.get("system", {}).get("sips", {}).get("count", 1))
 ```
 우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml`
 override는 legacy "rank = PE" 테스트 경로로 유지.
 ### D2. Greenlet-local rank registry (+ debug warning)
 ```python
 class DistributedContext:
    def __init__(self):
        self._backend = None
        self._rank_by_greenlet: dict = {}
    def _bind_rank(self, g, rank: int) -> None:
        self._rank_by_greenlet[g] = int(rank)
    def get_rank(self) -> int:
        self._ensure_initialized()
        from greenlet import getcurrent
        g = getcurrent()
        if g not in self._rank_by_greenlet:
            if os.environ.get("KERNBENCH_DEBUG"):
                warnings.warn(
                    "get_rank() called outside a bound greenlet — returning 0. "
                    "Likely a bug unless running single-driver."
                )
            return 0
        return int(self._rank_by_greenlet[g])
 ```
 ### D3. `torch.ahbm.set_device(rank)` — SIP 바인딩
 KernBench 백엔드 이름은 `ahbm` (ADR-0023). Real PyTorch는
 `torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named
 namespace를 사용한다.
 ```python
 class _AhbmNamespace:
    """torch.ahbm — per-greenlet SIP device binding.
    Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. Since
    KernBench's backend is 'ahbm' (not CUDA), we expose the equivalent
    API under ``torch.ahbm`` to avoid pretending to be a CUDA runtime.
    """
    def __init__(self):
        self._device_by_greenlet: dict = {}
    def set_device(self, device: int) -> None:
        from greenlet import getcurrent
        self._device_by_greenlet[getcurrent()] = int(device)
    def current_device(self) -> int | None:
        from greenlet import getcurrent
        return self._device_by_greenlet.get(getcurrent())
 # Attached to RuntimeContext as `self.ahbm = _AhbmNamespace()`.
 # Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
 ```
 **PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한
 `torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`,
 `torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는
 코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다.
 ```python
 class _AcceleratorNamespace:
    """torch.accelerator — device-agnostic API (PyTorch 2.x style).
    Aliases torch.ahbm for bench code that prefers device-neutral idiom:
        torch.accelerator.set_device_index(rank)
        torch.accelerator.current_device_index()
    """
    def __init__(self, ahbm: _AhbmNamespace):
        self._ahbm = ahbm
    def set_device_index(self, device: int) -> None:
        self._ahbm.set_device(device)
    def current_device_index(self) -> int | None:
        return self._ahbm.current_device()
 # RuntimeContext
 self.ahbm = _AhbmNamespace()
 self.accelerator = _AcceleratorNamespace(self.ahbm)   # alias
 ```
 Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유:
 ```python
 torch.ahbm.set_device(rank)                   # KernBench-native, explicit backend
 torch.accelerator.set_device_index(rank)      # PyTorch 2.x device-agnostic
 ```
 ### D4. Tensor placement = structural (sip, cube, pe) 좌표
 `resolve_dp_policy`가 `target_sip`을 직접 받아 구조적 좌표로 placement 생성.
 세부는 ADR-0026.
 ```python
 # RuntimeContext._create_tensor
 current_sip = self.ahbm.current_device()          # (D3 naming)
 if current_sip is None:
    current_sip = 0  # single-driver fallback (D2와 일관)
 placement = resolve_dp_policy(
    dp, shape=shape_2d, itemsize=itemsize,
    num_pe=eff_num_pe, num_cubes=eff_num_cubes,
    target_sip=current_sip,
 )
 ```
 Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적
 좌표를 직접 보유. ShardSpec 상세는 ADR-0026.
 ---
 ## Dependencies
 - **ADR-0023** (IPCQ): backend `ahbm` namespace의 기원.
 - **ADR-0026** (DPPolicy intra-device): D4의 `resolve_dp_policy` 시그니처와
  ShardSpec의 구조적 좌표 표현.
 - **ADR-0027** (Megatron TP + scheduler): worker scheduling, `mp.spawn`,
  collective drain, exception cleanup의 구현 기준.
 ---
 ## Non-goals
 - **IPCQ protocol 수정**: ADR-0023 유지.
 - **DPPolicy 필드 정리**: ADR-0026.
 - **Megatron-style TP**: ADR-0027.
 - **Worker scheduling / spawn / drain / exception cleanup**: ADR-0027 D0/D1.
 - **Collective algorithm 구현**: ADR-0032.
 - **Multi-node (프로세스 간)**: 단일 프로세스.
 ---
 ## Consequences
 ### Positive
 - **Bench = real PyTorch DDP** (공개 API 관점).
 - **Greenlet-local rank**: 1-프로세스 모델에서 cross-rank correctness 가능.
 - **Structural placement 좌표**: ADR-0026 / ADR-0027 / ADR-0032의 다른 ADR이
  `(sip, cube, pe)` 3튜플 위에서 일관되게 동작.
 ### Neutral
 - IPCQ PE-level protocol (ADR-0023) 불변.
 - IO_CPU 역할 불변 (기존 transit 그대로).
@@ -1,868 +0,0 @@
 # ADR-0024: SIP-level TP Launcher — rank = SIP (host-driven dispatch)
 ## Status
 Accepted. rank = SIP process-group model stands. The allreduce algorithm
 path (mapper / validator / per-PE install machinery originally targeted at
 ADR-0029) has been replaced by ADR-0032: `AhbmCCLBackend` now calls
 `configure_sfr_intercube_multisip` at `init_process_group` time and the
 intercube kernel receives `(sip_rank, sip_topo_kind, sip_topo_w,
 sip_topo_h)` appended after the module's `kernel_args()`. The
 `leader_only` / `all_pes` mapper concepts in this document are no longer
 used by the default allreduce path.
 ## Context
 ### 목표
 `torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device)
 경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이**
 읽히는 bench 코드를 목표로 한다.
 real PyTorch와 비교:
 | 차원 | real PyTorch | KernBench (이 ADR 이후) |
 |---|---|---|
 | 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 각 1 SIP |
 | `get_rank()` | `RANK` env var | greenlet-local 레지스트리 |
 | `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 |
 | `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP |
 | `mp.spawn` | OS 프로세스 fork | greenlet fan-out |
 ### 설계 원칙 — 공개 API의 추상화, 내부는 기존 path 활용
 **공개 API (bench worker) 수준의 추상화**:
 ```
 rank = SIP
 DPPolicy = intra-device (cube × PE) 분산만
 dist.all_reduce, torch.ahbm.set_device, mp.spawn 등 PyTorch-style 표면
 ```
 **Framework 내부 구현**:
 ```
 build_install_plans (host): topology + mapper + algorithm → SipInstallPlan
  ↓
 backend (host): plan의 per-PE spec을 engine.submit으로 IpcqInitMsg 디스패치
  ↓
 engine: 기존 PE-scoped routing (MmuMapMsg 등과 동일 경로)
  ↓
 PE_IPCQ: 자체 message loop에서 IpcqInitMsg 처리 (기존 capability)
 ```
 **핵심**: 새 message 타입이나 IO_CPU 확장 없음. 기존 engine routing과 기존
 `IpcqInitMsg` 타입을 그대로 사용. 기존의 "sideband direct call" 우회만
 제거하여 convention 일원화.
 ### 풀어야 할 문제
 1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록.
 2. **Multi-worker 실행** — N개 rank가 독립 worker 코드 실행. 1 프로세스 제약
   하에서 greenlet + barrier 동기화.
 3. **Cross-rank collective submit 동기화** — 첫 rank가 혼자 wait하면 peer 부재로
   SimPy deadlock. 모든 rank submit 후 drain 보장.
 4. **기존 sideband install 제거** — IpcqInitMsg를 engine.submit으로 일원화.
   MmuMapMsg 등 다른 control-plane 메시지와 동일 패턴.
 5. **Algorithm / mapper / validator 분리** — 알고리즘 모듈은 kernel 코드만
   담고, topology / mapping / validation은 registry + 선언.
 ### Non-problem (이 ADR 밖)
 - IPCQ direction addressing fix → **ADR-0025**
 - `DPPolicy.sip`/`num_sips` 제거 → **ADR-0026**
 - Megatron-style TP → **ADR-0027**
 - DTensor → **ADR-0028 (future)**
 - **IO_CPU를 SIP-level control-plane 단일 endpoint로 승격**: 이 ADR에서는
  invariant으로 채택하지 않음. 현재 KernBench에 해당 원칙이 없고, 단독으로
  도입하기엔 정당화가 약함. 미래에 control-plane latency 모델링 정밀도 요구가
  생기면 별도 ADR.
 ## Decision
 ### D1. rank = SIP (world_size 해석)
 ```python
 def _resolve_world_size(self) -> int:
    if "world_size" in self._merged:
        return int(self._merged["world_size"])
    defaults = self._cfg_all.get("defaults", {})
    if "world_size" in defaults:
        return int(defaults["world_size"])
    spec = self.ctx.spec or {}
    return int(spec.get("system", {}).get("sips", {}).get("count", 1))
 ```
 우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml`
 override는 legacy "rank = PE" 테스트 경로로 유지.
 ### D2. Install 경로 — engine.submit 일원화
 `ccl/install.py`의 sideband direct call을 제거하고, `IpcqInitMsg`를
 `engine.submit`으로 보낸다. MmuMapMsg / MemoryWriteMsg 등이 이미 동일 패턴.
 ```python
 # Backend (AhbmCCLBackend.__init__ 또는 init_process_group 시점)
 from kernbench.ccl.install_plan import build_install_plans
 plans = build_install_plans(
    world_size=self._world_size,
    algorithm=self._merged["algorithm"],
    algorithm_config=self._merged,
    spec=self.ctx.spec,
 )
 self._plans = plans
 # Each PE_IPCQ가 자기 neighbor table을 받도록 engine 경유 submit
 handles = []
 for plan in plans:
    for pe_install in plan.pe_installs:
        h = self.ctx.submit(IpcqInitMsg(
            correlation_id=self.ctx.correlation_id,
            request_id=f"ipcq_init_s{plan.sip}c{pe_install.cube}p{pe_install.pe}",
            target_sips=(plan.sip,),
            target_cubes=(pe_install.cube,),
            target_pe=pe_install.pe,
            entries=pe_install.neighbors,
            buffer_kind=plan.buffer_kind,
            n_slots=plan.n_slots,
            slot_size=plan.slot_size,
            # ... (기존 IpcqInitMsg 필드)
        ))
        handles.append(h)
 # Eager install — init_process_group이 반환하기 전에 완료 보장
 for h in handles:
    self.ctx.wait(h)
 ```
 **PE_IPCQ 컴포넌트**는 이미 `IpcqInitMsg`를 main loop에서 처리 (`pe_ipcq.py`
 라인 145-147). 변경 불필요. 유일한 차이는 "message가 sideband Python call이
 아니라 engine queue를 거쳐 도착한다"는 점.
 **Correctness invariant (equivalence)**: `init_process_group()`은 모든
 install handle을 `wait()`한 후 반환하므로 launch-before-install 문제는
 구조적으로 없다. 남는 correctness 질문은 단 하나:
 > Engine-routed `IpcqInitMsg` 처리가 기존 sideband
 > `pe_ipcq._install_neighbors(msg)` 호출과 **동일한 최종 PE_IPCQ 상태**를
 > 생성하는가.
 검증 포인트 (T3 참고):
 1. **State equivalence**: `_install_neighbors()` 내부 상태 전이가 engine
   dispatch path에서도 동일하게 일어나 최종 PE_IPCQ state
   (`_queue_pairs`, `_installed`, `_credit_inbox` 등)가 일치.
 2. **Sideband-only side effect 부재**: Sideband path에서만 있던 부수 효과가
   없음 (예: engine.submit이 설정하는 request_id / correlation tracking 등이
   install semantics를 왜곡하지 않음).
 3. **Ordering independence**: 서로 다른 PE들의 install message가 engine
   큐에서 임의 순서로 처리되어도 최종 상태가 동일. 즉 install은 **PE별
   독립 연산**이어야 하고, cross-PE 순서 의존성이 있으면 안 됨.
 4. **Idempotency**: 동일 PE에 대해 `IpcqInitMsg`가 두 번 도착하면? 현재
   설계 전제는 "per-PE 단 한 번 install". 중복 install 시 동작은 정의되지
   않음. 보수적 정책:
   - 최초 install 시 `_installed = True`로 전이
   - 이후 중복 install msg는 **에러** (raise) 또는 **silent idempotent**
     (no-op) 둘 중 하나로 명시
   - Recommend: **raise** (명시적 에러 → 버그 조기 검출). T3에 duplicate
     install 케이스 추가.
 5. **Partial install visibility**: 일부 PE만 install 완료된 중간 상태가
   외부에 observable한가? 현재 구조에서는 `init_process_group()`의 eager
   wait-all이 barrier 역할을 하므로 partial state는 bench 코드에 노출되지
   않음. 단, debugging / introspection API는 중간 상태를 볼 수 있음 (문제
   아님, 문서화만).
 **Timing 영향**: Engine-routed install은 `init_process_group()`이 SimPy 시간을
 소비하게 만든다. 기존 sideband install은 사실상 zero-cost. ADR 계약:
 > Benchmarks must not rely on zero-cost initialization.
 > `init_process_group()` consumes simulated time proportional to the number
 > of participating PEs × per-PE install latency. First collective call
 > starts at a well-defined but non-zero sim time.
 ### D3. Launch 경로 — non-CCL 커널과 동일 primitive
 **CCL 커널은 non-CCL 커널과 동일한 `KernelLaunchMsg` submission path를 쓴다.**
 Engine 내부의 IO_CPU/M_CPU transit 같은 것은 **기존 구현 세부이지 CCL-specific
 장치가 아님**. Backend는 plan의 `participating_pes` 목록을 돌면서 `KernelLaunchMsg`를
 submit할 뿐이다. 새 메시지 타입 없음, 새 라우팅 경로 없음.
 ```python
 # AhbmCCLBackend.all_reduce
 def all_reduce(self, tensor, op="sum"):
    if op != "sum":
        raise NotImplementedError(...)
    if tensor._handle is None or not tensor._handle.shards:
        raise RuntimeError(...)
    # Validator — global handle 기준 (D8)
    validator_name = self._merged.get("validator")
    if validator_name:
        resolve_validator(validator_name)(tensor._handle, self._world_size, self.ctx.spec)
    rank = self.ctx.distributed.get_rank()
    plan = self._plans[rank]
    tensor_view = _tensor_slice_for_sip(tensor._handle, plan.sip)
    # Plan에서 kernel args 계산 (host-side)
    import importlib
    mod = importlib.import_module(plan.kernel_module)
    n_elem = tensor_view.shards[0].nbytes // tensor.itemsize
    kargs = mod.kernel_args(n_elem=n_elem, world_size=plan.world_size,
                             **plan.kernel_config)
    def _submit():
        out = []
        for (cube, pe) in plan.participating_pes:
            h = self.ctx.submit(KernelLaunchMsg(
                correlation_id=self.ctx.correlation_id,
                request_id=f"allreduce_r{rank}_c{cube}p{pe}",
                kernel_ref=KernelRef(name=plan.algorithm_name, kind="builtin"),
                args=(_tensor_arg_for_pe(tensor_view, cube, pe), *kargs),
                target_sips=(plan.sip,),
                target_cubes=(cube,),
                target_pe=pe,
            ))
            out.append(h)
        return out
    self._barrier.submit_and_drain(self.ctx, rank, _submit)
 ```
 ### D4. Algorithm ABI — 얇게 + 명시적 arg 계약
 각 알고리즘 모듈은 **kernel + kernel_args만 필수**.
 ```python
 # src/kernbench/ccl/algorithms/ring_allreduce.py
 def kernel(t_ptr, n_elem, world_size, tl):
    """PE-side kernel code.
    Signature convention: first positional arg is the tensor pointer
    (per-PE slice), subsequent positional args are whatever
    kernel_args() returns. `tl` is injected by the TLContext runtime.
    """
 def kernel_args(*, n_elem: int, world_size: int, **kw) -> tuple:
    """Return the tuple of non-tensor positional args.
    Signature contract:
    - Called keyword-only with n_elem and world_size plus kernel_config.
    - Returns a tuple (possibly empty) of scalar / metadata args.
    - The backend constructs the final KernelLaunchMsg.args as:
          (per_pe_tensor_arg, *kernel_args(...))
      where per_pe_tensor_arg is a TensorArg containing only the shards
      local to the receiving PE (derived from tensor_view).
    """
    return (n_elem, world_size)
 ```
 **Arg assembly in backend (reference)**:
 ```python
 # AhbmCCLBackend.all_reduce (D3에서 발췌)
 kargs = mod.kernel_args(n_elem=n_elem, world_size=plan.world_size,
                         **plan.kernel_config)
 for (cube, pe) in plan.participating_pes:
    pe_tensor_arg = _tensor_arg_for_pe(tensor_view, cube, pe)
    self.ctx.submit(KernelLaunchMsg(
        args=(pe_tensor_arg, *kargs),       # tensor first, then kernel_args return
        target_sips=(plan.sip,),
        target_cubes=(cube,),
        target_pe=pe,
        ...
    ))
 ```
 **ccl.yaml**에서 선언적 metadata:
 ```yaml
 algorithms:
  ring_allreduce_tcm:
    module: kernbench.ccl.algorithms.ring_allreduce
    topology: ring_1d             # kernbench/ccl/topologies.py
    mapper: leader_only           # kernbench/ccl/mappers.py (신규)
    validator: single_shard_per_rank   # kernbench/ccl/validators.py (신규)
    buffer_kind: tcm
    n_elem: 8
 ```
 - `topology` (필수)
 - `mapper` (선택, default `"leader_only"`)
 - `validator` (선택)
 알고리즘 모듈 자체에는 mapper/validator/participating_pes/neighbor
 생성기가 **들어가지 않음**.
 ### D5. Mapper + validator — registry key **또는** import path
 Host-side framework가 built-in registry 제공. 커스텀 확장은 dot-import path.
 ```python
 # src/kernbench/ccl/mappers.py (new)
 Mapper = Callable[[dict, int], list[tuple[int, int]]]
 def leader_only(spec, rank):
    """Single leader PE per SIP. Ring/tree/mesh용."""
    return [(0, 0)]
 def all_pes(spec, rank):
    """Every PE in the SIP. 알고리즘이 intra-SIP 전체 PE를 참여시킬 때 사용
    (e.g. intra-SIP reduction, intra-SIP broadcast, hierarchical collective
    의 낮은 레벨 등)."""
    cm = spec["sip"]["cube_mesh"]
    pl = spec["cube"]["pe_layout"]
    n_cubes = cm["w"] * cm["h"]
    n_pes = pl["pe_per_corner"] * len(pl["corners"])
    return [(c, p) for c in range(n_cubes) for p in range(n_pes)]
 MAPPER_REGISTRY = {"leader_only": leader_only, "all_pes": all_pes}
 def resolve_mapper(key_or_path: str) -> Mapper:
    if key_or_path in MAPPER_REGISTRY:
        return MAPPER_REGISTRY[key_or_path]
    if "." in key_or_path:
        import importlib
        mod_path, fn_name = key_or_path.rsplit(".", 1)
        return getattr(importlib.import_module(mod_path), fn_name)
    raise ValueError(f"unknown mapper: {key_or_path!r}")
 ```
 Validator도 동일 패턴 (`src/kernbench/ccl/validators.py`). 입력은 **global
 TensorHandle** (D8 참고).
 ### D6. Host-side install plan builder
 ```python
 # src/kernbench/ccl/install_plan.py (new; 기존 install.py의 재구성)
 from dataclasses import dataclass
 from typing import Any, Mapping
@dataclass(frozen=True)
 class NeighborTableEntry:
    direction: str
    peer_direction: str       # ADR-0025
    peer_sip: int
    peer_cube: int
    peer_pe: int
    rx_base_pa: int
    # ... 기타 IPCQ 설정 ...
@dataclass(frozen=True)
 class PeInstallSpec:
    cube: int
    pe: int
    neighbors: tuple[NeighborTableEntry, ...]
@dataclass(frozen=True)
 class SipInstallPlan:
    algorithm_name: str                  # human-readable ("ring_allreduce_tcm")
    sip: int
    rank: int
    world_size: int
    pe_installs: tuple[PeInstallSpec, ...]     # per-PE neighbor tables
    buffer_kind: str
    n_slots: int
    slot_size: int
    kernel_module: str
    participating_pes: tuple[tuple[int, int], ...]
    kernel_config: Mapping[str, Any]
 def build_install_plans(
    world_size: int,
    algorithm: str,
    algorithm_config: dict,
    spec: dict,
 ) -> list[SipInstallPlan]:
    """Compose topology + mapper + algorithm into per-SIP plan list."""
    topo_fn = _resolve_topology(algorithm_config["topology"])
    mapper = resolve_mapper(algorithm_config.get("mapper", "leader_only"))
    # kernel_config: launch 시 kernel_args에 전달할 algorithm-specific params
    kernel_config = {
        k: v for k, v in algorithm_config.items()
        if k in {"n_elem", "reduce_op", "chunk_size"} or k.startswith("kernel_")
    }
    plans = []
    for rank in range(world_size):
        sip = rank  # identity mapping (non-identity는 open question)
        pes = mapper(spec, rank)
        pe_installs = _build_pe_installs(
            rank=rank, world_size=world_size, sip=sip,
            pes=pes, topo_fn=topo_fn, algorithm_config=algorithm_config, spec=spec,
        )
        plans.append(SipInstallPlan(
            algorithm_name=algorithm,
            sip=sip, rank=rank, world_size=world_size,
            pe_installs=pe_installs,
            buffer_kind=algorithm_config["buffer_kind"],
            n_slots=algorithm_config["n_slots"],
            slot_size=algorithm_config["slot_size"],
            kernel_module=algorithm_config["module"],
            participating_pes=tuple(pes),
            kernel_config=kernel_config,
        ))
    return plans
 ```
 `_build_pe_installs`는 기존 `ccl/install.py`의 neighbor 계산 로직을 재활용
 (ADR-0025의 `reverse_direction` 개선 반영).
 **Multi-PE 매퍼와 neighbor 생성 책임**: mapper가 SIP 내 여러 PE를 반환하는
 경우 (`all_pes` 등), PE-level neighbor 그래프는 `_build_pe_installs` 내부에
 형성된다. 즉 topology 모듈은 rank-level 관계만 제공하고, PE-level 연결은
 builder에서 풀어낸다. 복잡한 multi-level 패턴을 쓰는 알고리즘은 이 책임
 분산이 관리 부담이 될 수 있음 — 관련 논의는 ADR-0029 참고.
 ### D7. Epoch-based collective barrier
 Cross-rank submit 동기화. 각 collective 호출은 독립 epoch. 같은 rank의
 중복 join은 즉시 에러.
 ```python
 # src/kernbench/runtime_api/distributed.py
@dataclass
 class _EpochState:
    participants: set[int] = field(default_factory=set)
    pending: list = field(default_factory=list)
    drained: bool = False
    returned: int = 0
 class _CollectiveBarrier:
    """Epoch-based barrier.
    Contract:
    - Each call joins the earliest non-drained epoch.
    - Each rank may join a given epoch at most once. Duplicate join raises.
    - Last arriver (participants == world_size) performs drain and advances
      _next_epoch. Earlier arrivers yield and re-check drained on resume.
    - Epoch state is GC'd when returned == world_size (success path).
    - On failure paths, residual state is acceptable; reset() clears it.
    """
    def __init__(self, world_size: int):
        self._world_size = world_size
        self._next_epoch = 0
        self._state: dict[int, _EpochState] = {}
    def submit_and_drain(self, ctx, rank: int, submit_fn) -> None:
        epoch = self._next_epoch
        state = self._state.setdefault(epoch, _EpochState())
        if rank in state.participants:
            raise RuntimeError(
                f"rank {rank} attempted duplicate join to epoch {epoch}"
            )
        state.participants.add(rank)
        handles = submit_fn()
        state.pending.extend(handles)
        is_last = len(state.participants) >= self._world_size
        if is_last:
            for h in state.pending:
                ctx.wait(h)
            state.drained = True
            self._next_epoch = epoch + 1
        else:
            from greenlet import getcurrent
            g = getcurrent()
            if g.parent is None:
                raise RuntimeError("barrier requires a bound worker greenlet")
            while not state.drained:
                g.parent.switch()
        state.returned += 1
        if state.returned >= self._world_size:
            self._state.pop(epoch, None)
    def reset(self) -> None:
        """Explicit cleanup on spawn exception unwinding."""
        self._state.clear()
        self._next_epoch = 0
 ```
 ### D8. Per-rank tensor view + validator contract
 **Validator** (host-side, pre-slice, global handle 기준):
 ```python
 # src/kernbench/ccl/validators.py
 Validator = Callable[[TensorHandle, int, dict], None]
 def single_shard_per_rank(handle, world_size, spec):
    """Ring 계열: 정확히 world_size개 shard, SIP당 1개."""
    if len(handle.shards) != world_size:
        raise ValueError(...)
    per_sip = {}
    for s in handle.shards:
        per_sip[s.sip] = per_sip.get(s.sip, 0) + 1
    if any(c != 1 for c in per_sip.values()):
        raise ValueError(...)
 def multi_pe_sip_local(handle, world_size, spec):
    """Multi-PE per SIP layout: 각 SIP에 intra-SIP PE 수만큼 shard 존재.
    Intra-SIP 전체 PE를 참여시키는 알고리즘이 사용."""
    cm = spec["sip"]["cube_mesh"]
    pl = spec["cube"]["pe_layout"]
    per_sip = cm["w"] * cm["h"] * pl["pe_per_corner"] * len(pl["corners"])
    if len(handle.shards) != world_size * per_sip:
        raise ValueError(...)
 VALIDATOR_REGISTRY = {...}
 def resolve_validator(key_or_path): ...
 ```
 Validator는 world 전체의 shard layout 불변량을 본다. Per-rank view는
 backend가 validator 호출 **후** `_tensor_slice_for_sip`로 생성.
 **Per-rank tensor view** — SIP-local slice:
 ```python
 def _tensor_slice_for_sip(handle, sip) -> TensorArg:
    sip_shards = [s for s in handle.shards if s.sip == sip]
    if not sip_shards:
        raise RuntimeError(f"tensor has no shards on SIP {sip}")
    # Deterministic ordering contract: (cube, pe, offset_bytes) ascending.
    # Multi-PE mappers (hierarchical 등) rely on this ordering to align
    # per-PE tensor arg construction with participating_pes enumeration.
    sip_shards.sort(key=lambda s: (s.cube, s.pe, s.offset_bytes))
    min_offset = min(s.offset_bytes for s in sip_shards)
    local_va_base = handle.va_base + min_offset if handle.va_base else 0
    return TensorArg(
        shards=tuple(TensorArgShard(...) for s in sip_shards),
        va_base=local_va_base,
    )
 ```
 **Ordering invariant**: slice의 shard는 `(cube, pe, offset_bytes)` 오름차순.
 Backend가 `participating_pes`를 iterate하며 `_tensor_arg_for_pe(view, cube, pe)`를
 구성할 때, 결정론적 ordering을 전제할 수 있다. 특히 `all_pes` mapper +
 hierarchical 알고리즘이 per-PE slice 조합을 순서 의존적으로 해석하는 경우에
 중요.
 ### D9. Greenlet-local rank registry (+ debug warning)
 ```python
 class DistributedContext:
    def __init__(self):
        self._backend = None
        self._rank_by_greenlet: dict = {}
    def _bind_rank(self, g, rank: int) -> None:
        self._rank_by_greenlet[g] = int(rank)
    def get_rank(self) -> int:
        self._ensure_initialized()
        from greenlet import getcurrent
        g = getcurrent()
        if g not in self._rank_by_greenlet:
            if os.environ.get("KERNBENCH_DEBUG"):
                warnings.warn(
                    "get_rank() called outside a bound greenlet — returning 0. "
                    "Likely a bug unless running single-driver."
                )
            return 0
        return int(self._rank_by_greenlet[g])
 ```
 ### D10. `torch.ahbm.set_device(rank)` — SIP 바인딩
 KernBench 백엔드 이름은 `ahbm` (ADR-0023 D10). Real PyTorch는
 `torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named
 namespace를 사용한다.
 ```python
 class _AhbmNamespace:
    """torch.ahbm — per-greenlet SIP device binding.
    Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. Since
    KernBench's backend is 'ahbm' (not CUDA), we expose the equivalent
    API under ``torch.ahbm`` to avoid pretending to be a CUDA runtime.
    """
    def __init__(self):
        self._device_by_greenlet: dict = {}
    def set_device(self, device: int) -> None:
        from greenlet import getcurrent
        self._device_by_greenlet[getcurrent()] = int(device)
    def current_device(self) -> int | None:
        from greenlet import getcurrent
        return self._device_by_greenlet.get(getcurrent())
 # Attached to RuntimeContext as `self.ahbm = _AhbmNamespace()`.
 # Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
 ```
 **PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한
 `torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`,
 `torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는
 코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다.
 ```python
 class _AcceleratorNamespace:
    """torch.accelerator — device-agnostic API (PyTorch 2.x style).
    Aliases torch.ahbm for bench code that prefers device-neutral idiom:
        torch.accelerator.set_device_index(rank)
        torch.accelerator.current_device_index()
    """
    def __init__(self, ahbm: _AhbmNamespace):
        self._ahbm = ahbm
    def set_device_index(self, device: int) -> None:
        self._ahbm.set_device(device)
    def current_device_index(self) -> int | None:
        return self._ahbm.current_device()
 # RuntimeContext
 self.ahbm = _AhbmNamespace()
 self.accelerator = _AcceleratorNamespace(self.ahbm)   # alias
 ```
 Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유:
 ```python
 torch.ahbm.set_device(rank)                   # KernBench-native, explicit backend
 torch.accelerator.set_device_index(rank)      # PyTorch 2.x device-agnostic
 ```
 ### D11. Tensor placement = structural (sip, cube, pe) 좌표
 `resolve_dp_policy`가 `target_sip`을 직접 받아 구조적 좌표로 placement 생성.
 세부는 ADR-0026.
 ```python
 # RuntimeContext._create_tensor
 current_sip = self.ahbm.current_device()          # (D10 naming)
 if current_sip is None:
    current_sip = 0  # single-driver fallback (D9와 일관)
 placement = resolve_dp_policy(
    dp, shape=shape_2d, itemsize=itemsize,
    num_pe=eff_num_pe, num_cubes=eff_num_cubes,
    target_sip=current_sip,
 )
 ```
 Post-hoc `pe_index` shifting 제거 — ShardSpec이 `(sip, cube, pe)` 구조적
 좌표 보유.
 ### D12. `torch.multiprocessing.spawn`-compat surface
 Bench 작성자 표면은 real PyTorch `mp.spawn`과 동일:
 ```python
 # src/kernbench/runtime_api/multiprocessing.py (new)
 def spawn(fn, args=(), nprocs=1, join=True, daemon=False, start_method="spawn"):
    """Drop-in for torch.multiprocessing.spawn.
    Internal: greenlet fan-out + epoch-barrier sync + exception propagation.
    """
    ...
 # torch namespace에 부착
 torch.multiprocessing = SimpleNamespace(spawn=spawn)
 ```
 Bench:
 ```python
 import torch.multiprocessing as mp
 mp.spawn(worker, nprocs=world_size, args=(world_size, torch))
 ```
 ### D13. Scheduler + exception handling
 ```python
 def spawn(fn, args, nprocs, ...):
    dist = torch.distributed
    gs: list[greenlet] = []
    errors: dict[int, Exception] = {}
    for rank in range(nprocs):
        def _entry(r=rank):
            try:
                fn(r, *args)
            except Exception as e:
                errors[r] = e
                raise
        g = greenlet(_entry)
        dist._bind_rank(g, rank)
        gs.append(g)
    try:
        while True:
            alive = [g for g in gs if not g.dead]
            if not alive:
                break
            for g in alive:
                if not g.dead:
                    g.switch()
    except Exception as outer:
        for other in gs:
            if not other.dead:
                try:
                    other.throw(SystemExit)
                except Exception:
                    pass
        # Epoch barrier state 명시적 cleanup
        backend = getattr(dist, "_backend", None)
        if backend is not None and hasattr(backend, "_barrier"):
            backend._barrier.reset()
        raise SpawnException(errors) from outer
 ```
 **Scheduler contract**:
 - Deterministic round-robin over insertion order (rank 0, 1, ..., N-1).
 - 동기화 지점은 epoch barrier (D7)만. Scheduler 순서에 의존하는 correctness 없음.
 - 예외 발생 시 다른 greenlet 강제 종료 + `SpawnException` 전파.
 **Starvation guideline**:
 - 일반적으로 collective barrier가 workers를 동기화. 큰 편차 없음.
 - 극단적 non-collective 루프 대비 cooperative yield 제공:
  `torch.distributed.cooperative_yield()`.
 ### D14. Backward compatibility
 1. **Single-driver 호출**: `get_rank()` 0 반환 (D9).
 2. **`ccl.yaml` world_size override**: D1 fallback 우회 — legacy "rank = PE"
   테스트 경로로 사용 가능.
 3. **`DPPolicy.sip="column_wise"` 명시**: ADR-0026 scope.
 4. **`install_ipcq()` compatibility wrapper**:
 기존 `ccl/install.py`의 `install_ipcq()` API는 곧바로 제거하지 않는다.
 Thin compatibility wrapper로 남겨 기존 직접 호출자가 점진적으로 migration할
 수 있게 한다.
 ```python
 # src/kernbench/ccl/install.py (after this ADR)
 def install_ipcq(engine, spec, merged, *, algo_module=None, rank_to_pe=None):
    """DEPRECATED: legacy host-side PE installer.
    Internally delegates to build_install_plans + engine-routed IpcqInitMsg.
    Use dist.init_process_group() instead.
    """
    from kernbench.ccl.install_plan import build_install_plans
    import warnings
    warnings.warn(
        "install_ipcq() is deprecated; use dist.init_process_group()",
        DeprecationWarning, stacklevel=2,
    )
    plans = build_install_plans(
        world_size=merged.get("world_size", 1),
        algorithm=merged["algorithm"],
        algorithm_config=merged,
        spec=spec,
    )
    handles = []
    for plan in plans:
        for pe_install in plan.pe_installs:
            h = engine.submit(IpcqInitMsg(
                target_sips=(plan.sip,),
                target_cubes=(pe_install.cube,),
                target_pe=pe_install.pe,
                entries=pe_install.neighbors,
                buffer_kind=plan.buffer_kind,
                n_slots=plan.n_slots,
                slot_size=plan.slot_size,
            ))
            handles.append(h)
    for h in handles:
        engine.wait(h)
    return {"world_size": merged.get("world_size", 1), "plans": plans}
 ```
 Migration 스케줄:
 - Phase 1: wrapper로 유지 + DeprecationWarning
 - Phase 2: 직접 호출자 grep-audit → 각각 `dist.init_process_group()` 또는
  `build_install_plans()` 직접 사용으로 이관
 - Phase 3: wrapper 제거 (별도 cleanup ADR 또는 PR)
 ---
 ## Dependencies
 - **ADR-0023** (IPCQ): `IpcqInitMsg` 메시지 타입과 PE_IPCQ 핸들링을 그대로
  활용. Engine-routed submit으로 전환하는 것이 유일한 변경.
 - **ADR-0025** (IPCQ direction fix): `_build_pe_installs`의 neighbor 계산이
  2-rank ring 등에서 정확히 동작하려면 필요.
 - **ADR-0003 / 0016** (IO_CPU): IO_CPU는 기존 transit 역할 그대로. 본 ADR에서
  IO_CPU 역할 변경 없음.
 ---
 ## Non-goals
 - **IPCQ protocol 수정**: ADR-0023 유지.
 - **DPPolicy 필드 정리**: ADR-0026.
 - **Megatron-style TP**: ADR-0027.
 - **Multi-node (프로세스 간)**: 단일 프로세스.
 - **IO_CPU SIP control-plane 단일 endpoint 원칙 채택**: 본 ADR 범위 밖. 현재
  KernBench에 이 원칙이 없고, 도입은 별도 ADR.
 - **Hierarchical all-reduce 알고리즘 설계**: ADR-0029. 본 ADR은 그 알고리즘이
  쓸 framework 인프라 (`all_pes` mapper, `multi_pe_sip_local` validator,
  registry 확장점)만 제공.
 ---
 ## Open questions
 ### 🟡 Nice-to-have — scope 경계 관련
 - **Install timing 허용치**: SimPy 시간 상 install이 몇 ns~us 소모. 기존
  sideband는 0ns. 기존 테스트가 t=0 시작을 전제로 하는지 확인 (audit 결과에
  따라 테스트 교정 필요).
 - **`IpcqInitMsg` 배치 가능성**: MmuMapMsg처럼 `target_pe="all"` 브로드캐스트
  는 IPCQ에서는 부적합 (PE마다 neighbor가 다름). 현재는 per-PE 개별 submit.
  Per-PE payload를 담는 batched IpcqInitMsg 타입은 future optimization.
 - **`_rank_to_sip` 매핑**: 현재 identity. Non-trivial mapping 요구 시 별도.
 - **Cooperative yield API 위치**: `torch.distributed.cooperative_yield()`로
  노출 예정. 실제 필요성은 Phase 2 이후 벤치 추가 시 판단.
 (PE-level topology 일원화 관련 중장기 방향은 **ADR-0029** 참고 — 복잡한
 multi-level 알고리즘이 driving force가 되는 framework 진화 방향.)
 ---
 ## Consequences
 ### Positive
 - **새 message 타입 0개**: 기존 `IpcqInitMsg` + `KernelLaunchMsg`만으로 구현.
 - **IO_CPU / engine 변경 없음**: 기존 routing 그대로.
 - **Sideband install convention 제거**: MmuMapMsg 등과 동일 패턴으로 일원화.
 - **Plan state stale 문제 소멸**: Plan은 host 단일 소유.
 - **Bench = real PyTorch DDP** (공개 API 관점).
 - **Algorithm ABI 경량**: `kernel` + `kernel_args`만 필수.
 - **Epoch-based barrier**: interleaved collective 안전.
 - **Control/data plane 분리**: data plane(PE_IPCQ)은 ADR-0023 유지, control
  plane은 host-driven.
 - 장기 확장성: Megatron TP, DTensor 기반.
 ### Negative
 - 신규 모듈: `install_plan.py`, `mappers.py`, `validators.py`,
  `multiprocessing.py`.
 - Engine이 `IpcqInitMsg`를 엔진-path로 라우팅할 수 있는지 구현 시 확인 필요
  (minor hook 가능성).
 - Install이 SimPy 시간을 소모 (positive로도 볼 수 있으나, 기존 sideband 시점
  0ns 전제인 테스트가 있으면 교정 필요).
 ### Neutral
 - IPCQ PE-level protocol (ADR-0023) 불변.
 - `DPPolicy` 필드 변경은 ADR-0026.
 - IO_CPU 역할 불변 (기존 transit 그대로).
@@ -23,7 +23,7 @@ class DPPolicy:
    """Intra-device (cube × PE) data-parallel policy.
    SIP-level placement is controlled by ``torch.ahbm.set_device(rank)``
-    (ADR-0024 D10) and, for model-level TP, by Megatron-style parallel
+    (ADR-0024 D3) and, for model-level TP, by Megatron-style parallel
    layers (ADR-0027). DPPolicy does not cross SIP boundaries.
    """
    cube: Literal["replicate", "column_wise", "row_wise"] = "replicate"
@@ -37,7 +37,7 @@ class DPPolicy:
 ### D2. `ShardSpec` — structural (sip, cube, pe) 좌표, `pe_index` 완전 제거
 현재 `ShardSpec.pe_index`는 **global flat index** (`sip × cubes × pes + cube ×
-pes + pe`). 이는 ADR-0024 D11이 "abstraction leakage"로 지적한 형태.
+pes + pe`). 이는 ADR-0024 D4이 "abstraction leakage"로 지적한 형태.
 본 ADR에서 ShardSpec을 **structural 좌표로 재정의**하고, `pe_index`는
 property로도 **남기지 않는다**:
@@ -73,7 +73,7 @@ class ShardSpec:
 ### D3. `resolve_dp_policy`가 `target_sip`을 받아 structural 좌표 생성
-ADR-0024 D11의 계약 구현. Post-hoc shifting 없음.
+ADR-0024 D4의 계약 구현. Post-hoc shifting 없음.
 ```python
 # src/kernbench/policy/placement/dp.py (after)
@@ -135,14 +135,14 @@ def resolve_dp_policy(
 ### D4. `_create_tensor` — 구조적 좌표로 직접 placement
-ADR-0024 D11 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy`
+ADR-0024 D4 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy`
 호출 시점에 직접 지정.
 ```python
 # context.py _create_tensor (after)
 current_sip = self.ahbm.current_device()
 if current_sip is None:
-    # Single-driver fallback (ADR-0024 D9와 일관).
+    # Single-driver fallback (ADR-0024 D2와 일관).
    # Launcher 기반 코드가 set_device()를 빼먹으면 조용히 SIP 0에 박히는
    # 문제가 있음 → debug mode에서 경고.
    if os.environ.get("KERNBENCH_DEBUG"):
@@ -267,7 +267,7 @@ KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에
 - **개념 분리 명확**: DPPolicy = intra-device, TP = inter-device.
 - **API 단순화**: DPPolicy 생성자 필드 ~33% 축소.
 - **Structural 좌표 일관성**: ShardSpec이 `(sip, cube, pe)` 튜플로 표현 →
-  abstraction leakage 해소 (ADR-0024 D11 계약 충족).
+  abstraction leakage 해소 (ADR-0024 D4 계약 충족).
 - **`pe_index` 의미 명확**: SIP-local이 단일 해석. Global flat이 필요하면 명시.
 - **Launcher 모델 일관성**: ADR-0024의 "1 worker per SIP" 모델이 유일한 SIP
  경계 제어 메커니즘.
@@ -2,9 +2,7 @@
 ## Status
-Accepted (Revision 7 — resume invariant / main-context wait 비재귀 invariant /
+Accepted
 global barrier over-serialization tradeoff / TP forward yield-safety 명시,
 2026-04-14)
 ## Context
@@ -166,9 +164,9 @@ while alive:
   - 구현이 이를 **감지**할 필요는 없다 (타임아웃/steps-since-yield 카운터
     등). 이는 user contract이며 위반 시 증상은 "simulation hang"이다.
   - **Future extension**: non-collective 긴 계산 경로가 자주 나오면
-     ADR-0024 D13의 `torch.distributed.cooperative_yield()` primitive (명시적
+     명시적 `torch.distributed.cooperative_yield()` primitive (no-op yield)를
-     no-op yield)를 도입할 수 있다. 현 ADR 범위 밖. Breaking change 아님 —
+     도입할 수 있다. 현 ADR 범위 밖. Breaking change 아님 — 필요 시 추가하면
-     필요 시 추가하면 됨.
+     됨.
   - Round 내에서는 alive worker 전체가 한 번씩 `switch`를 받는다. 단일 round
     안에서 한 worker가 여러 번 wait를 호출해도 그 turn 안에서 순차적으로
     enqueue된 뒤 scheduler drain 한 번에 일괄 처리 (FIFO).
@@ -183,7 +181,7 @@ while alive:
   - **두 큐는 서로 다른 dependency source**: worker wait은 worker가 직접
     `submit + wait` 쌍으로 만들어낸 handle (tensor deploy, MmuMap 등). collective
     큐는 `dist.all_reduce`가 내부적으로 enqueue한 kernel launch handle이며
-     worker는 이걸 직접 wait하지 않는다 (ADR-0024 D7).
+     worker는 이걸 직접 wait하지 않는다 (D0.5의 두 큐 drain 모델 참조).
   - **Correctness 관점 독립**: collective는 worker 관점에선 "이미 submit된
     후 yield한" 상태. 그 완료 타이밍은 worker의 다음 action 시점 이전이기만
     하면 됨. worker wait 큐와의 순서 dependency 없음.
@@ -206,7 +204,7 @@ while alive:
     index로 두거나 append 전 `h not in pending_set` 검사) 가능. correctness
     를 바꾸지 않는 최적화로 분류.
-4. **Exception propagation + sibling cleanup (ADR-0024 D13 방식 채택)**.
+4. **Exception propagation + sibling cleanup**.
   worker greenlet이 raise하면 `g.switch()`가 main으로 예외를 전달한다.
   scheduler loop은 즉시 중단되고 다음 cleanup을 **명시적으로** 수행:
@@ -581,7 +579,7 @@ TP layer의 weight/output 표현에서 두 개념을 명확히 분리한다:
 | 개념 | 결정 주체 | 범위 |
 |---|---|---|
-| **TP shard ownership** (어느 rank가 weight의 어떤 slice를 소유하는가) | greenlet-local rank + `torch.ahbm.set_device(rank)` (ADR-0024 D9/D10) | **cross-rank, cross-SIP** |
+| **TP shard ownership** (어느 rank가 weight의 어떤 slice를 소유하는가) | greenlet-local rank + `torch.ahbm.set_device(rank)` (ADR-0024 D2/D3) | **cross-rank, cross-SIP** |
 | **Intra-rank placement** (소유된 slice를 rank 내부에서 cube × PE로 어떻게 분산하는가) | `DPPolicy(cube=..., pe=...)` (ADR-0026) | **한 rank 내부 (SIP 경계 안)** |
 따라서 `ColumnParallelLinear`가 `(in_features, out_features // ws)` shape로
@@ -825,40 +823,11 @@ strict-xfail 케이스를 본 ADR 구현 이후 **PASS**로 전환하는 것을
 ## Dependencies
- **ADR-0024** (launcher): rank = SIP, greenlet-local rank, `dist.all_reduce`,
+- **ADR-0024** (launcher): rank = SIP, greenlet-local rank,
-  `torch.ahbm.set_device(rank)`. 본 ADR의 D0/D1이 이 인프라를 확장.
+  `torch.ahbm.set_device(rank)`.
 - **ADR-0026** (DPPolicy intra-device): weight tensor의 per-rank slice 표현.
 - **ADR-0023 / ADR-0025** (IPCQ): `dist.all_reduce` 구현의 기반.
 ### Supersedes (partial)
 ADR-0024의 다음 섹션은 **미구현 상태의 설계**이며, 본 ADR이 더 단순한 모델로
 대체한다:
 - **ADR-0024 D7 (`_CollectiveBarrier.submit_and_drain`)** — epoch 기반 last-
  arriver-drains 패턴. 문제: last arriver가 **worker 컨텍스트에서** `ctx.wait`을
  호출해 env.run을 drive → D0.2가 막으려는 orphan 원인을 재현한다. 본 ADR의
  **D0.4 two-queue drain** (worker가 모두 yield한 뒤 main이 drain)이 동일한
  "모든 rank가 submit 완료 전까지 어떤 rank의 collective도 진행되지 않음"
  invariant를 **worker-safe하게** 제공한다. `_CollectiveBarrier` 클래스는
  구현하지 않는다.
 - **ADR-0024 D12/D13 (`spawn_workers` skeleton)** — signature / scheduler
  loop / exception handling 설계. 본 ADR의 **D1**이 real-PyTorch API와 일치하는
  signature (`spawn(fn, args, nprocs)`)로 재정의하며, D0 scheduler drain을 단일
  위치에서 수행한다. ADR-0024 D13의 exception cleanup (siblings
  `throw(SystemExit)` + `SpawnException` 래핑)은 본 ADR에 그대로 흡수
  (D0.4-(4) 참조).
 현 구현은 ADR-0024의 D7/D12/D13 어느 것도 landing하지 않았으므로 supersede에
 따른 마이그레이션 비용은 없음. 향후 `docs/adr/ADR-0024`에 "superseded by
 ADR-0027 D0/D1" 주석만 추가하면 정합.
 **Source of truth (normative, 구현자 대상)**: worker scheduling / collective
 drain / spawn / exception cleanup의 구현 기준은 **ADR-0027 D0/D1이다**. 구현
 시 ADR-0024 D7/D12/D13의 pseudocode / contract / signature를 참고하지 말 것 —
 두 ADR이 다른 결론을 낼 때는 항상 ADR-0027이 우선한다. 리뷰어도 이 원칙으로
 PR을 심사.
 ---
 ## Non-goals
@@ -146,7 +146,7 @@ At each `dist.all_reduce(tensor)` call:
 3. Appends `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` where
   `sip_rank` is the current greenlet's bound rank.
 4. Launches with `_defer_wait=True`; the main scheduler drains pending
-   handles after all workers submit (per ADR-0024 D7 / ADR-0027 D0.4).
+   handles after all workers submit (per ADR-0027 D0.4).
 ### D6. Config schema
@@ -10,7 +10,7 @@ The simulator is an analytical, event-driven performance model — not a
 cycle-accurate or RTL-level simulator. Many real-HW effects are approximated
 or omitted by design. To keep the model auditable and reviewable as a whole,
 this ADR consolidates the assumptions in one place. Individual component ADRs
-(ADR-0015, ADR-0019, ADR-0004) define the *mechanisms*; this document defines
+(ADR-0015, ADR-0017, ADR-0004) define the *mechanisms*; this document defines
 the *limits of fidelity*.
 ## Decisions
@@ -21,7 +21,7 @@ the *limits of fidelity*.
  ADR-0015 D2.
 - **Per-component switching/overhead latency** (`overhead_ns` attr).
 - **HBM per-pseudo-channel parallelism** via stateless `pc_avail[N]` array
-  with global round-robin chunking. Burst granularity tunable
+  with address-based PC selection (ADR-0034 D3). Burst granularity tunable
  (`burst_bytes`, default 256B). Read and write share each PC's
  `available_at` (real HW command bus is per-PC shared).
 - **HBM direction switching penalty mechanism**: per-PC last-direction
@@ -66,8 +66,8 @@ the *limits of fidelity*.
 ### D3. Ignored (out of scope)
 - Bank-level row buffer conflict penalty (assume no conflicts — best case;
-  round-robin chunk assignment is address-blind so we cannot detect same-bank
+  the model has no per-bank state within a PC, so same-bank reuse cannot be
-  reuse).
+  detected).
 - HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state
  `burst_time = burst_bytes / pc_bw_gbs`).
 - Refresh, ECC, thermal throttling, power gating.
@@ -110,29 +110,6 @@ below are different concerns, ordered by expected workload impact.
 **Higher impact (workload accuracy gap)**:
 - [ ] **Address-based PC selection at HBM CTRL** (replace the
  address-blind global round-robin). Compute the PC index from
  the HBM byte offset using parameters already in topology config:
      pc_shift = log2(burst_bytes)        # default 8 (burst=256B)
      pc_mask  = num_pcs - 1              # default 7 (8 PCs)
      pc       = (hbm_offset >> pc_shift) & pc_mask
  For the default `burst_bytes=256, num_pcs=8` this places the PC
  select field at HBM byte-offset bits **[10:8]**: bits [7:0] are
  the within-burst offset (same PC), bits [10:8] are the 3-bit PC
  index, and bits [36:11] are row/bank/column within the PC slice.
  Shift/mask are derived from topology config rather than hardcoded
  so alternative `(burst_bytes, num_pcs)` pairs stay consistent.
  See `src/kernbench/policy/address/phyaddr.py` for the canonical
  comment.
  Real-HW workloads where this matters most: (a) strided multi-
  transaction streams that under global-RR collide on the same PCs
  but under address-striping land on disjoint sets; (b) offset-
  disjoint parallel transfers where address-striping preserves
  parallelism while global-RR re-serializes them. Directly affects
  multi-PE concurrent HBM workload latencies.
 - [ ] **Bank-level conflict modeling** within a PC (opt-in via
  `track_banks: true`). Currently we assume no same-bank reuse;
  random scatter/gather workloads are optimistic here.
@@ -169,7 +146,7 @@ below are different concerns, ordered by expected workload impact.
  touching latency must update the relevant section here.
 - Workload-specific magnitude error envelopes are explicit.
 - Builder-side derivation of `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`
-  enforces the ADR-0019 D9 invariant in code rather than relying on yaml
+  enforces the ADR-0017 D8 invariant in code rather than relying on yaml
  manual consistency.
 - Wire transfer time is charged once per bottleneck-link transit (Phase 2c
  per-flit timing) rather than via terminal `drain_ns` injection. Single
@@ -180,5 +157,6 @@ below are different concerns, ordered by expected workload impact.
 ## Cross-references
 - ADR-0015 — component / port / wire model.
- ADR-0019 — NoC and local HBM topology.
+- ADR-0017 — Cube NOC architecture and HBM connectivity.
 - ADR-0004 — memory semantics, local HBM.
 - ADR-0034 — HBM controller internal design.
@@ -0,0 +1,271 @@
 # ADR-0034: HBM Controller Internal Design
 ## Status
 Accepted
 ## Context
 `HbmCtrlComponent` is the per-PE HBM partition endpoint at the leaf of
 the cube NOC. One instance is created per PE under the topology node
 `sip{S}.cube{C}.hbm_ctrl.pe{idx}` and attaches to that PE's router
 (ADR-0017 D4). The component models per-pseudo-channel (PC) scheduling,
 burst-granular commit timing, address-based PC selection, and response
 routing back to the requester.
 This ADR documents the component as currently implemented. ADR-0017 D4/D8
 defines *where* HBM CTRL attaches and *what* aggregate BW it must
 deliver. ADR-0033 D1/D2 defines *what fidelity* of HBM modelling is in
 scope. This ADR fills the gap between those two — the per-instance
 internal scheduling model.
 ## Decision
 ### D1. Role
 `HbmCtrlComponent` is a per-PE HBM partition endpoint. One instance per
 PE (default 8 per cube, set by `cube.memory_map.hbm_slices_per_cube`)
 attaches to that PE's router via the `peX.hbm` attachment list in
 `cube_mesh.yaml` (ADR-0017 D4). In the default n:1 channel mapping
 (ADR-0017 D8) the instance aggregates `channels_per_pe` pseudo-channels
 into one endpoint.
 The component models:
 - Per-PC scheduling (D2) with R/W command-bus sharing.
 - Address-based PC selection (D3).
 - Burst-granular commit timing (D4).
 - Flit-aware per-flit PC commit and async finalize (D5, D6).
 - Command-only Transaction handling for read-data drain (D7).
 - Response routing back to the requester (D8).
 It does not model:
 - Bank-level row-buffer conflicts, refresh, ECC, thermal throttling
  (ADR-0033 D3).
 - Cross-PE HBM contention beyond its own router edge (handled by the
  router mesh — ADR-0017 D3).
 - 1:1 channel mode (ADR-0017 D8 future work).
 ### D2. Per-PC scheduling model
 Per-instance state initialised in `start()`:
 - `_pc_avail: list[float]` — earliest sim-time each PC is free; length
  `num_pcs`, initial 0.0.
 - `_pc_last_dir: list["R"|"W"|None]` — direction of the last commit on
  each PC, used for switch-penalty detection (D4); initial `None`.
 `num_pcs` and `burst_bytes` must each be a positive power of two so
 that address-based PC selection (D3) reduces to a shift-and-mask.
 Read and write requests share the same `_pc_avail` slot per PC — the
 real HW per-PC command bus is shared between read and write traffic, so
 issuing a write to PC k blocks a subsequent read to PC k by exactly the
 burst time.
 Direction `dir` for a request is inferred from the request type:
 - `MemoryWriteMsg` → `"W"`.
 - `PeDmaMsg` with `is_write=True` → `"W"`.
 - All others (`MemoryReadMsg`, `PeDmaMsg` read) → `"R"`.
 ### D3. Address-based PC selection
 PC index for an access is derived from the access address by shift and
 mask:
 ```text
 pc_shift = log2(burst_bytes)         # default 8  (burst=256B)
 pc_mask  = num_pcs - 1               # default 7  (8 PCs)
 pc       = (address >> pc_shift) & pc_mask
 ```
 Computed once in `start()` from topology config so alternative
 `(burst_bytes, num_pcs)` pairs stay consistent. For the canonical
 default `(256, 8)` this places the PC select field at bits `[10:8]` of
 the HBM byte offset: bits `[7:0]` are within-burst (same PC), bits
 `[10:8]` are the 3-bit PC index, bits `[36:11]` are row/bank/column
 within the PC slice (see `phyaddr.py` comment).
 Address-based striping — as opposed to address-blind global
 round-robin — preserves PC parallelism for offset-disjoint concurrent
 transfers: each transfer's bursts land deterministically on the PC set
 implied by its byte addresses, so multi-PE workloads accessing disjoint
 regions do not collide on a single PC.
 ### D4. Burst granularity and PC commit timing
 A single PC commit takes:
 ```text
 chunk_time = burst_bytes / pc_bw_gbs    # ns
 ```
 - `burst_bytes` (default 256) is the burst granularity matching the
  flit size (ADR-0033 D1).
 - `pc_bw_gbs` is **builder-derived** from
  `hbm_to_router_bw_gbs / num_pcs` (`topology/builder.py`), enforcing
  the ADR-0017 D8 invariant that aggregate per-PE BW equals the
  router-to-HBM link BW.
 Per-PC commit scheduling for an arriving access on PC `pc` with
 direction `dir`:
 ```text
 switch_cost = switch_penalty_ns
              if pc_last_dir[pc] not in (None, dir) else 0
 start  = max(env.now, pc_avail[pc]) + switch_cost
 finish = start + chunk_time
 pc_avail[pc]    = finish
 pc_last_dir[pc] = dir
 ```
 Default `switch_penalty_ns = 0` — Tier 0 assumption that an ideal HBM
 scheduler amortises R/W switching cost (ADR-0033 D2). Non-zero values
 model pessimistic per-alternation cost.
 ### D5. Flit-aware per-flit PC commit (primary path)
 `_handle_flit` is the primary worker path. For each arriving `Flit`:
 1. On the **first** flit of a transaction (`tid = id(txn)` not in
   `_txn_state`):
   - Apply `overhead_ns` once via `run(env, nbytes)` — header decode
     model, first-flit overhead pattern (ADR-0033 D1).
   - Initialise `_txn_state[tid] = {"last_finish": env.now}`.
 2. Compute `pc = _pc_for_address(flit.address)` (D3).
 3. Apply the per-PC schedule (D4) using the request direction (D2).
 4. Update `state["last_finish"] = max(state["last_finish"], finish)`.
 5. If `flit.is_last`: pop `_txn_state[tid]` and spawn `_finalize_txn`
   (D6).
 Per-flit address-aware commit is the mechanism that lets concurrent
 multi-PE traffic to disjoint HBM offsets pipeline through distinct PCs
 in parallel.
 ### D6. Async finalize per transaction
 When a transaction's last flit has been scheduled, finalisation runs in
 a separately-spawned process:
 ```python
 def _finalize_txn(env, txn, last_finish):
    wait = last_finish - env.now
    if wait > 0:
        yield env.timeout(wait)
    yield from _send_response(env, txn)
 ```
 `_handle_flit` spawns this via `env.process(...)` and returns
 immediately, so the worker can pick up the next inbox message while the
 last PC commit drains.
 Without this split — i.e. if the worker itself did
 `yield env.timeout(wait)` — concurrent single-flit transactions whose
 addresses hit distinct PCs would still serialise at `chunk_time` each
 inside the worker, hiding the PC parallelism that D3 and D5 are
 designed to expose.
 ### D7. Non-flit fallback for command-only transactions
 `_handle_txn` runs when the inbox delivers a `Transaction` rather than a
 `Flit`. This is the path for command-only requests that the wire does
 not chunk into flits — most notably `MemoryReadMsg` whose command txn
 carries `nbytes=0` (data drain is modelled at HBM CTRL post-processing,
 not as inbound flits).
 Procedure:
 1. `work_bytes = txn.nbytes if txn.nbytes > 0 else int(request.nbytes or 0)`
   — for read commands, work is sized by the request.
 2. `n_chunks = ceil(work_bytes / burst_bytes)` if `work_bytes > 0` else
   0.
 3. `chunk_interval = drain_ns / n_chunks` (when both > 0) — chunks are
   scheduled over time at `drain/n_chunks` ns intervals to model the
   bottleneck-link's data arrival rate (ADR-0033 D1 chunk-loop drain).
 4. Apply `run(env, txn.nbytes)` once for `overhead_ns`.
 5. For each chunk `i`, advance `chunk_interval` ns then apply the D4
   schedule with `pc = _pc_for_address(base_address + i * burst_bytes)`.
 6. After scheduling all chunks, wait `last_finish - env.now` then call
   `_send_response`.
 `_handle_txn` shares the same `_pc_avail` / `_pc_last_dir` state with
 `_handle_flit` — there is exactly one source of PC scheduling truth
 across both paths.
 ### D8. Response routing
 `_send_response` dispatches on request type and path geometry:
 | Case | Trigger | Response |
 | --- | --- | --- |
 | PE_DMA | `isinstance(txn.request, PeDmaMsg)` | New reverse-path Transaction (`is_response=True`, `nbytes=0`), same `done` |
 | Bypass — Memory Read | `"m_cpu" not in any(txn.path)` AND `MemoryReadMsg` | Reverse-path Transaction with `nbytes=request.nbytes` (data return) |
 | Bypass — Memory Write | `"m_cpu" not in any(txn.path)` AND not Memory Read | `txn.done.succeed()` (write completes locally) |
 | Default | otherwise | New `ResponseMsg(correlation_id, request_id, src_cube, src_pe, success=True)` on reverse path |
 The "bypass" classification matches the Memory R/W fabric path defined
 in ADR-0015 D4 (PCIE_EP → io_noc → ucie → cube router → hbm_ctrl,
 without M_CPU). The PE_DMA case is its own dedicated reverse-path to
 keep the inner-loop DMA fast (PE_DMA reads/writes do not synthesise a
 ResponseMsg envelope).
 In all reverse-path cases, the response Transaction is put onto
 `out_ports[reverse_path[1]]` — the first hop back along the recorded
 forward path. If `reverse_path` has fewer than 2 entries (degenerate
 path), the original `txn.done` is signalled directly.
 ### D9. Configurable attributes
 | Attribute | Default | Source | Notes |
 | --- | --- | --- | --- |
 | `num_pcs` | 8 | topology cube `hbm_ctrl.attrs` | Must be power of 2 |
 | `pc_bw_gbs` | 32.0 | builder-derived: `hbm_to_router_bw_gbs / num_pcs` | Enforces ADR-0017 D8 invariant |
 | `burst_bytes` | 256 | topology attrs | Must be power of 2; equals `flit_bytes` (ADR-0033 D1) |
 | `switch_penalty_ns` | 0.0 | topology attrs | Tier 0 default; non-zero models pessimistic R/W switching |
 | `efficiency` | 1.0 | topology attrs | Applied at builder time to `hbm_to_router_bw_gbs` (router-edge BW scaling only) |
 | `overhead_ns` | 0.0 | topology attrs | First-flit decode overhead (D5) |
 `pc_bw_gbs` is derived by `topology/builder.py` rather than configured
 directly so the aggregate per-PE BW matches the router-to-HBM link BW
 without yaml-side duplication.
 ## Consequences
 ### Positive
 - Address-based PC selection preserves multi-stream HBM parallelism
  that an address-blind round-robin would collapse — important for
  multi-PE workloads with disjoint HBM regions.
 - Flit-aware path (D5) + async finalize (D6) preserves wormhole
  pipelining and exposes PC parallelism for back-to-back single-flit
  transactions.
 - Single source of PC scheduling truth (D4 mechanism, used by both D5
  flit path and D7 chunk-loop path).
 - Builder-derived `pc_bw_gbs` enforces ADR-0017 D8 in code, not yaml
  discipline.
 ### Negative
 - No bank-level conflict modelling within a PC; address-blind to
  bank/row-buffer reuse (ADR-0033 D3).
 - No HBM scheduler (FR-FCFS / write-buffer / watermark drain); fixed
  FIFO per PC. Bursty mixed R/W is approximated by `switch_penalty_ns`
  (ADR-0033 D2).
 - `_txn_state` is a regular dict keyed by `id(txn)`; in-flight state
  accumulates per concurrent transaction and is removed only on
  `is_last`. Adequate for current workloads.
 ## Links
 - ADR-0001 (Physical address layout — PC bit field comment)
 - ADR-0015 D4 (Memory R/W fabric path — bypass response case)
 - ADR-0017 D4 (Per-PE HBM partitioning — attachment to PE routers)
 - ADR-0017 D8 (HBM channel mapping mode — n:1 aggregate this ADR
  implements)
 - ADR-0017 D9 (AddressResolver — `hbm_ctrl.pe{pe_id}` endpoint
  resolution)
 - ADR-0033 D1 (Modelled precisely — per-PC parallelism, switch penalty,
  flit-aware PC commit, first-flit overhead, chunk-loop drain)
 - ADR-0033 D2 (Switch-penalty default 0 — ideal scheduler amortisation)
@@ -0,0 +1,286 @@
 # ADR-0035: M_CPU and M_CPU.DMA Component Model
 ## Status
 Accepted
 ## Context
 M_CPU is the cube-level command processor. It receives commands from
 IO_CPU (or from PCIE_EP when the engine routes Memory R/W through
 M_CPU as a fallback), fans them out to the PEs in its cube, and
 aggregates per-PE responses into a single ResponseMsg sent back to
 IO_CPU on the reverse path.
 M_CPU.DMA is the cube-level DMA channel pair that handles Memory R/W
 fan-out. Per ADR-0015 D5 it is **not** a separate topology node —
 it lives as internal state of `MCpuComponent`.
 This ADR documents the M_CPU component implementation that realizes
 those responsibilities, including the three distinct fan-out paths
 (Memory R/W, Kernel Launch, MMU Map/Unmap), the M_CPU.DMA resource
 model, and the response aggregation contract.
 ## Decision
 ### D1. Role
 M_CPU has three responsibilities:
 1. **Transit forwarding** — when not the terminal hop (e.g., on the
   reverse response path PE → M_CPU → IO_CPU), forwards Transactions
   to `next_hop` in their pre-computed path.
 2. **Multi-PE fan-out at terminal hop** — dispatches to one of three
   fan-out paths based on request type (D2).
 3. **Response aggregation** — collects per-PE responses, sends a
   single aggregate ResponseMsg back to IO_CPU on the reverse path.
 Per invocation (`run()`): applies `overhead_ns` once per incoming
 Transaction.
 M_CPU does **not**:
 - Decide routing — paths are pre-computed by the router (ADR-0002).
 - Handle PE-internal execution — PE_CPU / PE_SCHEDULER / engines
  (ADR-0014).
 - Decode addresses — `ctx.resolver.resolve(pa)` returns the per-PE
  `hbm_ctrl.pe{X}` directly (ADR-0017 D9).
 - Interpret tensor or kernel semantics — fan-out dispatch by Python
  isinstance check only.
 ### D2. Three fan-out paths dispatched by request type
 At the terminal hop the worker dispatches by request type:
 ```python
 elif self.ctx is not None and txn.request is not None:
    if isinstance(txn.request, KernelLaunchMsg):
        env.process(self._kernel_launch_fanout(env, txn))
    elif isinstance(txn.request, (MmuMapMsg, MmuUnmapMsg)):
        env.process(self._mmu_msg_fanout(env, txn))
    else:
        env.process(self._dma_fanout(env, txn))
 ```
 Each path uses a different router method:
 - `_dma_fanout` uses `ctx.router.find_mcpu_dma_path()` — the
  M_CPU-specific DMA path that avoids PE pipeline nodes.
 - `_kernel_launch_fanout` uses `ctx.router.find_node_path()` — the
  generic NOC command path to PE_CPU.
 - `_mmu_msg_fanout` uses `ctx.router.find_node_path()` — NOC command
  path to PE_MMU.
 ### D3. M_CPU.DMA internal subcomponent (ADR-0015 D5)
 `MCpuComponent.start()` initializes two SimPy resources:
 ```python
 self._dma_write = simpy.Resource(env, capacity=1)  # MemoryWriteMsg
 self._dma_read  = simpy.Resource(env, capacity=1)  # MemoryReadMsg
 ```
 Properties:
 - **Not a topology node** — managed entirely inside `MCpuComponent`;
  does not appear in `topology.yaml` or in the compiled graph.
 - **Independent read and write channels** — concurrent in-flight
  Memory R/W is allowed.
 - **Capacity=1 per channel** serializes the **dispatch step**
  (`yield self.out_ports[...].put(...)`) of concurrent in-flight Memory
  R/W requests at this M_CPU. Actual fabric transfer time is modeled
  by wire processes between components (ADR-0015 D2) and by
  `drain_ns` at terminal hops; the DMA resource does not gate
  transfer duration.
 Resource selection is request-type-based:
 ```python
 dma_res = self._dma_write if isinstance(request, MemoryWriteMsg) else self._dma_read
 ```
 ### D4. Transit forwarding at non-terminal hops
 When `txn.next_hop` is not None — typical for the reverse response
 path (PE → M_CPU → IO_CPU) — the worker forwards normally:
 ```python
 if next_hop:
    yield self.out_ports[next_hop].put(txn.advance())
 ```
 The fan-out branches fire only at the terminal hop. The same component
 therefore serves both forward command dispatch and reverse response
 relay roles.
 ### D5. DMA fan-out (`_dma_fanout` — Memory R/W)
 For each Memory R/W request at terminal hop:
 1. `_resolve_dma_destinations(request)` returns a per-PE
   `hbm_ctrl.pe{X}` derived from the request's PA via
   `ctx.resolver.resolve(PhysAddr.decode(pa))` (ADR-0017 D9).
 2. For each destination:
   - Acquire the appropriate DMA resource (`_dma_write` or
     `_dma_read`) via `with dma_res.request() as req`.
   - Resolve path via `ctx.router.find_mcpu_dma_path()`.
   - Compute `drain_ns = ctx.compute_drain_ns(path, nbytes)`.
   - Create sub-Transaction carrying `drain_ns` and dispatch to
     `path[1]`.
 3. Track `max_drain_ns` across destinations and record it as
   `txn.result_data["xfer_ns"]` after all responses arrive.
 4. After all per-PE responses are collected (D8), send an aggregate
   ResponseMsg on the reverse command path back to IO_CPU.
 PA decode fallback (`f"{cube_prefix}.hbm_ctrl"`) is legacy dead code —
 no such node exists after ADR-0017 D4's per-PE partitioning. Kept
 defensively but does not route to a real destination.
 ### D6. Kernel launch fan-out (`_kernel_launch_fanout`)
 For `KernelLaunchMsg` at terminal hop:
 1. `_resolve_pe_ids(target_pe)` → list of PE ids in this cube.
 2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_cpu"` via
   `ctx.router.find_node_path()`.
 3. **`target_start_ns` handling** (ADR-0009 D5):
   - If the request already carries `target_start_ns` (stamped by
     IO_CPU per ADR-0036 D3): **pass through unchanged**.
   - If absent (direct-to-M_CPU launch in unit tests): compute a
     per-cube barrier `env.now + max(per-PE leg latency)` and stamp
     via `dataclasses.replace`.
 4. Dispatch sub-Transactions with `nbytes=0` (kernel launch is a
   control message; preserving nbytes=0 keeps fan-out off the shared
   first-hop fabric BW, mirroring ADR-0036 D4).
 5. After all per-PE responses arrive (D8), aggregate per-PE metrics
   from each sub-Transaction's `result_data` into the parent
   transaction:
   ```python
   txn.result_data["pe_exec_ns"]  = max(existing, max(pe_exec_values))
   txn.result_data["dma_ns"]      = max(existing, max(dma_values))
   txn.result_data["compute_ns"]  = max(existing, max(compute_values))
   ```
   The max-merge with the existing value matters because cross-cube
   IO_CPU fan-out shares the same parent `result_data`; merging
   prevents one cube from clobbering another's metric.
 6. Send aggregate ResponseMsg on reverse path back to IO_CPU.
 ### D7. MMU map/unmap fan-out (`_mmu_msg_fanout`)
 For `MmuMapMsg` / `MmuUnmapMsg` at terminal hop:
 1. `_resolve_pe_ids(target_pe)` → PE ids.
 2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_mmu"` via
   `find_node_path()`.
 3. Dispatch sub-Transactions with `nbytes=0`.
 4. PE_MMU is a terminal node — it does **not** send a ResponseMsg
   back. Instead, the sub-Transaction's own `sub_done` event is the
   completion signal.
 5. Wait for all `sub_done` events in-line (does **not** use
   `_pending` counter — D8 is for response-bearing fan-out only).
 6. Send aggregate ResponseMsg on reverse path back to IO_CPU.
 ### D8. Response aggregation (`_pending` + `_parent_txns`)
 For DMA and kernel-launch fan-out (which expect per-PE ResponseMsg
 arriving on the reverse path):
 ```python
 self._pending: dict[str, tuple[int, int, simpy.Event]] = {}
 self._parent_txns: dict[str, Any] = {}
 ```
 - On dispatch: register `(expected, received=0, all_done)` and
  remember the parent transaction.
 - `_worker` recognises responses by `is_response=True` and routes
  them to `_collect_response`, which increments `received` and
  signals `all_done` when `received >= expected`.
 - After `yield all_done`, the fan-out path constructs the aggregate
  ResponseMsg:
  ```python
  resp_msg = ResponseMsg(
      correlation_id=request.correlation_id,
      request_id=request.request_id,
      src_cube=cube_id,
      src_pe=-1,             # -1 = M_CPU aggregate, not a single PE
      success=True,          # no failure semantics implemented
  )
  ```
 - The response Transaction travels on `list(reversed(txn.path))`
  back to IO_CPU.
 MMU fan-out (D7) uses a simpler in-line list of `sub_done` events
 because PE_MMU is terminal — there is no ResponseMsg path to
 intercept.
 ### D9. Helpers and configurable attribute
 `_resolve_pe_ids(target_pe)`:
 - `int` → `[target_pe]`
 - `tuple[int, ...]` → `list(target_pe)`
 - `"all"` → `range(n_slices)` where `n_slices` comes from cube
  `memory_map.hbm_slices_per_cube` (default 8).
 Used by kernel-launch and MMU fan-out paths.
 Single configurable attribute drives per-instance latency:
 | Site | impl name | overhead_ns |
 | --- | --- | --- |
 | Cube `m_cpu` | `builtin.m_cpu` | 5.0 |
 Applied once in `run()` per Transaction — models command
 interpretation and dispatch-decision time at M_CPU.
 ## Consequences
 ### Positive
 - Three fan-out paths are clearly separated by request type — adding
  a new request kind is an isinstance branch + one fan-out method.
 - M_CPU.DMA channels are independent (read and write run concurrently)
  and serialize only the dispatch step at capacity=1.
 - Transit-vs-terminal behavior is a single `if next_hop` check, so
  the same component handles forward dispatch and reverse response
  relay without role duplication.
 - `target_start_ns` passthrough (D6) preserves the cross-cube barrier
  established by IO_CPU (ADR-0036 D3), while the fallback computation
  keeps direct-to-M_CPU unit tests working.
 - Per-PE metric `max`-merge against existing parent `result_data`
  values is robust to cross-cube IO_CPU fan-out sharing the same
  parent.
 ### Negative
 - No partial-failure semantics — a missing per-PE response stalls the
  parent `all_done` indefinitely. Acceptable for simulation; not
  suitable as a production-style endpoint.
 - `_resolve_dma_destinations`'s cube-wide hbm_ctrl fallback is dead
  code (no such node exists post-ADR-0017 D4). Kept defensively;
  invites confusion and merits a follow-up cleanup.
 - DMA resource serialization applies only at dispatch (the `put` call
  is instantaneous in unbounded stores). The capacity=1 channel
  models "one request in flight at a time at this M_CPU", not
  "transfer duration serialization" — readers must consult wire
  processes (ADR-0015 D2) and `drain_ns` for actual transfer
  parallelism.
 ## Links
 - ADR-0009 D3 (M_CPU fan-out and aggregation completion semantics)
 - ADR-0009 D5 (`target_start_ns` — passed through unchanged when
  present; computed as per-cube barrier when absent)
 - ADR-0011 D-VA3 (MmuMapMsg fabric path includes M_CPU as PE fan-out
  point)
 - ADR-0014 D4 (DMA engine capacity=1; M_CPU.DMA mirrors the same
  contract at cube level)
 - ADR-0015 D5 (M_CPU.DMA is internal subcomponent of M_CPU, not a
  topology node)
 - ADR-0017 D9 (AddressResolver returns per-PE `hbm_ctrl.pe{X}`)
 - ADR-0036 D3 / D4 (IO_CPU stamps `target_start_ns`; M_CPU passes
  through unchanged; nbytes=0 invariant preserved through fan-out)
@@ -0,0 +1,216 @@
 # ADR-0036: IO_CPU Component Model
 ## Status
 Accepted
 ## Context
 IO_CPU is the IO chiplet's host-facing endpoint inside the simulation
 graph. PCIE_EP receives host messages from the runtime API and routes
 them via the io_noc; for command-bearing requests (KernelLaunch,
 MmuMap/Unmap) the io_noc forwards to IO_CPU, which:
 - Fans out the request to per-cube M_CPUs.
 - Aggregates per-cube responses into a single host-visible completion.
 - For kernel launches, stamps a global `target_start_ns` barrier so
  every PE across every targeted cube begins kernel body execution at
  the same simulated time (ADR-0009 D5).
 Memory R/W traffic bypasses IO_CPU per ADR-0015 D4 / ADR-0016 D3;
 this component therefore handles only command-plane traffic in normal
 operation.
 This ADR documents the IO_CPU component implementation that realizes
 those responsibilities.
 ## Decision
 ### D1. Role
 IO_CPU is the host-facing endpoint of the IO chiplet. It has two
 primary responsibilities:
 1. **Multi-cube fan-out** — distribute KernelLaunchMsg / MmuMapMsg /
   MmuUnmapMsg to per-cube M_CPUs.
 2. **Response aggregation** — collect per-cube ResponseMsg, signal
   parent `txn.done` when all targeted cubes have responded.
 A third, narrower responsibility applies only to KernelLaunchMsg:
 **`target_start_ns` global barrier stamping** (D3).
 The component does **not**:
 - Decide routing — paths are pre-computed by the router (ADR-0002).
 - Decode tensor or kernel internals — those concerns belong to
  M_CPU / PE_CPU / engines.
 - Handle PE-level fan-out — M_CPU fans out within a cube (ADR-0009 D3).
 - Handle Memory R/W data path — those bypass IO_CPU per ADR-0015 D4
  and ADR-0016 D3 (Memory R/W resolution code in
  `_resolve_cube_targets` exists as a defensive fallback only).
 Per invocation (`run()`): applies the configured `overhead_ns` once
 per incoming Transaction (D8).
 ### D2. Forward path — multi-cube fan-out
 When a non-response Transaction arrives, the worker:
 1. Pays `overhead_ns` via `run()`.
 2. Calls `_resolve_cube_targets` to derive the list of `(sip, cube)`
   targets from the request (D5).
 3. For each target:
   - Resolves M_CPU node id via `ctx.resolver.find_m_cpu(sip, cube)`.
   - Resolves the path via `ctx.router.find_node_path(io_cpu, m_cpu)`.
   - Creates a per-cube sub-Transaction with `path` populated and
     forwards it to `path[1]` (the first hop on the io_noc).
 4. Registers aggregation state: `_pending[request_id] = (expected,
   received=0, parent_done)`.
 ### D3. KernelLaunch `target_start_ns` global barrier (ADR-0009 D5)
 IO_CPU is the canonical stamper for `target_start_ns`. When the
 request is a `KernelLaunchMsg`, IO_CPU computes a single global
 barrier covering every targeted PE across every targeted cube:
 ```text
 for (sip, cube) in cube_targets:
    leg1 = compute_path_latency_ns(io_cpu → m_cpu(sip, cube), nbytes=0)
    for pe_id in target_pe_ids:
        leg2 = compute_path_latency_ns(m_cpu → pe_cpu(sip, cube, pe_id),
                                       nbytes=0)
        latency = leg1 + leg2 - io_overhead_ns - m_overhead_ns
        global_max = max(global_max, latency)
 target_start_ns = env.now + global_max
 ```
 The request is then replaced (via `dataclasses.replace`) so the
 stamped value propagates through the fan-out.
 Two overhead corrections:
 - `io_overhead_ns` is subtracted because IO_CPU has already paid it
  in `run()` before this method runs.
 - `m_overhead_ns` is subtracted once because it appears as the
  endpoint of leg1 *and* the start of leg2 in path latency, but
  M_CPU pays it only once at run time.
 Every downstream PE_CPU yields until `target_start_ns` before
 beginning kernel body execution; all PEs therefore start at the same
 simulated time regardless of how long their individual dispatch path
 took.
 ### D4. KernelLaunch sub-Transactions carry `nbytes=0`
 Per-cube sub-Transactions for KernelLaunchMsg force `nbytes=0`,
 overriding the parent `txn.nbytes`:
 - Kernel launch is a control message; payload size is irrelevant at
  the data-fabric level.
 - If `nbytes > 0`, every per-cube sub-txn occupies fabric BW on the
  io_noc's shared first hop. With 16 cubes this serializes fan-out,
  pushing far M_CPUs past `target_start_ns` and breaking the D3
  invariant.
 Non-KernelLaunch sub-Transactions preserve `txn.nbytes` (only relevant
 for the defensive Memory R/W fallback path, which carries actual
 payload sizes).
 ### D5. Per-request-type cube target resolution
 `_resolve_cube_targets` dispatches by request type:
 | Request type | Source of `(sip, cube)` | `target_cubes="all"` semantics |
 | --- | --- | --- |
 | `MemoryWriteMsg` | `dst_sip`, `dst_cube` (or `PhysAddr.decode(dst_pa).die_id` fallback) | single cube derived from PA decode |
 | `MemoryReadMsg` | `src_sip`, `src_cube` (or `PhysAddr.decode(src_pa).die_id` fallback) | single cube derived from PA decode |
 | `KernelLaunchMsg` | tensor shards filtered by `shard.sip == my_sip` | every cube that owns a shard on this SIP |
 | `MmuMapMsg` / `MmuUnmapMsg` | `target_cubes` list, filtered to this SIP | `range(cubes_per_sip)` from spec |
 Each IO_CPU instance fans out only within its own SIP — `_my_sip()`
 parses the SIP id from the node id (e.g., `sip0.io0.io_cpu` → 0).
 The Memory R/W rows exist for defensive completeness; the engine's
 normal path routes Memory R/W via `_process_memory_direct()` /
 `find_memory_path()`, bypassing IO_CPU entirely (ADR-0015 D4 /
 ADR-0016 D3).
 ### D6. Response aggregation
 `_pending: dict[request_id → (expected, received, parent_done)]`:
 - On dispatch: register `(len(cube_targets), 0, txn.done)`.
 - `_worker` recognises responses by `is_response=True` and routes
  them to `_collect_response`.
 - `_collect_response` increments `received`; when `received >=
  expected`, `parent_done.succeed()` is invoked and the entry is
  removed from `_pending`.
 This is a simple per-request counter. There is no per-cube identity
 tracking and no partial-failure handling — a missing response
 indefinitely stalls the parent done. Production-style failure paths
 are out of scope for the current simulator model.
 ### D7. `target_pe` resolution helper
 `_resolve_pe_ids(target_pe)`:
 - `int` → `[target_pe]`.
 - `tuple[int, ...]` → `list(target_pe)`.
 - `"all"` → `range(n_slices)`, where `n_slices` comes from cube
  `memory_map.hbm_slices_per_cube` (default 8).
 Used in D3's barrier computation to enumerate every PE target per
 cube.
 ### D8. Configurable `overhead_ns`
 A single attribute drives per-instance latency:
 | Site | impl name | overhead_ns |
 | --- | --- | --- |
 | IO chiplet `io_cpu` | `builtin.io_cpu` | 10.0 |
 Applied once in `run()` per Transaction. Models command
 interpretation + dispatch-decision time at IO_CPU.
 ## Consequences
 ### Positive
 - Cross-cube and cross-SIP kernel launches share a single global
  barrier (D3 + D4) — no per-cube divergence in start time.
 - nbytes=0 invariant keeps fan-out off the shared first-hop fabric
  BW, preserving the barrier's accuracy at scale (16 cubes).
 - Response aggregation via a single counter → minimal state,
  deterministic ordering of completion.
 - Per-SIP scoping (`_my_sip()`) keeps IO_CPUs in different SIPs
  cleanly independent.
 ### Negative
 - No partial-failure semantics — a missing per-cube response
  indefinitely stalls the parent. Adequate for simulation but not
  suitable as a production-style endpoint.
 - `_pending` is a regular dict; in-flight requests accumulate state.
  Acceptable for current benchmark workloads (few concurrent
  outstanding launches); unbounded in principle.
 - The Memory R/W resolution branches in `_resolve_cube_targets` are
  dead code in the normal engine path. Kept defensively but invite
  drift if the bypass path ever changes.
 ## Links
 - ADR-0002 (Routing distance — path computation)
 - ADR-0009 D1 (Kernel launch is an endpoint request to IO_CPU)
 - ADR-0009 D3 (M_CPU fans out within a cube; IO_CPU fans out across
  cubes)
 - ADR-0009 D5 (target_start_ns canonical stamping at IO_CPU)
 - ADR-0011 D-VA3 (MmuMapMsg routes through IO_CPU for cube fan-out)
 - ADR-0012 (Host ↔ IO_CPU message schema)
 - ADR-0015 D4 (Memory R/W bypasses IO_CPU; Kernel Launch via IO_CPU)
 - ADR-0016 D1 (IO chiplet io_noc — IO_CPU attaches here)
 - ADR-0016 D3 (Memory R/W path bypasses IO_CPU)
 - ADR-0016 D4 (Kernel Launch path through IO_CPU for command
  interpretation)
@@ -0,0 +1,200 @@
 # ADR-0037: Forwarding Component (forwarding_v1)
 ## Status
 Accepted
 ## Context
 The simulation graph has many node positions that exist purely to model
 fabric traversal — NOC mesh routers, switches, UCIe protocol endpoints,
 IO chiplet io_noc, transit cubes. These share a common pattern: receive
 a message, apply per-component overhead (modeling header decode +
 routing decision time), forward to the next hop along the pre-computed
 path.
 This ADR defines the contract for these transit nodes: a single
 component type (`TransitComponent`) that handles flit-aware forwarding
 with wormhole cut-through semantics, used under multiple impl names
 according to the conceptual role each instance plays.
 ## Decision
 ### D1. Role
 The Forwarding component (`TransitComponent` class) is a **stateless
 transit node** in the simulation graph. It models any fabric position
 where a message physically traverses but no semantic processing
 happens.
 Per traversal, the component:
 1. Reads an incoming Transaction or Flit from an `in_port`.
 2. Applies the configured per-component overhead (`overhead_ns`),
   applied **once per Transaction** even across multi-flit payloads
   (see D2).
 3. Looks up the next hop along the Transaction's pre-computed `path`.
 4. Forwards to the corresponding `out_port`; at the terminal node
   (no next hop), signals `txn.done` once the `is_last` flit arrives.
 The component **does NOT**:
 - Decide routing — paths are pre-computed by the router (ADR-0002 /
  ADR-0017 D2). Forwarding only executes the per-hop step.
 - Model wire propagation or bandwidth occupancy — separate wire
  processes between components handle that (ADR-0015 D2).
 - Resolve addresses — the AddressResolver does that (ADR-0017 D9).
 - Aggregate completion — terminal endpoints (IO_CPU, M_CPU, HBM_CTRL)
  handle that.
 ### D2. First-flit overhead model (header decode)
 Per-Transaction `overhead_ns` is applied **exactly once**, at first
 flit arrival:
 - `_txn_decoded: set[int]` tracks which Transactions have already
  paid the overhead at this node.
 - On first-flit arrival for a Transaction: `yield self.run(env,
  msg.txn.nbytes)` — pays the overhead.
 - Subsequent flits of the same Transaction skip the overhead — they
  pipeline through with no extra delay.
 - On `is_last` flit: remove the Transaction from `_txn_decoded`.
 This models the real-HW behavior where header decode and routing
 decision happen once on first flit; payload flits then stream through
 the same path (wormhole cut-through). Multi-hop pipelining emerges
 naturally — each hop adds its own first-flit overhead, but flits
 after the first do not re-pay overhead at any hop they have already
 passed first.
 ### D3. Serial worker forwarding (preserves order)
 The component's worker is a single SimPy process that consumes flits
 from `_inbox` and forwards them serially in arrival order. The
 component does NOT spawn `env.process(...)` per flit.
 Rationale: if the first flit yields on `overhead_ns` while subsequent
 flits run in parallel processes, the later flits can overtake the
 first. This produces out-of-order delivery and lets the `is_last`
 flit arrive at the destination before the first flit — corrupting
 both the transaction's completion semantics and any flit-index-based
 processing downstream.
 ### D4. Path-based next-hop routing
 Routing is **not** a Forwarding-component concern. The Transaction
 arrives with a pre-computed `path` (built by the router; ADR-0002 /
 ADR-0017 D2). The component just looks up its own position in the
 path and forwards to `path[index + 1]`:
 ```python
 def _next_hop_in_path(self, txn):
    my_id = self.node.id
    path = txn.path
    for i, n in enumerate(path):
        if n == my_id and i + 1 < len(path):
            return path[i + 1]
    return None
 ```
 If `next_hop` is found and present in `out_ports`, the flit is
 forwarded. Otherwise (terminal node), `txn.done.succeed()` is
 invoked when the `is_last` flit arrives.
 ### D5. Flit-aware mode with Non-Flit fallback
 `_FLIT_AWARE = True` opts this component out of the base class's
 flit-reassembly logic in `_fan_in`. Flits are placed directly on
 `_inbox` (no reassembly), enabling per-flit handling in the worker
 loop (D2, D3).
 Non-Flit messages — zero-byte control Transactions and other
 non-chunkified payloads — fall through to the base class's legacy
 `_forward_txn` path via `env.process`. This preserves backward
 compatibility for control-plane traffic that does not benefit from
 flit-level processing.
 ### D6. Multi-stream merging at the base class
 Multi-stream FIFO merging at routers is the base class's
 responsibility, not Forwarding's. The base class's `_fan_in` spawns
 one process per `in_port`; all push to a single shared `_inbox`.
 Flits from different upstream streams therefore interleave at
 flit granularity in `_inbox`'s FIFO order.
 The Forwarding worker simply consumes `_inbox` in arrival order —
 correctly modeling per-router multi-flow arbitration as
 fair-FIFO over the shared inbox.
 ### D7. Single implementation under multiple impl names
 A single `TransitComponent` class is registered under four impl names
 in `components.yaml`:
 - `builtin.forwarding` — generic forwarding (e.g., `io_noc`,
  `noc_router`, UCIe conn bridges)
 - `builtin.switch` — tray-level switch
 - `builtin.noc` — cube-level NOC fabric (legacy singleton; current
  NOC routers use `builtin.forwarding`)
 - `builtin.ucie` — UCIe protocol endpoint
 All four aliases instantiate the same class with the same behavior.
 Per-instance differentiation lives only in `attrs.overhead_ns`.
 Separate impl names exist as intent tags for readability and to
 allow future divergence without backward-incompatible config
 changes.
 ### D8. Configurable `overhead_ns`
 A single attribute drives per-instance latency:
 | Usage site | impl name | overhead_ns |
 | --- | --- | --- |
 | Tray-level switch | `builtin.switch` | 5.0 |
 | Cube NOC router | `builtin.forwarding` | 2.0 |
 | IO chiplet io_noc | `builtin.forwarding` | 0.0 |
 | UCIe protocol endpoint (`ucie-{N,S,E,W}`) | `builtin.ucie` | 8.0 |
 | UCIe conn bridge (`ucie-{PORT}.conn{N}`) | `builtin.forwarding` | 0.0 |
 Default is 0.0. The attribute is read at each `run()` invocation, so
 dynamic reconfiguration is possible but not currently used.
 ## Consequences
 ### Positive
 - A single class handles all transit-node roles in the simulation
  graph — minimal code surface for a high-population component type.
 - Flit-aware processing + serial worker preserves wormhole semantics
  across multi-hop paths without per-flit process overhead.
 - `overhead_ns` is the only per-instance tunable; routing, BW, and
  address resolution stay cleanly separated in their own components /
  modules.
 - Multi-stream merging emerges from the base-class structure; no
  router-specific logic duplicates fair-FIFO arbitration.
 - Non-Flit fallback path keeps control-plane traffic working without
  forcing every message into the flit framework.
 ### Negative
 - The single class hides usage-site intent inside `attrs.overhead_ns`
  configuration; readers must consult `topology.yaml` +
  `components.yaml` to see which impl name maps to which behavior
  class.
 - Per-flit serial worker is a bottleneck if `overhead_ns` is large
  and many concurrent transactions arrive at the same router; current
  values (0–8 ns) make this negligible.
 ## Links
 - ADR-0002 (Routing distance — path computation)
 - ADR-0015 D1 (Component port model)
 - ADR-0015 D2 (Wire process — BW + propagation, separate from this
  component)
 - ADR-0015 D6 (Transit cube forwarding pattern)
 - ADR-0016 D1 (IO chiplet io_noc — uses this component)
 - ADR-0017 D1 (Cube NOC routers — use this component)
 - ADR-0017 D6 (UCIe decomposition — `ucie-{PORT}` instances use this
  component)
 - ADR-0033 D1 (Flit-aware pass-through, first-flit overhead,
  multi-stream merge semantics)
@@ -1,548 +0,0 @@
 # IPCQ-DMA Co-design Hardware Design Document
 **Status**: Draft — Review Requested
 **Date**: 2026-04-28
 **Authors**: YW Kang
 **Reviewers**: (HW team TBD)
 **Related**: ADR-0023 (IPCQ PE Collective), ADR-0025 (Direction Addressing)
 ---
 ## 1. Background & Motivation
 IPCQ(Inter-PE Communication Queue)는 PE 간 collective communication을 위한
 하드웨어 큐 메커니즘이다. 핵심 설계 원리는 **DMA가 데이터 전송 시 별도의
 제어 메시지 없이, piggyback된 메타 정보를 바탕으로 IPCQ의 head/tail pointer를
 자동 업데이트**하는 IPCQ-DMA co-design이다.
 이 문서는:
 1. 현재 PE 아키텍처에서 IPCQ가 하드웨어 수준에서 어떻게 동작하는지 기술하고,
 2. 이 하드웨어를 시뮬레이터에서 어떻게 모델링하고 있는지 검증하며,
 3. 실제 하드웨어 구현을 위한 설계를 제안하고,
 4. 대안들을 검토하여 최적 접근을 확정한다.
 ---
 ## 2. High-level Behavior of PE_IPCQ
 ![PE Baseline Architecture](diagrams/pe_baseline.png)
 > source: [`diagrams/pe_baseline.d2`](diagrams/pe_baseline.d2) — `d2 --layout=elk --scale 1.5` 로 렌더링.
 ### IPCQ 하드웨어 동작
 **HW Configuration**:
 * IPCQ는 PE 간에 ring buffer 기반의 단방향 큐를 설정하여 데이터를 전달한다.
 * 각 PE는 방향별(N/S/E/W 등)로 독립적인 queue pair 를 유지한다.
 * IPCQ는 각 queue pair 마다 sender's head/tail pointer, receiver's head/tail pointer 를 유지한다.
 * **IPCQ Slot Region**: IPCQ의 수신 버퍼로, 다이어그램의 점선 박스로 표시된 것처럼 TCM, Cube SRAM, Local HBM 중 하나를 buffer_kind로 지정하여 사용할 수 있다.
 각 tier별 성능 특성 (시뮬레이션 모델 값, `ipcq_types.py`):
 | Buffer Kind | Intrinsic BW | Effective BW (NoC bottleneck) | 용도 |
 |-------------|-------------|-------------------------------|------|
 | TCM | 512 GB/s | 512 GB/s (직결, NoC 미경유) | 최저 latency, PE 내부 전용 |
 | Cube SRAM | 512 GB/s | 128 GB/s (`sram_to_router_bw`) | Cube 내 공유, NoC BW에 제한 |
 | Local HBM | 256 GB/s | 256 GB/s (`hbm_to_router_bw`) | 대용량, NoC BW에 제한 |
 **Send 경로 (fire-and-forget)**:
 1. PE_CPU가 `tl.send(dir, src_addr)` 발행 → PE_IPCQ에 IpcqRequest 전달
 2. PE_IPCQ가 backpressure 확인: `(my_head - peer_tail_cache) < peer.n_slots`
 3. Peer의 rx slot 주소 계산: `peer_rx_base + (my_head % n_slots) × slot_size`
 4. IpcqDmaToken(data + piggyback metadata: sender_seq)을 PE_DMA에 전달
 5. PE_IPCQ가 `my_head++`, PE_CPU에 즉시 반환 (DMA 완료를 기다리지 않음)
 6. PE_DMA가 src data를 snapshot 후 NoC를 통해 peer PE_DMA로 전송
 **Receive 경로 (blocking)**:
 1. Peer PE_DMA가 data를 slot에 write하고, **같은 사이클에** metadata(sender_seq, dst_addr)를 추출
 2. PE_IPCQ가 dst_addr range matching으로 방향을 식별, `peer_head_cache` 업데이트
 3. `tl.recv(dir)` 대기 중인 PE_CPU에 wakeup signal 전달
 4. PE_CPU가 slot에서 데이터 읽기, PE_IPCQ가 `my_tail++`
 5. **Credit return**: PE_IPCQ가 16B credit packet(`consumer_seq`)을 NoC를 통해 sender에게 전송
 6. Sender PE_IPCQ가 `peer_tail_cache` 업데이트, backpressure 해제
 **핵심 설계 원리**:
 - **Data + head pointer piggyback**: 별도의 head 동기화 메시지 없이, DMA data flit에 sender_seq를 실어보냄
 - **Atomic write + metadata**: 수신측 DMA가 slot write와 metadata 전달을 같은 사이클에 수행 (I6 invariant)
 - **Address-based direction matching**: 같은 peer에 여러 방향이 연결되어도 dst_addr range로 구분 (ADR-0025)
 - **Credit-based flow control**: Receiver가 slot 소비 후 16B credit으로 sender에게 알림
 ---
 ## 3. Simulator Implementation Verification
 위의 하드웨어 동작을 시뮬레이터에서 어떻게 모델링하는지 검증한다.
 ### 3.1 의도와 구현의 매핑
 | 설계 의도 | 시뮬레이터 구현 | 위치 |
 |-----------|----------------|------|
 | DMA가 데이터 전송 시 head pointer를 piggyback | `IpcqDmaToken.sender_seq` 필드가 data flit과 함께 전달 | `ipcq_types.py:185` |
 | 수신측 DMA가 data write + metadata 전달을 atomic 처리 | `_handle_ipcq_inbound`에서 `store.write` → `IpcqMetaArrival` 사이에 yield 없음 (I6) | `pe_dma.py:232-275` |
 | Send는 fire-and-forget | `_handle_ipcq_outbound`에서 `sub_done`을 기다리지 않음 | `pe_dma.py:182` |
 | Recv는 데이터 도착까지 block | `peer_head_cache > my_tail` 조건으로 대기 | `pe_ipcq.py:263` |
 | Credit return은 별도 fast-path | SimPy Store를 통한 direct put (latency는 NoC 경로 기반으로 charge) | `pe_ipcq.py:443-469` |
 | In-flight data semantics (snapshot) | Send 시점에 data snapshot 보존, 이후 src 수정과 무관 | `pe_dma.py:142-155` |
 | PE_DMA 단일 inbox | 모든 in_port를 `_fan_in`으로 단일 FIFO에 merge (`base.py:51-53`) | compute port와 IPCQ port 사이에 arbiter 없음 |
 ### 3.2 Credit Return Path 모델링 상세
 Credit return은 실제 NoC 경로를 `router.find_path()`로 찾고,
 `compute_path_latency_ns()`로 hop latency + BW drain을 계산하여 charge한다.
 ```python
 # pe_ipcq.py:471-492
 def _credit_latency_ns(self, direction: str) -> float:
    path = self.ctx.router.find_path(self._pe_prefix, peer_pe_dma)
    return self.ctx.compute_path_latency_ns(path, self._credit_size_bytes)
 ```
 단, latency를 `env.timeout()`으로 지불한 후 `peer_credit_store`(SimPy Store)에
 직접 put하는 방식이다. 실제 `Transaction`을 만들어 NoC를 hop-by-hop 통과시키지는
 않으므로, **다른 트래픽과의 bandwidth contention은 모델링되지 않는다.**
 | | Latency | BW Contention |
 |---|---|---|
 | Data path (IpcqDmaToken) | NoC Transaction으로 정확 모델링 | 실제 fabric 통과 |
 | Credit path (16B) | NoC 경로 latency 정확 반영 | fabric Transaction 미주입 (단순화) |
 Credit은 16B로 data transfer(수십~수백 KB) 대비 무시 가능한 크기이므로,
 이 단순화로 인한 실질적 오차는 거의 없다.
 ### 3.3 검증 결론
 시뮬레이터 구현은 IPCQ-DMA co-design 의도를 **정확하게 모델링**하고 있다.
 ---
 ## 4. Proposed Hardware Design
 ### 4.1 Block Diagram (변경 후)
 변경점을 강조 표시: **(NEW)** = 신규, **(MOD)** = 수정.
 ![PE Proposed Architecture](diagrams/pe_proposed.png)
 > Source: [`diagrams/pe_proposed.d2`](diagrams/pe_proposed.d2) — `d2 --layout=elk` 로 렌더링.
 **Baseline → Proposed 핵심 변경**:
 - 단일 FIFO inbox → **compute port / IPCQ port 분리 + WRR Arbiter** (NEW)
 - PE_IPCQ (SimPy component) → **IPCQ Controller** (HW register + combinational logic)
 - TCM 내 **IPCQ Slot Region 예약 영역** 명시
 - Credit Injector / Receiver가 Fabric Port를 통해 NoC에 직접 연결
 ### 4.2 Module Details
 #### 4.2.1 IPCQ Controller (신규 모듈)
 PE_CPU와 DMA Engine 사이에 위치하는 하드웨어 제어 블록.
 시뮬레이터의 `PeIpcqComponent`에 대응한다.
 ##### QPair Register File
 방향별 queue pair 상태를 flip-flop으로 유지한다.
 ```
 Per-direction registers (each 64-bit):
  my_head          — sender write position (monotonic)
  my_tail          — receiver read position (monotonic)
  peer_head_cache  — last known peer head (updated by Meta Extractor)
  peer_tail_cache  — last known peer tail (updated by Credit Receive)
  rx_base_pa       — this PE's rx buffer base physical address
  peer_rx_base_pa  — peer's rx buffer base physical address
  n_slots          — ring depth (power-of-2 제약, 아래 참조)
  slot_size        — bytes per slot
  peer_credit_tgt  — peer PE의 credit receive 주소
 Directions: 최대 8 (N/S/E/W/parent/child_left/child_right + spare)
 Total: 8 dirs × 9 regs × 8B = 576B flip-flops
 ```
 PE_CPU가 MMIO(CSR)로 읽기/쓰기 가능. Init 시점에 소프트웨어가 채워넣는다.
 ##### Slot Address Generator (combinational)
 ```
 Input:  pointer (my_head or my_tail), n_slots, slot_size, base_pa
 Output: slot_addr = base_pa + (pointer % n_slots) * slot_size
 Implementation:
  n_slots power-of-2 제약 → pointer & (n_slots - 1)  (AND mask, 1 gate delay)
  slot_size power-of-2   → barrel shift               (1 cycle)
  64-bit add             → ripple/kogge-stone adder    (1 cycle)
 Latency: 1-2 cycles combinational
 ```
 ##### Backpressure Comparator (combinational)
 ```
 full = (my_head - peer_tail_cache) >= n_slots
 Implementation: 64-bit subtract + unsigned compare
 Output: stall signal → PE_CPU (IPCQ send blocked) or DMA issue hold
 Latency: 1 cycle
 ```
 ##### Meta Extractor (inbound datapath sideband)
 DMA Engine의 inbound vc_comm path에 wired. Arriving IPCQ flit의 header에서
 metadata를 추출하여 queue pair 상태를 업데이트한다.
 ```
 Trigger: DMA inbound write completion (same cycle)
 Extract: {sender_seq, dst_addr} from flit header
 Direction matching (ADR-0025 D2):
  for each dir:
    match = (base_pa[dir] <= dst_addr) && (dst_addr < base_pa[dir] + n_slots[dir] * slot_size[dir])
  8× parallel range comparators + priority encoder
 Update: peer_head_cache[matched_dir] = max(peer_head_cache, sender_seq + 1)
 Output: recv_wake signal for matched direction → PE_CPU interrupt/flag
 Implementation: 8× (2 comparators + AND) + priority encoder
 Latency: 1 cycle (pipelined with DMA write — I6 atomicity 자연 보장)
 ```
 ##### Credit Injector (outbound)
 ```
 Trigger: recv completion (my_tail 증가 후)
 Action:  pack 16B credit packet → DMA vc_comm (또는 dedicated credit VC)
 Packet: {consumer_seq = my_tail, dst_rx_base_pa = my_rx_base_pa}
 Latency: 1 cycle to generate, then NoC traversal
 ```
 ##### Credit Receiver (inbound sideband)
 ```
 Trigger: 16B credit packet arrival (from NoC)
 Extract: {consumer_seq, dst_rx_base_pa}
 Direction matching (ADR-0025 D3):
  for each dir:
    match = (peer_rx_base_pa[dir] == credit.dst_rx_base_pa)
 Update: peer_tail_cache[matched_dir] = max(peer_tail_cache, consumer_seq)
 Output: send_wake signal → deassert backpressure stall
 Latency: 1 cycle
 ```
 #### 4.2.2 DMA Engine 수정사항
 ##### vc_comm IPCQ-aware mode
 기존 vc_comm 채널에 IPCQ flit 처리 모드를 추가한다.
 **Outbound**:
 1. IPCQ Controller로부터 command 수신: {src_addr, dst_addr, nbytes, sender_seq}
 2. TCM에서 src_addr read → DMA read buffer에 snapshot (기존 DMA behavior)
 3. Flit pack: data + piggyback metadata (sender_seq, dst_addr)
 4. NoC fabric port에 inject
 5. Fire-and-forget (completion을 기다리지 않음)
 **Inbound**:
 1. NoC로부터 IPCQ flit 수신
 2. Terminal BW drain charge (drain_ns = nbytes / bottleneck_bw)
 3. Slot write latency charge (backing memory tier)
 4. **ATOMIC** (same pipeline stage, no stall insertion):
   - TCM write: data → slot address
   - Meta Extractor trigger: sender_seq + dst_addr → IPCQ Controller
 5. Done
 **I6 atomicity 하드웨어 보장**: TCM write completion과 Meta Extractor trigger가
 동일 pipeline stage에서 발생하므로 별도 synchronization이 불필요하다.
 시뮬레이터의 "no yield between write and IpcqMetaArrival"이 자연스럽게 보장된다.
 ##### Data Snapshot Semantics
 DMA read buffer에 latch된 데이터는 src memory의 이후 수정에 영향받지 않는다.
 이는 DMA의 standard read-then-write behavior이므로 추가 HW가 불필요하다.
 ##### Credit Virtual Channel (선택적)
 옵션 A: vc_comm에 credit을 multiplexing (16B header-only flit으로 구분)
 옵션 B: 3rd dedicated credit VC 추가 (strict priority > data)
 옵션 B가 deadlock prevention에 유리하나, 16B credit의 BW 영향이 무시 가능하므로
 옵션 A로도 충분하다.
 #### 4.2.3 Fabric Flit Format 확장
 ```
 일반 data flit (예: 512-bit):
 ┌──────────────────────────────────────────┐
 │ [511:480] routing header (32b)           │
 │ [479:0]   payload (480b = 60B)           │
 └──────────────────────────────────────────┘
 IPCQ data flit (첫 flit에만 metadata 포함):
 ┌──────────────────────────────────────────┐
 │ [511:480] routing header (32b)           │
 │   [511]    ipcq_flag (1b)                │  ← IPCQ vs normal DMA 식별
 │   [510:509] vc_id (2b)                   │
 │   [508:480] route + hop count            │
 │ [479:416] ipcq_metadata (64b)            │  ← piggyback
 │   [479:448] sender_seq (32b)             │
 │   [447:416] dst_addr[31:0] (32b)         │  ← direction matching용
 │ [415:0]   payload (416b = 52B)           │
 └──────────────────────────────────────────┘
 후속 flits: full 60B payload (metadata 없음)
 Credit-only flit (128-bit, header-only):
 ┌──────────────────────────────────────────┐
 │ [127:96]  routing header (32b)           │
 │   [127]   credit_flag (1b)               │
 │ [95:64]   consumer_seq (32b)             │
 │ [63:0]    dst_rx_base_pa (64b)           │
 └──────────────────────────────────────────┘
 ```
 첫 flit의 payload가 60B → 52B로 감소 (13% overhead).
 Multi-flit transfer에서는 후속 flit이 full payload이므로 대형 전송에서 overhead < 1%.
 #### 4.2.4 TCM IPCQ Slot Region
 ```
 TCM Memory Map (16MB):
 ┌─────────────────────────────┐ 0x000000
 │  Kernel Working Memory      │
 │  (compute tensors)          │
 │  ~14MB                      │
 ├─────────────────────────────┤ 0xE00000
 │  IPCQ RX Buffers            │
 │  Dir N: slots × slot_size   │
 │  Dir S: slots × slot_size   │
 │  Dir E: slots × slot_size   │
 │  Dir W: slots × slot_size   │
 │  ~1MB                       │
 ├─────────────────────────────┤ 0xF00000
 │  IPCQ Metadata / Scratch    │
 │  ~1MB                       │
 └─────────────────────────────┘ 0xFFFFFF
 ```
 IPCQ region을 TCM의 상위 bank에 배치하여 compute access와의
 bank conflict를 최소화한다 (Section 6.1 참조).
 ---
 ## 5. End-to-End Dataflow
 ### 5.1 Sequence Diagram
 ```mermaid
 sequenceDiagram
    participant CPU_A as PE_A: PE_CPU
    participant IPCQ_A as PE_A: IPCQ Ctrl
    participant DMA_A as PE_A: DMA
    participant NOC as NoC Fabric
    participant DMA_B as PE_B: DMA
    participant IPCQ_B as PE_B: IPCQ Ctrl
    participant TCM_B as PE_B: TCM
    participant CPU_B as PE_B: PE_CPU
    Note over CPU_A: tl.send(dir="E", src=0x1000)
    CPU_A->>IPCQ_A: MMIO: send request
    Note over IPCQ_A: Backpressure check:<br/>(head - peer_tail_cache) < n_slots → PASS<br/>Slot addr gen:<br/>dst = peer_rx_base + (head%n) × slot_size
    IPCQ_A->>DMA_A: IpcqDmaToken {src, dst, sender_seq=head}
    Note over IPCQ_A: my_head++
    IPCQ_A-->>CPU_A: send returns (fire-and-forget)
    Note over DMA_A: TCM read → snapshot in read buffer<br/>Flit pack: data + {sender_seq, dst_addr}
    DMA_A->>NOC: IPCQ data flit(s)
    Note over NOC: hop latency + BW drain
    NOC->>DMA_B: IPCQ data flit(s)
    Note over DMA_B: Terminal BW drain<br/>Slot write latency
    rect rgb(255, 240, 220)
        Note over DMA_B,IPCQ_B: ATOMIC (I6): same cycle, no stall
        DMA_B->>TCM_B: write data → slot address
        DMA_B->>IPCQ_B: Meta Extractor: {sender_seq, dst_addr}
    end
    Note over IPCQ_B: Range match dst_addr → direction "W"<br/>peer_head_cache["W"] = sender_seq + 1
    IPCQ_B-->>CPU_B: recv_wake signal
    Note over CPU_B: tl.recv(dir="W") wakes up
    CPU_B->>IPCQ_B: recv request
    Note over IPCQ_B: peer_head_cache > my_tail → YES<br/>slot_addr = rx_base + (tail%n) × slot_size
    IPCQ_B-->>CPU_B: return slot_addr
    CPU_B->>TCM_B: read data from slot
    Note over IPCQ_B: my_tail++
    IPCQ_B->>NOC: Credit (16B): {consumer_seq, dst_rx_base_pa}
    Note over NOC: credit traversal (NoC latency)
    NOC->>IPCQ_A: Credit arrival
    Note over IPCQ_A: Match dst_rx_base_pa → direction "E"<br/>peer_tail_cache["E"] = consumer_seq<br/>Backpressure deassert (if stalled)
 ```
 ---
 ## 6. 2nm Implementation Analysis
 ### 6.1 Area Estimate
 | Module | Gate Count | Area (2nm est.) | Notes |
 |--------|-----------|-----------------|-------|
 | QPair Register File | ~4.6K FF | 0.002 mm² | 576B flip-flops |
 | Slot Addr Gen + Backpressure | ~5K gates | 0.001 mm² | Combinational |
 | Meta Extractor + Credit Logic | ~3K gates | 0.001 mm² | 8× parallel comparators |
 | **Total IPCQ Controller** | **~12.6K** | **~0.004 mm²** | **PE 전체 대비 < 0.1%** |
 | DMA vc_comm 확장 | ~2K gates | 0.002 mm² | Flit pack/unpack |
 | **Total 변경분** | **~14.6K** | **~0.006 mm²** | |
 ### 6.2 Timing
 | Path | Delay (2nm est.) | Target Clock | Margin |
 |------|-------------------|-------------|--------|
 | Backpressure (sub + cmp) | ~0.3 ns | 1 GHz (1 ns) | 3× |
 | Slot Addr Gen (mask + shift + add) | ~0.5 ns | 1 GHz | 2× |
 | Meta Extractor (8× range match) | ~0.4 ns | 1 GHz | 2.5× |
 | Credit Receiver (8× equality) | ~0.3 ns | 1 GHz | 3× |
 모든 critical path가 1 cycle 이내. Timing closure 문제 없음.
 ### 6.3 Power
 - Active: ~1 mW (register read/write + comparators, send/recv 동작 시)
 - Idle: leakage only
 - PE 전체 전력 대비 무시 가능
 ### 6.4 Constraints
 | 항목 | 제약 | 근거 |
 |------|------|------|
 | `n_slots` | **반드시 power-of-2** | mod → AND mask (1 gate). 임의 값은 divider 필요 (~10 cycles) |
 | `slot_size` | **power-of-2 권장** | mul → barrel shift. 임의 값은 multiplier 필요 |
 | TCM IPCQ region | **전용 bank 배치** | Compute access와 bank conflict 방지 |
 ---
 ## 7. Risk Assessment
 ### 7.1 TCM Bank Conflict
 - **Risk**: IPCQ slot write와 compute read가 동일 bank 접근 시 stall
 - **Mitigation**: IPCQ region을 TCM 상위 address의 전용 bank에 배치
 - **Cost**: TCM banking flexibility 소폭 감소
 - **Severity**: Medium (성능 영향), Low (correctness 문제 아님)
 ### 7.2 Credit Return Latency under Congestion
 - **Risk**: NoC 혼잡 시 credit return 지연 → sender backpressure stall
 - **Mitigation**:
  - Credit을 별도 VC로 분리 + strict priority (16B로 BW impact 미미)
  - 또는 n_slots를 넉넉히(8+) 설정하여 credit 지연을 buffer로 흡수
 - **Severity**: Low (credit 16B는 congestion에 거의 기여하지 않음)
 ### 7.3 Inter-Direction Ordering
 - **Risk**: 같은 PE에서 여러 방향으로 동시 send 시 순서
 - **Mitigation**: Per-direction monotonic seq으로 충분. Inter-direction ordering은
  kernel(소프트웨어) 책임 — 현재 시뮬레이터 모델과 동일
 - **Severity**: Low (아키텍처 설계에 의해 해소)
 ---
 ## 8. Alternatives Considered
 ### 8.1 Doorbell + Polling (전통적 방식)
 ```
 Send: DMA write data → DMA write doorbell register at peer → peer polls doorbell
 Recv: Polling loop on doorbell, or interrupt-driven
 ```
 | 장점 | 단점 |
 |------|------|
 | 단순한 HW (IPCQ controller 불필요) | 2번의 DMA transaction (data + doorbell) |
 | 기존 DMA 재사용 | Data/doorbell 사이 ordering 보장 필요 (fence) |
 | | Polling은 전력 낭비, interrupt는 latency overhead |
 **평가**: Piggyback 대비 latency 2-3× 증가. **불채택.**
 ### 8.2 Hardware Message Queue (NVIDIA NVLink 스타일)
 ```
 Send: CPU → HMQ에 descriptor push → HW가 peer HMQ로 자동 전달
 Recv: HMQ에서 descriptor pop → data pointer 확인
 ```
 | 장점 | 단점 |
 |------|------|
 | CPU는 descriptor만 작성 | 별도 HMQ engine 필요 (~0.05 mm²) |
 | Descriptor/data 분리 → 유연 | DMA와 별개 datapath → area/power 중복 |
 | | Large tensor에는 결국 DMA 필요 |
 **평가**: CCL의 large tensor 패턴에서 DMA 필수이므로 HMQ + DMA 이중 구조는
 면적 낭비. **불채택.**
 ### 8.3 RDMA-style Completion Queue (CQ)
 ```
 Send: DMA write → peer에 CQE 자동 생성
 Recv: CQ poll/interrupt → data 위치 확인
 ```
 | 장점 | 단점 |
 |------|------|
 | InfiniBand/RoCE 성숙 모델 | CQ 관리 logic + CQE memory overhead |
 | Multi-tenant/isolation 용이 | CQE/data ordering 보장 추가 필요 |
 | | PE-to-PE CCL에는 over-engineered |
 **평가**: RDMA CQ는 host-facing NIC의 multi-tenant 격리에 적합.
 PE 간 단일 owner 환경에서는 불필요한 복잡성. **불채택.**
 ### 8.4 Credit-in-Data Piggyback (v2 최적화 후보)
 현재 설계에서 credit return은 별도 16B packet이다.
 Bidirectional 통신 패턴에서는 **reverse 방향 data flit에 credit을 합칠 수 있다.**
 ```
 PE_A →E→ PE_B: data + sender_seq=3
 PE_B →W→ PE_A: data + sender_seq=5 + credit_ack=4  ← credit이 data에 합쳐짐
 ```
 | 장점 | 단점 |
 |------|------|
 | Credit 전용 packet 제거 → NoC BW 절약 | Unidirectional 패턴에서는 fallback 필요 |
 | Bidirectional allreduce에서 credit latency → 0 | Flit header에 8B 추가 (overhead 미미) |
 | | Logic 복잡도 소폭 증가 |
 **평가**: 현재 설계의 우수한 최적화.
 Bidirectional allreduce에서 credit packet을 완전 제거 가능.
 Standalone credit fallback도 유지. **v2로 채택 권고.**
 ---
 ## 9. Recommendations
 1. **현재 IPCQ-DMA co-design을 기본 하드웨어 설계로 채택**
   — 단순하고, 면적 효율적이며, 2nm에서 timing/power 문제 없음
 2. **n_slots를 반드시 power-of-2로 제약**
   — mod 연산을 AND mask로 대체, critical path 단축
 3. **TCM banking에서 IPCQ region 전용 bank 할당**
   — compute와의 bank conflict 방지
 4. **v2에서 Credit-in-Data Piggyback (Section 8.4) 추가 검토**
   — bidirectional 패턴에서 credit overhead 제거
 ---
 ## 10. Open Questions
 - [ ] IPCQ slot region size를 TCM의 몇 %까지 허용할 것인가? (현재 가정: ~1MB / 16MB = 6.25%)
 - [ ] Credit VC를 별도로 둘 것인가, vc_comm에 multiplexing할 것인가?
 - [ ] Inter-SIP link에서의 flit format 호환성 검증 필요
 - [ ] n_slots 최대값 제한? (8 directions × 8 slots × 64KB = 4MB → TCM의 25%)
@@ -582,7 +582,7 @@ If you add a new algorithm or pattern, please send a PR.
 - [ADR-0023](adr/ADR-0023-ipcq-pe-collective.md): IPCQ + PE-level collective design.
 - [ADR-0022](adr/ADR-0022-program-id-2d-grid.md): 2D grid program_id (axis=0/1).
 - [ADR-0020](adr/ADR-0020-data-execution-two-pass.md): 2-pass data execution.
- [ADR-0021](adr/ADR-0021-pe-pipeline-refactor.md): PE pipeline refactor.
+- [ADR-0014](adr/ADR-0014-dev-pe-pipeline-execution-model.md): PE pipeline execution model.
 Existing algorithm examples:
@@ -527,7 +527,7 @@ direct send 후 다른 step에서 같은 주소를 store해도 안전하다 (tok
 - [ADR-0023](adr/ADR-0023-ipcq-pe-collective.md): IPCQ + PE-level collective 설계
 - [ADR-0022](adr/ADR-0022-program-id-2d-grid.md): 2D grid program_id (axis=0/1)
 - [ADR-0020](adr/ADR-0020-data-execution-two-pass.md): 2-pass data execution
- [ADR-0021](adr/ADR-0021-pe-pipeline-refactor.md): PE pipeline refactor
+- [ADR-0014](adr/ADR-0014-dev-pe-pipeline-execution-model.md): PE pipeline execution model
 기존 알고리즘 예제:
@@ -5,5 +5,5 @@ This package provides:
    - helpers:    utilities for algorithm authors (chunked, ring_step, ...)
    - testing:    mock CCL runtime for fast unit tests of algorithm kernels
-See docs/adr/ADR-0023-ipcq-pe-collective.md and docs/ccl-author-guide.md.
+See docs/adr/ADR-0023-dev-ipcq-pe-collective.md and docs/onboarding/ccl-author-guide.md.
 """
@@ -24,7 +24,7 @@ class Scope(Enum):
@dataclass(frozen=True)
 class OpSpec:
-    """One operation in a multi-op composite (head + epilogue, ADR-0021).
+    """One operation in a multi-op composite (head + epilogue, ADR-0014 D3.3).
    The head op (first in CompositeCmd.ops) defines tile geometry; subsequent
    ops are epilogue stages whose ``scope`` controls how often they fire
@@ -156,7 +156,7 @@ class CompositeCmd:
    out_nbytes: int
    math_op: str | None = None       # for op="math": which math operation
    data_op: bool = True
-    # Multi-op composite (ADR-0021 extension): when non-empty, ops[0] is the
+    # Multi-op composite (ADR-0014 D3.3): when non-empty, ops[0] is the
    # head and ops[1:] are epilogue stages with explicit scope. When empty,
    # the legacy single-op semantics (op/a/b/math_op) apply.
    ops: tuple[OpSpec, ...] = ()
@@ -15,7 +15,7 @@ if TYPE_CHECKING:
 class HbmCtrlComponent(ComponentBase):
-    """HBM controller with per-pseudo-channel (PC) striping (ADR-0019 D1, ADR-0033).
+    """HBM controller with per-pseudo-channel (PC) striping (ADR-0017 D4, ADR-0033).
    Stateless per-PC ``available_at`` array; each incoming transaction is
    split into ``ceil(nbytes / burst_bytes)`` chunks distributed round-robin
@@ -267,8 +267,9 @@ class MCpuComponent(ComponentBase):
    def _resolve_dma_destinations(self, request: Any, target_pe: int | str) -> list[str]:
        """Return list of HBM destination node_ids for DMA fan-out.
-        With single hbm_ctrl per cube (ADR-0019), always returns one node.
+        The PA-based resolver maps each address to one per-PE
-        PA-based resolution still used for cross-cube routing.
+        ``hbm_ctrl.pe{X}`` (ADR-0017 D9), so this method returns exactly
        one node. Cross-cube routing uses the same resolution.
        """
        cube_prefix = self.node.id.rsplit(".", 1)[0]  # e.g. "sip0.cube0"
@@ -17,9 +17,11 @@ if TYPE_CHECKING:
 class PeDmaComponent(PeEngineBase):
    """PE_DMA: dual-channel DMA engine with READ and WRITE resources.
-    Each channel has capacity=1 (ADR-0014 D4):
+    Compute channels (vc_compute) have capacity=1 each (ADR-0014 D4):
      - DMA_READ and DMA_WRITE may execute concurrently.
      - Multiple READs cannot overlap; multiple WRITEs cannot overlap.
    The orthogonal vc_comm channel for IPCQ traffic is defined in
    ADR-0023 D8.
    Handles two message types:
      - Transaction: external fabric messages (PeDmaMsg probes, M_CPU DMA)
@@ -1,4 +1,4 @@
-"""PE_FETCH_STORE: TCM ↔ Register File transfer unit (ADR-0021 D5).
+"""PE_FETCH_STORE: TCM ↔ Register File transfer unit (ADR-0014 D1).
 Handles both fetch (TCM → register) and store (register → TCM).
 BW serialization is delegated to PE_TCM via port communication.
@@ -18,7 +18,7 @@ if TYPE_CHECKING:
 class PeFetchStoreComponent(PeEngineBase):
-    """PE_FETCH_STORE: TCM ↔ Register File (ADR-0021 D5).
+    """PE_FETCH_STORE: TCM ↔ Register File (ADR-0014 D1).
    Receives TileTokens via pipeline self-routing.
    Sends TcmRequest to PE_TCM for BW-based latency.
@@ -1,4 +1,4 @@
-"""PE_GEMM: matrix multiplication engine (ADR-0021 D6).
+"""PE_GEMM: matrix multiplication engine (ADR-0014 D1).
 Handles both legacy PeInternalTxn (GemmCmd) and pipeline TileToken.
 In pipeline mode, receives token after fetch stage, computes MAC, chains to next.
@@ -32,7 +32,7 @@ _DTYPE_BITS: dict[str, int] = {
 class PeGemmComponent(PeEngineBase):
-    """PE_GEMM: MAC array (ADR-0021 D6).
+    """PE_GEMM: MAC array (ADR-0014 D1).
    In pipeline mode: pure compute — register data already fetched.
    In legacy mode: handles PeInternalTxn(GemmCmd) with shared accel_slot.
@@ -1,4 +1,4 @@
-"""PE_MATH: element-wise / reduction computation engine (ADR-0021 D6).
+"""PE_MATH: element-wise / reduction computation engine (ADR-0014 D1).
 Handles both legacy PeInternalTxn (MathCmd) and pipeline TileToken.
 In pipeline mode, receives token after fetch stage, computes SIMD, chains to next.
@@ -24,7 +24,7 @@ if TYPE_CHECKING:
 class PeMathComponent(PeEngineBase):
-    """PE_MATH: SIMD/Vector unit (ADR-0021 D6).
+    """PE_MATH: SIMD/Vector unit (ADR-0014 D1).
    In pipeline mode: pure compute — register data already fetched.
    In legacy mode: handles PeInternalTxn(MathCmd) with shared accel_slot.
@@ -1,10 +1,10 @@
-"""PE_SCHEDULER: plan generation + tile dispatch (ADR-0021 D2).
+"""PE_SCHEDULER: plan generation + tile dispatch (ADR-0014 D6).
 Receives PeInternalTxn from PE_CPU, routes to engines:
  - Simple commands (DmaReadCmd, GemmCmd, etc.) → direct dispatch to engine
  - CompositeCmd → generate TilePlan, feed tiles via _feed_loop
-Composite pipeline uses token self-routing (ADR-0021 D4):
+Composite pipeline uses token self-routing (ADR-0014 D6):
  Scheduler only does initial dispatch + completion tracking.
  Tiles chain through components based on their plan's stage sequence.
 """
@@ -24,7 +24,7 @@ if TYPE_CHECKING:
 class PeSchedulerComponent(ComponentBase):
-    """PE_SCHEDULER: sole dispatcher inside a PE (ADR-0014 D1, ADR-0021 D2).
+    """PE_SCHEDULER: sole dispatcher inside a PE (ADR-0014 D1, D6).
    Simple commands are forwarded to the appropriate engine.
    CompositeCmd creates a TilePlan and feeds tiles into the pipeline.
@@ -104,7 +104,7 @@ class PeSchedulerComponent(ComponentBase):
    def _dispatch_composite(
        self, env: simpy.Environment, pe_txn: Any, cmd: Any,
    ) -> Generator:
-        """Generate plan and enqueue to feeder. Non-blocking (ADR-0021 D4)."""
+        """Generate plan and enqueue to feeder. Non-blocking (ADR-0014 D6)."""
        from kernbench.components.builtin.pe_types import PipelineContext
        plan = self._generate_plan(cmd)
@@ -121,7 +121,7 @@ class PeSchedulerComponent(ComponentBase):
        yield self._pending_feeds.put((plan, ctx))
    def _feed_loop(self, env: simpy.Environment) -> Generator:
-        """Single feeder process: FIFO command ordering (ADR-0021 D2).
+        """Single feeder process: FIFO command ordering (ADR-0014 D6).
        No tile feed interleaving between commands.
        Queue full → only this process blocks.
@@ -1,4 +1,4 @@
-"""PE_TCM: tightly-coupled memory with BW-based access serialization (ADR-0021).
+"""PE_TCM: tightly-coupled memory with BW-based access serialization (ADR-0014 D1).
 Models scratchpad memory inside the PE. Handles both legacy Transaction forwarding
 and TcmRequest from PE_FETCH_STORE for BW-serialized read/write access.
@@ -32,7 +32,7 @@ class TcmRequest:
 class PeTcmComponent(ComponentBase):
-    """PE_TCM: BW-serialized scratchpad memory (ADR-0021 D1).
+    """PE_TCM: BW-serialized scratchpad memory (ADR-0014 D1).
    Dual-channel: read and write can proceed in parallel,
    but concurrent reads serialize, concurrent writes serialize.
@@ -1,4 +1,4 @@
-"""PE pipeline types for ADR-0021: TileToken, TilePlan, Stage, PipelineContext.
+"""PE pipeline types for ADR-0014 D6: TileToken, TilePlan, Stage, PipelineContext.
 These types are used by the PE_SCHEDULER and all PE engine components
 for tile-based pipeline execution with self-routing.
@@ -84,7 +84,7 @@ class PipelineContext:
@dataclass
 class TileToken:
-    """Self-routing tile token passed between PE components (ADR-0021 D9).
+    """Self-routing tile token passed between PE components (ADR-0014 D6).
    Single-owner: only one component holds this token at any time.
    params is a cache of plan.stages[stage_idx].params (canonical source).
@@ -1,4 +1,4 @@
-"""Tile plan generators for PE pipeline (ADR-0021).
+"""Tile plan generators for PE pipeline (ADR-0014 D6).
 Generates TilePlan with stage sequences for GEMM and Math operations.
 Ported from pe_accel tiling.py with stage-based plan structure.
@@ -1,2 +1,2 @@
 # Legacy component backups — not actively used.
-# Kept for reference during ADR-0021 migration.
+# Kept for reference during the PE pipeline refactor (ADR-0014).
@@ -264,8 +264,9 @@ class MCpuComponent(ComponentBase):
    def _resolve_dma_destinations(self, request: Any, target_pe: int | str) -> list[str]:
        """Return list of HBM destination node_ids for DMA fan-out.
-        With single hbm_ctrl per cube (ADR-0019), always returns one node.
+        The PA-based resolver maps each address to one per-PE
-        PA-based resolution still used for cross-cube routing.
+        ``hbm_ctrl.pe{X}`` (ADR-0017 D9), so this method returns exactly
        one node. Cross-cube routing uses the same resolution.
        """
        cube_prefix = self.node.id.rsplit(".", 1)[0]  # e.g. "sip0.cube0"
@@ -20,7 +20,7 @@ _AHBM_SEL_BIT = 37
 _AHBM_LOCAL_USED = 38  # bits actually meaningful for AHBM
 # HBM-offset bit layout for PC (pseudo-channel) striping
-# (ADR-0033 D6, ADR-0019). Given burst_bytes = 2^B and num_pcs = 2^P
+# (ADR-0033 D6, ADR-0017 D8). Given burst_bytes = 2^B and num_pcs = 2^P
 # configured at hbm_ctrl, the PC index is derived from hbm_offset as
 #   pc_shift = B; pc_mask = (1 << P) - 1
 #   pc = (hbm_offset >> pc_shift) & pc_mask
@@ -35,7 +35,7 @@ class AddressResolver:
    def __init__(self, graph: TopologyGraph) -> None:
        self._node_ids = set(graph.nodes)
        # HBM slice size (bytes) — used to decode pe_id from hbm_offset
-        # so HBM PA → hbm_ctrl.pe{X} (ADR-0019 D1/D4).
+        # so HBM PA → hbm_ctrl.pe{X} (ADR-0017 D4/D9).
        mm = graph.spec.get("cube", {}).get("memory_map", {})
        hbm_total_gb = int(mm.get("hbm_total_gb_per_cube", 48))
        slices_per_cube = int(mm.get("hbm_slices_per_cube", 8))
@@ -129,7 +129,7 @@ class PathRouter:
        Otherwise the cube's own UCIe port appears as a zero-distance
        bus that Dijkstra prefers over the mesh — that is intended only
        for cross-cube routing. Local PE_DMA must traverse the mesh so
-        cross-PE-slice access pays the mesh-distance cost (ADR-0019 D4).
+        cross-PE-slice access pays the mesh-distance cost (ADR-0017 D7).
        """
        start = f"{src_pe}.pe_dma"
        adj = self._adj_local if _same_cube(start, dst_node) else self._adj
@@ -137,13 +137,13 @@ class PathRouter:
    def find_path_with_distance(self, src_pe: str, dst_node: str) -> tuple[list[str], float]:
        """Match find_path's cube-local routing so reported distance reflects
-        the actual chosen path (ADR-0019 D4)."""
+        the actual chosen path (ADR-0017 D7)."""
        start = f"{src_pe}.pe_dma"
        adj = self._adj_local if _same_cube(start, dst_node) else self._adj
        return self._run_dijkstra_with_dist(adj, start, dst_node)
    def find_mcpu_dma_path(self, m_cpu_id: str, dst_hbm_id: str) -> list[str]:
-        """M_CPU DMA path: routes through router mesh (ADR-0019).
+        """M_CPU DMA path: routes through router mesh (ADR-0017).
        Same-cube: uses _adj_local (no UCIe) to stay within mesh.
        Cross-cube: uses _adj_all to route via UCIe.
@@ -58,7 +58,7 @@ def _get_active_context():
 class _AhbmNamespace:
-    """torch.ahbm — per-greenlet SIP device binding (ADR-0024 D10).
+    """torch.ahbm — per-greenlet SIP device binding (ADR-0024 D3).
    Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. KernBench's
    backend is 'ahbm' (not CUDA), so this namespace avoids pretending to be
@@ -124,7 +124,7 @@ class RuntimeContext:
        dc = DistributedContext()
        dc._ctx_ref = self  # back-reference for AhbmCCLBackend to reach ctx.launch etc.
        self.distributed = dc
-        # ADR-0024 D10: torch.ahbm (KernBench-native) + torch.accelerator
+        # ADR-0024 D3: torch.ahbm (KernBench-native) + torch.accelerator
        # (PyTorch 2.x portable) namespaces for per-greenlet device binding.
        self.ahbm = _AhbmNamespace()
        self.accelerator = _AcceleratorNamespace(self.ahbm)
@@ -472,7 +472,7 @@ class RuntimeContext:
        eff_num_pe = dp.num_pes if dp.num_pes is not None else self._pes_per_cube
        eff_num_cubes = dp.num_cubes if dp.num_cubes is not None else self._num_cubes
        # ADR-0026 D4: resolve structural coords directly at resolve time.
-        # ``torch.ahbm.set_device(rank)`` (ADR-0024 D10) selects the target
+        # ``torch.ahbm.set_device(rank)`` (ADR-0024 D3) selects the target
        # SIP; if unset, fall back to SIP 0 for single-driver compatibility.
        current_sip = (
            self.ahbm.current_device() if hasattr(self, "ahbm") else None
@@ -619,7 +619,7 @@ class RuntimeContext:
        Creates per-SIP KernelLaunchMsg with local va_base per tensor
        (like host driver sending per-rank launch commands).
-        When ``_defer_wait=True`` (ADR-0024 D7), returns the list of
+        When ``_defer_wait=True`` (ADR-0027 D0.4), returns the list of
        ``(handle, sip_id, meta)`` tuples instead of waiting. Caller is
        responsible for waiting — used by collective ops to yield between
        submit and wait so all sibling ranks can submit first.
@@ -786,7 +786,7 @@ class RuntimeContext:
            last_handle = h
        if _defer_wait:
-            # ADR-0024 D7: return the pending-list so the caller can yield
+            # ADR-0027 D0.4: return the pending-list so the caller can yield
            # between submit and drain. Used by collective ops that need
            # all sibling ranks to submit before any rank waits.
            return [
@@ -178,7 +178,7 @@ class DistributedContext:
    def __init__(self) -> None:
        self._backend: AhbmCCLBackend | None = None
-        # ADR-0024 D9: greenlet-local rank registry. Bench launcher calls
+        # ADR-0024 D2: greenlet-local rank registry. Bench launcher calls
        # _bind_rank(g, rank) when spawning workers; get_rank() resolves the
        # current greenlet to its rank. Unbound greenlets fall back to 0 for
        # single-driver test compat.
@@ -220,7 +220,7 @@ class DistributedContext:
    def get_rank(self) -> int:
        """Return the rank bound to the current greenlet (default 0).
-        ADR-0024 D9: workers spawned by the bench launcher each get a rank
+        ADR-0024 D2: workers spawned by the bench launcher each get a rank
        registered via ``_bind_rank``. Callers outside any bound greenlet
        fall back to rank 0 for single-driver test compat.
        """
@@ -230,7 +230,7 @@ class DistributedContext:
        return int(self._rank_by_greenlet.get(g, 0))
    def _bind_rank(self, g: Any, rank: int) -> None:
-        """Bind a greenlet to a rank so ``get_rank()`` returns it (ADR-0024 D9)."""
+        """Bind a greenlet to a rank so ``get_rank()`` returns it (ADR-0024 D2)."""
        self._rank_by_greenlet[g] = int(rank)
    def get_backend(self) -> str:
@@ -65,7 +65,7 @@ def _drain_pending(ctx: Any) -> None:
                # Populate _completed so fast-path in ctx.wait short-circuits
                # on the return leg.
                ctx._completed.add(h)
-        # (b) Collective backend queue (ADR-0024 D7 + D0.4-(2)).
+        # (b) Collective backend queue (ADR-0027 D0.4-(2)).
        if backend is not None:
            pending_list = getattr(backend, "_pending_collective_handles", None)
            if pending_list is not None:
@@ -51,7 +51,7 @@ class OpLogger:
        record_end fires.
        """
        snap: dict[str, Any] = {}
-        # TileToken (ADR-0021 pipeline) — capture which stage this is and its
+        # TileToken (ADR-0014 D6 pipeline) — capture which stage this is and its
        # per-stage params (e.g. op_kind/scope for epilogue MATH stages) so
        # we can recover them at record_end even after the token advances.
        try:
@@ -356,7 +356,7 @@ def _instantiate_cube(
 ) -> None:
    """Add all cube-internal nodes and edges, including PE instances.
-    Topology: explicit router mesh from cube_mesh.yaml (ADR-0019).
+    Topology: explicit router mesh from cube_mesh.yaml (ADR-0017 D1).
    Each router is a separate SimPy node. Components attach to routers
    based on cube_mesh.yaml attachment lists.
    """
@@ -367,10 +367,10 @@ def _instantiate_cube(
    clinks = cube["links"]
    mm = cube["memory_map"]
-    # ── Mode branch (ADR-0019) ──
+    # ── Mode branch (ADR-0017 D8) ──
    mode = mm.get("hbm_mapping_mode", "n_to_one")
    if mode == "one_to_one":
-        raise NotImplementedError("1:1 mode: ADR-0019 D3")
+        raise NotImplementedError("1:1 mode: ADR-0017 D8")
    # ── UCIe ports + connection nodes ──
    ucie_cfg = cube["ucie"]
@@ -404,11 +404,10 @@ def _instantiate_cube(
            label=name.upper().replace("_", " "),
        )
-    # ── Per-PE HBM controller (ADR-0019 D1/D4) ──
+    # ── Per-PE HBM controller (ADR-0017 D4) ──
    # Each PE owns one slice of the cube's HBM. The slice has its own
    # set of pseudo-channels and is reachable ONLY through that PE's
    # attaching router (see cube_mesh.yaml ``peX.hbm`` attach lists).
    # Restored after the ADR-0019 over-consolidation in commit 5917b34.
    hbm_spec = cube["components"]["hbm_ctrl"]
    hbm_lx, hbm_ly = local_pos["hbm_ctrl"]
    _hbm_total_bw = float(cube["links"].get("hbm_to_router_bw_gbs", 256.0))
@@ -425,7 +424,7 @@ def _instantiate_cube(
            label=f"HBM CTRL pe{pe_idx}",
        )
-    # ── Router mesh from cube_mesh.yaml (ADR-0019 D3) ──
+    # ── Router mesh from cube_mesh.yaml (ADR-0017 D1) ──
    routers = mesh_data["routers"]
    router_spec = cube["components"]["noc_router"]
    router_bw = clinks.get("router_link_bw_gbs", 256.0)
@@ -573,7 +572,7 @@ def _instantiate_cube(
                    ))
            elif item.endswith(".hbm"):
                # peX.hbm: router rXcY owns the entry to hbm_ctrl.peX.
-                # (ADR-0019 D1/D4 — per-PE HBM partitioning.)
+                # (ADR-0017 D4 — per-PE HBM partitioning.)
                pe_prefix = item.rsplit(".", 1)[0]
                pe_idx = int(pe_prefix.replace("pe", ""))
                pe_hbm_id = f"{cp}.hbm_ctrl.pe{pe_idx}"
@@ -645,13 +644,12 @@ def _instantiate_cube(
                    ))
    # NOTE: HBM↔router edges are created in the per-router attach loop
-    # above (peX.hbm items map router → hbm_ctrl.peX). Removed the
+    # above (peX.hbm items map router → hbm_ctrl.peX). See ADR-0017 D4
-    # legacy "all routers → single hbm_ctrl" loop that bypassed the
+    # for the per-PE partition contract.
    # ADR-0019 D4 per-PE partition.
 def _add_pe_internal_edges(edges: list[Edge], pp: str, pe_links: dict) -> None:
-    """Add PE-internal edges for a single PE instance (ADR-0021)."""
+    """Add PE-internal edges for a single PE instance (ADR-0014 D8)."""
    edges.append(Edge(
        src=f"{pp}.pe_cpu", dst=f"{pp}.pe_scheduler",
        distance_mm=pe_links["pe_cpu_to_scheduler_mm"],
@@ -685,7 +683,7 @@ def _add_pe_internal_edges(edges: list[Edge], pp: str, pe_links: dict) -> None:
            kind="pe_internal",
        ))
-    # Fetch/Store → TCM (ADR-0021 D5)
+    # Fetch/Store → TCM (ADR-0014 D5)
    if "fetch_store_to_tcm_mm" in pe_links:
        edges.append(Edge(
            src=f"{pp}.pe_fetch_store", dst=f"{pp}.pe_tcm",
@@ -694,7 +692,7 @@ def _add_pe_internal_edges(edges: list[Edge], pp: str, pe_links: dict) -> None:
            kind="pe_internal",
        ))
-    # Chaining edges (ADR-0021 D4 — token self-routing)
+    # Chaining edges (ADR-0014 D6 — token self-routing)
    chaining = [
        ("pe_dma", "pe_fetch_store", "dma_to_fetch_store_mm"),
        ("pe_fetch_store", "pe_gemm", "fetch_store_to_gemm_mm"),
@@ -6,7 +6,7 @@
  forward(x) ends with ``dist.all_reduce`` to sum partial products.
 Both layers use the intra-device ``DPPolicy`` (ADR-0026). TP shard
-ownership is determined by ``torch.ahbm.set_device(rank)`` (ADR-0024 D10).
+ownership is determined by ``torch.ahbm.set_device(rank)`` (ADR-0024 D3).
 Yield-safety contract (ADR-0027 D4/D5): every forward path contains at
 least one ``ctx.wait`` (via ``torch.launch``) or one collective; this
@@ -53,7 +53,7 @@ class ColumnParallelLinear:
        self.k_local = out_features // ws
        self.dtype = dtype
        self._torch = torch
-        # Per-rank weight slice. ``set_device(rank)`` (ADR-0024 D10) places
+        # Per-rank weight slice. ``set_device(rank)`` (ADR-0024 D3) places
        # it on SIP ``rank``. Intra-SIP layout comes from DPPolicy (ADR-0026).
        self.weight = torch.zeros(
            (in_features, self.k_local),
@@ -43,7 +43,7 @@ def get_tensor_model_parallel_rank() -> int:
    """Return this worker's rank within the TP group.
    Delegates to the greenlet-local rank registered by the spawn launcher
-    (ADR-0024 D9 via ``torch.distributed.get_rank``).
+    (ADR-0024 D2 via ``torch.distributed.get_rank``).
    """
    # Resolve via the global torch.distributed facade on the active ctx.
    return _current_rank()
@@ -1,4 +1,4 @@
-"""End-to-end pipeline tests (ADR-0020 + ADR-0021).
+"""End-to-end pipeline tests (ADR-0020 + ADR-0014).
 Verifies:
  1. Actual benchmark kernel → greenlet mode → op_log → DataExecutor → accuracy
@@ -68,7 +68,7 @@ def _path_drain_for_write(eng: GraphEngine, msg: MemoryWriteMsg) -> float:
 def test_builder_derives_pc_bw_gbs():
    """Topology builder must inject `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`
-    as an attr on every hbm_ctrl node. Enforces ADR-0019 D9 invariant
+    as an attr on every hbm_ctrl node. Enforces ADR-0017 D8 invariant
    (channels_per_PE × per-PC BW = aggregated link BW) at build time.
    """
    handle = resolve_topology(str(TOPOLOGY_PATH))
@@ -192,13 +192,10 @@ def test_hbm_pe_hop_charged_at_large_payload(tmp_path):
    chunk of latency from the PE↔HBM hop on send and recv, so the
    total HBM/TCM gap should clearly clear the threshold below.
-    Threshold history: the gap was 4 µs under the over-consolidated
+    Under ADR-0017 D4 per-PE HBM CTRL, each PE's slice runs on its own
-    single-hbm_ctrl model (commit 5917b34), inflated by serialization
+    controller with no cross-PE contention, so the IPCQ pattern (each
-    on the shared HBM controller. With ADR-0019 D1 per-PE HBM CTRL
+    PE writes its own slice) yields a gap of ≈ 1.7 µs — well above the
-    restored, each PE's slice runs on its own controller with no
+    bare slot-IO term, confirming the PE↔HBM hop is being charged.
    cross-PE contention, so the IPCQ pattern (each PE writes its own
    slice) drops the gap to ≈ 1.7 µs — still well above the bare
    slot-IO term, confirming the PE↔HBM hop is being charged.
    """
    n_elem = 16384  # 32 KB / PE
    lat_tcm = _run_allreduce_with_buffer_kind(
@@ -1,4 +1,4 @@
-"""Tests for CUBE NOC Explicit Router Mesh (ADR-0019).
+"""Tests for CUBE NOC Explicit Router Mesh (ADR-0017).
 Key changes verified:
  - Explicit router nodes per cube from cube_mesh.yaml (6×6 grid)
@@ -125,14 +125,14 @@ def test_mesh_file_pe_corner_positions():
 def test_mesh_file_no_xbar_section():
-    """mesh output must not contain xbar section (ADR-0019 D2)."""
+    """mesh output must not contain xbar section (ADR-0017 D1)."""
    _graph()
    mesh = yaml.safe_load(MESH_PATH.read_text())
    assert "xbar" not in mesh, "xbar section should be removed from cube_mesh.yaml"
 def test_mesh_file_pe_hbm_attached():
-    """PE routers must have pe{idx}.hbm in attach list (ADR-0019 D1)."""
+    """PE routers must have pe{idx}.hbm in attach list (ADR-0017 D4)."""
    _graph()
    mesh = yaml.safe_load(MESH_PATH.read_text())
    for rid, rdata in mesh["routers"].items():
@@ -235,7 +235,7 @@ def test_mesh_ucie_all_four_directions():
 # ══════════════════════════════════════════════════════════════════
-# 2. Topology Graph: Explicit Router Mesh (ADR-0019)
+# 2. Topology Graph: Explicit Router Mesh (ADR-0017)
 # ══════════════════════════════════════════════════════════════════
@@ -247,7 +247,7 @@ def test_router_nodes_exist():
 def test_no_xbar_or_bridge_nodes():
-    """xbar/bridge nodes must not exist (ADR-0019 D2)."""
+    """xbar/bridge nodes must not exist (ADR-0017 D1)."""
    graph = _graph()
    bad = [n for n in graph.nodes if "xbar" in n or "bridge" in n]
    assert len(bad) == 0, f"Old xbar/bridge nodes found: {bad[:5]}"
@@ -260,11 +260,10 @@ def test_no_single_noc_node():
 def test_per_pe_hbm_ctrl_nodes():
-    """Each cube has 8 per-PE HBM CTRL instances (ADR-0019 D1).
+    """Each cube has 8 per-PE HBM CTRL instances (ADR-0017 D4).
-    Restored from over-consolidation in commit 5917b34. The legacy
+    Each PE owns its own ``hbm_ctrl.pe{X}`` reachable through that PE's
-    single ``sip0.cube0.hbm_ctrl`` is gone; each PE owns its own
+    attaching router. No cube-wide single ``hbm_ctrl`` node exists.
    ``hbm_ctrl.pe{X}`` reachable through that PE's attaching router.
    """
    graph = _graph()
    for pe in range(8):
@@ -272,7 +271,7 @@ def test_per_pe_hbm_ctrl_nodes():
    # Legacy single hbm_ctrl must not exist
    legacy_id = "sip0.cube0.hbm_ctrl"
    assert legacy_id not in graph.nodes, (
-        f"legacy {legacy_id} must be removed (per-PE partitioning, ADR-0019 D1)"
+        f"legacy {legacy_id} must not exist (per-PE partitioning, ADR-0017 D4)"
    )
@@ -297,9 +296,7 @@ def test_pe_dma_connects_to_router():
 def test_each_hbm_ctrl_connects_only_to_owning_router():
    """Each ``hbm_ctrl.pe{X}`` must have exactly one router edge
    (router_to_hbm + hbm_to_router) to its owning PE's attaching
-    router (ADR-0019 D4). Replaces a prior test that asserted the
+    router (ADR-0017 D7).
    single hbm_ctrl was connected to all routers — that asserted the
    spec-violating consolidation introduced in commit 5917b34.
    """
    graph = _graph()
    pe_router = {0: "r0c0", 1: "r0c1", 2: "r1c4", 3: "r1c5",
@@ -513,7 +510,7 @@ def test_null_routers_excluded():
 # ══════════════════════════════════════════════════════════════════
-# 7. Router Mesh Latency (ADR-0019)
+# 7. Router Mesh Latency (ADR-0017)
 # ══════════════════════════════════════════════════════════════════
@@ -1,4 +1,4 @@
-"""Tests for ADR-0021 PE pipeline: TileToken self-routing, pipeline overlap, e2e accuracy.
+"""Tests for ADR-0014 D6 PE pipeline: TileToken self-routing, pipeline overlap, e2e accuracy.
 Test plan items:
  3. Phase 1 → Phase 2 end-to-end (op_log → DataExecutor → verify)
@@ -1,18 +1,13 @@
-"""Tests for ADR-0019 D1/D4 per-PE HBM partitioning.
+"""Tests for ADR-0017 D4/D7 per-PE HBM partitioning.
-Restores the architectural property that was lost in commit 5917b34
+ADR-0017 D4/D7 specifies:
 (2026-04-04 "Replace xbar/bridge/single-NOC with explicit router mesh"),
 which over-consolidated 8 per-slice HBM CTRL nodes into one cube-wide
 HBM CTRL connected to every router. ADR-0019 D1/D4 specifies:
 - Each PE owns 8 of the cube's 64 pseudo-channels (PE_X → PCs 8X..8X+7).
 - HBM CTRL is split per-PE: ``hbm_ctrl.pe{X}`` is reachable ONLY through
  PE_X's attaching router. Accessing PE_Y's slice from PE_X requires
  mesh routing to r_Y_attach before entering hbm_ctrl.pe{Y}.
-These tests are written BEFORE the production change and are expected
+These tests enforce that property without weakening
 to FAIL on current code (HBM CTRL is a single ``hbm_ctrl`` node attached
 to all routers). Phase 2 must make them PASS without weakening
 assertions.
 """
 from __future__ import annotations
@@ -66,16 +61,16 @@ def test_topology_has_8_hbm_ctrl_per_cube():
    for pe in range(8):
        nid = f"sip0.cube0.hbm_ctrl.pe{pe}"
        assert nid in graph.nodes, (
-            f"Expected per-PE HBM CTRL node {nid!r} (ADR-0019 D1)"
+            f"Expected per-PE HBM CTRL node {nid!r} (ADR-0017 D4)"
        )
        node = graph.nodes[nid]
        assert int(node.attrs.get("num_pcs", 0)) == 8, (
            f"{nid} must have num_pcs=8; got {node.attrs.get('num_pcs')}"
        )
-    # Legacy single hbm_ctrl must not exist
+    # Cube-wide single hbm_ctrl must not exist
    assert "sip0.cube0.hbm_ctrl" not in graph.nodes, (
-        "Legacy single sip0.cube0.hbm_ctrl must be removed in favor of "
+        "Cube-wide single sip0.cube0.hbm_ctrl must not exist; only "
-        "per-PE hbm_ctrl.pe{X} (ADR-0019 D1)"
+        "per-PE hbm_ctrl.pe{X} (ADR-0017 D4)"
    )
@@ -199,10 +194,8 @@ def test_probe_cli_intra_cube_cases_are_monotonic():
    """Probe CLI cases must show monotonic latency:
        pe-local-hbm < pe-same-half-hbm < pe-cross-half-hbm.
-    Prior to per-PE partitioning these three return identical latency
+    Per ADR-0017 D7, same-half (pe0→pe1) is 1 mesh hop further than
-    because all roads lead to the same hbm_ctrl. With ADR-0019 D4
+    local, and cross-half (pe0→pe4) is several hops further.
    restored, same-half (pe0→pe1) is 1 mesh hop further than local,
    and cross-half (pe0→pe4) is several hops further.
    """
    graph = _graph()
    spec = graph.spec
@@ -17,7 +17,7 @@ def _graph():
 def test_resolve_hbm_addr():
-    """HBM address -> sip{S}.cube{C}.hbm_ctrl.pe{X} (per-PE controller, ADR-0019 D1)."""
+    """HBM address -> sip{S}.cube{C}.hbm_ctrl.pe{X} (per-PE controller, ADR-0017 D9)."""
    g = _graph()
    resolver = AddressResolver(g)
    # offset 0x1000 falls inside PE0's slice (slice_size = 6 GB)
@@ -102,16 +102,13 @@ def test_path_remote_pe_hbm():
    assert not any("xbar" in n or "bridge" in n for n in path)
-# ── PathRouter: cross-PE HBM distance reflects mesh hops (ADR-0019 D4) ─
+# ── PathRouter: cross-PE HBM distance reflects mesh hops (ADR-0017 D7) ─
 def test_cross_pe_hbm_distance_increases_with_mesh_hops():
-    """Restored ADR-0019 D4 behavior: accessing another PE's HBM slice
+    """ADR-0017 D7: accessing another PE's HBM slice must take more
-    must take more routing distance than accessing one's own slice,
+    routing distance than accessing one's own slice, because each
-    because each per-PE hbm_ctrl is reachable only via its PE's router.
+    per-PE hbm_ctrl is reachable only via its PE's router.
    Replaces a previous ``test_all_pe_hbm_equidistant`` that asserted the
    over-consolidated (spec-violating) behavior introduced in 5917b34.
    """
    g = _graph()
    router = PathRouter(g)
@@ -21,7 +21,7 @@ def test_full_graph_node_count():
    #                          + 20 ucie (4 ports x (1 port + 4 conn))
    #                          + 8 PEs x 9 pe_comps))  (ADR-0023: +pe_ipcq)
    #   IO: pcie_ep + io_cpu + noc + 4 io_ucie_ports + 4*4 io_ucie_conn = 23
-    #   cube: 32 + 10 + 20 + 72 = 134  (was 127; ADR-0019 D1 per-PE HBM CTRL)
+    #   cube: 32 + 10 + 20 + 72 = 134  (per-PE HBM CTRL, ADR-0017 D4)
    # = 1 + 2*(23 + 16*134) = 1 + 2*(23+2144) = 1 + 4334 = 4335
    assert len(g.nodes) == 4335
@@ -29,9 +29,9 @@ def test_full_graph_node_count():
 def test_full_graph_edge_count():
    g = _graph()
    # ADR-0023: +3 IPCQ edges per PE
-    # ADR-0019 D1 (restored): HBM↔router edges drop from 32 routers × 2
+    # ADR-0017 D4: HBM↔router edges = 8 PE-routers × 2 per cube
-    # to 8 PE-routers × 2 per cube. 32 cubes × (16-64) = -1536 edges.
+    # (per-PE partition; not all 32 routers).
-    # Multi-op composite (ADR-0021): +1 gemm→math edge per PE for
+    # Multi-op composite (ADR-0014 D3.3): +1 gemm→math edge per PE for
    # epilogue chaining = 2 SIPs × 16 cubes × 8 PEs = +256 edges.
    assert len(g.edges) == 12412
@@ -73,7 +73,7 @@ def test_cube_component_nodes_exist():
    # Null holes must not exist
    for null_rc in ("r2c2", "r2c3", "r3c2", "r3c3"):
        assert f"{cp}.{null_rc}" not in g.nodes
-    # Per-PE HBM CTRL (ADR-0019 D1) — 8 instances, no legacy single node
+    # Per-PE HBM CTRL (ADR-0017 D4) — 8 instances; no cube-wide single node
    for pe in range(8):
        nid = f"{cp}.hbm_ctrl.pe{pe}"
        assert g.nodes[nid].kind == "hbm_ctrl"
@@ -94,7 +94,7 @@ def test_pe_component_nodes_exist():
 def test_hbm_ctrl_at_cube_center():
    g = _graph()
-    # Per-PE hbm_ctrl nodes share the cube's HBM placement (ADR-0019 D1)
+    # Per-PE hbm_ctrl nodes share the cube's HBM placement (ADR-0017 D4)
    # cube0 origin = (0, 0), hbm at (6.5, 7.0)
    for pe in range(8):
        node = g.nodes[f"sip0.cube0.hbm_ctrl.pe{pe}"]
@@ -190,8 +190,7 @@ def test_pe_internal_edges():
 def test_per_pe_hbm_ctrl_connects_only_to_owning_router():
    """Each hbm_ctrl.pe{X} connects ONLY to PE_X's attaching router
-    (ADR-0019 D4). Replaces a prior test that asserted the
+    (ADR-0017 D7)."""
    spec-violating all-routers consolidation (commit 5917b34)."""
    g = _graph()
    es = _edge_set(g)
    cp = "sip0.cube0"
@@ -56,7 +56,7 @@ def test_initialize_mismatched_ws_raises(topology):
 def test_get_tp_rank_is_greenlet_local(topology):
    """D3: get_tensor_model_parallel_rank returns greenlet-local rank
-    (delegates to torch.distributed.get_rank, ADR-0024 D9)."""
+    (delegates to torch.distributed.get_rank, ADR-0024 D2)."""
    import kernbench.tp as tp
    with _make_ctx(topology) as ctx:
@@ -0,0 +1,107 @@
 """Tests for tools/verify_adr_lang_pairs.py."""
 from __future__ import annotations
 import sys
 from pathlib import Path
 _REPO_ROOT = Path(__file__).resolve().parents[1]
 sys.path.insert(0, str(_REPO_ROOT / "tools"))
 import verify_adr_lang_pairs as v  # noqa: E402
 def _make_adr(
    path: Path,
    title_id: str,
    title_text: str = "Some Title",
    status: str = "Accepted",
 ) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(
        f"# ADR-{title_id}: {title_text}\n\n"
        f"## Status\n\n{status}\n\n"
        f"## Context\n\nbody\n",
        encoding="utf-8",
    )
 def test_complete_pairs_pass(tmp_path: Path) -> None:
    _make_adr(tmp_path / "docs/adr/ADR-0001-foo-bar.md", "0001", "Foo EN")
    _make_adr(tmp_path / "docs/adr-ko/ADR-0001-foo-bar.md", "0001", "Foo KO")
    assert v.verify(tmp_path) == []
 def test_empty_dirs_pass(tmp_path: Path) -> None:
    assert v.verify(tmp_path) == []
 def test_missing_ko_fails(tmp_path: Path) -> None:
    _make_adr(tmp_path / "docs/adr/ADR-0001-foo-bar.md", "0001")
    errs = v.verify(tmp_path)
    assert any("missing KO" in e and "ADR-0001-foo-bar.md" in e for e in errs)
 def test_orphan_ko_fails(tmp_path: Path) -> None:
    _make_adr(tmp_path / "docs/adr-ko/ADR-0001-foo-bar.md", "0001")
    errs = v.verify(tmp_path)
    assert any("orphan KO" in e and "ADR-0001-foo-bar.md" in e for e in errs)
 def test_status_mismatch_fails(tmp_path: Path) -> None:
    _make_adr(tmp_path / "docs/adr/ADR-0001-foo-bar.md", "0001", status="Accepted")
    _make_adr(tmp_path / "docs/adr-ko/ADR-0001-foo-bar.md", "0001", status="Proposed")
    errs = v.verify(tmp_path)
    assert any("Status block mismatch" in e for e in errs)
 def test_title_id_mismatch_fails(tmp_path: Path) -> None:
    _make_adr(tmp_path / "docs/adr/ADR-0001-foo-bar.md", "0002")
    _make_adr(tmp_path / "docs/adr-ko/ADR-0001-foo-bar.md", "0001")
    errs = v.verify(tmp_path)
    assert any("EN title ADR-ID" in e for e in errs)
 def test_multiline_status_with_parenthetical_passes(tmp_path: Path) -> None:
    """Real ADRs like ADR-0001 have multi-line Status with revision notes."""
    multiline_status = (
        "Accepted (Revision 2 - 2026-04-27: concrete bit layout,\n"
        "Supersedes ADR-0031.)"
    )
    _make_adr(
        tmp_path / "docs/adr/ADR-0001-foo-bar.md", "0001", status=multiline_status
    )
    _make_adr(
        tmp_path / "docs/adr-ko/ADR-0001-foo-bar.md", "0001", status=multiline_status
    )
    assert v.verify(tmp_path) == []
 def test_crlf_normalization(tmp_path: Path) -> None:
    """KO has CRLF, EN has LF; Status content is otherwise identical -> pass."""
    en = tmp_path / "docs/adr/ADR-0001-foo-bar.md"
    ko = tmp_path / "docs/adr-ko/ADR-0001-foo-bar.md"
    en.parent.mkdir(parents=True, exist_ok=True)
    ko.parent.mkdir(parents=True, exist_ok=True)
    en.write_bytes(
        b"# ADR-0001: Foo\n\n## Status\n\nAccepted\n\n## Context\n\nbody\n"
    )
    ko.write_bytes(
        b"# ADR-0001: Foo\r\n\r\n## Status\r\n\r\nAccepted\r\n\r\n## Context\r\n\r\nbody\r\n"
    )
    assert v.verify(tmp_path) == []
 def test_underscore_in_slug_recognized(tmp_path: Path) -> None:
    """ADR-0013 uses an underscore in its slug; the regex must accept it."""
    _make_adr(tmp_path / "docs/adr/ADR-0013-ver-verification_strategy.md", "0013")
    _make_adr(tmp_path / "docs/adr-ko/ADR-0013-ver-verification_strategy.md", "0013")
    assert v.verify(tmp_path) == []
 def test_main_exit_codes(tmp_path: Path, capsys) -> None:
    assert v.main(["--root", str(tmp_path)]) == 0
    _make_adr(tmp_path / "docs/adr/ADR-0001-foo-bar.md", "0001")
    assert v.main(["--root", str(tmp_path)]) == 1
    out = capsys.readouterr().out
    assert "FAILED" in out
@@ -0,0 +1,144 @@
 """Verify ADR language pair invariants.
 Policy (see CLAUDE.md Part 2 -> ADR Translation Discipline):
  docs/adr/        : English canonical
  docs/adr-ko/     : Korean translation (1:1 mirror)
  docs/adr-history/: frozen, not checked (transitional)
  docs/adr-proposed/: language-free, not checked
 Checks:
  - every docs/adr/<X>.md has a matching docs/adr-ko/<X>.md
  - every docs/adr-ko/<X>.md has a matching docs/adr/<X>.md (no orphans)
  - title line `# ADR-NNNN:` of each pair matches the filename's NNNN
  - `## Status` block content is byte-equal (after CRLF/LF normalization)
    between EN and KO
 Exit code: 0 if all OK, 1 if any mismatch.
 """
 from __future__ import annotations
 import argparse
 import re
 import sys
 from pathlib import Path
 ADR_FILENAME_RE = re.compile(r"^ADR-(\d{4})-[a-z0-9_-]+\.md$")
 TITLE_RE = re.compile(r"^# ADR-(\d{4}):")
 def _normalize(text: str) -> str:
    return text.replace("\r\n", "\n").replace("\r", "\n")
 def find_adr_files(adr_dir: Path) -> dict[str, Path]:
    if not adr_dir.is_dir():
        return {}
    return {
        p.name: p
        for p in sorted(adr_dir.iterdir())
        if p.is_file() and ADR_FILENAME_RE.match(p.name)
    }
 def extract_title_id(text: str) -> str | None:
    lines = _normalize(text).splitlines()
    if not lines:
        return None
    m = TITLE_RE.match(lines[0])
    return m.group(1) if m else None
 def extract_status_block(text: str) -> str | None:
    """Return content between `## Status` and the next `## ` heading, stripped.
    Returns None if no `## Status` heading exists.
    """
    lines = _normalize(text).splitlines()
    in_status = False
    collected: list[str] = []
    for line in lines:
        if line.strip() == "## Status":
            in_status = True
            continue
        if in_status and line.startswith("## "):
            break
        if in_status:
            collected.append(line)
    if not in_status:
        return None
    return "\n".join(collected).strip()
 def verify(root: Path) -> list[str]:
    errors: list[str] = []
    en_dir = root / "docs" / "adr"
    ko_dir = root / "docs" / "adr-ko"
    en_files = find_adr_files(en_dir)
    ko_files = find_adr_files(ko_dir)
    for name in en_files:
        if name not in ko_files:
            errors.append(f"missing KO translation: docs/adr-ko/{name}")
    for name in ko_files:
        if name not in en_files:
            errors.append(f"orphan KO (no canonical EN): docs/adr-ko/{name}")
    for name in sorted(en_files.keys() & ko_files.keys()):
        m = ADR_FILENAME_RE.match(name)
        assert m is not None
        expected_id = m.group(1)
        en_text = en_files[name].read_text(encoding="utf-8")
        ko_text = ko_files[name].read_text(encoding="utf-8")
        en_id = extract_title_id(en_text)
        ko_id = extract_title_id(ko_text)
        if en_id != expected_id:
            errors.append(
                f"{name}: EN title ADR-ID {en_id!r} != filename {expected_id!r}"
            )
        if ko_id != expected_id:
            errors.append(
                f"{name}: KO title ADR-ID {ko_id!r} != filename {expected_id!r}"
            )
        en_status = extract_status_block(en_text)
        ko_status = extract_status_block(ko_text)
        if en_status is None:
            errors.append(f"{name}: EN missing `## Status` section")
        if ko_status is None:
            errors.append(f"{name}: KO missing `## Status` section")
        if en_status is not None and ko_status is not None and en_status != ko_status:
            errors.append(
                f"{name}: Status block mismatch\n"
                f"      EN: {en_status!r}\n"
                f"      KO: {ko_status!r}"
            )
    return errors
 def main(argv: list[str] | None = None) -> int:
    p = argparse.ArgumentParser(description=__doc__)
    p.add_argument(
        "--root",
        type=Path,
        default=Path.cwd(),
        help="Repository root (default: cwd)",
    )
    args = p.parse_args(argv)
    errors = verify(args.root)
    if errors:
        print("ADR language pair verification FAILED:")
        for e in errors:
            print(f"  - {e}")
        return 1
    print("ADR language pair verification OK")
    return 0
 if __name__ == "__main__":
    sys.exit(main())
@@ -78,15 +78,15 @@ cube:
      scheduler_to_fetch_store_mm: 0.5
      dma_to_tcm_bw_gbs:       512.0
      dma_to_tcm_mm:           0.5
-      dma_to_fetch_store_mm:   0.0     # DMA → fetch_store chaining (ADR-0021)
+      dma_to_fetch_store_mm:   0.0     # DMA → fetch_store chaining (ADR-0014 D6)
      fetch_store_to_tcm_bw_gbs: 512.0
      fetch_store_to_tcm_mm:   0.0
-      fetch_store_to_gemm_mm:  0.0     # fetch → GEMM chaining (ADR-0021)
+      fetch_store_to_gemm_mm:  0.0     # fetch → GEMM chaining (ADR-0014 D6)
-      fetch_store_to_math_mm:  0.0     # fetch → MATH chaining (ADR-0021)
+      fetch_store_to_math_mm:  0.0     # fetch → MATH chaining (ADR-0014 D6)
-      gemm_to_fetch_store_mm:  0.0     # GEMM → store chaining (ADR-0021)
+      gemm_to_fetch_store_mm:  0.0     # GEMM → store chaining (ADR-0014 D6)
-      gemm_to_math_mm:         0.0     # GEMM → MATH epilogue chaining (ADR-0021)
+      gemm_to_math_mm:         0.0     # GEMM → MATH epilogue chaining (ADR-0014 D6)
-      math_to_fetch_store_mm:  0.0     # MATH → store chaining (ADR-0021)
+      math_to_fetch_store_mm:  0.0     # MATH → store chaining (ADR-0014 D6)
-      fetch_store_to_dma_mm:   0.0     # store → DMA writeback chaining (ADR-0021)
+      fetch_store_to_dma_mm:   0.0     # store → DMA writeback chaining (ADR-0014 D6)
      gemm_to_tcm_bw_gbs:      512.0
      gemm_to_tcm_mm:          0.5
      math_to_tcm_bw_gbs:      512.0
@@ -99,7 +99,7 @@ cube:
    hbm_total_gb_per_cube: 48
    hbm_slices_per_cube: 8
    hbm_total_bw_gbs: 1024.0
-    hbm_mapping_mode: n_to_one        # one_to_one | n_to_one (ADR-0019)
+    hbm_mapping_mode: n_to_one        # one_to_one | n_to_one (ADR-0017 D8)
    hbm_pseudo_channels: 64           # total pseudo channels per cube
    hbm_channels_per_pe: 8            # = pseudo_channels / pes_per_cube
    hbm_channel_bw_gbs: 32.0          # per-channel bandwidth (GB/s)
@@ -123,7 +123,7 @@ cube:
    per_connection_bw_gbs: 128.0   # BW per connection; 4 × 128 = 512 GB/s = UCIe PHY BW
  links:
-    # Router mesh links (ADR-0019)
+    # Router mesh links (ADR-0017 D5)
    router_link_bw_gbs: 256.0        # inter-router XY mesh link BW
    router_overhead_ns: 2.0          # per-router switching overhead
    pe_to_router_bw_gbs: 256.0      # PE_DMA ↔ router (= N × channel_bw)
`@@ -1,2 +1,2 @@`
	`# Legacy component backups — not actively used.`	`# Legacy component backups — not actively used.`
	`# Kept for reference during ADR-0021 migration.`	`# Kept for reference during the PE pipeline refactor (ADR-0014).`