attention: milestone-gqa-llama70b figures + MILESTONE_FAST (sub-cycle 4c, 5/6)

Add 5 of the 6 figure renderers ADR-0057 D3 sub-cycle 4c specifies: - gqa_op_log_{panel}.png × 4 — per-panel bar chart of the 5 op_log counts (gemm, ipcq_send, ipcq_recv, dma_read, dma_write). - gqa_comparison.png — cross-panel grouped bars over the same 5 series. Sixth figure (gqa_scaling.png) depends on sub-cycle 4b's Q/cube ∈ {1, 2, 4} sweep on multi_user_* panels and is deferred until that data exists; emit_all_gqa_plots returns just the 5 in-scope paths. Add MILESTONE_FAST=1 mode to run(): skip the panel sweep, reuse the committed sweep.json, render figures only. Validation mode unchanged. The runtime errors clearly when neither env var is set, listing the two supported modes. Renderers live in the bench module (the milestone-1h-gemm pattern); tests/gqa/_gqa_plot_helpers.py re-exports them for figure tests. Tests: tests/gqa/test_plot_gqa_figures.py — 7 tests, all green: - 4 parametrized per-panel emit assertions - 1 comparison emit assertion - 1 emit_all returns exactly 5 PNG paths - 1 default out_dir matches the bench _OUTPUT_DIR Commits the 5 PNG baselines under the bench output dir alongside sweep.json, mirroring milestone-1h-gemm's committed-figures pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
attention: land milestone-gqa-llama70b 4-panel sweep bench (ADR-0057 v1)
2026-06-01 22:23:28 -07:00 · 2026-06-01 21:57:12 -07:00 · 2026-06-01 19:53:18 -07:00 · 2026-06-01 19:33:40 -07:00 · 2026-06-01 19:14:09 -07:00 · 2026-05-22 15:37:27 -07:00
349 changed files with 43962 additions and 7833 deletions
@@ -0,0 +1,327 @@
+---
+description: Generate a public-facing architecture design document from approved ADRs and SPEC.md, with gap analysis reported to chat only.
+---
+
+# `/report` — Architecture Design Document Generator
+
+Generates a **public-facing** architecture design document at
+`docs/report/architecture-{YYYY}-{1H|2H}.md` derived from the current ADR
+corpus, SPEC.md, CLAUDE.md, and the canonical component list.
+
+This command is **strictly read-only** on `docs/adr/`, `SPEC.md`,
+`CLAUDE.md`, and `src/`. The only write is the report file itself
+(a derived artifact under `docs/report/`).
+
+---
+
+## Invocation
+
+Two modes:
+
+- `/report` — **dry-run** (default). No file is written. The command
+  reads sources, performs classification, and reports the planned TOC
+  + gap analysis to chat only. Use this to validate ADR-to-section
+  mapping before committing.
+- `/report write` — **write mode**. Performs the same procedure and
+  writes `docs/report/architecture-{period}.md`. Use after a dry-run
+  whose classification looks correct.
+
+Period determination (both modes), from system date:
+
+- month 1–6 → `{YYYY}-1H`
+- month 7–12 → `{YYYY}-2H`
+
+In write mode, if `docs/report/architecture-{period}.md` already exists,
+overwrite it without asking (regeneration is the expected operation).
+
+---
+
+## Output Contract
+
+### Document body (`docs/report/architecture-{period}.md`)
+
+Public release form. Reader is an external developer/architect. They do
+**not** have access to SPEC.md or ADR files. Therefore:
+
+- **No `ADR-NNNN` identifiers** in visible prose.
+- **No `SPEC R/§` identifiers** in visible prose.
+- **No internal jargon** assumed without definition.
+- **No diagram embeds** — only `<!-- DIAGRAM: ... -->` placeholders.
+- **Attribution via HTML comments** — every prose paragraph that derives
+  from a source carries an inline comment immediately above it:
+  `<!-- src: ADR-NNNN <section-name> -->` (multiple sources allowed).
+
+### Chat-only report (not written to any file)
+
+After writing the document, report to the user in the chat response:
+
+- File path written.
+- Section counts (e.g., "Detailed Architecture: 8 components covered,
+  2 in `builtin/` have no ADR backing").
+- **G1 gaps** — SPEC requirements (R-numbers / §) with no ADR citing them.
+- **G2 gaps** — ADRs missing **Context** or **Decision**. Alternatives
+  and Consequences are optional; their absence is NOT a gap.
+- **G3 gaps** — ADR cross-references without a back-reference.
+  Only flag when the referencer's ADR number is **less than** the
+  referenced ADR's number (older → newer). Newer ADRs citing older
+  infrastructure ADRs (higher number → lower number) are expected to
+  be one-way and are NOT flagged.
+- **G4 suggestions** — areas where an ADR seems missing based on the
+  ADR corpus + SPEC reading. Phrase as suggestions, not findings. Each
+  G4 item must say *why* it's suggested and remain falsifiable.
+- **G5 consistency issues** — ADR-to-ADR inconsistencies:
+  - **G5a (supersession not reflected)** — ADR-A states it supersedes
+    ADR-B, but ADR-B's Status is not marked as Superseded.
+  - **G5b (merge candidates)** — two or more ADRs cover near-identical
+    scope (detected naturally during section assignment, not via
+    exhaustive pair-wise scan).
+  - **G5c (explicit contradictions)** — two ADRs whose Decisions
+    directly oppose each other. Must cite both quotations; do not
+    speculate contradictions from topical similarity alone.
+- **TOC rationale** — for each section, list contributing ADR IDs
+  (this is for the user's verification only, never written to the
+  document itself).
+
+G4 must never appear in the document body. G1–G3 are also chat-only.
+
+---
+
+## Procedure
+
+### Step 1 — Determine period
+
+Use current system date. Compute `{YYYY}-1H` or `{YYYY}-2H`.
+
+### Step 2 — Ingest ADRs
+
+For each `docs/adr/ADR-NNNN-*.md`:
+
+- If both `ADR-NNNN-*.md` (Korean) and `ADR-NNNN-*.en.md` (English)
+  exist for the same number, **prefer the Korean `.md`** version.
+- Parse for the four canonical sections: Context, Decision, Alternatives
+  (also accept "Alternatives Considered"), Consequences.
+- Record presence/absence of **Context** and **Decision** for G2.
+  Alternatives and Consequences presence is recorded for use during
+  authoring, but their absence is not a gap.
+- Record ADR-NNNN cross-references for G3, preserving the direction
+  (referencer → referenced). G3 evaluation uses ADR numbers to
+  distinguish older→newer (flagged when missing back-link) from
+  newer→older (not flagged; see *Output Contract* G3).
+- Record Status (e.g., Accepted, Superseded, Draft) and any "supersedes
+  ADR-NNNN" text in the body for G5a.
+
+Process ADRs in **numerical order** for determinism.
+
+### Step 3 — Read canonical component list
+
+List `src/kernbench/components/builtin/*.py`, excluding `__init__.py`,
+`pe_types.py`, and `__pycache__/`. Sort alphabetically. This is the
+canonical order for Detailed Architecture subsections.
+
+### Step 4 — Read SPEC.md and CLAUDE.md
+
+For G1 detection: extract every `R<N>` and `§<X.Y>` identifier mentioned
+in SPEC.md. For each ADR, check which of these it cites. SPEC IDs with
+zero citing ADRs → G1.
+
+### Step 5 — Section assignment
+
+Assign each ADR to exactly one of:
+
+- **Design Principles** — project-wide rationale, philosophy, mission
+  (e.g., "why source-level kernel execution", "why fast multi-device
+  scaling"). Includes ADRs that describe foundational invariants
+  (e.g., latency model assumptions, verification strategy).
+- **High-level Architecture** — Tray / SIP / CUBE / PE hierarchy and
+  cross-layer boundaries (e.g., runtime API ↔ sim_engine ↔ components).
+- **Detailed Architecture** — single-component internal designs. One
+  subsection per file in the canonical component list. ADRs whose
+  primary topic is the internal structure of one component go here.
+- **Implementation Decisions** — **cross-cutting** algorithms / policies
+  / schemes / models that don't belong to a single component:
+  collective algorithms, parallelization policies, address schemes,
+  routing algorithms, model assumptions.
+
+Boundary rule between Detailed Architecture and Implementation Decisions:
+
+> Detailed Architecture = component-internal.
+> Implementation Decisions = spans multiple components OR is an
+> algorithm/policy/scheme/assumption rather than a structural choice.
+
+If an ADR fits two sections plausibly, prefer the one that minimizes
+duplication and pick the more specific bucket (Detailed if it primarily
+concerns one component, else Implementation Decisions).
+
+During classification, opportunistically detect ADR consistency issues:
+
+- **G5b (merge candidate)** — if two or more ADRs land in the same
+  Detailed Architecture subsection or the same Implementation Decisions
+  topic AND their primary scope is near-identical, record as a merge
+  candidate. Topical adjacency is not enough; the scopes must be
+  effectively the same question.
+- **G5c (explicit contradiction)** — if while reading you encounter two
+  ADRs whose Decisions directly oppose each other on the same question,
+  record both quotations verbatim with their ADR IDs. Do NOT speculate
+  contradictions from similarity, vocabulary, or domain overlap — only
+  explicit, citable opposition.
+
+Do NOT perform an exhaustive pair-wise scan of all ADRs. G5b/G5c are
+byproducts of normal reading; if not encountered, the chat report
+shows "(none)".
+
+### Step 6 — Write the document (write mode only)
+
+In **dry-run mode**, skip this step entirely. Proceed directly to Step 7.
+
+```markdown
+# KernBench — Architecture Design Document
+*{YYYY} {1H|2H}*
+
+## Design Principles
+<prose>
+
+## High-level Architecture
+<intro prose>
+
+### Tray
+### SIP
+### CUBE
+### PE
+
+## Detailed Architecture
+### <component-1>
+### <component-2>
+...
+
+## Implementation Decisions
+### <topic-1>
+### <topic-2>
+...
+```
+
+#### Authoring rules (apply to every section)
+
+- **Stay grounded.** Every claim must trace to an ADR's stated content
+  (Context / Decision / Alternatives / Consequences). No invented
+  motivation, no invented alternatives, no invented trade-offs.
+- **4-part discipline, naturally.** Each subsection should naturally
+  cover: the problem the design addresses, the decision made, the
+  alternatives considered, the consequences. Do **not** label these
+  with rigid headers like "**Problem.**" — weave them into prose. But
+  ensure all four are present *if the source ADR documents them*.
+- **Missing → omit, not fabricate.** If a source ADR has no
+  "Alternatives" section, do **not** invent alternatives for the
+  report. Simply write the remaining parts and record G2 in chat.
+- **Attribution.** Every paragraph derived from one or more ADRs
+  carries an HTML comment immediately above:
+  `<!-- src: ADR-NNNN <section> [, ADR-MMMM <section>] -->`.
+- **Diagram placeholders.** Where a diagram would help, insert
+  `<!-- DIAGRAM: <short description of what the diagram should show> -->`
+  on its own line. **Never** embed an image (`![...](...)`).
+- **Public tone.** Self-contained. Define internal terms (SIP, CUBE,
+  PE, Tray, NOC, IPCQ, TCM, etc.) on first use within the document.
+  Do not assume reader has read SPEC or ADRs.
+- **No internal references.** No `ADR-NNNN` in body text. No
+  `SPEC §X.Y` or `R<N>` in body text. These appear only inside HTML
+  attribution comments.
+- **Detailed Architecture component subsections.** Use the canonical
+  list from Step 3 in order. For each component file, write a
+  subsection drawing from any ADR that primarily concerns that
+  component. If no ADR covers a component, write a one-line stub
+  noting the component exists and flag it in chat report. If an ADR
+  covers a topic not in the canonical list, place it under
+  "Detailed Architecture → Other" (sub-subsection) and flag for
+  canonical-list extension in chat.
+- **Implementation Decisions topic naming.** Derive topic names from
+  ADR titles, made reader-friendly (no ADR number). Group related
+  ADRs under one topic when natural (e.g., multiple address-related
+  ADRs under "Address Scheme").
+
+### Step 7 — Generate chat report
+
+After Step 6 (write mode) or directly from Step 5 (dry-run mode),
+emit the following to chat. Do **not** write any of this to a file.
+
+In **dry-run mode**, replace the `Wrote:` line with:
+`**DRY-RUN — no file written.** Review TOC and gaps below. Run \`/report write\` to commit.`
+
+```
+## /report — Generation Summary
+
+**Wrote:** docs/report/architecture-{period}.md
+
+**Section coverage**
+- Design Principles: <N> ADRs
+- High-level Architecture: <N> ADRs
+- Detailed Architecture: <covered>/<total> components ; components without ADR: [...]
+- Implementation Decisions: <N> topics, <N> ADRs
+
+**TOC rationale (ADR → section mapping)**
+- Design Principles: ADR-NNNN, ADR-MMMM
+- High-level Architecture: ...
+- Detailed Architecture → <component>: ADR-NNNN
+- Implementation Decisions → <topic>: ADR-NNNN, ADR-MMMM
+
+**G1 — SPEC requirements without ADR support**
+- R<N> / §<X.Y>: not cited by any ADR
+- (or "none")
+
+**G2 — ADRs missing required sections (Context or Decision)**
+- ADR-NNNN: missing <Context|Decision>
+- (or "none")
+
+**G3 — Broken cross-references** (older → newer only)
+- ADR-NNNN cites ADR-MMMM (NNNN < MMMM); ADR-MMMM does not back-reference
+- (or "none")
+- Note: newer ADRs citing older infrastructure ADRs (NNNN > MMMM) are
+  not flagged here — one-way references are the expected pattern.
+
+**G4 — Suggested topics that may warrant a new ADR (verify before acting)**
+- <topic>: <why agent thinks it may be missing — must be falsifiable>
+- (or "none")
+
+**G5 — ADR consistency issues**
+- **G5a (supersession not reflected)**
+  - ADR-NNNN claims to supersede ADR-MMMM, but ADR-MMMM Status is "<status>"
+  - (or "none")
+- **G5b (merge candidates)**
+  - ADR-NNNN + ADR-MMMM: near-identical scope on <topic> — evaluate merge
+  - (or "none")
+- **G5c (explicit contradictions)**
+  - ADR-NNNN says "<quote>"; ADR-MMMM says "<quote>" — direct opposition on <question>
+  - (or "none")
+```
+
+---
+
+## Constraints (do not violate)
+
+1. **Read-only on source.** No writes to `docs/adr/`, `SPEC.md`,
+   `CLAUDE.md`, or `src/`. Only write is
+   `docs/report/architecture-{period}.md`.
+2. **No fabrication.** Every body paragraph traces to ADR content via
+   HTML attribution comment.
+3. **No diagram embeds.** Placeholders only.
+4. **No internal IDs in body.** ADR-NNNN and SPEC R/§ stay inside
+   HTML comments only.
+5. **Determinism.** ADRs processed in numerical order; components in
+   canonical (alphabetical) order. Same inputs → same output.
+6. **G4 stays in chat.** Never written to the document.
+7. **Korean bilingual preference.** When both `.md` and `.en.md`
+   exist for the same ADR number, use `.md`.
+8. **All ADRs included.** No exclusion list. ADRs about internal
+   tooling (CLI, diagram views, verification strategy) are still
+   included — usually under Design Principles or Implementation
+   Decisions, written in publishable form.
+
+---
+
+## Failure modes to avoid
+
+- **Padding** with general background not present in the source ADRs.
+- **Inferring alternatives** the ADR doesn't mention.
+- **Quietly skipping** an ADR because it seems internal. Include it,
+  rephrase for public audience.
+- **Inventing components** not in `src/kernbench/components/builtin/`.
+- **Auto-selecting diagrams** from `docs/diagrams/`. Only placeholders.
+- **Promoting G4 suggestions to the document.** They stay in chat.
@@ -9,7 +9,59 @@
      "Bash(python -m kernbench.cli.main probe --topology topology.yaml)",
      "Bash(xargs grep -l \"class.*ComponentBase\\\\|class.*DefaultComponent\")",
      "Bash(python -m pytest tests/test_probe.py -v)",
-      "Bash(python -m pytest tests/test_probe.py tests/test_component_registry.py -v)"
+      "Bash(python -m pytest tests/test_probe.py tests/test_component_registry.py -v)",
+      "Bash(python -m pytest -o \"addopts=\" --no-header tests/test_intercube_root_center.py)",
+      "Bash(python -m pytest -o \"addopts=\" --no-header tests/test_tp_layers.py tests/test_tp_mlp.py)",
+      "Bash(git commit -m ' *)",
+      "Bash(git stash *)",
+      "Bash(python scripts/emit_overview_with_external_ref.py)",
+      "Bash(where inkscape *)",
+      "Bash(\"/c/Program Files \\(x86\\)/Microsoft/Edge/Application/msedge.exe\" --headless --disable-gpu --screenshot=\"$\\(pwd\\)/docs/diagrams/cube_mesh_view.png\" --window-size=1400,1300 \"file:///$\\(pwd)",
+      "Bash(python scripts/build_overview_slides.py)",
+      "Bash(git fetch *)",
+      "Bash(git pull *)",
+      "Bash(python -m pytest --no-header tests/test_allreduce_buffer_kind_sweep.py)",
+      "Bash(python -m pytest --no-header tests/test_pe_to_pe_latency.py)",
+      "Bash(python -m pytest --no-header tests/test_ipcq_buffer_kind_locations.py -v)",
+      "Bash(python -m pytest --no-header tests/test_ipcq_buffer_kind_locations.py tests/test_ipcq_buffer_kind_latency.py tests/test_allreduce_buffer_kind_sweep.py)",
+      "Bash(git checkout *)",
+      "Bash(python -m pytest --no-header tests/test_ipcq_buffer_kind_latency.py::test_slot_write_latency_orders_tcm_hbm_sram)",
+      "Bash(python scripts/emit_ipcq_send_recv_model_plots.py)",
+      "Bash(python -m pytest --no-header tests/test_pe_to_pe_latency.py -x)",
+      "Bash(python -m pytest --no-header tests/test_pe_to_pe_latency.py tests/test_ipcq_buffer_kind_locations.py tests/test_ipcq_buffer_kind_latency.py tests/test_allreduce_buffer_kind_sweep.py)",
+      "Bash(kill %1)",
+      "Bash(awk '{print $2}')",
+      "Bash(xargs -r kill)",
+      "Bash(python scripts/_debug_op_log.py)",
+      "Bash(SWEEP_SHAPES=\"16,32,64,128,256\" python scripts/gemm_sweep.py)",
+      "Bash(python scripts/plot_gemm_sweep.py)",
+      "Bash(python scripts/gemm_sweep.py)",
+      "Bash(python scripts/gen_pe_pipeline_diagram.py)",
+      "Bash(python scripts/gen_matmul_32x128x32_diagram.py)",
+      "Bash(python -m pytest tests/test_pe_pipeline.py -x --tb=short)",
+      "Bash(python -m pytest tests/test_pe_pipeline.py tests/test_e2e_pipeline.py tests/test_op_log.py -x --tb=short -q)",
+      "Bash(ls -la C:/Users/mukes/.claude/projects/c--Users-mukes-Mukesh-ywkang-git-kernbench2/ 2>&1 | head -20)",
+      "Read(//c/Users/mukes/.claude/projects/c--Users-mukes-Mukesh-ywkang-git-kernbench2/**)",
+      "Bash(awk 'NR==1812 || NR==1815' C:/Users/mukes/.claude/projects/c--Users-mukes-Mukesh-ywkang-git-kernbench2/e55237ed-5c1f-4a89-a3b9-9b74fec45366.jsonl)",
+      "Bash(awk 'NR==1058' C:/Users/mukes/.claude/projects/c--Users-mukes-Mukesh-ywkang-git-kernbench2/e55237ed-5c1f-4a89-a3b9-9b74fec45366.jsonl)",
+      "Bash(awk -F: '$1 > 1700 && $1 < 1815 {print $1}')",
+      "Bash(awk 'NR==1812' C:/Users/mukes/.claude/projects/c--Users-mukes-Mukesh-ywkang-git-kernbench2/e55237ed-5c1f-4a89-a3b9-9b74fec45366.jsonl)",
+      "Bash(awk 'NR>=1815 && NR<=1825' C:/Users/mukes/.claude/projects/c--Users-mukes-Mukesh-ywkang-git-kernbench2/e55237ed-5c1f-4a89-a3b9-9b74fec45366.jsonl)",
+      "Bash(awk 'NR>1815' C:/Users/mukes/.claude/projects/c--Users-mukes-Mukesh-ywkang-git-kernbench2/e55237ed-5c1f-4a89-a3b9-9b74fec45366.jsonl)",
+      "Bash(awk 'NR==1839' C:/Users/mukes/.claude/projects/c--Users-mukes-Mukesh-ywkang-git-kernbench2/e55237ed-5c1f-4a89-a3b9-9b74fec45366.jsonl)",
+      "Bash(git log *)",
+      "Bash(python -m pytest tests/test_op_log.py tests/test_pe_components.py tests/test_pe_pipeline.py -x --tb=short)",
+      "Bash(python -m pytest tests/test_pe_to_pe_latency.py tests/test_e2e_pipeline.py tests/test_e2e_data.py tests/test_data_executor.py tests/test_pe_dma_ipcq.py -x --tb=short)",
+      "Bash(python -m pytest tests/test_pe_pipeline.py::test_pe_dma_record_start_after_channel_acquire -x --tb=long)",
+      "Bash(python -m pytest tests/test_pe_pipeline.py::test_pe_dma_record_start_after_channel_acquire -x --tb=short)",
+      "Bash(python -m pytest tests/test_op_log.py tests/test_pe_components.py tests/test_pe_pipeline.py tests/test_pe_to_pe_latency.py tests/test_e2e_pipeline.py tests/test_e2e_data.py tests/test_data_executor.py tests/test_pe_dma_ipcq.py --tb=short)",
+      "Bash(python -m pytest tests/test_pe_pipeline.py -q)",
+      "Bash(python -m pytest tests/test_pe_pipeline.py tests/test_triton_emu.py -q)",
+      "Bash(python -m pytest tests/test_composite_epilogue.py -v)"
+    ],
+    "additionalDirectories": [
+      "c:\\Users\\mukes\\Mukesh\\ywkang_git\\kernbench2\\tests",
+      "C:\\Users\\mukes\\Mukesh\\ywkang_git\\kernbench2\\tests\\pe2pe_latency_plots"
    ]
  }
 }
@@ -30,7 +30,10 @@
      "Bash(python -m pytest tests/test_pe_components.py -v)",
      "Bash(python -m pytest tests/test_triton_emu.py -v)",
      "Bash(python -m pytest tests/test_pe_components.py tests/test_triton_emu.py -v)",
-      "Bash(python -m pytest tests/test_pe_components.py::test_mcpu_multi_pe_kernel_launch tests/test_pe_components.py::test_qkv_gemm_bench_multi_pe_completes -v)"
+      "Bash(python -m pytest tests/test_pe_components.py::test_mcpu_multi_pe_kernel_launch tests/test_pe_components.py::test_qkv_gemm_bench_multi_pe_completes -v)",
+      "Bash(git add:*)",
+      "Bash(git commit:*)",
+      "Bash(git push:*)"
    ]
  }
 }
@@ -29,3 +29,6 @@ build/

 # Logs
 *.log
+.claude/*
+!.claude/commands/
+!.claude/commands/*.md
@@ -5,27 +5,10 @@ SPEC.md and ADRs are the source of truth.

 ---

-## Terminology
+# Part 1 — General Behavior

- runtime API:
-  Host-facing public API used by benchmarks and user code (e.g., tensor deployment, kernel launch).
- simulation engine (sim_engine):
-  Discrete-event engine responsible for request injection, scheduling, and completion tracking.
- components:
-  Device-side nodes modeling hardware behavior (IO_CPU, M_CPU, PE_CPU, routers, engines, etc.).
-
-## Authority & Scope
-
- SPEC.md defines the architectural contract.
- ADRs (docs/adr/ADR-*.md) define non-trivial architectural decisions.
- If a change conflicts with SPEC.md or an ADR:
-  - STOP.
-  - Explain the conflict.
-  - Propose options (keep spec, update ADR, or narrow scope).
- Do NOT silently change architecture.
- The repository structure reflects architectural intent; Claude Code MUST respect existing module boundaries and file locations.
-
---
+> Reusable across repos. Describes *how* Claude Code interacts with the user
+> and constructs changes, independent of this project's domain.

 ## Design Questions

@@ -37,14 +20,21 @@ SPEC.md and ADRs are the source of truth.
  - ADRs
 - If a design question implies a change, default to Phase 1.

---
+## Surfacing Choices
+
+Applies to both design discussions and Phase 1 proposals.
+
+- If multiple valid interpretations of the request exist, present them.
+  Do NOT pick one silently.
+- If a simpler approach exists, say so. Push back when warranted —
+  do NOT just implement the more complex path the user proposed.
+- State required assumptions explicitly. If uncertain, ask before assuming.

 ## Change & Test Protocol (Mandatory)

 All non-trivial changes MUST follow a two-phase process.
-Design discussion is always allowed; code changes are not.
-
---
+Design discussion is always allowed.
+Production code changes require Phase 1 approval before Phase 2 applies them.

 ### Phase 1 — Proposal + Verification

@@ -63,20 +53,18 @@ Design discussion is always allowed; code changes are not.
 - Explain why the change is needed.
 - Explain consistency with SPEC.md and relevant ADRs.

-1) **Verification Plan**
+2) **Verification Plan**

- SPEC requirement(s) / ADR(s) affected (e.g., R1/R2/R5, ADR-0002).
+- SPEC requirement(s) / ADR(s) affected.
 - Tests that validate the change:
  - existing tests to run, and/or
  - new tests to add.
- Concrete input cases used by the tests:
-  - topology (SIP / CUBE / PE layout)
-  - request parameters (src, dst, size_bytes).
- Expected observable assertions, such as:
-  - hop trace contains key waypoints,
-  - latency invariants (e.g., > 0, monotonic increase),
-  - deterministic route selection.
-  - **expected changes (or no changes) in generated diagrams**, if applicable.
+- Concrete input cases used by the tests.
+- Expected observable assertions.
+- Expected changes (or no changes) in generated artifacts, if applicable.
+
+(Project-specific expectations for what these inputs/assertions look like:
+see Part 2 → *Verification Plan — Project Expectations*.)

 If the Verification Plan is missing or vague, STOP.

@@ -89,7 +77,13 @@ If the Verification Plan is missing or vague, STOP.

 - Any production code changes
 - Any SPEC.md or ADR modifications
- Any production diff output
+- Final, ready-to-apply unified diffs (Phase 2 only)
+
+#### Permitted for design discussion
+
+- Pseudocode, interface sketches, type signatures
+- Small illustrative snippets to clarify a design point
+- "Before / after" excerpts (not full diffs)

 #### Phase 1 Output

@@ -100,8 +94,6 @@ If the Verification Plan is missing or vague, STOP.
  - "No Phase 2 needed" OR
  - "Await approval for Phase 2"

---
-
 ### Phase 2 — Apply + Verify + Rollback

 #### Trigger
@@ -112,10 +104,10 @@ Phase 2 is triggered ONLY by the exact user approval phrase:

 #### Phase 2 Rules

- Output **minimal unified diffs only**
- Modify ONLY production files declared in Phase 1
- Do NOT include explanations, comments, or unchanged code
- Automatically apply the diff to the working tree
+- Keep changes minimal and scoped to the approved Phase 1 proposal.
+- Modify only production files declared in Phase 1.
+- Avoid unrelated edits, cleanup, or formatting churn.
+- Automatically apply approved changes to the working tree.

 #### Mandatory Verification

@@ -126,7 +118,7 @@ Phase 2 is triggered ONLY by the exact user approval phrase:
 If ALL tests PASS:

 - Keep the applied changes
- Ensure generated diagrams (if affected) are consistent
+- Ensure generated artifacts (if affected) are consistent
 - Report success concisely

 #### Failure Path (Mandatory)
@@ -143,8 +135,210 @@ If ANY test FAILS:

 Tests must NEVER be weakened, removed, or altered to force Phase 2 to pass.

+Failing tests may indicate:
+- invalid assumptions,
+- architectural violations,
+- or incomplete modeling.
+
+Do not assume the test is wrong without explicit evidence.
+
+## Allowed Exceptions
+
+(Protocol Still Required)
+
+- comments or docstrings
+- formatting-only changes
+- type annotation changes with no runtime behavior change
+
+In exceptions, Phase 1 MUST explicitly state:
+**"No behavior change; tests unchanged."**
+
+## Coding Style
+
+Applies to all production code changes (Phase 2) and test code (Phase 1).
+The Phase 1/2 protocol decides *whether* and *what* to change;
+this section decides *how* the resulting diff should look.
+
+### Simplicity First
+
+**Minimum code that solves the problem. Nothing speculative.**
+
+- Write the minimum code that satisfies the Phase 1 proposal.
+- No abstractions for single-use code.
+- No "flexibility"/"configurability" not declared in Phase 1.
+- No error handling for impossible scenarios.
+
+Ask yourself: "Would a senior engineer say this is overcomplicated?" If yes, simplify.
+
+### Surgical Changes
+
+**Touch only what you must. Clean up only your own mess.**
+
+- Touch only files declared in the Phase 1 proposal.
+- Don't "improve" adjacent code, comments, or formatting.
+- Match existing style in the file, even if you'd do it differently.
+- If your changes orphan imports/variables/functions, remove them.
+- If you notice pre-existing dead code, do NOT delete it silently.
+  Mention it, and present options:
+    (a) delete (with approval),
+    (b) keep as-is,
+    (c) refactor to make it reachable / repurposed.
+  Let the user choose before acting.
+- Every changed line must trace to the Phase 1 proposal.
+
+## Enforcement Defaults
+
+General fallbacks. Apply to anything not explicitly covered above.
+
+- If unsure whether a change is non-trivial → treat it as non-trivial.
+- If unsure whether Phase 2 is allowed → STOP and ask.
+
 ---

+# Part 2 — Project-Specific (kernbench)
+
+> Specific to this repo's domain (SIP/CUBE/PE topology, runtime API, sim_engine).
+> Replace this entire Part when adapting the framework to another repo.
+>
+> Contains **foundations** (Authority & Scope → Terminology → Terminology
+> Discipline → Mental Model → Common Failure Modes) followed by **rules**
+> (Non-Trivial, Verification Plan, CLI, Derived Artifacts, ADR Translation
+> Discipline, runtime API / sim_engine Boundaries).
+
+## Authority & Scope
+
+- SPEC.md defines the architectural contract.
+- ADRs (docs/adr/ADR-*.md) define non-trivial architectural decisions.
+- If a change conflicts with SPEC.md or an ADR:
+  - STOP.
+  - Explain the conflict.
+  - Propose options (keep spec, update ADR, or narrow scope).
+- Do NOT silently change architecture.
+- The repository structure reflects architectural intent; Claude Code MUST respect existing module boundaries and file locations.
+
+### ADR Lifecycle
+
+ADRs live in one of four folders. Three carry **canonical English**
+content based on lifecycle state; the fourth holds Korean translations:
+
+- `docs/adr/` — **Accepted** (canonical English; current
+  implementation reflected).
+- `docs/adr-proposed/` — **Proposed**, **Stub**, or **Draft** (design
+  only / future-work exploration / retroactive documentation pending
+  verification). **Authoring language is free** (any language); the
+  promotion step (below) translates to English.
+- `docs/adr-history/` — **Superseded** or **Merged** (no longer the
+  authoritative source; kept as historical record). Frozen — language
+  policy not applied retroactively.
+- `docs/adr-ko/` — Korean translations of accepted ADRs (derived
+  artifact, 1:1 mirror of `docs/adr/`). English in `docs/adr/` is the
+  canonical source of truth; when KO and EN disagree, EN wins. See
+  *ADR Translation Discipline* below.
+
+Status field values:
+
+- `Accepted` — design is in current implementation.
+- `Proposed` — design is concrete but not yet implemented.
+- `Stub (Future Work)` — design space exploration; no commitment yet.
+- `Draft` — retroactive documentation drafted but not yet verified
+  against the implementation it describes.
+- `Superseded by ADR-NNNN` — replaced by another ADR.
+- `Merged into ADR-NNNN` — content absorbed by another ADR.
+
+Transitions:
+
+- **Proposed/Stub → Accepted**: when the ADR's decisions are
+  reflected in production code AND covered by tests. If the proposed
+  ADR is in Korean, translate to English and place the English in
+  `docs/adr/`; move the Korean original to `docs/adr-ko/`. If the
+  proposed ADR is in English, `git mv` it to `docs/adr/` and create
+  the Korean translation in `docs/adr-ko/`. Change Status to
+  `Accepted` in both files.
+- **Draft → Accepted**: when the ADR's text has been verified to
+  accurately describe the existing implementation. Same English /
+  Korean placement rule as above.
+- **Accepted → Superseded**: set Status to `Superseded by ADR-MMMM`
+  in both the EN and KO files and `git mv` both to their respective
+  history locations (`docs/adr-history/` for English; the KO copy
+  stays in `docs/adr-ko/` only if it was already mirrored — see *ADR
+  Translation Discipline* for the frozen-history exception).
+- **Accepted → Merged**: set Status to `Merged into ADR-MMMM`
+  (single-line stub) in both files and apply the same `git mv` rule
+  as the Superseded transition.
+
+Cross-references between ADRs use the `ADR-NNNN` ID and remain valid
+regardless of folder location. ADR numbers are **immutable**; never
+renumber. Numbering holes from moved ADRs are expected.
+
+## Terminology
+
+- runtime API:
+  Host-facing public API used by benchmarks and user code (e.g., tensor deployment, kernel launch).
+- simulation engine (sim_engine):
+  Discrete-event engine responsible for request injection, scheduling, and completion tracking.
+- components:
+  Device-side nodes modeling hardware behavior (IO_CPU, M_CPU, PE_CPU, routers, engines, etc.).
+
+## Terminology Discipline
+
+Use only terms established in SPEC.md, ADRs, existing notes, or code.
+Do not coin new terms (status labels, tiers, classifications, role names)
+without explicit user approval. When a needed term is missing or ambiguous,
+ask before introducing one. When proposing a rename, show the existing
+term and the proposed change side-by-side and wait for approval.
+
+## Mental Model
+
+The simulator is layered along **request flow**:
+
+  runtime API           (host-facing: tensor ops, kernel launch;
+                         topology-agnostic, no routing — ADR-0007)
+       ↓
+  sim_engine            (schedules events, routes requests,
+                         tracks completion via correlation IDs)
+       ↓
+  components            (device-side nodes: IO_CPU, M_CPU, PE_CPU,
+                         routers, engines — model HW behavior
+                         including interconnect)
+
+Configuration & decisions (orthogonal to request flow):
+- **topology**  — compiled at config time (ADR-0006); defines which
+  components exist and how they connect. Authoritative graph for sim_engine.
+- **policy** (routing / address / placement) — consulted by sim_engine
+  during request handling.
+
+Invariant: all latency arises from **explicit scheduled events on modeled
+components and links** (SPEC §0.1, R8). No implicit waits, no magic delays.
+
+Stay within layer boundaries; do not collapse or bypass for convenience.
+
+## Common Failure Modes
+
+Anti-patterns that violate the Mental Model or Golden Invariants (SPEC §0.1).
+If your change does any of these, STOP and reconsider.
+
+- **runtime topology mutation** — topology is compiled at config time; do not
+  add/remove nodes or edges during simulation (ADR-0006).
+- **nondeterministic iteration order** — never iterate sets, unordered dicts,
+  or anything else with implementation-defined order on the critical path.
+  Determinism is required (SPEC §0.1).
+- **routing policy inside runtime API** — runtime API is topology-agnostic;
+  routing/fan-out belongs in policy + sim_engine (ADR-0007).
+- **latency modeled outside sim_engine scheduling** — every delay must come
+  from an explicit scheduled event on a modeled component or link
+  (SPEC §0.1, R8). No magic sleeps, no hardcoded constants smuggled in.
+- **hidden cross-layer coupling** — do not skip layer interfaces.
+  e.g., runtime API must not call into components directly, bypassing sim_engine.
+- **silent ADR/SPEC reinterpretation** — surface conflicts; do not paper over them.
+  See *Authority & Scope* above.
+- **weakening tests to make Phase 2 pass** — fix the code, not the test.
+  See *Part 1 → Phase 2 → Failure Path*.
+- **asserting from memory without source check** — quantitative
+  architectural facts (topology counts, sizes, latencies, address widths,
+  port arities) must be sourced from SPEC.md or a specific ADR before
+  assertion. Memory is unreliable. If the source is silent, surface the
+  gap rather than guessing.
+
 ## What Counts as "Non-Trivial"

 (Protocol Required)
@@ -158,39 +352,82 @@ Any of the following:
 - changes affecting determinism or connectivity
 - changes touching two or more production files

---
+## Verification Plan — Project Expectations

-## Allowed Exceptions
+Concrete forms that Part 1's *Verification Plan* MUST take in this repo:

-(Protocol Still Required)
-
- comments or docstrings
- formatting-only changes
- type annotation changes with no runtime behavior change
-
-In exceptions, Phase 1 MUST explicitly state:
-**"No behavior change; tests unchanged."**
-
---
+- SPEC requirement(s) / ADR(s) affected (e.g., R1/R2/R5, ADR-0002).
+- Concrete input cases:
+  - topology (SIP / CUBE / PE layout)
+  - request parameters (src, dst, size_bytes).
+- Expected observable assertions, such as:
+  - hop trace contains key waypoints,
+  - latency invariants (e.g., > 0, monotonic increase),
+  - deterministic route selection.
+  - **expected changes (or no changes) in generated diagrams**, if applicable.

 ## CLI Semantics

 - `kernbench run --device <id>` runs the benchmark on a single device.
 - Omitting `--device` runs the benchmark on all devices discovered in the topology (logically parallel).
 - Device enumeration is handled by the CLI only; benchmarks MUST remain single-device.
+- **Eval-bench exception (ADR-0054)**: a *milestone / eval bench*
+  (`milestone-1h-*`) may drive many configurations and build its own
+  per-config engines to regenerate a domain's full result + figure set; it
+  ignores `--device` and submits a sentinel tensor to satisfy the
+  "must submit ≥1 request" contract (ADR-0045 D4). This is the eval-harness
+  carve-out to the single-device rule, alongside the ADR-0024 multi-SIP CCL
+  exception.

 ## Derived Artifacts (Clarification)

 - Generated diagrams under `docs/diagrams/` are **derived artifacts**, not production code.
- Creating or updating files in `docs/diagrams/`:
+- Korean ADR translations under `docs/adr-ko/` are **derived artifacts**
+  (mirror of the canonical English in `docs/adr/`); see *ADR Translation
+  Discipline*.
+- Creating or updating files in `docs/diagrams/` or `docs/adr-ko/`:
  - does NOT count as a production code change,
  - does NOT require Phase 2 approval,
  - MUST be consistent with SPEC.md and ADRs.

-## Enforcement Defaults
+## ADR Translation Discipline
+
+English in `docs/adr/` is the canonical source of truth. Korean in
+`docs/adr-ko/` mirrors it 1:1 as a derived artifact.
+
+**Bidirectional sync rule (MUST)**: any edit to a file in `docs/adr/`
+must be accompanied, in the same change, by a mirroring edit to
+`docs/adr-ko/<same-filename>.md`. The reverse also applies: edits to
+`docs/adr-ko/` must mirror back into `docs/adr/`. The two files must
+always describe the same architectural content.
+
+Mechanics:
+
+- When editing an EN ADR, propagate the change to its KO counterpart
+  by translating just the diff (preserve unaffected KO prose); do not
+  regenerate the whole KO file from scratch.
+- When editing a KO ADR, propagate to EN the same way.
+- Filename mirror: `docs/adr/X.md` ↔ `docs/adr-ko/X.md` (no language
+  suffix in either path).
+- The `## Status` *lifecycle keyword* (`Accepted`, `Proposed`,
+  `Stub (Future Work)`, `Draft`, `Superseded by ADR-NNNN`,
+  `Merged into ADR-NNNN`) must match between EN and KO. Parenthetical
+  commentary and any list items that follow the keyword may be
+  translated naturally (the verify tool ignores them when comparing).
+- Conflict policy: if the two diverge despite the rule, treat EN as
+  authoritative and overwrite KO. Surface the divergence to the user
+  before reconciling.
+- `docs/adr-proposed/` is exempt — single language only, no mirror
+  required until promotion.
+- `docs/adr-history/` is frozen — pre-existing mixed-language state
+  there is not migrated.
+
+Verification: `python tools/verify_adr_lang_pairs.py` checks that
+every EN ADR has a matching KO file, the title's ADR-NNNN matches the
+filename, and Status blocks are byte-equal. Run it on demand or wire
+it into CI. Exit code: 0 = OK, 1 = mismatch.
+
+## runtime API / sim_engine Boundaries

- If unsure whether a change is non-trivial → treat it as non-trivial.
- If unsure whether Phase 2 is allowed → STOP and ask.
- SPEC.md and ADRs are the final authority.
 - runtime API MUST NOT hardcode topology/routing or internal hop sequences.
 - sim_engine MUST remain independent of runtime API semantics (no tensor/kernel policy logic).
@@ -155,5 +155,6 @@ kernbench/
 ## Documentation

 - [CHANGES.md](CHANGES.md) — changelog with detailed descriptions of each release
- [docs/latency-model.md](docs/latency-model.md) — latency model explanation with worked examples
+- [docs/onboarding/latency-model.md](docs/onboarding/latency-model.md) — latency model explanation with worked examples
+- [docs/onboarding/](docs/onboarding/) — onboarding guides (architecture overview, latency model, CCL author guide, intro presentation)
 - [docs/adr/](docs/adr/) — Architecture Decision Records
@@ -51,8 +51,8 @@ Major architectural decisions are documented in ADRs and referenced by number.
 - ADR-0007: runtime_api vs sim_engine responsibility boundaries
 - ADR-0008: Tensor deployment and allocation (Host allocator, PA-first)
 - ADR-0009: Kernel execution fan-out and completion semantics
- ADR-0010: CLI device selection and multi-device execution semantics
- ADR-0011: Memory addressing simplification (PA-first)
+- ADR-0010: Command line interface and execution semantics
+- ADR-0011: Memory Addressing — PA / VA / LA Address Models
 - ADR-0012: Host ↔ IO_CPU message schema (PA-first, PE-tagged shards)
 - ADR-0013: Verification strategy and Phase 1 test plan
 - ADR-0014: PE internal execution model (PE_CPU, PE_SCHEDULER, composite commands)
@@ -204,15 +204,23 @@ benchmark instances by default.

 ---

-## R10. Memory Addressing (Phase 0)
+## R10. Memory Addressing

-The simulator uses a **VA/PA memory model** (ADR-0011):
+The simulator defines three address models in ADR-0011; one is selected
+per simulation configuration:
+
+- **PA (Physical Address)** — direct PA, retained as PageFault fallback.
+- **VA (Virtual Address with MMU)** — currently implemented default.
+- **LA (Logical Address with BAAW)** — proposed, supports per-channel
+  HBM modelling (1:1 / n:1 mapping modes).
+
+VA model details (current default):

 - Tensors are assigned a contiguous virtual address (VA) range at deployment.
 - PE_MMU translates VA→PA per access; TLB overhead is configurable.
 - Mapping installation (MmuMapMsg) traverses the fabric with measured latency.
 - Replicate tensors use per-cube local PA mapping; sharded tensors broadcast.
- PA-only fallback is retained for backward compatibility.
+- PA fallback is retained for backward compatibility.
 - Tensor placement is represented as a list of PA shards, each explicitly tagged
  with `(sip, cube, pe)`, plus a tensor-wide `va_base`.

@@ -1,129 +0,0 @@
-"""CCL all-reduce bench — single unified entry point.
-
-Driven entirely by ``ccl.yaml`` + ``topology.yaml``:
-
- ``defaults.algorithm`` in ``ccl.yaml`` picks which kernel to run
-  (``ring_allreduce_{tcm,hbm,sram}`` / ``mesh_allreduce_4`` /
-  ``tree_allreduce_7``).
- ``world_size`` is derived from the algorithm entry's override or from
-  the topology spec (``sips × cubes_per_sip × pes_per_cube``).
- The host code uses only real PyTorch ``torch.distributed`` names:
-  ``init_process_group``, ``get_world_size``, ``get_rank``, ``all_reduce``.
-
-The bench is split into ``worker(rank, world_size, torch)`` — the
-per-rank business logic, designed to look like a real PyTorch DDP
-training worker so future model benches can reuse the same skeleton —
-and ``run(torch)`` — the kernbench-specific launcher that initializes
-the process group and invokes the worker.
-"""
-from __future__ import annotations
-
-import numpy as np
-
-from kernbench.ccl.install import load_ccl_config, resolve_algorithm_config
-from kernbench.policy.placement.dp import DPPolicy
-
-# Default per-rank tile size if ccl.yaml doesn't override it. Real
-# pytorch benches hardcode batch/feature dims similarly.
-DEFAULT_N_ELEM = 32
-
-
-def _derive_dp(spec: dict, world_size: int) -> DPPolicy:
-    """Pick a DPPolicy that fans the tensor across exactly ``world_size`` PEs.
-
-    Mirrors what a real PyTorch DDP user does manually with
-    ``tensor.to(f"cuda:{rank}")``: the host code chooses the placement so
-    that the collective sees the right number of participating ranks.
-    """
-    sips = int(spec["system"]["sips"]["count"])
-    cm = spec["sip"]["cube_mesh"]
-    pl = spec["cube"]["pe_layout"]
-    pes_per_cube = int(pl["pe_per_corner"]) * len(pl["corners"])
-    cubes_per_sip = int(cm["w"]) * int(cm["h"])
-    total = sips * cubes_per_sip * pes_per_cube
-    if world_size == total:
-        return DPPolicy(sip="column_wise", cube="column_wise", pe="column_wise")
-    if world_size <= pes_per_cube:
-        return DPPolicy(
-            sip="replicate", cube="replicate", pe="column_wise",
-            num_sips=1, num_cubes=1, num_pes=world_size,
-        )
-    if world_size <= cubes_per_sip * pes_per_cube:
-        return DPPolicy(
-            sip="replicate", cube="column_wise", pe="column_wise",
-            num_sips=1, num_cubes=world_size // pes_per_cube,
-        )
-    return DPPolicy(sip="column_wise", cube="column_wise", pe="column_wise")
-
-
-def worker(rank: int, world_size: int, torch) -> None:
-    """Per-rank business logic. Mirrors a real PyTorch DDP worker.
-
-    In real PyTorch DDP, this function runs in N separate processes,
-    each with its own ``rank``. In kernbench (single-process multi-device)
-    it is invoked once with ``rank=0`` on the single host driver; the
-    actual per-PE parallelism is handled by ``torch.launch`` fanning out
-    the kernel across all participating PEs via the tensor's DPPolicy.
-    The ``rank`` parameter is therefore always 0 today, and is kept as
-    an explicit argument for parity with real DDP workers (``if rank ==
-    0`` logging guards, future multi-host extensions).
-    """
-    cfg = resolve_algorithm_config(load_ccl_config())
-    algo_name = cfg["algorithm"]
-    n_elem = int(cfg.get("n_elem", DEFAULT_N_ELEM))
-
-    # Pick a DP that produces exactly ``world_size`` shards on this topology.
-    dp = _derive_dp(torch.spec, world_size)
-    tensor = torch.zeros(
-        (1, world_size * n_elem), dtype="f16", dp=dp, name="ccl_in",
-    )
-
-    # Initialize: CCL rank r's slice gets value (r + 1). Real PyTorch idiom:
-    #     target.copy_(torch.from_numpy(source))
-    init = np.zeros((1, world_size * n_elem), dtype=np.float16)
-    for r in range(world_size):
-        init[0, r * n_elem : (r + 1) * n_elem] = float(r + 1)
-    tensor.copy_(torch.from_numpy(init))
-
-    # The main act: one all_reduce call — the backend installs IPCQ at
-    # init_process_group time and here only dispatches the kernel.
-    torch.distributed.all_reduce(tensor, op="sum")
-
-    # Verify: each shard should hold sum(1..world_size) after all-reduce.
-    result = tensor.numpy()
-    expected = float(sum(range(1, world_size + 1)))
-    all_ok = bool(np.allclose(result, expected, rtol=1e-1, atol=1e-1))
-
-    # Print only on rank 0 — real PyTorch DDP idiom for single-source logs.
-    if rank == 0:
-        if all_ok:
-            print(f"  {algo_name} (ws={world_size}): {world_size} OK")
-        else:
-            flat = result.reshape(-1)
-            n_fail = 0
-            for r in range(world_size):
-                slice_r = flat[r * n_elem : (r + 1) * n_elem]
-                if not np.allclose(slice_r, expected, rtol=1e-1, atol=1e-1):
-                    n_fail += 1
-                    if n_fail <= 5:
-                        print(
-                            f"  [FAIL] rank {r} "
-                            f"(ws={world_size}, algo={algo_name}): "
-                            f"got mean={float(slice_r.mean()):.3f}, "
-                            f"expected={expected:.3f}"
-                        )
-            print(
-                f"  {algo_name} (ws={world_size}): "
-                f"{world_size - n_fail} OK / {n_fail} FAIL"
-            )
-
-
-def run(torch) -> None:
-    """CLI entry point: initialize the process group, invoke worker."""
-    dist = torch.distributed
-    dist.init_process_group(backend="ahbm")
-    worker(
-        rank=dist.get_rank(),
-        world_size=dist.get_world_size(),
-        torch=torch,
-    )
@@ -1,2 +0,0 @@
-def run(torch):
-    print("IPCQ all reduce kernel bench")
@@ -1,40 +0,0 @@
-from __future__ import annotations
-
-import importlib
-from collections.abc import Callable
-from typing import Any
-
-from kernbench.runtime_api.context import RuntimeContext
-
-BenchFn = Callable[[RuntimeContext], Any]
-
-
-def _load_module(bench_id: str):
-    bench_id = bench_id.strip()
-    if not bench_id:
-        raise ValueError("Bench id is empty.")
-    module_path = f"benches.{bench_id}"
-    try:
-        return importlib.import_module(module_path)
-    except ModuleNotFoundError as e:
-        raise ValueError(
-            f"Unknown bench '{bench_id}'. Expected module {module_path}.py"
-        ) from e
-
-
-def resolve_bench(bench_id: str) -> BenchFn:
-    """Resolve a bench id into its ``run(torch)`` callable.
-
-    Expected layout (repo root):
-        benches/<bench_id>.py
-            def run(torch: RuntimeContext) -> Any
-    """
-    mod = _load_module(bench_id)
-    run_fn = getattr(mod, "run", None)
-    if run_fn is None:
-        raise ValueError(
-            f"Bench module benches.{bench_id} must define 'run(torch)'."
-        )
-    if not callable(run_fn):
-        raise ValueError(f"'run' in benches.{bench_id} is not callable.")
-    return run_fn
@@ -6,12 +6,7 @@

 defaults:
  # Algorithm to run for this benchmark execution.
-  algorithm: ring_allreduce_tcm
-
-  # NOTE: world_size is not set here by default. AhbmCCLBackend derives it
-  # from the chosen algorithm's entry (if it sets ``world_size``) or from
-  # topology.yaml (``sips × cubes_per_sip × pes_per_cube``). This mirrors
-  # real PyTorch DDP where ranks/world_size come from env vars, not code.
+  algorithm: lrab_hierarchical_allreduce

  # IPCQ ring buffer location.
  #   tcm  — PE-local TCM (fast, small, conflicts with compute TCM access)
@@ -30,59 +25,26 @@ defaults:
  # Slot size in bytes (must hold one tile worth of data).
  slot_size: 4096

-  # PE_DMA virtual channel chunk size (D8). First implementation does not
-  # use chunk-level interleave; this is reserved for future precision.
+  # PE_DMA virtual channel chunk size (D8).
  vc_chunk_size: 256

-  # Credit return fast path message size (D9). Used by bottleneck-BW
-  # latency calculation. 16-64 bytes typical.
+  # Credit return fast path message size (D9).
  ipcq_credit_size_bytes: 16

 algorithms:
-  # ── ring all-reduce, buffer in PE_TCM ──
-  # Defaults to topology-derived world_size (full system, 256 ranks).
-  # Use a smaller tile size at high rank counts so f16 sums stay within
-  # the verification tolerance and op_log replay scales.
-  ring_allreduce_tcm:
-    module: kernbench.ccl.algorithms.ring_allreduce
-    topology: ring_1d
-    buffer_kind: tcm
-    n_elem: 8
-
-  # ── ring all-reduce, buffer in PE-local HBM ──
-  ring_allreduce_hbm:
-    module: kernbench.ccl.algorithms.ring_allreduce
-    topology: ring_1d
-    buffer_kind: hbm
-    n_elem: 8
-
-  # ── ring all-reduce, buffer in cube SRAM ──
-  ring_allreduce_sram:
-    module: kernbench.ccl.algorithms.ring_allreduce
-    topology: ring_1d
-    buffer_kind: sram
-    n_elem: 8
-
-  # ── 2D mesh all-reduce: perfect square only (2×2 = 4 PEs) ──
-  mesh_allreduce_4:
-    module: kernbench.ccl.algorithms.mesh_allreduce
-    topology: mesh_2d
-    buffer_kind: tcm
-    world_size: 4
-    n_elem: 16
-
-  # ── tree all-reduce (binary, 7 PEs) ──
-  tree_allreduce_7:
-    module: kernbench.ccl.algorithms.tree_allreduce
-    topology: tree_binary
-    buffer_kind: tcm
-    world_size: 7
-    n_elem: 16
-
-  # ── hierarchical all-reduce (3-level: intra-cube → inter-cube → inter-SIP) ──
-  # Uses bidirectional ring reduce + chain broadcast. ~25 rounds vs 255 flat.
-  hierarchical_allreduce:
-    module: kernbench.ccl.algorithms.hierarchical_allreduce
+  # ── intercube all-reduce (pe0-only, cube mesh + inter-SIP) ──
+  # Reduces across the 4×4 cube mesh within each SIP, then inter-SIP
+  # exchange on root cube, then broadcast back. SIP topology is read
+  # from topology.yaml → system.sips.topology. Kernel auto-selects
+  # ring / torus / mesh inter-SIP exchange pattern.
+  lrab_hierarchical_allreduce:
+    module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
    topology: none
    buffer_kind: tcm
-    n_elem: 16
+    n_elem: 8
+    # root_cube: the kernel currently elects the root dynamically as the
+    # geometric center of the cube mesh (root = (h//2)*w + (w//2)) to
+    # minimize the intra-SIP critical path, so this value is NOT read today.
+    # Kept as a placeholder for a future explicit-root override / runtime
+    # election hook (see ADR-0032 D1 + Non-goals).
+    root_cube: 15
@@ -2,7 +2,7 @@

 ## Status

-Proposed
+Merged into ADR-0011 (Address Model: LA section).

 ## Context

@@ -2,7 +2,7 @@

 ## Status

-Proposed
+Merged into ADR-0011 (Address Model: LA section).

 ## Context

@@ -0,0 +1,5 @@
+# ADR-0019: Per-Channel and Aggregated HBM Connection Models within CUBE NOC
+
+## Status
+
+Merged into ADR-0017 (Cube NOC and HBM Connectivity).
@@ -0,0 +1,5 @@
+# ADR-0019: CUBE NOC 내 Per-Channel 및 Aggregated HBM 연결 모델
+
+## Status
+
+Merged into ADR-0017 (Cube NOC and HBM Connectivity).
@@ -0,0 +1,5 @@
+# ADR-0021: PE Pipeline Refactoring — Component Separation + Scheduler-Based Routing
+
+## Status
+
+Merged into ADR-0014 (PE Pipeline Execution Model).
@@ -0,0 +1,5 @@
+# ADR-0021: PE 파이프라인 리팩토링 — 컴포넌트 분리 + Scheduler 기반 라우팅
+
+## Status
+
+Merged into ADR-0014 (PE Pipeline Execution Model).
@@ -2,7 +2,9 @@

 ## Status

-Proposed
+Superseded by ADR-0032 (Intercube all-reduce). The 3-level kernel and
+`hierarchical_allreduce.py` module have been removed. The cube-mesh
+intercube + inter-SIP path is now the single all-reduce algorithm.

 ## Context

@@ -2,7 +2,11 @@

 ## Status

-Stub (Blocker for ADR-0030 — specific range allocations TBD)
+Superseded by ADR-0001 (Revision 2, 2026-04-27).
+PE_LOCAL / MCPU_LOCAL / CUBE_SRAM sub-unit tables are now defined in
+ADR-0001 D2.3.3-D2.3.5.
+
+Previous status: Stub (Blocker for ADR-0030 — specific range allocations TBD)

 ## Context

@@ -253,5 +257,5 @@ PhysAddr encoding. 호출자는 어느 range인지 몰라도 됨.
 |------|--------|
 | `src/kernbench/policy/address/phyaddr.py` | Range table (`PE_RESOURCE_MAP`), range-based decode, 신규 component-specific factory들 (`pe_ipcq_addr` 등), 기존 `pe_tcm_addr` 내부 인코딩 갱신 |
 | `src/kernbench/policy/address/allocator.py` | Range-aware pool 분리 (TCM pool / IPCQ pool / scratchpad pool 등 per-PE) |
-| `docs/adr/ADR-0001-physaddr-layout.md` | Amendment note: range-based PE resource partition |
+| `docs/adr/ADR-0001-mem-physaddr-layout.md` | Amendment note: range-based PE resource partition |
 | `tests/test_phyaddr.py` | Range table 검증, 각 factory의 encode/decode round-trip, 기존 `pe_tcm_addr` 회귀 |
@@ -0,0 +1,358 @@
+# ADR-0001: 51비트 물리 주소 레이아웃 및 디코딩 계약
+
+## Status
+
+Accepted (Revision 2 — 2026-04-27: 구체적인 비트 레이아웃, rack_id 제거,
+Tray->SIP / SIP->DIE 명칭 변경, PE/MCPU/IOCPU 서브 유닛 표.
+ADR-0031을 대체함.)
+
+## Date
+
+2026-04-27 (original: 2026-02-27)
+
+## Context
+
+KernBench에는 다음과 같은 요건을 만족하는 안정적이고 파싱 가능한 물리 주소 체계가 필요하다.
+
+- 라우팅 도메인(SIP / die / HBM / PE-resource / IOCPU)으로 디코딩 가능
+- 토폴로지에 비의존적(개수를 하드코딩하지 않음)
+- 교체 가능한 정책과 DI-first 컴포넌트를 지원
+- 다수의 SIP, AHBM die, IO chiplet die를 통합된 공간에서 다룸
+
+### 연혁
+
+- 최초 ADR-0001은 `rack_id(4) + sip_id(4) + sip_seg(5) + local_offset(38)`
+  로 구성된 51비트 레이아웃을 정의했다. `rack_id`는 실제로 사용된 적이 없다.
+- ADR-0031(스텁)은 PE-resource 범위 분할을 요청했으나 구현되지 않았다.
+
+Revision 2에서는 `rack_id`를 제거하고 `sip_seg`를 `die_id`로 개명하며,
+PE, MCPU, CUBE_SRAM, IOCPU 리소스에 대한 구체적인 서브 유닛 표를 제공한다.
+ADR-0031은 본 ADR로 대체된다.
+
+## Decision
+
+**PhysAddr 값 객체**와, 정수 주소를 라우팅 도메인으로 변환하는
+**주소 디코딩 계약**을 정의한다.
+
+### D1. PhysAddr는 불변 값 객체이다
+
+- PhysAddr는 불변이며 순수한 값으로 비교 가능하다.
+- 모든 할당자는 **완전히 명세된 PhysAddr**(부분적인 메타데이터가 아님)를 반환한다.
+- PhysAddr를 해석하기 위해 전역 상태를 필요로 해서는 안 된다.
+
+### D2. 51비트 물리 주소 레이아웃
+
+51비트 물리 주소를 채택한다.
+
+#### 2.1 최상위 주소 맵
+
+```text
+[50:47] sip_id        (4)     -- 16 SIPs
+[46:42] die_id        (5)     -- 32 dies per SIP
+[41: 0] local_offset  (42)    -- 4 TB per die
+```
+
+```text
+50      47 46      42 41                      0
+---------+----------+-------------------------+
+| sip_id  | die_id   |      local_offset       |
+---------+----------+-------------------------+
+```
+
+#### 2.2 die_id 할당
+
+| die_id | 의미 |
+|--------|---------|
+| 0..15  | AHBM dies |
+| 16..20 | IOCHIPLET dies |
+| 21..31 | Reserved |
+
+#### 2.3 AHBM Die 레이아웃
+
+4 TB die-local 윈도우 중 하위 256 GB만 할당된다.
+
+```text
+[41:38] MBZ            (4)
+[37]    addr_space      (1)    -- 0 = local resource, 1 = HBM memory
+[36: 0] sub-address    (37)
+```
+
+| addr_space | 의미 |
+|------------|---------|
+| 0 | Local resource |
+| 1 | HBM memory |
+
+##### 2.3.1 HBM 윈도우 (addr_space = 1)
+
+```text
+[36:0] hbm_offset     (37)    -- 128 GB decode window
+```
+
+아키텍처상의 디코드 윈도우는 128 GB로 고정된다. 실제 구현 용량은
+SKU/토폴로지에 따라 더 작을 수 있다(D4 참조).
+
+##### 2.3.2 Resource 윈도우 (addr_space = 0)
+
+```text
+[36:34] resource_kind  (3)
+[33: 0] kind_local    (34)    -- 16 GB per kind
+```
+
+| resource_kind | 의미 |
+|---------------|---------|
+| 000 | PE_LOCAL |
+| 001 | MCPU_LOCAL |
+| 010 | CUBE_SRAM |
+| 011..111 | Reserved |
+
+각 kind는 16 GB 디코드 영역을 갖는다.
+
+##### 2.3.3 PE_LOCAL (resource_kind = 000)
+
+```text
+[33]    MBZ            (1)
+[32:29] pe_id          (4)     -- 0..15
+[28:25] pe_sub_unit    (4)
+[24: 0] sub_offset    (25)    -- 32 MB per slot
+```
+
+16 PE x 16 서브 유닛 슬롯 x 32 MB = 8 GB 활성 디코드.
+
+| pe_sub_unit | 이름 | 예산 |
+|-------------|------|--------|
+| 0 | PE_CPU_DTCM | 8 KB |
+| 1 | MATH_ENGINE_DTCM | 8 KB |
+| 2 | IPCQ | 256 KB |
+| 3 | PE_CPU_SFR | 16 KB |
+| 4 | MATH_ENGINE_SFR | 16 KB |
+| 5 | DMA_ENGINE_SFR | 192 KB |
+| 6 | PE_TCM | 2 MB |
+| 7..15 | Reserved | -- |
+
+##### 2.3.4 MCPU_LOCAL (resource_kind = 001)
+
+```text
+[33:30] MBZ            (4)
+[29:25] mcpu_sub_unit  (5)
+[24: 0] sub_offset    (25)    -- 32 MB per slot
+```
+
+1 GB 활성 디코드.
+
+| mcpu_sub_unit | 이름 | 예산 |
+|---------------|------|--------|
+| 0 | MCPU_ITCM | 512 KB |
+| 1 | MCPU_DTCM | 512 KB |
+| 2 | IPCQ | 256 KB |
+| 3 | MCPU_SFR | 8 KB |
+| 4 | MCPU_DMA_SFR | 16 KB |
+| 5 | MCPU_SRAM | 10 MB |
+| 6..31 | Reserved | -- |
+
+##### 2.3.5 CUBE_SRAM (resource_kind = 010)
+
+```text
+[33:25] MBZ            (9)
+[24: 0] sram_offset   (25)    -- flat 32 MB
+```
+
+#### 2.4 IOCHIPLET Die 레이아웃
+
+4 TB die-local 윈도우 중 하위 1 TB만 할당된다.
+
+```text
+[41:40] MBZ            (2)
+[39: 0] chiplet_offset (40)   -- 1 TB
+```
+
+주소 범위별 영역 구분:
+
+| 범위 | 의미 | 디코드 조건 |
+|-------|---------|------------------|
+| [0, 2 GB) | IOCPU resource | chiplet_offset < 0x8000_0000 |
+| [2 GB, 1 TB) | UAL | chiplet_offset >= 0x8000_0000 |
+
+##### 2.4.1 IOCPU 영역
+
+```text
+[30:27] iocpu_sub_unit (4)
+[26: 0] sub_offset    (27)    -- 128 MB per slot
+```
+
+16 x 128 MB 슬롯. 2 GB 활성 디코드.
+
+| iocpu_sub_unit | 이름 | 예산 |
+|----------------|------|--------|
+| 0 | IOCPU_ITCM | 512 KB |
+| 1 | IOCPU_DTCM | 512 KB |
+| 2 | IPCQ | 2 MB |
+| 3 | IOCPU_SFR | 8 KB |
+| 4 | IO_DMA_SFR | 16 KB |
+| 5 | IO_SRAM | 64 MB |
+| 6..15 | Reserved | -- |
+
+##### 2.4.2 UAL 영역
+
+서브 레이아웃은 별도 ADR에서 정의한다(TBD).
+
+#### 2.5 주소 지정 규칙
+
+1. MBZ 비트는 반드시 0이어야 한다. MBZ 비트가 0이 아닌 주소는
+   **아키텍처적으로 유효하지 않다**. 구현체는 디코드 폴트를 발생시키거나
+   오류를 반환할 수 있다 — 본 ADR은 동작을 규정하지 않는다.
+2. 단순한 하드웨어 디코드를 위해 고정된 슬롯 크기를 채택한다. 실제 구현
+   용량은 슬롯보다 작을 수 있다.
+3. 슬롯 내에서 서브 유닛의 구현 예산을 초과하는 접근은 **아키텍처적으로
+   유효하지 않다**(MBZ와 동일한 정책).
+
+### D3. 비트필드 디코딩은 결정론적이다
+
+정수 주소가 주어지면 필드 추출(`sip_id`, `die_id`, `kind`, `sub_unit`,
+`offset`)은 순수하게 위치 기반이다. 런타임 상태가 필요하지 않다.
+디코딩은 정수 주소를 결정론적으로 목적지 도메인(`sip_id`, `die_id`,
+타깃 종류 HBM / PE_LOCAL / MCPU_LOCAL / CUBE_SRAM / IOCPU / UAL)으로 매핑한다.
+
+### D4. 용량 검증은 토폴로지 설정에 의존할 수 있다
+
+디코딩된 주소가 **구현된 용량** 안에 들어가는지(예: 특정 SKU의 HBM 96 GB)는
+DI/설정을 통해 제공된 토폴로지 파라미터로 검증한다. 디코딩 자체(D3)는
+토폴로지를 참조하지 않으며 — 검증 단계에서만 참조한다. 이러한 파라미터는
+컴포넌트 구현이 아니라 토폴로지/설정 레이어에 존재해야 한다.
+
+### D5. 라우팅은 원시 비트가 아닌 디코딩된 도메인을 소비한다
+
+라우팅 정책은 디코딩된 도메인을 사용한다.
+
+- `src` 위치 (sip / die / pe 또는 node_id)
+- PhysAddr 디코딩에서 도출된 `dst` 도메인
+- 크기 인지 링크 레이턴시를 위한 `size_bytes`
+
+라우팅은 디코딩 모듈 내부를 제외하고는 원시 비트필드를 직접 들여다보아서는
+안 된다.
+
+## 고려된 대안
+
+1. **`rack_id`(4비트) 유지**: 기각 — 실제로 사용된 적이 없으며, 4비트를
+   소비함으로써 die-local 확장을 42비트(IOCHIPLET 1 TB)까지 가능하게 하는
+   기회를 막는다.
+
+2. **die당 256 GB로 균일화**: 기각 — IOCHIPLET UAL은 약 1 TB가 필요하다.
+   해제된 rack_id 비트를 활용하여 42비트 local_offset을 가능하게 한다.
+
+3. **가변 폭 die 윈도우(AHBM 256 GB, CHIPLET 1 TB를 다중 seg 스패닝으로 구현)**:
+   기각 — D3(결정론적 디코딩)를 복잡하게 만든다. MBZ 패딩을 갖는 균일한
+   4 TB 윈도우가 더 단순하다.
+
+4. **모든 곳에서 원시 정수를 사용하고, 라우팅에서 임시로 디코딩**: 기각 —
+   로직이 중복되고 라우팅이 일관성을 잃으며 가정이 숨겨진다.
+
+5. **토폴로지 크기(SIP/CUBE/PE 개수)를 디코딩에 하드코딩**: 기각 —
+   SPEC R3를 위반하고 교체 가능성을 깬다.
+
+6. **디코딩을 메모리 컨트롤러나 라우터 내부에 둠**: 기각 — 정책이 컴포넌트로
+   누출되며 SPEC R4 / D5를 위반한다.
+
+## 결과
+
+### 긍정적
+
+- 단순한 계층적 디코더: SIP -> die -> kind -> 서브 유닛.
+- 메모리(HBM)와 로컬 리소스(PE/MCPU/SRAM/IOCPU)의 깔끔한 분리.
+- 결정론적 라우팅 도메인은 명확한 테스트 불변식을 가능하게 한다(SPEC R1, R5).
+- 확장 가능: 11개의 예약된 die_id 슬롯, 예약된 resource_kind / 서브 유닛
+  슬롯, 예약된 MBZ 비트.
+- DI-first: 컴포넌트를 변경하지 않고도 디코더를 교체할 수 있다(SPEC R4).
+
+### 트레이드오프
+
+- power-of-2 슬롯 정렬로 인한 희소한 주소 공백.
+- 큰 예약/MBZ 영역(향후 확장을 위해 의도된 것).
+- 토폴로지에서 유도된 크기에 대해 명시적인 설정이 필요하다(D4).
+- 안정적이고 잘 테스트된 상태로 유지되어야 하는 단일 "정통" 디코딩 모듈이
+  도입된다.
+
+## 대체 대상
+
+- **ADR-0031 (PhysAddr PE-Resource Extension)**: 스텁 상태였음. D2.3.3-D2.3.5의
+  PE_LOCAL / MCPU_LOCAL / CUBE_SRAM 서브 유닛 표가 ADR-0031에서 제시한
+  목표를 충족한다.
+
+## 구현 메모 (비규범적)
+
+- 권장 모듈: `src/kernbench/policy/address/phyaddr.py`
+- 테스트는 다음을 커버해야 한다: kind별 인코딩/디코딩 라운드트립, MBZ 강제,
+  die_id 디스패치(AHBM / IOCHIPLET / 예약), 서브 유닛 경계값, 팩토리 API의
+  후방 호환성.
+- 팩토리 메서드: `hbm_addr`, `pe_hbm_addr`, `pe_tcm_addr`, `cube_sram_addr`는
+  시그니처를 유지한다(`rack_id` 제외). `cube_id` 파라미터는 `die_id`로
+  개명된다.
+- 신규 팩토리: `pe_resource_addr`, `mcpu_resource_addr`, `iocpu_resource_addr`,
+  `ual_addr`.
+
+## 부록 A. 주소 예시
+
+### A.1 AHBM HBM 접근
+
+sip=2, die=5, HBM offset=0x1000
+
+```text
+sip_id     = 2       -> [50:47] = 0b0010
+die_id     = 5       -> [46:42] = 0b00101
+addr_space = 1       -> [37]    = 1 (HBM)
+hbm_offset = 0x1000  -> [36:0]
+
+51-bit addr = (2 << 47) | (5 << 42) | (1 << 37) | 0x1000
+```
+
+### A.2 AHBM PE_LOCAL -- PE3 PE_TCM, offset=0x400
+
+```text
+sip_id        = 0  -> [50:47] = 0
+die_id        = 0  -> [46:42] = 0
+addr_space    = 0  -> [37]    = 0
+resource_kind = 0  -> [36:34] = 000 (PE_LOCAL)
+pe_id         = 3  -> [32:29] = 0011
+pe_sub_unit   = 6  -> [28:25] = 0110 (PE_TCM)
+sub_offset    = 0x400 -> [24:0]
+
+local_offset = (0 << 34) | (3 << 29) | (6 << 25) | 0x400
+```
+
+### A.3 AHBM MCPU_LOCAL -- MCPU_SRAM, offset=0x0
+
+```text
+sip_id        = 1  -> [50:47] = 0001
+die_id        = 3  -> [46:42] = 00011
+addr_space    = 0  -> [37]    = 0
+resource_kind = 1  -> [36:34] = 001 (MCPU_LOCAL)
+mcpu_sub_unit = 5  -> [29:25] = 00101 (MCPU_SRAM)
+sub_offset    = 0  -> [24:0]  = 0
+
+local_offset = (1 << 34) | (5 << 25)
+```
+
+### A.4 IOCHIPLET -- IOCPU IPCQ, offset=0x20000
+
+```text
+sip_id         = 1   -> [50:47] = 0001
+die_id         = 17  -> [46:42] = 10001 (IOCHIPLET[1])
+iocpu_sub_unit = 2   -> [30:27] = 0010 (IPCQ)
+sub_offset     = 0x20000 -> [26:0]
+
+chiplet_offset = (2 << 27) | 0x20000
+                 (< 0x8000_0000 -> IOCPU region)
+```
+
+### A.5 IOCHIPLET -- UAL 영역, offset=4 GB
+
+```text
+sip_id         = 0   -> [50:47] = 0
+die_id         = 16  -> [46:42] = 10000 (IOCHIPLET[0])
+chiplet_offset = 0x1_0000_0000 (4 GB >= 2 GB -> UAL region)
+```
+
+## 링크
+
+- SPEC.md: R1 (routing), R3 (configurable topology), R4 (DI-first),
+  R5 (multi-domain comm)
+- ADR-0031: Superseded
@@ -0,0 +1,100 @@
+# ADR-0002: 라우팅 거리, 순서 및 우회 규칙
+
+## Status
+Accepted
+
+## Date
+2026-02-27
+
+## Context
+KernBench Graph Latency Simulator는 서로 다른 아키텍처·토폴로지에 대한
+커널 실행 시간을 비교해야 하며, 그래프 순회로부터 end-to-end 레이턴시를
+계산하여 이를 달성한다.
+
+의미 있는 비교를 지원하려면:
+- 라우팅이 결정론적이어야 한다
+- 레이턴시가 실제 인터커넥트 구조를 반영해야 한다
+- 로컬과 리모트 트래픽이 구분 가능해야 한다
+- "우회(bypass)" 최적화가 디버깅 가능성이나 정확성을 훼손해서는 안 된다
+
+또한 시뮬레이터는 소프트웨어가 관리하는 메타데이터 및 제어 경로를
+가리는 숨겨진 지름길을 피하는 것을 목표로 한다.
+
+## Decision
+
+### D1. 거리(distance)는 hop 수가 아니라 누적 레이턴시이다
+- 라우팅 "거리"는 **노드별·링크별 레이턴시의 합**으로 정의된다.
+- 순서 결정이나 경로 선택에 hop 수만을 사용해서는 안 된다.
+- 크기 인지(size-aware) 직렬화 레이턴시(bytes / BW)가 거리에 기여한다.
+
+### D2. 라우팅 순서는 그래프 순회에서 유도된다
+- 선택된 경로는 구성된 그래프와 라우팅 정책 하에서
+  누적 레이턴시가 최소인 경로이다.
+- 동일 입력(토폴로지 + 정책 + 요청)에 대해 결정론적 순서가 보장되어야 한다.
+
+### D3. 우회는 명시적이며 그래프로 표현된다
+- 모든 경로는 그래프에 명시적으로 표현되며 레이턴시 누적의 대상이 되어야 한다.
+- 예: PE_DMA는 NOC 라우터 메시(ADR-0017 D7)에 연결된다. 모든 목적지
+  (HBM, 공유 SRAM, 큐브 간 UCIe)는 명시적 메시 hop을 통해 도달한다.
+  로컬 HBM 접근은 hop 수가 최소(스위칭 오버헤드만)이며, 리모트 접근은
+  추가 라우터를 거친다.
+- 암묵적이거나 "마법 같은" 우회 경로는 금지된다.
+
+### D4. end-to-end 레이턴시가 0인 경로는 없다
+
+- 모든 라우팅 요청은 **end-to-end** 레이턴시가 > 0이어야 한다.
+- 개별 패브릭 세그먼트(예: NOC hop)는 패브릭이 분산되어 있고 해당 granularity에서
+  거리가 의미가 없을 때 distance_mm = 0을 가질 수 있다.
+  이는 같은 경로상의 다른 컴포넌트(예: PE_DMA, SRAM, UCIe 엔드포인트)가
+  0이 아닌 레이턴시에 기여하여 end-to-end 불변성을 유지하므로 허용된다.
+- end-to-end가 완전히 0 레이턴시인 경로는 금지된다. 단, 명시적으로
+  표시된 테스트 전용 stub만 예외이다.
+
+### D5. 정책과 토폴로지의 책임 분리
+- 토폴로지 빌더:
+  - 노드와 링크 및 그들의 레이턴시/BW 파라미터를 정의한다
+- 라우팅 정책:
+  - 디코딩된 도메인을 바탕으로 사용 가능한 그래프 경로 중에서 선택한다
+- 라우팅 정책은 누락된 링크를 가정해서는 안 된다. 누락된 연결성은
+  토폴로지 구성 오류이다.
+
+### D6. 소프트웨어 관리 라우팅 메타데이터 금지
+- 라우팅 결정은 그래프 모델 외부에서 거리·hop 수·순서를 추적하는
+  요청별 소프트웨어 관리 메타데이터에 의존해서는 안 된다.
+- 모든 거리·순서 계산은 순회 자체에서 유도된다.
+
+## Alternatives Considered
+
+1) **Hop 수 기반 라우팅**
+- 기각: 이질적인 레이턴시·BW를 무시하고 아키텍처 차이를 잘못 표현한다.
+
+2) **암묵적 로컬 지름길**
+- 기각: 디버깅 가능성을 해치고 순회 기반 레이턴시 원칙을 위반한다.
+
+3) **소프트웨어 관리 거리 메타데이터**
+- 기각: 제어 오버헤드를 증가시키고 라우팅 시맨틱을 모호하게 만든다.
+
+## Consequences
+
+### 긍정적
+- 명확하고 디버깅 가능한 hop-by-hop 트레이스 (SPEC R2, R4).
+- 아키텍처 비교가 실제 인터커넥트 구조를 반영한다.
+- 라우팅 동작이 재현 가능하고 결정론적이다.
+
+### 트레이드오프 / 비용
+- 그래프 구성이 정확하고 완전해야 한다.
+- 우회 모델링이 명시적 그래프 표현을 요구하므로 토폴로지 기술이
+  약간 더 복잡해진다.
+
+## Implementation Notes (Non-normative)
+- 권장 책임 분담:
+  - 그래프 빌더: 필요한 모든 경로가 존재함을 보장.
+  - 라우터: 디코딩된 도메인과 정책을 바탕으로 다음 hop 선택.
+- 테스트가 검증해야 할 항목:
+  - end-to-end 레이턴시 > 0
+  - 동일 입력에 대한 결정론적 라우팅
+  - 우회 경로가 출력 트레이스에 명시적으로 나타남
+
+## Links
+- SPEC.md: R1 (라우팅), R2 (레이턴시), R3 (토폴로지), R5 (다중 도메인 통신)
+- ADR-0001: PhysAddr 레이아웃 및 디코딩 계약
@@ -0,0 +1,68 @@
+# ADR-0003: 타겟 시스템 계층 및 모델링 범위
+
+## Status
+
+Accepted
+
+## Context
+
+자사 AI Accelerator 플랫폼에서 LLM 커널 성능을 평가하기 위해 시스템 수준의 시뮬레이터가 필요하다.
+해당 플랫폼은 PCIe 또는 UAL을 통해 스위칭 패브릭으로 연결된 다수의 동일한 SIP를 포함하는 컴퓨트 트레이로 구성되며,
+호스트 CPU가 명령/커널을 발급한다.
+
+## Decision
+
+시스템 계층을 다음과 같이 명시적으로 모델링한다.
+
+### D1. Tray-level
+
+- 하나의 컴퓨트 트레이는 다음을 포함한다:
+  - 호스트 CPU (요청 발급 / 런타임 및 데이터 배치 조정)
+  - 다수의 동일한 SIP (가속기)
+  - SIP 간 인터커넥트 패브릭 (스위치를 통한 PCIe 및/또는 UAL)
+
+### D2. SIP-level
+
+- SIP는 다음으로 구성된 멀티 다이 패키지이다:
+  - 다수의 CUBE (HBM 다이 + 컴퓨트 PE + UCIe)
+  - 하나 이상의 IO 칩렛 (호스트/SIP 인터페이스)
+- IO 칩렛:
+  - 다음 인터페이스를 제공한다: PCIe-EP, IO_CPU, 선택적으로 UAL-EP
+  - SIP 당 다수가 존재할 수 있다
+  - 배치는 SIP shoreline(상/하/좌/우)으로 제약되며, 각 shoreline에는 1~2개의 IO 칩렛이 위치할 수 있다
+
+### D3. CUBE-level
+
+- 하나의 CUBE는 다음을 포함한다:
+  - HBM + 메모리 컨트롤러 (HBM_CTRL)
+  - NoC (on-die 패브릭): HBM 데이터, 큐브 간(UCIe) 트래픽, 명령(M_CPU↔PE_CPU),
+    공유 SRAM 액세스를 포함한 모든 큐브 내부 트래픽을 운반한다.
+    반드시 제공해야 하는 것: 풀-대역폭 PE↔로컬 HBM 경로, PE↔SRAM 연결성,
+    PE↔UCIe 연결성, M_CPU↔PE 명령 경로.
+    NoC 토폴로지는 구현 선택사항(예: 2D 메시, 링, 크로스바)이며,
+    현재 구현은 XY 라우팅 방식의 2D 메시를 사용한다(ADR-0017 참조).
+    HBM_CTRL은 각 PE의 로컬 NoC 포트에 부착된다(로컬 HBM = 최소 홉).
+  - 공유 SRAM: 모든 PE가 NoC를 통해 액세스 가능한 큐브 수준 공유 메모리
+  - PE 명령 분배 및 완료 집계를 조정하는 관리/제어 CPU (M_CPU)
+  - 다수의 PE
+  - CUBE↔CUBE 및 CUBE↔IO 연결성을 위한 최대 4개의 UCIe 엔드포인트 (N/E/W/S)
+
+### D4. PE-level
+
+- 하나의 PE는 하나의 커널 인스턴스를 실행할 수 있다
+- PE는 내부 제어 + 가속기를 포함한다 (PE 뷰 단위로 모델링):
+  - PE_CPU, 명령 핸들러, PE_TCM, DMA/GEMM/MATH 엔진, 내부 큐
+
+## Consequences
+
+- 시뮬레이터는 "뷰" 단위의 추상화를 지원한다:
+  - SIP 뷰는 PE 내부를 숨긴다
+  - CUBE 뷰는 각 PE를 단일 블록으로 다룬다
+  - PE 뷰는 PE 내부를 전개한다
+- 토폴로지는 매개변수화된 상태로 유지되며, 크기/개수/링크는 설정으로부터 주어진다.
+
+## Links
+
+- SPEC R3/R5
+- ADR-0005 (다이어그램 뷰)
+- ADR-0017 (큐브 NoC 2D 메시 아키텍처)
@@ -0,0 +1,78 @@
+# ADR-0004: 메모리 시맨틱 및 로컬 HBM 대역폭 보장
+
+## Status
+
+Accepted
+
+## Context
+
+PE↔HBM 동작을 정확하게 모델링하는 것은 커널 레이턴시 추정에 필수적이다.
+각 PE는 "로컬 HBM"이라는 개념을 가지며, 이는 중간 온칩 패브릭 대역폭과
+무관하게 HBM 전체 대역폭을 보장해야 한다.
+
+## Decision
+
+### D1. 로컬 HBM의 정의
+
+- 각 PE에는 논리적으로 정의된 "로컬 HBM" 영역이 할당된다.
+- 로컬 HBM은 NOC 메시(ADR-0017 D4) 내에서 해당 PE의 라우터에 직접 연결된
+  pseudo-channel 부분집합에 대응한다.
+- 경로는: PE_DMA → 로컬 라우터 → HBM_CTRL (스위칭 오버헤드만, 메시 hop 0개).
+- 매핑(HBM pseudo-channel → PE 로컬 영역)은 토폴로지 구성에서 유도된다.
+
+### D2. 로컬 HBM 대역폭 보장 계약
+
+- PE에서 자신의 로컬 HBM으로의 접근은 중간 패브릭 대역폭 제한과
+  무관하게 HBM의 유효 read/write 대역폭 전부를 보장해야 한다.
+- 유효 HBM 대역폭 = 스펙 대역폭 × 효율 계수.
+  효율 계수(`hbm_ctrl.attrs.efficiency`로 설정, 기본값 0.8)는 실세계 DRAM의
+  비효율(리프레시 사이클, 뱅크 충돌, 페이지 미스 등)을 모델링한다.
+  예: 256 GB/s 스펙 × 0.8 = 204.8 GB/s 유효 대역폭.
+- 토폴로지 빌더는 그래프 구성 시점에 router-to-hbm 에지의 대역폭에
+  효율 계수를 적용하므로, 이후의 모든 라우팅·레이턴시 계산은 유효 값을
+  사용한다.
+- 이 보장은 다음으로 모델링된다:
+  - PE-로컬-HBM 상호작용 지점에서 HBM 대역폭을 강제하는 전용 논리 경로
+    그리고/또는 서비스 모델,
+  - 명시적으로 모델링된 컴포넌트들을 따라 0이 아닌 레이턴시를 여전히 발생시킨다.
+- HBM CTRL 내부 모델링(PC 스트라이핑, cut-through, 스케줄링 충실도)은
+  ADR-0033 (레이턴시 모델: 가정 및 알려진 단순화)에 통합되어 있다.
+  여기서의 총 대역폭 보장은 계약으로 유지되며, ADR-0033은 PC 단위 모델이
+  이를 어떻게 실현하는지와 어떤 스케줄러 효과가 의도적으로 단순화되었는지를
+  기록한다.
+
+### D3. 리모트 PE HBM 시맨틱 (큐브 내)
+
+- 한 PE가 다른 PE의 로컬 HBM에 접근할 때는 NOC를 거친다:
+  - PE_DMA → NOC → (패브릭 hop) → 대상 PE의 NOC 포트 → HBM_CTRL
+- NOC의 대역폭과 hop 수에 의해 리모트 HBM 접근이 로컬 접근 대비 제한될 수 있다.
+
+### D4. 비로컬 HBM 시맨틱 (큐브 간 / SIP 간)
+
+- PE에서 다른 큐브나 SIP에 있는 HBM으로의 접근은 다음에 의해 제한될 수 있다:
+  - 큐브 내 NOC 대역폭,
+  - 큐브 간 UCIe 링크,
+  - SIP 간 패브릭 (PCIe/UAL).
+- 이 경로들은 명시적이고 추적 가능해야 한다.
+
+### D5. 공유 SRAM 시맨틱
+
+- 각 CUBE는 해당 CUBE의 모든 PE가 접근 가능한 공유 SRAM을 포함한다.
+- 접근 경로: PE_DMA → NOC → 공유 SRAM.
+- 공유 SRAM의 대역폭은 NOC↔SRAM 링크 대역폭으로 제한된다.
+- 공유 SRAM은 HBM 주소 공간의 일부가 아니라 별도의 메모리 도메인이다.
+
+## Verification Notes
+
+테스트가 다뤄야 할 케이스:
+
+- 로컬 HBM 케이스: 패브릭 BW 파라미터와 무관하게 대역폭이 HBM 대역폭과 일치
+- 리모트 PE HBM 케이스: 레이턴시가 메시 hop 순회를 포함
+- 비로컬 케이스(큐브 간/SIP 간): 패브릭/링크 파라미터에 대역폭·레이턴시가 반응
+- 공유 SRAM 케이스: NOC 경유 접근이 올바른 대역폭으로 수행됨
+
+## Links
+
+- SPEC R2/R5
+- ADR-0002 (거리/순서 및 명시적 우회)
+- ADR-0017 D7 (NOC를 통한 PE DMA → HBM 데이터 경로)
@@ -0,0 +1,186 @@
+# ADR-0005: 다이어그램 뷰 및 거리 기반 레이아웃 규칙
+
+## Status
+
+Accepted
+
+## Context
+
+대규모, 매개변수화된 AI Accelerator 시스템에 대해 검증 가능하고 점검 가능한
+시스템 모델링이 필요하다.
+
+사람이 다음을 할 수 있어야 한다:
+
+- 모델링된 토폴로지를 시각적으로 점검하고,
+- 통신 구조와 상대적 거리에 대해 추론하고,
+- 세부 사항에 압도되지 않으면서 여러 추상화 수준에서 이를 수행한다.
+
+시뮬레이터는 거리(누적 레이턴시)를 1급 개념(first-class concept)으로 모델링한다.
+다이어그램은 기본적으로 이 거리를 반영해야 한다.
+
+---
+
+## Decision
+
+### D1. Global Defaults
+
+- 모든 다이어그램은 기본적으로 **거리 인식(distance-aware)** 이어야 한다.
+- 모든 다이어그램은 아키텍처의 **대표 뷰(representative view)** 를 렌더링해야 한다.
+- 인스턴스 인덱스(예: sip0, cube2, pe3)는 다이어그램 생성에 필수가 아니어야 한다.
+- 인스턴스 인덱스는 다음의 경우에만 사용될 수 있다:
+  - 비대칭 또는 디버깅 시나리오에서 거리 앵커를 정의하기 위한 경우, 또는
+  - 명시적으로 요청된 경우.
+
+---
+
+### D2. Representative Rendering Rule
+
+- 모든 CUBE는 동일한 내부 구조를 공유한다.
+- 모든 PE는 동일한 내부 구조를 공유한다.
+
+따라서:
+
+- SIP 수준 다이어그램은 대표 CUBE와 IO 칩렛을 렌더링한다.
+- CUBE 수준 다이어그램은 대표 PE를 불투명 블록으로 렌더링한다.
+- PE 수준 다이어그램은 내부가 완전히 전개된 대표 PE를 렌더링한다.
+
+다이어그램은 명시적으로 요청되지 않는 한
+특정 SIP, CUBE, 또는 PE 인덱스에 의존해서는 안 된다.
+
+---
+
+### D3. Diagram Views
+
+#### View A — SIP 수준 다이어그램
+
+**목적**
+시스템 규모의 구조와 연결성을 설명한다.
+
+**가시 요소**
+
+- SIP 경계 (선택사항)
+- CUBE (불투명 블록)
+- IO 칩렛 (불투명 블록)
+- 연결성 명확화에 필요한 경우에만 선택적 UCIe 스텁
+
+**비가시 요소**
+
+- PE 내부
+- CUBE 내부 패브릭
+- IO 칩렛 내부
+
+**가시 링크**
+
+- 호스트 ↔ IO 칩렛 (PCIe)
+- SIP ↔ SIP (스위치를 통한 PCIe / UAL)
+- IO ↔ CUBE (온패키지 링크)
+
+---
+
+#### View B — CUBE 수준 다이어그램
+
+**목적**
+큐브 내부 구조와 데이터/제어 흐름을 설명한다.
+
+**가시 요소**
+
+- 라우터 메시: NoC 라우터의 2D 격자 (cube_mesh.yaml로부터), 모든 트래픽은 메시를 통해 라우팅됨
+- PE 라우터에 부착된 HBM_CTRL (로컬 HBM = 0 홉)
+- HBM 서브시스템 (HBM_CTRL)
+- 공유 SRAM: 큐브 수준 공유 메모리
+- 관리 CPU (M_CPU)
+- 불투명 블록으로 표현된 PE (PE[0..N−1])
+- 포트로 표현된 UCIe 엔드포인트 (N/E/W/S)
+
+**비가시 요소**
+
+- PE 내부
+
+**가시 링크**
+
+- PE → 라우터 (메시를 통한 HBM + 비-HBM 데이터 경로)
+- 라우터 ↔ HBM_CTRL (로컬 HBM 액세스)
+- 라우터 ↔ 라우터 (원격 액세스를 위한 메시 홉)
+- 라우터 ↔ UCIe 엔드포인트
+- 라우터 ↔ 공유 SRAM
+- M_CPU ↔ 라우터 (명령 경로)
+- 라우터 → PE_CPU (명령 전달, PE 블록 내부로 축약됨)
+
+---
+
+#### View C — PE 수준 다이어그램
+
+**목적**
+PE 내부 동작과 실행 구조를 설명한다.
+
+**가시 요소**
+
+- PE_CPU
+- 명령 핸들러 / 스케줄러
+- PE_TCM (로컬 SRAM)
+- HW 가속기 (DMA, GEMM, MATH 등)
+- 로컬 HBM 인터페이스
+- 선택적 IPCQ / 메시징 엔드포인트
+
+**가시 링크**
+
+- 제어 경로 (CPU → 스케줄러 → 엔진)
+- 데이터 경로 (엔진 ↔ TCM, DMA ↔ 로컬 HBM)
+- 외부 패브릭 포트는 추상 포트로만 표현
+
+---
+
+### D4. 거리 기반 레이아웃 (기본)
+
+#### 거리 정의
+
+- 거리는 ADR-0002와 정합되도록 **누적 레이턴시(accumulated latency)** 로 정의된다.
+- 거리는 단일 앵커 노드로부터 계산된다.
+
+#### 기본 앵커 선택
+
+- SIP 뷰: IO 칩렛 (또는 존재한다면 호스트 CPU)
+- CUBE 뷰: 대표 PE
+- PE 뷰: PE_CPU 또는 명령 핸들러
+
+앵커는 **암묵적 기본값**이며, 지정이 강제되어서는 안 된다.
+
+#### 레이아웃 규칙
+
+- 다이어그램은 거리 버킷에 기반한 레이어로 배치되어야 한다.
+- 레이아웃 방향은 뷰 유형 내에서 일관되어야 한다
+  (선호: 좌→우).
+- 동일 거리의 노드는 결정론적으로 안정된 순서를 가져야 한다
+  (역할 또는 식별자 기준).
+
+가독성을 위해 사이클은 점선 또는 곡선 엣지로 렌더링될 수 있으며,
+이는 거리 의미에 영향을 주지 않는다.
+
+---
+
+### D5. 생성 컨트랙트 (도구 / Claude Code용)
+
+다이어그램 생성 시:
+
+- 기본적으로 거리 기반 레이아웃을 가정한다.
+- 기본적으로 대표 렌더링을 가정한다.
+- 필요한 경우가 아니면 SIP/CUBE/PE 인덱스를 묻지 않는다.
+- 숨겨진 추상화 수준을 전개하지 않는다.
+- 마이크로 홉의 정밀도보다 아키텍처적 명확성을 우선한다.
+
+---
+
+## Consequences
+
+- 다이어그램은 토폴로지 스케일링에 걸쳐 안정적으로 유지된다.
+- 거리 또는 라우팅 정책의 변경이 시각적으로 반영된다.
+- 다이어그램은 수작업으로 유지되는 문서가 아닌, 시뮬레이터 모델로부터
+  파생된 검증 가능한 산출물의 역할을 한다.
+
+---
+
+## Links
+
+- SPEC Section 4 (Output, Debuggability, and Diagrams)
+- ADR-0002 (라우팅 거리 의미)
+- ADR-0006 (토폴로지 컴파일 및 자동 다이어그램 생성)
@@ -0,0 +1,130 @@
+# ADR-0006: 토폴로지 컴파일, 거리 추출, 그리고 자동 다이어그램 생성
+
+## Status
+
+Accepted
+
+## Context
+
+시뮬레이터는 토폴로지 설정(예: topology.yaml)을 명시적인 모델 그래프로 컴파일하고,
+라우팅 및 누적 레이턴시(거리)를 계산한다.
+정합성을 보장하고 수작업으로 유지되는 토폴로지 도면을 피하기 위해,
+다이어그램은 이 권위 있는 산출물로부터 생성되어야 한다.
+
+또한 사용성을 위해, 다이어그램은 안정적인 위치로 자동 방출되어
+개발자가 저장소 내에서 즉시 미리볼 수 있어야 한다.
+
+---
+
+## Decision
+
+### D1. 토폴로지 컴파일은 유일한 진실 공급원이다
+
+- topology.yaml(또는 동등한 설정)은 다음으로 컴파일된다:
+  - 명시적인 시스템 그래프,
+  - 노드/링크 속성,
+  - 라우팅 정책.
+이 컴파일된 그래프가 시스템의 권위 있는 표현이다.
+
+### D2. 컴파일 중 거리 추출
+
+- 토폴로지 컴파일 중 또는 그 직후, 시뮬레이터는 ADR-0002와 정합되는
+  거리 메타데이터(누적 레이턴시)를 계산해야 한다.
+- 거리 메타데이터는 ADR-0005에서 정의한 거리 기반 다이어그램 레이아웃을 지원하기에 충분해야 한다.
+- 분산된 패브릭 세그먼트(예: NoC)는 ADR-0002 D4에 따라 distance_mm = 0을 가질 수 있다.
+  이러한 노드의 레이아웃 배치는 거리 버킷이 아닌 명시적 위치 메타데이터를 사용한다.
+
+### D3. 다이어그램 생성은 파생 산출물이다
+
+- 다이어그램은 다음으로부터 생성되어야 한다:
+  - 컴파일된 토폴로지 그래프,
+  - 추출된 거리 메타데이터,
+  - ADR-0005에 정의된 뷰/레이아웃 규칙.
+- 다이어그램 생성은 추가적인 수작업 토폴로지 기술을 요구해서는 안 된다.
+
+### D4. 저장소로의 자동 다이어그램 방출
+
+- 토폴로지 컴파일의 일부로서, 구현은 기본적으로 다음 다이어그램을 생성해야 한다:
+  - SIP 수준 다이어그램 (대표, 거리 인식)
+  - CUBE 수준 다이어그램 (대표, 거리 인식)
+  - PE 수준 다이어그램 (대표, 거리 인식)
+- 기본 출력 디렉터리는 다음과 같다:
+  - `docs/diagrams/`
+- 생성기는 컴파일된 토폴로지(또는 다이어그램 규칙)가 변경되었을 때에만 덮어쓰기/업데이트해야 한다.
+
+### D5. 뷰별 투영 및 레이아웃
+
+각 뷰(SIP / CUBE / PE)에 대해:
+
+- 생성기는 컴파일된 그래프를 축소된 뷰 그래프로 투영해야 한다:
+  - ADR-0005에 따라 노드를 숨기거나 축약하고,
+  - 해당 뷰와 관련된 연결성 의미를 보존하고,
+  - 거리 버킷을 계산하여 레이아웃 레이어를 결정론적으로 할당한다.
+- CUBE 수준 투영은 다음을 포함해야 한다:
+  - 라우터 메시 (cube_mesh.yaml로부터), HBM_CTRL, 공유 SRAM, M_CPU, UCIe 포트,
+    그리고 불투명 블록으로 표현된 PE.
+  - 모든 경로(HBM, 비-HBM, 명령)는 동일한 라우터 메시를 통해 라우팅된다 (ADR-0017).
+- 기본 앵커는 암묵적이며 (ADR-0005) 인스턴스 인덱스를 요구해서는 안 된다.
+
+### D6. 출력 포맷과 결정론
+
+- 생성기는 다음 중 최소 하나를 출력해야 한다:
+  - Mermaid (Markdown 네이티브)
+  - Graphviz DOT (rank 기반 제어)
+  - SVG (mm 단위 정확도 레이아웃, 외부 의존성 없음)
+- 컴파일된 토폴로지로부터 mm 단위 정확도의 위치 메타데이터가 가용한 경우 SVG가 선호된다.
+- 출력은 결정론적이어야 한다:
+  - 동일한 토폴로지 + 동일한 규칙 → 동일한 다이어그램 텍스트
+- 파일 이름은 결정론적이고 안정적이어야 한다 (아래의 "출력 컨벤션" 참조).
+
+### D7. 성능 및 캐싱
+
+- 다이어그램 생성은 지연(lazy) 및/또는 캐시될 수 있으며, `docs/diagrams/`의 출력이
+  컴파일된 토폴로지와 정합을 유지하는 한 그렇다.
+- 구현은 다음을 기반으로 한 캐시 키를 사용해야 한다(SHOULD):
+  - 토폴로지 콘텐츠 해시,
+  - 라우팅 정책 버전,
+  - 다이어그램 규칙 버전,
+  - 뷰 유형 (SIP/CUBE/PE).
+
+---
+
+## 출력 컨벤션
+
+### 디렉터리
+
+- `docs/diagrams/`는 생성된 다이어그램의 표준 출력 디렉터리이다.
+
+### 파일 이름 (권장, 결정론적)
+
+- `system_view.svg` / `system_view.mmd` / `system_view.dot`
+- `sip_view.svg` / `sip_view.mmd` / `sip_view.dot`
+- `cube_view.svg` / `cube_view.mmd` / `cube_view.dot`
+- `pe_view.svg` / `pe_view.mmd` / `pe_view.dot`
+
+선택적으로, 멀티 토폴로지 워크플로우용:
+
+- `sip_view__{topology_id}.svg`
+- `cube_view__{topology_id}.svg`
+- `pe_view__{topology_id}.svg`
+
+### 저장소 정책
+
+- 생성된 다이어그램 파일은 diff 기반 리뷰가 가능하도록 저장소에 커밋될 수 있다.
+- 커밋된 경우, 이는 토폴로지 컴파일로부터 재현 가능해야 한다.
+
+---
+
+## Consequences
+
+- 다이어그램은 항상 시뮬레이터 동작과 정합한다.
+- 아키텍처 변경이 시각화에 자동으로 전파된다.
+- 다이어그램 diff는 아키텍처 변경의 의미 있는 지표가 된다.
+
+---
+
+## Links
+
+- SPEC Section 4 (Output, Debuggability, and Diagrams)
+- ADR-0002 (거리 의미)
+- ADR-0005 (다이어그램 뷰 및 레이아웃 규칙)
@@ -0,0 +1,95 @@
+# ADR-0007: 런타임 API 및 시뮬레이션 엔진 경계
+
+## Status
+
+Accepted
+
+## Context
+
+시뮬레이터는 책임이 명확히 다른 여러 계층으로 구성된다:
+
+- 벤치마크와 사용자 코드가 사용하는 호스트 대상 API 계층,
+- 요청을 실행하는 이산 이벤트 시뮬레이션 엔진,
+- 하드웨어 동작을 모델링하는 디바이스 컴포넌트.
+
+엄격한 경계가 없으면 오케스트레이션 로직이 컴포넌트로 누출되거나
+시뮬레이션 내부가 사용자 대상 API와 얽힐 수 있다.
+
+본 ADR은 다음 사이의 명확한 책임 경계를 정의한다:
+
+- 런타임 API,
+- 시뮬레이션 엔진 (sip_engine),
+- 하드웨어 컴포넌트.
+
+---
+
+## Decision
+
+### D1. 런타임 API는 호스트 대상 오케스트레이션만 담당
+
+런타임 API는 호스트/드라이버 수준의 동작을 표현하며 다음을 해야 한다:
+
+- 고수준 동작 노출 (텐서 배포, 커널 launch),
+- 엔드포인트 컴포넌트(예: IO_CPU)에만 요청 제출,
+- futures/handles로 완료 대기,
+- 호스트측 메타데이터(텐서 할당 맵, 커널 바인딩)의 소유와 영속화.
+
+런타임 API가 해서는 안 되는 것:
+
+- hop-by-hop 라우팅 또는 fan-out 하드코딩,
+- 내부 컴포넌트(M_CPU, PE_CPU, 엔진) 직접 호출,
+- 토폴로지나 라우팅 관련 가정 내장.
+
+---
+
+### D2. 시뮬레이션 엔진은 컴포넌트를 연결하고 완료를 추적
+
+시뮬레이션 엔진(sim_engine)은 다음을 해야 한다:
+
+- 초기화 시점에 컴포넌트 연결 (컴포넌트 포트/와이어 프레임워크에 따라
+  포트 store 생성 + 와이어 프로세스 시작 — ADR-0015),
+- 컴파일된 토폴로지 그래프의 진입 컴포넌트(예: 메모리 동작은 PCIE_EP,
+  커널 launch는 IO_CPU)에 요청 주입,
+- 이산 이벤트 모델로 이벤트 스케줄링과 실행,
+- correlation ID와 완료 추적 관리.
+
+시뮬레이션 엔진이 해서는 안 되는 것:
+
+- 텐서 시맨틱 정의,
+- 커널 실행 정책 정의,
+- 런타임 API에 내부 그래프 세부사항 노출,
+- 요청 실행 중에 토폴로지 경로를 따라 걷기,
+- 컴포넌트의 `run()` 메서드 직접 호출,
+- hop별 레이턴시 추적 또는 fan-out 분해 (컴포넌트의 책임).
+
+---
+
+### D3. 컴포넌트가 fan-out과 집계를 담당
+
+디바이스측 컴포넌트는 다음을 해야 한다:
+
+- 요청을 하위 도메인으로 fan-out
+  (IO_CPU → M_CPU → PE_CPU → 스케줄러/엔진),
+- 완료·실패 신호 집계,
+- 결정론적으로 상위로 결과 전파.
+
+런타임 API와 시뮬레이션 엔진 모두 컴포넌트 수준의 fan-out을 명시적으로
+오케스트레이션해서는 안 된다.
+
+---
+
+## Consequences
+
+- 토폴로지와 라우팅이 변해도 런타임 API는 안정적이다.
+- 시뮬레이션 내부는 사용자 대상 코드에 영향을 주지 않고 변경 가능하다.
+- 컴포넌트 구현은 DI로 교체 가능한 상태가 유지된다.
+
+---
+
+## Links
+
+- SPEC R4, R7, R8
+- ADR-0008 (텐서 배포)
+- ADR-0009 (커널 실행)
+- ADR-0015 (컴포넌트 포트/와이어 모델과 엔진 역할)
+- ADR-0010 (CLI 표면과 실행 시맨틱 — 런타임 API 소비자)
@@ -0,0 +1,100 @@
+# ADR-0008: 텐서 배포 및 할당 (호스트 할당기, PA 우선)
+
+## Status
+
+Accepted
+
+## Context
+
+벤치마크는 PyTorch와 유사한 텐서 시맨틱을 요구한다:
+
+- 텐서 생성 (empty, fill),
+- 가속기 디바이스로의 배포 (tensor.to()).
+
+현실적인 시스템에서는 호스트 소프트웨어가 할당·매핑을 관리하고 DMA/MMU
+매핑을 설치한다. Phase 0에서는 (ADR-0011) 다음으로 단순화한다:
+
+- 디바이스 메모리 동작은 PA만 사용,
+- VA/MMU/IOMMU는 모델링하지 않는다.
+
+호스트↔디바이스 인터페이스를 최소로 유지하기 위해 별도의
+AllocateTensorMeta 메시지는 피한다. 대신 호스트 할당은 PA 샤드 맵을
+생성하여 MemoryWrite/Read와 KernelLaunch가 직접 사용한다.
+
+---
+
+## Decision
+
+### D1. Tensor는 PA 샤드 매핑을 가진 호스트 소유 핸들
+
+Tensor 객체는 다음을 캡슐화하는 호스트 소유 핸들이다:
+
+- shape과 dtype,
+- 초기화 의도,
+- PA 샤드 맵 형태의 디바이스 배치 및 할당 메타데이터.
+
+배포 이후 Tensor 핸들은 다음을 포함해야 한다:
+
+- 각각 (sip, cube, pe, pa, nbytes, offset_bytes)를 가진 샤드 리스트.
+
+이 PA 샤드 매핑이 커널 인수 바인딩의 단일 진실 원천이다.
+
+---
+
+### D2. 배포는 호스트 할당기를 사용한다 (Phase 0)
+
+Phase 0에서 텐서 배포는 호스트 할당기를 통해 PA 샤드 매핑을 생성한다:
+
+- 배치(split/replicate/hybrid)는 DP 정책에 의해 결정,
+- 할당은 PE 수준에서 PA 범위를 부여하고 샤드 매핑을 반환,
+- Tensor 핸들은 결정론적으로 결과 샤드 리스트를 저장.
+
+Phase 0에서는 호스트가 보는 별도의 디바이스 할당 RPC는 필요하지 않다.
+
+---
+
+### D3. 데이터 초기화와 전송은 MemoryWrite/Read만 사용
+
+텐서가 함의하는 모든 데이터 초기화나 전송(예: fill, copy)은
+Host ↔ IO_CPU 메시지만으로 표현되어야 한다:
+
+- MemoryWrite
+- MemoryRead
+
+규칙:
+
+- MemoryWrite/Read는 PA + (sip, cube, pe) 태그를 참조해야 한다 (ADR-0012).
+- 할당 메타데이터는 별도의 할당 메시지로 임베드되어서는 안 된다.
+- 대량 텐서 데이터는 Phase 0 메시지에 임베드되어서는 안 된다.
+
+시뮬레이션 엔진은 MemoryWrite/Read를 그래프를 통해 스케줄하므로 레이턴시는
+명시적 순회로 계산된다.
+
+---
+
+### D4. 확장 경로 (호환성 유지)
+
+향후 ADR이 다음을 추가하여 선택적인 VA/MMU/IOMMU 모델링을 도입할 수 있다:
+
+- 텐서 핸들에 가상 주소,
+- 매핑 설치 단계,
+- 변환 레이턴시·페이지 granularity.
+
+Phase 0의 PA 샤드 맵은 유효한 fast-path 구성으로 유지된다.
+
+---
+
+## Consequences
+
+- Host↔IO_CPU 계약이 최소(MemoryRead/Write + KernelLaunch)로 유지된다.
+- KernelLaunch가 샤드 태그를 통해 PE별 데이터 배치를 명시적으로 전달할 수 있다.
+- 초기 구현이 단순하고 테스트 가능하게 유지된다.
+
+---
+
+## Links
+
+- ADR-0011 (메모리 주소 지정 — PA / VA / LA)
+- ADR-0012 (Host↔IO_CPU 스키마)
+- ADR-0007 (runtime_api vs sim_engine 경계)
+- ADR-0009 (커널 실행)
@@ -0,0 +1,138 @@
+# ADR-0009: 커널 실행 메시징 및 완료 시맨틱
+
+## Status
+
+Accepted
+
+## Context
+
+커널 실행은 호스트에서 시작되어 디바이스 측 제어 컴포넌트를 통해 진행된다:
+
+Host → IO_CPU → M_CPU → PE_CPU → 스케줄러 → 엔진
+
+완료는 역방향으로 전파된다.
+
+벤치마크를 단순하고 토폴로지에 비의존적으로 유지하기 위해, 커널 실행은
+엔드포인트 기반(endpoint-driven)이어야 하며 완료 집계는 결정론적이어야 한다.
+
+---
+
+## Decision
+
+### D1. 커널 런치는 엔드포인트 요청이다
+
+커널 런치는 IO_CPU 엔드포인트에 단일 KernelLaunch 요청을 제출함으로써
+시작된다.
+
+runtime API는 반드시:
+
+- 커널 런치 요청을 구성하고,
+- 이를 IO_CPU로 제출하며,
+- 단일 완료 결과를 대기해야 한다.
+
+runtime API는 내부 팬아웃(fan-out)을 직접 조율해서는 안 된다.
+
+---
+
+### D2. 텐서 인자는 메타데이터로 전달된다
+
+KernelLaunch 요청은 텐서 인자를 다음을 통해 참조해야 한다:
+
+- 호스트가 소유한 텐서 핸들, 또는
+- 그러한 핸들로부터 해석된 디바이스 주소 맵.
+
+대용량 텐서 데이터는 커널 런치 메시지에 임베드되어서는 안 된다.
+
+---
+
+### D3. 팬아웃과 집계는 컴포넌트의 책임이다
+
+- IO_CPU는 작업을 M_CPU들에게 팬아웃한다.
+- M_CPU는 작업을 PE_CPU들에게 팬아웃한다.
+- PE_CPU는 커널 실행과 엔진 디스패치를 관리한다.
+
+완료 시맨틱:
+
+- M_CPU는 대상 PE들이 모두 완료되거나 실패 정책이 트리거되면 완료된다.
+- IO_CPU는 대상 큐브들이 모두 완료되거나 실패 정책이 트리거되면 완료된다.
+
+---
+
+### D4. 완료 및 실패 전파
+
+- 모든 메시지는 correlation ID를 포함해야 한다.
+- 완료와 실패는 호스트로 결정론적으로 전파되어야 한다.
+- 시뮬레이션 엔진은 완료를 관찰할 수 있는 future/handle을 제공한다.
+
+---
+
+### D5. 런치 타이밍은 엔드포인트 동기화된다
+
+단일 커널 런치가 지정한 모든 PE는 런치 진입점으로부터의 디스패치 경로 길이와
+무관하게, 동일한 시뮬레이션 시각에 커널 본문 실행을 시작해야 한다.
+
+근거. 디스패치 트리 Host → IO_CPU → M_CPU → PE_CPU는 모든 레벨에서 가변
+레이턴시를 가진다. M_CPU에 가까운 PE는 멀리 있는 PE보다 런치를 더 일찍
+수신하고, IO_CPU에 가까운 큐브는 먼 큐브보다 더 일찍 수신한다. 동기화가
+없으면 각 PE의 커널은 서로 다른 `env.now`에서 시작되어, `pe_exec_ns`와 같은
+PE별 메트릭이 커널 자체의 동작이 아니라 디스패치 경로 기하 구조의 함수가
+된다 — 그 결과 커널 내부 대기(예: 큐브 간 또는 SIP 간 홉에서의 `tl.recv`)를
+타이밍하는 벤치마크에서 측정 아티팩트가 발생한다.
+
+메커니즘.
+
+- `KernelLaunchMsg`는 선택적 `target_start_ns: float | None`을 포함한다.
+- **IO_CPU**가 정식 스탬프 주체이다. M_CPU들로 팬아웃할 때, 모든 대상
+  (sip, cube, pe) 튜플에 대한 **두 단계 디스패치 체인**의 최대값을
+  `max_latency`로 하여 `target_start_ns = env.now + max_latency`를
+  계산한다:
+
+  ```
+  max_latency(sip, cube, pe) =
+      compute_path_latency_ns(find_node_path(io_cpu, m_cpu(sip, cube)))
+    + compute_path_latency_ns(find_node_path(m_cpu(sip, cube), pe_cpu))
+    - io_cpu.overhead_ns
+    - m_cpu.overhead_ns
+  ```
+
+  이는 실제 디스패치를 **두 개의 순차적 Transaction**(IO_CPU → M_CPU,
+  그리고 M_CPU → PE_CPU)으로 모델링한다. 각 구간의
+  `compute_path_latency_ns`는 양 끝점의 `overhead_ns`를 더하는데,
+  `io_cpu.overhead_ns`는 이 메서드가 실행되기 전 IO_CPU가 이미 지불했으므로
+  차감하고, `m_cpu.overhead_ns`는 구간1의 끝점인 동시에 구간2의 시작점으로
+  나타나지만 런타임에는 한 번만 지불되므로 한 번 차감한다. 단일
+  `find_node_path(io_cpu, pe_cpu)` 순회는 **동등하지 않다** — M_CPU를
+  우회하는 그래프 경로를 선택할 수 있어 먼 큐브에 대해 예측값이 조용히
+  과소평가되며, D5 불변식을 위반하게 된다.
+
+  팬아웃된 하위 Transaction은 `KernelLaunchMsg`에 대해
+  **`nbytes = 0`**을 운반한다(제어 메시지에 한함). 이를 적용하지 않으면
+  큰 커널 런치 페이로드가 공유되는 첫 홉의 패브릭 대역폭을 점유하여
+  큐브별 디스패치를 직렬화하고, 먼 M_CPU들이 `target_start_ns`를
+  지나가게 되어 늦은 도착 위반이 다시 발생한다.
+- **M_CPU**는 이미 스탬프된 `target_start_ns`를 변경 없이 그대로 전달한다.
+  값이 없는 경우(예: M_CPU로 직접 런치하는 단위 테스트)에만 M_CPU가 큐브별
+  배리어 `env.now + max(로컬 명령 경로 레이턴시)`를 계산한다.
+- **PE_CPU**는 `_execute_kernel`의 최상단에서 `pe_exec_start`를 기록하고
+  커널 본문을 호출하기 전에 `env.timeout(target_start_ns - env.now)`를
+  yield한다.
+- `target_start_ns is None`인 경우 PE_CPU는 레거시 비동기 동작으로 빠진다
+  — 하위 호환성을 보존한다.
+
+IO_CPU 레벨 스탬핑은 모든 대상 큐브의 모든 PE가 동일한 배리어 시뮬레이션
+시각을 사용하도록 보장하여, 큐브 내 디스패치 오프셋 아티팩트와 다중 큐브
+런치에서의 큐브 간 오프셋 아티팩트를 모두 제거한다. 실제 하드웨어의
+타이밍 브로드캐스트 런치(레이턴시 등화 디스패치 트리)를 모델링한다.
+
+이 동기화는 엔진 / IO_CPU / M_CPU / PE_CPU 제어 평면 내부에서 수행된다 —
+runtime API와 애플리케이션 커널은 변경되지 않는다.
+
+---
+
+## Links
+
+- SPEC R1, R2, R7, R8
+- ADR-0007 (Runtime API 경계)
+- ADR-0008 (텐서 배치)
+- ADR-0013 (검증 전략 — V2 팬아웃 테스트)
+- ADR-0015 D4 (커널 런치의 구체적 패브릭 경로)
@@ -0,0 +1,145 @@
+# ADR-0010: 명령줄 인터페이스 및 실행 시맨틱
+
+## Status
+
+Accepted
+
+## Context
+
+`kernbench` CLI는 시뮬레이터의 사용자 대면 진입점이다. 네 개의 서브명령을
+노출한다:
+
+- `run` — 토폴로지에 대해 벤치마크를 실행한다.
+- `list` — 등록된 벤치마크 목록을 출력한다.
+- `probe` — 레이턴시 / 대역폭 측정을 위한 진단 유틸리티.
+- `web` — 인터랙티브 토폴로지 뷰어.
+
+디바이스 열거는 CLI에 중앙 집중화되어 있다. runtime API와 시뮬레이션 엔진
+모두 디바이스를 열거하지 않는다. 벤치마크는 설계상 단일 디바이스를
+유지하며 입력으로 디바이스 식별자를 받는다.
+
+## Decision
+
+### D1. 벤치마크 계약 — 설계상 단일 디바이스
+
+- 벤치마크는 반드시 단일 디바이스에 대한 동작만 정의해야 한다.
+- 벤치마크는 반드시 디바이스 식별자를 입력으로 받아야 한다.
+- 벤치마크는 다중 디바이스를 열거하거나 루프해서는 안 된다.
+
+다중 디바이스 실행은 벤치마크의 관심사가 아니라 CLI의 관심사이다(D3).
+
+### D2. `kernbench run` — 벤치마크 실행
+
+필수 인자:
+
+- `--topology <path>`: 토폴로지 YAML 파일 경로. `resolve_topology()`를
+  통해 로드된다.
+- `--bench <identifier>`: 벤치마크 식별자. `kernbench.benches.registry.resolve()`를
+  통해 해석되며, 등록된 kebab-case 이름(예: `gemm-single-pe`) 또는
+  `kernbench list` 의 숫자 인덱스를 모두 받는다.
+
+선택 인자:
+
+- `--device <selector>` (기본값: `all`):
+  - `all` — 발견된 SIP마다 한 번씩 실행한다(D3 참고).
+  - `sip:<N>` — SIP N에서만 실행한다.
+  - `resolve_device()`를 통해 파싱된다.
+- `--verify-data` (기본값: off) — Phase 2 데이터 검증을 활성화한다
+  (ADR-0020 참고). 설정되면 `engine_factory`가 엔진을
+  `enable_data=True`로 구성한다. 벤치마크 실행 후, 기록된 op들의 진단
+  요약이 출력된다.
+
+각 호출은 단일 시뮬레이션 인스턴스 내에서 벤치마크를 한 번 실행한다.
+
+### D3. 다중 디바이스 실행은 논리적으로 병렬이다
+
+`--device all`(또는 생략) 상태이며 토폴로지에 SIP가 여러 개일 때:
+
+- 벤치마크 실행은 단일 시뮬레이션 엔진 인스턴스에 제출된다.
+- 시뮬레이션 시간 상에서 실행은 논리적으로 병렬이다.
+- 디바이스 간 경합(공유 패브릭 대역폭, SIP 간 트래픽 등)이 자연스럽게
+  모델링된다.
+
+CLI는 여러 OS 프로세스나 독립된 시뮬레이션 실행을 생성하지 **않는다** —
+병렬성은 단일 시뮬레이션 인스턴스 내부에서 일어난다.
+
+### D4. `kernbench list` — 등록된 벤치마크 목록 출력
+
+인자 없음. 각 등록된 벤치의 자동 부여된 인덱스, 등록된 이름,
+한 줄 설명을 출력한다.
+
+벤치는 `@bench(name=..., description=...)` 데코레이터
+(`kernbench.benches.registry`)를 통해 자기 자신을 등록한다.
+`kernbench.benches/` 아래의 언더스코어로 시작하지 않는 모든 모듈은
+반드시 최소 하나의 벤치를 등록해야 한다; 데코레이터가 누락되면
+패키지 import 시점에 `RuntimeError`가 발생한다.
+
+인덱스는 import 시점에 이름의 알파벳 순으로 부여된다. 인덱스는
+`--bench` 의 축약 표기를 위한 CLI 편의 기능이며 안정적인 API가
+아니다 — 알파벳 순으로 새 벤치가 끼면 이후 인덱스가 밀린다.
+
+### D5. `kernbench probe` — 레이턴시 / 대역폭 진단 유틸리티
+
+필수 인자:
+
+- `--topology <path>`: 토폴로지 YAML 파일 경로.
+
+선택 인자:
+
+- `--case <name>` (기본값: `all`) — 미리 정의된 트래픽 패턴을 실행하거나,
+  `all`로 정의된 모든 케이스를 실행한다.
+
+Probe는 시뮬레이션 엔진을 통해 각 패턴을 실행하고 케이스별로 다음을
+보고한다:
+
+- 종단 간 레이턴시(ns).
+- 유효 대역폭(nbytes / total_ns).
+- 병목 대역폭(선택된 경로상의 최소 엣지 BW).
+- 활용률(유효 / 병목).
+
+Probe는 추가로 단조성 불변식을 검증한다 — 예를 들어 local-HBM 접근 ≤
+큐브 내 PE 간 ≤ 큐브 간 ≤ SIP 간 — 그리고 위반을 보고한다. Probe는
+레이턴시 / 대역폭 모델을 검증하기 위한 개발자 도구이다; 벤치마크가
+아니다.
+
+### D6. `kernbench web` — 토폴로지 뷰어
+
+선택 인자:
+
+- `--port <N>` (기본값: `8765`) — HTTP 포트.
+- `--no-open` — 브라우저를 자동으로 열지 않는다.
+
+컴파일된 토폴로지를 브라우저에서 렌더링하는 로컬 HTTP 서버를 띄운다.
+정적인 `docs/diagrams/` 산출물과는 구별된다:
+
+- `docs/diagrams/` 파일은 토폴로지 컴파일 시점에 파생된다(ADR-0006).
+- `kernbench web`은 인터랙티브이다 — 팬/줌, 컴포넌트 속성 호버,
+  SIP / CUBE / PE 뷰 간 전환.
+
+### D7. runtime API와 시뮬레이션 엔진은 디바이스 스코프를 유지한다
+
+- runtime API 호출은 호출당 하나의 디바이스에서 동작한다.
+- 시뮬레이션 엔진은 모든 요청을 결정론적으로 스케줄링한다.
+- 어느 레이어도 디바이스를 열거하지 않는다.
+
+이 불변식은 각 레이어를 독립적으로 테스트 가능하게 유지한다; 디바이스
+열거와 다중 디바이스 팬아웃은 오직 CLI의 `run` 명령에만 존재한다(D3).
+
+`probe` 구현은 `kernbench.probes` 아래에 있다 (`kernbench.benches`와
+분리됨). 이는 probe가 등록된 벤치가 아니라 진단 유틸리티임을 반영한다.
+
+## Consequences
+
+- 벤치마크 작성자는 단일 디바이스 로직을 작성한다; 다중 디바이스 동작은
+  CLI가 SIP들에 걸쳐 디스패치함으로써 자연스럽게 도출된다.
+- 새로운 서브명령(예: 트레이스 내보내기, 리플레이) 추가는 벤치마크나
+  runtime API 변경을 요구하지 않는다 — CLI가 확장 포인트이다.
+- `probe`와 `web`은 진단/시각화 도구이며 벤치마크가 아니다; 벤치마크 로더
+  경로를 우회한다.
+
+## Links
+
+- SPEC R7, R8, R9
+- ADR-0007 (Runtime API와 시뮬레이션 엔진 경계)
+- ADR-0020 (Two-pass 데이터 실행 — `--verify-data`)
+- ADR-0006 (토폴로지 컴파일과 다이어그램 생성 — `kernbench web`의 배경)
@@ -0,0 +1,503 @@
+# ADR-0011: 메모리 주소 지정 — PA / VA / LA 주소 모델
+
+## Status
+
+Accepted.
+
+- **VA 모델: 현재 구현됨 (기본값).**
+- PA 모델: PE_DMA의 PageFault fallback으로 구현됨.
+- LA 모델: 제안됨, 미구현.
+
+## Context
+
+KernBench의 주소 모델은 각 단계마다 이전 단계의 한계를 해결하면서
+세 단계의 설계 지점을 거쳐 발전해 왔다. 본 ADR은 미래의 구현 작업이
+이 셋 중 하나를 선택해야 하므로 셋 모두를 한 곳에 기록한다.
+
+### PA 단독 베이스라인
+
+KernBench Phase 0는 모든 디바이스 메모리 동작(MemoryRead/MemoryWrite)을
+순수 물리 주소 전송으로 다뤘다. 호스트측 가상 주소 지정 없음, MMU/IOMMU
+변환 없음. 할당기는 PA 매핑을 반환하고, DMA 요청은 PA를 직접 운반했다.
+
+이는 초기 정확성·레이턴시 작업에는 충분했지만, 샤딩된 텐서에 대해
+`base_addr + offset` 패턴을 사용하는 표준 Triton 커널을 실행하기에는
+부족했다. 각 PE의 샤드는 서로 다른 PA를 갖지만, 커널은 offset을 계산하기
+위해 연속된 단일 주소 공간이 필요하기 때문이다.
+
+### VA/MMU를 채택한 이유 (현재 기본값)
+
+현실적인 시스템은 호스트측 가상 주소 지정과 DMA를 위한 MMU/IOMMU 스타일
+변환 경로를 사용한다. 호스트는 PE 수준에서 물리 메모리를 할당하고,
+그것을 가상 주소 공간에 매핑하여 매핑을 설치한 뒤, DMA 요청은 가상
+주소를 사용하며 그것이 물리 주소로 변환된다.
+
+이 모델을 채택하면 커널이 연속된 VA 범위에 대해 `base_addr + offset`을
+사용할 수 있고, 디바이스측 MMU가 각 접근을 적절한 PA로 변환한다.
+
+### LA/BAAW를 제안한 이유
+
+VA/MMU는 HBM을 단일 backing 공간으로 다룬다. KernBench는 HBM이 병렬로
+여러 pseudo channel로 구성된 아키텍처를 탐색해야 한다:
+
+- CUBE의 HBM은 32 또는 64개의 pseudo channel을 갖는다.
+- PE-Local-HBM 모델에서 각 PE에는 N개의 pseudo channel이 할당된다
+  (N = `hbm_pseudo_channels / pes_per_cube`).
+- 채널당 대역폭(예: 32 GB/s)이 PE의 총 대역폭을 결정한다
+  (N × 채널당).
+
+두 가지 채널 매핑 모드를 모델링할 수 있어야 한다:
+
+- **1:1 모드** — 하나의 논리 접근 → N개의 채널별 요청.
+  채널별 대역폭 경쟁을 정밀하게 모델링.
+- **n:1 모드 (기본값)** — 하나의 논리 접근 → 하나의 집계 요청.
+  채널들이 interleave된다고 가정; 집계된 대역폭 모델.
+
+VA의 `tl.load(va_ptr)`은 하나의 목표에 대한 하나의 DMA 요청을 생성한다.
+이를 PE_DMA 내부에서 채널별 요청으로 분해하려면 주소 계층이 채널을
+인지해야 한다. 이것이 BAAW(Logical-to-Physical Mapping Unit)를 가진
+LA(Logical Address) 추상화의 역할이다.
+
+LA 설계를 이끄는 핵심 요구사항:
+
+- PE_DMA → HBM_CTRL 유효 대역폭 시맨틱이 두 모드에서 동일해야 한다
+  (요청 형태와 자원 모델만 다름).
+- 커널 프로그래밍 모델은 변경되지 않는다 — 물리 채널 정보는 커널 코드에
+  절대 노출되지 않는다.
+- 모드 전환은 토폴로지 수준의 설정이다.
+
+### 설계 공간 요약
+
+| 모델 | 상태 | 핵심 아이디어 |
+|------|------|--------------|
+| PA | fallback (구현됨) | 직접 물리 주소 지정, 변환 없음 |
+| VA | 현재 기본값 (구현됨) | 텐서별 연속 VA 범위; MMU가 접근별로 변환 |
+| LA | 제안됨 | LA + BAAW가 (PA, 채널)로 해석; 1:1 및 n:1 채널 매핑 모드 지원 |
+
+---
+
+## Decision
+
+본 ADR은 세 개의 주소 모델을 정의한다. 어느 시점에도 시스템은 정확히
+한 모델로 동작한다. 선택은 토폴로지·설정 주도이며, 단일 시뮬레이션 실행
+내에서의 공존은 요구되지 않는다.
+
+---
+
+### 주소 모델: PA (물리 주소) — fallback
+
+#### D-PA1. PA 단독 시맨틱
+
+- 모든 디바이스 메모리 접근(MemoryRead/MemoryWrite)은 디바이스 물리 주소(PA)와
+  크기에 대해 동작한다.
+- PA 단독 모드는 PE_DMA의 PageFault fallback 경로를 통해 여전히 동작한다.
+  DMA src/dst 주소에 MMU 매핑이 없으면 PE_DMA는 그 값을 PA로 직접 다룬다.
+
+#### D-PA2. 할당은 PA 매핑을 생성한다
+
+디바이스 할당은 PE 로컬 메모리 영역을 선택하고 커널 실행 및 DMA 요청
+발행에 충분한 PA 매핑을 반환한다.
+
+PA 모델은 주로 PA 단독 테스트와의 하위 호환성을 위해, 그리고 VA / LA
+모델이 해석되어 들어가는 기저 물리 계층으로 유지된다.
+
+---
+
+### 주소 모델: VA (MMU를 동반한 가상 주소) — 현재 기본값
+
+#### D-VA1. 가상 주소 모델
+
+- 각 텐서는 하나의 연속된 VA 범위(`TensorHandle.va_base`)를 가진다.
+- `TensorShard`는 `va` 필드를 가지지 **않는다** — 샤드 VA는
+  `va_base + offset_bytes`로 유도된다.
+- 커널은 포인터 인수로 `va_base`를 받는다(`TensorArg.va_base` 경유).
+- `DmaReadCmd.src_addr`와 `DmaWriteCmd.dst_addr`는 VA(PA가 아님)를 운반한다.
+
+#### D-VA2. PE_MMU 컴포넌트
+
+- 하이브리드 설계: SimPy 컴포넌트(`MmuMapMsg`용 inbox) + 유틸리티
+  (PE_DMA가 호출하는 동기식 `translate()`).
+- 페이지 정렬 dict 조회로 O(1) VA → PA 변환.
+- `tlb_overhead_ns`로 접근당 레이턴시 설정 가능.
+- PageFault fallback: VA에 매핑이 없으면 PE_DMA가 그것을 PA로 직접
+  다룬다 (PA 모델과의 하위 호환성 유지).
+
+#### D-VA3. 매핑 설치
+
+- `MmuMapMsg`는 패브릭을 순회한다: Host → PCIE_EP → IO_CPU (큐브 fan-out)
+  → M_CPU (PE fan-out) → NOC → PE_MMU. 레이턴시는 end-to-end로 측정된다.
+- `MmuMapMsg.target_sips`는 SIP 수준 라우팅을 제어하여 복제 텐서의
+  cross-SIP 매핑 오염을 방지한다.
+- `DPPolicy.cube`에 기반한 매핑 전략:
+  - **Replicate** (`cube="replicate"`): (sip, cube)별 로컬 매핑만.
+    각 큐브의 PE들은 자신의 로컬 PA만 본다. cross-cube 매핑은 설치되지
+    않는다.
+  - **Sharded** (`cube="column_wise"` 등): 모든 샤드 매핑을 모든 대상
+    큐브로 브로드캐스트. cross-PE 및 cross-cube DMA를 가능하게 한다.
+
+#### D-VA4. 텐서 라이프사이클
+
+- `del tensor`는 `Tensor.__del__` + `RuntimeContext`에 대한 `weakref`를
+  통해 자동 정리를 트리거한다. 패브릭을 통해 `MmuUnmapMsg`를 보내고
+  VA와 PA 공간을 반환한다.
+- `with RuntimeContext(...) as ctx:`는 스코프 기반 일괄 정리를 제공한다.
+- `RuntimeContext._tensors`는 GC 방지를 피하기 위해 `weakref.ref`를 사용.
+- `PEMemAllocator`는 coalescing이 있는 free-list를 사용한다(bump allocator 아님).
+- `VirtualAllocator`는 VA 공간에 대해 coalescing이 있는 free-list를 사용한다.
+
+#### D-VA5. 할당기
+
+- `VirtualAllocator`: 디바이스 전체의 VA 공간, coalescing을 동반한
+  페이지 정렬 alloc/free.
+- `PEMemAllocator`: PE별 HBM/TCM, coalescing을 동반한 free-list 기반
+  alloc/free.
+- 페이지 크기는 `topology.yaml`의 `pe_mmu` attrs로 설정 가능
+  (기본 4096).
+
+#### Consequences (VA 모델)
+
+- Triton 커널은 샤딩된 텐서에 대해 `base_addr + offset` 패턴을 자연스럽게
+  사용한다.
+- 모든 레이턴시는 MMU 매핑 설치와 접근당 TLB 오버헤드를 포함하여
+  그래프 순회를 통해 명시적이다.
+- PA 단독 모드는 fallback으로 유지된다 (PageFault → PA로 처리).
+- IPCQ와 그 외 고정 주소 자원은 MMU를 우회한다 (PA 직접 사용).
+
+---
+
+### 주소 모델: LA (BAAW를 동반한 논리 주소) — 제안됨
+
+LA는 채널 수준 HBM 모델링이 필요할 때 VA를 대체한다.
+이 모델을 채택하면 VA/MMU 인프라가 제거된다 (D-LA1이 제거되는 산출물을
+나열한다). 동일 실행 내에서 VA와의 공존은 목표가 아니다.
+
+#### D-LA1. LA 도입 — VA 인프라 대체
+
+LA는 커널 코드(`tl.load`, `tl.store`, `tl.composite`)가 사용하는
+유일한 주소 공간이다. 속성:
+
+- Tensor를 연속된 논리 공간에 매핑할 수 있다 (VA처럼).
+- `(논리 버퍼 + offset)`을 표현한다.
+- 물리 채널 정보를 직접 포함하지 **않는다**.
+- 물리적 해석이 일어나기 전까지는 중간 추상화로 유지된다.
+
+LA 주소 공간:
+
+| 항목 | 값 |
+|------|-------|
+| LA 시작 | `0x1_0000_0000` (4 GB, 이전 VA 시작과 동일) |
+| LA 공간 크기 | PE당 64 GB |
+| 정렬 단위 | segment (D-LA3 참조) |
+
+LA는 PE 로컬이다: 서로 다른 PE가 동일한 LA 값을 사용할 수 있지만,
+BAAW segment 테이블이 다르므로 서로 다른 PA로 해석된다.
+
+LA가 채택되면 제거되는 VA 인프라:
+
+| 제거 | 대체 |
+|---------|-------------|
+| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (동일한 free-list 접근, 이름 변경) |
+| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment 테이블 (PE_DMA 내부) |
+| `components/builtin/pe_mmu.py` (PeMmuComponent) | 제거 — BAAW는 별도 컴포넌트가 아니라 PE_DMA 내부 로직 |
+| `runtime_api/kernel.py`: `MmuMapMsg`, `MmuUnmapMsg` | `BaawSegmentInstallMsg` |
+| `runtime_api/context.py`: VA alloc + MMU install | LA alloc + BAAW segment install |
+| `runtime_api/tensor.py`: `va_base` | `la_base` |
+| `topology.yaml`: `pe_mmu` 컴포넌트 entry | 제거 |
+
+#### D-LA2. 매핑 모드 설정
+
+토폴로지 수준(큐브) 설정:
+
+```yaml
+cube:
+  memory_map:
+    hbm_mapping_mode: n_to_one    # one_to_one | n_to_one
+    hbm_pseudo_channels: 64       # 전체 pseudo channel 수
+    hbm_channels_per_pe: 8        # PE당 로컬 채널 수
+    hbm_channel_bw_gbs: 32.0      # 채널당 대역폭
+```
+
+그래프 컴파일러(토폴로지 빌더)와 BAAW 초기화가 이 값을 소비한다.
+
+#### D-LA3. Segment와 BAAW
+
+Segment는 LA 공간을 분할한다. 각 segment는 특정 HBM 채널 또는 채널
+그룹에 매핑된다. 텐서 deploy 시점에 런타임 할당기가 생성한다. BAAW는
+segment 테이블을 사용하여 LA → 물리 요청(들)로 해석한다.
+
+```python
+@dataclass
+class BaawSegment:
+    la_base: int          # segment 시작 LA
+    la_size: int          # segment 크기 (bytes)
+    mode: str             # "one_to_one" | "n_to_one"
+    # 1:1 모드 필드
+    channel_count: int    # 이 segment에 할당된 채널 수 (예: 8)
+    pa_bases: list[int]   # 채널별 PA base (len = channel_count)
+    channel_ids: list[int]   # 채널별 논리 ID (예: [0..7])
+    channel_size: int     # 채널당 크기 (la_size // channel_count)
+    # n:1 모드 필드
+    agg_pa_base: int      # 집계 PA base
+    agg_node_id: str      # 집계 라우터 node_id
+```
+
+Segment 라이프사이클:
+
+1. **할당** (텐서 deploy): RuntimeContext가 LA allocator에서 LA를
+   할당한다. PEMemAllocator가 채널별 PA(1:1) 또는 집계 PA(n:1)를
+   할당한다. `BaawSegmentInstallMsg`가 segment를 PE_DMA에 등록한다.
+2. **사용** (커널 실행): 커널 `tl.load(la_ptr)` → `DmaReadCmd
+   (src_addr=LA)`. PE_DMA의 BAAW 프론트엔드가 segment를 조회하여
+   PA(들)로 변환한다.
+3. **해제** (텐서 free): segment가 테이블에서 제거되고 LA와 PA가
+   반환된다.
+
+#### D-LA4. BAAW 해석 로직
+
+BAAW는 PE_DMA 내부의 프론트엔드 단계이며, 별도의 SimPy 컴포넌트가 아니다.
+PE_DMA의 `handle_command()` 시작 시점에 실행되는 동기식 주소 해석 로직.
+
+입력: `(LA, nbytes)`. 출력:
+
+- **1:1 모드**: `list[PhysicalRequest]` — 채널당 하나.
+- **n:1 모드**: 단일 `PhysicalRequest`.
+
+```python
+@dataclass
+class PhysicalRequest:
+    pa: int           # 51-bit 물리 주소
+    nbytes: int       # 이 요청의 전송 크기
+    dst_node: str     # 대상 node_id (채널 라우터 또는 집계 라우터)
+
+
+def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
+    seg = self._find_segment(la)  # la_base <= la < la_base + la_size
+    offset = la - seg.la_base
+
+    if seg.mode == "n_to_one":
+        pa = seg.agg_pa_base + offset
+        return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]
+
+    # one_to_one
+    requests = []
+    per_ch_size = seg.channel_size
+    for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
+        ch_offset = offset % per_ch_size
+        ch_nbytes = nbytes // seg.channel_count
+        pa = pa_base + ch_offset
+        dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
+        requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
+    return requests
+```
+
+BAAW의 책임:
+
+- 논리 접근 → 물리 요청 단위로 변환.
+- 모드에 따라 fan-out(1:1) 또는 pass-through(n:1) 적용.
+- PA와 대상 노드 계산.
+
+BAAW가 하지 않는 것:
+
+- 실제 데이터 이동 수행.
+- NOC 라우팅 실행.
+- 대역폭 점유 시뮬레이션 (하위 컴포넌트의 역할).
+
+BAAW의 출력은 추가적인 주소 디코딩 없이 시뮬레이터의 라우팅·자원
+모델에서 바로 사용 가능하다.
+
+#### D-LA5. PE_DMA `handle_command()` 변경
+
+현재(VA 기반) 흐름:
+
+```
+DmaReadCmd.src_addr (VA)
+  → MMU.translate(VA) → PA
+  → PhysAddr.decode(PA) → PhysAddr 객체
+  → resolver.resolve(PhysAddr) → dst_node_id
+  → router.find_path(pe_prefix, dst_node_id) → path
+  → 1 sub-Transaction → 패브릭 주입
+```
+
+LA 기반 흐름:
+
+```
+DmaReadCmd.src_addr (LA)
+  → BAAW.resolve(LA, nbytes) → list[PhysicalRequest]
+  → 각 PhysicalRequest에 대해:
+      → router.find_path(pe_prefix, req.dst_node) → path
+      → compute_drain_ns(path, req.nbytes) → drain
+      → sub-Transaction → 패브릭 주입
+  → 모든 sub-Transaction 대기
+  → pe_txn.done.succeed()
+```
+
+주요 변경:
+
+- MMU 참조 제거 → BAAW resolve.
+- `PhysAddr.decode()` + `resolver.resolve()` → BAAW가 `dst_node`를
+  직접 반환.
+- 1 요청 → 1:1 모드에서 N개의 병렬 요청.
+
+#### D-LA6. 1:1 모드 상세
+
+- 하나의 논리 접근 → N개의 물리 요청 (N = `channels_per_pe`).
+- N = `hbm_pseudo_channels / pes_per_cube`.
+- 각 요청: 완전히 해석된 51-bit PA, 특정 채널 라우터를 대상으로 함
+  (`{pe_prefix}.ch_r{channel_id}`).
+- 채널별 링크가 대역폭 경쟁을 모델링.
+- PE_DMA가 N개의 sub-transaction을 동시에 주입.
+
+예: `hbm_pseudo_channels=64`, `pes_per_cube=8` → `channels_per_pe=8`.
+PE0은 ch0-7을 소유.
+
+```text
+Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
+BAAW segment: {
+    la_base: 0x1_0000_0000, la_size: 4096,
+    mode: "one_to_one", channel_count: 8,
+    pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
+    channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
+    channel_size: 512,
+}
+
+BAAW resolve 결과 (8 요청):
+  → PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
+  → PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
+  → ...
+  → PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")
+
+PE_DMA: 8개 sub-transaction 병렬 주입
+  채널별 라우터 → hbm_ctrl 링크 (channel_bw_gbs) per channel
+  전체 유효 BW = 8 × channel_bw_gbs
+```
+
+다른 N 값:
+
+- `hbm_pseudo_channels=32`, `pes_per_cube=8` → `channels_per_pe=4`,
+  4 요청
+- `hbm_pseudo_channels=64`, `pes_per_cube=4` → `channels_per_pe=16`,
+  16 요청
+
+#### D-LA7. n:1 모드 상세
+
+- 하나의 논리 접근 → 하나의 집계 요청.
+- 대상: 집계 라우터 → hbm_ctrl (ADR-0017 D8 참조).
+- 집계 링크 BW = `channels_per_pe × channel_bw_gbs`
+  (예: 8 × 32 = 256 GB/s).
+- 모델링을 위한 단일 큐 / 자원.
+- 채널별 PA 분해 없음.
+
+```text
+Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
+BAAW segment: {
+    la_base: 0x1_0000_0000, la_size: 4096,
+    mode: "n_to_one",
+    agg_pa_base: PA_agg,
+    agg_node_id: "sip0.cube0.pe0.agg_router",
+}
+
+BAAW resolve 결과:
+  → PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")
+
+PE_DMA: 1 sub-transaction
+  집계 라우터 → hbm_ctrl 링크 (256 GB/s)
+```
+
+#### D-LA8. 커널 모델 보존
+
+- 커널은 여전히 단일 메모리 op(`tl.load`, `tl.store`,
+  `tl.composite`)을 발행한다.
+- LA가 커널 코드에 노출되는 주소 체계이다.
+- 채널 분해·집계는 PE_DMA의 BAAW 내부에서 일어난다.
+- 커널 코드는 물리 채널 정보를 절대 보지 않는다.
+
+#### Consequences (LA 모델, 제안됨)
+
+긍정적:
+
+- 1:1 vs n:1 시맨틱이 한 곳(BAAW)에 모인다.
+- 커널 추상화 보존 — 커널 코드 변경 없음.
+- 토폴로지 기반 정책 제어 (yaml로 모드 전환).
+- 시뮬레이션 모델의 정합성·디버깅 가능성 향상.
+- Segment 기반 매핑이 페이지 테이블보다 단순하며 오버헤드도 적다.
+
+부정적:
+
+- 전체 VA/MMU 코드 리팩터가 필요하다.
+- 요청 생성 경로가 더 복잡 (1:1 모드에서 N 요청).
+- n:1 모드에서 채널별 가시성 감소.
+- VA 관련 테스트 재작성 필요.
+
+---
+
+## Migration Path
+
+- **PA → VA**는 확장이었다. PA 모드는 PE_DMA 내부의 PageFault fallback으로
+  유지된다. 전환은 PA 코드 제거를 요구하지 않는다.
+- **VA → LA**는, 채택될 경우, 공존이 아닌 대체이다. VA 인프라 제거
+  목록은 D-LA1 참조. PA fallback은 테스트를 위해 PE_DMA 내부에 직교적으로
+  유지될 수 있다.
+
+## Alternatives Considered (LA 모델)
+
+1. **VA 유지 + MMU에서 fan-out**: MMU가 채널별 PA를 반환한다.
+   기각: MMU의 역할이 변환을 넘어 요청 분해까지 확장되며, 집계(n:1)를
+   표현하기 어색해진다.
+2. **채널 인지 커널 API**: 커널이 채널별 load/store를 직접 호출한다.
+   기각: 추상화 누출, 이식성 손실, 모든 벤치마크 재작성 필요.
+3. **항상 PA (LA 없음)**: 런타임이 커널에 채널별 PA를 직접 전달한다.
+   기각: 집계와 양립 불가; 변환 시점이 불명확; 채널 정보가 커널로 누출.
+
+## Test Requirements
+
+### VA 모델 (현재, regression)
+
+- 설치된 매핑을 따라 cross-PE / cross-cube DMA 경로.
+- 측정된 레이턴시를 동반한 `MmuMapMsg` / `MmuUnmapMsg`의 패브릭 순회.
+- 접근당 TLB 오버헤드 타이밍.
+- PageFault fallback 경로가 PA 단독 동작을 보존하는지.
+
+### LA 모델 (구현 시)
+
+- 1:1 모드: 동일 논리 접근 → N개의 채널별 요청.
+- n:1 모드: 동일 논리 접근 → 1개의 집계 요청.
+- 동일 워크로드에 대해 두 모드 사이의 대역폭 동치.
+- 1:1 모드: 채널별 경쟁이 올바르게 모델링됨.
+- n:1 모드: 집계된 대역폭이 올바르게 반영됨.
+- 모드 전환에 걸쳐 커널 코드가 변경되지 않음.
+- BAAW segment install / uninstall 정확성.
+- 별개 segment 안의 여러 텐서가 충돌하지 않음.
+
+## Implementation Order (LA, 일정 잡힐 때)
+
+1. LA 타입 (`policy/address/la_allocator.py`).
+2. BAAW segment 테이블 (`policy/address/baaw.py`).
+3. `BaawSegmentInstallMsg` (`runtime_api/kernel.py`).
+4. PE_DMA BAAW 통합 (`components/builtin/pe_dma.py`
+   `handle_command()`).
+5. RuntimeContext: LA alloc + segment install
+   (`runtime_api/context.py`).
+6. `Tensor.va_base` → `Tensor.la_base` (`runtime_api/tensor.py`).
+7. VA/MMU 코드 제거.
+8. `topology.yaml`에서 `pe_mmu` 제거; 매핑 모드 설정 추가.
+9. 테스트 이전:
+
+| 테스트 파일 | 조치 |
+|-----------|--------|
+| `tests/test_mmu_component.py` | 제거 → BAAW segment install 테스트 |
+| `tests/test_mmu_fabric.py` | 제거 → BAAW + 패브릭 통합 테스트 |
+| `tests/test_pe_mmu.py` | 제거 |
+| `tests/test_va_allocator.py` | LA allocator 테스트로 교체 |
+| `tests/test_va_integration.py` | LA + BAAW 통합 테스트로 교체 |
+| `tests/test_va_offset.py` | LA offset 테스트로 교체 |
+
+## Links
+
+- ADR-0007 (runtime_api vs sim_engine 경계)
+- ADR-0008 (텐서 배포)
+- ADR-0009 (커널 실행)
+- ADR-0014 (PE 내부 실행 모델)
+- ADR-0015 (컴포넌트 포트/와이어 모델)
+- ADR-0017 (큐브 NOC와 HBM 연결성 — LA 모델 토폴로지 소비자)
+- ADR-0013 (검증 전략 — V1 PA 태깅)
+- SPEC R2 (순회 기반 레이턴시), R10 (메모리 주소 지정)
@@ -0,0 +1,239 @@
+# ADR-0012: Host ↔ IO_CPU 메시지 스키마 (PA-우선, PE-태깅)
+
+## Status
+
+Accepted
+
+## Context
+
+Phase 0은 PA-우선 메모리 모델을 사용한다(ADR-0011):
+
+- 메모리 연산은 디바이스 물리 주소(PA)만 사용한다,
+- VA/MMU/IOMMU는 모델링하지 않는다.
+
+호스트 대면 runtime API는 IO_CPU 엔드포인트를 통해 디바이스와
+상호작용한다. 다음을 보장하기 위해 Host ↔ IO_CPU에 대한 안정적이고
+최소한의 메시지 스키마를 정의한다:
+
+- 벤치마크는 안정적으로 유지된다,
+- IO_CPU 내부의 팬아웃/집계는 독립적으로 진화할 수 있다,
+- 완료와 실패 전파는 결정론적이다.
+
+또한 PE-태깅(A 방식)을 요구한다: 각 샤드는 (sip,cube,pe)를 명시적으로
+운반하여, IO_CPU가 PA 디코딩에 의존하지 않고 결정론적으로
+라우팅/팬아웃할 수 있도록 한다.
+
+---
+
+## Decision
+
+### D1. 계약 범위
+
+본 스키마는 오직 Host ↔ IO_CPU에 대해서만 안정적인 계약이다.
+
+IO_CPU를 넘어선 메시지(M_CPU, PE_CPU, 스케줄러, 엔진으로 가는 것)는
+컴포넌트 내부 사항이며 Phase 0에서 이 호스트 계약의 일부가 아니다.
+
+---
+
+### D2. 필수 메시지 집합
+
+runtime API는 Host ↔ IO_CPU에 대해 오직 다음 메시지 타입만 사용해야 한다:
+
+- MemoryWrite
+- MemoryRead
+- KernelLaunch
+
+벤치마크가 필요로 하는 모든 연산(텐서 초기화/복사, 커널 실행)은 이
+메시지들로 표현 가능해야 한다.
+
+---
+
+### D3. 공통 envelope (모든 요청에 필수)
+
+모든 Host ↔ IO_CPU 요청은 반드시 다음을 포함해야 한다:
+
+- `msg_type: str`
+- `correlation_id: str`
+  - 호스트에서 생성
+  - 응답을 결정론적으로 매칭하는 데 사용
+- `request_id: str`
+  - correlation_id 내에서 고유함
+- `target_device: str`
+  - 디바이스 식별자(예: "sip:0")
+- `timestamp_tag: str | None` (선택)
+  - 디버그 태그 전용; 결정성에 영향을 주어서는 안 됨
+
+모든 Host ↔ IO_CPU 응답은 반드시 다음을 포함해야 한다:
+
+- `correlation_id: str`
+- `request_id: str`
+- `completion: Completion`
+
+---
+
+### D4. Completion 스키마 (필수)
+
+`Completion`은 반드시 다음을 가져야 한다:
+
+- `ok: bool`
+- `error_code: str | None`
+- `error_message: str | None`
+
+규칙:
+
+- `ok == true`이면 `error_code`와 `error_message`는 반드시 null이어야 한다.
+- `ok == false`이면 `error_code`는 반드시 null이 아니어야 한다.
+- 완료 시맨틱은 결정론적이어야 한다.
+
+---
+
+### D5. MemoryWrite 스키마 (PA-우선, PE-태깅)
+
+`MemoryWrite`는 호스트에서 시작된 디바이스 메모리 쓰기/초기화 연산을
+나타낸다.
+
+필수 필드:
+
+- 공통 envelope 필드 (D3)
+- 목적지 배치 태그 (A 방식):
+  - `dst_sip: int`
+  - `dst_cube: int`
+  - `dst_pe: int`
+- `dst_pa: int`
+  - 목적지 PE의 주소 공간 내 목적지 물리 주소
+- `nbytes: int`
+- `src_kind: "pattern" | "host_buffer_ref"`
+  - Phase 0은 반드시 "pattern"을 지원해야 한다
+- `pattern: Pattern | None`
+  - `src_kind == "pattern"`인 경우 필수
+
+`Pattern` (Phase 0 필수 지원):
+
+- `pattern_kind: "zero" | "fill_u8" | "fill_u16" | "fill_u32" | "fill_fp16" | "fill_fp32"`
+- `value: number | None`
+  - fill_*에 필요; zero에서는 무시됨
+
+선택 필드:
+
+- `dst_mem_kind: "HBM" | "TCM" | "AUTO"` (기본값 "AUTO")
+- `debug_label: str | None`
+
+비고:
+
+- 이 메시지는 Phase 0에서 대용량 텐서 데이터를 임베드해서는 안 된다.
+- 모든 레이턴시는 명시적인 그래프 순회 및 모델링된 컴포넌트로부터
+  발생해야 한다.
+
+---
+
+### D6. MemoryRead 스키마 (PA-우선, PE-태깅)
+
+`MemoryRead`는 호스트에서 시작된 디바이스 메모리 읽기를 나타낸다.
+
+필수 필드:
+
+- 공통 envelope 필드 (D3)
+- 소스 배치 태그 (A 방식):
+  - `src_sip: int`
+  - `src_cube: int`
+  - `src_pe: int`
+- `src_pa: int`
+- `nbytes: int`
+
+선택 필드:
+
+- `dst_kind: "host_sink" | "discard"` (기본값 "host_sink")
+- `debug_label: str | None`
+
+응답 페이로드:
+
+- Phase 0에서는 실제 바이트는 필요하지 않다(레이턴시/트레이스 중심)
+- 구현은 추후 새로운 ADR을 통해 가벼운 통계나 해시를 반환할 수 있다
+
+---
+
+### D7. KernelLaunch 스키마 (PA-우선, PE-태깅된 샤드)
+
+`KernelLaunch`는 IO_CPU를 통해 대상 디바이스에서 커널을 런치하는 것을
+나타낸다.
+
+필수 필드:
+
+- 공통 envelope 필드 (D3)
+- `kernel_ref: KernelRef`
+- `args: list[KernelArg]`
+
+`KernelRef`는 반드시 다음을 가져야 한다:
+
+- `name: str`
+- `kind: "deployed" | "builtin"`
+- `deploy_pa: int | None` — 커널 바이너리가 배치된 PA("deployed"에 필수)
+- `deploy_sip: int` — 바이너리가 위치한 SIP
+- `deploy_cube: int` — 바이너리가 위치한 큐브
+- `deploy_pe: int` — 바이너리가 위치한 PE
+- `nbytes_code: int` — 커널 바이너리 크기(BW 모델링용)
+
+커널 바이너리는 MemoryWrite를 통해 디바이스 메모리에 사전 배치되어야 한다.
+KernelLaunch는 커널 소스 코드나 IR을 런치 메시지에 임베드해서는 안 된다.
+
+`KernelArg`는 PA 매핑을 통한 텐서 인자와 값을 통한 스칼라 인자를 지원한다.
+
+텐서 인자 (필수):
+
+- `arg_kind: "tensor"`
+- `tensor_pa_map: TensorPAMap`
+
+`TensorPAMap`은 반드시 다음을 가져야 한다:
+
+- `shards: list[TensorShard]`
+
+`TensorShard`는 반드시 다음을 가져야 한다 (A 방식 강제):
+
+- `sip: int`
+- `cube: int`
+- `pe: int`
+- `pa: int`
+- `nbytes: int`
+- `offset_bytes: int`
+
+스칼라 인자 (필수):
+
+- `arg_kind: "scalar"`
+- `dtype: "i32" | "i64" | "fp16" | "fp32" | "bool"`
+- `value: number | bool`
+
+KernelLaunch 선택 필드:
+
+- `grid: dict | None`
+- `meta: dict | None`
+- `failure_policy: "fail_fast" | "collect_all"` (기본값 "fail_fast")
+- `debug_label: str | None`
+
+비고:
+
+- KernelLaunch는 대용량 텐서 데이터를 임베드해서는 안 된다.
+- KernelLaunch는 오직 IO_CPU 엔드포인트에만 제출되어야 한다.
+- IO_CPU는 샤드의 (sip,cube,pe) 태그를 사용하여 내부적으로 작업을
+  팬아웃해야 한다.
+
+---
+
+## Verification Notes
+
+테스트는 다음을 검증해야 한다:
+
+- 스키마 검증이 필수 필드 누락을 거부함,
+- 결정론적 correlation/응답 매칭,
+- MemoryWrite/Read/KernelLaunch가 명시적인 홉 트레이스를 생성함,
+- 라우팅된 모든 요청은 레이턴시 > 0을 가짐.
+
+---
+
+## Links
+
+- ADR-0011 (메모리 주소 지정 — PA / VA / LA)
+- ADR-0007 (runtime_api와 sim_engine 경계)
+- ADR-0009 (커널 실행 팬아웃/집계)
+- ADR-0013 (검증 전략 — V1 메시지 스키마 검증)
+- SPEC R2, R7, R8
@@ -0,0 +1,145 @@
+# ADR-0013: 검증 전략 및 Phase 1 테스트 계획
+
+## Status
+
+Accepted
+
+## Context
+
+KernBench는 시스템 레벨 시뮬레이터이며, 그 정확성은 다음으로 정의된다:
+
+- SPEC에 정의된 불변식 준수,
+- 결정성과 디버깅 가능성,
+- 라우팅과 레이턴시의 명시적 모델링.
+
+진화하는 구현을 고려할 때, 점진적 개발을 허용하면서도 아키텍처적
+편향(drift)을 방지하는 안정적인 검증 전략이 필요하다.
+
+본 ADR은 Phase 1 검증 계획과 초기 구현에 대해 "올바른 동작"이 무엇인지를
+정의한다.
+
+---
+
+## Decision
+
+### D1. 검증은 계약 기반이다
+
+검증은 반드시 다음으로부터 도출되어야 한다:
+
+- SPEC 요구사항,
+- 채택된 ADR들.
+
+테스트는 부수적인 구현 세부사항이 아니라 아키텍처 계약을 검증해야 한다.
+
+---
+
+### D2. Phase 1 검증 범위
+
+Phase 1 검증은 다음에 초점을 둔다:
+
+- 메시지 계약 유효성 (ADR-0012),
+- IO_CPU 경계에서의 라우팅과 팬아웃 시맨틱 (ADR-0009),
+- PA-우선 메모리 주소 지정 및 샤드 태깅 (ADR-0011),
+- 핵심 레이턴시 및 트레이스 불변식 (SPEC 0.1, R2).
+
+마이크로아키텍처 정확도, 대역폭 경합, 사이클 레벨 동작은 Phase 1의
+범위에서 명시적으로 제외된다.
+
+---
+
+### D3. 필수 Phase 1 검증 케이스
+
+다음 검증 케이스는 구현에서 반드시 지원되어야 한다:
+
+#### V1. 메시지 스키마 검증
+
+- 텐서 샤드 중 어느 하나라도 `(sip, cube, pe)`가 누락된 KernelLaunch
+  요청은 반드시 거부되어야 한다.
+- 목적지/소스 배치 태그가 누락된 MemoryWrite/MemoryRead 요청은 반드시
+  거부되어야 한다.
+- Completion 결과는 반드시 `ok / error_code / error_message` 계약을
+  따라야 한다.
+
+#### V2. IO_CPU 팬아웃과 집계
+
+다음 조건이 주어졌을 때:
+
+- SIP 1개, CUBE 1개, PE 2개로 구성된 토폴로지,
+- 서로 다른 PE를 대상으로 하는 두 개의 텐서 샤드를 포함하는
+  KernelLaunch 요청,
+
+시스템은 반드시:
+
+- 단일 KernelLaunch를 IO_CPU에 제출하고,
+- 내부적으로 두 PE에 작업을 팬아웃하며,
+- 완료를 집계하여 호스트에 단일의 결정론적 완료를 반환해야 한다.
+
+#### V3. 레이턴시 및 트레이스 불변식
+
+모든 유효한 요청에 대하여:
+
+- 홉별 트레이스는 반드시 비어 있지 않아야 한다,
+- 총 레이턴시는 반드시 0보다 커야 한다,
+- 동일한 입력으로 반복 실행 시 반드시 동일한 트레이스를 생성해야 한다.
+
+#### V4. 토폴로지 독립성과 교차 도메인 커버리지
+
+검증 케이스는 다음을 포함한 다양한 토폴로지 형태에서 통과해야 한다:
+
+- 최소: (SIP 1, CUBE 1, PE 1)
+- 다중 PE: (SIP 1, CUBE 1, PE N개)
+- SIP 내 다중 CUBE: (SIP 1, CUBE M개, CUBE당 PE ≥1)
+- 다중 SIP 트레이: (SIP K개, SIP당 CUBE ≥1, CUBE당 PE ≥1)
+
+다중 CUBE 및 다중 SIP 토폴로지에 대해 Phase 1 검증은 다음에 초점을
+둔다:
+
+- 명시적 연결성(필요한 링크가 존재함),
+- 결정론적 라우팅과 제어 경로 순회,
+- 대표적인 교차 도메인 요청(CUBE 간 및 SIP 간 경로)에 대해 비어 있지
+  않은 트레이스와 레이턴시 > 0.
+
+테스트는 토폴로지 크기, 노드 ID, 링크 수를 하드코딩해서는 안 된다.
+대신 컴파일된 토폴로지 메타데이터로부터 기대값을 도출해야 한다.
+
+---
+
+### D4. Phase 1 산출물
+
+Phase 1은 다음을 포함할 수 있다:
+
+- 검증 전용 테스트 코드,
+- 토폴로지 픽스처,
+- 트레이스 검사 유틸리티.
+
+Phase 1은 다음을 요구해서는 안 된다:
+
+- 단지 테스트를 만족시키기 위한 프로덕션 코드 변경,
+- 진행을 위한 테스트의 약화 또는 제거.
+
+---
+
+### D5. Phase 2 강제
+
+Phase 2(Apply)는 반드시:
+
+- Phase 1 검증 케이스를 실행하고,
+- 검증이 실패하면 모든 변경을 롤백하며,
+- 테스트를 권위 있는 계약으로 보존해야 한다.
+
+---
+
+## Consequences
+
+- 아키텍처 정확성은 초기에 강제된다.
+- 테스트는 시스템 동작의 실행 가능한 문서로 기능한다.
+- 구현은 엄정성을 잃지 않으면서도 유연성을 유지한다.
+
+---
+
+## Links
+
+- SPEC 0.1, R2, R6
+- ADR-0011 (메모리 주소 지정 — PA / VA / LA)
+- ADR-0012 (Host ↔ IO_CPU 메시지 스키마)
+- ADR-0009 (커널 실행 시맨틱)
@@ -0,0 +1,441 @@
+# ADR-0014: PE 파이프라인 실행 모델
+
+## Status
+
+Accepted
+
+## Context
+
+본 ADR은 PE 내부 커널 실행 모델을 정의한다:
+
+- PE 내부 컴포넌트의 역할 분담
+- 명령 디스패치 경로 (simple / composite / epilogue를 포함한 multi-op composite)
+- TileToken 기반 자가-라우팅 파이프라인 (스케줄러는 디스패치와 완료 처리만 담당)
+- 레지스터 파일을 매개로 한 TCM 중심 데이터플로우
+- 엔진 자원 모델
+- 관측 가능성 및 트레이스 계약
+- 토폴로지 표현
+
+PE 내부 구조 (본 ADR 범위 7개 컴포넌트 + 외부 참조 2개):
+
+- `pe_cpu`, `pe_scheduler`, `pe_dma`, `pe_fetch_store`, `pe_gemm`, `pe_math`,
+  `pe_tcm` — 본 ADR에서 정의
+- `pe_mmu` — VA 모델, ADR-0011 D-VA에서 정의
+- `pe_ipcq` — 집합 통신, ADR-0023에서 정의
+
+목표는 결정론적이고 트레이스 친화적인 실행 계약을 통해 각 블록이 독립적으로
+교체 가능하도록 유지하는 것이다.
+
+## Decision
+
+### D1. PE 내부 컴포넌트의 역할
+
+**PE_CPU**
+
+- 커널 명령어 스트림 / 제어 로직을 실행한다.
+- PE 명령을 생성하여 `PE_SCHEDULER`에 제출한다 (`PeInternalTxn`을 통해).
+- 엔진 큐에 직접 작업을 넣지 않는다.
+
+**PE_SCHEDULER**
+
+- PE 내부의 유일한 디스패처.
+- `PE_CPU`로부터 명령을 수신한다. 명령 타입별 디스패치:
+  - Simple 명령 (`DmaReadCmd`, `DmaWriteCmd`, `GemmCmd`, `MathCmd`)
+    → 대상 엔진으로 직접 전달.
+  - `CompositeCmd` → `TilePlan`을 생성하고, 단일 `_feed_loop`를 통해
+    파이프라인에 타일을 공급한다 (D6).
+- composite 내부의 stage-to-stage 체이닝에는 관여하지 않는다;
+  이는 토큰 자가-라우팅(D6)으로 처리된다.
+
+**PE_DMA**
+
+- 큐브 NoC를 통해 TCM과 외부 메모리 도메인(HBM, 공유 SRAM, 큐브 간 UCIe)
+  사이의 메모리 전송을 처리한다.
+- 두 개의 실행 채널:
+  - `DMA_READ` (capacity = 1) 및 `DMA_WRITE` (capacity = 1) — D4 참조.
+- 추가 가상 채널:
+  - `vc_compute` — GEMM/MATH 타일의 load/store/writeback 트래픽.
+  - `vc_comm` — IPCQ 집합 통신 송신 데이터 (ADR-0023 D8에서 정의).
+
+**PE_FETCH_STORE**
+
+- TCM ↔ 레지스터 파일 전송 유닛.
+- 레지스터 파일 접근 시맨틱을 컴퓨트 엔진으로부터 격리하여
+  GEMM/MATH가 순수한 컴퓨트 컴포넌트로 유지되도록 한다.
+- BW 기반 레이턴시 모델; TCM 접근 경합은 `PE_TCM`의 BW 자원을 통해
+  자연스럽게 직렬화된다.
+
+**PE_GEMM**
+
+- MAC 어레이. 레지스터 파일에서 피연산자를 읽고, 결과를 레지스터 파일에
+  쓴다. `PE_TCM`에 직접 접근하지 않는다.
+
+**PE_MATH**
+
+- 원소별 / 리덕션 / SIMD 유닛. 레지스터 파일을 읽고 쓴다.
+
+**PE_TCM**
+
+- BW로 직렬화된 접근을 갖는 tightly-coupled 스크래치패드. 소유권에 따라
+  두 개의 논리 영역으로 분할된다 (D5 참조).
+
+**외부 참조 컴포넌트** (다른 곳에서 정의됨):
+
+- `pe_mmu` — 접근마다 VA→PA 변환 (ADR-0011 D-VA).
+- `pe_ipcq` — 집합 통신 링 버퍼와 피어 엔드포인트 메타데이터
+  (ADR-0023).
+
+### D2. 명령 생명주기와 큐
+
+`PE_SCHEDULER`는 세 개의 논리적 구조를 유지한다:
+
+**SubmissionQueue** — `PE_CPU`가 쓰고, 스케줄러가 소비한다.
+
+**InflightTable** — `PE_SCHEDULER`만 소유하고 변경한다; 전개된 sub-command,
+의존성 상태, 엔진 할당, 완료 상태를 추적한다.
+
+**CompletionQueue** — `PE_SCHEDULER`가 쓴다; 최종 완료 레코드를 보관한다.
+
+**Single-writer 규칙**: `PE_SCHEDULER`만이 명령 완료 상태를 변경한다.
+엔진은 명시적 이벤트 / 메시지로 완료를 보고하며, 이는 스케줄러가
+소비한다.
+
+**명령 완료**: 모든 sub-command가 완료되면 `PE_SCHEDULER`가 완료 레코드를
+발행한다.
+
+### D3. 디스패치 모드
+
+#### D3.1 Simple 명령
+
+simple 명령은 정확히 하나의 엔진 sub-command로 전개된다:
+
+- `DmaReadCmd` / `DmaWriteCmd` → `PE_DMA`
+- `GemmCmd` → `PE_GEMM`
+- `MathCmd` → `PE_MATH`
+
+흐름:
+
+```text
+PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution
+       → completion → PE_SCHEDULER → CompletionQueue
+```
+
+#### D3.2 Composite 명령 (단일-op 타일 파이프라인)
+
+기본 `CompositeCmd`는 단일 컴퓨트 op를 타일 파이프라인 시퀀스로 실행한다:
+
+```text
+DMA_READ → FETCH (TCM → RF) → COMPUTE (GEMM | MATH) → STORE (RF → TCM) → DMA_WRITE
+```
+
+`PE_SCHEDULER`는 DMA 페이로드를 하드웨어 타일로 분할하고, 단조 증가하는
+`tile_id`를 갖는 `TileToken`을 타일마다 하나씩 발행한다.
+
+타일 의존성 (단일 타일 `t` 내부):
+
+```text
+DMA_READ(t) → FETCH(t) → COMPUTE(t) → STORE(t) → DMA_WRITE(t)
+```
+
+엔진 자원이 허용하는 한 타일 간 오버랩이 허용된다
+(D4가 제약을 규정):
+
+```text
+DMA_READ(t+1) ∥ COMPUTE(t)
+DMA_WRITE(t-1) ∥ COMPUTE(t)
+```
+
+#### D3.3 Multi-op composite (스코프를 갖는 head + epilogue)
+
+`CompositeCmd`는 `ops: tuple[OpSpec, ...]`를 운반하여 multi-op
+파이프라인을 표현할 수 있다:
+
+```python
+@dataclass(frozen=True)
+class OpSpec:
+    kind: str         # "gemm" | "math.exp" | "math.bias_add" | ...
+    scope: Scope      # "per_k_tile" | "per_output_tile" | "once"
+    ...
+```
+
+- `ops[0]` (head)이 타일 기하 구조를 정의한다 (예: head GEMM이 M/K/N
+  분할을 결정).
+- `ops[1:]` (epilogue)는 후속 stage이며 `scope`에 따라 실행 빈도가
+  결정된다:
+  - `per_k_tile` — 모든 K-리덕션 스텝마다.
+  - `per_output_tile` — 출력 타일당 한 번.
+  - `once` — 커널당 한 번.
+
+크로스-엔진 체인(예: GEMM head → MATH epilogue)은 자연스럽다 —
+각 stage는 토큰 자가-라우팅(D6)을 통해 디스패치되므로, GEMM과 MATH는
+동일한 컴퓨트 슬롯(D4)을 공유하더라도 동일 composite 내에서 직렬적으로
+참여한다.
+
+비어 있는 `ops` 형식은 레거시 단일-op 경로이다.
+
+### D4. 엔진 자원 모델
+
+**DMA 엔진**:
+
+- `DMA_READ`: `simpy.Resource(capacity=1)`.
+- `DMA_WRITE`: `simpy.Resource(capacity=1)`.
+- 두 채널은 동시에 실행된다 (READ ∥ WRITE 허용).
+- 채널 내부에서는 요청이 직렬화된다 (READ ∥ READ 불가; WRITE도 동일).
+- `vc_comm`은 IPCQ 트래픽을 위한 직교 채널로 ADR-0023 D8에서 정의됨 —
+  본 ADR 범위 밖.
+
+**컴퓨트 엔진**:
+
+- `accel_slot`: `PE_GEMM`과 `PE_MATH`가 공유하는 `simpy.Resource(capacity=1)`.
+- PE 내에서 동시에 최대 한 개의 컴퓨트 op만 실행된다.
+- Multi-op composite 체인(D3.3)은 이 슬롯을 통해 컴퓨트 stage를 직렬로
+  실행한다; 토큰 자가-라우팅(D6)이 이전 컴퓨트가 슬롯을 해제한 후에만
+  다음 stage가 시작되도록 보장한다.
+
+**엔진 완료**: 각 엔진은 완료 이벤트를 발행하며, 이는 스케줄러 /
+`PipelineContext`(D6)가 소비한다.
+
+### D5. 데이터플로우
+
+**입력 경로 (HBM 소스)**:
+
+```text
+HBM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
+PE_TCM → PE_FETCH_STORE → Register File
+Register File → PE_GEMM | PE_MATH
+```
+
+**입력 경로 (공유 SRAM 소스)**:
+
+```text
+Shared SRAM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
+PE_TCM → PE_FETCH_STORE → Register File
+```
+
+**출력 경로 (HBM 목적지)**:
+
+```text
+Register File → PE_FETCH_STORE → PE_TCM
+PE_TCM → PE_DMA (DMA_WRITE) → cube NOC → HBM
+```
+
+GEMM/MATH는 `PE_TCM`에 직접 접근하지 않는다 — `PE_FETCH_STORE`가
+TCM↔레지스터 파일의 유일한 게이트웨이이다. 이를 통해 TCM BW 경합이
+명시적으로 드러나며, fetch 유닛 정책(예: 프리패치)을 컴퓨트 엔진과
+독립적으로 교체할 수 있다.
+
+#### D5.1 PE_TCM 분할
+
+`PE_TCM`은 두 개의 논리 영역으로 분할된다:
+
+**SchedulerReservedTCM**
+
+- `PE_SCHEDULER`가 단독으로 소유한다.
+- composite 명령의 타일 버퍼를 보관한다.
+- `PE_SCHEDULER`가 이 영역을 분할하고, DMA_READ / COMPUTE / DMA_WRITE
+  stage마다 버퍼를 할당하며, 입출력 분리를 보장하고, 타일-버퍼 수명을
+  관리한다.
+
+**AllocatableTCM**
+
+- `PEMemAllocator`가 관리하는 범용 영역.
+- 호스트 / DP 가시 할당에 사용된다.
+
+**가시성 규칙 (강한 격리)**: `PEMemAllocator`는 `SchedulerReservedTCM`을
+보거나 그 내부에 할당해서는 안 된다. 예약 영역은 구성 시점에 할당자가
+관리하는 범위에서 제외된다.
+
+**타일 버퍼 규칙**:
+
+- 타일이 활성 수명 동안 `SchedulerReservedTCM` 내부의 입력 버퍼와 출력
+  버퍼는 겹쳐서는 안 된다.
+- 타일 버퍼는 해당 `DMA_WRITE`가 완료될 때까지 유효하다.
+- 버퍼 재사용은 소비하는 타일의 수명이 끝난 후에만 허용된다.
+
+### D6. TileToken 자가-라우팅 파이프라인
+
+composite의 stage-to-stage 진행은 스케줄러를 거치지 **않고** 일어난다.
+각 컴포넌트는 토큰의 `plan`을 사용해 토큰을 다음 stage의 컴포넌트로
+직접 전달한다:
+
+```text
+Scheduler → DMA → Fetch → GEMM → Math (epi) → Store → DMA_WB → (complete)
+              ↑ chaining: no scheduler hop                          ↑
+                                                  PipelineContext.complete_tile()
+```
+
+이는 실제 HW의 done-wire 체인을 반영한다. 스케줄러는 **초기 디스패치 +
+완료 집계**만 담당한다.
+
+#### TilePlan / Stage
+
+```python
+class StageType(Enum):
+    DMA_READ = 0
+    FETCH = 1
+    GEMM = 2
+    MATH = 3
+    STORE = 4
+    DMA_WRITE = 5
+
+@dataclass(frozen=True)
+class Stage:
+    stage_type: StageType
+    component: str         # topology node id (e.g., "sip0.cube0.pe0.pe_dma")
+    params: dict           # stage-specific parameters
+
+@dataclass(frozen=True)
+class TilePlan:
+    tile_id: int
+    stages: tuple[Stage, ...]
+```
+
+#### TileToken
+
+```python
+@dataclass
+class TileToken:
+    tile_id: int
+    pipeline_ctx: PipelineContext
+    plan: TilePlan
+    stage_idx: int
+    params: dict             # cached current stage params
+    data_op: bool = True     # op_log opt-in (ADR-0020 D4)
+```
+
+단일 소유자 불변식: 토큰은 한 시점에 정확히 한 컴포넌트가 소유한다.
+생명주기: 스케줄러가 `stage_idx=0`으로 생성 → 컴포넌트 `_process()` →
+`stage_idx` 증가 → 다음 stage의 `in_port`에 put → 마지막 stage가
+`pipeline_ctx.complete_tile()` 호출.
+
+#### PipelineContext (정확히 한 번 완료)
+
+```python
+@dataclass
+class PipelineContext:
+    id: str
+    total_tiles: int
+    completed_tiles: int = 0
+    done_event: simpy.Event = None
+
+    def complete_tile(self) -> None:
+        self.completed_tiles += 1
+        if self.completed_tiles == self.total_tiles:
+            self.done_event.succeed()
+```
+
+각 타일의 마지막 stage는 `complete_tile()`을 정확히 한 번 호출해야
+한다. 중복 호출은 버그이다 (SimPy `Event`는 최대 한 번만 succeed
+가능).
+
+#### Feed 순서
+
+`PE_SCHEDULER`는 `_pending_feeds` FIFO를 소비하는 `_feed_loop` 프로세스를
+정확히 하나 갖는다. composite 명령은 제출 순서대로 인큐되며, 한 명령의
+타일 feed는 다음 명령의 feed가 시작되기 전에 완료까지 실행된다.
+**명령 간 타일-feed 인터리빙은 허용되지 않는다.**
+
+단일 명령의 타일들 내부에서는 다운스트림 파이프라인 오버랩이 자연스럽게
+발생한다 — 이전 타일이 후행 stage를 진행하는 동안 feeder는 남은 타일을
+첫 stage 큐로 계속 푸시한다 (SimPy Store 백프레셔가 흐름 제어를
+관장한다). 첫 stage 큐가 가득 차면 feeder만 블록되며, 스케줄러 워커의
+inbox 처리는 계속된다.
+
+#### 토큰 라우팅 패턴 (기본 클래스)
+
+```python
+def _pipeline_worker(self, env):
+    while True:
+        token = yield self._inbox.get()
+        yield from self._process(env, token)       # stage-specific logic
+        next_idx = token.stage_idx + 1
+        if next_idx < len(token.plan.stages):
+            next_stage = token.plan.stages[next_idx]
+            token.stage_idx = next_idx
+            token.params = next_stage.params
+            yield self.out_ports[next_stage.component].put(token)
+        else:
+            token.pipeline_ctx.complete_tile()
+```
+
+각 컴포넌트는 `_process()`만 구현한다; 체이닝은 기본 클래스에 존재한다.
+
+### D7. 관측 가능성 및 트레이스 계약
+
+시뮬레이터는 결정론적 트레이스 이벤트를 발행한다:
+
+- `command_submitted`
+- `sub_command_dispatched`
+- `engine_start`
+- `engine_complete`
+- `tile_ready`
+- `command_complete`
+
+동일한 입력에 대해 트레이스 순서는 결정론적이어야 한다.
+
+### D8. 토폴로지 표현
+
+PE 내부 컴포넌트는 `cube.pe_template`에 선언된다:
+
+```yaml
+pe_template:
+  components:
+    pe_cpu:         { kind: pe_cpu,         impl: builtin.pe_cpu,         attrs: { overhead_ns: ... } }
+    pe_scheduler:   { kind: pe_scheduler,   impl: builtin.pe_scheduler,   attrs: { overhead_ns: ... } }
+    pe_dma:         { kind: pe_dma,         impl: builtin.pe_dma,         attrs: { rd_engines: 1, wr_engines: 1 } }
+    pe_fetch_store: { kind: pe_fetch_store, impl: builtin.pe_fetch_store, attrs: { ... } }
+    pe_gemm:        { kind: pe_gemm,        impl: builtin.pe_gemm,        attrs: { shared_resource: accel_slot, ... } }
+    pe_math:        { kind: pe_math,        impl: builtin.pe_math,        attrs: { shared_resource: accel_slot, ... } }
+    pe_tcm:         { kind: pe_tcm,         impl: builtin.pe_tcm,         attrs: { size_mb: ..., read_bw_gbs: ..., write_bw_gbs: ... } }
+    pe_mmu:         { kind: pe_mmu,         impl: builtin.pe_mmu,         attrs: { ... } }   # ADR-0011 D-VA
+    pe_ipcq:        { kind: pe_ipcq,        impl: builtin.pe_ipcq,        attrs: { ... } }   # ADR-0023
+  links:
+    # Scheduler dispatch edges (initial)
+    scheduler_to_dma_mm:         0.0
+    scheduler_to_fetch_store_mm: 0.0
+    scheduler_to_gemm_mm:        0.0
+    scheduler_to_math_mm:        0.0
+    # Pipeline chaining edges (token self-routing per D6)
+    dma_to_fetch_store_mm:       0.0
+    fetch_store_to_gemm_mm:      0.0
+    fetch_store_to_math_mm:      0.0
+    gemm_to_fetch_store_mm:      0.0
+    gemm_to_math_mm:             0.0
+    math_to_fetch_store_mm:      0.0
+    fetch_store_to_dma_mm:       0.0
+    fetch_store_to_tcm_bw_gbs:   ...
+```
+
+템플릿은 PE마다 한 번 인스턴스화된다. PE 인스턴스는 `cube.pe_layout`
+(코너 배치)으로부터 파생된다. 외부 연결성(PE_DMA ↔ cube NoC ↔ HBM 등)은
+큐브 수준에서 모델링된다 (ADR-0017 D4).
+
+## Consequences
+
+### Positive
+
+- 각 블록이 독립적인 토폴로지 노드이다 — DI(ADR-0015)를 통해 개별
+  교체 가능하다.
+- PE 내부 구조가 토폴로지 그래프에 가시화된다.
+- 컴포넌트는 자신의 다운스트림을 알지 못한다 — plan 기반 라우팅이
+  유연성을 제공한다 (예: epilogue 체인에 스케줄러 변경이 불필요).
+- DMA와 컴퓨트가 SimPy Store 백프레셔를 통해 자연스럽게 오버랩된다.
+- Multi-op composite가 융합 연산(예: GEMM + bias_add)을 엔진 수준
+  결합 없이 표현한다.
+- TCM 접근 경합이 현실적이다 — `PE_FETCH_STORE`가 TCM↔RF의 유일한
+  게이트웨이이다.
+
+### Negative
+
+- PE 내부 컴포넌트 수가 더 거친 모델보다 많다 (기본 7개 + 외부 참조
+  2개) — 더 많은 토폴로지 노드/엣지.
+- PE 내부 토큰 전달이 트레이스에 명시적으로 드러난다 (HW 충실도와의
+  허용 가능한 trade-off).
+
+## Links
+
+- ADR-0011 D-VA (PE_MMU 컴포넌트, VA 변환)
+- ADR-0015 D4 (컴포넌트 포트/와이어 모델)
+- ADR-0020 (greenlet 커널 실행 / two-pass)
+- ADR-0023 (PE_IPCQ + PE_DMA 가상 채널)
+- SPEC R3, R4
@@ -0,0 +1,202 @@
+# ADR-0015: 컴포넌트 포트/와이어 모델과 패브릭 라우팅
+
+## Status
+
+Accepted
+
+## Context
+
+현실적인 하드웨어 모델링 — 큐, 경합, fan-out — 을 위해서는
+컴포넌트가 패브릭 순회를 소유하고, 시뮬레이션 엔진은 초기화와 완료
+관측만 처리해야 한다. 컴포넌트 간의 직접 메서드 호출이나 엔진 내부의
+경로 탐색은 큐잉과 경합 시맨틱을 무력화한다.
+
+본 ADR은 다음을 정의한다:
+
+- 컴포넌트가 타입드 포트 큐를 통해 통신하는 방식,
+- 전파 지연을 모델링하는 방식 (BW 점유를 포함한 와이어 프로세스),
+- Memory R/W (M_CPU 우회)와 Kernel Launch (M_CPU 경유)의 패브릭 경로,
+- 엔진의 축소된 역할 (와이어 초기화 + 완료 관측만),
+- M_CPU의 내부 서브컴포넌트로서의 M_CPU.DMA.
+
+---
+
+## Decision
+
+### D1. 컴포넌트 포트 모델
+
+각 컴포넌트는 SimPy Store로 모델링된 타입드 입출력 포트를 갖는다:
+
+```text
+in_ports:  dict[str, simpy.Store]   # keyed by source node_id
+out_ports: dict[str, simpy.Store]   # keyed by destination node_id
+```
+
+포트는 그래프 엣지를 기반으로 엔진 초기화 시 생성된다.
+각 유향 엣지(src → dst)는 다음을 생성한다:
+
+- `src.out_ports[dst]`  — 송신측
+- `dst.in_ports[src]`   — 수신측
+
+---
+
+### D2. 와이어 프로세스 (전파 지연 + BW 점유)
+
+토폴로지 그래프의 각 유향 엣지 (src, dst)에 대해 SimPy 와이어 프로세스가
+전파 지연과 BW 점유를 모델링한다:
+
+```python
+def wire_process(env, out_port, in_port, delay_ns, bw_gbs):
+    available_at = 0.0
+    while True:
+        cmd = yield out_port.get()
+        if bw_gbs > 0:
+            nbytes = getattr(cmd, "nbytes", 0)
+            if nbytes > 0:
+                wait = available_at - env.now
+                if wait > 0:
+                    yield env.timeout(wait)
+                available_at = env.now + (nbytes / bw_gbs)
+        yield env.timeout(delay_ns)
+        yield in_port.put(cmd)
+```
+
+와이어 프로세스는 엔진 초기화 시점에 시작된다.
+각 유향 엣지는 링크가 다음 트랜잭션을 위해 비워지는 시점을 추적하는
+`available_at` 타임스탬프를 유지한다. 한 트랜잭션이 링크를 점유하는 동안,
+동일 유향 링크의 다음 트랜잭션은 점유가 해제될 때까지 대기해야 한다
+(연속 직렬화). TX와 RX 방향은 독립적이다 (각각의 `available_at` 상태를
+갖는 별개의 와이어 프로세스).
+
+---
+
+### D3. 엔진 역할 (축소)
+
+시뮬레이션 엔진은 다음을 수행해야 한다:
+
+- 초기화 시점에 컴포넌트 와이어링 (포트 Store 생성, 와이어 프로세스 시작),
+- 각 요청 타입별 진입 컴포넌트 식별 (PCIE_EP),
+- 진입 컴포넌트의 in_port에 요청을 put,
+- 완료 이벤트 대기.
+
+시뮬레이션 엔진은 다음을 해서는 안 된다:
+
+- 요청 실행 중 토폴로지 경로 탐색,
+- 컴포넌트 `run()` 메서드 직접 호출,
+- hop별 레이턴시 추적이나 fan-out 분해.
+
+---
+
+### D4. Memory R/W와 Kernel Launch의 패브릭 경로
+
+Memory R/W와 Kernel Launch는 **서로 다른** 패브릭 경로를 사용한다.
+메모리 연산은 M_CPU를 우회하여 크로스바를 통해 직접 HBM으로 라우팅된다.
+Kernel Launch는 PE fan-out을 위해 M_CPU를 경유한다.
+
+**Memory R/W forward 경로 (pcie_ep → hbm_ctrl, M_CPU 우회):**
+
+```text
+pcie_ep → io_noc → io_ucie
+  → [transit cubes: ucie_in → noc → ucie_out]  (zero or more)
+  → target cube: ucie_in → router mesh → hbm_ctrl
+```
+
+**Memory R/W 완료 경로:**
+
+```text
+hbm_ctrl → router mesh → [transit cubes: ucie → router mesh → ucie]
+  → io_ucie → io_noc → pcie_ep
+```
+
+**Kernel Launch forward 경로 (pcie_ep → io_cpu → M_CPU → PE):**
+
+```text
+pcie_ep → io_noc → io_cpu → io_noc → io_ucie
+  → [transit cubes: ucie_in → noc → ucie_out]  (zero or more)
+  → target cube: ucie_in → noc → M_CPU → PE[0..n] (parallel fan-out)
+```
+
+**Kernel Launch 완료 경로:**
+
+```text
+PE[0..n] all complete → M_CPU (aggregation)
+  → noc → [transit cubes: ucie → noc → ucie]
+  → io_ucie → io_noc → io_cpu → io_noc → pcie_ep
+```
+
+**Memory R/W가 M_CPU를 우회하는 근거:**
+
+메모리 write/read 연산은 명령 해석이나 PE 디스패치가 필요하지 않다 —
+HBM으로의/로부터의 직접 데이터 전송이다. M_CPU를 경유하면 기능적 이득
+없이 불필요한 오버헤드(5ns)를 추가한다. IO 칩렛 내부의 io_noc가 라우팅
+결정을 처리한다: 메모리 연산은 큐브 패브릭으로 직접 가고, kernel
+launch는 io_cpu로 먼저 전달된다.
+
+---
+
+### D5. M_CPU.DMA는 M_CPU의 내부 서브컴포넌트이다
+
+M_CPU.DMA는 별개의 토폴로지 노드가 아니다.
+M_CPU 컴포넌트 구현이 소유하는 내부 서브컴포넌트이다.
+
+M_CPU.DMA는:
+
+- DMA READ 및 DMA WRITE 큐를 소유한다 (각 capacity=1, ADR-0014 D4),
+- NoC를 통해 hbm_ctrl에 메모리 요청을 발행한다,
+- NoC를 통해 hbm_ctrl로부터 완료를 수신한다,
+- M_CPU에 완료를 보고한다,
+- M_CPU의 `__init__`과 `run()` 내부에서 생성·관리된다.
+
+M_CPU.DMA는 컴파일된 토폴로지 그래프에서 노드로 나타나지 않는다.
+
+---
+
+### D6. Transit 큐브 포워딩
+
+메모리나 커널 요청의 대상이 아닌 큐브는 transit 노드로 동작한다.
+Transit 큐브는 요청을 소비하지 않고 포워딩한다:
+
+```text
+ucie_in (from upstream) → noc → ucie_out (to downstream)
+```
+
+Transit 포워딩은 ucie_in 컴포넌트 내부에서 전적으로 구현된다.
+transit 큐브의 noc와 ucie_out 컴포넌트는 패킷을 수정 없이 포워딩한다.
+
+---
+
+### D7. _formula_latency는 하한 교차 검증 용도로 유지된다
+
+경로 기반 공식 레이턴시 함수(`_formula_latency`)는 정확성 검증을 위한
+하한값으로 엔진 내에 유지된다.
+
+불변식:
+
+- Phase 0: `_formula_latency == component model total_ns`
+- Phase 1+: `_formula_latency <= component model total_ns` (경합이
+  큐잉을 추가)
+
+이 함수는 포트/와이어 모델과 독립적이며 토폴로지 그래프만 요구한다.
+`_route_kernel`의 샤드 비교와 회귀 가드로 사용된다.
+
+---
+
+## Consequences
+
+- 컴포넌트가 현실적인 하드웨어 동작(큐, 경합, fan-out)을 모델링한다.
+- 전파 지연이 엣지마다 정확하게 모델링된다.
+- 엔진이 라우팅 정책으로부터 분리된다.
+- 컴포넌트 구현이 DI(ADR-0007 D3)를 통해 교체 가능하게 유지된다.
+
+---
+
+## Links
+
+- ADR-0007 D2 (엔진 역할 경계)
+- ADR-0009 D3 (커널 실행 fan-out 계층)
+- ADR-0014 D4 (DMA 엔진 capacity=1)
+- ADR-0012 D1 (호스트 ↔ IO_CPU 메시지 스키마; M_CPU.DMA는 컴포넌트
+  내부)
+- ADR-0016 (IOChiplet NoC와 메모리 데이터 경로)
+- ADR-0017 (큐브 NoC 2D 메시 아키텍처)
+- ADR-0033 (이러한 메커니즘 위에 구축된 레이턴시 모델 가정)
@@ -0,0 +1,99 @@
+# ADR-0016: IOChiplet NoC와 메모리 데이터 경로
+
+## Status
+
+Accepted
+
+## Context
+
+ADR-0003 D2는 IO chiplet을 PCIe-EP 및 IO_CPU 인터페이스를 제공하는 SIP
+레벨 컴포넌트로 정의하지만, IO chiplet 내부의 라우팅은 명세하지 않는다.
+ADR-0015 D4는 Memory R/W에 대한 M_CPU 우회를 문서화하도록 갱신되었지만,
+이 라우팅을 가능하게 하는 IO chiplet의 내부 NoC 아키텍처는 형식적으로
+문서화되지 않았다.
+
+IO chiplet은 다음을 위해 내부 라우팅 패브릭(io_noc)을 필요로 한다:
+
+- pcie_ep, io_cpu, 그리고 큐브당 UCIe PHY 포트들을 연결한다
+- 메모리 연산(MemoryWrite/Read)을 io_cpu를 거치지 않고 큐브 패브릭으로
+  직접 라우팅한다
+- 커널 런치 명령을 명령 해석을 위해 io_cpu를 통해 라우팅한다
+
+## Decision
+
+### D1. IOChiplet 내부 NoC (io_noc)
+
+각 IO chiplet 인스턴스는 다음을 연결하는 내부 NoC 노드(`io_noc`)를
+포함한다:
+
+- `pcie_ep` — 호스트 대면 PCIe 엔드포인트
+- `io_cpu` — 커널 런치 해석용 명령 프로세서
+- `io_ucie-{PHY}.conn{N}` — 큐브 UCIe 포트들로 가는 PHY별 연결 노드
+
+io_noc은 오버헤드가 0인 포워딩 전용 패브릭(`forwarding_v1` 구현)이다.
+모든 라우팅 결정은 io_noc 자체가 아니라 메시지 타입에 기반하여 시뮬레이션
+엔진이 내린다.
+
+### D2. IOChiplet UCIe 분해
+
+각 IO chiplet PHY 포트는 다음으로 분해된다:
+
+- `io_ucie-{PHY}` — UCIe 프로토콜 엔드포인트(overhead = 8ns)
+- `io_ucie-{PHY}.conn{N}` — io_noc과 io_ucie 사이의 N개 연결 노드
+
+이는 큐브 측 UCIe 분해(ADR-0015 D1)를 미러링하며, PHY당 여러 독립적인
+NoC-UCIe 연결을 허용한다.
+
+### D3. Memory R/W 경로 (M_CPU 우회)
+
+메모리 연산(MemoryWrite, MemoryRead)은 pcie_ep에서 io_noc을 거쳐 대상
+큐브로 직접 라우팅되며, io_cpu를 완전히 우회한다:
+
+```text
+pcie_ep → io_noc → conn → io_ucie → [cube UCIe] → router mesh → hbm_ctrl
+```
+
+이는 순수 데이터 전송에 대해 10ns의 io_cpu 오버헤드를 회피한다.
+시뮬레이션 엔진의 `_process_memory_direct()` 메서드는 pcie_ep에서 대상
+HBM 노드까지의 최단 경로를 해석하는 `find_memory_path()`를 사용한다.
+
+### D4. 커널 런치 경로 (io_cpu 경유)
+
+커널 런치 명령은 명령 해석 및 PE 팬아웃 설정을 위해 io_cpu를 필요로
+한다:
+
+```text
+pcie_ep → io_noc → io_cpu → io_noc → conn → io_ucie → [cube UCIe]
+  → noc → m_cpu → PE
+```
+
+엔진의 `_entry_points()` 메서드는 KernelLaunchMsg를 pcie_ep(진입)와
+io_cpu(명령 처리) 양쪽 모두를 통해 라우팅한다.
+
+### D5. IOChiplet-to-큐브 포트 매핑
+
+각 IO chiplet 인스턴스는 자신이 연결되는 큐브 포트를 선언한다:
+
+```yaml
+cube_ports:
+  - { cube: {xy: [0,0]}, cube_side: N, phy: P0, distance_mm: 2.0 }
+  - { cube: {xy: [1,0]}, cube_side: N, phy: P1, distance_mm: 2.0 }
+```
+
+토폴로지 빌더는 io_ucie PHY 노드에서 해당 큐브 UCIe 포트 노드로의 엣지를
+지정된 거리 및 IO chiplet의 `per_connection_bw_gbs`를 링크 대역폭으로
+하여 생성한다.
+
+## Consequences
+
+- IO chiplet은 잘 정의된 내부 라우팅 패브릭을 가진다
+- 메모리 연산은 불필요한 io_cpu 오버헤드를 회피한다
+- 커널 런치 명령은 여전히 적절한 명령 해석을 받는다
+- io_noc 패턴은 큐브 레벨 NoC 설계와 일관된다
+- ADR-0003 D2는 본 ADR에 의해 확장된다(모순되지 않는다)
+
+## Links
+
+- ADR-0003 D2 (IO chiplet 정의)
+- ADR-0015 D4 (Memory R/W 및 커널 런치의 패브릭 경로)
+- ADR-0012 D1 (호스트-IO_CPU 메시지 스키마)
@@ -0,0 +1,282 @@
+# ADR-0017: 큐브 NoC와 HBM 연결성
+
+## Status
+
+Accepted
+
+## Context
+
+CUBE 레벨의 NoC는 모든 큐브 내부 요청을 운반하는 2D 라우터 메시이다:
+PE-HBM 데이터, PE-PE 트래픽, 명령 경로(M_CPU↔PE_CPU), 공유 SRAM 접근,
+큐브 간 UCIe 트래픽.
+
+CUBE의 HBM은 PE 라우터에 부착된 PE별 컨트롤러 엔드포인트를 통해 노출된다.
+이러한 PE별 분할 덕분에 로컬-vs-원격 HBM이 메시 거리로 구분 가능하다:
+PE 자신의 HBM 파티션은 자신의 라우터에 위치하고(스위칭 오버헤드만 발생),
+다른 PE의 HBM 파티션은 해당 PE의 라우터로 메시 hop을 거쳐 도달 가능하다.
+
+설계 공간에서는 두 가지 채널 매핑 모드를 지원한다:
+
+- **n:1 (default, 구현됨)** — 각 PE의 HBM 파티션이 `channels_per_pe`
+  pseudo-channel을 하나의 엔드포인트로 집계한다. 유효 PE당 BW =
+  N × per-channel BW.
+- **1:1 (future)** — 각 PE 라우터가 채널별 미니 라우터로 분해된다;
+  채널별 BW 경합을 직접 모델링한다.
+
+두 모드 모두 PE당 유효 BW는 동일하다; 연결 입도만 다르다.
+
+## Decision
+
+### D1. 2D 라우터 메시
+
+각 큐브는 `mesh_gen.py`가 생성하는 2D 라우터 메시를 포함한다.
+
+- 노드 명명: `sip{S}.cube{C}.r{row}c{col}` (예: `sip0.cube0.r0c0`).
+- 구현: `forwarding_v1`. NoC `overhead_ns = 0`.
+- 기본 6×6 그리드 (PE 코너 배치 + UCIe 부착 개수로 산정); 더 큰 PE
+  개수는 그리드를 확장한다.
+- HBM 제외 영역: HBM 다이가 물리적으로 점유하는 중앙 행/열을 제외한다
+  (예: 6×6의 경우 r2c2, r2c3, r3c2, r3c3).
+- 레이턴시 = Manhattan 거리 × `ns_per_mm`.
+
+### D2. XY 라우팅 알고리즘
+
+결정론적 XY 라우팅:
+
+1. 수평 구간: 소스 X에서 목적지 X까지 소스 Y에서 라우팅.
+2. 수직 구간: 소스 Y의 목적지 X에서 목적지 Y까지 라우팅.
+
+각 유향 구간은 고유 키를 운반한다:
+
+- 수평: `("H", y_band, x_min, x_max, direction)`
+- 수직:   `("V", x_band, y_min, y_max, direction)`
+
+그리드 위치는 HBM 영역을 제외하고 라우터 그리드에 스냅된다.
+
+### D3. 구간별 경합 모델
+
+각 유향 XY 구간은 `simpy.Resource(capacity=1)`이다. 동일 구간을 공유하는
+트랜잭션(동일한 행 또는 열 밴드, 동일한 방향)은 자원을 두고 경합한다 —
+wormhole 라우팅 메시에서의 링크 수준 직렬화를 모델링한다.
+
+경합이 없을 때 NoC 순회 레이턴시는 Manhattan 거리 × `ns_per_mm`이다.
+경합이 있을 때는 SimPy의 자원 스케줄링이 큐잉 지연을 추가한다.
+
+### D4. NoC 부착 지점 (PE별 HBM 파티션)
+
+모든 PE 라우터는 세 개의 부착을 갖는다: `pe{idx}.dma`, `pe{idx}.cpu`,
+그리고 `pe{idx}.hbm`. 마지막은 PE별 HBM 컨트롤러 엔드포인트로
+`sip{S}.cube{C}.hbm_ctrl.pe{idx}`이며, 큐브 HBM의 한 슬라이스를
+소유한다 (하나의 pseudo-channel 그룹; D8 참조).
+
+기타 부착:
+
+- M_CPU와 공유 SRAM은 각각 전용 edge 라우터를 점유한다.
+- UCIe 엔드포인트(N/S/E/W)는 각각 해당 변에 분산된 4개의 연결 라우터를
+  노출한다 (D6 참조).
+
+```text
+                    UCIe-N (conn x4)
+                         |
+           +---------+---+---+---------+
+           |         |       |         |
+PE0.dma ---+  r0c0   |  ...  |  r0c5  +--- PE2.dma
+PE0.cpu <--+ +hbm.pe0|       | +hbm.pe2+--< PE2.cpu
+           |         |       |         |
+UCIe-W ----+  ...    | [HBM] |  ...   +---- UCIe-E
+(conn x4)  |         | zone  |         |  (conn x4)
+           |  r2c0   |       |         |
+M_CPU <--->+         |       |         |
+           |  r3c0   |       |         |
+SRAM <---->+         |       |         |
+           |         |       |         |
+PE4.dma ---+  r4c0   |  ...  |  r4c5  +--- PE6.dma
+PE4.cpu <--+ +hbm.pe4|       | +hbm.pe6+--< PE6.cpu
+           |         |       |         |
+           +---------+---+---+---------+
+                         |
+                    UCIe-S (conn x4)
+```
+
+PE별 HBM 분할은 로컬 vs 크로스-PE HBM을 메시 거리로 구분 가능하게 만드는
+핵심 불변식이다 (D7 참조).
+
+### D5. NoC 엣지 대역폭과 거리
+
+| Connection                    | BW (GB/s)  | Distance      | Notes                                       |
+| ----------------------------- | ---------- | ------------- | ------------------------------------------- |
+| PE_DMA → NOC                  | 256.0      | Physical (PE) | 로컬-HBM 집계 BW와 일치                     |
+| NOC → PE_CPU                  | —          | 0.0 mm        | 명령 경로 전용                              |
+| Router ↔ hbm_ctrl.pe{idx}     | 256.0      | 0.0 mm        | PE 라우터당; N × per-channel BW (D8 참조)   |
+| NOC ↔ M_CPU                   | —          | 0.0 mm        | 명령 경로                                   |
+| NOC ↔ SRAM                    | 128.0 × 4  | 0.0 mm        | 512 GB/s 집계                               |
+| NOC ↔ UCIe conn               | 128.0      | 0.0 mm        | 연결당; 포트당 4개 conn                     |
+
+`0.0 mm` 거리는 NoC의 분산 특성을 반영한다; 실제 순회 거리는 라우터
+그리드 내에서 Manhattan 거리로 계산된다.
+
+### D6. UCIe 분해와 큐브 간 트래픽
+
+4개의 UCIe 포트(N, S, E, W) 각각은 다음으로 분해된다:
+
+- `ucie-{PORT}` 노드 1개: UCIe 프로토콜 엔드포인트 (`overhead = 8.0 ns`).
+- `ucie-{PORT}.conn{0-3}` 노드 4개: NoC와 UCIe 간 연결 브리지.
+
+이 분해로 포트당 4개의 독립 NoC↔UCIe 연결이 생성되며, 각각 128 GB/s
+대역폭을 갖는다 (포트당 집계 512 GB/s).
+
+큐브 간 트래픽 경로:
+
+```text
+Source: PE_DMA → NOC → conn{i} → ucie-{PORT}
+                  [UCIe link: 512 GB/s, 1.0mm seam distance]
+Target: ucie-{PORT} → conn{i} → r{x}c{y} → (mesh hops) → hbm_ctrl.pe{idx}
+```
+
+UCIe 오버헤드(8.0 ns)는 각 `ucie-{PORT}` 노드에서 적용되므로 전체 횡단은
+16 ns(TX 포트 + RX 포트)가 소요된다.
+
+### D7. NoC를 통한 데이터 경로
+
+모든 큐브 내부 트래픽은 동일한 라우터 메시를 사용한다 — 별도의 fast path는
+없다.
+
+**로컬 HBM** (동일 PE의 자신 파티션; 0 메시 hop):
+
+```text
+PE_DMA → r{x}c{y} → hbm_ctrl.pe{idx}   (switching overhead only)
+```
+
+**큐브 내 크로스-PE HBM** (대상 PE의 파티션, 메시로 도달):
+
+```text
+PE_DMA → r{x}c{y} → (mesh hops) → r{x'}c{y'} → hbm_ctrl.pe{idx'}
+```
+
+예시: PE0(`r0c0` 위)이 PE2의 HBM(PE2는 `r1c4` 위)에 접근:
+
+```text
+PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl.pe2
+```
+
+Dijkstra가 메시 내 최단 경로를 계산한다.
+
+**큐브 간 HBM** (UCIe 횡단):
+
+```text
+PE_DMA → r{x}c{y} → conn → ucie-{PORT} → [seam] → ucie-{PORT'} → conn
+       → r{x'}c{y'} → hbm_ctrl.pe{idx'}
+```
+
+**PE로의 커널 launch 명령**:
+
+```text
+[from io_noc] → ucie → conn → r{x}c{y} → (mesh) → M_CPU → (mesh) → PE_CPU
+```
+
+**공유 SRAM 접근**:
+
+```text
+PE_DMA → r{x}c{y} → (mesh) → SRAM
+```
+
+### D8. HBM 채널 매핑 모드
+
+채널 매핑은 큐브 범위에서 구성된다:
+
+```yaml
+cube:
+  memory_map:
+    hbm_mapping_mode: n_to_one       # one_to_one | n_to_one
+    hbm_pseudo_channels: 64          # total pseudo-channel count
+    hbm_channels_per_pe: 8           # per-PE local channel count
+    hbm_channel_bw_gbs: 32.0         # per-channel bandwidth (GB/s)
+    hbm_slices_per_cube: 8           # number of per-PE partitions
+    hbm_total_gb_per_cube: 48
+```
+
+**n:1 모드 (default, 구현됨).** 각 PE의 HBM 파티션은 `channels_per_pe`
+pseudo-channel을 집계하는 단일 엔드포인트 `hbm_ctrl.pe{idx}`이다.
+`Router ↔ hbm_ctrl.pe{idx}` 링크 대역폭은 `channels_per_pe ×
+hbm_channel_bw_gbs`와 같다. Pseudo-channel은 인터리브된다고 가정하며,
+PE당 집계 BW만 모델링한다. 별도의 집계 라우터 노드는 존재하지 않는다 —
+PE별 라우터 자체가 그 역할을 한다.
+
+**1:1 모드 (future).** 각 PE 라우터가 N개의 채널 미니 라우터로
+분해된다; 채널별 라우팅이 완전히 해석된 PA + channel ID를 운반한다.
+`ChannelSplitter`가 논리적 접근을 N개의 채널별 물리 요청으로 해결한다.
+채널별 링크가 BW 경합을 모델링한다. 크로스-PE 채널 접근 시맨틱은
+구현 ADR로 연기된다.
+
+**BW 계산 (default 값).**
+
+| Parameter                          | Value                      |
+| ---------------------------------- | -------------------------- |
+| 큐브당 pseudo channel              | 64 (parameter)             |
+| 큐브당 PE                          | 8 (parameter)              |
+| PE당 channel (N)                   | 64 / 8 = 8                 |
+| 채널당 BW                          | 32 GB/s (parameter)        |
+| PE당 로컬 BW                       | N × 32 = 256 GB/s          |
+| 큐브 전체 HBM BW                   | 64 × 32 = 2048 GB/s        |
+
+두 모드 모두 PE당 유효 BW는 동일하다; 요청 형태와 경합 모델만 다르다.
+
+### D9. AddressResolver — PE별 HBM 엔드포인트
+
+주소 리졸버는 PA의 HBM 오프셋을 소유 PE의 파티션으로 디코딩한다:
+
+```python
+# policy/routing/router.py
+hbm_slice_bytes = hbm_total_gb_per_cube * (1 << 30) // hbm_slices_per_cube
+
+if addr.kind == "hbm":
+    pe_id = int(addr.hbm_offset) // hbm_slice_bytes
+    return f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
+```
+
+pe_id 계산은 라우팅 레이어의 본질적 일부이다 (토폴로지 시점 관심사가
+아니다). 모든 HBM PA는 정확히 하나의 파티션에 속하므로 결정론적 라우팅이
+보장된다.
+
+외부 호출자(예: M_CPU DMA, PCIE_EP로부터의 Memory R/W)도 동일한 리졸버
+경로를 따른다 — 별도의 fast path는 존재하지 않는다.
+
+### D10. 메시 생성 파라미터
+
+`mesh_gen.py`는 다음으로부터 `cube_mesh.yaml`을 생성한다:
+
+- `cube.pe_layout`: 코너 배치(NW, NE, SW, SE)와 코너당 PE 개수.
+- `cube.geometry`: 큐브 물리 치수와 HBM 영역.
+- `cube.ucie.n_connections`: UCIe 부착용 라우터 개수를 결정.
+
+출력 `mesh_data` 딕셔너리는 다음을 포함한다:
+
+- 위치 및 HBM 제외 영역을 갖는 라우터 그리드.
+- PE-라우터 부착 (PE별 `pe{idx}.dma`, `pe{idx}.cpu`, `pe{idx}.hbm`).
+- UCIe-라우터 부착 (N/S/E/W가 edge 라우터에 분산).
+- M_CPU와 SRAM 라우터 부착.
+
+## Consequences
+
+- 로컬 HBM(0 메시 hop, 스위칭 오버헤드만)과 크로스-PE HBM(메시 hop)이
+  자연스럽게 구분되어 SPEC R5(다중 도메인 통신)와 ADR-0002(end-to-end
+  제로 레이턴시 경로 금지)를 만족한다.
+- 모든 큐브 내부 트래픽이 하나의 메시를 통해 라우팅된다 — 단일 경합
+  모델, 단일 레이아웃, 단일 엣지 BW 집합.
+- PE별 HBM 분할이 LA 모델(ADR-0011)에 깔끔하게 매핑된다: 각 PE의
+  파티션은 할당된 pseudo-channel의 n:1 집계이다.
+- 1:1 모드 확장이 구조적으로 자연스럽다 — 각 PE 라우터를 N개의 채널
+  라우터로 분해한다.
+- 메시 생성이 `topology.yaml`로 완전히 파라미터화된다; PE/큐브 기하
+  변경이 코드 수정 없이 전파된다.
+
+## Links
+
+- ADR-0002 (라우팅 거리, 순서, 제로 레이턴시 경로 금지)
+- ADR-0003 D3 (큐브 레벨 NoC 정의 — 본 ADR에서 확장)
+- ADR-0004 (메모리 시맨틱, 로컬 HBM)
+- ADR-0011 (메모리 주소 지정 — LA 모델이 PE별 파티션을 소비)
+- ADR-0014 D1 (라우터 메시를 통한 PE_DMA egress)
+- ADR-0015 D4 (Memory R/W와 Kernel Launch의 패브릭 경로)
+- ADR-0016 (IOChiplet io_noc — IO 칩렛 레벨에서의 유사 패턴)
+- ADR-0033 (레이턴시 모델: PC당 병렬성, 스위치 패널티)
@@ -2,7 +2,7 @@

 ## Status

-Proposed
+Accepted

 ## Context

@@ -16,21 +16,6 @@ Proposed
 2. PE_GEMM, PE_MATH가 실제 행렬 연산을 수행하고 결과를 검증할 수 있어야 한다
 3. 시뮬레이션 성능 저하를 최소화해야 한다

-### 기존 커널 실행 구조의 한계
-
-현재 커널 실행은 3단계로 분리되어 있다:
-
-```
-Phase 0: TLContext에서 커널 함수 실행 → PeCommand 리스트 생성 (SimPy 밖, 데이터 없음)
-Phase 1: PE_CPU가 PeCommand 리스트를 SimPy로 replay (타이밍만)
-```
-
-Phase 0에서 커널이 **전부 실행 완료**된 후에야 SimPy가 시작된다.
-`tl.load()`는 TensorHandle(placeholder)을 반환하므로 실제 데이터에 접근할 수 없다.
-따라서 데이터 값에 따른 분기(dynamic control flow)가 불가능하다.
-
-본 ADR은 이 한계를 **메모리 연산에 한해** 해소한다 (D1, D3 참조).
-
 ### 제약 조건

 - SimPy는 single-thread 이벤트 루프 — numpy matmul을 안에서 하면 전체가 block
@@ -529,22 +514,3 @@ dtype별 tolerance 정책:
  (연산은 Phase 2에서 실행, Phase 1에서 결과 값 미확정).
  메모리 데이터 기반 분기는 greenlet으로 지원된다.
 - greenlet C 확장 의존성 추가 (pip install greenlet)
-
---
-
-## 영향받는 파일
-
-| 파일 | 변경 |
-|------|------|
-| `src/kernbench/components/base.py` | `_on_process_start/end` hook 추가 |
-| `src/kernbench/common/pe_commands.py` | `data_op = True` 추가, metadata 필드 확장 |
-| `src/kernbench/sim_engine/op_log.py` | 신규: OpRecord, OpLogger |
-| `src/kernbench/sim_engine/data_executor.py` | 신규: DataExecutor, MemoryStore |
-| `src/kernbench/sim_engine/engine.py` | op_logger 주입 (optional) |
-| `src/kernbench/triton_emu/tl_context.py` | `tl.load()` 등 내부에서 greenlet switch 호출 |
-| `src/kernbench/triton_emu/kernel_runner.py` | 신규: KernelRunner (greenlet ↔ SimPy 연결) |
-| `src/kernbench/components/builtin/pe_cpu.py` | Phase 0 제거, KernelRunner 호출로 변경 |
-| `pyproject.toml` | greenlet 의존성 추가 |
-
-컴포넌트 구현 파일 (pe_gemm.py, pe_dma.py, hbm_ctrl.py 등): **변경 없음**
-벤치마크 커널 (benches/*.py): **사용자 API 변경 없음**
@@ -0,0 +1,100 @@
+# ADR-0022: 2D 그리드 program_id 시맨틱
+
+## Status
+
+Accepted
+
+## Context
+
+Triton 커널은 `tl.program_id(axis)`를 사용해 launch 그리드 내 자신의
+위치를 식별한다. 본 하드웨어는 2단계 계층을 갖는다: **큐브**가 **PE**를
+포함한다. 이전 구현은 `axis` 파라미터를 무시하고 항상 평탄화된 PE
+인덱스를 반환했기 때문에, 커널이 큐브 내부 위치와 큐브 식별자를 구분할
+수 없었다.
+
+## Decision
+
+`tl.program_id`와 `tl.num_programs`를 2D 하드웨어 그리드에 매핑한다:
+
+| Call | Returns | Description |
+|------|---------|-------------|
+| `tl.program_id(axis=0)` | `local_pe_id` | 큐브 내 PE 인덱스 |
+| `tl.program_id(axis=1)` | `cube_id` | 큐브 인덱스 |
+| `tl.num_programs(axis=0)` | `num_pes_per_cube` | 큐브당 PE 개수 |
+| `tl.num_programs(axis=1)` | `num_cubes` | 전체 큐브 개수 |
+
+전역 PID는 다음과 같이 도출된다:
+
+```python
+global_pid = tl.program_id(axis=1) * tl.num_programs(axis=0) + tl.program_id(axis=0)
+```
+
+### 축 매핑 근거
+
+- **axis=0 = PE (최내부)**: 큐브 내부 PE들은 HBM을 공유하고 로컬 NoC
+  메시를 통해 통신한다. 빠르고 강하게 결합된 차원이다 — 블록 내부의
+  스레드와 유사하다.
+- **axis=1 = 큐브 (외부)**: 큐브 간 통신은 더 높은 레이턴시의 UCIe를
+  통한다. 더 거친 스케줄링 차원이다 — 그리드 내의 블록과 유사하다.
+
+## Implementation
+
+### TLContext (`triton_emu/tl_context.py`)
+
+`cube_id`와 `num_cubes` 생성자 파라미터를 추가했다. `program_id()`와
+`num_programs()`가 `axis`로 디스패치한다:
+
+```python
+def program_id(self, axis: int = 0) -> int:
+    if axis == 1:
+        return self._cube_id
+    return self._pe_id
+
+def num_programs(self, axis: int = 0) -> int:
+    if axis == 1:
+        return self._num_cubes
+    return self._num_programs
+```
+
+### PE_CPU (`components/builtin/pe_cpu.py`)
+
+- `ctx.spec["system"]["sips"]["cubes_per_sip"]`에서 `num_cubes`를
+  추출한다.
+- `cube_id`(이미 `self._cube_idx`로 사용 가능)와 `num_cubes`를
+  TLContext에 전달한다.
+
+### KernelRunner (`triton_emu/kernel_runner.py`)
+
+- PE_CPU로부터 `num_cubes`를 수신한다.
+- greenlet 모드에서 `cube_id`와 `num_cubes`를 TLContext에 전달한다.
+
+## Backward Compatibility
+
+- `tl.program_id(0)` 또는 `tl.program_id()`를 사용하는 기존 코드는
+  변경되지 않는다 — 이전과 동일한 PE 인덱스를 반환한다.
+- `cube_id`와 `num_cubes`는 기본값이 `0`과 `1`이므로, 이를 제공하지
+  않는 호출자(예: 유닛 테스트)도 계속 동작한다.
+
+## Usage Example
+
+```python
+def sharded_gemm_kernel(a_ptr, b_ptr, out_ptr, M, K, N, tl):
+    local_pid = tl.program_id(axis=0)      # PE within cube
+    cube_id   = tl.program_id(axis=1)      # which cube
+    global_pid = cube_id * tl.num_programs(axis=0) + local_pid
+
+    # 전역 PID에 걸친 column-wise 샤딩
+    n_per_pid = N // (tl.num_programs(axis=1) * tl.num_programs(axis=0))
+    col_start = global_pid * n_per_pid
+
+    a = tl.load(a_ptr, shape=(M, K), dtype="f16")
+    b = tl.ref(b_ptr + col_start * K * 2, shape=(K, n_per_pid), dtype="f16")
+    h = tl.composite(op="gemm", a=a, b=b, out_ptr=out_ptr + col_start * M * 2)
+    tl.wait(h)
+```
+
+## Consequences
+
+- 벤치마크가 토폴로지 차원을 하드코딩하지 않고 큐브 인식 샤딩과 주소
+  지정을 표현할 수 있다.
+- 필요 시 axis=2(SIP 레벨)를 동일한 패턴을 따라 향후 추가할 수 있다.
@@ -2,7 +2,7 @@

 ## Status

-Proposed
+Accepted

 ## Context

@@ -17,14 +17,6 @@ Queue)를 통해** 일어난다.
 core-local 통신 큐와 유사하다. 호스트 레벨 collective(`dist.all_reduce`)는
 **미래 작업**으로 미루고, 본 ADR은 커널 collective 인프라에만 집중한다.

-### 현재 상태
-
- ADR-0021 PE 파이프라인 리팩토링: PE 내부가 컴포넌트 단위로 분리됨
-  (PE_CPU, PE_SCHEDULER, PE_DMA, PE_FETCH_STORE, PE_GEMM, PE_MATH, PE_TCM, PE_MMU)
- PE 간 직접 통신 채널 없음. 모든 데이터 이동은 PE_DMA → cube_noc/UCIe/PCIE → HBM 경로
- 호스트 CCL skeleton (ADR 없음, ad-hoc 구현): `dist.init_process_group(backend="ahbm")`,
-  `_run_ccl_bench`가 rank별 greenlet로 동시 실행. collective는 stub 상태.
-
 ### 풀어야 할 문제

 1. PE 간 직접 데이터 이동 (peer's memory에 write)
@@ -365,23 +357,39 @@ data 경로의 piggyback 모델과 달리, credit return은 일반 vc_comm fabri
 거치지 않고 **별도 fast path**로 처리한다. 이는 실제 HW의 NVLink/UCIe
 credit return fast path를 추상화한 것이다.

-**Latency 계산**: magic constant가 아니라 **라우팅 경로의 bottleneck BW**
-기준으로 산출한다.
+**Latency 계산**: magic constant가 아니라 **라우팅 경로의 full path
+latency** (per-node overhead + edge propagation + drain) 기준으로
+산출한다.

 ```
 credit_size_bytes = 16  (ccl.yaml: ipcq_credit_size_bytes)
-path = router.find_path(self_pe, peer_pe)
-latency = compute_drain_ns(path, credit_size_bytes)
-        = credit_size_bytes / bottleneck_bw_on_path
+path = router.find_path(self_pe, peer_pe.pe_dma)
+latency = compute_path_latency_ns(path, credit_size_bytes)
+        = sum(edge.distance_mm * ns_per_mm)
+        + sum(node_overhead_ns[n] for n in path)
+        + credit_size_bytes / bottleneck_bw_on_path
 ```

+router는 source에만 `.pe_dma`를 자동 부여하므로 destination에는 반드시
+`.pe_dma` suffix를 명시해야 한다. 그렇지 않으면 `find_path`가 raise하고
+credit이 0 cost로 silently teleport되는 latent bug가 발생한다 (이번
+업데이트에서 수정됨).
+
+`tl.recv`는 credit-emit 완료를 yield-from으로 기다린다 (이전에는
+`env.process`로 fork). 이로써 credit-return cost가 receiver의
+`pe_exec_ns`에 반영되어, IPCQ control-plane이 consume-acknowledgement를
+완료한 뒤에야 recv가 kernel에 반환된다 — RAW DMA의 non-posted `tl.store`가
+HBM ack-trip을 기다리는 것의 protocol-level 등가물이다.
+
 이로써:
 - **토폴로지 비례 approximation**: cube 내 credit return과 cross-SIP credit이
-  자동으로 다른 latency를 가짐 (정확한 값은 아니지만 magic constant보다 의미 있음)
- **Magic constant 없음**: 별도 `ipcq_ctrl_latency_ns` 같은 임의 값 불필요
- **Deadlock 위험 없음**: piggyback과 달리 B가 A에게 보낼 데이터가 없어도
-  credit이 자동 발행됨
- **기존 utility 재사용**: `ComponentContext.compute_drain_ns` 그대로 사용
+  자동으로 다른 latency를 가짐
+- **Magic constant 없음**: 모든 ns 값이 데이터 트래픽과 동일한 edge_map
+  및 `node_overhead_ns`에서 산출되는 `compute_path_latency_ns`로부터 옴
+- **Deadlock 위험 없음**: `peer_credit_store.put`은 unbounded, B가 A에게
+  보낼 데이터가 없어도 credit이 자동 발행됨
+- **`IPCQ ≥ raw DMA`** 보장: matched physical move에 대해 credit-emit이
+  RAW의 ack-trip cost와 균형을 이룸

 ```
 PE B: tl.recv(W) → 데이터 가져감 → my_tail++
@@ -426,11 +434,22 @@ backend init에서 IpcqInitMsg fan-out 시 양방향 fast path channel을 함께

 #### PE_DMA의 책임 추가

-PE_DMA(vc_comm)는 token 수신 시 다음 atomic 시퀀스로 처리한다.
-**두 동작 사이에 SimPy yield를 두어서는 안 된다** (I6 MUST 규칙 참조):
+PE_DMA(vc_comm)는 token 수신 시 다음 시퀀스로 처리한다: Transaction
+terminal의 BW drain을 먼저 지불하고, 이어서 atomic하게 data write +
+metadata forward 수행. **data write와 metadata forward 사이에는 SimPy
+yield를 두어서는 안 된다** (I6 MUST 규칙 참조). drain yield는 atomic
+구간 안이 아니라 그 앞에 위치해야 한다:

 ```python
-def _on_vc_comm_recv(self, env, token):
+def _on_vc_comm_recv(self, env, txn):
+    # Sender PE_DMA가 찍어 둔 drain_ns (= nbytes / bottleneck_bw) 를
+    # 여기서 지불. atomic 구간보다 앞이어야 한다 — recv는 bytes가
+    # "도착"한 이후에만 깨어나야 하므로.
+    drain = getattr(txn, "drain_ns", 0.0)
+    if drain > 0:
+        yield env.timeout(drain)
+
+    token = txn.request
    # ── ATOMIC: 두 동작 사이에 yield 금지 ──
    # 1. data를 dst_addr에 write (dst의 메모리 공간은 token.dst_endpoint.buffer_kind)
    data = self._memory_store.read(token.src_space, token.src_addr,
@@ -446,6 +465,32 @@ wire로 capacity가 unbounded인 store를 사용하므로 즉시 완료된다 (
 single-step). 이 최종 put이 atomic 구간의 끝이며, 그 이전에 다른 yield가
 삽입되면 안 된다.

+#### Drain-at-inbound semantics (D9 timing model)
+
+Transaction은 sender PE_DMA가 `drain_ns = nbytes / bottleneck_bw_on_path`
+를 찍어 둔 상태로 fabric에 들어간다. 이 simulator에서 per-hop `overhead_ns`
+는 각 forwarding component의 `run()` 에서 지불되고, 남은 BW drain은
+Transaction의 terminal node에서 한 번 지불된다. IPCQ가 아닌 모든
+Transaction (raw DMA, kernel-launch fanout 등) 은
+`ComponentBase._forward_txn` 이 terminal에서 이 drain을 지불한다. IPCQ의
+경우 목적지 PE_DMA가 `_handle_ipcq_inbound` 핸들러로 Transaction을
+가로채서 (IPCQ 전용 data write + metadata forward를 해야 하므로)
+**이 핸들러 최상단에서 drain을 명시적으로 지불해야 한다** — 그래야 IPCQ의
+timing model이 다른 모든 fabric Transaction과 동일선상에 놓인다.
+
+여기서 drain을 지불할 때의 side-effect:
+
+- **SRC `tl.send`**: 동작 불변. sender PE_DMA가 `sub_done` 을 `yield`
+  하지 않으므로 fire-and-forget 의미가 보존된다. metadata forward 이후
+  호출되는 `sub_done.succeed()` 는 sender 입장에서 listener가 없는 이벤트.
+- **DST `tl.recv`**: `drain_ns` 만큼 늦게 깨어난다. recv는 local PE_IPCQ
+  의 `IpcqMetaArrival` 수신 시에만 wake되며, metadata forward가 drain
+  이후로 이동했으므로 recv는 bandwidth까지 포함한 전체 fabric transfer
+  시간을 관측하게 된다.
+
+물리적 그림과 일치: send는 dispatch하고 바로 반환; recv는 bytes가 실제로
+자신의 inbox로 drain될 때까지 대기.
+
 #### Backpressure latency 정확도

 backpressure 해제까지 걸리는 시간:
@@ -924,7 +969,7 @@ tail 갱신은 D9 fast path SimPy Store 채널로 처리된다.

 ### D13. 테스트 전략

-ADR-0021의 D8 패턴을 따라 단위/통합/regression 테스트를 명시한다.
+단위/통합/regression 테스트를 명시한다.

 #### T1. 단위 테스트 (component-level)

@@ -1057,7 +1102,7 @@ F5. **Slot full + 무한 backpressure**:
 ### D15. 알고리즘 작성자 가이드 (요약)

 본 섹션은 알고리즘 작성자가 한 화면으로 시작점을 잡을 수 있도록 한다.
-자세한 step-by-step 가이드는 [docs/ccl-author-guide.md](../ccl-author-guide.md) 참조.
+자세한 step-by-step 가이드는 [docs/onboarding/ccl-author-guide.md](../onboarding/ccl-author-guide.md) 참조.

 #### 만지는 것 / 만지지 않는 것

@@ -1130,7 +1175,416 @@ def neighbors(rank, world_size, neighbor_map) -> dict | None:
 2. **send/recv 짝 맞지 않음** — peer 측 recv 없으면 hang (slot full backpressure)
 3. **dtype/shape 불일치** — 첫 구현은 검증 안 함, 작성자 책임

-자세한 step-by-step과 hello-world 예제는 `docs/ccl-author-guide.md` 참조.
+자세한 step-by-step과 hello-world 예제는 `docs/onboarding/ccl-author-guide.md` 참조.
+
+---
+
+## HW Realization Notes (Informative)
+
+**Status of this section**: Forward-looking. Describes how the simulator
+contract (D1–D15) would map to silicon. Not currently implemented;
+subject to revision before tapeout. The simulator implements the
+contract via Python/SimPy equivalents in
+[pe_ipcq.py](../../src/kernbench/components/builtin/pe_ipcq.py) and
+[pe_dma.py](../../src/kernbench/components/builtin/pe_dma.py).
+
+### D16. Proposed HW Block Diagram and End-to-End Dataflow
+
+![PE Baseline Architecture](../diagrams/pe_baseline.png)
+
+> Source: [`../diagrams/pe_baseline.d2`](../diagrams/pe_baseline.d2) — `d2 --layout=elk --scale 1.5`.
+
+![PE Proposed Architecture](../diagrams/pe_proposed.png)
+
+> Source: [`../diagrams/pe_proposed.d2`](../diagrams/pe_proposed.d2) — `d2 --layout=elk`.
+
+**Baseline → Proposed 핵심 변경**:
+
+- 단일 FIFO inbox → **compute port / IPCQ port 분리 + WRR Arbiter** (NEW)
+- PE_IPCQ (SimPy component) → **IPCQ Controller** (HW register + combinational logic)
+- TCM 내 **IPCQ Slot Region 예약 영역** 명시
+- Credit Injector / Receiver가 Fabric Port를 통해 NoC에 직접 연결
+
+#### End-to-End Sequence (HW view)
+
+```mermaid
+sequenceDiagram
+    participant CPU_A as PE_A: PE_CPU
+    participant IPCQ_A as PE_A: IPCQ Ctrl
+    participant DMA_A as PE_A: DMA
+    participant NOC as NoC Fabric
+    participant DMA_B as PE_B: DMA
+    participant IPCQ_B as PE_B: IPCQ Ctrl
+    participant TCM_B as PE_B: TCM
+    participant CPU_B as PE_B: PE_CPU
+
+    Note over CPU_A: tl.send(dir="E", src=0x1000)
+
+    CPU_A->>IPCQ_A: MMIO: send request
+    Note over IPCQ_A: Backpressure check:<br/>(head - peer_tail_cache) < n_slots → PASS<br/>Slot addr gen:<br/>dst = peer_rx_base + (head%n) × slot_size
+    IPCQ_A->>DMA_A: IpcqDmaToken {src, dst, sender_seq=head}
+    Note over IPCQ_A: my_head++
+    IPCQ_A-->>CPU_A: send returns (fire-and-forget)
+
+    Note over DMA_A: TCM read → snapshot in read buffer<br/>Flit pack: data + {sender_seq, dst_addr}
+    DMA_A->>NOC: IPCQ data flit(s)
+
+    Note over NOC: hop latency + BW drain
+
+    NOC->>DMA_B: IPCQ data flit(s)
+    Note over DMA_B: Terminal BW drain<br/>Slot write latency
+
+    rect rgb(255, 240, 220)
+        Note over DMA_B,IPCQ_B: ATOMIC (I6): same cycle, no stall
+        DMA_B->>TCM_B: write data → slot address
+        DMA_B->>IPCQ_B: Meta Extractor: {sender_seq, dst_addr}
+    end
+
+    Note over IPCQ_B: Range match dst_addr → direction "W"<br/>peer_head_cache["W"] = sender_seq + 1
+    IPCQ_B-->>CPU_B: recv_wake signal
+
+    Note over CPU_B: tl.recv(dir="W") wakes up
+    CPU_B->>IPCQ_B: recv request
+    Note over IPCQ_B: peer_head_cache > my_tail → YES<br/>slot_addr = rx_base + (tail%n) × slot_size
+    IPCQ_B-->>CPU_B: return slot_addr
+    CPU_B->>TCM_B: read data from slot
+    Note over IPCQ_B: my_tail++
+
+    IPCQ_B->>NOC: Credit (16B): {consumer_seq, dst_rx_base_pa}
+    Note over NOC: credit traversal (NoC latency)
+    NOC->>IPCQ_A: Credit arrival
+
+    Note over IPCQ_A: Match dst_rx_base_pa → direction "E"<br/>peer_tail_cache["E"] = consumer_seq<br/>Backpressure deassert (if stalled)
+```
+
+### D17. IPCQ Controller HW Module (신규)
+
+PE_CPU와 DMA Engine 사이에 위치하는 하드웨어 제어 블록. 시뮬레이터의
+`PeIpcqComponent`에 대응한다.
+
+#### QPair Register File
+
+방향별 queue pair 상태를 flip-flop으로 유지. PE_CPU가 MMIO(CSR)로 읽기/쓰기
+가능하며, init 시점에 소프트웨어가 채워넣는다.
+
+```
+Per-direction registers (each 64-bit):
+  my_head          — sender write position (monotonic)
+  my_tail          — receiver read position (monotonic)
+  peer_head_cache  — last known peer head (updated by Meta Extractor)
+  peer_tail_cache  — last known peer tail (updated by Credit Receiver)
+  rx_base_pa       — this PE's rx buffer base physical address
+  peer_rx_base_pa  — peer's rx buffer base physical address
+  n_slots          — ring depth (power-of-2 제약, D21 참조)
+  slot_size        — bytes per slot
+  peer_credit_tgt  — peer PE의 credit receive 주소
+
+Directions: 최대 8 (N/S/E/W/parent/child_left/child_right + spare)
+Total: 8 dirs × 9 regs × 8B = 576B flip-flops
+```
+
+#### Slot Address Generator (combinational)
+
+```
+Input:  pointer (my_head or my_tail), n_slots, slot_size, base_pa
+Output: slot_addr = base_pa + (pointer % n_slots) * slot_size
+
+Implementation:
+  n_slots power-of-2 → pointer & (n_slots - 1)   (AND mask, 1 gate)
+  slot_size power-of-2 → barrel shift             (1 cycle)
+  64-bit add → ripple/kogge-stone adder           (1 cycle)
+
+Latency: 1-2 cycles combinational
+```
+
+#### Backpressure Comparator (combinational)
+
+```
+full = (my_head - peer_tail_cache) >= n_slots
+
+Implementation: 64-bit subtract + unsigned compare
+Output: stall signal → PE_CPU (IPCQ send blocked) or DMA issue hold
+Latency: 1 cycle
+```
+
+#### Meta Extractor (inbound datapath sideband)
+
+DMA Engine의 inbound vc_comm path에 wired. 도착하는 IPCQ flit의 header에서
+metadata를 추출하여 queue pair 상태를 갱신한다.
+
+```
+Trigger: DMA inbound write completion (same cycle)
+Extract: {sender_seq, dst_addr} from flit header
+
+Direction matching (ADR-0025 D2):
+  for each dir:
+    match = (base_pa[dir] <= dst_addr) && (dst_addr < base_pa[dir] + n_slots[dir] * slot_size[dir])
+  8× parallel range comparators + priority encoder
+
+Update: peer_head_cache[matched_dir] = max(peer_head_cache, sender_seq + 1)
+Output: recv_wake signal → PE_CPU interrupt/flag
+Latency: 1 cycle (pipelined with DMA write — I6 atomicity 자연 보장)
+```
+
+#### Credit Injector (outbound)
+
+```
+Trigger: recv completion (my_tail 증가 후)
+Action:  pack 16B credit packet → DMA vc_comm (또는 dedicated credit VC)
+
+Packet: {consumer_seq = my_tail, dst_rx_base_pa = my_rx_base_pa}
+Latency: 1 cycle to generate, then NoC traversal
+```
+
+#### Credit Receiver (inbound sideband)
+
+```
+Trigger: 16B credit packet arrival (from NoC)
+Extract: {consumer_seq, dst_rx_base_pa}
+
+Direction matching (ADR-0025 D3):
+  for each dir:
+    match = (peer_rx_base_pa[dir] == credit.dst_rx_base_pa)
+
+Update: peer_tail_cache[matched_dir] = max(peer_tail_cache, consumer_seq)
+Output: send_wake signal → deassert backpressure stall
+Latency: 1 cycle
+```
+
+### D18. DMA Engine vc_comm IPCQ-aware Mode
+
+기존 vc_comm 채널(D8)에 IPCQ flit 처리 모드를 추가한다.
+
+**Outbound**:
+
+1. IPCQ Controller로부터 command 수신: `{src_addr, dst_addr, nbytes, sender_seq}`
+2. TCM에서 src_addr read → DMA read buffer에 snapshot (standard DMA behavior)
+3. Flit pack: data + piggyback metadata (sender_seq, dst_addr)
+4. NoC fabric port에 inject
+5. Fire-and-forget (completion 미대기)
+
+**Inbound**:
+
+1. NoC로부터 IPCQ flit 수신
+2. Terminal BW drain charge (`drain_ns = nbytes / bottleneck_bw`)
+3. Slot write latency charge (backing memory tier)
+4. **ATOMIC** (same pipeline stage, no stall insertion):
+   - TCM write: data → slot address
+   - Meta Extractor trigger: sender_seq + dst_addr → IPCQ Controller
+5. Done
+
+**I6 atomicity 하드웨어 보장**: TCM write completion과 Meta Extractor trigger가
+동일 pipeline stage에서 발생하므로 별도 synchronization이 불필요. 시뮬레이터의
+"no SimPy yield between MemoryStore.write and IpcqMetaArrival put" (D9, I6)이
+자연스럽게 보장된다.
+
+#### Data Snapshot Semantics
+
+DMA read buffer에 latch된 데이터는 src memory의 이후 수정에 영향받지 않는다.
+이는 DMA standard read-then-write behavior이므로 추가 HW 불필요.
+
+#### Credit Virtual Channel (선택적)
+
+- **옵션 A**: vc_comm에 credit을 multiplexing (16B header-only flit으로 구분).
+- **옵션 B**: 3rd dedicated credit VC 추가 (strict priority > data).
+
+옵션 B가 deadlock prevention에 유리하나, 16B credit의 BW 영향이 무시 가능하므로
+옵션 A로도 충분.
+
+### D19. Fabric Flit Format Extension
+
+```
+일반 data flit (예: 512-bit):
+┌──────────────────────────────────────────┐
+│ [511:480] routing header (32b)           │
+│ [479:0]   payload (480b = 60B)           │
+└──────────────────────────────────────────┘
+
+IPCQ data flit (첫 flit에만 metadata 포함):
+┌──────────────────────────────────────────┐
+│ [511:480] routing header (32b)           │
+│   [511]    ipcq_flag (1b)                │  ← IPCQ vs normal DMA 식별
+│   [510:509] vc_id (2b)                   │
+│   [508:480] route + hop count            │
+│ [479:416] ipcq_metadata (64b)            │  ← piggyback
+│   [479:448] sender_seq (32b)             │
+│   [447:416] dst_addr[31:0] (32b)         │  ← direction matching용
+│ [415:0]   payload (416b = 52B)           │
+└──────────────────────────────────────────┘
+후속 flits: full 60B payload (metadata 없음)
+
+Credit-only flit (128-bit, header-only):
+┌──────────────────────────────────────────┐
+│ [127:96]  routing header (32b)           │
+│   [127]   credit_flag (1b)               │
+│ [95:64]   consumer_seq (32b)             │
+│ [63:0]    dst_rx_base_pa (64b)           │
+└──────────────────────────────────────────┘
+```
+
+첫 flit의 payload가 60B → 52B로 감소 (13% overhead). Multi-flit transfer에서는
+후속 flit이 full payload이므로 대형 전송에서 overhead < 1%.
+
+### D20. TCM IPCQ Slot Region Layout
+
+```
+TCM Memory Map (16MB):
+┌─────────────────────────────┐ 0x000000
+│  Kernel Working Memory      │
+│  (compute tensors)          │
+│  ~14MB                      │
+├─────────────────────────────┤ 0xE00000
+│  IPCQ RX Buffers            │
+│  Dir N: slots × slot_size   │
+│  Dir S: slots × slot_size   │
+│  Dir E: slots × slot_size   │
+│  Dir W: slots × slot_size   │
+│  ~1MB                       │
+├─────────────────────────────┤ 0xF00000
+│  IPCQ Metadata / Scratch    │
+│  ~1MB                       │
+└─────────────────────────────┘ 0xFFFFFF
+```
+
+IPCQ region을 TCM의 상위 bank에 배치하여 compute access와의 bank conflict를
+최소화한다 (Risk D22 참조).
+
+### D21. 2nm Implementation Analysis
+
+#### Area Estimate
+
+| Module | Gate Count | Area (2nm est.) | Notes |
+|---|---|---|---|
+| QPair Register File | ~4.6K FF | 0.002 mm² | 576B flip-flops |
+| Slot Addr Gen + Backpressure | ~5K gates | 0.001 mm² | Combinational |
+| Meta Extractor + Credit Logic | ~3K gates | 0.001 mm² | 8× parallel comparators |
+| **IPCQ Controller subtotal** | **~12.6K** | **~0.004 mm²** | **PE 전체 대비 < 0.1%** |
+| DMA vc_comm 확장 | ~2K gates | 0.002 mm² | Flit pack/unpack |
+| **Total 변경분** | **~14.6K** | **~0.006 mm²** | |
+
+#### Timing
+
+| Path | Delay (2nm est.) | Target Clock | Margin |
+|---|---|---|---|
+| Backpressure (sub + cmp) | ~0.3 ns | 1 GHz (1 ns) | 3× |
+| Slot Addr Gen (mask + shift + add) | ~0.5 ns | 1 GHz | 2× |
+| Meta Extractor (8× range match) | ~0.4 ns | 1 GHz | 2.5× |
+| Credit Receiver (8× equality) | ~0.3 ns | 1 GHz | 3× |
+
+모든 critical path가 1 cycle 이내. Timing closure 문제 없음.
+
+#### Power
+
+- Active: ~1 mW (register R/W + comparators, send/recv 동작 시)
+- Idle: leakage only
+- PE 전체 전력 대비 무시 가능
+
+#### Constraints
+
+| 항목 | 제약 | 근거 |
+|---|---|---|
+| `n_slots` | **반드시 power-of-2** | mod → AND mask (1 gate). 임의 값은 divider 필요 (~10 cycles) |
+| `slot_size` | **power-of-2 권장** | mul → barrel shift. 임의 값은 multiplier 필요 |
+| TCM IPCQ region | **전용 bank 배치** | Compute access와 bank conflict 방지 |
+
+### D22. Risk Assessment
+
+#### TCM Bank Conflict
+
+- **Risk**: IPCQ slot write와 compute read가 동일 bank 접근 시 stall
+- **Mitigation**: IPCQ region을 TCM 상위 address의 전용 bank에 배치 (D20)
+- **Cost**: TCM banking flexibility 소폭 감소
+- **Severity**: Medium (성능 영향), Low (correctness 문제 아님)
+
+#### Credit Return Latency under Congestion
+
+- **Risk**: NoC 혼잡 시 credit return 지연 → sender backpressure stall
+- **Mitigation**:
+  - Credit을 별도 VC로 분리 + strict priority (16B로 BW impact 미미)
+  - 또는 n_slots를 넉넉히(8+) 설정하여 credit 지연을 buffer로 흡수
+- **Severity**: Low (credit 16B는 congestion에 거의 기여하지 않음)
+
+#### Inter-Direction Ordering
+
+- **Risk**: 같은 PE에서 여러 방향으로 동시 send 시 순서
+- **Mitigation**: Per-direction monotonic seq으로 충분. Inter-direction ordering은
+  kernel(소프트웨어) 책임 — 현재 시뮬레이터 모델과 동일 (D2 + D4)
+- **Severity**: Low (아키텍처 설계에 의해 해소)
+
+### D23. HW Alternatives Considered
+
+#### Doorbell + Polling (전통적 방식)
+
+```
+Send: DMA write data → DMA write doorbell register at peer → peer polls doorbell
+Recv: Polling loop on doorbell, or interrupt-driven
+```
+
+| 장점 | 단점 |
+|---|---|
+| 단순한 HW (IPCQ controller 불필요) | 2번의 DMA transaction (data + doorbell) |
+| 기존 DMA 재사용 | Data/doorbell 사이 ordering 보장 필요 (fence) |
+| | Polling은 전력 낭비, interrupt는 latency overhead |
+
+**평가**: Piggyback 대비 latency 2-3× 증가. **불채택.**
+
+#### Hardware Message Queue (NVIDIA NVLink 스타일)
+
+```
+Send: CPU → HMQ에 descriptor push → HW가 peer HMQ로 자동 전달
+Recv: HMQ에서 descriptor pop → data pointer 확인
+```
+
+| 장점 | 단점 |
+|---|---|
+| CPU는 descriptor만 작성 | 별도 HMQ engine 필요 (~0.05 mm²) |
+| Descriptor/data 분리 → 유연 | DMA와 별개 datapath → area/power 중복 |
+| | Large tensor에는 결국 DMA 필요 |
+
+**평가**: CCL의 large tensor 패턴에서 DMA 필수이므로 HMQ + DMA 이중 구조는
+면적 낭비. **불채택.**
+
+#### RDMA-style Completion Queue (CQ)
+
+```
+Send: DMA write → peer에 CQE 자동 생성
+Recv: CQ poll/interrupt → data 위치 확인
+```
+
+| 장점 | 단점 |
+|---|---|
+| InfiniBand/RoCE 성숙 모델 | CQ 관리 logic + CQE memory overhead |
+| Multi-tenant/isolation 용이 | CQE/data ordering 보장 추가 필요 |
+| | PE-to-PE CCL에는 over-engineered |
+
+**평가**: RDMA CQ는 host-facing NIC의 multi-tenant 격리에 적합.
+PE 간 단일 owner 환경에서는 불필요한 복잡성. **불채택.**
+
+#### Credit-in-Data Piggyback (v2 최적화 후보)
+
+현재 설계에서 credit return은 별도 16B packet이다. Bidirectional 통신
+패턴에서는 **reverse 방향 data flit에 credit을 합칠 수 있다.**
+
+```
+PE_A →E→ PE_B: data + sender_seq=3
+PE_B →W→ PE_A: data + sender_seq=5 + credit_ack=4  ← credit이 data에 합쳐짐
+```
+
+| 장점 | 단점 |
+|---|---|
+| Credit 전용 packet 제거 → NoC BW 절약 | Unidirectional 패턴에서는 fallback 필요 |
+| Bidirectional allreduce에서 credit latency → 0 | Flit header에 8B 추가 (overhead 미미) |
+| | Logic 복잡도 소폭 증가 |
+
+**평가**: 현재 설계의 우수한 최적화. Bidirectional allreduce에서 credit packet을
+완전 제거 가능. Standalone credit fallback도 유지. **v2로 채택 권고.**
+
+### Open HW Questions
+
+- IPCQ slot region size를 TCM의 몇 %까지 허용할 것인가? (현재 가정: ~1MB / 16MB = 6.25%)
+- Credit VC를 별도로 둘 것인가, vc_comm에 multiplexing할 것인가? (D18 참조)
+- Inter-SIP link에서의 flit format 호환성 검증 필요
+- n_slots 최대값 제한? (8 directions × 8 slots × 64KB = 4MB → TCM의 25%)

 ---

@@ -1192,29 +1646,3 @@ def neighbors(rank, world_size, neighbor_map) -> dict | None:
 - VC arbitration 모델이 first-order approximation이므로 heavy contention
  시나리오에서 실제 HW보다 약간 optimistic한 latency 결과 가능 (D8 한계)
 - VC chunk-level 인터리브로 PE_DMA 구현이 더 복잡해짐
-
---
-
-## 영향받는 파일
-
-| 파일 | 변경 |
-|------|------|
-| `topology.yaml` | pe_template에 pe_ipcq 추가, ipcq↔dma/cpu/tcm edge 추가 |
-| `components.yaml` | pe_ipcq_v1 등록 |
-| `src/kernbench/topology/builder.py` | PE 내부 edge에 ipcq 체인 추가 |
-| `src/kernbench/components/builtin/pe_ipcq.py` | 신규 |
-| `src/kernbench/components/builtin/pe_dma.py` | VC 추가, IpcqDmaToken 처리 |
-| `src/kernbench/common/pe_commands.py` | IpcqSendCmd, IpcqRecvCmd, IpcqDmaToken 정의 |
-| `src/kernbench/triton_emu/tl_context.py` | tl.send / tl.recv API |
-| `src/kernbench/runtime_api/distributed.py` | ccl.yaml 로드, init 시 IPCQ install (eager) |
-| `src/kernbench/runtime_api/kernel.py` | IpcqInitMsg (sideband) 정의 |
-| `src/kernbench/ccl/__init__.py` | 신규 — CCL 패키지 |
-| `src/kernbench/ccl/topologies.py` | 신규 — builtin topology generators (ring_1d, mesh_2d, tree_binary 등), `resolve_topology()` |
-| `src/kernbench/ccl/helpers.py` | 신규 — 알고리즘 작성 헬퍼 (chunked, ring_step 등) |
-| `src/kernbench/ccl/testing.py` | 신규 — mock CCL runtime (`run_kernel_in_mock`) |
-| `ccl.yaml` | 신규 — 알고리즘 metadata + IPCQ default 설정 |
-| `src/kernbench/ccl/algorithms/ring_allreduce.py` | 신규 — 첫 알고리즘 예제 |
-| `tests/test_pe_ipcq.py` | 신규 — PE_IPCQ 단위 테스트 |
-| `tests/test_pe_dma_vc.py` | 신규 — PE_DMA virtual channel 테스트 |
-| `tests/test_ipcq_e2e.py` | 신규 — send/recv end-to-end 테스트 |
-| `tests/test_ccl_topologies.py` | 신규 — builtin topology generator 단위 테스트 |
@@ -0,0 +1,236 @@
+# ADR-0024: SIP-level Launcher — rank = SIP
+
+## Status
+
+Accepted
+
+## Context
+
+### 목표
+
+`torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device)
+경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이**
+읽히는 bench 코드를 목표로 한다.
+
+real PyTorch와 비교:
+
+| 차원 | real PyTorch | KernBench |
+| --- | --- | --- |
+| 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 각 1 SIP |
+| `get_rank()` | `RANK` env var | greenlet-local 레지스트리 |
+| `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 |
+| `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP |
+| `mp.spawn` | OS 프로세스 fork | greenlet fan-out |
+
+### 풀어야 할 문제
+
+1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록.
+2. **Greenlet-local rank/device tracking** — 1-프로세스 모델 안에서 각
+   worker greenlet이 자기 rank / 자기 SIP를 정확히 식별.
+3. **Tensor placement = structural (sip, cube, pe)** — rank가 SIP이면
+   기본 텐서 배치도 구조적 좌표로 표현되어야 함.
+
+### Non-problem (이 ADR 밖)
+
+- IPCQ direction addressing → ADR-0025
+- `DPPolicy.sip`/`num_sips` 제거 → ADR-0026
+- Megatron-style TP → ADR-0027
+- DTensor → ADR-0028 (future)
+- Worker scheduling / `mp.spawn` / collective drain / exception cleanup
+  → ADR-0027 D0/D1
+- Collective algorithm 구현 (intercube_allreduce, SFR config) → ADR-0032
+
+## Decision
+
+### D1. rank = SIP (world_size 해석)
+
+```python
+def _resolve_world_size(self) -> int:
+    if "world_size" in self._merged:
+        return int(self._merged["world_size"])
+    defaults = self._cfg_all.get("defaults", {})
+    if "world_size" in defaults:
+        return int(defaults["world_size"])
+    spec = self.ctx.spec or {}
+    return int(spec.get("system", {}).get("sips", {}).get("count", 1))
+```
+
+우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml`
+override는 legacy "rank = PE" 테스트 경로로 유지.
+
+### D2. Greenlet-local rank registry (+ debug warning)
+
+```python
+class DistributedContext:
+    def __init__(self):
+        self._backend = None
+        self._rank_by_greenlet: dict = {}
+
+    def _bind_rank(self, g, rank: int) -> None:
+        self._rank_by_greenlet[g] = int(rank)
+
+    def get_rank(self) -> int:
+        self._ensure_initialized()
+        from greenlet import getcurrent
+        g = getcurrent()
+        if g not in self._rank_by_greenlet:
+            if os.environ.get("KERNBENCH_DEBUG"):
+                warnings.warn(
+                    "get_rank() called outside a bound greenlet — returning 0. "
+                    "Likely a bug unless running single-driver."
+                )
+            return 0
+        return int(self._rank_by_greenlet[g])
+```
+
+### D3. `torch.ahbm.set_device(rank)` — SIP 바인딩
+
+KernBench 백엔드 이름은 `ahbm` (ADR-0023). Real PyTorch는
+`torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named
+namespace를 사용한다.
+
+```python
+class _AhbmNamespace:
+    """torch.ahbm — per-greenlet SIP device binding.
+
+    Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. Since
+    KernBench's backend is 'ahbm' (not CUDA), we expose the equivalent
+    API under ``torch.ahbm`` to avoid pretending to be a CUDA runtime.
+    """
+
+    def __init__(self):
+        self._device_by_greenlet: dict = {}
+
+    def set_device(self, device: int) -> None:
+        from greenlet import getcurrent
+        self._device_by_greenlet[getcurrent()] = int(device)
+
+    def current_device(self) -> int | None:
+        from greenlet import getcurrent
+        return self._device_by_greenlet.get(getcurrent())
+
+# Attached to RuntimeContext as `self.ahbm = _AhbmNamespace()`.
+# Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
+```
+
+**PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한
+`torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`,
+`torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는
+코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다.
+
+```python
+class _AcceleratorNamespace:
+    """torch.accelerator — device-agnostic API (PyTorch 2.x style).
+
+    Aliases torch.ahbm for bench code that prefers device-neutral idiom:
+        torch.accelerator.set_device_index(rank)
+        torch.accelerator.current_device_index()
+    """
+
+    def __init__(self, ahbm: _AhbmNamespace):
+        self._ahbm = ahbm
+
+    def set_device_index(self, device: int) -> None:
+        self._ahbm.set_device(device)
+
+    def current_device_index(self) -> int | None:
+        return self._ahbm.current_device()
+
+# RuntimeContext
+self.ahbm = _AhbmNamespace()
+self.accelerator = _AcceleratorNamespace(self.ahbm)   # alias
+```
+
+Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유:
+
+```python
+torch.ahbm.set_device(rank)                   # KernBench-native, explicit backend
+torch.accelerator.set_device_index(rank)      # PyTorch 2.x device-agnostic
+```
+
+### D4. Tensor placement = structural (sip, cube, pe) 좌표
+
+`resolve_dp_policy`가 `target_sip`을 직접 받아 구조적 좌표로 placement 생성.
+세부는 ADR-0026.
+
+```python
+# RuntimeContext._create_tensor
+current_sip = self.ahbm.current_device()          # (D3 naming)
+if current_sip is None:
+    current_sip = 0  # single-driver fallback (D2와 일관)
+placement = resolve_dp_policy(
+    dp, shape=shape_2d, itemsize=itemsize,
+    num_pe=eff_num_pe, num_cubes=eff_num_cubes,
+    target_sip=current_sip,
+)
+```
+
+Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적
+좌표를 직접 보유. ShardSpec 상세는 ADR-0026.
+
+### D5. SIP 그리드 크기 — 명시적 `sips.w/h` 해석
+
+2D inter-SIP topology (`torus_2d`, `mesh_2d_no_wrap`)의 SIP 그리드 형태
+(width × height)는 `system.sips.w` / `system.sips.h`에서 해석한다. D1이
+`sips.count`로 `world_size`를 해석하는 것과 같은 방식이다. 우선순위:
+명시적 `w/h` (`w*h == count` 검증) > 정사각 fallback
+(`w/h` 미지정 시에만 `round(sqrt(count))²`) > error.
+
+```python
+sips = spec.get("system", {}).get("sips", {})
+if sip_topo == "ring_1d":
+    w, h = 0, 0                          # 1D sentinel (no grid)
+elif sips.get("w") is not None and sips.get("h") is not None:
+    w, h = int(sips["w"]), int(sips["h"])
+    if w * h != n_sips:
+        raise ValueError(f"sip layout {w}x{h} != sips.count ({n_sips})")
+else:
+    side = int(round(math.sqrt(n_sips)))
+    if side * side != n_sips:
+        raise ValueError("non-square sips.count requires explicit sips.w/h")
+    w, h = side, side
+```
+
+이로써 2D SIP 그리드가 완전 정사각이어야 한다는 기존 가정을 제거한다:
+6-SIP `torus_2d` / `mesh_2d_no_wrap`은 이제 `w: 3, h: 2`(또는 `2x3`)로
+표현 가능하다. 도출된 `(w, h)`는 알고리즘의 inter-SIP exchange로 전달된다
+(ADR-0032 D5에서 소비). 이전 코드 경로는 ring이 아닌 모든 topology에서
+`round(sqrt(count))²`를 조용히 취해 잘못된 그리드(예: 6 SIP에 2×2)를
+만들었다. fail-loud fallback을 갖춘 명시적 `w/h` 경로가 이를 대체한다.
+
+---
+
+## Dependencies
+
+- **ADR-0023** (IPCQ): backend `ahbm` namespace의 기원.
+- **ADR-0026** (DPPolicy intra-device): D4의 `resolve_dp_policy` 시그니처와
+  ShardSpec의 구조적 좌표 표현.
+- **ADR-0027** (Megatron TP + scheduler): worker scheduling, `mp.spawn`,
+  collective drain, exception cleanup의 구현 기준.
+
+---
+
+## Non-goals
+
+- **IPCQ protocol 수정**: ADR-0023 유지.
+- **DPPolicy 필드 정리**: ADR-0026.
+- **Megatron-style TP**: ADR-0027.
+- **Worker scheduling / spawn / drain / exception cleanup**: ADR-0027 D0/D1.
+- **Collective algorithm 구현**: ADR-0032.
+- **Multi-node (프로세스 간)**: 단일 프로세스.
+
+---
+
+## Consequences
+
+### Positive
+
+- **Bench = real PyTorch DDP** (공개 API 관점).
+- **Greenlet-local rank**: 1-프로세스 모델에서 cross-rank correctness 가능.
+- **Structural placement 좌표**: ADR-0026 / ADR-0027 / ADR-0032의 다른 ADR이
+  `(sip, cube, pe)` 3튜플 위에서 일관되게 동작.
+
+### Neutral
+
+- IPCQ PE-level protocol (ADR-0023) 불변.
+- IO_CPU 역할 불변 (기존 transit 그대로).
@@ -2,7 +2,7 @@

 ## Status

-Proposed (Revision 2 — Address-based matching; peer_direction field dropped)
+Accepted (Revision 2 — Address-based matching; peer_direction field dropped)

 ## Context

@@ -13,34 +13,6 @@ topology / dict-order에 의존하지 않고 **주소 기반**으로 일관되
 2-rank bidirectional ring (또는 여러 direction이 동일 peer를 가리키는
 topology 일반)에서 정확히 동작하도록 한다.

-### 현재 상태 (ADR-0023 D9 구현)
-
-`src/kernbench/components/builtin/pe_ipcq.py` — `_handle_meta_arrival`:
-
-```python
-def _handle_meta_arrival(self, msg: IpcqMetaArrival) -> None:
-    token = msg.token
-    sender_key = (token.src_sip, token.src_cube, token.src_pe)
-    for d, qp in self._queue_pairs.items():
-        p = qp["peer"]
-        if (p.sip, p.cube, p.pe) == sender_key:
-            qp["peer_head_cache"] = max(qp["peer_head_cache"], token.sender_seq + 1)
-            # ... wake recv waiters ...
-            return
-```
-
-`_credit_worker`도 동일한 "sender-coord-first-match" 패턴.
-
-`src/kernbench/ccl/install.py` — `reverse_direction`:
-
-```python
-def reverse_direction(my_rank: int, peer_rank: int) -> str | None:
-    for d, target in neighbor_table[peer_rank].items():
-        if target == my_rank:
-            return d
-    return None
-```
-
 ### 드러난 버그 — 2-rank bidirectional ring

 `ring_1d(rank, world_size=2)` → `{"E": 1, "W": 1}` (rank 0). 양쪽 방향이 같은 peer.
@@ -89,7 +61,14 @@ direction_idx × bytes_per_direction). 따라서:
 `src/kernbench/ccl/install.py`:

 ```python
-_OPPOSITE_DIR = {"E": "W", "W": "E", "N": "S", "S": "N"}
+# Extended in ADR-0032 with global_* pairs for inter-SIP directions,
+# which were introduced by configure_sfr_intercube_multisip to keep
+# intercube (N/S/E/W) and inter-SIP (global_N/S/E/W) namespaces disjoint.
+_OPPOSITE_DIR = {
+    "E": "W", "W": "E", "N": "S", "S": "N",
+    "global_E": "global_W", "global_W": "global_E",
+    "global_N": "global_S", "global_S": "global_N",
+}

 def reverse_direction(my_rank: int, peer_rank: int, my_dir: str) -> str | None:
    """Find peer's direction that reciprocates my_dir→peer_rank.
@@ -282,51 +261,6 @@ for plan in plans:

 ---

-## Test strategy
-
-### T1. Unit — `reverse_direction` opposite-preference
-
-`tests/test_ccl_install.py` (확장):
- Ring ws=2: `reverse_direction(0, 1, "E")` → "W", `reverse_direction(0, 1, "W")` → "E"
- Ring ws=4: `reverse_direction(0, 1, "E")` → "W" (자연스러운 opposite)
- Mesh 2×2: `reverse_direction(r, peer, "N")` → "S", "E" ↔ "W"
- Tree binary: opposite 없는 direction (parent) → fallback 경로
- Non-symmetric topology: opposite가 peer에 없고 다른 direction만 있는 경우
-
-### T2. Runtime — `_handle_meta_arrival` dst_addr 매칭
-
-`tests/test_pe_ipcq.py` (확장):
- 2-rank pair install 후, E direction dst_addr로 meta arrival → E의 `peer_head_cache`
-  증가 (W는 불변)
- W direction dst_addr로 meta arrival → W의 `peer_head_cache` 증가
- 잘못된 dst_addr (어느 rx range에도 속하지 않음) → 에러 또는 silent drop
-  (결정 후 명시)
-
-### T3. Credit — `dst_rx_base_pa` 매칭
-
-`tests/test_pe_ipcq.py` (확장):
- E direction send 후 peer가 consume → credit에 자기 W의 `my_rx_base_pa`
-  담아 송신 → sender의 E direction `peer_tail_cache` 증가
- W direction도 동일
-
-### T4. E2E — 2-rank bidirectional ring
-
-`tests/test_ipcq_e2e.py`:
- 2-rank ring_1d로 tl.send(E) + tl.recv(W) pattern이 양방향으로 작동
- ADR-0024의 `test_ccl_allreduce_matrix.py`에서 ring at ws=2가 통과
-
-### T5. Install invariant — rx_base range disjointness
-
-`tests/test_ccl_install_plan.py` (확장):
- I3.1 검증: `build_install_plans` 결과에서 모든 qp의 rx_base range가 disjoint
-
-### T6. 회귀
-
- 기존 ws≥3 ring / mesh / tree 테스트 그대로 통과
- `test_pe_ipcq`, `test_ipcq_e2e` 기존 케이스 회귀
-
---
-
 ## Consequences

 ### Positive
@@ -347,19 +281,3 @@ for plan in plans:

 - IPCQ protocol의 semantic layer (sender가 dst_addr 계산, receiver가 수신)는
  불변.
-
---
-
-## Affected files
-
-| File | Change |
-|------|--------|
-| `src/kernbench/ccl/install.py` | D1: `reverse_direction`에 `my_dir` 인자 추가, opposite-preference |
-| `src/kernbench/components/builtin/pe_ipcq.py` | D2: `_handle_meta_arrival` dst_addr 매칭 / D3: `_credit_worker` dst_rx_base_pa 매칭 / `_delayed_credit_send`가 `dst_rx_base_pa` 필드 채움 |
-| `src/kernbench/common/ipcq_types.py` | D3: `IpcqCreditMetadata`에 `dst_rx_base_pa` 필드 추가 |
-| `src/kernbench/ccl/install_plan.py` (ADR-0024 신규) | D6: I3.1 invariant 검증 (optional) |
-| `docs/adr/ADR-0023-ipcq-pe-collective.md` | Reference note: runtime 매칭 방식이 ADR-0025에서 바뀜 |
-| `tests/test_ccl_install.py` | T1 |
-| `tests/test_pe_ipcq.py` | T2, T3 |
-| `tests/test_ipcq_e2e.py` | T4 |
-| `tests/test_ccl_install_plan.py` | T5 |
@@ -2,7 +2,7 @@

 ## Status

-Proposed (Revision 4 — 문서 일관성 + grep audit 구체화)
+Accepted (Revision 5 — Phase 2 landed 2026-04-14, 523 passed + 1 strict xfail)

 ## Context

@@ -13,53 +13,6 @@ intra-device 추상화로 명확화한다. SIP 간 분산(TP)은 별도 레이
 (ADR-0024의 `torch.ahbm.set_device(rank)` 또는 ADR-0027의 Megatron parallel
 layers가 담당).

-### 현재 상태
-
-`src/kernbench/policy/placement/dp.py`:
-
-```python
-@dataclass(frozen=True)
-class DPPolicy:
-    sip: Literal["replicate", "column_wise", "row_wise"] = "replicate"
-    cube: Literal["replicate", "column_wise", "row_wise"] = "replicate"
-    pe: Literal["replicate", "column_wise", "row_wise"] = "replicate"
-    num_pes: int | None = None
-    num_cubes: int | None = None
-    num_sips: int | None = None    # ← 제거 대상
-```
-
-`sip` / `num_sips` 필드는 텐서를 SIP 경계 **너머**로 분산하는 경로를 제공함.
-이는:
-
- **ADR-0024의 launcher 모델과 충돌**: ADR-0024는 "rank = SIP = 1 worker per SIP"
-  모델. 각 worker가 자기 SIP에 텐서를 생성. 텐서가 여러 SIP에 걸치는 경우는
-  Megatron-style TP가 개별 primitive로 처리해야 함.
- **사용자 의도와 불일치**: "DPPolicy는 한 디바이스 내에서 PE들로 분산하는 방법"
-  (사용자 진술).
- **개념 혼동**: `DPPolicy.sip="column_wise"`는 실제로 **TP**. 이름이 DP인데
-  하는 일은 TP → 신규 사용자에게 혼란.
-
-### 영향받는 call site (rollback 시점 grep 결과)
-
-**생성 사이트** (`DPPolicy(sip=...` 또는 `num_sips=...`):
- `tests/test_runtime_api_tensor.py`
- `benches/ccl_allreduce.py` (ADR-0024 scope 내에서 이미 개편됨)
- `tests/test_va_offset.py`
- `benches/va_offset_verify.py`
- `tests/test_sip_parallel.py`
-
-**참조 사이트** (`dp.sip`, `policy.sip`, `num_sips` 등):
- `src/kernbench/runtime_api/context.py` (`_create_tensor`, `launch`)
- `src/kernbench/components/builtin/pe_cpu.py`
- `src/kernbench/components/legacy/builtin/pe_cpu.py`
- `src/kernbench/policy/placement/dp.py` (구현 자체)
- `tests/test_tensor.py`, `test_ipcq_types.py`
-
-**핵심 테스트**: `test_sip_parallel.py`는 이름 그대로 "SIP 병렬성을 DPPolicy로
-표현하는" 테스트. 이 ADR 이후 **새 launcher 모델로 재작성** 필요.
-
---
-
 ## Decision

 ### D1. `DPPolicy`에서 `sip` + `num_sips` 필드 제거
@@ -69,9 +22,9 @@ class DPPolicy:
 class DPPolicy:
    """Intra-device (cube × PE) data-parallel policy.

-    SIP-level placement is controlled by ``torch.cuda.set_device(rank)``
-    (ADR-0024) and, for model-level TP, by Megatron-style parallel layers
-    (ADR-0027). DPPolicy does not cross SIP boundaries.
+    SIP-level placement is controlled by ``torch.ahbm.set_device(rank)``
+    (ADR-0024 D3) and, for model-level TP, by Megatron-style parallel
+    layers (ADR-0027). DPPolicy does not cross SIP boundaries.
    """
    cube: Literal["replicate", "column_wise", "row_wise"] = "replicate"
    pe: Literal["replicate", "column_wise", "row_wise"] = "replicate"
@@ -84,7 +37,7 @@ class DPPolicy:
 ### D2. `ShardSpec` — structural (sip, cube, pe) 좌표, `pe_index` 완전 제거

 현재 `ShardSpec.pe_index`는 **global flat index** (`sip × cubes × pes + cube ×
-pes + pe`). 이는 ADR-0024 D11이 "abstraction leakage"로 지적한 형태.
+pes + pe`). 이는 ADR-0024 D4이 "abstraction leakage"로 지적한 형태.

 본 ADR에서 ShardSpec을 **structural 좌표로 재정의**하고, `pe_index`는
 property로도 **남기지 않는다**:
@@ -120,7 +73,7 @@ class ShardSpec:

 ### D3. `resolve_dp_policy`가 `target_sip`을 받아 structural 좌표 생성

-ADR-0024 D11의 계약 구현. Post-hoc shifting 없음.
+ADR-0024 D4의 계약 구현. Post-hoc shifting 없음.

 ```python
 # src/kernbench/policy/placement/dp.py (after)
@@ -182,14 +135,14 @@ def resolve_dp_policy(

 ### D4. `_create_tensor` — 구조적 좌표로 직접 placement

-ADR-0024 D11 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy`
+ADR-0024 D4 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy`
 호출 시점에 직접 지정.

 ```python
 # context.py _create_tensor (after)
 current_sip = self.ahbm.current_device()
 if current_sip is None:
-    # Single-driver fallback (ADR-0024 D9와 일관).
+    # Single-driver fallback (ADR-0024 D2와 일관).
    # Launcher 기반 코드가 set_device()를 빼먹으면 조용히 SIP 0에 박히는
    # 문제가 있음 → debug mode에서 경고.
    if os.environ.get("KERNBENCH_DEBUG"):
@@ -258,66 +211,6 @@ for sip_id in sip_range:
 권고. `PEIdentity` 값객체는 명시적 타입 장점은 있지만 boilerplate가 크고 현재
 allocator dict의 유일한 key라 오버엔지니어링. Tuple 유지.

-### D6. Migration — 기존 call site
-
-**(A) `DPPolicy(sip=..., num_sips=..., ...)` 사용하던 코드**:
-
- `DPPolicy(sip="column_wise", cube=..., pe=...)` 패턴 → **해당 bench를 ADR-0024
-  launcher로 재작성**. worker가 `set_device(rank)`로 SIP 선택, DPPolicy는
-  cube/PE만.
- `DPPolicy(sip="replicate", num_sips=1, ...)` 패턴 → `DPPolicy(cube=..., pe=...)`로
-  축소 (필드가 사라지니 자연스럽게).
-
-**(B) `dp.sip`, `dp.num_sips` 읽던 코드**:
-
- 제거. `launch()`의 `_compute_local_shape`에서 `dp.sip` 분기 삭제.
- `pe_cpu.py`가 `dp.sip`을 참조하던 곳도 정리.
-
-**(C) `ShardSpec.pe_index`를 사용하던 코드 — 전부 수정 필요**:
-
- `.pe_index` 접근은 이제 `AttributeError` 발생 → 모든 call site 수정 필수.
- Allocator lookup: `allocators[spec.pe_index]` →
-  `allocators[(spec.sip, spec.cube, spec.pe)]`
- Flat integer가 꼭 필요한 국소 문맥: `spec.sip * N_CUBES * N_PE + spec.cube *
-  N_PE + spec.pe` 명시적 계산. **국소 변수로만 사용하고 공개 API에 노출하지
-  않는다**.
-
-**구현 착수 전 grep audit 체크리스트**:
-
-1. **Property 참조**:
-   - `\.pe_index\b` — 필드/property 접근 모두 (regex)
-   - `pe_index=` — 생성 시점의 키워드 인자
-   - `pe_index:` — dataclass 필드 선언
-2. **Allocator / dict indexing**:
-   - `allocators\[` — dict lookup 패턴. `allocators[spec.pe_index]` 같은
-     것이 걸리는지
-   - `_allocators\[` — 같은 패턴 (prefix _)
-3. **Flat index 수동 계산 블록**:
-   - `flat_idx =`
-   - `pe_index =` (좌변)
-   - `* pes_per_cube +` (전형적 flat 계산 패턴)
-   - `* self._num_cubes \* self._pes_per_cube` (global flat 계산)
-4. **Serialization / logging**:
-   - `asdict(.*shard` — dataclass 직렬화 시 `pe_index` 자동 포함 여부
-   - `repr(.*ShardSpec` — 로그 포맷에서 의존하는지
-   - JSON/YAML 저장 포맷에서 `pe_index` 키 사용 여부
-5. **Tests asserting integer PE identity**:
-   - `assert .*pe_index` — 정수 동일성 주장
-   - `spec.pe_index ==` — 비교 (SIP-local 의미로 변하면 테스트가 깨질 수 있음)
-
-각 match마다 "이 호출자가 global flat / SIP-local / 내부 lookup 중 무엇을
-기대했나"를 판단한 뒤 구조적 좌표로 교체.
-
-**(D) `test_sip_parallel.py`**:
-
- 이름 유지, 내용은 ADR-0024의 multi-greenlet launcher 기반 재작성.
- "SIP 병렬성 = rank 별 worker × 각자 DPPolicy" 로 검증.
-
-**(E) `test_va_offset.py`, `benches/va_offset_verify.py`**:
-
- `num_sips=1`만 쓰는 경우가 대부분. 단순히 필드 제거.
- SIP offset 테스트가 핵심이면 `set_device(rank)` + 구조적 좌표 관찰로 이식.
-
 ### D7. 하위 호환 — 불가 (cleanup ADR)

 이 ADR은 **breaking change**.
@@ -331,17 +224,6 @@ KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에
 **Silent drift 차단**이 property 완전 제거의 주된 이점: global flat을 기대한
 코드가 SIP-local 결과를 받아 조용히 잘못된 인덱싱을 할 가능성 제거.

-### D8. 문서 업데이트
-
- `ADR-0008` (tensor deploy) — DPPolicy 의미 갱신 note, ShardSpec 구조적 좌표
-  전환 명시
- DPPolicy docstring에 "intra-device only" 명시 (D1 코드 스니펫의 docstring)
- ShardSpec docstring에 **structural coordinates `(sip, cube, pe)`를 직접
-  사용하며, `pe_index`는 더 이상 제공되지 않음**을 명시 (D2)
- `docs/ccl-author-guide` 등 튜토리얼에서 `sip=...` 예시 제거
-
---
-
 ## Dependencies

 - **ADR-0024** (launcher): `set_device(rank)` 및 current-device scoping이
@@ -378,56 +260,6 @@ KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에

 ---

-## Test strategy
-
-### T1. 단위 테스트 갱신
-
- `tests/test_tensor.py`, `tests/test_ipcq_types.py`, `tests/test_runtime_api_tensor.py`
-  — DPPolicy 생성자 인자 정리, ShardSpec 구조적 좌표 검증
- `tests/test_va_offset.py` — `num_sips=1` 제거 후 동작 유지
-
-### T2. `resolve_dp_policy` 구조적 좌표 반환
-
-`tests/test_dp_policy.py` (new 또는 확장):
- `resolve_dp_policy(dp, ..., target_sip=1)` 결과의 모든 ShardSpec이 `sip=1`
- 각 spec의 `(cube, pe)`가 local (0..num_cubes-1, 0..num_pe-1)
- 같은 topology에서 `target_sip=0`과 `target_sip=1` 결과가 sip 필드만 다름
-
-### T3. `test_sip_parallel.py` 재작성
-
-SIP 병렬성 검증을 launcher 기반으로:
-
-```python
-def test_sip_parallel_via_launcher(topology):
-    ...
-    def worker(rank, ws, torch):
-        torch.ahbm.set_device(rank)
-        t = torch.zeros((1, 128), dtype="f16",
-                         dp=DPPolicy(cube="column_wise", pe="column_wise"))
-        # verify shard.sip == rank (structural coord)
-
-    spawn(worker, nprocs=n_sips, ...)
-```
-
-### T4. Allocator key migration
-
-`tests/test_allocator_structural_key.py` (new 또는 기존 확장):
- `PEMemAllocator` dict이 `(sip, cube, pe)` tuple key로 작동
- `deploy_tensor`가 구조적 좌표로 allocator lookup
- `_free_tensor`도 동일
-
-### T5. E2E 회귀
-
-ADR-0024의 `test_ccl_allreduce_matrix.py` 그대로 통과.
-
-### T6. 오류 검증
-
- `DPPolicy(sip="column_wise")` 호출 → `TypeError`. 테스트로 명시.
- `DPPolicy(num_sips=2)` 호출 → `TypeError`.
- `spec.pe_index` 접근 → `AttributeError` (property 완전 제거 검증).
-
---
-
 ## Consequences

 ### Positive
@@ -435,7 +267,7 @@ ADR-0024의 `test_ccl_allreduce_matrix.py` 그대로 통과.
 - **개념 분리 명확**: DPPolicy = intra-device, TP = inter-device.
 - **API 단순화**: DPPolicy 생성자 필드 ~33% 축소.
 - **Structural 좌표 일관성**: ShardSpec이 `(sip, cube, pe)` 튜플로 표현 →
-  abstraction leakage 해소 (ADR-0024 D11 계약 충족).
+  abstraction leakage 해소 (ADR-0024 D4 계약 충족).
 - **`pe_index` 의미 명확**: SIP-local이 단일 해석. Global flat이 필요하면 명시.
 - **Launcher 모델 일관성**: ADR-0024의 "1 worker per SIP" 모델이 유일한 SIP
  경계 제어 메커니즘.
@@ -454,23 +286,3 @@ ADR-0024의 `test_ccl_allreduce_matrix.py` 그대로 통과.
 ### Neutral

 - 기존 `cube` / `pe` 필드 의미 불변.
-
---
-
-## Affected files
-
-| File | Change |
-|------|--------|
-| `src/kernbench/policy/placement/dp.py` | D1: `sip`/`num_sips` 제거 / D2: `ShardSpec`에 `sip`/`cube`/`pe` structural fields 추가, **`pe_index` property 제거** / D3: `resolve_dp_policy`에 `target_sip`, SIP-level 루프 제거 / 내부 resolver가 반환하는 shard 타입 이름도 `local_pe`로 명확화 (이름 충돌 방지) |
-| `src/kernbench/runtime_api/context.py` | D4: `_create_tensor` `target_sip` 전달 / D5: `_ensure_allocators` dict key → `(sip, cube, pe)` tuple / `launch`의 `dp.sip` 분기 제거 |
-| `src/kernbench/runtime_api/tensor.py` | D5: `deploy_tensor`가 구조적 좌표로 allocator lookup |
-| `src/kernbench/components/builtin/pe_cpu.py` | D6: `dp.sip` 참조 제거 |
-| `src/kernbench/components/legacy/builtin/pe_cpu.py` | D6: 동일 |
-| `benches/ccl_allreduce.py` | ADR-0024 scope에서 이미 처리 |
-| `benches/va_offset_verify.py` | D6: `num_sips=1` 제거 |
-| `tests/test_runtime_api_tensor.py` | D6 |
-| `tests/test_va_offset.py` | D6 |
-| `tests/test_tensor.py`, `test_ipcq_types.py` | D6 |
-| `tests/test_sip_parallel.py` | T3: launcher 기반 재작성 |
-| `tests/test_dp_policy.py` (new 또는 확장) | T2 |
-| `tests/test_allocator_structural_key.py` (new) | T4 |
@@ -0,0 +1,888 @@
+# ADR-0027: Megatron-style Tensor Parallelism API
+
+## Status
+
+Accepted
+
+## Context
+
+### 목표
+
+SIP 간 tensor parallelism(TP)을 **Megatron-LM 스타일의 명시적 parallel layer**
+API로 지원한다. DTensor 같은 선언적 추상화는 별도 ADR(0028) future work.
+
+Megatron-style을 선택한 이유:
+- TP는 model의 특정 layer 경계에서 발생. 명시적 primitive가 mental model에
+  자연스러움.
+- NVIDIA Megatron / DeepSpeed가 확립한 인더스트리 표준.
+- DTensor는 선언적이라 디자인 공간이 더 크다 → 단계적.
+
+### TP primitive 스펙 (Megatron-LM 참조)
+
+- **ColumnParallelLinear**: weight의 **column(out_features)** 축을 TP ranks에
+  분산. 입력 full-replicated, 출력 column-sharded. 후속 RowParallelLinear가
+  올 때 forward all-reduce 없음.
+- **RowParallelLinear**: weight의 **row(in_features)** 축을 TP ranks에 분산.
+  입력이 이미 column-sharded (ColumnParallel의 출력). forward 끝에
+  **all-reduce** 필요.
+- **VocabParallelEmbedding**: embedding을 vocab 축에 분산. forward 끝에
+  all-reduce. (초기 scope에서는 stub, 실제 구현은 all-gather kernel 선행 필요.)
+- **`copy_to_tp_region`**, **`reduce_from_tp_region`**, **`scatter_to_tp_region`**,
+  **`gather_from_tp_region`** — 기본 primitive.
+
+### 풀어야 할 문제
+
+1. **Worker-wait 일반화 (D0)**: `dist.all_reduce`의 defer/yield/drain 패턴을
+   모든 `ctx.wait` 경로로 확장. **이 ADR의 가장 큰 아키텍처 결정**.
+
+2. **런처 API 정규화 (D1)**: 현 bench들이 hand-rolled greenlet loop을 사용.
+   `torch.multiprocessing.spawn(fn, args, nprocs)`로 흡수해 real-PyTorch API 면
+   유지 + D0의 scheduler drain을 단일 구현 위치에 집중.
+
+3. **Per-rank weight 분산 표현**: 각 worker가 weight tensor의 자기 slice를
+   소유. ADR-0024의 `set_device(rank)` + ADR-0026의 intra-device DPPolicy로
+   자연스럽게 표현.
+
+4. **Forward-only scope**: 현재 KernBench는 backward가 없음 (simulation 목적).
+   본 ADR은 **forward만** 우선 지원. Training simulation은 별도 ADR.
+
+5. **Collective 호출 지점**: RowParallelLinear가 forward 끝에 `all_reduce` 호출.
+   ADR-0024의 multi-greenlet 구조 + D0 generalization에서 자연스럽게 동작.
+
+6. **TP group 개념**: Megatron은 DP × TP × PP group을 교차 사용. 초기 scope는
+   **TP group = 전체 SIP** 단순화. Mixed DP+TP는 future.
+
+---
+
+## Decision
+
+### D0. Worker-wait 일반화 — `ctx.wait`가 worker 컨텍스트면 main으로 defer
+
+**문제 재확인**. `kernel_runner.run`은 spawn 시점의 `greenlet.getcurrent()`를
+kernel greenlet의 `_parent`로 캡처한다
+([kernel_runner.py:94](src/kernbench/triton_emu/kernel_runner.py#L94)).
+main 컨텍스트에서 `env.run`이 돌면 parent=main이라 safe. worker 컨텍스트에서
+`env.run`이 돌면 parent=worker가 되고, worker가 yield/finish하는 순간 kernel
+greenlet은 orphan → `GreenletExit` → ADR-0024 Phase B의 `ring_default_ws` 실패.
+
+**해결**. worker greenlet이 `ctx.wait(h)`를 호출하면 직접 `env.run`을 driving
+하는 대신 **main scheduler로 yield**. main이 env.run을 drive해 handle이 완료
+되면 worker로 control return.
+
+#### D0.1 `RuntimeContext` 확장
+
+```python
+# context.py
+@dataclass
+class RuntimeContext:
+    ...
+    _pending_worker_waits: list[RequestHandle] = field(default_factory=list, init=False)
+```
+
+#### D0.2 `ctx.wait`의 worker fork
+
+```python
+def wait(self, handle, *, _meta=None):
+    # Fast-path: already completed — skip enqueue + switch (consistent with
+    # D0.4-(3) idempotency). Avoids needless worker→main→worker round-trip
+    # and prevents redundant _pending_worker_waits growth.
+    if handle in self._completed:
+        completion, _trace = self.engine.get_completion(handle)
+        return completion
+
+    from greenlet import getcurrent
+    g = getcurrent()
+    if g.parent is not None and not g.parent.dead:
+        # Worker greenlet: defer to main. Push handle, yield to parent.
+        # Parent (scheduler loop) drains env.run, then switches back.
+        self._pending_worker_waits.append(handle)
+        g.parent.switch()
+        # On resume: handle must have completed (main drained the list).
+        # Fall through to the status-quo completion/trace assembly.
+
+    # Main context (or single-driver): drive engine directly.
+    wait_fn = getattr(self.engine, "wait", None)
+    if wait_fn is not None:
+        wait_fn(handle)
+    completion, trace = self.engine.get_completion(handle)
+    self._completed.add(handle)
+    if _meta is not None and trace is not None:
+        entry = dict(trace) if isinstance(trace, dict) else {"raw": trace}
+        entry.update(_meta)
+        self._traces.append(entry)
+    return completion
+```
+
+#### D0.3 `ctx.wait`의 worker-context 세만틱 contract (normative)
+
+본 ADR은 `ctx.wait`의 세만틱을 worker 컨텍스트에서 **명시적으로 변경**한다.
+
+- **Submit-vs-complete 분리**: `ctx.wait(h)`는 worker에서 호출될 때 "즉시 완료
+  보장"이 아니라 "**다음 scheduler drain 이후** 완료 보장"이다. worker가
+  `wait()`에서 return하는 시점 = main이 해당 handle에 대해 `engine.wait`을
+  마친 시점. Main context 호출은 기존대로 즉시-동기 (status quo).
+- **Resume invariant (normative)**: worker-deferred `ctx.wait(h)`에서
+  `g.parent.switch()`가 return해 worker가 resume되는 시점에는 **반드시
+  `h in ctx._completed`가 True여야 한다**. 이 invariant가 깨지면 worker가
+  stale 상태에서 이후 단계를 진행하므로 `_drain_pending` / scheduler loop /
+  `ctx.wait` 어느 부분을 수정하든 이 불변식을 지켜야 한다. T3.b가 이
+  invariant를 직접 assert한다.
+- **관찰 가능 변화**: worker 안에서 `h = ctx.submit(msg); ctx.wait(h);
+  read(handle_result)` 패턴은 여전히 성립 — 단 `wait()`와 `read` 사이에는
+  자동으로 main-drain이 삽입되었다는 사실을 세만틱 명세로 포함한다.
+- **Host 객체 직접 read는 D0.5 참조**: `ctx.wait` 없이 `tensor.numpy()`를
+  부르는 경우의 계약은 D0.5에서 별도로 규정.
+
+#### D0.4 Main scheduler drain — 규약 (normative)
+
+(D1의 `multiprocessing.spawn` 내부 구현. 아래는 세만틱 정의.)
+
+```python
+while alive:
+    for g in alive:              # (1) round-based worker switch
+        g.switch()
+    _drain_pending(ctx)           # (2) drain in main context
+```
+
+(`_drain_pending`의 실제 정의는 D0.5 참조 — outer while-loop으로 두 큐가
+모두 빌 때까지 drain.)
+
+**규약**:
+
+1. **Round-based cooperative scheduling & yield 의무 (worker contract)**.
+   `g.switch()`는 해당 worker가 **자발적으로 yield**할 때까지 return하지 않는다
+   (cooperative greenlet 세만틱). 따라서:
+   - Worker가 yield 없이 `while True: do_compute()` 같은 pure-compute loop를
+     돌면 `g.switch()`는 영원히 return하지 않고 **scheduler loop 자체가 hard
+     block**된다 (다른 worker는 switch 기회를 못 얻음, drain도 안 일어남). 이는
+     starvation이 아니라 **scheduler non-progress (deadlock 등가)**이며 본
+     ADR이 **unsupported**로 규정한다.
+   - Worker는 **반드시** `ctx.wait(h)`, `dist.all_reduce`, host-read barrier
+     (D0.5) 중 하나를 유한 step 내에 호출해야 한다. TP layer의 `forward`는
+     매 layer 끝에서 launch→wait 쌍을 포함하므로 자연스럽게 이 조건을 만족.
+     CCL kernel도 `dist.all_reduce` 내부에서 yield한다.
+   - 구현이 이를 **감지**할 필요는 없다 (타임아웃/steps-since-yield 카운터
+     등). 이는 user contract이며 위반 시 증상은 "simulation hang"이다.
+   - **Future extension**: non-collective 긴 계산 경로가 자주 나오면
+     명시적 `torch.distributed.cooperative_yield()` primitive (no-op yield)를
+     도입할 수 있다. 현 ADR 범위 밖. Breaking change 아님 — 필요 시 추가하면
+     됨.
+   - Round 내에서는 alive worker 전체가 한 번씩 `switch`를 받는다. 단일 round
+     안에서 한 worker가 여러 번 wait를 호출해도 그 turn 안에서 순차적으로
+     enqueue된 뒤 scheduler drain 한 번에 일괄 처리 (FIFO).
+
+2. **Drain 순서 = submission 순서 (FIFO)**. `_pending_worker_waits`는 list
+   append/pop(0)로 엄격한 FIFO. 완료 순서가 아니라 submission 순서로 drain되며,
+   SimPy scheduler 자체가 인과적으로 올바른 완료 순서를 보장하므로 submission
+   순서 drain이 안전하다. `completion order`와 `drain order`는 혼동하지 말 것.
+
+   **Two-queue ordering (worker waits → collectives)**: `_drain_pending`은
+   worker wait 큐를 먼저, collective 큐를 나중에 drain한다. 이 순서의 근거:
+   - **두 큐는 서로 다른 dependency source**: worker wait은 worker가 직접
+     `submit + wait` 쌍으로 만들어낸 handle (tensor deploy, MmuMap 등). collective
+     큐는 `dist.all_reduce`가 내부적으로 enqueue한 kernel launch handle이며
+     worker는 이걸 직접 wait하지 않는다 (D0.5의 두 큐 drain 모델 참조).
+   - **Correctness 관점 독립**: collective는 worker 관점에선 "이미 submit된
+     후 yield한" 상태. 그 완료 타이밍은 worker의 다음 action 시점 이전이기만
+     하면 됨. worker wait 큐와의 순서 dependency 없음.
+   - **단일 drain barrier 안에서 둘 다 완료**: D0.5의 loop-until-empty 규약에
+     따라 한 barrier invocation에서 worker → collective → (새로 생긴 것이
+     있으면 반복) 순으로 모두 빠짐. worker가 resume될 땐 양쪽 모두 drained.
+   - **대안 (collective 먼저)도 가능**: 본 ADR은 현 구현 단순성을 위해 worker
+     먼저를 고정했을 뿐 의미상 동치. 성능 프로파일 차이가 관찰되면 재조정.
+
+3. **중복 enqueue — correctness는 idempotent drain, dedup은 non-guaranteed**.
+   `ctx.wait(h)`는 `h in ctx._completed`면 즉시 return. `_drain_pending`도
+   동일 guard. 같은 handle이 `_pending_worker_waits`에 여러 번 appended
+   되더라도 실제 `engine.wait`는 한 번만 호출된다 (idempotent).
+   - **Correctness**: idempotent drain에 의존 → safe.
+   - **Memory/성능**: 본 ADR은 `_pending_worker_waits`의 **dedup을 보장하지
+     않는다**. 같은 handle이 N번 enqueue되면 큐에 N개 element가 보관되고
+     drain 시 N번 pop + in-set guard가 돈다. 단일 worker가 같은 handle을
+     반복 wait하는 비정상 패턴이 아니면 N은 1~수 수준.
+   - **Implementation freedom**: 구현은 선택적으로 dedup (예: `set`을 side
+     index로 두거나 append 전 `h not in pending_set` 검사) 가능. correctness
+     를 바꾸지 않는 최적화로 분류.
+
+4. **Exception propagation + sibling cleanup**.
+   worker greenlet이 raise하면 `g.switch()`가 main으로 예외를 전달한다.
+   scheduler loop은 즉시 중단되고 다음 cleanup을 **명시적으로** 수행:
+
+   ```python
+   try:
+       while True:
+           alive = [g for g in gs if not g.dead]
+           if not alive:
+               break
+           for g in alive:
+               if not g.dead:
+                   g.switch()
+           _drain_pending(ctx)
+   except Exception as outer:
+       # (a) 살아남은 sibling worker greenlet 강제 종료.
+       for other in gs:
+           if not other.dead:
+               try:
+                   other.throw(SystemExit)
+               except Exception:
+                   pass          # 사일런트 — 이미 예외 상황
+       # (b) Backend barrier / pending 상태 초기화 (장래 epoch barrier 도입 대비).
+       backend = getattr(ctx.distributed, "_backend", None)
+       if backend is not None and hasattr(backend, "_barrier"):
+           backend._barrier.reset()
+       backend_pending = getattr(backend, "_pending_collective_handles", None)
+       if backend_pending is not None:
+           backend_pending.clear()
+       ctx._pending_worker_waits.clear()
+       # (c) 원인 예외는 SpawnException으로 래핑.
+       raise SpawnException(errors) from outer
+   ```
+
+   규약:
+   - **Sibling abort 보장**: worker 하나가 raise하면 모든 sibling greenlet에
+     `SystemExit`을 throw — greenlet은 즉시 terminate된다. greenlet leak 없음.
+   - **Pending queue 명시적 clear**: worker-wait + collective-pending 두 큐를
+     비움. 재사용 시 오염 방지.
+   - **`SpawnException(errors)` 래핑**: `errors: dict[int, Exception]`에 각
+     rank의 원래 예외를 담는다. real-PyTorch `torch.multiprocessing.spawn`의
+     failure 패턴과 호환.
+     - **Scope 제한**: `errors`에는 **자기 코드로 raise한 rank (root cause)만**
+       포함된다. Sibling cleanup 과정에서 `throw(SystemExit)`으로 종료된 rank는
+       `errors`에 나타나지 않는다 (SystemExit은 D1.2의 entry 래퍼 `try/except
+       Exception`에 걸리지 않음 — 의도된 설계: sibling 종료는 실패가 아니라
+       cleanup signal). 독자가 "모든 failed rank가 다 들어올 것"으로 기대하지
+       않도록 명시.
+   - **`ctx._traces`는 예외 이전 시점까지의 partial 상태**. trace completeness
+     는 보장되지 않음 (일부 launch/all_reduce가 entry를 남기지 못한 채 종료
+     가능).
+   - **Allocator / MemoryStore**는 예외 이전 상태 유지 — 재사용은 non-goal,
+     새 `RuntimeContext` 생성 권장.
+   - **`join=False` / retry / partial recovery**는 본 ADR의 non-goal.
+
+   `SpawnException`은 `runtime_api/multiprocessing.py`에 정의:
+
+   ```python
+   class SpawnException(RuntimeError):
+       def __init__(self, errors: dict[int, Exception]):
+           self.errors = errors
+           first = next(iter(errors.items()), None)
+           msg = (f"spawn failed on ranks {sorted(errors.keys())}"
+                  + (f": rank {first[0]} raised {first[1]!r}" if first else ""))
+           super().__init__(msg)
+   ```
+
+5. **Single-driver 호환**. `g.parent is None`인 main-only 실행 (legacy 단일
+   드라이버 테스트)에서는 D0.2의 worker-fork 조건이 거짓 → 기존 즉시-동기
+   경로 유지. `_drain_pending`은 호출되지 않는다.
+
+#### D0.5 Host-read barrier — 결정 (normative)
+
+Worker 안에서 `tensor.numpy()`, `tensor.__getitem__`, `tensor.data` 등
+**host-observable read**는 **자동 drain barrier**로 정의한다. 호출 직전:
+
+1. `ctx._pending_worker_waits`와 `backend._pending_collective_handles`가 비어
+   있지 않으면 `g.parent.switch()`로 main에 yield → main은 `_drain_pending`
+   실행 → 완료 후 worker resume.
+2. 두 큐가 모두 비어 있으면 즉시 read.
+
+**Barrier 반복 규약 (normative — re-entrance)**: `_drain_pending`은 while-loop
+로 **두 큐가 모두 완전히 비어질 때까지** drain한다. 단일 pass가 아님:
+
+```python
+def _drain_pending(ctx):
+    while ctx._pending_worker_waits or (
+        ctx.distributed._backend
+        and ctx.distributed._backend._pending_collective_handles
+    ):
+        while ctx._pending_worker_waits:
+            h = ctx._pending_worker_waits.pop(0)
+            if h not in ctx._completed:
+                ctx.engine.wait(h)
+        backend = ctx.distributed._backend
+        if backend is not None:
+            while backend._pending_collective_handles:
+                h, _sip_id, meta = backend._pending_collective_handles.pop(0)
+                ctx.wait(h, _meta=meta)  # main context: safe; ctx.wait가
+                                          # 다시 pending에 push하지 않음
+```
+
+**Main-context ctx.wait 비재귀 invariant (normative)**: `_drain_pending` 내부의
+`ctx.wait(h, _meta=meta)` 호출은 main greenlet 컨텍스트에서 실행된다. D0.2의
+worker-fork 조건(`g.parent is not None and not g.parent.dead`)이 False이므로
+즉시-동기 경로로 진입 → **`_pending_worker_waits`에 절대 enqueue하지 않는다**.
+이 invariant 덕분에 drain loop은 재귀/큐 재증가 없이 끝난다. 구현 시
+`g.parent is None`을 단일 main greenlet 보장으로 유지하는 것이 중요.
+
+**왜 loop인가**: `ctx.wait(h, _meta=meta)`는 main 컨텍스트에서 호출되므로 D0.2
+경로에 따라 engine을 **직접 drive**한다 (추가 enqueue 없음 — 위 invariant).
+따라서 이론적으로는 single pass로 충분하지만 — 규약은 **loop-until-empty**로
+고정한다. 이유:
+
+1. **미래 확장 안전성**: 향후 drain 중 새 pending이 enqueue되는 구현 (예:
+   collective가 sub-handle을 가진 tree-reduce)이 생길 수 있다. loop 규약이면
+   이때도 correctness 유지.
+2. **가독성**: "barrier는 pending이 빌 때까지 drain"이라는 단일 문장으로
+   의미가 닫힘. `ctx.wait` 호출이 새 enqueue를 안 한다는 non-trivial invariant
+   에 의존하지 않음.
+3. **Barrier의 세만틱은 "해당 read에 필요한 모든 dependency 완료"**: 현 모델
+   에선 모든 pending이 곧 모든 dependency이므로 둘은 동일. 사용자 mental model
+   은 전자.
+
+**Termination 보증**: 두 체제로 분리해 서술한다.
+
+- **현재 구현**: `ctx.wait`는 main context에서 호출 시 engine을 직접 drive
+  (D0.2) → 새 pending을 enqueue하지 않는다. 한 iteration마다 pending의 크기가
+  `pop(0)` + `engine.wait`로 엄격히 감소. iteration 수는 **초기 pending 크기
+  자체가 상한** → 유한 종료.
+- **Future extension (loop 규약을 정당화하는 상한)**: 향후 drain 중 새 pending이
+  enqueue되는 구현 (예: tree-reduce sub-handle)이 도입되면 초기 크기 상한은
+  깨진다. 그러나 SimPy causality는 handle의 dependency가 유한 DAG임을 보장하므로
+  **nested depth가 finite**. loop 규약이 이 경우까지 자동 수용한다.
+
+두 체제 모두 무한 루프가 불가능함을 보장. 현 구현의 단일-pass 상한은 공격적
+최적화 시 참고 값일 뿐 규약은 loop-until-empty로 고정.
+
+**왜 implicit drain at read가 맞는가**:
+
+- 기존 open question에서 (a) implicit drain, (b) explicit barrier 둘 중 선택
+  문제였다. (b)는 명확하지만 TP layer 사용자가 `out = fc1.forward(x);
+  ctx.drain(); result = out.numpy()` 3-step을 매번 써야 하는 부담. (a)는
+  "읽을 때 반영된 값을 보장"하는 단일 규약으로 CUDA의 `cudaDeviceSynchronize
+  before host copy` 패턴과 동일 — 숨은 규칙이 아닌 **명명된 entry-point의
+  contract**이다.
+- 본 ADR은 (a)를 채택하되 그 entry-point 목록을 **명시적으로 닫는다**:
+  `Tensor.numpy()`, `Tensor.data` (numpy alias), `Tensor.__getitem__`,
+  `Tensor.__repr__` (data가 포함되는 경우), 그 외 공식 host-read API는 본
+  ADR 구현 시점에 코드베이스 검색으로 확정. 추가되는 host-read API는 반드시
+  이 contract를 따라야 한다 (테스트로 회귀 방지).
+- `ctx.submit`만 하고 `wait` 없이 `numpy`를 직접 호출하는 경우도 drain
+  barrier가 동작 (pending queue에 handle이 있기 때문). 사용자가 explicit
+  wait을 생략해도 read 시점에 invariant가 복원된다.
+
+**`Tensor.copy_(source)` — write barrier 규정**:
+
+`copy_`는 semantically "target에 write"이지만 내부적으로 `source.numpy()`를
+호출하여 host에서 source 데이터를 가져온 뒤 `target._memory_store.write(...)`
+로 각 shard에 쓴다. 두 방향 모두 barrier 처리:
+
+1. **Source-side (read barrier)**: `source.numpy()`가 D0.5 read barrier를
+   트리거 (source 자체가 deployed tensor이고 pending이 있을 때).
+2. **Target-side (write barrier — global pending 기준)**: `copy_` 진입 시
+   `ctx._pending_worker_waits` 또는 `backend._pending_collective_handles`가
+   비어 있지 않으면 write 전에 `g.parent.switch()`로 drain. **Per-tensor /
+   per-shard dependency tracking이 아니라 global pending queue 기준**.
+   - 왜 global인가: KernBench의 handle 표현에는 "이 handle이 target의 어느
+     shard를 write한다"는 역추적 정보가 없다. 안전한 보수적 규약으로 "전역
+     pending이 있으면 drain". 이 결과로 **unrelated tensor의 pending도 copy_를
+     막을 수 있다** — drop-in invariant 우선.
+   - **명시적 tradeoff**: 이 규약은 서로 독립적인 tensor 사이에도 불필요한
+     serialization을 도입할 수 있다. 그러나 현 single-queue execution model
+     하에서는 이 비용이 허용 가능 — cross-rank correctness와 "읽을 때 최신"
+     invariant를 단순한 규칙으로 보장하는 편이 우선.
+   - 실질적 영향: 단일 worker는 대부분 한 layer step 안에서 pending이 주로
+     자기 작업 — over-barrier로 인한 추가 context switch는 round 끝 scheduler
+     drain 시점과 일치하는 경우가 많아 큰 문제 안 됨.
+   - Future refinement: per-tensor pending tracking을 도입하면 이 규약을
+     좁힐 수 있으나 본 ADR scope 밖.
+
+**Non-barrier**:
+
+- `tensor.shape`, `tensor.dtype`, `tensor.name` 등 **metadata-only** 접근은
+  drain하지 않음. 데이터 의존성이 없음.
+- `tensor.pa`, `tensor.va` 등 raw address accessor도 drain하지 않음 (주소만,
+  내용 아님).
+
+**공식 barrier entry-point (closed set)**:
+
+| API | Kind | Rationale |
+|---|---|---|
+| `Tensor.numpy()` | read | host-observable copy |
+| `Tensor.data` | read | `numpy()` alias |
+| `Tensor.__getitem__` | read | shard-aligned read |
+| `Tensor.__repr__` (data 포함 시) | read | debugging/log |
+| `Tensor.copy_(source)` | read + write | source read + target write |
+
+이 contract를 T5/T6에서 직접 검증.
+
+#### D0.6 왜 worker 함수 API는 불변인가 (informative)
+
+- `torch.zeros(...)` 내부는 `self.submit(msg)` + `self.wait(h)` 쌍. `wait`가
+  D0.2/D0.3에 따라 자동으로 main-defer → 겉보기 동기적으로 보이지만 한 번
+  yield.
+- `tensor.numpy()`는 D0.5에 따라 host-read barrier → pending이 있으면
+  drain→read, 없으면 즉시 read.
+- `dist.all_reduce`는 기존 `_defer_wait=True` + `_pending_collective_handles`
+  경로를 그대로 사용. D0.4의 drain이 두 큐를 함께 처리.
+
+#### D0.7 불변 조건 (invariants)
+
+- **kernel greenlet의 `_parent`는 항상 main**: env.run이 worker 컨텍스트에서
+  절대 돌지 않기 때문. (T3의 핵심 assertion.)
+- **cross-rank 동기 지점**: 모든 worker가 yield한 뒤에만 drain → 모든 rank의
+  kernel이 한 라운드에 함께 진행 (cross-rank IPCQ 교환의 필수 조건).
+- **Single-driver 호환**: D0.4-(5).
+
+### D1. `torch.multiprocessing.spawn(fn, args, nprocs)`
+
+Real-PyTorch API 파리티 + D0의 scheduler loop의 단일 구현 위치.
+
+#### D1.0 API parity only — execution parity 아님 (normative)
+
+`torch.multiprocessing.spawn` 이름은 **API signature parity**에 한정된다.
+실제 실행 모델은 **cooperative greenlet scheduler** (단일 Python 프로세스,
+단일 OS 스레드, D0.4의 round-robin drive)이다. 다음은 **본 ADR이 제공하지
+않는 속성** — real-PyTorch `torch.multiprocessing.spawn`이 보장하는 것 중
+명시적으로 **non-goal**:
+
+- 프로세스 격리 (independent OS process per rank).
+- 독립 address space (각 rank가 자기 Python heap 보유).
+- Failure isolation (한 rank의 hard crash가 다른 rank 영향 없음).
+- OS-level scheduler fairness (rank 간 preemptive time slicing).
+- `mp.Queue`, `mp.Lock` 등 inter-process primitive.
+
+이 구현의 실제 성질:
+
+- 모든 rank는 같은 Python 프로세스 안의 greenlet. shared global state가
+  그대로 보임 (의도된 simulation convenience).
+- GIL 하의 단일 스레드 → parallel execution 아님. SimPy 이벤트 순서로
+  "논리적 동시성"만 재현.
+- 한 worker에서 unhandled exception → 전체 simulation 중단 (D0.4-(4)).
+
+**호출자 의무**: real-PyTorch multi-process 샘플을 KernBench로 이식할 때
+프로세스 격리에 의존하는 로직 (예: `os.getpid`, 독립 임시 파일, 신호 처리
+등)은 지워야 한다. Namespace 이름은 코드 이식성을 위해 유지 — 세만틱은
+다르다.
+
+#### D1.1 Public surface
+
+```python
+# runtime_api/multiprocessing.py (new)
+class _MultiprocessingNamespace:
+    def __init__(self, ctx):
+        self._ctx = ctx
+
+    def spawn(self, fn, args: tuple, nprocs: int, join: bool = True) -> None:
+        """Spawn `nprocs` worker greenlets, each calling fn(rank, *args).
+
+        Mirrors torch.multiprocessing.spawn signature (minus `daemon`).
+        Drives the D0 scheduler loop until all workers finish.
+        """
+        ...
+```
+
+#### D1.2 구현
+
+```python
+def spawn(self, fn, args, nprocs, join=True):
+    from greenlet import greenlet
+    ctx = self._ctx
+    dist = ctx.distributed
+    gs: list[greenlet] = []
+    errors: dict[int, Exception] = {}
+    for rank in range(nprocs):
+        def _entry(r=rank):
+            try:
+                fn(r, *args)
+            except Exception as e:
+                errors[r] = e
+                raise
+        g = greenlet(_entry)
+        dist._bind_rank(g, rank)
+        gs.append(g)
+
+    try:
+        while True:
+            alive = [g for g in gs if not g.dead]
+            if not alive:
+                break
+            for g in alive:
+                if not g.dead:
+                    g.switch()
+            _drain_pending(ctx)       # D0.5
+    except Exception as outer:
+        # Sibling cleanup per D0.4-(4)
+        for other in gs:
+            if not other.dead:
+                try:
+                    other.throw(SystemExit)
+                except Exception:
+                    pass
+        backend = getattr(dist, "_backend", None)
+        if backend is not None:
+            if hasattr(backend, "_barrier"):
+                backend._barrier.reset()
+            if getattr(backend, "_pending_collective_handles", None) is not None:
+                backend._pending_collective_handles.clear()
+        ctx._pending_worker_waits.clear()
+        raise SpawnException(errors) from outer
+    # `join=True` semantics: we already wait for all workers.
+```
+
+#### D1.3 `torch` namespace attach
+
+`runtime_api/context.py` `__post_init__`에서:
+```python
+self.multiprocessing = _MultiprocessingNamespace(self)
+```
+
+→ bench 코드에서 `torch.multiprocessing.spawn(worker, args=(ws,), nprocs=ws)`.
+
+#### D1.4 기존 bench 마이그레이션
+
+`benches/ccl_allreduce.py`의 hand-rolled loop은 `torch.multiprocessing.spawn`
+한 줄로 축소. 기존 matrix 회귀는 그대로 유지. 현재 xfail인 `ring_default_ws`는
+D0 덕분에 PASS로 전환 예상 (worker가 kernel greenlet orphan을 발생시키지 않음).
+
+### D2. 새 패키지 `kernbench.tp`
+
+```
+src/kernbench/tp/
+    __init__.py          — public API re-exports
+    parallel_state.py    — TP group 관리 (현재 single global group)
+    layers.py            — ColumnParallelLinear, RowParallelLinear, VocabParallelEmbedding
+    primitives.py        — copy/reduce/scatter/gather_to/from_tp_region
+    kernels.py           — TP layer가 launch하는 gemm kernel (재사용 가능)
+    mappings.py          — forward identity/all_reduce, backward stub
+```
+
+### D3. `parallel_state` — TP group
+
+```python
+# parallel_state.py
+_TP_WORLD_SIZE = None
+
+def initialize_model_parallel(tensor_model_parallel_size: int) -> None:
+    """Initialize TP group. Must be called after dist.init_process_group."""
+    global _TP_WORLD_SIZE
+    from kernbench.runtime_api.distributed import get_dist  # or torch.distributed
+    dist = get_dist()
+    total = dist.get_world_size()
+    if tensor_model_parallel_size != total:
+        raise NotImplementedError(
+            "Only TP == world_size supported in initial scope"
+        )
+    _TP_WORLD_SIZE = tensor_model_parallel_size
+
+def get_tensor_model_parallel_world_size() -> int:
+    return _TP_WORLD_SIZE
+
+def get_tensor_model_parallel_rank() -> int:
+    from kernbench.runtime_api.distributed import get_dist
+    return get_dist().get_rank()         # ADR-0024 greenlet-local rank
+```
+
+초기 scope: TP size = world_size = topology SIP count. Pure TP 모델.
+
+### D4-pre. TP shard ownership vs DPPolicy — 역할 분리 (normative)
+
+TP layer의 weight/output 표현에서 두 개념을 명확히 분리한다:
+
+| 개념 | 결정 주체 | 범위 |
+|---|---|---|
+| **TP shard ownership** (어느 rank가 weight의 어떤 slice를 소유하는가) | greenlet-local rank + `torch.ahbm.set_device(rank)` (ADR-0024 D2/D3) | **cross-rank, cross-SIP** |
+| **Intra-rank placement** (소유된 slice를 rank 내부에서 cube × PE로 어떻게 분산하는가) | `DPPolicy(cube=..., pe=...)` (ADR-0026) | **한 rank 내부 (SIP 경계 안)** |
+
+따라서 `ColumnParallelLinear`가 `(in_features, out_features // ws)` shape로
+weight를 생성하고 `DPPolicy(cube="column_wise", pe="column_wise")`를 부여
+하면:
+
+- **Rank r**이 소유하는 slice = weight의 column 축 [r * k_local, (r+1) *
+  k_local) — **set_device(r)**가 이걸 결정 (해당 rank가 SIP r에 존재).
+- **그 slice 내부**에서 cube × PE column-wise 분산 — **DPPolicy**가 이걸
+  결정.
+
+두 축은 **독립적**이다. 같은 DPPolicy로 두 rank가 자기 slice를 만들면
+slice 자체는 다른 SIP에 있지만 intra-SIP placement 패턴은 동일. 반대로
+DPPolicy를 `cube="replicate", pe="replicate"`로 바꿔도 TP shard ownership은
+유지되고 intra-rank placement만 달라짐.
+
+**이 경계가 흐려지는 실수** (본 ADR이 금지):
+
+- DPPolicy에 "SIP 축"이 다시 등장 (ADR-0026에서 제거됨).
+- TP layer가 `set_device` 없이 `DPPolicy`만으로 cross-rank sharding을
+  표현 → 단일 rank 안에서 세로로 자른 것과 구분 안 됨.
+
+본 ADR의 TP layer는 항상 "rank = SIP = one slice 소유 + DPPolicy intra-SIP
+분산" 관점에서만 weight/output을 다룬다.
+
+### D4. `ColumnParallelLinear`
+
+**중요**: host-side `torch.matmul` 추상화를 신규 도입하지 않는다. layer의
+forward는 `torch.launch("gemm", gemm_kernel, ...)`로 기존 gemm kernel을
+호출 — KernBench bench들이 이미 쓰는 패턴
+([benches/gemm_single_pe.py](benches/gemm_single_pe.py),
+[benches/gpt3_qkv.py](benches/gpt3_qkv.py)).
+
+```python
+# layers.py
+from kernbench.policy.placement.dp import DPPolicy
+from kernbench.tp.kernels import _gemm_kernel
+from kernbench.tp.parallel_state import (
+    get_tensor_model_parallel_rank,
+    get_tensor_model_parallel_world_size,
+)
+
+class ColumnParallelLinear:
+    """Weight의 K(out_features) 축을 TP rank에 분산.
+
+    forward(x):
+        x: (M, N) — full-replicated across ranks
+        W_k: (N, K / world_size) — rank-local slice (set_device로 SIP r에 거주)
+        y_k = x @ W_k → (M, K / world_size) — rank-local output
+
+    출력은 column-sharded. RowParallelLinear가 기대하는 입력 형태.
+    """
+
+    def __init__(self, in_features: int, out_features: int, bias: bool = False,
+                 dtype: str = "f16", torch=None):
+        ws = get_tensor_model_parallel_world_size()
+        assert out_features % ws == 0
+        self.in_features = in_features
+        self.k_local = out_features // ws
+        self._torch = torch
+        # 각 rank가 자기 slice 소유 — set_device(rank)에 의해 SIP r에 배치.
+        self.weight = torch.zeros(
+            (in_features, self.k_local), dtype=dtype,
+            dp=DPPolicy(cube="column_wise", pe="column_wise"),
+            name="col_parallel_w",
+        )
+        self.bias = None
+        if bias:
+            self.bias = torch.zeros(
+                (self.k_local,), dtype=dtype,
+                dp=DPPolicy(cube="replicate", pe="replicate"),
+                name="col_parallel_b",
+            )
+
+    def forward(self, x):
+        # x는 full-replicated (caller 보장). 단순 local gemm.
+        M = x.shape[0]
+        out = self._torch.empty(
+            (M, self.k_local), dtype=x.dtype,
+            dp=DPPolicy(cube="column_wise", pe="column_wise"),
+            name="col_parallel_out",
+        )
+        self._torch.launch(
+            "col_parallel_gemm", _gemm_kernel,
+            x, self.weight, out, M, self.in_features, self.k_local,
+        )
+        # bias add는 별도 kernel 혹은 composite gemm의 fused bias.
+        # 초기 scope에서는 bias=False만 충분히 검증.
+        return out
+```
+
+**Yield-safety contract (normative)**: `ColumnParallelLinear.forward`는 한 번의
+`torch.launch` 호출로 kernel launch → 내부 `ctx.wait` 쌍을 포함한다. 이는
+D0.4-(1)의 "worker는 유한 step 내 yield" 조건을 자동으로 만족 — TP layer
+사용자가 yield 패턴을 수동으로 삽입할 필요 없음.
+
+### D5. `RowParallelLinear`
+
+```python
+class RowParallelLinear:
+    """Weight의 N(in_features) 축을 TP rank에 분산.
+
+    forward(x):
+        x: (M, N / world_size) — rank-local slice (ColumnParallel의 출력)
+        W_k: (N / world_size, K) — rank-local slice
+        y_k = x @ W_k → (M, K) — partial sum on each rank
+        y = all_reduce(y_k, op="sum") → (M, K) on every rank
+    """
+
+    def __init__(self, in_features: int, out_features: int, bias: bool = False,
+                 dtype: str = "f16", torch=None):
+        ws = get_tensor_model_parallel_world_size()
+        assert in_features % ws == 0
+        self.n_local = in_features // ws
+        self.out_features = out_features
+        self._torch = torch
+        self.weight = torch.zeros(
+            (self.n_local, out_features), dtype=dtype,
+            dp=DPPolicy(cube="column_wise", pe="column_wise"),
+            name="row_parallel_w",
+        )
+        # bias는 rank 0에만 (Megatron convention). 초기 scope에서는 생략.
+        self.bias = None
+
+    def forward(self, x):
+        M = x.shape[0]
+        y_partial = self._torch.empty(
+            (M, self.out_features), dtype=x.dtype,
+            dp=DPPolicy(cube="column_wise", pe="column_wise"),
+            name="row_parallel_partial",
+        )
+        self._torch.launch(
+            "row_parallel_gemm", _gemm_kernel,
+            x, self.weight, y_partial, M, self.n_local, self.out_features,
+        )
+        # Cross-rank reduce. ADR-0024의 dist.all_reduce는 D0 + mp.spawn 하에서
+        # 정상 동작 (kernel parent = main 유지).
+        self._torch.distributed.all_reduce(y_partial, op="sum")
+        return y_partial
+```
+
+**Yield-safety contract (normative)**: `RowParallelLinear.forward`는 launch →
+내부 wait에 이어 `all_reduce` (defer + worker yield 패턴)까지 포함하므로 forward
+한 번당 **최소 2회 yield**가 보장됨. D0.4-(1)의 scheduler progress 조건 자동
+만족. 모든 본 ADR의 TP layer forward는 "최소 하나의 wait 또는 collective를
+포함해 yield-safe하다"를 invariant로 유지한다 — 이후 추가되는 TP primitive
+(VocabParallelEmbedding 등)도 동일 계약 필수.
+
+### D6. Primitive 함수
+
+```python
+# primitives.py
+def copy_to_tp_region(x):
+    """Forward: identity. Backward: all-reduce. (Training 추가 시 구현)."""
+    return x
+
+def reduce_from_tp_region(x, torch):
+    """Forward: all-reduce. Backward: identity."""
+    torch.distributed.all_reduce(x, op="sum")
+    return x
+
+def scatter_to_tp_region(x):
+    raise NotImplementedError(
+        "Phase 2: 사용자가 이미 sharded tensor를 생성하는 것으로 대체"
+    )
+
+def gather_from_tp_region(x):
+    raise NotImplementedError(
+        "Phase 2: all-gather kernel 선행 필요 (future)"
+    )
+```
+
+### D7. 샘플 bench — 2-layer MLP with TP
+
+```python
+# benches/tp_mlp.py (신규)
+from kernbench.policy.placement.dp import DPPolicy
+import kernbench.tp as tp
+import numpy as np
+
+
+def worker(rank: int, world_size: int, torch):
+    torch.ahbm.set_device(rank)
+    tp.initialize_model_parallel(world_size)
+
+    B, D_in, D_hidden, D_out = 1, 512, 2048, 512
+    fc1 = tp.ColumnParallelLinear(D_in, D_hidden, torch=torch)
+    fc2 = tp.RowParallelLinear(D_hidden, D_out, torch=torch)
+
+    x = torch.zeros(
+        (B, D_in), dtype="f16",
+        dp=DPPolicy(cube="replicate", pe="replicate"),
+        name="x",
+    )
+    # init x with some pattern (e.g., constant)
+    x.copy_(torch.from_numpy(np.full((B, D_in), 0.1, dtype=np.float16)))
+
+    h = fc1.forward(x)      # column-sharded (B, D_hidden / ws)
+    y = fc2.forward(h)      # all-reduced (B, D_out) on every rank
+
+    # rank 0만 결과 출력 / 검증
+    if rank == 0:
+        result = y.numpy()
+        # 실제 검증 값은 zero-init weight이면 전부 0 — scope에서는 "완료 자체" 검증
+        print(f"  tp_mlp: shape={result.shape}, mean={float(result.mean()):.4f}")
+
+
+def run(torch):
+    torch.distributed.init_process_group(backend="ahbm")
+    ws = torch.distributed.get_world_size()
+    torch.multiprocessing.spawn(worker, args=(ws,), nprocs=ws)
+```
+
+### D8. Non-functional — training 미지원
+
+본 ADR은 **inference/forward only**. Backward / gradient / optimizer는 future.
+기존 KernBench가 training이 아니므로 자연스러움.
+
+### D9. 초기 scope 제약
+
+- TP size = world_size (mixed DP+TP 없음).
+- `scatter_to_tp_region`, `gather_from_tp_region`은 unimplemented.
+- **Weight 기본값은 zero**. 적절한 init scheme (Xavier, Kaiming 등)은 future.
+  단 테스트는 `tensor.copy_`로 결정론적 non-zero pattern을 주입해 numerical
+  correctness를 검증 (T2/T6). 즉 "production default = zero, 검증 = 결정론적
+  non-zero"로 운영 분리.
+- Bias 초기 scope에서 생략 (Megatron의 rank 0-only bias 정책은 future).
+- Pipeline parallelism은 scope 밖.
+- VocabParallelEmbedding은 all-gather 선행 필요 → stub only.
+
+### D10. 회귀: `ring_default_ws` xfail 해제 — 필수 acceptance
+
+D0 (worker-wait 일반화) + D0.5 (host-read barrier) 덕분에 모든 worker-driven
+`ctx.wait` 및 host-read가 main-drain 경로로 routing됨 → ADR-0024 Phase B의
+kernel-greenlet orphan 원인이 소멸. 기존 matrix test의 `ring_default_ws`
+strict-xfail 케이스를 본 ADR 구현 이후 **PASS**로 전환하는 것을 **필수 회귀
+기준**으로 포함. Observable acceptance criteria는 **T7**에 명시 (deadlock
+부재, GreenletExit 부재, numerical tolerance 등).
+
+---
+
+## Dependencies
+
+- **ADR-0024** (launcher): rank = SIP, greenlet-local rank,
+  `torch.ahbm.set_device(rank)`.
+- **ADR-0026** (DPPolicy intra-device): weight tensor의 per-rank slice 표현.
+- **ADR-0023 / ADR-0025** (IPCQ): `dist.all_reduce` 구현의 기반.
+
+---
+
+## Non-goals
+
+- **Backward pass / training**: inference only. Training simulation은 별도 ADR.
+- **Mixed parallelism (DP + TP + PP)**: 초기엔 pure TP only.
+- **Weight init schemes**: 단순 zero / debug pattern.
+- **Fused ops**: Megatron의 fused matmul+bias+gelu는 kernel 레벨 문제.
+- **DTensor 통합**: ADR-0028 future.
+- **Host-side `torch.matmul` 추상화**: TP layer는 `torch.launch(gemm_kernel, ...)`
+  로 기존 gemm kernel을 호출. 신규 matmul host-op 도입 안 함.
+
+---
+
+## Open questions
+
+- **`initialize_model_parallel` 위치**: `kernbench.tp.initialize_model_parallel`
+  (현 결정) vs real-PyTorch의 `torch.distributed.init_device_mesh`. TP 전용
+  모듈에 유지.
+- **Weight init**: ADR은 zero. Debug pattern (e.g., identity)이 유효 검증에
+  필요할 수 있음 — Phase 1 test에서 필요 시 추가.
+- **bias 배치 정책**: Megatron은 RowParallelLinear bias를 rank 0에만. 초기
+  scope에서는 bias=False로 회피.
+- **GEMM kernel 위치**: `kernbench.tp.kernels._gemm_kernel` vs 기존
+  `benches/gemm_single_pe.py`에서 import. TP가 bench 의존을 가지면 안 되므로
+  tp 내부에 복제. 향후 `kernbench.kernels` 공용 패키지로 이관 가능.
+
+**Resolved (이전 rev에서 open이었던 것들)**:
+- ~~`tensor.numpy()` 호출 시 drain 타이밍~~ → **D0.5에서 결정**: 공식 host-read
+  entry-point(`numpy`, `data`, `__getitem__`, data-포함 `__repr__`)는 자동
+  drain barrier. metadata-only accessor는 barrier 아님.
+
+---
+
+## Consequences
+
+### Positive
+
+- **Megatron 코드 이식 용이**: real training code와 API 일치.
+- **TP 벤치마크 가능**: scaling, communication-compute overlap 등 HW 특성
+  연구.
+- **`ring_default_ws` xfail 해제**: D0의 부산물로 ADR-0024 Phase B 블로커 해소.
+- **Scheduler loop 단일화**: D1 (`mp.spawn`) 도입으로 hand-rolled loop 제거.
+  후속 collective/TP 벤치가 동일 패턴 재사용.
+- **DPPolicy 의미 명확화** (ADR-0026 시너지): TP layer가 intra-device DPPolicy
+  만 사용하는 모범 사례.
+
+### Negative
+
+- 새 모듈 (`kernbench.tp`) 유지보수 비용.
+- 초기 scope가 제한적 (pure TP only, forward only).
+- D0 generalization이 `ctx.wait`의 세만틱을 바꿈 — 단일 드라이버 테스트와의
+  호환성을 명시적으로 검증 필요 (T7).
+
+### Neutral
+
+- ADR-0024/0026 기반 위에 순수한 상위 레이어 추가. Hardware simulation
+  stack에 영향 없음 (D0 제외).
@@ -0,0 +1,279 @@
+# ADR-0032: 큐브 간 All-Reduce — pe0 큐브-메시 리듀스 + 다중-SIP 교환
+
+## Status
+
+Accepted (supersedes ADR-0029).
+
+## Context
+
+### 목표
+
+토폴로지 계층을 활용하는 단일 all-reduce 알고리즘을 정의한다: 각 SIP
+내부의 큐브 메시(큐브 간) + SIP 간 교환. 단일 커널, 단일 SFR 구성
+경로이며 `topology.yaml`과 `ccl.yaml`로 구동된다.
+
+### ADR-0029(계층적 3-레벨)를 대체하는 이유
+
+ADR-0029는 시스템의 모든 PE가 참여하는 3-레벨(큐브 내 → 큐브 간 →
+SIP 간) 알고리즘을 제안했다. 실제로는 텐서가 큐브 내 PE 단위가 아니라
+**큐브 단위로 샤딩되는** 일반적 워크로드 패턴과 맞지 않으면서, 큐브 내
+PE-PE stage 복잡성(양방향 reduce + 체인 브로드캐스트)을 추가한다.
+
+또한 계층적 설계는 다음을 요구했다:
+- PE별 이웃 그래프 설치 (`_build_pe_installs` 다중 레벨)
+- 다중 레벨 토폴로지 스키마 (`hierarchical_3level`)
+- `all_pes` 매퍼 + `multi_pe_sip_local` 검증자 인프라
+
+아래의 큐브 간 알고리즘은 이 모든 것을 제거한다: **4×4 큐브 메시 위에서
+pe0만의 same-lane 큐브 간 reduce**, 그 다음 루트 큐브에서 SIP 간 교환,
+그 다음 다시 브로드캐스트. 더 단순한 커널, 더 단순한 와이어링,
+일반적인 큐브당 DP 워크로드에 대해 동일한 대역폭 특성을 갖는다.
+
+### 현재 상태
+
+- `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` — 커널
+- `src/kernbench/ccl/sfr_config.py` — `configure_sfr_intercube_multisip`
+- `src/kernbench/runtime_api/distributed.py` — `AhbmCCLBackend`가
+  `init_process_group` 시점에 자동으로 와이어링한다.
+- 기존 `ring_allreduce`, `mesh_allreduce`, `tree_allreduce`,
+  `hierarchical_allreduce` 모듈과 그 테스트는 **제거됨**.
+
+---
+
+## Decision
+
+### D1. 알고리즘 구조 — 5단계 (center-root, 양방향)
+
+루트 큐브는 큐브 메시의 기하학적 **중심**에 위치한다:
+
+```
+root_col  = cube_w // 2
+root_row  = cube_h // 2
+root_cube = root_row * cube_w + root_col   # 중심; 4×4 메시에서 10
+```
+
+각 reduce/broadcast 단계는 이 중심을 향해 **양방향으로** 수렴/발산하여,
+corner-root 워크 대비 SIP 내부 임계 경로를 절반으로 줄인다 (4×4 메시:
+reduce 4홉 + broadcast 4홉 vs SE-코너 루트의 6+6).
+
+각 SIP에 대해 (`mp.spawn`으로 동시에 launch):
+
+```
+Phase 1 — col == root_col에서 수렴하는 Row reduce (큐브 메시, pe0만):
+    좌측 절반(col < root_col)은 W→E로, 우측 절반(col > root_col)은
+    E→W로 진행; root_col 큐브가 양쪽을 병합 → row sum 보유.
+
+Phase 2 — col == root_col에서 row == root_row로 수렴하는 Col reduce:
+    위쪽(row < root_row)은 N→S로, 아래쪽(row > root_row)은 S→N로 진행;
+    루트 큐브가 양쪽을 병합 → 전체 SIP sum 보유.
+
+Phase 3 — cube_id == root_cube에서 SIP 간 교환 (pe0만):
+    Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
+    sip_topo_kind(topology.yaml의 sips.topology)로 선택.
+
+Phase 4 — col == root_col에서 root_row로부터 바깥쪽으로 Col 브로드캐스트.
+
+Phase 5 — root_col로부터 바깥쪽으로 큐브 메시 전반에 Row 브로드캐스트.
+```
+
+모든 단계가 끝나면 모든 큐브의 pe0이 전역 sum을 보유한다.
+
+**단일 큐브 fast-path**: `cube_w == cube_h == 1`(rank당 큐브 하나, 일반적인
+TP 케이스)인 경우 SIP 내부 reduce/broadcast 단계를 건너뛰고 곧바로
+Phase 3 SIP 간 교환으로 진행한다.
+
+커널은 `sip_topo_kind ∈ {0, 1, 2}`(ring_1d, torus_2d, mesh_2d_no_wrap)로
+파라미터화된 단일 함수이다. Phase 1-2와 4-5는 토폴로지 전반에서 동일하며,
+phase 3만 분기한다. 헬퍼 함수 `_inter_sip_ring`, `_inter_sip_torus_2d`,
+`_inter_sip_mesh_2d`가 세 가지 교환 패턴을 인코딩한다.
+
+### D2. 텐서 레이아웃 (rank = SIP, 워커별)
+
+ADR-0024에 따라 프로세스 그룹 레벨에서 rank = SIP이다. 각 워커가
+자신의 큐브-메시 전체 텐서를 할당한다:
+
+```python
+dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=16, num_pes=1)
+tensor = torch.zeros((n_cubes, n_elem), dtype="f16", dp=dp)
+```
+
+샤드 레이아웃: SIP당 16개 샤드, 큐브별 pe0에 하나씩. 커널은 각 큐브의
+샤드를 `pe_addr = t_ptr + cube_id * n_elem * 2`로 주소 지정한다.
+
+### D3. SFR / IPCQ 와이어링 — `configure_sfr_intercube_multisip`
+
+ADR-0024의 rank-to-2-PE 설치를 대체한다. 어느 큐브가 루트인지 또는 어느
+SIP 토폴로지가 선택되었는지와 무관하게 **모든 SIP의 모든 큐브의 pe0**에
+대해 PE_IPCQ 이웃 테이블을 와이어링한다. 이를 통해 커널이 런타임에 루트
+큐브를 선출할 수 있고, 재와이어링 없이 토폴로지 전환을 지원한다.
+
+| Level | Direction labels | Scope |
+|---|---|---|
+| SIP 내부 큐브 간 | N / S / E / W | 모든 큐브의 pe0 → 메시 이웃의 pe0 (랩어라운드 없음) |
+| SIP 간 (모든 큐브) | global_E / global_W / global_N / global_S | sip A의 큐브 c의 pe0 → `sips.topology`에 따른 피어 SIP의 큐브 c의 pe0 |
+
+SIP 간 방향은 `global_*` 접두사를 사용하여 큐브 간 방향과 네임스페이스를
+분리한다. ADR-0025의 `_OPPOSITE_DIR`은 `global_E ↔ global_W` 및
+`global_N ↔ global_S`로 확장되어, 2-SIP 양방향 ring에 대한 역방향
+리졸버가 올바르게 처리되도록 한다.
+
+내부적으로 이 함수는 다음 인자로 `install_ipcq`를 호출한다:
+- `world_size = n_sips × n_cubes`
+- `rank_to_pe = [(sip, cube, 0) for sip in range(n_sips) for cube in range(n_cubes)]`
+- 위 매핑을 생성하는 클로저로 캡처된 `neighbors()` 함수.
+
+이 `world_size`는 IPCQ 와이어링 내부적이며 프로세스-그룹 rank로 유출되지
+않는다.
+
+### D4. SIP 토폴로지 — `topology.yaml`에서
+
+```yaml
+system:
+  sips:
+    count: 2
+    topology: ring_1d       # or torus_2d, mesh_2d_no_wrap
+```
+
+- `ring_1d`: n_sips-1 라운드의 `send global_E / recv global_W`.
+- `torus_2d`: `w × h` 랩핑 메시. `global_E/W`에서 row ring, 이어서
+  `global_S/N`에서 col ring.
+- `mesh_2d_no_wrap`: 랩어라운드 없는 `w × h` 메시. 차원별 chain
+  reduce + 브로드캐스트.
+
+2D 그리드 크기 `(w, h)`는 `system.sips.w/h`에서 온다 (ADR-0024 D5).
+정사각 fallback (`round(sqrt(n_sips))²`)은 `w/h`가 생략된 경우에만
+적용되므로, 직사각형 그리드(예: 6 SIP을 `3×2`로)는 명시적 `w/h`로
+지원된다.
+
+### D5. 프로세스-그룹 통합 — `AhbmCCLBackend`
+
+`init_process_group` 시점에 백엔드는:
+
+1. `ccl.yaml` + `topology.yaml`을 로드한다.
+2. `system.sips.topology`로부터 알고리즘 모듈의 `TOPO_NAME_TO_KIND`를
+   통해 `sip_topo_kind`를 도출하고, `sip_topo_w, sip_topo_h`는
+   `system.sips.w/h`에서 정사각 fallback과 함께 도출한다 (ADR-0024 D5).
+3. `configure_sfr_intercube_multisip(engine, spec, cfg)`를 호출한다 —
+   일회성 SFR 와이어링, NCCL 커뮤니케이터 생성을 모방한다.
+
+각 `dist.all_reduce(tensor)` 호출 시:
+
+1. `cfg["module"]`로부터 `kernel_fn`을 해석한다.
+2. `kernel_args(world_size, n_elem)`로부터 인자
+   `(n_elem, cube_w, cube_h, n_sips)`를 구성한다.
+3. `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)`를 추가하며,
+   여기서 `sip_rank`는 현재 greenlet에 바인딩된 rank이다.
+4. `_defer_wait=True`로 launch; 모든 워커가 제출한 후 메인 스케줄러가
+   pending 핸들을 드레인한다 (ADR-0027 D0.4).
+
+### D6. 구성 스키마
+
+`ccl.yaml`:
+
+```yaml
+defaults:
+  algorithm: lrab_hierarchical_allreduce
+  buffer_kind: tcm
+  ...
+
+algorithms:
+  lrab_hierarchical_allreduce:
+    module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
+    topology: none
+    buffer_kind: tcm
+    n_elem: 8
+    root_cube: 15   # 현재 사용되지 않음 — 커널이 루트를 기하학적 중심으로
+                    # 동적으로 선출한다 (D1 참조). 향후 명시적 루트 override /
+                    # 런타임 선출 훅을 위한 placeholder로 유지한다.
+```
+
+`topology.yaml`:
+
+```yaml
+system:
+  sips:
+    count: 2
+    topology: ring_1d
+sip:
+  cube_mesh: { w: 4, h: 4 }
+```
+
+### D7. 알고리즘 모듈 계약
+
+`cfg["module"]`로 로드되는 모듈은 다음을 export해야 한다:
+
+| Name | Purpose |
+|---|---|
+| `kernel` | callable, 시그니처 `(t_ptr, n_elem, cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl)` |
+| `kernel_args(world_size, n_elem) -> tuple` | 처음 4개의 scalar 인자(텐서별) 반환 |
+| `TOPO_NAME_TO_KIND: dict[str, int]` | `system.sips.topology` 이름을 커널 분기 코드로 매핑 |
+| `SIP_TOPO_RING`, `SIP_TOPO_TORUS`, `SIP_TOPO_MESH` | 정수 상수 (0, 1, 2) |
+
+---
+
+## Dependencies
+
+- **ADR-0023**: IPCQ 프로토콜 (이웃 테이블, 송수신, credit 반환).
+- **ADR-0024**: rank = SIP launcher, `mp.spawn`, greenlet-로컬 rank.
+- **ADR-0025**: 주소 기반 IPCQ 방향 매칭; `global_*` 쌍으로 확장된
+  `_OPPOSITE_DIR`.
+- **ADR-0027**: 메인 스케줄러에서의 worker-wait / 집합 통신 pending
+  드레인.
+
+## Non-goals
+
+- **PE별 allreduce** (큐브 내 PE-PE reduce). 범위 밖 — 본 알고리즘의
+  워크로드는 큐브당 DP이다.
+- **정사각 그리드 fallback은 `n_sips = k²`를 요구**: 직사각형 SIP
+  그리드(정사각형이 아닌 메시/토러스)는 지원되지만, `system.sips.w/h`를
+  명시적으로 줄 때만 가능하다 (ADR-0024 D5). `w/h` 생략 시 2D 토폴로지는
+  정사각 그리드로 fallback하며 여전히 `n_sips = k²`를 요구한다.
+- **파이프라인 청크**: 큐브당 단일 타일, 아직 파이프라이닝 없음.
+- **루트 큐브의 런타임 선출**: 커널은 현재 SIP 내부 임계 경로를
+  최소화하기 위해 기하학적 중심인
+  `root_cube = (mesh_h // 2) * mesh_w + (mesh_w // 2)`을 사용한다. SFR
+  와이어링이 모든 큐브를 커버하므로, 필요해질 때 다른 루트를 런타임에
+  선출하는 것은 순수 커널 변경이다.
+
+---
+
+## Consequences
+
+### Positive
+
+- **단일 커널, 단일 설치 경로**로 all-reduce를 처리 — 제거된 네 개의
+  모듈(`ring`, `mesh`, `tree`, `hierarchical`)을 대체한다.
+- **토폴로지 무관 커널**: ring / torus / mesh를 정수 파라미터 하나로
+  선택, 커널 중복 없음.
+- **`dist.all_reduce`를 통한 자동화**: 벤치 레벨이나 사용자 레벨의
+  알고리즘 선택 불필요; end-to-end 구성 기반.
+- **완전한 SFR 와이어링**: 모든 SIP의 모든 큐브가 SIP 간 링크를 보유 —
+  향후 동적 루트 큐브 선출을 지원한다.
+
+### Negative
+
+- **PE별 샤딩된 텐서에 부적합**: 큐브 하나 내부에서 8개 PE에 걸쳐
+  샤딩되는 TP-레이어 스타일 텐서는 본 커널로 주소 지정할 수 없다. 이러한
+  워크로드에는 별도의 큐브 내 all-reduce 경로가 필요하다 (아직 구현되지
+  않음).
+- **`configure_sfr_intercube_multisip`는 항상 모든 pe0을 와이어링**:
+  주어진 실행이 부분집합(예: 1 SIP, ring만)만 필요하더라도. 설치 비용은
+  작지만 영(zero)은 아니다.
+
+---
+
+## Affected files
+
+| File | Change |
+|---|---|
+| `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` (신규) | 커널 + `_inter_sip_*` 헬퍼 + `TOPO_NAME_TO_KIND` |
+| `src/kernbench/ccl/sfr_config.py` (신규) | `configure_sfr_intercube_multisip` |
+| `src/kernbench/ccl/topologies.py` | `torus_2d`, `mesh_2d_no_wrap` 추가 |
+| `src/kernbench/ccl/install.py` | `_OPPOSITE_DIR`을 `global_*` 쌍으로 확장 |
+| `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend`가 `configure_sfr_intercube_multisip` 사용 + sip_rank/topo 인자 추가 |
+| `ccl.yaml` | 단일 `lrab_hierarchical_allreduce` 항목 |
+| `topology.yaml` | `system.sips.topology` 추가 |
+| `benches/ccl_allreduce.py` | Row-wise 큐브-메시 텐서 레이아웃 |
+| `tests/sccl/` (테스트 패키지) | 구성 기반 ring/torus/mesh 정확성 + 전체 `dist.all_reduce` 경로 + latency/buffer-kind 스윕 (평가 하니스 — ADR-0043) |
+| `tests/test_intercube_sfr_config.py` | SFR 와이어링 검증 |
+| 제거 | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` 및 그 테스트 |
@@ -0,0 +1,152 @@
+# ADR-0033 — 레이턴시 모델: 가정 및 알려진 단순화
+
+## Status
+
+Accepted
+
+## Context
+
+이 시뮬레이터는 분석적·이벤트 기반 성능 모델이지, 사이클 정확(cycle-accurate)
+시뮬레이터나 RTL 수준 시뮬레이터가 아니다. 실제 HW의 많은 효과들이 설계상
+근사되거나 생략되었다. 모델 전체를 감사·리뷰할 수 있도록 유지하기 위해,
+본 ADR은 그런 가정들을 한 곳에 통합한다. 개별 컴포넌트 ADR(ADR-0015,
+ADR-0017, ADR-0004)들이 *메커니즘*을 정의하고, 본 문서는 *충실도의 한계*를
+정의한다.
+
+## Decisions
+
+### D1. 정밀하게 모델링되는 것
+
+- **방향 에지별 BW 점유** (`available_at`을 통한 FIFO 직렬화) —
+  ADR-0015 D2.
+- **컴포넌트별 스위칭/오버헤드 레이턴시** (`overhead_ns` attr).
+- **HBM pseudo-channel별 병렬성**: 주소 기반 PC 선택을 동반한
+  stateless `pc_avail[N]` 배열로 (ADR-0034 D3). 버스트 granularity는 조정 가능
+  (`burst_bytes`, 기본 256B). 각 PC의 `available_at`은 read와 write가 공유한다
+  (실제 HW의 명령 버스가 PC별로 공유되기 때문).
+- **HBM 방향 전환 페널티 메커니즘**: PC별 last-direction 추적 +
+  설정 가능한 `switch_penalty_ns`. 기본값 0 — D2 참조.
+- **와이어 청크 스트리밍 (Phase 2c)**: 각 와이어는 payload가 있는
+  Transaction을 `flit_bytes` 단위의 `Flit` 객체로 분해한다(기본 = HBM
+  `burst_bytes` = 256B). 와이어는 각 flit을 `prop_ns + flit_nbytes/bw_gbs`
+  이후에 개별적으로 방출하므로 링크의 대역폭이 실제 HW의 wormhole 시맨틱대로
+  flit 도착률을 조절한다.
+- **방향 에지별로 분리된 Store** (Phase 2c 핵심 수정): 와이어는
+  `src.out_ports[dst]`와 `dst.in_ports[src]` 사이의 *유일한* 통로이다.
+  이전에는 둘이 동일한 `simpy.Store`로 별칭되어 있었다. 와이어가 청크화된
+  flit을 되돌려 넣을 때 목적지의 `fan_in`이 와이어가 대역폭 지연을 적용하기
+  전에 그것을 끌어가, flit의 절반이 병목을 우회할 수 있었다.
+- **Flit 인지 pass-through** (`TransitComponent`, `HbmCtrlComponent`):
+  각 flit을 직렬로 전달하며 트랜잭션 오버헤드는 첫 flit 도착 시점에 한 번만
+  적용된다(헤더 디코드 모델). 이후의 flit들은 추가 지연 없이 파이프라인을
+  통과한다. 다중 hop 경로 전반에서 wormhole이 자연스럽게 발현된다.
+- **HBM CTRL의 flit별 PC commit**: HBM CTRL에 도착하는 각 flit은
+  `max(env.now, pc_avail[pc]) + chunk_time`에 PC commit을 스케줄하며,
+  `is_last` flit이 마지막 PC commit을 기다린 후 `txn.done`을 신호한다.
+- **Flit 비인지 컴포넌트(기본)는 ``_fan_in``에서 flit을 재조립**하여
+  레거시 `_forward_txn` 경로가 실행되도록 한다. 이는 아직 flit 인지
+  처리로 마이그레이션되지 않은 컴포넌트(예: `MCpuComponent`,
+  `IoCpuComponent`의 sub-txn 생성기)에 대한 하위 호환성을 보존한다. 그런
+  컴포넌트들은 *leg 경계마다 한 번* 재조립하며, hop마다는 아니다 —
+  flit 인지 라우터 체인을 통한 다중 hop wormhole 타이밍이 보존된다.
+
+### D2. 근사됨 (알려진 방향성 오차와 함께)
+
+| 효과 | 실제 HW | 본 모델 | 오차 방향 |
+|--------|---------|-----------|----------------|
+| 라우터 출력 포트 중재 | Round-robin / weighted | 와이어 에지 FIFO + 직렬 워커 | 사이클당 한 txn일 때 공정; multi-stream 공유는 flit 수준에서 모델링 안 됨 |
+| HBM 스케줄러 / 쓰기 버퍼 | FR-FCFS + watermark drain | FIFO, 재정렬 없음 | 교번이 조밀한 혼합 R/W에 대해 비관적 — 기본 `switch_penalty_ns = 0`은 이상적 스케줄러가 amortize한다고 가정 |
+| Flit ↔ burst granularity | 32B flit < 256B burst | `flit_bytes = burst_bytes = 256B` | sub-flit 미세 타이밍 노이즈; 매우 작은 와이어 중재 윈도우에서만 영향 |
+| 와이어 수준 RR 공정성 | 공유 링크에서 사이클별 multi-flow 중재 | 에지마다 단일 직렬 와이어 프로세스 | 주어진 에지에 한 트랜잭션만 in-flight일 때만 공정. 동일 에지에서 동시 멀티 스트림 트래픽은 FIFO 순서로 직렬화됨 |
+
+### D3. 무시됨 (범위 외)
+
+- 뱅크 수준의 row buffer 충돌 페널티 (충돌 없음 가정 — 최적 케이스;
+  모델은 PC 내부에 뱅크별 상태를 갖지 않으므로 동일 뱅크 재사용을 감지할 수 없다).
+- HBM tRP / tRCD / tFAW / tRC 타이밍 제약 (정상 상태의
+  `burst_time = burst_bytes / pc_bw_gbs`에 흡수).
+- 리프레시, ECC, 열 throttling, 전력 게이팅.
+- 클럭 도메인 교차, PLL lock 시간.
+- 하위 버퍼 점유로 인한 상위 backpressure (입력 포트는 unbounded
+  `simpy.Store`를 사용).
+- 라우터에서의 sub-flit 사이클 수준 중재 (flit granularity가 본 모델의
+  최소 단위).
+
+### D4. 워크로드 민감도
+
+위 단순화들이 결과에 의미 있게 영향을 미치는 워크로드:
+
+- **무작위 scatter/gather**: 뱅크 충돌 무시 → 모델이 낙관적.
+- **혼합 R/W가 강한 워크로드** (예: GEMM 바이어스 누적): HBM 스케줄러
+  부재. 기본 `switch_penalty_ns = 0`은 이상적 amortization을 가정;
+  0이 아닌 값은 교번당 비관적 비용을 모델링.
+- **고동시성 (한 링크에 활성 흐름 >10개)**: HoL blocking과 VC 제한이
+  모델링되지 않음 → 모델이 낙관적.
+- **매우 작은(sub-flit) 트랜잭션**: flit 양자화 노이즈.
+- **단일 와이어상의 동시 multi-flow**: 와이어는 flit 수준에서 직렬
+  FIFO이므로 단일 에지 내에서의 흐름별 공정성은 모델링되지 않는다.
+  Pre-edge 병합(여러 source가 라우터에 도착하여 동일한 downstream
+  와이어로 전달되는 경우)은 flit 인지 라우터의 직렬 워커를 통해 올바르게
+  모델링된다.
+
+### D5. 검증 정책
+
+D4의 워크로드에 대해 절대값 결론을 내리기 전에 실제 HW나 사이클 정확
+시뮬레이터와 cross-check 할 것. 모델은 모델링된 영역 내에서의 **상대적
+비교**에 대해서는 여전히 정확하다.
+
+### D6. 향후 작업
+
+참고: 라우터에서의 multi-stream 병합은 올바르게 모델링되고 있다 — 각
+in_port가 자신의 fan_in 프로세스를 가지며 모두 공유 인박스로 push하고,
+라우터 워커가 인박스 FIFO 순서로 전달한다. 서로 다른 상위 스트림의 flit들이
+flit granularity에서 자연스럽게 인터리브된다. 아래 항목들은 별개의 관심사이며,
+예상되는 워크로드 영향 순으로 정렬되어 있다.
+
+**영향이 큼 (워크로드 정확도 격차)**:
+
+- [ ] PC 내의 **뱅크 수준 충돌 모델링** (`track_banks: true`로 opt-in).
+  현재는 동일 뱅크 재사용이 없다고 가정; 무작위 scatter/gather 워크로드는
+  이 부분에서 낙관적이다.
+- [ ] write buffer + watermark drain을 동반한 **HBM 스케줄러** (설계
+  논의에서의 Tier 2). 기본 `switch_penalty_ns=0`은 이상적 amortization의
+  stand-in; 버스티한 혼합 R/W 워크로드는 명시적 모델링으로부터 이득을 본다.
+- [ ] 유한한 컴포넌트 버퍼에 대한 **Backpressure** 모델링. 버퍼 점유가
+  상위 stall을 유발하는 고동시성/지속적 포화 상황에서 중요.
+- [ ] **청크 스트리밍과 op_log 통합**: 현재 op_log는 청크화되지 않는
+  PE 내부 명령 메시지(DmaReadCmd, DmaWriteCmd, GemmCmd, MathCmd)에 대해
+  발화한다. 통합은 flit 인지 컴포넌트들이 트랜잭션당 op_log start/end
+  hook(첫 flit에 start, is_last에 end)을 함께 방출하도록 요구한다.
+
+**영향이 작음 (학술적 / 특정 use case)**:
+
+- [ ] **사이클 정확 라우터 중재 정책** (우선순위·age를 동반한 RR, iSLIP).
+  FIFO 인박스는 스트림 간 flit 도착 시간이 약간씩 다를 때 이미 근사적으로
+  공정하다(유사한 비율의 워크로드에서 흔한 경우). 실질적 영향은 (a)
+  우선순위/QoS 모델링, (b) 지속적 포화에서의 스트림별 tail latency 분석에서만
+  나타난다. makespan이나 평균 레이턴시 연구에는 결정적이지 않음.
+- [ ] 더 미세한 와이어 중재 사이클을 위한 **Sub-flit (32B) granularity**.
+  본 모델의 `flit_bytes`는 burst(256B)와 같지만, 실제 HW는 32B flit마다
+  중재한다. 대부분 워크로드에서는 영향이 작다(작은 메시지에 대한 sub-flit
+  타이밍 노이즈).
+
+## Consequences
+
+- 모든 모델 충실도 질문에 대한 단일 리뷰 지점. 레이턴시를 건드리는 향후
+  모든 PR은 본 문서의 해당 절을 갱신해야 한다.
+- 워크로드별 규모 오차 envelope이 명시적이다.
+- 빌더측 `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs` 유도가
+  yaml의 수동 일관성에 의존하지 않고 코드 내에서 ADR-0017 D8의 불변성을
+  강제한다.
+- 와이어 전송 시간은 터미널의 `drain_ns` 주입을 통해서가 아니라
+  병목 링크 통과당 한 번 부과된다(Phase 2c flit별 타이밍). 단일 트랜잭션은
+  `drain + commit_time + small_overheads`에 도달; 다중 hop은 wormhole
+  파이프라이닝을 보존; multi-stream 병합은 공유 와이어의 FIFO에서 올바르게
+  직렬화된다.
+
+## Cross-references
+
+- ADR-0015 — 컴포넌트 / 포트 / 와이어 모델.
+- ADR-0017 — 큐브 NOC 아키텍처 및 HBM 연결성.
+- ADR-0004 — 메모리 시맨틱, 로컬 HBM.
+- ADR-0034 — HBM 컨트롤러 내부 설계.
@@ -0,0 +1,263 @@
+# ADR-0034: HBM 컨트롤러 내부 설계
+
+## Status
+
+Accepted
+
+## Context
+
+`HbmCtrlComponent`는 큐브 NOC의 말단(leaf)에 위치하는 PE별 HBM
+파티션 엔드포인트이다. 토폴로지 노드
+`sip{S}.cube{C}.hbm_ctrl.pe{idx}` 아래에 PE마다 하나의 인스턴스가
+생성되며 해당 PE의 라우터에 연결된다 (ADR-0017 D4). 본 컴포넌트는
+의사 채널(PC, pseudo-channel)별 스케줄링, 버스트 단위 커밋 타이밍,
+주소 기반 PC 선택, 그리고 응답을 요청자에게 되돌리는 라우팅을
+모델링한다.
+
+본 ADR은 현재 구현된 컴포넌트를 문서화한다. ADR-0017 D4/D8은 HBM CTRL이
+*어디에* 부착되는지와 *어떤* 집계 대역폭을 제공해야 하는지를 정의한다.
+ADR-0033 D1/D2는 HBM 모델링의 *어떤 정밀도(fidelity)*가 범위에 포함되는지를
+정의한다. 본 ADR은 그 둘 사이의 공백 — 인스턴스별 내부 스케줄링 모델을
+채운다.
+
+## Decision
+
+### D1. 역할
+
+`HbmCtrlComponent`는 PE별 HBM 파티션 엔드포인트이다. PE당 하나의
+인스턴스(큐브당 기본 8개, `cube.memory_map.hbm_slices_per_cube`로 설정)가
+`cube_mesh.yaml`의 `peX.hbm` 부착 목록을 통해 해당 PE의 라우터에 연결된다
+(ADR-0017 D4). 기본 n:1 채널 매핑(ADR-0017 D8)에서는 인스턴스가
+`channels_per_pe`개의 의사 채널을 하나의 엔드포인트로 집계한다.
+
+본 컴포넌트는 다음을 모델링한다:
+
+- PC별 스케줄링(D2) 및 R/W 명령 버스 공유.
+- 주소 기반 PC 선택(D3).
+- 버스트 단위 커밋 타이밍(D4).
+- Flit 인지 per-flit PC 커밋 및 비동기 finalize(D5, D6).
+- 읽기 데이터 드레인(drain)을 위한 명령 전용 Transaction 처리(D7).
+- 요청자에게 되돌리는 응답 라우팅(D8).
+
+다음은 모델링하지 않는다:
+
+- Bank 수준의 row-buffer 충돌, refresh, ECC, 열 스로틀링
+  (ADR-0033 D3).
+- 자신의 라우터 엣지를 넘어가는 PE 간 HBM 경합(라우터 메시가 처리 —
+  ADR-0017 D3).
+- 1:1 채널 모드(ADR-0017 D8 향후 작업).
+
+### D2. PC별 스케줄링 모델
+
+`start()`에서 초기화되는 인스턴스별 상태:
+
+- `_pc_avail: list[float]` — 각 PC가 다음에 자유로워지는 가장 빠른
+  시뮬레이션 시각; 길이 `num_pcs`, 초기값 0.0.
+- `_pc_last_dir: list["R"|"W"|None]` — 각 PC의 마지막 커밋 방향, 스위치
+  페널티 감지에 사용(D4); 초기값 `None`.
+
+`num_pcs`와 `burst_bytes`는 각각 양의 2의 거듭제곱이어야 주소 기반 PC
+선택(D3)이 시프트와 마스크로 축약된다.
+
+읽기와 쓰기 요청은 PC별로 동일한 `_pc_avail` 슬롯을 공유한다 — 실제 HW에서
+PC별 명령 버스는 읽기와 쓰기 트래픽이 공유하므로, PC k에 쓰기를 발행하면
+PC k에 대한 후속 읽기가 정확히 버스트 시간만큼 블록된다.
+
+요청의 방향 `dir`은 요청 타입으로부터 추론된다:
+
+- `MemoryWriteMsg` → `"W"`.
+- `is_write=True`인 `PeDmaMsg` → `"W"`.
+- 그 외 전부(`MemoryReadMsg`, 읽기 `PeDmaMsg`) → `"R"`.
+
+### D3. 주소 기반 PC 선택
+
+접근에 대한 PC 인덱스는 접근 주소로부터 시프트와 마스크로 도출된다:
+
+```text
+pc_shift = log2(burst_bytes)         # 기본값 8  (burst=256B)
+pc_mask  = num_pcs - 1               # 기본값 7  (8 PCs)
+pc       = (address >> pc_shift) & pc_mask
+```
+
+대안적인 `(burst_bytes, num_pcs)` 쌍과의 정합성을 유지하기 위해
+`start()`에서 토폴로지 설정으로부터 한 번 계산된다. 정규 기본값
+`(256, 8)`에서는 PC 선택 필드가 HBM 바이트 오프셋의 비트 `[10:8]`에
+배치된다: 비트 `[7:0]`은 버스트 내부(같은 PC), 비트 `[10:8]`은 3비트
+PC 인덱스, 비트 `[36:11]`은 PC 슬라이스 내부의 row/bank/column이다
+(`phyaddr.py` 주석 참조).
+
+주소 기반 스트라이핑은 — 주소를 보지 않는 전역 라운드로빈과 달리 —
+오프셋이 분리된 동시 전송들에 대해 PC 병렬성을 보존한다: 각 전송의
+버스트는 자신의 바이트 주소가 함의하는 PC 집합 위에 결정론적으로
+떨어지므로, 분리된 영역에 접근하는 멀티 PE 워크로드가 단일 PC에서
+충돌하지 않는다.
+
+### D4. 버스트 단위 시간 및 PC 커밋 타이밍
+
+단일 PC 커밋에 걸리는 시간:
+
+```text
+chunk_time = burst_bytes / pc_bw_gbs    # ns
+```
+
+- `burst_bytes`(기본 256)는 flit 크기와 일치하는 버스트 단위이다
+  (ADR-0033 D1).
+- `pc_bw_gbs`는 **빌더에서 도출**된다:
+  `hbm_to_router_bw_gbs / num_pcs` (`topology/builder.py`). 이는 PE당
+  집계 대역폭이 라우터-HBM 링크 대역폭과 같아야 한다는 ADR-0017 D8의
+  불변식을 강제한다.
+
+방향 `dir`로 PC `pc`에 도착한 접근에 대한 PC별 커밋 스케줄링:
+
+```text
+switch_cost = switch_penalty_ns
+              if pc_last_dir[pc] not in (None, dir) else 0
+start  = max(env.now, pc_avail[pc]) + switch_cost
+finish = start + chunk_time
+pc_avail[pc]    = finish
+pc_last_dir[pc] = dir
+```
+
+기본 `switch_penalty_ns = 0` — 이상적인 HBM 스케줄러가 R/W 스위칭
+비용을 분할 상환한다는 Tier 0 가정(ADR-0033 D2). 0이 아닌 값은
+교차마다 발생하는 비관적 비용을 모델링한다.
+
+### D5. Flit 인지 per-flit PC 커밋 (주 경로)
+
+`_handle_flit`이 주 워커 경로이다. 각 도착 `Flit`에 대해:
+
+1. 트랜잭션의 **첫 번째** flit인 경우(`tid = id(txn)`가 `_txn_state`에
+   없는 경우):
+   - `run(env, nbytes)`를 통해 `overhead_ns`를 한 번 적용 — 헤더 디코드
+     모델, first-flit overhead 패턴(ADR-0033 D1).
+   - `_txn_state[tid] = {"last_finish": env.now}`로 초기화.
+2. `pc = _pc_for_address(flit.address)`를 계산(D3).
+3. 요청 방향(D2)을 사용하여 PC별 스케줄(D4)을 적용.
+4. `state["last_finish"] = max(state["last_finish"], finish)`로 갱신.
+5. `flit.is_last`이면: `_txn_state[tid]`를 pop하고 `_finalize_txn`을
+   spawn(D6).
+
+per-flit 주소 인지 커밋이 분리된 HBM 오프셋으로 향하는 동시 멀티 PE
+트래픽이 서로 다른 PC를 통해 병렬로 파이프라인되도록 하는 메커니즘이다.
+
+### D6. 트랜잭션별 비동기 finalize
+
+트랜잭션의 마지막 flit이 스케줄링되고 나면, finalize는 별도로 spawn된
+프로세스에서 실행된다:
+
+```python
+def _finalize_txn(env, txn, last_finish):
+    wait = last_finish - env.now
+    if wait > 0:
+        yield env.timeout(wait)
+    yield from _send_response(env, txn)
+```
+
+`_handle_flit`은 이를 `env.process(...)`로 spawn한 뒤 즉시 반환하므로,
+마지막 PC 커밋이 드레인되는 동안에도 워커는 다음 inbox 메시지를 집어들
+수 있다.
+
+이 분리가 없다면 — 즉 워커 자신이 `yield env.timeout(wait)`를 한다면 —
+서로 다른 PC에 떨어지는 주소를 가진 동시 단일 flit 트랜잭션들도 결국
+워커 내부에서 각각 `chunk_time`만큼 직렬화되어, D3와 D5가 노출하려고
+설계한 PC 병렬성을 숨겨버린다.
+
+### D7. 명령 전용 트랜잭션을 위한 non-flit 폴백
+
+`_handle_txn`은 inbox가 `Flit`이 아닌 `Transaction`을 전달할 때 실행된다.
+이는 와이어가 flit으로 분할하지 않는 명령 전용 요청에 대한 경로로 —
+대표적으로 명령 트랜잭션이 `nbytes=0`을 운반하는 `MemoryReadMsg`가
+해당한다(데이터 드레인은 HBM CTRL 후처리에서 모델링되며, 인바운드
+flit으로 모델링되지 않는다).
+
+절차:
+
+1. `work_bytes = txn.nbytes if txn.nbytes > 0 else int(request.nbytes or 0)`
+   — 읽기 명령의 경우 작업량은 요청으로 결정된다.
+2. `work_bytes > 0`이면 `n_chunks = ceil(work_bytes / burst_bytes)`,
+   아니면 0.
+3. 둘 다 > 0일 때 `chunk_interval = drain_ns / n_chunks` — 청크는
+   `drain/n_chunks` ns 간격으로 시간상에 스케줄링되어 병목 링크의 데이터
+   도착 속도를 모델링한다(ADR-0033 D1 청크 루프 드레인).
+4. `overhead_ns`를 위해 `run(env, txn.nbytes)`를 한 번 적용.
+5. 각 청크 `i`에 대해 `chunk_interval` ns만큼 진행한 뒤
+   `pc = _pc_for_address(base_address + i * burst_bytes)`로 D4 스케줄을
+   적용.
+6. 모든 청크 스케줄링 후 `last_finish - env.now`만큼 대기한 다음
+   `_send_response`를 호출.
+
+`_handle_txn`은 `_handle_flit`과 동일한 `_pc_avail` / `_pc_last_dir`
+상태를 공유한다 — 두 경로에 걸쳐 PC 스케줄링의 단일 진실 원천이 정확히
+하나만 존재한다.
+
+### D8. 응답 라우팅
+
+`_send_response`는 요청 타입과 경로 형상에 따라 디스패치한다:
+
+| 경우 | 트리거 | 응답 |
+| --- | --- | --- |
+| PE_DMA | `isinstance(txn.request, PeDmaMsg)` | 신규 역방향 경로 Transaction(`is_response=True`, `nbytes=0`), 동일한 `done` |
+| Bypass — Memory Read | `"m_cpu" not in any(txn.path)` AND `MemoryReadMsg` | `nbytes=request.nbytes`(데이터 반환)인 역방향 경로 Transaction |
+| Bypass — Memory Write | `"m_cpu" not in any(txn.path)` AND not Memory Read | `txn.done.succeed()` (쓰기는 로컬에서 완료) |
+| 기본 | 그 외 | 역방향 경로상의 신규 `ResponseMsg(correlation_id, request_id, src_cube, src_pe, success=True)` |
+
+"bypass" 분류는 ADR-0015 D4에서 정의된 Memory R/W 패브릭 경로(PCIE_EP →
+io_noc → ucie → 큐브 라우터 → hbm_ctrl, M_CPU 미경유)와 일치한다.
+PE_DMA 케이스는 내부 루프 DMA를 빠르게 유지하기 위한 전용 역방향 경로이다
+(PE_DMA 읽기/쓰기는 ResponseMsg 봉투를 합성하지 않는다).
+
+모든 역방향 경로 케이스에서, 응답 Transaction은
+`out_ports[reverse_path[1]]` — 기록된 정방향 경로를 따라 되돌아가는 첫
+홉 — 에 put된다. `reverse_path`의 엔트리가 2개 미만이면(축퇴된 경로),
+원래의 `txn.done`이 직접 시그널된다.
+
+### D9. 설정 가능한 속성
+
+| 속성 | 기본값 | 출처 | 비고 |
+| --- | --- | --- | --- |
+| `num_pcs` | 8 | 토폴로지 큐브 `hbm_ctrl.attrs` | 2의 거듭제곱이어야 함 |
+| `pc_bw_gbs` | 32.0 | 빌더 도출: `hbm_to_router_bw_gbs / num_pcs` | ADR-0017 D8 불변식 강제 |
+| `burst_bytes` | 256 | 토폴로지 attrs | 2의 거듭제곱이어야 함; `flit_bytes`와 동일(ADR-0033 D1) |
+| `switch_penalty_ns` | 0.0 | 토폴로지 attrs | Tier 0 기본값; 0이 아니면 비관적 R/W 스위칭 모델링 |
+| `efficiency` | 1.0 | 토폴로지 attrs | 빌더 시점에 `hbm_to_router_bw_gbs`에 적용(라우터 엣지 BW 스케일링만) |
+| `overhead_ns` | 0.0 | 토폴로지 attrs | First-flit 디코드 오버헤드(D5) |
+
+`pc_bw_gbs`는 yaml 측 중복 없이 PE당 집계 대역폭을 라우터-HBM 링크
+대역폭과 일치시키기 위해 직접 설정되지 않고 `topology/builder.py`에서
+도출된다.
+
+## Consequences
+
+### Positive
+
+- 주소 기반 PC 선택은 주소를 보지 않는 라운드로빈이 무너뜨릴 멀티 스트림
+  HBM 병렬성을 보존한다 — 분리된 HBM 영역을 갖는 멀티 PE 워크로드에서
+  중요하다.
+- Flit 인지 경로(D5) + 비동기 finalize(D6)는 웜홀 파이프라이닝을
+  보존하며, 연속적인 단일 flit 트랜잭션에 대해 PC 병렬성을 노출한다.
+- PC 스케줄링의 단일 진실 원천(D4 메커니즘이 D5 flit 경로와 D7 청크 루프
+  경로 모두에서 사용됨).
+- 빌더 도출 `pc_bw_gbs`가 yaml 규율이 아닌 코드에서 ADR-0017 D8을
+  강제한다.
+
+### Negative
+
+- PC 내부의 bank 수준 충돌 모델링이 없음; bank/row-buffer 재사용에
+  주소-무관(ADR-0033 D3).
+- HBM 스케줄러 없음(FR-FCFS / write-buffer / watermark drain); PC당 고정
+  FIFO. 버스티한 혼합 R/W는 `switch_penalty_ns`로 근사화된다
+  (ADR-0033 D2).
+- `_txn_state`는 `id(txn)`로 키를 잡는 일반 dict이다; 동시 트랜잭션마다
+  in-flight 상태가 누적되며 `is_last` 시에만 제거된다. 현재 워크로드에는
+  충분하다.
+
+## Links
+
+- ADR-0001 (물리 주소 레이아웃 — PC 비트 필드 주석)
+- ADR-0015 D4 (Memory R/W 패브릭 경로 — bypass 응답 케이스)
+- ADR-0017 D4 (PE별 HBM 파티셔닝 — PE 라우터로의 부착)
+- ADR-0017 D8 (HBM 채널 매핑 모드 — 본 ADR이 구현하는 n:1 집계)
+- ADR-0017 D9 (AddressResolver — `hbm_ctrl.pe{pe_id}` 엔드포인트 해석)
+- ADR-0033 D1 (정확한 모델링 — PC별 병렬성, 스위치 페널티, flit 인지
+  PC 커밋, first-flit 오버헤드, 청크 루프 드레인)
+- ADR-0033 D2 (스위치 페널티 기본값 0 — 이상적 스케줄러의 분할 상환)
@@ -0,0 +1,273 @@
+# ADR-0035: M_CPU 및 M_CPU.DMA 컴포넌트 모델
+
+## Status
+
+Accepted
+
+## Context
+
+M_CPU는 큐브 수준의 명령 프로세서이다. IO_CPU로부터(또는 엔진이
+Memory R/W를 폴백으로 M_CPU를 거쳐 라우팅할 때 PCIE_EP로부터) 명령을
+수신하여 자신의 큐브 내 PE들로 팬아웃하고, PE별 응답을 단일 ResponseMsg로
+집계하여 역방향 경로를 통해 IO_CPU로 되돌려 보낸다.
+
+M_CPU.DMA는 Memory R/W 팬아웃을 처리하는 큐브 수준의 DMA 채널 쌍이다.
+ADR-0015 D5에 따라 별도의 토폴로지 노드가 **아니다** — `MCpuComponent`의
+내부 상태로서 존재한다.
+
+본 ADR은 위의 책임들을 실현하는 M_CPU 컴포넌트 구현을 문서화한다. 여기에는
+세 가지 구별되는 팬아웃 경로(Memory R/W, Kernel Launch, MMU Map/Unmap),
+M_CPU.DMA 자원 모델, 그리고 응답 집계 계약이 포함된다.
+
+## Decision
+
+### D1. 역할
+
+M_CPU는 세 가지 책임을 갖는다:
+
+1. **Transit 포워딩** — 종단 홉이 아닐 때(예: 역방향 응답 경로 PE →
+   M_CPU → IO_CPU), 사전 계산된 경로의 `next_hop`으로 Transaction을
+   전달한다.
+2. **종단 홉에서의 멀티 PE 팬아웃** — 요청 타입에 따라 세 팬아웃 경로
+   중 하나로 디스패치한다(D2).
+3. **응답 집계** — PE별 응답을 수집하여 역방향 경로를 통해 단일 집계
+   ResponseMsg를 IO_CPU로 되돌려 보낸다.
+
+호출당(`run()`): 들어오는 Transaction마다 `overhead_ns`를 한 번 적용한다.
+
+M_CPU는 다음을 하지 **않는다**:
+
+- 라우팅 결정 — 경로는 라우터에 의해 사전 계산된다(ADR-0002).
+- PE 내부 실행 처리 — PE_CPU / PE_SCHEDULER / 엔진들이 담당(ADR-0014).
+- 주소 디코드 — `ctx.resolver.resolve(pa)`가 PE별 `hbm_ctrl.pe{X}`를
+  직접 반환한다(ADR-0017 D9).
+- 텐서 또는 커널 의미 해석 — 팬아웃 디스패치는 Python isinstance
+  체크만으로 이루어진다.
+
+### D2. 요청 타입으로 디스패치되는 세 가지 팬아웃 경로
+
+종단 홉에서 워커는 요청 타입에 따라 디스패치한다:
+
+```python
+elif self.ctx is not None and txn.request is not None:
+    if isinstance(txn.request, KernelLaunchMsg):
+        env.process(self._kernel_launch_fanout(env, txn))
+    elif isinstance(txn.request, (MmuMapMsg, MmuUnmapMsg)):
+        env.process(self._mmu_msg_fanout(env, txn))
+    else:
+        env.process(self._dma_fanout(env, txn))
+```
+
+각 경로는 서로 다른 라우터 메서드를 사용한다:
+
+- `_dma_fanout`은 `ctx.router.find_mcpu_dma_path()`를 사용 — PE 파이프라인
+  노드를 우회하는 M_CPU 전용 DMA 경로.
+- `_kernel_launch_fanout`은 `ctx.router.find_node_path()`를 사용 — PE_CPU로
+  향하는 범용 NOC 명령 경로.
+- `_mmu_msg_fanout`은 `ctx.router.find_node_path()`를 사용 — PE_MMU로
+  향하는 NOC 명령 경로.
+
+### D3. M_CPU.DMA 내부 서브 컴포넌트 (ADR-0015 D5)
+
+`MCpuComponent.start()`는 두 개의 SimPy 자원을 초기화한다:
+
+```python
+self._dma_write = simpy.Resource(env, capacity=1)  # MemoryWriteMsg
+self._dma_read  = simpy.Resource(env, capacity=1)  # MemoryReadMsg
+```
+
+특성:
+
+- **토폴로지 노드가 아님** — 전적으로 `MCpuComponent` 내부에서 관리됨;
+  `topology.yaml`이나 컴파일된 그래프에 나타나지 않는다.
+- **독립된 읽기/쓰기 채널** — 동시 in-flight Memory R/W가 허용된다.
+- **채널당 capacity=1**은 본 M_CPU에서 동시 in-flight Memory R/W 요청의
+  **디스패치 단계**(`yield self.out_ports[...].put(...)`)를 직렬화한다.
+  실제 패브릭 전송 시간은 컴포넌트 사이의 와이어 프로세스(ADR-0015 D2)와
+  종단 홉의 `drain_ns`로 모델링되며, DMA 자원은 전송 지속 시간을
+  게이팅하지 않는다.
+
+자원 선택은 요청 타입에 기반한다:
+
+```python
+dma_res = self._dma_write if isinstance(request, MemoryWriteMsg) else self._dma_read
+```
+
+### D4. 비종단 홉에서의 transit 포워딩
+
+`txn.next_hop`이 None이 아닐 때 — 전형적으로 역방향 응답 경로(PE →
+M_CPU → IO_CPU)에서 — 워커는 정상적으로 전달한다:
+
+```python
+if next_hop:
+    yield self.out_ports[next_hop].put(txn.advance())
+```
+
+팬아웃 분기는 종단 홉에서만 발화한다. 따라서 동일한 컴포넌트가 정방향
+명령 디스패치 역할과 역방향 응답 중계 역할을 모두 수행한다.
+
+### D5. DMA 팬아웃 (`_dma_fanout` — Memory R/W)
+
+종단 홉에서 각 Memory R/W 요청에 대해:
+
+1. `_resolve_dma_destinations(request)`가 요청의 PA로부터
+   `ctx.resolver.resolve(PhysAddr.decode(pa))`를 통해 도출된 PE별
+   `hbm_ctrl.pe{X}`를 반환한다(ADR-0017 D9).
+2. 각 목적지에 대해:
+   - `with dma_res.request() as req`를 통해 적절한 DMA 자원(`_dma_write`
+     또는 `_dma_read`)을 획득.
+   - `ctx.router.find_mcpu_dma_path()`로 경로를 해석.
+   - `drain_ns = ctx.compute_drain_ns(path, nbytes)`를 계산.
+   - `drain_ns`를 운반하는 서브 Transaction을 생성하여 `path[1]`로
+     디스패치.
+3. 목적지들에 걸쳐 `max_drain_ns`를 추적하고, 모든 응답 도착 후
+   `txn.result_data["xfer_ns"]`로 기록한다.
+4. PE별 응답이 모두 수집된 후(D8), IO_CPU로 되돌아가는 역방향 명령
+   경로로 집계 ResponseMsg를 전송한다.
+
+PA 디코드 폴백(`f"{cube_prefix}.hbm_ctrl"`)은 레거시 데드 코드이다 —
+ADR-0017 D4의 PE별 파티셔닝 이후로 그러한 노드는 존재하지 않는다.
+방어적으로 남겨두었으나 실제 목적지로 라우팅되지는 않는다.
+
+### D6. Kernel launch 팬아웃 (`_kernel_launch_fanout`)
+
+종단 홉에서 `KernelLaunchMsg`에 대해:
+
+1. `_resolve_pe_ids(target_pe)` → 본 큐브 내 PE id 리스트.
+2. 각 PE에 대해: `ctx.router.find_node_path()`를 통해
+   `f"{cube_prefix}.pe{pe_id}.pe_cpu"`로의 경로를 찾음.
+3. **`target_start_ns` 처리**(ADR-0009 D5):
+   - 요청에 이미 `target_start_ns`가 실려 있으면(IO_CPU가
+     ADR-0036 D3에 따라 스탬프함): **변경 없이 통과**.
+   - 없으면(단위 테스트에서의 직접 M_CPU 런치):
+     `env.now + max(PE별 leg 레이턴시)`로 큐브별 배리어를 계산하고
+     `dataclasses.replace`로 스탬프.
+4. `nbytes=0`인 서브 Transaction으로 디스패치(커널 런치는 제어 메시지;
+   nbytes=0 유지는 팬아웃을 공유 first-hop 패브릭 BW에서 떼어내며,
+   ADR-0036 D4를 미러링).
+5. PE별 응답이 모두 도착한 후(D8), 각 서브 Transaction의 `result_data`로부터
+   PE별 메트릭을 부모 트랜잭션으로 집계한다:
+
+   ```python
+   txn.result_data["pe_exec_ns"]  = max(existing, max(pe_exec_values))
+   txn.result_data["dma_ns"]      = max(existing, max(dma_values))
+   txn.result_data["compute_ns"]  = max(existing, max(compute_values))
+   ```
+
+   기존 값과의 max 병합이 중요한 이유는 크로스 큐브 IO_CPU 팬아웃이
+   동일한 부모 `result_data`를 공유하기 때문이다; 병합을 통해 한 큐브가
+   다른 큐브의 메트릭을 덮어쓰는 일을 방지한다.
+6. IO_CPU로 되돌아가는 역방향 경로로 집계 ResponseMsg를 전송.
+
+### D7. MMU map/unmap 팬아웃 (`_mmu_msg_fanout`)
+
+종단 홉에서 `MmuMapMsg` / `MmuUnmapMsg`에 대해:
+
+1. `_resolve_pe_ids(target_pe)` → PE id들.
+2. 각 PE에 대해: `find_node_path()`를 통해
+   `f"{cube_prefix}.pe{pe_id}.pe_mmu"`로의 경로를 찾음.
+3. `nbytes=0`인 서브 Transaction으로 디스패치.
+4. PE_MMU는 종단 노드이다 — ResponseMsg를 되돌려 보내지 **않는다**.
+   대신 서브 Transaction 자체의 `sub_done` 이벤트가 완료 시그널 역할을
+   한다.
+5. 모든 `sub_done` 이벤트를 인라인으로 기다림(`_pending` 카운터를 사용
+   **하지 않음** — D8은 응답을 동반하는 팬아웃 전용).
+6. IO_CPU로 되돌아가는 역방향 경로로 집계 ResponseMsg를 전송.
+
+### D8. 응답 집계 (`_pending` + `_parent_txns`)
+
+DMA 및 kernel-launch 팬아웃(역방향 경로로 도착하는 PE별 ResponseMsg를
+예상함)에 대해:
+
+```python
+self._pending: dict[str, tuple[int, int, simpy.Event]] = {}
+self._parent_txns: dict[str, Any] = {}
+```
+
+- 디스패치 시: `(expected, received=0, all_done)`을 등록하고 부모
+  트랜잭션을 기억.
+- `_worker`는 `is_response=True`로 응답을 인식하여 `_collect_response`로
+  라우팅하며, `_collect_response`는 `received`를 증가시키고 `received >=
+  expected`일 때 `all_done`을 시그널한다.
+- `yield all_done` 후, 팬아웃 경로는 집계 ResponseMsg를 구성한다:
+
+  ```python
+  resp_msg = ResponseMsg(
+      correlation_id=request.correlation_id,
+      request_id=request.request_id,
+      src_cube=cube_id,
+      src_pe=-1,             # -1 = M_CPU 집계, 단일 PE가 아님
+      success=True,          # 실패 의미는 구현되어 있지 않음
+  )
+  ```
+
+- 응답 Transaction은 `list(reversed(txn.path))`를 따라 IO_CPU로
+  되돌아간다.
+
+MMU 팬아웃(D7)은 PE_MMU가 종단이므로 더 단순한 `sub_done` 이벤트의
+인라인 리스트를 사용한다 — 가로챌 ResponseMsg 경로가 없다.
+
+### D9. 헬퍼와 설정 가능한 속성
+
+`_resolve_pe_ids(target_pe)`:
+
+- `int` → `[target_pe]`
+- `tuple[int, ...]` → `list(target_pe)`
+- `"all"` → `range(n_slices)`, 여기서 `n_slices`는 큐브
+  `memory_map.hbm_slices_per_cube`(기본 8)에서 가져온다.
+
+Kernel-launch 및 MMU 팬아웃 경로에서 사용된다.
+
+인스턴스별 레이턴시를 결정하는 단일 설정 가능 속성:
+
+| 사이트 | impl 이름 | overhead_ns |
+| --- | --- | --- |
+| 큐브 `m_cpu` | `builtin.m_cpu` | 5.0 |
+
+Transaction마다 `run()`에서 한 번 적용 — M_CPU에서의 명령 해석 및
+디스패치 결정 시간을 모델링한다.
+
+## Consequences
+
+### Positive
+
+- 세 가지 팬아웃 경로가 요청 타입에 의해 명확히 분리됨 — 새로운 요청
+  종류 추가는 isinstance 분기 한 줄과 팬아웃 메서드 하나로 가능.
+- M_CPU.DMA 채널은 독립적이며(읽기/쓰기가 동시 실행됨) capacity=1에서
+  디스패치 단계만 직렬화된다.
+- Transit 대 종단 동작이 단일 `if next_hop` 체크이므로, 동일한 컴포넌트가
+  역할 중복 없이 정방향 디스패치와 역방향 응답 중계를 처리한다.
+- `target_start_ns` 통과(D6)는 IO_CPU가 수립한 크로스 큐브 배리어
+  (ADR-0036 D3)를 보존하며, 폴백 계산은 직접 M_CPU 단위 테스트가 계속
+  동작하도록 한다.
+- 부모 `result_data`의 기존 값에 대한 PE별 메트릭의 `max` 병합은 동일한
+  부모를 공유하는 크로스 큐브 IO_CPU 팬아웃에 견고하다.
+
+### Negative
+
+- 부분 실패 의미가 없음 — 누락된 PE별 응답은 부모 `all_done`을 무기한
+  스톨시킨다. 시뮬레이션 용도로는 수용 가능하나 프로덕션 스타일의
+  엔드포인트로는 적합하지 않다.
+- `_resolve_dma_destinations`의 큐브 전역 hbm_ctrl 폴백은 데드 코드이다
+  (ADR-0017 D4 이후 그런 노드는 존재하지 않음). 방어적으로 남겨두었으나
+  혼동을 유발하므로 후속 정리가 권장된다.
+- DMA 자원 직렬화는 디스패치에만 적용된다(언바운드 store에서 `put`
+  호출은 즉시적). capacity=1 채널은 "본 M_CPU에서 동시에 in-flight인
+  요청은 하나"를 모델링하며 "전송 지속 시간 직렬화"를 모델링하지 않는다
+  — 실제 전송 병렬성은 와이어 프로세스(ADR-0015 D2)와 `drain_ns`를
+  참조해야 한다.
+
+## Links
+
+- ADR-0009 D3 (M_CPU 팬아웃 및 집계 완료 의미)
+- ADR-0009 D5 (`target_start_ns` — 존재 시 변경 없이 통과; 부재 시
+  큐브별 배리어로 계산)
+- ADR-0011 D-VA3 (MmuMapMsg 패브릭 경로에 M_CPU가 PE 팬아웃 지점으로
+  포함됨)
+- ADR-0014 D4 (DMA 엔진 capacity=1; M_CPU.DMA가 큐브 수준에서 동일한
+  계약을 미러링)
+- ADR-0015 D5 (M_CPU.DMA는 M_CPU의 내부 서브 컴포넌트이며 토폴로지
+  노드가 아님)
+- ADR-0017 D9 (AddressResolver가 PE별 `hbm_ctrl.pe{X}`를 반환)
+- ADR-0036 D3 / D4 (IO_CPU가 `target_start_ns`를 스탬프; M_CPU는 변경
+  없이 통과; 팬아웃 전반에서 nbytes=0 불변식 보존)
@@ -0,0 +1,205 @@
+# ADR-0036: IO_CPU 컴포넌트 모델
+
+## Status
+
+Accepted
+
+## Context
+
+IO_CPU는 시뮬레이션 그래프 내부의 IO 칩렛 호스트 대향 엔드포인트이다.
+PCIE_EP는 런타임 API로부터 호스트 메시지를 수신하여 io_noc를 통해
+라우팅한다; 명령을 동반하는 요청(KernelLaunch, MmuMap/Unmap)의 경우
+io_noc는 IO_CPU로 전달하며, IO_CPU는 다음을 수행한다:
+
+- 요청을 큐브별 M_CPU로 팬아웃.
+- 큐브별 응답을 단일 호스트 가시 완료로 집계.
+- 커널 런치의 경우, 타깃이 된 모든 큐브의 모든 PE가 동일한 시뮬레이션
+  시각에 커널 본체 실행을 시작하도록 전역 `target_start_ns` 배리어를
+  스탬프함(ADR-0009 D5).
+
+Memory R/W 트래픽은 ADR-0015 D4 / ADR-0016 D3에 따라 IO_CPU를 우회한다;
+따라서 본 컴포넌트는 정상 동작에서 명령 평면 트래픽만을 처리한다.
+
+본 ADR은 위의 책임을 실현하는 IO_CPU 컴포넌트 구현을 문서화한다.
+
+## Decision
+
+### D1. 역할
+
+IO_CPU는 IO 칩렛의 호스트 대향 엔드포인트이다. 두 가지 주요 책임을
+갖는다:
+
+1. **멀티 큐브 팬아웃** — KernelLaunchMsg / MmuMapMsg / MmuUnmapMsg를
+   큐브별 M_CPU로 분배.
+2. **응답 집계** — 큐브별 ResponseMsg를 수집하고, 타깃이 된 모든 큐브가
+   응답한 후 부모 `txn.done`을 시그널.
+
+세 번째이자 더 좁은 책임은 KernelLaunchMsg에만 적용된다:
+**`target_start_ns` 전역 배리어 스탬핑**(D3).
+
+본 컴포넌트는 다음을 하지 **않는다**:
+
+- 라우팅 결정 — 경로는 라우터에 의해 사전 계산된다(ADR-0002).
+- 텐서 또는 커널 내부 디코드 — 그러한 관심사는 M_CPU / PE_CPU / 엔진에
+  속한다.
+- PE 수준 팬아웃 처리 — M_CPU가 큐브 내에서 팬아웃한다(ADR-0009 D3).
+- Memory R/W 데이터 경로 처리 — ADR-0015 D4와 ADR-0016 D3에 따라
+  IO_CPU를 우회한다(`_resolve_cube_targets` 내의 Memory R/W 해석 코드는
+  방어적 폴백으로만 존재).
+
+호출당(`run()`): 들어오는 Transaction마다 설정된 `overhead_ns`를 한 번
+적용한다(D8).
+
+### D2. 정방향 경로 — 멀티 큐브 팬아웃
+
+응답이 아닌 Transaction이 도착하면, 워커는:
+
+1. `run()`을 통해 `overhead_ns`를 지불.
+2. `_resolve_cube_targets`를 호출하여 요청으로부터 `(sip, cube)` 타깃
+   리스트를 도출(D5).
+3. 각 타깃에 대해:
+   - `ctx.resolver.find_m_cpu(sip, cube)`를 통해 M_CPU 노드 id를 해석.
+   - `ctx.router.find_node_path(io_cpu, m_cpu)`를 통해 경로를 해석.
+   - `path`가 채워진 큐브별 서브 Transaction을 생성하여 `path[1]`
+     (io_noc의 첫 홉)으로 전달.
+4. 집계 상태 등록: `_pending[request_id] = (expected, received=0,
+   parent_done)`.
+
+### D3. KernelLaunch `target_start_ns` 전역 배리어 (ADR-0009 D5)
+
+IO_CPU는 `target_start_ns`의 정규 스탬퍼이다. 요청이
+`KernelLaunchMsg`일 때, IO_CPU는 타깃이 된 모든 큐브의 모든 PE를 포괄하는
+단일 전역 배리어를 계산한다:
+
+```text
+for (sip, cube) in cube_targets:
+    leg1 = compute_path_latency_ns(io_cpu → m_cpu(sip, cube), nbytes=0)
+    for pe_id in target_pe_ids:
+        leg2 = compute_path_latency_ns(m_cpu → pe_cpu(sip, cube, pe_id),
+                                       nbytes=0)
+        latency = leg1 + leg2 - io_overhead_ns - m_overhead_ns
+        global_max = max(global_max, latency)
+
+target_start_ns = env.now + global_max
+```
+
+이후 요청은 (`dataclasses.replace`를 통해) 교체되어 스탬프된 값이 팬아웃
+전반에 전파된다.
+
+두 가지 오버헤드 보정:
+
+- `io_overhead_ns`는 차감되는데, IO_CPU가 본 메서드 실행 전에 `run()`에서
+  이미 지불했기 때문이다.
+- `m_overhead_ns`는 한 번 차감되는데, 경로 레이턴시에서 leg1의 종단점인
+  동시에 leg2의 시작점으로 두 번 등장하지만 M_CPU는 런타임에 단 한 번만
+  지불하기 때문이다.
+
+모든 다운스트림 PE_CPU는 커널 본체 실행을 시작하기 전 `target_start_ns`
+까지 yield한다; 이를 통해 개별 디스패치 경로가 얼마나 오래 걸렸는지와
+무관하게 모든 PE가 동일한 시뮬레이션 시각에 시작한다.
+
+### D4. KernelLaunch 서브 Transaction은 `nbytes=0`을 운반
+
+KernelLaunchMsg의 큐브별 서브 Transaction은 부모 `txn.nbytes`를 무시하고
+`nbytes=0`을 강제한다:
+
+- 커널 런치는 제어 메시지이다; 데이터 패브릭 수준에서 페이로드 크기는
+  무관하다.
+- `nbytes > 0`이면 모든 큐브별 서브 트랜잭션이 io_noc의 공유 first-hop
+  패브릭 BW를 점유한다. 16개 큐브에서는 이로 인해 팬아웃이 직렬화되어
+  먼 M_CPU들이 `target_start_ns`를 지나치게 되고 D3 불변식이 깨진다.
+
+KernelLaunch가 아닌 서브 Transaction은 `txn.nbytes`를 보존한다(실제
+페이로드 크기를 운반하는 방어적 Memory R/W 폴백 경로에만 관련됨).
+
+### D5. 요청 타입별 큐브 타깃 해석
+
+`_resolve_cube_targets`는 요청 타입에 따라 디스패치한다:
+
+| 요청 타입 | `(sip, cube)`의 출처 | `target_cubes="all"` 의미 |
+| --- | --- | --- |
+| `MemoryWriteMsg` | `dst_sip`, `dst_cube` (또는 `PhysAddr.decode(dst_pa).die_id` 폴백) | PA 디코드로 도출되는 단일 큐브 |
+| `MemoryReadMsg` | `src_sip`, `src_cube` (또는 `PhysAddr.decode(src_pa).die_id` 폴백) | PA 디코드로 도출되는 단일 큐브 |
+| `KernelLaunchMsg` | `shard.sip == my_sip`으로 필터링된 텐서 샤드 | 이 SIP 위에서 샤드를 소유하는 모든 큐브 |
+| `MmuMapMsg` / `MmuUnmapMsg` | 본 SIP로 필터링된 `target_cubes` 리스트 | 스펙으로부터 `range(cubes_per_sip)` |
+
+각 IO_CPU 인스턴스는 자기 SIP 내에서만 팬아웃한다 — `_my_sip()`이
+노드 id에서 SIP id를 파싱한다(예: `sip0.io0.io_cpu` → 0).
+
+Memory R/W 행은 방어적 완전성을 위해 존재한다; 엔진의 정상 경로는
+Memory R/W를 `_process_memory_direct()` / `find_memory_path()`로
+라우팅하여 IO_CPU를 완전히 우회한다(ADR-0015 D4 / ADR-0016 D3).
+
+### D6. 응답 집계
+
+`_pending: dict[request_id → (expected, received, parent_done)]`:
+
+- 디스패치 시: `(len(cube_targets), 0, txn.done)`을 등록.
+- `_worker`는 `is_response=True`로 응답을 인식하여 `_collect_response`로
+  라우팅한다.
+- `_collect_response`는 `received`를 증가시키며, `received >= expected`가
+  되면 `parent_done.succeed()`를 호출하고 엔트리를 `_pending`에서
+  제거한다.
+
+이는 단순한 요청별 카운터이다. 큐브별 정체성 추적이나 부분 실패 처리는
+없다 — 누락된 응답은 부모 done을 무기한 스톨시킨다. 프로덕션 스타일의
+실패 경로는 현재 시뮬레이터 모델의 범위 밖이다.
+
+### D7. `target_pe` 해석 헬퍼
+
+`_resolve_pe_ids(target_pe)`:
+
+- `int` → `[target_pe]`.
+- `tuple[int, ...]` → `list(target_pe)`.
+- `"all"` → `range(n_slices)`, 여기서 `n_slices`는 큐브
+  `memory_map.hbm_slices_per_cube`(기본 8)에서 가져온다.
+
+D3의 배리어 계산에서 큐브별로 모든 PE 타깃을 열거하는 데 사용된다.
+
+### D8. 설정 가능한 `overhead_ns`
+
+단일 속성이 인스턴스별 레이턴시를 결정한다:
+
+| 사이트 | impl 이름 | overhead_ns |
+| --- | --- | --- |
+| IO 칩렛 `io_cpu` | `builtin.io_cpu` | 10.0 |
+
+Transaction마다 `run()`에서 한 번 적용된다. IO_CPU에서의 명령 해석 및
+디스패치 결정 시간을 모델링한다.
+
+## Consequences
+
+### Positive
+
+- 크로스 큐브 및 크로스 SIP 커널 런치가 단일 전역 배리어를 공유한다
+  (D3 + D4) — 시작 시각의 큐브별 분기가 없다.
+- nbytes=0 불변식이 팬아웃을 공유 first-hop 패브릭 BW로부터 떼어내,
+  대규모(16 큐브)에서도 배리어의 정확도를 보존한다.
+- 단일 카운터를 통한 응답 집계 → 최소 상태, 결정론적 완료 순서.
+- SIP별 스코핑(`_my_sip()`)이 서로 다른 SIP의 IO_CPU들을 깨끗이
+  독립시킨다.
+
+### Negative
+
+- 부분 실패 의미가 없음 — 누락된 큐브별 응답은 부모를 무기한
+  스톨시킨다. 시뮬레이션 용도로는 충분하나 프로덕션 스타일의
+  엔드포인트로는 적합하지 않다.
+- `_pending`은 일반 dict이다; in-flight 요청이 상태로 누적된다. 현재
+  벤치마크 워크로드(미해결 런치가 적음)에는 허용 가능하나, 원리적으로는
+  무한하다.
+- `_resolve_cube_targets`의 Memory R/W 해석 분기는 정상 엔진 경로에서
+  데드 코드이다. 방어적으로 남겨두었으나 우회 경로가 변경되면 드리프트
+  위험을 초래한다.
+
+## Links
+
+- ADR-0002 (라우팅 거리 — 경로 계산)
+- ADR-0009 D1 (커널 런치는 IO_CPU에 대한 엔드포인트 요청)
+- ADR-0009 D3 (M_CPU는 큐브 내에서 팬아웃; IO_CPU는 큐브 사이에서 팬아웃)
+- ADR-0009 D5 (IO_CPU에서의 target_start_ns 정규 스탬핑)
+- ADR-0011 D-VA3 (MmuMapMsg가 큐브 팬아웃을 위해 IO_CPU를 경유)
+- ADR-0012 (호스트 ↔ IO_CPU 메시지 스키마)
+- ADR-0015 D4 (Memory R/W는 IO_CPU 우회; 커널 런치는 IO_CPU 경유)
+- ADR-0016 D1 (IO 칩렛 io_noc — IO_CPU가 여기 부착됨)
+- ADR-0016 D3 (Memory R/W 경로가 IO_CPU 우회)
+- ADR-0016 D4 (명령 해석을 위한 IO_CPU 경유 커널 런치 경로)
@@ -0,0 +1,185 @@
+# ADR-0037: Forwarding 컴포넌트 (forwarding_v1)
+
+## Status
+
+Accepted
+
+## Context
+
+시뮬레이션 그래프에는 순전히 패브릭 통과를 모델링하기 위해 존재하는 노드
+위치들이 많다 — NOC 메시 라우터, 스위치, UCIe 프로토콜 엔드포인트, IO
+칩렛 io_noc, transit 큐브. 이들은 공통 패턴을 공유한다: 메시지를 수신하고,
+컴포넌트별 오버헤드(헤더 디코드 + 라우팅 결정 시간을 모델링)를 적용하며,
+사전 계산된 경로를 따라 다음 홉으로 전달한다.
+
+본 ADR은 이러한 transit 노드에 대한 계약을 정의한다: 웜홀 cut-through
+의미로 flit 인지 포워딩을 처리하는 단일 컴포넌트 타입(`TransitComponent`)이며,
+각 인스턴스가 수행하는 개념적 역할에 따라 여러 impl 이름 아래에 사용된다.
+
+## Decision
+
+### D1. 역할
+
+Forwarding 컴포넌트(`TransitComponent` 클래스)는 시뮬레이션 그래프의
+**상태 없는 transit 노드**이다. 메시지가 물리적으로 통과하지만 의미론적
+처리는 일어나지 않는 모든 패브릭 위치를 모델링한다.
+
+통과당 컴포넌트는:
+
+1. `in_port`에서 들어오는 Transaction 또는 Flit을 읽는다.
+2. 설정된 컴포넌트별 오버헤드(`overhead_ns`)를 적용한다. 멀티 flit
+   페이로드라도 **Transaction당 한 번** 적용된다(D2 참조).
+3. Transaction의 사전 계산된 `path`를 따라 다음 홉을 조회한다.
+4. 해당 `out_port`로 전달한다; 종단 노드(다음 홉 없음)에서는 `is_last`
+   flit이 도착하면 `txn.done`을 시그널한다.
+
+본 컴포넌트는 다음을 하지 **않는다**:
+
+- 라우팅 결정 — 경로는 라우터에 의해 사전 계산된다(ADR-0002 /
+  ADR-0017 D2). Forwarding은 홉별 단계만 실행한다.
+- 와이어 전파나 대역폭 점유 모델링 — 컴포넌트 사이의 별도 와이어
+  프로세스가 처리한다(ADR-0015 D2).
+- 주소 해석 — AddressResolver가 담당한다(ADR-0017 D9).
+- 완료 집계 — 종단 엔드포인트(IO_CPU, M_CPU, HBM_CTRL)가 담당한다.
+
+### D2. First-flit 오버헤드 모델 (헤더 디코드)
+
+Transaction별 `overhead_ns`는 첫 flit 도착 시 **정확히 한 번** 적용된다:
+
+- `_txn_decoded: set[int]`이 본 노드에서 이미 오버헤드를 지불한
+  Transaction들을 추적한다.
+- 어떤 Transaction의 첫 flit 도착 시: `yield self.run(env, msg.txn.nbytes)`
+  — 오버헤드를 지불한다.
+- 동일 Transaction의 후속 flit들은 오버헤드를 건너뛰고 추가 지연 없이
+  파이프라인 통과한다.
+- `is_last` flit 시: Transaction을 `_txn_decoded`에서 제거한다.
+
+이는 실제 HW의 동작 — 헤더 디코드와 라우팅 결정이 첫 flit에서 한 번
+일어나고, 이후 페이로드 flit들은 같은 경로로 스트리밍되는(웜홀
+cut-through) — 을 모델링한다. 멀티 홉 파이프라이닝은 자연스럽게
+발현된다 — 각 홉이 자신의 first-flit 오버헤드를 추가하지만, 첫 flit
+이후의 flit들은 이미 첫 flit이 통과한 어떤 홉에서도 오버헤드를 다시
+지불하지 않는다.
+
+### D3. 직렬 워커 포워딩 (순서 보존)
+
+본 컴포넌트의 워커는 `_inbox`에서 flit을 소비하여 도착 순서대로 직렬
+포워딩하는 단일 SimPy 프로세스이다. 컴포넌트는 flit마다
+`env.process(...)`를 spawn하지 **않는다**.
+
+근거: 첫 flit이 `overhead_ns`에서 yield하는 동안 후속 flit이 병렬
+프로세스에서 실행되면, 후속 flit이 첫 flit을 추월할 수 있다. 이는 순서가
+어긋난 전달을 낳고, `is_last` flit이 첫 flit보다 먼저 목적지에 도착하게
+하여 — 트랜잭션의 완료 의미와 다운스트림의 flit 인덱스 기반 처리 모두를
+손상시킨다.
+
+### D4. 경로 기반 next-hop 라우팅
+
+라우팅은 Forwarding 컴포넌트의 관심사가 **아니다**. Transaction은 라우터에
+의해 사전 계산된 `path`(ADR-0002 / ADR-0017 D2)와 함께 도착한다.
+컴포넌트는 단지 자신의 경로상 위치를 찾아 `path[index + 1]`로 전달한다:
+
+```python
+def _next_hop_in_path(self, txn):
+    my_id = self.node.id
+    path = txn.path
+    for i, n in enumerate(path):
+        if n == my_id and i + 1 < len(path):
+            return path[i + 1]
+    return None
+```
+
+`next_hop`이 발견되고 `out_ports`에 존재하면 flit이 전달된다. 그렇지
+않으면(종단 노드) `is_last` flit이 도착할 때 `txn.done.succeed()`가
+호출된다.
+
+### D5. Flit 인지 모드와 Non-Flit 폴백
+
+`_FLIT_AWARE = True`는 본 컴포넌트가 베이스 클래스의 `_fan_in` 내 flit
+재조립 로직에서 제외되도록 한다. Flit은 재조립 없이 `_inbox`에 직접
+놓이며, 이는 워커 루프(D2, D3)에서의 per-flit 처리를 가능케 한다.
+
+Non-Flit 메시지 — 0바이트 제어 Transaction이나 그 외 청크화되지 않는
+페이로드 — 는 `env.process`를 통해 베이스 클래스의 레거시 `_forward_txn`
+경로로 빠진다. 이는 flit 수준 처리의 이득이 없는 제어 평면 트래픽에
+대한 하위 호환성을 보존한다.
+
+### D6. 베이스 클래스에서의 멀티 스트림 병합
+
+라우터에서의 멀티 스트림 FIFO 병합은 Forwarding이 아닌 베이스 클래스의
+책임이다. 베이스 클래스의 `_fan_in`은 `in_port`마다 하나의 프로세스를
+spawn한다; 모두가 공유된 단일 `_inbox`에 push한다. 따라서 서로 다른
+업스트림 스트림의 flit들은 `_inbox`의 FIFO 순서로 flit 단위에서
+인터리브된다.
+
+Forwarding 워커는 단지 `_inbox`를 도착 순서대로 소비할 뿐이다 —
+공유 inbox 위의 공정 FIFO로 라우터별 멀티 플로우 중재를 올바르게
+모델링한다.
+
+### D7. 여러 impl 이름 아래의 단일 구현
+
+단일 `TransitComponent` 클래스가 `components.yaml`에서 네 가지 impl
+이름으로 등록된다:
+
+- `builtin.forwarding` — 범용 forwarding (예: `io_noc`, `noc_router`,
+  UCIe conn 브리지)
+- `builtin.switch` — 트레이 수준 스위치
+- `builtin.noc` — 큐브 수준 NOC 패브릭(레거시 싱글톤; 현재 NOC
+  라우터는 `builtin.forwarding`을 사용)
+- `builtin.ucie` — UCIe 프로토콜 엔드포인트
+
+네 별칭 모두 동일한 동작을 갖는 동일한 클래스를 인스턴스화한다.
+인스턴스별 차별화는 `attrs.overhead_ns`에만 존재한다. 별도 impl 이름이
+존재하는 것은 가독성을 위한 의도 태그이자, 하위 호환을 깨지 않고 향후
+분기를 허용하기 위함이다.
+
+### D8. 설정 가능한 `overhead_ns`
+
+단일 속성이 인스턴스별 레이턴시를 결정한다:
+
+| 사용 사이트 | impl 이름 | overhead_ns |
+| --- | --- | --- |
+| 트레이 수준 스위치 | `builtin.switch` | 5.0 |
+| 큐브 NOC 라우터 | `builtin.forwarding` | 2.0 |
+| IO 칩렛 io_noc | `builtin.forwarding` | 0.0 |
+| UCIe 프로토콜 엔드포인트(`ucie-{N,S,E,W}`) | `builtin.ucie` | 8.0 |
+| UCIe conn 브리지(`ucie-{PORT}.conn{N}`) | `builtin.forwarding` | 0.0 |
+
+기본값은 0.0이다. 속성은 매 `run()` 호출에서 읽히므로 동적 재설정이
+가능하나 현재는 사용되지 않는다.
+
+## Consequences
+
+### Positive
+
+- 단일 클래스가 시뮬레이션 그래프의 모든 transit 노드 역할을 처리한다
+  — 개체 수가 많은 컴포넌트 타입에 대한 최소 코드 표면.
+- Flit 인지 처리 + 직렬 워커는 per-flit 프로세스 오버헤드 없이 멀티 홉
+  경로 전반에 걸쳐 웜홀 의미를 보존한다.
+- `overhead_ns`만이 유일한 인스턴스별 튜너블이다; 라우팅, 대역폭, 주소
+  해석은 자체 컴포넌트/모듈에서 깨끗이 분리되어 있다.
+- 멀티 스트림 병합이 베이스 클래스 구조에서 자연스럽게 발현된다; 라우터
+  전용 로직이 공정 FIFO 중재를 중복 구현하지 않는다.
+- Non-Flit 폴백 경로는 모든 메시지를 flit 프레임워크로 강제하지 않고도
+  제어 평면 트래픽이 계속 동작하도록 한다.
+
+### Negative
+
+- 단일 클래스가 사용 사이트의 의도를 `attrs.overhead_ns` 설정 안에
+  숨긴다; 어떤 impl 이름이 어떤 동작 클래스로 매핑되는지 보려면 독자가
+  `topology.yaml` + `components.yaml`을 참조해야 한다.
+- per-flit 직렬 워커는 `overhead_ns`가 크고 같은 라우터에 다수의 동시
+  트랜잭션이 도착할 때 병목이 된다; 현재 값(0–8 ns)에서는 무시할 만한
+  수준이다.
+
+## Links
+
+- ADR-0002 (라우팅 거리 — 경로 계산)
+- ADR-0015 D1 (컴포넌트 포트 모델)
+- ADR-0015 D2 (와이어 프로세스 — 본 컴포넌트와 별개의 BW + 전파)
+- ADR-0015 D6 (Transit 큐브 forwarding 패턴)
+- ADR-0016 D1 (IO 칩렛 io_noc — 본 컴포넌트 사용)
+- ADR-0017 D1 (큐브 NOC 라우터 — 본 컴포넌트 사용)
+- ADR-0017 D6 (UCIe 분해 — `ucie-{PORT}` 인스턴스가 본 컴포넌트 사용)
+- ADR-0033 D1 (Flit 인지 통과, first-flit 오버헤드, 멀티 스트림 병합
+  의미)
@@ -0,0 +1,133 @@
+# ADR-0038: PCIE_EP Component Model
+
+## Status
+
+Accepted (2026-05-20).
+
+ADR-0035 (M_CPU), ADR-0036 (IO_CPU), ADR-0037 (Forwarding)
+와 같은 결의 컴포넌트-레벨 ADR.
+
+## First action (제일 처음에 하는 일)
+
+`_inbox`에서 Transaction을 한 건 꺼내 `_forward_txn`을 통해 `run()`을 호출하고,
+그 안에서 `node.attrs["overhead_ns"]` 만큼 `env.timeout()`으로 PCIe 프로토콜
+처리 지연을 적용한다. 그 이후 시점부터는 일반 `ComponentBase` 워커가 정의한
+forwarding 규약을 따른다 (다음 hop이 있으면 `out_ports[next_hop].put(...)`,
+아니면 `drain_ns`를 소비하고 `txn.done.succeed()`).
+
+즉, **PCIE_EP의 첫 번째 일은 "PCIe 프로토콜 오버헤드를 시간으로 표현하는 것"**
+하나뿐이고, 라우팅·페이로드 변환·MMIO 디코딩 같은 부가 의사결정은 하지 않는다.
+
+## Context
+
+PCIE_EP는 토폴로지 그래프에서 **호스트와 디바이스 사이의 단방향 경계 포인트**
+역할을 한다. 빌더 (`topology/builder.py`)는 SIP마다 IO chiplet 인스턴스를
+생성하고 그 안에 `pcie_ep`, `io_cpu`, `io_noc`을 둔 뒤, 외부 호스트 측의 cross-SIP
+switch와 `pcie_ep` 사이에 양방향 엣지를 깐다:
+
+- `switch → pcie_ep`: host → device 트래픽 (MemoryWrite, MemoryRead, KernelLaunch).
+- `pcie_ep → switch`: device-side outbound (예: cross-SIP IPCQ 토큰).
+
+IOChiplet 내부적으로는 `pcie_ep ↔ io_noc` 양방향 엣지가 깔리고, 그 다음 hop이
+`io_cpu`나 cube 측 hbm_ctrl 경로로 분기된다 (ADR-0036 IO_CPU 모델 참고).
+라우터·리졸버는 SPEC R7이 요구하는 "PCIE_EP는 메모리 오퍼레이션을 위한
+엔드포인트"라는 계약을 이미 인지하고 있어, `find_pcie_ep(sip)`,
+`find_memory_path(pcie_ep, dst_node)` 같은 helper가 PCIE_EP를 시작점으로 한다.
+
+문제는 이 모든 의존 관계가 builder/router/resolver 쪽에는 있으나, **PCIE_EP
+자신의 내부 모델을 명시하는 ADR이 없다**는 것이다. 결과적으로:
+
+- "PCIE_EP는 어떤 latency를 모델링하나?"가 코드를 읽어야만 답이 나온다.
+- 다른 컴포넌트(IO_CPU=ADR-0036, M_CPU=ADR-0035)와의 비대칭이 발생한다.
+- 향후 PCIe link-layer 모델(예: TLP credit, retry)을 더 정교하게 만들지에 대한
+  의사결정 근거가 흩어진다.
+
+이 ADR은 현재의 **얇은 (thin) PCIE_EP 모델**을 명시적으로 못 박고, 그것이
+의도된 단순화임을 기록한다 (ADR-0033 latency model 단순화 정책과 정렬).
+
+## Decision
+
+### D1. PCIE_EP는 ComponentBase의 일반 forwarding 워커를 그대로 사용한다
+
+`PcieEpComponent`는 `ComponentBase`를 상속하며 `_worker`/`_forward_txn`을
+오버라이드하지 않는다. 따라서 모든 Transaction은 다음 순서로 처리된다:
+
+1. `_fan_in`이 들어오는 메시지(또는 Flit reassembly된 Transaction)를 `_inbox`에
+   적재한다.
+2. `_worker`가 `_inbox`에서 하나 꺼내 `env.process(self._forward_txn(env, txn))`로
+   포크한다 (per-message 파이프라이닝).
+3. `_forward_txn`이 op_log 시작 hook → `run()` 지연 → op_log 종료 hook 순서로
+   호출한다.
+4. `run()`은 단 한 줄: `yield env.timeout(overhead_ns)`.
+5. 다음 hop이 있으면 `out_ports[next_hop].put(txn.advance())`, 없으면 (terminal로
+   도착한 경우) `drain_ns`를 소비 후 `txn.done.succeed()`.
+
+### D2. PCIE_EP의 유일한 시간 모델은 `overhead_ns`다
+
+`node.attrs["overhead_ns"]`만 latency 파라미터로 인정한다. 코드 기본값은
+`0.0`이며, `topology.yaml` 의 IOChiplet `components.pcie_ep.attrs` 가 실제 값을
+지정한다 (현재 토폴로지: `overhead_ns: 5.0` ns).
+
+별도의 BW 직렬화 자원(simpy.Resource), 큐 깊이, retry 모델은 두지 않는다.
+링크-레벨 BW 직렬화는 wire-side에서 처리된다 — IOChiplet 내부는
+`pcie_ep_to_noc_bw_gbs = 256.0 GB/s` 링크, 외부는 system의 `io_ep_to_switch`
+링크 BW가 적용된다 (ADR-0015 port/wire 모델). PCIE_EP 컴포넌트 자체는 이
+BW 회계에 관여하지 않는다.
+
+### D3. PCIE_EP는 양방향 사용을 인지하지만, 방향에 따라 동작을 바꾸지 않는다
+
+토폴로지 빌더가 `switch ↔ pcie_ep` 와 `pcie_ep ↔ io_noc` 양방향 엣지를 깐다.
+따라서 PCIE_EP는:
+
+- inbound (host→device): switch에서 도착한 Transaction을 io_noc 쪽으로 다음 hop
+  계산을 통해 forward.
+- outbound (device→host): io_noc/io_cpu에서 도착한 Transaction을 switch 쪽으로
+  forward.
+
+두 경우 모두 D1의 일반 forwarding 워커가 처리하며, 컴포넌트 코드 자체는 방향을
+구분하지 않는다 (`txn.next_hop`만 따른다).
+
+### D4. PCIE_EP는 Flit-aware가 아니다 (legacy reassembly 경로)
+
+`_FLIT_AWARE`를 `True`로 두지 않는다. 따라서 `_fan_in`이 상류에서 chunkify된
+Flit들을 부모 Transaction으로 재조립하여 `_inbox`에 넣는다 (ADR-0033 Phase 2c
+점진적 rollout 정책과 정렬).
+
+PCIE_EP가 PCIe TLP-level credit 모델을 갖도록 확장될 미래에 D4를 재평가한다.
+
+### D5. PCIE_EP는 라우팅 helper의 **명명된 노드**다
+
+`policy/routing/router.py`의 `find_pcie_ep(sip, io_id="io0")`,
+`find_all_pcie_eps()`, `find_memory_path(pcie_ep, dst_node)`는 PCIE_EP를 메모리
+경로의 시작점(또는 종점)으로 간주한다. 컴포넌트 본체는 이 helper에 어떤 정보도
+제공하지 않으며, 명명 규칙(`sip{S}.{io_id}.pcie_ep`)은 토폴로지 빌더가 보장한다.
+
+## Alternatives Considered
+
+### A1. PCIe TLP-level 모델 (credit, retry, MPS 분할)
+
+기각. ADR-0033이 명시한 "현재 latency 모델은 abstract overhead + BW 직렬화로
+표현"이라는 단순화 원칙에 어긋난다. 호스트↔디바이스 protocol 정합성은 SPEC §5
+"Non-Goals"에 의해 의도적으로 out-of-scope이다.
+
+### A2. PCIE_EP에 자체 simpy.Resource로 inflight 제한 두기
+
+기각. 현재 워크로드에서 호스트 트래픽은 컨텐션 병목이 아니다. 필요해지는 시점에
+별도 ADR로 도입한다 (호환성 측면에서 D1은 그대로 두고 D2를 확장하는 형태).
+
+### A3. PCIE_EP를 IO_CPU와 합치기
+
+기각. PCIE_EP는 host-side에서 처음 만나는 protocol boundary 노드이고, IO_CPU는
+디바이스-쪽 control-plane 처리 노드다 (ADR-0036). 트래픽 fan-out·command 디코딩
+같은 의사결정 비용은 IO_CPU에 모이며, PCIE_EP는 link-edge overhead만 표현하는
+것이 의미가 있다. 합치면 두 책임이 섞여 ADR-0007 (runtime API/sim_engine 경계)
+정신에 어긋난다.
+
+## Consequences
+
+- PCIE_EP는 코드 라인이 거의 0인 채로 명시적인 모델 ADR을 갖게 된다 — 일관성
+  ↑, 유지보수 비용 ↓.
+- 향후 PCIe-level 정밀화가 필요해지면 D2/D4를 확장하는 새 ADR을 만들어
+  supersede한다.
+- `find_memory_path` 등 router helper가 PCIE_EP를 명명된 노드로 의존한다는
+  사실이 D5에서 명시되므로, 컴포넌트 ID 명명 규칙 변경 시 영향 범위가 명확해진다.
@@ -0,0 +1,194 @@
+# ADR-0039: PE_MMU Component Model — 컴포넌트 + 유틸리티 이중 역할
+
+## Status
+
+Accepted (2026-05-20).
+
+ADR-0011 (PA/VA/LA address model) 의 VA 모델에서 "PE_MMU가 VA→PA 변환"이라고만
+선언되어 있는데, **PE_MMU 컴포넌트 자신의 동작 모델**을 별도로 못 박는 ADR.
+
+## First action (제일 처음에 하는 일)
+
+생성 시점에 `node.attrs["page_size"]` (default `2 MiB`) 와
+`node.attrs["tlb_overhead_ns"]` (default `0.0`) 를 읽어 내부 `PeMMU` 객체
+(`policy.address.pe_mmu.PeMMU`) 를 단 한 번 인스턴스화한다. 이 객체가 페이지
+테이블·서브페이지 region 리스트·TLB 오버헤드의 단일 보유자(single owner)이다.
+
+런타임에서의 첫 동작은 두 갈래로 갈린다:
+
+- **컴포넌트 경로 (inbox 소비)**: `_worker`가 `_inbox`에서 Transaction을 한 건
+  꺼내, 그 `request`가 `MmuMapMsg`이면 각 엔트리에 대해
+  `self._mmu.map(va, pa, size)`를 호출하고 `txn.done.succeed()`.
+  `MmuUnmapMsg`이면 `unmap(va, size)`, 그 외 타입이면 표준 `_forward_txn`으로
+  떨군다. 즉 **MMU의 첫 일은 "map/unmap 명령을 페이지 테이블에 반영하는 것"**.
+- **유틸리티 경로 (직접 호출)**: PE_DMA / PE_GEMM 같은 동일 PE 내부 엔진이
+  `pe_mmu.mmu.translate(va)`를 직접 호출한다. 이 경로에서는 SimPy 이벤트가
+  발생하지 않으며, 호출자가 (overhead_ns > 0인 경우) 본인 process에서
+  `yield env.timeout(mmu.overhead_ns)`를 처리한다.
+
+## Context
+
+ADR-0011은 PA/VA/LA 세 가지 주소 모델을 정의하고 "VA 모델 = PE_MMU를 통한 변환"
+이라고만 합의했다. 그러나 코드 상의 `PeMmuComponent`는 두 가지 상호 보완적인
+역할을 동시에 수행한다:
+
+1. **토폴로지 그래프 상의 컴포넌트**: cube NoC에서 `MmuMapMsg` / `MmuUnmapMsg`
+   sideband 메시지를 수신하여 페이지 테이블을 갱신한다.
+2. **PE-로컬 유틸리티 객체**: 동일 PE의 PE_DMA / PE_GEMM이 latency 0으로 (혹은
+   호출자 측에서 `overhead_ns`만 부담하면서) 직접 `translate(va)`를 호출한다.
+
+이 두 역할을 모두 다루는 ADR이 없어 다음 모호함이 발생한다:
+
+- "왜 MMU 변환에 SimPy 이벤트가 안 잡히나?" (실제로는 호출자 측에서 잡고 있음)
+- 서브페이지 region 모델은 무엇이고, 왜 그 모델인가? (코드 docstring에는 있으나
+  ADR이 없음 — `project_mmu_subpage_stopgap`라는 memory note 참조만 존재)
+- map/unmap 메시지가 **누구로부터** 와서 **언제까지** 갱신되어야 하는가
+  (ordering 계약)?
+
+또한 `PeMMU.map()` 은 "later append, last-write-wins (역방향 탐색)" 의미를 갖는데,
+이것은 단순한 단일-PA 페이지 테이블 모델로는 표현 불가능한 DPPolicy의 서브페이지
+샤딩 (예: 128B 페이로드 × 4KB 페이지) 시나리오를 위해 의도적으로 추가된
+**stopgap**이다. 진짜 HW MMU와는 다른 단순화임을 ADR로 못 박을 필요가 있다.
+
+## Decision
+
+### D1. 이중 역할의 명시 — 컴포넌트와 유틸리티
+
+`PeMmuComponent`는 단일 클래스 안에서 다음 두 인터페이스를 노출한다:
+
+- 컴포넌트 인터페이스: `_inbox` 소비, `_worker` 루프 (MMU sideband 메시지 처리).
+- 유틸리티 인터페이스: `pe_mmu.mmu` 속성으로 underlying `PeMMU` 객체를 노출 —
+  PE_DMA / PE_GEMM이 이 객체를 직접 들고 `translate()`를 호출.
+
+후자는 **layer skip이 아니다**: PE 내부는 ADR-0007이 정의한 "components" 레이어
+하나 안의 sibling 관계이고, 같은 PE prefix에서 가져온 PE_MMU 객체에 대한 직접
+호출은 cross-layer가 아니다. cross-layer 위반은 runtime API / sim_engine /
+components 경계를 넘는 경우에만 적용된다.
+
+### D2. Latency 모델: `translate()`는 순수 함수, overhead는 호출자 책임
+
+`PeMMU.translate()`는 순수 함수이며 SimPy yield를 하지 않는다. 호출자(PE 엔진)
+가 변환 후 `if self._mmu.overhead_ns > 0: yield env.timeout(self._mmu.overhead_ns)`
+를 자기 process에서 발생시킨다.
+
+이유: PE 엔진의 SimPy process는 이미 자체 record_start / record_end (op_log)
+hook을 들고 있어 timing을 일관되게 잡을 수 있다. MMU가 별도의 process를 만들면
+PE 엔진의 처리 흐름을 두 갈래로 쪼개 op_log/pipeline overlap 의미가 흐려진다.
+
+#### D2.1. 현재 구현의 비대칭 — pipeline vs non-pipeline (Known asymmetry)
+
+본 ADR 작성 시점의 `pe_dma.py` 구현은 두 호출 경로에서 overhead 처리가 다르다:
+
+- **non-pipeline (`handle_command`)**: `translate()` 직후
+  `if self._mmu.overhead_ns > 0: yield env.timeout(self._mmu.overhead_ns)` 를
+  발생시킨다.
+- **pipeline (`_do_pipeline_dma`)**: `translate()` 만 호출하고 overhead timeout을
+  **생략**한다 — 함수 주석에 "same logic as non-pipeline path"라고 적혀 있으나
+  실제로는 일치하지 않는다.
+
+기본 토폴로지에서 `tlb_overhead_ns = 0.0` 이라 이 차이는 timing에 직접 드러나지
+않으나, `tlb_overhead_ns > 0` 으로 설정한 시뮬레이션에서는 pipeline 경로의
+GEMM/Math 가 non-pipeline 동일 워크로드 대비 MMU overhead 만큼 빠르게 측정된다.
+
+D2의 계약은 "**모든** 호출자가 overhead를 책임진다" 이며, pipeline 경로의 누락은
+**의도된 설계가 아니라 구현 비일관성**이다. ADR-0014 D6 (pipeline self-routing)
+이 이 overhead를 면제한다고 명시한 부분은 없다.
+
+조치 선택지(별도 Phase 1/2 제안 필요):
+
+- (a) `_do_pipeline_dma` 에서도 `if mmu.overhead_ns > 0: yield env.timeout(...)`
+  를 추가하여 D2 계약과 일치시킨다 — 권장.
+- (b) D2 계약을 "non-pipeline 경로에만 적용" 으로 좁히고, pipeline 경로의 면제를
+  ADR-0014 D6 갱신과 함께 정당화한다 — overhead 의미가 약해지므로 비권장.
+
+본 ADR은 (a) 를 권장하며, accept 전 또는 직후의 별도 작은 변경으로 이를
+교정하는 것을 가정한다.
+
+### D3. 페이지 테이블 구조 — 서브페이지 region 리스트 (stopgap)
+
+`self._table: dict[vpn, list[(start_in_page, end_in_page, pa_at_offset_zero)]]`
+구조로 한 페이지 안에 여러 disjoint region을 보유할 수 있다.
+- `map(va, pa, size)`: 페이지를 가로지르면 region들을 **append**한다.
+- `translate(va)`: VPN으로 region 리스트를 가져온 후, **역방향**으로 순회하며
+  처음 매칭되는 region을 채택 (last-write-wins).
+- `unmap(va, size)`: extent가 unmap 범위에 **완전히 포함된** region만 제거한다.
+  경계가 어긋난 부분 overlap은 그대로 남기며, 매핑 호출자는 mapping과 동일한
+  경계로 unmap할 책임을 진다.
+
+이는 진짜 HW MMU와는 다른 **시뮬레이터 stopgap**임을 ADR-0011 VA 모델 보강
+요소로 명시한다. DPPolicy 서브페이지 샤딩 시 last-write-wins overwrite로 인한
+조용한 미스라우팅을 방지하기 위함이다 (메모리 노트: project_mmu_subpage_stopgap).
+
+### D4. PageFault는 PA fallback 신호다
+
+매핑이 없는 VA로 `translate()`가 호출되면 `PageFault`가 발생한다. PE_DMA는 이
+예외를 잡아 **원본 주소를 PA로 그대로 사용**한다 (ADR-0011의 PA fallback 호환
+경로). 따라서 PageFault는 에러가 아닌 "VA 매핑 부재 시 PA로 해석한다"는 신호다.
+
+이 호환 경로는 ADR-0011이 합의한 PA-only 모드와의 후방 호환을 유지하기 위한
+의도된 동작이다.
+
+### D5. MMU sideband 메시지의 수신 계약
+
+`MmuMapMsg` / `MmuUnmapMsg`는 fabric을 통해 PE_MMU 컴포넌트의 `_inbox`로
+도달한다 (R10이 명시하는 "MMU map 설치는 fabric latency를 따른다"). 메시지
+schema는 runtime API (`runtime_api/kernel.py`) 가 정의하며, 현재 형식:
+
+- `MmuMapMsg.entries: tuple[dict, ...]` — 각 dict는 `{"va": int, "pa": int,
+  "size": int}` 키를 갖는다.
+- `MmuUnmapMsg.entries: tuple[dict, ...]` — 각 dict는 `{"va": int, "size": int}`
+  키를 갖는다.
+
+PE_MMU 측 수신 처리:
+
+1. `_worker` 가 `_inbox.get()` 에서 메시지 한 건을 꺼낸다.
+2. `hasattr(msg, "request")` 로 Transaction wrapper 인지 확인.
+3. `isinstance(msg.request, MmuMapMsg)` 이면 각 entry 에 대해
+   `self._mmu.map(va=e["va"], pa=e["pa"], size=e["size"])`.
+4. `isinstance(msg.request, MmuUnmapMsg)` 이면 각 entry 에 대해
+   `self._mmu.unmap(va=e["va"], size=e["size"])`.
+5. 둘 다 `msg.done.succeed()` 로 완료 통지.
+
+외부 호출자(runtime API 측)가 `done`을 await하면 "매핑이 디바이스에 설치된
+시점"이 SimPy 시간으로 보장된다 — 이 wait이 ADR-0011이 요구하는 "MMU map
+installation incurs measured fabric latency" 의 실현이다.
+
+이 ADR은 sideband 메시지의 **sender 와 fan-out 정책**을 정의하지 않는다 —
+그것은 runtime API 책임이다. 본 ADR은 PE_MMU 측 수신 계약만 명시한다.
+
+### D6. 비-MMU Transaction은 일반 forwarding으로 위임
+
+`_worker`가 inbox에서 꺼낸 메시지의 `request`가 `MmuMapMsg` / `MmuUnmapMsg`가
+아닌 경우 (또는 `request` 속성이 없는 경우) `_forward_txn`으로 떨군다. 이는
+미래에 PE_MMU가 cube-internal NOC 상의 통과 노드로 사용될 가능성을 차단하지
+않기 위함이다 (현재는 그런 통과 트래픽이 없으나, 토폴로지 변경에 대해 안전).
+
+## Alternatives Considered
+
+### A1. translate()를 SimPy generator로 만들기
+
+기각. D2에서 설명한 대로, PE 엔진의 op_log/pipeline overlap 의미가 흐려진다.
+호출자 측에서 timeout을 일으키는 현재 패턴이 op_log 회계와 일치한다.
+
+### A2. 서브페이지 region 리스트 대신 페이지 크기 자체를 작게 하기 (예: 128B)
+
+기각. 페이지 테이블 메모리 폭발과 cube-wide map message 크기 폭발을 초래한다.
+DPPolicy 샤딩이 128B를 요구한다 해도 그 외 대다수 매핑은 2MiB 단위이므로,
+페이지 크기를 작게 잡는 것은 평균 비용이 비대해진다.
+
+### A3. PE_MMU를 컴포넌트가 아닌 PE_CPU의 내장 헬퍼로만 두기
+
+기각. ADR-0011이 요구하는 "fabric을 통해 측정된 latency로 MMU map 설치"
+(MmuMapMsg 경로)를 표현하려면 토폴로지 그래프 상의 노드여야 한다. 또한 cube NoC
+visualizer에서 PE_MMU가 노드로 보여야 디버깅·진단이 일관된다.
+
+## Consequences
+
+- PE_MMU의 이중 역할(컴포넌트 + 유틸리티)이 ADR-level에서 정당화되어, 미래의
+  refactor 압박 (둘 중 하나로 통일하라)에 대한 논거가 생긴다.
+- 서브페이지 region 모델이 시뮬레이터 stopgap임을 ADR이 명시 — 이후 LA 모델
+  (ADR-0011) 도입 시 이 stopgap 제거 가능성을 평가하는 기준이 된다.
+- `translate()`가 yield하지 않는다는 계약이 ADR로 굳어지므로, 향후 누군가
+  "MMU에 자체 timeout을 넣자"는 제안을 할 때 D2를 근거로 거절할 수 있다.
+- PA fallback (D4) 이 정상 흐름임이 명시되어, PageFault를 에러로 오인하여
+  방어 로직을 추가하는 일을 막는다.
@@ -0,0 +1,142 @@
+# ADR-0040: PE_TCM Component Model — 듀얼 채널 BW 직렬화
+
+## Status
+
+Accepted (2026-05-20).
+
+ADR-0014 (PE Pipeline Execution Model) 가 "PE_TCM은 BW-기반 직렬화 scratchpad
+memory" 라고 언급하나 (D1), TCM 컴포넌트 자체의 정확한 동작 모델을 별도로
+명시한다.
+
+## First action (제일 처음에 하는 일)
+
+`start()`가 호출되면 즉시 두 개의 `simpy.Resource(env, capacity=1)`을 만들고
+`self._read_res` / `self._write_res`에 보관한다. 이 두 자원이 **읽기 채널**과
+**쓰기 채널**을 각각 1-in-flight로 직렬화하는 단일 결정 포인트다.
+
+런타임 첫 동작: `_worker`가 `_inbox`에서 메시지를 한 건 꺼내 타입 분기:
+
+- `TcmRequest` (`pe_fetch_store`에서 옴): `env.process(self._handle_tcm_request)`로
+  포크. 즉 **TCM의 첫 일은 "방향 (read/write)에 맞는 채널 락을 잡는 것"**.
+  락 획득 후 `bw > 0 and nbytes > 0` 이면 `delay_ns = nbytes / bw` 만큼
+  `env.timeout`, 그리고 `req.done.succeed()`.
+- 그 외 (Transaction): `env.process(self._forward_txn)`로 포크 (legacy fabric
+  통과 경로).
+
+생성 시점에 `node.attrs["read_bw_gbs"]` / `node.attrs["write_bw_gbs"]`
+(default 각 `512.0 GB/s`) 를 읽어 보관해 둔다.
+
+## Context
+
+PE 파이프라인 (ADR-0014 D1, D6) 에서 PE_TCM은 다음 두 종류의 트래픽을 받는다:
+
+1. **PE_FETCH_STORE → PE_TCM의 `TcmRequest`** — TCM ↔ Register File 전송 시,
+   PE_FETCH_STORE가 TCM의 BW로 직렬화된 access latency를 받아오기 위해 짧은
+   sideband 요청을 보낸다 (`direction = "read"` 또는 `"write"`, `nbytes`,
+   `done` 이벤트).
+2. **legacy Transaction forwarding** — 토폴로지 그래프 상에서 TCM이 통과 노드로
+   잡힐 가능성에 대비한 일반 forwarding 경로 (현재 critical path에서는 사용되지
+   않으나 보존됨).
+
+문제: ADR-0014는 "PE_TCM은 BW-기반 직렬화"라고만 언급한다. 그러나 코드에는
+명시적으로:
+
+- **읽기와 쓰기는 별도 채널이며 동시 진행 가능**, 다만 같은 방향끼리는
+  cap=1로 직렬화된다.
+- BW는 `read_bw_gbs` / `write_bw_gbs` 두 값으로 분리 설정 가능하다.
+- `delay_ns = nbytes / bw_gbs` 공식 (단위 환산: GB/s × ns ≈ B 라는 약식).
+- nbytes==0이면 BW 항을 건너뛰지만 채널 락은 잡는다.
+- `run()`은 `overhead_ns` (default 0.0) 만큼 yield 하나, 이는 legacy fabric
+  경로(Transaction forwarding)에서만 사용된다.
+
+이 모든 사항을 별도 ADR로 못 박을 필요가 있다. 특히 "왜 read/write가 분리
+채널인가" 와 "BW는 누가 결정하는가" 는 향후 누군가가 capacity=2 등으로 변경하려
+할 때 명확한 근거가 필요한 항목이다.
+
+## Decision
+
+### D1. 듀얼 채널 — read와 write는 독립 자원
+
+`_read_res = simpy.Resource(env, capacity=1)`,
+`_write_res = simpy.Resource(env, capacity=1)`.
+같은 방향의 동시 요청은 자원 큐에서 직렬화되나, 다른 방향끼리는 동시에 진행 가능.
+이는 실제 HW에서 TCM이 듀얼 포트 (read port + write port) 로 운용되는 모델과
+정합되며, GEMM 파이프라인에서 fetch(read)와 store(write)가 시간상 겹치는 정상
+케이스를 BW-직렬화 모델로 표현하기 위해 의도된 분리다.
+
+### D2. 단일 채널의 BW 모델 — `nbytes / bw_gbs`
+
+채널 락 획득 후, `nbytes > 0 and bw > 0`이면 `yield env.timeout(nbytes / bw_gbs)`.
+단위 약식은 GB/s × ns ≈ B 로, 시뮬레이터 전체에서 사용하는 BW 공식과 동일
+(ADR-0033 참고 — 시뮬레이터는 일관된 약식 단위를 사용한다).
+
+- `nbytes == 0`: BW 항은 0이지만 락은 잡혔다가 즉시 풀린다. 이 케이스가 의도된
+  이유: 빈 fetch/store를 보내는 plan generator가 PE_FETCH_STORE 측에서 `nbytes`만
+  0으로 채워 보내는 경우에도, TCM 측의 op_log / 채널 회계가 일관되게 한 번
+  소비된다.
+- `bw == 0` (config 실수): timeout 호출 자체를 skip하므로 0-time pass. 정상
+  세팅에서는 발생하지 않는다.
+
+### D3. BW는 `node.attrs`의 `read_bw_gbs` / `write_bw_gbs`로 설정
+
+기본값 `512.0 GB/s`. 토폴로지 빌더 (`topology/builder.py`) 가 `pe_template`에서
+TCM을 인스턴스화할 때 해당 attrs를 전달한다. 기본값 변경은 ADR-0014 D1 또는
+ADR-0033 latency model 측의 의사결정과 함께 가야 한다.
+
+### D4. TcmRequest의 schema는 PE_TCM이 owner다
+
+`@dataclass TcmRequest(direction: str, nbytes: int, done: simpy.Event, tag: str = "")`
+는 `components/builtin/pe_tcm.py`에 정의된다. PE_FETCH_STORE는 이 dataclass를
+import해서 생성·송신만 한다. 호출자 측이 schema를 정의하지 않는 이유:
+
+- BW 직렬화의 의미는 TCM 측 책임 — 어떤 필드가 직렬화 결정에 쓰이는가는 TCM이
+  결정한다.
+- `direction` 문자열을 `"read"` / `"write"` 둘로 좁히는 유효값 검증도 TCM 측에
+서 담당 (`_handle_tcm_request`의 if/else 분기).
+
+### D5. legacy Transaction forwarding 경로의 보존
+
+`_worker`가 `TcmRequest`가 아닌 메시지를 받으면 `_forward_txn`으로 보낸다. 이때
+`run()`의 `overhead_ns`가 적용된다. 현재 표준 PE 파이프라인에서는 TCM이
+Transaction의 통과 노드로 잡히지 않으나, fabric 토폴로지가 향후 변경될 때를
+위해 보존한다 (D1 의 사용 패턴과 직교).
+
+이 경로는 op_log 측에서 일반 Transaction 회계로 잡히며, BW 채널 락은 잡지 않는다.
+
+### D6. PE_TCM은 자체 데이터 저장소가 아니다 (timing only)
+
+TCM은 **시간만** 모델링한다. 실제 데이터 페이로드는 sim_engine의 별도
+`memory_store` (있다면) 가 보관하고, TCM 컴포넌트는 그것을 갱신하지 않는다.
+PE_FETCH_STORE도 TcmRequest를 통해 BW 지연만 받아오고 실제 register 컨텐츠는
+별도 경로로 다룬다 (ADR-0020 2-pass data execution 모델 — Phase 2에서 데이터
+처리).
+
+## Alternatives Considered
+
+### A1. 단일 채널 (capacity=2 의 read+write 공유)
+
+기각. fetch(read)와 store(write)가 시간상 겹치는 정상 케이스를 인공적으로
+직렬화하게 되어 PE 파이프라인의 BW upper bound가 잘못 모델링된다.
+
+### A2. 채널 capacity > 1 (예: 2-banked TCM)
+
+기각. 현재 HW 모델은 단일 bank 가정. 멀티-bank로 확장하고 싶다면 별도 ADR이
+필요하며, 그때 D1을 supersede한다. 지금 단계에서 capacity를 늘리면 BW upper
+bound는 그대로인데 명목상의 직렬화만 헐거워져 실제 모델 정확도 ↓.
+
+### A3. BW 공식을 `nbytes / bw + overhead_ns`로 일반화
+
+기각. `overhead_ns`는 D5의 legacy forwarding 경로에만 사용한다. fetch/store
+critical path에 추가 overhead가 필요해지면, 그것은 TCM이 아니라 PE_FETCH_STORE
+측 `run()` 또는 register-file access 모델에 두는 것이 책임 경계 측면에서 더
+적절하다.
+
+## Consequences
+
+- TCM의 BW 회계가 ADR-level에서 굳어지므로, GEMM/Math sweep의 op_log 해석 시
+  "왜 fetch와 store가 동시에 진행되었나" / "왜 같은 방향만 직렬화되나" 같은
+  질문이 빠르게 D1으로 해결된다.
+- 미래의 멀티-bank TCM이나 read/write 비대칭 BW 모델 변경 시 영향 범위가
+  명확해진다 (D1·D2·D3 중 어디를 수정하는지).
+- TCM이 데이터 저장소가 아니라는 점(D6)이 명시되어, ADR-0020 2-pass execution
+  과의 책임 경계가 견고해진다.
@@ -0,0 +1,187 @@
+# ADR-0041: Cube SRAM Component Model — terminal scratchpad on cube NoC
+
+## Status
+
+Accepted (2026-05-20).
+
+ADR-0017 (Cube NOC and HBM Connectivity) 에서 SRAM이 cube NoC의 attachment로
+존재한다고만 언급되는 점을 보완하여, SRAM 컴포넌트 자체의 latency/response
+모델을 명시한다.
+
+## First action (제일 처음에 하는 일)
+
+`_worker`가 `_inbox`에서 Transaction을 한 건 꺼낸 직후 가장 먼저 하는 일은
+`yield from self.run(env, txn.nbytes)` 호출이고, 그 안에서
+`node.attrs["overhead_ns"]` (default `0.0`) 만큼 `env.timeout()`을 발생시킨다.
+
+즉, **SRAM의 첫 일은 "access overhead를 시간으로 표현하는 것"**이다.
+overhead 소비 이후에 `drain_ns` (그 Transaction에 부여된 terminal BW 직렬화 비용)
+를 yield하고, 그 다음에 reverse path로 `ResponseMsg`를 생성하여 발사한다.
+
+이는 일반 `ComponentBase._worker`와 다른 점이 있다: SRAM은 **terminal node**
+임을 알고 있어서 `_forward_txn`을 거치지 않고 자체 워커가 `run → drain →
+_send_response` 순서를 명시한다.
+
+## Context
+
+cube 토폴로지 (`topology/builder.py`) 는 cube마다 다음 명명된 노드를 만든다:
+
+- `sip{S}.cube{C}.m_cpu`
+- `sip{S}.cube{C}.sram`
+- `sip{S}.cube{C}.hbm_ctrl` (PE당 partition)
+- `sip{S}.cube{C}.pe{P}` (PE 내부 sub-component들)
+
+SRAM은 cube NoC 의 attachment 중 하나로, 가장 가까운 router에 부착된다
+(`topology/mesh_gen.py`가 placement 좌표로 nearest router 결정 후 `attach`에
+추가). 빌더는 `sram ↔ router` 양방향 엣지를 깐다 (BW: `sram_to_router_bw_gbs`,
+기본 `128.0 GB/s`).
+
+SRAM의 두 가지 핵심 역할:
+
+1. **fabric terminal**: cube NoC에서 SRAM으로 향한 메모리 access Transaction의
+   끝점. SRAM이 access overhead와 drain을 소비하고 response를 reverse path로
+   되돌린다.
+2. **IPCQ slot tier 중 하나**: ADR-0023 D9.7 가 정의한 `buffer_kind ∈ {tcm,
+   sram, hbm}` 중 `sram` 티어의 slot bw/overhead를
+   `common/ipcq_types._BUFFER_KIND_BW`에서 참조 — 현재 값 `(512.0 GB/s, 2.0 ns)`.
+   이 값은 SRAM 노드 attrs의 `overhead_ns`와는 별도이며, IPCQ slot 회계 시점에서
+   PE_DMA가 시간으로 환산한다.
+
+이 두 역할은 하나의 SRAM 컴포넌트에서 동시에 충족되는데, 별도 ADR이 없으면:
+
+- "SRAM은 어떤 latency를 모델링하나?" — fabric drain + overhead, 아니면 IPCQ
+  티어의 slot latency? — 답이 흩어진다.
+- 미래에 SRAM 크기 (`size_mb`) attr이 실제로 어떤 의미를 갖는지 불명확. 현재
+  코드는 size를 사용하지 않으며 timing만 모델링한다.
+- SRAM이 cube의 어떤 router에 붙는지 (placement-based)에 대한 의사결정 근거가
+  토폴로지 코드 안에만 있다.
+
+## Decision
+
+### D1. SRAM은 cube NoC의 terminal scratchpad 노드다
+
+`SramComponent`는 `ComponentBase`를 상속하나 `_worker`를 오버라이드해서 terminal
+의미를 직접 표현한다:
+
+```
+while True:
+    txn = yield self._inbox.get()
+    yield from self.run(env, txn.nbytes)     # overhead_ns
+    if drain_ns > 0: yield env.timeout(drain_ns)
+    yield from self._send_response(env, txn)
+```
+
+이 패턴은 SRAM이 reverse path를 알아야 하므로 일반 `_forward_txn` (다음 hop으로
+forward)이 아닌 자체 워커가 필요하다.
+
+#### D1.1. 현재 미사용 — `_worker` 오버라이드는 dormant 경로다
+
+본 ADR 작성 시점의 코드베이스에서는, **어떤 컴포넌트도 SRAM 노드로 Transaction
+을 실제로 전송하지 않는다**. 확인된 SRAM 노드 ID 참조 위치:
+
+- `policy/routing/router.py` 등 routing helper — path 조회 가능성만 보장.
+- `components/builtin/pe_dma.py::_handle_ipcq_inbound` — IPCQ slot의
+  `buffer_kind == "sram"` 일 때 `bank_node = f"{cube_prefix}.sram"` 의 *path*
+  만 조회하여 `compute_drain_ns(path, ...)` 로 환산, **로컬에서 timeout** 한다.
+  Transaction 자체는 SRAM 노드로 흘러가지 않는다 (D4 참고).
+- `tests/test_routing.py` — `find_path("sip0.cube0.pe0", "sip0.cube0.sram")`
+  로 connectivity만 검증.
+
+따라서 `_worker`/`_send_response` 오버라이드는 **dormant code path** 이다.
+삭제하지 않고 보존하는 이유:
+
+- 향후 SRAM이 실제 fabric Transaction의 종점(예: M_CPU → SRAM 명시 access)이
+  되는 토폴로지 변경 시 즉시 사용 가능.
+- ADR-0017 (Cube NOC) 가 정의한 cube-attached scratchpad 의미에서 종점 동작은
+  의미상 자연스러우므로, 의도된 placeholder 다.
+
+이 dormant 상태가 종료되는 시점은 별도 ADR(또는 본 ADR의 후속 revision)이
+명시한다.
+
+### D2. ResponseMsg 생성과 reverse path 발사
+
+`_send_response`는:
+
+1. `reverse_path = list(reversed(txn.path))`로 역방향 경로 산출.
+2. `ResponseMsg(correlation_id=txn.request.correlation_id, request_id=...,
+   src_cube=<this cube>, src_pe=-1, success=True)` 생성.
+3. `Transaction(request=resp_msg, path=reverse_path, step=0, nbytes=0,
+   done=env.event(), is_response=True)` 로 감싸 `out_ports[reverse_path[1]]` 로
+   put.
+4. reverse path가 비정상이거나 (`< 2 hops`) ctx가 없으면, fallback으로 원본
+   `txn.done.succeed()` 만 호출.
+
+`src_pe = -1`은 "SRAM은 PE-localized가 아니다"를 의미한다. `src_cube`은 노드
+ID (`sip{S}.cube{C}.sram`) 의 cube 인덱스를 파싱해 채운다.
+
+### D3. Timing 파라미터는 `overhead_ns`와 wire-side `drain_ns`로 분리
+
+- **컴포넌트 측 latency**: `node.attrs["overhead_ns"]`. 기본 토폴로지에서는 `2.0
+  ns` 정도로 세팅.
+- **링크 측 직렬화**: `drain_ns`는 Transaction이 도착 시점에 carry해 온 값으로,
+  ADR-0015 (port/wire 모델) 의 wire-side BW 직렬화 결과다. SRAM은 이를 그대로
+  yield하기만 한다.
+- `size_mb` (default `32 MiB`) attr은 현재 timing에 사용되지 않는다 — 향후
+  capacity-aware 모델이 도입되면 그때 의미를 부여한다 (별도 ADR에서).
+
+### D4. IPCQ slot 회계는 SRAM 컴포넌트가 직접 모델링하지 않는다
+
+ADR-0023 D9.7 에 따른 IPCQ slot의 SRAM-티어 write latency는 PE_DMA의
+`_handle_ipcq_inbound`가 직접 `slot_io_latency_ns("sram", nbytes)`를 호출하여
+시간을 소비한다 (그 함수는 `common/ipcq_types._BUFFER_KIND_BW["sram"]` 의 값을
+사용). 즉:
+
+- SRAM 컴포넌트가 fabric Transaction을 받아 처리할 때는 **D1·D2·D3** 만 적용.
+- IPCQ slot이 SRAM에 살 때는 PE_DMA가 IPCQ slot-write 시점에 별도로 시간을
+  지불 — 이는 SRAM 컴포넌트 코드와 무관하며, IPCQ 측 회계다.
+
+이 분리는 의도된 것: IPCQ는 fast path (sub-cycle slot bookkeeping) 라 fabric
+Transaction을 거치지 않으므로, SRAM이 IPCQ를 인지할 필요가 없다.
+
+### D5. SRAM의 cube NoC 부착 위치는 placement-driven
+
+`topology/mesh_gen.py`는 `placement.sram.pos_mm` (`topology.yaml` 기본
+`[1.5, 9.0]`)을 보고 가장 가까운 router의 `attach`에 `"sram"`을 추가한다. 빌더
+(`topology/builder.py` 의 attachment 루프)가 그 attach 정보를 보고 `sram` 노드와
+router 사이에 양방향 엣지를 깐다.
+
+이 의사결정은 SRAM 컴포넌트 코드 외부 (mesh_gen / builder) 에 있으며, 컴포넌트
+는 어느 router에 붙었는지 알 필요가 없다. 컴포넌트는 `txn.path` / `reverse_path`
+가 router를 거쳐 자신에게 도달한다는 사실만 알면 된다.
+
+### D6. SRAM은 자체 데이터 저장소가 아니다 (timing-only)
+
+ADR-0040 D6 과 같은 맥락: SRAM 컴포넌트는 시간만 모델링하며, 실제 데이터
+페이로드는 sim_engine의 `memory_store` (있을 때) 가 보관한다.
+
+## Alternatives Considered
+
+### A1. SRAM이 `_forward_txn`을 그대로 사용하고 IO_CPU / HBM_CTRL 처럼 별도 응답 노드를 두기
+
+기각. cube NoC 상에서 SRAM은 terminal이며, 응답을 받아 줄 별도 노드를 두면
+의미 없는 hop이 늘어나고 ADR-0017 의 cube NoC 단순화 정신에 어긋난다.
+
+### A2. SRAM이 BW 직렬화를 자체 resource로 모델링
+
+기각. 링크 측 BW 직렬화 (`drain_ns`) 가 이미 의미를 충분히 잡고 있다. 컴포넌트
+내부에 또 `simpy.Resource`를 두면 ADR-0015 wire-side 모델과 이중계산을 야기.
+
+### A3. SRAM이 IPCQ slot 회계를 컴포넌트 측에서 처리
+
+기각. D4에서 명시한 대로 IPCQ는 fast path며 fabric Transaction을 통과하지
+않는다. SRAM이 IPCQ를 인지하면 책임이 두 갈래로 갈라져 추론이 어려워진다.
+
+### A4. `size_mb`로 capacity-aware latency 모델
+
+기각 (현재 단계). capacity는 토폴로지 visualizer 측 라벨링 정도에만 쓰이며,
+실제 timing 영향은 아직 모델링하지 않는다. 필요해지면 별도 ADR로 도입.
+
+## Consequences
+
+- SRAM의 timing 모델이 `overhead_ns + drain_ns + ResponseMsg(reverse_path)`로
+  ADR-level에서 굳어지므로, 누군가 IPCQ slot latency를 SRAM 컴포넌트에 추가하려
+  할 때 D4를 근거로 거절할 수 있다.
+- `size_mb` 가 현재 timing-neutral 임이 명시되어 (D3), 미래의 capacity-aware
+  모델 도입 시 호환성 영향 범위가 좁다.
+- placement-driven router 부착 (D5) 이 명시되어, SRAM 좌표 이동 시 어떤 부분에
+  파급이 있는지 (`mesh_gen`만) 명확해진다.
@@ -0,0 +1,194 @@
+# ADR-0042: Tile Plan Generators — GEMM/Math 파이프라인 plan 빌더
+
+## Status
+
+Accepted (2026-05-20).
+
+본 ADR은 `tiling.py`가 SimPy 컴포넌트가 아니라
+**plan-generator 모듈**임을 명시한다.
+
+ADR-0014 (PE Pipeline Execution Model) 의 D6 (tile plan / self-routing) 가
+tile-plan 생성 알고리즘을 직접 정의하지 않으므로, 본 ADR이 그 비어 있는 자리를
+채운다.
+
+## First action (제일 처음에 하는 일)
+
+`generate_gemm_plan(M, K, N, tile_m, tile_k, tile_n, ..., pe_prefix, a_pinned,
+b_pinned, epilogue_specs)`이 호출되면 가장 먼저 하는 일은 **타일 수 계산과
+컴포넌트 ID 문자열 구성**이다:
+
+```
+M_tiles = max(1, ceil(M / tile_m))
+K_tiles = max(1, ceil(K / tile_k))
+N_tiles = max(1, ceil(N / tile_n))
+dma_id   = f"{pe_prefix}.pe_dma"
+fetch_id = f"{pe_prefix}.pe_fetch_store"
+gemm_id  = f"{pe_prefix}.pe_gemm"
+math_id  = f"{pe_prefix}.pe_math"
+```
+
+즉 **plan generator의 첫 일은 "타일 개수를 ceiling으로 산출하고, 이 PE의
+sub-component ID 4개를 한 번에 짜놓는 것"**이다. SimPy 이벤트나 환경 객체는
+일절 다루지 않는다 — 이 모듈은 순수 함수다.
+
+`generate_math_plan(M, N, tile_m, tile_n, ..., math_op, src_addr, dst_addr,
+pe_prefix)` 도 마찬가지로 `M_tiles`, `N_tiles` 산출과 component ID 3개
+(`dma_id`, `fetch_id`, `math_id`) 구성이 첫 일이다.
+
+## Context
+
+ADR-0014 D6은 "PE_SCHEDULER가 CompositeCmd를 받으면 TilePlan을 생성하고
+self-routing tile token을 피드한다"고만 합의했다. 그러나 코드에서는 **plan
+생성 알고리즘의 구체적 내용**이 `src/kernbench/components/builtin/tiling.py`
+모듈에 자리잡고 있고, 이 모듈은:
+
+- 컴포넌트가 아니라 **순수 함수**의 모음이다 (`generate_gemm_plan`,
+  `generate_math_plan`).
+- SimPy 환경, 큐, op_log, hook 등에 의존하지 않는다.
+- 결과로 `PipelinePlan` (dataclass) 를 돌려준다.
+
+기존 G4 분석은 `tiling.py`를 컴포넌트로 잘못 가정했으나, 실제는 PE_SCHEDULER에
+주입되는 plan-builder 함수다. 이 차이는 ADR-0014 의 D6 와 짝을 이루는 별도
+ADR로 못 박혀야 한다 — 그렇지 않으면:
+
+- "tile plan을 만드는 책임이 PE_SCHEDULER인가 별도 모듈인가" 가 모호.
+- GEMM plan과 Math plan의 stage sequence 가 일관성 있는지 (예: FETCH/STORE 위치)
+  의사결정 근거가 흩어진다.
+- `a_pinned` / `b_pinned` / `epilogue_specs` 같은 옵션이 왜 plan 단에서 분기되는지
+  근거 없음.
+
+## Decision
+
+### D1. tiling은 순수 plan-generator 모듈이며 컴포넌트가 아니다
+
+`components/builtin/tiling.py`는 ComponentBase 하위 클래스를 정의하지 않는다.
+모듈-레벨 함수 두 개만 노출한다:
+
+- `generate_gemm_plan(...) -> PipelinePlan`
+- `generate_math_plan(...) -> PipelinePlan`
+
+토폴로지 그래프에서 `tiling` 이라는 노드는 존재하지 않는다. 명명상 `builtin/`
+디렉터리에 있는 이유는 PE_SCHEDULER (ADR-0014 D6) 의 직접 helper이기 때문이며,
+의미상으로는 PE_SCHEDULER 내부 utility에 가깝다.
+
+### D2. GEMM plan의 stage 시퀀스 — `M → N → K` order
+
+각 (m, n, k) 타일에 대한 stage 시퀀스 (operand pinning과 epilogue 미적용 기본):
+
+```
+[DMA_READ(A)] → [DMA_READ(B)] → FETCH → GEMM
+                                ↑
+                                ↓
+(last k tile only)              [MATH(output_tile)]* → STORE → DMA_WRITE
+```
+
+`k_tile` epilogue는 매 K-타일마다 GEMM 직후, `output_tile` epilogue는 (m,n)당
+마지막 K-타일에서 STORE/DMA_WRITE 직전에 한 번. K-루프 누적자(accumulator) 는
+RegFile에 남아 K 타일들 사이에 STORE/DMA_WRITE가 발생하지 않는다 (last_k에서만
+출력).
+
+### D3. Operand pinning — `a_pinned` / `b_pinned`
+
+호출자가 `a_pinned=True`로 호출하면 **모든 (m, n, k) 타일에서 A DMA_READ를
+생략**한다. 의미: 호출자(예: `tl.composite`)가 사전에 `tl.load`로 A 전체를
+TCM에 한 번 적재했음을 plan generator에 알리는 신호.
+
+이 분기는 plan 단에서 결정한다 (런타임 분기 아님). 따라서 op_log 상의 stage
+record 수는 pinning에 따라 결정적으로 달라지며, sweep 분석 측 (예: gemm_sweep
+의 stage record count) 이 이 결정을 그대로 본다.
+
+### D4. Epilogue scope — `k_tile` vs `output_tile`
+
+`epilogue_specs`는 op-spec 객체의 iterable이다. 각 op 객체는 다음 속성을 갖는
+다고 가정한다:
+
+- `op.kind: str` — math op 이름 (예: `"dequant"`, `"bias"`, `"relu"`, `"scale"`).
+  stage의 `params["op_kind"]` 로 들어간다.
+- `op.scope: Scope` — `Scope.K_TILE` 또는 `Scope.OUTPUT_TILE` (`Scope` 는
+  `kernbench.common.pe_commands` 에 정의된 enum).
+- op-별 추가 필드 (예: `bias`, `scale`, `factor`) — 현재 plan generator는 사용
+  하지 않으며 런타임 (PE_MATH) 측이 소비.
+
+plan generator는 `getattr(o, "scope", None)` 기준으로 두 그룹으로 분기:
+
+- `scope == Scope.K_TILE`: 매 K-타일 GEMM 직후 MATH stage 추가.
+- `scope == Scope.OUTPUT_TILE`: (m, n)당 마지막 K-타일 STORE 직전 MATH stage
+  추가.
+
+`scope` 속성이 없거나 두 enum 어느 쪽도 아닌 op는 **plan에 포함되지 않는다**
+(`getattr(..., None) == Scope.X` 가 둘 다 False). 기본값(`output_tile`) 채택은
+**호출자(예: `tl.composite`) 측 책임**이며, plan generator는 이미 채워진 scope
+값을 보고 분기할 뿐이다 (ADR-0014 의 composite epilogue 계약과 정렬).
+
+`Scope` 임포트는 `pe_commands ← pe_types ← tiling` 의 순환 참조를 피하기 위해
+함수 내부에서 lazy import 한다. 이는 의도된 패턴이며 개선 대상이 아니다 (D1의
+"tiling은 PE_SCHEDULER의 utility" 관점에서, pe_commands에 대한 컴파일타임 의존
+이 없는 편이 모듈 경계를 깔끔히 유지함).
+
+### D5. Math plan의 stage 시퀀스 — `M → N` order
+
+각 (m, n) 타일에 대한 stage 시퀀스:
+
+```
+DMA_READ → FETCH → MATH → STORE → DMA_WRITE
+```
+
+K 차원이 없으므로 epilogue / accumulator residency 같은 개념은 적용되지 않는다.
+PE_FETCH_STORE의 register-file 회계는 GEMM plan과 동일한 방식으로 다뤄진다.
+
+### D6. plan은 데이터다 — SimPy 의존성 없음
+
+`PipelinePlan` 은 `pe_types.py`에 정의된 dataclass로, `tiles: list[TilePlan]`을
+보유. 각 `TilePlan` 은 `stages: tuple[Stage, ...]` 를 보유. plan 자체는
+immutable에 가까운 데이터 구조이며 (Stage 의 `params: dict` 만 mutable),
+SimPy 객체나 event를 갖지 않는다.
+
+런타임 시점에 PE_SCHEDULER가 plan 의 첫 stage를 보고 `TileToken`을 생성하여
+파이프라인에 피드하며, TileToken 이 `plan: TilePlan`, `stage_idx: int`,
+`params: dict` 를 들고 다닌다. self-routing은 `TileToken.advance()` 가 다음
+stage의 `params`를 캐시하는 방식으로 진행된다 (ADR-0014 D6).
+
+### D7. plan generator의 contract — pure, deterministic, idempotent
+
+같은 입력으로 두 번 호출하면 같은 PipelinePlan을 돌려준다 (`TilePlan.stages`의
+순서까지 deterministic). 이 contract는 ADR-0014 D6 의 "결정적 tile dispatch
+순서" 요구와 정렬된다.
+
+부수효과(SimPy event, file I/O, 글로벌 상태) 없음 — 테스트에서 환경 객체 없이
+호출 가능 (`tests/test_pe_pipeline.py`의 일부 케이스가 이 방식 사용).
+
+## Alternatives Considered
+
+### A1. tiling을 컴포넌트로 만들기 (e.g., PE_PLANNER)
+
+기각. plan 생성은 SimPy 시간을 소비하지 않는 결정 알고리즘이다. 컴포넌트로
+만들면 (a) inbox·자원 등 불필요한 인프라가 따라붙고, (b) PE_SCHEDULER 가
+"plan 받기" → "tile 피드" 두 단계를 분리해 받게 되어 의미 없는 hop이 생긴다.
+
+### A2. plan 생성을 PE_SCHEDULER 클래스 메서드로 옮기기
+
+기각 (현재). 모듈 분리가 (1) 테스트 용이성, (2) 다른 plan 알고리즘 (예:
+DTensor-aware plan) 도입 시 추가 함수만 정의하면 되는 확장성을 준다. 만약 향후
+plan 종류가 많아져 명시적 dispatch가 필요해지면, 그때 PE_SCHEDULER에 plan
+factory를 두는 것을 별도 ADR로 도입한다.
+
+### A3. plan을 immutable로 강제 (frozen dataclass + tuple)
+
+부분 채택. `Stage` 와 `TilePlan` 은 dataclass지만 frozen은 아니다. 이유:
+`Stage.params: dict` 가 plan generator 시점에 채워지고 런타임에서 읽히기만 한다
+(TileToken 이 advance 시 캐시할 뿐). 완전 frozen은 dict → frozendict 마이그레이션
+비용 대비 이득이 적다. 다만 plan 단계 외에는 mutation 하지 말 것을 컨벤션으로
+유지한다.
+
+## Consequences
+
+- `tiling.py`가 컴포넌트가 아니라 plan-generator 모듈임이 ADR-level에서
+  명시되어, G4 같은 미래의 "이 컴포넌트는 ADR이 없다"는 분석을 차단한다.
+- GEMM plan의 stage sequence (D2) 와 pinning/epilogue 분기 (D3·D4) 가 ADR로
+  굳어지므로, sweep 분석 (`scripts/gemm_sweep.py`)의 stage record count 해석
+  근거가 명확해진다.
+- plan generator의 pure contract (D7) 덕분에 테스트가 환경 없이 plan 검증
+  가능 — ADR-0013 (verification strategy) 의 "behavior validated by tests with
+  meaningful input cases" 정신과 정렬.
+- 향후 DTensor-aware plan, K-major plan 등 새 plan 종류 추가 시 본 ADR이
+  baseline 역할 — 새 함수만 추가하고 D1·D6·D7을 따른다.
@@ -0,0 +1,131 @@
+# ADR-0043: Allreduce 평가 하니스 — `tests/sccl/`
+
+## Status
+
+Accepted
+
+`tests/sccl/` 평가 하니스를 문서화한다; 구현과 대조 검증 완료
+(상수, 파일 집합, 스윕 차원을 교차 확인).
+
+**ADR-0054로 개정됨**: 드라이버 코어, sweep, renderer가 `milestone-1h-ccl`
+bench(단일 home)로 이동했다; `tests/sccl/_allreduce_helpers.py`는 이제 거기서
+re-export한다(pytest 전용 param 빌더 + `_run_distributed` wrapper는 로컬
+유지). figure 테스트는 변경 없음.
+
+## Context
+
+ADR-0032는 intercube all-reduce *알고리즘*을 정의하고, ADR-0023/0024/0027은
+IPCQ 백엔드, rank=SIP launcher, `mp.spawn`을 정의한다. 그러나 어느 것도
+**allreduce를 어떻게 구동하고 특성화하는가** — 정확성 테스트, latency/
+buffer-kind 스윕, 파생 플롯 — 는 기술하지 않는다. ADR-0013(verification
+strategy)이 일반 정책이라면, 본 ADR은 구체적 allreduce 하니스를 고정하여
+작업의 "평가" 절반이 구현과 함께 문서화되도록 한다.
+
+하니스는 `tests/sccl/`(allreduce 테스트 통합 시 생성된 패키지)에 위치한다.
+이전의 평면적 `tests/test_allreduce_multidevice.py` +
+`tests/test_distributed_*` 레이아웃을 대체한다.
+
+## Decision
+
+### D1. 평가를 공개 `torch.distributed` 경로로 구동
+
+정확성과 스윕은 collective를 실제 DDP 형태 경로 —
+`init_process_group(backend="ahbm") → mp.spawn → dist.all_reduce`
+(ADR-0024/0027) — 로 실행하며, 하위 레벨 `ctx.launch`를 쓰지 않는다.
+`tests/sccl/_allreduce_helpers.py`의 공유 헬퍼
+`_run_distributed(tmp_path, monkeypatch, topo_path, corr_id, n_elem)`가
+엔진을 빌드하고 워커를 실행하고 `(engine, n_cubes)`를 반환한다.
+`monkeypatch.chdir`이 백엔드의 `load_ccl_config()`(cwd 조회)를 케이스별
+임시 `ccl.yaml`로 향하게 한다.
+
+직접 launch 레퍼런스(`run_allreduce`)는 같은 헬퍼 모듈에 유지된다 —
+distributed 테스트용이 아니라, `tests/`의 IPCQ buffer-kind / root-center
+마이크로 테스트가 import하기 때문이다.
+
+### D2. 평가 관심사별 파일 하나
+
+| 파일 | 관심사 | `torch.distributed`? |
+|---|---|---|
+| `test_allreduce_ring_torus_mesh.py` | ring_1d / torus_2d (2×3) / mesh_2d_no_wrap (2×3) 정확성 | yes |
+| `test_distributed_default_topology.py` | `topology.yaml` 그대로의 전체 경로 | yes |
+| `test_plot_latency_sweep.py` | latency 스윕 행 (n_elem × topology) | yes |
+| `test_plot_buffer_kind_sweep.py` | TCM/SRAM/HBM 스윕 행 | yes |
+| `test_plot_topology_diagram.py` | topology.png (순수 matplotlib) | no |
+| `test_plot_comparison_fsim.py` | broken-axis 모델 vs FSIM 비교 | no |
+| `test_intercube_root_center.py` | ADR-0032 center-root latency 가드 (직접 경로) | no |
+
+`_allreduce_helpers.py`는 공유 plumbing(드라이버, config writer, 스윕/
+buffer-kind 상수, 플롯 aggregator, topology-diagram + FSIM 비교 emitter)을
+보유한다. 수집되지 않는다(`test_` 접두사 없음).
+
+### D3. Latency 메트릭 — critical-path `pe_exec_ns`
+
+config별 보고 latency는 `engine._results`에 대한
+`crit_ns = max(pe_exec_ns)` — 가장 느린 rank의 PE 실행 시간 — 이다.
+모든 latency 차트에 그려지고 `summary.csv`에 기록되는 값이다.
+
+### D4. 스윕 차원
+
+- **Latency 스윕**: `n_elem ∈ {8, 32, 64, 128, 512, 1024, 2048, 4096,
+  8192, 16384, 32768, 49152}` (16 제외 — `n_cubes`와 충돌) × topology ∈
+  {ring_1d (6), torus_2d 2×3 (6), mesh_2d_no_wrap 2×3 (6)}.
+- **Buffer-kind 스윕**: `buffer_kind ∈ {tcm, sram, hbm}` × 더 작은
+  `n_elem` 그리드, torus_2d 6-SIP (3×2)에서. buffer_kind는 임시
+  `ccl.yaml`에 설정되며(백엔드가 `init_process_group` 시점에 읽음,
+  ADR-0023 D6) 적용된다.
+
+2×3 / 3×2 그리드는 명시적 `w/h` SIP 해석(ADR-0024 D5)을 행사한다.
+
+### D5. `pytest_sessionfinish` aggregator를 통한 파생 플롯
+
+스윕 테스트는 xdist 친화적이다: 각 parametrized 케이스가 staging 디렉터리에
+JSON 행 하나를 쓴다. conftest `pytest_sessionfinish` 훅(controller 노드
+전용)이 `_allreduce_helpers.py`의 aggregator를 호출한다:
+
+- `_aggregate_sweep_plots()` → topology별 PNG + `summary.csv`
+- `aggregate_buffer_kind_plot()` → TCM/SRAM/HBM 비교 PNG + csv
+
+topology-diagram 및 FSIM-비교 figure는 각자의 `test_plot_*` 테스트가
+직접 emit한다(행 staging 없음 — 각각 `topology.yaml`과 `summary.csv`의
+순수 함수). 모든 출력은 `docs/diagrams/allreduce_latency_plots/`에 떨어지며
+CLAUDE.md에 따라 **파생 아티팩트**다(ADR과 일관, Phase-2 게이트 없음).
+
+### D6. FSIM 비교 레퍼런스는 하드코딩 상수
+
+`emit_comparison_fsim_plot()`은 모델 곡선을 외부 FSIM single-device
+레퍼런스(`366 µs`) 하나와 겹쳐 그리며, 이는 리터럴로 보유된다 — 외부 데이터
+파일 없음. "measured" 시리즈는 시뮬레이터(`op_log` GEMM 카운트,
+`composite_window_ns`)에서, "theoretical" 시리즈는 손으로 도출한 해석적
+모델(ADR-0044 D5가 ADR-미검증으로 표시한 동일 모델)에서 온다.
+
+## Consequences
+
+### Positive
+
+- allreduce가 실제 DDP 스크립트와 같은 API로 평가되므로, 하니스가
+  ADR-0024/0027의 통합 테스트 역할도 겸한다.
+- figure는 매 `pytest` 실행마다 committed 데이터로 재생성된다; 수동 플롯
+  단계 없음.
+- 직사각형 그리드 스윕이 ADR-0024 D5 `w/h` 수정을 드러낸 회귀 커버리지를
+  제공했다.
+
+### Negative / limitations
+
+- 전체 latency 스윕은 기본 `pytest`에서 실행된다(~분 단위); `slow`로
+  표시되지 않는다. (ADR-0044는 GEMM 스윕을 `slow`로 표시하는 것과 대조.)
+- `test_intercube_root_center.py`는 latency *임계값* assertion(ADR-0032
+  center-root 가드)을 보유한다 — 스위트에서 유일한 절대-latency
+  assertion이며 latency 모델 변경(ADR-0033)에 민감하다.
+
+## Dependencies
+
+- **ADR-0013**: verification strategy (본 ADR이 특수화하는 일반 정책).
+- **ADR-0023 / ADR-0024 / ADR-0027**: IPCQ 백엔드, rank=SIP launcher,
+  `mp.spawn` — D1이 구동하는 경로.
+- **ADR-0032**: 평가 대상 알고리즘; D4 그리드가 그 topology 분기를 행사.
+- **ADR-0044**: 형제 격인 GEMM 평가 하니스.
+
+## Open questions
+
+- GEMM 스윕과의 일관성을 위해 latency 스윕을 `slow`로 표시할 것인가?
+- FSIM 레퍼런스를 하드코딩 상수에서 버전 관리되는 데이터 파일로 옮길 것인가?
@@ -0,0 +1,133 @@
+# ADR-0044: GEMM 평가 하니스 — `scripts/gemm_sweep.py` + `tests/gemm/`
+
+## Status
+
+Accepted
+
+GEMM 평가/특성화 하니스를 문서화한다; 구현과 대조 검증 완료
+(상수, tile 크기, figure 집합, script↔test 분할을 교차 확인). D5/D6
+caveat은 부정확이 아니라 기록된 한계다.
+
+**ADR-0054로 개정됨**: sweep + renderer가 `milestone-1h-gemm` bench(단일
+home)로 이동했다; `scripts/gemm_sweep.py`와 `tests/gemm/`는 이제 거기서
+re-export한다. D1/D2의 "데이터 생성은 수동 script / 무거운 작업은 opt-in"은
+평가-bench 패턴으로 대체된다(하나의 bench가 전부 재생성;
+`MILESTONE_FAST=1`은 committed JSON 재사용).
+
+## Context
+
+ADR-0014(PE pipeline)와 ADR-0042(tile-plan generator)는 GEMM *구현*을
+정의하고, ADR-0033은 latency 모델을 정의한다. 그러나 어느 것도 **GEMM
+성능을 어떻게 스윕하고 특성화하는가** — 타이밍 데이터를 만드는 shape/variant
+스윕과 이를 해석하는 figure — 는 기술하지 않는다. 본 ADR이 그 하니스를
+고정한다.
+
+allreduce 하니스(ADR-0043)와 달리 GEMM 스윕은 **무겁다**(24 sim 실행:
+8 shape × 3 operand-staging variant; `512` shape 하나가 2048 tile). 이
+무게가 아래 분할을 결정한다.
+
+## Decision
+
+### D1. 두 계층 분할 — 무거운 데이터 생성(script) vs. 빠른 figure(test)
+
+- **데이터 생성은 수동 script로 유지**: `scripts/gemm_sweep.py`가
+  `matmul-composite`(ADR-0042 plan)를 CLI와 동일한 `run_bench` 경로로
+  shape × variant에 걸쳐 실행하고, `result.engine.op_log`를 수확하여
+  `docs/diagrams/gemm_sweep.json`(stage별/engine별 wall-clock + occupancy
+  + record count + pe/composite window)을 쓴다.
+- **figure 렌더링은 test 생성**: `tests/gemm/`이 committed `gemm_sweep.json`을
+  읽어 matplotlib PNG를 `docs/diagrams/gemm_plots/`에 렌더링한다. 이
+  테스트는 빠르고 기본 실행된다.
+
+근거: 슬라이드덱 규모의 sim 스윕은 매 `pytest` 실행에 속하지 않지만,
+figure(저렴·결정적)는 자유롭게 재생성되고 CI로 가드되어야 한다. 이는
+CLAUDE.md의 script-vs-test 분할(무거운/수동 생성은 script; 빠른 assertion은
+test)을 반영한다.
+
+### D2. Slow regenerator 테스트가 script를 감싼다
+
+`tests/gemm/test_gemm_sweep.py`는 `@pytest.mark.slow`로 표시된다(기본
+`addopts: -m "not slow"`에서 제외). 이는 `scripts/gemm_sweep.py`를
+subprocess로 호출하여 `gemm_sweep.json`을 on-demand로 재생성한다
+(`pytest -m slow tests/gemm/test_gemm_sweep.py`). 스윕 로직은 단일
+home(script)을 가지며 테스트는 이를 감싸기만 하므로 sim 구동 코드의
+중복이 없다.
+
+### D3. Figure 집합 (3개 차트, `load_ref` variant)
+
+| 테스트 | PNG | 내용 |
+|---|---|---|
+| `test_plot_gemm_stage_breakdown.py` | `gemm_stage_breakdown.png` | stage별 engine wall-clock (DMA in / Fetch / GEMM / DMA out) |
+| `test_plot_gemm_mac_utilization.py` | `gemm_mac_utilization_measured.png` | GEMM util % + useful eff % |
+| `test_plot_gemm_mac_utilization.py` | `gemm_mac_utilization_theoretical_vs_measured.png` | theoretical vs 시뮬레이터-measured util/eff |
+
+`tests/gemm/_gemm_plot_helpers.py`가 공유 renderer를 보유한다(시리즈 로직은
+`scripts/build_overview_slides.py`의 GEMM `_render_*` 함수를 미러링하며,
+그쪽은 여전히 PPTX에 네이티브로 그린다). 수집되지 않음(`test_` 접두사
+없음). 각 `test_plot_*`는 `gemm_sweep.json`이 없으면 skip한다.
+
+### D4. Tile 크기는 데이터 기반; under-tile shape는 표시
+
+Tile 크기는 `gemm_sweep.json`(`tile_sizes`)에서 읽으며, 이는 스윕이
+`PeSchedulerComponent.TILE_M/K/N = 32/64/32` — 권위 소스 — 에서 기록한
+값이다. `M<TILE_M ∨ K<TILE_K ∨ N<TILE_N`인 shape는 차트에
+("under-tile") 표시된다. `512³` shape는 figure에서 제외된다
+(`EXCLUDED_SHAPES`).
+
+### D5. Theoretical 모델 — 상속된 상수, 아직 ADR-미검증
+
+"theoretical" 곡선은 `scripts/build_overview_slides.py`에서 그대로 복사한
+상수로 해석적 ideal-pipeline 모델을 사용한다:
+
+```
+HBM_GBS = 256.0   # GB/s        T_STAGE = 16.0 ns
+D_STAGES = 3                    BPE = 2
+```
+
+**이 값들은 아직 ADR과 대조 소싱되지 않았다.** 특히 ADR-0033의 `256`은
+`burst_bytes`(256 B)로 이 `256 GB/s`와 *다른* 양이며, ADR-0033은
+대역폭을 `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`로 도출한다.
+`T_STAGE`/stage 수도 여기서 ADR-0014로 추적되지 않았다. 따라서 모델은
+**기존 deck script와 일관할 뿐 ADR과 검증되지 않았고**, 상수가 중복된다
+(deck + helper). 이를 조정(topology/ADR-0033/0014에서 소싱, 중복 제거)하는
+것은 보류 — Open questions 참조.
+
+### D6. 알려진 네이밍 caveat — `_measured` 차트
+
+`gemm_mac_utilization_measured.png`는 현재 *theoretical* ideal-pipeline
+수치를 그린다(footnote가 그렇게 명시). 파일명만 "measured"라고 한다. 이는
+그 내용을 시뮬레이터-measured 시리즈로 재지정할지 또는 제목을 바꿀지
+결정을 보류 중인 알려진 misnomer다.
+
+## Consequences
+
+### Positive
+
+- GEMM figure가 allreduce처럼 test 생성·CI 가드된다.
+- 무거운 스윕은 opt-in으로 유지되어 기본 테스트 실행이 빠르다.
+- 스윕 로직의 단일 소스(script)를 slow 테스트가 재사용.
+
+### Negative / limitations
+
+- theoretical 모델 상수(D5)는 미검증·중복이다.
+- `_measured` figure는 misnomer(D6).
+- `build_overview_slides.py`는 여전히 이 PNG를 임베드하지 않고
+  `gemm_sweep.json`에서 GEMM 막대를 네이티브로 그린다 — test 아티팩트를
+  소비하도록 deck를 재배선하는 작업은 미완.
+
+## Dependencies
+
+- **ADR-0013**: verification strategy.
+- **ADR-0014 / ADR-0042**: PE pipeline + tile-plan generator — 스윕이
+  측정하는 GEMM 구현; D4의 stage record count는 ADR-0042 D2/D3에서 온다.
+- **ADR-0033**: latency 모델 — D5 상수가 (아직은 아니지만) 추적되어야 할
+  소스.
+- **ADR-0043**: 형제 격인 allreduce 평가 하니스.
+
+## Open questions
+
+- D5 상수를 `topology.yaml` / ADR-0033 / ADR-0014와 대조 조정하고
+  중복 제거할 것인가(모델 파라미터의 단일 소스)?
+- D6 `_measured` 네이밍 해결(내용 재지정 vs. 제목 변경)?
+- `build_overview_slides.py`를 네이티브 막대 그리기 대신 `gemm_plots/`
+  PNG 임베드로 재배선할 것인가?
@@ -0,0 +1,265 @@
+# ADR-0045: Bench Module Contract — registration, dispatch, and authoring
+
+## Status
+
+Accepted (2026-05-21).
+
+`src/kernbench/benches/` 패키지의 등록 메커니즘(@bench), CLI 디스패치 경로
+(`kernbench run/list`), 그리고 새 bench 모듈 작성 시 따라야 할 계약을 통합
+정의한다. ADR-0010 (CLI surface)이 `kernbench list/run` 인터페이스를 명세하나,
+**bench가 어떻게 등록되고 어떤 함수 시그너처를 따라야 하는가**는 ADR 레벨에
+없었음.
+
+**ADR-0054로 확장됨**: D5의 단일 구성 규칙에 세 번째 패턴이 추가된다 —
+*평가 bench*(예: `milestone-1h-*`)는 여러 구성을 구동하고, 구성별 자체 엔진을
+빌드하며, D4를 만족시키기 위해 sentinel 텐서를 제출한다.
+
+## First action (제일 처음에 하는 일)
+
+`kernbench.benches` 패키지가 임포트되면 `__init__.py` 가 즉시
+`_eager_import_and_audit(__path__, __name__)` 를 호출한다. 이 함수의 첫 일은
+패키지 디렉터리 안의 모든 형제 모듈을 `pkgutil.iter_modules(__path__)`로 나열한
+뒤, 다음 두 조건을 만족하지 않는 모듈을 모두 `importlib.import_module(...)`로
+**즉시 로드**하는 것이다:
+
+- 이름이 `registry` 인 경우 (인프라 자체)
+- 이름이 `_` 로 시작하는 경우 (helper 모듈)
+
+임포트 시점에 각 모듈 안의 `@bench(name=..., description=...)` 데코레이터가
+실행되어 `_PENDING` 리스트에 `(name, description, fn)` 튜플이 append 되고,
+`_REGISTERED_MODULES` 셋에 `fn.__module__` 가 추가된다.
+
+전체 임포트가 끝나면 `_audit_modules(imported, _REGISTERED_MODULES)` 가 호출되어,
+**임포트는 되었지만 @bench를 한 번도 호출하지 않은 모듈**이 있으면
+`RuntimeError("Bench module(s) missing @bench decorator: ...")` 가 즉시 발생한다.
+이 audit이 통과한 시점에 인덱스 할당은 아직 일어나지 않은 상태이며, 첫
+`list_all()` / `resolve(...)` 호출 시 `_finalize()` 가 이름 알파벳 정렬 순으로
+1-based index를 부여한다.
+
+즉, **bench 인프라의 첫 일은 "패키지 디렉터리의 모든 비-helper 모듈을 임포트
+하고, 각 모듈이 최소 한 번 @bench를 호출했는지 감사하는 것"** 이다.
+
+## Context
+
+`src/kernbench/benches/` 는 현재 8개의 bench 모듈을 보유한다 (`ccl_allreduce`,
+`gemm_single_pe`, `gpt3_qkv`, `ipcq_allreduce`, `matmul_composite`, `qkv_gemm`,
+`qkv_gemm_multi_pe`, `va_offset_verify`). 모든 bench는 다음 통합 흐름을 따른다:
+
+```
+kernbench run --topology <T> --bench <N>
+   ↓
+cli/main.py::cmd_run
+   ↓  resolve_topology(T)  + resolve(N)  + resolve_device(device_arg)
+   ↓
+runtime_api/bench_runner.py::run_bench(topology, bench_fn, device, engine_factory)
+   ↓  engine_factory(topology, device) → GraphEngine
+   ↓  RuntimeContext(engine, target_device, correlation_id, spec)
+   ↓
+bench_fn(ctx)        ← bench가 정의한 run(torch) 가 호출됨
+   ↓  ctx.empty/zeros/from_numpy/launch/distributed.* 등을 통해 submit
+   ↓
+ctx.wait_all()       ← 미완료 핸들이 있으면 drain
+   ↓
+BenchResult(completion, correlation_id, trace, traces, engine)
+```
+
+ADR-0010 은 CLI 표면만 다루고 (`run/list/probe/web`), ADR-0007 은 runtime API ↔
+sim_engine 책임 경계만 다룬다. 정작 "새 bench 파일을 추가하려면 어떤 모양으로
+써야 하는가"는 코드 컨벤션만으로 추적해야 한다. 결과적으로:
+
+- @bench 데코레이터의 호출 규약 (kebab-case 이름, non-empty description)이
+  코드에만 존재.
+- bench 함수 시그너처 (`def run(torch)`) 가 사실상 컨벤션인데, CLI 디스패치 측이
+  `spec.run` 을 호출한다는 사실로 강제되고 있음.
+- 신규 bench 추가자가 "helper 모듈은 `_` 접두로 분리해야 한다"는 것을 audit
+  RuntimeError를 받아본 뒤에야 학습.
+- single-device 컨벤션 (CLAUDE.md Part 2 CLI Semantics)이 bench 작성자 관점에서
+  어디까지 적용되는지 (CCL 멀티-SIP bench는 예외인가?) 명확하지 않음.
+
+이 ADR이 이런 모호함을 한 곳에 정리한다.
+
+## Decision
+
+### D1. @bench 데코레이터 계약
+
+```python
+from kernbench.benches.registry import bench
+
+@bench(name="my-bench", description="Short, complete-sentence description.")
+def run(torch):
+    ...
+```
+
+- `name`: kebab-case 문자열. 정규식 `^[a-z][a-z0-9]*(-[a-z0-9]+)*$` 통과 필요.
+  소문자/숫자/대시만 허용; 밑줄(`_`) 금지; 알파벳으로 시작.
+- `description`: non-empty 문자열 (strip 후 길이 > 0). CLI `list` 출력에 그대로
+  표시됨.
+- 데코레이터는 **fn을 변형 없이 반환**한다 — 즉 직접 호출도 가능. 부수효과로
+  `_PENDING` 에 등록만 추가한다.
+
+위 두 규칙 위반은 즉시 `ValueError`. duplicate name은 `_finalize()` 시점에
+`RuntimeError("duplicate bench name: ...")` 로 잡힌다.
+
+### D2. 모듈 파일 컨벤션
+
+`src/kernbench/benches/<slug>.py` 는 다음 중 하나여야 한다:
+
+- **bench 모듈**: 최상위 임포트 경로에서 적어도 한 번 `@bench(...)` 가 실행되어
+  최소 하나의 bench를 등록한다.
+- **helper 모듈**: 파일명이 `_` 로 시작 (예: `_shared_helpers.py`). `iter_modules`
+  순회에서 스킵된다.
+
+audit (`_audit_modules`) 는 helper가 아닌데도 @bench를 호출하지 않은 모듈을
+허용하지 않는다. 의도된 결과: 새 파일을 `benches/` 에 추가하기만 하면 자동
+등록되며, helper와의 구분은 **파일명 접두사** 하나로 명확하게 표시된다.
+
+### D3. bench 함수 시그너처는 `def run(torch)` 다
+
+데코레이터는 함수 이름을 강제하지 않지만, **CLI 디스패치는 `spec_entry.run`
+(즉 데코레이트된 callable) 을 호출**한다. 따라서 컨벤션은:
+
+- 함수 이름: `run`. 다른 이름으로 데코레이트해도 동작은 하지만 readability /
+  grep-ability 측면에서 항상 `run`.
+- 인자: 단일 위치 인자 `torch`. 실제로는 `RuntimeContext` 인스턴스이며 PyTorch
+  스타일의 namespace (zeros/empty/launch/distributed/...)를 노출한다 (ADR-0024 D3).
+- 반환값: 임의 (`Any`). 현재 `run_bench` 는 반환값을 무시하고 `ctx.handles()` /
+  `engine.get_completion()` 로 완료를 추적한다.
+
+`torch` 이름은 PyTorch 호환 idiom을 흉내내기 위함이며, 실제로 PyTorch 모듈이
+들어오는 것은 아니다 (ADR-0024 의 "rank = SIP" launcher 컨벤션과 정렬).
+
+### D4. bench는 최소 한 번의 submit을 수행해야 한다
+
+`run_bench` 는 `ctx.handles()` 가 비어 있는 경우 BenchResult.completion 을
+`ok=False, error_code="NO_REQUESTS"` 로 반환한다. 따라서 의미 있는 bench는
+다음 중 하나 이상을 호출해야 한다:
+
+- 텐서 생성 API: `torch.zeros(...)`, `torch.empty(...)` — 내부적으로
+  `MmuMapMsg` 와 (zeros 의 경우) `MemoryWriteMsg` 가 submit 됨.
+- 커널 실행 API: `torch.launch(name, fn, *args)` — `KernelLaunchMsg` 를 SIP 별로
+  submit.
+- (예외) 빈 placeholder bench: `ipcq_allreduce.py` 처럼 `print(...)` 만 하는
+  스텁은 NO_REQUESTS 결과를 받게 됨. CI 측에서 placeholder임을 인지하고 별도
+  처리하는 것을 가정한다.
+
+### D5. 단일-디바이스 컨벤션 + 멀티-SIP 예외 (ADR-0024/0027)
+
+CLAUDE.md Part 2 CLI Semantics 가 명시하는 **"benchmarks MUST remain
+single-device"** 컨벤션은 다음과 같이 해석된다:
+
+- **일반 bench (single-SIP 사용)**: `dp = DPPolicy(...)` 로 텐서 placement를
+  정의하고 `torch.launch(...)` 로 커널 발사. SIP 인덱스는 `--device` 가
+  결정한다 (CLI 측 책임).
+- **CCL bench (멀티-SIP 사용)**: 예외적으로 `torch.distributed.init_process_group
+  (backend="ahbm")` + `torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` 로
+  rank = SIP 패턴 (ADR-0024 D3) 을 따른다. `--device` 는 무시되며 (또는
+  `all` 로 가정), 각 spawned worker가 `torch.ahbm.set_device(rank)` 로 자신의
+  SIP를 바인딩한다.
+
+이 두 패턴 외의 멀티-디바이스 호출 (예: 한 bench 함수가 동일 process에서 여러
+SIP을 직접 launch) 은 본 ADR이 금지한다. CLI 가 `--device all` 로 호출되어도
+bench는 한 번만 실행되며, 그 안에서 멀티-SIP을 다루려면 D5의 두 번째 패턴을
+사용한다.
+
+### D6. 이름·인덱스 해석 (`resolve`)
+
+`resolve(identifier: str)` 는 다음 순서로 BenchSpec을 반환한다:
+
+1. `identifier.isdigit()` → 정수 변환 후 `_REGISTRY` 의 entries에서 `index ==`
+   인 spec 반환. 없으면 `ValueError("No bench with index ..."`)`.
+2. `identifier in _REGISTRY` → 직접 lookup.
+3. 그 외 → `ValueError("Unknown bench ...")`.
+
+빈/공백 identifier 는 `ValueError("bench identifier must be a non-empty string.")`.
+
+CLI 는 `--bench` 의 인자를 그대로 `resolve` 에 넘긴다. 따라서 사용자는
+`kernbench run --bench gemm-single-pe` 또는 `kernbench run --bench 2` 형식 모두
+사용 가능.
+
+### D7. 인덱스는 안정 API가 아니다
+
+`_finalize()` 가 `_PENDING` 을 **이름 알파벳 정렬** 후 1-based index를 부여하므로,
+새 bench 가 추가되면 기존 bench의 index가 밀릴 수 있다. 따라서:
+
+- 사람-친화적 인터랙티브 사용: 인덱스 OK.
+- 스크립트 / CI 자동화: 반드시 이름을 사용한다.
+
+이 사실은 `registry.py` 모듈 docstring 에 명시되어 있다.
+
+### D8. RuntimeContext 가 bench에 노출하는 표면
+
+bench 함수가 `torch` 파라미터를 통해 정상적으로 사용할 수 있는 표면:
+
+- **텐서 생성**: `torch.empty(shape, dtype=..., dp=DPPolicy(...), name=...)`,
+  `torch.zeros(...)`, `torch.from_numpy(arr)`. 모두 host-side 메타 + 디바이스
+  배포 (MmuMap + MemoryWrite) 를 submit 한다.
+- **커널 발사**: `torch.launch(kernel_name, kernel_fn, *args)` —
+  `(Tensor, int, float)` 위치 인자를 `TensorArg` / `ScalarArg` 로 변환하여
+  SIP 별 `KernelLaunchMsg` 발행 후 drain.
+- **동기화**: `torch.wait(handle)`, `torch.wait_all()` (run_bench 가 자동 호출).
+- **분산**: `torch.distributed.init_process_group(backend="ahbm")`,
+  `torch.distributed.get_world_size()`, `torch.distributed.all_reduce(t, op=...)`
+  (ADR-0024/0027).
+- **멀티-프로세스 (rank=SIP)**: `torch.multiprocessing.spawn(_worker, ..., nprocs=ws)`
+  (ADR-0024 D3 / ADR-0027).
+- **디바이스 바인딩**: `torch.ahbm.set_device(rank)` 또는
+  `torch.accelerator.set_device_index(rank)` (둘 다 같은 namespace를 가리킴).
+- **IPCQ 설치**: `torch.install_ipcq(algorithm=..., ccl_yaml=...)` (ADR-0023 D10).
+- **스펙 조회**: `torch.spec` — 토폴로지 빌더가 만든 dict (시스템·cube_mesh·HBM
+  파라미터 등). bench가 toplogy.yaml 파라미터에 의존하지 않게 짜기 위함.
+
+bench는 위에 열거되지 않은 RuntimeContext 의 private 멤버 (`_handles`, `_traces`,
+`_allocators` 등) 에 직접 접근해선 안 된다. ADR-0007 의 layer boundary 정신과
+정렬: bench → runtime API → sim_engine 한 방향만 허용.
+
+### D9. 환경 변수로 파라미터화는 허용된다
+
+`matmul_composite.py` 처럼 `os.environ.get("MATMUL_M", ...)` 등으로 bench
+파라미터를 외부에서 주입하는 패턴은 허용한다. 이유:
+
+- bench 함수 시그너처는 D3 에 의해 `def run(torch)` 로 고정되어 있어 위치/키워드
+  인자로 파라미터를 받기 곤란.
+- 환경 변수 패턴은 `MATMUL_VARIANT` 같은 운영-시 스윕을 위한 자연스러운 hook.
+- `scripts/gemm_sweep.py` 같은 외부 드라이버 (ADR-0044) 가 이 hook을 사용한다.
+
+단, 환경 변수가 bench의 동작을 바꾼다면 모듈 docstring 에 모든 변수를 명시할 것
+(matmul_composite.py 가 그 예시).
+
+## Alternatives Considered
+
+### A1. 명시적 manifest 파일 (YAML)에 bench 목록 두기
+
+기각. @bench 데코레이터 + audit 패턴은 "파일 추가 = 자동 등록" 을 보장하여 신규
+bench 작성자의 인지 비용을 한 곳 (파일 작성)으로 집중시킨다. 별도 manifest는
+유지보수 측에서 drift 위험이 크고, helper 분리는 이미 `_` 접두로 명확하다.
+
+### A2. bench 함수 이름을 데코레이터 인자로 받기 (`@bench(name=..., entry="run_xxx")`)
+
+기각. 디스패치 측에서 `spec.run` 하나만 호출하면 되는 단순함을 깬다. `run` 컨벤션
+하나로 충분하며, 변종이 필요하면 같은 모듈에 여러 함수를 등록하면 된다 (각각
+@bench 데코레이트).
+
+### A3. CCL bench를 위한 별도 `@multi_device_bench` 데코레이터
+
+기각. D5에서 명시한 두 패턴 (single + ADR-0024 멀티-SIP) 만으로 현재 8개 bench가
+모두 표현 가능. 별도 데코레이터는 디스패치 측에서 분기를 강제하여 복잡도를 늘리며,
+멀티-SIP 사용 의도는 bench 함수 본문의 `init_process_group(...)` 호출로 충분히
+드러난다.
+
+### A4. 인덱스를 안정 API로 만들기 (등록 순서 / explicit index= 인자)
+
+기각. D7에서 명시한 trade-off — 사용자 친화성 (알파벳 정렬된 인덱스가 list 출력
+에서 자연스럽게 1, 2, 3...) 우선. 스크립트는 이름으로 지정하면 충분.
+
+## Consequences
+
+- "bench 추가 방법" 이 한 ADR로 정리됨 → 신규 작성자가 코드 grep 없이 D1-D3,
+  D8 만 따르면 됨.
+- helper 모듈을 `_` 접두로 분리하는 패턴이 ADR-level에서 정당화되어, 향후
+  `benches/_*.py` 식의 공유 helper 작성이 자유로워짐.
+- CLAUDE.md Part 2 CLI Semantics 의 single-device 컨벤션이 멀티-SIP CCL bench
+  와 모순되지 않음을 D5 가 명시 — 둘은 직교한다.
+- ADR-0044 (GEMM eval harness) 의 `scripts/gemm_sweep.py` 가 환경 변수 hook을
+  사용하는 근거 (D9) 가 본 ADR에 굳어짐.
+- 인덱스가 불안정함 (D7) 이 명시되어, CI 측 `kernbench run --bench 3` 같은
+  코드는 본 ADR 수락 직후 점검 대상.
@@ -0,0 +1,307 @@
+# ADR-0046: TLContext — Kernel-side `tl.*` API Contract
+
+## Status
+
+Accepted (2026-05-22).
+
+`src/kernbench/triton_emu/` 의 `TLContext` 가 노출하는 `tl.*` primitive
+집합과 그 의미, 그리고 두 실행 모드 (command-list / greenlet runner) 의
+계약을 명시한다. ADR-0014/0020 가 PE 파이프라인과 2-pass 실행 모델을
+정의하나, **bench 의 kernel 함수가 호출하는 `tl.*` 표면 자체**는 ADR-level
+에 정리되어 있지 않았다.
+
+## First action (제일 처음에 하는 일)
+
+`TLContext(pe_id, num_programs, dispatch_cycles, runner, cube_id, num_cubes,
+scratch_base, scratch_size)` 생성 시 가장 먼저 다음 6개 필드를 초기화한다:
+
+- `self._pe_id`, `self._num_programs`, `self._cube_id`, `self._num_cubes` —
+  `tl.program_id` / `tl.num_programs` 가 반환할 값.
+- `self._dispatch_cycles` — 모든 `tl.*` API 호출 시작에서 자동으로 발행될
+  `PeCpuOverheadCmd(cycles)` 의 cycle 수.
+- `self._runner` — `KernelRunner` 인스턴스 (있으면 greenlet 모드, 없으면
+  command-list 모드).
+- `self._commands: list[PeCommand] = []` — command-list 모드에서 누적할
+  command 시퀀스.
+- `self._handle_counter = 0`, `self._completion_counter = 0` — 새 TensorHandle /
+  CompletionHandle id 생성용.
+- `self._scratch_base`, `self._scratch_size`, `self._scratch_cursor = 0` —
+  PE-로컬 scratch 영역 (math/dot/composite 의 output handle 주소 할당용).
+
+즉, **TLContext 의 첫 일은 "이 kernel 인스턴스가 어디서 (sip/cube/pe) 어떤
+규모 (num_programs/num_cubes) 로 실행되며, 어느 모드 (runner 유무) 로
+명령을 발사할지 메타데이터를 채우는 것"** 이다. 이 시점에 SimPy event 는
+없으며 command 도 발사되지 않는다.
+
+런타임 첫 동작은 kernel 함수가 `tl.<api>()` 를 처음 호출할 때 발생한다.
+모든 `tl.*` API 의 표준 entry 동작은:
+
+1. `self._emit_dispatch_overhead()` 호출 — `dispatch_cycles > 0` 인 경우
+   `PeCpuOverheadCmd(dispatch_cycles)` 를 즉시 `_emit`.
+2. API 별 처리 (TensorHandle 생성, command 구성).
+3. `self._emit(cmd)` — runner 모드면 greenlet.switch 로 SimPy 측에 cmd 전달,
+   아니면 `self._commands` 에 append.
+
+## Context
+
+`tl.*` 표면은 `TLContext` 가 노출하는 메소드들로 구성되며, kernel 함수가
+받는 `tl` 매개변수가 이 객체다. 사용자(bench 작성자) 입장에서 보이는
+contract:
+
+- 어떤 primitive 가 있는가
+- 각 primitive 가 어떤 데이터 흐름을 발생시키는가 (DMA / compute / IPCQ /
+  metadata-only)
+- TensorHandle 의 `space` 와 `addr` 가 어떻게 결정되는가
+- command-list 모드와 greenlet 모드의 차이
+
+ADR-0014 (PE pipeline) 가 PE_SCHEDULER 가 받는 PeCommand 들을 정의하나,
+`tl.*` 가 이들을 어떻게 emit 하는지는 코드 컨벤션에만 존재한다. 또한
+ADR-0020 (2-pass data execution) 가 greenlet 모드의 존재를 D3 에서
+언급하나, runner / non-runner 두 경로의 시그너처 차이 (return value 처리)
+는 ADR-level 에 명시되어 있지 않다. 이 ADR 이 그 빈자리를 채운다.
+
+## Decision
+
+### D1. `tl` 매개변수는 `TLContext` 인스턴스다
+
+bench 의 kernel 함수는 다음 시그너처를 따른다:
+
+```python
+def _kernel(arg1, arg2, ..., tl, **kwargs):
+    ...
+```
+
+`tl` 의 정체는 `kernbench.triton_emu.tl_context.TLContext` 인스턴스이다.
+real Triton 의 `triton.language` 모듈을 흉내내기 위한 이름이며, real
+Triton 모듈이 들어오는 것은 아니다.
+
+kernel 함수는 일반 Python 함수이며 `yield` / `async` 가 없다. `tl.*`
+호출이 SimPy event 를 발생시키지만, 호출자(kernel) 쪽에서는 동기 호출처럼
+보인다 — greenlet 모드에서 KernelRunner 가 SimPy ↔ kernel 사이를 중계
+하기 때문 (ADR-0020 D3).
+
+### D2. 두 실행 모드 — command-list / greenlet runner
+
+- **command-list 모드 (`runner is None`)**: `tl.*` 호출이 `self._commands`
+  리스트에 PeCommand 를 누적. DMA / GEMM / Math 가 실제 SimPy 시간을
+  소비하지 않으며, return value 가 metadata-only TensorHandle (data=None) 다.
+  이후 PE_SCHEDULER / sim_engine 가 command 시퀀스를 시간상 재생.
+
+- **greenlet runner 모드 (`runner is not None`)**: `tl.*` 호출이
+  `self._emit(cmd)` 를 통해 `runner.switch_to_simpy(cmd)` 로 부모 greenlet
+  (SimPy) 으로 컨트롤을 넘김. 부모는 cmd 를 컴포넌트에 분배하여 SimPy 시간을
+  소비한 뒤, DMA read 의 경우 실제 numpy 데이터를 반환. kernel 은 그
+  결과를 받아 다음 line 으로 진행 (ADR-0020 D3 의 데이터 인지 실행 모델).
+
+mode 선택은 KernelRunner 인스턴스를 TLContext 에 주입하는지 여부로 결정
+되며, `tl.*` 메소드들은 이 차이를 인지하지 않고 `_emit()` 헬퍼를 통해
+일관되게 동작한다.
+
+### D3. Primitive 카테고리
+
+#### D3.1. Reference (no DMA, metadata only)
+
+- `tl.ref(ptr, shape, dtype="f16") -> TensorHandle`: HBM 데이터를 참조하는
+  핸들만 만들고 DMA 는 발행하지 않음. composite scheduler 가 per-tile 로
+  스트리밍할 때 사용 (예: GEMM 의 b 피연산자).
+
+#### D3.2. Data movement (blocking, DMA engine)
+
+- `tl.load(ptr, shape, dtype="f16") -> TensorHandle`: HBM → 결과 핸들.
+  `DmaReadCmd` 발행. greenlet 모드에서는 결과 핸들의 `.data` 에 실제
+  numpy 배열 첨부; command-list 모드에서는 placeholder. 반환 핸들의
+  `space="hbm"`, `pinned=True`.
+- `tl.store(ptr, handle) -> None`: TCM → HBM. `DmaWriteCmd` 발행. greenlet
+  모드에서는 `handle.data` 가 있을 때만 `_store.write("hbm", ptr, data)` 를
+  먼저 호출 (visibility = issue time, ADR-0020 D3).
+
+#### D3.3. GEMM / compute (blocking)
+
+- `tl.dot(a, b) -> TensorHandle`: `a @ b`. 두 피연산자는 TCM 이어야 하며,
+  shape (M,K) × (K,N) → (M,N). `GemmCmd` 발행, output handle 은
+  `_make_compute_out(shape, dtype)` 로 PE-로컬 scratch 에 할당.
+- `tl.composite(op, a, b=None, out_ptr=0, math_op=None, epilogue=None,
+  acc_dtype=None, tile_shape=None) -> CompletionHandle`: 비차단(non-blocking)
+  tiled pipeline. `CompositeCmd` 발행. `epilogue` 는 dict list, 각 dict 는
+  `"op"` 키 + op-specific 필드 + 옵션 `"scope"` (k_tile / output_tile);
+  unknown op 나 missing field 는 즉시 ValueError. 반환된 CompletionHandle 은
+  `tl.wait(h)` 로 동기화.
+
+#### D3.4. Math: unary (blocking)
+
+- `tl.exp(x)`, `tl.log(x)`, `tl.sqrt(x)`, `tl.abs(x)`, `tl.sigmoid(x)`,
+  `tl.cos(x)`, `tl.sin(x)` — 모두 `MathCmd(op=<name>, inputs=(x,), out=)`
+  발행. `out` 은 동일 shape/dtype 의 scratch 할당.
+
+#### D3.5. Math: binary (blocking)
+
+- `tl.maximum(a, b)`, `tl.minimum(a, b)` — `_binary_math`.
+- `tl.fma(a, b, c)` — `a*b + c`. inputs 3개.
+- `tl.clamp(x, min, max)` — `MathCmd(op="clamp", inputs=(x, min, max))`.
+- `tl.where(cond, a, b)` — `MathCmd(op="where", inputs=(cond, a, b))`.
+- `tl.softmax(x, axis=-1)` — 단일 MathCmd(op="softmax") 로 시간 회계는
+  한 번에. Phase 2 DataExecutor 가 canonical (x-max → exp → sum → div) 로
+  expand 한다.
+
+#### D3.6. Reduction (blocking)
+
+- `tl.sum(x, axis)`, `tl.max(x, axis)`, `tl.min(x, axis)` — 해당 axis 의
+  크기를 1 로 줄인 output handle 을 반환. `MathCmd(op=<name>, inputs=(x,),
+  out=, axis=axis)` 발행.
+
+#### D3.7. Index / scalar (PE_CPU, no engine)
+
+- `tl.program_id(axis=0) -> int`: `axis==0` → pe_id (cube-local PE 인덱스),
+  `axis==1` → cube_id (ADR-0022).
+- `tl.num_programs(axis=0) -> int`: `axis==0` → num_programs (cube 당
+  PE 수), `axis==1` → num_cubes.
+- `tl.arange(start, end, dtype="i32") -> TensorHandle`: TCM 의 인덱스
+  range. command 발사 없이 metadata 만.
+- `tl.zeros(shape, dtype="f16") -> TensorHandle`, `tl.full(shape, value,
+  dtype="f16") -> TensorHandle`: TCM 에 placeholder. command 발사 없음.
+
+#### D3.8. Scalar helpers (no command, no engine)
+
+- `TLContext.cdiv(a, b) -> int` (static): ceiling division
+  `-(-a // b)`. real Triton 의 `tl.cdiv` 모방.
+
+#### D3.9. Metadata-only (no compute, no DMA)
+
+- `tl.trans(x) -> TensorHandle`: shape 의 마지막 두 dim 을 swap 한 새
+  핸들. 같은 addr/data 를 공유, command 발사 없음.
+
+#### D3.10. IPCQ (CCL) primitives (ADR-0023 D4)
+
+- `tl.send(dir, src=None, *, src_addr=None, nbytes=None, shape=None,
+  dtype="f16", space="tcm") -> None`: blocking send. handle 형태 또는
+  raw 주소 형태 둘 다 허용. `IpcqSendCmd` 발행. handle 의 `.data` 스냅샷이
+  명령에 실리는 경우, recv 측에서 받은 데이터의 race 회피.
+- `tl.recv(dir=None, shape=(), dtype="f16", space="tcm", dst_addr=None,
+  dst_space=None) -> TensorHandle`: blocking recv. `dst_addr/dst_space`
+  둘 다 주면 "copy_to_dst" 모드, 아니면 "return_slot" 모드. greenlet
+  모드에서 핸들의 `.data` 에 실제 데이터 첨부.
+- `tl.recv_no_consume(dir=None, shape=(), dtype="f16") -> TensorHandle`:
+  **DIAGNOSTIC ONLY**. recv blocking 동기화는 그대로 적용되나 slot-read
+  latency (slot-IO + PE↔bank fabric drain) 는 건너뛴다. pe2pe overview
+  플롯에서 `tl.store` 와의 apples-to-apples 비교용. production kernel 은
+  사용 금지 — `consume=False` 라는 별도 명령 분기로 격리되어 있어 실수
+  flag 가 작동하지 않는다.
+- `tl.recv_async(dir, shape=(), dtype="f16") -> RecvFuture`: non-blocking
+  recv. `RecvFuture` 를 반환; 이후 `tl.wait(future)` 로 결과 수령.
+
+#### D3.11. Composite + control
+
+- `tl.composite(...)`: D3.3 에서 설명.
+- `tl.wait(handle=None)`: `CompletionHandle` (composite) 또는 `RecvFuture`
+  (async recv) 또는 `None` (모든 pending composite) 대기.
+- `tl.cycles(n)`: PE_CPU scalar 실행 overhead 를 명시적으로 선언.
+  `PeCpuOverheadCmd(cycles=n)` 발행.
+
+### D4. TensorHandle 산술 연산자 — thread-local TLContext
+
+`tl_context.py` 모듈 로드 시점에 `_enable_tensor_ops()` 가 호출되어
+`TensorHandle.__add__`, `__sub__`, `__mul__`, `__truediv__` 를 patch한다.
+각 연산자는 thread-local `_ctx` (모듈 변수) 에 저장된 active TLContext 의
+`_binary_math` 를 호출한다.
+
+따라서 kernel 안에서 `c = a + b` 는 `MathCmd(op="add", inputs=(a,b),
+out=)` 발행 + new TensorHandle 반환 패턴과 동일하다.
+
+active TLContext 관리:
+
+- `TLContext._set_active(ctx)`: 현재 thread/greenlet 의 active ctx 설정.
+- `TLContext._get_active()`: 조회 (없으면 RuntimeError).
+- `run_kernel(kernel_fn, tl_ctx, *args, **kwargs)`: helper. 진입 시
+  active 설정, kernel 실행, 종료 시 None 으로 복원.
+
+`KernelRunner` 는 매 cmd 분배 시 `_switch_kernel` 가 직접 `_set_active(tl)`
+를 호출하여, 같은 thread 안의 다른 PE runner 가 active 를 덮어쓴 경우에도
+복원되도록 한다.
+
+### D5. Scratch allocator — compute output handles
+
+`tl.dot`, `tl.exp`, `tl.add` (TensorHandle `__add__`) 등 결과를 만드는 op 는
+`_make_compute_out(shape, dtype)` 를 호출하여 16-byte aligned scratch
+주소를 할당한다. 이 주소는 `space="tcm"` 로 발행되며, 이후 `tl.send` /
+`tl.store` 가 이 handle 을 source 로 사용할 수 있다.
+
+`_scratch_base == 0` (command-list 모드 등) 이면 할당 주소가 0으로
+반환되어 handle 은 send/store 의 source 로 사용 불가 (이 경우 `tl.load`
+로 받은 핸들만 source 가 될 수 있다).
+
+cursor 가 `_scratch_size` (default 1 MiB) 를 초과하면 RuntimeError.
+cursor 는 매 kernel invocation 시작 시 0 으로 리셋되어야 하나 (현재 코드는
+KernelRunner 가 새 TLContext 를 매번 생성하여 자연스럽게 리셋됨).
+
+### D6. Dispatch overhead — `PeCpuOverheadCmd(dispatch_cycles)`
+
+모든 non-metadata `tl.*` 호출의 entry 에서 `_emit_dispatch_overhead()` 가
+호출되며 `dispatch_cycles > 0` 일 때 `PeCpuOverheadCmd(dispatch_cycles)`
+를 발행한다. PE_CPU 가 명령 dispatch 자체에 소비하는 cycle 비용을
+모델링하기 위함이다.
+
+기본값:
+
+- `TLContext.__init__` 의 `dispatch_cycles` 매개변수 기본값: 1 cycle.
+- `KernelRunner` 가 만드는 TLContext: 0 cycles (greenlet 모드는 cycle
+  회계가 별도, ADR-0020 D3 정신).
+
+### D7. Kernel registry (`triton_emu/registry.py`)
+
+별도의 `_kernels: dict[str, Callable]` 가 kernel 이름 → 함수 매핑을 보유:
+
+- `register_kernel(name, fn)`: duplicate 등록 시 ValueError.
+- `get_kernel(name)`: 미등록 시 KeyError.
+- `clear_registry()`: 테스트 전용.
+
+`RuntimeContext.launch(kernel_name, kernel_fn, *args)` 가 매 호출마다
+`_kernels[kernel_name] = kernel_fn` 으로 idempotent 덮어쓴다 (last call
+wins). 이는 ADR-0045 D8 의 launch 동작과 정합된다.
+
+PE_CPU 는 `KernelRef.name` 으로 registry 에서 kernel 함수를 lookup 한 뒤
+KernelRunner 로 실행한다.
+
+## Alternatives Considered
+
+### A1. tl.* 를 ADR-0014 / ADR-0020 안으로 통합
+
+기각. ADR-0014 는 PE pipeline (PeCommand 의 sim_engine 측 소비) 를, ADR-0020
+은 2-pass 실행 (Phase 1 timing / Phase 2 data) 을 다룬다. `tl.*` 는 kernel
+작성자가 만나는 API 표면이라 독립 분리하는 것이 검색성·온보딩 측면에서
+낫다.
+
+### A2. command-list 모드 deprecation
+
+기각 (현재). 단순한 unit test 와 kernel verification 에서 command-list
+모드가 가볍게 동작한다. greenlet 의존성 없이 PeCommand 시퀀스를 검사할 수
+있는 출입구로 유지한다. greenlet 모드만의 의미 (실데이터, Phase 2) 가
+필요하면 D2 의 mode 선택으로 명시적으로 들어간다.
+
+### A3. TensorHandle 산술 연산자 제거
+
+기각. real Triton 의 kernel 코드 가독성을 흉내내기 위함이며 (예: `c = a +
+b`), thread-local active ctx 패턴이 깔끔하게 작동 중. 명시적 `tl.add(a, b)`
+도 D3.5 에 노출되어 있어, 연산자가 헷갈리면 함수형 호출로 대체 가능.
+
+### A4. softmax 를 명시적 시퀀스 (max → exp → sum → div) 로 expand
+
+부분 채택. `tl.softmax` 는 단일 `MathCmd(op="softmax")` 로 timing 회계는
+한 번에 처리한다 (D3.5). 실 데이터 expansion 은 Phase 2 DataExecutor 가
+canonical 시퀀스로 풀어준다. 즉, 시간 모델은 atomic, 데이터 모델은
+expansion — 두 마리 토끼를 의도적으로 분리.
+
+## Consequences
+
+- bench 작성자가 만나는 모든 `tl.*` primitive 가 한 ADR 에 분류·정의됨.
+  ADR-0045 D8 의 host-side surface (torch.empty 등) 와 짝을 이루어 "kernel
+  안 / 밖" 양쪽 작성 가이드가 완성.
+- command-list / greenlet 두 모드의 차이가 D2 에 명시되어, 새로운 `tl.*`
+  primitive 추가 시 `_emit()` 패턴만 따르면 양쪽 자동 호환됨.
+- thread-local active ctx 패턴 (D4) 이 ADR-level 에서 정당화되어, 향후
+  multi-PE 동일-thread 실행 시 reset 책임이 어디인지 명확해짐
+  (`_switch_kernel` 가 cmd 분배 시 active 복원 — KernelRunner.run 의
+  contract).
+- `tl.recv_no_consume` 의 진단 전용 격리(D3.10) 가 ADR 에 굳어져, 실수로
+  production kernel 에서 사용되는 것을 막는 layer 가 명확.
+- registry (D7) 가 별도 D 항목으로 분리되어, kernel 이름 충돌 / 동적
+  재등록 동작의 사양이 명시.
@@ -0,0 +1,243 @@
+# ADR-0047: AHBM CCL Backend — `torch.distributed`-compat shim
+
+## Status
+
+Accepted (2026-05-22).
+
+`runtime_api/distributed.py` 의 `AhbmCCLBackend` + `DistributedContext` —
+즉 `torch.distributed.init_process_group(backend="ahbm")` 진입점이 실제로
+무엇을 설치하고 어떤 의미로 `all_reduce`/`barrier`/`get_rank` 등을
+구현하는지를 명시한다. ADR-0023 D11 이 "torch.distributed compatibility"
+의도를 언급하나, **backend 자체의 동작 모델**은 ADR-level 에 없었다.
+
+## First action (제일 처음에 하는 일)
+
+`RuntimeContext.__post_init__` 가 자동으로 `DistributedContext()` 인스턴스를
+만들어 `self.distributed` 에 attach 한다. 그 시점의 첫 일은:
+
+1. `self._backend: AhbmCCLBackend | None = None` 으로 초기화 (아직 init
+   되지 않은 상태).
+2. `self._rank_by_greenlet: dict = {}` 로 greenlet-local rank 레지스트리
+   초기화 (ADR-0024 D2).
+3. 호출자(RuntimeContext) 측에서 `dc._ctx_ref = self` 로 back-reference 를
+   심어, 이후 `init_process_group` 가 `ctx.engine` / `ctx.spec` / `ctx.launch`
+   에 도달할 수 있게 한다.
+
+즉, **DistributedContext 의 첫 일은 "RuntimeContext 에 자기 자신을
+back-reference 와 함께 부착하고 backend 슬롯을 비워두는 것"**. 실제 backend
+설치(IPCQ install, world_size 산출, 알고리즘 모듈 로드)는 사용자 코드의
+`torch.distributed.init_process_group(backend="ahbm")` 호출 시점에 비로소
+일어난다.
+
+해당 시점의 `init_process_group` 의 첫 일은:
+
+1. `backend != "ahbm"` 이면 즉시 `ValueError("Unsupported backend ...")`.
+2. `getattr(self, "_ctx_ref", None)` 가 None 이면
+   `RuntimeError("DistributedContext not bound to a RuntimeContext")`.
+3. `self._backend = AhbmCCLBackend(torch_ctx=ctx)` — 이 생성자 안에서
+   ccl.yaml load + 알고리즘 모듈 import + world_size 산출 + SFR 설정 +
+   IPCQ install 이 모두 일어난다.
+4. `self._backend._dist_ctx = self` — backend 가 거꾸로
+   `_rank_by_greenlet` 에 접근할 수 있게 함.
+
+## Context
+
+PyTorch DDP 의 collective 호출 (`init_process_group`, `all_reduce` 등) 을
+그대로 사용할 수 있게 만들어, bench 코드가 "진짜 DDP training script" 와
+동일한 모습이 되도록 하는 것이 `AhbmCCLBackend` 의 목적이다 (ADR-0024 +
+ADR-0027 의 launcher 모델과 정렬).
+
+이 backend 가 책임지는 것:
+
+- `init_process_group` 시점에 **IPCQ neighbor table 을 한 번 설치** (real
+  NCCL communicator creation 과 유사).
+- `all_reduce(tensor, op="sum")` 호출 시 **설정된 algorithm 의 kernel 함수
+  를 `ctx.launch(...)` 로 발사**.
+- `get_world_size` / `get_rank` 를 greenlet-local rank 레지스트리와
+  ccl.yaml/topology 로부터 일관되게 답함.
+
+ADR-0023 D10 (IPCQ install plan), ADR-0024 (SIP launcher) 가 부분적으로
+이를 다루나, **`AhbmCCLBackend` 자체의 책임 범위와 의사결정 순서**는
+어디에도 명시되어 있지 않다. 본 ADR 이 채운다.
+
+## Decision
+
+### D1. backend 는 `init_process_group(backend="ahbm")` 시점에만 생성된다
+
+`DistributedContext` 는 `__init__` 시점에 `_backend = None` 으로 시작한다.
+backend 객체는 사용자가 `dist.init_process_group(backend="ahbm")` 를
+호출하기 전까지 존재하지 않으며, 그 외 API (`is_initialized`,
+`get_world_size`, `all_reduce`, `barrier`) 가 backend 가 None 인 채로
+호출되면 `RuntimeError("Default process group has not been initialized...")`
+를 던진다 (`_ensure_initialized` 헬퍼).
+
+`backend != "ahbm"` 은 즉시 `ValueError`. 다른 backend 명 (nccl, gloo
+등) 은 인식하지 않는다.
+
+### D2. world_size 산출 우선순위 — algorithm > defaults > topology
+
+`AhbmCCLBackend._resolve_world_size` (ADR-0024 D1) 의 결정 순서:
+
+1. `ccl.yaml` 의 algorithm entry 에 `world_size` 가 있으면 그 값.
+2. `defaults.world_size` 가 있으면 그 값.
+3. 둘 다 없으면 `spec.system.sips.count` (=topology 의 SIP 개수).
+
+기본 의미는 **rank = SIP** (ADR-0024). cube/PE-level parallelism 은 각
+rank 안에서 DPPolicy 로 표현되며 world_size 에 영향을 주지 않는다. 명시적
+`ccl.yaml` 의 world_size override 가 있으면 legacy "rank = flat PE 인덱스"
+테스트 경로를 위해 그대로 존중된다.
+
+`init_process_group(world_size=..., rank=...)` 의 사용자 인자는 **수신하나
+무시**된다 (real PyTorch 의 `RANK` / `WORLD_SIZE` env var 와 같은 의미).
+
+### D3. `init_process_group` 가 즉시 하는 4가지 설치 작업
+
+`AhbmCCLBackend.__init__` 안에서 다음이 순차 실행된다:
+
+1. **ccl.yaml 로딩**: `kernbench.ccl.install.load_ccl_config()` →
+   `resolve_algorithm_config(_cfg_all)` 로 `defaults.algorithm` (또는
+   사용자가 지정한 알고리즘) 의 merged config 산출.
+2. **알고리즘 모듈 import**: `importlib.import_module(self._merged["module"])`.
+   이 모듈은 `kernel` 함수, `kernel_args(world_size, n_elem, cube_w, cube_h)`
+   helper, optional `TOPO_NAME_TO_KIND` 매핑을 노출해야 한다.
+3. **world_size 산출** (D2).
+4. **topology 메타 수집**: `spec` 으로부터 `n_sips`, `sip_topo` (`ring_1d`
+   기본), `cube_w`/`cube_h`, `sips.w`/`sips.h`. SIP topology 가 ring_1d 가
+   아니면 explicit `w`/`h` 또는 square root 로 (`w*h == n_sips` 보장)
+   `_sip_topo_w/h` 산출. 불일치 시 `ValueError`.
+5. **SFR + IPCQ 설치**: `kernbench.ccl.sfr_config.configure_sfr_intercube_multisip
+   (engine, spec, self._merged)` 를 호출. 이 함수가 모든 SIP/cube 의 pe0 에
+   IPCQ neighbor table 을 푸시 (real NCCL communicator 의 일회성 설정에
+   해당).
+
+이 순서가 변하면 (예: SFR 전에 algorithm 모듈 load 가 실패하면) 부분 초기화
+상태가 발생할 수 있다. 따라서 D3 는 atomic 한 4-단계로 본다 — 실패 시
+backend 는 미설치 상태로 남는다.
+
+### D4. greenlet-local rank 등록 (ADR-0024 D2)
+
+`DistributedContext._rank_by_greenlet: dict[greenlet, int]` 은 spawn 된
+worker greenlet 각각에 rank 를 매핑한다. bench launcher (예:
+`torch.multiprocessing.spawn`) 가 worker 를 띄울 때
+`dc._bind_rank(g, rank)` 를 호출하여 등록한다.
+
+`get_rank()` 는 `getcurrent()` 의 greenlet 을 lookup. 미등록 greenlet은
+fallback 으로 0 을 반환 — single-driver / 테스트 호환성 유지.
+
+backend 는 `_dist_ctx._rank_by_greenlet` 를 통해 `all_reduce` 시 현재
+greenlet 의 rank 를 가져온다 (D5).
+
+### D5. `all_reduce(tensor, op="sum")` 동작
+
+검증 단계:
+
+- `op != "sum"` → `NotImplementedError`. 현재 kernel 들은 add reduction만 구현.
+- `tensor._handle is None` → `RuntimeError("not deployed")`.
+- `tensor._handle.shards` 가 비면 `RuntimeError("no shards")`.
+
+준비 단계:
+
+- `n_elem = shards[0].nbytes // tensor.itemsize` — 단일 shard 의 element 수.
+- `kernel_fn = self._algo_module.kernel` — D3 에서 import 된 알고리즘 모듈의
+  진입 함수.
+- effective cube dims 결정: 첫 번째 SIP 의 cube 갯수가 1 이면 (1,1) 으로
+  scalar 처리, 아니면 토폴로지의 `cube_w`/`cube_h` 사용. TP 가 일부 cube
+  만 쓰는 경우를 자연스럽게 흡수.
+- `kernel_args = self._algo_module.kernel_args(world_size, n_elem, cube_w,
+  cube_h)` — 알고리즘이 자기 kernel 에 넘길 인자 셋을 결정.
+
+dispatch:
+
+- 현재 greenlet 의 rank 를 `_rank_by_greenlet.get(g, 0)` 로 lookup.
+- `extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` 를 append.
+- `pending = self.ctx.launch(algorithm_name, kernel_fn, tensor, *kernel_args,
+  *extra_args, _defer_wait=True)` — `_defer_wait=True` 로 collective drain
+  을 메인 scheduler 에 위임 (ADR-0027 D0.4).
+
+drain:
+
+- 부모 greenlet 이 살아있으면 (multi-greenlet 모드) `_pending_collective_handles`
+  에 enqueue 한 뒤 부모로 switch. 메인 scheduler 가 모든 rank 의 launch 후
+  일괄 drain.
+- 단일-driver 모드면 inline 으로 `for h, _sip_id, meta in pending:
+  self.ctx.wait(h, _meta=meta)` 즉시 drain.
+
+### D6. `barrier()` 는 no-op 이다 (single-driver 모델)
+
+kernbench 는 하나의 Python process 안에서 모든 rank 를 greenlet 으로 다룬다.
+process 간 동기화가 필요한 상황이 없으므로 `barrier()` 는 호출 가능하지만
+실제 어떤 동기화도 수행하지 않는다. real PyTorch DDP 와의 API 호환성을
+위해 유지 (호출자가 NotImplementedError 를 받지 않도록).
+
+장래에 multi-process kernbench (예: SimPy event loop 가 process 별로
+독립) 가 도입되면 D6 를 supersede 하는 새 ADR 이 필요.
+
+### D7. `get_rank` / `get_world_size` / `get_backend` 의 의미
+
+- `get_rank()` (D4): 현재 greenlet 의 bound rank. 미등록은 0.
+- `get_world_size()` (D2): backend 가 D3 에서 산출한 world_size.
+- `get_backend()`: 항상 `"ahbm"` 문자열. backend 객체가 존재하지 않으면
+  `_ensure_initialized` 에서 RuntimeError.
+
+real PyTorch 와의 차이:
+
+- real PyTorch `get_rank()` 는 process global 값이지만, kernbench 는
+  greenlet-local. spawn 된 worker 안에서 호출하면 rank, main thread 에서
+  호출하면 0. bench 작성자는 worker 함수 안에서만 의미 있는 rank 를 기대해야
+  한다.
+
+### D8. 지원하는 API 표면 (final)
+
+`DistributedContext` 가 노출하는 API:
+
+- `init_process_group(backend="ahbm", world_size=None, rank=None, **kwargs)`
+- `is_initialized() -> bool`
+- `get_world_size() -> int`
+- `get_rank() -> int`
+- `get_backend() -> str`
+- `all_reduce(tensor, op="sum") -> None`
+- `barrier() -> None`
+- (internal) `_bind_rank(g, rank)`
+
+이외의 PyTorch distributed API (broadcast, reduce, all_gather, gather,
+scatter, send/recv 등) 는 **아직 구현되어 있지 않다**. kernel 레벨에서는
+`tl.send`/`tl.recv` (ADR-0046 D3.10) 로 직접 표현 가능하나, dist.* surface
+로는 노출되지 않는다. 추가 collective 가 필요해질 시 별도 알고리즘 모듈
+ `DistributedContext` 메소드 한 쌍을 추가하여 D8 를 확장한다.
+
+## Alternatives Considered
+
+### A1. backend 를 `RuntimeContext.__init__` 에서 즉시 생성
+
+기각. ccl.yaml 이 없거나 알고리즘 모듈을 import 할 수 없는 경우, bench 가
+distributed 기능을 안 쓰는데도 RuntimeContext 생성 자체가 실패하게 된다.
+"호출 시점에 비로소 설치" (D1) 가 lazy 의미상 옳다.
+
+### A2. world_size 를 항상 topology 로부터 자동 산출 (override 금지)
+
+기각. ADR-0024 D1 의 "explicit override" 경로가 legacy 테스트에서 사용 중.
+한 SIP 안에서 PE-level rank 를 따로 정의해야 하는 진단 시나리오를 위해
+유지.
+
+### A3. `op != "sum"` 을 silent fallback 으로 처리
+
+기각. 사용자가 `op="prod"` / `"max"` / `"avg"` 를 의도했는데 silently sum
+이 실행되면 결과 검증이 매우 어렵다. 명시적 `NotImplementedError` 가 안전.
+
+### A4. `barrier` 를 SimPy event 로 구현
+
+기각 (현재). single-driver 모델에서 cross-process 동기화 의미가 없으므로
+no-op 가 의미적으로 정확. SimPy fake-barrier 는 의미 없이 코드 복잡도만
+높임. multi-process kernbench 도입 시 재평가.
+
+## Consequences
+
+- `torch.distributed.init_process_group(backend="ahbm")` 의 4-단계 설치
+  (D3) 가 ADR-level 에서 굳어져, 향후 새 collective 알고리즘이 어디에
+  훅을 걸어야 하는지 명확.
+- D2 의 우선순위 (algorithm > defaults > topology) 가 명시되어, ccl.yaml
+  변경 시 영향 범위를 빠르게 가늠 가능.
+- D6 의 barrier no-op 결정이 ADR-level 에 굳어져, multi-process kernbench
+  도입 시 별도 ADR 로 supersede 해야 함이 분명.
+- D8 의 미지원 API 목록이 명시되어, 사용자가 `dist.broadcast(...)` 를
+  호출하려 할 때의 명확한 거절 근거 제공.
@@ -0,0 +1,262 @@
+# ADR-0048: Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
+
+## Status
+
+Accepted (2026-05-22).
+
+`policy/address/allocator.py` 의 `_FreeList` / `PEMemAllocator` 와
+`va_allocator.py` 의 `VirtualAllocator` 가 사용하는 free-list 알고리즘,
+페이지 정렬, coalescing 규칙을 명시한다. ADR-0001 (PhysAddr 레이아웃) 과
+ADR-0011 (PA/VA/LA 모델) 이 주소 스킴을 정의하나, **할당 알고리즘**은 별도
+ADR 이 없었다.
+
+## First action (제일 처음에 하는 일)
+
+### `_FreeList(capacity)`
+
+생성 즉시 `self._capacity = capacity`, `self._used = 0`, `self._free =
+[(0, capacity)]` 로 초기화. 첫 일은 **전 영역을 single free block 으로
+세우는 것** — 즉 `(offset=0, size=capacity)` 한 튜플이 free list 의 유일한
+원소다.
+
+### `PEMemAllocator(sip_id, die_id, pe_id, cfg)`
+
+생성 즉시 두 개의 `_FreeList` 를 만든다:
+
+- `self._hbm = _FreeList(cfg.hbm_slice_bytes)` — 이 PE 가 소유한 HBM
+  slice 의 바이트 크기 (`hbm_bytes_per_cube // hbm_slices_per_cube`) 만큼.
+- `self._tcm = _FreeList(cfg.tcm_allocatable_bytes)` — `tcm_bytes_per_pe -
+  tcm_scheduler_reserved_bytes` 만큼 (scheduler 예약분은 사전 분리).
+
+따라서 PEMemAllocator 의 첫 일은 **이 PE 의 HBM slice 와 사용자
+TCM 영역을 각각 단일 free block 으로 세우는 것**.
+
+### `VirtualAllocator(va_base, va_size, page_size=2*1024*1024)`
+
+생성 즉시 `self._va_base = va_base`, `self._va_size = va_size`,
+`self._page_size = page_size`, `self._used = 0`, `self._free = [(va_base,
+va_size)]`. 첫 일은 **VA base 부터 size 까지 single block 으로 세우고
+page_size 를 회수**.
+
+## Context
+
+`runtime_api/context.py::_ensure_allocators` 는 다음 단계로 allocator 세트를
+구성한다:
+
+1. spec 으로부터 `hbm_total_gb_per_cube`, `hbm_slices_per_cube`,
+   `tcm_size_mb`, target_device 별 SIP 범위 등을 읽음.
+2. `AddressConfig` 로 모든 파라미터를 frozen 하게 패킹.
+3. target SIP 범위 × cube × PE 의 모든 조합에 대해
+   `PEMemAllocator(sip, cube, pe, cfg)` 인스턴스를 1개씩 생성.
+4. `VirtualAllocator(va_base=0x1_0000_0000, va_size=64 GiB,
+   page_size=pe_mmu.page_size)` 를 1개 생성.
+
+allocator 들의 책임:
+
+- **PEMemAllocator**: PE-로컬 HBM slice / TCM 의 PA-공간 할당 (PhysAddr
+  encoding 까지 포함).
+- **VirtualAllocator**: device-wide VA 공간을 페이지 정렬로 할당. 이후
+  `RuntimeContext._create_tensor` 가 VA → PA 매핑을 `MmuMapMsg` 로 fabric
+  에 push.
+
+이 알고리즘들은:
+
+- **first-fit** 으로 단순.
+- 자유 블록 리스트는 **offset 정렬 (sorted by start)** 유지.
+- `free()` 시 **양쪽 인접 블록과 coalesce**.
+
+이런 결정의 근거가 어디에도 없으므로, 향후 누군가 "왜 best-fit 이 아닌가",
+"왜 buddy allocator 가 아닌가", "왜 partial overlap free 가 silently
+허용되는가" 라는 질문에 답할 기준이 필요. 본 ADR 이 그 기준을 마련한다.
+
+## Decision
+
+### D1. `_FreeList` — offset-기반 first-fit + coalescing
+
+`policy/address/allocator.py::_FreeList`:
+
+- 내부 표현: `list[tuple[int, int]]` = `[(start_offset, size), ...]` —
+  start offset 으로 정렬된 자유 블록의 sorted list.
+- `alloc(nbytes)`:
+  1. free list 를 앞에서부터 순회 (first-fit).
+  2. 처음 만나는 `size >= nbytes` 인 블록에서 앞부분을 잘라 사용.
+  3. 정확히 일치하면 블록 통째로 제거; 아니면 `(start+nbytes, size-nbytes)`
+     로 축소.
+  4. `_used += nbytes`, 잘라낸 `start` 반환.
+  5. 맞는 블록이 없으면 `AllocationError("overflow ... largest free block
+     ...")`.
+- `free(offset, nbytes)`:
+  1. `_used -= nbytes`.
+  2. `bisect_left(self._free, (offset,))` 로 삽입 위치 결정.
+  3. 직전 블록과 인접 (`prev_start + prev_size == offset`) 하면 흡수.
+  4. 직후 블록과 인접 (`offset+nbytes == next_start`) 하면 흡수.
+  5. coalesced range 를 정렬 위치에 insert.
+
+이 알고리즘은 fragmentation 에 약점이 있으나 (best-fit / buddy 대비), 본
+시뮬레이터의 워크로드 특성상 (deploy/free 패턴이 거의 stack-like) 충분
+하다는 것이 디자인 가정이다. 워크로드가 변하면 D1 supersede 후보.
+
+### D2. partial overlap free 는 **검사하지 않는다**
+
+`_FreeList.free(offset, nbytes)` 는 호출자가 정확한 (offset, nbytes) 를
+넘긴다고 신뢰한다. 다음을 검증하지 않는다:
+
+- 그 range 가 실제로 alloc 된 것인지.
+- 그 range 가 다른 alloc 된 영역과 겹치지 않는지.
+
+이유: 시뮬레이터 컨텍스트에서 호출자는 항상 `alloc()` 의 반환값을 그대로
+저장했다가 `free()` 에 넘기는 패턴이며, 외부 사용자 입력이 아니다. 안전성
+검사를 추가하면 매 free 마다 O(N) 비용이 들어 시뮬 wall-clock 에 영향.
+
+이 신뢰 모델이 깨지면 (예: 두 텐서가 같은 PA 를 가리키는 코드 경로 도입)
+즉시 ADR-level 으로 재검토.
+
+### D3. `PEMemAllocator` — HBM/TCM 두 채널 분리
+
+`PEMemAllocator(sip_id, die_id, pe_id, cfg)` 는 두 `_FreeList` 를 보유:
+
+- `_hbm`: `cfg.hbm_slice_bytes` 크기.
+- `_tcm`: `cfg.tcm_allocatable_bytes` (= `tcm_bytes_per_pe -
+  tcm_scheduler_reserved_bytes`) 크기.
+
+`alloc_hbm(nbytes) -> PhysAddr`:
+
+- `_hbm.alloc(nbytes)` 로 offset 획득.
+- `PhysAddr.pe_hbm_addr(sip_id, die_id, pe_id, pe_local_hbm_offset=offset,
+  slice_size_bytes=cfg.hbm_slice_bytes)` 로 PA 인코딩.
+- 실패 시 `AllocationError("HBM overflow ...")`.
+
+`free_hbm(pa, nbytes)`:
+
+- `pa.hbm_offset - pe_id * cfg.hbm_slice_bytes` 로 PE-local offset 복원.
+- `_hbm.free(offset, nbytes)`.
+
+`alloc_tcm(nbytes) -> PhysAddr`: 유사하게 `PhysAddr.pe_tcm_addr` 로 인코딩.
+
+`free_tcm(pa, nbytes)`: `pa.sub_offset` 을 그대로 사용 (TCM 은 PE-local
+offset 이 곧 sub_offset).
+
+scheduler-reserved TCM 영역 (`cfg.tcm_scheduler_reserved_bytes`) 은
+allocator 가 인지하지 않는다 (`_tcm` 의 capacity 에서 사전 차감되어 있음).
+이는 ADR-0014 의 PE_SCHEDULER 내부 buffer 예약과 정합된다.
+
+### D4. `VirtualAllocator` — 페이지 정렬 first-fit + coalescing
+
+`policy/address/va_allocator.py::VirtualAllocator`:
+
+- 내부 표현: `_FreeList` 와 동일한 sorted `list[tuple[int, int]]`.
+  최초: `[(va_base, va_size)]`.
+- `_align_up(nbytes) = ceil(nbytes / page_size) * page_size`.
+- `alloc(nbytes) -> int`:
+  1. `aligned = _align_up(nbytes)`.
+  2. first-fit 으로 `size >= aligned` 인 블록 탐색.
+  3. 블록 앞부분 `aligned` 만큼 잘라 사용. 정확히 일치하면 제거.
+  4. `_used += aligned`. 블록 `start` (= aligned 된 VA) 반환.
+  5. 실패 시 `VaAllocationError`.
+- `free(va, nbytes)`: `_align_up(nbytes)` 단위로 free. _FreeList 와 동일한
+  coalesce 알고리즘.
+
+`page_size` 의 실제 값은 두 곳에서 다른 기본을 갖는다:
+
+- `VirtualAllocator.__init__` 의 매개변수 기본값: `2 MiB`. 직접 호출하는
+  테스트가 그대로 받는다.
+- `RuntimeContext._ensure_allocators` 가 인스턴스화할 때:
+  `pe_mmu.attrs.get("page_size", 4096)` — `topology.yaml` 의
+  `pe_mmu.attrs.page_size` 가 있으면 그 값, 없으면 fallback 4 KiB.
+
+두 기본이 다른 이유: VirtualAllocator 의 standalone 기본은 ADR-0039 의
+PE_MMU stopgap 기본 (2 MiB) 과 정합되어 직접 테스트가 자연스럽고, context
+fallback 의 4 KiB 는 topology 미설정 시 안전한 minimum page 다. 실제 사용
+경로는 항상 후자이며 (`_ensure_allocators` 가 인스턴스화하므로),
+`topology.yaml` 에서 `page_size` 가 명시되면 그 값이 양쪽 (MMU + VA
+allocator) 으로 일관되게 흐른다.
+
+만약 이 일치가 깨지면 (예: VirtualAllocator 의 page_size 를 PE_MMU 와
+다르게 인스턴스화) MMU `map()` 가 서브-페이지 region 모드 (ADR-0039 D3) 로
+흐른다.
+
+VA 기본 범위: `va_base = 0x1_0000_0000` (= 4 GiB), `va_size = 64 GiB`. 이
+값은 `_ensure_allocators` 에 하드코딩되어 있으며 ADR-0011 의 VA 모델에서
+직접적인 의미를 갖지는 않는다 — 단지 host 코드와 충돌하지 않을 만큼 큰
+주소 공간을 device-wide 로 잡아둔 것.
+
+### D5. allocator 인스턴스의 lifecycle
+
+- `RuntimeContext._ensure_allocators` 가 lazy 하게 호출됨 (`_create_tensor`
+  의 첫 호출 시점).
+- 한 번 생성된 allocator dict (`self._allocators`) 는 RuntimeContext 의
+  lifetime 동안 재사용. 같은 process 안의 두 번째 deploy 는 새 객체를
+  만들지 않는다.
+- `RuntimeContext.cleanup()` 이 모든 living tensor 의 `_free_tensor()` 를
+  호출 → MMU unmap + `va_allocator.free` + `pemem_allocator.free_hbm` 으로
+  free list 가 원상복구. 다음 RuntimeContext 가 다시 만들면 초기 상태부터.
+
+allocator 상태가 RuntimeContext 간에 공유되지 않는 점이 단일 process 안의
+연속 실행에서 deploy → cleanup → deploy 의 결정성을 보장한다.
+
+### D6. Allocator 실패는 raise 한다 (silent OOM 금지)
+
+`_FreeList.alloc` / `VirtualAllocator.alloc` 모두 충분한 free block 이
+없으면 `AllocationError` / `VaAllocationError` 를 던진다. 메시지에는
+"required size + largest available block" 가 포함되어, fragmentation
+인지 진짜 OOM 인지 진단 가능.
+
+silent fallback (예: 가장 큰 블록만큼만 alloc) 는 절대 금지 — 부분 할당된
+텐서가 SimPy 단계에 들어가면 라우팅·DMA 가 잘못된 PA 를 인지하여 시뮬
+정확도가 깨진다.
+
+### D7. address space 와 allocator 의 1:1 대응
+
+물리 주소 공간 분리는 PhysAddr 의 sub-unit (ADR-0001 D2.3) 으로 표현되며,
+각 sub-unit 마다 별도 allocator 인스턴스를 둔다:
+
+- HBM slice → `PEMemAllocator._hbm`.
+- PE TCM → `PEMemAllocator._tcm`.
+- (현재 미사용) M_CPU local memory, CUBE SRAM → 별도 allocator 필요. 현재
+  구현은 아직 IPCQ-only slot 으로 처리 (ADR-0023 D9.7) 하며 PA 공간을
+  share 하지 않으므로 별도 free-list 가 없음.
+
+cube-level SRAM allocator 가 필요해지면 `_FreeList(cfg.sram_bytes_per_cube)`
+인스턴스를 cube 단위로 추가한다 (`cfg.sram_bytes_per_cube` 는 이미
+`AddressConfig` 에 정의되어 있어 데이터 모델은 준비됨).
+
+## Alternatives Considered
+
+### A1. best-fit / buddy allocator
+
+기각 (현재). 워크로드의 alloc/free 패턴이 stack-like (deploy 순서 = free
+순서) 라 first-fit + coalescing 으로 fragmentation 이 충분히 통제된다.
+LLM kernel sweep 에서 long-running fragmentation 이 관찰되면 buddy 로
+교체하는 ADR 을 별도로 만든다.
+
+### A2. partial overlap free 검증 추가
+
+기각. D2 의 신뢰 모델 + O(N) 검사 비용. 단, 디버그 모드 (`KERNBENCH_DEBUG`
+env var 등) 에서 활성화하는 옵션은 후속 작업으로 가능.
+
+### A3. VA 와 PA 의 통합 allocator
+
+기각. VA 공간 (64 GiB device-wide) 과 PA 공간 (slice 별 ~6 GiB) 는 의미
+차원이 다르다. VA 는 host kernel 의 view, PA 는 device sub-unit 의 view.
+ADR-0011 의 VA 모델 정신 (MMU 가 둘 사이를 매핑) 과 정합하기 위해
+allocator 도 분리.
+
+### A4. page_size 의 multi-tier 지원 (large page + small page)
+
+기각 (현재). 단일 page_size (현재 2 MiB) 가 LLM kernel 의 텐서 단위 (수
+MiB~수 GiB) 에 맞고, ADR-0039 D3 의 서브-페이지 region 으로 작은 매핑이
+필요할 때 흡수된다. multi-tier page 는 MMU 자체 모델을 확장해야 하므로
+별도 ADR 후보.
+
+## Consequences
+
+- allocator 알고리즘이 ADR-level 에서 굳어져 (D1·D3·D4), 새로운 시뮬
+  시나리오에서 fragmentation 이슈가 발생할 때 "여기서 first-fit + coalesce
+  를 쓰고 있다" 가 명확.
+- D2 의 신뢰 모델이 명시되어, 향후 사용자 입력으로부터 직접 alloc/free 를
+  받는 경로가 도입되면 본 ADR supersede 가 필요함을 일찍 인지 가능.
+- D7 의 sub-unit별 allocator 1:1 대응이 명시되어, M_CPU/SRAM 별도 영역이
+  필요해질 때 어디에 free-list 를 추가해야 하는지 명확.
+- `VirtualAllocator` 의 page_size 가 PE_MMU 설정과 일치해야 함이 D4 에
+  적혀 있어, 향후 topology.yaml 의 page_size 변경 시 ADR-0039 stopgap 동작
+  과의 상호작용을 빠르게 가늠 가능.
@@ -0,0 +1,231 @@
+# ADR-0049: `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
+
+## Status
+
+Accepted (2026-05-22).
+
+`probes/probe.py` 의 `run_probe(...)` 가 노출하는 traffic-pattern catalog,
+formula vs actual 비교, 그리고 monotonicity / D2H≥H2D 같은 invariant
+체크의 의미를 명시한다. ADR-0010 (CLI surface) 가 `kernbench probe`
+subcommand 를 enumerate 하나, **probe 가 실제로 측정하는 것**과 **어떤
+invariant 를 PASS/FAIL 로 판정하는가**는 ADR-level 에 없었다.
+
+## First action (제일 처음에 하는 일)
+
+`run_probe(topology_path, case_filter=None)` 의 첫 4가지 작업:
+
+1. `Path(topology_path).expanduser().resolve()` 로 절대 경로 산출.
+2. `load_topology(path)` → `TopologyGraph` 인스턴스 (그래프 + spec).
+3. `_build_edge_map(graph)` → `{(src, dst): Edge}` 빠른 lookup 테이블.
+4. `AddressResolver(graph)` + `PathRouter(graph)` 인스턴스화.
+
+그 다음 `nbytes = 32768` (= 32 KiB, summary table 의 기준 데이터 크기) 와
+`show_all = (case_filter is None or case_filter == "all")` 를 설정.
+
+즉, **probe 의 첫 일은 "토폴로지를 한 번 로드하여 edge map / resolver /
+router 를 준비하고, 32 KiB 라는 표준 측정 크기를 픽스하는 것"**. 그 이후
+H2D → D2H → PE DMA 세 카테고리의 case 들이 각각 별도의 `GraphEngine`
+인스턴스에서 실행된다 (case 간 cross-talk 차단).
+
+## Context
+
+`kernbench probe` 는 다음 의도로 도입된 verification 도구다:
+
+- **수동 분석 ground truth**: 실 시뮬레이션 (`kernbench run --bench ...`)
+  결과의 latency 가 비정상으로 보일 때, 단순 traffic pattern 의 정답을 별도
+  로 얻어 비교.
+- **formula vs actual 비교**: 분석 모델 (wire latency + overhead + drain)
+  과 시뮬레이션 결과 (`total_ns`) 가 일치하는지 확인. 일치하지 않으면 모델
+  단순화 가정 (ADR-0033) 어디가 빠진 것인지 단서.
+- **monotonicity check**: hop 수가 늘면 latency 가 단조 증가해야 한다는
+  invariant 의 자동 확인.
+- **utilization sweep**: 데이터 크기 (4 KiB ~ 1 MiB) 별 BW 활용률 표.
+
+이 도구의 동작 사양이 ADR-level 에 없으면:
+
+- 다른 형식의 traffic pattern (예: MCpuDma, IPCQ) 을 추가하려는 사람이 기존
+  카테고리의 표 포맷 / 측정 단위를 일관되게 따르기 어렵다.
+- monotonicity 가 무엇을 기준으로 검사되는지 (hop 수? cube 거리? wire
+  길이?) 모호.
+- 32 KiB 라는 기준 크기와 `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]` sweep
+  의 의미가 코드 grep 으로만 확인 가능.
+
+## Decision
+
+### D1. 세 가지 case category — H2D / D2H / PE DMA
+
+각 category 는 토폴로지 상 별개의 데이터 경로를 가지며, 별도의 summary
+table + sweep table + route detail block 으로 출력된다.
+
+- **H2D (Host→Device Write)**: `MemoryWriteMsg(dst_sip=0, dst_cube,
+  dst_pe=0, pattern="zero")` 가 `pcie_ep → io_cpu → m_cpu → hbm_ctrl` 경로
+  를 흐른다. cube 인덱스로 hop 수가 증가:
+  - h2d-1hop: cube=0, hops=1
+  - h2d-2hop: cube=4, hops=2
+  - h2d-3hop: cube=8, hops=3
+  - h2d-4hop: cube=12, hops=4
+- **D2H (Device→Host Read)**: `MemoryReadMsg(src_sip=0, src_cube, src_pe=0)`.
+  forward command path + reverse data path 의 합 latency. 같은 4 hops
+  카테고리.
+- **PE DMA (PE-initiated)**: `PeDmaMsg(src_sip, src_cube, src_pe, dst_pa)`.
+  5 가지 케이스로 cube/PE 위치 변화:
+  - pe-local-hbm: same cube, same PE
+  - pe-same-half-hbm: same cube, different PE (PE 1)
+  - pe-cross-half-hbm: same cube, far PE (PE 4)
+  - pe-cross-cube-hbm-best: adjacent cube (cube 1)
+  - pe-cross-cube-hbm-worst: diagonal far cube (cube 15)
+
+cube 인덱스가 4/8/12 (H2D), 1/4/15 (PE DMA) 같이 의미 있는 이유는
+4x4 cube mesh (sip.cube_mesh.w=4, h=4) 에서의 거리 정의 — 추후 cube_mesh
+크기 변경 시 이 값들이 같이 갱신되어야 한다.
+
+### D2. 표준 측정 크기 — `nbytes = 32768` (32 KiB)
+
+모든 case 의 summary table 은 `nbytes=32768` 로 한 번 실행한 결과를
+보여준다. 32 KiB 가 선택된 이유:
+
+- DMA overhead 와 BW drain 이 한쪽으로 치우치지 않는 적당한 크기.
+- 다수 sub-unit (TCM, register file) 의 1회 transfer 단위와 비교 가능.
+
+크기별 utilization 변화는 별도 sweep table 이 보여준다 (D3).
+
+### D3. Utilization sweep — `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]`
+
+`SWEEP_SIZES = [4096, 16384, 65536, 262144, 1048576]`, `SWEEP_LABELS =
+["4KB", "16KB", "64KB", "256KB", "1MB"]`. 매 size 마다 다음 공식:
+
+```
+drain   = nbytes / bottleneck_bw
+total   = overhead + wire + drain
+eff_bw  = nbytes / total
+util%   = eff_bw / bottleneck_bw × 100
+```
+
+`bn_bw is None or <= 0` 이면 그 컬럼은 0.0 % 로 출력. 의미: hop 수가 늘
+수록 작은 transfer 는 overhead-bound, 큰 transfer 는 drain-bound 가 되는
+패턴을 한 표에서 확인.
+
+### D4. 측정 항목 — actual / formula / breakdown
+
+각 case 행에 표시되는 컬럼:
+
+- `Actual` (total_ns): SimPy 실행 결과의 `trace["total_ns"]`.
+- `Ovhd`: 경로상 모든 node 의 `node.attrs["overhead_ns"]` 합 (formula
+  breakdown).
+- `Drain`: `nbytes / min(edge.bw_gbs over path)` (formula).
+- `Wire`: `Σ edge.distance_mm * (ns_per_mm from spec)`.
+- `Ovhd%` / `Drain%`: Ovhd/Drain 이 Actual 에서 차지하는 비율 (formula 의
+  Wire 는 통상 매우 작아 표시하지 않음).
+- `Eff.BW`: `nbytes / total_ns` (실 측정 BW).
+- `BN.BW`: bottleneck bandwidth (formula). path 상 모든 edge 의 BW 중 최소.
+  edge BW 가 없으면 "-".
+- `Util%`: `Eff.BW / BN.BW × 100`. 100% 면 single-stream BW upper bound 에
+  도달.
+
+formula 의 합 (`wire + ovhd + drain`) 과 actual 의 차이가 크면 모델
+단순화가 잡지 못하는 요소가 있다는 신호 (ADR-0033 의 가정 점검).
+
+### D5. Invariant 자동 체크 — PASS/FAIL
+
+다음 invariant 들이 자동으로 확인되어 `[v] PASS` / `[x] FAIL` 로 출력:
+
+- **H2D / D2H monotonic increase**: hop 수가 늘면 actual latency 가
+  단조 증가해야 함. `all(lats[i] < lats[i+1] for ...)`.
+- **D2H ≥ H2D**: 같은 hop 인덱스에서 D2H ≥ H2D (D2H 는 forward command
+  + reverse data 두 leg 이므로). `all(d2h[i].total >= h2d[i].total)`.
+- **PE DMA best < worst**: cross-cube best (adjacent) latency < cross-cube
+  worst (diagonal) latency.
+- **PE DMA local vs remote**: local BN BW vs remote BN BW 의 비교 출력
+  (PASS/FAIL 이 아닌 정보성).
+
+체크가 FAIL 이면 사람이 즉시 모델/토폴로지 회귀를 인지할 수 있도록 한
+줄로 분명하게 출력.
+
+### D6. Route detail — per-hop timestamp trace
+
+summary 와 sweep 표 이후 각 case 의 path 와 per-hop 누적 시간 (
+`_hop_timestamps`) 가 별도 섹션에서 출력된다:
+
+- H2D: leg1 (`pcie_ep → io_cpu`) + leg2 (`io_cpu → m_cpu`) + leg3
+  (`m_cpu → hbm_ctrl`) + per-hop trace.
+- D2H: forward (cmd, no data) + reverse (data) trace 분리 표시.
+- PE DMA: `pe_dma → router → hbm_ctrl` path + per-hop trace.
+
+각 hop 의 timestamp 는 cumulative `wire_ns + overhead_ns` 누적. terminal
+hop 의 annotation 에 `drain:Xns` 가 붙는다. bottleneck edge 는
+`<BN:XXGB/s>` 로 표시되어 시각적으로 식별 가능.
+
+### D7. case_filter 인자의 의미
+
+- `None` 또는 `"all"`: 모든 case 실행 (default).
+- 다른 문자열: 그 이름과 정확히 일치하는 case 만 실행. 예: `kernbench
+  probe --case h2d-2hop`.
+
+각 카테고리 안에서 `name != case_filter` 면 skip 되며, 그 카테고리의
+monotonicity / D2H≥H2D 비교는 데이터가 1개일 때 자연히 skip 된다.
+
+CLI parser 의 `--case` 기본값은 `"all"`이라 인자 생략 시 전체 실행.
+
+### D8. 매 case 별 fresh GraphEngine
+
+H2D 4개, D2H 4개, PE DMA 5개의 case 가 각각 **새로운 GraphEngine**
+인스턴스에서 실행된다 (`engine = GraphEngine(graph)`). 이유:
+
+- case 간 누적 상태 (op_log, completion 추적, allocator 등) 가 cross-talk
+  하지 않도록 격리.
+- 한 case 의 traffic 이 다른 case 의 BW 측정에 영향을 주지 않도록 보장.
+
+이 격리는 probe 의 측정 결과를 **각 case 단독 single-flow** 의 latency 로
+해석할 수 있게 한다. multi-flow contention 측정은 별도 도구 (예:
+`pe2pe_overview` 플롯, ADR-0033 의 multi-flow merging 모델) 책임.
+
+### D9. 출력 포맷의 안정성
+
+probe 의 stdout 출력은 사람이 읽기 위함이며, 정확한 컬럼 폭/구분자/공백 은
+machine-readable contract 가 아니다. 자동화된 도구가 probe 결과를 파싱
+하려면 별도 JSON 출력 모드를 추가해야 한다 (현재 미구현).
+
+PASS/FAIL 줄의 `[v]` / `[x]` 접두사는 CI grep 용 anchor 로 안정 보장.
+
+## Alternatives Considered
+
+### A1. Probe 를 별도 bench 로 등록 (`@bench(name="probe")`)
+
+기각. probe 는 bench 가 아니라 verification 도구로 의도된다 — sweep / 분석
+용 multi-engine 실행과 invariant PASS/FAIL 출력이 본질이며, ADR-0045 의
+"단일 디바이스 + 단일 RuntimeContext" bench 모델과 맞지 않는다.
+
+### A2. monotonicity 위반 시 exit code 1
+
+기각 (현재). 인간 검사 도구 위주로 의도되어 있어 PASS/FAIL 줄을 출력하고
+exit 0 로 종료. CI 가 violation 으로 fail 하길 원하면 별도 wrapper 가
+`grep "\[x\]"` 결과로 판단하면 됨. 후속으로 strict-mode flag (`--strict`)
+도입 가능.
+
+### A3. probe 의 case 정의를 외부 YAML 로
+
+기각 (현재). 8개 case (4 H2D + 4 D2H + 5 PE DMA — 합 13개) 는 코드에
+하드코딩되어 있고 의미가 토폴로지 mesh 구조에 단단히 묶여 있다. 외부
+YAML 로 옮기면 cube 인덱스의 의미 (4, 8, 12 / 1, 4, 15) 를 별도로 문서화
+해야 하므로 응집도 손실. 케이스 추가가 잦아지면 그때 별도 ADR 로 도입.
+
+### A4. multi-flow contention 측정 추가
+
+기각 (probe 범위 밖). D8 에서 명시한 single-flow 격리 모델이 probe 의 핵심
+의도. multi-flow contention 은 ADR-0033 latency model 의 다른 영역으로,
+별도 도구 또는 별도 case category 로 처리.
+
+## Consequences
+
+- probe 의 case catalog (D1) 와 측정 단위 (D2/D3) 가 ADR-level 에서 명시
+  되어, 새 traffic 카테고리 추가 시 어떤 표 포맷을 따라야 하는지 분명.
+- formula vs actual 의 컬럼 의미 (D4) 가 굳어져, probe 결과를 보고 "왜
+  Drain% 가 5% 인가 / 70% 인가" 같은 질문을 빠르게 ADR-0033 가정 점검으로
+  연결 가능.
+- invariant 자동 체크 (D5) 가 ADR 에 굳어져, 향후 latency 모델 변경 시
+  monotonicity / D2H≥H2D 회귀를 probe 가 즉시 잡아낸다는 안전망 정착.
+- D8 의 case 간 격리가 명시되어, probe 결과를 single-flow 측정으로 안전
+  하게 해석 가능. multi-flow 측정이 필요해지면 별도 도구 트랙이 필요함이
+  분명.
+- A2 의 strict-mode flag 가 후속 작업 후보로 기록되어, CI 통합 요구 시
+  최소 추가 작업으로 도입 가능.
@@ -0,0 +1,308 @@
+# ADR-0050: CCL Algorithm Module Contract — `ccl/algorithms/*.py`
+
+## Status
+
+Accepted (2026-05-22).
+
+`src/kernbench/ccl/algorithms/` 디렉터리 안의 모듈이 AHBM CCL backend
+(ADR-0047) 에서 collective algorithm 으로 사용되려면 갖춰야 할 인터페이스,
+kernel 시그너처, 그리고 새 알고리즘 추가 절차를 명시한다. ADR-0047 D3 가
+"algorithm 모듈은 `kernel`, `kernel_args`, optional `TOPO_NAME_TO_KIND` 를
+expose 해야 한다" 라고만 한 줄로 언급하나, **algorithm 모듈 작성자가 따라야
+할 contract** 는 ADR-level 에서 정리된 적이 없다. ADR-0045 가 bench 모듈
+contract 를 다루는 것과 짝을 이룬다.
+
+## First action (제일 처음에 하는 일)
+
+알고리즘 모듈이 import 되는 시점은 두 가지다:
+
+1. **AHBM backend 진입**: 사용자 코드가 `dist.init_process_group(backend="ahbm")`
+   를 호출하면, `AhbmCCLBackend.__init__` 안에서 `self._algo_module =
+   importlib.import_module(self._merged["module"])` 가 실행된다. 이때 모듈
+   레벨에서 가장 먼저 일어나는 일:
+   - `SIP_TOPO_RING/TORUS/MESH` 같은 정수 상수가 모듈 namespace 에 노출.
+   - `TOPO_NAME_TO_KIND` 사전이 모듈 namespace 에 노출 — backend 가
+     `topo_map = getattr(self._algo_module, "TOPO_NAME_TO_KIND", None)` 로
+     조회.
+   - `kernel_args` 함수 정의 — 호출 시 호출자가 사용.
+   - `allreduce_intercube_multidevice` 같은 알고리즘 함수 정의.
+   - 모듈 마지막 줄에서 `kernel = allreduce_intercube_multidevice` 로
+     alias 가 노출.
+
+2. **ccl.yaml install 단계**: `kernbench.ccl.install.install_ipcq` 가 호출
+   되어 IPCQ neighbor table 을 푸시할 때 같은 알고리즘 모듈이 import 됨.
+
+즉, **algorithm 모듈의 첫 일은 "topology-kind 상수, `TOPO_NAME_TO_KIND`
+사전, `kernel_args` 함수, 그리고 `kernel` alias 를 모듈 namespace 에 노출
+하는 것"** 이다. 모든 노출은 import-time 부수효과로 충분하며 별도 초기화
+함수 호출이 필요하지 않다.
+
+## Context
+
+`AhbmCCLBackend` (ADR-0047) 는 process group 초기화 시점에 `ccl.yaml` 의
+`defaults.algorithm` (또는 사용자가 지정한 알고리즘 이름) 으로부터 모듈
+경로를 얻어 dynamic import 한다. backend 는 그 모듈로부터 다음 4 가지를
+기대한다:
+
+- `kernel`: collective 의 진입 함수.
+- `kernel_args(world_size, n_elem, cube_w=, cube_h=) -> tuple`: kernel 에
+  넘길 위치 인자 묶음.
+- `TOPO_NAME_TO_KIND` (optional): `topology.yaml` 의 `sips.topology`
+  문자열 (예: `"ring_1d"`, `"torus_2d"`, `"mesh_2d_no_wrap"`) 을 정수
+  상수로 매핑하는 dict.
+- (간접) IPCQ neighbor table 설치: `configure_sfr_intercube_multisip` 가
+  알고리즘 모듈의 `TOPO_NAME_TO_KIND` 와 `cube_w/h` 를 보고 SFR 을 결정.
+
+현재 코퍼스의 유일한 algorithm 모듈은 `lrab_hierarchical_allreduce.py`
+(248 줄) 이다. 이름은 "**l**eft-**r**ight **a**lternating **b**roadcast
+**hierarchical allreduce**". 향후 `ring_allreduce`, `tree_allreduce`,
+`broadcast` 같은 모듈이 추가될 때마다 이 contract 를 따라야 일관된
+디스패치가 가능하다.
+
+이 contract 가 ADR-level 에 없으면:
+
+- 새 algorithm 작성자가 ADR-0047 D3 의 한 줄 만으로 시그너처를 추론해야.
+- kernel 함수 인자 순서 (특히 `t_ptr, n_elem, cube_w, cube_h, n_sips,
+  sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl`) 의 의미가 코드
+  grep 없이는 명확하지 않다.
+- `kernel_args` 가 어떤 인자를 받고 어떤 tuple 을 돌려줘야 하는지 관례
+  로만 굳어진다.
+
+## Decision
+
+### D1. algorithm 모듈은 4 가지 public symbol 을 노출한다
+
+```python
+# src/kernbench/ccl/algorithms/<name>.py
+
+from __future__ import annotations
+
+# (필수) topology-kind 상수 — 알고리즘 내부에서 사용
+SIP_TOPO_RING = 0
+SIP_TOPO_TORUS = 1
+SIP_TOPO_MESH = 2
+
+# (선택) topology 이름 → kind 매핑. backend 가 ccl.yaml/topology 의
+# 문자열 SIP topology 를 정수로 변환하는 데 사용.
+TOPO_NAME_TO_KIND = {
+    "ring_1d": SIP_TOPO_RING,
+    "torus_2d": SIP_TOPO_TORUS,
+    "mesh_2d_no_wrap": SIP_TOPO_MESH,
+}
+
+# (필수) kernel 인자 빌더
+def kernel_args(world_size: int, n_elem: int, *, cube_w: int = 4, cube_h: int = 4) -> tuple:
+    return (n_elem, cube_w, cube_h, world_size)
+
+# (필수) kernel 함수 (`tl=...` 키워드를 통해 TLContext 가 주입됨)
+def my_allreduce_kernel(t_ptr, n_elem, cube_w, cube_h, n_sips,
+                         sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, *, tl):
+    ...
+
+# (필수) kernel alias — backend 가 `module.kernel` 로 접근
+kernel = my_allreduce_kernel
+```
+
+- `kernel` alias 는 backend 가 직접 호출하는 entry point 다. 함수 이름이
+  무엇이든 (`allreduce_intercube_multidevice` 처럼) `module.kernel = fn`
+  으로 노출해야 한다.
+- `kernel_args` 가 없으면 backend 가 알고리즘 인자를 만들 방법이 없다.
+  signature 는 D2 참고.
+- `TOPO_NAME_TO_KIND` 가 없으면 backend 는 `sip_topo_kind = 0` 으로
+  fallback 한다. 단일 topology 만 지원하는 알고리즘이라면 생략 가능.
+
+### D2. `kernel_args` 시그너처 — `(world_size, n_elem, *, cube_w, cube_h)`
+
+```python
+def kernel_args(world_size: int, n_elem: int, *,
+                cube_w: int = 4, cube_h: int = 4) -> tuple:
+    return (n_elem, cube_w, cube_h, world_size)
+```
+
+- **위치 인자**: `world_size` (= rank 수), `n_elem` (= 단일 shard 의
+  element 수, f16 기준).
+- **키워드 인자**: `cube_w`, `cube_h` (= cube mesh 크기). default 는
+  4×4 — `topology.yaml` 의 `sip.cube_mesh` 기본값과 정합.
+- **반환**: kernel 의 위치 인자 순서대로 묶은 tuple.
+
+backend 의 `all_reduce` 가 호출 시:
+
+```python
+kernel_args_tuple = self._algo_module.kernel_args(
+    self._world_size, n_elem, cube_w=eff_cube_w, cube_h=eff_cube_h,
+)
+extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)
+pending = self.ctx.launch(
+    self._merged["algorithm"], kernel_fn, tensor,
+    *kernel_args_tuple, *extra_args, _defer_wait=True,
+)
+```
+
+즉 kernel 의 최종 위치 인자는: `(tensor_ptr, *kernel_args_tuple,
+sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` 이며, 거기에 `tl=...` 가
+키워드로 자동 주입된다. `kernel_args` 가 돌려주는 tuple 의 길이/순서는
+**kernel signature 와 1:1 일치** 해야 한다.
+
+### D3. `kernel` 함수 시그너처 — 정형화된 9 + tl 인자
+
+권장 시그너처:
+
+```python
+def my_kernel(
+    t_ptr: int,             # VA base of the row-wise-sharded tensor on this SIP
+    n_elem: int,            # element count per cube tile (or per shard)
+    cube_w: int,            # cube mesh width  (kernel_args 에서 옴)
+    cube_h: int,            # cube mesh height (kernel_args 에서 옴)
+    n_sips: int,            # world_size 와 동일 (rank = SIP, ADR-0024)
+    sip_rank: int,          # 이 SIP 의 rank
+    sip_topo_kind: int,     # TOPO_NAME_TO_KIND lookup 결과
+    sip_topo_w: int,        # SIP mesh width (ring_1d 면 0)
+    sip_topo_h: int,        # SIP mesh height (ring_1d 면 0)
+    *, tl,                  # TLContext (auto-injected)
+) -> None:
+```
+
+`kernel_args` 가 다른 위치 인자 순서를 채택하더라도, kernel 의 **마지막
+4 개 위치 인자는 항상 `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)`**
+이며 backend 가 `extra_args` 로 append 한다 (ADR-0047 D5). 이 4 개 인자는
+사용자 정의 algorithm 도 받아야 하지만, 알고리즘이 single-SIP 이라면
+그냥 무시하면 된다.
+
+`tl` 은 위치 인자가 아닌 키워드로 주입된다 — `RuntimeContext.launch` 가
+kernel 호출 직전에 `tl=tl_ctx` 를 추가한다. 따라서 kernel signature 의
+`tl` 은 keyword-only (`*, tl`) 또는 마지막 키워드 매개변수 형태여야
+한다.
+
+### D4. kernel body 의 자유도와 제약
+
+kernel body 안에서 사용 가능한 표면: ADR-0046 D3 의 모든 `tl.*` primitive.
+
+특히 자주 쓰이는 패턴:
+
+- `cube_id = tl.program_id(axis=1)` — 이 PE 가 속한 cube 인덱스.
+- `pe_addr = t_ptr + cube_id * nbytes` — cube-별 tile 의 VA 계산.
+- `acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")` — local 데이터
+  로드.
+- `tl.send(dir=...)` / `tl.recv(dir=..., shape=, dtype=)` — IPCQ
+  collective.
+- `acc = acc + recv` — TensorHandle 산술 연산자 (ADR-0046 D4).
+- `tl.store(pe_addr, acc)` — 결과 저장.
+
+kernel body 는 일반 Python 함수이며, branching/looping 자유. 단:
+
+- SimPy `yield` 또는 `async` 금지 (ADR-0046 D1).
+- TensorHandle 의 `.data` 직접 접근 금지 — phase 1 timing 모델은
+  데이터 의존을 모른다 (ADR-0020 의 2-pass 분리).
+- kernel 실행은 deterministic 해야 한다 — 같은 입력으로 두 번 실행하면
+  같은 op 시퀀스 발사. random / external IO 금지.
+
+### D5. SIP topology semantics — `sip_topo_kind` 의 의미
+
+backend 가 `topology.yaml` 의 `system.sips.topology` 문자열을 algorithm
+모듈의 `TOPO_NAME_TO_KIND` 로 lookup 하여 `sip_topo_kind` 정수로 변환.
+algorithm 은 이 정수를 보고 분기:
+
+```python
+if sip_topo_kind == SIP_TOPO_RING:
+    acc = _inter_sip_ring(...)
+elif sip_topo_kind == SIP_TOPO_TORUS:
+    acc = _inter_sip_torus_2d(...)
+elif sip_topo_kind == SIP_TOPO_MESH:
+    acc = _inter_sip_mesh_2d(...)
+```
+
+각 topology branch 는 IPCQ direction 이름 (예: `"global_E"`, `"W"`, `"S"`,
+`"N"`) 을 통해 peer 와 통신. direction 의 의미는 ADR-0023/0025 가 정의
+하며, `configure_sfr_intercube_multisip` 가 IPCQ neighbor table 을 그에
+맞춰 설치한다.
+
+algorithm 모듈은 자기가 지원하지 않는 topology kind 가 들어오면 silent
+no-op 으로 두기보다 명시적으로 `raise ValueError(f"unsupported topology
+kind {sip_topo_kind}")` 하는 것을 권장 — 실수로 backend 에 잘못 dispatch
+된 경우 빠르게 fail.
+
+### D6. ccl.yaml 의 algorithm entry 구조
+
+algorithm 모듈은 `ccl.yaml` 의 entry 와 짝을 이룬다 (ADR-0023 D10 +
+ADR-0047 D3):
+
+```yaml
+defaults:
+  algorithm: lrab_hierarchical_allreduce
+  n_elem: 8
+
+algorithms:
+  lrab_hierarchical_allreduce:
+    module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
+    # optional: world_size override
+    # optional: per-algorithm parameters consumed by configure_sfr_intercube_multisip
+```
+
+- `module`: full Python module path. backend 의 `importlib.import_module`
+  가 이 문자열을 그대로 사용.
+- `world_size` (optional): 명시되면 topology fallback 을 override
+  (ADR-0047 D2).
+- algorithm-specific parameters 는 `configure_sfr_intercube_multisip` 가
+  소비.
+
+새 algorithm 추가 시:
+
+1. `src/kernbench/ccl/algorithms/<name>.py` 작성 (D1 컨벤션).
+2. `ccl.yaml` 의 `algorithms` 섹션에 entry 추가.
+3. (필요 시) `kernbench.ccl.sfr_config` 에 SFR 설치 분기 추가.
+4. test 추가 (예: `tests/sccl/test_<name>.py`, ADR-0043 의 eval harness
+   확장).
+
+### D7. legacy "rank = flat PE index" 모드
+
+ADR-0047 D2 가 명시한 `ccl.yaml` 의 `world_size` override 경로는 legacy
+"rank = flat PE index" 테스트가 사용한다. algorithm 모듈은 이 모드 에서도
+`n_sips=world_size` 만큼의 rank 가 들어옴을 가정하면 된다 — backend 가
+rank↔(SIP, cube, PE) 매핑을 사전에 분리해 두므로 algorithm 본체에서는
+modal 분기가 필요 없다.
+
+단, single-cube workload 에서는 `cube_w=cube_h=1` 이 들어와 mesh-기반
+phase 들이 skip 되도록 작성해야 한다 (`lrab_hierarchical_allreduce.py`
+의 `single_cube = (cube_w == 1 and cube_h == 1)` 패턴 참고).
+
+## Alternatives Considered
+
+### A1. algorithm 모듈을 class 로 구조화 (`class Allreduce: kernel(...)` 등)
+
+기각. Python 모듈 namespace 자체가 algorithm 의 identity 로 사용 중이며
+(ADR-0047 D3 의 `importlib.import_module`), class 한 겹은 추가 indirection
+만 늘리고 dispatch 측 코드를 두텁게 만든다. 모듈-레벨 free function
+ `kernel` alias 패턴이 충분히 명확.
+
+### A2. `kernel_args` 를 명시적 dataclass 로 typing
+
+기각 (현재). algorithm 마다 인자 갯수가 다른 것이 정상이며, dataclass 한
+종류를 강제하면 다양한 algorithm 간 호환이 어려워진다. tuple 반환은 simple
+하고 backend 측 `*kernel_args_tuple` unpacking 과 깨끗이 맞물린다.
+algorithm 별 자체 타입 강도가 필요해지면 그 algorithm 모듈 안에서 NamedTuple
+사용은 자유.
+
+### A3. SFR 설치를 algorithm 모듈 안으로
+
+기각. SFR 설치 (`configure_sfr_intercube_multisip`) 는 topology + algorithm
+모두를 보고 IPCQ neighbor table 을 설치하는 cross-module 결정이라, algorithm
+모듈 내부보다 `kernbench.ccl.sfr_config` 같은 전용 위치가 자연스럽다. D6 의
+"필요 시 sfr_config 분기 추가" 워크플로우가 책임 분리 측면에서 더 명확.
+
+### A4. algorithm name 을 모듈 namespace 에 자동 등록 (decorator)
+
+기각. ADR-0045 (bench) 와 달리 algorithm 은 ccl.yaml entry 와 직접 묶여
+있어 추가 등록 레지스트리가 중복이다. `module` 문자열 매핑 하나면 충분.
+
+## Consequences
+
+- ADR-0047 D3 의 한 줄 contract 가 D1–D7 의 작성자-친화적 가이드로 확장
+  되어, 새 algorithm 추가 시 시그너처를 grep 으로 추론할 필요 없음.
+- D3 의 9 + tl 인자 시그너처가 표준화되어, backend 의 `extra_args` append
+  (ADR-0047 D5) 와 자연스럽게 맞물림. 향후 single-SIP-only algorithm 도
+  4 개의 sip_* 인자를 받아야 함이 명시.
+- D5 의 fail-loud 권장으로, ccl.yaml 의 topology 가 algorithm 미지원
+  topology 로 잘못 설정되면 backend 가 silent wrong-result 가 아닌
+  ValueError 로 fail.
+- D6 의 단계별 추가 절차가 명시되어, 새 algorithm 추가가 sfr_config /
+  test / ccl.yaml 어디까지 손대야 하는지 분명.
@@ -0,0 +1,267 @@
+# ADR-0051: Routing Helper API — `AddressResolver` + `PathRouter`
+
+## Status
+
+Accepted (2026-05-22).
+
+`policy/routing/router.py` 가 노출하는 두 helper 클래스
+(`AddressResolver`, `PathRouter`) 의 모든 public API, 인자, 반환 값,
+그리고 네 가지 다른 adjacency graph 의 사용처를 명시한다. ADR-0002 가
+routing distance 와 ordering, bypass 규칙을 정의하나, **helper API 표면
+자체** 는 ADR-level 에 정리된 적이 없다.
+
+## First action (제일 처음에 하는 일)
+
+### `AddressResolver(graph)`
+
+생성 즉시 다음 두 가지를 캐시한다:
+
+1. `self._node_ids = set(graph.nodes)` — 모든 node id 의 set (lookup 용).
+2. `self._hbm_slice_bytes = hbm_total_gb * (1 << 30) // slices_per_cube` —
+   `graph.spec.cube.memory_map` 으로부터 산출 (기본 `48 GB / 8 slices = 6
+   GB`). 이 값이 `resolve()` 가 HBM PA 의 `hbm_offset` 에서 `pe_id` 를
+   복원하는 데 쓰인다.
+
+즉, **AddressResolver 의 첫 일은 "전체 node id 집합과 HBM slice 크기를
+미리 계산해 두는 것"** 이다. graph 자체는 보유하지 않는다.
+
+### `PathRouter(graph)`
+
+생성 즉시 **네 개의 별도 adjacency graph 를 동시 구축**한다:
+
+1. `self._adj_all`: 모든 edge 포함 (component-to-component routing 용).
+2. `self._adj`: `kind != "command"` 인 edge 만 (PE DMA / 일반 data path).
+3. `self._adj_mcpu_dma`: `_MCPU_DMA_EXCLUDE = {"pe_internal",
+   "pe_to_router"}` 를 제외 (M_CPU DMA 가 PE pipeline 노드로 잘못 라우팅
+   되지 않게).
+4. `self._adj_local`: `_UCIE_KINDS` 8 종을 제외 (cube-local routing 용 —
+   UCIe 가 zero-distance bus 처럼 보여 Dijkstra 가 mesh 보다 선호하는
+   것을 막음).
+
+각 그래프는 `defaultdict(list)` of `(neighbor, weight)` 형태이며,
+`edge.routing_weight_mm or edge.distance_mm` 이 weight 로 쓰인다.
+
+즉, **PathRouter 의 첫 일은 "topology edge 들을 4개의 다른 정책으로 동시
+분류하여 4 개의 인접 리스트로 구축하는 것"**. 매 `find_*()` 호출 시 적절
+한 그래프를 골라 Dijkstra 를 돌린다.
+
+## Context
+
+`policy/routing/router.py` 는 다음 두 책임을 함께 수행한다:
+
+- **이름 매핑**: 토폴로지 명명 규칙 (`sip{S}.cube{C}.<comp>`,
+  `sip{S}.io{I}.pcie_ep` 등) 의 단일 소유자. 컴포넌트 / probe / IPCQ
+  install / runtime API 가 이름 문자열을 직접 만들지 않고 helper 를 호출.
+- **경로 결정**: edge 의 `kind` 에 따른 정책 분리. 같은 src→dst 라도
+  routing 의도 (PE DMA vs M_CPU DMA vs general component routing) 에 따라
+  다른 adjacency 를 사용해야 결과가 달라진다.
+
+이 helper API 가 코드 전반에서 광범위하게 소비되는데도 (probe.py /
+distributed.py / install.py / 각종 component / tests), ADR-level 에서
+**정확한 시그너처 / 반환 의미 / 어떤 adjacency 를 쓰는지** 가 한 곳에
+정리되어 있지 않다. 본 ADR 이 그 빈자리를 채운다.
+
+## Decision
+
+### D1. `AddressResolver` 의 5 개 public API
+
+#### D1.1. `resolve(addr: PhysAddr) -> str`
+
+`PhysAddr` 인스턴스를 토폴로지의 destination node id 로 변환.
+
+```
+addr.kind == "hbm"             → f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
+  where pe_id = addr.hbm_offset // self._hbm_slice_bytes  (ADR-0017 D4/D9)
+
+addr.kind == "pe_resource":
+  addr.unit_type == PE         → f"sip{s}.cube{d}.pe{addr.pe_id}.pe_tcm"
+  addr.unit_type == SRAM       → f"sip{s}.cube{d}.sram"
+  addr.unit_type == MCPU       → f"sip{s}.cube{d}.m_cpu"
+  그 외                          → RoutingError("unsupported unit_type")
+
+다른 kind                       → RoutingError("unsupported address kind")
+```
+
+산출된 node id 가 `self._node_ids` 에 없으면 `RoutingError(f"node {node_id}
+not found in topology")`. 즉, address 의 syntax 가 valid 해도 topology 에
+실제로 매핑되는 노드가 없으면 fail-loud.
+
+#### D1.2. `find_m_cpu(sip, cube) -> str`
+
+`f"sip{sip}.cube{cube}.m_cpu"`. 없으면 `RoutingError`.
+
+#### D1.3. `find_pcie_ep(sip, io_id="io0") -> str`
+
+`f"sip{sip}.{io_id}.pcie_ep"`. 없으면 `RoutingError`.
+
+#### D1.4. `find_io_cpu(sip, io_id="io0") -> str`
+
+`f"sip{sip}.{io_id}.io_cpu"`. 없으면 `RoutingError`.
+
+#### D1.5. `find_all_pcie_eps() -> list[str]`
+
+전 SIP 의 PCIE_EP node id 를 정렬된 리스트로 반환. `endswith(".pcie_ep")`
+필터링. cross-SIP IPCQ 가 모든 PCIE_EP 를 enumerate 할 때 사용.
+
+명명 규칙 (`sip{S}.cube{C}.<comp>`, `sip{S}.{io_id}.<comp>`) 의 단일
+소유자가 이 클래스다 (ADR-0015 D4). 토폴로지 빌더가 같은 명명 규칙으로
+노드를 만들고, 컴포넌트는 이름 문자열을 절대 직접 구성하지 않는다 —
+모두 helper 를 거친다.
+
+### D2. `PathRouter` 의 4 개 adjacency graph
+
+생성자가 한 번에 구축. edge `kind` 가 정책을 결정:
+
+| graph             | 제외 edge kinds                               | 용도                                       |
+|-------------------|-----------------------------------------------|--------------------------------------------|
+| `_adj_all`        | (none)                                        | M_CPU↔NOC command 포함, IO_CPU/M_CPU routing |
+| `_adj`            | `"command"`                                   | PE DMA / 일반 data path                    |
+| `_adj_mcpu_dma`   | `"pe_internal"`, `"pe_to_router"`            | M_CPU DMA (PE pipeline 우회)               |
+| `_adj_local`      | `_UCIE_KINDS` (`ucie_internal`, `ucie_conn_to_router`, `router_to_ucie_conn`, `ucie_conn_to_noc`, `noc_to_ucie_conn`, `ucie_mesh`, `io_to_cube`, `cube_to_io`) | same-cube routing (UCIe bus 우회) |
+
+각 그래프는 `dict[node_id, list[(neighbor, weight)]]` 이며, weight 는
+`edge.routing_weight_mm or edge.distance_mm`. command edge 의 routing
+영향력을 명시적으로 가르고, UCIe 의 "0-distance bus" 가 mesh 보다 선호
+되는 것을 막기 위한 `_adj_local` 분리가 ADR-0017 D7 의 cross-PE-slice
+mesh-distance 요구와 정합.
+
+### D3. `PathRouter` 의 6 개 public API (+ 2 backward-compat)
+
+#### D3.1. `find_path(src_pe: str, dst_node: str) -> list[str]`
+
+**PE DMA routing**. `src_pe` 는 PE prefix (예: `"sip0.cube0.pe0"`) 이며,
+함수가 `.pe_dma` 를 자동으로 prepend 하여 실제 시작 노드를
+`"sip0.cube0.pe0.pe_dma"` 로 설정.
+
+cube-local 여부 (`_same_cube`) 에 따라 adjacency 선택:
+
+- **same-cube** (src 와 dst 가 `sip{S}.cube{C}.` prefix 공유):
+  `_adj_local` 사용. UCIe 우회를 막아 cross-PE-slice 가 mesh 거리를 정확
+  히 지불 (ADR-0017 D7).
+- **cross-cube**: `_adj` 사용. UCIe 가 자연스럽게 cross-cube path 의
+  최적 선택지로 포함됨.
+
+#### D3.2. `find_path_with_distance(src_pe, dst_node) -> tuple[list[str], float]`
+
+D3.1 과 동일한 adjacency 정책을 사용하나, 결과로 `(path, total_distance)`
+를 함께 반환. probe / 분석 도구에서 distance 메트릭이 필요할 때 사용.
+
+#### D3.3. `find_mcpu_dma_path(m_cpu_id: str, dst_hbm_id: str) -> list[str]`
+
+**M_CPU DMA path**. cube 가 같으면 `_adj_local` (mesh 안에서 마무리), 다르
+면 `_adj_all` (UCIe 경유). `_MCPU_DMA_EXCLUDE` 가 PE pipeline 노드를 자동
+배제하므로, M_CPU 가 PE 의 내부 stage 를 거쳐 routing 되는 잘못된 경로가
+나오지 않는다.
+
+#### D3.4. `find_memory_path(src: str, dst: str) -> list[str]`
+
+`pcie_ep → io_noc → cube → router mesh → hbm_ctrl` 같은 직접 메모리
+경로. `_adj_mcpu_dma` 를 사용하여 `pe_internal` 및 `pe_to_router` edge
+를 제외 — host-issued read/write 가 PE pipeline 으로 새지 않게 보장.
+probe (ADR-0049 D1 의 H2D/D2H case) 에서 직접 호출.
+
+#### D3.5. `find_node_path(src: str, dst: str) -> list[str]`
+
+임의의 두 node 사이의 path. **command edge 포함** (`_adj_all` 사용). M_CPU
+↔ NOC 같은 command-kind link 를 거쳐야 하는 IoCpuComponent /
+MCpuComponent 등이 호출.
+
+#### D3.6. backward-compat shims
+
+- `_dijkstra(start, goal) -> list[str]` — `_run_dijkstra(self._adj, …)`
+  의 thin wrapper.
+- `_dijkstra_with_dist(start, goal) -> tuple[list[str], float]` — distance
+  포함 버전.
+
+언더스코어 prefix 에서 보듯이 내부 API 인 척이지만 기존 테스트가 직접
+호출. 새 코드는 D3.1–D3.5 를 사용하고, 이 두 shim 은 deprecation 후보.
+
+### D4. Dijkstra 알고리즘 — single-source shortest path
+
+`_run_dijkstra_with_dist(adj, start, goal)`:
+
+- `heapq` priority queue.
+- `best: dict[node, distance]` — 노드별 최단 거리 캐시.
+- `prev: dict[node, predecessor]` — path reconstruction.
+- weight 는 `routing_weight_mm or distance_mm`. UCIe 처럼 routing_weight 가
+  명시되어 distance 와 다른 edge 가 있으므로 weight 분리가 의도된 것.
+
+`start == goal` 은 빠른 path `([start], 0.0)` 반환. 도달 불가는
+`RoutingError(f"no path from {start} to {goal}")`.
+
+이 알고리즘은 **deterministic** 하다 — 같은 graph + start/goal 이면 같은
+경로. 이는 SPEC R1 의 "Routing MUST be deterministic" 요구와 정합. tie-
+break 는 `heapq` 의 push 순서를 따른다 (Python list 순서가 deterministic).
+
+### D5. helper API 의 단일 소유자 원칙
+
+다음 정보는 오직 router.py 안에서만 결정된다:
+
+- 명명 규칙: `sip{S}.cube{C}.<comp>`, `sip{S}.{io_id}.<comp>`,
+  `sip{S}.cube{C}.hbm_ctrl.pe{pe_id}`.
+- adjacency 정책: 어떤 edge kind 가 어떤 그래프에 포함되는가.
+- HBM slice 크기로부터 PE id 복원 방법.
+- Dijkstra의 weight 결정 (`routing_weight_mm or distance_mm`).
+
+이 단일 소유자 원칙이 깨지면 (예: 컴포넌트가 자체적으로 `f"sip{s}..."` 를
+구성하기 시작하면) 명명 규칙 변경 시 영향 범위가 폭발한다. ADR-0015 D4 의
+정신과 정렬.
+
+### D6. helper API consumer 의 목록
+
+본 helper 가 노출하는 메소드를 호출하는 곳을 명시 (현재 코퍼스 기준):
+
+- `probes/probe.py` (ADR-0049): `find_pcie_ep`, `find_io_cpu`,
+  `find_m_cpu`, `find_node_path`, `find_mcpu_dma_path`,
+  `find_memory_path`, `find_path`, `resolve`.
+- `runtime_api/distributed.py` (ADR-0047): 간접 (engine 내부 routing).
+- `ccl/install.py` (ADR-0023): `find_all_pcie_eps`, `resolve`.
+- `sim_engine/event_log.py`: probe 와 유사하게 `find_pcie_ep`,
+  `find_memory_path`.
+- `components/builtin/m_cpu.py`, `components/builtin/io_cpu.py`:
+  `find_node_path`, `find_mcpu_dma_path`.
+- 각종 tests (test_routing.py, test_cross_sip_routing.py 등): D3.1–D3.5
+  대부분.
+
+새 consumer 가 추가될 때 본 ADR 의 D1/D3 가 그 의도에 맞는 메소드가
+이미 있는지 / 새 메소드를 추가해야 하는지 1차 판단의 기준이 된다.
+
+## Alternatives Considered
+
+### A1. 단일 adjacency graph + edge-kind filter 동적 적용
+
+기각. 매 `find_*()` 마다 graph filtering 을 다시 하면 Dijkstra 의 cache
+locality 와 성능이 떨어진다. 4 개 그래프 동시 구축 (D2) 은 메모리 비용
+이 작고 (edge ≤ 수만 건 규모), 호출 시점에 정책 선택이 O(1) 로 결정.
+
+### A2. adjacency 분리를 edge 의 `kind` 가 아닌 별도 metadata 로
+
+기각. edge `kind` 는 이미 topology builder 가 부여하며 (ADR-0015 D4 +
+ADR-0017), 별도 metadata 를 도입하면 두 시스템이 동기화되어야 하는
+중복이 생긴다.
+
+### A3. Dijkstra 대신 BFS + uniform weight
+
+기각. routing_weight_mm 이 edge 별로 다른 (mesh link / UCIe / IO-internal)
+현실에서 BFS 는 hop 수 최소화일 뿐 latency / distance 최단을 보장하지
+않는다. SPEC R1 + R2 의 결정적·정확한 routing 요구에 어긋남.
+
+### A4. helper API 를 클래스 메서드가 아닌 모듈 함수로
+
+기각. 두 클래스 (`AddressResolver`, `PathRouter`) 가 각각 cache 상태
+(`_node_ids`, `_hbm_slice_bytes`, 4 adjacency graphs) 를 보유해야 하며,
+같은 graph 인스턴스에 여러 routing 질의가 발생한다. 모듈 함수는 매 호출
+시 state 를 다시 만들거나 global 로 두어야 해서 안전성/성능 저하.
+
+## Consequences
+
+- 컴포넌트 / probe / IPCQ install / runtime API 가 모두 router.py 의
+  helper 만 호출하면 명명 규칙 변경 (예: `.io0.` → `.iochiplet0.`) 이
+  단 한 파일 수정으로 끝남 (D5).
+- D2 의 4 그래프 분리가 ADR 에 굳어져, 새 edge kind 가 추가될 때 (예:
+  Inter-die UCIe link 의 새 kind) 어느 그래프에 포함시킬지 결정의 명확
+  한 기준 제공.
+- D3.1 의 cube-local vs cross-cube 분기 (ADR-0017 D7) 가 명시되어, 향후
+  routing 동작을 변경하려는 사람이 어느 adjacency 를 건드려야 할지 안다.
+- D6 의 consumer 목록이 명시되어, helper API 변경 시 PR review 범위가
+  분명. backward-compat shim (D3.6) 의 deprecation 후보가 식별됨.
@@ -0,0 +1,352 @@
+# ADR-0052: OpLog + MemoryStore Schemas — sim_engine internals
+
+## Status
+
+Accepted (2026-05-22).
+
+`sim_engine/op_log.py` 의 `OpRecord` 스키마와 `OpLogger` 의 record_start /
+record_end / record_copy 동작, 그리고 `sim_engine/memory_store.py` 의
+`MemoryStore` 가 사용하는 (space, addr) 주소공간 namespace 와 read/write
+의미를 명시한다. ADR-0020 (2-pass data execution) 가 두 인프라의 존재를
+선언하나, **레코드의 정확한 필드와 의미** 는 ADR-level 에서 정리되지
+않았고 ADR-0046 D3.2 (`tl.store` visibility), ADR-0023 D9 (IPCQ copy
+record) 등 여러 ADR 이 이들의 동작에 의존하고 있다.
+
+## First action (제일 처음에 하는 일)
+
+### `OpLogger(memory_store=None)`
+
+생성 즉시 다음 3 가지 필드 초기화:
+
+1. `self._records: list[OpRecord] = []` — 누적된 op record.
+2. `self._pending: dict[int, dict] = {}` — `id(msg)` 키로 partial record
+   (record_start 시점에 만들어졌고 record_end 가 아직 안 온 것).
+3. `self._memory_store = memory_store` — 옵션 MemoryStore reference.
+   math op 의 input 스냅샷 + dma_write 의 HBM source 스냅샷 캡처에 사용.
+
+생성 시점에는 records / pending 모두 비어 있으며, `record_*` 호출이
+순차적으로 데이터를 누적한다.
+
+### `MemoryStore()`
+
+생성 즉시 `self._storage: dict[str, dict[int, np.ndarray]] = {}` 단 하나
+의 필드 초기화. 두 단계 dict (`space → addr → ndarray`) 이며 lazy 하게
+필요한 space 가 생길 때마다 inner dict 가 채워진다.
+
+즉, **두 인프라의 첫 일은 "비어 있는 누적 buffer + space-별 sparse dict
+를 만들어 두는 것"** 이다. 첫 record / write 가 실제로 도착하면 그때
+필드가 채워지기 시작한다.
+
+## Context
+
+ADR-0020 (2-pass data execution) 의 D2/D5/D7 가 다음을 선언:
+
+- Phase 1 (timing) 동안 `ComponentBase._on_process_start/end` hook 이
+  `OpLogger.record_start/end` 를 호출하여 모든 data op 의 시간 + 메타
+  데이터를 기록.
+- Phase 2 (data) 가 op_log 를 t_start 순으로 재생하여 실 데이터 결과를
+  계산.
+- 데이터 페이로드 자체는 `MemoryStore` 에 (space, addr) 키로 보관.
+
+ADR-0023 D9 (IPCQ atomic write), ADR-0027 (Megatron TP scratch
+overwrite 회피), ADR-0046 D3.2 (`tl.store` visibility) 등 후속 ADR 들이
+op_log 와 MemoryStore 의 동작에 의존하지만, **정확한 record 필드 / space
+이름 / 스냅샷 시점** 은 코드 grep 으로만 확인 가능하다. 본 ADR 이 이를
+정리한다.
+
+## Decision
+
+### D1. `OpRecord` 스키마 — 7 개 필드
+
+```python
+@dataclass
+class OpRecord:
+    t_start: float
+    t_end: float
+    component_id: str
+    op_kind: str               # "memory" | "gemm" | "math" | "unknown"
+    op_name: str               # e.g. "dma_read", "gemm_f16", "exp",
+                               #     "TileToken/DMA_READ", "composite_gemm",
+                               #     "ipcq_copy"
+    params: dict[str, Any]
+    dependency_ids: list[int] = field(default_factory=list)
+```
+
+- **`t_start` / `t_end`**: SimPy 시간 (float ns). `t_start` 는 component
+  가 op 를 시작한 시점, `t_end` 는 완료 시점. duration = `t_end - t_start`.
+- **`component_id`**: op 가 발생한 node id (예:
+  `"sip0.cube0.pe0.pe_dma"`).
+- **`op_kind`**: 4 가지 중 하나. Phase 2 DataExecutor 가 이 값으로 분기.
+- **`op_name`**: 디버깅 / 분석용 사람-친화 이름. TileToken 일 경우
+  `"TileToken/{stage_type}"` (예: `"TileToken/DMA_READ"`) 로 stage 를
+  구분.
+- **`params`**: op-종속 메타데이터 dict (D3 참고).
+- **`dependency_ids`**: 현재 사용되지 않음 (default `[]`). 향후 cross-op
+  dependency 추적이 필요해질 때를 위한 자리.
+
+### D2. `OpLogger.records` — t_start 정렬 보장
+
+```python
+@property
+def records(self) -> list[OpRecord]:
+    self._records.sort(key=lambda r: r.t_start)
+    return self._records
+```
+
+매 접근 시 `t_start` 로 stable sort. 즉 같은 t_start 인 record 들은 insertion
+순서를 유지. ADR-0020 D5 의 "t_start stable ordering" 요구와 정합.
+
+Phase 2 DataExecutor 는 항상 `records` property 를 통해 접근하므로,
+record_end 호출이 t_start 와 다른 순서로 도착해도 (예: 짧은 op 가 긴
+op 보다 늦게 시작했으나 먼저 끝남) 재정렬되어 일관된 시퀀스를 받는다.
+
+### D3. op_name 별 `params` 스키마 (`_extract_op_info` 매핑)
+
+#### D3.1. `op_kind="memory", op_name="dma_read"` (DmaReadCmd)
+
+```python
+{"src_addr": int, "nbytes": int, "handle_id": str}
+```
+
+#### D3.2. `op_kind="memory", op_name="dma_write"` (DmaWriteCmd)
+
+```python
+{
+    "src_space": str,   # handle.space ("tcm"|"hbm"|"sram"), default "tcm"
+    "src_addr": int,    # handle.addr
+    "shape": tuple, "dtype": str,
+    "dst_space": "hbm", # DmaWrite 는 항상 HBM 으로
+    "dst_addr": int,
+    "nbytes": int,
+    "handle_id": str,
+    # record_end 시점에 src_space == "hbm" 이면 snapshot 추가 (D4)
+    "snapshot": np.ndarray | None,
+}
+```
+
+#### D3.3. `op_kind="gemm", op_name=f"gemm_{dtype_a}"` (GemmCmd)
+
+```python
+{
+    "src_a_addr": int, "src_b_addr": int, "dst_addr": int,
+    "shape_a": tuple, "shape_b": tuple, "shape_out": tuple,
+    "dtype_in": str, "dtype_out": str,
+    "m": int, "k": int, "n": int,
+    # ADR-0027: per-operand + output spaces 보존
+    "src_a_space": str, "src_b_space": str, "dst_space": str,
+}
+```
+
+#### D3.4. `op_kind="math", op_name=msg.op` (MathCmd; op = "exp", "sum", "add", "where" 등)
+
+```python
+{
+    "input_addrs": list[int],   # 입력 핸들들의 addr
+    "input_shapes": list[tuple],
+    "input_spaces": list[str],
+    "input_dtypes": list[str],
+    "dst_addr": int, "dst_space": str,
+    "shape_out": tuple, "dtype": str,
+    "axis": int | None,         # reduction 인 경우만 의미 있음
+    # record_end 시점에 모든 input 의 스냅샷이 채워짐 (D4)
+    "input_snapshots": list[np.ndarray | None],
+}
+```
+
+#### D3.5. `op_kind="gemm" or "math", op_name=f"composite_{op}"` (CompositeCmd)
+
+```python
+{
+    "op": str,              # "gemm" | "math"
+    "out_addr": int, "out_nbytes": int,
+    # op == "gemm" 인 경우 GemmCmd 와 같은 필드 추가:
+    "src_a_addr": int, "src_b_addr": int,
+    "shape_a": tuple, "shape_b": tuple,
+    "dtype_in": str, "dtype_out": str,
+    "src_a_space": str, "src_b_space": str,
+    "dst_space": "hbm", "dst_addr": int,  # = out_addr
+}
+```
+
+`op == "gemm"` 이면 `op_kind = "gemm"`, 아니면 `"math"`. Phase 2 측에서
+GemmCmd 와 동일 path 로 재생되도록 alias.
+
+#### D3.6. `op_kind="memory", op_name="ipcq_copy"` (record_copy 전용 경로)
+
+```python
+{
+    "src_space": str, "src_addr": int,
+    "dst_space": str, "dst_addr": int,
+    "shape": tuple, "dtype": str, "nbytes": int,
+    "snapshot": np.ndarray | None,   # 호출자가 전달, 없으면 record_copy 가 fresh read
+}
+```
+
+`PE_DMA._handle_ipcq_inbound` (ADR-0023 D9) 가 이 record 를 발사하여 IPCQ
+slot 의 inbound copy 를 Phase 2 가 재생 가능하게 한다. 이 record 는
+`record_start` / `record_end` 를 거치지 않고 직접 `record_copy()` 로 push.
+
+#### D3.7. `op_kind="unknown", op_name=type(msg).__name__`
+
+`_extract_op_info` 가 인식 못 한 message 의 fallback. params = `{}`.
+DataExecutor 가 이 op_kind 를 만나면 skip — Phase 2 replay 에 영향 없음.
+
+### D4. snapshot 캡처 시점
+
+`OpLogger._memory_store` 가 set 되어 있을 때 record_end 가 다음을 수행:
+
+- **math op**: 모든 input addr/shape/space/dtype 으로
+  `self._memory_store.read(...)` 를 호출하여 `params["input_snapshots"]` 에
+  ndarray copy 첨부. read 실패 시 None.
+- **dma_write op**: `src_space == "hbm"` 인 경우에만 source HBM 의
+  스냅샷을 `params["snapshot"]` 에 첨부. TCM source 는 **명시적으로
+  스킵** — TCM (PE scratch) 은 Phase 2 math/gemm 재생이 다시 채우므로,
+  Phase-1-time snapshot 을 잡으면 이전 kernel 의 stale 데이터를 잡을 위험
+  (ADR-0027 postmortem: TP gemm → all_reduce race).
+- **ipcq_copy**: `record_copy` 호출자가 `snapshot=token.data` 같이 in-flight
+  스냅샷을 전달. 없으면 record_copy 가 fresh read 로 대체 시도.
+
+스냅샷은 `.copy()` 가 호출되어 (`ndarray.copy()` 가 fresh allocation) 이후
+storage mutation 으로부터 안전. ADR-0027 의 "cross-PE Phase 2 ordering"
+race 회피의 근간.
+
+`memory_store` 가 None 인 경우 (Phase 1 timing-only 모드) 스냅샷 단계는
+전부 skip. record 의 timing 정보만 보존되며 데이터 replay 는 불가능.
+
+### D5. TileToken 처리 — record_start 가 stage 정보를 캡처
+
+ADR-0014 D6 의 self-routing tile token (pipeline 모드) 은 stage_idx 가
+record_end 시점에 이미 advance 되어 있을 수 있다 (TileToken 이 다음
+component 로 이동하면서 next stage 의 params 를 캐시). 따라서:
+
+`record_start` 가 다음을 `pending[id(msg)]["snap"]` 에 미리 저장:
+
+```python
+snap["stage_type"] = stage.stage_type.name        # "DMA_READ", "GEMM", 등
+snap["stage_params"] = dict(stage.params)         # 시점의 params 복사본
+```
+
+`record_end` 에서 이 snap 을 꺼내 params 에 merge:
+
+- `params["stage_type"]` 가 final params 에 추가.
+- `stage_params` 의 key 들이 (이미 있으면 보존) merge.
+- `op_name == "TileToken"` 이면 `op_name = f"TileToken/{stage_type}"` 로
+  rewrite (예: `"TileToken/DMA_READ"`) — 같은 component 에서 발생한 서로
+  다른 stage 의 record 를 disambiguate.
+
+이 메커니즘 덕분에 DMA_READ vs DMA_WRITE, FETCH vs STORE 가 같은 component
+(예: pe_dma) 에서 발생하더라도 reporting 측에서 구분 가능.
+
+### D6. `MemoryStore` — (space, addr) 두 단계 dict
+
+```python
+class MemoryStore:
+    def __init__(self) -> None:
+        self._storage: dict[str, dict[int, np.ndarray]] = {}
+
+    def write(self, space, addr, data): self._storage[space][addr] = data
+    def read(self, space, addr, shape=None, dtype=None) -> np.ndarray: ...
+    def has(self, space, addr) -> bool: ...
+    def snapshot(self) -> MemoryStore: ...
+```
+
+#### D6.1. space namespace
+
+문자열 키. 표준 값:
+
+- `"hbm"`: HBM 데이터 (deploy_tensor + Phase 2 dma_write 결과).
+- `"tcm"`: PE-로컬 TCM (Phase 2 math/gemm 결과).
+- `"sram"`: cube-level SRAM (ADR-0023 D9.7 IPCQ slot tier).
+
+다른 space (예: `"reg"`) 도 자유롭게 허용 — `_storage` 가 lazy dict 라
+새 space 가 write 호출과 함께 자동 생성.
+
+#### D6.2. address keying
+
+`addr` 는 정수. **physical address (PA) 또는 virtual address (VA)** 일 수
+있다 — MemoryStore 자체는 address space 의 의미를 모르고 그저 키로 쓴다.
+Phase 1 의 `MemoryWriteMsg` 는 PA + VA 둘 다 write (`_create_tensor` 에서
+PA 로 zero-init, VA base 로도 zero-init), Phase 2 는 op_log 가 captured
+한 address 로 read/write.
+
+`addr` 의 의미는 호출자가 결정한다 — `MemoryStore` 는 lookup 만 제공.
+
+#### D6.3. read/write 의미 — reference store (no copy)
+
+`write(space, addr, data)`: `data` ndarray 의 reference 를 저장. **copy
+하지 않음**. 호출자가 같은 ndarray 를 이후 mutate 하면 stored value 도
+변경된다.
+
+`read(space, addr, shape=None, dtype=None)`: 저장된 ndarray 의 reference
+반환. `shape` 또는 `dtype` 이 제공되면:
+
+- `dtype != stored.dtype`: `arr.view(np_dtype)` 로 reinterpret cast (no
+  copy).
+- `shape != stored.shape`: `nbytes` 가 일치하면 `arr.reshape(shape)` (view).
+- `nbytes` 불일치: `ValueError`.
+
+데이터를 안전하게 분리하려면 호출자가 `arr.copy()` 호출. ADR-0027 의
+race 회피가 op_log snapshot 단계에서 명시적 copy 를 강제하는 이유.
+
+#### D6.4. `has(space, addr) -> bool`
+
+해당 키의 존재 여부만 확인. 데이터 인스턴스화는 안 함.
+
+#### D6.5. `snapshot() -> MemoryStore`
+
+shallow copy. inner dict 의 새 인스턴스를 만들되 ndarray reference 는
+공유. Phase 2 초기화 시점에 Phase 1 의 store 를 fork 하여 Phase 2 의
+mutation 이 Phase 1 의 다른 사용처에 영향을 주지 않게 분리하는 데 사용.
+
+### D7. op_log 가 SimPy 단일-스레드를 가정한다
+
+`OpLogger` 의 `_records`, `_pending` 은 lock 없이 사용. SimPy 가 single-
+threaded 라 `record_start` → `record_end` 사이에 다른 thread 가 끼어들
+수 없다는 가정.
+
+향후 multi-process kernbench (ADR-0047 D6) 가 도입되면 OpLogger 도 process
+별로 분리되어야 함이 명시. 단일 OpLogger 인스턴스가 multiple process 의
+record 를 받지 못한다.
+
+## Alternatives Considered
+
+### A1. op_log 를 SQLite / parquet 같은 외부 store 로
+
+기각 (현재). in-memory list 가 Phase 1 → Phase 2 의 핸드오프 latency 를
+최소화한다. 외부화는 long-running batch run 에서 의미가 있겠으나, 현재
+single-run 워크로드 에서는 overhead 만 추가.
+
+### A2. snapshot 을 record_start 시점에 캡처
+
+기각. record_start 시점은 input 이 아직 채워지지 않은 상황 (예: math
+op 의 input 이 직전 op 의 output 일 때) 이 흔하다. record_end 가 정확한
+시점.
+
+### A3. MemoryStore 를 component-별 store 로 분리
+
+기각. (space, addr) 키가 이미 충분히 disambiguation 을 제공하며, component
+별 분리는 cross-PE IPCQ copy (ADR-0023 D9) 가 source/destination 양쪽
+store 를 접근해야 하는 케이스를 복잡하게 만든다.
+
+### A4. op_log 에 cross-op dependency edge 명시
+
+부분 채택. `dependency_ids` 필드가 OpRecord 에 자리 잡고 있지만 현재
+사용되지 않음 (D1). Phase 2 DataExecutor 가 t_start 정렬 + secondary sort
+(memory ops before math at same t_start) 로 ordering 을 결정하며, 명시적
+dependency graph 가 필요해지면 이 필드가 채워질 자리. 현재는 ordering rule
+이 충분하므로 미사용.
+
+## Consequences
+
+- ADR-0020 의 op_log / MemoryStore 선언이 D1–D6 의 구체 schema 로 확장
+  되어, Phase 2 DataExecutor 작성/수정 시 정확한 필드 의미를 grep 없이
+  ADR 에서 확인 가능.
+- D3 의 op_name 별 params 스키마가 명시되어, 새 op (예: 새 reduction
+  type) 추가 시 `_extract_op_info` 분기 어디에 끼울지 명확.
+- D4 의 snapshot 시점 차이 (math = input snapshot, dma_write = HBM-only
+  snapshot) 가 ADR 에 굳어져, ADR-0027 의 cross-PE race 회피 결정이 향후
+  refactor 에서 silently 깨지지 않음.
+- D6.3 의 reference-store 의미가 명시되어, 호출자가 mutation safety 책임
+  을 인지. ADR-0027 의 explicit `.copy()` 패턴이 정당화됨.
+- D7 의 single-thread 가정이 명시되어, multi-process kernbench (ADR-0047
+  D6 supersession 후보) 도입 시 OpLogger 분리가 필요함이 분명.
@@ -0,0 +1,307 @@
+# ADR-0053: Topology Builder + Visualizer Algorithms
+
+## Status
+
+Accepted (2026-05-22).
+
+`topology/builder.py`, `topology/mesh_gen.py`, `topology/visualizer.py` 가
+함께 수행하는 토폴로지 컴파일·시각화 파이프라인의 핵심 알고리즘 선택
+(placement-driven router attachment, mesh auto-layout, source_hash 캐시,
+view projection, SVG rendering) 을 명시한다. ADR-0006 가 topology
+compilation 의 high-level intent (compiled topology, distance extraction,
+automatic diagram generation) 를 정의하나, **builder 가 실제로 어떤
+알고리즘을 사용하는지** 는 코드 grep 으로만 확인 가능했다.
+
+## First action (제일 처음에 하는 일)
+
+`resolve_topology(path_str)` 가 호출되면 다음 4 단계가 순서대로 일어난다:
+
+1. **경로 검증** (`builder.py::resolve_topology`):
+   `Path(path_str).expanduser().resolve()`, 존재 확인, file 여부 확인.
+   실패 시 `FileNotFoundError` 또는 `ValueError`.
+2. **YAML 파싱** (`_read_spec`): `yaml.safe_load`. parse error 면 line/
+   column 정보 포함한 `ValueError`. dict 가 아니면 reject.
+3. **mesh 자동 생성** (`mesh_gen.ensure_mesh_file`): topology yaml 과
+   같은 디렉터리에 `cube_mesh.yaml` 을 만들거나 (캐시 invalid 시) 재사용
+   (캐시 hit 시). 이 단계가 cube NoC 의 라우터 grid 와 부착 정보를 결정.
+4. **graph 컴파일** (`_compile_graph`): system → IO chiplets → cubes →
+   inter-cube edges → IO↔cube edges → system↔IO edges 순으로 nodes/edges
+   를 누적, 그 다음 4 개의 view projection (system, sip, cube, pe) 을
+   생성하여 `TopologyGraph` 로 묶음.
+
+즉, **topology compile 의 첫 일은 "topology.yaml 을 dict 로 읽고, 동일
+디렉터리에 cube_mesh.yaml 을 생성/검증한 뒤, system→sip→cube→pe 순으로
+flat graph + 4-view projection 을 만드는 것"** 이다.
+
+## Context
+
+`topology/` 패키지의 책임:
+
+- **builder.py** (1207 줄): topology.yaml 을 받아 `TopologyGraph` (nodes
+  + edges + 4 view projections) 를 컴파일.
+- **mesh_gen.py** (305 줄): cube NoC 의 라우터 grid 와 PE/UCIe/M_CPU/SRAM
+  부착 위치를 자동 결정하여 `cube_mesh.yaml` 로 캐시.
+- **visualizer.py** (887 줄): `TopologyGraph` 로부터 SVG 다이어그램 4종
+  (system / sip / cube / pe) 을 생성.
+
+ADR-0006 가 "topology compilation 의 결과는 distance metadata 와 diagram
+generation 의 single source" 라는 high-level 결정을 정의하나, 구체 알고리즘
+(예: placement-driven nearest-router attachment, HBM 제외 zone 산출,
+source_hash 의 어떤 필드가 invalidation 을 트리거하는가) 은 ADR 에 없다.
+
+특히 다음 결정들이 ADR-level 에 부재:
+
+- 왜 mesh_gen 이 별도 파일 (`cube_mesh.yaml`) 로 캐시되는가?
+- source_hash 가 어떤 필드를 포함하며, 어떤 변경이 재생성을 강제하는가?
+- placement coordinate 가 cube 좌표가 아닌 mm 단위인 이유?
+- HBM zone 제외와 UCIe N/S/E/W 분배가 mesh 안에서 어떻게 결정되는가?
+- view projection 4 개 (system/sip/cube/pe) 의 추상화 레벨 차이?
+
+이 ADR 이 이 결정들을 한 곳에 정리한다.
+
+## Decision
+
+### D1. compile 파이프라인 — 6 단계
+
+`_compile_graph(spec)`:
+
+1. **시스템 노드 생성** (`_instantiate_system`): `fabric.switch0`, host CPU
+   등 system-level 노드 추가.
+2. **per-SIP loop** (`for sip_id in range(system.sips.count)`):
+   - **IO chiplets** (`_instantiate_io_chiplets`): pcie_ep / io_cpu /
+     io_noc / io_ucie PHY / conn 노드 + 내부 양방향 edge 생성.
+   - **cube instantiation** (`_instantiate_cube`): cube_mesh.yaml 의 router
+     grid 를 토대로 cube-별 라우터, PE sub-components (pe_cpu, pe_dma,
+     pe_fetch_store, pe_gemm, pe_math, pe_mmu, pe_tcm, pe_scheduler,
+     pe_ipcq), m_cpu, sram, hbm_ctrl 인스턴스화 + 내부 edge 깔기.
+   - **inter-cube edges** (`_add_inter_cube_edges`): UCIe N/S/E/W mesh
+     edge.
+   - **IO ↔ cube edges** (`_add_io_to_cube_edges`): io_noc 와 cube 의
+     edge UCIe phy 사이 연결.
+3. **switch ↔ IO edges** (`_add_system_to_io_edges`): `fabric.switch0`
+   와 각 SIP 의 `pcie_ep` 사이 양방향 edge (ADR-0038 D3 + ADR-0010 의
+   cross-SIP IPCQ 경로).
+4. **view projections** 4 종 build:
+   - `_build_system_view(spec)` — Tray 레벨, SIP 들과 system switch.
+   - `_build_sip_view(spec)` — SIP 안의 cube mesh + IO chiplet.
+   - `_build_cube_view(spec)` — 단일 cube 안의 router grid + PE/M_CPU/SRAM/
+     HBM_CTRL 부착.
+   - `_build_pe_view(spec)` — 단일 PE 안의 9 sub-components + 내부 edge.
+5. **TopologyGraph 리턴**: `TopologyGraph(spec, nodes, edges, system_view,
+   sip_view, cube_view, pe_view)`.
+
+이 6 단계는 **순서가 의미를 가진다**: cubes 가 만들어진 후에야 inter-cube
+edges 가 valid 한 src/dst 를 갖고, IO chiplet 이 먼저 만들어져야 IO ↔ cube
+edge 가 그를 참조할 수 있다. 새 노드 종류를 끼울 때는 의존 관계를 보고
+적절한 위치에 삽입해야 한다.
+
+### D2. `cube_mesh.yaml` — 별도 파일 + source_hash 캐시
+
+`mesh_gen.ensure_mesh_file(cube_spec, mesh_path)`:
+
+1. `source_hash = _compute_source_hash(cube_spec)` 산출. 입력 필드:
+   - `geometry` (cube_mm.w/h 등).
+   - `pe_layout` (corners, pe_per_corner).
+   - `ucie.n_connections`.
+   - `memory_map.hbm_mapping_mode`.
+   - `placement` (m_cpu/sram pos_mm).
+2. `mesh_path` (= `topology.yaml` 와 같은 디렉터리의 `cube_mesh.yaml`) 이
+   존재하고 `existing.source_hash == source_hash` 면 재사용 (캐시 hit).
+3. 아니면 `_generate_mesh(cube_spec, source_hash)` 로 새 mesh 생성 후
+   yaml 로 저장.
+
+별도 파일로 캐시하는 이유:
+
+- mesh 생성은 PE/UCIe/router 부착 계산이 들어가 매번 다시 하기 무거움.
+- 같은 cube spec 으로 여러 번 실행 시 동일 mesh 가 보장되어야 함.
+- 사람이 직접 mesh 를 inspect / debug 할 수 있는 artifact 가 됨.
+
+`source_hash` 가 list 한 5 개 필드가 mesh 형상을 결정하는 핵심이며, 그
+외 (예: bandwidth, overhead_ns) 변경은 mesh 재생성을 트리거하지 않는다.
+
+### D3. cube NoC mesh auto-layout 알고리즘
+
+`_generate_mesh(cube_spec)`:
+
+#### D3.1. 행/열 결정
+
+- `pe_positions = _corner_pe_positions(cube_w, cube_h)`: 4 corner (NW/NE/
+  SW/SE) 마다 PE center 좌표 (mm). hardcoded `(1.5, 1.5)` / `(cube_w-1.5,
+  cube_h-1.5)` 패턴 + `pe_per_corner=2` 면 각 corner 에 2 PE 위치.
+- `col_xs = _compute_col_positions(...)`: PE 들의 x 좌표 union + `max_spacing
+  = 3.0 mm` 보다 큰 gap 에 relay 컬럼 삽입.
+- `row_ys, rows_per_half = _compute_row_positions(cube_h, n_connections,
+  pe_positions)`:
+  - `n_conn = max(n_connections, 2)` (hot path minimum).
+  - `rows_per_half = ceil(n_conn / 2)`.
+  - top 절반 + HBM 두 row + bottom 절반. HBM 은 `(cube_h/2 - 1.5, cube_h/2
+    + 1.5)` 에 위치. PE rows 와 HBM rows 사이 `hbm_gap = 1.5 mm`.
+
+#### D3.2. HBM 제외 zone
+
+`hbm_row_start = rows_per_half`, `hbm_row_end = rows_per_half + 1`.
+`hbm_col_start = n_cols // 2 - 1`, `hbm_col_end = n_cols // 2`.
+
+이 (row, col) 사각형 안의 router 슬롯은 `None` 으로 마킹 (라우터 없음).
+실제 HBM 컨트롤러는 별도 `hbm_ctrl.pe{X}` 노드로 ADR-0017 D9 의 per-PE
+파티션 패턴을 따라 부착.
+
+#### D3.3. PE 부착
+
+각 corner 의 PE 들은 다음 row 에 매핑:
+
+- Top half: NW → row 0, NE → row 1 (top_corners 안의 index).
+- Bottom half: SW → row `hbm_row_end + 1`, SE → row `hbm_row_end + 2`.
+
+각 PE 의 x 좌표가 가장 가까운 col 의 router 에 부착 (`min(range(n_cols),
+key=lambda c: abs(col_xs[c] - pe_x))`). 부착 항목은 `pe{pe_idx}.dma`,
+`pe{pe_idx}.cpu`, `pe{pe_idx}.hbm` 세 가지 (router 별 attach list 에 push).
+
+#### D3.4. M_CPU / SRAM 부착 — nearest router by Euclidean distance
+
+`placement.m_cpu.pos_mm` (default `[1.5, 5.5]`) 와 `placement.sram.pos_mm`
+(default `[1.5, 8.5]`) 의 좌표에서 가장 가까운 router 를 Euclidean
+distance 로 찾아 attach list 에 `"m_cpu"` / `"sram"` 추가.
+
+#### D3.5. UCIe N/S/E/W 분배
+
+`ucie_pe_rows = top_pe_rows + bot_pe_rows` (총 `2 * rows_per_half` 개).
+
+- UCIe-E: 매 PE row 마다 rightmost col 의 router 에 `ucie_e.c{i}`.
+- UCIe-W: leftmost col 의 router 에 `ucie_w.c{i}` (E 의 mirror).
+- UCIe-N/S: PE column 들 중 절반을 좌측, 절반을 우측으로 나눠 top row /
+  bottom row 의 해당 col 에 부착.
+
+각 UCIe connection 은 `c{i}` index 가 붙어 ucie_n_connections 만큼의 PHY
+가 분산된다 (ADR-0017 D5+).
+
+### D4. node 명명 규칙 — 단일 소유자
+
+builder.py 는 다음 명명 규칙으로 노드를 만든다 (ADR-0051 D5 의 단일
+소유자 원칙):
+
+- `fabric.switch0` — system-level switch.
+- `sip{S}.{io_id}.{pcie_ep|io_cpu|io_noc|io_ucie.{dir}|conn.{id}}` — IO
+  chiplet.
+- `sip{S}.cube{C}.{m_cpu|sram|hbm_ctrl.pe{X}|noc.r{R}c{C}|...}` — cube 내부.
+- `sip{S}.cube{C}.pe{P}.{pe_cpu|pe_dma|pe_fetch_store|pe_gemm|pe_math|pe_mmu|pe_tcm|pe_scheduler|pe_ipcq}` — PE sub-components.
+
+이 명명 규칙을 변경하려면 builder.py 와 router.py (ADR-0051) 의 helper
+양쪽이 함께 갱신되어야 한다. 컴포넌트는 명명 규칙을 직접 알지 못하고
+helper 만 호출한다.
+
+### D5. edge `kind` 분류
+
+각 edge 가 부여받는 `kind` 가 라우팅 정책 (ADR-0051 D2) 의 입력. 주요
+kind 값:
+
+- `"pe_internal"` — PE 내부 sub-component 간.
+- `"pe_to_router"` — PE_DMA ↔ cube NoC router.
+- `"router_mesh"` — cube NoC router 간.
+- `"router_to_hbm"`, `"router_to_mcpu"`, `"router_to_sram"`,
+  `"sram_to_router"` 등 — cube-attached component 간.
+- `"ucie_internal"`, `"ucie_conn_to_router"`, `"router_to_ucie_conn"`,
+  `"ucie_conn_to_noc"`, `"noc_to_ucie_conn"`, `"ucie_mesh"` — UCIe 관련.
+- `"io_internal"` — IO chiplet 내부.
+- `"io_to_cube"`, `"cube_to_io"` — IO ↔ cube 경계.
+- `"pcie"` — switch ↔ pcie_ep.
+- `"command"` — control-plane only edges (M_CPU ↔ NOC 등; PE DMA path 에서
+  제외).
+
+새 edge kind 를 추가하면 router.py 의 4 adjacency graph (ADR-0051 D2) 의
+어느 카테고리에 속할지 결정해야 한다 — 그렇지 않으면 default 로 `_adj_all`
+에만 포함되어 의도와 다른 routing 발생 가능.
+
+### D6. view projection — 4 추상화 레벨
+
+`TopologyGraph` 는 flat (nodes + edges) 외에 4 개의 view projection 을
+보유:
+
+- **system_view** (`_build_system_view`): Tray 레벨. SIP 박스들 + `fabric.
+  switch0`. PCIE 링크 표시. 외부 발표용 high-level overview.
+- **sip_view** (`_build_sip_view`): 한 SIP 안. cube mesh + IO chiplet
+  (pcie_ep + io_cpu + io_noc). UCIe N/S/E/W 가 cube 간 연결로 보임.
+- **cube_view** (`_build_cube_view`): 한 cube 안. router grid + PE/M_CPU/
+  SRAM/HBM_CTRL 부착 + UCIe PHY edge 부분. cube 내부 라우팅 / placement
+  진단용.
+- **pe_view** (`_build_pe_view`): 한 PE 안. 9 sub-components + 내부 edge
+  (pe_internal kind). 자세한 PE 내부 dataflow 검토용.
+
+view 는 spec 에서 `visualization.emit_views: [system, sip, cube]` 같이
+선택적으로 출력 (ADR-0006). pe view 는 기본 출력에서 빠져 있으나 코드는
+유지 (자세한 디버그용).
+
+### D7. visualizer.py — SVG 다이어그램 출력
+
+`emit_diagrams(graph, out_dir)` 가 모든 view 를 SVG 로 렌더. 핵심 함수:
+
+- `_render_view_svg(view)` — 일반적인 view 렌더 (router grid 가 없는
+  경우).
+- `_render_cube_view_svg(view, spec)` — cube view 전용 (HBM block 그리기,
+  router grid layout, PE/M_CPU/SRAM/HBM positioning).
+- `_draw_node`, `_draw_edge` — 노드 / edge 의 시각적 표현.
+- `_pick_scale`, `_compute_node_sizes` — 자동 스케일링.
+
+visualizer 는 **derived artifact** (ADR-0006) 로 분류되며, 코드 변경 시
+production check 대상이 아니다. CLAUDE.md 의 "Derived Artifacts" 항목과
+정합.
+
+### D8. spec 변경의 영향 범위
+
+| spec 필드                             | 영향              | mesh 재생성 |
+|---------------------------------------|-------------------|-------------|
+| `system.sips.count`                   | SIP 갯수, node 수 | No          |
+| `sip.cube_mesh.w/h`                   | cube mesh 형상    | No          |
+| `cube.geometry.cube_mm.w/h`           | cube 크기 (mm)    | **Yes**     |
+| `cube.pe_layout.corners/pe_per_corner`| PE 부착 위치       | **Yes**     |
+| `cube.ucie.n_connections`             | UCIe PHY 분배     | **Yes**     |
+| `cube.memory_map.hbm_mapping_mode`    | HBM 분배 모드     | **Yes**     |
+| `cube.placement`                      | M_CPU/SRAM 위치   | **Yes**     |
+| `cube.memory_map.*` (위 제외)         | HBM 용량 / BW     | No          |
+| `*.links.*.bw_gbs`                    | edge bandwidth    | No          |
+| `*.attrs.overhead_ns`                 | 컴포넌트 latency  | No          |
+
+위 표가 D2 의 `_compute_source_hash` 입력과 일치. mesh 재생성이 필요한
+변경은 `cube_mesh.yaml` 의 source_hash 가 자동 invalidate.
+
+## Alternatives Considered
+
+### A1. mesh 를 별도 캐시 파일 없이 매 compile 시 재생성
+
+기각. 같은 spec 으로 여러 번 호출되는 케이스 (CLI run, probe, test) 마다
+mesh 생성 비용을 다시 지불. 또한 사람이 mesh 를 inspect 할 수 있는 artifact
+가 사라짐.
+
+### A2. mesh 생성을 builder.py 에 합치기
+
+기각 (현재). 305 줄 짜리 자체 알고리즘이며, mesh layout 의 결정 (placement-
+driven router attachment, HBM exclusion zone) 이 builder 의 일반적인
+node/edge 생성 책임과 다르다. 분리 유지가 단일 책임 원칙에 더 부합.
+
+### A3. placement coordinate 를 cube 좌표 (col/row) 로 표현
+
+기각. mm 단위 좌표가 시각화 측 (visualizer) 과 mesh layout 측 (nearest-
+router 산출) 양쪽에서 일관되게 쓰인다. cube 좌표는 router grid 가 결정
+되기 전까지는 정의되지 않으므로 placement 입력에 부적절.
+
+### A4. view projection 을 lazy 하게 생성
+
+기각 (현재). 4 개 view 의 생성 비용이 작고 (보통 < 100 ms), eager 생성이
+`TopologyGraph` 를 통한 single source of truth 를 보장.
+
+### A5. visualizer 출력 형식을 SVG 외 (PNG/PDF) 도
+
+기각. SVG 가 vector + 텍스트 검색 가능 + 브라우저 직접 렌더가 가능한 가장
+유연한 형식. PNG 변환이 필요하면 별도 도구 (rsvg-convert 등) 로 후처리.
+
+## Consequences
+
+- ADR-0006 의 high-level intent 가 D1–D7 로 구체화되어, topology 변경
+  영향을 D8 표로 빠르게 가늠 가능.
+- D3 의 mesh auto-layout 알고리즘이 ADR-level 에서 굳어져, 추후 새 PE
+  부착 패턴 (예: HBM 의 6-zone 분할) 도입 시 어느 단계가 영향받는지 명확.
+- D5 의 edge kind 목록과 D7 의 view 구조가 명시되어, 새 component 종류
+  추가 시 (builder + router + visualizer) 어디까지 손대야 하는지 PR
+  reviewer 가 한눈에 파악 가능.
+- D2 의 source_hash invalidation 규칙이 명시되어, cube_mesh.yaml 이 stale
+  하게 남는 경우 (예: bw 값만 바꿨을 때) 가 정상 동작임이 분명.
@@ -0,0 +1,138 @@
+# ADR-0054: 마일스톤 평가 bench — 자기완결적 sweep + figure bench
+
+## Status
+
+Accepted (2026-05-22).
+
+ADR-0044(D1/D2)와 ADR-0045(D5)를 개정하고, ADR-0043/0044의 "로직이
+`scripts/` + `tests/`에 산다" 배치를 대체한다: GEMM/allreduce 평가
+하니스가 이제 사용자가 실행하여 모든 결과 + figure를 재생성하는
+자기완결적 **bench**가 된다.
+
+## Context
+
+ADR-0043(allreduce 평가)과 ADR-0044(GEMM 평가)는 각 하니스를 **sweep**
+(수동 `scripts/` 드라이버, 또는 allreduce의 경우 parametrized 테스트
+자체) + committed 데이터를 렌더링하는 **figure 테스트**로 분리했다.
+따라서 sweep/render 로직은 `scripts/gemm_sweep.py`,
+`tests/gemm/_gemm_plot_helpers.py`, `tests/sccl/_allreduce_helpers.py`에
+존재했다.
+
+마일스톤 요구사항("사용자가 *하나의 bench*를 실행해 모든 결과와 플롯을
+생성하도록 allreduce + GEMM 평가를 리팩터")은 그 배치로는 충족 불가다:
+bench는 production 코드이며 **`tests/`를 import할 수 없다**(ADR-0007 레이어
+방향). 평가 로직은 bench에서 닿을 수 있도록 production으로 이동해야 했다.
+
+선택한 home은 별도 `kernbench.eval` 패키지가 아니라 bench 모듈 자체다.
+bench 파일은 임의의 모듈 레벨 코드를 가질 수 있으며, 하니스를 bench로
+합치면 도메인당 파일 하나가 유지되고 패키지 레이어가 하나 줄어든다.
+
+## Decision
+
+### D1. 두 마일스톤 bench가 평가 로직을 보유
+
+- `src/kernbench/benches/milestone_1h_gemm.py` — GEMM shape×variant sweep
+  + 세 figure renderer(`scripts/gemm_sweep.py` +
+  `tests/gemm/_gemm_plot_helpers.py`에서 이동).
+- `src/kernbench/benches/milestone_1h_ccl.py` — distributed allreduce
+  드라이버, latency + buffer-kind sweep, topology diagram, FSIM 비교, 그리고
+  direct-launch 패리티 레퍼런스(`tests/sccl/_allreduce_helpers.py`에서 이동).
+
+각 파일은 해당 도메인 평가 로직의 **단일 home**이다.
+
+### D2. "평가 bench" 패턴 (ADR-0045 D5 확장)
+
+ADR-0045 D5는 bench를 단일 구성(single-SIP, 또는 ADR-0024 multi-SIP CCL
+예외)으로 고정했다. 본 ADR은 세 번째 패턴을 추가한다:
+
+- **평가 bench**는 *여러* 구성을 구동하고 figure를 렌더링할 수 있다. 외부
+  `run_bench` 엔진 대신 sweep 지점마다 자체 `GraphEngine` /
+  `RuntimeContext`를 빌드한다.
+- 그러면 외부 ctx에 제출된 handle이 없으므로, bench는 마지막에
+  **sentinel 텐서**(`torch.zeros((1, 1), …)`)를 제출하여 `run_bench`의
+  "최소 한 번 제출" 계약(ADR-0045 D4)을 만족시키고 CLI가 0으로 종료되게
+  한다.
+
+### D3. 출력 위치
+
+두 bench 모두 `src/kernbench/benches/1H_milestone_output/{gemm,ccl}/`에
+쓴다(사용자 요청 — bench 옆 아티팩트). 디렉터리는 생성된 PNG/CSV/JSON만
+보유하며(`.py`/`__init__.py` 없음), 따라서 eager-import audit(ADR-0045
+첫 동작)이 무시한다 — `pkgutil.iter_modules`는 비-패키지 하위 디렉터리를
+yield하지 않는다. `docs/diagrams/` 아티팩트처럼 **커밋된다**(원격에서
+figure를 볼 수 있도록); bench 재실행 시 제자리에서 재생성된다.
+
+### D4. GEMM 무거운 sweep — 기본은 fresh, `MILESTONE_FAST`로 재사용
+
+`milestone-1h-gemm`은 기본적으로 전체 24-sim sweep을 실행한다(분 단위;
+한 shape는 2048 tile). `MILESTONE_FAST=1`은 committed
+`docs/diagrams/gemm_sweep.json`을 재사용하고 렌더링만 한다(초 단위). 이는
+ADR-0044 D1/D2의 "무거운 sweep은 수동/`slow` 단계로 유지"를 뒤집는다:
+bench 실행이 곧 재생성이다. slow 경로는 `@pytest.mark.slow` bench
+테스트로 행사되고, fast 경로는 기본 실행된다.
+
+### D5. 테스트 + 스크립트는 thin re-export shim으로 재사용 (단일 home 유지)
+
+기존 figure 테스트와 `scripts/gemm_sweep.py` 진입점은 유지되며 이제 bench
+모듈을 재사용한다:
+
+- `tests/gemm/_gemm_plot_helpers.py` → renderer +
+  `GEMM_SWEEP_JSON`/`GEMM_PLOTS_DIR`/`ROOT`를
+  `kernbench.benches.milestone_1h_gemm`에서 re-export.
+- `tests/sccl/_allreduce_helpers.py` → 드라이버 코어, config writer, sweep
+  상수, renderer, disk aggregator를 `kernbench.benches.milestone_1h_ccl`에서
+  re-export하고, **pytest 전용** 조각은 로컬 유지: `pytest.param` 행렬
+  (`CONFIGS` / `_sweep_params` / `_bk_params`)과 fixture 결합
+  `_run_distributed`(`monkeypatch.chdir` + `_drive_distributed`) wrapper.
+- `scripts/gemm_sweep.py` → bench의 `run_sweep` 위 thin wrapper.
+
+테스트가 bench 모듈을 import하는 것은 허용된다(테스트는 production 위에
+위치, ADR-0007); 이는 전체 패키지 eager audit을 유발하며, 그것은 이미 매
+`kernbench` 실행 시 동작한다. matplotlib는 renderer 내부에서 lazy import로
+유지되어 audit의 startup 비용은 불변이다.
+
+### D6. 평면 모듈 네이밍 (`benches/` 하위 폴더 없음)
+
+`1H_milestone…`로 명명된 `benches/` 하위 패키지는 불가능하다 — Python
+패키지 이름은 숫자로 시작할 수 없다. 따라서 bench는 평면 모듈
+`milestone_1h_gemm.py` / `milestone_1h_ccl.py`이며 bench 이름은
+`milestone-1h-gemm` / `milestone-1h-ccl`(kebab-case, ADR-0045 D1에 따라
+글자로 시작)이다.
+
+## Consequences
+
+### Positive
+
+- `kernbench run --bench milestone-1h-gemm`(또는 `…-ccl`)이 도메인의 모든
+  결과 + figure를 한 명령으로 재생성한다 — 마일스톤 요구사항.
+- 평가 로직의 단일 소스(bench), shim을 통해 테스트와 스크립트가 재사용;
+  중복 없음.
+- figure 테스트와 `scripts/gemm_sweep.py`는 변경 없이 계속 동작.
+
+### Negative / limitations
+
+- 두 bench 파일이 크다(CCL 쪽은 distributed 드라이버, sweep, matplotlib
+  드로잉을 섞는다). 대부분 평가 하니스인 "bench"는 이례적이며, 본 ADR이
+  이를 정당화한다.
+- 생성 아티팩트가 명시적 요청에 의해 source tree(`src/kernbench/benches/`)
+  안에 살며 커밋된다(원격에서 figure를 볼 수 있도록); bench 재실행 시
+  재생성된다.
+- `milestone-1h-ccl`(및 기본 `milestone-1h-gemm`)은 분 단위 소요 —
+  on-demand 마일스톤 아티팩트에는 수용 가능, 일상 실행에는 아님.
+
+## Dependencies
+
+- **ADR-0007**: 레이어 방향(테스트는 production을 import할 수 있으나 bench는
+  테스트를 import할 수 없는 이유).
+- **ADR-0043 / ADR-0044**: 본 ADR이 bench로 이전하는 allreduce / GEMM 평가
+  하니스.
+- **ADR-0045**: bench 모듈 계약; 여기 D2가 그 D5(single-device 규칙)를
+  평가-bench 패턴으로 확장하고, sentinel을 위해 D4(NO_REQUESTS)에 의존.
+- **ADR-0024**: allreduce sweep이 구동하는 rank = SIP launcher.
+
+## Open questions
+
+- GEMM theoretical 모델 상수(ADR-0044 D5)를 복사 대신 ADR-0033/0014에서
+  소싱해야 하는가? 본 ADR로는 불변.
+- `build_overview_slides.py`가 GEMM 막대를 네이티브로 그리는 대신 마일스톤
+  출력 PNG를 소비해야 하는가? 여전히 open(ADR-0044 D6 / Negative).
@@ -0,0 +1,175 @@
+# ADR Index
+
+Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **47**.
+
+Classification mirrors the `/report` skill's section assignment. When adding a new ADR, also add an entry to the `CLASSIFICATION` table in `tools/generate_adr_index.py`.
+
+## Design Principles
+
+- [ADR-0013](./ADR-0013-ver-verification-strategy.md) — 검증 전략 및 Phase 1 테스트 계획
+- [ADR-0033](./ADR-0033-lat-latency-model-assumptions.md) — 레이턴시 모델: 가정 및 알려진 단순화
+
+## High-level Architecture
+
+- [ADR-0003](./ADR-0003-dev-target-system-hierarchy.md) — 타겟 시스템 계층 및 모델링 범위  _(System hierarchy (Tray / SIP / CUBE / PE))_
+- [ADR-0007](./ADR-0007-api-runtime-api-boundaries.md) — 런타임 API 및 시뮬레이션 엔진 경계  _(Runtime API ↔ sim_engine boundaries)_
+- [ADR-0016](./ADR-0016-dev-iochiplet-noc-and-memory-path.md) — IOChiplet NoC와 메모리 데이터 경로  _(IOChiplet NOC and memory data path)_
+- [ADR-0017](./ADR-0017-dev-cube-noc-and-hbm-connectivity.md) — 큐브 NoC와 HBM 연결성  _(Cube NOC and HBM connectivity)_
+
+## Detailed Architecture
+
+One subsection per component file under `src/kernbench/components/builtin/`.
+
+### forwarding
+
+- [ADR-0037](./ADR-0037-dev-forwarding-component.md) — Forwarding 컴포넌트 (forwarding_v1)
+
+### hbm_ctrl
+
+- [ADR-0034](./ADR-0034-dev-hbm-controller-internal-design.md) — HBM 컨트롤러 내부 설계
+
+### io_cpu
+
+- [ADR-0036](./ADR-0036-dev-io-cpu-component-model.md) — IO_CPU 컴포넌트 모델
+
+### m_cpu
+
+- [ADR-0035](./ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md) — M_CPU 및 M_CPU.DMA 컴포넌트 모델
+
+### pcie_ep
+
+- [ADR-0038](./ADR-0038-dev-pcie-ep-component-model.md) — PCIE_EP Component Model
+
+### pe_cpu
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
+
+### pe_dma
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
+- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
+
+### pe_fetch_store
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
+
+### pe_gemm
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
+
+### pe_ipcq
+
+- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
+
+### pe_math
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
+
+### pe_mmu
+
+- [ADR-0039](./ADR-0039-dev-pe-mmu-component-model.md) — PE_MMU Component Model — 컴포넌트 + 유틸리티 이중 역할
+
+### pe_scheduler
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
+
+### pe_tcm
+
+- [ADR-0040](./ADR-0040-dev-pe-tcm-component-model.md) — PE_TCM Component Model — 듀얼 채널 BW 직렬화
+
+### sram
+
+- [ADR-0041](./ADR-0041-dev-cube-sram-component-model.md) — Cube SRAM Component Model — terminal scratchpad on cube NoC
+
+### tiling
+
+- [ADR-0042](./ADR-0042-prog-tile-plan-generators.md) — Tile Plan Generators — GEMM/Math 파이프라인 plan 빌더
+
+## Implementation Decisions
+
+### Address Scheme
+
+- [ADR-0001](./ADR-0001-mem-physaddr-layout.md) — 51비트 물리 주소 레이아웃 및 디코딩 계약
+- [ADR-0011](./ADR-0011-mem-memory-addressing-simplification.md) — 메모리 주소 지정 — PA / VA / LA 주소 모델
+
+### Routing & Helper API
+
+- [ADR-0002](./ADR-0002-lat-routing-distance.md) — 라우팅 거리, 순서 및 우회 규칙
+- [ADR-0051](./ADR-0051-lat-routing-helper-api.md) — Routing Helper API — `AddressResolver` + `PathRouter`
+
+### Memory Semantics & Local-HBM Bandwidth
+
+- [ADR-0004](./ADR-0004-mem-memory-semantics-local-hbm.md) — 메모리 시맨틱 및 로컬 HBM 대역폭 보장
+
+### Topology Compilation, Diagrams & Builder Algorithms
+
+- [ADR-0005](./ADR-0005-dev-diagram-views-distance-layout.md) — 다이어그램 뷰 및 거리 기반 레이아웃 규칙
+- [ADR-0006](./ADR-0006-dev-topology-compilation-distance-diagram.md) — 토폴로지 컴파일, 거리 추출, 그리고 자동 다이어그램 생성
+- [ADR-0053](./ADR-0053-dev-topology-builder-algorithms.md) — Topology Builder + Visualizer Algorithms
+
+### Tensor Deployment and Allocation
+
+- [ADR-0008](./ADR-0008-api-tensor-deploy-and-allocation.md) — 텐서 배포 및 할당 (호스트 할당기, PA 우선)
+
+### Kernel Execution and Host-Device Messaging
+
+- [ADR-0009](./ADR-0009-api-kernel-execution-messaging.md) — 커널 실행 메시징 및 완료 시맨틱
+- [ADR-0012](./ADR-0012-api-host-io-message-schema.md) — Host ↔ IO_CPU 메시지 스키마 (PA-우선, PE-태깅)
+
+### CLI Surface and Semantics
+
+- [ADR-0010](./ADR-0010-api-cli-surface-and-semantics.md) — 명령줄 인터페이스 및 실행 시맨틱
+
+### Component Port/Wire Fabric Model
+
+- [ADR-0015](./ADR-0015-dev-component-port-wire-model.md) — 컴포넌트 포트/와이어 모델과 패브릭 라우팅
+
+### Two-Pass Data Execution
+
+- [ADR-0020](./ADR-0020-prog-data-execution-two-pass.md) — 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)
+
+### 2D Grid Program Identity
+
+- [ADR-0022](./ADR-0022-prog-program-id-2d-grid.md) — 2D 그리드 program_id 시맨틱
+
+### Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)
+
+- [ADR-0024](./ADR-0024-par-sip-tp-launcher.md) — SIP-level Launcher — rank = SIP
+- [ADR-0026](./ADR-0026-par-dppolicy-intra-device.md) — DPPolicy = Intra-Device Only — sip/num_sips 필드 제거
+- [ADR-0027](./ADR-0027-par-megatron-tp.md) — Megatron-style Tensor Parallelism API
+- [ADR-0047](./ADR-0047-par-ahbm-ccl-backend.md) — AHBM CCL Backend — `torch.distributed`-compat shim
+- [ADR-0050](./ADR-0050-par-ccl-algorithm-module-contract.md) — CCL Algorithm Module Contract — `ccl/algorithms/*.py`
+
+### IPCQ Direction Addressing
+
+- [ADR-0025](./ADR-0025-algo-ipcq-direction-addressing.md) — IPCQ Direction Addressing — address-based matching
+
+### Intercube All-Reduce
+
+- [ADR-0032](./ADR-0032-algo-intercube-allreduce.md) — 큐브 간 All-Reduce — pe0 큐브-메시 리듀스 + 다중-SIP 교환
+
+### Evaluation Harnesses
+
+- [ADR-0043](./ADR-0043-eval-allreduce-harness.md) — Allreduce 평가 하니스 — `tests/sccl/`
+- [ADR-0044](./ADR-0044-eval-gemm-harness.md) — GEMM 평가 하니스 — `scripts/gemm_sweep.py` + `tests/gemm/`
+- [ADR-0054](./ADR-0054-eval-milestone-benches.md) — 마일스톤 평가 bench — 자기완결적 sweep + figure bench
+
+### Bench Module Contract
+
+- [ADR-0045](./ADR-0045-prog-bench-module-contract.md) — Bench Module Contract — registration, dispatch, and authoring
+
+### Kernel-side tl.* API (TLContext)
+
+- [ADR-0046](./ADR-0046-prog-tl-context-contract.md) — TLContext — Kernel-side `tl.*` API Contract
+
+### Memory Allocator Algorithms
+
+- [ADR-0048](./ADR-0048-mem-allocator-algorithms.md) — Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
+
+### Probe Subcommand
+
+- [ADR-0049](./ADR-0049-ver-probe-subcommand.md) — `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
+
+### Sim-engine Op Log and Memory Store Schemas
+
+- [ADR-0052](./ADR-0052-dev-oplog-memory-store-schemas.md) — OpLog + MemoryStore Schemas — sim_engine internals
@@ -2,7 +2,7 @@

 ## Status

-Proposed (Blocked on ADR-0031 — PhysAddr PE-resource extension)
+Proposed

 ## Context

@@ -340,7 +340,7 @@ encoding can be plugged in later" 약속이 이행된 것.
 | `src/kernbench/sim_engine/memory_store.py` | D3: IPCQ buffer가 기존 space와 공유되는지 검증 |
 | `src/kernbench/sim_engine/engine.py` | D4: IPCQ token routing이 PhysAddr-based fabric 경로 사용 |
 | `src/kernbench/ccl/diagnostics.py` | D5: pointer_dump를 PhysAddr 포매팅으로 개선 |
-| `docs/adr/ADR-0023-ipcq-pe-collective.md` | D6: D2.5 amendment note |
+| `docs/adr/ADR-0023-dev-ipcq-pe-collective.md` | D6: D2.5 amendment note |
 | `tests/test_ipcq_physaddr.py` (new) | T1 |
 | `tests/test_ipcq_alloc.py` (new) | T2 |
 | `tests/test_ccl_install_plan.py` | T3 확장 |
@@ -0,0 +1,362 @@
+# ADR-0001: 51-bit Physical Address Layout & Decoding Contract
+
+## Status
+
+Accepted (Revision 2 — 2026-04-27: concrete bit layout, rack_id removal,
+Tray->SIP / SIP->DIE renaming, PE/MCPU/IOCPU sub-unit tables.
+Supersedes ADR-0031.)
+
+## Date
+
+2026-04-27 (original: 2026-02-27)
+
+## Context
+
+KernBench requires a stable, parsable physical address scheme that:
+
+- can be decoded into routing domains (SIP / die / HBM / PE-resource / IOCPU)
+- remains topology-agnostic (no hardcoded counts)
+- supports swappable policy and DI-first components
+- covers multiple SIPs, AHBM dies, and IO chiplet dies in a unified space
+
+### History
+
+- Original ADR-0001 defined a 51-bit layout with `rack_id(4) + sip_id(4) +
+  sip_seg(5) + local_offset(38)`. `rack_id` was never used in practice.
+- ADR-0031 (stub) requested PE-resource range partition but was never
+  implemented.
+
+Revision 2 removes `rack_id`, renames `sip_seg -> die_id`, and provides
+concrete sub-unit tables for PE, MCPU, CUBE_SRAM, and IOCPU resources.
+ADR-0031 is superseded.
+
+## Decision
+
+We define a **PhysAddr value object** and an **address decoding contract**
+that converts an integer address into routing domains.
+
+### D1. PhysAddr is an immutable value object
+
+- PhysAddr is immutable and comparable as a pure value.
+- Any allocator returns a **fully specified PhysAddr** (not partial metadata).
+- No global state may be required to interpret a PhysAddr.
+
+### D2. 51-bit Physical Address Layout
+
+A 51-bit physical address is adopted.
+
+#### 2.1 Top-Level Address Map
+
+```text
+[50:47] sip_id        (4)     -- 16 SIPs
+[46:42] die_id        (5)     -- 32 dies per SIP
+[41: 0] local_offset  (42)    -- 4 TB per die
+```
+
+```text
+50      47 46      42 41                      0
+---------+----------+-------------------------+
+| sip_id  | die_id   |      local_offset       |
+---------+----------+-------------------------+
+```
+
+#### 2.2 die_id Allocation
+
+| die_id | Meaning |
+|--------|---------|
+| 0..15  | AHBM dies |
+| 16..20 | IOCHIPLET dies |
+| 21..31 | Reserved |
+
+#### 2.3 AHBM Die Layout
+
+Only lower 256 GB of the 4 TB die-local window is assigned.
+
+```text
+[41:38] MBZ            (4)
+[37]    addr_space      (1)    -- 0 = local resource, 1 = HBM memory
+[36: 0] sub-address    (37)
+```
+
+| addr_space | Meaning |
+|------------|---------|
+| 0 | Local resource |
+| 1 | HBM memory |
+
+##### 2.3.1 HBM Window (addr_space = 1)
+
+```text
+[36:0] hbm_offset     (37)    -- 128 GB decode window
+```
+
+The architectural decode window is fixed at 128 GB. Implemented capacity
+may be smaller depending on SKU/topology (see D4).
+
+##### 2.3.2 Resource Window (addr_space = 0)
+
+```text
+[36:34] resource_kind  (3)
+[33: 0] kind_local    (34)    -- 16 GB per kind
+```
+
+| resource_kind | Meaning |
+|---------------|---------|
+| 000 | PE_LOCAL |
+| 001 | MCPU_LOCAL |
+| 010 | CUBE_SRAM |
+| 011..111 | Reserved |
+
+Each kind gets a 16 GB decode region.
+
+##### 2.3.3 PE_LOCAL (resource_kind = 000)
+
+```text
+[33]    MBZ            (1)
+[32:29] pe_id          (4)     -- 0..15
+[28:25] pe_sub_unit    (4)
+[24: 0] sub_offset    (25)    -- 32 MB per slot
+```
+
+16 PEs x 16 sub-unit slots x 32 MB = 8 GB active decode.
+
+| pe_sub_unit | Name | Budget |
+|-------------|------|--------|
+| 0 | PE_CPU_DTCM | 8 KB |
+| 1 | MATH_ENGINE_DTCM | 8 KB |
+| 2 | IPCQ | 256 KB |
+| 3 | PE_CPU_SFR | 16 KB |
+| 4 | MATH_ENGINE_SFR | 16 KB |
+| 5 | DMA_ENGINE_SFR | 192 KB |
+| 6 | PE_TCM | 2 MB |
+| 7..15 | Reserved | -- |
+
+##### 2.3.4 MCPU_LOCAL (resource_kind = 001)
+
+```text
+[33:30] MBZ            (4)
+[29:25] mcpu_sub_unit  (5)
+[24: 0] sub_offset    (25)    -- 32 MB per slot
+```
+
+1 GB active decode.
+
+| mcpu_sub_unit | Name | Budget |
+|---------------|------|--------|
+| 0 | MCPU_ITCM | 512 KB |
+| 1 | MCPU_DTCM | 512 KB |
+| 2 | IPCQ | 256 KB |
+| 3 | MCPU_SFR | 8 KB |
+| 4 | MCPU_DMA_SFR | 16 KB |
+| 5 | MCPU_SRAM | 10 MB |
+| 6..31 | Reserved | -- |
+
+##### 2.3.5 CUBE_SRAM (resource_kind = 010)
+
+```text
+[33:25] MBZ            (9)
+[24: 0] sram_offset   (25)    -- flat 32 MB
+```
+
+#### 2.4 IOCHIPLET Die Layout
+
+Only lower 1 TB of the 4 TB die-local window is assigned.
+
+```text
+[41:40] MBZ            (2)
+[39: 0] chiplet_offset (40)   -- 1 TB
+```
+
+Region split by address range:
+
+| Range | Meaning | Decode condition |
+|-------|---------|------------------|
+| [0, 2 GB) | IOCPU resource | chiplet_offset < 0x8000_0000 |
+| [2 GB, 1 TB) | UAL | chiplet_offset >= 0x8000_0000 |
+
+##### 2.4.1 IOCPU Region
+
+```text
+[30:27] iocpu_sub_unit (4)
+[26: 0] sub_offset    (27)    -- 128 MB per slot
+```
+
+16 x 128 MB slots. 2 GB active decode.
+
+| iocpu_sub_unit | Name | Budget |
+|----------------|------|--------|
+| 0 | IOCPU_ITCM | 512 KB |
+| 1 | IOCPU_DTCM | 512 KB |
+| 2 | IPCQ | 2 MB |
+| 3 | IOCPU_SFR | 8 KB |
+| 4 | IO_DMA_SFR | 16 KB |
+| 5 | IO_SRAM | 64 MB |
+| 6..15 | Reserved | -- |
+
+##### 2.4.2 UAL Region
+
+Sub-layout TBD (separate ADR).
+
+#### 2.5 Addressing Rules
+
+1. MBZ bits must be zero. An address with non-zero MBZ bits is
+   **architecturally invalid**. Implementation may raise a decode fault
+   or return an error -- behavior is not prescribed by this ADR.
+2. Fixed slot sizes are chosen for simple hardware decode; actual
+   implemented capacity may be smaller than the slot.
+3. Access beyond a sub-unit's implemented budget within a slot is
+   **architecturally invalid** (same policy as MBZ).
+
+### D3. Bitfield decoding is deterministic
+
+Given an integer address, field extraction (`sip_id`, `die_id`, `kind`,
+`sub_unit`, `offset`) is purely positional. No runtime state is required.
+Decoding deterministically maps an integer address to destination domains:
+`sip_id`, `die_id`, target kind (HBM / PE_LOCAL / MCPU_LOCAL / CUBE_SRAM /
+IOCPU / UAL).
+
+### D4. Capacity validation may depend on topology config
+
+Whether a decoded address falls within **implemented capacity** (e.g.,
+HBM 96 GB on a specific SKU) is checked against topology parameters
+provided via DI/config. Decode itself (D3) never consults topology --
+only validation does. These parameters must live in the topology/config
+layer, not in node implementations.
+
+### D5. Routing consumes decoded domains, not raw bits
+
+Routing policy uses decoded domains:
+
+- `src` location (sip / die / pe or node_id)
+- `dst` domains derived from PhysAddr decoding
+- `size_bytes` for size-aware link latency
+
+Routing must not inspect raw bit-fields directly except inside the
+decoding module.
+
+## Alternatives Considered
+
+1. **Keep `rack_id` (4 bits)**: Rejected -- never used in practice,
+   consumes 4 bits that enable die-local expansion to 42 bits
+   (IOCHIPLET 1 TB).
+
+2. **Uniform 256 GB per die**: Rejected -- IOCHIPLET UAL requires ~1 TB.
+   Freed rack_id bits enable 42-bit local_offset.
+
+3. **Variable-width die windows (AHBM 256 GB, CHIPLET 1 TB via multi-seg
+   spanning)**: Rejected -- complicates D3 (deterministic decoding).
+   Uniform 4 TB window with MBZ padding is simpler.
+
+4. **Use raw integers everywhere, decode ad-hoc in routing**: Rejected --
+   leads to duplicated logic, inconsistent routing, and hidden
+   assumptions.
+
+5. **Hardcode topology sizes (SIP/CUBE/PE counts) into decoding**:
+   Rejected -- violates SPEC R3 and breaks swappability.
+
+6. **Put decoding inside memory controllers or routers**: Rejected --
+   leaks policy into components, violates SPEC R4 / D5.
+
+## Consequences
+
+### Positive
+
+- Simple hierarchical decoder: SIP -> die -> kind -> sub-unit.
+- Clean separation of memory (HBM) vs local resource (PE/MCPU/SRAM/IOCPU).
+- Deterministic routing domains enable clear test invariants (SPEC R1, R5).
+- Expandable: 11 reserved die_id slots, reserved resource_kind / sub-unit
+  slots, reserved MBZ bits.
+- DI-first: decoder can be swapped without changing components (SPEC R4).
+
+### Tradeoffs
+
+- Sparse address holes due to power-of-2 slot alignment.
+- Large reserved/MBZ regions (intentional for future extension).
+- Requires explicit configuration for topology-derived sizes (D4).
+- Introduces a single "blessed" decoding module that must remain stable
+  and well-tested.
+
+## Supersedes
+
+- **ADR-0031 (PhysAddr PE-Resource Extension)**: stub status. The
+  PE_LOCAL / MCPU_LOCAL / CUBE_SRAM sub-unit tables in D2.3.3-D2.3.5
+  fulfill ADR-0031's stated goals.
+
+## Implementation Notes (Non-normative)
+
+- Recommended module: `src/kernbench/policy/address/phyaddr.py`
+- Tests should cover: encode/decode round-trip per kind, MBZ enforcement,
+  die_id dispatch (AHBM / IOCHIPLET / reserved), sub-unit boundary
+  values, backward compatibility of factory APIs.
+- Factory methods: `hbm_addr`, `pe_hbm_addr`, `pe_tcm_addr`,
+  `cube_sram_addr` retain signatures (minus `rack_id`); `cube_id`
+  parameter renamed to `die_id`.
+- New factories: `pe_resource_addr`, `mcpu_resource_addr`,
+  `iocpu_resource_addr`, `ual_addr`.
+
+## Appendix A. Address Examples
+
+### A.1 AHBM HBM access
+
+sip=2, die=5, HBM offset=0x1000
+
+```text
+sip_id     = 2       -> [50:47] = 0b0010
+die_id     = 5       -> [46:42] = 0b00101
+addr_space = 1       -> [37]    = 1 (HBM)
+hbm_offset = 0x1000  -> [36:0]
+
+51-bit addr = (2 << 47) | (5 << 42) | (1 << 37) | 0x1000
+```
+
+### A.2 AHBM PE_LOCAL -- PE3 PE_TCM, offset=0x400
+
+```text
+sip_id        = 0  -> [50:47] = 0
+die_id        = 0  -> [46:42] = 0
+addr_space    = 0  -> [37]    = 0
+resource_kind = 0  -> [36:34] = 000 (PE_LOCAL)
+pe_id         = 3  -> [32:29] = 0011
+pe_sub_unit   = 6  -> [28:25] = 0110 (PE_TCM)
+sub_offset    = 0x400 -> [24:0]
+
+local_offset = (0 << 34) | (3 << 29) | (6 << 25) | 0x400
+```
+
+### A.3 AHBM MCPU_LOCAL -- MCPU_SRAM, offset=0x0
+
+```text
+sip_id        = 1  -> [50:47] = 0001
+die_id        = 3  -> [46:42] = 00011
+addr_space    = 0  -> [37]    = 0
+resource_kind = 1  -> [36:34] = 001 (MCPU_LOCAL)
+mcpu_sub_unit = 5  -> [29:25] = 00101 (MCPU_SRAM)
+sub_offset    = 0  -> [24:0]  = 0
+
+local_offset = (1 << 34) | (5 << 25)
+```
+
+### A.4 IOCHIPLET -- IOCPU IPCQ, offset=0x20000
+
+```text
+sip_id         = 1   -> [50:47] = 0001
+die_id         = 17  -> [46:42] = 10001 (IOCHIPLET[1])
+iocpu_sub_unit = 2   -> [30:27] = 0010 (IPCQ)
+sub_offset     = 0x20000 -> [26:0]
+
+chiplet_offset = (2 << 27) | 0x20000
+                 (< 0x8000_0000 -> IOCPU region)
+```
+
+### A.5 IOCHIPLET -- UAL region, offset=4 GB
+
+```text
+sip_id         = 0   -> [50:47] = 0
+die_id         = 16  -> [46:42] = 10000 (IOCHIPLET[0])
+chiplet_offset = 0x1_0000_0000 (4 GB >= 2 GB -> UAL region)
+```
+
+## Links
+
+- SPEC.md: R1 (routing), R3 (configurable topology), R4 (DI-first),
+  R5 (multi-domain comm)
+- ADR-0031: Superseded
@@ -1,108 +0,0 @@
-# ADR-0001: PhysAddr Layout & Address Decoding Contract
-
-## Status
-
-Accepted
-
-## Date
-
-2026-02-27
-
-## Context
-
-KernBench Graph Latency Simulator must route requests deterministically and compute end-to-end latency strictly by graph traversal.
-To model local vs remote traffic (same/different SIP, same/different CUBE, optional PE-group), requests need a stable, parsable address/location scheme that:
-
- can be decoded into routing domains (SIP/CUBE/HBM/PE-resource, etc.)
- remains topology-agnostic (no hardcoded counts)
- supports swappable policy and DI-first components without leaking topology assumptions into node implementations
-
-## Decision
-
-We define a **PhysAddr value object** and an **address decoding contract** that converts an integer address into routing domains.
-
-### D1. PhysAddr is an immutable value object
-
- PhysAddr is immutable and comparable as a pure value.
- Any allocator returns a **fully specified PhysAddr** (not partial metadata).
- No global state may be required to interpret a PhysAddr.
-
-### D2. PhysAddr fields (logical contract)
-
-PhysAddr must be able to represent at least:
-
- `rack_id` (optional but reserved for scale-out)
- `sip_id`  (device / SIP domain)
- `sip_seg` (SIP-level segment/window selection, e.g., cube window)
- `local_offset` (offset within the chosen segment/window)
-
-Decoded/derived fields may include (optional):
-
- `cube_id`
- `kind` (e.g., HBM vs PE-resource vs raw)
- `unit_type` / `pe_id` (if PE-level addressing is modeled)
-
-**Important:** The exact bit allocation may evolve, but the *semantic fields above* must remain decodable without hidden assumptions.
-
-### D3. Decoding is deterministic and policy-compatible
-
- Decoding must deterministically map an integer address to:
-  - destination SIP domain (`sip_id`)
-  - destination sub-domain (`cube_id` if applicable)
-  - destination target kind (HBM/PE-resource/other)
- Decoding must not depend on runtime topology sizes; it may depend on **explicit topology parameters** provided through configuration (e.g., segment size, slice size), and those parameters must live in the topology/config layer (not in random components).
-
-### D4. Topology-derived constants live in the topology layer
-
-Constants such as segment sizes (e.g., HBM slice size / window size) are derived from topology configuration (YAML/JSON/dict) and are provided to the decoder via DI/config.
-They must not be hardcoded in node implementations.
-
-### D5. Routing consumes decoded domains, not raw bits
-
-Routing policy uses decoded domains:
-
- `src` location (sip/cube/pe or node_id)
- `dst` domains derived from PhysAddr decoding
- `size_bytes` for size-aware link latency
-Routing must not inspect raw bit-fields directly except inside the decoding module.
-
-## Alternatives Considered
-
-1) **Use raw integers everywhere, decode ad-hoc in routing**
-
- Rejected: leads to duplicated logic, inconsistent routing, and hidden assumptions embedded in multiple components.
-
-1) **Hardcode topology sizes (SIP/CUBE/PE counts) into decoding**
-
- Rejected: violates SPEC (R3) and breaks swappability and configuration-driven topologies.
-
-1) **Put decoding inside memory controllers or routers**
-
- Rejected: leaks policy into components and undermines DI-first, swappable implementations (SPEC R4).
-
-## Consequences
-
-### Positive
-
- Deterministic routing domains enable clear test invariants for local vs remote paths (SPEC R1, R5).
- Keeps topology variability (SPEC R3) while preserving consistent semantics.
- DI-first: decoder can be swapped or extended without changing components or tests (SPEC R4).
-
-### Tradeoffs / Costs
-
- Requires explicit configuration for any topology-derived sizes.
- Introduces a single “blessed” decoding module that must remain stable and well-tested.
-
-## Implementation Notes (Non-normative)
-
- Recommended module boundary:
-  - `src/kernbench/policy/address/phyaddr.py`
-
- Tests should cover:
-  - deterministic decoding
-  - local vs remote classification from decoded fields
-  - invariants: “allocator returns full PhysAddr”, “decoding requires no global state”
-
-## Links
-
- SPEC.md: R1 (routing), R3 (configurable topology), R4 (DI-first), R5 (multi-domain comm)
@@ -35,7 +35,7 @@ shortcuts that obscure control paths.

 ### D3. Bypass is explicit and graph-represented
 - All paths must be explicitly represented in the graph and subject to latency accumulation.
- Example: PE_DMA connects to the NOC router mesh (ADR-0019). All destinations
+- Example: PE_DMA connects to the NOC router mesh (ADR-0017 D7). All destinations
  (HBM, shared SRAM, inter-cube UCIe) are reached via explicit mesh hops.
  Local HBM access has minimal hops (switching overhead only); remote access
  traverses additional routers.
@@ -35,11 +35,13 @@ We model the system hierarchy explicitly:

 - A CUBE contains:
  - HBM + memory controller (HBM_CTRL)
-  - NOC router mesh: 2D grid of explicit routers (from cube_mesh.yaml) with XY routing;
-    carries all intra-cube traffic including HBM data, inter-cube (UCIe),
-    command (M_CPU↔PE_CPU), and shared SRAM access.
-    HBM_CTRL is attached to PE routers (local HBM = 0 hop).
-    See ADR-0017 and ADR-0019 for full architecture.
+  - NOC (on-die fabric): carries all intra-cube traffic including HBM data,
+    inter-cube (UCIe), command (M_CPU↔PE_CPU), and shared SRAM access.
+    Must provide: full-BW PE↔local HBM path, PE↔SRAM connectivity,
+    PE↔UCIe connectivity, M_CPU↔PE command path.
+    NOC topology is an implementation choice (e.g., 2D mesh, ring, crossbar);
+    current implementation uses a 2D mesh with XY routing (see ADR-0017).
+    HBM_CTRL is attached to each PE's local NOC port (local HBM = minimal hop).
  - Shared SRAM: cube-level shared memory accessible by all PEs via NOC
  - management/control CPU (M_CPU) coordinating PE command distribution and completion aggregation
  - multiple PEs
@@ -15,7 +15,7 @@ Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth,

 - Each PE is assigned a logically defined “local HBM” region.
 - Local HBM corresponds to the pseudo-channel subset directly attached to that PE’s
-  router in the NOC mesh (ADR-0019).
+  router in the NOC mesh (ADR-0017 D4).
 - The path is: PE_DMA → local router → HBM_CTRL (switching overhead only, 0 mesh hops).
 - The mapping (HBM pseudo-channels → PE local regions) is derived from topology configuration.

@@ -33,12 +33,17 @@ Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth,
 - This guarantee is modeled by:
  - a dedicated logical path and/or service model that enforces HBM BW at the PE-local-HBM interaction point,
  - while still incurring non-zero latency along explicitly modeled components.
+- HBM CTRL internal modeling (PC striping, cut-through, scheduling fidelity)
+  is consolidated in ADR-0033 (Latency Model: Assumptions and Known
+  Simplifications). The aggregate BW guarantee here remains the contract;
+  ADR-0033 documents how the per-PC model realizes it and which scheduler
+  effects are intentionally simplified.

 ### D3. Remote PE HBM semantics (intra-cube)

- A PE that accesses another PE's local HBM traverses the router mesh:
-  - PE_DMA → local router → (mesh hops) → target PE's router → HBM_CTRL
- Router mesh bandwidth and hop count may limit remote HBM access relative to local access.
+- A PE that accesses another PE's local HBM traverses the NOC:
+  - PE_DMA → NOC → (fabric hops) → target PE's NOC port → HBM_CTRL
+- NOC bandwidth and hop count may limit remote HBM access relative to local access.

 ### D4. Non-local HBM semantics (inter-cube / inter-SIP)

@@ -20,7 +20,9 @@ Diagrams must reflect this distance by default.

 ---

-## Global Defaults
+## Decision
+
+### D1. Global Defaults

 - All diagrams MUST be **distance-aware by default**.
 - All diagrams MUST render **representative views** of the architecture.
@@ -31,7 +33,7 @@ Diagrams must reflect this distance by default.

 ---

-## Representative Rendering Rule
+### D2. Representative Rendering Rule

 - All CUBEs share the same internal structure.
 - All PEs share the same internal structure.
@@ -47,9 +49,9 @@ unless explicitly requested.

 ---

-## Diagram Views
+### D3. Diagram Views

-### View A — SIP-Level Diagram
+#### View A — SIP-Level Diagram

 **Purpose**
 Explain system-scale structure and connectivity.
@@ -75,7 +77,7 @@ Explain system-scale structure and connectivity.

 ---

-### View B — CUBE-Level Diagram
+#### View B — CUBE-Level Diagram

 **Purpose**
 Explain cube-internal structure and data/control flow.
@@ -106,7 +108,7 @@ Explain cube-internal structure and data/control flow.

 ---

-### View C — PE-Level Diagram
+#### View C — PE-Level Diagram

 **Purpose**
 Explain internal PE behavior and execution structure.
@@ -128,14 +130,14 @@ Explain internal PE behavior and execution structure.

 ---

-## Distance-Aware Layout (Default)
+### D4. Distance-Aware Layout (Default)

-### Distance definition
+#### Distance definition

 - Distance is defined as **accumulated latency**, consistent with ADR-0002.
 - Distance is computed from a single anchor node.

-### Default anchor selection
+#### Default anchor selection

 - SIP view: IO chiplet (or Host CPU if present)
 - CUBE view: a representative PE
@@ -143,7 +145,7 @@ Explain internal PE behavior and execution structure.

 Anchors are **implicit defaults** and MUST NOT be required to be specified.

-### Layout rules
+#### Layout rules

 - Diagrams MUST be laid out in layers based on distance buckets.
 - Layout direction MUST be consistent within a view type
@@ -156,7 +158,7 @@ without affecting distance semantics.

 ---

-## Generation Contract (for Tools / Claude Code)
+### D5. Generation Contract (for Tools / Claude Code)

 When generating diagrams:

@@ -63,7 +63,7 @@ For each view (SIP / CUBE / PE):
 - CUBE-level projection MUST include:
  - Router mesh (from cube_mesh.yaml), HBM_CTRL, shared SRAM, M_CPU, UCIe ports,
    and PEs as opaque blocks.
-  - All paths (HBM, non-HBM, command) route through the same router mesh (ADR-0019).
+  - All paths (HBM, non-HBM, command) route through the same router mesh (ADR-0017).
 - Default anchors are implicit (ADR-0005) and MUST NOT require instance indices.

 ### D6. Output formats and determinism
@@ -42,21 +42,25 @@ The runtime API MUST NOT:

 ---

-### D2. Simulation engine executes and schedules requests
+### D2. Simulation engine wires components and tracks completion

 The simulation engine (sim_engine) MUST:

- inject requests into the compiled topology graph,
+- wire components at initialization (create port stores + start wire
+  processes per the component port/wire framework — ADR-0015),
+- inject requests into the compiled topology graph at entry components
+  (e.g., PCIE_EP for memory operations, IO_CPU for kernel launch),
 - schedule and execute events using a discrete-event model,
- manage correlation ids and completion tracking,
- decompose operations into low-level requests when required
-  (e.g., MemoryWrite events).
+- manage correlation ids and completion tracking.

 The simulation engine MUST NOT:

 - define tensor semantics,
 - define kernel execution policies,
- expose internal graph details to the runtime API.
+- expose internal graph details to the runtime API,
+- walk the topology path during request execution,
+- call component `run()` methods directly,
+- track per-hop latency or decompose fan-out (components own this).

 ---

@@ -87,3 +91,5 @@ component-level fan-out explicitly.
 - SPEC R4, R7, R8
 - ADR-0008 (Tensor deployment)
 - ADR-0009 (Kernel execution)
+- ADR-0015 (Component port/wire model and engine role)
+- ADR-0010 (CLI surface and execution semantics — runtime API consumer)
@@ -94,7 +94,7 @@ The Phase 0 PA shard map remains a valid fast-path configuration.

 ## Links

- ADR-0011 (PA-first)
+- ADR-0011 (Memory Addressing — PA / VA / LA)
 - ADR-0012 (Host↔IO_CPU schema)
 - ADR-0007 (runtime_api vs sim_engine boundaries)
 - ADR-0009 (Kernel execution)
@@ -0,0 +1,146 @@
+# ADR-0009: Kernel Execution Messaging and Completion Semantics
+
+## Status
+
+Accepted
+
+## Context
+
+Kernel execution is initiated by the host and proceeds through
+device control components:
+
+Host → IO_CPU → M_CPU → PE_CPU → schedulers → engines
+
+Completion propagates in reverse order.
+
+To keep benchmarks simple and topology-agnostic,
+kernel execution must be endpoint-driven with deterministic aggregation.
+
+---
+
+## Decision
+
+### D1. Kernel launch is an endpoint request
+
+A kernel launch is initiated by submitting a single KernelLaunch request
+to the IO_CPU endpoint.
+
+The runtime API MUST:
+
+- construct the kernel launch request,
+- submit it to IO_CPU,
+- await a single completion result.
+
+The runtime API MUST NOT orchestrate internal fan-out.
+
+---
+
+### D2. Tensor arguments are passed by metadata
+
+KernelLaunch requests MUST reference tensor arguments via:
+
+- host-owned tensor handles, or
+- resolved device address maps derived from those handles.
+
+Bulk tensor data MUST NOT be embedded in kernel launch messages.
+
+---
+
+### D3. Fan-out and aggregation are component responsibilities
+
+- IO_CPU fans out work to M_CPUs.
+- M_CPU fans out work to PE_CPUs.
+- PE_CPU manages kernel execution and engine dispatch.
+
+Completion semantics:
+
+- M_CPU completes when all targeted PEs complete or a failure policy triggers.
+- IO_CPU completes when all targeted CUBEs complete or a failure policy triggers.
+
+---
+
+### D4. Completion and failure propagation
+
+- All messages MUST carry correlation identifiers.
+- Completion and failure MUST propagate deterministically to the host.
+- The simulation engine provides futures/handles to observe completion.
+
+---
+
+### D5. Launch timing is endpoint-synchronized
+
+All PEs targeted by a single kernel launch MUST begin executing the kernel
+body at the same simulated time, regardless of their dispatch path length
+from the launch entry point.
+
+Rationale. The dispatch tree Host → IO_CPU → M_CPU → PE_CPU has variable
+latency at every level. PEs near their M_CPU receive the launch earlier
+than PEs farther away; cubes near an IO_CPU receive it earlier than cubes
+farther away. Without synchronization, each PE's kernel begins at a
+different `env.now`, making per-PE metrics such as `pe_exec_ns` a function
+of dispatch-path geometry rather than of the kernel's behavior —
+producing measurement artifacts in benchmarks that time kernel-internal
+waits (for example `tl.recv` on cross-cube or cross-SIP hops).
+
+Mechanism.
+
+- `KernelLaunchMsg` carries an optional `target_start_ns: float | None`.
+- **IO_CPU** is the canonical stamper. On fan-out to M_CPUs, it
+  computes `target_start_ns = env.now + max_latency` where
+  `max_latency` is the maximum, over every target (sip, cube, pe)
+  tuple, of the **two-leg dispatch chain**:
+
+  ```
+  max_latency(sip, cube, pe) =
+      compute_path_latency_ns(find_node_path(io_cpu, m_cpu(sip, cube)))
+    + compute_path_latency_ns(find_node_path(m_cpu(sip, cube), pe_cpu))
+    - io_cpu.overhead_ns
+    - m_cpu.overhead_ns
+  ```
+
+  This models the actual dispatch as **two sequential Transactions**
+  (IO_CPU → M_CPU, then M_CPU → PE_CPU). Each leg's
+  `compute_path_latency_ns` adds its endpoints' `overhead_ns`;
+  `io_cpu.overhead_ns` is subtracted because IO_CPU has already
+  paid it before this method runs, and `m_cpu.overhead_ns` is
+  subtracted once because it appears as endpoint of leg1 *and*
+  start of leg2 but is paid only once at run time. A single
+  `find_node_path(io_cpu, pe_cpu)` walk is **not** equivalent —
+  it can pick a graph path that bypasses M_CPU and silently
+  under-shoots the prediction for far cubes, breaking the D5
+  invariant.
+
+  The fanned-out sub-Transactions carry **`nbytes = 0`** for
+  `KernelLaunchMsg` (control message only). Without this,
+  large kernel-launch payloads would occupy fabric BW on the
+  shared first hop and serialize the per-cube dispatch, pushing
+  far M_CPUs past `target_start_ns` and re-introducing the
+  late-arrival violation.
+- **M_CPU** passes an already-stamped `target_start_ns` through
+  unchanged. Only when the value is absent (e.g. a direct
+  launch-to-M_CPU unit test) does M_CPU compute a per-cube barrier
+  `env.now + max(local command-path latency)`.
+- **PE_CPU** yields `env.timeout(target_start_ns - env.now)` at the top
+  of `_execute_kernel`, before recording `pe_exec_start` and invoking
+  the kernel body.
+- When `target_start_ns is None`, PE_CPU falls through to the legacy
+  unsynchronized behavior — preserving backward compatibility.
+
+IO_CPU-level stamping guarantees every PE across every targeted cube
+uses the same barrier sim-time, eliminating both the within-cube
+dispatch-offset artifact *and* the cross-cube offset artifact in
+multi-cube launches. Models a real-hardware timed-broadcast launch
+(latency-equalized dispatch tree).
+
+The synchronization is internal to the engine / IO_CPU / M_CPU / PE_CPU
+control plane — runtime API and application kernels are unchanged.
+
+---
+
+## Links
+
+- SPEC R1, R2, R7, R8
+- ADR-0007 (Runtime API boundaries)
+- ADR-0008 (Tensor deployment)
+- ADR-0013 (Verification strategy — V2 fan-out tests)
+- ADR-0015 D4 (concrete fabric path for kernel launch)
@@ -1,74 +0,0 @@
-# ADR-0009: Kernel Execution Messaging and Completion Semantics
-
-## Status
-
-Accepted
-
-## Context
-
-Kernel execution is initiated by the host and proceeds through
-device control components:
-
-Host → IO_CPU → M_CPU → PE_CPU → schedulers → engines
-
-Completion propagates in reverse order.
-
-To keep benchmarks simple and topology-agnostic,
-kernel execution must be endpoint-driven with deterministic aggregation.
-
---
-
-## Decision
-
-### D1. Kernel launch is an endpoint request
-
-A kernel launch is initiated by submitting a single KernelLaunch request
-to the IO_CPU endpoint.
-
-The runtime API MUST:
-
- construct the kernel launch request,
- submit it to IO_CPU,
- await a single completion result.
-
-The runtime API MUST NOT orchestrate internal fan-out.
-
---
-
-### D2. Tensor arguments are passed by metadata
-
-KernelLaunch requests MUST reference tensor arguments via:
-
- host-owned tensor handles, or
- resolved device address maps derived from those handles.
-
-Bulk tensor data MUST NOT be embedded in kernel launch messages.
-
---
-
-### D3. Fan-out and aggregation are component responsibilities
-
- IO_CPU fans out work to M_CPUs.
- M_CPU fans out work to PE_CPUs.
- PE_CPU manages kernel execution and engine dispatch.
-
-Completion semantics:
-
- M_CPU completes when all targeted PEs complete or a failure policy triggers.
- IO_CPU completes when all targeted CUBEs complete or a failure policy triggers.
-
---
-
-### D4. Completion and failure propagation
-
- All messages MUST carry correlation identifiers.
- Completion and failure MUST propagate deterministically to the host.
- The simulation engine provides futures/handles to observe completion.
-
---
-
-## Links
-
- SPEC R1, R2, R7, R8
- ADR-0007 (Runtime API boundaries)
- ADR-0008 (Tensor deployment)
@@ -0,0 +1,152 @@
+# ADR-0010: Command Line Interface and Execution Semantics
+
+## Status
+
+Accepted
+
+## Context
+
+The `kernbench` CLI is the user-facing entry point of the simulator. It
+exposes four subcommands:
+
+- `run` — execute a benchmark against a topology.
+- `list` — enumerate registered benches.
+- `probe` — diagnostic utility for latency / BW measurement.
+- `web` — interactive topology viewer.
+
+Device enumeration is centralized in the CLI; neither the runtime API
+nor the simulation engine enumerates devices. Benchmarks remain
+single-device by design and accept a device identifier as input.
+
+## Decision
+
+### D1. Benchmark contract — single-device by design
+
+- A benchmark MUST define behavior for a single device only.
+- A benchmark MUST accept a device identifier as input.
+- Benchmarks MUST NOT enumerate or loop over multiple devices.
+
+Multi-device execution is the CLI's concern (D3), not the benchmark's.
+
+### D2. `kernbench run` — benchmark execution
+
+Required arguments:
+
+- `--topology <path>`: topology YAML file path. Loaded via
+  `resolve_topology()`.
+- `--bench <identifier>`: benchmark identifier. Resolved via
+  `kernbench.benches.registry.resolve()`, which accepts either the
+  registered kebab-case name (e.g., `gemm-single-pe`) or a numeric
+  index from `kernbench list`.
+
+Optional arguments:
+
+- `--device <selector>` (default: `all`):
+  - `all` — run once per discovered SIP (see D3).
+  - `sip:<N>` — run only on SIP N.
+  - Parsed via `resolve_device()`.
+- `--verify-data` (default: off) — enable Phase 2 data verification
+  (see ADR-0020). When set, `engine_factory` constructs the engine
+  with `enable_data=True`. After the benchmark runs, a diagnostic
+  summary of recorded ops is printed.
+
+Each invocation runs the benchmark once within a single simulation
+instance.
+
+### D3. Multi-device execution is logically parallel
+
+When `--device all` (or omitted) and the topology has multiple SIPs:
+
+- Benchmark executions are submitted to a single simulation engine
+  instance.
+- Executions are logically parallel in simulation time.
+- Inter-device contention is naturally modeled (shared fabric
+  bandwidth, cross-SIP traffic, etc.).
+
+The CLI does NOT spawn multiple OS processes or independent
+simulation runs — parallelism is internal to one simulation instance.
+
+### D4. `kernbench list` — enumerate registered benches
+
+No arguments. Prints each registered bench's auto-assigned index,
+registered name, and one-line description.
+
+Benches register themselves via the `@bench(name=..., description=...)`
+decorator (`kernbench.benches.registry`). Every non-underscore module
+under `kernbench.benches/` MUST register at least one bench; a missing
+decorator raises `RuntimeError` at package import time.
+
+Indices are assigned alphabetically by name at import time. They are a
+CLI convenience (shorthand for `--bench`), not a stable API — a new
+bench inserted alphabetically will shift later indices.
+
+### D5. `kernbench probe` — latency / BW diagnostic utility
+
+Required argument:
+
+- `--topology <path>`: topology YAML file path.
+
+Optional argument:
+
+- `--case <name>` (default: `all`) — run a predefined traffic
+  pattern, or `all` to run every defined case.
+
+Probe runs each pattern through the simulation engine and reports
+per case:
+
+- End-to-end latency (ns).
+- Effective bandwidth (nbytes / total_ns).
+- Bottleneck bandwidth (min edge BW along the chosen path).
+- Utilization (effective / bottleneck).
+
+Probe additionally validates monotonicity invariants — for example
+that local-HBM access ≤ cross-PE-within-cube ≤ cross-cube ≤
+cross-SIP — and reports violations. Probe is a developer tool for
+verifying the latency / BW model; it is not a benchmark.
+
+### D6. `kernbench web` — topology viewer
+
+Optional arguments:
+
+- `--port <N>` (default: `8765`) — HTTP port.
+- `--no-open` — do not auto-open the browser.
+
+Launches a local HTTP server that renders the compiled topology in
+the browser. Distinct from the static `docs/diagrams/` artifacts:
+
+- `docs/diagrams/` files are derived at topology-compile time
+  (ADR-0006).
+- `kernbench web` is interactive — pan/zoom, hover for component
+  attributes, switch between SIP / CUBE / PE views.
+
+### D7. Runtime API and simulation engine remain device-scoped
+
+- Runtime API calls operate on one device per invocation.
+- The simulation engine schedules all requests deterministically.
+- Neither layer enumerates devices.
+
+This invariant keeps each layer testable in isolation; device
+enumeration and multi-device fan-out live only in the CLI's `run`
+command (D3).
+
+The `probe` implementation lives under `kernbench.probes` (separate
+from `kernbench.benches`), reflecting that probes are diagnostic
+utilities, not registered benches.
+
+## Consequences
+
+- Benchmark authors write single-device logic; multi-device behavior
+  emerges from the CLI dispatching across SIPs.
+- Adding a new subcommand (e.g., trace export, replay) does not
+  require benchmark or runtime-API changes — the CLI is the
+  extension point.
+- `probe` and `web` are diagnostic / visualization tools, not
+  benchmarks; they bypass the benchmark loader path.
+
+## Links
+
+- SPEC R7, R8, R9
+- ADR-0007 (Runtime API and Simulation Engine Boundaries)
+- ADR-0020 (Two-pass data execution — `--verify-data`)
+- ADR-0006 (Topology compilation and diagram generation —
+  background for `kernbench web`)
@@ -1,62 +0,0 @@
-# ADR-0010: CLI Device Selection and Multi-Device Execution Semantics
-
-## Status
-
-Accepted
-
-## Context
-
-Benchmarks represent device-agnostic workloads that operate on a single device.
-Users may want to run a benchmark:
-
- on a specific device, or
- across all devices in the system.
-
-Device enumeration must not leak into benchmarks or runtime APIs.
-
---
-
-## Decision
-
-### D1. Benchmarks are single-device by design
-
- A benchmark MUST define behavior for a single device only.
- A benchmark MUST accept a device identifier as input.
- Benchmarks MUST NOT enumerate or loop over multiple devices.
-
---
-
-### D2. CLI controls device selection
-
-The `kernbench run` command supports an optional `--device` argument:
-
- If `--device <id>` is specified:
-  - the benchmark executes once for the specified device.
-
- If `--device` is omitted:
-  - the benchmark executes once using all the SIPs discovered in the topology.
-
---
-
-### D3. Multi-device execution is logically parallel
-
-When running on multiple devices:
-
- benchmark executions are submitted to a single simulation engine instance,
- executions are logically parallel in simulation time,
- inter-device contention is naturally modeled.
-
---
-
-### D4. Runtime API and simulation engine remain device-scoped
-
- Runtime API calls operate on one device per invocation.
- The simulation engine schedules all requests deterministically.
- Neither layer enumerates devices.
-
---
-
-## Links
-
- SPEC R7, R8
- ADR-0007 (Runtime API boundaries)
@@ -0,0 +1,521 @@
+# ADR-0011: Memory Addressing — PA / VA / LA Address Models
+
+## Status
+
+Accepted.
+
+- **VA model: currently implemented (default).**
+- PA model: implemented as PageFault fallback in PE_DMA.
+- LA model: proposed, not implemented.
+
+## Context
+
+KernBench's address model evolved through three design points, each
+addressing a limitation of the previous. This ADR documents all three
+in one place because future implementation work selects among them.
+
+### PA-only baseline
+
+Phase 0 of KernBench treated all device memory operations
+(MemoryRead/MemoryWrite) as raw physical-address transfers. No
+host-side virtual addressing, no MMU/IOMMU translation. Allocators
+returned PA mappings; DMA requests carried PA directly.
+
+This was sufficient for early correctness/latency work but
+insufficient for running standard Triton kernels that use
+`base_addr + offset` patterns on sharded tensors: each PE's shard
+has a different PA, but the kernel needs a single contiguous address
+space to compute offsets.
+
+### Why VA/MMU (current default)
+
+A realistic system uses host-side virtual addressing and an
+MMU/IOMMU-style translation path for DMA: the host allocates physical
+memory at PE level, maps it into a virtual address space, installs
+mappings, and DMA requests use virtual addresses that are translated
+to physical addresses.
+
+Adopting this model lets kernels use `base_addr + offset` over a
+contiguous VA range while the device-side MMU translates each access
+to the appropriate PA.
+
+### Why LA/BAAW (proposed)
+
+VA/MMU treats HBM as a single backing space. KernBench needs to
+explore architectures where HBM is composed of multiple pseudo
+channels in parallel:
+
+- CUBE's HBM has 32 or 64 pseudo channels.
+- In a PE-Local-HBM model, each PE is assigned N pseudo channels
+  (N = `hbm_pseudo_channels / pes_per_cube`).
+- Per-channel BW (e.g. 32 GB/s) determines aggregate PE BW
+  (N × per-channel).
+
+Two channel-mapping modes need to be modelable:
+
+- **1:1 mode** — one logical access → N per-channel requests.
+  Precise per-channel BW contention modelling.
+- **n:1 mode (default)** — one logical access → one aggregated
+  request. Channels are assumed to interleave; aggregated BW model.
+
+VA's `tl.load(va_ptr)` produces a single DMA request to a single
+target. Decomposing that into per-channel requests inside PE_DMA
+requires the address layer to be aware of channels. This is the
+role of the LA (Logical Address) abstraction with BAAW
+(Logical-to-Physical Mapping Unit).
+
+Core requirements driving the LA design:
+
+- PE_DMA → HBM_CTRL effective bandwidth semantics must be identical
+  in both modes (only request shape and resource model differ).
+- Kernel programming model is unchanged — physical channel
+  information is never exposed to kernel code.
+- Mode switch is a topology-level configuration.
+
+### Design space summary
+
+| Model | Status | Key idea |
+|-------|--------|----------|
+| PA | fallback (implemented) | Direct physical addressing, no translation |
+| VA | current default (implemented) | Per-tensor contiguous VA range; MMU translates per access |
+| LA | proposed | LA + BAAW resolves to (PA, channel); supports 1:1 and n:1 channel mapping modes |
+
+---
+
+## Decision
+
+This ADR defines three address models. At any given time the system
+operates in exactly one model. Selection is topology- / configuration-
+driven; coexistence within one simulation run is not required.
+
+---
+
+### Address Model: PA (Physical Address) — fallback
+
+#### D-PA1. PA-only semantics
+
+- All device memory accesses (MemoryRead/MemoryWrite) operate on
+  device physical addresses (PA) plus size.
+- PA-only mode remains functional via the PageFault fallback path in
+  PE_DMA: if a DMA src/dst address has no MMU mapping, PE_DMA treats
+  the value as a PA directly.
+
+#### D-PA2. Allocation produces PA mappings
+
+Device allocation selects PE-local memory regions and returns PA
+mappings sufficient to execute kernels and issue DMA requests.
+
+PA model is retained primarily for backward compatibility with PA-only
+tests and as the underlying physical layer that VA / LA models resolve
+into.
+
+---
+
+### Address Model: VA (Virtual Address with MMU) — current default
+
+#### D-VA1. Virtual Address Model
+
+- Each tensor gets a single contiguous VA range (`TensorHandle.va_base`).
+- `TensorShard` does NOT carry a `va` field — shard VA is derived as
+  `va_base + offset_bytes`.
+- Kernels receive `va_base` as their pointer argument (via
+  `TensorArg.va_base`).
+- `DmaReadCmd.src_addr` and `DmaWriteCmd.dst_addr` carry VA (not PA).
+
+#### D-VA2. PE_MMU Component
+
+- Hybrid design: SimPy component (inbox for `MmuMapMsg`) + utility
+  (synchronous `translate()` called by PE_DMA).
+- Page-aligned dict lookup for O(1) VA → PA translation.
+- `tlb_overhead_ns` configurable per-access latency.
+- PageFault fallback: if VA has no mapping, PE_DMA treats it as PA
+  directly (preserves PA model for backward compatibility).
+
+#### D-VA3. Mapping Installation
+
+- `MmuMapMsg` traverses the fabric: Host → PCIE_EP → IO_CPU (cube
+  fan-out) → M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured
+  end-to-end.
+- `MmuMapMsg.target_sips` controls SIP-level routing to prevent
+  cross-SIP mapping contamination for replicated tensors.
+- Mapping strategy based on `DPPolicy.cube`:
+  - **Replicate** (`cube="replicate"`): per-(sip, cube) local mapping
+    only. Each cube's PEs see only their local PA. No cross-cube
+    mapping installed.
+  - **Sharded** (`cube="column_wise"`, etc.): broadcast all shard
+    mappings to all target cubes. Enables cross-PE and cross-cube
+    DMA.
+
+#### D-VA4. Tensor Lifecycle
+
+- `del tensor` triggers automatic cleanup via `Tensor.__del__` +
+  `weakref` to `RuntimeContext`. Sends `MmuUnmapMsg` through fabric,
+  returns VA and PA space.
+- `with RuntimeContext(...) as ctx:` provides scope-based bulk cleanup.
+- `RuntimeContext._tensors` uses `weakref.ref` to avoid preventing GC.
+- `PEMemAllocator` uses free-list with coalescing (not bump allocator).
+- `VirtualAllocator` uses free-list with coalescing for VA space.
+
+#### D-VA5. Allocators
+
+- `VirtualAllocator`: device-wide VA space, page-aligned alloc/free
+  with coalescing.
+- `PEMemAllocator`: per-PE HBM/TCM, free-list based alloc/free with
+  coalescing.
+- Page size configurable via `topology.yaml` `pe_mmu` attrs
+  (default 4096).
+
+#### Consequences (VA model)
+
+- Triton kernels use `base_addr + offset` patterns naturally on
+  sharded tensors.
+- All latency remains explicit via graph traversal, including MMU
+  mapping installation and per-access TLB overhead.
+- PA-only mode retained as fallback (PageFault → treat as PA).
+- IPCQ and other fixed-address resources bypass MMU (use PA directly).
+
+---
+
+### Address Model: LA (Logical Address with BAAW) — proposed
+
+LA replaces VA when channel-level HBM modelling is required.
+Adopting this model removes the VA/MMU infrastructure (D-LA1 lists the
+removed artifacts). Coexistence with VA in the same run is not a goal.
+
+#### D-LA1. LA introduction — replaces VA infrastructure
+
+LA is the sole address space used by kernel code (`tl.load`,
+`tl.store`, `tl.composite`). Properties:
+
+- Can map a Tensor to a contiguous logical space (like VA).
+- Expresses `(logical buffer + offset)`.
+- Does NOT contain physical channel information directly.
+- Stays as an intermediate abstraction until physical resolution.
+
+LA address space:
+
+| Item | Value |
+|------|-------|
+| LA start | `0x1_0000_0000` (4 GB, preserves former VA start) |
+| LA space size | 64 GB per PE |
+| Alignment unit | segment (see D-LA3) |
+
+LA is PE-local: different PEs may use the same LA value; BAAW segment
+tables differ → they resolve to different PAs.
+
+VA infrastructure removed when LA is adopted:
+
+| Removed | Replacement |
+|---------|-------------|
+| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (same free-list approach, renamed) |
+| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (inside PE_DMA) |
+| `components/builtin/pe_mmu.py` (PeMmuComponent) | Removed — BAAW is internal PE_DMA logic, not a separate component |
+| `runtime_api/kernel.py`: `MmuMapMsg`, `MmuUnmapMsg` | `BaawSegmentInstallMsg` |
+| `runtime_api/context.py`: VA alloc + MMU install | LA alloc + BAAW segment install |
+| `runtime_api/tensor.py`: `va_base` | `la_base` |
+| `topology.yaml`: `pe_mmu` component entry | Removed |
+
+#### D-LA2. Mapping mode setting
+
+Topology-level (cube) configuration:
+
+```yaml
+cube:
+  memory_map:
+    hbm_mapping_mode: n_to_one    # one_to_one | n_to_one
+    hbm_pseudo_channels: 64       # total pseudo channel count
+    hbm_channels_per_pe: 8        # per-PE local channel count
+    hbm_channel_bw_gbs: 32.0      # per-channel bandwidth
+```
+
+Consumed by the graph compiler (topology builder) and BAAW
+initialisation.
+
+#### D-LA3. Segment and BAAW
+
+Segment partitions the LA space; each segment maps to a specific HBM
+channel or channel group. Created at tensor deploy time by the runtime
+allocator. BAAW resolves LA → physical request(s) using the segment
+table.
+
+```python
+@dataclass
+class BaawSegment:
+    la_base: int          # segment start LA
+    la_size: int          # segment size (bytes)
+    mode: str             # "one_to_one" | "n_to_one"
+    # 1:1 mode fields
+    channel_count: int    # channels assigned to this segment (e.g. 8)
+    pa_bases: list[int]   # per-channel PA bases (len = channel_count)
+    channel_ids: list[int]   # per-channel logical IDs (e.g. [0..7])
+    channel_size: int     # per-channel size (la_size // channel_count)
+    # n:1 mode fields
+    agg_pa_base: int      # aggregated PA base
+    agg_node_id: str      # aggregated router node_id
+```
+
+Segment lifecycle:
+
+1. **Allocate** (tensor deploy): RuntimeContext allocates LA from LA
+   allocator. PEMemAllocator allocates per-channel PA (1:1) or
+   aggregated PA (n:1). `BaawSegmentInstallMsg` registers the segment
+   with PE_DMA.
+2. **Use** (kernel run): kernel `tl.load(la_ptr)` → `DmaReadCmd
+   (src_addr=LA)`. PE_DMA's BAAW front-end looks up the segment and
+   converts to PA(s).
+3. **Free** (tensor free): segment removed from table; LA and PA
+   returned.
+
+#### D-LA4. BAAW resolution logic
+
+BAAW is a front-end stage inside PE_DMA, not a separate SimPy
+component. Synchronous address-resolution logic executed at the start
+of PE_DMA's `handle_command()`.
+
+Input: `(LA, nbytes)`. Output:
+
+- **1:1 mode**: `list[PhysicalRequest]` — one per channel.
+- **n:1 mode**: single `PhysicalRequest`.
+
+```python
+@dataclass
+class PhysicalRequest:
+    pa: int           # 51-bit Physical Address
+    nbytes: int       # transfer size for this request
+    dst_node: str     # target node_id (channel router or aggregated router)
+
+
+def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
+    seg = self._find_segment(la)  # la_base <= la < la_base + la_size
+    offset = la - seg.la_base
+
+    if seg.mode == "n_to_one":
+        pa = seg.agg_pa_base + offset
+        return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]
+
+    # one_to_one
+    requests = []
+    per_ch_size = seg.channel_size
+    for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
+        ch_offset = offset % per_ch_size
+        ch_nbytes = nbytes // seg.channel_count
+        pa = pa_base + ch_offset
+        dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
+        requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
+    return requests
+```
+
+BAAW responsibilities:
+
+- Convert logical access → physical request units.
+- Apply mode-dependent fan-out (1:1) or pass-through (n:1).
+- Compute PA and target node.
+
+BAAW non-responsibilities:
+
+- Performing actual data movement.
+- Executing NOC routing.
+- Simulating bandwidth occupation (downstream components' job).
+
+BAAW output is directly usable by the simulator's routing and resource
+model without additional address decoding.
+
+#### D-LA5. PE_DMA `handle_command()` change
+
+Current (VA-based) flow:
+
+```
+DmaReadCmd.src_addr (VA)
+  → MMU.translate(VA) → PA
+  → PhysAddr.decode(PA) → PhysAddr object
+  → resolver.resolve(PhysAddr) → dst_node_id
+  → router.find_path(pe_prefix, dst_node_id) → path
+  → 1 sub-Transaction → fabric inject
+```
+
+LA-based flow:
+
+```
+DmaReadCmd.src_addr (LA)
+  → BAAW.resolve(LA, nbytes) → list[PhysicalRequest]
+  → for each PhysicalRequest:
+      → router.find_path(pe_prefix, req.dst_node) → path
+      → compute_drain_ns(path, req.nbytes) → drain
+      → sub-Transaction → fabric inject
+  → await all sub-Transactions
+  → pe_txn.done.succeed()
+```
+
+Key changes:
+
+- MMU reference removed → BAAW resolve.
+- `PhysAddr.decode()` + `resolver.resolve()` → BAAW returns `dst_node`
+  directly.
+- 1 request → N parallel requests in 1:1 mode.
+
+#### D-LA6. 1:1 mode detail
+
+- One logical access → N physical requests (N = `channels_per_pe`).
+- N = `hbm_pseudo_channels / pes_per_cube`.
+- Each request: fully-resolved 51-bit PA, targets a specific channel
+  router (`{pe_prefix}.ch_r{channel_id}`).
+- Per-channel link models BW contention.
+- PE_DMA injects N sub-transactions concurrently.
+
+Example: `hbm_pseudo_channels=64`, `pes_per_cube=8` → `channels_per_pe=8`.
+PE0 owns ch0-7.
+
+```text
+Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
+BAAW segment: {
+    la_base: 0x1_0000_0000, la_size: 4096,
+    mode: "one_to_one", channel_count: 8,
+    pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
+    channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
+    channel_size: 512,
+}
+
+BAAW resolve result (8 requests):
+  → PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
+  → PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
+  → ...
+  → PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")
+
+PE_DMA: 8 sub-transactions parallel inject
+  per-channel router → hbm_ctrl link (channel_bw_gbs) per channel
+  Total effective BW = 8 × channel_bw_gbs
+```
+
+Other N values:
+
+- `hbm_pseudo_channels=32`, `pes_per_cube=8` → `channels_per_pe=4`,
+  4 requests
+- `hbm_pseudo_channels=64`, `pes_per_cube=4` → `channels_per_pe=16`,
+  16 requests
+
+#### D-LA7. n:1 mode detail
+
+- One logical access → one aggregated request.
+- Target: aggregated router → hbm_ctrl (see ADR-0017 D8).
+- Aggregated link BW = `channels_per_pe × channel_bw_gbs`
+  (e.g. 8 × 32 = 256 GB/s).
+- Single queue / resource for modelling.
+- No per-channel PA decomposition.
+
+```text
+Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
+BAAW segment: {
+    la_base: 0x1_0000_0000, la_size: 4096,
+    mode: "n_to_one",
+    agg_pa_base: PA_agg,
+    agg_node_id: "sip0.cube0.pe0.agg_router",
+}
+
+BAAW resolve result:
+  → PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")
+
+PE_DMA: 1 sub-transaction
+  aggregated router → hbm_ctrl link (256 GB/s)
+```
+
+#### D-LA8. Kernel model preserved
+
+- Kernel still issues single memory ops (`tl.load`, `tl.store`,
+  `tl.composite`).
+- LA is the address scheme exposed to kernel code.
+- Channel decomposition / aggregation happens inside PE_DMA's BAAW.
+- Kernel code never sees physical channel information.
+
+#### Consequences (LA model, proposed)
+
+Positive:
+
+- 1:1 vs n:1 semantics live in one place (BAAW).
+- Kernel abstraction preserved — no kernel code changes.
+- Topology-based policy control (mode switch via yaml).
+- Improved simulation-model consistency and debuggability.
+- Segment-based mapping is simpler than page tables; lower overhead.
+
+Negative:
+
+- Full VA/MMU code refactor required.
+- Request-generation path more complex (N requests in 1:1 mode).
+- Reduced per-channel visibility in n:1 mode.
+- VA-related tests need rewriting.
+
+---
+
+## Migration Path
+
+- **PA → VA** was an extension. PA mode is retained as the PageFault
+  fallback inside PE_DMA. Switching does not require removing PA
+  code.
+- **VA → LA**, if adopted, is a replacement, not coexistence. See
+  D-LA1 for the VA infrastructure removal list. PA fallback inside
+  PE_DMA may be retained orthogonally for tests.
+
+## Alternatives Considered (LA model)
+
+1. **Keep VA + fan-out in MMU**: MMU returns per-channel PAs.
+   Rejected: MMU's role would grow beyond translation to request
+   decomposition; aggregation (n:1) becomes awkward to express.
+2. **Channel-aware kernel API**: kernels call per-channel load/store
+   directly. Rejected: abstraction leakage, portability loss, all
+   benchmarks need rewriting.
+3. **Always PA (no LA)**: runtime passes per-channel PA to kernel
+   directly. Rejected: incompatible with aggregation; conversion
+   timing unclear; channel info leaks to kernel.
+
+## Test Requirements
+
+### VA model (current, regression)
+
+- Cross-PE / cross-cube DMA paths over installed mappings.
+- `MmuMapMsg` / `MmuUnmapMsg` fabric traversal with measured latency.
+- TLB-overhead-per-access timing.
+- PageFault fallback path preserves PA-only behaviour.
+
+### LA model (when implemented)
+
+- 1:1 mode: same logical access → N per-channel requests.
+- n:1 mode: same logical access → 1 aggregated request.
+- Bandwidth equivalence between modes for identical workload.
+- 1:1 mode: per-channel contention modelled correctly.
+- n:1 mode: aggregated bandwidth correctly reflected.
+- Kernel code unchanged across mode switch.
+- BAAW segment install / uninstall correctness.
+- Multiple tensors in distinct segments do not collide.
+
+## Implementation Order (LA, when scheduled)
+
+1. LA type (`policy/address/la_allocator.py`).
+2. BAAW segment table (`policy/address/baaw.py`).
+3. `BaawSegmentInstallMsg` (`runtime_api/kernel.py`).
+4. PE_DMA BAAW integration (`components/builtin/pe_dma.py`
+   `handle_command()`).
+5. RuntimeContext: LA alloc + segment install
+   (`runtime_api/context.py`).
+6. `Tensor.va_base` → `Tensor.la_base` (`runtime_api/tensor.py`).
+7. Remove VA/MMU code.
+8. Remove `pe_mmu` from `topology.yaml`; add mapping mode settings.
+9. Test migration:
+
+| Test file | Action |
+|-----------|--------|
+| `tests/test_mmu_component.py` | Remove → BAAW segment install tests |
+| `tests/test_mmu_fabric.py` | Remove → BAAW + fabric integration tests |
+| `tests/test_pe_mmu.py` | Remove |
+| `tests/test_va_allocator.py` | Replace with LA allocator tests |
+| `tests/test_va_integration.py` | Replace with LA + BAAW integration tests |
+| `tests/test_va_offset.py` | Replace with LA offset tests |
+
+## Links
+
+- ADR-0007 (runtime_api vs sim_engine boundaries)
+- ADR-0008 (tensor deployment)
+- ADR-0009 (kernel execution)
+- ADR-0014 (PE-internal execution model)
+- ADR-0015 (component port/wire model)
+- ADR-0017 (Cube NOC and HBM connectivity — LA model topology consumer)
+- ADR-0013 (Verification strategy — V1 PA tagging)
+- SPEC R2 (latency by traversal), R10 (memory addressing)
@@ -1,100 +0,0 @@
-# ADR-0011: Memory Addressing — PA-first with VA/MMU Extension
-
-## Status
-
-Accepted (Phase 1 VA/MMU implemented)
-
-## Context
-
-A realistic system uses host-side virtual addressing and an MMU/IOMMU-style
-translation path for DMA: host allocates physical memory at PE level, maps it
-into a virtual address space, installs mappings, and DMA requests use virtual
-addresses that are translated to physical addresses.
-
-The PA-only model (Phase 0) was insufficient for running standard Triton kernels
-that use `base_addr + offset` patterns on sharded tensors — each PE's shard has
-a different PA, but the kernel needs a single contiguous address space.
-
---
-
-## Decision
-
-### D1. Phase 0 model is PA-only (original, retained as fallback)
-
- All device memory accesses (MemoryRead/MemoryWrite) operate on device physical
-  addresses (PA) plus size.
- PA-only mode remains functional via PageFault fallback in PE_DMA.
-
-### D2. Allocation produces PA mappings
-
-Device allocation selects PE-local memory regions and returns PA mappings
-sufficient to execute kernels and issue DMA requests.
-
-### D3. Phase 1: VA/MMU layer (implemented)
-
-#### D3.1 Virtual Address Model
-
- Each tensor gets a single contiguous VA range (`TensorHandle.va_base`).
- `TensorShard` does NOT carry a `va` field — shard VA is derived as
-  `va_base + offset_bytes`.
- Kernels receive `va_base` as their pointer argument (via `TensorArg.va_base`).
- `DmaReadCmd.src_addr` and `DmaWriteCmd.dst_addr` carry VA (not PA).
-
-#### D3.2 PE_MMU Component
-
- Hybrid design: SimPy component (inbox for MmuMapMsg) + utility (synchronous
-  `translate()` called by PE_DMA).
- Page-aligned dict lookup for O(1) VA→PA translation.
- `tlb_overhead_ns` configurable per-access latency.
- PageFault fallback: if VA has no mapping, PE_DMA treats it as PA directly
-  (backward compatibility with PA-only tests).
-
-#### D3.3 Mapping Installation
-
- `MmuMapMsg` traverses the fabric: Host → PCIE_EP → IO_CPU (cube fan-out) →
-  M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured end-to-end.
- `MmuMapMsg.target_sips` controls SIP-level routing to prevent cross-SIP
-  mapping contamination for replicated tensors.
- Mapping strategy based on `DPPolicy.cube`:
-  - **Replicate** (`cube="replicate"`): per-(sip, cube) local mapping only.
-    Each cube's PEs see only their local PA. No cross-cube mapping installed.
-  - **Sharded** (`cube="column_wise"`, etc.): broadcast all shard mappings to all
-    target cubes. Enables cross-PE and cross-cube DMA.
-
-#### D3.4 Tensor Lifecycle
-
- `del tensor` triggers automatic cleanup via `Tensor.__del__` + `weakref` to
-  RuntimeContext. Sends `MmuUnmapMsg` through fabric, returns VA and PA space.
- `with RuntimeContext(...) as ctx:` provides scope-based bulk cleanup.
- `RuntimeContext._tensors` uses `weakref.ref` to avoid preventing GC.
- `PEMemAllocator` uses free-list with coalescing (not bump allocator).
- `VirtualAllocator` uses free-list with coalescing for VA space.
-
-#### D3.5 Allocators
-
- `VirtualAllocator`: device-wide VA space, page-aligned alloc/free with
-  coalescing.
- `PEMemAllocator`: per-PE HBM/TCM, free-list based alloc/free with coalescing.
- Page size configurable via `topology.yaml` pe_mmu attrs (default 4096).
-
---
-
-## Consequences
-
- Triton kernels use `base_addr + offset` patterns naturally on sharded tensors.
- All latency remains explicit via graph traversal, including MMU mapping
-  installation and per-access TLB overhead.
- PA-only mode retained as fallback (PageFault → treat as PA).
- Benchmark parameter renamed `ctx` → `torch` for PyTorch code compatibility.
- IPCQ and other fixed-address resources bypass MMU (use PA directly).
-
---
-
-## Links
-
- ADR-0007 (runtime_api vs sim_engine boundaries)
- ADR-0008 (tensor deployment)
- ADR-0009 (kernel execution)
- ADR-0014 (PE-internal execution model)
- ADR-0015 (component port/wire model)
- SPEC R2 (latency by traversal)
@@ -18,7 +18,7 @@ We define stable, minimal message schemas for Host ↔ IO_CPU so that:
 - IO_CPU-internal fan-out/aggregation can evolve independently,
 - completion and failure propagation is deterministic.

-We also require PE-tagging (A 방식): each shard explicitly carries (sip,cube,pe)
+We also require PE-tagging (Scheme A): each shard explicitly carries (sip,cube,pe)
 so IO_CPU can deterministically route/fan-out without relying on PA decoding.

 ---
@@ -93,7 +93,7 @@ Rules:
 Mandatory fields:

 - common envelope fields (D3)
- destination placement tags (A 방식):
+- destination placement tags (Scheme A):
  - `dst_sip: int`
  - `dst_cube: int`
  - `dst_pe: int`
@@ -130,7 +130,7 @@ Notes:
 Mandatory fields:

 - common envelope fields (D3)
- source placement tags (A 방식):
+- source placement tags (Scheme A):
  - `src_sip: int`
  - `src_cube: int`
  - `src_pe: int`
@@ -183,7 +183,7 @@ Tensor arg (mandatory):

 - `shards: list[TensorShard]`

-`TensorShard` MUST have (A 방식 강제):
+`TensorShard` MUST have (Scheme A enforced):

 - `sip: int`
 - `cube: int`
@@ -226,7 +226,8 @@ Tests SHOULD validate:

 ## Links

- ADR-0011 (PA-first memory addressing)
+- ADR-0011 (Memory Addressing — PA / VA / LA)
 - ADR-0007 (runtime_api vs sim_engine boundaries)
 - ADR-0009 (kernel execution fan-out/aggregation)
+- ADR-0013 (Verification strategy — V1 message schema validation)
 - SPEC R2, R7, R8
@@ -134,6 +134,6 @@ Phase 2 (Apply) MUST:
 ## Links

 - SPEC 0.1, R2, R6
- ADR-0011 (PA-first memory addressing)
+- ADR-0011 (Memory Addressing — PA / VA / LA)
 - ADR-0012 (Host ↔ IO_CPU message schema)
 - ADR-0009 (Kernel execution semantics)
@@ -0,0 +1,451 @@
+# ADR-0014: PE Pipeline Execution Model
+
+## Status
+
+Accepted
+
+## Context
+
+This ADR defines the PE-internal kernel execution model:
+
+- Role decomposition of PE-internal components
+- Command dispatch paths (simple / composite / multi-op composite with epilogue)
+- TileToken-based self-routing pipeline (scheduler does dispatch + completion only)
+- TCM-centric dataflow with a register-file intermediary
+- Engine resource model
+- Observability and trace contract
+- Topology representation
+
+PE-internal structure (7 components in scope; 2 cross-referenced):
+
+- `pe_cpu`, `pe_scheduler`, `pe_dma`, `pe_fetch_store`, `pe_gemm`, `pe_math`,
+  `pe_tcm` — defined here
+- `pe_mmu` — VA model, defined in ADR-0011 D-VA
+- `pe_ipcq` — collective communication, defined in ADR-0023
+
+The goal is a deterministic, trace-friendly execution contract that keeps
+each block independently swappable.
+
+## Decision
+
+### D1. PE-internal component roles
+
+**PE_CPU**
+
+- Executes kernel instruction stream / control logic.
+- Generates PE commands and submits them to `PE_SCHEDULER` (via
+  `PeInternalTxn`).
+- Does NOT enqueue work directly into engine queues.
+
+**PE_SCHEDULER**
+
+- Sole dispatcher inside a PE.
+- Receives commands from `PE_CPU`. Dispatch by command type:
+  - Simple command (`DmaReadCmd`, `DmaWriteCmd`, `GemmCmd`, `MathCmd`)
+    → forward directly to the target engine.
+  - `CompositeCmd` → generate a `TilePlan`, feed tiles into the pipeline
+    via a single `_feed_loop` (D6).
+- Does not participate in stage-to-stage chaining within a composite;
+  that is handled by token self-routing (D6).
+
+**PE_DMA**
+
+- Handles memory transfers between TCM and external memory domains
+  (HBM, shared SRAM, cross-cube UCIe) through the cube NOC.
+- Two execution channels:
+  - `DMA_READ` (capacity = 1) and `DMA_WRITE` (capacity = 1) — see D4.
+- Additional virtual channels:
+  - `vc_compute` — load/store/writeback traffic for GEMM/MATH tiles.
+  - `vc_comm` — IPCQ collective send data (defined in ADR-0023 D8).
+
+**PE_FETCH_STORE**
+
+- TCM ↔ Register File transfer unit.
+- Isolates register-file access semantics from compute engines so that
+  GEMM/MATH stay pure compute components.
+- BW-based latency model; TCM access contention naturally serializes
+  through `PE_TCM`'s BW resource.
+
+**PE_GEMM**
+
+- MAC array. Reads operands from the register file; writes results to
+  the register file. Does not touch `PE_TCM` directly.
+
+**PE_MATH**
+
+- Element-wise / reduction / SIMD unit. Reads / writes the register file.
+
+**PE_TCM**
+
+- Tightly-coupled scratchpad with BW-serialized access. Two logical
+  regions partitioned by ownership (see D5).
+
+**Cross-referenced components** (defined elsewhere):
+
+- `pe_mmu` — VA→PA translation per access (ADR-0011 D-VA).
+- `pe_ipcq` — collective ring buffers and peer endpoint metadata
+  (ADR-0023).
+
+### D2. Command lifecycle and queues
+
+`PE_SCHEDULER` maintains three logical structures:
+
+**SubmissionQueue** — written by `PE_CPU`; consumed by the scheduler.
+
+**InflightTable** — owned and mutated only by `PE_SCHEDULER`; tracks
+expanded sub-commands, dependency state, engine assignment, and
+completion status.
+
+**CompletionQueue** — written by `PE_SCHEDULER`; holds final completion
+records.
+
+**Single-writer rule**: only `PE_SCHEDULER` mutates command completion
+state. Engines report completion via explicit events / messages
+consumed by the scheduler.
+
+**Command completion**: when all sub-commands complete, `PE_SCHEDULER`
+publishes a completion record.
+
+### D3. Dispatch modes
+
+#### D3.1 Simple command
+
+A simple command expands to exactly one engine sub-command:
+
+- `DmaReadCmd` / `DmaWriteCmd` → `PE_DMA`
+- `GemmCmd` → `PE_GEMM`
+- `MathCmd` → `PE_MATH`
+
+Flow:
+
+```text
+PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution
+       → completion → PE_SCHEDULER → CompletionQueue
+```
+
+#### D3.2 Composite command (single-op tiled pipeline)
+
+The default `CompositeCmd` runs a single compute op as a tile-pipelined
+sequence:
+
+```text
+DMA_READ → FETCH (TCM → RF) → COMPUTE (GEMM | MATH) → STORE (RF → TCM) → DMA_WRITE
+```
+
+`PE_SCHEDULER` splits the DMA payload into hardware tiles and emits one
+`TileToken` per tile with a monotonically increasing `tile_id`.
+
+Tile dependency (within one tile `t`):
+
+```text
+DMA_READ(t) → FETCH(t) → COMPUTE(t) → STORE(t) → DMA_WRITE(t)
+```
+
+Inter-tile overlap is allowed wherever engine resources permit
+(D4 governs the constraints):
+
+```text
+DMA_READ(t+1) ∥ COMPUTE(t)
+DMA_WRITE(t-1) ∥ COMPUTE(t)
+```
+
+#### D3.3 Multi-op composite (head + epilogue with scope)
+
+A `CompositeCmd` MAY carry `ops: tuple[OpSpec, ...]` to express a
+multi-op pipeline:
+
+```python
+@dataclass(frozen=True)
+class OpSpec:
+    kind: str         # "gemm" | "math.exp" | "math.bias_add" | ...
+    scope: Scope      # "per_k_tile" | "per_output_tile" | "once"
+    ...
+```
+
+- `ops[0]` (head) defines tile geometry (e.g., the head GEMM determines
+  M/K/N partition).
+- `ops[1:]` (epilogue) are subsequent stages whose `scope` decides how
+  often they fire:
+  - `per_k_tile` — every K-reduction step.
+  - `per_output_tile` — once per output tile.
+  - `once` — once per kernel.
+
+Cross-engine chains (e.g., GEMM head → MATH epilogue) are natural —
+each stage is dispatched via token self-routing (D6), so GEMM and MATH
+participate serially within the same composite even though they share
+the compute slot (D4).
+
+The empty-`ops` form is the legacy single-op path.
+
+### D4. Engine resource model
+
+**DMA engine**:
+
+- `DMA_READ`: `simpy.Resource(capacity=1)`.
+- `DMA_WRITE`: `simpy.Resource(capacity=1)`.
+- Both channels run concurrently (READ ∥ WRITE allowed).
+- Within a channel, requests serialize (READ ∥ READ disallowed; same
+  for WRITE).
+- `vc_comm` is an orthogonal channel for IPCQ traffic defined in
+  ADR-0023 D8 — out of scope for this ADR.
+
+**Compute engine**:
+
+- `accel_slot`: `simpy.Resource(capacity=1)` shared by `PE_GEMM` and
+  `PE_MATH`.
+- At most one compute op runs at a time within a PE.
+- Multi-op composite chains (D3.3) execute their compute stages serially
+  through this slot; token self-routing (D6) ensures the next stage
+  starts only after the previous compute releases the slot.
+
+**Engine completion**: each engine emits a completion event consumed by
+the scheduler / `PipelineContext` (D6).
+
+### D5. Dataflow
+
+**Input path (HBM source)**:
+
+```text
+HBM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
+PE_TCM → PE_FETCH_STORE → Register File
+Register File → PE_GEMM | PE_MATH
+```
+
+**Input path (shared SRAM source)**:
+
+```text
+Shared SRAM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
+PE_TCM → PE_FETCH_STORE → Register File
+```
+
+**Output path (HBM destination)**:
+
+```text
+Register File → PE_FETCH_STORE → PE_TCM
+PE_TCM → PE_DMA (DMA_WRITE) → cube NOC → HBM
+```
+
+GEMM/MATH never touch `PE_TCM` directly — `PE_FETCH_STORE` is the
+single TCM↔register-file gateway. This makes TCM BW contention
+explicit and lets fetch unit policies (e.g., prefetch) be replaced
+independently of compute engines.
+
+#### D5.1 PE_TCM partitioning
+
+`PE_TCM` is split into two logical regions:
+
+**SchedulerReservedTCM**
+
+- Owned exclusively by `PE_SCHEDULER`.
+- Holds composite-command tile buffers.
+- `PE_SCHEDULER` partitions this region, assigns buffers per DMA_READ /
+  COMPUTE / DMA_WRITE stage, guarantees input/output separation, and
+  manages tile-buffer lifetimes.
+
+**AllocatableTCM**
+
+- General-purpose region managed by `PEMemAllocator`.
+- Used for host / DP-visible allocations.
+
+**Visibility rule (hard isolation)**: `PEMemAllocator` MUST NOT see or
+allocate inside `SchedulerReservedTCM`. The reserved region is excluded
+from allocator-managed ranges by construction.
+
+**Tile buffer rules**:
+
+- Input and output buffers within `SchedulerReservedTCM` MUST NOT
+  overlap during a tile's active lifetime.
+- A tile buffer remains valid until the corresponding `DMA_WRITE`
+  completes.
+- Buffer reuse is permitted only after the consuming tile's lifetime
+  ends.
+
+### D6. TileToken self-routing pipeline
+
+A composite's stage-to-stage progression happens **without** routing
+through the scheduler. Each component forwards the token directly to
+the next stage's component using the token's `plan`:
+
+```text
+Scheduler → DMA → Fetch → GEMM → Math (epi) → Store → DMA_WB → (complete)
+              ↑ chaining: no scheduler hop                          ↑
+                                                  PipelineContext.complete_tile()
+```
+
+This mirrors real-HW done-wire chains. The scheduler handles only
+**initial dispatch + completion aggregation**.
+
+#### TilePlan / Stage
+
+```python
+class StageType(Enum):
+    DMA_READ = 0
+    FETCH = 1
+    GEMM = 2
+    MATH = 3
+    STORE = 4
+    DMA_WRITE = 5
+
+@dataclass(frozen=True)
+class Stage:
+    stage_type: StageType
+    component: str         # topology node id (e.g., "sip0.cube0.pe0.pe_dma")
+    params: dict           # stage-specific parameters
+
+@dataclass(frozen=True)
+class TilePlan:
+    tile_id: int
+    stages: tuple[Stage, ...]
+```
+
+#### TileToken
+
+```python
+@dataclass
+class TileToken:
+    tile_id: int
+    pipeline_ctx: PipelineContext
+    plan: TilePlan
+    stage_idx: int
+    params: dict             # cached current stage params
+    data_op: bool = True     # op_log opt-in (ADR-0020 D4)
+```
+
+Single-owner invariant: a token is owned by exactly one component at a
+time. Lifecycle: scheduler creates with `stage_idx=0` → component
+`_process()` → increment `stage_idx` → put to next stage's `in_port` →
+last stage calls `pipeline_ctx.complete_tile()`.
+
+#### PipelineContext (exactly-once completion)
+
+```python
+@dataclass
+class PipelineContext:
+    id: str
+    total_tiles: int
+    completed_tiles: int = 0
+    done_event: simpy.Event = None
+
+    def complete_tile(self) -> None:
+        self.completed_tiles += 1
+        if self.completed_tiles == self.total_tiles:
+            self.done_event.succeed()
+```
+
+Each tile's last stage MUST call `complete_tile()` exactly once.
+Duplicate calls are bugs (SimPy `Event` can succeed at most once).
+
+#### Feed ordering
+
+`PE_SCHEDULER` has exactly one `_feed_loop` process consuming a
+`_pending_feeds` FIFO. Composite commands are enqueued in submission
+order; tile feed for a command runs to completion before the next
+command's feed begins. **Tile-feed interleaving between commands is
+disallowed.**
+
+Within a single command's tiles, downstream pipeline overlap arises
+naturally — earlier tiles progress through later stages while the feeder
+keeps pushing remaining tiles into the first stage queue (SimPy Store
+backpressure governs flow control). If the first-stage queue is full,
+only the feeder blocks; the scheduler worker's inbox processing
+continues.
+
+#### Token routing pattern (base class)
+
+```python
+def _pipeline_worker(self, env):
+    while True:
+        token = yield self._inbox.get()
+        yield from self._process(env, token)       # stage-specific logic
+        next_idx = token.stage_idx + 1
+        if next_idx < len(token.plan.stages):
+            next_stage = token.plan.stages[next_idx]
+            token.stage_idx = next_idx
+            token.params = next_stage.params
+            yield self.out_ports[next_stage.component].put(token)
+        else:
+            token.pipeline_ctx.complete_tile()
+```
+
+Each component implements only `_process()`; chaining lives in the
+base class.
+
+### D7. Observability and trace contract
+
+The simulator emits deterministic trace events:
+
+- `command_submitted`
+- `sub_command_dispatched`
+- `engine_start`
+- `engine_complete`
+- `tile_ready`
+- `command_complete`
+
+For identical inputs, trace ordering MUST be deterministic.
+
+### D8. Topology representation
+
+PE-internal components are declared in `cube.pe_template`:
+
+```yaml
+pe_template:
+  components:
+    pe_cpu:         { kind: pe_cpu,         impl: builtin.pe_cpu,         attrs: { overhead_ns: ... } }
+    pe_scheduler:   { kind: pe_scheduler,   impl: builtin.pe_scheduler,   attrs: { overhead_ns: ... } }
+    pe_dma:         { kind: pe_dma,         impl: builtin.pe_dma,         attrs: { rd_engines: 1, wr_engines: 1 } }
+    pe_fetch_store: { kind: pe_fetch_store, impl: builtin.pe_fetch_store, attrs: { ... } }
+    pe_gemm:        { kind: pe_gemm,        impl: builtin.pe_gemm,        attrs: { shared_resource: accel_slot, ... } }
+    pe_math:        { kind: pe_math,        impl: builtin.pe_math,        attrs: { shared_resource: accel_slot, ... } }
+    pe_tcm:         { kind: pe_tcm,         impl: builtin.pe_tcm,         attrs: { size_mb: ..., read_bw_gbs: ..., write_bw_gbs: ... } }
+    pe_mmu:         { kind: pe_mmu,         impl: builtin.pe_mmu,         attrs: { ... } }   # ADR-0011 D-VA
+    pe_ipcq:        { kind: pe_ipcq,        impl: builtin.pe_ipcq,        attrs: { ... } }   # ADR-0023
+  links:
+    # Scheduler dispatch edges (initial)
+    scheduler_to_dma_mm:         0.0
+    scheduler_to_fetch_store_mm: 0.0
+    scheduler_to_gemm_mm:        0.0
+    scheduler_to_math_mm:        0.0
+    # Pipeline chaining edges (token self-routing per D6)
+    dma_to_fetch_store_mm:       0.0
+    fetch_store_to_gemm_mm:      0.0
+    fetch_store_to_math_mm:      0.0
+    gemm_to_fetch_store_mm:      0.0
+    gemm_to_math_mm:             0.0
+    math_to_fetch_store_mm:      0.0
+    fetch_store_to_dma_mm:       0.0
+    fetch_store_to_tcm_bw_gbs:   ...
+```
+
+Template is instantiated once per PE. PE instances are derived from
+`cube.pe_layout` (corner placement). External connectivity (PE_DMA ↔
+cube NOC ↔ HBM, etc.) is modeled at the cube level (ADR-0017 D4).
+
+## Consequences
+
+### Positive
+
+- Each block is an independent topology node — individually swappable
+  via DI (ADR-0015).
+- PE-internal structure is visible in the topology graph.
+- Components do not know their downstream — plan-based routing gives
+  flexibility (e.g., epilogue chains require no scheduler change).
+- DMA and compute overlap naturally via SimPy Store backpressure.
+- Multi-op composite expresses fused operations (e.g., GEMM + bias_add)
+  without engine-level coupling.
+- TCM access contention is realistic — `PE_FETCH_STORE` is the single
+  TCM↔RF gateway.
+
+### Negative
+
+- Intra-PE component count is higher than a coarser model (7 base + 2
+  cross-referenced) — more topology nodes/edges.
+- Intra-PE token forwarding is explicit in traces (acceptable trade for
+  HW fidelity).
+
+## Links
+
+- ADR-0011 D-VA (PE_MMU component, VA translation)
+- ADR-0015 D4 (component port/wire model)
+- ADR-0020 (greenlet kernel execution / two-pass)
+- ADR-0023 (PE_IPCQ + PE_DMA virtual channels)
+- SPEC R3, R4
@@ -1,365 +0,0 @@
-# ADR-0014: PE Internal Execution Model (PE_CPU, PE_SCHEDULER, and Composite Commands)
-
-## Status
-
-Accepted
-
-## Context
-
-ADR-0003 (system hierarchy) and ADR-0009 (kernel execution semantics) reference PE internals but do not define:
-
- the dispatch model inside a PE,
- the responsibilities of PE_SCHEDULER,
- the PE_TCM-centric dataflow contract used by accelerator engines.
-
-We need a deterministic and debuggable PE-internal execution contract that supports:
-
- simple single-engine commands
- composite commands that build a tiled pipeline across DMA and accelerator engines
-
-The simulator must produce deterministic traces and allow modeling of PE-internal pipelining without introducing nondeterministic engine scheduling.
-
-## Decision
-
-### D1. PE internal component roles
-
-Each PE contains the following logical components.
-
-**PE_CPU**
-
- Executes kernel instruction stream or kernel control logic.
- Generates PE commands.
- Submits commands to PE_SCHEDULER.
- PE_CPU does NOT enqueue work directly into engine queues.
-
-**PE_SCHEDULER**
-
- The sole dispatcher inside a PE.
- Receives commands from PE_CPU.
- Expands composite commands into sub-commands.
- Tracks dependencies and command state.
- Dispatches work to engine queues.
- Manages tile scheduling for composite commands.
-
-**PE_DMA**
-
- Handles memory transfers between PE_TCM and external memory domains.
- PE_DMA connects to the NOC router mesh at the CUBE level (ADR-0019):
-  - All destinations (HBM, shared SRAM, inter-cube UCIe) are reached via the router mesh
-  - Local HBM access: PE_DMA → local router → hbm_ctrl (switching overhead only)
-  - Remote/shared: PE_DMA → local router → (mesh hops) → destination
- Supported directions include:
-  - HBM → PE_TCM (via router mesh)
-  - PE_TCM → HBM (via router mesh)
-  - PE_TCM → shared SRAM (via router mesh)
-  - PE_TCM → other memory domains (via router mesh, if supported by topology)
-
-**PE_GEMM**
-
- Matrix multiplication engine.
- Reads activations from PE_TCM.
- May stream weights directly from HBM.
-
-**PE_MATH**
-
- Element-wise computation engine.
- Reads and writes PE_TCM.
-
-**PE_TCM**
-
- Local SRAM used as the staging memory for accelerator operations.
-
---
-
-### D2. Command lifecycle and queues
-
-PE_SCHEDULER maintains three logical structures.
-
-**SubmissionQueue**
-
- Written by PE_CPU.
- Contains incoming PE commands waiting to be processed.
-
-**InflightTable**
-
- Owned and mutated only by PE_SCHEDULER.
- Tracks:
-  - expanded sub-commands
-  - dependency state
-  - engine assignment
-  - completion status
-
-**CompletionQueue**
-
- Written by PE_SCHEDULER.
- Contains final completion records for commands.
-
-**Single-writer rule**
-
- Only PE_SCHEDULER is allowed to mutate command completion state.
- Engine components must report completion via explicit completion events/messages.
-
-**Command completion**
-
-A command becomes DONE when:
-
- all sub-commands complete
- PE_SCHEDULER publishes a completion record to CompletionQueue.
-
---
-
-### D3. Dispatch modes
-
-PE commands are divided into two categories.
-
-#### D3.1 Simple command
-
-A simple command expands to exactly one engine sub-command.
-
-Examples include:
-
- DMA transfer
- GEMM compute
- MATH compute
-
-Execution flow:
-
-```text
-PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution → completion event → PE_SCHEDULER → CompletionQueue
-```
-
-#### D3.2 Composite command (tiled pipeline)
-
-Composite commands implement tiled pipelined execution across engines.
-
-Each tile executes the following pipeline:
-
-```text
-Input DMA (READ)
-→ Compute (GEMM or MATH)
-→ Output DMA (WRITE)
-```
-
-**Tiling rule**
-
-If the DMA payload exceeds hardware tile size, PE_SCHEDULER splits the transfer into tiles.
-Each tile is assigned a monotonically increasing `tile_id`.
-
-**Tile dependency rules**
-
-For tile `t`:
-
- Compute must wait for input DMA: `DMA_READ(t) → COMPUTE(t)`
- Output DMA must wait for compute: `COMPUTE(t) → DMA_WRITE(t)`
- All dependencies are enforced by PE_SCHEDULER.
-
-**Overlap policy (Phase 0 default)**
-
-Operations for different tiles may overlap when engine resources permit.
-
-Allowed overlaps:
-
-```text
-DMA_READ(t+1) ∥ COMPUTE(t)
-DMA_WRITE(t−1) ∥ COMPUTE(t)
-DMA_READ(t) ∥ DMA_WRITE(t)
-```
-
-Disallowed overlaps:
-
-```text
-GEMM(t) ∥ GEMM(t′)
-MATH(t) ∥ MATH(t′)
-GEMM(t) ∥ MATH(t′)
-```
-
---
-
-### D4. Engine execution model (Phase 0 default)
-
-Each engine behaves as a deterministic service resource.
-
-**DMA engine**
-
-PE_DMA contains two independent channels.
-
-```text
-DMA_READ capacity  = 1
-DMA_WRITE capacity = 1
-```
-
-Rules:
-
- DMA_READ and DMA_WRITE may execute concurrently.
- Multiple READs cannot overlap.
- Multiple WRITEs cannot overlap.
-
-Example allowed:
-
-```text
-DMA_READ(t+1) ∥ DMA_WRITE(t)
-```
-
-Example not allowed:
-
-```text
-DMA_READ(t) ∥ DMA_READ(t+1)
-DMA_WRITE(t) ∥ DMA_WRITE(t+1)
-```
-
-**Compute engine**
-
-Compute operations share a single compute resource.
-
-```text
-PE_ACCEL capacity = 1
-```
-
-Both GEMM and MATH require this shared compute slot.
-
-Consequences:
-
- GEMM ∥ GEMM not allowed
- MATH ∥ MATH not allowed
- GEMM ∥ MATH not allowed
-
-Only one compute operation can run in a PE at a time.
-
-**Compute opcode restriction**
-
-Composite commands contain one compute opcode only.
-
-Examples:
-
-```text
-COMPOSITE_GEMM
-COMPOSITE_MATH
-```
-
-Mixed compute pipelines such as `GEMM → MATH` are not supported in Phase 0.
-
-**Engine completion signaling**
-
-Every engine emits a completion event when a sub-command finishes.
-Completion events are delivered to PE_SCHEDULER.
-
---
-
-### D5. Dataflow model
-
-Compute operations use a TCM-centric dataflow model.
-
-**Input path (HBM)**
-
-```text
-HBM → router mesh → PE_DMA (DMA_READ) → PE_TCM
-```
-
-**Input path (shared SRAM)**
-
-```text
-Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM
-```
-
-**Compute stage**
-
-Compute engines read input tensors from PE_TCM.
-
-```text
-PE_TCM → GEMM / MATH
-```
-
-Weights for GEMM may optionally stream directly from HBM (via router mesh).
-
-**Output path (HBM)**
-
-Compute results are written to PE_TCM, then DMA writes to HBM.
-
-```text
-PE_TCM → PE_DMA (DMA_WRITE) → router mesh → HBM
-```
-
-**Output path (shared SRAM)**
-
-```text
-PE_TCM → PE_DMA (DMA_WRITE) → NOC → Shared SRAM
-```
-
-#### D5.1 PE_TCM partitioning and ownership boundary
-
-The PE_TCM address space is partitioned into two logical regions.
-
-**SchedulerReservedTCM**
-
- A staging region owned exclusively by PE_SCHEDULER.
- This region is used for composite command tile buffers.
- PE_SCHEDULER:
-  - partitions this region into tile buffers
-  - assigns buffers for DMA_READ, COMPUTE, and DMA_WRITE stages
-  - guarantees input/output buffer separation
-  - manages tile buffer lifetime
-
-**AllocatableTCM**
-
- General-purpose region managed by PEMemAllocator.
- Used by host or DP-visible allocations.
-
-**Visibility rule (hard isolation)**
-
- PEMemAllocator must not see or allocate memory inside SchedulerReservedTCM.
- SchedulerReservedTCM is excluded from allocator-managed ranges by construction.
- This prevents DP or host allocations from interfering with scheduler staging buffers.
-
-**Tile buffer rules**
-
-Within SchedulerReservedTCM:
-
- input buffers and output buffers must not overlap
- PE_SCHEDULER assigns tile buffers for DMA and compute stages
- tile buffers remain valid until the corresponding DMA_WRITE completes
- Buffer reuse is allowed only after the tile lifetime finishes.
-
---
-
-### D6. Observability and trace contract
-
-The simulator must emit deterministic trace events.
-
-Required events include:
-
- `command_submitted`
- `sub_command_dispatched`
- `engine_start`
- `engine_complete`
- `tile_ready`
- `command_complete`
-
-Trace ordering must be deterministic for identical inputs.
-
---
-
-### D7. Topology representation
-
-PE internal components are declared in `cube.pe_template`.
-
-The template is instantiated once per PE.
-
-PE instances are derived from `cube.pe_layout`.
-
-External connectivity such as:
-
- PE_DMA → router mesh → HBM (data path, ADR-0019)
- PE_DMA → router mesh → shared SRAM, inter-cube UCIe (non-HBM data path)
- router mesh → PE_CPU (command path from M_CPU)
-
-is modeled at the CUBE level (see ADR-0003 D3).
-
---
-
-## Links
-
- SPEC R3, R4
- ADR-0003 D4 (PE-level system hierarchy)
- ADR-0005 View C (PE-level diagram)
- ADR-0008 D2 (PA-level allocation at PE scope; PEMemAllocator is the per-PE allocator instance)
- ADR-0009 D3 (kernel execution fan-out and PE_CPU dispatch)
@@ -6,20 +6,19 @@ Accepted

 ## Context

-ADR-0007 D2 assigns path-walking and low-level request decomposition to the simulation engine.
-In practice, the engine iterates the topology path and calls `run()` on each component
-sequentially — conflating routing policy with component behavior and preventing realistic
-hardware modeling (queues, contention, fan-out).
-
-ADR-0007 D3 already states that components own fan-out and aggregation, but the current
-implementation does not enforce this for fabric traversal.
+Realistic hardware modeling — queues, contention, fan-out — requires
+that components own fabric traversal while the simulation engine
+handles only initialization and completion observation. Direct method
+calls between components, or path-walking inside the engine, defeat
+queueing and contention semantics.

 This ADR defines:

 - how components communicate via typed port queues,
 - how propagation delay is modeled (wire processes with BW occupancy),
- the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch (via M_CPU),
- the reduced role of the simulation engine,
+- the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch
+  (via M_CPU),
+- the engine's reduced role (wire init + completion observation only),
 - M_CPU.DMA as an internal subcomponent of M_CPU.

 ---
@@ -88,9 +87,6 @@ The simulation engine MUST NOT:
 - call component `run()` methods directly,
 - track per-hop latency or decompose fan-out.

-This supersedes ADR-0007 D2's "decompose operations into low-level requests" clause.
-ADR-0007 D2 must be amended accordingly.
-
 ---

 ### D4. Fabric paths for Memory R/W and Kernel Launch
@@ -192,16 +188,15 @@ It is used for shard comparison in `_route_kernel` and as a regression guard.
 - Propagation delay is modeled accurately per edge.
 - Engine is decoupled from routing policy.
 - Component implementations remain swappable via DI (ADR-0007 D3).
- ADR-0007 D2 must be amended to remove path-walking from engine responsibilities.
- ADR-0009 D3 should be updated to reference the unified fabric path (D4 above).

 ---

 ## Links

- ADR-0007 D2 (to be amended: engine path-walking clause)
- ADR-0009 D3 (kernel execution fan-out; fabric path to be referenced)
+- ADR-0007 D2 (engine role boundary)
+- ADR-0009 D3 (kernel execution fan-out hierarchy)
 - ADR-0014 D4 (DMA engine capacity=1)
 - ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal)
 - ADR-0016 (IOChiplet NOC and memory data path)
 - ADR-0017 (cube NOC 2D mesh architecture)
+- ADR-0033 (Latency model assumptions built on these mechanisms)
@@ -1,189 +0,0 @@
-# ADR-0017: Cube NOC 2D Mesh Architecture
-
-## Status
-
-Accepted
-
-## Context
-
-ADR-0003 D3 defines the cube-level NOC as a "distributed on-die fabric" but
-does not specify the internal routing model, contention semantics, or
-attachment topology. The implementation uses a 2D mesh router grid with
-XY routing and per-segment contention modeling. This ADR formalizes that
-architecture.
-
-## Decision
-
-### D1. NOC node and router grid
-
-Each cube contains a 2D router mesh generated by `mesh_gen.py`.
-Each router is a separate topology node (`sip{S}.cube{C}.r{row}c{col}`)
-implemented as `forwarding_v1`. (Supersedes the original single-node
-`noc_2d_mesh_v1` design — see ADR-0019.)
-
-Grid properties:
-
- Default dimensions: 6x6 routers (derived from PE layout + UCIe connections)
- Router naming: `r{row}c{col}` (e.g., `r0c0`, `r5c5`)
- HBM exclusion zone: center rows/columns are excluded where HBM physically
-  occupies space (e.g., r2c2, r2c3, r3c2, r3c3)
- Router positions are derived from physical PE corner placement and cube
-  geometry
-
-The NOC overhead_ns is 0.0. Latency is modeled by Manhattan distance
-traversal within the mesh (distance_mm x ns_per_mm).
-
-### D2. XY routing algorithm
-
-The NOC uses deterministic XY routing:
-
-1. Horizontal segment: route from source X to destination X at source Y
-2. Vertical segment: route from destination X at source Y to destination Y
-
-Each directed segment is identified by a unique link key:
-
- Horizontal: `("H", y_band, x_min, x_max, direction)`
- Vertical: `("V", x_band, y_min, y_max, direction)`
-
-Grid positions are snapped to the router grid, excluding the HBM zone.
-
-### D3. Contention model
-
-Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions
-sharing a segment (same row or column band, same direction) contend for the
-resource. This models link-level serialization in a wormhole-routed mesh.
-
-With no contention, NOC traversal latency equals the Manhattan distance
-multiplied by `ns_per_mm`. Under contention, additional queueing delay
-is added by SimPy's resource scheduling.
-
-### D4. NOC attachment points
-
-The NOC connects to all major cube-level components:
-
-```text
-                    UCIe-N (conn x4)
-                         |
-           +---------+---+---+---------+
-           |         |       |         |
-PE0.dma ---+  r0c0   |  ...  |  r0c5  +--- PE2.dma
-PE0.cpu <--+         |       |         +--< PE2.cpu
-           |         |       |         |
-UCIe-W ----+  ...    | [HBM] |  ...   +---- UCIe-E
-(conn x4)  |         | zone  |         |  (conn x4)
-           |  r2c0   |       |         |
-M_CPU <--->+         |       |         |
-           |  r3c0   |       |         |
-SRAM <---->+         |       |         |
-           |         |       |         |
-PE4.dma ---+  r4c0   |  ...  |  r4c5  +--- PE6.dma
-PE4.cpu <--+         |       |         +--< PE6.cpu
-           |         |       |         |
-           +---------+---+---+---------+
-                         |
-                    UCIe-S (conn x4)
-
-HBM attach: PE가 있는 라우터에 hbm_ctrl도 연결 (ADR-0019 D1)
-(xbar_top/xbar_bot은 ADR-0019에 의해 제거됨)
-```
-
-### D5. NOC edge bandwidths and distances
-
-| Connection | BW (GB/s) | Distance | Notes |
-| --- | --- | --- | --- |
-| PE_DMA -> NOC | 256.0 | Physical (PE pos) | Matches HBM slice BW |
-| NOC -> PE_CPU | - | 0.0 mm | Command path only |
-| Router <-> HBM_CTRL | 256.0 | 0.0 mm | Per PE router (ADR-0019) |
-| NOC <-> M_CPU | - | 0.0 mm | Command path |
-| NOC <-> SRAM | 128.0 x4 | 0.0 mm | 512 GB/s aggregate |
-| NOC <-> UCIe conn | 128.0 | 0.0 mm | Per connection, 4 per port |
-
-Distance 0.0 mm for most connections reflects the distributed nature of
-the NOC; the actual traversal distance is computed internally via Manhattan
-distance within the router grid.
-
-### D6. UCIe decomposition and inter-cube traffic
-
-Each cube has 4 UCIe ports (N, S, E, W). Each port is decomposed into:
-
- 1 `ucie-{PORT}` node: UCIe protocol endpoint (overhead = 8.0 ns)
- 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe
-
-This decomposition enables N=4 independent NOC-to-UCIe connections per port,
-each with 128 GB/s bandwidth. Total aggregate per port: 512 GB/s.
-
-Inter-cube traffic path:
-
-```text
-Source: PE_DMA -> NOC -> conn{i} -> ucie-{PORT}
-                    [UCIe link: 512 GB/s, 1.0mm seam distance]
-Target: ucie-{PORT} -> conn{i} -> r{x}c{y} -> (mesh hops) -> hbm_ctrl
-```
-
-UCIe overhead (8.0 ns) is applied at each ucie-{PORT} node, so a
-full crossing incurs 16 ns (TX port + RX port).
-
-### D7. Data paths through the NOC
-
-**PE DMA to local HBM (same half):**
-
-```text
-PE_DMA -> r{x}c{y} -> hbm_ctrl  (local: 0 mesh hops, switching overhead only)
-```
-
-**PE DMA to remote PE's HBM:**
-
-```text
-PE_DMA -> r{x}c{y} -> (mesh hops) -> r{x'}c{y'} -> hbm_ctrl
-```
-
-**PE DMA to remote cube HBM:**
-
-```text
-PE_DMA -> r{x}c{y} -> conn -> ucie-E -> [seam] -> ucie-W -> conn -> r{x'}c{y'} -> hbm_ctrl
-```
-
-**Kernel Launch command to PE:**
-
-```text
-[from io_noc] -> ucie -> conn -> r{x}c{y} -> (mesh hops) -> M_CPU -> (mesh hops) -> PE_CPU
-```
-
-**Shared SRAM access:**
-
-```text
-PE_DMA -> r{x}c{y} -> (mesh hops) -> SRAM
-```
-
-### D8. Mesh generation
-
-The router grid is generated by `mesh_gen.py` based on:
-
- `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner
- `cube.geometry`: cube physical dimensions and HBM zone
- `cube.ucie.n_connections`: determines router count for UCIe attachment
-
-The generator produces a `mesh_data` dictionary containing:
-
- Router grid with positions and HBM exclusion zones
- PE-to-router attachments (pe_dma, pe_cpu per PE)
- UCIe-to-router attachments (N/S/E/W, distributed across edge routers)
- M_CPU and SRAM router attachments
- HBM attachment per PE router (ADR-0019)
-
-## Consequences
-
- NOC provides position-aware routing with deterministic latency
- Contention is captured per directed segment (not per-node)
- All cube-internal traffic is explicitly routed through the NOC
- HBM exclusion zone reflects physical die layout constraints
- The mesh generation is fully parameterized by `topology.yaml`
-
-## Links
-
- ADR-0003 D3 (cube-level NOC definition — extended by this ADR)
- ADR-0004 D1 (PE DMA to local HBM path via router mesh)
- ADR-0014 D1 (PE_DMA egress via router mesh)
- ADR-0019 (NOC-Local HBM — xbar/bridge 제거, 명시적 라우터 mesh)
- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
- ADR-0016 D1 (IOChiplet io_noc — analogous pattern at IO chiplet level)
@@ -0,0 +1,291 @@
+# ADR-0017: Cube NOC and HBM Connectivity
+
+## Status
+
+Accepted
+
+## Context
+
+The CUBE-level NOC is a 2D router mesh that carries every intra-cube
+request: PE-to-HBM data, PE-to-PE traffic, command paths
+(M_CPU↔PE_CPU), shared SRAM access, and inter-cube UCIe traffic.
+
+The CUBE's HBM is exposed through per-PE controller endpoints attached
+to PE routers. This per-PE partitioning makes local-vs-remote HBM
+distinguishable by mesh distance: a PE's own HBM partition sits at its
+own router (switching overhead only); another PE's HBM partition is
+reachable by mesh hops to that PE's router.
+
+Two channel-mapping modes are supported in the design space:
+
+- **n:1 (default, implemented)** — each PE's HBM partition aggregates
+  `channels_per_pe` pseudo-channels into one endpoint. Effective
+  per-PE BW = N × per-channel BW.
+- **1:1 (future)** — each PE router decomposes into per-channel
+  mini-routers; per-channel BW contention is modeled directly.
+
+In both modes the per-PE effective BW is identical; only the connectivity
+granularity differs.
+
+## Decision
+
+### D1. 2D router mesh
+
+Each cube contains a 2D mesh of NOC routers generated by `mesh_gen.py`.
+
+- Node naming: `sip{S}.cube{C}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`).
+- Implementation: `forwarding_v1`. NOC `overhead_ns = 0`.
+- Default 6×6 grid (sized from PE corner placement + UCIe attachment
+  count); larger PE counts scale the grid up.
+- HBM exclusion zone: center rows/columns are excluded where HBM die
+  physically occupies space (e.g., r2c2, r2c3, r3c2, r3c3 for a 6×6).
+- Latency = Manhattan distance × `ns_per_mm`.
+
+### D2. XY routing algorithm
+
+Deterministic XY routing:
+
+1. Horizontal segment: route from source X to destination X at source Y.
+2. Vertical segment: route from destination X at source Y to destination Y.
+
+Each directed segment carries a unique key:
+
+- Horizontal: `("H", y_band, x_min, x_max, direction)`
+- Vertical:   `("V", x_band, y_min, y_max, direction)`
+
+Grid positions are snapped to the router grid, excluding the HBM zone.
+
+### D3. Per-segment contention model
+
+Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions
+sharing a segment (same row or column band, same direction) contend for
+the resource — modelling link-level serialization in a wormhole-routed
+mesh.
+
+With no contention, NOC traversal latency equals Manhattan distance ×
+`ns_per_mm`. Under contention, SimPy's resource scheduling adds queueing
+delay.
+
+### D4. NOC attachment points (per-PE HBM partition)
+
+Every PE router carries three attachments: `pe{idx}.dma`, `pe{idx}.cpu`,
+and `pe{idx}.hbm`. The last is the per-PE HBM controller endpoint —
+`sip{S}.cube{C}.hbm_ctrl.pe{idx}` — which owns one slice of the cube's
+HBM (one pseudo-channel group; see D8).
+
+Other attachments:
+
+- M_CPU and shared SRAM each occupy a dedicated edge router.
+- UCIe endpoints (N/S/E/W) each expose 4 connection routers distributed
+  along that edge (see D6).
+
+```text
+                    UCIe-N (conn x4)
+                         |
+           +---------+---+---+---------+
+           |         |       |         |
+PE0.dma ---+  r0c0   |  ...  |  r0c5  +--- PE2.dma
+PE0.cpu <--+ +hbm.pe0|       | +hbm.pe2+--< PE2.cpu
+           |         |       |         |
+UCIe-W ----+  ...    | [HBM] |  ...   +---- UCIe-E
+(conn x4)  |         | zone  |         |  (conn x4)
+           |  r2c0   |       |         |
+M_CPU <--->+         |       |         |
+           |  r3c0   |       |         |
+SRAM <---->+         |       |         |
+           |         |       |         |
+PE4.dma ---+  r4c0   |  ...  |  r4c5  +--- PE6.dma
+PE4.cpu <--+ +hbm.pe4|       | +hbm.pe6+--< PE6.cpu
+           |         |       |         |
+           +---------+---+---+---------+
+                         |
+                    UCIe-S (conn x4)
+```
+
+Per-PE HBM partitioning is the key invariant that makes local vs
+cross-PE HBM distinguishable by mesh distance (see D7).
+
+### D5. NOC edge bandwidths and distances
+
+| Connection                    | BW (GB/s)  | Distance      | Notes                                       |
+| ----------------------------- | ---------- | ------------- | ------------------------------------------- |
+| PE_DMA → NOC                  | 256.0      | Physical (PE) | Matches local-HBM aggregate BW              |
+| NOC → PE_CPU                  | —          | 0.0 mm        | Command path only                           |
+| Router ↔ hbm_ctrl.pe{idx}     | 256.0      | 0.0 mm        | Per PE router; N × per-channel BW (see D8)  |
+| NOC ↔ M_CPU                   | —          | 0.0 mm        | Command path                                |
+| NOC ↔ SRAM                    | 128.0 × 4  | 0.0 mm        | 512 GB/s aggregate                          |
+| NOC ↔ UCIe conn               | 128.0      | 0.0 mm        | Per connection; 4 conn per port             |
+
+`0.0 mm` distances reflect the distributed nature of the NOC; actual
+traversal distance is computed via Manhattan distance within the router
+grid.
+
+### D6. UCIe decomposition and inter-cube traffic
+
+Each of the 4 UCIe ports (N, S, E, W) decomposes into:
+
+- 1 `ucie-{PORT}` node: UCIe protocol endpoint (`overhead = 8.0 ns`).
+- 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe.
+
+This decomposition gives 4 independent NOC↔UCIe connections per port,
+each with 128 GB/s bandwidth (512 GB/s aggregate per port).
+
+Inter-cube traffic path:
+
+```text
+Source: PE_DMA → NOC → conn{i} → ucie-{PORT}
+                  [UCIe link: 512 GB/s, 1.0mm seam distance]
+Target: ucie-{PORT} → conn{i} → r{x}c{y} → (mesh hops) → hbm_ctrl.pe{idx}
+```
+
+UCIe overhead (8.0 ns) is applied at each `ucie-{PORT}` node, so a full
+crossing incurs 16 ns (TX port + RX port).
+
+### D7. Data paths through the NOC
+
+All intra-cube traffic uses the same router mesh — no separate fast
+paths.
+
+**Local HBM** (same PE's own partition; 0 mesh hops):
+
+```text
+PE_DMA → r{x}c{y} → hbm_ctrl.pe{idx}   (switching overhead only)
+```
+
+**Cross-PE HBM within cube** (target PE's partition, reached by mesh):
+
+```text
+PE_DMA → r{x}c{y} → (mesh hops) → r{x'}c{y'} → hbm_ctrl.pe{idx'}
+```
+
+Example: PE0 (on `r0c0`) accessing PE2's HBM (PE2 on `r1c4`):
+
+```text
+PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl.pe2
+```
+
+Dijkstra computes the shortest path within the mesh.
+
+**Cross-cube HBM** (UCIe traversal):
+
+```text
+PE_DMA → r{x}c{y} → conn → ucie-{PORT} → [seam] → ucie-{PORT'} → conn
+       → r{x'}c{y'} → hbm_ctrl.pe{idx'}
+```
+
+**Kernel launch command to PE**:
+
+```text
+[from io_noc] → ucie → conn → r{x}c{y} → (mesh) → M_CPU → (mesh) → PE_CPU
+```
+
+**Shared SRAM access**:
+
+```text
+PE_DMA → r{x}c{y} → (mesh) → SRAM
+```
+
+### D8. HBM channel mapping mode
+
+Channel mapping is configured at cube scope:
+
+```yaml
+cube:
+  memory_map:
+    hbm_mapping_mode: n_to_one       # one_to_one | n_to_one
+    hbm_pseudo_channels: 64          # total pseudo-channel count
+    hbm_channels_per_pe: 8           # per-PE local channel count
+    hbm_channel_bw_gbs: 32.0         # per-channel bandwidth (GB/s)
+    hbm_slices_per_cube: 8           # number of per-PE partitions
+    hbm_total_gb_per_cube: 48
+```
+
+**n:1 mode (default, implemented).** Each PE's HBM partition is a single
+endpoint `hbm_ctrl.pe{idx}` that aggregates `channels_per_pe` pseudo-
+channels. The `Router ↔ hbm_ctrl.pe{idx}` link bandwidth equals
+`channels_per_pe × hbm_channel_bw_gbs`. Pseudo-channels are assumed to
+interleave; only aggregate per-PE BW is modeled. No separate aggregated
+router node exists — the per-PE router itself serves that role.
+
+**1:1 mode (future).** Each PE router decomposes into N channel
+mini-routers; per-channel routing carries fully-resolved PA + channel ID.
+A `ChannelSplitter` resolves a logical access to N per-channel physical
+requests. Per-channel link models BW contention. Cross-PE channel
+access semantics are deferred to the implementation ADR.
+
+**BW math (defaults).**
+
+| Parameter                          | Value                      |
+| ---------------------------------- | -------------------------- |
+| pseudo channels per cube           | 64 (parameter)             |
+| PEs per cube                       | 8 (parameter)              |
+| channels per PE (N)                | 64 / 8 = 8                 |
+| per-channel BW                     | 32 GB/s (parameter)        |
+| per-PE local BW                    | N × 32 = 256 GB/s          |
+| cube total HBM BW                  | 64 × 32 = 2048 GB/s        |
+
+Both modes give the same per-PE effective BW; only the request shape and
+contention model differ.
+
+### D9. AddressResolver — per-PE HBM endpoint
+
+The address resolver decodes a PA's HBM offset to the owning PE's
+partition:
+
+```python
+# policy/routing/router.py
+hbm_slice_bytes = hbm_total_gb_per_cube * (1 << 30) // hbm_slices_per_cube
+
+if addr.kind == "hbm":
+    pe_id = int(addr.hbm_offset) // hbm_slice_bytes
+    return f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
+```
+
+The pe_id computation is intrinsic to the routing layer (not a
+topology-time concern). Any HBM PA falls within exactly one partition,
+yielding deterministic routing.
+
+External callers (e.g., M_CPU DMA, Memory R/W from PCIE_EP) follow the
+same resolver path — there is no separate fast path.
+
+### D10. Mesh generation parameters
+
+`mesh_gen.py` produces `cube_mesh.yaml` from:
+
+- `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner.
+- `cube.geometry`: cube physical dimensions and HBM zone.
+- `cube.ucie.n_connections`: determines router count for UCIe attachment.
+
+Output `mesh_data` dictionary contains:
+
+- Router grid with positions and HBM exclusion zones.
+- PE-to-router attachments (`pe{idx}.dma`, `pe{idx}.cpu`, `pe{idx}.hbm`
+  per PE).
+- UCIe-to-router attachments (N/S/E/W distributed across edge routers).
+- M_CPU and SRAM router attachments.
+
+## Consequences
+
+- Local HBM (0 mesh hops, switching overhead only) and cross-PE HBM
+  (mesh hops) are naturally distinguishable, satisfying SPEC R5
+  (multi-domain communication) and ADR-0002 (no zero-latency end-to-end
+  paths).
+- All cube-internal traffic routes through one mesh — single contention
+  model, single layout, single set of edge BWs.
+- Per-PE HBM partitioning maps cleanly to the LA model (ADR-0011): each
+  PE's partition is the n:1 aggregate of its assigned pseudo-channels.
+- 1:1 mode extension is structurally natural — split each PE router into
+  N channel routers.
+- Mesh generation is fully parameterised by `topology.yaml`; PE/cube
+  geometry changes propagate without code edits.
+
+## Links
+
+- ADR-0002 (Routing distance, ordering, no zero-latency paths)
+- ADR-0003 D3 (cube-level NOC definition — extended here)
+- ADR-0004 (Memory semantics, local HBM)
+- ADR-0011 (Memory addressing — LA model consumes per-PE partition)
+- ADR-0014 D1 (PE_DMA egress via router mesh)
+- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
+- ADR-0016 (IOChiplet io_noc — analogous pattern at IO chiplet level)
+- ADR-0033 (Latency model: per-PC parallelism, switch penalty)
@@ -1,431 +0,0 @@
-# ADR-0019: Per-Channel and Aggregated HBM Connection Models within CUBE NOC
-
-## Status
-
-Proposed
-
-## Context
-
-ADR-0018 introduced LA-based address abstraction and BAAW,
-defining how a logical memory access is translated into the following two forms of requests:
-
- 1:1 mode: one logical access → N per-channel requests
- n:1 mode: one logical access → one aggregated request
-
-Here N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`),
-determined by topology parameters.
-
-### Problems with the Existing Structure
-
-In the current implementation (`topology/builder.py`):
-
- PE_DMA → NOC → xbar_top/xbar_bot → HBM_CTRL.slice{0-7} path is used
- HBM is modeled as 8 slice (= per-PE) nodes
- Local/remote access use different paths:
-  - local: NOC → xbar → HBM slice
-  - cross-half: NOC → xbar_top → bridge → xbar_bot → HBM slice
-  - remote cube: NOC → UCIe → remote NOC → remote xbar → remote HBM slice
-
-Limitations of this structure:
-
- Cannot model at the pseudo-channel granularity (slice = per-PE granularity, not per-channel)
- xbar/bridge bifurcate local/remote paths
- Cannot express 1:1 / n:1 modes consistently
-
---
-
-## Decision
-
-### D1. HBM Attaches to PE Routers
-
-Consolidate the current `hbm_ctrl.slice{0-7}` (8 nodes) into a **single `hbm_ctrl` node**,
-and attach the HBM access point to the same router where the PE is attached.
-
- n:1 mode: PE's local HBM access goes directly from its own router (switching overhead only, 0 hops)
- Remote PE's HBM access: reaches the target PE's router via mesh hops
- The read/write resource model within the HBM controller is preserved
-
-Node naming changes:
-
-| Current | After Change |
-| ---- | ------- |
-| `sip0.cube0.hbm_ctrl.slice0` ~ `slice7` | `sip0.cube0.hbm_ctrl` (single) |
-
-In `mesh_gen.py`, add `pe{idx}.hbm` to the PE attachment so that
-the builder generates an edge between that router and hbm_ctrl.
-
---
-
-### D2. Complete Removal of xbar, bridge, and Single NOC Node
-
-Remove all of the following nodes and related edges:
-
- `{cube}.xbar_top`, `{cube}.xbar_bot`
- `{cube}.bridge.left`, `{cube}.bridge.right`
- `{cube}.noc` (single TwoDMeshNocComponent node)
- Edges of type `noc_to_xbar`, `xbar_to_noc`, `xbar_to_hbm`, `hbm_to_xbar`
- Edges of type `xbar_to_bridge`, `bridge_to_xbar`
- Edges of type `pe_to_noc`, `noc_to_pe`, `noc_to_pe_cpu`, etc. referencing the single noc node
-
-Their role is replaced by an **explicit router mesh based on cube_mesh.yaml**.
-Each router (r0c0, r0c1, ...) from the 6x6 router grid generated by `mesh_gen.py`
-is created as a separate SimPy node in the topology graph,
-and adjacent routers are connected via XY mesh edges.
-
---
-
-### D3. Explicit Router Mesh (Common Basis for n:1 / 1:1)
-
-#### Router Nodes Based on cube_mesh.yaml
-
-Each non-null router from cube_mesh.yaml generated by `mesh_gen.py`
-is created as a **separate SimPy node** in the topology graph.
-
- Node ID: `{cube}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`)
- kind: `noc_router`, impl: `forwarding_v1`
- pos_mm: taken from cube_mesh.yaml
-
-Based on the attach information in cube_mesh.yaml, components are connected to each router:
- `pe{p}.dma` → PE_DMA ↔ router edge
- `pe{p}.cpu` → PE_CPU ↔ router edge
- `pe{p}.hbm` → HBM_CTRL ↔ router edge (added in n:1)
- `m_cpu` → M_CPU ↔ router edge
- `sram` → SRAM ↔ router edge
- `ucie_{dir}.c{i}` → UCIe conn ↔ router edge
-
-Router-to-router XY mesh edges: bidirectional edges between adjacent routers.
-Null routers (HBM exclusion zones) are skipped.
-
-#### 1:1 Mode Extension (To Be Implemented Later)
-
-In 1:1 mode, each router differentiates into N channel mini-routers.
-Per-channel routing and ChannelSplitter (LA → per-channel PA) introduction are required.
-N GEMM engines per PE are also added at this point.
-
---
-
-### D4. Cross-PE HBM Access (n:1 Mode)
-
-In n:1 mode, when a PE accesses another PE's local HBM,
-it hops through the XY mesh in cube_mesh.yaml to reach the target PE's router.
-
-Example: PE0 (r0c0) accessing PE2's (r1c4) HBM:
-
-```text
-PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl
-```
-
-The Dijkstra router finds the shortest path in the mesh.
-
-Cross-PE channel access in 1:1 mode will be defined during the 1:1 extension in D3.
-
---
-
-### D5. n:1 Mode: Uses cube_mesh.yaml Router Mesh
-
-In n:1 mode, no separate "aggregated router" is created.
-The existing router grid from cube_mesh.yaml serves that role.
-
-#### Connection Structure
-
-PE_DMA, PE_CPU, and HBM are all connected to the router where each PE is attached:
-
-```text
-sip0.cube0.pe0.pe_dma ←→ sip0.cube0.r0c0  (bw: N × channel_bw_gbs)
-sip0.cube0.hbm_ctrl   ←→ sip0.cube0.r0c0  (bw: N × channel_bw_gbs)
-```
-
-Routers are connected via XY mesh edges. PE's local HBM access goes
-directly from its own router (switching overhead only).
-
-#### n:1 Mode Full Data Paths
-
-**Local HBM (0 hops):**
-```text
-PE0.pe_dma → r0c0 → hbm_ctrl  (switching overhead only)
-```
-
-**Remote HBM (mesh hops):**
-```text
-PE0.pe_dma → r0c0 → r0c1 → ... → r1c4 → hbm_ctrl
-```
-
-**M_CPU DMA:**
-```text
-M_CPU → r2c0 → (mesh hops) → r{x}c{y} → hbm_ctrl
-```
-
---
-
-### D6. All Traffic Is Unified onto the Same Router Mesh
-
- All memory accesses (DMA data) and commands (PE_CPU) use the same router mesh
- Local access does not use a separate fast path (xbar)
- Cross-cube (remote) access path:
-
-```text
-PE_DMA → r{x}c{y} → (mesh hops) → ucie_conn → ucie-{PORT}
-  → [UCIe link] → remote ucie → remote conn → remote r{x}c{y} → hbm_ctrl
-```
-
-UCIe connections maintain the existing structure,
-but both endpoints become mesh routers instead of xbars.
-
-The number of UCIe lines is determined by BW ratio: `ucie_lines_per_side = ceil(ucie_bw / noc_line_bw)`.
-
---
-
-### D7. AddressResolver Changes
-
-Current `AddressResolver.resolve()`:
-
-```python
-# Current: HBM offset → pe_slice → "sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
-pe_slice = PhysAddr.hbm_pe_id(addr.hbm_offset, self._slice_size_bytes)
-return f"sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
-```
-
-After change:
-
-```python
-# Changed: HBM → single endpoint
-return f"sip{s}.cube{c}.hbm_ctrl"
-```
-
-The pe_slice calculation is removed.
-In n:1 mode, PE_DMA directly accesses the hbm_ctrl attached to its own router.
-
-resolver.resolve() is retained for external access (M_CPU DMA, etc.) and backward compatibility.
-
---
-
-### D8. topology.yaml Configuration Changes
-
-#### Added Settings
-
-```yaml
-cube:
-  memory_map:
-    hbm_mapping_mode: n_to_one          # one_to_one | n_to_one
-    hbm_pseudo_channels: 64             # total pseudo channel count
-    hbm_channels_per_pe: 8              # local channels per PE (= pseudo_channels / pes_per_cube)
-    hbm_channel_bw_gbs: 32.0            # per-channel bandwidth (GB/s)
-    hbm_total_gb_per_cube: 48           # retained
-```
-
-#### Removed Settings
-
-```yaml
-# To be removed
-links:
-  xbar_to_hbm_bw_gbs: 256.0            # → replaced by channel_bw_gbs × channels_per_pe
-  xbar_to_hbm_mm: 2.5                  # → replaced by ch_router_to_hbm_mm
-  xbar_to_bridge_bw_gbs: 128.0         # → removed (no bridge)
-  xbar_to_bridge_mm: 3.0               # → removed
-  noc_to_xbar_bw_gbs: ...              # → removed
-  noc_to_xbar_mm: ...                  # → removed
-```
-
-#### Added Link Settings
-
-```yaml
-links:
-  router_link_bw_gbs: 256.0            # XY mesh link BW between routers
-  router_overhead_ns: 2.0              # router switching overhead
-  pe_to_router_bw_gbs: 256.0           # PE_DMA ↔ router
-  hbm_to_router_bw_gbs: 256.0          # HBM ↔ router (= N × channel_bw)
-```
-
---
-
-### D9. Bandwidth Numerical Consistency
-
-| Configuration | Value |
-| ---- | --- |
-| pseudo channels per cube | 64 (parameter) |
-| PEs per cube | 8 (parameter) |
-| channels per PE (N) | `pseudo_channels / pes_per_cube` = 8 |
-| per-channel BW | 32 GB/s (parameter) |
-| per-PE local BW | N × 32 = 256 GB/s |
-| cube total HBM BW | 64 × 32 = 2048 GB/s |
-
-The effective BW per PE is identical in both modes:
-
- 1:1 mode: N channel links × channel_bw_gbs = N × 32 = 256 GB/s
- n:1 mode: 1 aggregated link = N × channel_bw_gbs = 256 GB/s
-
---
-
-## Consequences
-
-### Positive
-
- The router mesh based on cube_mesh.yaml accurately reflects physical placement
- In n:1 mode, the existing VA scheme is preserved, keeping transition costs low
- Local / remote / command traffic is unified onto the same mesh, resulting in simplicity
- Aligns well with graph compiler-based topology generation
- Channel count and PE count are both parameterized, enabling testing of various configurations
- 1:1 mode extension naturally follows through router differentiation
-
-### Negative
-
- The number of SimPy nodes increases due to explicit router nodes (6x6 = up to 32 routers/cube)
- Requires complete rewrite of existing xbar/bridge/single NOC-based tests
- The internal contention model of TwoDMeshNocComponent needs to be replaced with a per-router model
-
---
-
-## Alternatives
-
-### A1. Retain Existing xbar + HBM Slices
-
- Local/remote paths remain bifurcated
- Cannot model at pseudo-channel granularity
- Cannot switch between 1:1/n:1 modes
-
-### A2. Always Generate Per-Channel Links and Aggregate Only in n:1
-
- Topology structure always has 1:1 size
- Expressing n:1 semantics via link aggregation is complex
- No reduction in router node count
-
-### A3. Gradual Transition (Retain xbar + Add NOC Path)
-
- Higher compatibility, but dual-path coexistence increases complexity
- Since xbar removal is ultimately necessary, the intermediate step provides little value
-
---
-
-## Implementation Notes
-
-### topology/builder.py Change Details
-
-#### Code to Remove (within current `_instantiate_cube()`)
-
- xbar_top, xbar_bot node creation (~line 495-508)
- bridge.left, bridge.right node creation
- noc ↔ xbar edge creation (~line 540-555)
- xbar ↔ hbm_ctrl.slice edge creation (~line 510-538)
- xbar ↔ bridge edge creation (~line 557-572)
-
-#### Code to Add
-
-1:1 mode:
-
-```python
-N = hbm_channels_per_pe  # from topology config
-total_ch = hbm_pseudo_channels
-
-# Create channel router nodes
-for ch_id in range(total_ch):
-    pe_id = ch_id // N
-    nodes[f"{cp}.ch_r{ch_id}"] = Node(
-        id=f"{cp}.ch_r{ch_id}", kind="noc_router", impl="noc_v1",
-        attrs={}, pos_mm=(...),  # horizontal row = ch_id % N
-    )
-
-# PE_DMA ↔ local channel router edges
-for pe_id in range(pes_per_cube):
-    for local_ch in range(N):
-        ch_id = pe_id * N + local_ch
-        edges.append(Edge(
-            src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.ch_r{ch_id}",
-            bw_gbs=channel_bw, kind="pe_to_ch_router", ...))
-        edges.append(Edge(
-            src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.pe{pe_id}.pe_dma",
-            bw_gbs=channel_bw, kind="ch_router_to_pe", ...))
-
-# Channel router ↔ hbm_ctrl edges
-for ch_id in range(total_ch):
-    edges.append(Edge(
-        src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.hbm_ctrl",
-        bw_gbs=channel_bw, kind="ch_router_to_hbm", ...))
-    edges.append(Edge(
-        src=f"{cp}.hbm_ctrl", dst=f"{cp}.ch_r{ch_id}",
-        bw_gbs=channel_bw, kind="hbm_to_ch_router", ...))
-
-# Horizontal line edges (same logical index)
-for row in range(N):
-    for p in range(pes_per_cube - 1):
-        ch_a = p * N + row
-        ch_b = (p + 1) * N + row
-        edges.append(Edge(
-            src=f"{cp}.ch_r{ch_a}", dst=f"{cp}.ch_r{ch_b}",
-            bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...))
-        edges.append(Edge(
-            src=f"{cp}.ch_r{ch_b}", dst=f"{cp}.ch_r{ch_a}",
-            bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...))
-```
-
-n:1 mode:
-
-```python
-# Create aggregated router nodes
-for pe_id in range(pes_per_cube):
-    nodes[f"{cp}.pe{pe_id}.agg_router"] = Node(
-        id=f"{cp}.pe{pe_id}.agg_router", kind="noc_router", impl="noc_v1",
-        attrs={}, pos_mm=(...),
-    )
-
-agg_bw = N * channel_bw  # aggregated BW
-
-# PE_DMA ↔ aggregated router
-for pe_id in range(pes_per_cube):
-    edges.append(Edge(
-        src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.pe{pe_id}.agg_router",
-        bw_gbs=agg_bw, kind="pe_to_agg_router", ...))
-    edges.append(Edge(
-        src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.pe{pe_id}.pe_dma",
-        bw_gbs=agg_bw, kind="agg_router_to_pe", ...))
-
-# Aggregated router ↔ hbm_ctrl
-for pe_id in range(pes_per_cube):
-    edges.append(Edge(
-        src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.hbm_ctrl",
-        bw_gbs=agg_bw, kind="agg_to_hbm", ...))
-    edges.append(Edge(
-        src=f"{cp}.hbm_ctrl", dst=f"{cp}.pe{pe_id}.agg_router",
-        bw_gbs=agg_bw, kind="hbm_to_agg", ...))
-
-# Horizontal links between aggregated routers
-for p in range(pes_per_cube - 1):
-    edges.append(Edge(
-        src=f"{cp}.pe{p}.agg_router", dst=f"{cp}.pe{p+1}.agg_router",
-        bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...))
-    edges.append(Edge(
-        src=f"{cp}.pe{p+1}.agg_router", dst=f"{cp}.pe{p}.agg_router",
-        bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...))
-```
-
-### Affected Existing Tests
-
-| Test File | Impact |
-| ---------- | ---- |
-| `tests/test_topology_compile.py` | Remove xbar/bridge node references, add channel router verification |
-| `tests/test_topology_load.py` | Reflect topology.yaml configuration changes |
-| `tests/test_pe_components.py` | PE_DMA routing path changes |
-| `tests/test_sip_parallel.py` | Cross-PE access path changes |
-| Cases that directly test xbar/bridge | Remove |
-
---
-
-## Test Requirements
-
- Verify that requests are delivered via per-channel links in 1:1 mode
- Verify that requests are delivered via the aggregated link in n:1 mode
- Verify that topology is correctly generated in both modes:
-  - 1:1: `total_ch` channel routers + per-PE links + horizontal links
-  - n:1: `pes_per_cube` aggregated routers + per-PE links
- Verify that effective BW is consistent across both modes for the same workload
- Verify that horizontal line routing works for cross-PE access
- Verify that routing through UCIe works for cross-cube access
- Verify that topology generation is correct under parameter variations (channels_per_pe = 4, 8, 16, etc.)
-
---
-
-## Links
-
- ADR-0018 (LA + BAAW) → addressing-side integration
- ADR-0017 (Cube NOC 2D Mesh) → this ADR replaces the xbar/bridge portion
- ADR-0004 (Memory Semantics) → BW model redefinition
- ADR-0014 (PE Internal Execution Model) → impact from PE_DMA path changes
@@ -1,431 +0,0 @@
-# ADR-0019: CUBE NOC 내 Per-Channel 및 Aggregated HBM 연결 모델
-
-## Status
-
-Proposed
-
-## Context
-
-ADR-0018에서는 LA 기반 주소 추상화와 BAAW를 도입하여,
-logical memory access가 다음 두 형태의 request로 변환되도록 정의하였다.
-
- 1:1 mode: 하나의 logical access → N개의 per-channel request
- n:1 mode: 하나의 logical access → 하나의 aggregated request
-
-여기서 N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`)이며,
-topology 파라미터로 결정된다.
-
-### 기존 구조의 문제
-
-현재 구현(`topology/builder.py`)에서는:
-
- PE_DMA → NOC → xbar_top/xbar_bot → HBM_CTRL.slice{0-7} 경로를 사용
- HBM은 8개 slice(= PE 수) 노드로 모델링됨
- local/remote access가 서로 다른 경로를 사용:
-  - local: NOC → xbar → HBM slice
-  - cross-half: NOC → xbar_top → bridge → xbar_bot → HBM slice
-  - remote cube: NOC → UCIe → remote NOC → remote xbar → remote HBM slice
-
-이 구조의 한계:
-
- pseudo-channel 단위 모델링 불가 (slice = PE 단위, channel 단위 아님)
- xbar/bridge가 local/remote 경로를 이원화
- 1:1 / n:1 mode를 일관되게 표현할 수 없음
-
---
-
-## Decision
-
-### D1. HBM은 PE 라우터에 attach된다
-
-현재의 `hbm_ctrl.slice{0-7}` (8개 노드)를 **`hbm_ctrl` 단일 노드**로 통합하고,
-PE가 attach된 라우터에 HBM access point도 함께 attach한다.
-
- n:1 mode: PE의 local HBM 접근은 자기 라우터에서 바로 (switching overhead만, 0 hop)
- remote PE의 HBM 접근: mesh hop을 거쳐 대상 PE의 라우터에 도달
- HBM controller 내부의 read/write resource 모델은 유지
-
-노드 네이밍 변경:
-
-| 현재 | 변경 후 |
-| ---- | ------- |
-| `sip0.cube0.hbm_ctrl.slice0` ~ `slice7` | `sip0.cube0.hbm_ctrl` (단일) |
-
-`mesh_gen.py`에서 PE attachment에 `pe{idx}.hbm`을 추가하여,
-builder가 해당 라우터와 hbm_ctrl 간 edge를 생성한다.
-
---
-
-### D2. xbar, bridge, 단일 NOC 노드 완전 제거
-
-기존 다음 노드 및 관련 edge를 모두 제거한다:
-
- `{cube}.xbar_top`, `{cube}.xbar_bot`
- `{cube}.bridge.left`, `{cube}.bridge.right`
- `{cube}.noc` (단일 TwoDMeshNocComponent 노드)
- `noc_to_xbar`, `xbar_to_noc`, `xbar_to_hbm`, `hbm_to_xbar` 종류의 edge
- `xbar_to_bridge`, `bridge_to_xbar` 종류의 edge
- `pe_to_noc`, `noc_to_pe`, `noc_to_pe_cpu` 등 단일 noc 노드 참조 edge
-
-이들의 역할은 **cube_mesh.yaml 기반의 명시적 라우터 mesh**가 대체한다.
-기존 `mesh_gen.py`가 생성하는 6×6 라우터 grid의 각 라우터(r0c0, r0c1, ...)를
-별도의 SimPy 노드로 topology graph에 생성하고,
-인접 라우터 간 XY mesh edge로 연결한다.
-
---
-
-### D3. 명시적 라우터 mesh (n:1 / 1:1 공통 기반)
-
-#### cube_mesh.yaml 기반 라우터 노드
-
-`mesh_gen.py`가 생성한 cube_mesh.yaml의 각 non-null 라우터를
-topology graph의 **별도 SimPy 노드**로 생성한다.
-
- 노드 ID: `{cube}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`)
- kind: `noc_router`, impl: `forwarding_v1`
- pos_mm: cube_mesh.yaml에서 가져옴
-
-기존 cube_mesh.yaml의 attach 정보에 따라 각 라우터에 component를 연결:
- `pe{p}.dma` → PE_DMA ↔ 라우터 edge
- `pe{p}.cpu` → PE_CPU ↔ 라우터 edge
- `pe{p}.hbm` → HBM_CTRL ↔ 라우터 edge (n:1에서 추가)
- `m_cpu` → M_CPU ↔ 라우터 edge
- `sram` → SRAM ↔ 라우터 edge
- `ucie_{dir}.c{i}` → UCIe conn ↔ 라우터 edge
-
-라우터 간 XY mesh edge: 인접 라우터 간 bidirectional edge.
-null 라우터(HBM exclusion zone)는 skip.
-
-#### 1:1 mode 확장 (나중에 구현)
-
-1:1 mode에서는 각 라우터가 N개 channel mini-router로 분화된다.
-per-channel routing과 ChannelSplitter (LA → per-channel PA) 도입이 필요.
-PE당 N개 GEMM engine도 이 시점에 추가.
-
---
-
-### D4. cross-PE HBM 접근 (n:1 mode)
-
-n:1 mode에서 PE가 다른 PE의 local HBM에 접근하는 경우,
-cube_mesh.yaml의 XY mesh를 통해 대상 PE의 라우터까지 hop한다.
-
-예: PE0(r0c0)이 PE2(r1c4)의 HBM에 접근:
-
-```text
-PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl
-```
-
-Dijkstra router가 mesh에서 최단 경로를 탐색한다.
-
-1:1 mode에서의 cross-PE channel 접근은 D3의 1:1 확장 시 정의한다.
-
---
-
-### D5. n:1 mode: cube_mesh.yaml 라우터 mesh 사용
-
-n:1 mode에서는 별도의 "aggregated router"를 생성하지 않는다.
-기존 cube_mesh.yaml의 라우터 grid가 그 역할을 한다.
-
-#### 연결 구조
-
-각 PE가 attach된 라우터에 PE_DMA, PE_CPU, HBM이 함께 연결된다:
-
-```text
-sip0.cube0.pe0.pe_dma ←→ sip0.cube0.r0c0  (bw: N × channel_bw_gbs)
-sip0.cube0.hbm_ctrl   ←→ sip0.cube0.r0c0  (bw: N × channel_bw_gbs)
-```
-
-라우터 간 XY mesh edge로 연결. PE의 local HBM 접근은
-자기 라우터에서 바로 (switching overhead만).
-
-#### n:1 mode 전체 데이터 경로
-
-**local HBM (0 hop):**
-```text
-PE0.pe_dma → r0c0 → hbm_ctrl  (switching overhead only)
-```
-
-**remote HBM (mesh hops):**
-```text
-PE0.pe_dma → r0c0 → r0c1 → ... → r1c4 → hbm_ctrl
-```
-
-**M_CPU DMA:**
-```text
-M_CPU → r2c0 → (mesh hops) → r{x}c{y} → hbm_ctrl
-```
-
---
-
-### D6. 모든 트래픽을 동일 router mesh로 통일한다
-
- 모든 memory access (DMA data)와 command (PE_CPU)가 동일 router mesh를 사용한다
- local access도 별도의 fast path(xbar)를 사용하지 않는다
- cross-cube (remote) access 경로:
-
-```text
-PE_DMA → r{x}c{y} → (mesh hops) → ucie_conn → ucie-{PORT}
-  → [UCIe link] → remote ucie → remote conn → remote r{x}c{y} → hbm_ctrl
-```
-
-UCIe 연결은 기존 구조를 유지하되,
-양쪽 endpoint가 xbar 대신 mesh 라우터가 된다.
-
-UCIe line 수는 BW 비율로 결정: `ucie_lines_per_side = ceil(ucie_bw / noc_line_bw)`.
-
---
-
-### D7. AddressResolver 변경
-
-현재 `AddressResolver.resolve()`:
-
-```python
-# 현재: HBM offset → pe_slice → "sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
-pe_slice = PhysAddr.hbm_pe_id(addr.hbm_offset, self._slice_size_bytes)
-return f"sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
-```
-
-변경 후:
-
-```python
-# 변경: HBM → 단일 endpoint
-return f"sip{s}.cube{c}.hbm_ctrl"
-```
-
-pe_slice 계산이 제거된다.
-n:1 mode에서 PE_DMA는 자기 라우터에 attach된 hbm_ctrl에 직접 접근한다.
-
-resolver.resolve()는 외부 접근(M_CPU DMA 등) 및 backward compatibility용으로 유지한다.
-
---
-
-### D8. topology.yaml 설정 변경
-
-#### 추가 설정
-
-```yaml
-cube:
-  memory_map:
-    hbm_mapping_mode: n_to_one          # one_to_one | n_to_one
-    hbm_pseudo_channels: 64             # 전체 pseudo channel 수
-    hbm_channels_per_pe: 8              # PE당 local channel 수 (= pseudo_channels / pes_per_cube)
-    hbm_channel_bw_gbs: 32.0            # per-channel bandwidth (GB/s)
-    hbm_total_gb_per_cube: 48           # 유지
-```
-
-#### 제거 설정
-
-```yaml
-# 제거 대상
-links:
-  xbar_to_hbm_bw_gbs: 256.0            # → channel_bw_gbs × channels_per_pe로 대체
-  xbar_to_hbm_mm: 2.5                  # → ch_router_to_hbm_mm으로 대체
-  xbar_to_bridge_bw_gbs: 128.0         # → 제거 (bridge 없음)
-  xbar_to_bridge_mm: 3.0               # → 제거
-  noc_to_xbar_bw_gbs: ...              # → 제거
-  noc_to_xbar_mm: ...                  # → 제거
-```
-
-#### 추가 link 설정
-
-```yaml
-links:
-  router_link_bw_gbs: 256.0            # 라우터 간 XY mesh link BW
-  router_overhead_ns: 2.0              # 라우터 switching overhead
-  pe_to_router_bw_gbs: 256.0           # PE_DMA ↔ 라우터
-  hbm_to_router_bw_gbs: 256.0          # HBM ↔ 라우터 (= N × channel_bw)
-```
-
---
-
-### D9. 대역폭 수치 정합
-
-| 구성 | 값 |
-| ---- | --- |
-| pseudo channels per cube | 64 (파라미터) |
-| PEs per cube | 8 (파라미터) |
-| channels per PE (N) | `pseudo_channels / pes_per_cube` = 8 |
-| per-channel BW | 32 GB/s (파라미터) |
-| per-PE local BW | N × 32 = 256 GB/s |
-| cube total HBM BW | 64 × 32 = 2048 GB/s |
-
-두 모드에서 PE당 effective BW는 동일:
-
- 1:1 mode: N개 channel link × channel_bw_gbs = N × 32 = 256 GB/s
- n:1 mode: 1개 aggregated link = N × channel_bw_gbs = 256 GB/s
-
---
-
-## Consequences
-
-### Positive
-
- cube_mesh.yaml 기반 라우터 mesh로 물리적 배치를 정확히 반영한다
- n:1 mode에서 기존 VA 체계를 유지하여 전환 비용이 낮다
- local / remote / command 트래픽이 동일 mesh로 통일되어 단순하다
- graph compiler 기반 topology 생성과 잘 맞는다
- channel 수, PE 수가 모두 파라미터이므로 다양한 구성을 테스트할 수 있다
- 1:1 mode 확장이 라우터 분화로 자연스럽게 가능하다
-
-### Negative
-
- 명시적 라우터 노드로 인해 SimPy 노드 수가 증가한다 (6×6 = 최대 32개 라우터/cube)
- 기존 xbar/bridge/단일 NOC 기반 테스트 전면 재작성 필요
- TwoDMeshNocComponent의 내부 contention 모델을 라우터별 모델로 교체 필요
-
---
-
-## Alternatives
-
-### A1. 기존 xbar + HBM slice 유지
-
- local/remote 경로가 이원화됨
- pseudo-channel 단위 모델링 불가
- 1:1/n:1 mode 전환 불가
-
-### A2. per-channel link를 항상 생성하고 n:1에서만 집계
-
- topology 구조가 항상 1:1 크기
- n:1 semantics를 link aggregation으로 표현하기 복잡
- router 노드 수 감소 효과 없음
-
-### A3. 단계적 전환 (xbar 유지 + NOC 경로 추가)
-
- 호환성은 높으나 두 경로 공존으로 복잡도 증가
- 최종적으로 xbar 제거가 필요하므로 중간 단계의 가치가 낮음
-
---
-
-## Implementation Notes
-
-### topology/builder.py 변경 상세
-
-#### 제거할 코드 (현재 `_instantiate_cube()` 내)
-
- xbar_top, xbar_bot 노드 생성 (~line 495-508)
- bridge.left, bridge.right 노드 생성
- noc ↔ xbar edge 생성 (~line 540-555)
- xbar ↔ hbm_ctrl.slice edge 생성 (~line 510-538)
- xbar ↔ bridge edge 생성 (~line 557-572)
-
-#### 추가할 코드
-
-1:1 mode:
-
-```python
-N = hbm_channels_per_pe  # from topology config
-total_ch = hbm_pseudo_channels
-
-# channel router 노드 생성
-for ch_id in range(total_ch):
-    pe_id = ch_id // N
-    nodes[f"{cp}.ch_r{ch_id}"] = Node(
-        id=f"{cp}.ch_r{ch_id}", kind="noc_router", impl="noc_v1",
-        attrs={}, pos_mm=(...),  # horizontal row = ch_id % N
-    )
-
-# PE_DMA ↔ local channel router edges
-for pe_id in range(pes_per_cube):
-    for local_ch in range(N):
-        ch_id = pe_id * N + local_ch
-        edges.append(Edge(
-            src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.ch_r{ch_id}",
-            bw_gbs=channel_bw, kind="pe_to_ch_router", ...))
-        edges.append(Edge(
-            src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.pe{pe_id}.pe_dma",
-            bw_gbs=channel_bw, kind="ch_router_to_pe", ...))
-
-# channel router ↔ hbm_ctrl edges
-for ch_id in range(total_ch):
-    edges.append(Edge(
-        src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.hbm_ctrl",
-        bw_gbs=channel_bw, kind="ch_router_to_hbm", ...))
-    edges.append(Edge(
-        src=f"{cp}.hbm_ctrl", dst=f"{cp}.ch_r{ch_id}",
-        bw_gbs=channel_bw, kind="hbm_to_ch_router", ...))
-
-# horizontal line edges (same logical index)
-for row in range(N):
-    for p in range(pes_per_cube - 1):
-        ch_a = p * N + row
-        ch_b = (p + 1) * N + row
-        edges.append(Edge(
-            src=f"{cp}.ch_r{ch_a}", dst=f"{cp}.ch_r{ch_b}",
-            bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...))
-        edges.append(Edge(
-            src=f"{cp}.ch_r{ch_b}", dst=f"{cp}.ch_r{ch_a}",
-            bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...))
-```
-
-n:1 mode:
-
-```python
-# aggregated router 노드 생성
-for pe_id in range(pes_per_cube):
-    nodes[f"{cp}.pe{pe_id}.agg_router"] = Node(
-        id=f"{cp}.pe{pe_id}.agg_router", kind="noc_router", impl="noc_v1",
-        attrs={}, pos_mm=(...),
-    )
-
-agg_bw = N * channel_bw  # aggregated BW
-
-# PE_DMA ↔ aggregated router
-for pe_id in range(pes_per_cube):
-    edges.append(Edge(
-        src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.pe{pe_id}.agg_router",
-        bw_gbs=agg_bw, kind="pe_to_agg_router", ...))
-    edges.append(Edge(
-        src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.pe{pe_id}.pe_dma",
-        bw_gbs=agg_bw, kind="agg_router_to_pe", ...))
-
-# aggregated router ↔ hbm_ctrl
-for pe_id in range(pes_per_cube):
-    edges.append(Edge(
-        src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.hbm_ctrl",
-        bw_gbs=agg_bw, kind="agg_to_hbm", ...))
-    edges.append(Edge(
-        src=f"{cp}.hbm_ctrl", dst=f"{cp}.pe{pe_id}.agg_router",
-        bw_gbs=agg_bw, kind="hbm_to_agg", ...))
-
-# aggregated router 간 horizontal link
-for p in range(pes_per_cube - 1):
-    edges.append(Edge(
-        src=f"{cp}.pe{p}.agg_router", dst=f"{cp}.pe{p+1}.agg_router",
-        bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...))
-    edges.append(Edge(
-        src=f"{cp}.pe{p+1}.agg_router", dst=f"{cp}.pe{p}.agg_router",
-        bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...))
-```
-
-### 영향받는 기존 테스트
-
-| 테스트 파일 | 영향 |
-| ---------- | ---- |
-| `tests/test_topology_compile.py` | xbar/bridge 노드 참조 제거, channel router 검증 추가 |
-| `tests/test_topology_load.py` | topology.yaml 설정 변경 반영 |
-| `tests/test_pe_components.py` | PE_DMA 라우팅 경로 변경 |
-| `tests/test_sip_parallel.py` | cross-PE 접근 경로 변경 |
-| xbar/bridge를 직접 테스트하는 케이스 | 제거 |
-
---
-
-## Test Requirements
-
- 1:1 mode에서 channel별 link로 request가 전달되는지 확인
- n:1 mode에서 aggregated link로 request가 전달되는지 확인
- 두 mode에서 topology가 올바르게 생성되는지 검증:
-  - 1:1: `total_ch`개 channel router + per-PE link + horizontal link
-  - n:1: `pes_per_cube`개 aggregated router + per-PE link
- 동일 workload에서 effective BW가 두 모드에서 일관적인지 확인
- cross-PE 접근 시 horizontal line routing이 동작하는지 확인
- cross-cube 접근 시 UCIe를 통한 routing이 동작하는지 확인
- 파라미터 변경 (channels_per_pe = 4, 8, 16 등)에서 topology 생성이 정상인지 확인
-
---
-
-## Links
-
- ADR-0018 (LA + BAAW) → addressing 측 연동
- ADR-0017 (Cube NOC 2D Mesh) → 본 ADR이 xbar/bridge 부분을 대체
- ADR-0004 (Memory Semantics) → BW 모델 재정의
- ADR-0014 (PE Internal Execution Model) → PE_DMA 경로 변경 영향
@@ -2,7 +2,7 @@

 ## Status

-Proposed
+Accepted

 ## Context

@@ -16,21 +16,6 @@ but do not actually read tensor data or perform computations.
 2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
 3. Must minimize simulation performance degradation

-### Limitations of the Existing Kernel Execution Structure
-
-The current kernel execution is separated into 3 stages:
-
-```
-Phase 0: Kernel function execution in TLContext → PeCommand list generation (outside SimPy, no data)
-Phase 1: PE_CPU replays PeCommand list via SimPy (timing only)
-```
-
-Phase 0 requires the kernel to **complete execution entirely** before SimPy begins.
-`tl.load()` returns a TensorHandle (placeholder), so actual data cannot be accessed.
-Therefore, branching based on data values (dynamic control flow) is impossible.
-
-This ADR resolves this limitation **for memory operations only** (see D1, D3).
-
 ### Constraints

 - SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
@@ -532,22 +517,3 @@ Per-dtype tolerance policy:
  (computations execute in Phase 2, result values are undetermined in Phase 1).
  Memory-data-based branching is supported via greenlet.
 - greenlet C extension dependency added (pip install greenlet)
-
---
-
-## Affected Files
-
-| File | Change |
-|------|--------|
-| `src/kernbench/components/base.py` | Add `_on_process_start/end` hooks |
-| `src/kernbench/common/pe_commands.py` | Add `data_op = True`, extend metadata fields |
-| `src/kernbench/sim_engine/op_log.py` | New: OpRecord, OpLogger |
-| `src/kernbench/sim_engine/data_executor.py` | New: DataExecutor, MemoryStore |
-| `src/kernbench/sim_engine/engine.py` | op_logger injection (optional) |
-| `src/kernbench/triton_emu/tl_context.py` | greenlet switch calls inside `tl.load()` etc. |
-| `src/kernbench/triton_emu/kernel_runner.py` | New: KernelRunner (greenlet ↔ SimPy bridge) |
-| `src/kernbench/components/builtin/pe_cpu.py` | Remove Phase 0, change to KernelRunner invocation |
-| `pyproject.toml` | Add greenlet dependency |
-
-Component implementation files (pe_gemm.py, pe_dma.py, hbm_ctrl.py, etc.): **no changes**
-Benchmark kernels (benches/*.py): **no user API changes**
@@ -1,537 +0,0 @@
-# ADR-0021: PE Pipeline Refactoring — Component Separation + Scheduler-Based Routing
-
-## Status
-
-Proposed
-
-## Context
-
-### Problems with the Current Structure
-
-pe_accel (SchedulerV2Component) hides 5 hardware blocks (DmaIn, DmaWb, Gemm, Math, Tcm)
-**inside a single component**.
-
-```
-SchedulerV2Component (single topology node)
-├── DmaInBlock     ← directly connected via internal SimPy Store
-├── DmaWbBlock     ← not visible in topology
-├── GemmBlock      ← not replaceable
-├── MathBlock      ← not replaceable
-└── TcmBlock       ← not replaceable
-```
-
-Problems:
- Blocks directly reference the next block via `desc.next_block` — hardcoded routing
- Individual blocks cannot be replaced (violates ADR-0015 component replacement principle)
- PE internal structure is not visible in the topology
- GemmBlock and MathBlock each duplicate TCM load/store logic
-
-### Actual Hardware Structure
-
-```
-HBM ←(DMA)→ TCM ←(Fetch/Store Unit)→ Register File ←→ GEMM/MATH Engine
-```
-
- DMA: HBM ↔ TCM transfer (via fabric, tens to hundreds of ns)
- Fetch/Store Unit: TCM ↔ Register File transfer (BW-based, a few ns)
- GEMM/MATH Engine: computation between Register Files (cycle-accurate)
- Completion signal: PE-internal 1-cycle wire signal (done pin assert)
-
---
-
-## Decision
-
-### D1. Separate Each Block into an Independent Component
-
-The internal blocks of pe_accel are separated into **independent PeEngineBase components**.
-Existing 5 blocks + 1 Fetch/Store Unit = 6 components.
-
-| Component | Role | HW Correspondence |
-|-----------|------|-------------------|
-| PE_SCHEDULER | Plan generation, tile state management, stage routing | Scheduler/Sequencer |
-| PE_DMA | HBM ↔ TCM (via fabric) | DMA Engine |
-| PE_FETCH_STORE | TCM ↔ Register File | Load/Store Unit |
-| PE_GEMM | MAC compute (register only) | MAC Array |
-| PE_MATH | Element-wise/reduction (register only) | SIMD/Vector Unit |
-| PE_TCM | BW-serialized scratchpad | SRAM Bank |
-
-Each component exists as a topology node and is connected via ports/wires.
-Replacing the `impl` allows changing the timing model of an individual block.
-
-### D2. Token Self-Routing — Scheduler Handles Only Dispatch + Completion
-
-**Components do not pass through the scheduler at every stage.**
-The token carries a plan so that components chain directly to the next stage.
-
-```
-Scheduler → DMA → Fetch → GEMM → Math → Store → DMA_WB → (done) → Scheduler
-              ↑ chaining: does not go through scheduler          completion only
-```
-
-This matches the actual HW structure where each block's done signal is directly
-connected to the next block via wire. The scheduler is responsible **only for
-initial dispatch + completion aggregation**.
-
-#### Stage Definition
-
-```python
-class StageType(Enum):
-    DMA_READ = 0
-    FETCH = 1
-    GEMM = 2
-    MATH = 3
-    STORE = 4
-    DMA_WRITE = 5
-```
-
-#### Plan Structure
-
-When the scheduler receives a CompositeCmd, it generates a **per-tile execution plan**.
-The plan defines the **stage sequence** for each tile:
-
-```python
-@dataclass
-class Stage:
-    stage_type: StageType
-    component: str       # topology node ID (e.g. "sip0.cube0.pe0.pe_dma")
-    params: dict         # per-stage parameters (dynamic)
-
-@dataclass(frozen=True)
-class TilePlan:
-    tile_id: int
-    stages: tuple[Stage, ...]  # list of stages to execute in order (immutable)
-```
-
-The stage sequence varies depending on the plan:
-
-```python
-# Normal GEMM: HBM → TCM → Register → Compute → Register → TCM → HBM
-stages = (DMA_READ, FETCH, GEMM, STORE, DMA_WRITE)
-
-# GEMM directly from TCM data (skip DMA read):
-stages = (FETCH, GEMM, STORE, DMA_WRITE)
-
-# MATH element-wise:
-stages = (DMA_READ, FETCH, MATH, STORE, DMA_WRITE)
-
-# GEMM + accumulation (intermediate K-tile, skip writeback):
-stages = (DMA_READ, FETCH, GEMM, STORE)  # store to TCM only
-```
-
-**Components do not hardcode the next component.**
-They read the next stage from the token's plan and forward it directly via out_port.
-This is the same pattern as a network packet carrying a routing header.
-
-#### Pipeline Context
-
-```python
-@dataclass
-class PipelineContext:
-    id: str
-    total_tiles: int
-    completed_tiles: int = 0
-    done_event: simpy.Event = None  # succeeds when all tiles are complete
-
-    def complete_tile(self) -> None:
-        self.completed_tiles += 1
-        if self.completed_tiles == self.total_tiles:
-            self.done_event.succeed()
-```
-
-**Completion follows an exactly-once contract**: the last stage of each tile must call
-`complete_tile()` exactly once. Duplicate calls are a bug, and `done_event` must
-succeed only once (SimPy Event constraint).
-
-#### Scheduler Role (Reduced)
-
-When the scheduler receives a CompositeCmd, it creates a plan and PipelineContext,
-enqueues them into the scheduler's internal `_pending_feeds` FIFO, and returns immediately.
-
-Actual tile injection is handled by a **single feeder process** (`_feed_loop`).
-This feeder consumes `_pending_feeds` in FIFO order and
-**does not allow tile feed interleaving across composite commands.**
-That is, the feed for the next command begins only after all tiles of the current
-command have been injected into the first stage queue.
-
-There is **exactly one `_feed_loop`** per scheduler, and
-tile feed for composite commands is performed exclusively through this single process.
-Command issue order refers to **the order in which PE_SCHEDULER receives PeInternalTxn**.
-
-This structure maintains command issue order while ensuring that when the first stage
-queue is full, only the feeder process blocks — the scheduler worker's inbox processing
-itself does not stall.
-
-```python
-class PeSchedulerV2(PeEngineBase):
-    _pipelines: dict[str, PipelineContext]
-    _pending_feeds: simpy.Store   # FIFO of (plan, ctx)
-
-    def start(self, env):
-        super().start(env)
-        self._pending_feeds = simpy.Store(env)
-        env.process(self._feed_loop(env))
-
-    def _dispatch_composite(self, env, pe_txn, cmd):
-        plan = generate_plan(cmd)
-        ctx = PipelineContext(
-            id=next_id(),
-            total_tiles=len(plan.tiles),
-            done_event=pe_txn.done,
-        )
-        self._pipelines[ctx.id] = ctx
-
-        # only enqueue to feeder queue and return immediately
-        yield self._pending_feeds.put((plan, ctx))
-
-    def _feed_loop(self, env):
-        """Single feeder process: feeds composite commands in FIFO order.
-
-        Tile feed interleaving across composite commands is not allowed.
-        The feed for the next command begins only after all tiles of the
-        current command have been injected into the first stage queue.
-
-        When the first stage queue is full, only this feeder blocks;
-        the scheduler worker's inbox processing does not stall.
-        """
-        while True:
-            plan, ctx = yield self._pending_feeds.get()
-            for tile in plan.tiles:
-                token = TileToken(
-                    tile_id=tile.tile_id,
-                    pipeline_ctx=ctx,
-                    plan=tile,
-                    stage_idx=0,
-                    params=tile.stages[0].params,
-                )
-                yield self.out_ports[tile.stages[0].component].put(token)
-                # queue capacity = HW queue depth → feeder blocks only when full
-```
-
-In this ADR, the scheduler can accept multiple composite commands,
-but tile submission order follows per-command FIFO.
-Within a command, tile-level pipeline overlap is allowed,
-but tile feed interleaving across commands is not.
-
-### D3. Data Transfer vs. Completion Signal — HW Modeling Criteria
-
-| Communication Type | Method | HW Correspondence |
-|-------------------|--------|-------------------|
-| Tile token (work directive) | message via out_port | enqueue to command queue |
-| Stage completion → next stage | component directly calls out_port.put | done-triggered local enqueue |
-| Pipeline completion → scheduler | PipelineContext.complete_tile() | completion interrupt |
-
-**Tile token**: uses out_port.put(). SimPy Store capacity = HW queue depth.
-
-**Intra-PE chaining latency**: within the scope of this ADR, no explicit latency model
-is applied to intra-PE stage triggers. Chaining between components corresponds to
-PE-internal wires, and since there is no scheduler round-trip, no artificial hop cost
-is incurred.
-
-**Pipeline completion**: the component at the last stage calls `pipeline_ctx.complete_tile()`.
-When all tiles are complete, PipelineContext calls done_event.succeed().
-
-### D4. Asynchronous Pipeline — Natural Overlap
-
-The scheduler processes CompositeCmds **asynchronously**.
-However, tile feed does not spawn an independent process per command; instead,
-the scheduler's internal **single feeder process** performs the feed in FIFO order.
-Therefore, the scheduler can continue to receive the next command,
-but the first-stage tile injection order is guaranteed per command.
-
-Since **SimPy Store capacity = HW queue depth**:
- When the queue is full, put() naturally blocks (backpressure)
- While DMA is processing tile 0, GEMM can start fetching an already-completed tile
- When a second CompositeCmd arrives, it is immediately queued to the DMA queue
-
-```
-First-stage feed order (feeder → DMA queue):
-  [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN] | [cmd2:t0][cmd2:t1]...
-                                            ↑ cmd2 starts after cmd1 feed completes
-
-Runtime pipeline (downstream overlap):
-  PE_DMA:    [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN][cmd2:t0][cmd2:t1]...
-  PE_FETCH:          [cmd1:t0][cmd1:t1]...
-  PE_GEMM:                   [cmd1:t0][cmd1:t1]...
-                              ↑ pipeline overlap within the same command
-```
-
-Here, the overlap does not come from tile feed interleaving across different commands,
-but occurs naturally as tiles from earlier commands progress to downstream stages
-while the feeder continues injecting subsequent tiles.
-
-For example, tile feed for cmd2 does not start until all tiles of cmd1 have been
-injected into the first stage queue. However, while cmd1.tile0 has already progressed
-to GEMM, cmd1.tile1 and cmd1.tile2 may still remain in DMA/FETCH, so
-**pipeline overlap within the same command occurs naturally**.
-
-#### Component Chaining Pattern
-
-All components follow the same pattern:
-
-```python
-def _pipeline_worker(self, env):
-    while True:
-        token = yield self._inbox.get()
-
-        # process own stage
-        yield from self._process(env, token)
-
-        # chain to next stage (read from plan)
-        next_idx = token.stage_idx + 1
-        if next_idx < len(token.plan.stages):
-            next_stage = token.plan.stages[next_idx]
-            token.stage_idx = next_idx
-            token.params = next_stage.params
-            yield self.out_ports[next_stage.component].put(token)
-        else:
-            # last stage — pipeline completion
-            token.pipeline_ctx.complete_tile()
-```
-
-### D5. PE_FETCH_STORE — Dedicated TCM ↔ Register File Transfer
-
-Previously, GemmBlock and MathBlock each implemented their own TCM read/write.
-This is separated into a **PE_FETCH_STORE component**.
-
-```python
-# PE_FETCH_STORE._process()
-def _process(self, env, token):
-    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
-    yield tcm_done
-    # chaining is handled by the base class (D4 pattern)
-```
-
-Advantages:
- GEMM/MATH perform **pure compute only** — no TCM access logic
- Fetch/store BW contention is naturally modeled (serialization via PE_TCM resource)
- Prefetch strategies can be experimented with by replacing the fetch unit alone
-
-### D6. Simplification of Each Compute Component
-
-GEMM/MATH perform compute only with register data already prepared.
-**Chaining follows the common pattern (D4), so only _process() needs to be implemented:**
-
-```python
-# PE_GEMM._process()
-def _process(self, env, token):
-    yield env.timeout(self._mac_latency(token.params))
-
-# PE_MATH._process()
-def _process(self, env, token):
-    yield env.timeout(self._simd_latency(token.params))
-
-# PE_FETCH_STORE._process()
-def _process(self, env, token):
-    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
-    yield tcm_done
-
-# PE_DMA._process()
-def _process(self, env, token):
-    yield from self._do_fabric_dma(token.params)
-```
-
-By replacing only the timing model, one can freely switch between cycle-accurate
-and analytical models. Since the chaining logic resides in the base class,
-each component only implements its pure stage logic.
-
-### D7. Topology Changes
-
-Add PE_FETCH_STORE to the PE template:
-
-```yaml
-pe_template:
-  components:
-    pe_cpu:         { kind: pe_cpu,         impl: pe_cpu_v1, ... }
-    pe_scheduler:   { kind: pe_scheduler,   impl: pe_scheduler_v2, ... }
-    pe_dma:         { kind: pe_dma,         impl: pe_dma_v1, ... }
-    pe_fetch_store: { kind: pe_fetch_store, impl: pe_fetch_store_v1, ... }
-    pe_gemm:        { kind: pe_gemm,        impl: pe_gemm_v1, ... }
-    pe_math:        { kind: pe_math,        impl: pe_math_v1, ... }
-    pe_mmu:         { kind: pe_mmu,         impl: pe_mmu_v1, ... }
-    pe_tcm:         { kind: pe_tcm,         impl: pe_tcm_v1, ... }
-  links:
-    # existing links...
-    fetch_store_to_tcm_bw_gbs: 512.0
-    fetch_store_to_tcm_mm: 0.0
-```
-
-PE internal edge connections:
-```
-PE_SCHEDULER → PE_DMA (initial dispatch)
-PE_SCHEDULER → PE_FETCH_STORE (initial dispatch)
-PE_SCHEDULER → PE_GEMM (initial dispatch)
-PE_SCHEDULER → PE_MATH (initial dispatch)
-PE_DMA → PE_FETCH_STORE (chaining)
-PE_FETCH_STORE → PE_GEMM (chaining)
-PE_FETCH_STORE → PE_MATH (chaining)
-PE_GEMM → PE_FETCH_STORE (store chaining)
-PE_MATH → PE_FETCH_STORE (store chaining)
-PE_FETCH_STORE → PE_DMA (writeback chaining)
-PE_FETCH_STORE → PE_TCM (BW request)
-```
-
-Topology edges encompass both **control/dispatch visibility + runtime chaining**.
-Scheduler → sub-component edges are initial dispatch paths, while
-inter-component edges are runtime chaining paths driven by token self-routing.
-
-### D8. Existing Code Migration — Builtin Integration
-
-The existing builtin v1 components and pe_accel are **replaced with new builtin components**.
-
-#### Migration Strategy
-
-1. Back up existing `components/builtin/` → `components/builtin_legacy/` (preserved without modification)
-2. Back up existing `components/custom/pe_accel/` → likewise
-3. Re-implement new `components/builtin/` with the ADR-0021 architecture
-4. Maintain **only one** topology.yaml (including pe_fetch_store)
-5. components.yaml points to the new builtin
-
-```yaml
-# components.yaml — new builtin
-pe_scheduler_v1: kernbench.components.builtin.pe_scheduler:PeSchedulerComponent
-pe_gemm_v1:      kernbench.components.builtin.pe_gemm:PeGemmComponent
-pe_math_v1:      kernbench.components.builtin.pe_math:PeMathComponent
-pe_dma_v1:       kernbench.components.builtin.pe_dma:PeDmaComponent
-pe_fetch_store_v1: kernbench.components.builtin.pe_fetch_store:PeFetchStoreComponent
-pe_tcm_v1:       kernbench.components.builtin.pe_tcm:PeTcmComponent
-```
-
-The impl names (pe_gemm_v1, etc.) are preserved, but **the implementations are replaced
-with the ADR-0021 architecture**. Existing benchmarks and tests referencing topology.yaml
-continue to work without changes.
-
-#### Latency Model Inheritance
-
-The latency modeling of the new builtin components (MAC cycle calculation, SIMD latency,
-TCM BW serialization, DMA fabric latency, etc.) is **based on the current pe_accel
-implementation**. The tile schedule generation logic from tiling.py is also carried over.
-Only the architecture (component separation, self-routing) changes; timing accuracy
-is preserved.
-
-#### Test Strategy
-
-#### Test Plan
-
-**1. Existing test pass** (regression):
-After migration is complete, all existing tests (366) must pass.
-
-**2. Latency regression**:
-Verify that the new builtin produces identical latency for the same inputs as pe_accel.
-
-**3. Phase 1 → Phase 2 end-to-end**:
-Integration test from SimPy simulation (Phase 1) op_log generation → DataExecutor
-(Phase 2) actual numpy computation → result correctness verification.
- GEMM: tl.composite(gemm) → op_log → Phase 2 matmul → allclose verification
- MATH: tl.exp / tl.add, etc. → op_log → Phase 2 numpy op → allclose verification
- Chaining: GEMM output → MATH input → final result end-to-end verification
-
-**4. TileToken self-routing**:
- Verify that tiles chain according to the plan's stage sequence
- Verify PipelineContext.complete_tile() exactly-once at the last stage
- Queue backpressure: verify that only the feeder blocks when DMA queue capacity is exceeded
-
-**5. Asynchronous pipeline overlap**:
- Verify that inter-tile stage overlap occurs within the same command (tile0 in GEMM while tile1 in DMA)
- Multiple commands: verify that cmd2 feed starts after cmd1 feed completes (FIFO order)
-
-### D9. TileToken Message Definition
-
-A message used for passing tile work between components.
-The token carries the plan and stage index, enabling self-routing.
-
-```python
-@dataclass
-class TileToken:
-    tile_id: int
-    pipeline_ctx: PipelineContext    # completion tracking
-    plan: TilePlan                   # full stage sequence for this tile (immutable)
-    stage_idx: int                   # current stage index in plan.stages
-    params: dict                     # current stage parameter cache (canonical: plan.stages[stage_idx].params)
-    data_op: bool = True             # op_log recording target (ADR-0020)
-```
-
-A TileToken is **owned by exactly one component at a time** and
-is never referenced by multiple components simultaneously (single-owner).
-
-Token lifecycle:
-1. Scheduler creates it with stage_idx=0 and puts it to the first stage component
-2. The component executes _process(), increments stage_idx, and puts it to the next component
-3. The last stage component calls pipeline_ctx.complete_tile()
-4. When all tiles are complete, PipelineContext calls done_event.succeed()
-
-Relationship with existing PeInternalTxn:
- PeInternalTxn: command transfer between PE_CPU → PE_SCHEDULER (existing, unchanged)
- TileToken: per-tile work transfer from PE_SCHEDULER → sub-components (new, self-routing)
-
---
-
-## Non-goals
-
- **PE_CPU changes**: the PE_CPU → PE_SCHEDULER interface is not modified
-  (PeInternalTxn-based, ADR-0014 maintained)
- **Resource contention model across multiple pipelines**: the current scope focuses on
-  accurate modeling of a single pipeline. TCM bank conflicts across multiple pipelines
-  are future work.
- **builtin_legacy maintenance**: kept for backup purposes only; not a target for
-  bug fixes or feature additions.
-
-## Open Questions
-
- **Register File capacity model**: whether to model capacity limits when the fetch unit
-  loads into registers. Capacity is expressed in bytes (register_file_bytes), and
-  the number of tiles that can be held simultaneously is determined by tile size.
-  When capacity is exceeded, fetch stalls, creating natural backpressure.
- **Prefetch strategy**: this ADR does not allow tile feed interleaving across composite
-  commands. Therefore, overlap arises not from pre-injection across commands, but
-  naturally from pipeline progression of tiles within the same command.
-  If additional prefetch is needed, it should be considered at the level of tile ordering
-  within the same command or fetch/store unit policy, not cross-command injection.
- **PE_DMA coalescing**: per-tile DMA may cause fragmentation.
-  Direction is to merge/coalesce within DMA without scheduler involvement.
- **Synchronous execution mode**: this ADR adopts asynchronous pipeline as the
-  default/sole execution model. If a sync mode is needed for debug or validation
-  purposes, it will be considered in a future ADR.
- **TCM bank conflict across multiple pipelines**: currently based on a single pipeline.
-  Bank conflict modeling when multiple pipelines simultaneously access TCM is future work.
-
---
-
-## Consequences
-
-### Positive
-
- Each block is an independent component — individually replaceable (ADR-0015 compliant)
- PE internal structure is visible in the topology
- Components do not know the next component — plan-based routing provides flexibility
- Natural pipeline overlap between DMA and compute (SimPy Store backpressure)
- Improved HW modeling accuracy (done signal = Event, data transfer = message)
- Fetch/store separation enables accurate TCM BW contention modeling
-
-### Negative
-
- Increased number of PE internal components (5 → 6) — more topology nodes/edges
- Component separation makes intra-PE token forwarding more explicit than before
- Breaking change from existing builtin/pe_accel — migration required
-
---
-
-## Affected Files
-
-| File | Change |
-|------|--------|
-| `topology.yaml` | Add pe_fetch_store component, add chaining edges |
-| `components.yaml` | Register new builtin components |
-| `src/kernbench/topology/builder.py` | Add fetch_store + chaining edges to PE internal edges |
-| `src/kernbench/common/pe_commands.py` | Add TileToken definition |
-| `src/kernbench/components/builtin/pe_scheduler.py` | Re-implement (feeder + plan-based dispatch) |
-| `src/kernbench/components/builtin/pe_gemm.py` | Re-implement (TileToken, _process pattern) |
-| `src/kernbench/components/builtin/pe_math.py` | Re-implement (TileToken, _process pattern) |
-| `src/kernbench/components/builtin/pe_dma.py` | Re-implement (TileToken, _process pattern) |
-| `src/kernbench/components/builtin/pe_fetch_store.py` | New |
-| `src/kernbench/components/builtin/pe_tcm.py` | Re-implement (TcmRequest service) |
-| `src/kernbench/components/builtin/types.py` | New: TilePlan, Stage, StageType, PipelineContext, TileToken |
-| `src/kernbench/components/builtin/tiling.py` | Ported from pe_accel: plan generation logic |
-
-Backup:
-| `src/kernbench/components/builtin_legacy/` | Full backup of existing builtin (preserved without modification) |
-| `src/kernbench/components/custom/pe_accel/` | Backup of existing pe_accel (preserved without modification) |
@@ -1,528 +0,0 @@
-# ADR-0021: PE 파이프라인 리팩토링 — 컴포넌트 분리 + Scheduler 기반 라우팅
-
-## Status
-
-Proposed
-
-## Context
-
-### 현재 구조의 문제
-
-pe_accel (SchedulerV2Component)은 5개 하드웨어 블록(DmaIn, DmaWb, Gemm, Math, Tcm)을
-**단일 컴포넌트 내부**에 숨기고 있다.
-
-```
-SchedulerV2Component (단일 topology 노드)
-├── DmaInBlock     ← 내부 SimPy Store로 직접 연결
-├── DmaWbBlock     ← topology에 안 보임
-├── GemmBlock      ← 교체 불가
-├── MathBlock      ← 교체 불가
-└── TcmBlock       ← 교체 불가
-```
-
-문제점:
- 블록이 다음 블록을 `desc.next_block`으로 직접 참조 — 하드코딩된 라우팅
- 개별 블록 교체 불가 (ADR-0015 컴포넌트 교체 원칙 위배)
- topology에서 PE 내부 구조가 보이지 않음
- GemmBlock과 MathBlock이 TCM load/store 로직을 각각 중복 구현
-
-### 실제 하드웨어 구조
-
-```
-HBM ←(DMA)→ TCM ←(Fetch/Store Unit)→ Register File ←→ GEMM/MATH Engine
-```
-
- DMA: HBM ↔ TCM 전송 (fabric 경유, 수십~수백 ns)
- Fetch/Store Unit: TCM ↔ Register File 전송 (BW 기반, 수 ns)
- GEMM/MATH Engine: Register File 간 연산 (cycle-accurate)
- 완료 신호: PE 내부 1-cycle wire signal (done pin assert)
-
---
-
-## Decision
-
-### D1. 각 블록을 독립 컴포넌트로 분리
-
-pe_accel의 내부 블록을 **독립 PeEngineBase 컴포넌트**로 분리한다.
-기존 5개 + Fetch/Store Unit 1개 = 6개 컴포넌트.
-
-| 컴포넌트 | 역할 | HW 대응 |
-|----------|------|---------|
-| PE_SCHEDULER | plan 생성, tile 상태 관리, stage 라우팅 | Scheduler/Sequencer |
-| PE_DMA | HBM ↔ TCM (fabric 경유) | DMA Engine |
-| PE_FETCH_STORE | TCM ↔ Register File | Load/Store Unit |
-| PE_GEMM | MAC compute (register only) | MAC Array |
-| PE_MATH | element-wise/reduction (register only) | SIMD/Vector Unit |
-| PE_TCM | BW-serialized scratchpad | SRAM Bank |
-
-각 컴포넌트는 topology 노드로 존재하며, port/wire로 연결된다.
-`impl`을 교체하면 개별 블록의 타이밍 모델을 변경할 수 있다.
-
-### D2. Token Self-Routing — Scheduler는 dispatch + completion만
-
-**컴포넌트가 매 stage마다 scheduler를 경유하지 않는다.**
-Token이 plan을 가지고 있어 컴포넌트가 직접 다음 stage로 체이닝한다.
-
-```
-Scheduler → DMA → Fetch → GEMM → Math → Store → DMA_WB → (done) → Scheduler
-              ↑ 체이닝: scheduler 안 거침                    completion만
-```
-
-이는 실제 HW에서 각 블록의 done signal이 다음 블록에 직접 wire로 연결되어
-있는 구조와 일치한다. Scheduler는 **초기 dispatch + completion aggregation만** 담당.
-
-#### Stage 정의
-
-```python
-class StageType(Enum):
-    DMA_READ = 0
-    FETCH = 1
-    GEMM = 2
-    MATH = 3
-    STORE = 4
-    DMA_WRITE = 5
-```
-
-#### Plan 구조
-
-Scheduler가 CompositeCmd를 받으면 **tile 단위 실행 plan**을 생성한다.
-Plan은 각 tile의 **stage sequence**를 정의한다:
-
-```python
-@dataclass
-class Stage:
-    stage_type: StageType
-    component: str       # topology 노드 ID (e.g. "sip0.cube0.pe0.pe_dma")
-    params: dict         # stage별 파라미터 (dynamic)
-
-@dataclass(frozen=True)
-class TilePlan:
-    tile_id: int
-    stages: tuple[Stage, ...]  # 순서대로 실행할 stage 목록 (immutable)
-```
-
-Plan에 따라 stage sequence가 달라진다:
-
-```python
-# 일반 GEMM: HBM → TCM → Register → Compute → Register → TCM → HBM
-stages = (DMA_READ, FETCH, GEMM, STORE, DMA_WRITE)
-
-# TCM 데이터로 바로 GEMM (DMA read 생략):
-stages = (FETCH, GEMM, STORE, DMA_WRITE)
-
-# MATH element-wise:
-stages = (DMA_READ, FETCH, MATH, STORE, DMA_WRITE)
-
-# GEMM + accumulation (중간 K-tile, writeback 생략):
-stages = (DMA_READ, FETCH, GEMM, STORE)  # store to TCM only
-```
-
-**컴포넌트는 다음 컴포넌트를 하드코딩하지 않는다.**
-Token의 plan에서 다음 stage를 읽고, out_port로 직접 전달한다.
-네트워크 패킷이 라우팅 헤더를 가지고 있는 것과 같은 패턴이다.
-
-#### Pipeline Context
-
-```python
-@dataclass
-class PipelineContext:
-    id: str
-    total_tiles: int
-    completed_tiles: int = 0
-    done_event: simpy.Event = None  # 모든 tile 완료 시 succeed
-
-    def complete_tile(self) -> None:
-        self.completed_tiles += 1
-        if self.completed_tiles == self.total_tiles:
-            self.done_event.succeed()
-```
-
-**Completion은 exactly-once contract**: 각 tile의 마지막 stage는 정확히 한 번만
-`complete_tile()`을 호출해야 한다. 중복 호출은 버그이며, `done_event`는
-단 한 번만 succeed되어야 한다 (SimPy Event 제약).
-
-#### Scheduler 역할 (축소됨)
-
-Scheduler는 CompositeCmd를 받으면 plan과 PipelineContext를 생성한 뒤,
-이를 scheduler 내부의 `_pending_feeds` FIFO에 enqueue하고 즉시 리턴한다.
-
-실제 tile 투입은 **단일 feeder process** (`_feed_loop`)가 담당한다.
-이 feeder는 `_pending_feeds`를 FIFO 순서로 소비하며,
-**composite command 간 tile feed interleaving은 허용하지 않는다.**
-즉, 한 command의 모든 tile이 첫 stage queue에 투입된 후에만
-다음 command의 feed가 시작된다.
-
-Scheduler당 `_feed_loop`는 **정확히 하나만** 존재하며,
-composite command의 tile feed는 이 단일 process를 통해서만 수행된다.
-Command issue order는 **PE_SCHEDULER가 PeInternalTxn을 수신한 순서**를 의미한다.
-
-이 구조는 command issue order를 유지하면서도, 첫 stage queue full 시
-feeder process만 block되고 scheduler worker의 inbox 처리 자체는 멈추지 않도록 한다.
-
-```python
-class PeSchedulerV2(PeEngineBase):
-    _pipelines: dict[str, PipelineContext]
-    _pending_feeds: simpy.Store   # FIFO of (plan, ctx)
-
-    def start(self, env):
-        super().start(env)
-        self._pending_feeds = simpy.Store(env)
-        env.process(self._feed_loop(env))
-
-    def _dispatch_composite(self, env, pe_txn, cmd):
-        plan = generate_plan(cmd)
-        ctx = PipelineContext(
-            id=next_id(),
-            total_tiles=len(plan.tiles),
-            done_event=pe_txn.done,
-        )
-        self._pipelines[ctx.id] = ctx
-
-        # feeder queue에 등록만 하고 즉시 리턴
-        yield self._pending_feeds.put((plan, ctx))
-
-    def _feed_loop(self, env):
-        """단일 feeder process: composite command를 FIFO 순서로 feed.
-
-        Composite command 간 tile feed interleaving은 허용하지 않는다.
-        한 command의 모든 tile이 첫 stage queue에 투입된 후에만
-        다음 command의 feed가 시작된다.
-
-        첫 stage queue full 시 이 feeder만 block되며,
-        scheduler worker의 inbox 처리는 멈추지 않는다.
-        """
-        while True:
-            plan, ctx = yield self._pending_feeds.get()
-            for tile in plan.tiles:
-                token = TileToken(
-                    tile_id=tile.tile_id,
-                    pipeline_ctx=ctx,
-                    plan=tile,
-                    stage_idx=0,
-                    params=tile.stages[0].params,
-                )
-                yield self.out_ports[tile.stages[0].component].put(token)
-                # queue capacity = HW queue depth → full이면 feeder만 block
-```
-
-본 ADR에서 scheduler는 여러 composite command를 수용할 수 있으나,
-tile submission order는 command 단위 FIFO를 따른다.
-Command 내부에서는 tile-level pipeline overlap을 허용하지만,
-command 간 tile feed interleaving은 허용하지 않는다.
-
-### D3. 데이터 전달 vs 완료 신호 — HW 모델링 기준
-
-| 통신 유형 | 방식 | HW 대응 |
-|----------|------|---------|
-| tile token (작업 지시) | message via out_port | command queue에 enqueue |
-| stage 완료 → 다음 stage | 컴포넌트가 직접 out_port.put | done-triggered local enqueue |
-| pipeline 완료 → scheduler | PipelineContext.complete_tile() | completion interrupt |
-
-**Tile token**: out_port.put() 사용. SimPy Store capacity = HW queue depth.
-
-**Intra-PE chaining latency**: 본 ADR 범위에서는 intra-PE stage trigger에
-explicit latency model을 두지 않는다. 컴포넌트 간 체이닝은 PE 내부 wire에 해당하며,
-scheduler 왕복이 없으므로 artificial hop cost가 발생하지 않는다.
-
-**Pipeline 완료**: 마지막 stage의 컴포넌트가 `pipeline_ctx.complete_tile()` 호출.
-모든 tile 완료 시 PipelineContext가 done_event.succeed().
-
-### D4. 비동기 파이프라인 — 자연스러운 overlap
-
-Scheduler는 CompositeCmd를 **비동기로** 처리한다.
-다만 tile feed는 command마다 독립 process를 만들지 않고,
-scheduler 내부의 **단일 feeder process**가 FIFO 순서로 수행한다.
-따라서 scheduler는 다음 command를 계속 받을 수 있지만,
-첫-stage tile 투입 순서는 command 단위로 보장된다.
-
-**SimPy Store capacity = HW queue depth**이므로:
- queue가 차면 put()이 자연스럽게 block (backpressure)
- DMA가 tile 0을 처리하는 동안 GEMM은 이미 완료된 tile의 fetch를 시작
- 두 번째 CompositeCmd가 들어오면 DMA queue에 바로 이어서 투입
-
-```
-First-stage feed order (feeder → DMA queue):
-  [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN] | [cmd2:t0][cmd2:t1]...
-                                            ↑ cmd1 feed 완료 후 cmd2 시작
-
-Runtime pipeline (downstream overlap):
-  PE_DMA:    [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN][cmd2:t0][cmd2:t1]...
-  PE_FETCH:          [cmd1:t0][cmd1:t1]...
-  PE_GEMM:                   [cmd1:t0][cmd1:t1]...
-                              ↑ 같은 cmd 내부에서 pipeline overlap
-```
-
-이때 overlap은 서로 다른 command의 tile feed interleaving에서 오는 것이 아니라,
-먼저 투입된 command의 tile들이 downstream stage로 진행되는 동안 feeder가
-다음 tile들을 계속 투입하면서 자연스럽게 발생한다.
-
-예를 들어 cmd1의 모든 tile이 첫 stage queue에 투입되기 전에는
-cmd2의 tile feed는 시작되지 않는다. 그러나 cmd1.tile0이 이미 GEMM으로
-진행한 상태에서 cmd1.tile1, cmd1.tile2가 DMA/FETCH에 남아 있을 수 있으므로,
-**같은 command 내부에서는 pipeline overlap이 자연스럽게 발생**한다.
-
-#### 컴포넌트 체이닝 패턴
-
-모든 컴포넌트가 동일한 패턴을 따른다:
-
-```python
-def _pipeline_worker(self, env):
-    while True:
-        token = yield self._inbox.get()
-
-        # 자기 stage 처리
-        yield from self._process(env, token)
-
-        # 다음 stage로 체이닝 (plan에서 읽음)
-        next_idx = token.stage_idx + 1
-        if next_idx < len(token.plan.stages):
-            next_stage = token.plan.stages[next_idx]
-            token.stage_idx = next_idx
-            token.params = next_stage.params
-            yield self.out_ports[next_stage.component].put(token)
-        else:
-            # 마지막 stage — pipeline completion
-            token.pipeline_ctx.complete_tile()
-```
-
-### D5. PE_FETCH_STORE — TCM ↔ Register File 전담
-
-기존에 GemmBlock과 MathBlock이 각각 TCM read/write를 구현했으나,
-이를 **PE_FETCH_STORE 컴포넌트**로 분리한다.
-
-```python
-# PE_FETCH_STORE._process()
-def _process(self, env, token):
-    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
-    yield tcm_done
-    # 체이닝은 base class가 처리 (D4 패턴)
-```
-
-장점:
- GEMM/MATH는 **순수 compute만** — TCM 접근 로직 없음
- fetch/store BW 경합이 자연스럽게 모델링됨 (PE_TCM의 resource로 serialization)
- prefetch 전략 등 fetch unit 단독 교체로 실험 가능
-
-### D6. 각 Compute 컴포넌트의 단순화
-
-GEMM/MATH는 register 데이터가 이미 준비된 상태에서 compute만 수행.
-**체이닝은 공통 패턴(D4)을 따르므로, _process()만 구현하면 된다:**
-
-```python
-# PE_GEMM._process()
-def _process(self, env, token):
-    yield env.timeout(self._mac_latency(token.params))
-
-# PE_MATH._process()
-def _process(self, env, token):
-    yield env.timeout(self._simd_latency(token.params))
-
-# PE_FETCH_STORE._process()
-def _process(self, env, token):
-    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
-    yield tcm_done
-
-# PE_DMA._process()
-def _process(self, env, token):
-    yield from self._do_fabric_dma(token.params)
-```
-
-타이밍 모델만 교체하면 cycle-accurate든 analytical든 자유롭게 변경 가능.
-체이닝 로직은 base class에 있으므로 각 컴포넌트는 순수 stage 로직만 구현.
-
-### D7. Topology 변경
-
-PE template에 PE_FETCH_STORE 추가:
-
-```yaml
-pe_template:
-  components:
-    pe_cpu:         { kind: pe_cpu,         impl: pe_cpu_v1, ... }
-    pe_scheduler:   { kind: pe_scheduler,   impl: pe_scheduler_v2, ... }
-    pe_dma:         { kind: pe_dma,         impl: pe_dma_v1, ... }
-    pe_fetch_store: { kind: pe_fetch_store, impl: pe_fetch_store_v1, ... }
-    pe_gemm:        { kind: pe_gemm,        impl: pe_gemm_v1, ... }
-    pe_math:        { kind: pe_math,        impl: pe_math_v1, ... }
-    pe_mmu:         { kind: pe_mmu,         impl: pe_mmu_v1, ... }
-    pe_tcm:         { kind: pe_tcm,         impl: pe_tcm_v1, ... }
-  links:
-    # 기존 links...
-    fetch_store_to_tcm_bw_gbs: 512.0
-    fetch_store_to_tcm_mm: 0.0
-```
-
-PE 내부 edge 연결:
-```
-PE_SCHEDULER → PE_DMA (초기 dispatch)
-PE_SCHEDULER → PE_FETCH_STORE (초기 dispatch)
-PE_SCHEDULER → PE_GEMM (초기 dispatch)
-PE_SCHEDULER → PE_MATH (초기 dispatch)
-PE_DMA → PE_FETCH_STORE (체이닝)
-PE_FETCH_STORE → PE_GEMM (체이닝)
-PE_FETCH_STORE → PE_MATH (체이닝)
-PE_GEMM → PE_FETCH_STORE (store 체이닝)
-PE_MATH → PE_FETCH_STORE (store 체이닝)
-PE_FETCH_STORE → PE_DMA (writeback 체이닝)
-PE_FETCH_STORE → PE_TCM (BW 요청)
-```
-
-Topology edge는 **control/dispatch visibility + runtime chaining** 양쪽을 포함한다.
-Scheduler → 하위 컴포넌트 edge는 초기 dispatch 경로이며,
-컴포넌트 간 edge는 token self-routing에 의한 runtime chaining 경로이다.
-
-### D8. 기존 코드 마이그레이션 — builtin 통합
-
-기존 builtin v1 컴포넌트와 pe_accel을 **새 builtin으로 교체**한다.
-
-#### 마이그레이션 전략
-
-1. 기존 `components/builtin/` → `components/builtin_legacy/`로 백업 (수정 없이 보관)
-2. 기존 `components/custom/pe_accel/` → 동일하게 백업
-3. 새 `components/builtin/`에 ADR-0021 아키텍처로 재구현
-4. topology.yaml은 **하나만 유지** (pe_fetch_store 포함)
-5. components.yaml은 새 builtin을 가리킴
-
-```yaml
-# components.yaml — 새 builtin
-pe_scheduler_v1: kernbench.components.builtin.pe_scheduler:PeSchedulerComponent
-pe_gemm_v1:      kernbench.components.builtin.pe_gemm:PeGemmComponent
-pe_math_v1:      kernbench.components.builtin.pe_math:PeMathComponent
-pe_dma_v1:       kernbench.components.builtin.pe_dma:PeDmaComponent
-pe_fetch_store_v1: kernbench.components.builtin.pe_fetch_store:PeFetchStoreComponent
-pe_tcm_v1:       kernbench.components.builtin.pe_tcm:PeTcmComponent
-```
-
-impl 이름(pe_gemm_v1 등)은 유지하되, **구현이 ADR-0021 아키텍처로 교체**된다.
-기존 벤치마크와 테스트의 topology.yaml 참조는 변경 없이 동작한다.
-
-#### 레이턴시 모델 계승
-
-새 builtin 컴포넌트의 레이턴시 모델링(MAC cycle 계산, SIMD latency,
-TCM BW serialization, DMA fabric latency 등)은 **pe_accel 현재 버전의 구현을 바탕으로** 한다.
-tiling.py의 tile schedule 생성 로직도 그대로 가져온다.
-아키텍처(컴포넌트 분리, self-routing)만 변경하고, 타이밍 정확도는 유지한다.
-
-#### 테스트 전략
-
-#### 테스트 계획
-
-**1. 기존 테스트 통과** (regression):
-마이그레이션 완료 후 기존 테스트(366개)가 전부 통과해야 한다.
-
-**2. 레이턴시 regression**:
-pe_accel과 동일한 입력에 대해 새 builtin이 동일 레이턴시를 산출하는지 검증.
-
-**3. Phase 1 → Phase 2 end-to-end**:
-SimPy 시뮬레이션(Phase 1)에서 op_log 생성 → DataExecutor(Phase 2)로
-실제 numpy 연산 → 결과 정합성 검증까지 통합 테스트.
- GEMM: tl.composite(gemm) → op_log → Phase 2 matmul → allclose 검증
- MATH: tl.exp / tl.add 등 → op_log → Phase 2 numpy op → allclose 검증
- 체이닝: GEMM 출력 → MATH 입력 → 최종 결과 end-to-end 검증
-
-**4. TileToken self-routing**:
- tile이 plan의 stage sequence를 따라 체이닝되는지 검증
- 마지막 stage에서 PipelineContext.complete_tile() exactly-once 검증
- queue backpressure: DMA queue capacity 초과 시 feeder만 block 검증
-
-**5. 비동기 pipeline overlap**:
- 동일 command 내 tile 간 stage overlap 발생 검증 (tile0 GEMM 중 tile1 DMA)
- 다중 command: cmd1 feed 완료 후 cmd2 feed 시작 (FIFO 순서) 검증
-
-### D9. TileToken 메시지 정의
-
-컴포넌트 간 tile 작업 전달에 사용하는 메시지.
-Token이 plan과 stage index를 가지고 있어 self-routing이 가능하다.
-
-```python
-@dataclass
-class TileToken:
-    tile_id: int
-    pipeline_ctx: PipelineContext    # completion 추적
-    plan: TilePlan                   # 이 tile의 전체 stage sequence (immutable)
-    stage_idx: int                   # 현재 stage index in plan.stages
-    params: dict                     # current stage 파라미터 캐시 (canonical: plan.stages[stage_idx].params)
-    data_op: bool = True             # op_log 기록 대상 (ADR-0020)
-```
-
-TileToken은 한 시점에 **하나의 컴포넌트에 의해서만 소유**되며,
-동시에 여러 컴포넌트에 의해 참조되지 않는다 (single-owner).
-
-Token lifecycle:
-1. Scheduler가 stage_idx=0으로 생성, 첫 stage 컴포넌트에 put
-2. 컴포넌트가 _process() 실행 후 stage_idx 증가, 다음 컴포넌트에 put
-3. 마지막 stage 컴포넌트가 pipeline_ctx.complete_tile() 호출
-4. 모든 tile 완료 시 PipelineContext가 done_event.succeed()
-
-기존 PeInternalTxn과의 관계:
- PeInternalTxn: PE_CPU → PE_SCHEDULER 간 command 전달 (기존 유지)
- TileToken: PE_SCHEDULER → 하위 컴포넌트 간 tile 단위 작업 전달 (신규, self-routing)
-
---
-
-## Non-goals
-
- **PE_CPU 변경**: PE_CPU → PE_SCHEDULER 인터페이스는 변경하지 않음
-  (PeInternalTxn 기반, ADR-0014 유지)
- **다중 pipeline 간 자원 경합 모델**: 현재 범위에서는 단일 pipeline의
-  정확한 모델링에 집중. 다중 pipeline 간 TCM bank conflict 등은 future work.
- **builtin_legacy 유지보수**: 백업 목적이며, 버그 수정이나 기능 추가 대상이 아님.
-
-## Open Questions
-
- **Register File 용량 모델**: fetch unit이 register에 로드할 때 용량 제한을
-  모델링할지. 용량은 바이트 단위(register_file_bytes)로 표현하며,
-  동시에 보유 가능한 tile 수는 tile 크기에 따라 결정된다.
-  용량 초과 시 fetch가 stall되어 자연스러운 backpressure가 발생한다.
- **Prefetch 전략**: 본 ADR에서는 composite command 간 tile feed interleaving을
-  허용하지 않는다. 따라서 overlap은 command 간 선행 투입이 아니라,
-  같은 command 내부 tile들의 pipeline progression에서 자연스럽게 발생한다.
-  추가적인 prefetch가 필요하면 command 간 투입이 아니라, 같은 command 내부에서의
-  tile ordering 또는 fetch/store unit policy 차원에서 검토한다.
- **PE_DMA coalescing**: tile 단위 DMA는 fragmentation 발생 가능.
-  DMA 내부에서 merge/coalesce하되 scheduler는 관여하지 않는 방향.
- **동기 실행 모드**: 본 ADR에서는 비동기 pipeline을 기본/유일 execution model로
-  채택한다. 디버그 또는 validation 목적의 sync mode가 필요하면 future ADR에서 검토.
- **다중 pipeline 간 TCM bank conflict**: 현재 단일 pipeline 기준.
-  다중 pipeline이 동시에 TCM에 접근할 때의 bank conflict 모델은 future work.
-
---
-
-## Consequences
-
-### 긍정적
-
- 각 블록이 독립 컴포넌트 — 개별 교체 가능 (ADR-0015 준수)
- topology에서 PE 내부 구조 가시화
- 컴포넌트가 다음 컴포넌트를 모름 — plan 기반 라우팅으로 유연성 확보
- DMA와 compute의 자연스러운 파이프라인 overlap (SimPy Store backpressure)
- HW 모델링 정확도 향상 (done signal = Event, data transfer = message)
- fetch/store 분리로 TCM BW 경합 정확히 모델링
-
-### 부정적
-
- PE 내부 컴포넌트 수 증가 (5 → 6) — topology 노드/edge 증가
- 컴포넌트 분리로 인해 intra-PE token forwarding이 이전 대비 더 명시적으로 드러남
- 기존 builtin/pe_accel과의 breaking change — 마이그레이션 필요
-
---
-
-## 영향받는 파일
-
-| 파일 | 변경 |
-|------|------|
-| `topology.yaml` | pe_fetch_store 컴포넌트 추가, 체이닝 edge 추가 |
-| `components.yaml` | 새 builtin 컴포넌트 등록 |
-| `src/kernbench/topology/builder.py` | PE 내부 edge에 fetch_store + 체이닝 edge 추가 |
-| `src/kernbench/common/pe_commands.py` | TileToken 정의 추가 |
-| `src/kernbench/components/builtin/pe_scheduler.py` | 재구현 (feeder + plan 기반 dispatch) |
-| `src/kernbench/components/builtin/pe_gemm.py` | 재구현 (TileToken, _process 패턴) |
-| `src/kernbench/components/builtin/pe_math.py` | 재구현 (TileToken, _process 패턴) |
-| `src/kernbench/components/builtin/pe_dma.py` | 재구현 (TileToken, _process 패턴) |
-| `src/kernbench/components/builtin/pe_fetch_store.py` | 신규 |
-| `src/kernbench/components/builtin/pe_tcm.py` | 재구현 (TcmRequest 서비스) |
-| `src/kernbench/components/builtin/types.py` | 신규: TilePlan, Stage, StageType, PipelineContext, TileToken |
-| `src/kernbench/components/builtin/tiling.py` | pe_accel에서 이식: plan 생성 로직 |
-
-백업:
-| `src/kernbench/components/builtin_legacy/` | 기존 builtin 전체 백업 (수정 없이 보관) |
-| `src/kernbench/components/custom/pe_accel/` | 기존 pe_accel 백업 (수정 없이 보관) |
@@ -1,10 +1,10 @@
 # ADR-0022: 2D Grid program_id Semantics

- **Status**: Accepted
- **Date**: 2026-04-09
- **Context**: Triton-style kernel addressing for multi-cube PE topology
+## Status

-## Problem
+Accepted
+
+## Context

 Triton kernels use `tl.program_id(axis)` to identify their position in a launch grid.
 Our hardware has a 2-level hierarchy: **cubes** contain **PEs**.
@@ -2,7 +2,7 @@

 ## Status

-Proposed
+Accepted

 ## Context

@@ -19,17 +19,6 @@ queues. Host-level collectives (`dist.all_reduce`) are deferred to
 **future work**; this ADR focuses solely on the kernel-side collective
 infrastructure.

-### Current state
-
- ADR-0021 PE pipeline refactor: each PE is decomposed into components
-  (PE_CPU, PE_SCHEDULER, PE_DMA, PE_FETCH_STORE, PE_GEMM, PE_MATH,
-  PE_TCM, PE_MMU).
- No direct PE-to-PE channel exists today. All data movement goes
-  through PE_DMA → cube_noc / UCIe / PCIE → HBM.
- A pre-ADR host CCL skeleton exists (`dist.init_process_group(backend="ahbm")`,
-  `_run_ccl_bench` running per-rank greenlets concurrently). The
-  collective itself is a stub.
-
 ### Problems to solve

 1. PE-to-PE direct data movement (writing into a peer's memory).
@@ -372,24 +361,41 @@ When the receiver frees a slot, the sender must learn about it
 travel through general vc_comm fabric — it uses a **separate fast
 path**, an abstraction of the NVLink / UCIe credit-return wire.

-**Latency** is computed from the **bottleneck BW on the path**, not a
-magic constant:
+**Latency** is computed from the **full path latency** (per-node
+overhead + edge propagation + drain), not a magic constant:

 ```
 credit_size_bytes = 16  (ccl.yaml: ipcq_credit_size_bytes)
-path = router.find_path(self_pe, peer_pe)
-latency = compute_drain_ns(path, credit_size_bytes)
-        = credit_size_bytes / bottleneck_bw_on_path
+path = router.find_path(self_pe, peer_pe.pe_dma)
+latency = compute_path_latency_ns(path, credit_size_bytes)
+        = sum(edge.distance_mm * ns_per_mm)
+        + sum(node_overhead_ns[n] for n in path)
+        + credit_size_bytes / bottleneck_bw_on_path
 ```

+The router auto-appends `.pe_dma` to the source only, so the
+destination MUST be spelled with the explicit `.pe_dma` suffix or
+`find_path` raises and the credit silently teleports at zero cost
+(latent bug fixed alongside this update).
+
+`tl.recv` blocks on the credit-emit completion (recv yields-from
+`_delayed_credit_send` rather than spawning it as a fork). This puts
+the credit-return cost on the receiver's `pe_exec_ns`, modeling the
+IPCQ control-plane completing the consume-acknowledgement before
+recv returns to the kernel — the protocol equivalent of a non-posted
+`tl.store` waiting for an HBM ack on the raw DMA path.
+
 That gives us:

 - **Topology-proportional approximation**: an in-cube credit return is
  automatically faster than a cross-SIP credit return.
- **No magic constants**: no arbitrary `ipcq_ctrl_latency_ns`.
+- **No magic constants**: every nanosecond comes from
+  `compute_path_latency_ns` on the same edge_map and `node_overhead_ns`
+  as data traffic.
 - **No deadlock risk**: unlike piggyback, B can issue credit even when
-  it has no data to send back.
- **Reuses existing utility**: `ComponentContext.compute_drain_ns`.
+  it has no data to send back. `peer_credit_store.put` is unbounded.
+- **`IPCQ ≥ raw DMA`** for matched physical moves — the credit-emit
+  cost on recv balances the HBM ack-trip cost RAW pays on the sender.

 #### Component coupling — SimPy Store channel

@@ -420,11 +426,21 @@ fan-out (see `IpcqInitMsg` in D12).
 #### PE_DMA's added responsibility

 When `vc_comm` receives a token, PE_DMA processes it as the following
-**atomic** sequence. **No SimPy yield is allowed between the two steps**
-(invariant I6):
+sequence: pay the Transaction's terminal BW drain, then atomically
+write data and forward metadata. **No SimPy yield is allowed between
+the data write and the metadata forward** (invariant I6). The drain
+yield must sit before the atomic block, not inside it:

 ```python
-def _on_vc_comm_recv(self, env, token):
+def _on_vc_comm_recv(self, env, txn):
+    # Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the
+    # sender PE_DMA). MUST happen before the atomic block so recv only
+    # wakes after the bytes have "landed".
+    drain = getattr(txn, "drain_ns", 0.0)
+    if drain > 0:
+        yield env.timeout(drain)
+
+    token = txn.request
    # ── ATOMIC: no yield between these two operations ──
    data = self._memory_store.read(token.src_space, token.src_addr,
                                   shape=..., dtype=...)
@@ -439,6 +455,33 @@ The final `put` is yieldable but uses an unbounded internal store, so
 it completes in a single step. That `put` is the closing call of the
 atomic block; nothing may be inserted before it.

+#### Drain-at-inbound semantics (D9 timing model)
+
+The Transaction carries `drain_ns = nbytes / bottleneck_bw_on_path`
+stamped at send-side PE_DMA. In this simulator per-hop `overhead_ns`
+is paid at each forwarding component via `run()`, and the remaining
+BW drain is paid once at the Transaction's terminal. Every non-IPCQ
+Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via
+`ComponentBase._forward_txn` at the terminal node. For IPCQ the
+destination PE_DMA intercepts the Transaction with `_handle_ipcq_inbound`
+(so IPCQ-specific data write + metadata forward can happen), so **the
+drain MUST be paid explicitly at the top of that handler** to keep
+IPCQ's timing model on par with every other fabric Transaction.
+
+Side-effects of paying drain here:
+
+- **SRC `tl.send`** is unchanged — fire-and-forget semantics are
+  preserved because the sender PE_DMA does not `yield sub_done`. The
+  `sub_done.succeed()` call (made after metadata forward below) is an
+  event with no listener on the sender side.
+- **DST `tl.recv`** unblocks `drain_ns` later. Since recv wakes only
+  when `IpcqMetaArrival` reaches its local PE_IPCQ, and the metadata
+  forward now happens after the drain, recv observes the full fabric
+  transfer time including bandwidth cost.
+
+Matches the physical picture: send dispatches and leaves; recv waits
+until the bytes have actually been drained into its inbox.
+
 ### D9.5. ADR-0020 (2-pass) integration

 `tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase
@@ -666,7 +709,7 @@ piggyback, tail updates via the D9 fast-path channel.

 ### D13. Test strategy

-Following the ADR-0021 D8 pattern.
+Test plan:

 #### T1. Unit tests (component-level)

@@ -758,7 +801,7 @@ F5. **Slot full + infinite backpressure**: the peer never recvs.
 ### D15. Algorithm-author cheat sheet

 Full step-by-step lives in
-[`docs/ccl-author-guide.en.md`](../ccl-author-guide.en.md). The
+[`docs/onboarding/ccl-author-guide.en.md`](../onboarding/ccl-author-guide.en.md). The
 shortest version:

 | Things you touch | Things you don't |
@@ -778,6 +821,432 @@ fairness from `tl.recv()` round-robin, confusing

 ---

+## HW Realization Notes (Informative)
+
+**Status of this section**: Forward-looking. Describes how the simulator
+contract (D1–D15) would map to silicon. Not currently implemented;
+subject to revision before tapeout. The simulator implements the
+contract via Python/SimPy equivalents in
+[pe_ipcq.py](../../src/kernbench/components/builtin/pe_ipcq.py) and
+[pe_dma.py](../../src/kernbench/components/builtin/pe_dma.py).
+
+### D16. Proposed HW block diagram and end-to-end dataflow
+
+![PE Baseline Architecture](../diagrams/pe_baseline.png)
+
+> Source: [`../diagrams/pe_baseline.d2`](../diagrams/pe_baseline.d2) — `d2 --layout=elk --scale 1.5`.
+
+![PE Proposed Architecture](../diagrams/pe_proposed.png)
+
+> Source: [`../diagrams/pe_proposed.d2`](../diagrams/pe_proposed.d2) — `d2 --layout=elk`.
+
+**Baseline → Proposed key changes**:
+
+- Single FIFO inbox → **separate compute port / IPCQ port + WRR Arbiter** (NEW)
+- PE_IPCQ (SimPy component) → **IPCQ Controller** (HW register + combinational logic)
+- **IPCQ Slot Region reserved area** within TCM
+- Credit Injector / Receiver connect directly to the NoC via the Fabric Port
+
+#### End-to-end sequence (HW view)
+
+```mermaid
+sequenceDiagram
+    participant CPU_A as PE_A: PE_CPU
+    participant IPCQ_A as PE_A: IPCQ Ctrl
+    participant DMA_A as PE_A: DMA
+    participant NOC as NoC Fabric
+    participant DMA_B as PE_B: DMA
+    participant IPCQ_B as PE_B: IPCQ Ctrl
+    participant TCM_B as PE_B: TCM
+    participant CPU_B as PE_B: PE_CPU
+
+    Note over CPU_A: tl.send(dir="E", src=0x1000)
+
+    CPU_A->>IPCQ_A: MMIO: send request
+    Note over IPCQ_A: Backpressure check:<br/>(head - peer_tail_cache) < n_slots → PASS<br/>Slot addr gen:<br/>dst = peer_rx_base + (head%n) × slot_size
+    IPCQ_A->>DMA_A: IpcqDmaToken {src, dst, sender_seq=head}
+    Note over IPCQ_A: my_head++
+    IPCQ_A-->>CPU_A: send returns (fire-and-forget)
+
+    Note over DMA_A: TCM read → snapshot in read buffer<br/>Flit pack: data + {sender_seq, dst_addr}
+    DMA_A->>NOC: IPCQ data flit(s)
+
+    Note over NOC: hop latency + BW drain
+
+    NOC->>DMA_B: IPCQ data flit(s)
+    Note over DMA_B: Terminal BW drain<br/>Slot write latency
+
+    rect rgb(255, 240, 220)
+        Note over DMA_B,IPCQ_B: ATOMIC (I6): same cycle, no stall
+        DMA_B->>TCM_B: write data → slot address
+        DMA_B->>IPCQ_B: Meta Extractor: {sender_seq, dst_addr}
+    end
+
+    Note over IPCQ_B: Range match dst_addr → direction "W"<br/>peer_head_cache["W"] = sender_seq + 1
+    IPCQ_B-->>CPU_B: recv_wake signal
+
+    Note over CPU_B: tl.recv(dir="W") wakes up
+    CPU_B->>IPCQ_B: recv request
+    Note over IPCQ_B: peer_head_cache > my_tail → YES<br/>slot_addr = rx_base + (tail%n) × slot_size
+    IPCQ_B-->>CPU_B: return slot_addr
+    CPU_B->>TCM_B: read data from slot
+    Note over IPCQ_B: my_tail++
+
+    IPCQ_B->>NOC: Credit (16B): {consumer_seq, dst_rx_base_pa}
+    Note over NOC: credit traversal (NoC latency)
+    NOC->>IPCQ_A: Credit arrival
+
+    Note over IPCQ_A: Match dst_rx_base_pa → direction "E"<br/>peer_tail_cache["E"] = consumer_seq<br/>Backpressure deassert (if stalled)
+```
+
+### D17. IPCQ Controller HW Module (NEW)
+
+The hardware control block sitting between PE_CPU and the DMA Engine.
+Corresponds to the simulator's `PeIpcqComponent`.
+
+#### QPair Register File
+
+Per-direction queue-pair state held in flip-flops. The PE_CPU reads /
+writes them via MMIO (CSRs); software populates them at init time.
+
+```
+Per-direction registers (each 64-bit):
+  my_head          — sender write position (monotonic)
+  my_tail          — receiver read position (monotonic)
+  peer_head_cache  — last known peer head (updated by Meta Extractor)
+  peer_tail_cache  — last known peer tail (updated by Credit Receiver)
+  rx_base_pa       — this PE's rx buffer base physical address
+  peer_rx_base_pa  — peer's rx buffer base physical address
+  n_slots          — ring depth (power-of-2 constraint, see D21)
+  slot_size        — bytes per slot
+  peer_credit_tgt  — peer PE's credit-receive address
+
+Directions: up to 8 (N/S/E/W/parent/child_left/child_right + spare)
+Total: 8 dirs × 9 regs × 8 B = 576 B of flip-flops
+```
+
+#### Slot Address Generator (combinational)
+
+```
+Input:  pointer (my_head or my_tail), n_slots, slot_size, base_pa
+Output: slot_addr = base_pa + (pointer % n_slots) * slot_size
+
+Implementation:
+  n_slots power-of-2 → pointer & (n_slots - 1)   (AND mask, 1 gate)
+  slot_size power-of-2 → barrel shift             (1 cycle)
+  64-bit add → ripple / Kogge-Stone adder         (1 cycle)
+
+Latency: 1–2 combinational cycles
+```
+
+#### Backpressure Comparator (combinational)
+
+```
+full = (my_head - peer_tail_cache) >= n_slots
+
+Implementation: 64-bit subtract + unsigned compare
+Output: stall signal → PE_CPU (IPCQ send blocked) or DMA issue hold
+Latency: 1 cycle
+```
+
+#### Meta Extractor (inbound datapath sideband)
+
+Wired into the DMA Engine's inbound vc_comm path. Extracts metadata
+from arriving IPCQ flit headers and updates queue-pair state.
+
+```
+Trigger: DMA inbound write completion (same cycle)
+Extract: {sender_seq, dst_addr} from flit header
+
+Direction matching (ADR-0025 D2):
+  for each dir:
+    match = (base_pa[dir] <= dst_addr) && (dst_addr < base_pa[dir] + n_slots[dir] * slot_size[dir])
+  8× parallel range comparators + priority encoder
+
+Update: peer_head_cache[matched_dir] = max(peer_head_cache, sender_seq + 1)
+Output: recv_wake signal → PE_CPU interrupt / flag
+Latency: 1 cycle (pipelined with the DMA write — I6 atomicity is intrinsic)
+```
+
+#### Credit Injector (outbound)
+
+```
+Trigger: recv completion (after my_tail increments)
+Action:  pack a 16 B credit packet → DMA vc_comm (or a dedicated credit VC)
+
+Packet: {consumer_seq = my_tail, dst_rx_base_pa = my_rx_base_pa}
+Latency: 1 cycle to generate; then NoC traversal
+```
+
+#### Credit Receiver (inbound sideband)
+
+```
+Trigger: 16 B credit packet arrival (from NoC)
+Extract: {consumer_seq, dst_rx_base_pa}
+
+Direction matching (ADR-0025 D3):
+  for each dir:
+    match = (peer_rx_base_pa[dir] == credit.dst_rx_base_pa)
+
+Update: peer_tail_cache[matched_dir] = max(peer_tail_cache, consumer_seq)
+Output: send_wake signal → deassert backpressure stall
+Latency: 1 cycle
+```
+
+### D18. DMA Engine vc_comm IPCQ-aware mode
+
+Add IPCQ-flit handling to the existing vc_comm channel (D8).
+
+**Outbound**:
+
+1. Receive a command from the IPCQ Controller: `{src_addr, dst_addr, nbytes, sender_seq}`.
+2. Read `src_addr` from TCM → snapshot into the DMA read buffer (standard DMA behavior).
+3. Pack flit: data + piggyback metadata (`sender_seq`, `dst_addr`).
+4. Inject into the NoC fabric port.
+5. Fire-and-forget (no completion wait).
+
+**Inbound**:
+
+1. Receive an IPCQ flit from the NoC.
+2. Charge terminal BW drain (`drain_ns = nbytes / bottleneck_bw`).
+3. Charge slot write latency (per backing memory tier).
+4. **ATOMIC** (same pipeline stage, no stall insertion):
+   - TCM write: data → slot address.
+   - Meta Extractor trigger: `sender_seq` + `dst_addr` → IPCQ Controller.
+5. Done.
+
+**I6 atomicity guaranteed in hardware**: TCM write completion and Meta
+Extractor trigger occur in the same pipeline stage, so no separate
+synchronization is needed. The simulator's "no SimPy yield between
+`MemoryStore.write` and `IpcqMetaArrival` put" (D9, I6) is preserved
+naturally.
+
+#### Data snapshot semantics
+
+Data latched into the DMA read buffer is unaffected by subsequent
+writes to `src` memory. This is standard DMA read-then-write
+behavior; no extra HW is required.
+
+#### Credit virtual channel (optional)
+
+- **Option A**: multiplex credits onto vc_comm (distinguish via 16 B
+  header-only flits).
+- **Option B**: add a third dedicated credit VC (strict priority > data).
+
+Option B is friendlier to deadlock prevention, but a 16 B credit's BW
+impact is negligible, so Option A suffices.
+
+### D19. Fabric flit format extension
+
+```
+Generic data flit (e.g. 512-bit):
+┌──────────────────────────────────────────┐
+│ [511:480] routing header (32b)           │
+│ [479:0]   payload (480b = 60 B)          │
+└──────────────────────────────────────────┘
+
+IPCQ data flit (only the first flit carries metadata):
+┌──────────────────────────────────────────┐
+│ [511:480] routing header (32b)           │
+│   [511]    ipcq_flag (1b)                │  ← IPCQ vs. normal DMA
+│   [510:509] vc_id (2b)                   │
+│   [508:480] route + hop count            │
+│ [479:416] ipcq_metadata (64b)            │  ← piggyback
+│   [479:448] sender_seq (32b)             │
+│   [447:416] dst_addr[31:0] (32b)         │  ← used for direction match
+│ [415:0]   payload (416b = 52 B)          │
+└──────────────────────────────────────────┘
+Subsequent flits: full 60 B payload (no metadata).
+
+Credit-only flit (128-bit, header-only):
+┌──────────────────────────────────────────┐
+│ [127:96]  routing header (32b)           │
+│   [127]   credit_flag (1b)               │
+│ [95:64]   consumer_seq (32b)             │
+│ [63:0]    dst_rx_base_pa (64b)           │
+└──────────────────────────────────────────┘
+```
+
+First-flit payload shrinks from 60 B to 52 B (13 % overhead). For
+multi-flit transfers the subsequent flits carry full payloads, so
+overhead < 1 % on large transfers.
+
+### D20. TCM IPCQ slot region layout
+
+```
+TCM Memory Map (16 MB):
+┌─────────────────────────────┐ 0x000000
+│  Kernel Working Memory      │
+│  (compute tensors)          │
+│  ~14 MB                     │
+├─────────────────────────────┤ 0xE00000
+│  IPCQ RX Buffers            │
+│  Dir N: slots × slot_size   │
+│  Dir S: slots × slot_size   │
+│  Dir E: slots × slot_size   │
+│  Dir W: slots × slot_size   │
+│  ~1 MB                      │
+├─────────────────────────────┤ 0xF00000
+│  IPCQ Metadata / Scratch    │
+│  ~1 MB                      │
+└─────────────────────────────┘ 0xFFFFFF
+```
+
+Place the IPCQ region in the upper TCM bank to minimize bank conflict
+with compute accesses (see Risk D22).
+
+### D21. 2 nm implementation analysis
+
+#### Area estimate
+
+| Module | Gate count | Area (2 nm est.) | Notes |
+|---|---|---|---|
+| QPair Register File | ~4.6 K FF | 0.002 mm² | 576 B of flip-flops |
+| Slot Addr Gen + Backpressure | ~5 K gates | 0.001 mm² | Combinational |
+| Meta Extractor + Credit Logic | ~3 K gates | 0.001 mm² | 8× parallel comparators |
+| **IPCQ Controller subtotal** | **~12.6 K** | **~0.004 mm²** | **< 0.1 % of the PE area** |
+| DMA vc_comm extension | ~2 K gates | 0.002 mm² | Flit pack / unpack |
+| **Total delta** | **~14.6 K** | **~0.006 mm²** | |
+
+#### Timing
+
+| Path | Delay (2 nm est.) | Target clock | Margin |
+|---|---|---|---|
+| Backpressure (sub + cmp) | ~0.3 ns | 1 GHz (1 ns) | 3× |
+| Slot Addr Gen (mask + shift + add) | ~0.5 ns | 1 GHz | 2× |
+| Meta Extractor (8× range match) | ~0.4 ns | 1 GHz | 2.5× |
+| Credit Receiver (8× equality) | ~0.3 ns | 1 GHz | 3× |
+
+All critical paths fit within one cycle. Timing closure is not a
+concern.
+
+#### Power
+
+- Active: ~1 mW (register R/W + comparators while sending / receiving).
+- Idle: leakage only.
+- Negligible vs. total PE power.
+
+#### Constraints
+
+| Item | Constraint | Rationale |
+|---|---|---|
+| `n_slots` | **must be power-of-2** | mod → AND mask (1 gate). Arbitrary values need a divider (~10 cycles). |
+| `slot_size` | **power-of-2 recommended** | mul → barrel shift. Arbitrary values need a multiplier. |
+| TCM IPCQ region | **dedicated bank** | Prevents bank conflict with compute accesses. |
+
+### D22. Risk assessment
+
+#### TCM bank conflict
+
+- **Risk**: IPCQ slot write and compute read both target the same TCM
+  bank → stall.
+- **Mitigation**: place the IPCQ region in a dedicated upper-address
+  bank (D20).
+- **Cost**: a small loss of TCM banking flexibility.
+- **Severity**: Medium (performance), Low (no correctness issue).
+
+#### Credit return latency under congestion
+
+- **Risk**: NoC congestion → credit-return delay → sender backpressure
+  stall.
+- **Mitigation**:
+  - Put credits on a separate VC with strict priority (16 B →
+    negligible BW impact).
+  - Or pick `n_slots` generously (8+) so credit delay is absorbed by
+    buffer depth.
+- **Severity**: Low (16 B credits contribute almost nothing to
+  congestion).
+
+#### Inter-direction ordering
+
+- **Risk**: simultaneous sends from one PE on multiple directions.
+- **Mitigation**: per-direction monotonic `sender_seq` suffices.
+  Inter-direction ordering is the kernel's (software's)
+  responsibility — same as the simulator model (D2 + D4).
+- **Severity**: Low (resolved by design).
+
+### D23. HW alternatives considered
+
+#### Doorbell + polling (traditional)
+
+```
+Send: DMA write data → DMA write a doorbell register at the peer → peer polls doorbell
+Recv: polling loop on the doorbell, or interrupt-driven
+```
+
+| Pros | Cons |
+|---|---|
+| Simple HW (no IPCQ controller) | Two DMA transactions (data + doorbell) |
+| Reuses existing DMA | Needs explicit fence between data and doorbell |
+| | Polling burns power; interrupt adds latency |
+
+**Verdict**: 2–3× latency vs. piggyback. **Rejected.**
+
+#### Hardware message queue (NVIDIA NVLink style)
+
+```
+Send: CPU → push a descriptor onto HMQ → HW relays it to the peer HMQ
+Recv: pop a descriptor from HMQ → use the data pointer
+```
+
+| Pros | Cons |
+|---|---|
+| CPU only writes descriptors | Needs a separate HMQ engine (~0.05 mm²) |
+| Descriptor / data separation is flexible | Separate datapath from DMA → area / power overlap |
+| | Large tensors still need DMA |
+
+**Verdict**: With CCL's large-tensor pattern, DMA is still required,
+so HMQ + DMA is a duplicated datapath. **Rejected.**
+
+#### RDMA-style completion queue (CQ)
+
+```
+Send: DMA write → CQE auto-posted at the peer
+Recv: CQ poll / interrupt → read data location
+```
+
+| Pros | Cons |
+|---|---|
+| Mature InfiniBand / RoCE model | CQ management logic + CQE memory overhead |
+| Good multi-tenant isolation | CQE / data ordering needs extra plumbing |
+| | Over-engineered for PE-to-PE CCL |
+
+**Verdict**: RDMA CQ is suited to host-facing NICs with multi-tenant
+isolation. For single-owner PE-to-PE this is needless complexity.
+**Rejected.**
+
+#### Credit-in-data piggyback (v2 optimization candidate)
+
+In the current design the credit return is a separate 16 B packet.
+For bidirectional traffic patterns, **the credit can be folded into a
+reverse-direction data flit**.
+
+```
+PE_A →E→ PE_B: data + sender_seq=3
+PE_B →W→ PE_A: data + sender_seq=5 + credit_ack=4  ← credit folded into data
+```
+
+| Pros | Cons |
+|---|---|
+| Removes the dedicated credit packet → NoC BW savings | Needs fallback for unidirectional patterns |
+| Bidirectional allreduce: credit latency → 0 | +8 B in the flit header (negligible) |
+| | Slightly more logic complexity |
+
+**Verdict**: A strong optimization. Eliminates the credit packet for
+bidirectional allreduce; the standalone credit fallback is retained.
+**Recommended for v2.**
+
+### Open HW questions
+
+- What fraction of TCM may the IPCQ slot region occupy? (Current
+  assumption: ~1 MB / 16 MB = 6.25 %.)
+- Dedicated credit VC vs. vc_comm multiplexing? (See D18.)
+- Inter-SIP link flit-format compatibility verification.
+- Maximum `n_slots`? (8 directions × 8 slots × 64 KB = 4 MB → 25 % of
+  TCM.)
+
+---
+
 ## Non-goals

 - **Host collective**: a model where `dist.all_reduce` itself moves
@@ -837,30 +1306,3 @@ fairness from `tl.recv()` round-robin, confusing
 - VC arbitration is a first-order approximation; heavy contention
  scenarios may report slightly optimistic latency vs real HW (D8).
 - Chunk-level interleave makes PE_DMA implementation more complex.
-
---
-
-## Affected files
-
-| File | Change |
-|------|--------|
-| `topology.yaml` | Add `pe_ipcq` to `pe_template`, plus the IPCQ ↔ DMA / CPU / TCM edges. |
-| `components.yaml` | Register `pe_ipcq_v1`. |
-| `src/kernbench/topology/builder.py` | Wire the IPCQ chain into PE-internal edges. |
-| `src/kernbench/components/builtin/pe_ipcq.py` | New. |
-| `src/kernbench/components/builtin/pe_dma.py` | Add VCs, handle `IpcqDmaToken`. |
-| `src/kernbench/common/pe_commands.py` | `IpcqSendCmd`, `IpcqRecvCmd`, `IpcqDmaToken`. |
-| `src/kernbench/triton_emu/tl_context.py` | `tl.send` / `tl.recv` API. |
-| `src/kernbench/runtime_api/distributed.py` | Eager IPCQ install in `AhbmCCLBackend.__init__`. |
-| `src/kernbench/runtime_api/kernel.py` | `IpcqInitMsg` definition. |
-| `src/kernbench/ccl/__init__.py` | New CCL package. |
-| `src/kernbench/ccl/topologies.py` | Builtin topology generators + `resolve_topology()`. |
-| `src/kernbench/ccl/helpers.py` | Algorithm-author helpers (`chunked`, `ring_step`, `tree_step`). |
-| `src/kernbench/ccl/testing.py` | Mock CCL runtime (`run_kernel_in_mock`). |
-| `src/kernbench/ccl/algorithms/*.py` | Algorithm modules (kernel + `kernel_args` + optional `neighbors`). |
-| `ccl.yaml` | Algorithm metadata + IPCQ defaults. |
-| `tests/test_pe_ipcq.py` | PE_IPCQ unit tests. |
-| `tests/test_pe_dma_vc.py` | PE_DMA VC tests. |
-| `tests/test_ipcq_e2e.py` | end-to-end send/recv tests. |
-| `tests/test_ccl_topologies.py` | Builtin topology generator tests. |
-| `tests/test_ccl_allreduce_matrix.py` | Unified bench × algorithm matrix. |
--- a/Show More
+++ b/Show More