diff --git a/.claude/commands/report.md b/.claude/commands/report.md
index a6ea2bc..d7d4a88 100644
--- a/.claude/commands/report.md
+++ b/.claude/commands/report.md
@@ -62,6 +62,10 @@ After writing the document, report to the user in the chat response:
 - **G2 gaps** — ADRs missing **Context** or **Decision**. Alternatives
   and Consequences are optional; their absence is NOT a gap.
 - **G3 gaps** — ADR cross-references without a back-reference.
+  Only flag when the referencer's ADR number is **less than** the
+  referenced ADR's number (older → newer). Newer ADRs citing older
+  infrastructure ADRs (higher number → lower number) are expected to
+  be one-way and are NOT flagged.
 - **G4 suggestions** — areas where an ADR seems missing based on the
   ADR corpus + SPEC reading. Phrase as suggestions, not findings. Each
   G4 item must say *why* it's suggested and remain falsifiable.
@@ -99,7 +103,10 @@ For each `docs/adr/ADR-NNNN-*.md`:
 - Record presence/absence of **Context** and **Decision** for G2.
   Alternatives and Consequences presence is recorded for use during
   authoring, but their absence is not a gap.
-- Record ADR-NNNN cross-references for G3.
+- Record ADR-NNNN cross-references for G3, preserving the direction
+  (referencer → referenced). G3 evaluation uses ADR numbers to
+  distinguish older→newer (flagged when missing back-link) from
+  newer→older (not flagged; see *Output Contract* G3).
 - Record Status (e.g., Accepted, Superseded, Draft) and any "supersedes
   ADR-NNNN" text in the body for G5a.
 
@@ -263,9 +270,11 @@ In **dry-run mode**, replace the `Wrote:` line with:
 - ADR-NNNN: missing <Context|Decision>
 - (or "none")
 
-**G3 — Broken cross-references**
-- ADR-NNNN cites ADR-MMMM; ADR-MMMM does not back-reference
+**G3 — Broken cross-references** (older → newer only)
+- ADR-NNNN cites ADR-MMMM (NNNN < MMMM); ADR-MMMM does not back-reference
 - (or "none")
+- Note: newer ADRs citing older infrastructure ADRs (NNNN > MMMM) are
+  not flagged here — one-way references are the expected pattern.
 
 **G4 — Suggested topics that may warrant a new ADR (verify before acting)**
 - <topic>: <why agent thinks it may be missing — must be falsifiable>
diff --git a/docs/adr-ko/ADR-0038-dev-pcie-ep-component-model.md b/docs/adr-ko/ADR-0038-dev-pcie-ep-component-model.md
new file mode 100644
index 0000000..73d876b
--- /dev/null
+++ b/docs/adr-ko/ADR-0038-dev-pcie-ep-component-model.md
@@ -0,0 +1,133 @@
+# ADR-0038: PCIE_EP Component Model
+
+## Status
+
+Accepted (2026-05-20).
+
+ADR-0035 (M_CPU), ADR-0036 (IO_CPU), ADR-0037 (Forwarding)
+와 같은 결의 컴포넌트-레벨 ADR.
+
+## First action (제일 처음에 하는 일)
+
+`_inbox`에서 Transaction을 한 건 꺼내 `_forward_txn`을 통해 `run()`을 호출하고,
+그 안에서 `node.attrs["overhead_ns"]` 만큼 `env.timeout()`으로 PCIe 프로토콜
+처리 지연을 적용한다. 그 이후 시점부터는 일반 `ComponentBase` 워커가 정의한
+forwarding 규약을 따른다 (다음 hop이 있으면 `out_ports[next_hop].put(...)`,
+아니면 `drain_ns`를 소비하고 `txn.done.succeed()`).
+
+즉, **PCIE_EP의 첫 번째 일은 "PCIe 프로토콜 오버헤드를 시간으로 표현하는 것"**
+하나뿐이고, 라우팅·페이로드 변환·MMIO 디코딩 같은 부가 의사결정은 하지 않는다.
+
+## Context
+
+PCIE_EP는 토폴로지 그래프에서 **호스트와 디바이스 사이의 단방향 경계 포인트**
+역할을 한다. 빌더 (`topology/builder.py`)는 SIP마다 IO chiplet 인스턴스를
+생성하고 그 안에 `pcie_ep`, `io_cpu`, `io_noc`을 둔 뒤, 외부 호스트 측의 cross-SIP
+switch와 `pcie_ep` 사이에 양방향 엣지를 깐다:
+
+- `switch → pcie_ep`: host → device 트래픽 (MemoryWrite, MemoryRead, KernelLaunch).
+- `pcie_ep → switch`: device-side outbound (예: cross-SIP IPCQ 토큰).
+
+IOChiplet 내부적으로는 `pcie_ep ↔ io_noc` 양방향 엣지가 깔리고, 그 다음 hop이
+`io_cpu`나 cube 측 hbm_ctrl 경로로 분기된다 (ADR-0036 IO_CPU 모델 참고).
+라우터·리졸버는 SPEC R7이 요구하는 "PCIE_EP는 메모리 오퍼레이션을 위한
+엔드포인트"라는 계약을 이미 인지하고 있어, `find_pcie_ep(sip)`,
+`find_memory_path(pcie_ep, dst_node)` 같은 helper가 PCIE_EP를 시작점으로 한다.
+
+문제는 이 모든 의존 관계가 builder/router/resolver 쪽에는 있으나, **PCIE_EP
+자신의 내부 모델을 명시하는 ADR이 없다**는 것이다. 결과적으로:
+
+- "PCIE_EP는 어떤 latency를 모델링하나?"가 코드를 읽어야만 답이 나온다.
+- 다른 컴포넌트(IO_CPU=ADR-0036, M_CPU=ADR-0035)와의 비대칭이 발생한다.
+- 향후 PCIe link-layer 모델(예: TLP credit, retry)을 더 정교하게 만들지에 대한
+  의사결정 근거가 흩어진다.
+
+이 ADR은 현재의 **얇은 (thin) PCIE_EP 모델**을 명시적으로 못 박고, 그것이
+의도된 단순화임을 기록한다 (ADR-0033 latency model 단순화 정책과 정렬).
+
+## Decision
+
+### D1. PCIE_EP는 ComponentBase의 일반 forwarding 워커를 그대로 사용한다
+
+`PcieEpComponent`는 `ComponentBase`를 상속하며 `_worker`/`_forward_txn`을
+오버라이드하지 않는다. 따라서 모든 Transaction은 다음 순서로 처리된다:
+
+1. `_fan_in`이 들어오는 메시지(또는 Flit reassembly된 Transaction)를 `_inbox`에
+   적재한다.
+2. `_worker`가 `_inbox`에서 하나 꺼내 `env.process(self._forward_txn(env, txn))`로
+   포크한다 (per-message 파이프라이닝).
+3. `_forward_txn`이 op_log 시작 hook → `run()` 지연 → op_log 종료 hook 순서로
+   호출한다.
+4. `run()`은 단 한 줄: `yield env.timeout(overhead_ns)`.
+5. 다음 hop이 있으면 `out_ports[next_hop].put(txn.advance())`, 없으면 (terminal로
+   도착한 경우) `drain_ns`를 소비 후 `txn.done.succeed()`.
+
+### D2. PCIE_EP의 유일한 시간 모델은 `overhead_ns`다
+
+`node.attrs["overhead_ns"]`만 latency 파라미터로 인정한다. 코드 기본값은
+`0.0`이며, `topology.yaml` 의 IOChiplet `components.pcie_ep.attrs` 가 실제 값을
+지정한다 (현재 토폴로지: `overhead_ns: 5.0` ns).
+
+별도의 BW 직렬화 자원(simpy.Resource), 큐 깊이, retry 모델은 두지 않는다.
+링크-레벨 BW 직렬화는 wire-side에서 처리된다 — IOChiplet 내부는
+`pcie_ep_to_noc_bw_gbs = 256.0 GB/s` 링크, 외부는 system의 `io_ep_to_switch`
+링크 BW가 적용된다 (ADR-0015 port/wire 모델). PCIE_EP 컴포넌트 자체는 이
+BW 회계에 관여하지 않는다.
+
+### D3. PCIE_EP는 양방향 사용을 인지하지만, 방향에 따라 동작을 바꾸지 않는다
+
+토폴로지 빌더가 `switch ↔ pcie_ep` 와 `pcie_ep ↔ io_noc` 양방향 엣지를 깐다.
+따라서 PCIE_EP는:
+
+- inbound (host→device): switch에서 도착한 Transaction을 io_noc 쪽으로 다음 hop
+  계산을 통해 forward.
+- outbound (device→host): io_noc/io_cpu에서 도착한 Transaction을 switch 쪽으로
+  forward.
+
+두 경우 모두 D1의 일반 forwarding 워커가 처리하며, 컴포넌트 코드 자체는 방향을
+구분하지 않는다 (`txn.next_hop`만 따른다).
+
+### D4. PCIE_EP는 Flit-aware가 아니다 (legacy reassembly 경로)
+
+`_FLIT_AWARE`를 `True`로 두지 않는다. 따라서 `_fan_in`이 상류에서 chunkify된
+Flit들을 부모 Transaction으로 재조립하여 `_inbox`에 넣는다 (ADR-0033 Phase 2c
+점진적 rollout 정책과 정렬).
+
+PCIE_EP가 PCIe TLP-level credit 모델을 갖도록 확장될 미래에 D4를 재평가한다.
+
+### D5. PCIE_EP는 라우팅 helper의 **명명된 노드**다
+
+`policy/routing/router.py`의 `find_pcie_ep(sip, io_id="io0")`,
+`find_all_pcie_eps()`, `find_memory_path(pcie_ep, dst_node)`는 PCIE_EP를 메모리
+경로의 시작점(또는 종점)으로 간주한다. 컴포넌트 본체는 이 helper에 어떤 정보도
+제공하지 않으며, 명명 규칙(`sip{S}.{io_id}.pcie_ep`)은 토폴로지 빌더가 보장한다.
+
+## Alternatives Considered
+
+### A1. PCIe TLP-level 모델 (credit, retry, MPS 분할)
+
+기각. ADR-0033이 명시한 "현재 latency 모델은 abstract overhead + BW 직렬화로
+표현"이라는 단순화 원칙에 어긋난다. 호스트↔디바이스 protocol 정합성은 SPEC §5
+"Non-Goals"에 의해 의도적으로 out-of-scope이다.
+
+### A2. PCIE_EP에 자체 simpy.Resource로 inflight 제한 두기
+
+기각. 현재 워크로드에서 호스트 트래픽은 컨텐션 병목이 아니다. 필요해지는 시점에
+별도 ADR로 도입한다 (호환성 측면에서 D1은 그대로 두고 D2를 확장하는 형태).
+
+### A3. PCIE_EP를 IO_CPU와 합치기
+
+기각. PCIE_EP는 host-side에서 처음 만나는 protocol boundary 노드이고, IO_CPU는
+디바이스-쪽 control-plane 처리 노드다 (ADR-0036). 트래픽 fan-out·command 디코딩
+같은 의사결정 비용은 IO_CPU에 모이며, PCIE_EP는 link-edge overhead만 표현하는
+것이 의미가 있다. 합치면 두 책임이 섞여 ADR-0007 (runtime API/sim_engine 경계)
+정신에 어긋난다.
+
+## Consequences
+
+- PCIE_EP는 코드 라인이 거의 0인 채로 명시적인 모델 ADR을 갖게 된다 — 일관성
+  ↑, 유지보수 비용 ↓.
+- 향후 PCIe-level 정밀화가 필요해지면 D2/D4를 확장하는 새 ADR을 만들어
+  supersede한다.
+- `find_memory_path` 등 router helper가 PCIE_EP를 명명된 노드로 의존한다는
+  사실이 D5에서 명시되므로, 컴포넌트 ID 명명 규칙 변경 시 영향 범위가 명확해진다.
diff --git a/docs/adr-ko/ADR-0039-dev-pe-mmu-component-model.md b/docs/adr-ko/ADR-0039-dev-pe-mmu-component-model.md
new file mode 100644
index 0000000..a93eeba
--- /dev/null
+++ b/docs/adr-ko/ADR-0039-dev-pe-mmu-component-model.md
@@ -0,0 +1,194 @@
+# ADR-0039: PE_MMU Component Model — 컴포넌트 + 유틸리티 이중 역할
+
+## Status
+
+Accepted (2026-05-20).
+
+ADR-0011 (PA/VA/LA address model) 의 VA 모델에서 "PE_MMU가 VA→PA 변환"이라고만
+선언되어 있는데, **PE_MMU 컴포넌트 자신의 동작 모델**을 별도로 못 박는 ADR.
+
+## First action (제일 처음에 하는 일)
+
+생성 시점에 `node.attrs["page_size"]` (default `2 MiB`) 와
+`node.attrs["tlb_overhead_ns"]` (default `0.0`) 를 읽어 내부 `PeMMU` 객체
+(`policy.address.pe_mmu.PeMMU`) 를 단 한 번 인스턴스화한다. 이 객체가 페이지
+테이블·서브페이지 region 리스트·TLB 오버헤드의 단일 보유자(single owner)이다.
+
+런타임에서의 첫 동작은 두 갈래로 갈린다:
+
+- **컴포넌트 경로 (inbox 소비)**: `_worker`가 `_inbox`에서 Transaction을 한 건
+  꺼내, 그 `request`가 `MmuMapMsg`이면 각 엔트리에 대해
+  `self._mmu.map(va, pa, size)`를 호출하고 `txn.done.succeed()`.
+  `MmuUnmapMsg`이면 `unmap(va, size)`, 그 외 타입이면 표준 `_forward_txn`으로
+  떨군다. 즉 **MMU의 첫 일은 "map/unmap 명령을 페이지 테이블에 반영하는 것"**.
+- **유틸리티 경로 (직접 호출)**: PE_DMA / PE_GEMM 같은 동일 PE 내부 엔진이
+  `pe_mmu.mmu.translate(va)`를 직접 호출한다. 이 경로에서는 SimPy 이벤트가
+  발생하지 않으며, 호출자가 (overhead_ns > 0인 경우) 본인 process에서
+  `yield env.timeout(mmu.overhead_ns)`를 처리한다.
+
+## Context
+
+ADR-0011은 PA/VA/LA 세 가지 주소 모델을 정의하고 "VA 모델 = PE_MMU를 통한 변환"
+이라고만 합의했다. 그러나 코드 상의 `PeMmuComponent`는 두 가지 상호 보완적인
+역할을 동시에 수행한다:
+
+1. **토폴로지 그래프 상의 컴포넌트**: cube NoC에서 `MmuMapMsg` / `MmuUnmapMsg`
+   sideband 메시지를 수신하여 페이지 테이블을 갱신한다.
+2. **PE-로컬 유틸리티 객체**: 동일 PE의 PE_DMA / PE_GEMM이 latency 0으로 (혹은
+   호출자 측에서 `overhead_ns`만 부담하면서) 직접 `translate(va)`를 호출한다.
+
+이 두 역할을 모두 다루는 ADR이 없어 다음 모호함이 발생한다:
+
+- "왜 MMU 변환에 SimPy 이벤트가 안 잡히나?" (실제로는 호출자 측에서 잡고 있음)
+- 서브페이지 region 모델은 무엇이고, 왜 그 모델인가? (코드 docstring에는 있으나
+  ADR이 없음 — `project_mmu_subpage_stopgap`라는 memory note 참조만 존재)
+- map/unmap 메시지가 **누구로부터** 와서 **언제까지** 갱신되어야 하는가
+  (ordering 계약)?
+
+또한 `PeMMU.map()` 은 "later append, last-write-wins (역방향 탐색)" 의미를 갖는데,
+이것은 단순한 단일-PA 페이지 테이블 모델로는 표현 불가능한 DPPolicy의 서브페이지
+샤딩 (예: 128B 페이로드 × 4KB 페이지) 시나리오를 위해 의도적으로 추가된
+**stopgap**이다. 진짜 HW MMU와는 다른 단순화임을 ADR로 못 박을 필요가 있다.
+
+## Decision
+
+### D1. 이중 역할의 명시 — 컴포넌트와 유틸리티
+
+`PeMmuComponent`는 단일 클래스 안에서 다음 두 인터페이스를 노출한다:
+
+- 컴포넌트 인터페이스: `_inbox` 소비, `_worker` 루프 (MMU sideband 메시지 처리).
+- 유틸리티 인터페이스: `pe_mmu.mmu` 속성으로 underlying `PeMMU` 객체를 노출 —
+  PE_DMA / PE_GEMM이 이 객체를 직접 들고 `translate()`를 호출.
+
+후자는 **layer skip이 아니다**: PE 내부는 ADR-0007이 정의한 "components" 레이어
+하나 안의 sibling 관계이고, 같은 PE prefix에서 가져온 PE_MMU 객체에 대한 직접
+호출은 cross-layer가 아니다. cross-layer 위반은 runtime API / sim_engine /
+components 경계를 넘는 경우에만 적용된다.
+
+### D2. Latency 모델: `translate()`는 순수 함수, overhead는 호출자 책임
+
+`PeMMU.translate()`는 순수 함수이며 SimPy yield를 하지 않는다. 호출자(PE 엔진)
+가 변환 후 `if self._mmu.overhead_ns > 0: yield env.timeout(self._mmu.overhead_ns)`
+를 자기 process에서 발생시킨다.
+
+이유: PE 엔진의 SimPy process는 이미 자체 record_start / record_end (op_log)
+hook을 들고 있어 timing을 일관되게 잡을 수 있다. MMU가 별도의 process를 만들면
+PE 엔진의 처리 흐름을 두 갈래로 쪼개 op_log/pipeline overlap 의미가 흐려진다.
+
+#### D2.1. 현재 구현의 비대칭 — pipeline vs non-pipeline (Known asymmetry)
+
+본 ADR 작성 시점의 `pe_dma.py` 구현은 두 호출 경로에서 overhead 처리가 다르다:
+
+- **non-pipeline (`handle_command`)**: `translate()` 직후
+  `if self._mmu.overhead_ns > 0: yield env.timeout(self._mmu.overhead_ns)` 를
+  발생시킨다.
+- **pipeline (`_do_pipeline_dma`)**: `translate()` 만 호출하고 overhead timeout을
+  **생략**한다 — 함수 주석에 "same logic as non-pipeline path"라고 적혀 있으나
+  실제로는 일치하지 않는다.
+
+기본 토폴로지에서 `tlb_overhead_ns = 0.0` 이라 이 차이는 timing에 직접 드러나지
+않으나, `tlb_overhead_ns > 0` 으로 설정한 시뮬레이션에서는 pipeline 경로의
+GEMM/Math 가 non-pipeline 동일 워크로드 대비 MMU overhead 만큼 빠르게 측정된다.
+
+D2의 계약은 "**모든** 호출자가 overhead를 책임진다" 이며, pipeline 경로의 누락은
+**의도된 설계가 아니라 구현 비일관성**이다. ADR-0014 D6 (pipeline self-routing)
+이 이 overhead를 면제한다고 명시한 부분은 없다.
+
+조치 선택지(별도 Phase 1/2 제안 필요):
+
+- (a) `_do_pipeline_dma` 에서도 `if mmu.overhead_ns > 0: yield env.timeout(...)`
+  를 추가하여 D2 계약과 일치시킨다 — 권장.
+- (b) D2 계약을 "non-pipeline 경로에만 적용" 으로 좁히고, pipeline 경로의 면제를
+  ADR-0014 D6 갱신과 함께 정당화한다 — overhead 의미가 약해지므로 비권장.
+
+본 ADR은 (a) 를 권장하며, accept 전 또는 직후의 별도 작은 변경으로 이를
+교정하는 것을 가정한다.
+
+### D3. 페이지 테이블 구조 — 서브페이지 region 리스트 (stopgap)
+
+`self._table: dict[vpn, list[(start_in_page, end_in_page, pa_at_offset_zero)]]`
+구조로 한 페이지 안에 여러 disjoint region을 보유할 수 있다.
+- `map(va, pa, size)`: 페이지를 가로지르면 region들을 **append**한다.
+- `translate(va)`: VPN으로 region 리스트를 가져온 후, **역방향**으로 순회하며
+  처음 매칭되는 region을 채택 (last-write-wins).
+- `unmap(va, size)`: extent가 unmap 범위에 **완전히 포함된** region만 제거한다.
+  경계가 어긋난 부분 overlap은 그대로 남기며, 매핑 호출자는 mapping과 동일한
+  경계로 unmap할 책임을 진다.
+
+이는 진짜 HW MMU와는 다른 **시뮬레이터 stopgap**임을 ADR-0011 VA 모델 보강
+요소로 명시한다. DPPolicy 서브페이지 샤딩 시 last-write-wins overwrite로 인한
+조용한 미스라우팅을 방지하기 위함이다 (메모리 노트: project_mmu_subpage_stopgap).
+
+### D4. PageFault는 PA fallback 신호다
+
+매핑이 없는 VA로 `translate()`가 호출되면 `PageFault`가 발생한다. PE_DMA는 이
+예외를 잡아 **원본 주소를 PA로 그대로 사용**한다 (ADR-0011의 PA fallback 호환
+경로). 따라서 PageFault는 에러가 아닌 "VA 매핑 부재 시 PA로 해석한다"는 신호다.
+
+이 호환 경로는 ADR-0011이 합의한 PA-only 모드와의 후방 호환을 유지하기 위한
+의도된 동작이다.
+
+### D5. MMU sideband 메시지의 수신 계약
+
+`MmuMapMsg` / `MmuUnmapMsg`는 fabric을 통해 PE_MMU 컴포넌트의 `_inbox`로
+도달한다 (R10이 명시하는 "MMU map 설치는 fabric latency를 따른다"). 메시지
+schema는 runtime API (`runtime_api/kernel.py`) 가 정의하며, 현재 형식:
+
+- `MmuMapMsg.entries: tuple[dict, ...]` — 각 dict는 `{"va": int, "pa": int,
+  "size": int}` 키를 갖는다.
+- `MmuUnmapMsg.entries: tuple[dict, ...]` — 각 dict는 `{"va": int, "size": int}`
+  키를 갖는다.
+
+PE_MMU 측 수신 처리:
+
+1. `_worker` 가 `_inbox.get()` 에서 메시지 한 건을 꺼낸다.
+2. `hasattr(msg, "request")` 로 Transaction wrapper 인지 확인.
+3. `isinstance(msg.request, MmuMapMsg)` 이면 각 entry 에 대해
+   `self._mmu.map(va=e["va"], pa=e["pa"], size=e["size"])`.
+4. `isinstance(msg.request, MmuUnmapMsg)` 이면 각 entry 에 대해
+   `self._mmu.unmap(va=e["va"], size=e["size"])`.
+5. 둘 다 `msg.done.succeed()` 로 완료 통지.
+
+외부 호출자(runtime API 측)가 `done`을 await하면 "매핑이 디바이스에 설치된
+시점"이 SimPy 시간으로 보장된다 — 이 wait이 ADR-0011이 요구하는 "MMU map
+installation incurs measured fabric latency" 의 실현이다.
+
+이 ADR은 sideband 메시지의 **sender 와 fan-out 정책**을 정의하지 않는다 —
+그것은 runtime API 책임이다. 본 ADR은 PE_MMU 측 수신 계약만 명시한다.
+
+### D6. 비-MMU Transaction은 일반 forwarding으로 위임
+
+`_worker`가 inbox에서 꺼낸 메시지의 `request`가 `MmuMapMsg` / `MmuUnmapMsg`가
+아닌 경우 (또는 `request` 속성이 없는 경우) `_forward_txn`으로 떨군다. 이는
+미래에 PE_MMU가 cube-internal NOC 상의 통과 노드로 사용될 가능성을 차단하지
+않기 위함이다 (현재는 그런 통과 트래픽이 없으나, 토폴로지 변경에 대해 안전).
+
+## Alternatives Considered
+
+### A1. translate()를 SimPy generator로 만들기
+
+기각. D2에서 설명한 대로, PE 엔진의 op_log/pipeline overlap 의미가 흐려진다.
+호출자 측에서 timeout을 일으키는 현재 패턴이 op_log 회계와 일치한다.
+
+### A2. 서브페이지 region 리스트 대신 페이지 크기 자체를 작게 하기 (예: 128B)
+
+기각. 페이지 테이블 메모리 폭발과 cube-wide map message 크기 폭발을 초래한다.
+DPPolicy 샤딩이 128B를 요구한다 해도 그 외 대다수 매핑은 2MiB 단위이므로,
+페이지 크기를 작게 잡는 것은 평균 비용이 비대해진다.
+
+### A3. PE_MMU를 컴포넌트가 아닌 PE_CPU의 내장 헬퍼로만 두기
+
+기각. ADR-0011이 요구하는 "fabric을 통해 측정된 latency로 MMU map 설치"
+(MmuMapMsg 경로)를 표현하려면 토폴로지 그래프 상의 노드여야 한다. 또한 cube NoC
+visualizer에서 PE_MMU가 노드로 보여야 디버깅·진단이 일관된다.
+
+## Consequences
+
+- PE_MMU의 이중 역할(컴포넌트 + 유틸리티)이 ADR-level에서 정당화되어, 미래의
+  refactor 압박 (둘 중 하나로 통일하라)에 대한 논거가 생긴다.
+- 서브페이지 region 모델이 시뮬레이터 stopgap임을 ADR이 명시 — 이후 LA 모델
+  (ADR-0011) 도입 시 이 stopgap 제거 가능성을 평가하는 기준이 된다.
+- `translate()`가 yield하지 않는다는 계약이 ADR로 굳어지므로, 향후 누군가
+  "MMU에 자체 timeout을 넣자"는 제안을 할 때 D2를 근거로 거절할 수 있다.
+- PA fallback (D4) 이 정상 흐름임이 명시되어, PageFault를 에러로 오인하여
+  방어 로직을 추가하는 일을 막는다.
diff --git a/docs/adr-ko/ADR-0040-dev-pe-tcm-component-model.md b/docs/adr-ko/ADR-0040-dev-pe-tcm-component-model.md
new file mode 100644
index 0000000..6592c82
--- /dev/null
+++ b/docs/adr-ko/ADR-0040-dev-pe-tcm-component-model.md
@@ -0,0 +1,142 @@
+# ADR-0040: PE_TCM Component Model — 듀얼 채널 BW 직렬화
+
+## Status
+
+Accepted (2026-05-20).
+
+ADR-0014 (PE Pipeline Execution Model) 가 "PE_TCM은 BW-기반 직렬화 scratchpad
+memory" 라고 언급하나 (D1), TCM 컴포넌트 자체의 정확한 동작 모델을 별도로
+명시한다.
+
+## First action (제일 처음에 하는 일)
+
+`start()`가 호출되면 즉시 두 개의 `simpy.Resource(env, capacity=1)`을 만들고
+`self._read_res` / `self._write_res`에 보관한다. 이 두 자원이 **읽기 채널**과
+**쓰기 채널**을 각각 1-in-flight로 직렬화하는 단일 결정 포인트다.
+
+런타임 첫 동작: `_worker`가 `_inbox`에서 메시지를 한 건 꺼내 타입 분기:
+
+- `TcmRequest` (`pe_fetch_store`에서 옴): `env.process(self._handle_tcm_request)`로
+  포크. 즉 **TCM의 첫 일은 "방향 (read/write)에 맞는 채널 락을 잡는 것"**.
+  락 획득 후 `bw > 0 and nbytes > 0` 이면 `delay_ns = nbytes / bw` 만큼
+  `env.timeout`, 그리고 `req.done.succeed()`.
+- 그 외 (Transaction): `env.process(self._forward_txn)`로 포크 (legacy fabric
+  통과 경로).
+
+생성 시점에 `node.attrs["read_bw_gbs"]` / `node.attrs["write_bw_gbs"]`
+(default 각 `512.0 GB/s`) 를 읽어 보관해 둔다.
+
+## Context
+
+PE 파이프라인 (ADR-0014 D1, D6) 에서 PE_TCM은 다음 두 종류의 트래픽을 받는다:
+
+1. **PE_FETCH_STORE → PE_TCM의 `TcmRequest`** — TCM ↔ Register File 전송 시,
+   PE_FETCH_STORE가 TCM의 BW로 직렬화된 access latency를 받아오기 위해 짧은
+   sideband 요청을 보낸다 (`direction = "read"` 또는 `"write"`, `nbytes`,
+   `done` 이벤트).
+2. **legacy Transaction forwarding** — 토폴로지 그래프 상에서 TCM이 통과 노드로
+   잡힐 가능성에 대비한 일반 forwarding 경로 (현재 critical path에서는 사용되지
+   않으나 보존됨).
+
+문제: ADR-0014는 "PE_TCM은 BW-기반 직렬화"라고만 언급한다. 그러나 코드에는
+명시적으로:
+
+- **읽기와 쓰기는 별도 채널이며 동시 진행 가능**, 다만 같은 방향끼리는
+  cap=1로 직렬화된다.
+- BW는 `read_bw_gbs` / `write_bw_gbs` 두 값으로 분리 설정 가능하다.
+- `delay_ns = nbytes / bw_gbs` 공식 (단위 환산: GB/s × ns ≈ B 라는 약식).
+- nbytes==0이면 BW 항을 건너뛰지만 채널 락은 잡는다.
+- `run()`은 `overhead_ns` (default 0.0) 만큼 yield 하나, 이는 legacy fabric
+  경로(Transaction forwarding)에서만 사용된다.
+
+이 모든 사항을 별도 ADR로 못 박을 필요가 있다. 특히 "왜 read/write가 분리
+채널인가" 와 "BW는 누가 결정하는가" 는 향후 누군가가 capacity=2 등으로 변경하려
+할 때 명확한 근거가 필요한 항목이다.
+
+## Decision
+
+### D1. 듀얼 채널 — read와 write는 독립 자원
+
+`_read_res = simpy.Resource(env, capacity=1)`,
+`_write_res = simpy.Resource(env, capacity=1)`.
+같은 방향의 동시 요청은 자원 큐에서 직렬화되나, 다른 방향끼리는 동시에 진행 가능.
+이는 실제 HW에서 TCM이 듀얼 포트 (read port + write port) 로 운용되는 모델과
+정합되며, GEMM 파이프라인에서 fetch(read)와 store(write)가 시간상 겹치는 정상
+케이스를 BW-직렬화 모델로 표현하기 위해 의도된 분리다.
+
+### D2. 단일 채널의 BW 모델 — `nbytes / bw_gbs`
+
+채널 락 획득 후, `nbytes > 0 and bw > 0`이면 `yield env.timeout(nbytes / bw_gbs)`.
+단위 약식은 GB/s × ns ≈ B 로, 시뮬레이터 전체에서 사용하는 BW 공식과 동일
+(ADR-0033 참고 — 시뮬레이터는 일관된 약식 단위를 사용한다).
+
+- `nbytes == 0`: BW 항은 0이지만 락은 잡혔다가 즉시 풀린다. 이 케이스가 의도된
+  이유: 빈 fetch/store를 보내는 plan generator가 PE_FETCH_STORE 측에서 `nbytes`만
+  0으로 채워 보내는 경우에도, TCM 측의 op_log / 채널 회계가 일관되게 한 번
+  소비된다.
+- `bw == 0` (config 실수): timeout 호출 자체를 skip하므로 0-time pass. 정상
+  세팅에서는 발생하지 않는다.
+
+### D3. BW는 `node.attrs`의 `read_bw_gbs` / `write_bw_gbs`로 설정
+
+기본값 `512.0 GB/s`. 토폴로지 빌더 (`topology/builder.py`) 가 `pe_template`에서
+TCM을 인스턴스화할 때 해당 attrs를 전달한다. 기본값 변경은 ADR-0014 D1 또는
+ADR-0033 latency model 측의 의사결정과 함께 가야 한다.
+
+### D4. TcmRequest의 schema는 PE_TCM이 owner다
+
+`@dataclass TcmRequest(direction: str, nbytes: int, done: simpy.Event, tag: str = "")`
+는 `components/builtin/pe_tcm.py`에 정의된다. PE_FETCH_STORE는 이 dataclass를
+import해서 생성·송신만 한다. 호출자 측이 schema를 정의하지 않는 이유:
+
+- BW 직렬화의 의미는 TCM 측 책임 — 어떤 필드가 직렬화 결정에 쓰이는가는 TCM이
+  결정한다.
+- `direction` 문자열을 `"read"` / `"write"` 둘로 좁히는 유효값 검증도 TCM 측에
+서 담당 (`_handle_tcm_request`의 if/else 분기).
+
+### D5. legacy Transaction forwarding 경로의 보존
+
+`_worker`가 `TcmRequest`가 아닌 메시지를 받으면 `_forward_txn`으로 보낸다. 이때
+`run()`의 `overhead_ns`가 적용된다. 현재 표준 PE 파이프라인에서는 TCM이
+Transaction의 통과 노드로 잡히지 않으나, fabric 토폴로지가 향후 변경될 때를
+위해 보존한다 (D1 의 사용 패턴과 직교).
+
+이 경로는 op_log 측에서 일반 Transaction 회계로 잡히며, BW 채널 락은 잡지 않는다.
+
+### D6. PE_TCM은 자체 데이터 저장소가 아니다 (timing only)
+
+TCM은 **시간만** 모델링한다. 실제 데이터 페이로드는 sim_engine의 별도
+`memory_store` (있다면) 가 보관하고, TCM 컴포넌트는 그것을 갱신하지 않는다.
+PE_FETCH_STORE도 TcmRequest를 통해 BW 지연만 받아오고 실제 register 컨텐츠는
+별도 경로로 다룬다 (ADR-0020 2-pass data execution 모델 — Phase 2에서 데이터
+처리).
+
+## Alternatives Considered
+
+### A1. 단일 채널 (capacity=2 의 read+write 공유)
+
+기각. fetch(read)와 store(write)가 시간상 겹치는 정상 케이스를 인공적으로
+직렬화하게 되어 PE 파이프라인의 BW upper bound가 잘못 모델링된다.
+
+### A2. 채널 capacity > 1 (예: 2-banked TCM)
+
+기각. 현재 HW 모델은 단일 bank 가정. 멀티-bank로 확장하고 싶다면 별도 ADR이
+필요하며, 그때 D1을 supersede한다. 지금 단계에서 capacity를 늘리면 BW upper
+bound는 그대로인데 명목상의 직렬화만 헐거워져 실제 모델 정확도 ↓.
+
+### A3. BW 공식을 `nbytes / bw + overhead_ns`로 일반화
+
+기각. `overhead_ns`는 D5의 legacy forwarding 경로에만 사용한다. fetch/store
+critical path에 추가 overhead가 필요해지면, 그것은 TCM이 아니라 PE_FETCH_STORE
+측 `run()` 또는 register-file access 모델에 두는 것이 책임 경계 측면에서 더
+적절하다.
+
+## Consequences
+
+- TCM의 BW 회계가 ADR-level에서 굳어지므로, GEMM/Math sweep의 op_log 해석 시
+  "왜 fetch와 store가 동시에 진행되었나" / "왜 같은 방향만 직렬화되나" 같은
+  질문이 빠르게 D1으로 해결된다.
+- 미래의 멀티-bank TCM이나 read/write 비대칭 BW 모델 변경 시 영향 범위가
+  명확해진다 (D1·D2·D3 중 어디를 수정하는지).
+- TCM이 데이터 저장소가 아니라는 점(D6)이 명시되어, ADR-0020 2-pass execution
+  과의 책임 경계가 견고해진다.
diff --git a/docs/adr-ko/ADR-0041-dev-cube-sram-component-model.md b/docs/adr-ko/ADR-0041-dev-cube-sram-component-model.md
new file mode 100644
index 0000000..5a32110
--- /dev/null
+++ b/docs/adr-ko/ADR-0041-dev-cube-sram-component-model.md
@@ -0,0 +1,187 @@
+# ADR-0041: Cube SRAM Component Model — terminal scratchpad on cube NoC
+
+## Status
+
+Accepted (2026-05-20).
+
+ADR-0017 (Cube NOC and HBM Connectivity) 에서 SRAM이 cube NoC의 attachment로
+존재한다고만 언급되는 점을 보완하여, SRAM 컴포넌트 자체의 latency/response
+모델을 명시한다.
+
+## First action (제일 처음에 하는 일)
+
+`_worker`가 `_inbox`에서 Transaction을 한 건 꺼낸 직후 가장 먼저 하는 일은
+`yield from self.run(env, txn.nbytes)` 호출이고, 그 안에서
+`node.attrs["overhead_ns"]` (default `0.0`) 만큼 `env.timeout()`을 발생시킨다.
+
+즉, **SRAM의 첫 일은 "access overhead를 시간으로 표현하는 것"**이다.
+overhead 소비 이후에 `drain_ns` (그 Transaction에 부여된 terminal BW 직렬화 비용)
+를 yield하고, 그 다음에 reverse path로 `ResponseMsg`를 생성하여 발사한다.
+
+이는 일반 `ComponentBase._worker`와 다른 점이 있다: SRAM은 **terminal node**
+임을 알고 있어서 `_forward_txn`을 거치지 않고 자체 워커가 `run → drain →
+_send_response` 순서를 명시한다.
+
+## Context
+
+cube 토폴로지 (`topology/builder.py`) 는 cube마다 다음 명명된 노드를 만든다:
+
+- `sip{S}.cube{C}.m_cpu`
+- `sip{S}.cube{C}.sram`
+- `sip{S}.cube{C}.hbm_ctrl` (PE당 partition)
+- `sip{S}.cube{C}.pe{P}` (PE 내부 sub-component들)
+
+SRAM은 cube NoC 의 attachment 중 하나로, 가장 가까운 router에 부착된다
+(`topology/mesh_gen.py`가 placement 좌표로 nearest router 결정 후 `attach`에
+추가). 빌더는 `sram ↔ router` 양방향 엣지를 깐다 (BW: `sram_to_router_bw_gbs`,
+기본 `128.0 GB/s`).
+
+SRAM의 두 가지 핵심 역할:
+
+1. **fabric terminal**: cube NoC에서 SRAM으로 향한 메모리 access Transaction의
+   끝점. SRAM이 access overhead와 drain을 소비하고 response를 reverse path로
+   되돌린다.
+2. **IPCQ slot tier 중 하나**: ADR-0023 D9.7 가 정의한 `buffer_kind ∈ {tcm,
+   sram, hbm}` 중 `sram` 티어의 slot bw/overhead를
+   `common/ipcq_types._BUFFER_KIND_BW`에서 참조 — 현재 값 `(512.0 GB/s, 2.0 ns)`.
+   이 값은 SRAM 노드 attrs의 `overhead_ns`와는 별도이며, IPCQ slot 회계 시점에서
+   PE_DMA가 시간으로 환산한다.
+
+이 두 역할은 하나의 SRAM 컴포넌트에서 동시에 충족되는데, 별도 ADR이 없으면:
+
+- "SRAM은 어떤 latency를 모델링하나?" — fabric drain + overhead, 아니면 IPCQ
+  티어의 slot latency? — 답이 흩어진다.
+- 미래에 SRAM 크기 (`size_mb`) attr이 실제로 어떤 의미를 갖는지 불명확. 현재
+  코드는 size를 사용하지 않으며 timing만 모델링한다.
+- SRAM이 cube의 어떤 router에 붙는지 (placement-based)에 대한 의사결정 근거가
+  토폴로지 코드 안에만 있다.
+
+## Decision
+
+### D1. SRAM은 cube NoC의 terminal scratchpad 노드다
+
+`SramComponent`는 `ComponentBase`를 상속하나 `_worker`를 오버라이드해서 terminal
+의미를 직접 표현한다:
+
+```
+while True:
+    txn = yield self._inbox.get()
+    yield from self.run(env, txn.nbytes)     # overhead_ns
+    if drain_ns > 0: yield env.timeout(drain_ns)
+    yield from self._send_response(env, txn)
+```
+
+이 패턴은 SRAM이 reverse path를 알아야 하므로 일반 `_forward_txn` (다음 hop으로
+forward)이 아닌 자체 워커가 필요하다.
+
+#### D1.1. 현재 미사용 — `_worker` 오버라이드는 dormant 경로다
+
+본 ADR 작성 시점의 코드베이스에서는, **어떤 컴포넌트도 SRAM 노드로 Transaction
+을 실제로 전송하지 않는다**. 확인된 SRAM 노드 ID 참조 위치:
+
+- `policy/routing/router.py` 등 routing helper — path 조회 가능성만 보장.
+- `components/builtin/pe_dma.py::_handle_ipcq_inbound` — IPCQ slot의
+  `buffer_kind == "sram"` 일 때 `bank_node = f"{cube_prefix}.sram"` 의 *path*
+  만 조회하여 `compute_drain_ns(path, ...)` 로 환산, **로컬에서 timeout** 한다.
+  Transaction 자체는 SRAM 노드로 흘러가지 않는다 (D4 참고).
+- `tests/test_routing.py` — `find_path("sip0.cube0.pe0", "sip0.cube0.sram")`
+  로 connectivity만 검증.
+
+따라서 `_worker`/`_send_response` 오버라이드는 **dormant code path** 이다.
+삭제하지 않고 보존하는 이유:
+
+- 향후 SRAM이 실제 fabric Transaction의 종점(예: M_CPU → SRAM 명시 access)이
+  되는 토폴로지 변경 시 즉시 사용 가능.
+- ADR-0017 (Cube NOC) 가 정의한 cube-attached scratchpad 의미에서 종점 동작은
+  의미상 자연스러우므로, 의도된 placeholder 다.
+
+이 dormant 상태가 종료되는 시점은 별도 ADR(또는 본 ADR의 후속 revision)이
+명시한다.
+
+### D2. ResponseMsg 생성과 reverse path 발사
+
+`_send_response`는:
+
+1. `reverse_path = list(reversed(txn.path))`로 역방향 경로 산출.
+2. `ResponseMsg(correlation_id=txn.request.correlation_id, request_id=...,
+   src_cube=<this cube>, src_pe=-1, success=True)` 생성.
+3. `Transaction(request=resp_msg, path=reverse_path, step=0, nbytes=0,
+   done=env.event(), is_response=True)` 로 감싸 `out_ports[reverse_path[1]]` 로
+   put.
+4. reverse path가 비정상이거나 (`< 2 hops`) ctx가 없으면, fallback으로 원본
+   `txn.done.succeed()` 만 호출.
+
+`src_pe = -1`은 "SRAM은 PE-localized가 아니다"를 의미한다. `src_cube`은 노드
+ID (`sip{S}.cube{C}.sram`) 의 cube 인덱스를 파싱해 채운다.
+
+### D3. Timing 파라미터는 `overhead_ns`와 wire-side `drain_ns`로 분리
+
+- **컴포넌트 측 latency**: `node.attrs["overhead_ns"]`. 기본 토폴로지에서는 `2.0
+  ns` 정도로 세팅.
+- **링크 측 직렬화**: `drain_ns`는 Transaction이 도착 시점에 carry해 온 값으로,
+  ADR-0015 (port/wire 모델) 의 wire-side BW 직렬화 결과다. SRAM은 이를 그대로
+  yield하기만 한다.
+- `size_mb` (default `32 MiB`) attr은 현재 timing에 사용되지 않는다 — 향후
+  capacity-aware 모델이 도입되면 그때 의미를 부여한다 (별도 ADR에서).
+
+### D4. IPCQ slot 회계는 SRAM 컴포넌트가 직접 모델링하지 않는다
+
+ADR-0023 D9.7 에 따른 IPCQ slot의 SRAM-티어 write latency는 PE_DMA의
+`_handle_ipcq_inbound`가 직접 `slot_io_latency_ns("sram", nbytes)`를 호출하여
+시간을 소비한다 (그 함수는 `common/ipcq_types._BUFFER_KIND_BW["sram"]` 의 값을
+사용). 즉:
+
+- SRAM 컴포넌트가 fabric Transaction을 받아 처리할 때는 **D1·D2·D3** 만 적용.
+- IPCQ slot이 SRAM에 살 때는 PE_DMA가 IPCQ slot-write 시점에 별도로 시간을
+  지불 — 이는 SRAM 컴포넌트 코드와 무관하며, IPCQ 측 회계다.
+
+이 분리는 의도된 것: IPCQ는 fast path (sub-cycle slot bookkeeping) 라 fabric
+Transaction을 거치지 않으므로, SRAM이 IPCQ를 인지할 필요가 없다.
+
+### D5. SRAM의 cube NoC 부착 위치는 placement-driven
+
+`topology/mesh_gen.py`는 `placement.sram.pos_mm` (`topology.yaml` 기본
+`[1.5, 9.0]`)을 보고 가장 가까운 router의 `attach`에 `"sram"`을 추가한다. 빌더
+(`topology/builder.py` 의 attachment 루프)가 그 attach 정보를 보고 `sram` 노드와
+router 사이에 양방향 엣지를 깐다.
+
+이 의사결정은 SRAM 컴포넌트 코드 외부 (mesh_gen / builder) 에 있으며, 컴포넌트
+는 어느 router에 붙었는지 알 필요가 없다. 컴포넌트는 `txn.path` / `reverse_path`
+가 router를 거쳐 자신에게 도달한다는 사실만 알면 된다.
+
+### D6. SRAM은 자체 데이터 저장소가 아니다 (timing-only)
+
+ADR-0040 D6 과 같은 맥락: SRAM 컴포넌트는 시간만 모델링하며, 실제 데이터
+페이로드는 sim_engine의 `memory_store` (있을 때) 가 보관한다.
+
+## Alternatives Considered
+
+### A1. SRAM이 `_forward_txn`을 그대로 사용하고 IO_CPU / HBM_CTRL 처럼 별도 응답 노드를 두기
+
+기각. cube NoC 상에서 SRAM은 terminal이며, 응답을 받아 줄 별도 노드를 두면
+의미 없는 hop이 늘어나고 ADR-0017 의 cube NoC 단순화 정신에 어긋난다.
+
+### A2. SRAM이 BW 직렬화를 자체 resource로 모델링
+
+기각. 링크 측 BW 직렬화 (`drain_ns`) 가 이미 의미를 충분히 잡고 있다. 컴포넌트
+내부에 또 `simpy.Resource`를 두면 ADR-0015 wire-side 모델과 이중계산을 야기.
+
+### A3. SRAM이 IPCQ slot 회계를 컴포넌트 측에서 처리
+
+기각. D4에서 명시한 대로 IPCQ는 fast path며 fabric Transaction을 통과하지
+않는다. SRAM이 IPCQ를 인지하면 책임이 두 갈래로 갈라져 추론이 어려워진다.
+
+### A4. `size_mb`로 capacity-aware latency 모델
+
+기각 (현재 단계). capacity는 토폴로지 visualizer 측 라벨링 정도에만 쓰이며,
+실제 timing 영향은 아직 모델링하지 않는다. 필요해지면 별도 ADR로 도입.
+
+## Consequences
+
+- SRAM의 timing 모델이 `overhead_ns + drain_ns + ResponseMsg(reverse_path)`로
+  ADR-level에서 굳어지므로, 누군가 IPCQ slot latency를 SRAM 컴포넌트에 추가하려
+  할 때 D4를 근거로 거절할 수 있다.
+- `size_mb` 가 현재 timing-neutral 임이 명시되어 (D3), 미래의 capacity-aware
+  모델 도입 시 호환성 영향 범위가 좁다.
+- placement-driven router 부착 (D5) 이 명시되어, SRAM 좌표 이동 시 어떤 부분에
+  파급이 있는지 (`mesh_gen`만) 명확해진다.
diff --git a/docs/adr-ko/ADR-0042-prog-tile-plan-generators.md b/docs/adr-ko/ADR-0042-prog-tile-plan-generators.md
new file mode 100644
index 0000000..7e03174
--- /dev/null
+++ b/docs/adr-ko/ADR-0042-prog-tile-plan-generators.md
@@ -0,0 +1,194 @@
+# ADR-0042: Tile Plan Generators — GEMM/Math 파이프라인 plan 빌더
+
+## Status
+
+Accepted (2026-05-20).
+
+본 ADR은 `tiling.py`가 SimPy 컴포넌트가 아니라
+**plan-generator 모듈**임을 명시한다.
+
+ADR-0014 (PE Pipeline Execution Model) 의 D6 (tile plan / self-routing) 가
+tile-plan 생성 알고리즘을 직접 정의하지 않으므로, 본 ADR이 그 비어 있는 자리를
+채운다.
+
+## First action (제일 처음에 하는 일)
+
+`generate_gemm_plan(M, K, N, tile_m, tile_k, tile_n, ..., pe_prefix, a_pinned,
+b_pinned, epilogue_specs)`이 호출되면 가장 먼저 하는 일은 **타일 수 계산과
+컴포넌트 ID 문자열 구성**이다:
+
+```
+M_tiles = max(1, ceil(M / tile_m))
+K_tiles = max(1, ceil(K / tile_k))
+N_tiles = max(1, ceil(N / tile_n))
+dma_id   = f"{pe_prefix}.pe_dma"
+fetch_id = f"{pe_prefix}.pe_fetch_store"
+gemm_id  = f"{pe_prefix}.pe_gemm"
+math_id  = f"{pe_prefix}.pe_math"
+```
+
+즉 **plan generator의 첫 일은 "타일 개수를 ceiling으로 산출하고, 이 PE의
+sub-component ID 4개를 한 번에 짜놓는 것"**이다. SimPy 이벤트나 환경 객체는
+일절 다루지 않는다 — 이 모듈은 순수 함수다.
+
+`generate_math_plan(M, N, tile_m, tile_n, ..., math_op, src_addr, dst_addr,
+pe_prefix)` 도 마찬가지로 `M_tiles`, `N_tiles` 산출과 component ID 3개
+(`dma_id`, `fetch_id`, `math_id`) 구성이 첫 일이다.
+
+## Context
+
+ADR-0014 D6은 "PE_SCHEDULER가 CompositeCmd를 받으면 TilePlan을 생성하고
+self-routing tile token을 피드한다"고만 합의했다. 그러나 코드에서는 **plan
+생성 알고리즘의 구체적 내용**이 `src/kernbench/components/builtin/tiling.py`
+모듈에 자리잡고 있고, 이 모듈은:
+
+- 컴포넌트가 아니라 **순수 함수**의 모음이다 (`generate_gemm_plan`,
+  `generate_math_plan`).
+- SimPy 환경, 큐, op_log, hook 등에 의존하지 않는다.
+- 결과로 `PipelinePlan` (dataclass) 를 돌려준다.
+
+기존 G4 분석은 `tiling.py`를 컴포넌트로 잘못 가정했으나, 실제는 PE_SCHEDULER에
+주입되는 plan-builder 함수다. 이 차이는 ADR-0014 의 D6 와 짝을 이루는 별도
+ADR로 못 박혀야 한다 — 그렇지 않으면:
+
+- "tile plan을 만드는 책임이 PE_SCHEDULER인가 별도 모듈인가" 가 모호.
+- GEMM plan과 Math plan의 stage sequence 가 일관성 있는지 (예: FETCH/STORE 위치)
+  의사결정 근거가 흩어진다.
+- `a_pinned` / `b_pinned` / `epilogue_specs` 같은 옵션이 왜 plan 단에서 분기되는지
+  근거 없음.
+
+## Decision
+
+### D1. tiling은 순수 plan-generator 모듈이며 컴포넌트가 아니다
+
+`components/builtin/tiling.py`는 ComponentBase 하위 클래스를 정의하지 않는다.
+모듈-레벨 함수 두 개만 노출한다:
+
+- `generate_gemm_plan(...) -> PipelinePlan`
+- `generate_math_plan(...) -> PipelinePlan`
+
+토폴로지 그래프에서 `tiling` 이라는 노드는 존재하지 않는다. 명명상 `builtin/`
+디렉터리에 있는 이유는 PE_SCHEDULER (ADR-0014 D6) 의 직접 helper이기 때문이며,
+의미상으로는 PE_SCHEDULER 내부 utility에 가깝다.
+
+### D2. GEMM plan의 stage 시퀀스 — `M → N → K` order
+
+각 (m, n, k) 타일에 대한 stage 시퀀스 (operand pinning과 epilogue 미적용 기본):
+
+```
+[DMA_READ(A)] → [DMA_READ(B)] → FETCH → GEMM
+                                ↑
+                                ↓
+(last k tile only)              [MATH(output_tile)]* → STORE → DMA_WRITE
+```
+
+`k_tile` epilogue는 매 K-타일마다 GEMM 직후, `output_tile` epilogue는 (m,n)당
+마지막 K-타일에서 STORE/DMA_WRITE 직전에 한 번. K-루프 누적자(accumulator) 는
+RegFile에 남아 K 타일들 사이에 STORE/DMA_WRITE가 발생하지 않는다 (last_k에서만
+출력).
+
+### D3. Operand pinning — `a_pinned` / `b_pinned`
+
+호출자가 `a_pinned=True`로 호출하면 **모든 (m, n, k) 타일에서 A DMA_READ를
+생략**한다. 의미: 호출자(예: `tl.composite`)가 사전에 `tl.load`로 A 전체를
+TCM에 한 번 적재했음을 plan generator에 알리는 신호.
+
+이 분기는 plan 단에서 결정한다 (런타임 분기 아님). 따라서 op_log 상의 stage
+record 수는 pinning에 따라 결정적으로 달라지며, sweep 분석 측 (예: gemm_sweep
+의 stage record count) 이 이 결정을 그대로 본다.
+
+### D4. Epilogue scope — `k_tile` vs `output_tile`
+
+`epilogue_specs`는 op-spec 객체의 iterable이다. 각 op 객체는 다음 속성을 갖는
+다고 가정한다:
+
+- `op.kind: str` — math op 이름 (예: `"dequant"`, `"bias"`, `"relu"`, `"scale"`).
+  stage의 `params["op_kind"]` 로 들어간다.
+- `op.scope: Scope` — `Scope.K_TILE` 또는 `Scope.OUTPUT_TILE` (`Scope` 는
+  `kernbench.common.pe_commands` 에 정의된 enum).
+- op-별 추가 필드 (예: `bias`, `scale`, `factor`) — 현재 plan generator는 사용
+  하지 않으며 런타임 (PE_MATH) 측이 소비.
+
+plan generator는 `getattr(o, "scope", None)` 기준으로 두 그룹으로 분기:
+
+- `scope == Scope.K_TILE`: 매 K-타일 GEMM 직후 MATH stage 추가.
+- `scope == Scope.OUTPUT_TILE`: (m, n)당 마지막 K-타일 STORE 직전 MATH stage
+  추가.
+
+`scope` 속성이 없거나 두 enum 어느 쪽도 아닌 op는 **plan에 포함되지 않는다**
+(`getattr(..., None) == Scope.X` 가 둘 다 False). 기본값(`output_tile`) 채택은
+**호출자(예: `tl.composite`) 측 책임**이며, plan generator는 이미 채워진 scope
+값을 보고 분기할 뿐이다 (ADR-0014 의 composite epilogue 계약과 정렬).
+
+`Scope` 임포트는 `pe_commands ← pe_types ← tiling` 의 순환 참조를 피하기 위해
+함수 내부에서 lazy import 한다. 이는 의도된 패턴이며 개선 대상이 아니다 (D1의
+"tiling은 PE_SCHEDULER의 utility" 관점에서, pe_commands에 대한 컴파일타임 의존
+이 없는 편이 모듈 경계를 깔끔히 유지함).
+
+### D5. Math plan의 stage 시퀀스 — `M → N` order
+
+각 (m, n) 타일에 대한 stage 시퀀스:
+
+```
+DMA_READ → FETCH → MATH → STORE → DMA_WRITE
+```
+
+K 차원이 없으므로 epilogue / accumulator residency 같은 개념은 적용되지 않는다.
+PE_FETCH_STORE의 register-file 회계는 GEMM plan과 동일한 방식으로 다뤄진다.
+
+### D6. plan은 데이터다 — SimPy 의존성 없음
+
+`PipelinePlan` 은 `pe_types.py`에 정의된 dataclass로, `tiles: list[TilePlan]`을
+보유. 각 `TilePlan` 은 `stages: tuple[Stage, ...]` 를 보유. plan 자체는
+immutable에 가까운 데이터 구조이며 (Stage 의 `params: dict` 만 mutable),
+SimPy 객체나 event를 갖지 않는다.
+
+런타임 시점에 PE_SCHEDULER가 plan 의 첫 stage를 보고 `TileToken`을 생성하여
+파이프라인에 피드하며, TileToken 이 `plan: TilePlan`, `stage_idx: int`,
+`params: dict` 를 들고 다닌다. self-routing은 `TileToken.advance()` 가 다음
+stage의 `params`를 캐시하는 방식으로 진행된다 (ADR-0014 D6).
+
+### D7. plan generator의 contract — pure, deterministic, idempotent
+
+같은 입력으로 두 번 호출하면 같은 PipelinePlan을 돌려준다 (`TilePlan.stages`의
+순서까지 deterministic). 이 contract는 ADR-0014 D6 의 "결정적 tile dispatch
+순서" 요구와 정렬된다.
+
+부수효과(SimPy event, file I/O, 글로벌 상태) 없음 — 테스트에서 환경 객체 없이
+호출 가능 (`tests/test_pe_pipeline.py`의 일부 케이스가 이 방식 사용).
+
+## Alternatives Considered
+
+### A1. tiling을 컴포넌트로 만들기 (e.g., PE_PLANNER)
+
+기각. plan 생성은 SimPy 시간을 소비하지 않는 결정 알고리즘이다. 컴포넌트로
+만들면 (a) inbox·자원 등 불필요한 인프라가 따라붙고, (b) PE_SCHEDULER 가
+"plan 받기" → "tile 피드" 두 단계를 분리해 받게 되어 의미 없는 hop이 생긴다.
+
+### A2. plan 생성을 PE_SCHEDULER 클래스 메서드로 옮기기
+
+기각 (현재). 모듈 분리가 (1) 테스트 용이성, (2) 다른 plan 알고리즘 (예:
+DTensor-aware plan) 도입 시 추가 함수만 정의하면 되는 확장성을 준다. 만약 향후
+plan 종류가 많아져 명시적 dispatch가 필요해지면, 그때 PE_SCHEDULER에 plan
+factory를 두는 것을 별도 ADR로 도입한다.
+
+### A3. plan을 immutable로 강제 (frozen dataclass + tuple)
+
+부분 채택. `Stage` 와 `TilePlan` 은 dataclass지만 frozen은 아니다. 이유:
+`Stage.params: dict` 가 plan generator 시점에 채워지고 런타임에서 읽히기만 한다
+(TileToken 이 advance 시 캐시할 뿐). 완전 frozen은 dict → frozendict 마이그레이션
+비용 대비 이득이 적다. 다만 plan 단계 외에는 mutation 하지 말 것을 컨벤션으로
+유지한다.
+
+## Consequences
+
+- `tiling.py`가 컴포넌트가 아니라 plan-generator 모듈임이 ADR-level에서
+  명시되어, G4 같은 미래의 "이 컴포넌트는 ADR이 없다"는 분석을 차단한다.
+- GEMM plan의 stage sequence (D2) 와 pinning/epilogue 분기 (D3·D4) 가 ADR로
+  굳어지므로, sweep 분석 (`scripts/gemm_sweep.py`)의 stage record count 해석
+  근거가 명확해진다.
+- plan generator의 pure contract (D7) 덕분에 테스트가 환경 없이 plan 검증
+  가능 — ADR-0013 (verification strategy) 의 "behavior validated by tests with
+  meaningful input cases" 정신과 정렬.
+- 향후 DTensor-aware plan, K-major plan 등 새 plan 종류 추가 시 본 ADR이
+  baseline 역할 — 새 함수만 추가하고 D1·D6·D7을 따른다.
diff --git a/docs/adr/ADR-0038-dev-pcie-ep-component-model.md b/docs/adr/ADR-0038-dev-pcie-ep-component-model.md
new file mode 100644
index 0000000..e88bdc8
--- /dev/null
+++ b/docs/adr/ADR-0038-dev-pcie-ep-component-model.md
@@ -0,0 +1,139 @@
+# ADR-0038: PCIE_EP Component Model
+
+## Status
+
+Accepted (2026-05-20).
+
+Companion to ADR-0035 (M_CPU), ADR-0036 (IO_CPU), and
+ADR-0037 (Forwarding) at the same component-model level.
+
+## First action
+
+Pull one Transaction from `_inbox` and let `_forward_txn` invoke `run()`, which
+applies a single `env.timeout(node.attrs["overhead_ns"])` for PCIe protocol
+handling. After that the standard `ComponentBase` worker rules take over: if
+`next_hop` exists, put the advanced Transaction on `out_ports[next_hop]`;
+otherwise consume `drain_ns` and call `txn.done.succeed()`.
+
+In other words, **PCIE_EP's first (and only) act is to spend the configured
+overhead as simulator time** — no routing decisions, no payload transformation,
+no MMIO decoding.
+
+## Context
+
+PCIE_EP is the **host ↔ device boundary** in the topology graph. The builder
+(`topology/builder.py`) creates an IO chiplet instance per SIP that contains
+`pcie_ep`, `io_cpu`, and `io_noc`, and lays bidirectional edges between the
+external `fabric.switch0` and each `pcie_ep`:
+
+- `switch → pcie_ep`: host → device traffic (MemoryWrite, MemoryRead,
+  KernelLaunch).
+- `pcie_ep → switch`: device-side outbound (e.g., cross-SIP IPCQ tokens).
+
+Inside the IO chiplet there are bidirectional `pcie_ep ↔ io_noc` edges, and
+from there traffic branches to `io_cpu` or to the cube-side `hbm_ctrl` path
+(see ADR-0036 IO_CPU model). The router and resolver already know — per SPEC
+R7 — that PCIE_EP is the endpoint for memory operations, so helpers like
+`find_pcie_ep(sip)` and `find_memory_path(pcie_ep, dst_node)` treat PCIE_EP as
+the start (or end) of the memory path.
+
+The problem is that all of this dependency lives in builder/router/resolver,
+while **PCIE_EP's own internal model has no ADR**. The consequence:
+
+- "What latency does PCIE_EP model?" requires reading the source.
+- The asymmetry with peer components (IO_CPU = ADR-0036, M_CPU = ADR-0035) is
+  awkward.
+- Future decisions about a more detailed PCIe link-layer model (TLP credits,
+  retry, MPS chunking) lack a documented baseline.
+
+This ADR pins down the current **thin PCIE_EP model** and records that this
+thinness is intentional (aligned with ADR-0033's latency-model simplification
+policy).
+
+## Decision
+
+### D1. PCIE_EP uses ComponentBase's generic forwarding worker as-is
+
+`PcieEpComponent` extends `ComponentBase` and does **not** override `_worker` or
+`_forward_txn`. Every Transaction flows through the standard sequence:
+
+1. `_fan_in` accumulates inbound messages (and reassembles Flits, per ADR-0033
+   Phase 2c) into `_inbox`.
+2. `_worker` pulls one message off `_inbox` and spawns
+   `env.process(self._forward_txn(env, txn))` for per-message pipelining.
+3. `_forward_txn` calls the op_log start hook → `run()` for latency → op_log
+   end hook.
+4. `run()` is a single line: `yield env.timeout(overhead_ns)`.
+5. If a next hop exists, `out_ports[next_hop].put(txn.advance())`. Otherwise
+   (terminal arrival) consume `drain_ns` and call `txn.done.succeed()`.
+
+### D2. The only timing parameter is `overhead_ns`
+
+Only `node.attrs["overhead_ns"]` is accepted as a latency parameter. The code
+default is `0.0`; `topology.yaml`'s IOChiplet `components.pcie_ep.attrs`
+supplies the real value (current topology: `overhead_ns: 5.0` ns).
+
+No separate BW-serialization resource (`simpy.Resource`), no queue depth, no
+retry model is introduced. Link-level BW serialization is handled wire-side —
+inside the IOChiplet by `pcie_ep_to_noc_bw_gbs = 256.0 GB/s`, and externally by
+the system's `io_ep_to_switch` link BW (ADR-0015 port/wire model). PCIE_EP
+itself takes no part in that accounting.
+
+### D3. PCIE_EP is direction-aware in topology but direction-blind in code
+
+The builder lays both `switch ↔ pcie_ep` and `pcie_ep ↔ io_noc` edges, so
+PCIE_EP serves:
+
+- inbound (host → device): forward Transactions arriving from the switch onto
+  io_noc-side next-hop.
+- outbound (device → host): forward Transactions arriving from io_noc/io_cpu
+  back to the switch.
+
+Both are handled by D1's generic forwarding worker; the component code never
+distinguishes direction (it just follows `txn.next_hop`).
+
+### D4. PCIE_EP is not Flit-aware (legacy reassembly path)
+
+`_FLIT_AWARE` is left at the inherited `False`, so `_fan_in` reassembles
+upstream-chunkified Flits into the parent Transaction before delivery to
+`_inbox` (aligned with ADR-0033 Phase 2c incremental rollout).
+
+A future PCIe TLP-level credit model would revisit D4.
+
+### D5. PCIE_EP is a **named node** for routing helpers
+
+`policy/routing/router.py` provides `find_pcie_ep(sip, io_id="io0")`,
+`find_all_pcie_eps()`, and `find_memory_path(pcie_ep, dst_node)` — all of
+which treat PCIE_EP as the start (or end) of the memory path. The component
+itself supplies no information to these helpers; the naming convention
+(`sip{S}.{io_id}.pcie_ep`) is guaranteed by the topology builder.
+
+## Alternatives Considered
+
+### A1. Full PCIe TLP-level model (credits, retry, MPS chunking)
+
+Rejected. Violates ADR-0033's "current latency model = abstract overhead + BW
+serialization" simplification. Host↔device protocol fidelity is explicitly
+out-of-scope in SPEC §5 "Non-Goals".
+
+### A2. Per-PCIE_EP `simpy.Resource` for in-flight cap
+
+Rejected. Host traffic is not a contention bottleneck in current workloads.
+Defer to a separate ADR if it becomes one (in which case D1 stays and D2 is
+extended).
+
+### A3. Merge PCIE_EP into IO_CPU
+
+Rejected. PCIE_EP is the protocol-boundary node first hit on the host side;
+IO_CPU is the device-side control-plane processing node (ADR-0036). Traffic
+fan-out and command decoding costs concentrate in IO_CPU, while PCIE_EP only
+expresses link-edge overhead. Merging them would mix two responsibilities and
+violate the spirit of ADR-0007 (runtime API/sim_engine boundaries).
+
+## Consequences
+
+- PCIE_EP gets an explicit model ADR despite having near-zero code — consistent
+  with peer component ADRs, lower maintenance friction.
+- Future PCIe-level refinement supersedes by extending D2/D4 in a new ADR.
+- D5 makes the named-node dependency explicit, so any future renaming of
+  component IDs has a clearly bounded blast radius.
diff --git a/docs/adr/ADR-0039-dev-pe-mmu-component-model.md b/docs/adr/ADR-0039-dev-pe-mmu-component-model.md
new file mode 100644
index 0000000..cb41d67
--- /dev/null
+++ b/docs/adr/ADR-0039-dev-pe-mmu-component-model.md
@@ -0,0 +1,203 @@
+# ADR-0039: PE_MMU Component Model — Component + Utility Dual Role
+
+## Status
+
+Accepted (2026-05-20).
+
+ADR-0011 (PA/VA/LA address model) only states that "the VA model translates
+VA→PA via PE_MMU"; this ADR pins down **the PE_MMU component's own behavior
+model**.
+
+## First action
+
+At construction, read `node.attrs["page_size"]` (default `2 MiB`) and
+`node.attrs["tlb_overhead_ns"]` (default `0.0`) and instantiate the internal
+`PeMMU` utility object (`policy.address.pe_mmu.PeMMU`) exactly once. That
+object is the single owner of the page table, the sub-page region lists, and
+the TLB overhead value.
+
+At runtime the first action splits into two paths:
+
+- **Component path (inbox consumption)**: `_worker` pulls a Transaction off
+  `_inbox`; if `request` is a `MmuMapMsg`, call `self._mmu.map(va, pa, size)`
+  for each entry and then `txn.done.succeed()`. For `MmuUnmapMsg`, call
+  `unmap(va, size)`. Any other type falls through to standard `_forward_txn`.
+  In other words, **the component's first act is "apply map/unmap commands to
+  the page table"**.
+- **Utility path (direct call)**: a sibling PE engine (PE_DMA / PE_GEMM) calls
+  `pe_mmu.mmu.translate(va)` directly. This path produces no SimPy events;
+  the caller (when `overhead_ns > 0`) issues a `yield env.timeout(mmu.overhead_ns)`
+  in its own process.
+
+## Context
+
+ADR-0011 defined three address models (PA/VA/LA) and agreed that "VA model =
+translation via PE_MMU". But in code, `PeMmuComponent` performs two
+complementary roles simultaneously:
+
+1. **A topology-graph component**: it receives `MmuMapMsg` / `MmuUnmapMsg`
+   sideband messages over the cube NoC and updates the page table.
+2. **A PE-local utility**: PE_DMA / PE_GEMM on the same PE call
+   `translate(va)` directly with zero SimPy latency (the caller pays
+   `overhead_ns` if any).
+
+Without an ADR covering both roles, the following questions are ambiguous:
+
+- "Why isn't there a SimPy event for the MMU translate?" (Answer: the caller
+  pays it.)
+- What is the sub-page region model, and why? (The code docstring has it, but
+  no ADR — only a memory note `project_mmu_subpage_stopgap`.)
+- Who sends map/unmap, and when must they be visible? (Ordering contract.)
+
+Additionally, `PeMMU.map()` has "append, last-write-wins on overlap"
+semantics, which is impossible to express with a one-PA-per-entry page table.
+That is a deliberate **simulator stopgap** to support DPPolicy sub-page sharding
+(e.g., 128 B payloads against 4 KiB pages) without silent last-write-wins
+misrouting. This deviation from real HW MMU semantics must be ADR-pinned.
+
+## Decision
+
+### D1. Explicit dual role — component and utility
+
+`PeMmuComponent` exposes two interfaces from a single class:
+
+- Component interface: `_inbox` consumption, `_worker` loop (handles MMU
+  sideband messages).
+- Utility interface: the `mmu` property exposes the underlying `PeMMU` object,
+  which PE_DMA / PE_GEMM hold directly and invoke `translate()` on.
+
+The latter is **not a layer skip**: inside a PE, the engines and PE_MMU are
+siblings under the "components" layer (ADR-0007). Cross-layer violations only
+apply to runtime API ↔ sim_engine ↔ components boundaries.
+
+### D2. Latency model — `translate()` is pure; caller owns the timeout
+
+`PeMMU.translate()` is a pure function and yields nothing in SimPy. The caller
+(a PE engine) issues `if mmu.overhead_ns > 0: yield env.timeout(mmu.overhead_ns)`
+in its own process after translation.
+
+Rationale: the PE engine process already holds its own `record_start` /
+`record_end` (op_log) hooks, so keeping timing inside the caller's process
+preserves consistent timing accounting. A separate MMU process would split the
+engine's processing flow and blur op_log / pipeline overlap semantics.
+
+#### D2.1. Current implementation asymmetry — pipeline vs non-pipeline (known)
+
+At the time of writing, `pe_dma.py` handles MMU overhead differently in its
+two call paths:
+
+- **non-pipeline (`handle_command`)**: after `translate()`, applies
+  `if self._mmu.overhead_ns > 0: yield env.timeout(self._mmu.overhead_ns)`.
+- **pipeline (`_do_pipeline_dma`)**: calls `translate()` only, **omitting**
+  the overhead timeout — though the comment says "same logic as non-pipeline
+  path", the behaviors differ.
+
+In the default topology, `tlb_overhead_ns = 0.0`, so this asymmetry does not
+manifest. With `tlb_overhead_ns > 0`, however, GEMM/Math via the pipeline path
+appears MMU-overhead faster than the equivalent non-pipeline workload.
+
+The D2 contract states that **all** callers pay the overhead; the pipeline
+omission is **not an intentional design** — ADR-0014 D6 (pipeline self-routing)
+does not exempt it. Remediation options (require a separate Phase 1/2):
+
+- (a) Add `if mmu.overhead_ns > 0: yield env.timeout(...)` in
+  `_do_pipeline_dma` to align with D2 — **preferred**.
+- (b) Narrow the D2 contract to "non-pipeline only" and document the pipeline
+  exemption in an ADR-0014 update — discouraged, since it weakens the
+  overhead's meaning.
+
+This ADR recommends (a) and assumes a small follow-up change either before or
+just after acceptance.
+
+### D3. Page table structure — sub-page region list (stopgap)
+
+`self._table: dict[vpn, list[(start_in_page, end_in_page, pa_at_offset_zero)]]`
+holds multiple disjoint regions per page.
+
+- `map(va, pa, size)`: append regions when the range crosses a page boundary.
+- `translate(va)`: look up regions for the VPN and iterate **in reverse** so
+  the most recent overlapping region wins (last-write-wins).
+- `unmap(va, size)`: remove only regions whose extent is **fully contained**
+  within the unmap range; partial-overlap boundaries are left in place and the
+  caller is expected to unmap on the same boundaries used for map.
+
+This is documented as a **simulator stopgap** that supplements the VA model
+from ADR-0011. It prevents silent last-write-wins misrouting when DPPolicy
+shards below page granularity. Memory note: `project_mmu_subpage_stopgap`.
+
+### D4. PageFault signals PA fallback
+
+If `translate()` is called with an unmapped VA, `PageFault` is raised. PE_DMA
+catches the exception and **uses the original address as a PA** (the PA-only
+backward-compatibility path from ADR-0011). PageFault is therefore not an
+error — it is the signal for "no VA mapping, interpret as PA".
+
+This path is intentional and preserves backward compatibility with the
+ADR-0011 PA-only mode.
+
+### D5. MMU sideband-message reception contract
+
+`MmuMapMsg` / `MmuUnmapMsg` arrive over the fabric at PE_MMU's `_inbox`
+(SPEC R10: "MMU map installation incurs measured fabric latency"). Schemas
+live in `runtime_api/kernel.py`:
+
+- `MmuMapMsg.entries: tuple[dict, ...]` — each dict is
+  `{"va": int, "pa": int, "size": int}`.
+- `MmuUnmapMsg.entries: tuple[dict, ...]` — each dict is
+  `{"va": int, "size": int}`.
+
+PE_MMU reception flow:
+
+1. `_worker` does `_inbox.get()` for one message.
+2. `hasattr(msg, "request")` confirms a Transaction wrapper.
+3. `isinstance(msg.request, MmuMapMsg)` → for each entry, call
+   `self._mmu.map(va=e["va"], pa=e["pa"], size=e["size"])`.
+4. `isinstance(msg.request, MmuUnmapMsg)` → for each entry, call
+   `self._mmu.unmap(va=e["va"], size=e["size"])`.
+5. Both signal `msg.done.succeed()` after completion.
+
+An external caller (runtime API) `await`ing `done` therefore receives a SimPy
+guarantee that "the mapping is installed on-device" — this is the realization
+of ADR-0011's "MMU map installation incurs measured fabric latency".
+
+This ADR does **not** define the **sender or fan-out policy** for the sideband
+message — those are runtime API responsibilities. Only the receive contract
+belongs here.
+
+### D6. Non-MMU Transactions delegate to generic forwarding
+
+If a message pulled from `_inbox` is not `MmuMapMsg` / `MmuUnmapMsg` (or
+lacks a `request` attribute), `_forward_txn` handles it normally. This keeps
+the door open for future topologies where PE_MMU sits on a pass-through path —
+current code never sends such traffic, but the routing remains safe.
+
+## Alternatives Considered
+
+### A1. Make `translate()` a SimPy generator
+
+Rejected. As D2 explains, this blurs op_log / pipeline overlap accounting in
+the PE engine.
+
+### A2. Use small page size (e.g., 128 B) instead of sub-page regions
+
+Rejected. Would explode page-table memory and cube-wide map message size. Most
+mappings are 2 MiB; pushing the page size below that for the few DPPolicy
+sharding cases inflates average cost.
+
+### A3. Make PE_MMU a PE_CPU helper only (not a topology node)
+
+Rejected. ADR-0011 requires that MMU map installation incur measured fabric
+latency (via `MmuMapMsg`), which requires PE_MMU to be a node on the graph.
+It also keeps cube NoC visualizer output consistent.
+
+## Consequences
+
+- PE_MMU's dual role is justified at ADR level, so future "unify into one"
+  refactor pressure has a documented counterpoint.
+- The sub-page region model is explicitly labeled a stopgap, providing a
+  basis for deprecating it when LA model (ADR-0011) lands.
+- The "`translate()` does not yield" contract is locked in (D2), so any
+  future proposal to add an internal MMU timeout can be denied with a
+  documented rationale.
+- PA fallback (D4) is normalized, preventing defensive logic from treating
+  PageFault as an error.
diff --git a/docs/adr/ADR-0040-dev-pe-tcm-component-model.md b/docs/adr/ADR-0040-dev-pe-tcm-component-model.md
new file mode 100644
index 0000000..e66d1bb
--- /dev/null
+++ b/docs/adr/ADR-0040-dev-pe-tcm-component-model.md
@@ -0,0 +1,149 @@
+# ADR-0040: PE_TCM Component Model — Dual-Channel BW Serialization
+
+## Status
+
+Accepted (2026-05-20).
+
+ADR-0014 (PE Pipeline Execution Model, D1) references PE_TCM as a "BW-based
+serialized scratchpad memory" but does not pin down the component's own model.
+This ADR fills that gap.
+
+## First action
+
+When `start()` is invoked, immediately create two `simpy.Resource(env, capacity=1)`
+instances and store them in `self._read_res` / `self._write_res`. These two
+resources are the single decision points that serialize the **read channel**
+and **write channel** to one in-flight request each.
+
+The runtime first action: `_worker` pulls a message off `_inbox` and branches
+by type:
+
+- `TcmRequest` (from `pe_fetch_store`): spawn `env.process(self._handle_tcm_request)`.
+  Hence **TCM's first act is "acquire the lock matching the direction
+  (read/write)"**. After lock acquisition, if `bw > 0 and nbytes > 0`, yield
+  `env.timeout(delay_ns = nbytes / bw)`, then `req.done.succeed()`.
+- Anything else (Transaction): spawn `env.process(self._forward_txn)` (legacy
+  fabric pass-through).
+
+At construction, `node.attrs["read_bw_gbs"]` and `node.attrs["write_bw_gbs"]`
+(default `512.0 GB/s` each) are captured and held.
+
+## Context
+
+In the PE pipeline (ADR-0014 D1, D6), PE_TCM receives two kinds of traffic:
+
+1. **`TcmRequest` from PE_FETCH_STORE** — when moving data between TCM and
+   the register file, PE_FETCH_STORE sends a short sideband request to obtain
+   BW-serialized access latency (`direction = "read"` or `"write"`, `nbytes`,
+   `done` event).
+2. **Legacy Transaction forwarding** — a fallback in case TCM ends up as a
+   pass-through node on the fabric graph (not used by the current critical
+   path, but preserved).
+
+The problem: ADR-0014 only says "BW-based serialization" without specifying:
+
+- Read and write are **independent channels** running in parallel; only
+  same-direction concurrency serializes at `capacity=1`.
+- BW is split into two configurable values (`read_bw_gbs` / `write_bw_gbs`).
+- The formula is `delay_ns = nbytes / bw_gbs` (loose unit convention:
+  GB/s × ns ≈ B).
+- `nbytes == 0` still acquires the lock but skips the BW term.
+- `run()`'s `overhead_ns` (default `0.0`) is only used in the legacy fabric
+  forwarding path.
+
+Each of these requires an ADR. In particular, "why are read and write
+separate channels" and "who owns the BW values" must be documented so that
+future changes (e.g., `capacity=2`) have a clear basis.
+
+## Decision
+
+### D1. Dual channel — read and write are independent resources
+
+`_read_res = simpy.Resource(env, capacity=1)`,
+`_write_res = simpy.Resource(env, capacity=1)`.
+Same-direction concurrent requests queue on the resource and serialize;
+opposite-direction requests proceed in parallel. This matches the hardware
+model where TCM has a dual-port (read + write) configuration, and it allows
+the simulator to express the GEMM-pipeline case where fetch (read) and store
+(write) overlap in time — modeled as BW-serialized inside each direction but
+independent across directions.
+
+### D2. Per-channel BW model — `nbytes / bw_gbs`
+
+After lock acquisition, if `nbytes > 0 and bw > 0`, yield
+`env.timeout(nbytes / bw_gbs)`. The unit convention is GB/s × ns ≈ B,
+consistent with the simulator-wide loose convention (see ADR-0033).
+
+- `nbytes == 0`: BW term is zero, but the lock is acquired and released. This
+  is intentional: when a plan generator emits an empty fetch/store on the
+  PE_FETCH_STORE side, the op_log / channel accounting on the TCM side still
+  records one consumption.
+- `bw == 0` (config error): the timeout call is skipped (0-time pass). Should
+  not occur with normal settings.
+
+### D3. BW values come from `node.attrs.read_bw_gbs` / `write_bw_gbs`
+
+Defaults `512.0 GB/s`. The topology builder (`topology/builder.py`) passes
+these attrs when instantiating TCM from `pe_template`. Default changes should
+coincide with related decisions in ADR-0014 D1 or ADR-0033.
+
+### D4. TcmRequest schema is owned by PE_TCM
+
+`@dataclass TcmRequest(direction: str, nbytes: int, done: simpy.Event, tag: str = "")`
+lives in `components/builtin/pe_tcm.py`. PE_FETCH_STORE imports the dataclass
+and only constructs/sends it. The caller does not define the schema because:
+
+- The meaning of BW serialization is TCM's responsibility — TCM decides which
+  fields drive serialization.
+- The valid-value check for `direction` (must be `"read"` or `"write"`) lives
+  in `_handle_tcm_request`'s if/else branch.
+
+### D5. Legacy Transaction forwarding path is preserved
+
+When `_worker` receives a non-`TcmRequest` message, it dispatches to
+`_forward_txn`, applying `run()`'s `overhead_ns`. The current standard PE
+pipeline does not route Transactions through TCM, but the path is kept to
+avoid breakage if fabric topology changes.
+
+This path is accounted for via standard Transaction op_log; the BW channel
+locks are **not** acquired (orthogonal to D1's usage).
+
+### D6. PE_TCM is not a data store (timing only)
+
+TCM models **time only**. The actual data payload is held by sim_engine's
+`memory_store` (when present); the TCM component never updates it.
+PE_FETCH_STORE obtains BW delay through `TcmRequest`, and register contents
+are handled separately in the data path (ADR-0020 2-pass data execution —
+Phase 2).
+
+## Alternatives Considered
+
+### A1. Single channel (`capacity=2` for shared read+write)
+
+Rejected. Would artificially serialize the normal-case overlap of fetch
+(read) and store (write) and yield an incorrect BW upper bound for the PE
+pipeline.
+
+### A2. `capacity > 1` (e.g., 2-banked TCM)
+
+Rejected. Current hardware model assumes a single bank. Multi-bank extension
+needs its own ADR that would supersede D1. Bumping capacity now would loosen
+the nominal serialization without raising the BW upper bound, producing less
+accurate modeling.
+
+### A3. Generalize BW formula to `nbytes / bw + overhead_ns`
+
+Rejected. `overhead_ns` is reserved for the legacy forwarding path (D5).
+Additional fetch/store-path overhead, if needed, belongs in PE_FETCH_STORE's
+`run()` or in a register-file access model — closer to the responsibility
+boundary.
+
+## Consequences
+
+- TCM's BW accounting is locked at ADR level. Questions arising from op_log
+  in GEMM/Math sweeps — "why did fetch and store overlap?", "why do only
+  same-direction requests serialize?" — resolve quickly to D1.
+- Future multi-bank TCM models or asymmetric read/write BW changes have a
+  clear blast radius (D1 / D2 / D3 — pick one).
+- D6 ("TCM is not a data store") sharpens the responsibility boundary with
+  ADR-0020 2-pass execution.
diff --git a/docs/adr/ADR-0041-dev-cube-sram-component-model.md b/docs/adr/ADR-0041-dev-cube-sram-component-model.md
new file mode 100644
index 0000000..8dfc0f1
--- /dev/null
+++ b/docs/adr/ADR-0041-dev-cube-sram-component-model.md
@@ -0,0 +1,195 @@
+# ADR-0041: Cube SRAM Component Model — terminal scratchpad on cube NoC
+
+## Status
+
+Accepted (2026-05-20).
+
+ADR-0017 (Cube NOC and HBM Connectivity) describes SRAM as a cube-NoC
+attachment but does not specify the SRAM component's own latency / response
+model. This ADR fills that gap.
+
+## First action
+
+Inside `_worker`, immediately after pulling a Transaction off `_inbox`, the
+very first action is `yield from self.run(env, txn.nbytes)`. Inside `run()`,
+the component applies `env.timeout(node.attrs["overhead_ns"])`
+(default `0.0`).
+
+In short, **SRAM's first act is "express access overhead as simulator time"**.
+After overhead, the worker yields `drain_ns` (the terminal BW-serialization
+cost stamped on the Transaction) and then constructs and dispatches a
+`ResponseMsg` on the reverse path.
+
+This differs from a generic `ComponentBase._worker`: SRAM knows it is a
+**terminal node**, so it does not go through `_forward_txn`. Its own worker
+explicitly performs `run → drain → _send_response`.
+
+## Context
+
+The cube topology (`topology/builder.py`) creates the following named nodes
+per cube:
+
+- `sip{S}.cube{C}.m_cpu`
+- `sip{S}.cube{C}.sram`
+- `sip{S}.cube{C}.hbm_ctrl` (per-PE partitions)
+- `sip{S}.cube{C}.pe{P}` (and its PE-internal sub-components)
+
+SRAM is one of the cube-NoC attachments — `topology/mesh_gen.py` assigns it
+to the nearest router by placement coordinates and adds `"sram"` to that
+router's `attach` list. The builder lays bidirectional `sram ↔ router` edges
+(BW: `sram_to_router_bw_gbs`, default `128.0 GB/s`).
+
+SRAM has two intertwined roles:
+
+1. **Fabric terminal**: the endpoint for cube-NoC memory-access Transactions
+   destined for SRAM. SRAM consumes access overhead + drain, then sends a
+   response back on the reverse path.
+2. **One of the IPCQ slot tiers**: ADR-0023 D9.7 defines
+   `buffer_kind ∈ {tcm, sram, hbm}`; the `sram` tier's per-access cost is
+   `(512.0 GB/s, 2.0 ns)` in `common/ipcq_types._BUFFER_KIND_BW`. This is
+   separate from the SRAM node's `overhead_ns` attr; PE_DMA accounts for it
+   directly at the IPCQ slot-write moment.
+
+Without an ADR covering both roles, the following questions are ambiguous:
+
+- "What latency does SRAM model?" — fabric drain + overhead, or the IPCQ
+  tier slot latency? — answers scatter.
+- What does the `size_mb` (`32`) attr mean in the future? Currently it is not
+  used; SRAM only models timing.
+- Which cube router does SRAM attach to? (placement-based; lives in topology
+  code only.)
+
+## Decision
+
+### D1. SRAM is a terminal scratchpad node on the cube NoC
+
+`SramComponent` extends `ComponentBase` but overrides `_worker` to express
+terminal semantics directly:
+
+```
+while True:
+    txn = yield self._inbox.get()
+    yield from self.run(env, txn.nbytes)     # overhead_ns
+    if drain_ns > 0: yield env.timeout(drain_ns)
+    yield from self._send_response(env, txn)
+```
+
+This pattern is necessary because SRAM must know the reverse path; the
+generic `_forward_txn` (which forwards to the next hop) does not fit a
+terminal.
+
+#### D1.1. Currently dormant — the `_worker` override is an unused path
+
+At the time of writing, **no component actually sends a Transaction to the
+SRAM node**. The verified references to the SRAM node ID are:
+
+- `policy/routing/router.py` and friends — guarantee path lookups.
+- `components/builtin/pe_dma.py::_handle_ipcq_inbound` — for
+  `buffer_kind == "sram"`, computes the *path* to
+  `bank_node = f"{cube_prefix}.sram"` via `compute_drain_ns(path, ...)` and
+  yields a **local** timeout. The Transaction itself does not flow to the
+  SRAM node (see D4).
+- `tests/test_routing.py` — checks connectivity via
+  `find_path("sip0.cube0.pe0", "sip0.cube0.sram")`.
+
+So the `_worker` / `_send_response` override is currently a **dormant code
+path**. It is preserved deliberately:
+
+- Topology changes that route fabric Transactions to SRAM terminally (e.g.,
+  explicit M_CPU → SRAM accesses) would activate it immediately.
+- ADR-0017's "cube-attached scratchpad" semantics naturally implies terminal
+  behavior; the override is an intentional placeholder.
+
+A future ADR (or a revision to this one) will mark dormancy resolved when an
+actual sender is added.
+
+### D2. ResponseMsg construction and reverse-path dispatch
+
+`_send_response`:
+
+1. `reverse_path = list(reversed(txn.path))` — derive the reverse path.
+2. Construct `ResponseMsg(correlation_id=txn.request.correlation_id,
+   request_id=..., src_cube=<this cube>, src_pe=-1, success=True)`.
+3. Wrap in `Transaction(request=resp_msg, path=reverse_path, step=0,
+   nbytes=0, done=env.event(), is_response=True)` and put on
+   `out_ports[reverse_path[1]]`.
+4. If the reverse path is too short (`< 2 hops`) or `ctx` is absent, fall
+   back to calling the original `txn.done.succeed()`.
+
+`src_pe = -1` means "SRAM is not PE-localized". `src_cube` is parsed from the
+node ID (`sip{S}.cube{C}.sram`).
+
+### D3. Timing parameters: `overhead_ns` and wire-side `drain_ns`
+
+- **Component-side latency**: `node.attrs["overhead_ns"]`. Default topology
+  uses `2.0 ns`.
+- **Link-side serialization**: `drain_ns` arrives stamped on the Transaction
+  — the wire-side BW serialization result from ADR-0015. SRAM only yields it.
+- The `size_mb` (default `32 MiB`) attr is currently timing-neutral. If a
+  capacity-aware model is added in the future, a separate ADR will give it
+  meaning.
+
+### D4. IPCQ slot accounting is not modeled by the SRAM component
+
+Per ADR-0023 D9.7, the IPCQ slot-write latency for the SRAM tier is incurred
+inside PE_DMA's `_handle_ipcq_inbound`, which calls
+`slot_io_latency_ns("sram", nbytes)` using `_BUFFER_KIND_BW["sram"]`. That is:
+
+- When SRAM receives a fabric Transaction (D1, D2, D3 apply), it processes
+  normally.
+- When an IPCQ slot lives on SRAM, PE_DMA pays the slot-write time directly —
+  independent of the SRAM component.
+
+This separation is intentional: IPCQ is a fast path (sub-cycle slot
+bookkeeping) and does not traverse fabric Transactions, so SRAM does not need
+to know about IPCQ.
+
+### D5. SRAM's cube-NoC attachment is placement-driven
+
+`topology/mesh_gen.py` reads `placement.sram.pos_mm` (default `[1.5, 9.0]` in
+`topology.yaml`) and adds `"sram"` to the nearest router's `attach`. The
+builder (`topology/builder.py`'s attachment loop) then lays bidirectional
+edges between the `sram` node and that router.
+
+This decision lives outside the SRAM component (mesh_gen / builder); the
+component does not know which router it sits on. It only relies on
+`txn.path` / `reverse_path` to reach it via a router.
+
+### D6. SRAM is not a data store (timing only)
+
+Same context as ADR-0040 D6: the SRAM component models time only; the data
+payload (if any) lives in sim_engine's `memory_store`.
+
+## Alternatives Considered
+
+### A1. Use `_forward_txn` and route responses via separate nodes (à la IO_CPU / HBM_CTRL)
+
+Rejected. SRAM is a terminal on the cube NoC; adding a response node would
+introduce meaningless hops and violate ADR-0017's simplification spirit.
+
+### A2. Model BW serialization inside SRAM with its own resource
+
+Rejected. Wire-side BW serialization (`drain_ns`) already captures it. An
+internal `simpy.Resource` would double-count against ADR-0015 (port/wire
+model).
+
+### A3. Handle IPCQ slot accounting in the SRAM component
+
+Rejected. As D4 makes explicit, IPCQ is a fast path that does not traverse
+fabric Transactions. If SRAM knew about IPCQ, the responsibility would split
+across two places and obscure reasoning.
+
+### A4. Capacity-aware latency from `size_mb`
+
+Rejected for now. The capacity is currently a visualizer label; introducing
+a capacity-aware timing model requires a dedicated ADR.
+
+## Consequences
+
+- SRAM's timing model is pinned at ADR level as
+  `overhead_ns + drain_ns + ResponseMsg(reverse_path)`. Any proposal to push
+  IPCQ slot latency into the SRAM component can be refused with D4.
+- D3 records that `size_mb` is timing-neutral today, so a future
+  capacity-aware model has a narrow compatibility scope.
+- D5 documents the placement-driven attachment, so changes to the SRAM
+  coordinate have a clearly bounded impact (`mesh_gen` only).
diff --git a/docs/adr/ADR-0042-prog-tile-plan-generators.md b/docs/adr/ADR-0042-prog-tile-plan-generators.md
new file mode 100644
index 0000000..ebedd0a
--- /dev/null
+++ b/docs/adr/ADR-0042-prog-tile-plan-generators.md
@@ -0,0 +1,199 @@
+# ADR-0042: Tile Plan Generators — GEMM/Math Pipeline Plan Builders
+
+## Status
+
+Accepted (2026-05-20).
+
+This ADR pins down `tiling.py` as a **plan-generator
+module**, not a SimPy component.
+
+ADR-0014 (PE Pipeline Execution Model) D6 (tile plan / self-routing) does not
+specify the tile-plan generation algorithm itself; this ADR fills that gap.
+
+## First action
+
+When `generate_gemm_plan(M, K, N, tile_m, tile_k, tile_n, ..., pe_prefix,
+a_pinned, b_pinned, epilogue_specs)` is called, the very first action is
+**computing tile counts and constructing the PE-component ID strings**:
+
+```
+M_tiles = max(1, ceil(M / tile_m))
+K_tiles = max(1, ceil(K / tile_k))
+N_tiles = max(1, ceil(N / tile_n))
+dma_id   = f"{pe_prefix}.pe_dma"
+fetch_id = f"{pe_prefix}.pe_fetch_store"
+gemm_id  = f"{pe_prefix}.pe_gemm"
+math_id  = f"{pe_prefix}.pe_math"
+```
+
+In short, **the plan generator's first act is "compute ceiling tile counts
+and assemble the four sub-component IDs for this PE once"**. No SimPy event
+or environment is touched — this module is a pure function.
+
+`generate_math_plan(M, N, tile_m, tile_n, ..., math_op, src_addr, dst_addr,
+pe_prefix)` likewise begins by computing `M_tiles`, `N_tiles` and assembling
+three component IDs (`dma_id`, `fetch_id`, `math_id`).
+
+## Context
+
+ADR-0014 D6 agreed that "PE_SCHEDULER, on receiving a CompositeCmd, generates
+a TilePlan and feeds self-routing tile tokens". But the **concrete plan
+generation algorithm** lives in `src/kernbench/components/builtin/tiling.py`,
+which:
+
+- Defines no component — it is a pair of **pure functions**
+  (`generate_gemm_plan`, `generate_math_plan`).
+- Does not depend on the SimPy environment, queues, op_log, or hooks.
+- Returns a `PipelinePlan` (dataclass).
+
+The original G4 analysis incorrectly described `tiling.py` as a component;
+it is in fact a plan-builder helper consumed by PE_SCHEDULER. Pinning this
+down in its own ADR (paired with ADR-0014 D6) prevents:
+
+- Ambiguity over whether plan generation belongs to PE_SCHEDULER or a
+  separate module.
+- Inconsistent rationale for stage sequences (e.g., FETCH/STORE position)
+  between GEMM and Math plans.
+- Undocumented branching rationale for `a_pinned` / `b_pinned` /
+  `epilogue_specs`.
+
+## Decision
+
+### D1. `tiling` is a pure plan-generator module, not a component
+
+`components/builtin/tiling.py` defines no `ComponentBase` subclass. It exports
+two module-level functions:
+
+- `generate_gemm_plan(...) -> PipelinePlan`
+- `generate_math_plan(...) -> PipelinePlan`
+
+There is no `tiling` node in the topology graph. It lives in `builtin/`
+because it is a direct helper for PE_SCHEDULER (ADR-0014 D6) and is
+conceptually a PE_SCHEDULER internal utility.
+
+### D2. GEMM plan stage sequence — `M → N → K` order
+
+For each `(m, n, k)` tile (default — no operand pinning, no epilogue):
+
+```
+[DMA_READ(A)] → [DMA_READ(B)] → FETCH → GEMM
+                                ↑
+                                ↓
+(last k tile only)              [MATH(output_tile)]* → STORE → DMA_WRITE
+```
+
+`k_tile` epilogue inserts a MATH stage immediately after GEMM on every
+K-tile; `output_tile` epilogue inserts MATH stages once per `(m, n)` after
+the final K-tile but before STORE/DMA_WRITE. The K-loop accumulator stays
+in the register file across K-tiles — STORE/DMA_WRITE happens only when
+`last_k`.
+
+### D3. Operand pinning — `a_pinned` / `b_pinned`
+
+If a caller passes `a_pinned=True`, **the A DMA_READ is omitted from every
+(m, n, k) tile**. Semantically: the caller (e.g., `tl.composite`) has already
+staged all of A in TCM via a prior `tl.load`, and signals so to the plan
+generator.
+
+The branch is made at plan time (not at runtime). Therefore the stage record
+count in op_log changes deterministically with pinning, and sweep analyses
+(e.g., gemm_sweep's stage record count) see this decision directly.
+
+### D4. Epilogue scope — `k_tile` vs `output_tile`
+
+`epilogue_specs` is an iterable of op-spec objects. Each op object is
+expected to have:
+
+- `op.kind: str` — math op name (e.g., `"dequant"`, `"bias"`, `"relu"`,
+  `"scale"`). Placed into the stage's `params["op_kind"]`.
+- `op.scope: Scope` — `Scope.K_TILE` or `Scope.OUTPUT_TILE` (`Scope` enum
+  in `kernbench.common.pe_commands`).
+- Op-specific extras (e.g., `bias`, `scale`, `factor`) — currently not used
+  by the plan generator; consumed at runtime by PE_MATH.
+
+The plan generator partitions by `getattr(o, "scope", None)`:
+
+- `scope == Scope.K_TILE`: adds a MATH stage right after GEMM on every K-tile.
+- `scope == Scope.OUTPUT_TILE`: adds MATH stages just before STORE on the
+  last K-tile per `(m, n)`.
+
+Ops with neither `scope` value (e.g., missing attribute) are **dropped
+silently** — `getattr(..., None) == Scope.X` is False for both. Picking a
+default (`output_tile`) is the **caller's responsibility** (e.g.,
+`tl.composite`), not the plan generator's. This aligns with ADR-0014's
+composite epilogue contract.
+
+`Scope` is imported lazily inside the function to avoid the circular path
+`pe_commands ← pe_types ← tiling`. This is intentional and not a refactor
+target — keeping `tiling` free of compile-time `pe_commands` dependencies
+preserves the module boundary (D1).
+
+### D5. Math plan stage sequence — `M → N` order
+
+For each `(m, n)` tile:
+
+```
+DMA_READ → FETCH → MATH → STORE → DMA_WRITE
+```
+
+There is no K dimension, so concepts like epilogue or accumulator residency
+do not apply. PE_FETCH_STORE's register-file accounting follows the same
+pattern as the GEMM plan.
+
+### D6. Plans are data — no SimPy dependency
+
+`PipelinePlan` is a dataclass in `pe_types.py` holding `tiles:
+list[TilePlan]`. Each `TilePlan` holds `stages: tuple[Stage, ...]`. The plan
+itself is near-immutable (only `Stage.params: dict` is mutable) and holds no
+SimPy objects.
+
+At runtime, PE_SCHEDULER consumes the plan's first stage, builds a `TileToken`,
+and feeds it into the pipeline. The TileToken carries `plan: TilePlan`,
+`stage_idx: int`, and a cached `params: dict`. Self-routing proceeds by
+`TileToken.advance()` caching the next stage's `params` (ADR-0014 D6).
+
+### D7. Plan generator contract — pure, deterministic, idempotent
+
+Two calls with identical inputs return identical `PipelinePlan` instances
+(including `TilePlan.stages` order). This contract aligns with ADR-0014 D6's
+"deterministic tile dispatch order".
+
+No side effects (no SimPy events, no file I/O, no global state) — tests can
+call the generators directly without an environment object (some cases in
+`tests/test_pe_pipeline.py` rely on this).
+
+## Alternatives Considered
+
+### A1. Make tiling a component (e.g., PE_PLANNER)
+
+Rejected. Plan generation consumes no SimPy time — it is a pure decision
+algorithm. Making it a component would (a) add unnecessary infrastructure
+(inbox, resources), and (b) split PE_SCHEDULER's flow into "receive plan"
+plus "feed tiles", inserting a meaningless hop.
+
+### A2. Move plan generation into PE_SCHEDULER as methods
+
+Rejected (currently). Module separation provides (1) testability and
+(2) extensibility for additional plan algorithms (e.g., DTensor-aware) —
+add a new function. If plan kinds proliferate enough to require explicit
+dispatch, a future ADR can introduce a plan factory on PE_SCHEDULER.
+
+### A3. Make plans fully immutable (frozen dataclass + tuple)
+
+Partially adopted. `Stage` and `TilePlan` are dataclasses but not frozen,
+because `Stage.params: dict` is populated at plan-generation time and read
+at runtime (cached by TileToken on advance). Moving dict → frozendict pays
+migration cost without enough benefit. Convention: do not mutate after
+generation.
+
+## Consequences
+
+- `tiling.py` is documented as a plan-generator module, not a component —
+  preempting future G4-style "this component lacks an ADR" analyses.
+- The GEMM plan's stage sequence (D2) and pinning / epilogue branching
+  (D3 / D4) are pinned, providing a clear interpretation basis for sweep
+  analyses (e.g., `scripts/gemm_sweep.py`'s stage record counts).
+- The plan generator's pure contract (D7) enables environment-free testing
+  in line with ADR-0013 (verification strategy).
+- Future plan kinds (DTensor-aware, K-major, ...) follow D1 / D6 / D7 as a
+  baseline — just add a new function.