ADR: add 0038-0042 (pcie_ep, pe_mmu, pe_tcm, sram, tiling)

Fill component-model coverage gaps surfaced by /report's G4 analysis. Each ADR documents the component's First action, latency model, and honest notes on dormant code or implementation asymmetries discovered during re-evaluation against current code. - 0038 pcie_ep: thin protocol-overhead model; ComponentBase forwarding worker as-is; named-node contract for router helpers - 0039 pe_mmu: component + utility dual role; sub-page region stopgap; D2.1 flags pipeline path missing mmu.overhead_ns timeout (asymmetric with non-pipeline; not visible at default tlb_overhead_ns=0) - 0040 pe_tcm: dual-channel BW serialization (read/write Resource cap=1); TcmRequest schema owned by TCM; timing-only (no data store) - 0041 sram: terminal scratchpad model + ResponseMsg on reverse path; D1.1 flags _worker override as currently dormant (no Transaction actually targets the SRAM node today) - 0042 tiling: pure plan-generator module, not a component; corrects the G4 misclassification; pins GEMM/Math stage sequences and epilogue scope contract Also: /report skill G3 refinement — only flag older->newer asymmetric cross-references; newer->older (e.g., 0034-0037 citing infrastructure ADRs) are expected one-way and no longer reported. Bilingual pair verifier (tools/verify_adr_lang_pairs.py) passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:43:03 -07:00
parent 049e3d8bb3
commit 1f36baa898
11 changed files with 1747 additions and 3 deletions
@@ -62,6 +62,10 @@ After writing the document, report to the user in the chat response:
 - **G2 gaps** — ADRs missing **Context** or **Decision**. Alternatives
  and Consequences are optional; their absence is NOT a gap.
 - **G3 gaps** — ADR cross-references without a back-reference.
  Only flag when the referencer's ADR number is **less than** the
  referenced ADR's number (older → newer). Newer ADRs citing older
  infrastructure ADRs (higher number → lower number) are expected to
  be one-way and are NOT flagged.
 - **G4 suggestions** — areas where an ADR seems missing based on the
  ADR corpus + SPEC reading. Phrase as suggestions, not findings. Each
  G4 item must say *why* it's suggested and remain falsifiable.
@@ -99,7 +103,10 @@ For each `docs/adr/ADR-NNNN-*.md`:
 - Record presence/absence of **Context** and **Decision** for G2.
  Alternatives and Consequences presence is recorded for use during
  authoring, but their absence is not a gap.
- Record ADR-NNNN cross-references for G3.
+- Record ADR-NNNN cross-references for G3, preserving the direction
  (referencer → referenced). G3 evaluation uses ADR numbers to
  distinguish older→newer (flagged when missing back-link) from
  newer→older (not flagged; see *Output Contract* G3).
 - Record Status (e.g., Accepted, Superseded, Draft) and any "supersedes
  ADR-NNNN" text in the body for G5a.
@@ -263,9 +270,11 @@ In **dry-run mode**, replace the `Wrote:` line with:
 - ADR-NNNN: missing <Context|Decision>
 - (or "none")
-**G3 — Broken cross-references**
+**G3 — Broken cross-references** (older → newer only)
- ADR-NNNN cites ADR-MMMM; ADR-MMMM does not back-reference
+- ADR-NNNN cites ADR-MMMM (NNNN < MMMM); ADR-MMMM does not back-reference
 - (or "none")
 - Note: newer ADRs citing older infrastructure ADRs (NNNN > MMMM) are
  not flagged here — one-way references are the expected pattern.
 **G4 — Suggested topics that may warrant a new ADR (verify before acting)**
 - <topic>: <why agent thinks it may be missing — must be falsifiable>
@@ -0,0 +1,133 @@
 # ADR-0038: PCIE_EP Component Model
 ## Status
 Accepted (2026-05-20).
 ADR-0035 (M_CPU), ADR-0036 (IO_CPU), ADR-0037 (Forwarding)
 와 같은 결의 컴포넌트-레벨 ADR.
 ## First action (제일 처음에 하는 일)
 `_inbox`에서 Transaction을 한 건 꺼내 `_forward_txn`을 통해 `run()`을 호출하고,
 그 안에서 `node.attrs["overhead_ns"]` 만큼 `env.timeout()`으로 PCIe 프로토콜
 처리 지연을 적용한다. 그 이후 시점부터는 일반 `ComponentBase` 워커가 정의한
 forwarding 규약을 따른다 (다음 hop이 있으면 `out_ports[next_hop].put(...)`,
 아니면 `drain_ns`를 소비하고 `txn.done.succeed()`).
 즉, **PCIE_EP의 첫 번째 일은 "PCIe 프로토콜 오버헤드를 시간으로 표현하는 것"**
 하나뿐이고, 라우팅·페이로드 변환·MMIO 디코딩 같은 부가 의사결정은 하지 않는다.
 ## Context
 PCIE_EP는 토폴로지 그래프에서 **호스트와 디바이스 사이의 단방향 경계 포인트**
 역할을 한다. 빌더 (`topology/builder.py`)는 SIP마다 IO chiplet 인스턴스를
 생성하고 그 안에 `pcie_ep`, `io_cpu`, `io_noc`을 둔 뒤, 외부 호스트 측의 cross-SIP
 switch와 `pcie_ep` 사이에 양방향 엣지를 깐다:
 - `switch → pcie_ep`: host → device 트래픽 (MemoryWrite, MemoryRead, KernelLaunch).
 - `pcie_ep → switch`: device-side outbound (예: cross-SIP IPCQ 토큰).
 IOChiplet 내부적으로는 `pcie_ep ↔ io_noc` 양방향 엣지가 깔리고, 그 다음 hop이
 `io_cpu`나 cube 측 hbm_ctrl 경로로 분기된다 (ADR-0036 IO_CPU 모델 참고).
 라우터·리졸버는 SPEC R7이 요구하는 "PCIE_EP는 메모리 오퍼레이션을 위한
 엔드포인트"라는 계약을 이미 인지하고 있어, `find_pcie_ep(sip)`,
 `find_memory_path(pcie_ep, dst_node)` 같은 helper가 PCIE_EP를 시작점으로 한다.
 문제는 이 모든 의존 관계가 builder/router/resolver 쪽에는 있으나, **PCIE_EP
 자신의 내부 모델을 명시하는 ADR이 없다**는 것이다. 결과적으로:
 - "PCIE_EP는 어떤 latency를 모델링하나?"가 코드를 읽어야만 답이 나온다.
 - 다른 컴포넌트(IO_CPU=ADR-0036, M_CPU=ADR-0035)와의 비대칭이 발생한다.
 - 향후 PCIe link-layer 모델(예: TLP credit, retry)을 더 정교하게 만들지에 대한
  의사결정 근거가 흩어진다.
 이 ADR은 현재의 **얇은 (thin) PCIE_EP 모델**을 명시적으로 못 박고, 그것이
 의도된 단순화임을 기록한다 (ADR-0033 latency model 단순화 정책과 정렬).
 ## Decision
 ### D1. PCIE_EP는 ComponentBase의 일반 forwarding 워커를 그대로 사용한다
 `PcieEpComponent`는 `ComponentBase`를 상속하며 `_worker`/`_forward_txn`을
 오버라이드하지 않는다. 따라서 모든 Transaction은 다음 순서로 처리된다:
 1. `_fan_in`이 들어오는 메시지(또는 Flit reassembly된 Transaction)를 `_inbox`에
   적재한다.
 2. `_worker`가 `_inbox`에서 하나 꺼내 `env.process(self._forward_txn(env, txn))`로
   포크한다 (per-message 파이프라이닝).
 3. `_forward_txn`이 op_log 시작 hook → `run()` 지연 → op_log 종료 hook 순서로
   호출한다.
 4. `run()`은 단 한 줄: `yield env.timeout(overhead_ns)`.
 5. 다음 hop이 있으면 `out_ports[next_hop].put(txn.advance())`, 없으면 (terminal로
   도착한 경우) `drain_ns`를 소비 후 `txn.done.succeed()`.
 ### D2. PCIE_EP의 유일한 시간 모델은 `overhead_ns`다
 `node.attrs["overhead_ns"]`만 latency 파라미터로 인정한다. 코드 기본값은
 `0.0`이며, `topology.yaml` 의 IOChiplet `components.pcie_ep.attrs` 가 실제 값을
 지정한다 (현재 토폴로지: `overhead_ns: 5.0` ns).
 별도의 BW 직렬화 자원(simpy.Resource), 큐 깊이, retry 모델은 두지 않는다.
 링크-레벨 BW 직렬화는 wire-side에서 처리된다 — IOChiplet 내부는
 `pcie_ep_to_noc_bw_gbs = 256.0 GB/s` 링크, 외부는 system의 `io_ep_to_switch`
 링크 BW가 적용된다 (ADR-0015 port/wire 모델). PCIE_EP 컴포넌트 자체는 이
 BW 회계에 관여하지 않는다.
 ### D3. PCIE_EP는 양방향 사용을 인지하지만, 방향에 따라 동작을 바꾸지 않는다
 토폴로지 빌더가 `switch ↔ pcie_ep` 와 `pcie_ep ↔ io_noc` 양방향 엣지를 깐다.
 따라서 PCIE_EP는:
 - inbound (host→device): switch에서 도착한 Transaction을 io_noc 쪽으로 다음 hop
  계산을 통해 forward.
 - outbound (device→host): io_noc/io_cpu에서 도착한 Transaction을 switch 쪽으로
  forward.
 두 경우 모두 D1의 일반 forwarding 워커가 처리하며, 컴포넌트 코드 자체는 방향을
 구분하지 않는다 (`txn.next_hop`만 따른다).
 ### D4. PCIE_EP는 Flit-aware가 아니다 (legacy reassembly 경로)
 `_FLIT_AWARE`를 `True`로 두지 않는다. 따라서 `_fan_in`이 상류에서 chunkify된
 Flit들을 부모 Transaction으로 재조립하여 `_inbox`에 넣는다 (ADR-0033 Phase 2c
 점진적 rollout 정책과 정렬).
 PCIE_EP가 PCIe TLP-level credit 모델을 갖도록 확장될 미래에 D4를 재평가한다.
 ### D5. PCIE_EP는 라우팅 helper의 **명명된 노드**다
 `policy/routing/router.py`의 `find_pcie_ep(sip, io_id="io0")`,
 `find_all_pcie_eps()`, `find_memory_path(pcie_ep, dst_node)`는 PCIE_EP를 메모리
 경로의 시작점(또는 종점)으로 간주한다. 컴포넌트 본체는 이 helper에 어떤 정보도
 제공하지 않으며, 명명 규칙(`sip{S}.{io_id}.pcie_ep`)은 토폴로지 빌더가 보장한다.
 ## Alternatives Considered
 ### A1. PCIe TLP-level 모델 (credit, retry, MPS 분할)
 기각. ADR-0033이 명시한 "현재 latency 모델은 abstract overhead + BW 직렬화로
 표현"이라는 단순화 원칙에 어긋난다. 호스트↔디바이스 protocol 정합성은 SPEC §5
 "Non-Goals"에 의해 의도적으로 out-of-scope이다.
 ### A2. PCIE_EP에 자체 simpy.Resource로 inflight 제한 두기
 기각. 현재 워크로드에서 호스트 트래픽은 컨텐션 병목이 아니다. 필요해지는 시점에
 별도 ADR로 도입한다 (호환성 측면에서 D1은 그대로 두고 D2를 확장하는 형태).
 ### A3. PCIE_EP를 IO_CPU와 합치기
 기각. PCIE_EP는 host-side에서 처음 만나는 protocol boundary 노드이고, IO_CPU는
 디바이스-쪽 control-plane 처리 노드다 (ADR-0036). 트래픽 fan-out·command 디코딩
 같은 의사결정 비용은 IO_CPU에 모이며, PCIE_EP는 link-edge overhead만 표현하는
 것이 의미가 있다. 합치면 두 책임이 섞여 ADR-0007 (runtime API/sim_engine 경계)
 정신에 어긋난다.
 ## Consequences
 - PCIE_EP는 코드 라인이 거의 0인 채로 명시적인 모델 ADR을 갖게 된다 — 일관성
  ↑, 유지보수 비용 ↓.
 - 향후 PCIe-level 정밀화가 필요해지면 D2/D4를 확장하는 새 ADR을 만들어
  supersede한다.
 - `find_memory_path` 등 router helper가 PCIE_EP를 명명된 노드로 의존한다는
  사실이 D5에서 명시되므로, 컴포넌트 ID 명명 규칙 변경 시 영향 범위가 명확해진다.
@@ -0,0 +1,194 @@
 # ADR-0039: PE_MMU Component Model — 컴포넌트 + 유틸리티 이중 역할
 ## Status
 Accepted (2026-05-20).
 ADR-0011 (PA/VA/LA address model) 의 VA 모델에서 "PE_MMU가 VA→PA 변환"이라고만
 선언되어 있는데, **PE_MMU 컴포넌트 자신의 동작 모델**을 별도로 못 박는 ADR.
 ## First action (제일 처음에 하는 일)
 생성 시점에 `node.attrs["page_size"]` (default `2 MiB`) 와
 `node.attrs["tlb_overhead_ns"]` (default `0.0`) 를 읽어 내부 `PeMMU` 객체
 (`policy.address.pe_mmu.PeMMU`) 를 단 한 번 인스턴스화한다. 이 객체가 페이지
 테이블·서브페이지 region 리스트·TLB 오버헤드의 단일 보유자(single owner)이다.
 런타임에서의 첫 동작은 두 갈래로 갈린다:
 - **컴포넌트 경로 (inbox 소비)**: `_worker`가 `_inbox`에서 Transaction을 한 건
  꺼내, 그 `request`가 `MmuMapMsg`이면 각 엔트리에 대해
  `self._mmu.map(va, pa, size)`를 호출하고 `txn.done.succeed()`.
  `MmuUnmapMsg`이면 `unmap(va, size)`, 그 외 타입이면 표준 `_forward_txn`으로
  떨군다. 즉 **MMU의 첫 일은 "map/unmap 명령을 페이지 테이블에 반영하는 것"**.
 - **유틸리티 경로 (직접 호출)**: PE_DMA / PE_GEMM 같은 동일 PE 내부 엔진이
  `pe_mmu.mmu.translate(va)`를 직접 호출한다. 이 경로에서는 SimPy 이벤트가
  발생하지 않으며, 호출자가 (overhead_ns > 0인 경우) 본인 process에서
  `yield env.timeout(mmu.overhead_ns)`를 처리한다.
 ## Context
 ADR-0011은 PA/VA/LA 세 가지 주소 모델을 정의하고 "VA 모델 = PE_MMU를 통한 변환"
 이라고만 합의했다. 그러나 코드 상의 `PeMmuComponent`는 두 가지 상호 보완적인
 역할을 동시에 수행한다:
 1. **토폴로지 그래프 상의 컴포넌트**: cube NoC에서 `MmuMapMsg` / `MmuUnmapMsg`
   sideband 메시지를 수신하여 페이지 테이블을 갱신한다.
 2. **PE-로컬 유틸리티 객체**: 동일 PE의 PE_DMA / PE_GEMM이 latency 0으로 (혹은
   호출자 측에서 `overhead_ns`만 부담하면서) 직접 `translate(va)`를 호출한다.
 이 두 역할을 모두 다루는 ADR이 없어 다음 모호함이 발생한다:
 - "왜 MMU 변환에 SimPy 이벤트가 안 잡히나?" (실제로는 호출자 측에서 잡고 있음)
 - 서브페이지 region 모델은 무엇이고, 왜 그 모델인가? (코드 docstring에는 있으나
  ADR이 없음 — `project_mmu_subpage_stopgap`라는 memory note 참조만 존재)
 - map/unmap 메시지가 **누구로부터** 와서 **언제까지** 갱신되어야 하는가
  (ordering 계약)?
 또한 `PeMMU.map()` 은 "later append, last-write-wins (역방향 탐색)" 의미를 갖는데,
 이것은 단순한 단일-PA 페이지 테이블 모델로는 표현 불가능한 DPPolicy의 서브페이지
 샤딩 (예: 128B 페이로드 × 4KB 페이지) 시나리오를 위해 의도적으로 추가된
 **stopgap**이다. 진짜 HW MMU와는 다른 단순화임을 ADR로 못 박을 필요가 있다.
 ## Decision
 ### D1. 이중 역할의 명시 — 컴포넌트와 유틸리티
 `PeMmuComponent`는 단일 클래스 안에서 다음 두 인터페이스를 노출한다:
 - 컴포넌트 인터페이스: `_inbox` 소비, `_worker` 루프 (MMU sideband 메시지 처리).
 - 유틸리티 인터페이스: `pe_mmu.mmu` 속성으로 underlying `PeMMU` 객체를 노출 —
  PE_DMA / PE_GEMM이 이 객체를 직접 들고 `translate()`를 호출.
 후자는 **layer skip이 아니다**: PE 내부는 ADR-0007이 정의한 "components" 레이어
 하나 안의 sibling 관계이고, 같은 PE prefix에서 가져온 PE_MMU 객체에 대한 직접
 호출은 cross-layer가 아니다. cross-layer 위반은 runtime API / sim_engine /
 components 경계를 넘는 경우에만 적용된다.
 ### D2. Latency 모델: `translate()`는 순수 함수, overhead는 호출자 책임
 `PeMMU.translate()`는 순수 함수이며 SimPy yield를 하지 않는다. 호출자(PE 엔진)
 가 변환 후 `if self._mmu.overhead_ns > 0: yield env.timeout(self._mmu.overhead_ns)`
 를 자기 process에서 발생시킨다.
 이유: PE 엔진의 SimPy process는 이미 자체 record_start / record_end (op_log)
 hook을 들고 있어 timing을 일관되게 잡을 수 있다. MMU가 별도의 process를 만들면
 PE 엔진의 처리 흐름을 두 갈래로 쪼개 op_log/pipeline overlap 의미가 흐려진다.
 #### D2.1. 현재 구현의 비대칭 — pipeline vs non-pipeline (Known asymmetry)
 본 ADR 작성 시점의 `pe_dma.py` 구현은 두 호출 경로에서 overhead 처리가 다르다:
 - **non-pipeline (`handle_command`)**: `translate()` 직후
  `if self._mmu.overhead_ns > 0: yield env.timeout(self._mmu.overhead_ns)` 를
  발생시킨다.
 - **pipeline (`_do_pipeline_dma`)**: `translate()` 만 호출하고 overhead timeout을
  **생략**한다 — 함수 주석에 "same logic as non-pipeline path"라고 적혀 있으나
  실제로는 일치하지 않는다.
 기본 토폴로지에서 `tlb_overhead_ns = 0.0` 이라 이 차이는 timing에 직접 드러나지
 않으나, `tlb_overhead_ns > 0` 으로 설정한 시뮬레이션에서는 pipeline 경로의
 GEMM/Math 가 non-pipeline 동일 워크로드 대비 MMU overhead 만큼 빠르게 측정된다.
 D2의 계약은 "**모든** 호출자가 overhead를 책임진다" 이며, pipeline 경로의 누락은
 **의도된 설계가 아니라 구현 비일관성**이다. ADR-0014 D6 (pipeline self-routing)
 이 이 overhead를 면제한다고 명시한 부분은 없다.
 조치 선택지(별도 Phase 1/2 제안 필요):
 - (a) `_do_pipeline_dma` 에서도 `if mmu.overhead_ns > 0: yield env.timeout(...)`
  를 추가하여 D2 계약과 일치시킨다 — 권장.
 - (b) D2 계약을 "non-pipeline 경로에만 적용" 으로 좁히고, pipeline 경로의 면제를
  ADR-0014 D6 갱신과 함께 정당화한다 — overhead 의미가 약해지므로 비권장.
 본 ADR은 (a) 를 권장하며, accept 전 또는 직후의 별도 작은 변경으로 이를
 교정하는 것을 가정한다.
 ### D3. 페이지 테이블 구조 — 서브페이지 region 리스트 (stopgap)
 `self._table: dict[vpn, list[(start_in_page, end_in_page, pa_at_offset_zero)]]`
 구조로 한 페이지 안에 여러 disjoint region을 보유할 수 있다.
 - `map(va, pa, size)`: 페이지를 가로지르면 region들을 **append**한다.
 - `translate(va)`: VPN으로 region 리스트를 가져온 후, **역방향**으로 순회하며
  처음 매칭되는 region을 채택 (last-write-wins).
 - `unmap(va, size)`: extent가 unmap 범위에 **완전히 포함된** region만 제거한다.
  경계가 어긋난 부분 overlap은 그대로 남기며, 매핑 호출자는 mapping과 동일한
  경계로 unmap할 책임을 진다.
 이는 진짜 HW MMU와는 다른 **시뮬레이터 stopgap**임을 ADR-0011 VA 모델 보강
 요소로 명시한다. DPPolicy 서브페이지 샤딩 시 last-write-wins overwrite로 인한
 조용한 미스라우팅을 방지하기 위함이다 (메모리 노트: project_mmu_subpage_stopgap).
 ### D4. PageFault는 PA fallback 신호다
 매핑이 없는 VA로 `translate()`가 호출되면 `PageFault`가 발생한다. PE_DMA는 이
 예외를 잡아 **원본 주소를 PA로 그대로 사용**한다 (ADR-0011의 PA fallback 호환
 경로). 따라서 PageFault는 에러가 아닌 "VA 매핑 부재 시 PA로 해석한다"는 신호다.
 이 호환 경로는 ADR-0011이 합의한 PA-only 모드와의 후방 호환을 유지하기 위한
 의도된 동작이다.
 ### D5. MMU sideband 메시지의 수신 계약
 `MmuMapMsg` / `MmuUnmapMsg`는 fabric을 통해 PE_MMU 컴포넌트의 `_inbox`로
 도달한다 (R10이 명시하는 "MMU map 설치는 fabric latency를 따른다"). 메시지
 schema는 runtime API (`runtime_api/kernel.py`) 가 정의하며, 현재 형식:
 - `MmuMapMsg.entries: tuple[dict, ...]` — 각 dict는 `{"va": int, "pa": int,
  "size": int}` 키를 갖는다.
 - `MmuUnmapMsg.entries: tuple[dict, ...]` — 각 dict는 `{"va": int, "size": int}`
  키를 갖는다.
 PE_MMU 측 수신 처리:
 1. `_worker` 가 `_inbox.get()` 에서 메시지 한 건을 꺼낸다.
 2. `hasattr(msg, "request")` 로 Transaction wrapper 인지 확인.
 3. `isinstance(msg.request, MmuMapMsg)` 이면 각 entry 에 대해
   `self._mmu.map(va=e["va"], pa=e["pa"], size=e["size"])`.
 4. `isinstance(msg.request, MmuUnmapMsg)` 이면 각 entry 에 대해
   `self._mmu.unmap(va=e["va"], size=e["size"])`.
 5. 둘 다 `msg.done.succeed()` 로 완료 통지.
 외부 호출자(runtime API 측)가 `done`을 await하면 "매핑이 디바이스에 설치된
 시점"이 SimPy 시간으로 보장된다 — 이 wait이 ADR-0011이 요구하는 "MMU map
 installation incurs measured fabric latency" 의 실현이다.
 이 ADR은 sideband 메시지의 **sender 와 fan-out 정책**을 정의하지 않는다 —
 그것은 runtime API 책임이다. 본 ADR은 PE_MMU 측 수신 계약만 명시한다.
 ### D6. 비-MMU Transaction은 일반 forwarding으로 위임
 `_worker`가 inbox에서 꺼낸 메시지의 `request`가 `MmuMapMsg` / `MmuUnmapMsg`가
 아닌 경우 (또는 `request` 속성이 없는 경우) `_forward_txn`으로 떨군다. 이는
 미래에 PE_MMU가 cube-internal NOC 상의 통과 노드로 사용될 가능성을 차단하지
 않기 위함이다 (현재는 그런 통과 트래픽이 없으나, 토폴로지 변경에 대해 안전).
 ## Alternatives Considered
 ### A1. translate()를 SimPy generator로 만들기
 기각. D2에서 설명한 대로, PE 엔진의 op_log/pipeline overlap 의미가 흐려진다.
 호출자 측에서 timeout을 일으키는 현재 패턴이 op_log 회계와 일치한다.
 ### A2. 서브페이지 region 리스트 대신 페이지 크기 자체를 작게 하기 (예: 128B)
 기각. 페이지 테이블 메모리 폭발과 cube-wide map message 크기 폭발을 초래한다.
 DPPolicy 샤딩이 128B를 요구한다 해도 그 외 대다수 매핑은 2MiB 단위이므로,
 페이지 크기를 작게 잡는 것은 평균 비용이 비대해진다.
 ### A3. PE_MMU를 컴포넌트가 아닌 PE_CPU의 내장 헬퍼로만 두기
 기각. ADR-0011이 요구하는 "fabric을 통해 측정된 latency로 MMU map 설치"
 (MmuMapMsg 경로)를 표현하려면 토폴로지 그래프 상의 노드여야 한다. 또한 cube NoC
 visualizer에서 PE_MMU가 노드로 보여야 디버깅·진단이 일관된다.
 ## Consequences
 - PE_MMU의 이중 역할(컴포넌트 + 유틸리티)이 ADR-level에서 정당화되어, 미래의
  refactor 압박 (둘 중 하나로 통일하라)에 대한 논거가 생긴다.
 - 서브페이지 region 모델이 시뮬레이터 stopgap임을 ADR이 명시 — 이후 LA 모델
  (ADR-0011) 도입 시 이 stopgap 제거 가능성을 평가하는 기준이 된다.
 - `translate()`가 yield하지 않는다는 계약이 ADR로 굳어지므로, 향후 누군가
  "MMU에 자체 timeout을 넣자"는 제안을 할 때 D2를 근거로 거절할 수 있다.
 - PA fallback (D4) 이 정상 흐름임이 명시되어, PageFault를 에러로 오인하여
  방어 로직을 추가하는 일을 막는다.
@@ -0,0 +1,142 @@
 # ADR-0040: PE_TCM Component Model — 듀얼 채널 BW 직렬화
 ## Status
 Accepted (2026-05-20).
 ADR-0014 (PE Pipeline Execution Model) 가 "PE_TCM은 BW-기반 직렬화 scratchpad
 memory" 라고 언급하나 (D1), TCM 컴포넌트 자체의 정확한 동작 모델을 별도로
 명시한다.
 ## First action (제일 처음에 하는 일)
 `start()`가 호출되면 즉시 두 개의 `simpy.Resource(env, capacity=1)`을 만들고
 `self._read_res` / `self._write_res`에 보관한다. 이 두 자원이 **읽기 채널**과
 **쓰기 채널**을 각각 1-in-flight로 직렬화하는 단일 결정 포인트다.
 런타임 첫 동작: `_worker`가 `_inbox`에서 메시지를 한 건 꺼내 타입 분기:
 - `TcmRequest` (`pe_fetch_store`에서 옴): `env.process(self._handle_tcm_request)`로
  포크. 즉 **TCM의 첫 일은 "방향 (read/write)에 맞는 채널 락을 잡는 것"**.
  락 획득 후 `bw > 0 and nbytes > 0` 이면 `delay_ns = nbytes / bw` 만큼
  `env.timeout`, 그리고 `req.done.succeed()`.
 - 그 외 (Transaction): `env.process(self._forward_txn)`로 포크 (legacy fabric
  통과 경로).
 생성 시점에 `node.attrs["read_bw_gbs"]` / `node.attrs["write_bw_gbs"]`
 (default 각 `512.0 GB/s`) 를 읽어 보관해 둔다.
 ## Context
 PE 파이프라인 (ADR-0014 D1, D6) 에서 PE_TCM은 다음 두 종류의 트래픽을 받는다:
 1. **PE_FETCH_STORE → PE_TCM의 `TcmRequest`** — TCM ↔ Register File 전송 시,
   PE_FETCH_STORE가 TCM의 BW로 직렬화된 access latency를 받아오기 위해 짧은
   sideband 요청을 보낸다 (`direction = "read"` 또는 `"write"`, `nbytes`,
   `done` 이벤트).
 2. **legacy Transaction forwarding** — 토폴로지 그래프 상에서 TCM이 통과 노드로
   잡힐 가능성에 대비한 일반 forwarding 경로 (현재 critical path에서는 사용되지
   않으나 보존됨).
 문제: ADR-0014는 "PE_TCM은 BW-기반 직렬화"라고만 언급한다. 그러나 코드에는
 명시적으로:
 - **읽기와 쓰기는 별도 채널이며 동시 진행 가능**, 다만 같은 방향끼리는
  cap=1로 직렬화된다.
 - BW는 `read_bw_gbs` / `write_bw_gbs` 두 값으로 분리 설정 가능하다.
 - `delay_ns = nbytes / bw_gbs` 공식 (단위 환산: GB/s × ns ≈ B 라는 약식).
 - nbytes==0이면 BW 항을 건너뛰지만 채널 락은 잡는다.
 - `run()`은 `overhead_ns` (default 0.0) 만큼 yield 하나, 이는 legacy fabric
  경로(Transaction forwarding)에서만 사용된다.
 이 모든 사항을 별도 ADR로 못 박을 필요가 있다. 특히 "왜 read/write가 분리
 채널인가" 와 "BW는 누가 결정하는가" 는 향후 누군가가 capacity=2 등으로 변경하려
 할 때 명확한 근거가 필요한 항목이다.
 ## Decision
 ### D1. 듀얼 채널 — read와 write는 독립 자원
 `_read_res = simpy.Resource(env, capacity=1)`,
 `_write_res = simpy.Resource(env, capacity=1)`.
 같은 방향의 동시 요청은 자원 큐에서 직렬화되나, 다른 방향끼리는 동시에 진행 가능.
 이는 실제 HW에서 TCM이 듀얼 포트 (read port + write port) 로 운용되는 모델과
 정합되며, GEMM 파이프라인에서 fetch(read)와 store(write)가 시간상 겹치는 정상
 케이스를 BW-직렬화 모델로 표현하기 위해 의도된 분리다.
 ### D2. 단일 채널의 BW 모델 — `nbytes / bw_gbs`
 채널 락 획득 후, `nbytes > 0 and bw > 0`이면 `yield env.timeout(nbytes / bw_gbs)`.
 단위 약식은 GB/s × ns ≈ B 로, 시뮬레이터 전체에서 사용하는 BW 공식과 동일
 (ADR-0033 참고 — 시뮬레이터는 일관된 약식 단위를 사용한다).
 - `nbytes == 0`: BW 항은 0이지만 락은 잡혔다가 즉시 풀린다. 이 케이스가 의도된
  이유: 빈 fetch/store를 보내는 plan generator가 PE_FETCH_STORE 측에서 `nbytes`만
  0으로 채워 보내는 경우에도, TCM 측의 op_log / 채널 회계가 일관되게 한 번
  소비된다.
 - `bw == 0` (config 실수): timeout 호출 자체를 skip하므로 0-time pass. 정상
  세팅에서는 발생하지 않는다.
 ### D3. BW는 `node.attrs`의 `read_bw_gbs` / `write_bw_gbs`로 설정
 기본값 `512.0 GB/s`. 토폴로지 빌더 (`topology/builder.py`) 가 `pe_template`에서
 TCM을 인스턴스화할 때 해당 attrs를 전달한다. 기본값 변경은 ADR-0014 D1 또는
 ADR-0033 latency model 측의 의사결정과 함께 가야 한다.
 ### D4. TcmRequest의 schema는 PE_TCM이 owner다
 `@dataclass TcmRequest(direction: str, nbytes: int, done: simpy.Event, tag: str = "")`
 는 `components/builtin/pe_tcm.py`에 정의된다. PE_FETCH_STORE는 이 dataclass를
 import해서 생성·송신만 한다. 호출자 측이 schema를 정의하지 않는 이유:
 - BW 직렬화의 의미는 TCM 측 책임 — 어떤 필드가 직렬화 결정에 쓰이는가는 TCM이
  결정한다.
 - `direction` 문자열을 `"read"` / `"write"` 둘로 좁히는 유효값 검증도 TCM 측에
 서 담당 (`_handle_tcm_request`의 if/else 분기).
 ### D5. legacy Transaction forwarding 경로의 보존
 `_worker`가 `TcmRequest`가 아닌 메시지를 받으면 `_forward_txn`으로 보낸다. 이때
 `run()`의 `overhead_ns`가 적용된다. 현재 표준 PE 파이프라인에서는 TCM이
 Transaction의 통과 노드로 잡히지 않으나, fabric 토폴로지가 향후 변경될 때를
 위해 보존한다 (D1 의 사용 패턴과 직교).
 이 경로는 op_log 측에서 일반 Transaction 회계로 잡히며, BW 채널 락은 잡지 않는다.
 ### D6. PE_TCM은 자체 데이터 저장소가 아니다 (timing only)
 TCM은 **시간만** 모델링한다. 실제 데이터 페이로드는 sim_engine의 별도
 `memory_store` (있다면) 가 보관하고, TCM 컴포넌트는 그것을 갱신하지 않는다.
 PE_FETCH_STORE도 TcmRequest를 통해 BW 지연만 받아오고 실제 register 컨텐츠는
 별도 경로로 다룬다 (ADR-0020 2-pass data execution 모델 — Phase 2에서 데이터
 처리).
 ## Alternatives Considered
 ### A1. 단일 채널 (capacity=2 의 read+write 공유)
 기각. fetch(read)와 store(write)가 시간상 겹치는 정상 케이스를 인공적으로
 직렬화하게 되어 PE 파이프라인의 BW upper bound가 잘못 모델링된다.
 ### A2. 채널 capacity > 1 (예: 2-banked TCM)
 기각. 현재 HW 모델은 단일 bank 가정. 멀티-bank로 확장하고 싶다면 별도 ADR이
 필요하며, 그때 D1을 supersede한다. 지금 단계에서 capacity를 늘리면 BW upper
 bound는 그대로인데 명목상의 직렬화만 헐거워져 실제 모델 정확도 ↓.
 ### A3. BW 공식을 `nbytes / bw + overhead_ns`로 일반화
 기각. `overhead_ns`는 D5의 legacy forwarding 경로에만 사용한다. fetch/store
 critical path에 추가 overhead가 필요해지면, 그것은 TCM이 아니라 PE_FETCH_STORE
 측 `run()` 또는 register-file access 모델에 두는 것이 책임 경계 측면에서 더
 적절하다.
 ## Consequences
 - TCM의 BW 회계가 ADR-level에서 굳어지므로, GEMM/Math sweep의 op_log 해석 시
  "왜 fetch와 store가 동시에 진행되었나" / "왜 같은 방향만 직렬화되나" 같은
  질문이 빠르게 D1으로 해결된다.
 - 미래의 멀티-bank TCM이나 read/write 비대칭 BW 모델 변경 시 영향 범위가
  명확해진다 (D1·D2·D3 중 어디를 수정하는지).
 - TCM이 데이터 저장소가 아니라는 점(D6)이 명시되어, ADR-0020 2-pass execution
  과의 책임 경계가 견고해진다.
@@ -0,0 +1,187 @@
 # ADR-0041: Cube SRAM Component Model — terminal scratchpad on cube NoC
 ## Status
 Accepted (2026-05-20).
 ADR-0017 (Cube NOC and HBM Connectivity) 에서 SRAM이 cube NoC의 attachment로
 존재한다고만 언급되는 점을 보완하여, SRAM 컴포넌트 자체의 latency/response
 모델을 명시한다.
 ## First action (제일 처음에 하는 일)
 `_worker`가 `_inbox`에서 Transaction을 한 건 꺼낸 직후 가장 먼저 하는 일은
 `yield from self.run(env, txn.nbytes)` 호출이고, 그 안에서
 `node.attrs["overhead_ns"]` (default `0.0`) 만큼 `env.timeout()`을 발생시킨다.
 즉, **SRAM의 첫 일은 "access overhead를 시간으로 표현하는 것"**이다.
 overhead 소비 이후에 `drain_ns` (그 Transaction에 부여된 terminal BW 직렬화 비용)
 를 yield하고, 그 다음에 reverse path로 `ResponseMsg`를 생성하여 발사한다.
 이는 일반 `ComponentBase._worker`와 다른 점이 있다: SRAM은 **terminal node**
 임을 알고 있어서 `_forward_txn`을 거치지 않고 자체 워커가 `run → drain →
 _send_response` 순서를 명시한다.
 ## Context
 cube 토폴로지 (`topology/builder.py`) 는 cube마다 다음 명명된 노드를 만든다:
 - `sip{S}.cube{C}.m_cpu`
 - `sip{S}.cube{C}.sram`
 - `sip{S}.cube{C}.hbm_ctrl` (PE당 partition)
 - `sip{S}.cube{C}.pe{P}` (PE 내부 sub-component들)
 SRAM은 cube NoC 의 attachment 중 하나로, 가장 가까운 router에 부착된다
 (`topology/mesh_gen.py`가 placement 좌표로 nearest router 결정 후 `attach`에
 추가). 빌더는 `sram ↔ router` 양방향 엣지를 깐다 (BW: `sram_to_router_bw_gbs`,
 기본 `128.0 GB/s`).
 SRAM의 두 가지 핵심 역할:
 1. **fabric terminal**: cube NoC에서 SRAM으로 향한 메모리 access Transaction의
   끝점. SRAM이 access overhead와 drain을 소비하고 response를 reverse path로
   되돌린다.
 2. **IPCQ slot tier 중 하나**: ADR-0023 D9.7 가 정의한 `buffer_kind ∈ {tcm,
   sram, hbm}` 중 `sram` 티어의 slot bw/overhead를
   `common/ipcq_types._BUFFER_KIND_BW`에서 참조 — 현재 값 `(512.0 GB/s, 2.0 ns)`.
   이 값은 SRAM 노드 attrs의 `overhead_ns`와는 별도이며, IPCQ slot 회계 시점에서
   PE_DMA가 시간으로 환산한다.
 이 두 역할은 하나의 SRAM 컴포넌트에서 동시에 충족되는데, 별도 ADR이 없으면:
 - "SRAM은 어떤 latency를 모델링하나?" — fabric drain + overhead, 아니면 IPCQ
  티어의 slot latency? — 답이 흩어진다.
 - 미래에 SRAM 크기 (`size_mb`) attr이 실제로 어떤 의미를 갖는지 불명확. 현재
  코드는 size를 사용하지 않으며 timing만 모델링한다.
 - SRAM이 cube의 어떤 router에 붙는지 (placement-based)에 대한 의사결정 근거가
  토폴로지 코드 안에만 있다.
 ## Decision
 ### D1. SRAM은 cube NoC의 terminal scratchpad 노드다
 `SramComponent`는 `ComponentBase`를 상속하나 `_worker`를 오버라이드해서 terminal
 의미를 직접 표현한다:
 ```
 while True:
    txn = yield self._inbox.get()
    yield from self.run(env, txn.nbytes)     # overhead_ns
    if drain_ns > 0: yield env.timeout(drain_ns)
    yield from self._send_response(env, txn)
 ```
 이 패턴은 SRAM이 reverse path를 알아야 하므로 일반 `_forward_txn` (다음 hop으로
 forward)이 아닌 자체 워커가 필요하다.
 #### D1.1. 현재 미사용 — `_worker` 오버라이드는 dormant 경로다
 본 ADR 작성 시점의 코드베이스에서는, **어떤 컴포넌트도 SRAM 노드로 Transaction
 을 실제로 전송하지 않는다**. 확인된 SRAM 노드 ID 참조 위치:
 - `policy/routing/router.py` 등 routing helper — path 조회 가능성만 보장.
 - `components/builtin/pe_dma.py::_handle_ipcq_inbound` — IPCQ slot의
  `buffer_kind == "sram"` 일 때 `bank_node = f"{cube_prefix}.sram"` 의 *path*
  만 조회하여 `compute_drain_ns(path, ...)` 로 환산, **로컬에서 timeout** 한다.
  Transaction 자체는 SRAM 노드로 흘러가지 않는다 (D4 참고).
 - `tests/test_routing.py` — `find_path("sip0.cube0.pe0", "sip0.cube0.sram")`
  로 connectivity만 검증.
 따라서 `_worker`/`_send_response` 오버라이드는 **dormant code path** 이다.
 삭제하지 않고 보존하는 이유:
 - 향후 SRAM이 실제 fabric Transaction의 종점(예: M_CPU → SRAM 명시 access)이
  되는 토폴로지 변경 시 즉시 사용 가능.
 - ADR-0017 (Cube NOC) 가 정의한 cube-attached scratchpad 의미에서 종점 동작은
  의미상 자연스러우므로, 의도된 placeholder 다.
 이 dormant 상태가 종료되는 시점은 별도 ADR(또는 본 ADR의 후속 revision)이
 명시한다.
 ### D2. ResponseMsg 생성과 reverse path 발사
 `_send_response`는:
 1. `reverse_path = list(reversed(txn.path))`로 역방향 경로 산출.
 2. `ResponseMsg(correlation_id=txn.request.correlation_id, request_id=...,
   src_cube=<this cube>, src_pe=-1, success=True)` 생성.
 3. `Transaction(request=resp_msg, path=reverse_path, step=0, nbytes=0,
   done=env.event(), is_response=True)` 로 감싸 `out_ports[reverse_path[1]]` 로
   put.
 4. reverse path가 비정상이거나 (`< 2 hops`) ctx가 없으면, fallback으로 원본
   `txn.done.succeed()` 만 호출.
 `src_pe = -1`은 "SRAM은 PE-localized가 아니다"를 의미한다. `src_cube`은 노드
 ID (`sip{S}.cube{C}.sram`) 의 cube 인덱스를 파싱해 채운다.
 ### D3. Timing 파라미터는 `overhead_ns`와 wire-side `drain_ns`로 분리
 - **컴포넌트 측 latency**: `node.attrs["overhead_ns"]`. 기본 토폴로지에서는 `2.0
  ns` 정도로 세팅.
 - **링크 측 직렬화**: `drain_ns`는 Transaction이 도착 시점에 carry해 온 값으로,
  ADR-0015 (port/wire 모델) 의 wire-side BW 직렬화 결과다. SRAM은 이를 그대로
  yield하기만 한다.
 - `size_mb` (default `32 MiB`) attr은 현재 timing에 사용되지 않는다 — 향후
  capacity-aware 모델이 도입되면 그때 의미를 부여한다 (별도 ADR에서).
 ### D4. IPCQ slot 회계는 SRAM 컴포넌트가 직접 모델링하지 않는다
 ADR-0023 D9.7 에 따른 IPCQ slot의 SRAM-티어 write latency는 PE_DMA의
 `_handle_ipcq_inbound`가 직접 `slot_io_latency_ns("sram", nbytes)`를 호출하여
 시간을 소비한다 (그 함수는 `common/ipcq_types._BUFFER_KIND_BW["sram"]` 의 값을
 사용). 즉:
 - SRAM 컴포넌트가 fabric Transaction을 받아 처리할 때는 **D1·D2·D3** 만 적용.
 - IPCQ slot이 SRAM에 살 때는 PE_DMA가 IPCQ slot-write 시점에 별도로 시간을
  지불 — 이는 SRAM 컴포넌트 코드와 무관하며, IPCQ 측 회계다.
 이 분리는 의도된 것: IPCQ는 fast path (sub-cycle slot bookkeeping) 라 fabric
 Transaction을 거치지 않으므로, SRAM이 IPCQ를 인지할 필요가 없다.
 ### D5. SRAM의 cube NoC 부착 위치는 placement-driven
 `topology/mesh_gen.py`는 `placement.sram.pos_mm` (`topology.yaml` 기본
 `[1.5, 9.0]`)을 보고 가장 가까운 router의 `attach`에 `"sram"`을 추가한다. 빌더
 (`topology/builder.py` 의 attachment 루프)가 그 attach 정보를 보고 `sram` 노드와
 router 사이에 양방향 엣지를 깐다.
 이 의사결정은 SRAM 컴포넌트 코드 외부 (mesh_gen / builder) 에 있으며, 컴포넌트
 는 어느 router에 붙었는지 알 필요가 없다. 컴포넌트는 `txn.path` / `reverse_path`
 가 router를 거쳐 자신에게 도달한다는 사실만 알면 된다.
 ### D6. SRAM은 자체 데이터 저장소가 아니다 (timing-only)
 ADR-0040 D6 과 같은 맥락: SRAM 컴포넌트는 시간만 모델링하며, 실제 데이터
 페이로드는 sim_engine의 `memory_store` (있을 때) 가 보관한다.
 ## Alternatives Considered
 ### A1. SRAM이 `_forward_txn`을 그대로 사용하고 IO_CPU / HBM_CTRL 처럼 별도 응답 노드를 두기
 기각. cube NoC 상에서 SRAM은 terminal이며, 응답을 받아 줄 별도 노드를 두면
 의미 없는 hop이 늘어나고 ADR-0017 의 cube NoC 단순화 정신에 어긋난다.
 ### A2. SRAM이 BW 직렬화를 자체 resource로 모델링
 기각. 링크 측 BW 직렬화 (`drain_ns`) 가 이미 의미를 충분히 잡고 있다. 컴포넌트
 내부에 또 `simpy.Resource`를 두면 ADR-0015 wire-side 모델과 이중계산을 야기.
 ### A3. SRAM이 IPCQ slot 회계를 컴포넌트 측에서 처리
 기각. D4에서 명시한 대로 IPCQ는 fast path며 fabric Transaction을 통과하지
 않는다. SRAM이 IPCQ를 인지하면 책임이 두 갈래로 갈라져 추론이 어려워진다.
 ### A4. `size_mb`로 capacity-aware latency 모델
 기각 (현재 단계). capacity는 토폴로지 visualizer 측 라벨링 정도에만 쓰이며,
 실제 timing 영향은 아직 모델링하지 않는다. 필요해지면 별도 ADR로 도입.
 ## Consequences
 - SRAM의 timing 모델이 `overhead_ns + drain_ns + ResponseMsg(reverse_path)`로
  ADR-level에서 굳어지므로, 누군가 IPCQ slot latency를 SRAM 컴포넌트에 추가하려
  할 때 D4를 근거로 거절할 수 있다.
 - `size_mb` 가 현재 timing-neutral 임이 명시되어 (D3), 미래의 capacity-aware
  모델 도입 시 호환성 영향 범위가 좁다.
 - placement-driven router 부착 (D5) 이 명시되어, SRAM 좌표 이동 시 어떤 부분에
  파급이 있는지 (`mesh_gen`만) 명확해진다.
@@ -0,0 +1,194 @@
 # ADR-0042: Tile Plan Generators — GEMM/Math 파이프라인 plan 빌더
 ## Status
 Accepted (2026-05-20).
 본 ADR은 `tiling.py`가 SimPy 컴포넌트가 아니라
 **plan-generator 모듈**임을 명시한다.
 ADR-0014 (PE Pipeline Execution Model) 의 D6 (tile plan / self-routing) 가
 tile-plan 생성 알고리즘을 직접 정의하지 않으므로, 본 ADR이 그 비어 있는 자리를
 채운다.
 ## First action (제일 처음에 하는 일)
 `generate_gemm_plan(M, K, N, tile_m, tile_k, tile_n, ..., pe_prefix, a_pinned,
 b_pinned, epilogue_specs)`이 호출되면 가장 먼저 하는 일은 **타일 수 계산과
 컴포넌트 ID 문자열 구성**이다:
 ```
 M_tiles = max(1, ceil(M / tile_m))
 K_tiles = max(1, ceil(K / tile_k))
 N_tiles = max(1, ceil(N / tile_n))
 dma_id   = f"{pe_prefix}.pe_dma"
 fetch_id = f"{pe_prefix}.pe_fetch_store"
 gemm_id  = f"{pe_prefix}.pe_gemm"
 math_id  = f"{pe_prefix}.pe_math"
 ```
 즉 **plan generator의 첫 일은 "타일 개수를 ceiling으로 산출하고, 이 PE의
 sub-component ID 4개를 한 번에 짜놓는 것"**이다. SimPy 이벤트나 환경 객체는
 일절 다루지 않는다 — 이 모듈은 순수 함수다.
 `generate_math_plan(M, N, tile_m, tile_n, ..., math_op, src_addr, dst_addr,
 pe_prefix)` 도 마찬가지로 `M_tiles`, `N_tiles` 산출과 component ID 3개
 (`dma_id`, `fetch_id`, `math_id`) 구성이 첫 일이다.
 ## Context
 ADR-0014 D6은 "PE_SCHEDULER가 CompositeCmd를 받으면 TilePlan을 생성하고
 self-routing tile token을 피드한다"고만 합의했다. 그러나 코드에서는 **plan
 생성 알고리즘의 구체적 내용**이 `src/kernbench/components/builtin/tiling.py`
 모듈에 자리잡고 있고, 이 모듈은:
 - 컴포넌트가 아니라 **순수 함수**의 모음이다 (`generate_gemm_plan`,
  `generate_math_plan`).
 - SimPy 환경, 큐, op_log, hook 등에 의존하지 않는다.
 - 결과로 `PipelinePlan` (dataclass) 를 돌려준다.
 기존 G4 분석은 `tiling.py`를 컴포넌트로 잘못 가정했으나, 실제는 PE_SCHEDULER에
 주입되는 plan-builder 함수다. 이 차이는 ADR-0014 의 D6 와 짝을 이루는 별도
 ADR로 못 박혀야 한다 — 그렇지 않으면:
 - "tile plan을 만드는 책임이 PE_SCHEDULER인가 별도 모듈인가" 가 모호.
 - GEMM plan과 Math plan의 stage sequence 가 일관성 있는지 (예: FETCH/STORE 위치)
  의사결정 근거가 흩어진다.
 - `a_pinned` / `b_pinned` / `epilogue_specs` 같은 옵션이 왜 plan 단에서 분기되는지
  근거 없음.
 ## Decision
 ### D1. tiling은 순수 plan-generator 모듈이며 컴포넌트가 아니다
 `components/builtin/tiling.py`는 ComponentBase 하위 클래스를 정의하지 않는다.
 모듈-레벨 함수 두 개만 노출한다:
 - `generate_gemm_plan(...) -> PipelinePlan`
 - `generate_math_plan(...) -> PipelinePlan`
 토폴로지 그래프에서 `tiling` 이라는 노드는 존재하지 않는다. 명명상 `builtin/`
 디렉터리에 있는 이유는 PE_SCHEDULER (ADR-0014 D6) 의 직접 helper이기 때문이며,
 의미상으로는 PE_SCHEDULER 내부 utility에 가깝다.
 ### D2. GEMM plan의 stage 시퀀스 — `M → N → K` order
 각 (m, n, k) 타일에 대한 stage 시퀀스 (operand pinning과 epilogue 미적용 기본):
 ```
 [DMA_READ(A)] → [DMA_READ(B)] → FETCH → GEMM
                                ↑
                                ↓
 (last k tile only)              [MATH(output_tile)]* → STORE → DMA_WRITE
 ```
 `k_tile` epilogue는 매 K-타일마다 GEMM 직후, `output_tile` epilogue는 (m,n)당
 마지막 K-타일에서 STORE/DMA_WRITE 직전에 한 번. K-루프 누적자(accumulator) 는
 RegFile에 남아 K 타일들 사이에 STORE/DMA_WRITE가 발생하지 않는다 (last_k에서만
 출력).
 ### D3. Operand pinning — `a_pinned` / `b_pinned`
 호출자가 `a_pinned=True`로 호출하면 **모든 (m, n, k) 타일에서 A DMA_READ를
 생략**한다. 의미: 호출자(예: `tl.composite`)가 사전에 `tl.load`로 A 전체를
 TCM에 한 번 적재했음을 plan generator에 알리는 신호.
 이 분기는 plan 단에서 결정한다 (런타임 분기 아님). 따라서 op_log 상의 stage
 record 수는 pinning에 따라 결정적으로 달라지며, sweep 분석 측 (예: gemm_sweep
 의 stage record count) 이 이 결정을 그대로 본다.
 ### D4. Epilogue scope — `k_tile` vs `output_tile`
 `epilogue_specs`는 op-spec 객체의 iterable이다. 각 op 객체는 다음 속성을 갖는
 다고 가정한다:
 - `op.kind: str` — math op 이름 (예: `"dequant"`, `"bias"`, `"relu"`, `"scale"`).
  stage의 `params["op_kind"]` 로 들어간다.
 - `op.scope: Scope` — `Scope.K_TILE` 또는 `Scope.OUTPUT_TILE` (`Scope` 는
  `kernbench.common.pe_commands` 에 정의된 enum).
 - op-별 추가 필드 (예: `bias`, `scale`, `factor`) — 현재 plan generator는 사용
  하지 않으며 런타임 (PE_MATH) 측이 소비.
 plan generator는 `getattr(o, "scope", None)` 기준으로 두 그룹으로 분기:
 - `scope == Scope.K_TILE`: 매 K-타일 GEMM 직후 MATH stage 추가.
 - `scope == Scope.OUTPUT_TILE`: (m, n)당 마지막 K-타일 STORE 직전 MATH stage
  추가.
 `scope` 속성이 없거나 두 enum 어느 쪽도 아닌 op는 **plan에 포함되지 않는다**
 (`getattr(..., None) == Scope.X` 가 둘 다 False). 기본값(`output_tile`) 채택은
 **호출자(예: `tl.composite`) 측 책임**이며, plan generator는 이미 채워진 scope
 값을 보고 분기할 뿐이다 (ADR-0014 의 composite epilogue 계약과 정렬).
 `Scope` 임포트는 `pe_commands ← pe_types ← tiling` 의 순환 참조를 피하기 위해
 함수 내부에서 lazy import 한다. 이는 의도된 패턴이며 개선 대상이 아니다 (D1의
 "tiling은 PE_SCHEDULER의 utility" 관점에서, pe_commands에 대한 컴파일타임 의존
 이 없는 편이 모듈 경계를 깔끔히 유지함).
 ### D5. Math plan의 stage 시퀀스 — `M → N` order
 각 (m, n) 타일에 대한 stage 시퀀스:
 ```
 DMA_READ → FETCH → MATH → STORE → DMA_WRITE
 ```
 K 차원이 없으므로 epilogue / accumulator residency 같은 개념은 적용되지 않는다.
 PE_FETCH_STORE의 register-file 회계는 GEMM plan과 동일한 방식으로 다뤄진다.
 ### D6. plan은 데이터다 — SimPy 의존성 없음
 `PipelinePlan` 은 `pe_types.py`에 정의된 dataclass로, `tiles: list[TilePlan]`을
 보유. 각 `TilePlan` 은 `stages: tuple[Stage, ...]` 를 보유. plan 자체는
 immutable에 가까운 데이터 구조이며 (Stage 의 `params: dict` 만 mutable),
 SimPy 객체나 event를 갖지 않는다.
 런타임 시점에 PE_SCHEDULER가 plan 의 첫 stage를 보고 `TileToken`을 생성하여
 파이프라인에 피드하며, TileToken 이 `plan: TilePlan`, `stage_idx: int`,
 `params: dict` 를 들고 다닌다. self-routing은 `TileToken.advance()` 가 다음
 stage의 `params`를 캐시하는 방식으로 진행된다 (ADR-0014 D6).
 ### D7. plan generator의 contract — pure, deterministic, idempotent
 같은 입력으로 두 번 호출하면 같은 PipelinePlan을 돌려준다 (`TilePlan.stages`의
 순서까지 deterministic). 이 contract는 ADR-0014 D6 의 "결정적 tile dispatch
 순서" 요구와 정렬된다.
 부수효과(SimPy event, file I/O, 글로벌 상태) 없음 — 테스트에서 환경 객체 없이
 호출 가능 (`tests/test_pe_pipeline.py`의 일부 케이스가 이 방식 사용).
 ## Alternatives Considered
 ### A1. tiling을 컴포넌트로 만들기 (e.g., PE_PLANNER)
 기각. plan 생성은 SimPy 시간을 소비하지 않는 결정 알고리즘이다. 컴포넌트로
 만들면 (a) inbox·자원 등 불필요한 인프라가 따라붙고, (b) PE_SCHEDULER 가
 "plan 받기" → "tile 피드" 두 단계를 분리해 받게 되어 의미 없는 hop이 생긴다.
 ### A2. plan 생성을 PE_SCHEDULER 클래스 메서드로 옮기기
 기각 (현재). 모듈 분리가 (1) 테스트 용이성, (2) 다른 plan 알고리즘 (예:
 DTensor-aware plan) 도입 시 추가 함수만 정의하면 되는 확장성을 준다. 만약 향후
 plan 종류가 많아져 명시적 dispatch가 필요해지면, 그때 PE_SCHEDULER에 plan
 factory를 두는 것을 별도 ADR로 도입한다.
 ### A3. plan을 immutable로 강제 (frozen dataclass + tuple)
 부분 채택. `Stage` 와 `TilePlan` 은 dataclass지만 frozen은 아니다. 이유:
 `Stage.params: dict` 가 plan generator 시점에 채워지고 런타임에서 읽히기만 한다
 (TileToken 이 advance 시 캐시할 뿐). 완전 frozen은 dict → frozendict 마이그레이션
 비용 대비 이득이 적다. 다만 plan 단계 외에는 mutation 하지 말 것을 컨벤션으로
 유지한다.
 ## Consequences
 - `tiling.py`가 컴포넌트가 아니라 plan-generator 모듈임이 ADR-level에서
  명시되어, G4 같은 미래의 "이 컴포넌트는 ADR이 없다"는 분석을 차단한다.
 - GEMM plan의 stage sequence (D2) 와 pinning/epilogue 분기 (D3·D4) 가 ADR로
  굳어지므로, sweep 분석 (`scripts/gemm_sweep.py`)의 stage record count 해석
  근거가 명확해진다.
 - plan generator의 pure contract (D7) 덕분에 테스트가 환경 없이 plan 검증
  가능 — ADR-0013 (verification strategy) 의 "behavior validated by tests with
  meaningful input cases" 정신과 정렬.
 - 향후 DTensor-aware plan, K-major plan 등 새 plan 종류 추가 시 본 ADR이
  baseline 역할 — 새 함수만 추가하고 D1·D6·D7을 따른다.
@@ -0,0 +1,139 @@
 # ADR-0038: PCIE_EP Component Model
 ## Status
 Accepted (2026-05-20).
 Companion to ADR-0035 (M_CPU), ADR-0036 (IO_CPU), and
 ADR-0037 (Forwarding) at the same component-model level.
 ## First action
 Pull one Transaction from `_inbox` and let `_forward_txn` invoke `run()`, which
 applies a single `env.timeout(node.attrs["overhead_ns"])` for PCIe protocol
 handling. After that the standard `ComponentBase` worker rules take over: if
 `next_hop` exists, put the advanced Transaction on `out_ports[next_hop]`;
 otherwise consume `drain_ns` and call `txn.done.succeed()`.
 In other words, **PCIE_EP's first (and only) act is to spend the configured
 overhead as simulator time** — no routing decisions, no payload transformation,
 no MMIO decoding.
 ## Context
 PCIE_EP is the **host ↔ device boundary** in the topology graph. The builder
 (`topology/builder.py`) creates an IO chiplet instance per SIP that contains
 `pcie_ep`, `io_cpu`, and `io_noc`, and lays bidirectional edges between the
 external `fabric.switch0` and each `pcie_ep`:
 - `switch → pcie_ep`: host → device traffic (MemoryWrite, MemoryRead,
  KernelLaunch).
 - `pcie_ep → switch`: device-side outbound (e.g., cross-SIP IPCQ tokens).
 Inside the IO chiplet there are bidirectional `pcie_ep ↔ io_noc` edges, and
 from there traffic branches to `io_cpu` or to the cube-side `hbm_ctrl` path
 (see ADR-0036 IO_CPU model). The router and resolver already know — per SPEC
 R7 — that PCIE_EP is the endpoint for memory operations, so helpers like
 `find_pcie_ep(sip)` and `find_memory_path(pcie_ep, dst_node)` treat PCIE_EP as
 the start (or end) of the memory path.
 The problem is that all of this dependency lives in builder/router/resolver,
 while **PCIE_EP's own internal model has no ADR**. The consequence:
 - "What latency does PCIE_EP model?" requires reading the source.
 - The asymmetry with peer components (IO_CPU = ADR-0036, M_CPU = ADR-0035) is
  awkward.
 - Future decisions about a more detailed PCIe link-layer model (TLP credits,
  retry, MPS chunking) lack a documented baseline.
 This ADR pins down the current **thin PCIE_EP model** and records that this
 thinness is intentional (aligned with ADR-0033's latency-model simplification
 policy).
 ## Decision
 ### D1. PCIE_EP uses ComponentBase's generic forwarding worker as-is
 `PcieEpComponent` extends `ComponentBase` and does **not** override `_worker` or
 `_forward_txn`. Every Transaction flows through the standard sequence:
 1. `_fan_in` accumulates inbound messages (and reassembles Flits, per ADR-0033
   Phase 2c) into `_inbox`.
 2. `_worker` pulls one message off `_inbox` and spawns
   `env.process(self._forward_txn(env, txn))` for per-message pipelining.
 3. `_forward_txn` calls the op_log start hook → `run()` for latency → op_log
   end hook.
 4. `run()` is a single line: `yield env.timeout(overhead_ns)`.
 5. If a next hop exists, `out_ports[next_hop].put(txn.advance())`. Otherwise
   (terminal arrival) consume `drain_ns` and call `txn.done.succeed()`.
 ### D2. The only timing parameter is `overhead_ns`
 Only `node.attrs["overhead_ns"]` is accepted as a latency parameter. The code
 default is `0.0`; `topology.yaml`'s IOChiplet `components.pcie_ep.attrs`
 supplies the real value (current topology: `overhead_ns: 5.0` ns).
 No separate BW-serialization resource (`simpy.Resource`), no queue depth, no
 retry model is introduced. Link-level BW serialization is handled wire-side —
 inside the IOChiplet by `pcie_ep_to_noc_bw_gbs = 256.0 GB/s`, and externally by
 the system's `io_ep_to_switch` link BW (ADR-0015 port/wire model). PCIE_EP
 itself takes no part in that accounting.
 ### D3. PCIE_EP is direction-aware in topology but direction-blind in code
 The builder lays both `switch ↔ pcie_ep` and `pcie_ep ↔ io_noc` edges, so
 PCIE_EP serves:
 - inbound (host → device): forward Transactions arriving from the switch onto
  io_noc-side next-hop.
 - outbound (device → host): forward Transactions arriving from io_noc/io_cpu
  back to the switch.
 Both are handled by D1's generic forwarding worker; the component code never
 distinguishes direction (it just follows `txn.next_hop`).
 ### D4. PCIE_EP is not Flit-aware (legacy reassembly path)
 `_FLIT_AWARE` is left at the inherited `False`, so `_fan_in` reassembles
 upstream-chunkified Flits into the parent Transaction before delivery to
 `_inbox` (aligned with ADR-0033 Phase 2c incremental rollout).
 A future PCIe TLP-level credit model would revisit D4.
 ### D5. PCIE_EP is a **named node** for routing helpers
 `policy/routing/router.py` provides `find_pcie_ep(sip, io_id="io0")`,
 `find_all_pcie_eps()`, and `find_memory_path(pcie_ep, dst_node)` — all of
 which treat PCIE_EP as the start (or end) of the memory path. The component
 itself supplies no information to these helpers; the naming convention
 (`sip{S}.{io_id}.pcie_ep`) is guaranteed by the topology builder.
 ## Alternatives Considered
 ### A1. Full PCIe TLP-level model (credits, retry, MPS chunking)
 Rejected. Violates ADR-0033's "current latency model = abstract overhead + BW
 serialization" simplification. Host↔device protocol fidelity is explicitly
 out-of-scope in SPEC §5 "Non-Goals".
 ### A2. Per-PCIE_EP `simpy.Resource` for in-flight cap
 Rejected. Host traffic is not a contention bottleneck in current workloads.
 Defer to a separate ADR if it becomes one (in which case D1 stays and D2 is
 extended).
 ### A3. Merge PCIE_EP into IO_CPU
 Rejected. PCIE_EP is the protocol-boundary node first hit on the host side;
 IO_CPU is the device-side control-plane processing node (ADR-0036). Traffic
 fan-out and command decoding costs concentrate in IO_CPU, while PCIE_EP only
 expresses link-edge overhead. Merging them would mix two responsibilities and
 violate the spirit of ADR-0007 (runtime API/sim_engine boundaries).
 ## Consequences
 - PCIE_EP gets an explicit model ADR despite having near-zero code — consistent
  with peer component ADRs, lower maintenance friction.
 - Future PCIe-level refinement supersedes by extending D2/D4 in a new ADR.
 - D5 makes the named-node dependency explicit, so any future renaming of
  component IDs has a clearly bounded blast radius.
@@ -0,0 +1,203 @@
 # ADR-0039: PE_MMU Component Model — Component + Utility Dual Role
 ## Status
 Accepted (2026-05-20).
 ADR-0011 (PA/VA/LA address model) only states that "the VA model translates
 VA→PA via PE_MMU"; this ADR pins down **the PE_MMU component's own behavior
 model**.
 ## First action
 At construction, read `node.attrs["page_size"]` (default `2 MiB`) and
 `node.attrs["tlb_overhead_ns"]` (default `0.0`) and instantiate the internal
 `PeMMU` utility object (`policy.address.pe_mmu.PeMMU`) exactly once. That
 object is the single owner of the page table, the sub-page region lists, and
 the TLB overhead value.
 At runtime the first action splits into two paths:
 - **Component path (inbox consumption)**: `_worker` pulls a Transaction off
  `_inbox`; if `request` is a `MmuMapMsg`, call `self._mmu.map(va, pa, size)`
  for each entry and then `txn.done.succeed()`. For `MmuUnmapMsg`, call
  `unmap(va, size)`. Any other type falls through to standard `_forward_txn`.
  In other words, **the component's first act is "apply map/unmap commands to
  the page table"**.
 - **Utility path (direct call)**: a sibling PE engine (PE_DMA / PE_GEMM) calls
  `pe_mmu.mmu.translate(va)` directly. This path produces no SimPy events;
  the caller (when `overhead_ns > 0`) issues a `yield env.timeout(mmu.overhead_ns)`
  in its own process.
 ## Context
 ADR-0011 defined three address models (PA/VA/LA) and agreed that "VA model =
 translation via PE_MMU". But in code, `PeMmuComponent` performs two
 complementary roles simultaneously:
 1. **A topology-graph component**: it receives `MmuMapMsg` / `MmuUnmapMsg`
   sideband messages over the cube NoC and updates the page table.
 2. **A PE-local utility**: PE_DMA / PE_GEMM on the same PE call
   `translate(va)` directly with zero SimPy latency (the caller pays
   `overhead_ns` if any).
 Without an ADR covering both roles, the following questions are ambiguous:
 - "Why isn't there a SimPy event for the MMU translate?" (Answer: the caller
  pays it.)
 - What is the sub-page region model, and why? (The code docstring has it, but
  no ADR — only a memory note `project_mmu_subpage_stopgap`.)
 - Who sends map/unmap, and when must they be visible? (Ordering contract.)
 Additionally, `PeMMU.map()` has "append, last-write-wins on overlap"
 semantics, which is impossible to express with a one-PA-per-entry page table.
 That is a deliberate **simulator stopgap** to support DPPolicy sub-page sharding
 (e.g., 128 B payloads against 4 KiB pages) without silent last-write-wins
 misrouting. This deviation from real HW MMU semantics must be ADR-pinned.
 ## Decision
 ### D1. Explicit dual role — component and utility
 `PeMmuComponent` exposes two interfaces from a single class:
 - Component interface: `_inbox` consumption, `_worker` loop (handles MMU
  sideband messages).
 - Utility interface: the `mmu` property exposes the underlying `PeMMU` object,
  which PE_DMA / PE_GEMM hold directly and invoke `translate()` on.
 The latter is **not a layer skip**: inside a PE, the engines and PE_MMU are
 siblings under the "components" layer (ADR-0007). Cross-layer violations only
 apply to runtime API ↔ sim_engine ↔ components boundaries.
 ### D2. Latency model — `translate()` is pure; caller owns the timeout
 `PeMMU.translate()` is a pure function and yields nothing in SimPy. The caller
 (a PE engine) issues `if mmu.overhead_ns > 0: yield env.timeout(mmu.overhead_ns)`
 in its own process after translation.
 Rationale: the PE engine process already holds its own `record_start` /
 `record_end` (op_log) hooks, so keeping timing inside the caller's process
 preserves consistent timing accounting. A separate MMU process would split the
 engine's processing flow and blur op_log / pipeline overlap semantics.
 #### D2.1. Current implementation asymmetry — pipeline vs non-pipeline (known)
 At the time of writing, `pe_dma.py` handles MMU overhead differently in its
 two call paths:
 - **non-pipeline (`handle_command`)**: after `translate()`, applies
  `if self._mmu.overhead_ns > 0: yield env.timeout(self._mmu.overhead_ns)`.
 - **pipeline (`_do_pipeline_dma`)**: calls `translate()` only, **omitting**
  the overhead timeout — though the comment says "same logic as non-pipeline
  path", the behaviors differ.
 In the default topology, `tlb_overhead_ns = 0.0`, so this asymmetry does not
 manifest. With `tlb_overhead_ns > 0`, however, GEMM/Math via the pipeline path
 appears MMU-overhead faster than the equivalent non-pipeline workload.
 The D2 contract states that **all** callers pay the overhead; the pipeline
 omission is **not an intentional design** — ADR-0014 D6 (pipeline self-routing)
 does not exempt it. Remediation options (require a separate Phase 1/2):
 - (a) Add `if mmu.overhead_ns > 0: yield env.timeout(...)` in
  `_do_pipeline_dma` to align with D2 — **preferred**.
 - (b) Narrow the D2 contract to "non-pipeline only" and document the pipeline
  exemption in an ADR-0014 update — discouraged, since it weakens the
  overhead's meaning.
 This ADR recommends (a) and assumes a small follow-up change either before or
 just after acceptance.
 ### D3. Page table structure — sub-page region list (stopgap)
 `self._table: dict[vpn, list[(start_in_page, end_in_page, pa_at_offset_zero)]]`
 holds multiple disjoint regions per page.
 - `map(va, pa, size)`: append regions when the range crosses a page boundary.
 - `translate(va)`: look up regions for the VPN and iterate **in reverse** so
  the most recent overlapping region wins (last-write-wins).
 - `unmap(va, size)`: remove only regions whose extent is **fully contained**
  within the unmap range; partial-overlap boundaries are left in place and the
  caller is expected to unmap on the same boundaries used for map.
 This is documented as a **simulator stopgap** that supplements the VA model
 from ADR-0011. It prevents silent last-write-wins misrouting when DPPolicy
 shards below page granularity. Memory note: `project_mmu_subpage_stopgap`.
 ### D4. PageFault signals PA fallback
 If `translate()` is called with an unmapped VA, `PageFault` is raised. PE_DMA
 catches the exception and **uses the original address as a PA** (the PA-only
 backward-compatibility path from ADR-0011). PageFault is therefore not an
 error — it is the signal for "no VA mapping, interpret as PA".
 This path is intentional and preserves backward compatibility with the
 ADR-0011 PA-only mode.
 ### D5. MMU sideband-message reception contract
 `MmuMapMsg` / `MmuUnmapMsg` arrive over the fabric at PE_MMU's `_inbox`
 (SPEC R10: "MMU map installation incurs measured fabric latency"). Schemas
 live in `runtime_api/kernel.py`:
 - `MmuMapMsg.entries: tuple[dict, ...]` — each dict is
  `{"va": int, "pa": int, "size": int}`.
 - `MmuUnmapMsg.entries: tuple[dict, ...]` — each dict is
  `{"va": int, "size": int}`.
 PE_MMU reception flow:
 1. `_worker` does `_inbox.get()` for one message.
 2. `hasattr(msg, "request")` confirms a Transaction wrapper.
 3. `isinstance(msg.request, MmuMapMsg)` → for each entry, call
   `self._mmu.map(va=e["va"], pa=e["pa"], size=e["size"])`.
 4. `isinstance(msg.request, MmuUnmapMsg)` → for each entry, call
   `self._mmu.unmap(va=e["va"], size=e["size"])`.
 5. Both signal `msg.done.succeed()` after completion.
 An external caller (runtime API) `await`ing `done` therefore receives a SimPy
 guarantee that "the mapping is installed on-device" — this is the realization
 of ADR-0011's "MMU map installation incurs measured fabric latency".
 This ADR does **not** define the **sender or fan-out policy** for the sideband
 message — those are runtime API responsibilities. Only the receive contract
 belongs here.
 ### D6. Non-MMU Transactions delegate to generic forwarding
 If a message pulled from `_inbox` is not `MmuMapMsg` / `MmuUnmapMsg` (or
 lacks a `request` attribute), `_forward_txn` handles it normally. This keeps
 the door open for future topologies where PE_MMU sits on a pass-through path —
 current code never sends such traffic, but the routing remains safe.
 ## Alternatives Considered
 ### A1. Make `translate()` a SimPy generator
 Rejected. As D2 explains, this blurs op_log / pipeline overlap accounting in
 the PE engine.
 ### A2. Use small page size (e.g., 128 B) instead of sub-page regions
 Rejected. Would explode page-table memory and cube-wide map message size. Most
 mappings are 2 MiB; pushing the page size below that for the few DPPolicy
 sharding cases inflates average cost.
 ### A3. Make PE_MMU a PE_CPU helper only (not a topology node)
 Rejected. ADR-0011 requires that MMU map installation incur measured fabric
 latency (via `MmuMapMsg`), which requires PE_MMU to be a node on the graph.
 It also keeps cube NoC visualizer output consistent.
 ## Consequences
 - PE_MMU's dual role is justified at ADR level, so future "unify into one"
  refactor pressure has a documented counterpoint.
 - The sub-page region model is explicitly labeled a stopgap, providing a
  basis for deprecating it when LA model (ADR-0011) lands.
 - The "`translate()` does not yield" contract is locked in (D2), so any
  future proposal to add an internal MMU timeout can be denied with a
  documented rationale.
 - PA fallback (D4) is normalized, preventing defensive logic from treating
  PageFault as an error.
@@ -0,0 +1,149 @@
 # ADR-0040: PE_TCM Component Model — Dual-Channel BW Serialization
 ## Status
 Accepted (2026-05-20).
 ADR-0014 (PE Pipeline Execution Model, D1) references PE_TCM as a "BW-based
 serialized scratchpad memory" but does not pin down the component's own model.
 This ADR fills that gap.
 ## First action
 When `start()` is invoked, immediately create two `simpy.Resource(env, capacity=1)`
 instances and store them in `self._read_res` / `self._write_res`. These two
 resources are the single decision points that serialize the **read channel**
 and **write channel** to one in-flight request each.
 The runtime first action: `_worker` pulls a message off `_inbox` and branches
 by type:
 - `TcmRequest` (from `pe_fetch_store`): spawn `env.process(self._handle_tcm_request)`.
  Hence **TCM's first act is "acquire the lock matching the direction
  (read/write)"**. After lock acquisition, if `bw > 0 and nbytes > 0`, yield
  `env.timeout(delay_ns = nbytes / bw)`, then `req.done.succeed()`.
 - Anything else (Transaction): spawn `env.process(self._forward_txn)` (legacy
  fabric pass-through).
 At construction, `node.attrs["read_bw_gbs"]` and `node.attrs["write_bw_gbs"]`
 (default `512.0 GB/s` each) are captured and held.
 ## Context
 In the PE pipeline (ADR-0014 D1, D6), PE_TCM receives two kinds of traffic:
 1. **`TcmRequest` from PE_FETCH_STORE** — when moving data between TCM and
   the register file, PE_FETCH_STORE sends a short sideband request to obtain
   BW-serialized access latency (`direction = "read"` or `"write"`, `nbytes`,
   `done` event).
 2. **Legacy Transaction forwarding** — a fallback in case TCM ends up as a
   pass-through node on the fabric graph (not used by the current critical
   path, but preserved).
 The problem: ADR-0014 only says "BW-based serialization" without specifying:
 - Read and write are **independent channels** running in parallel; only
  same-direction concurrency serializes at `capacity=1`.
 - BW is split into two configurable values (`read_bw_gbs` / `write_bw_gbs`).
 - The formula is `delay_ns = nbytes / bw_gbs` (loose unit convention:
  GB/s × ns ≈ B).
 - `nbytes == 0` still acquires the lock but skips the BW term.
 - `run()`'s `overhead_ns` (default `0.0`) is only used in the legacy fabric
  forwarding path.
 Each of these requires an ADR. In particular, "why are read and write
 separate channels" and "who owns the BW values" must be documented so that
 future changes (e.g., `capacity=2`) have a clear basis.
 ## Decision
 ### D1. Dual channel — read and write are independent resources
 `_read_res = simpy.Resource(env, capacity=1)`,
 `_write_res = simpy.Resource(env, capacity=1)`.
 Same-direction concurrent requests queue on the resource and serialize;
 opposite-direction requests proceed in parallel. This matches the hardware
 model where TCM has a dual-port (read + write) configuration, and it allows
 the simulator to express the GEMM-pipeline case where fetch (read) and store
 (write) overlap in time — modeled as BW-serialized inside each direction but
 independent across directions.
 ### D2. Per-channel BW model — `nbytes / bw_gbs`
 After lock acquisition, if `nbytes > 0 and bw > 0`, yield
 `env.timeout(nbytes / bw_gbs)`. The unit convention is GB/s × ns ≈ B,
 consistent with the simulator-wide loose convention (see ADR-0033).
 - `nbytes == 0`: BW term is zero, but the lock is acquired and released. This
  is intentional: when a plan generator emits an empty fetch/store on the
  PE_FETCH_STORE side, the op_log / channel accounting on the TCM side still
  records one consumption.
 - `bw == 0` (config error): the timeout call is skipped (0-time pass). Should
  not occur with normal settings.
 ### D3. BW values come from `node.attrs.read_bw_gbs` / `write_bw_gbs`
 Defaults `512.0 GB/s`. The topology builder (`topology/builder.py`) passes
 these attrs when instantiating TCM from `pe_template`. Default changes should
 coincide with related decisions in ADR-0014 D1 or ADR-0033.
 ### D4. TcmRequest schema is owned by PE_TCM
 `@dataclass TcmRequest(direction: str, nbytes: int, done: simpy.Event, tag: str = "")`
 lives in `components/builtin/pe_tcm.py`. PE_FETCH_STORE imports the dataclass
 and only constructs/sends it. The caller does not define the schema because:
 - The meaning of BW serialization is TCM's responsibility — TCM decides which
  fields drive serialization.
 - The valid-value check for `direction` (must be `"read"` or `"write"`) lives
  in `_handle_tcm_request`'s if/else branch.
 ### D5. Legacy Transaction forwarding path is preserved
 When `_worker` receives a non-`TcmRequest` message, it dispatches to
 `_forward_txn`, applying `run()`'s `overhead_ns`. The current standard PE
 pipeline does not route Transactions through TCM, but the path is kept to
 avoid breakage if fabric topology changes.
 This path is accounted for via standard Transaction op_log; the BW channel
 locks are **not** acquired (orthogonal to D1's usage).
 ### D6. PE_TCM is not a data store (timing only)
 TCM models **time only**. The actual data payload is held by sim_engine's
 `memory_store` (when present); the TCM component never updates it.
 PE_FETCH_STORE obtains BW delay through `TcmRequest`, and register contents
 are handled separately in the data path (ADR-0020 2-pass data execution —
 Phase 2).
 ## Alternatives Considered
 ### A1. Single channel (`capacity=2` for shared read+write)
 Rejected. Would artificially serialize the normal-case overlap of fetch
 (read) and store (write) and yield an incorrect BW upper bound for the PE
 pipeline.
 ### A2. `capacity > 1` (e.g., 2-banked TCM)
 Rejected. Current hardware model assumes a single bank. Multi-bank extension
 needs its own ADR that would supersede D1. Bumping capacity now would loosen
 the nominal serialization without raising the BW upper bound, producing less
 accurate modeling.
 ### A3. Generalize BW formula to `nbytes / bw + overhead_ns`
 Rejected. `overhead_ns` is reserved for the legacy forwarding path (D5).
 Additional fetch/store-path overhead, if needed, belongs in PE_FETCH_STORE's
 `run()` or in a register-file access model — closer to the responsibility
 boundary.
 ## Consequences
 - TCM's BW accounting is locked at ADR level. Questions arising from op_log
  in GEMM/Math sweeps — "why did fetch and store overlap?", "why do only
  same-direction requests serialize?" — resolve quickly to D1.
 - Future multi-bank TCM models or asymmetric read/write BW changes have a
  clear blast radius (D1 / D2 / D3 — pick one).
 - D6 ("TCM is not a data store") sharpens the responsibility boundary with
  ADR-0020 2-pass execution.
@@ -0,0 +1,195 @@
 # ADR-0041: Cube SRAM Component Model — terminal scratchpad on cube NoC
 ## Status
 Accepted (2026-05-20).
 ADR-0017 (Cube NOC and HBM Connectivity) describes SRAM as a cube-NoC
 attachment but does not specify the SRAM component's own latency / response
 model. This ADR fills that gap.
 ## First action
 Inside `_worker`, immediately after pulling a Transaction off `_inbox`, the
 very first action is `yield from self.run(env, txn.nbytes)`. Inside `run()`,
 the component applies `env.timeout(node.attrs["overhead_ns"])`
 (default `0.0`).
 In short, **SRAM's first act is "express access overhead as simulator time"**.
 After overhead, the worker yields `drain_ns` (the terminal BW-serialization
 cost stamped on the Transaction) and then constructs and dispatches a
 `ResponseMsg` on the reverse path.
 This differs from a generic `ComponentBase._worker`: SRAM knows it is a
 **terminal node**, so it does not go through `_forward_txn`. Its own worker
 explicitly performs `run → drain → _send_response`.
 ## Context
 The cube topology (`topology/builder.py`) creates the following named nodes
 per cube:
 - `sip{S}.cube{C}.m_cpu`
 - `sip{S}.cube{C}.sram`
 - `sip{S}.cube{C}.hbm_ctrl` (per-PE partitions)
 - `sip{S}.cube{C}.pe{P}` (and its PE-internal sub-components)
 SRAM is one of the cube-NoC attachments — `topology/mesh_gen.py` assigns it
 to the nearest router by placement coordinates and adds `"sram"` to that
 router's `attach` list. The builder lays bidirectional `sram ↔ router` edges
 (BW: `sram_to_router_bw_gbs`, default `128.0 GB/s`).
 SRAM has two intertwined roles:
 1. **Fabric terminal**: the endpoint for cube-NoC memory-access Transactions
   destined for SRAM. SRAM consumes access overhead + drain, then sends a
   response back on the reverse path.
 2. **One of the IPCQ slot tiers**: ADR-0023 D9.7 defines
   `buffer_kind ∈ {tcm, sram, hbm}`; the `sram` tier's per-access cost is
   `(512.0 GB/s, 2.0 ns)` in `common/ipcq_types._BUFFER_KIND_BW`. This is
   separate from the SRAM node's `overhead_ns` attr; PE_DMA accounts for it
   directly at the IPCQ slot-write moment.
 Without an ADR covering both roles, the following questions are ambiguous:
 - "What latency does SRAM model?" — fabric drain + overhead, or the IPCQ
  tier slot latency? — answers scatter.
 - What does the `size_mb` (`32`) attr mean in the future? Currently it is not
  used; SRAM only models timing.
 - Which cube router does SRAM attach to? (placement-based; lives in topology
  code only.)
 ## Decision
 ### D1. SRAM is a terminal scratchpad node on the cube NoC
 `SramComponent` extends `ComponentBase` but overrides `_worker` to express
 terminal semantics directly:
 ```
 while True:
    txn = yield self._inbox.get()
    yield from self.run(env, txn.nbytes)     # overhead_ns
    if drain_ns > 0: yield env.timeout(drain_ns)
    yield from self._send_response(env, txn)
 ```
 This pattern is necessary because SRAM must know the reverse path; the
 generic `_forward_txn` (which forwards to the next hop) does not fit a
 terminal.
 #### D1.1. Currently dormant — the `_worker` override is an unused path
 At the time of writing, **no component actually sends a Transaction to the
 SRAM node**. The verified references to the SRAM node ID are:
 - `policy/routing/router.py` and friends — guarantee path lookups.
 - `components/builtin/pe_dma.py::_handle_ipcq_inbound` — for
  `buffer_kind == "sram"`, computes the *path* to
  `bank_node = f"{cube_prefix}.sram"` via `compute_drain_ns(path, ...)` and
  yields a **local** timeout. The Transaction itself does not flow to the
  SRAM node (see D4).
 - `tests/test_routing.py` — checks connectivity via
  `find_path("sip0.cube0.pe0", "sip0.cube0.sram")`.
 So the `_worker` / `_send_response` override is currently a **dormant code
 path**. It is preserved deliberately:
 - Topology changes that route fabric Transactions to SRAM terminally (e.g.,
  explicit M_CPU → SRAM accesses) would activate it immediately.
 - ADR-0017's "cube-attached scratchpad" semantics naturally implies terminal
  behavior; the override is an intentional placeholder.
 A future ADR (or a revision to this one) will mark dormancy resolved when an
 actual sender is added.
 ### D2. ResponseMsg construction and reverse-path dispatch
 `_send_response`:
 1. `reverse_path = list(reversed(txn.path))` — derive the reverse path.
 2. Construct `ResponseMsg(correlation_id=txn.request.correlation_id,
   request_id=..., src_cube=<this cube>, src_pe=-1, success=True)`.
 3. Wrap in `Transaction(request=resp_msg, path=reverse_path, step=0,
   nbytes=0, done=env.event(), is_response=True)` and put on
   `out_ports[reverse_path[1]]`.
 4. If the reverse path is too short (`< 2 hops`) or `ctx` is absent, fall
   back to calling the original `txn.done.succeed()`.
 `src_pe = -1` means "SRAM is not PE-localized". `src_cube` is parsed from the
 node ID (`sip{S}.cube{C}.sram`).
 ### D3. Timing parameters: `overhead_ns` and wire-side `drain_ns`
 - **Component-side latency**: `node.attrs["overhead_ns"]`. Default topology
  uses `2.0 ns`.
 - **Link-side serialization**: `drain_ns` arrives stamped on the Transaction
  — the wire-side BW serialization result from ADR-0015. SRAM only yields it.
 - The `size_mb` (default `32 MiB`) attr is currently timing-neutral. If a
  capacity-aware model is added in the future, a separate ADR will give it
  meaning.
 ### D4. IPCQ slot accounting is not modeled by the SRAM component
 Per ADR-0023 D9.7, the IPCQ slot-write latency for the SRAM tier is incurred
 inside PE_DMA's `_handle_ipcq_inbound`, which calls
 `slot_io_latency_ns("sram", nbytes)` using `_BUFFER_KIND_BW["sram"]`. That is:
 - When SRAM receives a fabric Transaction (D1, D2, D3 apply), it processes
  normally.
 - When an IPCQ slot lives on SRAM, PE_DMA pays the slot-write time directly —
  independent of the SRAM component.
 This separation is intentional: IPCQ is a fast path (sub-cycle slot
 bookkeeping) and does not traverse fabric Transactions, so SRAM does not need
 to know about IPCQ.
 ### D5. SRAM's cube-NoC attachment is placement-driven
 `topology/mesh_gen.py` reads `placement.sram.pos_mm` (default `[1.5, 9.0]` in
 `topology.yaml`) and adds `"sram"` to the nearest router's `attach`. The
 builder (`topology/builder.py`'s attachment loop) then lays bidirectional
 edges between the `sram` node and that router.
 This decision lives outside the SRAM component (mesh_gen / builder); the
 component does not know which router it sits on. It only relies on
 `txn.path` / `reverse_path` to reach it via a router.
 ### D6. SRAM is not a data store (timing only)
 Same context as ADR-0040 D6: the SRAM component models time only; the data
 payload (if any) lives in sim_engine's `memory_store`.
 ## Alternatives Considered
 ### A1. Use `_forward_txn` and route responses via separate nodes (à la IO_CPU / HBM_CTRL)
 Rejected. SRAM is a terminal on the cube NoC; adding a response node would
 introduce meaningless hops and violate ADR-0017's simplification spirit.
 ### A2. Model BW serialization inside SRAM with its own resource
 Rejected. Wire-side BW serialization (`drain_ns`) already captures it. An
 internal `simpy.Resource` would double-count against ADR-0015 (port/wire
 model).
 ### A3. Handle IPCQ slot accounting in the SRAM component
 Rejected. As D4 makes explicit, IPCQ is a fast path that does not traverse
 fabric Transactions. If SRAM knew about IPCQ, the responsibility would split
 across two places and obscure reasoning.
 ### A4. Capacity-aware latency from `size_mb`
 Rejected for now. The capacity is currently a visualizer label; introducing
 a capacity-aware timing model requires a dedicated ADR.
 ## Consequences
 - SRAM's timing model is pinned at ADR level as
  `overhead_ns + drain_ns + ResponseMsg(reverse_path)`. Any proposal to push
  IPCQ slot latency into the SRAM component can be refused with D4.
 - D3 records that `size_mb` is timing-neutral today, so a future
  capacity-aware model has a narrow compatibility scope.
 - D5 documents the placement-driven attachment, so changes to the SRAM
  coordinate have a clearly bounded impact (`mesh_gen` only).
@@ -0,0 +1,199 @@
 # ADR-0042: Tile Plan Generators — GEMM/Math Pipeline Plan Builders
 ## Status
 Accepted (2026-05-20).
 This ADR pins down `tiling.py` as a **plan-generator
 module**, not a SimPy component.
 ADR-0014 (PE Pipeline Execution Model) D6 (tile plan / self-routing) does not
 specify the tile-plan generation algorithm itself; this ADR fills that gap.
 ## First action
 When `generate_gemm_plan(M, K, N, tile_m, tile_k, tile_n, ..., pe_prefix,
 a_pinned, b_pinned, epilogue_specs)` is called, the very first action is
 **computing tile counts and constructing the PE-component ID strings**:
 ```
 M_tiles = max(1, ceil(M / tile_m))
 K_tiles = max(1, ceil(K / tile_k))
 N_tiles = max(1, ceil(N / tile_n))
 dma_id   = f"{pe_prefix}.pe_dma"
 fetch_id = f"{pe_prefix}.pe_fetch_store"
 gemm_id  = f"{pe_prefix}.pe_gemm"
 math_id  = f"{pe_prefix}.pe_math"
 ```
 In short, **the plan generator's first act is "compute ceiling tile counts
 and assemble the four sub-component IDs for this PE once"**. No SimPy event
 or environment is touched — this module is a pure function.
 `generate_math_plan(M, N, tile_m, tile_n, ..., math_op, src_addr, dst_addr,
 pe_prefix)` likewise begins by computing `M_tiles`, `N_tiles` and assembling
 three component IDs (`dma_id`, `fetch_id`, `math_id`).
 ## Context
 ADR-0014 D6 agreed that "PE_SCHEDULER, on receiving a CompositeCmd, generates
 a TilePlan and feeds self-routing tile tokens". But the **concrete plan
 generation algorithm** lives in `src/kernbench/components/builtin/tiling.py`,
 which:
 - Defines no component — it is a pair of **pure functions**
  (`generate_gemm_plan`, `generate_math_plan`).
 - Does not depend on the SimPy environment, queues, op_log, or hooks.
 - Returns a `PipelinePlan` (dataclass).
 The original G4 analysis incorrectly described `tiling.py` as a component;
 it is in fact a plan-builder helper consumed by PE_SCHEDULER. Pinning this
 down in its own ADR (paired with ADR-0014 D6) prevents:
 - Ambiguity over whether plan generation belongs to PE_SCHEDULER or a
  separate module.
 - Inconsistent rationale for stage sequences (e.g., FETCH/STORE position)
  between GEMM and Math plans.
 - Undocumented branching rationale for `a_pinned` / `b_pinned` /
  `epilogue_specs`.
 ## Decision
 ### D1. `tiling` is a pure plan-generator module, not a component
 `components/builtin/tiling.py` defines no `ComponentBase` subclass. It exports
 two module-level functions:
 - `generate_gemm_plan(...) -> PipelinePlan`
 - `generate_math_plan(...) -> PipelinePlan`
 There is no `tiling` node in the topology graph. It lives in `builtin/`
 because it is a direct helper for PE_SCHEDULER (ADR-0014 D6) and is
 conceptually a PE_SCHEDULER internal utility.
 ### D2. GEMM plan stage sequence — `M → N → K` order
 For each `(m, n, k)` tile (default — no operand pinning, no epilogue):
 ```
 [DMA_READ(A)] → [DMA_READ(B)] → FETCH → GEMM
                                ↑
                                ↓
 (last k tile only)              [MATH(output_tile)]* → STORE → DMA_WRITE
 ```
 `k_tile` epilogue inserts a MATH stage immediately after GEMM on every
 K-tile; `output_tile` epilogue inserts MATH stages once per `(m, n)` after
 the final K-tile but before STORE/DMA_WRITE. The K-loop accumulator stays
 in the register file across K-tiles — STORE/DMA_WRITE happens only when
 `last_k`.
 ### D3. Operand pinning — `a_pinned` / `b_pinned`
 If a caller passes `a_pinned=True`, **the A DMA_READ is omitted from every
 (m, n, k) tile**. Semantically: the caller (e.g., `tl.composite`) has already
 staged all of A in TCM via a prior `tl.load`, and signals so to the plan
 generator.
 The branch is made at plan time (not at runtime). Therefore the stage record
 count in op_log changes deterministically with pinning, and sweep analyses
 (e.g., gemm_sweep's stage record count) see this decision directly.
 ### D4. Epilogue scope — `k_tile` vs `output_tile`
 `epilogue_specs` is an iterable of op-spec objects. Each op object is
 expected to have:
 - `op.kind: str` — math op name (e.g., `"dequant"`, `"bias"`, `"relu"`,
  `"scale"`). Placed into the stage's `params["op_kind"]`.
 - `op.scope: Scope` — `Scope.K_TILE` or `Scope.OUTPUT_TILE` (`Scope` enum
  in `kernbench.common.pe_commands`).
 - Op-specific extras (e.g., `bias`, `scale`, `factor`) — currently not used
  by the plan generator; consumed at runtime by PE_MATH.
 The plan generator partitions by `getattr(o, "scope", None)`:
 - `scope == Scope.K_TILE`: adds a MATH stage right after GEMM on every K-tile.
 - `scope == Scope.OUTPUT_TILE`: adds MATH stages just before STORE on the
  last K-tile per `(m, n)`.
 Ops with neither `scope` value (e.g., missing attribute) are **dropped
 silently** — `getattr(..., None) == Scope.X` is False for both. Picking a
 default (`output_tile`) is the **caller's responsibility** (e.g.,
 `tl.composite`), not the plan generator's. This aligns with ADR-0014's
 composite epilogue contract.
 `Scope` is imported lazily inside the function to avoid the circular path
 `pe_commands ← pe_types ← tiling`. This is intentional and not a refactor
 target — keeping `tiling` free of compile-time `pe_commands` dependencies
 preserves the module boundary (D1).
 ### D5. Math plan stage sequence — `M → N` order
 For each `(m, n)` tile:
 ```
 DMA_READ → FETCH → MATH → STORE → DMA_WRITE
 ```
 There is no K dimension, so concepts like epilogue or accumulator residency
 do not apply. PE_FETCH_STORE's register-file accounting follows the same
 pattern as the GEMM plan.
 ### D6. Plans are data — no SimPy dependency
 `PipelinePlan` is a dataclass in `pe_types.py` holding `tiles:
 list[TilePlan]`. Each `TilePlan` holds `stages: tuple[Stage, ...]`. The plan
 itself is near-immutable (only `Stage.params: dict` is mutable) and holds no
 SimPy objects.
 At runtime, PE_SCHEDULER consumes the plan's first stage, builds a `TileToken`,
 and feeds it into the pipeline. The TileToken carries `plan: TilePlan`,
 `stage_idx: int`, and a cached `params: dict`. Self-routing proceeds by
 `TileToken.advance()` caching the next stage's `params` (ADR-0014 D6).
 ### D7. Plan generator contract — pure, deterministic, idempotent
 Two calls with identical inputs return identical `PipelinePlan` instances
 (including `TilePlan.stages` order). This contract aligns with ADR-0014 D6's
 "deterministic tile dispatch order".
 No side effects (no SimPy events, no file I/O, no global state) — tests can
 call the generators directly without an environment object (some cases in
 `tests/test_pe_pipeline.py` rely on this).
 ## Alternatives Considered
 ### A1. Make tiling a component (e.g., PE_PLANNER)
 Rejected. Plan generation consumes no SimPy time — it is a pure decision
 algorithm. Making it a component would (a) add unnecessary infrastructure
 (inbox, resources), and (b) split PE_SCHEDULER's flow into "receive plan"
 plus "feed tiles", inserting a meaningless hop.
 ### A2. Move plan generation into PE_SCHEDULER as methods
 Rejected (currently). Module separation provides (1) testability and
 (2) extensibility for additional plan algorithms (e.g., DTensor-aware) —
 add a new function. If plan kinds proliferate enough to require explicit
 dispatch, a future ADR can introduce a plan factory on PE_SCHEDULER.
 ### A3. Make plans fully immutable (frozen dataclass + tuple)
 Partially adopted. `Stage` and `TilePlan` are dataclasses but not frozen,
 because `Stage.params: dict` is populated at plan-generation time and read
 at runtime (cached by TileToken on advance). Moving dict → frozendict pays
 migration cost without enough benefit. Convention: do not mutate after
 generation.
 ## Consequences
 - `tiling.py` is documented as a plan-generator module, not a component —
  preempting future G4-style "this component lacks an ADR" analyses.
 - The GEMM plan's stage sequence (D2) and pinning / epilogue branching
  (D3 / D4) are pinned, providing a clear interpretation basis for sweep
  analyses (e.g., `scripts/gemm_sweep.py`'s stage record counts).
 - The plan generator's pure contract (D7) enables environment-free testing
  in line with ADR-0013 (verification strategy).
 - Future plan kinds (DTensor-aware, K-major, ...) follow D1 / D6 / D7 as a
  baseline — just add a new function.