ADR: add 0038-0042 (pcie_ep, pe_mmu, pe_tcm, sram, tiling)

Fill component-model coverage gaps surfaced by /report's G4 analysis. Each ADR documents the component's First action, latency model, and honest notes on dormant code or implementation asymmetries discovered during re-evaluation against current code. - 0038 pcie_ep: thin protocol-overhead model; ComponentBase forwarding worker as-is; named-node contract for router helpers - 0039 pe_mmu: component + utility dual role; sub-page region stopgap; D2.1 flags pipeline path missing mmu.overhead_ns timeout (asymmetric with non-pipeline; not visible at default tlb_overhead_ns=0) - 0040 pe_tcm: dual-channel BW serialization (read/write Resource cap=1); TcmRequest schema owned by TCM; timing-only (no data store) - 0041 sram: terminal scratchpad model + ResponseMsg on reverse path; D1.1 flags _worker override as currently dormant (no Transaction actually targets the SRAM node today) - 0042 tiling: pure plan-generator module, not a component; corrects the G4 misclassification; pins GEMM/Math stage sequences and epilogue scope contract Also: /report skill G3 refinement — only flag older->newer asymmetric cross-references; newer->older (e.g., 0034-0037 citing infrastructure ADRs) are expected one-way and no longer reported. Bilingual pair verifier (tools/verify_adr_lang_pairs.py) passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:43:03 -07:00
parent 049e3d8bb3
commit 1f36baa898
11 changed files with 1747 additions and 3 deletions
@@ -0,0 +1,139 @@
+# ADR-0038: PCIE_EP Component Model
+
+## Status
+
+Accepted (2026-05-20).
+
+Companion to ADR-0035 (M_CPU), ADR-0036 (IO_CPU), and
+ADR-0037 (Forwarding) at the same component-model level.
+
+## First action
+
+Pull one Transaction from `_inbox` and let `_forward_txn` invoke `run()`, which
+applies a single `env.timeout(node.attrs["overhead_ns"])` for PCIe protocol
+handling. After that the standard `ComponentBase` worker rules take over: if
+`next_hop` exists, put the advanced Transaction on `out_ports[next_hop]`;
+otherwise consume `drain_ns` and call `txn.done.succeed()`.
+
+In other words, **PCIE_EP's first (and only) act is to spend the configured
+overhead as simulator time** — no routing decisions, no payload transformation,
+no MMIO decoding.
+
+## Context
+
+PCIE_EP is the **host ↔ device boundary** in the topology graph. The builder
+(`topology/builder.py`) creates an IO chiplet instance per SIP that contains
+`pcie_ep`, `io_cpu`, and `io_noc`, and lays bidirectional edges between the
+external `fabric.switch0` and each `pcie_ep`:
+
+- `switch → pcie_ep`: host → device traffic (MemoryWrite, MemoryRead,
+  KernelLaunch).
+- `pcie_ep → switch`: device-side outbound (e.g., cross-SIP IPCQ tokens).
+
+Inside the IO chiplet there are bidirectional `pcie_ep ↔ io_noc` edges, and
+from there traffic branches to `io_cpu` or to the cube-side `hbm_ctrl` path
+(see ADR-0036 IO_CPU model). The router and resolver already know — per SPEC
+R7 — that PCIE_EP is the endpoint for memory operations, so helpers like
+`find_pcie_ep(sip)` and `find_memory_path(pcie_ep, dst_node)` treat PCIE_EP as
+the start (or end) of the memory path.
+
+The problem is that all of this dependency lives in builder/router/resolver,
+while **PCIE_EP's own internal model has no ADR**. The consequence:
+
+- "What latency does PCIE_EP model?" requires reading the source.
+- The asymmetry with peer components (IO_CPU = ADR-0036, M_CPU = ADR-0035) is
+  awkward.
+- Future decisions about a more detailed PCIe link-layer model (TLP credits,
+  retry, MPS chunking) lack a documented baseline.
+
+This ADR pins down the current **thin PCIE_EP model** and records that this
+thinness is intentional (aligned with ADR-0033's latency-model simplification
+policy).
+
+## Decision
+
+### D1. PCIE_EP uses ComponentBase's generic forwarding worker as-is
+
+`PcieEpComponent` extends `ComponentBase` and does **not** override `_worker` or
+`_forward_txn`. Every Transaction flows through the standard sequence:
+
+1. `_fan_in` accumulates inbound messages (and reassembles Flits, per ADR-0033
+   Phase 2c) into `_inbox`.
+2. `_worker` pulls one message off `_inbox` and spawns
+   `env.process(self._forward_txn(env, txn))` for per-message pipelining.
+3. `_forward_txn` calls the op_log start hook → `run()` for latency → op_log
+   end hook.
+4. `run()` is a single line: `yield env.timeout(overhead_ns)`.
+5. If a next hop exists, `out_ports[next_hop].put(txn.advance())`. Otherwise
+   (terminal arrival) consume `drain_ns` and call `txn.done.succeed()`.
+
+### D2. The only timing parameter is `overhead_ns`
+
+Only `node.attrs["overhead_ns"]` is accepted as a latency parameter. The code
+default is `0.0`; `topology.yaml`'s IOChiplet `components.pcie_ep.attrs`
+supplies the real value (current topology: `overhead_ns: 5.0` ns).
+
+No separate BW-serialization resource (`simpy.Resource`), no queue depth, no
+retry model is introduced. Link-level BW serialization is handled wire-side —
+inside the IOChiplet by `pcie_ep_to_noc_bw_gbs = 256.0 GB/s`, and externally by
+the system's `io_ep_to_switch` link BW (ADR-0015 port/wire model). PCIE_EP
+itself takes no part in that accounting.
+
+### D3. PCIE_EP is direction-aware in topology but direction-blind in code
+
+The builder lays both `switch ↔ pcie_ep` and `pcie_ep ↔ io_noc` edges, so
+PCIE_EP serves:
+
+- inbound (host → device): forward Transactions arriving from the switch onto
+  io_noc-side next-hop.
+- outbound (device → host): forward Transactions arriving from io_noc/io_cpu
+  back to the switch.
+
+Both are handled by D1's generic forwarding worker; the component code never
+distinguishes direction (it just follows `txn.next_hop`).
+
+### D4. PCIE_EP is not Flit-aware (legacy reassembly path)
+
+`_FLIT_AWARE` is left at the inherited `False`, so `_fan_in` reassembles
+upstream-chunkified Flits into the parent Transaction before delivery to
+`_inbox` (aligned with ADR-0033 Phase 2c incremental rollout).
+
+A future PCIe TLP-level credit model would revisit D4.
+
+### D5. PCIE_EP is a **named node** for routing helpers
+
+`policy/routing/router.py` provides `find_pcie_ep(sip, io_id="io0")`,
+`find_all_pcie_eps()`, and `find_memory_path(pcie_ep, dst_node)` — all of
+which treat PCIE_EP as the start (or end) of the memory path. The component
+itself supplies no information to these helpers; the naming convention
+(`sip{S}.{io_id}.pcie_ep`) is guaranteed by the topology builder.
+
+## Alternatives Considered
+
+### A1. Full PCIe TLP-level model (credits, retry, MPS chunking)
+
+Rejected. Violates ADR-0033's "current latency model = abstract overhead + BW
+serialization" simplification. Host↔device protocol fidelity is explicitly
+out-of-scope in SPEC §5 "Non-Goals".
+
+### A2. Per-PCIE_EP `simpy.Resource` for in-flight cap
+
+Rejected. Host traffic is not a contention bottleneck in current workloads.
+Defer to a separate ADR if it becomes one (in which case D1 stays and D2 is
+extended).
+
+### A3. Merge PCIE_EP into IO_CPU
+
+Rejected. PCIE_EP is the protocol-boundary node first hit on the host side;
+IO_CPU is the device-side control-plane processing node (ADR-0036). Traffic
+fan-out and command decoding costs concentrate in IO_CPU, while PCIE_EP only
+expresses link-edge overhead. Merging them would mix two responsibilities and
+violate the spirit of ADR-0007 (runtime API/sim_engine boundaries).
+
+## Consequences
+
+- PCIE_EP gets an explicit model ADR despite having near-zero code — consistent
+  with peer component ADRs, lower maintenance friction.
+- Future PCIe-level refinement supersedes by extending D2/D4 in a new ADR.
+- D5 makes the named-node dependency explicit, so any future renaming of
+  component IDs has a clearly bounded blast radius.
@@ -0,0 +1,203 @@
+# ADR-0039: PE_MMU Component Model — Component + Utility Dual Role
+
+## Status
+
+Accepted (2026-05-20).
+
+ADR-0011 (PA/VA/LA address model) only states that "the VA model translates
+VA→PA via PE_MMU"; this ADR pins down **the PE_MMU component's own behavior
+model**.
+
+## First action
+
+At construction, read `node.attrs["page_size"]` (default `2 MiB`) and
+`node.attrs["tlb_overhead_ns"]` (default `0.0`) and instantiate the internal
+`PeMMU` utility object (`policy.address.pe_mmu.PeMMU`) exactly once. That
+object is the single owner of the page table, the sub-page region lists, and
+the TLB overhead value.
+
+At runtime the first action splits into two paths:
+
+- **Component path (inbox consumption)**: `_worker` pulls a Transaction off
+  `_inbox`; if `request` is a `MmuMapMsg`, call `self._mmu.map(va, pa, size)`
+  for each entry and then `txn.done.succeed()`. For `MmuUnmapMsg`, call
+  `unmap(va, size)`. Any other type falls through to standard `_forward_txn`.
+  In other words, **the component's first act is "apply map/unmap commands to
+  the page table"**.
+- **Utility path (direct call)**: a sibling PE engine (PE_DMA / PE_GEMM) calls
+  `pe_mmu.mmu.translate(va)` directly. This path produces no SimPy events;
+  the caller (when `overhead_ns > 0`) issues a `yield env.timeout(mmu.overhead_ns)`
+  in its own process.
+
+## Context
+
+ADR-0011 defined three address models (PA/VA/LA) and agreed that "VA model =
+translation via PE_MMU". But in code, `PeMmuComponent` performs two
+complementary roles simultaneously:
+
+1. **A topology-graph component**: it receives `MmuMapMsg` / `MmuUnmapMsg`
+   sideband messages over the cube NoC and updates the page table.
+2. **A PE-local utility**: PE_DMA / PE_GEMM on the same PE call
+   `translate(va)` directly with zero SimPy latency (the caller pays
+   `overhead_ns` if any).
+
+Without an ADR covering both roles, the following questions are ambiguous:
+
+- "Why isn't there a SimPy event for the MMU translate?" (Answer: the caller
+  pays it.)
+- What is the sub-page region model, and why? (The code docstring has it, but
+  no ADR — only a memory note `project_mmu_subpage_stopgap`.)
+- Who sends map/unmap, and when must they be visible? (Ordering contract.)
+
+Additionally, `PeMMU.map()` has "append, last-write-wins on overlap"
+semantics, which is impossible to express with a one-PA-per-entry page table.
+That is a deliberate **simulator stopgap** to support DPPolicy sub-page sharding
+(e.g., 128 B payloads against 4 KiB pages) without silent last-write-wins
+misrouting. This deviation from real HW MMU semantics must be ADR-pinned.
+
+## Decision
+
+### D1. Explicit dual role — component and utility
+
+`PeMmuComponent` exposes two interfaces from a single class:
+
+- Component interface: `_inbox` consumption, `_worker` loop (handles MMU
+  sideband messages).
+- Utility interface: the `mmu` property exposes the underlying `PeMMU` object,
+  which PE_DMA / PE_GEMM hold directly and invoke `translate()` on.
+
+The latter is **not a layer skip**: inside a PE, the engines and PE_MMU are
+siblings under the "components" layer (ADR-0007). Cross-layer violations only
+apply to runtime API ↔ sim_engine ↔ components boundaries.
+
+### D2. Latency model — `translate()` is pure; caller owns the timeout
+
+`PeMMU.translate()` is a pure function and yields nothing in SimPy. The caller
+(a PE engine) issues `if mmu.overhead_ns > 0: yield env.timeout(mmu.overhead_ns)`
+in its own process after translation.
+
+Rationale: the PE engine process already holds its own `record_start` /
+`record_end` (op_log) hooks, so keeping timing inside the caller's process
+preserves consistent timing accounting. A separate MMU process would split the
+engine's processing flow and blur op_log / pipeline overlap semantics.
+
+#### D2.1. Current implementation asymmetry — pipeline vs non-pipeline (known)
+
+At the time of writing, `pe_dma.py` handles MMU overhead differently in its
+two call paths:
+
+- **non-pipeline (`handle_command`)**: after `translate()`, applies
+  `if self._mmu.overhead_ns > 0: yield env.timeout(self._mmu.overhead_ns)`.
+- **pipeline (`_do_pipeline_dma`)**: calls `translate()` only, **omitting**
+  the overhead timeout — though the comment says "same logic as non-pipeline
+  path", the behaviors differ.
+
+In the default topology, `tlb_overhead_ns = 0.0`, so this asymmetry does not
+manifest. With `tlb_overhead_ns > 0`, however, GEMM/Math via the pipeline path
+appears MMU-overhead faster than the equivalent non-pipeline workload.
+
+The D2 contract states that **all** callers pay the overhead; the pipeline
+omission is **not an intentional design** — ADR-0014 D6 (pipeline self-routing)
+does not exempt it. Remediation options (require a separate Phase 1/2):
+
+- (a) Add `if mmu.overhead_ns > 0: yield env.timeout(...)` in
+  `_do_pipeline_dma` to align with D2 — **preferred**.
+- (b) Narrow the D2 contract to "non-pipeline only" and document the pipeline
+  exemption in an ADR-0014 update — discouraged, since it weakens the
+  overhead's meaning.
+
+This ADR recommends (a) and assumes a small follow-up change either before or
+just after acceptance.
+
+### D3. Page table structure — sub-page region list (stopgap)
+
+`self._table: dict[vpn, list[(start_in_page, end_in_page, pa_at_offset_zero)]]`
+holds multiple disjoint regions per page.
+
+- `map(va, pa, size)`: append regions when the range crosses a page boundary.
+- `translate(va)`: look up regions for the VPN and iterate **in reverse** so
+  the most recent overlapping region wins (last-write-wins).
+- `unmap(va, size)`: remove only regions whose extent is **fully contained**
+  within the unmap range; partial-overlap boundaries are left in place and the
+  caller is expected to unmap on the same boundaries used for map.
+
+This is documented as a **simulator stopgap** that supplements the VA model
+from ADR-0011. It prevents silent last-write-wins misrouting when DPPolicy
+shards below page granularity. Memory note: `project_mmu_subpage_stopgap`.
+
+### D4. PageFault signals PA fallback
+
+If `translate()` is called with an unmapped VA, `PageFault` is raised. PE_DMA
+catches the exception and **uses the original address as a PA** (the PA-only
+backward-compatibility path from ADR-0011). PageFault is therefore not an
+error — it is the signal for "no VA mapping, interpret as PA".
+
+This path is intentional and preserves backward compatibility with the
+ADR-0011 PA-only mode.
+
+### D5. MMU sideband-message reception contract
+
+`MmuMapMsg` / `MmuUnmapMsg` arrive over the fabric at PE_MMU's `_inbox`
+(SPEC R10: "MMU map installation incurs measured fabric latency"). Schemas
+live in `runtime_api/kernel.py`:
+
+- `MmuMapMsg.entries: tuple[dict, ...]` — each dict is
+  `{"va": int, "pa": int, "size": int}`.
+- `MmuUnmapMsg.entries: tuple[dict, ...]` — each dict is
+  `{"va": int, "size": int}`.
+
+PE_MMU reception flow:
+
+1. `_worker` does `_inbox.get()` for one message.
+2. `hasattr(msg, "request")` confirms a Transaction wrapper.
+3. `isinstance(msg.request, MmuMapMsg)` → for each entry, call
+   `self._mmu.map(va=e["va"], pa=e["pa"], size=e["size"])`.
+4. `isinstance(msg.request, MmuUnmapMsg)` → for each entry, call
+   `self._mmu.unmap(va=e["va"], size=e["size"])`.
+5. Both signal `msg.done.succeed()` after completion.
+
+An external caller (runtime API) `await`ing `done` therefore receives a SimPy
+guarantee that "the mapping is installed on-device" — this is the realization
+of ADR-0011's "MMU map installation incurs measured fabric latency".
+
+This ADR does **not** define the **sender or fan-out policy** for the sideband
+message — those are runtime API responsibilities. Only the receive contract
+belongs here.
+
+### D6. Non-MMU Transactions delegate to generic forwarding
+
+If a message pulled from `_inbox` is not `MmuMapMsg` / `MmuUnmapMsg` (or
+lacks a `request` attribute), `_forward_txn` handles it normally. This keeps
+the door open for future topologies where PE_MMU sits on a pass-through path —
+current code never sends such traffic, but the routing remains safe.
+
+## Alternatives Considered
+
+### A1. Make `translate()` a SimPy generator
+
+Rejected. As D2 explains, this blurs op_log / pipeline overlap accounting in
+the PE engine.
+
+### A2. Use small page size (e.g., 128 B) instead of sub-page regions
+
+Rejected. Would explode page-table memory and cube-wide map message size. Most
+mappings are 2 MiB; pushing the page size below that for the few DPPolicy
+sharding cases inflates average cost.
+
+### A3. Make PE_MMU a PE_CPU helper only (not a topology node)
+
+Rejected. ADR-0011 requires that MMU map installation incur measured fabric
+latency (via `MmuMapMsg`), which requires PE_MMU to be a node on the graph.
+It also keeps cube NoC visualizer output consistent.
+
+## Consequences
+
+- PE_MMU's dual role is justified at ADR level, so future "unify into one"
+  refactor pressure has a documented counterpoint.
+- The sub-page region model is explicitly labeled a stopgap, providing a
+  basis for deprecating it when LA model (ADR-0011) lands.
+- The "`translate()` does not yield" contract is locked in (D2), so any
+  future proposal to add an internal MMU timeout can be denied with a
+  documented rationale.
+- PA fallback (D4) is normalized, preventing defensive logic from treating
+  PageFault as an error.
@@ -0,0 +1,149 @@
+# ADR-0040: PE_TCM Component Model — Dual-Channel BW Serialization
+
+## Status
+
+Accepted (2026-05-20).
+
+ADR-0014 (PE Pipeline Execution Model, D1) references PE_TCM as a "BW-based
+serialized scratchpad memory" but does not pin down the component's own model.
+This ADR fills that gap.
+
+## First action
+
+When `start()` is invoked, immediately create two `simpy.Resource(env, capacity=1)`
+instances and store them in `self._read_res` / `self._write_res`. These two
+resources are the single decision points that serialize the **read channel**
+and **write channel** to one in-flight request each.
+
+The runtime first action: `_worker` pulls a message off `_inbox` and branches
+by type:
+
+- `TcmRequest` (from `pe_fetch_store`): spawn `env.process(self._handle_tcm_request)`.
+  Hence **TCM's first act is "acquire the lock matching the direction
+  (read/write)"**. After lock acquisition, if `bw > 0 and nbytes > 0`, yield
+  `env.timeout(delay_ns = nbytes / bw)`, then `req.done.succeed()`.
+- Anything else (Transaction): spawn `env.process(self._forward_txn)` (legacy
+  fabric pass-through).
+
+At construction, `node.attrs["read_bw_gbs"]` and `node.attrs["write_bw_gbs"]`
+(default `512.0 GB/s` each) are captured and held.
+
+## Context
+
+In the PE pipeline (ADR-0014 D1, D6), PE_TCM receives two kinds of traffic:
+
+1. **`TcmRequest` from PE_FETCH_STORE** — when moving data between TCM and
+   the register file, PE_FETCH_STORE sends a short sideband request to obtain
+   BW-serialized access latency (`direction = "read"` or `"write"`, `nbytes`,
+   `done` event).
+2. **Legacy Transaction forwarding** — a fallback in case TCM ends up as a
+   pass-through node on the fabric graph (not used by the current critical
+   path, but preserved).
+
+The problem: ADR-0014 only says "BW-based serialization" without specifying:
+
+- Read and write are **independent channels** running in parallel; only
+  same-direction concurrency serializes at `capacity=1`.
+- BW is split into two configurable values (`read_bw_gbs` / `write_bw_gbs`).
+- The formula is `delay_ns = nbytes / bw_gbs` (loose unit convention:
+  GB/s × ns ≈ B).
+- `nbytes == 0` still acquires the lock but skips the BW term.
+- `run()`'s `overhead_ns` (default `0.0`) is only used in the legacy fabric
+  forwarding path.
+
+Each of these requires an ADR. In particular, "why are read and write
+separate channels" and "who owns the BW values" must be documented so that
+future changes (e.g., `capacity=2`) have a clear basis.
+
+## Decision
+
+### D1. Dual channel — read and write are independent resources
+
+`_read_res = simpy.Resource(env, capacity=1)`,
+`_write_res = simpy.Resource(env, capacity=1)`.
+Same-direction concurrent requests queue on the resource and serialize;
+opposite-direction requests proceed in parallel. This matches the hardware
+model where TCM has a dual-port (read + write) configuration, and it allows
+the simulator to express the GEMM-pipeline case where fetch (read) and store
+(write) overlap in time — modeled as BW-serialized inside each direction but
+independent across directions.
+
+### D2. Per-channel BW model — `nbytes / bw_gbs`
+
+After lock acquisition, if `nbytes > 0 and bw > 0`, yield
+`env.timeout(nbytes / bw_gbs)`. The unit convention is GB/s × ns ≈ B,
+consistent with the simulator-wide loose convention (see ADR-0033).
+
+- `nbytes == 0`: BW term is zero, but the lock is acquired and released. This
+  is intentional: when a plan generator emits an empty fetch/store on the
+  PE_FETCH_STORE side, the op_log / channel accounting on the TCM side still
+  records one consumption.
+- `bw == 0` (config error): the timeout call is skipped (0-time pass). Should
+  not occur with normal settings.
+
+### D3. BW values come from `node.attrs.read_bw_gbs` / `write_bw_gbs`
+
+Defaults `512.0 GB/s`. The topology builder (`topology/builder.py`) passes
+these attrs when instantiating TCM from `pe_template`. Default changes should
+coincide with related decisions in ADR-0014 D1 or ADR-0033.
+
+### D4. TcmRequest schema is owned by PE_TCM
+
+`@dataclass TcmRequest(direction: str, nbytes: int, done: simpy.Event, tag: str = "")`
+lives in `components/builtin/pe_tcm.py`. PE_FETCH_STORE imports the dataclass
+and only constructs/sends it. The caller does not define the schema because:
+
+- The meaning of BW serialization is TCM's responsibility — TCM decides which
+  fields drive serialization.
+- The valid-value check for `direction` (must be `"read"` or `"write"`) lives
+  in `_handle_tcm_request`'s if/else branch.
+
+### D5. Legacy Transaction forwarding path is preserved
+
+When `_worker` receives a non-`TcmRequest` message, it dispatches to
+`_forward_txn`, applying `run()`'s `overhead_ns`. The current standard PE
+pipeline does not route Transactions through TCM, but the path is kept to
+avoid breakage if fabric topology changes.
+
+This path is accounted for via standard Transaction op_log; the BW channel
+locks are **not** acquired (orthogonal to D1's usage).
+
+### D6. PE_TCM is not a data store (timing only)
+
+TCM models **time only**. The actual data payload is held by sim_engine's
+`memory_store` (when present); the TCM component never updates it.
+PE_FETCH_STORE obtains BW delay through `TcmRequest`, and register contents
+are handled separately in the data path (ADR-0020 2-pass data execution —
+Phase 2).
+
+## Alternatives Considered
+
+### A1. Single channel (`capacity=2` for shared read+write)
+
+Rejected. Would artificially serialize the normal-case overlap of fetch
+(read) and store (write) and yield an incorrect BW upper bound for the PE
+pipeline.
+
+### A2. `capacity > 1` (e.g., 2-banked TCM)
+
+Rejected. Current hardware model assumes a single bank. Multi-bank extension
+needs its own ADR that would supersede D1. Bumping capacity now would loosen
+the nominal serialization without raising the BW upper bound, producing less
+accurate modeling.
+
+### A3. Generalize BW formula to `nbytes / bw + overhead_ns`
+
+Rejected. `overhead_ns` is reserved for the legacy forwarding path (D5).
+Additional fetch/store-path overhead, if needed, belongs in PE_FETCH_STORE's
+`run()` or in a register-file access model — closer to the responsibility
+boundary.
+
+## Consequences
+
+- TCM's BW accounting is locked at ADR level. Questions arising from op_log
+  in GEMM/Math sweeps — "why did fetch and store overlap?", "why do only
+  same-direction requests serialize?" — resolve quickly to D1.
+- Future multi-bank TCM models or asymmetric read/write BW changes have a
+  clear blast radius (D1 / D2 / D3 — pick one).
+- D6 ("TCM is not a data store") sharpens the responsibility boundary with
+  ADR-0020 2-pass execution.
@@ -0,0 +1,195 @@
+# ADR-0041: Cube SRAM Component Model — terminal scratchpad on cube NoC
+
+## Status
+
+Accepted (2026-05-20).
+
+ADR-0017 (Cube NOC and HBM Connectivity) describes SRAM as a cube-NoC
+attachment but does not specify the SRAM component's own latency / response
+model. This ADR fills that gap.
+
+## First action
+
+Inside `_worker`, immediately after pulling a Transaction off `_inbox`, the
+very first action is `yield from self.run(env, txn.nbytes)`. Inside `run()`,
+the component applies `env.timeout(node.attrs["overhead_ns"])`
+(default `0.0`).
+
+In short, **SRAM's first act is "express access overhead as simulator time"**.
+After overhead, the worker yields `drain_ns` (the terminal BW-serialization
+cost stamped on the Transaction) and then constructs and dispatches a
+`ResponseMsg` on the reverse path.
+
+This differs from a generic `ComponentBase._worker`: SRAM knows it is a
+**terminal node**, so it does not go through `_forward_txn`. Its own worker
+explicitly performs `run → drain → _send_response`.
+
+## Context
+
+The cube topology (`topology/builder.py`) creates the following named nodes
+per cube:
+
+- `sip{S}.cube{C}.m_cpu`
+- `sip{S}.cube{C}.sram`
+- `sip{S}.cube{C}.hbm_ctrl` (per-PE partitions)
+- `sip{S}.cube{C}.pe{P}` (and its PE-internal sub-components)
+
+SRAM is one of the cube-NoC attachments — `topology/mesh_gen.py` assigns it
+to the nearest router by placement coordinates and adds `"sram"` to that
+router's `attach` list. The builder lays bidirectional `sram ↔ router` edges
+(BW: `sram_to_router_bw_gbs`, default `128.0 GB/s`).
+
+SRAM has two intertwined roles:
+
+1. **Fabric terminal**: the endpoint for cube-NoC memory-access Transactions
+   destined for SRAM. SRAM consumes access overhead + drain, then sends a
+   response back on the reverse path.
+2. **One of the IPCQ slot tiers**: ADR-0023 D9.7 defines
+   `buffer_kind ∈ {tcm, sram, hbm}`; the `sram` tier's per-access cost is
+   `(512.0 GB/s, 2.0 ns)` in `common/ipcq_types._BUFFER_KIND_BW`. This is
+   separate from the SRAM node's `overhead_ns` attr; PE_DMA accounts for it
+   directly at the IPCQ slot-write moment.
+
+Without an ADR covering both roles, the following questions are ambiguous:
+
+- "What latency does SRAM model?" — fabric drain + overhead, or the IPCQ
+  tier slot latency? — answers scatter.
+- What does the `size_mb` (`32`) attr mean in the future? Currently it is not
+  used; SRAM only models timing.
+- Which cube router does SRAM attach to? (placement-based; lives in topology
+  code only.)
+
+## Decision
+
+### D1. SRAM is a terminal scratchpad node on the cube NoC
+
+`SramComponent` extends `ComponentBase` but overrides `_worker` to express
+terminal semantics directly:
+
+```
+while True:
+    txn = yield self._inbox.get()
+    yield from self.run(env, txn.nbytes)     # overhead_ns
+    if drain_ns > 0: yield env.timeout(drain_ns)
+    yield from self._send_response(env, txn)
+```
+
+This pattern is necessary because SRAM must know the reverse path; the
+generic `_forward_txn` (which forwards to the next hop) does not fit a
+terminal.
+
+#### D1.1. Currently dormant — the `_worker` override is an unused path
+
+At the time of writing, **no component actually sends a Transaction to the
+SRAM node**. The verified references to the SRAM node ID are:
+
+- `policy/routing/router.py` and friends — guarantee path lookups.
+- `components/builtin/pe_dma.py::_handle_ipcq_inbound` — for
+  `buffer_kind == "sram"`, computes the *path* to
+  `bank_node = f"{cube_prefix}.sram"` via `compute_drain_ns(path, ...)` and
+  yields a **local** timeout. The Transaction itself does not flow to the
+  SRAM node (see D4).
+- `tests/test_routing.py` — checks connectivity via
+  `find_path("sip0.cube0.pe0", "sip0.cube0.sram")`.
+
+So the `_worker` / `_send_response` override is currently a **dormant code
+path**. It is preserved deliberately:
+
+- Topology changes that route fabric Transactions to SRAM terminally (e.g.,
+  explicit M_CPU → SRAM accesses) would activate it immediately.
+- ADR-0017's "cube-attached scratchpad" semantics naturally implies terminal
+  behavior; the override is an intentional placeholder.
+
+A future ADR (or a revision to this one) will mark dormancy resolved when an
+actual sender is added.
+
+### D2. ResponseMsg construction and reverse-path dispatch
+
+`_send_response`:
+
+1. `reverse_path = list(reversed(txn.path))` — derive the reverse path.
+2. Construct `ResponseMsg(correlation_id=txn.request.correlation_id,
+   request_id=..., src_cube=<this cube>, src_pe=-1, success=True)`.
+3. Wrap in `Transaction(request=resp_msg, path=reverse_path, step=0,
+   nbytes=0, done=env.event(), is_response=True)` and put on
+   `out_ports[reverse_path[1]]`.
+4. If the reverse path is too short (`< 2 hops`) or `ctx` is absent, fall
+   back to calling the original `txn.done.succeed()`.
+
+`src_pe = -1` means "SRAM is not PE-localized". `src_cube` is parsed from the
+node ID (`sip{S}.cube{C}.sram`).
+
+### D3. Timing parameters: `overhead_ns` and wire-side `drain_ns`
+
+- **Component-side latency**: `node.attrs["overhead_ns"]`. Default topology
+  uses `2.0 ns`.
+- **Link-side serialization**: `drain_ns` arrives stamped on the Transaction
+  — the wire-side BW serialization result from ADR-0015. SRAM only yields it.
+- The `size_mb` (default `32 MiB`) attr is currently timing-neutral. If a
+  capacity-aware model is added in the future, a separate ADR will give it
+  meaning.
+
+### D4. IPCQ slot accounting is not modeled by the SRAM component
+
+Per ADR-0023 D9.7, the IPCQ slot-write latency for the SRAM tier is incurred
+inside PE_DMA's `_handle_ipcq_inbound`, which calls
+`slot_io_latency_ns("sram", nbytes)` using `_BUFFER_KIND_BW["sram"]`. That is:
+
+- When SRAM receives a fabric Transaction (D1, D2, D3 apply), it processes
+  normally.
+- When an IPCQ slot lives on SRAM, PE_DMA pays the slot-write time directly —
+  independent of the SRAM component.
+
+This separation is intentional: IPCQ is a fast path (sub-cycle slot
+bookkeeping) and does not traverse fabric Transactions, so SRAM does not need
+to know about IPCQ.
+
+### D5. SRAM's cube-NoC attachment is placement-driven
+
+`topology/mesh_gen.py` reads `placement.sram.pos_mm` (default `[1.5, 9.0]` in
+`topology.yaml`) and adds `"sram"` to the nearest router's `attach`. The
+builder (`topology/builder.py`'s attachment loop) then lays bidirectional
+edges between the `sram` node and that router.
+
+This decision lives outside the SRAM component (mesh_gen / builder); the
+component does not know which router it sits on. It only relies on
+`txn.path` / `reverse_path` to reach it via a router.
+
+### D6. SRAM is not a data store (timing only)
+
+Same context as ADR-0040 D6: the SRAM component models time only; the data
+payload (if any) lives in sim_engine's `memory_store`.
+
+## Alternatives Considered
+
+### A1. Use `_forward_txn` and route responses via separate nodes (à la IO_CPU / HBM_CTRL)
+
+Rejected. SRAM is a terminal on the cube NoC; adding a response node would
+introduce meaningless hops and violate ADR-0017's simplification spirit.
+
+### A2. Model BW serialization inside SRAM with its own resource
+
+Rejected. Wire-side BW serialization (`drain_ns`) already captures it. An
+internal `simpy.Resource` would double-count against ADR-0015 (port/wire
+model).
+
+### A3. Handle IPCQ slot accounting in the SRAM component
+
+Rejected. As D4 makes explicit, IPCQ is a fast path that does not traverse
+fabric Transactions. If SRAM knew about IPCQ, the responsibility would split
+across two places and obscure reasoning.
+
+### A4. Capacity-aware latency from `size_mb`
+
+Rejected for now. The capacity is currently a visualizer label; introducing
+a capacity-aware timing model requires a dedicated ADR.
+
+## Consequences
+
+- SRAM's timing model is pinned at ADR level as
+  `overhead_ns + drain_ns + ResponseMsg(reverse_path)`. Any proposal to push
+  IPCQ slot latency into the SRAM component can be refused with D4.
+- D3 records that `size_mb` is timing-neutral today, so a future
+  capacity-aware model has a narrow compatibility scope.
+- D5 documents the placement-driven attachment, so changes to the SRAM
+  coordinate have a clearly bounded impact (`mesh_gen` only).
@@ -0,0 +1,199 @@
+# ADR-0042: Tile Plan Generators — GEMM/Math Pipeline Plan Builders
+
+## Status
+
+Accepted (2026-05-20).
+
+This ADR pins down `tiling.py` as a **plan-generator
+module**, not a SimPy component.
+
+ADR-0014 (PE Pipeline Execution Model) D6 (tile plan / self-routing) does not
+specify the tile-plan generation algorithm itself; this ADR fills that gap.
+
+## First action
+
+When `generate_gemm_plan(M, K, N, tile_m, tile_k, tile_n, ..., pe_prefix,
+a_pinned, b_pinned, epilogue_specs)` is called, the very first action is
+**computing tile counts and constructing the PE-component ID strings**:
+
+```
+M_tiles = max(1, ceil(M / tile_m))
+K_tiles = max(1, ceil(K / tile_k))
+N_tiles = max(1, ceil(N / tile_n))
+dma_id   = f"{pe_prefix}.pe_dma"
+fetch_id = f"{pe_prefix}.pe_fetch_store"
+gemm_id  = f"{pe_prefix}.pe_gemm"
+math_id  = f"{pe_prefix}.pe_math"
+```
+
+In short, **the plan generator's first act is "compute ceiling tile counts
+and assemble the four sub-component IDs for this PE once"**. No SimPy event
+or environment is touched — this module is a pure function.
+
+`generate_math_plan(M, N, tile_m, tile_n, ..., math_op, src_addr, dst_addr,
+pe_prefix)` likewise begins by computing `M_tiles`, `N_tiles` and assembling
+three component IDs (`dma_id`, `fetch_id`, `math_id`).
+
+## Context
+
+ADR-0014 D6 agreed that "PE_SCHEDULER, on receiving a CompositeCmd, generates
+a TilePlan and feeds self-routing tile tokens". But the **concrete plan
+generation algorithm** lives in `src/kernbench/components/builtin/tiling.py`,
+which:
+
+- Defines no component — it is a pair of **pure functions**
+  (`generate_gemm_plan`, `generate_math_plan`).
+- Does not depend on the SimPy environment, queues, op_log, or hooks.
+- Returns a `PipelinePlan` (dataclass).
+
+The original G4 analysis incorrectly described `tiling.py` as a component;
+it is in fact a plan-builder helper consumed by PE_SCHEDULER. Pinning this
+down in its own ADR (paired with ADR-0014 D6) prevents:
+
+- Ambiguity over whether plan generation belongs to PE_SCHEDULER or a
+  separate module.
+- Inconsistent rationale for stage sequences (e.g., FETCH/STORE position)
+  between GEMM and Math plans.
+- Undocumented branching rationale for `a_pinned` / `b_pinned` /
+  `epilogue_specs`.
+
+## Decision
+
+### D1. `tiling` is a pure plan-generator module, not a component
+
+`components/builtin/tiling.py` defines no `ComponentBase` subclass. It exports
+two module-level functions:
+
+- `generate_gemm_plan(...) -> PipelinePlan`
+- `generate_math_plan(...) -> PipelinePlan`
+
+There is no `tiling` node in the topology graph. It lives in `builtin/`
+because it is a direct helper for PE_SCHEDULER (ADR-0014 D6) and is
+conceptually a PE_SCHEDULER internal utility.
+
+### D2. GEMM plan stage sequence — `M → N → K` order
+
+For each `(m, n, k)` tile (default — no operand pinning, no epilogue):
+
+```
+[DMA_READ(A)] → [DMA_READ(B)] → FETCH → GEMM
+                                ↑
+                                ↓
+(last k tile only)              [MATH(output_tile)]* → STORE → DMA_WRITE
+```
+
+`k_tile` epilogue inserts a MATH stage immediately after GEMM on every
+K-tile; `output_tile` epilogue inserts MATH stages once per `(m, n)` after
+the final K-tile but before STORE/DMA_WRITE. The K-loop accumulator stays
+in the register file across K-tiles — STORE/DMA_WRITE happens only when
+`last_k`.
+
+### D3. Operand pinning — `a_pinned` / `b_pinned`
+
+If a caller passes `a_pinned=True`, **the A DMA_READ is omitted from every
+(m, n, k) tile**. Semantically: the caller (e.g., `tl.composite`) has already
+staged all of A in TCM via a prior `tl.load`, and signals so to the plan
+generator.
+
+The branch is made at plan time (not at runtime). Therefore the stage record
+count in op_log changes deterministically with pinning, and sweep analyses
+(e.g., gemm_sweep's stage record count) see this decision directly.
+
+### D4. Epilogue scope — `k_tile` vs `output_tile`
+
+`epilogue_specs` is an iterable of op-spec objects. Each op object is
+expected to have:
+
+- `op.kind: str` — math op name (e.g., `"dequant"`, `"bias"`, `"relu"`,
+  `"scale"`). Placed into the stage's `params["op_kind"]`.
+- `op.scope: Scope` — `Scope.K_TILE` or `Scope.OUTPUT_TILE` (`Scope` enum
+  in `kernbench.common.pe_commands`).
+- Op-specific extras (e.g., `bias`, `scale`, `factor`) — currently not used
+  by the plan generator; consumed at runtime by PE_MATH.
+
+The plan generator partitions by `getattr(o, "scope", None)`:
+
+- `scope == Scope.K_TILE`: adds a MATH stage right after GEMM on every K-tile.
+- `scope == Scope.OUTPUT_TILE`: adds MATH stages just before STORE on the
+  last K-tile per `(m, n)`.
+
+Ops with neither `scope` value (e.g., missing attribute) are **dropped
+silently** — `getattr(..., None) == Scope.X` is False for both. Picking a
+default (`output_tile`) is the **caller's responsibility** (e.g.,
+`tl.composite`), not the plan generator's. This aligns with ADR-0014's
+composite epilogue contract.
+
+`Scope` is imported lazily inside the function to avoid the circular path
+`pe_commands ← pe_types ← tiling`. This is intentional and not a refactor
+target — keeping `tiling` free of compile-time `pe_commands` dependencies
+preserves the module boundary (D1).
+
+### D5. Math plan stage sequence — `M → N` order
+
+For each `(m, n)` tile:
+
+```
+DMA_READ → FETCH → MATH → STORE → DMA_WRITE
+```
+
+There is no K dimension, so concepts like epilogue or accumulator residency
+do not apply. PE_FETCH_STORE's register-file accounting follows the same
+pattern as the GEMM plan.
+
+### D6. Plans are data — no SimPy dependency
+
+`PipelinePlan` is a dataclass in `pe_types.py` holding `tiles:
+list[TilePlan]`. Each `TilePlan` holds `stages: tuple[Stage, ...]`. The plan
+itself is near-immutable (only `Stage.params: dict` is mutable) and holds no
+SimPy objects.
+
+At runtime, PE_SCHEDULER consumes the plan's first stage, builds a `TileToken`,
+and feeds it into the pipeline. The TileToken carries `plan: TilePlan`,
+`stage_idx: int`, and a cached `params: dict`. Self-routing proceeds by
+`TileToken.advance()` caching the next stage's `params` (ADR-0014 D6).
+
+### D7. Plan generator contract — pure, deterministic, idempotent
+
+Two calls with identical inputs return identical `PipelinePlan` instances
+(including `TilePlan.stages` order). This contract aligns with ADR-0014 D6's
+"deterministic tile dispatch order".
+
+No side effects (no SimPy events, no file I/O, no global state) — tests can
+call the generators directly without an environment object (some cases in
+`tests/test_pe_pipeline.py` rely on this).
+
+## Alternatives Considered
+
+### A1. Make tiling a component (e.g., PE_PLANNER)
+
+Rejected. Plan generation consumes no SimPy time — it is a pure decision
+algorithm. Making it a component would (a) add unnecessary infrastructure
+(inbox, resources), and (b) split PE_SCHEDULER's flow into "receive plan"
+plus "feed tiles", inserting a meaningless hop.
+
+### A2. Move plan generation into PE_SCHEDULER as methods
+
+Rejected (currently). Module separation provides (1) testability and
+(2) extensibility for additional plan algorithms (e.g., DTensor-aware) —
+add a new function. If plan kinds proliferate enough to require explicit
+dispatch, a future ADR can introduce a plan factory on PE_SCHEDULER.
+
+### A3. Make plans fully immutable (frozen dataclass + tuple)
+
+Partially adopted. `Stage` and `TilePlan` are dataclasses but not frozen,
+because `Stage.params: dict` is populated at plan-generation time and read
+at runtime (cached by TileToken on advance). Moving dict → frozendict pays
+migration cost without enough benefit. Convention: do not mutate after
+generation.
+
+## Consequences
+
+- `tiling.py` is documented as a plan-generator module, not a component —
+  preempting future G4-style "this component lacks an ADR" analyses.
+- The GEMM plan's stage sequence (D2) and pinning / epilogue branching
+  (D3 / D4) are pinned, providing a clear interpretation basis for sweep
+  analyses (e.g., `scripts/gemm_sweep.py`'s stage record counts).
+- The plan generator's pure contract (D7) enables environment-free testing
+  in line with ADR-0013 (verification strategy).
+- Future plan kinds (DTensor-aware, K-major, ...) follow D1 / D6 / D7 as a
+  baseline — just add a new function.