Files
kernbench2/docs/adr/ADR-0039-dev-pe-mmu-component-model.md
ywkang 1f36baa898 ADR: add 0038-0042 (pcie_ep, pe_mmu, pe_tcm, sram, tiling)
Fill component-model coverage gaps surfaced by /report's G4 analysis.
Each ADR documents the component's First action, latency model, and
honest notes on dormant code or implementation asymmetries discovered
during re-evaluation against current code.

- 0038 pcie_ep: thin protocol-overhead model; ComponentBase forwarding
  worker as-is; named-node contract for router helpers
- 0039 pe_mmu: component + utility dual role; sub-page region stopgap;
  D2.1 flags pipeline path missing mmu.overhead_ns timeout (asymmetric
  with non-pipeline; not visible at default tlb_overhead_ns=0)
- 0040 pe_tcm: dual-channel BW serialization (read/write Resource cap=1);
  TcmRequest schema owned by TCM; timing-only (no data store)
- 0041 sram: terminal scratchpad model + ResponseMsg on reverse path;
  D1.1 flags _worker override as currently dormant (no Transaction
  actually targets the SRAM node today)
- 0042 tiling: pure plan-generator module, not a component; corrects
  the G4 misclassification; pins GEMM/Math stage sequences and
  epilogue scope contract

Also: /report skill G3 refinement — only flag older->newer asymmetric
cross-references; newer->older (e.g., 0034-0037 citing infrastructure
ADRs) are expected one-way and no longer reported.

Bilingual pair verifier (tools/verify_adr_lang_pairs.py) passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:43:03 -07:00

204 lines
9.0 KiB
Markdown

# ADR-0039: PE_MMU Component Model — Component + Utility Dual Role
## Status
Accepted (2026-05-20).
ADR-0011 (PA/VA/LA address model) only states that "the VA model translates
VA→PA via PE_MMU"; this ADR pins down **the PE_MMU component's own behavior
model**.
## First action
At construction, read `node.attrs["page_size"]` (default `2 MiB`) and
`node.attrs["tlb_overhead_ns"]` (default `0.0`) and instantiate the internal
`PeMMU` utility object (`policy.address.pe_mmu.PeMMU`) exactly once. That
object is the single owner of the page table, the sub-page region lists, and
the TLB overhead value.
At runtime the first action splits into two paths:
- **Component path (inbox consumption)**: `_worker` pulls a Transaction off
`_inbox`; if `request` is a `MmuMapMsg`, call `self._mmu.map(va, pa, size)`
for each entry and then `txn.done.succeed()`. For `MmuUnmapMsg`, call
`unmap(va, size)`. Any other type falls through to standard `_forward_txn`.
In other words, **the component's first act is "apply map/unmap commands to
the page table"**.
- **Utility path (direct call)**: a sibling PE engine (PE_DMA / PE_GEMM) calls
`pe_mmu.mmu.translate(va)` directly. This path produces no SimPy events;
the caller (when `overhead_ns > 0`) issues a `yield env.timeout(mmu.overhead_ns)`
in its own process.
## Context
ADR-0011 defined three address models (PA/VA/LA) and agreed that "VA model =
translation via PE_MMU". But in code, `PeMmuComponent` performs two
complementary roles simultaneously:
1. **A topology-graph component**: it receives `MmuMapMsg` / `MmuUnmapMsg`
sideband messages over the cube NoC and updates the page table.
2. **A PE-local utility**: PE_DMA / PE_GEMM on the same PE call
`translate(va)` directly with zero SimPy latency (the caller pays
`overhead_ns` if any).
Without an ADR covering both roles, the following questions are ambiguous:
- "Why isn't there a SimPy event for the MMU translate?" (Answer: the caller
pays it.)
- What is the sub-page region model, and why? (The code docstring has it, but
no ADR — only a memory note `project_mmu_subpage_stopgap`.)
- Who sends map/unmap, and when must they be visible? (Ordering contract.)
Additionally, `PeMMU.map()` has "append, last-write-wins on overlap"
semantics, which is impossible to express with a one-PA-per-entry page table.
That is a deliberate **simulator stopgap** to support DPPolicy sub-page sharding
(e.g., 128 B payloads against 4 KiB pages) without silent last-write-wins
misrouting. This deviation from real HW MMU semantics must be ADR-pinned.
## Decision
### D1. Explicit dual role — component and utility
`PeMmuComponent` exposes two interfaces from a single class:
- Component interface: `_inbox` consumption, `_worker` loop (handles MMU
sideband messages).
- Utility interface: the `mmu` property exposes the underlying `PeMMU` object,
which PE_DMA / PE_GEMM hold directly and invoke `translate()` on.
The latter is **not a layer skip**: inside a PE, the engines and PE_MMU are
siblings under the "components" layer (ADR-0007). Cross-layer violations only
apply to runtime API ↔ sim_engine ↔ components boundaries.
### D2. Latency model — `translate()` is pure; caller owns the timeout
`PeMMU.translate()` is a pure function and yields nothing in SimPy. The caller
(a PE engine) issues `if mmu.overhead_ns > 0: yield env.timeout(mmu.overhead_ns)`
in its own process after translation.
Rationale: the PE engine process already holds its own `record_start` /
`record_end` (op_log) hooks, so keeping timing inside the caller's process
preserves consistent timing accounting. A separate MMU process would split the
engine's processing flow and blur op_log / pipeline overlap semantics.
#### D2.1. Current implementation asymmetry — pipeline vs non-pipeline (known)
At the time of writing, `pe_dma.py` handles MMU overhead differently in its
two call paths:
- **non-pipeline (`handle_command`)**: after `translate()`, applies
`if self._mmu.overhead_ns > 0: yield env.timeout(self._mmu.overhead_ns)`.
- **pipeline (`_do_pipeline_dma`)**: calls `translate()` only, **omitting**
the overhead timeout — though the comment says "same logic as non-pipeline
path", the behaviors differ.
In the default topology, `tlb_overhead_ns = 0.0`, so this asymmetry does not
manifest. With `tlb_overhead_ns > 0`, however, GEMM/Math via the pipeline path
appears MMU-overhead faster than the equivalent non-pipeline workload.
The D2 contract states that **all** callers pay the overhead; the pipeline
omission is **not an intentional design** — ADR-0014 D6 (pipeline self-routing)
does not exempt it. Remediation options (require a separate Phase 1/2):
- (a) Add `if mmu.overhead_ns > 0: yield env.timeout(...)` in
`_do_pipeline_dma` to align with D2 — **preferred**.
- (b) Narrow the D2 contract to "non-pipeline only" and document the pipeline
exemption in an ADR-0014 update — discouraged, since it weakens the
overhead's meaning.
This ADR recommends (a) and assumes a small follow-up change either before or
just after acceptance.
### D3. Page table structure — sub-page region list (stopgap)
`self._table: dict[vpn, list[(start_in_page, end_in_page, pa_at_offset_zero)]]`
holds multiple disjoint regions per page.
- `map(va, pa, size)`: append regions when the range crosses a page boundary.
- `translate(va)`: look up regions for the VPN and iterate **in reverse** so
the most recent overlapping region wins (last-write-wins).
- `unmap(va, size)`: remove only regions whose extent is **fully contained**
within the unmap range; partial-overlap boundaries are left in place and the
caller is expected to unmap on the same boundaries used for map.
This is documented as a **simulator stopgap** that supplements the VA model
from ADR-0011. It prevents silent last-write-wins misrouting when DPPolicy
shards below page granularity. Memory note: `project_mmu_subpage_stopgap`.
### D4. PageFault signals PA fallback
If `translate()` is called with an unmapped VA, `PageFault` is raised. PE_DMA
catches the exception and **uses the original address as a PA** (the PA-only
backward-compatibility path from ADR-0011). PageFault is therefore not an
error — it is the signal for "no VA mapping, interpret as PA".
This path is intentional and preserves backward compatibility with the
ADR-0011 PA-only mode.
### D5. MMU sideband-message reception contract
`MmuMapMsg` / `MmuUnmapMsg` arrive over the fabric at PE_MMU's `_inbox`
(SPEC R10: "MMU map installation incurs measured fabric latency"). Schemas
live in `runtime_api/kernel.py`:
- `MmuMapMsg.entries: tuple[dict, ...]` — each dict is
`{"va": int, "pa": int, "size": int}`.
- `MmuUnmapMsg.entries: tuple[dict, ...]` — each dict is
`{"va": int, "size": int}`.
PE_MMU reception flow:
1. `_worker` does `_inbox.get()` for one message.
2. `hasattr(msg, "request")` confirms a Transaction wrapper.
3. `isinstance(msg.request, MmuMapMsg)` → for each entry, call
`self._mmu.map(va=e["va"], pa=e["pa"], size=e["size"])`.
4. `isinstance(msg.request, MmuUnmapMsg)` → for each entry, call
`self._mmu.unmap(va=e["va"], size=e["size"])`.
5. Both signal `msg.done.succeed()` after completion.
An external caller (runtime API) `await`ing `done` therefore receives a SimPy
guarantee that "the mapping is installed on-device" — this is the realization
of ADR-0011's "MMU map installation incurs measured fabric latency".
This ADR does **not** define the **sender or fan-out policy** for the sideband
message — those are runtime API responsibilities. Only the receive contract
belongs here.
### D6. Non-MMU Transactions delegate to generic forwarding
If a message pulled from `_inbox` is not `MmuMapMsg` / `MmuUnmapMsg` (or
lacks a `request` attribute), `_forward_txn` handles it normally. This keeps
the door open for future topologies where PE_MMU sits on a pass-through path —
current code never sends such traffic, but the routing remains safe.
## Alternatives Considered
### A1. Make `translate()` a SimPy generator
Rejected. As D2 explains, this blurs op_log / pipeline overlap accounting in
the PE engine.
### A2. Use small page size (e.g., 128 B) instead of sub-page regions
Rejected. Would explode page-table memory and cube-wide map message size. Most
mappings are 2 MiB; pushing the page size below that for the few DPPolicy
sharding cases inflates average cost.
### A3. Make PE_MMU a PE_CPU helper only (not a topology node)
Rejected. ADR-0011 requires that MMU map installation incur measured fabric
latency (via `MmuMapMsg`), which requires PE_MMU to be a node on the graph.
It also keeps cube NoC visualizer output consistent.
## Consequences
- PE_MMU's dual role is justified at ADR level, so future "unify into one"
refactor pressure has a documented counterpoint.
- The sub-page region model is explicitly labeled a stopgap, providing a
basis for deprecating it when LA model (ADR-0011) lands.
- The "`translate()` does not yield" contract is locked in (D2), so any
future proposal to add an internal MMU timeout can be denied with a
documented rationale.
- PA fallback (D4) is normalized, preventing defensive logic from treating
PageFault as an error.