Fill component-model coverage gaps surfaced by /report's G4 analysis. Each ADR documents the component's First action, latency model, and honest notes on dormant code or implementation asymmetries discovered during re-evaluation against current code. - 0038 pcie_ep: thin protocol-overhead model; ComponentBase forwarding worker as-is; named-node contract for router helpers - 0039 pe_mmu: component + utility dual role; sub-page region stopgap; D2.1 flags pipeline path missing mmu.overhead_ns timeout (asymmetric with non-pipeline; not visible at default tlb_overhead_ns=0) - 0040 pe_tcm: dual-channel BW serialization (read/write Resource cap=1); TcmRequest schema owned by TCM; timing-only (no data store) - 0041 sram: terminal scratchpad model + ResponseMsg on reverse path; D1.1 flags _worker override as currently dormant (no Transaction actually targets the SRAM node today) - 0042 tiling: pure plan-generator module, not a component; corrects the G4 misclassification; pins GEMM/Math stage sequences and epilogue scope contract Also: /report skill G3 refinement — only flag older->newer asymmetric cross-references; newer->older (e.g., 0034-0037 citing infrastructure ADRs) are expected one-way and no longer reported. Bilingual pair verifier (tools/verify_adr_lang_pairs.py) passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9.0 KiB
ADR-0039: PE_MMU Component Model — Component + Utility Dual Role
Status
Accepted (2026-05-20).
ADR-0011 (PA/VA/LA address model) only states that "the VA model translates VA→PA via PE_MMU"; this ADR pins down the PE_MMU component's own behavior model.
First action
At construction, read node.attrs["page_size"] (default 2 MiB) and
node.attrs["tlb_overhead_ns"] (default 0.0) and instantiate the internal
PeMMU utility object (policy.address.pe_mmu.PeMMU) exactly once. That
object is the single owner of the page table, the sub-page region lists, and
the TLB overhead value.
At runtime the first action splits into two paths:
- Component path (inbox consumption):
_workerpulls a Transaction off_inbox; ifrequestis aMmuMapMsg, callself._mmu.map(va, pa, size)for each entry and thentxn.done.succeed(). ForMmuUnmapMsg, callunmap(va, size). Any other type falls through to standard_forward_txn. In other words, the component's first act is "apply map/unmap commands to the page table". - Utility path (direct call): a sibling PE engine (PE_DMA / PE_GEMM) calls
pe_mmu.mmu.translate(va)directly. This path produces no SimPy events; the caller (whenoverhead_ns > 0) issues ayield env.timeout(mmu.overhead_ns)in its own process.
Context
ADR-0011 defined three address models (PA/VA/LA) and agreed that "VA model =
translation via PE_MMU". But in code, PeMmuComponent performs two
complementary roles simultaneously:
- A topology-graph component: it receives
MmuMapMsg/MmuUnmapMsgsideband messages over the cube NoC and updates the page table. - A PE-local utility: PE_DMA / PE_GEMM on the same PE call
translate(va)directly with zero SimPy latency (the caller paysoverhead_nsif any).
Without an ADR covering both roles, the following questions are ambiguous:
- "Why isn't there a SimPy event for the MMU translate?" (Answer: the caller pays it.)
- What is the sub-page region model, and why? (The code docstring has it, but
no ADR — only a memory note
project_mmu_subpage_stopgap.) - Who sends map/unmap, and when must they be visible? (Ordering contract.)
Additionally, PeMMU.map() has "append, last-write-wins on overlap"
semantics, which is impossible to express with a one-PA-per-entry page table.
That is a deliberate simulator stopgap to support DPPolicy sub-page sharding
(e.g., 128 B payloads against 4 KiB pages) without silent last-write-wins
misrouting. This deviation from real HW MMU semantics must be ADR-pinned.
Decision
D1. Explicit dual role — component and utility
PeMmuComponent exposes two interfaces from a single class:
- Component interface:
_inboxconsumption,_workerloop (handles MMU sideband messages). - Utility interface: the
mmuproperty exposes the underlyingPeMMUobject, which PE_DMA / PE_GEMM hold directly and invoketranslate()on.
The latter is not a layer skip: inside a PE, the engines and PE_MMU are siblings under the "components" layer (ADR-0007). Cross-layer violations only apply to runtime API ↔ sim_engine ↔ components boundaries.
D2. Latency model — translate() is pure; caller owns the timeout
PeMMU.translate() is a pure function and yields nothing in SimPy. The caller
(a PE engine) issues if mmu.overhead_ns > 0: yield env.timeout(mmu.overhead_ns)
in its own process after translation.
Rationale: the PE engine process already holds its own record_start /
record_end (op_log) hooks, so keeping timing inside the caller's process
preserves consistent timing accounting. A separate MMU process would split the
engine's processing flow and blur op_log / pipeline overlap semantics.
D2.1. Current implementation asymmetry — pipeline vs non-pipeline (known)
At the time of writing, pe_dma.py handles MMU overhead differently in its
two call paths:
- non-pipeline (
handle_command): aftertranslate(), appliesif self._mmu.overhead_ns > 0: yield env.timeout(self._mmu.overhead_ns). - pipeline (
_do_pipeline_dma): callstranslate()only, omitting the overhead timeout — though the comment says "same logic as non-pipeline path", the behaviors differ.
In the default topology, tlb_overhead_ns = 0.0, so this asymmetry does not
manifest. With tlb_overhead_ns > 0, however, GEMM/Math via the pipeline path
appears MMU-overhead faster than the equivalent non-pipeline workload.
The D2 contract states that all callers pay the overhead; the pipeline omission is not an intentional design — ADR-0014 D6 (pipeline self-routing) does not exempt it. Remediation options (require a separate Phase 1/2):
- (a) Add
if mmu.overhead_ns > 0: yield env.timeout(...)in_do_pipeline_dmato align with D2 — preferred. - (b) Narrow the D2 contract to "non-pipeline only" and document the pipeline exemption in an ADR-0014 update — discouraged, since it weakens the overhead's meaning.
This ADR recommends (a) and assumes a small follow-up change either before or just after acceptance.
D3. Page table structure — sub-page region list (stopgap)
self._table: dict[vpn, list[(start_in_page, end_in_page, pa_at_offset_zero)]]
holds multiple disjoint regions per page.
map(va, pa, size): append regions when the range crosses a page boundary.translate(va): look up regions for the VPN and iterate in reverse so the most recent overlapping region wins (last-write-wins).unmap(va, size): remove only regions whose extent is fully contained within the unmap range; partial-overlap boundaries are left in place and the caller is expected to unmap on the same boundaries used for map.
This is documented as a simulator stopgap that supplements the VA model
from ADR-0011. It prevents silent last-write-wins misrouting when DPPolicy
shards below page granularity. Memory note: project_mmu_subpage_stopgap.
D4. PageFault signals PA fallback
If translate() is called with an unmapped VA, PageFault is raised. PE_DMA
catches the exception and uses the original address as a PA (the PA-only
backward-compatibility path from ADR-0011). PageFault is therefore not an
error — it is the signal for "no VA mapping, interpret as PA".
This path is intentional and preserves backward compatibility with the ADR-0011 PA-only mode.
D5. MMU sideband-message reception contract
MmuMapMsg / MmuUnmapMsg arrive over the fabric at PE_MMU's _inbox
(SPEC R10: "MMU map installation incurs measured fabric latency"). Schemas
live in runtime_api/kernel.py:
MmuMapMsg.entries: tuple[dict, ...]— each dict is{"va": int, "pa": int, "size": int}.MmuUnmapMsg.entries: tuple[dict, ...]— each dict is{"va": int, "size": int}.
PE_MMU reception flow:
_workerdoes_inbox.get()for one message.hasattr(msg, "request")confirms a Transaction wrapper.isinstance(msg.request, MmuMapMsg)→ for each entry, callself._mmu.map(va=e["va"], pa=e["pa"], size=e["size"]).isinstance(msg.request, MmuUnmapMsg)→ for each entry, callself._mmu.unmap(va=e["va"], size=e["size"]).- Both signal
msg.done.succeed()after completion.
An external caller (runtime API) awaiting done therefore receives a SimPy
guarantee that "the mapping is installed on-device" — this is the realization
of ADR-0011's "MMU map installation incurs measured fabric latency".
This ADR does not define the sender or fan-out policy for the sideband message — those are runtime API responsibilities. Only the receive contract belongs here.
D6. Non-MMU Transactions delegate to generic forwarding
If a message pulled from _inbox is not MmuMapMsg / MmuUnmapMsg (or
lacks a request attribute), _forward_txn handles it normally. This keeps
the door open for future topologies where PE_MMU sits on a pass-through path —
current code never sends such traffic, but the routing remains safe.
Alternatives Considered
A1. Make translate() a SimPy generator
Rejected. As D2 explains, this blurs op_log / pipeline overlap accounting in the PE engine.
A2. Use small page size (e.g., 128 B) instead of sub-page regions
Rejected. Would explode page-table memory and cube-wide map message size. Most mappings are 2 MiB; pushing the page size below that for the few DPPolicy sharding cases inflates average cost.
A3. Make PE_MMU a PE_CPU helper only (not a topology node)
Rejected. ADR-0011 requires that MMU map installation incur measured fabric
latency (via MmuMapMsg), which requires PE_MMU to be a node on the graph.
It also keeps cube NoC visualizer output consistent.
Consequences
- PE_MMU's dual role is justified at ADR level, so future "unify into one" refactor pressure has a documented counterpoint.
- The sub-page region model is explicitly labeled a stopgap, providing a basis for deprecating it when LA model (ADR-0011) lands.
- The "
translate()does not yield" contract is locked in (D2), so any future proposal to add an internal MMU timeout can be denied with a documented rationale. - PA fallback (D4) is normalized, preventing defensive logic from treating PageFault as an error.