Files
kernbench2/docs/adr/ADR-0039-dev-pe-mmu-component-model.md
T
ywkang 1f36baa898 ADR: add 0038-0042 (pcie_ep, pe_mmu, pe_tcm, sram, tiling)
Fill component-model coverage gaps surfaced by /report's G4 analysis.
Each ADR documents the component's First action, latency model, and
honest notes on dormant code or implementation asymmetries discovered
during re-evaluation against current code.

- 0038 pcie_ep: thin protocol-overhead model; ComponentBase forwarding
  worker as-is; named-node contract for router helpers
- 0039 pe_mmu: component + utility dual role; sub-page region stopgap;
  D2.1 flags pipeline path missing mmu.overhead_ns timeout (asymmetric
  with non-pipeline; not visible at default tlb_overhead_ns=0)
- 0040 pe_tcm: dual-channel BW serialization (read/write Resource cap=1);
  TcmRequest schema owned by TCM; timing-only (no data store)
- 0041 sram: terminal scratchpad model + ResponseMsg on reverse path;
  D1.1 flags _worker override as currently dormant (no Transaction
  actually targets the SRAM node today)
- 0042 tiling: pure plan-generator module, not a component; corrects
  the G4 misclassification; pins GEMM/Math stage sequences and
  epilogue scope contract

Also: /report skill G3 refinement — only flag older->newer asymmetric
cross-references; newer->older (e.g., 0034-0037 citing infrastructure
ADRs) are expected one-way and no longer reported.

Bilingual pair verifier (tools/verify_adr_lang_pairs.py) passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:43:03 -07:00

9.0 KiB

ADR-0039: PE_MMU Component Model — Component + Utility Dual Role

Status

Accepted (2026-05-20).

ADR-0011 (PA/VA/LA address model) only states that "the VA model translates VA→PA via PE_MMU"; this ADR pins down the PE_MMU component's own behavior model.

First action

At construction, read node.attrs["page_size"] (default 2 MiB) and node.attrs["tlb_overhead_ns"] (default 0.0) and instantiate the internal PeMMU utility object (policy.address.pe_mmu.PeMMU) exactly once. That object is the single owner of the page table, the sub-page region lists, and the TLB overhead value.

At runtime the first action splits into two paths:

  • Component path (inbox consumption): _worker pulls a Transaction off _inbox; if request is a MmuMapMsg, call self._mmu.map(va, pa, size) for each entry and then txn.done.succeed(). For MmuUnmapMsg, call unmap(va, size). Any other type falls through to standard _forward_txn. In other words, the component's first act is "apply map/unmap commands to the page table".
  • Utility path (direct call): a sibling PE engine (PE_DMA / PE_GEMM) calls pe_mmu.mmu.translate(va) directly. This path produces no SimPy events; the caller (when overhead_ns > 0) issues a yield env.timeout(mmu.overhead_ns) in its own process.

Context

ADR-0011 defined three address models (PA/VA/LA) and agreed that "VA model = translation via PE_MMU". But in code, PeMmuComponent performs two complementary roles simultaneously:

  1. A topology-graph component: it receives MmuMapMsg / MmuUnmapMsg sideband messages over the cube NoC and updates the page table.
  2. A PE-local utility: PE_DMA / PE_GEMM on the same PE call translate(va) directly with zero SimPy latency (the caller pays overhead_ns if any).

Without an ADR covering both roles, the following questions are ambiguous:

  • "Why isn't there a SimPy event for the MMU translate?" (Answer: the caller pays it.)
  • What is the sub-page region model, and why? (The code docstring has it, but no ADR — only a memory note project_mmu_subpage_stopgap.)
  • Who sends map/unmap, and when must they be visible? (Ordering contract.)

Additionally, PeMMU.map() has "append, last-write-wins on overlap" semantics, which is impossible to express with a one-PA-per-entry page table. That is a deliberate simulator stopgap to support DPPolicy sub-page sharding (e.g., 128 B payloads against 4 KiB pages) without silent last-write-wins misrouting. This deviation from real HW MMU semantics must be ADR-pinned.

Decision

D1. Explicit dual role — component and utility

PeMmuComponent exposes two interfaces from a single class:

  • Component interface: _inbox consumption, _worker loop (handles MMU sideband messages).
  • Utility interface: the mmu property exposes the underlying PeMMU object, which PE_DMA / PE_GEMM hold directly and invoke translate() on.

The latter is not a layer skip: inside a PE, the engines and PE_MMU are siblings under the "components" layer (ADR-0007). Cross-layer violations only apply to runtime API ↔ sim_engine ↔ components boundaries.

D2. Latency model — translate() is pure; caller owns the timeout

PeMMU.translate() is a pure function and yields nothing in SimPy. The caller (a PE engine) issues if mmu.overhead_ns > 0: yield env.timeout(mmu.overhead_ns) in its own process after translation.

Rationale: the PE engine process already holds its own record_start / record_end (op_log) hooks, so keeping timing inside the caller's process preserves consistent timing accounting. A separate MMU process would split the engine's processing flow and blur op_log / pipeline overlap semantics.

D2.1. Current implementation asymmetry — pipeline vs non-pipeline (known)

At the time of writing, pe_dma.py handles MMU overhead differently in its two call paths:

  • non-pipeline (handle_command): after translate(), applies if self._mmu.overhead_ns > 0: yield env.timeout(self._mmu.overhead_ns).
  • pipeline (_do_pipeline_dma): calls translate() only, omitting the overhead timeout — though the comment says "same logic as non-pipeline path", the behaviors differ.

In the default topology, tlb_overhead_ns = 0.0, so this asymmetry does not manifest. With tlb_overhead_ns > 0, however, GEMM/Math via the pipeline path appears MMU-overhead faster than the equivalent non-pipeline workload.

The D2 contract states that all callers pay the overhead; the pipeline omission is not an intentional design — ADR-0014 D6 (pipeline self-routing) does not exempt it. Remediation options (require a separate Phase 1/2):

  • (a) Add if mmu.overhead_ns > 0: yield env.timeout(...) in _do_pipeline_dma to align with D2 — preferred.
  • (b) Narrow the D2 contract to "non-pipeline only" and document the pipeline exemption in an ADR-0014 update — discouraged, since it weakens the overhead's meaning.

This ADR recommends (a) and assumes a small follow-up change either before or just after acceptance.

D3. Page table structure — sub-page region list (stopgap)

self._table: dict[vpn, list[(start_in_page, end_in_page, pa_at_offset_zero)]] holds multiple disjoint regions per page.

  • map(va, pa, size): append regions when the range crosses a page boundary.
  • translate(va): look up regions for the VPN and iterate in reverse so the most recent overlapping region wins (last-write-wins).
  • unmap(va, size): remove only regions whose extent is fully contained within the unmap range; partial-overlap boundaries are left in place and the caller is expected to unmap on the same boundaries used for map.

This is documented as a simulator stopgap that supplements the VA model from ADR-0011. It prevents silent last-write-wins misrouting when DPPolicy shards below page granularity. Memory note: project_mmu_subpage_stopgap.

D4. PageFault signals PA fallback

If translate() is called with an unmapped VA, PageFault is raised. PE_DMA catches the exception and uses the original address as a PA (the PA-only backward-compatibility path from ADR-0011). PageFault is therefore not an error — it is the signal for "no VA mapping, interpret as PA".

This path is intentional and preserves backward compatibility with the ADR-0011 PA-only mode.

D5. MMU sideband-message reception contract

MmuMapMsg / MmuUnmapMsg arrive over the fabric at PE_MMU's _inbox (SPEC R10: "MMU map installation incurs measured fabric latency"). Schemas live in runtime_api/kernel.py:

  • MmuMapMsg.entries: tuple[dict, ...] — each dict is {"va": int, "pa": int, "size": int}.
  • MmuUnmapMsg.entries: tuple[dict, ...] — each dict is {"va": int, "size": int}.

PE_MMU reception flow:

  1. _worker does _inbox.get() for one message.
  2. hasattr(msg, "request") confirms a Transaction wrapper.
  3. isinstance(msg.request, MmuMapMsg) → for each entry, call self._mmu.map(va=e["va"], pa=e["pa"], size=e["size"]).
  4. isinstance(msg.request, MmuUnmapMsg) → for each entry, call self._mmu.unmap(va=e["va"], size=e["size"]).
  5. Both signal msg.done.succeed() after completion.

An external caller (runtime API) awaiting done therefore receives a SimPy guarantee that "the mapping is installed on-device" — this is the realization of ADR-0011's "MMU map installation incurs measured fabric latency".

This ADR does not define the sender or fan-out policy for the sideband message — those are runtime API responsibilities. Only the receive contract belongs here.

D6. Non-MMU Transactions delegate to generic forwarding

If a message pulled from _inbox is not MmuMapMsg / MmuUnmapMsg (or lacks a request attribute), _forward_txn handles it normally. This keeps the door open for future topologies where PE_MMU sits on a pass-through path — current code never sends such traffic, but the routing remains safe.

Alternatives Considered

A1. Make translate() a SimPy generator

Rejected. As D2 explains, this blurs op_log / pipeline overlap accounting in the PE engine.

A2. Use small page size (e.g., 128 B) instead of sub-page regions

Rejected. Would explode page-table memory and cube-wide map message size. Most mappings are 2 MiB; pushing the page size below that for the few DPPolicy sharding cases inflates average cost.

A3. Make PE_MMU a PE_CPU helper only (not a topology node)

Rejected. ADR-0011 requires that MMU map installation incur measured fabric latency (via MmuMapMsg), which requires PE_MMU to be a node on the graph. It also keeps cube NoC visualizer output consistent.

Consequences

  • PE_MMU's dual role is justified at ADR level, so future "unify into one" refactor pressure has a documented counterpoint.
  • The sub-page region model is explicitly labeled a stopgap, providing a basis for deprecating it when LA model (ADR-0011) lands.
  • The "translate() does not yield" contract is locked in (D2), so any future proposal to add an internal MMU timeout can be denied with a documented rationale.
  • PA fallback (D4) is normalized, preventing defensive logic from treating PageFault as an error.