kernbench2/docs/adr/ADR-0039-dev-pe-mmu-component-model.md

# ADR-0039: PE_MMU Component Model — Component + Utility Dual Role

## Status

Accepted (2026-05-20).

ADR-0011 (PA/VA/LA address model) only states that "the VA model translates
VA→PA via PE_MMU"; this ADR pins down **the PE_MMU component's own behavior
model**.

## First action

At construction, read `node.attrs["page_size"]` (default `2 MiB`) and
`node.attrs["tlb_overhead_ns"]` (default `0.0`) and instantiate the internal
`PeMMU` utility object (`policy.address.pe_mmu.PeMMU`) exactly once. That
object is the single owner of the page table, the sub-page region lists, and
the TLB overhead value.

At runtime the first action splits into two paths:

- **Component path (inbox consumption)**: `_worker` pulls a Transaction off
  `_inbox`; if `request` is a `MmuMapMsg`, call `self._mmu.map(va, pa, size)`
  for each entry and then `txn.done.succeed()`. For `MmuUnmapMsg`, call
  `unmap(va, size)`. Any other type falls through to standard `_forward_txn`.
  In other words, **the component's first act is "apply map/unmap commands to
  the page table"**.
- **Utility path (direct call)**: a sibling PE engine (PE_DMA / PE_GEMM) calls
  `pe_mmu.mmu.translate(va)` directly. This path produces no SimPy events;
  the caller (when `overhead_ns > 0`) issues a `yield env.timeout(mmu.overhead_ns)`
  in its own process.

## Context

ADR-0011 defined three address models (PA/VA/LA) and agreed that "VA model =
translation via PE_MMU". But in code, `PeMmuComponent` performs two
complementary roles simultaneously:

1. **A topology-graph component**: it receives `MmuMapMsg` / `MmuUnmapMsg`
   sideband messages over the cube NoC and updates the page table.
2. **A PE-local utility**: PE_DMA / PE_GEMM on the same PE call
   `translate(va)` directly with zero SimPy latency (the caller pays
   `overhead_ns` if any).

Without an ADR covering both roles, the following questions are ambiguous:

- "Why isn't there a SimPy event for the MMU translate?" (Answer: the caller
  pays it.)
- What is the sub-page region model, and why? (The code docstring has it, but
  no ADR — only a memory note `project_mmu_subpage_stopgap`.)
- Who sends map/unmap, and when must they be visible? (Ordering contract.)

Additionally, `PeMMU.map()` has "append, last-write-wins on overlap"
semantics, which is impossible to express with a one-PA-per-entry page table.
That is a deliberate **simulator stopgap** to support DPPolicy sub-page sharding
(e.g., 128 B payloads against 4 KiB pages) without silent last-write-wins
misrouting. This deviation from real HW MMU semantics must be ADR-pinned.

## Decision

### D1. Explicit dual role — component and utility

`PeMmuComponent` exposes two interfaces from a single class:

- Component interface: `_inbox` consumption, `_worker` loop (handles MMU
  sideband messages).
- Utility interface: the `mmu` property exposes the underlying `PeMMU` object,
  which PE_DMA / PE_GEMM hold directly and invoke `translate()` on.

The latter is **not a layer skip**: inside a PE, the engines and PE_MMU are
siblings under the "components" layer (ADR-0007). Cross-layer violations only
apply to runtime API ↔ sim_engine ↔ components boundaries.

### D2. Latency model — `translate()` is pure; caller owns the timeout

`PeMMU.translate()` is a pure function and yields nothing in SimPy. The caller
(a PE engine) issues `if mmu.overhead_ns > 0: yield env.timeout(mmu.overhead_ns)`
in its own process after translation.

Rationale: the PE engine process already holds its own `record_start` /
`record_end` (op_log) hooks, so keeping timing inside the caller's process
preserves consistent timing accounting. A separate MMU process would split the
engine's processing flow and blur op_log / pipeline overlap semantics.

#### D2.1. Current implementation asymmetry — pipeline vs non-pipeline (known)

At the time of writing, `pe_dma.py` handles MMU overhead differently in its
two call paths:

- **non-pipeline (`handle_command`)**: after `translate()`, applies
  `if self._mmu.overhead_ns > 0: yield env.timeout(self._mmu.overhead_ns)`.
- **pipeline (`_do_pipeline_dma`)**: calls `translate()` only, **omitting**
  the overhead timeout — though the comment says "same logic as non-pipeline
  path", the behaviors differ.

In the default topology, `tlb_overhead_ns = 0.0`, so this asymmetry does not
manifest. With `tlb_overhead_ns > 0`, however, GEMM/Math via the pipeline path
appears MMU-overhead faster than the equivalent non-pipeline workload.

The D2 contract states that **all** callers pay the overhead; the pipeline
omission is **not an intentional design** — ADR-0014 D6 (pipeline self-routing)
does not exempt it. Remediation options (require a separate Phase 1/2):

- (a) Add `if mmu.overhead_ns > 0: yield env.timeout(...)` in
  `_do_pipeline_dma` to align with D2 — **preferred**.
- (b) Narrow the D2 contract to "non-pipeline only" and document the pipeline
  exemption in an ADR-0014 update — discouraged, since it weakens the
  overhead's meaning.

This ADR recommends (a) and assumes a small follow-up change either before or
just after acceptance.

### D3. Page table structure — sub-page region list (stopgap)

`self._table: dict[vpn, list[(start_in_page, end_in_page, pa_at_offset_zero)]]`
holds multiple disjoint regions per page.

- `map(va, pa, size)`: append regions when the range crosses a page boundary.
- `translate(va)`: look up regions for the VPN and iterate **in reverse** so
  the most recent overlapping region wins (last-write-wins).
- `unmap(va, size)`: remove only regions whose extent is **fully contained**
  within the unmap range; partial-overlap boundaries are left in place and the
  caller is expected to unmap on the same boundaries used for map.

This is documented as a **simulator stopgap** that supplements the VA model
from ADR-0011. It prevents silent last-write-wins misrouting when DPPolicy
shards below page granularity. Memory note: `project_mmu_subpage_stopgap`.

### D4. PageFault signals PA fallback

If `translate()` is called with an unmapped VA, `PageFault` is raised. PE_DMA
catches the exception and **uses the original address as a PA** (the PA-only
backward-compatibility path from ADR-0011). PageFault is therefore not an
error — it is the signal for "no VA mapping, interpret as PA".

This path is intentional and preserves backward compatibility with the
ADR-0011 PA-only mode.

### D5. MMU sideband-message reception contract

`MmuMapMsg` / `MmuUnmapMsg` arrive over the fabric at PE_MMU's `_inbox`
(SPEC R10: "MMU map installation incurs measured fabric latency"). Schemas
live in `runtime_api/kernel.py`:

- `MmuMapMsg.entries: tuple[dict, ...]` — each dict is
  `{"va": int, "pa": int, "size": int}`.
- `MmuUnmapMsg.entries: tuple[dict, ...]` — each dict is
  `{"va": int, "size": int}`.

PE_MMU reception flow:

1. `_worker` does `_inbox.get()` for one message.
2. `hasattr(msg, "request")` confirms a Transaction wrapper.
3. `isinstance(msg.request, MmuMapMsg)` → for each entry, call
   `self._mmu.map(va=e["va"], pa=e["pa"], size=e["size"])`.
4. `isinstance(msg.request, MmuUnmapMsg)` → for each entry, call
   `self._mmu.unmap(va=e["va"], size=e["size"])`.
5. Both signal `msg.done.succeed()` after completion.

An external caller (runtime API) `await`ing `done` therefore receives a SimPy
guarantee that "the mapping is installed on-device" — this is the realization
of ADR-0011's "MMU map installation incurs measured fabric latency".

This ADR does **not** define the **sender or fan-out policy** for the sideband
message — those are runtime API responsibilities. Only the receive contract
belongs here.

### D6. Non-MMU Transactions delegate to generic forwarding

If a message pulled from `_inbox` is not `MmuMapMsg` / `MmuUnmapMsg` (or
lacks a `request` attribute), `_forward_txn` handles it normally. This keeps
the door open for future topologies where PE_MMU sits on a pass-through path —
current code never sends such traffic, but the routing remains safe.

## Alternatives Considered

### A1. Make `translate()` a SimPy generator

Rejected. As D2 explains, this blurs op_log / pipeline overlap accounting in
the PE engine.

### A2. Use small page size (e.g., 128 B) instead of sub-page regions

Rejected. Would explode page-table memory and cube-wide map message size. Most
mappings are 2 MiB; pushing the page size below that for the few DPPolicy
sharding cases inflates average cost.

### A3. Make PE_MMU a PE_CPU helper only (not a topology node)

Rejected. ADR-0011 requires that MMU map installation incur measured fabric
latency (via `MmuMapMsg`), which requires PE_MMU to be a node on the graph.
It also keeps cube NoC visualizer output consistent.

## Consequences

- PE_MMU's dual role is justified at ADR level, so future "unify into one"
  refactor pressure has a documented counterpoint.
- The sub-page region model is explicitly labeled a stopgap, providing a
  basis for deprecating it when LA model (ADR-0011) lands.
- The "`translate()` does not yield" contract is locked in (D2), so any
  future proposal to add an internal MMU timeout can be denied with a
  documented rationale.
- PA fallback (D4) is normalized, preventing defensive logic from treating
  PageFault as an error.