1f36baa898
Fill component-model coverage gaps surfaced by /report's G4 analysis. Each ADR documents the component's First action, latency model, and honest notes on dormant code or implementation asymmetries discovered during re-evaluation against current code. - 0038 pcie_ep: thin protocol-overhead model; ComponentBase forwarding worker as-is; named-node contract for router helpers - 0039 pe_mmu: component + utility dual role; sub-page region stopgap; D2.1 flags pipeline path missing mmu.overhead_ns timeout (asymmetric with non-pipeline; not visible at default tlb_overhead_ns=0) - 0040 pe_tcm: dual-channel BW serialization (read/write Resource cap=1); TcmRequest schema owned by TCM; timing-only (no data store) - 0041 sram: terminal scratchpad model + ResponseMsg on reverse path; D1.1 flags _worker override as currently dormant (no Transaction actually targets the SRAM node today) - 0042 tiling: pure plan-generator module, not a component; corrects the G4 misclassification; pins GEMM/Math stage sequences and epilogue scope contract Also: /report skill G3 refinement — only flag older->newer asymmetric cross-references; newer->older (e.g., 0034-0037 citing infrastructure ADRs) are expected one-way and no longer reported. Bilingual pair verifier (tools/verify_adr_lang_pairs.py) passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
204 lines
9.0 KiB
Markdown
204 lines
9.0 KiB
Markdown
# ADR-0039: PE_MMU Component Model — Component + Utility Dual Role
|
|
|
|
## Status
|
|
|
|
Accepted (2026-05-20).
|
|
|
|
ADR-0011 (PA/VA/LA address model) only states that "the VA model translates
|
|
VA→PA via PE_MMU"; this ADR pins down **the PE_MMU component's own behavior
|
|
model**.
|
|
|
|
## First action
|
|
|
|
At construction, read `node.attrs["page_size"]` (default `2 MiB`) and
|
|
`node.attrs["tlb_overhead_ns"]` (default `0.0`) and instantiate the internal
|
|
`PeMMU` utility object (`policy.address.pe_mmu.PeMMU`) exactly once. That
|
|
object is the single owner of the page table, the sub-page region lists, and
|
|
the TLB overhead value.
|
|
|
|
At runtime the first action splits into two paths:
|
|
|
|
- **Component path (inbox consumption)**: `_worker` pulls a Transaction off
|
|
`_inbox`; if `request` is a `MmuMapMsg`, call `self._mmu.map(va, pa, size)`
|
|
for each entry and then `txn.done.succeed()`. For `MmuUnmapMsg`, call
|
|
`unmap(va, size)`. Any other type falls through to standard `_forward_txn`.
|
|
In other words, **the component's first act is "apply map/unmap commands to
|
|
the page table"**.
|
|
- **Utility path (direct call)**: a sibling PE engine (PE_DMA / PE_GEMM) calls
|
|
`pe_mmu.mmu.translate(va)` directly. This path produces no SimPy events;
|
|
the caller (when `overhead_ns > 0`) issues a `yield env.timeout(mmu.overhead_ns)`
|
|
in its own process.
|
|
|
|
## Context
|
|
|
|
ADR-0011 defined three address models (PA/VA/LA) and agreed that "VA model =
|
|
translation via PE_MMU". But in code, `PeMmuComponent` performs two
|
|
complementary roles simultaneously:
|
|
|
|
1. **A topology-graph component**: it receives `MmuMapMsg` / `MmuUnmapMsg`
|
|
sideband messages over the cube NoC and updates the page table.
|
|
2. **A PE-local utility**: PE_DMA / PE_GEMM on the same PE call
|
|
`translate(va)` directly with zero SimPy latency (the caller pays
|
|
`overhead_ns` if any).
|
|
|
|
Without an ADR covering both roles, the following questions are ambiguous:
|
|
|
|
- "Why isn't there a SimPy event for the MMU translate?" (Answer: the caller
|
|
pays it.)
|
|
- What is the sub-page region model, and why? (The code docstring has it, but
|
|
no ADR — only a memory note `project_mmu_subpage_stopgap`.)
|
|
- Who sends map/unmap, and when must they be visible? (Ordering contract.)
|
|
|
|
Additionally, `PeMMU.map()` has "append, last-write-wins on overlap"
|
|
semantics, which is impossible to express with a one-PA-per-entry page table.
|
|
That is a deliberate **simulator stopgap** to support DPPolicy sub-page sharding
|
|
(e.g., 128 B payloads against 4 KiB pages) without silent last-write-wins
|
|
misrouting. This deviation from real HW MMU semantics must be ADR-pinned.
|
|
|
|
## Decision
|
|
|
|
### D1. Explicit dual role — component and utility
|
|
|
|
`PeMmuComponent` exposes two interfaces from a single class:
|
|
|
|
- Component interface: `_inbox` consumption, `_worker` loop (handles MMU
|
|
sideband messages).
|
|
- Utility interface: the `mmu` property exposes the underlying `PeMMU` object,
|
|
which PE_DMA / PE_GEMM hold directly and invoke `translate()` on.
|
|
|
|
The latter is **not a layer skip**: inside a PE, the engines and PE_MMU are
|
|
siblings under the "components" layer (ADR-0007). Cross-layer violations only
|
|
apply to runtime API ↔ sim_engine ↔ components boundaries.
|
|
|
|
### D2. Latency model — `translate()` is pure; caller owns the timeout
|
|
|
|
`PeMMU.translate()` is a pure function and yields nothing in SimPy. The caller
|
|
(a PE engine) issues `if mmu.overhead_ns > 0: yield env.timeout(mmu.overhead_ns)`
|
|
in its own process after translation.
|
|
|
|
Rationale: the PE engine process already holds its own `record_start` /
|
|
`record_end` (op_log) hooks, so keeping timing inside the caller's process
|
|
preserves consistent timing accounting. A separate MMU process would split the
|
|
engine's processing flow and blur op_log / pipeline overlap semantics.
|
|
|
|
#### D2.1. Current implementation asymmetry — pipeline vs non-pipeline (known)
|
|
|
|
At the time of writing, `pe_dma.py` handles MMU overhead differently in its
|
|
two call paths:
|
|
|
|
- **non-pipeline (`handle_command`)**: after `translate()`, applies
|
|
`if self._mmu.overhead_ns > 0: yield env.timeout(self._mmu.overhead_ns)`.
|
|
- **pipeline (`_do_pipeline_dma`)**: calls `translate()` only, **omitting**
|
|
the overhead timeout — though the comment says "same logic as non-pipeline
|
|
path", the behaviors differ.
|
|
|
|
In the default topology, `tlb_overhead_ns = 0.0`, so this asymmetry does not
|
|
manifest. With `tlb_overhead_ns > 0`, however, GEMM/Math via the pipeline path
|
|
appears MMU-overhead faster than the equivalent non-pipeline workload.
|
|
|
|
The D2 contract states that **all** callers pay the overhead; the pipeline
|
|
omission is **not an intentional design** — ADR-0014 D6 (pipeline self-routing)
|
|
does not exempt it. Remediation options (require a separate Phase 1/2):
|
|
|
|
- (a) Add `if mmu.overhead_ns > 0: yield env.timeout(...)` in
|
|
`_do_pipeline_dma` to align with D2 — **preferred**.
|
|
- (b) Narrow the D2 contract to "non-pipeline only" and document the pipeline
|
|
exemption in an ADR-0014 update — discouraged, since it weakens the
|
|
overhead's meaning.
|
|
|
|
This ADR recommends (a) and assumes a small follow-up change either before or
|
|
just after acceptance.
|
|
|
|
### D3. Page table structure — sub-page region list (stopgap)
|
|
|
|
`self._table: dict[vpn, list[(start_in_page, end_in_page, pa_at_offset_zero)]]`
|
|
holds multiple disjoint regions per page.
|
|
|
|
- `map(va, pa, size)`: append regions when the range crosses a page boundary.
|
|
- `translate(va)`: look up regions for the VPN and iterate **in reverse** so
|
|
the most recent overlapping region wins (last-write-wins).
|
|
- `unmap(va, size)`: remove only regions whose extent is **fully contained**
|
|
within the unmap range; partial-overlap boundaries are left in place and the
|
|
caller is expected to unmap on the same boundaries used for map.
|
|
|
|
This is documented as a **simulator stopgap** that supplements the VA model
|
|
from ADR-0011. It prevents silent last-write-wins misrouting when DPPolicy
|
|
shards below page granularity. Memory note: `project_mmu_subpage_stopgap`.
|
|
|
|
### D4. PageFault signals PA fallback
|
|
|
|
If `translate()` is called with an unmapped VA, `PageFault` is raised. PE_DMA
|
|
catches the exception and **uses the original address as a PA** (the PA-only
|
|
backward-compatibility path from ADR-0011). PageFault is therefore not an
|
|
error — it is the signal for "no VA mapping, interpret as PA".
|
|
|
|
This path is intentional and preserves backward compatibility with the
|
|
ADR-0011 PA-only mode.
|
|
|
|
### D5. MMU sideband-message reception contract
|
|
|
|
`MmuMapMsg` / `MmuUnmapMsg` arrive over the fabric at PE_MMU's `_inbox`
|
|
(SPEC R10: "MMU map installation incurs measured fabric latency"). Schemas
|
|
live in `runtime_api/kernel.py`:
|
|
|
|
- `MmuMapMsg.entries: tuple[dict, ...]` — each dict is
|
|
`{"va": int, "pa": int, "size": int}`.
|
|
- `MmuUnmapMsg.entries: tuple[dict, ...]` — each dict is
|
|
`{"va": int, "size": int}`.
|
|
|
|
PE_MMU reception flow:
|
|
|
|
1. `_worker` does `_inbox.get()` for one message.
|
|
2. `hasattr(msg, "request")` confirms a Transaction wrapper.
|
|
3. `isinstance(msg.request, MmuMapMsg)` → for each entry, call
|
|
`self._mmu.map(va=e["va"], pa=e["pa"], size=e["size"])`.
|
|
4. `isinstance(msg.request, MmuUnmapMsg)` → for each entry, call
|
|
`self._mmu.unmap(va=e["va"], size=e["size"])`.
|
|
5. Both signal `msg.done.succeed()` after completion.
|
|
|
|
An external caller (runtime API) `await`ing `done` therefore receives a SimPy
|
|
guarantee that "the mapping is installed on-device" — this is the realization
|
|
of ADR-0011's "MMU map installation incurs measured fabric latency".
|
|
|
|
This ADR does **not** define the **sender or fan-out policy** for the sideband
|
|
message — those are runtime API responsibilities. Only the receive contract
|
|
belongs here.
|
|
|
|
### D6. Non-MMU Transactions delegate to generic forwarding
|
|
|
|
If a message pulled from `_inbox` is not `MmuMapMsg` / `MmuUnmapMsg` (or
|
|
lacks a `request` attribute), `_forward_txn` handles it normally. This keeps
|
|
the door open for future topologies where PE_MMU sits on a pass-through path —
|
|
current code never sends such traffic, but the routing remains safe.
|
|
|
|
## Alternatives Considered
|
|
|
|
### A1. Make `translate()` a SimPy generator
|
|
|
|
Rejected. As D2 explains, this blurs op_log / pipeline overlap accounting in
|
|
the PE engine.
|
|
|
|
### A2. Use small page size (e.g., 128 B) instead of sub-page regions
|
|
|
|
Rejected. Would explode page-table memory and cube-wide map message size. Most
|
|
mappings are 2 MiB; pushing the page size below that for the few DPPolicy
|
|
sharding cases inflates average cost.
|
|
|
|
### A3. Make PE_MMU a PE_CPU helper only (not a topology node)
|
|
|
|
Rejected. ADR-0011 requires that MMU map installation incur measured fabric
|
|
latency (via `MmuMapMsg`), which requires PE_MMU to be a node on the graph.
|
|
It also keeps cube NoC visualizer output consistent.
|
|
|
|
## Consequences
|
|
|
|
- PE_MMU's dual role is justified at ADR level, so future "unify into one"
|
|
refactor pressure has a documented counterpoint.
|
|
- The sub-page region model is explicitly labeled a stopgap, providing a
|
|
basis for deprecating it when LA model (ADR-0011) lands.
|
|
- The "`translate()` does not yield" contract is locked in (D2), so any
|
|
future proposal to add an internal MMU timeout can be denied with a
|
|
documented rationale.
|
|
- PA fallback (D4) is normalized, preventing defensive logic from treating
|
|
PageFault as an error.
|