Add virtual memory support: PE_MMU, VA allocator, fabric MmuMapMsg

Implement VA/MMU layer (ADR-0011 Phase 1) enabling Triton kernels to use contiguous virtual addresses on sharded tensors. Key changes: - PE_MMU component: hybrid inbox (MmuMapMsg) + sync translate() for PE_DMA - VirtualAllocator + PEMemAllocator: free-list with coalescing - MmuMapMsg/MmuUnmapMsg fabric path with SIP-level routing - DPPolicy-based mapping: replicate=local, sharded=broadcast - Tensor lifecycle: del + weakref cleanup, context manager - Rename: TensorHandle.pa→addr, DmaReadCmd.src_pa→src_addr, ctx→torch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 00:01:47 -07:00
parent 62fb01ae18
commit 08812eda58
34 changed files with 2131 additions and 139 deletions
@@ -1,8 +1,8 @@
-# ADR-0011: Memory Addressing Simplification (PA-first)
+# ADR-0011: Memory Addressing — PA-first with VA/MMU Extension

 ## Status

-Accepted
+Accepted (Phase 1 VA/MMU implemented)

 ## Context

@@ -11,49 +11,82 @@ translation path for DMA: host allocates physical memory at PE level, maps it
 into a virtual address space, installs mappings, and DMA requests use virtual
 addresses that are translated to physical addresses.

-For early development, we want a minimal, deterministic model that enables:
-
- correct routing and latency accounting through the graph,
- stable tensor deployment and kernel execution semantics,
- future extension toward VA/MMU without rewriting workflows.
+The PA-only model (Phase 0) was insufficient for running standard Triton kernels
+that use `base_addr + offset` patterns on sharded tensors — each PE's shard has
+a different PA, but the kernel needs a single contiguous address space.

 ---

 ## Decision

-### D1. Phase 0 model is PA-only
-
-The simulator uses a PA-first model:
+### D1. Phase 0 model is PA-only (original, retained as fallback)

 - All device memory accesses (MemoryRead/MemoryWrite) operate on device physical
  addresses (PA) plus size.
- Tensor handles store PA-based shard mappings after deployment.
- KernelLaunch passes tensor arguments as PA-based mappings (or references to them).
- MMU/IOMMU concepts (virtual address spaces, page tables, translation latency)
-  are NOT modeled in Phase 0.
+- PA-only mode remains functional via PageFault fallback in PE_DMA.

 ### D2. Allocation produces PA mappings

 Device allocation selects PE-local memory regions and returns PA mappings
 sufficient to execute kernels and issue DMA requests.

-### D3. Extension path (non-breaking)
+### D3. Phase 1: VA/MMU layer (implemented)

-A future ADR MAY introduce an optional VA/MMU layer by:
+#### D3.1 Virtual Address Model

- introducing virtual addresses in tensor handles,
- adding a mapping-install step,
- modeling translation latency and page granularity.
+- Each tensor gets a single contiguous VA range (`TensorHandle.va_base`).
+- `TensorShard` does NOT carry a `va` field — shard VA is derived as
+  `va_base + offset_bytes`.
+- Kernels receive `va_base` as their pointer argument (via `TensorArg.va_base`).
+- `DmaReadCmd.src_addr` and `DmaWriteCmd.dst_addr` carry VA (not PA).

-The Phase 0 PA model remains a valid fast-path configuration.
+#### D3.2 PE_MMU Component
+
+- Hybrid design: SimPy component (inbox for MmuMapMsg) + utility (synchronous
+  `translate()` called by PE_DMA).
+- Page-aligned dict lookup for O(1) VA→PA translation.
+- `tlb_overhead_ns` configurable per-access latency.
+- PageFault fallback: if VA has no mapping, PE_DMA treats it as PA directly
+  (backward compatibility with PA-only tests).
+
+#### D3.3 Mapping Installation
+
+- `MmuMapMsg` traverses the fabric: Host → PCIE_EP → IO_CPU (cube fan-out) →
+  M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured end-to-end.
+- `MmuMapMsg.target_sips` controls SIP-level routing to prevent cross-SIP
+  mapping contamination for replicated tensors.
+- Mapping strategy based on `DPPolicy.cube`:
+  - **Replicate** (`cube="replicate"`): per-(sip, cube) local mapping only.
+    Each cube's PEs see only their local PA. No cross-cube mapping installed.
+  - **Sharded** (`cube="shard_m"`, etc.): broadcast all shard mappings to all
+    target cubes. Enables cross-PE and cross-cube DMA.
+
+#### D3.4 Tensor Lifecycle
+
+- `del tensor` triggers automatic cleanup via `Tensor.__del__` + `weakref` to
+  RuntimeContext. Sends `MmuUnmapMsg` through fabric, returns VA and PA space.
+- `with RuntimeContext(...) as ctx:` provides scope-based bulk cleanup.
+- `RuntimeContext._tensors` uses `weakref.ref` to avoid preventing GC.
+- `PEMemAllocator` uses free-list with coalescing (not bump allocator).
+- `VirtualAllocator` uses free-list with coalescing for VA space.
+
+#### D3.5 Allocators
+
+- `VirtualAllocator`: device-wide VA space, page-aligned alloc/free with
+  coalescing.
+- `PEMemAllocator`: per-PE HBM/TCM, free-list based alloc/free with coalescing.
+- Page size configurable via `topology.yaml` pe_mmu attrs (default 4096).

 ---

 ## Consequences

- Early implementation stays simple and testable.
- All latency remains explicit via graph traversal, not hidden translation.
- Future VA/MMU modeling can be added without breaking existing benchmarks.
+- Triton kernels use `base_addr + offset` patterns naturally on sharded tensors.
+- All latency remains explicit via graph traversal, including MMU mapping
+  installation and per-access TLB overhead.
+- PA-only mode retained as fallback (PageFault → treat as PA).
+- Benchmark parameter renamed `ctx` → `torch` for PyTorch code compatibility.
+- IPCQ and other fixed-address resources bypass MMU (use PA directly).

 ---

@@ -62,4 +95,6 @@ The Phase 0 PA model remains a valid fast-path configuration.
 - ADR-0007 (runtime_api vs sim_engine boundaries)
 - ADR-0008 (tensor deployment)
 - ADR-0009 (kernel execution)
+- ADR-0014 (PE-internal execution model)
+- ADR-0015 (component port/wire model)
 - SPEC R2 (latency by traversal)