Add virtual memory support: PE_MMU, VA allocator, fabric MmuMapMsg

Implement VA/MMU layer (ADR-0011 Phase 1) enabling Triton kernels to use contiguous virtual addresses on sharded tensors. Key changes: - PE_MMU component: hybrid inbox (MmuMapMsg) + sync translate() for PE_DMA - VirtualAllocator + PEMemAllocator: free-list with coalescing - MmuMapMsg/MmuUnmapMsg fabric path with SIP-level routing - DPPolicy-based mapping: replicate=local, sharded=broadcast - Tensor lifecycle: del + weakref cleanup, context manager - Rename: TensorHandle.pa→addr, DmaReadCmd.src_pa→src_addr, ctx→torch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 00:01:47 -07:00
parent 62fb01ae18
commit 08812eda58
34 changed files with 2131 additions and 139 deletions
@@ -207,12 +207,15 @@ benchmark instances by default.

 ## R10. Memory Addressing (Phase 0)

-In Phase 0, the simulator uses a **PA-first memory model**:
+The simulator uses a **VA/PA memory model** (ADR-0011):

- All memory operations use device physical addresses (PA) only.
- Virtual addressing, MMU/IOMMU, and address translation latency are out of scope.
+- Tensors are assigned a contiguous virtual address (VA) range at deployment.
+- PE_MMU translates VA→PA per access; TLB overhead is configurable.
+- Mapping installation (MmuMapMsg) traverses the fabric with measured latency.
+- Replicate tensors use per-cube local PA mapping; sharded tensors broadcast.
+- PA-only fallback is retained for backward compatibility.
 - Tensor placement is represented as a list of PA shards, each explicitly tagged
-  with `(sip, cube, pe)`.
+  with `(sip, cube, pe)`, plus a tensor-wide `va_base`.

 All memory access latency MUST be modeled explicitly via graph traversal.
 No implicit translation or hidden latency is allowed.