Add virtual memory support: PE_MMU, VA allocator, fabric MmuMapMsg

Implement VA/MMU layer (ADR-0011 Phase 1) enabling Triton kernels to use
contiguous virtual addresses on sharded tensors.

Key changes:
- PE_MMU component: hybrid inbox (MmuMapMsg) + sync translate() for PE_DMA
- VirtualAllocator + PEMemAllocator: free-list with coalescing
- MmuMapMsg/MmuUnmapMsg fabric path with SIP-level routing
- DPPolicy-based mapping: replicate=local, sharded=broadcast
- Tensor lifecycle: del + weakref cleanup, context manager
- Rename: TensorHandle.pa→addr, DmaReadCmd.src_pa→src_addr, ctx→torch

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-26 00:01:47 -07:00
parent 62fb01ae18
commit 08812eda58
34 changed files with 2131 additions and 139 deletions
+7 -4
View File
@@ -207,12 +207,15 @@ benchmark instances by default.
## R10. Memory Addressing (Phase 0)
In Phase 0, the simulator uses a **PA-first memory model**:
The simulator uses a **VA/PA memory model** (ADR-0011):
- All memory operations use device physical addresses (PA) only.
- Virtual addressing, MMU/IOMMU, and address translation latency are out of scope.
- Tensors are assigned a contiguous virtual address (VA) range at deployment.
- PE_MMU translates VA→PA per access; TLB overhead is configurable.
- Mapping installation (MmuMapMsg) traverses the fabric with measured latency.
- Replicate tensors use per-cube local PA mapping; sharded tensors broadcast.
- PA-only fallback is retained for backward compatibility.
- Tensor placement is represented as a list of PA shards, each explicitly tagged
with `(sip, cube, pe)`.
with `(sip, cube, pe)`, plus a tensor-wide `va_base`.
All memory access latency MUST be modeled explicitly via graph traversal.
No implicit translation or hidden latency is allowed.