# ADR-0011: Memory Addressing — PA-first with VA/MMU Extension ## Status Accepted (Phase 1 VA/MMU implemented) ## Context A realistic system uses host-side virtual addressing and an MMU/IOMMU-style translation path for DMA: host allocates physical memory at PE level, maps it into a virtual address space, installs mappings, and DMA requests use virtual addresses that are translated to physical addresses. The PA-only model (Phase 0) was insufficient for running standard Triton kernels that use `base_addr + offset` patterns on sharded tensors — each PE's shard has a different PA, but the kernel needs a single contiguous address space. --- ## Decision ### D1. Phase 0 model is PA-only (original, retained as fallback) - All device memory accesses (MemoryRead/MemoryWrite) operate on device physical addresses (PA) plus size. - PA-only mode remains functional via PageFault fallback in PE_DMA. ### D2. Allocation produces PA mappings Device allocation selects PE-local memory regions and returns PA mappings sufficient to execute kernels and issue DMA requests. ### D3. Phase 1: VA/MMU layer (implemented) #### D3.1 Virtual Address Model - Each tensor gets a single contiguous VA range (`TensorHandle.va_base`). - `TensorShard` does NOT carry a `va` field — shard VA is derived as `va_base + offset_bytes`. - Kernels receive `va_base` as their pointer argument (via `TensorArg.va_base`). - `DmaReadCmd.src_addr` and `DmaWriteCmd.dst_addr` carry VA (not PA). #### D3.2 PE_MMU Component - Hybrid design: SimPy component (inbox for MmuMapMsg) + utility (synchronous `translate()` called by PE_DMA). - Page-aligned dict lookup for O(1) VA→PA translation. - `tlb_overhead_ns` configurable per-access latency. - PageFault fallback: if VA has no mapping, PE_DMA treats it as PA directly (backward compatibility with PA-only tests). #### D3.3 Mapping Installation - `MmuMapMsg` traverses the fabric: Host → PCIE_EP → IO_CPU (cube fan-out) → M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured end-to-end. - `MmuMapMsg.target_sips` controls SIP-level routing to prevent cross-SIP mapping contamination for replicated tensors. - Mapping strategy based on `DPPolicy.cube`: - **Replicate** (`cube="replicate"`): per-(sip, cube) local mapping only. Each cube's PEs see only their local PA. No cross-cube mapping installed. - **Sharded** (`cube="shard_m"`, etc.): broadcast all shard mappings to all target cubes. Enables cross-PE and cross-cube DMA. #### D3.4 Tensor Lifecycle - `del tensor` triggers automatic cleanup via `Tensor.__del__` + `weakref` to RuntimeContext. Sends `MmuUnmapMsg` through fabric, returns VA and PA space. - `with RuntimeContext(...) as ctx:` provides scope-based bulk cleanup. - `RuntimeContext._tensors` uses `weakref.ref` to avoid preventing GC. - `PEMemAllocator` uses free-list with coalescing (not bump allocator). - `VirtualAllocator` uses free-list with coalescing for VA space. #### D3.5 Allocators - `VirtualAllocator`: device-wide VA space, page-aligned alloc/free with coalescing. - `PEMemAllocator`: per-PE HBM/TCM, free-list based alloc/free with coalescing. - Page size configurable via `topology.yaml` pe_mmu attrs (default 4096). --- ## Consequences - Triton kernels use `base_addr + offset` patterns naturally on sharded tensors. - All latency remains explicit via graph traversal, including MMU mapping installation and per-access TLB overhead. - PA-only mode retained as fallback (PageFault → treat as PA). - Benchmark parameter renamed `ctx` → `torch` for PyTorch code compatibility. - IPCQ and other fixed-address resources bypass MMU (use PA directly). --- ## Links - ADR-0007 (runtime_api vs sim_engine boundaries) - ADR-0008 (tensor deployment) - ADR-0009 (kernel execution) - ADR-0014 (PE-internal execution model) - ADR-0015 (component port/wire model) - SPEC R2 (latency by traversal)