63669f82cb
- DPPolicy: 3-level (sip/cube/pe), unified naming (column_wise/row_wise) - PE_CPU: auto num_programs from cube shard count - context.launch(): per-SIP KernelLaunchMsg with local va_base + auto local shape - deploy_tensor: removed mmus param, MMU mapping is context-only responsibility - ComponentRegistry: YAML-based lazy loading (components.yaml), impls→builtin rename - VA offset bench + tests: 2D/1D, standard Triton kernel pattern Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3.8 KiB
3.8 KiB
ADR-0011: Memory Addressing — PA-first with VA/MMU Extension
Status
Accepted (Phase 1 VA/MMU implemented)
Context
A realistic system uses host-side virtual addressing and an MMU/IOMMU-style translation path for DMA: host allocates physical memory at PE level, maps it into a virtual address space, installs mappings, and DMA requests use virtual addresses that are translated to physical addresses.
The PA-only model (Phase 0) was insufficient for running standard Triton kernels
that use base_addr + offset patterns on sharded tensors — each PE's shard has
a different PA, but the kernel needs a single contiguous address space.
Decision
D1. Phase 0 model is PA-only (original, retained as fallback)
- All device memory accesses (MemoryRead/MemoryWrite) operate on device physical addresses (PA) plus size.
- PA-only mode remains functional via PageFault fallback in PE_DMA.
D2. Allocation produces PA mappings
Device allocation selects PE-local memory regions and returns PA mappings sufficient to execute kernels and issue DMA requests.
D3. Phase 1: VA/MMU layer (implemented)
D3.1 Virtual Address Model
- Each tensor gets a single contiguous VA range (
TensorHandle.va_base). TensorSharddoes NOT carry avafield — shard VA is derived asva_base + offset_bytes.- Kernels receive
va_baseas their pointer argument (viaTensorArg.va_base). DmaReadCmd.src_addrandDmaWriteCmd.dst_addrcarry VA (not PA).
D3.2 PE_MMU Component
- Hybrid design: SimPy component (inbox for MmuMapMsg) + utility (synchronous
translate()called by PE_DMA). - Page-aligned dict lookup for O(1) VA→PA translation.
tlb_overhead_nsconfigurable per-access latency.- PageFault fallback: if VA has no mapping, PE_DMA treats it as PA directly (backward compatibility with PA-only tests).
D3.3 Mapping Installation
MmuMapMsgtraverses the fabric: Host → PCIE_EP → IO_CPU (cube fan-out) → M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured end-to-end.MmuMapMsg.target_sipscontrols SIP-level routing to prevent cross-SIP mapping contamination for replicated tensors.- Mapping strategy based on
DPPolicy.cube:- Replicate (
cube="replicate"): per-(sip, cube) local mapping only. Each cube's PEs see only their local PA. No cross-cube mapping installed. - Sharded (
cube="column_wise", etc.): broadcast all shard mappings to all target cubes. Enables cross-PE and cross-cube DMA.
- Replicate (
D3.4 Tensor Lifecycle
del tensortriggers automatic cleanup viaTensor.__del__+weakrefto RuntimeContext. SendsMmuUnmapMsgthrough fabric, returns VA and PA space.with RuntimeContext(...) as ctx:provides scope-based bulk cleanup.RuntimeContext._tensorsusesweakref.refto avoid preventing GC.PEMemAllocatoruses free-list with coalescing (not bump allocator).VirtualAllocatoruses free-list with coalescing for VA space.
D3.5 Allocators
VirtualAllocator: device-wide VA space, page-aligned alloc/free with coalescing.PEMemAllocator: per-PE HBM/TCM, free-list based alloc/free with coalescing.- Page size configurable via
topology.yamlpe_mmu attrs (default 4096).
Consequences
- Triton kernels use
base_addr + offsetpatterns naturally on sharded tensors. - All latency remains explicit via graph traversal, including MMU mapping installation and per-access TLB overhead.
- PA-only mode retained as fallback (PageFault → treat as PA).
- Benchmark parameter renamed
ctx→torchfor PyTorch code compatibility. - IPCQ and other fixed-address resources bypass MMU (use PA directly).
Links
- ADR-0007 (runtime_api vs sim_engine boundaries)
- ADR-0008 (tensor deployment)
- ADR-0009 (kernel execution)
- ADR-0014 (PE-internal execution model)
- ADR-0015 (component port/wire model)
- SPEC R2 (latency by traversal)