63669f82cb
- DPPolicy: 3-level (sip/cube/pe), unified naming (column_wise/row_wise) - PE_CPU: auto num_programs from cube shard count - context.launch(): per-SIP KernelLaunchMsg with local va_base + auto local shape - deploy_tensor: removed mmus param, MMU mapping is context-only responsibility - ComponentRegistry: YAML-based lazy loading (components.yaml), impls→builtin rename - VA offset bench + tests: 2D/1D, standard Triton kernel pattern Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
101 lines
3.8 KiB
Markdown
101 lines
3.8 KiB
Markdown
# ADR-0011: Memory Addressing — PA-first with VA/MMU Extension
|
|
|
|
## Status
|
|
|
|
Accepted (Phase 1 VA/MMU implemented)
|
|
|
|
## Context
|
|
|
|
A realistic system uses host-side virtual addressing and an MMU/IOMMU-style
|
|
translation path for DMA: host allocates physical memory at PE level, maps it
|
|
into a virtual address space, installs mappings, and DMA requests use virtual
|
|
addresses that are translated to physical addresses.
|
|
|
|
The PA-only model (Phase 0) was insufficient for running standard Triton kernels
|
|
that use `base_addr + offset` patterns on sharded tensors — each PE's shard has
|
|
a different PA, but the kernel needs a single contiguous address space.
|
|
|
|
---
|
|
|
|
## Decision
|
|
|
|
### D1. Phase 0 model is PA-only (original, retained as fallback)
|
|
|
|
- All device memory accesses (MemoryRead/MemoryWrite) operate on device physical
|
|
addresses (PA) plus size.
|
|
- PA-only mode remains functional via PageFault fallback in PE_DMA.
|
|
|
|
### D2. Allocation produces PA mappings
|
|
|
|
Device allocation selects PE-local memory regions and returns PA mappings
|
|
sufficient to execute kernels and issue DMA requests.
|
|
|
|
### D3. Phase 1: VA/MMU layer (implemented)
|
|
|
|
#### D3.1 Virtual Address Model
|
|
|
|
- Each tensor gets a single contiguous VA range (`TensorHandle.va_base`).
|
|
- `TensorShard` does NOT carry a `va` field — shard VA is derived as
|
|
`va_base + offset_bytes`.
|
|
- Kernels receive `va_base` as their pointer argument (via `TensorArg.va_base`).
|
|
- `DmaReadCmd.src_addr` and `DmaWriteCmd.dst_addr` carry VA (not PA).
|
|
|
|
#### D3.2 PE_MMU Component
|
|
|
|
- Hybrid design: SimPy component (inbox for MmuMapMsg) + utility (synchronous
|
|
`translate()` called by PE_DMA).
|
|
- Page-aligned dict lookup for O(1) VA→PA translation.
|
|
- `tlb_overhead_ns` configurable per-access latency.
|
|
- PageFault fallback: if VA has no mapping, PE_DMA treats it as PA directly
|
|
(backward compatibility with PA-only tests).
|
|
|
|
#### D3.3 Mapping Installation
|
|
|
|
- `MmuMapMsg` traverses the fabric: Host → PCIE_EP → IO_CPU (cube fan-out) →
|
|
M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured end-to-end.
|
|
- `MmuMapMsg.target_sips` controls SIP-level routing to prevent cross-SIP
|
|
mapping contamination for replicated tensors.
|
|
- Mapping strategy based on `DPPolicy.cube`:
|
|
- **Replicate** (`cube="replicate"`): per-(sip, cube) local mapping only.
|
|
Each cube's PEs see only their local PA. No cross-cube mapping installed.
|
|
- **Sharded** (`cube="column_wise"`, etc.): broadcast all shard mappings to all
|
|
target cubes. Enables cross-PE and cross-cube DMA.
|
|
|
|
#### D3.4 Tensor Lifecycle
|
|
|
|
- `del tensor` triggers automatic cleanup via `Tensor.__del__` + `weakref` to
|
|
RuntimeContext. Sends `MmuUnmapMsg` through fabric, returns VA and PA space.
|
|
- `with RuntimeContext(...) as ctx:` provides scope-based bulk cleanup.
|
|
- `RuntimeContext._tensors` uses `weakref.ref` to avoid preventing GC.
|
|
- `PEMemAllocator` uses free-list with coalescing (not bump allocator).
|
|
- `VirtualAllocator` uses free-list with coalescing for VA space.
|
|
|
|
#### D3.5 Allocators
|
|
|
|
- `VirtualAllocator`: device-wide VA space, page-aligned alloc/free with
|
|
coalescing.
|
|
- `PEMemAllocator`: per-PE HBM/TCM, free-list based alloc/free with coalescing.
|
|
- Page size configurable via `topology.yaml` pe_mmu attrs (default 4096).
|
|
|
|
---
|
|
|
|
## Consequences
|
|
|
|
- Triton kernels use `base_addr + offset` patterns naturally on sharded tensors.
|
|
- All latency remains explicit via graph traversal, including MMU mapping
|
|
installation and per-access TLB overhead.
|
|
- PA-only mode retained as fallback (PageFault → treat as PA).
|
|
- Benchmark parameter renamed `ctx` → `torch` for PyTorch code compatibility.
|
|
- IPCQ and other fixed-address resources bypass MMU (use PA directly).
|
|
|
|
---
|
|
|
|
## Links
|
|
|
|
- ADR-0007 (runtime_api vs sim_engine boundaries)
|
|
- ADR-0008 (tensor deployment)
|
|
- ADR-0009 (kernel execution)
|
|
- ADR-0014 (PE-internal execution model)
|
|
- ADR-0015 (component port/wire model)
|
|
- SPEC R2 (latency by traversal)
|