# ADR-0011: Memory Addressing — PA / VA / LA Address Models ## Status Accepted. - **VA model: currently implemented (default).** - PA model: implemented as PageFault fallback in PE_DMA. - LA model: proposed, not implemented. ## Context KernBench's address model evolved through three design points, each addressing a limitation of the previous. This ADR documents all three in one place because future implementation work selects among them. ### PA-only baseline Phase 0 of KernBench treated all device memory operations (MemoryRead/MemoryWrite) as raw physical-address transfers. No host-side virtual addressing, no MMU/IOMMU translation. Allocators returned PA mappings; DMA requests carried PA directly. This was sufficient for early correctness/latency work but insufficient for running standard Triton kernels that use `base_addr + offset` patterns on sharded tensors: each PE's shard has a different PA, but the kernel needs a single contiguous address space to compute offsets. ### Why VA/MMU (current default) A realistic system uses host-side virtual addressing and an MMU/IOMMU-style translation path for DMA: the host allocates physical memory at PE level, maps it into a virtual address space, installs mappings, and DMA requests use virtual addresses that are translated to physical addresses. Adopting this model lets kernels use `base_addr + offset` over a contiguous VA range while the device-side MMU translates each access to the appropriate PA. ### Why LA/BAAW (proposed) VA/MMU treats HBM as a single backing space. KernBench needs to explore architectures where HBM is composed of multiple pseudo channels in parallel: - CUBE's HBM has 32 or 64 pseudo channels. - In a PE-Local-HBM model, each PE is assigned N pseudo channels (N = `hbm_pseudo_channels / pes_per_cube`). - Per-channel BW (e.g. 32 GB/s) determines aggregate PE BW (N × per-channel). Two channel-mapping modes need to be modelable: - **1:1 mode** — one logical access → N per-channel requests. Precise per-channel BW contention modelling. - **n:1 mode (default)** — one logical access → one aggregated request. Channels are assumed to interleave; aggregated BW model. VA's `tl.load(va_ptr)` produces a single DMA request to a single target. Decomposing that into per-channel requests inside PE_DMA requires the address layer to be aware of channels. This is the role of the LA (Logical Address) abstraction with BAAW (Logical-to-Physical Mapping Unit). Core requirements driving the LA design: - PE_DMA → HBM_CTRL effective bandwidth semantics must be identical in both modes (only request shape and resource model differ). - Kernel programming model is unchanged — physical channel information is never exposed to kernel code. - Mode switch is a topology-level configuration. ### Design space summary | Model | Status | Key idea | |-------|--------|----------| | PA | fallback (implemented) | Direct physical addressing, no translation | | VA | current default (implemented) | Per-tensor contiguous VA range; MMU translates per access | | LA | proposed | LA + BAAW resolves to (PA, channel); supports 1:1 and n:1 channel mapping modes | --- ## Decision This ADR defines three address models. At any given time the system operates in exactly one model. Selection is topology- / configuration- driven; coexistence within one simulation run is not required. --- ### Address Model: PA (Physical Address) — fallback #### D-PA1. PA-only semantics - All device memory accesses (MemoryRead/MemoryWrite) operate on device physical addresses (PA) plus size. - PA-only mode remains functional via the PageFault fallback path in PE_DMA: if a DMA src/dst address has no MMU mapping, PE_DMA treats the value as a PA directly. #### D-PA2. Allocation produces PA mappings Device allocation selects PE-local memory regions and returns PA mappings sufficient to execute kernels and issue DMA requests. PA model is retained primarily for backward compatibility with PA-only tests and as the underlying physical layer that VA / LA models resolve into. --- ### Address Model: VA (Virtual Address with MMU) — current default #### D-VA1. Virtual Address Model - Each tensor gets a single contiguous VA range (`TensorHandle.va_base`). - `TensorShard` does NOT carry a `va` field — shard VA is derived as `va_base + offset_bytes`. - Kernels receive `va_base` as their pointer argument (via `TensorArg.va_base`). - `DmaReadCmd.src_addr` and `DmaWriteCmd.dst_addr` carry VA (not PA). #### D-VA2. PE_MMU Component - Hybrid design: SimPy component (inbox for `MmuMapMsg`) + utility (synchronous `translate()` called by PE_DMA). - Page-aligned dict lookup for O(1) VA → PA translation. - `tlb_overhead_ns` configurable per-access latency. - PageFault fallback: if VA has no mapping, PE_DMA treats it as PA directly (preserves PA model for backward compatibility). #### D-VA3. Mapping Installation - `MmuMapMsg` traverses the fabric: Host → PCIE_EP → IO_CPU (cube fan-out) → M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured end-to-end. - `MmuMapMsg.target_sips` controls SIP-level routing to prevent cross-SIP mapping contamination for replicated tensors. - Mapping strategy based on `DPPolicy.cube`: - **Replicate** (`cube="replicate"`): per-(sip, cube) local mapping only. Each cube's PEs see only their local PA. No cross-cube mapping installed. - **Sharded** (`cube="column_wise"`, etc.): broadcast all shard mappings to all target cubes. Enables cross-PE and cross-cube DMA. #### D-VA4. Tensor Lifecycle - `del tensor` triggers automatic cleanup via `Tensor.__del__` + `weakref` to `RuntimeContext`. Sends `MmuUnmapMsg` through fabric, returns VA and PA space. - `with RuntimeContext(...) as ctx:` provides scope-based bulk cleanup. - `RuntimeContext._tensors` uses `weakref.ref` to avoid preventing GC. - `PEMemAllocator` uses free-list with coalescing (not bump allocator). - `VirtualAllocator` uses free-list with coalescing for VA space. #### D-VA5. Allocators - `VirtualAllocator`: device-wide VA space, page-aligned alloc/free with coalescing. - `PEMemAllocator`: per-PE HBM/TCM, free-list based alloc/free with coalescing. - Page size configurable via `topology.yaml` `pe_mmu` attrs (default 4096). #### Consequences (VA model) - Triton kernels use `base_addr + offset` patterns naturally on sharded tensors. - All latency remains explicit via graph traversal, including MMU mapping installation and per-access TLB overhead. - PA-only mode retained as fallback (PageFault → treat as PA). - IPCQ and other fixed-address resources bypass MMU (use PA directly). --- ### Address Model: LA (Logical Address with BAAW) — proposed LA replaces VA when channel-level HBM modelling is required. Adopting this model removes the VA/MMU infrastructure (D-LA1 lists the removed artifacts). Coexistence with VA in the same run is not a goal. #### D-LA1. LA introduction — replaces VA infrastructure LA is the sole address space used by kernel code (`tl.load`, `tl.store`, `tl.composite`). Properties: - Can map a Tensor to a contiguous logical space (like VA). - Expresses `(logical buffer + offset)`. - Does NOT contain physical channel information directly. - Stays as an intermediate abstraction until physical resolution. LA address space: | Item | Value | |------|-------| | LA start | `0x1_0000_0000` (4 GB, preserves former VA start) | | LA space size | 64 GB per PE | | Alignment unit | segment (see D-LA3) | LA is PE-local: different PEs may use the same LA value; BAAW segment tables differ → they resolve to different PAs. VA infrastructure removed when LA is adopted: | Removed | Replacement | |---------|-------------| | `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (same free-list approach, renamed) | | `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (inside PE_DMA) | | `components/builtin/pe_mmu.py` (PeMmuComponent) | Removed — BAAW is internal PE_DMA logic, not a separate component | | `runtime_api/kernel.py`: `MmuMapMsg`, `MmuUnmapMsg` | `BaawSegmentInstallMsg` | | `runtime_api/context.py`: VA alloc + MMU install | LA alloc + BAAW segment install | | `runtime_api/tensor.py`: `va_base` | `la_base` | | `topology.yaml`: `pe_mmu` component entry | Removed | #### D-LA2. Mapping mode setting Topology-level (cube) configuration: ```yaml cube: memory_map: hbm_mapping_mode: n_to_one # one_to_one | n_to_one hbm_pseudo_channels: 64 # total pseudo channel count hbm_channels_per_pe: 8 # per-PE local channel count hbm_channel_bw_gbs: 32.0 # per-channel bandwidth ``` Consumed by the graph compiler (topology builder) and BAAW initialisation. #### D-LA3. Segment and BAAW Segment partitions the LA space; each segment maps to a specific HBM channel or channel group. Created at tensor deploy time by the runtime allocator. BAAW resolves LA → physical request(s) using the segment table. ```python @dataclass class BaawSegment: la_base: int # segment start LA la_size: int # segment size (bytes) mode: str # "one_to_one" | "n_to_one" # 1:1 mode fields channel_count: int # channels assigned to this segment (e.g. 8) pa_bases: list[int] # per-channel PA bases (len = channel_count) channel_ids: list[int] # per-channel logical IDs (e.g. [0..7]) channel_size: int # per-channel size (la_size // channel_count) # n:1 mode fields agg_pa_base: int # aggregated PA base agg_node_id: str # aggregated router node_id ``` Segment lifecycle: 1. **Allocate** (tensor deploy): RuntimeContext allocates LA from LA allocator. PEMemAllocator allocates per-channel PA (1:1) or aggregated PA (n:1). `BaawSegmentInstallMsg` registers the segment with PE_DMA. 2. **Use** (kernel run): kernel `tl.load(la_ptr)` → `DmaReadCmd (src_addr=LA)`. PE_DMA's BAAW front-end looks up the segment and converts to PA(s). 3. **Free** (tensor free): segment removed from table; LA and PA returned. #### D-LA4. BAAW resolution logic BAAW is a front-end stage inside PE_DMA, not a separate SimPy component. Synchronous address-resolution logic executed at the start of PE_DMA's `handle_command()`. Input: `(LA, nbytes)`. Output: - **1:1 mode**: `list[PhysicalRequest]` — one per channel. - **n:1 mode**: single `PhysicalRequest`. ```python @dataclass class PhysicalRequest: pa: int # 51-bit Physical Address nbytes: int # transfer size for this request dst_node: str # target node_id (channel router or aggregated router) def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]: seg = self._find_segment(la) # la_base <= la < la_base + la_size offset = la - seg.la_base if seg.mode == "n_to_one": pa = seg.agg_pa_base + offset return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)] # one_to_one requests = [] per_ch_size = seg.channel_size for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)): ch_offset = offset % per_ch_size ch_nbytes = nbytes // seg.channel_count pa = pa_base + ch_offset dst_node = f"{self._pe_prefix}.ch_r{ch_id}" requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node)) return requests ``` BAAW responsibilities: - Convert logical access → physical request units. - Apply mode-dependent fan-out (1:1) or pass-through (n:1). - Compute PA and target node. BAAW non-responsibilities: - Performing actual data movement. - Executing NOC routing. - Simulating bandwidth occupation (downstream components' job). BAAW output is directly usable by the simulator's routing and resource model without additional address decoding. #### D-LA5. PE_DMA `handle_command()` change Current (VA-based) flow: ``` DmaReadCmd.src_addr (VA) → MMU.translate(VA) → PA → PhysAddr.decode(PA) → PhysAddr object → resolver.resolve(PhysAddr) → dst_node_id → router.find_path(pe_prefix, dst_node_id) → path → 1 sub-Transaction → fabric inject ``` LA-based flow: ``` DmaReadCmd.src_addr (LA) → BAAW.resolve(LA, nbytes) → list[PhysicalRequest] → for each PhysicalRequest: → router.find_path(pe_prefix, req.dst_node) → path → compute_drain_ns(path, req.nbytes) → drain → sub-Transaction → fabric inject → await all sub-Transactions → pe_txn.done.succeed() ``` Key changes: - MMU reference removed → BAAW resolve. - `PhysAddr.decode()` + `resolver.resolve()` → BAAW returns `dst_node` directly. - 1 request → N parallel requests in 1:1 mode. #### D-LA6. 1:1 mode detail - One logical access → N physical requests (N = `channels_per_pe`). - N = `hbm_pseudo_channels / pes_per_cube`. - Each request: fully-resolved 51-bit PA, targets a specific channel router (`{pe_prefix}.ch_r{channel_id}`). - Per-channel link models BW contention. - PE_DMA injects N sub-transactions concurrently. Example: `hbm_pseudo_channels=64`, `pes_per_cube=8` → `channels_per_pe=8`. PE0 owns ch0-7. ```text Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes BAAW segment: { la_base: 0x1_0000_0000, la_size: 4096, mode: "one_to_one", channel_count: 8, pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7], channel_ids: [0, 1, 2, 3, 4, 5, 6, 7], channel_size: 512, } BAAW resolve result (8 requests): → PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0") → PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1") → ... → PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7") PE_DMA: 8 sub-transactions parallel inject per-channel router → hbm_ctrl link (channel_bw_gbs) per channel Total effective BW = 8 × channel_bw_gbs ``` Other N values: - `hbm_pseudo_channels=32`, `pes_per_cube=8` → `channels_per_pe=4`, 4 requests - `hbm_pseudo_channels=64`, `pes_per_cube=4` → `channels_per_pe=16`, 16 requests #### D-LA7. n:1 mode detail - One logical access → one aggregated request. - Target: aggregated router → hbm_ctrl (see ADR-0019). - Aggregated link BW = `channels_per_pe × channel_bw_gbs` (e.g. 8 × 32 = 256 GB/s). - Single queue / resource for modelling. - No per-channel PA decomposition. ```text Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes BAAW segment: { la_base: 0x1_0000_0000, la_size: 4096, mode: "n_to_one", agg_pa_base: PA_agg, agg_node_id: "sip0.cube0.pe0.agg_router", } BAAW resolve result: → PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router") PE_DMA: 1 sub-transaction aggregated router → hbm_ctrl link (256 GB/s) ``` #### D-LA8. Kernel model preserved - Kernel still issues single memory ops (`tl.load`, `tl.store`, `tl.composite`). - LA is the address scheme exposed to kernel code. - Channel decomposition / aggregation happens inside PE_DMA's BAAW. - Kernel code never sees physical channel information. #### Consequences (LA model, proposed) Positive: - 1:1 vs n:1 semantics live in one place (BAAW). - Kernel abstraction preserved — no kernel code changes. - Topology-based policy control (mode switch via yaml). - Improved simulation-model consistency and debuggability. - Segment-based mapping is simpler than page tables; lower overhead. Negative: - Full VA/MMU code refactor required. - Request-generation path more complex (N requests in 1:1 mode). - Reduced per-channel visibility in n:1 mode. - VA-related tests need rewriting. --- ## Migration Path - **PA → VA** was an extension. PA mode is retained as the PageFault fallback inside PE_DMA. Switching does not require removing PA code. - **VA → LA**, if adopted, is a replacement, not coexistence. See D-LA1 for the VA infrastructure removal list. PA fallback inside PE_DMA may be retained orthogonally for tests. ## Alternatives Considered (LA model) 1. **Keep VA + fan-out in MMU**: MMU returns per-channel PAs. Rejected: MMU's role would grow beyond translation to request decomposition; aggregation (n:1) becomes awkward to express. 2. **Channel-aware kernel API**: kernels call per-channel load/store directly. Rejected: abstraction leakage, portability loss, all benchmarks need rewriting. 3. **Always PA (no LA)**: runtime passes per-channel PA to kernel directly. Rejected: incompatible with aggregation; conversion timing unclear; channel info leaks to kernel. ## Test Requirements ### VA model (current, regression) - Cross-PE / cross-cube DMA paths over installed mappings. - `MmuMapMsg` / `MmuUnmapMsg` fabric traversal with measured latency. - TLB-overhead-per-access timing. - PageFault fallback path preserves PA-only behaviour. ### LA model (when implemented) - 1:1 mode: same logical access → N per-channel requests. - n:1 mode: same logical access → 1 aggregated request. - Bandwidth equivalence between modes for identical workload. - 1:1 mode: per-channel contention modelled correctly. - n:1 mode: aggregated bandwidth correctly reflected. - Kernel code unchanged across mode switch. - BAAW segment install / uninstall correctness. - Multiple tensors in distinct segments do not collide. ## Implementation Order (LA, when scheduled) 1. LA type (`policy/address/la_allocator.py`). 2. BAAW segment table (`policy/address/baaw.py`). 3. `BaawSegmentInstallMsg` (`runtime_api/kernel.py`). 4. PE_DMA BAAW integration (`components/builtin/pe_dma.py` `handle_command()`). 5. RuntimeContext: LA alloc + segment install (`runtime_api/context.py`). 6. `Tensor.va_base` → `Tensor.la_base` (`runtime_api/tensor.py`). 7. Remove VA/MMU code. 8. Remove `pe_mmu` from `topology.yaml`; add mapping mode settings. 9. Test migration: | Test file | Action | |-----------|--------| | `tests/test_mmu_component.py` | Remove → BAAW segment install tests | | `tests/test_mmu_fabric.py` | Remove → BAAW + fabric integration tests | | `tests/test_pe_mmu.py` | Remove | | `tests/test_va_allocator.py` | Replace with LA allocator tests | | `tests/test_va_integration.py` | Replace with LA + BAAW integration tests | | `tests/test_va_offset.py` | Replace with LA offset tests | ## Links - ADR-0007 (runtime_api vs sim_engine boundaries) - ADR-0008 (tensor deployment) - ADR-0009 (kernel execution) - ADR-0014 (PE-internal execution model) - ADR-0015 (component port/wire model) - ADR-0019 (NOC + per-channel HBM connectivity — LA model topology consumer) - SPEC R2 (latency by traversal), R10 (memory addressing)