Files
kernbench2/docs/adr/ADR-0011-memory-addressing-simplification.md
T
ywkang 08812eda58 Add virtual memory support: PE_MMU, VA allocator, fabric MmuMapMsg
Implement VA/MMU layer (ADR-0011 Phase 1) enabling Triton kernels to use
contiguous virtual addresses on sharded tensors.

Key changes:
- PE_MMU component: hybrid inbox (MmuMapMsg) + sync translate() for PE_DMA
- VirtualAllocator + PEMemAllocator: free-list with coalescing
- MmuMapMsg/MmuUnmapMsg fabric path with SIP-level routing
- DPPolicy-based mapping: replicate=local, sharded=broadcast
- Tensor lifecycle: del + weakref cleanup, context manager
- Rename: TensorHandle.pa→addr, DmaReadCmd.src_pa→src_addr, ctx→torch

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 00:01:47 -07:00

3.8 KiB

ADR-0011: Memory Addressing — PA-first with VA/MMU Extension

Status

Accepted (Phase 1 VA/MMU implemented)

Context

A realistic system uses host-side virtual addressing and an MMU/IOMMU-style translation path for DMA: host allocates physical memory at PE level, maps it into a virtual address space, installs mappings, and DMA requests use virtual addresses that are translated to physical addresses.

The PA-only model (Phase 0) was insufficient for running standard Triton kernels that use base_addr + offset patterns on sharded tensors — each PE's shard has a different PA, but the kernel needs a single contiguous address space.


Decision

D1. Phase 0 model is PA-only (original, retained as fallback)

  • All device memory accesses (MemoryRead/MemoryWrite) operate on device physical addresses (PA) plus size.
  • PA-only mode remains functional via PageFault fallback in PE_DMA.

D2. Allocation produces PA mappings

Device allocation selects PE-local memory regions and returns PA mappings sufficient to execute kernels and issue DMA requests.

D3. Phase 1: VA/MMU layer (implemented)

D3.1 Virtual Address Model

  • Each tensor gets a single contiguous VA range (TensorHandle.va_base).
  • TensorShard does NOT carry a va field — shard VA is derived as va_base + offset_bytes.
  • Kernels receive va_base as their pointer argument (via TensorArg.va_base).
  • DmaReadCmd.src_addr and DmaWriteCmd.dst_addr carry VA (not PA).

D3.2 PE_MMU Component

  • Hybrid design: SimPy component (inbox for MmuMapMsg) + utility (synchronous translate() called by PE_DMA).
  • Page-aligned dict lookup for O(1) VA→PA translation.
  • tlb_overhead_ns configurable per-access latency.
  • PageFault fallback: if VA has no mapping, PE_DMA treats it as PA directly (backward compatibility with PA-only tests).

D3.3 Mapping Installation

  • MmuMapMsg traverses the fabric: Host → PCIE_EP → IO_CPU (cube fan-out) → M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured end-to-end.
  • MmuMapMsg.target_sips controls SIP-level routing to prevent cross-SIP mapping contamination for replicated tensors.
  • Mapping strategy based on DPPolicy.cube:
    • Replicate (cube="replicate"): per-(sip, cube) local mapping only. Each cube's PEs see only their local PA. No cross-cube mapping installed.
    • Sharded (cube="shard_m", etc.): broadcast all shard mappings to all target cubes. Enables cross-PE and cross-cube DMA.

D3.4 Tensor Lifecycle

  • del tensor triggers automatic cleanup via Tensor.__del__ + weakref to RuntimeContext. Sends MmuUnmapMsg through fabric, returns VA and PA space.
  • with RuntimeContext(...) as ctx: provides scope-based bulk cleanup.
  • RuntimeContext._tensors uses weakref.ref to avoid preventing GC.
  • PEMemAllocator uses free-list with coalescing (not bump allocator).
  • VirtualAllocator uses free-list with coalescing for VA space.

D3.5 Allocators

  • VirtualAllocator: device-wide VA space, page-aligned alloc/free with coalescing.
  • PEMemAllocator: per-PE HBM/TCM, free-list based alloc/free with coalescing.
  • Page size configurable via topology.yaml pe_mmu attrs (default 4096).

Consequences

  • Triton kernels use base_addr + offset patterns naturally on sharded tensors.
  • All latency remains explicit via graph traversal, including MMU mapping installation and per-access TLB overhead.
  • PA-only mode retained as fallback (PageFault → treat as PA).
  • Benchmark parameter renamed ctxtorch for PyTorch code compatibility.
  • IPCQ and other fixed-address resources bypass MMU (use PA directly).

  • ADR-0007 (runtime_api vs sim_engine boundaries)
  • ADR-0008 (tensor deployment)
  • ADR-0009 (kernel execution)
  • ADR-0014 (PE-internal execution model)
  • ADR-0015 (component port/wire model)
  • SPEC R2 (latency by traversal)