Files

T

ywkang 63669f82cb Add SIP-level tensor parallelism, component registry YAML, VA offset verification

- DPPolicy: 3-level (sip/cube/pe), unified naming (column_wise/row_wise)
- PE_CPU: auto num_programs from cube shard count
- context.launch(): per-SIP KernelLaunchMsg with local va_base + auto local shape
- deploy_tensor: removed mmus param, MMU mapping is context-only responsibility
- ComponentRegistry: YAML-based lazy loading (components.yaml), impls→builtin rename
- VA offset bench + tests: 2D/1D, standard Triton kernel pattern

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-26 01:13:17 -07:00

3.8 KiB

Raw Blame History

ADR-0011: Memory Addressing — PA-first with VA/MMU Extension

Status

Accepted (Phase 1 VA/MMU implemented)

Context

A realistic system uses host-side virtual addressing and an MMU/IOMMU-style translation path for DMA: host allocates physical memory at PE level, maps it into a virtual address space, installs mappings, and DMA requests use virtual addresses that are translated to physical addresses.

The PA-only model (Phase 0) was insufficient for running standard Triton kernels that use base_addr + offset patterns on sharded tensors — each PE's shard has a different PA, but the kernel needs a single contiguous address space.

Decision

D1. Phase 0 model is PA-only (original, retained as fallback)

All device memory accesses (MemoryRead/MemoryWrite) operate on device physical addresses (PA) plus size.
PA-only mode remains functional via PageFault fallback in PE_DMA.

D2. Allocation produces PA mappings

Device allocation selects PE-local memory regions and returns PA mappings sufficient to execute kernels and issue DMA requests.

D3. Phase 1: VA/MMU layer (implemented)

D3.1 Virtual Address Model

Each tensor gets a single contiguous VA range (TensorHandle.va_base).
TensorShard does NOT carry a va field — shard VA is derived as va_base + offset_bytes.
Kernels receive va_base as their pointer argument (via TensorArg.va_base).
DmaReadCmd.src_addr and DmaWriteCmd.dst_addr carry VA (not PA).

D3.2 PE_MMU Component

Hybrid design: SimPy component (inbox for MmuMapMsg) + utility (synchronous translate() called by PE_DMA).
Page-aligned dict lookup for O(1) VA→PA translation.
tlb_overhead_ns configurable per-access latency.
PageFault fallback: if VA has no mapping, PE_DMA treats it as PA directly (backward compatibility with PA-only tests).

D3.3 Mapping Installation

MmuMapMsg traverses the fabric: Host → PCIE_EP → IO_CPU (cube fan-out) → M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured end-to-end.
MmuMapMsg.target_sips controls SIP-level routing to prevent cross-SIP mapping contamination for replicated tensors.
Mapping strategy based on DPPolicy.cube:
- Replicate (cube="replicate"): per-(sip, cube) local mapping only. Each cube's PEs see only their local PA. No cross-cube mapping installed.
- Sharded (cube="column_wise", etc.): broadcast all shard mappings to all target cubes. Enables cross-PE and cross-cube DMA.

D3.4 Tensor Lifecycle

del tensor triggers automatic cleanup via Tensor.__del__ + weakref to RuntimeContext. Sends MmuUnmapMsg through fabric, returns VA and PA space.
with RuntimeContext(...) as ctx: provides scope-based bulk cleanup.
RuntimeContext._tensors uses weakref.ref to avoid preventing GC.
PEMemAllocator uses free-list with coalescing (not bump allocator).
VirtualAllocator uses free-list with coalescing for VA space.

D3.5 Allocators

VirtualAllocator: device-wide VA space, page-aligned alloc/free with coalescing.
PEMemAllocator: per-PE HBM/TCM, free-list based alloc/free with coalescing.
Page size configurable via topology.yaml pe_mmu attrs (default 4096).

Consequences

Triton kernels use base_addr + offset patterns naturally on sharded tensors.
All latency remains explicit via graph traversal, including MMU mapping installation and per-access TLB overhead.
PA-only mode retained as fallback (PageFault → treat as PA).
Benchmark parameter renamed ctx → torch for PyTorch code compatibility.
IPCQ and other fixed-address resources bypass MMU (use PA directly).

3.8 KiB Raw Blame History