Files
kernbench2/docs/adr/ADR-0011-memory-addressing-simplification.md
T
ywkang 22fd0d2b9d ADR: introduce docs/history/, merge 0011+0018, prune migration cruft
- CLAUDE.md: add ADR Lifecycle subsection (superseded → docs/history/,
  immutable numbering, no renumber)
- ADR-0011: merge ADR-0018 content as "Address Model: LA" section
  alongside PA / VA; status notes VA model is currently implemented
- ADR-0018 / 0029 / 0031: moved to docs/history/ with status updates
  (0018 merged into 0011, 0029 superseded by 0032, 0031 absorbed
  into 0001 rev 2)
- ADR-0019: rewrite Context as PE-HBM connectivity decision
  (self-contained, no LA model framing)
- ADR-0019/0020/0021/0023/0025/0027: Status Proposed → Accepted
  (code verified) and prune Implementation Notes / Affected files /
  Test strategy / "현재 상태" sub-sections describing pre-impl state
- ADR-0024/0026: same migration-flavor cleanup; 0026 also drops D6
  Migration and D8 docs-update sub-decisions
- ADR-0030: status simplified (blocker ADR-0031 now superseded)
- SPEC.md: R10 + §0.2 reflect PA / VA / LA model names
- ADR-0008/0012/0013: refresh ADR-0011 subtitle in Links

21 files changed, 553 insertions(+), 1290 deletions(-).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 11:42:45 -07:00

18 KiB
Raw Blame History

ADR-0011: Memory Addressing — PA / VA / LA Address Models

Status

Accepted.

  • VA model: currently implemented (default).
  • PA model: implemented as PageFault fallback in PE_DMA.
  • LA model: proposed, not implemented.

Context

KernBench's address model evolved through three design points, each addressing a limitation of the previous. This ADR documents all three in one place because future implementation work selects among them.

PA-only baseline

Phase 0 of KernBench treated all device memory operations (MemoryRead/MemoryWrite) as raw physical-address transfers. No host-side virtual addressing, no MMU/IOMMU translation. Allocators returned PA mappings; DMA requests carried PA directly.

This was sufficient for early correctness/latency work but insufficient for running standard Triton kernels that use base_addr + offset patterns on sharded tensors: each PE's shard has a different PA, but the kernel needs a single contiguous address space to compute offsets.

Why VA/MMU (current default)

A realistic system uses host-side virtual addressing and an MMU/IOMMU-style translation path for DMA: the host allocates physical memory at PE level, maps it into a virtual address space, installs mappings, and DMA requests use virtual addresses that are translated to physical addresses.

Adopting this model lets kernels use base_addr + offset over a contiguous VA range while the device-side MMU translates each access to the appropriate PA.

Why LA/BAAW (proposed)

VA/MMU treats HBM as a single backing space. KernBench needs to explore architectures where HBM is composed of multiple pseudo channels in parallel:

  • CUBE's HBM has 32 or 64 pseudo channels.
  • In a PE-Local-HBM model, each PE is assigned N pseudo channels (N = hbm_pseudo_channels / pes_per_cube).
  • Per-channel BW (e.g. 32 GB/s) determines aggregate PE BW (N × per-channel).

Two channel-mapping modes need to be modelable:

  • 1:1 mode — one logical access → N per-channel requests. Precise per-channel BW contention modelling.
  • n:1 mode (default) — one logical access → one aggregated request. Channels are assumed to interleave; aggregated BW model.

VA's tl.load(va_ptr) produces a single DMA request to a single target. Decomposing that into per-channel requests inside PE_DMA requires the address layer to be aware of channels. This is the role of the LA (Logical Address) abstraction with BAAW (Logical-to-Physical Mapping Unit).

Core requirements driving the LA design:

  • PE_DMA → HBM_CTRL effective bandwidth semantics must be identical in both modes (only request shape and resource model differ).
  • Kernel programming model is unchanged — physical channel information is never exposed to kernel code.
  • Mode switch is a topology-level configuration.

Design space summary

Model Status Key idea
PA fallback (implemented) Direct physical addressing, no translation
VA current default (implemented) Per-tensor contiguous VA range; MMU translates per access
LA proposed LA + BAAW resolves to (PA, channel); supports 1:1 and n:1 channel mapping modes

Decision

This ADR defines three address models. At any given time the system operates in exactly one model. Selection is topology- / configuration- driven; coexistence within one simulation run is not required.


Address Model: PA (Physical Address) — fallback

D-PA1. PA-only semantics

  • All device memory accesses (MemoryRead/MemoryWrite) operate on device physical addresses (PA) plus size.
  • PA-only mode remains functional via the PageFault fallback path in PE_DMA: if a DMA src/dst address has no MMU mapping, PE_DMA treats the value as a PA directly.

D-PA2. Allocation produces PA mappings

Device allocation selects PE-local memory regions and returns PA mappings sufficient to execute kernels and issue DMA requests.

PA model is retained primarily for backward compatibility with PA-only tests and as the underlying physical layer that VA / LA models resolve into.


Address Model: VA (Virtual Address with MMU) — current default

D-VA1. Virtual Address Model

  • Each tensor gets a single contiguous VA range (TensorHandle.va_base).
  • TensorShard does NOT carry a va field — shard VA is derived as va_base + offset_bytes.
  • Kernels receive va_base as their pointer argument (via TensorArg.va_base).
  • DmaReadCmd.src_addr and DmaWriteCmd.dst_addr carry VA (not PA).

D-VA2. PE_MMU Component

  • Hybrid design: SimPy component (inbox for MmuMapMsg) + utility (synchronous translate() called by PE_DMA).
  • Page-aligned dict lookup for O(1) VA → PA translation.
  • tlb_overhead_ns configurable per-access latency.
  • PageFault fallback: if VA has no mapping, PE_DMA treats it as PA directly (preserves PA model for backward compatibility).

D-VA3. Mapping Installation

  • MmuMapMsg traverses the fabric: Host → PCIE_EP → IO_CPU (cube fan-out) → M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured end-to-end.
  • MmuMapMsg.target_sips controls SIP-level routing to prevent cross-SIP mapping contamination for replicated tensors.
  • Mapping strategy based on DPPolicy.cube:
    • Replicate (cube="replicate"): per-(sip, cube) local mapping only. Each cube's PEs see only their local PA. No cross-cube mapping installed.
    • Sharded (cube="column_wise", etc.): broadcast all shard mappings to all target cubes. Enables cross-PE and cross-cube DMA.

D-VA4. Tensor Lifecycle

  • del tensor triggers automatic cleanup via Tensor.__del__ + weakref to RuntimeContext. Sends MmuUnmapMsg through fabric, returns VA and PA space.
  • with RuntimeContext(...) as ctx: provides scope-based bulk cleanup.
  • RuntimeContext._tensors uses weakref.ref to avoid preventing GC.
  • PEMemAllocator uses free-list with coalescing (not bump allocator).
  • VirtualAllocator uses free-list with coalescing for VA space.

D-VA5. Allocators

  • VirtualAllocator: device-wide VA space, page-aligned alloc/free with coalescing.
  • PEMemAllocator: per-PE HBM/TCM, free-list based alloc/free with coalescing.
  • Page size configurable via topology.yaml pe_mmu attrs (default 4096).

Consequences (VA model)

  • Triton kernels use base_addr + offset patterns naturally on sharded tensors.
  • All latency remains explicit via graph traversal, including MMU mapping installation and per-access TLB overhead.
  • PA-only mode retained as fallback (PageFault → treat as PA).
  • IPCQ and other fixed-address resources bypass MMU (use PA directly).

Address Model: LA (Logical Address with BAAW) — proposed

LA replaces VA when channel-level HBM modelling is required. Adopting this model removes the VA/MMU infrastructure (D-LA1 lists the removed artifacts). Coexistence with VA in the same run is not a goal.

D-LA1. LA introduction — replaces VA infrastructure

LA is the sole address space used by kernel code (tl.load, tl.store, tl.composite). Properties:

  • Can map a Tensor to a contiguous logical space (like VA).
  • Expresses (logical buffer + offset).
  • Does NOT contain physical channel information directly.
  • Stays as an intermediate abstraction until physical resolution.

LA address space:

Item Value
LA start 0x1_0000_0000 (4 GB, preserves former VA start)
LA space size 64 GB per PE
Alignment unit segment (see D-LA3)

LA is PE-local: different PEs may use the same LA value; BAAW segment tables differ → they resolve to different PAs.

VA infrastructure removed when LA is adopted:

Removed Replacement
policy/address/va_allocator.py (VirtualAllocator) LA allocator (same free-list approach, renamed)
policy/address/pe_mmu.py (PeMMU) BAAW segment table (inside PE_DMA)
components/builtin/pe_mmu.py (PeMmuComponent) Removed — BAAW is internal PE_DMA logic, not a separate component
runtime_api/kernel.py: MmuMapMsg, MmuUnmapMsg BaawSegmentInstallMsg
runtime_api/context.py: VA alloc + MMU install LA alloc + BAAW segment install
runtime_api/tensor.py: va_base la_base
topology.yaml: pe_mmu component entry Removed

D-LA2. Mapping mode setting

Topology-level (cube) configuration:

cube:
  memory_map:
    hbm_mapping_mode: n_to_one    # one_to_one | n_to_one
    hbm_pseudo_channels: 64       # total pseudo channel count
    hbm_channels_per_pe: 8        # per-PE local channel count
    hbm_channel_bw_gbs: 32.0      # per-channel bandwidth

Consumed by the graph compiler (topology builder) and BAAW initialisation.

D-LA3. Segment and BAAW

Segment partitions the LA space; each segment maps to a specific HBM channel or channel group. Created at tensor deploy time by the runtime allocator. BAAW resolves LA → physical request(s) using the segment table.

@dataclass
class BaawSegment:
    la_base: int          # segment start LA
    la_size: int          # segment size (bytes)
    mode: str             # "one_to_one" | "n_to_one"
    # 1:1 mode fields
    channel_count: int    # channels assigned to this segment (e.g. 8)
    pa_bases: list[int]   # per-channel PA bases (len = channel_count)
    channel_ids: list[int]   # per-channel logical IDs (e.g. [0..7])
    channel_size: int     # per-channel size (la_size // channel_count)
    # n:1 mode fields
    agg_pa_base: int      # aggregated PA base
    agg_node_id: str      # aggregated router node_id

Segment lifecycle:

  1. Allocate (tensor deploy): RuntimeContext allocates LA from LA allocator. PEMemAllocator allocates per-channel PA (1:1) or aggregated PA (n:1). BaawSegmentInstallMsg registers the segment with PE_DMA.
  2. Use (kernel run): kernel tl.load(la_ptr)DmaReadCmd (src_addr=LA). PE_DMA's BAAW front-end looks up the segment and converts to PA(s).
  3. Free (tensor free): segment removed from table; LA and PA returned.

D-LA4. BAAW resolution logic

BAAW is a front-end stage inside PE_DMA, not a separate SimPy component. Synchronous address-resolution logic executed at the start of PE_DMA's handle_command().

Input: (LA, nbytes). Output:

  • 1:1 mode: list[PhysicalRequest] — one per channel.
  • n:1 mode: single PhysicalRequest.
@dataclass
class PhysicalRequest:
    pa: int           # 51-bit Physical Address
    nbytes: int       # transfer size for this request
    dst_node: str     # target node_id (channel router or aggregated router)


def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
    seg = self._find_segment(la)  # la_base <= la < la_base + la_size
    offset = la - seg.la_base

    if seg.mode == "n_to_one":
        pa = seg.agg_pa_base + offset
        return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]

    # one_to_one
    requests = []
    per_ch_size = seg.channel_size
    for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
        ch_offset = offset % per_ch_size
        ch_nbytes = nbytes // seg.channel_count
        pa = pa_base + ch_offset
        dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
        requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
    return requests

BAAW responsibilities:

  • Convert logical access → physical request units.
  • Apply mode-dependent fan-out (1:1) or pass-through (n:1).
  • Compute PA and target node.

BAAW non-responsibilities:

  • Performing actual data movement.
  • Executing NOC routing.
  • Simulating bandwidth occupation (downstream components' job).

BAAW output is directly usable by the simulator's routing and resource model without additional address decoding.

D-LA5. PE_DMA handle_command() change

Current (VA-based) flow:

DmaReadCmd.src_addr (VA)
  → MMU.translate(VA) → PA
  → PhysAddr.decode(PA) → PhysAddr object
  → resolver.resolve(PhysAddr) → dst_node_id
  → router.find_path(pe_prefix, dst_node_id) → path
  → 1 sub-Transaction → fabric inject

LA-based flow:

DmaReadCmd.src_addr (LA)
  → BAAW.resolve(LA, nbytes) → list[PhysicalRequest]
  → for each PhysicalRequest:
      → router.find_path(pe_prefix, req.dst_node) → path
      → compute_drain_ns(path, req.nbytes) → drain
      → sub-Transaction → fabric inject
  → await all sub-Transactions
  → pe_txn.done.succeed()

Key changes:

  • MMU reference removed → BAAW resolve.
  • PhysAddr.decode() + resolver.resolve() → BAAW returns dst_node directly.
  • 1 request → N parallel requests in 1:1 mode.

D-LA6. 1:1 mode detail

  • One logical access → N physical requests (N = channels_per_pe).
  • N = hbm_pseudo_channels / pes_per_cube.
  • Each request: fully-resolved 51-bit PA, targets a specific channel router ({pe_prefix}.ch_r{channel_id}).
  • Per-channel link models BW contention.
  • PE_DMA injects N sub-transactions concurrently.

Example: hbm_pseudo_channels=64, pes_per_cube=8channels_per_pe=8. PE0 owns ch0-7.

Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
BAAW segment: {
    la_base: 0x1_0000_0000, la_size: 4096,
    mode: "one_to_one", channel_count: 8,
    pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
    channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
    channel_size: 512,
}

BAAW resolve result (8 requests):
  → PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
  → PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
  → ...
  → PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")

PE_DMA: 8 sub-transactions parallel inject
  per-channel router → hbm_ctrl link (channel_bw_gbs) per channel
  Total effective BW = 8 × channel_bw_gbs

Other N values:

  • hbm_pseudo_channels=32, pes_per_cube=8channels_per_pe=4, 4 requests
  • hbm_pseudo_channels=64, pes_per_cube=4channels_per_pe=16, 16 requests

D-LA7. n:1 mode detail

  • One logical access → one aggregated request.
  • Target: aggregated router → hbm_ctrl (see ADR-0019).
  • Aggregated link BW = channels_per_pe × channel_bw_gbs (e.g. 8 × 32 = 256 GB/s).
  • Single queue / resource for modelling.
  • No per-channel PA decomposition.
Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
BAAW segment: {
    la_base: 0x1_0000_0000, la_size: 4096,
    mode: "n_to_one",
    agg_pa_base: PA_agg,
    agg_node_id: "sip0.cube0.pe0.agg_router",
}

BAAW resolve result:
  → PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")

PE_DMA: 1 sub-transaction
  aggregated router → hbm_ctrl link (256 GB/s)

D-LA8. Kernel model preserved

  • Kernel still issues single memory ops (tl.load, tl.store, tl.composite).
  • LA is the address scheme exposed to kernel code.
  • Channel decomposition / aggregation happens inside PE_DMA's BAAW.
  • Kernel code never sees physical channel information.

Consequences (LA model, proposed)

Positive:

  • 1:1 vs n:1 semantics live in one place (BAAW).
  • Kernel abstraction preserved — no kernel code changes.
  • Topology-based policy control (mode switch via yaml).
  • Improved simulation-model consistency and debuggability.
  • Segment-based mapping is simpler than page tables; lower overhead.

Negative:

  • Full VA/MMU code refactor required.
  • Request-generation path more complex (N requests in 1:1 mode).
  • Reduced per-channel visibility in n:1 mode.
  • VA-related tests need rewriting.

Migration Path

  • PA → VA was an extension. PA mode is retained as the PageFault fallback inside PE_DMA. Switching does not require removing PA code.
  • VA → LA, if adopted, is a replacement, not coexistence. See D-LA1 for the VA infrastructure removal list. PA fallback inside PE_DMA may be retained orthogonally for tests.

Alternatives Considered (LA model)

  1. Keep VA + fan-out in MMU: MMU returns per-channel PAs. Rejected: MMU's role would grow beyond translation to request decomposition; aggregation (n:1) becomes awkward to express.
  2. Channel-aware kernel API: kernels call per-channel load/store directly. Rejected: abstraction leakage, portability loss, all benchmarks need rewriting.
  3. Always PA (no LA): runtime passes per-channel PA to kernel directly. Rejected: incompatible with aggregation; conversion timing unclear; channel info leaks to kernel.

Test Requirements

VA model (current, regression)

  • Cross-PE / cross-cube DMA paths over installed mappings.
  • MmuMapMsg / MmuUnmapMsg fabric traversal with measured latency.
  • TLB-overhead-per-access timing.
  • PageFault fallback path preserves PA-only behaviour.

LA model (when implemented)

  • 1:1 mode: same logical access → N per-channel requests.
  • n:1 mode: same logical access → 1 aggregated request.
  • Bandwidth equivalence between modes for identical workload.
  • 1:1 mode: per-channel contention modelled correctly.
  • n:1 mode: aggregated bandwidth correctly reflected.
  • Kernel code unchanged across mode switch.
  • BAAW segment install / uninstall correctness.
  • Multiple tensors in distinct segments do not collide.

Implementation Order (LA, when scheduled)

  1. LA type (policy/address/la_allocator.py).
  2. BAAW segment table (policy/address/baaw.py).
  3. BaawSegmentInstallMsg (runtime_api/kernel.py).
  4. PE_DMA BAAW integration (components/builtin/pe_dma.py handle_command()).
  5. RuntimeContext: LA alloc + segment install (runtime_api/context.py).
  6. Tensor.va_baseTensor.la_base (runtime_api/tensor.py).
  7. Remove VA/MMU code.
  8. Remove pe_mmu from topology.yaml; add mapping mode settings.
  9. Test migration:
Test file Action
tests/test_mmu_component.py Remove → BAAW segment install tests
tests/test_mmu_fabric.py Remove → BAAW + fabric integration tests
tests/test_pe_mmu.py Remove
tests/test_va_allocator.py Replace with LA allocator tests
tests/test_va_integration.py Replace with LA + BAAW integration tests
tests/test_va_offset.py Replace with LA offset tests
  • ADR-0007 (runtime_api vs sim_engine boundaries)
  • ADR-0008 (tensor deployment)
  • ADR-0009 (kernel execution)
  • ADR-0014 (PE-internal execution model)
  • ADR-0015 (component port/wire model)
  • ADR-0019 (NOC + per-channel HBM connectivity — LA model topology consumer)
  • SPEC R2 (latency by traversal), R10 (memory addressing)