Files

T

ywkang 22fd0d2b9d ADR: introduce docs/history/, merge 0011+0018, prune migration cruft

- CLAUDE.md: add ADR Lifecycle subsection (superseded → docs/history/,
  immutable numbering, no renumber)
- ADR-0011: merge ADR-0018 content as "Address Model: LA" section
  alongside PA / VA; status notes VA model is currently implemented
- ADR-0018 / 0029 / 0031: moved to docs/history/ with status updates
  (0018 merged into 0011, 0029 superseded by 0032, 0031 absorbed
  into 0001 rev 2)
- ADR-0019: rewrite Context as PE-HBM connectivity decision
  (self-contained, no LA model framing)
- ADR-0019/0020/0021/0023/0025/0027: Status Proposed → Accepted
  (code verified) and prune Implementation Notes / Affected files /
  Test strategy / "현재 상태" sub-sections describing pre-impl state
- ADR-0024/0026: same migration-flavor cleanup; 0026 also drops D6
  Migration and D8 docs-update sub-decisions
- ADR-0030: status simplified (blocker ADR-0031 now superseded)
- SPEC.md: R10 + §0.2 reflect PA / VA / LA model names
- ADR-0008/0012/0013: refresh ADR-0011 subtitle in Links

21 files changed, 553 insertions(+), 1290 deletions(-).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-19 11:42:45 -07:00

18 KiB

Raw Blame History

ADR-0011: Memory Addressing — PA / VA / LA Address Models

Status

Accepted.

VA model: currently implemented (default).
PA model: implemented as PageFault fallback in PE_DMA.
LA model: proposed, not implemented.

Context

KernBench's address model evolved through three design points, each addressing a limitation of the previous. This ADR documents all three in one place because future implementation work selects among them.

PA-only baseline

Phase 0 of KernBench treated all device memory operations (MemoryRead/MemoryWrite) as raw physical-address transfers. No host-side virtual addressing, no MMU/IOMMU translation. Allocators returned PA mappings; DMA requests carried PA directly.

This was sufficient for early correctness/latency work but insufficient for running standard Triton kernels that use base_addr + offset patterns on sharded tensors: each PE's shard has a different PA, but the kernel needs a single contiguous address space to compute offsets.

Why VA/MMU (current default)

A realistic system uses host-side virtual addressing and an MMU/IOMMU-style translation path for DMA: the host allocates physical memory at PE level, maps it into a virtual address space, installs mappings, and DMA requests use virtual addresses that are translated to physical addresses.

Adopting this model lets kernels use base_addr + offset over a contiguous VA range while the device-side MMU translates each access to the appropriate PA.

Why LA/BAAW (proposed)

VA/MMU treats HBM as a single backing space. KernBench needs to explore architectures where HBM is composed of multiple pseudo channels in parallel:

CUBE's HBM has 32 or 64 pseudo channels.
In a PE-Local-HBM model, each PE is assigned N pseudo channels (N = hbm_pseudo_channels / pes_per_cube).
Per-channel BW (e.g. 32 GB/s) determines aggregate PE BW (N × per-channel).

Two channel-mapping modes need to be modelable:

1:1 mode — one logical access → N per-channel requests. Precise per-channel BW contention modelling.
n:1 mode (default) — one logical access → one aggregated request. Channels are assumed to interleave; aggregated BW model.

VA's tl.load(va_ptr) produces a single DMA request to a single target. Decomposing that into per-channel requests inside PE_DMA requires the address layer to be aware of channels. This is the role of the LA (Logical Address) abstraction with BAAW (Logical-to-Physical Mapping Unit).

Core requirements driving the LA design:

PE_DMA → HBM_CTRL effective bandwidth semantics must be identical in both modes (only request shape and resource model differ).
Kernel programming model is unchanged — physical channel information is never exposed to kernel code.
Mode switch is a topology-level configuration.

Design space summary

Model	Status	Key idea
PA	fallback (implemented)	Direct physical addressing, no translation
VA	current default (implemented)	Per-tensor contiguous VA range; MMU translates per access
LA	proposed	LA + BAAW resolves to (PA, channel); supports 1:1 and n:1 channel mapping modes

Decision

This ADR defines three address models. At any given time the system operates in exactly one model. Selection is topology- / configuration- driven; coexistence within one simulation run is not required.

Address Model: PA (Physical Address) — fallback

D-PA1. PA-only semantics

All device memory accesses (MemoryRead/MemoryWrite) operate on device physical addresses (PA) plus size.
PA-only mode remains functional via the PageFault fallback path in PE_DMA: if a DMA src/dst address has no MMU mapping, PE_DMA treats the value as a PA directly.

D-PA2. Allocation produces PA mappings

Device allocation selects PE-local memory regions and returns PA mappings sufficient to execute kernels and issue DMA requests.

PA model is retained primarily for backward compatibility with PA-only tests and as the underlying physical layer that VA / LA models resolve into.

Address Model: VA (Virtual Address with MMU) — current default

D-VA1. Virtual Address Model

Each tensor gets a single contiguous VA range (TensorHandle.va_base).
TensorShard does NOT carry a va field — shard VA is derived as va_base + offset_bytes.
Kernels receive va_base as their pointer argument (via TensorArg.va_base).
DmaReadCmd.src_addr and DmaWriteCmd.dst_addr carry VA (not PA).

D-VA2. PE_MMU Component

Hybrid design: SimPy component (inbox for MmuMapMsg) + utility (synchronous translate() called by PE_DMA).
Page-aligned dict lookup for O(1) VA → PA translation.
tlb_overhead_ns configurable per-access latency.
PageFault fallback: if VA has no mapping, PE_DMA treats it as PA directly (preserves PA model for backward compatibility).

D-VA3. Mapping Installation

MmuMapMsg traverses the fabric: Host → PCIE_EP → IO_CPU (cube fan-out) → M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured end-to-end.
MmuMapMsg.target_sips controls SIP-level routing to prevent cross-SIP mapping contamination for replicated tensors.
Mapping strategy based on DPPolicy.cube:
- Replicate (cube="replicate"): per-(sip, cube) local mapping only. Each cube's PEs see only their local PA. No cross-cube mapping installed.
- Sharded (cube="column_wise", etc.): broadcast all shard mappings to all target cubes. Enables cross-PE and cross-cube DMA.

D-VA4. Tensor Lifecycle

del tensor triggers automatic cleanup via Tensor.__del__ + weakref to RuntimeContext. Sends MmuUnmapMsg through fabric, returns VA and PA space.
with RuntimeContext(...) as ctx: provides scope-based bulk cleanup.
RuntimeContext._tensors uses weakref.ref to avoid preventing GC.
PEMemAllocator uses free-list with coalescing (not bump allocator).
VirtualAllocator uses free-list with coalescing for VA space.

D-VA5. Allocators

VirtualAllocator: device-wide VA space, page-aligned alloc/free with coalescing.
PEMemAllocator: per-PE HBM/TCM, free-list based alloc/free with coalescing.
Page size configurable via topology.yaml pe_mmu attrs (default 4096).

Consequences (VA model)

Triton kernels use base_addr + offset patterns naturally on sharded tensors.
All latency remains explicit via graph traversal, including MMU mapping installation and per-access TLB overhead.
PA-only mode retained as fallback (PageFault → treat as PA).
IPCQ and other fixed-address resources bypass MMU (use PA directly).

Address Model: LA (Logical Address with BAAW) — proposed

LA replaces VA when channel-level HBM modelling is required. Adopting this model removes the VA/MMU infrastructure (D-LA1 lists the removed artifacts). Coexistence with VA in the same run is not a goal.

D-LA1. LA introduction — replaces VA infrastructure

LA is the sole address space used by kernel code (tl.load, tl.store, tl.composite). Properties:

Can map a Tensor to a contiguous logical space (like VA).
Expresses (logical buffer + offset).
Does NOT contain physical channel information directly.
Stays as an intermediate abstraction until physical resolution.

LA address space:

Item	Value
LA start	`0x1_0000_0000` (4 GB, preserves former VA start)
LA space size	64 GB per PE
Alignment unit	segment (see D-LA3)

LA is PE-local: different PEs may use the same LA value; BAAW segment tables differ → they resolve to different PAs.

VA infrastructure removed when LA is adopted:

Removed	Replacement
`policy/address/va_allocator.py` (VirtualAllocator)	LA allocator (same free-list approach, renamed)
`policy/address/pe_mmu.py` (PeMMU)	BAAW segment table (inside PE_DMA)
`components/builtin/pe_mmu.py` (PeMmuComponent)	Removed — BAAW is internal PE_DMA logic, not a separate component
`runtime_api/kernel.py`: `MmuMapMsg`, `MmuUnmapMsg`	`BaawSegmentInstallMsg`
`runtime_api/context.py`: VA alloc + MMU install	LA alloc + BAAW segment install
`runtime_api/tensor.py`: `va_base`	`la_base`
`topology.yaml`: `pe_mmu` component entry	Removed

D-LA2. Mapping mode setting

Topology-level (cube) configuration:

cube:
  memory_map:
    hbm_mapping_mode: n_to_one    # one_to_one | n_to_one
    hbm_pseudo_channels: 64       # total pseudo channel count
    hbm_channels_per_pe: 8        # per-PE local channel count
    hbm_channel_bw_gbs: 32.0      # per-channel bandwidth

Consumed by the graph compiler (topology builder) and BAAW initialisation.

D-LA3. Segment and BAAW

Segment partitions the LA space; each segment maps to a specific HBM channel or channel group. Created at tensor deploy time by the runtime allocator. BAAW resolves LA → physical request(s) using the segment table.

@dataclass
class BaawSegment:
    la_base: int          # segment start LA
    la_size: int          # segment size (bytes)
    mode: str             # "one_to_one" | "n_to_one"
    # 1:1 mode fields
    channel_count: int    # channels assigned to this segment (e.g. 8)
    pa_bases: list[int]   # per-channel PA bases (len = channel_count)
    channel_ids: list[int]   # per-channel logical IDs (e.g. [0..7])
    channel_size: int     # per-channel size (la_size // channel_count)
    # n:1 mode fields
    agg_pa_base: int      # aggregated PA base
    agg_node_id: str      # aggregated router node_id

Segment lifecycle:

Allocate (tensor deploy): RuntimeContext allocates LA from LA allocator. PEMemAllocator allocates per-channel PA (1:1) or aggregated PA (n:1). BaawSegmentInstallMsg registers the segment with PE_DMA.
Use (kernel run): kernel tl.load(la_ptr) → DmaReadCmd (src_addr=LA). PE_DMA's BAAW front-end looks up the segment and converts to PA(s).
Free (tensor free): segment removed from table; LA and PA returned.

D-LA4. BAAW resolution logic

BAAW is a front-end stage inside PE_DMA, not a separate SimPy component. Synchronous address-resolution logic executed at the start of PE_DMA's handle_command().

Input: (LA, nbytes). Output:

1:1 mode: list[PhysicalRequest] — one per channel.
n:1 mode: single PhysicalRequest.

@dataclass
class PhysicalRequest:
    pa: int           # 51-bit Physical Address
    nbytes: int       # transfer size for this request
    dst_node: str     # target node_id (channel router or aggregated router)


def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
    seg = self._find_segment(la)  # la_base <= la < la_base + la_size
    offset = la - seg.la_base

    if seg.mode == "n_to_one":
        pa = seg.agg_pa_base + offset
        return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]

    # one_to_one
    requests = []
    per_ch_size = seg.channel_size
    for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
        ch_offset = offset % per_ch_size
        ch_nbytes = nbytes // seg.channel_count
        pa = pa_base + ch_offset
        dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
        requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
    return requests

BAAW responsibilities:

Convert logical access → physical request units.
Apply mode-dependent fan-out (1:1) or pass-through (n:1).
Compute PA and target node.

BAAW non-responsibilities:

Performing actual data movement.
Executing NOC routing.
Simulating bandwidth occupation (downstream components' job).

BAAW output is directly usable by the simulator's routing and resource model without additional address decoding.

D-LA5. PE_DMA `handle_command()` change

Current (VA-based) flow:

DmaReadCmd.src_addr (VA)
  → MMU.translate(VA) → PA
  → PhysAddr.decode(PA) → PhysAddr object
  → resolver.resolve(PhysAddr) → dst_node_id
  → router.find_path(pe_prefix, dst_node_id) → path
  → 1 sub-Transaction → fabric inject

LA-based flow:

DmaReadCmd.src_addr (LA)
  → BAAW.resolve(LA, nbytes) → list[PhysicalRequest]
  → for each PhysicalRequest:
      → router.find_path(pe_prefix, req.dst_node) → path
      → compute_drain_ns(path, req.nbytes) → drain
      → sub-Transaction → fabric inject
  → await all sub-Transactions
  → pe_txn.done.succeed()

Key changes:

MMU reference removed → BAAW resolve.
PhysAddr.decode() + resolver.resolve() → BAAW returns dst_node directly.
1 request → N parallel requests in 1:1 mode.

D-LA6. 1:1 mode detail

One logical access → N physical requests (N = channels_per_pe).
N = hbm_pseudo_channels / pes_per_cube.
Each request: fully-resolved 51-bit PA, targets a specific channel router ({pe_prefix}.ch_r{channel_id}).
Per-channel link models BW contention.
PE_DMA injects N sub-transactions concurrently.

Example: hbm_pseudo_channels=64, pes_per_cube=8 → channels_per_pe=8. PE0 owns ch0-7.

Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
BAAW segment: {
    la_base: 0x1_0000_0000, la_size: 4096,
    mode: "one_to_one", channel_count: 8,
    pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
    channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
    channel_size: 512,
}

BAAW resolve result (8 requests):
  → PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
  → PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
  → ...
  → PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")

PE_DMA: 8 sub-transactions parallel inject
  per-channel router → hbm_ctrl link (channel_bw_gbs) per channel
  Total effective BW = 8 × channel_bw_gbs

Other N values:

hbm_pseudo_channels=32, pes_per_cube=8 → channels_per_pe=4, 4 requests
hbm_pseudo_channels=64, pes_per_cube=4 → channels_per_pe=16, 16 requests

D-LA7. n:1 mode detail

One logical access → one aggregated request.
Target: aggregated router → hbm_ctrl (see ADR-0019).
Aggregated link BW = channels_per_pe × channel_bw_gbs (e.g. 8 × 32 = 256 GB/s).
Single queue / resource for modelling.
No per-channel PA decomposition.

Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
BAAW segment: {
    la_base: 0x1_0000_0000, la_size: 4096,
    mode: "n_to_one",
    agg_pa_base: PA_agg,
    agg_node_id: "sip0.cube0.pe0.agg_router",
}

BAAW resolve result:
  → PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")

PE_DMA: 1 sub-transaction
  aggregated router → hbm_ctrl link (256 GB/s)

D-LA8. Kernel model preserved

Kernel still issues single memory ops (tl.load, tl.store, tl.composite).
LA is the address scheme exposed to kernel code.
Channel decomposition / aggregation happens inside PE_DMA's BAAW.
Kernel code never sees physical channel information.

Consequences (LA model, proposed)

Positive:

1:1 vs n:1 semantics live in one place (BAAW).
Kernel abstraction preserved — no kernel code changes.
Topology-based policy control (mode switch via yaml).
Improved simulation-model consistency and debuggability.
Segment-based mapping is simpler than page tables; lower overhead.

Negative:

Full VA/MMU code refactor required.
Request-generation path more complex (N requests in 1:1 mode).
Reduced per-channel visibility in n:1 mode.
VA-related tests need rewriting.

Migration Path

PA → VA was an extension. PA mode is retained as the PageFault fallback inside PE_DMA. Switching does not require removing PA code.
VA → LA, if adopted, is a replacement, not coexistence. See D-LA1 for the VA infrastructure removal list. PA fallback inside PE_DMA may be retained orthogonally for tests.

Alternatives Considered (LA model)

Keep VA + fan-out in MMU: MMU returns per-channel PAs. Rejected: MMU's role would grow beyond translation to request decomposition; aggregation (n:1) becomes awkward to express.
Channel-aware kernel API: kernels call per-channel load/store directly. Rejected: abstraction leakage, portability loss, all benchmarks need rewriting.
Always PA (no LA): runtime passes per-channel PA to kernel directly. Rejected: incompatible with aggregation; conversion timing unclear; channel info leaks to kernel.

Test Requirements

VA model (current, regression)

Cross-PE / cross-cube DMA paths over installed mappings.
MmuMapMsg / MmuUnmapMsg fabric traversal with measured latency.
TLB-overhead-per-access timing.
PageFault fallback path preserves PA-only behaviour.

LA model (when implemented)

1:1 mode: same logical access → N per-channel requests.
n:1 mode: same logical access → 1 aggregated request.
Bandwidth equivalence between modes for identical workload.
1:1 mode: per-channel contention modelled correctly.
n:1 mode: aggregated bandwidth correctly reflected.
Kernel code unchanged across mode switch.
BAAW segment install / uninstall correctness.
Multiple tensors in distinct segments do not collide.

Implementation Order (LA, when scheduled)

LA type (policy/address/la_allocator.py).
BAAW segment table (policy/address/baaw.py).
BaawSegmentInstallMsg (runtime_api/kernel.py).
PE_DMA BAAW integration (components/builtin/pe_dma.py handle_command()).
RuntimeContext: LA alloc + segment install (runtime_api/context.py).
Tensor.va_base → Tensor.la_base (runtime_api/tensor.py).
Remove VA/MMU code.
Remove pe_mmu from topology.yaml; add mapping mode settings.
Test migration:

Test file	Action
`tests/test_mmu_component.py`	Remove → BAAW segment install tests
`tests/test_mmu_fabric.py`	Remove → BAAW + fabric integration tests
`tests/test_pe_mmu.py`	Remove
`tests/test_va_allocator.py`	Replace with LA allocator tests
`tests/test_va_integration.py`	Replace with LA + BAAW integration tests
`tests/test_va_offset.py`	Replace with LA offset tests

18 KiB Raw Blame History Unescape Escape