- CLAUDE.md: add ADR Lifecycle subsection (superseded → docs/history/, immutable numbering, no renumber) - ADR-0011: merge ADR-0018 content as "Address Model: LA" section alongside PA / VA; status notes VA model is currently implemented - ADR-0018 / 0029 / 0031: moved to docs/history/ with status updates (0018 merged into 0011, 0029 superseded by 0032, 0031 absorbed into 0001 rev 2) - ADR-0019: rewrite Context as PE-HBM connectivity decision (self-contained, no LA model framing) - ADR-0019/0020/0021/0023/0025/0027: Status Proposed → Accepted (code verified) and prune Implementation Notes / Affected files / Test strategy / "현재 상태" sub-sections describing pre-impl state - ADR-0024/0026: same migration-flavor cleanup; 0026 also drops D6 Migration and D8 docs-update sub-decisions - ADR-0030: status simplified (blocker ADR-0031 now superseded) - SPEC.md: R10 + §0.2 reflect PA / VA / LA model names - ADR-0008/0012/0013: refresh ADR-0011 subtitle in Links 21 files changed, 553 insertions(+), 1290 deletions(-). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
18 KiB
ADR-0011: Memory Addressing — PA / VA / LA Address Models
Status
Accepted.
- VA model: currently implemented (default).
- PA model: implemented as PageFault fallback in PE_DMA.
- LA model: proposed, not implemented.
Context
KernBench's address model evolved through three design points, each addressing a limitation of the previous. This ADR documents all three in one place because future implementation work selects among them.
PA-only baseline
Phase 0 of KernBench treated all device memory operations (MemoryRead/MemoryWrite) as raw physical-address transfers. No host-side virtual addressing, no MMU/IOMMU translation. Allocators returned PA mappings; DMA requests carried PA directly.
This was sufficient for early correctness/latency work but
insufficient for running standard Triton kernels that use
base_addr + offset patterns on sharded tensors: each PE's shard
has a different PA, but the kernel needs a single contiguous address
space to compute offsets.
Why VA/MMU (current default)
A realistic system uses host-side virtual addressing and an MMU/IOMMU-style translation path for DMA: the host allocates physical memory at PE level, maps it into a virtual address space, installs mappings, and DMA requests use virtual addresses that are translated to physical addresses.
Adopting this model lets kernels use base_addr + offset over a
contiguous VA range while the device-side MMU translates each access
to the appropriate PA.
Why LA/BAAW (proposed)
VA/MMU treats HBM as a single backing space. KernBench needs to explore architectures where HBM is composed of multiple pseudo channels in parallel:
- CUBE's HBM has 32 or 64 pseudo channels.
- In a PE-Local-HBM model, each PE is assigned N pseudo channels
(N =
hbm_pseudo_channels / pes_per_cube). - Per-channel BW (e.g. 32 GB/s) determines aggregate PE BW (N × per-channel).
Two channel-mapping modes need to be modelable:
- 1:1 mode — one logical access → N per-channel requests. Precise per-channel BW contention modelling.
- n:1 mode (default) — one logical access → one aggregated request. Channels are assumed to interleave; aggregated BW model.
VA's tl.load(va_ptr) produces a single DMA request to a single
target. Decomposing that into per-channel requests inside PE_DMA
requires the address layer to be aware of channels. This is the
role of the LA (Logical Address) abstraction with BAAW
(Logical-to-Physical Mapping Unit).
Core requirements driving the LA design:
- PE_DMA → HBM_CTRL effective bandwidth semantics must be identical in both modes (only request shape and resource model differ).
- Kernel programming model is unchanged — physical channel information is never exposed to kernel code.
- Mode switch is a topology-level configuration.
Design space summary
| Model | Status | Key idea |
|---|---|---|
| PA | fallback (implemented) | Direct physical addressing, no translation |
| VA | current default (implemented) | Per-tensor contiguous VA range; MMU translates per access |
| LA | proposed | LA + BAAW resolves to (PA, channel); supports 1:1 and n:1 channel mapping modes |
Decision
This ADR defines three address models. At any given time the system operates in exactly one model. Selection is topology- / configuration- driven; coexistence within one simulation run is not required.
Address Model: PA (Physical Address) — fallback
D-PA1. PA-only semantics
- All device memory accesses (MemoryRead/MemoryWrite) operate on device physical addresses (PA) plus size.
- PA-only mode remains functional via the PageFault fallback path in PE_DMA: if a DMA src/dst address has no MMU mapping, PE_DMA treats the value as a PA directly.
D-PA2. Allocation produces PA mappings
Device allocation selects PE-local memory regions and returns PA mappings sufficient to execute kernels and issue DMA requests.
PA model is retained primarily for backward compatibility with PA-only tests and as the underlying physical layer that VA / LA models resolve into.
Address Model: VA (Virtual Address with MMU) — current default
D-VA1. Virtual Address Model
- Each tensor gets a single contiguous VA range (
TensorHandle.va_base). TensorSharddoes NOT carry avafield — shard VA is derived asva_base + offset_bytes.- Kernels receive
va_baseas their pointer argument (viaTensorArg.va_base). DmaReadCmd.src_addrandDmaWriteCmd.dst_addrcarry VA (not PA).
D-VA2. PE_MMU Component
- Hybrid design: SimPy component (inbox for
MmuMapMsg) + utility (synchronoustranslate()called by PE_DMA). - Page-aligned dict lookup for O(1) VA → PA translation.
tlb_overhead_nsconfigurable per-access latency.- PageFault fallback: if VA has no mapping, PE_DMA treats it as PA directly (preserves PA model for backward compatibility).
D-VA3. Mapping Installation
MmuMapMsgtraverses the fabric: Host → PCIE_EP → IO_CPU (cube fan-out) → M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured end-to-end.MmuMapMsg.target_sipscontrols SIP-level routing to prevent cross-SIP mapping contamination for replicated tensors.- Mapping strategy based on
DPPolicy.cube:- Replicate (
cube="replicate"): per-(sip, cube) local mapping only. Each cube's PEs see only their local PA. No cross-cube mapping installed. - Sharded (
cube="column_wise", etc.): broadcast all shard mappings to all target cubes. Enables cross-PE and cross-cube DMA.
- Replicate (
D-VA4. Tensor Lifecycle
del tensortriggers automatic cleanup viaTensor.__del__+weakreftoRuntimeContext. SendsMmuUnmapMsgthrough fabric, returns VA and PA space.with RuntimeContext(...) as ctx:provides scope-based bulk cleanup.RuntimeContext._tensorsusesweakref.refto avoid preventing GC.PEMemAllocatoruses free-list with coalescing (not bump allocator).VirtualAllocatoruses free-list with coalescing for VA space.
D-VA5. Allocators
VirtualAllocator: device-wide VA space, page-aligned alloc/free with coalescing.PEMemAllocator: per-PE HBM/TCM, free-list based alloc/free with coalescing.- Page size configurable via
topology.yamlpe_mmuattrs (default 4096).
Consequences (VA model)
- Triton kernels use
base_addr + offsetpatterns naturally on sharded tensors. - All latency remains explicit via graph traversal, including MMU mapping installation and per-access TLB overhead.
- PA-only mode retained as fallback (PageFault → treat as PA).
- IPCQ and other fixed-address resources bypass MMU (use PA directly).
Address Model: LA (Logical Address with BAAW) — proposed
LA replaces VA when channel-level HBM modelling is required. Adopting this model removes the VA/MMU infrastructure (D-LA1 lists the removed artifacts). Coexistence with VA in the same run is not a goal.
D-LA1. LA introduction — replaces VA infrastructure
LA is the sole address space used by kernel code (tl.load,
tl.store, tl.composite). Properties:
- Can map a Tensor to a contiguous logical space (like VA).
- Expresses
(logical buffer + offset). - Does NOT contain physical channel information directly.
- Stays as an intermediate abstraction until physical resolution.
LA address space:
| Item | Value |
|---|---|
| LA start | 0x1_0000_0000 (4 GB, preserves former VA start) |
| LA space size | 64 GB per PE |
| Alignment unit | segment (see D-LA3) |
LA is PE-local: different PEs may use the same LA value; BAAW segment tables differ → they resolve to different PAs.
VA infrastructure removed when LA is adopted:
| Removed | Replacement |
|---|---|
policy/address/va_allocator.py (VirtualAllocator) |
LA allocator (same free-list approach, renamed) |
policy/address/pe_mmu.py (PeMMU) |
BAAW segment table (inside PE_DMA) |
components/builtin/pe_mmu.py (PeMmuComponent) |
Removed — BAAW is internal PE_DMA logic, not a separate component |
runtime_api/kernel.py: MmuMapMsg, MmuUnmapMsg |
BaawSegmentInstallMsg |
runtime_api/context.py: VA alloc + MMU install |
LA alloc + BAAW segment install |
runtime_api/tensor.py: va_base |
la_base |
topology.yaml: pe_mmu component entry |
Removed |
D-LA2. Mapping mode setting
Topology-level (cube) configuration:
cube:
memory_map:
hbm_mapping_mode: n_to_one # one_to_one | n_to_one
hbm_pseudo_channels: 64 # total pseudo channel count
hbm_channels_per_pe: 8 # per-PE local channel count
hbm_channel_bw_gbs: 32.0 # per-channel bandwidth
Consumed by the graph compiler (topology builder) and BAAW initialisation.
D-LA3. Segment and BAAW
Segment partitions the LA space; each segment maps to a specific HBM channel or channel group. Created at tensor deploy time by the runtime allocator. BAAW resolves LA → physical request(s) using the segment table.
@dataclass
class BaawSegment:
la_base: int # segment start LA
la_size: int # segment size (bytes)
mode: str # "one_to_one" | "n_to_one"
# 1:1 mode fields
channel_count: int # channels assigned to this segment (e.g. 8)
pa_bases: list[int] # per-channel PA bases (len = channel_count)
channel_ids: list[int] # per-channel logical IDs (e.g. [0..7])
channel_size: int # per-channel size (la_size // channel_count)
# n:1 mode fields
agg_pa_base: int # aggregated PA base
agg_node_id: str # aggregated router node_id
Segment lifecycle:
- Allocate (tensor deploy): RuntimeContext allocates LA from LA
allocator. PEMemAllocator allocates per-channel PA (1:1) or
aggregated PA (n:1).
BaawSegmentInstallMsgregisters the segment with PE_DMA. - Use (kernel run): kernel
tl.load(la_ptr)→DmaReadCmd (src_addr=LA). PE_DMA's BAAW front-end looks up the segment and converts to PA(s). - Free (tensor free): segment removed from table; LA and PA returned.
D-LA4. BAAW resolution logic
BAAW is a front-end stage inside PE_DMA, not a separate SimPy
component. Synchronous address-resolution logic executed at the start
of PE_DMA's handle_command().
Input: (LA, nbytes). Output:
- 1:1 mode:
list[PhysicalRequest]— one per channel. - n:1 mode: single
PhysicalRequest.
@dataclass
class PhysicalRequest:
pa: int # 51-bit Physical Address
nbytes: int # transfer size for this request
dst_node: str # target node_id (channel router or aggregated router)
def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
seg = self._find_segment(la) # la_base <= la < la_base + la_size
offset = la - seg.la_base
if seg.mode == "n_to_one":
pa = seg.agg_pa_base + offset
return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]
# one_to_one
requests = []
per_ch_size = seg.channel_size
for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
ch_offset = offset % per_ch_size
ch_nbytes = nbytes // seg.channel_count
pa = pa_base + ch_offset
dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
return requests
BAAW responsibilities:
- Convert logical access → physical request units.
- Apply mode-dependent fan-out (1:1) or pass-through (n:1).
- Compute PA and target node.
BAAW non-responsibilities:
- Performing actual data movement.
- Executing NOC routing.
- Simulating bandwidth occupation (downstream components' job).
BAAW output is directly usable by the simulator's routing and resource model without additional address decoding.
D-LA5. PE_DMA handle_command() change
Current (VA-based) flow:
DmaReadCmd.src_addr (VA)
→ MMU.translate(VA) → PA
→ PhysAddr.decode(PA) → PhysAddr object
→ resolver.resolve(PhysAddr) → dst_node_id
→ router.find_path(pe_prefix, dst_node_id) → path
→ 1 sub-Transaction → fabric inject
LA-based flow:
DmaReadCmd.src_addr (LA)
→ BAAW.resolve(LA, nbytes) → list[PhysicalRequest]
→ for each PhysicalRequest:
→ router.find_path(pe_prefix, req.dst_node) → path
→ compute_drain_ns(path, req.nbytes) → drain
→ sub-Transaction → fabric inject
→ await all sub-Transactions
→ pe_txn.done.succeed()
Key changes:
- MMU reference removed → BAAW resolve.
PhysAddr.decode()+resolver.resolve()→ BAAW returnsdst_nodedirectly.- 1 request → N parallel requests in 1:1 mode.
D-LA6. 1:1 mode detail
- One logical access → N physical requests (N =
channels_per_pe). - N =
hbm_pseudo_channels / pes_per_cube. - Each request: fully-resolved 51-bit PA, targets a specific channel
router (
{pe_prefix}.ch_r{channel_id}). - Per-channel link models BW contention.
- PE_DMA injects N sub-transactions concurrently.
Example: hbm_pseudo_channels=64, pes_per_cube=8 → channels_per_pe=8.
PE0 owns ch0-7.
Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
BAAW segment: {
la_base: 0x1_0000_0000, la_size: 4096,
mode: "one_to_one", channel_count: 8,
pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
channel_size: 512,
}
BAAW resolve result (8 requests):
→ PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
→ PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
→ ...
→ PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")
PE_DMA: 8 sub-transactions parallel inject
per-channel router → hbm_ctrl link (channel_bw_gbs) per channel
Total effective BW = 8 × channel_bw_gbs
Other N values:
hbm_pseudo_channels=32,pes_per_cube=8→channels_per_pe=4, 4 requestshbm_pseudo_channels=64,pes_per_cube=4→channels_per_pe=16, 16 requests
D-LA7. n:1 mode detail
- One logical access → one aggregated request.
- Target: aggregated router → hbm_ctrl (see ADR-0019).
- Aggregated link BW =
channels_per_pe × channel_bw_gbs(e.g. 8 × 32 = 256 GB/s). - Single queue / resource for modelling.
- No per-channel PA decomposition.
Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
BAAW segment: {
la_base: 0x1_0000_0000, la_size: 4096,
mode: "n_to_one",
agg_pa_base: PA_agg,
agg_node_id: "sip0.cube0.pe0.agg_router",
}
BAAW resolve result:
→ PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")
PE_DMA: 1 sub-transaction
aggregated router → hbm_ctrl link (256 GB/s)
D-LA8. Kernel model preserved
- Kernel still issues single memory ops (
tl.load,tl.store,tl.composite). - LA is the address scheme exposed to kernel code.
- Channel decomposition / aggregation happens inside PE_DMA's BAAW.
- Kernel code never sees physical channel information.
Consequences (LA model, proposed)
Positive:
- 1:1 vs n:1 semantics live in one place (BAAW).
- Kernel abstraction preserved — no kernel code changes.
- Topology-based policy control (mode switch via yaml).
- Improved simulation-model consistency and debuggability.
- Segment-based mapping is simpler than page tables; lower overhead.
Negative:
- Full VA/MMU code refactor required.
- Request-generation path more complex (N requests in 1:1 mode).
- Reduced per-channel visibility in n:1 mode.
- VA-related tests need rewriting.
Migration Path
- PA → VA was an extension. PA mode is retained as the PageFault fallback inside PE_DMA. Switching does not require removing PA code.
- VA → LA, if adopted, is a replacement, not coexistence. See D-LA1 for the VA infrastructure removal list. PA fallback inside PE_DMA may be retained orthogonally for tests.
Alternatives Considered (LA model)
- Keep VA + fan-out in MMU: MMU returns per-channel PAs. Rejected: MMU's role would grow beyond translation to request decomposition; aggregation (n:1) becomes awkward to express.
- Channel-aware kernel API: kernels call per-channel load/store directly. Rejected: abstraction leakage, portability loss, all benchmarks need rewriting.
- Always PA (no LA): runtime passes per-channel PA to kernel directly. Rejected: incompatible with aggregation; conversion timing unclear; channel info leaks to kernel.
Test Requirements
VA model (current, regression)
- Cross-PE / cross-cube DMA paths over installed mappings.
MmuMapMsg/MmuUnmapMsgfabric traversal with measured latency.- TLB-overhead-per-access timing.
- PageFault fallback path preserves PA-only behaviour.
LA model (when implemented)
- 1:1 mode: same logical access → N per-channel requests.
- n:1 mode: same logical access → 1 aggregated request.
- Bandwidth equivalence between modes for identical workload.
- 1:1 mode: per-channel contention modelled correctly.
- n:1 mode: aggregated bandwidth correctly reflected.
- Kernel code unchanged across mode switch.
- BAAW segment install / uninstall correctness.
- Multiple tensors in distinct segments do not collide.
Implementation Order (LA, when scheduled)
- LA type (
policy/address/la_allocator.py). - BAAW segment table (
policy/address/baaw.py). BaawSegmentInstallMsg(runtime_api/kernel.py).- PE_DMA BAAW integration (
components/builtin/pe_dma.pyhandle_command()). - RuntimeContext: LA alloc + segment install
(
runtime_api/context.py). Tensor.va_base→Tensor.la_base(runtime_api/tensor.py).- Remove VA/MMU code.
- Remove
pe_mmufromtopology.yaml; add mapping mode settings. - Test migration:
| Test file | Action |
|---|---|
tests/test_mmu_component.py |
Remove → BAAW segment install tests |
tests/test_mmu_fabric.py |
Remove → BAAW + fabric integration tests |
tests/test_pe_mmu.py |
Remove |
tests/test_va_allocator.py |
Replace with LA allocator tests |
tests/test_va_integration.py |
Replace with LA + BAAW integration tests |
tests/test_va_offset.py |
Replace with LA offset tests |
Links
- ADR-0007 (runtime_api vs sim_engine boundaries)
- ADR-0008 (tensor deployment)
- ADR-0009 (kernel execution)
- ADR-0014 (PE-internal execution model)
- ADR-0015 (component port/wire model)
- ADR-0019 (NOC + per-channel HBM connectivity — LA model topology consumer)
- SPEC R2 (latency by traversal), R10 (memory addressing)