Files
kernbench2/docs/adr/ADR-0011-memory-addressing-simplification.md
T
ywkang 22fd0d2b9d ADR: introduce docs/history/, merge 0011+0018, prune migration cruft
- CLAUDE.md: add ADR Lifecycle subsection (superseded → docs/history/,
  immutable numbering, no renumber)
- ADR-0011: merge ADR-0018 content as "Address Model: LA" section
  alongside PA / VA; status notes VA model is currently implemented
- ADR-0018 / 0029 / 0031: moved to docs/history/ with status updates
  (0018 merged into 0011, 0029 superseded by 0032, 0031 absorbed
  into 0001 rev 2)
- ADR-0019: rewrite Context as PE-HBM connectivity decision
  (self-contained, no LA model framing)
- ADR-0019/0020/0021/0023/0025/0027: Status Proposed → Accepted
  (code verified) and prune Implementation Notes / Affected files /
  Test strategy / "현재 상태" sub-sections describing pre-impl state
- ADR-0024/0026: same migration-flavor cleanup; 0026 also drops D6
  Migration and D8 docs-update sub-decisions
- ADR-0030: status simplified (blocker ADR-0031 now superseded)
- SPEC.md: R10 + §0.2 reflect PA / VA / LA model names
- ADR-0008/0012/0013: refresh ADR-0011 subtitle in Links

21 files changed, 553 insertions(+), 1290 deletions(-).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 11:42:45 -07:00

522 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0011: Memory Addressing — PA / VA / LA Address Models
## Status
Accepted.
- **VA model: currently implemented (default).**
- PA model: implemented as PageFault fallback in PE_DMA.
- LA model: proposed, not implemented.
## Context
KernBench's address model evolved through three design points, each
addressing a limitation of the previous. This ADR documents all three
in one place because future implementation work selects among them.
### PA-only baseline
Phase 0 of KernBench treated all device memory operations
(MemoryRead/MemoryWrite) as raw physical-address transfers. No
host-side virtual addressing, no MMU/IOMMU translation. Allocators
returned PA mappings; DMA requests carried PA directly.
This was sufficient for early correctness/latency work but
insufficient for running standard Triton kernels that use
`base_addr + offset` patterns on sharded tensors: each PE's shard
has a different PA, but the kernel needs a single contiguous address
space to compute offsets.
### Why VA/MMU (current default)
A realistic system uses host-side virtual addressing and an
MMU/IOMMU-style translation path for DMA: the host allocates physical
memory at PE level, maps it into a virtual address space, installs
mappings, and DMA requests use virtual addresses that are translated
to physical addresses.
Adopting this model lets kernels use `base_addr + offset` over a
contiguous VA range while the device-side MMU translates each access
to the appropriate PA.
### Why LA/BAAW (proposed)
VA/MMU treats HBM as a single backing space. KernBench needs to
explore architectures where HBM is composed of multiple pseudo
channels in parallel:
- CUBE's HBM has 32 or 64 pseudo channels.
- In a PE-Local-HBM model, each PE is assigned N pseudo channels
(N = `hbm_pseudo_channels / pes_per_cube`).
- Per-channel BW (e.g. 32 GB/s) determines aggregate PE BW
(N × per-channel).
Two channel-mapping modes need to be modelable:
- **1:1 mode** — one logical access → N per-channel requests.
Precise per-channel BW contention modelling.
- **n:1 mode (default)** — one logical access → one aggregated
request. Channels are assumed to interleave; aggregated BW model.
VA's `tl.load(va_ptr)` produces a single DMA request to a single
target. Decomposing that into per-channel requests inside PE_DMA
requires the address layer to be aware of channels. This is the
role of the LA (Logical Address) abstraction with BAAW
(Logical-to-Physical Mapping Unit).
Core requirements driving the LA design:
- PE_DMA → HBM_CTRL effective bandwidth semantics must be identical
in both modes (only request shape and resource model differ).
- Kernel programming model is unchanged — physical channel
information is never exposed to kernel code.
- Mode switch is a topology-level configuration.
### Design space summary
| Model | Status | Key idea |
|-------|--------|----------|
| PA | fallback (implemented) | Direct physical addressing, no translation |
| VA | current default (implemented) | Per-tensor contiguous VA range; MMU translates per access |
| LA | proposed | LA + BAAW resolves to (PA, channel); supports 1:1 and n:1 channel mapping modes |
---
## Decision
This ADR defines three address models. At any given time the system
operates in exactly one model. Selection is topology- / configuration-
driven; coexistence within one simulation run is not required.
---
### Address Model: PA (Physical Address) — fallback
#### D-PA1. PA-only semantics
- All device memory accesses (MemoryRead/MemoryWrite) operate on
device physical addresses (PA) plus size.
- PA-only mode remains functional via the PageFault fallback path in
PE_DMA: if a DMA src/dst address has no MMU mapping, PE_DMA treats
the value as a PA directly.
#### D-PA2. Allocation produces PA mappings
Device allocation selects PE-local memory regions and returns PA
mappings sufficient to execute kernels and issue DMA requests.
PA model is retained primarily for backward compatibility with PA-only
tests and as the underlying physical layer that VA / LA models resolve
into.
---
### Address Model: VA (Virtual Address with MMU) — current default
#### D-VA1. Virtual Address Model
- Each tensor gets a single contiguous VA range (`TensorHandle.va_base`).
- `TensorShard` does NOT carry a `va` field — shard VA is derived as
`va_base + offset_bytes`.
- Kernels receive `va_base` as their pointer argument (via
`TensorArg.va_base`).
- `DmaReadCmd.src_addr` and `DmaWriteCmd.dst_addr` carry VA (not PA).
#### D-VA2. PE_MMU Component
- Hybrid design: SimPy component (inbox for `MmuMapMsg`) + utility
(synchronous `translate()` called by PE_DMA).
- Page-aligned dict lookup for O(1) VA → PA translation.
- `tlb_overhead_ns` configurable per-access latency.
- PageFault fallback: if VA has no mapping, PE_DMA treats it as PA
directly (preserves PA model for backward compatibility).
#### D-VA3. Mapping Installation
- `MmuMapMsg` traverses the fabric: Host → PCIE_EP → IO_CPU (cube
fan-out) → M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured
end-to-end.
- `MmuMapMsg.target_sips` controls SIP-level routing to prevent
cross-SIP mapping contamination for replicated tensors.
- Mapping strategy based on `DPPolicy.cube`:
- **Replicate** (`cube="replicate"`): per-(sip, cube) local mapping
only. Each cube's PEs see only their local PA. No cross-cube
mapping installed.
- **Sharded** (`cube="column_wise"`, etc.): broadcast all shard
mappings to all target cubes. Enables cross-PE and cross-cube
DMA.
#### D-VA4. Tensor Lifecycle
- `del tensor` triggers automatic cleanup via `Tensor.__del__` +
`weakref` to `RuntimeContext`. Sends `MmuUnmapMsg` through fabric,
returns VA and PA space.
- `with RuntimeContext(...) as ctx:` provides scope-based bulk cleanup.
- `RuntimeContext._tensors` uses `weakref.ref` to avoid preventing GC.
- `PEMemAllocator` uses free-list with coalescing (not bump allocator).
- `VirtualAllocator` uses free-list with coalescing for VA space.
#### D-VA5. Allocators
- `VirtualAllocator`: device-wide VA space, page-aligned alloc/free
with coalescing.
- `PEMemAllocator`: per-PE HBM/TCM, free-list based alloc/free with
coalescing.
- Page size configurable via `topology.yaml` `pe_mmu` attrs
(default 4096).
#### Consequences (VA model)
- Triton kernels use `base_addr + offset` patterns naturally on
sharded tensors.
- All latency remains explicit via graph traversal, including MMU
mapping installation and per-access TLB overhead.
- PA-only mode retained as fallback (PageFault → treat as PA).
- IPCQ and other fixed-address resources bypass MMU (use PA directly).
---
### Address Model: LA (Logical Address with BAAW) — proposed
LA replaces VA when channel-level HBM modelling is required.
Adopting this model removes the VA/MMU infrastructure (D-LA1 lists the
removed artifacts). Coexistence with VA in the same run is not a goal.
#### D-LA1. LA introduction — replaces VA infrastructure
LA is the sole address space used by kernel code (`tl.load`,
`tl.store`, `tl.composite`). Properties:
- Can map a Tensor to a contiguous logical space (like VA).
- Expresses `(logical buffer + offset)`.
- Does NOT contain physical channel information directly.
- Stays as an intermediate abstraction until physical resolution.
LA address space:
| Item | Value |
|------|-------|
| LA start | `0x1_0000_0000` (4 GB, preserves former VA start) |
| LA space size | 64 GB per PE |
| Alignment unit | segment (see D-LA3) |
LA is PE-local: different PEs may use the same LA value; BAAW segment
tables differ → they resolve to different PAs.
VA infrastructure removed when LA is adopted:
| Removed | Replacement |
|---------|-------------|
| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (same free-list approach, renamed) |
| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (inside PE_DMA) |
| `components/builtin/pe_mmu.py` (PeMmuComponent) | Removed — BAAW is internal PE_DMA logic, not a separate component |
| `runtime_api/kernel.py`: `MmuMapMsg`, `MmuUnmapMsg` | `BaawSegmentInstallMsg` |
| `runtime_api/context.py`: VA alloc + MMU install | LA alloc + BAAW segment install |
| `runtime_api/tensor.py`: `va_base` | `la_base` |
| `topology.yaml`: `pe_mmu` component entry | Removed |
#### D-LA2. Mapping mode setting
Topology-level (cube) configuration:
```yaml
cube:
memory_map:
hbm_mapping_mode: n_to_one # one_to_one | n_to_one
hbm_pseudo_channels: 64 # total pseudo channel count
hbm_channels_per_pe: 8 # per-PE local channel count
hbm_channel_bw_gbs: 32.0 # per-channel bandwidth
```
Consumed by the graph compiler (topology builder) and BAAW
initialisation.
#### D-LA3. Segment and BAAW
Segment partitions the LA space; each segment maps to a specific HBM
channel or channel group. Created at tensor deploy time by the runtime
allocator. BAAW resolves LA → physical request(s) using the segment
table.
```python
@dataclass
class BaawSegment:
la_base: int # segment start LA
la_size: int # segment size (bytes)
mode: str # "one_to_one" | "n_to_one"
# 1:1 mode fields
channel_count: int # channels assigned to this segment (e.g. 8)
pa_bases: list[int] # per-channel PA bases (len = channel_count)
channel_ids: list[int] # per-channel logical IDs (e.g. [0..7])
channel_size: int # per-channel size (la_size // channel_count)
# n:1 mode fields
agg_pa_base: int # aggregated PA base
agg_node_id: str # aggregated router node_id
```
Segment lifecycle:
1. **Allocate** (tensor deploy): RuntimeContext allocates LA from LA
allocator. PEMemAllocator allocates per-channel PA (1:1) or
aggregated PA (n:1). `BaawSegmentInstallMsg` registers the segment
with PE_DMA.
2. **Use** (kernel run): kernel `tl.load(la_ptr)` → `DmaReadCmd
(src_addr=LA)`. PE_DMA's BAAW front-end looks up the segment and
converts to PA(s).
3. **Free** (tensor free): segment removed from table; LA and PA
returned.
#### D-LA4. BAAW resolution logic
BAAW is a front-end stage inside PE_DMA, not a separate SimPy
component. Synchronous address-resolution logic executed at the start
of PE_DMA's `handle_command()`.
Input: `(LA, nbytes)`. Output:
- **1:1 mode**: `list[PhysicalRequest]` — one per channel.
- **n:1 mode**: single `PhysicalRequest`.
```python
@dataclass
class PhysicalRequest:
pa: int # 51-bit Physical Address
nbytes: int # transfer size for this request
dst_node: str # target node_id (channel router or aggregated router)
def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
seg = self._find_segment(la) # la_base <= la < la_base + la_size
offset = la - seg.la_base
if seg.mode == "n_to_one":
pa = seg.agg_pa_base + offset
return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]
# one_to_one
requests = []
per_ch_size = seg.channel_size
for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
ch_offset = offset % per_ch_size
ch_nbytes = nbytes // seg.channel_count
pa = pa_base + ch_offset
dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
return requests
```
BAAW responsibilities:
- Convert logical access → physical request units.
- Apply mode-dependent fan-out (1:1) or pass-through (n:1).
- Compute PA and target node.
BAAW non-responsibilities:
- Performing actual data movement.
- Executing NOC routing.
- Simulating bandwidth occupation (downstream components' job).
BAAW output is directly usable by the simulator's routing and resource
model without additional address decoding.
#### D-LA5. PE_DMA `handle_command()` change
Current (VA-based) flow:
```
DmaReadCmd.src_addr (VA)
→ MMU.translate(VA) → PA
→ PhysAddr.decode(PA) → PhysAddr object
→ resolver.resolve(PhysAddr) → dst_node_id
→ router.find_path(pe_prefix, dst_node_id) → path
→ 1 sub-Transaction → fabric inject
```
LA-based flow:
```
DmaReadCmd.src_addr (LA)
→ BAAW.resolve(LA, nbytes) → list[PhysicalRequest]
→ for each PhysicalRequest:
→ router.find_path(pe_prefix, req.dst_node) → path
→ compute_drain_ns(path, req.nbytes) → drain
→ sub-Transaction → fabric inject
→ await all sub-Transactions
→ pe_txn.done.succeed()
```
Key changes:
- MMU reference removed → BAAW resolve.
- `PhysAddr.decode()` + `resolver.resolve()` → BAAW returns `dst_node`
directly.
- 1 request → N parallel requests in 1:1 mode.
#### D-LA6. 1:1 mode detail
- One logical access → N physical requests (N = `channels_per_pe`).
- N = `hbm_pseudo_channels / pes_per_cube`.
- Each request: fully-resolved 51-bit PA, targets a specific channel
router (`{pe_prefix}.ch_r{channel_id}`).
- Per-channel link models BW contention.
- PE_DMA injects N sub-transactions concurrently.
Example: `hbm_pseudo_channels=64`, `pes_per_cube=8` → `channels_per_pe=8`.
PE0 owns ch0-7.
```text
Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
BAAW segment: {
la_base: 0x1_0000_0000, la_size: 4096,
mode: "one_to_one", channel_count: 8,
pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
channel_size: 512,
}
BAAW resolve result (8 requests):
→ PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
→ PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
→ ...
→ PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")
PE_DMA: 8 sub-transactions parallel inject
per-channel router → hbm_ctrl link (channel_bw_gbs) per channel
Total effective BW = 8 × channel_bw_gbs
```
Other N values:
- `hbm_pseudo_channels=32`, `pes_per_cube=8` → `channels_per_pe=4`,
4 requests
- `hbm_pseudo_channels=64`, `pes_per_cube=4` → `channels_per_pe=16`,
16 requests
#### D-LA7. n:1 mode detail
- One logical access → one aggregated request.
- Target: aggregated router → hbm_ctrl (see ADR-0019).
- Aggregated link BW = `channels_per_pe × channel_bw_gbs`
(e.g. 8 × 32 = 256 GB/s).
- Single queue / resource for modelling.
- No per-channel PA decomposition.
```text
Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
BAAW segment: {
la_base: 0x1_0000_0000, la_size: 4096,
mode: "n_to_one",
agg_pa_base: PA_agg,
agg_node_id: "sip0.cube0.pe0.agg_router",
}
BAAW resolve result:
→ PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")
PE_DMA: 1 sub-transaction
aggregated router → hbm_ctrl link (256 GB/s)
```
#### D-LA8. Kernel model preserved
- Kernel still issues single memory ops (`tl.load`, `tl.store`,
`tl.composite`).
- LA is the address scheme exposed to kernel code.
- Channel decomposition / aggregation happens inside PE_DMA's BAAW.
- Kernel code never sees physical channel information.
#### Consequences (LA model, proposed)
Positive:
- 1:1 vs n:1 semantics live in one place (BAAW).
- Kernel abstraction preserved — no kernel code changes.
- Topology-based policy control (mode switch via yaml).
- Improved simulation-model consistency and debuggability.
- Segment-based mapping is simpler than page tables; lower overhead.
Negative:
- Full VA/MMU code refactor required.
- Request-generation path more complex (N requests in 1:1 mode).
- Reduced per-channel visibility in n:1 mode.
- VA-related tests need rewriting.
---
## Migration Path
- **PA → VA** was an extension. PA mode is retained as the PageFault
fallback inside PE_DMA. Switching does not require removing PA
code.
- **VA → LA**, if adopted, is a replacement, not coexistence. See
D-LA1 for the VA infrastructure removal list. PA fallback inside
PE_DMA may be retained orthogonally for tests.
## Alternatives Considered (LA model)
1. **Keep VA + fan-out in MMU**: MMU returns per-channel PAs.
Rejected: MMU's role would grow beyond translation to request
decomposition; aggregation (n:1) becomes awkward to express.
2. **Channel-aware kernel API**: kernels call per-channel load/store
directly. Rejected: abstraction leakage, portability loss, all
benchmarks need rewriting.
3. **Always PA (no LA)**: runtime passes per-channel PA to kernel
directly. Rejected: incompatible with aggregation; conversion
timing unclear; channel info leaks to kernel.
## Test Requirements
### VA model (current, regression)
- Cross-PE / cross-cube DMA paths over installed mappings.
- `MmuMapMsg` / `MmuUnmapMsg` fabric traversal with measured latency.
- TLB-overhead-per-access timing.
- PageFault fallback path preserves PA-only behaviour.
### LA model (when implemented)
- 1:1 mode: same logical access → N per-channel requests.
- n:1 mode: same logical access → 1 aggregated request.
- Bandwidth equivalence between modes for identical workload.
- 1:1 mode: per-channel contention modelled correctly.
- n:1 mode: aggregated bandwidth correctly reflected.
- Kernel code unchanged across mode switch.
- BAAW segment install / uninstall correctness.
- Multiple tensors in distinct segments do not collide.
## Implementation Order (LA, when scheduled)
1. LA type (`policy/address/la_allocator.py`).
2. BAAW segment table (`policy/address/baaw.py`).
3. `BaawSegmentInstallMsg` (`runtime_api/kernel.py`).
4. PE_DMA BAAW integration (`components/builtin/pe_dma.py`
`handle_command()`).
5. RuntimeContext: LA alloc + segment install
(`runtime_api/context.py`).
6. `Tensor.va_base` → `Tensor.la_base` (`runtime_api/tensor.py`).
7. Remove VA/MMU code.
8. Remove `pe_mmu` from `topology.yaml`; add mapping mode settings.
9. Test migration:
| Test file | Action |
|-----------|--------|
| `tests/test_mmu_component.py` | Remove → BAAW segment install tests |
| `tests/test_mmu_fabric.py` | Remove → BAAW + fabric integration tests |
| `tests/test_pe_mmu.py` | Remove |
| `tests/test_va_allocator.py` | Replace with LA allocator tests |
| `tests/test_va_integration.py` | Replace with LA + BAAW integration tests |
| `tests/test_va_offset.py` | Replace with LA offset tests |
## Links
- ADR-0007 (runtime_api vs sim_engine boundaries)
- ADR-0008 (tensor deployment)
- ADR-0009 (kernel execution)
- ADR-0014 (PE-internal execution model)
- ADR-0015 (component port/wire model)
- ADR-0019 (NOC + per-channel HBM connectivity — LA model topology
consumer)
- SPEC R2 (latency by traversal), R10 (memory addressing)