kernbench2/docs/adr/ADR-0011-memory-addressing-simplification.md

# ADR-0011: Memory Addressing — PA / VA / LA Address Models

## Status

Accepted.

- **VA model: currently implemented (default).**
- PA model: implemented as PageFault fallback in PE_DMA.
- LA model: proposed, not implemented.

## Context

KernBench's address model evolved through three design points, each
addressing a limitation of the previous. This ADR documents all three
in one place because future implementation work selects among them.

### PA-only baseline

Phase 0 of KernBench treated all device memory operations
(MemoryRead/MemoryWrite) as raw physical-address transfers. No
host-side virtual addressing, no MMU/IOMMU translation. Allocators
returned PA mappings; DMA requests carried PA directly.

This was sufficient for early correctness/latency work but
insufficient for running standard Triton kernels that use
`base_addr + offset` patterns on sharded tensors: each PE's shard
has a different PA, but the kernel needs a single contiguous address
space to compute offsets.

### Why VA/MMU (current default)

A realistic system uses host-side virtual addressing and an
MMU/IOMMU-style translation path for DMA: the host allocates physical
memory at PE level, maps it into a virtual address space, installs
mappings, and DMA requests use virtual addresses that are translated
to physical addresses.

Adopting this model lets kernels use `base_addr + offset` over a
contiguous VA range while the device-side MMU translates each access
to the appropriate PA.

### Why LA/BAAW (proposed)

VA/MMU treats HBM as a single backing space. KernBench needs to
explore architectures where HBM is composed of multiple pseudo
channels in parallel:

- CUBE's HBM has 32 or 64 pseudo channels.
- In a PE-Local-HBM model, each PE is assigned N pseudo channels
  (N = `hbm_pseudo_channels / pes_per_cube`).
- Per-channel BW (e.g. 32 GB/s) determines aggregate PE BW
  (N × per-channel).

Two channel-mapping modes need to be modelable:

- **1:1 mode** — one logical access → N per-channel requests.
  Precise per-channel BW contention modelling.
- **n:1 mode (default)** — one logical access → one aggregated
  request. Channels are assumed to interleave; aggregated BW model.

VA's `tl.load(va_ptr)` produces a single DMA request to a single
target. Decomposing that into per-channel requests inside PE_DMA
requires the address layer to be aware of channels. This is the
role of the LA (Logical Address) abstraction with BAAW
(Logical-to-Physical Mapping Unit).

Core requirements driving the LA design:

- PE_DMA → HBM_CTRL effective bandwidth semantics must be identical
  in both modes (only request shape and resource model differ).
- Kernel programming model is unchanged — physical channel
  information is never exposed to kernel code.
- Mode switch is a topology-level configuration.

### Design space summary

| Model | Status | Key idea |
|-------|--------|----------|
| PA | fallback (implemented) | Direct physical addressing, no translation |
| VA | current default (implemented) | Per-tensor contiguous VA range; MMU translates per access |
| LA | proposed | LA + BAAW resolves to (PA, channel); supports 1:1 and n:1 channel mapping modes |

---

## Decision

This ADR defines three address models. At any given time the system
operates in exactly one model. Selection is topology- / configuration-
driven; coexistence within one simulation run is not required.

---

### Address Model: PA (Physical Address) — fallback

#### D-PA1. PA-only semantics

- All device memory accesses (MemoryRead/MemoryWrite) operate on
  device physical addresses (PA) plus size.
- PA-only mode remains functional via the PageFault fallback path in
  PE_DMA: if a DMA src/dst address has no MMU mapping, PE_DMA treats
  the value as a PA directly.

#### D-PA2. Allocation produces PA mappings

Device allocation selects PE-local memory regions and returns PA
mappings sufficient to execute kernels and issue DMA requests.

PA model is retained primarily for backward compatibility with PA-only
tests and as the underlying physical layer that VA / LA models resolve
into.

---

### Address Model: VA (Virtual Address with MMU) — current default

#### D-VA1. Virtual Address Model

- Each tensor gets a single contiguous VA range (`TensorHandle.va_base`).
- `TensorShard` does NOT carry a `va` field — shard VA is derived as
  `va_base + offset_bytes`.
- Kernels receive `va_base` as their pointer argument (via
  `TensorArg.va_base`).
- `DmaReadCmd.src_addr` and `DmaWriteCmd.dst_addr` carry VA (not PA).

#### D-VA2. PE_MMU Component

- Hybrid design: SimPy component (inbox for `MmuMapMsg`) + utility
  (synchronous `translate()` called by PE_DMA).
- Page-aligned dict lookup for O(1) VA → PA translation.
- `tlb_overhead_ns` configurable per-access latency.
- PageFault fallback: if VA has no mapping, PE_DMA treats it as PA
  directly (preserves PA model for backward compatibility).

#### D-VA3. Mapping Installation

- `MmuMapMsg` traverses the fabric: Host → PCIE_EP → IO_CPU (cube
  fan-out) → M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured
  end-to-end.
- `MmuMapMsg.target_sips` controls SIP-level routing to prevent
  cross-SIP mapping contamination for replicated tensors.
- Mapping strategy based on `DPPolicy.cube`:
  - **Replicate** (`cube="replicate"`): per-(sip, cube) local mapping
    only. Each cube's PEs see only their local PA. No cross-cube
    mapping installed.
  - **Sharded** (`cube="column_wise"`, etc.): broadcast all shard
    mappings to all target cubes. Enables cross-PE and cross-cube
    DMA.

#### D-VA4. Tensor Lifecycle

- `del tensor` triggers automatic cleanup via `Tensor.__del__` +
  `weakref` to `RuntimeContext`. Sends `MmuUnmapMsg` through fabric,
  returns VA and PA space.
- `with RuntimeContext(...) as ctx:` provides scope-based bulk cleanup.
- `RuntimeContext._tensors` uses `weakref.ref` to avoid preventing GC.
- `PEMemAllocator` uses free-list with coalescing (not bump allocator).
- `VirtualAllocator` uses free-list with coalescing for VA space.

#### D-VA5. Allocators

- `VirtualAllocator`: device-wide VA space, page-aligned alloc/free
  with coalescing.
- `PEMemAllocator`: per-PE HBM/TCM, free-list based alloc/free with
  coalescing.
- Page size configurable via `topology.yaml` `pe_mmu` attrs
  (default 4096).

#### Consequences (VA model)

- Triton kernels use `base_addr + offset` patterns naturally on
  sharded tensors.
- All latency remains explicit via graph traversal, including MMU
  mapping installation and per-access TLB overhead.
- PA-only mode retained as fallback (PageFault → treat as PA).
- IPCQ and other fixed-address resources bypass MMU (use PA directly).

---

### Address Model: LA (Logical Address with BAAW) — proposed

LA replaces VA when channel-level HBM modelling is required.
Adopting this model removes the VA/MMU infrastructure (D-LA1 lists the
removed artifacts). Coexistence with VA in the same run is not a goal.

#### D-LA1. LA introduction — replaces VA infrastructure

LA is the sole address space used by kernel code (`tl.load`,
`tl.store`, `tl.composite`). Properties:

- Can map a Tensor to a contiguous logical space (like VA).
- Expresses `(logical buffer + offset)`.
- Does NOT contain physical channel information directly.
- Stays as an intermediate abstraction until physical resolution.

LA address space:

| Item | Value |
|------|-------|
| LA start | `0x1_0000_0000` (4 GB, preserves former VA start) |
| LA space size | 64 GB per PE |
| Alignment unit | segment (see D-LA3) |

LA is PE-local: different PEs may use the same LA value; BAAW segment
tables differ → they resolve to different PAs.

VA infrastructure removed when LA is adopted:

| Removed | Replacement |
|---------|-------------|
| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (same free-list approach, renamed) |
| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (inside PE_DMA) |
| `components/builtin/pe_mmu.py` (PeMmuComponent) | Removed — BAAW is internal PE_DMA logic, not a separate component |
| `runtime_api/kernel.py`: `MmuMapMsg`, `MmuUnmapMsg` | `BaawSegmentInstallMsg` |
| `runtime_api/context.py`: VA alloc + MMU install | LA alloc + BAAW segment install |
| `runtime_api/tensor.py`: `va_base` | `la_base` |
| `topology.yaml`: `pe_mmu` component entry | Removed |

#### D-LA2. Mapping mode setting

Topology-level (cube) configuration:

```yaml
cube:
  memory_map:
    hbm_mapping_mode: n_to_one    # one_to_one | n_to_one
    hbm_pseudo_channels: 64       # total pseudo channel count
    hbm_channels_per_pe: 8        # per-PE local channel count
    hbm_channel_bw_gbs: 32.0      # per-channel bandwidth
```

Consumed by the graph compiler (topology builder) and BAAW
initialisation.

#### D-LA3. Segment and BAAW

Segment partitions the LA space; each segment maps to a specific HBM
channel or channel group. Created at tensor deploy time by the runtime
allocator. BAAW resolves LA → physical request(s) using the segment
table.

```python
@dataclass
class BaawSegment:
    la_base: int          # segment start LA
    la_size: int          # segment size (bytes)
    mode: str             # "one_to_one" | "n_to_one"
    # 1:1 mode fields
    channel_count: int    # channels assigned to this segment (e.g. 8)
    pa_bases: list[int]   # per-channel PA bases (len = channel_count)
    channel_ids: list[int]   # per-channel logical IDs (e.g. [0..7])
    channel_size: int     # per-channel size (la_size // channel_count)
    # n:1 mode fields
    agg_pa_base: int      # aggregated PA base
    agg_node_id: str      # aggregated router node_id
```

Segment lifecycle:

1. **Allocate** (tensor deploy): RuntimeContext allocates LA from LA
   allocator. PEMemAllocator allocates per-channel PA (1:1) or
   aggregated PA (n:1). `BaawSegmentInstallMsg` registers the segment
   with PE_DMA.
2. **Use** (kernel run): kernel `tl.load(la_ptr)` → `DmaReadCmd
   (src_addr=LA)`. PE_DMA's BAAW front-end looks up the segment and
   converts to PA(s).
3. **Free** (tensor free): segment removed from table; LA and PA
   returned.

#### D-LA4. BAAW resolution logic

BAAW is a front-end stage inside PE_DMA, not a separate SimPy
component. Synchronous address-resolution logic executed at the start
of PE_DMA's `handle_command()`.

Input: `(LA, nbytes)`. Output:

- **1:1 mode**: `list[PhysicalRequest]` — one per channel.
- **n:1 mode**: single `PhysicalRequest`.

```python
@dataclass
class PhysicalRequest:
    pa: int           # 51-bit Physical Address
    nbytes: int       # transfer size for this request
    dst_node: str     # target node_id (channel router or aggregated router)


def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
    seg = self._find_segment(la)  # la_base <= la < la_base + la_size
    offset = la - seg.la_base

    if seg.mode == "n_to_one":
        pa = seg.agg_pa_base + offset
        return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]

    # one_to_one
    requests = []
    per_ch_size = seg.channel_size
    for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
        ch_offset = offset % per_ch_size
        ch_nbytes = nbytes // seg.channel_count
        pa = pa_base + ch_offset
        dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
        requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
    return requests
```

BAAW responsibilities:

- Convert logical access → physical request units.
- Apply mode-dependent fan-out (1:1) or pass-through (n:1).
- Compute PA and target node.

BAAW non-responsibilities:

- Performing actual data movement.
- Executing NOC routing.
- Simulating bandwidth occupation (downstream components' job).

BAAW output is directly usable by the simulator's routing and resource
model without additional address decoding.

#### D-LA5. PE_DMA `handle_command()` change

Current (VA-based) flow:

```
DmaReadCmd.src_addr (VA)
  → MMU.translate(VA) → PA
  → PhysAddr.decode(PA) → PhysAddr object
  → resolver.resolve(PhysAddr) → dst_node_id
  → router.find_path(pe_prefix, dst_node_id) → path
  → 1 sub-Transaction → fabric inject
```

LA-based flow:

```
DmaReadCmd.src_addr (LA)
  → BAAW.resolve(LA, nbytes) → list[PhysicalRequest]
  → for each PhysicalRequest:
      → router.find_path(pe_prefix, req.dst_node) → path
      → compute_drain_ns(path, req.nbytes) → drain
      → sub-Transaction → fabric inject
  → await all sub-Transactions
  → pe_txn.done.succeed()
```

Key changes:

- MMU reference removed → BAAW resolve.
- `PhysAddr.decode()` + `resolver.resolve()` → BAAW returns `dst_node`
  directly.
- 1 request → N parallel requests in 1:1 mode.

#### D-LA6. 1:1 mode detail

- One logical access → N physical requests (N = `channels_per_pe`).
- N = `hbm_pseudo_channels / pes_per_cube`.
- Each request: fully-resolved 51-bit PA, targets a specific channel
  router (`{pe_prefix}.ch_r{channel_id}`).
- Per-channel link models BW contention.
- PE_DMA injects N sub-transactions concurrently.

Example: `hbm_pseudo_channels=64`, `pes_per_cube=8` → `channels_per_pe=8`.
PE0 owns ch0-7.

```text
Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
BAAW segment: {
    la_base: 0x1_0000_0000, la_size: 4096,
    mode: "one_to_one", channel_count: 8,
    pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
    channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
    channel_size: 512,
}

BAAW resolve result (8 requests):
  → PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
  → PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
  → ...
  → PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")

PE_DMA: 8 sub-transactions parallel inject
  per-channel router → hbm_ctrl link (channel_bw_gbs) per channel
  Total effective BW = 8 × channel_bw_gbs
```

Other N values:

- `hbm_pseudo_channels=32`, `pes_per_cube=8` → `channels_per_pe=4`,
  4 requests
- `hbm_pseudo_channels=64`, `pes_per_cube=4` → `channels_per_pe=16`,
  16 requests

#### D-LA7. n:1 mode detail

- One logical access → one aggregated request.
- Target: aggregated router → hbm_ctrl (see ADR-0019).
- Aggregated link BW = `channels_per_pe × channel_bw_gbs`
  (e.g. 8 × 32 = 256 GB/s).
- Single queue / resource for modelling.
- No per-channel PA decomposition.

```text
Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
BAAW segment: {
    la_base: 0x1_0000_0000, la_size: 4096,
    mode: "n_to_one",
    agg_pa_base: PA_agg,
    agg_node_id: "sip0.cube0.pe0.agg_router",
}

BAAW resolve result:
  → PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")

PE_DMA: 1 sub-transaction
  aggregated router → hbm_ctrl link (256 GB/s)
```

#### D-LA8. Kernel model preserved

- Kernel still issues single memory ops (`tl.load`, `tl.store`,
  `tl.composite`).
- LA is the address scheme exposed to kernel code.
- Channel decomposition / aggregation happens inside PE_DMA's BAAW.
- Kernel code never sees physical channel information.

#### Consequences (LA model, proposed)

Positive:

- 1:1 vs n:1 semantics live in one place (BAAW).
- Kernel abstraction preserved — no kernel code changes.
- Topology-based policy control (mode switch via yaml).
- Improved simulation-model consistency and debuggability.
- Segment-based mapping is simpler than page tables; lower overhead.

Negative:

- Full VA/MMU code refactor required.
- Request-generation path more complex (N requests in 1:1 mode).
- Reduced per-channel visibility in n:1 mode.
- VA-related tests need rewriting.

---

## Migration Path

- **PA → VA** was an extension. PA mode is retained as the PageFault
  fallback inside PE_DMA. Switching does not require removing PA
  code.
- **VA → LA**, if adopted, is a replacement, not coexistence. See
  D-LA1 for the VA infrastructure removal list. PA fallback inside
  PE_DMA may be retained orthogonally for tests.

## Alternatives Considered (LA model)

1. **Keep VA + fan-out in MMU**: MMU returns per-channel PAs.
   Rejected: MMU's role would grow beyond translation to request
   decomposition; aggregation (n:1) becomes awkward to express.
2. **Channel-aware kernel API**: kernels call per-channel load/store
   directly. Rejected: abstraction leakage, portability loss, all
   benchmarks need rewriting.
3. **Always PA (no LA)**: runtime passes per-channel PA to kernel
   directly. Rejected: incompatible with aggregation; conversion
   timing unclear; channel info leaks to kernel.

## Test Requirements

### VA model (current, regression)

- Cross-PE / cross-cube DMA paths over installed mappings.
- `MmuMapMsg` / `MmuUnmapMsg` fabric traversal with measured latency.
- TLB-overhead-per-access timing.
- PageFault fallback path preserves PA-only behaviour.

### LA model (when implemented)

- 1:1 mode: same logical access → N per-channel requests.
- n:1 mode: same logical access → 1 aggregated request.
- Bandwidth equivalence between modes for identical workload.
- 1:1 mode: per-channel contention modelled correctly.
- n:1 mode: aggregated bandwidth correctly reflected.
- Kernel code unchanged across mode switch.
- BAAW segment install / uninstall correctness.
- Multiple tensors in distinct segments do not collide.

## Implementation Order (LA, when scheduled)

1. LA type (`policy/address/la_allocator.py`).
2. BAAW segment table (`policy/address/baaw.py`).
3. `BaawSegmentInstallMsg` (`runtime_api/kernel.py`).
4. PE_DMA BAAW integration (`components/builtin/pe_dma.py`
   `handle_command()`).
5. RuntimeContext: LA alloc + segment install
   (`runtime_api/context.py`).
6. `Tensor.va_base` → `Tensor.la_base` (`runtime_api/tensor.py`).
7. Remove VA/MMU code.
8. Remove `pe_mmu` from `topology.yaml`; add mapping mode settings.
9. Test migration:

| Test file | Action |
|-----------|--------|
| `tests/test_mmu_component.py` | Remove → BAAW segment install tests |
| `tests/test_mmu_fabric.py` | Remove → BAAW + fabric integration tests |
| `tests/test_pe_mmu.py` | Remove |
| `tests/test_va_allocator.py` | Replace with LA allocator tests |
| `tests/test_va_integration.py` | Replace with LA + BAAW integration tests |
| `tests/test_va_offset.py` | Replace with LA offset tests |

## Links

- ADR-0007 (runtime_api vs sim_engine boundaries)
- ADR-0008 (tensor deployment)
- ADR-0009 (kernel execution)
- ADR-0014 (PE-internal execution model)
- ADR-0015 (component port/wire model)
- ADR-0019 (NOC + per-channel HBM connectivity — LA model topology
  consumer)
- SPEC R2 (latency by traversal), R10 (memory addressing)