kernbench2/docs/adr-history/ADR-0018-mem-logical-address.en.md

# ADR-0018: LA-Based Memory Address Abstraction and HBM Channel Mapping Mode Introduction

## Status

Merged into ADR-0011 (Address Model: LA section).

## Context

Kernbench simulates memory access between PE_DMA and Local-HBM within a CUBE.
Currently, a VA-based access path is used; however, the following two channel mapping models
are difficult to represent consistently.

### Background: Local-HBM Pseudo Channel Structure

The HBM in a CUBE consists of 32 or 64 pseudo channels.
In the PE-Local-HBM model, each PE is responsible for an equal number of pseudo channels.

Example: 64 pseudo channels, 8 PEs per cube -> each PE accesses 8 pseudo channels as local HBM

Both the number of pseudo channels and the number of PEs are topology parameters.
`N = hbm_pseudo_channels / pes_per_cube` (= channels_per_pe) determines
the number of local channels per PE.

The routing path BW between DMA and each pseudo channel matches the BW of each pseudo channel
(e.g., 32 GB/s), so if a PE sends simultaneous requests to N channels, it can utilize the
maximum memory BW.

### Limitations of the Current VA Model

When channels are divided into 8, requests must also be generated per channel and sent to DMA.
However, in the current architecture, the kernel generates requests with VA (`tl.load`)
and passes them directly to DMA, making it difficult for PE_CPU to generate per-channel DMA requests.

Therefore, instead of VA, we propose using **Logical Address (LA)**,
where the **BAAW (Logical-to-Physical Mapping Unit)** inside PE_DMA
converts LA to PA or a list of PAs based on segment-based mapping.

### Two Channel Mapping Modes

- **1:1 mode**: Creates and executes per-channel requests. Precise per-channel modeling.
- **n:1 mode (default)**: Assumes interleaving across local HBM channels. Aggregated BW modeling.

By supporting both modes, the overhead of the n:1 mode can be measured and evaluated.

### Core Requirements

- The effective bandwidth semantics of PE_DMA -> HBM_CTRL must be identical in both modes
- The difference must only be in the request representation and resource modeling approach
- The kernel programming model must not be changed
- Physical channel information must not be exposed to the kernel

### Existing Physical Address

The current system's 51-bit Physical Address is defined in `policy/address/phyaddr.py`:

```
[50:47] rack_id (4 bit)
[46:43] sip_id  (4 bit)
[42:38] cube_id (5 bit, sip_seg)
[37]    hbm_selector (1=HBM window)
[36:0]  hbm_offset   (37 bit, 128GB per cube)
```

PA is used to represent the final routable canonical physical destination,
and this role is preserved.
However, the timing and policy of logical access -> physical request conversion are not clearly separated.

---

## Decision

### D1. Introduction of LA (Logical Address) — Replacing VA

The existing VA (Virtual Address) infrastructure is replaced with LA (Logical Address).

#### Characteristics of LA

- Like VA, tensors can be mapped to a contiguous memory space
- Represents logical buffer + offset
- Does not directly contain physical channel information
- An intermediate abstraction maintained until physical resolution
- The sole address scheme used by kernel code (`tl.load`, `tl.store`, `tl.composite`)

#### LA Space Definition

| Item | Value |
|------|-------|
| LA start address | `0x1_0000_0000` (4 GB, preserving the existing VA start point) |
| LA space size | 64 GB per PE |
| Alignment unit | Segment-based (see D3 below) |

LA is a PE-local address space.
Even if different PEs use the same LA value, they resolve to different PAs
because each PE has a different BAAW segment table.

#### VA Infrastructure Removal Scope

With the introduction of LA, the following existing code will be replaced/removed:

| Removal Target | Replacement |
|----------------|-------------|
| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (same free-list approach, name/role changed) |
| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (inside PE_DMA) |
| `components/builtin/pe_mmu.py` (PeMmuComponent) | Removed — BAAW is internal PE_DMA logic, not a separate component |
| `runtime_api/kernel.py`: MmuMapMsg, MmuUnmapMsg | Replaced with BaawSegmentInstallMsg |
| `runtime_api/context.py`: VA alloc + MMU mapping install | LA alloc + BAAW segment install |
| `runtime_api/tensor.py`: `va_base` field | `la_base` field |
| `topology.yaml`: pe_mmu component entry | Removed |

---

### D2. Mapping Mode Configuration

The mapping mode is configured at the cube level in topology.yaml:

```yaml
cube:
  memory_map:
    hbm_mapping_mode: n_to_one    # one_to_one | n_to_one
    hbm_pseudo_channels: 64       # total pseudo channel count
    hbm_channels_per_pe: 8        # local channel count per PE
    hbm_channel_bw_gbs: 32.0      # per-channel bandwidth
```

This configuration is referenced during graph compilation (topology builder) and BAAW initialization.

---

### D3. Segments and BAAW

#### Segment Definition

A segment is a logical allocation unit that partitions the LA space so that each segment
maps to a specific HBM channel or channel group.

Segments are created by the runtime allocator during tensor deployment,
and BAAW uses them to convert LA into physical requests.

#### BAAW Segment Table Entry

```python
@dataclass
class BaawSegment:
    la_base: int          # segment start LA
    la_size: int          # segment size (bytes)
    mode: str             # "one_to_one" | "n_to_one"
    # 1:1 mode fields
    channel_count: int    # number of channels assigned to this segment (e.g., 8)
    pa_bases: list[int]   # per-channel PA start address list (len = channel_count)
    channel_ids: list[int]  # per-channel logical IDs (e.g., [0,1,2,...,7])
    channel_size: int     # per-channel size (la_size // channel_count)
    # n:1 mode fields
    agg_pa_base: int      # aggregated PA start address
    agg_node_id: str      # aggregated router node_id (for routing)
```

#### Segment Lifecycle

1. **Allocation time** (tensor deploy):
   - RuntimeContext allocates LA space from the LA allocator
   - PEMemAllocator allocates per-channel PA (1:1) or aggregated PA (n:1)
   - Sends `BaawSegmentInstallMsg` to PE_DMA to register in the segment table

2. **Usage time** (kernel execution):
   - Kernel issues `tl.load(la_ptr)` -> DmaReadCmd(src_addr=LA)
   - PE_DMA looks up the segment corresponding to the LA in BAAW
   - Converts to PA(s) according to the mode

3. **Deallocation time** (tensor free):
   - Removed from the segment table
   - LA space returned, PA deallocated

---

### D4. BAAW (Logical-to-Physical Mapping Unit)

#### Location

BAAW is placed as a front-end stage inside PE_DMA.
It is not a separate SimPy component; it is synchronous address resolution logic
executed at the beginning of PE_DMA's `handle_command()`.

#### Input

- LA (Logical Address) — DmaReadCmd.src_addr or DmaWriteCmd.dst_addr
- access size (bytes)

#### Output

- 1:1 mode: `list[PhysicalRequest]` — each request is (PA, nbytes, channel_node_id)
- n:1 mode: 1 `PhysicalRequest` — (agg_PA, nbytes, agg_node_id)

```python
@dataclass
class PhysicalRequest:
    pa: int           # 51-bit Physical Address
    nbytes: int       # transfer size for this request
    dst_node: str     # target node_id (channel router or aggregated router)
```

#### BAAW Resolve Logic

```python
def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
    seg = self._find_segment(la)  # la_base <= la < la_base + la_size
    offset = la - seg.la_base

    if seg.mode == "n_to_one":
        pa = seg.agg_pa_base + offset
        return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]

    elif seg.mode == "one_to_one":
        requests = []
        per_ch_size = seg.channel_size
        for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
            ch_offset = offset % per_ch_size  # interleaved or striped
            ch_nbytes = nbytes // seg.channel_count
            pa = pa_base + ch_offset
            dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
            requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
        return requests
```

#### Scope of Responsibility

BAAW is responsible for:
- Converting logical accesses into physical request units
- Performing fan-out (1:1) or pass-through (n:1) according to the mapping mode
- Generating Physical Addresses and determining target nodes

BAAW is NOT responsible for:
- Performing actual data movement
- Executing NOC routing
- Simulating bandwidth consumption (this is the role of downstream components)

#### Output Contract

The output of BAAW must be request units that can be directly used by the simulator's
routing and resource model without any additional address decoding.

---

### D5. PE_DMA handle_command() Changes

#### Current Flow (VA-based)

```
DmaReadCmd.src_addr (VA)
  -> MMU.translate(VA) -> PA
  -> PhysAddr.decode(PA) -> PhysAddr object
  -> resolver.resolve(PhysAddr) -> dst_node_id (e.g., "sip0.cube0.hbm_ctrl")
  -> router.find_path(pe_prefix, dst_node_id) -> path
  -> 1 sub-Transaction created -> fabric inject
```

#### New Flow (LA-based)

```
DmaReadCmd.src_addr (LA)
  -> BAAW.resolve(LA, nbytes) -> list[PhysicalRequest]
  -> For each PhysicalRequest:
      -> router.find_path(pe_prefix, req.dst_node) -> path
      -> compute_drain_ns(path, req.nbytes) -> drain
      -> sub-Transaction created -> fabric inject
  -> Wait for all sub-Transactions to complete
  -> pe_txn.done.succeed()
```

Key changes:
- MMU reference removed -> replaced with BAAW resolve
- PhysAddr.decode() + resolver.resolve() -> BAAW directly returns dst_node
- 1 request -> N requests injected in parallel (1:1 mode)

---

### D6. 1:1 Mode Details

- One logical access -> N (= `channels_per_pe`) physical requests
- N is a parameter determined by `hbm_pseudo_channels / pes_per_cube`
- Each request:
  - Fully resolved 51-bit PA
  - Targets a specific channel router (`{pe_prefix}.ch_r{channel_id}`)
- BW contention modeling via per-channel links
- PE_DMA injects N sub-transactions simultaneously

#### 1:1 Mode Example

Configuration: `hbm_pseudo_channels=64`, `pes_per_cube=8`
-> `channels_per_pe=8`, PE0 owns ch0-7

```text
Tensor A (4 KB) -> LA 0x1_0000_0000, size=4096 bytes
BAAW segment: {
    la_base: 0x1_0000_0000, la_size: 4096,
    mode: "one_to_one", channel_count: 8,  # = channels_per_pe
    pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
    channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
    channel_size: 512,  # = la_size / channel_count
}

BAAW resolve result (N=8 requests):
  -> PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
  -> PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
  -> ...
  -> PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")

PE_DMA: N sub-transactions injected in parallel
  Each accesses HBM via channel router -> hbm_ctrl link (channel_bw_gbs)
  Total effective BW = N x channel_bw_gbs
```

Examples with different N values:
- `hbm_pseudo_channels=32`, `pes_per_cube=8` -> `channels_per_pe=4`, 4 requests
- `hbm_pseudo_channels=64`, `pes_per_cube=4` -> `channels_per_pe=16`, 16 requests

---

### D7. n:1 Mode Details

- One logical access -> one aggregated request
- Target: aggregated router -> hbm_ctrl (see ADR-0019)
- Aggregated link BW = `channels_per_pe` x `channel_bw_gbs` (e.g., 8 x 32 = 256 GB/s)
- Modeled as a single queue / resource
- No per-channel PA decomposition

#### n:1 Mode Example

```
Tensor A (4 KB) -> LA 0x1_0000_0000, size=4096 bytes
BAAW segment: {
    la_base: 0x1_0000_0000, la_size: 4096,
    mode: "n_to_one",
    agg_pa_base: PA_agg,
    agg_node_id: "sip0.cube0.pe0.agg_router",
}

BAAW resolve result:
  -> PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")

PE_DMA: 1 sub-transaction injected
  Accesses HBM via aggregated router -> hbm_ctrl link (256 GB/s)
```

---

### D8. Kernel Model Preservation

- The kernel still issues only single memory ops (`tl.load`, `tl.store`, `tl.composite`)
- LA is the address scheme passed to the kernel
- Channel decomposition/aggregation is performed by BAAW inside PE_DMA
- Physical channel information is not exposed to kernel code

---

## Consequences

### Positive

- 1:1 vs n:1 semantics are clearly separated at a single point: BAAW
- Kernel abstraction is preserved — no kernel code changes required
- Topology-based policy control is possible (mode switching via yaml)
- Improved simulation model consistency and debuggability
- Segment-based mapping is simpler and has lower overhead compared to page tables

### Negative

- Full refactoring of VA/MMU-based code is required
- Increased complexity in the request generation path (managing N requests in 1:1 mode)
- Reduced per-channel visibility in n:1 mode
- Existing VA-related tests must be rewritten

---

## Alternatives

### A1. Keep VA + Fan-out at MMU

- Extend MMU to return per-channel PAs
- Problem: MMU's role expands beyond address translation to include request decomposition
- Problem: Aggregation representation is difficult in n:1 mode

### A2. Kernel Generates Channel-Aware Requests

- Kernel directly calls per-channel load/store
- Problem: Abstraction leakage, reduced portability
- Problem: All benchmark code must be modified

### A3. Always Use PA (Without LA)

- Runtime directly passes per-channel PA to the kernel
- Problem: Conflicts with the aggregation model
- Problem: Conversion timing is unclear, channel information exposed to kernel

---

## Implementation Notes

### Implementation Order

1. Introduce LA type (`policy/address/la_allocator.py`)
2. Implement BAAW segment table (`policy/address/baaw.py`)
3. Add `BaawSegmentInstallMsg` message type (`runtime_api/kernel.py`)
4. Integrate BAAW into PE_DMA (`components/builtin/pe_dma.py` handle_command changes)
5. Modify RuntimeContext: LA alloc + segment install (`runtime_api/context.py`)
6. Change Tensor.va_base -> la_base (`runtime_api/tensor.py`)
7. Remove VA/MMU code
8. Remove pe_mmu from topology.yaml, add mapping mode configuration
9. Test migration

### Affected Existing Tests

| Test File | Impact |
|-----------|--------|
| `tests/test_mmu_component.py` | Remove -> replace with BAAW segment install test |
| `tests/test_mmu_fabric.py` | Remove -> replace with BAAW + fabric integration test |
| `tests/test_pe_mmu.py` | Remove |
| `tests/test_va_allocator.py` | Replace with LA allocator test |
| `tests/test_va_integration.py` | Replace with LA + BAAW integration test |
| `tests/test_va_offset.py` | Replace with LA offset test |

---

## Test Requirements

- For the same logical access:
  - 1:1 -> verify N requests are generated
  - n:1 -> verify 1 aggregated request is generated
- Verify effective bandwidth consistency across both modes
- 1:1 -> verify per-channel contention modeling
- n:1 -> verify aggregated bandwidth is reflected
- Verify operation without kernel code changes
- Verify correct BAAW segment install/uninstall operation
- Verify no conflicts when multiple tensors are assigned to different segments

---

## Links

- ADR-0011 (Memory Addressing Simplification — PA-first, VA/MMU introduction) -> superseded by this ADR
- ADR-0019 (NOC Per-Channel HBM Connection Model) -> topology-side integration
- ADR-0014 (PE Internal Execution Model) -> PE_DMA change impact