Files
kernbench2/docs/adr-history/ADR-0018-mem-logical-address.en.md
ywkang 687c98086d ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
  (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
  docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
  docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
  retroactive docs pending verification.

Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
  TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
  Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
  deleted; ADR-0019/0021 moved to adr-history with one-line stub status

Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
  serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
  per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
  target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
  selection, flit-aware per-flit commit, async finalize, command-only
  fallback path)

Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
  "Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
  block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
  ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
  (now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)

Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
  ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py

Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.

Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
  (ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:15:55 -07:00

442 lines
15 KiB
Markdown

# ADR-0018: LA-Based Memory Address Abstraction and HBM Channel Mapping Mode Introduction
## Status
Merged into ADR-0011 (Address Model: LA section).
## Context
Kernbench simulates memory access between PE_DMA and Local-HBM within a CUBE.
Currently, a VA-based access path is used; however, the following two channel mapping models
are difficult to represent consistently.
### Background: Local-HBM Pseudo Channel Structure
The HBM in a CUBE consists of 32 or 64 pseudo channels.
In the PE-Local-HBM model, each PE is responsible for an equal number of pseudo channels.
Example: 64 pseudo channels, 8 PEs per cube -> each PE accesses 8 pseudo channels as local HBM
Both the number of pseudo channels and the number of PEs are topology parameters.
`N = hbm_pseudo_channels / pes_per_cube` (= channels_per_pe) determines
the number of local channels per PE.
The routing path BW between DMA and each pseudo channel matches the BW of each pseudo channel
(e.g., 32 GB/s), so if a PE sends simultaneous requests to N channels, it can utilize the
maximum memory BW.
### Limitations of the Current VA Model
When channels are divided into 8, requests must also be generated per channel and sent to DMA.
However, in the current architecture, the kernel generates requests with VA (`tl.load`)
and passes them directly to DMA, making it difficult for PE_CPU to generate per-channel DMA requests.
Therefore, instead of VA, we propose using **Logical Address (LA)**,
where the **BAAW (Logical-to-Physical Mapping Unit)** inside PE_DMA
converts LA to PA or a list of PAs based on segment-based mapping.
### Two Channel Mapping Modes
- **1:1 mode**: Creates and executes per-channel requests. Precise per-channel modeling.
- **n:1 mode (default)**: Assumes interleaving across local HBM channels. Aggregated BW modeling.
By supporting both modes, the overhead of the n:1 mode can be measured and evaluated.
### Core Requirements
- The effective bandwidth semantics of PE_DMA -> HBM_CTRL must be identical in both modes
- The difference must only be in the request representation and resource modeling approach
- The kernel programming model must not be changed
- Physical channel information must not be exposed to the kernel
### Existing Physical Address
The current system's 51-bit Physical Address is defined in `policy/address/phyaddr.py`:
```
[50:47] rack_id (4 bit)
[46:43] sip_id (4 bit)
[42:38] cube_id (5 bit, sip_seg)
[37] hbm_selector (1=HBM window)
[36:0] hbm_offset (37 bit, 128GB per cube)
```
PA is used to represent the final routable canonical physical destination,
and this role is preserved.
However, the timing and policy of logical access -> physical request conversion are not clearly separated.
---
## Decision
### D1. Introduction of LA (Logical Address) — Replacing VA
The existing VA (Virtual Address) infrastructure is replaced with LA (Logical Address).
#### Characteristics of LA
- Like VA, tensors can be mapped to a contiguous memory space
- Represents logical buffer + offset
- Does not directly contain physical channel information
- An intermediate abstraction maintained until physical resolution
- The sole address scheme used by kernel code (`tl.load`, `tl.store`, `tl.composite`)
#### LA Space Definition
| Item | Value |
|------|-------|
| LA start address | `0x1_0000_0000` (4 GB, preserving the existing VA start point) |
| LA space size | 64 GB per PE |
| Alignment unit | Segment-based (see D3 below) |
LA is a PE-local address space.
Even if different PEs use the same LA value, they resolve to different PAs
because each PE has a different BAAW segment table.
#### VA Infrastructure Removal Scope
With the introduction of LA, the following existing code will be replaced/removed:
| Removal Target | Replacement |
|----------------|-------------|
| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (same free-list approach, name/role changed) |
| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (inside PE_DMA) |
| `components/builtin/pe_mmu.py` (PeMmuComponent) | Removed — BAAW is internal PE_DMA logic, not a separate component |
| `runtime_api/kernel.py`: MmuMapMsg, MmuUnmapMsg | Replaced with BaawSegmentInstallMsg |
| `runtime_api/context.py`: VA alloc + MMU mapping install | LA alloc + BAAW segment install |
| `runtime_api/tensor.py`: `va_base` field | `la_base` field |
| `topology.yaml`: pe_mmu component entry | Removed |
---
### D2. Mapping Mode Configuration
The mapping mode is configured at the cube level in topology.yaml:
```yaml
cube:
memory_map:
hbm_mapping_mode: n_to_one # one_to_one | n_to_one
hbm_pseudo_channels: 64 # total pseudo channel count
hbm_channels_per_pe: 8 # local channel count per PE
hbm_channel_bw_gbs: 32.0 # per-channel bandwidth
```
This configuration is referenced during graph compilation (topology builder) and BAAW initialization.
---
### D3. Segments and BAAW
#### Segment Definition
A segment is a logical allocation unit that partitions the LA space so that each segment
maps to a specific HBM channel or channel group.
Segments are created by the runtime allocator during tensor deployment,
and BAAW uses them to convert LA into physical requests.
#### BAAW Segment Table Entry
```python
@dataclass
class BaawSegment:
la_base: int # segment start LA
la_size: int # segment size (bytes)
mode: str # "one_to_one" | "n_to_one"
# 1:1 mode fields
channel_count: int # number of channels assigned to this segment (e.g., 8)
pa_bases: list[int] # per-channel PA start address list (len = channel_count)
channel_ids: list[int] # per-channel logical IDs (e.g., [0,1,2,...,7])
channel_size: int # per-channel size (la_size // channel_count)
# n:1 mode fields
agg_pa_base: int # aggregated PA start address
agg_node_id: str # aggregated router node_id (for routing)
```
#### Segment Lifecycle
1. **Allocation time** (tensor deploy):
- RuntimeContext allocates LA space from the LA allocator
- PEMemAllocator allocates per-channel PA (1:1) or aggregated PA (n:1)
- Sends `BaawSegmentInstallMsg` to PE_DMA to register in the segment table
2. **Usage time** (kernel execution):
- Kernel issues `tl.load(la_ptr)` -> DmaReadCmd(src_addr=LA)
- PE_DMA looks up the segment corresponding to the LA in BAAW
- Converts to PA(s) according to the mode
3. **Deallocation time** (tensor free):
- Removed from the segment table
- LA space returned, PA deallocated
---
### D4. BAAW (Logical-to-Physical Mapping Unit)
#### Location
BAAW is placed as a front-end stage inside PE_DMA.
It is not a separate SimPy component; it is synchronous address resolution logic
executed at the beginning of PE_DMA's `handle_command()`.
#### Input
- LA (Logical Address) — DmaReadCmd.src_addr or DmaWriteCmd.dst_addr
- access size (bytes)
#### Output
- 1:1 mode: `list[PhysicalRequest]` — each request is (PA, nbytes, channel_node_id)
- n:1 mode: 1 `PhysicalRequest` — (agg_PA, nbytes, agg_node_id)
```python
@dataclass
class PhysicalRequest:
pa: int # 51-bit Physical Address
nbytes: int # transfer size for this request
dst_node: str # target node_id (channel router or aggregated router)
```
#### BAAW Resolve Logic
```python
def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
seg = self._find_segment(la) # la_base <= la < la_base + la_size
offset = la - seg.la_base
if seg.mode == "n_to_one":
pa = seg.agg_pa_base + offset
return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]
elif seg.mode == "one_to_one":
requests = []
per_ch_size = seg.channel_size
for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
ch_offset = offset % per_ch_size # interleaved or striped
ch_nbytes = nbytes // seg.channel_count
pa = pa_base + ch_offset
dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
return requests
```
#### Scope of Responsibility
BAAW is responsible for:
- Converting logical accesses into physical request units
- Performing fan-out (1:1) or pass-through (n:1) according to the mapping mode
- Generating Physical Addresses and determining target nodes
BAAW is NOT responsible for:
- Performing actual data movement
- Executing NOC routing
- Simulating bandwidth consumption (this is the role of downstream components)
#### Output Contract
The output of BAAW must be request units that can be directly used by the simulator's
routing and resource model without any additional address decoding.
---
### D5. PE_DMA handle_command() Changes
#### Current Flow (VA-based)
```
DmaReadCmd.src_addr (VA)
-> MMU.translate(VA) -> PA
-> PhysAddr.decode(PA) -> PhysAddr object
-> resolver.resolve(PhysAddr) -> dst_node_id (e.g., "sip0.cube0.hbm_ctrl")
-> router.find_path(pe_prefix, dst_node_id) -> path
-> 1 sub-Transaction created -> fabric inject
```
#### New Flow (LA-based)
```
DmaReadCmd.src_addr (LA)
-> BAAW.resolve(LA, nbytes) -> list[PhysicalRequest]
-> For each PhysicalRequest:
-> router.find_path(pe_prefix, req.dst_node) -> path
-> compute_drain_ns(path, req.nbytes) -> drain
-> sub-Transaction created -> fabric inject
-> Wait for all sub-Transactions to complete
-> pe_txn.done.succeed()
```
Key changes:
- MMU reference removed -> replaced with BAAW resolve
- PhysAddr.decode() + resolver.resolve() -> BAAW directly returns dst_node
- 1 request -> N requests injected in parallel (1:1 mode)
---
### D6. 1:1 Mode Details
- One logical access -> N (= `channels_per_pe`) physical requests
- N is a parameter determined by `hbm_pseudo_channels / pes_per_cube`
- Each request:
- Fully resolved 51-bit PA
- Targets a specific channel router (`{pe_prefix}.ch_r{channel_id}`)
- BW contention modeling via per-channel links
- PE_DMA injects N sub-transactions simultaneously
#### 1:1 Mode Example
Configuration: `hbm_pseudo_channels=64`, `pes_per_cube=8`
-> `channels_per_pe=8`, PE0 owns ch0-7
```text
Tensor A (4 KB) -> LA 0x1_0000_0000, size=4096 bytes
BAAW segment: {
la_base: 0x1_0000_0000, la_size: 4096,
mode: "one_to_one", channel_count: 8, # = channels_per_pe
pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
channel_size: 512, # = la_size / channel_count
}
BAAW resolve result (N=8 requests):
-> PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
-> PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
-> ...
-> PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")
PE_DMA: N sub-transactions injected in parallel
Each accesses HBM via channel router -> hbm_ctrl link (channel_bw_gbs)
Total effective BW = N x channel_bw_gbs
```
Examples with different N values:
- `hbm_pseudo_channels=32`, `pes_per_cube=8` -> `channels_per_pe=4`, 4 requests
- `hbm_pseudo_channels=64`, `pes_per_cube=4` -> `channels_per_pe=16`, 16 requests
---
### D7. n:1 Mode Details
- One logical access -> one aggregated request
- Target: aggregated router -> hbm_ctrl (see ADR-0019)
- Aggregated link BW = `channels_per_pe` x `channel_bw_gbs` (e.g., 8 x 32 = 256 GB/s)
- Modeled as a single queue / resource
- No per-channel PA decomposition
#### n:1 Mode Example
```
Tensor A (4 KB) -> LA 0x1_0000_0000, size=4096 bytes
BAAW segment: {
la_base: 0x1_0000_0000, la_size: 4096,
mode: "n_to_one",
agg_pa_base: PA_agg,
agg_node_id: "sip0.cube0.pe0.agg_router",
}
BAAW resolve result:
-> PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")
PE_DMA: 1 sub-transaction injected
Accesses HBM via aggregated router -> hbm_ctrl link (256 GB/s)
```
---
### D8. Kernel Model Preservation
- The kernel still issues only single memory ops (`tl.load`, `tl.store`, `tl.composite`)
- LA is the address scheme passed to the kernel
- Channel decomposition/aggregation is performed by BAAW inside PE_DMA
- Physical channel information is not exposed to kernel code
---
## Consequences
### Positive
- 1:1 vs n:1 semantics are clearly separated at a single point: BAAW
- Kernel abstraction is preserved — no kernel code changes required
- Topology-based policy control is possible (mode switching via yaml)
- Improved simulation model consistency and debuggability
- Segment-based mapping is simpler and has lower overhead compared to page tables
### Negative
- Full refactoring of VA/MMU-based code is required
- Increased complexity in the request generation path (managing N requests in 1:1 mode)
- Reduced per-channel visibility in n:1 mode
- Existing VA-related tests must be rewritten
---
## Alternatives
### A1. Keep VA + Fan-out at MMU
- Extend MMU to return per-channel PAs
- Problem: MMU's role expands beyond address translation to include request decomposition
- Problem: Aggregation representation is difficult in n:1 mode
### A2. Kernel Generates Channel-Aware Requests
- Kernel directly calls per-channel load/store
- Problem: Abstraction leakage, reduced portability
- Problem: All benchmark code must be modified
### A3. Always Use PA (Without LA)
- Runtime directly passes per-channel PA to the kernel
- Problem: Conflicts with the aggregation model
- Problem: Conversion timing is unclear, channel information exposed to kernel
---
## Implementation Notes
### Implementation Order
1. Introduce LA type (`policy/address/la_allocator.py`)
2. Implement BAAW segment table (`policy/address/baaw.py`)
3. Add `BaawSegmentInstallMsg` message type (`runtime_api/kernel.py`)
4. Integrate BAAW into PE_DMA (`components/builtin/pe_dma.py` handle_command changes)
5. Modify RuntimeContext: LA alloc + segment install (`runtime_api/context.py`)
6. Change Tensor.va_base -> la_base (`runtime_api/tensor.py`)
7. Remove VA/MMU code
8. Remove pe_mmu from topology.yaml, add mapping mode configuration
9. Test migration
### Affected Existing Tests
| Test File | Impact |
|-----------|--------|
| `tests/test_mmu_component.py` | Remove -> replace with BAAW segment install test |
| `tests/test_mmu_fabric.py` | Remove -> replace with BAAW + fabric integration test |
| `tests/test_pe_mmu.py` | Remove |
| `tests/test_va_allocator.py` | Replace with LA allocator test |
| `tests/test_va_integration.py` | Replace with LA + BAAW integration test |
| `tests/test_va_offset.py` | Replace with LA offset test |
---
## Test Requirements
- For the same logical access:
- 1:1 -> verify N requests are generated
- n:1 -> verify 1 aggregated request is generated
- Verify effective bandwidth consistency across both modes
- 1:1 -> verify per-channel contention modeling
- n:1 -> verify aggregated bandwidth is reflected
- Verify operation without kernel code changes
- Verify correct BAAW segment install/uninstall operation
- Verify no conflicts when multiple tensors are assigned to different segments
---
## Links
- ADR-0011 (Memory Addressing Simplification — PA-first, VA/MMU introduction) -> superseded by this ADR
- ADR-0019 (NOC Per-Channel HBM Connection Model) -> topology-side integration
- ADR-0014 (PE Internal Execution Model) -> PE_DMA change impact