Add English translations for ADR-0018, 0019, 0020, 0021
- ADR-0018: LA-based memory address abstraction + BAAW + HBM channel mapping - ADR-0019: CUBE NOC per-channel and aggregated HBM connection model - ADR-0020: 2-pass data execution model (timing/data separation, greenlet) - ADR-0021: PE pipeline refactor (component separation + token self-routing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,441 @@
|
||||
# ADR-0018: LA-Based Memory Address Abstraction and HBM Channel Mapping Mode Introduction
|
||||
|
||||
## Status
|
||||
|
||||
Proposed
|
||||
|
||||
## Context
|
||||
|
||||
Kernbench simulates memory access between PE_DMA and Local-HBM within a CUBE.
|
||||
Currently, a VA-based access path is used; however, the following two channel mapping models
|
||||
are difficult to represent consistently.
|
||||
|
||||
### Background: Local-HBM Pseudo Channel Structure
|
||||
|
||||
The HBM in a CUBE consists of 32 or 64 pseudo channels.
|
||||
In the PE-Local-HBM model, each PE is responsible for an equal number of pseudo channels.
|
||||
|
||||
Example: 64 pseudo channels, 8 PEs per cube -> each PE accesses 8 pseudo channels as local HBM
|
||||
|
||||
Both the number of pseudo channels and the number of PEs are topology parameters.
|
||||
`N = hbm_pseudo_channels / pes_per_cube` (= channels_per_pe) determines
|
||||
the number of local channels per PE.
|
||||
|
||||
The routing path BW between DMA and each pseudo channel matches the BW of each pseudo channel
|
||||
(e.g., 32 GB/s), so if a PE sends simultaneous requests to N channels, it can utilize the
|
||||
maximum memory BW.
|
||||
|
||||
### Limitations of the Current VA Model
|
||||
|
||||
When channels are divided into 8, requests must also be generated per channel and sent to DMA.
|
||||
However, in the current architecture, the kernel generates requests with VA (`tl.load`)
|
||||
and passes them directly to DMA, making it difficult for PE_CPU to generate per-channel DMA requests.
|
||||
|
||||
Therefore, instead of VA, we propose using **Logical Address (LA)**,
|
||||
where the **BAAW (Logical-to-Physical Mapping Unit)** inside PE_DMA
|
||||
converts LA to PA or a list of PAs based on segment-based mapping.
|
||||
|
||||
### Two Channel Mapping Modes
|
||||
|
||||
- **1:1 mode**: Creates and executes per-channel requests. Precise per-channel modeling.
|
||||
- **n:1 mode (default)**: Assumes interleaving across local HBM channels. Aggregated BW modeling.
|
||||
|
||||
By supporting both modes, the overhead of the n:1 mode can be measured and evaluated.
|
||||
|
||||
### Core Requirements
|
||||
|
||||
- The effective bandwidth semantics of PE_DMA -> HBM_CTRL must be identical in both modes
|
||||
- The difference must only be in the request representation and resource modeling approach
|
||||
- The kernel programming model must not be changed
|
||||
- Physical channel information must not be exposed to the kernel
|
||||
|
||||
### Existing Physical Address
|
||||
|
||||
The current system's 51-bit Physical Address is defined in `policy/address/phyaddr.py`:
|
||||
|
||||
```
|
||||
[50:47] rack_id (4 bit)
|
||||
[46:43] sip_id (4 bit)
|
||||
[42:38] cube_id (5 bit, sip_seg)
|
||||
[37] hbm_selector (1=HBM window)
|
||||
[36:0] hbm_offset (37 bit, 128GB per cube)
|
||||
```
|
||||
|
||||
PA is used to represent the final routable canonical physical destination,
|
||||
and this role is preserved.
|
||||
However, the timing and policy of logical access -> physical request conversion are not clearly separated.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Introduction of LA (Logical Address) — Replacing VA
|
||||
|
||||
The existing VA (Virtual Address) infrastructure is replaced with LA (Logical Address).
|
||||
|
||||
#### Characteristics of LA
|
||||
|
||||
- Like VA, tensors can be mapped to a contiguous memory space
|
||||
- Represents logical buffer + offset
|
||||
- Does not directly contain physical channel information
|
||||
- An intermediate abstraction maintained until physical resolution
|
||||
- The sole address scheme used by kernel code (`tl.load`, `tl.store`, `tl.composite`)
|
||||
|
||||
#### LA Space Definition
|
||||
|
||||
| Item | Value |
|
||||
|------|-------|
|
||||
| LA start address | `0x1_0000_0000` (4 GB, preserving the existing VA start point) |
|
||||
| LA space size | 64 GB per PE |
|
||||
| Alignment unit | Segment-based (see D3 below) |
|
||||
|
||||
LA is a PE-local address space.
|
||||
Even if different PEs use the same LA value, they resolve to different PAs
|
||||
because each PE has a different BAAW segment table.
|
||||
|
||||
#### VA Infrastructure Removal Scope
|
||||
|
||||
With the introduction of LA, the following existing code will be replaced/removed:
|
||||
|
||||
| Removal Target | Replacement |
|
||||
|----------------|-------------|
|
||||
| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (same free-list approach, name/role changed) |
|
||||
| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (inside PE_DMA) |
|
||||
| `components/builtin/pe_mmu.py` (PeMmuComponent) | Removed — BAAW is internal PE_DMA logic, not a separate component |
|
||||
| `runtime_api/kernel.py`: MmuMapMsg, MmuUnmapMsg | Replaced with BaawSegmentInstallMsg |
|
||||
| `runtime_api/context.py`: VA alloc + MMU mapping install | LA alloc + BAAW segment install |
|
||||
| `runtime_api/tensor.py`: `va_base` field | `la_base` field |
|
||||
| `topology.yaml`: pe_mmu component entry | Removed |
|
||||
|
||||
---
|
||||
|
||||
### D2. Mapping Mode Configuration
|
||||
|
||||
The mapping mode is configured at the cube level in topology.yaml:
|
||||
|
||||
```yaml
|
||||
cube:
|
||||
memory_map:
|
||||
hbm_mapping_mode: n_to_one # one_to_one | n_to_one
|
||||
hbm_pseudo_channels: 64 # total pseudo channel count
|
||||
hbm_channels_per_pe: 8 # local channel count per PE
|
||||
hbm_channel_bw_gbs: 32.0 # per-channel bandwidth
|
||||
```
|
||||
|
||||
This configuration is referenced during graph compilation (topology builder) and BAAW initialization.
|
||||
|
||||
---
|
||||
|
||||
### D3. Segments and BAAW
|
||||
|
||||
#### Segment Definition
|
||||
|
||||
A segment is a logical allocation unit that partitions the LA space so that each segment
|
||||
maps to a specific HBM channel or channel group.
|
||||
|
||||
Segments are created by the runtime allocator during tensor deployment,
|
||||
and BAAW uses them to convert LA into physical requests.
|
||||
|
||||
#### BAAW Segment Table Entry
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class BaawSegment:
|
||||
la_base: int # segment start LA
|
||||
la_size: int # segment size (bytes)
|
||||
mode: str # "one_to_one" | "n_to_one"
|
||||
# 1:1 mode fields
|
||||
channel_count: int # number of channels assigned to this segment (e.g., 8)
|
||||
pa_bases: list[int] # per-channel PA start address list (len = channel_count)
|
||||
channel_ids: list[int] # per-channel logical IDs (e.g., [0,1,2,...,7])
|
||||
channel_size: int # per-channel size (la_size // channel_count)
|
||||
# n:1 mode fields
|
||||
agg_pa_base: int # aggregated PA start address
|
||||
agg_node_id: str # aggregated router node_id (for routing)
|
||||
```
|
||||
|
||||
#### Segment Lifecycle
|
||||
|
||||
1. **Allocation time** (tensor deploy):
|
||||
- RuntimeContext allocates LA space from the LA allocator
|
||||
- PEMemAllocator allocates per-channel PA (1:1) or aggregated PA (n:1)
|
||||
- Sends `BaawSegmentInstallMsg` to PE_DMA to register in the segment table
|
||||
|
||||
2. **Usage time** (kernel execution):
|
||||
- Kernel issues `tl.load(la_ptr)` -> DmaReadCmd(src_addr=LA)
|
||||
- PE_DMA looks up the segment corresponding to the LA in BAAW
|
||||
- Converts to PA(s) according to the mode
|
||||
|
||||
3. **Deallocation time** (tensor free):
|
||||
- Removed from the segment table
|
||||
- LA space returned, PA deallocated
|
||||
|
||||
---
|
||||
|
||||
### D4. BAAW (Logical-to-Physical Mapping Unit)
|
||||
|
||||
#### Location
|
||||
|
||||
BAAW is placed as a front-end stage inside PE_DMA.
|
||||
It is not a separate SimPy component; it is synchronous address resolution logic
|
||||
executed at the beginning of PE_DMA's `handle_command()`.
|
||||
|
||||
#### Input
|
||||
|
||||
- LA (Logical Address) — DmaReadCmd.src_addr or DmaWriteCmd.dst_addr
|
||||
- access size (bytes)
|
||||
|
||||
#### Output
|
||||
|
||||
- 1:1 mode: `list[PhysicalRequest]` — each request is (PA, nbytes, channel_node_id)
|
||||
- n:1 mode: 1 `PhysicalRequest` — (agg_PA, nbytes, agg_node_id)
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class PhysicalRequest:
|
||||
pa: int # 51-bit Physical Address
|
||||
nbytes: int # transfer size for this request
|
||||
dst_node: str # target node_id (channel router or aggregated router)
|
||||
```
|
||||
|
||||
#### BAAW Resolve Logic
|
||||
|
||||
```python
|
||||
def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
|
||||
seg = self._find_segment(la) # la_base <= la < la_base + la_size
|
||||
offset = la - seg.la_base
|
||||
|
||||
if seg.mode == "n_to_one":
|
||||
pa = seg.agg_pa_base + offset
|
||||
return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]
|
||||
|
||||
elif seg.mode == "one_to_one":
|
||||
requests = []
|
||||
per_ch_size = seg.channel_size
|
||||
for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
|
||||
ch_offset = offset % per_ch_size # interleaved or striped
|
||||
ch_nbytes = nbytes // seg.channel_count
|
||||
pa = pa_base + ch_offset
|
||||
dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
|
||||
requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
|
||||
return requests
|
||||
```
|
||||
|
||||
#### Scope of Responsibility
|
||||
|
||||
BAAW is responsible for:
|
||||
- Converting logical accesses into physical request units
|
||||
- Performing fan-out (1:1) or pass-through (n:1) according to the mapping mode
|
||||
- Generating Physical Addresses and determining target nodes
|
||||
|
||||
BAAW is NOT responsible for:
|
||||
- Performing actual data movement
|
||||
- Executing NOC routing
|
||||
- Simulating bandwidth consumption (this is the role of downstream components)
|
||||
|
||||
#### Output Contract
|
||||
|
||||
The output of BAAW must be request units that can be directly used by the simulator's
|
||||
routing and resource model without any additional address decoding.
|
||||
|
||||
---
|
||||
|
||||
### D5. PE_DMA handle_command() Changes
|
||||
|
||||
#### Current Flow (VA-based)
|
||||
|
||||
```
|
||||
DmaReadCmd.src_addr (VA)
|
||||
-> MMU.translate(VA) -> PA
|
||||
-> PhysAddr.decode(PA) -> PhysAddr object
|
||||
-> resolver.resolve(PhysAddr) -> dst_node_id (e.g., "sip0.cube0.hbm_ctrl")
|
||||
-> router.find_path(pe_prefix, dst_node_id) -> path
|
||||
-> 1 sub-Transaction created -> fabric inject
|
||||
```
|
||||
|
||||
#### New Flow (LA-based)
|
||||
|
||||
```
|
||||
DmaReadCmd.src_addr (LA)
|
||||
-> BAAW.resolve(LA, nbytes) -> list[PhysicalRequest]
|
||||
-> For each PhysicalRequest:
|
||||
-> router.find_path(pe_prefix, req.dst_node) -> path
|
||||
-> compute_drain_ns(path, req.nbytes) -> drain
|
||||
-> sub-Transaction created -> fabric inject
|
||||
-> Wait for all sub-Transactions to complete
|
||||
-> pe_txn.done.succeed()
|
||||
```
|
||||
|
||||
Key changes:
|
||||
- MMU reference removed -> replaced with BAAW resolve
|
||||
- PhysAddr.decode() + resolver.resolve() -> BAAW directly returns dst_node
|
||||
- 1 request -> N requests injected in parallel (1:1 mode)
|
||||
|
||||
---
|
||||
|
||||
### D6. 1:1 Mode Details
|
||||
|
||||
- One logical access -> N (= `channels_per_pe`) physical requests
|
||||
- N is a parameter determined by `hbm_pseudo_channels / pes_per_cube`
|
||||
- Each request:
|
||||
- Fully resolved 51-bit PA
|
||||
- Targets a specific channel router (`{pe_prefix}.ch_r{channel_id}`)
|
||||
- BW contention modeling via per-channel links
|
||||
- PE_DMA injects N sub-transactions simultaneously
|
||||
|
||||
#### 1:1 Mode Example
|
||||
|
||||
Configuration: `hbm_pseudo_channels=64`, `pes_per_cube=8`
|
||||
-> `channels_per_pe=8`, PE0 owns ch0-7
|
||||
|
||||
```text
|
||||
Tensor A (4 KB) -> LA 0x1_0000_0000, size=4096 bytes
|
||||
BAAW segment: {
|
||||
la_base: 0x1_0000_0000, la_size: 4096,
|
||||
mode: "one_to_one", channel_count: 8, # = channels_per_pe
|
||||
pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
|
||||
channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
|
||||
channel_size: 512, # = la_size / channel_count
|
||||
}
|
||||
|
||||
BAAW resolve result (N=8 requests):
|
||||
-> PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
|
||||
-> PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
|
||||
-> ...
|
||||
-> PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")
|
||||
|
||||
PE_DMA: N sub-transactions injected in parallel
|
||||
Each accesses HBM via channel router -> hbm_ctrl link (channel_bw_gbs)
|
||||
Total effective BW = N x channel_bw_gbs
|
||||
```
|
||||
|
||||
Examples with different N values:
|
||||
- `hbm_pseudo_channels=32`, `pes_per_cube=8` -> `channels_per_pe=4`, 4 requests
|
||||
- `hbm_pseudo_channels=64`, `pes_per_cube=4` -> `channels_per_pe=16`, 16 requests
|
||||
|
||||
---
|
||||
|
||||
### D7. n:1 Mode Details
|
||||
|
||||
- One logical access -> one aggregated request
|
||||
- Target: aggregated router -> hbm_ctrl (see ADR-0019)
|
||||
- Aggregated link BW = `channels_per_pe` x `channel_bw_gbs` (e.g., 8 x 32 = 256 GB/s)
|
||||
- Modeled as a single queue / resource
|
||||
- No per-channel PA decomposition
|
||||
|
||||
#### n:1 Mode Example
|
||||
|
||||
```
|
||||
Tensor A (4 KB) -> LA 0x1_0000_0000, size=4096 bytes
|
||||
BAAW segment: {
|
||||
la_base: 0x1_0000_0000, la_size: 4096,
|
||||
mode: "n_to_one",
|
||||
agg_pa_base: PA_agg,
|
||||
agg_node_id: "sip0.cube0.pe0.agg_router",
|
||||
}
|
||||
|
||||
BAAW resolve result:
|
||||
-> PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")
|
||||
|
||||
PE_DMA: 1 sub-transaction injected
|
||||
Accesses HBM via aggregated router -> hbm_ctrl link (256 GB/s)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### D8. Kernel Model Preservation
|
||||
|
||||
- The kernel still issues only single memory ops (`tl.load`, `tl.store`, `tl.composite`)
|
||||
- LA is the address scheme passed to the kernel
|
||||
- Channel decomposition/aggregation is performed by BAAW inside PE_DMA
|
||||
- Physical channel information is not exposed to kernel code
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- 1:1 vs n:1 semantics are clearly separated at a single point: BAAW
|
||||
- Kernel abstraction is preserved — no kernel code changes required
|
||||
- Topology-based policy control is possible (mode switching via yaml)
|
||||
- Improved simulation model consistency and debuggability
|
||||
- Segment-based mapping is simpler and has lower overhead compared to page tables
|
||||
|
||||
### Negative
|
||||
|
||||
- Full refactoring of VA/MMU-based code is required
|
||||
- Increased complexity in the request generation path (managing N requests in 1:1 mode)
|
||||
- Reduced per-channel visibility in n:1 mode
|
||||
- Existing VA-related tests must be rewritten
|
||||
|
||||
---
|
||||
|
||||
## Alternatives
|
||||
|
||||
### A1. Keep VA + Fan-out at MMU
|
||||
|
||||
- Extend MMU to return per-channel PAs
|
||||
- Problem: MMU's role expands beyond address translation to include request decomposition
|
||||
- Problem: Aggregation representation is difficult in n:1 mode
|
||||
|
||||
### A2. Kernel Generates Channel-Aware Requests
|
||||
|
||||
- Kernel directly calls per-channel load/store
|
||||
- Problem: Abstraction leakage, reduced portability
|
||||
- Problem: All benchmark code must be modified
|
||||
|
||||
### A3. Always Use PA (Without LA)
|
||||
|
||||
- Runtime directly passes per-channel PA to the kernel
|
||||
- Problem: Conflicts with the aggregation model
|
||||
- Problem: Conversion timing is unclear, channel information exposed to kernel
|
||||
|
||||
---
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
### Implementation Order
|
||||
|
||||
1. Introduce LA type (`policy/address/la_allocator.py`)
|
||||
2. Implement BAAW segment table (`policy/address/baaw.py`)
|
||||
3. Add `BaawSegmentInstallMsg` message type (`runtime_api/kernel.py`)
|
||||
4. Integrate BAAW into PE_DMA (`components/builtin/pe_dma.py` handle_command changes)
|
||||
5. Modify RuntimeContext: LA alloc + segment install (`runtime_api/context.py`)
|
||||
6. Change Tensor.va_base -> la_base (`runtime_api/tensor.py`)
|
||||
7. Remove VA/MMU code
|
||||
8. Remove pe_mmu from topology.yaml, add mapping mode configuration
|
||||
9. Test migration
|
||||
|
||||
### Affected Existing Tests
|
||||
|
||||
| Test File | Impact |
|
||||
|-----------|--------|
|
||||
| `tests/test_mmu_component.py` | Remove -> replace with BAAW segment install test |
|
||||
| `tests/test_mmu_fabric.py` | Remove -> replace with BAAW + fabric integration test |
|
||||
| `tests/test_pe_mmu.py` | Remove |
|
||||
| `tests/test_va_allocator.py` | Replace with LA allocator test |
|
||||
| `tests/test_va_integration.py` | Replace with LA + BAAW integration test |
|
||||
| `tests/test_va_offset.py` | Replace with LA offset test |
|
||||
|
||||
---
|
||||
|
||||
## Test Requirements
|
||||
|
||||
- For the same logical access:
|
||||
- 1:1 -> verify N requests are generated
|
||||
- n:1 -> verify 1 aggregated request is generated
|
||||
- Verify effective bandwidth consistency across both modes
|
||||
- 1:1 -> verify per-channel contention modeling
|
||||
- n:1 -> verify aggregated bandwidth is reflected
|
||||
- Verify operation without kernel code changes
|
||||
- Verify correct BAAW segment install/uninstall operation
|
||||
- Verify no conflicts when multiple tensors are assigned to different segments
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- ADR-0011 (Memory Addressing Simplification — PA-first, VA/MMU introduction) -> superseded by this ADR
|
||||
- ADR-0019 (NOC Per-Channel HBM Connection Model) -> topology-side integration
|
||||
- ADR-0014 (PE Internal Execution Model) -> PE_DMA change impact
|
||||
Reference in New Issue
Block a user