Add English translations for ADR-0018, 0019, 0020, 0021

- ADR-0018: LA-based memory address abstraction + BAAW + HBM channel mapping - ADR-0019: CUBE NOC per-channel and aggregated HBM connection model - ADR-0020: 2-pass data execution model (timing/data separation, greenlet) - ADR-0021: PE pipeline refactor (component separation + token self-routing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 16:31:32 -07:00
parent 10b33b44ba
commit b2c52f0e34
4 changed files with 1962 additions and 0 deletions
@@ -0,0 +1,441 @@
+# ADR-0018: LA-Based Memory Address Abstraction and HBM Channel Mapping Mode Introduction
+
+## Status
+
+Proposed
+
+## Context
+
+Kernbench simulates memory access between PE_DMA and Local-HBM within a CUBE.
+Currently, a VA-based access path is used; however, the following two channel mapping models
+are difficult to represent consistently.
+
+### Background: Local-HBM Pseudo Channel Structure
+
+The HBM in a CUBE consists of 32 or 64 pseudo channels.
+In the PE-Local-HBM model, each PE is responsible for an equal number of pseudo channels.
+
+Example: 64 pseudo channels, 8 PEs per cube -> each PE accesses 8 pseudo channels as local HBM
+
+Both the number of pseudo channels and the number of PEs are topology parameters.
+`N = hbm_pseudo_channels / pes_per_cube` (= channels_per_pe) determines
+the number of local channels per PE.
+
+The routing path BW between DMA and each pseudo channel matches the BW of each pseudo channel
+(e.g., 32 GB/s), so if a PE sends simultaneous requests to N channels, it can utilize the
+maximum memory BW.
+
+### Limitations of the Current VA Model
+
+When channels are divided into 8, requests must also be generated per channel and sent to DMA.
+However, in the current architecture, the kernel generates requests with VA (`tl.load`)
+and passes them directly to DMA, making it difficult for PE_CPU to generate per-channel DMA requests.
+
+Therefore, instead of VA, we propose using **Logical Address (LA)**,
+where the **BAAW (Logical-to-Physical Mapping Unit)** inside PE_DMA
+converts LA to PA or a list of PAs based on segment-based mapping.
+
+### Two Channel Mapping Modes
+
+- **1:1 mode**: Creates and executes per-channel requests. Precise per-channel modeling.
+- **n:1 mode (default)**: Assumes interleaving across local HBM channels. Aggregated BW modeling.
+
+By supporting both modes, the overhead of the n:1 mode can be measured and evaluated.
+
+### Core Requirements
+
+- The effective bandwidth semantics of PE_DMA -> HBM_CTRL must be identical in both modes
+- The difference must only be in the request representation and resource modeling approach
+- The kernel programming model must not be changed
+- Physical channel information must not be exposed to the kernel
+
+### Existing Physical Address
+
+The current system's 51-bit Physical Address is defined in `policy/address/phyaddr.py`:
+
+```
+[50:47] rack_id (4 bit)
+[46:43] sip_id  (4 bit)
+[42:38] cube_id (5 bit, sip_seg)
+[37]    hbm_selector (1=HBM window)
+[36:0]  hbm_offset   (37 bit, 128GB per cube)
+```
+
+PA is used to represent the final routable canonical physical destination,
+and this role is preserved.
+However, the timing and policy of logical access -> physical request conversion are not clearly separated.
+
+---
+
+## Decision
+
+### D1. Introduction of LA (Logical Address) — Replacing VA
+
+The existing VA (Virtual Address) infrastructure is replaced with LA (Logical Address).
+
+#### Characteristics of LA
+
+- Like VA, tensors can be mapped to a contiguous memory space
+- Represents logical buffer + offset
+- Does not directly contain physical channel information
+- An intermediate abstraction maintained until physical resolution
+- The sole address scheme used by kernel code (`tl.load`, `tl.store`, `tl.composite`)
+
+#### LA Space Definition
+
+| Item | Value |
+|------|-------|
+| LA start address | `0x1_0000_0000` (4 GB, preserving the existing VA start point) |
+| LA space size | 64 GB per PE |
+| Alignment unit | Segment-based (see D3 below) |
+
+LA is a PE-local address space.
+Even if different PEs use the same LA value, they resolve to different PAs
+because each PE has a different BAAW segment table.
+
+#### VA Infrastructure Removal Scope
+
+With the introduction of LA, the following existing code will be replaced/removed:
+
+| Removal Target | Replacement |
+|----------------|-------------|
+| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (same free-list approach, name/role changed) |
+| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (inside PE_DMA) |
+| `components/builtin/pe_mmu.py` (PeMmuComponent) | Removed — BAAW is internal PE_DMA logic, not a separate component |
+| `runtime_api/kernel.py`: MmuMapMsg, MmuUnmapMsg | Replaced with BaawSegmentInstallMsg |
+| `runtime_api/context.py`: VA alloc + MMU mapping install | LA alloc + BAAW segment install |
+| `runtime_api/tensor.py`: `va_base` field | `la_base` field |
+| `topology.yaml`: pe_mmu component entry | Removed |
+
+---
+
+### D2. Mapping Mode Configuration
+
+The mapping mode is configured at the cube level in topology.yaml:
+
+```yaml
+cube:
+  memory_map:
+    hbm_mapping_mode: n_to_one    # one_to_one | n_to_one
+    hbm_pseudo_channels: 64       # total pseudo channel count
+    hbm_channels_per_pe: 8        # local channel count per PE
+    hbm_channel_bw_gbs: 32.0      # per-channel bandwidth
+```
+
+This configuration is referenced during graph compilation (topology builder) and BAAW initialization.
+
+---
+
+### D3. Segments and BAAW
+
+#### Segment Definition
+
+A segment is a logical allocation unit that partitions the LA space so that each segment
+maps to a specific HBM channel or channel group.
+
+Segments are created by the runtime allocator during tensor deployment,
+and BAAW uses them to convert LA into physical requests.
+
+#### BAAW Segment Table Entry
+
+```python
+@dataclass
+class BaawSegment:
+    la_base: int          # segment start LA
+    la_size: int          # segment size (bytes)
+    mode: str             # "one_to_one" | "n_to_one"
+    # 1:1 mode fields
+    channel_count: int    # number of channels assigned to this segment (e.g., 8)
+    pa_bases: list[int]   # per-channel PA start address list (len = channel_count)
+    channel_ids: list[int]  # per-channel logical IDs (e.g., [0,1,2,...,7])
+    channel_size: int     # per-channel size (la_size // channel_count)
+    # n:1 mode fields
+    agg_pa_base: int      # aggregated PA start address
+    agg_node_id: str      # aggregated router node_id (for routing)
+```
+
+#### Segment Lifecycle
+
+1. **Allocation time** (tensor deploy):
+   - RuntimeContext allocates LA space from the LA allocator
+   - PEMemAllocator allocates per-channel PA (1:1) or aggregated PA (n:1)
+   - Sends `BaawSegmentInstallMsg` to PE_DMA to register in the segment table
+
+2. **Usage time** (kernel execution):
+   - Kernel issues `tl.load(la_ptr)` -> DmaReadCmd(src_addr=LA)
+   - PE_DMA looks up the segment corresponding to the LA in BAAW
+   - Converts to PA(s) according to the mode
+
+3. **Deallocation time** (tensor free):
+   - Removed from the segment table
+   - LA space returned, PA deallocated
+
+---
+
+### D4. BAAW (Logical-to-Physical Mapping Unit)
+
+#### Location
+
+BAAW is placed as a front-end stage inside PE_DMA.
+It is not a separate SimPy component; it is synchronous address resolution logic
+executed at the beginning of PE_DMA's `handle_command()`.
+
+#### Input
+
+- LA (Logical Address) — DmaReadCmd.src_addr or DmaWriteCmd.dst_addr
+- access size (bytes)
+
+#### Output
+
+- 1:1 mode: `list[PhysicalRequest]` — each request is (PA, nbytes, channel_node_id)
+- n:1 mode: 1 `PhysicalRequest` — (agg_PA, nbytes, agg_node_id)
+
+```python
+@dataclass
+class PhysicalRequest:
+    pa: int           # 51-bit Physical Address
+    nbytes: int       # transfer size for this request
+    dst_node: str     # target node_id (channel router or aggregated router)
+```
+
+#### BAAW Resolve Logic
+
+```python
+def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
+    seg = self._find_segment(la)  # la_base <= la < la_base + la_size
+    offset = la - seg.la_base
+
+    if seg.mode == "n_to_one":
+        pa = seg.agg_pa_base + offset
+        return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]
+
+    elif seg.mode == "one_to_one":
+        requests = []
+        per_ch_size = seg.channel_size
+        for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
+            ch_offset = offset % per_ch_size  # interleaved or striped
+            ch_nbytes = nbytes // seg.channel_count
+            pa = pa_base + ch_offset
+            dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
+            requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
+        return requests
+```
+
+#### Scope of Responsibility
+
+BAAW is responsible for:
+- Converting logical accesses into physical request units
+- Performing fan-out (1:1) or pass-through (n:1) according to the mapping mode
+- Generating Physical Addresses and determining target nodes
+
+BAAW is NOT responsible for:
+- Performing actual data movement
+- Executing NOC routing
+- Simulating bandwidth consumption (this is the role of downstream components)
+
+#### Output Contract
+
+The output of BAAW must be request units that can be directly used by the simulator's
+routing and resource model without any additional address decoding.
+
+---
+
+### D5. PE_DMA handle_command() Changes
+
+#### Current Flow (VA-based)
+
+```
+DmaReadCmd.src_addr (VA)
+  -> MMU.translate(VA) -> PA
+  -> PhysAddr.decode(PA) -> PhysAddr object
+  -> resolver.resolve(PhysAddr) -> dst_node_id (e.g., "sip0.cube0.hbm_ctrl")
+  -> router.find_path(pe_prefix, dst_node_id) -> path
+  -> 1 sub-Transaction created -> fabric inject
+```
+
+#### New Flow (LA-based)
+
+```
+DmaReadCmd.src_addr (LA)
+  -> BAAW.resolve(LA, nbytes) -> list[PhysicalRequest]
+  -> For each PhysicalRequest:
+      -> router.find_path(pe_prefix, req.dst_node) -> path
+      -> compute_drain_ns(path, req.nbytes) -> drain
+      -> sub-Transaction created -> fabric inject
+  -> Wait for all sub-Transactions to complete
+  -> pe_txn.done.succeed()
+```
+
+Key changes:
+- MMU reference removed -> replaced with BAAW resolve
+- PhysAddr.decode() + resolver.resolve() -> BAAW directly returns dst_node
+- 1 request -> N requests injected in parallel (1:1 mode)
+
+---
+
+### D6. 1:1 Mode Details
+
+- One logical access -> N (= `channels_per_pe`) physical requests
+- N is a parameter determined by `hbm_pseudo_channels / pes_per_cube`
+- Each request:
+  - Fully resolved 51-bit PA
+  - Targets a specific channel router (`{pe_prefix}.ch_r{channel_id}`)
+- BW contention modeling via per-channel links
+- PE_DMA injects N sub-transactions simultaneously
+
+#### 1:1 Mode Example
+
+Configuration: `hbm_pseudo_channels=64`, `pes_per_cube=8`
+-> `channels_per_pe=8`, PE0 owns ch0-7
+
+```text
+Tensor A (4 KB) -> LA 0x1_0000_0000, size=4096 bytes
+BAAW segment: {
+    la_base: 0x1_0000_0000, la_size: 4096,
+    mode: "one_to_one", channel_count: 8,  # = channels_per_pe
+    pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
+    channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
+    channel_size: 512,  # = la_size / channel_count
+}
+
+BAAW resolve result (N=8 requests):
+  -> PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
+  -> PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
+  -> ...
+  -> PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")
+
+PE_DMA: N sub-transactions injected in parallel
+  Each accesses HBM via channel router -> hbm_ctrl link (channel_bw_gbs)
+  Total effective BW = N x channel_bw_gbs
+```
+
+Examples with different N values:
+- `hbm_pseudo_channels=32`, `pes_per_cube=8` -> `channels_per_pe=4`, 4 requests
+- `hbm_pseudo_channels=64`, `pes_per_cube=4` -> `channels_per_pe=16`, 16 requests
+
+---
+
+### D7. n:1 Mode Details
+
+- One logical access -> one aggregated request
+- Target: aggregated router -> hbm_ctrl (see ADR-0019)
+- Aggregated link BW = `channels_per_pe` x `channel_bw_gbs` (e.g., 8 x 32 = 256 GB/s)
+- Modeled as a single queue / resource
+- No per-channel PA decomposition
+
+#### n:1 Mode Example
+
+```
+Tensor A (4 KB) -> LA 0x1_0000_0000, size=4096 bytes
+BAAW segment: {
+    la_base: 0x1_0000_0000, la_size: 4096,
+    mode: "n_to_one",
+    agg_pa_base: PA_agg,
+    agg_node_id: "sip0.cube0.pe0.agg_router",
+}
+
+BAAW resolve result:
+  -> PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")
+
+PE_DMA: 1 sub-transaction injected
+  Accesses HBM via aggregated router -> hbm_ctrl link (256 GB/s)
+```
+
+---
+
+### D8. Kernel Model Preservation
+
+- The kernel still issues only single memory ops (`tl.load`, `tl.store`, `tl.composite`)
+- LA is the address scheme passed to the kernel
+- Channel decomposition/aggregation is performed by BAAW inside PE_DMA
+- Physical channel information is not exposed to kernel code
+
+---
+
+## Consequences
+
+### Positive
+
+- 1:1 vs n:1 semantics are clearly separated at a single point: BAAW
+- Kernel abstraction is preserved — no kernel code changes required
+- Topology-based policy control is possible (mode switching via yaml)
+- Improved simulation model consistency and debuggability
+- Segment-based mapping is simpler and has lower overhead compared to page tables
+
+### Negative
+
+- Full refactoring of VA/MMU-based code is required
+- Increased complexity in the request generation path (managing N requests in 1:1 mode)
+- Reduced per-channel visibility in n:1 mode
+- Existing VA-related tests must be rewritten
+
+---
+
+## Alternatives
+
+### A1. Keep VA + Fan-out at MMU
+
+- Extend MMU to return per-channel PAs
+- Problem: MMU's role expands beyond address translation to include request decomposition
+- Problem: Aggregation representation is difficult in n:1 mode
+
+### A2. Kernel Generates Channel-Aware Requests
+
+- Kernel directly calls per-channel load/store
+- Problem: Abstraction leakage, reduced portability
+- Problem: All benchmark code must be modified
+
+### A3. Always Use PA (Without LA)
+
+- Runtime directly passes per-channel PA to the kernel
+- Problem: Conflicts with the aggregation model
+- Problem: Conversion timing is unclear, channel information exposed to kernel
+
+---
+
+## Implementation Notes
+
+### Implementation Order
+
+1. Introduce LA type (`policy/address/la_allocator.py`)
+2. Implement BAAW segment table (`policy/address/baaw.py`)
+3. Add `BaawSegmentInstallMsg` message type (`runtime_api/kernel.py`)
+4. Integrate BAAW into PE_DMA (`components/builtin/pe_dma.py` handle_command changes)
+5. Modify RuntimeContext: LA alloc + segment install (`runtime_api/context.py`)
+6. Change Tensor.va_base -> la_base (`runtime_api/tensor.py`)
+7. Remove VA/MMU code
+8. Remove pe_mmu from topology.yaml, add mapping mode configuration
+9. Test migration
+
+### Affected Existing Tests
+
+| Test File | Impact |
+|-----------|--------|
+| `tests/test_mmu_component.py` | Remove -> replace with BAAW segment install test |
+| `tests/test_mmu_fabric.py` | Remove -> replace with BAAW + fabric integration test |
+| `tests/test_pe_mmu.py` | Remove |
+| `tests/test_va_allocator.py` | Replace with LA allocator test |
+| `tests/test_va_integration.py` | Replace with LA + BAAW integration test |
+| `tests/test_va_offset.py` | Replace with LA offset test |
+
+---
+
+## Test Requirements
+
+- For the same logical access:
+  - 1:1 -> verify N requests are generated
+  - n:1 -> verify 1 aggregated request is generated
+- Verify effective bandwidth consistency across both modes
+- 1:1 -> verify per-channel contention modeling
+- n:1 -> verify aggregated bandwidth is reflected
+- Verify operation without kernel code changes
+- Verify correct BAAW segment install/uninstall operation
+- Verify no conflicts when multiple tensors are assigned to different segments
+
+---
+
+## Links
+
+- ADR-0011 (Memory Addressing Simplification — PA-first, VA/MMU introduction) -> superseded by this ADR
+- ADR-0019 (NOC Per-Channel HBM Connection Model) -> topology-side integration
+- ADR-0014 (PE Internal Execution Model) -> PE_DMA change impact