Add English translations for ADR-0018, 0019, 0020, 0021

- ADR-0018: LA-based memory address abstraction + BAAW + HBM channel mapping
- ADR-0019: CUBE NOC per-channel and aggregated HBM connection model
- ADR-0020: 2-pass data execution model (timing/data separation, greenlet)
- ADR-0021: PE pipeline refactor (component separation + token self-routing)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-13 16:31:32 -07:00
parent 10b33b44ba
commit b2c52f0e34
4 changed files with 1962 additions and 0 deletions
+441
View File
@@ -0,0 +1,441 @@
# ADR-0018: LA-Based Memory Address Abstraction and HBM Channel Mapping Mode Introduction
## Status
Proposed
## Context
Kernbench simulates memory access between PE_DMA and Local-HBM within a CUBE.
Currently, a VA-based access path is used; however, the following two channel mapping models
are difficult to represent consistently.
### Background: Local-HBM Pseudo Channel Structure
The HBM in a CUBE consists of 32 or 64 pseudo channels.
In the PE-Local-HBM model, each PE is responsible for an equal number of pseudo channels.
Example: 64 pseudo channels, 8 PEs per cube -> each PE accesses 8 pseudo channels as local HBM
Both the number of pseudo channels and the number of PEs are topology parameters.
`N = hbm_pseudo_channels / pes_per_cube` (= channels_per_pe) determines
the number of local channels per PE.
The routing path BW between DMA and each pseudo channel matches the BW of each pseudo channel
(e.g., 32 GB/s), so if a PE sends simultaneous requests to N channels, it can utilize the
maximum memory BW.
### Limitations of the Current VA Model
When channels are divided into 8, requests must also be generated per channel and sent to DMA.
However, in the current architecture, the kernel generates requests with VA (`tl.load`)
and passes them directly to DMA, making it difficult for PE_CPU to generate per-channel DMA requests.
Therefore, instead of VA, we propose using **Logical Address (LA)**,
where the **BAAW (Logical-to-Physical Mapping Unit)** inside PE_DMA
converts LA to PA or a list of PAs based on segment-based mapping.
### Two Channel Mapping Modes
- **1:1 mode**: Creates and executes per-channel requests. Precise per-channel modeling.
- **n:1 mode (default)**: Assumes interleaving across local HBM channels. Aggregated BW modeling.
By supporting both modes, the overhead of the n:1 mode can be measured and evaluated.
### Core Requirements
- The effective bandwidth semantics of PE_DMA -> HBM_CTRL must be identical in both modes
- The difference must only be in the request representation and resource modeling approach
- The kernel programming model must not be changed
- Physical channel information must not be exposed to the kernel
### Existing Physical Address
The current system's 51-bit Physical Address is defined in `policy/address/phyaddr.py`:
```
[50:47] rack_id (4 bit)
[46:43] sip_id (4 bit)
[42:38] cube_id (5 bit, sip_seg)
[37] hbm_selector (1=HBM window)
[36:0] hbm_offset (37 bit, 128GB per cube)
```
PA is used to represent the final routable canonical physical destination,
and this role is preserved.
However, the timing and policy of logical access -> physical request conversion are not clearly separated.
---
## Decision
### D1. Introduction of LA (Logical Address) — Replacing VA
The existing VA (Virtual Address) infrastructure is replaced with LA (Logical Address).
#### Characteristics of LA
- Like VA, tensors can be mapped to a contiguous memory space
- Represents logical buffer + offset
- Does not directly contain physical channel information
- An intermediate abstraction maintained until physical resolution
- The sole address scheme used by kernel code (`tl.load`, `tl.store`, `tl.composite`)
#### LA Space Definition
| Item | Value |
|------|-------|
| LA start address | `0x1_0000_0000` (4 GB, preserving the existing VA start point) |
| LA space size | 64 GB per PE |
| Alignment unit | Segment-based (see D3 below) |
LA is a PE-local address space.
Even if different PEs use the same LA value, they resolve to different PAs
because each PE has a different BAAW segment table.
#### VA Infrastructure Removal Scope
With the introduction of LA, the following existing code will be replaced/removed:
| Removal Target | Replacement |
|----------------|-------------|
| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (same free-list approach, name/role changed) |
| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (inside PE_DMA) |
| `components/builtin/pe_mmu.py` (PeMmuComponent) | Removed — BAAW is internal PE_DMA logic, not a separate component |
| `runtime_api/kernel.py`: MmuMapMsg, MmuUnmapMsg | Replaced with BaawSegmentInstallMsg |
| `runtime_api/context.py`: VA alloc + MMU mapping install | LA alloc + BAAW segment install |
| `runtime_api/tensor.py`: `va_base` field | `la_base` field |
| `topology.yaml`: pe_mmu component entry | Removed |
---
### D2. Mapping Mode Configuration
The mapping mode is configured at the cube level in topology.yaml:
```yaml
cube:
memory_map:
hbm_mapping_mode: n_to_one # one_to_one | n_to_one
hbm_pseudo_channels: 64 # total pseudo channel count
hbm_channels_per_pe: 8 # local channel count per PE
hbm_channel_bw_gbs: 32.0 # per-channel bandwidth
```
This configuration is referenced during graph compilation (topology builder) and BAAW initialization.
---
### D3. Segments and BAAW
#### Segment Definition
A segment is a logical allocation unit that partitions the LA space so that each segment
maps to a specific HBM channel or channel group.
Segments are created by the runtime allocator during tensor deployment,
and BAAW uses them to convert LA into physical requests.
#### BAAW Segment Table Entry
```python
@dataclass
class BaawSegment:
la_base: int # segment start LA
la_size: int # segment size (bytes)
mode: str # "one_to_one" | "n_to_one"
# 1:1 mode fields
channel_count: int # number of channels assigned to this segment (e.g., 8)
pa_bases: list[int] # per-channel PA start address list (len = channel_count)
channel_ids: list[int] # per-channel logical IDs (e.g., [0,1,2,...,7])
channel_size: int # per-channel size (la_size // channel_count)
# n:1 mode fields
agg_pa_base: int # aggregated PA start address
agg_node_id: str # aggregated router node_id (for routing)
```
#### Segment Lifecycle
1. **Allocation time** (tensor deploy):
- RuntimeContext allocates LA space from the LA allocator
- PEMemAllocator allocates per-channel PA (1:1) or aggregated PA (n:1)
- Sends `BaawSegmentInstallMsg` to PE_DMA to register in the segment table
2. **Usage time** (kernel execution):
- Kernel issues `tl.load(la_ptr)` -> DmaReadCmd(src_addr=LA)
- PE_DMA looks up the segment corresponding to the LA in BAAW
- Converts to PA(s) according to the mode
3. **Deallocation time** (tensor free):
- Removed from the segment table
- LA space returned, PA deallocated
---
### D4. BAAW (Logical-to-Physical Mapping Unit)
#### Location
BAAW is placed as a front-end stage inside PE_DMA.
It is not a separate SimPy component; it is synchronous address resolution logic
executed at the beginning of PE_DMA's `handle_command()`.
#### Input
- LA (Logical Address) — DmaReadCmd.src_addr or DmaWriteCmd.dst_addr
- access size (bytes)
#### Output
- 1:1 mode: `list[PhysicalRequest]` — each request is (PA, nbytes, channel_node_id)
- n:1 mode: 1 `PhysicalRequest` — (agg_PA, nbytes, agg_node_id)
```python
@dataclass
class PhysicalRequest:
pa: int # 51-bit Physical Address
nbytes: int # transfer size for this request
dst_node: str # target node_id (channel router or aggregated router)
```
#### BAAW Resolve Logic
```python
def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
seg = self._find_segment(la) # la_base <= la < la_base + la_size
offset = la - seg.la_base
if seg.mode == "n_to_one":
pa = seg.agg_pa_base + offset
return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]
elif seg.mode == "one_to_one":
requests = []
per_ch_size = seg.channel_size
for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
ch_offset = offset % per_ch_size # interleaved or striped
ch_nbytes = nbytes // seg.channel_count
pa = pa_base + ch_offset
dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
return requests
```
#### Scope of Responsibility
BAAW is responsible for:
- Converting logical accesses into physical request units
- Performing fan-out (1:1) or pass-through (n:1) according to the mapping mode
- Generating Physical Addresses and determining target nodes
BAAW is NOT responsible for:
- Performing actual data movement
- Executing NOC routing
- Simulating bandwidth consumption (this is the role of downstream components)
#### Output Contract
The output of BAAW must be request units that can be directly used by the simulator's
routing and resource model without any additional address decoding.
---
### D5. PE_DMA handle_command() Changes
#### Current Flow (VA-based)
```
DmaReadCmd.src_addr (VA)
-> MMU.translate(VA) -> PA
-> PhysAddr.decode(PA) -> PhysAddr object
-> resolver.resolve(PhysAddr) -> dst_node_id (e.g., "sip0.cube0.hbm_ctrl")
-> router.find_path(pe_prefix, dst_node_id) -> path
-> 1 sub-Transaction created -> fabric inject
```
#### New Flow (LA-based)
```
DmaReadCmd.src_addr (LA)
-> BAAW.resolve(LA, nbytes) -> list[PhysicalRequest]
-> For each PhysicalRequest:
-> router.find_path(pe_prefix, req.dst_node) -> path
-> compute_drain_ns(path, req.nbytes) -> drain
-> sub-Transaction created -> fabric inject
-> Wait for all sub-Transactions to complete
-> pe_txn.done.succeed()
```
Key changes:
- MMU reference removed -> replaced with BAAW resolve
- PhysAddr.decode() + resolver.resolve() -> BAAW directly returns dst_node
- 1 request -> N requests injected in parallel (1:1 mode)
---
### D6. 1:1 Mode Details
- One logical access -> N (= `channels_per_pe`) physical requests
- N is a parameter determined by `hbm_pseudo_channels / pes_per_cube`
- Each request:
- Fully resolved 51-bit PA
- Targets a specific channel router (`{pe_prefix}.ch_r{channel_id}`)
- BW contention modeling via per-channel links
- PE_DMA injects N sub-transactions simultaneously
#### 1:1 Mode Example
Configuration: `hbm_pseudo_channels=64`, `pes_per_cube=8`
-> `channels_per_pe=8`, PE0 owns ch0-7
```text
Tensor A (4 KB) -> LA 0x1_0000_0000, size=4096 bytes
BAAW segment: {
la_base: 0x1_0000_0000, la_size: 4096,
mode: "one_to_one", channel_count: 8, # = channels_per_pe
pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
channel_size: 512, # = la_size / channel_count
}
BAAW resolve result (N=8 requests):
-> PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
-> PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
-> ...
-> PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")
PE_DMA: N sub-transactions injected in parallel
Each accesses HBM via channel router -> hbm_ctrl link (channel_bw_gbs)
Total effective BW = N x channel_bw_gbs
```
Examples with different N values:
- `hbm_pseudo_channels=32`, `pes_per_cube=8` -> `channels_per_pe=4`, 4 requests
- `hbm_pseudo_channels=64`, `pes_per_cube=4` -> `channels_per_pe=16`, 16 requests
---
### D7. n:1 Mode Details
- One logical access -> one aggregated request
- Target: aggregated router -> hbm_ctrl (see ADR-0019)
- Aggregated link BW = `channels_per_pe` x `channel_bw_gbs` (e.g., 8 x 32 = 256 GB/s)
- Modeled as a single queue / resource
- No per-channel PA decomposition
#### n:1 Mode Example
```
Tensor A (4 KB) -> LA 0x1_0000_0000, size=4096 bytes
BAAW segment: {
la_base: 0x1_0000_0000, la_size: 4096,
mode: "n_to_one",
agg_pa_base: PA_agg,
agg_node_id: "sip0.cube0.pe0.agg_router",
}
BAAW resolve result:
-> PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")
PE_DMA: 1 sub-transaction injected
Accesses HBM via aggregated router -> hbm_ctrl link (256 GB/s)
```
---
### D8. Kernel Model Preservation
- The kernel still issues only single memory ops (`tl.load`, `tl.store`, `tl.composite`)
- LA is the address scheme passed to the kernel
- Channel decomposition/aggregation is performed by BAAW inside PE_DMA
- Physical channel information is not exposed to kernel code
---
## Consequences
### Positive
- 1:1 vs n:1 semantics are clearly separated at a single point: BAAW
- Kernel abstraction is preserved — no kernel code changes required
- Topology-based policy control is possible (mode switching via yaml)
- Improved simulation model consistency and debuggability
- Segment-based mapping is simpler and has lower overhead compared to page tables
### Negative
- Full refactoring of VA/MMU-based code is required
- Increased complexity in the request generation path (managing N requests in 1:1 mode)
- Reduced per-channel visibility in n:1 mode
- Existing VA-related tests must be rewritten
---
## Alternatives
### A1. Keep VA + Fan-out at MMU
- Extend MMU to return per-channel PAs
- Problem: MMU's role expands beyond address translation to include request decomposition
- Problem: Aggregation representation is difficult in n:1 mode
### A2. Kernel Generates Channel-Aware Requests
- Kernel directly calls per-channel load/store
- Problem: Abstraction leakage, reduced portability
- Problem: All benchmark code must be modified
### A3. Always Use PA (Without LA)
- Runtime directly passes per-channel PA to the kernel
- Problem: Conflicts with the aggregation model
- Problem: Conversion timing is unclear, channel information exposed to kernel
---
## Implementation Notes
### Implementation Order
1. Introduce LA type (`policy/address/la_allocator.py`)
2. Implement BAAW segment table (`policy/address/baaw.py`)
3. Add `BaawSegmentInstallMsg` message type (`runtime_api/kernel.py`)
4. Integrate BAAW into PE_DMA (`components/builtin/pe_dma.py` handle_command changes)
5. Modify RuntimeContext: LA alloc + segment install (`runtime_api/context.py`)
6. Change Tensor.va_base -> la_base (`runtime_api/tensor.py`)
7. Remove VA/MMU code
8. Remove pe_mmu from topology.yaml, add mapping mode configuration
9. Test migration
### Affected Existing Tests
| Test File | Impact |
|-----------|--------|
| `tests/test_mmu_component.py` | Remove -> replace with BAAW segment install test |
| `tests/test_mmu_fabric.py` | Remove -> replace with BAAW + fabric integration test |
| `tests/test_pe_mmu.py` | Remove |
| `tests/test_va_allocator.py` | Replace with LA allocator test |
| `tests/test_va_integration.py` | Replace with LA + BAAW integration test |
| `tests/test_va_offset.py` | Replace with LA offset test |
---
## Test Requirements
- For the same logical access:
- 1:1 -> verify N requests are generated
- n:1 -> verify 1 aggregated request is generated
- Verify effective bandwidth consistency across both modes
- 1:1 -> verify per-channel contention modeling
- n:1 -> verify aggregated bandwidth is reflected
- Verify operation without kernel code changes
- Verify correct BAAW segment install/uninstall operation
- Verify no conflicts when multiple tensors are assigned to different segments
---
## Links
- ADR-0011 (Memory Addressing Simplification — PA-first, VA/MMU introduction) -> superseded by this ADR
- ADR-0019 (NOC Per-Channel HBM Connection Model) -> topology-side integration
- ADR-0014 (PE Internal Execution Model) -> PE_DMA change impact
+431
View File
@@ -0,0 +1,431 @@
# ADR-0019: Per-Channel and Aggregated HBM Connection Models within CUBE NOC
## Status
Proposed
## Context
ADR-0018 introduced LA-based address abstraction and BAAW,
defining how a logical memory access is translated into the following two forms of requests:
- 1:1 mode: one logical access → N per-channel requests
- n:1 mode: one logical access → one aggregated request
Here N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`),
determined by topology parameters.
### Problems with the Existing Structure
In the current implementation (`topology/builder.py`):
- PE_DMA → NOC → xbar_top/xbar_bot → HBM_CTRL.slice{0-7} path is used
- HBM is modeled as 8 slice (= per-PE) nodes
- Local/remote access use different paths:
- local: NOC → xbar → HBM slice
- cross-half: NOC → xbar_top → bridge → xbar_bot → HBM slice
- remote cube: NOC → UCIe → remote NOC → remote xbar → remote HBM slice
Limitations of this structure:
- Cannot model at the pseudo-channel granularity (slice = per-PE granularity, not per-channel)
- xbar/bridge bifurcate local/remote paths
- Cannot express 1:1 / n:1 modes consistently
---
## Decision
### D1. HBM Attaches to PE Routers
Consolidate the current `hbm_ctrl.slice{0-7}` (8 nodes) into a **single `hbm_ctrl` node**,
and attach the HBM access point to the same router where the PE is attached.
- n:1 mode: PE's local HBM access goes directly from its own router (switching overhead only, 0 hops)
- Remote PE's HBM access: reaches the target PE's router via mesh hops
- The read/write resource model within the HBM controller is preserved
Node naming changes:
| Current | After Change |
| ---- | ------- |
| `sip0.cube0.hbm_ctrl.slice0` ~ `slice7` | `sip0.cube0.hbm_ctrl` (single) |
In `mesh_gen.py`, add `pe{idx}.hbm` to the PE attachment so that
the builder generates an edge between that router and hbm_ctrl.
---
### D2. Complete Removal of xbar, bridge, and Single NOC Node
Remove all of the following nodes and related edges:
- `{cube}.xbar_top`, `{cube}.xbar_bot`
- `{cube}.bridge.left`, `{cube}.bridge.right`
- `{cube}.noc` (single TwoDMeshNocComponent node)
- Edges of type `noc_to_xbar`, `xbar_to_noc`, `xbar_to_hbm`, `hbm_to_xbar`
- Edges of type `xbar_to_bridge`, `bridge_to_xbar`
- Edges of type `pe_to_noc`, `noc_to_pe`, `noc_to_pe_cpu`, etc. referencing the single noc node
Their role is replaced by an **explicit router mesh based on cube_mesh.yaml**.
Each router (r0c0, r0c1, ...) from the 6x6 router grid generated by `mesh_gen.py`
is created as a separate SimPy node in the topology graph,
and adjacent routers are connected via XY mesh edges.
---
### D3. Explicit Router Mesh (Common Basis for n:1 / 1:1)
#### Router Nodes Based on cube_mesh.yaml
Each non-null router from cube_mesh.yaml generated by `mesh_gen.py`
is created as a **separate SimPy node** in the topology graph.
- Node ID: `{cube}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`)
- kind: `noc_router`, impl: `forwarding_v1`
- pos_mm: taken from cube_mesh.yaml
Based on the attach information in cube_mesh.yaml, components are connected to each router:
- `pe{p}.dma` → PE_DMA ↔ router edge
- `pe{p}.cpu` → PE_CPU ↔ router edge
- `pe{p}.hbm` → HBM_CTRL ↔ router edge (added in n:1)
- `m_cpu` → M_CPU ↔ router edge
- `sram` → SRAM ↔ router edge
- `ucie_{dir}.c{i}` → UCIe conn ↔ router edge
Router-to-router XY mesh edges: bidirectional edges between adjacent routers.
Null routers (HBM exclusion zones) are skipped.
#### 1:1 Mode Extension (To Be Implemented Later)
In 1:1 mode, each router differentiates into N channel mini-routers.
Per-channel routing and ChannelSplitter (LA → per-channel PA) introduction are required.
N GEMM engines per PE are also added at this point.
---
### D4. Cross-PE HBM Access (n:1 Mode)
In n:1 mode, when a PE accesses another PE's local HBM,
it hops through the XY mesh in cube_mesh.yaml to reach the target PE's router.
Example: PE0 (r0c0) accessing PE2's (r1c4) HBM:
```text
PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl
```
The Dijkstra router finds the shortest path in the mesh.
Cross-PE channel access in 1:1 mode will be defined during the 1:1 extension in D3.
---
### D5. n:1 Mode: Uses cube_mesh.yaml Router Mesh
In n:1 mode, no separate "aggregated router" is created.
The existing router grid from cube_mesh.yaml serves that role.
#### Connection Structure
PE_DMA, PE_CPU, and HBM are all connected to the router where each PE is attached:
```text
sip0.cube0.pe0.pe_dma ←→ sip0.cube0.r0c0 (bw: N × channel_bw_gbs)
sip0.cube0.hbm_ctrl ←→ sip0.cube0.r0c0 (bw: N × channel_bw_gbs)
```
Routers are connected via XY mesh edges. PE's local HBM access goes
directly from its own router (switching overhead only).
#### n:1 Mode Full Data Paths
**Local HBM (0 hops):**
```text
PE0.pe_dma → r0c0 → hbm_ctrl (switching overhead only)
```
**Remote HBM (mesh hops):**
```text
PE0.pe_dma → r0c0 → r0c1 → ... → r1c4 → hbm_ctrl
```
**M_CPU DMA:**
```text
M_CPU → r2c0 → (mesh hops) → r{x}c{y} → hbm_ctrl
```
---
### D6. All Traffic Is Unified onto the Same Router Mesh
- All memory accesses (DMA data) and commands (PE_CPU) use the same router mesh
- Local access does not use a separate fast path (xbar)
- Cross-cube (remote) access path:
```text
PE_DMA → r{x}c{y} → (mesh hops) → ucie_conn → ucie-{PORT}
→ [UCIe link] → remote ucie → remote conn → remote r{x}c{y} → hbm_ctrl
```
UCIe connections maintain the existing structure,
but both endpoints become mesh routers instead of xbars.
The number of UCIe lines is determined by BW ratio: `ucie_lines_per_side = ceil(ucie_bw / noc_line_bw)`.
---
### D7. AddressResolver Changes
Current `AddressResolver.resolve()`:
```python
# Current: HBM offset → pe_slice → "sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
pe_slice = PhysAddr.hbm_pe_id(addr.hbm_offset, self._slice_size_bytes)
return f"sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
```
After change:
```python
# Changed: HBM → single endpoint
return f"sip{s}.cube{c}.hbm_ctrl"
```
The pe_slice calculation is removed.
In n:1 mode, PE_DMA directly accesses the hbm_ctrl attached to its own router.
resolver.resolve() is retained for external access (M_CPU DMA, etc.) and backward compatibility.
---
### D8. topology.yaml Configuration Changes
#### Added Settings
```yaml
cube:
memory_map:
hbm_mapping_mode: n_to_one # one_to_one | n_to_one
hbm_pseudo_channels: 64 # total pseudo channel count
hbm_channels_per_pe: 8 # local channels per PE (= pseudo_channels / pes_per_cube)
hbm_channel_bw_gbs: 32.0 # per-channel bandwidth (GB/s)
hbm_total_gb_per_cube: 48 # retained
```
#### Removed Settings
```yaml
# To be removed
links:
xbar_to_hbm_bw_gbs: 256.0 # → replaced by channel_bw_gbs × channels_per_pe
xbar_to_hbm_mm: 2.5 # → replaced by ch_router_to_hbm_mm
xbar_to_bridge_bw_gbs: 128.0 # → removed (no bridge)
xbar_to_bridge_mm: 3.0 # → removed
noc_to_xbar_bw_gbs: ... # → removed
noc_to_xbar_mm: ... # → removed
```
#### Added Link Settings
```yaml
links:
router_link_bw_gbs: 256.0 # XY mesh link BW between routers
router_overhead_ns: 2.0 # router switching overhead
pe_to_router_bw_gbs: 256.0 # PE_DMA ↔ router
hbm_to_router_bw_gbs: 256.0 # HBM ↔ router (= N × channel_bw)
```
---
### D9. Bandwidth Numerical Consistency
| Configuration | Value |
| ---- | --- |
| pseudo channels per cube | 64 (parameter) |
| PEs per cube | 8 (parameter) |
| channels per PE (N) | `pseudo_channels / pes_per_cube` = 8 |
| per-channel BW | 32 GB/s (parameter) |
| per-PE local BW | N × 32 = 256 GB/s |
| cube total HBM BW | 64 × 32 = 2048 GB/s |
The effective BW per PE is identical in both modes:
- 1:1 mode: N channel links × channel_bw_gbs = N × 32 = 256 GB/s
- n:1 mode: 1 aggregated link = N × channel_bw_gbs = 256 GB/s
---
## Consequences
### Positive
- The router mesh based on cube_mesh.yaml accurately reflects physical placement
- In n:1 mode, the existing VA scheme is preserved, keeping transition costs low
- Local / remote / command traffic is unified onto the same mesh, resulting in simplicity
- Aligns well with graph compiler-based topology generation
- Channel count and PE count are both parameterized, enabling testing of various configurations
- 1:1 mode extension naturally follows through router differentiation
### Negative
- The number of SimPy nodes increases due to explicit router nodes (6x6 = up to 32 routers/cube)
- Requires complete rewrite of existing xbar/bridge/single NOC-based tests
- The internal contention model of TwoDMeshNocComponent needs to be replaced with a per-router model
---
## Alternatives
### A1. Retain Existing xbar + HBM Slices
- Local/remote paths remain bifurcated
- Cannot model at pseudo-channel granularity
- Cannot switch between 1:1/n:1 modes
### A2. Always Generate Per-Channel Links and Aggregate Only in n:1
- Topology structure always has 1:1 size
- Expressing n:1 semantics via link aggregation is complex
- No reduction in router node count
### A3. Gradual Transition (Retain xbar + Add NOC Path)
- Higher compatibility, but dual-path coexistence increases complexity
- Since xbar removal is ultimately necessary, the intermediate step provides little value
---
## Implementation Notes
### topology/builder.py Change Details
#### Code to Remove (within current `_instantiate_cube()`)
- xbar_top, xbar_bot node creation (~line 495-508)
- bridge.left, bridge.right node creation
- noc ↔ xbar edge creation (~line 540-555)
- xbar ↔ hbm_ctrl.slice edge creation (~line 510-538)
- xbar ↔ bridge edge creation (~line 557-572)
#### Code to Add
1:1 mode:
```python
N = hbm_channels_per_pe # from topology config
total_ch = hbm_pseudo_channels
# Create channel router nodes
for ch_id in range(total_ch):
pe_id = ch_id // N
nodes[f"{cp}.ch_r{ch_id}"] = Node(
id=f"{cp}.ch_r{ch_id}", kind="noc_router", impl="noc_v1",
attrs={}, pos_mm=(...), # horizontal row = ch_id % N
)
# PE_DMA ↔ local channel router edges
for pe_id in range(pes_per_cube):
for local_ch in range(N):
ch_id = pe_id * N + local_ch
edges.append(Edge(
src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.ch_r{ch_id}",
bw_gbs=channel_bw, kind="pe_to_ch_router", ...))
edges.append(Edge(
src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.pe{pe_id}.pe_dma",
bw_gbs=channel_bw, kind="ch_router_to_pe", ...))
# Channel router ↔ hbm_ctrl edges
for ch_id in range(total_ch):
edges.append(Edge(
src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.hbm_ctrl",
bw_gbs=channel_bw, kind="ch_router_to_hbm", ...))
edges.append(Edge(
src=f"{cp}.hbm_ctrl", dst=f"{cp}.ch_r{ch_id}",
bw_gbs=channel_bw, kind="hbm_to_ch_router", ...))
# Horizontal line edges (same logical index)
for row in range(N):
for p in range(pes_per_cube - 1):
ch_a = p * N + row
ch_b = (p + 1) * N + row
edges.append(Edge(
src=f"{cp}.ch_r{ch_a}", dst=f"{cp}.ch_r{ch_b}",
bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...))
edges.append(Edge(
src=f"{cp}.ch_r{ch_b}", dst=f"{cp}.ch_r{ch_a}",
bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...))
```
n:1 mode:
```python
# Create aggregated router nodes
for pe_id in range(pes_per_cube):
nodes[f"{cp}.pe{pe_id}.agg_router"] = Node(
id=f"{cp}.pe{pe_id}.agg_router", kind="noc_router", impl="noc_v1",
attrs={}, pos_mm=(...),
)
agg_bw = N * channel_bw # aggregated BW
# PE_DMA ↔ aggregated router
for pe_id in range(pes_per_cube):
edges.append(Edge(
src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.pe{pe_id}.agg_router",
bw_gbs=agg_bw, kind="pe_to_agg_router", ...))
edges.append(Edge(
src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.pe{pe_id}.pe_dma",
bw_gbs=agg_bw, kind="agg_router_to_pe", ...))
# Aggregated router ↔ hbm_ctrl
for pe_id in range(pes_per_cube):
edges.append(Edge(
src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.hbm_ctrl",
bw_gbs=agg_bw, kind="agg_to_hbm", ...))
edges.append(Edge(
src=f"{cp}.hbm_ctrl", dst=f"{cp}.pe{pe_id}.agg_router",
bw_gbs=agg_bw, kind="hbm_to_agg", ...))
# Horizontal links between aggregated routers
for p in range(pes_per_cube - 1):
edges.append(Edge(
src=f"{cp}.pe{p}.agg_router", dst=f"{cp}.pe{p+1}.agg_router",
bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...))
edges.append(Edge(
src=f"{cp}.pe{p+1}.agg_router", dst=f"{cp}.pe{p}.agg_router",
bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...))
```
### Affected Existing Tests
| Test File | Impact |
| ---------- | ---- |
| `tests/test_topology_compile.py` | Remove xbar/bridge node references, add channel router verification |
| `tests/test_topology_load.py` | Reflect topology.yaml configuration changes |
| `tests/test_pe_components.py` | PE_DMA routing path changes |
| `tests/test_sip_parallel.py` | Cross-PE access path changes |
| Cases that directly test xbar/bridge | Remove |
---
## Test Requirements
- Verify that requests are delivered via per-channel links in 1:1 mode
- Verify that requests are delivered via the aggregated link in n:1 mode
- Verify that topology is correctly generated in both modes:
- 1:1: `total_ch` channel routers + per-PE links + horizontal links
- n:1: `pes_per_cube` aggregated routers + per-PE links
- Verify that effective BW is consistent across both modes for the same workload
- Verify that horizontal line routing works for cross-PE access
- Verify that routing through UCIe works for cross-cube access
- Verify that topology generation is correct under parameter variations (channels_per_pe = 4, 8, 16, etc.)
---
## Links
- ADR-0018 (LA + BAAW) → addressing-side integration
- ADR-0017 (Cube NOC 2D Mesh) → this ADR replaces the xbar/bridge portion
- ADR-0004 (Memory Semantics) → BW model redefinition
- ADR-0014 (PE Internal Execution Model) → impact from PE_DMA path changes
@@ -0,0 +1,553 @@
# ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation)
## Status
Proposed
## Context
The current simulation models **timing only**.
`tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies,
but do not actually read tensor data or perform computations.
### Required Capabilities
1. Must be able to store and read actual data in HBM/TCM/SRAM
2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
3. Must minimize simulation performance degradation
### Limitations of the Existing Kernel Execution Structure
The current kernel execution is separated into 3 stages:
```
Phase 0: Kernel function execution in TLContext → PeCommand list generation (outside SimPy, no data)
Phase 1: PE_CPU replays PeCommand list via SimPy (timing only)
```
Phase 0 requires the kernel to **complete execution entirely** before SimPy begins.
`tl.load()` returns a TensorHandle (placeholder), so actual data cannot be accessed.
Therefore, branching based on data values (dynamic control flow) is impossible.
This ADR resolves this limitation **for memory operations only** (see D1, D3).
### Constraints
- SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
- Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations
- Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused
- Kernel functions must remain plain Python functions (no generator/async transformation)
### Design Exploration Results
| Option | Approach | Verdict |
|--------|----------|---------|
| Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block |
| SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests |
| Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads |
| **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact |
---
## Decision
### D1. 2-Pass Execution Model — Phase 0 Elimination
The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**.
Before:
```
Phase 0: Kernel → PeCommand list (no data, no branching)
Phase 1: Replay PeCommand list via SimPy (timing only)
```
After:
```
Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based
- Memory read/write: SimPy timing + MemoryStore actual data
- Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2)
- Dynamic control flow possible (tl.load returns actual data)
Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable
```
This ADR **extends Phase 1 to be data-aware for memory operations only**.
Phase 1 handles latency/BW bottleneck analysis + memory data tracking,
Phase 2 handles GEMM/Math computation correctness verification.
Phase 2 is optional — if only timing is needed, run Phase 1 alone.
### D2. Op Log Recording — ComponentBase Hook
Op log recording is performed as a **hook in the component base class**.
Individual component implementations are not modified.
```python
class ComponentBase:
def _on_process_start(self, env, msg):
if self._op_logger and getattr(msg, 'data_op', False):
self._op_logger.record_start(env.now, self.node.id, msg)
def _on_process_end(self, env, msg):
if self._op_logger and getattr(msg, 'data_op', False):
self._op_logger.record_end(env.now, self.node.id, msg)
```
Hooks are called before and after `run()` within `_forward_txn()`.
`_op_logger` is optional — zero overhead when absent.
**Hook timing definitions**:
| Timing | Meaning |
|--------|---------|
| `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) |
| `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) |
Link traversal latency is not included in t_start/t_end.
Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start.
### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination
The existing Phase 0 (kernel → PeCommand list) is eliminated,
and **greenlet** is used to cooperatively interleave kernel and SimPy execution.
#### Operating Principle
greenlet is a C extension that provides cooperative context switching.
When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet)
to perform timing simulation, and after completion, returns to the kernel with actual data.
```
SimPy loop (parent greenlet) Kernel (child greenlet)
───────────────────────── ──────────────────────
g.switch() ─────────────────────────→ Kernel starts
a = tl.load(ptr, ...)
internal: parent.switch(DmaReadCmd)
cmd = DmaReadCmd ←────────────────── (kernel paused)
yield DmaReadMsg(...)
yield env.timeout(dma_latency)
data = memory_store.read(...)
g.switch(data) ─────────────────────→ (kernel resumed)
a = data ← actual numpy array
if a[0][0] > 0.5: ← branching possible
...
```
The kernel is maintained as a **plain Python function**.
greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc.
#### KernelRunner — Framework Layer
The greenlet loop resides not in the PE_CPU component but in the framework layer,
**KernelRunner**.
```python
# KernelRunner (framework — greenlet ↔ SimPy bridge)
class KernelRunner:
def run(self, env, kernel_fn, args, store):
g = greenlet(self._run_kernel)
cmd = g.switch(kernel_fn, args)
while cmd is not None:
if isinstance(cmd, DmaReadCmd):
yield from self._dispatch_dma(env, cmd)
data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
cmd = g.switch(data) # resume with actual data
elif isinstance(cmd, GemmCmd):
yield from self._dispatch_gemm(env, cmd)
cmd = g.switch() # resume (no data)
elif isinstance(cmd, DmaWriteCmd):
store.write(cmd.dst_addr, cmd.data) # visibility = issue time
yield from self._dispatch_dma(env, cmd) # timing only
cmd = g.switch()
# PE_CPU (component — kept simple, unaware of greenlet)
def _execute_kernel(self, env):
runner = KernelRunner(self.ctx)
yield from runner.run(env, kernel_fn, args, store)
```
**Op logging single source of truth**: KernelRunner does not record directly to op_log.
All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**.
When KernelRunner delivers messages to components via `_dispatch_gemm()` etc.,
the component base class hooks automatically record them.
**Layer separation**:
- **Kernel code**: plain function, unaware of greenlet
- **TLContext**: calls `parent.switch(cmd)` inside `tl.load()`
- **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**.
- **ComponentBase hook**: the sole path for op_log recording
- **PE_CPU**: only calls KernelRunner, replaceable as a component
#### Handling Differences Between Memory Read/Write and Compute
| Operation | In Phase 1 | In Phase 2 |
|-----------|-----------|-----------|
| `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — |
| `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — |
| `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation |
| `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation |
Memory read/write is processed immediately in Phase 1 (numpy slice, fast).
GEMM/Math operations are batch-executed in Phase 2 (performance separation).
#### Store Visibility Rule
`tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue).
SimPy DMA timing is simulated separately afterward.
This is an intentional separation of timing and visibility:
- **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called
- **timing**: the point at which DMA latency completes in SimPy
This separation allows a load immediately after a store to see the latest data in dynamic control flow.
#### Result Handle Semantics
`tl.composite()` (sync/async) returns a **handle** referencing the result tensor.
The key contract in Phase 1:
1. **All compute handles are always considered pending in Phase 1.**
2. `tl.wait(handle)` **expresses timing synchronization only**
and does not make the handle ready.
3. Accessing the handle's actual result data (`handle.data`, element access,
numpy conversion, etc.) is **only possible in Phase 2**.
4. Therefore, **compute-result-based control flow is not supported in Phase 1.**
5. In contrast, `tl.load()` returns actual data in Phase 1, so
**memory-read-based control flow is supported**.
| Handle state | Phase | Allowed operations |
|------------|-------|----------|
| pending | Phase 1 | `tl.wait(handle)` — timing synchronization only |
| pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) |
| pending | Phase 1 | **Data access not allowed** — value-based branching not possible |
| ready | Phase 2 | Actual numpy data access, verification |
This restriction is intentional. If computations were executed in Phase 1,
the SimPy single-thread would block, defeating the purpose of 2-pass separation.
#### Phase 1 Materialization — Future Extension
If Phase 1 eager execution becomes necessary for small operations
(scalar, small reduction) in the future, selective materialization can be supported
by adding a `materialized_in_phase1: bool` flag to the op record.
This is not implemented in the current scope.
### D4. data_op Flag — Message Self-Declaration
The logging target is determined by the `data_op` attribute on the message instance,
not by message type. The framework does not hardcode message types.
```python
class MsgBase:
data_op: bool = False # default: no logging
class DmaReadCmd(MsgBase):
data_op = True # memory transfer → logging
class GemmCmd(MsgBase):
data_op = True # compute → logging
class MathCmd(MsgBase):
data_op = True # compute → logging
```
When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True`
enables automatic logging without modifying framework code.
### D5. Op Log Structure
#### Op Classification Scheme
A two-level classification is used:
| Level | Field | Role |
|-------|-------|------|
| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion |
| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification |
#### OpRecord Definition
```python
@dataclass
class OpRecord:
t_start: float # SimPy time (ns) — service start
t_end: float # SimPy time (ns) — service completion
component_id: str # e.g. "sip0.cube0.pe0.pe_gemm"
op_kind: str # "memory" | "gemm" | "math"
op_name: str # specific operation name
params: dict # per-operation parameters (see below)
dependency_ids: list[int] # currently based on in-memory record index, may be replaced with stable op_id in the future
```
#### dependency_ids Generation Rules
`dependency_ids` is **optional**, and by default the executor performs
address-based dependency inference (see D6).
Explicit setting is only needed when precise execution ordering is required:
- **Default (address-based inference)**: the executor analyzes read/write sets to
automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases.
- **Explicit setting**: set when logical dependencies cannot be expressed via addresses
at the TLContext or command generation stage.
Example: completion handle-based synchronization — handle dependencies depend on
logical completion order rather than memory addresses, so they cannot be captured
by address inference.
#### op_log Ordering
The op_log maintains **stable ordering** based on `t_start`.
Records with the same `t_start` preserve insertion order.
#### params Details
**memory (dma_read / dma_write)**:
```python
{
"src_addr": int, # source address (byte)
"dst_addr": int, # destination address (byte)
"nbytes": int, # transfer size
"src_space": str, # "hbm" | "tcm" | "sram"
"dst_space": str, # "hbm" | "tcm" | "sram"
}
```
**gemm**:
```python
{
"src_a_addr": int, # operand A address
"src_b_addr": int, # operand B address
"dst_addr": int, # output address
"shape_a": tuple, # e.g. (128, 256)
"shape_b": tuple, # e.g. (256, 128)
"shape_out": tuple, # e.g. (128, 128)
"dtype_in": str, # e.g. "f16"
"dtype_acc": str, # accumulation dtype, e.g. "f32"
"dtype_out": str, # output dtype, e.g. "f16"
"transpose_a": bool,
"transpose_b": bool,
"layout_a": str, # "row_major" | "col_major"
"layout_b": str,
"layout_out": str,
"addr_space": str, # "tcm" (GEMM operands are always in TCM)
}
```
**math**:
```python
{
"op": str, # "exp" | "add" | "sum" | "where" | ...
"input_addrs": list[int], # list of operand addresses
"input_shapes": list[tuple],
"dst_addr": int,
"shape_out": tuple,
"dtype": str,
"axis": int | None, # reduction axis
"addr_space": str, # "tcm"
}
```
### D6. Phase 2 Executor
Phase 2 executes the op_log outside of SimPy.
```python
class DataExecutor:
def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
self.store = initial_store # Takes the Phase 1 MemoryStore snapshot as input
def run(self):
for t, ops in groupby(op_log, key=lambda o: o.t_start):
batch = list(ops)
independent, sequential = self._classify(batch)
self._execute_parallel(independent)
self._execute_sequential(sequential)
```
**Parallel execution determination**:
Ops with the same `t_start` are considered **parallel candidates**.
The executor determines actual parallel execution based on the following criteria:
- Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks)
- Whether predecessor ops specified in `dependency_ids` have completed
Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel.
**Batch optimization**: Only independent ops with the same op_name **and identical
shape, dtype, layout, and transpose flags** are eligible for batching.
Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call.
Improves BLAS efficiency on CPU, reduces launch overhead on GPU.
**Phase 2 execution order guarantee**:
Phase 2 does not consider data arrival timing,
and guarantees execution order solely through
dependencies (address-based inference + explicit dependency_ids).
### D7. Memory Store
`MemoryStore` logically follows byte-addressable semantics,
and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping).
```python
class MemoryStore:
def write(self, space: str, addr: int, data: np.ndarray) -> None: ...
def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
```
**Internal storage format: numpy ndarray**
MemoryStore stores tensors as **numpy ndarrays**.
| Candidate | store/load speed | Phase 2 compute | Verdict |
|-----------|-----------------|-----------------|---------|
| **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** |
| bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected |
| torch tensor | Immediate | torch operations available | Use only for GPU optimization |
- write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup
- read: **returns numpy array by reference** (no copy)
- Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported)
- dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.)
- For byte-level access, convert via `.view(np.uint8)`
- For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility
**read/write contract**:
- read/write operates on a **contiguous tensor** basis.
If non-contiguous stride views are needed, express them as separate copy ops.
- In the normal benchmark path, producer/consumer dtype match is expected.
Reinterpret cast is a permissive behavior for low-level memory validation
or special test cases.
- addr is byte-aligned, with minimum alignment = dtype size.
- dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast.
Shape mismatch is verified based on nbytes, and raises an error on mismatch.
- Correctness criteria follow address-range-based read/write semantics.
- A tensor object cache may be used as an implementation optimization,
but the canonical state is byte-addressable storage.
- At deploy time, the host injects initial tensor data.
### D8. Benchmark Kernel Code
The benchmark's **user code API is not changed**.
The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained.
However, internal command/message schemas may be extended to include metadata
required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose).
### D9. No Component Changes
Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified.
Op log recording is the responsibility of the ComponentBase hook.
When custom components are replaced, only the timing model changes,
and Phase 2 data execution is unaffected.
### D10. Phase 2 is Optional
```python
engine = GraphEngine(graph)
engine.run(benchmark) # Phase 1: timing only
result = engine.get_timing_result()
if verify_data:
executor = DataExecutor(engine.op_log) # Phase 2: data
executor.run()
executor.verify(expected_output)
```
If only timing analysis is needed, Phase 2 is skipped.
If the op_logger is deactivated, Phase 1 performance is identical to the original.
### D11. Verification Contract
Basic verification **compares the final output tensor** against a reference backend (numpy).
Per-dtype tolerance policy:
| dtype | Comparison method | Tolerance |
|-------|----------|-----------|
| f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
| f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
| bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
| int types | `np.array_equal` | exact |
- Default mode: compare final output only (end-to-end correctness)
- Debug mode: can compare intermediate tensors on a per-op basis
(MemoryStore snapshot at each op boundary)
---
## Non-goals
- **Compute-result-based control flow**: not supported.
All compute handles are in pending state during Phase 1,
`wait()` expresses timing synchronization only and does not imply data readiness.
Accessing `handle.data`, element access, or truth-value evaluation in Phase 1
is **treated as an error**.
Memory-data-based branching (results of `tl.load()`) is supported via greenlet.
Phase 1 materialization is a future extension (see D3).
- **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce
the execution time overlap from Phase 1. Phase 2 only verifies data correctness.
- **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls
and do not reproduce the actual hardware PE microarchitecture.
## Open Questions
- **Aliasing / slice view**: How to represent slice/views referencing the same
backing storage in MemoryStore (stride-based view vs copy semantics)
- **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE
communication as memory ops or introduce a separate op_kind
- **Op log streaming**: Managing op_log memory usage in large-scale simulations
(in-memory list vs disk-backed streaming)
- **Fused operation**: Whether to record tl.composite's tiled pipeline
(READ→COMPUTE→WRITE) as a single fused op record or separate individual ops
- **Math op schema generalization**: The current math params have a simple structure,
but generalization may be needed for broadcasting rules, per-input dtype, keepdims,
scalar/immediate operands, where/mask expressions, etc.
- **Op record identifier**: Currently dependency_ids are based on in-memory list indices;
replacement with stable op_id is needed when introducing streaming/disk-backed mode
- **Phase 1 materialization policy**: See Future Extension in D3.
If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops
needs to be defined
---
## Consequences
### Positive
- Minimal impact on SimPy simulation performance (only op_log append added)
- Free to use multi-threading/GPU in Phase 2
- Component replaceability preserved (ADR-0015 design philosophy maintained)
- No changes needed to benchmark user code API
- When adding new message types, only set the data_op flag
- Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported
- `tl.load()` returns actual data, making kernel debugging easier
### Negative
- op_log memory usage (for large-scale simulations)
- Phase 2 execution time is proportional to tensor size (large GEMM)
- Dynamic branching based on pending handles (incomplete computations) not possible
(computations execute in Phase 2, result values are undetermined in Phase 1).
Memory-data-based branching is supported via greenlet.
- greenlet C extension dependency added (pip install greenlet)
---
## Affected Files
| File | Change |
|------|--------|
| `src/kernbench/components/base.py` | Add `_on_process_start/end` hooks |
| `src/kernbench/common/pe_commands.py` | Add `data_op = True`, extend metadata fields |
| `src/kernbench/sim_engine/op_log.py` | New: OpRecord, OpLogger |
| `src/kernbench/sim_engine/data_executor.py` | New: DataExecutor, MemoryStore |
| `src/kernbench/sim_engine/engine.py` | op_logger injection (optional) |
| `src/kernbench/triton_emu/tl_context.py` | greenlet switch calls inside `tl.load()` etc. |
| `src/kernbench/triton_emu/kernel_runner.py` | New: KernelRunner (greenlet ↔ SimPy bridge) |
| `src/kernbench/components/builtin/pe_cpu.py` | Remove Phase 0, change to KernelRunner invocation |
| `pyproject.toml` | Add greenlet dependency |
Component implementation files (pe_gemm.py, pe_dma.py, hbm_ctrl.py, etc.): **no changes**
Benchmark kernels (benches/*.py): **no user API changes**
@@ -0,0 +1,537 @@
# ADR-0021: PE Pipeline Refactoring — Component Separation + Scheduler-Based Routing
## Status
Proposed
## Context
### Problems with the Current Structure
pe_accel (SchedulerV2Component) hides 5 hardware blocks (DmaIn, DmaWb, Gemm, Math, Tcm)
**inside a single component**.
```
SchedulerV2Component (single topology node)
├── DmaInBlock ← directly connected via internal SimPy Store
├── DmaWbBlock ← not visible in topology
├── GemmBlock ← not replaceable
├── MathBlock ← not replaceable
└── TcmBlock ← not replaceable
```
Problems:
- Blocks directly reference the next block via `desc.next_block` — hardcoded routing
- Individual blocks cannot be replaced (violates ADR-0015 component replacement principle)
- PE internal structure is not visible in the topology
- GemmBlock and MathBlock each duplicate TCM load/store logic
### Actual Hardware Structure
```
HBM ←(DMA)→ TCM ←(Fetch/Store Unit)→ Register File ←→ GEMM/MATH Engine
```
- DMA: HBM ↔ TCM transfer (via fabric, tens to hundreds of ns)
- Fetch/Store Unit: TCM ↔ Register File transfer (BW-based, a few ns)
- GEMM/MATH Engine: computation between Register Files (cycle-accurate)
- Completion signal: PE-internal 1-cycle wire signal (done pin assert)
---
## Decision
### D1. Separate Each Block into an Independent Component
The internal blocks of pe_accel are separated into **independent PeEngineBase components**.
Existing 5 blocks + 1 Fetch/Store Unit = 6 components.
| Component | Role | HW Correspondence |
|-----------|------|-------------------|
| PE_SCHEDULER | Plan generation, tile state management, stage routing | Scheduler/Sequencer |
| PE_DMA | HBM ↔ TCM (via fabric) | DMA Engine |
| PE_FETCH_STORE | TCM ↔ Register File | Load/Store Unit |
| PE_GEMM | MAC compute (register only) | MAC Array |
| PE_MATH | Element-wise/reduction (register only) | SIMD/Vector Unit |
| PE_TCM | BW-serialized scratchpad | SRAM Bank |
Each component exists as a topology node and is connected via ports/wires.
Replacing the `impl` allows changing the timing model of an individual block.
### D2. Token Self-Routing — Scheduler Handles Only Dispatch + Completion
**Components do not pass through the scheduler at every stage.**
The token carries a plan so that components chain directly to the next stage.
```
Scheduler → DMA → Fetch → GEMM → Math → Store → DMA_WB → (done) → Scheduler
↑ chaining: does not go through scheduler completion only
```
This matches the actual HW structure where each block's done signal is directly
connected to the next block via wire. The scheduler is responsible **only for
initial dispatch + completion aggregation**.
#### Stage Definition
```python
class StageType(Enum):
DMA_READ = 0
FETCH = 1
GEMM = 2
MATH = 3
STORE = 4
DMA_WRITE = 5
```
#### Plan Structure
When the scheduler receives a CompositeCmd, it generates a **per-tile execution plan**.
The plan defines the **stage sequence** for each tile:
```python
@dataclass
class Stage:
stage_type: StageType
component: str # topology node ID (e.g. "sip0.cube0.pe0.pe_dma")
params: dict # per-stage parameters (dynamic)
@dataclass(frozen=True)
class TilePlan:
tile_id: int
stages: tuple[Stage, ...] # list of stages to execute in order (immutable)
```
The stage sequence varies depending on the plan:
```python
# Normal GEMM: HBM → TCM → Register → Compute → Register → TCM → HBM
stages = (DMA_READ, FETCH, GEMM, STORE, DMA_WRITE)
# GEMM directly from TCM data (skip DMA read):
stages = (FETCH, GEMM, STORE, DMA_WRITE)
# MATH element-wise:
stages = (DMA_READ, FETCH, MATH, STORE, DMA_WRITE)
# GEMM + accumulation (intermediate K-tile, skip writeback):
stages = (DMA_READ, FETCH, GEMM, STORE) # store to TCM only
```
**Components do not hardcode the next component.**
They read the next stage from the token's plan and forward it directly via out_port.
This is the same pattern as a network packet carrying a routing header.
#### Pipeline Context
```python
@dataclass
class PipelineContext:
id: str
total_tiles: int
completed_tiles: int = 0
done_event: simpy.Event = None # succeeds when all tiles are complete
def complete_tile(self) -> None:
self.completed_tiles += 1
if self.completed_tiles == self.total_tiles:
self.done_event.succeed()
```
**Completion follows an exactly-once contract**: the last stage of each tile must call
`complete_tile()` exactly once. Duplicate calls are a bug, and `done_event` must
succeed only once (SimPy Event constraint).
#### Scheduler Role (Reduced)
When the scheduler receives a CompositeCmd, it creates a plan and PipelineContext,
enqueues them into the scheduler's internal `_pending_feeds` FIFO, and returns immediately.
Actual tile injection is handled by a **single feeder process** (`_feed_loop`).
This feeder consumes `_pending_feeds` in FIFO order and
**does not allow tile feed interleaving across composite commands.**
That is, the feed for the next command begins only after all tiles of the current
command have been injected into the first stage queue.
There is **exactly one `_feed_loop`** per scheduler, and
tile feed for composite commands is performed exclusively through this single process.
Command issue order refers to **the order in which PE_SCHEDULER receives PeInternalTxn**.
This structure maintains command issue order while ensuring that when the first stage
queue is full, only the feeder process blocks — the scheduler worker's inbox processing
itself does not stall.
```python
class PeSchedulerV2(PeEngineBase):
_pipelines: dict[str, PipelineContext]
_pending_feeds: simpy.Store # FIFO of (plan, ctx)
def start(self, env):
super().start(env)
self._pending_feeds = simpy.Store(env)
env.process(self._feed_loop(env))
def _dispatch_composite(self, env, pe_txn, cmd):
plan = generate_plan(cmd)
ctx = PipelineContext(
id=next_id(),
total_tiles=len(plan.tiles),
done_event=pe_txn.done,
)
self._pipelines[ctx.id] = ctx
# only enqueue to feeder queue and return immediately
yield self._pending_feeds.put((plan, ctx))
def _feed_loop(self, env):
"""Single feeder process: feeds composite commands in FIFO order.
Tile feed interleaving across composite commands is not allowed.
The feed for the next command begins only after all tiles of the
current command have been injected into the first stage queue.
When the first stage queue is full, only this feeder blocks;
the scheduler worker's inbox processing does not stall.
"""
while True:
plan, ctx = yield self._pending_feeds.get()
for tile in plan.tiles:
token = TileToken(
tile_id=tile.tile_id,
pipeline_ctx=ctx,
plan=tile,
stage_idx=0,
params=tile.stages[0].params,
)
yield self.out_ports[tile.stages[0].component].put(token)
# queue capacity = HW queue depth → feeder blocks only when full
```
In this ADR, the scheduler can accept multiple composite commands,
but tile submission order follows per-command FIFO.
Within a command, tile-level pipeline overlap is allowed,
but tile feed interleaving across commands is not.
### D3. Data Transfer vs. Completion Signal — HW Modeling Criteria
| Communication Type | Method | HW Correspondence |
|-------------------|--------|-------------------|
| Tile token (work directive) | message via out_port | enqueue to command queue |
| Stage completion → next stage | component directly calls out_port.put | done-triggered local enqueue |
| Pipeline completion → scheduler | PipelineContext.complete_tile() | completion interrupt |
**Tile token**: uses out_port.put(). SimPy Store capacity = HW queue depth.
**Intra-PE chaining latency**: within the scope of this ADR, no explicit latency model
is applied to intra-PE stage triggers. Chaining between components corresponds to
PE-internal wires, and since there is no scheduler round-trip, no artificial hop cost
is incurred.
**Pipeline completion**: the component at the last stage calls `pipeline_ctx.complete_tile()`.
When all tiles are complete, PipelineContext calls done_event.succeed().
### D4. Asynchronous Pipeline — Natural Overlap
The scheduler processes CompositeCmds **asynchronously**.
However, tile feed does not spawn an independent process per command; instead,
the scheduler's internal **single feeder process** performs the feed in FIFO order.
Therefore, the scheduler can continue to receive the next command,
but the first-stage tile injection order is guaranteed per command.
Since **SimPy Store capacity = HW queue depth**:
- When the queue is full, put() naturally blocks (backpressure)
- While DMA is processing tile 0, GEMM can start fetching an already-completed tile
- When a second CompositeCmd arrives, it is immediately queued to the DMA queue
```
First-stage feed order (feeder → DMA queue):
[cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN] | [cmd2:t0][cmd2:t1]...
↑ cmd2 starts after cmd1 feed completes
Runtime pipeline (downstream overlap):
PE_DMA: [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN][cmd2:t0][cmd2:t1]...
PE_FETCH: [cmd1:t0][cmd1:t1]...
PE_GEMM: [cmd1:t0][cmd1:t1]...
↑ pipeline overlap within the same command
```
Here, the overlap does not come from tile feed interleaving across different commands,
but occurs naturally as tiles from earlier commands progress to downstream stages
while the feeder continues injecting subsequent tiles.
For example, tile feed for cmd2 does not start until all tiles of cmd1 have been
injected into the first stage queue. However, while cmd1.tile0 has already progressed
to GEMM, cmd1.tile1 and cmd1.tile2 may still remain in DMA/FETCH, so
**pipeline overlap within the same command occurs naturally**.
#### Component Chaining Pattern
All components follow the same pattern:
```python
def _pipeline_worker(self, env):
while True:
token = yield self._inbox.get()
# process own stage
yield from self._process(env, token)
# chain to next stage (read from plan)
next_idx = token.stage_idx + 1
if next_idx < len(token.plan.stages):
next_stage = token.plan.stages[next_idx]
token.stage_idx = next_idx
token.params = next_stage.params
yield self.out_ports[next_stage.component].put(token)
else:
# last stage — pipeline completion
token.pipeline_ctx.complete_tile()
```
### D5. PE_FETCH_STORE — Dedicated TCM ↔ Register File Transfer
Previously, GemmBlock and MathBlock each implemented their own TCM read/write.
This is separated into a **PE_FETCH_STORE component**.
```python
# PE_FETCH_STORE._process()
def _process(self, env, token):
yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
yield tcm_done
# chaining is handled by the base class (D4 pattern)
```
Advantages:
- GEMM/MATH perform **pure compute only** — no TCM access logic
- Fetch/store BW contention is naturally modeled (serialization via PE_TCM resource)
- Prefetch strategies can be experimented with by replacing the fetch unit alone
### D6. Simplification of Each Compute Component
GEMM/MATH perform compute only with register data already prepared.
**Chaining follows the common pattern (D4), so only _process() needs to be implemented:**
```python
# PE_GEMM._process()
def _process(self, env, token):
yield env.timeout(self._mac_latency(token.params))
# PE_MATH._process()
def _process(self, env, token):
yield env.timeout(self._simd_latency(token.params))
# PE_FETCH_STORE._process()
def _process(self, env, token):
yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
yield tcm_done
# PE_DMA._process()
def _process(self, env, token):
yield from self._do_fabric_dma(token.params)
```
By replacing only the timing model, one can freely switch between cycle-accurate
and analytical models. Since the chaining logic resides in the base class,
each component only implements its pure stage logic.
### D7. Topology Changes
Add PE_FETCH_STORE to the PE template:
```yaml
pe_template:
components:
pe_cpu: { kind: pe_cpu, impl: pe_cpu_v1, ... }
pe_scheduler: { kind: pe_scheduler, impl: pe_scheduler_v2, ... }
pe_dma: { kind: pe_dma, impl: pe_dma_v1, ... }
pe_fetch_store: { kind: pe_fetch_store, impl: pe_fetch_store_v1, ... }
pe_gemm: { kind: pe_gemm, impl: pe_gemm_v1, ... }
pe_math: { kind: pe_math, impl: pe_math_v1, ... }
pe_mmu: { kind: pe_mmu, impl: pe_mmu_v1, ... }
pe_tcm: { kind: pe_tcm, impl: pe_tcm_v1, ... }
links:
# existing links...
fetch_store_to_tcm_bw_gbs: 512.0
fetch_store_to_tcm_mm: 0.0
```
PE internal edge connections:
```
PE_SCHEDULER → PE_DMA (initial dispatch)
PE_SCHEDULER → PE_FETCH_STORE (initial dispatch)
PE_SCHEDULER → PE_GEMM (initial dispatch)
PE_SCHEDULER → PE_MATH (initial dispatch)
PE_DMA → PE_FETCH_STORE (chaining)
PE_FETCH_STORE → PE_GEMM (chaining)
PE_FETCH_STORE → PE_MATH (chaining)
PE_GEMM → PE_FETCH_STORE (store chaining)
PE_MATH → PE_FETCH_STORE (store chaining)
PE_FETCH_STORE → PE_DMA (writeback chaining)
PE_FETCH_STORE → PE_TCM (BW request)
```
Topology edges encompass both **control/dispatch visibility + runtime chaining**.
Scheduler → sub-component edges are initial dispatch paths, while
inter-component edges are runtime chaining paths driven by token self-routing.
### D8. Existing Code Migration — Builtin Integration
The existing builtin v1 components and pe_accel are **replaced with new builtin components**.
#### Migration Strategy
1. Back up existing `components/builtin/``components/builtin_legacy/` (preserved without modification)
2. Back up existing `components/custom/pe_accel/` → likewise
3. Re-implement new `components/builtin/` with the ADR-0021 architecture
4. Maintain **only one** topology.yaml (including pe_fetch_store)
5. components.yaml points to the new builtin
```yaml
# components.yaml — new builtin
pe_scheduler_v1: kernbench.components.builtin.pe_scheduler:PeSchedulerComponent
pe_gemm_v1: kernbench.components.builtin.pe_gemm:PeGemmComponent
pe_math_v1: kernbench.components.builtin.pe_math:PeMathComponent
pe_dma_v1: kernbench.components.builtin.pe_dma:PeDmaComponent
pe_fetch_store_v1: kernbench.components.builtin.pe_fetch_store:PeFetchStoreComponent
pe_tcm_v1: kernbench.components.builtin.pe_tcm:PeTcmComponent
```
The impl names (pe_gemm_v1, etc.) are preserved, but **the implementations are replaced
with the ADR-0021 architecture**. Existing benchmarks and tests referencing topology.yaml
continue to work without changes.
#### Latency Model Inheritance
The latency modeling of the new builtin components (MAC cycle calculation, SIMD latency,
TCM BW serialization, DMA fabric latency, etc.) is **based on the current pe_accel
implementation**. The tile schedule generation logic from tiling.py is also carried over.
Only the architecture (component separation, self-routing) changes; timing accuracy
is preserved.
#### Test Strategy
#### Test Plan
**1. Existing test pass** (regression):
After migration is complete, all existing tests (366) must pass.
**2. Latency regression**:
Verify that the new builtin produces identical latency for the same inputs as pe_accel.
**3. Phase 1 → Phase 2 end-to-end**:
Integration test from SimPy simulation (Phase 1) op_log generation → DataExecutor
(Phase 2) actual numpy computation → result correctness verification.
- GEMM: tl.composite(gemm) → op_log → Phase 2 matmul → allclose verification
- MATH: tl.exp / tl.add, etc. → op_log → Phase 2 numpy op → allclose verification
- Chaining: GEMM output → MATH input → final result end-to-end verification
**4. TileToken self-routing**:
- Verify that tiles chain according to the plan's stage sequence
- Verify PipelineContext.complete_tile() exactly-once at the last stage
- Queue backpressure: verify that only the feeder blocks when DMA queue capacity is exceeded
**5. Asynchronous pipeline overlap**:
- Verify that inter-tile stage overlap occurs within the same command (tile0 in GEMM while tile1 in DMA)
- Multiple commands: verify that cmd2 feed starts after cmd1 feed completes (FIFO order)
### D9. TileToken Message Definition
A message used for passing tile work between components.
The token carries the plan and stage index, enabling self-routing.
```python
@dataclass
class TileToken:
tile_id: int
pipeline_ctx: PipelineContext # completion tracking
plan: TilePlan # full stage sequence for this tile (immutable)
stage_idx: int # current stage index in plan.stages
params: dict # current stage parameter cache (canonical: plan.stages[stage_idx].params)
data_op: bool = True # op_log recording target (ADR-0020)
```
A TileToken is **owned by exactly one component at a time** and
is never referenced by multiple components simultaneously (single-owner).
Token lifecycle:
1. Scheduler creates it with stage_idx=0 and puts it to the first stage component
2. The component executes _process(), increments stage_idx, and puts it to the next component
3. The last stage component calls pipeline_ctx.complete_tile()
4. When all tiles are complete, PipelineContext calls done_event.succeed()
Relationship with existing PeInternalTxn:
- PeInternalTxn: command transfer between PE_CPU → PE_SCHEDULER (existing, unchanged)
- TileToken: per-tile work transfer from PE_SCHEDULER → sub-components (new, self-routing)
---
## Non-goals
- **PE_CPU changes**: the PE_CPU → PE_SCHEDULER interface is not modified
(PeInternalTxn-based, ADR-0014 maintained)
- **Resource contention model across multiple pipelines**: the current scope focuses on
accurate modeling of a single pipeline. TCM bank conflicts across multiple pipelines
are future work.
- **builtin_legacy maintenance**: kept for backup purposes only; not a target for
bug fixes or feature additions.
## Open Questions
- **Register File capacity model**: whether to model capacity limits when the fetch unit
loads into registers. Capacity is expressed in bytes (register_file_bytes), and
the number of tiles that can be held simultaneously is determined by tile size.
When capacity is exceeded, fetch stalls, creating natural backpressure.
- **Prefetch strategy**: this ADR does not allow tile feed interleaving across composite
commands. Therefore, overlap arises not from pre-injection across commands, but
naturally from pipeline progression of tiles within the same command.
If additional prefetch is needed, it should be considered at the level of tile ordering
within the same command or fetch/store unit policy, not cross-command injection.
- **PE_DMA coalescing**: per-tile DMA may cause fragmentation.
Direction is to merge/coalesce within DMA without scheduler involvement.
- **Synchronous execution mode**: this ADR adopts asynchronous pipeline as the
default/sole execution model. If a sync mode is needed for debug or validation
purposes, it will be considered in a future ADR.
- **TCM bank conflict across multiple pipelines**: currently based on a single pipeline.
Bank conflict modeling when multiple pipelines simultaneously access TCM is future work.
---
## Consequences
### Positive
- Each block is an independent component — individually replaceable (ADR-0015 compliant)
- PE internal structure is visible in the topology
- Components do not know the next component — plan-based routing provides flexibility
- Natural pipeline overlap between DMA and compute (SimPy Store backpressure)
- Improved HW modeling accuracy (done signal = Event, data transfer = message)
- Fetch/store separation enables accurate TCM BW contention modeling
### Negative
- Increased number of PE internal components (5 → 6) — more topology nodes/edges
- Component separation makes intra-PE token forwarding more explicit than before
- Breaking change from existing builtin/pe_accel — migration required
---
## Affected Files
| File | Change |
|------|--------|
| `topology.yaml` | Add pe_fetch_store component, add chaining edges |
| `components.yaml` | Register new builtin components |
| `src/kernbench/topology/builder.py` | Add fetch_store + chaining edges to PE internal edges |
| `src/kernbench/common/pe_commands.py` | Add TileToken definition |
| `src/kernbench/components/builtin/pe_scheduler.py` | Re-implement (feeder + plan-based dispatch) |
| `src/kernbench/components/builtin/pe_gemm.py` | Re-implement (TileToken, _process pattern) |
| `src/kernbench/components/builtin/pe_math.py` | Re-implement (TileToken, _process pattern) |
| `src/kernbench/components/builtin/pe_dma.py` | Re-implement (TileToken, _process pattern) |
| `src/kernbench/components/builtin/pe_fetch_store.py` | New |
| `src/kernbench/components/builtin/pe_tcm.py` | Re-implement (TcmRequest service) |
| `src/kernbench/components/builtin/types.py` | New: TilePlan, Stage, StageType, PipelineContext, TileToken |
| `src/kernbench/components/builtin/tiling.py` | Ported from pe_accel: plan generation logic |
Backup:
| `src/kernbench/components/builtin_legacy/` | Full backup of existing builtin (preserved without modification) |
| `src/kernbench/components/custom/pe_accel/` | Backup of existing pe_accel (preserved without modification) |