Add English translations for ADR-0018, 0019, 0020, 0021
- ADR-0018: LA-based memory address abstraction + BAAW + HBM channel mapping - ADR-0019: CUBE NOC per-channel and aggregated HBM connection model - ADR-0020: 2-pass data execution model (timing/data separation, greenlet) - ADR-0021: PE pipeline refactor (component separation + token self-routing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,441 @@
|
||||
# ADR-0018: LA-Based Memory Address Abstraction and HBM Channel Mapping Mode Introduction
|
||||
|
||||
## Status
|
||||
|
||||
Proposed
|
||||
|
||||
## Context
|
||||
|
||||
Kernbench simulates memory access between PE_DMA and Local-HBM within a CUBE.
|
||||
Currently, a VA-based access path is used; however, the following two channel mapping models
|
||||
are difficult to represent consistently.
|
||||
|
||||
### Background: Local-HBM Pseudo Channel Structure
|
||||
|
||||
The HBM in a CUBE consists of 32 or 64 pseudo channels.
|
||||
In the PE-Local-HBM model, each PE is responsible for an equal number of pseudo channels.
|
||||
|
||||
Example: 64 pseudo channels, 8 PEs per cube -> each PE accesses 8 pseudo channels as local HBM
|
||||
|
||||
Both the number of pseudo channels and the number of PEs are topology parameters.
|
||||
`N = hbm_pseudo_channels / pes_per_cube` (= channels_per_pe) determines
|
||||
the number of local channels per PE.
|
||||
|
||||
The routing path BW between DMA and each pseudo channel matches the BW of each pseudo channel
|
||||
(e.g., 32 GB/s), so if a PE sends simultaneous requests to N channels, it can utilize the
|
||||
maximum memory BW.
|
||||
|
||||
### Limitations of the Current VA Model
|
||||
|
||||
When channels are divided into 8, requests must also be generated per channel and sent to DMA.
|
||||
However, in the current architecture, the kernel generates requests with VA (`tl.load`)
|
||||
and passes them directly to DMA, making it difficult for PE_CPU to generate per-channel DMA requests.
|
||||
|
||||
Therefore, instead of VA, we propose using **Logical Address (LA)**,
|
||||
where the **BAAW (Logical-to-Physical Mapping Unit)** inside PE_DMA
|
||||
converts LA to PA or a list of PAs based on segment-based mapping.
|
||||
|
||||
### Two Channel Mapping Modes
|
||||
|
||||
- **1:1 mode**: Creates and executes per-channel requests. Precise per-channel modeling.
|
||||
- **n:1 mode (default)**: Assumes interleaving across local HBM channels. Aggregated BW modeling.
|
||||
|
||||
By supporting both modes, the overhead of the n:1 mode can be measured and evaluated.
|
||||
|
||||
### Core Requirements
|
||||
|
||||
- The effective bandwidth semantics of PE_DMA -> HBM_CTRL must be identical in both modes
|
||||
- The difference must only be in the request representation and resource modeling approach
|
||||
- The kernel programming model must not be changed
|
||||
- Physical channel information must not be exposed to the kernel
|
||||
|
||||
### Existing Physical Address
|
||||
|
||||
The current system's 51-bit Physical Address is defined in `policy/address/phyaddr.py`:
|
||||
|
||||
```
|
||||
[50:47] rack_id (4 bit)
|
||||
[46:43] sip_id (4 bit)
|
||||
[42:38] cube_id (5 bit, sip_seg)
|
||||
[37] hbm_selector (1=HBM window)
|
||||
[36:0] hbm_offset (37 bit, 128GB per cube)
|
||||
```
|
||||
|
||||
PA is used to represent the final routable canonical physical destination,
|
||||
and this role is preserved.
|
||||
However, the timing and policy of logical access -> physical request conversion are not clearly separated.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Introduction of LA (Logical Address) — Replacing VA
|
||||
|
||||
The existing VA (Virtual Address) infrastructure is replaced with LA (Logical Address).
|
||||
|
||||
#### Characteristics of LA
|
||||
|
||||
- Like VA, tensors can be mapped to a contiguous memory space
|
||||
- Represents logical buffer + offset
|
||||
- Does not directly contain physical channel information
|
||||
- An intermediate abstraction maintained until physical resolution
|
||||
- The sole address scheme used by kernel code (`tl.load`, `tl.store`, `tl.composite`)
|
||||
|
||||
#### LA Space Definition
|
||||
|
||||
| Item | Value |
|
||||
|------|-------|
|
||||
| LA start address | `0x1_0000_0000` (4 GB, preserving the existing VA start point) |
|
||||
| LA space size | 64 GB per PE |
|
||||
| Alignment unit | Segment-based (see D3 below) |
|
||||
|
||||
LA is a PE-local address space.
|
||||
Even if different PEs use the same LA value, they resolve to different PAs
|
||||
because each PE has a different BAAW segment table.
|
||||
|
||||
#### VA Infrastructure Removal Scope
|
||||
|
||||
With the introduction of LA, the following existing code will be replaced/removed:
|
||||
|
||||
| Removal Target | Replacement |
|
||||
|----------------|-------------|
|
||||
| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (same free-list approach, name/role changed) |
|
||||
| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (inside PE_DMA) |
|
||||
| `components/builtin/pe_mmu.py` (PeMmuComponent) | Removed — BAAW is internal PE_DMA logic, not a separate component |
|
||||
| `runtime_api/kernel.py`: MmuMapMsg, MmuUnmapMsg | Replaced with BaawSegmentInstallMsg |
|
||||
| `runtime_api/context.py`: VA alloc + MMU mapping install | LA alloc + BAAW segment install |
|
||||
| `runtime_api/tensor.py`: `va_base` field | `la_base` field |
|
||||
| `topology.yaml`: pe_mmu component entry | Removed |
|
||||
|
||||
---
|
||||
|
||||
### D2. Mapping Mode Configuration
|
||||
|
||||
The mapping mode is configured at the cube level in topology.yaml:
|
||||
|
||||
```yaml
|
||||
cube:
|
||||
memory_map:
|
||||
hbm_mapping_mode: n_to_one # one_to_one | n_to_one
|
||||
hbm_pseudo_channels: 64 # total pseudo channel count
|
||||
hbm_channels_per_pe: 8 # local channel count per PE
|
||||
hbm_channel_bw_gbs: 32.0 # per-channel bandwidth
|
||||
```
|
||||
|
||||
This configuration is referenced during graph compilation (topology builder) and BAAW initialization.
|
||||
|
||||
---
|
||||
|
||||
### D3. Segments and BAAW
|
||||
|
||||
#### Segment Definition
|
||||
|
||||
A segment is a logical allocation unit that partitions the LA space so that each segment
|
||||
maps to a specific HBM channel or channel group.
|
||||
|
||||
Segments are created by the runtime allocator during tensor deployment,
|
||||
and BAAW uses them to convert LA into physical requests.
|
||||
|
||||
#### BAAW Segment Table Entry
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class BaawSegment:
|
||||
la_base: int # segment start LA
|
||||
la_size: int # segment size (bytes)
|
||||
mode: str # "one_to_one" | "n_to_one"
|
||||
# 1:1 mode fields
|
||||
channel_count: int # number of channels assigned to this segment (e.g., 8)
|
||||
pa_bases: list[int] # per-channel PA start address list (len = channel_count)
|
||||
channel_ids: list[int] # per-channel logical IDs (e.g., [0,1,2,...,7])
|
||||
channel_size: int # per-channel size (la_size // channel_count)
|
||||
# n:1 mode fields
|
||||
agg_pa_base: int # aggregated PA start address
|
||||
agg_node_id: str # aggregated router node_id (for routing)
|
||||
```
|
||||
|
||||
#### Segment Lifecycle
|
||||
|
||||
1. **Allocation time** (tensor deploy):
|
||||
- RuntimeContext allocates LA space from the LA allocator
|
||||
- PEMemAllocator allocates per-channel PA (1:1) or aggregated PA (n:1)
|
||||
- Sends `BaawSegmentInstallMsg` to PE_DMA to register in the segment table
|
||||
|
||||
2. **Usage time** (kernel execution):
|
||||
- Kernel issues `tl.load(la_ptr)` -> DmaReadCmd(src_addr=LA)
|
||||
- PE_DMA looks up the segment corresponding to the LA in BAAW
|
||||
- Converts to PA(s) according to the mode
|
||||
|
||||
3. **Deallocation time** (tensor free):
|
||||
- Removed from the segment table
|
||||
- LA space returned, PA deallocated
|
||||
|
||||
---
|
||||
|
||||
### D4. BAAW (Logical-to-Physical Mapping Unit)
|
||||
|
||||
#### Location
|
||||
|
||||
BAAW is placed as a front-end stage inside PE_DMA.
|
||||
It is not a separate SimPy component; it is synchronous address resolution logic
|
||||
executed at the beginning of PE_DMA's `handle_command()`.
|
||||
|
||||
#### Input
|
||||
|
||||
- LA (Logical Address) — DmaReadCmd.src_addr or DmaWriteCmd.dst_addr
|
||||
- access size (bytes)
|
||||
|
||||
#### Output
|
||||
|
||||
- 1:1 mode: `list[PhysicalRequest]` — each request is (PA, nbytes, channel_node_id)
|
||||
- n:1 mode: 1 `PhysicalRequest` — (agg_PA, nbytes, agg_node_id)
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class PhysicalRequest:
|
||||
pa: int # 51-bit Physical Address
|
||||
nbytes: int # transfer size for this request
|
||||
dst_node: str # target node_id (channel router or aggregated router)
|
||||
```
|
||||
|
||||
#### BAAW Resolve Logic
|
||||
|
||||
```python
|
||||
def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
|
||||
seg = self._find_segment(la) # la_base <= la < la_base + la_size
|
||||
offset = la - seg.la_base
|
||||
|
||||
if seg.mode == "n_to_one":
|
||||
pa = seg.agg_pa_base + offset
|
||||
return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]
|
||||
|
||||
elif seg.mode == "one_to_one":
|
||||
requests = []
|
||||
per_ch_size = seg.channel_size
|
||||
for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
|
||||
ch_offset = offset % per_ch_size # interleaved or striped
|
||||
ch_nbytes = nbytes // seg.channel_count
|
||||
pa = pa_base + ch_offset
|
||||
dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
|
||||
requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
|
||||
return requests
|
||||
```
|
||||
|
||||
#### Scope of Responsibility
|
||||
|
||||
BAAW is responsible for:
|
||||
- Converting logical accesses into physical request units
|
||||
- Performing fan-out (1:1) or pass-through (n:1) according to the mapping mode
|
||||
- Generating Physical Addresses and determining target nodes
|
||||
|
||||
BAAW is NOT responsible for:
|
||||
- Performing actual data movement
|
||||
- Executing NOC routing
|
||||
- Simulating bandwidth consumption (this is the role of downstream components)
|
||||
|
||||
#### Output Contract
|
||||
|
||||
The output of BAAW must be request units that can be directly used by the simulator's
|
||||
routing and resource model without any additional address decoding.
|
||||
|
||||
---
|
||||
|
||||
### D5. PE_DMA handle_command() Changes
|
||||
|
||||
#### Current Flow (VA-based)
|
||||
|
||||
```
|
||||
DmaReadCmd.src_addr (VA)
|
||||
-> MMU.translate(VA) -> PA
|
||||
-> PhysAddr.decode(PA) -> PhysAddr object
|
||||
-> resolver.resolve(PhysAddr) -> dst_node_id (e.g., "sip0.cube0.hbm_ctrl")
|
||||
-> router.find_path(pe_prefix, dst_node_id) -> path
|
||||
-> 1 sub-Transaction created -> fabric inject
|
||||
```
|
||||
|
||||
#### New Flow (LA-based)
|
||||
|
||||
```
|
||||
DmaReadCmd.src_addr (LA)
|
||||
-> BAAW.resolve(LA, nbytes) -> list[PhysicalRequest]
|
||||
-> For each PhysicalRequest:
|
||||
-> router.find_path(pe_prefix, req.dst_node) -> path
|
||||
-> compute_drain_ns(path, req.nbytes) -> drain
|
||||
-> sub-Transaction created -> fabric inject
|
||||
-> Wait for all sub-Transactions to complete
|
||||
-> pe_txn.done.succeed()
|
||||
```
|
||||
|
||||
Key changes:
|
||||
- MMU reference removed -> replaced with BAAW resolve
|
||||
- PhysAddr.decode() + resolver.resolve() -> BAAW directly returns dst_node
|
||||
- 1 request -> N requests injected in parallel (1:1 mode)
|
||||
|
||||
---
|
||||
|
||||
### D6. 1:1 Mode Details
|
||||
|
||||
- One logical access -> N (= `channels_per_pe`) physical requests
|
||||
- N is a parameter determined by `hbm_pseudo_channels / pes_per_cube`
|
||||
- Each request:
|
||||
- Fully resolved 51-bit PA
|
||||
- Targets a specific channel router (`{pe_prefix}.ch_r{channel_id}`)
|
||||
- BW contention modeling via per-channel links
|
||||
- PE_DMA injects N sub-transactions simultaneously
|
||||
|
||||
#### 1:1 Mode Example
|
||||
|
||||
Configuration: `hbm_pseudo_channels=64`, `pes_per_cube=8`
|
||||
-> `channels_per_pe=8`, PE0 owns ch0-7
|
||||
|
||||
```text
|
||||
Tensor A (4 KB) -> LA 0x1_0000_0000, size=4096 bytes
|
||||
BAAW segment: {
|
||||
la_base: 0x1_0000_0000, la_size: 4096,
|
||||
mode: "one_to_one", channel_count: 8, # = channels_per_pe
|
||||
pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
|
||||
channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
|
||||
channel_size: 512, # = la_size / channel_count
|
||||
}
|
||||
|
||||
BAAW resolve result (N=8 requests):
|
||||
-> PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
|
||||
-> PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
|
||||
-> ...
|
||||
-> PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")
|
||||
|
||||
PE_DMA: N sub-transactions injected in parallel
|
||||
Each accesses HBM via channel router -> hbm_ctrl link (channel_bw_gbs)
|
||||
Total effective BW = N x channel_bw_gbs
|
||||
```
|
||||
|
||||
Examples with different N values:
|
||||
- `hbm_pseudo_channels=32`, `pes_per_cube=8` -> `channels_per_pe=4`, 4 requests
|
||||
- `hbm_pseudo_channels=64`, `pes_per_cube=4` -> `channels_per_pe=16`, 16 requests
|
||||
|
||||
---
|
||||
|
||||
### D7. n:1 Mode Details
|
||||
|
||||
- One logical access -> one aggregated request
|
||||
- Target: aggregated router -> hbm_ctrl (see ADR-0019)
|
||||
- Aggregated link BW = `channels_per_pe` x `channel_bw_gbs` (e.g., 8 x 32 = 256 GB/s)
|
||||
- Modeled as a single queue / resource
|
||||
- No per-channel PA decomposition
|
||||
|
||||
#### n:1 Mode Example
|
||||
|
||||
```
|
||||
Tensor A (4 KB) -> LA 0x1_0000_0000, size=4096 bytes
|
||||
BAAW segment: {
|
||||
la_base: 0x1_0000_0000, la_size: 4096,
|
||||
mode: "n_to_one",
|
||||
agg_pa_base: PA_agg,
|
||||
agg_node_id: "sip0.cube0.pe0.agg_router",
|
||||
}
|
||||
|
||||
BAAW resolve result:
|
||||
-> PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")
|
||||
|
||||
PE_DMA: 1 sub-transaction injected
|
||||
Accesses HBM via aggregated router -> hbm_ctrl link (256 GB/s)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### D8. Kernel Model Preservation
|
||||
|
||||
- The kernel still issues only single memory ops (`tl.load`, `tl.store`, `tl.composite`)
|
||||
- LA is the address scheme passed to the kernel
|
||||
- Channel decomposition/aggregation is performed by BAAW inside PE_DMA
|
||||
- Physical channel information is not exposed to kernel code
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- 1:1 vs n:1 semantics are clearly separated at a single point: BAAW
|
||||
- Kernel abstraction is preserved — no kernel code changes required
|
||||
- Topology-based policy control is possible (mode switching via yaml)
|
||||
- Improved simulation model consistency and debuggability
|
||||
- Segment-based mapping is simpler and has lower overhead compared to page tables
|
||||
|
||||
### Negative
|
||||
|
||||
- Full refactoring of VA/MMU-based code is required
|
||||
- Increased complexity in the request generation path (managing N requests in 1:1 mode)
|
||||
- Reduced per-channel visibility in n:1 mode
|
||||
- Existing VA-related tests must be rewritten
|
||||
|
||||
---
|
||||
|
||||
## Alternatives
|
||||
|
||||
### A1. Keep VA + Fan-out at MMU
|
||||
|
||||
- Extend MMU to return per-channel PAs
|
||||
- Problem: MMU's role expands beyond address translation to include request decomposition
|
||||
- Problem: Aggregation representation is difficult in n:1 mode
|
||||
|
||||
### A2. Kernel Generates Channel-Aware Requests
|
||||
|
||||
- Kernel directly calls per-channel load/store
|
||||
- Problem: Abstraction leakage, reduced portability
|
||||
- Problem: All benchmark code must be modified
|
||||
|
||||
### A3. Always Use PA (Without LA)
|
||||
|
||||
- Runtime directly passes per-channel PA to the kernel
|
||||
- Problem: Conflicts with the aggregation model
|
||||
- Problem: Conversion timing is unclear, channel information exposed to kernel
|
||||
|
||||
---
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
### Implementation Order
|
||||
|
||||
1. Introduce LA type (`policy/address/la_allocator.py`)
|
||||
2. Implement BAAW segment table (`policy/address/baaw.py`)
|
||||
3. Add `BaawSegmentInstallMsg` message type (`runtime_api/kernel.py`)
|
||||
4. Integrate BAAW into PE_DMA (`components/builtin/pe_dma.py` handle_command changes)
|
||||
5. Modify RuntimeContext: LA alloc + segment install (`runtime_api/context.py`)
|
||||
6. Change Tensor.va_base -> la_base (`runtime_api/tensor.py`)
|
||||
7. Remove VA/MMU code
|
||||
8. Remove pe_mmu from topology.yaml, add mapping mode configuration
|
||||
9. Test migration
|
||||
|
||||
### Affected Existing Tests
|
||||
|
||||
| Test File | Impact |
|
||||
|-----------|--------|
|
||||
| `tests/test_mmu_component.py` | Remove -> replace with BAAW segment install test |
|
||||
| `tests/test_mmu_fabric.py` | Remove -> replace with BAAW + fabric integration test |
|
||||
| `tests/test_pe_mmu.py` | Remove |
|
||||
| `tests/test_va_allocator.py` | Replace with LA allocator test |
|
||||
| `tests/test_va_integration.py` | Replace with LA + BAAW integration test |
|
||||
| `tests/test_va_offset.py` | Replace with LA offset test |
|
||||
|
||||
---
|
||||
|
||||
## Test Requirements
|
||||
|
||||
- For the same logical access:
|
||||
- 1:1 -> verify N requests are generated
|
||||
- n:1 -> verify 1 aggregated request is generated
|
||||
- Verify effective bandwidth consistency across both modes
|
||||
- 1:1 -> verify per-channel contention modeling
|
||||
- n:1 -> verify aggregated bandwidth is reflected
|
||||
- Verify operation without kernel code changes
|
||||
- Verify correct BAAW segment install/uninstall operation
|
||||
- Verify no conflicts when multiple tensors are assigned to different segments
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- ADR-0011 (Memory Addressing Simplification — PA-first, VA/MMU introduction) -> superseded by this ADR
|
||||
- ADR-0019 (NOC Per-Channel HBM Connection Model) -> topology-side integration
|
||||
- ADR-0014 (PE Internal Execution Model) -> PE_DMA change impact
|
||||
@@ -0,0 +1,431 @@
|
||||
# ADR-0019: Per-Channel and Aggregated HBM Connection Models within CUBE NOC
|
||||
|
||||
## Status
|
||||
|
||||
Proposed
|
||||
|
||||
## Context
|
||||
|
||||
ADR-0018 introduced LA-based address abstraction and BAAW,
|
||||
defining how a logical memory access is translated into the following two forms of requests:
|
||||
|
||||
- 1:1 mode: one logical access → N per-channel requests
|
||||
- n:1 mode: one logical access → one aggregated request
|
||||
|
||||
Here N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`),
|
||||
determined by topology parameters.
|
||||
|
||||
### Problems with the Existing Structure
|
||||
|
||||
In the current implementation (`topology/builder.py`):
|
||||
|
||||
- PE_DMA → NOC → xbar_top/xbar_bot → HBM_CTRL.slice{0-7} path is used
|
||||
- HBM is modeled as 8 slice (= per-PE) nodes
|
||||
- Local/remote access use different paths:
|
||||
- local: NOC → xbar → HBM slice
|
||||
- cross-half: NOC → xbar_top → bridge → xbar_bot → HBM slice
|
||||
- remote cube: NOC → UCIe → remote NOC → remote xbar → remote HBM slice
|
||||
|
||||
Limitations of this structure:
|
||||
|
||||
- Cannot model at the pseudo-channel granularity (slice = per-PE granularity, not per-channel)
|
||||
- xbar/bridge bifurcate local/remote paths
|
||||
- Cannot express 1:1 / n:1 modes consistently
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. HBM Attaches to PE Routers
|
||||
|
||||
Consolidate the current `hbm_ctrl.slice{0-7}` (8 nodes) into a **single `hbm_ctrl` node**,
|
||||
and attach the HBM access point to the same router where the PE is attached.
|
||||
|
||||
- n:1 mode: PE's local HBM access goes directly from its own router (switching overhead only, 0 hops)
|
||||
- Remote PE's HBM access: reaches the target PE's router via mesh hops
|
||||
- The read/write resource model within the HBM controller is preserved
|
||||
|
||||
Node naming changes:
|
||||
|
||||
| Current | After Change |
|
||||
| ---- | ------- |
|
||||
| `sip0.cube0.hbm_ctrl.slice0` ~ `slice7` | `sip0.cube0.hbm_ctrl` (single) |
|
||||
|
||||
In `mesh_gen.py`, add `pe{idx}.hbm` to the PE attachment so that
|
||||
the builder generates an edge between that router and hbm_ctrl.
|
||||
|
||||
---
|
||||
|
||||
### D2. Complete Removal of xbar, bridge, and Single NOC Node
|
||||
|
||||
Remove all of the following nodes and related edges:
|
||||
|
||||
- `{cube}.xbar_top`, `{cube}.xbar_bot`
|
||||
- `{cube}.bridge.left`, `{cube}.bridge.right`
|
||||
- `{cube}.noc` (single TwoDMeshNocComponent node)
|
||||
- Edges of type `noc_to_xbar`, `xbar_to_noc`, `xbar_to_hbm`, `hbm_to_xbar`
|
||||
- Edges of type `xbar_to_bridge`, `bridge_to_xbar`
|
||||
- Edges of type `pe_to_noc`, `noc_to_pe`, `noc_to_pe_cpu`, etc. referencing the single noc node
|
||||
|
||||
Their role is replaced by an **explicit router mesh based on cube_mesh.yaml**.
|
||||
Each router (r0c0, r0c1, ...) from the 6x6 router grid generated by `mesh_gen.py`
|
||||
is created as a separate SimPy node in the topology graph,
|
||||
and adjacent routers are connected via XY mesh edges.
|
||||
|
||||
---
|
||||
|
||||
### D3. Explicit Router Mesh (Common Basis for n:1 / 1:1)
|
||||
|
||||
#### Router Nodes Based on cube_mesh.yaml
|
||||
|
||||
Each non-null router from cube_mesh.yaml generated by `mesh_gen.py`
|
||||
is created as a **separate SimPy node** in the topology graph.
|
||||
|
||||
- Node ID: `{cube}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`)
|
||||
- kind: `noc_router`, impl: `forwarding_v1`
|
||||
- pos_mm: taken from cube_mesh.yaml
|
||||
|
||||
Based on the attach information in cube_mesh.yaml, components are connected to each router:
|
||||
- `pe{p}.dma` → PE_DMA ↔ router edge
|
||||
- `pe{p}.cpu` → PE_CPU ↔ router edge
|
||||
- `pe{p}.hbm` → HBM_CTRL ↔ router edge (added in n:1)
|
||||
- `m_cpu` → M_CPU ↔ router edge
|
||||
- `sram` → SRAM ↔ router edge
|
||||
- `ucie_{dir}.c{i}` → UCIe conn ↔ router edge
|
||||
|
||||
Router-to-router XY mesh edges: bidirectional edges between adjacent routers.
|
||||
Null routers (HBM exclusion zones) are skipped.
|
||||
|
||||
#### 1:1 Mode Extension (To Be Implemented Later)
|
||||
|
||||
In 1:1 mode, each router differentiates into N channel mini-routers.
|
||||
Per-channel routing and ChannelSplitter (LA → per-channel PA) introduction are required.
|
||||
N GEMM engines per PE are also added at this point.
|
||||
|
||||
---
|
||||
|
||||
### D4. Cross-PE HBM Access (n:1 Mode)
|
||||
|
||||
In n:1 mode, when a PE accesses another PE's local HBM,
|
||||
it hops through the XY mesh in cube_mesh.yaml to reach the target PE's router.
|
||||
|
||||
Example: PE0 (r0c0) accessing PE2's (r1c4) HBM:
|
||||
|
||||
```text
|
||||
PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl
|
||||
```
|
||||
|
||||
The Dijkstra router finds the shortest path in the mesh.
|
||||
|
||||
Cross-PE channel access in 1:1 mode will be defined during the 1:1 extension in D3.
|
||||
|
||||
---
|
||||
|
||||
### D5. n:1 Mode: Uses cube_mesh.yaml Router Mesh
|
||||
|
||||
In n:1 mode, no separate "aggregated router" is created.
|
||||
The existing router grid from cube_mesh.yaml serves that role.
|
||||
|
||||
#### Connection Structure
|
||||
|
||||
PE_DMA, PE_CPU, and HBM are all connected to the router where each PE is attached:
|
||||
|
||||
```text
|
||||
sip0.cube0.pe0.pe_dma ←→ sip0.cube0.r0c0 (bw: N × channel_bw_gbs)
|
||||
sip0.cube0.hbm_ctrl ←→ sip0.cube0.r0c0 (bw: N × channel_bw_gbs)
|
||||
```
|
||||
|
||||
Routers are connected via XY mesh edges. PE's local HBM access goes
|
||||
directly from its own router (switching overhead only).
|
||||
|
||||
#### n:1 Mode Full Data Paths
|
||||
|
||||
**Local HBM (0 hops):**
|
||||
```text
|
||||
PE0.pe_dma → r0c0 → hbm_ctrl (switching overhead only)
|
||||
```
|
||||
|
||||
**Remote HBM (mesh hops):**
|
||||
```text
|
||||
PE0.pe_dma → r0c0 → r0c1 → ... → r1c4 → hbm_ctrl
|
||||
```
|
||||
|
||||
**M_CPU DMA:**
|
||||
```text
|
||||
M_CPU → r2c0 → (mesh hops) → r{x}c{y} → hbm_ctrl
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### D6. All Traffic Is Unified onto the Same Router Mesh
|
||||
|
||||
- All memory accesses (DMA data) and commands (PE_CPU) use the same router mesh
|
||||
- Local access does not use a separate fast path (xbar)
|
||||
- Cross-cube (remote) access path:
|
||||
|
||||
```text
|
||||
PE_DMA → r{x}c{y} → (mesh hops) → ucie_conn → ucie-{PORT}
|
||||
→ [UCIe link] → remote ucie → remote conn → remote r{x}c{y} → hbm_ctrl
|
||||
```
|
||||
|
||||
UCIe connections maintain the existing structure,
|
||||
but both endpoints become mesh routers instead of xbars.
|
||||
|
||||
The number of UCIe lines is determined by BW ratio: `ucie_lines_per_side = ceil(ucie_bw / noc_line_bw)`.
|
||||
|
||||
---
|
||||
|
||||
### D7. AddressResolver Changes
|
||||
|
||||
Current `AddressResolver.resolve()`:
|
||||
|
||||
```python
|
||||
# Current: HBM offset → pe_slice → "sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
|
||||
pe_slice = PhysAddr.hbm_pe_id(addr.hbm_offset, self._slice_size_bytes)
|
||||
return f"sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
|
||||
```
|
||||
|
||||
After change:
|
||||
|
||||
```python
|
||||
# Changed: HBM → single endpoint
|
||||
return f"sip{s}.cube{c}.hbm_ctrl"
|
||||
```
|
||||
|
||||
The pe_slice calculation is removed.
|
||||
In n:1 mode, PE_DMA directly accesses the hbm_ctrl attached to its own router.
|
||||
|
||||
resolver.resolve() is retained for external access (M_CPU DMA, etc.) and backward compatibility.
|
||||
|
||||
---
|
||||
|
||||
### D8. topology.yaml Configuration Changes
|
||||
|
||||
#### Added Settings
|
||||
|
||||
```yaml
|
||||
cube:
|
||||
memory_map:
|
||||
hbm_mapping_mode: n_to_one # one_to_one | n_to_one
|
||||
hbm_pseudo_channels: 64 # total pseudo channel count
|
||||
hbm_channels_per_pe: 8 # local channels per PE (= pseudo_channels / pes_per_cube)
|
||||
hbm_channel_bw_gbs: 32.0 # per-channel bandwidth (GB/s)
|
||||
hbm_total_gb_per_cube: 48 # retained
|
||||
```
|
||||
|
||||
#### Removed Settings
|
||||
|
||||
```yaml
|
||||
# To be removed
|
||||
links:
|
||||
xbar_to_hbm_bw_gbs: 256.0 # → replaced by channel_bw_gbs × channels_per_pe
|
||||
xbar_to_hbm_mm: 2.5 # → replaced by ch_router_to_hbm_mm
|
||||
xbar_to_bridge_bw_gbs: 128.0 # → removed (no bridge)
|
||||
xbar_to_bridge_mm: 3.0 # → removed
|
||||
noc_to_xbar_bw_gbs: ... # → removed
|
||||
noc_to_xbar_mm: ... # → removed
|
||||
```
|
||||
|
||||
#### Added Link Settings
|
||||
|
||||
```yaml
|
||||
links:
|
||||
router_link_bw_gbs: 256.0 # XY mesh link BW between routers
|
||||
router_overhead_ns: 2.0 # router switching overhead
|
||||
pe_to_router_bw_gbs: 256.0 # PE_DMA ↔ router
|
||||
hbm_to_router_bw_gbs: 256.0 # HBM ↔ router (= N × channel_bw)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### D9. Bandwidth Numerical Consistency
|
||||
|
||||
| Configuration | Value |
|
||||
| ---- | --- |
|
||||
| pseudo channels per cube | 64 (parameter) |
|
||||
| PEs per cube | 8 (parameter) |
|
||||
| channels per PE (N) | `pseudo_channels / pes_per_cube` = 8 |
|
||||
| per-channel BW | 32 GB/s (parameter) |
|
||||
| per-PE local BW | N × 32 = 256 GB/s |
|
||||
| cube total HBM BW | 64 × 32 = 2048 GB/s |
|
||||
|
||||
The effective BW per PE is identical in both modes:
|
||||
|
||||
- 1:1 mode: N channel links × channel_bw_gbs = N × 32 = 256 GB/s
|
||||
- n:1 mode: 1 aggregated link = N × channel_bw_gbs = 256 GB/s
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- The router mesh based on cube_mesh.yaml accurately reflects physical placement
|
||||
- In n:1 mode, the existing VA scheme is preserved, keeping transition costs low
|
||||
- Local / remote / command traffic is unified onto the same mesh, resulting in simplicity
|
||||
- Aligns well with graph compiler-based topology generation
|
||||
- Channel count and PE count are both parameterized, enabling testing of various configurations
|
||||
- 1:1 mode extension naturally follows through router differentiation
|
||||
|
||||
### Negative
|
||||
|
||||
- The number of SimPy nodes increases due to explicit router nodes (6x6 = up to 32 routers/cube)
|
||||
- Requires complete rewrite of existing xbar/bridge/single NOC-based tests
|
||||
- The internal contention model of TwoDMeshNocComponent needs to be replaced with a per-router model
|
||||
|
||||
---
|
||||
|
||||
## Alternatives
|
||||
|
||||
### A1. Retain Existing xbar + HBM Slices
|
||||
|
||||
- Local/remote paths remain bifurcated
|
||||
- Cannot model at pseudo-channel granularity
|
||||
- Cannot switch between 1:1/n:1 modes
|
||||
|
||||
### A2. Always Generate Per-Channel Links and Aggregate Only in n:1
|
||||
|
||||
- Topology structure always has 1:1 size
|
||||
- Expressing n:1 semantics via link aggregation is complex
|
||||
- No reduction in router node count
|
||||
|
||||
### A3. Gradual Transition (Retain xbar + Add NOC Path)
|
||||
|
||||
- Higher compatibility, but dual-path coexistence increases complexity
|
||||
- Since xbar removal is ultimately necessary, the intermediate step provides little value
|
||||
|
||||
---
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
### topology/builder.py Change Details
|
||||
|
||||
#### Code to Remove (within current `_instantiate_cube()`)
|
||||
|
||||
- xbar_top, xbar_bot node creation (~line 495-508)
|
||||
- bridge.left, bridge.right node creation
|
||||
- noc ↔ xbar edge creation (~line 540-555)
|
||||
- xbar ↔ hbm_ctrl.slice edge creation (~line 510-538)
|
||||
- xbar ↔ bridge edge creation (~line 557-572)
|
||||
|
||||
#### Code to Add
|
||||
|
||||
1:1 mode:
|
||||
|
||||
```python
|
||||
N = hbm_channels_per_pe # from topology config
|
||||
total_ch = hbm_pseudo_channels
|
||||
|
||||
# Create channel router nodes
|
||||
for ch_id in range(total_ch):
|
||||
pe_id = ch_id // N
|
||||
nodes[f"{cp}.ch_r{ch_id}"] = Node(
|
||||
id=f"{cp}.ch_r{ch_id}", kind="noc_router", impl="noc_v1",
|
||||
attrs={}, pos_mm=(...), # horizontal row = ch_id % N
|
||||
)
|
||||
|
||||
# PE_DMA ↔ local channel router edges
|
||||
for pe_id in range(pes_per_cube):
|
||||
for local_ch in range(N):
|
||||
ch_id = pe_id * N + local_ch
|
||||
edges.append(Edge(
|
||||
src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.ch_r{ch_id}",
|
||||
bw_gbs=channel_bw, kind="pe_to_ch_router", ...))
|
||||
edges.append(Edge(
|
||||
src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.pe{pe_id}.pe_dma",
|
||||
bw_gbs=channel_bw, kind="ch_router_to_pe", ...))
|
||||
|
||||
# Channel router ↔ hbm_ctrl edges
|
||||
for ch_id in range(total_ch):
|
||||
edges.append(Edge(
|
||||
src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.hbm_ctrl",
|
||||
bw_gbs=channel_bw, kind="ch_router_to_hbm", ...))
|
||||
edges.append(Edge(
|
||||
src=f"{cp}.hbm_ctrl", dst=f"{cp}.ch_r{ch_id}",
|
||||
bw_gbs=channel_bw, kind="hbm_to_ch_router", ...))
|
||||
|
||||
# Horizontal line edges (same logical index)
|
||||
for row in range(N):
|
||||
for p in range(pes_per_cube - 1):
|
||||
ch_a = p * N + row
|
||||
ch_b = (p + 1) * N + row
|
||||
edges.append(Edge(
|
||||
src=f"{cp}.ch_r{ch_a}", dst=f"{cp}.ch_r{ch_b}",
|
||||
bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...))
|
||||
edges.append(Edge(
|
||||
src=f"{cp}.ch_r{ch_b}", dst=f"{cp}.ch_r{ch_a}",
|
||||
bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...))
|
||||
```
|
||||
|
||||
n:1 mode:
|
||||
|
||||
```python
|
||||
# Create aggregated router nodes
|
||||
for pe_id in range(pes_per_cube):
|
||||
nodes[f"{cp}.pe{pe_id}.agg_router"] = Node(
|
||||
id=f"{cp}.pe{pe_id}.agg_router", kind="noc_router", impl="noc_v1",
|
||||
attrs={}, pos_mm=(...),
|
||||
)
|
||||
|
||||
agg_bw = N * channel_bw # aggregated BW
|
||||
|
||||
# PE_DMA ↔ aggregated router
|
||||
for pe_id in range(pes_per_cube):
|
||||
edges.append(Edge(
|
||||
src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.pe{pe_id}.agg_router",
|
||||
bw_gbs=agg_bw, kind="pe_to_agg_router", ...))
|
||||
edges.append(Edge(
|
||||
src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.pe{pe_id}.pe_dma",
|
||||
bw_gbs=agg_bw, kind="agg_router_to_pe", ...))
|
||||
|
||||
# Aggregated router ↔ hbm_ctrl
|
||||
for pe_id in range(pes_per_cube):
|
||||
edges.append(Edge(
|
||||
src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.hbm_ctrl",
|
||||
bw_gbs=agg_bw, kind="agg_to_hbm", ...))
|
||||
edges.append(Edge(
|
||||
src=f"{cp}.hbm_ctrl", dst=f"{cp}.pe{pe_id}.agg_router",
|
||||
bw_gbs=agg_bw, kind="hbm_to_agg", ...))
|
||||
|
||||
# Horizontal links between aggregated routers
|
||||
for p in range(pes_per_cube - 1):
|
||||
edges.append(Edge(
|
||||
src=f"{cp}.pe{p}.agg_router", dst=f"{cp}.pe{p+1}.agg_router",
|
||||
bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...))
|
||||
edges.append(Edge(
|
||||
src=f"{cp}.pe{p+1}.agg_router", dst=f"{cp}.pe{p}.agg_router",
|
||||
bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...))
|
||||
```
|
||||
|
||||
### Affected Existing Tests
|
||||
|
||||
| Test File | Impact |
|
||||
| ---------- | ---- |
|
||||
| `tests/test_topology_compile.py` | Remove xbar/bridge node references, add channel router verification |
|
||||
| `tests/test_topology_load.py` | Reflect topology.yaml configuration changes |
|
||||
| `tests/test_pe_components.py` | PE_DMA routing path changes |
|
||||
| `tests/test_sip_parallel.py` | Cross-PE access path changes |
|
||||
| Cases that directly test xbar/bridge | Remove |
|
||||
|
||||
---
|
||||
|
||||
## Test Requirements
|
||||
|
||||
- Verify that requests are delivered via per-channel links in 1:1 mode
|
||||
- Verify that requests are delivered via the aggregated link in n:1 mode
|
||||
- Verify that topology is correctly generated in both modes:
|
||||
- 1:1: `total_ch` channel routers + per-PE links + horizontal links
|
||||
- n:1: `pes_per_cube` aggregated routers + per-PE links
|
||||
- Verify that effective BW is consistent across both modes for the same workload
|
||||
- Verify that horizontal line routing works for cross-PE access
|
||||
- Verify that routing through UCIe works for cross-cube access
|
||||
- Verify that topology generation is correct under parameter variations (channels_per_pe = 4, 8, 16, etc.)
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- ADR-0018 (LA + BAAW) → addressing-side integration
|
||||
- ADR-0017 (Cube NOC 2D Mesh) → this ADR replaces the xbar/bridge portion
|
||||
- ADR-0004 (Memory Semantics) → BW model redefinition
|
||||
- ADR-0014 (PE Internal Execution Model) → impact from PE_DMA path changes
|
||||
@@ -0,0 +1,553 @@
|
||||
# ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation)
|
||||
|
||||
## Status
|
||||
|
||||
Proposed
|
||||
|
||||
## Context
|
||||
|
||||
The current simulation models **timing only**.
|
||||
`tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies,
|
||||
but do not actually read tensor data or perform computations.
|
||||
|
||||
### Required Capabilities
|
||||
|
||||
1. Must be able to store and read actual data in HBM/TCM/SRAM
|
||||
2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
|
||||
3. Must minimize simulation performance degradation
|
||||
|
||||
### Limitations of the Existing Kernel Execution Structure
|
||||
|
||||
The current kernel execution is separated into 3 stages:
|
||||
|
||||
```
|
||||
Phase 0: Kernel function execution in TLContext → PeCommand list generation (outside SimPy, no data)
|
||||
Phase 1: PE_CPU replays PeCommand list via SimPy (timing only)
|
||||
```
|
||||
|
||||
Phase 0 requires the kernel to **complete execution entirely** before SimPy begins.
|
||||
`tl.load()` returns a TensorHandle (placeholder), so actual data cannot be accessed.
|
||||
Therefore, branching based on data values (dynamic control flow) is impossible.
|
||||
|
||||
This ADR resolves this limitation **for memory operations only** (see D1, D3).
|
||||
|
||||
### Constraints
|
||||
|
||||
- SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
|
||||
- Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations
|
||||
- Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused
|
||||
- Kernel functions must remain plain Python functions (no generator/async transformation)
|
||||
|
||||
### Design Exploration Results
|
||||
|
||||
| Option | Approach | Verdict |
|
||||
|--------|----------|---------|
|
||||
| Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block |
|
||||
| SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests |
|
||||
| Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads |
|
||||
| **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact |
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. 2-Pass Execution Model — Phase 0 Elimination
|
||||
|
||||
The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**.
|
||||
|
||||
Before:
|
||||
```
|
||||
Phase 0: Kernel → PeCommand list (no data, no branching)
|
||||
Phase 1: Replay PeCommand list via SimPy (timing only)
|
||||
```
|
||||
|
||||
After:
|
||||
```
|
||||
Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based
|
||||
- Memory read/write: SimPy timing + MemoryStore actual data
|
||||
- Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2)
|
||||
- Dynamic control flow possible (tl.load returns actual data)
|
||||
|
||||
Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable
|
||||
```
|
||||
|
||||
This ADR **extends Phase 1 to be data-aware for memory operations only**.
|
||||
Phase 1 handles latency/BW bottleneck analysis + memory data tracking,
|
||||
Phase 2 handles GEMM/Math computation correctness verification.
|
||||
Phase 2 is optional — if only timing is needed, run Phase 1 alone.
|
||||
|
||||
### D2. Op Log Recording — ComponentBase Hook
|
||||
|
||||
Op log recording is performed as a **hook in the component base class**.
|
||||
Individual component implementations are not modified.
|
||||
|
||||
```python
|
||||
class ComponentBase:
|
||||
def _on_process_start(self, env, msg):
|
||||
if self._op_logger and getattr(msg, 'data_op', False):
|
||||
self._op_logger.record_start(env.now, self.node.id, msg)
|
||||
|
||||
def _on_process_end(self, env, msg):
|
||||
if self._op_logger and getattr(msg, 'data_op', False):
|
||||
self._op_logger.record_end(env.now, self.node.id, msg)
|
||||
```
|
||||
|
||||
Hooks are called before and after `run()` within `_forward_txn()`.
|
||||
`_op_logger` is optional — zero overhead when absent.
|
||||
|
||||
**Hook timing definitions**:
|
||||
|
||||
| Timing | Meaning |
|
||||
|--------|---------|
|
||||
| `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) |
|
||||
| `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) |
|
||||
|
||||
Link traversal latency is not included in t_start/t_end.
|
||||
Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start.
|
||||
|
||||
### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination
|
||||
|
||||
The existing Phase 0 (kernel → PeCommand list) is eliminated,
|
||||
and **greenlet** is used to cooperatively interleave kernel and SimPy execution.
|
||||
|
||||
#### Operating Principle
|
||||
|
||||
greenlet is a C extension that provides cooperative context switching.
|
||||
When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet)
|
||||
to perform timing simulation, and after completion, returns to the kernel with actual data.
|
||||
|
||||
```
|
||||
SimPy loop (parent greenlet) Kernel (child greenlet)
|
||||
───────────────────────── ──────────────────────
|
||||
g.switch() ─────────────────────────→ Kernel starts
|
||||
a = tl.load(ptr, ...)
|
||||
internal: parent.switch(DmaReadCmd)
|
||||
cmd = DmaReadCmd ←────────────────── (kernel paused)
|
||||
yield DmaReadMsg(...)
|
||||
yield env.timeout(dma_latency)
|
||||
data = memory_store.read(...)
|
||||
g.switch(data) ─────────────────────→ (kernel resumed)
|
||||
a = data ← actual numpy array
|
||||
if a[0][0] > 0.5: ← branching possible
|
||||
...
|
||||
```
|
||||
|
||||
The kernel is maintained as a **plain Python function**.
|
||||
greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc.
|
||||
|
||||
#### KernelRunner — Framework Layer
|
||||
|
||||
The greenlet loop resides not in the PE_CPU component but in the framework layer,
|
||||
**KernelRunner**.
|
||||
|
||||
```python
|
||||
# KernelRunner (framework — greenlet ↔ SimPy bridge)
|
||||
class KernelRunner:
|
||||
def run(self, env, kernel_fn, args, store):
|
||||
g = greenlet(self._run_kernel)
|
||||
cmd = g.switch(kernel_fn, args)
|
||||
|
||||
while cmd is not None:
|
||||
if isinstance(cmd, DmaReadCmd):
|
||||
yield from self._dispatch_dma(env, cmd)
|
||||
data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
|
||||
cmd = g.switch(data) # resume with actual data
|
||||
elif isinstance(cmd, GemmCmd):
|
||||
yield from self._dispatch_gemm(env, cmd)
|
||||
cmd = g.switch() # resume (no data)
|
||||
elif isinstance(cmd, DmaWriteCmd):
|
||||
store.write(cmd.dst_addr, cmd.data) # visibility = issue time
|
||||
yield from self._dispatch_dma(env, cmd) # timing only
|
||||
cmd = g.switch()
|
||||
|
||||
# PE_CPU (component — kept simple, unaware of greenlet)
|
||||
def _execute_kernel(self, env):
|
||||
runner = KernelRunner(self.ctx)
|
||||
yield from runner.run(env, kernel_fn, args, store)
|
||||
```
|
||||
|
||||
**Op logging single source of truth**: KernelRunner does not record directly to op_log.
|
||||
All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**.
|
||||
When KernelRunner delivers messages to components via `_dispatch_gemm()` etc.,
|
||||
the component base class hooks automatically record them.
|
||||
|
||||
**Layer separation**:
|
||||
- **Kernel code**: plain function, unaware of greenlet
|
||||
- **TLContext**: calls `parent.switch(cmd)` inside `tl.load()`
|
||||
- **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**.
|
||||
- **ComponentBase hook**: the sole path for op_log recording
|
||||
- **PE_CPU**: only calls KernelRunner, replaceable as a component
|
||||
|
||||
#### Handling Differences Between Memory Read/Write and Compute
|
||||
|
||||
| Operation | In Phase 1 | In Phase 2 |
|
||||
|-----------|-----------|-----------|
|
||||
| `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — |
|
||||
| `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — |
|
||||
| `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation |
|
||||
| `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation |
|
||||
|
||||
Memory read/write is processed immediately in Phase 1 (numpy slice, fast).
|
||||
GEMM/Math operations are batch-executed in Phase 2 (performance separation).
|
||||
|
||||
#### Store Visibility Rule
|
||||
|
||||
`tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue).
|
||||
SimPy DMA timing is simulated separately afterward.
|
||||
|
||||
This is an intentional separation of timing and visibility:
|
||||
- **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called
|
||||
- **timing**: the point at which DMA latency completes in SimPy
|
||||
|
||||
This separation allows a load immediately after a store to see the latest data in dynamic control flow.
|
||||
|
||||
#### Result Handle Semantics
|
||||
|
||||
`tl.composite()` (sync/async) returns a **handle** referencing the result tensor.
|
||||
|
||||
The key contract in Phase 1:
|
||||
|
||||
1. **All compute handles are always considered pending in Phase 1.**
|
||||
2. `tl.wait(handle)` **expresses timing synchronization only**
|
||||
and does not make the handle ready.
|
||||
3. Accessing the handle's actual result data (`handle.data`, element access,
|
||||
numpy conversion, etc.) is **only possible in Phase 2**.
|
||||
4. Therefore, **compute-result-based control flow is not supported in Phase 1.**
|
||||
5. In contrast, `tl.load()` returns actual data in Phase 1, so
|
||||
**memory-read-based control flow is supported**.
|
||||
|
||||
| Handle state | Phase | Allowed operations |
|
||||
|------------|-------|----------|
|
||||
| pending | Phase 1 | `tl.wait(handle)` — timing synchronization only |
|
||||
| pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) |
|
||||
| pending | Phase 1 | **Data access not allowed** — value-based branching not possible |
|
||||
| ready | Phase 2 | Actual numpy data access, verification |
|
||||
|
||||
This restriction is intentional. If computations were executed in Phase 1,
|
||||
the SimPy single-thread would block, defeating the purpose of 2-pass separation.
|
||||
|
||||
#### Phase 1 Materialization — Future Extension
|
||||
|
||||
If Phase 1 eager execution becomes necessary for small operations
|
||||
(scalar, small reduction) in the future, selective materialization can be supported
|
||||
by adding a `materialized_in_phase1: bool` flag to the op record.
|
||||
This is not implemented in the current scope.
|
||||
|
||||
### D4. data_op Flag — Message Self-Declaration
|
||||
|
||||
The logging target is determined by the `data_op` attribute on the message instance,
|
||||
not by message type. The framework does not hardcode message types.
|
||||
|
||||
```python
|
||||
class MsgBase:
|
||||
data_op: bool = False # default: no logging
|
||||
|
||||
class DmaReadCmd(MsgBase):
|
||||
data_op = True # memory transfer → logging
|
||||
|
||||
class GemmCmd(MsgBase):
|
||||
data_op = True # compute → logging
|
||||
|
||||
class MathCmd(MsgBase):
|
||||
data_op = True # compute → logging
|
||||
```
|
||||
|
||||
When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True`
|
||||
enables automatic logging without modifying framework code.
|
||||
|
||||
### D5. Op Log Structure
|
||||
|
||||
#### Op Classification Scheme
|
||||
|
||||
A two-level classification is used:
|
||||
|
||||
| Level | Field | Role |
|
||||
|-------|-------|------|
|
||||
| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion |
|
||||
| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification |
|
||||
|
||||
#### OpRecord Definition
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class OpRecord:
|
||||
t_start: float # SimPy time (ns) — service start
|
||||
t_end: float # SimPy time (ns) — service completion
|
||||
component_id: str # e.g. "sip0.cube0.pe0.pe_gemm"
|
||||
op_kind: str # "memory" | "gemm" | "math"
|
||||
op_name: str # specific operation name
|
||||
params: dict # per-operation parameters (see below)
|
||||
dependency_ids: list[int] # currently based on in-memory record index, may be replaced with stable op_id in the future
|
||||
```
|
||||
|
||||
#### dependency_ids Generation Rules
|
||||
|
||||
`dependency_ids` is **optional**, and by default the executor performs
|
||||
address-based dependency inference (see D6).
|
||||
|
||||
Explicit setting is only needed when precise execution ordering is required:
|
||||
- **Default (address-based inference)**: the executor analyzes read/write sets to
|
||||
automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases.
|
||||
- **Explicit setting**: set when logical dependencies cannot be expressed via addresses
|
||||
at the TLContext or command generation stage.
|
||||
Example: completion handle-based synchronization — handle dependencies depend on
|
||||
logical completion order rather than memory addresses, so they cannot be captured
|
||||
by address inference.
|
||||
|
||||
#### op_log Ordering
|
||||
|
||||
The op_log maintains **stable ordering** based on `t_start`.
|
||||
Records with the same `t_start` preserve insertion order.
|
||||
|
||||
#### params Details
|
||||
|
||||
**memory (dma_read / dma_write)**:
|
||||
```python
|
||||
{
|
||||
"src_addr": int, # source address (byte)
|
||||
"dst_addr": int, # destination address (byte)
|
||||
"nbytes": int, # transfer size
|
||||
"src_space": str, # "hbm" | "tcm" | "sram"
|
||||
"dst_space": str, # "hbm" | "tcm" | "sram"
|
||||
}
|
||||
```
|
||||
|
||||
**gemm**:
|
||||
```python
|
||||
{
|
||||
"src_a_addr": int, # operand A address
|
||||
"src_b_addr": int, # operand B address
|
||||
"dst_addr": int, # output address
|
||||
"shape_a": tuple, # e.g. (128, 256)
|
||||
"shape_b": tuple, # e.g. (256, 128)
|
||||
"shape_out": tuple, # e.g. (128, 128)
|
||||
"dtype_in": str, # e.g. "f16"
|
||||
"dtype_acc": str, # accumulation dtype, e.g. "f32"
|
||||
"dtype_out": str, # output dtype, e.g. "f16"
|
||||
"transpose_a": bool,
|
||||
"transpose_b": bool,
|
||||
"layout_a": str, # "row_major" | "col_major"
|
||||
"layout_b": str,
|
||||
"layout_out": str,
|
||||
"addr_space": str, # "tcm" (GEMM operands are always in TCM)
|
||||
}
|
||||
```
|
||||
|
||||
**math**:
|
||||
```python
|
||||
{
|
||||
"op": str, # "exp" | "add" | "sum" | "where" | ...
|
||||
"input_addrs": list[int], # list of operand addresses
|
||||
"input_shapes": list[tuple],
|
||||
"dst_addr": int,
|
||||
"shape_out": tuple,
|
||||
"dtype": str,
|
||||
"axis": int | None, # reduction axis
|
||||
"addr_space": str, # "tcm"
|
||||
}
|
||||
```
|
||||
|
||||
### D6. Phase 2 Executor
|
||||
|
||||
Phase 2 executes the op_log outside of SimPy.
|
||||
|
||||
```python
|
||||
class DataExecutor:
|
||||
def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
|
||||
self.store = initial_store # Takes the Phase 1 MemoryStore snapshot as input
|
||||
|
||||
def run(self):
|
||||
for t, ops in groupby(op_log, key=lambda o: o.t_start):
|
||||
batch = list(ops)
|
||||
independent, sequential = self._classify(batch)
|
||||
self._execute_parallel(independent)
|
||||
self._execute_sequential(sequential)
|
||||
```
|
||||
|
||||
**Parallel execution determination**:
|
||||
|
||||
Ops with the same `t_start` are considered **parallel candidates**.
|
||||
The executor determines actual parallel execution based on the following criteria:
|
||||
- Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks)
|
||||
- Whether predecessor ops specified in `dependency_ids` have completed
|
||||
|
||||
Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel.
|
||||
|
||||
**Batch optimization**: Only independent ops with the same op_name **and identical
|
||||
shape, dtype, layout, and transpose flags** are eligible for batching.
|
||||
Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call.
|
||||
Improves BLAS efficiency on CPU, reduces launch overhead on GPU.
|
||||
|
||||
**Phase 2 execution order guarantee**:
|
||||
|
||||
Phase 2 does not consider data arrival timing,
|
||||
and guarantees execution order solely through
|
||||
dependencies (address-based inference + explicit dependency_ids).
|
||||
|
||||
### D7. Memory Store
|
||||
|
||||
`MemoryStore` logically follows byte-addressable semantics,
|
||||
and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping).
|
||||
|
||||
```python
|
||||
class MemoryStore:
|
||||
def write(self, space: str, addr: int, data: np.ndarray) -> None: ...
|
||||
def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
|
||||
```
|
||||
|
||||
**Internal storage format: numpy ndarray**
|
||||
|
||||
MemoryStore stores tensors as **numpy ndarrays**.
|
||||
|
||||
| Candidate | store/load speed | Phase 2 compute | Verdict |
|
||||
|-----------|-----------------|-----------------|---------|
|
||||
| **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** |
|
||||
| bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected |
|
||||
| torch tensor | Immediate | torch operations available | Use only for GPU optimization |
|
||||
|
||||
- write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup
|
||||
- read: **returns numpy array by reference** (no copy)
|
||||
- Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported)
|
||||
- dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.)
|
||||
- For byte-level access, convert via `.view(np.uint8)`
|
||||
- For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility
|
||||
|
||||
**read/write contract**:
|
||||
|
||||
- read/write operates on a **contiguous tensor** basis.
|
||||
If non-contiguous stride views are needed, express them as separate copy ops.
|
||||
- In the normal benchmark path, producer/consumer dtype match is expected.
|
||||
Reinterpret cast is a permissive behavior for low-level memory validation
|
||||
or special test cases.
|
||||
- addr is byte-aligned, with minimum alignment = dtype size.
|
||||
- dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast.
|
||||
Shape mismatch is verified based on nbytes, and raises an error on mismatch.
|
||||
- Correctness criteria follow address-range-based read/write semantics.
|
||||
- A tensor object cache may be used as an implementation optimization,
|
||||
but the canonical state is byte-addressable storage.
|
||||
- At deploy time, the host injects initial tensor data.
|
||||
|
||||
### D8. Benchmark Kernel Code
|
||||
|
||||
The benchmark's **user code API is not changed**.
|
||||
The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained.
|
||||
|
||||
However, internal command/message schemas may be extended to include metadata
|
||||
required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose).
|
||||
|
||||
### D9. No Component Changes
|
||||
|
||||
Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified.
|
||||
Op log recording is the responsibility of the ComponentBase hook.
|
||||
When custom components are replaced, only the timing model changes,
|
||||
and Phase 2 data execution is unaffected.
|
||||
|
||||
### D10. Phase 2 is Optional
|
||||
|
||||
```python
|
||||
engine = GraphEngine(graph)
|
||||
engine.run(benchmark) # Phase 1: timing only
|
||||
result = engine.get_timing_result()
|
||||
|
||||
if verify_data:
|
||||
executor = DataExecutor(engine.op_log) # Phase 2: data
|
||||
executor.run()
|
||||
executor.verify(expected_output)
|
||||
```
|
||||
|
||||
If only timing analysis is needed, Phase 2 is skipped.
|
||||
If the op_logger is deactivated, Phase 1 performance is identical to the original.
|
||||
|
||||
### D11. Verification Contract
|
||||
|
||||
Basic verification **compares the final output tensor** against a reference backend (numpy).
|
||||
|
||||
Per-dtype tolerance policy:
|
||||
|
||||
| dtype | Comparison method | Tolerance |
|
||||
|-------|----------|-----------|
|
||||
| f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
|
||||
| f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
|
||||
| bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
|
||||
| int types | `np.array_equal` | exact |
|
||||
|
||||
- Default mode: compare final output only (end-to-end correctness)
|
||||
- Debug mode: can compare intermediate tensors on a per-op basis
|
||||
(MemoryStore snapshot at each op boundary)
|
||||
|
||||
---
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **Compute-result-based control flow**: not supported.
|
||||
All compute handles are in pending state during Phase 1,
|
||||
`wait()` expresses timing synchronization only and does not imply data readiness.
|
||||
Accessing `handle.data`, element access, or truth-value evaluation in Phase 1
|
||||
is **treated as an error**.
|
||||
Memory-data-based branching (results of `tl.load()`) is supported via greenlet.
|
||||
Phase 1 materialization is a future extension (see D3).
|
||||
- **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce
|
||||
the execution time overlap from Phase 1. Phase 2 only verifies data correctness.
|
||||
- **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls
|
||||
and do not reproduce the actual hardware PE microarchitecture.
|
||||
|
||||
## Open Questions
|
||||
|
||||
- **Aliasing / slice view**: How to represent slice/views referencing the same
|
||||
backing storage in MemoryStore (stride-based view vs copy semantics)
|
||||
- **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE
|
||||
communication as memory ops or introduce a separate op_kind
|
||||
- **Op log streaming**: Managing op_log memory usage in large-scale simulations
|
||||
(in-memory list vs disk-backed streaming)
|
||||
- **Fused operation**: Whether to record tl.composite's tiled pipeline
|
||||
(READ→COMPUTE→WRITE) as a single fused op record or separate individual ops
|
||||
- **Math op schema generalization**: The current math params have a simple structure,
|
||||
but generalization may be needed for broadcasting rules, per-input dtype, keepdims,
|
||||
scalar/immediate operands, where/mask expressions, etc.
|
||||
- **Op record identifier**: Currently dependency_ids are based on in-memory list indices;
|
||||
replacement with stable op_id is needed when introducing streaming/disk-backed mode
|
||||
- **Phase 1 materialization policy**: See Future Extension in D3.
|
||||
If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops
|
||||
needs to be defined
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- Minimal impact on SimPy simulation performance (only op_log append added)
|
||||
- Free to use multi-threading/GPU in Phase 2
|
||||
- Component replaceability preserved (ADR-0015 design philosophy maintained)
|
||||
- No changes needed to benchmark user code API
|
||||
- When adding new message types, only set the data_op flag
|
||||
- Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported
|
||||
- `tl.load()` returns actual data, making kernel debugging easier
|
||||
|
||||
### Negative
|
||||
|
||||
- op_log memory usage (for large-scale simulations)
|
||||
- Phase 2 execution time is proportional to tensor size (large GEMM)
|
||||
- Dynamic branching based on pending handles (incomplete computations) not possible
|
||||
(computations execute in Phase 2, result values are undetermined in Phase 1).
|
||||
Memory-data-based branching is supported via greenlet.
|
||||
- greenlet C extension dependency added (pip install greenlet)
|
||||
|
||||
---
|
||||
|
||||
## Affected Files
|
||||
|
||||
| File | Change |
|
||||
|------|--------|
|
||||
| `src/kernbench/components/base.py` | Add `_on_process_start/end` hooks |
|
||||
| `src/kernbench/common/pe_commands.py` | Add `data_op = True`, extend metadata fields |
|
||||
| `src/kernbench/sim_engine/op_log.py` | New: OpRecord, OpLogger |
|
||||
| `src/kernbench/sim_engine/data_executor.py` | New: DataExecutor, MemoryStore |
|
||||
| `src/kernbench/sim_engine/engine.py` | op_logger injection (optional) |
|
||||
| `src/kernbench/triton_emu/tl_context.py` | greenlet switch calls inside `tl.load()` etc. |
|
||||
| `src/kernbench/triton_emu/kernel_runner.py` | New: KernelRunner (greenlet ↔ SimPy bridge) |
|
||||
| `src/kernbench/components/builtin/pe_cpu.py` | Remove Phase 0, change to KernelRunner invocation |
|
||||
| `pyproject.toml` | Add greenlet dependency |
|
||||
|
||||
Component implementation files (pe_gemm.py, pe_dma.py, hbm_ctrl.py, etc.): **no changes**
|
||||
Benchmark kernels (benches/*.py): **no user API changes**
|
||||
@@ -0,0 +1,537 @@
|
||||
# ADR-0021: PE Pipeline Refactoring — Component Separation + Scheduler-Based Routing
|
||||
|
||||
## Status
|
||||
|
||||
Proposed
|
||||
|
||||
## Context
|
||||
|
||||
### Problems with the Current Structure
|
||||
|
||||
pe_accel (SchedulerV2Component) hides 5 hardware blocks (DmaIn, DmaWb, Gemm, Math, Tcm)
|
||||
**inside a single component**.
|
||||
|
||||
```
|
||||
SchedulerV2Component (single topology node)
|
||||
├── DmaInBlock ← directly connected via internal SimPy Store
|
||||
├── DmaWbBlock ← not visible in topology
|
||||
├── GemmBlock ← not replaceable
|
||||
├── MathBlock ← not replaceable
|
||||
└── TcmBlock ← not replaceable
|
||||
```
|
||||
|
||||
Problems:
|
||||
- Blocks directly reference the next block via `desc.next_block` — hardcoded routing
|
||||
- Individual blocks cannot be replaced (violates ADR-0015 component replacement principle)
|
||||
- PE internal structure is not visible in the topology
|
||||
- GemmBlock and MathBlock each duplicate TCM load/store logic
|
||||
|
||||
### Actual Hardware Structure
|
||||
|
||||
```
|
||||
HBM ←(DMA)→ TCM ←(Fetch/Store Unit)→ Register File ←→ GEMM/MATH Engine
|
||||
```
|
||||
|
||||
- DMA: HBM ↔ TCM transfer (via fabric, tens to hundreds of ns)
|
||||
- Fetch/Store Unit: TCM ↔ Register File transfer (BW-based, a few ns)
|
||||
- GEMM/MATH Engine: computation between Register Files (cycle-accurate)
|
||||
- Completion signal: PE-internal 1-cycle wire signal (done pin assert)
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Separate Each Block into an Independent Component
|
||||
|
||||
The internal blocks of pe_accel are separated into **independent PeEngineBase components**.
|
||||
Existing 5 blocks + 1 Fetch/Store Unit = 6 components.
|
||||
|
||||
| Component | Role | HW Correspondence |
|
||||
|-----------|------|-------------------|
|
||||
| PE_SCHEDULER | Plan generation, tile state management, stage routing | Scheduler/Sequencer |
|
||||
| PE_DMA | HBM ↔ TCM (via fabric) | DMA Engine |
|
||||
| PE_FETCH_STORE | TCM ↔ Register File | Load/Store Unit |
|
||||
| PE_GEMM | MAC compute (register only) | MAC Array |
|
||||
| PE_MATH | Element-wise/reduction (register only) | SIMD/Vector Unit |
|
||||
| PE_TCM | BW-serialized scratchpad | SRAM Bank |
|
||||
|
||||
Each component exists as a topology node and is connected via ports/wires.
|
||||
Replacing the `impl` allows changing the timing model of an individual block.
|
||||
|
||||
### D2. Token Self-Routing — Scheduler Handles Only Dispatch + Completion
|
||||
|
||||
**Components do not pass through the scheduler at every stage.**
|
||||
The token carries a plan so that components chain directly to the next stage.
|
||||
|
||||
```
|
||||
Scheduler → DMA → Fetch → GEMM → Math → Store → DMA_WB → (done) → Scheduler
|
||||
↑ chaining: does not go through scheduler completion only
|
||||
```
|
||||
|
||||
This matches the actual HW structure where each block's done signal is directly
|
||||
connected to the next block via wire. The scheduler is responsible **only for
|
||||
initial dispatch + completion aggregation**.
|
||||
|
||||
#### Stage Definition
|
||||
|
||||
```python
|
||||
class StageType(Enum):
|
||||
DMA_READ = 0
|
||||
FETCH = 1
|
||||
GEMM = 2
|
||||
MATH = 3
|
||||
STORE = 4
|
||||
DMA_WRITE = 5
|
||||
```
|
||||
|
||||
#### Plan Structure
|
||||
|
||||
When the scheduler receives a CompositeCmd, it generates a **per-tile execution plan**.
|
||||
The plan defines the **stage sequence** for each tile:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class Stage:
|
||||
stage_type: StageType
|
||||
component: str # topology node ID (e.g. "sip0.cube0.pe0.pe_dma")
|
||||
params: dict # per-stage parameters (dynamic)
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class TilePlan:
|
||||
tile_id: int
|
||||
stages: tuple[Stage, ...] # list of stages to execute in order (immutable)
|
||||
```
|
||||
|
||||
The stage sequence varies depending on the plan:
|
||||
|
||||
```python
|
||||
# Normal GEMM: HBM → TCM → Register → Compute → Register → TCM → HBM
|
||||
stages = (DMA_READ, FETCH, GEMM, STORE, DMA_WRITE)
|
||||
|
||||
# GEMM directly from TCM data (skip DMA read):
|
||||
stages = (FETCH, GEMM, STORE, DMA_WRITE)
|
||||
|
||||
# MATH element-wise:
|
||||
stages = (DMA_READ, FETCH, MATH, STORE, DMA_WRITE)
|
||||
|
||||
# GEMM + accumulation (intermediate K-tile, skip writeback):
|
||||
stages = (DMA_READ, FETCH, GEMM, STORE) # store to TCM only
|
||||
```
|
||||
|
||||
**Components do not hardcode the next component.**
|
||||
They read the next stage from the token's plan and forward it directly via out_port.
|
||||
This is the same pattern as a network packet carrying a routing header.
|
||||
|
||||
#### Pipeline Context
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class PipelineContext:
|
||||
id: str
|
||||
total_tiles: int
|
||||
completed_tiles: int = 0
|
||||
done_event: simpy.Event = None # succeeds when all tiles are complete
|
||||
|
||||
def complete_tile(self) -> None:
|
||||
self.completed_tiles += 1
|
||||
if self.completed_tiles == self.total_tiles:
|
||||
self.done_event.succeed()
|
||||
```
|
||||
|
||||
**Completion follows an exactly-once contract**: the last stage of each tile must call
|
||||
`complete_tile()` exactly once. Duplicate calls are a bug, and `done_event` must
|
||||
succeed only once (SimPy Event constraint).
|
||||
|
||||
#### Scheduler Role (Reduced)
|
||||
|
||||
When the scheduler receives a CompositeCmd, it creates a plan and PipelineContext,
|
||||
enqueues them into the scheduler's internal `_pending_feeds` FIFO, and returns immediately.
|
||||
|
||||
Actual tile injection is handled by a **single feeder process** (`_feed_loop`).
|
||||
This feeder consumes `_pending_feeds` in FIFO order and
|
||||
**does not allow tile feed interleaving across composite commands.**
|
||||
That is, the feed for the next command begins only after all tiles of the current
|
||||
command have been injected into the first stage queue.
|
||||
|
||||
There is **exactly one `_feed_loop`** per scheduler, and
|
||||
tile feed for composite commands is performed exclusively through this single process.
|
||||
Command issue order refers to **the order in which PE_SCHEDULER receives PeInternalTxn**.
|
||||
|
||||
This structure maintains command issue order while ensuring that when the first stage
|
||||
queue is full, only the feeder process blocks — the scheduler worker's inbox processing
|
||||
itself does not stall.
|
||||
|
||||
```python
|
||||
class PeSchedulerV2(PeEngineBase):
|
||||
_pipelines: dict[str, PipelineContext]
|
||||
_pending_feeds: simpy.Store # FIFO of (plan, ctx)
|
||||
|
||||
def start(self, env):
|
||||
super().start(env)
|
||||
self._pending_feeds = simpy.Store(env)
|
||||
env.process(self._feed_loop(env))
|
||||
|
||||
def _dispatch_composite(self, env, pe_txn, cmd):
|
||||
plan = generate_plan(cmd)
|
||||
ctx = PipelineContext(
|
||||
id=next_id(),
|
||||
total_tiles=len(plan.tiles),
|
||||
done_event=pe_txn.done,
|
||||
)
|
||||
self._pipelines[ctx.id] = ctx
|
||||
|
||||
# only enqueue to feeder queue and return immediately
|
||||
yield self._pending_feeds.put((plan, ctx))
|
||||
|
||||
def _feed_loop(self, env):
|
||||
"""Single feeder process: feeds composite commands in FIFO order.
|
||||
|
||||
Tile feed interleaving across composite commands is not allowed.
|
||||
The feed for the next command begins only after all tiles of the
|
||||
current command have been injected into the first stage queue.
|
||||
|
||||
When the first stage queue is full, only this feeder blocks;
|
||||
the scheduler worker's inbox processing does not stall.
|
||||
"""
|
||||
while True:
|
||||
plan, ctx = yield self._pending_feeds.get()
|
||||
for tile in plan.tiles:
|
||||
token = TileToken(
|
||||
tile_id=tile.tile_id,
|
||||
pipeline_ctx=ctx,
|
||||
plan=tile,
|
||||
stage_idx=0,
|
||||
params=tile.stages[0].params,
|
||||
)
|
||||
yield self.out_ports[tile.stages[0].component].put(token)
|
||||
# queue capacity = HW queue depth → feeder blocks only when full
|
||||
```
|
||||
|
||||
In this ADR, the scheduler can accept multiple composite commands,
|
||||
but tile submission order follows per-command FIFO.
|
||||
Within a command, tile-level pipeline overlap is allowed,
|
||||
but tile feed interleaving across commands is not.
|
||||
|
||||
### D3. Data Transfer vs. Completion Signal — HW Modeling Criteria
|
||||
|
||||
| Communication Type | Method | HW Correspondence |
|
||||
|-------------------|--------|-------------------|
|
||||
| Tile token (work directive) | message via out_port | enqueue to command queue |
|
||||
| Stage completion → next stage | component directly calls out_port.put | done-triggered local enqueue |
|
||||
| Pipeline completion → scheduler | PipelineContext.complete_tile() | completion interrupt |
|
||||
|
||||
**Tile token**: uses out_port.put(). SimPy Store capacity = HW queue depth.
|
||||
|
||||
**Intra-PE chaining latency**: within the scope of this ADR, no explicit latency model
|
||||
is applied to intra-PE stage triggers. Chaining between components corresponds to
|
||||
PE-internal wires, and since there is no scheduler round-trip, no artificial hop cost
|
||||
is incurred.
|
||||
|
||||
**Pipeline completion**: the component at the last stage calls `pipeline_ctx.complete_tile()`.
|
||||
When all tiles are complete, PipelineContext calls done_event.succeed().
|
||||
|
||||
### D4. Asynchronous Pipeline — Natural Overlap
|
||||
|
||||
The scheduler processes CompositeCmds **asynchronously**.
|
||||
However, tile feed does not spawn an independent process per command; instead,
|
||||
the scheduler's internal **single feeder process** performs the feed in FIFO order.
|
||||
Therefore, the scheduler can continue to receive the next command,
|
||||
but the first-stage tile injection order is guaranteed per command.
|
||||
|
||||
Since **SimPy Store capacity = HW queue depth**:
|
||||
- When the queue is full, put() naturally blocks (backpressure)
|
||||
- While DMA is processing tile 0, GEMM can start fetching an already-completed tile
|
||||
- When a second CompositeCmd arrives, it is immediately queued to the DMA queue
|
||||
|
||||
```
|
||||
First-stage feed order (feeder → DMA queue):
|
||||
[cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN] | [cmd2:t0][cmd2:t1]...
|
||||
↑ cmd2 starts after cmd1 feed completes
|
||||
|
||||
Runtime pipeline (downstream overlap):
|
||||
PE_DMA: [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN][cmd2:t0][cmd2:t1]...
|
||||
PE_FETCH: [cmd1:t0][cmd1:t1]...
|
||||
PE_GEMM: [cmd1:t0][cmd1:t1]...
|
||||
↑ pipeline overlap within the same command
|
||||
```
|
||||
|
||||
Here, the overlap does not come from tile feed interleaving across different commands,
|
||||
but occurs naturally as tiles from earlier commands progress to downstream stages
|
||||
while the feeder continues injecting subsequent tiles.
|
||||
|
||||
For example, tile feed for cmd2 does not start until all tiles of cmd1 have been
|
||||
injected into the first stage queue. However, while cmd1.tile0 has already progressed
|
||||
to GEMM, cmd1.tile1 and cmd1.tile2 may still remain in DMA/FETCH, so
|
||||
**pipeline overlap within the same command occurs naturally**.
|
||||
|
||||
#### Component Chaining Pattern
|
||||
|
||||
All components follow the same pattern:
|
||||
|
||||
```python
|
||||
def _pipeline_worker(self, env):
|
||||
while True:
|
||||
token = yield self._inbox.get()
|
||||
|
||||
# process own stage
|
||||
yield from self._process(env, token)
|
||||
|
||||
# chain to next stage (read from plan)
|
||||
next_idx = token.stage_idx + 1
|
||||
if next_idx < len(token.plan.stages):
|
||||
next_stage = token.plan.stages[next_idx]
|
||||
token.stage_idx = next_idx
|
||||
token.params = next_stage.params
|
||||
yield self.out_ports[next_stage.component].put(token)
|
||||
else:
|
||||
# last stage — pipeline completion
|
||||
token.pipeline_ctx.complete_tile()
|
||||
```
|
||||
|
||||
### D5. PE_FETCH_STORE — Dedicated TCM ↔ Register File Transfer
|
||||
|
||||
Previously, GemmBlock and MathBlock each implemented their own TCM read/write.
|
||||
This is separated into a **PE_FETCH_STORE component**.
|
||||
|
||||
```python
|
||||
# PE_FETCH_STORE._process()
|
||||
def _process(self, env, token):
|
||||
yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
|
||||
yield tcm_done
|
||||
# chaining is handled by the base class (D4 pattern)
|
||||
```
|
||||
|
||||
Advantages:
|
||||
- GEMM/MATH perform **pure compute only** — no TCM access logic
|
||||
- Fetch/store BW contention is naturally modeled (serialization via PE_TCM resource)
|
||||
- Prefetch strategies can be experimented with by replacing the fetch unit alone
|
||||
|
||||
### D6. Simplification of Each Compute Component
|
||||
|
||||
GEMM/MATH perform compute only with register data already prepared.
|
||||
**Chaining follows the common pattern (D4), so only _process() needs to be implemented:**
|
||||
|
||||
```python
|
||||
# PE_GEMM._process()
|
||||
def _process(self, env, token):
|
||||
yield env.timeout(self._mac_latency(token.params))
|
||||
|
||||
# PE_MATH._process()
|
||||
def _process(self, env, token):
|
||||
yield env.timeout(self._simd_latency(token.params))
|
||||
|
||||
# PE_FETCH_STORE._process()
|
||||
def _process(self, env, token):
|
||||
yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
|
||||
yield tcm_done
|
||||
|
||||
# PE_DMA._process()
|
||||
def _process(self, env, token):
|
||||
yield from self._do_fabric_dma(token.params)
|
||||
```
|
||||
|
||||
By replacing only the timing model, one can freely switch between cycle-accurate
|
||||
and analytical models. Since the chaining logic resides in the base class,
|
||||
each component only implements its pure stage logic.
|
||||
|
||||
### D7. Topology Changes
|
||||
|
||||
Add PE_FETCH_STORE to the PE template:
|
||||
|
||||
```yaml
|
||||
pe_template:
|
||||
components:
|
||||
pe_cpu: { kind: pe_cpu, impl: pe_cpu_v1, ... }
|
||||
pe_scheduler: { kind: pe_scheduler, impl: pe_scheduler_v2, ... }
|
||||
pe_dma: { kind: pe_dma, impl: pe_dma_v1, ... }
|
||||
pe_fetch_store: { kind: pe_fetch_store, impl: pe_fetch_store_v1, ... }
|
||||
pe_gemm: { kind: pe_gemm, impl: pe_gemm_v1, ... }
|
||||
pe_math: { kind: pe_math, impl: pe_math_v1, ... }
|
||||
pe_mmu: { kind: pe_mmu, impl: pe_mmu_v1, ... }
|
||||
pe_tcm: { kind: pe_tcm, impl: pe_tcm_v1, ... }
|
||||
links:
|
||||
# existing links...
|
||||
fetch_store_to_tcm_bw_gbs: 512.0
|
||||
fetch_store_to_tcm_mm: 0.0
|
||||
```
|
||||
|
||||
PE internal edge connections:
|
||||
```
|
||||
PE_SCHEDULER → PE_DMA (initial dispatch)
|
||||
PE_SCHEDULER → PE_FETCH_STORE (initial dispatch)
|
||||
PE_SCHEDULER → PE_GEMM (initial dispatch)
|
||||
PE_SCHEDULER → PE_MATH (initial dispatch)
|
||||
PE_DMA → PE_FETCH_STORE (chaining)
|
||||
PE_FETCH_STORE → PE_GEMM (chaining)
|
||||
PE_FETCH_STORE → PE_MATH (chaining)
|
||||
PE_GEMM → PE_FETCH_STORE (store chaining)
|
||||
PE_MATH → PE_FETCH_STORE (store chaining)
|
||||
PE_FETCH_STORE → PE_DMA (writeback chaining)
|
||||
PE_FETCH_STORE → PE_TCM (BW request)
|
||||
```
|
||||
|
||||
Topology edges encompass both **control/dispatch visibility + runtime chaining**.
|
||||
Scheduler → sub-component edges are initial dispatch paths, while
|
||||
inter-component edges are runtime chaining paths driven by token self-routing.
|
||||
|
||||
### D8. Existing Code Migration — Builtin Integration
|
||||
|
||||
The existing builtin v1 components and pe_accel are **replaced with new builtin components**.
|
||||
|
||||
#### Migration Strategy
|
||||
|
||||
1. Back up existing `components/builtin/` → `components/builtin_legacy/` (preserved without modification)
|
||||
2. Back up existing `components/custom/pe_accel/` → likewise
|
||||
3. Re-implement new `components/builtin/` with the ADR-0021 architecture
|
||||
4. Maintain **only one** topology.yaml (including pe_fetch_store)
|
||||
5. components.yaml points to the new builtin
|
||||
|
||||
```yaml
|
||||
# components.yaml — new builtin
|
||||
pe_scheduler_v1: kernbench.components.builtin.pe_scheduler:PeSchedulerComponent
|
||||
pe_gemm_v1: kernbench.components.builtin.pe_gemm:PeGemmComponent
|
||||
pe_math_v1: kernbench.components.builtin.pe_math:PeMathComponent
|
||||
pe_dma_v1: kernbench.components.builtin.pe_dma:PeDmaComponent
|
||||
pe_fetch_store_v1: kernbench.components.builtin.pe_fetch_store:PeFetchStoreComponent
|
||||
pe_tcm_v1: kernbench.components.builtin.pe_tcm:PeTcmComponent
|
||||
```
|
||||
|
||||
The impl names (pe_gemm_v1, etc.) are preserved, but **the implementations are replaced
|
||||
with the ADR-0021 architecture**. Existing benchmarks and tests referencing topology.yaml
|
||||
continue to work without changes.
|
||||
|
||||
#### Latency Model Inheritance
|
||||
|
||||
The latency modeling of the new builtin components (MAC cycle calculation, SIMD latency,
|
||||
TCM BW serialization, DMA fabric latency, etc.) is **based on the current pe_accel
|
||||
implementation**. The tile schedule generation logic from tiling.py is also carried over.
|
||||
Only the architecture (component separation, self-routing) changes; timing accuracy
|
||||
is preserved.
|
||||
|
||||
#### Test Strategy
|
||||
|
||||
#### Test Plan
|
||||
|
||||
**1. Existing test pass** (regression):
|
||||
After migration is complete, all existing tests (366) must pass.
|
||||
|
||||
**2. Latency regression**:
|
||||
Verify that the new builtin produces identical latency for the same inputs as pe_accel.
|
||||
|
||||
**3. Phase 1 → Phase 2 end-to-end**:
|
||||
Integration test from SimPy simulation (Phase 1) op_log generation → DataExecutor
|
||||
(Phase 2) actual numpy computation → result correctness verification.
|
||||
- GEMM: tl.composite(gemm) → op_log → Phase 2 matmul → allclose verification
|
||||
- MATH: tl.exp / tl.add, etc. → op_log → Phase 2 numpy op → allclose verification
|
||||
- Chaining: GEMM output → MATH input → final result end-to-end verification
|
||||
|
||||
**4. TileToken self-routing**:
|
||||
- Verify that tiles chain according to the plan's stage sequence
|
||||
- Verify PipelineContext.complete_tile() exactly-once at the last stage
|
||||
- Queue backpressure: verify that only the feeder blocks when DMA queue capacity is exceeded
|
||||
|
||||
**5. Asynchronous pipeline overlap**:
|
||||
- Verify that inter-tile stage overlap occurs within the same command (tile0 in GEMM while tile1 in DMA)
|
||||
- Multiple commands: verify that cmd2 feed starts after cmd1 feed completes (FIFO order)
|
||||
|
||||
### D9. TileToken Message Definition
|
||||
|
||||
A message used for passing tile work between components.
|
||||
The token carries the plan and stage index, enabling self-routing.
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class TileToken:
|
||||
tile_id: int
|
||||
pipeline_ctx: PipelineContext # completion tracking
|
||||
plan: TilePlan # full stage sequence for this tile (immutable)
|
||||
stage_idx: int # current stage index in plan.stages
|
||||
params: dict # current stage parameter cache (canonical: plan.stages[stage_idx].params)
|
||||
data_op: bool = True # op_log recording target (ADR-0020)
|
||||
```
|
||||
|
||||
A TileToken is **owned by exactly one component at a time** and
|
||||
is never referenced by multiple components simultaneously (single-owner).
|
||||
|
||||
Token lifecycle:
|
||||
1. Scheduler creates it with stage_idx=0 and puts it to the first stage component
|
||||
2. The component executes _process(), increments stage_idx, and puts it to the next component
|
||||
3. The last stage component calls pipeline_ctx.complete_tile()
|
||||
4. When all tiles are complete, PipelineContext calls done_event.succeed()
|
||||
|
||||
Relationship with existing PeInternalTxn:
|
||||
- PeInternalTxn: command transfer between PE_CPU → PE_SCHEDULER (existing, unchanged)
|
||||
- TileToken: per-tile work transfer from PE_SCHEDULER → sub-components (new, self-routing)
|
||||
|
||||
---
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **PE_CPU changes**: the PE_CPU → PE_SCHEDULER interface is not modified
|
||||
(PeInternalTxn-based, ADR-0014 maintained)
|
||||
- **Resource contention model across multiple pipelines**: the current scope focuses on
|
||||
accurate modeling of a single pipeline. TCM bank conflicts across multiple pipelines
|
||||
are future work.
|
||||
- **builtin_legacy maintenance**: kept for backup purposes only; not a target for
|
||||
bug fixes or feature additions.
|
||||
|
||||
## Open Questions
|
||||
|
||||
- **Register File capacity model**: whether to model capacity limits when the fetch unit
|
||||
loads into registers. Capacity is expressed in bytes (register_file_bytes), and
|
||||
the number of tiles that can be held simultaneously is determined by tile size.
|
||||
When capacity is exceeded, fetch stalls, creating natural backpressure.
|
||||
- **Prefetch strategy**: this ADR does not allow tile feed interleaving across composite
|
||||
commands. Therefore, overlap arises not from pre-injection across commands, but
|
||||
naturally from pipeline progression of tiles within the same command.
|
||||
If additional prefetch is needed, it should be considered at the level of tile ordering
|
||||
within the same command or fetch/store unit policy, not cross-command injection.
|
||||
- **PE_DMA coalescing**: per-tile DMA may cause fragmentation.
|
||||
Direction is to merge/coalesce within DMA without scheduler involvement.
|
||||
- **Synchronous execution mode**: this ADR adopts asynchronous pipeline as the
|
||||
default/sole execution model. If a sync mode is needed for debug or validation
|
||||
purposes, it will be considered in a future ADR.
|
||||
- **TCM bank conflict across multiple pipelines**: currently based on a single pipeline.
|
||||
Bank conflict modeling when multiple pipelines simultaneously access TCM is future work.
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- Each block is an independent component — individually replaceable (ADR-0015 compliant)
|
||||
- PE internal structure is visible in the topology
|
||||
- Components do not know the next component — plan-based routing provides flexibility
|
||||
- Natural pipeline overlap between DMA and compute (SimPy Store backpressure)
|
||||
- Improved HW modeling accuracy (done signal = Event, data transfer = message)
|
||||
- Fetch/store separation enables accurate TCM BW contention modeling
|
||||
|
||||
### Negative
|
||||
|
||||
- Increased number of PE internal components (5 → 6) — more topology nodes/edges
|
||||
- Component separation makes intra-PE token forwarding more explicit than before
|
||||
- Breaking change from existing builtin/pe_accel — migration required
|
||||
|
||||
---
|
||||
|
||||
## Affected Files
|
||||
|
||||
| File | Change |
|
||||
|------|--------|
|
||||
| `topology.yaml` | Add pe_fetch_store component, add chaining edges |
|
||||
| `components.yaml` | Register new builtin components |
|
||||
| `src/kernbench/topology/builder.py` | Add fetch_store + chaining edges to PE internal edges |
|
||||
| `src/kernbench/common/pe_commands.py` | Add TileToken definition |
|
||||
| `src/kernbench/components/builtin/pe_scheduler.py` | Re-implement (feeder + plan-based dispatch) |
|
||||
| `src/kernbench/components/builtin/pe_gemm.py` | Re-implement (TileToken, _process pattern) |
|
||||
| `src/kernbench/components/builtin/pe_math.py` | Re-implement (TileToken, _process pattern) |
|
||||
| `src/kernbench/components/builtin/pe_dma.py` | Re-implement (TileToken, _process pattern) |
|
||||
| `src/kernbench/components/builtin/pe_fetch_store.py` | New |
|
||||
| `src/kernbench/components/builtin/pe_tcm.py` | Re-implement (TcmRequest service) |
|
||||
| `src/kernbench/components/builtin/types.py` | New: TilePlan, Stage, StageType, PipelineContext, TileToken |
|
||||
| `src/kernbench/components/builtin/tiling.py` | Ported from pe_accel: plan generation logic |
|
||||
|
||||
Backup:
|
||||
| `src/kernbench/components/builtin_legacy/` | Full backup of existing builtin (preserved without modification) |
|
||||
| `src/kernbench/components/custom/pe_accel/` | Backup of existing pe_accel (preserved without modification) |
|
||||
Reference in New Issue
Block a user