diff --git a/docs/adr/ADR-0018-Logical Address.en.md b/docs/adr/ADR-0018-Logical Address.en.md new file mode 100644 index 0000000..2780ded --- /dev/null +++ b/docs/adr/ADR-0018-Logical Address.en.md @@ -0,0 +1,441 @@ +# ADR-0018: LA-Based Memory Address Abstraction and HBM Channel Mapping Mode Introduction + +## Status + +Proposed + +## Context + +Kernbench simulates memory access between PE_DMA and Local-HBM within a CUBE. +Currently, a VA-based access path is used; however, the following two channel mapping models +are difficult to represent consistently. + +### Background: Local-HBM Pseudo Channel Structure + +The HBM in a CUBE consists of 32 or 64 pseudo channels. +In the PE-Local-HBM model, each PE is responsible for an equal number of pseudo channels. + +Example: 64 pseudo channels, 8 PEs per cube -> each PE accesses 8 pseudo channels as local HBM + +Both the number of pseudo channels and the number of PEs are topology parameters. +`N = hbm_pseudo_channels / pes_per_cube` (= channels_per_pe) determines +the number of local channels per PE. + +The routing path BW between DMA and each pseudo channel matches the BW of each pseudo channel +(e.g., 32 GB/s), so if a PE sends simultaneous requests to N channels, it can utilize the +maximum memory BW. + +### Limitations of the Current VA Model + +When channels are divided into 8, requests must also be generated per channel and sent to DMA. +However, in the current architecture, the kernel generates requests with VA (`tl.load`) +and passes them directly to DMA, making it difficult for PE_CPU to generate per-channel DMA requests. + +Therefore, instead of VA, we propose using **Logical Address (LA)**, +where the **BAAW (Logical-to-Physical Mapping Unit)** inside PE_DMA +converts LA to PA or a list of PAs based on segment-based mapping. + +### Two Channel Mapping Modes + +- **1:1 mode**: Creates and executes per-channel requests. Precise per-channel modeling. +- **n:1 mode (default)**: Assumes interleaving across local HBM channels. Aggregated BW modeling. + +By supporting both modes, the overhead of the n:1 mode can be measured and evaluated. + +### Core Requirements + +- The effective bandwidth semantics of PE_DMA -> HBM_CTRL must be identical in both modes +- The difference must only be in the request representation and resource modeling approach +- The kernel programming model must not be changed +- Physical channel information must not be exposed to the kernel + +### Existing Physical Address + +The current system's 51-bit Physical Address is defined in `policy/address/phyaddr.py`: + +``` +[50:47] rack_id (4 bit) +[46:43] sip_id (4 bit) +[42:38] cube_id (5 bit, sip_seg) +[37] hbm_selector (1=HBM window) +[36:0] hbm_offset (37 bit, 128GB per cube) +``` + +PA is used to represent the final routable canonical physical destination, +and this role is preserved. +However, the timing and policy of logical access -> physical request conversion are not clearly separated. + +--- + +## Decision + +### D1. Introduction of LA (Logical Address) — Replacing VA + +The existing VA (Virtual Address) infrastructure is replaced with LA (Logical Address). + +#### Characteristics of LA + +- Like VA, tensors can be mapped to a contiguous memory space +- Represents logical buffer + offset +- Does not directly contain physical channel information +- An intermediate abstraction maintained until physical resolution +- The sole address scheme used by kernel code (`tl.load`, `tl.store`, `tl.composite`) + +#### LA Space Definition + +| Item | Value | +|------|-------| +| LA start address | `0x1_0000_0000` (4 GB, preserving the existing VA start point) | +| LA space size | 64 GB per PE | +| Alignment unit | Segment-based (see D3 below) | + +LA is a PE-local address space. +Even if different PEs use the same LA value, they resolve to different PAs +because each PE has a different BAAW segment table. + +#### VA Infrastructure Removal Scope + +With the introduction of LA, the following existing code will be replaced/removed: + +| Removal Target | Replacement | +|----------------|-------------| +| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (same free-list approach, name/role changed) | +| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (inside PE_DMA) | +| `components/builtin/pe_mmu.py` (PeMmuComponent) | Removed — BAAW is internal PE_DMA logic, not a separate component | +| `runtime_api/kernel.py`: MmuMapMsg, MmuUnmapMsg | Replaced with BaawSegmentInstallMsg | +| `runtime_api/context.py`: VA alloc + MMU mapping install | LA alloc + BAAW segment install | +| `runtime_api/tensor.py`: `va_base` field | `la_base` field | +| `topology.yaml`: pe_mmu component entry | Removed | + +--- + +### D2. Mapping Mode Configuration + +The mapping mode is configured at the cube level in topology.yaml: + +```yaml +cube: + memory_map: + hbm_mapping_mode: n_to_one # one_to_one | n_to_one + hbm_pseudo_channels: 64 # total pseudo channel count + hbm_channels_per_pe: 8 # local channel count per PE + hbm_channel_bw_gbs: 32.0 # per-channel bandwidth +``` + +This configuration is referenced during graph compilation (topology builder) and BAAW initialization. + +--- + +### D3. Segments and BAAW + +#### Segment Definition + +A segment is a logical allocation unit that partitions the LA space so that each segment +maps to a specific HBM channel or channel group. + +Segments are created by the runtime allocator during tensor deployment, +and BAAW uses them to convert LA into physical requests. + +#### BAAW Segment Table Entry + +```python +@dataclass +class BaawSegment: + la_base: int # segment start LA + la_size: int # segment size (bytes) + mode: str # "one_to_one" | "n_to_one" + # 1:1 mode fields + channel_count: int # number of channels assigned to this segment (e.g., 8) + pa_bases: list[int] # per-channel PA start address list (len = channel_count) + channel_ids: list[int] # per-channel logical IDs (e.g., [0,1,2,...,7]) + channel_size: int # per-channel size (la_size // channel_count) + # n:1 mode fields + agg_pa_base: int # aggregated PA start address + agg_node_id: str # aggregated router node_id (for routing) +``` + +#### Segment Lifecycle + +1. **Allocation time** (tensor deploy): + - RuntimeContext allocates LA space from the LA allocator + - PEMemAllocator allocates per-channel PA (1:1) or aggregated PA (n:1) + - Sends `BaawSegmentInstallMsg` to PE_DMA to register in the segment table + +2. **Usage time** (kernel execution): + - Kernel issues `tl.load(la_ptr)` -> DmaReadCmd(src_addr=LA) + - PE_DMA looks up the segment corresponding to the LA in BAAW + - Converts to PA(s) according to the mode + +3. **Deallocation time** (tensor free): + - Removed from the segment table + - LA space returned, PA deallocated + +--- + +### D4. BAAW (Logical-to-Physical Mapping Unit) + +#### Location + +BAAW is placed as a front-end stage inside PE_DMA. +It is not a separate SimPy component; it is synchronous address resolution logic +executed at the beginning of PE_DMA's `handle_command()`. + +#### Input + +- LA (Logical Address) — DmaReadCmd.src_addr or DmaWriteCmd.dst_addr +- access size (bytes) + +#### Output + +- 1:1 mode: `list[PhysicalRequest]` — each request is (PA, nbytes, channel_node_id) +- n:1 mode: 1 `PhysicalRequest` — (agg_PA, nbytes, agg_node_id) + +```python +@dataclass +class PhysicalRequest: + pa: int # 51-bit Physical Address + nbytes: int # transfer size for this request + dst_node: str # target node_id (channel router or aggregated router) +``` + +#### BAAW Resolve Logic + +```python +def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]: + seg = self._find_segment(la) # la_base <= la < la_base + la_size + offset = la - seg.la_base + + if seg.mode == "n_to_one": + pa = seg.agg_pa_base + offset + return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)] + + elif seg.mode == "one_to_one": + requests = [] + per_ch_size = seg.channel_size + for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)): + ch_offset = offset % per_ch_size # interleaved or striped + ch_nbytes = nbytes // seg.channel_count + pa = pa_base + ch_offset + dst_node = f"{self._pe_prefix}.ch_r{ch_id}" + requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node)) + return requests +``` + +#### Scope of Responsibility + +BAAW is responsible for: +- Converting logical accesses into physical request units +- Performing fan-out (1:1) or pass-through (n:1) according to the mapping mode +- Generating Physical Addresses and determining target nodes + +BAAW is NOT responsible for: +- Performing actual data movement +- Executing NOC routing +- Simulating bandwidth consumption (this is the role of downstream components) + +#### Output Contract + +The output of BAAW must be request units that can be directly used by the simulator's +routing and resource model without any additional address decoding. + +--- + +### D5. PE_DMA handle_command() Changes + +#### Current Flow (VA-based) + +``` +DmaReadCmd.src_addr (VA) + -> MMU.translate(VA) -> PA + -> PhysAddr.decode(PA) -> PhysAddr object + -> resolver.resolve(PhysAddr) -> dst_node_id (e.g., "sip0.cube0.hbm_ctrl") + -> router.find_path(pe_prefix, dst_node_id) -> path + -> 1 sub-Transaction created -> fabric inject +``` + +#### New Flow (LA-based) + +``` +DmaReadCmd.src_addr (LA) + -> BAAW.resolve(LA, nbytes) -> list[PhysicalRequest] + -> For each PhysicalRequest: + -> router.find_path(pe_prefix, req.dst_node) -> path + -> compute_drain_ns(path, req.nbytes) -> drain + -> sub-Transaction created -> fabric inject + -> Wait for all sub-Transactions to complete + -> pe_txn.done.succeed() +``` + +Key changes: +- MMU reference removed -> replaced with BAAW resolve +- PhysAddr.decode() + resolver.resolve() -> BAAW directly returns dst_node +- 1 request -> N requests injected in parallel (1:1 mode) + +--- + +### D6. 1:1 Mode Details + +- One logical access -> N (= `channels_per_pe`) physical requests +- N is a parameter determined by `hbm_pseudo_channels / pes_per_cube` +- Each request: + - Fully resolved 51-bit PA + - Targets a specific channel router (`{pe_prefix}.ch_r{channel_id}`) +- BW contention modeling via per-channel links +- PE_DMA injects N sub-transactions simultaneously + +#### 1:1 Mode Example + +Configuration: `hbm_pseudo_channels=64`, `pes_per_cube=8` +-> `channels_per_pe=8`, PE0 owns ch0-7 + +```text +Tensor A (4 KB) -> LA 0x1_0000_0000, size=4096 bytes +BAAW segment: { + la_base: 0x1_0000_0000, la_size: 4096, + mode: "one_to_one", channel_count: 8, # = channels_per_pe + pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7], + channel_ids: [0, 1, 2, 3, 4, 5, 6, 7], + channel_size: 512, # = la_size / channel_count +} + +BAAW resolve result (N=8 requests): + -> PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0") + -> PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1") + -> ... + -> PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7") + +PE_DMA: N sub-transactions injected in parallel + Each accesses HBM via channel router -> hbm_ctrl link (channel_bw_gbs) + Total effective BW = N x channel_bw_gbs +``` + +Examples with different N values: +- `hbm_pseudo_channels=32`, `pes_per_cube=8` -> `channels_per_pe=4`, 4 requests +- `hbm_pseudo_channels=64`, `pes_per_cube=4` -> `channels_per_pe=16`, 16 requests + +--- + +### D7. n:1 Mode Details + +- One logical access -> one aggregated request +- Target: aggregated router -> hbm_ctrl (see ADR-0019) +- Aggregated link BW = `channels_per_pe` x `channel_bw_gbs` (e.g., 8 x 32 = 256 GB/s) +- Modeled as a single queue / resource +- No per-channel PA decomposition + +#### n:1 Mode Example + +``` +Tensor A (4 KB) -> LA 0x1_0000_0000, size=4096 bytes +BAAW segment: { + la_base: 0x1_0000_0000, la_size: 4096, + mode: "n_to_one", + agg_pa_base: PA_agg, + agg_node_id: "sip0.cube0.pe0.agg_router", +} + +BAAW resolve result: + -> PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router") + +PE_DMA: 1 sub-transaction injected + Accesses HBM via aggregated router -> hbm_ctrl link (256 GB/s) +``` + +--- + +### D8. Kernel Model Preservation + +- The kernel still issues only single memory ops (`tl.load`, `tl.store`, `tl.composite`) +- LA is the address scheme passed to the kernel +- Channel decomposition/aggregation is performed by BAAW inside PE_DMA +- Physical channel information is not exposed to kernel code + +--- + +## Consequences + +### Positive + +- 1:1 vs n:1 semantics are clearly separated at a single point: BAAW +- Kernel abstraction is preserved — no kernel code changes required +- Topology-based policy control is possible (mode switching via yaml) +- Improved simulation model consistency and debuggability +- Segment-based mapping is simpler and has lower overhead compared to page tables + +### Negative + +- Full refactoring of VA/MMU-based code is required +- Increased complexity in the request generation path (managing N requests in 1:1 mode) +- Reduced per-channel visibility in n:1 mode +- Existing VA-related tests must be rewritten + +--- + +## Alternatives + +### A1. Keep VA + Fan-out at MMU + +- Extend MMU to return per-channel PAs +- Problem: MMU's role expands beyond address translation to include request decomposition +- Problem: Aggregation representation is difficult in n:1 mode + +### A2. Kernel Generates Channel-Aware Requests + +- Kernel directly calls per-channel load/store +- Problem: Abstraction leakage, reduced portability +- Problem: All benchmark code must be modified + +### A3. Always Use PA (Without LA) + +- Runtime directly passes per-channel PA to the kernel +- Problem: Conflicts with the aggregation model +- Problem: Conversion timing is unclear, channel information exposed to kernel + +--- + +## Implementation Notes + +### Implementation Order + +1. Introduce LA type (`policy/address/la_allocator.py`) +2. Implement BAAW segment table (`policy/address/baaw.py`) +3. Add `BaawSegmentInstallMsg` message type (`runtime_api/kernel.py`) +4. Integrate BAAW into PE_DMA (`components/builtin/pe_dma.py` handle_command changes) +5. Modify RuntimeContext: LA alloc + segment install (`runtime_api/context.py`) +6. Change Tensor.va_base -> la_base (`runtime_api/tensor.py`) +7. Remove VA/MMU code +8. Remove pe_mmu from topology.yaml, add mapping mode configuration +9. Test migration + +### Affected Existing Tests + +| Test File | Impact | +|-----------|--------| +| `tests/test_mmu_component.py` | Remove -> replace with BAAW segment install test | +| `tests/test_mmu_fabric.py` | Remove -> replace with BAAW + fabric integration test | +| `tests/test_pe_mmu.py` | Remove | +| `tests/test_va_allocator.py` | Replace with LA allocator test | +| `tests/test_va_integration.py` | Replace with LA + BAAW integration test | +| `tests/test_va_offset.py` | Replace with LA offset test | + +--- + +## Test Requirements + +- For the same logical access: + - 1:1 -> verify N requests are generated + - n:1 -> verify 1 aggregated request is generated +- Verify effective bandwidth consistency across both modes +- 1:1 -> verify per-channel contention modeling +- n:1 -> verify aggregated bandwidth is reflected +- Verify operation without kernel code changes +- Verify correct BAAW segment install/uninstall operation +- Verify no conflicts when multiple tensors are assigned to different segments + +--- + +## Links + +- ADR-0011 (Memory Addressing Simplification — PA-first, VA/MMU introduction) -> superseded by this ADR +- ADR-0019 (NOC Per-Channel HBM Connection Model) -> topology-side integration +- ADR-0014 (PE Internal Execution Model) -> PE_DMA change impact diff --git a/docs/adr/ADR-0019-NOC-Local HBM.en.md b/docs/adr/ADR-0019-NOC-Local HBM.en.md new file mode 100644 index 0000000..ac772b0 --- /dev/null +++ b/docs/adr/ADR-0019-NOC-Local HBM.en.md @@ -0,0 +1,431 @@ +# ADR-0019: Per-Channel and Aggregated HBM Connection Models within CUBE NOC + +## Status + +Proposed + +## Context + +ADR-0018 introduced LA-based address abstraction and BAAW, +defining how a logical memory access is translated into the following two forms of requests: + +- 1:1 mode: one logical access → N per-channel requests +- n:1 mode: one logical access → one aggregated request + +Here N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`), +determined by topology parameters. + +### Problems with the Existing Structure + +In the current implementation (`topology/builder.py`): + +- PE_DMA → NOC → xbar_top/xbar_bot → HBM_CTRL.slice{0-7} path is used +- HBM is modeled as 8 slice (= per-PE) nodes +- Local/remote access use different paths: + - local: NOC → xbar → HBM slice + - cross-half: NOC → xbar_top → bridge → xbar_bot → HBM slice + - remote cube: NOC → UCIe → remote NOC → remote xbar → remote HBM slice + +Limitations of this structure: + +- Cannot model at the pseudo-channel granularity (slice = per-PE granularity, not per-channel) +- xbar/bridge bifurcate local/remote paths +- Cannot express 1:1 / n:1 modes consistently + +--- + +## Decision + +### D1. HBM Attaches to PE Routers + +Consolidate the current `hbm_ctrl.slice{0-7}` (8 nodes) into a **single `hbm_ctrl` node**, +and attach the HBM access point to the same router where the PE is attached. + +- n:1 mode: PE's local HBM access goes directly from its own router (switching overhead only, 0 hops) +- Remote PE's HBM access: reaches the target PE's router via mesh hops +- The read/write resource model within the HBM controller is preserved + +Node naming changes: + +| Current | After Change | +| ---- | ------- | +| `sip0.cube0.hbm_ctrl.slice0` ~ `slice7` | `sip0.cube0.hbm_ctrl` (single) | + +In `mesh_gen.py`, add `pe{idx}.hbm` to the PE attachment so that +the builder generates an edge between that router and hbm_ctrl. + +--- + +### D2. Complete Removal of xbar, bridge, and Single NOC Node + +Remove all of the following nodes and related edges: + +- `{cube}.xbar_top`, `{cube}.xbar_bot` +- `{cube}.bridge.left`, `{cube}.bridge.right` +- `{cube}.noc` (single TwoDMeshNocComponent node) +- Edges of type `noc_to_xbar`, `xbar_to_noc`, `xbar_to_hbm`, `hbm_to_xbar` +- Edges of type `xbar_to_bridge`, `bridge_to_xbar` +- Edges of type `pe_to_noc`, `noc_to_pe`, `noc_to_pe_cpu`, etc. referencing the single noc node + +Their role is replaced by an **explicit router mesh based on cube_mesh.yaml**. +Each router (r0c0, r0c1, ...) from the 6x6 router grid generated by `mesh_gen.py` +is created as a separate SimPy node in the topology graph, +and adjacent routers are connected via XY mesh edges. + +--- + +### D3. Explicit Router Mesh (Common Basis for n:1 / 1:1) + +#### Router Nodes Based on cube_mesh.yaml + +Each non-null router from cube_mesh.yaml generated by `mesh_gen.py` +is created as a **separate SimPy node** in the topology graph. + +- Node ID: `{cube}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`) +- kind: `noc_router`, impl: `forwarding_v1` +- pos_mm: taken from cube_mesh.yaml + +Based on the attach information in cube_mesh.yaml, components are connected to each router: +- `pe{p}.dma` → PE_DMA ↔ router edge +- `pe{p}.cpu` → PE_CPU ↔ router edge +- `pe{p}.hbm` → HBM_CTRL ↔ router edge (added in n:1) +- `m_cpu` → M_CPU ↔ router edge +- `sram` → SRAM ↔ router edge +- `ucie_{dir}.c{i}` → UCIe conn ↔ router edge + +Router-to-router XY mesh edges: bidirectional edges between adjacent routers. +Null routers (HBM exclusion zones) are skipped. + +#### 1:1 Mode Extension (To Be Implemented Later) + +In 1:1 mode, each router differentiates into N channel mini-routers. +Per-channel routing and ChannelSplitter (LA → per-channel PA) introduction are required. +N GEMM engines per PE are also added at this point. + +--- + +### D4. Cross-PE HBM Access (n:1 Mode) + +In n:1 mode, when a PE accesses another PE's local HBM, +it hops through the XY mesh in cube_mesh.yaml to reach the target PE's router. + +Example: PE0 (r0c0) accessing PE2's (r1c4) HBM: + +```text +PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl +``` + +The Dijkstra router finds the shortest path in the mesh. + +Cross-PE channel access in 1:1 mode will be defined during the 1:1 extension in D3. + +--- + +### D5. n:1 Mode: Uses cube_mesh.yaml Router Mesh + +In n:1 mode, no separate "aggregated router" is created. +The existing router grid from cube_mesh.yaml serves that role. + +#### Connection Structure + +PE_DMA, PE_CPU, and HBM are all connected to the router where each PE is attached: + +```text +sip0.cube0.pe0.pe_dma ←→ sip0.cube0.r0c0 (bw: N × channel_bw_gbs) +sip0.cube0.hbm_ctrl ←→ sip0.cube0.r0c0 (bw: N × channel_bw_gbs) +``` + +Routers are connected via XY mesh edges. PE's local HBM access goes +directly from its own router (switching overhead only). + +#### n:1 Mode Full Data Paths + +**Local HBM (0 hops):** +```text +PE0.pe_dma → r0c0 → hbm_ctrl (switching overhead only) +``` + +**Remote HBM (mesh hops):** +```text +PE0.pe_dma → r0c0 → r0c1 → ... → r1c4 → hbm_ctrl +``` + +**M_CPU DMA:** +```text +M_CPU → r2c0 → (mesh hops) → r{x}c{y} → hbm_ctrl +``` + +--- + +### D6. All Traffic Is Unified onto the Same Router Mesh + +- All memory accesses (DMA data) and commands (PE_CPU) use the same router mesh +- Local access does not use a separate fast path (xbar) +- Cross-cube (remote) access path: + +```text +PE_DMA → r{x}c{y} → (mesh hops) → ucie_conn → ucie-{PORT} + → [UCIe link] → remote ucie → remote conn → remote r{x}c{y} → hbm_ctrl +``` + +UCIe connections maintain the existing structure, +but both endpoints become mesh routers instead of xbars. + +The number of UCIe lines is determined by BW ratio: `ucie_lines_per_side = ceil(ucie_bw / noc_line_bw)`. + +--- + +### D7. AddressResolver Changes + +Current `AddressResolver.resolve()`: + +```python +# Current: HBM offset → pe_slice → "sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}" +pe_slice = PhysAddr.hbm_pe_id(addr.hbm_offset, self._slice_size_bytes) +return f"sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}" +``` + +After change: + +```python +# Changed: HBM → single endpoint +return f"sip{s}.cube{c}.hbm_ctrl" +``` + +The pe_slice calculation is removed. +In n:1 mode, PE_DMA directly accesses the hbm_ctrl attached to its own router. + +resolver.resolve() is retained for external access (M_CPU DMA, etc.) and backward compatibility. + +--- + +### D8. topology.yaml Configuration Changes + +#### Added Settings + +```yaml +cube: + memory_map: + hbm_mapping_mode: n_to_one # one_to_one | n_to_one + hbm_pseudo_channels: 64 # total pseudo channel count + hbm_channels_per_pe: 8 # local channels per PE (= pseudo_channels / pes_per_cube) + hbm_channel_bw_gbs: 32.0 # per-channel bandwidth (GB/s) + hbm_total_gb_per_cube: 48 # retained +``` + +#### Removed Settings + +```yaml +# To be removed +links: + xbar_to_hbm_bw_gbs: 256.0 # → replaced by channel_bw_gbs × channels_per_pe + xbar_to_hbm_mm: 2.5 # → replaced by ch_router_to_hbm_mm + xbar_to_bridge_bw_gbs: 128.0 # → removed (no bridge) + xbar_to_bridge_mm: 3.0 # → removed + noc_to_xbar_bw_gbs: ... # → removed + noc_to_xbar_mm: ... # → removed +``` + +#### Added Link Settings + +```yaml +links: + router_link_bw_gbs: 256.0 # XY mesh link BW between routers + router_overhead_ns: 2.0 # router switching overhead + pe_to_router_bw_gbs: 256.0 # PE_DMA ↔ router + hbm_to_router_bw_gbs: 256.0 # HBM ↔ router (= N × channel_bw) +``` + +--- + +### D9. Bandwidth Numerical Consistency + +| Configuration | Value | +| ---- | --- | +| pseudo channels per cube | 64 (parameter) | +| PEs per cube | 8 (parameter) | +| channels per PE (N) | `pseudo_channels / pes_per_cube` = 8 | +| per-channel BW | 32 GB/s (parameter) | +| per-PE local BW | N × 32 = 256 GB/s | +| cube total HBM BW | 64 × 32 = 2048 GB/s | + +The effective BW per PE is identical in both modes: + +- 1:1 mode: N channel links × channel_bw_gbs = N × 32 = 256 GB/s +- n:1 mode: 1 aggregated link = N × channel_bw_gbs = 256 GB/s + +--- + +## Consequences + +### Positive + +- The router mesh based on cube_mesh.yaml accurately reflects physical placement +- In n:1 mode, the existing VA scheme is preserved, keeping transition costs low +- Local / remote / command traffic is unified onto the same mesh, resulting in simplicity +- Aligns well with graph compiler-based topology generation +- Channel count and PE count are both parameterized, enabling testing of various configurations +- 1:1 mode extension naturally follows through router differentiation + +### Negative + +- The number of SimPy nodes increases due to explicit router nodes (6x6 = up to 32 routers/cube) +- Requires complete rewrite of existing xbar/bridge/single NOC-based tests +- The internal contention model of TwoDMeshNocComponent needs to be replaced with a per-router model + +--- + +## Alternatives + +### A1. Retain Existing xbar + HBM Slices + +- Local/remote paths remain bifurcated +- Cannot model at pseudo-channel granularity +- Cannot switch between 1:1/n:1 modes + +### A2. Always Generate Per-Channel Links and Aggregate Only in n:1 + +- Topology structure always has 1:1 size +- Expressing n:1 semantics via link aggregation is complex +- No reduction in router node count + +### A3. Gradual Transition (Retain xbar + Add NOC Path) + +- Higher compatibility, but dual-path coexistence increases complexity +- Since xbar removal is ultimately necessary, the intermediate step provides little value + +--- + +## Implementation Notes + +### topology/builder.py Change Details + +#### Code to Remove (within current `_instantiate_cube()`) + +- xbar_top, xbar_bot node creation (~line 495-508) +- bridge.left, bridge.right node creation +- noc ↔ xbar edge creation (~line 540-555) +- xbar ↔ hbm_ctrl.slice edge creation (~line 510-538) +- xbar ↔ bridge edge creation (~line 557-572) + +#### Code to Add + +1:1 mode: + +```python +N = hbm_channels_per_pe # from topology config +total_ch = hbm_pseudo_channels + +# Create channel router nodes +for ch_id in range(total_ch): + pe_id = ch_id // N + nodes[f"{cp}.ch_r{ch_id}"] = Node( + id=f"{cp}.ch_r{ch_id}", kind="noc_router", impl="noc_v1", + attrs={}, pos_mm=(...), # horizontal row = ch_id % N + ) + +# PE_DMA ↔ local channel router edges +for pe_id in range(pes_per_cube): + for local_ch in range(N): + ch_id = pe_id * N + local_ch + edges.append(Edge( + src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.ch_r{ch_id}", + bw_gbs=channel_bw, kind="pe_to_ch_router", ...)) + edges.append(Edge( + src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.pe{pe_id}.pe_dma", + bw_gbs=channel_bw, kind="ch_router_to_pe", ...)) + +# Channel router ↔ hbm_ctrl edges +for ch_id in range(total_ch): + edges.append(Edge( + src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.hbm_ctrl", + bw_gbs=channel_bw, kind="ch_router_to_hbm", ...)) + edges.append(Edge( + src=f"{cp}.hbm_ctrl", dst=f"{cp}.ch_r{ch_id}", + bw_gbs=channel_bw, kind="hbm_to_ch_router", ...)) + +# Horizontal line edges (same logical index) +for row in range(N): + for p in range(pes_per_cube - 1): + ch_a = p * N + row + ch_b = (p + 1) * N + row + edges.append(Edge( + src=f"{cp}.ch_r{ch_a}", dst=f"{cp}.ch_r{ch_b}", + bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...)) + edges.append(Edge( + src=f"{cp}.ch_r{ch_b}", dst=f"{cp}.ch_r{ch_a}", + bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...)) +``` + +n:1 mode: + +```python +# Create aggregated router nodes +for pe_id in range(pes_per_cube): + nodes[f"{cp}.pe{pe_id}.agg_router"] = Node( + id=f"{cp}.pe{pe_id}.agg_router", kind="noc_router", impl="noc_v1", + attrs={}, pos_mm=(...), + ) + +agg_bw = N * channel_bw # aggregated BW + +# PE_DMA ↔ aggregated router +for pe_id in range(pes_per_cube): + edges.append(Edge( + src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.pe{pe_id}.agg_router", + bw_gbs=agg_bw, kind="pe_to_agg_router", ...)) + edges.append(Edge( + src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.pe{pe_id}.pe_dma", + bw_gbs=agg_bw, kind="agg_router_to_pe", ...)) + +# Aggregated router ↔ hbm_ctrl +for pe_id in range(pes_per_cube): + edges.append(Edge( + src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.hbm_ctrl", + bw_gbs=agg_bw, kind="agg_to_hbm", ...)) + edges.append(Edge( + src=f"{cp}.hbm_ctrl", dst=f"{cp}.pe{pe_id}.agg_router", + bw_gbs=agg_bw, kind="hbm_to_agg", ...)) + +# Horizontal links between aggregated routers +for p in range(pes_per_cube - 1): + edges.append(Edge( + src=f"{cp}.pe{p}.agg_router", dst=f"{cp}.pe{p+1}.agg_router", + bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...)) + edges.append(Edge( + src=f"{cp}.pe{p+1}.agg_router", dst=f"{cp}.pe{p}.agg_router", + bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...)) +``` + +### Affected Existing Tests + +| Test File | Impact | +| ---------- | ---- | +| `tests/test_topology_compile.py` | Remove xbar/bridge node references, add channel router verification | +| `tests/test_topology_load.py` | Reflect topology.yaml configuration changes | +| `tests/test_pe_components.py` | PE_DMA routing path changes | +| `tests/test_sip_parallel.py` | Cross-PE access path changes | +| Cases that directly test xbar/bridge | Remove | + +--- + +## Test Requirements + +- Verify that requests are delivered via per-channel links in 1:1 mode +- Verify that requests are delivered via the aggregated link in n:1 mode +- Verify that topology is correctly generated in both modes: + - 1:1: `total_ch` channel routers + per-PE links + horizontal links + - n:1: `pes_per_cube` aggregated routers + per-PE links +- Verify that effective BW is consistent across both modes for the same workload +- Verify that horizontal line routing works for cross-PE access +- Verify that routing through UCIe works for cross-cube access +- Verify that topology generation is correct under parameter variations (channels_per_pe = 4, 8, 16, etc.) + +--- + +## Links + +- ADR-0018 (LA + BAAW) → addressing-side integration +- ADR-0017 (Cube NOC 2D Mesh) → this ADR replaces the xbar/bridge portion +- ADR-0004 (Memory Semantics) → BW model redefinition +- ADR-0014 (PE Internal Execution Model) → impact from PE_DMA path changes diff --git a/docs/adr/ADR-0020-data-execution-two-pass.en.md b/docs/adr/ADR-0020-data-execution-two-pass.en.md new file mode 100644 index 0000000..208922d --- /dev/null +++ b/docs/adr/ADR-0020-data-execution-two-pass.en.md @@ -0,0 +1,553 @@ +# ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation) + +## Status + +Proposed + +## Context + +The current simulation models **timing only**. +`tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies, +but do not actually read tensor data or perform computations. + +### Required Capabilities + +1. Must be able to store and read actual data in HBM/TCM/SRAM +2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results +3. Must minimize simulation performance degradation + +### Limitations of the Existing Kernel Execution Structure + +The current kernel execution is separated into 3 stages: + +``` +Phase 0: Kernel function execution in TLContext → PeCommand list generation (outside SimPy, no data) +Phase 1: PE_CPU replays PeCommand list via SimPy (timing only) +``` + +Phase 0 requires the kernel to **complete execution entirely** before SimPy begins. +`tl.load()` returns a TensorHandle (placeholder), so actual data cannot be accessed. +Therefore, branching based on data values (dynamic control flow) is impossible. + +This ADR resolves this limitation **for memory operations only** (see D1, D3). + +### Constraints + +- SimPy is a single-thread event loop — running numpy matmul inside it blocks everything +- Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations +- Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused +- Kernel functions must remain plain Python functions (no generator/async transformation) + +### Design Exploration Results + +| Option | Approach | Verdict | +|--------|----------|---------| +| Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block | +| SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests | +| Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads | +| **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact | + +--- + +## Decision + +### D1. 2-Pass Execution Model — Phase 0 Elimination + +The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**. + +Before: +``` +Phase 0: Kernel → PeCommand list (no data, no branching) +Phase 1: Replay PeCommand list via SimPy (timing only) +``` + +After: +``` +Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based + - Memory read/write: SimPy timing + MemoryStore actual data + - Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2) + - Dynamic control flow possible (tl.load returns actual data) + +Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable +``` + +This ADR **extends Phase 1 to be data-aware for memory operations only**. +Phase 1 handles latency/BW bottleneck analysis + memory data tracking, +Phase 2 handles GEMM/Math computation correctness verification. +Phase 2 is optional — if only timing is needed, run Phase 1 alone. + +### D2. Op Log Recording — ComponentBase Hook + +Op log recording is performed as a **hook in the component base class**. +Individual component implementations are not modified. + +```python +class ComponentBase: + def _on_process_start(self, env, msg): + if self._op_logger and getattr(msg, 'data_op', False): + self._op_logger.record_start(env.now, self.node.id, msg) + + def _on_process_end(self, env, msg): + if self._op_logger and getattr(msg, 'data_op', False): + self._op_logger.record_end(env.now, self.node.id, msg) +``` + +Hooks are called before and after `run()` within `_forward_txn()`. +`_op_logger` is optional — zero overhead when absent. + +**Hook timing definitions**: + +| Timing | Meaning | +|--------|---------| +| `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) | +| `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) | + +Link traversal latency is not included in t_start/t_end. +Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start. + +### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination + +The existing Phase 0 (kernel → PeCommand list) is eliminated, +and **greenlet** is used to cooperatively interleave kernel and SimPy execution. + +#### Operating Principle + +greenlet is a C extension that provides cooperative context switching. +When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet) +to perform timing simulation, and after completion, returns to the kernel with actual data. + +``` +SimPy loop (parent greenlet) Kernel (child greenlet) +───────────────────────── ────────────────────── +g.switch() ─────────────────────────→ Kernel starts + a = tl.load(ptr, ...) + internal: parent.switch(DmaReadCmd) +cmd = DmaReadCmd ←────────────────── (kernel paused) + yield DmaReadMsg(...) + yield env.timeout(dma_latency) + data = memory_store.read(...) +g.switch(data) ─────────────────────→ (kernel resumed) + a = data ← actual numpy array + if a[0][0] > 0.5: ← branching possible + ... +``` + +The kernel is maintained as a **plain Python function**. +greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc. + +#### KernelRunner — Framework Layer + +The greenlet loop resides not in the PE_CPU component but in the framework layer, +**KernelRunner**. + +```python +# KernelRunner (framework — greenlet ↔ SimPy bridge) +class KernelRunner: + def run(self, env, kernel_fn, args, store): + g = greenlet(self._run_kernel) + cmd = g.switch(kernel_fn, args) + + while cmd is not None: + if isinstance(cmd, DmaReadCmd): + yield from self._dispatch_dma(env, cmd) + data = store.read(cmd.src_addr, cmd.shape, cmd.dtype) + cmd = g.switch(data) # resume with actual data + elif isinstance(cmd, GemmCmd): + yield from self._dispatch_gemm(env, cmd) + cmd = g.switch() # resume (no data) + elif isinstance(cmd, DmaWriteCmd): + store.write(cmd.dst_addr, cmd.data) # visibility = issue time + yield from self._dispatch_dma(env, cmd) # timing only + cmd = g.switch() + +# PE_CPU (component — kept simple, unaware of greenlet) +def _execute_kernel(self, env): + runner = KernelRunner(self.ctx) + yield from runner.run(env, kernel_fn, args, store) +``` + +**Op logging single source of truth**: KernelRunner does not record directly to op_log. +All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**. +When KernelRunner delivers messages to components via `_dispatch_gemm()` etc., +the component base class hooks automatically record them. + +**Layer separation**: +- **Kernel code**: plain function, unaware of greenlet +- **TLContext**: calls `parent.switch(cmd)` inside `tl.load()` +- **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**. +- **ComponentBase hook**: the sole path for op_log recording +- **PE_CPU**: only calls KernelRunner, replaceable as a component + +#### Handling Differences Between Memory Read/Write and Compute + +| Operation | In Phase 1 | In Phase 2 | +|-----------|-----------|-----------| +| `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — | +| `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — | +| `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation | +| `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation | + +Memory read/write is processed immediately in Phase 1 (numpy slice, fast). +GEMM/Math operations are batch-executed in Phase 2 (performance separation). + +#### Store Visibility Rule + +`tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue). +SimPy DMA timing is simulated separately afterward. + +This is an intentional separation of timing and visibility: +- **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called +- **timing**: the point at which DMA latency completes in SimPy + +This separation allows a load immediately after a store to see the latest data in dynamic control flow. + +#### Result Handle Semantics + +`tl.composite()` (sync/async) returns a **handle** referencing the result tensor. + +The key contract in Phase 1: + +1. **All compute handles are always considered pending in Phase 1.** +2. `tl.wait(handle)` **expresses timing synchronization only** + and does not make the handle ready. +3. Accessing the handle's actual result data (`handle.data`, element access, + numpy conversion, etc.) is **only possible in Phase 2**. +4. Therefore, **compute-result-based control flow is not supported in Phase 1.** +5. In contrast, `tl.load()` returns actual data in Phase 1, so + **memory-read-based control flow is supported**. + +| Handle state | Phase | Allowed operations | +|------------|-------|----------| +| pending | Phase 1 | `tl.wait(handle)` — timing synchronization only | +| pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) | +| pending | Phase 1 | **Data access not allowed** — value-based branching not possible | +| ready | Phase 2 | Actual numpy data access, verification | + +This restriction is intentional. If computations were executed in Phase 1, +the SimPy single-thread would block, defeating the purpose of 2-pass separation. + +#### Phase 1 Materialization — Future Extension + +If Phase 1 eager execution becomes necessary for small operations +(scalar, small reduction) in the future, selective materialization can be supported +by adding a `materialized_in_phase1: bool` flag to the op record. +This is not implemented in the current scope. + +### D4. data_op Flag — Message Self-Declaration + +The logging target is determined by the `data_op` attribute on the message instance, +not by message type. The framework does not hardcode message types. + +```python +class MsgBase: + data_op: bool = False # default: no logging + +class DmaReadCmd(MsgBase): + data_op = True # memory transfer → logging + +class GemmCmd(MsgBase): + data_op = True # compute → logging + +class MathCmd(MsgBase): + data_op = True # compute → logging +``` + +When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True` +enables automatic logging without modifying framework code. + +### D5. Op Log Structure + +#### Op Classification Scheme + +A two-level classification is used: + +| Level | Field | Role | +|-------|-------|------| +| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion | +| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification | + +#### OpRecord Definition + +```python +@dataclass +class OpRecord: + t_start: float # SimPy time (ns) — service start + t_end: float # SimPy time (ns) — service completion + component_id: str # e.g. "sip0.cube0.pe0.pe_gemm" + op_kind: str # "memory" | "gemm" | "math" + op_name: str # specific operation name + params: dict # per-operation parameters (see below) + dependency_ids: list[int] # currently based on in-memory record index, may be replaced with stable op_id in the future +``` + +#### dependency_ids Generation Rules + +`dependency_ids` is **optional**, and by default the executor performs +address-based dependency inference (see D6). + +Explicit setting is only needed when precise execution ordering is required: +- **Default (address-based inference)**: the executor analyzes read/write sets to + automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases. +- **Explicit setting**: set when logical dependencies cannot be expressed via addresses + at the TLContext or command generation stage. + Example: completion handle-based synchronization — handle dependencies depend on + logical completion order rather than memory addresses, so they cannot be captured + by address inference. + +#### op_log Ordering + +The op_log maintains **stable ordering** based on `t_start`. +Records with the same `t_start` preserve insertion order. + +#### params Details + +**memory (dma_read / dma_write)**: +```python +{ + "src_addr": int, # source address (byte) + "dst_addr": int, # destination address (byte) + "nbytes": int, # transfer size + "src_space": str, # "hbm" | "tcm" | "sram" + "dst_space": str, # "hbm" | "tcm" | "sram" +} +``` + +**gemm**: +```python +{ + "src_a_addr": int, # operand A address + "src_b_addr": int, # operand B address + "dst_addr": int, # output address + "shape_a": tuple, # e.g. (128, 256) + "shape_b": tuple, # e.g. (256, 128) + "shape_out": tuple, # e.g. (128, 128) + "dtype_in": str, # e.g. "f16" + "dtype_acc": str, # accumulation dtype, e.g. "f32" + "dtype_out": str, # output dtype, e.g. "f16" + "transpose_a": bool, + "transpose_b": bool, + "layout_a": str, # "row_major" | "col_major" + "layout_b": str, + "layout_out": str, + "addr_space": str, # "tcm" (GEMM operands are always in TCM) +} +``` + +**math**: +```python +{ + "op": str, # "exp" | "add" | "sum" | "where" | ... + "input_addrs": list[int], # list of operand addresses + "input_shapes": list[tuple], + "dst_addr": int, + "shape_out": tuple, + "dtype": str, + "axis": int | None, # reduction axis + "addr_space": str, # "tcm" +} +``` + +### D6. Phase 2 Executor + +Phase 2 executes the op_log outside of SimPy. + +```python +class DataExecutor: + def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore): + self.store = initial_store # Takes the Phase 1 MemoryStore snapshot as input + + def run(self): + for t, ops in groupby(op_log, key=lambda o: o.t_start): + batch = list(ops) + independent, sequential = self._classify(batch) + self._execute_parallel(independent) + self._execute_sequential(sequential) +``` + +**Parallel execution determination**: + +Ops with the same `t_start` are considered **parallel candidates**. +The executor determines actual parallel execution based on the following criteria: +- Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks) +- Whether predecessor ops specified in `dependency_ids` have completed + +Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel. + +**Batch optimization**: Only independent ops with the same op_name **and identical +shape, dtype, layout, and transpose flags** are eligible for batching. +Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call. +Improves BLAS efficiency on CPU, reduces launch overhead on GPU. + +**Phase 2 execution order guarantee**: + +Phase 2 does not consider data arrival timing, +and guarantees execution order solely through +dependencies (address-based inference + explicit dependency_ids). + +### D7. Memory Store + +`MemoryStore` logically follows byte-addressable semantics, +and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping). + +```python +class MemoryStore: + def write(self, space: str, addr: int, data: np.ndarray) -> None: ... + def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ... +``` + +**Internal storage format: numpy ndarray** + +MemoryStore stores tensors as **numpy ndarrays**. + +| Candidate | store/load speed | Phase 2 compute | Verdict | +|-----------|-----------------|-----------------|---------| +| **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** | +| bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected | +| torch tensor | Immediate | torch operations available | Use only for GPU optimization | + +- write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup +- read: **returns numpy array by reference** (no copy) +- Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported) +- dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.) +- For byte-level access, convert via `.view(np.uint8)` +- For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility + +**read/write contract**: + +- read/write operates on a **contiguous tensor** basis. + If non-contiguous stride views are needed, express them as separate copy ops. +- In the normal benchmark path, producer/consumer dtype match is expected. + Reinterpret cast is a permissive behavior for low-level memory validation + or special test cases. +- addr is byte-aligned, with minimum alignment = dtype size. +- dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast. + Shape mismatch is verified based on nbytes, and raises an error on mismatch. +- Correctness criteria follow address-range-based read/write semantics. +- A tensor object cache may be used as an implementation optimization, + but the canonical state is byte-addressable storage. +- At deploy time, the host injects initial tensor data. + +### D8. Benchmark Kernel Code + +The benchmark's **user code API is not changed**. +The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained. + +However, internal command/message schemas may be extended to include metadata +required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose). + +### D9. No Component Changes + +Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified. +Op log recording is the responsibility of the ComponentBase hook. +When custom components are replaced, only the timing model changes, +and Phase 2 data execution is unaffected. + +### D10. Phase 2 is Optional + +```python +engine = GraphEngine(graph) +engine.run(benchmark) # Phase 1: timing only +result = engine.get_timing_result() + +if verify_data: + executor = DataExecutor(engine.op_log) # Phase 2: data + executor.run() + executor.verify(expected_output) +``` + +If only timing analysis is needed, Phase 2 is skipped. +If the op_logger is deactivated, Phase 1 performance is identical to the original. + +### D11. Verification Contract + +Basic verification **compares the final output tensor** against a reference backend (numpy). + +Per-dtype tolerance policy: + +| dtype | Comparison method | Tolerance | +|-------|----------|-----------| +| f32 | `np.allclose` | rtol=1e-5, atol=1e-5 | +| f16 | `np.allclose` | rtol=1e-3, atol=1e-3 | +| bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 | +| int types | `np.array_equal` | exact | + +- Default mode: compare final output only (end-to-end correctness) +- Debug mode: can compare intermediate tensors on a per-op basis + (MemoryStore snapshot at each op boundary) + +--- + +## Non-goals + +- **Compute-result-based control flow**: not supported. + All compute handles are in pending state during Phase 1, + `wait()` expresses timing synchronization only and does not imply data readiness. + Accessing `handle.data`, element access, or truth-value evaluation in Phase 1 + is **treated as an error**. + Memory-data-based branching (results of `tl.load()`) is supported via greenlet. + Phase 1 materialization is a future extension (see D3). +- **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce + the execution time overlap from Phase 1. Phase 2 only verifies data correctness. +- **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls + and do not reproduce the actual hardware PE microarchitecture. + +## Open Questions + +- **Aliasing / slice view**: How to represent slice/views referencing the same + backing storage in MemoryStore (stride-based view vs copy semantics) +- **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE + communication as memory ops or introduce a separate op_kind +- **Op log streaming**: Managing op_log memory usage in large-scale simulations + (in-memory list vs disk-backed streaming) +- **Fused operation**: Whether to record tl.composite's tiled pipeline + (READ→COMPUTE→WRITE) as a single fused op record or separate individual ops +- **Math op schema generalization**: The current math params have a simple structure, + but generalization may be needed for broadcasting rules, per-input dtype, keepdims, + scalar/immediate operands, where/mask expressions, etc. +- **Op record identifier**: Currently dependency_ids are based on in-memory list indices; + replacement with stable op_id is needed when introducing streaming/disk-backed mode +- **Phase 1 materialization policy**: See Future Extension in D3. + If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops + needs to be defined + +--- + +## Consequences + +### Positive + +- Minimal impact on SimPy simulation performance (only op_log append added) +- Free to use multi-threading/GPU in Phase 2 +- Component replaceability preserved (ADR-0015 design philosophy maintained) +- No changes needed to benchmark user code API +- When adding new message types, only set the data_op flag +- Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported +- `tl.load()` returns actual data, making kernel debugging easier + +### Negative + +- op_log memory usage (for large-scale simulations) +- Phase 2 execution time is proportional to tensor size (large GEMM) +- Dynamic branching based on pending handles (incomplete computations) not possible + (computations execute in Phase 2, result values are undetermined in Phase 1). + Memory-data-based branching is supported via greenlet. +- greenlet C extension dependency added (pip install greenlet) + +--- + +## Affected Files + +| File | Change | +|------|--------| +| `src/kernbench/components/base.py` | Add `_on_process_start/end` hooks | +| `src/kernbench/common/pe_commands.py` | Add `data_op = True`, extend metadata fields | +| `src/kernbench/sim_engine/op_log.py` | New: OpRecord, OpLogger | +| `src/kernbench/sim_engine/data_executor.py` | New: DataExecutor, MemoryStore | +| `src/kernbench/sim_engine/engine.py` | op_logger injection (optional) | +| `src/kernbench/triton_emu/tl_context.py` | greenlet switch calls inside `tl.load()` etc. | +| `src/kernbench/triton_emu/kernel_runner.py` | New: KernelRunner (greenlet ↔ SimPy bridge) | +| `src/kernbench/components/builtin/pe_cpu.py` | Remove Phase 0, change to KernelRunner invocation | +| `pyproject.toml` | Add greenlet dependency | + +Component implementation files (pe_gemm.py, pe_dma.py, hbm_ctrl.py, etc.): **no changes** +Benchmark kernels (benches/*.py): **no user API changes** diff --git a/docs/adr/ADR-0021-pe-pipeline-refactor.en.md b/docs/adr/ADR-0021-pe-pipeline-refactor.en.md new file mode 100644 index 0000000..afb4e4f --- /dev/null +++ b/docs/adr/ADR-0021-pe-pipeline-refactor.en.md @@ -0,0 +1,537 @@ +# ADR-0021: PE Pipeline Refactoring — Component Separation + Scheduler-Based Routing + +## Status + +Proposed + +## Context + +### Problems with the Current Structure + +pe_accel (SchedulerV2Component) hides 5 hardware blocks (DmaIn, DmaWb, Gemm, Math, Tcm) +**inside a single component**. + +``` +SchedulerV2Component (single topology node) +├── DmaInBlock ← directly connected via internal SimPy Store +├── DmaWbBlock ← not visible in topology +├── GemmBlock ← not replaceable +├── MathBlock ← not replaceable +└── TcmBlock ← not replaceable +``` + +Problems: +- Blocks directly reference the next block via `desc.next_block` — hardcoded routing +- Individual blocks cannot be replaced (violates ADR-0015 component replacement principle) +- PE internal structure is not visible in the topology +- GemmBlock and MathBlock each duplicate TCM load/store logic + +### Actual Hardware Structure + +``` +HBM ←(DMA)→ TCM ←(Fetch/Store Unit)→ Register File ←→ GEMM/MATH Engine +``` + +- DMA: HBM ↔ TCM transfer (via fabric, tens to hundreds of ns) +- Fetch/Store Unit: TCM ↔ Register File transfer (BW-based, a few ns) +- GEMM/MATH Engine: computation between Register Files (cycle-accurate) +- Completion signal: PE-internal 1-cycle wire signal (done pin assert) + +--- + +## Decision + +### D1. Separate Each Block into an Independent Component + +The internal blocks of pe_accel are separated into **independent PeEngineBase components**. +Existing 5 blocks + 1 Fetch/Store Unit = 6 components. + +| Component | Role | HW Correspondence | +|-----------|------|-------------------| +| PE_SCHEDULER | Plan generation, tile state management, stage routing | Scheduler/Sequencer | +| PE_DMA | HBM ↔ TCM (via fabric) | DMA Engine | +| PE_FETCH_STORE | TCM ↔ Register File | Load/Store Unit | +| PE_GEMM | MAC compute (register only) | MAC Array | +| PE_MATH | Element-wise/reduction (register only) | SIMD/Vector Unit | +| PE_TCM | BW-serialized scratchpad | SRAM Bank | + +Each component exists as a topology node and is connected via ports/wires. +Replacing the `impl` allows changing the timing model of an individual block. + +### D2. Token Self-Routing — Scheduler Handles Only Dispatch + Completion + +**Components do not pass through the scheduler at every stage.** +The token carries a plan so that components chain directly to the next stage. + +``` +Scheduler → DMA → Fetch → GEMM → Math → Store → DMA_WB → (done) → Scheduler + ↑ chaining: does not go through scheduler completion only +``` + +This matches the actual HW structure where each block's done signal is directly +connected to the next block via wire. The scheduler is responsible **only for +initial dispatch + completion aggregation**. + +#### Stage Definition + +```python +class StageType(Enum): + DMA_READ = 0 + FETCH = 1 + GEMM = 2 + MATH = 3 + STORE = 4 + DMA_WRITE = 5 +``` + +#### Plan Structure + +When the scheduler receives a CompositeCmd, it generates a **per-tile execution plan**. +The plan defines the **stage sequence** for each tile: + +```python +@dataclass +class Stage: + stage_type: StageType + component: str # topology node ID (e.g. "sip0.cube0.pe0.pe_dma") + params: dict # per-stage parameters (dynamic) + +@dataclass(frozen=True) +class TilePlan: + tile_id: int + stages: tuple[Stage, ...] # list of stages to execute in order (immutable) +``` + +The stage sequence varies depending on the plan: + +```python +# Normal GEMM: HBM → TCM → Register → Compute → Register → TCM → HBM +stages = (DMA_READ, FETCH, GEMM, STORE, DMA_WRITE) + +# GEMM directly from TCM data (skip DMA read): +stages = (FETCH, GEMM, STORE, DMA_WRITE) + +# MATH element-wise: +stages = (DMA_READ, FETCH, MATH, STORE, DMA_WRITE) + +# GEMM + accumulation (intermediate K-tile, skip writeback): +stages = (DMA_READ, FETCH, GEMM, STORE) # store to TCM only +``` + +**Components do not hardcode the next component.** +They read the next stage from the token's plan and forward it directly via out_port. +This is the same pattern as a network packet carrying a routing header. + +#### Pipeline Context + +```python +@dataclass +class PipelineContext: + id: str + total_tiles: int + completed_tiles: int = 0 + done_event: simpy.Event = None # succeeds when all tiles are complete + + def complete_tile(self) -> None: + self.completed_tiles += 1 + if self.completed_tiles == self.total_tiles: + self.done_event.succeed() +``` + +**Completion follows an exactly-once contract**: the last stage of each tile must call +`complete_tile()` exactly once. Duplicate calls are a bug, and `done_event` must +succeed only once (SimPy Event constraint). + +#### Scheduler Role (Reduced) + +When the scheduler receives a CompositeCmd, it creates a plan and PipelineContext, +enqueues them into the scheduler's internal `_pending_feeds` FIFO, and returns immediately. + +Actual tile injection is handled by a **single feeder process** (`_feed_loop`). +This feeder consumes `_pending_feeds` in FIFO order and +**does not allow tile feed interleaving across composite commands.** +That is, the feed for the next command begins only after all tiles of the current +command have been injected into the first stage queue. + +There is **exactly one `_feed_loop`** per scheduler, and +tile feed for composite commands is performed exclusively through this single process. +Command issue order refers to **the order in which PE_SCHEDULER receives PeInternalTxn**. + +This structure maintains command issue order while ensuring that when the first stage +queue is full, only the feeder process blocks — the scheduler worker's inbox processing +itself does not stall. + +```python +class PeSchedulerV2(PeEngineBase): + _pipelines: dict[str, PipelineContext] + _pending_feeds: simpy.Store # FIFO of (plan, ctx) + + def start(self, env): + super().start(env) + self._pending_feeds = simpy.Store(env) + env.process(self._feed_loop(env)) + + def _dispatch_composite(self, env, pe_txn, cmd): + plan = generate_plan(cmd) + ctx = PipelineContext( + id=next_id(), + total_tiles=len(plan.tiles), + done_event=pe_txn.done, + ) + self._pipelines[ctx.id] = ctx + + # only enqueue to feeder queue and return immediately + yield self._pending_feeds.put((plan, ctx)) + + def _feed_loop(self, env): + """Single feeder process: feeds composite commands in FIFO order. + + Tile feed interleaving across composite commands is not allowed. + The feed for the next command begins only after all tiles of the + current command have been injected into the first stage queue. + + When the first stage queue is full, only this feeder blocks; + the scheduler worker's inbox processing does not stall. + """ + while True: + plan, ctx = yield self._pending_feeds.get() + for tile in plan.tiles: + token = TileToken( + tile_id=tile.tile_id, + pipeline_ctx=ctx, + plan=tile, + stage_idx=0, + params=tile.stages[0].params, + ) + yield self.out_ports[tile.stages[0].component].put(token) + # queue capacity = HW queue depth → feeder blocks only when full +``` + +In this ADR, the scheduler can accept multiple composite commands, +but tile submission order follows per-command FIFO. +Within a command, tile-level pipeline overlap is allowed, +but tile feed interleaving across commands is not. + +### D3. Data Transfer vs. Completion Signal — HW Modeling Criteria + +| Communication Type | Method | HW Correspondence | +|-------------------|--------|-------------------| +| Tile token (work directive) | message via out_port | enqueue to command queue | +| Stage completion → next stage | component directly calls out_port.put | done-triggered local enqueue | +| Pipeline completion → scheduler | PipelineContext.complete_tile() | completion interrupt | + +**Tile token**: uses out_port.put(). SimPy Store capacity = HW queue depth. + +**Intra-PE chaining latency**: within the scope of this ADR, no explicit latency model +is applied to intra-PE stage triggers. Chaining between components corresponds to +PE-internal wires, and since there is no scheduler round-trip, no artificial hop cost +is incurred. + +**Pipeline completion**: the component at the last stage calls `pipeline_ctx.complete_tile()`. +When all tiles are complete, PipelineContext calls done_event.succeed(). + +### D4. Asynchronous Pipeline — Natural Overlap + +The scheduler processes CompositeCmds **asynchronously**. +However, tile feed does not spawn an independent process per command; instead, +the scheduler's internal **single feeder process** performs the feed in FIFO order. +Therefore, the scheduler can continue to receive the next command, +but the first-stage tile injection order is guaranteed per command. + +Since **SimPy Store capacity = HW queue depth**: +- When the queue is full, put() naturally blocks (backpressure) +- While DMA is processing tile 0, GEMM can start fetching an already-completed tile +- When a second CompositeCmd arrives, it is immediately queued to the DMA queue + +``` +First-stage feed order (feeder → DMA queue): + [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN] | [cmd2:t0][cmd2:t1]... + ↑ cmd2 starts after cmd1 feed completes + +Runtime pipeline (downstream overlap): + PE_DMA: [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN][cmd2:t0][cmd2:t1]... + PE_FETCH: [cmd1:t0][cmd1:t1]... + PE_GEMM: [cmd1:t0][cmd1:t1]... + ↑ pipeline overlap within the same command +``` + +Here, the overlap does not come from tile feed interleaving across different commands, +but occurs naturally as tiles from earlier commands progress to downstream stages +while the feeder continues injecting subsequent tiles. + +For example, tile feed for cmd2 does not start until all tiles of cmd1 have been +injected into the first stage queue. However, while cmd1.tile0 has already progressed +to GEMM, cmd1.tile1 and cmd1.tile2 may still remain in DMA/FETCH, so +**pipeline overlap within the same command occurs naturally**. + +#### Component Chaining Pattern + +All components follow the same pattern: + +```python +def _pipeline_worker(self, env): + while True: + token = yield self._inbox.get() + + # process own stage + yield from self._process(env, token) + + # chain to next stage (read from plan) + next_idx = token.stage_idx + 1 + if next_idx < len(token.plan.stages): + next_stage = token.plan.stages[next_idx] + token.stage_idx = next_idx + token.params = next_stage.params + yield self.out_ports[next_stage.component].put(token) + else: + # last stage — pipeline completion + token.pipeline_ctx.complete_tile() +``` + +### D5. PE_FETCH_STORE — Dedicated TCM ↔ Register File Transfer + +Previously, GemmBlock and MathBlock each implemented their own TCM read/write. +This is separated into a **PE_FETCH_STORE component**. + +```python +# PE_FETCH_STORE._process() +def _process(self, env, token): + yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...)) + yield tcm_done + # chaining is handled by the base class (D4 pattern) +``` + +Advantages: +- GEMM/MATH perform **pure compute only** — no TCM access logic +- Fetch/store BW contention is naturally modeled (serialization via PE_TCM resource) +- Prefetch strategies can be experimented with by replacing the fetch unit alone + +### D6. Simplification of Each Compute Component + +GEMM/MATH perform compute only with register data already prepared. +**Chaining follows the common pattern (D4), so only _process() needs to be implemented:** + +```python +# PE_GEMM._process() +def _process(self, env, token): + yield env.timeout(self._mac_latency(token.params)) + +# PE_MATH._process() +def _process(self, env, token): + yield env.timeout(self._simd_latency(token.params)) + +# PE_FETCH_STORE._process() +def _process(self, env, token): + yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...)) + yield tcm_done + +# PE_DMA._process() +def _process(self, env, token): + yield from self._do_fabric_dma(token.params) +``` + +By replacing only the timing model, one can freely switch between cycle-accurate +and analytical models. Since the chaining logic resides in the base class, +each component only implements its pure stage logic. + +### D7. Topology Changes + +Add PE_FETCH_STORE to the PE template: + +```yaml +pe_template: + components: + pe_cpu: { kind: pe_cpu, impl: pe_cpu_v1, ... } + pe_scheduler: { kind: pe_scheduler, impl: pe_scheduler_v2, ... } + pe_dma: { kind: pe_dma, impl: pe_dma_v1, ... } + pe_fetch_store: { kind: pe_fetch_store, impl: pe_fetch_store_v1, ... } + pe_gemm: { kind: pe_gemm, impl: pe_gemm_v1, ... } + pe_math: { kind: pe_math, impl: pe_math_v1, ... } + pe_mmu: { kind: pe_mmu, impl: pe_mmu_v1, ... } + pe_tcm: { kind: pe_tcm, impl: pe_tcm_v1, ... } + links: + # existing links... + fetch_store_to_tcm_bw_gbs: 512.0 + fetch_store_to_tcm_mm: 0.0 +``` + +PE internal edge connections: +``` +PE_SCHEDULER → PE_DMA (initial dispatch) +PE_SCHEDULER → PE_FETCH_STORE (initial dispatch) +PE_SCHEDULER → PE_GEMM (initial dispatch) +PE_SCHEDULER → PE_MATH (initial dispatch) +PE_DMA → PE_FETCH_STORE (chaining) +PE_FETCH_STORE → PE_GEMM (chaining) +PE_FETCH_STORE → PE_MATH (chaining) +PE_GEMM → PE_FETCH_STORE (store chaining) +PE_MATH → PE_FETCH_STORE (store chaining) +PE_FETCH_STORE → PE_DMA (writeback chaining) +PE_FETCH_STORE → PE_TCM (BW request) +``` + +Topology edges encompass both **control/dispatch visibility + runtime chaining**. +Scheduler → sub-component edges are initial dispatch paths, while +inter-component edges are runtime chaining paths driven by token self-routing. + +### D8. Existing Code Migration — Builtin Integration + +The existing builtin v1 components and pe_accel are **replaced with new builtin components**. + +#### Migration Strategy + +1. Back up existing `components/builtin/` → `components/builtin_legacy/` (preserved without modification) +2. Back up existing `components/custom/pe_accel/` → likewise +3. Re-implement new `components/builtin/` with the ADR-0021 architecture +4. Maintain **only one** topology.yaml (including pe_fetch_store) +5. components.yaml points to the new builtin + +```yaml +# components.yaml — new builtin +pe_scheduler_v1: kernbench.components.builtin.pe_scheduler:PeSchedulerComponent +pe_gemm_v1: kernbench.components.builtin.pe_gemm:PeGemmComponent +pe_math_v1: kernbench.components.builtin.pe_math:PeMathComponent +pe_dma_v1: kernbench.components.builtin.pe_dma:PeDmaComponent +pe_fetch_store_v1: kernbench.components.builtin.pe_fetch_store:PeFetchStoreComponent +pe_tcm_v1: kernbench.components.builtin.pe_tcm:PeTcmComponent +``` + +The impl names (pe_gemm_v1, etc.) are preserved, but **the implementations are replaced +with the ADR-0021 architecture**. Existing benchmarks and tests referencing topology.yaml +continue to work without changes. + +#### Latency Model Inheritance + +The latency modeling of the new builtin components (MAC cycle calculation, SIMD latency, +TCM BW serialization, DMA fabric latency, etc.) is **based on the current pe_accel +implementation**. The tile schedule generation logic from tiling.py is also carried over. +Only the architecture (component separation, self-routing) changes; timing accuracy +is preserved. + +#### Test Strategy + +#### Test Plan + +**1. Existing test pass** (regression): +After migration is complete, all existing tests (366) must pass. + +**2. Latency regression**: +Verify that the new builtin produces identical latency for the same inputs as pe_accel. + +**3. Phase 1 → Phase 2 end-to-end**: +Integration test from SimPy simulation (Phase 1) op_log generation → DataExecutor +(Phase 2) actual numpy computation → result correctness verification. +- GEMM: tl.composite(gemm) → op_log → Phase 2 matmul → allclose verification +- MATH: tl.exp / tl.add, etc. → op_log → Phase 2 numpy op → allclose verification +- Chaining: GEMM output → MATH input → final result end-to-end verification + +**4. TileToken self-routing**: +- Verify that tiles chain according to the plan's stage sequence +- Verify PipelineContext.complete_tile() exactly-once at the last stage +- Queue backpressure: verify that only the feeder blocks when DMA queue capacity is exceeded + +**5. Asynchronous pipeline overlap**: +- Verify that inter-tile stage overlap occurs within the same command (tile0 in GEMM while tile1 in DMA) +- Multiple commands: verify that cmd2 feed starts after cmd1 feed completes (FIFO order) + +### D9. TileToken Message Definition + +A message used for passing tile work between components. +The token carries the plan and stage index, enabling self-routing. + +```python +@dataclass +class TileToken: + tile_id: int + pipeline_ctx: PipelineContext # completion tracking + plan: TilePlan # full stage sequence for this tile (immutable) + stage_idx: int # current stage index in plan.stages + params: dict # current stage parameter cache (canonical: plan.stages[stage_idx].params) + data_op: bool = True # op_log recording target (ADR-0020) +``` + +A TileToken is **owned by exactly one component at a time** and +is never referenced by multiple components simultaneously (single-owner). + +Token lifecycle: +1. Scheduler creates it with stage_idx=0 and puts it to the first stage component +2. The component executes _process(), increments stage_idx, and puts it to the next component +3. The last stage component calls pipeline_ctx.complete_tile() +4. When all tiles are complete, PipelineContext calls done_event.succeed() + +Relationship with existing PeInternalTxn: +- PeInternalTxn: command transfer between PE_CPU → PE_SCHEDULER (existing, unchanged) +- TileToken: per-tile work transfer from PE_SCHEDULER → sub-components (new, self-routing) + +--- + +## Non-goals + +- **PE_CPU changes**: the PE_CPU → PE_SCHEDULER interface is not modified + (PeInternalTxn-based, ADR-0014 maintained) +- **Resource contention model across multiple pipelines**: the current scope focuses on + accurate modeling of a single pipeline. TCM bank conflicts across multiple pipelines + are future work. +- **builtin_legacy maintenance**: kept for backup purposes only; not a target for + bug fixes or feature additions. + +## Open Questions + +- **Register File capacity model**: whether to model capacity limits when the fetch unit + loads into registers. Capacity is expressed in bytes (register_file_bytes), and + the number of tiles that can be held simultaneously is determined by tile size. + When capacity is exceeded, fetch stalls, creating natural backpressure. +- **Prefetch strategy**: this ADR does not allow tile feed interleaving across composite + commands. Therefore, overlap arises not from pre-injection across commands, but + naturally from pipeline progression of tiles within the same command. + If additional prefetch is needed, it should be considered at the level of tile ordering + within the same command or fetch/store unit policy, not cross-command injection. +- **PE_DMA coalescing**: per-tile DMA may cause fragmentation. + Direction is to merge/coalesce within DMA without scheduler involvement. +- **Synchronous execution mode**: this ADR adopts asynchronous pipeline as the + default/sole execution model. If a sync mode is needed for debug or validation + purposes, it will be considered in a future ADR. +- **TCM bank conflict across multiple pipelines**: currently based on a single pipeline. + Bank conflict modeling when multiple pipelines simultaneously access TCM is future work. + +--- + +## Consequences + +### Positive + +- Each block is an independent component — individually replaceable (ADR-0015 compliant) +- PE internal structure is visible in the topology +- Components do not know the next component — plan-based routing provides flexibility +- Natural pipeline overlap between DMA and compute (SimPy Store backpressure) +- Improved HW modeling accuracy (done signal = Event, data transfer = message) +- Fetch/store separation enables accurate TCM BW contention modeling + +### Negative + +- Increased number of PE internal components (5 → 6) — more topology nodes/edges +- Component separation makes intra-PE token forwarding more explicit than before +- Breaking change from existing builtin/pe_accel — migration required + +--- + +## Affected Files + +| File | Change | +|------|--------| +| `topology.yaml` | Add pe_fetch_store component, add chaining edges | +| `components.yaml` | Register new builtin components | +| `src/kernbench/topology/builder.py` | Add fetch_store + chaining edges to PE internal edges | +| `src/kernbench/common/pe_commands.py` | Add TileToken definition | +| `src/kernbench/components/builtin/pe_scheduler.py` | Re-implement (feeder + plan-based dispatch) | +| `src/kernbench/components/builtin/pe_gemm.py` | Re-implement (TileToken, _process pattern) | +| `src/kernbench/components/builtin/pe_math.py` | Re-implement (TileToken, _process pattern) | +| `src/kernbench/components/builtin/pe_dma.py` | Re-implement (TileToken, _process pattern) | +| `src/kernbench/components/builtin/pe_fetch_store.py` | New | +| `src/kernbench/components/builtin/pe_tcm.py` | Re-implement (TcmRequest service) | +| `src/kernbench/components/builtin/types.py` | New: TilePlan, Stage, StageType, PipelineContext, TileToken | +| `src/kernbench/components/builtin/tiling.py` | Ported from pe_accel: plan generation logic | + +Backup: +| `src/kernbench/components/builtin_legacy/` | Full backup of existing builtin (preserved without modification) | +| `src/kernbench/components/custom/pe_accel/` | Backup of existing pe_accel (preserved without modification) |