Add English translations for ADR-0018, 0019, 0020, 0021

- ADR-0018: LA-based memory address abstraction + BAAW + HBM channel mapping - ADR-0019: CUBE NOC per-channel and aggregated HBM connection model - ADR-0020: 2-pass data execution model (timing/data separation, greenlet) - ADR-0021: PE pipeline refactor (component separation + token self-routing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 16:31:32 -07:00
parent 10b33b44ba
commit b2c52f0e34
4 changed files with 1962 additions and 0 deletions
@@ -0,0 +1,441 @@
+# ADR-0018: LA-Based Memory Address Abstraction and HBM Channel Mapping Mode Introduction
+
+## Status
+
+Proposed
+
+## Context
+
+Kernbench simulates memory access between PE_DMA and Local-HBM within a CUBE.
+Currently, a VA-based access path is used; however, the following two channel mapping models
+are difficult to represent consistently.
+
+### Background: Local-HBM Pseudo Channel Structure
+
+The HBM in a CUBE consists of 32 or 64 pseudo channels.
+In the PE-Local-HBM model, each PE is responsible for an equal number of pseudo channels.
+
+Example: 64 pseudo channels, 8 PEs per cube -> each PE accesses 8 pseudo channels as local HBM
+
+Both the number of pseudo channels and the number of PEs are topology parameters.
+`N = hbm_pseudo_channels / pes_per_cube` (= channels_per_pe) determines
+the number of local channels per PE.
+
+The routing path BW between DMA and each pseudo channel matches the BW of each pseudo channel
+(e.g., 32 GB/s), so if a PE sends simultaneous requests to N channels, it can utilize the
+maximum memory BW.
+
+### Limitations of the Current VA Model
+
+When channels are divided into 8, requests must also be generated per channel and sent to DMA.
+However, in the current architecture, the kernel generates requests with VA (`tl.load`)
+and passes them directly to DMA, making it difficult for PE_CPU to generate per-channel DMA requests.
+
+Therefore, instead of VA, we propose using **Logical Address (LA)**,
+where the **BAAW (Logical-to-Physical Mapping Unit)** inside PE_DMA
+converts LA to PA or a list of PAs based on segment-based mapping.
+
+### Two Channel Mapping Modes
+
+- **1:1 mode**: Creates and executes per-channel requests. Precise per-channel modeling.
+- **n:1 mode (default)**: Assumes interleaving across local HBM channels. Aggregated BW modeling.
+
+By supporting both modes, the overhead of the n:1 mode can be measured and evaluated.
+
+### Core Requirements
+
+- The effective bandwidth semantics of PE_DMA -> HBM_CTRL must be identical in both modes
+- The difference must only be in the request representation and resource modeling approach
+- The kernel programming model must not be changed
+- Physical channel information must not be exposed to the kernel
+
+### Existing Physical Address
+
+The current system's 51-bit Physical Address is defined in `policy/address/phyaddr.py`:
+
+```
+[50:47] rack_id (4 bit)
+[46:43] sip_id  (4 bit)
+[42:38] cube_id (5 bit, sip_seg)
+[37]    hbm_selector (1=HBM window)
+[36:0]  hbm_offset   (37 bit, 128GB per cube)
+```
+
+PA is used to represent the final routable canonical physical destination,
+and this role is preserved.
+However, the timing and policy of logical access -> physical request conversion are not clearly separated.
+
+---
+
+## Decision
+
+### D1. Introduction of LA (Logical Address) — Replacing VA
+
+The existing VA (Virtual Address) infrastructure is replaced with LA (Logical Address).
+
+#### Characteristics of LA
+
+- Like VA, tensors can be mapped to a contiguous memory space
+- Represents logical buffer + offset
+- Does not directly contain physical channel information
+- An intermediate abstraction maintained until physical resolution
+- The sole address scheme used by kernel code (`tl.load`, `tl.store`, `tl.composite`)
+
+#### LA Space Definition
+
+| Item | Value |
+|------|-------|
+| LA start address | `0x1_0000_0000` (4 GB, preserving the existing VA start point) |
+| LA space size | 64 GB per PE |
+| Alignment unit | Segment-based (see D3 below) |
+
+LA is a PE-local address space.
+Even if different PEs use the same LA value, they resolve to different PAs
+because each PE has a different BAAW segment table.
+
+#### VA Infrastructure Removal Scope
+
+With the introduction of LA, the following existing code will be replaced/removed:
+
+| Removal Target | Replacement |
+|----------------|-------------|
+| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (same free-list approach, name/role changed) |
+| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (inside PE_DMA) |
+| `components/builtin/pe_mmu.py` (PeMmuComponent) | Removed — BAAW is internal PE_DMA logic, not a separate component |
+| `runtime_api/kernel.py`: MmuMapMsg, MmuUnmapMsg | Replaced with BaawSegmentInstallMsg |
+| `runtime_api/context.py`: VA alloc + MMU mapping install | LA alloc + BAAW segment install |
+| `runtime_api/tensor.py`: `va_base` field | `la_base` field |
+| `topology.yaml`: pe_mmu component entry | Removed |
+
+---
+
+### D2. Mapping Mode Configuration
+
+The mapping mode is configured at the cube level in topology.yaml:
+
+```yaml
+cube:
+  memory_map:
+    hbm_mapping_mode: n_to_one    # one_to_one | n_to_one
+    hbm_pseudo_channels: 64       # total pseudo channel count
+    hbm_channels_per_pe: 8        # local channel count per PE
+    hbm_channel_bw_gbs: 32.0      # per-channel bandwidth
+```
+
+This configuration is referenced during graph compilation (topology builder) and BAAW initialization.
+
+---
+
+### D3. Segments and BAAW
+
+#### Segment Definition
+
+A segment is a logical allocation unit that partitions the LA space so that each segment
+maps to a specific HBM channel or channel group.
+
+Segments are created by the runtime allocator during tensor deployment,
+and BAAW uses them to convert LA into physical requests.
+
+#### BAAW Segment Table Entry
+
+```python
+@dataclass
+class BaawSegment:
+    la_base: int          # segment start LA
+    la_size: int          # segment size (bytes)
+    mode: str             # "one_to_one" | "n_to_one"
+    # 1:1 mode fields
+    channel_count: int    # number of channels assigned to this segment (e.g., 8)
+    pa_bases: list[int]   # per-channel PA start address list (len = channel_count)
+    channel_ids: list[int]  # per-channel logical IDs (e.g., [0,1,2,...,7])
+    channel_size: int     # per-channel size (la_size // channel_count)
+    # n:1 mode fields
+    agg_pa_base: int      # aggregated PA start address
+    agg_node_id: str      # aggregated router node_id (for routing)
+```
+
+#### Segment Lifecycle
+
+1. **Allocation time** (tensor deploy):
+   - RuntimeContext allocates LA space from the LA allocator
+   - PEMemAllocator allocates per-channel PA (1:1) or aggregated PA (n:1)
+   - Sends `BaawSegmentInstallMsg` to PE_DMA to register in the segment table
+
+2. **Usage time** (kernel execution):
+   - Kernel issues `tl.load(la_ptr)` -> DmaReadCmd(src_addr=LA)
+   - PE_DMA looks up the segment corresponding to the LA in BAAW
+   - Converts to PA(s) according to the mode
+
+3. **Deallocation time** (tensor free):
+   - Removed from the segment table
+   - LA space returned, PA deallocated
+
+---
+
+### D4. BAAW (Logical-to-Physical Mapping Unit)
+
+#### Location
+
+BAAW is placed as a front-end stage inside PE_DMA.
+It is not a separate SimPy component; it is synchronous address resolution logic
+executed at the beginning of PE_DMA's `handle_command()`.
+
+#### Input
+
+- LA (Logical Address) — DmaReadCmd.src_addr or DmaWriteCmd.dst_addr
+- access size (bytes)
+
+#### Output
+
+- 1:1 mode: `list[PhysicalRequest]` — each request is (PA, nbytes, channel_node_id)
+- n:1 mode: 1 `PhysicalRequest` — (agg_PA, nbytes, agg_node_id)
+
+```python
+@dataclass
+class PhysicalRequest:
+    pa: int           # 51-bit Physical Address
+    nbytes: int       # transfer size for this request
+    dst_node: str     # target node_id (channel router or aggregated router)
+```
+
+#### BAAW Resolve Logic
+
+```python
+def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
+    seg = self._find_segment(la)  # la_base <= la < la_base + la_size
+    offset = la - seg.la_base
+
+    if seg.mode == "n_to_one":
+        pa = seg.agg_pa_base + offset
+        return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]
+
+    elif seg.mode == "one_to_one":
+        requests = []
+        per_ch_size = seg.channel_size
+        for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
+            ch_offset = offset % per_ch_size  # interleaved or striped
+            ch_nbytes = nbytes // seg.channel_count
+            pa = pa_base + ch_offset
+            dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
+            requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
+        return requests
+```
+
+#### Scope of Responsibility
+
+BAAW is responsible for:
+- Converting logical accesses into physical request units
+- Performing fan-out (1:1) or pass-through (n:1) according to the mapping mode
+- Generating Physical Addresses and determining target nodes
+
+BAAW is NOT responsible for:
+- Performing actual data movement
+- Executing NOC routing
+- Simulating bandwidth consumption (this is the role of downstream components)
+
+#### Output Contract
+
+The output of BAAW must be request units that can be directly used by the simulator's
+routing and resource model without any additional address decoding.
+
+---
+
+### D5. PE_DMA handle_command() Changes
+
+#### Current Flow (VA-based)
+
+```
+DmaReadCmd.src_addr (VA)
+  -> MMU.translate(VA) -> PA
+  -> PhysAddr.decode(PA) -> PhysAddr object
+  -> resolver.resolve(PhysAddr) -> dst_node_id (e.g., "sip0.cube0.hbm_ctrl")
+  -> router.find_path(pe_prefix, dst_node_id) -> path
+  -> 1 sub-Transaction created -> fabric inject
+```
+
+#### New Flow (LA-based)
+
+```
+DmaReadCmd.src_addr (LA)
+  -> BAAW.resolve(LA, nbytes) -> list[PhysicalRequest]
+  -> For each PhysicalRequest:
+      -> router.find_path(pe_prefix, req.dst_node) -> path
+      -> compute_drain_ns(path, req.nbytes) -> drain
+      -> sub-Transaction created -> fabric inject
+  -> Wait for all sub-Transactions to complete
+  -> pe_txn.done.succeed()
+```
+
+Key changes:
+- MMU reference removed -> replaced with BAAW resolve
+- PhysAddr.decode() + resolver.resolve() -> BAAW directly returns dst_node
+- 1 request -> N requests injected in parallel (1:1 mode)
+
+---
+
+### D6. 1:1 Mode Details
+
+- One logical access -> N (= `channels_per_pe`) physical requests
+- N is a parameter determined by `hbm_pseudo_channels / pes_per_cube`
+- Each request:
+  - Fully resolved 51-bit PA
+  - Targets a specific channel router (`{pe_prefix}.ch_r{channel_id}`)
+- BW contention modeling via per-channel links
+- PE_DMA injects N sub-transactions simultaneously
+
+#### 1:1 Mode Example
+
+Configuration: `hbm_pseudo_channels=64`, `pes_per_cube=8`
+-> `channels_per_pe=8`, PE0 owns ch0-7
+
+```text
+Tensor A (4 KB) -> LA 0x1_0000_0000, size=4096 bytes
+BAAW segment: {
+    la_base: 0x1_0000_0000, la_size: 4096,
+    mode: "one_to_one", channel_count: 8,  # = channels_per_pe
+    pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
+    channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
+    channel_size: 512,  # = la_size / channel_count
+}
+
+BAAW resolve result (N=8 requests):
+  -> PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
+  -> PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
+  -> ...
+  -> PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")
+
+PE_DMA: N sub-transactions injected in parallel
+  Each accesses HBM via channel router -> hbm_ctrl link (channel_bw_gbs)
+  Total effective BW = N x channel_bw_gbs
+```
+
+Examples with different N values:
+- `hbm_pseudo_channels=32`, `pes_per_cube=8` -> `channels_per_pe=4`, 4 requests
+- `hbm_pseudo_channels=64`, `pes_per_cube=4` -> `channels_per_pe=16`, 16 requests
+
+---
+
+### D7. n:1 Mode Details
+
+- One logical access -> one aggregated request
+- Target: aggregated router -> hbm_ctrl (see ADR-0019)
+- Aggregated link BW = `channels_per_pe` x `channel_bw_gbs` (e.g., 8 x 32 = 256 GB/s)
+- Modeled as a single queue / resource
+- No per-channel PA decomposition
+
+#### n:1 Mode Example
+
+```
+Tensor A (4 KB) -> LA 0x1_0000_0000, size=4096 bytes
+BAAW segment: {
+    la_base: 0x1_0000_0000, la_size: 4096,
+    mode: "n_to_one",
+    agg_pa_base: PA_agg,
+    agg_node_id: "sip0.cube0.pe0.agg_router",
+}
+
+BAAW resolve result:
+  -> PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")
+
+PE_DMA: 1 sub-transaction injected
+  Accesses HBM via aggregated router -> hbm_ctrl link (256 GB/s)
+```
+
+---
+
+### D8. Kernel Model Preservation
+
+- The kernel still issues only single memory ops (`tl.load`, `tl.store`, `tl.composite`)
+- LA is the address scheme passed to the kernel
+- Channel decomposition/aggregation is performed by BAAW inside PE_DMA
+- Physical channel information is not exposed to kernel code
+
+---
+
+## Consequences
+
+### Positive
+
+- 1:1 vs n:1 semantics are clearly separated at a single point: BAAW
+- Kernel abstraction is preserved — no kernel code changes required
+- Topology-based policy control is possible (mode switching via yaml)
+- Improved simulation model consistency and debuggability
+- Segment-based mapping is simpler and has lower overhead compared to page tables
+
+### Negative
+
+- Full refactoring of VA/MMU-based code is required
+- Increased complexity in the request generation path (managing N requests in 1:1 mode)
+- Reduced per-channel visibility in n:1 mode
+- Existing VA-related tests must be rewritten
+
+---
+
+## Alternatives
+
+### A1. Keep VA + Fan-out at MMU
+
+- Extend MMU to return per-channel PAs
+- Problem: MMU's role expands beyond address translation to include request decomposition
+- Problem: Aggregation representation is difficult in n:1 mode
+
+### A2. Kernel Generates Channel-Aware Requests
+
+- Kernel directly calls per-channel load/store
+- Problem: Abstraction leakage, reduced portability
+- Problem: All benchmark code must be modified
+
+### A3. Always Use PA (Without LA)
+
+- Runtime directly passes per-channel PA to the kernel
+- Problem: Conflicts with the aggregation model
+- Problem: Conversion timing is unclear, channel information exposed to kernel
+
+---
+
+## Implementation Notes
+
+### Implementation Order
+
+1. Introduce LA type (`policy/address/la_allocator.py`)
+2. Implement BAAW segment table (`policy/address/baaw.py`)
+3. Add `BaawSegmentInstallMsg` message type (`runtime_api/kernel.py`)
+4. Integrate BAAW into PE_DMA (`components/builtin/pe_dma.py` handle_command changes)
+5. Modify RuntimeContext: LA alloc + segment install (`runtime_api/context.py`)
+6. Change Tensor.va_base -> la_base (`runtime_api/tensor.py`)
+7. Remove VA/MMU code
+8. Remove pe_mmu from topology.yaml, add mapping mode configuration
+9. Test migration
+
+### Affected Existing Tests
+
+| Test File | Impact |
+|-----------|--------|
+| `tests/test_mmu_component.py` | Remove -> replace with BAAW segment install test |
+| `tests/test_mmu_fabric.py` | Remove -> replace with BAAW + fabric integration test |
+| `tests/test_pe_mmu.py` | Remove |
+| `tests/test_va_allocator.py` | Replace with LA allocator test |
+| `tests/test_va_integration.py` | Replace with LA + BAAW integration test |
+| `tests/test_va_offset.py` | Replace with LA offset test |
+
+---
+
+## Test Requirements
+
+- For the same logical access:
+  - 1:1 -> verify N requests are generated
+  - n:1 -> verify 1 aggregated request is generated
+- Verify effective bandwidth consistency across both modes
+- 1:1 -> verify per-channel contention modeling
+- n:1 -> verify aggregated bandwidth is reflected
+- Verify operation without kernel code changes
+- Verify correct BAAW segment install/uninstall operation
+- Verify no conflicts when multiple tensors are assigned to different segments
+
+---
+
+## Links
+
+- ADR-0011 (Memory Addressing Simplification — PA-first, VA/MMU introduction) -> superseded by this ADR
+- ADR-0019 (NOC Per-Channel HBM Connection Model) -> topology-side integration
+- ADR-0014 (PE Internal Execution Model) -> PE_DMA change impact
@@ -0,0 +1,431 @@
+# ADR-0019: Per-Channel and Aggregated HBM Connection Models within CUBE NOC
+
+## Status
+
+Proposed
+
+## Context
+
+ADR-0018 introduced LA-based address abstraction and BAAW,
+defining how a logical memory access is translated into the following two forms of requests:
+
+- 1:1 mode: one logical access → N per-channel requests
+- n:1 mode: one logical access → one aggregated request
+
+Here N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`),
+determined by topology parameters.
+
+### Problems with the Existing Structure
+
+In the current implementation (`topology/builder.py`):
+
+- PE_DMA → NOC → xbar_top/xbar_bot → HBM_CTRL.slice{0-7} path is used
+- HBM is modeled as 8 slice (= per-PE) nodes
+- Local/remote access use different paths:
+  - local: NOC → xbar → HBM slice
+  - cross-half: NOC → xbar_top → bridge → xbar_bot → HBM slice
+  - remote cube: NOC → UCIe → remote NOC → remote xbar → remote HBM slice
+
+Limitations of this structure:
+
+- Cannot model at the pseudo-channel granularity (slice = per-PE granularity, not per-channel)
+- xbar/bridge bifurcate local/remote paths
+- Cannot express 1:1 / n:1 modes consistently
+
+---
+
+## Decision
+
+### D1. HBM Attaches to PE Routers
+
+Consolidate the current `hbm_ctrl.slice{0-7}` (8 nodes) into a **single `hbm_ctrl` node**,
+and attach the HBM access point to the same router where the PE is attached.
+
+- n:1 mode: PE's local HBM access goes directly from its own router (switching overhead only, 0 hops)
+- Remote PE's HBM access: reaches the target PE's router via mesh hops
+- The read/write resource model within the HBM controller is preserved
+
+Node naming changes:
+
+| Current | After Change |
+| ---- | ------- |
+| `sip0.cube0.hbm_ctrl.slice0` ~ `slice7` | `sip0.cube0.hbm_ctrl` (single) |
+
+In `mesh_gen.py`, add `pe{idx}.hbm` to the PE attachment so that
+the builder generates an edge between that router and hbm_ctrl.
+
+---
+
+### D2. Complete Removal of xbar, bridge, and Single NOC Node
+
+Remove all of the following nodes and related edges:
+
+- `{cube}.xbar_top`, `{cube}.xbar_bot`
+- `{cube}.bridge.left`, `{cube}.bridge.right`
+- `{cube}.noc` (single TwoDMeshNocComponent node)
+- Edges of type `noc_to_xbar`, `xbar_to_noc`, `xbar_to_hbm`, `hbm_to_xbar`
+- Edges of type `xbar_to_bridge`, `bridge_to_xbar`
+- Edges of type `pe_to_noc`, `noc_to_pe`, `noc_to_pe_cpu`, etc. referencing the single noc node
+
+Their role is replaced by an **explicit router mesh based on cube_mesh.yaml**.
+Each router (r0c0, r0c1, ...) from the 6x6 router grid generated by `mesh_gen.py`
+is created as a separate SimPy node in the topology graph,
+and adjacent routers are connected via XY mesh edges.
+
+---
+
+### D3. Explicit Router Mesh (Common Basis for n:1 / 1:1)
+
+#### Router Nodes Based on cube_mesh.yaml
+
+Each non-null router from cube_mesh.yaml generated by `mesh_gen.py`
+is created as a **separate SimPy node** in the topology graph.
+
+- Node ID: `{cube}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`)
+- kind: `noc_router`, impl: `forwarding_v1`
+- pos_mm: taken from cube_mesh.yaml
+
+Based on the attach information in cube_mesh.yaml, components are connected to each router:
+- `pe{p}.dma` → PE_DMA ↔ router edge
+- `pe{p}.cpu` → PE_CPU ↔ router edge
+- `pe{p}.hbm` → HBM_CTRL ↔ router edge (added in n:1)
+- `m_cpu` → M_CPU ↔ router edge
+- `sram` → SRAM ↔ router edge
+- `ucie_{dir}.c{i}` → UCIe conn ↔ router edge
+
+Router-to-router XY mesh edges: bidirectional edges between adjacent routers.
+Null routers (HBM exclusion zones) are skipped.
+
+#### 1:1 Mode Extension (To Be Implemented Later)
+
+In 1:1 mode, each router differentiates into N channel mini-routers.
+Per-channel routing and ChannelSplitter (LA → per-channel PA) introduction are required.
+N GEMM engines per PE are also added at this point.
+
+---
+
+### D4. Cross-PE HBM Access (n:1 Mode)
+
+In n:1 mode, when a PE accesses another PE's local HBM,
+it hops through the XY mesh in cube_mesh.yaml to reach the target PE's router.
+
+Example: PE0 (r0c0) accessing PE2's (r1c4) HBM:
+
+```text
+PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl
+```
+
+The Dijkstra router finds the shortest path in the mesh.
+
+Cross-PE channel access in 1:1 mode will be defined during the 1:1 extension in D3.
+
+---
+
+### D5. n:1 Mode: Uses cube_mesh.yaml Router Mesh
+
+In n:1 mode, no separate "aggregated router" is created.
+The existing router grid from cube_mesh.yaml serves that role.
+
+#### Connection Structure
+
+PE_DMA, PE_CPU, and HBM are all connected to the router where each PE is attached:
+
+```text
+sip0.cube0.pe0.pe_dma ←→ sip0.cube0.r0c0  (bw: N × channel_bw_gbs)
+sip0.cube0.hbm_ctrl   ←→ sip0.cube0.r0c0  (bw: N × channel_bw_gbs)
+```
+
+Routers are connected via XY mesh edges. PE's local HBM access goes
+directly from its own router (switching overhead only).
+
+#### n:1 Mode Full Data Paths
+
+**Local HBM (0 hops):**
+```text
+PE0.pe_dma → r0c0 → hbm_ctrl  (switching overhead only)
+```
+
+**Remote HBM (mesh hops):**
+```text
+PE0.pe_dma → r0c0 → r0c1 → ... → r1c4 → hbm_ctrl
+```
+
+**M_CPU DMA:**
+```text
+M_CPU → r2c0 → (mesh hops) → r{x}c{y} → hbm_ctrl
+```
+
+---
+
+### D6. All Traffic Is Unified onto the Same Router Mesh
+
+- All memory accesses (DMA data) and commands (PE_CPU) use the same router mesh
+- Local access does not use a separate fast path (xbar)
+- Cross-cube (remote) access path:
+
+```text
+PE_DMA → r{x}c{y} → (mesh hops) → ucie_conn → ucie-{PORT}
+  → [UCIe link] → remote ucie → remote conn → remote r{x}c{y} → hbm_ctrl
+```
+
+UCIe connections maintain the existing structure,
+but both endpoints become mesh routers instead of xbars.
+
+The number of UCIe lines is determined by BW ratio: `ucie_lines_per_side = ceil(ucie_bw / noc_line_bw)`.
+
+---
+
+### D7. AddressResolver Changes
+
+Current `AddressResolver.resolve()`:
+
+```python
+# Current: HBM offset → pe_slice → "sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
+pe_slice = PhysAddr.hbm_pe_id(addr.hbm_offset, self._slice_size_bytes)
+return f"sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
+```
+
+After change:
+
+```python
+# Changed: HBM → single endpoint
+return f"sip{s}.cube{c}.hbm_ctrl"
+```
+
+The pe_slice calculation is removed.
+In n:1 mode, PE_DMA directly accesses the hbm_ctrl attached to its own router.
+
+resolver.resolve() is retained for external access (M_CPU DMA, etc.) and backward compatibility.
+
+---
+
+### D8. topology.yaml Configuration Changes
+
+#### Added Settings
+
+```yaml
+cube:
+  memory_map:
+    hbm_mapping_mode: n_to_one          # one_to_one | n_to_one
+    hbm_pseudo_channels: 64             # total pseudo channel count
+    hbm_channels_per_pe: 8              # local channels per PE (= pseudo_channels / pes_per_cube)
+    hbm_channel_bw_gbs: 32.0            # per-channel bandwidth (GB/s)
+    hbm_total_gb_per_cube: 48           # retained
+```
+
+#### Removed Settings
+
+```yaml
+# To be removed
+links:
+  xbar_to_hbm_bw_gbs: 256.0            # → replaced by channel_bw_gbs × channels_per_pe
+  xbar_to_hbm_mm: 2.5                  # → replaced by ch_router_to_hbm_mm
+  xbar_to_bridge_bw_gbs: 128.0         # → removed (no bridge)
+  xbar_to_bridge_mm: 3.0               # → removed
+  noc_to_xbar_bw_gbs: ...              # → removed
+  noc_to_xbar_mm: ...                  # → removed
+```
+
+#### Added Link Settings
+
+```yaml
+links:
+  router_link_bw_gbs: 256.0            # XY mesh link BW between routers
+  router_overhead_ns: 2.0              # router switching overhead
+  pe_to_router_bw_gbs: 256.0           # PE_DMA ↔ router
+  hbm_to_router_bw_gbs: 256.0          # HBM ↔ router (= N × channel_bw)
+```
+
+---
+
+### D9. Bandwidth Numerical Consistency
+
+| Configuration | Value |
+| ---- | --- |
+| pseudo channels per cube | 64 (parameter) |
+| PEs per cube | 8 (parameter) |
+| channels per PE (N) | `pseudo_channels / pes_per_cube` = 8 |
+| per-channel BW | 32 GB/s (parameter) |
+| per-PE local BW | N × 32 = 256 GB/s |
+| cube total HBM BW | 64 × 32 = 2048 GB/s |
+
+The effective BW per PE is identical in both modes:
+
+- 1:1 mode: N channel links × channel_bw_gbs = N × 32 = 256 GB/s
+- n:1 mode: 1 aggregated link = N × channel_bw_gbs = 256 GB/s
+
+---
+
+## Consequences
+
+### Positive
+
+- The router mesh based on cube_mesh.yaml accurately reflects physical placement
+- In n:1 mode, the existing VA scheme is preserved, keeping transition costs low
+- Local / remote / command traffic is unified onto the same mesh, resulting in simplicity
+- Aligns well with graph compiler-based topology generation
+- Channel count and PE count are both parameterized, enabling testing of various configurations
+- 1:1 mode extension naturally follows through router differentiation
+
+### Negative
+
+- The number of SimPy nodes increases due to explicit router nodes (6x6 = up to 32 routers/cube)
+- Requires complete rewrite of existing xbar/bridge/single NOC-based tests
+- The internal contention model of TwoDMeshNocComponent needs to be replaced with a per-router model
+
+---
+
+## Alternatives
+
+### A1. Retain Existing xbar + HBM Slices
+
+- Local/remote paths remain bifurcated
+- Cannot model at pseudo-channel granularity
+- Cannot switch between 1:1/n:1 modes
+
+### A2. Always Generate Per-Channel Links and Aggregate Only in n:1
+
+- Topology structure always has 1:1 size
+- Expressing n:1 semantics via link aggregation is complex
+- No reduction in router node count
+
+### A3. Gradual Transition (Retain xbar + Add NOC Path)
+
+- Higher compatibility, but dual-path coexistence increases complexity
+- Since xbar removal is ultimately necessary, the intermediate step provides little value
+
+---
+
+## Implementation Notes
+
+### topology/builder.py Change Details
+
+#### Code to Remove (within current `_instantiate_cube()`)
+
+- xbar_top, xbar_bot node creation (~line 495-508)
+- bridge.left, bridge.right node creation
+- noc ↔ xbar edge creation (~line 540-555)
+- xbar ↔ hbm_ctrl.slice edge creation (~line 510-538)
+- xbar ↔ bridge edge creation (~line 557-572)
+
+#### Code to Add
+
+1:1 mode:
+
+```python
+N = hbm_channels_per_pe  # from topology config
+total_ch = hbm_pseudo_channels
+
+# Create channel router nodes
+for ch_id in range(total_ch):
+    pe_id = ch_id // N
+    nodes[f"{cp}.ch_r{ch_id}"] = Node(
+        id=f"{cp}.ch_r{ch_id}", kind="noc_router", impl="noc_v1",
+        attrs={}, pos_mm=(...),  # horizontal row = ch_id % N
+    )
+
+# PE_DMA ↔ local channel router edges
+for pe_id in range(pes_per_cube):
+    for local_ch in range(N):
+        ch_id = pe_id * N + local_ch
+        edges.append(Edge(
+            src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.ch_r{ch_id}",
+            bw_gbs=channel_bw, kind="pe_to_ch_router", ...))
+        edges.append(Edge(
+            src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.pe{pe_id}.pe_dma",
+            bw_gbs=channel_bw, kind="ch_router_to_pe", ...))
+
+# Channel router ↔ hbm_ctrl edges
+for ch_id in range(total_ch):
+    edges.append(Edge(
+        src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.hbm_ctrl",
+        bw_gbs=channel_bw, kind="ch_router_to_hbm", ...))
+    edges.append(Edge(
+        src=f"{cp}.hbm_ctrl", dst=f"{cp}.ch_r{ch_id}",
+        bw_gbs=channel_bw, kind="hbm_to_ch_router", ...))
+
+# Horizontal line edges (same logical index)
+for row in range(N):
+    for p in range(pes_per_cube - 1):
+        ch_a = p * N + row
+        ch_b = (p + 1) * N + row
+        edges.append(Edge(
+            src=f"{cp}.ch_r{ch_a}", dst=f"{cp}.ch_r{ch_b}",
+            bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...))
+        edges.append(Edge(
+            src=f"{cp}.ch_r{ch_b}", dst=f"{cp}.ch_r{ch_a}",
+            bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...))
+```
+
+n:1 mode:
+
+```python
+# Create aggregated router nodes
+for pe_id in range(pes_per_cube):
+    nodes[f"{cp}.pe{pe_id}.agg_router"] = Node(
+        id=f"{cp}.pe{pe_id}.agg_router", kind="noc_router", impl="noc_v1",
+        attrs={}, pos_mm=(...),
+    )
+
+agg_bw = N * channel_bw  # aggregated BW
+
+# PE_DMA ↔ aggregated router
+for pe_id in range(pes_per_cube):
+    edges.append(Edge(
+        src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.pe{pe_id}.agg_router",
+        bw_gbs=agg_bw, kind="pe_to_agg_router", ...))
+    edges.append(Edge(
+        src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.pe{pe_id}.pe_dma",
+        bw_gbs=agg_bw, kind="agg_router_to_pe", ...))
+
+# Aggregated router ↔ hbm_ctrl
+for pe_id in range(pes_per_cube):
+    edges.append(Edge(
+        src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.hbm_ctrl",
+        bw_gbs=agg_bw, kind="agg_to_hbm", ...))
+    edges.append(Edge(
+        src=f"{cp}.hbm_ctrl", dst=f"{cp}.pe{pe_id}.agg_router",
+        bw_gbs=agg_bw, kind="hbm_to_agg", ...))
+
+# Horizontal links between aggregated routers
+for p in range(pes_per_cube - 1):
+    edges.append(Edge(
+        src=f"{cp}.pe{p}.agg_router", dst=f"{cp}.pe{p+1}.agg_router",
+        bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...))
+    edges.append(Edge(
+        src=f"{cp}.pe{p+1}.agg_router", dst=f"{cp}.pe{p}.agg_router",
+        bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...))
+```
+
+### Affected Existing Tests
+
+| Test File | Impact |
+| ---------- | ---- |
+| `tests/test_topology_compile.py` | Remove xbar/bridge node references, add channel router verification |
+| `tests/test_topology_load.py` | Reflect topology.yaml configuration changes |
+| `tests/test_pe_components.py` | PE_DMA routing path changes |
+| `tests/test_sip_parallel.py` | Cross-PE access path changes |
+| Cases that directly test xbar/bridge | Remove |
+
+---
+
+## Test Requirements
+
+- Verify that requests are delivered via per-channel links in 1:1 mode
+- Verify that requests are delivered via the aggregated link in n:1 mode
+- Verify that topology is correctly generated in both modes:
+  - 1:1: `total_ch` channel routers + per-PE links + horizontal links
+  - n:1: `pes_per_cube` aggregated routers + per-PE links
+- Verify that effective BW is consistent across both modes for the same workload
+- Verify that horizontal line routing works for cross-PE access
+- Verify that routing through UCIe works for cross-cube access
+- Verify that topology generation is correct under parameter variations (channels_per_pe = 4, 8, 16, etc.)
+
+---
+
+## Links
+
+- ADR-0018 (LA + BAAW) → addressing-side integration
+- ADR-0017 (Cube NOC 2D Mesh) → this ADR replaces the xbar/bridge portion
+- ADR-0004 (Memory Semantics) → BW model redefinition
+- ADR-0014 (PE Internal Execution Model) → impact from PE_DMA path changes
@@ -0,0 +1,553 @@
+# ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation)
+
+## Status
+
+Proposed
+
+## Context
+
+The current simulation models **timing only**.
+`tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies,
+but do not actually read tensor data or perform computations.
+
+### Required Capabilities
+
+1. Must be able to store and read actual data in HBM/TCM/SRAM
+2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
+3. Must minimize simulation performance degradation
+
+### Limitations of the Existing Kernel Execution Structure
+
+The current kernel execution is separated into 3 stages:
+
+```
+Phase 0: Kernel function execution in TLContext → PeCommand list generation (outside SimPy, no data)
+Phase 1: PE_CPU replays PeCommand list via SimPy (timing only)
+```
+
+Phase 0 requires the kernel to **complete execution entirely** before SimPy begins.
+`tl.load()` returns a TensorHandle (placeholder), so actual data cannot be accessed.
+Therefore, branching based on data values (dynamic control flow) is impossible.
+
+This ADR resolves this limitation **for memory operations only** (see D1, D3).
+
+### Constraints
+
+- SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
+- Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations
+- Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused
+- Kernel functions must remain plain Python functions (no generator/async transformation)
+
+### Design Exploration Results
+
+| Option | Approach | Verdict |
+|--------|----------|---------|
+| Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block |
+| SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests |
+| Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads |
+| **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact |
+
+---
+
+## Decision
+
+### D1. 2-Pass Execution Model — Phase 0 Elimination
+
+The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**.
+
+Before:
+```
+Phase 0: Kernel → PeCommand list (no data, no branching)
+Phase 1: Replay PeCommand list via SimPy (timing only)
+```
+
+After:
+```
+Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based
+  - Memory read/write: SimPy timing + MemoryStore actual data
+  - Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2)
+  - Dynamic control flow possible (tl.load returns actual data)
+
+Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable
+```
+
+This ADR **extends Phase 1 to be data-aware for memory operations only**.
+Phase 1 handles latency/BW bottleneck analysis + memory data tracking,
+Phase 2 handles GEMM/Math computation correctness verification.
+Phase 2 is optional — if only timing is needed, run Phase 1 alone.
+
+### D2. Op Log Recording — ComponentBase Hook
+
+Op log recording is performed as a **hook in the component base class**.
+Individual component implementations are not modified.
+
+```python
+class ComponentBase:
+    def _on_process_start(self, env, msg):
+        if self._op_logger and getattr(msg, 'data_op', False):
+            self._op_logger.record_start(env.now, self.node.id, msg)
+
+    def _on_process_end(self, env, msg):
+        if self._op_logger and getattr(msg, 'data_op', False):
+            self._op_logger.record_end(env.now, self.node.id, msg)
+```
+
+Hooks are called before and after `run()` within `_forward_txn()`.
+`_op_logger` is optional — zero overhead when absent.
+
+**Hook timing definitions**:
+
+| Timing | Meaning |
+|--------|---------|
+| `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) |
+| `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) |
+
+Link traversal latency is not included in t_start/t_end.
+Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start.
+
+### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination
+
+The existing Phase 0 (kernel → PeCommand list) is eliminated,
+and **greenlet** is used to cooperatively interleave kernel and SimPy execution.
+
+#### Operating Principle
+
+greenlet is a C extension that provides cooperative context switching.
+When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet)
+to perform timing simulation, and after completion, returns to the kernel with actual data.
+
+```
+SimPy loop (parent greenlet)           Kernel (child greenlet)
+─────────────────────────              ──────────────────────
+g.switch() ─────────────────────────→ Kernel starts
+                                       a = tl.load(ptr, ...)
+                                         internal: parent.switch(DmaReadCmd)
+cmd = DmaReadCmd ←──────────────────  (kernel paused)
+  yield DmaReadMsg(...)
+  yield env.timeout(dma_latency)
+  data = memory_store.read(...)
+g.switch(data) ─────────────────────→ (kernel resumed)
+                                       a = data  ← actual numpy array
+                                       if a[0][0] > 0.5:  ← branching possible
+                                         ...
+```
+
+The kernel is maintained as a **plain Python function**.
+greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc.
+
+#### KernelRunner — Framework Layer
+
+The greenlet loop resides not in the PE_CPU component but in the framework layer,
+**KernelRunner**.
+
+```python
+# KernelRunner (framework — greenlet ↔ SimPy bridge)
+class KernelRunner:
+    def run(self, env, kernel_fn, args, store):
+        g = greenlet(self._run_kernel)
+        cmd = g.switch(kernel_fn, args)
+
+        while cmd is not None:
+            if isinstance(cmd, DmaReadCmd):
+                yield from self._dispatch_dma(env, cmd)
+                data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
+                cmd = g.switch(data)            # resume with actual data
+            elif isinstance(cmd, GemmCmd):
+                yield from self._dispatch_gemm(env, cmd)
+                cmd = g.switch()                # resume (no data)
+            elif isinstance(cmd, DmaWriteCmd):
+                store.write(cmd.dst_addr, cmd.data)  # visibility = issue time
+                yield from self._dispatch_dma(env, cmd)  # timing only
+                cmd = g.switch()
+
+# PE_CPU (component — kept simple, unaware of greenlet)
+def _execute_kernel(self, env):
+    runner = KernelRunner(self.ctx)
+    yield from runner.run(env, kernel_fn, args, store)
+```
+
+**Op logging single source of truth**: KernelRunner does not record directly to op_log.
+All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**.
+When KernelRunner delivers messages to components via `_dispatch_gemm()` etc.,
+the component base class hooks automatically record them.
+
+**Layer separation**:
+- **Kernel code**: plain function, unaware of greenlet
+- **TLContext**: calls `parent.switch(cmd)` inside `tl.load()`
+- **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**.
+- **ComponentBase hook**: the sole path for op_log recording
+- **PE_CPU**: only calls KernelRunner, replaceable as a component
+
+#### Handling Differences Between Memory Read/Write and Compute
+
+| Operation | In Phase 1 | In Phase 2 |
+|-----------|-----------|-----------|
+| `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — |
+| `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — |
+| `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation |
+| `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation |
+
+Memory read/write is processed immediately in Phase 1 (numpy slice, fast).
+GEMM/Math operations are batch-executed in Phase 2 (performance separation).
+
+#### Store Visibility Rule
+
+`tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue).
+SimPy DMA timing is simulated separately afterward.
+
+This is an intentional separation of timing and visibility:
+- **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called
+- **timing**: the point at which DMA latency completes in SimPy
+
+This separation allows a load immediately after a store to see the latest data in dynamic control flow.
+
+#### Result Handle Semantics
+
+`tl.composite()` (sync/async) returns a **handle** referencing the result tensor.
+
+The key contract in Phase 1:
+
+1. **All compute handles are always considered pending in Phase 1.**
+2. `tl.wait(handle)` **expresses timing synchronization only**
+   and does not make the handle ready.
+3. Accessing the handle's actual result data (`handle.data`, element access,
+   numpy conversion, etc.) is **only possible in Phase 2**.
+4. Therefore, **compute-result-based control flow is not supported in Phase 1.**
+5. In contrast, `tl.load()` returns actual data in Phase 1, so
+   **memory-read-based control flow is supported**.
+
+| Handle state | Phase | Allowed operations |
+|------------|-------|----------|
+| pending | Phase 1 | `tl.wait(handle)` — timing synchronization only |
+| pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) |
+| pending | Phase 1 | **Data access not allowed** — value-based branching not possible |
+| ready | Phase 2 | Actual numpy data access, verification |
+
+This restriction is intentional. If computations were executed in Phase 1,
+the SimPy single-thread would block, defeating the purpose of 2-pass separation.
+
+#### Phase 1 Materialization — Future Extension
+
+If Phase 1 eager execution becomes necessary for small operations
+(scalar, small reduction) in the future, selective materialization can be supported
+by adding a `materialized_in_phase1: bool` flag to the op record.
+This is not implemented in the current scope.
+
+### D4. data_op Flag — Message Self-Declaration
+
+The logging target is determined by the `data_op` attribute on the message instance,
+not by message type. The framework does not hardcode message types.
+
+```python
+class MsgBase:
+    data_op: bool = False       # default: no logging
+
+class DmaReadCmd(MsgBase):
+    data_op = True              # memory transfer → logging
+
+class GemmCmd(MsgBase):
+    data_op = True              # compute → logging
+
+class MathCmd(MsgBase):
+    data_op = True              # compute → logging
+```
+
+When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True`
+enables automatic logging without modifying framework code.
+
+### D5. Op Log Structure
+
+#### Op Classification Scheme
+
+A two-level classification is used:
+
+| Level | Field | Role |
+|-------|-------|------|
+| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion |
+| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification |
+
+#### OpRecord Definition
+
+```python
+@dataclass
+class OpRecord:
+    t_start: float              # SimPy time (ns) — service start
+    t_end: float                # SimPy time (ns) — service completion
+    component_id: str           # e.g. "sip0.cube0.pe0.pe_gemm"
+    op_kind: str                # "memory" | "gemm" | "math"
+    op_name: str                # specific operation name
+    params: dict                # per-operation parameters (see below)
+    dependency_ids: list[int]   # currently based on in-memory record index, may be replaced with stable op_id in the future
+```
+
+#### dependency_ids Generation Rules
+
+`dependency_ids` is **optional**, and by default the executor performs
+address-based dependency inference (see D6).
+
+Explicit setting is only needed when precise execution ordering is required:
+- **Default (address-based inference)**: the executor analyzes read/write sets to
+  automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases.
+- **Explicit setting**: set when logical dependencies cannot be expressed via addresses
+  at the TLContext or command generation stage.
+  Example: completion handle-based synchronization — handle dependencies depend on
+  logical completion order rather than memory addresses, so they cannot be captured
+  by address inference.
+
+#### op_log Ordering
+
+The op_log maintains **stable ordering** based on `t_start`.
+Records with the same `t_start` preserve insertion order.
+
+#### params Details
+
+**memory (dma_read / dma_write)**:
+```python
+{
+    "src_addr": int,            # source address (byte)
+    "dst_addr": int,            # destination address (byte)
+    "nbytes": int,              # transfer size
+    "src_space": str,           # "hbm" | "tcm" | "sram"
+    "dst_space": str,           # "hbm" | "tcm" | "sram"
+}
+```
+
+**gemm**:
+```python
+{
+    "src_a_addr": int,          # operand A address
+    "src_b_addr": int,          # operand B address
+    "dst_addr": int,            # output address
+    "shape_a": tuple,           # e.g. (128, 256)
+    "shape_b": tuple,           # e.g. (256, 128)
+    "shape_out": tuple,         # e.g. (128, 128)
+    "dtype_in": str,            # e.g. "f16"
+    "dtype_acc": str,           # accumulation dtype, e.g. "f32"
+    "dtype_out": str,           # output dtype, e.g. "f16"
+    "transpose_a": bool,
+    "transpose_b": bool,
+    "layout_a": str,            # "row_major" | "col_major"
+    "layout_b": str,
+    "layout_out": str,
+    "addr_space": str,          # "tcm" (GEMM operands are always in TCM)
+}
+```
+
+**math**:
+```python
+{
+    "op": str,                  # "exp" | "add" | "sum" | "where" | ...
+    "input_addrs": list[int],   # list of operand addresses
+    "input_shapes": list[tuple],
+    "dst_addr": int,
+    "shape_out": tuple,
+    "dtype": str,
+    "axis": int | None,         # reduction axis
+    "addr_space": str,          # "tcm"
+}
+```
+
+### D6. Phase 2 Executor
+
+Phase 2 executes the op_log outside of SimPy.
+
+```python
+class DataExecutor:
+    def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
+        self.store = initial_store  # Takes the Phase 1 MemoryStore snapshot as input
+
+    def run(self):
+        for t, ops in groupby(op_log, key=lambda o: o.t_start):
+            batch = list(ops)
+            independent, sequential = self._classify(batch)
+            self._execute_parallel(independent)
+            self._execute_sequential(sequential)
+```
+
+**Parallel execution determination**:
+
+Ops with the same `t_start` are considered **parallel candidates**.
+The executor determines actual parallel execution based on the following criteria:
+- Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks)
+- Whether predecessor ops specified in `dependency_ids` have completed
+
+Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel.
+
+**Batch optimization**: Only independent ops with the same op_name **and identical
+shape, dtype, layout, and transpose flags** are eligible for batching.
+Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call.
+Improves BLAS efficiency on CPU, reduces launch overhead on GPU.
+
+**Phase 2 execution order guarantee**:
+
+Phase 2 does not consider data arrival timing,
+and guarantees execution order solely through
+dependencies (address-based inference + explicit dependency_ids).
+
+### D7. Memory Store
+
+`MemoryStore` logically follows byte-addressable semantics,
+and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping).
+
+```python
+class MemoryStore:
+    def write(self, space: str, addr: int, data: np.ndarray) -> None: ...
+    def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
+```
+
+**Internal storage format: numpy ndarray**
+
+MemoryStore stores tensors as **numpy ndarrays**.
+
+| Candidate | store/load speed | Phase 2 compute | Verdict |
+|-----------|-----------------|-----------------|---------|
+| **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** |
+| bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected |
+| torch tensor | Immediate | torch operations available | Use only for GPU optimization |
+
+- write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup
+- read: **returns numpy array by reference** (no copy)
+- Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported)
+- dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.)
+- For byte-level access, convert via `.view(np.uint8)`
+- For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility
+
+**read/write contract**:
+
+- read/write operates on a **contiguous tensor** basis.
+  If non-contiguous stride views are needed, express them as separate copy ops.
+- In the normal benchmark path, producer/consumer dtype match is expected.
+  Reinterpret cast is a permissive behavior for low-level memory validation
+  or special test cases.
+- addr is byte-aligned, with minimum alignment = dtype size.
+- dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast.
+  Shape mismatch is verified based on nbytes, and raises an error on mismatch.
+- Correctness criteria follow address-range-based read/write semantics.
+- A tensor object cache may be used as an implementation optimization,
+  but the canonical state is byte-addressable storage.
+- At deploy time, the host injects initial tensor data.
+
+### D8. Benchmark Kernel Code
+
+The benchmark's **user code API is not changed**.
+The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained.
+
+However, internal command/message schemas may be extended to include metadata
+required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose).
+
+### D9. No Component Changes
+
+Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified.
+Op log recording is the responsibility of the ComponentBase hook.
+When custom components are replaced, only the timing model changes,
+and Phase 2 data execution is unaffected.
+
+### D10. Phase 2 is Optional
+
+```python
+engine = GraphEngine(graph)
+engine.run(benchmark)                       # Phase 1: timing only
+result = engine.get_timing_result()
+
+if verify_data:
+    executor = DataExecutor(engine.op_log)  # Phase 2: data
+    executor.run()
+    executor.verify(expected_output)
+```
+
+If only timing analysis is needed, Phase 2 is skipped.
+If the op_logger is deactivated, Phase 1 performance is identical to the original.
+
+### D11. Verification Contract
+
+Basic verification **compares the final output tensor** against a reference backend (numpy).
+
+Per-dtype tolerance policy:
+
+| dtype | Comparison method | Tolerance |
+|-------|----------|-----------|
+| f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
+| f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
+| bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
+| int types | `np.array_equal` | exact |
+
+- Default mode: compare final output only (end-to-end correctness)
+- Debug mode: can compare intermediate tensors on a per-op basis
+  (MemoryStore snapshot at each op boundary)
+
+---
+
+## Non-goals
+
+- **Compute-result-based control flow**: not supported.
+  All compute handles are in pending state during Phase 1,
+  `wait()` expresses timing synchronization only and does not imply data readiness.
+  Accessing `handle.data`, element access, or truth-value evaluation in Phase 1
+  is **treated as an error**.
+  Memory-data-based branching (results of `tl.load()`) is supported via greenlet.
+  Phase 1 materialization is a future extension (see D3).
+- **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce
+  the execution time overlap from Phase 1. Phase 2 only verifies data correctness.
+- **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls
+  and do not reproduce the actual hardware PE microarchitecture.
+
+## Open Questions
+
+- **Aliasing / slice view**: How to represent slice/views referencing the same
+  backing storage in MemoryStore (stride-based view vs copy semantics)
+- **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE
+  communication as memory ops or introduce a separate op_kind
+- **Op log streaming**: Managing op_log memory usage in large-scale simulations
+  (in-memory list vs disk-backed streaming)
+- **Fused operation**: Whether to record tl.composite's tiled pipeline
+  (READ→COMPUTE→WRITE) as a single fused op record or separate individual ops
+- **Math op schema generalization**: The current math params have a simple structure,
+  but generalization may be needed for broadcasting rules, per-input dtype, keepdims,
+  scalar/immediate operands, where/mask expressions, etc.
+- **Op record identifier**: Currently dependency_ids are based on in-memory list indices;
+  replacement with stable op_id is needed when introducing streaming/disk-backed mode
+- **Phase 1 materialization policy**: See Future Extension in D3.
+  If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops
+  needs to be defined
+
+---
+
+## Consequences
+
+### Positive
+
+- Minimal impact on SimPy simulation performance (only op_log append added)
+- Free to use multi-threading/GPU in Phase 2
+- Component replaceability preserved (ADR-0015 design philosophy maintained)
+- No changes needed to benchmark user code API
+- When adding new message types, only set the data_op flag
+- Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported
+- `tl.load()` returns actual data, making kernel debugging easier
+
+### Negative
+
+- op_log memory usage (for large-scale simulations)
+- Phase 2 execution time is proportional to tensor size (large GEMM)
+- Dynamic branching based on pending handles (incomplete computations) not possible
+  (computations execute in Phase 2, result values are undetermined in Phase 1).
+  Memory-data-based branching is supported via greenlet.
+- greenlet C extension dependency added (pip install greenlet)
+
+---
+
+## Affected Files
+
+| File | Change |
+|------|--------|
+| `src/kernbench/components/base.py` | Add `_on_process_start/end` hooks |
+| `src/kernbench/common/pe_commands.py` | Add `data_op = True`, extend metadata fields |
+| `src/kernbench/sim_engine/op_log.py` | New: OpRecord, OpLogger |
+| `src/kernbench/sim_engine/data_executor.py` | New: DataExecutor, MemoryStore |
+| `src/kernbench/sim_engine/engine.py` | op_logger injection (optional) |
+| `src/kernbench/triton_emu/tl_context.py` | greenlet switch calls inside `tl.load()` etc. |
+| `src/kernbench/triton_emu/kernel_runner.py` | New: KernelRunner (greenlet ↔ SimPy bridge) |
+| `src/kernbench/components/builtin/pe_cpu.py` | Remove Phase 0, change to KernelRunner invocation |
+| `pyproject.toml` | Add greenlet dependency |
+
+Component implementation files (pe_gemm.py, pe_dma.py, hbm_ctrl.py, etc.): **no changes**
+Benchmark kernels (benches/*.py): **no user API changes**
@@ -0,0 +1,537 @@
+# ADR-0021: PE Pipeline Refactoring — Component Separation + Scheduler-Based Routing
+
+## Status
+
+Proposed
+
+## Context
+
+### Problems with the Current Structure
+
+pe_accel (SchedulerV2Component) hides 5 hardware blocks (DmaIn, DmaWb, Gemm, Math, Tcm)
+**inside a single component**.
+
+```
+SchedulerV2Component (single topology node)
+├── DmaInBlock     ← directly connected via internal SimPy Store
+├── DmaWbBlock     ← not visible in topology
+├── GemmBlock      ← not replaceable
+├── MathBlock      ← not replaceable
+└── TcmBlock       ← not replaceable
+```
+
+Problems:
+- Blocks directly reference the next block via `desc.next_block` — hardcoded routing
+- Individual blocks cannot be replaced (violates ADR-0015 component replacement principle)
+- PE internal structure is not visible in the topology
+- GemmBlock and MathBlock each duplicate TCM load/store logic
+
+### Actual Hardware Structure
+
+```
+HBM ←(DMA)→ TCM ←(Fetch/Store Unit)→ Register File ←→ GEMM/MATH Engine
+```
+
+- DMA: HBM ↔ TCM transfer (via fabric, tens to hundreds of ns)
+- Fetch/Store Unit: TCM ↔ Register File transfer (BW-based, a few ns)
+- GEMM/MATH Engine: computation between Register Files (cycle-accurate)
+- Completion signal: PE-internal 1-cycle wire signal (done pin assert)
+
+---
+
+## Decision
+
+### D1. Separate Each Block into an Independent Component
+
+The internal blocks of pe_accel are separated into **independent PeEngineBase components**.
+Existing 5 blocks + 1 Fetch/Store Unit = 6 components.
+
+| Component | Role | HW Correspondence |
+|-----------|------|-------------------|
+| PE_SCHEDULER | Plan generation, tile state management, stage routing | Scheduler/Sequencer |
+| PE_DMA | HBM ↔ TCM (via fabric) | DMA Engine |
+| PE_FETCH_STORE | TCM ↔ Register File | Load/Store Unit |
+| PE_GEMM | MAC compute (register only) | MAC Array |
+| PE_MATH | Element-wise/reduction (register only) | SIMD/Vector Unit |
+| PE_TCM | BW-serialized scratchpad | SRAM Bank |
+
+Each component exists as a topology node and is connected via ports/wires.
+Replacing the `impl` allows changing the timing model of an individual block.
+
+### D2. Token Self-Routing — Scheduler Handles Only Dispatch + Completion
+
+**Components do not pass through the scheduler at every stage.**
+The token carries a plan so that components chain directly to the next stage.
+
+```
+Scheduler → DMA → Fetch → GEMM → Math → Store → DMA_WB → (done) → Scheduler
+              ↑ chaining: does not go through scheduler          completion only
+```
+
+This matches the actual HW structure where each block's done signal is directly
+connected to the next block via wire. The scheduler is responsible **only for
+initial dispatch + completion aggregation**.
+
+#### Stage Definition
+
+```python
+class StageType(Enum):
+    DMA_READ = 0
+    FETCH = 1
+    GEMM = 2
+    MATH = 3
+    STORE = 4
+    DMA_WRITE = 5
+```
+
+#### Plan Structure
+
+When the scheduler receives a CompositeCmd, it generates a **per-tile execution plan**.
+The plan defines the **stage sequence** for each tile:
+
+```python
+@dataclass
+class Stage:
+    stage_type: StageType
+    component: str       # topology node ID (e.g. "sip0.cube0.pe0.pe_dma")
+    params: dict         # per-stage parameters (dynamic)
+
+@dataclass(frozen=True)
+class TilePlan:
+    tile_id: int
+    stages: tuple[Stage, ...]  # list of stages to execute in order (immutable)
+```
+
+The stage sequence varies depending on the plan:
+
+```python
+# Normal GEMM: HBM → TCM → Register → Compute → Register → TCM → HBM
+stages = (DMA_READ, FETCH, GEMM, STORE, DMA_WRITE)
+
+# GEMM directly from TCM data (skip DMA read):
+stages = (FETCH, GEMM, STORE, DMA_WRITE)
+
+# MATH element-wise:
+stages = (DMA_READ, FETCH, MATH, STORE, DMA_WRITE)
+
+# GEMM + accumulation (intermediate K-tile, skip writeback):
+stages = (DMA_READ, FETCH, GEMM, STORE)  # store to TCM only
+```
+
+**Components do not hardcode the next component.**
+They read the next stage from the token's plan and forward it directly via out_port.
+This is the same pattern as a network packet carrying a routing header.
+
+#### Pipeline Context
+
+```python
+@dataclass
+class PipelineContext:
+    id: str
+    total_tiles: int
+    completed_tiles: int = 0
+    done_event: simpy.Event = None  # succeeds when all tiles are complete
+
+    def complete_tile(self) -> None:
+        self.completed_tiles += 1
+        if self.completed_tiles == self.total_tiles:
+            self.done_event.succeed()
+```
+
+**Completion follows an exactly-once contract**: the last stage of each tile must call
+`complete_tile()` exactly once. Duplicate calls are a bug, and `done_event` must
+succeed only once (SimPy Event constraint).
+
+#### Scheduler Role (Reduced)
+
+When the scheduler receives a CompositeCmd, it creates a plan and PipelineContext,
+enqueues them into the scheduler's internal `_pending_feeds` FIFO, and returns immediately.
+
+Actual tile injection is handled by a **single feeder process** (`_feed_loop`).
+This feeder consumes `_pending_feeds` in FIFO order and
+**does not allow tile feed interleaving across composite commands.**
+That is, the feed for the next command begins only after all tiles of the current
+command have been injected into the first stage queue.
+
+There is **exactly one `_feed_loop`** per scheduler, and
+tile feed for composite commands is performed exclusively through this single process.
+Command issue order refers to **the order in which PE_SCHEDULER receives PeInternalTxn**.
+
+This structure maintains command issue order while ensuring that when the first stage
+queue is full, only the feeder process blocks — the scheduler worker's inbox processing
+itself does not stall.
+
+```python
+class PeSchedulerV2(PeEngineBase):
+    _pipelines: dict[str, PipelineContext]
+    _pending_feeds: simpy.Store   # FIFO of (plan, ctx)
+
+    def start(self, env):
+        super().start(env)
+        self._pending_feeds = simpy.Store(env)
+        env.process(self._feed_loop(env))
+
+    def _dispatch_composite(self, env, pe_txn, cmd):
+        plan = generate_plan(cmd)
+        ctx = PipelineContext(
+            id=next_id(),
+            total_tiles=len(plan.tiles),
+            done_event=pe_txn.done,
+        )
+        self._pipelines[ctx.id] = ctx
+
+        # only enqueue to feeder queue and return immediately
+        yield self._pending_feeds.put((plan, ctx))
+
+    def _feed_loop(self, env):
+        """Single feeder process: feeds composite commands in FIFO order.
+
+        Tile feed interleaving across composite commands is not allowed.
+        The feed for the next command begins only after all tiles of the
+        current command have been injected into the first stage queue.
+
+        When the first stage queue is full, only this feeder blocks;
+        the scheduler worker's inbox processing does not stall.
+        """
+        while True:
+            plan, ctx = yield self._pending_feeds.get()
+            for tile in plan.tiles:
+                token = TileToken(
+                    tile_id=tile.tile_id,
+                    pipeline_ctx=ctx,
+                    plan=tile,
+                    stage_idx=0,
+                    params=tile.stages[0].params,
+                )
+                yield self.out_ports[tile.stages[0].component].put(token)
+                # queue capacity = HW queue depth → feeder blocks only when full
+```
+
+In this ADR, the scheduler can accept multiple composite commands,
+but tile submission order follows per-command FIFO.
+Within a command, tile-level pipeline overlap is allowed,
+but tile feed interleaving across commands is not.
+
+### D3. Data Transfer vs. Completion Signal — HW Modeling Criteria
+
+| Communication Type | Method | HW Correspondence |
+|-------------------|--------|-------------------|
+| Tile token (work directive) | message via out_port | enqueue to command queue |
+| Stage completion → next stage | component directly calls out_port.put | done-triggered local enqueue |
+| Pipeline completion → scheduler | PipelineContext.complete_tile() | completion interrupt |
+
+**Tile token**: uses out_port.put(). SimPy Store capacity = HW queue depth.
+
+**Intra-PE chaining latency**: within the scope of this ADR, no explicit latency model
+is applied to intra-PE stage triggers. Chaining between components corresponds to
+PE-internal wires, and since there is no scheduler round-trip, no artificial hop cost
+is incurred.
+
+**Pipeline completion**: the component at the last stage calls `pipeline_ctx.complete_tile()`.
+When all tiles are complete, PipelineContext calls done_event.succeed().
+
+### D4. Asynchronous Pipeline — Natural Overlap
+
+The scheduler processes CompositeCmds **asynchronously**.
+However, tile feed does not spawn an independent process per command; instead,
+the scheduler's internal **single feeder process** performs the feed in FIFO order.
+Therefore, the scheduler can continue to receive the next command,
+but the first-stage tile injection order is guaranteed per command.
+
+Since **SimPy Store capacity = HW queue depth**:
+- When the queue is full, put() naturally blocks (backpressure)
+- While DMA is processing tile 0, GEMM can start fetching an already-completed tile
+- When a second CompositeCmd arrives, it is immediately queued to the DMA queue
+
+```
+First-stage feed order (feeder → DMA queue):
+  [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN] | [cmd2:t0][cmd2:t1]...
+                                            ↑ cmd2 starts after cmd1 feed completes
+
+Runtime pipeline (downstream overlap):
+  PE_DMA:    [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN][cmd2:t0][cmd2:t1]...
+  PE_FETCH:          [cmd1:t0][cmd1:t1]...
+  PE_GEMM:                   [cmd1:t0][cmd1:t1]...
+                              ↑ pipeline overlap within the same command
+```
+
+Here, the overlap does not come from tile feed interleaving across different commands,
+but occurs naturally as tiles from earlier commands progress to downstream stages
+while the feeder continues injecting subsequent tiles.
+
+For example, tile feed for cmd2 does not start until all tiles of cmd1 have been
+injected into the first stage queue. However, while cmd1.tile0 has already progressed
+to GEMM, cmd1.tile1 and cmd1.tile2 may still remain in DMA/FETCH, so
+**pipeline overlap within the same command occurs naturally**.
+
+#### Component Chaining Pattern
+
+All components follow the same pattern:
+
+```python
+def _pipeline_worker(self, env):
+    while True:
+        token = yield self._inbox.get()
+
+        # process own stage
+        yield from self._process(env, token)
+
+        # chain to next stage (read from plan)
+        next_idx = token.stage_idx + 1
+        if next_idx < len(token.plan.stages):
+            next_stage = token.plan.stages[next_idx]
+            token.stage_idx = next_idx
+            token.params = next_stage.params
+            yield self.out_ports[next_stage.component].put(token)
+        else:
+            # last stage — pipeline completion
+            token.pipeline_ctx.complete_tile()
+```
+
+### D5. PE_FETCH_STORE — Dedicated TCM ↔ Register File Transfer
+
+Previously, GemmBlock and MathBlock each implemented their own TCM read/write.
+This is separated into a **PE_FETCH_STORE component**.
+
+```python
+# PE_FETCH_STORE._process()
+def _process(self, env, token):
+    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
+    yield tcm_done
+    # chaining is handled by the base class (D4 pattern)
+```
+
+Advantages:
+- GEMM/MATH perform **pure compute only** — no TCM access logic
+- Fetch/store BW contention is naturally modeled (serialization via PE_TCM resource)
+- Prefetch strategies can be experimented with by replacing the fetch unit alone
+
+### D6. Simplification of Each Compute Component
+
+GEMM/MATH perform compute only with register data already prepared.
+**Chaining follows the common pattern (D4), so only _process() needs to be implemented:**
+
+```python
+# PE_GEMM._process()
+def _process(self, env, token):
+    yield env.timeout(self._mac_latency(token.params))
+
+# PE_MATH._process()
+def _process(self, env, token):
+    yield env.timeout(self._simd_latency(token.params))
+
+# PE_FETCH_STORE._process()
+def _process(self, env, token):
+    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
+    yield tcm_done
+
+# PE_DMA._process()
+def _process(self, env, token):
+    yield from self._do_fabric_dma(token.params)
+```
+
+By replacing only the timing model, one can freely switch between cycle-accurate
+and analytical models. Since the chaining logic resides in the base class,
+each component only implements its pure stage logic.
+
+### D7. Topology Changes
+
+Add PE_FETCH_STORE to the PE template:
+
+```yaml
+pe_template:
+  components:
+    pe_cpu:         { kind: pe_cpu,         impl: pe_cpu_v1, ... }
+    pe_scheduler:   { kind: pe_scheduler,   impl: pe_scheduler_v2, ... }
+    pe_dma:         { kind: pe_dma,         impl: pe_dma_v1, ... }
+    pe_fetch_store: { kind: pe_fetch_store, impl: pe_fetch_store_v1, ... }
+    pe_gemm:        { kind: pe_gemm,        impl: pe_gemm_v1, ... }
+    pe_math:        { kind: pe_math,        impl: pe_math_v1, ... }
+    pe_mmu:         { kind: pe_mmu,         impl: pe_mmu_v1, ... }
+    pe_tcm:         { kind: pe_tcm,         impl: pe_tcm_v1, ... }
+  links:
+    # existing links...
+    fetch_store_to_tcm_bw_gbs: 512.0
+    fetch_store_to_tcm_mm: 0.0
+```
+
+PE internal edge connections:
+```
+PE_SCHEDULER → PE_DMA (initial dispatch)
+PE_SCHEDULER → PE_FETCH_STORE (initial dispatch)
+PE_SCHEDULER → PE_GEMM (initial dispatch)
+PE_SCHEDULER → PE_MATH (initial dispatch)
+PE_DMA → PE_FETCH_STORE (chaining)
+PE_FETCH_STORE → PE_GEMM (chaining)
+PE_FETCH_STORE → PE_MATH (chaining)
+PE_GEMM → PE_FETCH_STORE (store chaining)
+PE_MATH → PE_FETCH_STORE (store chaining)
+PE_FETCH_STORE → PE_DMA (writeback chaining)
+PE_FETCH_STORE → PE_TCM (BW request)
+```
+
+Topology edges encompass both **control/dispatch visibility + runtime chaining**.
+Scheduler → sub-component edges are initial dispatch paths, while
+inter-component edges are runtime chaining paths driven by token self-routing.
+
+### D8. Existing Code Migration — Builtin Integration
+
+The existing builtin v1 components and pe_accel are **replaced with new builtin components**.
+
+#### Migration Strategy
+
+1. Back up existing `components/builtin/` → `components/builtin_legacy/` (preserved without modification)
+2. Back up existing `components/custom/pe_accel/` → likewise
+3. Re-implement new `components/builtin/` with the ADR-0021 architecture
+4. Maintain **only one** topology.yaml (including pe_fetch_store)
+5. components.yaml points to the new builtin
+
+```yaml
+# components.yaml — new builtin
+pe_scheduler_v1: kernbench.components.builtin.pe_scheduler:PeSchedulerComponent
+pe_gemm_v1:      kernbench.components.builtin.pe_gemm:PeGemmComponent
+pe_math_v1:      kernbench.components.builtin.pe_math:PeMathComponent
+pe_dma_v1:       kernbench.components.builtin.pe_dma:PeDmaComponent
+pe_fetch_store_v1: kernbench.components.builtin.pe_fetch_store:PeFetchStoreComponent
+pe_tcm_v1:       kernbench.components.builtin.pe_tcm:PeTcmComponent
+```
+
+The impl names (pe_gemm_v1, etc.) are preserved, but **the implementations are replaced
+with the ADR-0021 architecture**. Existing benchmarks and tests referencing topology.yaml
+continue to work without changes.
+
+#### Latency Model Inheritance
+
+The latency modeling of the new builtin components (MAC cycle calculation, SIMD latency,
+TCM BW serialization, DMA fabric latency, etc.) is **based on the current pe_accel
+implementation**. The tile schedule generation logic from tiling.py is also carried over.
+Only the architecture (component separation, self-routing) changes; timing accuracy
+is preserved.
+
+#### Test Strategy
+
+#### Test Plan
+
+**1. Existing test pass** (regression):
+After migration is complete, all existing tests (366) must pass.
+
+**2. Latency regression**:
+Verify that the new builtin produces identical latency for the same inputs as pe_accel.
+
+**3. Phase 1 → Phase 2 end-to-end**:
+Integration test from SimPy simulation (Phase 1) op_log generation → DataExecutor
+(Phase 2) actual numpy computation → result correctness verification.
+- GEMM: tl.composite(gemm) → op_log → Phase 2 matmul → allclose verification
+- MATH: tl.exp / tl.add, etc. → op_log → Phase 2 numpy op → allclose verification
+- Chaining: GEMM output → MATH input → final result end-to-end verification
+
+**4. TileToken self-routing**:
+- Verify that tiles chain according to the plan's stage sequence
+- Verify PipelineContext.complete_tile() exactly-once at the last stage
+- Queue backpressure: verify that only the feeder blocks when DMA queue capacity is exceeded
+
+**5. Asynchronous pipeline overlap**:
+- Verify that inter-tile stage overlap occurs within the same command (tile0 in GEMM while tile1 in DMA)
+- Multiple commands: verify that cmd2 feed starts after cmd1 feed completes (FIFO order)
+
+### D9. TileToken Message Definition
+
+A message used for passing tile work between components.
+The token carries the plan and stage index, enabling self-routing.
+
+```python
+@dataclass
+class TileToken:
+    tile_id: int
+    pipeline_ctx: PipelineContext    # completion tracking
+    plan: TilePlan                   # full stage sequence for this tile (immutable)
+    stage_idx: int                   # current stage index in plan.stages
+    params: dict                     # current stage parameter cache (canonical: plan.stages[stage_idx].params)
+    data_op: bool = True             # op_log recording target (ADR-0020)
+```
+
+A TileToken is **owned by exactly one component at a time** and
+is never referenced by multiple components simultaneously (single-owner).
+
+Token lifecycle:
+1. Scheduler creates it with stage_idx=0 and puts it to the first stage component
+2. The component executes _process(), increments stage_idx, and puts it to the next component
+3. The last stage component calls pipeline_ctx.complete_tile()
+4. When all tiles are complete, PipelineContext calls done_event.succeed()
+
+Relationship with existing PeInternalTxn:
+- PeInternalTxn: command transfer between PE_CPU → PE_SCHEDULER (existing, unchanged)
+- TileToken: per-tile work transfer from PE_SCHEDULER → sub-components (new, self-routing)
+
+---
+
+## Non-goals
+
+- **PE_CPU changes**: the PE_CPU → PE_SCHEDULER interface is not modified
+  (PeInternalTxn-based, ADR-0014 maintained)
+- **Resource contention model across multiple pipelines**: the current scope focuses on
+  accurate modeling of a single pipeline. TCM bank conflicts across multiple pipelines
+  are future work.
+- **builtin_legacy maintenance**: kept for backup purposes only; not a target for
+  bug fixes or feature additions.
+
+## Open Questions
+
+- **Register File capacity model**: whether to model capacity limits when the fetch unit
+  loads into registers. Capacity is expressed in bytes (register_file_bytes), and
+  the number of tiles that can be held simultaneously is determined by tile size.
+  When capacity is exceeded, fetch stalls, creating natural backpressure.
+- **Prefetch strategy**: this ADR does not allow tile feed interleaving across composite
+  commands. Therefore, overlap arises not from pre-injection across commands, but
+  naturally from pipeline progression of tiles within the same command.
+  If additional prefetch is needed, it should be considered at the level of tile ordering
+  within the same command or fetch/store unit policy, not cross-command injection.
+- **PE_DMA coalescing**: per-tile DMA may cause fragmentation.
+  Direction is to merge/coalesce within DMA without scheduler involvement.
+- **Synchronous execution mode**: this ADR adopts asynchronous pipeline as the
+  default/sole execution model. If a sync mode is needed for debug or validation
+  purposes, it will be considered in a future ADR.
+- **TCM bank conflict across multiple pipelines**: currently based on a single pipeline.
+  Bank conflict modeling when multiple pipelines simultaneously access TCM is future work.
+
+---
+
+## Consequences
+
+### Positive
+
+- Each block is an independent component — individually replaceable (ADR-0015 compliant)
+- PE internal structure is visible in the topology
+- Components do not know the next component — plan-based routing provides flexibility
+- Natural pipeline overlap between DMA and compute (SimPy Store backpressure)
+- Improved HW modeling accuracy (done signal = Event, data transfer = message)
+- Fetch/store separation enables accurate TCM BW contention modeling
+
+### Negative
+
+- Increased number of PE internal components (5 → 6) — more topology nodes/edges
+- Component separation makes intra-PE token forwarding more explicit than before
+- Breaking change from existing builtin/pe_accel — migration required
+
+---
+
+## Affected Files
+
+| File | Change |
+|------|--------|
+| `topology.yaml` | Add pe_fetch_store component, add chaining edges |
+| `components.yaml` | Register new builtin components |
+| `src/kernbench/topology/builder.py` | Add fetch_store + chaining edges to PE internal edges |
+| `src/kernbench/common/pe_commands.py` | Add TileToken definition |
+| `src/kernbench/components/builtin/pe_scheduler.py` | Re-implement (feeder + plan-based dispatch) |
+| `src/kernbench/components/builtin/pe_gemm.py` | Re-implement (TileToken, _process pattern) |
+| `src/kernbench/components/builtin/pe_math.py` | Re-implement (TileToken, _process pattern) |
+| `src/kernbench/components/builtin/pe_dma.py` | Re-implement (TileToken, _process pattern) |
+| `src/kernbench/components/builtin/pe_fetch_store.py` | New |
+| `src/kernbench/components/builtin/pe_tcm.py` | Re-implement (TcmRequest service) |
+| `src/kernbench/components/builtin/types.py` | New: TilePlan, Stage, StageType, PipelineContext, TileToken |
+| `src/kernbench/components/builtin/tiling.py` | Ported from pe_accel: plan generation logic |
+
+Backup:
+| `src/kernbench/components/builtin_legacy/` | Full backup of existing builtin (preserved without modification) |
+| `src/kernbench/components/custom/pe_accel/` | Backup of existing pe_accel (preserved without modification) |