Add English translations for ADR-0018, 0019, 0020, 0021

- ADR-0018: LA-based memory address abstraction + BAAW + HBM channel mapping - ADR-0019: CUBE NOC per-channel and aggregated HBM connection model - ADR-0020: 2-pass data execution model (timing/data separation, greenlet) - ADR-0021: PE pipeline refactor (component separation + token self-routing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 16:31:32 -07:00
parent 10b33b44ba
commit b2c52f0e34
4 changed files with 1962 additions and 0 deletions
@@ -0,0 +1,537 @@
+# ADR-0021: PE Pipeline Refactoring — Component Separation + Scheduler-Based Routing
+
+## Status
+
+Proposed
+
+## Context
+
+### Problems with the Current Structure
+
+pe_accel (SchedulerV2Component) hides 5 hardware blocks (DmaIn, DmaWb, Gemm, Math, Tcm)
+**inside a single component**.
+
+```
+SchedulerV2Component (single topology node)
+├── DmaInBlock     ← directly connected via internal SimPy Store
+├── DmaWbBlock     ← not visible in topology
+├── GemmBlock      ← not replaceable
+├── MathBlock      ← not replaceable
+└── TcmBlock       ← not replaceable
+```
+
+Problems:
+- Blocks directly reference the next block via `desc.next_block` — hardcoded routing
+- Individual blocks cannot be replaced (violates ADR-0015 component replacement principle)
+- PE internal structure is not visible in the topology
+- GemmBlock and MathBlock each duplicate TCM load/store logic
+
+### Actual Hardware Structure
+
+```
+HBM ←(DMA)→ TCM ←(Fetch/Store Unit)→ Register File ←→ GEMM/MATH Engine
+```
+
+- DMA: HBM ↔ TCM transfer (via fabric, tens to hundreds of ns)
+- Fetch/Store Unit: TCM ↔ Register File transfer (BW-based, a few ns)
+- GEMM/MATH Engine: computation between Register Files (cycle-accurate)
+- Completion signal: PE-internal 1-cycle wire signal (done pin assert)
+
+---
+
+## Decision
+
+### D1. Separate Each Block into an Independent Component
+
+The internal blocks of pe_accel are separated into **independent PeEngineBase components**.
+Existing 5 blocks + 1 Fetch/Store Unit = 6 components.
+
+| Component | Role | HW Correspondence |
+|-----------|------|-------------------|
+| PE_SCHEDULER | Plan generation, tile state management, stage routing | Scheduler/Sequencer |
+| PE_DMA | HBM ↔ TCM (via fabric) | DMA Engine |
+| PE_FETCH_STORE | TCM ↔ Register File | Load/Store Unit |
+| PE_GEMM | MAC compute (register only) | MAC Array |
+| PE_MATH | Element-wise/reduction (register only) | SIMD/Vector Unit |
+| PE_TCM | BW-serialized scratchpad | SRAM Bank |
+
+Each component exists as a topology node and is connected via ports/wires.
+Replacing the `impl` allows changing the timing model of an individual block.
+
+### D2. Token Self-Routing — Scheduler Handles Only Dispatch + Completion
+
+**Components do not pass through the scheduler at every stage.**
+The token carries a plan so that components chain directly to the next stage.
+
+```
+Scheduler → DMA → Fetch → GEMM → Math → Store → DMA_WB → (done) → Scheduler
+              ↑ chaining: does not go through scheduler          completion only
+```
+
+This matches the actual HW structure where each block's done signal is directly
+connected to the next block via wire. The scheduler is responsible **only for
+initial dispatch + completion aggregation**.
+
+#### Stage Definition
+
+```python
+class StageType(Enum):
+    DMA_READ = 0
+    FETCH = 1
+    GEMM = 2
+    MATH = 3
+    STORE = 4
+    DMA_WRITE = 5
+```
+
+#### Plan Structure
+
+When the scheduler receives a CompositeCmd, it generates a **per-tile execution plan**.
+The plan defines the **stage sequence** for each tile:
+
+```python
+@dataclass
+class Stage:
+    stage_type: StageType
+    component: str       # topology node ID (e.g. "sip0.cube0.pe0.pe_dma")
+    params: dict         # per-stage parameters (dynamic)
+
+@dataclass(frozen=True)
+class TilePlan:
+    tile_id: int
+    stages: tuple[Stage, ...]  # list of stages to execute in order (immutable)
+```
+
+The stage sequence varies depending on the plan:
+
+```python
+# Normal GEMM: HBM → TCM → Register → Compute → Register → TCM → HBM
+stages = (DMA_READ, FETCH, GEMM, STORE, DMA_WRITE)
+
+# GEMM directly from TCM data (skip DMA read):
+stages = (FETCH, GEMM, STORE, DMA_WRITE)
+
+# MATH element-wise:
+stages = (DMA_READ, FETCH, MATH, STORE, DMA_WRITE)
+
+# GEMM + accumulation (intermediate K-tile, skip writeback):
+stages = (DMA_READ, FETCH, GEMM, STORE)  # store to TCM only
+```
+
+**Components do not hardcode the next component.**
+They read the next stage from the token's plan and forward it directly via out_port.
+This is the same pattern as a network packet carrying a routing header.
+
+#### Pipeline Context
+
+```python
+@dataclass
+class PipelineContext:
+    id: str
+    total_tiles: int
+    completed_tiles: int = 0
+    done_event: simpy.Event = None  # succeeds when all tiles are complete
+
+    def complete_tile(self) -> None:
+        self.completed_tiles += 1
+        if self.completed_tiles == self.total_tiles:
+            self.done_event.succeed()
+```
+
+**Completion follows an exactly-once contract**: the last stage of each tile must call
+`complete_tile()` exactly once. Duplicate calls are a bug, and `done_event` must
+succeed only once (SimPy Event constraint).
+
+#### Scheduler Role (Reduced)
+
+When the scheduler receives a CompositeCmd, it creates a plan and PipelineContext,
+enqueues them into the scheduler's internal `_pending_feeds` FIFO, and returns immediately.
+
+Actual tile injection is handled by a **single feeder process** (`_feed_loop`).
+This feeder consumes `_pending_feeds` in FIFO order and
+**does not allow tile feed interleaving across composite commands.**
+That is, the feed for the next command begins only after all tiles of the current
+command have been injected into the first stage queue.
+
+There is **exactly one `_feed_loop`** per scheduler, and
+tile feed for composite commands is performed exclusively through this single process.
+Command issue order refers to **the order in which PE_SCHEDULER receives PeInternalTxn**.
+
+This structure maintains command issue order while ensuring that when the first stage
+queue is full, only the feeder process blocks — the scheduler worker's inbox processing
+itself does not stall.
+
+```python
+class PeSchedulerV2(PeEngineBase):
+    _pipelines: dict[str, PipelineContext]
+    _pending_feeds: simpy.Store   # FIFO of (plan, ctx)
+
+    def start(self, env):
+        super().start(env)
+        self._pending_feeds = simpy.Store(env)
+        env.process(self._feed_loop(env))
+
+    def _dispatch_composite(self, env, pe_txn, cmd):
+        plan = generate_plan(cmd)
+        ctx = PipelineContext(
+            id=next_id(),
+            total_tiles=len(plan.tiles),
+            done_event=pe_txn.done,
+        )
+        self._pipelines[ctx.id] = ctx
+
+        # only enqueue to feeder queue and return immediately
+        yield self._pending_feeds.put((plan, ctx))
+
+    def _feed_loop(self, env):
+        """Single feeder process: feeds composite commands in FIFO order.
+
+        Tile feed interleaving across composite commands is not allowed.
+        The feed for the next command begins only after all tiles of the
+        current command have been injected into the first stage queue.
+
+        When the first stage queue is full, only this feeder blocks;
+        the scheduler worker's inbox processing does not stall.
+        """
+        while True:
+            plan, ctx = yield self._pending_feeds.get()
+            for tile in plan.tiles:
+                token = TileToken(
+                    tile_id=tile.tile_id,
+                    pipeline_ctx=ctx,
+                    plan=tile,
+                    stage_idx=0,
+                    params=tile.stages[0].params,
+                )
+                yield self.out_ports[tile.stages[0].component].put(token)
+                # queue capacity = HW queue depth → feeder blocks only when full
+```
+
+In this ADR, the scheduler can accept multiple composite commands,
+but tile submission order follows per-command FIFO.
+Within a command, tile-level pipeline overlap is allowed,
+but tile feed interleaving across commands is not.
+
+### D3. Data Transfer vs. Completion Signal — HW Modeling Criteria
+
+| Communication Type | Method | HW Correspondence |
+|-------------------|--------|-------------------|
+| Tile token (work directive) | message via out_port | enqueue to command queue |
+| Stage completion → next stage | component directly calls out_port.put | done-triggered local enqueue |
+| Pipeline completion → scheduler | PipelineContext.complete_tile() | completion interrupt |
+
+**Tile token**: uses out_port.put(). SimPy Store capacity = HW queue depth.
+
+**Intra-PE chaining latency**: within the scope of this ADR, no explicit latency model
+is applied to intra-PE stage triggers. Chaining between components corresponds to
+PE-internal wires, and since there is no scheduler round-trip, no artificial hop cost
+is incurred.
+
+**Pipeline completion**: the component at the last stage calls `pipeline_ctx.complete_tile()`.
+When all tiles are complete, PipelineContext calls done_event.succeed().
+
+### D4. Asynchronous Pipeline — Natural Overlap
+
+The scheduler processes CompositeCmds **asynchronously**.
+However, tile feed does not spawn an independent process per command; instead,
+the scheduler's internal **single feeder process** performs the feed in FIFO order.
+Therefore, the scheduler can continue to receive the next command,
+but the first-stage tile injection order is guaranteed per command.
+
+Since **SimPy Store capacity = HW queue depth**:
+- When the queue is full, put() naturally blocks (backpressure)
+- While DMA is processing tile 0, GEMM can start fetching an already-completed tile
+- When a second CompositeCmd arrives, it is immediately queued to the DMA queue
+
+```
+First-stage feed order (feeder → DMA queue):
+  [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN] | [cmd2:t0][cmd2:t1]...
+                                            ↑ cmd2 starts after cmd1 feed completes
+
+Runtime pipeline (downstream overlap):
+  PE_DMA:    [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN][cmd2:t0][cmd2:t1]...
+  PE_FETCH:          [cmd1:t0][cmd1:t1]...
+  PE_GEMM:                   [cmd1:t0][cmd1:t1]...
+                              ↑ pipeline overlap within the same command
+```
+
+Here, the overlap does not come from tile feed interleaving across different commands,
+but occurs naturally as tiles from earlier commands progress to downstream stages
+while the feeder continues injecting subsequent tiles.
+
+For example, tile feed for cmd2 does not start until all tiles of cmd1 have been
+injected into the first stage queue. However, while cmd1.tile0 has already progressed
+to GEMM, cmd1.tile1 and cmd1.tile2 may still remain in DMA/FETCH, so
+**pipeline overlap within the same command occurs naturally**.
+
+#### Component Chaining Pattern
+
+All components follow the same pattern:
+
+```python
+def _pipeline_worker(self, env):
+    while True:
+        token = yield self._inbox.get()
+
+        # process own stage
+        yield from self._process(env, token)
+
+        # chain to next stage (read from plan)
+        next_idx = token.stage_idx + 1
+        if next_idx < len(token.plan.stages):
+            next_stage = token.plan.stages[next_idx]
+            token.stage_idx = next_idx
+            token.params = next_stage.params
+            yield self.out_ports[next_stage.component].put(token)
+        else:
+            # last stage — pipeline completion
+            token.pipeline_ctx.complete_tile()
+```
+
+### D5. PE_FETCH_STORE — Dedicated TCM ↔ Register File Transfer
+
+Previously, GemmBlock and MathBlock each implemented their own TCM read/write.
+This is separated into a **PE_FETCH_STORE component**.
+
+```python
+# PE_FETCH_STORE._process()
+def _process(self, env, token):
+    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
+    yield tcm_done
+    # chaining is handled by the base class (D4 pattern)
+```
+
+Advantages:
+- GEMM/MATH perform **pure compute only** — no TCM access logic
+- Fetch/store BW contention is naturally modeled (serialization via PE_TCM resource)
+- Prefetch strategies can be experimented with by replacing the fetch unit alone
+
+### D6. Simplification of Each Compute Component
+
+GEMM/MATH perform compute only with register data already prepared.
+**Chaining follows the common pattern (D4), so only _process() needs to be implemented:**
+
+```python
+# PE_GEMM._process()
+def _process(self, env, token):
+    yield env.timeout(self._mac_latency(token.params))
+
+# PE_MATH._process()
+def _process(self, env, token):
+    yield env.timeout(self._simd_latency(token.params))
+
+# PE_FETCH_STORE._process()
+def _process(self, env, token):
+    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
+    yield tcm_done
+
+# PE_DMA._process()
+def _process(self, env, token):
+    yield from self._do_fabric_dma(token.params)
+```
+
+By replacing only the timing model, one can freely switch between cycle-accurate
+and analytical models. Since the chaining logic resides in the base class,
+each component only implements its pure stage logic.
+
+### D7. Topology Changes
+
+Add PE_FETCH_STORE to the PE template:
+
+```yaml
+pe_template:
+  components:
+    pe_cpu:         { kind: pe_cpu,         impl: pe_cpu_v1, ... }
+    pe_scheduler:   { kind: pe_scheduler,   impl: pe_scheduler_v2, ... }
+    pe_dma:         { kind: pe_dma,         impl: pe_dma_v1, ... }
+    pe_fetch_store: { kind: pe_fetch_store, impl: pe_fetch_store_v1, ... }
+    pe_gemm:        { kind: pe_gemm,        impl: pe_gemm_v1, ... }
+    pe_math:        { kind: pe_math,        impl: pe_math_v1, ... }
+    pe_mmu:         { kind: pe_mmu,         impl: pe_mmu_v1, ... }
+    pe_tcm:         { kind: pe_tcm,         impl: pe_tcm_v1, ... }
+  links:
+    # existing links...
+    fetch_store_to_tcm_bw_gbs: 512.0
+    fetch_store_to_tcm_mm: 0.0
+```
+
+PE internal edge connections:
+```
+PE_SCHEDULER → PE_DMA (initial dispatch)
+PE_SCHEDULER → PE_FETCH_STORE (initial dispatch)
+PE_SCHEDULER → PE_GEMM (initial dispatch)
+PE_SCHEDULER → PE_MATH (initial dispatch)
+PE_DMA → PE_FETCH_STORE (chaining)
+PE_FETCH_STORE → PE_GEMM (chaining)
+PE_FETCH_STORE → PE_MATH (chaining)
+PE_GEMM → PE_FETCH_STORE (store chaining)
+PE_MATH → PE_FETCH_STORE (store chaining)
+PE_FETCH_STORE → PE_DMA (writeback chaining)
+PE_FETCH_STORE → PE_TCM (BW request)
+```
+
+Topology edges encompass both **control/dispatch visibility + runtime chaining**.
+Scheduler → sub-component edges are initial dispatch paths, while
+inter-component edges are runtime chaining paths driven by token self-routing.
+
+### D8. Existing Code Migration — Builtin Integration
+
+The existing builtin v1 components and pe_accel are **replaced with new builtin components**.
+
+#### Migration Strategy
+
+1. Back up existing `components/builtin/` → `components/builtin_legacy/` (preserved without modification)
+2. Back up existing `components/custom/pe_accel/` → likewise
+3. Re-implement new `components/builtin/` with the ADR-0021 architecture
+4. Maintain **only one** topology.yaml (including pe_fetch_store)
+5. components.yaml points to the new builtin
+
+```yaml
+# components.yaml — new builtin
+pe_scheduler_v1: kernbench.components.builtin.pe_scheduler:PeSchedulerComponent
+pe_gemm_v1:      kernbench.components.builtin.pe_gemm:PeGemmComponent
+pe_math_v1:      kernbench.components.builtin.pe_math:PeMathComponent
+pe_dma_v1:       kernbench.components.builtin.pe_dma:PeDmaComponent
+pe_fetch_store_v1: kernbench.components.builtin.pe_fetch_store:PeFetchStoreComponent
+pe_tcm_v1:       kernbench.components.builtin.pe_tcm:PeTcmComponent
+```
+
+The impl names (pe_gemm_v1, etc.) are preserved, but **the implementations are replaced
+with the ADR-0021 architecture**. Existing benchmarks and tests referencing topology.yaml
+continue to work without changes.
+
+#### Latency Model Inheritance
+
+The latency modeling of the new builtin components (MAC cycle calculation, SIMD latency,
+TCM BW serialization, DMA fabric latency, etc.) is **based on the current pe_accel
+implementation**. The tile schedule generation logic from tiling.py is also carried over.
+Only the architecture (component separation, self-routing) changes; timing accuracy
+is preserved.
+
+#### Test Strategy
+
+#### Test Plan
+
+**1. Existing test pass** (regression):
+After migration is complete, all existing tests (366) must pass.
+
+**2. Latency regression**:
+Verify that the new builtin produces identical latency for the same inputs as pe_accel.
+
+**3. Phase 1 → Phase 2 end-to-end**:
+Integration test from SimPy simulation (Phase 1) op_log generation → DataExecutor
+(Phase 2) actual numpy computation → result correctness verification.
+- GEMM: tl.composite(gemm) → op_log → Phase 2 matmul → allclose verification
+- MATH: tl.exp / tl.add, etc. → op_log → Phase 2 numpy op → allclose verification
+- Chaining: GEMM output → MATH input → final result end-to-end verification
+
+**4. TileToken self-routing**:
+- Verify that tiles chain according to the plan's stage sequence
+- Verify PipelineContext.complete_tile() exactly-once at the last stage
+- Queue backpressure: verify that only the feeder blocks when DMA queue capacity is exceeded
+
+**5. Asynchronous pipeline overlap**:
+- Verify that inter-tile stage overlap occurs within the same command (tile0 in GEMM while tile1 in DMA)
+- Multiple commands: verify that cmd2 feed starts after cmd1 feed completes (FIFO order)
+
+### D9. TileToken Message Definition
+
+A message used for passing tile work between components.
+The token carries the plan and stage index, enabling self-routing.
+
+```python
+@dataclass
+class TileToken:
+    tile_id: int
+    pipeline_ctx: PipelineContext    # completion tracking
+    plan: TilePlan                   # full stage sequence for this tile (immutable)
+    stage_idx: int                   # current stage index in plan.stages
+    params: dict                     # current stage parameter cache (canonical: plan.stages[stage_idx].params)
+    data_op: bool = True             # op_log recording target (ADR-0020)
+```
+
+A TileToken is **owned by exactly one component at a time** and
+is never referenced by multiple components simultaneously (single-owner).
+
+Token lifecycle:
+1. Scheduler creates it with stage_idx=0 and puts it to the first stage component
+2. The component executes _process(), increments stage_idx, and puts it to the next component
+3. The last stage component calls pipeline_ctx.complete_tile()
+4. When all tiles are complete, PipelineContext calls done_event.succeed()
+
+Relationship with existing PeInternalTxn:
+- PeInternalTxn: command transfer between PE_CPU → PE_SCHEDULER (existing, unchanged)
+- TileToken: per-tile work transfer from PE_SCHEDULER → sub-components (new, self-routing)
+
+---
+
+## Non-goals
+
+- **PE_CPU changes**: the PE_CPU → PE_SCHEDULER interface is not modified
+  (PeInternalTxn-based, ADR-0014 maintained)
+- **Resource contention model across multiple pipelines**: the current scope focuses on
+  accurate modeling of a single pipeline. TCM bank conflicts across multiple pipelines
+  are future work.
+- **builtin_legacy maintenance**: kept for backup purposes only; not a target for
+  bug fixes or feature additions.
+
+## Open Questions
+
+- **Register File capacity model**: whether to model capacity limits when the fetch unit
+  loads into registers. Capacity is expressed in bytes (register_file_bytes), and
+  the number of tiles that can be held simultaneously is determined by tile size.
+  When capacity is exceeded, fetch stalls, creating natural backpressure.
+- **Prefetch strategy**: this ADR does not allow tile feed interleaving across composite
+  commands. Therefore, overlap arises not from pre-injection across commands, but
+  naturally from pipeline progression of tiles within the same command.
+  If additional prefetch is needed, it should be considered at the level of tile ordering
+  within the same command or fetch/store unit policy, not cross-command injection.
+- **PE_DMA coalescing**: per-tile DMA may cause fragmentation.
+  Direction is to merge/coalesce within DMA without scheduler involvement.
+- **Synchronous execution mode**: this ADR adopts asynchronous pipeline as the
+  default/sole execution model. If a sync mode is needed for debug or validation
+  purposes, it will be considered in a future ADR.
+- **TCM bank conflict across multiple pipelines**: currently based on a single pipeline.
+  Bank conflict modeling when multiple pipelines simultaneously access TCM is future work.
+
+---
+
+## Consequences
+
+### Positive
+
+- Each block is an independent component — individually replaceable (ADR-0015 compliant)
+- PE internal structure is visible in the topology
+- Components do not know the next component — plan-based routing provides flexibility
+- Natural pipeline overlap between DMA and compute (SimPy Store backpressure)
+- Improved HW modeling accuracy (done signal = Event, data transfer = message)
+- Fetch/store separation enables accurate TCM BW contention modeling
+
+### Negative
+
+- Increased number of PE internal components (5 → 6) — more topology nodes/edges
+- Component separation makes intra-PE token forwarding more explicit than before
+- Breaking change from existing builtin/pe_accel — migration required
+
+---
+
+## Affected Files
+
+| File | Change |
+|------|--------|
+| `topology.yaml` | Add pe_fetch_store component, add chaining edges |
+| `components.yaml` | Register new builtin components |
+| `src/kernbench/topology/builder.py` | Add fetch_store + chaining edges to PE internal edges |
+| `src/kernbench/common/pe_commands.py` | Add TileToken definition |
+| `src/kernbench/components/builtin/pe_scheduler.py` | Re-implement (feeder + plan-based dispatch) |
+| `src/kernbench/components/builtin/pe_gemm.py` | Re-implement (TileToken, _process pattern) |
+| `src/kernbench/components/builtin/pe_math.py` | Re-implement (TileToken, _process pattern) |
+| `src/kernbench/components/builtin/pe_dma.py` | Re-implement (TileToken, _process pattern) |
+| `src/kernbench/components/builtin/pe_fetch_store.py` | New |
+| `src/kernbench/components/builtin/pe_tcm.py` | Re-implement (TcmRequest service) |
+| `src/kernbench/components/builtin/types.py` | New: TilePlan, Stage, StageType, PipelineContext, TileToken |
+| `src/kernbench/components/builtin/tiling.py` | Ported from pe_accel: plan generation logic |
+
+Backup:
+| `src/kernbench/components/builtin_legacy/` | Full backup of existing builtin (preserved without modification) |
+| `src/kernbench/components/custom/pe_accel/` | Backup of existing pe_accel (preserved without modification) |