kernbench2/docs/adr/ADR-0021-pe-pipeline-refactor.en.md

# ADR-0021: PE Pipeline Refactoring — Component Separation + Scheduler-Based Routing

## Status

Accepted

## Context

### Actual Hardware Structure

```
HBM ←(DMA)→ TCM ←(Fetch/Store Unit)→ Register File ←→ GEMM/MATH Engine
```

- DMA: HBM ↔ TCM transfer (via fabric, tens to hundreds of ns)
- Fetch/Store Unit: TCM ↔ Register File transfer (BW-based, a few ns)
- GEMM/MATH Engine: computation between Register Files (cycle-accurate)
- Completion signal: PE-internal 1-cycle wire signal (done pin assert)

---

## Decision

### D1. Separate Each Block into an Independent Component

The internal blocks of pe_accel are separated into **independent PeEngineBase components**.
Existing 5 blocks + 1 Fetch/Store Unit = 6 components.

| Component | Role | HW Correspondence |
|-----------|------|-------------------|
| PE_SCHEDULER | Plan generation, tile state management, stage routing | Scheduler/Sequencer |
| PE_DMA | HBM ↔ TCM (via fabric) | DMA Engine |
| PE_FETCH_STORE | TCM ↔ Register File | Load/Store Unit |
| PE_GEMM | MAC compute (register only) | MAC Array |
| PE_MATH | Element-wise/reduction (register only) | SIMD/Vector Unit |
| PE_TCM | BW-serialized scratchpad | SRAM Bank |

Each component exists as a topology node and is connected via ports/wires.
Replacing the `impl` allows changing the timing model of an individual block.

### D2. Token Self-Routing — Scheduler Handles Only Dispatch + Completion

**Components do not pass through the scheduler at every stage.**
The token carries a plan so that components chain directly to the next stage.

```
Scheduler → DMA → Fetch → GEMM → Math → Store → DMA_WB → (done) → Scheduler
              ↑ chaining: does not go through scheduler          completion only
```

This matches the actual HW structure where each block's done signal is directly
connected to the next block via wire. The scheduler is responsible **only for
initial dispatch + completion aggregation**.

#### Stage Definition

```python
class StageType(Enum):
    DMA_READ = 0
    FETCH = 1
    GEMM = 2
    MATH = 3
    STORE = 4
    DMA_WRITE = 5
```

#### Plan Structure

When the scheduler receives a CompositeCmd, it generates a **per-tile execution plan**.
The plan defines the **stage sequence** for each tile:

```python
@dataclass
class Stage:
    stage_type: StageType
    component: str       # topology node ID (e.g. "sip0.cube0.pe0.pe_dma")
    params: dict         # per-stage parameters (dynamic)

@dataclass(frozen=True)
class TilePlan:
    tile_id: int
    stages: tuple[Stage, ...]  # list of stages to execute in order (immutable)
```

The stage sequence varies depending on the plan:

```python
# Normal GEMM: HBM → TCM → Register → Compute → Register → TCM → HBM
stages = (DMA_READ, FETCH, GEMM, STORE, DMA_WRITE)

# GEMM directly from TCM data (skip DMA read):
stages = (FETCH, GEMM, STORE, DMA_WRITE)

# MATH element-wise:
stages = (DMA_READ, FETCH, MATH, STORE, DMA_WRITE)

# GEMM + accumulation (intermediate K-tile, skip writeback):
stages = (DMA_READ, FETCH, GEMM, STORE)  # store to TCM only
```

**Components do not hardcode the next component.**
They read the next stage from the token's plan and forward it directly via out_port.
This is the same pattern as a network packet carrying a routing header.

#### Pipeline Context

```python
@dataclass
class PipelineContext:
    id: str
    total_tiles: int
    completed_tiles: int = 0
    done_event: simpy.Event = None  # succeeds when all tiles are complete

    def complete_tile(self) -> None:
        self.completed_tiles += 1
        if self.completed_tiles == self.total_tiles:
            self.done_event.succeed()
```

**Completion follows an exactly-once contract**: the last stage of each tile must call
`complete_tile()` exactly once. Duplicate calls are a bug, and `done_event` must
succeed only once (SimPy Event constraint).

#### Scheduler Role (Reduced)

When the scheduler receives a CompositeCmd, it creates a plan and PipelineContext,
enqueues them into the scheduler's internal `_pending_feeds` FIFO, and returns immediately.

Actual tile injection is handled by a **single feeder process** (`_feed_loop`).
This feeder consumes `_pending_feeds` in FIFO order and
**does not allow tile feed interleaving across composite commands.**
That is, the feed for the next command begins only after all tiles of the current
command have been injected into the first stage queue.

There is **exactly one `_feed_loop`** per scheduler, and
tile feed for composite commands is performed exclusively through this single process.
Command issue order refers to **the order in which PE_SCHEDULER receives PeInternalTxn**.

This structure maintains command issue order while ensuring that when the first stage
queue is full, only the feeder process blocks — the scheduler worker's inbox processing
itself does not stall.

```python
class PeSchedulerV2(PeEngineBase):
    _pipelines: dict[str, PipelineContext]
    _pending_feeds: simpy.Store   # FIFO of (plan, ctx)

    def start(self, env):
        super().start(env)
        self._pending_feeds = simpy.Store(env)
        env.process(self._feed_loop(env))

    def _dispatch_composite(self, env, pe_txn, cmd):
        plan = generate_plan(cmd)
        ctx = PipelineContext(
            id=next_id(),
            total_tiles=len(plan.tiles),
            done_event=pe_txn.done,
        )
        self._pipelines[ctx.id] = ctx

        # only enqueue to feeder queue and return immediately
        yield self._pending_feeds.put((plan, ctx))

    def _feed_loop(self, env):
        """Single feeder process: feeds composite commands in FIFO order.

        Tile feed interleaving across composite commands is not allowed.
        The feed for the next command begins only after all tiles of the
        current command have been injected into the first stage queue.

        When the first stage queue is full, only this feeder blocks;
        the scheduler worker's inbox processing does not stall.
        """
        while True:
            plan, ctx = yield self._pending_feeds.get()
            for tile in plan.tiles:
                token = TileToken(
                    tile_id=tile.tile_id,
                    pipeline_ctx=ctx,
                    plan=tile,
                    stage_idx=0,
                    params=tile.stages[0].params,
                )
                yield self.out_ports[tile.stages[0].component].put(token)
                # queue capacity = HW queue depth → feeder blocks only when full
```

In this ADR, the scheduler can accept multiple composite commands,
but tile submission order follows per-command FIFO.
Within a command, tile-level pipeline overlap is allowed,
but tile feed interleaving across commands is not.

### D3. Data Transfer vs. Completion Signal — HW Modeling Criteria

| Communication Type | Method | HW Correspondence |
|-------------------|--------|-------------------|
| Tile token (work directive) | message via out_port | enqueue to command queue |
| Stage completion → next stage | component directly calls out_port.put | done-triggered local enqueue |
| Pipeline completion → scheduler | PipelineContext.complete_tile() | completion interrupt |

**Tile token**: uses out_port.put(). SimPy Store capacity = HW queue depth.

**Intra-PE chaining latency**: within the scope of this ADR, no explicit latency model
is applied to intra-PE stage triggers. Chaining between components corresponds to
PE-internal wires, and since there is no scheduler round-trip, no artificial hop cost
is incurred.

**Pipeline completion**: the component at the last stage calls `pipeline_ctx.complete_tile()`.
When all tiles are complete, PipelineContext calls done_event.succeed().

### D4. Asynchronous Pipeline — Natural Overlap

The scheduler processes CompositeCmds **asynchronously**.
However, tile feed does not spawn an independent process per command; instead,
the scheduler's internal **single feeder process** performs the feed in FIFO order.
Therefore, the scheduler can continue to receive the next command,
but the first-stage tile injection order is guaranteed per command.

Since **SimPy Store capacity = HW queue depth**:
- When the queue is full, put() naturally blocks (backpressure)
- While DMA is processing tile 0, GEMM can start fetching an already-completed tile
- When a second CompositeCmd arrives, it is immediately queued to the DMA queue

```
First-stage feed order (feeder → DMA queue):
  [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN] | [cmd2:t0][cmd2:t1]...
                                            ↑ cmd2 starts after cmd1 feed completes

Runtime pipeline (downstream overlap):
  PE_DMA:    [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN][cmd2:t0][cmd2:t1]...
  PE_FETCH:          [cmd1:t0][cmd1:t1]...
  PE_GEMM:                   [cmd1:t0][cmd1:t1]...
                              ↑ pipeline overlap within the same command
```

Here, the overlap does not come from tile feed interleaving across different commands,
but occurs naturally as tiles from earlier commands progress to downstream stages
while the feeder continues injecting subsequent tiles.

For example, tile feed for cmd2 does not start until all tiles of cmd1 have been
injected into the first stage queue. However, while cmd1.tile0 has already progressed
to GEMM, cmd1.tile1 and cmd1.tile2 may still remain in DMA/FETCH, so
**pipeline overlap within the same command occurs naturally**.

#### Component Chaining Pattern

All components follow the same pattern:

```python
def _pipeline_worker(self, env):
    while True:
        token = yield self._inbox.get()

        # process own stage
        yield from self._process(env, token)

        # chain to next stage (read from plan)
        next_idx = token.stage_idx + 1
        if next_idx < len(token.plan.stages):
            next_stage = token.plan.stages[next_idx]
            token.stage_idx = next_idx
            token.params = next_stage.params
            yield self.out_ports[next_stage.component].put(token)
        else:
            # last stage — pipeline completion
            token.pipeline_ctx.complete_tile()
```

### D5. PE_FETCH_STORE — Dedicated TCM ↔ Register File Transfer

Previously, GemmBlock and MathBlock each implemented their own TCM read/write.
This is separated into a **PE_FETCH_STORE component**.

```python
# PE_FETCH_STORE._process()
def _process(self, env, token):
    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
    yield tcm_done
    # chaining is handled by the base class (D4 pattern)
```

Advantages:
- GEMM/MATH perform **pure compute only** — no TCM access logic
- Fetch/store BW contention is naturally modeled (serialization via PE_TCM resource)
- Prefetch strategies can be experimented with by replacing the fetch unit alone

### D6. Simplification of Each Compute Component

GEMM/MATH perform compute only with register data already prepared.
**Chaining follows the common pattern (D4), so only _process() needs to be implemented:**

```python
# PE_GEMM._process()
def _process(self, env, token):
    yield env.timeout(self._mac_latency(token.params))

# PE_MATH._process()
def _process(self, env, token):
    yield env.timeout(self._simd_latency(token.params))

# PE_FETCH_STORE._process()
def _process(self, env, token):
    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
    yield tcm_done

# PE_DMA._process()
def _process(self, env, token):
    yield from self._do_fabric_dma(token.params)
```

By replacing only the timing model, one can freely switch between cycle-accurate
and analytical models. Since the chaining logic resides in the base class,
each component only implements its pure stage logic.

### D7. Topology Changes

Add PE_FETCH_STORE to the PE template:

```yaml
pe_template:
  components:
    pe_cpu:         { kind: pe_cpu,         impl: pe_cpu_v1, ... }
    pe_scheduler:   { kind: pe_scheduler,   impl: pe_scheduler_v2, ... }
    pe_dma:         { kind: pe_dma,         impl: pe_dma_v1, ... }
    pe_fetch_store: { kind: pe_fetch_store, impl: pe_fetch_store_v1, ... }
    pe_gemm:        { kind: pe_gemm,        impl: pe_gemm_v1, ... }
    pe_math:        { kind: pe_math,        impl: pe_math_v1, ... }
    pe_mmu:         { kind: pe_mmu,         impl: pe_mmu_v1, ... }
    pe_tcm:         { kind: pe_tcm,         impl: pe_tcm_v1, ... }
  links:
    # existing links...
    fetch_store_to_tcm_bw_gbs: 512.0
    fetch_store_to_tcm_mm: 0.0
```

PE internal edge connections:
```
PE_SCHEDULER → PE_DMA (initial dispatch)
PE_SCHEDULER → PE_FETCH_STORE (initial dispatch)
PE_SCHEDULER → PE_GEMM (initial dispatch)
PE_SCHEDULER → PE_MATH (initial dispatch)
PE_DMA → PE_FETCH_STORE (chaining)
PE_FETCH_STORE → PE_GEMM (chaining)
PE_FETCH_STORE → PE_MATH (chaining)
PE_GEMM → PE_FETCH_STORE (store chaining)
PE_MATH → PE_FETCH_STORE (store chaining)
PE_FETCH_STORE → PE_DMA (writeback chaining)
PE_FETCH_STORE → PE_TCM (BW request)
```

Topology edges encompass both **control/dispatch visibility + runtime chaining**.
Scheduler → sub-component edges are initial dispatch paths, while
inter-component edges are runtime chaining paths driven by token self-routing.

### D9. TileToken Message Definition

A message used for passing tile work between components.
The token carries the plan and stage index, enabling self-routing.

```python
@dataclass
class TileToken:
    tile_id: int
    pipeline_ctx: PipelineContext    # completion tracking
    plan: TilePlan                   # full stage sequence for this tile (immutable)
    stage_idx: int                   # current stage index in plan.stages
    params: dict                     # current stage parameter cache (canonical: plan.stages[stage_idx].params)
    data_op: bool = True             # op_log recording target (ADR-0020)
```

A TileToken is **owned by exactly one component at a time** and
is never referenced by multiple components simultaneously (single-owner).

Token lifecycle:
1. Scheduler creates it with stage_idx=0 and puts it to the first stage component
2. The component executes _process(), increments stage_idx, and puts it to the next component
3. The last stage component calls pipeline_ctx.complete_tile()
4. When all tiles are complete, PipelineContext calls done_event.succeed()

Relationship with existing PeInternalTxn:
- PeInternalTxn: command transfer between PE_CPU → PE_SCHEDULER (existing, unchanged)
- TileToken: per-tile work transfer from PE_SCHEDULER → sub-components (new, self-routing)

---

## Non-goals

- **PE_CPU changes**: the PE_CPU → PE_SCHEDULER interface is not modified
  (PeInternalTxn-based, ADR-0014 maintained)
- **Resource contention model across multiple pipelines**: the current scope focuses on
  accurate modeling of a single pipeline. TCM bank conflicts across multiple pipelines
  are future work.

## Open Questions

- **Register File capacity model**: whether to model capacity limits when the fetch unit
  loads into registers. Capacity is expressed in bytes (register_file_bytes), and
  the number of tiles that can be held simultaneously is determined by tile size.
  When capacity is exceeded, fetch stalls, creating natural backpressure.
- **Prefetch strategy**: this ADR does not allow tile feed interleaving across composite
  commands. Therefore, overlap arises not from pre-injection across commands, but
  naturally from pipeline progression of tiles within the same command.
  If additional prefetch is needed, it should be considered at the level of tile ordering
  within the same command or fetch/store unit policy, not cross-command injection.
- **PE_DMA coalescing**: per-tile DMA may cause fragmentation.
  Direction is to merge/coalesce within DMA without scheduler involvement.
- **Synchronous execution mode**: this ADR adopts asynchronous pipeline as the
  default/sole execution model. If a sync mode is needed for debug or validation
  purposes, it will be considered in a future ADR.
- **TCM bank conflict across multiple pipelines**: currently based on a single pipeline.
  Bank conflict modeling when multiple pipelines simultaneously access TCM is future work.

---

## Consequences

### Positive

- Each block is an independent component — individually replaceable (ADR-0015 compliant)
- PE internal structure is visible in the topology
- Components do not know the next component — plan-based routing provides flexibility
- Natural pipeline overlap between DMA and compute (SimPy Store backpressure)
- Improved HW modeling accuracy (done signal = Event, data transfer = message)
- Fetch/store separation enables accurate TCM BW contention modeling

### Negative

- Increased number of PE internal components (5 → 6) — more topology nodes/edges
- Component separation makes intra-PE token forwarding more explicit than before