kernbench2/docs/adr/ADR-0014-dev-pe-pipeline-execution-model.md

# ADR-0014: PE Pipeline Execution Model

## Status

Accepted

## Context

This ADR defines the PE-internal kernel execution model:

- Role decomposition of PE-internal components
- Command dispatch paths (simple / composite / multi-op composite with epilogue)
- TileToken-based self-routing pipeline (scheduler does dispatch + completion only)
- TCM-centric dataflow with a register-file intermediary
- Engine resource model
- Observability and trace contract
- Topology representation

PE-internal structure (7 components in scope; 2 cross-referenced):

- `pe_cpu`, `pe_scheduler`, `pe_dma`, `pe_fetch_store`, `pe_gemm`, `pe_math`,
  `pe_tcm` — defined here
- `pe_mmu` — VA model, defined in ADR-0011 D-VA
- `pe_ipcq` — collective communication, defined in ADR-0023

The goal is a deterministic, trace-friendly execution contract that keeps
each block independently swappable.

## Decision

### D1. PE-internal component roles

**PE_CPU**

- Executes kernel instruction stream / control logic.
- Generates PE commands and submits them to `PE_SCHEDULER` (via
  `PeInternalTxn`).
- Does NOT enqueue work directly into engine queues.

**PE_SCHEDULER**

- Sole dispatcher inside a PE.
- Receives commands from `PE_CPU`. Dispatch by command type:
  - Simple command (`DmaReadCmd`, `DmaWriteCmd`, `GemmCmd`, `MathCmd`)
    → forward directly to the target engine.
  - `CompositeCmd` → generate a `TilePlan`, feed tiles into the pipeline
    via a single `_feed_loop` (D6).
- Does not participate in stage-to-stage chaining within a composite;
  that is handled by token self-routing (D6).

**PE_DMA**

- Handles memory transfers between TCM and external memory domains
  (HBM, shared SRAM, cross-cube UCIe) through the cube NOC.
- Two execution channels:
  - `DMA_READ` (capacity = 1) and `DMA_WRITE` (capacity = 1) — see D4.
- Additional virtual channels:
  - `vc_compute` — load/store/writeback traffic for GEMM/MATH tiles.
  - `vc_comm` — IPCQ collective send data (defined in ADR-0023 D8).

**PE_FETCH_STORE**

- TCM ↔ Register File transfer unit.
- Isolates register-file access semantics from compute engines so that
  GEMM/MATH stay pure compute components.
- BW-based latency model; TCM access contention naturally serializes
  through `PE_TCM`'s BW resource.

**PE_GEMM**

- MAC array. Reads operands from the register file; writes results to
  the register file. Does not touch `PE_TCM` directly.

**PE_MATH**

- Element-wise / reduction / SIMD unit. Reads / writes the register file.

**PE_TCM**

- Tightly-coupled scratchpad with BW-serialized access. Two logical
  regions partitioned by ownership (see D5).

**Cross-referenced components** (defined elsewhere):

- `pe_mmu` — VA→PA translation per access (ADR-0011 D-VA).
- `pe_ipcq` — collective ring buffers and peer endpoint metadata
  (ADR-0023).

### D2. Command lifecycle and queues

`PE_SCHEDULER` maintains three logical structures:

**SubmissionQueue** — written by `PE_CPU`; consumed by the scheduler.

**InflightTable** — owned and mutated only by `PE_SCHEDULER`; tracks
expanded sub-commands, dependency state, engine assignment, and
completion status.

**CompletionQueue** — written by `PE_SCHEDULER`; holds final completion
records.

**Single-writer rule**: only `PE_SCHEDULER` mutates command completion
state. Engines report completion via explicit events / messages
consumed by the scheduler.

**Command completion**: when all sub-commands complete, `PE_SCHEDULER`
publishes a completion record.

### D3. Dispatch modes

#### D3.1 Simple command

A simple command expands to exactly one engine sub-command:

- `DmaReadCmd` / `DmaWriteCmd` → `PE_DMA`
- `GemmCmd` → `PE_GEMM`
- `MathCmd` → `PE_MATH`

Flow:

```text
PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution
       → completion → PE_SCHEDULER → CompletionQueue
```

#### D3.2 Composite command (single-op tiled pipeline)

The default `CompositeCmd` runs a single compute op as a tile-pipelined
sequence:

```text
DMA_READ → FETCH (TCM → RF) → COMPUTE (GEMM | MATH) → STORE (RF → TCM) → DMA_WRITE
```

`PE_SCHEDULER` splits the DMA payload into hardware tiles and emits one
`TileToken` per tile with a monotonically increasing `tile_id`.

Tile dependency (within one tile `t`):

```text
DMA_READ(t) → FETCH(t) → COMPUTE(t) → STORE(t) → DMA_WRITE(t)
```

Inter-tile overlap is allowed wherever engine resources permit
(D4 governs the constraints):

```text
DMA_READ(t+1) ∥ COMPUTE(t)
DMA_WRITE(t-1) ∥ COMPUTE(t)
```

#### D3.3 Multi-op composite (head + epilogue with scope)

A `CompositeCmd` MAY carry `ops: tuple[OpSpec, ...]` to express a
multi-op pipeline:

```python
@dataclass(frozen=True)
class OpSpec:
    kind: str         # "gemm" | "math.exp" | "math.bias_add" | ...
    scope: Scope      # "per_k_tile" | "per_output_tile" | "once"
    ...
```

- `ops[0]` (head) defines tile geometry (e.g., the head GEMM determines
  M/K/N partition).
- `ops[1:]` (epilogue) are subsequent stages whose `scope` decides how
  often they fire:
  - `per_k_tile` — every K-reduction step.
  - `per_output_tile` — once per output tile.
  - `once` — once per kernel.

Cross-engine chains (e.g., GEMM head → MATH epilogue) are natural —
each stage is dispatched via token self-routing (D6), so GEMM and MATH
participate serially within the same composite even though they share
the compute slot (D4).

The empty-`ops` form is the legacy single-op path.

### D4. Engine resource model

**DMA engine**:

- `DMA_READ`: `simpy.Resource(capacity=1)`.
- `DMA_WRITE`: `simpy.Resource(capacity=1)`.
- Both channels run concurrently (READ ∥ WRITE allowed).
- Within a channel, requests serialize (READ ∥ READ disallowed; same
  for WRITE).
- `vc_comm` is an orthogonal channel for IPCQ traffic defined in
  ADR-0023 D8 — out of scope for this ADR.

**Compute engine**:

- `accel_slot`: `simpy.Resource(capacity=1)` shared by `PE_GEMM` and
  `PE_MATH`.
- At most one compute op runs at a time within a PE.
- Multi-op composite chains (D3.3) execute their compute stages serially
  through this slot; token self-routing (D6) ensures the next stage
  starts only after the previous compute releases the slot.

**Engine completion**: each engine emits a completion event consumed by
the scheduler / `PipelineContext` (D6).

### D5. Dataflow

**Input path (HBM source)**:

```text
HBM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
PE_TCM → PE_FETCH_STORE → Register File
Register File → PE_GEMM | PE_MATH
```

**Input path (shared SRAM source)**:

```text
Shared SRAM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
PE_TCM → PE_FETCH_STORE → Register File
```

**Output path (HBM destination)**:

```text
Register File → PE_FETCH_STORE → PE_TCM
PE_TCM → PE_DMA (DMA_WRITE) → cube NOC → HBM
```

GEMM/MATH never touch `PE_TCM` directly — `PE_FETCH_STORE` is the
single TCM↔register-file gateway. This makes TCM BW contention
explicit and lets fetch unit policies (e.g., prefetch) be replaced
independently of compute engines.

#### D5.1 PE_TCM partitioning

`PE_TCM` is split into two logical regions:

**SchedulerReservedTCM**

- Owned exclusively by `PE_SCHEDULER`.
- Holds composite-command tile buffers.
- `PE_SCHEDULER` partitions this region, assigns buffers per DMA_READ /
  COMPUTE / DMA_WRITE stage, guarantees input/output separation, and
  manages tile-buffer lifetimes.

**AllocatableTCM**

- General-purpose region managed by `PEMemAllocator`.
- Used for host / DP-visible allocations.

**Visibility rule (hard isolation)**: `PEMemAllocator` MUST NOT see or
allocate inside `SchedulerReservedTCM`. The reserved region is excluded
from allocator-managed ranges by construction.

**Tile buffer rules**:

- Input and output buffers within `SchedulerReservedTCM` MUST NOT
  overlap during a tile's active lifetime.
- A tile buffer remains valid until the corresponding `DMA_WRITE`
  completes.
- Buffer reuse is permitted only after the consuming tile's lifetime
  ends.

### D6. TileToken self-routing pipeline

A composite's stage-to-stage progression happens **without** routing
through the scheduler. Each component forwards the token directly to
the next stage's component using the token's `plan`:

```text
Scheduler → DMA → Fetch → GEMM → Math (epi) → Store → DMA_WB → (complete)
              ↑ chaining: no scheduler hop                          ↑
                                                  PipelineContext.complete_tile()
```

This mirrors real-HW done-wire chains. The scheduler handles only
**initial dispatch + completion aggregation**.

#### TilePlan / Stage

```python
class StageType(Enum):
    DMA_READ = 0
    FETCH = 1
    GEMM = 2
    MATH = 3
    STORE = 4
    DMA_WRITE = 5

@dataclass(frozen=True)
class Stage:
    stage_type: StageType
    component: str         # topology node id (e.g., "sip0.cube0.pe0.pe_dma")
    params: dict           # stage-specific parameters

@dataclass(frozen=True)
class TilePlan:
    tile_id: int
    stages: tuple[Stage, ...]
```

#### TileToken

```python
@dataclass
class TileToken:
    tile_id: int
    pipeline_ctx: PipelineContext
    plan: TilePlan
    stage_idx: int
    params: dict             # cached current stage params
    data_op: bool = True     # op_log opt-in (ADR-0020 D4)
```

Single-owner invariant: a token is owned by exactly one component at a
time. Lifecycle: scheduler creates with `stage_idx=0` → component
`_process()` → increment `stage_idx` → put to next stage's `in_port` →
last stage calls `pipeline_ctx.complete_tile()`.

#### PipelineContext (exactly-once completion)

```python
@dataclass
class PipelineContext:
    id: str
    total_tiles: int
    completed_tiles: int = 0
    done_event: simpy.Event = None

    def complete_tile(self) -> None:
        self.completed_tiles += 1
        if self.completed_tiles == self.total_tiles:
            self.done_event.succeed()
```

Each tile's last stage MUST call `complete_tile()` exactly once.
Duplicate calls are bugs (SimPy `Event` can succeed at most once).

#### Feed ordering

`PE_SCHEDULER` has exactly one `_feed_loop` process consuming a
`_pending_feeds` FIFO. Composite commands are enqueued in submission
order; tile feed for a command runs to completion before the next
command's feed begins. **Tile-feed interleaving between commands is
disallowed.**

Within a single command's tiles, downstream pipeline overlap arises
naturally — earlier tiles progress through later stages while the feeder
keeps pushing remaining tiles into the first stage queue (SimPy Store
backpressure governs flow control). If the first-stage queue is full,
only the feeder blocks; the scheduler worker's inbox processing
continues.

#### Token routing pattern (base class)

```python
def _pipeline_worker(self, env):
    while True:
        token = yield self._inbox.get()
        yield from self._process(env, token)       # stage-specific logic
        next_idx = token.stage_idx + 1
        if next_idx < len(token.plan.stages):
            next_stage = token.plan.stages[next_idx]
            token.stage_idx = next_idx
            token.params = next_stage.params
            yield self.out_ports[next_stage.component].put(token)
        else:
            token.pipeline_ctx.complete_tile()
```

Each component implements only `_process()`; chaining lives in the
base class.

### D7. Observability and trace contract

The simulator emits deterministic trace events:

- `command_submitted`
- `sub_command_dispatched`
- `engine_start`
- `engine_complete`
- `tile_ready`
- `command_complete`

For identical inputs, trace ordering MUST be deterministic.

### D8. Topology representation

PE-internal components are declared in `cube.pe_template`:

```yaml
pe_template:
  components:
    pe_cpu:         { kind: pe_cpu,         impl: builtin.pe_cpu,         attrs: { overhead_ns: ... } }
    pe_scheduler:   { kind: pe_scheduler,   impl: builtin.pe_scheduler,   attrs: { overhead_ns: ... } }
    pe_dma:         { kind: pe_dma,         impl: builtin.pe_dma,         attrs: { rd_engines: 1, wr_engines: 1 } }
    pe_fetch_store: { kind: pe_fetch_store, impl: builtin.pe_fetch_store, attrs: { ... } }
    pe_gemm:        { kind: pe_gemm,        impl: builtin.pe_gemm,        attrs: { shared_resource: accel_slot, ... } }
    pe_math:        { kind: pe_math,        impl: builtin.pe_math,        attrs: { shared_resource: accel_slot, ... } }
    pe_tcm:         { kind: pe_tcm,         impl: builtin.pe_tcm,         attrs: { size_mb: ..., read_bw_gbs: ..., write_bw_gbs: ... } }
    pe_mmu:         { kind: pe_mmu,         impl: builtin.pe_mmu,         attrs: { ... } }   # ADR-0011 D-VA
    pe_ipcq:        { kind: pe_ipcq,        impl: builtin.pe_ipcq,        attrs: { ... } }   # ADR-0023
  links:
    # Scheduler dispatch edges (initial)
    scheduler_to_dma_mm:         0.0
    scheduler_to_fetch_store_mm: 0.0
    scheduler_to_gemm_mm:        0.0
    scheduler_to_math_mm:        0.0
    # Pipeline chaining edges (token self-routing per D6)
    dma_to_fetch_store_mm:       0.0
    fetch_store_to_gemm_mm:      0.0
    fetch_store_to_math_mm:      0.0
    gemm_to_fetch_store_mm:      0.0
    gemm_to_math_mm:             0.0
    math_to_fetch_store_mm:      0.0
    fetch_store_to_dma_mm:       0.0
    fetch_store_to_tcm_bw_gbs:   ...
```

Template is instantiated once per PE. PE instances are derived from
`cube.pe_layout` (corner placement). External connectivity (PE_DMA ↔
cube NOC ↔ HBM, etc.) is modeled at the cube level (ADR-0017 D4).

## Consequences

### Positive

- Each block is an independent topology node — individually swappable
  via DI (ADR-0015).
- PE-internal structure is visible in the topology graph.
- Components do not know their downstream — plan-based routing gives
  flexibility (e.g., epilogue chains require no scheduler change).
- DMA and compute overlap naturally via SimPy Store backpressure.
- Multi-op composite expresses fused operations (e.g., GEMM + bias_add)
  without engine-level coupling.
- TCM access contention is realistic — `PE_FETCH_STORE` is the single
  TCM↔RF gateway.

### Negative

- Intra-PE component count is higher than a coarser model (7 base + 2
  cross-referenced) — more topology nodes/edges.
- Intra-PE token forwarding is explicit in traces (acceptable trade for
  HW fidelity).

## Links

- ADR-0011 D-VA (PE_MMU component, VA translation)
- ADR-0015 D4 (component port/wire model)
- ADR-0020 (greenlet kernel execution / two-pass)
- ADR-0023 (PE_IPCQ + PE_DMA virtual channels)
- SPEC R3, R4