# ADR-0014: PE Pipeline Execution Model ## Status Accepted ## Context This ADR defines the PE-internal kernel execution model: - Role decomposition of PE-internal components - Command dispatch paths (simple / composite / multi-op composite with epilogue) - TileToken-based self-routing pipeline (scheduler does dispatch + completion only) - TCM-centric dataflow with a register-file intermediary - Engine resource model - Observability and trace contract - Topology representation PE-internal structure (7 components in scope; 2 cross-referenced): - `pe_cpu`, `pe_scheduler`, `pe_dma`, `pe_fetch_store`, `pe_gemm`, `pe_math`, `pe_tcm` — defined here - `pe_mmu` — VA model, defined in ADR-0011 D-VA - `pe_ipcq` — collective communication, defined in ADR-0023 The goal is a deterministic, trace-friendly execution contract that keeps each block independently swappable. ## Decision ### D1. PE-internal component roles **PE_CPU** - Executes kernel instruction stream / control logic. - Generates PE commands and submits them to `PE_SCHEDULER` (via `PeInternalTxn`). - Does NOT enqueue work directly into engine queues. **PE_SCHEDULER** - Sole dispatcher inside a PE. - Receives commands from `PE_CPU`. Dispatch by command type: - Simple command (`DmaReadCmd`, `DmaWriteCmd`, `GemmCmd`, `MathCmd`) → forward directly to the target engine. - `CompositeCmd` → generate a `TilePlan`, feed tiles into the pipeline via a single `_feed_loop` (D6). - Does not participate in stage-to-stage chaining within a composite; that is handled by token self-routing (D6). **PE_DMA** - Handles memory transfers between TCM and external memory domains (HBM, shared SRAM, cross-cube UCIe) through the cube NOC. - Two execution channels: - `DMA_READ` (capacity = 1) and `DMA_WRITE` (capacity = 1) — see D4. - Additional virtual channels: - `vc_compute` — load/store/writeback traffic for GEMM/MATH tiles. - `vc_comm` — IPCQ collective send data (defined in ADR-0023 D8). **PE_FETCH_STORE** - TCM ↔ Register File transfer unit. - Isolates register-file access semantics from compute engines so that GEMM/MATH stay pure compute components. - BW-based latency model; TCM access contention naturally serializes through `PE_TCM`'s BW resource. **PE_GEMM** - MAC array. Reads operands from the register file; writes results to the register file. Does not touch `PE_TCM` directly. **PE_MATH** - Element-wise / reduction / SIMD unit. Reads / writes the register file. **PE_TCM** - Tightly-coupled scratchpad with BW-serialized access. Two logical regions partitioned by ownership (see D5). **Cross-referenced components** (defined elsewhere): - `pe_mmu` — VA→PA translation per access (ADR-0011 D-VA). - `pe_ipcq` — collective ring buffers and peer endpoint metadata (ADR-0023). ### D2. Command lifecycle and queues `PE_SCHEDULER` maintains three logical structures: **SubmissionQueue** — written by `PE_CPU`; consumed by the scheduler. **InflightTable** — owned and mutated only by `PE_SCHEDULER`; tracks expanded sub-commands, dependency state, engine assignment, and completion status. **CompletionQueue** — written by `PE_SCHEDULER`; holds final completion records. **Single-writer rule**: only `PE_SCHEDULER` mutates command completion state. Engines report completion via explicit events / messages consumed by the scheduler. **Command completion**: when all sub-commands complete, `PE_SCHEDULER` publishes a completion record. ### D3. Dispatch modes #### D3.1 Simple command A simple command expands to exactly one engine sub-command: - `DmaReadCmd` / `DmaWriteCmd` → `PE_DMA` - `GemmCmd` → `PE_GEMM` - `MathCmd` → `PE_MATH` Flow: ```text PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution → completion → PE_SCHEDULER → CompletionQueue ``` #### D3.2 Composite command (single-op tiled pipeline) The default `CompositeCmd` runs a single compute op as a tile-pipelined sequence: ```text DMA_READ → FETCH (TCM → RF) → COMPUTE (GEMM | MATH) → STORE (RF → TCM) → DMA_WRITE ``` `PE_SCHEDULER` splits the DMA payload into hardware tiles and emits one `TileToken` per tile with a monotonically increasing `tile_id`. Tile dependency (within one tile `t`): ```text DMA_READ(t) → FETCH(t) → COMPUTE(t) → STORE(t) → DMA_WRITE(t) ``` Inter-tile overlap is allowed wherever engine resources permit (D4 governs the constraints): ```text DMA_READ(t+1) ∥ COMPUTE(t) DMA_WRITE(t-1) ∥ COMPUTE(t) ``` #### D3.3 Multi-op composite (head + epilogue with scope) A `CompositeCmd` MAY carry `ops: tuple[OpSpec, ...]` to express a multi-op pipeline: ```python @dataclass(frozen=True) class OpSpec: kind: str # "gemm" | "math.exp" | "math.bias_add" | ... scope: Scope # "per_k_tile" | "per_output_tile" | "once" ... ``` - `ops[0]` (head) defines tile geometry (e.g., the head GEMM determines M/K/N partition). - `ops[1:]` (epilogue) are subsequent stages whose `scope` decides how often they fire: - `per_k_tile` — every K-reduction step. - `per_output_tile` — once per output tile. - `once` — once per kernel. Cross-engine chains (e.g., GEMM head → MATH epilogue) are natural — each stage is dispatched via token self-routing (D6), so GEMM and MATH participate serially within the same composite even though they share the compute slot (D4). The empty-`ops` form is the legacy single-op path. ### D4. Engine resource model **DMA engine**: - `DMA_READ`: `simpy.Resource(capacity=1)`. - `DMA_WRITE`: `simpy.Resource(capacity=1)`. - Both channels run concurrently (READ ∥ WRITE allowed). - Within a channel, requests serialize (READ ∥ READ disallowed; same for WRITE). - `vc_comm` is an orthogonal channel for IPCQ traffic defined in ADR-0023 D8 — out of scope for this ADR. **Compute engine**: - `accel_slot`: `simpy.Resource(capacity=1)` shared by `PE_GEMM` and `PE_MATH`. - At most one compute op runs at a time within a PE. - Multi-op composite chains (D3.3) execute their compute stages serially through this slot; token self-routing (D6) ensures the next stage starts only after the previous compute releases the slot. **Engine completion**: each engine emits a completion event consumed by the scheduler / `PipelineContext` (D6). ### D5. Dataflow **Input path (HBM source)**: ```text HBM → cube NOC → PE_DMA (DMA_READ) → PE_TCM PE_TCM → PE_FETCH_STORE → Register File Register File → PE_GEMM | PE_MATH ``` **Input path (shared SRAM source)**: ```text Shared SRAM → cube NOC → PE_DMA (DMA_READ) → PE_TCM PE_TCM → PE_FETCH_STORE → Register File ``` **Output path (HBM destination)**: ```text Register File → PE_FETCH_STORE → PE_TCM PE_TCM → PE_DMA (DMA_WRITE) → cube NOC → HBM ``` GEMM/MATH never touch `PE_TCM` directly — `PE_FETCH_STORE` is the single TCM↔register-file gateway. This makes TCM BW contention explicit and lets fetch unit policies (e.g., prefetch) be replaced independently of compute engines. #### D5.1 PE_TCM partitioning `PE_TCM` is split into two logical regions: **SchedulerReservedTCM** - Owned exclusively by `PE_SCHEDULER`. - Holds composite-command tile buffers. - `PE_SCHEDULER` partitions this region, assigns buffers per DMA_READ / COMPUTE / DMA_WRITE stage, guarantees input/output separation, and manages tile-buffer lifetimes. **AllocatableTCM** - General-purpose region managed by `PEMemAllocator`. - Used for host / DP-visible allocations. **Visibility rule (hard isolation)**: `PEMemAllocator` MUST NOT see or allocate inside `SchedulerReservedTCM`. The reserved region is excluded from allocator-managed ranges by construction. **Tile buffer rules**: - Input and output buffers within `SchedulerReservedTCM` MUST NOT overlap during a tile's active lifetime. - A tile buffer remains valid until the corresponding `DMA_WRITE` completes. - Buffer reuse is permitted only after the consuming tile's lifetime ends. ### D6. TileToken self-routing pipeline A composite's stage-to-stage progression happens **without** routing through the scheduler. Each component forwards the token directly to the next stage's component using the token's `plan`: ```text Scheduler → DMA → Fetch → GEMM → Math (epi) → Store → DMA_WB → (complete) ↑ chaining: no scheduler hop ↑ PipelineContext.complete_tile() ``` This mirrors real-HW done-wire chains. The scheduler handles only **initial dispatch + completion aggregation**. #### TilePlan / Stage ```python class StageType(Enum): DMA_READ = 0 FETCH = 1 GEMM = 2 MATH = 3 STORE = 4 DMA_WRITE = 5 @dataclass(frozen=True) class Stage: stage_type: StageType component: str # topology node id (e.g., "sip0.cube0.pe0.pe_dma") params: dict # stage-specific parameters @dataclass(frozen=True) class TilePlan: tile_id: int stages: tuple[Stage, ...] ``` #### TileToken ```python @dataclass class TileToken: tile_id: int pipeline_ctx: PipelineContext plan: TilePlan stage_idx: int params: dict # cached current stage params data_op: bool = True # op_log opt-in (ADR-0020 D4) ``` Single-owner invariant: a token is owned by exactly one component at a time. Lifecycle: scheduler creates with `stage_idx=0` → component `_process()` → increment `stage_idx` → put to next stage's `in_port` → last stage calls `pipeline_ctx.complete_tile()`. #### PipelineContext (exactly-once completion) ```python @dataclass class PipelineContext: id: str total_tiles: int completed_tiles: int = 0 done_event: simpy.Event = None def complete_tile(self) -> None: self.completed_tiles += 1 if self.completed_tiles == self.total_tiles: self.done_event.succeed() ``` Each tile's last stage MUST call `complete_tile()` exactly once. Duplicate calls are bugs (SimPy `Event` can succeed at most once). #### Feed ordering `PE_SCHEDULER` has exactly one `_feed_loop` process consuming a `_pending_feeds` FIFO. Composite commands are enqueued in submission order; tile feed for a command runs to completion before the next command's feed begins. **Tile-feed interleaving between commands is disallowed.** Within a single command's tiles, downstream pipeline overlap arises naturally — earlier tiles progress through later stages while the feeder keeps pushing remaining tiles into the first stage queue (SimPy Store backpressure governs flow control). If the first-stage queue is full, only the feeder blocks; the scheduler worker's inbox processing continues. #### Token routing pattern (base class) ```python def _pipeline_worker(self, env): while True: token = yield self._inbox.get() yield from self._process(env, token) # stage-specific logic next_idx = token.stage_idx + 1 if next_idx < len(token.plan.stages): next_stage = token.plan.stages[next_idx] token.stage_idx = next_idx token.params = next_stage.params yield self.out_ports[next_stage.component].put(token) else: token.pipeline_ctx.complete_tile() ``` Each component implements only `_process()`; chaining lives in the base class. ### D7. Observability and trace contract The simulator emits deterministic trace events: - `command_submitted` - `sub_command_dispatched` - `engine_start` - `engine_complete` - `tile_ready` - `command_complete` For identical inputs, trace ordering MUST be deterministic. ### D8. Topology representation PE-internal components are declared in `cube.pe_template`: ```yaml pe_template: components: pe_cpu: { kind: pe_cpu, impl: builtin.pe_cpu, attrs: { overhead_ns: ... } } pe_scheduler: { kind: pe_scheduler, impl: builtin.pe_scheduler, attrs: { overhead_ns: ... } } pe_dma: { kind: pe_dma, impl: builtin.pe_dma, attrs: { rd_engines: 1, wr_engines: 1 } } pe_fetch_store: { kind: pe_fetch_store, impl: builtin.pe_fetch_store, attrs: { ... } } pe_gemm: { kind: pe_gemm, impl: builtin.pe_gemm, attrs: { shared_resource: accel_slot, ... } } pe_math: { kind: pe_math, impl: builtin.pe_math, attrs: { shared_resource: accel_slot, ... } } pe_tcm: { kind: pe_tcm, impl: builtin.pe_tcm, attrs: { size_mb: ..., read_bw_gbs: ..., write_bw_gbs: ... } } pe_mmu: { kind: pe_mmu, impl: builtin.pe_mmu, attrs: { ... } } # ADR-0011 D-VA pe_ipcq: { kind: pe_ipcq, impl: builtin.pe_ipcq, attrs: { ... } } # ADR-0023 links: # Scheduler dispatch edges (initial) scheduler_to_dma_mm: 0.0 scheduler_to_fetch_store_mm: 0.0 scheduler_to_gemm_mm: 0.0 scheduler_to_math_mm: 0.0 # Pipeline chaining edges (token self-routing per D6) dma_to_fetch_store_mm: 0.0 fetch_store_to_gemm_mm: 0.0 fetch_store_to_math_mm: 0.0 gemm_to_fetch_store_mm: 0.0 gemm_to_math_mm: 0.0 math_to_fetch_store_mm: 0.0 fetch_store_to_dma_mm: 0.0 fetch_store_to_tcm_bw_gbs: ... ``` Template is instantiated once per PE. PE instances are derived from `cube.pe_layout` (corner placement). External connectivity (PE_DMA ↔ cube NOC ↔ HBM, etc.) is modeled at the cube level (ADR-0017 D4). ## Consequences ### Positive - Each block is an independent topology node — individually swappable via DI (ADR-0015). - PE-internal structure is visible in the topology graph. - Components do not know their downstream — plan-based routing gives flexibility (e.g., epilogue chains require no scheduler change). - DMA and compute overlap naturally via SimPy Store backpressure. - Multi-op composite expresses fused operations (e.g., GEMM + bias_add) without engine-level coupling. - TCM access contention is realistic — `PE_FETCH_STORE` is the single TCM↔RF gateway. ### Negative - Intra-PE component count is higher than a coarser model (7 base + 2 cross-referenced) — more topology nodes/edges. - Intra-PE token forwarding is explicit in traces (acceptable trade for HW fidelity). ## Links - ADR-0011 D-VA (PE_MMU component, VA translation) - ADR-0015 D4 (component port/wire model) - ADR-0020 (greenlet kernel execution / two-pass) - ADR-0023 (PE_IPCQ + PE_DMA virtual channels) - SPEC R3, R4