# ADR-0014: PE Internal Execution Model (PE_CPU, PE_SCHEDULER, and Composite Commands) ## Status Proposed ## Context ADR-0003 (system hierarchy) and ADR-0009 (kernel execution semantics) reference PE internals but do not define: - the dispatch model inside a PE, - the responsibilities of PE_SCHEDULER, - the PE_TCM-centric dataflow contract used by accelerator engines. We need a deterministic and debuggable PE-internal execution contract that supports: - simple single-engine commands - composite commands that build a tiled pipeline across DMA and accelerator engines The simulator must produce deterministic traces and allow modeling of PE-internal pipelining without introducing nondeterministic engine scheduling. ## Decision ### D1. PE internal component roles Each PE contains the following logical components. **PE_CPU** - Executes kernel instruction stream or kernel control logic. - Generates PE commands. - Submits commands to PE_SCHEDULER. - PE_CPU does NOT enqueue work directly into engine queues. **PE_SCHEDULER** - The sole dispatcher inside a PE. - Receives commands from PE_CPU. - Expands composite commands into sub-commands. - Tracks dependencies and command state. - Dispatches work to engine queues. - Manages tile scheduling for composite commands. **PE_DMA** - Handles memory transfers between PE_TCM and external memory domains. - PE_DMA has **dual egress** at the CUBE level: - **→ XBAR**: dedicated path to HBM (local and cross-half via bridge) - **→ NOC**: path to non-HBM destinations (shared SRAM, inter-cube UCIe, etc.) - Supported directions include: - HBM → PE_TCM (via XBAR) - PE_TCM → HBM (via XBAR) - PE_TCM → shared SRAM (via NOC) - PE_TCM → other memory domains (via NOC, if supported by topology) **PE_GEMM** - Matrix multiplication engine. - Reads activations from PE_TCM. - May stream weights directly from HBM. **PE_MATH** - Element-wise computation engine. - Reads and writes PE_TCM. **PE_TCM** - Local SRAM used as the staging memory for accelerator operations. --- ### D2. Command lifecycle and queues PE_SCHEDULER maintains three logical structures. **SubmissionQueue** - Written by PE_CPU. - Contains incoming PE commands waiting to be processed. **InflightTable** - Owned and mutated only by PE_SCHEDULER. - Tracks: - expanded sub-commands - dependency state - engine assignment - completion status **CompletionQueue** - Written by PE_SCHEDULER. - Contains final completion records for commands. **Single-writer rule** - Only PE_SCHEDULER is allowed to mutate command completion state. - Engine components must report completion via explicit completion events/messages. **Command completion** A command becomes DONE when: - all sub-commands complete - PE_SCHEDULER publishes a completion record to CompletionQueue. --- ### D3. Dispatch modes PE commands are divided into two categories. #### D3.1 Simple command A simple command expands to exactly one engine sub-command. Examples include: - DMA transfer - GEMM compute - MATH compute Execution flow: ``` PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution → completion event → PE_SCHEDULER → CompletionQueue ``` #### D3.2 Composite command (tiled pipeline) Composite commands implement tiled pipelined execution across engines. Each tile executes the following pipeline: ``` Input DMA (READ) → Compute (GEMM or MATH) → Output DMA (WRITE) ``` **Tiling rule** If the DMA payload exceeds hardware tile size, PE_SCHEDULER splits the transfer into tiles. Each tile is assigned a monotonically increasing `tile_id`. **Tile dependency rules** For tile `t`: - Compute must wait for input DMA: `DMA_READ(t) → COMPUTE(t)` - Output DMA must wait for compute: `COMPUTE(t) → DMA_WRITE(t)` - All dependencies are enforced by PE_SCHEDULER. **Overlap policy (Phase 0 default)** Operations for different tiles may overlap when engine resources permit. Allowed overlaps: ``` DMA_READ(t+1) ∥ COMPUTE(t) DMA_WRITE(t−1) ∥ COMPUTE(t) DMA_READ(t) ∥ DMA_WRITE(t) ``` Disallowed overlaps: ``` GEMM(t) ∥ GEMM(t′) MATH(t) ∥ MATH(t′) GEMM(t) ∥ MATH(t′) ``` --- ### D4. Engine execution model (Phase 0 default) Each engine behaves as a deterministic service resource. **DMA engine** PE_DMA contains two independent channels. ``` DMA_READ capacity = 1 DMA_WRITE capacity = 1 ``` Rules: - DMA_READ and DMA_WRITE may execute concurrently. - Multiple READs cannot overlap. - Multiple WRITEs cannot overlap. Example allowed: ``` DMA_READ(t+1) ∥ DMA_WRITE(t) ``` Example not allowed: ``` DMA_READ(t) ∥ DMA_READ(t+1) DMA_WRITE(t) ∥ DMA_WRITE(t+1) ``` **Compute engine** Compute operations share a single compute resource. ``` PE_ACCEL capacity = 1 ``` Both GEMM and MATH require this shared compute slot. Consequences: - GEMM ∥ GEMM not allowed - MATH ∥ MATH not allowed - GEMM ∥ MATH not allowed Only one compute operation can run in a PE at a time. **Compute opcode restriction** Composite commands contain one compute opcode only. Examples: ``` COMPOSITE_GEMM COMPOSITE_MATH ``` Mixed compute pipelines such as `GEMM → MATH` are not supported in Phase 0. **Engine completion signaling** Every engine emits a completion event when a sub-command finishes. Completion events are delivered to PE_SCHEDULER. --- ### D5. Dataflow model Compute operations use a TCM-centric dataflow model. **Input path (HBM)** ``` HBM → XBAR → PE_DMA (DMA_READ) → PE_TCM ``` **Input path (shared SRAM)** ``` Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM ``` **Compute stage** Compute engines read input tensors from PE_TCM. ``` PE_TCM → GEMM / MATH ``` Weights for GEMM may optionally stream directly from HBM (via XBAR). **Output path (HBM)** Compute results are written to PE_TCM, then DMA writes to HBM. ``` PE_TCM → PE_DMA (DMA_WRITE) → XBAR → HBM ``` **Output path (shared SRAM)** ``` PE_TCM → PE_DMA (DMA_WRITE) → NOC → Shared SRAM ``` #### D5.1 PE_TCM partitioning and ownership boundary The PE_TCM address space is partitioned into two logical regions. **SchedulerReservedTCM** - A staging region owned exclusively by PE_SCHEDULER. - This region is used for composite command tile buffers. - PE_SCHEDULER: - partitions this region into tile buffers - assigns buffers for DMA_READ, COMPUTE, and DMA_WRITE stages - guarantees input/output buffer separation - manages tile buffer lifetime **AllocatableTCM** - General-purpose region managed by PEMemAllocator. - Used by host or DP-visible allocations. **Visibility rule (hard isolation)** - PEMemAllocator must not see or allocate memory inside SchedulerReservedTCM. - SchedulerReservedTCM is excluded from allocator-managed ranges by construction. - This prevents DP or host allocations from interfering with scheduler staging buffers. **Tile buffer rules** Within SchedulerReservedTCM: - input buffers and output buffers must not overlap - PE_SCHEDULER assigns tile buffers for DMA and compute stages - tile buffers remain valid until the corresponding DMA_WRITE completes - Buffer reuse is allowed only after the tile lifetime finishes. --- ### D6. Observability and trace contract The simulator must emit deterministic trace events. Required events include: - `command_submitted` - `sub_command_dispatched` - `engine_start` - `engine_complete` - `tile_ready` - `command_complete` Trace ordering must be deterministic for identical inputs. --- ### D7. Topology representation PE internal components are declared in `cube.pe_template`. The template is instantiated once per PE. PE instances are derived from `cube.pe_layout`. External connectivity such as: - PE_DMA → XBAR (HBM data path) - PE_DMA → NOC (non-HBM data path: shared SRAM, inter-cube UCIe) - NOC → PE_CPU (command path from M_CPU) is modeled at the CUBE level (see ADR-0003 D3). --- ## Links - SPEC R3, R4 - ADR-0003 D4 (PE-level system hierarchy) - ADR-0005 View C (PE-level diagram) - ADR-0008 D2 (PA-level allocation at PE scope; PEMemAllocator is the per-PE allocator instance) - ADR-0009 D3 (kernel execution fan-out and PE_CPU dispatch)