commit - release 1

2026-03-18 11:47:48 -07:00
commit 6f43807900
109 changed files with 14909 additions and 0 deletions
@@ -0,0 +1,364 @@
+# ADR-0014: PE Internal Execution Model (PE_CPU, PE_SCHEDULER, and Composite Commands)
+
+## Status
+
+Proposed
+
+## Context
+
+ADR-0003 (system hierarchy) and ADR-0009 (kernel execution semantics) reference PE internals but do not define:
+
+- the dispatch model inside a PE,
+- the responsibilities of PE_SCHEDULER,
+- the PE_TCM-centric dataflow contract used by accelerator engines.
+
+We need a deterministic and debuggable PE-internal execution contract that supports:
+
+- simple single-engine commands
+- composite commands that build a tiled pipeline across DMA and accelerator engines
+
+The simulator must produce deterministic traces and allow modeling of PE-internal pipelining without introducing nondeterministic engine scheduling.
+
+## Decision
+
+### D1. PE internal component roles
+
+Each PE contains the following logical components.
+
+**PE_CPU**
+
+- Executes kernel instruction stream or kernel control logic.
+- Generates PE commands.
+- Submits commands to PE_SCHEDULER.
+- PE_CPU does NOT enqueue work directly into engine queues.
+
+**PE_SCHEDULER**
+
+- The sole dispatcher inside a PE.
+- Receives commands from PE_CPU.
+- Expands composite commands into sub-commands.
+- Tracks dependencies and command state.
+- Dispatches work to engine queues.
+- Manages tile scheduling for composite commands.
+
+**PE_DMA**
+
+- Handles memory transfers between PE_TCM and external memory domains.
+- PE_DMA has **dual egress** at the CUBE level:
+  - **→ XBAR**: dedicated path to HBM (local and cross-half via bridge)
+  - **→ NOC**: path to non-HBM destinations (shared SRAM, inter-cube UCIe, etc.)
+- Supported directions include:
+  - HBM → PE_TCM (via XBAR)
+  - PE_TCM → HBM (via XBAR)
+  - PE_TCM → shared SRAM (via NOC)
+  - PE_TCM → other memory domains (via NOC, if supported by topology)
+
+**PE_GEMM**
+
+- Matrix multiplication engine.
+- Reads activations from PE_TCM.
+- May stream weights directly from HBM.
+
+**PE_MATH**
+
+- Element-wise computation engine.
+- Reads and writes PE_TCM.
+
+**PE_TCM**
+
+- Local SRAM used as the staging memory for accelerator operations.
+
+---
+
+### D2. Command lifecycle and queues
+
+PE_SCHEDULER maintains three logical structures.
+
+**SubmissionQueue**
+
+- Written by PE_CPU.
+- Contains incoming PE commands waiting to be processed.
+
+**InflightTable**
+
+- Owned and mutated only by PE_SCHEDULER.
+- Tracks:
+  - expanded sub-commands
+  - dependency state
+  - engine assignment
+  - completion status
+
+**CompletionQueue**
+
+- Written by PE_SCHEDULER.
+- Contains final completion records for commands.
+
+**Single-writer rule**
+
+- Only PE_SCHEDULER is allowed to mutate command completion state.
+- Engine components must report completion via explicit completion events/messages.
+
+**Command completion**
+
+A command becomes DONE when:
+
+- all sub-commands complete
+- PE_SCHEDULER publishes a completion record to CompletionQueue.
+
+---
+
+### D3. Dispatch modes
+
+PE commands are divided into two categories.
+
+#### D3.1 Simple command
+
+A simple command expands to exactly one engine sub-command.
+
+Examples include:
+
+- DMA transfer
+- GEMM compute
+- MATH compute
+
+Execution flow:
+
+```
+PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution → completion event → PE_SCHEDULER → CompletionQueue
+```
+
+#### D3.2 Composite command (tiled pipeline)
+
+Composite commands implement tiled pipelined execution across engines.
+
+Each tile executes the following pipeline:
+
+```
+Input DMA (READ)
+→ Compute (GEMM or MATH)
+→ Output DMA (WRITE)
+```
+
+**Tiling rule**
+
+If the DMA payload exceeds hardware tile size, PE_SCHEDULER splits the transfer into tiles.
+Each tile is assigned a monotonically increasing `tile_id`.
+
+**Tile dependency rules**
+
+For tile `t`:
+
+- Compute must wait for input DMA: `DMA_READ(t) → COMPUTE(t)`
+- Output DMA must wait for compute: `COMPUTE(t) → DMA_WRITE(t)`
+- All dependencies are enforced by PE_SCHEDULER.
+
+**Overlap policy (Phase 0 default)**
+
+Operations for different tiles may overlap when engine resources permit.
+
+Allowed overlaps:
+
+```
+DMA_READ(t+1) ∥ COMPUTE(t)
+DMA_WRITE(t−1) ∥ COMPUTE(t)
+DMA_READ(t) ∥ DMA_WRITE(t)
+```
+
+Disallowed overlaps:
+
+```
+GEMM(t) ∥ GEMM(t′)
+MATH(t) ∥ MATH(t′)
+GEMM(t) ∥ MATH(t′)
+```
+
+---
+
+### D4. Engine execution model (Phase 0 default)
+
+Each engine behaves as a deterministic service resource.
+
+**DMA engine**
+
+PE_DMA contains two independent channels.
+
+```
+DMA_READ capacity  = 1
+DMA_WRITE capacity = 1
+```
+
+Rules:
+
+- DMA_READ and DMA_WRITE may execute concurrently.
+- Multiple READs cannot overlap.
+- Multiple WRITEs cannot overlap.
+
+Example allowed:
+
+```
+DMA_READ(t+1) ∥ DMA_WRITE(t)
+```
+
+Example not allowed:
+
+```
+DMA_READ(t) ∥ DMA_READ(t+1)
+DMA_WRITE(t) ∥ DMA_WRITE(t+1)
+```
+
+**Compute engine**
+
+Compute operations share a single compute resource.
+
+```
+PE_ACCEL capacity = 1
+```
+
+Both GEMM and MATH require this shared compute slot.
+
+Consequences:
+
+- GEMM ∥ GEMM not allowed
+- MATH ∥ MATH not allowed
+- GEMM ∥ MATH not allowed
+
+Only one compute operation can run in a PE at a time.
+
+**Compute opcode restriction**
+
+Composite commands contain one compute opcode only.
+
+Examples:
+
+```
+COMPOSITE_GEMM
+COMPOSITE_MATH
+```
+
+Mixed compute pipelines such as `GEMM → MATH` are not supported in Phase 0.
+
+**Engine completion signaling**
+
+Every engine emits a completion event when a sub-command finishes.
+Completion events are delivered to PE_SCHEDULER.
+
+---
+
+### D5. Dataflow model
+
+Compute operations use a TCM-centric dataflow model.
+
+**Input path (HBM)**
+
+```
+HBM → XBAR → PE_DMA (DMA_READ) → PE_TCM
+```
+
+**Input path (shared SRAM)**
+
+```
+Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM
+```
+
+**Compute stage**
+
+Compute engines read input tensors from PE_TCM.
+
+```
+PE_TCM → GEMM / MATH
+```
+
+Weights for GEMM may optionally stream directly from HBM (via XBAR).
+
+**Output path (HBM)**
+
+Compute results are written to PE_TCM, then DMA writes to HBM.
+
+```
+PE_TCM → PE_DMA (DMA_WRITE) → XBAR → HBM
+```
+
+**Output path (shared SRAM)**
+
+```
+PE_TCM → PE_DMA (DMA_WRITE) → NOC → Shared SRAM
+```
+
+#### D5.1 PE_TCM partitioning and ownership boundary
+
+The PE_TCM address space is partitioned into two logical regions.
+
+**SchedulerReservedTCM**
+
+- A staging region owned exclusively by PE_SCHEDULER.
+- This region is used for composite command tile buffers.
+- PE_SCHEDULER:
+  - partitions this region into tile buffers
+  - assigns buffers for DMA_READ, COMPUTE, and DMA_WRITE stages
+  - guarantees input/output buffer separation
+  - manages tile buffer lifetime
+
+**AllocatableTCM**
+
+- General-purpose region managed by PEMemAllocator.
+- Used by host or DP-visible allocations.
+
+**Visibility rule (hard isolation)**
+
+- PEMemAllocator must not see or allocate memory inside SchedulerReservedTCM.
+- SchedulerReservedTCM is excluded from allocator-managed ranges by construction.
+- This prevents DP or host allocations from interfering with scheduler staging buffers.
+
+**Tile buffer rules**
+
+Within SchedulerReservedTCM:
+
+- input buffers and output buffers must not overlap
+- PE_SCHEDULER assigns tile buffers for DMA and compute stages
+- tile buffers remain valid until the corresponding DMA_WRITE completes
+- Buffer reuse is allowed only after the tile lifetime finishes.
+
+---
+
+### D6. Observability and trace contract
+
+The simulator must emit deterministic trace events.
+
+Required events include:
+
+- `command_submitted`
+- `sub_command_dispatched`
+- `engine_start`
+- `engine_complete`
+- `tile_ready`
+- `command_complete`
+
+Trace ordering must be deterministic for identical inputs.
+
+---
+
+### D7. Topology representation
+
+PE internal components are declared in `cube.pe_template`.
+
+The template is instantiated once per PE.
+
+PE instances are derived from `cube.pe_layout`.
+
+External connectivity such as:
+
+- PE_DMA → XBAR (HBM data path)
+- PE_DMA → NOC (non-HBM data path: shared SRAM, inter-cube UCIe)
+- NOC → PE_CPU (command path from M_CPU)
+
+is modeled at the CUBE level (see ADR-0003 D3).
+
+---
+
+## Links
+
+- SPEC R3, R4
+- ADR-0003 D4 (PE-level system hierarchy)
+- ADR-0005 View C (PE-level diagram)
+- ADR-0008 D2 (PA-level allocation at PE scope; PEMemAllocator is the per-PE allocator instance)
+- ADR-0009 D3 (kernel execution fan-out and PE_CPU dispatch)