commit - release 1

This commit is contained in:
2026-03-18 11:47:48 -07:00
commit 6f43807900
109 changed files with 14909 additions and 0 deletions
@@ -0,0 +1,364 @@
# ADR-0014: PE Internal Execution Model (PE_CPU, PE_SCHEDULER, and Composite Commands)
## Status
Proposed
## Context
ADR-0003 (system hierarchy) and ADR-0009 (kernel execution semantics) reference PE internals but do not define:
- the dispatch model inside a PE,
- the responsibilities of PE_SCHEDULER,
- the PE_TCM-centric dataflow contract used by accelerator engines.
We need a deterministic and debuggable PE-internal execution contract that supports:
- simple single-engine commands
- composite commands that build a tiled pipeline across DMA and accelerator engines
The simulator must produce deterministic traces and allow modeling of PE-internal pipelining without introducing nondeterministic engine scheduling.
## Decision
### D1. PE internal component roles
Each PE contains the following logical components.
**PE_CPU**
- Executes kernel instruction stream or kernel control logic.
- Generates PE commands.
- Submits commands to PE_SCHEDULER.
- PE_CPU does NOT enqueue work directly into engine queues.
**PE_SCHEDULER**
- The sole dispatcher inside a PE.
- Receives commands from PE_CPU.
- Expands composite commands into sub-commands.
- Tracks dependencies and command state.
- Dispatches work to engine queues.
- Manages tile scheduling for composite commands.
**PE_DMA**
- Handles memory transfers between PE_TCM and external memory domains.
- PE_DMA has **dual egress** at the CUBE level:
- **→ XBAR**: dedicated path to HBM (local and cross-half via bridge)
- **→ NOC**: path to non-HBM destinations (shared SRAM, inter-cube UCIe, etc.)
- Supported directions include:
- HBM → PE_TCM (via XBAR)
- PE_TCM → HBM (via XBAR)
- PE_TCM → shared SRAM (via NOC)
- PE_TCM → other memory domains (via NOC, if supported by topology)
**PE_GEMM**
- Matrix multiplication engine.
- Reads activations from PE_TCM.
- May stream weights directly from HBM.
**PE_MATH**
- Element-wise computation engine.
- Reads and writes PE_TCM.
**PE_TCM**
- Local SRAM used as the staging memory for accelerator operations.
---
### D2. Command lifecycle and queues
PE_SCHEDULER maintains three logical structures.
**SubmissionQueue**
- Written by PE_CPU.
- Contains incoming PE commands waiting to be processed.
**InflightTable**
- Owned and mutated only by PE_SCHEDULER.
- Tracks:
- expanded sub-commands
- dependency state
- engine assignment
- completion status
**CompletionQueue**
- Written by PE_SCHEDULER.
- Contains final completion records for commands.
**Single-writer rule**
- Only PE_SCHEDULER is allowed to mutate command completion state.
- Engine components must report completion via explicit completion events/messages.
**Command completion**
A command becomes DONE when:
- all sub-commands complete
- PE_SCHEDULER publishes a completion record to CompletionQueue.
---
### D3. Dispatch modes
PE commands are divided into two categories.
#### D3.1 Simple command
A simple command expands to exactly one engine sub-command.
Examples include:
- DMA transfer
- GEMM compute
- MATH compute
Execution flow:
```
PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution → completion event → PE_SCHEDULER → CompletionQueue
```
#### D3.2 Composite command (tiled pipeline)
Composite commands implement tiled pipelined execution across engines.
Each tile executes the following pipeline:
```
Input DMA (READ)
→ Compute (GEMM or MATH)
→ Output DMA (WRITE)
```
**Tiling rule**
If the DMA payload exceeds hardware tile size, PE_SCHEDULER splits the transfer into tiles.
Each tile is assigned a monotonically increasing `tile_id`.
**Tile dependency rules**
For tile `t`:
- Compute must wait for input DMA: `DMA_READ(t) → COMPUTE(t)`
- Output DMA must wait for compute: `COMPUTE(t) → DMA_WRITE(t)`
- All dependencies are enforced by PE_SCHEDULER.
**Overlap policy (Phase 0 default)**
Operations for different tiles may overlap when engine resources permit.
Allowed overlaps:
```
DMA_READ(t+1) ∥ COMPUTE(t)
DMA_WRITE(t1) ∥ COMPUTE(t)
DMA_READ(t) ∥ DMA_WRITE(t)
```
Disallowed overlaps:
```
GEMM(t) ∥ GEMM(t)
MATH(t) ∥ MATH(t)
GEMM(t) ∥ MATH(t)
```
---
### D4. Engine execution model (Phase 0 default)
Each engine behaves as a deterministic service resource.
**DMA engine**
PE_DMA contains two independent channels.
```
DMA_READ capacity = 1
DMA_WRITE capacity = 1
```
Rules:
- DMA_READ and DMA_WRITE may execute concurrently.
- Multiple READs cannot overlap.
- Multiple WRITEs cannot overlap.
Example allowed:
```
DMA_READ(t+1) ∥ DMA_WRITE(t)
```
Example not allowed:
```
DMA_READ(t) ∥ DMA_READ(t+1)
DMA_WRITE(t) ∥ DMA_WRITE(t+1)
```
**Compute engine**
Compute operations share a single compute resource.
```
PE_ACCEL capacity = 1
```
Both GEMM and MATH require this shared compute slot.
Consequences:
- GEMM ∥ GEMM not allowed
- MATH ∥ MATH not allowed
- GEMM ∥ MATH not allowed
Only one compute operation can run in a PE at a time.
**Compute opcode restriction**
Composite commands contain one compute opcode only.
Examples:
```
COMPOSITE_GEMM
COMPOSITE_MATH
```
Mixed compute pipelines such as `GEMM → MATH` are not supported in Phase 0.
**Engine completion signaling**
Every engine emits a completion event when a sub-command finishes.
Completion events are delivered to PE_SCHEDULER.
---
### D5. Dataflow model
Compute operations use a TCM-centric dataflow model.
**Input path (HBM)**
```
HBM → XBAR → PE_DMA (DMA_READ) → PE_TCM
```
**Input path (shared SRAM)**
```
Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM
```
**Compute stage**
Compute engines read input tensors from PE_TCM.
```
PE_TCM → GEMM / MATH
```
Weights for GEMM may optionally stream directly from HBM (via XBAR).
**Output path (HBM)**
Compute results are written to PE_TCM, then DMA writes to HBM.
```
PE_TCM → PE_DMA (DMA_WRITE) → XBAR → HBM
```
**Output path (shared SRAM)**
```
PE_TCM → PE_DMA (DMA_WRITE) → NOC → Shared SRAM
```
#### D5.1 PE_TCM partitioning and ownership boundary
The PE_TCM address space is partitioned into two logical regions.
**SchedulerReservedTCM**
- A staging region owned exclusively by PE_SCHEDULER.
- This region is used for composite command tile buffers.
- PE_SCHEDULER:
- partitions this region into tile buffers
- assigns buffers for DMA_READ, COMPUTE, and DMA_WRITE stages
- guarantees input/output buffer separation
- manages tile buffer lifetime
**AllocatableTCM**
- General-purpose region managed by PEMemAllocator.
- Used by host or DP-visible allocations.
**Visibility rule (hard isolation)**
- PEMemAllocator must not see or allocate memory inside SchedulerReservedTCM.
- SchedulerReservedTCM is excluded from allocator-managed ranges by construction.
- This prevents DP or host allocations from interfering with scheduler staging buffers.
**Tile buffer rules**
Within SchedulerReservedTCM:
- input buffers and output buffers must not overlap
- PE_SCHEDULER assigns tile buffers for DMA and compute stages
- tile buffers remain valid until the corresponding DMA_WRITE completes
- Buffer reuse is allowed only after the tile lifetime finishes.
---
### D6. Observability and trace contract
The simulator must emit deterministic trace events.
Required events include:
- `command_submitted`
- `sub_command_dispatched`
- `engine_start`
- `engine_complete`
- `tile_ready`
- `command_complete`
Trace ordering must be deterministic for identical inputs.
---
### D7. Topology representation
PE internal components are declared in `cube.pe_template`.
The template is instantiated once per PE.
PE instances are derived from `cube.pe_layout`.
External connectivity such as:
- PE_DMA → XBAR (HBM data path)
- PE_DMA → NOC (non-HBM data path: shared SRAM, inter-cube UCIe)
- NOC → PE_CPU (command path from M_CPU)
is modeled at the CUBE level (see ADR-0003 D3).
---
## Links
- SPEC R3, R4
- ADR-0003 D4 (PE-level system hierarchy)
- ADR-0005 View C (PE-level diagram)
- ADR-0008 D2 (PA-level allocation at PE scope; PEMemAllocator is the per-PE allocator instance)
- ADR-0009 D3 (kernel execution fan-out and PE_CPU dispatch)