Files
kernbench2/docs/adr/ADR-0014-pe-internal-execution-model.md
T
2026-03-18 11:47:48 -07:00

365 lines
8.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0014: PE Internal Execution Model (PE_CPU, PE_SCHEDULER, and Composite Commands)
## Status
Proposed
## Context
ADR-0003 (system hierarchy) and ADR-0009 (kernel execution semantics) reference PE internals but do not define:
- the dispatch model inside a PE,
- the responsibilities of PE_SCHEDULER,
- the PE_TCM-centric dataflow contract used by accelerator engines.
We need a deterministic and debuggable PE-internal execution contract that supports:
- simple single-engine commands
- composite commands that build a tiled pipeline across DMA and accelerator engines
The simulator must produce deterministic traces and allow modeling of PE-internal pipelining without introducing nondeterministic engine scheduling.
## Decision
### D1. PE internal component roles
Each PE contains the following logical components.
**PE_CPU**
- Executes kernel instruction stream or kernel control logic.
- Generates PE commands.
- Submits commands to PE_SCHEDULER.
- PE_CPU does NOT enqueue work directly into engine queues.
**PE_SCHEDULER**
- The sole dispatcher inside a PE.
- Receives commands from PE_CPU.
- Expands composite commands into sub-commands.
- Tracks dependencies and command state.
- Dispatches work to engine queues.
- Manages tile scheduling for composite commands.
**PE_DMA**
- Handles memory transfers between PE_TCM and external memory domains.
- PE_DMA has **dual egress** at the CUBE level:
- **→ XBAR**: dedicated path to HBM (local and cross-half via bridge)
- **→ NOC**: path to non-HBM destinations (shared SRAM, inter-cube UCIe, etc.)
- Supported directions include:
- HBM → PE_TCM (via XBAR)
- PE_TCM → HBM (via XBAR)
- PE_TCM → shared SRAM (via NOC)
- PE_TCM → other memory domains (via NOC, if supported by topology)
**PE_GEMM**
- Matrix multiplication engine.
- Reads activations from PE_TCM.
- May stream weights directly from HBM.
**PE_MATH**
- Element-wise computation engine.
- Reads and writes PE_TCM.
**PE_TCM**
- Local SRAM used as the staging memory for accelerator operations.
---
### D2. Command lifecycle and queues
PE_SCHEDULER maintains three logical structures.
**SubmissionQueue**
- Written by PE_CPU.
- Contains incoming PE commands waiting to be processed.
**InflightTable**
- Owned and mutated only by PE_SCHEDULER.
- Tracks:
- expanded sub-commands
- dependency state
- engine assignment
- completion status
**CompletionQueue**
- Written by PE_SCHEDULER.
- Contains final completion records for commands.
**Single-writer rule**
- Only PE_SCHEDULER is allowed to mutate command completion state.
- Engine components must report completion via explicit completion events/messages.
**Command completion**
A command becomes DONE when:
- all sub-commands complete
- PE_SCHEDULER publishes a completion record to CompletionQueue.
---
### D3. Dispatch modes
PE commands are divided into two categories.
#### D3.1 Simple command
A simple command expands to exactly one engine sub-command.
Examples include:
- DMA transfer
- GEMM compute
- MATH compute
Execution flow:
```
PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution → completion event → PE_SCHEDULER → CompletionQueue
```
#### D3.2 Composite command (tiled pipeline)
Composite commands implement tiled pipelined execution across engines.
Each tile executes the following pipeline:
```
Input DMA (READ)
→ Compute (GEMM or MATH)
→ Output DMA (WRITE)
```
**Tiling rule**
If the DMA payload exceeds hardware tile size, PE_SCHEDULER splits the transfer into tiles.
Each tile is assigned a monotonically increasing `tile_id`.
**Tile dependency rules**
For tile `t`:
- Compute must wait for input DMA: `DMA_READ(t) → COMPUTE(t)`
- Output DMA must wait for compute: `COMPUTE(t) → DMA_WRITE(t)`
- All dependencies are enforced by PE_SCHEDULER.
**Overlap policy (Phase 0 default)**
Operations for different tiles may overlap when engine resources permit.
Allowed overlaps:
```
DMA_READ(t+1) ∥ COMPUTE(t)
DMA_WRITE(t1) ∥ COMPUTE(t)
DMA_READ(t) ∥ DMA_WRITE(t)
```
Disallowed overlaps:
```
GEMM(t) ∥ GEMM(t)
MATH(t) ∥ MATH(t)
GEMM(t) ∥ MATH(t)
```
---
### D4. Engine execution model (Phase 0 default)
Each engine behaves as a deterministic service resource.
**DMA engine**
PE_DMA contains two independent channels.
```
DMA_READ capacity = 1
DMA_WRITE capacity = 1
```
Rules:
- DMA_READ and DMA_WRITE may execute concurrently.
- Multiple READs cannot overlap.
- Multiple WRITEs cannot overlap.
Example allowed:
```
DMA_READ(t+1) ∥ DMA_WRITE(t)
```
Example not allowed:
```
DMA_READ(t) ∥ DMA_READ(t+1)
DMA_WRITE(t) ∥ DMA_WRITE(t+1)
```
**Compute engine**
Compute operations share a single compute resource.
```
PE_ACCEL capacity = 1
```
Both GEMM and MATH require this shared compute slot.
Consequences:
- GEMM ∥ GEMM not allowed
- MATH ∥ MATH not allowed
- GEMM ∥ MATH not allowed
Only one compute operation can run in a PE at a time.
**Compute opcode restriction**
Composite commands contain one compute opcode only.
Examples:
```
COMPOSITE_GEMM
COMPOSITE_MATH
```
Mixed compute pipelines such as `GEMM → MATH` are not supported in Phase 0.
**Engine completion signaling**
Every engine emits a completion event when a sub-command finishes.
Completion events are delivered to PE_SCHEDULER.
---
### D5. Dataflow model
Compute operations use a TCM-centric dataflow model.
**Input path (HBM)**
```
HBM → XBAR → PE_DMA (DMA_READ) → PE_TCM
```
**Input path (shared SRAM)**
```
Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM
```
**Compute stage**
Compute engines read input tensors from PE_TCM.
```
PE_TCM → GEMM / MATH
```
Weights for GEMM may optionally stream directly from HBM (via XBAR).
**Output path (HBM)**
Compute results are written to PE_TCM, then DMA writes to HBM.
```
PE_TCM → PE_DMA (DMA_WRITE) → XBAR → HBM
```
**Output path (shared SRAM)**
```
PE_TCM → PE_DMA (DMA_WRITE) → NOC → Shared SRAM
```
#### D5.1 PE_TCM partitioning and ownership boundary
The PE_TCM address space is partitioned into two logical regions.
**SchedulerReservedTCM**
- A staging region owned exclusively by PE_SCHEDULER.
- This region is used for composite command tile buffers.
- PE_SCHEDULER:
- partitions this region into tile buffers
- assigns buffers for DMA_READ, COMPUTE, and DMA_WRITE stages
- guarantees input/output buffer separation
- manages tile buffer lifetime
**AllocatableTCM**
- General-purpose region managed by PEMemAllocator.
- Used by host or DP-visible allocations.
**Visibility rule (hard isolation)**
- PEMemAllocator must not see or allocate memory inside SchedulerReservedTCM.
- SchedulerReservedTCM is excluded from allocator-managed ranges by construction.
- This prevents DP or host allocations from interfering with scheduler staging buffers.
**Tile buffer rules**
Within SchedulerReservedTCM:
- input buffers and output buffers must not overlap
- PE_SCHEDULER assigns tile buffers for DMA and compute stages
- tile buffers remain valid until the corresponding DMA_WRITE completes
- Buffer reuse is allowed only after the tile lifetime finishes.
---
### D6. Observability and trace contract
The simulator must emit deterministic trace events.
Required events include:
- `command_submitted`
- `sub_command_dispatched`
- `engine_start`
- `engine_complete`
- `tile_ready`
- `command_complete`
Trace ordering must be deterministic for identical inputs.
---
### D7. Topology representation
PE internal components are declared in `cube.pe_template`.
The template is instantiated once per PE.
PE instances are derived from `cube.pe_layout`.
External connectivity such as:
- PE_DMA → XBAR (HBM data path)
- PE_DMA → NOC (non-HBM data path: shared SRAM, inter-cube UCIe)
- NOC → PE_CPU (command path from M_CPU)
is modeled at the CUBE level (see ADR-0003 D3).
---
## Links
- SPEC R3, R4
- ADR-0003 D4 (PE-level system hierarchy)
- ADR-0005 View C (PE-level diagram)
- ADR-0008 D2 (PA-level allocation at PE scope; PEMemAllocator is the per-PE allocator instance)
- ADR-0009 D3 (kernel execution fan-out and PE_CPU dispatch)