commit - release 1
This commit is contained in:
@@ -0,0 +1,364 @@
|
||||
# ADR-0014: PE Internal Execution Model (PE_CPU, PE_SCHEDULER, and Composite Commands)
|
||||
|
||||
## Status
|
||||
|
||||
Proposed
|
||||
|
||||
## Context
|
||||
|
||||
ADR-0003 (system hierarchy) and ADR-0009 (kernel execution semantics) reference PE internals but do not define:
|
||||
|
||||
- the dispatch model inside a PE,
|
||||
- the responsibilities of PE_SCHEDULER,
|
||||
- the PE_TCM-centric dataflow contract used by accelerator engines.
|
||||
|
||||
We need a deterministic and debuggable PE-internal execution contract that supports:
|
||||
|
||||
- simple single-engine commands
|
||||
- composite commands that build a tiled pipeline across DMA and accelerator engines
|
||||
|
||||
The simulator must produce deterministic traces and allow modeling of PE-internal pipelining without introducing nondeterministic engine scheduling.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. PE internal component roles
|
||||
|
||||
Each PE contains the following logical components.
|
||||
|
||||
**PE_CPU**
|
||||
|
||||
- Executes kernel instruction stream or kernel control logic.
|
||||
- Generates PE commands.
|
||||
- Submits commands to PE_SCHEDULER.
|
||||
- PE_CPU does NOT enqueue work directly into engine queues.
|
||||
|
||||
**PE_SCHEDULER**
|
||||
|
||||
- The sole dispatcher inside a PE.
|
||||
- Receives commands from PE_CPU.
|
||||
- Expands composite commands into sub-commands.
|
||||
- Tracks dependencies and command state.
|
||||
- Dispatches work to engine queues.
|
||||
- Manages tile scheduling for composite commands.
|
||||
|
||||
**PE_DMA**
|
||||
|
||||
- Handles memory transfers between PE_TCM and external memory domains.
|
||||
- PE_DMA has **dual egress** at the CUBE level:
|
||||
- **→ XBAR**: dedicated path to HBM (local and cross-half via bridge)
|
||||
- **→ NOC**: path to non-HBM destinations (shared SRAM, inter-cube UCIe, etc.)
|
||||
- Supported directions include:
|
||||
- HBM → PE_TCM (via XBAR)
|
||||
- PE_TCM → HBM (via XBAR)
|
||||
- PE_TCM → shared SRAM (via NOC)
|
||||
- PE_TCM → other memory domains (via NOC, if supported by topology)
|
||||
|
||||
**PE_GEMM**
|
||||
|
||||
- Matrix multiplication engine.
|
||||
- Reads activations from PE_TCM.
|
||||
- May stream weights directly from HBM.
|
||||
|
||||
**PE_MATH**
|
||||
|
||||
- Element-wise computation engine.
|
||||
- Reads and writes PE_TCM.
|
||||
|
||||
**PE_TCM**
|
||||
|
||||
- Local SRAM used as the staging memory for accelerator operations.
|
||||
|
||||
---
|
||||
|
||||
### D2. Command lifecycle and queues
|
||||
|
||||
PE_SCHEDULER maintains three logical structures.
|
||||
|
||||
**SubmissionQueue**
|
||||
|
||||
- Written by PE_CPU.
|
||||
- Contains incoming PE commands waiting to be processed.
|
||||
|
||||
**InflightTable**
|
||||
|
||||
- Owned and mutated only by PE_SCHEDULER.
|
||||
- Tracks:
|
||||
- expanded sub-commands
|
||||
- dependency state
|
||||
- engine assignment
|
||||
- completion status
|
||||
|
||||
**CompletionQueue**
|
||||
|
||||
- Written by PE_SCHEDULER.
|
||||
- Contains final completion records for commands.
|
||||
|
||||
**Single-writer rule**
|
||||
|
||||
- Only PE_SCHEDULER is allowed to mutate command completion state.
|
||||
- Engine components must report completion via explicit completion events/messages.
|
||||
|
||||
**Command completion**
|
||||
|
||||
A command becomes DONE when:
|
||||
|
||||
- all sub-commands complete
|
||||
- PE_SCHEDULER publishes a completion record to CompletionQueue.
|
||||
|
||||
---
|
||||
|
||||
### D3. Dispatch modes
|
||||
|
||||
PE commands are divided into two categories.
|
||||
|
||||
#### D3.1 Simple command
|
||||
|
||||
A simple command expands to exactly one engine sub-command.
|
||||
|
||||
Examples include:
|
||||
|
||||
- DMA transfer
|
||||
- GEMM compute
|
||||
- MATH compute
|
||||
|
||||
Execution flow:
|
||||
|
||||
```
|
||||
PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution → completion event → PE_SCHEDULER → CompletionQueue
|
||||
```
|
||||
|
||||
#### D3.2 Composite command (tiled pipeline)
|
||||
|
||||
Composite commands implement tiled pipelined execution across engines.
|
||||
|
||||
Each tile executes the following pipeline:
|
||||
|
||||
```
|
||||
Input DMA (READ)
|
||||
→ Compute (GEMM or MATH)
|
||||
→ Output DMA (WRITE)
|
||||
```
|
||||
|
||||
**Tiling rule**
|
||||
|
||||
If the DMA payload exceeds hardware tile size, PE_SCHEDULER splits the transfer into tiles.
|
||||
Each tile is assigned a monotonically increasing `tile_id`.
|
||||
|
||||
**Tile dependency rules**
|
||||
|
||||
For tile `t`:
|
||||
|
||||
- Compute must wait for input DMA: `DMA_READ(t) → COMPUTE(t)`
|
||||
- Output DMA must wait for compute: `COMPUTE(t) → DMA_WRITE(t)`
|
||||
- All dependencies are enforced by PE_SCHEDULER.
|
||||
|
||||
**Overlap policy (Phase 0 default)**
|
||||
|
||||
Operations for different tiles may overlap when engine resources permit.
|
||||
|
||||
Allowed overlaps:
|
||||
|
||||
```
|
||||
DMA_READ(t+1) ∥ COMPUTE(t)
|
||||
DMA_WRITE(t−1) ∥ COMPUTE(t)
|
||||
DMA_READ(t) ∥ DMA_WRITE(t)
|
||||
```
|
||||
|
||||
Disallowed overlaps:
|
||||
|
||||
```
|
||||
GEMM(t) ∥ GEMM(t′)
|
||||
MATH(t) ∥ MATH(t′)
|
||||
GEMM(t) ∥ MATH(t′)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### D4. Engine execution model (Phase 0 default)
|
||||
|
||||
Each engine behaves as a deterministic service resource.
|
||||
|
||||
**DMA engine**
|
||||
|
||||
PE_DMA contains two independent channels.
|
||||
|
||||
```
|
||||
DMA_READ capacity = 1
|
||||
DMA_WRITE capacity = 1
|
||||
```
|
||||
|
||||
Rules:
|
||||
|
||||
- DMA_READ and DMA_WRITE may execute concurrently.
|
||||
- Multiple READs cannot overlap.
|
||||
- Multiple WRITEs cannot overlap.
|
||||
|
||||
Example allowed:
|
||||
|
||||
```
|
||||
DMA_READ(t+1) ∥ DMA_WRITE(t)
|
||||
```
|
||||
|
||||
Example not allowed:
|
||||
|
||||
```
|
||||
DMA_READ(t) ∥ DMA_READ(t+1)
|
||||
DMA_WRITE(t) ∥ DMA_WRITE(t+1)
|
||||
```
|
||||
|
||||
**Compute engine**
|
||||
|
||||
Compute operations share a single compute resource.
|
||||
|
||||
```
|
||||
PE_ACCEL capacity = 1
|
||||
```
|
||||
|
||||
Both GEMM and MATH require this shared compute slot.
|
||||
|
||||
Consequences:
|
||||
|
||||
- GEMM ∥ GEMM not allowed
|
||||
- MATH ∥ MATH not allowed
|
||||
- GEMM ∥ MATH not allowed
|
||||
|
||||
Only one compute operation can run in a PE at a time.
|
||||
|
||||
**Compute opcode restriction**
|
||||
|
||||
Composite commands contain one compute opcode only.
|
||||
|
||||
Examples:
|
||||
|
||||
```
|
||||
COMPOSITE_GEMM
|
||||
COMPOSITE_MATH
|
||||
```
|
||||
|
||||
Mixed compute pipelines such as `GEMM → MATH` are not supported in Phase 0.
|
||||
|
||||
**Engine completion signaling**
|
||||
|
||||
Every engine emits a completion event when a sub-command finishes.
|
||||
Completion events are delivered to PE_SCHEDULER.
|
||||
|
||||
---
|
||||
|
||||
### D5. Dataflow model
|
||||
|
||||
Compute operations use a TCM-centric dataflow model.
|
||||
|
||||
**Input path (HBM)**
|
||||
|
||||
```
|
||||
HBM → XBAR → PE_DMA (DMA_READ) → PE_TCM
|
||||
```
|
||||
|
||||
**Input path (shared SRAM)**
|
||||
|
||||
```
|
||||
Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM
|
||||
```
|
||||
|
||||
**Compute stage**
|
||||
|
||||
Compute engines read input tensors from PE_TCM.
|
||||
|
||||
```
|
||||
PE_TCM → GEMM / MATH
|
||||
```
|
||||
|
||||
Weights for GEMM may optionally stream directly from HBM (via XBAR).
|
||||
|
||||
**Output path (HBM)**
|
||||
|
||||
Compute results are written to PE_TCM, then DMA writes to HBM.
|
||||
|
||||
```
|
||||
PE_TCM → PE_DMA (DMA_WRITE) → XBAR → HBM
|
||||
```
|
||||
|
||||
**Output path (shared SRAM)**
|
||||
|
||||
```
|
||||
PE_TCM → PE_DMA (DMA_WRITE) → NOC → Shared SRAM
|
||||
```
|
||||
|
||||
#### D5.1 PE_TCM partitioning and ownership boundary
|
||||
|
||||
The PE_TCM address space is partitioned into two logical regions.
|
||||
|
||||
**SchedulerReservedTCM**
|
||||
|
||||
- A staging region owned exclusively by PE_SCHEDULER.
|
||||
- This region is used for composite command tile buffers.
|
||||
- PE_SCHEDULER:
|
||||
- partitions this region into tile buffers
|
||||
- assigns buffers for DMA_READ, COMPUTE, and DMA_WRITE stages
|
||||
- guarantees input/output buffer separation
|
||||
- manages tile buffer lifetime
|
||||
|
||||
**AllocatableTCM**
|
||||
|
||||
- General-purpose region managed by PEMemAllocator.
|
||||
- Used by host or DP-visible allocations.
|
||||
|
||||
**Visibility rule (hard isolation)**
|
||||
|
||||
- PEMemAllocator must not see or allocate memory inside SchedulerReservedTCM.
|
||||
- SchedulerReservedTCM is excluded from allocator-managed ranges by construction.
|
||||
- This prevents DP or host allocations from interfering with scheduler staging buffers.
|
||||
|
||||
**Tile buffer rules**
|
||||
|
||||
Within SchedulerReservedTCM:
|
||||
|
||||
- input buffers and output buffers must not overlap
|
||||
- PE_SCHEDULER assigns tile buffers for DMA and compute stages
|
||||
- tile buffers remain valid until the corresponding DMA_WRITE completes
|
||||
- Buffer reuse is allowed only after the tile lifetime finishes.
|
||||
|
||||
---
|
||||
|
||||
### D6. Observability and trace contract
|
||||
|
||||
The simulator must emit deterministic trace events.
|
||||
|
||||
Required events include:
|
||||
|
||||
- `command_submitted`
|
||||
- `sub_command_dispatched`
|
||||
- `engine_start`
|
||||
- `engine_complete`
|
||||
- `tile_ready`
|
||||
- `command_complete`
|
||||
|
||||
Trace ordering must be deterministic for identical inputs.
|
||||
|
||||
---
|
||||
|
||||
### D7. Topology representation
|
||||
|
||||
PE internal components are declared in `cube.pe_template`.
|
||||
|
||||
The template is instantiated once per PE.
|
||||
|
||||
PE instances are derived from `cube.pe_layout`.
|
||||
|
||||
External connectivity such as:
|
||||
|
||||
- PE_DMA → XBAR (HBM data path)
|
||||
- PE_DMA → NOC (non-HBM data path: shared SRAM, inter-cube UCIe)
|
||||
- NOC → PE_CPU (command path from M_CPU)
|
||||
|
||||
is modeled at the CUBE level (see ADR-0003 D3).
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- SPEC R3, R4
|
||||
- ADR-0003 D4 (PE-level system hierarchy)
|
||||
- ADR-0005 View C (PE-level diagram)
|
||||
- ADR-0008 D2 (PA-level allocation at PE scope; PEMemAllocator is the per-PE allocator instance)
|
||||
- ADR-0009 D3 (kernel execution fan-out and PE_CPU dispatch)
|
||||
Reference in New Issue
Block a user