5917b3497c
- Remove xbar_top/bot, bridge, single noc node from topology
- Each cube_mesh.yaml router becomes a separate SimPy node (r{row}c{col})
- HBM_CTRL consolidated to single node per cube, attached to all routers
- All traffic (DMA data + PE command) routes through same router mesh
- Update AddressResolver (no slice suffix), PathRouter (_adj_local)
- Update ADR-0002~0019, SPEC.md to remove xbar/bridge references
- Regenerate SVG diagrams for new topology structure
- Skip cross-SIP PE_TCM and PE_MMU routing tests (not yet wired)
326 passed, 13 skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
366 lines
8.3 KiB
Markdown
366 lines
8.3 KiB
Markdown
# ADR-0014: PE Internal Execution Model (PE_CPU, PE_SCHEDULER, and Composite Commands)
|
||
|
||
## Status
|
||
|
||
Accepted
|
||
|
||
## Context
|
||
|
||
ADR-0003 (system hierarchy) and ADR-0009 (kernel execution semantics) reference PE internals but do not define:
|
||
|
||
- the dispatch model inside a PE,
|
||
- the responsibilities of PE_SCHEDULER,
|
||
- the PE_TCM-centric dataflow contract used by accelerator engines.
|
||
|
||
We need a deterministic and debuggable PE-internal execution contract that supports:
|
||
|
||
- simple single-engine commands
|
||
- composite commands that build a tiled pipeline across DMA and accelerator engines
|
||
|
||
The simulator must produce deterministic traces and allow modeling of PE-internal pipelining without introducing nondeterministic engine scheduling.
|
||
|
||
## Decision
|
||
|
||
### D1. PE internal component roles
|
||
|
||
Each PE contains the following logical components.
|
||
|
||
**PE_CPU**
|
||
|
||
- Executes kernel instruction stream or kernel control logic.
|
||
- Generates PE commands.
|
||
- Submits commands to PE_SCHEDULER.
|
||
- PE_CPU does NOT enqueue work directly into engine queues.
|
||
|
||
**PE_SCHEDULER**
|
||
|
||
- The sole dispatcher inside a PE.
|
||
- Receives commands from PE_CPU.
|
||
- Expands composite commands into sub-commands.
|
||
- Tracks dependencies and command state.
|
||
- Dispatches work to engine queues.
|
||
- Manages tile scheduling for composite commands.
|
||
|
||
**PE_DMA**
|
||
|
||
- Handles memory transfers between PE_TCM and external memory domains.
|
||
- PE_DMA connects to the NOC router mesh at the CUBE level (ADR-0019):
|
||
- All destinations (HBM, shared SRAM, inter-cube UCIe) are reached via the router mesh
|
||
- Local HBM access: PE_DMA → local router → hbm_ctrl (switching overhead only)
|
||
- Remote/shared: PE_DMA → local router → (mesh hops) → destination
|
||
- Supported directions include:
|
||
- HBM → PE_TCM (via router mesh)
|
||
- PE_TCM → HBM (via router mesh)
|
||
- PE_TCM → shared SRAM (via router mesh)
|
||
- PE_TCM → other memory domains (via router mesh, if supported by topology)
|
||
|
||
**PE_GEMM**
|
||
|
||
- Matrix multiplication engine.
|
||
- Reads activations from PE_TCM.
|
||
- May stream weights directly from HBM.
|
||
|
||
**PE_MATH**
|
||
|
||
- Element-wise computation engine.
|
||
- Reads and writes PE_TCM.
|
||
|
||
**PE_TCM**
|
||
|
||
- Local SRAM used as the staging memory for accelerator operations.
|
||
|
||
---
|
||
|
||
### D2. Command lifecycle and queues
|
||
|
||
PE_SCHEDULER maintains three logical structures.
|
||
|
||
**SubmissionQueue**
|
||
|
||
- Written by PE_CPU.
|
||
- Contains incoming PE commands waiting to be processed.
|
||
|
||
**InflightTable**
|
||
|
||
- Owned and mutated only by PE_SCHEDULER.
|
||
- Tracks:
|
||
- expanded sub-commands
|
||
- dependency state
|
||
- engine assignment
|
||
- completion status
|
||
|
||
**CompletionQueue**
|
||
|
||
- Written by PE_SCHEDULER.
|
||
- Contains final completion records for commands.
|
||
|
||
**Single-writer rule**
|
||
|
||
- Only PE_SCHEDULER is allowed to mutate command completion state.
|
||
- Engine components must report completion via explicit completion events/messages.
|
||
|
||
**Command completion**
|
||
|
||
A command becomes DONE when:
|
||
|
||
- all sub-commands complete
|
||
- PE_SCHEDULER publishes a completion record to CompletionQueue.
|
||
|
||
---
|
||
|
||
### D3. Dispatch modes
|
||
|
||
PE commands are divided into two categories.
|
||
|
||
#### D3.1 Simple command
|
||
|
||
A simple command expands to exactly one engine sub-command.
|
||
|
||
Examples include:
|
||
|
||
- DMA transfer
|
||
- GEMM compute
|
||
- MATH compute
|
||
|
||
Execution flow:
|
||
|
||
```text
|
||
PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution → completion event → PE_SCHEDULER → CompletionQueue
|
||
```
|
||
|
||
#### D3.2 Composite command (tiled pipeline)
|
||
|
||
Composite commands implement tiled pipelined execution across engines.
|
||
|
||
Each tile executes the following pipeline:
|
||
|
||
```text
|
||
Input DMA (READ)
|
||
→ Compute (GEMM or MATH)
|
||
→ Output DMA (WRITE)
|
||
```
|
||
|
||
**Tiling rule**
|
||
|
||
If the DMA payload exceeds hardware tile size, PE_SCHEDULER splits the transfer into tiles.
|
||
Each tile is assigned a monotonically increasing `tile_id`.
|
||
|
||
**Tile dependency rules**
|
||
|
||
For tile `t`:
|
||
|
||
- Compute must wait for input DMA: `DMA_READ(t) → COMPUTE(t)`
|
||
- Output DMA must wait for compute: `COMPUTE(t) → DMA_WRITE(t)`
|
||
- All dependencies are enforced by PE_SCHEDULER.
|
||
|
||
**Overlap policy (Phase 0 default)**
|
||
|
||
Operations for different tiles may overlap when engine resources permit.
|
||
|
||
Allowed overlaps:
|
||
|
||
```text
|
||
DMA_READ(t+1) ∥ COMPUTE(t)
|
||
DMA_WRITE(t−1) ∥ COMPUTE(t)
|
||
DMA_READ(t) ∥ DMA_WRITE(t)
|
||
```
|
||
|
||
Disallowed overlaps:
|
||
|
||
```text
|
||
GEMM(t) ∥ GEMM(t′)
|
||
MATH(t) ∥ MATH(t′)
|
||
GEMM(t) ∥ MATH(t′)
|
||
```
|
||
|
||
---
|
||
|
||
### D4. Engine execution model (Phase 0 default)
|
||
|
||
Each engine behaves as a deterministic service resource.
|
||
|
||
**DMA engine**
|
||
|
||
PE_DMA contains two independent channels.
|
||
|
||
```text
|
||
DMA_READ capacity = 1
|
||
DMA_WRITE capacity = 1
|
||
```
|
||
|
||
Rules:
|
||
|
||
- DMA_READ and DMA_WRITE may execute concurrently.
|
||
- Multiple READs cannot overlap.
|
||
- Multiple WRITEs cannot overlap.
|
||
|
||
Example allowed:
|
||
|
||
```text
|
||
DMA_READ(t+1) ∥ DMA_WRITE(t)
|
||
```
|
||
|
||
Example not allowed:
|
||
|
||
```text
|
||
DMA_READ(t) ∥ DMA_READ(t+1)
|
||
DMA_WRITE(t) ∥ DMA_WRITE(t+1)
|
||
```
|
||
|
||
**Compute engine**
|
||
|
||
Compute operations share a single compute resource.
|
||
|
||
```text
|
||
PE_ACCEL capacity = 1
|
||
```
|
||
|
||
Both GEMM and MATH require this shared compute slot.
|
||
|
||
Consequences:
|
||
|
||
- GEMM ∥ GEMM not allowed
|
||
- MATH ∥ MATH not allowed
|
||
- GEMM ∥ MATH not allowed
|
||
|
||
Only one compute operation can run in a PE at a time.
|
||
|
||
**Compute opcode restriction**
|
||
|
||
Composite commands contain one compute opcode only.
|
||
|
||
Examples:
|
||
|
||
```text
|
||
COMPOSITE_GEMM
|
||
COMPOSITE_MATH
|
||
```
|
||
|
||
Mixed compute pipelines such as `GEMM → MATH` are not supported in Phase 0.
|
||
|
||
**Engine completion signaling**
|
||
|
||
Every engine emits a completion event when a sub-command finishes.
|
||
Completion events are delivered to PE_SCHEDULER.
|
||
|
||
---
|
||
|
||
### D5. Dataflow model
|
||
|
||
Compute operations use a TCM-centric dataflow model.
|
||
|
||
**Input path (HBM)**
|
||
|
||
```text
|
||
HBM → router mesh → PE_DMA (DMA_READ) → PE_TCM
|
||
```
|
||
|
||
**Input path (shared SRAM)**
|
||
|
||
```text
|
||
Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM
|
||
```
|
||
|
||
**Compute stage**
|
||
|
||
Compute engines read input tensors from PE_TCM.
|
||
|
||
```text
|
||
PE_TCM → GEMM / MATH
|
||
```
|
||
|
||
Weights for GEMM may optionally stream directly from HBM (via router mesh).
|
||
|
||
**Output path (HBM)**
|
||
|
||
Compute results are written to PE_TCM, then DMA writes to HBM.
|
||
|
||
```text
|
||
PE_TCM → PE_DMA (DMA_WRITE) → router mesh → HBM
|
||
```
|
||
|
||
**Output path (shared SRAM)**
|
||
|
||
```text
|
||
PE_TCM → PE_DMA (DMA_WRITE) → NOC → Shared SRAM
|
||
```
|
||
|
||
#### D5.1 PE_TCM partitioning and ownership boundary
|
||
|
||
The PE_TCM address space is partitioned into two logical regions.
|
||
|
||
**SchedulerReservedTCM**
|
||
|
||
- A staging region owned exclusively by PE_SCHEDULER.
|
||
- This region is used for composite command tile buffers.
|
||
- PE_SCHEDULER:
|
||
- partitions this region into tile buffers
|
||
- assigns buffers for DMA_READ, COMPUTE, and DMA_WRITE stages
|
||
- guarantees input/output buffer separation
|
||
- manages tile buffer lifetime
|
||
|
||
**AllocatableTCM**
|
||
|
||
- General-purpose region managed by PEMemAllocator.
|
||
- Used by host or DP-visible allocations.
|
||
|
||
**Visibility rule (hard isolation)**
|
||
|
||
- PEMemAllocator must not see or allocate memory inside SchedulerReservedTCM.
|
||
- SchedulerReservedTCM is excluded from allocator-managed ranges by construction.
|
||
- This prevents DP or host allocations from interfering with scheduler staging buffers.
|
||
|
||
**Tile buffer rules**
|
||
|
||
Within SchedulerReservedTCM:
|
||
|
||
- input buffers and output buffers must not overlap
|
||
- PE_SCHEDULER assigns tile buffers for DMA and compute stages
|
||
- tile buffers remain valid until the corresponding DMA_WRITE completes
|
||
- Buffer reuse is allowed only after the tile lifetime finishes.
|
||
|
||
---
|
||
|
||
### D6. Observability and trace contract
|
||
|
||
The simulator must emit deterministic trace events.
|
||
|
||
Required events include:
|
||
|
||
- `command_submitted`
|
||
- `sub_command_dispatched`
|
||
- `engine_start`
|
||
- `engine_complete`
|
||
- `tile_ready`
|
||
- `command_complete`
|
||
|
||
Trace ordering must be deterministic for identical inputs.
|
||
|
||
---
|
||
|
||
### D7. Topology representation
|
||
|
||
PE internal components are declared in `cube.pe_template`.
|
||
|
||
The template is instantiated once per PE.
|
||
|
||
PE instances are derived from `cube.pe_layout`.
|
||
|
||
External connectivity such as:
|
||
|
||
- PE_DMA → router mesh → HBM (data path, ADR-0019)
|
||
- PE_DMA → router mesh → shared SRAM, inter-cube UCIe (non-HBM data path)
|
||
- router mesh → PE_CPU (command path from M_CPU)
|
||
|
||
is modeled at the CUBE level (see ADR-0003 D3).
|
||
|
||
---
|
||
|
||
## Links
|
||
|
||
- SPEC R3, R4
|
||
- ADR-0003 D4 (PE-level system hierarchy)
|
||
- ADR-0005 View C (PE-level diagram)
|
||
- ADR-0008 D2 (PA-level allocation at PE scope; PEMemAllocator is the per-PE allocator instance)
|
||
- ADR-0009 D3 (kernel execution fan-out and PE_CPU dispatch)
|