- Remove xbar_top/bot, bridge, single noc node from topology
- Each cube_mesh.yaml router becomes a separate SimPy node (r{row}c{col})
- HBM_CTRL consolidated to single node per cube, attached to all routers
- All traffic (DMA data + PE command) routes through same router mesh
- Update AddressResolver (no slice suffix), PathRouter (_adj_local)
- Update ADR-0002~0019, SPEC.md to remove xbar/bridge references
- Regenerate SVG diagrams for new topology structure
- Skip cross-SIP PE_TCM and PE_MMU routing tests (not yet wired)
326 passed, 13 skipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8.3 KiB
ADR-0014: PE Internal Execution Model (PE_CPU, PE_SCHEDULER, and Composite Commands)
Status
Accepted
Context
ADR-0003 (system hierarchy) and ADR-0009 (kernel execution semantics) reference PE internals but do not define:
- the dispatch model inside a PE,
- the responsibilities of PE_SCHEDULER,
- the PE_TCM-centric dataflow contract used by accelerator engines.
We need a deterministic and debuggable PE-internal execution contract that supports:
- simple single-engine commands
- composite commands that build a tiled pipeline across DMA and accelerator engines
The simulator must produce deterministic traces and allow modeling of PE-internal pipelining without introducing nondeterministic engine scheduling.
Decision
D1. PE internal component roles
Each PE contains the following logical components.
PE_CPU
- Executes kernel instruction stream or kernel control logic.
- Generates PE commands.
- Submits commands to PE_SCHEDULER.
- PE_CPU does NOT enqueue work directly into engine queues.
PE_SCHEDULER
- The sole dispatcher inside a PE.
- Receives commands from PE_CPU.
- Expands composite commands into sub-commands.
- Tracks dependencies and command state.
- Dispatches work to engine queues.
- Manages tile scheduling for composite commands.
PE_DMA
- Handles memory transfers between PE_TCM and external memory domains.
- PE_DMA connects to the NOC router mesh at the CUBE level (ADR-0019):
- All destinations (HBM, shared SRAM, inter-cube UCIe) are reached via the router mesh
- Local HBM access: PE_DMA → local router → hbm_ctrl (switching overhead only)
- Remote/shared: PE_DMA → local router → (mesh hops) → destination
- Supported directions include:
- HBM → PE_TCM (via router mesh)
- PE_TCM → HBM (via router mesh)
- PE_TCM → shared SRAM (via router mesh)
- PE_TCM → other memory domains (via router mesh, if supported by topology)
PE_GEMM
- Matrix multiplication engine.
- Reads activations from PE_TCM.
- May stream weights directly from HBM.
PE_MATH
- Element-wise computation engine.
- Reads and writes PE_TCM.
PE_TCM
- Local SRAM used as the staging memory for accelerator operations.
D2. Command lifecycle and queues
PE_SCHEDULER maintains three logical structures.
SubmissionQueue
- Written by PE_CPU.
- Contains incoming PE commands waiting to be processed.
InflightTable
- Owned and mutated only by PE_SCHEDULER.
- Tracks:
- expanded sub-commands
- dependency state
- engine assignment
- completion status
CompletionQueue
- Written by PE_SCHEDULER.
- Contains final completion records for commands.
Single-writer rule
- Only PE_SCHEDULER is allowed to mutate command completion state.
- Engine components must report completion via explicit completion events/messages.
Command completion
A command becomes DONE when:
- all sub-commands complete
- PE_SCHEDULER publishes a completion record to CompletionQueue.
D3. Dispatch modes
PE commands are divided into two categories.
D3.1 Simple command
A simple command expands to exactly one engine sub-command.
Examples include:
- DMA transfer
- GEMM compute
- MATH compute
Execution flow:
PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution → completion event → PE_SCHEDULER → CompletionQueue
D3.2 Composite command (tiled pipeline)
Composite commands implement tiled pipelined execution across engines.
Each tile executes the following pipeline:
Input DMA (READ)
→ Compute (GEMM or MATH)
→ Output DMA (WRITE)
Tiling rule
If the DMA payload exceeds hardware tile size, PE_SCHEDULER splits the transfer into tiles.
Each tile is assigned a monotonically increasing tile_id.
Tile dependency rules
For tile t:
- Compute must wait for input DMA:
DMA_READ(t) → COMPUTE(t) - Output DMA must wait for compute:
COMPUTE(t) → DMA_WRITE(t) - All dependencies are enforced by PE_SCHEDULER.
Overlap policy (Phase 0 default)
Operations for different tiles may overlap when engine resources permit.
Allowed overlaps:
DMA_READ(t+1) ∥ COMPUTE(t)
DMA_WRITE(t−1) ∥ COMPUTE(t)
DMA_READ(t) ∥ DMA_WRITE(t)
Disallowed overlaps:
GEMM(t) ∥ GEMM(t′)
MATH(t) ∥ MATH(t′)
GEMM(t) ∥ MATH(t′)
D4. Engine execution model (Phase 0 default)
Each engine behaves as a deterministic service resource.
DMA engine
PE_DMA contains two independent channels.
DMA_READ capacity = 1
DMA_WRITE capacity = 1
Rules:
- DMA_READ and DMA_WRITE may execute concurrently.
- Multiple READs cannot overlap.
- Multiple WRITEs cannot overlap.
Example allowed:
DMA_READ(t+1) ∥ DMA_WRITE(t)
Example not allowed:
DMA_READ(t) ∥ DMA_READ(t+1)
DMA_WRITE(t) ∥ DMA_WRITE(t+1)
Compute engine
Compute operations share a single compute resource.
PE_ACCEL capacity = 1
Both GEMM and MATH require this shared compute slot.
Consequences:
- GEMM ∥ GEMM not allowed
- MATH ∥ MATH not allowed
- GEMM ∥ MATH not allowed
Only one compute operation can run in a PE at a time.
Compute opcode restriction
Composite commands contain one compute opcode only.
Examples:
COMPOSITE_GEMM
COMPOSITE_MATH
Mixed compute pipelines such as GEMM → MATH are not supported in Phase 0.
Engine completion signaling
Every engine emits a completion event when a sub-command finishes. Completion events are delivered to PE_SCHEDULER.
D5. Dataflow model
Compute operations use a TCM-centric dataflow model.
Input path (HBM)
HBM → router mesh → PE_DMA (DMA_READ) → PE_TCM
Input path (shared SRAM)
Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM
Compute stage
Compute engines read input tensors from PE_TCM.
PE_TCM → GEMM / MATH
Weights for GEMM may optionally stream directly from HBM (via router mesh).
Output path (HBM)
Compute results are written to PE_TCM, then DMA writes to HBM.
PE_TCM → PE_DMA (DMA_WRITE) → router mesh → HBM
Output path (shared SRAM)
PE_TCM → PE_DMA (DMA_WRITE) → NOC → Shared SRAM
D5.1 PE_TCM partitioning and ownership boundary
The PE_TCM address space is partitioned into two logical regions.
SchedulerReservedTCM
- A staging region owned exclusively by PE_SCHEDULER.
- This region is used for composite command tile buffers.
- PE_SCHEDULER:
- partitions this region into tile buffers
- assigns buffers for DMA_READ, COMPUTE, and DMA_WRITE stages
- guarantees input/output buffer separation
- manages tile buffer lifetime
AllocatableTCM
- General-purpose region managed by PEMemAllocator.
- Used by host or DP-visible allocations.
Visibility rule (hard isolation)
- PEMemAllocator must not see or allocate memory inside SchedulerReservedTCM.
- SchedulerReservedTCM is excluded from allocator-managed ranges by construction.
- This prevents DP or host allocations from interfering with scheduler staging buffers.
Tile buffer rules
Within SchedulerReservedTCM:
- input buffers and output buffers must not overlap
- PE_SCHEDULER assigns tile buffers for DMA and compute stages
- tile buffers remain valid until the corresponding DMA_WRITE completes
- Buffer reuse is allowed only after the tile lifetime finishes.
D6. Observability and trace contract
The simulator must emit deterministic trace events.
Required events include:
command_submittedsub_command_dispatchedengine_startengine_completetile_readycommand_complete
Trace ordering must be deterministic for identical inputs.
D7. Topology representation
PE internal components are declared in cube.pe_template.
The template is instantiated once per PE.
PE instances are derived from cube.pe_layout.
External connectivity such as:
- PE_DMA → router mesh → HBM (data path, ADR-0019)
- PE_DMA → router mesh → shared SRAM, inter-cube UCIe (non-HBM data path)
- router mesh → PE_CPU (command path from M_CPU)
is modeled at the CUBE level (see ADR-0003 D3).
Links
- SPEC R3, R4
- ADR-0003 D4 (PE-level system hierarchy)
- ADR-0005 View C (PE-level diagram)
- ADR-0008 D2 (PA-level allocation at PE scope; PEMemAllocator is the per-PE allocator instance)
- ADR-0009 D3 (kernel execution fan-out and PE_CPU dispatch)