Files

T

ywkang 32b29a1e5c ADR-0003/0014: generalize "router mesh" to "NOC"

NOC topology is an implementation choice (mesh, ring, crossbar, etc.).
ADR-0017 covers the current 2D mesh choice; ADRs at the system-level
shouldn't bind to that specific implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-14 23:23:46 -07:00

8.1 KiB

Raw Blame History

ADR-0014: PE Internal Execution Model (PE_CPU, PE_SCHEDULER, and Composite Commands)

Status

Accepted

Context

ADR-0003 (system hierarchy) and ADR-0009 (kernel execution semantics) reference PE internals but do not define:

the dispatch model inside a PE,
the responsibilities of PE_SCHEDULER,
the PE_TCM-centric dataflow contract used by accelerator engines.

We need a deterministic and debuggable PE-internal execution contract that supports:

simple single-engine commands
composite commands that build a tiled pipeline across DMA and accelerator engines

The simulator must produce deterministic traces and allow modeling of PE-internal pipelining without introducing nondeterministic engine scheduling.

Decision

D1. PE internal component roles

Each PE contains the following logical components.

PE_CPU

Executes kernel instruction stream or kernel control logic.
Generates PE commands.
Submits commands to PE_SCHEDULER.
PE_CPU does NOT enqueue work directly into engine queues.

PE_SCHEDULER

The sole dispatcher inside a PE.
Receives commands from PE_CPU.
Expands composite commands into sub-commands.
Tracks dependencies and command state.
Dispatches work to engine queues.
Manages tile scheduling for composite commands.

PE_DMA

Handles memory transfers between PE_TCM and external memory domains.
PE_DMA connects to the cube-level NOC (on-die fabric):
- All destinations (HBM, shared SRAM, inter-cube UCIe) are reached via the NOC
- Local HBM access: PE_DMA → NOC → hbm_ctrl (minimal hop)
- Remote/shared: PE_DMA → NOC → (fabric hops) → destination
Supported directions include:
- HBM → PE_TCM (via NOC)
- PE_TCM → HBM (via NOC)
- PE_TCM → shared SRAM (via NOC)
- PE_TCM → other memory domains (via NOC, if supported by topology)

PE_GEMM

Matrix multiplication engine.
Reads activations from PE_TCM.
May stream weights directly from HBM.

PE_MATH

Element-wise computation engine.
Reads and writes PE_TCM.

PE_TCM

Local SRAM used as the staging memory for accelerator operations.

D2. Command lifecycle and queues

PE_SCHEDULER maintains three logical structures.

SubmissionQueue

Written by PE_CPU.
Contains incoming PE commands waiting to be processed.

InflightTable

Owned and mutated only by PE_SCHEDULER.
Tracks:
- expanded sub-commands
- dependency state
- engine assignment
- completion status

CompletionQueue

Written by PE_SCHEDULER.
Contains final completion records for commands.

Single-writer rule

Only PE_SCHEDULER is allowed to mutate command completion state.
Engine components must report completion via explicit completion events/messages.

Command completion

A command becomes DONE when:

all sub-commands complete
PE_SCHEDULER publishes a completion record to CompletionQueue.

D3. Dispatch modes

PE commands are divided into two categories.

D3.1 Simple command

A simple command expands to exactly one engine sub-command.

Examples include:

DMA transfer
GEMM compute
MATH compute

Execution flow:

PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution → completion event → PE_SCHEDULER → CompletionQueue

D3.2 Composite command (tiled pipeline)

Composite commands implement tiled pipelined execution across engines.

Each tile executes the following pipeline:

Input DMA (READ)
→ Compute (GEMM or MATH)
→ Output DMA (WRITE)

Tiling rule

If the DMA payload exceeds hardware tile size, PE_SCHEDULER splits the transfer into tiles. Each tile is assigned a monotonically increasing tile_id.

Tile dependency rules

For tile t:

Compute must wait for input DMA: DMA_READ(t) → COMPUTE(t)
Output DMA must wait for compute: COMPUTE(t) → DMA_WRITE(t)
All dependencies are enforced by PE_SCHEDULER.

Overlap policy (Phase 0 default)

Operations for different tiles may overlap when engine resources permit.

Allowed overlaps:

DMA_READ(t+1) ∥ COMPUTE(t)
DMA_WRITE(t−1) ∥ COMPUTE(t)
DMA_READ(t) ∥ DMA_WRITE(t)

Disallowed overlaps:

GEMM(t) ∥ GEMM(t′)
MATH(t) ∥ MATH(t′)
GEMM(t) ∥ MATH(t′)

D4. Engine execution model (Phase 0 default)

Each engine behaves as a deterministic service resource.

DMA engine

PE_DMA contains two independent channels.

DMA_READ capacity  = 1
DMA_WRITE capacity = 1

Rules:

DMA_READ and DMA_WRITE may execute concurrently.
Multiple READs cannot overlap.
Multiple WRITEs cannot overlap.

Example allowed:

DMA_READ(t+1) ∥ DMA_WRITE(t)

Example not allowed:

DMA_READ(t) ∥ DMA_READ(t+1)
DMA_WRITE(t) ∥ DMA_WRITE(t+1)

Compute engine

Compute operations share a single compute resource.

PE_ACCEL capacity = 1

Both GEMM and MATH require this shared compute slot.

Consequences:

GEMM ∥ GEMM not allowed
MATH ∥ MATH not allowed
GEMM ∥ MATH not allowed

Only one compute operation can run in a PE at a time.

Compute opcode restriction

Composite commands contain one compute opcode only.

Examples:

COMPOSITE_GEMM
COMPOSITE_MATH

Mixed compute pipelines such as GEMM → MATH are not supported in Phase 0.

Engine completion signaling

Every engine emits a completion event when a sub-command finishes. Completion events are delivered to PE_SCHEDULER.

D5. Dataflow model

Compute operations use a TCM-centric dataflow model.

Input path (HBM)

HBM → NOC → PE_DMA (DMA_READ) → PE_TCM

Input path (shared SRAM)

Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM

Compute stage

Compute engines read input tensors from PE_TCM.

PE_TCM → GEMM / MATH

Weights for GEMM may optionally stream directly from HBM (via NOC).

Output path (HBM)

Compute results are written to PE_TCM, then DMA writes to HBM.

PE_TCM → PE_DMA (DMA_WRITE) → NOC → HBM

Output path (shared SRAM)

PE_TCM → PE_DMA (DMA_WRITE) → NOC → Shared SRAM

D5.1 PE_TCM partitioning and ownership boundary

The PE_TCM address space is partitioned into two logical regions.

SchedulerReservedTCM

A staging region owned exclusively by PE_SCHEDULER.
This region is used for composite command tile buffers.
PE_SCHEDULER:
- partitions this region into tile buffers
- assigns buffers for DMA_READ, COMPUTE, and DMA_WRITE stages
- guarantees input/output buffer separation
- manages tile buffer lifetime

AllocatableTCM

General-purpose region managed by PEMemAllocator.
Used by host or DP-visible allocations.

Visibility rule (hard isolation)

PEMemAllocator must not see or allocate memory inside SchedulerReservedTCM.
SchedulerReservedTCM is excluded from allocator-managed ranges by construction.
This prevents DP or host allocations from interfering with scheduler staging buffers.

Tile buffer rules

Within SchedulerReservedTCM:

input buffers and output buffers must not overlap
PE_SCHEDULER assigns tile buffers for DMA and compute stages
tile buffers remain valid until the corresponding DMA_WRITE completes
Buffer reuse is allowed only after the tile lifetime finishes.

D6. Observability and trace contract

The simulator must emit deterministic trace events.

Required events include:

command_submitted
sub_command_dispatched
engine_start
engine_complete
tile_ready
command_complete

Trace ordering must be deterministic for identical inputs.

D7. Topology representation

PE internal components are declared in cube.pe_template.

The template is instantiated once per PE.

PE instances are derived from cube.pe_layout.

External connectivity such as:

PE_DMA → NOC → HBM (data path)
PE_DMA → NOC → shared SRAM, inter-cube UCIe (non-HBM data path)
NOC → PE_CPU (command path from M_CPU)

is modeled at the CUBE level (see ADR-0003 D3).

8.1 KiB Raw Blame History Unescape Escape