Files
kernbench2/docs/adr/ADR-0014-pe-internal-execution-model.md
T
ywkang 32b29a1e5c ADR-0003/0014: generalize "router mesh" to "NOC"
NOC topology is an implementation choice (mesh, ring, crossbar, etc.).
ADR-0017 covers the current 2D mesh choice; ADRs at the system-level
shouldn't bind to that specific implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 23:23:46 -07:00

8.1 KiB
Raw Blame History

ADR-0014: PE Internal Execution Model (PE_CPU, PE_SCHEDULER, and Composite Commands)

Status

Accepted

Context

ADR-0003 (system hierarchy) and ADR-0009 (kernel execution semantics) reference PE internals but do not define:

  • the dispatch model inside a PE,
  • the responsibilities of PE_SCHEDULER,
  • the PE_TCM-centric dataflow contract used by accelerator engines.

We need a deterministic and debuggable PE-internal execution contract that supports:

  • simple single-engine commands
  • composite commands that build a tiled pipeline across DMA and accelerator engines

The simulator must produce deterministic traces and allow modeling of PE-internal pipelining without introducing nondeterministic engine scheduling.

Decision

D1. PE internal component roles

Each PE contains the following logical components.

PE_CPU

  • Executes kernel instruction stream or kernel control logic.
  • Generates PE commands.
  • Submits commands to PE_SCHEDULER.
  • PE_CPU does NOT enqueue work directly into engine queues.

PE_SCHEDULER

  • The sole dispatcher inside a PE.
  • Receives commands from PE_CPU.
  • Expands composite commands into sub-commands.
  • Tracks dependencies and command state.
  • Dispatches work to engine queues.
  • Manages tile scheduling for composite commands.

PE_DMA

  • Handles memory transfers between PE_TCM and external memory domains.
  • PE_DMA connects to the cube-level NOC (on-die fabric):
    • All destinations (HBM, shared SRAM, inter-cube UCIe) are reached via the NOC
    • Local HBM access: PE_DMA → NOC → hbm_ctrl (minimal hop)
    • Remote/shared: PE_DMA → NOC → (fabric hops) → destination
  • Supported directions include:
    • HBM → PE_TCM (via NOC)
    • PE_TCM → HBM (via NOC)
    • PE_TCM → shared SRAM (via NOC)
    • PE_TCM → other memory domains (via NOC, if supported by topology)

PE_GEMM

  • Matrix multiplication engine.
  • Reads activations from PE_TCM.
  • May stream weights directly from HBM.

PE_MATH

  • Element-wise computation engine.
  • Reads and writes PE_TCM.

PE_TCM

  • Local SRAM used as the staging memory for accelerator operations.

D2. Command lifecycle and queues

PE_SCHEDULER maintains three logical structures.

SubmissionQueue

  • Written by PE_CPU.
  • Contains incoming PE commands waiting to be processed.

InflightTable

  • Owned and mutated only by PE_SCHEDULER.
  • Tracks:
    • expanded sub-commands
    • dependency state
    • engine assignment
    • completion status

CompletionQueue

  • Written by PE_SCHEDULER.
  • Contains final completion records for commands.

Single-writer rule

  • Only PE_SCHEDULER is allowed to mutate command completion state.
  • Engine components must report completion via explicit completion events/messages.

Command completion

A command becomes DONE when:

  • all sub-commands complete
  • PE_SCHEDULER publishes a completion record to CompletionQueue.

D3. Dispatch modes

PE commands are divided into two categories.

D3.1 Simple command

A simple command expands to exactly one engine sub-command.

Examples include:

  • DMA transfer
  • GEMM compute
  • MATH compute

Execution flow:

PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution → completion event → PE_SCHEDULER → CompletionQueue

D3.2 Composite command (tiled pipeline)

Composite commands implement tiled pipelined execution across engines.

Each tile executes the following pipeline:

Input DMA (READ)
→ Compute (GEMM or MATH)
→ Output DMA (WRITE)

Tiling rule

If the DMA payload exceeds hardware tile size, PE_SCHEDULER splits the transfer into tiles. Each tile is assigned a monotonically increasing tile_id.

Tile dependency rules

For tile t:

  • Compute must wait for input DMA: DMA_READ(t) → COMPUTE(t)
  • Output DMA must wait for compute: COMPUTE(t) → DMA_WRITE(t)
  • All dependencies are enforced by PE_SCHEDULER.

Overlap policy (Phase 0 default)

Operations for different tiles may overlap when engine resources permit.

Allowed overlaps:

DMA_READ(t+1) ∥ COMPUTE(t)
DMA_WRITE(t1) ∥ COMPUTE(t)
DMA_READ(t) ∥ DMA_WRITE(t)

Disallowed overlaps:

GEMM(t) ∥ GEMM(t)
MATH(t) ∥ MATH(t)
GEMM(t) ∥ MATH(t)

D4. Engine execution model (Phase 0 default)

Each engine behaves as a deterministic service resource.

DMA engine

PE_DMA contains two independent channels.

DMA_READ capacity  = 1
DMA_WRITE capacity = 1

Rules:

  • DMA_READ and DMA_WRITE may execute concurrently.
  • Multiple READs cannot overlap.
  • Multiple WRITEs cannot overlap.

Example allowed:

DMA_READ(t+1) ∥ DMA_WRITE(t)

Example not allowed:

DMA_READ(t) ∥ DMA_READ(t+1)
DMA_WRITE(t) ∥ DMA_WRITE(t+1)

Compute engine

Compute operations share a single compute resource.

PE_ACCEL capacity = 1

Both GEMM and MATH require this shared compute slot.

Consequences:

  • GEMM ∥ GEMM not allowed
  • MATH ∥ MATH not allowed
  • GEMM ∥ MATH not allowed

Only one compute operation can run in a PE at a time.

Compute opcode restriction

Composite commands contain one compute opcode only.

Examples:

COMPOSITE_GEMM
COMPOSITE_MATH

Mixed compute pipelines such as GEMM → MATH are not supported in Phase 0.

Engine completion signaling

Every engine emits a completion event when a sub-command finishes. Completion events are delivered to PE_SCHEDULER.


D5. Dataflow model

Compute operations use a TCM-centric dataflow model.

Input path (HBM)

HBM → NOC → PE_DMA (DMA_READ) → PE_TCM

Input path (shared SRAM)

Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM

Compute stage

Compute engines read input tensors from PE_TCM.

PE_TCM → GEMM / MATH

Weights for GEMM may optionally stream directly from HBM (via NOC).

Output path (HBM)

Compute results are written to PE_TCM, then DMA writes to HBM.

PE_TCM → PE_DMA (DMA_WRITE) → NOC → HBM

Output path (shared SRAM)

PE_TCM → PE_DMA (DMA_WRITE) → NOC → Shared SRAM

D5.1 PE_TCM partitioning and ownership boundary

The PE_TCM address space is partitioned into two logical regions.

SchedulerReservedTCM

  • A staging region owned exclusively by PE_SCHEDULER.
  • This region is used for composite command tile buffers.
  • PE_SCHEDULER:
    • partitions this region into tile buffers
    • assigns buffers for DMA_READ, COMPUTE, and DMA_WRITE stages
    • guarantees input/output buffer separation
    • manages tile buffer lifetime

AllocatableTCM

  • General-purpose region managed by PEMemAllocator.
  • Used by host or DP-visible allocations.

Visibility rule (hard isolation)

  • PEMemAllocator must not see or allocate memory inside SchedulerReservedTCM.
  • SchedulerReservedTCM is excluded from allocator-managed ranges by construction.
  • This prevents DP or host allocations from interfering with scheduler staging buffers.

Tile buffer rules

Within SchedulerReservedTCM:

  • input buffers and output buffers must not overlap
  • PE_SCHEDULER assigns tile buffers for DMA and compute stages
  • tile buffers remain valid until the corresponding DMA_WRITE completes
  • Buffer reuse is allowed only after the tile lifetime finishes.

D6. Observability and trace contract

The simulator must emit deterministic trace events.

Required events include:

  • command_submitted
  • sub_command_dispatched
  • engine_start
  • engine_complete
  • tile_ready
  • command_complete

Trace ordering must be deterministic for identical inputs.


D7. Topology representation

PE internal components are declared in cube.pe_template.

The template is instantiated once per PE.

PE instances are derived from cube.pe_layout.

External connectivity such as:

  • PE_DMA → NOC → HBM (data path)
  • PE_DMA → NOC → shared SRAM, inter-cube UCIe (non-HBM data path)
  • NOC → PE_CPU (command path from M_CPU)

is modeled at the CUBE level (see ADR-0003 D3).


  • SPEC R3, R4
  • ADR-0003 D4 (PE-level system hierarchy)
  • ADR-0005 View C (PE-level diagram)
  • ADR-0008 D2 (PA-level allocation at PE scope; PEMemAllocator is the per-PE allocator instance)
  • ADR-0009 D3 (kernel execution fan-out and PE_CPU dispatch)