kernbench2/docs/adr/ADR-0014-pe-internal-execution-model.md

# ADR-0014: PE Internal Execution Model (PE_CPU, PE_SCHEDULER, and Composite Commands)

## Status

Proposed

## Context

ADR-0003 (system hierarchy) and ADR-0009 (kernel execution semantics) reference PE internals but do not define:

- the dispatch model inside a PE,
- the responsibilities of PE_SCHEDULER,
- the PE_TCM-centric dataflow contract used by accelerator engines.

We need a deterministic and debuggable PE-internal execution contract that supports:

- simple single-engine commands
- composite commands that build a tiled pipeline across DMA and accelerator engines

The simulator must produce deterministic traces and allow modeling of PE-internal pipelining without introducing nondeterministic engine scheduling.

## Decision

### D1. PE internal component roles

Each PE contains the following logical components.

**PE_CPU**

- Executes kernel instruction stream or kernel control logic.
- Generates PE commands.
- Submits commands to PE_SCHEDULER.
- PE_CPU does NOT enqueue work directly into engine queues.

**PE_SCHEDULER**

- The sole dispatcher inside a PE.
- Receives commands from PE_CPU.
- Expands composite commands into sub-commands.
- Tracks dependencies and command state.
- Dispatches work to engine queues.
- Manages tile scheduling for composite commands.

**PE_DMA**

- Handles memory transfers between PE_TCM and external memory domains.
- PE_DMA has **dual egress** at the CUBE level:
  - **→ XBAR**: dedicated path to HBM (local and cross-half via bridge)
  - **→ NOC**: path to non-HBM destinations (shared SRAM, inter-cube UCIe, etc.)
- Supported directions include:
  - HBM → PE_TCM (via XBAR)
  - PE_TCM → HBM (via XBAR)
  - PE_TCM → shared SRAM (via NOC)
  - PE_TCM → other memory domains (via NOC, if supported by topology)

**PE_GEMM**

- Matrix multiplication engine.
- Reads activations from PE_TCM.
- May stream weights directly from HBM.

**PE_MATH**

- Element-wise computation engine.
- Reads and writes PE_TCM.

**PE_TCM**

- Local SRAM used as the staging memory for accelerator operations.

---

### D2. Command lifecycle and queues

PE_SCHEDULER maintains three logical structures.

**SubmissionQueue**

- Written by PE_CPU.
- Contains incoming PE commands waiting to be processed.

**InflightTable**

- Owned and mutated only by PE_SCHEDULER.
- Tracks:
  - expanded sub-commands
  - dependency state
  - engine assignment
  - completion status

**CompletionQueue**

- Written by PE_SCHEDULER.
- Contains final completion records for commands.

**Single-writer rule**

- Only PE_SCHEDULER is allowed to mutate command completion state.
- Engine components must report completion via explicit completion events/messages.

**Command completion**

A command becomes DONE when:

- all sub-commands complete
- PE_SCHEDULER publishes a completion record to CompletionQueue.

---

### D3. Dispatch modes

PE commands are divided into two categories.

#### D3.1 Simple command

A simple command expands to exactly one engine sub-command.

Examples include:

- DMA transfer
- GEMM compute
- MATH compute

Execution flow:

```
PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution → completion event → PE_SCHEDULER → CompletionQueue
```

#### D3.2 Composite command (tiled pipeline)

Composite commands implement tiled pipelined execution across engines.

Each tile executes the following pipeline:

```
Input DMA (READ)
→ Compute (GEMM or MATH)
→ Output DMA (WRITE)
```

**Tiling rule**

If the DMA payload exceeds hardware tile size, PE_SCHEDULER splits the transfer into tiles.
Each tile is assigned a monotonically increasing `tile_id`.

**Tile dependency rules**

For tile `t`:

- Compute must wait for input DMA: `DMA_READ(t) → COMPUTE(t)`
- Output DMA must wait for compute: `COMPUTE(t) → DMA_WRITE(t)`
- All dependencies are enforced by PE_SCHEDULER.

**Overlap policy (Phase 0 default)**

Operations for different tiles may overlap when engine resources permit.

Allowed overlaps:

```
DMA_READ(t+1) ∥ COMPUTE(t)
DMA_WRITE(t−1) ∥ COMPUTE(t)
DMA_READ(t) ∥ DMA_WRITE(t)
```

Disallowed overlaps:

```
GEMM(t) ∥ GEMM(t′)
MATH(t) ∥ MATH(t′)
GEMM(t) ∥ MATH(t′)
```

---

### D4. Engine execution model (Phase 0 default)

Each engine behaves as a deterministic service resource.

**DMA engine**

PE_DMA contains two independent channels.

```
DMA_READ capacity  = 1
DMA_WRITE capacity = 1
```

Rules:

- DMA_READ and DMA_WRITE may execute concurrently.
- Multiple READs cannot overlap.
- Multiple WRITEs cannot overlap.

Example allowed:

```
DMA_READ(t+1) ∥ DMA_WRITE(t)
```

Example not allowed:

```
DMA_READ(t) ∥ DMA_READ(t+1)
DMA_WRITE(t) ∥ DMA_WRITE(t+1)
```

**Compute engine**

Compute operations share a single compute resource.

```
PE_ACCEL capacity = 1
```

Both GEMM and MATH require this shared compute slot.

Consequences:

- GEMM ∥ GEMM not allowed
- MATH ∥ MATH not allowed
- GEMM ∥ MATH not allowed

Only one compute operation can run in a PE at a time.

**Compute opcode restriction**

Composite commands contain one compute opcode only.

Examples:

```
COMPOSITE_GEMM
COMPOSITE_MATH
```

Mixed compute pipelines such as `GEMM → MATH` are not supported in Phase 0.

**Engine completion signaling**

Every engine emits a completion event when a sub-command finishes.
Completion events are delivered to PE_SCHEDULER.

---

### D5. Dataflow model

Compute operations use a TCM-centric dataflow model.

**Input path (HBM)**

```
HBM → XBAR → PE_DMA (DMA_READ) → PE_TCM
```

**Input path (shared SRAM)**

```
Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM
```

**Compute stage**

Compute engines read input tensors from PE_TCM.

```
PE_TCM → GEMM / MATH
```

Weights for GEMM may optionally stream directly from HBM (via XBAR).

**Output path (HBM)**

Compute results are written to PE_TCM, then DMA writes to HBM.

```
PE_TCM → PE_DMA (DMA_WRITE) → XBAR → HBM
```

**Output path (shared SRAM)**

```
PE_TCM → PE_DMA (DMA_WRITE) → NOC → Shared SRAM
```

#### D5.1 PE_TCM partitioning and ownership boundary

The PE_TCM address space is partitioned into two logical regions.

**SchedulerReservedTCM**

- A staging region owned exclusively by PE_SCHEDULER.
- This region is used for composite command tile buffers.
- PE_SCHEDULER:
  - partitions this region into tile buffers
  - assigns buffers for DMA_READ, COMPUTE, and DMA_WRITE stages
  - guarantees input/output buffer separation
  - manages tile buffer lifetime

**AllocatableTCM**

- General-purpose region managed by PEMemAllocator.
- Used by host or DP-visible allocations.

**Visibility rule (hard isolation)**

- PEMemAllocator must not see or allocate memory inside SchedulerReservedTCM.
- SchedulerReservedTCM is excluded from allocator-managed ranges by construction.
- This prevents DP or host allocations from interfering with scheduler staging buffers.

**Tile buffer rules**

Within SchedulerReservedTCM:

- input buffers and output buffers must not overlap
- PE_SCHEDULER assigns tile buffers for DMA and compute stages
- tile buffers remain valid until the corresponding DMA_WRITE completes
- Buffer reuse is allowed only after the tile lifetime finishes.

---

### D6. Observability and trace contract

The simulator must emit deterministic trace events.

Required events include:

- `command_submitted`
- `sub_command_dispatched`
- `engine_start`
- `engine_complete`
- `tile_ready`
- `command_complete`

Trace ordering must be deterministic for identical inputs.

---

### D7. Topology representation

PE internal components are declared in `cube.pe_template`.

The template is instantiated once per PE.

PE instances are derived from `cube.pe_layout`.

External connectivity such as:

- PE_DMA → XBAR (HBM data path)
- PE_DMA → NOC (non-HBM data path: shared SRAM, inter-cube UCIe)
- NOC → PE_CPU (command path from M_CPU)

is modeled at the CUBE level (see ADR-0003 D3).

---

## Links

- SPEC R3, R4
- ADR-0003 D4 (PE-level system hierarchy)
- ADR-0005 View C (PE-level diagram)
- ADR-0008 D2 (PA-level allocation at PE scope; PEMemAllocator is the per-PE allocator instance)
- ADR-0009 D3 (kernel execution fan-out and PE_CPU dispatch)