Files
kernbench2/docs/adr/ADR-0014-dev-pe-pipeline-execution-model.md
T
ywkang 687c98086d ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
  (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
  docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
  docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
  retroactive docs pending verification.

Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
  TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
  Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
  deleted; ADR-0019/0021 moved to adr-history with one-line stub status

Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
  serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
  per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
  target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
  selection, flit-aware per-flit commit, async finalize, command-only
  fallback path)

Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
  "Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
  block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
  ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
  (now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)

Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
  ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py

Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.

Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
  (ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:15:55 -07:00

452 lines
14 KiB
Markdown

# ADR-0014: PE Pipeline Execution Model
## Status
Accepted
## Context
This ADR defines the PE-internal kernel execution model:
- Role decomposition of PE-internal components
- Command dispatch paths (simple / composite / multi-op composite with epilogue)
- TileToken-based self-routing pipeline (scheduler does dispatch + completion only)
- TCM-centric dataflow with a register-file intermediary
- Engine resource model
- Observability and trace contract
- Topology representation
PE-internal structure (7 components in scope; 2 cross-referenced):
- `pe_cpu`, `pe_scheduler`, `pe_dma`, `pe_fetch_store`, `pe_gemm`, `pe_math`,
`pe_tcm` — defined here
- `pe_mmu` — VA model, defined in ADR-0011 D-VA
- `pe_ipcq` — collective communication, defined in ADR-0023
The goal is a deterministic, trace-friendly execution contract that keeps
each block independently swappable.
## Decision
### D1. PE-internal component roles
**PE_CPU**
- Executes kernel instruction stream / control logic.
- Generates PE commands and submits them to `PE_SCHEDULER` (via
`PeInternalTxn`).
- Does NOT enqueue work directly into engine queues.
**PE_SCHEDULER**
- Sole dispatcher inside a PE.
- Receives commands from `PE_CPU`. Dispatch by command type:
- Simple command (`DmaReadCmd`, `DmaWriteCmd`, `GemmCmd`, `MathCmd`)
→ forward directly to the target engine.
- `CompositeCmd` → generate a `TilePlan`, feed tiles into the pipeline
via a single `_feed_loop` (D6).
- Does not participate in stage-to-stage chaining within a composite;
that is handled by token self-routing (D6).
**PE_DMA**
- Handles memory transfers between TCM and external memory domains
(HBM, shared SRAM, cross-cube UCIe) through the cube NOC.
- Two execution channels:
- `DMA_READ` (capacity = 1) and `DMA_WRITE` (capacity = 1) — see D4.
- Additional virtual channels:
- `vc_compute` — load/store/writeback traffic for GEMM/MATH tiles.
- `vc_comm` — IPCQ collective send data (defined in ADR-0023 D8).
**PE_FETCH_STORE**
- TCM ↔ Register File transfer unit.
- Isolates register-file access semantics from compute engines so that
GEMM/MATH stay pure compute components.
- BW-based latency model; TCM access contention naturally serializes
through `PE_TCM`'s BW resource.
**PE_GEMM**
- MAC array. Reads operands from the register file; writes results to
the register file. Does not touch `PE_TCM` directly.
**PE_MATH**
- Element-wise / reduction / SIMD unit. Reads / writes the register file.
**PE_TCM**
- Tightly-coupled scratchpad with BW-serialized access. Two logical
regions partitioned by ownership (see D5).
**Cross-referenced components** (defined elsewhere):
- `pe_mmu` — VA→PA translation per access (ADR-0011 D-VA).
- `pe_ipcq` — collective ring buffers and peer endpoint metadata
(ADR-0023).
### D2. Command lifecycle and queues
`PE_SCHEDULER` maintains three logical structures:
**SubmissionQueue** — written by `PE_CPU`; consumed by the scheduler.
**InflightTable** — owned and mutated only by `PE_SCHEDULER`; tracks
expanded sub-commands, dependency state, engine assignment, and
completion status.
**CompletionQueue** — written by `PE_SCHEDULER`; holds final completion
records.
**Single-writer rule**: only `PE_SCHEDULER` mutates command completion
state. Engines report completion via explicit events / messages
consumed by the scheduler.
**Command completion**: when all sub-commands complete, `PE_SCHEDULER`
publishes a completion record.
### D3. Dispatch modes
#### D3.1 Simple command
A simple command expands to exactly one engine sub-command:
- `DmaReadCmd` / `DmaWriteCmd``PE_DMA`
- `GemmCmd``PE_GEMM`
- `MathCmd``PE_MATH`
Flow:
```text
PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution
→ completion → PE_SCHEDULER → CompletionQueue
```
#### D3.2 Composite command (single-op tiled pipeline)
The default `CompositeCmd` runs a single compute op as a tile-pipelined
sequence:
```text
DMA_READ → FETCH (TCM → RF) → COMPUTE (GEMM | MATH) → STORE (RF → TCM) → DMA_WRITE
```
`PE_SCHEDULER` splits the DMA payload into hardware tiles and emits one
`TileToken` per tile with a monotonically increasing `tile_id`.
Tile dependency (within one tile `t`):
```text
DMA_READ(t) → FETCH(t) → COMPUTE(t) → STORE(t) → DMA_WRITE(t)
```
Inter-tile overlap is allowed wherever engine resources permit
(D4 governs the constraints):
```text
DMA_READ(t+1) ∥ COMPUTE(t)
DMA_WRITE(t-1) ∥ COMPUTE(t)
```
#### D3.3 Multi-op composite (head + epilogue with scope)
A `CompositeCmd` MAY carry `ops: tuple[OpSpec, ...]` to express a
multi-op pipeline:
```python
@dataclass(frozen=True)
class OpSpec:
kind: str # "gemm" | "math.exp" | "math.bias_add" | ...
scope: Scope # "per_k_tile" | "per_output_tile" | "once"
...
```
- `ops[0]` (head) defines tile geometry (e.g., the head GEMM determines
M/K/N partition).
- `ops[1:]` (epilogue) are subsequent stages whose `scope` decides how
often they fire:
- `per_k_tile` — every K-reduction step.
- `per_output_tile` — once per output tile.
- `once` — once per kernel.
Cross-engine chains (e.g., GEMM head → MATH epilogue) are natural —
each stage is dispatched via token self-routing (D6), so GEMM and MATH
participate serially within the same composite even though they share
the compute slot (D4).
The empty-`ops` form is the legacy single-op path.
### D4. Engine resource model
**DMA engine**:
- `DMA_READ`: `simpy.Resource(capacity=1)`.
- `DMA_WRITE`: `simpy.Resource(capacity=1)`.
- Both channels run concurrently (READ ∥ WRITE allowed).
- Within a channel, requests serialize (READ ∥ READ disallowed; same
for WRITE).
- `vc_comm` is an orthogonal channel for IPCQ traffic defined in
ADR-0023 D8 — out of scope for this ADR.
**Compute engine**:
- `accel_slot`: `simpy.Resource(capacity=1)` shared by `PE_GEMM` and
`PE_MATH`.
- At most one compute op runs at a time within a PE.
- Multi-op composite chains (D3.3) execute their compute stages serially
through this slot; token self-routing (D6) ensures the next stage
starts only after the previous compute releases the slot.
**Engine completion**: each engine emits a completion event consumed by
the scheduler / `PipelineContext` (D6).
### D5. Dataflow
**Input path (HBM source)**:
```text
HBM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
PE_TCM → PE_FETCH_STORE → Register File
Register File → PE_GEMM | PE_MATH
```
**Input path (shared SRAM source)**:
```text
Shared SRAM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
PE_TCM → PE_FETCH_STORE → Register File
```
**Output path (HBM destination)**:
```text
Register File → PE_FETCH_STORE → PE_TCM
PE_TCM → PE_DMA (DMA_WRITE) → cube NOC → HBM
```
GEMM/MATH never touch `PE_TCM` directly — `PE_FETCH_STORE` is the
single TCM↔register-file gateway. This makes TCM BW contention
explicit and lets fetch unit policies (e.g., prefetch) be replaced
independently of compute engines.
#### D5.1 PE_TCM partitioning
`PE_TCM` is split into two logical regions:
**SchedulerReservedTCM**
- Owned exclusively by `PE_SCHEDULER`.
- Holds composite-command tile buffers.
- `PE_SCHEDULER` partitions this region, assigns buffers per DMA_READ /
COMPUTE / DMA_WRITE stage, guarantees input/output separation, and
manages tile-buffer lifetimes.
**AllocatableTCM**
- General-purpose region managed by `PEMemAllocator`.
- Used for host / DP-visible allocations.
**Visibility rule (hard isolation)**: `PEMemAllocator` MUST NOT see or
allocate inside `SchedulerReservedTCM`. The reserved region is excluded
from allocator-managed ranges by construction.
**Tile buffer rules**:
- Input and output buffers within `SchedulerReservedTCM` MUST NOT
overlap during a tile's active lifetime.
- A tile buffer remains valid until the corresponding `DMA_WRITE`
completes.
- Buffer reuse is permitted only after the consuming tile's lifetime
ends.
### D6. TileToken self-routing pipeline
A composite's stage-to-stage progression happens **without** routing
through the scheduler. Each component forwards the token directly to
the next stage's component using the token's `plan`:
```text
Scheduler → DMA → Fetch → GEMM → Math (epi) → Store → DMA_WB → (complete)
↑ chaining: no scheduler hop ↑
PipelineContext.complete_tile()
```
This mirrors real-HW done-wire chains. The scheduler handles only
**initial dispatch + completion aggregation**.
#### TilePlan / Stage
```python
class StageType(Enum):
DMA_READ = 0
FETCH = 1
GEMM = 2
MATH = 3
STORE = 4
DMA_WRITE = 5
@dataclass(frozen=True)
class Stage:
stage_type: StageType
component: str # topology node id (e.g., "sip0.cube0.pe0.pe_dma")
params: dict # stage-specific parameters
@dataclass(frozen=True)
class TilePlan:
tile_id: int
stages: tuple[Stage, ...]
```
#### TileToken
```python
@dataclass
class TileToken:
tile_id: int
pipeline_ctx: PipelineContext
plan: TilePlan
stage_idx: int
params: dict # cached current stage params
data_op: bool = True # op_log opt-in (ADR-0020 D4)
```
Single-owner invariant: a token is owned by exactly one component at a
time. Lifecycle: scheduler creates with `stage_idx=0` → component
`_process()` → increment `stage_idx` → put to next stage's `in_port`
last stage calls `pipeline_ctx.complete_tile()`.
#### PipelineContext (exactly-once completion)
```python
@dataclass
class PipelineContext:
id: str
total_tiles: int
completed_tiles: int = 0
done_event: simpy.Event = None
def complete_tile(self) -> None:
self.completed_tiles += 1
if self.completed_tiles == self.total_tiles:
self.done_event.succeed()
```
Each tile's last stage MUST call `complete_tile()` exactly once.
Duplicate calls are bugs (SimPy `Event` can succeed at most once).
#### Feed ordering
`PE_SCHEDULER` has exactly one `_feed_loop` process consuming a
`_pending_feeds` FIFO. Composite commands are enqueued in submission
order; tile feed for a command runs to completion before the next
command's feed begins. **Tile-feed interleaving between commands is
disallowed.**
Within a single command's tiles, downstream pipeline overlap arises
naturally — earlier tiles progress through later stages while the feeder
keeps pushing remaining tiles into the first stage queue (SimPy Store
backpressure governs flow control). If the first-stage queue is full,
only the feeder blocks; the scheduler worker's inbox processing
continues.
#### Token routing pattern (base class)
```python
def _pipeline_worker(self, env):
while True:
token = yield self._inbox.get()
yield from self._process(env, token) # stage-specific logic
next_idx = token.stage_idx + 1
if next_idx < len(token.plan.stages):
next_stage = token.plan.stages[next_idx]
token.stage_idx = next_idx
token.params = next_stage.params
yield self.out_ports[next_stage.component].put(token)
else:
token.pipeline_ctx.complete_tile()
```
Each component implements only `_process()`; chaining lives in the
base class.
### D7. Observability and trace contract
The simulator emits deterministic trace events:
- `command_submitted`
- `sub_command_dispatched`
- `engine_start`
- `engine_complete`
- `tile_ready`
- `command_complete`
For identical inputs, trace ordering MUST be deterministic.
### D8. Topology representation
PE-internal components are declared in `cube.pe_template`:
```yaml
pe_template:
components:
pe_cpu: { kind: pe_cpu, impl: builtin.pe_cpu, attrs: { overhead_ns: ... } }
pe_scheduler: { kind: pe_scheduler, impl: builtin.pe_scheduler, attrs: { overhead_ns: ... } }
pe_dma: { kind: pe_dma, impl: builtin.pe_dma, attrs: { rd_engines: 1, wr_engines: 1 } }
pe_fetch_store: { kind: pe_fetch_store, impl: builtin.pe_fetch_store, attrs: { ... } }
pe_gemm: { kind: pe_gemm, impl: builtin.pe_gemm, attrs: { shared_resource: accel_slot, ... } }
pe_math: { kind: pe_math, impl: builtin.pe_math, attrs: { shared_resource: accel_slot, ... } }
pe_tcm: { kind: pe_tcm, impl: builtin.pe_tcm, attrs: { size_mb: ..., read_bw_gbs: ..., write_bw_gbs: ... } }
pe_mmu: { kind: pe_mmu, impl: builtin.pe_mmu, attrs: { ... } } # ADR-0011 D-VA
pe_ipcq: { kind: pe_ipcq, impl: builtin.pe_ipcq, attrs: { ... } } # ADR-0023
links:
# Scheduler dispatch edges (initial)
scheduler_to_dma_mm: 0.0
scheduler_to_fetch_store_mm: 0.0
scheduler_to_gemm_mm: 0.0
scheduler_to_math_mm: 0.0
# Pipeline chaining edges (token self-routing per D6)
dma_to_fetch_store_mm: 0.0
fetch_store_to_gemm_mm: 0.0
fetch_store_to_math_mm: 0.0
gemm_to_fetch_store_mm: 0.0
gemm_to_math_mm: 0.0
math_to_fetch_store_mm: 0.0
fetch_store_to_dma_mm: 0.0
fetch_store_to_tcm_bw_gbs: ...
```
Template is instantiated once per PE. PE instances are derived from
`cube.pe_layout` (corner placement). External connectivity (PE_DMA ↔
cube NOC ↔ HBM, etc.) is modeled at the cube level (ADR-0017 D4).
## Consequences
### Positive
- Each block is an independent topology node — individually swappable
via DI (ADR-0015).
- PE-internal structure is visible in the topology graph.
- Components do not know their downstream — plan-based routing gives
flexibility (e.g., epilogue chains require no scheduler change).
- DMA and compute overlap naturally via SimPy Store backpressure.
- Multi-op composite expresses fused operations (e.g., GEMM + bias_add)
without engine-level coupling.
- TCM access contention is realistic — `PE_FETCH_STORE` is the single
TCM↔RF gateway.
### Negative
- Intra-PE component count is higher than a coarser model (7 base + 2
cross-referenced) — more topology nodes/edges.
- Intra-PE token forwarding is explicit in traces (acceptable trade for
HW fidelity).
## Links
- ADR-0011 D-VA (PE_MMU component, VA translation)
- ADR-0015 D4 (component port/wire model)
- ADR-0020 (greenlet kernel execution / two-pass)
- ADR-0023 (PE_IPCQ + PE_DMA virtual channels)
- SPEC R3, R4