687c98086d
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
(dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
retroactive docs pending verification.
Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
deleted; ADR-0019/0021 moved to adr-history with one-line stub status
Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
selection, flit-aware per-flit commit, async finalize, command-only
fallback path)
Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
"Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
(now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)
Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py
Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.
Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
(ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
452 lines
14 KiB
Markdown
452 lines
14 KiB
Markdown
# ADR-0014: PE Pipeline Execution Model
|
|
|
|
## Status
|
|
|
|
Accepted
|
|
|
|
## Context
|
|
|
|
This ADR defines the PE-internal kernel execution model:
|
|
|
|
- Role decomposition of PE-internal components
|
|
- Command dispatch paths (simple / composite / multi-op composite with epilogue)
|
|
- TileToken-based self-routing pipeline (scheduler does dispatch + completion only)
|
|
- TCM-centric dataflow with a register-file intermediary
|
|
- Engine resource model
|
|
- Observability and trace contract
|
|
- Topology representation
|
|
|
|
PE-internal structure (7 components in scope; 2 cross-referenced):
|
|
|
|
- `pe_cpu`, `pe_scheduler`, `pe_dma`, `pe_fetch_store`, `pe_gemm`, `pe_math`,
|
|
`pe_tcm` — defined here
|
|
- `pe_mmu` — VA model, defined in ADR-0011 D-VA
|
|
- `pe_ipcq` — collective communication, defined in ADR-0023
|
|
|
|
The goal is a deterministic, trace-friendly execution contract that keeps
|
|
each block independently swappable.
|
|
|
|
## Decision
|
|
|
|
### D1. PE-internal component roles
|
|
|
|
**PE_CPU**
|
|
|
|
- Executes kernel instruction stream / control logic.
|
|
- Generates PE commands and submits them to `PE_SCHEDULER` (via
|
|
`PeInternalTxn`).
|
|
- Does NOT enqueue work directly into engine queues.
|
|
|
|
**PE_SCHEDULER**
|
|
|
|
- Sole dispatcher inside a PE.
|
|
- Receives commands from `PE_CPU`. Dispatch by command type:
|
|
- Simple command (`DmaReadCmd`, `DmaWriteCmd`, `GemmCmd`, `MathCmd`)
|
|
→ forward directly to the target engine.
|
|
- `CompositeCmd` → generate a `TilePlan`, feed tiles into the pipeline
|
|
via a single `_feed_loop` (D6).
|
|
- Does not participate in stage-to-stage chaining within a composite;
|
|
that is handled by token self-routing (D6).
|
|
|
|
**PE_DMA**
|
|
|
|
- Handles memory transfers between TCM and external memory domains
|
|
(HBM, shared SRAM, cross-cube UCIe) through the cube NOC.
|
|
- Two execution channels:
|
|
- `DMA_READ` (capacity = 1) and `DMA_WRITE` (capacity = 1) — see D4.
|
|
- Additional virtual channels:
|
|
- `vc_compute` — load/store/writeback traffic for GEMM/MATH tiles.
|
|
- `vc_comm` — IPCQ collective send data (defined in ADR-0023 D8).
|
|
|
|
**PE_FETCH_STORE**
|
|
|
|
- TCM ↔ Register File transfer unit.
|
|
- Isolates register-file access semantics from compute engines so that
|
|
GEMM/MATH stay pure compute components.
|
|
- BW-based latency model; TCM access contention naturally serializes
|
|
through `PE_TCM`'s BW resource.
|
|
|
|
**PE_GEMM**
|
|
|
|
- MAC array. Reads operands from the register file; writes results to
|
|
the register file. Does not touch `PE_TCM` directly.
|
|
|
|
**PE_MATH**
|
|
|
|
- Element-wise / reduction / SIMD unit. Reads / writes the register file.
|
|
|
|
**PE_TCM**
|
|
|
|
- Tightly-coupled scratchpad with BW-serialized access. Two logical
|
|
regions partitioned by ownership (see D5).
|
|
|
|
**Cross-referenced components** (defined elsewhere):
|
|
|
|
- `pe_mmu` — VA→PA translation per access (ADR-0011 D-VA).
|
|
- `pe_ipcq` — collective ring buffers and peer endpoint metadata
|
|
(ADR-0023).
|
|
|
|
### D2. Command lifecycle and queues
|
|
|
|
`PE_SCHEDULER` maintains three logical structures:
|
|
|
|
**SubmissionQueue** — written by `PE_CPU`; consumed by the scheduler.
|
|
|
|
**InflightTable** — owned and mutated only by `PE_SCHEDULER`; tracks
|
|
expanded sub-commands, dependency state, engine assignment, and
|
|
completion status.
|
|
|
|
**CompletionQueue** — written by `PE_SCHEDULER`; holds final completion
|
|
records.
|
|
|
|
**Single-writer rule**: only `PE_SCHEDULER` mutates command completion
|
|
state. Engines report completion via explicit events / messages
|
|
consumed by the scheduler.
|
|
|
|
**Command completion**: when all sub-commands complete, `PE_SCHEDULER`
|
|
publishes a completion record.
|
|
|
|
### D3. Dispatch modes
|
|
|
|
#### D3.1 Simple command
|
|
|
|
A simple command expands to exactly one engine sub-command:
|
|
|
|
- `DmaReadCmd` / `DmaWriteCmd` → `PE_DMA`
|
|
- `GemmCmd` → `PE_GEMM`
|
|
- `MathCmd` → `PE_MATH`
|
|
|
|
Flow:
|
|
|
|
```text
|
|
PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution
|
|
→ completion → PE_SCHEDULER → CompletionQueue
|
|
```
|
|
|
|
#### D3.2 Composite command (single-op tiled pipeline)
|
|
|
|
The default `CompositeCmd` runs a single compute op as a tile-pipelined
|
|
sequence:
|
|
|
|
```text
|
|
DMA_READ → FETCH (TCM → RF) → COMPUTE (GEMM | MATH) → STORE (RF → TCM) → DMA_WRITE
|
|
```
|
|
|
|
`PE_SCHEDULER` splits the DMA payload into hardware tiles and emits one
|
|
`TileToken` per tile with a monotonically increasing `tile_id`.
|
|
|
|
Tile dependency (within one tile `t`):
|
|
|
|
```text
|
|
DMA_READ(t) → FETCH(t) → COMPUTE(t) → STORE(t) → DMA_WRITE(t)
|
|
```
|
|
|
|
Inter-tile overlap is allowed wherever engine resources permit
|
|
(D4 governs the constraints):
|
|
|
|
```text
|
|
DMA_READ(t+1) ∥ COMPUTE(t)
|
|
DMA_WRITE(t-1) ∥ COMPUTE(t)
|
|
```
|
|
|
|
#### D3.3 Multi-op composite (head + epilogue with scope)
|
|
|
|
A `CompositeCmd` MAY carry `ops: tuple[OpSpec, ...]` to express a
|
|
multi-op pipeline:
|
|
|
|
```python
|
|
@dataclass(frozen=True)
|
|
class OpSpec:
|
|
kind: str # "gemm" | "math.exp" | "math.bias_add" | ...
|
|
scope: Scope # "per_k_tile" | "per_output_tile" | "once"
|
|
...
|
|
```
|
|
|
|
- `ops[0]` (head) defines tile geometry (e.g., the head GEMM determines
|
|
M/K/N partition).
|
|
- `ops[1:]` (epilogue) are subsequent stages whose `scope` decides how
|
|
often they fire:
|
|
- `per_k_tile` — every K-reduction step.
|
|
- `per_output_tile` — once per output tile.
|
|
- `once` — once per kernel.
|
|
|
|
Cross-engine chains (e.g., GEMM head → MATH epilogue) are natural —
|
|
each stage is dispatched via token self-routing (D6), so GEMM and MATH
|
|
participate serially within the same composite even though they share
|
|
the compute slot (D4).
|
|
|
|
The empty-`ops` form is the legacy single-op path.
|
|
|
|
### D4. Engine resource model
|
|
|
|
**DMA engine**:
|
|
|
|
- `DMA_READ`: `simpy.Resource(capacity=1)`.
|
|
- `DMA_WRITE`: `simpy.Resource(capacity=1)`.
|
|
- Both channels run concurrently (READ ∥ WRITE allowed).
|
|
- Within a channel, requests serialize (READ ∥ READ disallowed; same
|
|
for WRITE).
|
|
- `vc_comm` is an orthogonal channel for IPCQ traffic defined in
|
|
ADR-0023 D8 — out of scope for this ADR.
|
|
|
|
**Compute engine**:
|
|
|
|
- `accel_slot`: `simpy.Resource(capacity=1)` shared by `PE_GEMM` and
|
|
`PE_MATH`.
|
|
- At most one compute op runs at a time within a PE.
|
|
- Multi-op composite chains (D3.3) execute their compute stages serially
|
|
through this slot; token self-routing (D6) ensures the next stage
|
|
starts only after the previous compute releases the slot.
|
|
|
|
**Engine completion**: each engine emits a completion event consumed by
|
|
the scheduler / `PipelineContext` (D6).
|
|
|
|
### D5. Dataflow
|
|
|
|
**Input path (HBM source)**:
|
|
|
|
```text
|
|
HBM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
|
|
PE_TCM → PE_FETCH_STORE → Register File
|
|
Register File → PE_GEMM | PE_MATH
|
|
```
|
|
|
|
**Input path (shared SRAM source)**:
|
|
|
|
```text
|
|
Shared SRAM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
|
|
PE_TCM → PE_FETCH_STORE → Register File
|
|
```
|
|
|
|
**Output path (HBM destination)**:
|
|
|
|
```text
|
|
Register File → PE_FETCH_STORE → PE_TCM
|
|
PE_TCM → PE_DMA (DMA_WRITE) → cube NOC → HBM
|
|
```
|
|
|
|
GEMM/MATH never touch `PE_TCM` directly — `PE_FETCH_STORE` is the
|
|
single TCM↔register-file gateway. This makes TCM BW contention
|
|
explicit and lets fetch unit policies (e.g., prefetch) be replaced
|
|
independently of compute engines.
|
|
|
|
#### D5.1 PE_TCM partitioning
|
|
|
|
`PE_TCM` is split into two logical regions:
|
|
|
|
**SchedulerReservedTCM**
|
|
|
|
- Owned exclusively by `PE_SCHEDULER`.
|
|
- Holds composite-command tile buffers.
|
|
- `PE_SCHEDULER` partitions this region, assigns buffers per DMA_READ /
|
|
COMPUTE / DMA_WRITE stage, guarantees input/output separation, and
|
|
manages tile-buffer lifetimes.
|
|
|
|
**AllocatableTCM**
|
|
|
|
- General-purpose region managed by `PEMemAllocator`.
|
|
- Used for host / DP-visible allocations.
|
|
|
|
**Visibility rule (hard isolation)**: `PEMemAllocator` MUST NOT see or
|
|
allocate inside `SchedulerReservedTCM`. The reserved region is excluded
|
|
from allocator-managed ranges by construction.
|
|
|
|
**Tile buffer rules**:
|
|
|
|
- Input and output buffers within `SchedulerReservedTCM` MUST NOT
|
|
overlap during a tile's active lifetime.
|
|
- A tile buffer remains valid until the corresponding `DMA_WRITE`
|
|
completes.
|
|
- Buffer reuse is permitted only after the consuming tile's lifetime
|
|
ends.
|
|
|
|
### D6. TileToken self-routing pipeline
|
|
|
|
A composite's stage-to-stage progression happens **without** routing
|
|
through the scheduler. Each component forwards the token directly to
|
|
the next stage's component using the token's `plan`:
|
|
|
|
```text
|
|
Scheduler → DMA → Fetch → GEMM → Math (epi) → Store → DMA_WB → (complete)
|
|
↑ chaining: no scheduler hop ↑
|
|
PipelineContext.complete_tile()
|
|
```
|
|
|
|
This mirrors real-HW done-wire chains. The scheduler handles only
|
|
**initial dispatch + completion aggregation**.
|
|
|
|
#### TilePlan / Stage
|
|
|
|
```python
|
|
class StageType(Enum):
|
|
DMA_READ = 0
|
|
FETCH = 1
|
|
GEMM = 2
|
|
MATH = 3
|
|
STORE = 4
|
|
DMA_WRITE = 5
|
|
|
|
@dataclass(frozen=True)
|
|
class Stage:
|
|
stage_type: StageType
|
|
component: str # topology node id (e.g., "sip0.cube0.pe0.pe_dma")
|
|
params: dict # stage-specific parameters
|
|
|
|
@dataclass(frozen=True)
|
|
class TilePlan:
|
|
tile_id: int
|
|
stages: tuple[Stage, ...]
|
|
```
|
|
|
|
#### TileToken
|
|
|
|
```python
|
|
@dataclass
|
|
class TileToken:
|
|
tile_id: int
|
|
pipeline_ctx: PipelineContext
|
|
plan: TilePlan
|
|
stage_idx: int
|
|
params: dict # cached current stage params
|
|
data_op: bool = True # op_log opt-in (ADR-0020 D4)
|
|
```
|
|
|
|
Single-owner invariant: a token is owned by exactly one component at a
|
|
time. Lifecycle: scheduler creates with `stage_idx=0` → component
|
|
`_process()` → increment `stage_idx` → put to next stage's `in_port` →
|
|
last stage calls `pipeline_ctx.complete_tile()`.
|
|
|
|
#### PipelineContext (exactly-once completion)
|
|
|
|
```python
|
|
@dataclass
|
|
class PipelineContext:
|
|
id: str
|
|
total_tiles: int
|
|
completed_tiles: int = 0
|
|
done_event: simpy.Event = None
|
|
|
|
def complete_tile(self) -> None:
|
|
self.completed_tiles += 1
|
|
if self.completed_tiles == self.total_tiles:
|
|
self.done_event.succeed()
|
|
```
|
|
|
|
Each tile's last stage MUST call `complete_tile()` exactly once.
|
|
Duplicate calls are bugs (SimPy `Event` can succeed at most once).
|
|
|
|
#### Feed ordering
|
|
|
|
`PE_SCHEDULER` has exactly one `_feed_loop` process consuming a
|
|
`_pending_feeds` FIFO. Composite commands are enqueued in submission
|
|
order; tile feed for a command runs to completion before the next
|
|
command's feed begins. **Tile-feed interleaving between commands is
|
|
disallowed.**
|
|
|
|
Within a single command's tiles, downstream pipeline overlap arises
|
|
naturally — earlier tiles progress through later stages while the feeder
|
|
keeps pushing remaining tiles into the first stage queue (SimPy Store
|
|
backpressure governs flow control). If the first-stage queue is full,
|
|
only the feeder blocks; the scheduler worker's inbox processing
|
|
continues.
|
|
|
|
#### Token routing pattern (base class)
|
|
|
|
```python
|
|
def _pipeline_worker(self, env):
|
|
while True:
|
|
token = yield self._inbox.get()
|
|
yield from self._process(env, token) # stage-specific logic
|
|
next_idx = token.stage_idx + 1
|
|
if next_idx < len(token.plan.stages):
|
|
next_stage = token.plan.stages[next_idx]
|
|
token.stage_idx = next_idx
|
|
token.params = next_stage.params
|
|
yield self.out_ports[next_stage.component].put(token)
|
|
else:
|
|
token.pipeline_ctx.complete_tile()
|
|
```
|
|
|
|
Each component implements only `_process()`; chaining lives in the
|
|
base class.
|
|
|
|
### D7. Observability and trace contract
|
|
|
|
The simulator emits deterministic trace events:
|
|
|
|
- `command_submitted`
|
|
- `sub_command_dispatched`
|
|
- `engine_start`
|
|
- `engine_complete`
|
|
- `tile_ready`
|
|
- `command_complete`
|
|
|
|
For identical inputs, trace ordering MUST be deterministic.
|
|
|
|
### D8. Topology representation
|
|
|
|
PE-internal components are declared in `cube.pe_template`:
|
|
|
|
```yaml
|
|
pe_template:
|
|
components:
|
|
pe_cpu: { kind: pe_cpu, impl: builtin.pe_cpu, attrs: { overhead_ns: ... } }
|
|
pe_scheduler: { kind: pe_scheduler, impl: builtin.pe_scheduler, attrs: { overhead_ns: ... } }
|
|
pe_dma: { kind: pe_dma, impl: builtin.pe_dma, attrs: { rd_engines: 1, wr_engines: 1 } }
|
|
pe_fetch_store: { kind: pe_fetch_store, impl: builtin.pe_fetch_store, attrs: { ... } }
|
|
pe_gemm: { kind: pe_gemm, impl: builtin.pe_gemm, attrs: { shared_resource: accel_slot, ... } }
|
|
pe_math: { kind: pe_math, impl: builtin.pe_math, attrs: { shared_resource: accel_slot, ... } }
|
|
pe_tcm: { kind: pe_tcm, impl: builtin.pe_tcm, attrs: { size_mb: ..., read_bw_gbs: ..., write_bw_gbs: ... } }
|
|
pe_mmu: { kind: pe_mmu, impl: builtin.pe_mmu, attrs: { ... } } # ADR-0011 D-VA
|
|
pe_ipcq: { kind: pe_ipcq, impl: builtin.pe_ipcq, attrs: { ... } } # ADR-0023
|
|
links:
|
|
# Scheduler dispatch edges (initial)
|
|
scheduler_to_dma_mm: 0.0
|
|
scheduler_to_fetch_store_mm: 0.0
|
|
scheduler_to_gemm_mm: 0.0
|
|
scheduler_to_math_mm: 0.0
|
|
# Pipeline chaining edges (token self-routing per D6)
|
|
dma_to_fetch_store_mm: 0.0
|
|
fetch_store_to_gemm_mm: 0.0
|
|
fetch_store_to_math_mm: 0.0
|
|
gemm_to_fetch_store_mm: 0.0
|
|
gemm_to_math_mm: 0.0
|
|
math_to_fetch_store_mm: 0.0
|
|
fetch_store_to_dma_mm: 0.0
|
|
fetch_store_to_tcm_bw_gbs: ...
|
|
```
|
|
|
|
Template is instantiated once per PE. PE instances are derived from
|
|
`cube.pe_layout` (corner placement). External connectivity (PE_DMA ↔
|
|
cube NOC ↔ HBM, etc.) is modeled at the cube level (ADR-0017 D4).
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
|
|
- Each block is an independent topology node — individually swappable
|
|
via DI (ADR-0015).
|
|
- PE-internal structure is visible in the topology graph.
|
|
- Components do not know their downstream — plan-based routing gives
|
|
flexibility (e.g., epilogue chains require no scheduler change).
|
|
- DMA and compute overlap naturally via SimPy Store backpressure.
|
|
- Multi-op composite expresses fused operations (e.g., GEMM + bias_add)
|
|
without engine-level coupling.
|
|
- TCM access contention is realistic — `PE_FETCH_STORE` is the single
|
|
TCM↔RF gateway.
|
|
|
|
### Negative
|
|
|
|
- Intra-PE component count is higher than a coarser model (7 base + 2
|
|
cross-referenced) — more topology nodes/edges.
|
|
- Intra-PE token forwarding is explicit in traces (acceptable trade for
|
|
HW fidelity).
|
|
|
|
## Links
|
|
|
|
- ADR-0011 D-VA (PE_MMU component, VA translation)
|
|
- ADR-0015 D4 (component port/wire model)
|
|
- ADR-0020 (greenlet kernel execution / two-pass)
|
|
- ADR-0023 (PE_IPCQ + PE_DMA virtual channels)
|
|
- SPEC R3, R4
|