Files
kernbench2/docs/adr/ADR-0014-dev-pe-pipeline-execution-model.md
ywkang 687c98086d ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
  (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
  docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
  docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
  retroactive docs pending verification.

Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
  TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
  Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
  deleted; ADR-0019/0021 moved to adr-history with one-line stub status

Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
  serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
  per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
  target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
  selection, flit-aware per-flit commit, async finalize, command-only
  fallback path)

Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
  "Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
  block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
  ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
  (now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)

Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
  ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py

Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.

Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
  (ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:15:55 -07:00

14 KiB

ADR-0014: PE Pipeline Execution Model

Status

Accepted

Context

This ADR defines the PE-internal kernel execution model:

  • Role decomposition of PE-internal components
  • Command dispatch paths (simple / composite / multi-op composite with epilogue)
  • TileToken-based self-routing pipeline (scheduler does dispatch + completion only)
  • TCM-centric dataflow with a register-file intermediary
  • Engine resource model
  • Observability and trace contract
  • Topology representation

PE-internal structure (7 components in scope; 2 cross-referenced):

  • pe_cpu, pe_scheduler, pe_dma, pe_fetch_store, pe_gemm, pe_math, pe_tcm — defined here
  • pe_mmu — VA model, defined in ADR-0011 D-VA
  • pe_ipcq — collective communication, defined in ADR-0023

The goal is a deterministic, trace-friendly execution contract that keeps each block independently swappable.

Decision

D1. PE-internal component roles

PE_CPU

  • Executes kernel instruction stream / control logic.
  • Generates PE commands and submits them to PE_SCHEDULER (via PeInternalTxn).
  • Does NOT enqueue work directly into engine queues.

PE_SCHEDULER

  • Sole dispatcher inside a PE.
  • Receives commands from PE_CPU. Dispatch by command type:
    • Simple command (DmaReadCmd, DmaWriteCmd, GemmCmd, MathCmd) → forward directly to the target engine.
    • CompositeCmd → generate a TilePlan, feed tiles into the pipeline via a single _feed_loop (D6).
  • Does not participate in stage-to-stage chaining within a composite; that is handled by token self-routing (D6).

PE_DMA

  • Handles memory transfers between TCM and external memory domains (HBM, shared SRAM, cross-cube UCIe) through the cube NOC.
  • Two execution channels:
    • DMA_READ (capacity = 1) and DMA_WRITE (capacity = 1) — see D4.
  • Additional virtual channels:
    • vc_compute — load/store/writeback traffic for GEMM/MATH tiles.
    • vc_comm — IPCQ collective send data (defined in ADR-0023 D8).

PE_FETCH_STORE

  • TCM ↔ Register File transfer unit.
  • Isolates register-file access semantics from compute engines so that GEMM/MATH stay pure compute components.
  • BW-based latency model; TCM access contention naturally serializes through PE_TCM's BW resource.

PE_GEMM

  • MAC array. Reads operands from the register file; writes results to the register file. Does not touch PE_TCM directly.

PE_MATH

  • Element-wise / reduction / SIMD unit. Reads / writes the register file.

PE_TCM

  • Tightly-coupled scratchpad with BW-serialized access. Two logical regions partitioned by ownership (see D5).

Cross-referenced components (defined elsewhere):

  • pe_mmu — VA→PA translation per access (ADR-0011 D-VA).
  • pe_ipcq — collective ring buffers and peer endpoint metadata (ADR-0023).

D2. Command lifecycle and queues

PE_SCHEDULER maintains three logical structures:

SubmissionQueue — written by PE_CPU; consumed by the scheduler.

InflightTable — owned and mutated only by PE_SCHEDULER; tracks expanded sub-commands, dependency state, engine assignment, and completion status.

CompletionQueue — written by PE_SCHEDULER; holds final completion records.

Single-writer rule: only PE_SCHEDULER mutates command completion state. Engines report completion via explicit events / messages consumed by the scheduler.

Command completion: when all sub-commands complete, PE_SCHEDULER publishes a completion record.

D3. Dispatch modes

D3.1 Simple command

A simple command expands to exactly one engine sub-command:

  • DmaReadCmd / DmaWriteCmdPE_DMA
  • GemmCmdPE_GEMM
  • MathCmdPE_MATH

Flow:

PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution
       → completion → PE_SCHEDULER → CompletionQueue

D3.2 Composite command (single-op tiled pipeline)

The default CompositeCmd runs a single compute op as a tile-pipelined sequence:

DMA_READ → FETCH (TCM → RF) → COMPUTE (GEMM | MATH) → STORE (RF → TCM) → DMA_WRITE

PE_SCHEDULER splits the DMA payload into hardware tiles and emits one TileToken per tile with a monotonically increasing tile_id.

Tile dependency (within one tile t):

DMA_READ(t) → FETCH(t) → COMPUTE(t) → STORE(t) → DMA_WRITE(t)

Inter-tile overlap is allowed wherever engine resources permit (D4 governs the constraints):

DMA_READ(t+1) ∥ COMPUTE(t)
DMA_WRITE(t-1) ∥ COMPUTE(t)

D3.3 Multi-op composite (head + epilogue with scope)

A CompositeCmd MAY carry ops: tuple[OpSpec, ...] to express a multi-op pipeline:

@dataclass(frozen=True)
class OpSpec:
    kind: str         # "gemm" | "math.exp" | "math.bias_add" | ...
    scope: Scope      # "per_k_tile" | "per_output_tile" | "once"
    ...
  • ops[0] (head) defines tile geometry (e.g., the head GEMM determines M/K/N partition).
  • ops[1:] (epilogue) are subsequent stages whose scope decides how often they fire:
    • per_k_tile — every K-reduction step.
    • per_output_tile — once per output tile.
    • once — once per kernel.

Cross-engine chains (e.g., GEMM head → MATH epilogue) are natural — each stage is dispatched via token self-routing (D6), so GEMM and MATH participate serially within the same composite even though they share the compute slot (D4).

The empty-ops form is the legacy single-op path.

D4. Engine resource model

DMA engine:

  • DMA_READ: simpy.Resource(capacity=1).
  • DMA_WRITE: simpy.Resource(capacity=1).
  • Both channels run concurrently (READ ∥ WRITE allowed).
  • Within a channel, requests serialize (READ ∥ READ disallowed; same for WRITE).
  • vc_comm is an orthogonal channel for IPCQ traffic defined in ADR-0023 D8 — out of scope for this ADR.

Compute engine:

  • accel_slot: simpy.Resource(capacity=1) shared by PE_GEMM and PE_MATH.
  • At most one compute op runs at a time within a PE.
  • Multi-op composite chains (D3.3) execute their compute stages serially through this slot; token self-routing (D6) ensures the next stage starts only after the previous compute releases the slot.

Engine completion: each engine emits a completion event consumed by the scheduler / PipelineContext (D6).

D5. Dataflow

Input path (HBM source):

HBM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
PE_TCM → PE_FETCH_STORE → Register File
Register File → PE_GEMM | PE_MATH

Input path (shared SRAM source):

Shared SRAM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
PE_TCM → PE_FETCH_STORE → Register File

Output path (HBM destination):

Register File → PE_FETCH_STORE → PE_TCM
PE_TCM → PE_DMA (DMA_WRITE) → cube NOC → HBM

GEMM/MATH never touch PE_TCM directly — PE_FETCH_STORE is the single TCM↔register-file gateway. This makes TCM BW contention explicit and lets fetch unit policies (e.g., prefetch) be replaced independently of compute engines.

D5.1 PE_TCM partitioning

PE_TCM is split into two logical regions:

SchedulerReservedTCM

  • Owned exclusively by PE_SCHEDULER.
  • Holds composite-command tile buffers.
  • PE_SCHEDULER partitions this region, assigns buffers per DMA_READ / COMPUTE / DMA_WRITE stage, guarantees input/output separation, and manages tile-buffer lifetimes.

AllocatableTCM

  • General-purpose region managed by PEMemAllocator.
  • Used for host / DP-visible allocations.

Visibility rule (hard isolation): PEMemAllocator MUST NOT see or allocate inside SchedulerReservedTCM. The reserved region is excluded from allocator-managed ranges by construction.

Tile buffer rules:

  • Input and output buffers within SchedulerReservedTCM MUST NOT overlap during a tile's active lifetime.
  • A tile buffer remains valid until the corresponding DMA_WRITE completes.
  • Buffer reuse is permitted only after the consuming tile's lifetime ends.

D6. TileToken self-routing pipeline

A composite's stage-to-stage progression happens without routing through the scheduler. Each component forwards the token directly to the next stage's component using the token's plan:

Scheduler → DMA → Fetch → GEMM → Math (epi) → Store → DMA_WB → (complete)
              ↑ chaining: no scheduler hop                          ↑
                                                  PipelineContext.complete_tile()

This mirrors real-HW done-wire chains. The scheduler handles only initial dispatch + completion aggregation.

TilePlan / Stage

class StageType(Enum):
    DMA_READ = 0
    FETCH = 1
    GEMM = 2
    MATH = 3
    STORE = 4
    DMA_WRITE = 5

@dataclass(frozen=True)
class Stage:
    stage_type: StageType
    component: str         # topology node id (e.g., "sip0.cube0.pe0.pe_dma")
    params: dict           # stage-specific parameters

@dataclass(frozen=True)
class TilePlan:
    tile_id: int
    stages: tuple[Stage, ...]

TileToken

@dataclass
class TileToken:
    tile_id: int
    pipeline_ctx: PipelineContext
    plan: TilePlan
    stage_idx: int
    params: dict             # cached current stage params
    data_op: bool = True     # op_log opt-in (ADR-0020 D4)

Single-owner invariant: a token is owned by exactly one component at a time. Lifecycle: scheduler creates with stage_idx=0 → component _process() → increment stage_idx → put to next stage's in_port → last stage calls pipeline_ctx.complete_tile().

PipelineContext (exactly-once completion)

@dataclass
class PipelineContext:
    id: str
    total_tiles: int
    completed_tiles: int = 0
    done_event: simpy.Event = None

    def complete_tile(self) -> None:
        self.completed_tiles += 1
        if self.completed_tiles == self.total_tiles:
            self.done_event.succeed()

Each tile's last stage MUST call complete_tile() exactly once. Duplicate calls are bugs (SimPy Event can succeed at most once).

Feed ordering

PE_SCHEDULER has exactly one _feed_loop process consuming a _pending_feeds FIFO. Composite commands are enqueued in submission order; tile feed for a command runs to completion before the next command's feed begins. Tile-feed interleaving between commands is disallowed.

Within a single command's tiles, downstream pipeline overlap arises naturally — earlier tiles progress through later stages while the feeder keeps pushing remaining tiles into the first stage queue (SimPy Store backpressure governs flow control). If the first-stage queue is full, only the feeder blocks; the scheduler worker's inbox processing continues.

Token routing pattern (base class)

def _pipeline_worker(self, env):
    while True:
        token = yield self._inbox.get()
        yield from self._process(env, token)       # stage-specific logic
        next_idx = token.stage_idx + 1
        if next_idx < len(token.plan.stages):
            next_stage = token.plan.stages[next_idx]
            token.stage_idx = next_idx
            token.params = next_stage.params
            yield self.out_ports[next_stage.component].put(token)
        else:
            token.pipeline_ctx.complete_tile()

Each component implements only _process(); chaining lives in the base class.

D7. Observability and trace contract

The simulator emits deterministic trace events:

  • command_submitted
  • sub_command_dispatched
  • engine_start
  • engine_complete
  • tile_ready
  • command_complete

For identical inputs, trace ordering MUST be deterministic.

D8. Topology representation

PE-internal components are declared in cube.pe_template:

pe_template:
  components:
    pe_cpu:         { kind: pe_cpu,         impl: builtin.pe_cpu,         attrs: { overhead_ns: ... } }
    pe_scheduler:   { kind: pe_scheduler,   impl: builtin.pe_scheduler,   attrs: { overhead_ns: ... } }
    pe_dma:         { kind: pe_dma,         impl: builtin.pe_dma,         attrs: { rd_engines: 1, wr_engines: 1 } }
    pe_fetch_store: { kind: pe_fetch_store, impl: builtin.pe_fetch_store, attrs: { ... } }
    pe_gemm:        { kind: pe_gemm,        impl: builtin.pe_gemm,        attrs: { shared_resource: accel_slot, ... } }
    pe_math:        { kind: pe_math,        impl: builtin.pe_math,        attrs: { shared_resource: accel_slot, ... } }
    pe_tcm:         { kind: pe_tcm,         impl: builtin.pe_tcm,         attrs: { size_mb: ..., read_bw_gbs: ..., write_bw_gbs: ... } }
    pe_mmu:         { kind: pe_mmu,         impl: builtin.pe_mmu,         attrs: { ... } }   # ADR-0011 D-VA
    pe_ipcq:        { kind: pe_ipcq,        impl: builtin.pe_ipcq,        attrs: { ... } }   # ADR-0023
  links:
    # Scheduler dispatch edges (initial)
    scheduler_to_dma_mm:         0.0
    scheduler_to_fetch_store_mm: 0.0
    scheduler_to_gemm_mm:        0.0
    scheduler_to_math_mm:        0.0
    # Pipeline chaining edges (token self-routing per D6)
    dma_to_fetch_store_mm:       0.0
    fetch_store_to_gemm_mm:      0.0
    fetch_store_to_math_mm:      0.0
    gemm_to_fetch_store_mm:      0.0
    gemm_to_math_mm:             0.0
    math_to_fetch_store_mm:      0.0
    fetch_store_to_dma_mm:       0.0
    fetch_store_to_tcm_bw_gbs:   ...

Template is instantiated once per PE. PE instances are derived from cube.pe_layout (corner placement). External connectivity (PE_DMA ↔ cube NOC ↔ HBM, etc.) is modeled at the cube level (ADR-0017 D4).

Consequences

Positive

  • Each block is an independent topology node — individually swappable via DI (ADR-0015).
  • PE-internal structure is visible in the topology graph.
  • Components do not know their downstream — plan-based routing gives flexibility (e.g., epilogue chains require no scheduler change).
  • DMA and compute overlap naturally via SimPy Store backpressure.
  • Multi-op composite expresses fused operations (e.g., GEMM + bias_add) without engine-level coupling.
  • TCM access contention is realistic — PE_FETCH_STORE is the single TCM↔RF gateway.

Negative

  • Intra-PE component count is higher than a coarser model (7 base + 2 cross-referenced) — more topology nodes/edges.
  • Intra-PE token forwarding is explicit in traces (acceptable trade for HW fidelity).
  • ADR-0011 D-VA (PE_MMU component, VA translation)
  • ADR-0015 D4 (component port/wire model)
  • ADR-0020 (greenlet kernel execution / two-pass)
  • ADR-0023 (PE_IPCQ + PE_DMA virtual channels)
  • SPEC R3, R4