Files
kernbench2/docs/adr-ko/ADR-0014-dev-pe-pipeline-execution-model.md
T
ywkang a796c1d2f7 ADR: bilingual structure — EN canonical in adr/, KO mirror in adr-ko/
Establish English as the canonical ADR language with Korean translations
held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror).
Promotion from adr-proposed/ to adr/ now writes English to adr/ and the
Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md.

- Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English,
  2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix
  dropped). ADR-0023 EN regenerated against KO source which had newer
  HW Realization Notes (D16-D23) section.
- docs/adr-history/ left frozen by design (transitional state).
- CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark
  docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline
  section covering bidirectional sync, conflict resolution (EN wins),
  and proposed-language freedom.
- tools/verify_adr_lang_pairs.py: new verification tool checking pair
  completeness, filename mirroring, ADR-ID match, Status byte-equality.
  Pre-commit hook intentionally not added; run on demand or in CI.
- tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF
  normalization, em-dash title separator, underscore-slug edge case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:38:44 -07:00

14 KiB

ADR-0014: PE Pipeline Execution Model

Status

Accepted

Context

This ADR defines the PE-internal kernel execution model:

  • Role decomposition of PE-internal components
  • Command dispatch paths (simple / composite / multi-op composite with epilogue)
  • TileToken-based self-routing pipeline (scheduler does dispatch + completion only)
  • TCM-centric dataflow with a register-file intermediary
  • Engine resource model
  • Observability and trace contract
  • Topology representation

PE-internal structure (7 components in scope; 2 cross-referenced):

  • pe_cpu, pe_scheduler, pe_dma, pe_fetch_store, pe_gemm, pe_math, pe_tcm — defined here
  • pe_mmu — VA model, defined in ADR-0011 D-VA
  • pe_ipcq — collective communication, defined in ADR-0023

The goal is a deterministic, trace-friendly execution contract that keeps each block independently swappable.

Decision

D1. PE-internal component roles

PE_CPU

  • Executes kernel instruction stream / control logic.
  • Generates PE commands and submits them to PE_SCHEDULER (via PeInternalTxn).
  • Does NOT enqueue work directly into engine queues.

PE_SCHEDULER

  • Sole dispatcher inside a PE.
  • Receives commands from PE_CPU. Dispatch by command type:
    • Simple command (DmaReadCmd, DmaWriteCmd, GemmCmd, MathCmd) → forward directly to the target engine.
    • CompositeCmd → generate a TilePlan, feed tiles into the pipeline via a single _feed_loop (D6).
  • Does not participate in stage-to-stage chaining within a composite; that is handled by token self-routing (D6).

PE_DMA

  • Handles memory transfers between TCM and external memory domains (HBM, shared SRAM, cross-cube UCIe) through the cube NOC.
  • Two execution channels:
    • DMA_READ (capacity = 1) and DMA_WRITE (capacity = 1) — see D4.
  • Additional virtual channels:
    • vc_compute — load/store/writeback traffic for GEMM/MATH tiles.
    • vc_comm — IPCQ collective send data (defined in ADR-0023 D8).

PE_FETCH_STORE

  • TCM ↔ Register File transfer unit.
  • Isolates register-file access semantics from compute engines so that GEMM/MATH stay pure compute components.
  • BW-based latency model; TCM access contention naturally serializes through PE_TCM's BW resource.

PE_GEMM

  • MAC array. Reads operands from the register file; writes results to the register file. Does not touch PE_TCM directly.

PE_MATH

  • Element-wise / reduction / SIMD unit. Reads / writes the register file.

PE_TCM

  • Tightly-coupled scratchpad with BW-serialized access. Two logical regions partitioned by ownership (see D5).

Cross-referenced components (defined elsewhere):

  • pe_mmu — VA→PA translation per access (ADR-0011 D-VA).
  • pe_ipcq — collective ring buffers and peer endpoint metadata (ADR-0023).

D2. Command lifecycle and queues

PE_SCHEDULER maintains three logical structures:

SubmissionQueue — written by PE_CPU; consumed by the scheduler.

InflightTable — owned and mutated only by PE_SCHEDULER; tracks expanded sub-commands, dependency state, engine assignment, and completion status.

CompletionQueue — written by PE_SCHEDULER; holds final completion records.

Single-writer rule: only PE_SCHEDULER mutates command completion state. Engines report completion via explicit events / messages consumed by the scheduler.

Command completion: when all sub-commands complete, PE_SCHEDULER publishes a completion record.

D3. Dispatch modes

D3.1 Simple command

A simple command expands to exactly one engine sub-command:

  • DmaReadCmd / DmaWriteCmdPE_DMA
  • GemmCmdPE_GEMM
  • MathCmdPE_MATH

Flow:

PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution
       → completion → PE_SCHEDULER → CompletionQueue

D3.2 Composite command (single-op tiled pipeline)

The default CompositeCmd runs a single compute op as a tile-pipelined sequence:

DMA_READ → FETCH (TCM → RF) → COMPUTE (GEMM | MATH) → STORE (RF → TCM) → DMA_WRITE

PE_SCHEDULER splits the DMA payload into hardware tiles and emits one TileToken per tile with a monotonically increasing tile_id.

Tile dependency (within one tile t):

DMA_READ(t) → FETCH(t) → COMPUTE(t) → STORE(t) → DMA_WRITE(t)

Inter-tile overlap is allowed wherever engine resources permit (D4 governs the constraints):

DMA_READ(t+1) ∥ COMPUTE(t)
DMA_WRITE(t-1) ∥ COMPUTE(t)

D3.3 Multi-op composite (head + epilogue with scope)

A CompositeCmd MAY carry ops: tuple[OpSpec, ...] to express a multi-op pipeline:

@dataclass(frozen=True)
class OpSpec:
    kind: str         # "gemm" | "math.exp" | "math.bias_add" | ...
    scope: Scope      # "per_k_tile" | "per_output_tile" | "once"
    ...
  • ops[0] (head) defines tile geometry (e.g., the head GEMM determines M/K/N partition).
  • ops[1:] (epilogue) are subsequent stages whose scope decides how often they fire:
    • per_k_tile — every K-reduction step.
    • per_output_tile — once per output tile.
    • once — once per kernel.

Cross-engine chains (e.g., GEMM head → MATH epilogue) are natural — each stage is dispatched via token self-routing (D6), so GEMM and MATH participate serially within the same composite even though they share the compute slot (D4).

The empty-ops form is the legacy single-op path.

D4. Engine resource model

DMA engine:

  • DMA_READ: simpy.Resource(capacity=1).
  • DMA_WRITE: simpy.Resource(capacity=1).
  • Both channels run concurrently (READ ∥ WRITE allowed).
  • Within a channel, requests serialize (READ ∥ READ disallowed; same for WRITE).
  • vc_comm is an orthogonal channel for IPCQ traffic defined in ADR-0023 D8 — out of scope for this ADR.

Compute engine:

  • accel_slot: simpy.Resource(capacity=1) shared by PE_GEMM and PE_MATH.
  • At most one compute op runs at a time within a PE.
  • Multi-op composite chains (D3.3) execute their compute stages serially through this slot; token self-routing (D6) ensures the next stage starts only after the previous compute releases the slot.

Engine completion: each engine emits a completion event consumed by the scheduler / PipelineContext (D6).

D5. Dataflow

Input path (HBM source):

HBM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
PE_TCM → PE_FETCH_STORE → Register File
Register File → PE_GEMM | PE_MATH

Input path (shared SRAM source):

Shared SRAM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
PE_TCM → PE_FETCH_STORE → Register File

Output path (HBM destination):

Register File → PE_FETCH_STORE → PE_TCM
PE_TCM → PE_DMA (DMA_WRITE) → cube NOC → HBM

GEMM/MATH never touch PE_TCM directly — PE_FETCH_STORE is the single TCM↔register-file gateway. This makes TCM BW contention explicit and lets fetch unit policies (e.g., prefetch) be replaced independently of compute engines.

D5.1 PE_TCM partitioning

PE_TCM is split into two logical regions:

SchedulerReservedTCM

  • Owned exclusively by PE_SCHEDULER.
  • Holds composite-command tile buffers.
  • PE_SCHEDULER partitions this region, assigns buffers per DMA_READ / COMPUTE / DMA_WRITE stage, guarantees input/output separation, and manages tile-buffer lifetimes.

AllocatableTCM

  • General-purpose region managed by PEMemAllocator.
  • Used for host / DP-visible allocations.

Visibility rule (hard isolation): PEMemAllocator MUST NOT see or allocate inside SchedulerReservedTCM. The reserved region is excluded from allocator-managed ranges by construction.

Tile buffer rules:

  • Input and output buffers within SchedulerReservedTCM MUST NOT overlap during a tile's active lifetime.
  • A tile buffer remains valid until the corresponding DMA_WRITE completes.
  • Buffer reuse is permitted only after the consuming tile's lifetime ends.

D6. TileToken self-routing pipeline

A composite's stage-to-stage progression happens without routing through the scheduler. Each component forwards the token directly to the next stage's component using the token's plan:

Scheduler → DMA → Fetch → GEMM → Math (epi) → Store → DMA_WB → (complete)
              ↑ chaining: no scheduler hop                          ↑
                                                  PipelineContext.complete_tile()

This mirrors real-HW done-wire chains. The scheduler handles only initial dispatch + completion aggregation.

TilePlan / Stage

class StageType(Enum):
    DMA_READ = 0
    FETCH = 1
    GEMM = 2
    MATH = 3
    STORE = 4
    DMA_WRITE = 5

@dataclass(frozen=True)
class Stage:
    stage_type: StageType
    component: str         # topology node id (e.g., "sip0.cube0.pe0.pe_dma")
    params: dict           # stage-specific parameters

@dataclass(frozen=True)
class TilePlan:
    tile_id: int
    stages: tuple[Stage, ...]

TileToken

@dataclass
class TileToken:
    tile_id: int
    pipeline_ctx: PipelineContext
    plan: TilePlan
    stage_idx: int
    params: dict             # cached current stage params
    data_op: bool = True     # op_log opt-in (ADR-0020 D4)

Single-owner invariant: a token is owned by exactly one component at a time. Lifecycle: scheduler creates with stage_idx=0 → component _process() → increment stage_idx → put to next stage's in_port → last stage calls pipeline_ctx.complete_tile().

PipelineContext (exactly-once completion)

@dataclass
class PipelineContext:
    id: str
    total_tiles: int
    completed_tiles: int = 0
    done_event: simpy.Event = None

    def complete_tile(self) -> None:
        self.completed_tiles += 1
        if self.completed_tiles == self.total_tiles:
            self.done_event.succeed()

Each tile's last stage MUST call complete_tile() exactly once. Duplicate calls are bugs (SimPy Event can succeed at most once).

Feed ordering

PE_SCHEDULER has exactly one _feed_loop process consuming a _pending_feeds FIFO. Composite commands are enqueued in submission order; tile feed for a command runs to completion before the next command's feed begins. Tile-feed interleaving between commands is disallowed.

Within a single command's tiles, downstream pipeline overlap arises naturally — earlier tiles progress through later stages while the feeder keeps pushing remaining tiles into the first stage queue (SimPy Store backpressure governs flow control). If the first-stage queue is full, only the feeder blocks; the scheduler worker's inbox processing continues.

Token routing pattern (base class)

def _pipeline_worker(self, env):
    while True:
        token = yield self._inbox.get()
        yield from self._process(env, token)       # stage-specific logic
        next_idx = token.stage_idx + 1
        if next_idx < len(token.plan.stages):
            next_stage = token.plan.stages[next_idx]
            token.stage_idx = next_idx
            token.params = next_stage.params
            yield self.out_ports[next_stage.component].put(token)
        else:
            token.pipeline_ctx.complete_tile()

Each component implements only _process(); chaining lives in the base class.

D7. Observability and trace contract

The simulator emits deterministic trace events:

  • command_submitted
  • sub_command_dispatched
  • engine_start
  • engine_complete
  • tile_ready
  • command_complete

For identical inputs, trace ordering MUST be deterministic.

D8. Topology representation

PE-internal components are declared in cube.pe_template:

pe_template:
  components:
    pe_cpu:         { kind: pe_cpu,         impl: builtin.pe_cpu,         attrs: { overhead_ns: ... } }
    pe_scheduler:   { kind: pe_scheduler,   impl: builtin.pe_scheduler,   attrs: { overhead_ns: ... } }
    pe_dma:         { kind: pe_dma,         impl: builtin.pe_dma,         attrs: { rd_engines: 1, wr_engines: 1 } }
    pe_fetch_store: { kind: pe_fetch_store, impl: builtin.pe_fetch_store, attrs: { ... } }
    pe_gemm:        { kind: pe_gemm,        impl: builtin.pe_gemm,        attrs: { shared_resource: accel_slot, ... } }
    pe_math:        { kind: pe_math,        impl: builtin.pe_math,        attrs: { shared_resource: accel_slot, ... } }
    pe_tcm:         { kind: pe_tcm,         impl: builtin.pe_tcm,         attrs: { size_mb: ..., read_bw_gbs: ..., write_bw_gbs: ... } }
    pe_mmu:         { kind: pe_mmu,         impl: builtin.pe_mmu,         attrs: { ... } }   # ADR-0011 D-VA
    pe_ipcq:        { kind: pe_ipcq,        impl: builtin.pe_ipcq,        attrs: { ... } }   # ADR-0023
  links:
    # Scheduler dispatch edges (initial)
    scheduler_to_dma_mm:         0.0
    scheduler_to_fetch_store_mm: 0.0
    scheduler_to_gemm_mm:        0.0
    scheduler_to_math_mm:        0.0
    # Pipeline chaining edges (token self-routing per D6)
    dma_to_fetch_store_mm:       0.0
    fetch_store_to_gemm_mm:      0.0
    fetch_store_to_math_mm:      0.0
    gemm_to_fetch_store_mm:      0.0
    gemm_to_math_mm:             0.0
    math_to_fetch_store_mm:      0.0
    fetch_store_to_dma_mm:       0.0
    fetch_store_to_tcm_bw_gbs:   ...

Template is instantiated once per PE. PE instances are derived from cube.pe_layout (corner placement). External connectivity (PE_DMA ↔ cube NOC ↔ HBM, etc.) is modeled at the cube level (ADR-0017 D4).

Consequences

Positive

  • Each block is an independent topology node — individually swappable via DI (ADR-0015).
  • PE-internal structure is visible in the topology graph.
  • Components do not know their downstream — plan-based routing gives flexibility (e.g., epilogue chains require no scheduler change).
  • DMA and compute overlap naturally via SimPy Store backpressure.
  • Multi-op composite expresses fused operations (e.g., GEMM + bias_add) without engine-level coupling.
  • TCM access contention is realistic — PE_FETCH_STORE is the single TCM↔RF gateway.

Negative

  • Intra-PE component count is higher than a coarser model (7 base + 2 cross-referenced) — more topology nodes/edges.
  • Intra-PE token forwarding is explicit in traces (acceptable trade for HW fidelity).
  • ADR-0011 D-VA (PE_MMU component, VA translation)
  • ADR-0015 D4 (component port/wire model)
  • ADR-0020 (greenlet kernel execution / two-pass)
  • ADR-0023 (PE_IPCQ + PE_DMA virtual channels)
  • SPEC R3, R4