Files

T

ywkang a796c1d2f7 ADR: bilingual structure — EN canonical in adr/, KO mirror in adr-ko/

Establish English as the canonical ADR language with Korean translations
held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror).
Promotion from adr-proposed/ to adr/ now writes English to adr/ and the
Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md.

- Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English,
  2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix
  dropped). ADR-0023 EN regenerated against KO source which had newer
  HW Realization Notes (D16-D23) section.
- docs/adr-history/ left frozen by design (transitional state).
- CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark
  docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline
  section covering bidirectional sync, conflict resolution (EN wins),
  and proposed-language freedom.
- tools/verify_adr_lang_pairs.py: new verification tool checking pair
  completeness, filename mirroring, ADR-ID match, Status byte-equality.
  Pre-commit hook intentionally not added; run on demand or in CI.
- tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF
  normalization, em-dash title separator, underscore-slug edge case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 01:38:44 -07:00

14 KiB

Raw Blame History

ADR-0014: PE Pipeline Execution Model

Status

Accepted

Context

This ADR defines the PE-internal kernel execution model:

Role decomposition of PE-internal components
Command dispatch paths (simple / composite / multi-op composite with epilogue)
TileToken-based self-routing pipeline (scheduler does dispatch + completion only)
TCM-centric dataflow with a register-file intermediary
Engine resource model
Observability and trace contract
Topology representation

PE-internal structure (7 components in scope; 2 cross-referenced):

pe_cpu, pe_scheduler, pe_dma, pe_fetch_store, pe_gemm, pe_math, pe_tcm — defined here
pe_mmu — VA model, defined in ADR-0011 D-VA
pe_ipcq — collective communication, defined in ADR-0023

The goal is a deterministic, trace-friendly execution contract that keeps each block independently swappable.

Decision

D1. PE-internal component roles

PE_CPU

Executes kernel instruction stream / control logic.
Generates PE commands and submits them to PE_SCHEDULER (via PeInternalTxn).
Does NOT enqueue work directly into engine queues.

PE_SCHEDULER

Sole dispatcher inside a PE.
Receives commands from PE_CPU. Dispatch by command type:
- Simple command (DmaReadCmd, DmaWriteCmd, GemmCmd, MathCmd) → forward directly to the target engine.
- CompositeCmd → generate a TilePlan, feed tiles into the pipeline via a single _feed_loop (D6).
Does not participate in stage-to-stage chaining within a composite; that is handled by token self-routing (D6).

PE_DMA

Handles memory transfers between TCM and external memory domains (HBM, shared SRAM, cross-cube UCIe) through the cube NOC.
Two execution channels:
- DMA_READ (capacity = 1) and DMA_WRITE (capacity = 1) — see D4.
Additional virtual channels:
- vc_compute — load/store/writeback traffic for GEMM/MATH tiles.
- vc_comm — IPCQ collective send data (defined in ADR-0023 D8).

PE_FETCH_STORE

TCM ↔ Register File transfer unit.
Isolates register-file access semantics from compute engines so that GEMM/MATH stay pure compute components.
BW-based latency model; TCM access contention naturally serializes through PE_TCM's BW resource.

PE_GEMM

MAC array. Reads operands from the register file; writes results to the register file. Does not touch PE_TCM directly.

PE_MATH

Element-wise / reduction / SIMD unit. Reads / writes the register file.

PE_TCM

Tightly-coupled scratchpad with BW-serialized access. Two logical regions partitioned by ownership (see D5).

Cross-referenced components (defined elsewhere):

pe_mmu — VA→PA translation per access (ADR-0011 D-VA).
pe_ipcq — collective ring buffers and peer endpoint metadata (ADR-0023).

D2. Command lifecycle and queues

PE_SCHEDULER maintains three logical structures:

SubmissionQueue — written by PE_CPU; consumed by the scheduler.

InflightTable — owned and mutated only by PE_SCHEDULER; tracks expanded sub-commands, dependency state, engine assignment, and completion status.

CompletionQueue — written by PE_SCHEDULER; holds final completion records.

Single-writer rule: only PE_SCHEDULER mutates command completion state. Engines report completion via explicit events / messages consumed by the scheduler.

Command completion: when all sub-commands complete, PE_SCHEDULER publishes a completion record.

D3. Dispatch modes

D3.1 Simple command

A simple command expands to exactly one engine sub-command:

DmaReadCmd / DmaWriteCmd → PE_DMA
GemmCmd → PE_GEMM
MathCmd → PE_MATH

Flow:

PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution
       → completion → PE_SCHEDULER → CompletionQueue

D3.2 Composite command (single-op tiled pipeline)

The default CompositeCmd runs a single compute op as a tile-pipelined sequence:

DMA_READ → FETCH (TCM → RF) → COMPUTE (GEMM | MATH) → STORE (RF → TCM) → DMA_WRITE

PE_SCHEDULER splits the DMA payload into hardware tiles and emits one TileToken per tile with a monotonically increasing tile_id.

Tile dependency (within one tile t):

DMA_READ(t) → FETCH(t) → COMPUTE(t) → STORE(t) → DMA_WRITE(t)

Inter-tile overlap is allowed wherever engine resources permit (D4 governs the constraints):

DMA_READ(t+1) ∥ COMPUTE(t)
DMA_WRITE(t-1) ∥ COMPUTE(t)

D3.3 Multi-op composite (head + epilogue with scope)

A CompositeCmd MAY carry ops: tuple[OpSpec, ...] to express a multi-op pipeline:

@dataclass(frozen=True)
class OpSpec:
    kind: str         # "gemm" | "math.exp" | "math.bias_add" | ...
    scope: Scope      # "per_k_tile" | "per_output_tile" | "once"
    ...

ops[0] (head) defines tile geometry (e.g., the head GEMM determines M/K/N partition).
ops[1:] (epilogue) are subsequent stages whose scope decides how often they fire:
- per_k_tile — every K-reduction step.
- per_output_tile — once per output tile.
- once — once per kernel.

Cross-engine chains (e.g., GEMM head → MATH epilogue) are natural — each stage is dispatched via token self-routing (D6), so GEMM and MATH participate serially within the same composite even though they share the compute slot (D4).

The empty-ops form is the legacy single-op path.

D4. Engine resource model

DMA engine:

DMA_READ: simpy.Resource(capacity=1).
DMA_WRITE: simpy.Resource(capacity=1).
Both channels run concurrently (READ ∥ WRITE allowed).
Within a channel, requests serialize (READ ∥ READ disallowed; same for WRITE).
vc_comm is an orthogonal channel for IPCQ traffic defined in ADR-0023 D8 — out of scope for this ADR.

Compute engine:

accel_slot: simpy.Resource(capacity=1) shared by PE_GEMM and PE_MATH.
At most one compute op runs at a time within a PE.
Multi-op composite chains (D3.3) execute their compute stages serially through this slot; token self-routing (D6) ensures the next stage starts only after the previous compute releases the slot.

Engine completion: each engine emits a completion event consumed by the scheduler / PipelineContext (D6).

D5. Dataflow

Input path (HBM source):

HBM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
PE_TCM → PE_FETCH_STORE → Register File
Register File → PE_GEMM | PE_MATH

Input path (shared SRAM source):

Shared SRAM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
PE_TCM → PE_FETCH_STORE → Register File

Output path (HBM destination):

Register File → PE_FETCH_STORE → PE_TCM
PE_TCM → PE_DMA (DMA_WRITE) → cube NOC → HBM

GEMM/MATH never touch PE_TCM directly — PE_FETCH_STORE is the single TCM↔register-file gateway. This makes TCM BW contention explicit and lets fetch unit policies (e.g., prefetch) be replaced independently of compute engines.

D5.1 PE_TCM partitioning

PE_TCM is split into two logical regions:

SchedulerReservedTCM

Owned exclusively by PE_SCHEDULER.
Holds composite-command tile buffers.
PE_SCHEDULER partitions this region, assigns buffers per DMA_READ / COMPUTE / DMA_WRITE stage, guarantees input/output separation, and manages tile-buffer lifetimes.

AllocatableTCM

General-purpose region managed by PEMemAllocator.
Used for host / DP-visible allocations.

Visibility rule (hard isolation): PEMemAllocator MUST NOT see or allocate inside SchedulerReservedTCM. The reserved region is excluded from allocator-managed ranges by construction.

Tile buffer rules:

Input and output buffers within SchedulerReservedTCM MUST NOT overlap during a tile's active lifetime.
A tile buffer remains valid until the corresponding DMA_WRITE completes.
Buffer reuse is permitted only after the consuming tile's lifetime ends.

D6. TileToken self-routing pipeline

A composite's stage-to-stage progression happens without routing through the scheduler. Each component forwards the token directly to the next stage's component using the token's plan:

Scheduler → DMA → Fetch → GEMM → Math (epi) → Store → DMA_WB → (complete)
              ↑ chaining: no scheduler hop                          ↑
                                                  PipelineContext.complete_tile()

This mirrors real-HW done-wire chains. The scheduler handles only initial dispatch + completion aggregation.

TilePlan / Stage

class StageType(Enum):
    DMA_READ = 0
    FETCH = 1
    GEMM = 2
    MATH = 3
    STORE = 4
    DMA_WRITE = 5

@dataclass(frozen=True)
class Stage:
    stage_type: StageType
    component: str         # topology node id (e.g., "sip0.cube0.pe0.pe_dma")
    params: dict           # stage-specific parameters

@dataclass(frozen=True)
class TilePlan:
    tile_id: int
    stages: tuple[Stage, ...]

TileToken

@dataclass
class TileToken:
    tile_id: int
    pipeline_ctx: PipelineContext
    plan: TilePlan
    stage_idx: int
    params: dict             # cached current stage params
    data_op: bool = True     # op_log opt-in (ADR-0020 D4)

Single-owner invariant: a token is owned by exactly one component at a time. Lifecycle: scheduler creates with stage_idx=0 → component _process() → increment stage_idx → put to next stage's in_port → last stage calls pipeline_ctx.complete_tile().

PipelineContext (exactly-once completion)

@dataclass
class PipelineContext:
    id: str
    total_tiles: int
    completed_tiles: int = 0
    done_event: simpy.Event = None

    def complete_tile(self) -> None:
        self.completed_tiles += 1
        if self.completed_tiles == self.total_tiles:
            self.done_event.succeed()

Each tile's last stage MUST call complete_tile() exactly once. Duplicate calls are bugs (SimPy Event can succeed at most once).

Feed ordering

PE_SCHEDULER has exactly one _feed_loop process consuming a _pending_feeds FIFO. Composite commands are enqueued in submission order; tile feed for a command runs to completion before the next command's feed begins. Tile-feed interleaving between commands is disallowed.

Within a single command's tiles, downstream pipeline overlap arises naturally — earlier tiles progress through later stages while the feeder keeps pushing remaining tiles into the first stage queue (SimPy Store backpressure governs flow control). If the first-stage queue is full, only the feeder blocks; the scheduler worker's inbox processing continues.

Token routing pattern (base class)

def _pipeline_worker(self, env):
    while True:
        token = yield self._inbox.get()
        yield from self._process(env, token)       # stage-specific logic
        next_idx = token.stage_idx + 1
        if next_idx < len(token.plan.stages):
            next_stage = token.plan.stages[next_idx]
            token.stage_idx = next_idx
            token.params = next_stage.params
            yield self.out_ports[next_stage.component].put(token)
        else:
            token.pipeline_ctx.complete_tile()

Each component implements only _process(); chaining lives in the base class.

D7. Observability and trace contract

The simulator emits deterministic trace events:

command_submitted
sub_command_dispatched
engine_start
engine_complete
tile_ready
command_complete

For identical inputs, trace ordering MUST be deterministic.

D8. Topology representation

PE-internal components are declared in cube.pe_template:

pe_template:
  components:
    pe_cpu:         { kind: pe_cpu,         impl: builtin.pe_cpu,         attrs: { overhead_ns: ... } }
    pe_scheduler:   { kind: pe_scheduler,   impl: builtin.pe_scheduler,   attrs: { overhead_ns: ... } }
    pe_dma:         { kind: pe_dma,         impl: builtin.pe_dma,         attrs: { rd_engines: 1, wr_engines: 1 } }
    pe_fetch_store: { kind: pe_fetch_store, impl: builtin.pe_fetch_store, attrs: { ... } }
    pe_gemm:        { kind: pe_gemm,        impl: builtin.pe_gemm,        attrs: { shared_resource: accel_slot, ... } }
    pe_math:        { kind: pe_math,        impl: builtin.pe_math,        attrs: { shared_resource: accel_slot, ... } }
    pe_tcm:         { kind: pe_tcm,         impl: builtin.pe_tcm,         attrs: { size_mb: ..., read_bw_gbs: ..., write_bw_gbs: ... } }
    pe_mmu:         { kind: pe_mmu,         impl: builtin.pe_mmu,         attrs: { ... } }   # ADR-0011 D-VA
    pe_ipcq:        { kind: pe_ipcq,        impl: builtin.pe_ipcq,        attrs: { ... } }   # ADR-0023
  links:
    # Scheduler dispatch edges (initial)
    scheduler_to_dma_mm:         0.0
    scheduler_to_fetch_store_mm: 0.0
    scheduler_to_gemm_mm:        0.0
    scheduler_to_math_mm:        0.0
    # Pipeline chaining edges (token self-routing per D6)
    dma_to_fetch_store_mm:       0.0
    fetch_store_to_gemm_mm:      0.0
    fetch_store_to_math_mm:      0.0
    gemm_to_fetch_store_mm:      0.0
    gemm_to_math_mm:             0.0
    math_to_fetch_store_mm:      0.0
    fetch_store_to_dma_mm:       0.0
    fetch_store_to_tcm_bw_gbs:   ...

Template is instantiated once per PE. PE instances are derived from cube.pe_layout (corner placement). External connectivity (PE_DMA ↔ cube NOC ↔ HBM, etc.) is modeled at the cube level (ADR-0017 D4).

Consequences

Positive

Each block is an independent topology node — individually swappable via DI (ADR-0015).
PE-internal structure is visible in the topology graph.
Components do not know their downstream — plan-based routing gives flexibility (e.g., epilogue chains require no scheduler change).
DMA and compute overlap naturally via SimPy Store backpressure.
Multi-op composite expresses fused operations (e.g., GEMM + bias_add) without engine-level coupling.
TCM access contention is realistic — PE_FETCH_STORE is the single TCM↔RF gateway.

Negative

Intra-PE component count is higher than a coarser model (7 base + 2 cross-referenced) — more topology nodes/edges.
Intra-PE token forwarding is explicit in traces (acceptable trade for HW fidelity).

14 KiB Raw Blame History