Establish English as the canonical ADR language with Korean translations held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror). Promotion from adr-proposed/ to adr/ now writes English to adr/ and the Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md. - Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English, 2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix dropped). ADR-0023 EN regenerated against KO source which had newer HW Realization Notes (D16-D23) section. - docs/adr-history/ left frozen by design (transitional state). - CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline section covering bidirectional sync, conflict resolution (EN wins), and proposed-language freedom. - tools/verify_adr_lang_pairs.py: new verification tool checking pair completeness, filename mirroring, ADR-ID match, Status byte-equality. Pre-commit hook intentionally not added; run on demand or in CI. - tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF normalization, em-dash title separator, underscore-slug edge case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 KiB
ADR-0014: PE Pipeline Execution Model
Status
Accepted
Context
This ADR defines the PE-internal kernel execution model:
- Role decomposition of PE-internal components
- Command dispatch paths (simple / composite / multi-op composite with epilogue)
- TileToken-based self-routing pipeline (scheduler does dispatch + completion only)
- TCM-centric dataflow with a register-file intermediary
- Engine resource model
- Observability and trace contract
- Topology representation
PE-internal structure (7 components in scope; 2 cross-referenced):
pe_cpu,pe_scheduler,pe_dma,pe_fetch_store,pe_gemm,pe_math,pe_tcm— defined herepe_mmu— VA model, defined in ADR-0011 D-VApe_ipcq— collective communication, defined in ADR-0023
The goal is a deterministic, trace-friendly execution contract that keeps each block independently swappable.
Decision
D1. PE-internal component roles
PE_CPU
- Executes kernel instruction stream / control logic.
- Generates PE commands and submits them to
PE_SCHEDULER(viaPeInternalTxn). - Does NOT enqueue work directly into engine queues.
PE_SCHEDULER
- Sole dispatcher inside a PE.
- Receives commands from
PE_CPU. Dispatch by command type:- Simple command (
DmaReadCmd,DmaWriteCmd,GemmCmd,MathCmd) → forward directly to the target engine. CompositeCmd→ generate aTilePlan, feed tiles into the pipeline via a single_feed_loop(D6).
- Simple command (
- Does not participate in stage-to-stage chaining within a composite; that is handled by token self-routing (D6).
PE_DMA
- Handles memory transfers between TCM and external memory domains (HBM, shared SRAM, cross-cube UCIe) through the cube NOC.
- Two execution channels:
DMA_READ(capacity = 1) andDMA_WRITE(capacity = 1) — see D4.
- Additional virtual channels:
vc_compute— load/store/writeback traffic for GEMM/MATH tiles.vc_comm— IPCQ collective send data (defined in ADR-0023 D8).
PE_FETCH_STORE
- TCM ↔ Register File transfer unit.
- Isolates register-file access semantics from compute engines so that GEMM/MATH stay pure compute components.
- BW-based latency model; TCM access contention naturally serializes
through
PE_TCM's BW resource.
PE_GEMM
- MAC array. Reads operands from the register file; writes results to
the register file. Does not touch
PE_TCMdirectly.
PE_MATH
- Element-wise / reduction / SIMD unit. Reads / writes the register file.
PE_TCM
- Tightly-coupled scratchpad with BW-serialized access. Two logical regions partitioned by ownership (see D5).
Cross-referenced components (defined elsewhere):
pe_mmu— VA→PA translation per access (ADR-0011 D-VA).pe_ipcq— collective ring buffers and peer endpoint metadata (ADR-0023).
D2. Command lifecycle and queues
PE_SCHEDULER maintains three logical structures:
SubmissionQueue — written by PE_CPU; consumed by the scheduler.
InflightTable — owned and mutated only by PE_SCHEDULER; tracks
expanded sub-commands, dependency state, engine assignment, and
completion status.
CompletionQueue — written by PE_SCHEDULER; holds final completion
records.
Single-writer rule: only PE_SCHEDULER mutates command completion
state. Engines report completion via explicit events / messages
consumed by the scheduler.
Command completion: when all sub-commands complete, PE_SCHEDULER
publishes a completion record.
D3. Dispatch modes
D3.1 Simple command
A simple command expands to exactly one engine sub-command:
DmaReadCmd/DmaWriteCmd→PE_DMAGemmCmd→PE_GEMMMathCmd→PE_MATH
Flow:
PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution
→ completion → PE_SCHEDULER → CompletionQueue
D3.2 Composite command (single-op tiled pipeline)
The default CompositeCmd runs a single compute op as a tile-pipelined
sequence:
DMA_READ → FETCH (TCM → RF) → COMPUTE (GEMM | MATH) → STORE (RF → TCM) → DMA_WRITE
PE_SCHEDULER splits the DMA payload into hardware tiles and emits one
TileToken per tile with a monotonically increasing tile_id.
Tile dependency (within one tile t):
DMA_READ(t) → FETCH(t) → COMPUTE(t) → STORE(t) → DMA_WRITE(t)
Inter-tile overlap is allowed wherever engine resources permit (D4 governs the constraints):
DMA_READ(t+1) ∥ COMPUTE(t)
DMA_WRITE(t-1) ∥ COMPUTE(t)
D3.3 Multi-op composite (head + epilogue with scope)
A CompositeCmd MAY carry ops: tuple[OpSpec, ...] to express a
multi-op pipeline:
@dataclass(frozen=True)
class OpSpec:
kind: str # "gemm" | "math.exp" | "math.bias_add" | ...
scope: Scope # "per_k_tile" | "per_output_tile" | "once"
...
ops[0](head) defines tile geometry (e.g., the head GEMM determines M/K/N partition).ops[1:](epilogue) are subsequent stages whosescopedecides how often they fire:per_k_tile— every K-reduction step.per_output_tile— once per output tile.once— once per kernel.
Cross-engine chains (e.g., GEMM head → MATH epilogue) are natural — each stage is dispatched via token self-routing (D6), so GEMM and MATH participate serially within the same composite even though they share the compute slot (D4).
The empty-ops form is the legacy single-op path.
D4. Engine resource model
DMA engine:
DMA_READ:simpy.Resource(capacity=1).DMA_WRITE:simpy.Resource(capacity=1).- Both channels run concurrently (READ ∥ WRITE allowed).
- Within a channel, requests serialize (READ ∥ READ disallowed; same for WRITE).
vc_commis an orthogonal channel for IPCQ traffic defined in ADR-0023 D8 — out of scope for this ADR.
Compute engine:
accel_slot:simpy.Resource(capacity=1)shared byPE_GEMMandPE_MATH.- At most one compute op runs at a time within a PE.
- Multi-op composite chains (D3.3) execute their compute stages serially through this slot; token self-routing (D6) ensures the next stage starts only after the previous compute releases the slot.
Engine completion: each engine emits a completion event consumed by
the scheduler / PipelineContext (D6).
D5. Dataflow
Input path (HBM source):
HBM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
PE_TCM → PE_FETCH_STORE → Register File
Register File → PE_GEMM | PE_MATH
Input path (shared SRAM source):
Shared SRAM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
PE_TCM → PE_FETCH_STORE → Register File
Output path (HBM destination):
Register File → PE_FETCH_STORE → PE_TCM
PE_TCM → PE_DMA (DMA_WRITE) → cube NOC → HBM
GEMM/MATH never touch PE_TCM directly — PE_FETCH_STORE is the
single TCM↔register-file gateway. This makes TCM BW contention
explicit and lets fetch unit policies (e.g., prefetch) be replaced
independently of compute engines.
D5.1 PE_TCM partitioning
PE_TCM is split into two logical regions:
SchedulerReservedTCM
- Owned exclusively by
PE_SCHEDULER. - Holds composite-command tile buffers.
PE_SCHEDULERpartitions this region, assigns buffers per DMA_READ / COMPUTE / DMA_WRITE stage, guarantees input/output separation, and manages tile-buffer lifetimes.
AllocatableTCM
- General-purpose region managed by
PEMemAllocator. - Used for host / DP-visible allocations.
Visibility rule (hard isolation): PEMemAllocator MUST NOT see or
allocate inside SchedulerReservedTCM. The reserved region is excluded
from allocator-managed ranges by construction.
Tile buffer rules:
- Input and output buffers within
SchedulerReservedTCMMUST NOT overlap during a tile's active lifetime. - A tile buffer remains valid until the corresponding
DMA_WRITEcompletes. - Buffer reuse is permitted only after the consuming tile's lifetime ends.
D6. TileToken self-routing pipeline
A composite's stage-to-stage progression happens without routing
through the scheduler. Each component forwards the token directly to
the next stage's component using the token's plan:
Scheduler → DMA → Fetch → GEMM → Math (epi) → Store → DMA_WB → (complete)
↑ chaining: no scheduler hop ↑
PipelineContext.complete_tile()
This mirrors real-HW done-wire chains. The scheduler handles only initial dispatch + completion aggregation.
TilePlan / Stage
class StageType(Enum):
DMA_READ = 0
FETCH = 1
GEMM = 2
MATH = 3
STORE = 4
DMA_WRITE = 5
@dataclass(frozen=True)
class Stage:
stage_type: StageType
component: str # topology node id (e.g., "sip0.cube0.pe0.pe_dma")
params: dict # stage-specific parameters
@dataclass(frozen=True)
class TilePlan:
tile_id: int
stages: tuple[Stage, ...]
TileToken
@dataclass
class TileToken:
tile_id: int
pipeline_ctx: PipelineContext
plan: TilePlan
stage_idx: int
params: dict # cached current stage params
data_op: bool = True # op_log opt-in (ADR-0020 D4)
Single-owner invariant: a token is owned by exactly one component at a
time. Lifecycle: scheduler creates with stage_idx=0 → component
_process() → increment stage_idx → put to next stage's in_port →
last stage calls pipeline_ctx.complete_tile().
PipelineContext (exactly-once completion)
@dataclass
class PipelineContext:
id: str
total_tiles: int
completed_tiles: int = 0
done_event: simpy.Event = None
def complete_tile(self) -> None:
self.completed_tiles += 1
if self.completed_tiles == self.total_tiles:
self.done_event.succeed()
Each tile's last stage MUST call complete_tile() exactly once.
Duplicate calls are bugs (SimPy Event can succeed at most once).
Feed ordering
PE_SCHEDULER has exactly one _feed_loop process consuming a
_pending_feeds FIFO. Composite commands are enqueued in submission
order; tile feed for a command runs to completion before the next
command's feed begins. Tile-feed interleaving between commands is
disallowed.
Within a single command's tiles, downstream pipeline overlap arises naturally — earlier tiles progress through later stages while the feeder keeps pushing remaining tiles into the first stage queue (SimPy Store backpressure governs flow control). If the first-stage queue is full, only the feeder blocks; the scheduler worker's inbox processing continues.
Token routing pattern (base class)
def _pipeline_worker(self, env):
while True:
token = yield self._inbox.get()
yield from self._process(env, token) # stage-specific logic
next_idx = token.stage_idx + 1
if next_idx < len(token.plan.stages):
next_stage = token.plan.stages[next_idx]
token.stage_idx = next_idx
token.params = next_stage.params
yield self.out_ports[next_stage.component].put(token)
else:
token.pipeline_ctx.complete_tile()
Each component implements only _process(); chaining lives in the
base class.
D7. Observability and trace contract
The simulator emits deterministic trace events:
command_submittedsub_command_dispatchedengine_startengine_completetile_readycommand_complete
For identical inputs, trace ordering MUST be deterministic.
D8. Topology representation
PE-internal components are declared in cube.pe_template:
pe_template:
components:
pe_cpu: { kind: pe_cpu, impl: builtin.pe_cpu, attrs: { overhead_ns: ... } }
pe_scheduler: { kind: pe_scheduler, impl: builtin.pe_scheduler, attrs: { overhead_ns: ... } }
pe_dma: { kind: pe_dma, impl: builtin.pe_dma, attrs: { rd_engines: 1, wr_engines: 1 } }
pe_fetch_store: { kind: pe_fetch_store, impl: builtin.pe_fetch_store, attrs: { ... } }
pe_gemm: { kind: pe_gemm, impl: builtin.pe_gemm, attrs: { shared_resource: accel_slot, ... } }
pe_math: { kind: pe_math, impl: builtin.pe_math, attrs: { shared_resource: accel_slot, ... } }
pe_tcm: { kind: pe_tcm, impl: builtin.pe_tcm, attrs: { size_mb: ..., read_bw_gbs: ..., write_bw_gbs: ... } }
pe_mmu: { kind: pe_mmu, impl: builtin.pe_mmu, attrs: { ... } } # ADR-0011 D-VA
pe_ipcq: { kind: pe_ipcq, impl: builtin.pe_ipcq, attrs: { ... } } # ADR-0023
links:
# Scheduler dispatch edges (initial)
scheduler_to_dma_mm: 0.0
scheduler_to_fetch_store_mm: 0.0
scheduler_to_gemm_mm: 0.0
scheduler_to_math_mm: 0.0
# Pipeline chaining edges (token self-routing per D6)
dma_to_fetch_store_mm: 0.0
fetch_store_to_gemm_mm: 0.0
fetch_store_to_math_mm: 0.0
gemm_to_fetch_store_mm: 0.0
gemm_to_math_mm: 0.0
math_to_fetch_store_mm: 0.0
fetch_store_to_dma_mm: 0.0
fetch_store_to_tcm_bw_gbs: ...
Template is instantiated once per PE. PE instances are derived from
cube.pe_layout (corner placement). External connectivity (PE_DMA ↔
cube NOC ↔ HBM, etc.) is modeled at the cube level (ADR-0017 D4).
Consequences
Positive
- Each block is an independent topology node — individually swappable via DI (ADR-0015).
- PE-internal structure is visible in the topology graph.
- Components do not know their downstream — plan-based routing gives flexibility (e.g., epilogue chains require no scheduler change).
- DMA and compute overlap naturally via SimPy Store backpressure.
- Multi-op composite expresses fused operations (e.g., GEMM + bias_add) without engine-level coupling.
- TCM access contention is realistic —
PE_FETCH_STOREis the single TCM↔RF gateway.
Negative
- Intra-PE component count is higher than a coarser model (7 base + 2 cross-referenced) — more topology nodes/edges.
- Intra-PE token forwarding is explicit in traces (acceptable trade for HW fidelity).
Links
- ADR-0011 D-VA (PE_MMU component, VA translation)
- ADR-0015 D4 (component port/wire model)
- ADR-0020 (greenlet kernel execution / two-pass)
- ADR-0023 (PE_IPCQ + PE_DMA virtual channels)
- SPEC R3, R4