Files
kernbench2/docs/adr/ADR-0021-pe-pipeline-refactor.en.md
T
ywkang 22fd0d2b9d ADR: introduce docs/history/, merge 0011+0018, prune migration cruft
- CLAUDE.md: add ADR Lifecycle subsection (superseded → docs/history/,
  immutable numbering, no renumber)
- ADR-0011: merge ADR-0018 content as "Address Model: LA" section
  alongside PA / VA; status notes VA model is currently implemented
- ADR-0018 / 0029 / 0031: moved to docs/history/ with status updates
  (0018 merged into 0011, 0029 superseded by 0032, 0031 absorbed
  into 0001 rev 2)
- ADR-0019: rewrite Context as PE-HBM connectivity decision
  (self-contained, no LA model framing)
- ADR-0019/0020/0021/0023/0025/0027: Status Proposed → Accepted
  (code verified) and prune Implementation Notes / Affected files /
  Test strategy / "현재 상태" sub-sections describing pre-impl state
- ADR-0024/0026: same migration-flavor cleanup; 0026 also drops D6
  Migration and D8 docs-update sub-decisions
- ADR-0030: status simplified (blocker ADR-0031 now superseded)
- SPEC.md: R10 + §0.2 reflect PA / VA / LA model names
- ADR-0008/0012/0013: refresh ADR-0011 subtitle in Links

21 files changed, 553 insertions(+), 1290 deletions(-).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 11:42:45 -07:00

16 KiB

ADR-0021: PE Pipeline Refactoring — Component Separation + Scheduler-Based Routing

Status

Accepted

Context

Actual Hardware Structure

HBM ←(DMA)→ TCM ←(Fetch/Store Unit)→ Register File ←→ GEMM/MATH Engine
  • DMA: HBM ↔ TCM transfer (via fabric, tens to hundreds of ns)
  • Fetch/Store Unit: TCM ↔ Register File transfer (BW-based, a few ns)
  • GEMM/MATH Engine: computation between Register Files (cycle-accurate)
  • Completion signal: PE-internal 1-cycle wire signal (done pin assert)

Decision

D1. Separate Each Block into an Independent Component

The internal blocks of pe_accel are separated into independent PeEngineBase components. Existing 5 blocks + 1 Fetch/Store Unit = 6 components.

Component Role HW Correspondence
PE_SCHEDULER Plan generation, tile state management, stage routing Scheduler/Sequencer
PE_DMA HBM ↔ TCM (via fabric) DMA Engine
PE_FETCH_STORE TCM ↔ Register File Load/Store Unit
PE_GEMM MAC compute (register only) MAC Array
PE_MATH Element-wise/reduction (register only) SIMD/Vector Unit
PE_TCM BW-serialized scratchpad SRAM Bank

Each component exists as a topology node and is connected via ports/wires. Replacing the impl allows changing the timing model of an individual block.

D2. Token Self-Routing — Scheduler Handles Only Dispatch + Completion

Components do not pass through the scheduler at every stage. The token carries a plan so that components chain directly to the next stage.

Scheduler → DMA → Fetch → GEMM → Math → Store → DMA_WB → (done) → Scheduler
              ↑ chaining: does not go through scheduler          completion only

This matches the actual HW structure where each block's done signal is directly connected to the next block via wire. The scheduler is responsible only for initial dispatch + completion aggregation.

Stage Definition

class StageType(Enum):
    DMA_READ = 0
    FETCH = 1
    GEMM = 2
    MATH = 3
    STORE = 4
    DMA_WRITE = 5

Plan Structure

When the scheduler receives a CompositeCmd, it generates a per-tile execution plan. The plan defines the stage sequence for each tile:

@dataclass
class Stage:
    stage_type: StageType
    component: str       # topology node ID (e.g. "sip0.cube0.pe0.pe_dma")
    params: dict         # per-stage parameters (dynamic)

@dataclass(frozen=True)
class TilePlan:
    tile_id: int
    stages: tuple[Stage, ...]  # list of stages to execute in order (immutable)

The stage sequence varies depending on the plan:

# Normal GEMM: HBM → TCM → Register → Compute → Register → TCM → HBM
stages = (DMA_READ, FETCH, GEMM, STORE, DMA_WRITE)

# GEMM directly from TCM data (skip DMA read):
stages = (FETCH, GEMM, STORE, DMA_WRITE)

# MATH element-wise:
stages = (DMA_READ, FETCH, MATH, STORE, DMA_WRITE)

# GEMM + accumulation (intermediate K-tile, skip writeback):
stages = (DMA_READ, FETCH, GEMM, STORE)  # store to TCM only

Components do not hardcode the next component. They read the next stage from the token's plan and forward it directly via out_port. This is the same pattern as a network packet carrying a routing header.

Pipeline Context

@dataclass
class PipelineContext:
    id: str
    total_tiles: int
    completed_tiles: int = 0
    done_event: simpy.Event = None  # succeeds when all tiles are complete

    def complete_tile(self) -> None:
        self.completed_tiles += 1
        if self.completed_tiles == self.total_tiles:
            self.done_event.succeed()

Completion follows an exactly-once contract: the last stage of each tile must call complete_tile() exactly once. Duplicate calls are a bug, and done_event must succeed only once (SimPy Event constraint).

Scheduler Role (Reduced)

When the scheduler receives a CompositeCmd, it creates a plan and PipelineContext, enqueues them into the scheduler's internal _pending_feeds FIFO, and returns immediately.

Actual tile injection is handled by a single feeder process (_feed_loop). This feeder consumes _pending_feeds in FIFO order and does not allow tile feed interleaving across composite commands. That is, the feed for the next command begins only after all tiles of the current command have been injected into the first stage queue.

There is exactly one _feed_loop per scheduler, and tile feed for composite commands is performed exclusively through this single process. Command issue order refers to the order in which PE_SCHEDULER receives PeInternalTxn.

This structure maintains command issue order while ensuring that when the first stage queue is full, only the feeder process blocks — the scheduler worker's inbox processing itself does not stall.

class PeSchedulerV2(PeEngineBase):
    _pipelines: dict[str, PipelineContext]
    _pending_feeds: simpy.Store   # FIFO of (plan, ctx)

    def start(self, env):
        super().start(env)
        self._pending_feeds = simpy.Store(env)
        env.process(self._feed_loop(env))

    def _dispatch_composite(self, env, pe_txn, cmd):
        plan = generate_plan(cmd)
        ctx = PipelineContext(
            id=next_id(),
            total_tiles=len(plan.tiles),
            done_event=pe_txn.done,
        )
        self._pipelines[ctx.id] = ctx

        # only enqueue to feeder queue and return immediately
        yield self._pending_feeds.put((plan, ctx))

    def _feed_loop(self, env):
        """Single feeder process: feeds composite commands in FIFO order.

        Tile feed interleaving across composite commands is not allowed.
        The feed for the next command begins only after all tiles of the
        current command have been injected into the first stage queue.

        When the first stage queue is full, only this feeder blocks;
        the scheduler worker's inbox processing does not stall.
        """
        while True:
            plan, ctx = yield self._pending_feeds.get()
            for tile in plan.tiles:
                token = TileToken(
                    tile_id=tile.tile_id,
                    pipeline_ctx=ctx,
                    plan=tile,
                    stage_idx=0,
                    params=tile.stages[0].params,
                )
                yield self.out_ports[tile.stages[0].component].put(token)
                # queue capacity = HW queue depth → feeder blocks only when full

In this ADR, the scheduler can accept multiple composite commands, but tile submission order follows per-command FIFO. Within a command, tile-level pipeline overlap is allowed, but tile feed interleaving across commands is not.

D3. Data Transfer vs. Completion Signal — HW Modeling Criteria

Communication Type Method HW Correspondence
Tile token (work directive) message via out_port enqueue to command queue
Stage completion → next stage component directly calls out_port.put done-triggered local enqueue
Pipeline completion → scheduler PipelineContext.complete_tile() completion interrupt

Tile token: uses out_port.put(). SimPy Store capacity = HW queue depth.

Intra-PE chaining latency: within the scope of this ADR, no explicit latency model is applied to intra-PE stage triggers. Chaining between components corresponds to PE-internal wires, and since there is no scheduler round-trip, no artificial hop cost is incurred.

Pipeline completion: the component at the last stage calls pipeline_ctx.complete_tile(). When all tiles are complete, PipelineContext calls done_event.succeed().

D4. Asynchronous Pipeline — Natural Overlap

The scheduler processes CompositeCmds asynchronously. However, tile feed does not spawn an independent process per command; instead, the scheduler's internal single feeder process performs the feed in FIFO order. Therefore, the scheduler can continue to receive the next command, but the first-stage tile injection order is guaranteed per command.

Since SimPy Store capacity = HW queue depth:

  • When the queue is full, put() naturally blocks (backpressure)
  • While DMA is processing tile 0, GEMM can start fetching an already-completed tile
  • When a second CompositeCmd arrives, it is immediately queued to the DMA queue
First-stage feed order (feeder → DMA queue):
  [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN] | [cmd2:t0][cmd2:t1]...
                                            ↑ cmd2 starts after cmd1 feed completes

Runtime pipeline (downstream overlap):
  PE_DMA:    [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN][cmd2:t0][cmd2:t1]...
  PE_FETCH:          [cmd1:t0][cmd1:t1]...
  PE_GEMM:                   [cmd1:t0][cmd1:t1]...
                              ↑ pipeline overlap within the same command

Here, the overlap does not come from tile feed interleaving across different commands, but occurs naturally as tiles from earlier commands progress to downstream stages while the feeder continues injecting subsequent tiles.

For example, tile feed for cmd2 does not start until all tiles of cmd1 have been injected into the first stage queue. However, while cmd1.tile0 has already progressed to GEMM, cmd1.tile1 and cmd1.tile2 may still remain in DMA/FETCH, so pipeline overlap within the same command occurs naturally.

Component Chaining Pattern

All components follow the same pattern:

def _pipeline_worker(self, env):
    while True:
        token = yield self._inbox.get()

        # process own stage
        yield from self._process(env, token)

        # chain to next stage (read from plan)
        next_idx = token.stage_idx + 1
        if next_idx < len(token.plan.stages):
            next_stage = token.plan.stages[next_idx]
            token.stage_idx = next_idx
            token.params = next_stage.params
            yield self.out_ports[next_stage.component].put(token)
        else:
            # last stage — pipeline completion
            token.pipeline_ctx.complete_tile()

D5. PE_FETCH_STORE — Dedicated TCM ↔ Register File Transfer

Previously, GemmBlock and MathBlock each implemented their own TCM read/write. This is separated into a PE_FETCH_STORE component.

# PE_FETCH_STORE._process()
def _process(self, env, token):
    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
    yield tcm_done
    # chaining is handled by the base class (D4 pattern)

Advantages:

  • GEMM/MATH perform pure compute only — no TCM access logic
  • Fetch/store BW contention is naturally modeled (serialization via PE_TCM resource)
  • Prefetch strategies can be experimented with by replacing the fetch unit alone

D6. Simplification of Each Compute Component

GEMM/MATH perform compute only with register data already prepared. Chaining follows the common pattern (D4), so only _process() needs to be implemented:

# PE_GEMM._process()
def _process(self, env, token):
    yield env.timeout(self._mac_latency(token.params))

# PE_MATH._process()
def _process(self, env, token):
    yield env.timeout(self._simd_latency(token.params))

# PE_FETCH_STORE._process()
def _process(self, env, token):
    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
    yield tcm_done

# PE_DMA._process()
def _process(self, env, token):
    yield from self._do_fabric_dma(token.params)

By replacing only the timing model, one can freely switch between cycle-accurate and analytical models. Since the chaining logic resides in the base class, each component only implements its pure stage logic.

D7. Topology Changes

Add PE_FETCH_STORE to the PE template:

pe_template:
  components:
    pe_cpu:         { kind: pe_cpu,         impl: pe_cpu_v1, ... }
    pe_scheduler:   { kind: pe_scheduler,   impl: pe_scheduler_v2, ... }
    pe_dma:         { kind: pe_dma,         impl: pe_dma_v1, ... }
    pe_fetch_store: { kind: pe_fetch_store, impl: pe_fetch_store_v1, ... }
    pe_gemm:        { kind: pe_gemm,        impl: pe_gemm_v1, ... }
    pe_math:        { kind: pe_math,        impl: pe_math_v1, ... }
    pe_mmu:         { kind: pe_mmu,         impl: pe_mmu_v1, ... }
    pe_tcm:         { kind: pe_tcm,         impl: pe_tcm_v1, ... }
  links:
    # existing links...
    fetch_store_to_tcm_bw_gbs: 512.0
    fetch_store_to_tcm_mm: 0.0

PE internal edge connections:

PE_SCHEDULER → PE_DMA (initial dispatch)
PE_SCHEDULER → PE_FETCH_STORE (initial dispatch)
PE_SCHEDULER → PE_GEMM (initial dispatch)
PE_SCHEDULER → PE_MATH (initial dispatch)
PE_DMA → PE_FETCH_STORE (chaining)
PE_FETCH_STORE → PE_GEMM (chaining)
PE_FETCH_STORE → PE_MATH (chaining)
PE_GEMM → PE_FETCH_STORE (store chaining)
PE_MATH → PE_FETCH_STORE (store chaining)
PE_FETCH_STORE → PE_DMA (writeback chaining)
PE_FETCH_STORE → PE_TCM (BW request)

Topology edges encompass both control/dispatch visibility + runtime chaining. Scheduler → sub-component edges are initial dispatch paths, while inter-component edges are runtime chaining paths driven by token self-routing.

D9. TileToken Message Definition

A message used for passing tile work between components. The token carries the plan and stage index, enabling self-routing.

@dataclass
class TileToken:
    tile_id: int
    pipeline_ctx: PipelineContext    # completion tracking
    plan: TilePlan                   # full stage sequence for this tile (immutable)
    stage_idx: int                   # current stage index in plan.stages
    params: dict                     # current stage parameter cache (canonical: plan.stages[stage_idx].params)
    data_op: bool = True             # op_log recording target (ADR-0020)

A TileToken is owned by exactly one component at a time and is never referenced by multiple components simultaneously (single-owner).

Token lifecycle:

  1. Scheduler creates it with stage_idx=0 and puts it to the first stage component
  2. The component executes _process(), increments stage_idx, and puts it to the next component
  3. The last stage component calls pipeline_ctx.complete_tile()
  4. When all tiles are complete, PipelineContext calls done_event.succeed()

Relationship with existing PeInternalTxn:

  • PeInternalTxn: command transfer between PE_CPU → PE_SCHEDULER (existing, unchanged)
  • TileToken: per-tile work transfer from PE_SCHEDULER → sub-components (new, self-routing)

Non-goals

  • PE_CPU changes: the PE_CPU → PE_SCHEDULER interface is not modified (PeInternalTxn-based, ADR-0014 maintained)
  • Resource contention model across multiple pipelines: the current scope focuses on accurate modeling of a single pipeline. TCM bank conflicts across multiple pipelines are future work.

Open Questions

  • Register File capacity model: whether to model capacity limits when the fetch unit loads into registers. Capacity is expressed in bytes (register_file_bytes), and the number of tiles that can be held simultaneously is determined by tile size. When capacity is exceeded, fetch stalls, creating natural backpressure.
  • Prefetch strategy: this ADR does not allow tile feed interleaving across composite commands. Therefore, overlap arises not from pre-injection across commands, but naturally from pipeline progression of tiles within the same command. If additional prefetch is needed, it should be considered at the level of tile ordering within the same command or fetch/store unit policy, not cross-command injection.
  • PE_DMA coalescing: per-tile DMA may cause fragmentation. Direction is to merge/coalesce within DMA without scheduler involvement.
  • Synchronous execution mode: this ADR adopts asynchronous pipeline as the default/sole execution model. If a sync mode is needed for debug or validation purposes, it will be considered in a future ADR.
  • TCM bank conflict across multiple pipelines: currently based on a single pipeline. Bank conflict modeling when multiple pipelines simultaneously access TCM is future work.

Consequences

Positive

  • Each block is an independent component — individually replaceable (ADR-0015 compliant)
  • PE internal structure is visible in the topology
  • Components do not know the next component — plan-based routing provides flexibility
  • Natural pipeline overlap between DMA and compute (SimPy Store backpressure)
  • Improved HW modeling accuracy (done signal = Event, data transfer = message)
  • Fetch/store separation enables accurate TCM BW contention modeling

Negative

  • Increased number of PE internal components (5 → 6) — more topology nodes/edges
  • Component separation makes intra-PE token forwarding more explicit than before