Files
kernbench2/docs/adr/ADR-0021-pe-pipeline-refactor.en.md
T
ywkang b2c52f0e34 Add English translations for ADR-0018, 0019, 0020, 0021
- ADR-0018: LA-based memory address abstraction + BAAW + HBM channel mapping
- ADR-0019: CUBE NOC per-channel and aggregated HBM connection model
- ADR-0020: 2-pass data execution model (timing/data separation, greenlet)
- ADR-0021: PE pipeline refactor (component separation + token self-routing)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 16:31:32 -07:00

22 KiB

ADR-0021: PE Pipeline Refactoring — Component Separation + Scheduler-Based Routing

Status

Proposed

Context

Problems with the Current Structure

pe_accel (SchedulerV2Component) hides 5 hardware blocks (DmaIn, DmaWb, Gemm, Math, Tcm) inside a single component.

SchedulerV2Component (single topology node)
├── DmaInBlock     ← directly connected via internal SimPy Store
├── DmaWbBlock     ← not visible in topology
├── GemmBlock      ← not replaceable
├── MathBlock      ← not replaceable
└── TcmBlock       ← not replaceable

Problems:

  • Blocks directly reference the next block via desc.next_block — hardcoded routing
  • Individual blocks cannot be replaced (violates ADR-0015 component replacement principle)
  • PE internal structure is not visible in the topology
  • GemmBlock and MathBlock each duplicate TCM load/store logic

Actual Hardware Structure

HBM ←(DMA)→ TCM ←(Fetch/Store Unit)→ Register File ←→ GEMM/MATH Engine
  • DMA: HBM ↔ TCM transfer (via fabric, tens to hundreds of ns)
  • Fetch/Store Unit: TCM ↔ Register File transfer (BW-based, a few ns)
  • GEMM/MATH Engine: computation between Register Files (cycle-accurate)
  • Completion signal: PE-internal 1-cycle wire signal (done pin assert)

Decision

D1. Separate Each Block into an Independent Component

The internal blocks of pe_accel are separated into independent PeEngineBase components. Existing 5 blocks + 1 Fetch/Store Unit = 6 components.

Component Role HW Correspondence
PE_SCHEDULER Plan generation, tile state management, stage routing Scheduler/Sequencer
PE_DMA HBM ↔ TCM (via fabric) DMA Engine
PE_FETCH_STORE TCM ↔ Register File Load/Store Unit
PE_GEMM MAC compute (register only) MAC Array
PE_MATH Element-wise/reduction (register only) SIMD/Vector Unit
PE_TCM BW-serialized scratchpad SRAM Bank

Each component exists as a topology node and is connected via ports/wires. Replacing the impl allows changing the timing model of an individual block.

D2. Token Self-Routing — Scheduler Handles Only Dispatch + Completion

Components do not pass through the scheduler at every stage. The token carries a plan so that components chain directly to the next stage.

Scheduler → DMA → Fetch → GEMM → Math → Store → DMA_WB → (done) → Scheduler
              ↑ chaining: does not go through scheduler          completion only

This matches the actual HW structure where each block's done signal is directly connected to the next block via wire. The scheduler is responsible only for initial dispatch + completion aggregation.

Stage Definition

class StageType(Enum):
    DMA_READ = 0
    FETCH = 1
    GEMM = 2
    MATH = 3
    STORE = 4
    DMA_WRITE = 5

Plan Structure

When the scheduler receives a CompositeCmd, it generates a per-tile execution plan. The plan defines the stage sequence for each tile:

@dataclass
class Stage:
    stage_type: StageType
    component: str       # topology node ID (e.g. "sip0.cube0.pe0.pe_dma")
    params: dict         # per-stage parameters (dynamic)

@dataclass(frozen=True)
class TilePlan:
    tile_id: int
    stages: tuple[Stage, ...]  # list of stages to execute in order (immutable)

The stage sequence varies depending on the plan:

# Normal GEMM: HBM → TCM → Register → Compute → Register → TCM → HBM
stages = (DMA_READ, FETCH, GEMM, STORE, DMA_WRITE)

# GEMM directly from TCM data (skip DMA read):
stages = (FETCH, GEMM, STORE, DMA_WRITE)

# MATH element-wise:
stages = (DMA_READ, FETCH, MATH, STORE, DMA_WRITE)

# GEMM + accumulation (intermediate K-tile, skip writeback):
stages = (DMA_READ, FETCH, GEMM, STORE)  # store to TCM only

Components do not hardcode the next component. They read the next stage from the token's plan and forward it directly via out_port. This is the same pattern as a network packet carrying a routing header.

Pipeline Context

@dataclass
class PipelineContext:
    id: str
    total_tiles: int
    completed_tiles: int = 0
    done_event: simpy.Event = None  # succeeds when all tiles are complete

    def complete_tile(self) -> None:
        self.completed_tiles += 1
        if self.completed_tiles == self.total_tiles:
            self.done_event.succeed()

Completion follows an exactly-once contract: the last stage of each tile must call complete_tile() exactly once. Duplicate calls are a bug, and done_event must succeed only once (SimPy Event constraint).

Scheduler Role (Reduced)

When the scheduler receives a CompositeCmd, it creates a plan and PipelineContext, enqueues them into the scheduler's internal _pending_feeds FIFO, and returns immediately.

Actual tile injection is handled by a single feeder process (_feed_loop). This feeder consumes _pending_feeds in FIFO order and does not allow tile feed interleaving across composite commands. That is, the feed for the next command begins only after all tiles of the current command have been injected into the first stage queue.

There is exactly one _feed_loop per scheduler, and tile feed for composite commands is performed exclusively through this single process. Command issue order refers to the order in which PE_SCHEDULER receives PeInternalTxn.

This structure maintains command issue order while ensuring that when the first stage queue is full, only the feeder process blocks — the scheduler worker's inbox processing itself does not stall.

class PeSchedulerV2(PeEngineBase):
    _pipelines: dict[str, PipelineContext]
    _pending_feeds: simpy.Store   # FIFO of (plan, ctx)

    def start(self, env):
        super().start(env)
        self._pending_feeds = simpy.Store(env)
        env.process(self._feed_loop(env))

    def _dispatch_composite(self, env, pe_txn, cmd):
        plan = generate_plan(cmd)
        ctx = PipelineContext(
            id=next_id(),
            total_tiles=len(plan.tiles),
            done_event=pe_txn.done,
        )
        self._pipelines[ctx.id] = ctx

        # only enqueue to feeder queue and return immediately
        yield self._pending_feeds.put((plan, ctx))

    def _feed_loop(self, env):
        """Single feeder process: feeds composite commands in FIFO order.

        Tile feed interleaving across composite commands is not allowed.
        The feed for the next command begins only after all tiles of the
        current command have been injected into the first stage queue.

        When the first stage queue is full, only this feeder blocks;
        the scheduler worker's inbox processing does not stall.
        """
        while True:
            plan, ctx = yield self._pending_feeds.get()
            for tile in plan.tiles:
                token = TileToken(
                    tile_id=tile.tile_id,
                    pipeline_ctx=ctx,
                    plan=tile,
                    stage_idx=0,
                    params=tile.stages[0].params,
                )
                yield self.out_ports[tile.stages[0].component].put(token)
                # queue capacity = HW queue depth → feeder blocks only when full

In this ADR, the scheduler can accept multiple composite commands, but tile submission order follows per-command FIFO. Within a command, tile-level pipeline overlap is allowed, but tile feed interleaving across commands is not.

D3. Data Transfer vs. Completion Signal — HW Modeling Criteria

Communication Type Method HW Correspondence
Tile token (work directive) message via out_port enqueue to command queue
Stage completion → next stage component directly calls out_port.put done-triggered local enqueue
Pipeline completion → scheduler PipelineContext.complete_tile() completion interrupt

Tile token: uses out_port.put(). SimPy Store capacity = HW queue depth.

Intra-PE chaining latency: within the scope of this ADR, no explicit latency model is applied to intra-PE stage triggers. Chaining between components corresponds to PE-internal wires, and since there is no scheduler round-trip, no artificial hop cost is incurred.

Pipeline completion: the component at the last stage calls pipeline_ctx.complete_tile(). When all tiles are complete, PipelineContext calls done_event.succeed().

D4. Asynchronous Pipeline — Natural Overlap

The scheduler processes CompositeCmds asynchronously. However, tile feed does not spawn an independent process per command; instead, the scheduler's internal single feeder process performs the feed in FIFO order. Therefore, the scheduler can continue to receive the next command, but the first-stage tile injection order is guaranteed per command.

Since SimPy Store capacity = HW queue depth:

  • When the queue is full, put() naturally blocks (backpressure)
  • While DMA is processing tile 0, GEMM can start fetching an already-completed tile
  • When a second CompositeCmd arrives, it is immediately queued to the DMA queue
First-stage feed order (feeder → DMA queue):
  [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN] | [cmd2:t0][cmd2:t1]...
                                            ↑ cmd2 starts after cmd1 feed completes

Runtime pipeline (downstream overlap):
  PE_DMA:    [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN][cmd2:t0][cmd2:t1]...
  PE_FETCH:          [cmd1:t0][cmd1:t1]...
  PE_GEMM:                   [cmd1:t0][cmd1:t1]...
                              ↑ pipeline overlap within the same command

Here, the overlap does not come from tile feed interleaving across different commands, but occurs naturally as tiles from earlier commands progress to downstream stages while the feeder continues injecting subsequent tiles.

For example, tile feed for cmd2 does not start until all tiles of cmd1 have been injected into the first stage queue. However, while cmd1.tile0 has already progressed to GEMM, cmd1.tile1 and cmd1.tile2 may still remain in DMA/FETCH, so pipeline overlap within the same command occurs naturally.

Component Chaining Pattern

All components follow the same pattern:

def _pipeline_worker(self, env):
    while True:
        token = yield self._inbox.get()

        # process own stage
        yield from self._process(env, token)

        # chain to next stage (read from plan)
        next_idx = token.stage_idx + 1
        if next_idx < len(token.plan.stages):
            next_stage = token.plan.stages[next_idx]
            token.stage_idx = next_idx
            token.params = next_stage.params
            yield self.out_ports[next_stage.component].put(token)
        else:
            # last stage — pipeline completion
            token.pipeline_ctx.complete_tile()

D5. PE_FETCH_STORE — Dedicated TCM ↔ Register File Transfer

Previously, GemmBlock and MathBlock each implemented their own TCM read/write. This is separated into a PE_FETCH_STORE component.

# PE_FETCH_STORE._process()
def _process(self, env, token):
    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
    yield tcm_done
    # chaining is handled by the base class (D4 pattern)

Advantages:

  • GEMM/MATH perform pure compute only — no TCM access logic
  • Fetch/store BW contention is naturally modeled (serialization via PE_TCM resource)
  • Prefetch strategies can be experimented with by replacing the fetch unit alone

D6. Simplification of Each Compute Component

GEMM/MATH perform compute only with register data already prepared. Chaining follows the common pattern (D4), so only _process() needs to be implemented:

# PE_GEMM._process()
def _process(self, env, token):
    yield env.timeout(self._mac_latency(token.params))

# PE_MATH._process()
def _process(self, env, token):
    yield env.timeout(self._simd_latency(token.params))

# PE_FETCH_STORE._process()
def _process(self, env, token):
    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
    yield tcm_done

# PE_DMA._process()
def _process(self, env, token):
    yield from self._do_fabric_dma(token.params)

By replacing only the timing model, one can freely switch between cycle-accurate and analytical models. Since the chaining logic resides in the base class, each component only implements its pure stage logic.

D7. Topology Changes

Add PE_FETCH_STORE to the PE template:

pe_template:
  components:
    pe_cpu:         { kind: pe_cpu,         impl: pe_cpu_v1, ... }
    pe_scheduler:   { kind: pe_scheduler,   impl: pe_scheduler_v2, ... }
    pe_dma:         { kind: pe_dma,         impl: pe_dma_v1, ... }
    pe_fetch_store: { kind: pe_fetch_store, impl: pe_fetch_store_v1, ... }
    pe_gemm:        { kind: pe_gemm,        impl: pe_gemm_v1, ... }
    pe_math:        { kind: pe_math,        impl: pe_math_v1, ... }
    pe_mmu:         { kind: pe_mmu,         impl: pe_mmu_v1, ... }
    pe_tcm:         { kind: pe_tcm,         impl: pe_tcm_v1, ... }
  links:
    # existing links...
    fetch_store_to_tcm_bw_gbs: 512.0
    fetch_store_to_tcm_mm: 0.0

PE internal edge connections:

PE_SCHEDULER → PE_DMA (initial dispatch)
PE_SCHEDULER → PE_FETCH_STORE (initial dispatch)
PE_SCHEDULER → PE_GEMM (initial dispatch)
PE_SCHEDULER → PE_MATH (initial dispatch)
PE_DMA → PE_FETCH_STORE (chaining)
PE_FETCH_STORE → PE_GEMM (chaining)
PE_FETCH_STORE → PE_MATH (chaining)
PE_GEMM → PE_FETCH_STORE (store chaining)
PE_MATH → PE_FETCH_STORE (store chaining)
PE_FETCH_STORE → PE_DMA (writeback chaining)
PE_FETCH_STORE → PE_TCM (BW request)

Topology edges encompass both control/dispatch visibility + runtime chaining. Scheduler → sub-component edges are initial dispatch paths, while inter-component edges are runtime chaining paths driven by token self-routing.

D8. Existing Code Migration — Builtin Integration

The existing builtin v1 components and pe_accel are replaced with new builtin components.

Migration Strategy

  1. Back up existing components/builtin/components/builtin_legacy/ (preserved without modification)
  2. Back up existing components/custom/pe_accel/ → likewise
  3. Re-implement new components/builtin/ with the ADR-0021 architecture
  4. Maintain only one topology.yaml (including pe_fetch_store)
  5. components.yaml points to the new builtin
# components.yaml — new builtin
pe_scheduler_v1: kernbench.components.builtin.pe_scheduler:PeSchedulerComponent
pe_gemm_v1:      kernbench.components.builtin.pe_gemm:PeGemmComponent
pe_math_v1:      kernbench.components.builtin.pe_math:PeMathComponent
pe_dma_v1:       kernbench.components.builtin.pe_dma:PeDmaComponent
pe_fetch_store_v1: kernbench.components.builtin.pe_fetch_store:PeFetchStoreComponent
pe_tcm_v1:       kernbench.components.builtin.pe_tcm:PeTcmComponent

The impl names (pe_gemm_v1, etc.) are preserved, but the implementations are replaced with the ADR-0021 architecture. Existing benchmarks and tests referencing topology.yaml continue to work without changes.

Latency Model Inheritance

The latency modeling of the new builtin components (MAC cycle calculation, SIMD latency, TCM BW serialization, DMA fabric latency, etc.) is based on the current pe_accel implementation. The tile schedule generation logic from tiling.py is also carried over. Only the architecture (component separation, self-routing) changes; timing accuracy is preserved.

Test Strategy

Test Plan

1. Existing test pass (regression): After migration is complete, all existing tests (366) must pass.

2. Latency regression: Verify that the new builtin produces identical latency for the same inputs as pe_accel.

3. Phase 1 → Phase 2 end-to-end: Integration test from SimPy simulation (Phase 1) op_log generation → DataExecutor (Phase 2) actual numpy computation → result correctness verification.

  • GEMM: tl.composite(gemm) → op_log → Phase 2 matmul → allclose verification
  • MATH: tl.exp / tl.add, etc. → op_log → Phase 2 numpy op → allclose verification
  • Chaining: GEMM output → MATH input → final result end-to-end verification

4. TileToken self-routing:

  • Verify that tiles chain according to the plan's stage sequence
  • Verify PipelineContext.complete_tile() exactly-once at the last stage
  • Queue backpressure: verify that only the feeder blocks when DMA queue capacity is exceeded

5. Asynchronous pipeline overlap:

  • Verify that inter-tile stage overlap occurs within the same command (tile0 in GEMM while tile1 in DMA)
  • Multiple commands: verify that cmd2 feed starts after cmd1 feed completes (FIFO order)

D9. TileToken Message Definition

A message used for passing tile work between components. The token carries the plan and stage index, enabling self-routing.

@dataclass
class TileToken:
    tile_id: int
    pipeline_ctx: PipelineContext    # completion tracking
    plan: TilePlan                   # full stage sequence for this tile (immutable)
    stage_idx: int                   # current stage index in plan.stages
    params: dict                     # current stage parameter cache (canonical: plan.stages[stage_idx].params)
    data_op: bool = True             # op_log recording target (ADR-0020)

A TileToken is owned by exactly one component at a time and is never referenced by multiple components simultaneously (single-owner).

Token lifecycle:

  1. Scheduler creates it with stage_idx=0 and puts it to the first stage component
  2. The component executes _process(), increments stage_idx, and puts it to the next component
  3. The last stage component calls pipeline_ctx.complete_tile()
  4. When all tiles are complete, PipelineContext calls done_event.succeed()

Relationship with existing PeInternalTxn:

  • PeInternalTxn: command transfer between PE_CPU → PE_SCHEDULER (existing, unchanged)
  • TileToken: per-tile work transfer from PE_SCHEDULER → sub-components (new, self-routing)

Non-goals

  • PE_CPU changes: the PE_CPU → PE_SCHEDULER interface is not modified (PeInternalTxn-based, ADR-0014 maintained)
  • Resource contention model across multiple pipelines: the current scope focuses on accurate modeling of a single pipeline. TCM bank conflicts across multiple pipelines are future work.
  • builtin_legacy maintenance: kept for backup purposes only; not a target for bug fixes or feature additions.

Open Questions

  • Register File capacity model: whether to model capacity limits when the fetch unit loads into registers. Capacity is expressed in bytes (register_file_bytes), and the number of tiles that can be held simultaneously is determined by tile size. When capacity is exceeded, fetch stalls, creating natural backpressure.
  • Prefetch strategy: this ADR does not allow tile feed interleaving across composite commands. Therefore, overlap arises not from pre-injection across commands, but naturally from pipeline progression of tiles within the same command. If additional prefetch is needed, it should be considered at the level of tile ordering within the same command or fetch/store unit policy, not cross-command injection.
  • PE_DMA coalescing: per-tile DMA may cause fragmentation. Direction is to merge/coalesce within DMA without scheduler involvement.
  • Synchronous execution mode: this ADR adopts asynchronous pipeline as the default/sole execution model. If a sync mode is needed for debug or validation purposes, it will be considered in a future ADR.
  • TCM bank conflict across multiple pipelines: currently based on a single pipeline. Bank conflict modeling when multiple pipelines simultaneously access TCM is future work.

Consequences

Positive

  • Each block is an independent component — individually replaceable (ADR-0015 compliant)
  • PE internal structure is visible in the topology
  • Components do not know the next component — plan-based routing provides flexibility
  • Natural pipeline overlap between DMA and compute (SimPy Store backpressure)
  • Improved HW modeling accuracy (done signal = Event, data transfer = message)
  • Fetch/store separation enables accurate TCM BW contention modeling

Negative

  • Increased number of PE internal components (5 → 6) — more topology nodes/edges
  • Component separation makes intra-PE token forwarding more explicit than before
  • Breaking change from existing builtin/pe_accel — migration required

Affected Files

File Change
topology.yaml Add pe_fetch_store component, add chaining edges
components.yaml Register new builtin components
src/kernbench/topology/builder.py Add fetch_store + chaining edges to PE internal edges
src/kernbench/common/pe_commands.py Add TileToken definition
src/kernbench/components/builtin/pe_scheduler.py Re-implement (feeder + plan-based dispatch)
src/kernbench/components/builtin/pe_gemm.py Re-implement (TileToken, _process pattern)
src/kernbench/components/builtin/pe_math.py Re-implement (TileToken, _process pattern)
src/kernbench/components/builtin/pe_dma.py Re-implement (TileToken, _process pattern)
src/kernbench/components/builtin/pe_fetch_store.py New
src/kernbench/components/builtin/pe_tcm.py Re-implement (TcmRequest service)
src/kernbench/components/builtin/types.py New: TilePlan, Stage, StageType, PipelineContext, TileToken
src/kernbench/components/builtin/tiling.py Ported from pe_accel: plan generation logic

Backup: | src/kernbench/components/builtin_legacy/ | Full backup of existing builtin (preserved without modification) | | src/kernbench/components/custom/pe_accel/ | Backup of existing pe_accel (preserved without modification) |