22fd0d2b9d
- CLAUDE.md: add ADR Lifecycle subsection (superseded → docs/history/, immutable numbering, no renumber) - ADR-0011: merge ADR-0018 content as "Address Model: LA" section alongside PA / VA; status notes VA model is currently implemented - ADR-0018 / 0029 / 0031: moved to docs/history/ with status updates (0018 merged into 0011, 0029 superseded by 0032, 0031 absorbed into 0001 rev 2) - ADR-0019: rewrite Context as PE-HBM connectivity decision (self-contained, no LA model framing) - ADR-0019/0020/0021/0023/0025/0027: Status Proposed → Accepted (code verified) and prune Implementation Notes / Affected files / Test strategy / "현재 상태" sub-sections describing pre-impl state - ADR-0024/0026: same migration-flavor cleanup; 0026 also drops D6 Migration and D8 docs-update sub-decisions - ADR-0030: status simplified (blocker ADR-0031 now superseded) - SPEC.md: R10 + §0.2 reflect PA / VA / LA model names - ADR-0008/0012/0013: refresh ADR-0011 subtitle in Links 21 files changed, 553 insertions(+), 1290 deletions(-). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
433 lines
16 KiB
Markdown
433 lines
16 KiB
Markdown
# ADR-0021: PE Pipeline Refactoring — Component Separation + Scheduler-Based Routing
|
|
|
|
## Status
|
|
|
|
Accepted
|
|
|
|
## Context
|
|
|
|
### Actual Hardware Structure
|
|
|
|
```
|
|
HBM ←(DMA)→ TCM ←(Fetch/Store Unit)→ Register File ←→ GEMM/MATH Engine
|
|
```
|
|
|
|
- DMA: HBM ↔ TCM transfer (via fabric, tens to hundreds of ns)
|
|
- Fetch/Store Unit: TCM ↔ Register File transfer (BW-based, a few ns)
|
|
- GEMM/MATH Engine: computation between Register Files (cycle-accurate)
|
|
- Completion signal: PE-internal 1-cycle wire signal (done pin assert)
|
|
|
|
---
|
|
|
|
## Decision
|
|
|
|
### D1. Separate Each Block into an Independent Component
|
|
|
|
The internal blocks of pe_accel are separated into **independent PeEngineBase components**.
|
|
Existing 5 blocks + 1 Fetch/Store Unit = 6 components.
|
|
|
|
| Component | Role | HW Correspondence |
|
|
|-----------|------|-------------------|
|
|
| PE_SCHEDULER | Plan generation, tile state management, stage routing | Scheduler/Sequencer |
|
|
| PE_DMA | HBM ↔ TCM (via fabric) | DMA Engine |
|
|
| PE_FETCH_STORE | TCM ↔ Register File | Load/Store Unit |
|
|
| PE_GEMM | MAC compute (register only) | MAC Array |
|
|
| PE_MATH | Element-wise/reduction (register only) | SIMD/Vector Unit |
|
|
| PE_TCM | BW-serialized scratchpad | SRAM Bank |
|
|
|
|
Each component exists as a topology node and is connected via ports/wires.
|
|
Replacing the `impl` allows changing the timing model of an individual block.
|
|
|
|
### D2. Token Self-Routing — Scheduler Handles Only Dispatch + Completion
|
|
|
|
**Components do not pass through the scheduler at every stage.**
|
|
The token carries a plan so that components chain directly to the next stage.
|
|
|
|
```
|
|
Scheduler → DMA → Fetch → GEMM → Math → Store → DMA_WB → (done) → Scheduler
|
|
↑ chaining: does not go through scheduler completion only
|
|
```
|
|
|
|
This matches the actual HW structure where each block's done signal is directly
|
|
connected to the next block via wire. The scheduler is responsible **only for
|
|
initial dispatch + completion aggregation**.
|
|
|
|
#### Stage Definition
|
|
|
|
```python
|
|
class StageType(Enum):
|
|
DMA_READ = 0
|
|
FETCH = 1
|
|
GEMM = 2
|
|
MATH = 3
|
|
STORE = 4
|
|
DMA_WRITE = 5
|
|
```
|
|
|
|
#### Plan Structure
|
|
|
|
When the scheduler receives a CompositeCmd, it generates a **per-tile execution plan**.
|
|
The plan defines the **stage sequence** for each tile:
|
|
|
|
```python
|
|
@dataclass
|
|
class Stage:
|
|
stage_type: StageType
|
|
component: str # topology node ID (e.g. "sip0.cube0.pe0.pe_dma")
|
|
params: dict # per-stage parameters (dynamic)
|
|
|
|
@dataclass(frozen=True)
|
|
class TilePlan:
|
|
tile_id: int
|
|
stages: tuple[Stage, ...] # list of stages to execute in order (immutable)
|
|
```
|
|
|
|
The stage sequence varies depending on the plan:
|
|
|
|
```python
|
|
# Normal GEMM: HBM → TCM → Register → Compute → Register → TCM → HBM
|
|
stages = (DMA_READ, FETCH, GEMM, STORE, DMA_WRITE)
|
|
|
|
# GEMM directly from TCM data (skip DMA read):
|
|
stages = (FETCH, GEMM, STORE, DMA_WRITE)
|
|
|
|
# MATH element-wise:
|
|
stages = (DMA_READ, FETCH, MATH, STORE, DMA_WRITE)
|
|
|
|
# GEMM + accumulation (intermediate K-tile, skip writeback):
|
|
stages = (DMA_READ, FETCH, GEMM, STORE) # store to TCM only
|
|
```
|
|
|
|
**Components do not hardcode the next component.**
|
|
They read the next stage from the token's plan and forward it directly via out_port.
|
|
This is the same pattern as a network packet carrying a routing header.
|
|
|
|
#### Pipeline Context
|
|
|
|
```python
|
|
@dataclass
|
|
class PipelineContext:
|
|
id: str
|
|
total_tiles: int
|
|
completed_tiles: int = 0
|
|
done_event: simpy.Event = None # succeeds when all tiles are complete
|
|
|
|
def complete_tile(self) -> None:
|
|
self.completed_tiles += 1
|
|
if self.completed_tiles == self.total_tiles:
|
|
self.done_event.succeed()
|
|
```
|
|
|
|
**Completion follows an exactly-once contract**: the last stage of each tile must call
|
|
`complete_tile()` exactly once. Duplicate calls are a bug, and `done_event` must
|
|
succeed only once (SimPy Event constraint).
|
|
|
|
#### Scheduler Role (Reduced)
|
|
|
|
When the scheduler receives a CompositeCmd, it creates a plan and PipelineContext,
|
|
enqueues them into the scheduler's internal `_pending_feeds` FIFO, and returns immediately.
|
|
|
|
Actual tile injection is handled by a **single feeder process** (`_feed_loop`).
|
|
This feeder consumes `_pending_feeds` in FIFO order and
|
|
**does not allow tile feed interleaving across composite commands.**
|
|
That is, the feed for the next command begins only after all tiles of the current
|
|
command have been injected into the first stage queue.
|
|
|
|
There is **exactly one `_feed_loop`** per scheduler, and
|
|
tile feed for composite commands is performed exclusively through this single process.
|
|
Command issue order refers to **the order in which PE_SCHEDULER receives PeInternalTxn**.
|
|
|
|
This structure maintains command issue order while ensuring that when the first stage
|
|
queue is full, only the feeder process blocks — the scheduler worker's inbox processing
|
|
itself does not stall.
|
|
|
|
```python
|
|
class PeSchedulerV2(PeEngineBase):
|
|
_pipelines: dict[str, PipelineContext]
|
|
_pending_feeds: simpy.Store # FIFO of (plan, ctx)
|
|
|
|
def start(self, env):
|
|
super().start(env)
|
|
self._pending_feeds = simpy.Store(env)
|
|
env.process(self._feed_loop(env))
|
|
|
|
def _dispatch_composite(self, env, pe_txn, cmd):
|
|
plan = generate_plan(cmd)
|
|
ctx = PipelineContext(
|
|
id=next_id(),
|
|
total_tiles=len(plan.tiles),
|
|
done_event=pe_txn.done,
|
|
)
|
|
self._pipelines[ctx.id] = ctx
|
|
|
|
# only enqueue to feeder queue and return immediately
|
|
yield self._pending_feeds.put((plan, ctx))
|
|
|
|
def _feed_loop(self, env):
|
|
"""Single feeder process: feeds composite commands in FIFO order.
|
|
|
|
Tile feed interleaving across composite commands is not allowed.
|
|
The feed for the next command begins only after all tiles of the
|
|
current command have been injected into the first stage queue.
|
|
|
|
When the first stage queue is full, only this feeder blocks;
|
|
the scheduler worker's inbox processing does not stall.
|
|
"""
|
|
while True:
|
|
plan, ctx = yield self._pending_feeds.get()
|
|
for tile in plan.tiles:
|
|
token = TileToken(
|
|
tile_id=tile.tile_id,
|
|
pipeline_ctx=ctx,
|
|
plan=tile,
|
|
stage_idx=0,
|
|
params=tile.stages[0].params,
|
|
)
|
|
yield self.out_ports[tile.stages[0].component].put(token)
|
|
# queue capacity = HW queue depth → feeder blocks only when full
|
|
```
|
|
|
|
In this ADR, the scheduler can accept multiple composite commands,
|
|
but tile submission order follows per-command FIFO.
|
|
Within a command, tile-level pipeline overlap is allowed,
|
|
but tile feed interleaving across commands is not.
|
|
|
|
### D3. Data Transfer vs. Completion Signal — HW Modeling Criteria
|
|
|
|
| Communication Type | Method | HW Correspondence |
|
|
|-------------------|--------|-------------------|
|
|
| Tile token (work directive) | message via out_port | enqueue to command queue |
|
|
| Stage completion → next stage | component directly calls out_port.put | done-triggered local enqueue |
|
|
| Pipeline completion → scheduler | PipelineContext.complete_tile() | completion interrupt |
|
|
|
|
**Tile token**: uses out_port.put(). SimPy Store capacity = HW queue depth.
|
|
|
|
**Intra-PE chaining latency**: within the scope of this ADR, no explicit latency model
|
|
is applied to intra-PE stage triggers. Chaining between components corresponds to
|
|
PE-internal wires, and since there is no scheduler round-trip, no artificial hop cost
|
|
is incurred.
|
|
|
|
**Pipeline completion**: the component at the last stage calls `pipeline_ctx.complete_tile()`.
|
|
When all tiles are complete, PipelineContext calls done_event.succeed().
|
|
|
|
### D4. Asynchronous Pipeline — Natural Overlap
|
|
|
|
The scheduler processes CompositeCmds **asynchronously**.
|
|
However, tile feed does not spawn an independent process per command; instead,
|
|
the scheduler's internal **single feeder process** performs the feed in FIFO order.
|
|
Therefore, the scheduler can continue to receive the next command,
|
|
but the first-stage tile injection order is guaranteed per command.
|
|
|
|
Since **SimPy Store capacity = HW queue depth**:
|
|
- When the queue is full, put() naturally blocks (backpressure)
|
|
- While DMA is processing tile 0, GEMM can start fetching an already-completed tile
|
|
- When a second CompositeCmd arrives, it is immediately queued to the DMA queue
|
|
|
|
```
|
|
First-stage feed order (feeder → DMA queue):
|
|
[cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN] | [cmd2:t0][cmd2:t1]...
|
|
↑ cmd2 starts after cmd1 feed completes
|
|
|
|
Runtime pipeline (downstream overlap):
|
|
PE_DMA: [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN][cmd2:t0][cmd2:t1]...
|
|
PE_FETCH: [cmd1:t0][cmd1:t1]...
|
|
PE_GEMM: [cmd1:t0][cmd1:t1]...
|
|
↑ pipeline overlap within the same command
|
|
```
|
|
|
|
Here, the overlap does not come from tile feed interleaving across different commands,
|
|
but occurs naturally as tiles from earlier commands progress to downstream stages
|
|
while the feeder continues injecting subsequent tiles.
|
|
|
|
For example, tile feed for cmd2 does not start until all tiles of cmd1 have been
|
|
injected into the first stage queue. However, while cmd1.tile0 has already progressed
|
|
to GEMM, cmd1.tile1 and cmd1.tile2 may still remain in DMA/FETCH, so
|
|
**pipeline overlap within the same command occurs naturally**.
|
|
|
|
#### Component Chaining Pattern
|
|
|
|
All components follow the same pattern:
|
|
|
|
```python
|
|
def _pipeline_worker(self, env):
|
|
while True:
|
|
token = yield self._inbox.get()
|
|
|
|
# process own stage
|
|
yield from self._process(env, token)
|
|
|
|
# chain to next stage (read from plan)
|
|
next_idx = token.stage_idx + 1
|
|
if next_idx < len(token.plan.stages):
|
|
next_stage = token.plan.stages[next_idx]
|
|
token.stage_idx = next_idx
|
|
token.params = next_stage.params
|
|
yield self.out_ports[next_stage.component].put(token)
|
|
else:
|
|
# last stage — pipeline completion
|
|
token.pipeline_ctx.complete_tile()
|
|
```
|
|
|
|
### D5. PE_FETCH_STORE — Dedicated TCM ↔ Register File Transfer
|
|
|
|
Previously, GemmBlock and MathBlock each implemented their own TCM read/write.
|
|
This is separated into a **PE_FETCH_STORE component**.
|
|
|
|
```python
|
|
# PE_FETCH_STORE._process()
|
|
def _process(self, env, token):
|
|
yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
|
|
yield tcm_done
|
|
# chaining is handled by the base class (D4 pattern)
|
|
```
|
|
|
|
Advantages:
|
|
- GEMM/MATH perform **pure compute only** — no TCM access logic
|
|
- Fetch/store BW contention is naturally modeled (serialization via PE_TCM resource)
|
|
- Prefetch strategies can be experimented with by replacing the fetch unit alone
|
|
|
|
### D6. Simplification of Each Compute Component
|
|
|
|
GEMM/MATH perform compute only with register data already prepared.
|
|
**Chaining follows the common pattern (D4), so only _process() needs to be implemented:**
|
|
|
|
```python
|
|
# PE_GEMM._process()
|
|
def _process(self, env, token):
|
|
yield env.timeout(self._mac_latency(token.params))
|
|
|
|
# PE_MATH._process()
|
|
def _process(self, env, token):
|
|
yield env.timeout(self._simd_latency(token.params))
|
|
|
|
# PE_FETCH_STORE._process()
|
|
def _process(self, env, token):
|
|
yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
|
|
yield tcm_done
|
|
|
|
# PE_DMA._process()
|
|
def _process(self, env, token):
|
|
yield from self._do_fabric_dma(token.params)
|
|
```
|
|
|
|
By replacing only the timing model, one can freely switch between cycle-accurate
|
|
and analytical models. Since the chaining logic resides in the base class,
|
|
each component only implements its pure stage logic.
|
|
|
|
### D7. Topology Changes
|
|
|
|
Add PE_FETCH_STORE to the PE template:
|
|
|
|
```yaml
|
|
pe_template:
|
|
components:
|
|
pe_cpu: { kind: pe_cpu, impl: pe_cpu_v1, ... }
|
|
pe_scheduler: { kind: pe_scheduler, impl: pe_scheduler_v2, ... }
|
|
pe_dma: { kind: pe_dma, impl: pe_dma_v1, ... }
|
|
pe_fetch_store: { kind: pe_fetch_store, impl: pe_fetch_store_v1, ... }
|
|
pe_gemm: { kind: pe_gemm, impl: pe_gemm_v1, ... }
|
|
pe_math: { kind: pe_math, impl: pe_math_v1, ... }
|
|
pe_mmu: { kind: pe_mmu, impl: pe_mmu_v1, ... }
|
|
pe_tcm: { kind: pe_tcm, impl: pe_tcm_v1, ... }
|
|
links:
|
|
# existing links...
|
|
fetch_store_to_tcm_bw_gbs: 512.0
|
|
fetch_store_to_tcm_mm: 0.0
|
|
```
|
|
|
|
PE internal edge connections:
|
|
```
|
|
PE_SCHEDULER → PE_DMA (initial dispatch)
|
|
PE_SCHEDULER → PE_FETCH_STORE (initial dispatch)
|
|
PE_SCHEDULER → PE_GEMM (initial dispatch)
|
|
PE_SCHEDULER → PE_MATH (initial dispatch)
|
|
PE_DMA → PE_FETCH_STORE (chaining)
|
|
PE_FETCH_STORE → PE_GEMM (chaining)
|
|
PE_FETCH_STORE → PE_MATH (chaining)
|
|
PE_GEMM → PE_FETCH_STORE (store chaining)
|
|
PE_MATH → PE_FETCH_STORE (store chaining)
|
|
PE_FETCH_STORE → PE_DMA (writeback chaining)
|
|
PE_FETCH_STORE → PE_TCM (BW request)
|
|
```
|
|
|
|
Topology edges encompass both **control/dispatch visibility + runtime chaining**.
|
|
Scheduler → sub-component edges are initial dispatch paths, while
|
|
inter-component edges are runtime chaining paths driven by token self-routing.
|
|
|
|
### D9. TileToken Message Definition
|
|
|
|
A message used for passing tile work between components.
|
|
The token carries the plan and stage index, enabling self-routing.
|
|
|
|
```python
|
|
@dataclass
|
|
class TileToken:
|
|
tile_id: int
|
|
pipeline_ctx: PipelineContext # completion tracking
|
|
plan: TilePlan # full stage sequence for this tile (immutable)
|
|
stage_idx: int # current stage index in plan.stages
|
|
params: dict # current stage parameter cache (canonical: plan.stages[stage_idx].params)
|
|
data_op: bool = True # op_log recording target (ADR-0020)
|
|
```
|
|
|
|
A TileToken is **owned by exactly one component at a time** and
|
|
is never referenced by multiple components simultaneously (single-owner).
|
|
|
|
Token lifecycle:
|
|
1. Scheduler creates it with stage_idx=0 and puts it to the first stage component
|
|
2. The component executes _process(), increments stage_idx, and puts it to the next component
|
|
3. The last stage component calls pipeline_ctx.complete_tile()
|
|
4. When all tiles are complete, PipelineContext calls done_event.succeed()
|
|
|
|
Relationship with existing PeInternalTxn:
|
|
- PeInternalTxn: command transfer between PE_CPU → PE_SCHEDULER (existing, unchanged)
|
|
- TileToken: per-tile work transfer from PE_SCHEDULER → sub-components (new, self-routing)
|
|
|
|
---
|
|
|
|
## Non-goals
|
|
|
|
- **PE_CPU changes**: the PE_CPU → PE_SCHEDULER interface is not modified
|
|
(PeInternalTxn-based, ADR-0014 maintained)
|
|
- **Resource contention model across multiple pipelines**: the current scope focuses on
|
|
accurate modeling of a single pipeline. TCM bank conflicts across multiple pipelines
|
|
are future work.
|
|
|
|
## Open Questions
|
|
|
|
- **Register File capacity model**: whether to model capacity limits when the fetch unit
|
|
loads into registers. Capacity is expressed in bytes (register_file_bytes), and
|
|
the number of tiles that can be held simultaneously is determined by tile size.
|
|
When capacity is exceeded, fetch stalls, creating natural backpressure.
|
|
- **Prefetch strategy**: this ADR does not allow tile feed interleaving across composite
|
|
commands. Therefore, overlap arises not from pre-injection across commands, but
|
|
naturally from pipeline progression of tiles within the same command.
|
|
If additional prefetch is needed, it should be considered at the level of tile ordering
|
|
within the same command or fetch/store unit policy, not cross-command injection.
|
|
- **PE_DMA coalescing**: per-tile DMA may cause fragmentation.
|
|
Direction is to merge/coalesce within DMA without scheduler involvement.
|
|
- **Synchronous execution mode**: this ADR adopts asynchronous pipeline as the
|
|
default/sole execution model. If a sync mode is needed for debug or validation
|
|
purposes, it will be considered in a future ADR.
|
|
- **TCM bank conflict across multiple pipelines**: currently based on a single pipeline.
|
|
Bank conflict modeling when multiple pipelines simultaneously access TCM is future work.
|
|
|
|
---
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
|
|
- Each block is an independent component — individually replaceable (ADR-0015 compliant)
|
|
- PE internal structure is visible in the topology
|
|
- Components do not know the next component — plan-based routing provides flexibility
|
|
- Natural pipeline overlap between DMA and compute (SimPy Store backpressure)
|
|
- Improved HW modeling accuracy (done signal = Event, data transfer = message)
|
|
- Fetch/store separation enables accurate TCM BW contention modeling
|
|
|
|
### Negative
|
|
|
|
- Increased number of PE internal components (5 → 6) — more topology nodes/edges
|
|
- Component separation makes intra-PE token forwarding more explicit than before
|
|
|