Add English translations for ADR-0018, 0019, 0020, 0021
- ADR-0018: LA-based memory address abstraction + BAAW + HBM channel mapping - ADR-0019: CUBE NOC per-channel and aggregated HBM connection model - ADR-0020: 2-pass data execution model (timing/data separation, greenlet) - ADR-0021: PE pipeline refactor (component separation + token self-routing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,537 @@
|
||||
# ADR-0021: PE Pipeline Refactoring — Component Separation + Scheduler-Based Routing
|
||||
|
||||
## Status
|
||||
|
||||
Proposed
|
||||
|
||||
## Context
|
||||
|
||||
### Problems with the Current Structure
|
||||
|
||||
pe_accel (SchedulerV2Component) hides 5 hardware blocks (DmaIn, DmaWb, Gemm, Math, Tcm)
|
||||
**inside a single component**.
|
||||
|
||||
```
|
||||
SchedulerV2Component (single topology node)
|
||||
├── DmaInBlock ← directly connected via internal SimPy Store
|
||||
├── DmaWbBlock ← not visible in topology
|
||||
├── GemmBlock ← not replaceable
|
||||
├── MathBlock ← not replaceable
|
||||
└── TcmBlock ← not replaceable
|
||||
```
|
||||
|
||||
Problems:
|
||||
- Blocks directly reference the next block via `desc.next_block` — hardcoded routing
|
||||
- Individual blocks cannot be replaced (violates ADR-0015 component replacement principle)
|
||||
- PE internal structure is not visible in the topology
|
||||
- GemmBlock and MathBlock each duplicate TCM load/store logic
|
||||
|
||||
### Actual Hardware Structure
|
||||
|
||||
```
|
||||
HBM ←(DMA)→ TCM ←(Fetch/Store Unit)→ Register File ←→ GEMM/MATH Engine
|
||||
```
|
||||
|
||||
- DMA: HBM ↔ TCM transfer (via fabric, tens to hundreds of ns)
|
||||
- Fetch/Store Unit: TCM ↔ Register File transfer (BW-based, a few ns)
|
||||
- GEMM/MATH Engine: computation between Register Files (cycle-accurate)
|
||||
- Completion signal: PE-internal 1-cycle wire signal (done pin assert)
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Separate Each Block into an Independent Component
|
||||
|
||||
The internal blocks of pe_accel are separated into **independent PeEngineBase components**.
|
||||
Existing 5 blocks + 1 Fetch/Store Unit = 6 components.
|
||||
|
||||
| Component | Role | HW Correspondence |
|
||||
|-----------|------|-------------------|
|
||||
| PE_SCHEDULER | Plan generation, tile state management, stage routing | Scheduler/Sequencer |
|
||||
| PE_DMA | HBM ↔ TCM (via fabric) | DMA Engine |
|
||||
| PE_FETCH_STORE | TCM ↔ Register File | Load/Store Unit |
|
||||
| PE_GEMM | MAC compute (register only) | MAC Array |
|
||||
| PE_MATH | Element-wise/reduction (register only) | SIMD/Vector Unit |
|
||||
| PE_TCM | BW-serialized scratchpad | SRAM Bank |
|
||||
|
||||
Each component exists as a topology node and is connected via ports/wires.
|
||||
Replacing the `impl` allows changing the timing model of an individual block.
|
||||
|
||||
### D2. Token Self-Routing — Scheduler Handles Only Dispatch + Completion
|
||||
|
||||
**Components do not pass through the scheduler at every stage.**
|
||||
The token carries a plan so that components chain directly to the next stage.
|
||||
|
||||
```
|
||||
Scheduler → DMA → Fetch → GEMM → Math → Store → DMA_WB → (done) → Scheduler
|
||||
↑ chaining: does not go through scheduler completion only
|
||||
```
|
||||
|
||||
This matches the actual HW structure where each block's done signal is directly
|
||||
connected to the next block via wire. The scheduler is responsible **only for
|
||||
initial dispatch + completion aggregation**.
|
||||
|
||||
#### Stage Definition
|
||||
|
||||
```python
|
||||
class StageType(Enum):
|
||||
DMA_READ = 0
|
||||
FETCH = 1
|
||||
GEMM = 2
|
||||
MATH = 3
|
||||
STORE = 4
|
||||
DMA_WRITE = 5
|
||||
```
|
||||
|
||||
#### Plan Structure
|
||||
|
||||
When the scheduler receives a CompositeCmd, it generates a **per-tile execution plan**.
|
||||
The plan defines the **stage sequence** for each tile:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class Stage:
|
||||
stage_type: StageType
|
||||
component: str # topology node ID (e.g. "sip0.cube0.pe0.pe_dma")
|
||||
params: dict # per-stage parameters (dynamic)
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class TilePlan:
|
||||
tile_id: int
|
||||
stages: tuple[Stage, ...] # list of stages to execute in order (immutable)
|
||||
```
|
||||
|
||||
The stage sequence varies depending on the plan:
|
||||
|
||||
```python
|
||||
# Normal GEMM: HBM → TCM → Register → Compute → Register → TCM → HBM
|
||||
stages = (DMA_READ, FETCH, GEMM, STORE, DMA_WRITE)
|
||||
|
||||
# GEMM directly from TCM data (skip DMA read):
|
||||
stages = (FETCH, GEMM, STORE, DMA_WRITE)
|
||||
|
||||
# MATH element-wise:
|
||||
stages = (DMA_READ, FETCH, MATH, STORE, DMA_WRITE)
|
||||
|
||||
# GEMM + accumulation (intermediate K-tile, skip writeback):
|
||||
stages = (DMA_READ, FETCH, GEMM, STORE) # store to TCM only
|
||||
```
|
||||
|
||||
**Components do not hardcode the next component.**
|
||||
They read the next stage from the token's plan and forward it directly via out_port.
|
||||
This is the same pattern as a network packet carrying a routing header.
|
||||
|
||||
#### Pipeline Context
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class PipelineContext:
|
||||
id: str
|
||||
total_tiles: int
|
||||
completed_tiles: int = 0
|
||||
done_event: simpy.Event = None # succeeds when all tiles are complete
|
||||
|
||||
def complete_tile(self) -> None:
|
||||
self.completed_tiles += 1
|
||||
if self.completed_tiles == self.total_tiles:
|
||||
self.done_event.succeed()
|
||||
```
|
||||
|
||||
**Completion follows an exactly-once contract**: the last stage of each tile must call
|
||||
`complete_tile()` exactly once. Duplicate calls are a bug, and `done_event` must
|
||||
succeed only once (SimPy Event constraint).
|
||||
|
||||
#### Scheduler Role (Reduced)
|
||||
|
||||
When the scheduler receives a CompositeCmd, it creates a plan and PipelineContext,
|
||||
enqueues them into the scheduler's internal `_pending_feeds` FIFO, and returns immediately.
|
||||
|
||||
Actual tile injection is handled by a **single feeder process** (`_feed_loop`).
|
||||
This feeder consumes `_pending_feeds` in FIFO order and
|
||||
**does not allow tile feed interleaving across composite commands.**
|
||||
That is, the feed for the next command begins only after all tiles of the current
|
||||
command have been injected into the first stage queue.
|
||||
|
||||
There is **exactly one `_feed_loop`** per scheduler, and
|
||||
tile feed for composite commands is performed exclusively through this single process.
|
||||
Command issue order refers to **the order in which PE_SCHEDULER receives PeInternalTxn**.
|
||||
|
||||
This structure maintains command issue order while ensuring that when the first stage
|
||||
queue is full, only the feeder process blocks — the scheduler worker's inbox processing
|
||||
itself does not stall.
|
||||
|
||||
```python
|
||||
class PeSchedulerV2(PeEngineBase):
|
||||
_pipelines: dict[str, PipelineContext]
|
||||
_pending_feeds: simpy.Store # FIFO of (plan, ctx)
|
||||
|
||||
def start(self, env):
|
||||
super().start(env)
|
||||
self._pending_feeds = simpy.Store(env)
|
||||
env.process(self._feed_loop(env))
|
||||
|
||||
def _dispatch_composite(self, env, pe_txn, cmd):
|
||||
plan = generate_plan(cmd)
|
||||
ctx = PipelineContext(
|
||||
id=next_id(),
|
||||
total_tiles=len(plan.tiles),
|
||||
done_event=pe_txn.done,
|
||||
)
|
||||
self._pipelines[ctx.id] = ctx
|
||||
|
||||
# only enqueue to feeder queue and return immediately
|
||||
yield self._pending_feeds.put((plan, ctx))
|
||||
|
||||
def _feed_loop(self, env):
|
||||
"""Single feeder process: feeds composite commands in FIFO order.
|
||||
|
||||
Tile feed interleaving across composite commands is not allowed.
|
||||
The feed for the next command begins only after all tiles of the
|
||||
current command have been injected into the first stage queue.
|
||||
|
||||
When the first stage queue is full, only this feeder blocks;
|
||||
the scheduler worker's inbox processing does not stall.
|
||||
"""
|
||||
while True:
|
||||
plan, ctx = yield self._pending_feeds.get()
|
||||
for tile in plan.tiles:
|
||||
token = TileToken(
|
||||
tile_id=tile.tile_id,
|
||||
pipeline_ctx=ctx,
|
||||
plan=tile,
|
||||
stage_idx=0,
|
||||
params=tile.stages[0].params,
|
||||
)
|
||||
yield self.out_ports[tile.stages[0].component].put(token)
|
||||
# queue capacity = HW queue depth → feeder blocks only when full
|
||||
```
|
||||
|
||||
In this ADR, the scheduler can accept multiple composite commands,
|
||||
but tile submission order follows per-command FIFO.
|
||||
Within a command, tile-level pipeline overlap is allowed,
|
||||
but tile feed interleaving across commands is not.
|
||||
|
||||
### D3. Data Transfer vs. Completion Signal — HW Modeling Criteria
|
||||
|
||||
| Communication Type | Method | HW Correspondence |
|
||||
|-------------------|--------|-------------------|
|
||||
| Tile token (work directive) | message via out_port | enqueue to command queue |
|
||||
| Stage completion → next stage | component directly calls out_port.put | done-triggered local enqueue |
|
||||
| Pipeline completion → scheduler | PipelineContext.complete_tile() | completion interrupt |
|
||||
|
||||
**Tile token**: uses out_port.put(). SimPy Store capacity = HW queue depth.
|
||||
|
||||
**Intra-PE chaining latency**: within the scope of this ADR, no explicit latency model
|
||||
is applied to intra-PE stage triggers. Chaining between components corresponds to
|
||||
PE-internal wires, and since there is no scheduler round-trip, no artificial hop cost
|
||||
is incurred.
|
||||
|
||||
**Pipeline completion**: the component at the last stage calls `pipeline_ctx.complete_tile()`.
|
||||
When all tiles are complete, PipelineContext calls done_event.succeed().
|
||||
|
||||
### D4. Asynchronous Pipeline — Natural Overlap
|
||||
|
||||
The scheduler processes CompositeCmds **asynchronously**.
|
||||
However, tile feed does not spawn an independent process per command; instead,
|
||||
the scheduler's internal **single feeder process** performs the feed in FIFO order.
|
||||
Therefore, the scheduler can continue to receive the next command,
|
||||
but the first-stage tile injection order is guaranteed per command.
|
||||
|
||||
Since **SimPy Store capacity = HW queue depth**:
|
||||
- When the queue is full, put() naturally blocks (backpressure)
|
||||
- While DMA is processing tile 0, GEMM can start fetching an already-completed tile
|
||||
- When a second CompositeCmd arrives, it is immediately queued to the DMA queue
|
||||
|
||||
```
|
||||
First-stage feed order (feeder → DMA queue):
|
||||
[cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN] | [cmd2:t0][cmd2:t1]...
|
||||
↑ cmd2 starts after cmd1 feed completes
|
||||
|
||||
Runtime pipeline (downstream overlap):
|
||||
PE_DMA: [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN][cmd2:t0][cmd2:t1]...
|
||||
PE_FETCH: [cmd1:t0][cmd1:t1]...
|
||||
PE_GEMM: [cmd1:t0][cmd1:t1]...
|
||||
↑ pipeline overlap within the same command
|
||||
```
|
||||
|
||||
Here, the overlap does not come from tile feed interleaving across different commands,
|
||||
but occurs naturally as tiles from earlier commands progress to downstream stages
|
||||
while the feeder continues injecting subsequent tiles.
|
||||
|
||||
For example, tile feed for cmd2 does not start until all tiles of cmd1 have been
|
||||
injected into the first stage queue. However, while cmd1.tile0 has already progressed
|
||||
to GEMM, cmd1.tile1 and cmd1.tile2 may still remain in DMA/FETCH, so
|
||||
**pipeline overlap within the same command occurs naturally**.
|
||||
|
||||
#### Component Chaining Pattern
|
||||
|
||||
All components follow the same pattern:
|
||||
|
||||
```python
|
||||
def _pipeline_worker(self, env):
|
||||
while True:
|
||||
token = yield self._inbox.get()
|
||||
|
||||
# process own stage
|
||||
yield from self._process(env, token)
|
||||
|
||||
# chain to next stage (read from plan)
|
||||
next_idx = token.stage_idx + 1
|
||||
if next_idx < len(token.plan.stages):
|
||||
next_stage = token.plan.stages[next_idx]
|
||||
token.stage_idx = next_idx
|
||||
token.params = next_stage.params
|
||||
yield self.out_ports[next_stage.component].put(token)
|
||||
else:
|
||||
# last stage — pipeline completion
|
||||
token.pipeline_ctx.complete_tile()
|
||||
```
|
||||
|
||||
### D5. PE_FETCH_STORE — Dedicated TCM ↔ Register File Transfer
|
||||
|
||||
Previously, GemmBlock and MathBlock each implemented their own TCM read/write.
|
||||
This is separated into a **PE_FETCH_STORE component**.
|
||||
|
||||
```python
|
||||
# PE_FETCH_STORE._process()
|
||||
def _process(self, env, token):
|
||||
yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
|
||||
yield tcm_done
|
||||
# chaining is handled by the base class (D4 pattern)
|
||||
```
|
||||
|
||||
Advantages:
|
||||
- GEMM/MATH perform **pure compute only** — no TCM access logic
|
||||
- Fetch/store BW contention is naturally modeled (serialization via PE_TCM resource)
|
||||
- Prefetch strategies can be experimented with by replacing the fetch unit alone
|
||||
|
||||
### D6. Simplification of Each Compute Component
|
||||
|
||||
GEMM/MATH perform compute only with register data already prepared.
|
||||
**Chaining follows the common pattern (D4), so only _process() needs to be implemented:**
|
||||
|
||||
```python
|
||||
# PE_GEMM._process()
|
||||
def _process(self, env, token):
|
||||
yield env.timeout(self._mac_latency(token.params))
|
||||
|
||||
# PE_MATH._process()
|
||||
def _process(self, env, token):
|
||||
yield env.timeout(self._simd_latency(token.params))
|
||||
|
||||
# PE_FETCH_STORE._process()
|
||||
def _process(self, env, token):
|
||||
yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
|
||||
yield tcm_done
|
||||
|
||||
# PE_DMA._process()
|
||||
def _process(self, env, token):
|
||||
yield from self._do_fabric_dma(token.params)
|
||||
```
|
||||
|
||||
By replacing only the timing model, one can freely switch between cycle-accurate
|
||||
and analytical models. Since the chaining logic resides in the base class,
|
||||
each component only implements its pure stage logic.
|
||||
|
||||
### D7. Topology Changes
|
||||
|
||||
Add PE_FETCH_STORE to the PE template:
|
||||
|
||||
```yaml
|
||||
pe_template:
|
||||
components:
|
||||
pe_cpu: { kind: pe_cpu, impl: pe_cpu_v1, ... }
|
||||
pe_scheduler: { kind: pe_scheduler, impl: pe_scheduler_v2, ... }
|
||||
pe_dma: { kind: pe_dma, impl: pe_dma_v1, ... }
|
||||
pe_fetch_store: { kind: pe_fetch_store, impl: pe_fetch_store_v1, ... }
|
||||
pe_gemm: { kind: pe_gemm, impl: pe_gemm_v1, ... }
|
||||
pe_math: { kind: pe_math, impl: pe_math_v1, ... }
|
||||
pe_mmu: { kind: pe_mmu, impl: pe_mmu_v1, ... }
|
||||
pe_tcm: { kind: pe_tcm, impl: pe_tcm_v1, ... }
|
||||
links:
|
||||
# existing links...
|
||||
fetch_store_to_tcm_bw_gbs: 512.0
|
||||
fetch_store_to_tcm_mm: 0.0
|
||||
```
|
||||
|
||||
PE internal edge connections:
|
||||
```
|
||||
PE_SCHEDULER → PE_DMA (initial dispatch)
|
||||
PE_SCHEDULER → PE_FETCH_STORE (initial dispatch)
|
||||
PE_SCHEDULER → PE_GEMM (initial dispatch)
|
||||
PE_SCHEDULER → PE_MATH (initial dispatch)
|
||||
PE_DMA → PE_FETCH_STORE (chaining)
|
||||
PE_FETCH_STORE → PE_GEMM (chaining)
|
||||
PE_FETCH_STORE → PE_MATH (chaining)
|
||||
PE_GEMM → PE_FETCH_STORE (store chaining)
|
||||
PE_MATH → PE_FETCH_STORE (store chaining)
|
||||
PE_FETCH_STORE → PE_DMA (writeback chaining)
|
||||
PE_FETCH_STORE → PE_TCM (BW request)
|
||||
```
|
||||
|
||||
Topology edges encompass both **control/dispatch visibility + runtime chaining**.
|
||||
Scheduler → sub-component edges are initial dispatch paths, while
|
||||
inter-component edges are runtime chaining paths driven by token self-routing.
|
||||
|
||||
### D8. Existing Code Migration — Builtin Integration
|
||||
|
||||
The existing builtin v1 components and pe_accel are **replaced with new builtin components**.
|
||||
|
||||
#### Migration Strategy
|
||||
|
||||
1. Back up existing `components/builtin/` → `components/builtin_legacy/` (preserved without modification)
|
||||
2. Back up existing `components/custom/pe_accel/` → likewise
|
||||
3. Re-implement new `components/builtin/` with the ADR-0021 architecture
|
||||
4. Maintain **only one** topology.yaml (including pe_fetch_store)
|
||||
5. components.yaml points to the new builtin
|
||||
|
||||
```yaml
|
||||
# components.yaml — new builtin
|
||||
pe_scheduler_v1: kernbench.components.builtin.pe_scheduler:PeSchedulerComponent
|
||||
pe_gemm_v1: kernbench.components.builtin.pe_gemm:PeGemmComponent
|
||||
pe_math_v1: kernbench.components.builtin.pe_math:PeMathComponent
|
||||
pe_dma_v1: kernbench.components.builtin.pe_dma:PeDmaComponent
|
||||
pe_fetch_store_v1: kernbench.components.builtin.pe_fetch_store:PeFetchStoreComponent
|
||||
pe_tcm_v1: kernbench.components.builtin.pe_tcm:PeTcmComponent
|
||||
```
|
||||
|
||||
The impl names (pe_gemm_v1, etc.) are preserved, but **the implementations are replaced
|
||||
with the ADR-0021 architecture**. Existing benchmarks and tests referencing topology.yaml
|
||||
continue to work without changes.
|
||||
|
||||
#### Latency Model Inheritance
|
||||
|
||||
The latency modeling of the new builtin components (MAC cycle calculation, SIMD latency,
|
||||
TCM BW serialization, DMA fabric latency, etc.) is **based on the current pe_accel
|
||||
implementation**. The tile schedule generation logic from tiling.py is also carried over.
|
||||
Only the architecture (component separation, self-routing) changes; timing accuracy
|
||||
is preserved.
|
||||
|
||||
#### Test Strategy
|
||||
|
||||
#### Test Plan
|
||||
|
||||
**1. Existing test pass** (regression):
|
||||
After migration is complete, all existing tests (366) must pass.
|
||||
|
||||
**2. Latency regression**:
|
||||
Verify that the new builtin produces identical latency for the same inputs as pe_accel.
|
||||
|
||||
**3. Phase 1 → Phase 2 end-to-end**:
|
||||
Integration test from SimPy simulation (Phase 1) op_log generation → DataExecutor
|
||||
(Phase 2) actual numpy computation → result correctness verification.
|
||||
- GEMM: tl.composite(gemm) → op_log → Phase 2 matmul → allclose verification
|
||||
- MATH: tl.exp / tl.add, etc. → op_log → Phase 2 numpy op → allclose verification
|
||||
- Chaining: GEMM output → MATH input → final result end-to-end verification
|
||||
|
||||
**4. TileToken self-routing**:
|
||||
- Verify that tiles chain according to the plan's stage sequence
|
||||
- Verify PipelineContext.complete_tile() exactly-once at the last stage
|
||||
- Queue backpressure: verify that only the feeder blocks when DMA queue capacity is exceeded
|
||||
|
||||
**5. Asynchronous pipeline overlap**:
|
||||
- Verify that inter-tile stage overlap occurs within the same command (tile0 in GEMM while tile1 in DMA)
|
||||
- Multiple commands: verify that cmd2 feed starts after cmd1 feed completes (FIFO order)
|
||||
|
||||
### D9. TileToken Message Definition
|
||||
|
||||
A message used for passing tile work between components.
|
||||
The token carries the plan and stage index, enabling self-routing.
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class TileToken:
|
||||
tile_id: int
|
||||
pipeline_ctx: PipelineContext # completion tracking
|
||||
plan: TilePlan # full stage sequence for this tile (immutable)
|
||||
stage_idx: int # current stage index in plan.stages
|
||||
params: dict # current stage parameter cache (canonical: plan.stages[stage_idx].params)
|
||||
data_op: bool = True # op_log recording target (ADR-0020)
|
||||
```
|
||||
|
||||
A TileToken is **owned by exactly one component at a time** and
|
||||
is never referenced by multiple components simultaneously (single-owner).
|
||||
|
||||
Token lifecycle:
|
||||
1. Scheduler creates it with stage_idx=0 and puts it to the first stage component
|
||||
2. The component executes _process(), increments stage_idx, and puts it to the next component
|
||||
3. The last stage component calls pipeline_ctx.complete_tile()
|
||||
4. When all tiles are complete, PipelineContext calls done_event.succeed()
|
||||
|
||||
Relationship with existing PeInternalTxn:
|
||||
- PeInternalTxn: command transfer between PE_CPU → PE_SCHEDULER (existing, unchanged)
|
||||
- TileToken: per-tile work transfer from PE_SCHEDULER → sub-components (new, self-routing)
|
||||
|
||||
---
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **PE_CPU changes**: the PE_CPU → PE_SCHEDULER interface is not modified
|
||||
(PeInternalTxn-based, ADR-0014 maintained)
|
||||
- **Resource contention model across multiple pipelines**: the current scope focuses on
|
||||
accurate modeling of a single pipeline. TCM bank conflicts across multiple pipelines
|
||||
are future work.
|
||||
- **builtin_legacy maintenance**: kept for backup purposes only; not a target for
|
||||
bug fixes or feature additions.
|
||||
|
||||
## Open Questions
|
||||
|
||||
- **Register File capacity model**: whether to model capacity limits when the fetch unit
|
||||
loads into registers. Capacity is expressed in bytes (register_file_bytes), and
|
||||
the number of tiles that can be held simultaneously is determined by tile size.
|
||||
When capacity is exceeded, fetch stalls, creating natural backpressure.
|
||||
- **Prefetch strategy**: this ADR does not allow tile feed interleaving across composite
|
||||
commands. Therefore, overlap arises not from pre-injection across commands, but
|
||||
naturally from pipeline progression of tiles within the same command.
|
||||
If additional prefetch is needed, it should be considered at the level of tile ordering
|
||||
within the same command or fetch/store unit policy, not cross-command injection.
|
||||
- **PE_DMA coalescing**: per-tile DMA may cause fragmentation.
|
||||
Direction is to merge/coalesce within DMA without scheduler involvement.
|
||||
- **Synchronous execution mode**: this ADR adopts asynchronous pipeline as the
|
||||
default/sole execution model. If a sync mode is needed for debug or validation
|
||||
purposes, it will be considered in a future ADR.
|
||||
- **TCM bank conflict across multiple pipelines**: currently based on a single pipeline.
|
||||
Bank conflict modeling when multiple pipelines simultaneously access TCM is future work.
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- Each block is an independent component — individually replaceable (ADR-0015 compliant)
|
||||
- PE internal structure is visible in the topology
|
||||
- Components do not know the next component — plan-based routing provides flexibility
|
||||
- Natural pipeline overlap between DMA and compute (SimPy Store backpressure)
|
||||
- Improved HW modeling accuracy (done signal = Event, data transfer = message)
|
||||
- Fetch/store separation enables accurate TCM BW contention modeling
|
||||
|
||||
### Negative
|
||||
|
||||
- Increased number of PE internal components (5 → 6) — more topology nodes/edges
|
||||
- Component separation makes intra-PE token forwarding more explicit than before
|
||||
- Breaking change from existing builtin/pe_accel — migration required
|
||||
|
||||
---
|
||||
|
||||
## Affected Files
|
||||
|
||||
| File | Change |
|
||||
|------|--------|
|
||||
| `topology.yaml` | Add pe_fetch_store component, add chaining edges |
|
||||
| `components.yaml` | Register new builtin components |
|
||||
| `src/kernbench/topology/builder.py` | Add fetch_store + chaining edges to PE internal edges |
|
||||
| `src/kernbench/common/pe_commands.py` | Add TileToken definition |
|
||||
| `src/kernbench/components/builtin/pe_scheduler.py` | Re-implement (feeder + plan-based dispatch) |
|
||||
| `src/kernbench/components/builtin/pe_gemm.py` | Re-implement (TileToken, _process pattern) |
|
||||
| `src/kernbench/components/builtin/pe_math.py` | Re-implement (TileToken, _process pattern) |
|
||||
| `src/kernbench/components/builtin/pe_dma.py` | Re-implement (TileToken, _process pattern) |
|
||||
| `src/kernbench/components/builtin/pe_fetch_store.py` | New |
|
||||
| `src/kernbench/components/builtin/pe_tcm.py` | Re-implement (TcmRequest service) |
|
||||
| `src/kernbench/components/builtin/types.py` | New: TilePlan, Stage, StageType, PipelineContext, TileToken |
|
||||
| `src/kernbench/components/builtin/tiling.py` | Ported from pe_accel: plan generation logic |
|
||||
|
||||
Backup:
|
||||
| `src/kernbench/components/builtin_legacy/` | Full backup of existing builtin (preserved without modification) |
|
||||
| `src/kernbench/components/custom/pe_accel/` | Backup of existing pe_accel (preserved without modification) |
|
||||
Reference in New Issue
Block a user