ADR: introduce docs/history/, merge 0011+0018, prune migration cruft

- CLAUDE.md: add ADR Lifecycle subsection (superseded → docs/history/,
  immutable numbering, no renumber)
- ADR-0011: merge ADR-0018 content as "Address Model: LA" section
  alongside PA / VA; status notes VA model is currently implemented
- ADR-0018 / 0029 / 0031: moved to docs/history/ with status updates
  (0018 merged into 0011, 0029 superseded by 0032, 0031 absorbed
  into 0001 rev 2)
- ADR-0019: rewrite Context as PE-HBM connectivity decision
  (self-contained, no LA model framing)
- ADR-0019/0020/0021/0023/0025/0027: Status Proposed → Accepted
  (code verified) and prune Implementation Notes / Affected files /
  Test strategy / "현재 상태" sub-sections describing pre-impl state
- ADR-0024/0026: same migration-flavor cleanup; 0026 also drops D6
  Migration and D8 docs-update sub-decisions
- ADR-0030: status simplified (blocker ADR-0031 now superseded)
- SPEC.md: R10 + §0.2 reflect PA / VA / LA model names
- ADR-0008/0012/0013: refresh ADR-0011 subtitle in Links

21 files changed, 553 insertions(+), 1290 deletions(-).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-19 11:42:45 -07:00
parent ecc57d050d
commit 22fd0d2b9d
23 changed files with 553 additions and 1290 deletions
+1 -106
View File
@@ -2,30 +2,10 @@
## Status
Proposed
Accepted
## Context
### Problems with the Current Structure
pe_accel (SchedulerV2Component) hides 5 hardware blocks (DmaIn, DmaWb, Gemm, Math, Tcm)
**inside a single component**.
```
SchedulerV2Component (single topology node)
├── DmaInBlock ← directly connected via internal SimPy Store
├── DmaWbBlock ← not visible in topology
├── GemmBlock ← not replaceable
├── MathBlock ← not replaceable
└── TcmBlock ← not replaceable
```
Problems:
- Blocks directly reference the next block via `desc.next_block` — hardcoded routing
- Individual blocks cannot be replaced (violates ADR-0015 component replacement principle)
- PE internal structure is not visible in the topology
- GemmBlock and MathBlock each duplicate TCM load/store logic
### Actual Hardware Structure
```
@@ -374,66 +354,6 @@ Topology edges encompass both **control/dispatch visibility + runtime chaining**
Scheduler → sub-component edges are initial dispatch paths, while
inter-component edges are runtime chaining paths driven by token self-routing.
### D8. Existing Code Migration — Builtin Integration
The existing builtin v1 components and pe_accel are **replaced with new builtin components**.
#### Migration Strategy
1. Back up existing `components/builtin/``components/builtin_legacy/` (preserved without modification)
2. Back up existing `components/custom/pe_accel/` → likewise
3. Re-implement new `components/builtin/` with the ADR-0021 architecture
4. Maintain **only one** topology.yaml (including pe_fetch_store)
5. components.yaml points to the new builtin
```yaml
# components.yaml — new builtin
pe_scheduler_v1: kernbench.components.builtin.pe_scheduler:PeSchedulerComponent
pe_gemm_v1: kernbench.components.builtin.pe_gemm:PeGemmComponent
pe_math_v1: kernbench.components.builtin.pe_math:PeMathComponent
pe_dma_v1: kernbench.components.builtin.pe_dma:PeDmaComponent
pe_fetch_store_v1: kernbench.components.builtin.pe_fetch_store:PeFetchStoreComponent
pe_tcm_v1: kernbench.components.builtin.pe_tcm:PeTcmComponent
```
The impl names (pe_gemm_v1, etc.) are preserved, but **the implementations are replaced
with the ADR-0021 architecture**. Existing benchmarks and tests referencing topology.yaml
continue to work without changes.
#### Latency Model Inheritance
The latency modeling of the new builtin components (MAC cycle calculation, SIMD latency,
TCM BW serialization, DMA fabric latency, etc.) is **based on the current pe_accel
implementation**. The tile schedule generation logic from tiling.py is also carried over.
Only the architecture (component separation, self-routing) changes; timing accuracy
is preserved.
#### Test Strategy
#### Test Plan
**1. Existing test pass** (regression):
After migration is complete, all existing tests (366) must pass.
**2. Latency regression**:
Verify that the new builtin produces identical latency for the same inputs as pe_accel.
**3. Phase 1 → Phase 2 end-to-end**:
Integration test from SimPy simulation (Phase 1) op_log generation → DataExecutor
(Phase 2) actual numpy computation → result correctness verification.
- GEMM: tl.composite(gemm) → op_log → Phase 2 matmul → allclose verification
- MATH: tl.exp / tl.add, etc. → op_log → Phase 2 numpy op → allclose verification
- Chaining: GEMM output → MATH input → final result end-to-end verification
**4. TileToken self-routing**:
- Verify that tiles chain according to the plan's stage sequence
- Verify PipelineContext.complete_tile() exactly-once at the last stage
- Queue backpressure: verify that only the feeder blocks when DMA queue capacity is exceeded
**5. Asynchronous pipeline overlap**:
- Verify that inter-tile stage overlap occurs within the same command (tile0 in GEMM while tile1 in DMA)
- Multiple commands: verify that cmd2 feed starts after cmd1 feed completes (FIFO order)
### D9. TileToken Message Definition
A message used for passing tile work between components.
@@ -472,8 +392,6 @@ Relationship with existing PeInternalTxn:
- **Resource contention model across multiple pipelines**: the current scope focuses on
accurate modeling of a single pipeline. TCM bank conflicts across multiple pipelines
are future work.
- **builtin_legacy maintenance**: kept for backup purposes only; not a target for
bug fixes or feature additions.
## Open Questions
@@ -511,27 +429,4 @@ Relationship with existing PeInternalTxn:
- Increased number of PE internal components (5 → 6) — more topology nodes/edges
- Component separation makes intra-PE token forwarding more explicit than before
- Breaking change from existing builtin/pe_accel — migration required
---
## Affected Files
| File | Change |
|------|--------|
| `topology.yaml` | Add pe_fetch_store component, add chaining edges |
| `components.yaml` | Register new builtin components |
| `src/kernbench/topology/builder.py` | Add fetch_store + chaining edges to PE internal edges |
| `src/kernbench/common/pe_commands.py` | Add TileToken definition |
| `src/kernbench/components/builtin/pe_scheduler.py` | Re-implement (feeder + plan-based dispatch) |
| `src/kernbench/components/builtin/pe_gemm.py` | Re-implement (TileToken, _process pattern) |
| `src/kernbench/components/builtin/pe_math.py` | Re-implement (TileToken, _process pattern) |
| `src/kernbench/components/builtin/pe_dma.py` | Re-implement (TileToken, _process pattern) |
| `src/kernbench/components/builtin/pe_fetch_store.py` | New |
| `src/kernbench/components/builtin/pe_tcm.py` | Re-implement (TcmRequest service) |
| `src/kernbench/components/builtin/types.py` | New: TilePlan, Stage, StageType, PipelineContext, TileToken |
| `src/kernbench/components/builtin/tiling.py` | Ported from pe_accel: plan generation logic |
Backup:
| `src/kernbench/components/builtin_legacy/` | Full backup of existing builtin (preserved without modification) |
| `src/kernbench/components/custom/pe_accel/` | Backup of existing pe_accel (preserved without modification) |