ADR: introduce docs/history/, merge 0011+0018, prune migration cruft

- CLAUDE.md: add ADR Lifecycle subsection (superseded → docs/history/, immutable numbering, no renumber) - ADR-0011: merge ADR-0018 content as "Address Model: LA" section alongside PA / VA; status notes VA model is currently implemented - ADR-0018 / 0029 / 0031: moved to docs/history/ with status updates (0018 merged into 0011, 0029 superseded by 0032, 0031 absorbed into 0001 rev 2) - ADR-0019: rewrite Context as PE-HBM connectivity decision (self-contained, no LA model framing) - ADR-0019/0020/0021/0023/0025/0027: Status Proposed → Accepted (code verified) and prune Implementation Notes / Affected files / Test strategy / "현재 상태" sub-sections describing pre-impl state - ADR-0024/0026: same migration-flavor cleanup; 0026 also drops D6 Migration and D8 docs-update sub-decisions - ADR-0030: status simplified (blocker ADR-0031 now superseded) - SPEC.md: R10 + §0.2 reflect PA / VA / LA model names - ADR-0008/0012/0013: refresh ADR-0011 subtitle in Links 21 files changed, 553 insertions(+), 1290 deletions(-). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 11:42:45 -07:00
parent ecc57d050d
commit 22fd0d2b9d
23 changed files with 553 additions and 1290 deletions
@@ -2,7 +2,7 @@

 ## Status

-Proposed
+Accepted

 ## Context

@@ -19,17 +19,6 @@ queues. Host-level collectives (`dist.all_reduce`) are deferred to
 **future work**; this ADR focuses solely on the kernel-side collective
 infrastructure.

-### Current state
-
- ADR-0021 PE pipeline refactor: each PE is decomposed into components
-  (PE_CPU, PE_SCHEDULER, PE_DMA, PE_FETCH_STORE, PE_GEMM, PE_MATH,
-  PE_TCM, PE_MMU).
- No direct PE-to-PE channel exists today. All data movement goes
-  through PE_DMA → cube_noc / UCIe / PCIE → HBM.
- A pre-ADR host CCL skeleton exists (`dist.init_process_group(backend="ahbm")`,
-  `_run_ccl_bench` running per-rank greenlets concurrently). The
-  collective itself is a stub.
-
 ### Problems to solve

 1. PE-to-PE direct data movement (writing into a peer's memory).
@@ -891,30 +880,3 @@ fairness from `tl.recv()` round-robin, confusing
 - VC arbitration is a first-order approximation; heavy contention
  scenarios may report slightly optimistic latency vs real HW (D8).
 - Chunk-level interleave makes PE_DMA implementation more complex.
-
---
-
-## Affected files
-
-| File | Change |
-|------|--------|
-| `topology.yaml` | Add `pe_ipcq` to `pe_template`, plus the IPCQ ↔ DMA / CPU / TCM edges. |
-| `components.yaml` | Register `pe_ipcq_v1`. |
-| `src/kernbench/topology/builder.py` | Wire the IPCQ chain into PE-internal edges. |
-| `src/kernbench/components/builtin/pe_ipcq.py` | New. |
-| `src/kernbench/components/builtin/pe_dma.py` | Add VCs, handle `IpcqDmaToken`. |
-| `src/kernbench/common/pe_commands.py` | `IpcqSendCmd`, `IpcqRecvCmd`, `IpcqDmaToken`. |
-| `src/kernbench/triton_emu/tl_context.py` | `tl.send` / `tl.recv` API. |
-| `src/kernbench/runtime_api/distributed.py` | Eager IPCQ install in `AhbmCCLBackend.__init__`. |
-| `src/kernbench/runtime_api/kernel.py` | `IpcqInitMsg` definition. |
-| `src/kernbench/ccl/__init__.py` | New CCL package. |
-| `src/kernbench/ccl/topologies.py` | Builtin topology generators + `resolve_topology()`. |
-| `src/kernbench/ccl/helpers.py` | Algorithm-author helpers (`chunked`, `ring_step`, `tree_step`). |
-| `src/kernbench/ccl/testing.py` | Mock CCL runtime (`run_kernel_in_mock`). |
-| `src/kernbench/ccl/algorithms/*.py` | Algorithm modules (kernel + `kernel_args` + optional `neighbors`). |
-| `ccl.yaml` | Algorithm metadata + IPCQ defaults. |
-| `tests/test_pe_ipcq.py` | PE_IPCQ unit tests. |
-| `tests/test_pe_dma_vc.py` | PE_DMA VC tests. |
-| `tests/test_ipcq_e2e.py` | end-to-end send/recv tests. |
-| `tests/test_ccl_topologies.py` | Builtin topology generator tests. |
-| `tests/test_ccl_allreduce_matrix.py` | Unified bench × algorithm matrix. |