adr: add INDEX.md (auto-generated by tools/generate_adr_index.py)
Adds a section-based table of contents for the 46-ADR corpus, mirroring the /report skill's classification (Design Principles / High-level Architecture / Detailed Architecture by component / Implementation Decisions by topic). Generated for both docs/adr/ (EN titles) and docs/adr-ko/ (KO titles) from one tool. tools/generate_adr_index.py: - Single CLASSIFICATION dict per ADR — add an entry when introducing a new ADR; the script fails loud if any file is missing from the table. - DETAILED_COMPONENTS lists each builtin component and the ADR(s) that cover it (ADR-0014 appears under six PE engines; ADR-0023 under pe_dma + pe_ipcq). - Accepts both ":" and "—" title separators (matching ADR-0033's existing format). - --check mode for CI: exits 1 if INDEX.md is stale. Also includes the docs/report/architecture-2026-1H.md generated by the prior /report write (the public-facing architecture document; 836 lines, 76 source-attribution comments). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,174 @@
|
|||||||
|
# ADR Index
|
||||||
|
|
||||||
|
Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **46**.
|
||||||
|
|
||||||
|
Classification mirrors the `/report` skill's section assignment. When adding a new ADR, also add an entry to the `CLASSIFICATION` table in `tools/generate_adr_index.py`.
|
||||||
|
|
||||||
|
## Design Principles
|
||||||
|
|
||||||
|
- [ADR-0013](./ADR-0013-ver-verification-strategy.md) — 검증 전략 및 Phase 1 테스트 계획
|
||||||
|
- [ADR-0033](./ADR-0033-lat-latency-model-assumptions.md) — 레이턴시 모델: 가정 및 알려진 단순화
|
||||||
|
|
||||||
|
## High-level Architecture
|
||||||
|
|
||||||
|
- [ADR-0003](./ADR-0003-dev-target-system-hierarchy.md) — 타겟 시스템 계층 및 모델링 범위 _(System hierarchy (Tray / SIP / CUBE / PE))_
|
||||||
|
- [ADR-0007](./ADR-0007-api-runtime-api-boundaries.md) — 런타임 API 및 시뮬레이션 엔진 경계 _(Runtime API ↔ sim_engine boundaries)_
|
||||||
|
- [ADR-0016](./ADR-0016-dev-iochiplet-noc-and-memory-path.md) — IOChiplet NoC와 메모리 데이터 경로 _(IOChiplet NOC and memory data path)_
|
||||||
|
- [ADR-0017](./ADR-0017-dev-cube-noc-and-hbm-connectivity.md) — 큐브 NoC와 HBM 연결성 _(Cube NOC and HBM connectivity)_
|
||||||
|
|
||||||
|
## Detailed Architecture
|
||||||
|
|
||||||
|
One subsection per component file under `src/kernbench/components/builtin/`.
|
||||||
|
|
||||||
|
### forwarding
|
||||||
|
|
||||||
|
- [ADR-0037](./ADR-0037-dev-forwarding-component.md) — Forwarding 컴포넌트 (forwarding_v1)
|
||||||
|
|
||||||
|
### hbm_ctrl
|
||||||
|
|
||||||
|
- [ADR-0034](./ADR-0034-dev-hbm-controller-internal-design.md) — HBM 컨트롤러 내부 설계
|
||||||
|
|
||||||
|
### io_cpu
|
||||||
|
|
||||||
|
- [ADR-0036](./ADR-0036-dev-io-cpu-component-model.md) — IO_CPU 컴포넌트 모델
|
||||||
|
|
||||||
|
### m_cpu
|
||||||
|
|
||||||
|
- [ADR-0035](./ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md) — M_CPU 및 M_CPU.DMA 컴포넌트 모델
|
||||||
|
|
||||||
|
### pcie_ep
|
||||||
|
|
||||||
|
- [ADR-0038](./ADR-0038-dev-pcie-ep-component-model.md) — PCIE_EP Component Model
|
||||||
|
|
||||||
|
### pe_cpu
|
||||||
|
|
||||||
|
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
|
||||||
|
|
||||||
|
### pe_dma
|
||||||
|
|
||||||
|
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
|
||||||
|
- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
|
||||||
|
|
||||||
|
### pe_fetch_store
|
||||||
|
|
||||||
|
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
|
||||||
|
|
||||||
|
### pe_gemm
|
||||||
|
|
||||||
|
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
|
||||||
|
|
||||||
|
### pe_ipcq
|
||||||
|
|
||||||
|
- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
|
||||||
|
|
||||||
|
### pe_math
|
||||||
|
|
||||||
|
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
|
||||||
|
|
||||||
|
### pe_mmu
|
||||||
|
|
||||||
|
- [ADR-0039](./ADR-0039-dev-pe-mmu-component-model.md) — PE_MMU Component Model — 컴포넌트 + 유틸리티 이중 역할
|
||||||
|
|
||||||
|
### pe_scheduler
|
||||||
|
|
||||||
|
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
|
||||||
|
|
||||||
|
### pe_tcm
|
||||||
|
|
||||||
|
- [ADR-0040](./ADR-0040-dev-pe-tcm-component-model.md) — PE_TCM Component Model — 듀얼 채널 BW 직렬화
|
||||||
|
|
||||||
|
### sram
|
||||||
|
|
||||||
|
- [ADR-0041](./ADR-0041-dev-cube-sram-component-model.md) — Cube SRAM Component Model — terminal scratchpad on cube NoC
|
||||||
|
|
||||||
|
### tiling
|
||||||
|
|
||||||
|
- [ADR-0042](./ADR-0042-prog-tile-plan-generators.md) — Tile Plan Generators — GEMM/Math 파이프라인 plan 빌더
|
||||||
|
|
||||||
|
## Implementation Decisions
|
||||||
|
|
||||||
|
### Address Scheme
|
||||||
|
|
||||||
|
- [ADR-0001](./ADR-0001-mem-physaddr-layout.md) — 51비트 물리 주소 레이아웃 및 디코딩 계약
|
||||||
|
- [ADR-0011](./ADR-0011-mem-memory-addressing-simplification.md) — 메모리 주소 지정 — PA / VA / LA 주소 모델
|
||||||
|
|
||||||
|
### Routing & Helper API
|
||||||
|
|
||||||
|
- [ADR-0002](./ADR-0002-lat-routing-distance.md) — 라우팅 거리, 순서 및 우회 규칙
|
||||||
|
- [ADR-0051](./ADR-0051-lat-routing-helper-api.md) — Routing Helper API — `AddressResolver` + `PathRouter`
|
||||||
|
|
||||||
|
### Memory Semantics & Local-HBM Bandwidth
|
||||||
|
|
||||||
|
- [ADR-0004](./ADR-0004-mem-memory-semantics-local-hbm.md) — 메모리 시맨틱 및 로컬 HBM 대역폭 보장
|
||||||
|
|
||||||
|
### Topology Compilation, Diagrams & Builder Algorithms
|
||||||
|
|
||||||
|
- [ADR-0005](./ADR-0005-dev-diagram-views-distance-layout.md) — 다이어그램 뷰 및 거리 기반 레이아웃 규칙
|
||||||
|
- [ADR-0006](./ADR-0006-dev-topology-compilation-distance-diagram.md) — 토폴로지 컴파일, 거리 추출, 그리고 자동 다이어그램 생성
|
||||||
|
- [ADR-0053](./ADR-0053-dev-topology-builder-algorithms.md) — Topology Builder + Visualizer Algorithms
|
||||||
|
|
||||||
|
### Tensor Deployment and Allocation
|
||||||
|
|
||||||
|
- [ADR-0008](./ADR-0008-api-tensor-deploy-and-allocation.md) — 텐서 배포 및 할당 (호스트 할당기, PA 우선)
|
||||||
|
|
||||||
|
### Kernel Execution and Host-Device Messaging
|
||||||
|
|
||||||
|
- [ADR-0009](./ADR-0009-api-kernel-execution-messaging.md) — 커널 실행 메시징 및 완료 시맨틱
|
||||||
|
- [ADR-0012](./ADR-0012-api-host-io-message-schema.md) — Host ↔ IO_CPU 메시지 스키마 (PA-우선, PE-태깅)
|
||||||
|
|
||||||
|
### CLI Surface and Semantics
|
||||||
|
|
||||||
|
- [ADR-0010](./ADR-0010-api-cli-surface-and-semantics.md) — 명령줄 인터페이스 및 실행 시맨틱
|
||||||
|
|
||||||
|
### Component Port/Wire Fabric Model
|
||||||
|
|
||||||
|
- [ADR-0015](./ADR-0015-dev-component-port-wire-model.md) — 컴포넌트 포트/와이어 모델과 패브릭 라우팅
|
||||||
|
|
||||||
|
### Two-Pass Data Execution
|
||||||
|
|
||||||
|
- [ADR-0020](./ADR-0020-prog-data-execution-two-pass.md) — 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)
|
||||||
|
|
||||||
|
### 2D Grid Program Identity
|
||||||
|
|
||||||
|
- [ADR-0022](./ADR-0022-prog-program-id-2d-grid.md) — 2D 그리드 program_id 시맨틱
|
||||||
|
|
||||||
|
### Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)
|
||||||
|
|
||||||
|
- [ADR-0024](./ADR-0024-par-sip-tp-launcher.md) — SIP-level Launcher — rank = SIP
|
||||||
|
- [ADR-0026](./ADR-0026-par-dppolicy-intra-device.md) — DPPolicy = Intra-Device Only — sip/num_sips 필드 제거
|
||||||
|
- [ADR-0027](./ADR-0027-par-megatron-tp.md) — Megatron-style Tensor Parallelism API
|
||||||
|
- [ADR-0047](./ADR-0047-par-ahbm-ccl-backend.md) — AHBM CCL Backend — `torch.distributed`-compat shim
|
||||||
|
- [ADR-0050](./ADR-0050-par-ccl-algorithm-module-contract.md) — CCL Algorithm Module Contract — `ccl/algorithms/*.py`
|
||||||
|
|
||||||
|
### IPCQ Direction Addressing
|
||||||
|
|
||||||
|
- [ADR-0025](./ADR-0025-algo-ipcq-direction-addressing.md) — IPCQ Direction Addressing — address-based matching
|
||||||
|
|
||||||
|
### Intercube All-Reduce
|
||||||
|
|
||||||
|
- [ADR-0032](./ADR-0032-algo-intercube-allreduce.md) — 큐브 간 All-Reduce — pe0 큐브-메시 리듀스 + 다중-SIP 교환
|
||||||
|
|
||||||
|
### Evaluation Harnesses
|
||||||
|
|
||||||
|
- [ADR-0043](./ADR-0043-eval-allreduce-harness.md) — Allreduce 평가 하니스 — `tests/sccl/`
|
||||||
|
- [ADR-0044](./ADR-0044-eval-gemm-harness.md) — GEMM 평가 하니스 — `scripts/gemm_sweep.py` + `tests/gemm/`
|
||||||
|
|
||||||
|
### Bench Module Contract
|
||||||
|
|
||||||
|
- [ADR-0045](./ADR-0045-prog-bench-module-contract.md) — Bench Module Contract — registration, dispatch, and authoring
|
||||||
|
|
||||||
|
### Kernel-side tl.* API (TLContext)
|
||||||
|
|
||||||
|
- [ADR-0046](./ADR-0046-prog-tl-context-contract.md) — TLContext — Kernel-side `tl.*` API Contract
|
||||||
|
|
||||||
|
### Memory Allocator Algorithms
|
||||||
|
|
||||||
|
- [ADR-0048](./ADR-0048-mem-allocator-algorithms.md) — Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
|
||||||
|
|
||||||
|
### Probe Subcommand
|
||||||
|
|
||||||
|
- [ADR-0049](./ADR-0049-ver-probe-subcommand.md) — `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
|
||||||
|
|
||||||
|
### Sim-engine Op Log and Memory Store Schemas
|
||||||
|
|
||||||
|
- [ADR-0052](./ADR-0052-dev-oplog-memory-store-schemas.md) — OpLog + MemoryStore Schemas — sim_engine internals
|
||||||
@@ -0,0 +1,174 @@
|
|||||||
|
# ADR Index
|
||||||
|
|
||||||
|
Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **46**.
|
||||||
|
|
||||||
|
Classification mirrors the `/report` skill's section assignment. When adding a new ADR, also add an entry to the `CLASSIFICATION` table in `tools/generate_adr_index.py`.
|
||||||
|
|
||||||
|
## Design Principles
|
||||||
|
|
||||||
|
- [ADR-0013](./ADR-0013-ver-verification-strategy.md) — Verification Strategy and Phase 1 Test Plan
|
||||||
|
- [ADR-0033](./ADR-0033-lat-latency-model-assumptions.md) — Latency Model: Assumptions and Known Simplifications
|
||||||
|
|
||||||
|
## High-level Architecture
|
||||||
|
|
||||||
|
- [ADR-0003](./ADR-0003-dev-target-system-hierarchy.md) — Target System Hierarchy & Modeling Scope _(System hierarchy (Tray / SIP / CUBE / PE))_
|
||||||
|
- [ADR-0007](./ADR-0007-api-runtime-api-boundaries.md) — Runtime API and Simulation Engine Boundaries _(Runtime API ↔ sim_engine boundaries)_
|
||||||
|
- [ADR-0016](./ADR-0016-dev-iochiplet-noc-and-memory-path.md) — IOChiplet NOC and Memory Data Path _(IOChiplet NOC and memory data path)_
|
||||||
|
- [ADR-0017](./ADR-0017-dev-cube-noc-and-hbm-connectivity.md) — Cube NOC and HBM Connectivity _(Cube NOC and HBM connectivity)_
|
||||||
|
|
||||||
|
## Detailed Architecture
|
||||||
|
|
||||||
|
One subsection per component file under `src/kernbench/components/builtin/`.
|
||||||
|
|
||||||
|
### forwarding
|
||||||
|
|
||||||
|
- [ADR-0037](./ADR-0037-dev-forwarding-component.md) — Forwarding Component (forwarding_v1)
|
||||||
|
|
||||||
|
### hbm_ctrl
|
||||||
|
|
||||||
|
- [ADR-0034](./ADR-0034-dev-hbm-controller-internal-design.md) — HBM Controller Internal Design
|
||||||
|
|
||||||
|
### io_cpu
|
||||||
|
|
||||||
|
- [ADR-0036](./ADR-0036-dev-io-cpu-component-model.md) — IO_CPU Component Model
|
||||||
|
|
||||||
|
### m_cpu
|
||||||
|
|
||||||
|
- [ADR-0035](./ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md) — M_CPU and M_CPU.DMA Component Model
|
||||||
|
|
||||||
|
### pcie_ep
|
||||||
|
|
||||||
|
- [ADR-0038](./ADR-0038-dev-pcie-ep-component-model.md) — PCIE_EP Component Model
|
||||||
|
|
||||||
|
### pe_cpu
|
||||||
|
|
||||||
|
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
|
||||||
|
|
||||||
|
### pe_dma
|
||||||
|
|
||||||
|
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
|
||||||
|
- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
|
||||||
|
|
||||||
|
### pe_fetch_store
|
||||||
|
|
||||||
|
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
|
||||||
|
|
||||||
|
### pe_gemm
|
||||||
|
|
||||||
|
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
|
||||||
|
|
||||||
|
### pe_ipcq
|
||||||
|
|
||||||
|
- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
|
||||||
|
|
||||||
|
### pe_math
|
||||||
|
|
||||||
|
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
|
||||||
|
|
||||||
|
### pe_mmu
|
||||||
|
|
||||||
|
- [ADR-0039](./ADR-0039-dev-pe-mmu-component-model.md) — PE_MMU Component Model — Component + Utility Dual Role
|
||||||
|
|
||||||
|
### pe_scheduler
|
||||||
|
|
||||||
|
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
|
||||||
|
|
||||||
|
### pe_tcm
|
||||||
|
|
||||||
|
- [ADR-0040](./ADR-0040-dev-pe-tcm-component-model.md) — PE_TCM Component Model — Dual-Channel BW Serialization
|
||||||
|
|
||||||
|
### sram
|
||||||
|
|
||||||
|
- [ADR-0041](./ADR-0041-dev-cube-sram-component-model.md) — Cube SRAM Component Model — terminal scratchpad on cube NoC
|
||||||
|
|
||||||
|
### tiling
|
||||||
|
|
||||||
|
- [ADR-0042](./ADR-0042-prog-tile-plan-generators.md) — Tile Plan Generators — GEMM/Math Pipeline Plan Builders
|
||||||
|
|
||||||
|
## Implementation Decisions
|
||||||
|
|
||||||
|
### Address Scheme
|
||||||
|
|
||||||
|
- [ADR-0001](./ADR-0001-mem-physaddr-layout.md) — 51-bit Physical Address Layout & Decoding Contract
|
||||||
|
- [ADR-0011](./ADR-0011-mem-memory-addressing-simplification.md) — Memory Addressing — PA / VA / LA Address Models
|
||||||
|
|
||||||
|
### Routing & Helper API
|
||||||
|
|
||||||
|
- [ADR-0002](./ADR-0002-lat-routing-distance.md) — Routing Distance, Ordering & Bypass Rules
|
||||||
|
- [ADR-0051](./ADR-0051-lat-routing-helper-api.md) — Routing Helper API — `AddressResolver` + `PathRouter`
|
||||||
|
|
||||||
|
### Memory Semantics & Local-HBM Bandwidth
|
||||||
|
|
||||||
|
- [ADR-0004](./ADR-0004-mem-memory-semantics-local-hbm.md) — Memory Semantics & Local-HBM Bandwidth Guarantee
|
||||||
|
|
||||||
|
### Topology Compilation, Diagrams & Builder Algorithms
|
||||||
|
|
||||||
|
- [ADR-0005](./ADR-0005-dev-diagram-views-distance-layout.md) — Diagram Views & Distance-Aware Layout Rules
|
||||||
|
- [ADR-0006](./ADR-0006-dev-topology-compilation-distance-diagram.md) — Topology Compilation, Distance Extraction, and Automatic Diagram Generation
|
||||||
|
- [ADR-0053](./ADR-0053-dev-topology-builder-algorithms.md) — Topology Builder + Visualizer Algorithms
|
||||||
|
|
||||||
|
### Tensor Deployment and Allocation
|
||||||
|
|
||||||
|
- [ADR-0008](./ADR-0008-api-tensor-deploy-and-allocation.md) — Tensor Deployment and Allocation (Host Allocator, PA-first)
|
||||||
|
|
||||||
|
### Kernel Execution and Host-Device Messaging
|
||||||
|
|
||||||
|
- [ADR-0009](./ADR-0009-api-kernel-execution-messaging.md) — Kernel Execution Messaging and Completion Semantics
|
||||||
|
- [ADR-0012](./ADR-0012-api-host-io-message-schema.md) — Host ↔ IO_CPU Message Schema (PA-first, PE-tagged)
|
||||||
|
|
||||||
|
### CLI Surface and Semantics
|
||||||
|
|
||||||
|
- [ADR-0010](./ADR-0010-api-cli-surface-and-semantics.md) — Command Line Interface and Execution Semantics
|
||||||
|
|
||||||
|
### Component Port/Wire Fabric Model
|
||||||
|
|
||||||
|
- [ADR-0015](./ADR-0015-dev-component-port-wire-model.md) — Component Port/Wire Model and Fabric Routing
|
||||||
|
|
||||||
|
### Two-Pass Data Execution
|
||||||
|
|
||||||
|
- [ADR-0020](./ADR-0020-prog-data-execution-two-pass.md) — 2-Pass Data Execution Model (Timing / Data Separation)
|
||||||
|
|
||||||
|
### 2D Grid Program Identity
|
||||||
|
|
||||||
|
- [ADR-0022](./ADR-0022-prog-program-id-2d-grid.md) — 2D Grid program_id Semantics
|
||||||
|
|
||||||
|
### Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)
|
||||||
|
|
||||||
|
- [ADR-0024](./ADR-0024-par-sip-tp-launcher.md) — SIP-level Launcher — rank = SIP
|
||||||
|
- [ADR-0026](./ADR-0026-par-dppolicy-intra-device.md) — DPPolicy = Intra-Device Only — remove sip/num_sips fields
|
||||||
|
- [ADR-0027](./ADR-0027-par-megatron-tp.md) — Megatron-style Tensor Parallelism API
|
||||||
|
- [ADR-0047](./ADR-0047-par-ahbm-ccl-backend.md) — AHBM CCL Backend — `torch.distributed`-compat shim
|
||||||
|
- [ADR-0050](./ADR-0050-par-ccl-algorithm-module-contract.md) — CCL Algorithm Module Contract — `ccl/algorithms/*.py`
|
||||||
|
|
||||||
|
### IPCQ Direction Addressing
|
||||||
|
|
||||||
|
- [ADR-0025](./ADR-0025-algo-ipcq-direction-addressing.md) — IPCQ Direction Addressing — address-based matching
|
||||||
|
|
||||||
|
### Intercube All-Reduce
|
||||||
|
|
||||||
|
- [ADR-0032](./ADR-0032-algo-intercube-allreduce.md) — Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange
|
||||||
|
|
||||||
|
### Evaluation Harnesses
|
||||||
|
|
||||||
|
- [ADR-0043](./ADR-0043-eval-allreduce-harness.md) — Allreduce Evaluation Harness — `tests/sccl/`
|
||||||
|
- [ADR-0044](./ADR-0044-eval-gemm-harness.md) — GEMM Evaluation Harness — `scripts/gemm_sweep.py` + `tests/gemm/`
|
||||||
|
|
||||||
|
### Bench Module Contract
|
||||||
|
|
||||||
|
- [ADR-0045](./ADR-0045-prog-bench-module-contract.md) — Bench Module Contract — registration, dispatch, and authoring
|
||||||
|
|
||||||
|
### Kernel-side tl.* API (TLContext)
|
||||||
|
|
||||||
|
- [ADR-0046](./ADR-0046-prog-tl-context-contract.md) — TLContext — Kernel-side `tl.*` API Contract
|
||||||
|
|
||||||
|
### Memory Allocator Algorithms
|
||||||
|
|
||||||
|
- [ADR-0048](./ADR-0048-mem-allocator-algorithms.md) — Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
|
||||||
|
|
||||||
|
### Probe Subcommand
|
||||||
|
|
||||||
|
- [ADR-0049](./ADR-0049-ver-probe-subcommand.md) — `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
|
||||||
|
|
||||||
|
### Sim-engine Op Log and Memory Store Schemas
|
||||||
|
|
||||||
|
- [ADR-0052](./ADR-0052-dev-oplog-memory-store-schemas.md) — OpLog + MemoryStore Schemas — sim_engine internals
|
||||||
@@ -0,0 +1,836 @@
|
|||||||
|
# KernBench — Architecture Design Document
|
||||||
|
*2026 1H*
|
||||||
|
|
||||||
|
KernBench is a system-level, discrete-event simulator for AI-accelerator
|
||||||
|
chiplet systems. It models the data-movement and control paths across
|
||||||
|
the full hardware hierarchy and reports end-to-end execution latency
|
||||||
|
for kernels dispatched to the device's compute units.
|
||||||
|
|
||||||
|
This document is a public summary of the architecture as designed and
|
||||||
|
implemented in the first half of 2026. It assumes no prior knowledge of
|
||||||
|
the simulator's internal documents; terms specific to the system are
|
||||||
|
defined on first use.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Design Principles
|
||||||
|
|
||||||
|
KernBench is grounded in two foundational commitments: every measured
|
||||||
|
latency must trace to explicit, modeled events on the simulator's graph,
|
||||||
|
and every behavioral claim must be verifiable through tests that target
|
||||||
|
spec-level invariants rather than incidental implementation details.
|
||||||
|
|
||||||
|
<!-- src: ADR-0013 Context, Decision -->
|
||||||
|
The verification posture is verification-driven. Tests are written to
|
||||||
|
validate the architectural contracts that the simulator exposes —
|
||||||
|
correct routing, deterministic results, monotonic latency under
|
||||||
|
increasing hop counts — rather than to mirror the call graph of the
|
||||||
|
implementation. Two phases coexist: a fast timing phase that exercises
|
||||||
|
the simulator's discrete-event engine and produces a log of operations
|
||||||
|
with timestamps, and an optional data-replay phase that uses that log
|
||||||
|
to compute real numerical results. Tests can target either phase.
|
||||||
|
|
||||||
|
<!-- src: ADR-0033 Context, Decision -->
|
||||||
|
The latency model is intentionally abstract rather than
|
||||||
|
cycle-accurate. Each modeled node contributes a configurable per-node
|
||||||
|
overhead, each link contributes wire delay plus byte-over-bandwidth
|
||||||
|
serialization, and each terminal service contributes its own service
|
||||||
|
time. The simulator does not attempt to reproduce cache coherence
|
||||||
|
protocols, microarchitectural pipelines, or full PCIe/UCIe protocol
|
||||||
|
correctness; those are explicitly outside the scope. The aim is a
|
||||||
|
simulator that compares system-level configurations meaningfully and
|
||||||
|
deterministically, not one that ships microarchitectural truths.
|
||||||
|
|
||||||
|
<!-- src: ADR-0033 Decision, Consequences -->
|
||||||
|
Determinism is a hard requirement. Given identical inputs — topology,
|
||||||
|
routing policy, and request stream — the simulator must produce
|
||||||
|
identical outputs, hop traces included. This rules out reliance on
|
||||||
|
unordered set iteration on the critical path and forces every latency
|
||||||
|
contribution to come from an explicitly scheduled event on a modeled
|
||||||
|
component or link. There are no implicit waits, no hardcoded magic
|
||||||
|
delays, and no shortcuts that bypass the modeled graph.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## High-level Architecture
|
||||||
|
|
||||||
|
<!-- src: ADR-0003 Context, Decision -->
|
||||||
|
The simulated system is a four-level hierarchy. A **Tray** holds one or
|
||||||
|
more **SIPs** (system-in-package), each containing a 2D mesh of
|
||||||
|
**CUBEs** plus one or more **IO chiplets** that connect the SIP to the
|
||||||
|
host. Each CUBE contains a regular grid of **PEs** (processing
|
||||||
|
elements) plus its own attached resources — high-bandwidth memory
|
||||||
|
(HBM), a per-cube SRAM scratchpad, and a management CPU (M_CPU). The PE
|
||||||
|
itself is a composite of nine sub-components rather than a monolithic
|
||||||
|
core. This hierarchy is fixed; the parameters along each axis (counts,
|
||||||
|
mesh dimensions, link widths) are configurable through the topology
|
||||||
|
spec.
|
||||||
|
|
||||||
|
<!-- src: ADR-0007 Context, Decision -->
|
||||||
|
A clean separation runs along the request flow. A **runtime API** at
|
||||||
|
the top is the host-facing surface; it exposes tensor and kernel
|
||||||
|
operations, owns host-side allocation metadata, and is topology-
|
||||||
|
agnostic — it does not route or fan out. Below it the **simulation
|
||||||
|
engine** decomposes runtime operations into discrete graph requests
|
||||||
|
(memory writes, memory reads, kernel launches, MMU map installs) and
|
||||||
|
schedules events deterministically. At the bottom, **components** model
|
||||||
|
device behavior on a graph of nodes connected by links; they
|
||||||
|
implement the actual latency contributions and pass requests along.
|
||||||
|
No component reaches up into the runtime API, and no runtime call
|
||||||
|
shortcuts the engine.
|
||||||
|
|
||||||
|
<!-- DIAGRAM: Four-level system hierarchy — Tray containing SIPs, each SIP showing its 2D cube mesh and IO chiplet; one cube blown up to show the router mesh, attached PEs, M_CPU, SRAM, and HBM partition. -->
|
||||||
|
|
||||||
|
### Tray
|
||||||
|
|
||||||
|
<!-- src: ADR-0003 Decision -->
|
||||||
|
The Tray is the outermost boundary. It owns the host CPU on one side
|
||||||
|
and one or more SIPs on the other, connected through a fabric switch.
|
||||||
|
For collective communication that must traverse multiple SIPs, the
|
||||||
|
fabric switch acts as the common rendezvous: device-side outbound
|
||||||
|
traffic from one SIP routes through the switch and back into the
|
||||||
|
target SIP's IO chiplet.
|
||||||
|
|
||||||
|
### SIP
|
||||||
|
|
||||||
|
<!-- src: ADR-0003 Decision, ADR-0017 Context -->
|
||||||
|
A SIP packages a 2D mesh of CUBEs and one or more IO chiplets. The
|
||||||
|
default topology used by the simulator is a 4×4 cube mesh; the
|
||||||
|
mesh dimensions are configurable. Each cube on the boundary of the
|
||||||
|
mesh connects to its neighbors over UCIe (die-to-die) links arranged
|
||||||
|
on the four cardinal sides — north, south, east, and west. The IO
|
||||||
|
chiplets sit on one side of the SIP and provide the bridge to the host
|
||||||
|
across PCIe.
|
||||||
|
|
||||||
|
<!-- src: ADR-0016 Context, Decision -->
|
||||||
|
The IO chiplet itself contains its own internal network. A
|
||||||
|
host-facing PCIe endpoint passes traffic to a small NOC ("network on
|
||||||
|
chip"); from there it can branch to a control-plane CPU that processes
|
||||||
|
kernel-launch messages, or it can take the direct memory data path to
|
||||||
|
the cube's HBM controller. The decision to provide a direct memory
|
||||||
|
path that bypasses the control CPU was a deliberate concession to
|
||||||
|
keep host-issued memory writes from paying control-plane overhead on
|
||||||
|
the data path.
|
||||||
|
|
||||||
|
### CUBE
|
||||||
|
|
||||||
|
<!-- src: ADR-0017 Decision -->
|
||||||
|
Each CUBE owns a 2D mesh of NOC routers and a set of attached
|
||||||
|
resources: PEs, the cube-local SRAM scratchpad, the management CPU
|
||||||
|
(M_CPU), and the HBM partition (split across multiple PE-private
|
||||||
|
slices for bandwidth). The router mesh uses deterministic XY routing.
|
||||||
|
Attached components do not connect to each other directly — they all
|
||||||
|
sit on the router mesh, and every cube-internal transfer pays the
|
||||||
|
mesh distance from source to destination.
|
||||||
|
|
||||||
|
<!-- src: ADR-0017 Decision -->
|
||||||
|
The HBM partition is per-PE: each PE owns one HBM slice, and the
|
||||||
|
controller exposes per-PE channels so that the same PE always
|
||||||
|
addresses the same set of HBM channels. This makes the local-HBM
|
||||||
|
bandwidth from a PE to its own slice predictable, while accesses to
|
||||||
|
another PE's slice — or a different cube's slice — pay the mesh
|
||||||
|
distance and any UCIe crossings.
|
||||||
|
|
||||||
|
### PE
|
||||||
|
|
||||||
|
<!-- src: ADR-0014 Context, Decision -->
|
||||||
|
A PE is not a monolithic core. Internally it is a set of nine
|
||||||
|
sub-components, each modeling one stage of a request's flow: a small
|
||||||
|
control CPU, a tile-pipeline scheduler, a DMA engine, a fetch-store
|
||||||
|
engine that moves data between the on-PE scratchpad and the register
|
||||||
|
file, a GEMM compute engine, a math compute engine, the tightly-
|
||||||
|
coupled memory (TCM, the on-PE scratchpad), an MMU for virtual-to-
|
||||||
|
physical address translation, and an inter-PE collective queue
|
||||||
|
(IPCQ). The scheduler decomposes higher-level operations into per-tile
|
||||||
|
stage sequences, and tile tokens self-route from one sub-component
|
||||||
|
to the next.
|
||||||
|
|
||||||
|
<!-- DIAGRAM: PE internal layout — the nine sub-components and the edges that connect them; tile token flowing through DMA_READ → FETCH → GEMM → STORE → DMA_WRITE. -->
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Detailed Architecture
|
||||||
|
|
||||||
|
This section describes each modeled device-side component in turn.
|
||||||
|
Components are listed in the alphabetical order used by the
|
||||||
|
simulator's source tree.
|
||||||
|
|
||||||
|
### forwarding
|
||||||
|
|
||||||
|
<!-- src: ADR-0037 Context, Decision -->
|
||||||
|
The forwarding component is the generic routing relay used wherever a
|
||||||
|
node only needs to apply a small processing overhead and pass the
|
||||||
|
request to the next hop. NOC routers, conn nodes, and ucie phys all
|
||||||
|
reduce to this. Its first act on receiving a request is to apply the
|
||||||
|
per-node overhead configured for it in the topology spec; after the
|
||||||
|
overhead it simply hands the request to the next hop along the path.
|
||||||
|
|
||||||
|
<!-- src: ADR-0037 Decision, Consequences -->
|
||||||
|
The decision to share one implementation across these roles was made
|
||||||
|
to keep the simulator's component set small without sacrificing
|
||||||
|
modeling fidelity. Each instance still carries its own overhead and
|
||||||
|
its own link bandwidth contributions, so different roles still produce
|
||||||
|
different timing. What is shared is the dispatcher loop, not the
|
||||||
|
parameter values.
|
||||||
|
|
||||||
|
### hbm_ctrl
|
||||||
|
|
||||||
|
<!-- src: ADR-0034 Context, Decision -->
|
||||||
|
The HBM controller is the terminal node for all memory traffic that
|
||||||
|
reaches HBM. Internally it owns a number of pseudo channels, partitioned
|
||||||
|
per-PE so that each PE addresses a deterministic subset. On a request
|
||||||
|
arrival the controller first selects the right pseudo channel from the
|
||||||
|
target address, then enters a chunk-loop that drains the requested
|
||||||
|
size in fixed-size flits over the channel's bandwidth.
|
||||||
|
|
||||||
|
<!-- src: ADR-0034 Decision, Consequences -->
|
||||||
|
The chunk-loop pattern replaces an earlier all-at-once drain. The
|
||||||
|
benefit is that the controller no longer presents a flit-aware fabric
|
||||||
|
with a single bulk transfer; instead it emits flits at a paced rate
|
||||||
|
matching the channel bandwidth, which makes cross-flow contention
|
||||||
|
visible. The bandwidth budget is calibrated against the configured
|
||||||
|
HBM total bandwidth divided across the channel count.
|
||||||
|
|
||||||
|
### io_cpu
|
||||||
|
|
||||||
|
<!-- src: ADR-0036 Context, Decision -->
|
||||||
|
The IO_CPU is the control-plane processor sitting inside the IO chiplet.
|
||||||
|
It receives kernel-launch messages from the host, decodes them, and
|
||||||
|
dispatches per-cube launches to the cube's management CPU. Pure memory
|
||||||
|
operations bypass it entirely, taking the direct data path established
|
||||||
|
inside the IO chiplet.
|
||||||
|
|
||||||
|
<!-- src: ADR-0036 Decision -->
|
||||||
|
On receiving a kernel-launch message, the IO_CPU consults the message's
|
||||||
|
shard list — which already names the target SIP, cube, and PE for each
|
||||||
|
piece of the tensor argument — and forwards a per-cube launch to each
|
||||||
|
cube the kernel needs to reach. This makes the IO_CPU a deterministic
|
||||||
|
fan-out point: it does not decode physical addresses to route, it just
|
||||||
|
follows the explicit per-shard targets it was handed.
|
||||||
|
|
||||||
|
### m_cpu
|
||||||
|
|
||||||
|
<!-- src: ADR-0035 Context, Decision -->
|
||||||
|
The M_CPU is the cube's management processor. It owns two distinct
|
||||||
|
roles: as a control-plane fan-out point for kernel launches arriving
|
||||||
|
from the IO chiplet, and as a DMA endpoint for host-initiated memory
|
||||||
|
writes that need to land in this cube's HBM. The control role
|
||||||
|
forwards launches to the right PE control CPUs; the DMA role places
|
||||||
|
the actual bytes into HBM through the router mesh.
|
||||||
|
|
||||||
|
<!-- src: ADR-0035 Decision -->
|
||||||
|
The component model deliberately distinguishes the two roles because
|
||||||
|
their routing differs: the control fan-out path uses command-kind
|
||||||
|
links that do not appear on data-path routes, while the DMA path uses
|
||||||
|
the same router mesh as PE-initiated DMA, with PE-internal nodes
|
||||||
|
excluded. The routing layer knows about both modes and selects the
|
||||||
|
appropriate adjacency at request time.
|
||||||
|
|
||||||
|
### pcie_ep
|
||||||
|
|
||||||
|
<!-- src: ADR-0038 Context, Decision -->
|
||||||
|
The PCIE endpoint is the protocol boundary at the host-device edge.
|
||||||
|
Its first act on each incoming request is to apply a configured
|
||||||
|
protocol-processing overhead; after that it simply forwards. There is
|
||||||
|
no internal queuing model, no retry, and no TLP-level fidelity — those
|
||||||
|
are deliberately outside scope. The endpoint is bidirectional: host →
|
||||||
|
device traffic (memory writes, kernel launches) flows one way, and
|
||||||
|
device-side outbound traffic (cross-SIP collective sends) flows the
|
||||||
|
other.
|
||||||
|
|
||||||
|
<!-- src: ADR-0038 Decision, Alternatives Considered, Consequences -->
|
||||||
|
A more detailed PCIe model was considered and rejected. The simulator
|
||||||
|
is targeting system-level latency comparisons; making the endpoint
|
||||||
|
heavier with credit-management and retry logic would not improve the
|
||||||
|
metrics being studied. The decision keeps the endpoint as the
|
||||||
|
documented protocol-boundary node, named consistently so routing
|
||||||
|
helpers can locate it by SIP and IO instance.
|
||||||
|
|
||||||
|
### pe_cpu
|
||||||
|
|
||||||
|
<!-- src: ADR-0014 Decision -->
|
||||||
|
The PE control CPU is the entry point for kernel work arriving from
|
||||||
|
the cube's management CPU. It receives kernel-launch messages, resolves
|
||||||
|
the kernel function by name, and hands execution to the scheduler with
|
||||||
|
the resolved tensor arguments. From the scheduler's point of view, the
|
||||||
|
PE_CPU is the upstream source of high-level commands; from the rest
|
||||||
|
of the system's point of view, the PE_CPU is where a kernel's
|
||||||
|
execution begins on a given PE.
|
||||||
|
|
||||||
|
### pe_dma
|
||||||
|
|
||||||
|
<!-- src: ADR-0014 Decision, ADR-0023 Decision -->
|
||||||
|
The DMA engine on each PE has two distinct modes. In the standard PE
|
||||||
|
pipeline it consumes tile tokens issued by the scheduler, acquires a
|
||||||
|
read or write channel (modeled as a one-in-flight resource per
|
||||||
|
direction), and runs the bytes to or from HBM through the mesh. In
|
||||||
|
its collective mode it forwards send tokens for the cube's IPCQ into
|
||||||
|
the fabric, snapshotting the source data at send time so later
|
||||||
|
mutations cannot race the receiver's read. Both modes share the same
|
||||||
|
channel resources but differ in their downstream handling — one
|
||||||
|
returns when the round-trip completes, the other dispatches
|
||||||
|
fire-and-forget.
|
||||||
|
|
||||||
|
### pe_fetch_store
|
||||||
|
|
||||||
|
<!-- src: ADR-0014 Decision -->
|
||||||
|
The fetch-store engine is the bridge between the on-PE scratchpad
|
||||||
|
(TCM) and the register file. It does not run DMA; it only moves bytes
|
||||||
|
internally. On receiving a tile-stage token it sends a short request
|
||||||
|
to the TCM, waits for the bandwidth-serialized delay, and continues
|
||||||
|
the pipeline. The split between this engine and the TCM lets the
|
||||||
|
scratchpad model its own read/write bandwidth independently.
|
||||||
|
|
||||||
|
### pe_gemm
|
||||||
|
|
||||||
|
<!-- src: ADR-0014 Decision -->
|
||||||
|
The GEMM engine is the matrix-multiply compute unit. Tile tokens
|
||||||
|
arriving at this stage carry the per-tile dimensions, and the engine
|
||||||
|
contributes a service time accounting for one fused multiply-add over
|
||||||
|
the tile's macs. Composite operations (where the same tensor pair is
|
||||||
|
streamed across many tiles) reuse the engine through the scheduler;
|
||||||
|
the engine itself is stateless between tiles.
|
||||||
|
|
||||||
|
### pe_ipcq
|
||||||
|
|
||||||
|
<!-- src: ADR-0023 Context, Decision -->
|
||||||
|
The IPCQ — inter-process communication queue — is each PE's
|
||||||
|
collective-communication endpoint. It owns ring buffers that hold
|
||||||
|
inbound messages from neighbor PEs and bookkeeping for send credits.
|
||||||
|
Direction names ("N", "S", "E", "W" for cube-internal neighbors and
|
||||||
|
"global_*" for cross-SIP neighbors) are resolved to physical peer
|
||||||
|
endpoints by a neighbor table installed at process-group creation
|
||||||
|
time. The component itself does not move bytes — it issues DMA tokens
|
||||||
|
through the local PE_DMA, which performs the actual cross-PE
|
||||||
|
transfer.
|
||||||
|
|
||||||
|
<!-- src: ADR-0023 Decision, Consequences -->
|
||||||
|
A key invariant is that the inbound terminal — where data lands at
|
||||||
|
the receiver — pays the link bandwidth drain plus any cube-internal
|
||||||
|
mesh hop to the slot's backing memory. This prevents IPCQ from
|
||||||
|
silently outpacing raw DMA at large transfer sizes. Outbound sends
|
||||||
|
are fire-and-forget; credit return is the only backpressure signal.
|
||||||
|
|
||||||
|
### pe_math
|
||||||
|
|
||||||
|
<!-- src: ADR-0014 Decision -->
|
||||||
|
The math engine handles element-wise and reduction operations. It
|
||||||
|
consumes tile tokens carrying an operation kind (`exp`, `sum`, `max`,
|
||||||
|
`where`, etc.) and contributes a service time proportional to the
|
||||||
|
number of elements processed. Like the GEMM engine it is stateless;
|
||||||
|
chained epilogues (a sequence of math operations after a GEMM tile)
|
||||||
|
are scheduled as separate stages.
|
||||||
|
|
||||||
|
### pe_mmu
|
||||||
|
|
||||||
|
<!-- src: ADR-0039 Context, Decision -->
|
||||||
|
The MMU has two roles, exposed through one component. As a node on
|
||||||
|
the cube NOC it receives MMU-map and MMU-unmap messages and updates
|
||||||
|
its internal page table, so that the runtime API can install
|
||||||
|
virtual-to-physical mappings with measured fabric latency. As a
|
||||||
|
utility object held inside the PE it offers synchronous translate
|
||||||
|
calls to the PE's DMA and GEMM engines without taking simulator time
|
||||||
|
itself; the calling engine pays any configured TLB overhead in its
|
||||||
|
own process.
|
||||||
|
|
||||||
|
<!-- src: ADR-0039 Decision, Alternatives Considered -->
|
||||||
|
The page table supports multiple disjoint regions inside a single
|
||||||
|
page, with later-write-wins semantics on overlap. This is a deliberate
|
||||||
|
simulator stopgap to support parallelization policies that shard data
|
||||||
|
at sub-page granularity without silent mis-routing through a real
|
||||||
|
hardware MMU's one-PA-per-entry assumption. A real MMU does not work
|
||||||
|
this way; the model documents this as a simplification.
|
||||||
|
|
||||||
|
### pe_scheduler
|
||||||
|
|
||||||
|
<!-- src: ADR-0014 Decision -->
|
||||||
|
The scheduler is the sole dispatcher inside a PE. Simple commands are
|
||||||
|
routed directly to the right engine. Composite commands generate a
|
||||||
|
tile plan, and the resulting tile tokens are fed into the pipeline.
|
||||||
|
Self-routing keeps the scheduler off the per-stage hot path: each
|
||||||
|
engine, on finishing a stage, advances the token to the next stage's
|
||||||
|
component itself, so the scheduler only does initial dispatch and
|
||||||
|
completion tracking.
|
||||||
|
|
||||||
|
### pe_tcm
|
||||||
|
|
||||||
|
<!-- src: ADR-0040 Context, Decision -->
|
||||||
|
The TCM is the per-PE tightly-coupled scratchpad memory. It models
|
||||||
|
time only, not data — the actual payload lives in the simulator's
|
||||||
|
memory store. Read and write are independent channels: each is
|
||||||
|
modeled as a one-in-flight resource, so same-direction requests
|
||||||
|
serialize but a read and a write can overlap. The bandwidth of each
|
||||||
|
direction is configured separately and applied as bytes-over-bandwidth
|
||||||
|
on each request.
|
||||||
|
|
||||||
|
<!-- src: ADR-0040 Decision, Alternatives Considered -->
|
||||||
|
The decision to keep read and write on separate channels was made
|
||||||
|
because the PE pipeline's normal case overlaps fetch (read) and store
|
||||||
|
(write). Collapsing them into a single shared channel would have
|
||||||
|
artificially serialized that overlap and produced an incorrect
|
||||||
|
bandwidth ceiling.
|
||||||
|
|
||||||
|
### sram
|
||||||
|
|
||||||
|
<!-- src: ADR-0041 Context, Decision -->
|
||||||
|
The cube SRAM is a per-cube scratchpad attached to one of the cube's
|
||||||
|
routers. As a node it applies a configured access overhead, pays the
|
||||||
|
link-bandwidth drain stamped on the incoming request, and sends a
|
||||||
|
response on the reverse path. It is a terminal — it does not forward.
|
||||||
|
|
||||||
|
<!-- src: ADR-0041 Decision, Consequences -->
|
||||||
|
A second role is as one of three backing-memory tiers (TCM, SRAM, HBM)
|
||||||
|
that an inter-PE collective slot can live in. When the slot lives in
|
||||||
|
SRAM, the PE_DMA pays the slot read or write latency directly using
|
||||||
|
the configured SRAM bandwidth and overhead; the SRAM component does
|
||||||
|
not need to know about collective semantics. This separation keeps
|
||||||
|
the SRAM component agnostic to the collective subsystem.
|
||||||
|
|
||||||
|
### tiling
|
||||||
|
|
||||||
|
<!-- src: ADR-0042 Context, Decision -->
|
||||||
|
The tile-plan generator is not a runtime component — it is a pure
|
||||||
|
module of functions that take a problem shape (matrix dimensions, tile
|
||||||
|
sizes) and produce an ordered list of tile-stage sequences. The
|
||||||
|
scheduler consumes this list. Each tile's stage sequence depends on
|
||||||
|
how its operands are staged: operands streamed from HBM produce
|
||||||
|
DMA_READ stages, operands already resident in TCM (because they were
|
||||||
|
loaded eagerly upfront) skip them.
|
||||||
|
|
||||||
|
<!-- src: ADR-0042 Decision, Consequences -->
|
||||||
|
The plan generator is intentionally pure — given the same input it
|
||||||
|
returns the same plan, with no simulator events created. This lets
|
||||||
|
the rest of the system reason about tile sequences as data, and it
|
||||||
|
makes the plan testable in isolation without simulator state. New
|
||||||
|
plan variants (for example, K-major or DTensor-aware plans) can be
|
||||||
|
added as new functions following the same shape.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Implementation Decisions
|
||||||
|
|
||||||
|
This section collects cross-cutting decisions — algorithms, policies,
|
||||||
|
schemes, and contracts — that span multiple components rather than
|
||||||
|
living inside one.
|
||||||
|
|
||||||
|
### Address Scheme
|
||||||
|
|
||||||
|
<!-- src: ADR-0001 Context, Decision -->
|
||||||
|
Every physical address in the simulator decodes into a structured
|
||||||
|
location. A fixed-width physical address carries the SIP id, the
|
||||||
|
cube id within the SIP, a type discriminator (HBM vs PE-resource vs
|
||||||
|
others), and a type-specific offset. HBM addresses additionally encode
|
||||||
|
the per-PE slice offset so the controller can determine which PE
|
||||||
|
owns the target slice without external lookup. The layout is
|
||||||
|
deliberately reserved rather than packed-to-fit, so new sub-units can
|
||||||
|
be added at the type-discriminator level without rewriting existing
|
||||||
|
addresses.
|
||||||
|
|
||||||
|
<!-- src: ADR-0011 Context, Decision -->
|
||||||
|
On top of physical addressing, the simulator supports three address
|
||||||
|
models that the runtime API selects between. Direct physical
|
||||||
|
addressing is retained as a fallback. Virtual addressing — the
|
||||||
|
current default — gives each tensor a contiguous virtual range at
|
||||||
|
deployment, with the per-PE MMU translating per access; an
|
||||||
|
alternative logical-address scheme remains a future option. The
|
||||||
|
virtual-address path is what every modern test path takes; the PA
|
||||||
|
fallback is used by the MMU itself when no mapping exists for an
|
||||||
|
address (a deliberate signal, not an error).
|
||||||
|
|
||||||
|
<!-- src: ADR-0011 Decision, Consequences -->
|
||||||
|
Tensor placement is represented as a list of physical-address shards,
|
||||||
|
each tagged with target SIP, cube, and PE, plus a single tensor-wide
|
||||||
|
virtual base. This means a kernel sees one virtual base for the whole
|
||||||
|
tensor while the host driver and the engine still know exactly where
|
||||||
|
each shard lives. Replicated tensors get per-cube local PA mappings;
|
||||||
|
sharded tensors broadcast their mapping across cubes within a SIP.
|
||||||
|
|
||||||
|
### Routing, Distance & Helper API
|
||||||
|
|
||||||
|
<!-- src: ADR-0002 Context, Decision -->
|
||||||
|
Routing is policy-driven, deterministic, and topology-aware. Given a
|
||||||
|
source, a destination, and an intent — for example, PE-initiated
|
||||||
|
DMA versus host-initiated memory write versus a generic
|
||||||
|
component-to-component query — the routing layer picks the right
|
||||||
|
path. The intent matters because different traffic types must avoid
|
||||||
|
different categories of edges: PE-initiated DMA should not traverse
|
||||||
|
command-only links; M_CPU DMA should not pass through PE-internal
|
||||||
|
pipeline edges; cube-local transfers should not use the
|
||||||
|
zero-distance UCIe bus that would otherwise look attractive to a
|
||||||
|
shortest-path search.
|
||||||
|
|
||||||
|
<!-- src: ADR-0051 Decision -->
|
||||||
|
The routing layer therefore maintains four separate adjacency graphs
|
||||||
|
at construction, each excluding a different category of edges, and
|
||||||
|
picks the appropriate one per intent. On top of the graphs sits a
|
||||||
|
helper API that hides the topology's naming convention: callers ask
|
||||||
|
for the PCIe endpoint of a given SIP, the M_CPU of a given cube, or
|
||||||
|
the HBM destination for a given physical address, and receive the
|
||||||
|
corresponding node id. No component constructs node-id strings
|
||||||
|
directly; if the naming convention ever changes, the change is local
|
||||||
|
to the helper layer.
|
||||||
|
|
||||||
|
<!-- src: ADR-0051 Decision, Consequences -->
|
||||||
|
Path-finding itself uses Dijkstra with explicit per-edge weights
|
||||||
|
(routing weight is allowed to differ from physical distance — for
|
||||||
|
example, UCIe is configured to be routing-preferable). Tie-breaks
|
||||||
|
follow insertion order, which keeps results deterministic. Paths
|
||||||
|
between unreachable nodes raise rather than returning empty, surfacing
|
||||||
|
topology errors immediately.
|
||||||
|
|
||||||
|
### Memory Semantics and Local-HBM Bandwidth
|
||||||
|
|
||||||
|
<!-- src: ADR-0004 Context, Decision -->
|
||||||
|
A PE accessing its own HBM slice through its own cube's NOC must see
|
||||||
|
the full local HBM bandwidth — that is the model's intent. Memory
|
||||||
|
traffic accumulates latency from per-component overhead and
|
||||||
|
bytes-over-link-bandwidth serialization along the path, but the
|
||||||
|
controller does not throttle below the slice's allotted bandwidth.
|
||||||
|
Cross-PE-slice accesses inside the same cube, cross-cube accesses
|
||||||
|
through UCIe, and cross-SIP accesses through PCIe each pay
|
||||||
|
progressively more overhead as the path grows.
|
||||||
|
|
||||||
|
### Topology Compilation, Diagrams & Builder Algorithms
|
||||||
|
|
||||||
|
<!-- src: ADR-0006 Context, Decision -->
|
||||||
|
Topology is configurable, not hardcoded. The simulator reads a YAML
|
||||||
|
spec, compiles it into a flat graph of nodes and edges plus four
|
||||||
|
view projections at different abstraction levels — system, SIP, cube,
|
||||||
|
PE — and uses the compiled graph as the single source for both
|
||||||
|
execution and visualization. Distance metadata used by routing is
|
||||||
|
extracted at compile time so that diagrams and routing decisions
|
||||||
|
agree by construction.
|
||||||
|
|
||||||
|
<!-- src: ADR-0005 Context, Decision -->
|
||||||
|
Diagrams are derived artifacts of the compiled topology. The visualizer
|
||||||
|
produces one SVG per view at the appropriate abstraction level; nothing
|
||||||
|
in the diagrams is hand-drawn or hand-positioned. Distance-aware
|
||||||
|
layout rules place nodes in the diagrams using the same coordinates
|
||||||
|
that routing uses to compute distance, so a diagram that "looks
|
||||||
|
wrong" is a signal that the topology itself has a problem, not the
|
||||||
|
visualizer.
|
||||||
|
|
||||||
|
<!-- src: ADR-0053 Decision -->
|
||||||
|
Inside a cube the router mesh is generated automatically. PE corner
|
||||||
|
positions are fixed by convention; the relay-column algorithm
|
||||||
|
inserts additional grid columns whenever the gap between adjacent PE
|
||||||
|
columns would exceed a tunable maximum. HBM occupies a central
|
||||||
|
exclusion zone — router slots inside the zone are deliberately empty,
|
||||||
|
since HBM controllers attach as separate named nodes. M_CPU and SRAM
|
||||||
|
attach to the nearest router by Euclidean distance from their
|
||||||
|
configured placement coordinates, and UCIe physical lanes distribute
|
||||||
|
along the boundary rows and columns. The whole mesh is cached
|
||||||
|
beside the topology spec and invalidated only when one of a small set
|
||||||
|
of layout-relevant fields changes.
|
||||||
|
|
||||||
|
<!-- DIAGRAM: One cube's router mesh — rows × cols of routers with HBM exclusion zone in the middle, PEs/M_CPU/SRAM attaching to nearest routers, UCIe PHYs along the perimeter. -->
|
||||||
|
|
||||||
|
### Tensor Deployment and Allocation
|
||||||
|
|
||||||
|
<!-- src: ADR-0008 Context, Decision -->
|
||||||
|
Tensor deployment in the runtime API produces a list of physical-address
|
||||||
|
shards plus a single tensor-wide virtual base. The host allocator
|
||||||
|
walks the data-parallelism policy, computes per-shard placement, and
|
||||||
|
emits the per-shard physical addresses through the per-PE allocators.
|
||||||
|
No separate "allocate then later attach to a device" RPC exists —
|
||||||
|
allocation and deployment are a single operation that produces a
|
||||||
|
deployed tensor handle.
|
||||||
|
|
||||||
|
### Memory Allocator Algorithms
|
||||||
|
|
||||||
|
<!-- src: ADR-0048 Context, Decision -->
|
||||||
|
Each per-PE allocator owns two channels — HBM slice and TCM — each
|
||||||
|
backed by an offset-keyed free-list. Allocation is first-fit; freeing
|
||||||
|
coalesces with adjacent free blocks. A device-wide virtual allocator
|
||||||
|
sits above the per-PE allocators, aligns requests up to the configured
|
||||||
|
page size, and coalesces on free in the same way. The trade-off is
|
||||||
|
explicit: first-fit is simpler and cheaper than best-fit or buddy
|
||||||
|
allocation, and the simulator's workload is stack-like enough
|
||||||
|
(deploy / kernel / free in matched order) that fragmentation is not
|
||||||
|
a practical concern.
|
||||||
|
|
||||||
|
<!-- src: ADR-0048 Decision, Consequences -->
|
||||||
|
Allocation failure raises rather than silently returning a partial
|
||||||
|
result. A partial tensor reaching the engine would route over wrong
|
||||||
|
PAs and silently corrupt simulator output, so an out-of-memory signal
|
||||||
|
is preferred. The free path trusts its caller to pass back exactly
|
||||||
|
what was allocated; the small risk of caller error in exchange for
|
||||||
|
fast common-case freeing is documented as a deliberate trade.
|
||||||
|
|
||||||
|
### Kernel Execution and Host-Device Messaging
|
||||||
|
|
||||||
|
<!-- src: ADR-0009 Context, Decision -->
|
||||||
|
Kernel execution decomposes into a small set of messages that travel
|
||||||
|
the device graph. The host issues a single kernel-launch message; the
|
||||||
|
IO_CPU fans it out per-cube; the cube M_CPU fans it out per-PE; the
|
||||||
|
PE CPU resolves the kernel and runs it through the scheduler.
|
||||||
|
Completion flows back the same way, gated by per-shard completion
|
||||||
|
tracking. Memory operations follow the same pattern: a memory write
|
||||||
|
or read travels as one message that the engine routes to the right
|
||||||
|
HBM controller, with a response taking the reverse path.
|
||||||
|
|
||||||
|
<!-- src: ADR-0012 Context, Decision -->
|
||||||
|
The schema between the host and the device-side IO CPU is PA-first
|
||||||
|
and shard-tagged. Every byte of host-issued payload arrives with an
|
||||||
|
explicit target SIP, cube, PE, and physical address. The IO_CPU does
|
||||||
|
not decode addresses to derive placement — placement is named
|
||||||
|
explicitly by the shard list. This makes the host-device interface
|
||||||
|
deterministic and keeps the routing helper free of host-derived
|
||||||
|
intent.
|
||||||
|
|
||||||
|
### CLI Surface and Semantics
|
||||||
|
|
||||||
|
<!-- src: ADR-0010 Context, Decision -->
|
||||||
|
The command-line interface exposes four subcommands. A bench runner
|
||||||
|
loads a topology, resolves a registered benchmark by name or index,
|
||||||
|
and runs it on a selected device. A bench-listing command enumerates
|
||||||
|
the registered benchmarks. A probe utility runs a fixed catalog of
|
||||||
|
traffic patterns through the engine for latency and bandwidth
|
||||||
|
verification. A web viewer renders the topology in a browser. A
|
||||||
|
benchmark instance is always single-device by convention; multi-SIP
|
||||||
|
collective work happens inside the benchmark through the launcher
|
||||||
|
abstraction, not by multiplexing the CLI.
|
||||||
|
|
||||||
|
### Component Port and Wire Fabric Model
|
||||||
|
|
||||||
|
<!-- src: ADR-0015 Context, Decision -->
|
||||||
|
Every modeled component exposes input and output ports, and every
|
||||||
|
edge in the topology connects an output port on one component to an
|
||||||
|
input port on another. Bandwidth and propagation delay are properties
|
||||||
|
of the wire between ports, not of the component endpoints. A
|
||||||
|
component's responsibility is to apply its configured per-node
|
||||||
|
overhead and either forward to the next hop or terminate; the wire
|
||||||
|
charges the byte-over-bandwidth serialization separately.
|
||||||
|
|
||||||
|
<!-- src: ADR-0015 Decision, Consequences -->
|
||||||
|
This separation lets components be swapped behind their port
|
||||||
|
interface without changing the rest of the model, and it keeps
|
||||||
|
bandwidth contention at the wire level where multiple components may
|
||||||
|
contend for the same edge. Future component models can refine
|
||||||
|
internal behavior without disturbing the fabric.
|
||||||
|
|
||||||
|
### Two-Pass Data Execution
|
||||||
|
|
||||||
|
<!-- src: ADR-0020 Context, Decision -->
|
||||||
|
The simulator runs in two passes. The first pass — fast and always
|
||||||
|
on — runs the discrete-event engine and records every data operation
|
||||||
|
in an operation log with timestamps, component identifiers, and per-
|
||||||
|
operation parameters. The second pass — optional, opt-in — replays
|
||||||
|
the log against an in-memory tensor store to produce actual numerical
|
||||||
|
results. Tests that only need timing skip the second pass; tests that
|
||||||
|
need to verify correctness opt in.
|
||||||
|
|
||||||
|
<!-- src: ADR-0020 Decision, Consequences -->
|
||||||
|
The split lets the timing engine remain unconcerned with data
|
||||||
|
semantics: kernels move handles around, not bytes. The replay phase
|
||||||
|
recovers data semantics from the recorded operations, in their
|
||||||
|
original time order with a small set of secondary-sort rules. The
|
||||||
|
op-log records carry enough metadata — input snapshots for compute
|
||||||
|
operations, source snapshots for cross-component copies — that the
|
||||||
|
replay phase cannot mis-order with respect to in-flight mutations.
|
||||||
|
|
||||||
|
### Sim-engine Op Log and Memory Store Schemas
|
||||||
|
|
||||||
|
<!-- src: ADR-0052 Context, Decision -->
|
||||||
|
The operation log holds typed records with seven fields each: start
|
||||||
|
and end timestamps, the component that issued the operation, an
|
||||||
|
operation kind ("memory", "gemm", "math"), an operation name, a
|
||||||
|
parameter dictionary, and a (currently unused) dependency list.
|
||||||
|
Records are kept in stable timestamp order. The parameter dictionary
|
||||||
|
varies by operation: a DMA read carries source address and byte count;
|
||||||
|
a GEMM carries operand shapes, dtypes, and address spaces; a math
|
||||||
|
operation carries input addresses and snapshots.
|
||||||
|
|
||||||
|
<!-- src: ADR-0052 Decision, Consequences -->
|
||||||
|
The companion memory store is a two-level dictionary keyed by
|
||||||
|
address space ("hbm", "tcm", "sram", others) and integer address.
|
||||||
|
Reads and writes are reference-based — no copy by default — so
|
||||||
|
callers wanting to detach a snapshot must copy explicitly. This is
|
||||||
|
deliberate: the engine-internal snapshot paths copy at well-defined
|
||||||
|
points (math input capture, HBM source capture for DMA writes,
|
||||||
|
inbound collective copies) and downstream replay code therefore
|
||||||
|
sees stable data even when slot or scratch addresses are reused by
|
||||||
|
later operations.
|
||||||
|
|
||||||
|
### 2D Grid Program Identity
|
||||||
|
|
||||||
|
<!-- src: ADR-0022 Context, Decision -->
|
||||||
|
Inside a kernel the program identity is two-dimensional. The
|
||||||
|
first axis corresponds to the PE index within a cube; the second
|
||||||
|
corresponds to the cube index within a SIP. Together they let a
|
||||||
|
kernel address its position both within its cube and within the
|
||||||
|
larger system without needing to know the full topology. Total
|
||||||
|
program counts along each axis are exposed symmetrically.
|
||||||
|
|
||||||
|
### Parallelism — SIP Launcher, DPPolicy, Megatron-TP, AHBM Backend, and CCL Algorithm Module
|
||||||
|
|
||||||
|
<!-- src: ADR-0024 Context, Decision -->
|
||||||
|
The launcher model treats each SIP as one rank. Inside a process the
|
||||||
|
launcher spawns one greenlet per SIP rank; the rank is bound to its
|
||||||
|
greenlet so that any code running in that worker sees the right
|
||||||
|
distributed-style rank. This is a deliberately PyTorch-compatible
|
||||||
|
shape: a benchmark looks like a small DDP training script — initialize
|
||||||
|
a process group, spawn workers, each worker runs the same body.
|
||||||
|
|
||||||
|
<!-- src: ADR-0026 Context, Decision -->
|
||||||
|
Data-parallelism policy lives in a single object that names the
|
||||||
|
sharding strategy along the cube axis (replicate, row-wise,
|
||||||
|
column-wise) and along the PE axis (same set of values), and optionally
|
||||||
|
overrides the number of cubes or PEs participating. The policy is
|
||||||
|
intra-device — it does not cross SIP boundaries. SIP-level parallelism
|
||||||
|
is the launcher's responsibility, and the two axes compose
|
||||||
|
orthogonally.
|
||||||
|
|
||||||
|
<!-- src: ADR-0027 Context, Decision -->
|
||||||
|
A Megatron-style tensor-parallel API sits on top of the launcher and
|
||||||
|
the DP policy. Layer-level building blocks — column-parallel linear,
|
||||||
|
row-parallel linear, all-reduce — name their sharding intent in terms
|
||||||
|
the launcher and the placement policy can compose. This is the layer
|
||||||
|
that bench code typically writes against.
|
||||||
|
|
||||||
|
<!-- src: ADR-0047 Context, Decision -->
|
||||||
|
For collective operations the runtime exposes a PyTorch-compatible
|
||||||
|
distributed backend named "ahbm". On process-group initialization the
|
||||||
|
backend loads the configured collective-algorithm module, resolves
|
||||||
|
the world size (priority: explicit ccl.yaml override → defaults
|
||||||
|
section → topology SIP count), imports the algorithm module
|
||||||
|
dynamically, derives the SIP topology kind, and pushes the inter-PE
|
||||||
|
neighbor table to every participating PE. From that point on, an
|
||||||
|
all-reduce call dispatches the algorithm's kernel function across
|
||||||
|
all ranks.
|
||||||
|
|
||||||
|
<!-- src: ADR-0050 Context, Decision -->
|
||||||
|
A collective-algorithm module is a Python module with a small, fixed
|
||||||
|
contract. It exposes topology-kind integer constants, a name-to-kind
|
||||||
|
mapping for the YAML configuration, a kernel-arguments builder, and
|
||||||
|
a kernel function — the kernel function being aliased to the name
|
||||||
|
`kernel` so the backend can find it generically. The kernel itself
|
||||||
|
takes the tensor pointer, the per-cube element count, cube mesh
|
||||||
|
width and height, the world size, the current rank, and the SIP
|
||||||
|
topology dimensions; the backend appends those last four arguments
|
||||||
|
automatically. New collectives slot in by adding a new module that
|
||||||
|
follows this shape.
|
||||||
|
|
||||||
|
<!-- src: ADR-0027 Decision, Consequences -->
|
||||||
|
The combination is deliberate: bench authors get to write code that
|
||||||
|
looks like a regular distributed training script, while the launcher,
|
||||||
|
backend, and placement policies behind it remain free to redirect
|
||||||
|
work to the right SIP, cube, and PE without exposing topology to the
|
||||||
|
kernel.
|
||||||
|
|
||||||
|
### IPCQ Direction Addressing
|
||||||
|
|
||||||
|
<!-- src: ADR-0025 Context, Decision -->
|
||||||
|
Inside a collective algorithm, peer PEs are named by direction —
|
||||||
|
"N", "S", "E", "W" for cube-internal neighbors, and "global_*" for
|
||||||
|
cross-SIP neighbors. Direction addressing is the addressing scheme:
|
||||||
|
the algorithm names a direction, the IPCQ neighbor table installed
|
||||||
|
at process-group time resolves the direction to the peer endpoint's
|
||||||
|
physical-address coordinates, and the PE_DMA performs the actual
|
||||||
|
transfer. The algorithm itself does not see PA arithmetic — direction
|
||||||
|
is the user-facing handle.
|
||||||
|
|
||||||
|
### Intercube All-Reduce
|
||||||
|
|
||||||
|
<!-- src: ADR-0032 Context, Decision -->
|
||||||
|
The default all-reduce algorithm uses a center-rooted bidirectional
|
||||||
|
phase inside each SIP's cube mesh followed by an inter-SIP exchange
|
||||||
|
on the mesh's root cube, and then a bidirectional broadcast back
|
||||||
|
out. Center-rooting halves the in-cube hop count compared with a
|
||||||
|
corner-rooted walk. The inter-SIP exchange itself follows the
|
||||||
|
configured SIP topology — ring, torus, or non-wrapping mesh —
|
||||||
|
selected at runtime through the SIP-topology kind integer the
|
||||||
|
backend passes to the kernel.
|
||||||
|
|
||||||
|
### Evaluation Harnesses
|
||||||
|
|
||||||
|
<!-- src: ADR-0043 Context, Decision -->
|
||||||
|
The all-reduce evaluation harness drives correctness and the
|
||||||
|
latency/buffer-kind sweeps through the public distributed path —
|
||||||
|
initialize process group, spawn workers, call all-reduce — rather
|
||||||
|
than the lower-level engine interface. A shared helper module factors
|
||||||
|
out the setup; sweep tests cover the buffer-kind tiers (TCM, SRAM,
|
||||||
|
HBM) and the inter-SIP topology variants. The plots produced by the
|
||||||
|
harness are part of its output contract; the harness regenerates them
|
||||||
|
on demand.
|
||||||
|
|
||||||
|
<!-- src: ADR-0044 Context, Decision -->
|
||||||
|
The GEMM evaluation harness is split into two layers. A heavy
|
||||||
|
shape-and-variant sweep lives as a manual script — it runs the same
|
||||||
|
composite-GEMM benchmark across many shapes and operand-staging
|
||||||
|
variants, harvests the resulting op-log, and writes a JSON summary.
|
||||||
|
A faster figure-generation layer lives in the test suite and consumes
|
||||||
|
that JSON to render plots. The split keeps the heavy data
|
||||||
|
generation explicit and out of the regular test path.
|
||||||
|
|
||||||
|
### Bench Module Contract
|
||||||
|
|
||||||
|
<!-- src: ADR-0045 Context, Decision -->
|
||||||
|
Adding a new benchmark requires only dropping a file into the
|
||||||
|
benchmarks directory. The file registers one or more benchmark
|
||||||
|
functions through a small decorator that takes a kebab-case name and
|
||||||
|
a human-readable description. The decorator is the registration
|
||||||
|
mechanism — there is no separate manifest. Each benchmark function
|
||||||
|
takes one argument, conventionally named `torch`, which is the
|
||||||
|
runtime context exposing tensor allocation, kernel launch,
|
||||||
|
distributed APIs, and process-spawning. The function name is `run` by
|
||||||
|
convention.
|
||||||
|
|
||||||
|
<!-- src: ADR-0045 Decision, Consequences -->
|
||||||
|
A benchmark must submit at least one operation, or the runner
|
||||||
|
returns an error. A benchmark instance is single-device by default;
|
||||||
|
when a benchmark is collective, it uses the distributed-process-spawn
|
||||||
|
pattern internally — one worker greenlet per rank, with each worker
|
||||||
|
binding to its rank. Multi-device benchmark patterns outside that
|
||||||
|
shape are not supported.
|
||||||
|
|
||||||
|
### Kernel-side `tl.*` API
|
||||||
|
|
||||||
|
<!-- src: ADR-0046 Context, Decision -->
|
||||||
|
Inside a kernel function, the `tl` argument exposes the kernel-side
|
||||||
|
API in a shape that mirrors the conventions of established
|
||||||
|
GPU-kernel languages. Categories: reference handles that name HBM
|
||||||
|
data without issuing DMA; data movement (load, store) that does
|
||||||
|
issue DMA; GEMM and math compute (dot, composite, the unary and
|
||||||
|
binary math operations, reductions); index and scalar helpers
|
||||||
|
(program identity, range-builders); metadata-only operations like
|
||||||
|
transpose; and the collective primitives (send, receive,
|
||||||
|
non-blocking receive). Tensor handles support arithmetic operators
|
||||||
|
via a thread-local active context so kernel code reads naturally.
|
||||||
|
|
||||||
|
<!-- src: ADR-0046 Decision, Consequences -->
|
||||||
|
The API supports two execution modes. A command-list mode records
|
||||||
|
operations into a list without consuming simulator time — useful for
|
||||||
|
inspection and lightweight tests. A greenlet-driven mode runs the
|
||||||
|
kernel as a child greenlet that switches back to the simulator on
|
||||||
|
each `tl.*` call; the simulator drives the event scheduler and hands
|
||||||
|
real data back to the kernel as DMA reads complete. The two modes
|
||||||
|
share the same surface; the kernel does not know which one it is
|
||||||
|
running under.
|
||||||
|
|
||||||
|
### Probe Subcommand
|
||||||
|
|
||||||
|
<!-- src: ADR-0049 Context, Decision -->
|
||||||
|
The probe utility runs three families of traffic patterns through
|
||||||
|
the engine — host-to-device writes at increasing hop counts,
|
||||||
|
device-to-host reads at increasing hop counts, and PE-initiated DMA
|
||||||
|
across the cube mesh — and reports actual latency, the analytical
|
||||||
|
formula breakdown, effective bandwidth, bottleneck bandwidth, and
|
||||||
|
utilization. A fixed reference size is used for the summary table;
|
||||||
|
a separate utilization-versus-size sweep covers a logarithmic range
|
||||||
|
of transfer sizes. Each case runs in its own engine instance so
|
||||||
|
cases do not perturb each other.
|
||||||
|
|
||||||
|
<!-- src: ADR-0049 Decision, Consequences -->
|
||||||
|
The probe also checks a small set of invariants automatically:
|
||||||
|
monotonic latency increase with hop count, device-to-host latency
|
||||||
|
at least as large as host-to-device for the same hop count, and a
|
||||||
|
faster best-case path than worst-case for cross-cube PE DMA. Failures
|
||||||
|
print prominently. The output is meant for human reading; automated
|
||||||
|
parsing should not depend on column widths or whitespace.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
This document summarizes 46 architecture decisions captured during
|
||||||
|
the first half of 2026. It is regenerated mechanically from the
|
||||||
|
decision corpus; sources are recorded in HTML comments throughout.
|
||||||
@@ -0,0 +1,333 @@
|
|||||||
|
"""Generate docs/adr/INDEX.md (and docs/adr-ko/INDEX.md) from the ADR corpus.
|
||||||
|
|
||||||
|
Auto-derives a section-based index following the same classification as
|
||||||
|
the /report skill — Design Principles / High-level Architecture /
|
||||||
|
Detailed Architecture (by component) / Implementation Decisions
|
||||||
|
(by topic). Run before publishing to refresh INDEX.md.
|
||||||
|
|
||||||
|
The classification table below is the single source of truth. When a new
|
||||||
|
ADR is added under docs/adr/, append an entry to ``CLASSIFICATION``. The
|
||||||
|
script exits 1 if any ADR file is missing from the table or any title
|
||||||
|
cannot be parsed, so omissions surface in CI.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python tools/generate_adr_index.py [--root <repo-root>] [--check]
|
||||||
|
|
||||||
|
--check : exit 1 if the generated INDEX differs from the on-disk file
|
||||||
|
(used by CI to detect un-regenerated indexes).
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
ADR_FILENAME_RE = re.compile(r"^ADR-(\d{4})-([a-z0-9_-]+)\.md$")
|
||||||
|
# Title separator may be ":" (most ADRs) or "—" (em-dash; ADR-0033 uses
|
||||||
|
# this). The verifier (tools/verify_adr_lang_pairs.py) only checks the
|
||||||
|
# number, so both styles already coexist in the corpus.
|
||||||
|
TITLE_RE = re.compile(r"^# ADR-(\d{4})\s*[:—]\s*(.+?)\s*$")
|
||||||
|
|
||||||
|
DESIGN_PRINCIPLES = "Design Principles"
|
||||||
|
HIGH_LEVEL = "High-level Architecture"
|
||||||
|
DETAILED = "Detailed Architecture"
|
||||||
|
IMPL_DECISIONS = "Implementation Decisions"
|
||||||
|
|
||||||
|
|
||||||
|
# (section, subgroup) per ADR. subgroup is used to sub-divide Detailed
|
||||||
|
# (by component, see DETAILED_COMPONENTS) and Implementation (by topic).
|
||||||
|
# Add a line here when introducing a new ADR.
|
||||||
|
CLASSIFICATION: dict[int, tuple[str, str | None]] = {
|
||||||
|
# Design Principles
|
||||||
|
13: (DESIGN_PRINCIPLES, None),
|
||||||
|
33: (DESIGN_PRINCIPLES, None),
|
||||||
|
|
||||||
|
# High-level Architecture
|
||||||
|
3: (HIGH_LEVEL, "System hierarchy (Tray / SIP / CUBE / PE)"),
|
||||||
|
7: (HIGH_LEVEL, "Runtime API ↔ sim_engine boundaries"),
|
||||||
|
16: (HIGH_LEVEL, "IOChiplet NOC and memory data path"),
|
||||||
|
17: (HIGH_LEVEL, "Cube NOC and HBM connectivity"),
|
||||||
|
|
||||||
|
# Detailed Architecture (subgroup matches DETAILED_COMPONENTS entries)
|
||||||
|
14: (DETAILED, "pe_pipeline"), # covers pe_cpu/pe_dma/pe_fetch_store/pe_gemm/pe_math/pe_scheduler
|
||||||
|
23: (DETAILED, "pe_ipcq"),
|
||||||
|
34: (DETAILED, "hbm_ctrl"),
|
||||||
|
35: (DETAILED, "m_cpu"),
|
||||||
|
36: (DETAILED, "io_cpu"),
|
||||||
|
37: (DETAILED, "forwarding"),
|
||||||
|
38: (DETAILED, "pcie_ep"),
|
||||||
|
39: (DETAILED, "pe_mmu"),
|
||||||
|
40: (DETAILED, "pe_tcm"),
|
||||||
|
41: (DETAILED, "sram"),
|
||||||
|
42: (DETAILED, "tiling"),
|
||||||
|
|
||||||
|
# Implementation Decisions
|
||||||
|
1: (IMPL_DECISIONS, "Address Scheme"),
|
||||||
|
2: (IMPL_DECISIONS, "Routing & Helper API"),
|
||||||
|
4: (IMPL_DECISIONS, "Memory Semantics & Local-HBM Bandwidth"),
|
||||||
|
5: (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"),
|
||||||
|
6: (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"),
|
||||||
|
8: (IMPL_DECISIONS, "Tensor Deployment and Allocation"),
|
||||||
|
9: (IMPL_DECISIONS, "Kernel Execution and Host-Device Messaging"),
|
||||||
|
10: (IMPL_DECISIONS, "CLI Surface and Semantics"),
|
||||||
|
11: (IMPL_DECISIONS, "Address Scheme"),
|
||||||
|
12: (IMPL_DECISIONS, "Kernel Execution and Host-Device Messaging"),
|
||||||
|
15: (IMPL_DECISIONS, "Component Port/Wire Fabric Model"),
|
||||||
|
20: (IMPL_DECISIONS, "Two-Pass Data Execution"),
|
||||||
|
22: (IMPL_DECISIONS, "2D Grid Program Identity"),
|
||||||
|
24: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
|
||||||
|
25: (IMPL_DECISIONS, "IPCQ Direction Addressing"),
|
||||||
|
26: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
|
||||||
|
27: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
|
||||||
|
32: (IMPL_DECISIONS, "Intercube All-Reduce"),
|
||||||
|
43: (IMPL_DECISIONS, "Evaluation Harnesses"),
|
||||||
|
44: (IMPL_DECISIONS, "Evaluation Harnesses"),
|
||||||
|
45: (IMPL_DECISIONS, "Bench Module Contract"),
|
||||||
|
46: (IMPL_DECISIONS, "Kernel-side tl.* API (TLContext)"),
|
||||||
|
47: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
|
||||||
|
48: (IMPL_DECISIONS, "Memory Allocator Algorithms"),
|
||||||
|
49: (IMPL_DECISIONS, "Probe Subcommand"),
|
||||||
|
50: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
|
||||||
|
51: (IMPL_DECISIONS, "Routing & Helper API"),
|
||||||
|
52: (IMPL_DECISIONS, "Sim-engine Op Log and Memory Store Schemas"),
|
||||||
|
53: (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"),
|
||||||
|
}
|
||||||
|
|
||||||
|
# Canonical component order for the Detailed Architecture section.
|
||||||
|
# Each entry: (component_name, list[ADR-numbers that cover it]).
|
||||||
|
# Order matches src/kernbench/components/builtin/*.py alphabetical
|
||||||
|
# (the same order /report uses).
|
||||||
|
DETAILED_COMPONENTS: list[tuple[str, list[int]]] = [
|
||||||
|
("forwarding", [37]),
|
||||||
|
("hbm_ctrl", [34]),
|
||||||
|
("io_cpu", [36]),
|
||||||
|
("m_cpu", [35]),
|
||||||
|
("pcie_ep", [38]),
|
||||||
|
("pe_cpu", [14]),
|
||||||
|
("pe_dma", [14, 23]),
|
||||||
|
("pe_fetch_store", [14]),
|
||||||
|
("pe_gemm", [14]),
|
||||||
|
("pe_ipcq", [23]),
|
||||||
|
("pe_math", [14]),
|
||||||
|
("pe_mmu", [39]),
|
||||||
|
("pe_scheduler", [14]),
|
||||||
|
("pe_tcm", [40]),
|
||||||
|
("sram", [41]),
|
||||||
|
("tiling", [42]),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def _strip_bom(text: str) -> str:
|
||||||
|
"""Strip leading UTF-8 BOM if present."""
|
||||||
|
if text and ord(text[0]) == 0xFEFF:
|
||||||
|
return text[1:]
|
||||||
|
return text
|
||||||
|
|
||||||
|
|
||||||
|
def _find_adrs(adr_dir: Path) -> list[tuple[int, str, Path]]:
|
||||||
|
"""Return [(num, slug, path), ...] for ADR files in adr_dir, sorted by num."""
|
||||||
|
out: list[tuple[int, str, Path]] = []
|
||||||
|
for p in sorted(adr_dir.iterdir()):
|
||||||
|
if not p.is_file():
|
||||||
|
continue
|
||||||
|
m = ADR_FILENAME_RE.match(p.name)
|
||||||
|
if not m:
|
||||||
|
continue
|
||||||
|
out.append((int(m.group(1)), m.group(2), p))
|
||||||
|
out.sort(key=lambda t: t[0])
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def _extract_title(path: Path) -> str:
|
||||||
|
"""Parse the title from the first line `# ADR-NNNN: <title>`. Strips BOM."""
|
||||||
|
text = _strip_bom(path.read_text(encoding="utf-8"))
|
||||||
|
first_line = text.split("\n", 1)[0] if text else ""
|
||||||
|
m = TITLE_RE.match(first_line)
|
||||||
|
if not m:
|
||||||
|
raise ValueError(
|
||||||
|
f"{path.name}: cannot parse title from first line: {first_line!r}"
|
||||||
|
)
|
||||||
|
return m.group(2)
|
||||||
|
|
||||||
|
|
||||||
|
def _build_index(adr_dir: Path, link_prefix: str) -> str:
|
||||||
|
"""Build the INDEX.md text for adr_dir.
|
||||||
|
|
||||||
|
link_prefix is the relative href used for ADR links (e.g., ``./``
|
||||||
|
so links resolve relative to the INDEX file location).
|
||||||
|
"""
|
||||||
|
adrs = _find_adrs(adr_dir)
|
||||||
|
if not adrs:
|
||||||
|
raise RuntimeError(f"No ADR files found under {adr_dir}")
|
||||||
|
|
||||||
|
# Validate every ADR is classified.
|
||||||
|
missing = sorted(num for num, _slug, _ in adrs if num not in CLASSIFICATION)
|
||||||
|
if missing:
|
||||||
|
raise RuntimeError(
|
||||||
|
"ADR(s) missing from CLASSIFICATION table in "
|
||||||
|
"tools/generate_adr_index.py: "
|
||||||
|
+ ", ".join(f"ADR-{n:04d}" for n in missing)
|
||||||
|
+ ". Add an entry for each."
|
||||||
|
)
|
||||||
|
|
||||||
|
# Map: num → (filename, title)
|
||||||
|
num_to_meta: dict[int, tuple[str, str]] = {}
|
||||||
|
for num, _slug, path in adrs:
|
||||||
|
num_to_meta[num] = (path.name, _extract_title(path))
|
||||||
|
|
||||||
|
# ── Section assembly ────────────────────────────────────────────
|
||||||
|
lines: list[str] = []
|
||||||
|
lines.append("# ADR Index")
|
||||||
|
lines.append("")
|
||||||
|
lines.append(
|
||||||
|
f"Auto-generated by `tools/generate_adr_index.py`. "
|
||||||
|
f"Total ADRs: **{len(adrs)}**."
|
||||||
|
)
|
||||||
|
lines.append("")
|
||||||
|
lines.append(
|
||||||
|
"Classification mirrors the `/report` skill's section assignment. "
|
||||||
|
"When adding a new ADR, also add an entry to the "
|
||||||
|
"`CLASSIFICATION` table in `tools/generate_adr_index.py`."
|
||||||
|
)
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
def fmt_entry(num: int) -> str:
|
||||||
|
fname, title = num_to_meta[num]
|
||||||
|
return f"- [ADR-{num:04d}]({link_prefix}{fname}) — {title}"
|
||||||
|
|
||||||
|
# Design Principles
|
||||||
|
lines.append("## Design Principles")
|
||||||
|
lines.append("")
|
||||||
|
nums = sorted(n for n, (sec, _) in CLASSIFICATION.items()
|
||||||
|
if sec == DESIGN_PRINCIPLES and n in num_to_meta)
|
||||||
|
for n in nums:
|
||||||
|
lines.append(fmt_entry(n))
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
# High-level Architecture (preserve declaration order via CLASSIFICATION dict's insertion order)
|
||||||
|
lines.append("## High-level Architecture")
|
||||||
|
lines.append("")
|
||||||
|
nums = sorted(n for n, (sec, _) in CLASSIFICATION.items()
|
||||||
|
if sec == HIGH_LEVEL and n in num_to_meta)
|
||||||
|
for n in nums:
|
||||||
|
sub = CLASSIFICATION[n][1] or ""
|
||||||
|
fname, title = num_to_meta[n]
|
||||||
|
if sub:
|
||||||
|
lines.append(
|
||||||
|
f"- [ADR-{n:04d}]({link_prefix}{fname}) — {title}"
|
||||||
|
f" _({sub})_"
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
lines.append(fmt_entry(n))
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
# Detailed Architecture (canonical component order)
|
||||||
|
lines.append("## Detailed Architecture")
|
||||||
|
lines.append("")
|
||||||
|
lines.append("One subsection per component file under `src/kernbench/components/builtin/`.")
|
||||||
|
lines.append("")
|
||||||
|
for comp, adr_nums in DETAILED_COMPONENTS:
|
||||||
|
lines.append(f"### {comp}")
|
||||||
|
lines.append("")
|
||||||
|
if adr_nums:
|
||||||
|
for n in adr_nums:
|
||||||
|
if n not in num_to_meta:
|
||||||
|
raise RuntimeError(
|
||||||
|
f"DETAILED_COMPONENTS references ADR-{n:04d} for "
|
||||||
|
f"'{comp}' but no such ADR file exists."
|
||||||
|
)
|
||||||
|
lines.append(fmt_entry(n))
|
||||||
|
else:
|
||||||
|
lines.append("_(no ADR coverage)_")
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
# Implementation Decisions — group by subgroup, preserving first-appearance order.
|
||||||
|
lines.append("## Implementation Decisions")
|
||||||
|
lines.append("")
|
||||||
|
topic_order: list[str] = []
|
||||||
|
topic_to_nums: dict[str, list[int]] = {}
|
||||||
|
for n, (sec, sub) in CLASSIFICATION.items():
|
||||||
|
if sec != IMPL_DECISIONS or n not in num_to_meta:
|
||||||
|
continue
|
||||||
|
topic = sub or "Uncategorized"
|
||||||
|
if topic not in topic_to_nums:
|
||||||
|
topic_order.append(topic)
|
||||||
|
topic_to_nums[topic] = []
|
||||||
|
topic_to_nums[topic].append(n)
|
||||||
|
# Stable order: by smallest ADR-number in topic, so older infra appears first.
|
||||||
|
topic_order.sort(key=lambda t: min(topic_to_nums[t]))
|
||||||
|
for topic in topic_order:
|
||||||
|
lines.append(f"### {topic}")
|
||||||
|
lines.append("")
|
||||||
|
for n in sorted(topic_to_nums[topic]):
|
||||||
|
lines.append(fmt_entry(n))
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
return "\n".join(lines).rstrip() + "\n"
|
||||||
|
|
||||||
|
|
||||||
|
def _check_or_write(path: Path, content: str, check: bool) -> bool:
|
||||||
|
"""Write content to path, or compare in --check mode. Returns True on diff."""
|
||||||
|
existing = path.read_text(encoding="utf-8") if path.exists() else ""
|
||||||
|
if check:
|
||||||
|
if existing != content:
|
||||||
|
print(f"[diff] {path} would change.")
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
path.write_text(content, encoding="utf-8")
|
||||||
|
if existing != content:
|
||||||
|
print(f"[wrote] {path}")
|
||||||
|
else:
|
||||||
|
print(f"[unchanged] {path}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def main(argv: list[str] | None = None) -> int:
|
||||||
|
p = argparse.ArgumentParser(description=__doc__)
|
||||||
|
p.add_argument(
|
||||||
|
"--root", type=Path, default=Path.cwd(),
|
||||||
|
help="Repository root (default: cwd)",
|
||||||
|
)
|
||||||
|
p.add_argument(
|
||||||
|
"--check", action="store_true",
|
||||||
|
help="Exit 1 if generated INDEX would differ from disk",
|
||||||
|
)
|
||||||
|
args = p.parse_args(argv)
|
||||||
|
|
||||||
|
en_dir = args.root / "docs" / "adr"
|
||||||
|
ko_dir = args.root / "docs" / "adr-ko"
|
||||||
|
|
||||||
|
if not en_dir.is_dir():
|
||||||
|
print(f"error: {en_dir} does not exist", file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
|
||||||
|
any_diff = False
|
||||||
|
try:
|
||||||
|
en_index = _build_index(en_dir, link_prefix="./")
|
||||||
|
except (RuntimeError, ValueError) as e:
|
||||||
|
print(f"error (EN): {e}", file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
any_diff |= _check_or_write(en_dir / "INDEX.md", en_index, args.check)
|
||||||
|
|
||||||
|
if ko_dir.is_dir():
|
||||||
|
try:
|
||||||
|
ko_index = _build_index(ko_dir, link_prefix="./")
|
||||||
|
except (RuntimeError, ValueError) as e:
|
||||||
|
print(f"error (KO): {e}", file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
any_diff |= _check_or_write(ko_dir / "INDEX.md", ko_index, args.check)
|
||||||
|
|
||||||
|
if args.check and any_diff:
|
||||||
|
print(
|
||||||
|
"INDEX.md is out of date. "
|
||||||
|
"Run `python tools/generate_adr_index.py` to refresh.",
|
||||||
|
file=sys.stderr,
|
||||||
|
)
|
||||||
|
return 1
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main())
|
||||||
Reference in New Issue
Block a user