adr: add INDEX.md (auto-generated by tools/generate_adr_index.py)
Adds a section-based table of contents for the 46-ADR corpus, mirroring the /report skill's classification (Design Principles / High-level Architecture / Detailed Architecture by component / Implementation Decisions by topic). Generated for both docs/adr/ (EN titles) and docs/adr-ko/ (KO titles) from one tool. tools/generate_adr_index.py: - Single CLASSIFICATION dict per ADR — add an entry when introducing a new ADR; the script fails loud if any file is missing from the table. - DETAILED_COMPONENTS lists each builtin component and the ADR(s) that cover it (ADR-0014 appears under six PE engines; ADR-0023 under pe_dma + pe_ipcq). - Accepts both ":" and "—" title separators (matching ADR-0033's existing format). - --check mode for CI: exits 1 if INDEX.md is stale. Also includes the docs/report/architecture-2026-1H.md generated by the prior /report write (the public-facing architecture document; 836 lines, 76 source-attribution comments). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,174 @@
|
||||
# ADR Index
|
||||
|
||||
Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **46**.
|
||||
|
||||
Classification mirrors the `/report` skill's section assignment. When adding a new ADR, also add an entry to the `CLASSIFICATION` table in `tools/generate_adr_index.py`.
|
||||
|
||||
## Design Principles
|
||||
|
||||
- [ADR-0013](./ADR-0013-ver-verification-strategy.md) — 검증 전략 및 Phase 1 테스트 계획
|
||||
- [ADR-0033](./ADR-0033-lat-latency-model-assumptions.md) — 레이턴시 모델: 가정 및 알려진 단순화
|
||||
|
||||
## High-level Architecture
|
||||
|
||||
- [ADR-0003](./ADR-0003-dev-target-system-hierarchy.md) — 타겟 시스템 계층 및 모델링 범위 _(System hierarchy (Tray / SIP / CUBE / PE))_
|
||||
- [ADR-0007](./ADR-0007-api-runtime-api-boundaries.md) — 런타임 API 및 시뮬레이션 엔진 경계 _(Runtime API ↔ sim_engine boundaries)_
|
||||
- [ADR-0016](./ADR-0016-dev-iochiplet-noc-and-memory-path.md) — IOChiplet NoC와 메모리 데이터 경로 _(IOChiplet NOC and memory data path)_
|
||||
- [ADR-0017](./ADR-0017-dev-cube-noc-and-hbm-connectivity.md) — 큐브 NoC와 HBM 연결성 _(Cube NOC and HBM connectivity)_
|
||||
|
||||
## Detailed Architecture
|
||||
|
||||
One subsection per component file under `src/kernbench/components/builtin/`.
|
||||
|
||||
### forwarding
|
||||
|
||||
- [ADR-0037](./ADR-0037-dev-forwarding-component.md) — Forwarding 컴포넌트 (forwarding_v1)
|
||||
|
||||
### hbm_ctrl
|
||||
|
||||
- [ADR-0034](./ADR-0034-dev-hbm-controller-internal-design.md) — HBM 컨트롤러 내부 설계
|
||||
|
||||
### io_cpu
|
||||
|
||||
- [ADR-0036](./ADR-0036-dev-io-cpu-component-model.md) — IO_CPU 컴포넌트 모델
|
||||
|
||||
### m_cpu
|
||||
|
||||
- [ADR-0035](./ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md) — M_CPU 및 M_CPU.DMA 컴포넌트 모델
|
||||
|
||||
### pcie_ep
|
||||
|
||||
- [ADR-0038](./ADR-0038-dev-pcie-ep-component-model.md) — PCIE_EP Component Model
|
||||
|
||||
### pe_cpu
|
||||
|
||||
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
|
||||
|
||||
### pe_dma
|
||||
|
||||
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
|
||||
- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
|
||||
|
||||
### pe_fetch_store
|
||||
|
||||
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
|
||||
|
||||
### pe_gemm
|
||||
|
||||
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
|
||||
|
||||
### pe_ipcq
|
||||
|
||||
- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
|
||||
|
||||
### pe_math
|
||||
|
||||
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
|
||||
|
||||
### pe_mmu
|
||||
|
||||
- [ADR-0039](./ADR-0039-dev-pe-mmu-component-model.md) — PE_MMU Component Model — 컴포넌트 + 유틸리티 이중 역할
|
||||
|
||||
### pe_scheduler
|
||||
|
||||
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
|
||||
|
||||
### pe_tcm
|
||||
|
||||
- [ADR-0040](./ADR-0040-dev-pe-tcm-component-model.md) — PE_TCM Component Model — 듀얼 채널 BW 직렬화
|
||||
|
||||
### sram
|
||||
|
||||
- [ADR-0041](./ADR-0041-dev-cube-sram-component-model.md) — Cube SRAM Component Model — terminal scratchpad on cube NoC
|
||||
|
||||
### tiling
|
||||
|
||||
- [ADR-0042](./ADR-0042-prog-tile-plan-generators.md) — Tile Plan Generators — GEMM/Math 파이프라인 plan 빌더
|
||||
|
||||
## Implementation Decisions
|
||||
|
||||
### Address Scheme
|
||||
|
||||
- [ADR-0001](./ADR-0001-mem-physaddr-layout.md) — 51비트 물리 주소 레이아웃 및 디코딩 계약
|
||||
- [ADR-0011](./ADR-0011-mem-memory-addressing-simplification.md) — 메모리 주소 지정 — PA / VA / LA 주소 모델
|
||||
|
||||
### Routing & Helper API
|
||||
|
||||
- [ADR-0002](./ADR-0002-lat-routing-distance.md) — 라우팅 거리, 순서 및 우회 규칙
|
||||
- [ADR-0051](./ADR-0051-lat-routing-helper-api.md) — Routing Helper API — `AddressResolver` + `PathRouter`
|
||||
|
||||
### Memory Semantics & Local-HBM Bandwidth
|
||||
|
||||
- [ADR-0004](./ADR-0004-mem-memory-semantics-local-hbm.md) — 메모리 시맨틱 및 로컬 HBM 대역폭 보장
|
||||
|
||||
### Topology Compilation, Diagrams & Builder Algorithms
|
||||
|
||||
- [ADR-0005](./ADR-0005-dev-diagram-views-distance-layout.md) — 다이어그램 뷰 및 거리 기반 레이아웃 규칙
|
||||
- [ADR-0006](./ADR-0006-dev-topology-compilation-distance-diagram.md) — 토폴로지 컴파일, 거리 추출, 그리고 자동 다이어그램 생성
|
||||
- [ADR-0053](./ADR-0053-dev-topology-builder-algorithms.md) — Topology Builder + Visualizer Algorithms
|
||||
|
||||
### Tensor Deployment and Allocation
|
||||
|
||||
- [ADR-0008](./ADR-0008-api-tensor-deploy-and-allocation.md) — 텐서 배포 및 할당 (호스트 할당기, PA 우선)
|
||||
|
||||
### Kernel Execution and Host-Device Messaging
|
||||
|
||||
- [ADR-0009](./ADR-0009-api-kernel-execution-messaging.md) — 커널 실행 메시징 및 완료 시맨틱
|
||||
- [ADR-0012](./ADR-0012-api-host-io-message-schema.md) — Host ↔ IO_CPU 메시지 스키마 (PA-우선, PE-태깅)
|
||||
|
||||
### CLI Surface and Semantics
|
||||
|
||||
- [ADR-0010](./ADR-0010-api-cli-surface-and-semantics.md) — 명령줄 인터페이스 및 실행 시맨틱
|
||||
|
||||
### Component Port/Wire Fabric Model
|
||||
|
||||
- [ADR-0015](./ADR-0015-dev-component-port-wire-model.md) — 컴포넌트 포트/와이어 모델과 패브릭 라우팅
|
||||
|
||||
### Two-Pass Data Execution
|
||||
|
||||
- [ADR-0020](./ADR-0020-prog-data-execution-two-pass.md) — 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)
|
||||
|
||||
### 2D Grid Program Identity
|
||||
|
||||
- [ADR-0022](./ADR-0022-prog-program-id-2d-grid.md) — 2D 그리드 program_id 시맨틱
|
||||
|
||||
### Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)
|
||||
|
||||
- [ADR-0024](./ADR-0024-par-sip-tp-launcher.md) — SIP-level Launcher — rank = SIP
|
||||
- [ADR-0026](./ADR-0026-par-dppolicy-intra-device.md) — DPPolicy = Intra-Device Only — sip/num_sips 필드 제거
|
||||
- [ADR-0027](./ADR-0027-par-megatron-tp.md) — Megatron-style Tensor Parallelism API
|
||||
- [ADR-0047](./ADR-0047-par-ahbm-ccl-backend.md) — AHBM CCL Backend — `torch.distributed`-compat shim
|
||||
- [ADR-0050](./ADR-0050-par-ccl-algorithm-module-contract.md) — CCL Algorithm Module Contract — `ccl/algorithms/*.py`
|
||||
|
||||
### IPCQ Direction Addressing
|
||||
|
||||
- [ADR-0025](./ADR-0025-algo-ipcq-direction-addressing.md) — IPCQ Direction Addressing — address-based matching
|
||||
|
||||
### Intercube All-Reduce
|
||||
|
||||
- [ADR-0032](./ADR-0032-algo-intercube-allreduce.md) — 큐브 간 All-Reduce — pe0 큐브-메시 리듀스 + 다중-SIP 교환
|
||||
|
||||
### Evaluation Harnesses
|
||||
|
||||
- [ADR-0043](./ADR-0043-eval-allreduce-harness.md) — Allreduce 평가 하니스 — `tests/sccl/`
|
||||
- [ADR-0044](./ADR-0044-eval-gemm-harness.md) — GEMM 평가 하니스 — `scripts/gemm_sweep.py` + `tests/gemm/`
|
||||
|
||||
### Bench Module Contract
|
||||
|
||||
- [ADR-0045](./ADR-0045-prog-bench-module-contract.md) — Bench Module Contract — registration, dispatch, and authoring
|
||||
|
||||
### Kernel-side tl.* API (TLContext)
|
||||
|
||||
- [ADR-0046](./ADR-0046-prog-tl-context-contract.md) — TLContext — Kernel-side `tl.*` API Contract
|
||||
|
||||
### Memory Allocator Algorithms
|
||||
|
||||
- [ADR-0048](./ADR-0048-mem-allocator-algorithms.md) — Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
|
||||
|
||||
### Probe Subcommand
|
||||
|
||||
- [ADR-0049](./ADR-0049-ver-probe-subcommand.md) — `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
|
||||
|
||||
### Sim-engine Op Log and Memory Store Schemas
|
||||
|
||||
- [ADR-0052](./ADR-0052-dev-oplog-memory-store-schemas.md) — OpLog + MemoryStore Schemas — sim_engine internals
|
||||
@@ -0,0 +1,174 @@
|
||||
# ADR Index
|
||||
|
||||
Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **46**.
|
||||
|
||||
Classification mirrors the `/report` skill's section assignment. When adding a new ADR, also add an entry to the `CLASSIFICATION` table in `tools/generate_adr_index.py`.
|
||||
|
||||
## Design Principles
|
||||
|
||||
- [ADR-0013](./ADR-0013-ver-verification-strategy.md) — Verification Strategy and Phase 1 Test Plan
|
||||
- [ADR-0033](./ADR-0033-lat-latency-model-assumptions.md) — Latency Model: Assumptions and Known Simplifications
|
||||
|
||||
## High-level Architecture
|
||||
|
||||
- [ADR-0003](./ADR-0003-dev-target-system-hierarchy.md) — Target System Hierarchy & Modeling Scope _(System hierarchy (Tray / SIP / CUBE / PE))_
|
||||
- [ADR-0007](./ADR-0007-api-runtime-api-boundaries.md) — Runtime API and Simulation Engine Boundaries _(Runtime API ↔ sim_engine boundaries)_
|
||||
- [ADR-0016](./ADR-0016-dev-iochiplet-noc-and-memory-path.md) — IOChiplet NOC and Memory Data Path _(IOChiplet NOC and memory data path)_
|
||||
- [ADR-0017](./ADR-0017-dev-cube-noc-and-hbm-connectivity.md) — Cube NOC and HBM Connectivity _(Cube NOC and HBM connectivity)_
|
||||
|
||||
## Detailed Architecture
|
||||
|
||||
One subsection per component file under `src/kernbench/components/builtin/`.
|
||||
|
||||
### forwarding
|
||||
|
||||
- [ADR-0037](./ADR-0037-dev-forwarding-component.md) — Forwarding Component (forwarding_v1)
|
||||
|
||||
### hbm_ctrl
|
||||
|
||||
- [ADR-0034](./ADR-0034-dev-hbm-controller-internal-design.md) — HBM Controller Internal Design
|
||||
|
||||
### io_cpu
|
||||
|
||||
- [ADR-0036](./ADR-0036-dev-io-cpu-component-model.md) — IO_CPU Component Model
|
||||
|
||||
### m_cpu
|
||||
|
||||
- [ADR-0035](./ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md) — M_CPU and M_CPU.DMA Component Model
|
||||
|
||||
### pcie_ep
|
||||
|
||||
- [ADR-0038](./ADR-0038-dev-pcie-ep-component-model.md) — PCIE_EP Component Model
|
||||
|
||||
### pe_cpu
|
||||
|
||||
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
|
||||
|
||||
### pe_dma
|
||||
|
||||
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
|
||||
- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
|
||||
|
||||
### pe_fetch_store
|
||||
|
||||
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
|
||||
|
||||
### pe_gemm
|
||||
|
||||
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
|
||||
|
||||
### pe_ipcq
|
||||
|
||||
- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
|
||||
|
||||
### pe_math
|
||||
|
||||
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
|
||||
|
||||
### pe_mmu
|
||||
|
||||
- [ADR-0039](./ADR-0039-dev-pe-mmu-component-model.md) — PE_MMU Component Model — Component + Utility Dual Role
|
||||
|
||||
### pe_scheduler
|
||||
|
||||
- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
|
||||
|
||||
### pe_tcm
|
||||
|
||||
- [ADR-0040](./ADR-0040-dev-pe-tcm-component-model.md) — PE_TCM Component Model — Dual-Channel BW Serialization
|
||||
|
||||
### sram
|
||||
|
||||
- [ADR-0041](./ADR-0041-dev-cube-sram-component-model.md) — Cube SRAM Component Model — terminal scratchpad on cube NoC
|
||||
|
||||
### tiling
|
||||
|
||||
- [ADR-0042](./ADR-0042-prog-tile-plan-generators.md) — Tile Plan Generators — GEMM/Math Pipeline Plan Builders
|
||||
|
||||
## Implementation Decisions
|
||||
|
||||
### Address Scheme
|
||||
|
||||
- [ADR-0001](./ADR-0001-mem-physaddr-layout.md) — 51-bit Physical Address Layout & Decoding Contract
|
||||
- [ADR-0011](./ADR-0011-mem-memory-addressing-simplification.md) — Memory Addressing — PA / VA / LA Address Models
|
||||
|
||||
### Routing & Helper API
|
||||
|
||||
- [ADR-0002](./ADR-0002-lat-routing-distance.md) — Routing Distance, Ordering & Bypass Rules
|
||||
- [ADR-0051](./ADR-0051-lat-routing-helper-api.md) — Routing Helper API — `AddressResolver` + `PathRouter`
|
||||
|
||||
### Memory Semantics & Local-HBM Bandwidth
|
||||
|
||||
- [ADR-0004](./ADR-0004-mem-memory-semantics-local-hbm.md) — Memory Semantics & Local-HBM Bandwidth Guarantee
|
||||
|
||||
### Topology Compilation, Diagrams & Builder Algorithms
|
||||
|
||||
- [ADR-0005](./ADR-0005-dev-diagram-views-distance-layout.md) — Diagram Views & Distance-Aware Layout Rules
|
||||
- [ADR-0006](./ADR-0006-dev-topology-compilation-distance-diagram.md) — Topology Compilation, Distance Extraction, and Automatic Diagram Generation
|
||||
- [ADR-0053](./ADR-0053-dev-topology-builder-algorithms.md) — Topology Builder + Visualizer Algorithms
|
||||
|
||||
### Tensor Deployment and Allocation
|
||||
|
||||
- [ADR-0008](./ADR-0008-api-tensor-deploy-and-allocation.md) — Tensor Deployment and Allocation (Host Allocator, PA-first)
|
||||
|
||||
### Kernel Execution and Host-Device Messaging
|
||||
|
||||
- [ADR-0009](./ADR-0009-api-kernel-execution-messaging.md) — Kernel Execution Messaging and Completion Semantics
|
||||
- [ADR-0012](./ADR-0012-api-host-io-message-schema.md) — Host ↔ IO_CPU Message Schema (PA-first, PE-tagged)
|
||||
|
||||
### CLI Surface and Semantics
|
||||
|
||||
- [ADR-0010](./ADR-0010-api-cli-surface-and-semantics.md) — Command Line Interface and Execution Semantics
|
||||
|
||||
### Component Port/Wire Fabric Model
|
||||
|
||||
- [ADR-0015](./ADR-0015-dev-component-port-wire-model.md) — Component Port/Wire Model and Fabric Routing
|
||||
|
||||
### Two-Pass Data Execution
|
||||
|
||||
- [ADR-0020](./ADR-0020-prog-data-execution-two-pass.md) — 2-Pass Data Execution Model (Timing / Data Separation)
|
||||
|
||||
### 2D Grid Program Identity
|
||||
|
||||
- [ADR-0022](./ADR-0022-prog-program-id-2d-grid.md) — 2D Grid program_id Semantics
|
||||
|
||||
### Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)
|
||||
|
||||
- [ADR-0024](./ADR-0024-par-sip-tp-launcher.md) — SIP-level Launcher — rank = SIP
|
||||
- [ADR-0026](./ADR-0026-par-dppolicy-intra-device.md) — DPPolicy = Intra-Device Only — remove sip/num_sips fields
|
||||
- [ADR-0027](./ADR-0027-par-megatron-tp.md) — Megatron-style Tensor Parallelism API
|
||||
- [ADR-0047](./ADR-0047-par-ahbm-ccl-backend.md) — AHBM CCL Backend — `torch.distributed`-compat shim
|
||||
- [ADR-0050](./ADR-0050-par-ccl-algorithm-module-contract.md) — CCL Algorithm Module Contract — `ccl/algorithms/*.py`
|
||||
|
||||
### IPCQ Direction Addressing
|
||||
|
||||
- [ADR-0025](./ADR-0025-algo-ipcq-direction-addressing.md) — IPCQ Direction Addressing — address-based matching
|
||||
|
||||
### Intercube All-Reduce
|
||||
|
||||
- [ADR-0032](./ADR-0032-algo-intercube-allreduce.md) — Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange
|
||||
|
||||
### Evaluation Harnesses
|
||||
|
||||
- [ADR-0043](./ADR-0043-eval-allreduce-harness.md) — Allreduce Evaluation Harness — `tests/sccl/`
|
||||
- [ADR-0044](./ADR-0044-eval-gemm-harness.md) — GEMM Evaluation Harness — `scripts/gemm_sweep.py` + `tests/gemm/`
|
||||
|
||||
### Bench Module Contract
|
||||
|
||||
- [ADR-0045](./ADR-0045-prog-bench-module-contract.md) — Bench Module Contract — registration, dispatch, and authoring
|
||||
|
||||
### Kernel-side tl.* API (TLContext)
|
||||
|
||||
- [ADR-0046](./ADR-0046-prog-tl-context-contract.md) — TLContext — Kernel-side `tl.*` API Contract
|
||||
|
||||
### Memory Allocator Algorithms
|
||||
|
||||
- [ADR-0048](./ADR-0048-mem-allocator-algorithms.md) — Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
|
||||
|
||||
### Probe Subcommand
|
||||
|
||||
- [ADR-0049](./ADR-0049-ver-probe-subcommand.md) — `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
|
||||
|
||||
### Sim-engine Op Log and Memory Store Schemas
|
||||
|
||||
- [ADR-0052](./ADR-0052-dev-oplog-memory-store-schemas.md) — OpLog + MemoryStore Schemas — sim_engine internals
|
||||
@@ -0,0 +1,836 @@
|
||||
# KernBench — Architecture Design Document
|
||||
*2026 1H*
|
||||
|
||||
KernBench is a system-level, discrete-event simulator for AI-accelerator
|
||||
chiplet systems. It models the data-movement and control paths across
|
||||
the full hardware hierarchy and reports end-to-end execution latency
|
||||
for kernels dispatched to the device's compute units.
|
||||
|
||||
This document is a public summary of the architecture as designed and
|
||||
implemented in the first half of 2026. It assumes no prior knowledge of
|
||||
the simulator's internal documents; terms specific to the system are
|
||||
defined on first use.
|
||||
|
||||
---
|
||||
|
||||
## Design Principles
|
||||
|
||||
KernBench is grounded in two foundational commitments: every measured
|
||||
latency must trace to explicit, modeled events on the simulator's graph,
|
||||
and every behavioral claim must be verifiable through tests that target
|
||||
spec-level invariants rather than incidental implementation details.
|
||||
|
||||
<!-- src: ADR-0013 Context, Decision -->
|
||||
The verification posture is verification-driven. Tests are written to
|
||||
validate the architectural contracts that the simulator exposes —
|
||||
correct routing, deterministic results, monotonic latency under
|
||||
increasing hop counts — rather than to mirror the call graph of the
|
||||
implementation. Two phases coexist: a fast timing phase that exercises
|
||||
the simulator's discrete-event engine and produces a log of operations
|
||||
with timestamps, and an optional data-replay phase that uses that log
|
||||
to compute real numerical results. Tests can target either phase.
|
||||
|
||||
<!-- src: ADR-0033 Context, Decision -->
|
||||
The latency model is intentionally abstract rather than
|
||||
cycle-accurate. Each modeled node contributes a configurable per-node
|
||||
overhead, each link contributes wire delay plus byte-over-bandwidth
|
||||
serialization, and each terminal service contributes its own service
|
||||
time. The simulator does not attempt to reproduce cache coherence
|
||||
protocols, microarchitectural pipelines, or full PCIe/UCIe protocol
|
||||
correctness; those are explicitly outside the scope. The aim is a
|
||||
simulator that compares system-level configurations meaningfully and
|
||||
deterministically, not one that ships microarchitectural truths.
|
||||
|
||||
<!-- src: ADR-0033 Decision, Consequences -->
|
||||
Determinism is a hard requirement. Given identical inputs — topology,
|
||||
routing policy, and request stream — the simulator must produce
|
||||
identical outputs, hop traces included. This rules out reliance on
|
||||
unordered set iteration on the critical path and forces every latency
|
||||
contribution to come from an explicitly scheduled event on a modeled
|
||||
component or link. There are no implicit waits, no hardcoded magic
|
||||
delays, and no shortcuts that bypass the modeled graph.
|
||||
|
||||
---
|
||||
|
||||
## High-level Architecture
|
||||
|
||||
<!-- src: ADR-0003 Context, Decision -->
|
||||
The simulated system is a four-level hierarchy. A **Tray** holds one or
|
||||
more **SIPs** (system-in-package), each containing a 2D mesh of
|
||||
**CUBEs** plus one or more **IO chiplets** that connect the SIP to the
|
||||
host. Each CUBE contains a regular grid of **PEs** (processing
|
||||
elements) plus its own attached resources — high-bandwidth memory
|
||||
(HBM), a per-cube SRAM scratchpad, and a management CPU (M_CPU). The PE
|
||||
itself is a composite of nine sub-components rather than a monolithic
|
||||
core. This hierarchy is fixed; the parameters along each axis (counts,
|
||||
mesh dimensions, link widths) are configurable through the topology
|
||||
spec.
|
||||
|
||||
<!-- src: ADR-0007 Context, Decision -->
|
||||
A clean separation runs along the request flow. A **runtime API** at
|
||||
the top is the host-facing surface; it exposes tensor and kernel
|
||||
operations, owns host-side allocation metadata, and is topology-
|
||||
agnostic — it does not route or fan out. Below it the **simulation
|
||||
engine** decomposes runtime operations into discrete graph requests
|
||||
(memory writes, memory reads, kernel launches, MMU map installs) and
|
||||
schedules events deterministically. At the bottom, **components** model
|
||||
device behavior on a graph of nodes connected by links; they
|
||||
implement the actual latency contributions and pass requests along.
|
||||
No component reaches up into the runtime API, and no runtime call
|
||||
shortcuts the engine.
|
||||
|
||||
<!-- DIAGRAM: Four-level system hierarchy — Tray containing SIPs, each SIP showing its 2D cube mesh and IO chiplet; one cube blown up to show the router mesh, attached PEs, M_CPU, SRAM, and HBM partition. -->
|
||||
|
||||
### Tray
|
||||
|
||||
<!-- src: ADR-0003 Decision -->
|
||||
The Tray is the outermost boundary. It owns the host CPU on one side
|
||||
and one or more SIPs on the other, connected through a fabric switch.
|
||||
For collective communication that must traverse multiple SIPs, the
|
||||
fabric switch acts as the common rendezvous: device-side outbound
|
||||
traffic from one SIP routes through the switch and back into the
|
||||
target SIP's IO chiplet.
|
||||
|
||||
### SIP
|
||||
|
||||
<!-- src: ADR-0003 Decision, ADR-0017 Context -->
|
||||
A SIP packages a 2D mesh of CUBEs and one or more IO chiplets. The
|
||||
default topology used by the simulator is a 4×4 cube mesh; the
|
||||
mesh dimensions are configurable. Each cube on the boundary of the
|
||||
mesh connects to its neighbors over UCIe (die-to-die) links arranged
|
||||
on the four cardinal sides — north, south, east, and west. The IO
|
||||
chiplets sit on one side of the SIP and provide the bridge to the host
|
||||
across PCIe.
|
||||
|
||||
<!-- src: ADR-0016 Context, Decision -->
|
||||
The IO chiplet itself contains its own internal network. A
|
||||
host-facing PCIe endpoint passes traffic to a small NOC ("network on
|
||||
chip"); from there it can branch to a control-plane CPU that processes
|
||||
kernel-launch messages, or it can take the direct memory data path to
|
||||
the cube's HBM controller. The decision to provide a direct memory
|
||||
path that bypasses the control CPU was a deliberate concession to
|
||||
keep host-issued memory writes from paying control-plane overhead on
|
||||
the data path.
|
||||
|
||||
### CUBE
|
||||
|
||||
<!-- src: ADR-0017 Decision -->
|
||||
Each CUBE owns a 2D mesh of NOC routers and a set of attached
|
||||
resources: PEs, the cube-local SRAM scratchpad, the management CPU
|
||||
(M_CPU), and the HBM partition (split across multiple PE-private
|
||||
slices for bandwidth). The router mesh uses deterministic XY routing.
|
||||
Attached components do not connect to each other directly — they all
|
||||
sit on the router mesh, and every cube-internal transfer pays the
|
||||
mesh distance from source to destination.
|
||||
|
||||
<!-- src: ADR-0017 Decision -->
|
||||
The HBM partition is per-PE: each PE owns one HBM slice, and the
|
||||
controller exposes per-PE channels so that the same PE always
|
||||
addresses the same set of HBM channels. This makes the local-HBM
|
||||
bandwidth from a PE to its own slice predictable, while accesses to
|
||||
another PE's slice — or a different cube's slice — pay the mesh
|
||||
distance and any UCIe crossings.
|
||||
|
||||
### PE
|
||||
|
||||
<!-- src: ADR-0014 Context, Decision -->
|
||||
A PE is not a monolithic core. Internally it is a set of nine
|
||||
sub-components, each modeling one stage of a request's flow: a small
|
||||
control CPU, a tile-pipeline scheduler, a DMA engine, a fetch-store
|
||||
engine that moves data between the on-PE scratchpad and the register
|
||||
file, a GEMM compute engine, a math compute engine, the tightly-
|
||||
coupled memory (TCM, the on-PE scratchpad), an MMU for virtual-to-
|
||||
physical address translation, and an inter-PE collective queue
|
||||
(IPCQ). The scheduler decomposes higher-level operations into per-tile
|
||||
stage sequences, and tile tokens self-route from one sub-component
|
||||
to the next.
|
||||
|
||||
<!-- DIAGRAM: PE internal layout — the nine sub-components and the edges that connect them; tile token flowing through DMA_READ → FETCH → GEMM → STORE → DMA_WRITE. -->
|
||||
|
||||
---
|
||||
|
||||
## Detailed Architecture
|
||||
|
||||
This section describes each modeled device-side component in turn.
|
||||
Components are listed in the alphabetical order used by the
|
||||
simulator's source tree.
|
||||
|
||||
### forwarding
|
||||
|
||||
<!-- src: ADR-0037 Context, Decision -->
|
||||
The forwarding component is the generic routing relay used wherever a
|
||||
node only needs to apply a small processing overhead and pass the
|
||||
request to the next hop. NOC routers, conn nodes, and ucie phys all
|
||||
reduce to this. Its first act on receiving a request is to apply the
|
||||
per-node overhead configured for it in the topology spec; after the
|
||||
overhead it simply hands the request to the next hop along the path.
|
||||
|
||||
<!-- src: ADR-0037 Decision, Consequences -->
|
||||
The decision to share one implementation across these roles was made
|
||||
to keep the simulator's component set small without sacrificing
|
||||
modeling fidelity. Each instance still carries its own overhead and
|
||||
its own link bandwidth contributions, so different roles still produce
|
||||
different timing. What is shared is the dispatcher loop, not the
|
||||
parameter values.
|
||||
|
||||
### hbm_ctrl
|
||||
|
||||
<!-- src: ADR-0034 Context, Decision -->
|
||||
The HBM controller is the terminal node for all memory traffic that
|
||||
reaches HBM. Internally it owns a number of pseudo channels, partitioned
|
||||
per-PE so that each PE addresses a deterministic subset. On a request
|
||||
arrival the controller first selects the right pseudo channel from the
|
||||
target address, then enters a chunk-loop that drains the requested
|
||||
size in fixed-size flits over the channel's bandwidth.
|
||||
|
||||
<!-- src: ADR-0034 Decision, Consequences -->
|
||||
The chunk-loop pattern replaces an earlier all-at-once drain. The
|
||||
benefit is that the controller no longer presents a flit-aware fabric
|
||||
with a single bulk transfer; instead it emits flits at a paced rate
|
||||
matching the channel bandwidth, which makes cross-flow contention
|
||||
visible. The bandwidth budget is calibrated against the configured
|
||||
HBM total bandwidth divided across the channel count.
|
||||
|
||||
### io_cpu
|
||||
|
||||
<!-- src: ADR-0036 Context, Decision -->
|
||||
The IO_CPU is the control-plane processor sitting inside the IO chiplet.
|
||||
It receives kernel-launch messages from the host, decodes them, and
|
||||
dispatches per-cube launches to the cube's management CPU. Pure memory
|
||||
operations bypass it entirely, taking the direct data path established
|
||||
inside the IO chiplet.
|
||||
|
||||
<!-- src: ADR-0036 Decision -->
|
||||
On receiving a kernel-launch message, the IO_CPU consults the message's
|
||||
shard list — which already names the target SIP, cube, and PE for each
|
||||
piece of the tensor argument — and forwards a per-cube launch to each
|
||||
cube the kernel needs to reach. This makes the IO_CPU a deterministic
|
||||
fan-out point: it does not decode physical addresses to route, it just
|
||||
follows the explicit per-shard targets it was handed.
|
||||
|
||||
### m_cpu
|
||||
|
||||
<!-- src: ADR-0035 Context, Decision -->
|
||||
The M_CPU is the cube's management processor. It owns two distinct
|
||||
roles: as a control-plane fan-out point for kernel launches arriving
|
||||
from the IO chiplet, and as a DMA endpoint for host-initiated memory
|
||||
writes that need to land in this cube's HBM. The control role
|
||||
forwards launches to the right PE control CPUs; the DMA role places
|
||||
the actual bytes into HBM through the router mesh.
|
||||
|
||||
<!-- src: ADR-0035 Decision -->
|
||||
The component model deliberately distinguishes the two roles because
|
||||
their routing differs: the control fan-out path uses command-kind
|
||||
links that do not appear on data-path routes, while the DMA path uses
|
||||
the same router mesh as PE-initiated DMA, with PE-internal nodes
|
||||
excluded. The routing layer knows about both modes and selects the
|
||||
appropriate adjacency at request time.
|
||||
|
||||
### pcie_ep
|
||||
|
||||
<!-- src: ADR-0038 Context, Decision -->
|
||||
The PCIE endpoint is the protocol boundary at the host-device edge.
|
||||
Its first act on each incoming request is to apply a configured
|
||||
protocol-processing overhead; after that it simply forwards. There is
|
||||
no internal queuing model, no retry, and no TLP-level fidelity — those
|
||||
are deliberately outside scope. The endpoint is bidirectional: host →
|
||||
device traffic (memory writes, kernel launches) flows one way, and
|
||||
device-side outbound traffic (cross-SIP collective sends) flows the
|
||||
other.
|
||||
|
||||
<!-- src: ADR-0038 Decision, Alternatives Considered, Consequences -->
|
||||
A more detailed PCIe model was considered and rejected. The simulator
|
||||
is targeting system-level latency comparisons; making the endpoint
|
||||
heavier with credit-management and retry logic would not improve the
|
||||
metrics being studied. The decision keeps the endpoint as the
|
||||
documented protocol-boundary node, named consistently so routing
|
||||
helpers can locate it by SIP and IO instance.
|
||||
|
||||
### pe_cpu
|
||||
|
||||
<!-- src: ADR-0014 Decision -->
|
||||
The PE control CPU is the entry point for kernel work arriving from
|
||||
the cube's management CPU. It receives kernel-launch messages, resolves
|
||||
the kernel function by name, and hands execution to the scheduler with
|
||||
the resolved tensor arguments. From the scheduler's point of view, the
|
||||
PE_CPU is the upstream source of high-level commands; from the rest
|
||||
of the system's point of view, the PE_CPU is where a kernel's
|
||||
execution begins on a given PE.
|
||||
|
||||
### pe_dma
|
||||
|
||||
<!-- src: ADR-0014 Decision, ADR-0023 Decision -->
|
||||
The DMA engine on each PE has two distinct modes. In the standard PE
|
||||
pipeline it consumes tile tokens issued by the scheduler, acquires a
|
||||
read or write channel (modeled as a one-in-flight resource per
|
||||
direction), and runs the bytes to or from HBM through the mesh. In
|
||||
its collective mode it forwards send tokens for the cube's IPCQ into
|
||||
the fabric, snapshotting the source data at send time so later
|
||||
mutations cannot race the receiver's read. Both modes share the same
|
||||
channel resources but differ in their downstream handling — one
|
||||
returns when the round-trip completes, the other dispatches
|
||||
fire-and-forget.
|
||||
|
||||
### pe_fetch_store
|
||||
|
||||
<!-- src: ADR-0014 Decision -->
|
||||
The fetch-store engine is the bridge between the on-PE scratchpad
|
||||
(TCM) and the register file. It does not run DMA; it only moves bytes
|
||||
internally. On receiving a tile-stage token it sends a short request
|
||||
to the TCM, waits for the bandwidth-serialized delay, and continues
|
||||
the pipeline. The split between this engine and the TCM lets the
|
||||
scratchpad model its own read/write bandwidth independently.
|
||||
|
||||
### pe_gemm
|
||||
|
||||
<!-- src: ADR-0014 Decision -->
|
||||
The GEMM engine is the matrix-multiply compute unit. Tile tokens
|
||||
arriving at this stage carry the per-tile dimensions, and the engine
|
||||
contributes a service time accounting for one fused multiply-add over
|
||||
the tile's macs. Composite operations (where the same tensor pair is
|
||||
streamed across many tiles) reuse the engine through the scheduler;
|
||||
the engine itself is stateless between tiles.
|
||||
|
||||
### pe_ipcq
|
||||
|
||||
<!-- src: ADR-0023 Context, Decision -->
|
||||
The IPCQ — inter-process communication queue — is each PE's
|
||||
collective-communication endpoint. It owns ring buffers that hold
|
||||
inbound messages from neighbor PEs and bookkeeping for send credits.
|
||||
Direction names ("N", "S", "E", "W" for cube-internal neighbors and
|
||||
"global_*" for cross-SIP neighbors) are resolved to physical peer
|
||||
endpoints by a neighbor table installed at process-group creation
|
||||
time. The component itself does not move bytes — it issues DMA tokens
|
||||
through the local PE_DMA, which performs the actual cross-PE
|
||||
transfer.
|
||||
|
||||
<!-- src: ADR-0023 Decision, Consequences -->
|
||||
A key invariant is that the inbound terminal — where data lands at
|
||||
the receiver — pays the link bandwidth drain plus any cube-internal
|
||||
mesh hop to the slot's backing memory. This prevents IPCQ from
|
||||
silently outpacing raw DMA at large transfer sizes. Outbound sends
|
||||
are fire-and-forget; credit return is the only backpressure signal.
|
||||
|
||||
### pe_math
|
||||
|
||||
<!-- src: ADR-0014 Decision -->
|
||||
The math engine handles element-wise and reduction operations. It
|
||||
consumes tile tokens carrying an operation kind (`exp`, `sum`, `max`,
|
||||
`where`, etc.) and contributes a service time proportional to the
|
||||
number of elements processed. Like the GEMM engine it is stateless;
|
||||
chained epilogues (a sequence of math operations after a GEMM tile)
|
||||
are scheduled as separate stages.
|
||||
|
||||
### pe_mmu
|
||||
|
||||
<!-- src: ADR-0039 Context, Decision -->
|
||||
The MMU has two roles, exposed through one component. As a node on
|
||||
the cube NOC it receives MMU-map and MMU-unmap messages and updates
|
||||
its internal page table, so that the runtime API can install
|
||||
virtual-to-physical mappings with measured fabric latency. As a
|
||||
utility object held inside the PE it offers synchronous translate
|
||||
calls to the PE's DMA and GEMM engines without taking simulator time
|
||||
itself; the calling engine pays any configured TLB overhead in its
|
||||
own process.
|
||||
|
||||
<!-- src: ADR-0039 Decision, Alternatives Considered -->
|
||||
The page table supports multiple disjoint regions inside a single
|
||||
page, with later-write-wins semantics on overlap. This is a deliberate
|
||||
simulator stopgap to support parallelization policies that shard data
|
||||
at sub-page granularity without silent mis-routing through a real
|
||||
hardware MMU's one-PA-per-entry assumption. A real MMU does not work
|
||||
this way; the model documents this as a simplification.
|
||||
|
||||
### pe_scheduler
|
||||
|
||||
<!-- src: ADR-0014 Decision -->
|
||||
The scheduler is the sole dispatcher inside a PE. Simple commands are
|
||||
routed directly to the right engine. Composite commands generate a
|
||||
tile plan, and the resulting tile tokens are fed into the pipeline.
|
||||
Self-routing keeps the scheduler off the per-stage hot path: each
|
||||
engine, on finishing a stage, advances the token to the next stage's
|
||||
component itself, so the scheduler only does initial dispatch and
|
||||
completion tracking.
|
||||
|
||||
### pe_tcm
|
||||
|
||||
<!-- src: ADR-0040 Context, Decision -->
|
||||
The TCM is the per-PE tightly-coupled scratchpad memory. It models
|
||||
time only, not data — the actual payload lives in the simulator's
|
||||
memory store. Read and write are independent channels: each is
|
||||
modeled as a one-in-flight resource, so same-direction requests
|
||||
serialize but a read and a write can overlap. The bandwidth of each
|
||||
direction is configured separately and applied as bytes-over-bandwidth
|
||||
on each request.
|
||||
|
||||
<!-- src: ADR-0040 Decision, Alternatives Considered -->
|
||||
The decision to keep read and write on separate channels was made
|
||||
because the PE pipeline's normal case overlaps fetch (read) and store
|
||||
(write). Collapsing them into a single shared channel would have
|
||||
artificially serialized that overlap and produced an incorrect
|
||||
bandwidth ceiling.
|
||||
|
||||
### sram
|
||||
|
||||
<!-- src: ADR-0041 Context, Decision -->
|
||||
The cube SRAM is a per-cube scratchpad attached to one of the cube's
|
||||
routers. As a node it applies a configured access overhead, pays the
|
||||
link-bandwidth drain stamped on the incoming request, and sends a
|
||||
response on the reverse path. It is a terminal — it does not forward.
|
||||
|
||||
<!-- src: ADR-0041 Decision, Consequences -->
|
||||
A second role is as one of three backing-memory tiers (TCM, SRAM, HBM)
|
||||
that an inter-PE collective slot can live in. When the slot lives in
|
||||
SRAM, the PE_DMA pays the slot read or write latency directly using
|
||||
the configured SRAM bandwidth and overhead; the SRAM component does
|
||||
not need to know about collective semantics. This separation keeps
|
||||
the SRAM component agnostic to the collective subsystem.
|
||||
|
||||
### tiling
|
||||
|
||||
<!-- src: ADR-0042 Context, Decision -->
|
||||
The tile-plan generator is not a runtime component — it is a pure
|
||||
module of functions that take a problem shape (matrix dimensions, tile
|
||||
sizes) and produce an ordered list of tile-stage sequences. The
|
||||
scheduler consumes this list. Each tile's stage sequence depends on
|
||||
how its operands are staged: operands streamed from HBM produce
|
||||
DMA_READ stages, operands already resident in TCM (because they were
|
||||
loaded eagerly upfront) skip them.
|
||||
|
||||
<!-- src: ADR-0042 Decision, Consequences -->
|
||||
The plan generator is intentionally pure — given the same input it
|
||||
returns the same plan, with no simulator events created. This lets
|
||||
the rest of the system reason about tile sequences as data, and it
|
||||
makes the plan testable in isolation without simulator state. New
|
||||
plan variants (for example, K-major or DTensor-aware plans) can be
|
||||
added as new functions following the same shape.
|
||||
|
||||
---
|
||||
|
||||
## Implementation Decisions
|
||||
|
||||
This section collects cross-cutting decisions — algorithms, policies,
|
||||
schemes, and contracts — that span multiple components rather than
|
||||
living inside one.
|
||||
|
||||
### Address Scheme
|
||||
|
||||
<!-- src: ADR-0001 Context, Decision -->
|
||||
Every physical address in the simulator decodes into a structured
|
||||
location. A fixed-width physical address carries the SIP id, the
|
||||
cube id within the SIP, a type discriminator (HBM vs PE-resource vs
|
||||
others), and a type-specific offset. HBM addresses additionally encode
|
||||
the per-PE slice offset so the controller can determine which PE
|
||||
owns the target slice without external lookup. The layout is
|
||||
deliberately reserved rather than packed-to-fit, so new sub-units can
|
||||
be added at the type-discriminator level without rewriting existing
|
||||
addresses.
|
||||
|
||||
<!-- src: ADR-0011 Context, Decision -->
|
||||
On top of physical addressing, the simulator supports three address
|
||||
models that the runtime API selects between. Direct physical
|
||||
addressing is retained as a fallback. Virtual addressing — the
|
||||
current default — gives each tensor a contiguous virtual range at
|
||||
deployment, with the per-PE MMU translating per access; an
|
||||
alternative logical-address scheme remains a future option. The
|
||||
virtual-address path is what every modern test path takes; the PA
|
||||
fallback is used by the MMU itself when no mapping exists for an
|
||||
address (a deliberate signal, not an error).
|
||||
|
||||
<!-- src: ADR-0011 Decision, Consequences -->
|
||||
Tensor placement is represented as a list of physical-address shards,
|
||||
each tagged with target SIP, cube, and PE, plus a single tensor-wide
|
||||
virtual base. This means a kernel sees one virtual base for the whole
|
||||
tensor while the host driver and the engine still know exactly where
|
||||
each shard lives. Replicated tensors get per-cube local PA mappings;
|
||||
sharded tensors broadcast their mapping across cubes within a SIP.
|
||||
|
||||
### Routing, Distance & Helper API
|
||||
|
||||
<!-- src: ADR-0002 Context, Decision -->
|
||||
Routing is policy-driven, deterministic, and topology-aware. Given a
|
||||
source, a destination, and an intent — for example, PE-initiated
|
||||
DMA versus host-initiated memory write versus a generic
|
||||
component-to-component query — the routing layer picks the right
|
||||
path. The intent matters because different traffic types must avoid
|
||||
different categories of edges: PE-initiated DMA should not traverse
|
||||
command-only links; M_CPU DMA should not pass through PE-internal
|
||||
pipeline edges; cube-local transfers should not use the
|
||||
zero-distance UCIe bus that would otherwise look attractive to a
|
||||
shortest-path search.
|
||||
|
||||
<!-- src: ADR-0051 Decision -->
|
||||
The routing layer therefore maintains four separate adjacency graphs
|
||||
at construction, each excluding a different category of edges, and
|
||||
picks the appropriate one per intent. On top of the graphs sits a
|
||||
helper API that hides the topology's naming convention: callers ask
|
||||
for the PCIe endpoint of a given SIP, the M_CPU of a given cube, or
|
||||
the HBM destination for a given physical address, and receive the
|
||||
corresponding node id. No component constructs node-id strings
|
||||
directly; if the naming convention ever changes, the change is local
|
||||
to the helper layer.
|
||||
|
||||
<!-- src: ADR-0051 Decision, Consequences -->
|
||||
Path-finding itself uses Dijkstra with explicit per-edge weights
|
||||
(routing weight is allowed to differ from physical distance — for
|
||||
example, UCIe is configured to be routing-preferable). Tie-breaks
|
||||
follow insertion order, which keeps results deterministic. Paths
|
||||
between unreachable nodes raise rather than returning empty, surfacing
|
||||
topology errors immediately.
|
||||
|
||||
### Memory Semantics and Local-HBM Bandwidth
|
||||
|
||||
<!-- src: ADR-0004 Context, Decision -->
|
||||
A PE accessing its own HBM slice through its own cube's NOC must see
|
||||
the full local HBM bandwidth — that is the model's intent. Memory
|
||||
traffic accumulates latency from per-component overhead and
|
||||
bytes-over-link-bandwidth serialization along the path, but the
|
||||
controller does not throttle below the slice's allotted bandwidth.
|
||||
Cross-PE-slice accesses inside the same cube, cross-cube accesses
|
||||
through UCIe, and cross-SIP accesses through PCIe each pay
|
||||
progressively more overhead as the path grows.
|
||||
|
||||
### Topology Compilation, Diagrams & Builder Algorithms
|
||||
|
||||
<!-- src: ADR-0006 Context, Decision -->
|
||||
Topology is configurable, not hardcoded. The simulator reads a YAML
|
||||
spec, compiles it into a flat graph of nodes and edges plus four
|
||||
view projections at different abstraction levels — system, SIP, cube,
|
||||
PE — and uses the compiled graph as the single source for both
|
||||
execution and visualization. Distance metadata used by routing is
|
||||
extracted at compile time so that diagrams and routing decisions
|
||||
agree by construction.
|
||||
|
||||
<!-- src: ADR-0005 Context, Decision -->
|
||||
Diagrams are derived artifacts of the compiled topology. The visualizer
|
||||
produces one SVG per view at the appropriate abstraction level; nothing
|
||||
in the diagrams is hand-drawn or hand-positioned. Distance-aware
|
||||
layout rules place nodes in the diagrams using the same coordinates
|
||||
that routing uses to compute distance, so a diagram that "looks
|
||||
wrong" is a signal that the topology itself has a problem, not the
|
||||
visualizer.
|
||||
|
||||
<!-- src: ADR-0053 Decision -->
|
||||
Inside a cube the router mesh is generated automatically. PE corner
|
||||
positions are fixed by convention; the relay-column algorithm
|
||||
inserts additional grid columns whenever the gap between adjacent PE
|
||||
columns would exceed a tunable maximum. HBM occupies a central
|
||||
exclusion zone — router slots inside the zone are deliberately empty,
|
||||
since HBM controllers attach as separate named nodes. M_CPU and SRAM
|
||||
attach to the nearest router by Euclidean distance from their
|
||||
configured placement coordinates, and UCIe physical lanes distribute
|
||||
along the boundary rows and columns. The whole mesh is cached
|
||||
beside the topology spec and invalidated only when one of a small set
|
||||
of layout-relevant fields changes.
|
||||
|
||||
<!-- DIAGRAM: One cube's router mesh — rows × cols of routers with HBM exclusion zone in the middle, PEs/M_CPU/SRAM attaching to nearest routers, UCIe PHYs along the perimeter. -->
|
||||
|
||||
### Tensor Deployment and Allocation
|
||||
|
||||
<!-- src: ADR-0008 Context, Decision -->
|
||||
Tensor deployment in the runtime API produces a list of physical-address
|
||||
shards plus a single tensor-wide virtual base. The host allocator
|
||||
walks the data-parallelism policy, computes per-shard placement, and
|
||||
emits the per-shard physical addresses through the per-PE allocators.
|
||||
No separate "allocate then later attach to a device" RPC exists —
|
||||
allocation and deployment are a single operation that produces a
|
||||
deployed tensor handle.
|
||||
|
||||
### Memory Allocator Algorithms
|
||||
|
||||
<!-- src: ADR-0048 Context, Decision -->
|
||||
Each per-PE allocator owns two channels — HBM slice and TCM — each
|
||||
backed by an offset-keyed free-list. Allocation is first-fit; freeing
|
||||
coalesces with adjacent free blocks. A device-wide virtual allocator
|
||||
sits above the per-PE allocators, aligns requests up to the configured
|
||||
page size, and coalesces on free in the same way. The trade-off is
|
||||
explicit: first-fit is simpler and cheaper than best-fit or buddy
|
||||
allocation, and the simulator's workload is stack-like enough
|
||||
(deploy / kernel / free in matched order) that fragmentation is not
|
||||
a practical concern.
|
||||
|
||||
<!-- src: ADR-0048 Decision, Consequences -->
|
||||
Allocation failure raises rather than silently returning a partial
|
||||
result. A partial tensor reaching the engine would route over wrong
|
||||
PAs and silently corrupt simulator output, so an out-of-memory signal
|
||||
is preferred. The free path trusts its caller to pass back exactly
|
||||
what was allocated; the small risk of caller error in exchange for
|
||||
fast common-case freeing is documented as a deliberate trade.
|
||||
|
||||
### Kernel Execution and Host-Device Messaging
|
||||
|
||||
<!-- src: ADR-0009 Context, Decision -->
|
||||
Kernel execution decomposes into a small set of messages that travel
|
||||
the device graph. The host issues a single kernel-launch message; the
|
||||
IO_CPU fans it out per-cube; the cube M_CPU fans it out per-PE; the
|
||||
PE CPU resolves the kernel and runs it through the scheduler.
|
||||
Completion flows back the same way, gated by per-shard completion
|
||||
tracking. Memory operations follow the same pattern: a memory write
|
||||
or read travels as one message that the engine routes to the right
|
||||
HBM controller, with a response taking the reverse path.
|
||||
|
||||
<!-- src: ADR-0012 Context, Decision -->
|
||||
The schema between the host and the device-side IO CPU is PA-first
|
||||
and shard-tagged. Every byte of host-issued payload arrives with an
|
||||
explicit target SIP, cube, PE, and physical address. The IO_CPU does
|
||||
not decode addresses to derive placement — placement is named
|
||||
explicitly by the shard list. This makes the host-device interface
|
||||
deterministic and keeps the routing helper free of host-derived
|
||||
intent.
|
||||
|
||||
### CLI Surface and Semantics
|
||||
|
||||
<!-- src: ADR-0010 Context, Decision -->
|
||||
The command-line interface exposes four subcommands. A bench runner
|
||||
loads a topology, resolves a registered benchmark by name or index,
|
||||
and runs it on a selected device. A bench-listing command enumerates
|
||||
the registered benchmarks. A probe utility runs a fixed catalog of
|
||||
traffic patterns through the engine for latency and bandwidth
|
||||
verification. A web viewer renders the topology in a browser. A
|
||||
benchmark instance is always single-device by convention; multi-SIP
|
||||
collective work happens inside the benchmark through the launcher
|
||||
abstraction, not by multiplexing the CLI.
|
||||
|
||||
### Component Port and Wire Fabric Model
|
||||
|
||||
<!-- src: ADR-0015 Context, Decision -->
|
||||
Every modeled component exposes input and output ports, and every
|
||||
edge in the topology connects an output port on one component to an
|
||||
input port on another. Bandwidth and propagation delay are properties
|
||||
of the wire between ports, not of the component endpoints. A
|
||||
component's responsibility is to apply its configured per-node
|
||||
overhead and either forward to the next hop or terminate; the wire
|
||||
charges the byte-over-bandwidth serialization separately.
|
||||
|
||||
<!-- src: ADR-0015 Decision, Consequences -->
|
||||
This separation lets components be swapped behind their port
|
||||
interface without changing the rest of the model, and it keeps
|
||||
bandwidth contention at the wire level where multiple components may
|
||||
contend for the same edge. Future component models can refine
|
||||
internal behavior without disturbing the fabric.
|
||||
|
||||
### Two-Pass Data Execution
|
||||
|
||||
<!-- src: ADR-0020 Context, Decision -->
|
||||
The simulator runs in two passes. The first pass — fast and always
|
||||
on — runs the discrete-event engine and records every data operation
|
||||
in an operation log with timestamps, component identifiers, and per-
|
||||
operation parameters. The second pass — optional, opt-in — replays
|
||||
the log against an in-memory tensor store to produce actual numerical
|
||||
results. Tests that only need timing skip the second pass; tests that
|
||||
need to verify correctness opt in.
|
||||
|
||||
<!-- src: ADR-0020 Decision, Consequences -->
|
||||
The split lets the timing engine remain unconcerned with data
|
||||
semantics: kernels move handles around, not bytes. The replay phase
|
||||
recovers data semantics from the recorded operations, in their
|
||||
original time order with a small set of secondary-sort rules. The
|
||||
op-log records carry enough metadata — input snapshots for compute
|
||||
operations, source snapshots for cross-component copies — that the
|
||||
replay phase cannot mis-order with respect to in-flight mutations.
|
||||
|
||||
### Sim-engine Op Log and Memory Store Schemas
|
||||
|
||||
<!-- src: ADR-0052 Context, Decision -->
|
||||
The operation log holds typed records with seven fields each: start
|
||||
and end timestamps, the component that issued the operation, an
|
||||
operation kind ("memory", "gemm", "math"), an operation name, a
|
||||
parameter dictionary, and a (currently unused) dependency list.
|
||||
Records are kept in stable timestamp order. The parameter dictionary
|
||||
varies by operation: a DMA read carries source address and byte count;
|
||||
a GEMM carries operand shapes, dtypes, and address spaces; a math
|
||||
operation carries input addresses and snapshots.
|
||||
|
||||
<!-- src: ADR-0052 Decision, Consequences -->
|
||||
The companion memory store is a two-level dictionary keyed by
|
||||
address space ("hbm", "tcm", "sram", others) and integer address.
|
||||
Reads and writes are reference-based — no copy by default — so
|
||||
callers wanting to detach a snapshot must copy explicitly. This is
|
||||
deliberate: the engine-internal snapshot paths copy at well-defined
|
||||
points (math input capture, HBM source capture for DMA writes,
|
||||
inbound collective copies) and downstream replay code therefore
|
||||
sees stable data even when slot or scratch addresses are reused by
|
||||
later operations.
|
||||
|
||||
### 2D Grid Program Identity
|
||||
|
||||
<!-- src: ADR-0022 Context, Decision -->
|
||||
Inside a kernel the program identity is two-dimensional. The
|
||||
first axis corresponds to the PE index within a cube; the second
|
||||
corresponds to the cube index within a SIP. Together they let a
|
||||
kernel address its position both within its cube and within the
|
||||
larger system without needing to know the full topology. Total
|
||||
program counts along each axis are exposed symmetrically.
|
||||
|
||||
### Parallelism — SIP Launcher, DPPolicy, Megatron-TP, AHBM Backend, and CCL Algorithm Module
|
||||
|
||||
<!-- src: ADR-0024 Context, Decision -->
|
||||
The launcher model treats each SIP as one rank. Inside a process the
|
||||
launcher spawns one greenlet per SIP rank; the rank is bound to its
|
||||
greenlet so that any code running in that worker sees the right
|
||||
distributed-style rank. This is a deliberately PyTorch-compatible
|
||||
shape: a benchmark looks like a small DDP training script — initialize
|
||||
a process group, spawn workers, each worker runs the same body.
|
||||
|
||||
<!-- src: ADR-0026 Context, Decision -->
|
||||
Data-parallelism policy lives in a single object that names the
|
||||
sharding strategy along the cube axis (replicate, row-wise,
|
||||
column-wise) and along the PE axis (same set of values), and optionally
|
||||
overrides the number of cubes or PEs participating. The policy is
|
||||
intra-device — it does not cross SIP boundaries. SIP-level parallelism
|
||||
is the launcher's responsibility, and the two axes compose
|
||||
orthogonally.
|
||||
|
||||
<!-- src: ADR-0027 Context, Decision -->
|
||||
A Megatron-style tensor-parallel API sits on top of the launcher and
|
||||
the DP policy. Layer-level building blocks — column-parallel linear,
|
||||
row-parallel linear, all-reduce — name their sharding intent in terms
|
||||
the launcher and the placement policy can compose. This is the layer
|
||||
that bench code typically writes against.
|
||||
|
||||
<!-- src: ADR-0047 Context, Decision -->
|
||||
For collective operations the runtime exposes a PyTorch-compatible
|
||||
distributed backend named "ahbm". On process-group initialization the
|
||||
backend loads the configured collective-algorithm module, resolves
|
||||
the world size (priority: explicit ccl.yaml override → defaults
|
||||
section → topology SIP count), imports the algorithm module
|
||||
dynamically, derives the SIP topology kind, and pushes the inter-PE
|
||||
neighbor table to every participating PE. From that point on, an
|
||||
all-reduce call dispatches the algorithm's kernel function across
|
||||
all ranks.
|
||||
|
||||
<!-- src: ADR-0050 Context, Decision -->
|
||||
A collective-algorithm module is a Python module with a small, fixed
|
||||
contract. It exposes topology-kind integer constants, a name-to-kind
|
||||
mapping for the YAML configuration, a kernel-arguments builder, and
|
||||
a kernel function — the kernel function being aliased to the name
|
||||
`kernel` so the backend can find it generically. The kernel itself
|
||||
takes the tensor pointer, the per-cube element count, cube mesh
|
||||
width and height, the world size, the current rank, and the SIP
|
||||
topology dimensions; the backend appends those last four arguments
|
||||
automatically. New collectives slot in by adding a new module that
|
||||
follows this shape.
|
||||
|
||||
<!-- src: ADR-0027 Decision, Consequences -->
|
||||
The combination is deliberate: bench authors get to write code that
|
||||
looks like a regular distributed training script, while the launcher,
|
||||
backend, and placement policies behind it remain free to redirect
|
||||
work to the right SIP, cube, and PE without exposing topology to the
|
||||
kernel.
|
||||
|
||||
### IPCQ Direction Addressing
|
||||
|
||||
<!-- src: ADR-0025 Context, Decision -->
|
||||
Inside a collective algorithm, peer PEs are named by direction —
|
||||
"N", "S", "E", "W" for cube-internal neighbors, and "global_*" for
|
||||
cross-SIP neighbors. Direction addressing is the addressing scheme:
|
||||
the algorithm names a direction, the IPCQ neighbor table installed
|
||||
at process-group time resolves the direction to the peer endpoint's
|
||||
physical-address coordinates, and the PE_DMA performs the actual
|
||||
transfer. The algorithm itself does not see PA arithmetic — direction
|
||||
is the user-facing handle.
|
||||
|
||||
### Intercube All-Reduce
|
||||
|
||||
<!-- src: ADR-0032 Context, Decision -->
|
||||
The default all-reduce algorithm uses a center-rooted bidirectional
|
||||
phase inside each SIP's cube mesh followed by an inter-SIP exchange
|
||||
on the mesh's root cube, and then a bidirectional broadcast back
|
||||
out. Center-rooting halves the in-cube hop count compared with a
|
||||
corner-rooted walk. The inter-SIP exchange itself follows the
|
||||
configured SIP topology — ring, torus, or non-wrapping mesh —
|
||||
selected at runtime through the SIP-topology kind integer the
|
||||
backend passes to the kernel.
|
||||
|
||||
### Evaluation Harnesses
|
||||
|
||||
<!-- src: ADR-0043 Context, Decision -->
|
||||
The all-reduce evaluation harness drives correctness and the
|
||||
latency/buffer-kind sweeps through the public distributed path —
|
||||
initialize process group, spawn workers, call all-reduce — rather
|
||||
than the lower-level engine interface. A shared helper module factors
|
||||
out the setup; sweep tests cover the buffer-kind tiers (TCM, SRAM,
|
||||
HBM) and the inter-SIP topology variants. The plots produced by the
|
||||
harness are part of its output contract; the harness regenerates them
|
||||
on demand.
|
||||
|
||||
<!-- src: ADR-0044 Context, Decision -->
|
||||
The GEMM evaluation harness is split into two layers. A heavy
|
||||
shape-and-variant sweep lives as a manual script — it runs the same
|
||||
composite-GEMM benchmark across many shapes and operand-staging
|
||||
variants, harvests the resulting op-log, and writes a JSON summary.
|
||||
A faster figure-generation layer lives in the test suite and consumes
|
||||
that JSON to render plots. The split keeps the heavy data
|
||||
generation explicit and out of the regular test path.
|
||||
|
||||
### Bench Module Contract
|
||||
|
||||
<!-- src: ADR-0045 Context, Decision -->
|
||||
Adding a new benchmark requires only dropping a file into the
|
||||
benchmarks directory. The file registers one or more benchmark
|
||||
functions through a small decorator that takes a kebab-case name and
|
||||
a human-readable description. The decorator is the registration
|
||||
mechanism — there is no separate manifest. Each benchmark function
|
||||
takes one argument, conventionally named `torch`, which is the
|
||||
runtime context exposing tensor allocation, kernel launch,
|
||||
distributed APIs, and process-spawning. The function name is `run` by
|
||||
convention.
|
||||
|
||||
<!-- src: ADR-0045 Decision, Consequences -->
|
||||
A benchmark must submit at least one operation, or the runner
|
||||
returns an error. A benchmark instance is single-device by default;
|
||||
when a benchmark is collective, it uses the distributed-process-spawn
|
||||
pattern internally — one worker greenlet per rank, with each worker
|
||||
binding to its rank. Multi-device benchmark patterns outside that
|
||||
shape are not supported.
|
||||
|
||||
### Kernel-side `tl.*` API
|
||||
|
||||
<!-- src: ADR-0046 Context, Decision -->
|
||||
Inside a kernel function, the `tl` argument exposes the kernel-side
|
||||
API in a shape that mirrors the conventions of established
|
||||
GPU-kernel languages. Categories: reference handles that name HBM
|
||||
data without issuing DMA; data movement (load, store) that does
|
||||
issue DMA; GEMM and math compute (dot, composite, the unary and
|
||||
binary math operations, reductions); index and scalar helpers
|
||||
(program identity, range-builders); metadata-only operations like
|
||||
transpose; and the collective primitives (send, receive,
|
||||
non-blocking receive). Tensor handles support arithmetic operators
|
||||
via a thread-local active context so kernel code reads naturally.
|
||||
|
||||
<!-- src: ADR-0046 Decision, Consequences -->
|
||||
The API supports two execution modes. A command-list mode records
|
||||
operations into a list without consuming simulator time — useful for
|
||||
inspection and lightweight tests. A greenlet-driven mode runs the
|
||||
kernel as a child greenlet that switches back to the simulator on
|
||||
each `tl.*` call; the simulator drives the event scheduler and hands
|
||||
real data back to the kernel as DMA reads complete. The two modes
|
||||
share the same surface; the kernel does not know which one it is
|
||||
running under.
|
||||
|
||||
### Probe Subcommand
|
||||
|
||||
<!-- src: ADR-0049 Context, Decision -->
|
||||
The probe utility runs three families of traffic patterns through
|
||||
the engine — host-to-device writes at increasing hop counts,
|
||||
device-to-host reads at increasing hop counts, and PE-initiated DMA
|
||||
across the cube mesh — and reports actual latency, the analytical
|
||||
formula breakdown, effective bandwidth, bottleneck bandwidth, and
|
||||
utilization. A fixed reference size is used for the summary table;
|
||||
a separate utilization-versus-size sweep covers a logarithmic range
|
||||
of transfer sizes. Each case runs in its own engine instance so
|
||||
cases do not perturb each other.
|
||||
|
||||
<!-- src: ADR-0049 Decision, Consequences -->
|
||||
The probe also checks a small set of invariants automatically:
|
||||
monotonic latency increase with hop count, device-to-host latency
|
||||
at least as large as host-to-device for the same hop count, and a
|
||||
faster best-case path than worst-case for cross-cube PE DMA. Failures
|
||||
print prominently. The output is meant for human reading; automated
|
||||
parsing should not depend on column widths or whitespace.
|
||||
|
||||
---
|
||||
|
||||
This document summarizes 46 architecture decisions captured during
|
||||
the first half of 2026. It is regenerated mechanically from the
|
||||
decision corpus; sources are recorded in HTML comments throughout.
|
||||
@@ -0,0 +1,333 @@
|
||||
"""Generate docs/adr/INDEX.md (and docs/adr-ko/INDEX.md) from the ADR corpus.
|
||||
|
||||
Auto-derives a section-based index following the same classification as
|
||||
the /report skill — Design Principles / High-level Architecture /
|
||||
Detailed Architecture (by component) / Implementation Decisions
|
||||
(by topic). Run before publishing to refresh INDEX.md.
|
||||
|
||||
The classification table below is the single source of truth. When a new
|
||||
ADR is added under docs/adr/, append an entry to ``CLASSIFICATION``. The
|
||||
script exits 1 if any ADR file is missing from the table or any title
|
||||
cannot be parsed, so omissions surface in CI.
|
||||
|
||||
Usage:
|
||||
python tools/generate_adr_index.py [--root <repo-root>] [--check]
|
||||
|
||||
--check : exit 1 if the generated INDEX differs from the on-disk file
|
||||
(used by CI to detect un-regenerated indexes).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import re
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
ADR_FILENAME_RE = re.compile(r"^ADR-(\d{4})-([a-z0-9_-]+)\.md$")
|
||||
# Title separator may be ":" (most ADRs) or "—" (em-dash; ADR-0033 uses
|
||||
# this). The verifier (tools/verify_adr_lang_pairs.py) only checks the
|
||||
# number, so both styles already coexist in the corpus.
|
||||
TITLE_RE = re.compile(r"^# ADR-(\d{4})\s*[:—]\s*(.+?)\s*$")
|
||||
|
||||
DESIGN_PRINCIPLES = "Design Principles"
|
||||
HIGH_LEVEL = "High-level Architecture"
|
||||
DETAILED = "Detailed Architecture"
|
||||
IMPL_DECISIONS = "Implementation Decisions"
|
||||
|
||||
|
||||
# (section, subgroup) per ADR. subgroup is used to sub-divide Detailed
|
||||
# (by component, see DETAILED_COMPONENTS) and Implementation (by topic).
|
||||
# Add a line here when introducing a new ADR.
|
||||
CLASSIFICATION: dict[int, tuple[str, str | None]] = {
|
||||
# Design Principles
|
||||
13: (DESIGN_PRINCIPLES, None),
|
||||
33: (DESIGN_PRINCIPLES, None),
|
||||
|
||||
# High-level Architecture
|
||||
3: (HIGH_LEVEL, "System hierarchy (Tray / SIP / CUBE / PE)"),
|
||||
7: (HIGH_LEVEL, "Runtime API ↔ sim_engine boundaries"),
|
||||
16: (HIGH_LEVEL, "IOChiplet NOC and memory data path"),
|
||||
17: (HIGH_LEVEL, "Cube NOC and HBM connectivity"),
|
||||
|
||||
# Detailed Architecture (subgroup matches DETAILED_COMPONENTS entries)
|
||||
14: (DETAILED, "pe_pipeline"), # covers pe_cpu/pe_dma/pe_fetch_store/pe_gemm/pe_math/pe_scheduler
|
||||
23: (DETAILED, "pe_ipcq"),
|
||||
34: (DETAILED, "hbm_ctrl"),
|
||||
35: (DETAILED, "m_cpu"),
|
||||
36: (DETAILED, "io_cpu"),
|
||||
37: (DETAILED, "forwarding"),
|
||||
38: (DETAILED, "pcie_ep"),
|
||||
39: (DETAILED, "pe_mmu"),
|
||||
40: (DETAILED, "pe_tcm"),
|
||||
41: (DETAILED, "sram"),
|
||||
42: (DETAILED, "tiling"),
|
||||
|
||||
# Implementation Decisions
|
||||
1: (IMPL_DECISIONS, "Address Scheme"),
|
||||
2: (IMPL_DECISIONS, "Routing & Helper API"),
|
||||
4: (IMPL_DECISIONS, "Memory Semantics & Local-HBM Bandwidth"),
|
||||
5: (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"),
|
||||
6: (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"),
|
||||
8: (IMPL_DECISIONS, "Tensor Deployment and Allocation"),
|
||||
9: (IMPL_DECISIONS, "Kernel Execution and Host-Device Messaging"),
|
||||
10: (IMPL_DECISIONS, "CLI Surface and Semantics"),
|
||||
11: (IMPL_DECISIONS, "Address Scheme"),
|
||||
12: (IMPL_DECISIONS, "Kernel Execution and Host-Device Messaging"),
|
||||
15: (IMPL_DECISIONS, "Component Port/Wire Fabric Model"),
|
||||
20: (IMPL_DECISIONS, "Two-Pass Data Execution"),
|
||||
22: (IMPL_DECISIONS, "2D Grid Program Identity"),
|
||||
24: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
|
||||
25: (IMPL_DECISIONS, "IPCQ Direction Addressing"),
|
||||
26: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
|
||||
27: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
|
||||
32: (IMPL_DECISIONS, "Intercube All-Reduce"),
|
||||
43: (IMPL_DECISIONS, "Evaluation Harnesses"),
|
||||
44: (IMPL_DECISIONS, "Evaluation Harnesses"),
|
||||
45: (IMPL_DECISIONS, "Bench Module Contract"),
|
||||
46: (IMPL_DECISIONS, "Kernel-side tl.* API (TLContext)"),
|
||||
47: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
|
||||
48: (IMPL_DECISIONS, "Memory Allocator Algorithms"),
|
||||
49: (IMPL_DECISIONS, "Probe Subcommand"),
|
||||
50: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
|
||||
51: (IMPL_DECISIONS, "Routing & Helper API"),
|
||||
52: (IMPL_DECISIONS, "Sim-engine Op Log and Memory Store Schemas"),
|
||||
53: (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"),
|
||||
}
|
||||
|
||||
# Canonical component order for the Detailed Architecture section.
|
||||
# Each entry: (component_name, list[ADR-numbers that cover it]).
|
||||
# Order matches src/kernbench/components/builtin/*.py alphabetical
|
||||
# (the same order /report uses).
|
||||
DETAILED_COMPONENTS: list[tuple[str, list[int]]] = [
|
||||
("forwarding", [37]),
|
||||
("hbm_ctrl", [34]),
|
||||
("io_cpu", [36]),
|
||||
("m_cpu", [35]),
|
||||
("pcie_ep", [38]),
|
||||
("pe_cpu", [14]),
|
||||
("pe_dma", [14, 23]),
|
||||
("pe_fetch_store", [14]),
|
||||
("pe_gemm", [14]),
|
||||
("pe_ipcq", [23]),
|
||||
("pe_math", [14]),
|
||||
("pe_mmu", [39]),
|
||||
("pe_scheduler", [14]),
|
||||
("pe_tcm", [40]),
|
||||
("sram", [41]),
|
||||
("tiling", [42]),
|
||||
]
|
||||
|
||||
|
||||
def _strip_bom(text: str) -> str:
|
||||
"""Strip leading UTF-8 BOM if present."""
|
||||
if text and ord(text[0]) == 0xFEFF:
|
||||
return text[1:]
|
||||
return text
|
||||
|
||||
|
||||
def _find_adrs(adr_dir: Path) -> list[tuple[int, str, Path]]:
|
||||
"""Return [(num, slug, path), ...] for ADR files in adr_dir, sorted by num."""
|
||||
out: list[tuple[int, str, Path]] = []
|
||||
for p in sorted(adr_dir.iterdir()):
|
||||
if not p.is_file():
|
||||
continue
|
||||
m = ADR_FILENAME_RE.match(p.name)
|
||||
if not m:
|
||||
continue
|
||||
out.append((int(m.group(1)), m.group(2), p))
|
||||
out.sort(key=lambda t: t[0])
|
||||
return out
|
||||
|
||||
|
||||
def _extract_title(path: Path) -> str:
|
||||
"""Parse the title from the first line `# ADR-NNNN: <title>`. Strips BOM."""
|
||||
text = _strip_bom(path.read_text(encoding="utf-8"))
|
||||
first_line = text.split("\n", 1)[0] if text else ""
|
||||
m = TITLE_RE.match(first_line)
|
||||
if not m:
|
||||
raise ValueError(
|
||||
f"{path.name}: cannot parse title from first line: {first_line!r}"
|
||||
)
|
||||
return m.group(2)
|
||||
|
||||
|
||||
def _build_index(adr_dir: Path, link_prefix: str) -> str:
|
||||
"""Build the INDEX.md text for adr_dir.
|
||||
|
||||
link_prefix is the relative href used for ADR links (e.g., ``./``
|
||||
so links resolve relative to the INDEX file location).
|
||||
"""
|
||||
adrs = _find_adrs(adr_dir)
|
||||
if not adrs:
|
||||
raise RuntimeError(f"No ADR files found under {adr_dir}")
|
||||
|
||||
# Validate every ADR is classified.
|
||||
missing = sorted(num for num, _slug, _ in adrs if num not in CLASSIFICATION)
|
||||
if missing:
|
||||
raise RuntimeError(
|
||||
"ADR(s) missing from CLASSIFICATION table in "
|
||||
"tools/generate_adr_index.py: "
|
||||
+ ", ".join(f"ADR-{n:04d}" for n in missing)
|
||||
+ ". Add an entry for each."
|
||||
)
|
||||
|
||||
# Map: num → (filename, title)
|
||||
num_to_meta: dict[int, tuple[str, str]] = {}
|
||||
for num, _slug, path in adrs:
|
||||
num_to_meta[num] = (path.name, _extract_title(path))
|
||||
|
||||
# ── Section assembly ────────────────────────────────────────────
|
||||
lines: list[str] = []
|
||||
lines.append("# ADR Index")
|
||||
lines.append("")
|
||||
lines.append(
|
||||
f"Auto-generated by `tools/generate_adr_index.py`. "
|
||||
f"Total ADRs: **{len(adrs)}**."
|
||||
)
|
||||
lines.append("")
|
||||
lines.append(
|
||||
"Classification mirrors the `/report` skill's section assignment. "
|
||||
"When adding a new ADR, also add an entry to the "
|
||||
"`CLASSIFICATION` table in `tools/generate_adr_index.py`."
|
||||
)
|
||||
lines.append("")
|
||||
|
||||
def fmt_entry(num: int) -> str:
|
||||
fname, title = num_to_meta[num]
|
||||
return f"- [ADR-{num:04d}]({link_prefix}{fname}) — {title}"
|
||||
|
||||
# Design Principles
|
||||
lines.append("## Design Principles")
|
||||
lines.append("")
|
||||
nums = sorted(n for n, (sec, _) in CLASSIFICATION.items()
|
||||
if sec == DESIGN_PRINCIPLES and n in num_to_meta)
|
||||
for n in nums:
|
||||
lines.append(fmt_entry(n))
|
||||
lines.append("")
|
||||
|
||||
# High-level Architecture (preserve declaration order via CLASSIFICATION dict's insertion order)
|
||||
lines.append("## High-level Architecture")
|
||||
lines.append("")
|
||||
nums = sorted(n for n, (sec, _) in CLASSIFICATION.items()
|
||||
if sec == HIGH_LEVEL and n in num_to_meta)
|
||||
for n in nums:
|
||||
sub = CLASSIFICATION[n][1] or ""
|
||||
fname, title = num_to_meta[n]
|
||||
if sub:
|
||||
lines.append(
|
||||
f"- [ADR-{n:04d}]({link_prefix}{fname}) — {title}"
|
||||
f" _({sub})_"
|
||||
)
|
||||
else:
|
||||
lines.append(fmt_entry(n))
|
||||
lines.append("")
|
||||
|
||||
# Detailed Architecture (canonical component order)
|
||||
lines.append("## Detailed Architecture")
|
||||
lines.append("")
|
||||
lines.append("One subsection per component file under `src/kernbench/components/builtin/`.")
|
||||
lines.append("")
|
||||
for comp, adr_nums in DETAILED_COMPONENTS:
|
||||
lines.append(f"### {comp}")
|
||||
lines.append("")
|
||||
if adr_nums:
|
||||
for n in adr_nums:
|
||||
if n not in num_to_meta:
|
||||
raise RuntimeError(
|
||||
f"DETAILED_COMPONENTS references ADR-{n:04d} for "
|
||||
f"'{comp}' but no such ADR file exists."
|
||||
)
|
||||
lines.append(fmt_entry(n))
|
||||
else:
|
||||
lines.append("_(no ADR coverage)_")
|
||||
lines.append("")
|
||||
|
||||
# Implementation Decisions — group by subgroup, preserving first-appearance order.
|
||||
lines.append("## Implementation Decisions")
|
||||
lines.append("")
|
||||
topic_order: list[str] = []
|
||||
topic_to_nums: dict[str, list[int]] = {}
|
||||
for n, (sec, sub) in CLASSIFICATION.items():
|
||||
if sec != IMPL_DECISIONS or n not in num_to_meta:
|
||||
continue
|
||||
topic = sub or "Uncategorized"
|
||||
if topic not in topic_to_nums:
|
||||
topic_order.append(topic)
|
||||
topic_to_nums[topic] = []
|
||||
topic_to_nums[topic].append(n)
|
||||
# Stable order: by smallest ADR-number in topic, so older infra appears first.
|
||||
topic_order.sort(key=lambda t: min(topic_to_nums[t]))
|
||||
for topic in topic_order:
|
||||
lines.append(f"### {topic}")
|
||||
lines.append("")
|
||||
for n in sorted(topic_to_nums[topic]):
|
||||
lines.append(fmt_entry(n))
|
||||
lines.append("")
|
||||
|
||||
return "\n".join(lines).rstrip() + "\n"
|
||||
|
||||
|
||||
def _check_or_write(path: Path, content: str, check: bool) -> bool:
|
||||
"""Write content to path, or compare in --check mode. Returns True on diff."""
|
||||
existing = path.read_text(encoding="utf-8") if path.exists() else ""
|
||||
if check:
|
||||
if existing != content:
|
||||
print(f"[diff] {path} would change.")
|
||||
return True
|
||||
return False
|
||||
path.write_text(content, encoding="utf-8")
|
||||
if existing != content:
|
||||
print(f"[wrote] {path}")
|
||||
else:
|
||||
print(f"[unchanged] {path}")
|
||||
return False
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
p = argparse.ArgumentParser(description=__doc__)
|
||||
p.add_argument(
|
||||
"--root", type=Path, default=Path.cwd(),
|
||||
help="Repository root (default: cwd)",
|
||||
)
|
||||
p.add_argument(
|
||||
"--check", action="store_true",
|
||||
help="Exit 1 if generated INDEX would differ from disk",
|
||||
)
|
||||
args = p.parse_args(argv)
|
||||
|
||||
en_dir = args.root / "docs" / "adr"
|
||||
ko_dir = args.root / "docs" / "adr-ko"
|
||||
|
||||
if not en_dir.is_dir():
|
||||
print(f"error: {en_dir} does not exist", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
any_diff = False
|
||||
try:
|
||||
en_index = _build_index(en_dir, link_prefix="./")
|
||||
except (RuntimeError, ValueError) as e:
|
||||
print(f"error (EN): {e}", file=sys.stderr)
|
||||
return 1
|
||||
any_diff |= _check_or_write(en_dir / "INDEX.md", en_index, args.check)
|
||||
|
||||
if ko_dir.is_dir():
|
||||
try:
|
||||
ko_index = _build_index(ko_dir, link_prefix="./")
|
||||
except (RuntimeError, ValueError) as e:
|
||||
print(f"error (KO): {e}", file=sys.stderr)
|
||||
return 1
|
||||
any_diff |= _check_or_write(ko_dir / "INDEX.md", ko_index, args.check)
|
||||
|
||||
if args.check and any_diff:
|
||||
print(
|
||||
"INDEX.md is out of date. "
|
||||
"Run `python tools/generate_adr_index.py` to refresh.",
|
||||
file=sys.stderr,
|
||||
)
|
||||
return 1
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
Reference in New Issue
Block a user