adr: add INDEX.md (auto-generated by tools/generate_adr_index.py)

Adds a section-based table of contents for the 46-ADR corpus, mirroring the /report skill's classification (Design Principles / High-level Architecture / Detailed Architecture by component / Implementation Decisions by topic). Generated for both docs/adr/ (EN titles) and docs/adr-ko/ (KO titles) from one tool. tools/generate_adr_index.py: - Single CLASSIFICATION dict per ADR — add an entry when introducing a new ADR; the script fails loud if any file is missing from the table. - DETAILED_COMPONENTS lists each builtin component and the ADR(s) that cover it (ADR-0014 appears under six PE engines; ADR-0023 under pe_dma + pe_ipcq). - Accepts both ":" and "—" title separators (matching ADR-0033's existing format). - --check mode for CI: exits 1 if INDEX.md is stale. Also includes the docs/report/architecture-2026-1H.md generated by the prior /report write (the public-facing architecture document; 836 lines, 76 source-attribution comments). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 11:15:37 -07:00
parent bd49c93703
commit e33e76f2d1
4 changed files with 1517 additions and 0 deletions
@@ -0,0 +1,174 @@
+# ADR Index
+
+Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **46**.
+
+Classification mirrors the `/report` skill's section assignment. When adding a new ADR, also add an entry to the `CLASSIFICATION` table in `tools/generate_adr_index.py`.
+
+## Design Principles
+
+- [ADR-0013](./ADR-0013-ver-verification-strategy.md) — 검증 전략 및 Phase 1 테스트 계획
+- [ADR-0033](./ADR-0033-lat-latency-model-assumptions.md) — 레이턴시 모델: 가정 및 알려진 단순화
+
+## High-level Architecture
+
+- [ADR-0003](./ADR-0003-dev-target-system-hierarchy.md) — 타겟 시스템 계층 및 모델링 범위  _(System hierarchy (Tray / SIP / CUBE / PE))_
+- [ADR-0007](./ADR-0007-api-runtime-api-boundaries.md) — 런타임 API 및 시뮬레이션 엔진 경계  _(Runtime API ↔ sim_engine boundaries)_
+- [ADR-0016](./ADR-0016-dev-iochiplet-noc-and-memory-path.md) — IOChiplet NoC와 메모리 데이터 경로  _(IOChiplet NOC and memory data path)_
+- [ADR-0017](./ADR-0017-dev-cube-noc-and-hbm-connectivity.md) — 큐브 NoC와 HBM 연결성  _(Cube NOC and HBM connectivity)_
+
+## Detailed Architecture
+
+One subsection per component file under `src/kernbench/components/builtin/`.
+
+### forwarding
+
+- [ADR-0037](./ADR-0037-dev-forwarding-component.md) — Forwarding 컴포넌트 (forwarding_v1)
+
+### hbm_ctrl
+
+- [ADR-0034](./ADR-0034-dev-hbm-controller-internal-design.md) — HBM 컨트롤러 내부 설계
+
+### io_cpu
+
+- [ADR-0036](./ADR-0036-dev-io-cpu-component-model.md) — IO_CPU 컴포넌트 모델
+
+### m_cpu
+
+- [ADR-0035](./ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md) — M_CPU 및 M_CPU.DMA 컴포넌트 모델
+
+### pcie_ep
+
+- [ADR-0038](./ADR-0038-dev-pcie-ep-component-model.md) — PCIE_EP Component Model
+
+### pe_cpu
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
+
+### pe_dma
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
+- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
+
+### pe_fetch_store
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
+
+### pe_gemm
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
+
+### pe_ipcq
+
+- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
+
+### pe_math
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
+
+### pe_mmu
+
+- [ADR-0039](./ADR-0039-dev-pe-mmu-component-model.md) — PE_MMU Component Model — 컴포넌트 + 유틸리티 이중 역할
+
+### pe_scheduler
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
+
+### pe_tcm
+
+- [ADR-0040](./ADR-0040-dev-pe-tcm-component-model.md) — PE_TCM Component Model — 듀얼 채널 BW 직렬화
+
+### sram
+
+- [ADR-0041](./ADR-0041-dev-cube-sram-component-model.md) — Cube SRAM Component Model — terminal scratchpad on cube NoC
+
+### tiling
+
+- [ADR-0042](./ADR-0042-prog-tile-plan-generators.md) — Tile Plan Generators — GEMM/Math 파이프라인 plan 빌더
+
+## Implementation Decisions
+
+### Address Scheme
+
+- [ADR-0001](./ADR-0001-mem-physaddr-layout.md) — 51비트 물리 주소 레이아웃 및 디코딩 계약
+- [ADR-0011](./ADR-0011-mem-memory-addressing-simplification.md) — 메모리 주소 지정 — PA / VA / LA 주소 모델
+
+### Routing & Helper API
+
+- [ADR-0002](./ADR-0002-lat-routing-distance.md) — 라우팅 거리, 순서 및 우회 규칙
+- [ADR-0051](./ADR-0051-lat-routing-helper-api.md) — Routing Helper API — `AddressResolver` + `PathRouter`
+
+### Memory Semantics & Local-HBM Bandwidth
+
+- [ADR-0004](./ADR-0004-mem-memory-semantics-local-hbm.md) — 메모리 시맨틱 및 로컬 HBM 대역폭 보장
+
+### Topology Compilation, Diagrams & Builder Algorithms
+
+- [ADR-0005](./ADR-0005-dev-diagram-views-distance-layout.md) — 다이어그램 뷰 및 거리 기반 레이아웃 규칙
+- [ADR-0006](./ADR-0006-dev-topology-compilation-distance-diagram.md) — 토폴로지 컴파일, 거리 추출, 그리고 자동 다이어그램 생성
+- [ADR-0053](./ADR-0053-dev-topology-builder-algorithms.md) — Topology Builder + Visualizer Algorithms
+
+### Tensor Deployment and Allocation
+
+- [ADR-0008](./ADR-0008-api-tensor-deploy-and-allocation.md) — 텐서 배포 및 할당 (호스트 할당기, PA 우선)
+
+### Kernel Execution and Host-Device Messaging
+
+- [ADR-0009](./ADR-0009-api-kernel-execution-messaging.md) — 커널 실행 메시징 및 완료 시맨틱
+- [ADR-0012](./ADR-0012-api-host-io-message-schema.md) — Host ↔ IO_CPU 메시지 스키마 (PA-우선, PE-태깅)
+
+### CLI Surface and Semantics
+
+- [ADR-0010](./ADR-0010-api-cli-surface-and-semantics.md) — 명령줄 인터페이스 및 실행 시맨틱
+
+### Component Port/Wire Fabric Model
+
+- [ADR-0015](./ADR-0015-dev-component-port-wire-model.md) — 컴포넌트 포트/와이어 모델과 패브릭 라우팅
+
+### Two-Pass Data Execution
+
+- [ADR-0020](./ADR-0020-prog-data-execution-two-pass.md) — 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)
+
+### 2D Grid Program Identity
+
+- [ADR-0022](./ADR-0022-prog-program-id-2d-grid.md) — 2D 그리드 program_id 시맨틱
+
+### Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)
+
+- [ADR-0024](./ADR-0024-par-sip-tp-launcher.md) — SIP-level Launcher — rank = SIP
+- [ADR-0026](./ADR-0026-par-dppolicy-intra-device.md) — DPPolicy = Intra-Device Only — sip/num_sips 필드 제거
+- [ADR-0027](./ADR-0027-par-megatron-tp.md) — Megatron-style Tensor Parallelism API
+- [ADR-0047](./ADR-0047-par-ahbm-ccl-backend.md) — AHBM CCL Backend — `torch.distributed`-compat shim
+- [ADR-0050](./ADR-0050-par-ccl-algorithm-module-contract.md) — CCL Algorithm Module Contract — `ccl/algorithms/*.py`
+
+### IPCQ Direction Addressing
+
+- [ADR-0025](./ADR-0025-algo-ipcq-direction-addressing.md) — IPCQ Direction Addressing — address-based matching
+
+### Intercube All-Reduce
+
+- [ADR-0032](./ADR-0032-algo-intercube-allreduce.md) — 큐브 간 All-Reduce — pe0 큐브-메시 리듀스 + 다중-SIP 교환
+
+### Evaluation Harnesses
+
+- [ADR-0043](./ADR-0043-eval-allreduce-harness.md) — Allreduce 평가 하니스 — `tests/sccl/`
+- [ADR-0044](./ADR-0044-eval-gemm-harness.md) — GEMM 평가 하니스 — `scripts/gemm_sweep.py` + `tests/gemm/`
+
+### Bench Module Contract
+
+- [ADR-0045](./ADR-0045-prog-bench-module-contract.md) — Bench Module Contract — registration, dispatch, and authoring
+
+### Kernel-side tl.* API (TLContext)
+
+- [ADR-0046](./ADR-0046-prog-tl-context-contract.md) — TLContext — Kernel-side `tl.*` API Contract
+
+### Memory Allocator Algorithms
+
+- [ADR-0048](./ADR-0048-mem-allocator-algorithms.md) — Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
+
+### Probe Subcommand
+
+- [ADR-0049](./ADR-0049-ver-probe-subcommand.md) — `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
+
+### Sim-engine Op Log and Memory Store Schemas
+
+- [ADR-0052](./ADR-0052-dev-oplog-memory-store-schemas.md) — OpLog + MemoryStore Schemas — sim_engine internals
@@ -0,0 +1,174 @@
+# ADR Index
+
+Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **46**.
+
+Classification mirrors the `/report` skill's section assignment. When adding a new ADR, also add an entry to the `CLASSIFICATION` table in `tools/generate_adr_index.py`.
+
+## Design Principles
+
+- [ADR-0013](./ADR-0013-ver-verification-strategy.md) — Verification Strategy and Phase 1 Test Plan
+- [ADR-0033](./ADR-0033-lat-latency-model-assumptions.md) — Latency Model: Assumptions and Known Simplifications
+
+## High-level Architecture
+
+- [ADR-0003](./ADR-0003-dev-target-system-hierarchy.md) — Target System Hierarchy & Modeling Scope  _(System hierarchy (Tray / SIP / CUBE / PE))_
+- [ADR-0007](./ADR-0007-api-runtime-api-boundaries.md) — Runtime API and Simulation Engine Boundaries  _(Runtime API ↔ sim_engine boundaries)_
+- [ADR-0016](./ADR-0016-dev-iochiplet-noc-and-memory-path.md) — IOChiplet NOC and Memory Data Path  _(IOChiplet NOC and memory data path)_
+- [ADR-0017](./ADR-0017-dev-cube-noc-and-hbm-connectivity.md) — Cube NOC and HBM Connectivity  _(Cube NOC and HBM connectivity)_
+
+## Detailed Architecture
+
+One subsection per component file under `src/kernbench/components/builtin/`.
+
+### forwarding
+
+- [ADR-0037](./ADR-0037-dev-forwarding-component.md) — Forwarding Component (forwarding_v1)
+
+### hbm_ctrl
+
+- [ADR-0034](./ADR-0034-dev-hbm-controller-internal-design.md) — HBM Controller Internal Design
+
+### io_cpu
+
+- [ADR-0036](./ADR-0036-dev-io-cpu-component-model.md) — IO_CPU Component Model
+
+### m_cpu
+
+- [ADR-0035](./ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md) — M_CPU and M_CPU.DMA Component Model
+
+### pcie_ep
+
+- [ADR-0038](./ADR-0038-dev-pcie-ep-component-model.md) — PCIE_EP Component Model
+
+### pe_cpu
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
+
+### pe_dma
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
+- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
+
+### pe_fetch_store
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
+
+### pe_gemm
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
+
+### pe_ipcq
+
+- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
+
+### pe_math
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
+
+### pe_mmu
+
+- [ADR-0039](./ADR-0039-dev-pe-mmu-component-model.md) — PE_MMU Component Model — Component + Utility Dual Role
+
+### pe_scheduler
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
+
+### pe_tcm
+
+- [ADR-0040](./ADR-0040-dev-pe-tcm-component-model.md) — PE_TCM Component Model — Dual-Channel BW Serialization
+
+### sram
+
+- [ADR-0041](./ADR-0041-dev-cube-sram-component-model.md) — Cube SRAM Component Model — terminal scratchpad on cube NoC
+
+### tiling
+
+- [ADR-0042](./ADR-0042-prog-tile-plan-generators.md) — Tile Plan Generators — GEMM/Math Pipeline Plan Builders
+
+## Implementation Decisions
+
+### Address Scheme
+
+- [ADR-0001](./ADR-0001-mem-physaddr-layout.md) — 51-bit Physical Address Layout & Decoding Contract
+- [ADR-0011](./ADR-0011-mem-memory-addressing-simplification.md) — Memory Addressing — PA / VA / LA Address Models
+
+### Routing & Helper API
+
+- [ADR-0002](./ADR-0002-lat-routing-distance.md) — Routing Distance, Ordering & Bypass Rules
+- [ADR-0051](./ADR-0051-lat-routing-helper-api.md) — Routing Helper API — `AddressResolver` + `PathRouter`
+
+### Memory Semantics & Local-HBM Bandwidth
+
+- [ADR-0004](./ADR-0004-mem-memory-semantics-local-hbm.md) — Memory Semantics & Local-HBM Bandwidth Guarantee
+
+### Topology Compilation, Diagrams & Builder Algorithms
+
+- [ADR-0005](./ADR-0005-dev-diagram-views-distance-layout.md) — Diagram Views & Distance-Aware Layout Rules
+- [ADR-0006](./ADR-0006-dev-topology-compilation-distance-diagram.md) — Topology Compilation, Distance Extraction, and Automatic Diagram Generation
+- [ADR-0053](./ADR-0053-dev-topology-builder-algorithms.md) — Topology Builder + Visualizer Algorithms
+
+### Tensor Deployment and Allocation
+
+- [ADR-0008](./ADR-0008-api-tensor-deploy-and-allocation.md) — Tensor Deployment and Allocation (Host Allocator, PA-first)
+
+### Kernel Execution and Host-Device Messaging
+
+- [ADR-0009](./ADR-0009-api-kernel-execution-messaging.md) — Kernel Execution Messaging and Completion Semantics
+- [ADR-0012](./ADR-0012-api-host-io-message-schema.md) — Host ↔ IO_CPU Message Schema (PA-first, PE-tagged)
+
+### CLI Surface and Semantics
+
+- [ADR-0010](./ADR-0010-api-cli-surface-and-semantics.md) — Command Line Interface and Execution Semantics
+
+### Component Port/Wire Fabric Model
+
+- [ADR-0015](./ADR-0015-dev-component-port-wire-model.md) — Component Port/Wire Model and Fabric Routing
+
+### Two-Pass Data Execution
+
+- [ADR-0020](./ADR-0020-prog-data-execution-two-pass.md) — 2-Pass Data Execution Model (Timing / Data Separation)
+
+### 2D Grid Program Identity
+
+- [ADR-0022](./ADR-0022-prog-program-id-2d-grid.md) — 2D Grid program_id Semantics
+
+### Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)
+
+- [ADR-0024](./ADR-0024-par-sip-tp-launcher.md) — SIP-level Launcher — rank = SIP
+- [ADR-0026](./ADR-0026-par-dppolicy-intra-device.md) — DPPolicy = Intra-Device Only — remove sip/num_sips fields
+- [ADR-0027](./ADR-0027-par-megatron-tp.md) — Megatron-style Tensor Parallelism API
+- [ADR-0047](./ADR-0047-par-ahbm-ccl-backend.md) — AHBM CCL Backend — `torch.distributed`-compat shim
+- [ADR-0050](./ADR-0050-par-ccl-algorithm-module-contract.md) — CCL Algorithm Module Contract — `ccl/algorithms/*.py`
+
+### IPCQ Direction Addressing
+
+- [ADR-0025](./ADR-0025-algo-ipcq-direction-addressing.md) — IPCQ Direction Addressing — address-based matching
+
+### Intercube All-Reduce
+
+- [ADR-0032](./ADR-0032-algo-intercube-allreduce.md) — Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange
+
+### Evaluation Harnesses
+
+- [ADR-0043](./ADR-0043-eval-allreduce-harness.md) — Allreduce Evaluation Harness — `tests/sccl/`
+- [ADR-0044](./ADR-0044-eval-gemm-harness.md) — GEMM Evaluation Harness — `scripts/gemm_sweep.py` + `tests/gemm/`
+
+### Bench Module Contract
+
+- [ADR-0045](./ADR-0045-prog-bench-module-contract.md) — Bench Module Contract — registration, dispatch, and authoring
+
+### Kernel-side tl.* API (TLContext)
+
+- [ADR-0046](./ADR-0046-prog-tl-context-contract.md) — TLContext — Kernel-side `tl.*` API Contract
+
+### Memory Allocator Algorithms
+
+- [ADR-0048](./ADR-0048-mem-allocator-algorithms.md) — Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
+
+### Probe Subcommand
+
+- [ADR-0049](./ADR-0049-ver-probe-subcommand.md) — `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
+
+### Sim-engine Op Log and Memory Store Schemas
+
+- [ADR-0052](./ADR-0052-dev-oplog-memory-store-schemas.md) — OpLog + MemoryStore Schemas — sim_engine internals
@@ -0,0 +1,836 @@
+# KernBench — Architecture Design Document
+*2026 1H*
+
+KernBench is a system-level, discrete-event simulator for AI-accelerator
+chiplet systems. It models the data-movement and control paths across
+the full hardware hierarchy and reports end-to-end execution latency
+for kernels dispatched to the device's compute units.
+
+This document is a public summary of the architecture as designed and
+implemented in the first half of 2026. It assumes no prior knowledge of
+the simulator's internal documents; terms specific to the system are
+defined on first use.
+
+---
+
+## Design Principles
+
+KernBench is grounded in two foundational commitments: every measured
+latency must trace to explicit, modeled events on the simulator's graph,
+and every behavioral claim must be verifiable through tests that target
+spec-level invariants rather than incidental implementation details.
+
+<!-- src: ADR-0013 Context, Decision -->
+The verification posture is verification-driven. Tests are written to
+validate the architectural contracts that the simulator exposes —
+correct routing, deterministic results, monotonic latency under
+increasing hop counts — rather than to mirror the call graph of the
+implementation. Two phases coexist: a fast timing phase that exercises
+the simulator's discrete-event engine and produces a log of operations
+with timestamps, and an optional data-replay phase that uses that log
+to compute real numerical results. Tests can target either phase.
+
+<!-- src: ADR-0033 Context, Decision -->
+The latency model is intentionally abstract rather than
+cycle-accurate. Each modeled node contributes a configurable per-node
+overhead, each link contributes wire delay plus byte-over-bandwidth
+serialization, and each terminal service contributes its own service
+time. The simulator does not attempt to reproduce cache coherence
+protocols, microarchitectural pipelines, or full PCIe/UCIe protocol
+correctness; those are explicitly outside the scope. The aim is a
+simulator that compares system-level configurations meaningfully and
+deterministically, not one that ships microarchitectural truths.
+
+<!-- src: ADR-0033 Decision, Consequences -->
+Determinism is a hard requirement. Given identical inputs — topology,
+routing policy, and request stream — the simulator must produce
+identical outputs, hop traces included. This rules out reliance on
+unordered set iteration on the critical path and forces every latency
+contribution to come from an explicitly scheduled event on a modeled
+component or link. There are no implicit waits, no hardcoded magic
+delays, and no shortcuts that bypass the modeled graph.
+
+---
+
+## High-level Architecture
+
+<!-- src: ADR-0003 Context, Decision -->
+The simulated system is a four-level hierarchy. A **Tray** holds one or
+more **SIPs** (system-in-package), each containing a 2D mesh of
+**CUBEs** plus one or more **IO chiplets** that connect the SIP to the
+host. Each CUBE contains a regular grid of **PEs** (processing
+elements) plus its own attached resources — high-bandwidth memory
+(HBM), a per-cube SRAM scratchpad, and a management CPU (M_CPU). The PE
+itself is a composite of nine sub-components rather than a monolithic
+core. This hierarchy is fixed; the parameters along each axis (counts,
+mesh dimensions, link widths) are configurable through the topology
+spec.
+
+<!-- src: ADR-0007 Context, Decision -->
+A clean separation runs along the request flow. A **runtime API** at
+the top is the host-facing surface; it exposes tensor and kernel
+operations, owns host-side allocation metadata, and is topology-
+agnostic — it does not route or fan out. Below it the **simulation
+engine** decomposes runtime operations into discrete graph requests
+(memory writes, memory reads, kernel launches, MMU map installs) and
+schedules events deterministically. At the bottom, **components** model
+device behavior on a graph of nodes connected by links; they
+implement the actual latency contributions and pass requests along.
+No component reaches up into the runtime API, and no runtime call
+shortcuts the engine.
+
+<!-- DIAGRAM: Four-level system hierarchy — Tray containing SIPs, each SIP showing its 2D cube mesh and IO chiplet; one cube blown up to show the router mesh, attached PEs, M_CPU, SRAM, and HBM partition. -->
+
+### Tray
+
+<!-- src: ADR-0003 Decision -->
+The Tray is the outermost boundary. It owns the host CPU on one side
+and one or more SIPs on the other, connected through a fabric switch.
+For collective communication that must traverse multiple SIPs, the
+fabric switch acts as the common rendezvous: device-side outbound
+traffic from one SIP routes through the switch and back into the
+target SIP's IO chiplet.
+
+### SIP
+
+<!-- src: ADR-0003 Decision, ADR-0017 Context -->
+A SIP packages a 2D mesh of CUBEs and one or more IO chiplets. The
+default topology used by the simulator is a 4×4 cube mesh; the
+mesh dimensions are configurable. Each cube on the boundary of the
+mesh connects to its neighbors over UCIe (die-to-die) links arranged
+on the four cardinal sides — north, south, east, and west. The IO
+chiplets sit on one side of the SIP and provide the bridge to the host
+across PCIe.
+
+<!-- src: ADR-0016 Context, Decision -->
+The IO chiplet itself contains its own internal network. A
+host-facing PCIe endpoint passes traffic to a small NOC ("network on
+chip"); from there it can branch to a control-plane CPU that processes
+kernel-launch messages, or it can take the direct memory data path to
+the cube's HBM controller. The decision to provide a direct memory
+path that bypasses the control CPU was a deliberate concession to
+keep host-issued memory writes from paying control-plane overhead on
+the data path.
+
+### CUBE
+
+<!-- src: ADR-0017 Decision -->
+Each CUBE owns a 2D mesh of NOC routers and a set of attached
+resources: PEs, the cube-local SRAM scratchpad, the management CPU
+(M_CPU), and the HBM partition (split across multiple PE-private
+slices for bandwidth). The router mesh uses deterministic XY routing.
+Attached components do not connect to each other directly — they all
+sit on the router mesh, and every cube-internal transfer pays the
+mesh distance from source to destination.
+
+<!-- src: ADR-0017 Decision -->
+The HBM partition is per-PE: each PE owns one HBM slice, and the
+controller exposes per-PE channels so that the same PE always
+addresses the same set of HBM channels. This makes the local-HBM
+bandwidth from a PE to its own slice predictable, while accesses to
+another PE's slice — or a different cube's slice — pay the mesh
+distance and any UCIe crossings.
+
+### PE
+
+<!-- src: ADR-0014 Context, Decision -->
+A PE is not a monolithic core. Internally it is a set of nine
+sub-components, each modeling one stage of a request's flow: a small
+control CPU, a tile-pipeline scheduler, a DMA engine, a fetch-store
+engine that moves data between the on-PE scratchpad and the register
+file, a GEMM compute engine, a math compute engine, the tightly-
+coupled memory (TCM, the on-PE scratchpad), an MMU for virtual-to-
+physical address translation, and an inter-PE collective queue
+(IPCQ). The scheduler decomposes higher-level operations into per-tile
+stage sequences, and tile tokens self-route from one sub-component
+to the next.
+
+<!-- DIAGRAM: PE internal layout — the nine sub-components and the edges that connect them; tile token flowing through DMA_READ → FETCH → GEMM → STORE → DMA_WRITE. -->
+
+---
+
+## Detailed Architecture
+
+This section describes each modeled device-side component in turn.
+Components are listed in the alphabetical order used by the
+simulator's source tree.
+
+### forwarding
+
+<!-- src: ADR-0037 Context, Decision -->
+The forwarding component is the generic routing relay used wherever a
+node only needs to apply a small processing overhead and pass the
+request to the next hop. NOC routers, conn nodes, and ucie phys all
+reduce to this. Its first act on receiving a request is to apply the
+per-node overhead configured for it in the topology spec; after the
+overhead it simply hands the request to the next hop along the path.
+
+<!-- src: ADR-0037 Decision, Consequences -->
+The decision to share one implementation across these roles was made
+to keep the simulator's component set small without sacrificing
+modeling fidelity. Each instance still carries its own overhead and
+its own link bandwidth contributions, so different roles still produce
+different timing. What is shared is the dispatcher loop, not the
+parameter values.
+
+### hbm_ctrl
+
+<!-- src: ADR-0034 Context, Decision -->
+The HBM controller is the terminal node for all memory traffic that
+reaches HBM. Internally it owns a number of pseudo channels, partitioned
+per-PE so that each PE addresses a deterministic subset. On a request
+arrival the controller first selects the right pseudo channel from the
+target address, then enters a chunk-loop that drains the requested
+size in fixed-size flits over the channel's bandwidth.
+
+<!-- src: ADR-0034 Decision, Consequences -->
+The chunk-loop pattern replaces an earlier all-at-once drain. The
+benefit is that the controller no longer presents a flit-aware fabric
+with a single bulk transfer; instead it emits flits at a paced rate
+matching the channel bandwidth, which makes cross-flow contention
+visible. The bandwidth budget is calibrated against the configured
+HBM total bandwidth divided across the channel count.
+
+### io_cpu
+
+<!-- src: ADR-0036 Context, Decision -->
+The IO_CPU is the control-plane processor sitting inside the IO chiplet.
+It receives kernel-launch messages from the host, decodes them, and
+dispatches per-cube launches to the cube's management CPU. Pure memory
+operations bypass it entirely, taking the direct data path established
+inside the IO chiplet.
+
+<!-- src: ADR-0036 Decision -->
+On receiving a kernel-launch message, the IO_CPU consults the message's
+shard list — which already names the target SIP, cube, and PE for each
+piece of the tensor argument — and forwards a per-cube launch to each
+cube the kernel needs to reach. This makes the IO_CPU a deterministic
+fan-out point: it does not decode physical addresses to route, it just
+follows the explicit per-shard targets it was handed.
+
+### m_cpu
+
+<!-- src: ADR-0035 Context, Decision -->
+The M_CPU is the cube's management processor. It owns two distinct
+roles: as a control-plane fan-out point for kernel launches arriving
+from the IO chiplet, and as a DMA endpoint for host-initiated memory
+writes that need to land in this cube's HBM. The control role
+forwards launches to the right PE control CPUs; the DMA role places
+the actual bytes into HBM through the router mesh.
+
+<!-- src: ADR-0035 Decision -->
+The component model deliberately distinguishes the two roles because
+their routing differs: the control fan-out path uses command-kind
+links that do not appear on data-path routes, while the DMA path uses
+the same router mesh as PE-initiated DMA, with PE-internal nodes
+excluded. The routing layer knows about both modes and selects the
+appropriate adjacency at request time.
+
+### pcie_ep
+
+<!-- src: ADR-0038 Context, Decision -->
+The PCIE endpoint is the protocol boundary at the host-device edge.
+Its first act on each incoming request is to apply a configured
+protocol-processing overhead; after that it simply forwards. There is
+no internal queuing model, no retry, and no TLP-level fidelity — those
+are deliberately outside scope. The endpoint is bidirectional: host →
+device traffic (memory writes, kernel launches) flows one way, and
+device-side outbound traffic (cross-SIP collective sends) flows the
+other.
+
+<!-- src: ADR-0038 Decision, Alternatives Considered, Consequences -->
+A more detailed PCIe model was considered and rejected. The simulator
+is targeting system-level latency comparisons; making the endpoint
+heavier with credit-management and retry logic would not improve the
+metrics being studied. The decision keeps the endpoint as the
+documented protocol-boundary node, named consistently so routing
+helpers can locate it by SIP and IO instance.
+
+### pe_cpu
+
+<!-- src: ADR-0014 Decision -->
+The PE control CPU is the entry point for kernel work arriving from
+the cube's management CPU. It receives kernel-launch messages, resolves
+the kernel function by name, and hands execution to the scheduler with
+the resolved tensor arguments. From the scheduler's point of view, the
+PE_CPU is the upstream source of high-level commands; from the rest
+of the system's point of view, the PE_CPU is where a kernel's
+execution begins on a given PE.
+
+### pe_dma
+
+<!-- src: ADR-0014 Decision, ADR-0023 Decision -->
+The DMA engine on each PE has two distinct modes. In the standard PE
+pipeline it consumes tile tokens issued by the scheduler, acquires a
+read or write channel (modeled as a one-in-flight resource per
+direction), and runs the bytes to or from HBM through the mesh. In
+its collective mode it forwards send tokens for the cube's IPCQ into
+the fabric, snapshotting the source data at send time so later
+mutations cannot race the receiver's read. Both modes share the same
+channel resources but differ in their downstream handling — one
+returns when the round-trip completes, the other dispatches
+fire-and-forget.
+
+### pe_fetch_store
+
+<!-- src: ADR-0014 Decision -->
+The fetch-store engine is the bridge between the on-PE scratchpad
+(TCM) and the register file. It does not run DMA; it only moves bytes
+internally. On receiving a tile-stage token it sends a short request
+to the TCM, waits for the bandwidth-serialized delay, and continues
+the pipeline. The split between this engine and the TCM lets the
+scratchpad model its own read/write bandwidth independently.
+
+### pe_gemm
+
+<!-- src: ADR-0014 Decision -->
+The GEMM engine is the matrix-multiply compute unit. Tile tokens
+arriving at this stage carry the per-tile dimensions, and the engine
+contributes a service time accounting for one fused multiply-add over
+the tile's macs. Composite operations (where the same tensor pair is
+streamed across many tiles) reuse the engine through the scheduler;
+the engine itself is stateless between tiles.
+
+### pe_ipcq
+
+<!-- src: ADR-0023 Context, Decision -->
+The IPCQ — inter-process communication queue — is each PE's
+collective-communication endpoint. It owns ring buffers that hold
+inbound messages from neighbor PEs and bookkeeping for send credits.
+Direction names ("N", "S", "E", "W" for cube-internal neighbors and
+"global_*" for cross-SIP neighbors) are resolved to physical peer
+endpoints by a neighbor table installed at process-group creation
+time. The component itself does not move bytes — it issues DMA tokens
+through the local PE_DMA, which performs the actual cross-PE
+transfer.
+
+<!-- src: ADR-0023 Decision, Consequences -->
+A key invariant is that the inbound terminal — where data lands at
+the receiver — pays the link bandwidth drain plus any cube-internal
+mesh hop to the slot's backing memory. This prevents IPCQ from
+silently outpacing raw DMA at large transfer sizes. Outbound sends
+are fire-and-forget; credit return is the only backpressure signal.
+
+### pe_math
+
+<!-- src: ADR-0014 Decision -->
+The math engine handles element-wise and reduction operations. It
+consumes tile tokens carrying an operation kind (`exp`, `sum`, `max`,
+`where`, etc.) and contributes a service time proportional to the
+number of elements processed. Like the GEMM engine it is stateless;
+chained epilogues (a sequence of math operations after a GEMM tile)
+are scheduled as separate stages.
+
+### pe_mmu
+
+<!-- src: ADR-0039 Context, Decision -->
+The MMU has two roles, exposed through one component. As a node on
+the cube NOC it receives MMU-map and MMU-unmap messages and updates
+its internal page table, so that the runtime API can install
+virtual-to-physical mappings with measured fabric latency. As a
+utility object held inside the PE it offers synchronous translate
+calls to the PE's DMA and GEMM engines without taking simulator time
+itself; the calling engine pays any configured TLB overhead in its
+own process.
+
+<!-- src: ADR-0039 Decision, Alternatives Considered -->
+The page table supports multiple disjoint regions inside a single
+page, with later-write-wins semantics on overlap. This is a deliberate
+simulator stopgap to support parallelization policies that shard data
+at sub-page granularity without silent mis-routing through a real
+hardware MMU's one-PA-per-entry assumption. A real MMU does not work
+this way; the model documents this as a simplification.
+
+### pe_scheduler
+
+<!-- src: ADR-0014 Decision -->
+The scheduler is the sole dispatcher inside a PE. Simple commands are
+routed directly to the right engine. Composite commands generate a
+tile plan, and the resulting tile tokens are fed into the pipeline.
+Self-routing keeps the scheduler off the per-stage hot path: each
+engine, on finishing a stage, advances the token to the next stage's
+component itself, so the scheduler only does initial dispatch and
+completion tracking.
+
+### pe_tcm
+
+<!-- src: ADR-0040 Context, Decision -->
+The TCM is the per-PE tightly-coupled scratchpad memory. It models
+time only, not data — the actual payload lives in the simulator's
+memory store. Read and write are independent channels: each is
+modeled as a one-in-flight resource, so same-direction requests
+serialize but a read and a write can overlap. The bandwidth of each
+direction is configured separately and applied as bytes-over-bandwidth
+on each request.
+
+<!-- src: ADR-0040 Decision, Alternatives Considered -->
+The decision to keep read and write on separate channels was made
+because the PE pipeline's normal case overlaps fetch (read) and store
+(write). Collapsing them into a single shared channel would have
+artificially serialized that overlap and produced an incorrect
+bandwidth ceiling.
+
+### sram
+
+<!-- src: ADR-0041 Context, Decision -->
+The cube SRAM is a per-cube scratchpad attached to one of the cube's
+routers. As a node it applies a configured access overhead, pays the
+link-bandwidth drain stamped on the incoming request, and sends a
+response on the reverse path. It is a terminal — it does not forward.
+
+<!-- src: ADR-0041 Decision, Consequences -->
+A second role is as one of three backing-memory tiers (TCM, SRAM, HBM)
+that an inter-PE collective slot can live in. When the slot lives in
+SRAM, the PE_DMA pays the slot read or write latency directly using
+the configured SRAM bandwidth and overhead; the SRAM component does
+not need to know about collective semantics. This separation keeps
+the SRAM component agnostic to the collective subsystem.
+
+### tiling
+
+<!-- src: ADR-0042 Context, Decision -->
+The tile-plan generator is not a runtime component — it is a pure
+module of functions that take a problem shape (matrix dimensions, tile
+sizes) and produce an ordered list of tile-stage sequences. The
+scheduler consumes this list. Each tile's stage sequence depends on
+how its operands are staged: operands streamed from HBM produce
+DMA_READ stages, operands already resident in TCM (because they were
+loaded eagerly upfront) skip them.
+
+<!-- src: ADR-0042 Decision, Consequences -->
+The plan generator is intentionally pure — given the same input it
+returns the same plan, with no simulator events created. This lets
+the rest of the system reason about tile sequences as data, and it
+makes the plan testable in isolation without simulator state. New
+plan variants (for example, K-major or DTensor-aware plans) can be
+added as new functions following the same shape.
+
+---
+
+## Implementation Decisions
+
+This section collects cross-cutting decisions — algorithms, policies,
+schemes, and contracts — that span multiple components rather than
+living inside one.
+
+### Address Scheme
+
+<!-- src: ADR-0001 Context, Decision -->
+Every physical address in the simulator decodes into a structured
+location. A fixed-width physical address carries the SIP id, the
+cube id within the SIP, a type discriminator (HBM vs PE-resource vs
+others), and a type-specific offset. HBM addresses additionally encode
+the per-PE slice offset so the controller can determine which PE
+owns the target slice without external lookup. The layout is
+deliberately reserved rather than packed-to-fit, so new sub-units can
+be added at the type-discriminator level without rewriting existing
+addresses.
+
+<!-- src: ADR-0011 Context, Decision -->
+On top of physical addressing, the simulator supports three address
+models that the runtime API selects between. Direct physical
+addressing is retained as a fallback. Virtual addressing — the
+current default — gives each tensor a contiguous virtual range at
+deployment, with the per-PE MMU translating per access; an
+alternative logical-address scheme remains a future option. The
+virtual-address path is what every modern test path takes; the PA
+fallback is used by the MMU itself when no mapping exists for an
+address (a deliberate signal, not an error).
+
+<!-- src: ADR-0011 Decision, Consequences -->
+Tensor placement is represented as a list of physical-address shards,
+each tagged with target SIP, cube, and PE, plus a single tensor-wide
+virtual base. This means a kernel sees one virtual base for the whole
+tensor while the host driver and the engine still know exactly where
+each shard lives. Replicated tensors get per-cube local PA mappings;
+sharded tensors broadcast their mapping across cubes within a SIP.
+
+### Routing, Distance & Helper API
+
+<!-- src: ADR-0002 Context, Decision -->
+Routing is policy-driven, deterministic, and topology-aware. Given a
+source, a destination, and an intent — for example, PE-initiated
+DMA versus host-initiated memory write versus a generic
+component-to-component query — the routing layer picks the right
+path. The intent matters because different traffic types must avoid
+different categories of edges: PE-initiated DMA should not traverse
+command-only links; M_CPU DMA should not pass through PE-internal
+pipeline edges; cube-local transfers should not use the
+zero-distance UCIe bus that would otherwise look attractive to a
+shortest-path search.
+
+<!-- src: ADR-0051 Decision -->
+The routing layer therefore maintains four separate adjacency graphs
+at construction, each excluding a different category of edges, and
+picks the appropriate one per intent. On top of the graphs sits a
+helper API that hides the topology's naming convention: callers ask
+for the PCIe endpoint of a given SIP, the M_CPU of a given cube, or
+the HBM destination for a given physical address, and receive the
+corresponding node id. No component constructs node-id strings
+directly; if the naming convention ever changes, the change is local
+to the helper layer.
+
+<!-- src: ADR-0051 Decision, Consequences -->
+Path-finding itself uses Dijkstra with explicit per-edge weights
+(routing weight is allowed to differ from physical distance — for
+example, UCIe is configured to be routing-preferable). Tie-breaks
+follow insertion order, which keeps results deterministic. Paths
+between unreachable nodes raise rather than returning empty, surfacing
+topology errors immediately.
+
+### Memory Semantics and Local-HBM Bandwidth
+
+<!-- src: ADR-0004 Context, Decision -->
+A PE accessing its own HBM slice through its own cube's NOC must see
+the full local HBM bandwidth — that is the model's intent. Memory
+traffic accumulates latency from per-component overhead and
+bytes-over-link-bandwidth serialization along the path, but the
+controller does not throttle below the slice's allotted bandwidth.
+Cross-PE-slice accesses inside the same cube, cross-cube accesses
+through UCIe, and cross-SIP accesses through PCIe each pay
+progressively more overhead as the path grows.
+
+### Topology Compilation, Diagrams & Builder Algorithms
+
+<!-- src: ADR-0006 Context, Decision -->
+Topology is configurable, not hardcoded. The simulator reads a YAML
+spec, compiles it into a flat graph of nodes and edges plus four
+view projections at different abstraction levels — system, SIP, cube,
+PE — and uses the compiled graph as the single source for both
+execution and visualization. Distance metadata used by routing is
+extracted at compile time so that diagrams and routing decisions
+agree by construction.
+
+<!-- src: ADR-0005 Context, Decision -->
+Diagrams are derived artifacts of the compiled topology. The visualizer
+produces one SVG per view at the appropriate abstraction level; nothing
+in the diagrams is hand-drawn or hand-positioned. Distance-aware
+layout rules place nodes in the diagrams using the same coordinates
+that routing uses to compute distance, so a diagram that "looks
+wrong" is a signal that the topology itself has a problem, not the
+visualizer.
+
+<!-- src: ADR-0053 Decision -->
+Inside a cube the router mesh is generated automatically. PE corner
+positions are fixed by convention; the relay-column algorithm
+inserts additional grid columns whenever the gap between adjacent PE
+columns would exceed a tunable maximum. HBM occupies a central
+exclusion zone — router slots inside the zone are deliberately empty,
+since HBM controllers attach as separate named nodes. M_CPU and SRAM
+attach to the nearest router by Euclidean distance from their
+configured placement coordinates, and UCIe physical lanes distribute
+along the boundary rows and columns. The whole mesh is cached
+beside the topology spec and invalidated only when one of a small set
+of layout-relevant fields changes.
+
+<!-- DIAGRAM: One cube's router mesh — rows × cols of routers with HBM exclusion zone in the middle, PEs/M_CPU/SRAM attaching to nearest routers, UCIe PHYs along the perimeter. -->
+
+### Tensor Deployment and Allocation
+
+<!-- src: ADR-0008 Context, Decision -->
+Tensor deployment in the runtime API produces a list of physical-address
+shards plus a single tensor-wide virtual base. The host allocator
+walks the data-parallelism policy, computes per-shard placement, and
+emits the per-shard physical addresses through the per-PE allocators.
+No separate "allocate then later attach to a device" RPC exists —
+allocation and deployment are a single operation that produces a
+deployed tensor handle.
+
+### Memory Allocator Algorithms
+
+<!-- src: ADR-0048 Context, Decision -->
+Each per-PE allocator owns two channels — HBM slice and TCM — each
+backed by an offset-keyed free-list. Allocation is first-fit; freeing
+coalesces with adjacent free blocks. A device-wide virtual allocator
+sits above the per-PE allocators, aligns requests up to the configured
+page size, and coalesces on free in the same way. The trade-off is
+explicit: first-fit is simpler and cheaper than best-fit or buddy
+allocation, and the simulator's workload is stack-like enough
+(deploy / kernel / free in matched order) that fragmentation is not
+a practical concern.
+
+<!-- src: ADR-0048 Decision, Consequences -->
+Allocation failure raises rather than silently returning a partial
+result. A partial tensor reaching the engine would route over wrong
+PAs and silently corrupt simulator output, so an out-of-memory signal
+is preferred. The free path trusts its caller to pass back exactly
+what was allocated; the small risk of caller error in exchange for
+fast common-case freeing is documented as a deliberate trade.
+
+### Kernel Execution and Host-Device Messaging
+
+<!-- src: ADR-0009 Context, Decision -->
+Kernel execution decomposes into a small set of messages that travel
+the device graph. The host issues a single kernel-launch message; the
+IO_CPU fans it out per-cube; the cube M_CPU fans it out per-PE; the
+PE CPU resolves the kernel and runs it through the scheduler.
+Completion flows back the same way, gated by per-shard completion
+tracking. Memory operations follow the same pattern: a memory write
+or read travels as one message that the engine routes to the right
+HBM controller, with a response taking the reverse path.
+
+<!-- src: ADR-0012 Context, Decision -->
+The schema between the host and the device-side IO CPU is PA-first
+and shard-tagged. Every byte of host-issued payload arrives with an
+explicit target SIP, cube, PE, and physical address. The IO_CPU does
+not decode addresses to derive placement — placement is named
+explicitly by the shard list. This makes the host-device interface
+deterministic and keeps the routing helper free of host-derived
+intent.
+
+### CLI Surface and Semantics
+
+<!-- src: ADR-0010 Context, Decision -->
+The command-line interface exposes four subcommands. A bench runner
+loads a topology, resolves a registered benchmark by name or index,
+and runs it on a selected device. A bench-listing command enumerates
+the registered benchmarks. A probe utility runs a fixed catalog of
+traffic patterns through the engine for latency and bandwidth
+verification. A web viewer renders the topology in a browser. A
+benchmark instance is always single-device by convention; multi-SIP
+collective work happens inside the benchmark through the launcher
+abstraction, not by multiplexing the CLI.
+
+### Component Port and Wire Fabric Model
+
+<!-- src: ADR-0015 Context, Decision -->
+Every modeled component exposes input and output ports, and every
+edge in the topology connects an output port on one component to an
+input port on another. Bandwidth and propagation delay are properties
+of the wire between ports, not of the component endpoints. A
+component's responsibility is to apply its configured per-node
+overhead and either forward to the next hop or terminate; the wire
+charges the byte-over-bandwidth serialization separately.
+
+<!-- src: ADR-0015 Decision, Consequences -->
+This separation lets components be swapped behind their port
+interface without changing the rest of the model, and it keeps
+bandwidth contention at the wire level where multiple components may
+contend for the same edge. Future component models can refine
+internal behavior without disturbing the fabric.
+
+### Two-Pass Data Execution
+
+<!-- src: ADR-0020 Context, Decision -->
+The simulator runs in two passes. The first pass — fast and always
+on — runs the discrete-event engine and records every data operation
+in an operation log with timestamps, component identifiers, and per-
+operation parameters. The second pass — optional, opt-in — replays
+the log against an in-memory tensor store to produce actual numerical
+results. Tests that only need timing skip the second pass; tests that
+need to verify correctness opt in.
+
+<!-- src: ADR-0020 Decision, Consequences -->
+The split lets the timing engine remain unconcerned with data
+semantics: kernels move handles around, not bytes. The replay phase
+recovers data semantics from the recorded operations, in their
+original time order with a small set of secondary-sort rules. The
+op-log records carry enough metadata — input snapshots for compute
+operations, source snapshots for cross-component copies — that the
+replay phase cannot mis-order with respect to in-flight mutations.
+
+### Sim-engine Op Log and Memory Store Schemas
+
+<!-- src: ADR-0052 Context, Decision -->
+The operation log holds typed records with seven fields each: start
+and end timestamps, the component that issued the operation, an
+operation kind ("memory", "gemm", "math"), an operation name, a
+parameter dictionary, and a (currently unused) dependency list.
+Records are kept in stable timestamp order. The parameter dictionary
+varies by operation: a DMA read carries source address and byte count;
+a GEMM carries operand shapes, dtypes, and address spaces; a math
+operation carries input addresses and snapshots.
+
+<!-- src: ADR-0052 Decision, Consequences -->
+The companion memory store is a two-level dictionary keyed by
+address space ("hbm", "tcm", "sram", others) and integer address.
+Reads and writes are reference-based — no copy by default — so
+callers wanting to detach a snapshot must copy explicitly. This is
+deliberate: the engine-internal snapshot paths copy at well-defined
+points (math input capture, HBM source capture for DMA writes,
+inbound collective copies) and downstream replay code therefore
+sees stable data even when slot or scratch addresses are reused by
+later operations.
+
+### 2D Grid Program Identity
+
+<!-- src: ADR-0022 Context, Decision -->
+Inside a kernel the program identity is two-dimensional. The
+first axis corresponds to the PE index within a cube; the second
+corresponds to the cube index within a SIP. Together they let a
+kernel address its position both within its cube and within the
+larger system without needing to know the full topology. Total
+program counts along each axis are exposed symmetrically.
+
+### Parallelism — SIP Launcher, DPPolicy, Megatron-TP, AHBM Backend, and CCL Algorithm Module
+
+<!-- src: ADR-0024 Context, Decision -->
+The launcher model treats each SIP as one rank. Inside a process the
+launcher spawns one greenlet per SIP rank; the rank is bound to its
+greenlet so that any code running in that worker sees the right
+distributed-style rank. This is a deliberately PyTorch-compatible
+shape: a benchmark looks like a small DDP training script — initialize
+a process group, spawn workers, each worker runs the same body.
+
+<!-- src: ADR-0026 Context, Decision -->
+Data-parallelism policy lives in a single object that names the
+sharding strategy along the cube axis (replicate, row-wise,
+column-wise) and along the PE axis (same set of values), and optionally
+overrides the number of cubes or PEs participating. The policy is
+intra-device — it does not cross SIP boundaries. SIP-level parallelism
+is the launcher's responsibility, and the two axes compose
+orthogonally.
+
+<!-- src: ADR-0027 Context, Decision -->
+A Megatron-style tensor-parallel API sits on top of the launcher and
+the DP policy. Layer-level building blocks — column-parallel linear,
+row-parallel linear, all-reduce — name their sharding intent in terms
+the launcher and the placement policy can compose. This is the layer
+that bench code typically writes against.
+
+<!-- src: ADR-0047 Context, Decision -->
+For collective operations the runtime exposes a PyTorch-compatible
+distributed backend named "ahbm". On process-group initialization the
+backend loads the configured collective-algorithm module, resolves
+the world size (priority: explicit ccl.yaml override → defaults
+section → topology SIP count), imports the algorithm module
+dynamically, derives the SIP topology kind, and pushes the inter-PE
+neighbor table to every participating PE. From that point on, an
+all-reduce call dispatches the algorithm's kernel function across
+all ranks.
+
+<!-- src: ADR-0050 Context, Decision -->
+A collective-algorithm module is a Python module with a small, fixed
+contract. It exposes topology-kind integer constants, a name-to-kind
+mapping for the YAML configuration, a kernel-arguments builder, and
+a kernel function — the kernel function being aliased to the name
+`kernel` so the backend can find it generically. The kernel itself
+takes the tensor pointer, the per-cube element count, cube mesh
+width and height, the world size, the current rank, and the SIP
+topology dimensions; the backend appends those last four arguments
+automatically. New collectives slot in by adding a new module that
+follows this shape.
+
+<!-- src: ADR-0027 Decision, Consequences -->
+The combination is deliberate: bench authors get to write code that
+looks like a regular distributed training script, while the launcher,
+backend, and placement policies behind it remain free to redirect
+work to the right SIP, cube, and PE without exposing topology to the
+kernel.
+
+### IPCQ Direction Addressing
+
+<!-- src: ADR-0025 Context, Decision -->
+Inside a collective algorithm, peer PEs are named by direction —
+"N", "S", "E", "W" for cube-internal neighbors, and "global_*" for
+cross-SIP neighbors. Direction addressing is the addressing scheme:
+the algorithm names a direction, the IPCQ neighbor table installed
+at process-group time resolves the direction to the peer endpoint's
+physical-address coordinates, and the PE_DMA performs the actual
+transfer. The algorithm itself does not see PA arithmetic — direction
+is the user-facing handle.
+
+### Intercube All-Reduce
+
+<!-- src: ADR-0032 Context, Decision -->
+The default all-reduce algorithm uses a center-rooted bidirectional
+phase inside each SIP's cube mesh followed by an inter-SIP exchange
+on the mesh's root cube, and then a bidirectional broadcast back
+out. Center-rooting halves the in-cube hop count compared with a
+corner-rooted walk. The inter-SIP exchange itself follows the
+configured SIP topology — ring, torus, or non-wrapping mesh —
+selected at runtime through the SIP-topology kind integer the
+backend passes to the kernel.
+
+### Evaluation Harnesses
+
+<!-- src: ADR-0043 Context, Decision -->
+The all-reduce evaluation harness drives correctness and the
+latency/buffer-kind sweeps through the public distributed path —
+initialize process group, spawn workers, call all-reduce — rather
+than the lower-level engine interface. A shared helper module factors
+out the setup; sweep tests cover the buffer-kind tiers (TCM, SRAM,
+HBM) and the inter-SIP topology variants. The plots produced by the
+harness are part of its output contract; the harness regenerates them
+on demand.
+
+<!-- src: ADR-0044 Context, Decision -->
+The GEMM evaluation harness is split into two layers. A heavy
+shape-and-variant sweep lives as a manual script — it runs the same
+composite-GEMM benchmark across many shapes and operand-staging
+variants, harvests the resulting op-log, and writes a JSON summary.
+A faster figure-generation layer lives in the test suite and consumes
+that JSON to render plots. The split keeps the heavy data
+generation explicit and out of the regular test path.
+
+### Bench Module Contract
+
+<!-- src: ADR-0045 Context, Decision -->
+Adding a new benchmark requires only dropping a file into the
+benchmarks directory. The file registers one or more benchmark
+functions through a small decorator that takes a kebab-case name and
+a human-readable description. The decorator is the registration
+mechanism — there is no separate manifest. Each benchmark function
+takes one argument, conventionally named `torch`, which is the
+runtime context exposing tensor allocation, kernel launch,
+distributed APIs, and process-spawning. The function name is `run` by
+convention.
+
+<!-- src: ADR-0045 Decision, Consequences -->
+A benchmark must submit at least one operation, or the runner
+returns an error. A benchmark instance is single-device by default;
+when a benchmark is collective, it uses the distributed-process-spawn
+pattern internally — one worker greenlet per rank, with each worker
+binding to its rank. Multi-device benchmark patterns outside that
+shape are not supported.
+
+### Kernel-side `tl.*` API
+
+<!-- src: ADR-0046 Context, Decision -->
+Inside a kernel function, the `tl` argument exposes the kernel-side
+API in a shape that mirrors the conventions of established
+GPU-kernel languages. Categories: reference handles that name HBM
+data without issuing DMA; data movement (load, store) that does
+issue DMA; GEMM and math compute (dot, composite, the unary and
+binary math operations, reductions); index and scalar helpers
+(program identity, range-builders); metadata-only operations like
+transpose; and the collective primitives (send, receive,
+non-blocking receive). Tensor handles support arithmetic operators
+via a thread-local active context so kernel code reads naturally.
+
+<!-- src: ADR-0046 Decision, Consequences -->
+The API supports two execution modes. A command-list mode records
+operations into a list without consuming simulator time — useful for
+inspection and lightweight tests. A greenlet-driven mode runs the
+kernel as a child greenlet that switches back to the simulator on
+each `tl.*` call; the simulator drives the event scheduler and hands
+real data back to the kernel as DMA reads complete. The two modes
+share the same surface; the kernel does not know which one it is
+running under.
+
+### Probe Subcommand
+
+<!-- src: ADR-0049 Context, Decision -->
+The probe utility runs three families of traffic patterns through
+the engine — host-to-device writes at increasing hop counts,
+device-to-host reads at increasing hop counts, and PE-initiated DMA
+across the cube mesh — and reports actual latency, the analytical
+formula breakdown, effective bandwidth, bottleneck bandwidth, and
+utilization. A fixed reference size is used for the summary table;
+a separate utilization-versus-size sweep covers a logarithmic range
+of transfer sizes. Each case runs in its own engine instance so
+cases do not perturb each other.
+
+<!-- src: ADR-0049 Decision, Consequences -->
+The probe also checks a small set of invariants automatically:
+monotonic latency increase with hop count, device-to-host latency
+at least as large as host-to-device for the same hop count, and a
+faster best-case path than worst-case for cross-cube PE DMA. Failures
+print prominently. The output is meant for human reading; automated
+parsing should not depend on column widths or whitespace.
+
+---
+
+This document summarizes 46 architecture decisions captured during
+the first half of 2026. It is regenerated mechanically from the
+decision corpus; sources are recorded in HTML comments throughout.
@@ -0,0 +1,333 @@
+"""Generate docs/adr/INDEX.md (and docs/adr-ko/INDEX.md) from the ADR corpus.
+
+Auto-derives a section-based index following the same classification as
+the /report skill — Design Principles / High-level Architecture /
+Detailed Architecture (by component) / Implementation Decisions
+(by topic). Run before publishing to refresh INDEX.md.
+
+The classification table below is the single source of truth. When a new
+ADR is added under docs/adr/, append an entry to ``CLASSIFICATION``. The
+script exits 1 if any ADR file is missing from the table or any title
+cannot be parsed, so omissions surface in CI.
+
+Usage:
+    python tools/generate_adr_index.py [--root <repo-root>] [--check]
+
+  --check : exit 1 if the generated INDEX differs from the on-disk file
+            (used by CI to detect un-regenerated indexes).
+"""
+
+from __future__ import annotations
+
+import argparse
+import re
+import sys
+from pathlib import Path
+
+ADR_FILENAME_RE = re.compile(r"^ADR-(\d{4})-([a-z0-9_-]+)\.md$")
+# Title separator may be ":" (most ADRs) or "—" (em-dash; ADR-0033 uses
+# this). The verifier (tools/verify_adr_lang_pairs.py) only checks the
+# number, so both styles already coexist in the corpus.
+TITLE_RE = re.compile(r"^# ADR-(\d{4})\s*[:—]\s*(.+?)\s*$")
+
+DESIGN_PRINCIPLES = "Design Principles"
+HIGH_LEVEL = "High-level Architecture"
+DETAILED = "Detailed Architecture"
+IMPL_DECISIONS = "Implementation Decisions"
+
+
+# (section, subgroup) per ADR. subgroup is used to sub-divide Detailed
+# (by component, see DETAILED_COMPONENTS) and Implementation (by topic).
+# Add a line here when introducing a new ADR.
+CLASSIFICATION: dict[int, tuple[str, str | None]] = {
+    # Design Principles
+    13: (DESIGN_PRINCIPLES, None),
+    33: (DESIGN_PRINCIPLES, None),
+
+    # High-level Architecture
+    3:  (HIGH_LEVEL, "System hierarchy (Tray / SIP / CUBE / PE)"),
+    7:  (HIGH_LEVEL, "Runtime API ↔ sim_engine boundaries"),
+    16: (HIGH_LEVEL, "IOChiplet NOC and memory data path"),
+    17: (HIGH_LEVEL, "Cube NOC and HBM connectivity"),
+
+    # Detailed Architecture (subgroup matches DETAILED_COMPONENTS entries)
+    14: (DETAILED, "pe_pipeline"),  # covers pe_cpu/pe_dma/pe_fetch_store/pe_gemm/pe_math/pe_scheduler
+    23: (DETAILED, "pe_ipcq"),
+    34: (DETAILED, "hbm_ctrl"),
+    35: (DETAILED, "m_cpu"),
+    36: (DETAILED, "io_cpu"),
+    37: (DETAILED, "forwarding"),
+    38: (DETAILED, "pcie_ep"),
+    39: (DETAILED, "pe_mmu"),
+    40: (DETAILED, "pe_tcm"),
+    41: (DETAILED, "sram"),
+    42: (DETAILED, "tiling"),
+
+    # Implementation Decisions
+    1:  (IMPL_DECISIONS, "Address Scheme"),
+    2:  (IMPL_DECISIONS, "Routing & Helper API"),
+    4:  (IMPL_DECISIONS, "Memory Semantics & Local-HBM Bandwidth"),
+    5:  (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"),
+    6:  (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"),
+    8:  (IMPL_DECISIONS, "Tensor Deployment and Allocation"),
+    9:  (IMPL_DECISIONS, "Kernel Execution and Host-Device Messaging"),
+    10: (IMPL_DECISIONS, "CLI Surface and Semantics"),
+    11: (IMPL_DECISIONS, "Address Scheme"),
+    12: (IMPL_DECISIONS, "Kernel Execution and Host-Device Messaging"),
+    15: (IMPL_DECISIONS, "Component Port/Wire Fabric Model"),
+    20: (IMPL_DECISIONS, "Two-Pass Data Execution"),
+    22: (IMPL_DECISIONS, "2D Grid Program Identity"),
+    24: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
+    25: (IMPL_DECISIONS, "IPCQ Direction Addressing"),
+    26: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
+    27: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
+    32: (IMPL_DECISIONS, "Intercube All-Reduce"),
+    43: (IMPL_DECISIONS, "Evaluation Harnesses"),
+    44: (IMPL_DECISIONS, "Evaluation Harnesses"),
+    45: (IMPL_DECISIONS, "Bench Module Contract"),
+    46: (IMPL_DECISIONS, "Kernel-side tl.* API (TLContext)"),
+    47: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
+    48: (IMPL_DECISIONS, "Memory Allocator Algorithms"),
+    49: (IMPL_DECISIONS, "Probe Subcommand"),
+    50: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
+    51: (IMPL_DECISIONS, "Routing & Helper API"),
+    52: (IMPL_DECISIONS, "Sim-engine Op Log and Memory Store Schemas"),
+    53: (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"),
+}
+
+# Canonical component order for the Detailed Architecture section.
+# Each entry: (component_name, list[ADR-numbers that cover it]).
+# Order matches src/kernbench/components/builtin/*.py alphabetical
+# (the same order /report uses).
+DETAILED_COMPONENTS: list[tuple[str, list[int]]] = [
+    ("forwarding",      [37]),
+    ("hbm_ctrl",        [34]),
+    ("io_cpu",          [36]),
+    ("m_cpu",           [35]),
+    ("pcie_ep",         [38]),
+    ("pe_cpu",          [14]),
+    ("pe_dma",          [14, 23]),
+    ("pe_fetch_store",  [14]),
+    ("pe_gemm",         [14]),
+    ("pe_ipcq",         [23]),
+    ("pe_math",         [14]),
+    ("pe_mmu",          [39]),
+    ("pe_scheduler",    [14]),
+    ("pe_tcm",          [40]),
+    ("sram",            [41]),
+    ("tiling",          [42]),
+]
+
+
+def _strip_bom(text: str) -> str:
+    """Strip leading UTF-8 BOM if present."""
+    if text and ord(text[0]) == 0xFEFF:
+        return text[1:]
+    return text
+
+
+def _find_adrs(adr_dir: Path) -> list[tuple[int, str, Path]]:
+    """Return [(num, slug, path), ...] for ADR files in adr_dir, sorted by num."""
+    out: list[tuple[int, str, Path]] = []
+    for p in sorted(adr_dir.iterdir()):
+        if not p.is_file():
+            continue
+        m = ADR_FILENAME_RE.match(p.name)
+        if not m:
+            continue
+        out.append((int(m.group(1)), m.group(2), p))
+    out.sort(key=lambda t: t[0])
+    return out
+
+
+def _extract_title(path: Path) -> str:
+    """Parse the title from the first line `# ADR-NNNN: <title>`. Strips BOM."""
+    text = _strip_bom(path.read_text(encoding="utf-8"))
+    first_line = text.split("\n", 1)[0] if text else ""
+    m = TITLE_RE.match(first_line)
+    if not m:
+        raise ValueError(
+            f"{path.name}: cannot parse title from first line: {first_line!r}"
+        )
+    return m.group(2)
+
+
+def _build_index(adr_dir: Path, link_prefix: str) -> str:
+    """Build the INDEX.md text for adr_dir.
+
+    link_prefix is the relative href used for ADR links (e.g., ``./``
+    so links resolve relative to the INDEX file location).
+    """
+    adrs = _find_adrs(adr_dir)
+    if not adrs:
+        raise RuntimeError(f"No ADR files found under {adr_dir}")
+
+    # Validate every ADR is classified.
+    missing = sorted(num for num, _slug, _ in adrs if num not in CLASSIFICATION)
+    if missing:
+        raise RuntimeError(
+            "ADR(s) missing from CLASSIFICATION table in "
+            "tools/generate_adr_index.py: "
+            + ", ".join(f"ADR-{n:04d}" for n in missing)
+            + ". Add an entry for each."
+        )
+
+    # Map: num → (filename, title)
+    num_to_meta: dict[int, tuple[str, str]] = {}
+    for num, _slug, path in adrs:
+        num_to_meta[num] = (path.name, _extract_title(path))
+
+    # ── Section assembly ────────────────────────────────────────────
+    lines: list[str] = []
+    lines.append("# ADR Index")
+    lines.append("")
+    lines.append(
+        f"Auto-generated by `tools/generate_adr_index.py`. "
+        f"Total ADRs: **{len(adrs)}**."
+    )
+    lines.append("")
+    lines.append(
+        "Classification mirrors the `/report` skill's section assignment. "
+        "When adding a new ADR, also add an entry to the "
+        "`CLASSIFICATION` table in `tools/generate_adr_index.py`."
+    )
+    lines.append("")
+
+    def fmt_entry(num: int) -> str:
+        fname, title = num_to_meta[num]
+        return f"- [ADR-{num:04d}]({link_prefix}{fname}) — {title}"
+
+    # Design Principles
+    lines.append("## Design Principles")
+    lines.append("")
+    nums = sorted(n for n, (sec, _) in CLASSIFICATION.items()
+                  if sec == DESIGN_PRINCIPLES and n in num_to_meta)
+    for n in nums:
+        lines.append(fmt_entry(n))
+    lines.append("")
+
+    # High-level Architecture (preserve declaration order via CLASSIFICATION dict's insertion order)
+    lines.append("## High-level Architecture")
+    lines.append("")
+    nums = sorted(n for n, (sec, _) in CLASSIFICATION.items()
+                  if sec == HIGH_LEVEL and n in num_to_meta)
+    for n in nums:
+        sub = CLASSIFICATION[n][1] or ""
+        fname, title = num_to_meta[n]
+        if sub:
+            lines.append(
+                f"- [ADR-{n:04d}]({link_prefix}{fname}) — {title}"
+                f"  _({sub})_"
+            )
+        else:
+            lines.append(fmt_entry(n))
+    lines.append("")
+
+    # Detailed Architecture (canonical component order)
+    lines.append("## Detailed Architecture")
+    lines.append("")
+    lines.append("One subsection per component file under `src/kernbench/components/builtin/`.")
+    lines.append("")
+    for comp, adr_nums in DETAILED_COMPONENTS:
+        lines.append(f"### {comp}")
+        lines.append("")
+        if adr_nums:
+            for n in adr_nums:
+                if n not in num_to_meta:
+                    raise RuntimeError(
+                        f"DETAILED_COMPONENTS references ADR-{n:04d} for "
+                        f"'{comp}' but no such ADR file exists."
+                    )
+                lines.append(fmt_entry(n))
+        else:
+            lines.append("_(no ADR coverage)_")
+        lines.append("")
+
+    # Implementation Decisions — group by subgroup, preserving first-appearance order.
+    lines.append("## Implementation Decisions")
+    lines.append("")
+    topic_order: list[str] = []
+    topic_to_nums: dict[str, list[int]] = {}
+    for n, (sec, sub) in CLASSIFICATION.items():
+        if sec != IMPL_DECISIONS or n not in num_to_meta:
+            continue
+        topic = sub or "Uncategorized"
+        if topic not in topic_to_nums:
+            topic_order.append(topic)
+            topic_to_nums[topic] = []
+        topic_to_nums[topic].append(n)
+    # Stable order: by smallest ADR-number in topic, so older infra appears first.
+    topic_order.sort(key=lambda t: min(topic_to_nums[t]))
+    for topic in topic_order:
+        lines.append(f"### {topic}")
+        lines.append("")
+        for n in sorted(topic_to_nums[topic]):
+            lines.append(fmt_entry(n))
+        lines.append("")
+
+    return "\n".join(lines).rstrip() + "\n"
+
+
+def _check_or_write(path: Path, content: str, check: bool) -> bool:
+    """Write content to path, or compare in --check mode. Returns True on diff."""
+    existing = path.read_text(encoding="utf-8") if path.exists() else ""
+    if check:
+        if existing != content:
+            print(f"[diff] {path} would change.")
+            return True
+        return False
+    path.write_text(content, encoding="utf-8")
+    if existing != content:
+        print(f"[wrote] {path}")
+    else:
+        print(f"[unchanged] {path}")
+    return False
+
+
+def main(argv: list[str] | None = None) -> int:
+    p = argparse.ArgumentParser(description=__doc__)
+    p.add_argument(
+        "--root", type=Path, default=Path.cwd(),
+        help="Repository root (default: cwd)",
+    )
+    p.add_argument(
+        "--check", action="store_true",
+        help="Exit 1 if generated INDEX would differ from disk",
+    )
+    args = p.parse_args(argv)
+
+    en_dir = args.root / "docs" / "adr"
+    ko_dir = args.root / "docs" / "adr-ko"
+
+    if not en_dir.is_dir():
+        print(f"error: {en_dir} does not exist", file=sys.stderr)
+        return 1
+
+    any_diff = False
+    try:
+        en_index = _build_index(en_dir, link_prefix="./")
+    except (RuntimeError, ValueError) as e:
+        print(f"error (EN): {e}", file=sys.stderr)
+        return 1
+    any_diff |= _check_or_write(en_dir / "INDEX.md", en_index, args.check)
+
+    if ko_dir.is_dir():
+        try:
+            ko_index = _build_index(ko_dir, link_prefix="./")
+        except (RuntimeError, ValueError) as e:
+            print(f"error (KO): {e}", file=sys.stderr)
+            return 1
+        any_diff |= _check_or_write(ko_dir / "INDEX.md", ko_index, args.check)
+
+    if args.check and any_diff:
+        print(
+            "INDEX.md is out of date. "
+            "Run `python tools/generate_adr_index.py` to refresh.",
+            file=sys.stderr,
+        )
+        return 1
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())