adr: add INDEX.md (auto-generated by tools/generate_adr_index.py)

Adds a section-based table of contents for the 46-ADR corpus, mirroring the /report skill's classification (Design Principles / High-level Architecture / Detailed Architecture by component / Implementation Decisions by topic). Generated for both docs/adr/ (EN titles) and docs/adr-ko/ (KO titles) from one tool. tools/generate_adr_index.py: - Single CLASSIFICATION dict per ADR — add an entry when introducing a new ADR; the script fails loud if any file is missing from the table. - DETAILED_COMPONENTS lists each builtin component and the ADR(s) that cover it (ADR-0014 appears under six PE engines; ADR-0023 under pe_dma + pe_ipcq). - Accepts both ":" and "—" title separators (matching ADR-0033's existing format). - --check mode for CI: exits 1 if INDEX.md is stale. Also includes the docs/report/architecture-2026-1H.md generated by the prior /report write (the public-facing architecture document; 836 lines, 76 source-attribution comments). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 11:15:37 -07:00
parent bd49c93703
commit e33e76f2d1
4 changed files with 1517 additions and 0 deletions
@@ -0,0 +1,174 @@
 # ADR Index
 Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **46**.
 Classification mirrors the `/report` skill's section assignment. When adding a new ADR, also add an entry to the `CLASSIFICATION` table in `tools/generate_adr_index.py`.
 ## Design Principles
 - [ADR-0013](./ADR-0013-ver-verification-strategy.md) — 검증 전략 및 Phase 1 테스트 계획
 - [ADR-0033](./ADR-0033-lat-latency-model-assumptions.md) — 레이턴시 모델: 가정 및 알려진 단순화
 ## High-level Architecture
 - [ADR-0003](./ADR-0003-dev-target-system-hierarchy.md) — 타겟 시스템 계층 및 모델링 범위  _(System hierarchy (Tray / SIP / CUBE / PE))_
 - [ADR-0007](./ADR-0007-api-runtime-api-boundaries.md) — 런타임 API 및 시뮬레이션 엔진 경계  _(Runtime API ↔ sim_engine boundaries)_
 - [ADR-0016](./ADR-0016-dev-iochiplet-noc-and-memory-path.md) — IOChiplet NoC와 메모리 데이터 경로  _(IOChiplet NOC and memory data path)_
 - [ADR-0017](./ADR-0017-dev-cube-noc-and-hbm-connectivity.md) — 큐브 NoC와 HBM 연결성  _(Cube NOC and HBM connectivity)_
 ## Detailed Architecture
 One subsection per component file under `src/kernbench/components/builtin/`.
 ### forwarding
 - [ADR-0037](./ADR-0037-dev-forwarding-component.md) — Forwarding 컴포넌트 (forwarding_v1)
 ### hbm_ctrl
 - [ADR-0034](./ADR-0034-dev-hbm-controller-internal-design.md) — HBM 컨트롤러 내부 설계
 ### io_cpu
 - [ADR-0036](./ADR-0036-dev-io-cpu-component-model.md) — IO_CPU 컴포넌트 모델
 ### m_cpu
 - [ADR-0035](./ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md) — M_CPU 및 M_CPU.DMA 컴포넌트 모델
 ### pcie_ep
 - [ADR-0038](./ADR-0038-dev-pcie-ep-component-model.md) — PCIE_EP Component Model
 ### pe_cpu
 - [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
 ### pe_dma
 - [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
 - [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
 ### pe_fetch_store
 - [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
 ### pe_gemm
 - [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
 ### pe_ipcq
 - [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
 ### pe_math
 - [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
 ### pe_mmu
 - [ADR-0039](./ADR-0039-dev-pe-mmu-component-model.md) — PE_MMU Component Model — 컴포넌트 + 유틸리티 이중 역할
 ### pe_scheduler
 - [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
 ### pe_tcm
 - [ADR-0040](./ADR-0040-dev-pe-tcm-component-model.md) — PE_TCM Component Model — 듀얼 채널 BW 직렬화
 ### sram
 - [ADR-0041](./ADR-0041-dev-cube-sram-component-model.md) — Cube SRAM Component Model — terminal scratchpad on cube NoC
 ### tiling
 - [ADR-0042](./ADR-0042-prog-tile-plan-generators.md) — Tile Plan Generators — GEMM/Math 파이프라인 plan 빌더
 ## Implementation Decisions
 ### Address Scheme
 - [ADR-0001](./ADR-0001-mem-physaddr-layout.md) — 51비트 물리 주소 레이아웃 및 디코딩 계약
 - [ADR-0011](./ADR-0011-mem-memory-addressing-simplification.md) — 메모리 주소 지정 — PA / VA / LA 주소 모델
 ### Routing & Helper API
 - [ADR-0002](./ADR-0002-lat-routing-distance.md) — 라우팅 거리, 순서 및 우회 규칙
 - [ADR-0051](./ADR-0051-lat-routing-helper-api.md) — Routing Helper API — `AddressResolver` + `PathRouter`
 ### Memory Semantics & Local-HBM Bandwidth
 - [ADR-0004](./ADR-0004-mem-memory-semantics-local-hbm.md) — 메모리 시맨틱 및 로컬 HBM 대역폭 보장
 ### Topology Compilation, Diagrams & Builder Algorithms
 - [ADR-0005](./ADR-0005-dev-diagram-views-distance-layout.md) — 다이어그램 뷰 및 거리 기반 레이아웃 규칙
 - [ADR-0006](./ADR-0006-dev-topology-compilation-distance-diagram.md) — 토폴로지 컴파일, 거리 추출, 그리고 자동 다이어그램 생성
 - [ADR-0053](./ADR-0053-dev-topology-builder-algorithms.md) — Topology Builder + Visualizer Algorithms
 ### Tensor Deployment and Allocation
 - [ADR-0008](./ADR-0008-api-tensor-deploy-and-allocation.md) — 텐서 배포 및 할당 (호스트 할당기, PA 우선)
 ### Kernel Execution and Host-Device Messaging
 - [ADR-0009](./ADR-0009-api-kernel-execution-messaging.md) — 커널 실행 메시징 및 완료 시맨틱
 - [ADR-0012](./ADR-0012-api-host-io-message-schema.md) — Host ↔ IO_CPU 메시지 스키마 (PA-우선, PE-태깅)
 ### CLI Surface and Semantics
 - [ADR-0010](./ADR-0010-api-cli-surface-and-semantics.md) — 명령줄 인터페이스 및 실행 시맨틱
 ### Component Port/Wire Fabric Model
 - [ADR-0015](./ADR-0015-dev-component-port-wire-model.md) — 컴포넌트 포트/와이어 모델과 패브릭 라우팅
 ### Two-Pass Data Execution
 - [ADR-0020](./ADR-0020-prog-data-execution-two-pass.md) — 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)
 ### 2D Grid Program Identity
 - [ADR-0022](./ADR-0022-prog-program-id-2d-grid.md) — 2D 그리드 program_id 시맨틱
 ### Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)
 - [ADR-0024](./ADR-0024-par-sip-tp-launcher.md) — SIP-level Launcher — rank = SIP
 - [ADR-0026](./ADR-0026-par-dppolicy-intra-device.md) — DPPolicy = Intra-Device Only — sip/num_sips 필드 제거
 - [ADR-0027](./ADR-0027-par-megatron-tp.md) — Megatron-style Tensor Parallelism API
 - [ADR-0047](./ADR-0047-par-ahbm-ccl-backend.md) — AHBM CCL Backend — `torch.distributed`-compat shim
 - [ADR-0050](./ADR-0050-par-ccl-algorithm-module-contract.md) — CCL Algorithm Module Contract — `ccl/algorithms/*.py`
 ### IPCQ Direction Addressing
 - [ADR-0025](./ADR-0025-algo-ipcq-direction-addressing.md) — IPCQ Direction Addressing — address-based matching
 ### Intercube All-Reduce
 - [ADR-0032](./ADR-0032-algo-intercube-allreduce.md) — 큐브 간 All-Reduce — pe0 큐브-메시 리듀스 + 다중-SIP 교환
 ### Evaluation Harnesses
 - [ADR-0043](./ADR-0043-eval-allreduce-harness.md) — Allreduce 평가 하니스 — `tests/sccl/`
 - [ADR-0044](./ADR-0044-eval-gemm-harness.md) — GEMM 평가 하니스 — `scripts/gemm_sweep.py` + `tests/gemm/`
 ### Bench Module Contract
 - [ADR-0045](./ADR-0045-prog-bench-module-contract.md) — Bench Module Contract — registration, dispatch, and authoring
 ### Kernel-side tl.* API (TLContext)
 - [ADR-0046](./ADR-0046-prog-tl-context-contract.md) — TLContext — Kernel-side `tl.*` API Contract
 ### Memory Allocator Algorithms
 - [ADR-0048](./ADR-0048-mem-allocator-algorithms.md) — Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
 ### Probe Subcommand
 - [ADR-0049](./ADR-0049-ver-probe-subcommand.md) — `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
 ### Sim-engine Op Log and Memory Store Schemas
 - [ADR-0052](./ADR-0052-dev-oplog-memory-store-schemas.md) — OpLog + MemoryStore Schemas — sim_engine internals
@@ -0,0 +1,174 @@
 # ADR Index
 Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **46**.
 Classification mirrors the `/report` skill's section assignment. When adding a new ADR, also add an entry to the `CLASSIFICATION` table in `tools/generate_adr_index.py`.
 ## Design Principles
 - [ADR-0013](./ADR-0013-ver-verification-strategy.md) — Verification Strategy and Phase 1 Test Plan
 - [ADR-0033](./ADR-0033-lat-latency-model-assumptions.md) — Latency Model: Assumptions and Known Simplifications
 ## High-level Architecture
 - [ADR-0003](./ADR-0003-dev-target-system-hierarchy.md) — Target System Hierarchy & Modeling Scope  _(System hierarchy (Tray / SIP / CUBE / PE))_
 - [ADR-0007](./ADR-0007-api-runtime-api-boundaries.md) — Runtime API and Simulation Engine Boundaries  _(Runtime API ↔ sim_engine boundaries)_
 - [ADR-0016](./ADR-0016-dev-iochiplet-noc-and-memory-path.md) — IOChiplet NOC and Memory Data Path  _(IOChiplet NOC and memory data path)_
 - [ADR-0017](./ADR-0017-dev-cube-noc-and-hbm-connectivity.md) — Cube NOC and HBM Connectivity  _(Cube NOC and HBM connectivity)_
 ## Detailed Architecture
 One subsection per component file under `src/kernbench/components/builtin/`.
 ### forwarding
 - [ADR-0037](./ADR-0037-dev-forwarding-component.md) — Forwarding Component (forwarding_v1)
 ### hbm_ctrl
 - [ADR-0034](./ADR-0034-dev-hbm-controller-internal-design.md) — HBM Controller Internal Design
 ### io_cpu
 - [ADR-0036](./ADR-0036-dev-io-cpu-component-model.md) — IO_CPU Component Model
 ### m_cpu
 - [ADR-0035](./ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md) — M_CPU and M_CPU.DMA Component Model
 ### pcie_ep
 - [ADR-0038](./ADR-0038-dev-pcie-ep-component-model.md) — PCIE_EP Component Model
 ### pe_cpu
 - [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
 ### pe_dma
 - [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
 - [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
 ### pe_fetch_store
 - [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
 ### pe_gemm
 - [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
 ### pe_ipcq
 - [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
 ### pe_math
 - [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
 ### pe_mmu
 - [ADR-0039](./ADR-0039-dev-pe-mmu-component-model.md) — PE_MMU Component Model — Component + Utility Dual Role
 ### pe_scheduler
 - [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
 ### pe_tcm
 - [ADR-0040](./ADR-0040-dev-pe-tcm-component-model.md) — PE_TCM Component Model — Dual-Channel BW Serialization
 ### sram
 - [ADR-0041](./ADR-0041-dev-cube-sram-component-model.md) — Cube SRAM Component Model — terminal scratchpad on cube NoC
 ### tiling
 - [ADR-0042](./ADR-0042-prog-tile-plan-generators.md) — Tile Plan Generators — GEMM/Math Pipeline Plan Builders
 ## Implementation Decisions
 ### Address Scheme
 - [ADR-0001](./ADR-0001-mem-physaddr-layout.md) — 51-bit Physical Address Layout & Decoding Contract
 - [ADR-0011](./ADR-0011-mem-memory-addressing-simplification.md) — Memory Addressing — PA / VA / LA Address Models
 ### Routing & Helper API
 - [ADR-0002](./ADR-0002-lat-routing-distance.md) — Routing Distance, Ordering & Bypass Rules
 - [ADR-0051](./ADR-0051-lat-routing-helper-api.md) — Routing Helper API — `AddressResolver` + `PathRouter`
 ### Memory Semantics & Local-HBM Bandwidth
 - [ADR-0004](./ADR-0004-mem-memory-semantics-local-hbm.md) — Memory Semantics & Local-HBM Bandwidth Guarantee
 ### Topology Compilation, Diagrams & Builder Algorithms
 - [ADR-0005](./ADR-0005-dev-diagram-views-distance-layout.md) — Diagram Views & Distance-Aware Layout Rules
 - [ADR-0006](./ADR-0006-dev-topology-compilation-distance-diagram.md) — Topology Compilation, Distance Extraction, and Automatic Diagram Generation
 - [ADR-0053](./ADR-0053-dev-topology-builder-algorithms.md) — Topology Builder + Visualizer Algorithms
 ### Tensor Deployment and Allocation
 - [ADR-0008](./ADR-0008-api-tensor-deploy-and-allocation.md) — Tensor Deployment and Allocation (Host Allocator, PA-first)
 ### Kernel Execution and Host-Device Messaging
 - [ADR-0009](./ADR-0009-api-kernel-execution-messaging.md) — Kernel Execution Messaging and Completion Semantics
 - [ADR-0012](./ADR-0012-api-host-io-message-schema.md) — Host ↔ IO_CPU Message Schema (PA-first, PE-tagged)
 ### CLI Surface and Semantics
 - [ADR-0010](./ADR-0010-api-cli-surface-and-semantics.md) — Command Line Interface and Execution Semantics
 ### Component Port/Wire Fabric Model
 - [ADR-0015](./ADR-0015-dev-component-port-wire-model.md) — Component Port/Wire Model and Fabric Routing
 ### Two-Pass Data Execution
 - [ADR-0020](./ADR-0020-prog-data-execution-two-pass.md) — 2-Pass Data Execution Model (Timing / Data Separation)
 ### 2D Grid Program Identity
 - [ADR-0022](./ADR-0022-prog-program-id-2d-grid.md) — 2D Grid program_id Semantics
 ### Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)
 - [ADR-0024](./ADR-0024-par-sip-tp-launcher.md) — SIP-level Launcher — rank = SIP
 - [ADR-0026](./ADR-0026-par-dppolicy-intra-device.md) — DPPolicy = Intra-Device Only — remove sip/num_sips fields
 - [ADR-0027](./ADR-0027-par-megatron-tp.md) — Megatron-style Tensor Parallelism API
 - [ADR-0047](./ADR-0047-par-ahbm-ccl-backend.md) — AHBM CCL Backend — `torch.distributed`-compat shim
 - [ADR-0050](./ADR-0050-par-ccl-algorithm-module-contract.md) — CCL Algorithm Module Contract — `ccl/algorithms/*.py`
 ### IPCQ Direction Addressing
 - [ADR-0025](./ADR-0025-algo-ipcq-direction-addressing.md) — IPCQ Direction Addressing — address-based matching
 ### Intercube All-Reduce
 - [ADR-0032](./ADR-0032-algo-intercube-allreduce.md) — Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange
 ### Evaluation Harnesses
 - [ADR-0043](./ADR-0043-eval-allreduce-harness.md) — Allreduce Evaluation Harness — `tests/sccl/`
 - [ADR-0044](./ADR-0044-eval-gemm-harness.md) — GEMM Evaluation Harness — `scripts/gemm_sweep.py` + `tests/gemm/`
 ### Bench Module Contract
 - [ADR-0045](./ADR-0045-prog-bench-module-contract.md) — Bench Module Contract — registration, dispatch, and authoring
 ### Kernel-side tl.* API (TLContext)
 - [ADR-0046](./ADR-0046-prog-tl-context-contract.md) — TLContext — Kernel-side `tl.*` API Contract
 ### Memory Allocator Algorithms
 - [ADR-0048](./ADR-0048-mem-allocator-algorithms.md) — Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
 ### Probe Subcommand
 - [ADR-0049](./ADR-0049-ver-probe-subcommand.md) — `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
 ### Sim-engine Op Log and Memory Store Schemas
 - [ADR-0052](./ADR-0052-dev-oplog-memory-store-schemas.md) — OpLog + MemoryStore Schemas — sim_engine internals
@@ -0,0 +1,836 @@
 # KernBench — Architecture Design Document
 *2026 1H*
 KernBench is a system-level, discrete-event simulator for AI-accelerator
 chiplet systems. It models the data-movement and control paths across
 the full hardware hierarchy and reports end-to-end execution latency
 for kernels dispatched to the device's compute units.
 This document is a public summary of the architecture as designed and
 implemented in the first half of 2026. It assumes no prior knowledge of
 the simulator's internal documents; terms specific to the system are
 defined on first use.
 ---
 ## Design Principles
 KernBench is grounded in two foundational commitments: every measured
 latency must trace to explicit, modeled events on the simulator's graph,
 and every behavioral claim must be verifiable through tests that target
 spec-level invariants rather than incidental implementation details.
 <!-- src: ADR-0013 Context, Decision -->
 The verification posture is verification-driven. Tests are written to
 validate the architectural contracts that the simulator exposes —
 correct routing, deterministic results, monotonic latency under
 increasing hop counts — rather than to mirror the call graph of the
 implementation. Two phases coexist: a fast timing phase that exercises
 the simulator's discrete-event engine and produces a log of operations
 with timestamps, and an optional data-replay phase that uses that log
 to compute real numerical results. Tests can target either phase.
 <!-- src: ADR-0033 Context, Decision -->
 The latency model is intentionally abstract rather than
 cycle-accurate. Each modeled node contributes a configurable per-node
 overhead, each link contributes wire delay plus byte-over-bandwidth
 serialization, and each terminal service contributes its own service
 time. The simulator does not attempt to reproduce cache coherence
 protocols, microarchitectural pipelines, or full PCIe/UCIe protocol
 correctness; those are explicitly outside the scope. The aim is a
 simulator that compares system-level configurations meaningfully and
 deterministically, not one that ships microarchitectural truths.
 <!-- src: ADR-0033 Decision, Consequences -->
 Determinism is a hard requirement. Given identical inputs — topology,
 routing policy, and request stream — the simulator must produce
 identical outputs, hop traces included. This rules out reliance on
 unordered set iteration on the critical path and forces every latency
 contribution to come from an explicitly scheduled event on a modeled
 component or link. There are no implicit waits, no hardcoded magic
 delays, and no shortcuts that bypass the modeled graph.
 ---
 ## High-level Architecture
 <!-- src: ADR-0003 Context, Decision -->
 The simulated system is a four-level hierarchy. A **Tray** holds one or
 more **SIPs** (system-in-package), each containing a 2D mesh of
 **CUBEs** plus one or more **IO chiplets** that connect the SIP to the
 host. Each CUBE contains a regular grid of **PEs** (processing
 elements) plus its own attached resources — high-bandwidth memory
 (HBM), a per-cube SRAM scratchpad, and a management CPU (M_CPU). The PE
 itself is a composite of nine sub-components rather than a monolithic
 core. This hierarchy is fixed; the parameters along each axis (counts,
 mesh dimensions, link widths) are configurable through the topology
 spec.
 <!-- src: ADR-0007 Context, Decision -->
 A clean separation runs along the request flow. A **runtime API** at
 the top is the host-facing surface; it exposes tensor and kernel
 operations, owns host-side allocation metadata, and is topology-
 agnostic — it does not route or fan out. Below it the **simulation
 engine** decomposes runtime operations into discrete graph requests
 (memory writes, memory reads, kernel launches, MMU map installs) and
 schedules events deterministically. At the bottom, **components** model
 device behavior on a graph of nodes connected by links; they
 implement the actual latency contributions and pass requests along.
 No component reaches up into the runtime API, and no runtime call
 shortcuts the engine.
 <!-- DIAGRAM: Four-level system hierarchy — Tray containing SIPs, each SIP showing its 2D cube mesh and IO chiplet; one cube blown up to show the router mesh, attached PEs, M_CPU, SRAM, and HBM partition. -->
 ### Tray
 <!-- src: ADR-0003 Decision -->
 The Tray is the outermost boundary. It owns the host CPU on one side
 and one or more SIPs on the other, connected through a fabric switch.
 For collective communication that must traverse multiple SIPs, the
 fabric switch acts as the common rendezvous: device-side outbound
 traffic from one SIP routes through the switch and back into the
 target SIP's IO chiplet.
 ### SIP
 <!-- src: ADR-0003 Decision, ADR-0017 Context -->
 A SIP packages a 2D mesh of CUBEs and one or more IO chiplets. The
 default topology used by the simulator is a 4×4 cube mesh; the
 mesh dimensions are configurable. Each cube on the boundary of the
 mesh connects to its neighbors over UCIe (die-to-die) links arranged
 on the four cardinal sides — north, south, east, and west. The IO
 chiplets sit on one side of the SIP and provide the bridge to the host
 across PCIe.
 <!-- src: ADR-0016 Context, Decision -->
 The IO chiplet itself contains its own internal network. A
 host-facing PCIe endpoint passes traffic to a small NOC ("network on
 chip"); from there it can branch to a control-plane CPU that processes
 kernel-launch messages, or it can take the direct memory data path to
 the cube's HBM controller. The decision to provide a direct memory
 path that bypasses the control CPU was a deliberate concession to
 keep host-issued memory writes from paying control-plane overhead on
 the data path.
 ### CUBE
 <!-- src: ADR-0017 Decision -->
 Each CUBE owns a 2D mesh of NOC routers and a set of attached
 resources: PEs, the cube-local SRAM scratchpad, the management CPU
 (M_CPU), and the HBM partition (split across multiple PE-private
 slices for bandwidth). The router mesh uses deterministic XY routing.
 Attached components do not connect to each other directly — they all
 sit on the router mesh, and every cube-internal transfer pays the
 mesh distance from source to destination.
 <!-- src: ADR-0017 Decision -->
 The HBM partition is per-PE: each PE owns one HBM slice, and the
 controller exposes per-PE channels so that the same PE always
 addresses the same set of HBM channels. This makes the local-HBM
 bandwidth from a PE to its own slice predictable, while accesses to
 another PE's slice — or a different cube's slice — pay the mesh
 distance and any UCIe crossings.
 ### PE
 <!-- src: ADR-0014 Context, Decision -->
 A PE is not a monolithic core. Internally it is a set of nine
 sub-components, each modeling one stage of a request's flow: a small
 control CPU, a tile-pipeline scheduler, a DMA engine, a fetch-store
 engine that moves data between the on-PE scratchpad and the register
 file, a GEMM compute engine, a math compute engine, the tightly-
 coupled memory (TCM, the on-PE scratchpad), an MMU for virtual-to-
 physical address translation, and an inter-PE collective queue
 (IPCQ). The scheduler decomposes higher-level operations into per-tile
 stage sequences, and tile tokens self-route from one sub-component
 to the next.
 <!-- DIAGRAM: PE internal layout — the nine sub-components and the edges that connect them; tile token flowing through DMA_READ → FETCH → GEMM → STORE → DMA_WRITE. -->
 ---
 ## Detailed Architecture
 This section describes each modeled device-side component in turn.
 Components are listed in the alphabetical order used by the
 simulator's source tree.
 ### forwarding
 <!-- src: ADR-0037 Context, Decision -->
 The forwarding component is the generic routing relay used wherever a
 node only needs to apply a small processing overhead and pass the
 request to the next hop. NOC routers, conn nodes, and ucie phys all
 reduce to this. Its first act on receiving a request is to apply the
 per-node overhead configured for it in the topology spec; after the
 overhead it simply hands the request to the next hop along the path.
 <!-- src: ADR-0037 Decision, Consequences -->
 The decision to share one implementation across these roles was made
 to keep the simulator's component set small without sacrificing
 modeling fidelity. Each instance still carries its own overhead and
 its own link bandwidth contributions, so different roles still produce
 different timing. What is shared is the dispatcher loop, not the
 parameter values.
 ### hbm_ctrl
 <!-- src: ADR-0034 Context, Decision -->
 The HBM controller is the terminal node for all memory traffic that
 reaches HBM. Internally it owns a number of pseudo channels, partitioned
 per-PE so that each PE addresses a deterministic subset. On a request
 arrival the controller first selects the right pseudo channel from the
 target address, then enters a chunk-loop that drains the requested
 size in fixed-size flits over the channel's bandwidth.
 <!-- src: ADR-0034 Decision, Consequences -->
 The chunk-loop pattern replaces an earlier all-at-once drain. The
 benefit is that the controller no longer presents a flit-aware fabric
 with a single bulk transfer; instead it emits flits at a paced rate
 matching the channel bandwidth, which makes cross-flow contention
 visible. The bandwidth budget is calibrated against the configured
 HBM total bandwidth divided across the channel count.
 ### io_cpu
 <!-- src: ADR-0036 Context, Decision -->
 The IO_CPU is the control-plane processor sitting inside the IO chiplet.
 It receives kernel-launch messages from the host, decodes them, and
 dispatches per-cube launches to the cube's management CPU. Pure memory
 operations bypass it entirely, taking the direct data path established
 inside the IO chiplet.
 <!-- src: ADR-0036 Decision -->
 On receiving a kernel-launch message, the IO_CPU consults the message's
 shard list — which already names the target SIP, cube, and PE for each
 piece of the tensor argument — and forwards a per-cube launch to each
 cube the kernel needs to reach. This makes the IO_CPU a deterministic
 fan-out point: it does not decode physical addresses to route, it just
 follows the explicit per-shard targets it was handed.
 ### m_cpu
 <!-- src: ADR-0035 Context, Decision -->
 The M_CPU is the cube's management processor. It owns two distinct
 roles: as a control-plane fan-out point for kernel launches arriving
 from the IO chiplet, and as a DMA endpoint for host-initiated memory
 writes that need to land in this cube's HBM. The control role
 forwards launches to the right PE control CPUs; the DMA role places
 the actual bytes into HBM through the router mesh.
 <!-- src: ADR-0035 Decision -->
 The component model deliberately distinguishes the two roles because
 their routing differs: the control fan-out path uses command-kind
 links that do not appear on data-path routes, while the DMA path uses
 the same router mesh as PE-initiated DMA, with PE-internal nodes
 excluded. The routing layer knows about both modes and selects the
 appropriate adjacency at request time.
 ### pcie_ep
 <!-- src: ADR-0038 Context, Decision -->
 The PCIE endpoint is the protocol boundary at the host-device edge.
 Its first act on each incoming request is to apply a configured
 protocol-processing overhead; after that it simply forwards. There is
 no internal queuing model, no retry, and no TLP-level fidelity — those
 are deliberately outside scope. The endpoint is bidirectional: host →
 device traffic (memory writes, kernel launches) flows one way, and
 device-side outbound traffic (cross-SIP collective sends) flows the
 other.
 <!-- src: ADR-0038 Decision, Alternatives Considered, Consequences -->
 A more detailed PCIe model was considered and rejected. The simulator
 is targeting system-level latency comparisons; making the endpoint
 heavier with credit-management and retry logic would not improve the
 metrics being studied. The decision keeps the endpoint as the
 documented protocol-boundary node, named consistently so routing
 helpers can locate it by SIP and IO instance.
 ### pe_cpu
 <!-- src: ADR-0014 Decision -->
 The PE control CPU is the entry point for kernel work arriving from
 the cube's management CPU. It receives kernel-launch messages, resolves
 the kernel function by name, and hands execution to the scheduler with
 the resolved tensor arguments. From the scheduler's point of view, the
 PE_CPU is the upstream source of high-level commands; from the rest
 of the system's point of view, the PE_CPU is where a kernel's
 execution begins on a given PE.
 ### pe_dma
 <!-- src: ADR-0014 Decision, ADR-0023 Decision -->
 The DMA engine on each PE has two distinct modes. In the standard PE
 pipeline it consumes tile tokens issued by the scheduler, acquires a
 read or write channel (modeled as a one-in-flight resource per
 direction), and runs the bytes to or from HBM through the mesh. In
 its collective mode it forwards send tokens for the cube's IPCQ into
 the fabric, snapshotting the source data at send time so later
 mutations cannot race the receiver's read. Both modes share the same
 channel resources but differ in their downstream handling — one
 returns when the round-trip completes, the other dispatches
 fire-and-forget.
 ### pe_fetch_store
 <!-- src: ADR-0014 Decision -->
 The fetch-store engine is the bridge between the on-PE scratchpad
 (TCM) and the register file. It does not run DMA; it only moves bytes
 internally. On receiving a tile-stage token it sends a short request
 to the TCM, waits for the bandwidth-serialized delay, and continues
 the pipeline. The split between this engine and the TCM lets the
 scratchpad model its own read/write bandwidth independently.
 ### pe_gemm
 <!-- src: ADR-0014 Decision -->
 The GEMM engine is the matrix-multiply compute unit. Tile tokens
 arriving at this stage carry the per-tile dimensions, and the engine
 contributes a service time accounting for one fused multiply-add over
 the tile's macs. Composite operations (where the same tensor pair is
 streamed across many tiles) reuse the engine through the scheduler;
 the engine itself is stateless between tiles.
 ### pe_ipcq
 <!-- src: ADR-0023 Context, Decision -->
 The IPCQ — inter-process communication queue — is each PE's
 collective-communication endpoint. It owns ring buffers that hold
 inbound messages from neighbor PEs and bookkeeping for send credits.
 Direction names ("N", "S", "E", "W" for cube-internal neighbors and
 "global_*" for cross-SIP neighbors) are resolved to physical peer
 endpoints by a neighbor table installed at process-group creation
 time. The component itself does not move bytes — it issues DMA tokens
 through the local PE_DMA, which performs the actual cross-PE
 transfer.
 <!-- src: ADR-0023 Decision, Consequences -->
 A key invariant is that the inbound terminal — where data lands at
 the receiver — pays the link bandwidth drain plus any cube-internal
 mesh hop to the slot's backing memory. This prevents IPCQ from
 silently outpacing raw DMA at large transfer sizes. Outbound sends
 are fire-and-forget; credit return is the only backpressure signal.
 ### pe_math
 <!-- src: ADR-0014 Decision -->
 The math engine handles element-wise and reduction operations. It
 consumes tile tokens carrying an operation kind (`exp`, `sum`, `max`,
 `where`, etc.) and contributes a service time proportional to the
 number of elements processed. Like the GEMM engine it is stateless;
 chained epilogues (a sequence of math operations after a GEMM tile)
 are scheduled as separate stages.
 ### pe_mmu
 <!-- src: ADR-0039 Context, Decision -->
 The MMU has two roles, exposed through one component. As a node on
 the cube NOC it receives MMU-map and MMU-unmap messages and updates
 its internal page table, so that the runtime API can install
 virtual-to-physical mappings with measured fabric latency. As a
 utility object held inside the PE it offers synchronous translate
 calls to the PE's DMA and GEMM engines without taking simulator time
 itself; the calling engine pays any configured TLB overhead in its
 own process.
 <!-- src: ADR-0039 Decision, Alternatives Considered -->
 The page table supports multiple disjoint regions inside a single
 page, with later-write-wins semantics on overlap. This is a deliberate
 simulator stopgap to support parallelization policies that shard data
 at sub-page granularity without silent mis-routing through a real
 hardware MMU's one-PA-per-entry assumption. A real MMU does not work
 this way; the model documents this as a simplification.
 ### pe_scheduler
 <!-- src: ADR-0014 Decision -->
 The scheduler is the sole dispatcher inside a PE. Simple commands are
 routed directly to the right engine. Composite commands generate a
 tile plan, and the resulting tile tokens are fed into the pipeline.
 Self-routing keeps the scheduler off the per-stage hot path: each
 engine, on finishing a stage, advances the token to the next stage's
 component itself, so the scheduler only does initial dispatch and
 completion tracking.
 ### pe_tcm
 <!-- src: ADR-0040 Context, Decision -->
 The TCM is the per-PE tightly-coupled scratchpad memory. It models
 time only, not data — the actual payload lives in the simulator's
 memory store. Read and write are independent channels: each is
 modeled as a one-in-flight resource, so same-direction requests
 serialize but a read and a write can overlap. The bandwidth of each
 direction is configured separately and applied as bytes-over-bandwidth
 on each request.
 <!-- src: ADR-0040 Decision, Alternatives Considered -->
 The decision to keep read and write on separate channels was made
 because the PE pipeline's normal case overlaps fetch (read) and store
 (write). Collapsing them into a single shared channel would have
 artificially serialized that overlap and produced an incorrect
 bandwidth ceiling.
 ### sram
 <!-- src: ADR-0041 Context, Decision -->
 The cube SRAM is a per-cube scratchpad attached to one of the cube's
 routers. As a node it applies a configured access overhead, pays the
 link-bandwidth drain stamped on the incoming request, and sends a
 response on the reverse path. It is a terminal — it does not forward.
 <!-- src: ADR-0041 Decision, Consequences -->
 A second role is as one of three backing-memory tiers (TCM, SRAM, HBM)
 that an inter-PE collective slot can live in. When the slot lives in
 SRAM, the PE_DMA pays the slot read or write latency directly using
 the configured SRAM bandwidth and overhead; the SRAM component does
 not need to know about collective semantics. This separation keeps
 the SRAM component agnostic to the collective subsystem.
 ### tiling
 <!-- src: ADR-0042 Context, Decision -->
 The tile-plan generator is not a runtime component — it is a pure
 module of functions that take a problem shape (matrix dimensions, tile
 sizes) and produce an ordered list of tile-stage sequences. The
 scheduler consumes this list. Each tile's stage sequence depends on
 how its operands are staged: operands streamed from HBM produce
 DMA_READ stages, operands already resident in TCM (because they were
 loaded eagerly upfront) skip them.
 <!-- src: ADR-0042 Decision, Consequences -->
 The plan generator is intentionally pure — given the same input it
 returns the same plan, with no simulator events created. This lets
 the rest of the system reason about tile sequences as data, and it
 makes the plan testable in isolation without simulator state. New
 plan variants (for example, K-major or DTensor-aware plans) can be
 added as new functions following the same shape.
 ---
 ## Implementation Decisions
 This section collects cross-cutting decisions — algorithms, policies,
 schemes, and contracts — that span multiple components rather than
 living inside one.
 ### Address Scheme
 <!-- src: ADR-0001 Context, Decision -->
 Every physical address in the simulator decodes into a structured
 location. A fixed-width physical address carries the SIP id, the
 cube id within the SIP, a type discriminator (HBM vs PE-resource vs
 others), and a type-specific offset. HBM addresses additionally encode
 the per-PE slice offset so the controller can determine which PE
 owns the target slice without external lookup. The layout is
 deliberately reserved rather than packed-to-fit, so new sub-units can
 be added at the type-discriminator level without rewriting existing
 addresses.
 <!-- src: ADR-0011 Context, Decision -->
 On top of physical addressing, the simulator supports three address
 models that the runtime API selects between. Direct physical
 addressing is retained as a fallback. Virtual addressing — the
 current default — gives each tensor a contiguous virtual range at
 deployment, with the per-PE MMU translating per access; an
 alternative logical-address scheme remains a future option. The
 virtual-address path is what every modern test path takes; the PA
 fallback is used by the MMU itself when no mapping exists for an
 address (a deliberate signal, not an error).
 <!-- src: ADR-0011 Decision, Consequences -->
 Tensor placement is represented as a list of physical-address shards,
 each tagged with target SIP, cube, and PE, plus a single tensor-wide
 virtual base. This means a kernel sees one virtual base for the whole
 tensor while the host driver and the engine still know exactly where
 each shard lives. Replicated tensors get per-cube local PA mappings;
 sharded tensors broadcast their mapping across cubes within a SIP.
 ### Routing, Distance & Helper API
 <!-- src: ADR-0002 Context, Decision -->
 Routing is policy-driven, deterministic, and topology-aware. Given a
 source, a destination, and an intent — for example, PE-initiated
 DMA versus host-initiated memory write versus a generic
 component-to-component query — the routing layer picks the right
 path. The intent matters because different traffic types must avoid
 different categories of edges: PE-initiated DMA should not traverse
 command-only links; M_CPU DMA should not pass through PE-internal
 pipeline edges; cube-local transfers should not use the
 zero-distance UCIe bus that would otherwise look attractive to a
 shortest-path search.
 <!-- src: ADR-0051 Decision -->
 The routing layer therefore maintains four separate adjacency graphs
 at construction, each excluding a different category of edges, and
 picks the appropriate one per intent. On top of the graphs sits a
 helper API that hides the topology's naming convention: callers ask
 for the PCIe endpoint of a given SIP, the M_CPU of a given cube, or
 the HBM destination for a given physical address, and receive the
 corresponding node id. No component constructs node-id strings
 directly; if the naming convention ever changes, the change is local
 to the helper layer.
 <!-- src: ADR-0051 Decision, Consequences -->
 Path-finding itself uses Dijkstra with explicit per-edge weights
 (routing weight is allowed to differ from physical distance — for
 example, UCIe is configured to be routing-preferable). Tie-breaks
 follow insertion order, which keeps results deterministic. Paths
 between unreachable nodes raise rather than returning empty, surfacing
 topology errors immediately.
 ### Memory Semantics and Local-HBM Bandwidth
 <!-- src: ADR-0004 Context, Decision -->
 A PE accessing its own HBM slice through its own cube's NOC must see
 the full local HBM bandwidth — that is the model's intent. Memory
 traffic accumulates latency from per-component overhead and
 bytes-over-link-bandwidth serialization along the path, but the
 controller does not throttle below the slice's allotted bandwidth.
 Cross-PE-slice accesses inside the same cube, cross-cube accesses
 through UCIe, and cross-SIP accesses through PCIe each pay
 progressively more overhead as the path grows.
 ### Topology Compilation, Diagrams & Builder Algorithms
 <!-- src: ADR-0006 Context, Decision -->
 Topology is configurable, not hardcoded. The simulator reads a YAML
 spec, compiles it into a flat graph of nodes and edges plus four
 view projections at different abstraction levels — system, SIP, cube,
 PE — and uses the compiled graph as the single source for both
 execution and visualization. Distance metadata used by routing is
 extracted at compile time so that diagrams and routing decisions
 agree by construction.
 <!-- src: ADR-0005 Context, Decision -->
 Diagrams are derived artifacts of the compiled topology. The visualizer
 produces one SVG per view at the appropriate abstraction level; nothing
 in the diagrams is hand-drawn or hand-positioned. Distance-aware
 layout rules place nodes in the diagrams using the same coordinates
 that routing uses to compute distance, so a diagram that "looks
 wrong" is a signal that the topology itself has a problem, not the
 visualizer.
 <!-- src: ADR-0053 Decision -->
 Inside a cube the router mesh is generated automatically. PE corner
 positions are fixed by convention; the relay-column algorithm
 inserts additional grid columns whenever the gap between adjacent PE
 columns would exceed a tunable maximum. HBM occupies a central
 exclusion zone — router slots inside the zone are deliberately empty,
 since HBM controllers attach as separate named nodes. M_CPU and SRAM
 attach to the nearest router by Euclidean distance from their
 configured placement coordinates, and UCIe physical lanes distribute
 along the boundary rows and columns. The whole mesh is cached
 beside the topology spec and invalidated only when one of a small set
 of layout-relevant fields changes.
 <!-- DIAGRAM: One cube's router mesh — rows × cols of routers with HBM exclusion zone in the middle, PEs/M_CPU/SRAM attaching to nearest routers, UCIe PHYs along the perimeter. -->
 ### Tensor Deployment and Allocation
 <!-- src: ADR-0008 Context, Decision -->
 Tensor deployment in the runtime API produces a list of physical-address
 shards plus a single tensor-wide virtual base. The host allocator
 walks the data-parallelism policy, computes per-shard placement, and
 emits the per-shard physical addresses through the per-PE allocators.
 No separate "allocate then later attach to a device" RPC exists —
 allocation and deployment are a single operation that produces a
 deployed tensor handle.
 ### Memory Allocator Algorithms
 <!-- src: ADR-0048 Context, Decision -->
 Each per-PE allocator owns two channels — HBM slice and TCM — each
 backed by an offset-keyed free-list. Allocation is first-fit; freeing
 coalesces with adjacent free blocks. A device-wide virtual allocator
 sits above the per-PE allocators, aligns requests up to the configured
 page size, and coalesces on free in the same way. The trade-off is
 explicit: first-fit is simpler and cheaper than best-fit or buddy
 allocation, and the simulator's workload is stack-like enough
 (deploy / kernel / free in matched order) that fragmentation is not
 a practical concern.
 <!-- src: ADR-0048 Decision, Consequences -->
 Allocation failure raises rather than silently returning a partial
 result. A partial tensor reaching the engine would route over wrong
 PAs and silently corrupt simulator output, so an out-of-memory signal
 is preferred. The free path trusts its caller to pass back exactly
 what was allocated; the small risk of caller error in exchange for
 fast common-case freeing is documented as a deliberate trade.
 ### Kernel Execution and Host-Device Messaging
 <!-- src: ADR-0009 Context, Decision -->
 Kernel execution decomposes into a small set of messages that travel
 the device graph. The host issues a single kernel-launch message; the
 IO_CPU fans it out per-cube; the cube M_CPU fans it out per-PE; the
 PE CPU resolves the kernel and runs it through the scheduler.
 Completion flows back the same way, gated by per-shard completion
 tracking. Memory operations follow the same pattern: a memory write
 or read travels as one message that the engine routes to the right
 HBM controller, with a response taking the reverse path.
 <!-- src: ADR-0012 Context, Decision -->
 The schema between the host and the device-side IO CPU is PA-first
 and shard-tagged. Every byte of host-issued payload arrives with an
 explicit target SIP, cube, PE, and physical address. The IO_CPU does
 not decode addresses to derive placement — placement is named
 explicitly by the shard list. This makes the host-device interface
 deterministic and keeps the routing helper free of host-derived
 intent.
 ### CLI Surface and Semantics
 <!-- src: ADR-0010 Context, Decision -->
 The command-line interface exposes four subcommands. A bench runner
 loads a topology, resolves a registered benchmark by name or index,
 and runs it on a selected device. A bench-listing command enumerates
 the registered benchmarks. A probe utility runs a fixed catalog of
 traffic patterns through the engine for latency and bandwidth
 verification. A web viewer renders the topology in a browser. A
 benchmark instance is always single-device by convention; multi-SIP
 collective work happens inside the benchmark through the launcher
 abstraction, not by multiplexing the CLI.
 ### Component Port and Wire Fabric Model
 <!-- src: ADR-0015 Context, Decision -->
 Every modeled component exposes input and output ports, and every
 edge in the topology connects an output port on one component to an
 input port on another. Bandwidth and propagation delay are properties
 of the wire between ports, not of the component endpoints. A
 component's responsibility is to apply its configured per-node
 overhead and either forward to the next hop or terminate; the wire
 charges the byte-over-bandwidth serialization separately.
 <!-- src: ADR-0015 Decision, Consequences -->
 This separation lets components be swapped behind their port
 interface without changing the rest of the model, and it keeps
 bandwidth contention at the wire level where multiple components may
 contend for the same edge. Future component models can refine
 internal behavior without disturbing the fabric.
 ### Two-Pass Data Execution
 <!-- src: ADR-0020 Context, Decision -->
 The simulator runs in two passes. The first pass — fast and always
 on — runs the discrete-event engine and records every data operation
 in an operation log with timestamps, component identifiers, and per-
 operation parameters. The second pass — optional, opt-in — replays
 the log against an in-memory tensor store to produce actual numerical
 results. Tests that only need timing skip the second pass; tests that
 need to verify correctness opt in.
 <!-- src: ADR-0020 Decision, Consequences -->
 The split lets the timing engine remain unconcerned with data
 semantics: kernels move handles around, not bytes. The replay phase
 recovers data semantics from the recorded operations, in their
 original time order with a small set of secondary-sort rules. The
 op-log records carry enough metadata — input snapshots for compute
 operations, source snapshots for cross-component copies — that the
 replay phase cannot mis-order with respect to in-flight mutations.
 ### Sim-engine Op Log and Memory Store Schemas
 <!-- src: ADR-0052 Context, Decision -->
 The operation log holds typed records with seven fields each: start
 and end timestamps, the component that issued the operation, an
 operation kind ("memory", "gemm", "math"), an operation name, a
 parameter dictionary, and a (currently unused) dependency list.
 Records are kept in stable timestamp order. The parameter dictionary
 varies by operation: a DMA read carries source address and byte count;
 a GEMM carries operand shapes, dtypes, and address spaces; a math
 operation carries input addresses and snapshots.
 <!-- src: ADR-0052 Decision, Consequences -->
 The companion memory store is a two-level dictionary keyed by
 address space ("hbm", "tcm", "sram", others) and integer address.
 Reads and writes are reference-based — no copy by default — so
 callers wanting to detach a snapshot must copy explicitly. This is
 deliberate: the engine-internal snapshot paths copy at well-defined
 points (math input capture, HBM source capture for DMA writes,
 inbound collective copies) and downstream replay code therefore
 sees stable data even when slot or scratch addresses are reused by
 later operations.
 ### 2D Grid Program Identity
 <!-- src: ADR-0022 Context, Decision -->
 Inside a kernel the program identity is two-dimensional. The
 first axis corresponds to the PE index within a cube; the second
 corresponds to the cube index within a SIP. Together they let a
 kernel address its position both within its cube and within the
 larger system without needing to know the full topology. Total
 program counts along each axis are exposed symmetrically.
 ### Parallelism — SIP Launcher, DPPolicy, Megatron-TP, AHBM Backend, and CCL Algorithm Module
 <!-- src: ADR-0024 Context, Decision -->
 The launcher model treats each SIP as one rank. Inside a process the
 launcher spawns one greenlet per SIP rank; the rank is bound to its
 greenlet so that any code running in that worker sees the right
 distributed-style rank. This is a deliberately PyTorch-compatible
 shape: a benchmark looks like a small DDP training script — initialize
 a process group, spawn workers, each worker runs the same body.
 <!-- src: ADR-0026 Context, Decision -->
 Data-parallelism policy lives in a single object that names the
 sharding strategy along the cube axis (replicate, row-wise,
 column-wise) and along the PE axis (same set of values), and optionally
 overrides the number of cubes or PEs participating. The policy is
 intra-device — it does not cross SIP boundaries. SIP-level parallelism
 is the launcher's responsibility, and the two axes compose
 orthogonally.
 <!-- src: ADR-0027 Context, Decision -->
 A Megatron-style tensor-parallel API sits on top of the launcher and
 the DP policy. Layer-level building blocks — column-parallel linear,
 row-parallel linear, all-reduce — name their sharding intent in terms
 the launcher and the placement policy can compose. This is the layer
 that bench code typically writes against.
 <!-- src: ADR-0047 Context, Decision -->
 For collective operations the runtime exposes a PyTorch-compatible
 distributed backend named "ahbm". On process-group initialization the
 backend loads the configured collective-algorithm module, resolves
 the world size (priority: explicit ccl.yaml override → defaults
 section → topology SIP count), imports the algorithm module
 dynamically, derives the SIP topology kind, and pushes the inter-PE
 neighbor table to every participating PE. From that point on, an
 all-reduce call dispatches the algorithm's kernel function across
 all ranks.
 <!-- src: ADR-0050 Context, Decision -->
 A collective-algorithm module is a Python module with a small, fixed
 contract. It exposes topology-kind integer constants, a name-to-kind
 mapping for the YAML configuration, a kernel-arguments builder, and
 a kernel function — the kernel function being aliased to the name
 `kernel` so the backend can find it generically. The kernel itself
 takes the tensor pointer, the per-cube element count, cube mesh
 width and height, the world size, the current rank, and the SIP
 topology dimensions; the backend appends those last four arguments
 automatically. New collectives slot in by adding a new module that
 follows this shape.
 <!-- src: ADR-0027 Decision, Consequences -->
 The combination is deliberate: bench authors get to write code that
 looks like a regular distributed training script, while the launcher,
 backend, and placement policies behind it remain free to redirect
 work to the right SIP, cube, and PE without exposing topology to the
 kernel.
 ### IPCQ Direction Addressing
 <!-- src: ADR-0025 Context, Decision -->
 Inside a collective algorithm, peer PEs are named by direction —
 "N", "S", "E", "W" for cube-internal neighbors, and "global_*" for
 cross-SIP neighbors. Direction addressing is the addressing scheme:
 the algorithm names a direction, the IPCQ neighbor table installed
 at process-group time resolves the direction to the peer endpoint's
 physical-address coordinates, and the PE_DMA performs the actual
 transfer. The algorithm itself does not see PA arithmetic — direction
 is the user-facing handle.
 ### Intercube All-Reduce
 <!-- src: ADR-0032 Context, Decision -->
 The default all-reduce algorithm uses a center-rooted bidirectional
 phase inside each SIP's cube mesh followed by an inter-SIP exchange
 on the mesh's root cube, and then a bidirectional broadcast back
 out. Center-rooting halves the in-cube hop count compared with a
 corner-rooted walk. The inter-SIP exchange itself follows the
 configured SIP topology — ring, torus, or non-wrapping mesh —
 selected at runtime through the SIP-topology kind integer the
 backend passes to the kernel.
 ### Evaluation Harnesses
 <!-- src: ADR-0043 Context, Decision -->
 The all-reduce evaluation harness drives correctness and the
 latency/buffer-kind sweeps through the public distributed path —
 initialize process group, spawn workers, call all-reduce — rather
 than the lower-level engine interface. A shared helper module factors
 out the setup; sweep tests cover the buffer-kind tiers (TCM, SRAM,
 HBM) and the inter-SIP topology variants. The plots produced by the
 harness are part of its output contract; the harness regenerates them
 on demand.
 <!-- src: ADR-0044 Context, Decision -->
 The GEMM evaluation harness is split into two layers. A heavy
 shape-and-variant sweep lives as a manual script — it runs the same
 composite-GEMM benchmark across many shapes and operand-staging
 variants, harvests the resulting op-log, and writes a JSON summary.
 A faster figure-generation layer lives in the test suite and consumes
 that JSON to render plots. The split keeps the heavy data
 generation explicit and out of the regular test path.
 ### Bench Module Contract
 <!-- src: ADR-0045 Context, Decision -->
 Adding a new benchmark requires only dropping a file into the
 benchmarks directory. The file registers one or more benchmark
 functions through a small decorator that takes a kebab-case name and
 a human-readable description. The decorator is the registration
 mechanism — there is no separate manifest. Each benchmark function
 takes one argument, conventionally named `torch`, which is the
 runtime context exposing tensor allocation, kernel launch,
 distributed APIs, and process-spawning. The function name is `run` by
 convention.
 <!-- src: ADR-0045 Decision, Consequences -->
 A benchmark must submit at least one operation, or the runner
 returns an error. A benchmark instance is single-device by default;
 when a benchmark is collective, it uses the distributed-process-spawn
 pattern internally — one worker greenlet per rank, with each worker
 binding to its rank. Multi-device benchmark patterns outside that
 shape are not supported.
 ### Kernel-side `tl.*` API
 <!-- src: ADR-0046 Context, Decision -->
 Inside a kernel function, the `tl` argument exposes the kernel-side
 API in a shape that mirrors the conventions of established
 GPU-kernel languages. Categories: reference handles that name HBM
 data without issuing DMA; data movement (load, store) that does
 issue DMA; GEMM and math compute (dot, composite, the unary and
 binary math operations, reductions); index and scalar helpers
 (program identity, range-builders); metadata-only operations like
 transpose; and the collective primitives (send, receive,
 non-blocking receive). Tensor handles support arithmetic operators
 via a thread-local active context so kernel code reads naturally.
 <!-- src: ADR-0046 Decision, Consequences -->
 The API supports two execution modes. A command-list mode records
 operations into a list without consuming simulator time — useful for
 inspection and lightweight tests. A greenlet-driven mode runs the
 kernel as a child greenlet that switches back to the simulator on
 each `tl.*` call; the simulator drives the event scheduler and hands
 real data back to the kernel as DMA reads complete. The two modes
 share the same surface; the kernel does not know which one it is
 running under.
 ### Probe Subcommand
 <!-- src: ADR-0049 Context, Decision -->
 The probe utility runs three families of traffic patterns through
 the engine — host-to-device writes at increasing hop counts,
 device-to-host reads at increasing hop counts, and PE-initiated DMA
 across the cube mesh — and reports actual latency, the analytical
 formula breakdown, effective bandwidth, bottleneck bandwidth, and
 utilization. A fixed reference size is used for the summary table;
 a separate utilization-versus-size sweep covers a logarithmic range
 of transfer sizes. Each case runs in its own engine instance so
 cases do not perturb each other.
 <!-- src: ADR-0049 Decision, Consequences -->
 The probe also checks a small set of invariants automatically:
 monotonic latency increase with hop count, device-to-host latency
 at least as large as host-to-device for the same hop count, and a
 faster best-case path than worst-case for cross-cube PE DMA. Failures
 print prominently. The output is meant for human reading; automated
 parsing should not depend on column widths or whitespace.
 ---
 This document summarizes 46 architecture decisions captured during
 the first half of 2026. It is regenerated mechanically from the
 decision corpus; sources are recorded in HTML comments throughout.
@@ -0,0 +1,333 @@
 """Generate docs/adr/INDEX.md (and docs/adr-ko/INDEX.md) from the ADR corpus.
 Auto-derives a section-based index following the same classification as
 the /report skill — Design Principles / High-level Architecture /
 Detailed Architecture (by component) / Implementation Decisions
 (by topic). Run before publishing to refresh INDEX.md.
 The classification table below is the single source of truth. When a new
 ADR is added under docs/adr/, append an entry to ``CLASSIFICATION``. The
 script exits 1 if any ADR file is missing from the table or any title
 cannot be parsed, so omissions surface in CI.
 Usage:
    python tools/generate_adr_index.py [--root <repo-root>] [--check]
  --check : exit 1 if the generated INDEX differs from the on-disk file
            (used by CI to detect un-regenerated indexes).
 """
 from __future__ import annotations
 import argparse
 import re
 import sys
 from pathlib import Path
 ADR_FILENAME_RE = re.compile(r"^ADR-(\d{4})-([a-z0-9_-]+)\.md$")
 # Title separator may be ":" (most ADRs) or "—" (em-dash; ADR-0033 uses
 # this). The verifier (tools/verify_adr_lang_pairs.py) only checks the
 # number, so both styles already coexist in the corpus.
 TITLE_RE = re.compile(r"^# ADR-(\d{4})\s*[:—]\s*(.+?)\s*$")
 DESIGN_PRINCIPLES = "Design Principles"
 HIGH_LEVEL = "High-level Architecture"
 DETAILED = "Detailed Architecture"
 IMPL_DECISIONS = "Implementation Decisions"
 # (section, subgroup) per ADR. subgroup is used to sub-divide Detailed
 # (by component, see DETAILED_COMPONENTS) and Implementation (by topic).
 # Add a line here when introducing a new ADR.
 CLASSIFICATION: dict[int, tuple[str, str | None]] = {
    # Design Principles
    13: (DESIGN_PRINCIPLES, None),
    33: (DESIGN_PRINCIPLES, None),
    # High-level Architecture
    3:  (HIGH_LEVEL, "System hierarchy (Tray / SIP / CUBE / PE)"),
    7:  (HIGH_LEVEL, "Runtime API ↔ sim_engine boundaries"),
    16: (HIGH_LEVEL, "IOChiplet NOC and memory data path"),
    17: (HIGH_LEVEL, "Cube NOC and HBM connectivity"),
    # Detailed Architecture (subgroup matches DETAILED_COMPONENTS entries)
    14: (DETAILED, "pe_pipeline"),  # covers pe_cpu/pe_dma/pe_fetch_store/pe_gemm/pe_math/pe_scheduler
    23: (DETAILED, "pe_ipcq"),
    34: (DETAILED, "hbm_ctrl"),
    35: (DETAILED, "m_cpu"),
    36: (DETAILED, "io_cpu"),
    37: (DETAILED, "forwarding"),
    38: (DETAILED, "pcie_ep"),
    39: (DETAILED, "pe_mmu"),
    40: (DETAILED, "pe_tcm"),
    41: (DETAILED, "sram"),
    42: (DETAILED, "tiling"),
    # Implementation Decisions
    1:  (IMPL_DECISIONS, "Address Scheme"),
    2:  (IMPL_DECISIONS, "Routing & Helper API"),
    4:  (IMPL_DECISIONS, "Memory Semantics & Local-HBM Bandwidth"),
    5:  (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"),
    6:  (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"),
    8:  (IMPL_DECISIONS, "Tensor Deployment and Allocation"),
    9:  (IMPL_DECISIONS, "Kernel Execution and Host-Device Messaging"),
    10: (IMPL_DECISIONS, "CLI Surface and Semantics"),
    11: (IMPL_DECISIONS, "Address Scheme"),
    12: (IMPL_DECISIONS, "Kernel Execution and Host-Device Messaging"),
    15: (IMPL_DECISIONS, "Component Port/Wire Fabric Model"),
    20: (IMPL_DECISIONS, "Two-Pass Data Execution"),
    22: (IMPL_DECISIONS, "2D Grid Program Identity"),
    24: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
    25: (IMPL_DECISIONS, "IPCQ Direction Addressing"),
    26: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
    27: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
    32: (IMPL_DECISIONS, "Intercube All-Reduce"),
    43: (IMPL_DECISIONS, "Evaluation Harnesses"),
    44: (IMPL_DECISIONS, "Evaluation Harnesses"),
    45: (IMPL_DECISIONS, "Bench Module Contract"),
    46: (IMPL_DECISIONS, "Kernel-side tl.* API (TLContext)"),
    47: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
    48: (IMPL_DECISIONS, "Memory Allocator Algorithms"),
    49: (IMPL_DECISIONS, "Probe Subcommand"),
    50: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
    51: (IMPL_DECISIONS, "Routing & Helper API"),
    52: (IMPL_DECISIONS, "Sim-engine Op Log and Memory Store Schemas"),
    53: (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"),
 }
 # Canonical component order for the Detailed Architecture section.
 # Each entry: (component_name, list[ADR-numbers that cover it]).
 # Order matches src/kernbench/components/builtin/*.py alphabetical
 # (the same order /report uses).
 DETAILED_COMPONENTS: list[tuple[str, list[int]]] = [
    ("forwarding",      [37]),
    ("hbm_ctrl",        [34]),
    ("io_cpu",          [36]),
    ("m_cpu",           [35]),
    ("pcie_ep",         [38]),
    ("pe_cpu",          [14]),
    ("pe_dma",          [14, 23]),
    ("pe_fetch_store",  [14]),
    ("pe_gemm",         [14]),
    ("pe_ipcq",         [23]),
    ("pe_math",         [14]),
    ("pe_mmu",          [39]),
    ("pe_scheduler",    [14]),
    ("pe_tcm",          [40]),
    ("sram",            [41]),
    ("tiling",          [42]),
 ]
 def _strip_bom(text: str) -> str:
    """Strip leading UTF-8 BOM if present."""
    if text and ord(text[0]) == 0xFEFF:
        return text[1:]
    return text
 def _find_adrs(adr_dir: Path) -> list[tuple[int, str, Path]]:
    """Return [(num, slug, path), ...] for ADR files in adr_dir, sorted by num."""
    out: list[tuple[int, str, Path]] = []
    for p in sorted(adr_dir.iterdir()):
        if not p.is_file():
            continue
        m = ADR_FILENAME_RE.match(p.name)
        if not m:
            continue
        out.append((int(m.group(1)), m.group(2), p))
    out.sort(key=lambda t: t[0])
    return out
 def _extract_title(path: Path) -> str:
    """Parse the title from the first line `# ADR-NNNN: <title>`. Strips BOM."""
    text = _strip_bom(path.read_text(encoding="utf-8"))
    first_line = text.split("\n", 1)[0] if text else ""
    m = TITLE_RE.match(first_line)
    if not m:
        raise ValueError(
            f"{path.name}: cannot parse title from first line: {first_line!r}"
        )
    return m.group(2)
 def _build_index(adr_dir: Path, link_prefix: str) -> str:
    """Build the INDEX.md text for adr_dir.
    link_prefix is the relative href used for ADR links (e.g., ``./``
    so links resolve relative to the INDEX file location).
    """
    adrs = _find_adrs(adr_dir)
    if not adrs:
        raise RuntimeError(f"No ADR files found under {adr_dir}")
    # Validate every ADR is classified.
    missing = sorted(num for num, _slug, _ in adrs if num not in CLASSIFICATION)
    if missing:
        raise RuntimeError(
            "ADR(s) missing from CLASSIFICATION table in "
            "tools/generate_adr_index.py: "
            + ", ".join(f"ADR-{n:04d}" for n in missing)
            + ". Add an entry for each."
        )
    # Map: num → (filename, title)
    num_to_meta: dict[int, tuple[str, str]] = {}
    for num, _slug, path in adrs:
        num_to_meta[num] = (path.name, _extract_title(path))
    # ── Section assembly ────────────────────────────────────────────
    lines: list[str] = []
    lines.append("# ADR Index")
    lines.append("")
    lines.append(
        f"Auto-generated by `tools/generate_adr_index.py`. "
        f"Total ADRs: **{len(adrs)}**."
    )
    lines.append("")
    lines.append(
        "Classification mirrors the `/report` skill's section assignment. "
        "When adding a new ADR, also add an entry to the "
        "`CLASSIFICATION` table in `tools/generate_adr_index.py`."
    )
    lines.append("")
    def fmt_entry(num: int) -> str:
        fname, title = num_to_meta[num]
        return f"- [ADR-{num:04d}]({link_prefix}{fname}) — {title}"
    # Design Principles
    lines.append("## Design Principles")
    lines.append("")
    nums = sorted(n for n, (sec, _) in CLASSIFICATION.items()
                  if sec == DESIGN_PRINCIPLES and n in num_to_meta)
    for n in nums:
        lines.append(fmt_entry(n))
    lines.append("")
    # High-level Architecture (preserve declaration order via CLASSIFICATION dict's insertion order)
    lines.append("## High-level Architecture")
    lines.append("")
    nums = sorted(n for n, (sec, _) in CLASSIFICATION.items()
                  if sec == HIGH_LEVEL and n in num_to_meta)
    for n in nums:
        sub = CLASSIFICATION[n][1] or ""
        fname, title = num_to_meta[n]
        if sub:
            lines.append(
                f"- [ADR-{n:04d}]({link_prefix}{fname}) — {title}"
                f"  _({sub})_"
            )
        else:
            lines.append(fmt_entry(n))
    lines.append("")
    # Detailed Architecture (canonical component order)
    lines.append("## Detailed Architecture")
    lines.append("")
    lines.append("One subsection per component file under `src/kernbench/components/builtin/`.")
    lines.append("")
    for comp, adr_nums in DETAILED_COMPONENTS:
        lines.append(f"### {comp}")
        lines.append("")
        if adr_nums:
            for n in adr_nums:
                if n not in num_to_meta:
                    raise RuntimeError(
                        f"DETAILED_COMPONENTS references ADR-{n:04d} for "
                        f"'{comp}' but no such ADR file exists."
                    )
                lines.append(fmt_entry(n))
        else:
            lines.append("_(no ADR coverage)_")
        lines.append("")
    # Implementation Decisions — group by subgroup, preserving first-appearance order.
    lines.append("## Implementation Decisions")
    lines.append("")
    topic_order: list[str] = []
    topic_to_nums: dict[str, list[int]] = {}
    for n, (sec, sub) in CLASSIFICATION.items():
        if sec != IMPL_DECISIONS or n not in num_to_meta:
            continue
        topic = sub or "Uncategorized"
        if topic not in topic_to_nums:
            topic_order.append(topic)
            topic_to_nums[topic] = []
        topic_to_nums[topic].append(n)
    # Stable order: by smallest ADR-number in topic, so older infra appears first.
    topic_order.sort(key=lambda t: min(topic_to_nums[t]))
    for topic in topic_order:
        lines.append(f"### {topic}")
        lines.append("")
        for n in sorted(topic_to_nums[topic]):
            lines.append(fmt_entry(n))
        lines.append("")
    return "\n".join(lines).rstrip() + "\n"
 def _check_or_write(path: Path, content: str, check: bool) -> bool:
    """Write content to path, or compare in --check mode. Returns True on diff."""
    existing = path.read_text(encoding="utf-8") if path.exists() else ""
    if check:
        if existing != content:
            print(f"[diff] {path} would change.")
            return True
        return False
    path.write_text(content, encoding="utf-8")
    if existing != content:
        print(f"[wrote] {path}")
    else:
        print(f"[unchanged] {path}")
    return False
 def main(argv: list[str] | None = None) -> int:
    p = argparse.ArgumentParser(description=__doc__)
    p.add_argument(
        "--root", type=Path, default=Path.cwd(),
        help="Repository root (default: cwd)",
    )
    p.add_argument(
        "--check", action="store_true",
        help="Exit 1 if generated INDEX would differ from disk",
    )
    args = p.parse_args(argv)
    en_dir = args.root / "docs" / "adr"
    ko_dir = args.root / "docs" / "adr-ko"
    if not en_dir.is_dir():
        print(f"error: {en_dir} does not exist", file=sys.stderr)
        return 1
    any_diff = False
    try:
        en_index = _build_index(en_dir, link_prefix="./")
    except (RuntimeError, ValueError) as e:
        print(f"error (EN): {e}", file=sys.stderr)
        return 1
    any_diff |= _check_or_write(en_dir / "INDEX.md", en_index, args.check)
    if ko_dir.is_dir():
        try:
            ko_index = _build_index(ko_dir, link_prefix="./")
        except (RuntimeError, ValueError) as e:
            print(f"error (KO): {e}", file=sys.stderr)
            return 1
        any_diff |= _check_or_write(ko_dir / "INDEX.md", ko_index, args.check)
    if args.check and any_diff:
        print(
            "INDEX.md is out of date. "
            "Run `python tools/generate_adr_index.py` to refresh.",
            file=sys.stderr,
        )
        return 1
    return 0
 if __name__ == "__main__":
    sys.exit(main())