From e33e76f2d1aa78fa179d99d729a495efa1215dbc Mon Sep 17 00:00:00 2001 From: Yangwook Kang Date: Fri, 22 May 2026 11:15:37 -0700 Subject: [PATCH] adr: add INDEX.md (auto-generated by tools/generate_adr_index.py) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds a section-based table of contents for the 46-ADR corpus, mirroring the /report skill's classification (Design Principles / High-level Architecture / Detailed Architecture by component / Implementation Decisions by topic). Generated for both docs/adr/ (EN titles) and docs/adr-ko/ (KO titles) from one tool. tools/generate_adr_index.py: - Single CLASSIFICATION dict per ADR — add an entry when introducing a new ADR; the script fails loud if any file is missing from the table. - DETAILED_COMPONENTS lists each builtin component and the ADR(s) that cover it (ADR-0014 appears under six PE engines; ADR-0023 under pe_dma + pe_ipcq). - Accepts both ":" and "—" title separators (matching ADR-0033's existing format). - --check mode for CI: exits 1 if INDEX.md is stale. Also includes the docs/report/architecture-2026-1H.md generated by the prior /report write (the public-facing architecture document; 836 lines, 76 source-attribution comments). Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/adr-ko/INDEX.md | 174 ++++++ docs/adr/INDEX.md | 174 ++++++ docs/report/architecture-2026-1H.md | 836 ++++++++++++++++++++++++++++ tools/generate_adr_index.py | 333 +++++++++++ 4 files changed, 1517 insertions(+) create mode 100644 docs/adr-ko/INDEX.md create mode 100644 docs/adr/INDEX.md create mode 100644 docs/report/architecture-2026-1H.md create mode 100644 tools/generate_adr_index.py diff --git a/docs/adr-ko/INDEX.md b/docs/adr-ko/INDEX.md new file mode 100644 index 0000000..7ea6702 --- /dev/null +++ b/docs/adr-ko/INDEX.md @@ -0,0 +1,174 @@ +# ADR Index + +Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **46**. + +Classification mirrors the `/report` skill's section assignment. When adding a new ADR, also add an entry to the `CLASSIFICATION` table in `tools/generate_adr_index.py`. + +## Design Principles + +- [ADR-0013](./ADR-0013-ver-verification-strategy.md) — 검증 전략 및 Phase 1 테스트 계획 +- [ADR-0033](./ADR-0033-lat-latency-model-assumptions.md) — 레이턴시 모델: 가정 및 알려진 단순화 + +## High-level Architecture + +- [ADR-0003](./ADR-0003-dev-target-system-hierarchy.md) — 타겟 시스템 계층 및 모델링 범위 _(System hierarchy (Tray / SIP / CUBE / PE))_ +- [ADR-0007](./ADR-0007-api-runtime-api-boundaries.md) — 런타임 API 및 시뮬레이션 엔진 경계 _(Runtime API ↔ sim_engine boundaries)_ +- [ADR-0016](./ADR-0016-dev-iochiplet-noc-and-memory-path.md) — IOChiplet NoC와 메모리 데이터 경로 _(IOChiplet NOC and memory data path)_ +- [ADR-0017](./ADR-0017-dev-cube-noc-and-hbm-connectivity.md) — 큐브 NoC와 HBM 연결성 _(Cube NOC and HBM connectivity)_ + +## Detailed Architecture + +One subsection per component file under `src/kernbench/components/builtin/`. + +### forwarding + +- [ADR-0037](./ADR-0037-dev-forwarding-component.md) — Forwarding 컴포넌트 (forwarding_v1) + +### hbm_ctrl + +- [ADR-0034](./ADR-0034-dev-hbm-controller-internal-design.md) — HBM 컨트롤러 내부 설계 + +### io_cpu + +- [ADR-0036](./ADR-0036-dev-io-cpu-component-model.md) — IO_CPU 컴포넌트 모델 + +### m_cpu + +- [ADR-0035](./ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md) — M_CPU 및 M_CPU.DMA 컴포넌트 모델 + +### pcie_ep + +- [ADR-0038](./ADR-0038-dev-pcie-ep-component-model.md) — PCIE_EP Component Model + +### pe_cpu + +- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델 + +### pe_dma + +- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델 +- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication + +### pe_fetch_store + +- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델 + +### pe_gemm + +- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델 + +### pe_ipcq + +- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication + +### pe_math + +- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델 + +### pe_mmu + +- [ADR-0039](./ADR-0039-dev-pe-mmu-component-model.md) — PE_MMU Component Model — 컴포넌트 + 유틸리티 이중 역할 + +### pe_scheduler + +- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델 + +### pe_tcm + +- [ADR-0040](./ADR-0040-dev-pe-tcm-component-model.md) — PE_TCM Component Model — 듀얼 채널 BW 직렬화 + +### sram + +- [ADR-0041](./ADR-0041-dev-cube-sram-component-model.md) — Cube SRAM Component Model — terminal scratchpad on cube NoC + +### tiling + +- [ADR-0042](./ADR-0042-prog-tile-plan-generators.md) — Tile Plan Generators — GEMM/Math 파이프라인 plan 빌더 + +## Implementation Decisions + +### Address Scheme + +- [ADR-0001](./ADR-0001-mem-physaddr-layout.md) — 51비트 물리 주소 레이아웃 및 디코딩 계약 +- [ADR-0011](./ADR-0011-mem-memory-addressing-simplification.md) — 메모리 주소 지정 — PA / VA / LA 주소 모델 + +### Routing & Helper API + +- [ADR-0002](./ADR-0002-lat-routing-distance.md) — 라우팅 거리, 순서 및 우회 규칙 +- [ADR-0051](./ADR-0051-lat-routing-helper-api.md) — Routing Helper API — `AddressResolver` + `PathRouter` + +### Memory Semantics & Local-HBM Bandwidth + +- [ADR-0004](./ADR-0004-mem-memory-semantics-local-hbm.md) — 메모리 시맨틱 및 로컬 HBM 대역폭 보장 + +### Topology Compilation, Diagrams & Builder Algorithms + +- [ADR-0005](./ADR-0005-dev-diagram-views-distance-layout.md) — 다이어그램 뷰 및 거리 기반 레이아웃 규칙 +- [ADR-0006](./ADR-0006-dev-topology-compilation-distance-diagram.md) — 토폴로지 컴파일, 거리 추출, 그리고 자동 다이어그램 생성 +- [ADR-0053](./ADR-0053-dev-topology-builder-algorithms.md) — Topology Builder + Visualizer Algorithms + +### Tensor Deployment and Allocation + +- [ADR-0008](./ADR-0008-api-tensor-deploy-and-allocation.md) — 텐서 배포 및 할당 (호스트 할당기, PA 우선) + +### Kernel Execution and Host-Device Messaging + +- [ADR-0009](./ADR-0009-api-kernel-execution-messaging.md) — 커널 실행 메시징 및 완료 시맨틱 +- [ADR-0012](./ADR-0012-api-host-io-message-schema.md) — Host ↔ IO_CPU 메시지 스키마 (PA-우선, PE-태깅) + +### CLI Surface and Semantics + +- [ADR-0010](./ADR-0010-api-cli-surface-and-semantics.md) — 명령줄 인터페이스 및 실행 시맨틱 + +### Component Port/Wire Fabric Model + +- [ADR-0015](./ADR-0015-dev-component-port-wire-model.md) — 컴포넌트 포트/와이어 모델과 패브릭 라우팅 + +### Two-Pass Data Execution + +- [ADR-0020](./ADR-0020-prog-data-execution-two-pass.md) — 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리) + +### 2D Grid Program Identity + +- [ADR-0022](./ADR-0022-prog-program-id-2d-grid.md) — 2D 그리드 program_id 시맨틱 + +### Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm) + +- [ADR-0024](./ADR-0024-par-sip-tp-launcher.md) — SIP-level Launcher — rank = SIP +- [ADR-0026](./ADR-0026-par-dppolicy-intra-device.md) — DPPolicy = Intra-Device Only — sip/num_sips 필드 제거 +- [ADR-0027](./ADR-0027-par-megatron-tp.md) — Megatron-style Tensor Parallelism API +- [ADR-0047](./ADR-0047-par-ahbm-ccl-backend.md) — AHBM CCL Backend — `torch.distributed`-compat shim +- [ADR-0050](./ADR-0050-par-ccl-algorithm-module-contract.md) — CCL Algorithm Module Contract — `ccl/algorithms/*.py` + +### IPCQ Direction Addressing + +- [ADR-0025](./ADR-0025-algo-ipcq-direction-addressing.md) — IPCQ Direction Addressing — address-based matching + +### Intercube All-Reduce + +- [ADR-0032](./ADR-0032-algo-intercube-allreduce.md) — 큐브 간 All-Reduce — pe0 큐브-메시 리듀스 + 다중-SIP 교환 + +### Evaluation Harnesses + +- [ADR-0043](./ADR-0043-eval-allreduce-harness.md) — Allreduce 평가 하니스 — `tests/sccl/` +- [ADR-0044](./ADR-0044-eval-gemm-harness.md) — GEMM 평가 하니스 — `scripts/gemm_sweep.py` + `tests/gemm/` + +### Bench Module Contract + +- [ADR-0045](./ADR-0045-prog-bench-module-contract.md) — Bench Module Contract — registration, dispatch, and authoring + +### Kernel-side tl.* API (TLContext) + +- [ADR-0046](./ADR-0046-prog-tl-context-contract.md) — TLContext — Kernel-side `tl.*` API Contract + +### Memory Allocator Algorithms + +- [ADR-0048](./ADR-0048-mem-allocator-algorithms.md) — Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator + +### Probe Subcommand + +- [ADR-0049](./ADR-0049-ver-probe-subcommand.md) — `kernbench probe` Subcommand — Traffic-Pattern Verification Harness + +### Sim-engine Op Log and Memory Store Schemas + +- [ADR-0052](./ADR-0052-dev-oplog-memory-store-schemas.md) — OpLog + MemoryStore Schemas — sim_engine internals diff --git a/docs/adr/INDEX.md b/docs/adr/INDEX.md new file mode 100644 index 0000000..7a6d610 --- /dev/null +++ b/docs/adr/INDEX.md @@ -0,0 +1,174 @@ +# ADR Index + +Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **46**. + +Classification mirrors the `/report` skill's section assignment. When adding a new ADR, also add an entry to the `CLASSIFICATION` table in `tools/generate_adr_index.py`. + +## Design Principles + +- [ADR-0013](./ADR-0013-ver-verification-strategy.md) — Verification Strategy and Phase 1 Test Plan +- [ADR-0033](./ADR-0033-lat-latency-model-assumptions.md) — Latency Model: Assumptions and Known Simplifications + +## High-level Architecture + +- [ADR-0003](./ADR-0003-dev-target-system-hierarchy.md) — Target System Hierarchy & Modeling Scope _(System hierarchy (Tray / SIP / CUBE / PE))_ +- [ADR-0007](./ADR-0007-api-runtime-api-boundaries.md) — Runtime API and Simulation Engine Boundaries _(Runtime API ↔ sim_engine boundaries)_ +- [ADR-0016](./ADR-0016-dev-iochiplet-noc-and-memory-path.md) — IOChiplet NOC and Memory Data Path _(IOChiplet NOC and memory data path)_ +- [ADR-0017](./ADR-0017-dev-cube-noc-and-hbm-connectivity.md) — Cube NOC and HBM Connectivity _(Cube NOC and HBM connectivity)_ + +## Detailed Architecture + +One subsection per component file under `src/kernbench/components/builtin/`. + +### forwarding + +- [ADR-0037](./ADR-0037-dev-forwarding-component.md) — Forwarding Component (forwarding_v1) + +### hbm_ctrl + +- [ADR-0034](./ADR-0034-dev-hbm-controller-internal-design.md) — HBM Controller Internal Design + +### io_cpu + +- [ADR-0036](./ADR-0036-dev-io-cpu-component-model.md) — IO_CPU Component Model + +### m_cpu + +- [ADR-0035](./ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md) — M_CPU and M_CPU.DMA Component Model + +### pcie_ep + +- [ADR-0038](./ADR-0038-dev-pcie-ep-component-model.md) — PCIE_EP Component Model + +### pe_cpu + +- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model + +### pe_dma + +- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model +- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication + +### pe_fetch_store + +- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model + +### pe_gemm + +- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model + +### pe_ipcq + +- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication + +### pe_math + +- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model + +### pe_mmu + +- [ADR-0039](./ADR-0039-dev-pe-mmu-component-model.md) — PE_MMU Component Model — Component + Utility Dual Role + +### pe_scheduler + +- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model + +### pe_tcm + +- [ADR-0040](./ADR-0040-dev-pe-tcm-component-model.md) — PE_TCM Component Model — Dual-Channel BW Serialization + +### sram + +- [ADR-0041](./ADR-0041-dev-cube-sram-component-model.md) — Cube SRAM Component Model — terminal scratchpad on cube NoC + +### tiling + +- [ADR-0042](./ADR-0042-prog-tile-plan-generators.md) — Tile Plan Generators — GEMM/Math Pipeline Plan Builders + +## Implementation Decisions + +### Address Scheme + +- [ADR-0001](./ADR-0001-mem-physaddr-layout.md) — 51-bit Physical Address Layout & Decoding Contract +- [ADR-0011](./ADR-0011-mem-memory-addressing-simplification.md) — Memory Addressing — PA / VA / LA Address Models + +### Routing & Helper API + +- [ADR-0002](./ADR-0002-lat-routing-distance.md) — Routing Distance, Ordering & Bypass Rules +- [ADR-0051](./ADR-0051-lat-routing-helper-api.md) — Routing Helper API — `AddressResolver` + `PathRouter` + +### Memory Semantics & Local-HBM Bandwidth + +- [ADR-0004](./ADR-0004-mem-memory-semantics-local-hbm.md) — Memory Semantics & Local-HBM Bandwidth Guarantee + +### Topology Compilation, Diagrams & Builder Algorithms + +- [ADR-0005](./ADR-0005-dev-diagram-views-distance-layout.md) — Diagram Views & Distance-Aware Layout Rules +- [ADR-0006](./ADR-0006-dev-topology-compilation-distance-diagram.md) — Topology Compilation, Distance Extraction, and Automatic Diagram Generation +- [ADR-0053](./ADR-0053-dev-topology-builder-algorithms.md) — Topology Builder + Visualizer Algorithms + +### Tensor Deployment and Allocation + +- [ADR-0008](./ADR-0008-api-tensor-deploy-and-allocation.md) — Tensor Deployment and Allocation (Host Allocator, PA-first) + +### Kernel Execution and Host-Device Messaging + +- [ADR-0009](./ADR-0009-api-kernel-execution-messaging.md) — Kernel Execution Messaging and Completion Semantics +- [ADR-0012](./ADR-0012-api-host-io-message-schema.md) — Host ↔ IO_CPU Message Schema (PA-first, PE-tagged) + +### CLI Surface and Semantics + +- [ADR-0010](./ADR-0010-api-cli-surface-and-semantics.md) — Command Line Interface and Execution Semantics + +### Component Port/Wire Fabric Model + +- [ADR-0015](./ADR-0015-dev-component-port-wire-model.md) — Component Port/Wire Model and Fabric Routing + +### Two-Pass Data Execution + +- [ADR-0020](./ADR-0020-prog-data-execution-two-pass.md) — 2-Pass Data Execution Model (Timing / Data Separation) + +### 2D Grid Program Identity + +- [ADR-0022](./ADR-0022-prog-program-id-2d-grid.md) — 2D Grid program_id Semantics + +### Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm) + +- [ADR-0024](./ADR-0024-par-sip-tp-launcher.md) — SIP-level Launcher — rank = SIP +- [ADR-0026](./ADR-0026-par-dppolicy-intra-device.md) — DPPolicy = Intra-Device Only — remove sip/num_sips fields +- [ADR-0027](./ADR-0027-par-megatron-tp.md) — Megatron-style Tensor Parallelism API +- [ADR-0047](./ADR-0047-par-ahbm-ccl-backend.md) — AHBM CCL Backend — `torch.distributed`-compat shim +- [ADR-0050](./ADR-0050-par-ccl-algorithm-module-contract.md) — CCL Algorithm Module Contract — `ccl/algorithms/*.py` + +### IPCQ Direction Addressing + +- [ADR-0025](./ADR-0025-algo-ipcq-direction-addressing.md) — IPCQ Direction Addressing — address-based matching + +### Intercube All-Reduce + +- [ADR-0032](./ADR-0032-algo-intercube-allreduce.md) — Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange + +### Evaluation Harnesses + +- [ADR-0043](./ADR-0043-eval-allreduce-harness.md) — Allreduce Evaluation Harness — `tests/sccl/` +- [ADR-0044](./ADR-0044-eval-gemm-harness.md) — GEMM Evaluation Harness — `scripts/gemm_sweep.py` + `tests/gemm/` + +### Bench Module Contract + +- [ADR-0045](./ADR-0045-prog-bench-module-contract.md) — Bench Module Contract — registration, dispatch, and authoring + +### Kernel-side tl.* API (TLContext) + +- [ADR-0046](./ADR-0046-prog-tl-context-contract.md) — TLContext — Kernel-side `tl.*` API Contract + +### Memory Allocator Algorithms + +- [ADR-0048](./ADR-0048-mem-allocator-algorithms.md) — Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator + +### Probe Subcommand + +- [ADR-0049](./ADR-0049-ver-probe-subcommand.md) — `kernbench probe` Subcommand — Traffic-Pattern Verification Harness + +### Sim-engine Op Log and Memory Store Schemas + +- [ADR-0052](./ADR-0052-dev-oplog-memory-store-schemas.md) — OpLog + MemoryStore Schemas — sim_engine internals diff --git a/docs/report/architecture-2026-1H.md b/docs/report/architecture-2026-1H.md new file mode 100644 index 0000000..f87312c --- /dev/null +++ b/docs/report/architecture-2026-1H.md @@ -0,0 +1,836 @@ +# KernBench — Architecture Design Document +*2026 1H* + +KernBench is a system-level, discrete-event simulator for AI-accelerator +chiplet systems. It models the data-movement and control paths across +the full hardware hierarchy and reports end-to-end execution latency +for kernels dispatched to the device's compute units. + +This document is a public summary of the architecture as designed and +implemented in the first half of 2026. It assumes no prior knowledge of +the simulator's internal documents; terms specific to the system are +defined on first use. + +--- + +## Design Principles + +KernBench is grounded in two foundational commitments: every measured +latency must trace to explicit, modeled events on the simulator's graph, +and every behavioral claim must be verifiable through tests that target +spec-level invariants rather than incidental implementation details. + + +The verification posture is verification-driven. Tests are written to +validate the architectural contracts that the simulator exposes — +correct routing, deterministic results, monotonic latency under +increasing hop counts — rather than to mirror the call graph of the +implementation. Two phases coexist: a fast timing phase that exercises +the simulator's discrete-event engine and produces a log of operations +with timestamps, and an optional data-replay phase that uses that log +to compute real numerical results. Tests can target either phase. + + +The latency model is intentionally abstract rather than +cycle-accurate. Each modeled node contributes a configurable per-node +overhead, each link contributes wire delay plus byte-over-bandwidth +serialization, and each terminal service contributes its own service +time. The simulator does not attempt to reproduce cache coherence +protocols, microarchitectural pipelines, or full PCIe/UCIe protocol +correctness; those are explicitly outside the scope. The aim is a +simulator that compares system-level configurations meaningfully and +deterministically, not one that ships microarchitectural truths. + + +Determinism is a hard requirement. Given identical inputs — topology, +routing policy, and request stream — the simulator must produce +identical outputs, hop traces included. This rules out reliance on +unordered set iteration on the critical path and forces every latency +contribution to come from an explicitly scheduled event on a modeled +component or link. There are no implicit waits, no hardcoded magic +delays, and no shortcuts that bypass the modeled graph. + +--- + +## High-level Architecture + + +The simulated system is a four-level hierarchy. A **Tray** holds one or +more **SIPs** (system-in-package), each containing a 2D mesh of +**CUBEs** plus one or more **IO chiplets** that connect the SIP to the +host. Each CUBE contains a regular grid of **PEs** (processing +elements) plus its own attached resources — high-bandwidth memory +(HBM), a per-cube SRAM scratchpad, and a management CPU (M_CPU). The PE +itself is a composite of nine sub-components rather than a monolithic +core. This hierarchy is fixed; the parameters along each axis (counts, +mesh dimensions, link widths) are configurable through the topology +spec. + + +A clean separation runs along the request flow. A **runtime API** at +the top is the host-facing surface; it exposes tensor and kernel +operations, owns host-side allocation metadata, and is topology- +agnostic — it does not route or fan out. Below it the **simulation +engine** decomposes runtime operations into discrete graph requests +(memory writes, memory reads, kernel launches, MMU map installs) and +schedules events deterministically. At the bottom, **components** model +device behavior on a graph of nodes connected by links; they +implement the actual latency contributions and pass requests along. +No component reaches up into the runtime API, and no runtime call +shortcuts the engine. + + + +### Tray + + +The Tray is the outermost boundary. It owns the host CPU on one side +and one or more SIPs on the other, connected through a fabric switch. +For collective communication that must traverse multiple SIPs, the +fabric switch acts as the common rendezvous: device-side outbound +traffic from one SIP routes through the switch and back into the +target SIP's IO chiplet. + +### SIP + + +A SIP packages a 2D mesh of CUBEs and one or more IO chiplets. The +default topology used by the simulator is a 4×4 cube mesh; the +mesh dimensions are configurable. Each cube on the boundary of the +mesh connects to its neighbors over UCIe (die-to-die) links arranged +on the four cardinal sides — north, south, east, and west. The IO +chiplets sit on one side of the SIP and provide the bridge to the host +across PCIe. + + +The IO chiplet itself contains its own internal network. A +host-facing PCIe endpoint passes traffic to a small NOC ("network on +chip"); from there it can branch to a control-plane CPU that processes +kernel-launch messages, or it can take the direct memory data path to +the cube's HBM controller. The decision to provide a direct memory +path that bypasses the control CPU was a deliberate concession to +keep host-issued memory writes from paying control-plane overhead on +the data path. + +### CUBE + + +Each CUBE owns a 2D mesh of NOC routers and a set of attached +resources: PEs, the cube-local SRAM scratchpad, the management CPU +(M_CPU), and the HBM partition (split across multiple PE-private +slices for bandwidth). The router mesh uses deterministic XY routing. +Attached components do not connect to each other directly — they all +sit on the router mesh, and every cube-internal transfer pays the +mesh distance from source to destination. + + +The HBM partition is per-PE: each PE owns one HBM slice, and the +controller exposes per-PE channels so that the same PE always +addresses the same set of HBM channels. This makes the local-HBM +bandwidth from a PE to its own slice predictable, while accesses to +another PE's slice — or a different cube's slice — pay the mesh +distance and any UCIe crossings. + +### PE + + +A PE is not a monolithic core. Internally it is a set of nine +sub-components, each modeling one stage of a request's flow: a small +control CPU, a tile-pipeline scheduler, a DMA engine, a fetch-store +engine that moves data between the on-PE scratchpad and the register +file, a GEMM compute engine, a math compute engine, the tightly- +coupled memory (TCM, the on-PE scratchpad), an MMU for virtual-to- +physical address translation, and an inter-PE collective queue +(IPCQ). The scheduler decomposes higher-level operations into per-tile +stage sequences, and tile tokens self-route from one sub-component +to the next. + + + +--- + +## Detailed Architecture + +This section describes each modeled device-side component in turn. +Components are listed in the alphabetical order used by the +simulator's source tree. + +### forwarding + + +The forwarding component is the generic routing relay used wherever a +node only needs to apply a small processing overhead and pass the +request to the next hop. NOC routers, conn nodes, and ucie phys all +reduce to this. Its first act on receiving a request is to apply the +per-node overhead configured for it in the topology spec; after the +overhead it simply hands the request to the next hop along the path. + + +The decision to share one implementation across these roles was made +to keep the simulator's component set small without sacrificing +modeling fidelity. Each instance still carries its own overhead and +its own link bandwidth contributions, so different roles still produce +different timing. What is shared is the dispatcher loop, not the +parameter values. + +### hbm_ctrl + + +The HBM controller is the terminal node for all memory traffic that +reaches HBM. Internally it owns a number of pseudo channels, partitioned +per-PE so that each PE addresses a deterministic subset. On a request +arrival the controller first selects the right pseudo channel from the +target address, then enters a chunk-loop that drains the requested +size in fixed-size flits over the channel's bandwidth. + + +The chunk-loop pattern replaces an earlier all-at-once drain. The +benefit is that the controller no longer presents a flit-aware fabric +with a single bulk transfer; instead it emits flits at a paced rate +matching the channel bandwidth, which makes cross-flow contention +visible. The bandwidth budget is calibrated against the configured +HBM total bandwidth divided across the channel count. + +### io_cpu + + +The IO_CPU is the control-plane processor sitting inside the IO chiplet. +It receives kernel-launch messages from the host, decodes them, and +dispatches per-cube launches to the cube's management CPU. Pure memory +operations bypass it entirely, taking the direct data path established +inside the IO chiplet. + + +On receiving a kernel-launch message, the IO_CPU consults the message's +shard list — which already names the target SIP, cube, and PE for each +piece of the tensor argument — and forwards a per-cube launch to each +cube the kernel needs to reach. This makes the IO_CPU a deterministic +fan-out point: it does not decode physical addresses to route, it just +follows the explicit per-shard targets it was handed. + +### m_cpu + + +The M_CPU is the cube's management processor. It owns two distinct +roles: as a control-plane fan-out point for kernel launches arriving +from the IO chiplet, and as a DMA endpoint for host-initiated memory +writes that need to land in this cube's HBM. The control role +forwards launches to the right PE control CPUs; the DMA role places +the actual bytes into HBM through the router mesh. + + +The component model deliberately distinguishes the two roles because +their routing differs: the control fan-out path uses command-kind +links that do not appear on data-path routes, while the DMA path uses +the same router mesh as PE-initiated DMA, with PE-internal nodes +excluded. The routing layer knows about both modes and selects the +appropriate adjacency at request time. + +### pcie_ep + + +The PCIE endpoint is the protocol boundary at the host-device edge. +Its first act on each incoming request is to apply a configured +protocol-processing overhead; after that it simply forwards. There is +no internal queuing model, no retry, and no TLP-level fidelity — those +are deliberately outside scope. The endpoint is bidirectional: host → +device traffic (memory writes, kernel launches) flows one way, and +device-side outbound traffic (cross-SIP collective sends) flows the +other. + + +A more detailed PCIe model was considered and rejected. The simulator +is targeting system-level latency comparisons; making the endpoint +heavier with credit-management and retry logic would not improve the +metrics being studied. The decision keeps the endpoint as the +documented protocol-boundary node, named consistently so routing +helpers can locate it by SIP and IO instance. + +### pe_cpu + + +The PE control CPU is the entry point for kernel work arriving from +the cube's management CPU. It receives kernel-launch messages, resolves +the kernel function by name, and hands execution to the scheduler with +the resolved tensor arguments. From the scheduler's point of view, the +PE_CPU is the upstream source of high-level commands; from the rest +of the system's point of view, the PE_CPU is where a kernel's +execution begins on a given PE. + +### pe_dma + + +The DMA engine on each PE has two distinct modes. In the standard PE +pipeline it consumes tile tokens issued by the scheduler, acquires a +read or write channel (modeled as a one-in-flight resource per +direction), and runs the bytes to or from HBM through the mesh. In +its collective mode it forwards send tokens for the cube's IPCQ into +the fabric, snapshotting the source data at send time so later +mutations cannot race the receiver's read. Both modes share the same +channel resources but differ in their downstream handling — one +returns when the round-trip completes, the other dispatches +fire-and-forget. + +### pe_fetch_store + + +The fetch-store engine is the bridge between the on-PE scratchpad +(TCM) and the register file. It does not run DMA; it only moves bytes +internally. On receiving a tile-stage token it sends a short request +to the TCM, waits for the bandwidth-serialized delay, and continues +the pipeline. The split between this engine and the TCM lets the +scratchpad model its own read/write bandwidth independently. + +### pe_gemm + + +The GEMM engine is the matrix-multiply compute unit. Tile tokens +arriving at this stage carry the per-tile dimensions, and the engine +contributes a service time accounting for one fused multiply-add over +the tile's macs. Composite operations (where the same tensor pair is +streamed across many tiles) reuse the engine through the scheduler; +the engine itself is stateless between tiles. + +### pe_ipcq + + +The IPCQ — inter-process communication queue — is each PE's +collective-communication endpoint. It owns ring buffers that hold +inbound messages from neighbor PEs and bookkeeping for send credits. +Direction names ("N", "S", "E", "W" for cube-internal neighbors and +"global_*" for cross-SIP neighbors) are resolved to physical peer +endpoints by a neighbor table installed at process-group creation +time. The component itself does not move bytes — it issues DMA tokens +through the local PE_DMA, which performs the actual cross-PE +transfer. + + +A key invariant is that the inbound terminal — where data lands at +the receiver — pays the link bandwidth drain plus any cube-internal +mesh hop to the slot's backing memory. This prevents IPCQ from +silently outpacing raw DMA at large transfer sizes. Outbound sends +are fire-and-forget; credit return is the only backpressure signal. + +### pe_math + + +The math engine handles element-wise and reduction operations. It +consumes tile tokens carrying an operation kind (`exp`, `sum`, `max`, +`where`, etc.) and contributes a service time proportional to the +number of elements processed. Like the GEMM engine it is stateless; +chained epilogues (a sequence of math operations after a GEMM tile) +are scheduled as separate stages. + +### pe_mmu + + +The MMU has two roles, exposed through one component. As a node on +the cube NOC it receives MMU-map and MMU-unmap messages and updates +its internal page table, so that the runtime API can install +virtual-to-physical mappings with measured fabric latency. As a +utility object held inside the PE it offers synchronous translate +calls to the PE's DMA and GEMM engines without taking simulator time +itself; the calling engine pays any configured TLB overhead in its +own process. + + +The page table supports multiple disjoint regions inside a single +page, with later-write-wins semantics on overlap. This is a deliberate +simulator stopgap to support parallelization policies that shard data +at sub-page granularity without silent mis-routing through a real +hardware MMU's one-PA-per-entry assumption. A real MMU does not work +this way; the model documents this as a simplification. + +### pe_scheduler + + +The scheduler is the sole dispatcher inside a PE. Simple commands are +routed directly to the right engine. Composite commands generate a +tile plan, and the resulting tile tokens are fed into the pipeline. +Self-routing keeps the scheduler off the per-stage hot path: each +engine, on finishing a stage, advances the token to the next stage's +component itself, so the scheduler only does initial dispatch and +completion tracking. + +### pe_tcm + + +The TCM is the per-PE tightly-coupled scratchpad memory. It models +time only, not data — the actual payload lives in the simulator's +memory store. Read and write are independent channels: each is +modeled as a one-in-flight resource, so same-direction requests +serialize but a read and a write can overlap. The bandwidth of each +direction is configured separately and applied as bytes-over-bandwidth +on each request. + + +The decision to keep read and write on separate channels was made +because the PE pipeline's normal case overlaps fetch (read) and store +(write). Collapsing them into a single shared channel would have +artificially serialized that overlap and produced an incorrect +bandwidth ceiling. + +### sram + + +The cube SRAM is a per-cube scratchpad attached to one of the cube's +routers. As a node it applies a configured access overhead, pays the +link-bandwidth drain stamped on the incoming request, and sends a +response on the reverse path. It is a terminal — it does not forward. + + +A second role is as one of three backing-memory tiers (TCM, SRAM, HBM) +that an inter-PE collective slot can live in. When the slot lives in +SRAM, the PE_DMA pays the slot read or write latency directly using +the configured SRAM bandwidth and overhead; the SRAM component does +not need to know about collective semantics. This separation keeps +the SRAM component agnostic to the collective subsystem. + +### tiling + + +The tile-plan generator is not a runtime component — it is a pure +module of functions that take a problem shape (matrix dimensions, tile +sizes) and produce an ordered list of tile-stage sequences. The +scheduler consumes this list. Each tile's stage sequence depends on +how its operands are staged: operands streamed from HBM produce +DMA_READ stages, operands already resident in TCM (because they were +loaded eagerly upfront) skip them. + + +The plan generator is intentionally pure — given the same input it +returns the same plan, with no simulator events created. This lets +the rest of the system reason about tile sequences as data, and it +makes the plan testable in isolation without simulator state. New +plan variants (for example, K-major or DTensor-aware plans) can be +added as new functions following the same shape. + +--- + +## Implementation Decisions + +This section collects cross-cutting decisions — algorithms, policies, +schemes, and contracts — that span multiple components rather than +living inside one. + +### Address Scheme + + +Every physical address in the simulator decodes into a structured +location. A fixed-width physical address carries the SIP id, the +cube id within the SIP, a type discriminator (HBM vs PE-resource vs +others), and a type-specific offset. HBM addresses additionally encode +the per-PE slice offset so the controller can determine which PE +owns the target slice without external lookup. The layout is +deliberately reserved rather than packed-to-fit, so new sub-units can +be added at the type-discriminator level without rewriting existing +addresses. + + +On top of physical addressing, the simulator supports three address +models that the runtime API selects between. Direct physical +addressing is retained as a fallback. Virtual addressing — the +current default — gives each tensor a contiguous virtual range at +deployment, with the per-PE MMU translating per access; an +alternative logical-address scheme remains a future option. The +virtual-address path is what every modern test path takes; the PA +fallback is used by the MMU itself when no mapping exists for an +address (a deliberate signal, not an error). + + +Tensor placement is represented as a list of physical-address shards, +each tagged with target SIP, cube, and PE, plus a single tensor-wide +virtual base. This means a kernel sees one virtual base for the whole +tensor while the host driver and the engine still know exactly where +each shard lives. Replicated tensors get per-cube local PA mappings; +sharded tensors broadcast their mapping across cubes within a SIP. + +### Routing, Distance & Helper API + + +Routing is policy-driven, deterministic, and topology-aware. Given a +source, a destination, and an intent — for example, PE-initiated +DMA versus host-initiated memory write versus a generic +component-to-component query — the routing layer picks the right +path. The intent matters because different traffic types must avoid +different categories of edges: PE-initiated DMA should not traverse +command-only links; M_CPU DMA should not pass through PE-internal +pipeline edges; cube-local transfers should not use the +zero-distance UCIe bus that would otherwise look attractive to a +shortest-path search. + + +The routing layer therefore maintains four separate adjacency graphs +at construction, each excluding a different category of edges, and +picks the appropriate one per intent. On top of the graphs sits a +helper API that hides the topology's naming convention: callers ask +for the PCIe endpoint of a given SIP, the M_CPU of a given cube, or +the HBM destination for a given physical address, and receive the +corresponding node id. No component constructs node-id strings +directly; if the naming convention ever changes, the change is local +to the helper layer. + + +Path-finding itself uses Dijkstra with explicit per-edge weights +(routing weight is allowed to differ from physical distance — for +example, UCIe is configured to be routing-preferable). Tie-breaks +follow insertion order, which keeps results deterministic. Paths +between unreachable nodes raise rather than returning empty, surfacing +topology errors immediately. + +### Memory Semantics and Local-HBM Bandwidth + + +A PE accessing its own HBM slice through its own cube's NOC must see +the full local HBM bandwidth — that is the model's intent. Memory +traffic accumulates latency from per-component overhead and +bytes-over-link-bandwidth serialization along the path, but the +controller does not throttle below the slice's allotted bandwidth. +Cross-PE-slice accesses inside the same cube, cross-cube accesses +through UCIe, and cross-SIP accesses through PCIe each pay +progressively more overhead as the path grows. + +### Topology Compilation, Diagrams & Builder Algorithms + + +Topology is configurable, not hardcoded. The simulator reads a YAML +spec, compiles it into a flat graph of nodes and edges plus four +view projections at different abstraction levels — system, SIP, cube, +PE — and uses the compiled graph as the single source for both +execution and visualization. Distance metadata used by routing is +extracted at compile time so that diagrams and routing decisions +agree by construction. + + +Diagrams are derived artifacts of the compiled topology. The visualizer +produces one SVG per view at the appropriate abstraction level; nothing +in the diagrams is hand-drawn or hand-positioned. Distance-aware +layout rules place nodes in the diagrams using the same coordinates +that routing uses to compute distance, so a diagram that "looks +wrong" is a signal that the topology itself has a problem, not the +visualizer. + + +Inside a cube the router mesh is generated automatically. PE corner +positions are fixed by convention; the relay-column algorithm +inserts additional grid columns whenever the gap between adjacent PE +columns would exceed a tunable maximum. HBM occupies a central +exclusion zone — router slots inside the zone are deliberately empty, +since HBM controllers attach as separate named nodes. M_CPU and SRAM +attach to the nearest router by Euclidean distance from their +configured placement coordinates, and UCIe physical lanes distribute +along the boundary rows and columns. The whole mesh is cached +beside the topology spec and invalidated only when one of a small set +of layout-relevant fields changes. + + + +### Tensor Deployment and Allocation + + +Tensor deployment in the runtime API produces a list of physical-address +shards plus a single tensor-wide virtual base. The host allocator +walks the data-parallelism policy, computes per-shard placement, and +emits the per-shard physical addresses through the per-PE allocators. +No separate "allocate then later attach to a device" RPC exists — +allocation and deployment are a single operation that produces a +deployed tensor handle. + +### Memory Allocator Algorithms + + +Each per-PE allocator owns two channels — HBM slice and TCM — each +backed by an offset-keyed free-list. Allocation is first-fit; freeing +coalesces with adjacent free blocks. A device-wide virtual allocator +sits above the per-PE allocators, aligns requests up to the configured +page size, and coalesces on free in the same way. The trade-off is +explicit: first-fit is simpler and cheaper than best-fit or buddy +allocation, and the simulator's workload is stack-like enough +(deploy / kernel / free in matched order) that fragmentation is not +a practical concern. + + +Allocation failure raises rather than silently returning a partial +result. A partial tensor reaching the engine would route over wrong +PAs and silently corrupt simulator output, so an out-of-memory signal +is preferred. The free path trusts its caller to pass back exactly +what was allocated; the small risk of caller error in exchange for +fast common-case freeing is documented as a deliberate trade. + +### Kernel Execution and Host-Device Messaging + + +Kernel execution decomposes into a small set of messages that travel +the device graph. The host issues a single kernel-launch message; the +IO_CPU fans it out per-cube; the cube M_CPU fans it out per-PE; the +PE CPU resolves the kernel and runs it through the scheduler. +Completion flows back the same way, gated by per-shard completion +tracking. Memory operations follow the same pattern: a memory write +or read travels as one message that the engine routes to the right +HBM controller, with a response taking the reverse path. + + +The schema between the host and the device-side IO CPU is PA-first +and shard-tagged. Every byte of host-issued payload arrives with an +explicit target SIP, cube, PE, and physical address. The IO_CPU does +not decode addresses to derive placement — placement is named +explicitly by the shard list. This makes the host-device interface +deterministic and keeps the routing helper free of host-derived +intent. + +### CLI Surface and Semantics + + +The command-line interface exposes four subcommands. A bench runner +loads a topology, resolves a registered benchmark by name or index, +and runs it on a selected device. A bench-listing command enumerates +the registered benchmarks. A probe utility runs a fixed catalog of +traffic patterns through the engine for latency and bandwidth +verification. A web viewer renders the topology in a browser. A +benchmark instance is always single-device by convention; multi-SIP +collective work happens inside the benchmark through the launcher +abstraction, not by multiplexing the CLI. + +### Component Port and Wire Fabric Model + + +Every modeled component exposes input and output ports, and every +edge in the topology connects an output port on one component to an +input port on another. Bandwidth and propagation delay are properties +of the wire between ports, not of the component endpoints. A +component's responsibility is to apply its configured per-node +overhead and either forward to the next hop or terminate; the wire +charges the byte-over-bandwidth serialization separately. + + +This separation lets components be swapped behind their port +interface without changing the rest of the model, and it keeps +bandwidth contention at the wire level where multiple components may +contend for the same edge. Future component models can refine +internal behavior without disturbing the fabric. + +### Two-Pass Data Execution + + +The simulator runs in two passes. The first pass — fast and always +on — runs the discrete-event engine and records every data operation +in an operation log with timestamps, component identifiers, and per- +operation parameters. The second pass — optional, opt-in — replays +the log against an in-memory tensor store to produce actual numerical +results. Tests that only need timing skip the second pass; tests that +need to verify correctness opt in. + + +The split lets the timing engine remain unconcerned with data +semantics: kernels move handles around, not bytes. The replay phase +recovers data semantics from the recorded operations, in their +original time order with a small set of secondary-sort rules. The +op-log records carry enough metadata — input snapshots for compute +operations, source snapshots for cross-component copies — that the +replay phase cannot mis-order with respect to in-flight mutations. + +### Sim-engine Op Log and Memory Store Schemas + + +The operation log holds typed records with seven fields each: start +and end timestamps, the component that issued the operation, an +operation kind ("memory", "gemm", "math"), an operation name, a +parameter dictionary, and a (currently unused) dependency list. +Records are kept in stable timestamp order. The parameter dictionary +varies by operation: a DMA read carries source address and byte count; +a GEMM carries operand shapes, dtypes, and address spaces; a math +operation carries input addresses and snapshots. + + +The companion memory store is a two-level dictionary keyed by +address space ("hbm", "tcm", "sram", others) and integer address. +Reads and writes are reference-based — no copy by default — so +callers wanting to detach a snapshot must copy explicitly. This is +deliberate: the engine-internal snapshot paths copy at well-defined +points (math input capture, HBM source capture for DMA writes, +inbound collective copies) and downstream replay code therefore +sees stable data even when slot or scratch addresses are reused by +later operations. + +### 2D Grid Program Identity + + +Inside a kernel the program identity is two-dimensional. The +first axis corresponds to the PE index within a cube; the second +corresponds to the cube index within a SIP. Together they let a +kernel address its position both within its cube and within the +larger system without needing to know the full topology. Total +program counts along each axis are exposed symmetrically. + +### Parallelism — SIP Launcher, DPPolicy, Megatron-TP, AHBM Backend, and CCL Algorithm Module + + +The launcher model treats each SIP as one rank. Inside a process the +launcher spawns one greenlet per SIP rank; the rank is bound to its +greenlet so that any code running in that worker sees the right +distributed-style rank. This is a deliberately PyTorch-compatible +shape: a benchmark looks like a small DDP training script — initialize +a process group, spawn workers, each worker runs the same body. + + +Data-parallelism policy lives in a single object that names the +sharding strategy along the cube axis (replicate, row-wise, +column-wise) and along the PE axis (same set of values), and optionally +overrides the number of cubes or PEs participating. The policy is +intra-device — it does not cross SIP boundaries. SIP-level parallelism +is the launcher's responsibility, and the two axes compose +orthogonally. + + +A Megatron-style tensor-parallel API sits on top of the launcher and +the DP policy. Layer-level building blocks — column-parallel linear, +row-parallel linear, all-reduce — name their sharding intent in terms +the launcher and the placement policy can compose. This is the layer +that bench code typically writes against. + + +For collective operations the runtime exposes a PyTorch-compatible +distributed backend named "ahbm". On process-group initialization the +backend loads the configured collective-algorithm module, resolves +the world size (priority: explicit ccl.yaml override → defaults +section → topology SIP count), imports the algorithm module +dynamically, derives the SIP topology kind, and pushes the inter-PE +neighbor table to every participating PE. From that point on, an +all-reduce call dispatches the algorithm's kernel function across +all ranks. + + +A collective-algorithm module is a Python module with a small, fixed +contract. It exposes topology-kind integer constants, a name-to-kind +mapping for the YAML configuration, a kernel-arguments builder, and +a kernel function — the kernel function being aliased to the name +`kernel` so the backend can find it generically. The kernel itself +takes the tensor pointer, the per-cube element count, cube mesh +width and height, the world size, the current rank, and the SIP +topology dimensions; the backend appends those last four arguments +automatically. New collectives slot in by adding a new module that +follows this shape. + + +The combination is deliberate: bench authors get to write code that +looks like a regular distributed training script, while the launcher, +backend, and placement policies behind it remain free to redirect +work to the right SIP, cube, and PE without exposing topology to the +kernel. + +### IPCQ Direction Addressing + + +Inside a collective algorithm, peer PEs are named by direction — +"N", "S", "E", "W" for cube-internal neighbors, and "global_*" for +cross-SIP neighbors. Direction addressing is the addressing scheme: +the algorithm names a direction, the IPCQ neighbor table installed +at process-group time resolves the direction to the peer endpoint's +physical-address coordinates, and the PE_DMA performs the actual +transfer. The algorithm itself does not see PA arithmetic — direction +is the user-facing handle. + +### Intercube All-Reduce + + +The default all-reduce algorithm uses a center-rooted bidirectional +phase inside each SIP's cube mesh followed by an inter-SIP exchange +on the mesh's root cube, and then a bidirectional broadcast back +out. Center-rooting halves the in-cube hop count compared with a +corner-rooted walk. The inter-SIP exchange itself follows the +configured SIP topology — ring, torus, or non-wrapping mesh — +selected at runtime through the SIP-topology kind integer the +backend passes to the kernel. + +### Evaluation Harnesses + + +The all-reduce evaluation harness drives correctness and the +latency/buffer-kind sweeps through the public distributed path — +initialize process group, spawn workers, call all-reduce — rather +than the lower-level engine interface. A shared helper module factors +out the setup; sweep tests cover the buffer-kind tiers (TCM, SRAM, +HBM) and the inter-SIP topology variants. The plots produced by the +harness are part of its output contract; the harness regenerates them +on demand. + + +The GEMM evaluation harness is split into two layers. A heavy +shape-and-variant sweep lives as a manual script — it runs the same +composite-GEMM benchmark across many shapes and operand-staging +variants, harvests the resulting op-log, and writes a JSON summary. +A faster figure-generation layer lives in the test suite and consumes +that JSON to render plots. The split keeps the heavy data +generation explicit and out of the regular test path. + +### Bench Module Contract + + +Adding a new benchmark requires only dropping a file into the +benchmarks directory. The file registers one or more benchmark +functions through a small decorator that takes a kebab-case name and +a human-readable description. The decorator is the registration +mechanism — there is no separate manifest. Each benchmark function +takes one argument, conventionally named `torch`, which is the +runtime context exposing tensor allocation, kernel launch, +distributed APIs, and process-spawning. The function name is `run` by +convention. + + +A benchmark must submit at least one operation, or the runner +returns an error. A benchmark instance is single-device by default; +when a benchmark is collective, it uses the distributed-process-spawn +pattern internally — one worker greenlet per rank, with each worker +binding to its rank. Multi-device benchmark patterns outside that +shape are not supported. + +### Kernel-side `tl.*` API + + +Inside a kernel function, the `tl` argument exposes the kernel-side +API in a shape that mirrors the conventions of established +GPU-kernel languages. Categories: reference handles that name HBM +data without issuing DMA; data movement (load, store) that does +issue DMA; GEMM and math compute (dot, composite, the unary and +binary math operations, reductions); index and scalar helpers +(program identity, range-builders); metadata-only operations like +transpose; and the collective primitives (send, receive, +non-blocking receive). Tensor handles support arithmetic operators +via a thread-local active context so kernel code reads naturally. + + +The API supports two execution modes. A command-list mode records +operations into a list without consuming simulator time — useful for +inspection and lightweight tests. A greenlet-driven mode runs the +kernel as a child greenlet that switches back to the simulator on +each `tl.*` call; the simulator drives the event scheduler and hands +real data back to the kernel as DMA reads complete. The two modes +share the same surface; the kernel does not know which one it is +running under. + +### Probe Subcommand + + +The probe utility runs three families of traffic patterns through +the engine — host-to-device writes at increasing hop counts, +device-to-host reads at increasing hop counts, and PE-initiated DMA +across the cube mesh — and reports actual latency, the analytical +formula breakdown, effective bandwidth, bottleneck bandwidth, and +utilization. A fixed reference size is used for the summary table; +a separate utilization-versus-size sweep covers a logarithmic range +of transfer sizes. Each case runs in its own engine instance so +cases do not perturb each other. + + +The probe also checks a small set of invariants automatically: +monotonic latency increase with hop count, device-to-host latency +at least as large as host-to-device for the same hop count, and a +faster best-case path than worst-case for cross-cube PE DMA. Failures +print prominently. The output is meant for human reading; automated +parsing should not depend on column widths or whitespace. + +--- + +This document summarizes 46 architecture decisions captured during +the first half of 2026. It is regenerated mechanically from the +decision corpus; sources are recorded in HTML comments throughout. diff --git a/tools/generate_adr_index.py b/tools/generate_adr_index.py new file mode 100644 index 0000000..5a7df60 --- /dev/null +++ b/tools/generate_adr_index.py @@ -0,0 +1,333 @@ +"""Generate docs/adr/INDEX.md (and docs/adr-ko/INDEX.md) from the ADR corpus. + +Auto-derives a section-based index following the same classification as +the /report skill — Design Principles / High-level Architecture / +Detailed Architecture (by component) / Implementation Decisions +(by topic). Run before publishing to refresh INDEX.md. + +The classification table below is the single source of truth. When a new +ADR is added under docs/adr/, append an entry to ``CLASSIFICATION``. The +script exits 1 if any ADR file is missing from the table or any title +cannot be parsed, so omissions surface in CI. + +Usage: + python tools/generate_adr_index.py [--root ] [--check] + + --check : exit 1 if the generated INDEX differs from the on-disk file + (used by CI to detect un-regenerated indexes). +""" + +from __future__ import annotations + +import argparse +import re +import sys +from pathlib import Path + +ADR_FILENAME_RE = re.compile(r"^ADR-(\d{4})-([a-z0-9_-]+)\.md$") +# Title separator may be ":" (most ADRs) or "—" (em-dash; ADR-0033 uses +# this). The verifier (tools/verify_adr_lang_pairs.py) only checks the +# number, so both styles already coexist in the corpus. +TITLE_RE = re.compile(r"^# ADR-(\d{4})\s*[:—]\s*(.+?)\s*$") + +DESIGN_PRINCIPLES = "Design Principles" +HIGH_LEVEL = "High-level Architecture" +DETAILED = "Detailed Architecture" +IMPL_DECISIONS = "Implementation Decisions" + + +# (section, subgroup) per ADR. subgroup is used to sub-divide Detailed +# (by component, see DETAILED_COMPONENTS) and Implementation (by topic). +# Add a line here when introducing a new ADR. +CLASSIFICATION: dict[int, tuple[str, str | None]] = { + # Design Principles + 13: (DESIGN_PRINCIPLES, None), + 33: (DESIGN_PRINCIPLES, None), + + # High-level Architecture + 3: (HIGH_LEVEL, "System hierarchy (Tray / SIP / CUBE / PE)"), + 7: (HIGH_LEVEL, "Runtime API ↔ sim_engine boundaries"), + 16: (HIGH_LEVEL, "IOChiplet NOC and memory data path"), + 17: (HIGH_LEVEL, "Cube NOC and HBM connectivity"), + + # Detailed Architecture (subgroup matches DETAILED_COMPONENTS entries) + 14: (DETAILED, "pe_pipeline"), # covers pe_cpu/pe_dma/pe_fetch_store/pe_gemm/pe_math/pe_scheduler + 23: (DETAILED, "pe_ipcq"), + 34: (DETAILED, "hbm_ctrl"), + 35: (DETAILED, "m_cpu"), + 36: (DETAILED, "io_cpu"), + 37: (DETAILED, "forwarding"), + 38: (DETAILED, "pcie_ep"), + 39: (DETAILED, "pe_mmu"), + 40: (DETAILED, "pe_tcm"), + 41: (DETAILED, "sram"), + 42: (DETAILED, "tiling"), + + # Implementation Decisions + 1: (IMPL_DECISIONS, "Address Scheme"), + 2: (IMPL_DECISIONS, "Routing & Helper API"), + 4: (IMPL_DECISIONS, "Memory Semantics & Local-HBM Bandwidth"), + 5: (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"), + 6: (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"), + 8: (IMPL_DECISIONS, "Tensor Deployment and Allocation"), + 9: (IMPL_DECISIONS, "Kernel Execution and Host-Device Messaging"), + 10: (IMPL_DECISIONS, "CLI Surface and Semantics"), + 11: (IMPL_DECISIONS, "Address Scheme"), + 12: (IMPL_DECISIONS, "Kernel Execution and Host-Device Messaging"), + 15: (IMPL_DECISIONS, "Component Port/Wire Fabric Model"), + 20: (IMPL_DECISIONS, "Two-Pass Data Execution"), + 22: (IMPL_DECISIONS, "2D Grid Program Identity"), + 24: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"), + 25: (IMPL_DECISIONS, "IPCQ Direction Addressing"), + 26: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"), + 27: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"), + 32: (IMPL_DECISIONS, "Intercube All-Reduce"), + 43: (IMPL_DECISIONS, "Evaluation Harnesses"), + 44: (IMPL_DECISIONS, "Evaluation Harnesses"), + 45: (IMPL_DECISIONS, "Bench Module Contract"), + 46: (IMPL_DECISIONS, "Kernel-side tl.* API (TLContext)"), + 47: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"), + 48: (IMPL_DECISIONS, "Memory Allocator Algorithms"), + 49: (IMPL_DECISIONS, "Probe Subcommand"), + 50: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"), + 51: (IMPL_DECISIONS, "Routing & Helper API"), + 52: (IMPL_DECISIONS, "Sim-engine Op Log and Memory Store Schemas"), + 53: (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"), +} + +# Canonical component order for the Detailed Architecture section. +# Each entry: (component_name, list[ADR-numbers that cover it]). +# Order matches src/kernbench/components/builtin/*.py alphabetical +# (the same order /report uses). +DETAILED_COMPONENTS: list[tuple[str, list[int]]] = [ + ("forwarding", [37]), + ("hbm_ctrl", [34]), + ("io_cpu", [36]), + ("m_cpu", [35]), + ("pcie_ep", [38]), + ("pe_cpu", [14]), + ("pe_dma", [14, 23]), + ("pe_fetch_store", [14]), + ("pe_gemm", [14]), + ("pe_ipcq", [23]), + ("pe_math", [14]), + ("pe_mmu", [39]), + ("pe_scheduler", [14]), + ("pe_tcm", [40]), + ("sram", [41]), + ("tiling", [42]), +] + + +def _strip_bom(text: str) -> str: + """Strip leading UTF-8 BOM if present.""" + if text and ord(text[0]) == 0xFEFF: + return text[1:] + return text + + +def _find_adrs(adr_dir: Path) -> list[tuple[int, str, Path]]: + """Return [(num, slug, path), ...] for ADR files in adr_dir, sorted by num.""" + out: list[tuple[int, str, Path]] = [] + for p in sorted(adr_dir.iterdir()): + if not p.is_file(): + continue + m = ADR_FILENAME_RE.match(p.name) + if not m: + continue + out.append((int(m.group(1)), m.group(2), p)) + out.sort(key=lambda t: t[0]) + return out + + +def _extract_title(path: Path) -> str: + """Parse the title from the first line `# ADR-NNNN: `. Strips BOM.""" + text = _strip_bom(path.read_text(encoding="utf-8")) + first_line = text.split("\n", 1)[0] if text else "" + m = TITLE_RE.match(first_line) + if not m: + raise ValueError( + f"{path.name}: cannot parse title from first line: {first_line!r}" + ) + return m.group(2) + + +def _build_index(adr_dir: Path, link_prefix: str) -> str: + """Build the INDEX.md text for adr_dir. + + link_prefix is the relative href used for ADR links (e.g., ``./`` + so links resolve relative to the INDEX file location). + """ + adrs = _find_adrs(adr_dir) + if not adrs: + raise RuntimeError(f"No ADR files found under {adr_dir}") + + # Validate every ADR is classified. + missing = sorted(num for num, _slug, _ in adrs if num not in CLASSIFICATION) + if missing: + raise RuntimeError( + "ADR(s) missing from CLASSIFICATION table in " + "tools/generate_adr_index.py: " + + ", ".join(f"ADR-{n:04d}" for n in missing) + + ". Add an entry for each." + ) + + # Map: num → (filename, title) + num_to_meta: dict[int, tuple[str, str]] = {} + for num, _slug, path in adrs: + num_to_meta[num] = (path.name, _extract_title(path)) + + # ── Section assembly ──────────────────────────────────────────── + lines: list[str] = [] + lines.append("# ADR Index") + lines.append("") + lines.append( + f"Auto-generated by `tools/generate_adr_index.py`. " + f"Total ADRs: **{len(adrs)}**." + ) + lines.append("") + lines.append( + "Classification mirrors the `/report` skill's section assignment. " + "When adding a new ADR, also add an entry to the " + "`CLASSIFICATION` table in `tools/generate_adr_index.py`." + ) + lines.append("") + + def fmt_entry(num: int) -> str: + fname, title = num_to_meta[num] + return f"- [ADR-{num:04d}]({link_prefix}{fname}) — {title}" + + # Design Principles + lines.append("## Design Principles") + lines.append("") + nums = sorted(n for n, (sec, _) in CLASSIFICATION.items() + if sec == DESIGN_PRINCIPLES and n in num_to_meta) + for n in nums: + lines.append(fmt_entry(n)) + lines.append("") + + # High-level Architecture (preserve declaration order via CLASSIFICATION dict's insertion order) + lines.append("## High-level Architecture") + lines.append("") + nums = sorted(n for n, (sec, _) in CLASSIFICATION.items() + if sec == HIGH_LEVEL and n in num_to_meta) + for n in nums: + sub = CLASSIFICATION[n][1] or "" + fname, title = num_to_meta[n] + if sub: + lines.append( + f"- [ADR-{n:04d}]({link_prefix}{fname}) — {title}" + f" _({sub})_" + ) + else: + lines.append(fmt_entry(n)) + lines.append("") + + # Detailed Architecture (canonical component order) + lines.append("## Detailed Architecture") + lines.append("") + lines.append("One subsection per component file under `src/kernbench/components/builtin/`.") + lines.append("") + for comp, adr_nums in DETAILED_COMPONENTS: + lines.append(f"### {comp}") + lines.append("") + if adr_nums: + for n in adr_nums: + if n not in num_to_meta: + raise RuntimeError( + f"DETAILED_COMPONENTS references ADR-{n:04d} for " + f"'{comp}' but no such ADR file exists." + ) + lines.append(fmt_entry(n)) + else: + lines.append("_(no ADR coverage)_") + lines.append("") + + # Implementation Decisions — group by subgroup, preserving first-appearance order. + lines.append("## Implementation Decisions") + lines.append("") + topic_order: list[str] = [] + topic_to_nums: dict[str, list[int]] = {} + for n, (sec, sub) in CLASSIFICATION.items(): + if sec != IMPL_DECISIONS or n not in num_to_meta: + continue + topic = sub or "Uncategorized" + if topic not in topic_to_nums: + topic_order.append(topic) + topic_to_nums[topic] = [] + topic_to_nums[topic].append(n) + # Stable order: by smallest ADR-number in topic, so older infra appears first. + topic_order.sort(key=lambda t: min(topic_to_nums[t])) + for topic in topic_order: + lines.append(f"### {topic}") + lines.append("") + for n in sorted(topic_to_nums[topic]): + lines.append(fmt_entry(n)) + lines.append("") + + return "\n".join(lines).rstrip() + "\n" + + +def _check_or_write(path: Path, content: str, check: bool) -> bool: + """Write content to path, or compare in --check mode. Returns True on diff.""" + existing = path.read_text(encoding="utf-8") if path.exists() else "" + if check: + if existing != content: + print(f"[diff] {path} would change.") + return True + return False + path.write_text(content, encoding="utf-8") + if existing != content: + print(f"[wrote] {path}") + else: + print(f"[unchanged] {path}") + return False + + +def main(argv: list[str] | None = None) -> int: + p = argparse.ArgumentParser(description=__doc__) + p.add_argument( + "--root", type=Path, default=Path.cwd(), + help="Repository root (default: cwd)", + ) + p.add_argument( + "--check", action="store_true", + help="Exit 1 if generated INDEX would differ from disk", + ) + args = p.parse_args(argv) + + en_dir = args.root / "docs" / "adr" + ko_dir = args.root / "docs" / "adr-ko" + + if not en_dir.is_dir(): + print(f"error: {en_dir} does not exist", file=sys.stderr) + return 1 + + any_diff = False + try: + en_index = _build_index(en_dir, link_prefix="./") + except (RuntimeError, ValueError) as e: + print(f"error (EN): {e}", file=sys.stderr) + return 1 + any_diff |= _check_or_write(en_dir / "INDEX.md", en_index, args.check) + + if ko_dir.is_dir(): + try: + ko_index = _build_index(ko_dir, link_prefix="./") + except (RuntimeError, ValueError) as e: + print(f"error (KO): {e}", file=sys.stderr) + return 1 + any_diff |= _check_or_write(ko_dir / "INDEX.md", ko_index, args.check) + + if args.check and any_diff: + print( + "INDEX.md is out of date. " + "Run `python tools/generate_adr_index.py` to refresh.", + file=sys.stderr, + ) + return 1 + return 0 + + +if __name__ == "__main__": + sys.exit(main())