cc1bbd0ab7
Move the GEMM + allreduce sweep/render logic out of scripts/ and tests/
into two self-contained eval benches so a user can regenerate every
result + figure with one command:
kernbench run --bench milestone-1h-gemm (MILESTONE_FAST=1 reuses JSON)
kernbench run --bench milestone-1h-ccl
- benches/milestone_1h_{gemm,ccl}.py: single home for each domain; the
run(torch) entry drives the sweeps and writes figures into
benches/1H_milestone_output/{gemm,ccl}/ (gitignored), then submits a
sentinel tensor to satisfy the run_bench contract.
- tests/gemm + tests/sccl helpers and scripts/gemm_sweep.py become thin
re-export/wrapper shims over the benches (single source preserved); the
pytest-only param builders + _run_distributed wrapper stay in the shim.
- eval-bench pattern: a bench may drive many configs + build its own
per-config engines (extends ADR-0045 D5; reverses ADR-0044 D1/D2).
ADR-0054 (EN+KO) records the design; ADR-0043/0044/0045 + CLAUDE.md CLI
Semantics amended; ADR INDEX regenerated. Verified: milestone benches run
clean (ok=True, all artifacts), full suite 67 passed, lang-pairs OK.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.0 KiB
7.0 KiB
ADR Index
Auto-generated by tools/generate_adr_index.py. Total ADRs: 47.
Classification mirrors the /report skill's section assignment. When adding a new ADR, also add an entry to the CLASSIFICATION table in tools/generate_adr_index.py.
Design Principles
- ADR-0013 — Verification Strategy and Phase 1 Test Plan
- ADR-0033 — Latency Model: Assumptions and Known Simplifications
High-level Architecture
- ADR-0003 — Target System Hierarchy & Modeling Scope (System hierarchy (Tray / SIP / CUBE / PE))
- ADR-0007 — Runtime API and Simulation Engine Boundaries (Runtime API ↔ sim_engine boundaries)
- ADR-0016 — IOChiplet NOC and Memory Data Path (IOChiplet NOC and memory data path)
- ADR-0017 — Cube NOC and HBM Connectivity (Cube NOC and HBM connectivity)
Detailed Architecture
One subsection per component file under src/kernbench/components/builtin/.
forwarding
- ADR-0037 — Forwarding Component (forwarding_v1)
hbm_ctrl
- ADR-0034 — HBM Controller Internal Design
io_cpu
- ADR-0036 — IO_CPU Component Model
m_cpu
- ADR-0035 — M_CPU and M_CPU.DMA Component Model
pcie_ep
- ADR-0038 — PCIE_EP Component Model
pe_cpu
- ADR-0014 — PE Pipeline Execution Model
pe_dma
pe_fetch_store
- ADR-0014 — PE Pipeline Execution Model
pe_gemm
- ADR-0014 — PE Pipeline Execution Model
pe_ipcq
- ADR-0023 — PE-level IPCQ — Inter-PE Collective Communication
pe_math
- ADR-0014 — PE Pipeline Execution Model
pe_mmu
- ADR-0039 — PE_MMU Component Model — Component + Utility Dual Role
pe_scheduler
- ADR-0014 — PE Pipeline Execution Model
pe_tcm
- ADR-0040 — PE_TCM Component Model — Dual-Channel BW Serialization
sram
- ADR-0041 — Cube SRAM Component Model — terminal scratchpad on cube NoC
tiling
- ADR-0042 — Tile Plan Generators — GEMM/Math Pipeline Plan Builders
Implementation Decisions
Address Scheme
- ADR-0001 — 51-bit Physical Address Layout & Decoding Contract
- ADR-0011 — Memory Addressing — PA / VA / LA Address Models
Routing & Helper API
- ADR-0002 — Routing Distance, Ordering & Bypass Rules
- ADR-0051 — Routing Helper API —
AddressResolver+PathRouter
Memory Semantics & Local-HBM Bandwidth
- ADR-0004 — Memory Semantics & Local-HBM Bandwidth Guarantee
Topology Compilation, Diagrams & Builder Algorithms
- ADR-0005 — Diagram Views & Distance-Aware Layout Rules
- ADR-0006 — Topology Compilation, Distance Extraction, and Automatic Diagram Generation
- ADR-0053 — Topology Builder + Visualizer Algorithms
Tensor Deployment and Allocation
- ADR-0008 — Tensor Deployment and Allocation (Host Allocator, PA-first)
Kernel Execution and Host-Device Messaging
- ADR-0009 — Kernel Execution Messaging and Completion Semantics
- ADR-0012 — Host ↔ IO_CPU Message Schema (PA-first, PE-tagged)
CLI Surface and Semantics
- ADR-0010 — Command Line Interface and Execution Semantics
Component Port/Wire Fabric Model
- ADR-0015 — Component Port/Wire Model and Fabric Routing
Two-Pass Data Execution
- ADR-0020 — 2-Pass Data Execution Model (Timing / Data Separation)
2D Grid Program Identity
- ADR-0022 — 2D Grid program_id Semantics
Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)
- ADR-0024 — SIP-level Launcher — rank = SIP
- ADR-0026 — DPPolicy = Intra-Device Only — remove sip/num_sips fields
- ADR-0027 — Megatron-style Tensor Parallelism API
- ADR-0047 — AHBM CCL Backend —
torch.distributed-compat shim - ADR-0050 — CCL Algorithm Module Contract —
ccl/algorithms/*.py
IPCQ Direction Addressing
- ADR-0025 — IPCQ Direction Addressing — address-based matching
Intercube All-Reduce
- ADR-0032 — Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange
Evaluation Harnesses
- ADR-0043 — Allreduce Evaluation Harness —
tests/sccl/ - ADR-0044 — GEMM Evaluation Harness —
scripts/gemm_sweep.py+tests/gemm/ - ADR-0054 — Milestone Eval Benches — self-contained sweep + figure benches
Bench Module Contract
- ADR-0045 — Bench Module Contract — registration, dispatch, and authoring
Kernel-side tl.* API (TLContext)
- ADR-0046 — TLContext — Kernel-side
tl.*API Contract
Memory Allocator Algorithms
- ADR-0048 — Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
Probe Subcommand
- ADR-0049 —
kernbench probeSubcommand — Traffic-Pattern Verification Harness
Sim-engine Op Log and Memory Store Schemas
- ADR-0052 — OpLog + MemoryStore Schemas — sim_engine internals