Files
kernbench2/docs/adr/INDEX.md
T
mukesh cc1bbd0ab7 eval: fold GEMM/allreduce harnesses into self-contained milestone benches
Move the GEMM + allreduce sweep/render logic out of scripts/ and tests/
into two self-contained eval benches so a user can regenerate every
result + figure with one command:

  kernbench run --bench milestone-1h-gemm   (MILESTONE_FAST=1 reuses JSON)
  kernbench run --bench milestone-1h-ccl

- benches/milestone_1h_{gemm,ccl}.py: single home for each domain; the
  run(torch) entry drives the sweeps and writes figures into
  benches/1H_milestone_output/{gemm,ccl}/ (gitignored), then submits a
  sentinel tensor to satisfy the run_bench contract.
- tests/gemm + tests/sccl helpers and scripts/gemm_sweep.py become thin
  re-export/wrapper shims over the benches (single source preserved); the
  pytest-only param builders + _run_distributed wrapper stay in the shim.
- eval-bench pattern: a bench may drive many configs + build its own
  per-config engines (extends ADR-0045 D5; reverses ADR-0044 D1/D2).

ADR-0054 (EN+KO) records the design; ADR-0043/0044/0045 + CLAUDE.md CLI
Semantics amended; ADR INDEX regenerated. Verified: milestone benches run
clean (ok=True, all artifacts), full suite 67 passed, lang-pairs OK.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 15:19:52 -07:00

7.0 KiB

ADR Index

Auto-generated by tools/generate_adr_index.py. Total ADRs: 47.

Classification mirrors the /report skill's section assignment. When adding a new ADR, also add an entry to the CLASSIFICATION table in tools/generate_adr_index.py.

Design Principles

  • ADR-0013 — Verification Strategy and Phase 1 Test Plan
  • ADR-0033 — Latency Model: Assumptions and Known Simplifications

High-level Architecture

  • ADR-0003 — Target System Hierarchy & Modeling Scope (System hierarchy (Tray / SIP / CUBE / PE))
  • ADR-0007 — Runtime API and Simulation Engine Boundaries (Runtime API ↔ sim_engine boundaries)
  • ADR-0016 — IOChiplet NOC and Memory Data Path (IOChiplet NOC and memory data path)
  • ADR-0017 — Cube NOC and HBM Connectivity (Cube NOC and HBM connectivity)

Detailed Architecture

One subsection per component file under src/kernbench/components/builtin/.

forwarding

  • ADR-0037 — Forwarding Component (forwarding_v1)

hbm_ctrl

  • ADR-0034 — HBM Controller Internal Design

io_cpu

m_cpu

  • ADR-0035 — M_CPU and M_CPU.DMA Component Model

pcie_ep

pe_cpu

  • ADR-0014 — PE Pipeline Execution Model

pe_dma

  • ADR-0014 — PE Pipeline Execution Model
  • ADR-0023 — PE-level IPCQ — Inter-PE Collective Communication

pe_fetch_store

  • ADR-0014 — PE Pipeline Execution Model

pe_gemm

  • ADR-0014 — PE Pipeline Execution Model

pe_ipcq

  • ADR-0023 — PE-level IPCQ — Inter-PE Collective Communication

pe_math

  • ADR-0014 — PE Pipeline Execution Model

pe_mmu

  • ADR-0039 — PE_MMU Component Model — Component + Utility Dual Role

pe_scheduler

  • ADR-0014 — PE Pipeline Execution Model

pe_tcm

  • ADR-0040 — PE_TCM Component Model — Dual-Channel BW Serialization

sram

  • ADR-0041 — Cube SRAM Component Model — terminal scratchpad on cube NoC

tiling

  • ADR-0042 — Tile Plan Generators — GEMM/Math Pipeline Plan Builders

Implementation Decisions

Address Scheme

  • ADR-0001 — 51-bit Physical Address Layout & Decoding Contract
  • ADR-0011 — Memory Addressing — PA / VA / LA Address Models

Routing & Helper API

  • ADR-0002 — Routing Distance, Ordering & Bypass Rules
  • ADR-0051 — Routing Helper API — AddressResolver + PathRouter

Memory Semantics & Local-HBM Bandwidth

  • ADR-0004 — Memory Semantics & Local-HBM Bandwidth Guarantee

Topology Compilation, Diagrams & Builder Algorithms

  • ADR-0005 — Diagram Views & Distance-Aware Layout Rules
  • ADR-0006 — Topology Compilation, Distance Extraction, and Automatic Diagram Generation
  • ADR-0053 — Topology Builder + Visualizer Algorithms

Tensor Deployment and Allocation

  • ADR-0008 — Tensor Deployment and Allocation (Host Allocator, PA-first)

Kernel Execution and Host-Device Messaging

  • ADR-0009 — Kernel Execution Messaging and Completion Semantics
  • ADR-0012 — Host ↔ IO_CPU Message Schema (PA-first, PE-tagged)

CLI Surface and Semantics

  • ADR-0010 — Command Line Interface and Execution Semantics

Component Port/Wire Fabric Model

  • ADR-0015 — Component Port/Wire Model and Fabric Routing

Two-Pass Data Execution

  • ADR-0020 — 2-Pass Data Execution Model (Timing / Data Separation)

2D Grid Program Identity

  • ADR-0022 — 2D Grid program_id Semantics

Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)

  • ADR-0024 — SIP-level Launcher — rank = SIP
  • ADR-0026 — DPPolicy = Intra-Device Only — remove sip/num_sips fields
  • ADR-0027 — Megatron-style Tensor Parallelism API
  • ADR-0047 — AHBM CCL Backend — torch.distributed-compat shim
  • ADR-0050 — CCL Algorithm Module Contract — ccl/algorithms/*.py

IPCQ Direction Addressing

  • ADR-0025 — IPCQ Direction Addressing — address-based matching

Intercube All-Reduce

  • ADR-0032 — Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange

Evaluation Harnesses

  • ADR-0043 — Allreduce Evaluation Harness — tests/sccl/
  • ADR-0044 — GEMM Evaluation Harness — scripts/gemm_sweep.py + tests/gemm/
  • ADR-0054 — Milestone Eval Benches — self-contained sweep + figure benches

Bench Module Contract

  • ADR-0045 — Bench Module Contract — registration, dispatch, and authoring

Kernel-side tl.* API (TLContext)

  • ADR-0046 — TLContext — Kernel-side tl.* API Contract

Memory Allocator Algorithms

  • ADR-0048 — Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator

Probe Subcommand

  • ADR-0049kernbench probe Subcommand — Traffic-Pattern Verification Harness

Sim-engine Op Log and Memory Store Schemas

  • ADR-0052 — OpLog + MemoryStore Schemas — sim_engine internals