cc1bbd0ab7
Move the GEMM + allreduce sweep/render logic out of scripts/ and tests/
into two self-contained eval benches so a user can regenerate every
result + figure with one command:
kernbench run --bench milestone-1h-gemm (MILESTONE_FAST=1 reuses JSON)
kernbench run --bench milestone-1h-ccl
- benches/milestone_1h_{gemm,ccl}.py: single home for each domain; the
run(torch) entry drives the sweeps and writes figures into
benches/1H_milestone_output/{gemm,ccl}/ (gitignored), then submits a
sentinel tensor to satisfy the run_bench contract.
- tests/gemm + tests/sccl helpers and scripts/gemm_sweep.py become thin
re-export/wrapper shims over the benches (single source preserved); the
pytest-only param builders + _run_distributed wrapper stay in the shim.
- eval-bench pattern: a bench may drive many configs + build its own
per-config engines (extends ADR-0045 D5; reverses ADR-0044 D1/D2).
ADR-0054 (EN+KO) records the design; ADR-0043/0044/0045 + CLAUDE.md CLI
Semantics amended; ADR INDEX regenerated. Verified: milestone benches run
clean (ok=True, all artifacts), full suite 67 passed, lang-pairs OK.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.2 KiB
7.2 KiB
ADR Index
Auto-generated by tools/generate_adr_index.py. Total ADRs: 47.
Classification mirrors the /report skill's section assignment. When adding a new ADR, also add an entry to the CLASSIFICATION table in tools/generate_adr_index.py.
Design Principles
High-level Architecture
- ADR-0003 — 타겟 시스템 계층 및 모델링 범위 (System hierarchy (Tray / SIP / CUBE / PE))
- ADR-0007 — 런타임 API 및 시뮬레이션 엔진 경계 (Runtime API ↔ sim_engine boundaries)
- ADR-0016 — IOChiplet NoC와 메모리 데이터 경로 (IOChiplet NOC and memory data path)
- ADR-0017 — 큐브 NoC와 HBM 연결성 (Cube NOC and HBM connectivity)
Detailed Architecture
One subsection per component file under src/kernbench/components/builtin/.
forwarding
- ADR-0037 — Forwarding 컴포넌트 (forwarding_v1)
hbm_ctrl
- ADR-0034 — HBM 컨트롤러 내부 설계
io_cpu
- ADR-0036 — IO_CPU 컴포넌트 모델
m_cpu
- ADR-0035 — M_CPU 및 M_CPU.DMA 컴포넌트 모델
pcie_ep
- ADR-0038 — PCIE_EP Component Model
pe_cpu
- ADR-0014 — PE 파이프라인 실행 모델
pe_dma
pe_fetch_store
- ADR-0014 — PE 파이프라인 실행 모델
pe_gemm
- ADR-0014 — PE 파이프라인 실행 모델
pe_ipcq
- ADR-0023 — PE-level IPCQ — Inter-PE Collective Communication
pe_math
- ADR-0014 — PE 파이프라인 실행 모델
pe_mmu
- ADR-0039 — PE_MMU Component Model — 컴포넌트 + 유틸리티 이중 역할
pe_scheduler
- ADR-0014 — PE 파이프라인 실행 모델
pe_tcm
- ADR-0040 — PE_TCM Component Model — 듀얼 채널 BW 직렬화
sram
- ADR-0041 — Cube SRAM Component Model — terminal scratchpad on cube NoC
tiling
- ADR-0042 — Tile Plan Generators — GEMM/Math 파이프라인 plan 빌더
Implementation Decisions
Address Scheme
Routing & Helper API
Memory Semantics & Local-HBM Bandwidth
- ADR-0004 — 메모리 시맨틱 및 로컬 HBM 대역폭 보장
Topology Compilation, Diagrams & Builder Algorithms
- ADR-0005 — 다이어그램 뷰 및 거리 기반 레이아웃 규칙
- ADR-0006 — 토폴로지 컴파일, 거리 추출, 그리고 자동 다이어그램 생성
- ADR-0053 — Topology Builder + Visualizer Algorithms
Tensor Deployment and Allocation
- ADR-0008 — 텐서 배포 및 할당 (호스트 할당기, PA 우선)
Kernel Execution and Host-Device Messaging
CLI Surface and Semantics
- ADR-0010 — 명령줄 인터페이스 및 실행 시맨틱
Component Port/Wire Fabric Model
- ADR-0015 — 컴포넌트 포트/와이어 모델과 패브릭 라우팅
Two-Pass Data Execution
- ADR-0020 — 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)
2D Grid Program Identity
- ADR-0022 — 2D 그리드 program_id 시맨틱
Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)
- ADR-0024 — SIP-level Launcher — rank = SIP
- ADR-0026 — DPPolicy = Intra-Device Only — sip/num_sips 필드 제거
- ADR-0027 — Megatron-style Tensor Parallelism API
- ADR-0047 — AHBM CCL Backend —
torch.distributed-compat shim - ADR-0050 — CCL Algorithm Module Contract —
ccl/algorithms/*.py
IPCQ Direction Addressing
- ADR-0025 — IPCQ Direction Addressing — address-based matching
Intercube All-Reduce
- ADR-0032 — 큐브 간 All-Reduce — pe0 큐브-메시 리듀스 + 다중-SIP 교환
Evaluation Harnesses
- ADR-0043 — Allreduce 평가 하니스 —
tests/sccl/ - ADR-0044 — GEMM 평가 하니스 —
scripts/gemm_sweep.py+tests/gemm/ - ADR-0054 — 마일스톤 평가 bench — 자기완결적 sweep + figure bench
Bench Module Contract
- ADR-0045 — Bench Module Contract — registration, dispatch, and authoring
Kernel-side tl.* API (TLContext)
- ADR-0046 — TLContext — Kernel-side
tl.*API Contract
Memory Allocator Algorithms
- ADR-0048 — Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
Probe Subcommand
- ADR-0049 —
kernbench probeSubcommand — Traffic-Pattern Verification Harness
Sim-engine Op Log and Memory Store Schemas
- ADR-0052 — OpLog + MemoryStore Schemas — sim_engine internals