Files
kernbench2/README.md
ywkang 687c98086d ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
  (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
  docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
  docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
  retroactive docs pending verification.

Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
  TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
  Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
  deleted; ADR-0019/0021 moved to adr-history with one-line stub status

Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
  serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
  per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
  target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
  selection, flit-aware per-flit commit, async finalize, command-only
  fallback path)

Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
  "Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
  block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
  ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
  (now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)

Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
  ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py

Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.

Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
  (ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:15:55 -07:00

161 lines
5.7 KiB
Markdown

# kernbench
A discrete-event simulator for AI accelerator hardware, built on [SimPy](https://simpy.readthedocs.io/).
It models the full data path — from host PCIe injection through IO chiplet, NOC mesh,
crossbar, and HBM — to measure end-to-end latency with contention and queueing.
## Architecture
```text
Host (CLI)
|
+-- kernbench run -> run a benchmark (QKV GEMM, AllReduce, ...)
+-- kernbench probe -> latency/BW analysis for predefined traffic patterns
|
v
+---------------------------------------------------+
| Runtime API (runtime_api/) |
| MemoryWriteMsg, MemoryReadMsg, PeDmaMsg, |
| KernelLaunchMsg |
+---------------------------------------------------+
| Simulation Engine (sim_engine/) |
| SimPy processes, wire model, BW occupancy |
+---------------------------------------------------+
| Components (components/) |
| pcie_ep, io_cpu, m_cpu, noc, xbar, hbm_ctrl, |
| pe_cpu, pe_dma, pe_gemm, pe_math, pe_tcm, ... |
+---------------------------------------------------+
| Topology (topology/) |
| YAML-driven graph: 4x4 cube mesh, UCIe links, |
| IO chiplet with NOC, HBM slices |
+---------------------------------------------------+
```
## Prerequisites
- Python 3.10+
- Dependencies: `simpy`, `pyyaml`, `pytest`
## Installation
```bash
# Create virtual environment
python -m venv .venv
# Activate (Windows)
.venv\Scripts\activate
# Activate (Linux/macOS)
source .venv/bin/activate
# Install in editable mode
pip install -e ".[dev]"
```
## Usage
### Probe — Latency and Bandwidth Analysis
The `probe` command runs predefined traffic patterns (H2D write, D2H read,
PE DMA) and reports latency breakdown, bottleneck bandwidth, and utilization.
```bash
# Run all probe cases
kernbench probe --topology topology.yaml
# Run a specific case
kernbench probe --topology topology.yaml --case pe-local-hbm
```
Output includes:
- **Summary tables** — actual latency, overhead/drain/wire breakdown, effective BW, utilization
- **BW saturation sweep** — utilization at 4KB through 1MB to show saturation threshold
- **Per-hop route traces** — cumulative timestamps at every node along the path
### Run — Execute a Benchmark
```bash
# Run a benchmark on all devices
kernbench run --topology topology.yaml --bench qkv_gemm
# Run on a specific device
kernbench run --topology topology.yaml --bench qkv_gemm --device sip:0
```
Available benchmarks (in `benches/`):
- `qkv_gemm` — single-PE QKV GEMM
- `qkv_gemm_multi_pe` — multi-PE QKV GEMM
- `ipcq_allreduce` — IPCQ AllReduce
### Tests
```bash
# Run all tests (278 tests)
pytest
# Run a specific test file
pytest tests/test_probe.py -v
# Run a single test
pytest tests/test_probe.py::test_h2d_latency_monotonic -v
# Run with output shown
pytest -s tests/test_probe.py
```
Key test files:
| File | Coverage |
| --------------------------- | ------------------------------------------------------------- |
| `test_probe.py` | Probe latency invariants, monotonicity, determinism, BW sweep |
| `test_engine.py` | SimPy engine: submit/wait/complete, routing, multi-SIP |
| `test_bw_occupancy.py` | Wire BW contention, HOL blocking, back-to-back serialization |
| `test_iochiplet_noc_d2h.py` | IO chiplet NOC topology, H2D/D2H data paths |
| `test_noc_mesh.py` | 2D mesh NOC routing, Manhattan distance |
| `test_pe_components.py` | PE-internal components: cpu, scheduler, dma, gemm |
| `test_routing.py` | XY routing, address resolution, path finding |
| `test_topology_compile.py` | YAML topology compilation, node/edge validation |
## Topology Configuration
The system is configured via `topology.yaml`. Key parameters:
| Parameter | Default | Description |
| --- | --- | --- |
| `ns_per_mm` | 0.01 | Wire propagation delay (10 ps/mm) |
| `cube_mesh` | 4x4 | Cube grid dimensions per SIP |
| `ucie.overhead_ns` | 8.0 | UCIe protocol overhead per port (16ns per crossing) |
| `hbm_ctrl.efficiency` | 0.8 | HBM effective BW factor (256 to 204.8 GB/s) |
| `xbar.overhead_ns` | 2.0 | Crossbar arbitration delay |
| `xbar_to_hbm_bw_gbs` | 256.0 | Raw HBM bandwidth per slice |
## Project Structure
```text
kernbench/
+-- src/kernbench/
| +-- cli/ # CLI entry points (main, probe, report)
| +-- common/ # Shared types (Completion, RequestHandle, Trace)
| +-- components/ # Hardware component models (SimPy processes)
| +-- di/ # Dependency injection
| +-- policy/ # Routing (XY), address decoding (PhysAddr)
| +-- runtime_api/ # Host-facing API (messages, bench runner)
| +-- sim_engine/ # Discrete-event engine, transaction, wire model
| +-- topology/ # YAML builder, mesh generator, graph types
| +-- triton_emu/ # Triton kernel emulation
+-- benches/ # Benchmark implementations
+-- tests/ # pytest test suite (278 tests)
+-- docs/ # ADRs, latency model docs, diagrams
+-- topology.yaml # System topology configuration
+-- CHANGES.md # Changelog
```
## Documentation
- [CHANGES.md](CHANGES.md) — changelog with detailed descriptions of each release
- [docs/onboarding/latency-model.md](docs/onboarding/latency-model.md) — latency model explanation with worked examples
- [docs/onboarding/](docs/onboarding/) — onboarding guides (architecture overview, latency model, CCL author guide, intro presentation)
- [docs/adr/](docs/adr/) — Architecture Decision Records