master
Add 5 of the 6 figure renderers ADR-0057 D3 sub-cycle 4c specifies:
- gqa_op_log_{panel}.png × 4 — per-panel bar chart of the 5 op_log
counts (gemm, ipcq_send, ipcq_recv, dma_read, dma_write).
- gqa_comparison.png — cross-panel grouped bars over the same 5 series.
Sixth figure (gqa_scaling.png) depends on sub-cycle 4b's Q/cube ∈
{1, 2, 4} sweep on multi_user_* panels and is deferred until that
data exists; emit_all_gqa_plots returns just the 5 in-scope paths.
Add MILESTONE_FAST=1 mode to run(): skip the panel sweep, reuse the
committed sweep.json, render figures only. Validation mode unchanged.
The runtime errors clearly when neither env var is set, listing the
two supported modes.
Renderers live in the bench module (the milestone-1h-gemm pattern);
tests/gqa/_gqa_plot_helpers.py re-exports them for figure tests.
Tests: tests/gqa/test_plot_gqa_figures.py — 7 tests, all green:
- 4 parametrized per-panel emit assertions
- 1 comparison emit assertion
- 1 emit_all returns exactly 5 PNG paths
- 1 default out_dir matches the bench _OUTPUT_DIR
Commits the 5 PNG baselines under the bench output dir alongside
sweep.json, mirroring milestone-1h-gemm's committed-figures pattern.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kernbench
A discrete-event simulator for AI accelerator hardware, built on SimPy. It models the full data path — from host PCIe injection through IO chiplet, NOC mesh, crossbar, and HBM — to measure end-to-end latency with contention and queueing.
Architecture
Host (CLI)
|
+-- kernbench run -> run a benchmark (QKV GEMM, AllReduce, ...)
+-- kernbench probe -> latency/BW analysis for predefined traffic patterns
|
v
+---------------------------------------------------+
| Runtime API (runtime_api/) |
| MemoryWriteMsg, MemoryReadMsg, PeDmaMsg, |
| KernelLaunchMsg |
+---------------------------------------------------+
| Simulation Engine (sim_engine/) |
| SimPy processes, wire model, BW occupancy |
+---------------------------------------------------+
| Components (components/) |
| pcie_ep, io_cpu, m_cpu, noc, xbar, hbm_ctrl, |
| pe_cpu, pe_dma, pe_gemm, pe_math, pe_tcm, ... |
+---------------------------------------------------+
| Topology (topology/) |
| YAML-driven graph: 4x4 cube mesh, UCIe links, |
| IO chiplet with NOC, HBM slices |
+---------------------------------------------------+
Prerequisites
- Python 3.10+
- Dependencies:
simpy,pyyaml,pytest
Installation
# Create virtual environment
python -m venv .venv
# Activate (Windows)
.venv\Scripts\activate
# Activate (Linux/macOS)
source .venv/bin/activate
# Install in editable mode
pip install -e ".[dev]"
Usage
Probe — Latency and Bandwidth Analysis
The probe command runs predefined traffic patterns (H2D write, D2H read,
PE DMA) and reports latency breakdown, bottleneck bandwidth, and utilization.
# Run all probe cases
kernbench probe --topology topology.yaml
# Run a specific case
kernbench probe --topology topology.yaml --case pe-local-hbm
Output includes:
- Summary tables — actual latency, overhead/drain/wire breakdown, effective BW, utilization
- BW saturation sweep — utilization at 4KB through 1MB to show saturation threshold
- Per-hop route traces — cumulative timestamps at every node along the path
Run — Execute a Benchmark
# Run a benchmark on all devices
kernbench run --topology topology.yaml --bench qkv_gemm
# Run on a specific device
kernbench run --topology topology.yaml --bench qkv_gemm --device sip:0
Available benchmarks (in benches/):
qkv_gemm— single-PE QKV GEMMqkv_gemm_multi_pe— multi-PE QKV GEMMipcq_allreduce— IPCQ AllReduce
Tests
# Run all tests (278 tests)
pytest
# Run a specific test file
pytest tests/test_probe.py -v
# Run a single test
pytest tests/test_probe.py::test_h2d_latency_monotonic -v
# Run with output shown
pytest -s tests/test_probe.py
Key test files:
| File | Coverage |
|---|---|
test_probe.py |
Probe latency invariants, monotonicity, determinism, BW sweep |
test_engine.py |
SimPy engine: submit/wait/complete, routing, multi-SIP |
test_bw_occupancy.py |
Wire BW contention, HOL blocking, back-to-back serialization |
test_iochiplet_noc_d2h.py |
IO chiplet NOC topology, H2D/D2H data paths |
test_noc_mesh.py |
2D mesh NOC routing, Manhattan distance |
test_pe_components.py |
PE-internal components: cpu, scheduler, dma, gemm |
test_routing.py |
XY routing, address resolution, path finding |
test_topology_compile.py |
YAML topology compilation, node/edge validation |
Topology Configuration
The system is configured via topology.yaml. Key parameters:
| Parameter | Default | Description |
|---|---|---|
ns_per_mm |
0.01 | Wire propagation delay (10 ps/mm) |
cube_mesh |
4x4 | Cube grid dimensions per SIP |
ucie.overhead_ns |
8.0 | UCIe protocol overhead per port (16ns per crossing) |
hbm_ctrl.efficiency |
0.8 | HBM effective BW factor (256 to 204.8 GB/s) |
xbar.overhead_ns |
2.0 | Crossbar arbitration delay |
xbar_to_hbm_bw_gbs |
256.0 | Raw HBM bandwidth per slice |
Project Structure
kernbench/
+-- src/kernbench/
| +-- cli/ # CLI entry points (main, probe, report)
| +-- common/ # Shared types (Completion, RequestHandle, Trace)
| +-- components/ # Hardware component models (SimPy processes)
| +-- di/ # Dependency injection
| +-- policy/ # Routing (XY), address decoding (PhysAddr)
| +-- runtime_api/ # Host-facing API (messages, bench runner)
| +-- sim_engine/ # Discrete-event engine, transaction, wire model
| +-- topology/ # YAML builder, mesh generator, graph types
| +-- triton_emu/ # Triton kernel emulation
+-- benches/ # Benchmark implementations
+-- tests/ # pytest test suite (278 tests)
+-- docs/ # ADRs, latency model docs, diagrams
+-- topology.yaml # System topology configuration
+-- CHANGES.md # Changelog
Documentation
- CHANGES.md — changelog with detailed descriptions of each release
- docs/onboarding/latency-model.md — latency model explanation with worked examples
- docs/onboarding/ — onboarding guides (architecture overview, latency model, CCL author guide, intro presentation)
- docs/adr/ — Architecture Decision Records
Description
Languages
Python
96%
HTML
4%