# kernbench A discrete-event simulator for AI accelerator hardware, built on [SimPy](https://simpy.readthedocs.io/). It models the full data path — from host PCIe injection through IO chiplet, NOC mesh, crossbar, and HBM — to measure end-to-end latency with contention and queueing. ## Architecture ```text Host (CLI) | +-- kernbench run -> run a benchmark (QKV GEMM, AllReduce, ...) +-- kernbench probe -> latency/BW analysis for predefined traffic patterns | v +---------------------------------------------------+ | Runtime API (runtime_api/) | | MemoryWriteMsg, MemoryReadMsg, PeDmaMsg, | | KernelLaunchMsg | +---------------------------------------------------+ | Simulation Engine (sim_engine/) | | SimPy processes, wire model, BW occupancy | +---------------------------------------------------+ | Components (components/) | | pcie_ep, io_cpu, m_cpu, noc, xbar, hbm_ctrl, | | pe_cpu, pe_dma, pe_gemm, pe_math, pe_tcm, ... | +---------------------------------------------------+ | Topology (topology/) | | YAML-driven graph: 4x4 cube mesh, UCIe links, | | IO chiplet with NOC, HBM slices | +---------------------------------------------------+ ``` ## Prerequisites - Python 3.10+ - Dependencies: `simpy`, `pyyaml`, `pytest` ## Installation ```bash # Create virtual environment python -m venv .venv # Activate (Windows) .venv\Scripts\activate # Activate (Linux/macOS) source .venv/bin/activate # Install in editable mode pip install -e ".[dev]" ``` ## Usage ### Probe — Latency and Bandwidth Analysis The `probe` command runs predefined traffic patterns (H2D write, D2H read, PE DMA) and reports latency breakdown, bottleneck bandwidth, and utilization. ```bash # Run all probe cases kernbench probe --topology topology.yaml # Run a specific case kernbench probe --topology topology.yaml --case pe-local-hbm ``` Output includes: - **Summary tables** — actual latency, overhead/drain/wire breakdown, effective BW, utilization - **BW saturation sweep** — utilization at 4KB through 1MB to show saturation threshold - **Per-hop route traces** — cumulative timestamps at every node along the path ### Run — Execute a Benchmark ```bash # Run a benchmark on all devices kernbench run --topology topology.yaml --bench qkv_gemm # Run on a specific device kernbench run --topology topology.yaml --bench qkv_gemm --device sip:0 ``` Available benchmarks (in `benches/`): - `qkv_gemm` — single-PE QKV GEMM - `qkv_gemm_multi_pe` — multi-PE QKV GEMM - `ipcq_allreduce` — IPCQ AllReduce ### Tests ```bash # Run all tests (278 tests) pytest # Run a specific test file pytest tests/test_probe.py -v # Run a single test pytest tests/test_probe.py::test_h2d_latency_monotonic -v # Run with output shown pytest -s tests/test_probe.py ``` Key test files: | File | Coverage | | --------------------------- | ------------------------------------------------------------- | | `test_probe.py` | Probe latency invariants, monotonicity, determinism, BW sweep | | `test_engine.py` | SimPy engine: submit/wait/complete, routing, multi-SIP | | `test_bw_occupancy.py` | Wire BW contention, HOL blocking, back-to-back serialization | | `test_iochiplet_noc_d2h.py` | IO chiplet NOC topology, H2D/D2H data paths | | `test_noc_mesh.py` | 2D mesh NOC routing, Manhattan distance | | `test_pe_components.py` | PE-internal components: cpu, scheduler, dma, gemm | | `test_routing.py` | XY routing, address resolution, path finding | | `test_topology_compile.py` | YAML topology compilation, node/edge validation | ## Topology Configuration The system is configured via `topology.yaml`. Key parameters: | Parameter | Default | Description | | --- | --- | --- | | `ns_per_mm` | 0.01 | Wire propagation delay (10 ps/mm) | | `cube_mesh` | 4x4 | Cube grid dimensions per SIP | | `ucie.overhead_ns` | 8.0 | UCIe protocol overhead per port (16ns per crossing) | | `hbm_ctrl.efficiency` | 0.8 | HBM effective BW factor (256 to 204.8 GB/s) | | `xbar.overhead_ns` | 2.0 | Crossbar arbitration delay | | `xbar_to_hbm_bw_gbs` | 256.0 | Raw HBM bandwidth per slice | ## Project Structure ```text kernbench/ +-- src/kernbench/ | +-- cli/ # CLI entry points (main, probe, report) | +-- common/ # Shared types (Completion, RequestHandle, Trace) | +-- components/ # Hardware component models (SimPy processes) | +-- di/ # Dependency injection | +-- policy/ # Routing (XY), address decoding (PhysAddr) | +-- runtime_api/ # Host-facing API (messages, bench runner) | +-- sim_engine/ # Discrete-event engine, transaction, wire model | +-- topology/ # YAML builder, mesh generator, graph types | +-- triton_emu/ # Triton kernel emulation +-- benches/ # Benchmark implementations +-- tests/ # pytest test suite (278 tests) +-- docs/ # ADRs, latency model docs, diagrams +-- topology.yaml # System topology configuration +-- CHANGES.md # Changelog ``` ## Documentation - [CHANGES.md](CHANGES.md) — changelog with detailed descriptions of each release - [docs/onboarding/latency-model.md](docs/onboarding/latency-model.md) — latency model explanation with worked examples - [docs/onboarding/](docs/onboarding/) — onboarding guides (architecture overview, latency model, CCL author guide, intro presentation) - [docs/adr/](docs/adr/) — Architecture Decision Records