b8213d43a9968443955d7c62f259bee7efa4f816
Restores per-PE HBM controller partitioning that was lost in
commit 5917b34 ("Replace xbar/bridge/single-NOC with explicit
router mesh"), which had over-consolidated the per-slice HBM CTRL
into a single cube-wide ``hbm_ctrl`` connected to every router —
the opposite of what ADR-0019 D1/D4 specifies.
Builder splits ``hbm_ctrl`` into 8 ``hbm_ctrl.pe{X}`` instances per
cube, each reachable ONLY through PE_X's attaching router via the
existing ``peX.hbm`` attach metadata from cube_mesh.yaml. Cube
aggregate BW now matches the spec (8 PEs × 8 PCs × 32 GB/s =
2048 GB/s) instead of collapsing to 256 GB/s.
AddressResolver decodes the target PE from the HBM PA's hbm_offset
(``offset // slice_size``) and returns ``hbm_ctrl.pe{X}``. PathRouter
uses the existing ``_adj_local`` adjacency for same-cube PE_DMA so
the cube's own UCIe port can no longer appear as a zero-distance
shortcut between routers — local PE_DMA now traverses the mesh,
restoring the ADR-0019 D4 worked example
``PE0.pe_dma → r0c0 → … → r1c4 → hbm_ctrl``.
Tests:
- New tests/test_per_pe_hbm_partition.py: 14 tests covering
topology shape, per-PE router exclusivity, PA resolution,
single-hop local path, cross-PE mesh traversal, and end-to-end
latency monotonicity. Probe CLI now reports
pe-local < pe-same-half < pe-cross-half (was uniform 141ns).
- Existing tests updated for new node ids and replaced two
assertions that locked in the wrong consolidation:
test_noc_mesh.test_hbm_connects_to_all_routers and
test_topology_compile.test_hbm_ctrl_connects_all_routers are
now per-PE exclusivity assertions; test_routing
.test_all_pe_hbm_equidistant becomes
test_cross_pe_hbm_distance_increases_with_mesh_hops.
- test_ipcq_buffer_kind_locations.test_hbm_pe_hop_charged_at_large_payload
threshold recalibrated 4000→1500 ns: the prior figure reflected
serialization on the over-consolidated single hbm_ctrl; per-PE
partitioning removes that artificial contention so the gap
shrinks to the genuine PE↔HBM-hop cost.
Full suite: 645 passed, 1 skipped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kernbench
A discrete-event simulator for AI accelerator hardware, built on SimPy. It models the full data path — from host PCIe injection through IO chiplet, NOC mesh, crossbar, and HBM — to measure end-to-end latency with contention and queueing.
Architecture
Host (CLI)
|
+-- kernbench run -> run a benchmark (QKV GEMM, AllReduce, ...)
+-- kernbench probe -> latency/BW analysis for predefined traffic patterns
|
v
+---------------------------------------------------+
| Runtime API (runtime_api/) |
| MemoryWriteMsg, MemoryReadMsg, PeDmaMsg, |
| KernelLaunchMsg |
+---------------------------------------------------+
| Simulation Engine (sim_engine/) |
| SimPy processes, wire model, BW occupancy |
+---------------------------------------------------+
| Components (components/) |
| pcie_ep, io_cpu, m_cpu, noc, xbar, hbm_ctrl, |
| pe_cpu, pe_dma, pe_gemm, pe_math, pe_tcm, ... |
+---------------------------------------------------+
| Topology (topology/) |
| YAML-driven graph: 4x4 cube mesh, UCIe links, |
| IO chiplet with NOC, HBM slices |
+---------------------------------------------------+
Prerequisites
- Python 3.10+
- Dependencies:
simpy,pyyaml,pytest
Installation
# Create virtual environment
python -m venv .venv
# Activate (Windows)
.venv\Scripts\activate
# Activate (Linux/macOS)
source .venv/bin/activate
# Install in editable mode
pip install -e ".[dev]"
Usage
Probe — Latency and Bandwidth Analysis
The probe command runs predefined traffic patterns (H2D write, D2H read,
PE DMA) and reports latency breakdown, bottleneck bandwidth, and utilization.
# Run all probe cases
kernbench probe --topology topology.yaml
# Run a specific case
kernbench probe --topology topology.yaml --case pe-local-hbm
Output includes:
- Summary tables — actual latency, overhead/drain/wire breakdown, effective BW, utilization
- BW saturation sweep — utilization at 4KB through 1MB to show saturation threshold
- Per-hop route traces — cumulative timestamps at every node along the path
Run — Execute a Benchmark
# Run a benchmark on all devices
kernbench run --topology topology.yaml --bench qkv_gemm
# Run on a specific device
kernbench run --topology topology.yaml --bench qkv_gemm --device sip:0
Available benchmarks (in benches/):
qkv_gemm— single-PE QKV GEMMqkv_gemm_multi_pe— multi-PE QKV GEMMipcq_allreduce— IPCQ AllReduce
Tests
# Run all tests (278 tests)
pytest
# Run a specific test file
pytest tests/test_probe.py -v
# Run a single test
pytest tests/test_probe.py::test_h2d_latency_monotonic -v
# Run with output shown
pytest -s tests/test_probe.py
Key test files:
| File | Coverage |
|---|---|
test_probe.py |
Probe latency invariants, monotonicity, determinism, BW sweep |
test_engine.py |
SimPy engine: submit/wait/complete, routing, multi-SIP |
test_bw_occupancy.py |
Wire BW contention, HOL blocking, back-to-back serialization |
test_iochiplet_noc_d2h.py |
IO chiplet NOC topology, H2D/D2H data paths |
test_noc_mesh.py |
2D mesh NOC routing, Manhattan distance |
test_pe_components.py |
PE-internal components: cpu, scheduler, dma, gemm |
test_routing.py |
XY routing, address resolution, path finding |
test_topology_compile.py |
YAML topology compilation, node/edge validation |
Topology Configuration
The system is configured via topology.yaml. Key parameters:
| Parameter | Default | Description |
|---|---|---|
ns_per_mm |
0.01 | Wire propagation delay (10 ps/mm) |
cube_mesh |
4x4 | Cube grid dimensions per SIP |
ucie.overhead_ns |
8.0 | UCIe protocol overhead per port (16ns per crossing) |
hbm_ctrl.efficiency |
0.8 | HBM effective BW factor (256 to 204.8 GB/s) |
xbar.overhead_ns |
2.0 | Crossbar arbitration delay |
xbar_to_hbm_bw_gbs |
256.0 | Raw HBM bandwidth per slice |
Project Structure
kernbench/
+-- src/kernbench/
| +-- cli/ # CLI entry points (main, probe, report)
| +-- common/ # Shared types (Completion, RequestHandle, Trace)
| +-- components/ # Hardware component models (SimPy processes)
| +-- di/ # Dependency injection
| +-- policy/ # Routing (XY), address decoding (PhysAddr)
| +-- runtime_api/ # Host-facing API (messages, bench runner)
| +-- sim_engine/ # Discrete-event engine, transaction, wire model
| +-- topology/ # YAML builder, mesh generator, graph types
| +-- triton_emu/ # Triton kernel emulation
+-- benches/ # Benchmark implementations
+-- tests/ # pytest test suite (278 tests)
+-- docs/ # ADRs, latency model docs, diagrams
+-- topology.yaml # System topology configuration
+-- CHANGES.md # Changelog
Documentation
- CHANGES.md — changelog with detailed descriptions of each release
- docs/latency-model.md — latency model explanation with worked examples
- docs/adr/ — Architecture Decision Records
Description
Languages
Python
96%
HTML
4%