kernbench2/README.md

# kernbench

A discrete-event simulator for AI accelerator hardware, built on [SimPy](https://simpy.readthedocs.io/).
It models the full data path — from host PCIe injection through IO chiplet, NOC mesh,
crossbar, and HBM — to measure end-to-end latency with contention and queueing.

## Architecture

```text
Host (CLI)
  |
  +-- kernbench run     -> run a benchmark (QKV GEMM, AllReduce, ...)
  +-- kernbench probe   -> latency/BW analysis for predefined traffic patterns
  |
  v
+---------------------------------------------------+
|  Runtime API          (runtime_api/)              |
|  MemoryWriteMsg, MemoryReadMsg, PeDmaMsg,         |
|  KernelLaunchMsg                                  |
+---------------------------------------------------+
|  Simulation Engine    (sim_engine/)               |
|  SimPy processes, wire model, BW occupancy        |
+---------------------------------------------------+
|  Components           (components/)               |
|  pcie_ep, io_cpu, m_cpu, noc, xbar, hbm_ctrl,    |
|  pe_cpu, pe_dma, pe_gemm, pe_math, pe_tcm, ...   |
+---------------------------------------------------+
|  Topology             (topology/)                 |
|  YAML-driven graph: 4x4 cube mesh, UCIe links,   |
|  IO chiplet with NOC, HBM slices                  |
+---------------------------------------------------+
```

## Prerequisites

- Python 3.10+
- Dependencies: `simpy`, `pyyaml`, `pytest`

## Installation

```bash
# Create virtual environment
python -m venv .venv

# Activate (Windows)
.venv\Scripts\activate

# Activate (Linux/macOS)
source .venv/bin/activate

# Install in editable mode
pip install -e ".[dev]"
```

## Usage

### Probe — Latency and Bandwidth Analysis

The `probe` command runs predefined traffic patterns (H2D write, D2H read,
PE DMA) and reports latency breakdown, bottleneck bandwidth, and utilization.

```bash
# Run all probe cases
kernbench probe --topology topology.yaml

# Run a specific case
kernbench probe --topology topology.yaml --case pe-local-hbm
```

Output includes:

- **Summary tables** — actual latency, overhead/drain/wire breakdown, effective BW, utilization
- **BW saturation sweep** — utilization at 4KB through 1MB to show saturation threshold
- **Per-hop route traces** — cumulative timestamps at every node along the path

### Run — Execute a Benchmark

```bash
# Run a benchmark on all devices
kernbench run --topology topology.yaml --bench qkv_gemm

# Run on a specific device
kernbench run --topology topology.yaml --bench qkv_gemm --device sip:0
```

Available benchmarks (in `benches/`):

- `qkv_gemm` — single-PE QKV GEMM
- `qkv_gemm_multi_pe` — multi-PE QKV GEMM
- `ipcq_allreduce` — IPCQ AllReduce

### Tests

```bash
# Run all tests (278 tests)
pytest

# Run a specific test file
pytest tests/test_probe.py -v

# Run a single test
pytest tests/test_probe.py::test_h2d_latency_monotonic -v

# Run with output shown
pytest -s tests/test_probe.py
```

Key test files:

| File                        | Coverage                                                      |
| --------------------------- | ------------------------------------------------------------- |
| `test_probe.py`             | Probe latency invariants, monotonicity, determinism, BW sweep |
| `test_engine.py`            | SimPy engine: submit/wait/complete, routing, multi-SIP        |
| `test_bw_occupancy.py`      | Wire BW contention, HOL blocking, back-to-back serialization  |
| `test_iochiplet_noc_d2h.py` | IO chiplet NOC topology, H2D/D2H data paths                   |
| `test_noc_mesh.py`          | 2D mesh NOC routing, Manhattan distance                       |
| `test_pe_components.py`     | PE-internal components: cpu, scheduler, dma, gemm             |
| `test_routing.py`           | XY routing, address resolution, path finding                  |
| `test_topology_compile.py`  | YAML topology compilation, node/edge validation               |

## Topology Configuration

The system is configured via `topology.yaml`. Key parameters:

| Parameter | Default | Description |
| --- | --- | --- |
| `ns_per_mm` | 0.01 | Wire propagation delay (10 ps/mm) |
| `cube_mesh` | 4x4 | Cube grid dimensions per SIP |
| `ucie.overhead_ns` | 8.0 | UCIe protocol overhead per port (16ns per crossing) |
| `hbm_ctrl.efficiency` | 0.8 | HBM effective BW factor (256 to 204.8 GB/s) |
| `xbar.overhead_ns` | 2.0 | Crossbar arbitration delay |
| `xbar_to_hbm_bw_gbs` | 256.0 | Raw HBM bandwidth per slice |

## Project Structure

```text
kernbench/
+-- src/kernbench/
|   +-- cli/            # CLI entry points (main, probe, report)
|   +-- common/         # Shared types (Completion, RequestHandle, Trace)
|   +-- components/     # Hardware component models (SimPy processes)
|   +-- di/             # Dependency injection
|   +-- policy/         # Routing (XY), address decoding (PhysAddr)
|   +-- runtime_api/    # Host-facing API (messages, bench runner)
|   +-- sim_engine/     # Discrete-event engine, transaction, wire model
|   +-- topology/       # YAML builder, mesh generator, graph types
|   +-- triton_emu/     # Triton kernel emulation
+-- benches/            # Benchmark implementations
+-- tests/              # pytest test suite (278 tests)
+-- docs/               # ADRs, latency model docs, diagrams
+-- topology.yaml       # System topology configuration
+-- CHANGES.md          # Changelog
```

## Documentation

- [CHANGES.md](CHANGES.md) — changelog with detailed descriptions of each release
- [docs/onboarding/latency-model.md](docs/onboarding/latency-model.md) — latency model explanation with worked examples
- [docs/onboarding/](docs/onboarding/) — onboarding guides (architecture overview, latency model, CCL author guide, intro presentation)
- [docs/adr/](docs/adr/) — Architecture Decision Records