ywkang a76487ca48 PE_DMA perf: SIP-wide scenarios + dual outputs + clearer naming
User asked to surface system-wide congestion (more accurate than
single-cube), bring back the latency-breakdown plot under a separate
filename, and rename the obscure ``streaming`` category.

Scenarios:
  Renamed all_pe_to_pe0 → all_pe_cube0_to_pe0 (clarify cube scope).
  Added two SIP-wide scenarios:
    sip_local_all     — every PE in sip0 (128 total) accesses its own
                        local slice. All paths disjoint (each PE owns
                        its own hbm_ctrl.peX), so the model should
                        scale linearly with cube count.
    sip_hotspot_pe0   — every PE in sip0 (128 total) targets
                        sip0.cube0.pe0_slice. Worst-case hotspot:
                        UCIe inbound + r0c0→hbm_ctrl.pe0 saturated.
  Each bar now carries an ``N=...`` annotation showing the issuer
  count, and the chart titles say the scope explicitly.

Effective BW + util at 16 KB:
  sip_local_all       N=128  eff= 27.2 TB/s  util_a= 83 %
  sip_hotspot_pe0     N=128  eff= 134 GB/s   util_a= 93 %
                                              (UCIe-into-cube0 saturated)

Plots:
  no_congestion.png + congestion.png        — Effective BW utilization
                                              (two bars: single vs aggregate peak)
  breakdown_no_congestion.png +
  breakdown_congestion.png                  — stacked latency breakdown
                                              (renamed from previous)
  summary.csv with columns for both views.

The visual y-cap on BW utilization is 150 %. Bars exceeding it (e.g.
sip_local_all's util_single = 10,639 %) are drawn at the cap with an
upward arrow and the real value annotated. The verification rule for
``util_single`` is loosened to ``≤ n_issuers × 100 % + 5 %`` so
massively-parallel disjoint scenarios pass.

Category renamed: ``streaming`` → ``wire_transfer``. It is the
bulk-transfer time = (n_flits − 1) × flit_bytes / bottleneck_bw — the
cost of streaming the rest of the payload through the slowest wire
after the first flit has arrived.

All checks PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 09:43:09 -07:00
2026-03-18 11:47:48 -07:00
2026-03-18 11:47:48 -07:00
2026-03-18 11:47:48 -07:00
2026-03-18 11:47:48 -07:00

kernbench

A discrete-event simulator for AI accelerator hardware, built on SimPy. It models the full data path — from host PCIe injection through IO chiplet, NOC mesh, crossbar, and HBM — to measure end-to-end latency with contention and queueing.

Architecture

Host (CLI)
  |
  +-- kernbench run     -> run a benchmark (QKV GEMM, AllReduce, ...)
  +-- kernbench probe   -> latency/BW analysis for predefined traffic patterns
  |
  v
+---------------------------------------------------+
|  Runtime API          (runtime_api/)              |
|  MemoryWriteMsg, MemoryReadMsg, PeDmaMsg,         |
|  KernelLaunchMsg                                  |
+---------------------------------------------------+
|  Simulation Engine    (sim_engine/)               |
|  SimPy processes, wire model, BW occupancy        |
+---------------------------------------------------+
|  Components           (components/)               |
|  pcie_ep, io_cpu, m_cpu, noc, xbar, hbm_ctrl,    |
|  pe_cpu, pe_dma, pe_gemm, pe_math, pe_tcm, ...   |
+---------------------------------------------------+
|  Topology             (topology/)                 |
|  YAML-driven graph: 4x4 cube mesh, UCIe links,   |
|  IO chiplet with NOC, HBM slices                  |
+---------------------------------------------------+

Prerequisites

  • Python 3.10+
  • Dependencies: simpy, pyyaml, pytest

Installation

# Create virtual environment
python -m venv .venv

# Activate (Windows)
.venv\Scripts\activate

# Activate (Linux/macOS)
source .venv/bin/activate

# Install in editable mode
pip install -e ".[dev]"

Usage

Probe — Latency and Bandwidth Analysis

The probe command runs predefined traffic patterns (H2D write, D2H read, PE DMA) and reports latency breakdown, bottleneck bandwidth, and utilization.

# Run all probe cases
kernbench probe --topology topology.yaml

# Run a specific case
kernbench probe --topology topology.yaml --case pe-local-hbm

Output includes:

  • Summary tables — actual latency, overhead/drain/wire breakdown, effective BW, utilization
  • BW saturation sweep — utilization at 4KB through 1MB to show saturation threshold
  • Per-hop route traces — cumulative timestamps at every node along the path

Run — Execute a Benchmark

# Run a benchmark on all devices
kernbench run --topology topology.yaml --bench qkv_gemm

# Run on a specific device
kernbench run --topology topology.yaml --bench qkv_gemm --device sip:0

Available benchmarks (in benches/):

  • qkv_gemm — single-PE QKV GEMM
  • qkv_gemm_multi_pe — multi-PE QKV GEMM
  • ipcq_allreduce — IPCQ AllReduce

Tests

# Run all tests (278 tests)
pytest

# Run a specific test file
pytest tests/test_probe.py -v

# Run a single test
pytest tests/test_probe.py::test_h2d_latency_monotonic -v

# Run with output shown
pytest -s tests/test_probe.py

Key test files:

File Coverage
test_probe.py Probe latency invariants, monotonicity, determinism, BW sweep
test_engine.py SimPy engine: submit/wait/complete, routing, multi-SIP
test_bw_occupancy.py Wire BW contention, HOL blocking, back-to-back serialization
test_iochiplet_noc_d2h.py IO chiplet NOC topology, H2D/D2H data paths
test_noc_mesh.py 2D mesh NOC routing, Manhattan distance
test_pe_components.py PE-internal components: cpu, scheduler, dma, gemm
test_routing.py XY routing, address resolution, path finding
test_topology_compile.py YAML topology compilation, node/edge validation

Topology Configuration

The system is configured via topology.yaml. Key parameters:

Parameter Default Description
ns_per_mm 0.01 Wire propagation delay (10 ps/mm)
cube_mesh 4x4 Cube grid dimensions per SIP
ucie.overhead_ns 8.0 UCIe protocol overhead per port (16ns per crossing)
hbm_ctrl.efficiency 0.8 HBM effective BW factor (256 to 204.8 GB/s)
xbar.overhead_ns 2.0 Crossbar arbitration delay
xbar_to_hbm_bw_gbs 256.0 Raw HBM bandwidth per slice

Project Structure

kernbench/
+-- src/kernbench/
|   +-- cli/            # CLI entry points (main, probe, report)
|   +-- common/         # Shared types (Completion, RequestHandle, Trace)
|   +-- components/     # Hardware component models (SimPy processes)
|   +-- di/             # Dependency injection
|   +-- policy/         # Routing (XY), address decoding (PhysAddr)
|   +-- runtime_api/    # Host-facing API (messages, bench runner)
|   +-- sim_engine/     # Discrete-event engine, transaction, wire model
|   +-- topology/       # YAML builder, mesh generator, graph types
|   +-- triton_emu/     # Triton kernel emulation
+-- benches/            # Benchmark implementations
+-- tests/              # pytest test suite (278 tests)
+-- docs/               # ADRs, latency model docs, diagrams
+-- topology.yaml       # System topology configuration
+-- CHANGES.md          # Changelog

Documentation

S
Description
No description provided
Readme 13 MiB
Languages
Python 96%
HTML 4%