mukesh b610cb0d9a sccl: drive allreduce tests via torch.distributed; reorganize into tests/sccl/
Convert the multidevice allreduce correctness + latency/buffer-kind sweeps
to run through the real PyTorch-distributed path
(init_process_group(backend="ahbm") -> mp.spawn -> dist.all_reduce) instead
of direct ctx.launch, and reorganize the CCL/allreduce tests into a
tests/sccl/ package split one test per file.

Production change (required for the distributed path on non-square SIP grids):
- AhbmCCLBackend now reads explicit system.sips.w/h from the spec, with a
  square-only sqrt fallback that raises on ambiguity, instead of silently
  guessing round(sqrt(count)). This fixes the 2x3 / 3x2 torus + mesh cases,
  which previously resolved to a wrong 2x2 grid. Mirrors the test helper's
  _sip_topo_dims precedence (explicit w/h > square fallback > raise).

Test reorganization (tests/sccl/):
- _allreduce_helpers.py: shared plumbing (distributed driver, config writers,
  direct-launch run_allreduce parity reference, sweep/buffer-kind constants,
  plot aggregators, topology-diagram + FSIM-comparison emitters).
- test_allreduce_ring_torus_mesh.py: correctness across ring/torus/mesh.
- test_distributed_default_topology.py: full distributed path on topology.yaml.
- test_plot_latency_sweep.py / test_plot_buffer_kind_sweep.py: sweep rows.
- test_plot_topology_diagram.py / test_plot_comparison_fsim.py: plot emitters.
- test_intercube_root_center.py: moved in (ADR-0032 center-root latency guard).

Also:
- Move the FSIM comparison plot generator out of scripts/ into the sccl suite.
- Delete superseded test files (test_allreduce_multidevice,
  test_distributed_lrab_hierarchical_allreduce, test_allreduce_buffer_kind_sweep)
  and repoint conftest aggregators + the ipcq buffer-kind importers.
- Regenerate the allreduce_latency_plots derived artifacts from the full sweep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 22:24:43 -07:00
2026-03-18 11:47:48 -07:00
2026-03-18 11:47:48 -07:00

kernbench

A discrete-event simulator for AI accelerator hardware, built on SimPy. It models the full data path — from host PCIe injection through IO chiplet, NOC mesh, crossbar, and HBM — to measure end-to-end latency with contention and queueing.

Architecture

Host (CLI)
  |
  +-- kernbench run     -> run a benchmark (QKV GEMM, AllReduce, ...)
  +-- kernbench probe   -> latency/BW analysis for predefined traffic patterns
  |
  v
+---------------------------------------------------+
|  Runtime API          (runtime_api/)              |
|  MemoryWriteMsg, MemoryReadMsg, PeDmaMsg,         |
|  KernelLaunchMsg                                  |
+---------------------------------------------------+
|  Simulation Engine    (sim_engine/)               |
|  SimPy processes, wire model, BW occupancy        |
+---------------------------------------------------+
|  Components           (components/)               |
|  pcie_ep, io_cpu, m_cpu, noc, xbar, hbm_ctrl,    |
|  pe_cpu, pe_dma, pe_gemm, pe_math, pe_tcm, ...   |
+---------------------------------------------------+
|  Topology             (topology/)                 |
|  YAML-driven graph: 4x4 cube mesh, UCIe links,   |
|  IO chiplet with NOC, HBM slices                  |
+---------------------------------------------------+

Prerequisites

  • Python 3.10+
  • Dependencies: simpy, pyyaml, pytest

Installation

# Create virtual environment
python -m venv .venv

# Activate (Windows)
.venv\Scripts\activate

# Activate (Linux/macOS)
source .venv/bin/activate

# Install in editable mode
pip install -e ".[dev]"

Usage

Probe — Latency and Bandwidth Analysis

The probe command runs predefined traffic patterns (H2D write, D2H read, PE DMA) and reports latency breakdown, bottleneck bandwidth, and utilization.

# Run all probe cases
kernbench probe --topology topology.yaml

# Run a specific case
kernbench probe --topology topology.yaml --case pe-local-hbm

Output includes:

  • Summary tables — actual latency, overhead/drain/wire breakdown, effective BW, utilization
  • BW saturation sweep — utilization at 4KB through 1MB to show saturation threshold
  • Per-hop route traces — cumulative timestamps at every node along the path

Run — Execute a Benchmark

# Run a benchmark on all devices
kernbench run --topology topology.yaml --bench qkv_gemm

# Run on a specific device
kernbench run --topology topology.yaml --bench qkv_gemm --device sip:0

Available benchmarks (in benches/):

  • qkv_gemm — single-PE QKV GEMM
  • qkv_gemm_multi_pe — multi-PE QKV GEMM
  • ipcq_allreduce — IPCQ AllReduce

Tests

# Run all tests (278 tests)
pytest

# Run a specific test file
pytest tests/test_probe.py -v

# Run a single test
pytest tests/test_probe.py::test_h2d_latency_monotonic -v

# Run with output shown
pytest -s tests/test_probe.py

Key test files:

File Coverage
test_probe.py Probe latency invariants, monotonicity, determinism, BW sweep
test_engine.py SimPy engine: submit/wait/complete, routing, multi-SIP
test_bw_occupancy.py Wire BW contention, HOL blocking, back-to-back serialization
test_iochiplet_noc_d2h.py IO chiplet NOC topology, H2D/D2H data paths
test_noc_mesh.py 2D mesh NOC routing, Manhattan distance
test_pe_components.py PE-internal components: cpu, scheduler, dma, gemm
test_routing.py XY routing, address resolution, path finding
test_topology_compile.py YAML topology compilation, node/edge validation

Topology Configuration

The system is configured via topology.yaml. Key parameters:

Parameter Default Description
ns_per_mm 0.01 Wire propagation delay (10 ps/mm)
cube_mesh 4x4 Cube grid dimensions per SIP
ucie.overhead_ns 8.0 UCIe protocol overhead per port (16ns per crossing)
hbm_ctrl.efficiency 0.8 HBM effective BW factor (256 to 204.8 GB/s)
xbar.overhead_ns 2.0 Crossbar arbitration delay
xbar_to_hbm_bw_gbs 256.0 Raw HBM bandwidth per slice

Project Structure

kernbench/
+-- src/kernbench/
|   +-- cli/            # CLI entry points (main, probe, report)
|   +-- common/         # Shared types (Completion, RequestHandle, Trace)
|   +-- components/     # Hardware component models (SimPy processes)
|   +-- di/             # Dependency injection
|   +-- policy/         # Routing (XY), address decoding (PhysAddr)
|   +-- runtime_api/    # Host-facing API (messages, bench runner)
|   +-- sim_engine/     # Discrete-event engine, transaction, wire model
|   +-- topology/       # YAML builder, mesh generator, graph types
|   +-- triton_emu/     # Triton kernel emulation
+-- benches/            # Benchmark implementations
+-- tests/              # pytest test suite (278 tests)
+-- docs/               # ADRs, latency model docs, diagrams
+-- topology.yaml       # System topology configuration
+-- CHANGES.md          # Changelog

Documentation

S
Description
No description provided
Readme 13 MiB
Languages
Python 96%
HTML 4%