Add CHANGES.md, README, update SPEC/ADRs for release 2
- CHANGES.md: detailed changelog for release 1 and 2 - README.md: full project docs with install, probe, run, test usage - SPEC.md: add ADR-0014~0017 references, update R7 for pcie_ep endpoint - ADR-0003: update NOC description to reference ADR-0017 - ADR-0004: add HBM efficiency factor (0.8) to BW guarantee contract - ADR-0014: status Proposed -> Accepted - ADR-0015: update D4 to M_CPU bypass for Memory R/W, add ADR-0016/0017 links - ADR-0016 (new): IOChiplet NOC and memory data path - ADR-0017 (new): Cube NOC 2D mesh architecture - Fix MD lint warnings (unfenced code blocks) across all docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1,13 +1,159 @@
|
||||
# Python Project (VS Code Template)
|
||||
# kernbench
|
||||
|
||||
## Quick start
|
||||
1. Create venv + install dev deps (editable):
|
||||
- VS Code: Run Task → `deps: install (editable)`
|
||||
2. Run tests:
|
||||
- VS Code: Run Task → `test`
|
||||
3. Lint / format:
|
||||
- `lint`, `format` tasks
|
||||
A discrete-event simulator for AI accelerator hardware, built on [SimPy](https://simpy.readthedocs.io/).
|
||||
It models the full data path — from host PCIe injection through IO chiplet, NOC mesh,
|
||||
crossbar, and HBM — to measure end-to-end latency with contention and queueing.
|
||||
|
||||
## Structure
|
||||
- `src/` app code
|
||||
- `tests/` pytest
|
||||
## Architecture
|
||||
|
||||
```text
|
||||
Host (CLI)
|
||||
|
|
||||
+-- kernbench run -> run a benchmark (QKV GEMM, AllReduce, ...)
|
||||
+-- kernbench probe -> latency/BW analysis for predefined traffic patterns
|
||||
|
|
||||
v
|
||||
+---------------------------------------------------+
|
||||
| Runtime API (runtime_api/) |
|
||||
| MemoryWriteMsg, MemoryReadMsg, PeDmaMsg, |
|
||||
| KernelLaunchMsg |
|
||||
+---------------------------------------------------+
|
||||
| Simulation Engine (sim_engine/) |
|
||||
| SimPy processes, wire model, BW occupancy |
|
||||
+---------------------------------------------------+
|
||||
| Components (components/) |
|
||||
| pcie_ep, io_cpu, m_cpu, noc, xbar, hbm_ctrl, |
|
||||
| pe_cpu, pe_dma, pe_gemm, pe_math, pe_tcm, ... |
|
||||
+---------------------------------------------------+
|
||||
| Topology (topology/) |
|
||||
| YAML-driven graph: 4x4 cube mesh, UCIe links, |
|
||||
| IO chiplet with NOC, HBM slices |
|
||||
+---------------------------------------------------+
|
||||
```
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.10+
|
||||
- Dependencies: `simpy`, `pyyaml`, `pytest`
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Create virtual environment
|
||||
python -m venv .venv
|
||||
|
||||
# Activate (Windows)
|
||||
.venv\Scripts\activate
|
||||
|
||||
# Activate (Linux/macOS)
|
||||
source .venv/bin/activate
|
||||
|
||||
# Install in editable mode
|
||||
pip install -e ".[dev]"
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Probe — Latency and Bandwidth Analysis
|
||||
|
||||
The `probe` command runs predefined traffic patterns (H2D write, D2H read,
|
||||
PE DMA) and reports latency breakdown, bottleneck bandwidth, and utilization.
|
||||
|
||||
```bash
|
||||
# Run all probe cases
|
||||
kernbench probe --topology topology.yaml
|
||||
|
||||
# Run a specific case
|
||||
kernbench probe --topology topology.yaml --case pe-local-hbm
|
||||
```
|
||||
|
||||
Output includes:
|
||||
|
||||
- **Summary tables** — actual latency, overhead/drain/wire breakdown, effective BW, utilization
|
||||
- **BW saturation sweep** — utilization at 4KB through 1MB to show saturation threshold
|
||||
- **Per-hop route traces** — cumulative timestamps at every node along the path
|
||||
|
||||
### Run — Execute a Benchmark
|
||||
|
||||
```bash
|
||||
# Run a benchmark on all devices
|
||||
kernbench run --topology topology.yaml --bench qkv_gemm
|
||||
|
||||
# Run on a specific device
|
||||
kernbench run --topology topology.yaml --bench qkv_gemm --device sip:0
|
||||
```
|
||||
|
||||
Available benchmarks (in `benches/`):
|
||||
|
||||
- `qkv_gemm` — single-PE QKV GEMM
|
||||
- `qkv_gemm_multi_pe` — multi-PE QKV GEMM
|
||||
- `ipcq_allreduce` — IPCQ AllReduce
|
||||
|
||||
### Tests
|
||||
|
||||
```bash
|
||||
# Run all tests (278 tests)
|
||||
pytest
|
||||
|
||||
# Run a specific test file
|
||||
pytest tests/test_probe.py -v
|
||||
|
||||
# Run a single test
|
||||
pytest tests/test_probe.py::test_h2d_latency_monotonic -v
|
||||
|
||||
# Run with output shown
|
||||
pytest -s tests/test_probe.py
|
||||
```
|
||||
|
||||
Key test files:
|
||||
|
||||
| File | Coverage |
|
||||
| --------------------------- | ------------------------------------------------------------- |
|
||||
| `test_probe.py` | Probe latency invariants, monotonicity, determinism, BW sweep |
|
||||
| `test_engine.py` | SimPy engine: submit/wait/complete, routing, multi-SIP |
|
||||
| `test_bw_occupancy.py` | Wire BW contention, HOL blocking, back-to-back serialization |
|
||||
| `test_iochiplet_noc_d2h.py` | IO chiplet NOC topology, H2D/D2H data paths |
|
||||
| `test_noc_mesh.py` | 2D mesh NOC routing, Manhattan distance |
|
||||
| `test_pe_components.py` | PE-internal components: cpu, scheduler, dma, gemm |
|
||||
| `test_routing.py` | XY routing, address resolution, path finding |
|
||||
| `test_topology_compile.py` | YAML topology compilation, node/edge validation |
|
||||
|
||||
## Topology Configuration
|
||||
|
||||
The system is configured via `topology.yaml`. Key parameters:
|
||||
|
||||
| Parameter | Default | Description |
|
||||
| --- | --- | --- |
|
||||
| `ns_per_mm` | 0.01 | Wire propagation delay (10 ps/mm) |
|
||||
| `cube_mesh` | 4x4 | Cube grid dimensions per SIP |
|
||||
| `ucie.overhead_ns` | 8.0 | UCIe protocol overhead per port (16ns per crossing) |
|
||||
| `hbm_ctrl.efficiency` | 0.8 | HBM effective BW factor (256 to 204.8 GB/s) |
|
||||
| `xbar.overhead_ns` | 2.0 | Crossbar arbitration delay |
|
||||
| `xbar_to_hbm_bw_gbs` | 256.0 | Raw HBM bandwidth per slice |
|
||||
|
||||
## Project Structure
|
||||
|
||||
```text
|
||||
kernbench/
|
||||
+-- src/kernbench/
|
||||
| +-- cli/ # CLI entry points (main, probe, report)
|
||||
| +-- common/ # Shared types (Completion, RequestHandle, Trace)
|
||||
| +-- components/ # Hardware component models (SimPy processes)
|
||||
| +-- di/ # Dependency injection
|
||||
| +-- policy/ # Routing (XY), address decoding (PhysAddr)
|
||||
| +-- runtime_api/ # Host-facing API (messages, bench runner)
|
||||
| +-- sim_engine/ # Discrete-event engine, transaction, wire model
|
||||
| +-- topology/ # YAML builder, mesh generator, graph types
|
||||
| +-- triton_emu/ # Triton kernel emulation
|
||||
+-- benches/ # Benchmark implementations
|
||||
+-- tests/ # pytest test suite (278 tests)
|
||||
+-- docs/ # ADRs, latency model docs, diagrams
|
||||
+-- topology.yaml # System topology configuration
|
||||
+-- CHANGES.md # Changelog
|
||||
```
|
||||
|
||||
## Documentation
|
||||
|
||||
- [CHANGES.md](CHANGES.md) — changelog with detailed descriptions of each release
|
||||
- [docs/latency-model.md](docs/latency-model.md) — latency model explanation with worked examples
|
||||
- [docs/adr/](docs/adr/) — Architecture Decision Records
|
||||
|
||||
Reference in New Issue
Block a user