diff --git a/CHANGES.md b/CHANGES.md new file mode 100644 index 0000000..c7363eb --- /dev/null +++ b/CHANGES.md @@ -0,0 +1,81 @@ +# Changelog + +## Release 2 (2026-03-19) + +### Probe CLI Improvements + +- **Restructured output**: tables are printed first (H2D, D2H, PE DMA), followed + by detailed per-hop route traces below. This makes it easier to scan summary + numbers before diving into routing details. +- **Per-hop timestamps**: each route trace now shows cumulative nanosecond + timestamps at every hop, so you can see exactly where time is spent. +- **D2H Read section**: added `MemoryReadMsg`-based D2H read probes. D2H models + the full round-trip: forward command path (pcie_ep → hbm) + reverse data path + (hbm → pcie_ep) with host-side drain. +- **Cross-cube best/worst split**: PE DMA cross-cube case is now reported as two + separate rows — best case (adjacent cube) and worst case (farthest cube) — to + show the latency range. +- **Multi-size BW saturation sweep**: each probe case now includes a saturation + table showing utilization at 4KB, 16KB, 64KB, 256KB, and 1MB. This reveals + the data size threshold (~64KB) where overhead becomes negligible and + utilization exceeds 90%. +- **Default data size changed from 4KB to 32KB** for more realistic baseline + measurements. + +### UCIe Overhead Tuning + +- UCIe `overhead_ns` increased from 1.0 to **8.0 ns per port** (16ns per + crossing = TX + RX). This fixes a latency inversion where cross-cube PE DMA + (which traverses UCIe) was incorrectly faster than cross-half PE DMA (which + traverses xbar bridges). Applied to both cube-to-cube UCIe and IO chiplet UCIe. + +### HBM Efficiency Factor + +- Added `efficiency: 0.8` parameter to `hbm_ctrl` in `topology.yaml`. +- The topology builder now applies this multiplicative factor to xbar→hbm edge + bandwidth: `256 GB/s × 0.8 = 204.8 GB/s` effective. +- This models real-world DRAM access inefficiency (refresh, bank conflicts, + page misses) rather than assuming ideal spec bandwidth. + +### IOChiplet NOC and D2H Topology + +- H2D MemoryWrite and D2H MemoryRead now route through `io_noc` (IO chiplet + NOC), bypassing `m_cpu`. The `m_cpu` path is reserved for KernelLaunch. +- IOChiplet topology includes: `io_noc`, per-PHY `io_ucie` nodes, and `conn` + nodes for NOC-to-UCIe bridging. +- Added comprehensive tests for IOChiplet structure, H2D/D2H data paths, + and latency invariants. + +### NOC Mesh, Crossbar, and BW Occupancy + +- Added `noc_2d_mesh_v1` component with Manhattan-distance-based internal + routing. +- Added `xbar_v1` crossbar component with top/bottom halves and bridge + interconnect. +- Added BW occupancy model to wires: each directed edge tracks + `available_at` for back-to-back serialization contention. +- Cube mesh visualization diagram (`docs/diagrams/cube_mesh_view.svg`). + +### Test Coverage + +- **278 tests** passing across 16 test files. +- New test files: `test_bw_occupancy.py`, `test_iochiplet_noc_d2h.py`, + `test_noc_mesh.py`. +- New probe tests: monotonic D2H latency, D2H >= H2D invariant, HBM + efficiency verification, sweep saturation, cross-cube best/worst ordering, + per-hop timestamp monotonicity. + +--- + +## Release 1 (initial) + +- Core simulation engine (SimPy-based discrete-event) +- Topology builder with YAML-driven configuration +- Physical address encoding/decoding +- XY routing on 4x4 cube mesh +- Component model: pcie_ep, io_cpu, m_cpu, pe_cpu, pe_scheduler, pe_dma, + pe_gemm, pe_math, pe_tcm, hbm_ctrl, noc, xbar, sram +- Runtime API: MemoryWriteMsg, MemoryReadMsg, PeDmaMsg, KernelLaunchMsg +- Probe CLI for latency analysis +- Benchmark runner with device enumeration +- Benchmarks: QKV GEMM (single/multi-PE), IPCQ AllReduce diff --git a/README.md b/README.md index b276a9f..36d969e 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,159 @@ -# Python Project (VS Code Template) +# kernbench -## Quick start -1. Create venv + install dev deps (editable): - - VS Code: Run Task → `deps: install (editable)` -2. Run tests: - - VS Code: Run Task → `test` -3. Lint / format: - - `lint`, `format` tasks +A discrete-event simulator for AI accelerator hardware, built on [SimPy](https://simpy.readthedocs.io/). +It models the full data path — from host PCIe injection through IO chiplet, NOC mesh, +crossbar, and HBM — to measure end-to-end latency with contention and queueing. -## Structure -- `src/` app code -- `tests/` pytest +## Architecture + +```text +Host (CLI) + | + +-- kernbench run -> run a benchmark (QKV GEMM, AllReduce, ...) + +-- kernbench probe -> latency/BW analysis for predefined traffic patterns + | + v ++---------------------------------------------------+ +| Runtime API (runtime_api/) | +| MemoryWriteMsg, MemoryReadMsg, PeDmaMsg, | +| KernelLaunchMsg | ++---------------------------------------------------+ +| Simulation Engine (sim_engine/) | +| SimPy processes, wire model, BW occupancy | ++---------------------------------------------------+ +| Components (components/) | +| pcie_ep, io_cpu, m_cpu, noc, xbar, hbm_ctrl, | +| pe_cpu, pe_dma, pe_gemm, pe_math, pe_tcm, ... | ++---------------------------------------------------+ +| Topology (topology/) | +| YAML-driven graph: 4x4 cube mesh, UCIe links, | +| IO chiplet with NOC, HBM slices | ++---------------------------------------------------+ +``` + +## Prerequisites + +- Python 3.10+ +- Dependencies: `simpy`, `pyyaml`, `pytest` + +## Installation + +```bash +# Create virtual environment +python -m venv .venv + +# Activate (Windows) +.venv\Scripts\activate + +# Activate (Linux/macOS) +source .venv/bin/activate + +# Install in editable mode +pip install -e ".[dev]" +``` + +## Usage + +### Probe — Latency and Bandwidth Analysis + +The `probe` command runs predefined traffic patterns (H2D write, D2H read, +PE DMA) and reports latency breakdown, bottleneck bandwidth, and utilization. + +```bash +# Run all probe cases +kernbench probe --topology topology.yaml + +# Run a specific case +kernbench probe --topology topology.yaml --case pe-local-hbm +``` + +Output includes: + +- **Summary tables** — actual latency, overhead/drain/wire breakdown, effective BW, utilization +- **BW saturation sweep** — utilization at 4KB through 1MB to show saturation threshold +- **Per-hop route traces** — cumulative timestamps at every node along the path + +### Run — Execute a Benchmark + +```bash +# Run a benchmark on all devices +kernbench run --topology topology.yaml --bench qkv_gemm + +# Run on a specific device +kernbench run --topology topology.yaml --bench qkv_gemm --device sip:0 +``` + +Available benchmarks (in `benches/`): + +- `qkv_gemm` — single-PE QKV GEMM +- `qkv_gemm_multi_pe` — multi-PE QKV GEMM +- `ipcq_allreduce` — IPCQ AllReduce + +### Tests + +```bash +# Run all tests (278 tests) +pytest + +# Run a specific test file +pytest tests/test_probe.py -v + +# Run a single test +pytest tests/test_probe.py::test_h2d_latency_monotonic -v + +# Run with output shown +pytest -s tests/test_probe.py +``` + +Key test files: + +| File | Coverage | +| --------------------------- | ------------------------------------------------------------- | +| `test_probe.py` | Probe latency invariants, monotonicity, determinism, BW sweep | +| `test_engine.py` | SimPy engine: submit/wait/complete, routing, multi-SIP | +| `test_bw_occupancy.py` | Wire BW contention, HOL blocking, back-to-back serialization | +| `test_iochiplet_noc_d2h.py` | IO chiplet NOC topology, H2D/D2H data paths | +| `test_noc_mesh.py` | 2D mesh NOC routing, Manhattan distance | +| `test_pe_components.py` | PE-internal components: cpu, scheduler, dma, gemm | +| `test_routing.py` | XY routing, address resolution, path finding | +| `test_topology_compile.py` | YAML topology compilation, node/edge validation | + +## Topology Configuration + +The system is configured via `topology.yaml`. Key parameters: + +| Parameter | Default | Description | +| --- | --- | --- | +| `ns_per_mm` | 0.01 | Wire propagation delay (10 ps/mm) | +| `cube_mesh` | 4x4 | Cube grid dimensions per SIP | +| `ucie.overhead_ns` | 8.0 | UCIe protocol overhead per port (16ns per crossing) | +| `hbm_ctrl.efficiency` | 0.8 | HBM effective BW factor (256 to 204.8 GB/s) | +| `xbar.overhead_ns` | 2.0 | Crossbar arbitration delay | +| `xbar_to_hbm_bw_gbs` | 256.0 | Raw HBM bandwidth per slice | + +## Project Structure + +```text +kernbench/ ++-- src/kernbench/ +| +-- cli/ # CLI entry points (main, probe, report) +| +-- common/ # Shared types (Completion, RequestHandle, Trace) +| +-- components/ # Hardware component models (SimPy processes) +| +-- di/ # Dependency injection +| +-- policy/ # Routing (XY), address decoding (PhysAddr) +| +-- runtime_api/ # Host-facing API (messages, bench runner) +| +-- sim_engine/ # Discrete-event engine, transaction, wire model +| +-- topology/ # YAML builder, mesh generator, graph types +| +-- triton_emu/ # Triton kernel emulation ++-- benches/ # Benchmark implementations ++-- tests/ # pytest test suite (278 tests) ++-- docs/ # ADRs, latency model docs, diagrams ++-- topology.yaml # System topology configuration ++-- CHANGES.md # Changelog +``` + +## Documentation + +- [CHANGES.md](CHANGES.md) — changelog with detailed descriptions of each release +- [docs/latency-model.md](docs/latency-model.md) — latency model explanation with worked examples +- [docs/adr/](docs/adr/) — Architecture Decision Records diff --git a/SPEC.md b/SPEC.md index e881bbe..a850c60 100644 --- a/SPEC.md +++ b/SPEC.md @@ -55,6 +55,10 @@ Major architectural decisions are documented in ADRs and referenced by number. - ADR-0011: Memory addressing simplification (PA-first) - ADR-0012: Host ↔ IO_CPU message schema (PA-first, PE-tagged shards) - ADR-0013: Verification strategy and Phase 1 test plan +- ADR-0014: PE internal execution model (PE_CPU, PE_SCHEDULER, composite commands) +- ADR-0015: Component port/wire model, BW occupancy, and fabric routing +- ADR-0016: IOChiplet NOC and memory data path (M_CPU bypass) +- ADR-0017: Cube NOC 2D mesh architecture (XY routing, contention, attachments) SPEC MUST remain consistent with accepted ADRs. @@ -165,7 +169,8 @@ Development MUST follow a verification-driven workflow: The simulator MUST provide a host-facing runtime API that: - exposes tensor deployment and kernel execution operations, -- submits requests only to endpoint components (e.g., IO_CPU), +- submits requests to endpoint components: PCIE_EP for memory operations + (MemoryWrite/Read), IO_CPU for kernel launch, - owns host-side tensor handles and allocation metadata as PA shard maps, - remains topology-agnostic and does not perform routing or fan-out. diff --git a/docs/adr/ADR-0003-target-system-hierarchy.md b/docs/adr/ADR-0003-target-system-hierarchy.md index 4a685d8..f05bed7 100644 --- a/docs/adr/ADR-0003-target-system-hierarchy.md +++ b/docs/adr/ADR-0003-target-system-hierarchy.md @@ -37,8 +37,10 @@ We model the system hierarchy explicitly: - HBM + memory controller (HBM_CTRL) - XBAR (top/bottom): HBM pseudo-channel crossbar, PE's dedicated path to HBM - Bridge (left/right): connects XBAR.top ↔ XBAR.bottom for cross-half HBM access - - NOC: distributed on-die fabric spanning the entire cube (distance modeled as 0); - carries non-HBM traffic including inter-cube (UCIe), command (M_CPU↔PE_CPU), and shared SRAM access + - NOC: 2D mesh router grid spanning the entire cube with XY routing and + per-segment contention modeling; carries all intra-cube traffic including + PE DMA to xbar (HBM), inter-cube (UCIe), command (M_CPU↔PE_CPU), and + shared SRAM access. See ADR-0017 for full NOC architecture. - Shared SRAM: cube-level shared memory accessible by all PEs via NOC - management/control CPU (M_CPU) coordinating PE command distribution and completion aggregation - multiple PEs @@ -62,3 +64,4 @@ We model the system hierarchy explicitly: - SPEC R3/R5 - ADR-0005 (diagram views) +- ADR-0017 (cube NOC 2D mesh architecture) diff --git a/docs/adr/ADR-0004-memory-semantics-local-hbm.md b/docs/adr/ADR-0004-memory-semantics-local-hbm.md index ed91e7d..189fcae 100644 --- a/docs/adr/ADR-0004-memory-semantics-local-hbm.md +++ b/docs/adr/ADR-0004-memory-semantics-local-hbm.md @@ -21,8 +21,15 @@ Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth, ### D2. Local HBM bandwidth guarantee contract -- Accesses from a PE to its local HBM MUST guarantee full HBM read/write bandwidth - independent of intervening fabric bandwidth limits. +- Accesses from a PE to its local HBM MUST guarantee full effective HBM + read/write bandwidth independent of intervening fabric bandwidth limits. +- Effective HBM bandwidth = spec bandwidth x efficiency factor. + The efficiency factor (configured via `hbm_ctrl.attrs.efficiency`, default 0.8) + models real-world DRAM inefficiencies (refresh cycles, bank conflicts, page + misses). For example: 256 GB/s spec x 0.8 = 204.8 GB/s effective. +- The topology builder applies the efficiency factor to xbar-to-hbm edge + bandwidth at graph construction time, so all downstream routing and latency + computation uses the effective value. - This guarantee is modeled by: - a dedicated logical path and/or service model that enforces HBM BW at the PE-local-HBM interaction point, - while still incurring non-zero latency along explicitly modeled components. @@ -62,3 +69,4 @@ Tests should cover: - SPEC R2/R5 - ADR-0002 (distance/order & explicit bypass) +- ADR-0017 D7 (PE DMA data paths through NOC to HBM) diff --git a/docs/adr/ADR-0014-pe-internal-execution-model.md b/docs/adr/ADR-0014-pe-internal-execution-model.md index 99023a0..3a80216 100644 --- a/docs/adr/ADR-0014-pe-internal-execution-model.md +++ b/docs/adr/ADR-0014-pe-internal-execution-model.md @@ -2,7 +2,7 @@ ## Status -Proposed +Accepted ## Context @@ -123,7 +123,7 @@ Examples include: Execution flow: -``` +```text PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution → completion event → PE_SCHEDULER → CompletionQueue ``` @@ -133,7 +133,7 @@ Composite commands implement tiled pipelined execution across engines. Each tile executes the following pipeline: -``` +```text Input DMA (READ) → Compute (GEMM or MATH) → Output DMA (WRITE) @@ -158,7 +158,7 @@ Operations for different tiles may overlap when engine resources permit. Allowed overlaps: -``` +```text DMA_READ(t+1) ∥ COMPUTE(t) DMA_WRITE(t−1) ∥ COMPUTE(t) DMA_READ(t) ∥ DMA_WRITE(t) @@ -166,7 +166,7 @@ DMA_READ(t) ∥ DMA_WRITE(t) Disallowed overlaps: -``` +```text GEMM(t) ∥ GEMM(t′) MATH(t) ∥ MATH(t′) GEMM(t) ∥ MATH(t′) @@ -182,7 +182,7 @@ Each engine behaves as a deterministic service resource. PE_DMA contains two independent channels. -``` +```text DMA_READ capacity = 1 DMA_WRITE capacity = 1 ``` @@ -195,13 +195,13 @@ Rules: Example allowed: -``` +```text DMA_READ(t+1) ∥ DMA_WRITE(t) ``` Example not allowed: -``` +```text DMA_READ(t) ∥ DMA_READ(t+1) DMA_WRITE(t) ∥ DMA_WRITE(t+1) ``` @@ -210,7 +210,7 @@ DMA_WRITE(t) ∥ DMA_WRITE(t+1) Compute operations share a single compute resource. -``` +```text PE_ACCEL capacity = 1 ``` @@ -230,7 +230,7 @@ Composite commands contain one compute opcode only. Examples: -``` +```text COMPOSITE_GEMM COMPOSITE_MATH ``` @@ -250,13 +250,13 @@ Compute operations use a TCM-centric dataflow model. **Input path (HBM)** -``` +```text HBM → XBAR → PE_DMA (DMA_READ) → PE_TCM ``` **Input path (shared SRAM)** -``` +```text Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM ``` @@ -264,7 +264,7 @@ Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM Compute engines read input tensors from PE_TCM. -``` +```text PE_TCM → GEMM / MATH ``` @@ -274,13 +274,13 @@ Weights for GEMM may optionally stream directly from HBM (via XBAR). Compute results are written to PE_TCM, then DMA writes to HBM. -``` +```text PE_TCM → PE_DMA (DMA_WRITE) → XBAR → HBM ``` **Output path (shared SRAM)** -``` +```text PE_TCM → PE_DMA (DMA_WRITE) → NOC → Shared SRAM ``` diff --git a/docs/adr/ADR-0015-component-port-wire-model.md b/docs/adr/ADR-0015-component-port-wire-model.md index 393a078..8bf53c1 100644 --- a/docs/adr/ADR-0015-component-port-wire-model.md +++ b/docs/adr/ADR-0015-component-port-wire-model.md @@ -17,8 +17,8 @@ implementation does not enforce this for fabric traversal. This ADR defines: - how components communicate via typed port queues, -- how propagation delay is modeled (wire processes), -- the fabric path for Memory R/W through M_CPU.DMA, +- how propagation delay is modeled (wire processes with BW occupancy), +- the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch (via M_CPU), - the reduced role of the simulation engine, - M_CPU.DMA as an internal subcomponent of M_CPU. @@ -30,7 +30,7 @@ This ADR defines: Each component has typed input/output ports modeled as SimPy Stores: -``` +```text in_ports: dict[str, simpy.Store] # keyed by source node_id out_ports: dict[str, simpy.Store] # keyed by destination node_id ``` @@ -93,35 +93,51 @@ ADR-0007 D2 must be amended accordingly. --- -### D4. Unified fabric path for Memory R/W and Kernel Launch +### D4. Fabric paths for Memory R/W and Kernel Launch -Both Memory R/W and Kernel Launch use the same fabric path to reach the target cube's M_CPU. -The difference is what M_CPU does upon receiving the request. +Memory R/W and Kernel Launch use **different** fabric paths. +Memory operations bypass M_CPU and route directly to HBM via the crossbar. +Kernel Launch routes through M_CPU for PE fan-out. -**Forward path (IO_CPU → target M_CPU):** +**Memory R/W forward path (pcie_ep → hbm_ctrl, M_CPU bypass):** -``` -IO_CPU - → [transit cubes: ucie_out → wire → ucie_in → noc → ucie_out] (zero or more) - → target cube: ucie_in → noc → M_CPU +```text +pcie_ep → io_noc → io_ucie + → [transit cubes: ucie_in → noc → ucie_out] (zero or more) + → target cube: ucie_in → noc → xbar → hbm_ctrl ``` -**At M_CPU (diverges by operation type):** +**Memory R/W completion path:** -``` -Memory R/W: M_CPU → M_CPU.DMA → noc → hbm_ctrl -Kernel Launch: M_CPU → PE[0..n] (parallel fan-out) +```text +hbm_ctrl → xbar → noc → [transit cubes: ucie → noc → ucie] + → io_ucie → io_noc → pcie_ep ``` -**Completion path (reverse, same fabric):** +**Kernel Launch forward path (pcie_ep → io_cpu → M_CPU → PE):** +```text +pcie_ep → io_noc → io_cpu → io_noc → io_ucie + → [transit cubes: ucie_in → noc → ucie_out] (zero or more) + → target cube: ucie_in → noc → M_CPU → PE[0..n] (parallel fan-out) ``` -Memory R/W: hbm_ctrl → noc → M_CPU.DMA → M_CPU -Kernel Launch: PE[0..n] all complete → M_CPU (aggregation) -M_CPU → [transit cubes: ucie → noc → ucie] → IO_CPU → runtime_api +**Kernel Launch completion path:** + +```text +PE[0..n] all complete → M_CPU (aggregation) + → noc → [transit cubes: ucie → noc → ucie] + → io_ucie → io_noc → io_cpu → io_noc → pcie_ep ``` +**Rationale for M_CPU bypass on Memory R/W:** + +Memory write/read operations do not require command interpretation or PE +dispatch — they are direct data transfers to/from HBM. Routing through M_CPU +would add unnecessary overhead (5ns) without functional benefit. The io_noc +inside the IO chiplet handles the routing decision: memory operations go +directly to cube fabric, while kernel launches are forwarded to io_cpu first. + --- ### D5. M_CPU.DMA is an internal subcomponent of M_CPU @@ -146,7 +162,7 @@ M_CPU.DMA does not appear as a node in the compiled topology graph. A cube that is not the target of a memory or kernel request acts as a transit node. Transit cubes forward requests without consuming them: -``` +```text ucie_in (from upstream) → noc → ucie_out (to downstream) ``` @@ -187,3 +203,5 @@ It is used for shard comparison in `_route_kernel` and as a regression guard. - ADR-0009 D3 (kernel execution fan-out; fabric path to be referenced) - ADR-0014 D4 (DMA engine capacity=1) - ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal) +- ADR-0016 (IOChiplet NOC and memory data path) +- ADR-0017 (cube NOC 2D mesh architecture) diff --git a/docs/adr/ADR-0016-iochiplet-noc-and-memory-path.md b/docs/adr/ADR-0016-iochiplet-noc-and-memory-path.md new file mode 100644 index 0000000..7808115 --- /dev/null +++ b/docs/adr/ADR-0016-iochiplet-noc-and-memory-path.md @@ -0,0 +1,98 @@ +# ADR-0016: IOChiplet NOC and Memory Data Path + +## Status + +Accepted + +## Context + +ADR-0003 D2 defines IO chiplets as SIP-level components providing PCIe-EP and +IO_CPU interfaces, but does not specify internal routing within the IO chiplet. +ADR-0015 D4 was updated to document the M_CPU bypass for Memory R/W, but the +IO chiplet's internal NOC architecture that enables this routing was not +formally documented. + +The IO chiplet needs an internal routing fabric (io_noc) to: + +- connect pcie_ep, io_cpu, and per-cube UCIe PHY ports +- route memory operations (MemoryWrite/Read) directly to cube fabric without + passing through io_cpu +- route kernel launch commands through io_cpu for command interpretation + +## Decision + +### D1. IOChiplet internal NOC (io_noc) + +Each IO chiplet instance contains an internal NOC node (`io_noc`) that connects: + +- `pcie_ep` — host-facing PCIe endpoint +- `io_cpu` — command processor for kernel launch interpretation +- `io_ucie-{PHY}.conn{N}` — per-PHY connection nodes to cube UCIe ports + +The io_noc is a forwarding-only fabric (`forwarding_v1` implementation) with +zero overhead. All routing decisions are made by the simulation engine based +on message type, not by io_noc itself. + +### D2. IOChiplet UCIe decomposition + +Each IO chiplet PHY port is decomposed into: + +- `io_ucie-{PHY}` — the UCIe protocol endpoint (overhead = 8ns) +- `io_ucie-{PHY}.conn{N}` — N connection nodes between io_noc and io_ucie + +This mirrors the cube-side UCIe decomposition (ADR-0015 D1) and allows +multiple independent NOC-to-UCIe connections per PHY. + +### D3. Memory R/W path (M_CPU bypass) + +Memory operations (MemoryWrite, MemoryRead) are routed directly from pcie_ep +through io_noc to the target cube, bypassing io_cpu entirely: + +```text +pcie_ep → io_noc → conn → io_ucie → [cube UCIe] → noc → xbar → hbm_ctrl +``` + +This avoids the 10ns io_cpu overhead for pure data transfers. The simulation +engine's `_process_memory_direct()` method uses `find_memory_path()` which +resolves the shortest path from pcie_ep to the target HBM node. + +### D4. Kernel Launch path (via io_cpu) + +Kernel launch commands require io_cpu for command interpretation and PE +fan-out setup: + +```text +pcie_ep → io_noc → io_cpu → io_noc → conn → io_ucie → [cube UCIe] + → noc → m_cpu → PE +``` + +The engine's `_entry_points()` method routes KernelLaunchMsg through both +pcie_ep (entry) and io_cpu (command processing). + +### D5. IOChiplet-to-cube port mapping + +Each IO chiplet instance declares which cube ports it connects to: + +```yaml +cube_ports: + - { cube: {xy: [0,0]}, cube_side: N, phy: P0, distance_mm: 2.0 } + - { cube: {xy: [1,0]}, cube_side: N, phy: P1, distance_mm: 2.0 } +``` + +The topology builder creates edges from io_ucie PHY nodes to the +corresponding cube UCIe port nodes, with the specified distance and +the IO chiplet's `per_connection_bw_gbs` as link bandwidth. + +## Consequences + +- IO chiplet has a well-defined internal routing fabric +- Memory operations avoid unnecessary io_cpu overhead +- Kernel launch commands still get proper command interpretation +- The io_noc pattern is consistent with cube-level NOC design +- ADR-0003 D2 is extended (not contradicted) by this ADR + +## Links + +- ADR-0003 D2 (IO chiplet definition) +- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch) +- ADR-0012 D1 (host-to-IO_CPU message schema) diff --git a/docs/adr/ADR-0017-cube-noc-2d-mesh.md b/docs/adr/ADR-0017-cube-noc-2d-mesh.md new file mode 100644 index 0000000..9b7af00 --- /dev/null +++ b/docs/adr/ADR-0017-cube-noc-2d-mesh.md @@ -0,0 +1,189 @@ +# ADR-0017: Cube NOC 2D Mesh Architecture + +## Status + +Accepted + +## Context + +ADR-0003 D3 defines the cube-level NOC as a "distributed on-die fabric" but +does not specify the internal routing model, contention semantics, or +attachment topology. The implementation uses a 2D mesh router grid with +XY routing and per-segment contention modeling. This ADR formalizes that +architecture. + +## Decision + +### D1. NOC node and router grid + +Each cube contains a single NOC topology node (`sip{S}.cube{C}.noc`) +implemented as `noc_2d_mesh_v1`. Internally, the NOC models a 2D router +grid generated by `mesh_gen.py`. + +Grid properties: + +- Default dimensions: 6x6 routers (derived from PE layout + UCIe connections) +- Router naming: `r{row}c{col}` (e.g., `r0c0`, `r5c5`) +- HBM exclusion zone: center rows/columns are excluded where HBM physically + occupies space (e.g., r2c2, r2c3, r3c2, r3c3) +- Router positions are derived from physical PE corner placement and cube + geometry + +The NOC overhead_ns is 0.0. Latency is modeled by Manhattan distance +traversal within the mesh (distance_mm x ns_per_mm). + +### D2. XY routing algorithm + +The NOC uses deterministic XY routing: + +1. Horizontal segment: route from source X to destination X at source Y +2. Vertical segment: route from destination X at source Y to destination Y + +Each directed segment is identified by a unique link key: + +- Horizontal: `("H", y_band, x_min, x_max, direction)` +- Vertical: `("V", x_band, y_min, y_max, direction)` + +Grid positions are snapped to the router grid, excluding the HBM zone. + +### D3. Contention model + +Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions +sharing a segment (same row or column band, same direction) contend for the +resource. This models link-level serialization in a wormhole-routed mesh. + +With no contention, NOC traversal latency equals the Manhattan distance +multiplied by `ns_per_mm`. Under contention, additional queueing delay +is added by SimPy's resource scheduling. + +### D4. NOC attachment points + +The NOC connects to all major cube-level components: + +```text + UCIe-N (conn x4) + | + +---------+---+---+---------+ + | | | | +PE0.dma ---+ r0c0 | ... | r0c5 +--- PE2.dma +PE0.cpu <--+ | | +--< PE2.cpu + | | | | +UCIe-W ----+ ... | [HBM] | ... +---- UCIe-E +(conn x4) | | zone | | (conn x4) + | r2c0 | | | +M_CPU <--->+ | | | + | r3c0 | | | +SRAM <---->+ | | | + | | | | +PE4.dma ---+ r4c0 | ... | r4c5 +--- PE6.dma +PE4.cpu <--+ | | +--< PE6.cpu + | | | | + +---------+---+---+---------+ + | + UCIe-S (conn x4) + +xbar_top attached to: r0c0, r0c1, r1c4, r1c5 (top-half PE routers) +xbar_bot attached to: r4c0, r4c1, r5c4, r5c5 (bottom-half PE routers) +``` + +### D5. NOC edge bandwidths and distances + +| Connection | BW (GB/s) | Distance | Notes | +| --- | --- | --- | --- | +| PE_DMA -> NOC | 256.0 | Physical (PE pos) | Matches HBM slice BW | +| NOC -> PE_CPU | - | 0.0 mm | Command path only | +| NOC <-> xbar_top | 256.0 | 0.0 mm | Per xbar half | +| NOC <-> xbar_bot | 256.0 | 0.0 mm | Per xbar half | +| NOC <-> M_CPU | - | 0.0 mm | Command path | +| NOC <-> SRAM | 128.0 x4 | 0.0 mm | 512 GB/s aggregate | +| NOC <-> UCIe conn | 128.0 | 0.0 mm | Per connection, 4 per port | + +Distance 0.0 mm for most connections reflects the distributed nature of +the NOC; the actual traversal distance is computed internally via Manhattan +distance within the router grid. + +### D6. UCIe decomposition and inter-cube traffic + +Each cube has 4 UCIe ports (N, S, E, W). Each port is decomposed into: + +- 1 `ucie-{PORT}` node: UCIe protocol endpoint (overhead = 8.0 ns) +- 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe + +This decomposition enables N=4 independent NOC-to-UCIe connections per port, +each with 128 GB/s bandwidth. Total aggregate per port: 512 GB/s. + +Inter-cube traffic path: + +```text +Source: PE_DMA -> NOC -> conn{i} -> ucie-{PORT} + [UCIe link: 512 GB/s, 1.0mm seam distance] +Target: ucie-{PORT} -> conn{i} -> NOC -> xbar -> HBM +``` + +UCIe overhead (8.0 ns) is applied at each ucie-{PORT} node, so a +full crossing incurs 16 ns (TX port + RX port). + +### D7. Data paths through the NOC + +**PE DMA to local HBM (same half):** + +```text +PE_DMA -> NOC -> xbar_top -> HBM_CTRL.slice{0-3} +``` + +**PE DMA to cross-half HBM:** + +```text +PE_DMA -> NOC -> xbar_top -> bridge -> xbar_bot -> HBM_CTRL.slice{4-7} +``` + +**PE DMA to remote cube HBM:** + +```text +PE_DMA -> NOC -> conn -> ucie-E -> [seam] -> ucie-W -> conn -> NOC -> xbar -> HBM +``` + +**Kernel Launch command to PE:** + +```text +[from io_noc] -> ucie -> conn -> NOC -> M_CPU -> NOC -> PE_CPU +``` + +**Shared SRAM access:** + +```text +PE_DMA -> NOC -> SRAM +``` + +### D8. Mesh generation + +The router grid is generated by `mesh_gen.py` based on: + +- `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner +- `cube.geometry`: cube physical dimensions and HBM zone +- `cube.ucie.n_connections`: determines router count for UCIe attachment + +The generator produces a `mesh_data` dictionary containing: + +- Router grid with positions and HBM exclusion zones +- PE-to-router attachments (pe_dma, pe_cpu per PE) +- UCIe-to-router attachments (N/S/E/W, distributed across edge routers) +- M_CPU and SRAM router attachments +- xbar_top/bot router assignments (top-half vs bottom-half PE routers) + +## Consequences + +- NOC provides position-aware routing with deterministic latency +- Contention is captured per directed segment (not per-node) +- All cube-internal traffic is explicitly routed through the NOC +- HBM exclusion zone reflects physical die layout constraints +- The mesh generation is fully parameterized by `topology.yaml` + +## Links + +- ADR-0003 D3 (cube-level NOC definition — extended by this ADR) +- ADR-0004 D1 (PE DMA to local HBM path via xbar) +- ADR-0004 D3 (cross-half HBM via bridge) +- ADR-0014 D1 (PE_DMA dual egress: xbar for HBM, NOC for non-HBM) +- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch) +- ADR-0016 D1 (IOChiplet io_noc — analogous pattern at IO chiplet level) diff --git a/docs/latency-model.md b/docs/latency-model.md index c5e711e..bfea01f 100644 --- a/docs/latency-model.md +++ b/docs/latency-model.md @@ -7,7 +7,7 @@ Every request flows through a graph of **components** connected by **wires**. The total latency reported is the **actual SimPy wall-clock** (`env.now` delta), not a static formula—so contention and queueing are captured automatically. -``` +```text total_ns (actual) = wire_prop + component_overhead + drain + queueing ├── deterministic ──────────────────┘ │ └── contention-dependent ────────────────────┘ @@ -17,7 +17,7 @@ total_ns (actual) = wire_prop + component_overhead + drain + queueing ### 1. Wire Propagation -``` +```text wire_ns = distance_mm × ns_per_mm (global: 0.01 = 10 ps/mm) ``` @@ -29,7 +29,7 @@ and negligible compared to other costs. ### 2. Component Overhead (`overhead_ns`) -``` +```text component_ns = node.attrs["overhead_ns"] ``` @@ -53,7 +53,7 @@ This models arbitration, protocol processing, pipeline stages, etc. ### 3. Drain (Serialization Delay) -``` +```text drain_ns = nbytes / bottleneck_bw_gbs ``` @@ -65,7 +65,7 @@ Example: 4096 bytes through a path with bottleneck 128 GB/s → `4096 / 128 = 32 ### Formula (Theoretical Lower Bound) -``` +```text formula_ns = Σ(wire_prop) + Σ(overhead_ns) + drain_ns ``` @@ -159,7 +159,7 @@ a timeout or waits on a resource/store. The delta between start and done capture Each component is a SimPy process: -``` +```text _fan_in (per in_port) → _inbox (Store) → _worker → out_ports ``` @@ -215,7 +215,7 @@ If request A holds the resource and request B arrives: - SimPy advances B's `env.now` by A's remaining service time - This "extra" time shows up in B's `total_ns` automatically -``` +```text No contention: actual_ns == formula_ns Contention: actual_ns > formula_ns queueing_delay = actual_ns - formula_ns @@ -237,7 +237,7 @@ with self._resource.request() as req: This means a short request arriving during a long request's drain must wait for the full remaining drain time—classic head-of-line blocking: -``` +```text Request A: 4 KB, drain = 16.0 ns (arrives at t=0) Request B: 64 B, drain = 0.25 ns (arrives at t=5) @@ -274,7 +274,7 @@ Setup: PE0 and PE1 in cube0 both read 4096 bytes from their local HBM slices ### Paths -``` +```text DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0 DMA B: pe1.pe_dma → xbar.pe1 → hbm_ctrl.slice1 ``` @@ -284,7 +284,7 @@ DMA B: pe1.pe_dma → xbar.pe1 → hbm_ctrl.slice1 Since slice0 and slice1 are **separate** hbm_ctrl instances, each with its own `simpy.Resource(capacity=1)`, there is no resource competition. -``` +```text DMA A timeline: t=0.00 pe_dma dequeues txn t=0.00 xbar.pe0: overhead_ns=2.0 → t=2.00 @@ -304,13 +304,13 @@ Both complete at ~18.09 ns. `actual == formula` for both. Now suppose both PE0 and PE1 read from **slice0**: -``` +```text DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0 DMA B: pe1.pe_dma → xbar.pe1 → xbar.pe0 → hbm_ctrl.slice0 (chain traversal to reach slice0) ``` -``` +```text DMA A timeline: t=0.00 xbar.pe0(2.0) → wire → hbm_ctrl.slice0 t=2.025 yield req → immediate (first to arrive) @@ -343,7 +343,7 @@ compare `actual_ns` vs `formula_ns` (available in PE DMA traces). ## Probe Output Explained -``` +```text === PE DMA Latency === Case Target Actual Ovhd Drain Wire Ovhd% Drain% Eff.BW BN.BW Util% pe-local-hbm c0.pe0->c0.slice0 18.09 2.0 16.0 0.08 11.1% 88.5% 226.49 256.0 88.5% @@ -368,7 +368,7 @@ pe-cross-half-hbm c0.pe0->c0.slice4 37.14 5.0 32.0 0.14 13.5% 86.1% fraction. For small transfers (4KB), overhead is significant relative to drain. For large transfers, drain dominates and utilization approaches 100%. -``` +```text 4 KB: Ovhd=2.0, Drain=16.0 → Util=88.5% (overhead is 11% of time) 64 KB: Ovhd=2.0, Drain=256.0 → Util=99.2% (overhead is <1% of time) ```