Add CHANGES.md, README, update SPEC/ADRs for release 2
- CHANGES.md: detailed changelog for release 1 and 2 - README.md: full project docs with install, probe, run, test usage - SPEC.md: add ADR-0014~0017 references, update R7 for pcie_ep endpoint - ADR-0003: update NOC description to reference ADR-0017 - ADR-0004: add HBM efficiency factor (0.8) to BW guarantee contract - ADR-0014: status Proposed -> Accepted - ADR-0015: update D4 to M_CPU bypass for Memory R/W, add ADR-0016/0017 links - ADR-0016 (new): IOChiplet NOC and memory data path - ADR-0017 (new): Cube NOC 2D mesh architecture - Fix MD lint warnings (unfenced code blocks) across all docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
+81
@@ -0,0 +1,81 @@
|
||||
# Changelog
|
||||
|
||||
## Release 2 (2026-03-19)
|
||||
|
||||
### Probe CLI Improvements
|
||||
|
||||
- **Restructured output**: tables are printed first (H2D, D2H, PE DMA), followed
|
||||
by detailed per-hop route traces below. This makes it easier to scan summary
|
||||
numbers before diving into routing details.
|
||||
- **Per-hop timestamps**: each route trace now shows cumulative nanosecond
|
||||
timestamps at every hop, so you can see exactly where time is spent.
|
||||
- **D2H Read section**: added `MemoryReadMsg`-based D2H read probes. D2H models
|
||||
the full round-trip: forward command path (pcie_ep → hbm) + reverse data path
|
||||
(hbm → pcie_ep) with host-side drain.
|
||||
- **Cross-cube best/worst split**: PE DMA cross-cube case is now reported as two
|
||||
separate rows — best case (adjacent cube) and worst case (farthest cube) — to
|
||||
show the latency range.
|
||||
- **Multi-size BW saturation sweep**: each probe case now includes a saturation
|
||||
table showing utilization at 4KB, 16KB, 64KB, 256KB, and 1MB. This reveals
|
||||
the data size threshold (~64KB) where overhead becomes negligible and
|
||||
utilization exceeds 90%.
|
||||
- **Default data size changed from 4KB to 32KB** for more realistic baseline
|
||||
measurements.
|
||||
|
||||
### UCIe Overhead Tuning
|
||||
|
||||
- UCIe `overhead_ns` increased from 1.0 to **8.0 ns per port** (16ns per
|
||||
crossing = TX + RX). This fixes a latency inversion where cross-cube PE DMA
|
||||
(which traverses UCIe) was incorrectly faster than cross-half PE DMA (which
|
||||
traverses xbar bridges). Applied to both cube-to-cube UCIe and IO chiplet UCIe.
|
||||
|
||||
### HBM Efficiency Factor
|
||||
|
||||
- Added `efficiency: 0.8` parameter to `hbm_ctrl` in `topology.yaml`.
|
||||
- The topology builder now applies this multiplicative factor to xbar→hbm edge
|
||||
bandwidth: `256 GB/s × 0.8 = 204.8 GB/s` effective.
|
||||
- This models real-world DRAM access inefficiency (refresh, bank conflicts,
|
||||
page misses) rather than assuming ideal spec bandwidth.
|
||||
|
||||
### IOChiplet NOC and D2H Topology
|
||||
|
||||
- H2D MemoryWrite and D2H MemoryRead now route through `io_noc` (IO chiplet
|
||||
NOC), bypassing `m_cpu`. The `m_cpu` path is reserved for KernelLaunch.
|
||||
- IOChiplet topology includes: `io_noc`, per-PHY `io_ucie` nodes, and `conn`
|
||||
nodes for NOC-to-UCIe bridging.
|
||||
- Added comprehensive tests for IOChiplet structure, H2D/D2H data paths,
|
||||
and latency invariants.
|
||||
|
||||
### NOC Mesh, Crossbar, and BW Occupancy
|
||||
|
||||
- Added `noc_2d_mesh_v1` component with Manhattan-distance-based internal
|
||||
routing.
|
||||
- Added `xbar_v1` crossbar component with top/bottom halves and bridge
|
||||
interconnect.
|
||||
- Added BW occupancy model to wires: each directed edge tracks
|
||||
`available_at` for back-to-back serialization contention.
|
||||
- Cube mesh visualization diagram (`docs/diagrams/cube_mesh_view.svg`).
|
||||
|
||||
### Test Coverage
|
||||
|
||||
- **278 tests** passing across 16 test files.
|
||||
- New test files: `test_bw_occupancy.py`, `test_iochiplet_noc_d2h.py`,
|
||||
`test_noc_mesh.py`.
|
||||
- New probe tests: monotonic D2H latency, D2H >= H2D invariant, HBM
|
||||
efficiency verification, sweep saturation, cross-cube best/worst ordering,
|
||||
per-hop timestamp monotonicity.
|
||||
|
||||
---
|
||||
|
||||
## Release 1 (initial)
|
||||
|
||||
- Core simulation engine (SimPy-based discrete-event)
|
||||
- Topology builder with YAML-driven configuration
|
||||
- Physical address encoding/decoding
|
||||
- XY routing on 4x4 cube mesh
|
||||
- Component model: pcie_ep, io_cpu, m_cpu, pe_cpu, pe_scheduler, pe_dma,
|
||||
pe_gemm, pe_math, pe_tcm, hbm_ctrl, noc, xbar, sram
|
||||
- Runtime API: MemoryWriteMsg, MemoryReadMsg, PeDmaMsg, KernelLaunchMsg
|
||||
- Probe CLI for latency analysis
|
||||
- Benchmark runner with device enumeration
|
||||
- Benchmarks: QKV GEMM (single/multi-PE), IPCQ AllReduce
|
||||
@@ -1,13 +1,159 @@
|
||||
# Python Project (VS Code Template)
|
||||
# kernbench
|
||||
|
||||
## Quick start
|
||||
1. Create venv + install dev deps (editable):
|
||||
- VS Code: Run Task → `deps: install (editable)`
|
||||
2. Run tests:
|
||||
- VS Code: Run Task → `test`
|
||||
3. Lint / format:
|
||||
- `lint`, `format` tasks
|
||||
A discrete-event simulator for AI accelerator hardware, built on [SimPy](https://simpy.readthedocs.io/).
|
||||
It models the full data path — from host PCIe injection through IO chiplet, NOC mesh,
|
||||
crossbar, and HBM — to measure end-to-end latency with contention and queueing.
|
||||
|
||||
## Structure
|
||||
- `src/` app code
|
||||
- `tests/` pytest
|
||||
## Architecture
|
||||
|
||||
```text
|
||||
Host (CLI)
|
||||
|
|
||||
+-- kernbench run -> run a benchmark (QKV GEMM, AllReduce, ...)
|
||||
+-- kernbench probe -> latency/BW analysis for predefined traffic patterns
|
||||
|
|
||||
v
|
||||
+---------------------------------------------------+
|
||||
| Runtime API (runtime_api/) |
|
||||
| MemoryWriteMsg, MemoryReadMsg, PeDmaMsg, |
|
||||
| KernelLaunchMsg |
|
||||
+---------------------------------------------------+
|
||||
| Simulation Engine (sim_engine/) |
|
||||
| SimPy processes, wire model, BW occupancy |
|
||||
+---------------------------------------------------+
|
||||
| Components (components/) |
|
||||
| pcie_ep, io_cpu, m_cpu, noc, xbar, hbm_ctrl, |
|
||||
| pe_cpu, pe_dma, pe_gemm, pe_math, pe_tcm, ... |
|
||||
+---------------------------------------------------+
|
||||
| Topology (topology/) |
|
||||
| YAML-driven graph: 4x4 cube mesh, UCIe links, |
|
||||
| IO chiplet with NOC, HBM slices |
|
||||
+---------------------------------------------------+
|
||||
```
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.10+
|
||||
- Dependencies: `simpy`, `pyyaml`, `pytest`
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Create virtual environment
|
||||
python -m venv .venv
|
||||
|
||||
# Activate (Windows)
|
||||
.venv\Scripts\activate
|
||||
|
||||
# Activate (Linux/macOS)
|
||||
source .venv/bin/activate
|
||||
|
||||
# Install in editable mode
|
||||
pip install -e ".[dev]"
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Probe — Latency and Bandwidth Analysis
|
||||
|
||||
The `probe` command runs predefined traffic patterns (H2D write, D2H read,
|
||||
PE DMA) and reports latency breakdown, bottleneck bandwidth, and utilization.
|
||||
|
||||
```bash
|
||||
# Run all probe cases
|
||||
kernbench probe --topology topology.yaml
|
||||
|
||||
# Run a specific case
|
||||
kernbench probe --topology topology.yaml --case pe-local-hbm
|
||||
```
|
||||
|
||||
Output includes:
|
||||
|
||||
- **Summary tables** — actual latency, overhead/drain/wire breakdown, effective BW, utilization
|
||||
- **BW saturation sweep** — utilization at 4KB through 1MB to show saturation threshold
|
||||
- **Per-hop route traces** — cumulative timestamps at every node along the path
|
||||
|
||||
### Run — Execute a Benchmark
|
||||
|
||||
```bash
|
||||
# Run a benchmark on all devices
|
||||
kernbench run --topology topology.yaml --bench qkv_gemm
|
||||
|
||||
# Run on a specific device
|
||||
kernbench run --topology topology.yaml --bench qkv_gemm --device sip:0
|
||||
```
|
||||
|
||||
Available benchmarks (in `benches/`):
|
||||
|
||||
- `qkv_gemm` — single-PE QKV GEMM
|
||||
- `qkv_gemm_multi_pe` — multi-PE QKV GEMM
|
||||
- `ipcq_allreduce` — IPCQ AllReduce
|
||||
|
||||
### Tests
|
||||
|
||||
```bash
|
||||
# Run all tests (278 tests)
|
||||
pytest
|
||||
|
||||
# Run a specific test file
|
||||
pytest tests/test_probe.py -v
|
||||
|
||||
# Run a single test
|
||||
pytest tests/test_probe.py::test_h2d_latency_monotonic -v
|
||||
|
||||
# Run with output shown
|
||||
pytest -s tests/test_probe.py
|
||||
```
|
||||
|
||||
Key test files:
|
||||
|
||||
| File | Coverage |
|
||||
| --------------------------- | ------------------------------------------------------------- |
|
||||
| `test_probe.py` | Probe latency invariants, monotonicity, determinism, BW sweep |
|
||||
| `test_engine.py` | SimPy engine: submit/wait/complete, routing, multi-SIP |
|
||||
| `test_bw_occupancy.py` | Wire BW contention, HOL blocking, back-to-back serialization |
|
||||
| `test_iochiplet_noc_d2h.py` | IO chiplet NOC topology, H2D/D2H data paths |
|
||||
| `test_noc_mesh.py` | 2D mesh NOC routing, Manhattan distance |
|
||||
| `test_pe_components.py` | PE-internal components: cpu, scheduler, dma, gemm |
|
||||
| `test_routing.py` | XY routing, address resolution, path finding |
|
||||
| `test_topology_compile.py` | YAML topology compilation, node/edge validation |
|
||||
|
||||
## Topology Configuration
|
||||
|
||||
The system is configured via `topology.yaml`. Key parameters:
|
||||
|
||||
| Parameter | Default | Description |
|
||||
| --- | --- | --- |
|
||||
| `ns_per_mm` | 0.01 | Wire propagation delay (10 ps/mm) |
|
||||
| `cube_mesh` | 4x4 | Cube grid dimensions per SIP |
|
||||
| `ucie.overhead_ns` | 8.0 | UCIe protocol overhead per port (16ns per crossing) |
|
||||
| `hbm_ctrl.efficiency` | 0.8 | HBM effective BW factor (256 to 204.8 GB/s) |
|
||||
| `xbar.overhead_ns` | 2.0 | Crossbar arbitration delay |
|
||||
| `xbar_to_hbm_bw_gbs` | 256.0 | Raw HBM bandwidth per slice |
|
||||
|
||||
## Project Structure
|
||||
|
||||
```text
|
||||
kernbench/
|
||||
+-- src/kernbench/
|
||||
| +-- cli/ # CLI entry points (main, probe, report)
|
||||
| +-- common/ # Shared types (Completion, RequestHandle, Trace)
|
||||
| +-- components/ # Hardware component models (SimPy processes)
|
||||
| +-- di/ # Dependency injection
|
||||
| +-- policy/ # Routing (XY), address decoding (PhysAddr)
|
||||
| +-- runtime_api/ # Host-facing API (messages, bench runner)
|
||||
| +-- sim_engine/ # Discrete-event engine, transaction, wire model
|
||||
| +-- topology/ # YAML builder, mesh generator, graph types
|
||||
| +-- triton_emu/ # Triton kernel emulation
|
||||
+-- benches/ # Benchmark implementations
|
||||
+-- tests/ # pytest test suite (278 tests)
|
||||
+-- docs/ # ADRs, latency model docs, diagrams
|
||||
+-- topology.yaml # System topology configuration
|
||||
+-- CHANGES.md # Changelog
|
||||
```
|
||||
|
||||
## Documentation
|
||||
|
||||
- [CHANGES.md](CHANGES.md) — changelog with detailed descriptions of each release
|
||||
- [docs/latency-model.md](docs/latency-model.md) — latency model explanation with worked examples
|
||||
- [docs/adr/](docs/adr/) — Architecture Decision Records
|
||||
|
||||
@@ -55,6 +55,10 @@ Major architectural decisions are documented in ADRs and referenced by number.
|
||||
- ADR-0011: Memory addressing simplification (PA-first)
|
||||
- ADR-0012: Host ↔ IO_CPU message schema (PA-first, PE-tagged shards)
|
||||
- ADR-0013: Verification strategy and Phase 1 test plan
|
||||
- ADR-0014: PE internal execution model (PE_CPU, PE_SCHEDULER, composite commands)
|
||||
- ADR-0015: Component port/wire model, BW occupancy, and fabric routing
|
||||
- ADR-0016: IOChiplet NOC and memory data path (M_CPU bypass)
|
||||
- ADR-0017: Cube NOC 2D mesh architecture (XY routing, contention, attachments)
|
||||
|
||||
SPEC MUST remain consistent with accepted ADRs.
|
||||
|
||||
@@ -165,7 +169,8 @@ Development MUST follow a verification-driven workflow:
|
||||
The simulator MUST provide a host-facing runtime API that:
|
||||
|
||||
- exposes tensor deployment and kernel execution operations,
|
||||
- submits requests only to endpoint components (e.g., IO_CPU),
|
||||
- submits requests to endpoint components: PCIE_EP for memory operations
|
||||
(MemoryWrite/Read), IO_CPU for kernel launch,
|
||||
- owns host-side tensor handles and allocation metadata as PA shard maps,
|
||||
- remains topology-agnostic and does not perform routing or fan-out.
|
||||
|
||||
|
||||
@@ -37,8 +37,10 @@ We model the system hierarchy explicitly:
|
||||
- HBM + memory controller (HBM_CTRL)
|
||||
- XBAR (top/bottom): HBM pseudo-channel crossbar, PE's dedicated path to HBM
|
||||
- Bridge (left/right): connects XBAR.top ↔ XBAR.bottom for cross-half HBM access
|
||||
- NOC: distributed on-die fabric spanning the entire cube (distance modeled as 0);
|
||||
carries non-HBM traffic including inter-cube (UCIe), command (M_CPU↔PE_CPU), and shared SRAM access
|
||||
- NOC: 2D mesh router grid spanning the entire cube with XY routing and
|
||||
per-segment contention modeling; carries all intra-cube traffic including
|
||||
PE DMA to xbar (HBM), inter-cube (UCIe), command (M_CPU↔PE_CPU), and
|
||||
shared SRAM access. See ADR-0017 for full NOC architecture.
|
||||
- Shared SRAM: cube-level shared memory accessible by all PEs via NOC
|
||||
- management/control CPU (M_CPU) coordinating PE command distribution and completion aggregation
|
||||
- multiple PEs
|
||||
@@ -62,3 +64,4 @@ We model the system hierarchy explicitly:
|
||||
|
||||
- SPEC R3/R5
|
||||
- ADR-0005 (diagram views)
|
||||
- ADR-0017 (cube NOC 2D mesh architecture)
|
||||
|
||||
@@ -21,8 +21,15 @@ Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth,
|
||||
|
||||
### D2. Local HBM bandwidth guarantee contract
|
||||
|
||||
- Accesses from a PE to its local HBM MUST guarantee full HBM read/write bandwidth
|
||||
independent of intervening fabric bandwidth limits.
|
||||
- Accesses from a PE to its local HBM MUST guarantee full effective HBM
|
||||
read/write bandwidth independent of intervening fabric bandwidth limits.
|
||||
- Effective HBM bandwidth = spec bandwidth x efficiency factor.
|
||||
The efficiency factor (configured via `hbm_ctrl.attrs.efficiency`, default 0.8)
|
||||
models real-world DRAM inefficiencies (refresh cycles, bank conflicts, page
|
||||
misses). For example: 256 GB/s spec x 0.8 = 204.8 GB/s effective.
|
||||
- The topology builder applies the efficiency factor to xbar-to-hbm edge
|
||||
bandwidth at graph construction time, so all downstream routing and latency
|
||||
computation uses the effective value.
|
||||
- This guarantee is modeled by:
|
||||
- a dedicated logical path and/or service model that enforces HBM BW at the PE-local-HBM interaction point,
|
||||
- while still incurring non-zero latency along explicitly modeled components.
|
||||
@@ -62,3 +69,4 @@ Tests should cover:
|
||||
|
||||
- SPEC R2/R5
|
||||
- ADR-0002 (distance/order & explicit bypass)
|
||||
- ADR-0017 D7 (PE DMA data paths through NOC to HBM)
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
## Status
|
||||
|
||||
Proposed
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
@@ -123,7 +123,7 @@ Examples include:
|
||||
|
||||
Execution flow:
|
||||
|
||||
```
|
||||
```text
|
||||
PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution → completion event → PE_SCHEDULER → CompletionQueue
|
||||
```
|
||||
|
||||
@@ -133,7 +133,7 @@ Composite commands implement tiled pipelined execution across engines.
|
||||
|
||||
Each tile executes the following pipeline:
|
||||
|
||||
```
|
||||
```text
|
||||
Input DMA (READ)
|
||||
→ Compute (GEMM or MATH)
|
||||
→ Output DMA (WRITE)
|
||||
@@ -158,7 +158,7 @@ Operations for different tiles may overlap when engine resources permit.
|
||||
|
||||
Allowed overlaps:
|
||||
|
||||
```
|
||||
```text
|
||||
DMA_READ(t+1) ∥ COMPUTE(t)
|
||||
DMA_WRITE(t−1) ∥ COMPUTE(t)
|
||||
DMA_READ(t) ∥ DMA_WRITE(t)
|
||||
@@ -166,7 +166,7 @@ DMA_READ(t) ∥ DMA_WRITE(t)
|
||||
|
||||
Disallowed overlaps:
|
||||
|
||||
```
|
||||
```text
|
||||
GEMM(t) ∥ GEMM(t′)
|
||||
MATH(t) ∥ MATH(t′)
|
||||
GEMM(t) ∥ MATH(t′)
|
||||
@@ -182,7 +182,7 @@ Each engine behaves as a deterministic service resource.
|
||||
|
||||
PE_DMA contains two independent channels.
|
||||
|
||||
```
|
||||
```text
|
||||
DMA_READ capacity = 1
|
||||
DMA_WRITE capacity = 1
|
||||
```
|
||||
@@ -195,13 +195,13 @@ Rules:
|
||||
|
||||
Example allowed:
|
||||
|
||||
```
|
||||
```text
|
||||
DMA_READ(t+1) ∥ DMA_WRITE(t)
|
||||
```
|
||||
|
||||
Example not allowed:
|
||||
|
||||
```
|
||||
```text
|
||||
DMA_READ(t) ∥ DMA_READ(t+1)
|
||||
DMA_WRITE(t) ∥ DMA_WRITE(t+1)
|
||||
```
|
||||
@@ -210,7 +210,7 @@ DMA_WRITE(t) ∥ DMA_WRITE(t+1)
|
||||
|
||||
Compute operations share a single compute resource.
|
||||
|
||||
```
|
||||
```text
|
||||
PE_ACCEL capacity = 1
|
||||
```
|
||||
|
||||
@@ -230,7 +230,7 @@ Composite commands contain one compute opcode only.
|
||||
|
||||
Examples:
|
||||
|
||||
```
|
||||
```text
|
||||
COMPOSITE_GEMM
|
||||
COMPOSITE_MATH
|
||||
```
|
||||
@@ -250,13 +250,13 @@ Compute operations use a TCM-centric dataflow model.
|
||||
|
||||
**Input path (HBM)**
|
||||
|
||||
```
|
||||
```text
|
||||
HBM → XBAR → PE_DMA (DMA_READ) → PE_TCM
|
||||
```
|
||||
|
||||
**Input path (shared SRAM)**
|
||||
|
||||
```
|
||||
```text
|
||||
Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM
|
||||
```
|
||||
|
||||
@@ -264,7 +264,7 @@ Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM
|
||||
|
||||
Compute engines read input tensors from PE_TCM.
|
||||
|
||||
```
|
||||
```text
|
||||
PE_TCM → GEMM / MATH
|
||||
```
|
||||
|
||||
@@ -274,13 +274,13 @@ Weights for GEMM may optionally stream directly from HBM (via XBAR).
|
||||
|
||||
Compute results are written to PE_TCM, then DMA writes to HBM.
|
||||
|
||||
```
|
||||
```text
|
||||
PE_TCM → PE_DMA (DMA_WRITE) → XBAR → HBM
|
||||
```
|
||||
|
||||
**Output path (shared SRAM)**
|
||||
|
||||
```
|
||||
```text
|
||||
PE_TCM → PE_DMA (DMA_WRITE) → NOC → Shared SRAM
|
||||
```
|
||||
|
||||
|
||||
@@ -17,8 +17,8 @@ implementation does not enforce this for fabric traversal.
|
||||
This ADR defines:
|
||||
|
||||
- how components communicate via typed port queues,
|
||||
- how propagation delay is modeled (wire processes),
|
||||
- the fabric path for Memory R/W through M_CPU.DMA,
|
||||
- how propagation delay is modeled (wire processes with BW occupancy),
|
||||
- the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch (via M_CPU),
|
||||
- the reduced role of the simulation engine,
|
||||
- M_CPU.DMA as an internal subcomponent of M_CPU.
|
||||
|
||||
@@ -30,7 +30,7 @@ This ADR defines:
|
||||
|
||||
Each component has typed input/output ports modeled as SimPy Stores:
|
||||
|
||||
```
|
||||
```text
|
||||
in_ports: dict[str, simpy.Store] # keyed by source node_id
|
||||
out_ports: dict[str, simpy.Store] # keyed by destination node_id
|
||||
```
|
||||
@@ -93,35 +93,51 @@ ADR-0007 D2 must be amended accordingly.
|
||||
|
||||
---
|
||||
|
||||
### D4. Unified fabric path for Memory R/W and Kernel Launch
|
||||
### D4. Fabric paths for Memory R/W and Kernel Launch
|
||||
|
||||
Both Memory R/W and Kernel Launch use the same fabric path to reach the target cube's M_CPU.
|
||||
The difference is what M_CPU does upon receiving the request.
|
||||
Memory R/W and Kernel Launch use **different** fabric paths.
|
||||
Memory operations bypass M_CPU and route directly to HBM via the crossbar.
|
||||
Kernel Launch routes through M_CPU for PE fan-out.
|
||||
|
||||
**Forward path (IO_CPU → target M_CPU):**
|
||||
**Memory R/W forward path (pcie_ep → hbm_ctrl, M_CPU bypass):**
|
||||
|
||||
```
|
||||
IO_CPU
|
||||
→ [transit cubes: ucie_out → wire → ucie_in → noc → ucie_out] (zero or more)
|
||||
→ target cube: ucie_in → noc → M_CPU
|
||||
```text
|
||||
pcie_ep → io_noc → io_ucie
|
||||
→ [transit cubes: ucie_in → noc → ucie_out] (zero or more)
|
||||
→ target cube: ucie_in → noc → xbar → hbm_ctrl
|
||||
```
|
||||
|
||||
**At M_CPU (diverges by operation type):**
|
||||
**Memory R/W completion path:**
|
||||
|
||||
```
|
||||
Memory R/W: M_CPU → M_CPU.DMA → noc → hbm_ctrl
|
||||
Kernel Launch: M_CPU → PE[0..n] (parallel fan-out)
|
||||
```text
|
||||
hbm_ctrl → xbar → noc → [transit cubes: ucie → noc → ucie]
|
||||
→ io_ucie → io_noc → pcie_ep
|
||||
```
|
||||
|
||||
**Completion path (reverse, same fabric):**
|
||||
**Kernel Launch forward path (pcie_ep → io_cpu → M_CPU → PE):**
|
||||
|
||||
```text
|
||||
pcie_ep → io_noc → io_cpu → io_noc → io_ucie
|
||||
→ [transit cubes: ucie_in → noc → ucie_out] (zero or more)
|
||||
→ target cube: ucie_in → noc → M_CPU → PE[0..n] (parallel fan-out)
|
||||
```
|
||||
Memory R/W: hbm_ctrl → noc → M_CPU.DMA → M_CPU
|
||||
Kernel Launch: PE[0..n] all complete → M_CPU (aggregation)
|
||||
|
||||
M_CPU → [transit cubes: ucie → noc → ucie] → IO_CPU → runtime_api
|
||||
**Kernel Launch completion path:**
|
||||
|
||||
```text
|
||||
PE[0..n] all complete → M_CPU (aggregation)
|
||||
→ noc → [transit cubes: ucie → noc → ucie]
|
||||
→ io_ucie → io_noc → io_cpu → io_noc → pcie_ep
|
||||
```
|
||||
|
||||
**Rationale for M_CPU bypass on Memory R/W:**
|
||||
|
||||
Memory write/read operations do not require command interpretation or PE
|
||||
dispatch — they are direct data transfers to/from HBM. Routing through M_CPU
|
||||
would add unnecessary overhead (5ns) without functional benefit. The io_noc
|
||||
inside the IO chiplet handles the routing decision: memory operations go
|
||||
directly to cube fabric, while kernel launches are forwarded to io_cpu first.
|
||||
|
||||
---
|
||||
|
||||
### D5. M_CPU.DMA is an internal subcomponent of M_CPU
|
||||
@@ -146,7 +162,7 @@ M_CPU.DMA does not appear as a node in the compiled topology graph.
|
||||
A cube that is not the target of a memory or kernel request acts as a transit node.
|
||||
Transit cubes forward requests without consuming them:
|
||||
|
||||
```
|
||||
```text
|
||||
ucie_in (from upstream) → noc → ucie_out (to downstream)
|
||||
```
|
||||
|
||||
@@ -187,3 +203,5 @@ It is used for shard comparison in `_route_kernel` and as a regression guard.
|
||||
- ADR-0009 D3 (kernel execution fan-out; fabric path to be referenced)
|
||||
- ADR-0014 D4 (DMA engine capacity=1)
|
||||
- ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal)
|
||||
- ADR-0016 (IOChiplet NOC and memory data path)
|
||||
- ADR-0017 (cube NOC 2D mesh architecture)
|
||||
|
||||
@@ -0,0 +1,98 @@
|
||||
# ADR-0016: IOChiplet NOC and Memory Data Path
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
ADR-0003 D2 defines IO chiplets as SIP-level components providing PCIe-EP and
|
||||
IO_CPU interfaces, but does not specify internal routing within the IO chiplet.
|
||||
ADR-0015 D4 was updated to document the M_CPU bypass for Memory R/W, but the
|
||||
IO chiplet's internal NOC architecture that enables this routing was not
|
||||
formally documented.
|
||||
|
||||
The IO chiplet needs an internal routing fabric (io_noc) to:
|
||||
|
||||
- connect pcie_ep, io_cpu, and per-cube UCIe PHY ports
|
||||
- route memory operations (MemoryWrite/Read) directly to cube fabric without
|
||||
passing through io_cpu
|
||||
- route kernel launch commands through io_cpu for command interpretation
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. IOChiplet internal NOC (io_noc)
|
||||
|
||||
Each IO chiplet instance contains an internal NOC node (`io_noc`) that connects:
|
||||
|
||||
- `pcie_ep` — host-facing PCIe endpoint
|
||||
- `io_cpu` — command processor for kernel launch interpretation
|
||||
- `io_ucie-{PHY}.conn{N}` — per-PHY connection nodes to cube UCIe ports
|
||||
|
||||
The io_noc is a forwarding-only fabric (`forwarding_v1` implementation) with
|
||||
zero overhead. All routing decisions are made by the simulation engine based
|
||||
on message type, not by io_noc itself.
|
||||
|
||||
### D2. IOChiplet UCIe decomposition
|
||||
|
||||
Each IO chiplet PHY port is decomposed into:
|
||||
|
||||
- `io_ucie-{PHY}` — the UCIe protocol endpoint (overhead = 8ns)
|
||||
- `io_ucie-{PHY}.conn{N}` — N connection nodes between io_noc and io_ucie
|
||||
|
||||
This mirrors the cube-side UCIe decomposition (ADR-0015 D1) and allows
|
||||
multiple independent NOC-to-UCIe connections per PHY.
|
||||
|
||||
### D3. Memory R/W path (M_CPU bypass)
|
||||
|
||||
Memory operations (MemoryWrite, MemoryRead) are routed directly from pcie_ep
|
||||
through io_noc to the target cube, bypassing io_cpu entirely:
|
||||
|
||||
```text
|
||||
pcie_ep → io_noc → conn → io_ucie → [cube UCIe] → noc → xbar → hbm_ctrl
|
||||
```
|
||||
|
||||
This avoids the 10ns io_cpu overhead for pure data transfers. The simulation
|
||||
engine's `_process_memory_direct()` method uses `find_memory_path()` which
|
||||
resolves the shortest path from pcie_ep to the target HBM node.
|
||||
|
||||
### D4. Kernel Launch path (via io_cpu)
|
||||
|
||||
Kernel launch commands require io_cpu for command interpretation and PE
|
||||
fan-out setup:
|
||||
|
||||
```text
|
||||
pcie_ep → io_noc → io_cpu → io_noc → conn → io_ucie → [cube UCIe]
|
||||
→ noc → m_cpu → PE
|
||||
```
|
||||
|
||||
The engine's `_entry_points()` method routes KernelLaunchMsg through both
|
||||
pcie_ep (entry) and io_cpu (command processing).
|
||||
|
||||
### D5. IOChiplet-to-cube port mapping
|
||||
|
||||
Each IO chiplet instance declares which cube ports it connects to:
|
||||
|
||||
```yaml
|
||||
cube_ports:
|
||||
- { cube: {xy: [0,0]}, cube_side: N, phy: P0, distance_mm: 2.0 }
|
||||
- { cube: {xy: [1,0]}, cube_side: N, phy: P1, distance_mm: 2.0 }
|
||||
```
|
||||
|
||||
The topology builder creates edges from io_ucie PHY nodes to the
|
||||
corresponding cube UCIe port nodes, with the specified distance and
|
||||
the IO chiplet's `per_connection_bw_gbs` as link bandwidth.
|
||||
|
||||
## Consequences
|
||||
|
||||
- IO chiplet has a well-defined internal routing fabric
|
||||
- Memory operations avoid unnecessary io_cpu overhead
|
||||
- Kernel launch commands still get proper command interpretation
|
||||
- The io_noc pattern is consistent with cube-level NOC design
|
||||
- ADR-0003 D2 is extended (not contradicted) by this ADR
|
||||
|
||||
## Links
|
||||
|
||||
- ADR-0003 D2 (IO chiplet definition)
|
||||
- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
|
||||
- ADR-0012 D1 (host-to-IO_CPU message schema)
|
||||
@@ -0,0 +1,189 @@
|
||||
# ADR-0017: Cube NOC 2D Mesh Architecture
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
ADR-0003 D3 defines the cube-level NOC as a "distributed on-die fabric" but
|
||||
does not specify the internal routing model, contention semantics, or
|
||||
attachment topology. The implementation uses a 2D mesh router grid with
|
||||
XY routing and per-segment contention modeling. This ADR formalizes that
|
||||
architecture.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. NOC node and router grid
|
||||
|
||||
Each cube contains a single NOC topology node (`sip{S}.cube{C}.noc`)
|
||||
implemented as `noc_2d_mesh_v1`. Internally, the NOC models a 2D router
|
||||
grid generated by `mesh_gen.py`.
|
||||
|
||||
Grid properties:
|
||||
|
||||
- Default dimensions: 6x6 routers (derived from PE layout + UCIe connections)
|
||||
- Router naming: `r{row}c{col}` (e.g., `r0c0`, `r5c5`)
|
||||
- HBM exclusion zone: center rows/columns are excluded where HBM physically
|
||||
occupies space (e.g., r2c2, r2c3, r3c2, r3c3)
|
||||
- Router positions are derived from physical PE corner placement and cube
|
||||
geometry
|
||||
|
||||
The NOC overhead_ns is 0.0. Latency is modeled by Manhattan distance
|
||||
traversal within the mesh (distance_mm x ns_per_mm).
|
||||
|
||||
### D2. XY routing algorithm
|
||||
|
||||
The NOC uses deterministic XY routing:
|
||||
|
||||
1. Horizontal segment: route from source X to destination X at source Y
|
||||
2. Vertical segment: route from destination X at source Y to destination Y
|
||||
|
||||
Each directed segment is identified by a unique link key:
|
||||
|
||||
- Horizontal: `("H", y_band, x_min, x_max, direction)`
|
||||
- Vertical: `("V", x_band, y_min, y_max, direction)`
|
||||
|
||||
Grid positions are snapped to the router grid, excluding the HBM zone.
|
||||
|
||||
### D3. Contention model
|
||||
|
||||
Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions
|
||||
sharing a segment (same row or column band, same direction) contend for the
|
||||
resource. This models link-level serialization in a wormhole-routed mesh.
|
||||
|
||||
With no contention, NOC traversal latency equals the Manhattan distance
|
||||
multiplied by `ns_per_mm`. Under contention, additional queueing delay
|
||||
is added by SimPy's resource scheduling.
|
||||
|
||||
### D4. NOC attachment points
|
||||
|
||||
The NOC connects to all major cube-level components:
|
||||
|
||||
```text
|
||||
UCIe-N (conn x4)
|
||||
|
|
||||
+---------+---+---+---------+
|
||||
| | | |
|
||||
PE0.dma ---+ r0c0 | ... | r0c5 +--- PE2.dma
|
||||
PE0.cpu <--+ | | +--< PE2.cpu
|
||||
| | | |
|
||||
UCIe-W ----+ ... | [HBM] | ... +---- UCIe-E
|
||||
(conn x4) | | zone | | (conn x4)
|
||||
| r2c0 | | |
|
||||
M_CPU <--->+ | | |
|
||||
| r3c0 | | |
|
||||
SRAM <---->+ | | |
|
||||
| | | |
|
||||
PE4.dma ---+ r4c0 | ... | r4c5 +--- PE6.dma
|
||||
PE4.cpu <--+ | | +--< PE6.cpu
|
||||
| | | |
|
||||
+---------+---+---+---------+
|
||||
|
|
||||
UCIe-S (conn x4)
|
||||
|
||||
xbar_top attached to: r0c0, r0c1, r1c4, r1c5 (top-half PE routers)
|
||||
xbar_bot attached to: r4c0, r4c1, r5c4, r5c5 (bottom-half PE routers)
|
||||
```
|
||||
|
||||
### D5. NOC edge bandwidths and distances
|
||||
|
||||
| Connection | BW (GB/s) | Distance | Notes |
|
||||
| --- | --- | --- | --- |
|
||||
| PE_DMA -> NOC | 256.0 | Physical (PE pos) | Matches HBM slice BW |
|
||||
| NOC -> PE_CPU | - | 0.0 mm | Command path only |
|
||||
| NOC <-> xbar_top | 256.0 | 0.0 mm | Per xbar half |
|
||||
| NOC <-> xbar_bot | 256.0 | 0.0 mm | Per xbar half |
|
||||
| NOC <-> M_CPU | - | 0.0 mm | Command path |
|
||||
| NOC <-> SRAM | 128.0 x4 | 0.0 mm | 512 GB/s aggregate |
|
||||
| NOC <-> UCIe conn | 128.0 | 0.0 mm | Per connection, 4 per port |
|
||||
|
||||
Distance 0.0 mm for most connections reflects the distributed nature of
|
||||
the NOC; the actual traversal distance is computed internally via Manhattan
|
||||
distance within the router grid.
|
||||
|
||||
### D6. UCIe decomposition and inter-cube traffic
|
||||
|
||||
Each cube has 4 UCIe ports (N, S, E, W). Each port is decomposed into:
|
||||
|
||||
- 1 `ucie-{PORT}` node: UCIe protocol endpoint (overhead = 8.0 ns)
|
||||
- 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe
|
||||
|
||||
This decomposition enables N=4 independent NOC-to-UCIe connections per port,
|
||||
each with 128 GB/s bandwidth. Total aggregate per port: 512 GB/s.
|
||||
|
||||
Inter-cube traffic path:
|
||||
|
||||
```text
|
||||
Source: PE_DMA -> NOC -> conn{i} -> ucie-{PORT}
|
||||
[UCIe link: 512 GB/s, 1.0mm seam distance]
|
||||
Target: ucie-{PORT} -> conn{i} -> NOC -> xbar -> HBM
|
||||
```
|
||||
|
||||
UCIe overhead (8.0 ns) is applied at each ucie-{PORT} node, so a
|
||||
full crossing incurs 16 ns (TX port + RX port).
|
||||
|
||||
### D7. Data paths through the NOC
|
||||
|
||||
**PE DMA to local HBM (same half):**
|
||||
|
||||
```text
|
||||
PE_DMA -> NOC -> xbar_top -> HBM_CTRL.slice{0-3}
|
||||
```
|
||||
|
||||
**PE DMA to cross-half HBM:**
|
||||
|
||||
```text
|
||||
PE_DMA -> NOC -> xbar_top -> bridge -> xbar_bot -> HBM_CTRL.slice{4-7}
|
||||
```
|
||||
|
||||
**PE DMA to remote cube HBM:**
|
||||
|
||||
```text
|
||||
PE_DMA -> NOC -> conn -> ucie-E -> [seam] -> ucie-W -> conn -> NOC -> xbar -> HBM
|
||||
```
|
||||
|
||||
**Kernel Launch command to PE:**
|
||||
|
||||
```text
|
||||
[from io_noc] -> ucie -> conn -> NOC -> M_CPU -> NOC -> PE_CPU
|
||||
```
|
||||
|
||||
**Shared SRAM access:**
|
||||
|
||||
```text
|
||||
PE_DMA -> NOC -> SRAM
|
||||
```
|
||||
|
||||
### D8. Mesh generation
|
||||
|
||||
The router grid is generated by `mesh_gen.py` based on:
|
||||
|
||||
- `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner
|
||||
- `cube.geometry`: cube physical dimensions and HBM zone
|
||||
- `cube.ucie.n_connections`: determines router count for UCIe attachment
|
||||
|
||||
The generator produces a `mesh_data` dictionary containing:
|
||||
|
||||
- Router grid with positions and HBM exclusion zones
|
||||
- PE-to-router attachments (pe_dma, pe_cpu per PE)
|
||||
- UCIe-to-router attachments (N/S/E/W, distributed across edge routers)
|
||||
- M_CPU and SRAM router attachments
|
||||
- xbar_top/bot router assignments (top-half vs bottom-half PE routers)
|
||||
|
||||
## Consequences
|
||||
|
||||
- NOC provides position-aware routing with deterministic latency
|
||||
- Contention is captured per directed segment (not per-node)
|
||||
- All cube-internal traffic is explicitly routed through the NOC
|
||||
- HBM exclusion zone reflects physical die layout constraints
|
||||
- The mesh generation is fully parameterized by `topology.yaml`
|
||||
|
||||
## Links
|
||||
|
||||
- ADR-0003 D3 (cube-level NOC definition — extended by this ADR)
|
||||
- ADR-0004 D1 (PE DMA to local HBM path via xbar)
|
||||
- ADR-0004 D3 (cross-half HBM via bridge)
|
||||
- ADR-0014 D1 (PE_DMA dual egress: xbar for HBM, NOC for non-HBM)
|
||||
- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
|
||||
- ADR-0016 D1 (IOChiplet io_noc — analogous pattern at IO chiplet level)
|
||||
+14
-14
@@ -7,7 +7,7 @@ Every request flows through a graph of **components** connected by **wires**.
|
||||
The total latency reported is the **actual SimPy wall-clock** (`env.now` delta),
|
||||
not a static formula—so contention and queueing are captured automatically.
|
||||
|
||||
```
|
||||
```text
|
||||
total_ns (actual) = wire_prop + component_overhead + drain + queueing
|
||||
├── deterministic ──────────────────┘ │
|
||||
└── contention-dependent ────────────────────┘
|
||||
@@ -17,7 +17,7 @@ total_ns (actual) = wire_prop + component_overhead + drain + queueing
|
||||
|
||||
### 1. Wire Propagation
|
||||
|
||||
```
|
||||
```text
|
||||
wire_ns = distance_mm × ns_per_mm (global: 0.01 = 10 ps/mm)
|
||||
```
|
||||
|
||||
@@ -29,7 +29,7 @@ and negligible compared to other costs.
|
||||
|
||||
### 2. Component Overhead (`overhead_ns`)
|
||||
|
||||
```
|
||||
```text
|
||||
component_ns = node.attrs["overhead_ns"]
|
||||
```
|
||||
|
||||
@@ -53,7 +53,7 @@ This models arbitration, protocol processing, pipeline stages, etc.
|
||||
|
||||
### 3. Drain (Serialization Delay)
|
||||
|
||||
```
|
||||
```text
|
||||
drain_ns = nbytes / bottleneck_bw_gbs
|
||||
```
|
||||
|
||||
@@ -65,7 +65,7 @@ Example: 4096 bytes through a path with bottleneck 128 GB/s → `4096 / 128 = 32
|
||||
|
||||
### Formula (Theoretical Lower Bound)
|
||||
|
||||
```
|
||||
```text
|
||||
formula_ns = Σ(wire_prop) + Σ(overhead_ns) + drain_ns
|
||||
```
|
||||
|
||||
@@ -159,7 +159,7 @@ a timeout or waits on a resource/store. The delta between start and done capture
|
||||
|
||||
Each component is a SimPy process:
|
||||
|
||||
```
|
||||
```text
|
||||
_fan_in (per in_port) → _inbox (Store) → _worker → out_ports
|
||||
```
|
||||
|
||||
@@ -215,7 +215,7 @@ If request A holds the resource and request B arrives:
|
||||
- SimPy advances B's `env.now` by A's remaining service time
|
||||
- This "extra" time shows up in B's `total_ns` automatically
|
||||
|
||||
```
|
||||
```text
|
||||
No contention: actual_ns == formula_ns
|
||||
Contention: actual_ns > formula_ns
|
||||
queueing_delay = actual_ns - formula_ns
|
||||
@@ -237,7 +237,7 @@ with self._resource.request() as req:
|
||||
This means a short request arriving during a long request's drain must wait
|
||||
for the full remaining drain time—classic head-of-line blocking:
|
||||
|
||||
```
|
||||
```text
|
||||
Request A: 4 KB, drain = 16.0 ns (arrives at t=0)
|
||||
Request B: 64 B, drain = 0.25 ns (arrives at t=5)
|
||||
|
||||
@@ -274,7 +274,7 @@ Setup: PE0 and PE1 in cube0 both read 4096 bytes from their local HBM slices
|
||||
|
||||
### Paths
|
||||
|
||||
```
|
||||
```text
|
||||
DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0
|
||||
DMA B: pe1.pe_dma → xbar.pe1 → hbm_ctrl.slice1
|
||||
```
|
||||
@@ -284,7 +284,7 @@ DMA B: pe1.pe_dma → xbar.pe1 → hbm_ctrl.slice1
|
||||
Since slice0 and slice1 are **separate** hbm_ctrl instances, each with its own
|
||||
`simpy.Resource(capacity=1)`, there is no resource competition.
|
||||
|
||||
```
|
||||
```text
|
||||
DMA A timeline:
|
||||
t=0.00 pe_dma dequeues txn
|
||||
t=0.00 xbar.pe0: overhead_ns=2.0 → t=2.00
|
||||
@@ -304,13 +304,13 @@ Both complete at ~18.09 ns. `actual == formula` for both.
|
||||
|
||||
Now suppose both PE0 and PE1 read from **slice0**:
|
||||
|
||||
```
|
||||
```text
|
||||
DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0
|
||||
DMA B: pe1.pe_dma → xbar.pe1 → xbar.pe0 → hbm_ctrl.slice0
|
||||
(chain traversal to reach slice0)
|
||||
```
|
||||
|
||||
```
|
||||
```text
|
||||
DMA A timeline:
|
||||
t=0.00 xbar.pe0(2.0) → wire → hbm_ctrl.slice0
|
||||
t=2.025 yield req → immediate (first to arrive)
|
||||
@@ -343,7 +343,7 @@ compare `actual_ns` vs `formula_ns` (available in PE DMA traces).
|
||||
|
||||
## Probe Output Explained
|
||||
|
||||
```
|
||||
```text
|
||||
=== PE DMA Latency ===
|
||||
Case Target Actual Ovhd Drain Wire Ovhd% Drain% Eff.BW BN.BW Util%
|
||||
pe-local-hbm c0.pe0->c0.slice0 18.09 2.0 16.0 0.08 11.1% 88.5% 226.49 256.0 88.5%
|
||||
@@ -368,7 +368,7 @@ pe-cross-half-hbm c0.pe0->c0.slice4 37.14 5.0 32.0 0.14 13.5% 86.1%
|
||||
fraction. For small transfers (4KB), overhead is significant relative to drain.
|
||||
For large transfers, drain dominates and utilization approaches 100%.
|
||||
|
||||
```
|
||||
```text
|
||||
4 KB: Ovhd=2.0, Drain=16.0 → Util=88.5% (overhead is 11% of time)
|
||||
64 KB: Ovhd=2.0, Drain=256.0 → Util=99.2% (overhead is <1% of time)
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user