# Changelog ## Release 2 (2026-03-19) ### Probe CLI Improvements - **Restructured output**: tables are printed first (H2D, D2H, PE DMA), followed by detailed per-hop route traces below. This makes it easier to scan summary numbers before diving into routing details. - **Per-hop timestamps**: each route trace now shows cumulative nanosecond timestamps at every hop, so you can see exactly where time is spent. - **D2H Read section**: added `MemoryReadMsg`-based D2H read probes. D2H models the full round-trip: forward command path (pcie_ep → hbm) + reverse data path (hbm → pcie_ep) with host-side drain. - **Cross-cube best/worst split**: PE DMA cross-cube case is now reported as two separate rows — best case (adjacent cube) and worst case (farthest cube) — to show the latency range. - **Multi-size BW saturation sweep**: each probe case now includes a saturation table showing utilization at 4KB, 16KB, 64KB, 256KB, and 1MB. This reveals the data size threshold (~64KB) where overhead becomes negligible and utilization exceeds 90%. - **Default data size changed from 4KB to 32KB** for more realistic baseline measurements. ### UCIe Overhead Tuning - UCIe `overhead_ns` increased from 1.0 to **8.0 ns per port** (16ns per crossing = TX + RX). This fixes a latency inversion where cross-cube PE DMA (which traverses UCIe) was incorrectly faster than cross-half PE DMA (which traverses xbar bridges). Applied to both cube-to-cube UCIe and IO chiplet UCIe. ### HBM Efficiency Factor - Added `efficiency: 0.8` parameter to `hbm_ctrl` in `topology.yaml`. - The topology builder now applies this multiplicative factor to xbar→hbm edge bandwidth: `256 GB/s × 0.8 = 204.8 GB/s` effective. - This models real-world DRAM access inefficiency (refresh, bank conflicts, page misses) rather than assuming ideal spec bandwidth. ### IOChiplet NOC and D2H Topology - H2D MemoryWrite and D2H MemoryRead now route through `io_noc` (IO chiplet NOC), bypassing `m_cpu`. The `m_cpu` path is reserved for KernelLaunch. - IOChiplet topology includes: `io_noc`, per-PHY `io_ucie` nodes, and `conn` nodes for NOC-to-UCIe bridging. - Added comprehensive tests for IOChiplet structure, H2D/D2H data paths, and latency invariants. ### NOC Mesh, Crossbar, and BW Occupancy - Added `noc_2d_mesh_v1` component with Manhattan-distance-based internal routing. - Added `xbar_v1` crossbar component with top/bottom halves and bridge interconnect. - Added BW occupancy model to wires: each directed edge tracks `available_at` for back-to-back serialization contention. - Cube mesh visualization diagram (`docs/diagrams/cube_mesh_view.svg`). ### Test Coverage - **278 tests** passing across 16 test files. - New test files: `test_bw_occupancy.py`, `test_iochiplet_noc_d2h.py`, `test_noc_mesh.py`. - New probe tests: monotonic D2H latency, D2H >= H2D invariant, HBM efficiency verification, sweep saturation, cross-cube best/worst ordering, per-hop timestamp monotonicity. --- ## Release 1 (initial) - Core simulation engine (SimPy-based discrete-event) - Topology builder with YAML-driven configuration - Physical address encoding/decoding - XY routing on 4x4 cube mesh - Component model: pcie_ep, io_cpu, m_cpu, pe_cpu, pe_scheduler, pe_dma, pe_gemm, pe_math, pe_tcm, hbm_ctrl, noc, xbar, sram - Runtime API: MemoryWriteMsg, MemoryReadMsg, PeDmaMsg, KernelLaunchMsg - Probe CLI for latency analysis - Benchmark runner with device enumeration - Benchmarks: QKV GEMM (single/multi-PE), IPCQ AllReduce