Add CHANGES.md, README, update SPEC/ADRs for release 2
- CHANGES.md: detailed changelog for release 1 and 2 - README.md: full project docs with install, probe, run, test usage - SPEC.md: add ADR-0014~0017 references, update R7 for pcie_ep endpoint - ADR-0003: update NOC description to reference ADR-0017 - ADR-0004: add HBM efficiency factor (0.8) to BW guarantee contract - ADR-0014: status Proposed -> Accepted - ADR-0015: update D4 to M_CPU bypass for Memory R/W, add ADR-0016/0017 links - ADR-0016 (new): IOChiplet NOC and memory data path - ADR-0017 (new): Cube NOC 2D mesh architecture - Fix MD lint warnings (unfenced code blocks) across all docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
+81
@@ -0,0 +1,81 @@
|
||||
# Changelog
|
||||
|
||||
## Release 2 (2026-03-19)
|
||||
|
||||
### Probe CLI Improvements
|
||||
|
||||
- **Restructured output**: tables are printed first (H2D, D2H, PE DMA), followed
|
||||
by detailed per-hop route traces below. This makes it easier to scan summary
|
||||
numbers before diving into routing details.
|
||||
- **Per-hop timestamps**: each route trace now shows cumulative nanosecond
|
||||
timestamps at every hop, so you can see exactly where time is spent.
|
||||
- **D2H Read section**: added `MemoryReadMsg`-based D2H read probes. D2H models
|
||||
the full round-trip: forward command path (pcie_ep → hbm) + reverse data path
|
||||
(hbm → pcie_ep) with host-side drain.
|
||||
- **Cross-cube best/worst split**: PE DMA cross-cube case is now reported as two
|
||||
separate rows — best case (adjacent cube) and worst case (farthest cube) — to
|
||||
show the latency range.
|
||||
- **Multi-size BW saturation sweep**: each probe case now includes a saturation
|
||||
table showing utilization at 4KB, 16KB, 64KB, 256KB, and 1MB. This reveals
|
||||
the data size threshold (~64KB) where overhead becomes negligible and
|
||||
utilization exceeds 90%.
|
||||
- **Default data size changed from 4KB to 32KB** for more realistic baseline
|
||||
measurements.
|
||||
|
||||
### UCIe Overhead Tuning
|
||||
|
||||
- UCIe `overhead_ns` increased from 1.0 to **8.0 ns per port** (16ns per
|
||||
crossing = TX + RX). This fixes a latency inversion where cross-cube PE DMA
|
||||
(which traverses UCIe) was incorrectly faster than cross-half PE DMA (which
|
||||
traverses xbar bridges). Applied to both cube-to-cube UCIe and IO chiplet UCIe.
|
||||
|
||||
### HBM Efficiency Factor
|
||||
|
||||
- Added `efficiency: 0.8` parameter to `hbm_ctrl` in `topology.yaml`.
|
||||
- The topology builder now applies this multiplicative factor to xbar→hbm edge
|
||||
bandwidth: `256 GB/s × 0.8 = 204.8 GB/s` effective.
|
||||
- This models real-world DRAM access inefficiency (refresh, bank conflicts,
|
||||
page misses) rather than assuming ideal spec bandwidth.
|
||||
|
||||
### IOChiplet NOC and D2H Topology
|
||||
|
||||
- H2D MemoryWrite and D2H MemoryRead now route through `io_noc` (IO chiplet
|
||||
NOC), bypassing `m_cpu`. The `m_cpu` path is reserved for KernelLaunch.
|
||||
- IOChiplet topology includes: `io_noc`, per-PHY `io_ucie` nodes, and `conn`
|
||||
nodes for NOC-to-UCIe bridging.
|
||||
- Added comprehensive tests for IOChiplet structure, H2D/D2H data paths,
|
||||
and latency invariants.
|
||||
|
||||
### NOC Mesh, Crossbar, and BW Occupancy
|
||||
|
||||
- Added `noc_2d_mesh_v1` component with Manhattan-distance-based internal
|
||||
routing.
|
||||
- Added `xbar_v1` crossbar component with top/bottom halves and bridge
|
||||
interconnect.
|
||||
- Added BW occupancy model to wires: each directed edge tracks
|
||||
`available_at` for back-to-back serialization contention.
|
||||
- Cube mesh visualization diagram (`docs/diagrams/cube_mesh_view.svg`).
|
||||
|
||||
### Test Coverage
|
||||
|
||||
- **278 tests** passing across 16 test files.
|
||||
- New test files: `test_bw_occupancy.py`, `test_iochiplet_noc_d2h.py`,
|
||||
`test_noc_mesh.py`.
|
||||
- New probe tests: monotonic D2H latency, D2H >= H2D invariant, HBM
|
||||
efficiency verification, sweep saturation, cross-cube best/worst ordering,
|
||||
per-hop timestamp monotonicity.
|
||||
|
||||
---
|
||||
|
||||
## Release 1 (initial)
|
||||
|
||||
- Core simulation engine (SimPy-based discrete-event)
|
||||
- Topology builder with YAML-driven configuration
|
||||
- Physical address encoding/decoding
|
||||
- XY routing on 4x4 cube mesh
|
||||
- Component model: pcie_ep, io_cpu, m_cpu, pe_cpu, pe_scheduler, pe_dma,
|
||||
pe_gemm, pe_math, pe_tcm, hbm_ctrl, noc, xbar, sram
|
||||
- Runtime API: MemoryWriteMsg, MemoryReadMsg, PeDmaMsg, KernelLaunchMsg
|
||||
- Probe CLI for latency analysis
|
||||
- Benchmark runner with device enumeration
|
||||
- Benchmarks: QKV GEMM (single/multi-PE), IPCQ AllReduce
|
||||
Reference in New Issue
Block a user