Files
kernbench2/CHANGES.md
T
ywkang fc6abbc8ee Add CHANGES.md, README, update SPEC/ADRs for release 2
- CHANGES.md: detailed changelog for release 1 and 2
- README.md: full project docs with install, probe, run, test usage
- SPEC.md: add ADR-0014~0017 references, update R7 for pcie_ep endpoint
- ADR-0003: update NOC description to reference ADR-0017
- ADR-0004: add HBM efficiency factor (0.8) to BW guarantee contract
- ADR-0014: status Proposed -> Accepted
- ADR-0015: update D4 to M_CPU bypass for Memory R/W, add ADR-0016/0017 links
- ADR-0016 (new): IOChiplet NOC and memory data path
- ADR-0017 (new): Cube NOC 2D mesh architecture
- Fix MD lint warnings (unfenced code blocks) across all docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 01:43:15 -07:00

82 lines
3.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Changelog
## Release 2 (2026-03-19)
### Probe CLI Improvements
- **Restructured output**: tables are printed first (H2D, D2H, PE DMA), followed
by detailed per-hop route traces below. This makes it easier to scan summary
numbers before diving into routing details.
- **Per-hop timestamps**: each route trace now shows cumulative nanosecond
timestamps at every hop, so you can see exactly where time is spent.
- **D2H Read section**: added `MemoryReadMsg`-based D2H read probes. D2H models
the full round-trip: forward command path (pcie_ep → hbm) + reverse data path
(hbm → pcie_ep) with host-side drain.
- **Cross-cube best/worst split**: PE DMA cross-cube case is now reported as two
separate rows — best case (adjacent cube) and worst case (farthest cube) — to
show the latency range.
- **Multi-size BW saturation sweep**: each probe case now includes a saturation
table showing utilization at 4KB, 16KB, 64KB, 256KB, and 1MB. This reveals
the data size threshold (~64KB) where overhead becomes negligible and
utilization exceeds 90%.
- **Default data size changed from 4KB to 32KB** for more realistic baseline
measurements.
### UCIe Overhead Tuning
- UCIe `overhead_ns` increased from 1.0 to **8.0 ns per port** (16ns per
crossing = TX + RX). This fixes a latency inversion where cross-cube PE DMA
(which traverses UCIe) was incorrectly faster than cross-half PE DMA (which
traverses xbar bridges). Applied to both cube-to-cube UCIe and IO chiplet UCIe.
### HBM Efficiency Factor
- Added `efficiency: 0.8` parameter to `hbm_ctrl` in `topology.yaml`.
- The topology builder now applies this multiplicative factor to xbar→hbm edge
bandwidth: `256 GB/s × 0.8 = 204.8 GB/s` effective.
- This models real-world DRAM access inefficiency (refresh, bank conflicts,
page misses) rather than assuming ideal spec bandwidth.
### IOChiplet NOC and D2H Topology
- H2D MemoryWrite and D2H MemoryRead now route through `io_noc` (IO chiplet
NOC), bypassing `m_cpu`. The `m_cpu` path is reserved for KernelLaunch.
- IOChiplet topology includes: `io_noc`, per-PHY `io_ucie` nodes, and `conn`
nodes for NOC-to-UCIe bridging.
- Added comprehensive tests for IOChiplet structure, H2D/D2H data paths,
and latency invariants.
### NOC Mesh, Crossbar, and BW Occupancy
- Added `noc_2d_mesh_v1` component with Manhattan-distance-based internal
routing.
- Added `xbar_v1` crossbar component with top/bottom halves and bridge
interconnect.
- Added BW occupancy model to wires: each directed edge tracks
`available_at` for back-to-back serialization contention.
- Cube mesh visualization diagram (`docs/diagrams/cube_mesh_view.svg`).
### Test Coverage
- **278 tests** passing across 16 test files.
- New test files: `test_bw_occupancy.py`, `test_iochiplet_noc_d2h.py`,
`test_noc_mesh.py`.
- New probe tests: monotonic D2H latency, D2H >= H2D invariant, HBM
efficiency verification, sweep saturation, cross-cube best/worst ordering,
per-hop timestamp monotonicity.
---
## Release 1 (initial)
- Core simulation engine (SimPy-based discrete-event)
- Topology builder with YAML-driven configuration
- Physical address encoding/decoding
- XY routing on 4x4 cube mesh
- Component model: pcie_ep, io_cpu, m_cpu, pe_cpu, pe_scheduler, pe_dma,
pe_gemm, pe_math, pe_tcm, hbm_ctrl, noc, xbar, sram
- Runtime API: MemoryWriteMsg, MemoryReadMsg, PeDmaMsg, KernelLaunchMsg
- Probe CLI for latency analysis
- Benchmark runner with device enumeration
- Benchmarks: QKV GEMM (single/multi-PE), IPCQ AllReduce