Files
kernbench2/CHANGES.md
ywkang fc6abbc8ee Add CHANGES.md, README, update SPEC/ADRs for release 2
- CHANGES.md: detailed changelog for release 1 and 2
- README.md: full project docs with install, probe, run, test usage
- SPEC.md: add ADR-0014~0017 references, update R7 for pcie_ep endpoint
- ADR-0003: update NOC description to reference ADR-0017
- ADR-0004: add HBM efficiency factor (0.8) to BW guarantee contract
- ADR-0014: status Proposed -> Accepted
- ADR-0015: update D4 to M_CPU bypass for Memory R/W, add ADR-0016/0017 links
- ADR-0016 (new): IOChiplet NOC and memory data path
- ADR-0017 (new): Cube NOC 2D mesh architecture
- Fix MD lint warnings (unfenced code blocks) across all docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 01:43:15 -07:00

3.5 KiB
Raw Permalink Blame History

Changelog

Release 2 (2026-03-19)

Probe CLI Improvements

  • Restructured output: tables are printed first (H2D, D2H, PE DMA), followed by detailed per-hop route traces below. This makes it easier to scan summary numbers before diving into routing details.
  • Per-hop timestamps: each route trace now shows cumulative nanosecond timestamps at every hop, so you can see exactly where time is spent.
  • D2H Read section: added MemoryReadMsg-based D2H read probes. D2H models the full round-trip: forward command path (pcie_ep → hbm) + reverse data path (hbm → pcie_ep) with host-side drain.
  • Cross-cube best/worst split: PE DMA cross-cube case is now reported as two separate rows — best case (adjacent cube) and worst case (farthest cube) — to show the latency range.
  • Multi-size BW saturation sweep: each probe case now includes a saturation table showing utilization at 4KB, 16KB, 64KB, 256KB, and 1MB. This reveals the data size threshold (~64KB) where overhead becomes negligible and utilization exceeds 90%.
  • Default data size changed from 4KB to 32KB for more realistic baseline measurements.

UCIe Overhead Tuning

  • UCIe overhead_ns increased from 1.0 to 8.0 ns per port (16ns per crossing = TX + RX). This fixes a latency inversion where cross-cube PE DMA (which traverses UCIe) was incorrectly faster than cross-half PE DMA (which traverses xbar bridges). Applied to both cube-to-cube UCIe and IO chiplet UCIe.

HBM Efficiency Factor

  • Added efficiency: 0.8 parameter to hbm_ctrl in topology.yaml.
  • The topology builder now applies this multiplicative factor to xbar→hbm edge bandwidth: 256 GB/s × 0.8 = 204.8 GB/s effective.
  • This models real-world DRAM access inefficiency (refresh, bank conflicts, page misses) rather than assuming ideal spec bandwidth.

IOChiplet NOC and D2H Topology

  • H2D MemoryWrite and D2H MemoryRead now route through io_noc (IO chiplet NOC), bypassing m_cpu. The m_cpu path is reserved for KernelLaunch.
  • IOChiplet topology includes: io_noc, per-PHY io_ucie nodes, and conn nodes for NOC-to-UCIe bridging.
  • Added comprehensive tests for IOChiplet structure, H2D/D2H data paths, and latency invariants.

NOC Mesh, Crossbar, and BW Occupancy

  • Added noc_2d_mesh_v1 component with Manhattan-distance-based internal routing.
  • Added xbar_v1 crossbar component with top/bottom halves and bridge interconnect.
  • Added BW occupancy model to wires: each directed edge tracks available_at for back-to-back serialization contention.
  • Cube mesh visualization diagram (docs/diagrams/cube_mesh_view.svg).

Test Coverage

  • 278 tests passing across 16 test files.
  • New test files: test_bw_occupancy.py, test_iochiplet_noc_d2h.py, test_noc_mesh.py.
  • New probe tests: monotonic D2H latency, D2H >= H2D invariant, HBM efficiency verification, sweep saturation, cross-cube best/worst ordering, per-hop timestamp monotonicity.

Release 1 (initial)

  • Core simulation engine (SimPy-based discrete-event)
  • Topology builder with YAML-driven configuration
  • Physical address encoding/decoding
  • XY routing on 4x4 cube mesh
  • Component model: pcie_ep, io_cpu, m_cpu, pe_cpu, pe_scheduler, pe_dma, pe_gemm, pe_math, pe_tcm, hbm_ctrl, noc, xbar, sram
  • Runtime API: MemoryWriteMsg, MemoryReadMsg, PeDmaMsg, KernelLaunchMsg
  • Probe CLI for latency analysis
  • Benchmark runner with device enumeration
  • Benchmarks: QKV GEMM (single/multi-PE), IPCQ AllReduce