Add probe CLI improvements, D2H read, UCIe/HBM tuning, BW sweep

- Probe CLI: restructured output (tables first, routes below), per-hop
  timestamps, split cross-cube into best/worst cases, D2H read section
- UCIe overhead: 1ns -> 8ns per port (16ns per crossing) to fix
  cross-cube-best < cross-half latency inversion
- HBM efficiency: added efficiency=0.8 factor to hbm_ctrl, reducing
  effective BW from 256 to 204.8 GB/s
- Multi-size BW sweep: saturation tables (4KB-1MB) for all probe cases
- Probe default data size: 4KB -> 32KB for more realistic measurements
- IOChiplet NOC + D2H topology and tests
- NOC mesh, xbar, BW occupancy components and tests
- Cube mesh visualization diagram

278 tests pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-19 01:16:18 -07:00
parent 6f43807900
commit d75da439c6
24 changed files with 3456 additions and 501 deletions
+10 -7
View File
@@ -327,11 +327,13 @@ def test_formula_latency_lower_bound():
assert formula > 0, "formula must be > 0"
def test_formula_latency_exact_no_contention():
"""With no contention, formula should approximate actual for PE DMA.
def test_formula_latency_lower_bound_no_contention():
"""With no contention, formula is a lower bound for PE DMA.
PE DMA is single-request with no fan-out or aggregation,
so formula ≈ actual (within small tolerance for SimPy scheduling).
PE DMA routes through NOC, which applies internal mesh traversal
latency (XY routing based on physical positions) not captured by the
formula (NOC edges have distance_mm=0 since NOC is distributed).
Formula <= actual is the invariant.
"""
from kernbench.runtime_api.kernel import PeDmaMsg
from kernbench.policy.address.phyaddr import PhysAddr as PA
@@ -360,10 +362,11 @@ def test_formula_latency_exact_no_contention():
_, trace = engine.get_completion(h)
actual = trace["total_ns"]
# No contention: formula should equal actual
assert abs(formula - actual) < 0.01, (
f"formula ({formula:.4f}) actual ({actual:.4f}) expected with no contention"
# Formula is a lower bound; NOC internal traversal adds latency
assert formula <= actual + 0.01, (
f"formula ({formula:.4f}) must be <= actual ({actual:.4f})"
)
assert actual > 0
# ── 10. remote cube access succeeds with higher latency ────────────