Add probe CLI improvements, D2H read, UCIe/HBM tuning, BW sweep

- Probe CLI: restructured output (tables first, routes below), per-hop
  timestamps, split cross-cube into best/worst cases, D2H read section
- UCIe overhead: 1ns -> 8ns per port (16ns per crossing) to fix
  cross-cube-best < cross-half latency inversion
- HBM efficiency: added efficiency=0.8 factor to hbm_ctrl, reducing
  effective BW from 256 to 204.8 GB/s
- Multi-size BW sweep: saturation tables (4KB-1MB) for all probe cases
- Probe default data size: 4KB -> 32KB for more realistic measurements
- IOChiplet NOC + D2H topology and tests
- NOC mesh, xbar, BW occupancy components and tests
- Cube mesh visualization diagram

278 tests pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-19 01:16:18 -07:00
parent 6f43807900
commit d75da439c6
24 changed files with 3456 additions and 501 deletions
+24 -4
View File
@@ -69,17 +69,37 @@ class HbmCtrlComponent(ComponentBase):
yield from self._send_response(env, txn)
def _send_response(self, env: simpy.Environment, txn: Any) -> Generator:
"""Create ResponseMsg and send on reverse path back to originator.
"""Route completion based on path type.
PeDmaMsg is a direct probe with no IO_CPU/M_CPU aggregation in the path,
so we succeed txn.done directly instead of sending a response Transaction.
- PeDmaMsg: succeed done directly (probe).
- Bypass path (no m_cpu): MemoryWrite succeeds done; MemoryRead sends
data back on reverse path with original done event.
- M_CPU DMA path: send ResponseMsg for m_cpu/io_cpu aggregation.
"""
from kernbench.runtime_api.kernel import PeDmaMsg
from kernbench.runtime_api.kernel import MemoryReadMsg, PeDmaMsg
if isinstance(txn.request, PeDmaMsg):
txn.done.succeed()
return
# Bypass path: no m_cpu in the transaction path
is_bypass = not any("m_cpu" in n for n in txn.path)
if is_bypass:
if isinstance(txn.request, MemoryReadMsg):
# D2H: send data back on reverse path to pcie_ep
reverse_path = list(reversed(txn.path))
if len(reverse_path) >= 2:
resp_txn = Transaction(
request=txn.request, path=reverse_path, step=0,
nbytes=txn.request.nbytes, done=txn.done,
)
yield self.out_ports[reverse_path[1]].put(resp_txn.advance())
return
# MemoryWrite bypass or short path: done
txn.done.succeed()
return
# M_CPU DMA path: send ResponseMsg for aggregation
reverse_path = list(reversed(txn.path))
if len(reverse_path) >= 2 and self.ctx:
from kernbench.runtime_api.kernel import ResponseMsg