ADR-0023 D9.7+: charge PE↔bank fabric hop for SRAM/HBM IPCQ slots
Cube SRAM and HBM live on the cube NoC behind router-attached links (sram_to_router_bw_gbs=128, hbm_to_router_bw_gbs=256). Previously the slot-IO model treated them as if they were per-PE local, so the buffer_kind sweep showed TCM ≈ SRAM at 64 KB / PE. pe_ipcq._handle_recv and pe_dma._handle_ipcq_inbound now charge a PE→bank compute_drain_ns on top of the intrinsic slot-IO for SRAM/HBM. TCM stays free of this hop. Adds an internal IpcqRecvCmd.consume field that gates the recv-side hop+slot-IO charges (used by a follow-up diagnostic API; default True keeps current behavior). Post-fix at 64 KB / PE: TCM 12.0 µs < HBM 21.4 µs < SRAM 24.3 µs. SRAM is slowest because its 128 GB/s bank link is the narrowest in the system — narrower than HBM's 256 GB/s. The existing ordering test is rewritten from tcm<sram<hbm to tcm<hbm<sram and a new test_ipcq_buffer_kind_locations adds 3 invariants on the gap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -332,12 +332,37 @@ class PeIpcqComponent(ComponentBase):
|
||||
# ADR-0023 D9.7: charge IPCQ slot-READ latency against the
|
||||
# backing-memory tier (tcm/sram/hbm). Recv blocks for the
|
||||
# kernel-side slot consume; pe_exec_ns reflects this cost.
|
||||
# SRAM/HBM live on the cube NoC behind a router-attached link,
|
||||
# so reading a slot also pays a PE→bank fabric drain. TCM is
|
||||
# per-PE local and skips this hop.
|
||||
#
|
||||
# cmd.consume is a DIAGNOSTIC flag (default True). When False,
|
||||
# the read charges below are skipped — used only by the pe2pe
|
||||
# overview plot for an apples-to-apples comparison against
|
||||
# tl.store (one-sided write, no read on DST). Real kernels
|
||||
# always consume; this branch must not be exercised in
|
||||
# production code paths.
|
||||
from kernbench.common.ipcq_types import slot_io_latency_ns
|
||||
slot_read_ns = slot_io_latency_ns(
|
||||
self._buffer_kind, req.result_data.get("nbytes", 0),
|
||||
)
|
||||
if slot_read_ns > 0:
|
||||
yield env.timeout(slot_read_ns)
|
||||
nbytes = req.result_data.get("nbytes", 0)
|
||||
if cmd.consume:
|
||||
if self._buffer_kind in ("sram", "hbm") and self.ctx is not None:
|
||||
cube_prefix = self._pe_prefix.rsplit(".", 1)[0]
|
||||
bank_node = (
|
||||
f"{cube_prefix}.sram" if self._buffer_kind == "sram"
|
||||
else f"{cube_prefix}.hbm_ctrl"
|
||||
)
|
||||
try:
|
||||
path = self.ctx.router.find_path(
|
||||
self._pe_prefix, bank_node,
|
||||
)
|
||||
bank_drain_ns = self.ctx.compute_drain_ns(path, nbytes)
|
||||
if bank_drain_ns > 0:
|
||||
yield env.timeout(bank_drain_ns)
|
||||
except Exception:
|
||||
pass
|
||||
slot_read_ns = slot_io_latency_ns(self._buffer_kind, nbytes)
|
||||
if slot_read_ns > 0:
|
||||
yield env.timeout(slot_read_ns)
|
||||
|
||||
# Diagnostics trace (D14)
|
||||
from kernbench.ccl import diagnostics
|
||||
|
||||
Reference in New Issue
Block a user