ADR-0023 D9.7+: charge PE↔bank fabric hop for SRAM/HBM IPCQ slots

Cube SRAM and HBM live on the cube NoC behind router-attached links (sram_to_router_bw_gbs=128, hbm_to_router_bw_gbs=256). Previously the slot-IO model treated them as if they were per-PE local, so the buffer_kind sweep showed TCM ≈ SRAM at 64 KB / PE. pe_ipcq._handle_recv and pe_dma._handle_ipcq_inbound now charge a PE→bank compute_drain_ns on top of the intrinsic slot-IO for SRAM/HBM. TCM stays free of this hop. Adds an internal IpcqRecvCmd.consume field that gates the recv-side hop+slot-IO charges (used by a follow-up diagnostic API; default True keeps current behavior). Post-fix at 64 KB / PE: TCM 12.0 µs < HBM 21.4 µs < SRAM 24.3 µs. SRAM is slowest because its 128 GB/s bank link is the narrowest in the system — narrower than HBM's 256 GB/s. The existing ordering test is rewritten from tcm<sram<hbm to tcm<hbm<sram and a new test_ipcq_buffer_kind_locations adds 3 invariants on the gap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 18:20:28 -07:00
parent 533e699299
commit 9c129d6131
7 changed files with 317 additions and 44 deletions
@@ -135,6 +135,13 @@ class IpcqRecvCmd:
        "return_slot" — return slot address as-is (default, zero-copy).
                        Kernel uses the slot memory directly.
        "copy_to_dst" — copy slot data to dst_addr, then return.
+
+    ``consume`` (DIAGNOSTIC ONLY): when False, recv still blocks until the
+    payload lands in the slot, but skips the slot-read latency charge
+    (slot-IO + PE↔bank fabric drain for SRAM/HBM tiers). This exists
+    solely so the pe2pe overview plot can compare apples-to-apples
+    against tl.store (a one-sided write that pays no read on DST). Real
+    kernels always need the data they receive — leave this True.
    """

    direction: str | None        # None → round-robin (weak fairness, D4)
@@ -146,6 +153,7 @@ class IpcqRecvCmd:
    dst_space: str = ""          # used only when recv_mode == "copy_to_dst"
    blocking: bool = True
    data_op: bool = True
+    consume: bool = True         # DIAGNOSTIC: see docstring


 # ── D12: IpcqDmaToken (PE_IPCQ → PE_DMA, vc_comm) ───────────────────
@@ -222,10 +222,24 @@ class PeDmaComponent(PeEngineBase):
        # ADR-0023 D9.7: charge IPCQ slot-WRITE latency against the
        # backing-memory tier (tcm/sram/hbm) before the atomic block.
        # Must come BEFORE the atomic write→IpcqMetaArrival pair (I6).
+        # SRAM/HBM also pay a PE_DMA→bank fabric drain (slot lives on
+        # the cube NoC); TCM is per-PE local and skips this hop.
        from kernbench.common.ipcq_types import slot_io_latency_ns
-        slot_write_ns = slot_io_latency_ns(
-            token.dst_endpoint.buffer_kind, token.nbytes,
-        )
+        buffer_kind = token.dst_endpoint.buffer_kind
+        if buffer_kind in ("sram", "hbm") and self.ctx is not None:
+            cube_prefix = self._pe_prefix.rsplit(".", 1)[0]
+            bank_node = (
+                f"{cube_prefix}.sram" if buffer_kind == "sram"
+                else f"{cube_prefix}.hbm_ctrl"
+            )
+            try:
+                path = self.ctx.router.find_path(self._pe_prefix, bank_node)
+                bank_drain_ns = self.ctx.compute_drain_ns(path, token.nbytes)
+                if bank_drain_ns > 0:
+                    yield env.timeout(bank_drain_ns)
+            except Exception:
+                pass
+        slot_write_ns = slot_io_latency_ns(buffer_kind, token.nbytes)
        if slot_write_ns > 0:
            yield env.timeout(slot_write_ns)

@@ -332,12 +332,37 @@ class PeIpcqComponent(ComponentBase):
        # ADR-0023 D9.7: charge IPCQ slot-READ latency against the
        # backing-memory tier (tcm/sram/hbm). Recv blocks for the
        # kernel-side slot consume; pe_exec_ns reflects this cost.
+        # SRAM/HBM live on the cube NoC behind a router-attached link,
+        # so reading a slot also pays a PE→bank fabric drain. TCM is
+        # per-PE local and skips this hop.
+        #
+        # cmd.consume is a DIAGNOSTIC flag (default True). When False,
+        # the read charges below are skipped — used only by the pe2pe
+        # overview plot for an apples-to-apples comparison against
+        # tl.store (one-sided write, no read on DST). Real kernels
+        # always consume; this branch must not be exercised in
+        # production code paths.
        from kernbench.common.ipcq_types import slot_io_latency_ns
-        slot_read_ns = slot_io_latency_ns(
-            self._buffer_kind, req.result_data.get("nbytes", 0),
-        )
-        if slot_read_ns > 0:
-            yield env.timeout(slot_read_ns)
+        nbytes = req.result_data.get("nbytes", 0)
+        if cmd.consume:
+            if self._buffer_kind in ("sram", "hbm") and self.ctx is not None:
+                cube_prefix = self._pe_prefix.rsplit(".", 1)[0]
+                bank_node = (
+                    f"{cube_prefix}.sram" if self._buffer_kind == "sram"
+                    else f"{cube_prefix}.hbm_ctrl"
+                )
+                try:
+                    path = self.ctx.router.find_path(
+                        self._pe_prefix, bank_node,
+                    )
+                    bank_drain_ns = self.ctx.compute_drain_ns(path, nbytes)
+                    if bank_drain_ns > 0:
+                        yield env.timeout(bank_drain_ns)
+                except Exception:
+                    pass
+            slot_read_ns = slot_io_latency_ns(self._buffer_kind, nbytes)
+            if slot_read_ns > 0:
+                yield env.timeout(slot_read_ns)

        # Diagnostics trace (D14)
        from kernbench.ccl import diagnostics