ADR-0023 D9.7: IPCQ slot-memory latency model (TCM/SRAM/HBM)

Charge per-tier bandwidth + setup overhead at IPCQ slot WRITE (receiver inbound DMA, in pe_dma._handle_ipcq_inbound) and slot READ (recv consume, in pe_ipcq._handle_recv). Tier table (common/ipcq_types.py): tcm : 512 GB/s, 0 ns sram : 128 GB/s, 2 ns hbm : 32 GB/s, 6 ns Before this change, slot read/write was free regardless of buffer_kind, making memory-tier choice invisible in simulated latency. After the change, swapping buffer_kind in ccl.yaml produces measurable per-tier separation in allreduce latency. Tests: test_ipcq_buffer_kind_latency.py — three micro-tests asserting tcm < sram < hbm ordering, payload-scaling, and that buffer_kind sensitivity grows with payload (credit-only path stays fabric-bound). test_allreduce_buffer_kind_sweep.py — 12-config parametrized sweep emitting buffer_kind_sweep.png (3 lines, torus_2d). conftest sessionfinish hook generalised to dispatch multiple sweep aggregators (allreduce + buffer-kind). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 21:28:34 -07:00
parent 1e39214f89
commit 84a1325e5c
8 changed files with 489 additions and 17 deletions
@@ -31,6 +31,26 @@ class IpcqInvalidDirection(ValueError):
    has no neighbor installed for this PE."""


+# ── ADR-0023 D9.7: IPCQ slot-memory latency model ───────────────────
+#
+# Per-tier (bw_gbs, overhead_ns) used to charge the slot write (inbound)
+# and slot read (recv consume). Mirrors topology.yaml component values.
+_BUFFER_KIND_BW: dict[str, tuple[float, float]] = {
+    "tcm":  (512.0, 0.0),
+    "sram": (128.0, 2.0),
+    "hbm":  (32.0,  6.0),
+}
+
+
+def slot_io_latency_ns(buffer_kind: str, nbytes: int) -> float:
+    """Per-access latency for one slot read/write of ``nbytes`` against
+    the IPCQ backing memory tier (``buffer_kind``)."""
+    bw_gbs, overhead_ns = _BUFFER_KIND_BW.get(
+        buffer_kind, _BUFFER_KIND_BW["tcm"],
+    )
+    return float(nbytes) / bw_gbs + overhead_ns
+
+
 # ── D2.5: IpcqEndpoint ───────────────────────────────────────────────


@@ -219,6 +219,16 @@ class PeDmaComponent(PeEngineBase):

        token = txn.request

+        # ADR-0023 D9.7: charge IPCQ slot-WRITE latency against the
+        # backing-memory tier (tcm/sram/hbm) before the atomic block.
+        # Must come BEFORE the atomic write→IpcqMetaArrival pair (I6).
+        from kernbench.common.ipcq_types import slot_io_latency_ns
+        slot_write_ns = slot_io_latency_ns(
+            token.dst_endpoint.buffer_kind, token.nbytes,
+        )
+        if slot_write_ns > 0:
+            yield env.timeout(slot_write_ns)
+
        # ── ATOMIC: do not introduce yield between these two operations ──
        # 1. Move data via MemoryStore (single-hop DMA write).
        # Prefer the in-flight snapshot stashed by the sender PE_DMA;
@@ -329,6 +329,16 @@ class PeIpcqComponent(ComponentBase):

        qp["my_tail"] += 1

+        # ADR-0023 D9.7: charge IPCQ slot-READ latency against the
+        # backing-memory tier (tcm/sram/hbm). Recv blocks for the
+        # kernel-side slot consume; pe_exec_ns reflects this cost.
+        from kernbench.common.ipcq_types import slot_io_latency_ns
+        slot_read_ns = slot_io_latency_ns(
+            self._buffer_kind, req.result_data.get("nbytes", 0),
+        )
+        if slot_read_ns > 0:
+            yield env.timeout(slot_read_ns)
+
        # Diagnostics trace (D14)
        from kernbench.ccl import diagnostics
        if diagnostics.trace_enabled():