Add tl.recv_no_consume diagnostic API for apples-to-apples pe2pe plot

The pe2pe overview compared IPCQ (tl.send + tl.recv) against raw DMA (tl.load + tl.store), but DMA is one-sided — DST never reads — while tl.recv pays a slot-read on DST. The comparison was unfair: IPCQ looked slower partly because it does more work. Adds tl.recv_no_consume() — a separate, diagnostic-only entry point that blocks for slot arrival but skips the slot-read (and bank-hop) charge on DST. Production tl.recv is unchanged (no `consume` kwarg on the public API), so the diagnostic flag can never accidentally leak into real workloads. Updates test_pe_to_pe_latency to call tl.recv_no_consume so the overview.png shows IPCQ no-consume vs raw DMA on equal footing. Also fixes PLOT_DIR back to docs/diagrams/pe2pe_latency_plots/ (was lost in a merge). Adds scripts/replot_pe2pe.py for label-only re-renders without re-measuring. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 18:20:44 -07:00
parent 9c129d6131
commit a563169e89
9 changed files with 245 additions and 48 deletions
@@ -492,6 +492,48 @@ class TLContext:
            )
        return self._make_handle(addr=0, shape=shape, dtype=dtype)

+    def recv_no_consume(
+        self,
+        dir: str | None = None,
+        shape: tuple[int, ...] = (),
+        dtype: str = "f16",
+    ) -> TensorHandle:
+        """DIAGNOSTIC ONLY — recv that blocks for arrival but skips slot read.
+
+        Same blocking semantics as ``tl.recv``: the kernel waits until
+        the payload has landed in the IPCQ slot. Differs from ``tl.recv``
+        by skipping the slot-read latency charge (slot-IO + PE↔bank
+        fabric drain) on DST.
+
+        This entry point exists solely so the pe2pe overview plot can
+        draw an apples-to-apples comparison against ``tl.store`` (a
+        one-sided fabric write that pays no read on DST). Production
+        kernels MUST use ``tl.recv`` — they need to consume the data
+        they receive. This API is segregated from ``tl.recv`` so the
+        diagnostic flag can never accidentally be set in real workloads.
+        """
+        self._emit_dispatch_overhead()
+        cmd = IpcqRecvCmd(
+            direction=dir,
+            shape=shape, dtype=dtype,
+            handle_id=self._next_handle_id(),
+            consume=False,
+        )
+        result = self._emit(cmd)  # type: ignore[arg-type]
+        if isinstance(result, dict):
+            slot_addr = int(result.get("src_addr", 0))
+            slot_space = str(result.get("src_space", "tcm"))
+            return TensorHandle(
+                id=self._next_handle_id(),
+                addr=slot_addr,
+                shape=shape,
+                dtype=dtype,
+                nbytes=self._nbytes(shape, dtype),
+                data=None,
+                space=slot_space,
+            )
+        return self._make_handle(addr=0, shape=shape, dtype=dtype)
+
    def recv_async(
        self,
        dir: str,