Add tl.recv_no_consume diagnostic API for apples-to-apples pe2pe plot

The pe2pe overview compared IPCQ (tl.send + tl.recv) against raw DMA
(tl.load + tl.store), but DMA is one-sided — DST never reads — while
tl.recv pays a slot-read on DST. The comparison was unfair: IPCQ
looked slower partly because it does more work.

Adds tl.recv_no_consume() — a separate, diagnostic-only entry point
that blocks for slot arrival but skips the slot-read (and bank-hop)
charge on DST. Production tl.recv is unchanged (no `consume` kwarg
on the public API), so the diagnostic flag can never accidentally
leak into real workloads.

Updates test_pe_to_pe_latency to call tl.recv_no_consume so the
overview.png shows IPCQ no-consume vs raw DMA on equal footing.
Also fixes PLOT_DIR back to docs/diagrams/pe2pe_latency_plots/
(was lost in a merge). Adds scripts/replot_pe2pe.py for label-only
re-renders without re-measuring.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-28 18:20:44 -07:00
parent 9c129d6131
commit a563169e89
9 changed files with 245 additions and 48 deletions
+42
View File
@@ -492,6 +492,48 @@ class TLContext:
)
return self._make_handle(addr=0, shape=shape, dtype=dtype)
def recv_no_consume(
self,
dir: str | None = None,
shape: tuple[int, ...] = (),
dtype: str = "f16",
) -> TensorHandle:
"""DIAGNOSTIC ONLY — recv that blocks for arrival but skips slot read.
Same blocking semantics as ``tl.recv``: the kernel waits until
the payload has landed in the IPCQ slot. Differs from ``tl.recv``
by skipping the slot-read latency charge (slot-IO + PE↔bank
fabric drain) on DST.
This entry point exists solely so the pe2pe overview plot can
draw an apples-to-apples comparison against ``tl.store`` (a
one-sided fabric write that pays no read on DST). Production
kernels MUST use ``tl.recv`` — they need to consume the data
they receive. This API is segregated from ``tl.recv`` so the
diagnostic flag can never accidentally be set in real workloads.
"""
self._emit_dispatch_overhead()
cmd = IpcqRecvCmd(
direction=dir,
shape=shape, dtype=dtype,
handle_id=self._next_handle_id(),
consume=False,
)
result = self._emit(cmd) # type: ignore[arg-type]
if isinstance(result, dict):
slot_addr = int(result.get("src_addr", 0))
slot_space = str(result.get("src_space", "tcm"))
return TensorHandle(
id=self._next_handle_id(),
addr=slot_addr,
shape=shape,
dtype=dtype,
nbytes=self._nbytes(shape, dtype),
data=None,
space=slot_space,
)
return self._make_handle(addr=0, shape=shape, dtype=dtype)
def recv_async(
self,
dir: str,