Add tl.recv_no_consume diagnostic API for apples-to-apples pe2pe plot
The pe2pe overview compared IPCQ (tl.send + tl.recv) against raw DMA (tl.load + tl.store), but DMA is one-sided — DST never reads — while tl.recv pays a slot-read on DST. The comparison was unfair: IPCQ looked slower partly because it does more work. Adds tl.recv_no_consume() — a separate, diagnostic-only entry point that blocks for slot arrival but skips the slot-read (and bank-hop) charge on DST. Production tl.recv is unchanged (no `consume` kwarg on the public API), so the diagnostic flag can never accidentally leak into real workloads. Updates test_pe_to_pe_latency to call tl.recv_no_consume so the overview.png shows IPCQ no-consume vs raw DMA on equal footing. Also fixes PLOT_DIR back to docs/diagrams/pe2pe_latency_plots/ (was lost in a merge). Adds scripts/replot_pe2pe.py for label-only re-renders without re-measuring. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -492,6 +492,48 @@ class TLContext:
|
||||
)
|
||||
return self._make_handle(addr=0, shape=shape, dtype=dtype)
|
||||
|
||||
def recv_no_consume(
|
||||
self,
|
||||
dir: str | None = None,
|
||||
shape: tuple[int, ...] = (),
|
||||
dtype: str = "f16",
|
||||
) -> TensorHandle:
|
||||
"""DIAGNOSTIC ONLY — recv that blocks for arrival but skips slot read.
|
||||
|
||||
Same blocking semantics as ``tl.recv``: the kernel waits until
|
||||
the payload has landed in the IPCQ slot. Differs from ``tl.recv``
|
||||
by skipping the slot-read latency charge (slot-IO + PE↔bank
|
||||
fabric drain) on DST.
|
||||
|
||||
This entry point exists solely so the pe2pe overview plot can
|
||||
draw an apples-to-apples comparison against ``tl.store`` (a
|
||||
one-sided fabric write that pays no read on DST). Production
|
||||
kernels MUST use ``tl.recv`` — they need to consume the data
|
||||
they receive. This API is segregated from ``tl.recv`` so the
|
||||
diagnostic flag can never accidentally be set in real workloads.
|
||||
"""
|
||||
self._emit_dispatch_overhead()
|
||||
cmd = IpcqRecvCmd(
|
||||
direction=dir,
|
||||
shape=shape, dtype=dtype,
|
||||
handle_id=self._next_handle_id(),
|
||||
consume=False,
|
||||
)
|
||||
result = self._emit(cmd) # type: ignore[arg-type]
|
||||
if isinstance(result, dict):
|
||||
slot_addr = int(result.get("src_addr", 0))
|
||||
slot_space = str(result.get("src_space", "tcm"))
|
||||
return TensorHandle(
|
||||
id=self._next_handle_id(),
|
||||
addr=slot_addr,
|
||||
shape=shape,
|
||||
dtype=dtype,
|
||||
nbytes=self._nbytes(shape, dtype),
|
||||
data=None,
|
||||
space=slot_space,
|
||||
)
|
||||
return self._make_handle(addr=0, shape=shape, dtype=dtype)
|
||||
|
||||
def recv_async(
|
||||
self,
|
||||
dir: str,
|
||||
|
||||
Reference in New Issue
Block a user