The pe2pe overview compared IPCQ (tl.send + tl.recv) against raw DMA
(tl.load + tl.store), but DMA is one-sided — DST never reads — while
tl.recv pays a slot-read on DST. The comparison was unfair: IPCQ
looked slower partly because it does more work.
Adds tl.recv_no_consume() — a separate, diagnostic-only entry point
that blocks for slot arrival but skips the slot-read (and bank-hop)
charge on DST. Production tl.recv is unchanged (no `consume` kwarg
on the public API), so the diagnostic flag can never accidentally
leak into real workloads.
Updates test_pe_to_pe_latency to call tl.recv_no_consume so the
overview.png shows IPCQ no-consume vs raw DMA on equal footing.
Also fixes PLOT_DIR back to docs/diagrams/pe2pe_latency_plots/
(was lost in a merge). Adds scripts/replot_pe2pe.py for label-only
re-renders without re-measuring.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove h5_inter_sip from the hop list and switch the overview grid
from 2x3 to 2x2. RAW DMA was unavailable for the cross-SIP hop, so
the panel only carried IPCQ data and was redundant with h4_inter_cube
for the topology comparison.
Regenerate pe2pe_latency_plots/overview.png and summary.csv; delete
the obsolete h5_inter_sip.png.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds tests/test_pe_to_pe_latency.py: a sweep that measures PE-to-PE
transfer latency for five hop types (intra-cube horizontal/vertical,
inter-cube horizontal/vertical, inter-SIP) across data sizes 128 B to
10 KB, on both the IPCQ (tl.send/tl.recv) and raw-DMA (tl.load+tl.store)
paths. Emits per-hop PNG plots, an overview PNG, and a CSV summary into
tests/pe2pe_latency_plots/. Latency is reported as max(pe_exec_ns) across
participating PEs, read from engine.get_completion(), so the measurement
captures the SRC/DST PE's kernel body time rather than the full launch+
response-aggregation envelope.
Two simulator fixes were needed to make this measurement meaningful:
- PeMMU now stores a list of (start, end, pa) sub-regions per page
rather than a single PA. DPPolicy layouts with shards smaller than
page_size (e.g. 128 B payloads with 4 KB pages) used to silently
overwrite each other through last-write-wins, causing DMAs intended
for cube0 to physically route to cube3 - inflating latency by ~170 ns
per DMA at small sizes. STOPGAP: real MMUs don't support sub-page
regions; long-term fix is either smaller MMU page size or DPPolicy
validation that refuses sub-page shards.
- M_CPU's per-PE metrics aggregation (pe_exec_ns, dma_ns, compute_ns)
now max-merges against the existing value in result_data rather than
overwriting. Multi-cube workloads share one result_data dict via
IO_CPU fanout; the previous overwrite caused whichever cube's M_CPU
finished last to clobber others' values, so multi-cube pe_exec_ns was
racy and frequently 0. Same fix applied in legacy/builtin/m_cpu.py.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>