scripts/plot_pe_dma_perf.py runs the simulator across six
no-congestion scenarios (SAME_CUBE_PE_LOCAL / REMOTE_BEST /
REMOTE_WORST, REMOTE_CUBE_BEST / REMOTE_WORST, REMOTE_SIP) and
five congestion scenarios (1/2/3 PE hot-target, 8-PE corresp.
cube-to-cube, 8-PE all-hit-pe0). It categorises actual total /
makespan into pe_setup, noc_mesh, ucie, fabric, streaming,
hbm_ctrl, and a contention residual using a wormhole-pipelined
model (first-flit arrival + (n_flits-1)/bottleneck + final
chunk_time).
Outputs:
docs/diagrams/pe_dma_perf/no_congestion.png — single-PE latency
by topological distance. Visualises monotonic growth from
SAME_CUBE_PE_LOCAL (77 ns) up to REMOTE_CUBE_PE_REMOTE_WORST
(573 ns) and REMOTE_SIP (409 ns).
docs/diagrams/pe_dma_perf/congestion.png — makespan as concurrent
issuer count grows. ctrl_hot_{1,2,3}=82/158/230 ns; 8-PE
eastbound UCIe = 963 ns; 8-PE all-hit-pe0 = 558 ns.
docs/diagrams/pe_dma_perf/summary.csv — raw rows for re-plotting.
Built-in --verify harness asserts:
(1) distance monotonicity for no-congestion;
(2) same-cube paths contain zero UCIe budget;
(3) remote-cube/SIP paths carry positive UCIe budget;
(4) breakdown is internally consistent (formula ≤ actual);
(5) streaming term matches (n_flits-1) × flit_bytes /
bottleneck_bw within 5 % for the local scenario;
(6) congestion makespan is monotonic in issuer count;
(7) 8-PE hotspot strictly exceeds 3-PE hotspot.
Cross-SIP gets a looser 70 % contention slack because the path
crosses two non-flit-aware (pcie_ep) boundaries that force
store-and-forward re-streaming the simple formula does not
attribute. Single-cube scenarios stay under 25 % residual.
All checks PASS at the current model (post ADR-0019 D1/D4
per-PE HBM CTRL restoration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related issues caused measured pipeline efficiency to look
worse than the simulator's actual behavior:
1. DMA timing recorded too early. The op-log start timestamp
for a DMA op fired when the request entered the queue, and
the DMA channel was released as soon as the request was
issued. Back-to-back DMAs therefore appeared to grab the
channel simultaneously, with per-op duration drifting
upward as queue depth grew - an artifact, not real cost.
Fix: defer the start timestamp until after the channel is
acquired, and hold the channel through the full HBM
round-trip until the response returns. Per-op duration is
now constant and equal to the actual transfer interval;
serialization is visible as queue wait, not as inflated
service time.
2. Sweep timing window folded in pre-composite work. The PE
timing window spanned every PE engine record, which
included the upfront pinned-operand DMA issued before the
composite GEMM begins. For large-K shapes that one-shot
load can be nearly half of the window, conflating
operand-staging cost with composite-pipeline behavior.
Fix: add a second window scoped to the composite pipeline
by filtering op_log records to those tagged with a
tile-pipeline stage; the legacy operand-load path is
untagged and naturally excluded. For 32x3072x32 load_ref
the window drops from 1765ns to 992ns and measured eff
lines up with the steady-state DMA-bound stage limit
instead of being penalized for the one-time load.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/build_overview_slides.py renders a 5-slide PPTX
(kernbench2_overview.pptx) summarizing architecture, model
correctness, IPCQ, allreduce, and buffer-kind tier comparison.
scripts/emit_overview_with_external_ref.py renders log-y and
broken-y variants of the allreduce overview (overview_log.png,
overview_broken.png) including a 366 µs ext-sim reference marker
at 96 KB / PE.
Also includes cube_mesh_view.png rendered from the SVG.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pe2pe overview compared IPCQ (tl.send + tl.recv) against raw DMA
(tl.load + tl.store), but DMA is one-sided — DST never reads — while
tl.recv pays a slot-read on DST. The comparison was unfair: IPCQ
looked slower partly because it does more work.
Adds tl.recv_no_consume() — a separate, diagnostic-only entry point
that blocks for slot arrival but skips the slot-read (and bank-hop)
charge on DST. Production tl.recv is unchanged (no `consume` kwarg
on the public API), so the diagnostic flag can never accidentally
leak into real workloads.
Updates test_pe_to_pe_latency to call tl.recv_no_consume so the
overview.png shows IPCQ no-consume vs raw DMA on equal footing.
Also fixes PLOT_DIR back to docs/diagrams/pe2pe_latency_plots/
(was lost in a merge). Adds scripts/replot_pe2pe.py for label-only
re-renders without re-measuring.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>