kernbench2

Author	SHA1	Message	Date
ywkang	a44f832be5	Regenerate latency plots/diagrams for post-Phase-2c model Allreduce + pe2pe + ipcq + pe_view auto-regenerated by test sweeps running against the new chunk-streaming wire timing (per-flit wormhole) — absolute numbers shift upward to reflect bottleneck-link transit charged once per flit (instead of the previous cut-through subtraction at HBM CTRL). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 23:24:01 -07:00
mukesh	a563169e89	Add tl.recv_no_consume diagnostic API for apples-to-apples pe2pe plot The pe2pe overview compared IPCQ (tl.send + tl.recv) against raw DMA (tl.load + tl.store), but DMA is one-sided — DST never reads — while tl.recv pays a slot-read on DST. The comparison was unfair: IPCQ looked slower partly because it does more work. Adds tl.recv_no_consume() — a separate, diagnostic-only entry point that blocks for slot arrival but skips the slot-read (and bank-hop) charge on DST. Production tl.recv is unchanged (no `consume` kwarg on the public API), so the diagnostic flag can never accidentally leak into real workloads. Updates test_pe_to_pe_latency to call tl.recv_no_consume so the overview.png shows IPCQ no-consume vs raw DMA on equal footing. Also fixes PLOT_DIR back to docs/diagrams/pe2pe_latency_plots/ (was lost in a merge). Adds scripts/replot_pe2pe.py for label-only re-renders without re-measuring. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:20:44 -07:00
mukesh	1c5752a9ec	Intercube allreduce: center root + bidirectional reduce Move the algorithmic root cube from the corner (cube_w-1, cube_h-1) to the geometric center (cube_w//2, cube_h//2) and have each phase converge bidirectionally so the intra-SIP critical path drops from ~12 hops to ~8 hops on a 4×4 mesh (left half W→E + right half E→W in row reduce; top half N→S + bottom half S→N in col reduce; mirrored on broadcast). Result on torus_2d 6 SIPs at 96 KB / PE on TCM: before (corner root) : 22.0 µs after (center root) : 17.2 µs (−22%) Same shape on ring_1d (−7%) and mesh_2d_no_wrap (−12%); also holds across SRAM and HBM (~−20% each). Phase 1 test (test_intercube_root_center.py) asserts the torus_2d 96 KB latency drops below 20.5 µs and that all 96 cubes still validate (correctness preserved). Plot updates: - overview.png: replace constant 10.6 µs theoretical line with user-supplied hand-derived curve (per-cube packet count = bytes_per_pe × 8 PEs ÷ 128 B; 1346 ns startup + 1.20 ns/pkt). - All summary.csv numbers and per-topology PNGs regenerated. - pe2pe_latency_plots and ipcq diagram emitter PNGs refreshed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 21:28:58 -07:00
mukesh	1e39214f89	Move generated diagrams to docs/diagrams/; add IPCQ diagram emitter Plot output dirs now live under docs/diagrams/ (the canonical "derived artifacts" location per CLAUDE.md): tests/allreduce_latency_plots/ → docs/diagrams/allreduce_latency_plots/ tests/pe2pe_latency_plots/ → docs/diagrams/pe2pe_latency_plots/ + new docs/diagrams/ipcq_diagram_plots/ with two presentation diagrams (ipcq_send_recv.png, ipcq_two_pe_dma.png) New test tests/test_emit_ipcq_diagram.py renders the two IPCQ diagrams from a static description (no simulation); it exists so the diagrams can be regenerated reproducibly. Path references updated in tests/test_pe_to_pe_latency.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 21:28:17 -07:00

4 Commits