kernbench2

Author	SHA1	Message	Date
ywkang	a759d58007	Add PE_DMA latency-breakdown plots + self-verification harness scripts/plot_pe_dma_perf.py runs the simulator across six no-congestion scenarios (SAME_CUBE_PE_LOCAL / REMOTE_BEST / REMOTE_WORST, REMOTE_CUBE_BEST / REMOTE_WORST, REMOTE_SIP) and five congestion scenarios (1/2/3 PE hot-target, 8-PE corresp. cube-to-cube, 8-PE all-hit-pe0). It categorises actual total / makespan into pe_setup, noc_mesh, ucie, fabric, streaming, hbm_ctrl, and a contention residual using a wormhole-pipelined model (first-flit arrival + (n_flits-1)/bottleneck + final chunk_time). Outputs: docs/diagrams/pe_dma_perf/no_congestion.png — single-PE latency by topological distance. Visualises monotonic growth from SAME_CUBE_PE_LOCAL (77 ns) up to REMOTE_CUBE_PE_REMOTE_WORST (573 ns) and REMOTE_SIP (409 ns). docs/diagrams/pe_dma_perf/congestion.png — makespan as concurrent issuer count grows. ctrl_hot_{1,2,3}=82/158/230 ns; 8-PE eastbound UCIe = 963 ns; 8-PE all-hit-pe0 = 558 ns. docs/diagrams/pe_dma_perf/summary.csv — raw rows for re-plotting. Built-in --verify harness asserts: (1) distance monotonicity for no-congestion; (2) same-cube paths contain zero UCIe budget; (3) remote-cube/SIP paths carry positive UCIe budget; (4) breakdown is internally consistent (formula ≤ actual); (5) streaming term matches (n_flits-1) × flit_bytes / bottleneck_bw within 5 % for the local scenario; (6) congestion makespan is monotonic in issuer count; (7) 8-PE hotspot strictly exceeds 3-PE hotspot. Cross-SIP gets a looser 70 % contention slack because the path crosses two non-flit-aware (pcie_ep) boundaries that force store-and-forward re-streaming the simple formula does not attribute. Single-cube scenarios stay under 25 % residual. All checks PASS at the current model (post ADR-0019 D1/D4 per-PE HBM CTRL restoration). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 01:23:42 -07:00
mukesh	f6d262e359	Honest measured pipeline efficiency: two timing fixes Two related issues caused measured pipeline efficiency to look worse than the simulator's actual behavior: 1. DMA timing recorded too early. The op-log start timestamp for a DMA op fired when the request entered the queue, and the DMA channel was released as soon as the request was issued. Back-to-back DMAs therefore appeared to grab the channel simultaneously, with per-op duration drifting upward as queue depth grew - an artifact, not real cost. Fix: defer the start timestamp until after the channel is acquired, and hold the channel through the full HBM round-trip until the response returns. Per-op duration is now constant and equal to the actual transfer interval; serialization is visible as queue wait, not as inflated service time. 2. Sweep timing window folded in pre-composite work. The PE timing window spanned every PE engine record, which included the upfront pinned-operand DMA issued before the composite GEMM begins. For large-K shapes that one-shot load can be nearly half of the window, conflating operand-staging cost with composite-pipeline behavior. Fix: add a second window scoped to the composite pipeline by filtering op_log records to those tagged with a tile-pipeline stage; the legacy operand-load path is untagged and naturally excluded. For 32x3072x32 load_ref the window drops from 1765ns to 992ns and measured eff lines up with the steady-state DMA-bound stage limit instead of being penalized for the one-time load. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 14:19:17 -07:00
mukesh	83ea97b05f	Composite GEMM: K-loop accumulator residency, pinned operands, sweep + deck Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 15:00:41 -07:00
mukesh	5accd98171	Add deck builder + overview-with-ref diagram scripts scripts/build_overview_slides.py renders a 5-slide PPTX (kernbench2_overview.pptx) summarizing architecture, model correctness, IPCQ, allreduce, and buffer-kind tier comparison. scripts/emit_overview_with_external_ref.py renders log-y and broken-y variants of the allreduce overview (overview_log.png, overview_broken.png) including a 366 µs ext-sim reference marker at 96 KB / PE. Also includes cube_mesh_view.png rendered from the SVG. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:20:54 -07:00
mukesh	a563169e89	Add tl.recv_no_consume diagnostic API for apples-to-apples pe2pe plot The pe2pe overview compared IPCQ (tl.send + tl.recv) against raw DMA (tl.load + tl.store), but DMA is one-sided — DST never reads — while tl.recv pays a slot-read on DST. The comparison was unfair: IPCQ looked slower partly because it does more work. Adds tl.recv_no_consume() — a separate, diagnostic-only entry point that blocks for slot arrival but skips the slot-read (and bank-hop) charge on DST. Production tl.recv is unchanged (no `consume` kwarg on the public API), so the diagnostic flag can never accidentally leak into real workloads. Updates test_pe_to_pe_latency to call tl.recv_no_consume so the overview.png shows IPCQ no-consume vs raw DMA on equal footing. Also fixes PLOT_DIR back to docs/diagrams/pe2pe_latency_plots/ (was lost in a merge). Adds scripts/replot_pe2pe.py for label-only re-renders without re-measuring. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:20:44 -07:00
ywkang	6f43807900	commit - release 1	2026-03-18 11:47:48 -07:00

6 Commits