kernbench2/docs/diagrams/allreduce_latency_plots/buffer_kind_sweep.csv at 5accd981717845fc8f6e9c7e97c1311f32e50447

Files

T

mukesh 9c129d6131 ADR-0023 D9.7+: charge PE↔bank fabric hop for SRAM/HBM IPCQ slots

Cube SRAM and HBM live on the cube NoC behind router-attached links
(sram_to_router_bw_gbs=128, hbm_to_router_bw_gbs=256). Previously the
slot-IO model treated them as if they were per-PE local, so the
buffer_kind sweep showed TCM ≈ SRAM at 64 KB / PE.

pe_ipcq._handle_recv and pe_dma._handle_ipcq_inbound now charge a
PE→bank compute_drain_ns on top of the intrinsic slot-IO for SRAM/HBM.
TCM stays free of this hop. Adds an internal IpcqRecvCmd.consume field
that gates the recv-side hop+slot-IO charges (used by a follow-up
diagnostic API; default True keeps current behavior).

Post-fix at 64 KB / PE: TCM 12.0 µs < HBM 21.4 µs < SRAM 24.3 µs.
SRAM is slowest because its 128 GB/s bank link is the narrowest in
the system — narrower than HBM's 256 GB/s. The existing ordering test
is rewritten from tcm<sram<hbm to tcm<hbm<sram and a new
test_ipcq_buffer_kind_locations adds 3 invariants on the gap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-28 18:20:28 -07:00

593 B

Raw Blame History

1	buffer_kind	sip_topology	n_sips	n_elem	bytes_per_pe	latency_ns
2	hbm	torus_2d	6	128	256	1858.0399999999827
3	hbm	torus_2d	6	1024	2048	2389.0399999999827
4	hbm	torus_2d	6	8192	16384	6673.039999999986
5	hbm	torus_2d	6	32768	65536	21361.03999999992
6	sram	torus_2d	6	128	256	1774.0399999999827
7	sram	torus_2d	6	1024	2048	2389.0399999999827
8	sram	torus_2d	6	8192	16384	7345.039999999986
9	sram	torus_2d	6	32768	65536	24337.039999999935
10	tcm	torus_2d	6	128	256	1678.0399999999827
11	tcm	torus_2d	6	1024	2048	1957.0399999999827
12	tcm	torus_2d	6	8192	16384	4225.039999999986
13	tcm	torus_2d	6	32768	65536	12001.03999999992

593 B Raw Blame History

593 B

Raw Blame History