Files
kernbench2/docs/diagrams/allreduce_latency_plots/buffer_kind_sweep.csv
T
mukesh 9c129d6131 ADR-0023 D9.7+: charge PE↔bank fabric hop for SRAM/HBM IPCQ slots
Cube SRAM and HBM live on the cube NoC behind router-attached links
(sram_to_router_bw_gbs=128, hbm_to_router_bw_gbs=256). Previously the
slot-IO model treated them as if they were per-PE local, so the
buffer_kind sweep showed TCM ≈ SRAM at 64 KB / PE.

pe_ipcq._handle_recv and pe_dma._handle_ipcq_inbound now charge a
PE→bank compute_drain_ns on top of the intrinsic slot-IO for SRAM/HBM.
TCM stays free of this hop. Adds an internal IpcqRecvCmd.consume field
that gates the recv-side hop+slot-IO charges (used by a follow-up
diagnostic API; default True keeps current behavior).

Post-fix at 64 KB / PE: TCM 12.0 µs < HBM 21.4 µs < SRAM 24.3 µs.
SRAM is slowest because its 128 GB/s bank link is the narrowest in
the system — narrower than HBM's 256 GB/s. The existing ordering test
is rewritten from tcm<sram<hbm to tcm<hbm<sram and a new
test_ipcq_buffer_kind_locations adds 3 invariants on the gap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 18:20:28 -07:00

593 B

1buffer_kindsip_topologyn_sipsn_elembytes_per_pelatency_ns
2hbmtorus_2d61282561858.0399999999827
3hbmtorus_2d6102420482389.0399999999827
4hbmtorus_2d68192163846673.039999999986
5hbmtorus_2d6327686553621361.03999999992
6sramtorus_2d61282561774.0399999999827
7sramtorus_2d6102420482389.0399999999827
8sramtorus_2d68192163847345.039999999986
9sramtorus_2d6327686553624337.039999999935
10tcmtorus_2d61282561678.0399999999827
11tcmtorus_2d6102420481957.0399999999827
12tcmtorus_2d68192163844225.039999999986
13tcmtorus_2d6327686553612001.03999999992