kernbench2

Files

T

mukesh 9c129d6131 ADR-0023 D9.7+: charge PE↔bank fabric hop for SRAM/HBM IPCQ slots

Cube SRAM and HBM live on the cube NoC behind router-attached links
(sram_to_router_bw_gbs=128, hbm_to_router_bw_gbs=256). Previously the
slot-IO model treated them as if they were per-PE local, so the
buffer_kind sweep showed TCM ≈ SRAM at 64 KB / PE.

pe_ipcq._handle_recv and pe_dma._handle_ipcq_inbound now charge a
PE→bank compute_drain_ns on top of the intrinsic slot-IO for SRAM/HBM.
TCM stays free of this hop. Adds an internal IpcqRecvCmd.consume field
that gates the recv-side hop+slot-IO charges (used by a follow-up
diagnostic API; default True keeps current behavior).

Post-fix at 64 KB / PE: TCM 12.0 µs < HBM 21.4 µs < SRAM 24.3 µs.
SRAM is slowest because its 128 GB/s bank link is the narrowest in
the system — narrower than HBM's 256 GB/s. The existing ordering test
is rewritten from tcm<sram<hbm to tcm<hbm<sram and a new
test_ipcq_buffer_kind_locations adds 3 invariants on the gap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-28 18:20:28 -07:00

buffer_kind_sweep.csv

ADR-0023 D9.7+: charge PE↔bank fabric hop for SRAM/HBM IPCQ slots

2026-04-28 18:20:28 -07:00

buffer_kind_sweep.png

ADR-0023 D9.7+: charge PE↔bank fabric hop for SRAM/HBM IPCQ slots

2026-04-28 18:20:28 -07:00

mesh_2d_no_wrap.png

Intercube allreduce: center root + bidirectional reduce

2026-04-27 21:28:58 -07:00

overview.png

Intercube allreduce: center root + bidirectional reduce