kernbench2/docs/diagrams/allreduce_latency_plots/buffer_kind_sweep.csv at 84a1325e5c8b9b2236610a25d08f41a8a607e361

Files

T

mukesh 84a1325e5c ADR-0023 D9.7: IPCQ slot-memory latency model (TCM/SRAM/HBM)

Charge per-tier bandwidth + setup overhead at IPCQ slot WRITE
(receiver inbound DMA, in pe_dma._handle_ipcq_inbound) and slot
READ (recv consume, in pe_ipcq._handle_recv). Tier table
(common/ipcq_types.py):
  tcm  : 512 GB/s, 0 ns
  sram : 128 GB/s, 2 ns
  hbm  :  32 GB/s, 6 ns

Before this change, slot read/write was free regardless of
buffer_kind, making memory-tier choice invisible in simulated
latency. After the change, swapping buffer_kind in ccl.yaml
produces measurable per-tier separation in allreduce latency.

Tests:
  test_ipcq_buffer_kind_latency.py — three micro-tests asserting
    tcm < sram < hbm ordering, payload-scaling, and that
    buffer_kind sensitivity grows with payload (credit-only path
    stays fabric-bound).
  test_allreduce_buffer_kind_sweep.py — 12-config parametrized
    sweep emitting buffer_kind_sweep.png (3 lines, torus_2d).

conftest sessionfinish hook generalised to dispatch multiple
sweep aggregators (allreduce + buffer-kind).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-27 21:28:34 -07:00

592 B

Raw Blame History

1	buffer_kind	sip_topology	n_sips	n_elem	bytes_per_pe	latency_ns
2	hbm	torus_2d	6	128	256	2002.0399999999827
3	hbm	torus_2d	6	1024	2048	3541.0399999999827
4	hbm	torus_2d	6	8192	16384	15889.03999999999
5	hbm	torus_2d	6	32768	65536	58225.03999999998
6	sram	torus_2d	6	128	256	1762.0399999999827
7	sram	torus_2d	6	1024	2048	2293.0399999999827
8	sram	torus_2d	6	8192	16384	6577.039999999986
9	sram	torus_2d	6	32768	65536	21265.03999999992
10	tcm	torus_2d	6	128	256	1678.0399999999827
11	tcm	torus_2d	6	1024	2048	1957.0399999999827
12	tcm	torus_2d	6	8192	16384	4225.039999999986
13	tcm	torus_2d	6	32768	65536	12001.03999999992

592 B Raw Blame History

592 B

Raw Blame History