ADR-0023 D9.7: IPCQ slot-memory latency model (TCM/SRAM/HBM)

Charge per-tier bandwidth + setup overhead at IPCQ slot WRITE (receiver inbound DMA, in pe_dma._handle_ipcq_inbound) and slot READ (recv consume, in pe_ipcq._handle_recv). Tier table (common/ipcq_types.py): tcm : 512 GB/s, 0 ns sram : 128 GB/s, 2 ns hbm : 32 GB/s, 6 ns Before this change, slot read/write was free regardless of buffer_kind, making memory-tier choice invisible in simulated latency. After the change, swapping buffer_kind in ccl.yaml produces measurable per-tier separation in allreduce latency. Tests: test_ipcq_buffer_kind_latency.py — three micro-tests asserting tcm < sram < hbm ordering, payload-scaling, and that buffer_kind sensitivity grows with payload (credit-only path stays fabric-bound). test_allreduce_buffer_kind_sweep.py — 12-config parametrized sweep emitting buffer_kind_sweep.png (3 lines, torus_2d). conftest sessionfinish hook generalised to dispatch multiple sweep aggregators (allreduce + buffer-kind). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 21:28:34 -07:00
parent 1e39214f89
commit 84a1325e5c
8 changed files with 489 additions and 17 deletions
@@ -0,0 +1,13 @@
+buffer_kind,sip_topology,n_sips,n_elem,bytes_per_pe,latency_ns
+hbm,torus_2d,6,128,256,2002.0399999999827
+hbm,torus_2d,6,1024,2048,3541.0399999999827
+hbm,torus_2d,6,8192,16384,15889.03999999999
+hbm,torus_2d,6,32768,65536,58225.03999999998
+sram,torus_2d,6,128,256,1762.0399999999827
+sram,torus_2d,6,1024,2048,2293.0399999999827
+sram,torus_2d,6,8192,16384,6577.039999999986
+sram,torus_2d,6,32768,65536,21265.03999999992
+tcm,torus_2d,6,128,256,1678.0399999999827
+tcm,torus_2d,6,1024,2048,1957.0399999999827
+tcm,torus_2d,6,8192,16384,4225.039999999986
+tcm,torus_2d,6,32768,65536,12001.03999999992