84a1325e5c
Charge per-tier bandwidth + setup overhead at IPCQ slot WRITE
(receiver inbound DMA, in pe_dma._handle_ipcq_inbound) and slot
READ (recv consume, in pe_ipcq._handle_recv). Tier table
(common/ipcq_types.py):
tcm : 512 GB/s, 0 ns
sram : 128 GB/s, 2 ns
hbm : 32 GB/s, 6 ns
Before this change, slot read/write was free regardless of
buffer_kind, making memory-tier choice invisible in simulated
latency. After the change, swapping buffer_kind in ccl.yaml
produces measurable per-tier separation in allreduce latency.
Tests:
test_ipcq_buffer_kind_latency.py — three micro-tests asserting
tcm < sram < hbm ordering, payload-scaling, and that
buffer_kind sensitivity grows with payload (credit-only path
stays fabric-bound).
test_allreduce_buffer_kind_sweep.py — 12-config parametrized
sweep emitting buffer_kind_sweep.png (3 lines, torus_2d).
conftest sessionfinish hook generalised to dispatch multiple
sweep aggregators (allreduce + buffer-kind).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
592 B
592 B
| 1 | buffer_kind | sip_topology | n_sips | n_elem | bytes_per_pe | latency_ns |
|---|---|---|---|---|---|---|
| 2 | hbm | torus_2d | 6 | 128 | 256 | 2002.0399999999827 |
| 3 | hbm | torus_2d | 6 | 1024 | 2048 | 3541.0399999999827 |
| 4 | hbm | torus_2d | 6 | 8192 | 16384 | 15889.03999999999 |
| 5 | hbm | torus_2d | 6 | 32768 | 65536 | 58225.03999999998 |
| 6 | sram | torus_2d | 6 | 128 | 256 | 1762.0399999999827 |
| 7 | sram | torus_2d | 6 | 1024 | 2048 | 2293.0399999999827 |
| 8 | sram | torus_2d | 6 | 8192 | 16384 | 6577.039999999986 |
| 9 | sram | torus_2d | 6 | 32768 | 65536 | 21265.03999999992 |
| 10 | tcm | torus_2d | 6 | 128 | 256 | 1678.0399999999827 |
| 11 | tcm | torus_2d | 6 | 1024 | 2048 | 1957.0399999999827 |
| 12 | tcm | torus_2d | 6 | 8192 | 16384 | 4225.039999999986 |
| 13 | tcm | torus_2d | 6 | 32768 | 65536 | 12001.03999999992 |