ADR-0023 D9.7: IPCQ slot-memory latency model (TCM/SRAM/HBM)
Charge per-tier bandwidth + setup overhead at IPCQ slot WRITE
(receiver inbound DMA, in pe_dma._handle_ipcq_inbound) and slot
READ (recv consume, in pe_ipcq._handle_recv). Tier table
(common/ipcq_types.py):
tcm : 512 GB/s, 0 ns
sram : 128 GB/s, 2 ns
hbm : 32 GB/s, 6 ns
Before this change, slot read/write was free regardless of
buffer_kind, making memory-tier choice invisible in simulated
latency. After the change, swapping buffer_kind in ccl.yaml
produces measurable per-tier separation in allreduce latency.
Tests:
test_ipcq_buffer_kind_latency.py — three micro-tests asserting
tcm < sram < hbm ordering, payload-scaling, and that
buffer_kind sensitivity grows with payload (credit-only path
stays fabric-bound).
test_allreduce_buffer_kind_sweep.py — 12-config parametrized
sweep emitting buffer_kind_sweep.png (3 lines, torus_2d).
conftest sessionfinish hook generalised to dispatch multiple
sweep aggregators (allreduce + buffer-kind).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -329,6 +329,16 @@ class PeIpcqComponent(ComponentBase):
|
||||
|
||||
qp["my_tail"] += 1
|
||||
|
||||
# ADR-0023 D9.7: charge IPCQ slot-READ latency against the
|
||||
# backing-memory tier (tcm/sram/hbm). Recv blocks for the
|
||||
# kernel-side slot consume; pe_exec_ns reflects this cost.
|
||||
from kernbench.common.ipcq_types import slot_io_latency_ns
|
||||
slot_read_ns = slot_io_latency_ns(
|
||||
self._buffer_kind, req.result_data.get("nbytes", 0),
|
||||
)
|
||||
if slot_read_ns > 0:
|
||||
yield env.timeout(slot_read_ns)
|
||||
|
||||
# Diagnostics trace (D14)
|
||||
from kernbench.ccl import diagnostics
|
||||
if diagnostics.trace_enabled():
|
||||
|
||||
Reference in New Issue
Block a user