IPCQ-DMA co-design HW design doc + fix IPCQ slot BW model
Add hardware design document (docs/ipcq-dma-codesign-hw.md) covering PE_IPCQ high-level architecture, simulator verification, proposed HW implementation, and alternatives analysis. Include D2 block diagrams for baseline and proposed PE architectures. Fix IPCQ slot-memory bandwidth parameters to match topology.yaml: SRAM 128→512 GB/s (intrinsic BW, NoC-bottlenecked at 128), HBM 32→256 GB/s (was per-channel, now per-PE aggregate). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -47,8 +47,8 @@ from tests.test_allreduce_multidevice import (
|
||||
# pe_ipcq.py). Mirrors topology.yaml component values.
|
||||
_EXPECTED_BW = {
|
||||
"tcm": (512.0, 0.0),
|
||||
"sram": (128.0, 2.0),
|
||||
"hbm": (32.0, 6.0),
|
||||
"sram": (512.0, 2.0),
|
||||
"hbm": (256.0, 6.0),
|
||||
}
|
||||
|
||||
|
||||
@@ -160,10 +160,10 @@ def test_slot_io_scales_linearly_with_nbytes(tmp_path):
|
||||
lat_8k = _run_torus_allreduce(tmp_path, buffer_kind="hbm", n_elem=4096)
|
||||
|
||||
# Expected delta from doubling: at least one slot-IO event per cube
|
||||
# in the critical path (very conservative). Per-access add = 4096/32 ≈ 128
|
||||
# in the critical path (very conservative). Per-access add = 4096/256 = 16
|
||||
# ns on HBM going from 4k → 8k. Multiple slot accesses on the critical
|
||||
# path should make the observed delta meaningfully larger.
|
||||
expected_min_delta = 0.5 * (4096 / 32.0) # ≈ 64 ns
|
||||
expected_min_delta = 0.5 * (4096 / 256.0) # ≈ 8 ns
|
||||
assert lat_8k - lat_4k > expected_min_delta, (
|
||||
f"doubling nbytes on hbm should add ≥ {expected_min_delta:.1f} ns "
|
||||
f"of slot-IO latency, got delta={lat_8k - lat_4k:.1f} ns "
|
||||
|
||||
Reference in New Issue
Block a user