IPCQ-DMA co-design HW design doc + fix IPCQ slot BW model

Add hardware design document (docs/ipcq-dma-codesign-hw.md) covering
PE_IPCQ high-level architecture, simulator verification, proposed HW
implementation, and alternatives analysis. Include D2 block diagrams
for baseline and proposed PE architectures.

Fix IPCQ slot-memory bandwidth parameters to match topology.yaml:
SRAM 128→512 GB/s (intrinsic BW, NoC-bottlenecked at 128),
HBM 32→256 GB/s (was per-channel, now per-PE aggregate).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-28 13:31:02 -07:00
parent 54fcb7e4bc
commit 533e699299
9 changed files with 1121 additions and 6 deletions
+4 -4
View File
@@ -47,8 +47,8 @@ from tests.test_allreduce_multidevice import (
# pe_ipcq.py). Mirrors topology.yaml component values.
_EXPECTED_BW = {
"tcm": (512.0, 0.0),
"sram": (128.0, 2.0),
"hbm": (32.0, 6.0),
"sram": (512.0, 2.0),
"hbm": (256.0, 6.0),
}
@@ -160,10 +160,10 @@ def test_slot_io_scales_linearly_with_nbytes(tmp_path):
lat_8k = _run_torus_allreduce(tmp_path, buffer_kind="hbm", n_elem=4096)
# Expected delta from doubling: at least one slot-IO event per cube
# in the critical path (very conservative). Per-access add = 4096/32 ≈ 128
# in the critical path (very conservative). Per-access add = 4096/256 = 16
# ns on HBM going from 4k → 8k. Multiple slot accesses on the critical
# path should make the observed delta meaningfully larger.
expected_min_delta = 0.5 * (4096 / 32.0) # ≈ 64 ns
expected_min_delta = 0.5 * (4096 / 256.0) # ≈ 8 ns
assert lat_8k - lat_4k > expected_min_delta, (
f"doubling nbytes on hbm should add ≥ {expected_min_delta:.1f} ns "
f"of slot-IO latency, got delta={lat_8k - lat_4k:.1f} ns "