IPCQ-DMA co-design HW design doc + fix IPCQ slot BW model

Add hardware design document (docs/ipcq-dma-codesign-hw.md) covering PE_IPCQ high-level architecture, simulator verification, proposed HW implementation, and alternatives analysis. Include D2 block diagrams for baseline and proposed PE architectures. Fix IPCQ slot-memory bandwidth parameters to match topology.yaml: SRAM 128→512 GB/s (intrinsic BW, NoC-bottlenecked at 128), HBM 32→256 GB/s (was per-channel, now per-PE aggregate). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 13:31:02 -07:00
parent 54fcb7e4bc
commit 533e699299
9 changed files with 1121 additions and 6 deletions
@@ -47,8 +47,8 @@ from tests.test_allreduce_multidevice import (
 # pe_ipcq.py). Mirrors topology.yaml component values.
 _EXPECTED_BW = {
    "tcm":  (512.0, 0.0),
-    "sram": (128.0, 2.0),
-    "hbm":  (32.0,  6.0),
+    "sram": (512.0, 2.0),
+    "hbm":  (256.0, 6.0),
 }


@@ -160,10 +160,10 @@ def test_slot_io_scales_linearly_with_nbytes(tmp_path):
    lat_8k = _run_torus_allreduce(tmp_path, buffer_kind="hbm", n_elem=4096)

    # Expected delta from doubling: at least one slot-IO event per cube
-    # in the critical path (very conservative). Per-access add = 4096/32 ≈ 128
+    # in the critical path (very conservative). Per-access add = 4096/256 = 16
    # ns on HBM going from 4k → 8k. Multiple slot accesses on the critical
    # path should make the observed delta meaningfully larger.
-    expected_min_delta = 0.5 * (4096 / 32.0)  # ≈ 64 ns
+    expected_min_delta = 0.5 * (4096 / 256.0)  # ≈ 8 ns
    assert lat_8k - lat_4k > expected_min_delta, (
        f"doubling nbytes on hbm should add ≥ {expected_min_delta:.1f} ns "
        f"of slot-IO latency, got delta={lat_8k - lat_4k:.1f} ns "