9c129d6131
Cube SRAM and HBM live on the cube NoC behind router-attached links (sram_to_router_bw_gbs=128, hbm_to_router_bw_gbs=256). Previously the slot-IO model treated them as if they were per-PE local, so the buffer_kind sweep showed TCM ≈ SRAM at 64 KB / PE. pe_ipcq._handle_recv and pe_dma._handle_ipcq_inbound now charge a PE→bank compute_drain_ns on top of the intrinsic slot-IO for SRAM/HBM. TCM stays free of this hop. Adds an internal IpcqRecvCmd.consume field that gates the recv-side hop+slot-IO charges (used by a follow-up diagnostic API; default True keeps current behavior). Post-fix at 64 KB / PE: TCM 12.0 µs < HBM 21.4 µs < SRAM 24.3 µs. SRAM is slowest because its 128 GB/s bank link is the narrowest in the system — narrower than HBM's 256 GB/s. The existing ordering test is rewritten from tcm<sram<hbm to tcm<hbm<sram and a new test_ipcq_buffer_kind_locations adds 3 invariants on the gap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
593 B
593 B
| 1 | buffer_kind | sip_topology | n_sips | n_elem | bytes_per_pe | latency_ns |
|---|---|---|---|---|---|---|
| 2 | hbm | torus_2d | 6 | 128 | 256 | 1858.0399999999827 |
| 3 | hbm | torus_2d | 6 | 1024 | 2048 | 2389.0399999999827 |
| 4 | hbm | torus_2d | 6 | 8192 | 16384 | 6673.039999999986 |
| 5 | hbm | torus_2d | 6 | 32768 | 65536 | 21361.03999999992 |
| 6 | sram | torus_2d | 6 | 128 | 256 | 1774.0399999999827 |
| 7 | sram | torus_2d | 6 | 1024 | 2048 | 2389.0399999999827 |
| 8 | sram | torus_2d | 6 | 8192 | 16384 | 7345.039999999986 |
| 9 | sram | torus_2d | 6 | 32768 | 65536 | 24337.039999999935 |
| 10 | tcm | torus_2d | 6 | 128 | 256 | 1678.0399999999827 |
| 11 | tcm | torus_2d | 6 | 1024 | 2048 | 1957.0399999999827 |
| 12 | tcm | torus_2d | 6 | 8192 | 16384 | 4225.039999999986 |
| 13 | tcm | torus_2d | 6 | 32768 | 65536 | 12001.03999999992 |