a759d58007
scripts/plot_pe_dma_perf.py runs the simulator across six
no-congestion scenarios (SAME_CUBE_PE_LOCAL / REMOTE_BEST /
REMOTE_WORST, REMOTE_CUBE_BEST / REMOTE_WORST, REMOTE_SIP) and
five congestion scenarios (1/2/3 PE hot-target, 8-PE corresp.
cube-to-cube, 8-PE all-hit-pe0). It categorises actual total /
makespan into pe_setup, noc_mesh, ucie, fabric, streaming,
hbm_ctrl, and a contention residual using a wormhole-pipelined
model (first-flit arrival + (n_flits-1)/bottleneck + final
chunk_time).
Outputs:
docs/diagrams/pe_dma_perf/no_congestion.png — single-PE latency
by topological distance. Visualises monotonic growth from
SAME_CUBE_PE_LOCAL (77 ns) up to REMOTE_CUBE_PE_REMOTE_WORST
(573 ns) and REMOTE_SIP (409 ns).
docs/diagrams/pe_dma_perf/congestion.png — makespan as concurrent
issuer count grows. ctrl_hot_{1,2,3}=82/158/230 ns; 8-PE
eastbound UCIe = 963 ns; 8-PE all-hit-pe0 = 558 ns.
docs/diagrams/pe_dma_perf/summary.csv — raw rows for re-plotting.
Built-in --verify harness asserts:
(1) distance monotonicity for no-congestion;
(2) same-cube paths contain zero UCIe budget;
(3) remote-cube/SIP paths carry positive UCIe budget;
(4) breakdown is internally consistent (formula ≤ actual);
(5) streaming term matches (n_flits-1) × flit_bytes /
bottleneck_bw within 5 % for the local scenario;
(6) congestion makespan is monotonic in issuer count;
(7) 8-PE hotspot strictly exceeds 3-PE hotspot.
Cross-SIP gets a looser 70 % contention slack because the path
crosses two non-flit-aware (pcie_ep) boundaries that force
store-and-forward re-streaming the simple formula does not
attribute. Single-cube scenarios stay under 25 % residual.
All checks PASS at the current model (post ADR-0019 D1/D4
per-PE HBM CTRL restoration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3.5 KiB
3.5 KiB
| 1 | graph | scenario | label | nbytes | n_issuers | total_ns | makespan_ns | min_lat_ns | pe_setup | noc_mesh | ucie | hbm_ctrl | contention | path | first_path |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | no_congestion | local | SAME_CUBE PE_LOCAL | 16384 | 77.0 | 1.0 | 2.0 | 0.0 | 9.0 | 2.0 | pe0.pe_dma -> cube0.r0c0 -> hbm_ctrl.pe0 | ||||
| 3 | no_congestion | same_cube_best | SAME_CUBE REMOTE_BEST (pe0→pe1) | 16384 | 82.06 | 1.0 | 5.03 | 0.0 | 9.0 | 4.030000000000001 | pe0.pe_dma -> cube0.r0c0 -> cube0.r0c1 -> hbm_ctrl.pe1 | ||||
| 4 | no_congestion | same_cube_worst | SAME_CUBE REMOTE_WORST (pe0→pe7) | 16384 | 117.50000000000001 | 1.0 | 26.25 | 0.0 | 9.0 | 18.250000000000014 | pe0.pe_dma -> cube0.r0c0 -> cube0.r1c0 -> cube0.r1c1 -> cube0.r1c2 -> cube0.r1c3 -> cube0.r4c3 -> cube0.r4c4 -> cube0.r5c4 -> cube0.r5c5 -> hbm_ctrl.pe7 | ||||
| 5 | no_congestion | remote_cube_best | REMOTE_CUBE REMOTE_BEST (cube0→cube1) | 16384 | 202.51999999999998 | 1.0 | 6.0 | 32.510000000000005 | 9.0 | 28.00999999999999 | pe0.pe_dma -> cube0.r0c0 -> ucie-N.conn0 -> cube0.ucie-N -> ucie-N.conn3 -> cube0.r0c5 -> ucie-E.conn0 -> cube0.ucie-E -> cube1.ucie-W -> ucie-W.conn0 -> cube1.r0c0 -> hbm_ctrl.pe0 | ||||
| 6 | no_congestion | remote_cube_worst | REMOTE_CUBE REMOTE_WORST (cube0→cube15.pe7) | 16384 | 573.1199999999999 | 1.0 | 30.0 | 219.05999999999995 | 9.0 | 188.05999999999995 | pe0.pe_dma -> cube0.r0c0 -> ucie-N.conn0 -> cube0.ucie-N -> ucie-N.conn3 -> cube0.r0c5 -> ucie-E.conn0 -> cube0.ucie-E -> cube1.ucie-W -> ucie-W.conn0 -> cube1.r0c0 -> ucie-N.conn0 -> cube1.ucie-N -> ucie-N.conn3 -> cube1.r0c5 -> ucie-E.conn0 -> cube1.ucie-E -> cube2.ucie-W -> ucie-W.conn0 -> cube2.r0c0 -> ucie-N.conn0 -> cube2.ucie-N -> ucie-N.conn3 -> cube2.r0c5 -> ucie-E.conn0 -> cube2.ucie-E -> cube3.ucie-W -> ucie-W.conn0 -> cube3.r0c0 -> ucie-N.conn0 -> cube3.ucie-N -> ucie-N.conn3 -> cube3.r0c5 -> ucie-E.conn0 -> cube3.ucie-E -> ucie-E.conn3 -> cube3.r5c5 -> ucie-S.conn3 -> cube3.ucie-S -> cube7.ucie-N -> ucie-N.conn3 -> cube7.r0c5 -> ucie-E.conn0 -> cube7.ucie-E -> ucie-E.conn3 -> cube7.r5c5 -> ucie-S.conn3 -> cube7.ucie-S -> cube11.ucie-N -> ucie-N.conn3 -> cube11.r0c5 -> ucie-E.conn0 -> cube11.ucie-E -> ucie-E.conn3 -> cube11.r5c5 -> ucie-S.conn3 -> cube11.ucie-S -> cube15.ucie-N -> ucie-N.conn3 -> cube15.r0c5 -> ucie-E.conn0 -> cube15.ucie-E -> ucie-E.conn3 -> cube15.r5c5 -> hbm_ctrl.pe7 | ||||
| 7 | no_congestion | remote_sip | REMOTE_SIP SAME_CUBE_SAME_PE (sip0→sip1) | 16384 | 408.5216666666663 | 1.0 | 4.0 | 37.040000000000006 | 9.0 | 209.38499999999962 | pe0.pe_dma -> cube0.r0c0 -> ucie-N.conn0 -> cube0.ucie-N -> io0.ucie-P0 -> ucie-P0.conn0 -> io0.noc -> io0.pcie_ep -> fabric.switch0 -> io0.pcie_ep -> io0.noc -> ucie-P0.conn0 -> io0.ucie-P0 -> cube0.ucie-N -> ucie-N.conn0 -> cube0.r0c0 -> hbm_ctrl.pe0 | ||||
| 8 | congestion | ctrl_hot_1 | 1×PE → pe0_slice | 16384 | 1 | 82.06 | 82.06 | 1.0 | 5.03 | 0.0 | 9.0 | 4.030000000000001 | pe1.pe_dma -> cube0.r0c1 -> cube0.r0c0 -> hbm_ctrl.pe0 | ||
| 9 | congestion | ctrl_hot_2 | 2×PE → pe0_slice | 16384 | 2 | 158.3450000000001 | 134.2400000000001 | 1.0 | 5.03 | 0.0 | 9.0 | 80.31500000000011 | pe1.pe_dma -> cube0.r0c1 -> cube0.r0c0 -> hbm_ctrl.pe0 | ||
| 10 | congestion | ctrl_hot_3 | 3×PE → pe0_slice | 16384 | 3 | 230.0750000000001 | 139.94000000000008 | 1.0 | 5.03 | 0.0 | 9.0 | 152.0450000000001 | pe1.pe_dma -> cube0.r0c1 -> cube0.r0c0 -> hbm_ctrl.pe0 | ||
| 11 | congestion | ucie_eastbound | 8×PE corresp. cube0→cube1 | 16384 | 8 | 962.52 | 438.52 | 1.0 | 6.0 | 32.510000000000005 | 9.0 | 788.01 | pe0.pe_dma -> cube0.r0c0 -> ucie-N.conn0 -> cube0.ucie-N -> ucie-N.conn3 -> cube0.r0c5 -> ucie-E.conn0 -> cube0.ucie-E -> cube1.ucie-W -> ucie-W.conn0 -> cube1.r0c0 -> hbm_ctrl.pe0 | ||
| 12 | congestion | all_pe_to_pe0 | 8×PE → pe0_slice | 16384 | 8 | 558.2499999999998 | 195.0 | 1.0 | 2.0 | 0.0 | 9.0 | 483.2499999999998 | pe0.pe_dma -> cube0.r0c0 -> hbm_ctrl.pe0 |