Files
kernbench2/docs/diagrams/pe_dma_perf/summary.csv
T
ywkang a759d58007 Add PE_DMA latency-breakdown plots + self-verification harness
scripts/plot_pe_dma_perf.py runs the simulator across six
no-congestion scenarios (SAME_CUBE_PE_LOCAL / REMOTE_BEST /
REMOTE_WORST, REMOTE_CUBE_BEST / REMOTE_WORST, REMOTE_SIP) and
five congestion scenarios (1/2/3 PE hot-target, 8-PE corresp.
cube-to-cube, 8-PE all-hit-pe0). It categorises actual total /
makespan into pe_setup, noc_mesh, ucie, fabric, streaming,
hbm_ctrl, and a contention residual using a wormhole-pipelined
model (first-flit arrival + (n_flits-1)/bottleneck + final
chunk_time).

Outputs:
  docs/diagrams/pe_dma_perf/no_congestion.png — single-PE latency
    by topological distance. Visualises monotonic growth from
    SAME_CUBE_PE_LOCAL (77 ns) up to REMOTE_CUBE_PE_REMOTE_WORST
    (573 ns) and REMOTE_SIP (409 ns).
  docs/diagrams/pe_dma_perf/congestion.png — makespan as concurrent
    issuer count grows. ctrl_hot_{1,2,3}=82/158/230 ns; 8-PE
    eastbound UCIe = 963 ns; 8-PE all-hit-pe0 = 558 ns.
  docs/diagrams/pe_dma_perf/summary.csv — raw rows for re-plotting.

Built-in --verify harness asserts:
  (1) distance monotonicity for no-congestion;
  (2) same-cube paths contain zero UCIe budget;
  (3) remote-cube/SIP paths carry positive UCIe budget;
  (4) breakdown is internally consistent (formula ≤ actual);
  (5) streaming term matches (n_flits-1) × flit_bytes /
      bottleneck_bw within 5 % for the local scenario;
  (6) congestion makespan is monotonic in issuer count;
  (7) 8-PE hotspot strictly exceeds 3-PE hotspot.

Cross-SIP gets a looser 70 % contention slack because the path
crosses two non-flit-aware (pcie_ep) boundaries that force
store-and-forward re-streaming the simple formula does not
attribute. Single-cube scenarios stay under 25 % residual.

All checks PASS at the current model (post ADR-0019 D1/D4
per-PE HBM CTRL restoration).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 01:23:42 -07:00

3.5 KiB

1graphscenariolabelnbytesn_issuerstotal_nsmakespan_nsmin_lat_nspe_setupnoc_meshuciehbm_ctrlcontentionpathfirst_path
2no_congestionlocalSAME_CUBE PE_LOCAL1638477.01.02.00.09.02.0pe0.pe_dma -> cube0.r0c0 -> hbm_ctrl.pe0
3no_congestionsame_cube_bestSAME_CUBE REMOTE_BEST (pe0→pe1)1638482.061.05.030.09.04.030000000000001pe0.pe_dma -> cube0.r0c0 -> cube0.r0c1 -> hbm_ctrl.pe1
4no_congestionsame_cube_worstSAME_CUBE REMOTE_WORST (pe0→pe7)16384117.500000000000011.026.250.09.018.250000000000014pe0.pe_dma -> cube0.r0c0 -> cube0.r1c0 -> cube0.r1c1 -> cube0.r1c2 -> cube0.r1c3 -> cube0.r4c3 -> cube0.r4c4 -> cube0.r5c4 -> cube0.r5c5 -> hbm_ctrl.pe7
5no_congestionremote_cube_bestREMOTE_CUBE REMOTE_BEST (cube0→cube1)16384202.519999999999981.06.032.5100000000000059.028.00999999999999pe0.pe_dma -> cube0.r0c0 -> ucie-N.conn0 -> cube0.ucie-N -> ucie-N.conn3 -> cube0.r0c5 -> ucie-E.conn0 -> cube0.ucie-E -> cube1.ucie-W -> ucie-W.conn0 -> cube1.r0c0 -> hbm_ctrl.pe0
6no_congestionremote_cube_worstREMOTE_CUBE REMOTE_WORST (cube0→cube15.pe7)16384573.11999999999991.030.0219.059999999999959.0188.05999999999995pe0.pe_dma -> cube0.r0c0 -> ucie-N.conn0 -> cube0.ucie-N -> ucie-N.conn3 -> cube0.r0c5 -> ucie-E.conn0 -> cube0.ucie-E -> cube1.ucie-W -> ucie-W.conn0 -> cube1.r0c0 -> ucie-N.conn0 -> cube1.ucie-N -> ucie-N.conn3 -> cube1.r0c5 -> ucie-E.conn0 -> cube1.ucie-E -> cube2.ucie-W -> ucie-W.conn0 -> cube2.r0c0 -> ucie-N.conn0 -> cube2.ucie-N -> ucie-N.conn3 -> cube2.r0c5 -> ucie-E.conn0 -> cube2.ucie-E -> cube3.ucie-W -> ucie-W.conn0 -> cube3.r0c0 -> ucie-N.conn0 -> cube3.ucie-N -> ucie-N.conn3 -> cube3.r0c5 -> ucie-E.conn0 -> cube3.ucie-E -> ucie-E.conn3 -> cube3.r5c5 -> ucie-S.conn3 -> cube3.ucie-S -> cube7.ucie-N -> ucie-N.conn3 -> cube7.r0c5 -> ucie-E.conn0 -> cube7.ucie-E -> ucie-E.conn3 -> cube7.r5c5 -> ucie-S.conn3 -> cube7.ucie-S -> cube11.ucie-N -> ucie-N.conn3 -> cube11.r0c5 -> ucie-E.conn0 -> cube11.ucie-E -> ucie-E.conn3 -> cube11.r5c5 -> ucie-S.conn3 -> cube11.ucie-S -> cube15.ucie-N -> ucie-N.conn3 -> cube15.r0c5 -> ucie-E.conn0 -> cube15.ucie-E -> ucie-E.conn3 -> cube15.r5c5 -> hbm_ctrl.pe7
7no_congestionremote_sipREMOTE_SIP SAME_CUBE_SAME_PE (sip0→sip1)16384408.52166666666631.04.037.0400000000000069.0209.38499999999962pe0.pe_dma -> cube0.r0c0 -> ucie-N.conn0 -> cube0.ucie-N -> io0.ucie-P0 -> ucie-P0.conn0 -> io0.noc -> io0.pcie_ep -> fabric.switch0 -> io0.pcie_ep -> io0.noc -> ucie-P0.conn0 -> io0.ucie-P0 -> cube0.ucie-N -> ucie-N.conn0 -> cube0.r0c0 -> hbm_ctrl.pe0
8congestionctrl_hot_11×PE → pe0_slice16384182.0682.061.05.030.09.04.030000000000001pe1.pe_dma -> cube0.r0c1 -> cube0.r0c0 -> hbm_ctrl.pe0
9congestionctrl_hot_22×PE → pe0_slice163842158.3450000000001134.24000000000011.05.030.09.080.31500000000011pe1.pe_dma -> cube0.r0c1 -> cube0.r0c0 -> hbm_ctrl.pe0
10congestionctrl_hot_33×PE → pe0_slice163843230.0750000000001139.940000000000081.05.030.09.0152.0450000000001pe1.pe_dma -> cube0.r0c1 -> cube0.r0c0 -> hbm_ctrl.pe0
11congestionucie_eastbound8×PE corresp. cube0→cube1163848962.52438.521.06.032.5100000000000059.0788.01pe0.pe_dma -> cube0.r0c0 -> ucie-N.conn0 -> cube0.ucie-N -> ucie-N.conn3 -> cube0.r0c5 -> ucie-E.conn0 -> cube0.ucie-E -> cube1.ucie-W -> ucie-W.conn0 -> cube1.r0c0 -> hbm_ctrl.pe0
12congestionall_pe_to_pe08×PE → pe0_slice163848558.2499999999998195.01.02.00.09.0483.2499999999998pe0.pe_dma -> cube0.r0c0 -> hbm_ctrl.pe0