Files
kernbench2/docs/diagrams/pe_dma_perf/summary.csv
T
ywkang 0bf220fed0 Switch PE_DMA perf plots to Effective BW utilization
Replaces the latency-breakdown stacked bars with a single utilization
bar per scenario. Each bar shows ``effective_bw / peak_bottleneck_bw``
with both values annotated, and a horizontal "single-path peak" line at
100 %. The colour band (green ≥70 %, amber ≥40 %, red <40 %) makes the
no-congestion distance roll-off scannable at a glance.

Definitions:
  effective_bw = (total bytes transferred) / wall-clock time
    no_congestion: nbytes / total_ns
    congestion:    n_issuers × nbytes / makespan_ns  (aggregate)
  peak_bw      = min(edge.bw_gbs) on first issuer's path
  util_pct     = effective_bw / peak_bw × 100

The congestion graph shows that 8×PE eastbound exceeds 100 % of a
single-path peak (106.4 %): UCIe-N's 4 connections × 128 GB/s give
512 GB/s of aggregate eastbound capacity, so concurrent issuers across
disjoint conns sum past any single conn's 128 GB/s. The 8×PE→pe0_slice
hotspot reaches 91.7 %, almost saturating the shared r0c0→hbm_ctrl.pe0
bottleneck — the simulator's address-based PC striping + per-flit
arbitration model amortises the cost cleanly.

Self-verification updated to BW invariants:
  (1) effective BW shrinks as topological distance grows
  (2) util_pct ∈ (0, 250 %]
  (3) single-issuer util_pct ≤ 100 %
  (4) effective_bw = nbytes / total_ns for single requests
  (5) congestion aggregate BW grows monotonically with issuer count
      on the hot-target series
  (6) 8-PE all-hit-pe0 saturates ≥ 70 % of shared peak

All checks PASS at the current model.

The CSV retains all breakdown components (pe_setup, noc_mesh, ucie,
fabric, streaming, hbm_ctrl, contention) so a future replot can still
recover the latency-breakdown view without re-running the simulator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 07:59:45 -07:00

4.1 KiB

1graphscenariolabelnbytesn_issuerstotal_nsmakespan_nsmin_lat_nsbottleneck_bw_gbseffective_bw_gbsutil_pctpe_setupnoc_meshuciefabricstreaminghbm_ctrlcontentionpathfirst_path
2no_congestionlocalSAME_CUBE PE_LOCAL16384177.0256.0212.779220779220883.116883116883121.02.00.00.063.09.02.0pe0.pe_dma -> cube0.r0c0 -> hbm_ctrl.pe0
3no_congestionsame_cube_bestSAME_CUBE REMOTE_BEST (pe0→pe1)16384182.06256.0199.658786253960577.991713380453321.05.030.00.063.09.04.030000000000001pe0.pe_dma -> cube0.r0c0 -> cube0.r0c1 -> hbm_ctrl.pe1
4no_congestionsame_cube_worstSAME_CUBE REMOTE_WORST (pe0→pe7)163841117.50000000000001256.0139.438297872340454.468085106382971.026.250.00.063.09.018.250000000000014pe0.pe_dma -> cube0.r0c0 -> cube0.r1c0 -> cube0.r1c1 -> cube0.r1c2 -> cube0.r1c3 -> cube0.r4c3 -> cube0.r4c4 -> cube0.r5c4 -> cube0.r5c5 -> hbm_ctrl.pe7
5no_congestionremote_cube_bestREMOTE_CUBE REMOTE_BEST (cube0→cube1)163841202.51999999999998128.080.9006517874777863.203634208967021.06.032.5100000000000050.0126.09.028.00999999999999pe0.pe_dma -> cube0.r0c0 -> ucie-N.conn0 -> cube0.ucie-N -> ucie-N.conn3 -> cube0.r0c5 -> ucie-E.conn0 -> cube0.ucie-E -> cube1.ucie-W -> ucie-W.conn0 -> cube1.r0c0 -> hbm_ctrl.pe0
6no_congestionremote_cube_worstREMOTE_CUBE REMOTE_WORST (cube0→cube15.pe7)163841573.1199999999999128.028.58738135120045222.3338916806253531.030.0219.059999999999950.0126.09.0188.05999999999995pe0.pe_dma -> cube0.r0c0 -> ucie-N.conn0 -> cube0.ucie-N -> ucie-N.conn3 -> cube0.r0c5 -> ucie-E.conn0 -> cube0.ucie-E -> cube1.ucie-W -> ucie-W.conn0 -> cube1.r0c0 -> ucie-N.conn0 -> cube1.ucie-N -> ucie-N.conn3 -> cube1.r0c5 -> ucie-E.conn0 -> cube1.ucie-E -> cube2.ucie-W -> ucie-W.conn0 -> cube2.r0c0 -> ucie-N.conn0 -> cube2.ucie-N -> ucie-N.conn3 -> cube2.r0c5 -> ucie-E.conn0 -> cube2.ucie-E -> cube3.ucie-W -> ucie-W.conn0 -> cube3.r0c0 -> ucie-N.conn0 -> cube3.ucie-N -> ucie-N.conn3 -> cube3.r0c5 -> ucie-E.conn0 -> cube3.ucie-E -> ucie-E.conn3 -> cube3.r5c5 -> ucie-S.conn3 -> cube3.ucie-S -> cube7.ucie-N -> ucie-N.conn3 -> cube7.r0c5 -> ucie-E.conn0 -> cube7.ucie-E -> ucie-E.conn3 -> cube7.r5c5 -> ucie-S.conn3 -> cube7.ucie-S -> cube11.ucie-N -> ucie-N.conn3 -> cube11.r0c5 -> ucie-E.conn0 -> cube11.ucie-E -> ucie-E.conn3 -> cube11.r5c5 -> ucie-S.conn3 -> cube11.ucie-S -> cube15.ucie-N -> ucie-N.conn3 -> cube15.r0c5 -> ucie-E.conn0 -> cube15.ucie-E -> ucie-E.conn3 -> cube15.r5c5 -> hbm_ctrl.pe7
7no_congestionremote_sipREMOTE_SIP SAME_CUBE_SAME_PE (sip0→sip1)163841408.5216666666663128.040.1055839551554131.3324874649651651.04.037.04000000000000622.09666666666667126.09.0209.38499999999962pe0.pe_dma -> cube0.r0c0 -> ucie-N.conn0 -> cube0.ucie-N -> io0.ucie-P0 -> ucie-P0.conn0 -> io0.noc -> io0.pcie_ep -> fabric.switch0 -> io0.pcie_ep -> io0.noc -> ucie-P0.conn0 -> io0.ucie-P0 -> cube0.ucie-N -> ucie-N.conn0 -> cube0.r0c0 -> hbm_ctrl.pe0
8congestionctrl_hot_11×PE → pe0_slice16384182.0682.06256.0199.658786253960577.991713380453321.05.030.00.063.09.04.030000000000001pe1.pe_dma -> cube0.r0c1 -> cube0.r0c0 -> hbm_ctrl.pe0
9congestionctrl_hot_22×PE → pe0_slice163842158.3450000000001134.2400000000001256.0206.9405412232781380.836148915343021.05.030.00.063.09.080.31500000000011pe1.pe_dma -> cube0.r0c1 -> cube0.r0c0 -> hbm_ctrl.pe0
10congestionctrl_hot_33×PE → pe0_slice163843230.0750000000001139.94000000000008256.0213.634684342062383.451048571118081.05.030.00.063.09.0152.0450000000001pe1.pe_dma -> cube0.r0c1 -> cube0.r0c0 -> hbm_ctrl.pe0
11congestionucie_eastbound8×PE corresp. cube0→cube1163848962.52438.52128.0136.17587167019906106.3873997423431.06.032.5100000000000050.0126.09.0788.01pe0.pe_dma -> cube0.r0c0 -> ucie-N.conn0 -> cube0.ucie-N -> ucie-N.conn3 -> cube0.r0c5 -> ucie-E.conn0 -> cube0.ucie-E -> cube1.ucie-W -> ucie-W.conn0 -> cube1.r0c0 -> hbm_ctrl.pe0
12congestionall_pe_to_pe08×PE → pe0_slice163848558.2499999999998195.0256.0234.790864308105891.715181370353831.02.00.00.063.09.0483.2499999999998pe0.pe_dma -> cube0.r0c0 -> hbm_ctrl.pe0