Files
ywkang a76487ca48 PE_DMA perf: SIP-wide scenarios + dual outputs + clearer naming
User asked to surface system-wide congestion (more accurate than
single-cube), bring back the latency-breakdown plot under a separate
filename, and rename the obscure ``streaming`` category.

Scenarios:
  Renamed all_pe_to_pe0 → all_pe_cube0_to_pe0 (clarify cube scope).
  Added two SIP-wide scenarios:
    sip_local_all     — every PE in sip0 (128 total) accesses its own
                        local slice. All paths disjoint (each PE owns
                        its own hbm_ctrl.peX), so the model should
                        scale linearly with cube count.
    sip_hotspot_pe0   — every PE in sip0 (128 total) targets
                        sip0.cube0.pe0_slice. Worst-case hotspot:
                        UCIe inbound + r0c0→hbm_ctrl.pe0 saturated.
  Each bar now carries an ``N=...`` annotation showing the issuer
  count, and the chart titles say the scope explicitly.

Effective BW + util at 16 KB:
  sip_local_all       N=128  eff= 27.2 TB/s  util_a= 83 %
  sip_hotspot_pe0     N=128  eff= 134 GB/s   util_a= 93 %
                                              (UCIe-into-cube0 saturated)

Plots:
  no_congestion.png + congestion.png        — Effective BW utilization
                                              (two bars: single vs aggregate peak)
  breakdown_no_congestion.png +
  breakdown_congestion.png                  — stacked latency breakdown
                                              (renamed from previous)
  summary.csv with columns for both views.

The visual y-cap on BW utilization is 150 %. Bars exceeding it (e.g.
sip_local_all's util_single = 10,639 %) are drawn at the cap with an
upward arrow and the real value annotated. The verification rule for
``util_single`` is loosened to ``≤ n_issuers × 100 % + 5 %`` so
massively-parallel disjoint scenarios pass.

Category renamed: ``streaming`` → ``wire_transfer``. It is the
bulk-transfer time = (n_flits − 1) × flit_bytes / bottleneck_bw — the
cost of streaming the rest of the payload through the slowest wire
after the first flit has arrived.

All checks PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 09:43:09 -07:00

5.0 KiB

1graphscenariolabelnbytesn_issuerstotal_nsmakespan_nsmin_lat_nspeak_single_bw_gbspeak_aggregate_bw_gbseffective_bw_gbsutil_single_pctutil_aggregate_pctpe_setupnoc_meshuciefabricwire_transferhbm_ctrlcontentionpathfirst_path
2no_congestionlocalSAME_CUBE PE_LOCAL16384177.0256.0256.0212.779220779220883.1168831168831283.116883116883121.02.00.00.063.09.02.0pe0.pe_dma -> cube0.r0c0 -> hbm_ctrl.pe0
3no_congestionsame_cube_bestSAME_CUBE REMOTE_BEST (pe0→pe1)16384182.06256.0256.0199.658786253960577.9917133804533277.991713380453321.05.030.00.063.09.04.030000000000001pe0.pe_dma -> cube0.r0c0 -> cube0.r0c1 -> hbm_ctrl.pe1
4no_congestionsame_cube_worstSAME_CUBE REMOTE_WORST (pe0→pe7)163841117.50000000000001256.0256.0139.438297872340454.4680851063829754.468085106382971.026.250.00.063.09.018.250000000000014pe0.pe_dma -> cube0.r0c0 -> cube0.r1c0 -> cube0.r1c1 -> cube0.r1c2 -> cube0.r1c3 -> cube0.r4c3 -> cube0.r4c4 -> cube0.r5c4 -> cube0.r5c5 -> hbm_ctrl.pe7
5no_congestionremote_cube_bestREMOTE_CUBE REMOTE_BEST (cube0→cube1)163841202.51999999999998128.0128.080.9006517874777863.2036342089670263.203634208967021.06.032.5100000000000050.0126.09.028.00999999999999pe0.pe_dma -> cube0.r0c0 -> ucie-N.conn0 -> cube0.ucie-N -> ucie-N.conn3 -> cube0.r0c5 -> ucie-E.conn0 -> cube0.ucie-E -> cube1.ucie-W -> ucie-W.conn0 -> cube1.r0c0 -> hbm_ctrl.pe0
6no_congestionremote_cube_worstREMOTE_CUBE REMOTE_WORST (cube0→cube15.pe7)163841573.1199999999999128.0128.028.58738135120045222.33389168062535322.3338916806253531.030.0219.059999999999950.0126.09.0188.05999999999995pe0.pe_dma -> cube0.r0c0 -> ucie-N.conn0 -> cube0.ucie-N -> ucie-N.conn3 -> cube0.r0c5 -> ucie-E.conn0 -> cube0.ucie-E -> cube1.ucie-W -> ucie-W.conn0 -> cube1.r0c0 -> ucie-N.conn0 -> cube1.ucie-N -> ucie-N.conn3 -> cube1.r0c5 -> ucie-E.conn0 -> cube1.ucie-E -> cube2.ucie-W -> ucie-W.conn0 -> cube2.r0c0 -> ucie-N.conn0 -> cube2.ucie-N -> ucie-N.conn3 -> cube2.r0c5 -> ucie-E.conn0 -> cube2.ucie-E -> cube3.ucie-W -> ucie-W.conn0 -> cube3.r0c0 -> ucie-N.conn0 -> cube3.ucie-N -> ucie-N.conn3 -> cube3.r0c5 -> ucie-E.conn0 -> cube3.ucie-E -> ucie-E.conn3 -> cube3.r5c5 -> ucie-S.conn3 -> cube3.ucie-S -> cube7.ucie-N -> ucie-N.conn3 -> cube7.r0c5 -> ucie-E.conn0 -> cube7.ucie-E -> ucie-E.conn3 -> cube7.r5c5 -> ucie-S.conn3 -> cube7.ucie-S -> cube11.ucie-N -> ucie-N.conn3 -> cube11.r0c5 -> ucie-E.conn0 -> cube11.ucie-E -> ucie-E.conn3 -> cube11.r5c5 -> ucie-S.conn3 -> cube11.ucie-S -> cube15.ucie-N -> ucie-N.conn3 -> cube15.r0c5 -> ucie-E.conn0 -> cube15.ucie-E -> ucie-E.conn3 -> cube15.r5c5 -> hbm_ctrl.pe7
7no_congestionremote_sipREMOTE_SIP SAME_CUBE_SAME_PE (sip0→sip1)163841408.5216666666663128.0128.040.1055839551554131.33248746496516531.3324874649651651.04.037.04000000000000622.09666666666667126.09.0209.38499999999962pe0.pe_dma -> cube0.r0c0 -> ucie-N.conn0 -> cube0.ucie-N -> io0.ucie-P0 -> ucie-P0.conn0 -> io0.noc -> io0.pcie_ep -> fabric.switch0 -> io0.pcie_ep -> io0.noc -> ucie-P0.conn0 -> io0.ucie-P0 -> cube0.ucie-N -> ucie-N.conn0 -> cube0.r0c0 -> hbm_ctrl.pe0
8congestionctrl_hot_1cube0 1×PE → pe0_slice16384182.0682.06256.0256.0199.658786253960577.9917133804533277.991713380453321.05.030.00.063.09.04.030000000000001pe1.pe_dma -> cube0.r0c1 -> cube0.r0c0 -> hbm_ctrl.pe0
9congestionctrl_hot_2cube0 2×PE → pe0_slice163842158.3450000000001134.2400000000001256.0256.0206.9405412232781380.8361489153430280.836148915343021.05.030.00.063.09.080.31500000000011pe1.pe_dma -> cube0.r0c1 -> cube0.r0c0 -> hbm_ctrl.pe0
10congestionctrl_hot_3cube0 3×PE → pe0_slice163843230.0750000000001139.94000000000008256.0256.0213.634684342062383.4510485711180883.451048571118081.05.030.00.063.09.0152.0450000000001pe1.pe_dma -> cube0.r0c1 -> cube0.r0c0 -> hbm_ctrl.pe0
11congestionucie_eastboundcube0 8×PE corresp. → cube1163848962.52438.52128.0159.99999999999997136.17587167019906106.38739974234385.109919793874431.06.032.5100000000000050.0126.09.0788.01pe0.pe_dma -> cube0.r0c0 -> ucie-N.conn0 -> cube0.ucie-N -> ucie-N.conn3 -> cube0.r0c5 -> ucie-E.conn0 -> cube0.ucie-E -> cube1.ucie-W -> ucie-W.conn0 -> cube1.r0c0 -> hbm_ctrl.pe0
12congestionall_pe_cube0_to_pe0cube0 8×PE → pe0_slice163848558.2499999999998195.0256.0256.0234.790864308105891.7151813703538391.715181370353831.02.00.00.063.09.0483.2499999999998pe0.pe_dma -> cube0.r0c0 -> hbm_ctrl.pe0
13congestionsip_local_allsip0 128×PE → own slice1638412877.077.0256.032768.027235.7402597402610638.96103896103983.116883116883121.02.00.00.063.09.02.0pe0.pe_dma -> cube0.r0c0 -> hbm_ctrl.pe0
14congestionsip_hotspot_pe0sip0 128×PE → cube0.pe0_slice1638412815618.595000000001204.0256.0143.9999999999998134.272769093506852.450300427151193.244978537157641.02.00.00.063.09.015543.595000000001pe0.pe_dma -> cube0.r0c0 -> hbm_ctrl.pe0