Replaces the latency-breakdown stacked bars with a single utilization
bar per scenario. Each bar shows ``effective_bw / peak_bottleneck_bw``
with both values annotated, and a horizontal "single-path peak" line at
100 %. The colour band (green ≥70 %, amber ≥40 %, red <40 %) makes the
no-congestion distance roll-off scannable at a glance.
Definitions:
effective_bw = (total bytes transferred) / wall-clock time
no_congestion: nbytes / total_ns
congestion: n_issuers × nbytes / makespan_ns (aggregate)
peak_bw = min(edge.bw_gbs) on first issuer's path
util_pct = effective_bw / peak_bw × 100
The congestion graph shows that 8×PE eastbound exceeds 100 % of a
single-path peak (106.4 %): UCIe-N's 4 connections × 128 GB/s give
512 GB/s of aggregate eastbound capacity, so concurrent issuers across
disjoint conns sum past any single conn's 128 GB/s. The 8×PE→pe0_slice
hotspot reaches 91.7 %, almost saturating the shared r0c0→hbm_ctrl.pe0
bottleneck — the simulator's address-based PC striping + per-flit
arbitration model amortises the cost cleanly.
Self-verification updated to BW invariants:
(1) effective BW shrinks as topological distance grows
(2) util_pct ∈ (0, 250 %]
(3) single-issuer util_pct ≤ 100 %
(4) effective_bw = nbytes / total_ns for single requests
(5) congestion aggregate BW grows monotonically with issuer count
on the hot-target series
(6) 8-PE all-hit-pe0 saturates ≥ 70 % of shared peak
All checks PASS at the current model.
The CSV retains all breakdown components (pe_setup, noc_mesh, ucie,
fabric, streaming, hbm_ctrl, contention) so a future replot can still
recover the latency-breakdown view without re-running the simulator.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/plot_pe_dma_perf.py runs the simulator across six
no-congestion scenarios (SAME_CUBE_PE_LOCAL / REMOTE_BEST /
REMOTE_WORST, REMOTE_CUBE_BEST / REMOTE_WORST, REMOTE_SIP) and
five congestion scenarios (1/2/3 PE hot-target, 8-PE corresp.
cube-to-cube, 8-PE all-hit-pe0). It categorises actual total /
makespan into pe_setup, noc_mesh, ucie, fabric, streaming,
hbm_ctrl, and a contention residual using a wormhole-pipelined
model (first-flit arrival + (n_flits-1)/bottleneck + final
chunk_time).
Outputs:
docs/diagrams/pe_dma_perf/no_congestion.png — single-PE latency
by topological distance. Visualises monotonic growth from
SAME_CUBE_PE_LOCAL (77 ns) up to REMOTE_CUBE_PE_REMOTE_WORST
(573 ns) and REMOTE_SIP (409 ns).
docs/diagrams/pe_dma_perf/congestion.png — makespan as concurrent
issuer count grows. ctrl_hot_{1,2,3}=82/158/230 ns; 8-PE
eastbound UCIe = 963 ns; 8-PE all-hit-pe0 = 558 ns.
docs/diagrams/pe_dma_perf/summary.csv — raw rows for re-plotting.
Built-in --verify harness asserts:
(1) distance monotonicity for no-congestion;
(2) same-cube paths contain zero UCIe budget;
(3) remote-cube/SIP paths carry positive UCIe budget;
(4) breakdown is internally consistent (formula ≤ actual);
(5) streaming term matches (n_flits-1) × flit_bytes /
bottleneck_bw within 5 % for the local scenario;
(6) congestion makespan is monotonic in issuer count;
(7) 8-PE hotspot strictly exceeds 3-PE hotspot.
Cross-SIP gets a looser 70 % contention slack because the path
crosses two non-flit-aware (pcie_ep) boundaries that force
store-and-forward re-streaming the simple formula does not
attribute. Single-cube scenarios stay under 25 % residual.
All checks PASS at the current model (post ADR-0019 D1/D4
per-PE HBM CTRL restoration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>