kernbench2

Author	SHA1	Message	Date
ywkang	a76487ca48	PE_DMA perf: SIP-wide scenarios + dual outputs + clearer naming User asked to surface system-wide congestion (more accurate than single-cube), bring back the latency-breakdown plot under a separate filename, and rename the obscure ``streaming`` category. Scenarios: Renamed all_pe_to_pe0 → all_pe_cube0_to_pe0 (clarify cube scope). Added two SIP-wide scenarios: sip_local_all — every PE in sip0 (128 total) accesses its own local slice. All paths disjoint (each PE owns its own hbm_ctrl.peX), so the model should scale linearly with cube count. sip_hotspot_pe0 — every PE in sip0 (128 total) targets sip0.cube0.pe0_slice. Worst-case hotspot: UCIe inbound + r0c0→hbm_ctrl.pe0 saturated. Each bar now carries an ``N=...`` annotation showing the issuer count, and the chart titles say the scope explicitly. Effective BW + util at 16 KB: sip_local_all N=128 eff= 27.2 TB/s util_a= 83 % sip_hotspot_pe0 N=128 eff= 134 GB/s util_a= 93 % (UCIe-into-cube0 saturated) Plots: no_congestion.png + congestion.png — Effective BW utilization (two bars: single vs aggregate peak) breakdown_no_congestion.png + breakdown_congestion.png — stacked latency breakdown (renamed from previous) summary.csv with columns for both views. The visual y-cap on BW utilization is 150 %. Bars exceeding it (e.g. sip_local_all's util_single = 10,639 %) are drawn at the cap with an upward arrow and the real value annotated. The verification rule for ``util_single`` is loosened to ``≤ n_issuers × 100 % + 5 %`` so massively-parallel disjoint scenarios pass. Category renamed: ``streaming`` → ``wire_transfer``. It is the bulk-transfer time = (n_flits − 1) × flit_bytes / bottleneck_bw — the cost of streaming the rest of the payload through the slowest wire after the first flit has arrived. All checks PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 09:43:09 -07:00
ywkang	a143925a12	PE_DMA perf: dual-peak utilisation (single-path + aggregate) Each scenario now shows TWO bars: util_single = effective_bw / single-path peak × 100 (peak = min bw_gbs on first issuer's path) util_aggregate = effective_bw / aggregate-resource peak × 100 (peak = max-min fair share across concurrent paths) Aggregate peak uses a max-min fair-share computation: each concurrent path's sustainable share on an edge is bw_gbs / usage_count, the per-path throughput is the min share along its edges, and the aggregate peak is the sum across paths. This produces the correct answer for both shared-bottleneck scenarios (N paths converge on one wire → aggregate = wire BW) and multi-lane shared resources (UCIe's 4 connections used in parallel → aggregate ≈ 4 × per-conn BW), without enumerating max-flow. Single-issuer (no_congestion) → util_single == util_aggregate by definition. Congestion exposes the divergence: ctrl_hot_{1,2,3}, all_pe_to_pe0 → both metrics agree (one shared bottleneck: r0c0→hbm_ctrl.pe0 @ 256 GB/s) 8×PE eastbound → util_single=106 % (single conn @ 128 GB/s) but util_aggregate=85 % (UCIe-W.conn0 @ 7-way shared, aggregate peak ≈ 160 GB/s under the current cross-cube routing that funnels via cube1.r0c0). Verification updated to assert: (2) util_aggregate ≤ 100 % (effective BW can't exceed the aggregate resource peak, by construction). (3) single-issuer util_single == util_aggregate. (7) ucie_eastbound: util_aggregate is meaningfully smaller than util_single (the multi-lane peak correction is observable). CSV grows with peak_aggregate_bw_gbs and util_aggregate_pct columns; breakdown columns retained. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 08:53:00 -07:00
ywkang	0bf220fed0	Switch PE_DMA perf plots to Effective BW utilization Replaces the latency-breakdown stacked bars with a single utilization bar per scenario. Each bar shows ``effective_bw / peak_bottleneck_bw`` with both values annotated, and a horizontal "single-path peak" line at 100 %. The colour band (green ≥70 %, amber ≥40 %, red <40 %) makes the no-congestion distance roll-off scannable at a glance. Definitions: effective_bw = (total bytes transferred) / wall-clock time no_congestion: nbytes / total_ns congestion: n_issuers × nbytes / makespan_ns (aggregate) peak_bw = min(edge.bw_gbs) on first issuer's path util_pct = effective_bw / peak_bw × 100 The congestion graph shows that 8×PE eastbound exceeds 100 % of a single-path peak (106.4 %): UCIe-N's 4 connections × 128 GB/s give 512 GB/s of aggregate eastbound capacity, so concurrent issuers across disjoint conns sum past any single conn's 128 GB/s. The 8×PE→pe0_slice hotspot reaches 91.7 %, almost saturating the shared r0c0→hbm_ctrl.pe0 bottleneck — the simulator's address-based PC striping + per-flit arbitration model amortises the cost cleanly. Self-verification updated to BW invariants: (1) effective BW shrinks as topological distance grows (2) util_pct ∈ (0, 250 %] (3) single-issuer util_pct ≤ 100 % (4) effective_bw = nbytes / total_ns for single requests (5) congestion aggregate BW grows monotonically with issuer count on the hot-target series (6) 8-PE all-hit-pe0 saturates ≥ 70 % of shared peak All checks PASS at the current model. The CSV retains all breakdown components (pe_setup, noc_mesh, ucie, fabric, streaming, hbm_ctrl, contention) so a future replot can still recover the latency-breakdown view without re-running the simulator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 07:59:45 -07:00
ywkang	a759d58007	Add PE_DMA latency-breakdown plots + self-verification harness scripts/plot_pe_dma_perf.py runs the simulator across six no-congestion scenarios (SAME_CUBE_PE_LOCAL / REMOTE_BEST / REMOTE_WORST, REMOTE_CUBE_BEST / REMOTE_WORST, REMOTE_SIP) and five congestion scenarios (1/2/3 PE hot-target, 8-PE corresp. cube-to-cube, 8-PE all-hit-pe0). It categorises actual total / makespan into pe_setup, noc_mesh, ucie, fabric, streaming, hbm_ctrl, and a contention residual using a wormhole-pipelined model (first-flit arrival + (n_flits-1)/bottleneck + final chunk_time). Outputs: docs/diagrams/pe_dma_perf/no_congestion.png — single-PE latency by topological distance. Visualises monotonic growth from SAME_CUBE_PE_LOCAL (77 ns) up to REMOTE_CUBE_PE_REMOTE_WORST (573 ns) and REMOTE_SIP (409 ns). docs/diagrams/pe_dma_perf/congestion.png — makespan as concurrent issuer count grows. ctrl_hot_{1,2,3}=82/158/230 ns; 8-PE eastbound UCIe = 963 ns; 8-PE all-hit-pe0 = 558 ns. docs/diagrams/pe_dma_perf/summary.csv — raw rows for re-plotting. Built-in --verify harness asserts: (1) distance monotonicity for no-congestion; (2) same-cube paths contain zero UCIe budget; (3) remote-cube/SIP paths carry positive UCIe budget; (4) breakdown is internally consistent (formula ≤ actual); (5) streaming term matches (n_flits-1) × flit_bytes / bottleneck_bw within 5 % for the local scenario; (6) congestion makespan is monotonic in issuer count; (7) 8-PE hotspot strictly exceeds 3-PE hotspot. Cross-SIP gets a looser 70 % contention slack because the path crosses two non-flit-aware (pcie_ep) boundaries that force store-and-forward re-streaming the simple formula does not attribute. Single-cube scenarios stay under 25 % residual. All checks PASS at the current model (post ADR-0019 D1/D4 per-PE HBM CTRL restoration). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 01:23:42 -07:00

4 Commits