kernbench2

ywkang/kernbench2

Fork 0

Commit Graph

Author	SHA1	Message	Date
ywkang	a143925a12	PE_DMA perf: dual-peak utilisation (single-path + aggregate) Each scenario now shows TWO bars: util_single = effective_bw / single-path peak × 100 (peak = min bw_gbs on first issuer's path) util_aggregate = effective_bw / aggregate-resource peak × 100 (peak = max-min fair share across concurrent paths) Aggregate peak uses a max-min fair-share computation: each concurrent path's sustainable share on an edge is bw_gbs / usage_count, the per-path throughput is the min share along its edges, and the aggregate peak is the sum across paths. This produces the correct answer for both shared-bottleneck scenarios (N paths converge on one wire → aggregate = wire BW) and multi-lane shared resources (UCIe's 4 connections used in parallel → aggregate ≈ 4 × per-conn BW), without enumerating max-flow. Single-issuer (no_congestion) → util_single == util_aggregate by definition. Congestion exposes the divergence: ctrl_hot_{1,2,3}, all_pe_to_pe0 → both metrics agree (one shared bottleneck: r0c0→hbm_ctrl.pe0 @ 256 GB/s) 8×PE eastbound → util_single=106 % (single conn @ 128 GB/s) but util_aggregate=85 % (UCIe-W.conn0 @ 7-way shared, aggregate peak ≈ 160 GB/s under the current cross-cube routing that funnels via cube1.r0c0). Verification updated to assert: (2) util_aggregate ≤ 100 % (effective BW can't exceed the aggregate resource peak, by construction). (3) single-issuer util_single == util_aggregate. (7) ucie_eastbound: util_aggregate is meaningfully smaller than util_single (the multi-lane peak correction is observable). CSV grows with peak_aggregate_bw_gbs and util_aggregate_pct columns; breakdown columns retained. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 08:53:00 -07:00
ywkang	0bf220fed0	Switch PE_DMA perf plots to Effective BW utilization Replaces the latency-breakdown stacked bars with a single utilization bar per scenario. Each bar shows ``effective_bw / peak_bottleneck_bw`` with both values annotated, and a horizontal "single-path peak" line at 100 %. The colour band (green ≥70 %, amber ≥40 %, red <40 %) makes the no-congestion distance roll-off scannable at a glance. Definitions: effective_bw = (total bytes transferred) / wall-clock time no_congestion: nbytes / total_ns congestion: n_issuers × nbytes / makespan_ns (aggregate) peak_bw = min(edge.bw_gbs) on first issuer's path util_pct = effective_bw / peak_bw × 100 The congestion graph shows that 8×PE eastbound exceeds 100 % of a single-path peak (106.4 %): UCIe-N's 4 connections × 128 GB/s give 512 GB/s of aggregate eastbound capacity, so concurrent issuers across disjoint conns sum past any single conn's 128 GB/s. The 8×PE→pe0_slice hotspot reaches 91.7 %, almost saturating the shared r0c0→hbm_ctrl.pe0 bottleneck — the simulator's address-based PC striping + per-flit arbitration model amortises the cost cleanly. Self-verification updated to BW invariants: (1) effective BW shrinks as topological distance grows (2) util_pct ∈ (0, 250 %] (3) single-issuer util_pct ≤ 100 % (4) effective_bw = nbytes / total_ns for single requests (5) congestion aggregate BW grows monotonically with issuer count on the hot-target series (6) 8-PE all-hit-pe0 saturates ≥ 70 % of shared peak All checks PASS at the current model. The CSV retains all breakdown components (pe_setup, noc_mesh, ucie, fabric, streaming, hbm_ctrl, contention) so a future replot can still recover the latency-breakdown view without re-running the simulator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 07:59:45 -07:00
ywkang	a759d58007	Add PE_DMA latency-breakdown plots + self-verification harness scripts/plot_pe_dma_perf.py runs the simulator across six no-congestion scenarios (SAME_CUBE_PE_LOCAL / REMOTE_BEST / REMOTE_WORST, REMOTE_CUBE_BEST / REMOTE_WORST, REMOTE_SIP) and five congestion scenarios (1/2/3 PE hot-target, 8-PE corresp. cube-to-cube, 8-PE all-hit-pe0). It categorises actual total / makespan into pe_setup, noc_mesh, ucie, fabric, streaming, hbm_ctrl, and a contention residual using a wormhole-pipelined model (first-flit arrival + (n_flits-1)/bottleneck + final chunk_time). Outputs: docs/diagrams/pe_dma_perf/no_congestion.png — single-PE latency by topological distance. Visualises monotonic growth from SAME_CUBE_PE_LOCAL (77 ns) up to REMOTE_CUBE_PE_REMOTE_WORST (573 ns) and REMOTE_SIP (409 ns). docs/diagrams/pe_dma_perf/congestion.png — makespan as concurrent issuer count grows. ctrl_hot_{1,2,3}=82/158/230 ns; 8-PE eastbound UCIe = 963 ns; 8-PE all-hit-pe0 = 558 ns. docs/diagrams/pe_dma_perf/summary.csv — raw rows for re-plotting. Built-in --verify harness asserts: (1) distance monotonicity for no-congestion; (2) same-cube paths contain zero UCIe budget; (3) remote-cube/SIP paths carry positive UCIe budget; (4) breakdown is internally consistent (formula ≤ actual); (5) streaming term matches (n_flits-1) × flit_bytes / bottleneck_bw within 5 % for the local scenario; (6) congestion makespan is monotonic in issuer count; (7) 8-PE hotspot strictly exceeds 3-PE hotspot. Cross-SIP gets a looser 70 % contention slack because the path crosses two non-flit-aware (pcie_ep) boundaries that force store-and-forward re-streaming the simple formula does not attribute. Single-cube scenarios stay under 25 % residual. All checks PASS at the current model (post ADR-0019 D1/D4 per-PE HBM CTRL restoration). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 01:23:42 -07:00

Author

SHA1

Message

Date

ywkang

a143925a12

PE_DMA perf: dual-peak utilisation (single-path + aggregate)

Each scenario now shows TWO bars:

  util_single    = effective_bw / single-path peak × 100
                   (peak = min bw_gbs on first issuer's path)
  util_aggregate = effective_bw / aggregate-resource peak × 100
                   (peak = max-min fair share across concurrent paths)

Aggregate peak uses a max-min fair-share computation: each concurrent
path's sustainable share on an edge is bw_gbs / usage_count, the
per-path throughput is the min share along its edges, and the aggregate
peak is the sum across paths. This produces the correct answer for both
shared-bottleneck scenarios (N paths converge on one wire → aggregate =
wire BW) and multi-lane shared resources (UCIe's 4 connections used in
parallel → aggregate ≈ 4 × per-conn BW), without enumerating max-flow.

Single-issuer (no_congestion) → util_single == util_aggregate by
definition. Congestion exposes the divergence:
  ctrl_hot_{1,2,3}, all_pe_to_pe0 → both metrics agree (one shared
                    bottleneck: r0c0→hbm_ctrl.pe0 @ 256 GB/s)
  8×PE eastbound → util_single=106 % (single conn @ 128 GB/s) but
                    util_aggregate=85 % (UCIe-W.conn0 @ 7-way shared,
                    aggregate peak ≈ 160 GB/s under the current
                    cross-cube routing that funnels via cube1.r0c0).

Verification updated to assert:
  (2) util_aggregate ≤ 100 % (effective BW can't exceed the aggregate
      resource peak, by construction).
  (3) single-issuer util_single == util_aggregate.
  (7) ucie_eastbound: util_aggregate is meaningfully smaller than
      util_single (the multi-lane peak correction is observable).

CSV grows with peak_aggregate_bw_gbs and util_aggregate_pct columns;
breakdown columns retained.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-15 08:53:00 -07:00

ywkang

0bf220fed0

Switch PE_DMA perf plots to Effective BW utilization

Replaces the latency-breakdown stacked bars with a single utilization
bar per scenario. Each bar shows ``effective_bw / peak_bottleneck_bw``
with both values annotated, and a horizontal "single-path peak" line at
100 %. The colour band (green ≥70 %, amber ≥40 %, red <40 %) makes the
no-congestion distance roll-off scannable at a glance.

Definitions:
  effective_bw = (total bytes transferred) / wall-clock time
    no_congestion: nbytes / total_ns
    congestion:    n_issuers × nbytes / makespan_ns  (aggregate)
  peak_bw      = min(edge.bw_gbs) on first issuer's path
  util_pct     = effective_bw / peak_bw × 100

The congestion graph shows that 8×PE eastbound exceeds 100 % of a
single-path peak (106.4 %): UCIe-N's 4 connections × 128 GB/s give
512 GB/s of aggregate eastbound capacity, so concurrent issuers across
disjoint conns sum past any single conn's 128 GB/s. The 8×PE→pe0_slice
hotspot reaches 91.7 %, almost saturating the shared r0c0→hbm_ctrl.pe0
bottleneck — the simulator's address-based PC striping + per-flit
arbitration model amortises the cost cleanly.

Self-verification updated to BW invariants:
  (1) effective BW shrinks as topological distance grows
  (2) util_pct ∈ (0, 250 %]
  (3) single-issuer util_pct ≤ 100 %
  (4) effective_bw = nbytes / total_ns for single requests
  (5) congestion aggregate BW grows monotonically with issuer count
      on the hot-target series
  (6) 8-PE all-hit-pe0 saturates ≥ 70 % of shared peak

All checks PASS at the current model.

The CSV retains all breakdown components (pe_setup, noc_mesh, ucie,
fabric, streaming, hbm_ctrl, contention) so a future replot can still
recover the latency-breakdown view without re-running the simulator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-15 07:59:45 -07:00

ywkang

a759d58007

Add PE_DMA latency-breakdown plots + self-verification harness

scripts/plot_pe_dma_perf.py runs the simulator across six
no-congestion scenarios (SAME_CUBE_PE_LOCAL / REMOTE_BEST /
REMOTE_WORST, REMOTE_CUBE_BEST / REMOTE_WORST, REMOTE_SIP) and
five congestion scenarios (1/2/3 PE hot-target, 8-PE corresp.
cube-to-cube, 8-PE all-hit-pe0). It categorises actual total /
makespan into pe_setup, noc_mesh, ucie, fabric, streaming,
hbm_ctrl, and a contention residual using a wormhole-pipelined
model (first-flit arrival + (n_flits-1)/bottleneck + final
chunk_time).

Outputs:
  docs/diagrams/pe_dma_perf/no_congestion.png — single-PE latency
    by topological distance. Visualises monotonic growth from
    SAME_CUBE_PE_LOCAL (77 ns) up to REMOTE_CUBE_PE_REMOTE_WORST
    (573 ns) and REMOTE_SIP (409 ns).
  docs/diagrams/pe_dma_perf/congestion.png — makespan as concurrent
    issuer count grows. ctrl_hot_{1,2,3}=82/158/230 ns; 8-PE
    eastbound UCIe = 963 ns; 8-PE all-hit-pe0 = 558 ns.
  docs/diagrams/pe_dma_perf/summary.csv — raw rows for re-plotting.

Built-in --verify harness asserts:
  (1) distance monotonicity for no-congestion;
  (2) same-cube paths contain zero UCIe budget;
  (3) remote-cube/SIP paths carry positive UCIe budget;
  (4) breakdown is internally consistent (formula ≤ actual);
  (5) streaming term matches (n_flits-1) × flit_bytes /
      bottleneck_bw within 5 % for the local scenario;
  (6) congestion makespan is monotonic in issuer count;
  (7) 8-PE hotspot strictly exceeds 3-PE hotspot.

Cross-SIP gets a looser 70 % contention slack because the path
crosses two non-flit-aware (pcie_ep) boundaries that force
store-and-forward re-streaming the simple formula does not
attribute. Single-cube scenarios stay under 25 % residual.

All checks PASS at the current model (post ADR-0019 D1/D4
per-PE HBM CTRL restoration).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-15 01:23:42 -07:00

3 Commits