kernbench2

Author	SHA1	Message	Date
mukesh	cc1bbd0ab7	eval: fold GEMM/allreduce harnesses into self-contained milestone benches Move the GEMM + allreduce sweep/render logic out of scripts/ and tests/ into two self-contained eval benches so a user can regenerate every result + figure with one command: kernbench run --bench milestone-1h-gemm (MILESTONE_FAST=1 reuses JSON) kernbench run --bench milestone-1h-ccl - benches/milestone_1h_{gemm,ccl}.py: single home for each domain; the run(torch) entry drives the sweeps and writes figures into benches/1H_milestone_output/{gemm,ccl}/ (gitignored), then submits a sentinel tensor to satisfy the run_bench contract. - tests/gemm + tests/sccl helpers and scripts/gemm_sweep.py become thin re-export/wrapper shims over the benches (single source preserved); the pytest-only param builders + _run_distributed wrapper stay in the shim. - eval-bench pattern: a bench may drive many configs + build its own per-config engines (extends ADR-0045 D5; reverses ADR-0044 D1/D2). ADR-0054 (EN+KO) records the design; ADR-0043/0044/0045 + CLAUDE.md CLI Semantics amended; ADR INDEX regenerated. Verified: milestone benches run clean (ok=True, all artifacts), full suite 67 passed, lang-pairs OK. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 15:19:52 -07:00
mukesh	b610cb0d9a	sccl: drive allreduce tests via torch.distributed; reorganize into tests/sccl/ Convert the multidevice allreduce correctness + latency/buffer-kind sweeps to run through the real PyTorch-distributed path (init_process_group(backend="ahbm") -> mp.spawn -> dist.all_reduce) instead of direct ctx.launch, and reorganize the CCL/allreduce tests into a tests/sccl/ package split one test per file. Production change (required for the distributed path on non-square SIP grids): - AhbmCCLBackend now reads explicit system.sips.w/h from the spec, with a square-only sqrt fallback that raises on ambiguity, instead of silently guessing round(sqrt(count)). This fixes the 2x3 / 3x2 torus + mesh cases, which previously resolved to a wrong 2x2 grid. Mirrors the test helper's _sip_topo_dims precedence (explicit w/h > square fallback > raise). Test reorganization (tests/sccl/): - _allreduce_helpers.py: shared plumbing (distributed driver, config writers, direct-launch run_allreduce parity reference, sweep/buffer-kind constants, plot aggregators, topology-diagram + FSIM-comparison emitters). - test_allreduce_ring_torus_mesh.py: correctness across ring/torus/mesh. - test_distributed_default_topology.py: full distributed path on topology.yaml. - test_plot_latency_sweep.py / test_plot_buffer_kind_sweep.py: sweep rows. - test_plot_topology_diagram.py / test_plot_comparison_fsim.py: plot emitters. - test_intercube_root_center.py: moved in (ADR-0032 center-root latency guard). Also: - Move the FSIM comparison plot generator out of scripts/ into the sccl suite. - Delete superseded test files (test_allreduce_multidevice, test_distributed_lrab_hierarchical_allreduce, test_allreduce_buffer_kind_sweep) and repoint conftest aggregators + the ipcq buffer-kind importers. - Regenerate the allreduce_latency_plots derived artifacts from the full sweep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 22:24:43 -07:00
mukesh	ff7d727ddd	CCL allreduce: rename to lrab_hierarchical_allreduce + descriptive plots Rename the intercube all-reduce identity to lrab_hierarchical_allreduce (module, config key, distributed test) so the name reflects both levels it implements: LRAB intra-SIP (local reduce to center root + broadcast) and the hierarchical inter-SIP topology exchange (ring/torus/mesh). ADR-0032 slug kept as the stable decision id; pure rename, no logic change. Also in this batch: - ADR-0032 (EN+KO): document the shipped center-root bidirectional reduce (doc was stale corner-root); annotate ccl.yaml root_cube as a placeholder. - Rename allreduce + pe2pe latency plots to descriptive, title-matching filenames and retitle the in-plot headings; drop overview/overview_log. - Point the PPTX image refs at the new plot names. Doc + derived-artifact + rename only; no simulation behavior changed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 20:50:48 -07:00
ywkang	049e3d8bb3	benches: package as kernbench.benches, add @bench registry + list subcommand Move benches/ -> src/kernbench/benches/ and src/kernbench/cli/probe.py -> src/kernbench/probes/probe.py. Each bench self-registers via @bench(name=..., description=...); kernbench list enumerates benches with auto-assigned indices, --bench accepts kebab-case name or numeric index. Audit at package-import time fails if any non-underscore module forgets the decorator. ADR-0010 (EN + KO) updated to reflect the new resolver path, list subcommand, and probes package separation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 14:42:10 -07:00
ywkang	a76487ca48	PE_DMA perf: SIP-wide scenarios + dual outputs + clearer naming User asked to surface system-wide congestion (more accurate than single-cube), bring back the latency-breakdown plot under a separate filename, and rename the obscure ``streaming`` category. Scenarios: Renamed all_pe_to_pe0 → all_pe_cube0_to_pe0 (clarify cube scope). Added two SIP-wide scenarios: sip_local_all — every PE in sip0 (128 total) accesses its own local slice. All paths disjoint (each PE owns its own hbm_ctrl.peX), so the model should scale linearly with cube count. sip_hotspot_pe0 — every PE in sip0 (128 total) targets sip0.cube0.pe0_slice. Worst-case hotspot: UCIe inbound + r0c0→hbm_ctrl.pe0 saturated. Each bar now carries an ``N=...`` annotation showing the issuer count, and the chart titles say the scope explicitly. Effective BW + util at 16 KB: sip_local_all N=128 eff= 27.2 TB/s util_a= 83 % sip_hotspot_pe0 N=128 eff= 134 GB/s util_a= 93 % (UCIe-into-cube0 saturated) Plots: no_congestion.png + congestion.png — Effective BW utilization (two bars: single vs aggregate peak) breakdown_no_congestion.png + breakdown_congestion.png — stacked latency breakdown (renamed from previous) summary.csv with columns for both views. The visual y-cap on BW utilization is 150 %. Bars exceeding it (e.g. sip_local_all's util_single = 10,639 %) are drawn at the cap with an upward arrow and the real value annotated. The verification rule for ``util_single`` is loosened to ``≤ n_issuers × 100 % + 5 %`` so massively-parallel disjoint scenarios pass. Category renamed: ``streaming`` → ``wire_transfer``. It is the bulk-transfer time = (n_flits − 1) × flit_bytes / bottleneck_bw — the cost of streaming the rest of the payload through the slowest wire after the first flit has arrived. All checks PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 09:43:09 -07:00
ywkang	a143925a12	PE_DMA perf: dual-peak utilisation (single-path + aggregate) Each scenario now shows TWO bars: util_single = effective_bw / single-path peak × 100 (peak = min bw_gbs on first issuer's path) util_aggregate = effective_bw / aggregate-resource peak × 100 (peak = max-min fair share across concurrent paths) Aggregate peak uses a max-min fair-share computation: each concurrent path's sustainable share on an edge is bw_gbs / usage_count, the per-path throughput is the min share along its edges, and the aggregate peak is the sum across paths. This produces the correct answer for both shared-bottleneck scenarios (N paths converge on one wire → aggregate = wire BW) and multi-lane shared resources (UCIe's 4 connections used in parallel → aggregate ≈ 4 × per-conn BW), without enumerating max-flow. Single-issuer (no_congestion) → util_single == util_aggregate by definition. Congestion exposes the divergence: ctrl_hot_{1,2,3}, all_pe_to_pe0 → both metrics agree (one shared bottleneck: r0c0→hbm_ctrl.pe0 @ 256 GB/s) 8×PE eastbound → util_single=106 % (single conn @ 128 GB/s) but util_aggregate=85 % (UCIe-W.conn0 @ 7-way shared, aggregate peak ≈ 160 GB/s under the current cross-cube routing that funnels via cube1.r0c0). Verification updated to assert: (2) util_aggregate ≤ 100 % (effective BW can't exceed the aggregate resource peak, by construction). (3) single-issuer util_single == util_aggregate. (7) ucie_eastbound: util_aggregate is meaningfully smaller than util_single (the multi-lane peak correction is observable). CSV grows with peak_aggregate_bw_gbs and util_aggregate_pct columns; breakdown columns retained. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 08:53:00 -07:00
ywkang	0bf220fed0	Switch PE_DMA perf plots to Effective BW utilization Replaces the latency-breakdown stacked bars with a single utilization bar per scenario. Each bar shows ``effective_bw / peak_bottleneck_bw`` with both values annotated, and a horizontal "single-path peak" line at 100 %. The colour band (green ≥70 %, amber ≥40 %, red <40 %) makes the no-congestion distance roll-off scannable at a glance. Definitions: effective_bw = (total bytes transferred) / wall-clock time no_congestion: nbytes / total_ns congestion: n_issuers × nbytes / makespan_ns (aggregate) peak_bw = min(edge.bw_gbs) on first issuer's path util_pct = effective_bw / peak_bw × 100 The congestion graph shows that 8×PE eastbound exceeds 100 % of a single-path peak (106.4 %): UCIe-N's 4 connections × 128 GB/s give 512 GB/s of aggregate eastbound capacity, so concurrent issuers across disjoint conns sum past any single conn's 128 GB/s. The 8×PE→pe0_slice hotspot reaches 91.7 %, almost saturating the shared r0c0→hbm_ctrl.pe0 bottleneck — the simulator's address-based PC striping + per-flit arbitration model amortises the cost cleanly. Self-verification updated to BW invariants: (1) effective BW shrinks as topological distance grows (2) util_pct ∈ (0, 250 %] (3) single-issuer util_pct ≤ 100 % (4) effective_bw = nbytes / total_ns for single requests (5) congestion aggregate BW grows monotonically with issuer count on the hot-target series (6) 8-PE all-hit-pe0 saturates ≥ 70 % of shared peak All checks PASS at the current model. The CSV retains all breakdown components (pe_setup, noc_mesh, ucie, fabric, streaming, hbm_ctrl, contention) so a future replot can still recover the latency-breakdown view without re-running the simulator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 07:59:45 -07:00
ywkang	a759d58007	Add PE_DMA latency-breakdown plots + self-verification harness scripts/plot_pe_dma_perf.py runs the simulator across six no-congestion scenarios (SAME_CUBE_PE_LOCAL / REMOTE_BEST / REMOTE_WORST, REMOTE_CUBE_BEST / REMOTE_WORST, REMOTE_SIP) and five congestion scenarios (1/2/3 PE hot-target, 8-PE corresp. cube-to-cube, 8-PE all-hit-pe0). It categorises actual total / makespan into pe_setup, noc_mesh, ucie, fabric, streaming, hbm_ctrl, and a contention residual using a wormhole-pipelined model (first-flit arrival + (n_flits-1)/bottleneck + final chunk_time). Outputs: docs/diagrams/pe_dma_perf/no_congestion.png — single-PE latency by topological distance. Visualises monotonic growth from SAME_CUBE_PE_LOCAL (77 ns) up to REMOTE_CUBE_PE_REMOTE_WORST (573 ns) and REMOTE_SIP (409 ns). docs/diagrams/pe_dma_perf/congestion.png — makespan as concurrent issuer count grows. ctrl_hot_{1,2,3}=82/158/230 ns; 8-PE eastbound UCIe = 963 ns; 8-PE all-hit-pe0 = 558 ns. docs/diagrams/pe_dma_perf/summary.csv — raw rows for re-plotting. Built-in --verify harness asserts: (1) distance monotonicity for no-congestion; (2) same-cube paths contain zero UCIe budget; (3) remote-cube/SIP paths carry positive UCIe budget; (4) breakdown is internally consistent (formula ≤ actual); (5) streaming term matches (n_flits-1) × flit_bytes / bottleneck_bw within 5 % for the local scenario; (6) congestion makespan is monotonic in issuer count; (7) 8-PE hotspot strictly exceeds 3-PE hotspot. Cross-SIP gets a looser 70 % contention slack because the path crosses two non-flit-aware (pcie_ep) boundaries that force store-and-forward re-streaming the simple formula does not attribute. Single-cube scenarios stay under 25 % residual. All checks PASS at the current model (post ADR-0019 D1/D4 per-PE HBM CTRL restoration). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 01:23:42 -07:00
mukesh	f6d262e359	Honest measured pipeline efficiency: two timing fixes Two related issues caused measured pipeline efficiency to look worse than the simulator's actual behavior: 1. DMA timing recorded too early. The op-log start timestamp for a DMA op fired when the request entered the queue, and the DMA channel was released as soon as the request was issued. Back-to-back DMAs therefore appeared to grab the channel simultaneously, with per-op duration drifting upward as queue depth grew - an artifact, not real cost. Fix: defer the start timestamp until after the channel is acquired, and hold the channel through the full HBM round-trip until the response returns. Per-op duration is now constant and equal to the actual transfer interval; serialization is visible as queue wait, not as inflated service time. 2. Sweep timing window folded in pre-composite work. The PE timing window spanned every PE engine record, which included the upfront pinned-operand DMA issued before the composite GEMM begins. For large-K shapes that one-shot load can be nearly half of the window, conflating operand-staging cost with composite-pipeline behavior. Fix: add a second window scoped to the composite pipeline by filtering op_log records to those tagged with a tile-pipeline stage; the legacy operand-load path is untagged and naturally excluded. For 32x3072x32 load_ref the window drops from 1765ns to 992ns and measured eff lines up with the steady-state DMA-bound stage limit instead of being penalized for the one-time load. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 14:19:17 -07:00
mukesh	83ea97b05f	Composite GEMM: K-loop accumulator residency, pinned operands, sweep + deck Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 15:00:41 -07:00
mukesh	5accd98171	Add deck builder + overview-with-ref diagram scripts scripts/build_overview_slides.py renders a 5-slide PPTX (kernbench2_overview.pptx) summarizing architecture, model correctness, IPCQ, allreduce, and buffer-kind tier comparison. scripts/emit_overview_with_external_ref.py renders log-y and broken-y variants of the allreduce overview (overview_log.png, overview_broken.png) including a 366 µs ext-sim reference marker at 96 KB / PE. Also includes cube_mesh_view.png rendered from the SVG. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:20:54 -07:00
mukesh	a563169e89	Add tl.recv_no_consume diagnostic API for apples-to-apples pe2pe plot The pe2pe overview compared IPCQ (tl.send + tl.recv) against raw DMA (tl.load + tl.store), but DMA is one-sided — DST never reads — while tl.recv pays a slot-read on DST. The comparison was unfair: IPCQ looked slower partly because it does more work. Adds tl.recv_no_consume() — a separate, diagnostic-only entry point that blocks for slot arrival but skips the slot-read (and bank-hop) charge on DST. Production tl.recv is unchanged (no `consume` kwarg on the public API), so the diagnostic flag can never accidentally leak into real workloads. Updates test_pe_to_pe_latency to call tl.recv_no_consume so the overview.png shows IPCQ no-consume vs raw DMA on equal footing. Also fixes PLOT_DIR back to docs/diagrams/pe2pe_latency_plots/ (was lost in a merge). Adds scripts/replot_pe2pe.py for label-only re-renders without re-measuring. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:20:44 -07:00
ywkang	6f43807900	commit - release 1	2026-03-18 11:47:48 -07:00

13 Commits