kernbench2

Author	SHA1	Message	Date
mukesh	b610cb0d9a	sccl: drive allreduce tests via torch.distributed; reorganize into tests/sccl/ Convert the multidevice allreduce correctness + latency/buffer-kind sweeps to run through the real PyTorch-distributed path (init_process_group(backend="ahbm") -> mp.spawn -> dist.all_reduce) instead of direct ctx.launch, and reorganize the CCL/allreduce tests into a tests/sccl/ package split one test per file. Production change (required for the distributed path on non-square SIP grids): - AhbmCCLBackend now reads explicit system.sips.w/h from the spec, with a square-only sqrt fallback that raises on ambiguity, instead of silently guessing round(sqrt(count)). This fixes the 2x3 / 3x2 torus + mesh cases, which previously resolved to a wrong 2x2 grid. Mirrors the test helper's _sip_topo_dims precedence (explicit w/h > square fallback > raise). Test reorganization (tests/sccl/): - _allreduce_helpers.py: shared plumbing (distributed driver, config writers, direct-launch run_allreduce parity reference, sweep/buffer-kind constants, plot aggregators, topology-diagram + FSIM-comparison emitters). - test_allreduce_ring_torus_mesh.py: correctness across ring/torus/mesh. - test_distributed_default_topology.py: full distributed path on topology.yaml. - test_plot_latency_sweep.py / test_plot_buffer_kind_sweep.py: sweep rows. - test_plot_topology_diagram.py / test_plot_comparison_fsim.py: plot emitters. - test_intercube_root_center.py: moved in (ADR-0032 center-root latency guard). Also: - Move the FSIM comparison plot generator out of scripts/ into the sccl suite. - Delete superseded test files (test_allreduce_multidevice, test_distributed_lrab_hierarchical_allreduce, test_allreduce_buffer_kind_sweep) and repoint conftest aggregators + the ipcq buffer-kind importers. - Regenerate the allreduce_latency_plots derived artifacts from the full sweep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 22:24:43 -07:00
mukesh	ff7d727ddd	CCL allreduce: rename to lrab_hierarchical_allreduce + descriptive plots Rename the intercube all-reduce identity to lrab_hierarchical_allreduce (module, config key, distributed test) so the name reflects both levels it implements: LRAB intra-SIP (local reduce to center root + broadcast) and the hierarchical inter-SIP topology exchange (ring/torus/mesh). ADR-0032 slug kept as the stable decision id; pure rename, no logic change. Also in this batch: - ADR-0032 (EN+KO): document the shipped center-root bidirectional reduce (doc was stale corner-root); annotate ccl.yaml root_cube as a placeholder. - Rename allreduce + pe2pe latency plots to descriptive, title-matching filenames and retitle the in-plot headings; drop overview/overview_log. - Point the PPTX image refs at the new plot names. Doc + derived-artifact + rename only; no simulation behavior changed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 20:50:48 -07:00
mukesh	9c129d6131	ADR-0023 D9.7+: charge PE↔bank fabric hop for SRAM/HBM IPCQ slots Cube SRAM and HBM live on the cube NoC behind router-attached links (sram_to_router_bw_gbs=128, hbm_to_router_bw_gbs=256). Previously the slot-IO model treated them as if they were per-PE local, so the buffer_kind sweep showed TCM ≈ SRAM at 64 KB / PE. pe_ipcq._handle_recv and pe_dma._handle_ipcq_inbound now charge a PE→bank compute_drain_ns on top of the intrinsic slot-IO for SRAM/HBM. TCM stays free of this hop. Adds an internal IpcqRecvCmd.consume field that gates the recv-side hop+slot-IO charges (used by a follow-up diagnostic API; default True keeps current behavior). Post-fix at 64 KB / PE: TCM 12.0 µs < HBM 21.4 µs < SRAM 24.3 µs. SRAM is slowest because its 128 GB/s bank link is the narrowest in the system — narrower than HBM's 256 GB/s. The existing ordering test is rewritten from tcm<sram<hbm to tcm<hbm<sram and a new test_ipcq_buffer_kind_locations adds 3 invariants on the gap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:20:28 -07:00
ywkang	533e699299	IPCQ-DMA co-design HW design doc + fix IPCQ slot BW model Add hardware design document (docs/ipcq-dma-codesign-hw.md) covering PE_IPCQ high-level architecture, simulator verification, proposed HW implementation, and alternatives analysis. Include D2 block diagrams for baseline and proposed PE architectures. Fix IPCQ slot-memory bandwidth parameters to match topology.yaml: SRAM 128→512 GB/s (intrinsic BW, NoC-bottlenecked at 128), HBM 32→256 GB/s (was per-channel, now per-PE aggregate). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-28 13:31:02 -07:00
mukesh	84a1325e5c	ADR-0023 D9.7: IPCQ slot-memory latency model (TCM/SRAM/HBM) Charge per-tier bandwidth + setup overhead at IPCQ slot WRITE (receiver inbound DMA, in pe_dma._handle_ipcq_inbound) and slot READ (recv consume, in pe_ipcq._handle_recv). Tier table (common/ipcq_types.py): tcm : 512 GB/s, 0 ns sram : 128 GB/s, 2 ns hbm : 32 GB/s, 6 ns Before this change, slot read/write was free regardless of buffer_kind, making memory-tier choice invisible in simulated latency. After the change, swapping buffer_kind in ccl.yaml produces measurable per-tier separation in allreduce latency. Tests: test_ipcq_buffer_kind_latency.py — three micro-tests asserting tcm < sram < hbm ordering, payload-scaling, and that buffer_kind sensitivity grows with payload (credit-only path stays fabric-bound). test_allreduce_buffer_kind_sweep.py — 12-config parametrized sweep emitting buffer_kind_sweep.png (3 lines, torus_2d). conftest sessionfinish hook generalised to dispatch multiple sweep aggregators (allreduce + buffer-kind). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 21:28:34 -07:00

5 Commits