kernbench2

Author	SHA1	Message	Date
ywkang	b6315c3c90	paper(1H): reflect D8 single-op cost model in §2/§3.4 + figure/diagram regen Two strands bundled as the 1H-codesign-paper refresh unit: (A) This session — single-op cost-model reflection (depends on `2d8271c`): - §2 Table 2 (tab:hw): split "FIXED per command" into "FIXED per single-op command" (8 cycles) and "FIXED per composite command" (40 cycles); §2 dispatch-overhead prose updated to the two-class split. - §3.4 (sec:gemm-vs-async): rename paragraph headers + prose to async-full / async-tiled; "atomic" -> "single-op" throughout; reframe mechanism #3 from the old DMA-only fast-path to the single-op fast-path. Headline narrative now: even with EVERY single-op cmd (96 DMA + 48 dot + 47 add) charged the light 8-cycle FIXED, composite still wins ~2.8x at K=3072 purely on command-count structure (1 vs 192 commands) -- down from the pre-D8 ~6.3x, and explicitly NOT a modelling artifact. Numbers refreshed from the regenerated sweep: async-full 3.83->3.91, async-tiled 1.14->~2.53, under-tile corner 1.06->1.21, depth-2 vs depth-inf spread <1%. New figure wired in. - build/main.pdf rebuilt (tectonic); pdftotext-verified (no broken refs; Table 2 split, single-op terms, 2.8x/2.53/192-host-commands all present). (B) Prior-session paper work riding along uncommitted: §4 all-reduce deep-edit, §5 GQA, §6 discussion trims; milestone_1h_ccl.py plot label "FSIM" -> "H2 2025 SW queue baseline"; regenerated diagrams under docs/diagrams/** and gemm output PNGs under 1H_milestone_output/gemm/. (Composite-window gemm plots are unaffected by D8 — D8 only changes single-op dispatch FIXED, which the composite window excludes.) All TODO items for the D8 single-op extension are now complete and pushed across 3 commits (`2d8271c` cost-model+ADR+tests, `821bbf2` bench harness, this paper refresh). Full regression green (826 passed, 1 skipped). No remaining work. NOTE for review (carried from `2d8271c`): ADR-0065's "2x CPU-offload win" headline for GQA decode opt2 may want a refresh to the post-D8 ~1.87x. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-17 15:12:31 -07:00
mukesh	ff7d727ddd	CCL allreduce: rename to lrab_hierarchical_allreduce + descriptive plots Rename the intercube all-reduce identity to lrab_hierarchical_allreduce (module, config key, distributed test) so the name reflects both levels it implements: LRAB intra-SIP (local reduce to center root + broadcast) and the hierarchical inter-SIP topology exchange (ring/torus/mesh). ADR-0032 slug kept as the stable decision id; pure rename, no logic change. Also in this batch: - ADR-0032 (EN+KO): document the shipped center-root bidirectional reduce (doc was stale corner-root); annotate ccl.yaml root_cube as a placeholder. - Rename allreduce + pe2pe latency plots to descriptive, title-matching filenames and retitle the in-plot headings; drop overview/overview_log. - Point the PPTX image refs at the new plot names. Doc + derived-artifact + rename only; no simulation behavior changed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 20:50:48 -07:00
ywkang	b8213d43a9	ADR-0019 D1/D4: per-PE HBM CTRL partitioning Restores per-PE HBM controller partitioning that was lost in commit `5917b34` ("Replace xbar/bridge/single-NOC with explicit router mesh"), which had over-consolidated the per-slice HBM CTRL into a single cube-wide ``hbm_ctrl`` connected to every router — the opposite of what ADR-0019 D1/D4 specifies. Builder splits ``hbm_ctrl`` into 8 ``hbm_ctrl.pe{X}`` instances per cube, each reachable ONLY through PE_X's attaching router via the existing ``peX.hbm`` attach metadata from cube_mesh.yaml. Cube aggregate BW now matches the spec (8 PEs × 8 PCs × 32 GB/s = 2048 GB/s) instead of collapsing to 256 GB/s. AddressResolver decodes the target PE from the HBM PA's hbm_offset (``offset // slice_size``) and returns ``hbm_ctrl.pe{X}``. PathRouter uses the existing ``_adj_local`` adjacency for same-cube PE_DMA so the cube's own UCIe port can no longer appear as a zero-distance shortcut between routers — local PE_DMA now traverses the mesh, restoring the ADR-0019 D4 worked example ``PE0.pe_dma → r0c0 → … → r1c4 → hbm_ctrl``. Tests: - New tests/test_per_pe_hbm_partition.py: 14 tests covering topology shape, per-PE router exclusivity, PA resolution, single-hop local path, cross-PE mesh traversal, and end-to-end latency monotonicity. Probe CLI now reports pe-local < pe-same-half < pe-cross-half (was uniform 141ns). - Existing tests updated for new node ids and replaced two assertions that locked in the wrong consolidation: test_noc_mesh.test_hbm_connects_to_all_routers and test_topology_compile.test_hbm_ctrl_connects_all_routers are now per-PE exclusivity assertions; test_routing .test_all_pe_hbm_equidistant becomes test_cross_pe_hbm_distance_increases_with_mesh_hops. - test_ipcq_buffer_kind_locations.test_hbm_pe_hop_charged_at_large_payload threshold recalibrated 4000→1500 ns: the prior figure reflected serialization on the over-consolidated single hbm_ctrl; per-PE partitioning removes that artificial contention so the gap shrinks to the genuine PE↔HBM-hop cost. Full suite: 645 passed, 1 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 01:04:30 -07:00
ywkang	a44f832be5	Regenerate latency plots/diagrams for post-Phase-2c model Allreduce + pe2pe + ipcq + pe_view auto-regenerated by test sweeps running against the new chunk-streaming wire timing (per-flit wormhole) — absolute numbers shift upward to reflect bottleneck-link transit charged once per flit (instead of the previous cut-through subtraction at HBM CTRL). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 23:24:01 -07:00
mukesh	a563169e89	Add tl.recv_no_consume diagnostic API for apples-to-apples pe2pe plot The pe2pe overview compared IPCQ (tl.send + tl.recv) against raw DMA (tl.load + tl.store), but DMA is one-sided — DST never reads — while tl.recv pays a slot-read on DST. The comparison was unfair: IPCQ looked slower partly because it does more work. Adds tl.recv_no_consume() — a separate, diagnostic-only entry point that blocks for slot arrival but skips the slot-read (and bank-hop) charge on DST. Production tl.recv is unchanged (no `consume` kwarg on the public API), so the diagnostic flag can never accidentally leak into real workloads. Updates test_pe_to_pe_latency to call tl.recv_no_consume so the overview.png shows IPCQ no-consume vs raw DMA on equal footing. Also fixes PLOT_DIR back to docs/diagrams/pe2pe_latency_plots/ (was lost in a merge). Adds scripts/replot_pe2pe.py for label-only re-renders without re-measuring. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:20:44 -07:00
mukesh	1c5752a9ec	Intercube allreduce: center root + bidirectional reduce Move the algorithmic root cube from the corner (cube_w-1, cube_h-1) to the geometric center (cube_w//2, cube_h//2) and have each phase converge bidirectionally so the intra-SIP critical path drops from ~12 hops to ~8 hops on a 4×4 mesh (left half W→E + right half E→W in row reduce; top half N→S + bottom half S→N in col reduce; mirrored on broadcast). Result on torus_2d 6 SIPs at 96 KB / PE on TCM: before (corner root) : 22.0 µs after (center root) : 17.2 µs (−22%) Same shape on ring_1d (−7%) and mesh_2d_no_wrap (−12%); also holds across SRAM and HBM (~−20% each). Phase 1 test (test_intercube_root_center.py) asserts the torus_2d 96 KB latency drops below 20.5 µs and that all 96 cubes still validate (correctness preserved). Plot updates: - overview.png: replace constant 10.6 µs theoretical line with user-supplied hand-derived curve (per-cube packet count = bytes_per_pe × 8 PEs ÷ 128 B; 1346 ns startup + 1.20 ns/pkt). - All summary.csv numbers and per-topology PNGs regenerated. - pe2pe_latency_plots and ipcq diagram emitter PNGs refreshed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 21:28:58 -07:00
mukesh	1e39214f89	Move generated diagrams to docs/diagrams/; add IPCQ diagram emitter Plot output dirs now live under docs/diagrams/ (the canonical "derived artifacts" location per CLAUDE.md): tests/allreduce_latency_plots/ → docs/diagrams/allreduce_latency_plots/ tests/pe2pe_latency_plots/ → docs/diagrams/pe2pe_latency_plots/ + new docs/diagrams/ipcq_diagram_plots/ with two presentation diagrams (ipcq_send_recv.png, ipcq_two_pe_dma.png) New test tests/test_emit_ipcq_diagram.py renders the two IPCQ diagrams from a static description (no simulation); it exists so the diagrams can be regenerated reproducibly. Path references updated in tests/test_pe_to_pe_latency.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 21:28:17 -07:00

7 Commits