kernbench2

Author	SHA1	Message	Date
ywkang	b8213d43a9	ADR-0019 D1/D4: per-PE HBM CTRL partitioning Restores per-PE HBM controller partitioning that was lost in commit `5917b34` ("Replace xbar/bridge/single-NOC with explicit router mesh"), which had over-consolidated the per-slice HBM CTRL into a single cube-wide ``hbm_ctrl`` connected to every router — the opposite of what ADR-0019 D1/D4 specifies. Builder splits ``hbm_ctrl`` into 8 ``hbm_ctrl.pe{X}`` instances per cube, each reachable ONLY through PE_X's attaching router via the existing ``peX.hbm`` attach metadata from cube_mesh.yaml. Cube aggregate BW now matches the spec (8 PEs × 8 PCs × 32 GB/s = 2048 GB/s) instead of collapsing to 256 GB/s. AddressResolver decodes the target PE from the HBM PA's hbm_offset (``offset // slice_size``) and returns ``hbm_ctrl.pe{X}``. PathRouter uses the existing ``_adj_local`` adjacency for same-cube PE_DMA so the cube's own UCIe port can no longer appear as a zero-distance shortcut between routers — local PE_DMA now traverses the mesh, restoring the ADR-0019 D4 worked example ``PE0.pe_dma → r0c0 → … → r1c4 → hbm_ctrl``. Tests: - New tests/test_per_pe_hbm_partition.py: 14 tests covering topology shape, per-PE router exclusivity, PA resolution, single-hop local path, cross-PE mesh traversal, and end-to-end latency monotonicity. Probe CLI now reports pe-local < pe-same-half < pe-cross-half (was uniform 141ns). - Existing tests updated for new node ids and replaced two assertions that locked in the wrong consolidation: test_noc_mesh.test_hbm_connects_to_all_routers and test_topology_compile.test_hbm_ctrl_connects_all_routers are now per-PE exclusivity assertions; test_routing .test_all_pe_hbm_equidistant becomes test_cross_pe_hbm_distance_increases_with_mesh_hops. - test_ipcq_buffer_kind_locations.test_hbm_pe_hop_charged_at_large_payload threshold recalibrated 4000→1500 ns: the prior figure reflected serialization on the over-consolidated single hbm_ctrl; per-PE partitioning removes that artificial contention so the gap shrinks to the genuine PE↔HBM-hop cost. Full suite: 645 passed, 1 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 01:04:30 -07:00
ywkang	a44f832be5	Regenerate latency plots/diagrams for post-Phase-2c model Allreduce + pe2pe + ipcq + pe_view auto-regenerated by test sweeps running against the new chunk-streaming wire timing (per-flit wormhole) — absolute numbers shift upward to reflect bottleneck-link transit charged once per flit (instead of the previous cut-through subtraction at HBM CTRL). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 23:24:01 -07:00
mukesh	5accd98171	Add deck builder + overview-with-ref diagram scripts scripts/build_overview_slides.py renders a 5-slide PPTX (kernbench2_overview.pptx) summarizing architecture, model correctness, IPCQ, allreduce, and buffer-kind tier comparison. scripts/emit_overview_with_external_ref.py renders log-y and broken-y variants of the allreduce overview (overview_log.png, overview_broken.png) including a 366 µs ext-sim reference marker at 96 KB / PE. Also includes cube_mesh_view.png rendered from the SVG. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:20:54 -07:00
mukesh	9c129d6131	ADR-0023 D9.7+: charge PE↔bank fabric hop for SRAM/HBM IPCQ slots Cube SRAM and HBM live on the cube NoC behind router-attached links (sram_to_router_bw_gbs=128, hbm_to_router_bw_gbs=256). Previously the slot-IO model treated them as if they were per-PE local, so the buffer_kind sweep showed TCM ≈ SRAM at 64 KB / PE. pe_ipcq._handle_recv and pe_dma._handle_ipcq_inbound now charge a PE→bank compute_drain_ns on top of the intrinsic slot-IO for SRAM/HBM. TCM stays free of this hop. Adds an internal IpcqRecvCmd.consume field that gates the recv-side hop+slot-IO charges (used by a follow-up diagnostic API; default True keeps current behavior). Post-fix at 64 KB / PE: TCM 12.0 µs < HBM 21.4 µs < SRAM 24.3 µs. SRAM is slowest because its 128 GB/s bank link is the narrowest in the system — narrower than HBM's 256 GB/s. The existing ordering test is rewritten from tcm<sram<hbm to tcm<hbm<sram and a new test_ipcq_buffer_kind_locations adds 3 invariants on the gap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:20:28 -07:00
mukesh	ad5f01ab13	Merge origin/master: combine single-cube fast path + center-root reduce Conflict resolution: - intercube_allreduce.py: kept origin's `if single_cube:` early-exit (TP launches kernel on one cube/rank → skip intra-SIP mesh and go direct to inter-SIP exchange) AND replaced the multi-cube body with the local center-root + bidirectional reduce/broadcast (8-hop critical path on 4×4 vs 12 with corner root). - tests/{allreduce,pe2pe}_latency_plots/: kept the local move to docs/diagrams/; dropped origin's stale content edits to the old paths (regenerable derived artifacts). - docs/diagrams/pe2pe_latency_plots/summary.csv: kept local (post-Phase-2 + center-root values). Origin contributions retained as-is: - pyproject.toml: matplotlib >= 3.7 dep. - runtime_api/distributed.py: derive effective cube_w/h from tensor shard placement so single-cube TP paths get cube_w=cube_h=1. - kernel_args() now accepts optional cube_w/cube_h kwargs. Verified post-merge: - test_intercube_root_center.py: 2/2 (center-root multi-cube path). - test_tp_layers.py + test_tp_mlp.py: 10/10 (single-cube TP path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 21:41:46 -07:00
mukesh	1c5752a9ec	Intercube allreduce: center root + bidirectional reduce Move the algorithmic root cube from the corner (cube_w-1, cube_h-1) to the geometric center (cube_w//2, cube_h//2) and have each phase converge bidirectionally so the intra-SIP critical path drops from ~12 hops to ~8 hops on a 4×4 mesh (left half W→E + right half E→W in row reduce; top half N→S + bottom half S→N in col reduce; mirrored on broadcast). Result on torus_2d 6 SIPs at 96 KB / PE on TCM: before (corner root) : 22.0 µs after (center root) : 17.2 µs (−22%) Same shape on ring_1d (−7%) and mesh_2d_no_wrap (−12%); also holds across SRAM and HBM (~−20% each). Phase 1 test (test_intercube_root_center.py) asserts the torus_2d 96 KB latency drops below 20.5 µs and that all 96 cubes still validate (correctness preserved). Plot updates: - overview.png: replace constant 10.6 µs theoretical line with user-supplied hand-derived curve (per-cube packet count = bytes_per_pe × 8 PEs ÷ 128 B; 1346 ns startup + 1.20 ns/pkt). - All summary.csv numbers and per-topology PNGs regenerated. - pe2pe_latency_plots and ipcq diagram emitter PNGs refreshed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 21:28:58 -07:00
mukesh	84a1325e5c	ADR-0023 D9.7: IPCQ slot-memory latency model (TCM/SRAM/HBM) Charge per-tier bandwidth + setup overhead at IPCQ slot WRITE (receiver inbound DMA, in pe_dma._handle_ipcq_inbound) and slot READ (recv consume, in pe_ipcq._handle_recv). Tier table (common/ipcq_types.py): tcm : 512 GB/s, 0 ns sram : 128 GB/s, 2 ns hbm : 32 GB/s, 6 ns Before this change, slot read/write was free regardless of buffer_kind, making memory-tier choice invisible in simulated latency. After the change, swapping buffer_kind in ccl.yaml produces measurable per-tier separation in allreduce latency. Tests: test_ipcq_buffer_kind_latency.py — three micro-tests asserting tcm < sram < hbm ordering, payload-scaling, and that buffer_kind sensitivity grows with payload (credit-only path stays fabric-bound). test_allreduce_buffer_kind_sweep.py — 12-config parametrized sweep emitting buffer_kind_sweep.png (3 lines, torus_2d). conftest sessionfinish hook generalised to dispatch multiple sweep aggregators (allreduce + buffer-kind). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 21:28:34 -07:00
mukesh	1e39214f89	Move generated diagrams to docs/diagrams/; add IPCQ diagram emitter Plot output dirs now live under docs/diagrams/ (the canonical "derived artifacts" location per CLAUDE.md): tests/allreduce_latency_plots/ → docs/diagrams/allreduce_latency_plots/ tests/pe2pe_latency_plots/ → docs/diagrams/pe2pe_latency_plots/ + new docs/diagrams/ipcq_diagram_plots/ with two presentation diagrams (ipcq_send_recv.png, ipcq_two_pe_dma.png) New test tests/test_emit_ipcq_diagram.py renders the two IPCQ diagrams from a static description (no simulation); it exists so the diagrams can be regenerated reproducibly. Path references updated in tests/test_pe_to_pe_latency.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 21:28:17 -07:00

8 Commits