kernbench2

Author	SHA1	Message	Date
ywkang	687c98086d	ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037 Filename + lifecycle: - ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable. - ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2: docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft), docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for retroactive docs pending verification. Merges (one ADR per topic, no change-history annotations): - ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items) - ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl. TileToken self-routing and multi-op composite epilogue scope) - ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md deleted; ADR-0019/0021 moved to adr-history with one-line stub status Retroactive documentation (G4 closures, code-verified): - ADR-0037 forwarding component (TransitComponent: first-flit overhead, serial worker, path-based routing, single impl/multiple names) - ADR-0036 IO_CPU component (target_start_ns global barrier stamping, per-cube fan-out, response aggregation) - ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources, target_start_ns passthrough) - ADR-0034 HBM controller internal design (per-PC state, address-based selection, flit-aware per-flit commit, async finalize, command-only fallback path) Content updates: - ADR-0010 expanded to full CLI surface (run/probe/web), retitled "Command Line Interface and Execution Semantics" - ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned - ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata block replaced with standard Status header - ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4); ADR-0027 cleaned of supersession history - ADR-0033 D6 cleanup: address-based PC selection moved out of future-work (now documented in ADR-0034 D3); related D1/D3 wording realigned - Cross-references back-filled in 5 ADRs (G3 gaps closed) Onboarding docs split: - docs/onboarding/ created - moved: hw-architecture-overview.md, latency-model.md, di-presentation.md, ccl-author-guide{,.en}.md - references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8). No behavior change. Tooling: - tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py (ADR EN/KO pair invariant checker) - .claude/commands/report.md tracked (/report slash command) - .gitignore: allow .claude/commands/*.md while keeping settings files ignored Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:15:55 -07:00
mukesh	f6d262e359	Honest measured pipeline efficiency: two timing fixes Two related issues caused measured pipeline efficiency to look worse than the simulator's actual behavior: 1. DMA timing recorded too early. The op-log start timestamp for a DMA op fired when the request entered the queue, and the DMA channel was released as soon as the request was issued. Back-to-back DMAs therefore appeared to grab the channel simultaneously, with per-op duration drifting upward as queue depth grew - an artifact, not real cost. Fix: defer the start timestamp until after the channel is acquired, and hold the channel through the full HBM round-trip until the response returns. Per-op duration is now constant and equal to the actual transfer interval; serialization is visible as queue wait, not as inflated service time. 2. Sweep timing window folded in pre-composite work. The PE timing window spanned every PE engine record, which included the upfront pinned-operand DMA issued before the composite GEMM begins. For large-K shapes that one-shot load can be nearly half of the window, conflating operand-staging cost with composite-pipeline behavior. Fix: add a second window scoped to the composite pipeline by filtering op_log records to those tagged with a tile-pipeline stage; the legacy operand-load path is untagged and naturally excluded. For 32x3072x32 load_ref the window drops from 1765ns to 992ns and measured eff lines up with the steady-state DMA-bound stage limit instead of being penalized for the one-time load. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 14:19:17 -07:00
mukesh	9c129d6131	ADR-0023 D9.7+: charge PE↔bank fabric hop for SRAM/HBM IPCQ slots Cube SRAM and HBM live on the cube NoC behind router-attached links (sram_to_router_bw_gbs=128, hbm_to_router_bw_gbs=256). Previously the slot-IO model treated them as if they were per-PE local, so the buffer_kind sweep showed TCM ≈ SRAM at 64 KB / PE. pe_ipcq._handle_recv and pe_dma._handle_ipcq_inbound now charge a PE→bank compute_drain_ns on top of the intrinsic slot-IO for SRAM/HBM. TCM stays free of this hop. Adds an internal IpcqRecvCmd.consume field that gates the recv-side hop+slot-IO charges (used by a follow-up diagnostic API; default True keeps current behavior). Post-fix at 64 KB / PE: TCM 12.0 µs < HBM 21.4 µs < SRAM 24.3 µs. SRAM is slowest because its 128 GB/s bank link is the narrowest in the system — narrower than HBM's 256 GB/s. The existing ordering test is rewritten from tcm<sram<hbm to tcm<hbm<sram and a new test_ipcq_buffer_kind_locations adds 3 invariants on the gap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:20:28 -07:00
mukesh	84a1325e5c	ADR-0023 D9.7: IPCQ slot-memory latency model (TCM/SRAM/HBM) Charge per-tier bandwidth + setup overhead at IPCQ slot WRITE (receiver inbound DMA, in pe_dma._handle_ipcq_inbound) and slot READ (recv consume, in pe_ipcq._handle_recv). Tier table (common/ipcq_types.py): tcm : 512 GB/s, 0 ns sram : 128 GB/s, 2 ns hbm : 32 GB/s, 6 ns Before this change, slot read/write was free regardless of buffer_kind, making memory-tier choice invisible in simulated latency. After the change, swapping buffer_kind in ccl.yaml produces measurable per-tier separation in allreduce latency. Tests: test_ipcq_buffer_kind_latency.py — three micro-tests asserting tcm < sram < hbm ordering, payload-scaling, and that buffer_kind sensitivity grows with payload (credit-only path stays fabric-bound). test_allreduce_buffer_kind_sweep.py — 12-config parametrized sweep emitting buffer_kind_sweep.png (3 lines, torus_2d). conftest sessionfinish hook generalised to dispatch multiple sweep aggregators (allreduce + buffer-kind). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 21:28:34 -07:00
ywkang	81cc32c46b	ADR-0001 Rev 2: 51-bit PhysAddr layout with concrete sub-unit tables Remove rack_id (4 bits), rename sip_seg→die_id, shift fields to enable 42-bit local_offset (4 TB per die). Define PE_LOCAL/MCPU_LOCAL/CUBE_SRAM sub-unit tables for AHBM dies and IOCPU sub-unit table for IOCHIPLET dies (1 TB window). Supersedes ADR-0031. Also fixes latent VA/PA confusion in pe_dma pipeline DMA path where virtual addresses were decoded as physical addresses without MMU translation — previously masked by coincidental bit-position alignment. 529 passed (+6 recovered), 10 pre-existing failures unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-27 15:52:29 -07:00
mukesh	14d800b0ae	Kernel-launch sync (ADR-0009 D5) and IPCQ drain at inbound (ADR-0023) - KernelLaunchMsg gains target_start_ns: IO_CPU stamps a global barrier (max path latency across every target PE), M_CPU passes it through, PE_CPU yields until it before recording pe_exec_start. Every PE in a launch begins kernel execution at the same env.now regardless of its dispatch path length — eliminates per-PE dispatch-offset artifact in cross-PE and cross-cube latency measurements. - PE_DMA._handle_ipcq_inbound now pays Transaction.drain_ns at the top, matching the terminal-drain behavior of ComponentBase._forward_txn for every non-IPCQ Transaction. SRC-side tl.send stays fire-and-forget (sender doesn't yield on sub_done); tl.recv now blocks until bytes have actually drained into its inbox. - ComponentContext: new compute_path_latency_ns helper + node_overhead_ns field populated by GraphEngine. - tests/test_kernel_launch_sync.py: asserts all PEs in one launch produce identical pe_exec_ns for a no-op kernel (zero spread). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 15:30:29 -07:00
ywkang	1c8ddc2d03	Fix Phase 1 slot-overwrite race + PE_MATH latency model (n_slots=4 safe) Root cause: In ring all-reduce, PE_IPCQ's recv handler advances my_tail and issues a credit return immediately. With tight credit latency (0.12ns intra-cube), the sender can refill the slot BEFORE the receiver's outbound PE_DMA reads from it for the next send. The outbound snapshot then captures stale data from a later round. Fix: Propagate TensorHandle.data (captured at recv-time, before credit return) through the entire send chain: tl.send(src=handle) → IpcqSendCmd.data → IpcqDmaToken.data PE_DMA outbound already prefers token.data over MemoryStore read, so the recv-time snapshot is used for the in-flight data. This eliminates the race: the snapshot is captured before the slot can be overwritten. Additional fixes: - PE_MATH handle_command: compute SIMD latency from output tensor element count via _compute_ns(), using max(overhead_ns, compute_ns). Previously used overhead_ns=0.0 for all standalone MathCmd, making math ops take 0ns in SimPy. - DataExecutor secondary sort: same-t_start ops sorted by op_kind (memory < gemm < math) so IPCQ slot writes execute before math reads. - ipcq_copy recorded at INBOUND time (receiver PE_DMA arrival) instead of outbound. Inbound time is after fabric propagation, so it sorts correctly relative to the receiver's math. - record_copy accepts explicit snapshot parameter (from token.data). Result: N_ELEM=32 + 256-rank + n_slots=4 + cross-SIP now passes. n_slots reverted to 4 (the deeper buffer was a workaround, not needed). 502 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 23:02:19 -07:00
ywkang	998cc85762	Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023) Major changes: PE-level IPCQ infrastructure: - New PE_IPCQ component: ring-buffer control plane with 4-direction neighbor mapping, head/tail pointers, backpressure (poll/sleep). - PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA, including in-flight data snapshot (D9) and op_log recording at outbound time for Phase 2 replay correctness. - IpcqDmaToken piggyback model: data + metadata travel together, atomic visibility at receiver (invariant I6). - Credit return fast path: bottleneck-BW latency, no fabric vc_comm. Phase 2 data execution (ADR-0020 integration): - op_log extended: DmaWriteCmd now captures src_space/src_addr for Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time. - DataExecutor replays dma_write + ipcq_copy in t_start order. - Engine._flush_data_phase: incremental cursor-based replay after each engine.wait() so host reads see post-Phase-2 data. - KernelRunner Phase 1 writes disabled when op_log is active to prevent stale data from corrupting the MemoryStore snapshot. TLContext / kernel API: - tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype), tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode. - TensorHandle operator overloading (add/sub/mul/div) via thread-local active TLContext → MathCmd dispatch through PE_MATH. - PE-local scratch allocator for math output handles. - tl.load returns space="hbm" handles for correct Phase 2 addressing. - Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv. Unified ccl_allreduce bench (PyTorch-compat host code): - Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch) split matching real PyTorch DDP worker pattern. - torch.distributed facade: init_process_group, get_world_size, get_rank, get_backend, all_reduce, barrier — only real PyTorch names. - AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches kernel via tensor shard metadata (n_elem from shards[0].nbytes). - world_size derived from topology spec (sips × cubes × pes_per_cube) with optional algorithm-level override in ccl.yaml. Tensor API (PyTorch-compat surface): - Tensor.numpy(): gather-aware (all shards via VA-based addressing). - Tensor.copy_(source): scatter from host tensor into sharded target. - RuntimeContext.from_numpy(arr): host-side staging tensor. - Tensor.data property fixed to use numpy() (was shards[0]-only). Algorithm modules moved to src/kernbench/ccl/algorithms/: - ring_allreduce, mesh_allreduce, tree_allreduce, hello_send. - Each module exports kernel_args(world_size, n_elem) helper. - ccl.yaml module paths updated to kernbench.ccl.algorithms.*. Dead code removed: - 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.). - _run_ccl_bench greenlet-per-SIP scheduler. - benches.loader.is_ccl_bench + run_rank detection. - benches/ccl/ directory. Tests: - New test_ccl_allreduce_matrix.py: 7 parametrized cases (ring×3 buffers, ring 8/16, mesh 4, tree 7). - New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests. - Existing tests updated for new import paths + world_size_override. Docs: - Korean ccl-author-guide.md and ADR-0023 paths updated. - New English versions: ccl-author-guide.en.md, ADR-0023.en.md. 502 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 19:36:59 -07:00
ywkang	1d95df4bee	Restructure legacy backups, remove pe_accel, fix DMA self-routing - Move builtin_legacy/ → legacy/builtin/ (cleaner structure) - Move pe_accel_legacy/ → legacy/pe_accel/ - Remove custom/pe_accel/ (replaced by new builtin) - Remove pe_scheduler_v2 from components.yaml - Switch topology.yaml to pe_scheduler_v1 (new builtin) - Fix PE_DMA self-routing: handle consecutive DMA_READ stages (same component consecutive stages processed in-place, not via port) 382 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 00:02:26 -07:00
ywkang	b6eb97c49a	Implement ADR-0021: PE pipeline refactor with token self-routing Step 1-2: Backup existing code - builtin/ → builtin_legacy/ (unchanged backup) - custom/pe_accel/ → custom/pe_accel_legacy/ (unchanged backup) Step 3-4: New pipeline types and tiling - pe_types.py: StageType, Stage, TilePlan, PipelinePlan, PipelineContext, TileToken - tiling.py: generate_gemm_plan, generate_math_plan (ported from pe_accel) Step 5: Component implementations (ADR-0021 D4-D6) - PE_SCHEDULER: _feed_loop (singleton FIFO feeder) + plan generation - PE_FETCH_STORE: new component — TCM ↔ Register File - PE_GEMM: TileToken pipeline + legacy PeInternalTxn dual-mode - PE_MATH: TileToken pipeline + legacy dual-mode - PE_DMA: TileToken pipeline + legacy + fabric Transaction triple-mode - PE_TCM: TcmRequest handler with dual-channel BW serialization Step 6: Infrastructure - topology.yaml: pe_fetch_store component + chaining edges - components.yaml: pe_fetch_store_v1 registration - builder.py: PE_COMP_OFFSETS, _add_pe_internal_edges, PE view positions - Tests: node/edge counts, PE component sets updated All components handle both TileToken (pipeline) and PeInternalTxn (legacy). Token self-routing: components read next stage from token.plan, chain via out_port. 366 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 23:35:31 -07:00
ywkang	eb792e6212	Remove xbar/noc remnants, rule-based cube-view connectors - Delete xbar.py and noc.py (TwoDMeshNocComponent) — unused since router mesh - Remove xbar_v1/noc_2d_mesh_v1 from components.yaml - Fix pe_to_xbar → pe_to_router in routing exclusion set - Fix xbar_to_hbm_bw_gbs → hbm_to_router_bw_gbs in report.py - Update all docstrings/comments referencing xbar/bridge → router mesh - Cube-view connectors: rule-based _connector_points helper - PE↔router: single diagonal line (not chevron) - UCIe N/S: 45°→horizontal→45° - UCIe E/W: 45°→vertical→45° - HBM ports: 45°→horizontal→45° Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 23:59:12 -07:00
ywkang	63669f82cb	Add SIP-level tensor parallelism, component registry YAML, VA offset verification - DPPolicy: 3-level (sip/cube/pe), unified naming (column_wise/row_wise) - PE_CPU: auto num_programs from cube shard count - context.launch(): per-SIP KernelLaunchMsg with local va_base + auto local shape - deploy_tensor: removed mmus param, MMU mapping is context-only responsibility - ComponentRegistry: YAML-based lazy loading (components.yaml), impls→builtin rename - VA offset bench + tests: 2D/1D, standard Triton kernel pattern Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 01:13:17 -07:00

12 Commits