kernbench2

Author	SHA1	Message	Date
mukesh	46291bf91b	PE-to-PE latency: drop h5 inter-SIP panel from overview Remove h5_inter_sip from the hop list and switch the overview grid from 2x3 to 2x2. RAW DMA was unavailable for the cross-SIP hop, so the panel only carried IPCQ data and was redundant with h4_inter_cube for the topology comparison. Regenerate pe2pe_latency_plots/overview.png and summary.csv; delete the obsolete h5_inter_sip.png. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 16:43:28 -07:00
mukesh	04c912f53e	Allreduce sweep: parametrized + xdist parallelism + topology diagram Refactor the latency sweep from one giant test into 36 parametrized cases that run in parallel under xdist (~6-8x faster: 1:49 instead of ~10 min). Each case writes a JSON row to a staging dir; conftest sessionfinish hook aggregates rows on the controller node into summary.csv and the per-topology + overview plots. Aggregator gains a CSV fallback so plot-only tweaks no longer require re-running the sweep. Overview plot updates: - 96 KB explicit x-axis marker with vertical dotted line - horizontal theoretical 2D-torus reference (10600 ns) - annotation showing both theoretical and simulated values at 96 KB - drop overlapping 128 KB tick New topology.png: 2x2 panel diagram showing device-level topology (ring, torus 2x3, mesh 2x3) and the cube-level reduction inside SIP 0. Wrap arrows anchor on box edges and arc outside rows/columns so they do not overlap any SIP. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 16:43:19 -07:00
mukesh	1c33afec55	ADR-0032 + intra_* opposite directions in IPCQ install Add intra_N/S/E/W to install.py _OPPOSITE_DIR table so the intra-cube PE-to-PE namespace is symmetrical with intercube N/S/E/W. ADR-0032 documents the intercube allreduce algorithm (supersedes ADR-0029). Refresh ADR-0024/0025/0029 cross-refs and update test_intercube_sfr_config.py to cover the new intra_* mappings. Drop the obsolete test_ccl_round_robin_recv.py (replaced by intercube tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 16:43:01 -07:00
ywkang	81cc32c46b	ADR-0001 Rev 2: 51-bit PhysAddr layout with concrete sub-unit tables Remove rack_id (4 bits), rename sip_seg→die_id, shift fields to enable 42-bit local_offset (4 TB per die). Define PE_LOCAL/MCPU_LOCAL/CUBE_SRAM sub-unit tables for AHBM dies and IOCPU sub-unit table for IOCHIPLET dies (1 TB window). Supersedes ADR-0031. Also fixes latent VA/PA confusion in pe_dma pipeline DMA path where virtual addresses were decoded as physical addresses without MMU translation — previously masked by coincidental bit-position alignment. 529 passed (+6 recovered), 10 pre-existing failures unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-27 15:52:29 -07:00
mukesh	e9cc40f74d	Rectangular SIP topology + 6-device allreduce sweep mesh_2d, torus_2d, and mesh_2d_no_wrap accept optional w,h kwargs; sqrt fall-back preserved for square layouts (back-compat tests confirm 4-SIP and 9-SIP square configs still work). sfr_config reads system.sips.w/h from spec and threads dims through to the topology fn. test_allreduce_multidevice CONFIGS switched from 4 SIPs (square) to 6 SIPs: ring_1d_6sip, torus_2d_6sip_2x3, mesh_2d_no_wrap_6sip_2x3. _write_temp_configs writes system.sips.w/h when supplied; _sip_topo_dims reads them back. Latency sweep loop also moved to 6-SIP layouts. Linear-scale plot variants dropped -- only log-scale *.png + summary.csv emitted. Plots in tests/allreduce_latency_plots regenerated. New tests/test_sip_topology_rectangular.py asserts neighbor correctness for 2x3 layouts and back-compat for square fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 15:13:14 -07:00
mukesh	c1a5cf3a2a	ADR-0009 D5: chain-aware target_start_ns + zero-byte launch fanout The single-walk predictor (find_node_path(io_cpu, pe_cpu) + compute_path_latency_ns) under-shot actual dispatch latency for far cubes -- the routing graph could pick a path bypassing M_CPU, and non-zero-nbytes launch sub-txns serialized on shared first hops. Far PEs arrived at _execute_kernel after target_start_ns, silently skipped the barrier yield, and started pe_exec_start late. Their reported pe_exec_ns under-counted by exactly the late_ns amount (63 ns observed at h4 cube4.pe0 in the IPCQ test, up to 113 ns worst case for cubes 9-11), producing the suspicious flat region in the h4 IPCQ curve at 8192/10240 bytes. Fix: - IO_CPU predictor uses the explicit two-leg chain (IO_CPU->M_CPU + M_CPU->PE_CPU - io.overhead - m.overhead), so every PE on every targeted cube has a barrier >= its real dispatch arrival. - Kernel-launch fanout sub-txns carry nbytes=0 (control-plane, not data-plane), removing the per-cube fanout serialization that pushed far M_CPUs past the predictor. - Legacy io_cpu mirror updated. ADR-0009 D5 mechanism updated to specify the two-leg formula and the nbytes=0 requirement. New tests/test_d5_barrier_invariant.py asserts (a) no PE enters _execute_kernel after target_start_ns and (b) every PE in a multi-cube launch has identical pe_exec_start -- both regressions silently pass on the existing tests/test_kernel_launch_sync.py because that test only inspects post-aggregation max(pe_exec_ns). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 15:12:58 -07:00
mukesh	90874abbfe	ADR-0023 D9: blocking credit-emit with full-path latency PE_IPCQ._handle_recv now yields-from _delayed_credit_send instead of spawning it as a fork, so the receiver's pe_exec_ns includes the credit-return cost. _credit_latency_ns switches from compute_drain_ns(path, 16) to compute_path_latency_ns(path, 16) and fixes a latent find_path bug where the destination lacked the ".pe_dma" suffix (silently returned 0 ns under the bare except). Net effect on h3/h4 inter-cube pe-to-pe latency: IPCQ >= raw DMA at every size, matching real-HW posted-write semantics. tl.send remains fire-and-forget. ADR-0023 D9 amended; new diagnostic test tests/test_pe_to_pe_diagnostic.py captures per-PE pe_exec_ns, paths, drain, and meta-arrival timing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 15:12:38 -07:00
mukesh	19dfc86dc3	Allreduce latency sweep across topologies and data sizes Adds test_allreduce_latency_sweep that runs the existing intercube allreduce kernel under three SIP topologies (ring_1d, torus_2d, mesh_2d_no_wrap, all at n_sips=4) across 11 data sizes from 256 B/SIP up to 1 MB/SIP. For each point, captures max(pe_exec_ns) — the critical-path kernel time — and emits CSV plus log-x and linear-x plots, both per-topology and combined overview, with KB/MB-formatted tick labels. Reuses run_allreduce + _write_temp_configs and adds a slot_size auto-bump when n_elem*2 exceeds the default IPCQ slot. Sweep skips n_elem=16 because the runtime's dim_map scalar-arg remapping (context.py:761) collides any int-valued kernel scalar that matches a global tensor dim with its local shard size. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 10:16:29 -07:00
mukesh	14d800b0ae	Kernel-launch sync (ADR-0009 D5) and IPCQ drain at inbound (ADR-0023) - KernelLaunchMsg gains target_start_ns: IO_CPU stamps a global barrier (max path latency across every target PE), M_CPU passes it through, PE_CPU yields until it before recording pe_exec_start. Every PE in a launch begins kernel execution at the same env.now regardless of its dispatch path length — eliminates per-PE dispatch-offset artifact in cross-PE and cross-cube latency measurements. - PE_DMA._handle_ipcq_inbound now pays Transaction.drain_ns at the top, matching the terminal-drain behavior of ComponentBase._forward_txn for every non-IPCQ Transaction. SRC-side tl.send stays fire-and-forget (sender doesn't yield on sub_done); tl.recv now blocks until bytes have actually drained into its inbox. - ComponentContext: new compute_path_latency_ns helper + node_overhead_ns field populated by GraphEngine. - tests/test_kernel_launch_sync.py: asserts all PEs in one launch produce identical pe_exec_ns for a no-op kernel (zero spread). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 15:30:29 -07:00
mukesh	6918e6e906	PE-to-PE latency test + supporting fixes Adds tests/test_pe_to_pe_latency.py: a sweep that measures PE-to-PE transfer latency for five hop types (intra-cube horizontal/vertical, inter-cube horizontal/vertical, inter-SIP) across data sizes 128 B to 10 KB, on both the IPCQ (tl.send/tl.recv) and raw-DMA (tl.load+tl.store) paths. Emits per-hop PNG plots, an overview PNG, and a CSV summary into tests/pe2pe_latency_plots/. Latency is reported as max(pe_exec_ns) across participating PEs, read from engine.get_completion(), so the measurement captures the SRC/DST PE's kernel body time rather than the full launch+ response-aggregation envelope. Two simulator fixes were needed to make this measurement meaningful: - PeMMU now stores a list of (start, end, pa) sub-regions per page rather than a single PA. DPPolicy layouts with shards smaller than page_size (e.g. 128 B payloads with 4 KB pages) used to silently overwrite each other through last-write-wins, causing DMAs intended for cube0 to physically route to cube3 - inflating latency by ~170 ns per DMA at small sizes. STOPGAP: real MMUs don't support sub-page regions; long-term fix is either smaller MMU page size or DPPolicy validation that refuses sub-page shards. - M_CPU's per-PE metrics aggregation (pe_exec_ns, dma_ns, compute_ns) now max-merges against the existing value in result_data rather than overwriting. Multi-cube workloads share one result_data dict via IO_CPU fanout; the previous overwrite caused whichever cube's M_CPU finished last to clobber others' values, so multi-cube pe_exec_ns was racy and frequently 0. Same fix applied in legacy/builtin/m_cpu.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 21:04:31 -07:00
mukesh	1d8b9401e5	Intercube allreduce: pe0 cube-mesh reduce + multi-SIP ring/torus/mesh New intercube allreduce kernel replacing the old flat ring algorithms. Reduces across the 4x4 cube mesh within each SIP (pe0-only, same-lane), then inter-SIP exchange on root cube, then broadcast back. Supports ring_1d, torus_2d, and mesh_2d_no_wrap SIP topologies driven by topology.yaml. Integrated with dist.init_process_group / dist.all_reduce. New files: - src/kernbench/ccl/algorithms/intercube_allreduce.py (kernel) - src/kernbench/ccl/sfr_config.py (configure_sfr_intercube_multisip) - tests/test_allreduce_multidevice.py (config-driven, 3 topologies) - tests/test_distributed_intercube_allreduce.py (full distributed path) - tests/test_intercube_sfr_config.py (SFR wiring verification) Modified: - distributed.py: AhbmCCLBackend uses configure_sfr_intercube_multisip - topologies.py: added torus_2d, mesh_2d_no_wrap - install.py: global_E/W/N/S in _OPPOSITE_DIR - topology.yaml: added system.sips.topology - ccl.yaml: single intercube_allreduce algorithm - benches/ccl_allreduce.py: row_wise cube-mesh tensor layout Removed old flat-ring algorithms and their tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 17:33:42 -07:00
ywkang	cfc2d74ec4	Refactor ccl_allreduce bench: rank=SIP only, remove rank=PE legacy path The unified ccl_allreduce bench previously carried two execution models in one worker with ``if world_size == n_sips:`` branching: - TP mode (rank = SIP, ADR-0024/0027): proper ProcessGroup semantics. - Legacy rank = PE mode: single-driver worker allocating one big tensor distributed across all PEs via _derive_dp, with kernel-level SPMD via program_id. The second model is unnecessary — intra-SIP PE-level collectives are expressed inside the kernel (tl.send/tl.recv with program_id, IPCQ) and do not need a host-side ProcessGroup. Removing it lets the bench be a clean reference implementation of the TP launcher. benches/ccl_allreduce.py: - Config resolved once in run() via _resolve_cfg -> _BenchCfg dataclass. - rank != n_sips now raises RuntimeError explicitly. - _worker / _allocate_rank_tile / _init_with_rank_value / _report each have one concern; duplicated init + verification paths collapsed. - _derive_dp and the second verify+print block deleted. - 166 lines -> 91 lines. ccl.yaml: - mesh_allreduce_4 (world_size: 4) and tree_allreduce_7 (world_size: 7) algorithm entries removed (rank = PE only). - Algorithm kernel files (kernbench.ccl.algorithms.mesh_allreduce, tree_allreduce) kept as-is for direct-dispatch future use. tests/test_ccl_allreduce_matrix.py: - Matrix shrinks from 7 cases to 3: ring × {tcm, hbm, sram} at ws = topology SIP count (= 2). mesh_2x2, tree_binary_7, ring_multi_cube, and the three ring_*_8 cases removed. tests/test_ccl_performance.py: - _run_8rank renamed to _run_ring; world_size: 8 override dropped; now exercises rank = SIP ring all-reduce. tests/test_mp_spawn.py, tests/test_ccl_ddp_launcher.py: - Monkeypatch target updated from bench.worker to bench._worker (signature now takes BenchCfg instead of (rank, world_size)). 555 passed, 1 intentional skip. Tests that directly call install_ipcq(world_size_override=N) for kernel-level sanity (test_ccl_hello_world_guide, test_recv_copy_to_dst, test_tl_recv_async, test_ccl_deadlock_detection) are unchanged — they never went through the bench and still exercise the kernel-only path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 16:45:27 -07:00
ywkang	105f1dc09e	ADR-0027: Megatron TP API + worker-wait generalization + mp.spawn Implements ADR-0027 Phase 2 end-to-end. All 559 tests pass (was 523 + 1 xfail; ring_default_ws strict-xfail is now resolved). D0 — Worker-wait generalization (context.py): - _pending_worker_waits queue on RuntimeContext. - ctx.wait(h) in worker context defers to main via g.parent.switch(). Fast-path for already-completed handles. - Worker API is unchanged: tensor deploy, launch, etc. still look synchronous; they're transparently cooperatively scheduled. - Solves ADR-0024 Phase B kernel-greenlet orphan bug (env.run now only ever drives from main; kernel _parent is always main). D0.5 — Host-read barrier (tensor.py): - Explicit _HOST_READ_BARRIERS registry (T5.g closed-set via code review, not reflection-magic). - numpy/data/__getitem__/__repr__ drain pending worker-waits before host-observable read. - copy_: source-side barrier via source.numpy(). Target-side write barrier is intentionally NOT applied — global pending target barrier prematurely drains cross-rank collectives → deadlock. - Collective pending is excluded from barrier drain condition (collective is cross-rank; its own yield in all_reduce covers the invariant naturally). D1 — torch.multiprocessing.spawn (runtime_api/multiprocessing.py): - API signature parity with real PyTorch spawn; execution is cooperative greenlet scheduler (process isolation etc. are explicit non-goals per D1.0). - _drain_pending drains worker-waits then collectives in one barrier, loop-until-empty. - Round-based exception handling with SystemExit sibling abort + SpawnException(errors) wrapping root-cause ranks. - RuntimeContext attaches ctx.multiprocessing in __post_init__. - benches/ccl_allreduce.py hand-rolled loop collapses to one torch.multiprocessing.spawn call. D2–D6 — kernbench.tp package: - parallel_state: initialize_model_parallel, get__rank, get__world_size, with weak active-ctx registry in context.py. - layers: ColumnParallelLinear, RowParallelLinear (shape-only primitives — fp16 gemm via tl.load + tl.dot + tl.store). - kernels: _gemm_kernel used by TP layers (self-contained; no bench dependency). - primitives / mappings stubs per D6/D8. Data-path fixes (surfaced by TP gemm + all_reduce sequence): - sim_engine/op_log.py: dma_write snapshot is skipped for TCM sources (PE scratch is repopulated by Phase 2 math/gemm replay — capturing Phase-1-time snapshot picked up STALE data from prior kernel's output aliased at the same scratch addr, causing the later kernel's dma_write to overwrite Phase 2 result with stale value). - sim_engine/op_log.py + sim_engine/data_executor.py: per-operand space recorded on GemmCmd and composite gemm records so HBM-resident operands (tl.load output) don't default to TCM during replay. - runtime_api/context.py: ctx.zeros writes zero-init to MemoryStore at VA keys so kernels reading via VA see deterministic init even without explicit copy_(). Tests (Phase 1 + Phase 2): - test_worker_wait_drain (T3): orphan invariant + resume + multi-rank drain + idempotency + exception propagation. - test_mp_spawn (T4): spawn shape + bind + SpawnException scope. - test_host_read_barrier (T5): barrier contract per entry-point + closed-set registry check. - test_tp_parallel_state (T1): initialize + rank lookup. - test_tp_layers (T2): shape + deterministic numerical correctness (concat-matmul equality for RowParallel, not mean-only). - test_tp_mlp (T6): full 2-layer MLP with deterministic weight numerical match + rank-consistency post all-reduce. - test_ccl_allreduce_matrix: ring_default_ws xfail removed (T7). Regression: 523 pre + 35 new + 1 ex-xfail = 559 passed, 1 intentional skip (T3.e historical failure documentation). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 16:31:13 -07:00
ywkang	357cab525b	ADR-0026: DPPolicy intra-device only + ShardSpec structural coords DPPolicy no longer carries a cross-SIP axis. SIP-level placement is solely controlled by torch.ahbm.set_device(rank) (ADR-0024); DPPolicy itself describes only the cube × PE layout within one SIP. ShardSpec switches to structural (sip, cube, pe) coordinates; the flat pe_index field/property is fully removed — silent drift between global-flat and SIP-local interpretations was a foot-gun flagged by ADR-0024 D11. Breaking API (explicit TypeError / AttributeError): - DPPolicy(sip=...) / DPPolicy(num_sips=...) -> TypeError - ShardSpec.pe_index -> AttributeError - ShardSpec(pe_index=...) -> TypeError - resolve_dp_policy now takes target_sip= (required), no num_sips. Downstream migration: - PE allocator dict keyed by (sip, cube, pe) tuples, in both _ensure_allocators and _free_tensor. deploy_tensor uses tuple lookup. - _create_tensor passes target_sip=current_sip; post-hoc pe_index shifting removed entirely. - launch._compute_local_shape drops the dp.sip branch. - Internal resolvers (column_wise / row_wise / replicate / tiled_) return _LocalPeShard (cube-local identifier) instead of ShardSpec — resolve_dp_policy lifts them to full structural coords. Tests: - New tests/test_adr0026_dppolicy_intra_device.py (12 tests) pins the contract end-to-end. - test_sip_parallel.py rewritten: SIP composition now modeled as two resolve_dp_policy(target_sip=...) calls (ADR-0024 launcher style). - Call-site migration: test_tensor, test_va_integration, test_va_offset, test_runtime_api_tensor, test_tl_recv_async, test_ccl_ and benches gemm_single_pe, gpt3_qkv, va_offset_verify, ccl_allreduce (legacy branch) all use intra-device DPPolicy and structural ShardSpec. Result: 523 passed, 1 strict xfail (ring_default_ws — unchanged ADR-0024 Phase B blocker; architectural fix deferred to ADR-0027). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 13:02:19 -07:00
ywkang	787409ced1	ADR-0024 Phase B: update xfail reason with architectural blocker details Phase B Option A (freeze + defer to ADR-0027): the root cause of ring_default_ws strict-xfail is that bench workers call torch.zeros / copy_ which drive env.run in the WORKER-greenlet context. Any pending KernelLaunchMsg gets stepped inside that worker, spawning kernel_runner with parent = worker (not main). When the worker yields/finishes, the kernel greenlet is orphaned and its next switch_to_simpy raises GreenletExit mid-add — producing rank 0 mean=1 (expected 3). This is a larger architectural redesign (lazy-deploy tensor API, coroutine worker, or setup/verify split) and is parked until ADR-0027 (Megatron TP) starts, where the proper solution ships with TP use cases. No production changes; xfail reason + inline comment only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 12:46:33 -07:00
ywkang	79124daab1	ADR-0024 Phase B (partial): scheduler-level collective drain Root cause (hang diagnosis): `kernel_runner.run()` captures `greenlet.getcurrent()` at spawn time as the kernel greenlet's `_parent`. When a worker greenlet (say g0) calls `dist.all_reduce` → `ctx.wait(h)` → `env.run(until=h0)`, the SimPy scheduler steps pe_cpu processes, which in turn spawn kernel greenlets. Those kernels' `_parent` becomes g0 (current greenlet at spawn). When a kernel yields via switch_to_simpy, control jumps back up to g0's LAST switch point — which is the main scheduler's `g.switch()` call — rather than the kernel_runner's generator frame. Main then re-enters its `for g in alive: g.switch()` loop mid-wait, producing nested greenlet re-entry. Scheduler spins: g0 never completes, g1 appears to complete out of order, infinite loop at 100% CPU. Fix: - AhbmCCLBackend.all_reduce: in multi-greenlet mode, submit via launch(_defer_wait=True), extend backend._pending_collective_handles, and yield to the parent greenlet. Worker does NOT call wait. - benches/ccl_allreduce.py run(): after each scheduler round, the MAIN greenlet drains backend._pending_collective_handles. This keeps env.run invocation in the main context, so kernel_runner's spawned kernel greenlets have main as their _parent — no nested re-entry. - Legacy single-driver path (no bench scheduler): all_reduce falls back to inline wait when g.parent is None. Result: - Multi-greenlet cross-SIP ring no longer hangs (was 100% CPU infinite loop in kernel_runner._switch_kernel). - ring_default_ws still xfail(strict=True): now fails as a data correctness issue — DataExecutor reports only 1 math op for a 2-rank ring (expected 2). Cross-SIP op_log replay integration is the remaining Phase B task. 514 passed, 1 xfailed (strict). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 09:14:03 -07:00
ywkang	4ba0a83e71	Implement ADR-0024 Phase A: SIP-level TP launcher MVP Scope (Phase A): - D1: world_size fallback = SIP count (rank = SIP, TP boundary) - D9: greenlet-local get_rank + _bind_rank (single-driver fallback = 0) - D10: torch.ahbm.set_device + torch.accelerator.set_device_index alias - D11: tensor placement scoped to current-device SIP (post-hoc pe_index shift — ADR-0026 replaces with structural coords) - D12/D13: multi-greenlet run() with simple round-robin scheduler; hybrid dispatch (ws == SIP count → multi-greenlet, else legacy single-worker for ccl.yaml override compat) - D7 partial: backend.all_reduce submit + yield + wait via launch()'s new _defer_wait flag; parent-less greenlets skip yield - Relaxed shard-count check (len(shards) > 0 instead of == world_size) - rank_to_pe = SIP-representative [(r, 0, 0)] when ws <= n_sips Deferred to Phase B: - Engine-routed install (D2) — keeps sideband - install_plan.py module (D6) — keeps install.py - Epoch barrier (D7 full) — simple yield is sufficient for ring ws=2 mock - Validator registry (D8) - Cross-SIP multi-greenlet + real kernel integration — matrix ring_default_ws hangs in SimPy drain despite ADR-0025 direction fix; marked xfail(run=False) pending Phase B diagnosis (suspected per-rank kernel_args / program_id mismatch) Tests: - test_ccl_ddp_launcher.py (6 new tests) — D1/D9/D10/D11/D12/D13 - test_ccl_allreduce_matrix.py — ring_default_ws xfail'd, override cases (ring_tcm_8 / hbm_8 / sram_8 / multi_cube / mesh_2x2 / tree_binary_7) all pass via legacy path 514 tests pass, 1 xfail. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 09:00:28 -07:00
ywkang	32536daf2e	Fix ADR-0025: IPCQ direction addressing via address-based matching 2-rank bidirectional ring deadlock: when E and W neighbors point to the same peer, sender-coord matching in _handle_meta_arrival / _credit_worker picked the first direction in dict order, landing data in the wrong rx slot relative to what the kernel recv(W) was waiting on. Fix (ADR-0025 D1/D2/D3): - install.reverse_direction: prefer OPPOSITE direction (E↔W, N↔S) when peer has it pointing back to us; fallback to any matching for topologies without opposite convention (tree_binary parent/child). - _handle_meta_arrival: match by token.dst_addr range against each qp's my_rx_base_pa + n_slots × slot_size window (unambiguous). - _credit_worker: match by credit.dst_rx_base_pa == qp.peer.rx_base_pa. - IpcqCreditMetadata: new dst_rx_base_pa field carrying receiver-side rx base; _delayed_credit_send fills it from the consuming qp. Tests (Phase 1 → Phase 2): - test_reverse_direction_opposite_preference_2rank_ring - test_reverse_direction_opposite_preference_4rank_ring_sanity - test_meta_arrival_matches_by_dst_addr_same_peer - test_credit_matches_by_dst_rx_base_pa_same_peer - Existing credit-return test updated with dst_rx_base_pa. 508 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 00:38:41 -07:00
ywkang	10b33b44ba	Add Tensor indexing + hierarchical 3-level all-reduce kernel Tensor.__setitem__ / __getitem__: - Shard-aligned slice assignment and read on deployed tensors. - Scalar broadcast and numpy array assignment supported. - Cross-shard slices raise NotImplementedError (use copy_ for that). - 3 new tests: single-PE, multi-PE, cross-shard error case. Hierarchical all-reduce kernel (src/kernbench/ccl/algorithms/): - 3-level reduce: intra-cube (E/W) → inter-cube (N/S) → inter-SIP (parent). - Bidirectional ring reduce at each level: ceil((N-1)/2) rounds. Left half sends via dir_dec, right half via dir_inc (wrap). Representative receives from both sides. - Chain broadcast for reverse path: cube 0 PE 0 → all PE 0s → all PEs. - Registered in ccl.yaml as "hierarchical_allreduce" with topology: none (neighbors() override builds the full 3-level neighbor map). - kernel_args derives pes_per_cube/cubes_per_sip/num_sips from world_size. - Mock-verified at 8/16/32/64/128 ranks. Mock runtime fixes: - Direction pairing: explicit N↔S, E↔W, parent↔parent instead of "first matching reverse". Fixes 2-element rings where N and S both point to the same peer. - Deadlock detection: send-counter based (not just queue-depth-total) to catch chain reductions where send+recv pairs net to zero. - Multi-cube program_id: pes_per_cube parameter enables program_id(axis=0) = PE within cube, program_id(axis=1) = cube id. Legacy single-cube tests unaffected (default = world_size). 504 tests pass in 12s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 23:52:04 -07:00
ywkang	74f5f5cf08	Add session-scoped topology fixture in tests/conftest.py Provides a shared `topology` fixture that caches the parsed topology.yaml result per pytest-xdist worker session. Tests that build a GraphEngine can accept `topology` instead of calling resolve_topology("topology.yaml") repeatedly. Topology parsing costs ~32ms, so the practical saving per worker is modest (<1s across all tests). The fixture is mainly for architectural cleanliness — keeping the "parse once, build engine many" pattern explicit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 21:13:25 -07:00
ywkang	bcf941dcee	Speed up regression: 25min → 6min (test matrix + DataExecutor cleanup) Test matrix restructure: - 256-rank full-system ring runs only ONCE (marked pytest.mark.slow) instead of 7× across matrix + perf tests. Cross-SIP routing is verified by the single run; buffer variants (tcm/hbm/sram) are tested at 8-rank where they finish in <0.5s. - Performance tests use 8-rank instead of 256-rank. - `pytest -m "not slow"` completes in ~2.5min (local dev). - Full suite including slow: ~6min (CI). DataExecutor optimization: - Remove ThreadPoolExecutor from DataExecutor.run(). Same-t_start groups are almost always size 1, so the thread pool creation and dispatch overhead dominated. Simple sequential loop is faster. - Skip dma_read ops at the loop level (they are always no-ops in Phase 2 but were dispatched through _execute_op → _execute_memory). - Remove redundant CLI Phase 2 re-execution: engine._flush_data_phase already replays during engine.wait(); the CLI now only prints the diagnostic summary without re-running DataExecutor. 502 tests pass. Wall time: 25m30s → 5m43s (full), 2m28s (no slow). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 20:52:07 -07:00
ywkang	998cc85762	Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023) Major changes: PE-level IPCQ infrastructure: - New PE_IPCQ component: ring-buffer control plane with 4-direction neighbor mapping, head/tail pointers, backpressure (poll/sleep). - PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA, including in-flight data snapshot (D9) and op_log recording at outbound time for Phase 2 replay correctness. - IpcqDmaToken piggyback model: data + metadata travel together, atomic visibility at receiver (invariant I6). - Credit return fast path: bottleneck-BW latency, no fabric vc_comm. Phase 2 data execution (ADR-0020 integration): - op_log extended: DmaWriteCmd now captures src_space/src_addr for Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time. - DataExecutor replays dma_write + ipcq_copy in t_start order. - Engine._flush_data_phase: incremental cursor-based replay after each engine.wait() so host reads see post-Phase-2 data. - KernelRunner Phase 1 writes disabled when op_log is active to prevent stale data from corrupting the MemoryStore snapshot. TLContext / kernel API: - tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype), tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode. - TensorHandle operator overloading (add/sub/mul/div) via thread-local active TLContext → MathCmd dispatch through PE_MATH. - PE-local scratch allocator for math output handles. - tl.load returns space="hbm" handles for correct Phase 2 addressing. - Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv. Unified ccl_allreduce bench (PyTorch-compat host code): - Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch) split matching real PyTorch DDP worker pattern. - torch.distributed facade: init_process_group, get_world_size, get_rank, get_backend, all_reduce, barrier — only real PyTorch names. - AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches kernel via tensor shard metadata (n_elem from shards[0].nbytes). - world_size derived from topology spec (sips × cubes × pes_per_cube) with optional algorithm-level override in ccl.yaml. Tensor API (PyTorch-compat surface): - Tensor.numpy(): gather-aware (all shards via VA-based addressing). - Tensor.copy_(source): scatter from host tensor into sharded target. - RuntimeContext.from_numpy(arr): host-side staging tensor. - Tensor.data property fixed to use numpy() (was shards[0]-only). Algorithm modules moved to src/kernbench/ccl/algorithms/: - ring_allreduce, mesh_allreduce, tree_allreduce, hello_send. - Each module exports kernel_args(world_size, n_elem) helper. - ccl.yaml module paths updated to kernbench.ccl.algorithms.*. Dead code removed: - 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.). - _run_ccl_bench greenlet-per-SIP scheduler. - benches.loader.is_ccl_bench + run_rank detection. - benches/ccl/ directory. Tests: - New test_ccl_allreduce_matrix.py: 7 parametrized cases (ring×3 buffers, ring 8/16, mesh 4, tree 7). - New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests. - Existing tests updated for new import paths + world_size_override. Docs: - Korean ccl-author-guide.md and ADR-0023 paths updated. - New English versions: ccl-author-guide.en.md, ADR-0023.en.md. 502 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 19:36:59 -07:00
ywkang	ff2c677a9c	Add 2D grid program_id semantics (ADR-0022) tl.program_id(axis=0) returns local PE id within cube, tl.program_id(axis=1) returns cube id. Enables cube-aware sharding in benchmark kernels. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 16:49:56 -07:00
ywkang	dc3fb02aed	Add --verify-data CLI flag, Tensor.data property, parallel DataExecutor - CLI: --verify-data flag enables Phase 2 data verification (ADR-0020) - Tensor.data: returns actual numpy values (verify-data) or zeros placeholder - Tensor.__repr__: shows value summary or data=N/A (placeholder) - DataExecutor: ThreadPoolExecutor for same-timestamp parallel op execution - BenchResult.engine: exposes op_log/memory_store for Phase 2 access Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 09:34:01 -07:00
ywkang	59e36f0c34	Add E2E pipeline tests: greenlet op_log, GEMM accuracy, latency regression ADR-0020 + ADR-0021 final verification: - CompositeCmd GEMM/Math pipeline completes through full chain - Greenlet mode generates op_log records (memory + gemm ops) - Phase 1→Phase 2: MemoryStore seed → greenlet → op_log → DataExecutor → allclose - Latency determinism: same kernel produces identical latency - Multi-tile > single-tile latency invariant 388 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 00:28:03 -07:00
ywkang	81ce55571d	Rename impl names: add builtin. prefix for clear provenance - components.yaml: all builtin impls use builtin.xxx naming - topology.yaml: all impl references updated to builtin.xxx - builder.py: hardcoded ucie impl → builtin.ucie - Tests: all impl string references updated Convention: builtin.<name> for built-in, custom.<name> for user-defined. 382 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 00:16:24 -07:00
ywkang	95d583ef9f	Add Phase 1→Phase 2 e2e data tests + GraphEngine enable_data mode GraphEngine(enable_data=True): - Creates MemoryStore + OpLogger - Injects op_logger into all components - Exposes engine.op_log and engine.memory_store properties E2E tests (test_e2e_data.py): - Engine data mode creates store + logger - Default engine has no store - PeDmaMsg completes successfully with data mode - DataExecutor GEMM accuracy: random f16 matmul with f32 accumulation - DataExecutor chain: GEMM → exp correctness - DataExecutor verify API: pass/fail per tensor - MemoryStore snapshot isolation between Phase 1 and Phase 2 382 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 23:49:28 -07:00
ywkang	f5d1606f9d	Add ADR-0021 pipeline tests: self-routing, tiling, overlap Test plan items 3-5: - TileToken self-routing: advance(), stage sequence, chain traversal - PipelineContext: completion tracking, exactly-once contract - Tiling plans: GEMM tile count, stage sequence, intermediate K no DMA_WRITE - Math plan: READ→FETCH→MATH→STORE→WRITE sequence - Pipeline overlap: SimPy simulation verifying intra-command tile overlap 9 new tests, all passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 23:40:19 -07:00
ywkang	b6eb97c49a	Implement ADR-0021: PE pipeline refactor with token self-routing Step 1-2: Backup existing code - builtin/ → builtin_legacy/ (unchanged backup) - custom/pe_accel/ → custom/pe_accel_legacy/ (unchanged backup) Step 3-4: New pipeline types and tiling - pe_types.py: StageType, Stage, TilePlan, PipelinePlan, PipelineContext, TileToken - tiling.py: generate_gemm_plan, generate_math_plan (ported from pe_accel) Step 5: Component implementations (ADR-0021 D4-D6) - PE_SCHEDULER: _feed_loop (singleton FIFO feeder) + plan generation - PE_FETCH_STORE: new component — TCM ↔ Register File - PE_GEMM: TileToken pipeline + legacy PeInternalTxn dual-mode - PE_MATH: TileToken pipeline + legacy dual-mode - PE_DMA: TileToken pipeline + legacy + fabric Transaction triple-mode - PE_TCM: TcmRequest handler with dual-channel BW serialization Step 6: Infrastructure - topology.yaml: pe_fetch_store component + chaining edges - components.yaml: pe_fetch_store_v1 registration - builder.py: PE_COMP_OFFSETS, _add_pe_internal_edges, PE view positions - Tests: node/edge counts, PE component sets updated All components handle both TileToken (pipeline) and PeInternalTxn (legacy). Token self-routing: components read next stage from token.plan, chain via out_port. 366 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 23:35:31 -07:00
ywkang	51004c311c	Implement ADR-0020: 2-pass data execution with greenlet kernel runner Step 1 — Foundation: - OpRecord/OpLogger: op log infrastructure with t_start stable ordering - MemoryStore: numpy ndarray tensor-granular storage (reference semantics) - data_op=True flag on DmaReadCmd, DmaWriteCmd, GemmCmd, MathCmd, CompositeCmd - numpy/greenlet dependencies added to pyproject.toml Step 2 — ComponentBase hooks: - _on_process_start/end hooks in _forward_txn (fabric messages) - _handle_with_hooks in PeEngineBase (PE-internal commands) - op_logger optional — zero overhead when disabled Step 3 — KernelRunner + greenlet: - KernelRunner: greenlet ↔ SimPy bridge in triton_emu/kernel_runner.py - TLContext: _emit() method routes to greenlet switch or command list - tl.load() returns real numpy data in greenlet mode - Dynamic control flow supported (memory-read based branching) Step 4 — PE_CPU integration: - Greenlet mode when ctx.memory_store is set, legacy fallback otherwise - Refactored into _execute_greenlet/_execute_legacy/_send_response - ComponentContext gains memory_store and op_logger fields Step 5 — DataExecutor: - Phase 2 numpy execution for GEMM/Math ops from op_log - _compute_math: all unary/binary/reduction ops - verify(): compare MemoryStore against expected with dtype tolerance 28 new tests, 366 total passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 00:22:44 -07:00
ywkang	eb792e6212	Remove xbar/noc remnants, rule-based cube-view connectors - Delete xbar.py and noc.py (TwoDMeshNocComponent) — unused since router mesh - Remove xbar_v1/noc_2d_mesh_v1 from components.yaml - Fix pe_to_xbar → pe_to_router in routing exclusion set - Fix xbar_to_hbm_bw_gbs → hbm_to_router_bw_gbs in report.py - Update all docstrings/comments referencing xbar/bridge → router mesh - Cube-view connectors: rule-based _connector_points helper - PE↔router: single diagonal line (not chevron) - UCIe N/S: 45°→horizontal→45° - UCIe E/W: 45°→vertical→45° - HBM ports: 45°→horizontal→45° Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 23:59:12 -07:00
ywkang	7640635f90	M_CPU/SRAM placement via pos_mm in topology.yaml (nearest router) Component placement uses mm coordinates in topology.yaml, mesh_gen finds the nearest router automatically. M_CPU moved to pos_mm=[7.5,2.0] (→ r0c2), SRAM at pos_mm=[1.5,9.0] (→ r3c0). No hardcoded router references in topology config. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 00:48:20 -07:00
ywkang	e94f1de078	Cube-view SVG: detailed topology validation rendering - Dedicated cube_view renderer showing 6×6 router grid with attachments - PE blocks drawn next to their router (above/below) - HBM pseudo channel port bar (64 ports, color-coded by PE owner) - Per-PE BW annotations on HBM links - Router color-coded by type (PE/M_CPU/SRAM/UCIe/relay) - Title shows mode, channel count, per-PE and total BW - Legend for all component types Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 22:03:38 -07:00
ywkang	91085733ba	Show individual routers in cube_view SVG, fix row Y overlap - cube_view now renders all 32 router nodes from cube_mesh.yaml instead of collapsed "router_mesh" placeholder - Fix mesh_gen row Y position overlap (r1/r2 and r3/r4 had same Y) by adding hbm_gap spacing between PE rows and HBM zone - Add noc_router to visualizer KIND_SIZE for proper sizing - Update cube view tests for individual router nodes 339 passed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 18:22:38 -07:00
ywkang	d2c92b8a18	Wire PE_MMU to router mesh for MmuMapMsg delivery Add router → PE_MMU edge so MmuMapMsg can reach PE_MMU via the router mesh. Unskip all PE_MMU fabric tests. 339 passed, 0 skipped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 18:10:42 -07:00
ywkang	08256c1326	Fix cross-SIP PE_TCM access by scoping deploy to target_device SIP RuntimeContext._ensure_allocators() now limits SIP range to target_device (single SIP or all). Prevents cross-SIP tensor deployment that caused PE_TCM routing errors. Also accept 'sip0' format (without colon) in DeviceSelector. 331 passed, 8 skipped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 18:03:11 -07:00
ywkang	5917b3497c	Replace xbar/bridge/single-NOC with explicit router mesh (ADR-0019) - Remove xbar_top/bot, bridge, single noc node from topology - Each cube_mesh.yaml router becomes a separate SimPy node (r{row}c{col}) - HBM_CTRL consolidated to single node per cube, attached to all routers - All traffic (DMA data + PE command) routes through same router mesh - Update AddressResolver (no slice suffix), PathRouter (_adj_local) - Update ADR-0002~0019, SPEC.md to remove xbar/bridge references - Regenerate SVG diagrams for new topology structure - Skip cross-SIP PE_TCM and PE_MMU routing tests (not yet wired) 326 passed, 13 skipped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 17:51:28 -07:00
ywkang	63669f82cb	Add SIP-level tensor parallelism, component registry YAML, VA offset verification - DPPolicy: 3-level (sip/cube/pe), unified naming (column_wise/row_wise) - PE_CPU: auto num_programs from cube shard count - context.launch(): per-SIP KernelLaunchMsg with local va_base + auto local shape - deploy_tensor: removed mmus param, MMU mapping is context-only responsibility - ComponentRegistry: YAML-based lazy loading (components.yaml), impls→builtin rename - VA offset bench + tests: 2D/1D, standard Triton kernel pattern Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 01:13:17 -07:00
ywkang	08812eda58	Add virtual memory support: PE_MMU, VA allocator, fabric MmuMapMsg Implement VA/MMU layer (ADR-0011 Phase 1) enabling Triton kernels to use contiguous virtual addresses on sharded tensors. Key changes: - PE_MMU component: hybrid inbox (MmuMapMsg) + sync translate() for PE_DMA - VirtualAllocator + PEMemAllocator: free-list with coalescing - MmuMapMsg/MmuUnmapMsg fabric path with SIP-level routing - DPPolicy-based mapping: replicate=local, sharded=broadcast - Tensor lifecycle: del + weakref cleanup, context manager - Rename: TensorHandle.pa→addr, DmaReadCmd.src_pa→src_addr, ctx→torch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 00:01:47 -07:00
ywkang	62fb01ae18	Add reverse path response latency for PE DMA and PE_CPU→M_CPU Model fabric response hop latency for PE-internal operations: - HBM_CTRL sends PeDmaMsg response on reverse path instead of direct done signal - PE_CPU sends ResponseMsg via NOC→M_CPU on kernel completion - Add NOC→PE_DMA and PE_CPU→NOC edges in topology builder - Make HBM BW test assertions dynamic based on topology efficiency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 15:40:56 -07:00
ywkang	d75da439c6	Add probe CLI improvements, D2H read, UCIe/HBM tuning, BW sweep - Probe CLI: restructured output (tables first, routes below), per-hop timestamps, split cross-cube into best/worst cases, D2H read section - UCIe overhead: 1ns -> 8ns per port (16ns per crossing) to fix cross-cube-best < cross-half latency inversion - HBM efficiency: added efficiency=0.8 factor to hbm_ctrl, reducing effective BW from 256 to 204.8 GB/s - Multi-size BW sweep: saturation tables (4KB-1MB) for all probe cases - Probe default data size: 4KB -> 32KB for more realistic measurements - IOChiplet NOC + D2H topology and tests - NOC mesh, xbar, BW occupancy components and tests - Cube mesh visualization diagram 278 tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 01:16:18 -07:00
ywkang	6f43807900	commit - release 1	2026-03-18 11:47:48 -07:00

42 Commits