kernbench2

Author	SHA1	Message	Date
ywkang	e7f376ebaa	ADR-0027 rev7 (Megatron TP + worker-wait generalization) + ADR-0026 typo fix ADR-0027 is a design-only change (no production code). Rev 7 closes design across 7 iterations of review. Key decisions: - D0 (worker-wait generalization): ctx.wait in worker context yields to main scheduler, which drains env.run. Solves ADR-0024 Phase B orphan bug (ring_default_ws strict xfail). Normative contracts on resume invariant, fast-path, main-context non-reentrance, barrier loop-until-empty, and scheduler non-progress as user contract. - D0.5 (host-read barrier): Tensor.numpy/data/__getitem__/__repr__/copy_ auto-drain pending before reading. Closed-set via explicit registry (T5.g). copy_ uses global-pending barrier with explicit over-serialization tradeoff. - D1 (torch.multiprocessing.spawn): real-PyTorch API-signature parity, cooperative greenlet scheduler internally. Explicit non-goal on process isolation / address space / failure isolation. Sibling cleanup via SystemExit + SpawnException(errors) wrapping root-cause ranks. - D4/D5 (TP layers): ColumnParallelLinear / RowParallelLinear use torch.launch(gemm_kernel) — no host-side torch.matmul. Yield-safety contract normatively required for all TP forward paths. - Supersedes ADR-0024 D7/D12/D13 as design (none landed). Source of truth declared normative. Test strategy: T1-T8 with numerical-correctness primary (not mean/ aggregate-only), orphan invariant direct assertion, host-read barrier closed-set via registry. Phase 2 acceptance = 524 passed + 0 xfail (ring_default_ws unblocked by D0). ADR-0026 typo fix: torch.cuda.set_device → torch.ahbm.set_device in DPPolicy docstring (ADR-0024 D10 convention). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 14:13:26 -07:00
ywkang	357cab525b	ADR-0026: DPPolicy intra-device only + ShardSpec structural coords DPPolicy no longer carries a cross-SIP axis. SIP-level placement is solely controlled by torch.ahbm.set_device(rank) (ADR-0024); DPPolicy itself describes only the cube × PE layout within one SIP. ShardSpec switches to structural (sip, cube, pe) coordinates; the flat pe_index field/property is fully removed — silent drift between global-flat and SIP-local interpretations was a foot-gun flagged by ADR-0024 D11. Breaking API (explicit TypeError / AttributeError): - DPPolicy(sip=...) / DPPolicy(num_sips=...) -> TypeError - ShardSpec.pe_index -> AttributeError - ShardSpec(pe_index=...) -> TypeError - resolve_dp_policy now takes target_sip= (required), no num_sips. Downstream migration: - PE allocator dict keyed by (sip, cube, pe) tuples, in both _ensure_allocators and _free_tensor. deploy_tensor uses tuple lookup. - _create_tensor passes target_sip=current_sip; post-hoc pe_index shifting removed entirely. - launch._compute_local_shape drops the dp.sip branch. - Internal resolvers (column_wise / row_wise / replicate / tiled_) return _LocalPeShard (cube-local identifier) instead of ShardSpec — resolve_dp_policy lifts them to full structural coords. Tests: - New tests/test_adr0026_dppolicy_intra_device.py (12 tests) pins the contract end-to-end. - test_sip_parallel.py rewritten: SIP composition now modeled as two resolve_dp_policy(target_sip=...) calls (ADR-0024 launcher style). - Call-site migration: test_tensor, test_va_integration, test_va_offset, test_runtime_api_tensor, test_tl_recv_async, test_ccl_ and benches gemm_single_pe, gpt3_qkv, va_offset_verify, ccl_allreduce (legacy branch) all use intra-device DPPolicy and structural ShardSpec. Result: 523 passed, 1 strict xfail (ring_default_ws — unchanged ADR-0024 Phase B blocker; architectural fix deferred to ADR-0027). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 13:02:19 -07:00
ywkang	787409ced1	ADR-0024 Phase B: update xfail reason with architectural blocker details Phase B Option A (freeze + defer to ADR-0027): the root cause of ring_default_ws strict-xfail is that bench workers call torch.zeros / copy_ which drive env.run in the WORKER-greenlet context. Any pending KernelLaunchMsg gets stepped inside that worker, spawning kernel_runner with parent = worker (not main). When the worker yields/finishes, the kernel greenlet is orphaned and its next switch_to_simpy raises GreenletExit mid-add — producing rank 0 mean=1 (expected 3). This is a larger architectural redesign (lazy-deploy tensor API, coroutine worker, or setup/verify split) and is parked until ADR-0027 (Megatron TP) starts, where the proper solution ships with TP use cases. No production changes; xfail reason + inline comment only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 12:46:33 -07:00
ywkang	79124daab1	ADR-0024 Phase B (partial): scheduler-level collective drain Root cause (hang diagnosis): `kernel_runner.run()` captures `greenlet.getcurrent()` at spawn time as the kernel greenlet's `_parent`. When a worker greenlet (say g0) calls `dist.all_reduce` → `ctx.wait(h)` → `env.run(until=h0)`, the SimPy scheduler steps pe_cpu processes, which in turn spawn kernel greenlets. Those kernels' `_parent` becomes g0 (current greenlet at spawn). When a kernel yields via switch_to_simpy, control jumps back up to g0's LAST switch point — which is the main scheduler's `g.switch()` call — rather than the kernel_runner's generator frame. Main then re-enters its `for g in alive: g.switch()` loop mid-wait, producing nested greenlet re-entry. Scheduler spins: g0 never completes, g1 appears to complete out of order, infinite loop at 100% CPU. Fix: - AhbmCCLBackend.all_reduce: in multi-greenlet mode, submit via launch(_defer_wait=True), extend backend._pending_collective_handles, and yield to the parent greenlet. Worker does NOT call wait. - benches/ccl_allreduce.py run(): after each scheduler round, the MAIN greenlet drains backend._pending_collective_handles. This keeps env.run invocation in the main context, so kernel_runner's spawned kernel greenlets have main as their _parent — no nested re-entry. - Legacy single-driver path (no bench scheduler): all_reduce falls back to inline wait when g.parent is None. Result: - Multi-greenlet cross-SIP ring no longer hangs (was 100% CPU infinite loop in kernel_runner._switch_kernel). - ring_default_ws still xfail(strict=True): now fails as a data correctness issue — DataExecutor reports only 1 math op for a 2-rank ring (expected 2). Cross-SIP op_log replay integration is the remaining Phase B task. 514 passed, 1 xfailed (strict). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 09:14:03 -07:00
ywkang	4ba0a83e71	Implement ADR-0024 Phase A: SIP-level TP launcher MVP Scope (Phase A): - D1: world_size fallback = SIP count (rank = SIP, TP boundary) - D9: greenlet-local get_rank + _bind_rank (single-driver fallback = 0) - D10: torch.ahbm.set_device + torch.accelerator.set_device_index alias - D11: tensor placement scoped to current-device SIP (post-hoc pe_index shift — ADR-0026 replaces with structural coords) - D12/D13: multi-greenlet run() with simple round-robin scheduler; hybrid dispatch (ws == SIP count → multi-greenlet, else legacy single-worker for ccl.yaml override compat) - D7 partial: backend.all_reduce submit + yield + wait via launch()'s new _defer_wait flag; parent-less greenlets skip yield - Relaxed shard-count check (len(shards) > 0 instead of == world_size) - rank_to_pe = SIP-representative [(r, 0, 0)] when ws <= n_sips Deferred to Phase B: - Engine-routed install (D2) — keeps sideband - install_plan.py module (D6) — keeps install.py - Epoch barrier (D7 full) — simple yield is sufficient for ring ws=2 mock - Validator registry (D8) - Cross-SIP multi-greenlet + real kernel integration — matrix ring_default_ws hangs in SimPy drain despite ADR-0025 direction fix; marked xfail(run=False) pending Phase B diagnosis (suspected per-rank kernel_args / program_id mismatch) Tests: - test_ccl_ddp_launcher.py (6 new tests) — D1/D9/D10/D11/D12/D13 - test_ccl_allreduce_matrix.py — ring_default_ws xfail'd, override cases (ring_tcm_8 / hbm_8 / sram_8 / multi_cube / mesh_2x2 / tree_binary_7) all pass via legacy path 514 tests pass, 1 xfail. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 09:00:28 -07:00
ywkang	32536daf2e	Fix ADR-0025: IPCQ direction addressing via address-based matching 2-rank bidirectional ring deadlock: when E and W neighbors point to the same peer, sender-coord matching in _handle_meta_arrival / _credit_worker picked the first direction in dict order, landing data in the wrong rx slot relative to what the kernel recv(W) was waiting on. Fix (ADR-0025 D1/D2/D3): - install.reverse_direction: prefer OPPOSITE direction (E↔W, N↔S) when peer has it pointing back to us; fallback to any matching for topologies without opposite convention (tree_binary parent/child). - _handle_meta_arrival: match by token.dst_addr range against each qp's my_rx_base_pa + n_slots × slot_size window (unambiguous). - _credit_worker: match by credit.dst_rx_base_pa == qp.peer.rx_base_pa. - IpcqCreditMetadata: new dst_rx_base_pa field carrying receiver-side rx base; _delayed_credit_send fills it from the consuming qp. Tests (Phase 1 → Phase 2): - test_reverse_direction_opposite_preference_2rank_ring - test_reverse_direction_opposite_preference_4rank_ring_sanity - test_meta_arrival_matches_by_dst_addr_same_peer - test_credit_matches_by_dst_rx_base_pa_same_peer - Existing credit-return test updated with dst_rx_base_pa. 508 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 00:38:41 -07:00
ywkang	e1084800ab	docs: add ADRs 0024–0031 for SIP-TP launcher stack ADR-0024 (SIP-level TP launcher): rank = SIP abstraction, engine-routed install, mp.spawn parity, epoch barrier, ShardSpec structural coords. ADR-0025 (IPCQ direction addressing): address-based matching for meta arrival and credit return; fixes 2-rank bidirectional ring deadlock. ADR-0026 (DPPolicy intra-device only): remove sip/num_sips fields; ShardSpec uses structural (sip, cube, pe); pe_index property removed. ADR-0027 (Megatron-style TP API): ColumnParallelLinear / RowParallelLinear on top of ADR-0024 launcher. Backlog until 0024/0025/0026 land. ADR-0028 (DTensor support): stub / future work. ADR-0029 (Hierarchical all-reduce): 3-level reduce using all_pes mapper and multi_pe_sip_local validator from ADR-0024. Backlog. ADR-0030 (IPCQ PhysAddr integration): blocked on ADR-0031. ADR-0031 (PhysAddr PE-resource extension): stub; local_offset range-based partition approach; specific ranges TBD. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 00:38:27 -07:00
ywkang	b2c52f0e34	Add English translations for ADR-0018, 0019, 0020, 0021 - ADR-0018: LA-based memory address abstraction + BAAW + HBM channel mapping - ADR-0019: CUBE NOC per-channel and aggregated HBM connection model - ADR-0020: 2-pass data execution model (timing/data separation, greenlet) - ADR-0021: PE pipeline refactor (component separation + token self-routing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 16:31:32 -07:00
ywkang	10b33b44ba	Add Tensor indexing + hierarchical 3-level all-reduce kernel Tensor.__setitem__ / __getitem__: - Shard-aligned slice assignment and read on deployed tensors. - Scalar broadcast and numpy array assignment supported. - Cross-shard slices raise NotImplementedError (use copy_ for that). - 3 new tests: single-PE, multi-PE, cross-shard error case. Hierarchical all-reduce kernel (src/kernbench/ccl/algorithms/): - 3-level reduce: intra-cube (E/W) → inter-cube (N/S) → inter-SIP (parent). - Bidirectional ring reduce at each level: ceil((N-1)/2) rounds. Left half sends via dir_dec, right half via dir_inc (wrap). Representative receives from both sides. - Chain broadcast for reverse path: cube 0 PE 0 → all PE 0s → all PEs. - Registered in ccl.yaml as "hierarchical_allreduce" with topology: none (neighbors() override builds the full 3-level neighbor map). - kernel_args derives pes_per_cube/cubes_per_sip/num_sips from world_size. - Mock-verified at 8/16/32/64/128 ranks. Mock runtime fixes: - Direction pairing: explicit N↔S, E↔W, parent↔parent instead of "first matching reverse". Fixes 2-element rings where N and S both point to the same peer. - Deadlock detection: send-counter based (not just queue-depth-total) to catch chain reductions where send+recv pairs net to zero. - Multi-cube program_id: pes_per_cube parameter enables program_id(axis=0) = PE within cube, program_id(axis=1) = cube id. Legacy single-cube tests unaffected (default = world_size). 504 tests pass in 12s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 23:52:04 -07:00
ywkang	1c8ddc2d03	Fix Phase 1 slot-overwrite race + PE_MATH latency model (n_slots=4 safe) Root cause: In ring all-reduce, PE_IPCQ's recv handler advances my_tail and issues a credit return immediately. With tight credit latency (0.12ns intra-cube), the sender can refill the slot BEFORE the receiver's outbound PE_DMA reads from it for the next send. The outbound snapshot then captures stale data from a later round. Fix: Propagate TensorHandle.data (captured at recv-time, before credit return) through the entire send chain: tl.send(src=handle) → IpcqSendCmd.data → IpcqDmaToken.data PE_DMA outbound already prefers token.data over MemoryStore read, so the recv-time snapshot is used for the in-flight data. This eliminates the race: the snapshot is captured before the slot can be overwritten. Additional fixes: - PE_MATH handle_command: compute SIMD latency from output tensor element count via _compute_ns(), using max(overhead_ns, compute_ns). Previously used overhead_ns=0.0 for all standalone MathCmd, making math ops take 0ns in SimPy. - DataExecutor secondary sort: same-t_start ops sorted by op_kind (memory < gemm < math) so IPCQ slot writes execute before math reads. - ipcq_copy recorded at INBOUND time (receiver PE_DMA arrival) instead of outbound. Inbound time is after fabric propagation, so it sorts correctly relative to the receiver's math. - record_copy accepts explicit snapshot parameter (from token.data). Result: N_ELEM=32 + 256-rank + n_slots=4 + cross-SIP now passes. n_slots reverted to 4 (the deeper buffer was a workaround, not needed). 502 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 23:02:19 -07:00
ywkang	74f5f5cf08	Add session-scoped topology fixture in tests/conftest.py Provides a shared `topology` fixture that caches the parsed topology.yaml result per pytest-xdist worker session. Tests that build a GraphEngine can accept `topology` instead of calling resolve_topology("topology.yaml") repeatedly. Topology parsing costs ~32ms, so the practical saving per worker is modest (<1s across all tests). The fixture is mainly for architectural cleanliness — keeping the "parse once, build engine many" pattern explicit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 21:13:25 -07:00
ywkang	372c987995	Reduce test time to 12s: shrink GEMM dims + enable pytest-xdist GEMM dimension reduction: - qkv_gemm.py: M,K,N = 128,256,128 → 32,64,32 (64 tiles → 1 tile). - qkv_gemm_multi_pe.py: same reduction. - Tests verify pipeline correctness, not large-matrix throughput. - Per-test time: 18s → 1.7s. 6 tests total: 108s → 10s. pytest-xdist parallel execution: - Add pytest-xdist to dev dependencies. - pyproject.toml addopts: -n auto (use all CPU cores), -m "not slow". - Default `pytest` runs 501 tests in ~12s (previously 148s). - Full suite including slow: `pytest -m ""` → 3m24s (previously 5m43s). pytest.mark.slow: - Registered in pyproject.toml markers section. - 256-rank full-system test is the only slow-marked test. - Run with: pytest -m "" (CI) or pytest (local dev, skips slow). 502 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 21:06:41 -07:00
ywkang	bcf941dcee	Speed up regression: 25min → 6min (test matrix + DataExecutor cleanup) Test matrix restructure: - 256-rank full-system ring runs only ONCE (marked pytest.mark.slow) instead of 7× across matrix + perf tests. Cross-SIP routing is verified by the single run; buffer variants (tcm/hbm/sram) are tested at 8-rank where they finish in <0.5s. - Performance tests use 8-rank instead of 256-rank. - `pytest -m "not slow"` completes in ~2.5min (local dev). - Full suite including slow: ~6min (CI). DataExecutor optimization: - Remove ThreadPoolExecutor from DataExecutor.run(). Same-t_start groups are almost always size 1, so the thread pool creation and dispatch overhead dominated. Simple sequential loop is faster. - Skip dma_read ops at the loop level (they are always no-ops in Phase 2 but were dispatched through _execute_op → _execute_memory). - Remove redundant CLI Phase 2 re-execution: engine._flush_data_phase already replays during engine.wait(); the CLI now only prints the diagnostic summary without re-running DataExecutor. 502 tests pass. Wall time: 25m30s → 5m43s (full), 2m28s (no slow). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 20:52:07 -07:00
ywkang	998cc85762	Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023) Major changes: PE-level IPCQ infrastructure: - New PE_IPCQ component: ring-buffer control plane with 4-direction neighbor mapping, head/tail pointers, backpressure (poll/sleep). - PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA, including in-flight data snapshot (D9) and op_log recording at outbound time for Phase 2 replay correctness. - IpcqDmaToken piggyback model: data + metadata travel together, atomic visibility at receiver (invariant I6). - Credit return fast path: bottleneck-BW latency, no fabric vc_comm. Phase 2 data execution (ADR-0020 integration): - op_log extended: DmaWriteCmd now captures src_space/src_addr for Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time. - DataExecutor replays dma_write + ipcq_copy in t_start order. - Engine._flush_data_phase: incremental cursor-based replay after each engine.wait() so host reads see post-Phase-2 data. - KernelRunner Phase 1 writes disabled when op_log is active to prevent stale data from corrupting the MemoryStore snapshot. TLContext / kernel API: - tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype), tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode. - TensorHandle operator overloading (add/sub/mul/div) via thread-local active TLContext → MathCmd dispatch through PE_MATH. - PE-local scratch allocator for math output handles. - tl.load returns space="hbm" handles for correct Phase 2 addressing. - Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv. Unified ccl_allreduce bench (PyTorch-compat host code): - Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch) split matching real PyTorch DDP worker pattern. - torch.distributed facade: init_process_group, get_world_size, get_rank, get_backend, all_reduce, barrier — only real PyTorch names. - AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches kernel via tensor shard metadata (n_elem from shards[0].nbytes). - world_size derived from topology spec (sips × cubes × pes_per_cube) with optional algorithm-level override in ccl.yaml. Tensor API (PyTorch-compat surface): - Tensor.numpy(): gather-aware (all shards via VA-based addressing). - Tensor.copy_(source): scatter from host tensor into sharded target. - RuntimeContext.from_numpy(arr): host-side staging tensor. - Tensor.data property fixed to use numpy() (was shards[0]-only). Algorithm modules moved to src/kernbench/ccl/algorithms/: - ring_allreduce, mesh_allreduce, tree_allreduce, hello_send. - Each module exports kernel_args(world_size, n_elem) helper. - ccl.yaml module paths updated to kernbench.ccl.algorithms.*. Dead code removed: - 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.). - _run_ccl_bench greenlet-per-SIP scheduler. - benches.loader.is_ccl_bench + run_rank detection. - benches/ccl/ directory. Tests: - New test_ccl_allreduce_matrix.py: 7 parametrized cases (ring×3 buffers, ring 8/16, mesh 4, tree 7). - New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests. - Existing tests updated for new import paths + world_size_override. Docs: - Korean ccl-author-guide.md and ADR-0023 paths updated. - New English versions: ccl-author-guide.en.md, ADR-0023.en.md. 502 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 19:36:59 -07:00
ywkang	ff2c677a9c	Add 2D grid program_id semantics (ADR-0022) tl.program_id(axis=0) returns local PE id within cube, tl.program_id(axis=1) returns cube id. Enables cube-aware sharding in benchmark kernels. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 16:49:56 -07:00
ywkang	dc3fb02aed	Add --verify-data CLI flag, Tensor.data property, parallel DataExecutor - CLI: --verify-data flag enables Phase 2 data verification (ADR-0020) - Tensor.data: returns actual numpy values (verify-data) or zeros placeholder - Tensor.__repr__: shows value summary or data=N/A (placeholder) - DataExecutor: ThreadPoolExecutor for same-timestamp parallel op execution - BenchResult.engine: exposes op_log/memory_store for Phase 2 access Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 09:34:01 -07:00
ywkang	59e36f0c34	Add E2E pipeline tests: greenlet op_log, GEMM accuracy, latency regression ADR-0020 + ADR-0021 final verification: - CompositeCmd GEMM/Math pipeline completes through full chain - Greenlet mode generates op_log records (memory + gemm ops) - Phase 1→Phase 2: MemoryStore seed → greenlet → op_log → DataExecutor → allclose - Latency determinism: same kernel produces identical latency - Multi-tile > single-tile latency invariant 388 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 00:28:03 -07:00
ywkang	81ce55571d	Rename impl names: add builtin. prefix for clear provenance - components.yaml: all builtin impls use builtin.xxx naming - topology.yaml: all impl references updated to builtin.xxx - builder.py: hardcoded ucie impl → builtin.ucie - Tests: all impl string references updated Convention: builtin.<name> for built-in, custom.<name> for user-defined. 382 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 00:16:24 -07:00
ywkang	1d95df4bee	Restructure legacy backups, remove pe_accel, fix DMA self-routing - Move builtin_legacy/ → legacy/builtin/ (cleaner structure) - Move pe_accel_legacy/ → legacy/pe_accel/ - Remove custom/pe_accel/ (replaced by new builtin) - Remove pe_scheduler_v2 from components.yaml - Switch topology.yaml to pe_scheduler_v1 (new builtin) - Fix PE_DMA self-routing: handle consecutive DMA_READ stages (same component consecutive stages processed in-place, not via port) 382 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 00:02:26 -07:00
ywkang	95d583ef9f	Add Phase 1→Phase 2 e2e data tests + GraphEngine enable_data mode GraphEngine(enable_data=True): - Creates MemoryStore + OpLogger - Injects op_logger into all components - Exposes engine.op_log and engine.memory_store properties E2E tests (test_e2e_data.py): - Engine data mode creates store + logger - Default engine has no store - PeDmaMsg completes successfully with data mode - DataExecutor GEMM accuracy: random f16 matmul with f32 accumulation - DataExecutor chain: GEMM → exp correctness - DataExecutor verify API: pass/fail per tensor - MemoryStore snapshot isolation between Phase 1 and Phase 2 382 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 23:49:28 -07:00
ywkang	f5d1606f9d	Add ADR-0021 pipeline tests: self-routing, tiling, overlap Test plan items 3-5: - TileToken self-routing: advance(), stage sequence, chain traversal - PipelineContext: completion tracking, exactly-once contract - Tiling plans: GEMM tile count, stage sequence, intermediate K no DMA_WRITE - Math plan: READ→FETCH→MATH→STORE→WRITE sequence - Pipeline overlap: SimPy simulation verifying intra-command tile overlap 9 new tests, all passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 23:40:19 -07:00
ywkang	b6eb97c49a	Implement ADR-0021: PE pipeline refactor with token self-routing Step 1-2: Backup existing code - builtin/ → builtin_legacy/ (unchanged backup) - custom/pe_accel/ → custom/pe_accel_legacy/ (unchanged backup) Step 3-4: New pipeline types and tiling - pe_types.py: StageType, Stage, TilePlan, PipelinePlan, PipelineContext, TileToken - tiling.py: generate_gemm_plan, generate_math_plan (ported from pe_accel) Step 5: Component implementations (ADR-0021 D4-D6) - PE_SCHEDULER: _feed_loop (singleton FIFO feeder) + plan generation - PE_FETCH_STORE: new component — TCM ↔ Register File - PE_GEMM: TileToken pipeline + legacy PeInternalTxn dual-mode - PE_MATH: TileToken pipeline + legacy dual-mode - PE_DMA: TileToken pipeline + legacy + fabric Transaction triple-mode - PE_TCM: TcmRequest handler with dual-channel BW serialization Step 6: Infrastructure - topology.yaml: pe_fetch_store component + chaining edges - components.yaml: pe_fetch_store_v1 registration - builder.py: PE_COMP_OFFSETS, _add_pe_internal_edges, PE view positions - Tests: node/edge counts, PE component sets updated All components handle both TileToken (pipeline) and PeInternalTxn (legacy). Token self-routing: components read next stage from token.plan, chain via out_port. 366 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 23:35:31 -07:00
ywkang	161132cdcb	ADR-0021: PE pipeline refactor — component separation + token self-routing Design for refactoring pe_accel monolith into independent builtin components: - D1: 6 independent components (scheduler, DMA, fetch_store, GEMM, MATH, TCM) - D2: Token self-routing — scheduler only dispatches + tracks completion - D3: done signal = simpy.Event (HW wire), data = message (queue) - D4: Async pipeline with single FIFO feeder, command-level ordering - D5: PE_FETCH_STORE separates TCM↔register from compute - D6: Compute components implement _process() only, chaining in base - D7: Topology adds pe_fetch_store + chaining edges - D8: Existing builtin/pe_accel → builtin_legacy backup, new builtin - D9: TileToken with plan + stage_idx for self-routing Key decisions from review: - No PipelineManager object — scheduler + existing ports sufficient - PipelineContext with exactly-once completion contract - _feed_loop singleton per scheduler, FIFO command ordering - Intra-PE chaining: no explicit latency model - Latency models ported from pe_accel current implementation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 23:21:40 -07:00
ywkang	51004c311c	Implement ADR-0020: 2-pass data execution with greenlet kernel runner Step 1 — Foundation: - OpRecord/OpLogger: op log infrastructure with t_start stable ordering - MemoryStore: numpy ndarray tensor-granular storage (reference semantics) - data_op=True flag on DmaReadCmd, DmaWriteCmd, GemmCmd, MathCmd, CompositeCmd - numpy/greenlet dependencies added to pyproject.toml Step 2 — ComponentBase hooks: - _on_process_start/end hooks in _forward_txn (fabric messages) - _handle_with_hooks in PeEngineBase (PE-internal commands) - op_logger optional — zero overhead when disabled Step 3 — KernelRunner + greenlet: - KernelRunner: greenlet ↔ SimPy bridge in triton_emu/kernel_runner.py - TLContext: _emit() method routes to greenlet switch or command list - tl.load() returns real numpy data in greenlet mode - Dynamic control flow supported (memory-read based branching) Step 4 — PE_CPU integration: - Greenlet mode when ctx.memory_store is set, legacy fallback otherwise - Refactored into _execute_greenlet/_execute_legacy/_send_response - ComponentContext gains memory_store and op_logger fields Step 5 — DataExecutor: - Phase 2 numpy execution for GEMM/Math ops from op_log - _compute_math: all unary/binary/reduction ops - verify(): compare MemoryStore against expected with dtype tolerance 28 new tests, 366 total passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 00:22:44 -07:00
ywkang	140b85436a	ADR-0020: 2-Pass data execution model with greenlet kernel runner Design for actual data storage/computation in HBM/TCM/SRAM components: - Phase 1: SimPy timing + MemoryStore (memory ops data-aware via greenlet) - Phase 2: op_log-based numpy execution for GEMM/Math verification - Greenlet-based KernelRunner replaces Phase 0 command list generation - tl.load() returns real data in Phase 1, enabling memory-based control flow - ComponentBase hook for op logging (single source of truth) - MemoryStore: numpy ndarray tensor-granular storage with reference semantics Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 23:53:49 -07:00
ywkang	eb792e6212	Remove xbar/noc remnants, rule-based cube-view connectors - Delete xbar.py and noc.py (TwoDMeshNocComponent) — unused since router mesh - Remove xbar_v1/noc_2d_mesh_v1 from components.yaml - Fix pe_to_xbar → pe_to_router in routing exclusion set - Fix xbar_to_hbm_bw_gbs → hbm_to_router_bw_gbs in report.py - Update all docstrings/comments referencing xbar/bridge → router mesh - Cube-view connectors: rule-based _connector_points helper - PE↔router: single diagonal line (not chevron) - UCIe N/S: 45°→horizontal→45° - UCIe E/W: 45°→vertical→45° - HBM ports: 45°→horizontal→45° Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 23:59:12 -07:00
ywkang	7640635f90	M_CPU/SRAM placement via pos_mm in topology.yaml (nearest router) Component placement uses mm coordinates in topology.yaml, mesh_gen finds the nearest router automatically. M_CPU moved to pos_mm=[7.5,2.0] (→ r0c2), SRAM at pos_mm=[1.5,9.0] (→ r3c0). No hardcoded router references in topology config. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 00:48:20 -07:00
ywkang	3ea4fa90f8	Cube-view: increase 45° stub length and component gap for visibility Stub length increased to 12px (PE/HBM) and 10px (UCIe). Gap between router and component increased to 30px so both 45° stubs (router end + component end) are clearly visible. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 00:38:27 -07:00
ywkang	5125d92c17	Cube-view: M_CPU north, 45° stub-straight-stub connector pattern - M_CPU placed north (above) its router - All connectors: 45° stub from router → straight → 45° stub to component - Consistent 4-point polyline pattern for PE, M_CPU, SRAM, HBM, UCIe Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 00:34:48 -07:00
ywkang	72acc5c8bb	Cube-view: UCIe flush against cube edges UCIe position calculated with minimal inset (0.3 × size) to place components flush against cube boundary edges. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 00:28:58 -07:00
ywkang	bde76ec959	Cube-view: 45° diagonal from router, then straight to component All connectors now start with 45° diagonal from router edge, then go straight (vertical/horizontal) to the component block. Applies to PE, M_CPU/SRAM, PE→HBM, and UCIe connectors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 00:25:41 -07:00
ywkang	d3de982ea4	Cube-view: 90° router mesh links, 45° component connectors Router-router mesh links remain straight (horizontal/vertical). All component→router connectors use 45° L-bend polylines: - PE blocks: vertical then 45° diagonal to router - M_CPU/SRAM: horizontal then 45° diagonal to router - PE→HBM port group: vertical then 45° diagonal - UCIe port→router: direction-aware 45° bend Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 00:20:28 -07:00
ywkang	df81835d84	Cube-view: UCIe position/size from topology.yaml (ucie_mm.size=2.0) UCIe components placed at defined positions from _cube_local_positions with size from cube.geometry.ucie_mm.size. N/S horizontal, E/W vertical. Connection ports rendered as color-coded boxes inside UCIe component. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 00:11:11 -07:00
ywkang	66ec6cd40c	Cube-view: UCIe components inside cube boundary with port boxes - UCIe-N/S/E/W drawn as component blocks inside cube boundary (inset 3mm from edge) - Each UCIe has c0-c3 connection ports as color-coded boxes inside - Connector lines from each port box to its attached router - Removed old UCIe rendering that placed blocks outside cube Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 23:58:32 -07:00
ywkang	e766163a25	Cube-view: HBM pseudo channel ports on edges, UCIe flush to cube border - HBM pseudo channel ports split to top/bottom edges of HBM zone (32 ports each, 8 per PE, color-coded) - PE→HBM lines connect router to its port group center - Per-PE label: "PE0×8ch" with BW annotation - UCIe blocks flush against cube edges at router positions - UCIe blocks smaller (22×10px) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 22:38:10 -07:00
ywkang	24faf2e1d4	Cube-view: angle HBM lines, offset M_CPU/SRAM blocks - HBM connection lines angled 30% toward HBM center (not vertical) to distinguish from mesh links - M_CPU/SRAM blocks placed to the left of their router with horizontal connector lines (avoid mesh overlap) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 22:30:56 -07:00
ywkang	7cd30e106e	Fix Router→HBM_CTRL lines visibility in cube_view Draw HBM connection lines last (on top of component blocks). PE routers: thicker (1.5px, opacity 0.6) with dashed style. Relay routers: thinner (0.7px, opacity 0.2). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 22:25:40 -07:00
ywkang	109c9b4483	Cube-view: draw all attached components as separate blocks All router-attached components (PE, M_CPU, SRAM, UCIe) rendered as labeled blocks with explicit connector lines to their router. UCIe blocks positioned at cube edges matching port direction. Router→HBM_CTRL lines shown for all 32 routers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 22:09:08 -07:00
ywkang	e94f1de078	Cube-view SVG: detailed topology validation rendering - Dedicated cube_view renderer showing 6×6 router grid with attachments - PE blocks drawn next to their router (above/below) - HBM pseudo channel port bar (64 ports, color-coded by PE owner) - Per-PE BW annotations on HBM links - Router color-coded by type (PE/M_CPU/SRAM/UCIe/relay) - Title shows mode, channel count, per-PE and total BW - Legend for all component types Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 22:03:38 -07:00
ywkang	5c6abe6d12	Reduce SRAM/UCIe/M_CPU/HBM node sizes, thin HBM and mesh links Shrink cube-view component nodes to avoid clutter. HBM and router_mesh edge lines made thinner and more transparent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 21:51:41 -07:00
ywkang	f298e3c7cc	Offset PE nodes in cube_view to avoid overlapping routers PE nodes are shifted 1.2mm above (top half) or below (bottom half) their assigned router position. PE size reduced to 1.4x0.7mm. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 18:50:32 -07:00
ywkang	91085733ba	Show individual routers in cube_view SVG, fix row Y overlap - cube_view now renders all 32 router nodes from cube_mesh.yaml instead of collapsed "router_mesh" placeholder - Fix mesh_gen row Y position overlap (r1/r2 and r3/r4 had same Y) by adding hbm_gap spacing between PE rows and HBM zone - Add noc_router to visualizer KIND_SIZE for proper sizing - Update cube view tests for individual router nodes 339 passed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 18:22:38 -07:00
ywkang	d2c92b8a18	Wire PE_MMU to router mesh for MmuMapMsg delivery Add router → PE_MMU edge so MmuMapMsg can reach PE_MMU via the router mesh. Unskip all PE_MMU fabric tests. 339 passed, 0 skipped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 18:10:42 -07:00
ywkang	08256c1326	Fix cross-SIP PE_TCM access by scoping deploy to target_device SIP RuntimeContext._ensure_allocators() now limits SIP range to target_device (single SIP or all). Prevents cross-SIP tensor deployment that caused PE_TCM routing errors. Also accept 'sip0' format (without colon) in DeviceSelector. 331 passed, 8 skipped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 18:03:11 -07:00
ywkang	624161f52f	Update web viewer for router mesh topology (ADR-0019) Remove all xbar/bridge rendering from cube detail view. Replace 8 HBM slices with single HBM_CTRL block. Add green dotted lines showing router-to-HBM connectivity. Update legend, event animation, and PE view NOC destinations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 17:56:05 -07:00
ywkang	5917b3497c	Replace xbar/bridge/single-NOC with explicit router mesh (ADR-0019) - Remove xbar_top/bot, bridge, single noc node from topology - Each cube_mesh.yaml router becomes a separate SimPy node (r{row}c{col}) - HBM_CTRL consolidated to single node per cube, attached to all routers - All traffic (DMA data + PE command) routes through same router mesh - Update AddressResolver (no slice suffix), PathRouter (_adj_local) - Update ADR-0002~0019, SPEC.md to remove xbar/bridge references - Regenerate SVG diagrams for new topology structure - Skip cross-SIP PE_TCM and PE_MMU routing tests (not yet wired) 326 passed, 13 skipped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 17:51:28 -07:00
ywkang	31c7110da7	Add ADR-0018 (LA/BAAW addressing) and ADR-0019 (NOC per-channel HBM) ADR-0018: LA replaces VA, BAAW segment-based mapping in PE_DMA, 1:1 (per-channel) and n:1 (aggregated) modes with parameterized channel count. ADR-0019: xbar/bridge removal, channel router topology with horizontal line layout, aggregated router for n:1 mode, unified NOC path for local/remote HBM access. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 01:05:27 -07:00
ywkang	114510d4b9	Add SchedulerV2 (pe_accel), DPPolicy overrides, and new benchmarks - Add cycle-accurate PE accelerator scheduler (SchedulerV2) with tiled GEMM/Math pipelines (DMA_IN → GEMM → MATH → DMA_WB) - Add DPPolicy num_pes/num_cubes/num_sips overrides for single-PE testing - Support tuple target_pe for targeting specific PE subsets - Add gemm_single_pe and gpt3_qkv benchmarks - Switch default topology to pe_scheduler_v2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 23:18:49 -07:00
ywkang	63669f82cb	Add SIP-level tensor parallelism, component registry YAML, VA offset verification - DPPolicy: 3-level (sip/cube/pe), unified naming (column_wise/row_wise) - PE_CPU: auto num_programs from cube shard count - context.launch(): per-SIP KernelLaunchMsg with local va_base + auto local shape - deploy_tensor: removed mmus param, MMU mapping is context-only responsibility - ComponentRegistry: YAML-based lazy loading (components.yaml), impls→builtin rename - VA offset bench + tests: 2D/1D, standard Triton kernel pattern Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 01:13:17 -07:00
ywkang	08812eda58	Add virtual memory support: PE_MMU, VA allocator, fabric MmuMapMsg Implement VA/MMU layer (ADR-0011 Phase 1) enabling Triton kernels to use contiguous virtual addresses on sharded tensors. Key changes: - PE_MMU component: hybrid inbox (MmuMapMsg) + sync translate() for PE_DMA - VirtualAllocator + PEMemAllocator: free-list with coalescing - MmuMapMsg/MmuUnmapMsg fabric path with SIP-level routing - DPPolicy-based mapping: replicate=local, sharded=broadcast - Tensor lifecycle: del + weakref cleanup, context manager - Rename: TensorHandle.pa→addr, DmaReadCmd.src_pa→src_addr, ctx→torch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 00:01:47 -07:00

1 2 3

108 Commits