kernbench2

Author	SHA1	Message	Date
ywkang	74f5f5cf08	Add session-scoped topology fixture in tests/conftest.py Provides a shared `topology` fixture that caches the parsed topology.yaml result per pytest-xdist worker session. Tests that build a GraphEngine can accept `topology` instead of calling resolve_topology("topology.yaml") repeatedly. Topology parsing costs ~32ms, so the practical saving per worker is modest (<1s across all tests). The fixture is mainly for architectural cleanliness — keeping the "parse once, build engine many" pattern explicit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 21:13:25 -07:00
ywkang	372c987995	Reduce test time to 12s: shrink GEMM dims + enable pytest-xdist GEMM dimension reduction: - qkv_gemm.py: M,K,N = 128,256,128 → 32,64,32 (64 tiles → 1 tile). - qkv_gemm_multi_pe.py: same reduction. - Tests verify pipeline correctness, not large-matrix throughput. - Per-test time: 18s → 1.7s. 6 tests total: 108s → 10s. pytest-xdist parallel execution: - Add pytest-xdist to dev dependencies. - pyproject.toml addopts: -n auto (use all CPU cores), -m "not slow". - Default `pytest` runs 501 tests in ~12s (previously 148s). - Full suite including slow: `pytest -m ""` → 3m24s (previously 5m43s). pytest.mark.slow: - Registered in pyproject.toml markers section. - 256-rank full-system test is the only slow-marked test. - Run with: pytest -m "" (CI) or pytest (local dev, skips slow). 502 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 21:06:41 -07:00
ywkang	bcf941dcee	Speed up regression: 25min → 6min (test matrix + DataExecutor cleanup) Test matrix restructure: - 256-rank full-system ring runs only ONCE (marked pytest.mark.slow) instead of 7× across matrix + perf tests. Cross-SIP routing is verified by the single run; buffer variants (tcm/hbm/sram) are tested at 8-rank where they finish in <0.5s. - Performance tests use 8-rank instead of 256-rank. - `pytest -m "not slow"` completes in ~2.5min (local dev). - Full suite including slow: ~6min (CI). DataExecutor optimization: - Remove ThreadPoolExecutor from DataExecutor.run(). Same-t_start groups are almost always size 1, so the thread pool creation and dispatch overhead dominated. Simple sequential loop is faster. - Skip dma_read ops at the loop level (they are always no-ops in Phase 2 but were dispatched through _execute_op → _execute_memory). - Remove redundant CLI Phase 2 re-execution: engine._flush_data_phase already replays during engine.wait(); the CLI now only prints the diagnostic summary without re-running DataExecutor. 502 tests pass. Wall time: 25m30s → 5m43s (full), 2m28s (no slow). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 20:52:07 -07:00
ywkang	998cc85762	Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023) Major changes: PE-level IPCQ infrastructure: - New PE_IPCQ component: ring-buffer control plane with 4-direction neighbor mapping, head/tail pointers, backpressure (poll/sleep). - PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA, including in-flight data snapshot (D9) and op_log recording at outbound time for Phase 2 replay correctness. - IpcqDmaToken piggyback model: data + metadata travel together, atomic visibility at receiver (invariant I6). - Credit return fast path: bottleneck-BW latency, no fabric vc_comm. Phase 2 data execution (ADR-0020 integration): - op_log extended: DmaWriteCmd now captures src_space/src_addr for Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time. - DataExecutor replays dma_write + ipcq_copy in t_start order. - Engine._flush_data_phase: incremental cursor-based replay after each engine.wait() so host reads see post-Phase-2 data. - KernelRunner Phase 1 writes disabled when op_log is active to prevent stale data from corrupting the MemoryStore snapshot. TLContext / kernel API: - tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype), tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode. - TensorHandle operator overloading (add/sub/mul/div) via thread-local active TLContext → MathCmd dispatch through PE_MATH. - PE-local scratch allocator for math output handles. - tl.load returns space="hbm" handles for correct Phase 2 addressing. - Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv. Unified ccl_allreduce bench (PyTorch-compat host code): - Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch) split matching real PyTorch DDP worker pattern. - torch.distributed facade: init_process_group, get_world_size, get_rank, get_backend, all_reduce, barrier — only real PyTorch names. - AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches kernel via tensor shard metadata (n_elem from shards[0].nbytes). - world_size derived from topology spec (sips × cubes × pes_per_cube) with optional algorithm-level override in ccl.yaml. Tensor API (PyTorch-compat surface): - Tensor.numpy(): gather-aware (all shards via VA-based addressing). - Tensor.copy_(source): scatter from host tensor into sharded target. - RuntimeContext.from_numpy(arr): host-side staging tensor. - Tensor.data property fixed to use numpy() (was shards[0]-only). Algorithm modules moved to src/kernbench/ccl/algorithms/: - ring_allreduce, mesh_allreduce, tree_allreduce, hello_send. - Each module exports kernel_args(world_size, n_elem) helper. - ccl.yaml module paths updated to kernbench.ccl.algorithms.*. Dead code removed: - 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.). - _run_ccl_bench greenlet-per-SIP scheduler. - benches.loader.is_ccl_bench + run_rank detection. - benches/ccl/ directory. Tests: - New test_ccl_allreduce_matrix.py: 7 parametrized cases (ring×3 buffers, ring 8/16, mesh 4, tree 7). - New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests. - Existing tests updated for new import paths + world_size_override. Docs: - Korean ccl-author-guide.md and ADR-0023 paths updated. - New English versions: ccl-author-guide.en.md, ADR-0023.en.md. 502 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 19:36:59 -07:00
ywkang	ff2c677a9c	Add 2D grid program_id semantics (ADR-0022) tl.program_id(axis=0) returns local PE id within cube, tl.program_id(axis=1) returns cube id. Enables cube-aware sharding in benchmark kernels. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 16:49:56 -07:00
ywkang	dc3fb02aed	Add --verify-data CLI flag, Tensor.data property, parallel DataExecutor - CLI: --verify-data flag enables Phase 2 data verification (ADR-0020) - Tensor.data: returns actual numpy values (verify-data) or zeros placeholder - Tensor.__repr__: shows value summary or data=N/A (placeholder) - DataExecutor: ThreadPoolExecutor for same-timestamp parallel op execution - BenchResult.engine: exposes op_log/memory_store for Phase 2 access Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 09:34:01 -07:00
ywkang	59e36f0c34	Add E2E pipeline tests: greenlet op_log, GEMM accuracy, latency regression ADR-0020 + ADR-0021 final verification: - CompositeCmd GEMM/Math pipeline completes through full chain - Greenlet mode generates op_log records (memory + gemm ops) - Phase 1→Phase 2: MemoryStore seed → greenlet → op_log → DataExecutor → allclose - Latency determinism: same kernel produces identical latency - Multi-tile > single-tile latency invariant 388 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 00:28:03 -07:00
ywkang	81ce55571d	Rename impl names: add builtin. prefix for clear provenance - components.yaml: all builtin impls use builtin.xxx naming - topology.yaml: all impl references updated to builtin.xxx - builder.py: hardcoded ucie impl → builtin.ucie - Tests: all impl string references updated Convention: builtin.<name> for built-in, custom.<name> for user-defined. 382 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 00:16:24 -07:00
ywkang	1d95df4bee	Restructure legacy backups, remove pe_accel, fix DMA self-routing - Move builtin_legacy/ → legacy/builtin/ (cleaner structure) - Move pe_accel_legacy/ → legacy/pe_accel/ - Remove custom/pe_accel/ (replaced by new builtin) - Remove pe_scheduler_v2 from components.yaml - Switch topology.yaml to pe_scheduler_v1 (new builtin) - Fix PE_DMA self-routing: handle consecutive DMA_READ stages (same component consecutive stages processed in-place, not via port) 382 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 00:02:26 -07:00
ywkang	95d583ef9f	Add Phase 1→Phase 2 e2e data tests + GraphEngine enable_data mode GraphEngine(enable_data=True): - Creates MemoryStore + OpLogger - Injects op_logger into all components - Exposes engine.op_log and engine.memory_store properties E2E tests (test_e2e_data.py): - Engine data mode creates store + logger - Default engine has no store - PeDmaMsg completes successfully with data mode - DataExecutor GEMM accuracy: random f16 matmul with f32 accumulation - DataExecutor chain: GEMM → exp correctness - DataExecutor verify API: pass/fail per tensor - MemoryStore snapshot isolation between Phase 1 and Phase 2 382 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 23:49:28 -07:00
ywkang	f5d1606f9d	Add ADR-0021 pipeline tests: self-routing, tiling, overlap Test plan items 3-5: - TileToken self-routing: advance(), stage sequence, chain traversal - PipelineContext: completion tracking, exactly-once contract - Tiling plans: GEMM tile count, stage sequence, intermediate K no DMA_WRITE - Math plan: READ→FETCH→MATH→STORE→WRITE sequence - Pipeline overlap: SimPy simulation verifying intra-command tile overlap 9 new tests, all passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 23:40:19 -07:00
ywkang	b6eb97c49a	Implement ADR-0021: PE pipeline refactor with token self-routing Step 1-2: Backup existing code - builtin/ → builtin_legacy/ (unchanged backup) - custom/pe_accel/ → custom/pe_accel_legacy/ (unchanged backup) Step 3-4: New pipeline types and tiling - pe_types.py: StageType, Stage, TilePlan, PipelinePlan, PipelineContext, TileToken - tiling.py: generate_gemm_plan, generate_math_plan (ported from pe_accel) Step 5: Component implementations (ADR-0021 D4-D6) - PE_SCHEDULER: _feed_loop (singleton FIFO feeder) + plan generation - PE_FETCH_STORE: new component — TCM ↔ Register File - PE_GEMM: TileToken pipeline + legacy PeInternalTxn dual-mode - PE_MATH: TileToken pipeline + legacy dual-mode - PE_DMA: TileToken pipeline + legacy + fabric Transaction triple-mode - PE_TCM: TcmRequest handler with dual-channel BW serialization Step 6: Infrastructure - topology.yaml: pe_fetch_store component + chaining edges - components.yaml: pe_fetch_store_v1 registration - builder.py: PE_COMP_OFFSETS, _add_pe_internal_edges, PE view positions - Tests: node/edge counts, PE component sets updated All components handle both TileToken (pipeline) and PeInternalTxn (legacy). Token self-routing: components read next stage from token.plan, chain via out_port. 366 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 23:35:31 -07:00
ywkang	161132cdcb	ADR-0021: PE pipeline refactor — component separation + token self-routing Design for refactoring pe_accel monolith into independent builtin components: - D1: 6 independent components (scheduler, DMA, fetch_store, GEMM, MATH, TCM) - D2: Token self-routing — scheduler only dispatches + tracks completion - D3: done signal = simpy.Event (HW wire), data = message (queue) - D4: Async pipeline with single FIFO feeder, command-level ordering - D5: PE_FETCH_STORE separates TCM↔register from compute - D6: Compute components implement _process() only, chaining in base - D7: Topology adds pe_fetch_store + chaining edges - D8: Existing builtin/pe_accel → builtin_legacy backup, new builtin - D9: TileToken with plan + stage_idx for self-routing Key decisions from review: - No PipelineManager object — scheduler + existing ports sufficient - PipelineContext with exactly-once completion contract - _feed_loop singleton per scheduler, FIFO command ordering - Intra-PE chaining: no explicit latency model - Latency models ported from pe_accel current implementation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 23:21:40 -07:00
ywkang	51004c311c	Implement ADR-0020: 2-pass data execution with greenlet kernel runner Step 1 — Foundation: - OpRecord/OpLogger: op log infrastructure with t_start stable ordering - MemoryStore: numpy ndarray tensor-granular storage (reference semantics) - data_op=True flag on DmaReadCmd, DmaWriteCmd, GemmCmd, MathCmd, CompositeCmd - numpy/greenlet dependencies added to pyproject.toml Step 2 — ComponentBase hooks: - _on_process_start/end hooks in _forward_txn (fabric messages) - _handle_with_hooks in PeEngineBase (PE-internal commands) - op_logger optional — zero overhead when disabled Step 3 — KernelRunner + greenlet: - KernelRunner: greenlet ↔ SimPy bridge in triton_emu/kernel_runner.py - TLContext: _emit() method routes to greenlet switch or command list - tl.load() returns real numpy data in greenlet mode - Dynamic control flow supported (memory-read based branching) Step 4 — PE_CPU integration: - Greenlet mode when ctx.memory_store is set, legacy fallback otherwise - Refactored into _execute_greenlet/_execute_legacy/_send_response - ComponentContext gains memory_store and op_logger fields Step 5 — DataExecutor: - Phase 2 numpy execution for GEMM/Math ops from op_log - _compute_math: all unary/binary/reduction ops - verify(): compare MemoryStore against expected with dtype tolerance 28 new tests, 366 total passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 00:22:44 -07:00
ywkang	140b85436a	ADR-0020: 2-Pass data execution model with greenlet kernel runner Design for actual data storage/computation in HBM/TCM/SRAM components: - Phase 1: SimPy timing + MemoryStore (memory ops data-aware via greenlet) - Phase 2: op_log-based numpy execution for GEMM/Math verification - Greenlet-based KernelRunner replaces Phase 0 command list generation - tl.load() returns real data in Phase 1, enabling memory-based control flow - ComponentBase hook for op logging (single source of truth) - MemoryStore: numpy ndarray tensor-granular storage with reference semantics Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 23:53:49 -07:00
ywkang	eb792e6212	Remove xbar/noc remnants, rule-based cube-view connectors - Delete xbar.py and noc.py (TwoDMeshNocComponent) — unused since router mesh - Remove xbar_v1/noc_2d_mesh_v1 from components.yaml - Fix pe_to_xbar → pe_to_router in routing exclusion set - Fix xbar_to_hbm_bw_gbs → hbm_to_router_bw_gbs in report.py - Update all docstrings/comments referencing xbar/bridge → router mesh - Cube-view connectors: rule-based _connector_points helper - PE↔router: single diagonal line (not chevron) - UCIe N/S: 45°→horizontal→45° - UCIe E/W: 45°→vertical→45° - HBM ports: 45°→horizontal→45° Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 23:59:12 -07:00
ywkang	7640635f90	M_CPU/SRAM placement via pos_mm in topology.yaml (nearest router) Component placement uses mm coordinates in topology.yaml, mesh_gen finds the nearest router automatically. M_CPU moved to pos_mm=[7.5,2.0] (→ r0c2), SRAM at pos_mm=[1.5,9.0] (→ r3c0). No hardcoded router references in topology config. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 00:48:20 -07:00
ywkang	3ea4fa90f8	Cube-view: increase 45° stub length and component gap for visibility Stub length increased to 12px (PE/HBM) and 10px (UCIe). Gap between router and component increased to 30px so both 45° stubs (router end + component end) are clearly visible. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 00:38:27 -07:00
ywkang	5125d92c17	Cube-view: M_CPU north, 45° stub-straight-stub connector pattern - M_CPU placed north (above) its router - All connectors: 45° stub from router → straight → 45° stub to component - Consistent 4-point polyline pattern for PE, M_CPU, SRAM, HBM, UCIe Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 00:34:48 -07:00
ywkang	72acc5c8bb	Cube-view: UCIe flush against cube edges UCIe position calculated with minimal inset (0.3 × size) to place components flush against cube boundary edges. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 00:28:58 -07:00
ywkang	bde76ec959	Cube-view: 45° diagonal from router, then straight to component All connectors now start with 45° diagonal from router edge, then go straight (vertical/horizontal) to the component block. Applies to PE, M_CPU/SRAM, PE→HBM, and UCIe connectors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 00:25:41 -07:00
ywkang	d3de982ea4	Cube-view: 90° router mesh links, 45° component connectors Router-router mesh links remain straight (horizontal/vertical). All component→router connectors use 45° L-bend polylines: - PE blocks: vertical then 45° diagonal to router - M_CPU/SRAM: horizontal then 45° diagonal to router - PE→HBM port group: vertical then 45° diagonal - UCIe port→router: direction-aware 45° bend Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 00:20:28 -07:00
ywkang	df81835d84	Cube-view: UCIe position/size from topology.yaml (ucie_mm.size=2.0) UCIe components placed at defined positions from _cube_local_positions with size from cube.geometry.ucie_mm.size. N/S horizontal, E/W vertical. Connection ports rendered as color-coded boxes inside UCIe component. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 00:11:11 -07:00
ywkang	66ec6cd40c	Cube-view: UCIe components inside cube boundary with port boxes - UCIe-N/S/E/W drawn as component blocks inside cube boundary (inset 3mm from edge) - Each UCIe has c0-c3 connection ports as color-coded boxes inside - Connector lines from each port box to its attached router - Removed old UCIe rendering that placed blocks outside cube Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 23:58:32 -07:00
ywkang	e766163a25	Cube-view: HBM pseudo channel ports on edges, UCIe flush to cube border - HBM pseudo channel ports split to top/bottom edges of HBM zone (32 ports each, 8 per PE, color-coded) - PE→HBM lines connect router to its port group center - Per-PE label: "PE0×8ch" with BW annotation - UCIe blocks flush against cube edges at router positions - UCIe blocks smaller (22×10px) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 22:38:10 -07:00
ywkang	24faf2e1d4	Cube-view: angle HBM lines, offset M_CPU/SRAM blocks - HBM connection lines angled 30% toward HBM center (not vertical) to distinguish from mesh links - M_CPU/SRAM blocks placed to the left of their router with horizontal connector lines (avoid mesh overlap) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 22:30:56 -07:00
ywkang	7cd30e106e	Fix Router→HBM_CTRL lines visibility in cube_view Draw HBM connection lines last (on top of component blocks). PE routers: thicker (1.5px, opacity 0.6) with dashed style. Relay routers: thinner (0.7px, opacity 0.2). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 22:25:40 -07:00
ywkang	109c9b4483	Cube-view: draw all attached components as separate blocks All router-attached components (PE, M_CPU, SRAM, UCIe) rendered as labeled blocks with explicit connector lines to their router. UCIe blocks positioned at cube edges matching port direction. Router→HBM_CTRL lines shown for all 32 routers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 22:09:08 -07:00
ywkang	e94f1de078	Cube-view SVG: detailed topology validation rendering - Dedicated cube_view renderer showing 6×6 router grid with attachments - PE blocks drawn next to their router (above/below) - HBM pseudo channel port bar (64 ports, color-coded by PE owner) - Per-PE BW annotations on HBM links - Router color-coded by type (PE/M_CPU/SRAM/UCIe/relay) - Title shows mode, channel count, per-PE and total BW - Legend for all component types Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 22:03:38 -07:00
ywkang	5c6abe6d12	Reduce SRAM/UCIe/M_CPU/HBM node sizes, thin HBM and mesh links Shrink cube-view component nodes to avoid clutter. HBM and router_mesh edge lines made thinner and more transparent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 21:51:41 -07:00
ywkang	f298e3c7cc	Offset PE nodes in cube_view to avoid overlapping routers PE nodes are shifted 1.2mm above (top half) or below (bottom half) their assigned router position. PE size reduced to 1.4x0.7mm. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 18:50:32 -07:00
ywkang	91085733ba	Show individual routers in cube_view SVG, fix row Y overlap - cube_view now renders all 32 router nodes from cube_mesh.yaml instead of collapsed "router_mesh" placeholder - Fix mesh_gen row Y position overlap (r1/r2 and r3/r4 had same Y) by adding hbm_gap spacing between PE rows and HBM zone - Add noc_router to visualizer KIND_SIZE for proper sizing - Update cube view tests for individual router nodes 339 passed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 18:22:38 -07:00
ywkang	d2c92b8a18	Wire PE_MMU to router mesh for MmuMapMsg delivery Add router → PE_MMU edge so MmuMapMsg can reach PE_MMU via the router mesh. Unskip all PE_MMU fabric tests. 339 passed, 0 skipped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 18:10:42 -07:00
ywkang	08256c1326	Fix cross-SIP PE_TCM access by scoping deploy to target_device SIP RuntimeContext._ensure_allocators() now limits SIP range to target_device (single SIP or all). Prevents cross-SIP tensor deployment that caused PE_TCM routing errors. Also accept 'sip0' format (without colon) in DeviceSelector. 331 passed, 8 skipped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 18:03:11 -07:00
ywkang	624161f52f	Update web viewer for router mesh topology (ADR-0019) Remove all xbar/bridge rendering from cube detail view. Replace 8 HBM slices with single HBM_CTRL block. Add green dotted lines showing router-to-HBM connectivity. Update legend, event animation, and PE view NOC destinations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 17:56:05 -07:00
ywkang	5917b3497c	Replace xbar/bridge/single-NOC with explicit router mesh (ADR-0019) - Remove xbar_top/bot, bridge, single noc node from topology - Each cube_mesh.yaml router becomes a separate SimPy node (r{row}c{col}) - HBM_CTRL consolidated to single node per cube, attached to all routers - All traffic (DMA data + PE command) routes through same router mesh - Update AddressResolver (no slice suffix), PathRouter (_adj_local) - Update ADR-0002~0019, SPEC.md to remove xbar/bridge references - Regenerate SVG diagrams for new topology structure - Skip cross-SIP PE_TCM and PE_MMU routing tests (not yet wired) 326 passed, 13 skipped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 17:51:28 -07:00
ywkang	31c7110da7	Add ADR-0018 (LA/BAAW addressing) and ADR-0019 (NOC per-channel HBM) ADR-0018: LA replaces VA, BAAW segment-based mapping in PE_DMA, 1:1 (per-channel) and n:1 (aggregated) modes with parameterized channel count. ADR-0019: xbar/bridge removal, channel router topology with horizontal line layout, aggregated router for n:1 mode, unified NOC path for local/remote HBM access. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 01:05:27 -07:00
ywkang	114510d4b9	Add SchedulerV2 (pe_accel), DPPolicy overrides, and new benchmarks - Add cycle-accurate PE accelerator scheduler (SchedulerV2) with tiled GEMM/Math pipelines (DMA_IN → GEMM → MATH → DMA_WB) - Add DPPolicy num_pes/num_cubes/num_sips overrides for single-PE testing - Support tuple target_pe for targeting specific PE subsets - Add gemm_single_pe and gpt3_qkv benchmarks - Switch default topology to pe_scheduler_v2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 23:18:49 -07:00
ywkang	63669f82cb	Add SIP-level tensor parallelism, component registry YAML, VA offset verification - DPPolicy: 3-level (sip/cube/pe), unified naming (column_wise/row_wise) - PE_CPU: auto num_programs from cube shard count - context.launch(): per-SIP KernelLaunchMsg with local va_base + auto local shape - deploy_tensor: removed mmus param, MMU mapping is context-only responsibility - ComponentRegistry: YAML-based lazy loading (components.yaml), impls→builtin rename - VA offset bench + tests: 2D/1D, standard Triton kernel pattern Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 01:13:17 -07:00
ywkang	08812eda58	Add virtual memory support: PE_MMU, VA allocator, fabric MmuMapMsg Implement VA/MMU layer (ADR-0011 Phase 1) enabling Triton kernels to use contiguous virtual addresses on sharded tensors. Key changes: - PE_MMU component: hybrid inbox (MmuMapMsg) + sync translate() for PE_DMA - VirtualAllocator + PEMemAllocator: free-list with coalescing - MmuMapMsg/MmuUnmapMsg fabric path with SIP-level routing - DPPolicy-based mapping: replicate=local, sharded=broadcast - Tensor lifecycle: del + weakref cleanup, context manager - Rename: TensorHandle.pa→addr, DmaReadCmd.src_pa→src_addr, ctx→torch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 00:01:47 -07:00
ywkang	62fb01ae18	Add reverse path response latency for PE DMA and PE_CPU→M_CPU Model fabric response hop latency for PE-internal operations: - HBM_CTRL sends PeDmaMsg response on reverse path instead of direct done signal - PE_CPU sends ResponseMsg via NOC→M_CPU on kernel completion - Add NOC→PE_DMA and PE_CPU→NOC edges in topology builder - Make HBM BW test assertions dynamic based on topology efficiency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 15:40:56 -07:00
ywkang	8b5afef5eb	remove temp files	2026-03-20 00:10:11 -07:00
ywkang	0d89c5c074	Updated tasks.json to set the working directory and fixed '[dev]' ( escape character issue )	2026-03-20 00:03:58 -07:00
Yangwook Kang	d40d0cceea	updated venv:create to use python3 instead of python	2026-03-19 23:55:41 -07:00
ywkang	dcbc41571f	Add web topology viewer with hot path visualization - FastAPI backend (server.py) with REST API + WebSocket for event streaming - SVG-based topology viewer (index.html) with SIP/CUBE/PE drill-down views - Event logging infrastructure (event_log.py) generating events from real probe cases (H2D/D2H/PE-DMA) and bench workloads (QKV GEMM single/multi-PE) - Timeline replay engine with play/pause, speed control, and scrubbing - Workload selector dropdown grouped by category (Probe/Bench) - CLI entry points: kernbench web, ./kernbench wrapper scripts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 03:19:19 -07:00
ywkang	fc6abbc8ee	Add CHANGES.md, README, update SPEC/ADRs for release 2 - CHANGES.md: detailed changelog for release 1 and 2 - README.md: full project docs with install, probe, run, test usage - SPEC.md: add ADR-0014~0017 references, update R7 for pcie_ep endpoint - ADR-0003: update NOC description to reference ADR-0017 - ADR-0004: add HBM efficiency factor (0.8) to BW guarantee contract - ADR-0014: status Proposed -> Accepted - ADR-0015: update D4 to M_CPU bypass for Memory R/W, add ADR-0016/0017 links - ADR-0016 (new): IOChiplet NOC and memory data path - ADR-0017 (new): Cube NOC 2D mesh architecture - Fix MD lint warnings (unfenced code blocks) across all docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 01:43:15 -07:00
ywkang	d75da439c6	Add probe CLI improvements, D2H read, UCIe/HBM tuning, BW sweep - Probe CLI: restructured output (tables first, routes below), per-hop timestamps, split cross-cube into best/worst cases, D2H read section - UCIe overhead: 1ns -> 8ns per port (16ns per crossing) to fix cross-cube-best < cross-half latency inversion - HBM efficiency: added efficiency=0.8 factor to hbm_ctrl, reducing effective BW from 256 to 204.8 GB/s - Multi-size BW sweep: saturation tables (4KB-1MB) for all probe cases - Probe default data size: 4KB -> 32KB for more realistic measurements - IOChiplet NOC + D2H topology and tests - NOC mesh, xbar, BW occupancy components and tests - Cube mesh visualization diagram 278 tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 01:16:18 -07:00
ywkang	6f43807900	commit - release 1	2026-03-18 11:47:48 -07:00

1 2

98 Commits