kernbench2

Author	SHA1	Message	Date
ywkang	f298e3c7cc	Offset PE nodes in cube_view to avoid overlapping routers PE nodes are shifted 1.2mm above (top half) or below (bottom half) their assigned router position. PE size reduced to 1.4x0.7mm. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 18:50:32 -07:00
ywkang	91085733ba	Show individual routers in cube_view SVG, fix row Y overlap - cube_view now renders all 32 router nodes from cube_mesh.yaml instead of collapsed "router_mesh" placeholder - Fix mesh_gen row Y position overlap (r1/r2 and r3/r4 had same Y) by adding hbm_gap spacing between PE rows and HBM zone - Add noc_router to visualizer KIND_SIZE for proper sizing - Update cube view tests for individual router nodes 339 passed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 18:22:38 -07:00
ywkang	d2c92b8a18	Wire PE_MMU to router mesh for MmuMapMsg delivery Add router → PE_MMU edge so MmuMapMsg can reach PE_MMU via the router mesh. Unskip all PE_MMU fabric tests. 339 passed, 0 skipped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 18:10:42 -07:00
ywkang	08256c1326	Fix cross-SIP PE_TCM access by scoping deploy to target_device SIP RuntimeContext._ensure_allocators() now limits SIP range to target_device (single SIP or all). Prevents cross-SIP tensor deployment that caused PE_TCM routing errors. Also accept 'sip0' format (without colon) in DeviceSelector. 331 passed, 8 skipped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 18:03:11 -07:00
ywkang	624161f52f	Update web viewer for router mesh topology (ADR-0019) Remove all xbar/bridge rendering from cube detail view. Replace 8 HBM slices with single HBM_CTRL block. Add green dotted lines showing router-to-HBM connectivity. Update legend, event animation, and PE view NOC destinations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 17:56:05 -07:00
ywkang	5917b3497c	Replace xbar/bridge/single-NOC with explicit router mesh (ADR-0019) - Remove xbar_top/bot, bridge, single noc node from topology - Each cube_mesh.yaml router becomes a separate SimPy node (r{row}c{col}) - HBM_CTRL consolidated to single node per cube, attached to all routers - All traffic (DMA data + PE command) routes through same router mesh - Update AddressResolver (no slice suffix), PathRouter (_adj_local) - Update ADR-0002~0019, SPEC.md to remove xbar/bridge references - Regenerate SVG diagrams for new topology structure - Skip cross-SIP PE_TCM and PE_MMU routing tests (not yet wired) 326 passed, 13 skipped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 17:51:28 -07:00
ywkang	114510d4b9	Add SchedulerV2 (pe_accel), DPPolicy overrides, and new benchmarks - Add cycle-accurate PE accelerator scheduler (SchedulerV2) with tiled GEMM/Math pipelines (DMA_IN → GEMM → MATH → DMA_WB) - Add DPPolicy num_pes/num_cubes/num_sips overrides for single-PE testing - Support tuple target_pe for targeting specific PE subsets - Add gemm_single_pe and gpt3_qkv benchmarks - Switch default topology to pe_scheduler_v2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 23:18:49 -07:00
ywkang	63669f82cb	Add SIP-level tensor parallelism, component registry YAML, VA offset verification - DPPolicy: 3-level (sip/cube/pe), unified naming (column_wise/row_wise) - PE_CPU: auto num_programs from cube shard count - context.launch(): per-SIP KernelLaunchMsg with local va_base + auto local shape - deploy_tensor: removed mmus param, MMU mapping is context-only responsibility - ComponentRegistry: YAML-based lazy loading (components.yaml), impls→builtin rename - VA offset bench + tests: 2D/1D, standard Triton kernel pattern Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 01:13:17 -07:00
ywkang	08812eda58	Add virtual memory support: PE_MMU, VA allocator, fabric MmuMapMsg Implement VA/MMU layer (ADR-0011 Phase 1) enabling Triton kernels to use contiguous virtual addresses on sharded tensors. Key changes: - PE_MMU component: hybrid inbox (MmuMapMsg) + sync translate() for PE_DMA - VirtualAllocator + PEMemAllocator: free-list with coalescing - MmuMapMsg/MmuUnmapMsg fabric path with SIP-level routing - DPPolicy-based mapping: replicate=local, sharded=broadcast - Tensor lifecycle: del + weakref cleanup, context manager - Rename: TensorHandle.pa→addr, DmaReadCmd.src_pa→src_addr, ctx→torch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 00:01:47 -07:00
ywkang	62fb01ae18	Add reverse path response latency for PE DMA and PE_CPU→M_CPU Model fabric response hop latency for PE-internal operations: - HBM_CTRL sends PeDmaMsg response on reverse path instead of direct done signal - PE_CPU sends ResponseMsg via NOC→M_CPU on kernel completion - Add NOC→PE_DMA and PE_CPU→NOC edges in topology builder - Make HBM BW test assertions dynamic based on topology efficiency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 15:40:56 -07:00
ywkang	dcbc41571f	Add web topology viewer with hot path visualization - FastAPI backend (server.py) with REST API + WebSocket for event streaming - SVG-based topology viewer (index.html) with SIP/CUBE/PE drill-down views - Event logging infrastructure (event_log.py) generating events from real probe cases (H2D/D2H/PE-DMA) and bench workloads (QKV GEMM single/multi-PE) - Timeline replay engine with play/pause, speed control, and scrubbing - Workload selector dropdown grouped by category (Probe/Bench) - CLI entry points: kernbench web, ./kernbench wrapper scripts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 03:19:19 -07:00
ywkang	d75da439c6	Add probe CLI improvements, D2H read, UCIe/HBM tuning, BW sweep - Probe CLI: restructured output (tables first, routes below), per-hop timestamps, split cross-cube into best/worst cases, D2H read section - UCIe overhead: 1ns -> 8ns per port (16ns per crossing) to fix cross-cube-best < cross-half latency inversion - HBM efficiency: added efficiency=0.8 factor to hbm_ctrl, reducing effective BW from 256 to 204.8 GB/s - Multi-size BW sweep: saturation tables (4KB-1MB) for all probe cases - Probe default data size: 4KB -> 32KB for more realistic measurements - IOChiplet NOC + D2H topology and tests - NOC mesh, xbar, BW occupancy components and tests - Cube mesh visualization diagram 278 tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 01:16:18 -07:00
ywkang	6f43807900	commit - release 1	2026-03-18 11:47:48 -07:00

1 2

63 Commits