kernbench2

Author	SHA1	Message	Date
ywkang	049e3d8bb3	benches: package as kernbench.benches, add @bench registry + list subcommand Move benches/ -> src/kernbench/benches/ and src/kernbench/cli/probe.py -> src/kernbench/probes/probe.py. Each bench self-registers via @bench(name=..., description=...); kernbench list enumerates benches with auto-assigned indices, --bench accepts kebab-case name or numeric index. Audit at package-import time fails if any non-underscore module forgets the decorator. ADR-0010 (EN + KO) updated to reflect the new resolver path, list subcommand, and probes package separation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 14:42:10 -07:00
ywkang	6824a935c9	Calibrate 3 tests for ADR-0033 Phase 2c per-flit wire timing - test_h2d_local_cube_cut_through: threshold 65 → 80ns. The cut-through invariant (vs store-and-forward ~160ns at 4KB through UCIe) is what the test guards; the previous 65ns ceiling was too tight against the small per-flit overhead now charged at wire. - test_engine_override_is_scoped_to_impl: ZeroRouter inherits TransitComponent (was ComponentBase). Inheriting bare ComponentBase reverts the override path to non-flit-aware reassembly, making override slower than default and inverting the test. The test's intent is overhead=0 vs overhead=2, not flit-awareness. - test_intra_sip_critical_path_at_96k_below_threshold: threshold 20.5 → 30 µs. Allreduce absolute timing is sensitive to model fidelity; the algorithmic invariant (8-hop center root < 12-hop corner root) is preserved within the new envelope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 23:06:33 -07:00
ywkang	81cc32c46b	ADR-0001 Rev 2: 51-bit PhysAddr layout with concrete sub-unit tables Remove rack_id (4 bits), rename sip_seg→die_id, shift fields to enable 42-bit local_offset (4 TB per die). Define PE_LOCAL/MCPU_LOCAL/CUBE_SRAM sub-unit tables for AHBM dies and IOCPU sub-unit table for IOCHIPLET dies (1 TB window). Supersedes ADR-0031. Also fixes latent VA/PA confusion in pe_dma pipeline DMA path where virtual addresses were decoded as physical addresses without MMU translation — previously masked by coincidental bit-position alignment. 529 passed (+6 recovered), 10 pre-existing failures unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-27 15:52:29 -07:00
ywkang	eb792e6212	Remove xbar/noc remnants, rule-based cube-view connectors - Delete xbar.py and noc.py (TwoDMeshNocComponent) — unused since router mesh - Remove xbar_v1/noc_2d_mesh_v1 from components.yaml - Fix pe_to_xbar → pe_to_router in routing exclusion set - Fix xbar_to_hbm_bw_gbs → hbm_to_router_bw_gbs in report.py - Update all docstrings/comments referencing xbar/bridge → router mesh - Cube-view connectors: rule-based _connector_points helper - PE↔router: single diagonal line (not chevron) - UCIe N/S: 45°→horizontal→45° - UCIe E/W: 45°→vertical→45° - HBM ports: 45°→horizontal→45° Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 23:59:12 -07:00
ywkang	5917b3497c	Replace xbar/bridge/single-NOC with explicit router mesh (ADR-0019) - Remove xbar_top/bot, bridge, single noc node from topology - Each cube_mesh.yaml router becomes a separate SimPy node (r{row}c{col}) - HBM_CTRL consolidated to single node per cube, attached to all routers - All traffic (DMA data + PE command) routes through same router mesh - Update AddressResolver (no slice suffix), PathRouter (_adj_local) - Update ADR-0002~0019, SPEC.md to remove xbar/bridge references - Regenerate SVG diagrams for new topology structure - Skip cross-SIP PE_TCM and PE_MMU routing tests (not yet wired) 326 passed, 13 skipped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 17:51:28 -07:00
ywkang	62fb01ae18	Add reverse path response latency for PE DMA and PE_CPU→M_CPU Model fabric response hop latency for PE-internal operations: - HBM_CTRL sends PeDmaMsg response on reverse path instead of direct done signal - PE_CPU sends ResponseMsg via NOC→M_CPU on kernel completion - Add NOC→PE_DMA and PE_CPU→NOC edges in topology builder - Make HBM BW test assertions dynamic based on topology efficiency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 15:40:56 -07:00
ywkang	d75da439c6	Add probe CLI improvements, D2H read, UCIe/HBM tuning, BW sweep - Probe CLI: restructured output (tables first, routes below), per-hop timestamps, split cross-cube into best/worst cases, D2H read section - UCIe overhead: 1ns -> 8ns per port (16ns per crossing) to fix cross-cube-best < cross-half latency inversion - HBM efficiency: added efficiency=0.8 factor to hbm_ctrl, reducing effective BW from 256 to 204.8 GB/s - Multi-size BW sweep: saturation tables (4KB-1MB) for all probe cases - Probe default data size: 4KB -> 32KB for more realistic measurements - IOChiplet NOC + D2H topology and tests - NOC mesh, xbar, BW occupancy components and tests - Cube mesh visualization diagram 278 tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 01:16:18 -07:00
ywkang	6f43807900	commit - release 1	2026-03-18 11:47:48 -07:00

8 Commits