kernbench2

Author	SHA1	Message	Date
ywkang	aaa1cbfaf6	ADR-0033 D6: address-based PC selection at HBM CTRL Replaces global round-robin with deterministic address-derived PC striping: pc_shift = log2(burst_bytes) pc_mask = num_pcs - 1 pc = (flit.address >> pc_shift) & pc_mask Each Transaction carries base_address (HBM byte offset of the first chunk); each Flit derives its own address as base + i*flit_bytes. HBM CTRL routes flits to PCs via this formula, replacing the arrival-order RR pointer. Also splits the is_last wait into an asynchronous _finalize_txn process so the worker isn't blocked on PC commit, exposing true PC parallelism for disjoint addresses. phyaddr.py documents the canonical bit layout (bits [10:8] for the default burst=256, num_pcs=8 case). ADR-0033 D6 records the derivation and the workload scenarios where address-striping matters (strided streams, offset-disjoint parallel transfers). Adds tests/test_hbm_address_based_pc.py: canonical bit mapping, strided 8-way load distribution, same-address PC-0 serialization, PC-aligned 2KB pair collision, dynamic pc_shift from burst_bytes, and power-of-2 attr validation. Integration tests inspect _pc_avail ledger directly: at default config UCIe's 8 ns per-txn overhead exactly matches chunk_time, masking PC contention at the makespan level even though the ledger correctly distinguishes the cases. Full suite: 631 passed, 1 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 00:18:46 -07:00
ywkang	4929040cf1	Phase 2c-2/3: per-flit wire timing + flit-aware routers + HBM CTRL Root cause of Phase 2c-1 timing collapse identified: src.out_port and dst.in_port aliased the same simpy.Store, so when wire chunkified a Transaction into Flits and re-put them, fan_in could pull flits before the wire applied bw delay — half the flits bypassed bottleneck timing. Fix: separate Stores per directed edge. Wire is the only conduit. Each flit on the wire incurs chunk_time = flit_nbytes/bw_gbs once, in arrival order. Multi-hop wormhole pipelining emerges naturally because flit-aware pass-through (TransitComponent) forwards each flit serially without reassembly. 64 KB MemoryWrite via UCIe 128 GB/s bottleneck: 273 ns (broken) → 545 ns (matches drain 512 + commit 8 + path overheads). 1 MB: 8230 ns (matches drain 8192). Single-flit transfer transport-time alone, exactly what real-HW wormhole produces. 3 pre-existing tests now off by small margins or inverted: - test_h2d_local_cube_cut_through: 65.53 vs threshold 65.0 - test_engine_override_is_scoped_to_impl: ZeroRouter inherits ComponentBase, not flit-aware, so override path reassembles at each hop while default doesn't - test_intra_sip_critical_path_at_96k_below_threshold: 96KB allreduce microscopically over its threshold Not weakening these to pass: they reflect model fidelity improvements that need calibrated thresholds. To address in follow-up via test threshold updates and ZeroRouter→TransitComponent inheritance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 22:43:40 -07:00
ywkang	5fdb6f8797	Latency model: HBM PC striping + chunk-loop drain (ADR-0033) Previous model double-counted slow-upstream paths (e.g., 64KB via UCIe 128 GB/s was ~2x pessimistic). HBM CTRL now distributes bursts across 8 pseudo-channels via global round-robin, with per-chunk commit timing that pipelines correctly against the bottleneck link's data arrival. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 21:59:07 -07:00
ywkang	5917b3497c	Replace xbar/bridge/single-NOC with explicit router mesh (ADR-0019) - Remove xbar_top/bot, bridge, single noc node from topology - Each cube_mesh.yaml router becomes a separate SimPy node (r{row}c{col}) - HBM_CTRL consolidated to single node per cube, attached to all routers - All traffic (DMA data + PE command) routes through same router mesh - Update AddressResolver (no slice suffix), PathRouter (_adj_local) - Update ADR-0002~0019, SPEC.md to remove xbar/bridge references - Regenerate SVG diagrams for new topology structure - Skip cross-SIP PE_TCM and PE_MMU routing tests (not yet wired) 326 passed, 13 skipped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 17:51:28 -07:00
ywkang	63669f82cb	Add SIP-level tensor parallelism, component registry YAML, VA offset verification - DPPolicy: 3-level (sip/cube/pe), unified naming (column_wise/row_wise) - PE_CPU: auto num_programs from cube shard count - context.launch(): per-SIP KernelLaunchMsg with local va_base + auto local shape - deploy_tensor: removed mmus param, MMU mapping is context-only responsibility - ComponentRegistry: YAML-based lazy loading (components.yaml), impls→builtin rename - VA offset bench + tests: 2D/1D, standard Triton kernel pattern Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 01:13:17 -07:00

5 Commits