Commit Graph

5 Commits

Author SHA1 Message Date
ywkang aaa1cbfaf6 ADR-0033 D6: address-based PC selection at HBM CTRL
Replaces global round-robin with deterministic address-derived PC
striping:

    pc_shift = log2(burst_bytes)
    pc_mask  = num_pcs - 1
    pc       = (flit.address >> pc_shift) & pc_mask

Each Transaction carries base_address (HBM byte offset of the first
chunk); each Flit derives its own address as base + i*flit_bytes.
HBM CTRL routes flits to PCs via this formula, replacing the
arrival-order RR pointer. Also splits the is_last wait into an
asynchronous _finalize_txn process so the worker isn't blocked on
PC commit, exposing true PC parallelism for disjoint addresses.

phyaddr.py documents the canonical bit layout (bits [10:8] for the
default burst=256, num_pcs=8 case). ADR-0033 D6 records the
derivation and the workload scenarios where address-striping
matters (strided streams, offset-disjoint parallel transfers).

Adds tests/test_hbm_address_based_pc.py: canonical bit mapping,
strided 8-way load distribution, same-address PC-0 serialization,
PC-aligned 2KB pair collision, dynamic pc_shift from burst_bytes,
and power-of-2 attr validation. Integration tests inspect
_pc_avail ledger directly: at default config UCIe's 8 ns per-txn
overhead exactly matches chunk_time, masking PC contention at the
makespan level even though the ledger correctly distinguishes the
cases.

Full suite: 631 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 00:18:46 -07:00
ywkang 4929040cf1 Phase 2c-2/3: per-flit wire timing + flit-aware routers + HBM CTRL
Root cause of Phase 2c-1 timing collapse identified: src.out_port and
dst.in_port aliased the same simpy.Store, so when wire chunkified a
Transaction into Flits and re-put them, fan_in could pull flits before
the wire applied bw delay — half the flits bypassed bottleneck timing.

Fix: separate Stores per directed edge. Wire is the only conduit. Each
flit on the wire incurs chunk_time = flit_nbytes/bw_gbs once, in arrival
order. Multi-hop wormhole pipelining emerges naturally because
flit-aware pass-through (TransitComponent) forwards each flit serially
without reassembly.

64 KB MemoryWrite via UCIe 128 GB/s bottleneck: 273 ns (broken) → 545 ns
(matches drain 512 + commit 8 + path overheads). 1 MB: 8230 ns (matches
drain 8192). Single-flit transfer transport-time alone, exactly what
real-HW wormhole produces.

3 pre-existing tests now off by small margins or inverted:
- test_h2d_local_cube_cut_through: 65.53 vs threshold 65.0
- test_engine_override_is_scoped_to_impl: ZeroRouter inherits
  ComponentBase, not flit-aware, so override path reassembles at each
  hop while default doesn't
- test_intra_sip_critical_path_at_96k_below_threshold: 96KB allreduce
  microscopically over its threshold

Not weakening these to pass: they reflect model fidelity improvements
that need calibrated thresholds. To address in follow-up via test
threshold updates and ZeroRouter→TransitComponent inheritance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 22:43:40 -07:00
ywkang 5fdb6f8797 Latency model: HBM PC striping + chunk-loop drain (ADR-0033)
Previous model double-counted slow-upstream paths (e.g., 64KB via UCIe
128 GB/s was ~2x pessimistic). HBM CTRL now distributes bursts across
8 pseudo-channels via global round-robin, with per-chunk commit timing
that pipelines correctly against the bottleneck link's data arrival.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 21:59:07 -07:00
ywkang 5917b3497c Replace xbar/bridge/single-NOC with explicit router mesh (ADR-0019)
- Remove xbar_top/bot, bridge, single noc node from topology
- Each cube_mesh.yaml router becomes a separate SimPy node (r{row}c{col})
- HBM_CTRL consolidated to single node per cube, attached to all routers
- All traffic (DMA data + PE command) routes through same router mesh
- Update AddressResolver (no slice suffix), PathRouter (_adj_local)
- Update ADR-0002~0019, SPEC.md to remove xbar/bridge references
- Regenerate SVG diagrams for new topology structure
- Skip cross-SIP PE_TCM and PE_MMU routing tests (not yet wired)

326 passed, 13 skipped

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 17:51:28 -07:00
ywkang 63669f82cb Add SIP-level tensor parallelism, component registry YAML, VA offset verification
- DPPolicy: 3-level (sip/cube/pe), unified naming (column_wise/row_wise)
- PE_CPU: auto num_programs from cube shard count
- context.launch(): per-SIP KernelLaunchMsg with local va_base + auto local shape
- deploy_tensor: removed mmus param, MMU mapping is context-only responsibility
- ComponentRegistry: YAML-based lazy loading (components.yaml), impls→builtin rename
- VA offset bench + tests: 2D/1D, standard Triton kernel pattern

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 01:13:17 -07:00