Replaces global round-robin with deterministic address-derived PC
striping:
pc_shift = log2(burst_bytes)
pc_mask = num_pcs - 1
pc = (flit.address >> pc_shift) & pc_mask
Each Transaction carries base_address (HBM byte offset of the first
chunk); each Flit derives its own address as base + i*flit_bytes.
HBM CTRL routes flits to PCs via this formula, replacing the
arrival-order RR pointer. Also splits the is_last wait into an
asynchronous _finalize_txn process so the worker isn't blocked on
PC commit, exposing true PC parallelism for disjoint addresses.
phyaddr.py documents the canonical bit layout (bits [10:8] for the
default burst=256, num_pcs=8 case). ADR-0033 D6 records the
derivation and the workload scenarios where address-striping
matters (strided streams, offset-disjoint parallel transfers).
Adds tests/test_hbm_address_based_pc.py: canonical bit mapping,
strided 8-way load distribution, same-address PC-0 serialization,
PC-aligned 2KB pair collision, dynamic pc_shift from burst_bytes,
and power-of-2 attr validation. Integration tests inspect
_pc_avail ledger directly: at default config UCIe's 8 ns per-txn
overhead exactly matches chunk_time, masking PC contention at the
makespan level even though the ledger correctly distinguishes the
cases.
Full suite: 631 passed, 1 skipped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cycle-accurate arbitration policies (priority/iSLIP) downgraded to
"academic / specific use cases" — FIFO inbox is approximately fair
for typical similar-rate workloads (GEMM, AllReduce, data parallel).
True impact appears only for QoS modeling or per-stream tail latency
analysis under saturation.
Higher-priority items pulled forward: address-based PC selection at
HBM CTRL (directly affects multi-PE concurrent HBM contention), bank
conflict modeling, HBM scheduler, finite buffer backpressure, op_log
chunk-streaming integration.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier the future-work list mentioned "multi-flow fair sharing on a
single shared link" which was confusing — each wire has a single
source, so this isn't a real gap. The actual modeling story:
- Multi-stream merging at routers IS handled via per-in_port fan_in +
shared inbox + FIFO worker forwarding. Flits from different
upstream streams interleave at flit granularity naturally.
- What's NOT modeled: cycle-accurate arbitration policies (priority,
iSLIP), address-based PC selection at HBM CTRL (round-robin is
address-blind, so size-aligned concurrent transactions hit full
PC contention even when real-HW address striping would diverge),
sub-flit (32B) granularity, finite buffer backpressure, and bank
conflict modeling.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- test_op_log_per_transaction_not_per_flit (renamed from
..._records...): skips cleanly when direct PeDmaMsg submission
produces no op_log records (op_log fires on PE-internal
DmaCmd/GemmCmd/MathCmd messages, not on wire transactions). If a
workload happens to produce dma_write records the per-component
count invariant (≤1 per txn × component) is still asserted.
- ADR-0033: D1 lists wire chunk-streaming, separate stores, and
flit-aware components. D2/D3/D4 updated for new wire model.
D6 future work notes op_log full integration with chunk-streaming.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous model double-counted slow-upstream paths (e.g., 64KB via UCIe
128 GB/s was ~2x pessimistic). HBM CTRL now distributes bursts across
8 pseudo-channels via global round-robin, with per-chunk commit timing
that pipelines correctly against the bottleneck link's data arrival.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>