ADR-0033 D6: address-based PC selection at HBM CTRL

Replaces global round-robin with deterministic address-derived PC
striping:

    pc_shift = log2(burst_bytes)
    pc_mask  = num_pcs - 1
    pc       = (flit.address >> pc_shift) & pc_mask

Each Transaction carries base_address (HBM byte offset of the first
chunk); each Flit derives its own address as base + i*flit_bytes.
HBM CTRL routes flits to PCs via this formula, replacing the
arrival-order RR pointer. Also splits the is_last wait into an
asynchronous _finalize_txn process so the worker isn't blocked on
PC commit, exposing true PC parallelism for disjoint addresses.

phyaddr.py documents the canonical bit layout (bits [10:8] for the
default burst=256, num_pcs=8 case). ADR-0033 D6 records the
derivation and the workload scenarios where address-striping
matters (strided streams, offset-disjoint parallel transfers).

Adds tests/test_hbm_address_based_pc.py: canonical bit mapping,
strided 8-way load distribution, same-address PC-0 serialization,
PC-aligned 2KB pair collision, dynamic pc_shift from burst_bytes,
and power-of-2 attr validation. Integration tests inspect
_pc_avail ledger directly: at default config UCIe's 8 ns per-txn
overhead exactly matches chunk_time, masking PC contention at the
makespan level even though the ledger correctly distinguishes the
cases.

Full suite: 631 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-15 00:18:46 -07:00
parent a44f832be5
commit aaa1cbfaf6
6 changed files with 292 additions and 27 deletions
+22 -6
View File
@@ -111,12 +111,28 @@ below are different concerns, ordered by expected workload impact.
**Higher impact (workload accuracy gap)**:
- [ ] **Address-based PC selection at HBM CTRL** (replace the
address-blind global round-robin). When two transactions of size
`num_pcs × burst_bytes` (e.g., 2KB at 8 PCs × 256B) arrive
concurrently, both claim PCs 0..7 via global RR, producing full
per-PC contention even when real-HW address striping would put
them on disjoint PC sets. Directly affects multi-PE concurrent
HBM workload latencies.
address-blind global round-robin). Compute the PC index from
the HBM byte offset using parameters already in topology config:
pc_shift = log2(burst_bytes) # default 8 (burst=256B)
pc_mask = num_pcs - 1 # default 7 (8 PCs)
pc = (hbm_offset >> pc_shift) & pc_mask
For the default `burst_bytes=256, num_pcs=8` this places the PC
select field at HBM byte-offset bits **[10:8]**: bits [7:0] are
the within-burst offset (same PC), bits [10:8] are the 3-bit PC
index, and bits [36:11] are row/bank/column within the PC slice.
Shift/mask are derived from topology config rather than hardcoded
so alternative `(burst_bytes, num_pcs)` pairs stay consistent.
See `src/kernbench/policy/address/phyaddr.py` for the canonical
comment.
Real-HW workloads where this matters most: (a) strided multi-
transaction streams that under global-RR collide on the same PCs
but under address-striping land on disjoint sets; (b) offset-
disjoint parallel transfers where address-striping preserves
parallelism while global-RR re-serializes them. Directly affects
multi-PE concurrent HBM workload latencies.
- [ ] **Bank-level conflict modeling** within a PC (opt-in via
`track_banks: true`). Currently we assume no same-bank reuse;
random scatter/gather workloads are optimistic here.