ADR-0033 D6: address-based PC selection at HBM CTRL
Replaces global round-robin with deterministic address-derived PC
striping:
pc_shift = log2(burst_bytes)
pc_mask = num_pcs - 1
pc = (flit.address >> pc_shift) & pc_mask
Each Transaction carries base_address (HBM byte offset of the first
chunk); each Flit derives its own address as base + i*flit_bytes.
HBM CTRL routes flits to PCs via this formula, replacing the
arrival-order RR pointer. Also splits the is_last wait into an
asynchronous _finalize_txn process so the worker isn't blocked on
PC commit, exposing true PC parallelism for disjoint addresses.
phyaddr.py documents the canonical bit layout (bits [10:8] for the
default burst=256, num_pcs=8 case). ADR-0033 D6 records the
derivation and the workload scenarios where address-striping
matters (strided streams, offset-disjoint parallel transfers).
Adds tests/test_hbm_address_based_pc.py: canonical bit mapping,
strided 8-way load distribution, same-address PC-0 serialization,
PC-aligned 2KB pair collision, dynamic pc_shift from burst_bytes,
and power-of-2 attr validation. Integration tests inspect
_pc_avail ledger directly: at default config UCIe's 8 ns per-txn
overhead exactly matches chunk_time, masking PC contention at the
makespan level even though the ledger correctly distinguishes the
cases.
Full suite: 631 passed, 1 skipped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -111,12 +111,28 @@ below are different concerns, ordered by expected workload impact.
|
||||
**Higher impact (workload accuracy gap)**:
|
||||
|
||||
- [ ] **Address-based PC selection at HBM CTRL** (replace the
|
||||
address-blind global round-robin). When two transactions of size
|
||||
`num_pcs × burst_bytes` (e.g., 2KB at 8 PCs × 256B) arrive
|
||||
concurrently, both claim PCs 0..7 via global RR, producing full
|
||||
per-PC contention even when real-HW address striping would put
|
||||
them on disjoint PC sets. Directly affects multi-PE concurrent
|
||||
HBM workload latencies.
|
||||
address-blind global round-robin). Compute the PC index from
|
||||
the HBM byte offset using parameters already in topology config:
|
||||
|
||||
pc_shift = log2(burst_bytes) # default 8 (burst=256B)
|
||||
pc_mask = num_pcs - 1 # default 7 (8 PCs)
|
||||
pc = (hbm_offset >> pc_shift) & pc_mask
|
||||
|
||||
For the default `burst_bytes=256, num_pcs=8` this places the PC
|
||||
select field at HBM byte-offset bits **[10:8]**: bits [7:0] are
|
||||
the within-burst offset (same PC), bits [10:8] are the 3-bit PC
|
||||
index, and bits [36:11] are row/bank/column within the PC slice.
|
||||
Shift/mask are derived from topology config rather than hardcoded
|
||||
so alternative `(burst_bytes, num_pcs)` pairs stay consistent.
|
||||
See `src/kernbench/policy/address/phyaddr.py` for the canonical
|
||||
comment.
|
||||
|
||||
Real-HW workloads where this matters most: (a) strided multi-
|
||||
transaction streams that under global-RR collide on the same PCs
|
||||
but under address-striping land on disjoint sets; (b) offset-
|
||||
disjoint parallel transfers where address-striping preserves
|
||||
parallelism while global-RR re-serializes them. Directly affects
|
||||
multi-PE concurrent HBM workload latencies.
|
||||
- [ ] **Bank-level conflict modeling** within a PC (opt-in via
|
||||
`track_banks: true`). Currently we assume no same-bank reuse;
|
||||
random scatter/gather workloads are optimistic here.
|
||||
|
||||
Reference in New Issue
Block a user