ADR-0033 D6: address-based PC selection at HBM CTRL

Replaces global round-robin with deterministic address-derived PC striping: pc_shift = log2(burst_bytes) pc_mask = num_pcs - 1 pc = (flit.address >> pc_shift) & pc_mask Each Transaction carries base_address (HBM byte offset of the first chunk); each Flit derives its own address as base + i*flit_bytes. HBM CTRL routes flits to PCs via this formula, replacing the arrival-order RR pointer. Also splits the is_last wait into an asynchronous _finalize_txn process so the worker isn't blocked on PC commit, exposing true PC parallelism for disjoint addresses. phyaddr.py documents the canonical bit layout (bits [10:8] for the default burst=256, num_pcs=8 case). ADR-0033 D6 records the derivation and the workload scenarios where address-striping matters (strided streams, offset-disjoint parallel transfers). Adds tests/test_hbm_address_based_pc.py: canonical bit mapping, strided 8-way load distribution, same-address PC-0 serialization, PC-aligned 2KB pair collision, dynamic pc_shift from burst_bytes, and power-of-2 attr validation. Integration tests inspect _pc_avail ledger directly: at default config UCIe's 8 ns per-txn overhead exactly matches chunk_time, masking PC contention at the makespan level even though the ledger correctly distinguishes the cases. Full suite: 631 passed, 1 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 00:18:46 -07:00
parent a44f832be5
commit aaa1cbfaf6
6 changed files with 292 additions and 27 deletions
@@ -111,12 +111,28 @@ below are different concerns, ordered by expected workload impact.
 **Higher impact (workload accuracy gap)**:

 - [ ] **Address-based PC selection at HBM CTRL** (replace the
-  address-blind global round-robin). When two transactions of size
-  `num_pcs × burst_bytes` (e.g., 2KB at 8 PCs × 256B) arrive
-  concurrently, both claim PCs 0..7 via global RR, producing full
-  per-PC contention even when real-HW address striping would put
-  them on disjoint PC sets. Directly affects multi-PE concurrent
-  HBM workload latencies.
+  address-blind global round-robin). Compute the PC index from
+  the HBM byte offset using parameters already in topology config:
+
+      pc_shift = log2(burst_bytes)        # default 8 (burst=256B)
+      pc_mask  = num_pcs - 1              # default 7 (8 PCs)
+      pc       = (hbm_offset >> pc_shift) & pc_mask
+
+  For the default `burst_bytes=256, num_pcs=8` this places the PC
+  select field at HBM byte-offset bits **[10:8]**: bits [7:0] are
+  the within-burst offset (same PC), bits [10:8] are the 3-bit PC
+  index, and bits [36:11] are row/bank/column within the PC slice.
+  Shift/mask are derived from topology config rather than hardcoded
+  so alternative `(burst_bytes, num_pcs)` pairs stay consistent.
+  See `src/kernbench/policy/address/phyaddr.py` for the canonical
+  comment.
+
+  Real-HW workloads where this matters most: (a) strided multi-
+  transaction streams that under global-RR collide on the same PCs
+  but under address-striping land on disjoint sets; (b) offset-
+  disjoint parallel transfers where address-striping preserves
+  parallelism while global-RR re-serializes them. Directly affects
+  multi-PE concurrent HBM workload latencies.
 - [ ] **Bank-level conflict modeling** within a PC (opt-in via
  `track_banks: true`). Currently we assume no same-bank reuse;
  random scatter/gather workloads are optimistic here.