ADR-0033 D6: reorder future work by workload impact

Cycle-accurate arbitration policies (priority/iSLIP) downgraded to "academic / specific use cases" — FIFO inbox is approximately fair for typical similar-rate workloads (GEMM, AllReduce, data parallel). True impact appears only for QoS modeling or per-stream tail latency analysis under saturation. Higher-priority items pulled forward: address-based PC selection at HBM CTRL (directly affects multi-PE concurrent HBM contention), bank conflict modeling, HBM scheduler, finite buffer backpressure, op_log chunk-streaming integration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 23:21:35 -07:00
parent 9beb140eaa
commit c9bd5387ac
1 changed files with 27 additions and 16 deletions
@@ -106,36 +106,47 @@ Note: multi-stream merging at routers IS modeled correctly — each
 in_port has its own fan_in process, all push to a shared inbox, and
 the router worker forwards in inbox FIFO order. Flits from different
 upstream streams naturally interleave at flit granularity. The items
-below are different concerns.
+below are different concerns, ordered by expected workload impact.
 **Higher impact (workload accuracy gap)**:
 - [ ] **Cycle-accurate router arbitration policies** (RR with
  priorities, age, iSLIP). Currently the inbox FIFO order is used as
  a proxy for fair RR — works when flit arrival times differ slightly
  between streams, but doesn't reflect intentional priority/QoS.
 - [ ] **Sub-flit (32B) granularity** for finer wire arbitration
  cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates
  per 32B flit. Effect is small for most workloads (sub-flit timing
  noise).
 - [ ] **Address-based PC selection at HBM CTRL** (replace the
  address-blind global round-robin). When two transactions of size
  `num_pcs × burst_bytes` (e.g., 2KB at 8 PCs × 256B) arrive
  concurrently, both claim PCs 0..7 via global RR, producing full
-  per-PC contention. Real HW uses address bits to select PCs, so
+  per-PC contention even when real-HW address striping would put
-  different-address transactions hit different PC patterns. Address
+  them on disjoint PC sets. Directly affects multi-PE concurrent
-  modeling would let the simulator reflect cache-line/page-aware
+  HBM workload latencies.
  layouts.
 - [ ] **Bank-level conflict modeling** within a PC (opt-in via
-  `track_banks: true`). Currently we assume no same-bank reuse.
+  `track_banks: true`). Currently we assume no same-bank reuse;
  random scatter/gather workloads are optimistic here.
 - [ ] **HBM scheduler** with write buffer + watermark drain (Tier 2
  from the design discussion). Default `switch_penalty_ns=0` is the
-  ideal-amortization stand-in.
+  ideal-amortization stand-in; bursty mixed R/W workloads benefit
- [ ] **Backpressure** modeling for finite component buffers.
+  from explicit modeling.
 - [ ] **Backpressure** modeling for finite component buffers. Matters
  at high concurrency / sustained saturation where buffer occupancy
  causes upstream stalls.
 - [ ] **Op_log integration with chunk-streaming**: currently op_log
  fires on PE-internal command messages (DmaReadCmd, DmaWriteCmd,
  GemmCmd, MathCmd) which are not chunkified. Integration would
  require flit-aware components to also emit op_log start/end hooks
  per transaction (start on first flit, end on is_last).
 **Lower impact (academic / specific use cases)**:
 - [ ] **Cycle-accurate router arbitration policies** (RR with
  priorities, age, iSLIP). The FIFO inbox is already approximately
  fair when flit arrival times differ slightly between streams (the
  common case for similar-rate workloads). True impact appears only
  for: (a) priority/QoS modeling, (b) per-stream tail latency
  analysis under sustained saturation. Not critical for makespan or
  average-latency studies.
 - [ ] **Sub-flit (32B) granularity** for finer wire arbitration
  cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates
  per 32B flit. Effect is small for most workloads (sub-flit timing
  noise on small messages).
 ## Consequences
 - Single review point for all model fidelity questions. Each future PR