ADR-0033 D6: reorder future work by workload impact

Cycle-accurate arbitration policies (priority/iSLIP) downgraded to
"academic / specific use cases" — FIFO inbox is approximately fair
for typical similar-rate workloads (GEMM, AllReduce, data parallel).
True impact appears only for QoS modeling or per-stream tail latency
analysis under saturation.

Higher-priority items pulled forward: address-based PC selection at
HBM CTRL (directly affects multi-PE concurrent HBM contention), bank
conflict modeling, HBM scheduler, finite buffer backpressure, op_log
chunk-streaming integration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-14 23:21:35 -07:00
parent 9beb140eaa
commit c9bd5387ac
+27 -16
View File
@@ -106,36 +106,47 @@ Note: multi-stream merging at routers IS modeled correctly — each
in_port has its own fan_in process, all push to a shared inbox, and in_port has its own fan_in process, all push to a shared inbox, and
the router worker forwards in inbox FIFO order. Flits from different the router worker forwards in inbox FIFO order. Flits from different
upstream streams naturally interleave at flit granularity. The items upstream streams naturally interleave at flit granularity. The items
below are different concerns. below are different concerns, ordered by expected workload impact.
**Higher impact (workload accuracy gap)**:
- [ ] **Cycle-accurate router arbitration policies** (RR with
priorities, age, iSLIP). Currently the inbox FIFO order is used as
a proxy for fair RR — works when flit arrival times differ slightly
between streams, but doesn't reflect intentional priority/QoS.
- [ ] **Sub-flit (32B) granularity** for finer wire arbitration
cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates
per 32B flit. Effect is small for most workloads (sub-flit timing
noise).
- [ ] **Address-based PC selection at HBM CTRL** (replace the - [ ] **Address-based PC selection at HBM CTRL** (replace the
address-blind global round-robin). When two transactions of size address-blind global round-robin). When two transactions of size
`num_pcs × burst_bytes` (e.g., 2KB at 8 PCs × 256B) arrive `num_pcs × burst_bytes` (e.g., 2KB at 8 PCs × 256B) arrive
concurrently, both claim PCs 0..7 via global RR, producing full concurrently, both claim PCs 0..7 via global RR, producing full
per-PC contention. Real HW uses address bits to select PCs, so per-PC contention even when real-HW address striping would put
different-address transactions hit different PC patterns. Address them on disjoint PC sets. Directly affects multi-PE concurrent
modeling would let the simulator reflect cache-line/page-aware HBM workload latencies.
layouts.
- [ ] **Bank-level conflict modeling** within a PC (opt-in via - [ ] **Bank-level conflict modeling** within a PC (opt-in via
`track_banks: true`). Currently we assume no same-bank reuse. `track_banks: true`). Currently we assume no same-bank reuse;
random scatter/gather workloads are optimistic here.
- [ ] **HBM scheduler** with write buffer + watermark drain (Tier 2 - [ ] **HBM scheduler** with write buffer + watermark drain (Tier 2
from the design discussion). Default `switch_penalty_ns=0` is the from the design discussion). Default `switch_penalty_ns=0` is the
ideal-amortization stand-in. ideal-amortization stand-in; bursty mixed R/W workloads benefit
- [ ] **Backpressure** modeling for finite component buffers. from explicit modeling.
- [ ] **Backpressure** modeling for finite component buffers. Matters
at high concurrency / sustained saturation where buffer occupancy
causes upstream stalls.
- [ ] **Op_log integration with chunk-streaming**: currently op_log - [ ] **Op_log integration with chunk-streaming**: currently op_log
fires on PE-internal command messages (DmaReadCmd, DmaWriteCmd, fires on PE-internal command messages (DmaReadCmd, DmaWriteCmd,
GemmCmd, MathCmd) which are not chunkified. Integration would GemmCmd, MathCmd) which are not chunkified. Integration would
require flit-aware components to also emit op_log start/end hooks require flit-aware components to also emit op_log start/end hooks
per transaction (start on first flit, end on is_last). per transaction (start on first flit, end on is_last).
**Lower impact (academic / specific use cases)**:
- [ ] **Cycle-accurate router arbitration policies** (RR with
priorities, age, iSLIP). The FIFO inbox is already approximately
fair when flit arrival times differ slightly between streams (the
common case for similar-rate workloads). True impact appears only
for: (a) priority/QoS modeling, (b) per-stream tail latency
analysis under sustained saturation. Not critical for makespan or
average-latency studies.
- [ ] **Sub-flit (32B) granularity** for finer wire arbitration
cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates
per 32B flit. Effect is small for most workloads (sub-flit timing
noise on small messages).
## Consequences ## Consequences
- Single review point for all model fidelity questions. Each future PR - Single review point for all model fidelity questions. Each future PR