From c9bd5387ac76f255fdb12666abf14a3be2ee8f1e Mon Sep 17 00:00:00 2001 From: Yangwook Kang Date: Thu, 14 May 2026 23:21:35 -0700 Subject: [PATCH] ADR-0033 D6: reorder future work by workload impact MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cycle-accurate arbitration policies (priority/iSLIP) downgraded to "academic / specific use cases" — FIFO inbox is approximately fair for typical similar-rate workloads (GEMM, AllReduce, data parallel). True impact appears only for QoS modeling or per-stream tail latency analysis under saturation. Higher-priority items pulled forward: address-based PC selection at HBM CTRL (directly affects multi-PE concurrent HBM contention), bank conflict modeling, HBM scheduler, finite buffer backpressure, op_log chunk-streaming integration. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../adr/ADR-0033-latency-model-assumptions.md | 43 ++++++++++++------- 1 file changed, 27 insertions(+), 16 deletions(-) diff --git a/docs/adr/ADR-0033-latency-model-assumptions.md b/docs/adr/ADR-0033-latency-model-assumptions.md index 4dec803..4ca622d 100644 --- a/docs/adr/ADR-0033-latency-model-assumptions.md +++ b/docs/adr/ADR-0033-latency-model-assumptions.md @@ -106,36 +106,47 @@ Note: multi-stream merging at routers IS modeled correctly — each in_port has its own fan_in process, all push to a shared inbox, and the router worker forwards in inbox FIFO order. Flits from different upstream streams naturally interleave at flit granularity. The items -below are different concerns. +below are different concerns, ordered by expected workload impact. + +**Higher impact (workload accuracy gap)**: -- [ ] **Cycle-accurate router arbitration policies** (RR with - priorities, age, iSLIP). Currently the inbox FIFO order is used as - a proxy for fair RR — works when flit arrival times differ slightly - between streams, but doesn't reflect intentional priority/QoS. -- [ ] **Sub-flit (32B) granularity** for finer wire arbitration - cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates - per 32B flit. Effect is small for most workloads (sub-flit timing - noise). - [ ] **Address-based PC selection at HBM CTRL** (replace the address-blind global round-robin). When two transactions of size `num_pcs × burst_bytes` (e.g., 2KB at 8 PCs × 256B) arrive concurrently, both claim PCs 0..7 via global RR, producing full - per-PC contention. Real HW uses address bits to select PCs, so - different-address transactions hit different PC patterns. Address - modeling would let the simulator reflect cache-line/page-aware - layouts. + per-PC contention even when real-HW address striping would put + them on disjoint PC sets. Directly affects multi-PE concurrent + HBM workload latencies. - [ ] **Bank-level conflict modeling** within a PC (opt-in via - `track_banks: true`). Currently we assume no same-bank reuse. + `track_banks: true`). Currently we assume no same-bank reuse; + random scatter/gather workloads are optimistic here. - [ ] **HBM scheduler** with write buffer + watermark drain (Tier 2 from the design discussion). Default `switch_penalty_ns=0` is the - ideal-amortization stand-in. -- [ ] **Backpressure** modeling for finite component buffers. + ideal-amortization stand-in; bursty mixed R/W workloads benefit + from explicit modeling. +- [ ] **Backpressure** modeling for finite component buffers. Matters + at high concurrency / sustained saturation where buffer occupancy + causes upstream stalls. - [ ] **Op_log integration with chunk-streaming**: currently op_log fires on PE-internal command messages (DmaReadCmd, DmaWriteCmd, GemmCmd, MathCmd) which are not chunkified. Integration would require flit-aware components to also emit op_log start/end hooks per transaction (start on first flit, end on is_last). +**Lower impact (academic / specific use cases)**: + +- [ ] **Cycle-accurate router arbitration policies** (RR with + priorities, age, iSLIP). The FIFO inbox is already approximately + fair when flit arrival times differ slightly between streams (the + common case for similar-rate workloads). True impact appears only + for: (a) priority/QoS modeling, (b) per-stream tail latency + analysis under sustained saturation. Not critical for makespan or + average-latency studies. +- [ ] **Sub-flit (32B) granularity** for finer wire arbitration + cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates + per 32B flit. Effect is small for most workloads (sub-flit timing + noise on small messages). + ## Consequences - Single review point for all model fidelity questions. Each future PR