From c9bd5387ac76f255fdb12666abf14a3be2ee8f1e Mon Sep 17 00:00:00 2001
From: Yangwook Kang <ywkang80@gmail.com>
Date: Thu, 14 May 2026 23:21:35 -0700
Subject: [PATCH] ADR-0033 D6: reorder future work by workload impact
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Cycle-accurate arbitration policies (priority/iSLIP) downgraded to
"academic / specific use cases" — FIFO inbox is approximately fair
for typical similar-rate workloads (GEMM, AllReduce, data parallel).
True impact appears only for QoS modeling or per-stream tail latency
analysis under saturation.

Higher-priority items pulled forward: address-based PC selection at
HBM CTRL (directly affects multi-PE concurrent HBM contention), bank
conflict modeling, HBM scheduler, finite buffer backpressure, op_log
chunk-streaming integration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../adr/ADR-0033-latency-model-assumptions.md | 43 ++++++++++++-------
 1 file changed, 27 insertions(+), 16 deletions(-)

diff --git a/docs/adr/ADR-0033-latency-model-assumptions.md b/docs/adr/ADR-0033-latency-model-assumptions.md
index 4dec803..4ca622d 100644
--- a/docs/adr/ADR-0033-latency-model-assumptions.md
+++ b/docs/adr/ADR-0033-latency-model-assumptions.md
@@ -106,36 +106,47 @@ Note: multi-stream merging at routers IS modeled correctly — each
 in_port has its own fan_in process, all push to a shared inbox, and
 the router worker forwards in inbox FIFO order. Flits from different
 upstream streams naturally interleave at flit granularity. The items
-below are different concerns.
+below are different concerns, ordered by expected workload impact.
+
+**Higher impact (workload accuracy gap)**:
 
-- [ ] **Cycle-accurate router arbitration policies** (RR with
-  priorities, age, iSLIP). Currently the inbox FIFO order is used as
-  a proxy for fair RR — works when flit arrival times differ slightly
-  between streams, but doesn't reflect intentional priority/QoS.
-- [ ] **Sub-flit (32B) granularity** for finer wire arbitration
-  cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates
-  per 32B flit. Effect is small for most workloads (sub-flit timing
-  noise).
 - [ ] **Address-based PC selection at HBM CTRL** (replace the
   address-blind global round-robin). When two transactions of size
   `num_pcs × burst_bytes` (e.g., 2KB at 8 PCs × 256B) arrive
   concurrently, both claim PCs 0..7 via global RR, producing full
-  per-PC contention. Real HW uses address bits to select PCs, so
-  different-address transactions hit different PC patterns. Address
-  modeling would let the simulator reflect cache-line/page-aware
-  layouts.
+  per-PC contention even when real-HW address striping would put
+  them on disjoint PC sets. Directly affects multi-PE concurrent
+  HBM workload latencies.
 - [ ] **Bank-level conflict modeling** within a PC (opt-in via
-  `track_banks: true`). Currently we assume no same-bank reuse.
+  `track_banks: true`). Currently we assume no same-bank reuse;
+  random scatter/gather workloads are optimistic here.
 - [ ] **HBM scheduler** with write buffer + watermark drain (Tier 2
   from the design discussion). Default `switch_penalty_ns=0` is the
-  ideal-amortization stand-in.
-- [ ] **Backpressure** modeling for finite component buffers.
+  ideal-amortization stand-in; bursty mixed R/W workloads benefit
+  from explicit modeling.
+- [ ] **Backpressure** modeling for finite component buffers. Matters
+  at high concurrency / sustained saturation where buffer occupancy
+  causes upstream stalls.
 - [ ] **Op_log integration with chunk-streaming**: currently op_log
   fires on PE-internal command messages (DmaReadCmd, DmaWriteCmd,
   GemmCmd, MathCmd) which are not chunkified. Integration would
   require flit-aware components to also emit op_log start/end hooks
   per transaction (start on first flit, end on is_last).
 
+**Lower impact (academic / specific use cases)**:
+
+- [ ] **Cycle-accurate router arbitration policies** (RR with
+  priorities, age, iSLIP). The FIFO inbox is already approximately
+  fair when flit arrival times differ slightly between streams (the
+  common case for similar-rate workloads). True impact appears only
+  for: (a) priority/QoS modeling, (b) per-stream tail latency
+  analysis under sustained saturation. Not critical for makespan or
+  average-latency studies.
+- [ ] **Sub-flit (32B) granularity** for finer wire arbitration
+  cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates
+  per 32B flit. Effect is small for most workloads (sub-flit timing
+  noise on small messages).
+
 ## Consequences
 
 - Single review point for all model fidelity questions. Each future PR