diff --git a/docs/adr/ADR-0033-latency-model-assumptions.md b/docs/adr/ADR-0033-latency-model-assumptions.md index 1655747..4dec803 100644 --- a/docs/adr/ADR-0033-latency-model-assumptions.md +++ b/docs/adr/ADR-0033-latency-model-assumptions.md @@ -102,18 +102,39 @@ accurate for **relative comparisons** within the modeled regime. ### D6. Future work -- [ ] Bank-level conflict modeling (opt-in via `track_banks: true`). -- [ ] HBM scheduler with write buffer + watermark drain (Tier 2 from the - design discussion). -- [ ] Fluid wire model for multi-flow fairness on a single shared link - (currently FIFO serial). -- [ ] Sub-flit (32B) granularity for cycle-accurate wire arbitration. -- [ ] Backpressure modeling for finite component buffers. -- [ ] Op_log integration with chunk-streaming (currently op_log fires on - PE-internal command messages — DmaReadCmd, DmaWriteCmd, GemmCmd, - MathCmd — which are not chunkified; integration would require - flit-aware components to also emit op_log start/end hooks per - transaction). +Note: multi-stream merging at routers IS modeled correctly — each +in_port has its own fan_in process, all push to a shared inbox, and +the router worker forwards in inbox FIFO order. Flits from different +upstream streams naturally interleave at flit granularity. The items +below are different concerns. + +- [ ] **Cycle-accurate router arbitration policies** (RR with + priorities, age, iSLIP). Currently the inbox FIFO order is used as + a proxy for fair RR — works when flit arrival times differ slightly + between streams, but doesn't reflect intentional priority/QoS. +- [ ] **Sub-flit (32B) granularity** for finer wire arbitration + cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates + per 32B flit. Effect is small for most workloads (sub-flit timing + noise). +- [ ] **Address-based PC selection at HBM CTRL** (replace the + address-blind global round-robin). When two transactions of size + `num_pcs × burst_bytes` (e.g., 2KB at 8 PCs × 256B) arrive + concurrently, both claim PCs 0..7 via global RR, producing full + per-PC contention. Real HW uses address bits to select PCs, so + different-address transactions hit different PC patterns. Address + modeling would let the simulator reflect cache-line/page-aware + layouts. +- [ ] **Bank-level conflict modeling** within a PC (opt-in via + `track_banks: true`). Currently we assume no same-bank reuse. +- [ ] **HBM scheduler** with write buffer + watermark drain (Tier 2 + from the design discussion). Default `switch_penalty_ns=0` is the + ideal-amortization stand-in. +- [ ] **Backpressure** modeling for finite component buffers. +- [ ] **Op_log integration with chunk-streaming**: currently op_log + fires on PE-internal command messages (DmaReadCmd, DmaWriteCmd, + GemmCmd, MathCmd) which are not chunkified. Integration would + require flit-aware components to also emit op_log start/end hooks + per transaction (start on first flit, end on is_last). ## Consequences