# ADR-0033 — Latency Model: Assumptions and Known Simplifications ## Status Accepted ## Context The simulator is an analytical, event-driven performance model — not a cycle-accurate or RTL-level simulator. Many real-HW effects are approximated or omitted by design. To keep the model auditable and reviewable as a whole, this ADR consolidates the assumptions in one place. Individual component ADRs (ADR-0015, ADR-0019, ADR-0004) define the *mechanisms*; this document defines the *limits of fidelity*. ## Decisions ### D1. Modeled precisely - **Per-directed-edge BW occupancy** (FIFO serialization via `available_at`) — ADR-0015 D2. - **Per-component switching/overhead latency** (`overhead_ns` attr). - **HBM per-pseudo-channel parallelism** via stateless `pc_avail[N]` array with global round-robin chunking. Burst granularity tunable (`burst_bytes`, default 256B). Read and write share each PC's `available_at` (real HW command bus is per-PC shared). - **HBM direction switching penalty mechanism**: per-PC last-direction tracking + configurable `switch_penalty_ns`. Default 0 — see D2. - **Wire chunk-streaming (Phase 2c)**: each wire decomposes Transactions with payload into `Flit` objects of `flit_bytes` (default = HBM `burst_bytes` = 256B). The wire emits each flit individually after `prop_ns + flit_nbytes/bw_gbs` so the link's bandwidth throttles flit arrival rate per real-HW wormhole semantics. - **Separate Stores per directed edge** (Phase 2c key fix): the wire is the *only* conduit between `src.out_ports[dst]` and `dst.in_ports[src]`. Earlier the two were aliased to the same `simpy.Store`; when the wire put a chunkified flit back, the destination's `fan_in` could pull it before the wire applied bandwidth delay, leaving half the flits bypassing the bottleneck. - **Flit-aware pass-through** (`TransitComponent`, `HbmCtrlComponent`): forward each flit serially with per-transaction overhead applied ONCE on the first-flit arrival (header decode model). Subsequent flits pipeline through with no extra delay. Wormhole emerges naturally across multi-hop paths. - **HBM CTRL per-flit PC commit**: each flit arriving at HBM CTRL schedules a PC commit at `max(env.now, pc_avail[pc]) + chunk_time`, with the `is_last` flit waiting for the last PC commit before signaling `txn.done`. - **Non-flit-aware components (default) reassemble flits at ``_fan_in``** before the legacy `_forward_txn` path runs. This preserves backward compatibility for components that have not yet been migrated to flit-aware processing (e.g., `MCpuComponent`, `IoCpuComponent` sub-txn generators). Such components reassemble *once per leg boundary*, NOT per hop — multi-hop wormhole timing through a chain of flit-aware routers is preserved. ### D2. Approximated (with known directional error) | Effect | Real HW | Our model | Error direction | |--------|---------|-----------|----------------| | Router output port arbitration | Round-robin / weighted | Wire edge FIFO + serial worker | Fair when one txn per cycle; multi-stream sharing not modeled at flit level | | HBM scheduler / write buffer | FR-FCFS + watermark drain | FIFO, no reordering | Pessimistic for mixed R/W when alternations are dense — default `switch_penalty_ns = 0` assumes ideal scheduler amortizes | | Flit ↔ burst granularity | 32B flit < 256B burst | `flit_bytes = burst_bytes = 256B` | Sub-flit fine-grained timing noise; affects very small wire arbitration windows only | | Wire-level RR fairness | Per-cycle multi-flow arbitration on shared link | Single serial wire process per edge | Fair only when one transaction is in flight on a given edge at a time. Multi-stream concurrent traffic on the same edge serializes by FIFO order | ### D3. Ignored (out of scope) - Bank-level row buffer conflict penalty (assume no conflicts — best case; round-robin chunk assignment is address-blind so we cannot detect same-bank reuse). - HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state `burst_time = burst_bytes / pc_bw_gbs`). - Refresh, ECC, thermal throttling, power gating. - Clock domain crossings, PLL lock time. - Upstream backpressure due to downstream buffer occupancy (input ports use unbounded `simpy.Store`). - Sub-flit cycle-level arbitration at routers (flit granularity is our smallest unit). ### D4. Workload sensitivity Workloads where the above simplifications meaningfully affect results: - **Random scatter/gather**: bank conflict ignored → model optimistic. - **Heavy mixed R/W intensive** (e.g., GEMM bias accumulation): HBM scheduler absent. With default `switch_penalty_ns = 0` we assume ideal amortization; setting it non-zero models pessimistic per-alternation cost. - **High concurrency (>10 active flows on one link)**: HoL blocking and VC limits not modeled → model optimistic. - **Very small (sub-flit) transactions**: flit quantization noise. - **Concurrent multi-flow on a single wire**: wire is serial FIFO at the flit level, so per-flow fairness within a single edge is not modeled. Pre-edge merging (multiple sources arriving at a router and being forwarded to the same downstream wire) is correctly modeled via the flit-aware router's serial worker. ### D5. Verification policy For workloads in D4, cross-check against real HW or a cycle-accurate simulator before drawing absolute-magnitude conclusions. The model remains accurate for **relative comparisons** within the modeled regime. ### D6. Future work Note: multi-stream merging at routers IS modeled correctly — each in_port has its own fan_in process, all push to a shared inbox, and the router worker forwards in inbox FIFO order. Flits from different upstream streams naturally interleave at flit granularity. The items below are different concerns. - [ ] **Cycle-accurate router arbitration policies** (RR with priorities, age, iSLIP). Currently the inbox FIFO order is used as a proxy for fair RR — works when flit arrival times differ slightly between streams, but doesn't reflect intentional priority/QoS. - [ ] **Sub-flit (32B) granularity** for finer wire arbitration cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates per 32B flit. Effect is small for most workloads (sub-flit timing noise). - [ ] **Address-based PC selection at HBM CTRL** (replace the address-blind global round-robin). When two transactions of size `num_pcs × burst_bytes` (e.g., 2KB at 8 PCs × 256B) arrive concurrently, both claim PCs 0..7 via global RR, producing full per-PC contention. Real HW uses address bits to select PCs, so different-address transactions hit different PC patterns. Address modeling would let the simulator reflect cache-line/page-aware layouts. - [ ] **Bank-level conflict modeling** within a PC (opt-in via `track_banks: true`). Currently we assume no same-bank reuse. - [ ] **HBM scheduler** with write buffer + watermark drain (Tier 2 from the design discussion). Default `switch_penalty_ns=0` is the ideal-amortization stand-in. - [ ] **Backpressure** modeling for finite component buffers. - [ ] **Op_log integration with chunk-streaming**: currently op_log fires on PE-internal command messages (DmaReadCmd, DmaWriteCmd, GemmCmd, MathCmd) which are not chunkified. Integration would require flit-aware components to also emit op_log start/end hooks per transaction (start on first flit, end on is_last). ## Consequences - Single review point for all model fidelity questions. Each future PR touching latency must update the relevant section here. - Workload-specific magnitude error envelopes are explicit. - Builder-side derivation of `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs` enforces the ADR-0019 D9 invariant in code rather than relying on yaml manual consistency. - Wire transfer time is charged once per bottleneck-link transit (Phase 2c per-flit timing) rather than via terminal `drain_ns` injection. Single transactions land at `drain + commit_time + small_overheads`; multi-hop preserves wormhole pipelining; multi-stream merge correctly serializes at the shared wire's FIFO. ## Cross-references - ADR-0015 — component / port / wire model. - ADR-0019 — NoC and local HBM topology. - ADR-0004 — memory semantics, local HBM.