Files
kernbench2/docs/adr/ADR-0033-latency-model-assumptions.md
T
ywkang 9beb140eaa ADR-0033 D6: clarify what multi-flow merging actually models
Earlier the future-work list mentioned "multi-flow fair sharing on a
single shared link" which was confusing — each wire has a single
source, so this isn't a real gap. The actual modeling story:

- Multi-stream merging at routers IS handled via per-in_port fan_in +
  shared inbox + FIFO worker forwarding. Flits from different
  upstream streams interleave at flit granularity naturally.
- What's NOT modeled: cycle-accurate arbitration policies (priority,
  iSLIP), address-based PC selection at HBM CTRL (round-robin is
  address-blind, so size-aligned concurrent transactions hit full
  PC contention even when real-HW address striping would diverge),
  sub-flit (32B) granularity, finite buffer backpressure, and bank
  conflict modeling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 23:18:19 -07:00

8.1 KiB
Raw Blame History

ADR-0033 — Latency Model: Assumptions and Known Simplifications

Status

Accepted

Context

The simulator is an analytical, event-driven performance model — not a cycle-accurate or RTL-level simulator. Many real-HW effects are approximated or omitted by design. To keep the model auditable and reviewable as a whole, this ADR consolidates the assumptions in one place. Individual component ADRs (ADR-0015, ADR-0019, ADR-0004) define the mechanisms; this document defines the limits of fidelity.

Decisions

D1. Modeled precisely

  • Per-directed-edge BW occupancy (FIFO serialization via available_at) — ADR-0015 D2.
  • Per-component switching/overhead latency (overhead_ns attr).
  • HBM per-pseudo-channel parallelism via stateless pc_avail[N] array with global round-robin chunking. Burst granularity tunable (burst_bytes, default 256B). Read and write share each PC's available_at (real HW command bus is per-PC shared).
  • HBM direction switching penalty mechanism: per-PC last-direction tracking + configurable switch_penalty_ns. Default 0 — see D2.
  • Wire chunk-streaming (Phase 2c): each wire decomposes Transactions with payload into Flit objects of flit_bytes (default = HBM burst_bytes = 256B). The wire emits each flit individually after prop_ns + flit_nbytes/bw_gbs so the link's bandwidth throttles flit arrival rate per real-HW wormhole semantics.
  • Separate Stores per directed edge (Phase 2c key fix): the wire is the only conduit between src.out_ports[dst] and dst.in_ports[src]. Earlier the two were aliased to the same simpy.Store; when the wire put a chunkified flit back, the destination's fan_in could pull it before the wire applied bandwidth delay, leaving half the flits bypassing the bottleneck.
  • Flit-aware pass-through (TransitComponent, HbmCtrlComponent): forward each flit serially with per-transaction overhead applied ONCE on the first-flit arrival (header decode model). Subsequent flits pipeline through with no extra delay. Wormhole emerges naturally across multi-hop paths.
  • HBM CTRL per-flit PC commit: each flit arriving at HBM CTRL schedules a PC commit at max(env.now, pc_avail[pc]) + chunk_time, with the is_last flit waiting for the last PC commit before signaling txn.done.
  • Non-flit-aware components (default) reassemble flits at _fan_in before the legacy _forward_txn path runs. This preserves backward compatibility for components that have not yet been migrated to flit-aware processing (e.g., MCpuComponent, IoCpuComponent sub-txn generators). Such components reassemble once per leg boundary, NOT per hop — multi-hop wormhole timing through a chain of flit-aware routers is preserved.

D2. Approximated (with known directional error)

Effect Real HW Our model Error direction
Router output port arbitration Round-robin / weighted Wire edge FIFO + serial worker Fair when one txn per cycle; multi-stream sharing not modeled at flit level
HBM scheduler / write buffer FR-FCFS + watermark drain FIFO, no reordering Pessimistic for mixed R/W when alternations are dense — default switch_penalty_ns = 0 assumes ideal scheduler amortizes
Flit ↔ burst granularity 32B flit < 256B burst flit_bytes = burst_bytes = 256B Sub-flit fine-grained timing noise; affects very small wire arbitration windows only
Wire-level RR fairness Per-cycle multi-flow arbitration on shared link Single serial wire process per edge Fair only when one transaction is in flight on a given edge at a time. Multi-stream concurrent traffic on the same edge serializes by FIFO order

D3. Ignored (out of scope)

  • Bank-level row buffer conflict penalty (assume no conflicts — best case; round-robin chunk assignment is address-blind so we cannot detect same-bank reuse).
  • HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state burst_time = burst_bytes / pc_bw_gbs).
  • Refresh, ECC, thermal throttling, power gating.
  • Clock domain crossings, PLL lock time.
  • Upstream backpressure due to downstream buffer occupancy (input ports use unbounded simpy.Store).
  • Sub-flit cycle-level arbitration at routers (flit granularity is our smallest unit).

D4. Workload sensitivity

Workloads where the above simplifications meaningfully affect results:

  • Random scatter/gather: bank conflict ignored → model optimistic.
  • Heavy mixed R/W intensive (e.g., GEMM bias accumulation): HBM scheduler absent. With default switch_penalty_ns = 0 we assume ideal amortization; setting it non-zero models pessimistic per-alternation cost.
  • High concurrency (>10 active flows on one link): HoL blocking and VC limits not modeled → model optimistic.
  • Very small (sub-flit) transactions: flit quantization noise.
  • Concurrent multi-flow on a single wire: wire is serial FIFO at the flit level, so per-flow fairness within a single edge is not modeled. Pre-edge merging (multiple sources arriving at a router and being forwarded to the same downstream wire) is correctly modeled via the flit-aware router's serial worker.

D5. Verification policy

For workloads in D4, cross-check against real HW or a cycle-accurate simulator before drawing absolute-magnitude conclusions. The model remains accurate for relative comparisons within the modeled regime.

D6. Future work

Note: multi-stream merging at routers IS modeled correctly — each in_port has its own fan_in process, all push to a shared inbox, and the router worker forwards in inbox FIFO order. Flits from different upstream streams naturally interleave at flit granularity. The items below are different concerns.

  • Cycle-accurate router arbitration policies (RR with priorities, age, iSLIP). Currently the inbox FIFO order is used as a proxy for fair RR — works when flit arrival times differ slightly between streams, but doesn't reflect intentional priority/QoS.
  • Sub-flit (32B) granularity for finer wire arbitration cycles. Our flit_bytes equals burst (256B); real HW arbitrates per 32B flit. Effect is small for most workloads (sub-flit timing noise).
  • Address-based PC selection at HBM CTRL (replace the address-blind global round-robin). When two transactions of size num_pcs × burst_bytes (e.g., 2KB at 8 PCs × 256B) arrive concurrently, both claim PCs 0..7 via global RR, producing full per-PC contention. Real HW uses address bits to select PCs, so different-address transactions hit different PC patterns. Address modeling would let the simulator reflect cache-line/page-aware layouts.
  • Bank-level conflict modeling within a PC (opt-in via track_banks: true). Currently we assume no same-bank reuse.
  • HBM scheduler with write buffer + watermark drain (Tier 2 from the design discussion). Default switch_penalty_ns=0 is the ideal-amortization stand-in.
  • Backpressure modeling for finite component buffers.
  • Op_log integration with chunk-streaming: currently op_log fires on PE-internal command messages (DmaReadCmd, DmaWriteCmd, GemmCmd, MathCmd) which are not chunkified. Integration would require flit-aware components to also emit op_log start/end hooks per transaction (start on first flit, end on is_last).

Consequences

  • Single review point for all model fidelity questions. Each future PR touching latency must update the relevant section here.
  • Workload-specific magnitude error envelopes are explicit.
  • Builder-side derivation of pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs enforces the ADR-0019 D9 invariant in code rather than relying on yaml manual consistency.
  • Wire transfer time is charged once per bottleneck-link transit (Phase 2c per-flit timing) rather than via terminal drain_ns injection. Single transactions land at drain + commit_time + small_overheads; multi-hop preserves wormhole pipelining; multi-stream merge correctly serializes at the shared wire's FIFO.

Cross-references

  • ADR-0015 — component / port / wire model.
  • ADR-0019 — NoC and local HBM topology.
  • ADR-0004 — memory semantics, local HBM.