Files
kernbench2/docs/adr/ADR-0033-latency-model-assumptions.md
T
ywkang 5fdb6f8797 Latency model: HBM PC striping + chunk-loop drain (ADR-0033)
Previous model double-counted slow-upstream paths (e.g., 64KB via UCIe
128 GB/s was ~2x pessimistic). HBM CTRL now distributes bursts across
8 pseudo-channels via global round-robin, with per-chunk commit timing
that pipelines correctly against the bottleneck link's data arrival.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 21:59:07 -07:00

4.6 KiB

ADR-0033 — Latency Model: Assumptions and Known Simplifications

Status

Accepted

Context

The simulator is an analytical, event-driven performance model — not a cycle-accurate or RTL-level simulator. Many real-HW effects are approximated or omitted by design. To keep the model auditable and reviewable as a whole, this ADR consolidates the assumptions in one place. Individual component ADRs (ADR-0015, ADR-0019, ADR-0004) define the mechanisms; this document defines the limits of fidelity.

Decisions

D1. Modeled precisely

  • Per-directed-edge BW occupancy (FIFO serialization via available_at) — ADR-0015 D2.
  • Per-component switching/overhead latency (overhead_ns attr).
  • HBM per-pseudo-channel parallelism via stateless pc_avail[N] array with global round-robin chunking. Burst granularity tunable (burst_bytes, default 256B). Read and write share each PC's available_at (real HW command bus is per-PC shared).
  • HBM direction switching penalty mechanism: per-PC last-direction tracking + configurable switch_penalty_ns. Default 0 — see D2.
  • Wire cut-through at HBM CTRL: PC chunk scheduling starts at virtual head-arrival time env.now - txn.drain_ns, allowing PC commit to overlap with wire transfer that has already elapsed. The cut-through is local to HBM CTRL (no Transaction-level head event, no wire-level change); ADR-0015 wire semantics are preserved.

D2. Approximated (with known directional error)

Effect Real HW Our model Error direction
Router output port arbitration Round-robin / weighted Wire edge FIFO HoL blocking exaggerated; fairness not modeled
Multi-flow BW sharing Per-flow fair share FIFO atomic occupancy Per-txn latency dist. differs; makespan correct
HBM scheduler / write buffer FR-FCFS + watermark drain FIFO, no reordering Switching penalty over-charged when alternations are dense — but default switch_penalty_ns = 0 assumes ideal scheduler amortizes it (Tier 0)
Flit/cycle granularity Discrete flits @ cycle rate Continuous nbytes Sub-flit small-message noise
Wire cut-through scope Wormhole at every hop Cut-through absorbed at HBM CTRL only Intermediate hops still store-and-forward semantics; acceptable because component overheads at intermediate nodes are size-independent

D3. Ignored (out of scope)

  • Bank-level row buffer conflict penalty (assume no conflicts — best case; round-robin chunk assignment is address-blind so we cannot detect same-bank reuse).
  • HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state burst_time = burst_bytes / pc_bw_gbs).
  • Refresh, ECC, thermal throttling, power gating.
  • Clock domain crossings, PLL lock time.
  • Flit-level discrete interleaving on links.
  • Upstream backpressure due to downstream buffer occupancy (input ports use unbounded simpy.Store).

D4. Workload sensitivity

Workloads where the above simplifications meaningfully affect results:

  • Random scatter/gather: bank conflict ignored → model optimistic.
  • Heavy mixed R/W intensive (e.g., GEMM bias accumulation): HBM scheduler absent. With default switch_penalty_ns = 0 we assume ideal amortization; setting it non-zero models pessimistic per-alternation cost.
  • High concurrency (>10 active flows on one link): HoL blocking and VC limits not modeled → model optimistic.
  • Very small (sub-flit) transactions: flit quantization noise.

D5. Verification policy

For workloads in D4, cross-check against real HW or a cycle-accurate simulator before drawing absolute-magnitude conclusions. The model remains accurate for relative comparisons within the modeled regime.

D6. Future work

  • Bank-level conflict modeling (opt-in via track_banks: true).
  • HBM scheduler with write buffer + watermark drain (Tier 2 from the design discussion).
  • Fluid wire model for multi-flow router contention.
  • Wire-level cut-through at intermediate routers (currently destination HBM CTRL only).
  • Backpressure modeling for finite component buffers.

Consequences

  • Single review point for all model fidelity questions. Each future PR touching latency must update the relevant section here.
  • Workload-specific magnitude error envelopes are explicit.
  • Builder-side derivation of pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs enforces the ADR-0019 D9 invariant in code rather than relying on yaml manual consistency.

Cross-references

  • ADR-0015 — component / port / wire model.
  • ADR-0019 — NoC and local HBM topology.
  • ADR-0004 — memory semantics, local HBM.