Files

T

ywkang 5fdb6f8797 Latency model: HBM PC striping + chunk-loop drain (ADR-0033)

Previous model double-counted slow-upstream paths (e.g., 64KB via UCIe
128 GB/s was ~2x pessimistic). HBM CTRL now distributes bursts across
8 pseudo-channels via global round-robin, with per-chunk commit timing
that pipelines correctly against the bottleneck link's data arrival.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-14 21:59:07 -07:00

4.6 KiB

Raw Blame History

ADR-0033 — Latency Model: Assumptions and Known Simplifications

Status

Accepted

Context

The simulator is an analytical, event-driven performance model — not a cycle-accurate or RTL-level simulator. Many real-HW effects are approximated or omitted by design. To keep the model auditable and reviewable as a whole, this ADR consolidates the assumptions in one place. Individual component ADRs (ADR-0015, ADR-0019, ADR-0004) define the mechanisms; this document defines the limits of fidelity.

Decisions

D1. Modeled precisely

Per-directed-edge BW occupancy (FIFO serialization via available_at) — ADR-0015 D2.
Per-component switching/overhead latency (overhead_ns attr).
HBM per-pseudo-channel parallelism via stateless pc_avail[N] array with global round-robin chunking. Burst granularity tunable (burst_bytes, default 256B). Read and write share each PC's available_at (real HW command bus is per-PC shared).
HBM direction switching penalty mechanism: per-PC last-direction tracking + configurable switch_penalty_ns. Default 0 — see D2.
Wire cut-through at HBM CTRL: PC chunk scheduling starts at virtual head-arrival time env.now - txn.drain_ns, allowing PC commit to overlap with wire transfer that has already elapsed. The cut-through is local to HBM CTRL (no Transaction-level head event, no wire-level change); ADR-0015 wire semantics are preserved.

D2. Approximated (with known directional error)

Effect	Real HW	Our model	Error direction
Router output port arbitration	Round-robin / weighted	Wire edge FIFO	HoL blocking exaggerated; fairness not modeled
Multi-flow BW sharing	Per-flow fair share	FIFO atomic occupancy	Per-txn latency dist. differs; makespan correct
HBM scheduler / write buffer	FR-FCFS + watermark drain	FIFO, no reordering	Switching penalty over-charged when alternations are dense — but default `switch_penalty_ns = 0` assumes ideal scheduler amortizes it (Tier 0)
Flit/cycle granularity	Discrete flits @ cycle rate	Continuous nbytes	Sub-flit small-message noise
Wire cut-through scope	Wormhole at every hop	Cut-through absorbed at HBM CTRL only	Intermediate hops still store-and-forward semantics; acceptable because component overheads at intermediate nodes are size-independent

D3. Ignored (out of scope)

Bank-level row buffer conflict penalty (assume no conflicts — best case; round-robin chunk assignment is address-blind so we cannot detect same-bank reuse).
HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state burst_time = burst_bytes / pc_bw_gbs).
Refresh, ECC, thermal throttling, power gating.
Clock domain crossings, PLL lock time.
Flit-level discrete interleaving on links.
Upstream backpressure due to downstream buffer occupancy (input ports use unbounded simpy.Store).

D4. Workload sensitivity

Workloads where the above simplifications meaningfully affect results:

Random scatter/gather: bank conflict ignored → model optimistic.
Heavy mixed R/W intensive (e.g., GEMM bias accumulation): HBM scheduler absent. With default switch_penalty_ns = 0 we assume ideal amortization; setting it non-zero models pessimistic per-alternation cost.
High concurrency (>10 active flows on one link): HoL blocking and VC limits not modeled → model optimistic.
Very small (sub-flit) transactions: flit quantization noise.

D5. Verification policy

For workloads in D4, cross-check against real HW or a cycle-accurate simulator before drawing absolute-magnitude conclusions. The model remains accurate for relative comparisons within the modeled regime.

D6. Future work

Bank-level conflict modeling (opt-in via track_banks: true).
HBM scheduler with write buffer + watermark drain (Tier 2 from the design discussion).
Fluid wire model for multi-flow router contention.
Wire-level cut-through at intermediate routers (currently destination HBM CTRL only).
Backpressure modeling for finite component buffers.

Consequences

Single review point for all model fidelity questions. Each future PR touching latency must update the relevant section here.
Workload-specific magnitude error envelopes are explicit.
Builder-side derivation of pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs enforces the ADR-0019 D9 invariant in code rather than relying on yaml manual consistency.

Cross-references

ADR-0015 — component / port / wire model.
ADR-0019 — NoC and local HBM topology.
ADR-0004 — memory semantics, local HBM.

4.6 KiB Raw Blame History