5fdb6f8797
Previous model double-counted slow-upstream paths (e.g., 64KB via UCIe 128 GB/s was ~2x pessimistic). HBM CTRL now distributes bursts across 8 pseudo-channels via global round-robin, with per-chunk commit timing that pipelines correctly against the bottleneck link's data arrival. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
100 lines
4.6 KiB
Markdown
100 lines
4.6 KiB
Markdown
# ADR-0033 — Latency Model: Assumptions and Known Simplifications
|
|
|
|
## Status
|
|
|
|
Accepted
|
|
|
|
## Context
|
|
|
|
The simulator is an analytical, event-driven performance model — not a
|
|
cycle-accurate or RTL-level simulator. Many real-HW effects are approximated
|
|
or omitted by design. To keep the model auditable and reviewable as a whole,
|
|
this ADR consolidates the assumptions in one place. Individual component ADRs
|
|
(ADR-0015, ADR-0019, ADR-0004) define the *mechanisms*; this document defines
|
|
the *limits of fidelity*.
|
|
|
|
## Decisions
|
|
|
|
### D1. Modeled precisely
|
|
|
|
- **Per-directed-edge BW occupancy** (FIFO serialization via `available_at`) —
|
|
ADR-0015 D2.
|
|
- **Per-component switching/overhead latency** (`overhead_ns` attr).
|
|
- **HBM per-pseudo-channel parallelism** via stateless `pc_avail[N]` array
|
|
with global round-robin chunking. Burst granularity tunable
|
|
(`burst_bytes`, default 256B). Read and write share each PC's
|
|
`available_at` (real HW command bus is per-PC shared).
|
|
- **HBM direction switching penalty mechanism**: per-PC last-direction
|
|
tracking + configurable `switch_penalty_ns`. Default 0 — see D2.
|
|
- **Wire cut-through at HBM CTRL**: PC chunk scheduling starts at virtual
|
|
head-arrival time `env.now - txn.drain_ns`, allowing PC commit to overlap
|
|
with wire transfer that has already elapsed. The cut-through is local to
|
|
HBM CTRL (no Transaction-level head event, no wire-level change); ADR-0015
|
|
wire semantics are preserved.
|
|
|
|
### D2. Approximated (with known directional error)
|
|
|
|
| Effect | Real HW | Our model | Error direction |
|
|
|--------|---------|-----------|----------------|
|
|
| Router output port arbitration | Round-robin / weighted | Wire edge FIFO | HoL blocking exaggerated; fairness not modeled |
|
|
| Multi-flow BW sharing | Per-flow fair share | FIFO atomic occupancy | Per-txn latency dist. differs; makespan correct |
|
|
| HBM scheduler / write buffer | FR-FCFS + watermark drain | FIFO, no reordering | Switching penalty over-charged when alternations are dense — but default `switch_penalty_ns = 0` assumes ideal scheduler amortizes it (Tier 0) |
|
|
| Flit/cycle granularity | Discrete flits @ cycle rate | Continuous nbytes | Sub-flit small-message noise |
|
|
| Wire cut-through scope | Wormhole at every hop | Cut-through absorbed at HBM CTRL only | Intermediate hops still store-and-forward semantics; acceptable because component overheads at intermediate nodes are size-independent |
|
|
|
|
### D3. Ignored (out of scope)
|
|
|
|
- Bank-level row buffer conflict penalty (assume no conflicts — best case;
|
|
round-robin chunk assignment is address-blind so we cannot detect same-bank
|
|
reuse).
|
|
- HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state
|
|
`burst_time = burst_bytes / pc_bw_gbs`).
|
|
- Refresh, ECC, thermal throttling, power gating.
|
|
- Clock domain crossings, PLL lock time.
|
|
- Flit-level discrete interleaving on links.
|
|
- Upstream backpressure due to downstream buffer occupancy (input ports use
|
|
unbounded `simpy.Store`).
|
|
|
|
### D4. Workload sensitivity
|
|
|
|
Workloads where the above simplifications meaningfully affect results:
|
|
|
|
- **Random scatter/gather**: bank conflict ignored → model optimistic.
|
|
- **Heavy mixed R/W intensive** (e.g., GEMM bias accumulation): HBM scheduler
|
|
absent. With default `switch_penalty_ns = 0` we assume ideal amortization;
|
|
setting it non-zero models pessimistic per-alternation cost.
|
|
- **High concurrency (>10 active flows on one link)**: HoL blocking and VC
|
|
limits not modeled → model optimistic.
|
|
- **Very small (sub-flit) transactions**: flit quantization noise.
|
|
|
|
### D5. Verification policy
|
|
|
|
For workloads in D4, cross-check against real HW or a cycle-accurate
|
|
simulator before drawing absolute-magnitude conclusions. The model remains
|
|
accurate for **relative comparisons** within the modeled regime.
|
|
|
|
### D6. Future work
|
|
|
|
- [ ] Bank-level conflict modeling (opt-in via `track_banks: true`).
|
|
- [ ] HBM scheduler with write buffer + watermark drain (Tier 2 from the
|
|
design discussion).
|
|
- [ ] Fluid wire model for multi-flow router contention.
|
|
- [ ] Wire-level cut-through at intermediate routers (currently destination
|
|
HBM CTRL only).
|
|
- [ ] Backpressure modeling for finite component buffers.
|
|
|
|
## Consequences
|
|
|
|
- Single review point for all model fidelity questions. Each future PR
|
|
touching latency must update the relevant section here.
|
|
- Workload-specific magnitude error envelopes are explicit.
|
|
- Builder-side derivation of `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`
|
|
enforces the ADR-0019 D9 invariant in code rather than relying on yaml
|
|
manual consistency.
|
|
|
|
## Cross-references
|
|
|
|
- ADR-0015 — component / port / wire model.
|
|
- ADR-0019 — NoC and local HBM topology.
|
|
- ADR-0004 — memory semantics, local HBM.
|