Latency model: HBM PC striping + chunk-loop drain (ADR-0033)
Previous model double-counted slow-upstream paths (e.g., 64KB via UCIe 128 GB/s was ~2x pessimistic). HBM CTRL now distributes bursts across 8 pseudo-channels via global round-robin, with per-chunk commit timing that pipelines correctly against the bottleneck link's data arrival. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,99 @@
|
||||
# ADR-0033 — Latency Model: Assumptions and Known Simplifications
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The simulator is an analytical, event-driven performance model — not a
|
||||
cycle-accurate or RTL-level simulator. Many real-HW effects are approximated
|
||||
or omitted by design. To keep the model auditable and reviewable as a whole,
|
||||
this ADR consolidates the assumptions in one place. Individual component ADRs
|
||||
(ADR-0015, ADR-0019, ADR-0004) define the *mechanisms*; this document defines
|
||||
the *limits of fidelity*.
|
||||
|
||||
## Decisions
|
||||
|
||||
### D1. Modeled precisely
|
||||
|
||||
- **Per-directed-edge BW occupancy** (FIFO serialization via `available_at`) —
|
||||
ADR-0015 D2.
|
||||
- **Per-component switching/overhead latency** (`overhead_ns` attr).
|
||||
- **HBM per-pseudo-channel parallelism** via stateless `pc_avail[N]` array
|
||||
with global round-robin chunking. Burst granularity tunable
|
||||
(`burst_bytes`, default 256B). Read and write share each PC's
|
||||
`available_at` (real HW command bus is per-PC shared).
|
||||
- **HBM direction switching penalty mechanism**: per-PC last-direction
|
||||
tracking + configurable `switch_penalty_ns`. Default 0 — see D2.
|
||||
- **Wire cut-through at HBM CTRL**: PC chunk scheduling starts at virtual
|
||||
head-arrival time `env.now - txn.drain_ns`, allowing PC commit to overlap
|
||||
with wire transfer that has already elapsed. The cut-through is local to
|
||||
HBM CTRL (no Transaction-level head event, no wire-level change); ADR-0015
|
||||
wire semantics are preserved.
|
||||
|
||||
### D2. Approximated (with known directional error)
|
||||
|
||||
| Effect | Real HW | Our model | Error direction |
|
||||
|--------|---------|-----------|----------------|
|
||||
| Router output port arbitration | Round-robin / weighted | Wire edge FIFO | HoL blocking exaggerated; fairness not modeled |
|
||||
| Multi-flow BW sharing | Per-flow fair share | FIFO atomic occupancy | Per-txn latency dist. differs; makespan correct |
|
||||
| HBM scheduler / write buffer | FR-FCFS + watermark drain | FIFO, no reordering | Switching penalty over-charged when alternations are dense — but default `switch_penalty_ns = 0` assumes ideal scheduler amortizes it (Tier 0) |
|
||||
| Flit/cycle granularity | Discrete flits @ cycle rate | Continuous nbytes | Sub-flit small-message noise |
|
||||
| Wire cut-through scope | Wormhole at every hop | Cut-through absorbed at HBM CTRL only | Intermediate hops still store-and-forward semantics; acceptable because component overheads at intermediate nodes are size-independent |
|
||||
|
||||
### D3. Ignored (out of scope)
|
||||
|
||||
- Bank-level row buffer conflict penalty (assume no conflicts — best case;
|
||||
round-robin chunk assignment is address-blind so we cannot detect same-bank
|
||||
reuse).
|
||||
- HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state
|
||||
`burst_time = burst_bytes / pc_bw_gbs`).
|
||||
- Refresh, ECC, thermal throttling, power gating.
|
||||
- Clock domain crossings, PLL lock time.
|
||||
- Flit-level discrete interleaving on links.
|
||||
- Upstream backpressure due to downstream buffer occupancy (input ports use
|
||||
unbounded `simpy.Store`).
|
||||
|
||||
### D4. Workload sensitivity
|
||||
|
||||
Workloads where the above simplifications meaningfully affect results:
|
||||
|
||||
- **Random scatter/gather**: bank conflict ignored → model optimistic.
|
||||
- **Heavy mixed R/W intensive** (e.g., GEMM bias accumulation): HBM scheduler
|
||||
absent. With default `switch_penalty_ns = 0` we assume ideal amortization;
|
||||
setting it non-zero models pessimistic per-alternation cost.
|
||||
- **High concurrency (>10 active flows on one link)**: HoL blocking and VC
|
||||
limits not modeled → model optimistic.
|
||||
- **Very small (sub-flit) transactions**: flit quantization noise.
|
||||
|
||||
### D5. Verification policy
|
||||
|
||||
For workloads in D4, cross-check against real HW or a cycle-accurate
|
||||
simulator before drawing absolute-magnitude conclusions. The model remains
|
||||
accurate for **relative comparisons** within the modeled regime.
|
||||
|
||||
### D6. Future work
|
||||
|
||||
- [ ] Bank-level conflict modeling (opt-in via `track_banks: true`).
|
||||
- [ ] HBM scheduler with write buffer + watermark drain (Tier 2 from the
|
||||
design discussion).
|
||||
- [ ] Fluid wire model for multi-flow router contention.
|
||||
- [ ] Wire-level cut-through at intermediate routers (currently destination
|
||||
HBM CTRL only).
|
||||
- [ ] Backpressure modeling for finite component buffers.
|
||||
|
||||
## Consequences
|
||||
|
||||
- Single review point for all model fidelity questions. Each future PR
|
||||
touching latency must update the relevant section here.
|
||||
- Workload-specific magnitude error envelopes are explicit.
|
||||
- Builder-side derivation of `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`
|
||||
enforces the ADR-0019 D9 invariant in code rather than relying on yaml
|
||||
manual consistency.
|
||||
|
||||
## Cross-references
|
||||
|
||||
- ADR-0015 — component / port / wire model.
|
||||
- ADR-0019 — NoC and local HBM topology.
|
||||
- ADR-0004 — memory semantics, local HBM.
|
||||
Reference in New Issue
Block a user