Latency model: HBM PC striping + chunk-loop drain (ADR-0033)

Previous model double-counted slow-upstream paths (e.g., 64KB via UCIe
128 GB/s was ~2x pessimistic). HBM CTRL now distributes bursts across
8 pseudo-channels via global round-robin, with per-chunk commit timing
that pipelines correctly against the bottleneck link's data arrival.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-14 21:59:07 -07:00
parent f6d262e359
commit 5fdb6f8797
11 changed files with 1192 additions and 52 deletions
@@ -33,12 +33,17 @@ Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth,
- This guarantee is modeled by:
- a dedicated logical path and/or service model that enforces HBM BW at the PE-local-HBM interaction point,
- while still incurring non-zero latency along explicitly modeled components.
- HBM CTRL internal modeling (PC striping, cut-through, scheduling fidelity)
is consolidated in ADR-0033 (Latency Model: Assumptions and Known
Simplifications). The aggregate BW guarantee here remains the contract;
ADR-0033 documents how the per-PC model realizes it and which scheduler
effects are intentionally simplified.
### D3. Remote PE HBM semantics (intra-cube)
- A PE that accesses another PE's local HBM traverses the router mesh:
- PE_DMA → local router → (mesh hops) → target PE's router → HBM_CTRL
- Router mesh bandwidth and hop count may limit remote HBM access relative to local access.
- A PE that accesses another PE's local HBM traverses the NOC:
- PE_DMA → NOC → (fabric hops) → target PE's NOC port → HBM_CTRL
- NOC bandwidth and hop count may limit remote HBM access relative to local access.
### D4. Non-local HBM semantics (inter-cube / inter-SIP)
@@ -0,0 +1,99 @@
# ADR-0033 — Latency Model: Assumptions and Known Simplifications
## Status
Accepted
## Context
The simulator is an analytical, event-driven performance model — not a
cycle-accurate or RTL-level simulator. Many real-HW effects are approximated
or omitted by design. To keep the model auditable and reviewable as a whole,
this ADR consolidates the assumptions in one place. Individual component ADRs
(ADR-0015, ADR-0019, ADR-0004) define the *mechanisms*; this document defines
the *limits of fidelity*.
## Decisions
### D1. Modeled precisely
- **Per-directed-edge BW occupancy** (FIFO serialization via `available_at`) —
ADR-0015 D2.
- **Per-component switching/overhead latency** (`overhead_ns` attr).
- **HBM per-pseudo-channel parallelism** via stateless `pc_avail[N]` array
with global round-robin chunking. Burst granularity tunable
(`burst_bytes`, default 256B). Read and write share each PC's
`available_at` (real HW command bus is per-PC shared).
- **HBM direction switching penalty mechanism**: per-PC last-direction
tracking + configurable `switch_penalty_ns`. Default 0 — see D2.
- **Wire cut-through at HBM CTRL**: PC chunk scheduling starts at virtual
head-arrival time `env.now - txn.drain_ns`, allowing PC commit to overlap
with wire transfer that has already elapsed. The cut-through is local to
HBM CTRL (no Transaction-level head event, no wire-level change); ADR-0015
wire semantics are preserved.
### D2. Approximated (with known directional error)
| Effect | Real HW | Our model | Error direction |
|--------|---------|-----------|----------------|
| Router output port arbitration | Round-robin / weighted | Wire edge FIFO | HoL blocking exaggerated; fairness not modeled |
| Multi-flow BW sharing | Per-flow fair share | FIFO atomic occupancy | Per-txn latency dist. differs; makespan correct |
| HBM scheduler / write buffer | FR-FCFS + watermark drain | FIFO, no reordering | Switching penalty over-charged when alternations are dense — but default `switch_penalty_ns = 0` assumes ideal scheduler amortizes it (Tier 0) |
| Flit/cycle granularity | Discrete flits @ cycle rate | Continuous nbytes | Sub-flit small-message noise |
| Wire cut-through scope | Wormhole at every hop | Cut-through absorbed at HBM CTRL only | Intermediate hops still store-and-forward semantics; acceptable because component overheads at intermediate nodes are size-independent |
### D3. Ignored (out of scope)
- Bank-level row buffer conflict penalty (assume no conflicts — best case;
round-robin chunk assignment is address-blind so we cannot detect same-bank
reuse).
- HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state
`burst_time = burst_bytes / pc_bw_gbs`).
- Refresh, ECC, thermal throttling, power gating.
- Clock domain crossings, PLL lock time.
- Flit-level discrete interleaving on links.
- Upstream backpressure due to downstream buffer occupancy (input ports use
unbounded `simpy.Store`).
### D4. Workload sensitivity
Workloads where the above simplifications meaningfully affect results:
- **Random scatter/gather**: bank conflict ignored → model optimistic.
- **Heavy mixed R/W intensive** (e.g., GEMM bias accumulation): HBM scheduler
absent. With default `switch_penalty_ns = 0` we assume ideal amortization;
setting it non-zero models pessimistic per-alternation cost.
- **High concurrency (>10 active flows on one link)**: HoL blocking and VC
limits not modeled → model optimistic.
- **Very small (sub-flit) transactions**: flit quantization noise.
### D5. Verification policy
For workloads in D4, cross-check against real HW or a cycle-accurate
simulator before drawing absolute-magnitude conclusions. The model remains
accurate for **relative comparisons** within the modeled regime.
### D6. Future work
- [ ] Bank-level conflict modeling (opt-in via `track_banks: true`).
- [ ] HBM scheduler with write buffer + watermark drain (Tier 2 from the
design discussion).
- [ ] Fluid wire model for multi-flow router contention.
- [ ] Wire-level cut-through at intermediate routers (currently destination
HBM CTRL only).
- [ ] Backpressure modeling for finite component buffers.
## Consequences
- Single review point for all model fidelity questions. Each future PR
touching latency must update the relevant section here.
- Workload-specific magnitude error envelopes are explicit.
- Builder-side derivation of `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`
enforces the ADR-0019 D9 invariant in code rather than relying on yaml
manual consistency.
## Cross-references
- ADR-0015 — component / port / wire model.
- ADR-0019 — NoC and local HBM topology.
- ADR-0004 — memory semantics, local HBM.