Latency model: HBM PC striping + chunk-loop drain (ADR-0033)

Previous model double-counted slow-upstream paths (e.g., 64KB via UCIe 128 GB/s was ~2x pessimistic). HBM CTRL now distributes bursts across 8 pseudo-channels via global round-robin, with per-chunk commit timing that pipelines correctly against the bottleneck link's data arrival. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 21:59:07 -07:00
parent f6d262e359
commit 5fdb6f8797
11 changed files with 1192 additions and 52 deletions
@@ -33,12 +33,17 @@ Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth,
 - This guarantee is modeled by:
  - a dedicated logical path and/or service model that enforces HBM BW at the PE-local-HBM interaction point,
  - while still incurring non-zero latency along explicitly modeled components.
+- HBM CTRL internal modeling (PC striping, cut-through, scheduling fidelity)
+  is consolidated in ADR-0033 (Latency Model: Assumptions and Known
+  Simplifications). The aggregate BW guarantee here remains the contract;
+  ADR-0033 documents how the per-PC model realizes it and which scheduler
+  effects are intentionally simplified.

 ### D3. Remote PE HBM semantics (intra-cube)

- A PE that accesses another PE's local HBM traverses the router mesh:
-  - PE_DMA → local router → (mesh hops) → target PE's router → HBM_CTRL
- Router mesh bandwidth and hop count may limit remote HBM access relative to local access.
+- A PE that accesses another PE's local HBM traverses the NOC:
+  - PE_DMA → NOC → (fabric hops) → target PE's NOC port → HBM_CTRL
+- NOC bandwidth and hop count may limit remote HBM access relative to local access.

 ### D4. Non-local HBM semantics (inter-cube / inter-SIP)

@@ -0,0 +1,99 @@
+# ADR-0033 — Latency Model: Assumptions and Known Simplifications
+
+## Status
+
+Accepted
+
+## Context
+
+The simulator is an analytical, event-driven performance model — not a
+cycle-accurate or RTL-level simulator. Many real-HW effects are approximated
+or omitted by design. To keep the model auditable and reviewable as a whole,
+this ADR consolidates the assumptions in one place. Individual component ADRs
+(ADR-0015, ADR-0019, ADR-0004) define the *mechanisms*; this document defines
+the *limits of fidelity*.
+
+## Decisions
+
+### D1. Modeled precisely
+
+- **Per-directed-edge BW occupancy** (FIFO serialization via `available_at`) —
+  ADR-0015 D2.
+- **Per-component switching/overhead latency** (`overhead_ns` attr).
+- **HBM per-pseudo-channel parallelism** via stateless `pc_avail[N]` array
+  with global round-robin chunking. Burst granularity tunable
+  (`burst_bytes`, default 256B). Read and write share each PC's
+  `available_at` (real HW command bus is per-PC shared).
+- **HBM direction switching penalty mechanism**: per-PC last-direction
+  tracking + configurable `switch_penalty_ns`. Default 0 — see D2.
+- **Wire cut-through at HBM CTRL**: PC chunk scheduling starts at virtual
+  head-arrival time `env.now - txn.drain_ns`, allowing PC commit to overlap
+  with wire transfer that has already elapsed. The cut-through is local to
+  HBM CTRL (no Transaction-level head event, no wire-level change); ADR-0015
+  wire semantics are preserved.
+
+### D2. Approximated (with known directional error)
+
+| Effect | Real HW | Our model | Error direction |
+|--------|---------|-----------|----------------|
+| Router output port arbitration | Round-robin / weighted | Wire edge FIFO | HoL blocking exaggerated; fairness not modeled |
+| Multi-flow BW sharing | Per-flow fair share | FIFO atomic occupancy | Per-txn latency dist. differs; makespan correct |
+| HBM scheduler / write buffer | FR-FCFS + watermark drain | FIFO, no reordering | Switching penalty over-charged when alternations are dense — but default `switch_penalty_ns = 0` assumes ideal scheduler amortizes it (Tier 0) |
+| Flit/cycle granularity | Discrete flits @ cycle rate | Continuous nbytes | Sub-flit small-message noise |
+| Wire cut-through scope | Wormhole at every hop | Cut-through absorbed at HBM CTRL only | Intermediate hops still store-and-forward semantics; acceptable because component overheads at intermediate nodes are size-independent |
+
+### D3. Ignored (out of scope)
+
+- Bank-level row buffer conflict penalty (assume no conflicts — best case;
+  round-robin chunk assignment is address-blind so we cannot detect same-bank
+  reuse).
+- HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state
+  `burst_time = burst_bytes / pc_bw_gbs`).
+- Refresh, ECC, thermal throttling, power gating.
+- Clock domain crossings, PLL lock time.
+- Flit-level discrete interleaving on links.
+- Upstream backpressure due to downstream buffer occupancy (input ports use
+  unbounded `simpy.Store`).
+
+### D4. Workload sensitivity
+
+Workloads where the above simplifications meaningfully affect results:
+
+- **Random scatter/gather**: bank conflict ignored → model optimistic.
+- **Heavy mixed R/W intensive** (e.g., GEMM bias accumulation): HBM scheduler
+  absent. With default `switch_penalty_ns = 0` we assume ideal amortization;
+  setting it non-zero models pessimistic per-alternation cost.
+- **High concurrency (>10 active flows on one link)**: HoL blocking and VC
+  limits not modeled → model optimistic.
+- **Very small (sub-flit) transactions**: flit quantization noise.
+
+### D5. Verification policy
+
+For workloads in D4, cross-check against real HW or a cycle-accurate
+simulator before drawing absolute-magnitude conclusions. The model remains
+accurate for **relative comparisons** within the modeled regime.
+
+### D6. Future work
+
+- [ ] Bank-level conflict modeling (opt-in via `track_banks: true`).
+- [ ] HBM scheduler with write buffer + watermark drain (Tier 2 from the
+  design discussion).
+- [ ] Fluid wire model for multi-flow router contention.
+- [ ] Wire-level cut-through at intermediate routers (currently destination
+  HBM CTRL only).
+- [ ] Backpressure modeling for finite component buffers.
+
+## Consequences
+
+- Single review point for all model fidelity questions. Each future PR
+  touching latency must update the relevant section here.
+- Workload-specific magnitude error envelopes are explicit.
+- Builder-side derivation of `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`
+  enforces the ADR-0019 D9 invariant in code rather than relying on yaml
+  manual consistency.
+
+## Cross-references
+
+- ADR-0015 — component / port / wire model.
+- ADR-0019 — NoC and local HBM topology.
+- ADR-0004 — memory semantics, local HBM.