c6788788a4
- test_op_log_per_transaction_not_per_flit (renamed from ..._records...): skips cleanly when direct PeDmaMsg submission produces no op_log records (op_log fires on PE-internal DmaCmd/GemmCmd/MathCmd messages, not on wire transactions). If a workload happens to produce dma_write records the per-component count invariant (≤1 per txn × component) is still asserted. - ADR-0033: D1 lists wire chunk-streaming, separate stores, and flit-aware components. D2/D3/D4 updated for new wire model. D6 future work notes op_log full integration with chunk-streaming. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.9 KiB
6.9 KiB
ADR-0033 — Latency Model: Assumptions and Known Simplifications
Status
Accepted
Context
The simulator is an analytical, event-driven performance model — not a cycle-accurate or RTL-level simulator. Many real-HW effects are approximated or omitted by design. To keep the model auditable and reviewable as a whole, this ADR consolidates the assumptions in one place. Individual component ADRs (ADR-0015, ADR-0019, ADR-0004) define the mechanisms; this document defines the limits of fidelity.
Decisions
D1. Modeled precisely
- Per-directed-edge BW occupancy (FIFO serialization via
available_at) — ADR-0015 D2. - Per-component switching/overhead latency (
overhead_nsattr). - HBM per-pseudo-channel parallelism via stateless
pc_avail[N]array with global round-robin chunking. Burst granularity tunable (burst_bytes, default 256B). Read and write share each PC'savailable_at(real HW command bus is per-PC shared). - HBM direction switching penalty mechanism: per-PC last-direction
tracking + configurable
switch_penalty_ns. Default 0 — see D2. - Wire chunk-streaming (Phase 2c): each wire decomposes Transactions
with payload into
Flitobjects offlit_bytes(default = HBMburst_bytes= 256B). The wire emits each flit individually afterprop_ns + flit_nbytes/bw_gbsso the link's bandwidth throttles flit arrival rate per real-HW wormhole semantics. - Separate Stores per directed edge (Phase 2c key fix): the wire
is the only conduit between
src.out_ports[dst]anddst.in_ports[src]. Earlier the two were aliased to the samesimpy.Store; when the wire put a chunkified flit back, the destination'sfan_incould pull it before the wire applied bandwidth delay, leaving half the flits bypassing the bottleneck. - Flit-aware pass-through (
TransitComponent,HbmCtrlComponent): forward each flit serially with per-transaction overhead applied ONCE on the first-flit arrival (header decode model). Subsequent flits pipeline through with no extra delay. Wormhole emerges naturally across multi-hop paths. - HBM CTRL per-flit PC commit: each flit arriving at HBM CTRL
schedules a PC commit at
max(env.now, pc_avail[pc]) + chunk_time, with theis_lastflit waiting for the last PC commit before signalingtxn.done. - Non-flit-aware components (default) reassemble flits at
_fan_inbefore the legacy_forward_txnpath runs. This preserves backward compatibility for components that have not yet been migrated to flit-aware processing (e.g.,MCpuComponent,IoCpuComponentsub-txn generators). Such components reassemble once per leg boundary, NOT per hop — multi-hop wormhole timing through a chain of flit-aware routers is preserved.
D2. Approximated (with known directional error)
| Effect | Real HW | Our model | Error direction |
|---|---|---|---|
| Router output port arbitration | Round-robin / weighted | Wire edge FIFO + serial worker | Fair when one txn per cycle; multi-stream sharing not modeled at flit level |
| HBM scheduler / write buffer | FR-FCFS + watermark drain | FIFO, no reordering | Pessimistic for mixed R/W when alternations are dense — default switch_penalty_ns = 0 assumes ideal scheduler amortizes |
| Flit ↔ burst granularity | 32B flit < 256B burst | flit_bytes = burst_bytes = 256B |
Sub-flit fine-grained timing noise; affects very small wire arbitration windows only |
| Wire-level RR fairness | Per-cycle multi-flow arbitration on shared link | Single serial wire process per edge | Fair only when one transaction is in flight on a given edge at a time. Multi-stream concurrent traffic on the same edge serializes by FIFO order |
D3. Ignored (out of scope)
- Bank-level row buffer conflict penalty (assume no conflicts — best case; round-robin chunk assignment is address-blind so we cannot detect same-bank reuse).
- HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state
burst_time = burst_bytes / pc_bw_gbs). - Refresh, ECC, thermal throttling, power gating.
- Clock domain crossings, PLL lock time.
- Upstream backpressure due to downstream buffer occupancy (input ports use
unbounded
simpy.Store). - Sub-flit cycle-level arbitration at routers (flit granularity is our smallest unit).
D4. Workload sensitivity
Workloads where the above simplifications meaningfully affect results:
- Random scatter/gather: bank conflict ignored → model optimistic.
- Heavy mixed R/W intensive (e.g., GEMM bias accumulation): HBM scheduler
absent. With default
switch_penalty_ns = 0we assume ideal amortization; setting it non-zero models pessimistic per-alternation cost. - High concurrency (>10 active flows on one link): HoL blocking and VC limits not modeled → model optimistic.
- Very small (sub-flit) transactions: flit quantization noise.
- Concurrent multi-flow on a single wire: wire is serial FIFO at the flit level, so per-flow fairness within a single edge is not modeled. Pre-edge merging (multiple sources arriving at a router and being forwarded to the same downstream wire) is correctly modeled via the flit-aware router's serial worker.
D5. Verification policy
For workloads in D4, cross-check against real HW or a cycle-accurate simulator before drawing absolute-magnitude conclusions. The model remains accurate for relative comparisons within the modeled regime.
D6. Future work
- Bank-level conflict modeling (opt-in via
track_banks: true). - HBM scheduler with write buffer + watermark drain (Tier 2 from the design discussion).
- Fluid wire model for multi-flow fairness on a single shared link (currently FIFO serial).
- Sub-flit (32B) granularity for cycle-accurate wire arbitration.
- Backpressure modeling for finite component buffers.
- Op_log integration with chunk-streaming (currently op_log fires on PE-internal command messages — DmaReadCmd, DmaWriteCmd, GemmCmd, MathCmd — which are not chunkified; integration would require flit-aware components to also emit op_log start/end hooks per transaction).
Consequences
- Single review point for all model fidelity questions. Each future PR touching latency must update the relevant section here.
- Workload-specific magnitude error envelopes are explicit.
- Builder-side derivation of
pc_bw_gbs = hbm_to_router_bw_gbs / num_pcsenforces the ADR-0019 D9 invariant in code rather than relying on yaml manual consistency. - Wire transfer time is charged once per bottleneck-link transit (Phase 2c
per-flit timing) rather than via terminal
drain_nsinjection. Single transactions land atdrain + commit_time + small_overheads; multi-hop preserves wormhole pipelining; multi-stream merge correctly serializes at the shared wire's FIFO.
Cross-references
- ADR-0015 — component / port / wire model.
- ADR-0019 — NoC and local HBM topology.
- ADR-0004 — memory semantics, local HBM.