Files
kernbench2/docs/adr-ko/ADR-0033-lat-latency-model-assumptions.md
T
ywkang a796c1d2f7 ADR: bilingual structure — EN canonical in adr/, KO mirror in adr-ko/
Establish English as the canonical ADR language with Korean translations
held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror).
Promotion from adr-proposed/ to adr/ now writes English to adr/ and the
Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md.

- Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English,
  2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix
  dropped). ADR-0023 EN regenerated against KO source which had newer
  HW Realization Notes (D16-D23) section.
- docs/adr-history/ left frozen by design (transitional state).
- CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark
  docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline
  section covering bidirectional sync, conflict resolution (EN wins),
  and proposed-language freedom.
- tools/verify_adr_lang_pairs.py: new verification tool checking pair
  completeness, filename mirroring, ADR-ID match, Status byte-equality.
  Pre-commit hook intentionally not added; run on demand or in CI.
- tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF
  normalization, em-dash title separator, underscore-slug edge case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:38:44 -07:00

8.3 KiB

ADR-0033 — Latency Model: Assumptions and Known Simplifications

Status

Accepted

Context

The simulator is an analytical, event-driven performance model — not a cycle-accurate or RTL-level simulator. Many real-HW effects are approximated or omitted by design. To keep the model auditable and reviewable as a whole, this ADR consolidates the assumptions in one place. Individual component ADRs (ADR-0015, ADR-0017, ADR-0004) define the mechanisms; this document defines the limits of fidelity.

Decisions

D1. Modeled precisely

  • Per-directed-edge BW occupancy (FIFO serialization via available_at) — ADR-0015 D2.
  • Per-component switching/overhead latency (overhead_ns attr).
  • HBM per-pseudo-channel parallelism via stateless pc_avail[N] array with address-based PC selection (ADR-0034 D3). Burst granularity tunable (burst_bytes, default 256B). Read and write share each PC's available_at (real HW command bus is per-PC shared).
  • HBM direction switching penalty mechanism: per-PC last-direction tracking + configurable switch_penalty_ns. Default 0 — see D2.
  • Wire chunk-streaming (Phase 2c): each wire decomposes Transactions with payload into Flit objects of flit_bytes (default = HBM burst_bytes = 256B). The wire emits each flit individually after prop_ns + flit_nbytes/bw_gbs so the link's bandwidth throttles flit arrival rate per real-HW wormhole semantics.
  • Separate Stores per directed edge (Phase 2c key fix): the wire is the only conduit between src.out_ports[dst] and dst.in_ports[src]. Earlier the two were aliased to the same simpy.Store; when the wire put a chunkified flit back, the destination's fan_in could pull it before the wire applied bandwidth delay, leaving half the flits bypassing the bottleneck.
  • Flit-aware pass-through (TransitComponent, HbmCtrlComponent): forward each flit serially with per-transaction overhead applied ONCE on the first-flit arrival (header decode model). Subsequent flits pipeline through with no extra delay. Wormhole emerges naturally across multi-hop paths.
  • HBM CTRL per-flit PC commit: each flit arriving at HBM CTRL schedules a PC commit at max(env.now, pc_avail[pc]) + chunk_time, with the is_last flit waiting for the last PC commit before signaling txn.done.
  • Non-flit-aware components (default) reassemble flits at _fan_in before the legacy _forward_txn path runs. This preserves backward compatibility for components that have not yet been migrated to flit-aware processing (e.g., MCpuComponent, IoCpuComponent sub-txn generators). Such components reassemble once per leg boundary, NOT per hop — multi-hop wormhole timing through a chain of flit-aware routers is preserved.

D2. Approximated (with known directional error)

Effect Real HW Our model Error direction
Router output port arbitration Round-robin / weighted Wire edge FIFO + serial worker Fair when one txn per cycle; multi-stream sharing not modeled at flit level
HBM scheduler / write buffer FR-FCFS + watermark drain FIFO, no reordering Pessimistic for mixed R/W when alternations are dense — default switch_penalty_ns = 0 assumes ideal scheduler amortizes
Flit ↔ burst granularity 32B flit < 256B burst flit_bytes = burst_bytes = 256B Sub-flit fine-grained timing noise; affects very small wire arbitration windows only
Wire-level RR fairness Per-cycle multi-flow arbitration on shared link Single serial wire process per edge Fair only when one transaction is in flight on a given edge at a time. Multi-stream concurrent traffic on the same edge serializes by FIFO order

D3. Ignored (out of scope)

  • Bank-level row buffer conflict penalty (assume no conflicts — best case; the model has no per-bank state within a PC, so same-bank reuse cannot be detected).
  • HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state burst_time = burst_bytes / pc_bw_gbs).
  • Refresh, ECC, thermal throttling, power gating.
  • Clock domain crossings, PLL lock time.
  • Upstream backpressure due to downstream buffer occupancy (input ports use unbounded simpy.Store).
  • Sub-flit cycle-level arbitration at routers (flit granularity is our smallest unit).

D4. Workload sensitivity

Workloads where the above simplifications meaningfully affect results:

  • Random scatter/gather: bank conflict ignored → model optimistic.
  • Heavy mixed R/W intensive (e.g., GEMM bias accumulation): HBM scheduler absent. With default switch_penalty_ns = 0 we assume ideal amortization; setting it non-zero models pessimistic per-alternation cost.
  • High concurrency (>10 active flows on one link): HoL blocking and VC limits not modeled → model optimistic.
  • Very small (sub-flit) transactions: flit quantization noise.
  • Concurrent multi-flow on a single wire: wire is serial FIFO at the flit level, so per-flow fairness within a single edge is not modeled. Pre-edge merging (multiple sources arriving at a router and being forwarded to the same downstream wire) is correctly modeled via the flit-aware router's serial worker.

D5. Verification policy

For workloads in D4, cross-check against real HW or a cycle-accurate simulator before drawing absolute-magnitude conclusions. The model remains accurate for relative comparisons within the modeled regime.

D6. Future work

Note: multi-stream merging at routers IS modeled correctly — each in_port has its own fan_in process, all push to a shared inbox, and the router worker forwards in inbox FIFO order. Flits from different upstream streams naturally interleave at flit granularity. The items below are different concerns, ordered by expected workload impact.

Higher impact (workload accuracy gap):

  • Bank-level conflict modeling within a PC (opt-in via track_banks: true). Currently we assume no same-bank reuse; random scatter/gather workloads are optimistic here.
  • HBM scheduler with write buffer + watermark drain (Tier 2 from the design discussion). Default switch_penalty_ns=0 is the ideal-amortization stand-in; bursty mixed R/W workloads benefit from explicit modeling.
  • Backpressure modeling for finite component buffers. Matters at high concurrency / sustained saturation where buffer occupancy causes upstream stalls.
  • Op_log integration with chunk-streaming: currently op_log fires on PE-internal command messages (DmaReadCmd, DmaWriteCmd, GemmCmd, MathCmd) which are not chunkified. Integration would require flit-aware components to also emit op_log start/end hooks per transaction (start on first flit, end on is_last).

Lower impact (academic / specific use cases):

  • Cycle-accurate router arbitration policies (RR with priorities, age, iSLIP). The FIFO inbox is already approximately fair when flit arrival times differ slightly between streams (the common case for similar-rate workloads). True impact appears only for: (a) priority/QoS modeling, (b) per-stream tail latency analysis under sustained saturation. Not critical for makespan or average-latency studies.
  • Sub-flit (32B) granularity for finer wire arbitration cycles. Our flit_bytes equals burst (256B); real HW arbitrates per 32B flit. Effect is small for most workloads (sub-flit timing noise on small messages).

Consequences

  • Single review point for all model fidelity questions. Each future PR touching latency must update the relevant section here.
  • Workload-specific magnitude error envelopes are explicit.
  • Builder-side derivation of pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs enforces the ADR-0017 D8 invariant in code rather than relying on yaml manual consistency.
  • Wire transfer time is charged once per bottleneck-link transit (Phase 2c per-flit timing) rather than via terminal drain_ns injection. Single transactions land at drain + commit_time + small_overheads; multi-hop preserves wormhole pipelining; multi-stream merge correctly serializes at the shared wire's FIFO.

Cross-references

  • ADR-0015 — component / port / wire model.
  • ADR-0017 — Cube NOC architecture and HBM connectivity.
  • ADR-0004 — memory semantics, local HBM.
  • ADR-0034 — HBM controller internal design.