kernbench2/docs/adr/ADR-0033-latency-model-assumptions.md

# ADR-0033 — Latency Model: Assumptions and Known Simplifications

## Status

Accepted

## Context

The simulator is an analytical, event-driven performance model — not a
cycle-accurate or RTL-level simulator. Many real-HW effects are approximated
or omitted by design. To keep the model auditable and reviewable as a whole,
this ADR consolidates the assumptions in one place. Individual component ADRs
(ADR-0015, ADR-0019, ADR-0004) define the *mechanisms*; this document defines
the *limits of fidelity*.

## Decisions

### D1. Modeled precisely

- **Per-directed-edge BW occupancy** (FIFO serialization via `available_at`) —
  ADR-0015 D2.
- **Per-component switching/overhead latency** (`overhead_ns` attr).
- **HBM per-pseudo-channel parallelism** via stateless `pc_avail[N]` array
  with global round-robin chunking. Burst granularity tunable
  (`burst_bytes`, default 256B). Read and write share each PC's
  `available_at` (real HW command bus is per-PC shared).
- **HBM direction switching penalty mechanism**: per-PC last-direction
  tracking + configurable `switch_penalty_ns`. Default 0 — see D2.
- **Wire chunk-streaming (Phase 2c)**: each wire decomposes Transactions
  with payload into `Flit` objects of `flit_bytes` (default = HBM
  `burst_bytes` = 256B). The wire emits each flit individually after
  `prop_ns + flit_nbytes/bw_gbs` so the link's bandwidth throttles
  flit arrival rate per real-HW wormhole semantics.
- **Separate Stores per directed edge** (Phase 2c key fix): the wire
  is the *only* conduit between `src.out_ports[dst]` and
  `dst.in_ports[src]`. Earlier the two were aliased to the same
  `simpy.Store`; when the wire put a chunkified flit back, the
  destination's `fan_in` could pull it before the wire applied
  bandwidth delay, leaving half the flits bypassing the bottleneck.
- **Flit-aware pass-through** (`TransitComponent`, `HbmCtrlComponent`):
  forward each flit serially with per-transaction overhead applied
  ONCE on the first-flit arrival (header decode model). Subsequent
  flits pipeline through with no extra delay. Wormhole emerges
  naturally across multi-hop paths.
- **HBM CTRL per-flit PC commit**: each flit arriving at HBM CTRL
  schedules a PC commit at `max(env.now, pc_avail[pc]) + chunk_time`,
  with the `is_last` flit waiting for the last PC commit before
  signaling `txn.done`.
- **Non-flit-aware components (default) reassemble flits at
  ``_fan_in``** before the legacy `_forward_txn` path runs. This
  preserves backward compatibility for components that have not yet
  been migrated to flit-aware processing (e.g., `MCpuComponent`,
  `IoCpuComponent` sub-txn generators). Such components reassemble
  *once per leg boundary*, NOT per hop — multi-hop wormhole timing
  through a chain of flit-aware routers is preserved.

### D2. Approximated (with known directional error)

| Effect | Real HW | Our model | Error direction |
|--------|---------|-----------|----------------|
| Router output port arbitration | Round-robin / weighted | Wire edge FIFO + serial worker | Fair when one txn per cycle; multi-stream sharing not modeled at flit level |
| HBM scheduler / write buffer | FR-FCFS + watermark drain | FIFO, no reordering | Pessimistic for mixed R/W when alternations are dense — default `switch_penalty_ns = 0` assumes ideal scheduler amortizes |
| Flit ↔ burst granularity | 32B flit < 256B burst | `flit_bytes = burst_bytes = 256B` | Sub-flit fine-grained timing noise; affects very small wire arbitration windows only |
| Wire-level RR fairness | Per-cycle multi-flow arbitration on shared link | Single serial wire process per edge | Fair only when one transaction is in flight on a given edge at a time. Multi-stream concurrent traffic on the same edge serializes by FIFO order |

### D3. Ignored (out of scope)

- Bank-level row buffer conflict penalty (assume no conflicts — best case;
  round-robin chunk assignment is address-blind so we cannot detect same-bank
  reuse).
- HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state
  `burst_time = burst_bytes / pc_bw_gbs`).
- Refresh, ECC, thermal throttling, power gating.
- Clock domain crossings, PLL lock time.
- Upstream backpressure due to downstream buffer occupancy (input ports use
  unbounded `simpy.Store`).
- Sub-flit cycle-level arbitration at routers (flit granularity is our
  smallest unit).

### D4. Workload sensitivity

Workloads where the above simplifications meaningfully affect results:

- **Random scatter/gather**: bank conflict ignored → model optimistic.
- **Heavy mixed R/W intensive** (e.g., GEMM bias accumulation): HBM scheduler
  absent. With default `switch_penalty_ns = 0` we assume ideal amortization;
  setting it non-zero models pessimistic per-alternation cost.
- **High concurrency (>10 active flows on one link)**: HoL blocking and VC
  limits not modeled → model optimistic.
- **Very small (sub-flit) transactions**: flit quantization noise.
- **Concurrent multi-flow on a single wire**: wire is serial FIFO at the
  flit level, so per-flow fairness within a single edge is not modeled.
  Pre-edge merging (multiple sources arriving at a router and being
  forwarded to the same downstream wire) is correctly modeled via the
  flit-aware router's serial worker.

### D5. Verification policy

For workloads in D4, cross-check against real HW or a cycle-accurate
simulator before drawing absolute-magnitude conclusions. The model remains
accurate for **relative comparisons** within the modeled regime.

### D6. Future work

- [ ] Bank-level conflict modeling (opt-in via `track_banks: true`).
- [ ] HBM scheduler with write buffer + watermark drain (Tier 2 from the
  design discussion).
- [ ] Fluid wire model for multi-flow fairness on a single shared link
  (currently FIFO serial).
- [ ] Sub-flit (32B) granularity for cycle-accurate wire arbitration.
- [ ] Backpressure modeling for finite component buffers.
- [ ] Op_log integration with chunk-streaming (currently op_log fires on
  PE-internal command messages — DmaReadCmd, DmaWriteCmd, GemmCmd,
  MathCmd — which are not chunkified; integration would require
  flit-aware components to also emit op_log start/end hooks per
  transaction).

## Consequences

- Single review point for all model fidelity questions. Each future PR
  touching latency must update the relevant section here.
- Workload-specific magnitude error envelopes are explicit.
- Builder-side derivation of `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`
  enforces the ADR-0019 D9 invariant in code rather than relying on yaml
  manual consistency.
- Wire transfer time is charged once per bottleneck-link transit (Phase 2c
  per-flit timing) rather than via terminal `drain_ns` injection. Single
  transactions land at `drain + commit_time + small_overheads`; multi-hop
  preserves wormhole pipelining; multi-stream merge correctly serializes
  at the shared wire's FIFO.

## Cross-references

- ADR-0015 — component / port / wire model.
- ADR-0019 — NoC and local HBM topology.
- ADR-0004 — memory semantics, local HBM.