Files
kernbench2/docs/adr/ADR-0033-lat-latency-model-assumptions.md
T
ywkang 687c98086d ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
  (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
  docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
  docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
  retroactive docs pending verification.

Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
  TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
  Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
  deleted; ADR-0019/0021 moved to adr-history with one-line stub status

Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
  serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
  per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
  target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
  selection, flit-aware per-flit commit, async finalize, command-only
  fallback path)

Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
  "Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
  block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
  ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
  (now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)

Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
  ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py

Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.

Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
  (ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:15:55 -07:00

163 lines
8.3 KiB
Markdown

# ADR-0033 — Latency Model: Assumptions and Known Simplifications
## Status
Accepted
## Context
The simulator is an analytical, event-driven performance model — not a
cycle-accurate or RTL-level simulator. Many real-HW effects are approximated
or omitted by design. To keep the model auditable and reviewable as a whole,
this ADR consolidates the assumptions in one place. Individual component ADRs
(ADR-0015, ADR-0017, ADR-0004) define the *mechanisms*; this document defines
the *limits of fidelity*.
## Decisions
### D1. Modeled precisely
- **Per-directed-edge BW occupancy** (FIFO serialization via `available_at`) —
ADR-0015 D2.
- **Per-component switching/overhead latency** (`overhead_ns` attr).
- **HBM per-pseudo-channel parallelism** via stateless `pc_avail[N]` array
with address-based PC selection (ADR-0034 D3). Burst granularity tunable
(`burst_bytes`, default 256B). Read and write share each PC's
`available_at` (real HW command bus is per-PC shared).
- **HBM direction switching penalty mechanism**: per-PC last-direction
tracking + configurable `switch_penalty_ns`. Default 0 — see D2.
- **Wire chunk-streaming (Phase 2c)**: each wire decomposes Transactions
with payload into `Flit` objects of `flit_bytes` (default = HBM
`burst_bytes` = 256B). The wire emits each flit individually after
`prop_ns + flit_nbytes/bw_gbs` so the link's bandwidth throttles
flit arrival rate per real-HW wormhole semantics.
- **Separate Stores per directed edge** (Phase 2c key fix): the wire
is the *only* conduit between `src.out_ports[dst]` and
`dst.in_ports[src]`. Earlier the two were aliased to the same
`simpy.Store`; when the wire put a chunkified flit back, the
destination's `fan_in` could pull it before the wire applied
bandwidth delay, leaving half the flits bypassing the bottleneck.
- **Flit-aware pass-through** (`TransitComponent`, `HbmCtrlComponent`):
forward each flit serially with per-transaction overhead applied
ONCE on the first-flit arrival (header decode model). Subsequent
flits pipeline through with no extra delay. Wormhole emerges
naturally across multi-hop paths.
- **HBM CTRL per-flit PC commit**: each flit arriving at HBM CTRL
schedules a PC commit at `max(env.now, pc_avail[pc]) + chunk_time`,
with the `is_last` flit waiting for the last PC commit before
signaling `txn.done`.
- **Non-flit-aware components (default) reassemble flits at
``_fan_in``** before the legacy `_forward_txn` path runs. This
preserves backward compatibility for components that have not yet
been migrated to flit-aware processing (e.g., `MCpuComponent`,
`IoCpuComponent` sub-txn generators). Such components reassemble
*once per leg boundary*, NOT per hop — multi-hop wormhole timing
through a chain of flit-aware routers is preserved.
### D2. Approximated (with known directional error)
| Effect | Real HW | Our model | Error direction |
|--------|---------|-----------|----------------|
| Router output port arbitration | Round-robin / weighted | Wire edge FIFO + serial worker | Fair when one txn per cycle; multi-stream sharing not modeled at flit level |
| HBM scheduler / write buffer | FR-FCFS + watermark drain | FIFO, no reordering | Pessimistic for mixed R/W when alternations are dense — default `switch_penalty_ns = 0` assumes ideal scheduler amortizes |
| Flit ↔ burst granularity | 32B flit < 256B burst | `flit_bytes = burst_bytes = 256B` | Sub-flit fine-grained timing noise; affects very small wire arbitration windows only |
| Wire-level RR fairness | Per-cycle multi-flow arbitration on shared link | Single serial wire process per edge | Fair only when one transaction is in flight on a given edge at a time. Multi-stream concurrent traffic on the same edge serializes by FIFO order |
### D3. Ignored (out of scope)
- Bank-level row buffer conflict penalty (assume no conflicts — best case;
the model has no per-bank state within a PC, so same-bank reuse cannot be
detected).
- HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state
`burst_time = burst_bytes / pc_bw_gbs`).
- Refresh, ECC, thermal throttling, power gating.
- Clock domain crossings, PLL lock time.
- Upstream backpressure due to downstream buffer occupancy (input ports use
unbounded `simpy.Store`).
- Sub-flit cycle-level arbitration at routers (flit granularity is our
smallest unit).
### D4. Workload sensitivity
Workloads where the above simplifications meaningfully affect results:
- **Random scatter/gather**: bank conflict ignored → model optimistic.
- **Heavy mixed R/W intensive** (e.g., GEMM bias accumulation): HBM scheduler
absent. With default `switch_penalty_ns = 0` we assume ideal amortization;
setting it non-zero models pessimistic per-alternation cost.
- **High concurrency (>10 active flows on one link)**: HoL blocking and VC
limits not modeled → model optimistic.
- **Very small (sub-flit) transactions**: flit quantization noise.
- **Concurrent multi-flow on a single wire**: wire is serial FIFO at the
flit level, so per-flow fairness within a single edge is not modeled.
Pre-edge merging (multiple sources arriving at a router and being
forwarded to the same downstream wire) is correctly modeled via the
flit-aware router's serial worker.
### D5. Verification policy
For workloads in D4, cross-check against real HW or a cycle-accurate
simulator before drawing absolute-magnitude conclusions. The model remains
accurate for **relative comparisons** within the modeled regime.
### D6. Future work
Note: multi-stream merging at routers IS modeled correctly — each
in_port has its own fan_in process, all push to a shared inbox, and
the router worker forwards in inbox FIFO order. Flits from different
upstream streams naturally interleave at flit granularity. The items
below are different concerns, ordered by expected workload impact.
**Higher impact (workload accuracy gap)**:
- [ ] **Bank-level conflict modeling** within a PC (opt-in via
`track_banks: true`). Currently we assume no same-bank reuse;
random scatter/gather workloads are optimistic here.
- [ ] **HBM scheduler** with write buffer + watermark drain (Tier 2
from the design discussion). Default `switch_penalty_ns=0` is the
ideal-amortization stand-in; bursty mixed R/W workloads benefit
from explicit modeling.
- [ ] **Backpressure** modeling for finite component buffers. Matters
at high concurrency / sustained saturation where buffer occupancy
causes upstream stalls.
- [ ] **Op_log integration with chunk-streaming**: currently op_log
fires on PE-internal command messages (DmaReadCmd, DmaWriteCmd,
GemmCmd, MathCmd) which are not chunkified. Integration would
require flit-aware components to also emit op_log start/end hooks
per transaction (start on first flit, end on is_last).
**Lower impact (academic / specific use cases)**:
- [ ] **Cycle-accurate router arbitration policies** (RR with
priorities, age, iSLIP). The FIFO inbox is already approximately
fair when flit arrival times differ slightly between streams (the
common case for similar-rate workloads). True impact appears only
for: (a) priority/QoS modeling, (b) per-stream tail latency
analysis under sustained saturation. Not critical for makespan or
average-latency studies.
- [ ] **Sub-flit (32B) granularity** for finer wire arbitration
cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates
per 32B flit. Effect is small for most workloads (sub-flit timing
noise on small messages).
## Consequences
- Single review point for all model fidelity questions. Each future PR
touching latency must update the relevant section here.
- Workload-specific magnitude error envelopes are explicit.
- Builder-side derivation of `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`
enforces the ADR-0017 D8 invariant in code rather than relying on yaml
manual consistency.
- Wire transfer time is charged once per bottleneck-link transit (Phase 2c
per-flit timing) rather than via terminal `drain_ns` injection. Single
transactions land at `drain + commit_time + small_overheads`; multi-hop
preserves wormhole pipelining; multi-stream merge correctly serializes
at the shared wire's FIFO.
## Cross-references
- ADR-0015 — component / port / wire model.
- ADR-0017 — Cube NOC architecture and HBM connectivity.
- ADR-0004 — memory semantics, local HBM.
- ADR-0034 — HBM controller internal design.