687c98086d
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
(dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
retroactive docs pending verification.
Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
deleted; ADR-0019/0021 moved to adr-history with one-line stub status
Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
selection, flit-aware per-flit commit, async finalize, command-only
fallback path)
Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
"Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
(now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)
Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py
Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.
Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
(ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
163 lines
8.3 KiB
Markdown
163 lines
8.3 KiB
Markdown
# ADR-0033 — Latency Model: Assumptions and Known Simplifications
|
|
|
|
## Status
|
|
|
|
Accepted
|
|
|
|
## Context
|
|
|
|
The simulator is an analytical, event-driven performance model — not a
|
|
cycle-accurate or RTL-level simulator. Many real-HW effects are approximated
|
|
or omitted by design. To keep the model auditable and reviewable as a whole,
|
|
this ADR consolidates the assumptions in one place. Individual component ADRs
|
|
(ADR-0015, ADR-0017, ADR-0004) define the *mechanisms*; this document defines
|
|
the *limits of fidelity*.
|
|
|
|
## Decisions
|
|
|
|
### D1. Modeled precisely
|
|
|
|
- **Per-directed-edge BW occupancy** (FIFO serialization via `available_at`) —
|
|
ADR-0015 D2.
|
|
- **Per-component switching/overhead latency** (`overhead_ns` attr).
|
|
- **HBM per-pseudo-channel parallelism** via stateless `pc_avail[N]` array
|
|
with address-based PC selection (ADR-0034 D3). Burst granularity tunable
|
|
(`burst_bytes`, default 256B). Read and write share each PC's
|
|
`available_at` (real HW command bus is per-PC shared).
|
|
- **HBM direction switching penalty mechanism**: per-PC last-direction
|
|
tracking + configurable `switch_penalty_ns`. Default 0 — see D2.
|
|
- **Wire chunk-streaming (Phase 2c)**: each wire decomposes Transactions
|
|
with payload into `Flit` objects of `flit_bytes` (default = HBM
|
|
`burst_bytes` = 256B). The wire emits each flit individually after
|
|
`prop_ns + flit_nbytes/bw_gbs` so the link's bandwidth throttles
|
|
flit arrival rate per real-HW wormhole semantics.
|
|
- **Separate Stores per directed edge** (Phase 2c key fix): the wire
|
|
is the *only* conduit between `src.out_ports[dst]` and
|
|
`dst.in_ports[src]`. Earlier the two were aliased to the same
|
|
`simpy.Store`; when the wire put a chunkified flit back, the
|
|
destination's `fan_in` could pull it before the wire applied
|
|
bandwidth delay, leaving half the flits bypassing the bottleneck.
|
|
- **Flit-aware pass-through** (`TransitComponent`, `HbmCtrlComponent`):
|
|
forward each flit serially with per-transaction overhead applied
|
|
ONCE on the first-flit arrival (header decode model). Subsequent
|
|
flits pipeline through with no extra delay. Wormhole emerges
|
|
naturally across multi-hop paths.
|
|
- **HBM CTRL per-flit PC commit**: each flit arriving at HBM CTRL
|
|
schedules a PC commit at `max(env.now, pc_avail[pc]) + chunk_time`,
|
|
with the `is_last` flit waiting for the last PC commit before
|
|
signaling `txn.done`.
|
|
- **Non-flit-aware components (default) reassemble flits at
|
|
``_fan_in``** before the legacy `_forward_txn` path runs. This
|
|
preserves backward compatibility for components that have not yet
|
|
been migrated to flit-aware processing (e.g., `MCpuComponent`,
|
|
`IoCpuComponent` sub-txn generators). Such components reassemble
|
|
*once per leg boundary*, NOT per hop — multi-hop wormhole timing
|
|
through a chain of flit-aware routers is preserved.
|
|
|
|
### D2. Approximated (with known directional error)
|
|
|
|
| Effect | Real HW | Our model | Error direction |
|
|
|--------|---------|-----------|----------------|
|
|
| Router output port arbitration | Round-robin / weighted | Wire edge FIFO + serial worker | Fair when one txn per cycle; multi-stream sharing not modeled at flit level |
|
|
| HBM scheduler / write buffer | FR-FCFS + watermark drain | FIFO, no reordering | Pessimistic for mixed R/W when alternations are dense — default `switch_penalty_ns = 0` assumes ideal scheduler amortizes |
|
|
| Flit ↔ burst granularity | 32B flit < 256B burst | `flit_bytes = burst_bytes = 256B` | Sub-flit fine-grained timing noise; affects very small wire arbitration windows only |
|
|
| Wire-level RR fairness | Per-cycle multi-flow arbitration on shared link | Single serial wire process per edge | Fair only when one transaction is in flight on a given edge at a time. Multi-stream concurrent traffic on the same edge serializes by FIFO order |
|
|
|
|
### D3. Ignored (out of scope)
|
|
|
|
- Bank-level row buffer conflict penalty (assume no conflicts — best case;
|
|
the model has no per-bank state within a PC, so same-bank reuse cannot be
|
|
detected).
|
|
- HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state
|
|
`burst_time = burst_bytes / pc_bw_gbs`).
|
|
- Refresh, ECC, thermal throttling, power gating.
|
|
- Clock domain crossings, PLL lock time.
|
|
- Upstream backpressure due to downstream buffer occupancy (input ports use
|
|
unbounded `simpy.Store`).
|
|
- Sub-flit cycle-level arbitration at routers (flit granularity is our
|
|
smallest unit).
|
|
|
|
### D4. Workload sensitivity
|
|
|
|
Workloads where the above simplifications meaningfully affect results:
|
|
|
|
- **Random scatter/gather**: bank conflict ignored → model optimistic.
|
|
- **Heavy mixed R/W intensive** (e.g., GEMM bias accumulation): HBM scheduler
|
|
absent. With default `switch_penalty_ns = 0` we assume ideal amortization;
|
|
setting it non-zero models pessimistic per-alternation cost.
|
|
- **High concurrency (>10 active flows on one link)**: HoL blocking and VC
|
|
limits not modeled → model optimistic.
|
|
- **Very small (sub-flit) transactions**: flit quantization noise.
|
|
- **Concurrent multi-flow on a single wire**: wire is serial FIFO at the
|
|
flit level, so per-flow fairness within a single edge is not modeled.
|
|
Pre-edge merging (multiple sources arriving at a router and being
|
|
forwarded to the same downstream wire) is correctly modeled via the
|
|
flit-aware router's serial worker.
|
|
|
|
### D5. Verification policy
|
|
|
|
For workloads in D4, cross-check against real HW or a cycle-accurate
|
|
simulator before drawing absolute-magnitude conclusions. The model remains
|
|
accurate for **relative comparisons** within the modeled regime.
|
|
|
|
### D6. Future work
|
|
|
|
Note: multi-stream merging at routers IS modeled correctly — each
|
|
in_port has its own fan_in process, all push to a shared inbox, and
|
|
the router worker forwards in inbox FIFO order. Flits from different
|
|
upstream streams naturally interleave at flit granularity. The items
|
|
below are different concerns, ordered by expected workload impact.
|
|
|
|
**Higher impact (workload accuracy gap)**:
|
|
|
|
- [ ] **Bank-level conflict modeling** within a PC (opt-in via
|
|
`track_banks: true`). Currently we assume no same-bank reuse;
|
|
random scatter/gather workloads are optimistic here.
|
|
- [ ] **HBM scheduler** with write buffer + watermark drain (Tier 2
|
|
from the design discussion). Default `switch_penalty_ns=0` is the
|
|
ideal-amortization stand-in; bursty mixed R/W workloads benefit
|
|
from explicit modeling.
|
|
- [ ] **Backpressure** modeling for finite component buffers. Matters
|
|
at high concurrency / sustained saturation where buffer occupancy
|
|
causes upstream stalls.
|
|
- [ ] **Op_log integration with chunk-streaming**: currently op_log
|
|
fires on PE-internal command messages (DmaReadCmd, DmaWriteCmd,
|
|
GemmCmd, MathCmd) which are not chunkified. Integration would
|
|
require flit-aware components to also emit op_log start/end hooks
|
|
per transaction (start on first flit, end on is_last).
|
|
|
|
**Lower impact (academic / specific use cases)**:
|
|
|
|
- [ ] **Cycle-accurate router arbitration policies** (RR with
|
|
priorities, age, iSLIP). The FIFO inbox is already approximately
|
|
fair when flit arrival times differ slightly between streams (the
|
|
common case for similar-rate workloads). True impact appears only
|
|
for: (a) priority/QoS modeling, (b) per-stream tail latency
|
|
analysis under sustained saturation. Not critical for makespan or
|
|
average-latency studies.
|
|
- [ ] **Sub-flit (32B) granularity** for finer wire arbitration
|
|
cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates
|
|
per 32B flit. Effect is small for most workloads (sub-flit timing
|
|
noise on small messages).
|
|
|
|
## Consequences
|
|
|
|
- Single review point for all model fidelity questions. Each future PR
|
|
touching latency must update the relevant section here.
|
|
- Workload-specific magnitude error envelopes are explicit.
|
|
- Builder-side derivation of `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`
|
|
enforces the ADR-0017 D8 invariant in code rather than relying on yaml
|
|
manual consistency.
|
|
- Wire transfer time is charged once per bottleneck-link transit (Phase 2c
|
|
per-flit timing) rather than via terminal `drain_ns` injection. Single
|
|
transactions land at `drain + commit_time + small_overheads`; multi-hop
|
|
preserves wormhole pipelining; multi-stream merge correctly serializes
|
|
at the shared wire's FIFO.
|
|
|
|
## Cross-references
|
|
|
|
- ADR-0015 — component / port / wire model.
|
|
- ADR-0017 — Cube NOC architecture and HBM connectivity.
|
|
- ADR-0004 — memory semantics, local HBM.
|
|
- ADR-0034 — HBM controller internal design.
|