Files
kernbench2/docs/onboarding/latency-model.md
ywkang 687c98086d ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
  (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
  docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
  docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
  retroactive docs pending verification.

Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
  TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
  Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
  deleted; ADR-0019/0021 moved to adr-history with one-line stub status

Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
  serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
  per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
  target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
  selection, flit-aware per-flit commit, async finalize, command-only
  fallback path)

Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
  "Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
  block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
  ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
  (now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)

Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
  ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py

Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.

Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
  (ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:15:55 -07:00

382 lines
13 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Latency Model
## Overview
kernbench uses a discrete-event simulation (SimPy) to compute end-to-end latency.
Every request flows through a graph of **components** connected by **wires**.
The total latency reported is the **actual SimPy wall-clock** (`env.now` delta),
not a static formula—so contention and queueing are captured automatically.
```text
total_ns (actual) = wire_prop + component_overhead + drain + queueing
├── deterministic ──────────────────┘ │
└── contention-dependent ────────────────────┘
```
## Three Deterministic Cost Components
### 1. Wire Propagation
```text
wire_ns = distance_mm × ns_per_mm (global: 0.01 = 10 ps/mm)
```
Every edge in the topology graph has a `distance_mm`. A SimPy wire process
delays each message by `wire_ns` before delivering it to the next component.
For on-chip silicon this is ~10 ps/mm; the same constant applies everywhere
since all links are on-die or interposer. Wire propagation is typically <1 ns
and negligible compared to other costs.
### 2. Component Overhead (`overhead_ns`)
```text
component_ns = node.attrs["overhead_ns"]
```
Each component on the path adds a fixed processing delay via `yield env.timeout(overhead_ns)`.
This models arbitration, protocol processing, pipeline stages, etc.
| Component | overhead_ns | Meaning |
|-----------|-------------|---------|
| pcie_ep | 5.0 | PCIe protocol processing |
| io_cpu | 10.0 | Command decode / dispatch |
| m_cpu | 5.0 | DMA scheduling |
| fabric switch | 5.0 | Packet arbitration |
| xbar | 2.0 | Crossbar arbitration |
| xbar bridge | 1.0 | Bridge traversal between xbar halves |
| ucie | 8.0 | UCIe protocol overhead per port (TX or RX; 16ns per crossing) |
| noc (2D mesh) | 0.0 | Hop delay modeled internally via manhattan distance |
| hbm_ctrl | 0.0 | Access time via drain_ns; efficiency=0.8 reduces edge BW (256→204.8) |
| pe_cpu | 2.0 | Command dispatch |
| pe_scheduler | 1.0 | PE-internal scheduling |
| pe_gemm/math | 0.0 | Placeholder; will use flops-based model |
### 3. Drain (Serialization Delay)
```text
drain_ns = nbytes / bottleneck_bw_gbs
```
**Wormhole (cut-through) model**: data flows through intermediate nodes as a
pipeline. Serialization cost is paid **once** at the terminal node, not at
every hop. The bottleneck is the minimum `bw_gbs` across all edges in the path.
Example: 4096 bytes through a path with bottleneck 128 GB/s → `4096 / 128 = 32.0 ns`.
### Formula (Theoretical Lower Bound)
```text
formula_ns = Σ(wire_prop) + Σ(overhead_ns) + drain_ns
```
This is the latency with **zero contention**—no other request competing for
any resource. The engine provides `_formula_latency()` for verification.
With no contention: `actual == formula`. With contention: `actual > formula`.
### Diagram: PE DMA Read (pe0 → local slice0, 4096 bytes)
```mermaid
sequenceDiagram
participant D as pe_dma
participant X as xbar.pe0
participant H as hbm_ctrl.slice0
D->>X: txn (4096B)
Note over X: overhead 2.0 ns
X->>H: txn (wire 0.025 ns)
Note over H: acquire Resource
Note over H: overhead 0 ns
Note over H: drain 4096/256 = 16.0 ns
Note over H: release Resource
H-->>D: done.succeed()
Note over D,H: total_ns = 18.09 ns<br/>formula = wire(0.025) + ovhd(2.0) + drain(16.0) = 18.025 ns<br/>actual ≈ formula (no contention)
```
### Diagram: Two Requests — No Contention vs HOL Blocking
#### Case 1: Different slices (parallel, no contention)
```mermaid
sequenceDiagram
participant A as Request A
participant S0 as hbm_ctrl.slice0<br/>Resource(cap=1)
participant S1 as hbm_ctrl.slice1<br/>Resource(cap=1)
Note over A,S1: t=2 ns — both requests arrive at their own slice
A->>S0: A (4KB)
A->>S1: B (4KB)
Note over S0: acquire (immediate)
Note over S1: acquire (immediate)
Note over S0: drain 16.0 ns
Note over S1: drain 16.0 ns
Note over S0: t=18 release
Note over S1: t=18 release
Note over A,S1: A actual = 18 ns, B actual = 18 ns<br/>No waiting — separate Resources
```
#### Case 2: Same slice (HOL blocking)
```mermaid
sequenceDiagram
participant A as Request A (4KB)
participant Q as hbm_ctrl.slice0<br/>Resource(cap=1)
participant B as Request B (64B)
Note over A,B: t=0 — A arrives first
A->>Q: acquire (immediate)
Note over Q: drain A = 16.0 ns
Note over B,Q: t=5 — B arrives, yield req → BLOCKED
B--xQ: waiting...
Note over Q: t=16 — A drain done, release
Q->>B: B acquires resource
Note over Q: drain B = 0.25 ns
Note over Q: t=16.25 — B done, release
Note over A,B: A actual = 16.0 ns (== formula)<br/>B actual = 11.25 ns (formula 0.25 + queueing 11.0)<br/>HOL blocking: short request waits behind long drain
```
---
## How SimPy Tracks Latency
### Measurement
```python
start_ns = env.now
yield txn_done # wait for the transaction to complete
total_ns = env.now - start_ns # ← this is what probe reports
```
`env.now` is SimPy's simulation clock. It only advances when a process `yield`s
a timeout or waits on a resource/store. The delta between start and done captures
**everything**: wire delays, component overheads, drain, and any queueing.
### Component Pipeline
Each component is a SimPy process:
```text
_fan_in (per in_port) → _inbox (Store) → _worker → out_ports
```
1. **`_fan_in`**: relays messages from each `in_port` into a shared `_inbox` Store.
2. **`_worker`**: pulls from `_inbox`, spawns `_forward_txn` per message.
3. **`_forward_txn`**: calls `run()` (overhead), then puts to `out_ports[next_hop]`.
The worker uses `env.process()` (pipeline model), so multiple messages can be
in-flight through the same component concurrently. Contention happens when
they compete for shared resources (e.g., `simpy.Resource` in hbm_ctrl).
### Wire Process
```python
while True:
msg = yield out_port.get() # wait for sender
yield env.timeout(prop_ns) # propagation delay
yield in_port.put(msg) # deliver to receiver
```
Each directed edge has its own wire process. Messages are delayed by exactly
`distance_mm × ns_per_mm`.
---
## Contention and Queueing
Queueing delay is **not a separate formula term**—it emerges from SimPy's
event scheduling when multiple requests compete for the same resource.
### Where Contention Occurs
| Resource | SimPy Type | Capacity | Effect |
|----------|-----------|----------|--------|
| hbm_ctrl | `simpy.Resource` | 1 | Serializes HBM access |
| m_cpu DMA read engine | `simpy.Resource` | 1 | Serializes DMA reads |
| m_cpu DMA write engine | `simpy.Resource` | 1 | Serializes DMA writes |
| pe_dma channels | `simpy.Resource` | configurable | Serializes PE DMA ops |
| component inbox | `simpy.Store` | unbounded | No backpressure (FIFO) |
### How Queueing Works
```python
# hbm_ctrl._worker
with self._resource.request() as req:
yield req # ← BLOCKS if resource is occupied
yield from self.run(env, txn.nbytes)
yield env.timeout(drain_ns)
```
If request A holds the resource and request B arrives:
- B's `yield req` blocks until A releases the resource
- SimPy advances B's `env.now` by A's remaining service time
- This "extra" time shows up in B's `total_ns` automatically
```text
No contention: actual_ns == formula_ns
Contention: actual_ns > formula_ns
queueing_delay = actual_ns - formula_ns
```
### Head-of-Line (HOL) Blocking at hbm_ctrl
The `simpy.Resource` is held for the **entire** `with` block—both overhead and
drain. The resource is NOT released between overhead and drain:
```python
with self._resource.request() as req:
yield req # acquire (or wait)
yield from self.run(env, txn.nbytes) # overhead_ns ─┐
yield env.timeout(drain_ns) # drain_ns │ resource held
# ← resource released here ───────────────────────────────┘
```
This means a short request arriving during a long request's drain must wait
for the full remaining drain time—classic head-of-line blocking:
```text
Request A: 4 KB, drain = 16.0 ns (arrives at t=0)
Request B: 64 B, drain = 0.25 ns (arrives at t=5)
Timeline:
t=0.00 A acquires resource
t=0.00 A: overhead (0 ns)
t=0.00 A: drain starts (16.0 ns)
t=5.00 B arrives → yield req → BLOCKED (A holds resource)
t=16.00 A: drain done → resource released
t=16.00 B acquires resource
t=16.00 B: overhead (0 ns)
t=16.25 B: drain done → resource released
B actual = 11.25 ns (waited 11.0 + own 0.25)
B formula = 0.25 ns
B queueing = 11.0 ns ← HOL blocking penalty
```
**Why this is physically realistic**: An HBM channel processes one burst at a
time. While data is being serialized onto the channel (drain), no other request
can use that channel. The FIFO ordering (`simpy.Resource` default) reflects
the simplest controller scheduling policy.
**Alternative: priority scheduling**: If needed, `simpy.PriorityResource` can
prioritize shorter requests (Shortest Job First), but this is not currently
used since FIFO matches typical HBM controller behavior.
---
## Worked Example: Two Concurrent PE DMA Reads
Setup: PE0 and PE1 in cube0 both read 4096 bytes from their local HBM slices
(slice0 and slice1), submitted to the **same engine** at the same time.
### Paths
```text
DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0
DMA B: pe1.pe_dma → xbar.pe1 → hbm_ctrl.slice1
```
### No Contention (different HBM slices)
Since slice0 and slice1 are **separate** hbm_ctrl instances, each with its own
`simpy.Resource(capacity=1)`, there is no resource competition.
```text
DMA A timeline:
t=0.00 pe_dma dequeues txn
t=0.00 xbar.pe0: overhead_ns=2.0 → t=2.00
t=2.025 wire prop (2.5mm × 0.01) → t=2.025
t=2.025 hbm_ctrl.slice0: yield req → immediate (no contention)
t=2.025 hbm_ctrl.slice0: overhead_ns=0 → t=2.025
t=18.025 drain_ns = 4096/256 = 16.0 → t=18.025
t=18.025 done
DMA B timeline: (identical, on its own slice)
t=0.00 → ... → t=18.09 done
```
Both complete at ~18.09 ns. `actual == formula` for both.
### With Contention (same HBM slice)
Now suppose both PE0 and PE1 read from **slice0**:
```text
DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0
DMA B: pe1.pe_dma → xbar.pe1 → xbar.pe0 → hbm_ctrl.slice0
(chain traversal to reach slice0)
```
```text
DMA A timeline:
t=0.00 xbar.pe0(2.0) → wire → hbm_ctrl.slice0
t=2.025 yield req → immediate (first to arrive)
t=18.025 drain 16.0 → release resource → done
actual_A = 18.025 ns (== formula)
DMA B timeline:
t=0.00 xbar.pe1(2.0) → xbar.pe0(2.0) → wire → hbm_ctrl.slice0
t=4.035 yield req → BLOCKED (A holds resource until t=18.025)
t=18.025 acquire resource
t=34.025 drain 16.0 → release → done
actual_B = 34.035 ns
formula_B = wire(0.035) + overhead(4.0) + drain(32.0) = 36.035 ns
But actual_B is different because drain uses bottleneck BW of B's path (128 GB/s)
while A's path has BW 256 GB/s. Let's recalculate:
B's bottleneck: xbar_x_bw = 128 GB/s → drain = 4096/128 = 32.0 ns
formula_B = 0.035 + 4.0 + 32.0 = 36.035 ns
actual_B = 36.035 + queueing ≈ 50+ ns
queueing = time waiting for A to release hbm_ctrl
```
The key insight: **queueing delay is not in the formula**. It only appears in
the actual SimPy simulation when resources are contested. The probe reports
`actual_ns`, which includes all queueing. To see pure queueing overhead,
compare `actual_ns` vs `formula_ns` (available in PE DMA traces).
---
## Probe Output Explained
```text
=== PE DMA Latency ===
Case Target Actual Ovhd Drain Wire Ovhd% Drain% Eff.BW BN.BW Util%
pe-local-hbm c0.pe0->c0.slice0 18.09 2.0 16.0 0.08 11.1% 88.5% 226.49 256.0 88.5%
pe-cross-half-hbm c0.pe0->c0.slice4 37.14 5.0 32.0 0.14 13.5% 86.1% 110.27 128.0 86.1%
```
| Column | Meaning |
|--------|---------|
| **Actual** | SimPy measured `env.now` delta (includes contention if any) |
| **Ovhd** | Sum of `overhead_ns` for all components on the forward path |
| **Drain** | `nbytes / bottleneck_bw` — serialization at terminal |
| **Wire** | Sum of `distance_mm × ns_per_mm` for all edges |
| **Ovhd%** | `Ovhd / Actual × 100` — fraction of time spent in component processing |
| **Drain%** | `Drain / Actual × 100` — fraction of time spent in data transfer |
| **Eff.BW** | `nbytes / Actual` — achieved bandwidth |
| **BN.BW** | Bottleneck bandwidth (min `bw_gbs` on path) |
| **Util%** | `Eff.BW / BN.BW × 100` — how close to theoretical max BW |
### Why Util% < 100%
`Util% = Drain% = drain_ns / actual_ns`. The gap from 100% is the overhead
fraction. For small transfers (4KB), overhead is significant relative to drain.
For large transfers, drain dominates and utilization approaches 100%.
```text
4 KB: Ovhd=2.0, Drain=16.0 → Util=88.5% (overhead is 11% of time)
64 KB: Ovhd=2.0, Drain=256.0 → Util=99.2% (overhead is <1% of time)
```
### H2D Path: Why Ovhd% is ~40%
H2D traverses many components (pcie_ep → io_cpu → ucie → noc → m_cpu → noc →
xbar → hbm_ctrl + response path). Total forward overhead is ~23 ns vs drain
of 32 ns for 4KB, so overhead is comparable to data transfer time—resulting
in ~55% utilization. This is expected for small command-path transfers.