Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
(dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
retroactive docs pending verification.
Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
deleted; ADR-0019/0021 moved to adr-history with one-line stub status
Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
selection, flit-aware per-flit commit, async finalize, command-only
fallback path)
Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
"Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
(now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)
Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py
Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.
Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
(ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
13 KiB
Latency Model
Overview
kernbench uses a discrete-event simulation (SimPy) to compute end-to-end latency.
Every request flows through a graph of components connected by wires.
The total latency reported is the actual SimPy wall-clock (env.now delta),
not a static formula—so contention and queueing are captured automatically.
total_ns (actual) = wire_prop + component_overhead + drain + queueing
├── deterministic ──────────────────┘ │
└── contention-dependent ────────────────────┘
Three Deterministic Cost Components
1. Wire Propagation
wire_ns = distance_mm × ns_per_mm (global: 0.01 = 10 ps/mm)
Every edge in the topology graph has a distance_mm. A SimPy wire process
delays each message by wire_ns before delivering it to the next component.
For on-chip silicon this is ~10 ps/mm; the same constant applies everywhere
since all links are on-die or interposer. Wire propagation is typically <1 ns
and negligible compared to other costs.
2. Component Overhead (overhead_ns)
component_ns = node.attrs["overhead_ns"]
Each component on the path adds a fixed processing delay via yield env.timeout(overhead_ns).
This models arbitration, protocol processing, pipeline stages, etc.
| Component | overhead_ns | Meaning |
|---|---|---|
| pcie_ep | 5.0 | PCIe protocol processing |
| io_cpu | 10.0 | Command decode / dispatch |
| m_cpu | 5.0 | DMA scheduling |
| fabric switch | 5.0 | Packet arbitration |
| xbar | 2.0 | Crossbar arbitration |
| xbar bridge | 1.0 | Bridge traversal between xbar halves |
| ucie | 8.0 | UCIe protocol overhead per port (TX or RX; 16ns per crossing) |
| noc (2D mesh) | 0.0 | Hop delay modeled internally via manhattan distance |
| hbm_ctrl | 0.0 | Access time via drain_ns; efficiency=0.8 reduces edge BW (256→204.8) |
| pe_cpu | 2.0 | Command dispatch |
| pe_scheduler | 1.0 | PE-internal scheduling |
| pe_gemm/math | 0.0 | Placeholder; will use flops-based model |
3. Drain (Serialization Delay)
drain_ns = nbytes / bottleneck_bw_gbs
Wormhole (cut-through) model: data flows through intermediate nodes as a
pipeline. Serialization cost is paid once at the terminal node, not at
every hop. The bottleneck is the minimum bw_gbs across all edges in the path.
Example: 4096 bytes through a path with bottleneck 128 GB/s → 4096 / 128 = 32.0 ns.
Formula (Theoretical Lower Bound)
formula_ns = Σ(wire_prop) + Σ(overhead_ns) + drain_ns
This is the latency with zero contention—no other request competing for
any resource. The engine provides _formula_latency() for verification.
With no contention: actual == formula. With contention: actual > formula.
Diagram: PE DMA Read (pe0 → local slice0, 4096 bytes)
sequenceDiagram
participant D as pe_dma
participant X as xbar.pe0
participant H as hbm_ctrl.slice0
D->>X: txn (4096B)
Note over X: overhead 2.0 ns
X->>H: txn (wire 0.025 ns)
Note over H: acquire Resource
Note over H: overhead 0 ns
Note over H: drain 4096/256 = 16.0 ns
Note over H: release Resource
H-->>D: done.succeed()
Note over D,H: total_ns = 18.09 ns<br/>formula = wire(0.025) + ovhd(2.0) + drain(16.0) = 18.025 ns<br/>actual ≈ formula (no contention)
Diagram: Two Requests — No Contention vs HOL Blocking
Case 1: Different slices (parallel, no contention)
sequenceDiagram
participant A as Request A
participant S0 as hbm_ctrl.slice0<br/>Resource(cap=1)
participant S1 as hbm_ctrl.slice1<br/>Resource(cap=1)
Note over A,S1: t=2 ns — both requests arrive at their own slice
A->>S0: A (4KB)
A->>S1: B (4KB)
Note over S0: acquire (immediate)
Note over S1: acquire (immediate)
Note over S0: drain 16.0 ns
Note over S1: drain 16.0 ns
Note over S0: t=18 release
Note over S1: t=18 release
Note over A,S1: A actual = 18 ns, B actual = 18 ns<br/>No waiting — separate Resources
Case 2: Same slice (HOL blocking)
sequenceDiagram
participant A as Request A (4KB)
participant Q as hbm_ctrl.slice0<br/>Resource(cap=1)
participant B as Request B (64B)
Note over A,B: t=0 — A arrives first
A->>Q: acquire (immediate)
Note over Q: drain A = 16.0 ns
Note over B,Q: t=5 — B arrives, yield req → BLOCKED
B--xQ: waiting...
Note over Q: t=16 — A drain done, release
Q->>B: B acquires resource
Note over Q: drain B = 0.25 ns
Note over Q: t=16.25 — B done, release
Note over A,B: A actual = 16.0 ns (== formula)<br/>B actual = 11.25 ns (formula 0.25 + queueing 11.0)<br/>HOL blocking: short request waits behind long drain
How SimPy Tracks Latency
Measurement
start_ns = env.now
yield txn_done # wait for the transaction to complete
total_ns = env.now - start_ns # ← this is what probe reports
env.now is SimPy's simulation clock. It only advances when a process yields
a timeout or waits on a resource/store. The delta between start and done captures
everything: wire delays, component overheads, drain, and any queueing.
Component Pipeline
Each component is a SimPy process:
_fan_in (per in_port) → _inbox (Store) → _worker → out_ports
_fan_in: relays messages from eachin_portinto a shared_inboxStore._worker: pulls from_inbox, spawns_forward_txnper message._forward_txn: callsrun()(overhead), then puts toout_ports[next_hop].
The worker uses env.process() (pipeline model), so multiple messages can be
in-flight through the same component concurrently. Contention happens when
they compete for shared resources (e.g., simpy.Resource in hbm_ctrl).
Wire Process
while True:
msg = yield out_port.get() # wait for sender
yield env.timeout(prop_ns) # propagation delay
yield in_port.put(msg) # deliver to receiver
Each directed edge has its own wire process. Messages are delayed by exactly
distance_mm × ns_per_mm.
Contention and Queueing
Queueing delay is not a separate formula term—it emerges from SimPy's event scheduling when multiple requests compete for the same resource.
Where Contention Occurs
| Resource | SimPy Type | Capacity | Effect |
|---|---|---|---|
| hbm_ctrl | simpy.Resource |
1 | Serializes HBM access |
| m_cpu DMA read engine | simpy.Resource |
1 | Serializes DMA reads |
| m_cpu DMA write engine | simpy.Resource |
1 | Serializes DMA writes |
| pe_dma channels | simpy.Resource |
configurable | Serializes PE DMA ops |
| component inbox | simpy.Store |
unbounded | No backpressure (FIFO) |
How Queueing Works
# hbm_ctrl._worker
with self._resource.request() as req:
yield req # ← BLOCKS if resource is occupied
yield from self.run(env, txn.nbytes)
yield env.timeout(drain_ns)
If request A holds the resource and request B arrives:
- B's
yield reqblocks until A releases the resource - SimPy advances B's
env.nowby A's remaining service time - This "extra" time shows up in B's
total_nsautomatically
No contention: actual_ns == formula_ns
Contention: actual_ns > formula_ns
queueing_delay = actual_ns - formula_ns
Head-of-Line (HOL) Blocking at hbm_ctrl
The simpy.Resource is held for the entire with block—both overhead and
drain. The resource is NOT released between overhead and drain:
with self._resource.request() as req:
yield req # acquire (or wait)
yield from self.run(env, txn.nbytes) # overhead_ns ─┐
yield env.timeout(drain_ns) # drain_ns │ resource held
# ← resource released here ───────────────────────────────┘
This means a short request arriving during a long request's drain must wait for the full remaining drain time—classic head-of-line blocking:
Request A: 4 KB, drain = 16.0 ns (arrives at t=0)
Request B: 64 B, drain = 0.25 ns (arrives at t=5)
Timeline:
t=0.00 A acquires resource
t=0.00 A: overhead (0 ns)
t=0.00 A: drain starts (16.0 ns)
t=5.00 B arrives → yield req → BLOCKED (A holds resource)
t=16.00 A: drain done → resource released
t=16.00 B acquires resource
t=16.00 B: overhead (0 ns)
t=16.25 B: drain done → resource released
B actual = 11.25 ns (waited 11.0 + own 0.25)
B formula = 0.25 ns
B queueing = 11.0 ns ← HOL blocking penalty
Why this is physically realistic: An HBM channel processes one burst at a
time. While data is being serialized onto the channel (drain), no other request
can use that channel. The FIFO ordering (simpy.Resource default) reflects
the simplest controller scheduling policy.
Alternative: priority scheduling: If needed, simpy.PriorityResource can
prioritize shorter requests (Shortest Job First), but this is not currently
used since FIFO matches typical HBM controller behavior.
Worked Example: Two Concurrent PE DMA Reads
Setup: PE0 and PE1 in cube0 both read 4096 bytes from their local HBM slices (slice0 and slice1), submitted to the same engine at the same time.
Paths
DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0
DMA B: pe1.pe_dma → xbar.pe1 → hbm_ctrl.slice1
No Contention (different HBM slices)
Since slice0 and slice1 are separate hbm_ctrl instances, each with its own
simpy.Resource(capacity=1), there is no resource competition.
DMA A timeline:
t=0.00 pe_dma dequeues txn
t=0.00 xbar.pe0: overhead_ns=2.0 → t=2.00
t=2.025 wire prop (2.5mm × 0.01) → t=2.025
t=2.025 hbm_ctrl.slice0: yield req → immediate (no contention)
t=2.025 hbm_ctrl.slice0: overhead_ns=0 → t=2.025
t=18.025 drain_ns = 4096/256 = 16.0 → t=18.025
t=18.025 done
DMA B timeline: (identical, on its own slice)
t=0.00 → ... → t=18.09 done
Both complete at ~18.09 ns. actual == formula for both.
With Contention (same HBM slice)
Now suppose both PE0 and PE1 read from slice0:
DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0
DMA B: pe1.pe_dma → xbar.pe1 → xbar.pe0 → hbm_ctrl.slice0
(chain traversal to reach slice0)
DMA A timeline:
t=0.00 xbar.pe0(2.0) → wire → hbm_ctrl.slice0
t=2.025 yield req → immediate (first to arrive)
t=18.025 drain 16.0 → release resource → done
actual_A = 18.025 ns (== formula)
DMA B timeline:
t=0.00 xbar.pe1(2.0) → xbar.pe0(2.0) → wire → hbm_ctrl.slice0
t=4.035 yield req → BLOCKED (A holds resource until t=18.025)
t=18.025 acquire resource
t=34.025 drain 16.0 → release → done
actual_B = 34.035 ns
formula_B = wire(0.035) + overhead(4.0) + drain(32.0) = 36.035 ns
But actual_B is different because drain uses bottleneck BW of B's path (128 GB/s)
while A's path has BW 256 GB/s. Let's recalculate:
B's bottleneck: xbar_x_bw = 128 GB/s → drain = 4096/128 = 32.0 ns
formula_B = 0.035 + 4.0 + 32.0 = 36.035 ns
actual_B = 36.035 + queueing ≈ 50+ ns
queueing = time waiting for A to release hbm_ctrl
The key insight: queueing delay is not in the formula. It only appears in
the actual SimPy simulation when resources are contested. The probe reports
actual_ns, which includes all queueing. To see pure queueing overhead,
compare actual_ns vs formula_ns (available in PE DMA traces).
Probe Output Explained
=== PE DMA Latency ===
Case Target Actual Ovhd Drain Wire Ovhd% Drain% Eff.BW BN.BW Util%
pe-local-hbm c0.pe0->c0.slice0 18.09 2.0 16.0 0.08 11.1% 88.5% 226.49 256.0 88.5%
pe-cross-half-hbm c0.pe0->c0.slice4 37.14 5.0 32.0 0.14 13.5% 86.1% 110.27 128.0 86.1%
| Column | Meaning |
|---|---|
| Actual | SimPy measured env.now delta (includes contention if any) |
| Ovhd | Sum of overhead_ns for all components on the forward path |
| Drain | nbytes / bottleneck_bw — serialization at terminal |
| Wire | Sum of distance_mm × ns_per_mm for all edges |
| Ovhd% | Ovhd / Actual × 100 — fraction of time spent in component processing |
| Drain% | Drain / Actual × 100 — fraction of time spent in data transfer |
| Eff.BW | nbytes / Actual — achieved bandwidth |
| BN.BW | Bottleneck bandwidth (min bw_gbs on path) |
| Util% | Eff.BW / BN.BW × 100 — how close to theoretical max BW |
Why Util% < 100%
Util% = Drain% = drain_ns / actual_ns. The gap from 100% is the overhead
fraction. For small transfers (4KB), overhead is significant relative to drain.
For large transfers, drain dominates and utilization approaches 100%.
4 KB: Ovhd=2.0, Drain=16.0 → Util=88.5% (overhead is 11% of time)
64 KB: Ovhd=2.0, Drain=256.0 → Util=99.2% (overhead is <1% of time)
H2D Path: Why Ovhd% is ~40%
H2D traverses many components (pcie_ep → io_cpu → ucie → noc → m_cpu → noc → xbar → hbm_ctrl + response path). Total forward overhead is ~23 ns vs drain of 32 ns for 4KB, so overhead is comparable to data transfer time—resulting in ~55% utilization. This is expected for small command-path transfers.