Files

T

ywkang 687c98086d ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037

Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
  (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
  docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
  docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
  retroactive docs pending verification.

Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
  TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
  Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
  deleted; ADR-0019/0021 moved to adr-history with one-line stub status

Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
  serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
  per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
  target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
  selection, flit-aware per-flit commit, async finalize, command-only
  fallback path)

Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
  "Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
  block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
  ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
  (now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)

Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
  ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py

Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.

Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
  (ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 01:15:55 -07:00

13 KiB

Raw Blame History

Latency Model

Overview

kernbench uses a discrete-event simulation (SimPy) to compute end-to-end latency. Every request flows through a graph of components connected by wires. The total latency reported is the actual SimPy wall-clock (env.now delta), not a static formula—so contention and queueing are captured automatically.

total_ns (actual) = wire_prop + component_overhead + drain + queueing
                    ├── deterministic ──────────────────┘       │
                    └── contention-dependent ────────────────────┘

Three Deterministic Cost Components

1. Wire Propagation

wire_ns = distance_mm × ns_per_mm       (global: 0.01 = 10 ps/mm)

Every edge in the topology graph has a distance_mm. A SimPy wire process delays each message by wire_ns before delivering it to the next component. For on-chip silicon this is ~10 ps/mm; the same constant applies everywhere since all links are on-die or interposer. Wire propagation is typically <1 ns and negligible compared to other costs.

2. Component Overhead (`overhead_ns`)

component_ns = node.attrs["overhead_ns"]

Each component on the path adds a fixed processing delay via yield env.timeout(overhead_ns). This models arbitration, protocol processing, pipeline stages, etc.

Component	overhead_ns	Meaning
pcie_ep	5.0	PCIe protocol processing
io_cpu	10.0	Command decode / dispatch
m_cpu	5.0	DMA scheduling
fabric switch	5.0	Packet arbitration
xbar	2.0	Crossbar arbitration
xbar bridge	1.0	Bridge traversal between xbar halves
ucie	8.0	UCIe protocol overhead per port (TX or RX; 16ns per crossing)
noc (2D mesh)	0.0	Hop delay modeled internally via manhattan distance
hbm_ctrl	0.0	Access time via drain_ns; efficiency=0.8 reduces edge BW (256→204.8)
pe_cpu	2.0	Command dispatch
pe_scheduler	1.0	PE-internal scheduling
pe_gemm/math	0.0	Placeholder; will use flops-based model

3. Drain (Serialization Delay)

drain_ns = nbytes / bottleneck_bw_gbs

Wormhole (cut-through) model: data flows through intermediate nodes as a pipeline. Serialization cost is paid once at the terminal node, not at every hop. The bottleneck is the minimum bw_gbs across all edges in the path.

Example: 4096 bytes through a path with bottleneck 128 GB/s → 4096 / 128 = 32.0 ns.

Formula (Theoretical Lower Bound)

formula_ns = Σ(wire_prop) + Σ(overhead_ns) + drain_ns

This is the latency with zero contention—no other request competing for any resource. The engine provides _formula_latency() for verification. With no contention: actual == formula. With contention: actual > formula.

Diagram: PE DMA Read (pe0 → local slice0, 4096 bytes)

sequenceDiagram
    participant D as pe_dma
    participant X as xbar.pe0
    participant H as hbm_ctrl.slice0

    D->>X: txn (4096B)
    Note over X: overhead 2.0 ns
    X->>H: txn (wire 0.025 ns)
    Note over H: acquire Resource
    Note over H: overhead 0 ns
    Note over H: drain 4096/256 = 16.0 ns
    Note over H: release Resource
    H-->>D: done.succeed()

    Note over D,H: total_ns = 18.09 ns<br/>formula = wire(0.025) + ovhd(2.0) + drain(16.0) = 18.025 ns<br/>actual ≈ formula (no contention)

Diagram: Two Requests — No Contention vs HOL Blocking

Case 1: Different slices (parallel, no contention)

sequenceDiagram
    participant A as Request A
    participant S0 as hbm_ctrl.slice0<br/>Resource(cap=1)
    participant S1 as hbm_ctrl.slice1<br/>Resource(cap=1)

    Note over A,S1: t=2 ns — both requests arrive at their own slice
    A->>S0: A (4KB)
    A->>S1: B (4KB)
    Note over S0: acquire (immediate)
    Note over S1: acquire (immediate)
    Note over S0: drain 16.0 ns
    Note over S1: drain 16.0 ns
    Note over S0: t=18 release
    Note over S1: t=18 release

    Note over A,S1: A actual = 18 ns, B actual = 18 ns<br/>No waiting — separate Resources

Case 2: Same slice (HOL blocking)

sequenceDiagram
    participant A as Request A (4KB)
    participant Q as hbm_ctrl.slice0<br/>Resource(cap=1)
    participant B as Request B (64B)

    Note over A,B: t=0 — A arrives first
    A->>Q: acquire (immediate)
    Note over Q: drain A = 16.0 ns

    Note over B,Q: t=5 — B arrives, yield req → BLOCKED
    B--xQ: waiting...

    Note over Q: t=16 — A drain done, release
    Q->>B: B acquires resource
    Note over Q: drain B = 0.25 ns
    Note over Q: t=16.25 — B done, release

    Note over A,B: A actual = 16.0 ns (== formula)<br/>B actual = 11.25 ns (formula 0.25 + queueing 11.0)<br/>HOL blocking: short request waits behind long drain

How SimPy Tracks Latency

Measurement

start_ns = env.now
yield txn_done          # wait for the transaction to complete
total_ns = env.now - start_ns     # ← this is what probe reports

env.now is SimPy's simulation clock. It only advances when a process yields a timeout or waits on a resource/store. The delta between start and done captures everything: wire delays, component overheads, drain, and any queueing.

Component Pipeline

Each component is a SimPy process:

_fan_in (per in_port)  →  _inbox (Store)  →  _worker  →  out_ports

_fan_in: relays messages from each in_port into a shared _inbox Store.
_worker: pulls from _inbox, spawns _forward_txn per message.
_forward_txn: calls run() (overhead), then puts to out_ports[next_hop].

The worker uses env.process() (pipeline model), so multiple messages can be in-flight through the same component concurrently. Contention happens when they compete for shared resources (e.g., simpy.Resource in hbm_ctrl).

Wire Process

while True:
    msg = yield out_port.get()      # wait for sender
    yield env.timeout(prop_ns)      # propagation delay
    yield in_port.put(msg)          # deliver to receiver

Each directed edge has its own wire process. Messages are delayed by exactly distance_mm × ns_per_mm.

Contention and Queueing

Queueing delay is not a separate formula term—it emerges from SimPy's event scheduling when multiple requests compete for the same resource.

Where Contention Occurs

Resource	SimPy Type	Capacity	Effect
hbm_ctrl	`simpy.Resource`	1	Serializes HBM access
m_cpu DMA read engine	`simpy.Resource`	1	Serializes DMA reads
m_cpu DMA write engine	`simpy.Resource`	1	Serializes DMA writes
pe_dma channels	`simpy.Resource`	configurable	Serializes PE DMA ops
component inbox	`simpy.Store`	unbounded	No backpressure (FIFO)

How Queueing Works

# hbm_ctrl._worker
with self._resource.request() as req:
    yield req                     # ← BLOCKS if resource is occupied
    yield from self.run(env, txn.nbytes)
    yield env.timeout(drain_ns)

If request A holds the resource and request B arrives:

B's yield req blocks until A releases the resource
SimPy advances B's env.now by A's remaining service time
This "extra" time shows up in B's total_ns automatically

No contention:  actual_ns == formula_ns
Contention:     actual_ns  > formula_ns
                queueing_delay = actual_ns - formula_ns

Head-of-Line (HOL) Blocking at hbm_ctrl

The simpy.Resource is held for the entire with block—both overhead and drain. The resource is NOT released between overhead and drain:

with self._resource.request() as req:
    yield req                              # acquire (or wait)
    yield from self.run(env, txn.nbytes)   # overhead_ns  ─┐
    yield env.timeout(drain_ns)            # drain_ns      │ resource held
# ← resource released here ───────────────────────────────┘

This means a short request arriving during a long request's drain must wait for the full remaining drain time—classic head-of-line blocking:

Request A: 4 KB,  drain = 16.0 ns   (arrives at t=0)
Request B: 64 B,  drain = 0.25 ns   (arrives at t=5)

Timeline:
  t=0.00   A acquires resource
  t=0.00   A: overhead (0 ns)
  t=0.00   A: drain starts (16.0 ns)
  t=5.00   B arrives → yield req → BLOCKED (A holds resource)
  t=16.00  A: drain done → resource released
  t=16.00  B acquires resource
  t=16.00  B: overhead (0 ns)
  t=16.25  B: drain done → resource released

  B actual  = 11.25 ns (waited 11.0 + own 0.25)
  B formula = 0.25 ns
  B queueing = 11.0 ns  ← HOL blocking penalty

Why this is physically realistic: An HBM channel processes one burst at a time. While data is being serialized onto the channel (drain), no other request can use that channel. The FIFO ordering (simpy.Resource default) reflects the simplest controller scheduling policy.

Alternative: priority scheduling: If needed, simpy.PriorityResource can prioritize shorter requests (Shortest Job First), but this is not currently used since FIFO matches typical HBM controller behavior.

Worked Example: Two Concurrent PE DMA Reads

Setup: PE0 and PE1 in cube0 both read 4096 bytes from their local HBM slices (slice0 and slice1), submitted to the same engine at the same time.

Paths

DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0
DMA B: pe1.pe_dma → xbar.pe1 → hbm_ctrl.slice1

No Contention (different HBM slices)

Since slice0 and slice1 are separate hbm_ctrl instances, each with its own simpy.Resource(capacity=1), there is no resource competition.

DMA A timeline:
  t=0.00   pe_dma dequeues txn
  t=0.00   xbar.pe0: overhead_ns=2.0 → t=2.00
  t=2.025  wire prop (2.5mm × 0.01) → t=2.025
  t=2.025  hbm_ctrl.slice0: yield req → immediate (no contention)
  t=2.025  hbm_ctrl.slice0: overhead_ns=0 → t=2.025
  t=18.025 drain_ns = 4096/256 = 16.0 → t=18.025
  t=18.025 done

DMA B timeline: (identical, on its own slice)
  t=0.00   → ... → t=18.09  done

Both complete at ~18.09 ns. actual == formula for both.

With Contention (same HBM slice)

Now suppose both PE0 and PE1 read from slice0:

DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0
DMA B: pe1.pe_dma → xbar.pe1 → xbar.pe0 → hbm_ctrl.slice0
                                (chain traversal to reach slice0)

DMA A timeline:
  t=0.00   xbar.pe0(2.0) → wire → hbm_ctrl.slice0
  t=2.025  yield req → immediate (first to arrive)
  t=18.025 drain 16.0 → release resource → done
  actual_A = 18.025 ns (== formula)

DMA B timeline:
  t=0.00   xbar.pe1(2.0) → xbar.pe0(2.0) → wire → hbm_ctrl.slice0
  t=4.035  yield req → BLOCKED (A holds resource until t=18.025)
  t=18.025 acquire resource
  t=34.025 drain 16.0 → release → done
  actual_B = 34.035 ns

  formula_B = wire(0.035) + overhead(4.0) + drain(32.0) = 36.035 ns
  But actual_B is different because drain uses bottleneck BW of B's path (128 GB/s)
  while A's path has BW 256 GB/s. Let's recalculate:

  B's bottleneck: xbar_x_bw = 128 GB/s → drain = 4096/128 = 32.0 ns
  formula_B = 0.035 + 4.0 + 32.0 = 36.035 ns
  actual_B  = 36.035 + queueing ≈ 50+ ns
  queueing  = time waiting for A to release hbm_ctrl

The key insight: queueing delay is not in the formula. It only appears in the actual SimPy simulation when resources are contested. The probe reports actual_ns, which includes all queueing. To see pure queueing overhead, compare actual_ns vs formula_ns (available in PE DMA traces).

Probe Output Explained

=== PE DMA Latency ===
Case                Target              Actual  Ovhd  Drain  Wire  Ovhd% Drain%  Eff.BW   BN.BW   Util%
pe-local-hbm        c0.pe0->c0.slice0    18.09   2.0  16.0  0.08  11.1% 88.5%   226.49   256.0   88.5%
pe-cross-half-hbm   c0.pe0->c0.slice4    37.14   5.0  32.0  0.14  13.5% 86.1%   110.27   128.0   86.1%

Column	Meaning
Actual	SimPy measured `env.now` delta (includes contention if any)
Ovhd	Sum of `overhead_ns` for all components on the forward path
Drain	`nbytes / bottleneck_bw` — serialization at terminal
Wire	Sum of `distance_mm × ns_per_mm` for all edges
Ovhd%	`Ovhd / Actual × 100` — fraction of time spent in component processing
Drain%	`Drain / Actual × 100` — fraction of time spent in data transfer
Eff.BW	`nbytes / Actual` — achieved bandwidth
BN.BW	Bottleneck bandwidth (min `bw_gbs` on path)
Util%	`Eff.BW / BN.BW × 100` — how close to theoretical max BW

Why Util% < 100%

Util% = Drain% = drain_ns / actual_ns. The gap from 100% is the overhead fraction. For small transfers (4KB), overhead is significant relative to drain. For large transfers, drain dominates and utilization approaches 100%.

  4 KB:  Ovhd=2.0, Drain=16.0  → Util=88.5%   (overhead is 11% of time)
 64 KB:  Ovhd=2.0, Drain=256.0 → Util=99.2%   (overhead is <1% of time)

H2D Path: Why Ovhd% is ~40%

H2D traverses many components (pcie_ep → io_cpu → ucie → noc → m_cpu → noc → xbar → hbm_ctrl + response path). Total forward overhead is ~23 ns vs drain of 32 ns for 4KB, so overhead is comparable to data transfer time—resulting in ~55% utilization. This is expected for small command-path transfers.

13 KiB Raw Blame History Unescape Escape