# Latency Model ## Overview kernbench uses a discrete-event simulation (SimPy) to compute end-to-end latency. Every request flows through a graph of **components** connected by **wires**. The total latency reported is the **actual SimPy wall-clock** (`env.now` delta), not a static formula—so contention and queueing are captured automatically. ``` total_ns (actual) = wire_prop + component_overhead + drain + queueing ├── deterministic ──────────────────┘ │ └── contention-dependent ────────────────────┘ ``` ## Three Deterministic Cost Components ### 1. Wire Propagation ``` wire_ns = distance_mm × ns_per_mm (global: 0.01 = 10 ps/mm) ``` Every edge in the topology graph has a `distance_mm`. A SimPy wire process delays each message by `wire_ns` before delivering it to the next component. For on-chip silicon this is ~10 ps/mm; the same constant applies everywhere since all links are on-die or interposer. Wire propagation is typically <1 ns and negligible compared to other costs. ### 2. Component Overhead (`overhead_ns`) ``` component_ns = node.attrs["overhead_ns"] ``` Each component on the path adds a fixed processing delay via `yield env.timeout(overhead_ns)`. This models arbitration, protocol processing, pipeline stages, etc. | Component | overhead_ns | Meaning | |-----------|-------------|---------| | pcie_ep | 5.0 | PCIe protocol processing | | io_cpu | 10.0 | Command decode / dispatch | | m_cpu | 5.0 | DMA scheduling | | fabric switch | 5.0 | Packet arbitration | | xbar | 2.0 | Crossbar arbitration | | xbar bridge | 1.0 | Bridge traversal between xbar halves | | ucie | 8.0 | UCIe protocol overhead per port (TX or RX; 16ns per crossing) | | noc (2D mesh) | 0.0 | Hop delay modeled internally via manhattan distance | | hbm_ctrl | 0.0 | Access time via drain_ns; efficiency=0.8 reduces edge BW (256→204.8) | | pe_cpu | 2.0 | Command dispatch | | pe_scheduler | 1.0 | PE-internal scheduling | | pe_gemm/math | 0.0 | Placeholder; will use flops-based model | ### 3. Drain (Serialization Delay) ``` drain_ns = nbytes / bottleneck_bw_gbs ``` **Wormhole (cut-through) model**: data flows through intermediate nodes as a pipeline. Serialization cost is paid **once** at the terminal node, not at every hop. The bottleneck is the minimum `bw_gbs` across all edges in the path. Example: 4096 bytes through a path with bottleneck 128 GB/s → `4096 / 128 = 32.0 ns`. ### Formula (Theoretical Lower Bound) ``` formula_ns = Σ(wire_prop) + Σ(overhead_ns) + drain_ns ``` This is the latency with **zero contention**—no other request competing for any resource. The engine provides `_formula_latency()` for verification. With no contention: `actual == formula`. With contention: `actual > formula`. ### Diagram: PE DMA Read (pe0 → local slice0, 4096 bytes) ```mermaid sequenceDiagram participant D as pe_dma participant X as xbar.pe0 participant H as hbm_ctrl.slice0 D->>X: txn (4096B) Note over X: overhead 2.0 ns X->>H: txn (wire 0.025 ns) Note over H: acquire Resource Note over H: overhead 0 ns Note over H: drain 4096/256 = 16.0 ns Note over H: release Resource H-->>D: done.succeed() Note over D,H: total_ns = 18.09 ns
formula = wire(0.025) + ovhd(2.0) + drain(16.0) = 18.025 ns
actual ≈ formula (no contention) ``` ### Diagram: Two Requests — No Contention vs HOL Blocking #### Case 1: Different slices (parallel, no contention) ```mermaid sequenceDiagram participant A as Request A participant S0 as hbm_ctrl.slice0
Resource(cap=1) participant S1 as hbm_ctrl.slice1
Resource(cap=1) Note over A,S1: t=2 ns — both requests arrive at their own slice A->>S0: A (4KB) A->>S1: B (4KB) Note over S0: acquire (immediate) Note over S1: acquire (immediate) Note over S0: drain 16.0 ns Note over S1: drain 16.0 ns Note over S0: t=18 release Note over S1: t=18 release Note over A,S1: A actual = 18 ns, B actual = 18 ns
No waiting — separate Resources ``` #### Case 2: Same slice (HOL blocking) ```mermaid sequenceDiagram participant A as Request A (4KB) participant Q as hbm_ctrl.slice0
Resource(cap=1) participant B as Request B (64B) Note over A,B: t=0 — A arrives first A->>Q: acquire (immediate) Note over Q: drain A = 16.0 ns Note over B,Q: t=5 — B arrives, yield req → BLOCKED B--xQ: waiting... Note over Q: t=16 — A drain done, release Q->>B: B acquires resource Note over Q: drain B = 0.25 ns Note over Q: t=16.25 — B done, release Note over A,B: A actual = 16.0 ns (== formula)
B actual = 11.25 ns (formula 0.25 + queueing 11.0)
HOL blocking: short request waits behind long drain ``` --- ## How SimPy Tracks Latency ### Measurement ```python start_ns = env.now yield txn_done # wait for the transaction to complete total_ns = env.now - start_ns # ← this is what probe reports ``` `env.now` is SimPy's simulation clock. It only advances when a process `yield`s a timeout or waits on a resource/store. The delta between start and done captures **everything**: wire delays, component overheads, drain, and any queueing. ### Component Pipeline Each component is a SimPy process: ``` _fan_in (per in_port) → _inbox (Store) → _worker → out_ports ``` 1. **`_fan_in`**: relays messages from each `in_port` into a shared `_inbox` Store. 2. **`_worker`**: pulls from `_inbox`, spawns `_forward_txn` per message. 3. **`_forward_txn`**: calls `run()` (overhead), then puts to `out_ports[next_hop]`. The worker uses `env.process()` (pipeline model), so multiple messages can be in-flight through the same component concurrently. Contention happens when they compete for shared resources (e.g., `simpy.Resource` in hbm_ctrl). ### Wire Process ```python while True: msg = yield out_port.get() # wait for sender yield env.timeout(prop_ns) # propagation delay yield in_port.put(msg) # deliver to receiver ``` Each directed edge has its own wire process. Messages are delayed by exactly `distance_mm × ns_per_mm`. --- ## Contention and Queueing Queueing delay is **not a separate formula term**—it emerges from SimPy's event scheduling when multiple requests compete for the same resource. ### Where Contention Occurs | Resource | SimPy Type | Capacity | Effect | |----------|-----------|----------|--------| | hbm_ctrl | `simpy.Resource` | 1 | Serializes HBM access | | m_cpu DMA read engine | `simpy.Resource` | 1 | Serializes DMA reads | | m_cpu DMA write engine | `simpy.Resource` | 1 | Serializes DMA writes | | pe_dma channels | `simpy.Resource` | configurable | Serializes PE DMA ops | | component inbox | `simpy.Store` | unbounded | No backpressure (FIFO) | ### How Queueing Works ```python # hbm_ctrl._worker with self._resource.request() as req: yield req # ← BLOCKS if resource is occupied yield from self.run(env, txn.nbytes) yield env.timeout(drain_ns) ``` If request A holds the resource and request B arrives: - B's `yield req` blocks until A releases the resource - SimPy advances B's `env.now` by A's remaining service time - This "extra" time shows up in B's `total_ns` automatically ``` No contention: actual_ns == formula_ns Contention: actual_ns > formula_ns queueing_delay = actual_ns - formula_ns ``` ### Head-of-Line (HOL) Blocking at hbm_ctrl The `simpy.Resource` is held for the **entire** `with` block—both overhead and drain. The resource is NOT released between overhead and drain: ```python with self._resource.request() as req: yield req # acquire (or wait) yield from self.run(env, txn.nbytes) # overhead_ns ─┐ yield env.timeout(drain_ns) # drain_ns │ resource held # ← resource released here ───────────────────────────────┘ ``` This means a short request arriving during a long request's drain must wait for the full remaining drain time—classic head-of-line blocking: ``` Request A: 4 KB, drain = 16.0 ns (arrives at t=0) Request B: 64 B, drain = 0.25 ns (arrives at t=5) Timeline: t=0.00 A acquires resource t=0.00 A: overhead (0 ns) t=0.00 A: drain starts (16.0 ns) t=5.00 B arrives → yield req → BLOCKED (A holds resource) t=16.00 A: drain done → resource released t=16.00 B acquires resource t=16.00 B: overhead (0 ns) t=16.25 B: drain done → resource released B actual = 11.25 ns (waited 11.0 + own 0.25) B formula = 0.25 ns B queueing = 11.0 ns ← HOL blocking penalty ``` **Why this is physically realistic**: An HBM channel processes one burst at a time. While data is being serialized onto the channel (drain), no other request can use that channel. The FIFO ordering (`simpy.Resource` default) reflects the simplest controller scheduling policy. **Alternative: priority scheduling**: If needed, `simpy.PriorityResource` can prioritize shorter requests (Shortest Job First), but this is not currently used since FIFO matches typical HBM controller behavior. --- ## Worked Example: Two Concurrent PE DMA Reads Setup: PE0 and PE1 in cube0 both read 4096 bytes from their local HBM slices (slice0 and slice1), submitted to the **same engine** at the same time. ### Paths ``` DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0 DMA B: pe1.pe_dma → xbar.pe1 → hbm_ctrl.slice1 ``` ### No Contention (different HBM slices) Since slice0 and slice1 are **separate** hbm_ctrl instances, each with its own `simpy.Resource(capacity=1)`, there is no resource competition. ``` DMA A timeline: t=0.00 pe_dma dequeues txn t=0.00 xbar.pe0: overhead_ns=2.0 → t=2.00 t=2.025 wire prop (2.5mm × 0.01) → t=2.025 t=2.025 hbm_ctrl.slice0: yield req → immediate (no contention) t=2.025 hbm_ctrl.slice0: overhead_ns=0 → t=2.025 t=18.025 drain_ns = 4096/256 = 16.0 → t=18.025 t=18.025 done DMA B timeline: (identical, on its own slice) t=0.00 → ... → t=18.09 done ``` Both complete at ~18.09 ns. `actual == formula` for both. ### With Contention (same HBM slice) Now suppose both PE0 and PE1 read from **slice0**: ``` DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0 DMA B: pe1.pe_dma → xbar.pe1 → xbar.pe0 → hbm_ctrl.slice0 (chain traversal to reach slice0) ``` ``` DMA A timeline: t=0.00 xbar.pe0(2.0) → wire → hbm_ctrl.slice0 t=2.025 yield req → immediate (first to arrive) t=18.025 drain 16.0 → release resource → done actual_A = 18.025 ns (== formula) DMA B timeline: t=0.00 xbar.pe1(2.0) → xbar.pe0(2.0) → wire → hbm_ctrl.slice0 t=4.035 yield req → BLOCKED (A holds resource until t=18.025) t=18.025 acquire resource t=34.025 drain 16.0 → release → done actual_B = 34.035 ns formula_B = wire(0.035) + overhead(4.0) + drain(32.0) = 36.035 ns But actual_B is different because drain uses bottleneck BW of B's path (128 GB/s) while A's path has BW 256 GB/s. Let's recalculate: B's bottleneck: xbar_x_bw = 128 GB/s → drain = 4096/128 = 32.0 ns formula_B = 0.035 + 4.0 + 32.0 = 36.035 ns actual_B = 36.035 + queueing ≈ 50+ ns queueing = time waiting for A to release hbm_ctrl ``` The key insight: **queueing delay is not in the formula**. It only appears in the actual SimPy simulation when resources are contested. The probe reports `actual_ns`, which includes all queueing. To see pure queueing overhead, compare `actual_ns` vs `formula_ns` (available in PE DMA traces). --- ## Probe Output Explained ``` === PE DMA Latency === Case Target Actual Ovhd Drain Wire Ovhd% Drain% Eff.BW BN.BW Util% pe-local-hbm c0.pe0->c0.slice0 18.09 2.0 16.0 0.08 11.1% 88.5% 226.49 256.0 88.5% pe-cross-half-hbm c0.pe0->c0.slice4 37.14 5.0 32.0 0.14 13.5% 86.1% 110.27 128.0 86.1% ``` | Column | Meaning | |--------|---------| | **Actual** | SimPy measured `env.now` delta (includes contention if any) | | **Ovhd** | Sum of `overhead_ns` for all components on the forward path | | **Drain** | `nbytes / bottleneck_bw` — serialization at terminal | | **Wire** | Sum of `distance_mm × ns_per_mm` for all edges | | **Ovhd%** | `Ovhd / Actual × 100` — fraction of time spent in component processing | | **Drain%** | `Drain / Actual × 100` — fraction of time spent in data transfer | | **Eff.BW** | `nbytes / Actual` — achieved bandwidth | | **BN.BW** | Bottleneck bandwidth (min `bw_gbs` on path) | | **Util%** | `Eff.BW / BN.BW × 100` — how close to theoretical max BW | ### Why Util% < 100% `Util% = Drain% = drain_ns / actual_ns`. The gap from 100% is the overhead fraction. For small transfers (4KB), overhead is significant relative to drain. For large transfers, drain dominates and utilization approaches 100%. ``` 4 KB: Ovhd=2.0, Drain=16.0 → Util=88.5% (overhead is 11% of time) 64 KB: Ovhd=2.0, Drain=256.0 → Util=99.2% (overhead is <1% of time) ``` ### H2D Path: Why Ovhd% is ~40% H2D traverses many components (pcie_ep → io_cpu → ucie → noc → m_cpu → noc → xbar → hbm_ctrl + response path). Total forward overhead is ~23 ns vs drain of 32 ns for 4KB, so overhead is comparable to data transfer time—resulting in ~55% utilization. This is expected for small command-path transfers.