commit - release 1

2026-03-18 11:47:48 -07:00
commit 6f43807900
109 changed files with 14909 additions and 0 deletions
@@ -0,0 +1,381 @@
+# Latency Model
+
+## Overview
+
+kernbench uses a discrete-event simulation (SimPy) to compute end-to-end latency.
+Every request flows through a graph of **components** connected by **wires**.
+The total latency reported is the **actual SimPy wall-clock** (`env.now` delta),
+not a static formula—so contention and queueing are captured automatically.
+
+```
+total_ns (actual) = wire_prop + component_overhead + drain + queueing
+                    ├── deterministic ──────────────────┘       │
+                    └── contention-dependent ────────────────────┘
+```
+
+## Three Deterministic Cost Components
+
+### 1. Wire Propagation
+
+```
+wire_ns = distance_mm × ns_per_mm       (global: 0.01 = 10 ps/mm)
+```
+
+Every edge in the topology graph has a `distance_mm`. A SimPy wire process
+delays each message by `wire_ns` before delivering it to the next component.
+For on-chip silicon this is ~10 ps/mm; the same constant applies everywhere
+since all links are on-die or interposer. Wire propagation is typically <1 ns
+and negligible compared to other costs.
+
+### 2. Component Overhead (`overhead_ns`)
+
+```
+component_ns = node.attrs["overhead_ns"]
+```
+
+Each component on the path adds a fixed processing delay via `yield env.timeout(overhead_ns)`.
+This models arbitration, protocol processing, pipeline stages, etc.
+
+| Component | overhead_ns | Meaning |
+|-----------|-------------|---------|
+| pcie_ep | 5.0 | PCIe protocol processing |
+| io_cpu | 10.0 | Command decode / dispatch |
+| m_cpu | 5.0 | DMA scheduling |
+| fabric switch | 5.0 | Packet arbitration |
+| xbar | 2.0 | Crossbar arbitration |
+| xbar bridge | 1.0 | Bridge traversal between xbar halves |
+| ucie | 1.0 | UCIe protocol overhead per port |
+| noc (2D mesh) | 0.0 | Hop delay modeled internally via manhattan distance |
+| hbm_ctrl | 0.0 | Access time captured in drain_ns |
+| pe_cpu | 2.0 | Command dispatch |
+| pe_scheduler | 1.0 | PE-internal scheduling |
+| pe_gemm/math | 0.0 | Placeholder; will use flops-based model |
+
+### 3. Drain (Serialization Delay)
+
+```
+drain_ns = nbytes / bottleneck_bw_gbs
+```
+
+**Wormhole (cut-through) model**: data flows through intermediate nodes as a
+pipeline. Serialization cost is paid **once** at the terminal node, not at
+every hop. The bottleneck is the minimum `bw_gbs` across all edges in the path.
+
+Example: 4096 bytes through a path with bottleneck 128 GB/s → `4096 / 128 = 32.0 ns`.
+
+### Formula (Theoretical Lower Bound)
+
+```
+formula_ns = Σ(wire_prop) + Σ(overhead_ns) + drain_ns
+```
+
+This is the latency with **zero contention**—no other request competing for
+any resource. The engine provides `_formula_latency()` for verification.
+With no contention: `actual == formula`. With contention: `actual > formula`.
+
+### Diagram: PE DMA Read (pe0 → local slice0, 4096 bytes)
+
+```mermaid
+sequenceDiagram
+    participant D as pe_dma
+    participant X as xbar.pe0
+    participant H as hbm_ctrl.slice0
+
+    D->>X: txn (4096B)
+    Note over X: overhead 2.0 ns
+    X->>H: txn (wire 0.025 ns)
+    Note over H: acquire Resource
+    Note over H: overhead 0 ns
+    Note over H: drain 4096/256 = 16.0 ns
+    Note over H: release Resource
+    H-->>D: done.succeed()
+
+    Note over D,H: total_ns = 18.09 ns<br/>formula = wire(0.025) + ovhd(2.0) + drain(16.0) = 18.025 ns<br/>actual ≈ formula (no contention)
+```
+
+### Diagram: Two Requests — No Contention vs HOL Blocking
+
+#### Case 1: Different slices (parallel, no contention)
+
+```mermaid
+sequenceDiagram
+    participant A as Request A
+    participant S0 as hbm_ctrl.slice0<br/>Resource(cap=1)
+    participant S1 as hbm_ctrl.slice1<br/>Resource(cap=1)
+
+    Note over A,S1: t=2 ns — both requests arrive at their own slice
+    A->>S0: A (4KB)
+    A->>S1: B (4KB)
+    Note over S0: acquire (immediate)
+    Note over S1: acquire (immediate)
+    Note over S0: drain 16.0 ns
+    Note over S1: drain 16.0 ns
+    Note over S0: t=18 release
+    Note over S1: t=18 release
+
+    Note over A,S1: A actual = 18 ns, B actual = 18 ns<br/>No waiting — separate Resources
+```
+
+#### Case 2: Same slice (HOL blocking)
+
+```mermaid
+sequenceDiagram
+    participant A as Request A (4KB)
+    participant Q as hbm_ctrl.slice0<br/>Resource(cap=1)
+    participant B as Request B (64B)
+
+    Note over A,B: t=0 — A arrives first
+    A->>Q: acquire (immediate)
+    Note over Q: drain A = 16.0 ns
+
+    Note over B,Q: t=5 — B arrives, yield req → BLOCKED
+    B--xQ: waiting...
+
+    Note over Q: t=16 — A drain done, release
+    Q->>B: B acquires resource
+    Note over Q: drain B = 0.25 ns
+    Note over Q: t=16.25 — B done, release
+
+    Note over A,B: A actual = 16.0 ns (== formula)<br/>B actual = 11.25 ns (formula 0.25 + queueing 11.0)<br/>HOL blocking: short request waits behind long drain
+```
+
+---
+
+## How SimPy Tracks Latency
+
+### Measurement
+
+```python
+start_ns = env.now
+yield txn_done          # wait for the transaction to complete
+total_ns = env.now - start_ns     # ← this is what probe reports
+```
+
+`env.now` is SimPy's simulation clock. It only advances when a process `yield`s
+a timeout or waits on a resource/store. The delta between start and done captures
+**everything**: wire delays, component overheads, drain, and any queueing.
+
+### Component Pipeline
+
+Each component is a SimPy process:
+
+```
+_fan_in (per in_port)  →  _inbox (Store)  →  _worker  →  out_ports
+```
+
+1. **`_fan_in`**: relays messages from each `in_port` into a shared `_inbox` Store.
+2. **`_worker`**: pulls from `_inbox`, spawns `_forward_txn` per message.
+3. **`_forward_txn`**: calls `run()` (overhead), then puts to `out_ports[next_hop]`.
+
+The worker uses `env.process()` (pipeline model), so multiple messages can be
+in-flight through the same component concurrently. Contention happens when
+they compete for shared resources (e.g., `simpy.Resource` in hbm_ctrl).
+
+### Wire Process
+
+```python
+while True:
+    msg = yield out_port.get()      # wait for sender
+    yield env.timeout(prop_ns)      # propagation delay
+    yield in_port.put(msg)          # deliver to receiver
+```
+
+Each directed edge has its own wire process. Messages are delayed by exactly
+`distance_mm × ns_per_mm`.
+
+---
+
+## Contention and Queueing
+
+Queueing delay is **not a separate formula term**—it emerges from SimPy's
+event scheduling when multiple requests compete for the same resource.
+
+### Where Contention Occurs
+
+| Resource | SimPy Type | Capacity | Effect |
+|----------|-----------|----------|--------|
+| hbm_ctrl | `simpy.Resource` | 1 | Serializes HBM access |
+| m_cpu DMA read engine | `simpy.Resource` | 1 | Serializes DMA reads |
+| m_cpu DMA write engine | `simpy.Resource` | 1 | Serializes DMA writes |
+| pe_dma channels | `simpy.Resource` | configurable | Serializes PE DMA ops |
+| component inbox | `simpy.Store` | unbounded | No backpressure (FIFO) |
+
+### How Queueing Works
+
+```python
+# hbm_ctrl._worker
+with self._resource.request() as req:
+    yield req                     # ← BLOCKS if resource is occupied
+    yield from self.run(env, txn.nbytes)
+    yield env.timeout(drain_ns)
+```
+
+If request A holds the resource and request B arrives:
+- B's `yield req` blocks until A releases the resource
+- SimPy advances B's `env.now` by A's remaining service time
+- This "extra" time shows up in B's `total_ns` automatically
+
+```
+No contention:  actual_ns == formula_ns
+Contention:     actual_ns  > formula_ns
+                queueing_delay = actual_ns - formula_ns
+```
+
+### Head-of-Line (HOL) Blocking at hbm_ctrl
+
+The `simpy.Resource` is held for the **entire** `with` block—both overhead and
+drain. The resource is NOT released between overhead and drain:
+
+```python
+with self._resource.request() as req:
+    yield req                              # acquire (or wait)
+    yield from self.run(env, txn.nbytes)   # overhead_ns  ─┐
+    yield env.timeout(drain_ns)            # drain_ns      │ resource held
+# ← resource released here ───────────────────────────────┘
+```
+
+This means a short request arriving during a long request's drain must wait
+for the full remaining drain time—classic head-of-line blocking:
+
+```
+Request A: 4 KB,  drain = 16.0 ns   (arrives at t=0)
+Request B: 64 B,  drain = 0.25 ns   (arrives at t=5)
+
+Timeline:
+  t=0.00   A acquires resource
+  t=0.00   A: overhead (0 ns)
+  t=0.00   A: drain starts (16.0 ns)
+  t=5.00   B arrives → yield req → BLOCKED (A holds resource)
+  t=16.00  A: drain done → resource released
+  t=16.00  B acquires resource
+  t=16.00  B: overhead (0 ns)
+  t=16.25  B: drain done → resource released
+
+  B actual  = 11.25 ns (waited 11.0 + own 0.25)
+  B formula = 0.25 ns
+  B queueing = 11.0 ns  ← HOL blocking penalty
+```
+
+**Why this is physically realistic**: An HBM channel processes one burst at a
+time. While data is being serialized onto the channel (drain), no other request
+can use that channel. The FIFO ordering (`simpy.Resource` default) reflects
+the simplest controller scheduling policy.
+
+**Alternative: priority scheduling**: If needed, `simpy.PriorityResource` can
+prioritize shorter requests (Shortest Job First), but this is not currently
+used since FIFO matches typical HBM controller behavior.
+
+---
+
+## Worked Example: Two Concurrent PE DMA Reads
+
+Setup: PE0 and PE1 in cube0 both read 4096 bytes from their local HBM slices
+(slice0 and slice1), submitted to the **same engine** at the same time.
+
+### Paths
+
+```
+DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0
+DMA B: pe1.pe_dma → xbar.pe1 → hbm_ctrl.slice1
+```
+
+### No Contention (different HBM slices)
+
+Since slice0 and slice1 are **separate** hbm_ctrl instances, each with its own
+`simpy.Resource(capacity=1)`, there is no resource competition.
+
+```
+DMA A timeline:
+  t=0.00   pe_dma dequeues txn
+  t=0.00   xbar.pe0: overhead_ns=2.0 → t=2.00
+  t=2.025  wire prop (2.5mm × 0.01) → t=2.025
+  t=2.025  hbm_ctrl.slice0: yield req → immediate (no contention)
+  t=2.025  hbm_ctrl.slice0: overhead_ns=0 → t=2.025
+  t=18.025 drain_ns = 4096/256 = 16.0 → t=18.025
+  t=18.025 done
+
+DMA B timeline: (identical, on its own slice)
+  t=0.00   → ... → t=18.09  done
+```
+
+Both complete at ~18.09 ns. `actual == formula` for both.
+
+### With Contention (same HBM slice)
+
+Now suppose both PE0 and PE1 read from **slice0**:
+
+```
+DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0
+DMA B: pe1.pe_dma → xbar.pe1 → xbar.pe0 → hbm_ctrl.slice0
+                                (chain traversal to reach slice0)
+```
+
+```
+DMA A timeline:
+  t=0.00   xbar.pe0(2.0) → wire → hbm_ctrl.slice0
+  t=2.025  yield req → immediate (first to arrive)
+  t=18.025 drain 16.0 → release resource → done
+  actual_A = 18.025 ns (== formula)
+
+DMA B timeline:
+  t=0.00   xbar.pe1(2.0) → xbar.pe0(2.0) → wire → hbm_ctrl.slice0
+  t=4.035  yield req → BLOCKED (A holds resource until t=18.025)
+  t=18.025 acquire resource
+  t=34.025 drain 16.0 → release → done
+  actual_B = 34.035 ns
+
+  formula_B = wire(0.035) + overhead(4.0) + drain(32.0) = 36.035 ns
+  But actual_B is different because drain uses bottleneck BW of B's path (128 GB/s)
+  while A's path has BW 256 GB/s. Let's recalculate:
+
+  B's bottleneck: xbar_x_bw = 128 GB/s → drain = 4096/128 = 32.0 ns
+  formula_B = 0.035 + 4.0 + 32.0 = 36.035 ns
+  actual_B  = 36.035 + queueing ≈ 50+ ns
+  queueing  = time waiting for A to release hbm_ctrl
+```
+
+The key insight: **queueing delay is not in the formula**. It only appears in
+the actual SimPy simulation when resources are contested. The probe reports
+`actual_ns`, which includes all queueing. To see pure queueing overhead,
+compare `actual_ns` vs `formula_ns` (available in PE DMA traces).
+
+---
+
+## Probe Output Explained
+
+```
+=== PE DMA Latency ===
+Case                Target              Actual  Ovhd  Drain  Wire  Ovhd% Drain%  Eff.BW   BN.BW   Util%
+pe-local-hbm        c0.pe0->c0.slice0    18.09   2.0  16.0  0.08  11.1% 88.5%   226.49   256.0   88.5%
+pe-cross-half-hbm   c0.pe0->c0.slice4    37.14   5.0  32.0  0.14  13.5% 86.1%   110.27   128.0   86.1%
+```
+
+| Column | Meaning |
+|--------|---------|
+| **Actual** | SimPy measured `env.now` delta (includes contention if any) |
+| **Ovhd** | Sum of `overhead_ns` for all components on the forward path |
+| **Drain** | `nbytes / bottleneck_bw` — serialization at terminal |
+| **Wire** | Sum of `distance_mm × ns_per_mm` for all edges |
+| **Ovhd%** | `Ovhd / Actual × 100` — fraction of time spent in component processing |
+| **Drain%** | `Drain / Actual × 100` — fraction of time spent in data transfer |
+| **Eff.BW** | `nbytes / Actual` — achieved bandwidth |
+| **BN.BW** | Bottleneck bandwidth (min `bw_gbs` on path) |
+| **Util%** | `Eff.BW / BN.BW × 100` — how close to theoretical max BW |
+
+### Why Util% < 100%
+
+`Util% = Drain% = drain_ns / actual_ns`. The gap from 100% is the overhead
+fraction. For small transfers (4KB), overhead is significant relative to drain.
+For large transfers, drain dominates and utilization approaches 100%.
+
+```
+  4 KB:  Ovhd=2.0, Drain=16.0  → Util=88.5%   (overhead is 11% of time)
+ 64 KB:  Ovhd=2.0, Drain=256.0 → Util=99.2%   (overhead is <1% of time)
+```
+
+### H2D Path: Why Ovhd% is ~40%
+
+H2D traverses many components (pcie_ep → io_cpu → ucie → noc → m_cpu → noc →
+xbar → hbm_ctrl + response path). Total forward overhead is ~23 ns vs drain
+of 32 ns for 4KB, so overhead is comparable to data transfer time—resulting
+in ~55% utilization. This is expected for small command-path transfers.