# ADR-0049: `kernbench probe` Subcommand — Traffic-Pattern Verification Harness ## Status Accepted (2026-05-22). Pins down the traffic-pattern catalog, formula-vs-actual comparison, and invariant checks (monotonicity, D2H ≥ H2D, etc.) exposed by `probes/probe.py::run_probe(...)`. ADR-0010 (CLI surface) enumerates the `kernbench probe` subcommand, but **what probe actually measures** and **which invariants it judges PASS/FAIL** had no ADR-level coverage. ## First action `run_probe(topology_path, case_filter=None)` performs four startup steps: 1. `Path(topology_path).expanduser().resolve()` → absolute path. 2. `load_topology(path)` → `TopologyGraph` (graph + spec). 3. `_build_edge_map(graph)` → a `{(src, dst): Edge}` lookup table. 4. Instantiate `AddressResolver(graph)` + `PathRouter(graph)`. Then it sets `nbytes = 32768` (= 32 KiB, the summary-table reference size) and `show_all = (case_filter is None or case_filter == "all")`. In short, **probe's first act is "load the topology once and prepare edge map / resolver / router, plus pin 32 KiB as the standard measurement size"**. After that, the H2D → D2H → PE DMA categories execute in separate `GraphEngine` instances (no cross-talk between cases). ## Context `kernbench probe` was introduced as a verification tool for these purposes: - **Manual ground truth**: when a real-simulation result (`kernbench run --bench ...`) shows abnormal latency, derive the answer for a simple traffic pattern in isolation and compare. - **Formula vs actual**: check whether the analytical model (wire latency + overhead + drain) matches the simulator's `total_ns`. A mismatch points to which simplifying assumption in ADR-0033 is missing. - **Monotonicity check**: latency should grow monotonically with hop count. - **Utilization sweep**: a BW-utilization table across data sizes (4 KiB ~ 1 MiB). Without an ADR for this tool: - Adding a new traffic-pattern category (e.g., MCpuDma, IPCQ) is hard because the table format / measurement units of existing categories aren't documented at the ADR level. - The basis for the monotonicity check (hop count? cube distance? wire length?) is ambiguous. - The reference size 32 KiB and the sweep `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]` are only discoverable by reading source. ## Decision ### D1. Three case categories — H2D / D2H / PE DMA Each category has a distinct data path in the topology and gets its own summary table + sweep table + route-detail block. - **H2D (Host → Device Write)**: `MemoryWriteMsg(dst_sip=0, dst_cube, dst_pe=0, pattern="zero")` flows along `pcie_ep → io_cpu → m_cpu → hbm_ctrl`. The cube index varies the hop count: - h2d-1hop: cube=0, hops=1 - h2d-2hop: cube=4, hops=2 - h2d-3hop: cube=8, hops=3 - h2d-4hop: cube=12, hops=4 - **D2H (Device → Host Read)**: `MemoryReadMsg(src_sip=0, src_cube, src_pe=0)`. Total latency = forward command path + reverse data path. Same 4-hops category as H2D. - **PE DMA (PE-initiated)**: `PeDmaMsg(src_sip, src_cube, src_pe, dst_pa)`. Five cases cover varying cube/PE positions: - pe-local-hbm: same cube, same PE - pe-same-half-hbm: same cube, different PE (PE 1) - pe-cross-half-hbm: same cube, far PE (PE 4) - pe-cross-cube-hbm-best: adjacent cube (cube 1) - pe-cross-cube-hbm-worst: diagonal far cube (cube 15) The cube indices 4/8/12 (H2D) and 1/4/15 (PE DMA) are meaningful for a 4 × 4 cube mesh (`sip.cube_mesh.w=4, h=4`); changes to the mesh size require these to be updated in lockstep. ### D2. Standard measurement size — `nbytes = 32768` (32 KiB) Every case in the summary table runs once with `nbytes=32768`. 32 KiB was chosen because: - DMA overhead and BW drain are balanced — neither dominates. - It compares cleanly against the one-shot transfer size of several sub-units (TCM, register file). Per-size utilization variations are shown in a separate sweep table (D3). ### D3. Utilization sweep — `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]` `SWEEP_SIZES = [4096, 16384, 65536, 262144, 1048576]`, `SWEEP_LABELS = ["4KB", "16KB", "64KB", "256KB", "1MB"]`. Per size: ``` drain = nbytes / bottleneck_bw total = overhead + wire + drain eff_bw = nbytes / total util% = eff_bw / bottleneck_bw × 100 ``` When `bn_bw is None or <= 0`, the column shows 0.0 %. The intent: the table shows in one view how small transfers become overhead-bound and large transfers become drain-bound as hop count rises. ### D4. Measured columns — actual / formula / breakdown Per-case columns: - `Actual` (total_ns): the SimPy run's `trace["total_ns"]`. - `Ovhd`: sum of `node.attrs["overhead_ns"]` along the path (formula). - `Drain`: `nbytes / min(edge.bw_gbs over path)` (formula). - `Wire`: `Σ edge.distance_mm * (ns_per_mm from spec)`. - `Ovhd%` / `Drain%`: each portion as a percentage of Actual. Wire is usually too small to display. - `Eff.BW`: `nbytes / total_ns` (measured BW). - `BN.BW`: bottleneck bandwidth (formula). The minimum edge BW along the path. Missing edge BW shows "-". - `Util%`: `Eff.BW / BN.BW × 100`. 100 % means the single-stream BW upper bound is reached. A large gap between the formula sum (`wire + ovhd + drain`) and Actual signals a factor the simplified model misses (a place to inspect ADR-0033's assumptions). ### D5. Automatic invariant checks — PASS/FAIL The following invariants are reported with `[v] PASS` / `[x] FAIL`: - **H2D / D2H monotonic increase**: as hop count rises, actual latency must grow monotonically. `all(lats[i] < lats[i+1] for ...)`. - **D2H ≥ H2D**: for the same hop index, D2H ≥ H2D (D2H has both forward command and reverse data legs). `all(d2h[i].total >= h2d[i].total)`. - **PE DMA best < worst**: cross-cube best (adjacent) latency must be less than cross-cube worst (diagonal). - **PE DMA local vs remote**: prints the local BN BW vs remote BN BW side-by-side (informational, not PASS/FAIL). When a check fails, a single clear line surfaces the regression for human review. ### D6. Route detail — per-hop timestamp trace After the summary and sweep tables, each case's path and cumulative per-hop timestamps (`_hop_timestamps`) appear in a separate section: - H2D: leg1 (`pcie_ep → io_cpu`) + leg2 (`io_cpu → m_cpu`) + leg3 (`m_cpu → hbm_ctrl`) + per-hop trace. - D2H: forward (cmd, no data) and reverse (data) traces shown separately. - PE DMA: `pe_dma → router → hbm_ctrl` path + per-hop trace. Each hop's timestamp is cumulative `wire_ns + overhead_ns`. The terminal hop's annotation appends `drain:Xns`. Bottleneck edges are marked `` so they are visually identifiable. ### D7. Semantics of the `case_filter` argument - `None` or `"all"`: run all cases (default). - Other strings: run only the case whose name matches exactly. Example: `kernbench probe --case h2d-2hop`. Within a category, cases with `name != case_filter` are skipped; if only one data point remains, the category's monotonicity / D2H ≥ H2D comparisons are naturally skipped. The CLI parser's `--case` default is `"all"`, so omitting it runs everything. ### D8. Fresh GraphEngine per case Each of the 4 H2D, 4 D2H, and 5 PE DMA cases runs in **its own GraphEngine** (`engine = GraphEngine(graph)`). Reasons: - Isolate accumulated state (op_log, completion tracking, allocators) so cases do not cross-talk. - Guarantee one case's traffic does not perturb another case's BW measurement. This isolation lets probe results be interpreted as **single-flow** per-case latency. Multi-flow contention measurement is handled by separate tooling (e.g., the `pe2pe_overview` plot or ADR-0033's multi-flow merging model). ### D9. Output-format stability probe's stdout is meant for humans; precise column widths, separators, and whitespace are **not** a machine-readable contract. Automated tools that wish to parse probe output should use a separate JSON-output mode (not yet implemented). The `[v]` / `[x]` prefix on PASS/FAIL lines is a stable CI grep anchor. ## Alternatives Considered ### A1. Register probe as another bench (`@bench(name="probe")`) Rejected. probe is a verification tool, not a bench — multi-engine execution for sweeps/analysis and PASS/FAIL invariant output are essential, none of which fits ADR-0045's "single device + single RuntimeContext" bench model. ### A2. Exit code 1 on monotonicity violation Rejected (currently). probe is positioned as a human inspection tool — PASS/FAIL is printed and exit is 0. A wrapper can `grep "\[x\]"` to decide. A future `--strict` flag could opt into non-zero exits. ### A3. Externalize the case catalog to YAML Rejected (currently). The 8 cases (4 H2D + 4 D2H + 5 PE DMA = 13 total) are hardcoded and their semantics are tightly bound to the mesh topology. Moving cube-index meaning (4, 8, 12 / 1, 4, 15) into YAML would require separate documentation and lose cohesion. Externalize only when case additions become frequent. ### A4. Add multi-flow contention measurement Rejected (out of probe scope). D8's single-flow isolation is probe's core intent. Multi-flow contention belongs in a different area of the ADR-0033 latency model — either a separate tool or a new case category. ## Consequences - probe's case catalog (D1) and measurement units (D2/D3) are pinned at ADR level, so new traffic categories know which table format to follow. - The semantics of the formula-vs-actual columns (D4) are locked in, so questions like "why is Drain% 5 % or 70 %?" can quickly be linked to ADR-0033 assumption checks. - Automatic invariant checks (D5) are pinned, so latency-model changes immediately catch monotonicity / D2H ≥ H2D regressions. - D8's case-isolation is explicit, so probe results are safe to read as single-flow measurements. If multi-flow is needed, a separate tool track is clearly required. - A2's strict-mode flag is recorded as a follow-up so CI integration has a minimal change path when requested.