Files
kernbench2/docs/adr/ADR-0049-ver-probe-subcommand.md
T
ywkang 9a02955770 adr: add ADR-0046-0049 — close G4 coverage gaps from /report
Documents four cross-cutting surfaces that previously had no ADR backing,
each surfaced as a G4 candidate by /report:

- 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates
  all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...),
  the two execution modes (command-list vs greenlet runner), scratch
  allocator semantics, dispatch-overhead model, and the kernel registry.

- 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group
  (backend="ahbm") install path. world_size priority (algorithm >
  defaults > topology), the 4-step init sequence (load ccl.yaml, import
  algorithm module, derive world_size, install SFR + IPCQ), greenlet-
  local rank registry, all_reduce dispatch via _defer_wait, barrier
  no-op rationale, and the explicit list of unsupported dist.* APIs.

- 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator
  free-list semantics. Offset-keyed first-fit with coalescing, the
  no-validation trust model for free(), HBM/TCM channel separation,
  page-aligned VA allocation, the page_size dual-default
  (VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and
  one-allocator-per-sub-unit rule.

- 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog.
  H2D / D2H / PE DMA categories with their exact cube-index choices,
  the 32 KiB reference size, the 5-point utilization sweep, the
  formula vs actual column meanings, automatic invariant checks
  (monotonicity, D2H >= H2D, best < worst), per-case GraphEngine
  isolation, and the human-readable (not machine-parsable) output
  contract.

Bilingual pair verifier passes for all four EN/KO pairs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:25:04 -07:00

248 lines
9.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0049: `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
## Status
Accepted (2026-05-22).
Pins down the traffic-pattern catalog, formula-vs-actual comparison, and
invariant checks (monotonicity, D2H ≥ H2D, etc.) exposed by
`probes/probe.py::run_probe(...)`. ADR-0010 (CLI surface) enumerates the
`kernbench probe` subcommand, but **what probe actually measures** and
**which invariants it judges PASS/FAIL** had no ADR-level coverage.
## First action
`run_probe(topology_path, case_filter=None)` performs four startup steps:
1. `Path(topology_path).expanduser().resolve()` → absolute path.
2. `load_topology(path)``TopologyGraph` (graph + spec).
3. `_build_edge_map(graph)` → a `{(src, dst): Edge}` lookup table.
4. Instantiate `AddressResolver(graph)` + `PathRouter(graph)`.
Then it sets `nbytes = 32768` (= 32 KiB, the summary-table reference
size) and `show_all = (case_filter is None or case_filter == "all")`.
In short, **probe's first act is "load the topology once and prepare
edge map / resolver / router, plus pin 32 KiB as the standard measurement
size"**. After that, the H2D → D2H → PE DMA categories execute in
separate `GraphEngine` instances (no cross-talk between cases).
## Context
`kernbench probe` was introduced as a verification tool for these
purposes:
- **Manual ground truth**: when a real-simulation result (`kernbench run
--bench ...`) shows abnormal latency, derive the answer for a simple
traffic pattern in isolation and compare.
- **Formula vs actual**: check whether the analytical model
(wire latency + overhead + drain) matches the simulator's
`total_ns`. A mismatch points to which simplifying assumption in
ADR-0033 is missing.
- **Monotonicity check**: latency should grow monotonically with hop
count.
- **Utilization sweep**: a BW-utilization table across data sizes
(4 KiB ~ 1 MiB).
Without an ADR for this tool:
- Adding a new traffic-pattern category (e.g., MCpuDma, IPCQ) is hard
because the table format / measurement units of existing categories
aren't documented at the ADR level.
- The basis for the monotonicity check (hop count? cube distance? wire
length?) is ambiguous.
- The reference size 32 KiB and the sweep `[4 KiB, 16 KiB, 64 KiB, 256
KiB, 1 MiB]` are only discoverable by reading source.
## Decision
### D1. Three case categories — H2D / D2H / PE DMA
Each category has a distinct data path in the topology and gets its own
summary table + sweep table + route-detail block.
- **H2D (Host → Device Write)**: `MemoryWriteMsg(dst_sip=0, dst_cube,
dst_pe=0, pattern="zero")` flows along `pcie_ep → io_cpu → m_cpu →
hbm_ctrl`. The cube index varies the hop count:
- h2d-1hop: cube=0, hops=1
- h2d-2hop: cube=4, hops=2
- h2d-3hop: cube=8, hops=3
- h2d-4hop: cube=12, hops=4
- **D2H (Device → Host Read)**: `MemoryReadMsg(src_sip=0, src_cube,
src_pe=0)`. Total latency = forward command path + reverse data path.
Same 4-hops category as H2D.
- **PE DMA (PE-initiated)**: `PeDmaMsg(src_sip, src_cube, src_pe,
dst_pa)`. Five cases cover varying cube/PE positions:
- pe-local-hbm: same cube, same PE
- pe-same-half-hbm: same cube, different PE (PE 1)
- pe-cross-half-hbm: same cube, far PE (PE 4)
- pe-cross-cube-hbm-best: adjacent cube (cube 1)
- pe-cross-cube-hbm-worst: diagonal far cube (cube 15)
The cube indices 4/8/12 (H2D) and 1/4/15 (PE DMA) are meaningful for a
4 × 4 cube mesh (`sip.cube_mesh.w=4, h=4`); changes to the mesh size
require these to be updated in lockstep.
### D2. Standard measurement size — `nbytes = 32768` (32 KiB)
Every case in the summary table runs once with `nbytes=32768`. 32 KiB
was chosen because:
- DMA overhead and BW drain are balanced — neither dominates.
- It compares cleanly against the one-shot transfer size of several
sub-units (TCM, register file).
Per-size utilization variations are shown in a separate sweep table
(D3).
### D3. Utilization sweep — `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]`
`SWEEP_SIZES = [4096, 16384, 65536, 262144, 1048576]`,
`SWEEP_LABELS = ["4KB", "16KB", "64KB", "256KB", "1MB"]`. Per size:
```
drain = nbytes / bottleneck_bw
total = overhead + wire + drain
eff_bw = nbytes / total
util% = eff_bw / bottleneck_bw × 100
```
When `bn_bw is None or <= 0`, the column shows 0.0 %. The intent: the
table shows in one view how small transfers become overhead-bound and
large transfers become drain-bound as hop count rises.
### D4. Measured columns — actual / formula / breakdown
Per-case columns:
- `Actual` (total_ns): the SimPy run's `trace["total_ns"]`.
- `Ovhd`: sum of `node.attrs["overhead_ns"]` along the path (formula).
- `Drain`: `nbytes / min(edge.bw_gbs over path)` (formula).
- `Wire`: `Σ edge.distance_mm * (ns_per_mm from spec)`.
- `Ovhd%` / `Drain%`: each portion as a percentage of Actual. Wire is
usually too small to display.
- `Eff.BW`: `nbytes / total_ns` (measured BW).
- `BN.BW`: bottleneck bandwidth (formula). The minimum edge BW along
the path. Missing edge BW shows "-".
- `Util%`: `Eff.BW / BN.BW × 100`. 100 % means the single-stream BW
upper bound is reached.
A large gap between the formula sum (`wire + ovhd + drain`) and Actual
signals a factor the simplified model misses (a place to inspect
ADR-0033's assumptions).
### D5. Automatic invariant checks — PASS/FAIL
The following invariants are reported with `[v] PASS` / `[x] FAIL`:
- **H2D / D2H monotonic increase**: as hop count rises, actual latency
must grow monotonically. `all(lats[i] < lats[i+1] for ...)`.
- **D2H ≥ H2D**: for the same hop index, D2H ≥ H2D (D2H has both
forward command and reverse data legs). `all(d2h[i].total >=
h2d[i].total)`.
- **PE DMA best < worst**: cross-cube best (adjacent) latency must be
less than cross-cube worst (diagonal).
- **PE DMA local vs remote**: prints the local BN BW vs remote BN BW
side-by-side (informational, not PASS/FAIL).
When a check fails, a single clear line surfaces the regression for
human review.
### D6. Route detail — per-hop timestamp trace
After the summary and sweep tables, each case's path and cumulative
per-hop timestamps (`_hop_timestamps`) appear in a separate section:
- H2D: leg1 (`pcie_ep → io_cpu`) + leg2 (`io_cpu → m_cpu`) + leg3
(`m_cpu → hbm_ctrl`) + per-hop trace.
- D2H: forward (cmd, no data) and reverse (data) traces shown
separately.
- PE DMA: `pe_dma → router → hbm_ctrl` path + per-hop trace.
Each hop's timestamp is cumulative `wire_ns + overhead_ns`. The
terminal hop's annotation appends `drain:Xns`. Bottleneck edges are
marked `<BN:XXGB/s>` so they are visually identifiable.
### D7. Semantics of the `case_filter` argument
- `None` or `"all"`: run all cases (default).
- Other strings: run only the case whose name matches exactly. Example:
`kernbench probe --case h2d-2hop`.
Within a category, cases with `name != case_filter` are skipped; if
only one data point remains, the category's monotonicity / D2H ≥ H2D
comparisons are naturally skipped.
The CLI parser's `--case` default is `"all"`, so omitting it runs
everything.
### D8. Fresh GraphEngine per case
Each of the 4 H2D, 4 D2H, and 5 PE DMA cases runs in **its own
GraphEngine** (`engine = GraphEngine(graph)`). Reasons:
- Isolate accumulated state (op_log, completion tracking, allocators)
so cases do not cross-talk.
- Guarantee one case's traffic does not perturb another case's BW
measurement.
This isolation lets probe results be interpreted as **single-flow**
per-case latency. Multi-flow contention measurement is handled by
separate tooling (e.g., the `pe2pe_overview` plot or ADR-0033's
multi-flow merging model).
### D9. Output-format stability
probe's stdout is meant for humans; precise column widths, separators,
and whitespace are **not** a machine-readable contract. Automated tools
that wish to parse probe output should use a separate JSON-output mode
(not yet implemented).
The `[v]` / `[x]` prefix on PASS/FAIL lines is a stable CI grep anchor.
## Alternatives Considered
### A1. Register probe as another bench (`@bench(name="probe")`)
Rejected. probe is a verification tool, not a bench — multi-engine
execution for sweeps/analysis and PASS/FAIL invariant output are
essential, none of which fits ADR-0045's "single device + single
RuntimeContext" bench model.
### A2. Exit code 1 on monotonicity violation
Rejected (currently). probe is positioned as a human inspection tool —
PASS/FAIL is printed and exit is 0. A wrapper can `grep "\[x\]"` to
decide. A future `--strict` flag could opt into non-zero exits.
### A3. Externalize the case catalog to YAML
Rejected (currently). The 8 cases (4 H2D + 4 D2H + 5 PE DMA = 13 total)
are hardcoded and their semantics are tightly bound to the mesh
topology. Moving cube-index meaning (4, 8, 12 / 1, 4, 15) into YAML
would require separate documentation and lose cohesion. Externalize
only when case additions become frequent.
### A4. Add multi-flow contention measurement
Rejected (out of probe scope). D8's single-flow isolation is probe's
core intent. Multi-flow contention belongs in a different area of the
ADR-0033 latency model — either a separate tool or a new case
category.
## Consequences
- probe's case catalog (D1) and measurement units (D2/D3) are pinned at
ADR level, so new traffic categories know which table format to
follow.
- The semantics of the formula-vs-actual columns (D4) are locked in, so
questions like "why is Drain% 5 % or 70 %?" can quickly be linked to
ADR-0033 assumption checks.
- Automatic invariant checks (D5) are pinned, so latency-model changes
immediately catch monotonicity / D2H ≥ H2D regressions.
- D8's case-isolation is explicit, so probe results are safe to read as
single-flow measurements. If multi-flow is needed, a separate tool
track is clearly required.
- A2's strict-mode flag is recorded as a follow-up so CI integration
has a minimal change path when requested.