Files
kernbench2/docs/adr/ADR-0049-ver-probe-subcommand.md
T
ywkang 9a02955770 adr: add ADR-0046-0049 — close G4 coverage gaps from /report
Documents four cross-cutting surfaces that previously had no ADR backing,
each surfaced as a G4 candidate by /report:

- 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates
  all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...),
  the two execution modes (command-list vs greenlet runner), scratch
  allocator semantics, dispatch-overhead model, and the kernel registry.

- 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group
  (backend="ahbm") install path. world_size priority (algorithm >
  defaults > topology), the 4-step init sequence (load ccl.yaml, import
  algorithm module, derive world_size, install SFR + IPCQ), greenlet-
  local rank registry, all_reduce dispatch via _defer_wait, barrier
  no-op rationale, and the explicit list of unsupported dist.* APIs.

- 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator
  free-list semantics. Offset-keyed first-fit with coalescing, the
  no-validation trust model for free(), HBM/TCM channel separation,
  page-aligned VA allocation, the page_size dual-default
  (VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and
  one-allocator-per-sub-unit rule.

- 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog.
  H2D / D2H / PE DMA categories with their exact cube-index choices,
  the 32 KiB reference size, the 5-point utilization sweep, the
  formula vs actual column meanings, automatic invariant checks
  (monotonicity, D2H >= H2D, best < worst), per-case GraphEngine
  isolation, and the human-readable (not machine-parsable) output
  contract.

Bilingual pair verifier passes for all four EN/KO pairs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:25:04 -07:00

9.8 KiB
Raw Blame History

ADR-0049: kernbench probe Subcommand — Traffic-Pattern Verification Harness

Status

Accepted (2026-05-22).

Pins down the traffic-pattern catalog, formula-vs-actual comparison, and invariant checks (monotonicity, D2H ≥ H2D, etc.) exposed by probes/probe.py::run_probe(...). ADR-0010 (CLI surface) enumerates the kernbench probe subcommand, but what probe actually measures and which invariants it judges PASS/FAIL had no ADR-level coverage.

First action

run_probe(topology_path, case_filter=None) performs four startup steps:

  1. Path(topology_path).expanduser().resolve() → absolute path.
  2. load_topology(path)TopologyGraph (graph + spec).
  3. _build_edge_map(graph) → a {(src, dst): Edge} lookup table.
  4. Instantiate AddressResolver(graph) + PathRouter(graph).

Then it sets nbytes = 32768 (= 32 KiB, the summary-table reference size) and show_all = (case_filter is None or case_filter == "all").

In short, probe's first act is "load the topology once and prepare edge map / resolver / router, plus pin 32 KiB as the standard measurement size". After that, the H2D → D2H → PE DMA categories execute in separate GraphEngine instances (no cross-talk between cases).

Context

kernbench probe was introduced as a verification tool for these purposes:

  • Manual ground truth: when a real-simulation result (kernbench run --bench ...) shows abnormal latency, derive the answer for a simple traffic pattern in isolation and compare.
  • Formula vs actual: check whether the analytical model (wire latency + overhead + drain) matches the simulator's total_ns. A mismatch points to which simplifying assumption in ADR-0033 is missing.
  • Monotonicity check: latency should grow monotonically with hop count.
  • Utilization sweep: a BW-utilization table across data sizes (4 KiB ~ 1 MiB).

Without an ADR for this tool:

  • Adding a new traffic-pattern category (e.g., MCpuDma, IPCQ) is hard because the table format / measurement units of existing categories aren't documented at the ADR level.
  • The basis for the monotonicity check (hop count? cube distance? wire length?) is ambiguous.
  • The reference size 32 KiB and the sweep [4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB] are only discoverable by reading source.

Decision

D1. Three case categories — H2D / D2H / PE DMA

Each category has a distinct data path in the topology and gets its own summary table + sweep table + route-detail block.

  • H2D (Host → Device Write): MemoryWriteMsg(dst_sip=0, dst_cube, dst_pe=0, pattern="zero") flows along pcie_ep → io_cpu → m_cpu → hbm_ctrl. The cube index varies the hop count:
    • h2d-1hop: cube=0, hops=1
    • h2d-2hop: cube=4, hops=2
    • h2d-3hop: cube=8, hops=3
    • h2d-4hop: cube=12, hops=4
  • D2H (Device → Host Read): MemoryReadMsg(src_sip=0, src_cube, src_pe=0). Total latency = forward command path + reverse data path. Same 4-hops category as H2D.
  • PE DMA (PE-initiated): PeDmaMsg(src_sip, src_cube, src_pe, dst_pa). Five cases cover varying cube/PE positions:
    • pe-local-hbm: same cube, same PE
    • pe-same-half-hbm: same cube, different PE (PE 1)
    • pe-cross-half-hbm: same cube, far PE (PE 4)
    • pe-cross-cube-hbm-best: adjacent cube (cube 1)
    • pe-cross-cube-hbm-worst: diagonal far cube (cube 15)

The cube indices 4/8/12 (H2D) and 1/4/15 (PE DMA) are meaningful for a 4 × 4 cube mesh (sip.cube_mesh.w=4, h=4); changes to the mesh size require these to be updated in lockstep.

D2. Standard measurement size — nbytes = 32768 (32 KiB)

Every case in the summary table runs once with nbytes=32768. 32 KiB was chosen because:

  • DMA overhead and BW drain are balanced — neither dominates.
  • It compares cleanly against the one-shot transfer size of several sub-units (TCM, register file).

Per-size utilization variations are shown in a separate sweep table (D3).

D3. Utilization sweep — [4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]

SWEEP_SIZES = [4096, 16384, 65536, 262144, 1048576], SWEEP_LABELS = ["4KB", "16KB", "64KB", "256KB", "1MB"]. Per size:

drain   = nbytes / bottleneck_bw
total   = overhead + wire + drain
eff_bw  = nbytes / total
util%   = eff_bw / bottleneck_bw × 100

When bn_bw is None or <= 0, the column shows 0.0 %. The intent: the table shows in one view how small transfers become overhead-bound and large transfers become drain-bound as hop count rises.

D4. Measured columns — actual / formula / breakdown

Per-case columns:

  • Actual (total_ns): the SimPy run's trace["total_ns"].
  • Ovhd: sum of node.attrs["overhead_ns"] along the path (formula).
  • Drain: nbytes / min(edge.bw_gbs over path) (formula).
  • Wire: Σ edge.distance_mm * (ns_per_mm from spec).
  • Ovhd% / Drain%: each portion as a percentage of Actual. Wire is usually too small to display.
  • Eff.BW: nbytes / total_ns (measured BW).
  • BN.BW: bottleneck bandwidth (formula). The minimum edge BW along the path. Missing edge BW shows "-".
  • Util%: Eff.BW / BN.BW × 100. 100 % means the single-stream BW upper bound is reached.

A large gap between the formula sum (wire + ovhd + drain) and Actual signals a factor the simplified model misses (a place to inspect ADR-0033's assumptions).

D5. Automatic invariant checks — PASS/FAIL

The following invariants are reported with [v] PASS / [x] FAIL:

  • H2D / D2H monotonic increase: as hop count rises, actual latency must grow monotonically. all(lats[i] < lats[i+1] for ...).
  • D2H ≥ H2D: for the same hop index, D2H ≥ H2D (D2H has both forward command and reverse data legs). all(d2h[i].total >= h2d[i].total).
  • PE DMA best < worst: cross-cube best (adjacent) latency must be less than cross-cube worst (diagonal).
  • PE DMA local vs remote: prints the local BN BW vs remote BN BW side-by-side (informational, not PASS/FAIL).

When a check fails, a single clear line surfaces the regression for human review.

D6. Route detail — per-hop timestamp trace

After the summary and sweep tables, each case's path and cumulative per-hop timestamps (_hop_timestamps) appear in a separate section:

  • H2D: leg1 (pcie_ep → io_cpu) + leg2 (io_cpu → m_cpu) + leg3 (m_cpu → hbm_ctrl) + per-hop trace.
  • D2H: forward (cmd, no data) and reverse (data) traces shown separately.
  • PE DMA: pe_dma → router → hbm_ctrl path + per-hop trace.

Each hop's timestamp is cumulative wire_ns + overhead_ns. The terminal hop's annotation appends drain:Xns. Bottleneck edges are marked <BN:XXGB/s> so they are visually identifiable.

D7. Semantics of the case_filter argument

  • None or "all": run all cases (default).
  • Other strings: run only the case whose name matches exactly. Example: kernbench probe --case h2d-2hop.

Within a category, cases with name != case_filter are skipped; if only one data point remains, the category's monotonicity / D2H ≥ H2D comparisons are naturally skipped.

The CLI parser's --case default is "all", so omitting it runs everything.

D8. Fresh GraphEngine per case

Each of the 4 H2D, 4 D2H, and 5 PE DMA cases runs in its own GraphEngine (engine = GraphEngine(graph)). Reasons:

  • Isolate accumulated state (op_log, completion tracking, allocators) so cases do not cross-talk.
  • Guarantee one case's traffic does not perturb another case's BW measurement.

This isolation lets probe results be interpreted as single-flow per-case latency. Multi-flow contention measurement is handled by separate tooling (e.g., the pe2pe_overview plot or ADR-0033's multi-flow merging model).

D9. Output-format stability

probe's stdout is meant for humans; precise column widths, separators, and whitespace are not a machine-readable contract. Automated tools that wish to parse probe output should use a separate JSON-output mode (not yet implemented).

The [v] / [x] prefix on PASS/FAIL lines is a stable CI grep anchor.

Alternatives Considered

A1. Register probe as another bench (@bench(name="probe"))

Rejected. probe is a verification tool, not a bench — multi-engine execution for sweeps/analysis and PASS/FAIL invariant output are essential, none of which fits ADR-0045's "single device + single RuntimeContext" bench model.

A2. Exit code 1 on monotonicity violation

Rejected (currently). probe is positioned as a human inspection tool — PASS/FAIL is printed and exit is 0. A wrapper can grep "\[x\]" to decide. A future --strict flag could opt into non-zero exits.

A3. Externalize the case catalog to YAML

Rejected (currently). The 8 cases (4 H2D + 4 D2H + 5 PE DMA = 13 total) are hardcoded and their semantics are tightly bound to the mesh topology. Moving cube-index meaning (4, 8, 12 / 1, 4, 15) into YAML would require separate documentation and lose cohesion. Externalize only when case additions become frequent.

A4. Add multi-flow contention measurement

Rejected (out of probe scope). D8's single-flow isolation is probe's core intent. Multi-flow contention belongs in a different area of the ADR-0033 latency model — either a separate tool or a new case category.

Consequences

  • probe's case catalog (D1) and measurement units (D2/D3) are pinned at ADR level, so new traffic categories know which table format to follow.
  • The semantics of the formula-vs-actual columns (D4) are locked in, so questions like "why is Drain% 5 % or 70 %?" can quickly be linked to ADR-0033 assumption checks.
  • Automatic invariant checks (D5) are pinned, so latency-model changes immediately catch monotonicity / D2H ≥ H2D regressions.
  • D8's case-isolation is explicit, so probe results are safe to read as single-flow measurements. If multi-flow is needed, a separate tool track is clearly required.
  • A2's strict-mode flag is recorded as a follow-up so CI integration has a minimal change path when requested.