Documents four cross-cutting surfaces that previously had no ADR backing, each surfaced as a G4 candidate by /report: - 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...), the two execution modes (command-list vs greenlet runner), scratch allocator semantics, dispatch-overhead model, and the kernel registry. - 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group (backend="ahbm") install path. world_size priority (algorithm > defaults > topology), the 4-step init sequence (load ccl.yaml, import algorithm module, derive world_size, install SFR + IPCQ), greenlet- local rank registry, all_reduce dispatch via _defer_wait, barrier no-op rationale, and the explicit list of unsupported dist.* APIs. - 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator free-list semantics. Offset-keyed first-fit with coalescing, the no-validation trust model for free(), HBM/TCM channel separation, page-aligned VA allocation, the page_size dual-default (VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and one-allocator-per-sub-unit rule. - 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog. H2D / D2H / PE DMA categories with their exact cube-index choices, the 32 KiB reference size, the 5-point utilization sweep, the formula vs actual column meanings, automatic invariant checks (monotonicity, D2H >= H2D, best < worst), per-case GraphEngine isolation, and the human-readable (not machine-parsable) output contract. Bilingual pair verifier passes for all four EN/KO pairs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9.8 KiB
ADR-0049: kernbench probe Subcommand — Traffic-Pattern Verification Harness
Status
Accepted (2026-05-22).
Pins down the traffic-pattern catalog, formula-vs-actual comparison, and
invariant checks (monotonicity, D2H ≥ H2D, etc.) exposed by
probes/probe.py::run_probe(...). ADR-0010 (CLI surface) enumerates the
kernbench probe subcommand, but what probe actually measures and
which invariants it judges PASS/FAIL had no ADR-level coverage.
First action
run_probe(topology_path, case_filter=None) performs four startup steps:
Path(topology_path).expanduser().resolve()→ absolute path.load_topology(path)→TopologyGraph(graph + spec)._build_edge_map(graph)→ a{(src, dst): Edge}lookup table.- Instantiate
AddressResolver(graph)+PathRouter(graph).
Then it sets nbytes = 32768 (= 32 KiB, the summary-table reference
size) and show_all = (case_filter is None or case_filter == "all").
In short, probe's first act is "load the topology once and prepare
edge map / resolver / router, plus pin 32 KiB as the standard measurement
size". After that, the H2D → D2H → PE DMA categories execute in
separate GraphEngine instances (no cross-talk between cases).
Context
kernbench probe was introduced as a verification tool for these
purposes:
- Manual ground truth: when a real-simulation result (
kernbench run --bench ...) shows abnormal latency, derive the answer for a simple traffic pattern in isolation and compare. - Formula vs actual: check whether the analytical model
(wire latency + overhead + drain) matches the simulator's
total_ns. A mismatch points to which simplifying assumption in ADR-0033 is missing. - Monotonicity check: latency should grow monotonically with hop count.
- Utilization sweep: a BW-utilization table across data sizes (4 KiB ~ 1 MiB).
Without an ADR for this tool:
- Adding a new traffic-pattern category (e.g., MCpuDma, IPCQ) is hard because the table format / measurement units of existing categories aren't documented at the ADR level.
- The basis for the monotonicity check (hop count? cube distance? wire length?) is ambiguous.
- The reference size 32 KiB and the sweep
[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]are only discoverable by reading source.
Decision
D1. Three case categories — H2D / D2H / PE DMA
Each category has a distinct data path in the topology and gets its own summary table + sweep table + route-detail block.
- H2D (Host → Device Write):
MemoryWriteMsg(dst_sip=0, dst_cube, dst_pe=0, pattern="zero")flows alongpcie_ep → io_cpu → m_cpu → hbm_ctrl. The cube index varies the hop count:- h2d-1hop: cube=0, hops=1
- h2d-2hop: cube=4, hops=2
- h2d-3hop: cube=8, hops=3
- h2d-4hop: cube=12, hops=4
- D2H (Device → Host Read):
MemoryReadMsg(src_sip=0, src_cube, src_pe=0). Total latency = forward command path + reverse data path. Same 4-hops category as H2D. - PE DMA (PE-initiated):
PeDmaMsg(src_sip, src_cube, src_pe, dst_pa). Five cases cover varying cube/PE positions:- pe-local-hbm: same cube, same PE
- pe-same-half-hbm: same cube, different PE (PE 1)
- pe-cross-half-hbm: same cube, far PE (PE 4)
- pe-cross-cube-hbm-best: adjacent cube (cube 1)
- pe-cross-cube-hbm-worst: diagonal far cube (cube 15)
The cube indices 4/8/12 (H2D) and 1/4/15 (PE DMA) are meaningful for a
4 × 4 cube mesh (sip.cube_mesh.w=4, h=4); changes to the mesh size
require these to be updated in lockstep.
D2. Standard measurement size — nbytes = 32768 (32 KiB)
Every case in the summary table runs once with nbytes=32768. 32 KiB
was chosen because:
- DMA overhead and BW drain are balanced — neither dominates.
- It compares cleanly against the one-shot transfer size of several sub-units (TCM, register file).
Per-size utilization variations are shown in a separate sweep table (D3).
D3. Utilization sweep — [4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]
SWEEP_SIZES = [4096, 16384, 65536, 262144, 1048576],
SWEEP_LABELS = ["4KB", "16KB", "64KB", "256KB", "1MB"]. Per size:
drain = nbytes / bottleneck_bw
total = overhead + wire + drain
eff_bw = nbytes / total
util% = eff_bw / bottleneck_bw × 100
When bn_bw is None or <= 0, the column shows 0.0 %. The intent: the
table shows in one view how small transfers become overhead-bound and
large transfers become drain-bound as hop count rises.
D4. Measured columns — actual / formula / breakdown
Per-case columns:
Actual(total_ns): the SimPy run'strace["total_ns"].Ovhd: sum ofnode.attrs["overhead_ns"]along the path (formula).Drain:nbytes / min(edge.bw_gbs over path)(formula).Wire:Σ edge.distance_mm * (ns_per_mm from spec).Ovhd%/Drain%: each portion as a percentage of Actual. Wire is usually too small to display.Eff.BW:nbytes / total_ns(measured BW).BN.BW: bottleneck bandwidth (formula). The minimum edge BW along the path. Missing edge BW shows "-".Util%:Eff.BW / BN.BW × 100. 100 % means the single-stream BW upper bound is reached.
A large gap between the formula sum (wire + ovhd + drain) and Actual
signals a factor the simplified model misses (a place to inspect
ADR-0033's assumptions).
D5. Automatic invariant checks — PASS/FAIL
The following invariants are reported with [v] PASS / [x] FAIL:
- H2D / D2H monotonic increase: as hop count rises, actual latency
must grow monotonically.
all(lats[i] < lats[i+1] for ...). - D2H ≥ H2D: for the same hop index, D2H ≥ H2D (D2H has both
forward command and reverse data legs).
all(d2h[i].total >= h2d[i].total). - PE DMA best < worst: cross-cube best (adjacent) latency must be less than cross-cube worst (diagonal).
- PE DMA local vs remote: prints the local BN BW vs remote BN BW side-by-side (informational, not PASS/FAIL).
When a check fails, a single clear line surfaces the regression for human review.
D6. Route detail — per-hop timestamp trace
After the summary and sweep tables, each case's path and cumulative
per-hop timestamps (_hop_timestamps) appear in a separate section:
- H2D: leg1 (
pcie_ep → io_cpu) + leg2 (io_cpu → m_cpu) + leg3 (m_cpu → hbm_ctrl) + per-hop trace. - D2H: forward (cmd, no data) and reverse (data) traces shown separately.
- PE DMA:
pe_dma → router → hbm_ctrlpath + per-hop trace.
Each hop's timestamp is cumulative wire_ns + overhead_ns. The
terminal hop's annotation appends drain:Xns. Bottleneck edges are
marked <BN:XXGB/s> so they are visually identifiable.
D7. Semantics of the case_filter argument
Noneor"all": run all cases (default).- Other strings: run only the case whose name matches exactly. Example:
kernbench probe --case h2d-2hop.
Within a category, cases with name != case_filter are skipped; if
only one data point remains, the category's monotonicity / D2H ≥ H2D
comparisons are naturally skipped.
The CLI parser's --case default is "all", so omitting it runs
everything.
D8. Fresh GraphEngine per case
Each of the 4 H2D, 4 D2H, and 5 PE DMA cases runs in its own
GraphEngine (engine = GraphEngine(graph)). Reasons:
- Isolate accumulated state (op_log, completion tracking, allocators) so cases do not cross-talk.
- Guarantee one case's traffic does not perturb another case's BW measurement.
This isolation lets probe results be interpreted as single-flow
per-case latency. Multi-flow contention measurement is handled by
separate tooling (e.g., the pe2pe_overview plot or ADR-0033's
multi-flow merging model).
D9. Output-format stability
probe's stdout is meant for humans; precise column widths, separators, and whitespace are not a machine-readable contract. Automated tools that wish to parse probe output should use a separate JSON-output mode (not yet implemented).
The [v] / [x] prefix on PASS/FAIL lines is a stable CI grep anchor.
Alternatives Considered
A1. Register probe as another bench (@bench(name="probe"))
Rejected. probe is a verification tool, not a bench — multi-engine execution for sweeps/analysis and PASS/FAIL invariant output are essential, none of which fits ADR-0045's "single device + single RuntimeContext" bench model.
A2. Exit code 1 on monotonicity violation
Rejected (currently). probe is positioned as a human inspection tool —
PASS/FAIL is printed and exit is 0. A wrapper can grep "\[x\]" to
decide. A future --strict flag could opt into non-zero exits.
A3. Externalize the case catalog to YAML
Rejected (currently). The 8 cases (4 H2D + 4 D2H + 5 PE DMA = 13 total) are hardcoded and their semantics are tightly bound to the mesh topology. Moving cube-index meaning (4, 8, 12 / 1, 4, 15) into YAML would require separate documentation and lose cohesion. Externalize only when case additions become frequent.
A4. Add multi-flow contention measurement
Rejected (out of probe scope). D8's single-flow isolation is probe's core intent. Multi-flow contention belongs in a different area of the ADR-0033 latency model — either a separate tool or a new case category.
Consequences
- probe's case catalog (D1) and measurement units (D2/D3) are pinned at ADR level, so new traffic categories know which table format to follow.
- The semantics of the formula-vs-actual columns (D4) are locked in, so questions like "why is Drain% 5 % or 70 %?" can quickly be linked to ADR-0033 assumption checks.
- Automatic invariant checks (D5) are pinned, so latency-model changes immediately catch monotonicity / D2H ≥ H2D regressions.
- D8's case-isolation is explicit, so probe results are safe to read as single-flow measurements. If multi-flow is needed, a separate tool track is clearly required.
- A2's strict-mode flag is recorded as a follow-up so CI integration has a minimal change path when requested.