9a02955770
Documents four cross-cutting surfaces that previously had no ADR backing, each surfaced as a G4 candidate by /report: - 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...), the two execution modes (command-list vs greenlet runner), scratch allocator semantics, dispatch-overhead model, and the kernel registry. - 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group (backend="ahbm") install path. world_size priority (algorithm > defaults > topology), the 4-step init sequence (load ccl.yaml, import algorithm module, derive world_size, install SFR + IPCQ), greenlet- local rank registry, all_reduce dispatch via _defer_wait, barrier no-op rationale, and the explicit list of unsupported dist.* APIs. - 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator free-list semantics. Offset-keyed first-fit with coalescing, the no-validation trust model for free(), HBM/TCM channel separation, page-aligned VA allocation, the page_size dual-default (VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and one-allocator-per-sub-unit rule. - 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog. H2D / D2H / PE DMA categories with their exact cube-index choices, the 32 KiB reference size, the 5-point utilization sweep, the formula vs actual column meanings, automatic invariant checks (monotonicity, D2H >= H2D, best < worst), per-case GraphEngine isolation, and the human-readable (not machine-parsable) output contract. Bilingual pair verifier passes for all four EN/KO pairs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
248 lines
9.8 KiB
Markdown
248 lines
9.8 KiB
Markdown
# ADR-0049: `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
|
||
|
||
## Status
|
||
|
||
Accepted (2026-05-22).
|
||
|
||
Pins down the traffic-pattern catalog, formula-vs-actual comparison, and
|
||
invariant checks (monotonicity, D2H ≥ H2D, etc.) exposed by
|
||
`probes/probe.py::run_probe(...)`. ADR-0010 (CLI surface) enumerates the
|
||
`kernbench probe` subcommand, but **what probe actually measures** and
|
||
**which invariants it judges PASS/FAIL** had no ADR-level coverage.
|
||
|
||
## First action
|
||
|
||
`run_probe(topology_path, case_filter=None)` performs four startup steps:
|
||
|
||
1. `Path(topology_path).expanduser().resolve()` → absolute path.
|
||
2. `load_topology(path)` → `TopologyGraph` (graph + spec).
|
||
3. `_build_edge_map(graph)` → a `{(src, dst): Edge}` lookup table.
|
||
4. Instantiate `AddressResolver(graph)` + `PathRouter(graph)`.
|
||
|
||
Then it sets `nbytes = 32768` (= 32 KiB, the summary-table reference
|
||
size) and `show_all = (case_filter is None or case_filter == "all")`.
|
||
|
||
In short, **probe's first act is "load the topology once and prepare
|
||
edge map / resolver / router, plus pin 32 KiB as the standard measurement
|
||
size"**. After that, the H2D → D2H → PE DMA categories execute in
|
||
separate `GraphEngine` instances (no cross-talk between cases).
|
||
|
||
## Context
|
||
|
||
`kernbench probe` was introduced as a verification tool for these
|
||
purposes:
|
||
|
||
- **Manual ground truth**: when a real-simulation result (`kernbench run
|
||
--bench ...`) shows abnormal latency, derive the answer for a simple
|
||
traffic pattern in isolation and compare.
|
||
- **Formula vs actual**: check whether the analytical model
|
||
(wire latency + overhead + drain) matches the simulator's
|
||
`total_ns`. A mismatch points to which simplifying assumption in
|
||
ADR-0033 is missing.
|
||
- **Monotonicity check**: latency should grow monotonically with hop
|
||
count.
|
||
- **Utilization sweep**: a BW-utilization table across data sizes
|
||
(4 KiB ~ 1 MiB).
|
||
|
||
Without an ADR for this tool:
|
||
|
||
- Adding a new traffic-pattern category (e.g., MCpuDma, IPCQ) is hard
|
||
because the table format / measurement units of existing categories
|
||
aren't documented at the ADR level.
|
||
- The basis for the monotonicity check (hop count? cube distance? wire
|
||
length?) is ambiguous.
|
||
- The reference size 32 KiB and the sweep `[4 KiB, 16 KiB, 64 KiB, 256
|
||
KiB, 1 MiB]` are only discoverable by reading source.
|
||
|
||
## Decision
|
||
|
||
### D1. Three case categories — H2D / D2H / PE DMA
|
||
|
||
Each category has a distinct data path in the topology and gets its own
|
||
summary table + sweep table + route-detail block.
|
||
|
||
- **H2D (Host → Device Write)**: `MemoryWriteMsg(dst_sip=0, dst_cube,
|
||
dst_pe=0, pattern="zero")` flows along `pcie_ep → io_cpu → m_cpu →
|
||
hbm_ctrl`. The cube index varies the hop count:
|
||
- h2d-1hop: cube=0, hops=1
|
||
- h2d-2hop: cube=4, hops=2
|
||
- h2d-3hop: cube=8, hops=3
|
||
- h2d-4hop: cube=12, hops=4
|
||
- **D2H (Device → Host Read)**: `MemoryReadMsg(src_sip=0, src_cube,
|
||
src_pe=0)`. Total latency = forward command path + reverse data path.
|
||
Same 4-hops category as H2D.
|
||
- **PE DMA (PE-initiated)**: `PeDmaMsg(src_sip, src_cube, src_pe,
|
||
dst_pa)`. Five cases cover varying cube/PE positions:
|
||
- pe-local-hbm: same cube, same PE
|
||
- pe-same-half-hbm: same cube, different PE (PE 1)
|
||
- pe-cross-half-hbm: same cube, far PE (PE 4)
|
||
- pe-cross-cube-hbm-best: adjacent cube (cube 1)
|
||
- pe-cross-cube-hbm-worst: diagonal far cube (cube 15)
|
||
|
||
The cube indices 4/8/12 (H2D) and 1/4/15 (PE DMA) are meaningful for a
|
||
4 × 4 cube mesh (`sip.cube_mesh.w=4, h=4`); changes to the mesh size
|
||
require these to be updated in lockstep.
|
||
|
||
### D2. Standard measurement size — `nbytes = 32768` (32 KiB)
|
||
|
||
Every case in the summary table runs once with `nbytes=32768`. 32 KiB
|
||
was chosen because:
|
||
|
||
- DMA overhead and BW drain are balanced — neither dominates.
|
||
- It compares cleanly against the one-shot transfer size of several
|
||
sub-units (TCM, register file).
|
||
|
||
Per-size utilization variations are shown in a separate sweep table
|
||
(D3).
|
||
|
||
### D3. Utilization sweep — `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]`
|
||
|
||
`SWEEP_SIZES = [4096, 16384, 65536, 262144, 1048576]`,
|
||
`SWEEP_LABELS = ["4KB", "16KB", "64KB", "256KB", "1MB"]`. Per size:
|
||
|
||
```
|
||
drain = nbytes / bottleneck_bw
|
||
total = overhead + wire + drain
|
||
eff_bw = nbytes / total
|
||
util% = eff_bw / bottleneck_bw × 100
|
||
```
|
||
|
||
When `bn_bw is None or <= 0`, the column shows 0.0 %. The intent: the
|
||
table shows in one view how small transfers become overhead-bound and
|
||
large transfers become drain-bound as hop count rises.
|
||
|
||
### D4. Measured columns — actual / formula / breakdown
|
||
|
||
Per-case columns:
|
||
|
||
- `Actual` (total_ns): the SimPy run's `trace["total_ns"]`.
|
||
- `Ovhd`: sum of `node.attrs["overhead_ns"]` along the path (formula).
|
||
- `Drain`: `nbytes / min(edge.bw_gbs over path)` (formula).
|
||
- `Wire`: `Σ edge.distance_mm * (ns_per_mm from spec)`.
|
||
- `Ovhd%` / `Drain%`: each portion as a percentage of Actual. Wire is
|
||
usually too small to display.
|
||
- `Eff.BW`: `nbytes / total_ns` (measured BW).
|
||
- `BN.BW`: bottleneck bandwidth (formula). The minimum edge BW along
|
||
the path. Missing edge BW shows "-".
|
||
- `Util%`: `Eff.BW / BN.BW × 100`. 100 % means the single-stream BW
|
||
upper bound is reached.
|
||
|
||
A large gap between the formula sum (`wire + ovhd + drain`) and Actual
|
||
signals a factor the simplified model misses (a place to inspect
|
||
ADR-0033's assumptions).
|
||
|
||
### D5. Automatic invariant checks — PASS/FAIL
|
||
|
||
The following invariants are reported with `[v] PASS` / `[x] FAIL`:
|
||
|
||
- **H2D / D2H monotonic increase**: as hop count rises, actual latency
|
||
must grow monotonically. `all(lats[i] < lats[i+1] for ...)`.
|
||
- **D2H ≥ H2D**: for the same hop index, D2H ≥ H2D (D2H has both
|
||
forward command and reverse data legs). `all(d2h[i].total >=
|
||
h2d[i].total)`.
|
||
- **PE DMA best < worst**: cross-cube best (adjacent) latency must be
|
||
less than cross-cube worst (diagonal).
|
||
- **PE DMA local vs remote**: prints the local BN BW vs remote BN BW
|
||
side-by-side (informational, not PASS/FAIL).
|
||
|
||
When a check fails, a single clear line surfaces the regression for
|
||
human review.
|
||
|
||
### D6. Route detail — per-hop timestamp trace
|
||
|
||
After the summary and sweep tables, each case's path and cumulative
|
||
per-hop timestamps (`_hop_timestamps`) appear in a separate section:
|
||
|
||
- H2D: leg1 (`pcie_ep → io_cpu`) + leg2 (`io_cpu → m_cpu`) + leg3
|
||
(`m_cpu → hbm_ctrl`) + per-hop trace.
|
||
- D2H: forward (cmd, no data) and reverse (data) traces shown
|
||
separately.
|
||
- PE DMA: `pe_dma → router → hbm_ctrl` path + per-hop trace.
|
||
|
||
Each hop's timestamp is cumulative `wire_ns + overhead_ns`. The
|
||
terminal hop's annotation appends `drain:Xns`. Bottleneck edges are
|
||
marked `<BN:XXGB/s>` so they are visually identifiable.
|
||
|
||
### D7. Semantics of the `case_filter` argument
|
||
|
||
- `None` or `"all"`: run all cases (default).
|
||
- Other strings: run only the case whose name matches exactly. Example:
|
||
`kernbench probe --case h2d-2hop`.
|
||
|
||
Within a category, cases with `name != case_filter` are skipped; if
|
||
only one data point remains, the category's monotonicity / D2H ≥ H2D
|
||
comparisons are naturally skipped.
|
||
|
||
The CLI parser's `--case` default is `"all"`, so omitting it runs
|
||
everything.
|
||
|
||
### D8. Fresh GraphEngine per case
|
||
|
||
Each of the 4 H2D, 4 D2H, and 5 PE DMA cases runs in **its own
|
||
GraphEngine** (`engine = GraphEngine(graph)`). Reasons:
|
||
|
||
- Isolate accumulated state (op_log, completion tracking, allocators)
|
||
so cases do not cross-talk.
|
||
- Guarantee one case's traffic does not perturb another case's BW
|
||
measurement.
|
||
|
||
This isolation lets probe results be interpreted as **single-flow**
|
||
per-case latency. Multi-flow contention measurement is handled by
|
||
separate tooling (e.g., the `pe2pe_overview` plot or ADR-0033's
|
||
multi-flow merging model).
|
||
|
||
### D9. Output-format stability
|
||
|
||
probe's stdout is meant for humans; precise column widths, separators,
|
||
and whitespace are **not** a machine-readable contract. Automated tools
|
||
that wish to parse probe output should use a separate JSON-output mode
|
||
(not yet implemented).
|
||
|
||
The `[v]` / `[x]` prefix on PASS/FAIL lines is a stable CI grep anchor.
|
||
|
||
## Alternatives Considered
|
||
|
||
### A1. Register probe as another bench (`@bench(name="probe")`)
|
||
|
||
Rejected. probe is a verification tool, not a bench — multi-engine
|
||
execution for sweeps/analysis and PASS/FAIL invariant output are
|
||
essential, none of which fits ADR-0045's "single device + single
|
||
RuntimeContext" bench model.
|
||
|
||
### A2. Exit code 1 on monotonicity violation
|
||
|
||
Rejected (currently). probe is positioned as a human inspection tool —
|
||
PASS/FAIL is printed and exit is 0. A wrapper can `grep "\[x\]"` to
|
||
decide. A future `--strict` flag could opt into non-zero exits.
|
||
|
||
### A3. Externalize the case catalog to YAML
|
||
|
||
Rejected (currently). The 8 cases (4 H2D + 4 D2H + 5 PE DMA = 13 total)
|
||
are hardcoded and their semantics are tightly bound to the mesh
|
||
topology. Moving cube-index meaning (4, 8, 12 / 1, 4, 15) into YAML
|
||
would require separate documentation and lose cohesion. Externalize
|
||
only when case additions become frequent.
|
||
|
||
### A4. Add multi-flow contention measurement
|
||
|
||
Rejected (out of probe scope). D8's single-flow isolation is probe's
|
||
core intent. Multi-flow contention belongs in a different area of the
|
||
ADR-0033 latency model — either a separate tool or a new case
|
||
category.
|
||
|
||
## Consequences
|
||
|
||
- probe's case catalog (D1) and measurement units (D2/D3) are pinned at
|
||
ADR level, so new traffic categories know which table format to
|
||
follow.
|
||
- The semantics of the formula-vs-actual columns (D4) are locked in, so
|
||
questions like "why is Drain% 5 % or 70 %?" can quickly be linked to
|
||
ADR-0033 assumption checks.
|
||
- Automatic invariant checks (D5) are pinned, so latency-model changes
|
||
immediately catch monotonicity / D2H ≥ H2D regressions.
|
||
- D8's case-isolation is explicit, so probe results are safe to read as
|
||
single-flow measurements. If multi-flow is needed, a separate tool
|
||
track is clearly required.
|
||
- A2's strict-mode flag is recorded as a follow-up so CI integration
|
||
has a minimal change path when requested.
|