Files

T

ywkang 9a02955770 adr: add ADR-0046-0049 — close G4 coverage gaps from /report

Documents four cross-cutting surfaces that previously had no ADR backing,
each surfaced as a G4 candidate by /report:

- 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates
  all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...),
  the two execution modes (command-list vs greenlet runner), scratch
  allocator semantics, dispatch-overhead model, and the kernel registry.

- 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group
  (backend="ahbm") install path. world_size priority (algorithm >
  defaults > topology), the 4-step init sequence (load ccl.yaml, import
  algorithm module, derive world_size, install SFR + IPCQ), greenlet-
  local rank registry, all_reduce dispatch via _defer_wait, barrier
  no-op rationale, and the explicit list of unsupported dist.* APIs.

- 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator
  free-list semantics. Offset-keyed first-fit with coalescing, the
  no-validation trust model for free(), HBM/TCM channel separation,
  page-aligned VA allocation, the page_size dual-default
  (VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and
  one-allocator-per-sub-unit rule.

- 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog.
  H2D / D2H / PE DMA categories with their exact cube-index choices,
  the 32 KiB reference size, the 5-point utilization sweep, the
  formula vs actual column meanings, automatic invariant checks
  (monotonicity, D2H >= H2D, best < worst), per-case GraphEngine
  isolation, and the human-readable (not machine-parsable) output
  contract.

Bilingual pair verifier passes for all four EN/KO pairs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-22 10:25:04 -07:00

9.8 KiB

Raw Blame History

ADR-0049: `kernbench probe` Subcommand — Traffic-Pattern Verification Harness

Status

Accepted (2026-05-22).

Pins down the traffic-pattern catalog, formula-vs-actual comparison, and invariant checks (monotonicity, D2H ≥ H2D, etc.) exposed by probes/probe.py::run_probe(...). ADR-0010 (CLI surface) enumerates the kernbench probe subcommand, but what probe actually measures and which invariants it judges PASS/FAIL had no ADR-level coverage.

First action

run_probe(topology_path, case_filter=None) performs four startup steps:

Path(topology_path).expanduser().resolve() → absolute path.
load_topology(path) → TopologyGraph (graph + spec).
_build_edge_map(graph) → a {(src, dst): Edge} lookup table.
Instantiate AddressResolver(graph) + PathRouter(graph).

Then it sets nbytes = 32768 (= 32 KiB, the summary-table reference size) and show_all = (case_filter is None or case_filter == "all").

In short, probe's first act is "load the topology once and prepare edge map / resolver / router, plus pin 32 KiB as the standard measurement size". After that, the H2D → D2H → PE DMA categories execute in separate GraphEngine instances (no cross-talk between cases).

Context

kernbench probe was introduced as a verification tool for these purposes:

Manual ground truth: when a real-simulation result (kernbench run --bench ...) shows abnormal latency, derive the answer for a simple traffic pattern in isolation and compare.
Formula vs actual: check whether the analytical model (wire latency + overhead + drain) matches the simulator's total_ns. A mismatch points to which simplifying assumption in ADR-0033 is missing.
Monotonicity check: latency should grow monotonically with hop count.
Utilization sweep: a BW-utilization table across data sizes (4 KiB ~ 1 MiB).

Without an ADR for this tool:

Adding a new traffic-pattern category (e.g., MCpuDma, IPCQ) is hard because the table format / measurement units of existing categories aren't documented at the ADR level.
The basis for the monotonicity check (hop count? cube distance? wire length?) is ambiguous.
The reference size 32 KiB and the sweep [4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB] are only discoverable by reading source.

Decision

D1. Three case categories — H2D / D2H / PE DMA

Each category has a distinct data path in the topology and gets its own summary table + sweep table + route-detail block.

H2D (Host → Device Write): MemoryWriteMsg(dst_sip=0, dst_cube, dst_pe=0, pattern="zero") flows along pcie_ep → io_cpu → m_cpu → hbm_ctrl. The cube index varies the hop count:
- h2d-1hop: cube=0, hops=1
- h2d-2hop: cube=4, hops=2
- h2d-3hop: cube=8, hops=3
- h2d-4hop: cube=12, hops=4
D2H (Device → Host Read): MemoryReadMsg(src_sip=0, src_cube, src_pe=0). Total latency = forward command path + reverse data path. Same 4-hops category as H2D.
PE DMA (PE-initiated): PeDmaMsg(src_sip, src_cube, src_pe, dst_pa). Five cases cover varying cube/PE positions:
- pe-local-hbm: same cube, same PE
- pe-same-half-hbm: same cube, different PE (PE 1)
- pe-cross-half-hbm: same cube, far PE (PE 4)
- pe-cross-cube-hbm-best: adjacent cube (cube 1)
- pe-cross-cube-hbm-worst: diagonal far cube (cube 15)

The cube indices 4/8/12 (H2D) and 1/4/15 (PE DMA) are meaningful for a 4 × 4 cube mesh (sip.cube_mesh.w=4, h=4); changes to the mesh size require these to be updated in lockstep.

D2. Standard measurement size — `nbytes = 32768` (32 KiB)

Every case in the summary table runs once with nbytes=32768. 32 KiB was chosen because:

DMA overhead and BW drain are balanced — neither dominates.
It compares cleanly against the one-shot transfer size of several sub-units (TCM, register file).

Per-size utilization variations are shown in a separate sweep table (D3).

D3. Utilization sweep — `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]`

SWEEP_SIZES = [4096, 16384, 65536, 262144, 1048576], SWEEP_LABELS = ["4KB", "16KB", "64KB", "256KB", "1MB"]. Per size:

drain   = nbytes / bottleneck_bw
total   = overhead + wire + drain
eff_bw  = nbytes / total
util%   = eff_bw / bottleneck_bw × 100

When bn_bw is None or <= 0, the column shows 0.0 %. The intent: the table shows in one view how small transfers become overhead-bound and large transfers become drain-bound as hop count rises.

D4. Measured columns — actual / formula / breakdown

Per-case columns:

Actual (total_ns): the SimPy run's trace["total_ns"].
Ovhd: sum of node.attrs["overhead_ns"] along the path (formula).
Drain: nbytes / min(edge.bw_gbs over path) (formula).
Wire: Σ edge.distance_mm * (ns_per_mm from spec).
Ovhd% / Drain%: each portion as a percentage of Actual. Wire is usually too small to display.
Eff.BW: nbytes / total_ns (measured BW).
BN.BW: bottleneck bandwidth (formula). The minimum edge BW along the path. Missing edge BW shows "-".
Util%: Eff.BW / BN.BW × 100. 100 % means the single-stream BW upper bound is reached.

A large gap between the formula sum (wire + ovhd + drain) and Actual signals a factor the simplified model misses (a place to inspect ADR-0033's assumptions).

D5. Automatic invariant checks — PASS/FAIL

The following invariants are reported with [v] PASS / [x] FAIL:

H2D / D2H monotonic increase: as hop count rises, actual latency must grow monotonically. all(lats[i] < lats[i+1] for ...).
D2H ≥ H2D: for the same hop index, D2H ≥ H2D (D2H has both forward command and reverse data legs). all(d2h[i].total >= h2d[i].total).
PE DMA best < worst: cross-cube best (adjacent) latency must be less than cross-cube worst (diagonal).
PE DMA local vs remote: prints the local BN BW vs remote BN BW side-by-side (informational, not PASS/FAIL).

When a check fails, a single clear line surfaces the regression for human review.

D6. Route detail — per-hop timestamp trace

After the summary and sweep tables, each case's path and cumulative per-hop timestamps (_hop_timestamps) appear in a separate section:

H2D: leg1 (pcie_ep → io_cpu) + leg2 (io_cpu → m_cpu) + leg3 (m_cpu → hbm_ctrl) + per-hop trace.
D2H: forward (cmd, no data) and reverse (data) traces shown separately.
PE DMA: pe_dma → router → hbm_ctrl path + per-hop trace.

Each hop's timestamp is cumulative wire_ns + overhead_ns. The terminal hop's annotation appends drain:Xns. Bottleneck edges are marked <BN:XXGB/s> so they are visually identifiable.

D7. Semantics of the `case_filter` argument

None or "all": run all cases (default).
Other strings: run only the case whose name matches exactly. Example: kernbench probe --case h2d-2hop.

Within a category, cases with name != case_filter are skipped; if only one data point remains, the category's monotonicity / D2H ≥ H2D comparisons are naturally skipped.

The CLI parser's --case default is "all", so omitting it runs everything.

D8. Fresh GraphEngine per case

Each of the 4 H2D, 4 D2H, and 5 PE DMA cases runs in its own GraphEngine (engine = GraphEngine(graph)). Reasons:

Isolate accumulated state (op_log, completion tracking, allocators) so cases do not cross-talk.
Guarantee one case's traffic does not perturb another case's BW measurement.

This isolation lets probe results be interpreted as single-flow per-case latency. Multi-flow contention measurement is handled by separate tooling (e.g., the pe2pe_overview plot or ADR-0033's multi-flow merging model).

D9. Output-format stability

probe's stdout is meant for humans; precise column widths, separators, and whitespace are not a machine-readable contract. Automated tools that wish to parse probe output should use a separate JSON-output mode (not yet implemented).

The [v] / [x] prefix on PASS/FAIL lines is a stable CI grep anchor.

Alternatives Considered

A1. Register probe as another bench (`@bench(name="probe")`)

Rejected. probe is a verification tool, not a bench — multi-engine execution for sweeps/analysis and PASS/FAIL invariant output are essential, none of which fits ADR-0045's "single device + single RuntimeContext" bench model.

A2. Exit code 1 on monotonicity violation

Rejected (currently). probe is positioned as a human inspection tool — PASS/FAIL is printed and exit is 0. A wrapper can grep "\[x\]" to decide. A future --strict flag could opt into non-zero exits.

A3. Externalize the case catalog to YAML

Rejected (currently). The 8 cases (4 H2D + 4 D2H + 5 PE DMA = 13 total) are hardcoded and their semantics are tightly bound to the mesh topology. Moving cube-index meaning (4, 8, 12 / 1, 4, 15) into YAML would require separate documentation and lose cohesion. Externalize only when case additions become frequent.

A4. Add multi-flow contention measurement

Rejected (out of probe scope). D8's single-flow isolation is probe's core intent. Multi-flow contention belongs in a different area of the ADR-0033 latency model — either a separate tool or a new case category.

Consequences

probe's case catalog (D1) and measurement units (D2/D3) are pinned at ADR level, so new traffic categories know which table format to follow.
The semantics of the formula-vs-actual columns (D4) are locked in, so questions like "why is Drain% 5 % or 70 %?" can quickly be linked to ADR-0033 assumption checks.
Automatic invariant checks (D5) are pinned, so latency-model changes immediately catch monotonicity / D2H ≥ H2D regressions.
D8's case-isolation is explicit, so probe results are safe to read as single-flow measurements. If multi-flow is needed, a separate tool track is clearly required.
A2's strict-mode flag is recorded as a follow-up so CI integration has a minimal change path when requested.

9.8 KiB Raw Blame History Unescape Escape

ADR-0049: kernbench probe Subcommand — Traffic-Pattern Verification Harness