Files
kernbench2/docs/adr/ADR-0045-prog-bench-module-contract.md
T
ywkang 5f8dd688f5 adr: add ADR-0045 (bench module contract — registration, dispatch, authoring)
Documents src/kernbench/benches/: how @bench registration + audit work,
how the CLI dispatches via run_bench/RuntimeContext, and the contract a
new bench module must satisfy.

Nine decisions (D1-D9) cover:
- @bench name/description rules and duplicate detection
- Module-file convention (_-prefixed helpers vs bench modules)
- def run(torch) signature; torch = RuntimeContext
- Minimum-one-submit rule (else NO_REQUESTS)
- Single-device convention + multi-SIP CCL exception (ADR-0024/0027)
- resolve() name/index decision tree; indices are not a stable API
- Exact RuntimeContext surface exposed to benches
- Env-var parameterization (matmul_composite / gemm_sweep.py pattern)

Four alternatives rejected with documented reasons (manifest YAML,
decorator entry= arg, @multi_device_bench split, stable indices).

Verifier (tools/verify_adr_lang_pairs.py) passes for EN/KO pair.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 16:29:45 -07:00

12 KiB

ADR-0045: Bench Module Contract — registration, dispatch, and authoring

Status

Accepted (2026-05-21).

Unifies the src/kernbench/benches/ registration mechanism (@bench), the CLI dispatch path (kernbench run/list), and the contract a new bench module must follow. ADR-0010 (CLI surface) specifies the kernbench list/run interface, but how benches are registered and what signature they must follow had no ADR-level coverage.

First action

When kernbench.benches is imported, __init__.py immediately calls _eager_import_and_audit(__path__, __name__). Its first action is to enumerate every sibling module in the package directory via pkgutil.iter_modules(__path__) and eagerly import each one via importlib.import_module(...) — except modules matching either:

  • name registry (the infrastructure module itself), or
  • name starting with _ (helper modules).

At import time, each @bench(name=..., description=...) decorator inside the imported module runs, appending (name, description, fn) to _PENDING and adding fn.__module__ to _REGISTERED_MODULES.

Once imports finish, _audit_modules(imported, _REGISTERED_MODULES) runs; if any imported module did not invoke @bench at least once, it raises RuntimeError("Bench module(s) missing @bench decorator: ...") immediately. At this point indices are still unassigned — the first call to list_all() / resolve(...) triggers _finalize(), which sorts _PENDING alphabetically by name and assigns 1-based indices.

In short, the bench infrastructure's first act is "eagerly import every non-helper module in the package and audit that each one registered at least one bench".

Context

src/kernbench/benches/ currently holds 8 bench modules (ccl_allreduce, gemm_single_pe, gpt3_qkv, ipcq_allreduce, matmul_composite, qkv_gemm, qkv_gemm_multi_pe, va_offset_verify). Every bench follows the same unified flow:

kernbench run --topology <T> --bench <N>
   ↓
cli/main.py::cmd_run
   ↓  resolve_topology(T)  + resolve(N)  + resolve_device(device_arg)
   ↓
runtime_api/bench_runner.py::run_bench(topology, bench_fn, device, engine_factory)
   ↓  engine_factory(topology, device) → GraphEngine
   ↓  RuntimeContext(engine, target_device, correlation_id, spec)
   ↓
bench_fn(ctx)        ← invokes the bench's run(torch)
   ↓  ctx.empty/zeros/from_numpy/launch/distributed.* etc. submit work
   ↓
ctx.wait_all()       ← drains any outstanding handles
   ↓
BenchResult(completion, correlation_id, trace, traces, engine)

ADR-0010 covers only the CLI surface (run/list/probe/web); ADR-0007 covers only the runtime API ↔ sim_engine boundary. The question "what shape must a new bench file take?" had to be answered by grepping the codebase. As a result:

  • The @bench decorator contract (kebab-case name, non-empty description) lived only in the source.
  • The bench function signature (def run(torch)) was a de-facto convention enforced by the CLI dispatcher calling spec.run.
  • New bench authors learned the "helpers must use _ prefix" rule only after seeing the audit's RuntimeError.
  • The single-device convention (CLAUDE.md Part 2 CLI Semantics) and its interaction with multi-SIP CCL benches was ambiguous for bench authors.

This ADR consolidates all of it in one place.

Decision

D1. @bench decorator contract

from kernbench.benches.registry import bench

@bench(name="my-bench", description="Short, complete-sentence description.")
def run(torch):
    ...
  • name: kebab-case string matching ^[a-z][a-z0-9]*(-[a-z0-9]+)*$. Lowercase letters, digits, and dashes only; underscores forbidden; must start with a letter.
  • description: non-empty string (stripped length > 0). Displayed verbatim by kernbench list.
  • The decorator returns the function unchanged — direct invocation is fine. Its only side effect is appending to _PENDING.

Violations of the first two rules raise ValueError at decoration time. Duplicate names are caught at _finalize() with RuntimeError("duplicate bench name: ...").

D2. Module-file convention

Every src/kernbench/benches/<slug>.py must be one of:

  • A bench module: at top-level import, @bench(...) runs at least once to register at least one bench.
  • A helper module: the filename starts with _ (e.g., _shared_helpers.py). iter_modules skips it.

The audit (_audit_modules) rejects any non-helper that fails to call @bench. Intended consequence: dropping a new file into benches/ automatically registers its benches, and helper modules are clearly flagged by their filename prefix alone.

D3. The bench function signature is def run(torch)

The decorator does not enforce a function name, but CLI dispatch calls spec_entry.run (the decorated callable). The convention is therefore:

  • Function name: run. Other names work, but always use run for readability and grep-ability.
  • Argument: a single positional torch. In practice this is a RuntimeContext instance exposing PyTorch-style namespaces (zeros/empty/launch/distributed/...) — see ADR-0024 D3.
  • Return value: any (Any). run_bench ignores it and tracks completion via ctx.handles() / engine.get_completion().

The torch name imitates a PyTorch-compatible idiom; the actual PyTorch module is not passed in (aligned with ADR-0024's "rank = SIP" launcher convention).

D4. A bench must submit at least once

If ctx.handles() is empty after the bench returns, run_bench reports BenchResult.completion = ok=False, error_code="NO_REQUESTS". So a meaningful bench must invoke at least one of:

  • Tensor-creation APIs: torch.zeros(...), torch.empty(...) — these internally submit MmuMapMsg and (for zeros) MemoryWriteMsg.
  • Kernel-launch API: torch.launch(name, fn, *args) — submits per-SIP KernelLaunchMsg.
  • (Exception) Empty placeholder benches: e.g., ipcq_allreduce.py's print(...)-only stub will receive a NO_REQUESTS result. CI is expected to recognize and handle placeholder benches specially.

D5. Single-device convention + multi-SIP exception (ADR-0024/0027)

CLAUDE.md Part 2 CLI Semantics' "benchmarks MUST remain single-device" rule is interpreted as follows:

  • Standard bench (single-SIP use): define tensor placement with dp = DPPolicy(...) and launch with torch.launch(...). The SIP index is chosen by --device (CLI's responsibility).
  • CCL bench (multi-SIP use): as an exception, use torch.distributed.init_process_group(backend="ahbm") plus torch.multiprocessing.spawn(_worker, ..., nprocs=ws) for the rank = SIP pattern (ADR-0024 D3). --device is ignored (or treated as all); each spawned worker calls torch.ahbm.set_device(rank) to bind to its SIP.

Multi-device patterns outside these two (e.g., one bench function launching across multiple SIPs in the same process) are forbidden by this ADR. Even with --device all, the bench runs once; multi-SIP use inside that single run must follow D5's second pattern.

D6. Name/index resolution (resolve)

resolve(identifier: str) returns a BenchSpec via:

  1. If identifier.isdigit(): convert to int and find the spec where index == that value. If none, ValueError("No bench with index ...").
  2. If identifier in _REGISTRY: direct lookup.
  3. Otherwise: ValueError("Unknown bench ...").

Empty or whitespace-only identifiers raise ValueError("bench identifier must be a non-empty string.").

The CLI passes --bench directly to resolve, so users can use either kernbench run --bench gemm-single-pe or kernbench run --bench 2.

D7. Indices are not a stable API

_finalize() sorts _PENDING alphabetically by name and assigns 1-based indices. Adding a new bench can shift existing benches' indices. Therefore:

  • Human-interactive use: indices are fine.
  • Scripts / CI automation: always use the name.

This caveat is documented in registry.py's module docstring.

D8. Surface RuntimeContext exposes to benches

A bench's torch parameter may legitimately use:

  • Tensor creation: torch.empty(shape, dtype=..., dp=DPPolicy(...), name=...), torch.zeros(...), torch.from_numpy(arr). All submit host-side metadata plus device deployment (MmuMapMsg + MemoryWriteMsg).
  • Kernel launch: torch.launch(kernel_name, kernel_fn, *args) — converts (Tensor, int, float) positional args to TensorArg / ScalarArg, submits per-SIP KernelLaunchMsg, and drains.
  • Synchronization: torch.wait(handle), torch.wait_all() (run_bench calls the latter automatically).
  • Distributed: torch.distributed.init_process_group(backend="ahbm"), torch.distributed.get_world_size(), torch.distributed.all_reduce(t, op=...) (ADR-0024/0027).
  • Multi-process (rank = SIP): torch.multiprocessing.spawn(_worker, ..., nprocs=ws) (ADR-0024 D3 / ADR-0027).
  • Device binding: torch.ahbm.set_device(rank) or torch.accelerator.set_device_index(rank) (both point to the same namespace).
  • IPCQ install: torch.install_ipcq(algorithm=..., ccl_yaml=...) (ADR-0023 D10).
  • Spec lookup: torch.spec — the dict produced by the topology builder (system / cube_mesh / HBM parameters etc.). Use it so the bench does not hardcode topology.yaml values.

Benches must not access RuntimeContext private members (_handles, _traces, _allocators, etc.) directly. This aligns with ADR-0007's layer-boundary spirit: bench → runtime API → sim_engine flows in one direction.

D9. Environment-variable parameterization is allowed

Benches may parameterize themselves via os.environ.get(...), as matmul_composite.py does for MATMUL_M, MATMUL_K, MATMUL_N, MATMUL_DTYPE, MATMUL_VARIANT. Rationale:

  • The bench function signature is fixed by D3 to def run(torch), so positional/keyword arguments cannot carry parameters.
  • The env-var pattern is a natural hook for operational sweeps (e.g., MATMUL_VARIANT).
  • External drivers such as scripts/gemm_sweep.py (ADR-0044) consume this hook (it sets MATMUL_M/K/N/VARIANT at scripts/gemm_sweep.py:115-118).

When environment variables alter bench behavior, the module docstring must list every variable used (matmul_composite.py is the canonical example).

Alternatives Considered

A1. An explicit manifest file (YAML) listing benches

Rejected. The @bench + audit pattern guarantees "drop in file → auto- register", concentrating cognitive cost in one place (the file itself). A separate manifest is prone to drift, and helper separation is already clear via the _ prefix.

A2. Allowing the bench's entry-point name in the decorator

(@bench(name=..., entry="run_xxx"))

Rejected. Breaks the simplicity of dispatch (spec.run is a single callable). The run convention is sufficient; variants can register multiple @bench-decorated functions in the same module.

A3. A separate @multi_device_bench decorator for CCL

Rejected. The two patterns named in D5 (single + ADR-0024 multi-SIP) cover all 8 current benches. A separate decorator would force dispatch to branch and add complexity; the multi-SIP intent is already obvious from the bench's init_process_group(...) call.

A4. Make indices a stable API (registration order or explicit

index= argument)

Rejected. D7's trade-off favors user-friendliness — alphabetically sorted 1-based indices read naturally in the list output. Scripts can use names.

Consequences

  • "How to add a bench" is consolidated in one ADR — new authors only need to read D1-D3 and D8 without grepping source.
  • The _-prefixed helper-module pattern is legitimized at ADR level, so future benches/_*.py shared helpers are free to be added.
  • The CLI's single-device convention and CCL's multi-SIP exception are shown to be consistent (D5) — they are orthogonal.
  • The rationale for ADR-0044's GEMM eval harness using env-var hooks (D9) is now ADR-pinned.
  • Indices are explicitly unstable (D7), so any CI code calling kernbench run --bench 3 is flagged for review after this ADR is accepted.