# ADR-0045: Bench Module Contract — registration, dispatch, and authoring ## Status Accepted (2026-05-21). Unifies the `src/kernbench/benches/` registration mechanism (@bench), the CLI dispatch path (`kernbench run/list`), and the contract a new bench module must follow. ADR-0010 (CLI surface) specifies the `kernbench list/run` interface, but **how benches are registered and what signature they must follow** had no ADR-level coverage. **Extended by ADR-0054**: D5's single-config rule gains a third pattern — the *eval bench* (e.g. `milestone-1h-*`) drives many configs, builds its own per-config engines, and submits a sentinel tensor to satisfy D4. ## First action When `kernbench.benches` is imported, `__init__.py` immediately calls `_eager_import_and_audit(__path__, __name__)`. Its first action is to enumerate every sibling module in the package directory via `pkgutil.iter_modules(__path__)` and **eagerly import** each one via `importlib.import_module(...)` — except modules matching either: - name `registry` (the infrastructure module itself), or - name starting with `_` (helper modules). At import time, each `@bench(name=..., description=...)` decorator inside the imported module runs, appending `(name, description, fn)` to `_PENDING` and adding `fn.__module__` to `_REGISTERED_MODULES`. Once imports finish, `_audit_modules(imported, _REGISTERED_MODULES)` runs; if any imported module did not invoke `@bench` at least once, it raises `RuntimeError("Bench module(s) missing @bench decorator: ...")` immediately. At this point indices are still unassigned — the first call to `list_all()` / `resolve(...)` triggers `_finalize()`, which sorts `_PENDING` alphabetically by name and assigns 1-based indices. In short, **the bench infrastructure's first act is "eagerly import every non-helper module in the package and audit that each one registered at least one bench"**. ## Context `src/kernbench/benches/` currently holds 8 bench modules (`ccl_allreduce`, `gemm_single_pe`, `gpt3_qkv`, `ipcq_allreduce`, `matmul_composite`, `qkv_gemm`, `qkv_gemm_multi_pe`, `va_offset_verify`). Every bench follows the same unified flow: ``` kernbench run --topology --bench ↓ cli/main.py::cmd_run ↓ resolve_topology(T) + resolve(N) + resolve_device(device_arg) ↓ runtime_api/bench_runner.py::run_bench(topology, bench_fn, device, engine_factory) ↓ engine_factory(topology, device) → GraphEngine ↓ RuntimeContext(engine, target_device, correlation_id, spec) ↓ bench_fn(ctx) ← invokes the bench's run(torch) ↓ ctx.empty/zeros/from_numpy/launch/distributed.* etc. submit work ↓ ctx.wait_all() ← drains any outstanding handles ↓ BenchResult(completion, correlation_id, trace, traces, engine) ``` ADR-0010 covers only the CLI surface (`run/list/probe/web`); ADR-0007 covers only the runtime API ↔ sim_engine boundary. The question "what shape must a new bench file take?" had to be answered by grepping the codebase. As a result: - The @bench decorator contract (kebab-case name, non-empty description) lived only in the source. - The bench function signature (`def run(torch)`) was a de-facto convention enforced by the CLI dispatcher calling `spec.run`. - New bench authors learned the "helpers must use `_` prefix" rule only after seeing the audit's RuntimeError. - The single-device convention (CLAUDE.md Part 2 CLI Semantics) and its interaction with multi-SIP CCL benches was ambiguous for bench authors. This ADR consolidates all of it in one place. ## Decision ### D1. @bench decorator contract ```python from kernbench.benches.registry import bench @bench(name="my-bench", description="Short, complete-sentence description.") def run(torch): ... ``` - `name`: kebab-case string matching `^[a-z][a-z0-9]*(-[a-z0-9]+)*$`. Lowercase letters, digits, and dashes only; underscores forbidden; must start with a letter. - `description`: non-empty string (stripped length > 0). Displayed verbatim by `kernbench list`. - The decorator **returns the function unchanged** — direct invocation is fine. Its only side effect is appending to `_PENDING`. Violations of the first two rules raise `ValueError` at decoration time. Duplicate names are caught at `_finalize()` with `RuntimeError("duplicate bench name: ...")`. ### D2. Module-file convention Every `src/kernbench/benches/.py` must be one of: - **A bench module**: at top-level import, `@bench(...)` runs at least once to register at least one bench. - **A helper module**: the filename starts with `_` (e.g., `_shared_helpers.py`). `iter_modules` skips it. The audit (`_audit_modules`) rejects any non-helper that fails to call `@bench`. Intended consequence: dropping a new file into `benches/` automatically registers its benches, and helper modules are clearly flagged by their filename prefix alone. ### D3. The bench function signature is `def run(torch)` The decorator does not enforce a function name, but **CLI dispatch calls `spec_entry.run`** (the decorated callable). The convention is therefore: - Function name: `run`. Other names work, but always use `run` for readability and grep-ability. - Argument: a single positional `torch`. In practice this is a `RuntimeContext` instance exposing PyTorch-style namespaces (zeros/empty/launch/distributed/...) — see ADR-0024 D3. - Return value: any (`Any`). `run_bench` ignores it and tracks completion via `ctx.handles()` / `engine.get_completion()`. The `torch` name imitates a PyTorch-compatible idiom; the actual PyTorch module is not passed in (aligned with ADR-0024's "rank = SIP" launcher convention). ### D4. A bench must submit at least once If `ctx.handles()` is empty after the bench returns, `run_bench` reports `BenchResult.completion = ok=False, error_code="NO_REQUESTS"`. So a meaningful bench must invoke at least one of: - Tensor-creation APIs: `torch.zeros(...)`, `torch.empty(...)` — these internally submit `MmuMapMsg` and (for zeros) `MemoryWriteMsg`. - Kernel-launch API: `torch.launch(name, fn, *args)` — submits per-SIP `KernelLaunchMsg`. - (Exception) Empty placeholder benches: e.g., `ipcq_allreduce.py`'s `print(...)`-only stub will receive a NO_REQUESTS result. CI is expected to recognize and handle placeholder benches specially. ### D5. Single-device convention + multi-SIP exception (ADR-0024/0027) CLAUDE.md Part 2 CLI Semantics' **"benchmarks MUST remain single-device"** rule is interpreted as follows: - **Standard bench (single-SIP use)**: define tensor placement with `dp = DPPolicy(...)` and launch with `torch.launch(...)`. The SIP index is chosen by `--device` (CLI's responsibility). - **CCL bench (multi-SIP use)**: as an exception, use `torch.distributed.init_process_group(backend="ahbm")` plus `torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` for the rank = SIP pattern (ADR-0024 D3). `--device` is ignored (or treated as `all`); each spawned worker calls `torch.ahbm.set_device(rank)` to bind to its SIP. Multi-device patterns outside these two (e.g., one bench function launching across multiple SIPs in the same process) are forbidden by this ADR. Even with `--device all`, the bench runs once; multi-SIP use inside that single run must follow D5's second pattern. ### D6. Name/index resolution (`resolve`) `resolve(identifier: str)` returns a BenchSpec via: 1. If `identifier.isdigit()`: convert to int and find the spec where `index ==` that value. If none, `ValueError("No bench with index ...")`. 2. If `identifier in _REGISTRY`: direct lookup. 3. Otherwise: `ValueError("Unknown bench ...")`. Empty or whitespace-only identifiers raise `ValueError("bench identifier must be a non-empty string.")`. The CLI passes `--bench` directly to `resolve`, so users can use either `kernbench run --bench gemm-single-pe` or `kernbench run --bench 2`. ### D7. Indices are not a stable API `_finalize()` sorts `_PENDING` alphabetically by name and assigns 1-based indices. Adding a new bench can shift existing benches' indices. Therefore: - Human-interactive use: indices are fine. - Scripts / CI automation: always use the name. This caveat is documented in `registry.py`'s module docstring. ### D8. Surface RuntimeContext exposes to benches A bench's `torch` parameter may legitimately use: - **Tensor creation**: `torch.empty(shape, dtype=..., dp=DPPolicy(...), name=...)`, `torch.zeros(...)`, `torch.from_numpy(arr)`. All submit host-side metadata plus device deployment (`MmuMapMsg` + `MemoryWriteMsg`). - **Kernel launch**: `torch.launch(kernel_name, kernel_fn, *args)` — converts `(Tensor, int, float)` positional args to `TensorArg` / `ScalarArg`, submits per-SIP `KernelLaunchMsg`, and drains. - **Synchronization**: `torch.wait(handle)`, `torch.wait_all()` (`run_bench` calls the latter automatically). - **Distributed**: `torch.distributed.init_process_group(backend="ahbm")`, `torch.distributed.get_world_size()`, `torch.distributed.all_reduce(t, op=...)` (ADR-0024/0027). - **Multi-process (rank = SIP)**: `torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` (ADR-0024 D3 / ADR-0027). - **Device binding**: `torch.ahbm.set_device(rank)` or `torch.accelerator.set_device_index(rank)` (both point to the same namespace). - **IPCQ install**: `torch.install_ipcq(algorithm=..., ccl_yaml=...)` (ADR-0023 D10). - **Spec lookup**: `torch.spec` — the dict produced by the topology builder (system / cube_mesh / HBM parameters etc.). Use it so the bench does not hardcode topology.yaml values. Benches must not access RuntimeContext private members (`_handles`, `_traces`, `_allocators`, etc.) directly. This aligns with ADR-0007's layer-boundary spirit: bench → runtime API → sim_engine flows in one direction. ### D9. Environment-variable parameterization is allowed Benches may parameterize themselves via `os.environ.get(...)`, as `matmul_composite.py` does for `MATMUL_M`, `MATMUL_K`, `MATMUL_N`, `MATMUL_DTYPE`, `MATMUL_VARIANT`. Rationale: - The bench function signature is fixed by D3 to `def run(torch)`, so positional/keyword arguments cannot carry parameters. - The env-var pattern is a natural hook for operational sweeps (e.g., `MATMUL_VARIANT`). - External drivers such as `scripts/gemm_sweep.py` (ADR-0044) consume this hook (it sets `MATMUL_M/K/N/VARIANT` at `scripts/gemm_sweep.py:115-118`). When environment variables alter bench behavior, the module docstring must list every variable used (`matmul_composite.py` is the canonical example). ## Alternatives Considered ### A1. An explicit manifest file (YAML) listing benches Rejected. The `@bench` + audit pattern guarantees "drop in file → auto- register", concentrating cognitive cost in one place (the file itself). A separate manifest is prone to drift, and helper separation is already clear via the `_` prefix. ### A2. Allowing the bench's entry-point name in the decorator (`@bench(name=..., entry="run_xxx")`) Rejected. Breaks the simplicity of dispatch (`spec.run` is a single callable). The `run` convention is sufficient; variants can register multiple `@bench`-decorated functions in the same module. ### A3. A separate `@multi_device_bench` decorator for CCL Rejected. The two patterns named in D5 (single + ADR-0024 multi-SIP) cover all 8 current benches. A separate decorator would force dispatch to branch and add complexity; the multi-SIP intent is already obvious from the bench's `init_process_group(...)` call. ### A4. Make indices a stable API (registration order or explicit `index=` argument) Rejected. D7's trade-off favors user-friendliness — alphabetically sorted 1-based indices read naturally in the `list` output. Scripts can use names. ## Consequences - "How to add a bench" is consolidated in one ADR — new authors only need to read D1-D3 and D8 without grepping source. - The `_`-prefixed helper-module pattern is legitimized at ADR level, so future `benches/_*.py` shared helpers are free to be added. - The CLI's single-device convention and CCL's multi-SIP exception are shown to be consistent (D5) — they are orthogonal. - The rationale for ADR-0044's GEMM eval harness using env-var hooks (D9) is now ADR-pinned. - Indices are explicitly unstable (D7), so any CI code calling `kernbench run --bench 3` is flagged for review after this ADR is accepted.