kernbench2/docs/adr/ADR-0045-prog-bench-module-contract.md

# ADR-0045: Bench Module Contract — registration, dispatch, and authoring

## Status

Accepted (2026-05-21).

Unifies the `src/kernbench/benches/` registration mechanism (@bench), the
CLI dispatch path (`kernbench run/list`), and the contract a new bench
module must follow. ADR-0010 (CLI surface) specifies the `kernbench
list/run` interface, but **how benches are registered and what signature
they must follow** had no ADR-level coverage.

**Extended by ADR-0054**: D5's single-config rule gains a third pattern —
the *eval bench* (e.g. `milestone-1h-*`) drives many configs, builds its
own per-config engines, and submits a sentinel tensor to satisfy D4.

## First action

When `kernbench.benches` is imported, `__init__.py` immediately calls
`_eager_import_and_audit(__path__, __name__)`. Its first action is to
enumerate every sibling module in the package directory via
`pkgutil.iter_modules(__path__)` and **eagerly import** each one via
`importlib.import_module(...)` — except modules matching either:

- name `registry` (the infrastructure module itself), or
- name starting with `_` (helper modules).

At import time, each `@bench(name=..., description=...)` decorator inside
the imported module runs, appending `(name, description, fn)` to
`_PENDING` and adding `fn.__module__` to `_REGISTERED_MODULES`.

Once imports finish, `_audit_modules(imported, _REGISTERED_MODULES)`
runs; if any imported module did not invoke `@bench` at least once, it
raises `RuntimeError("Bench module(s) missing @bench decorator: ...")`
immediately. At this point indices are still unassigned — the first call
to `list_all()` / `resolve(...)` triggers `_finalize()`, which sorts
`_PENDING` alphabetically by name and assigns 1-based indices.

In short, **the bench infrastructure's first act is "eagerly import
every non-helper module in the package and audit that each one
registered at least one bench"**.

## Context

`src/kernbench/benches/` currently holds 8 bench modules (`ccl_allreduce`,
`gemm_single_pe`, `gpt3_qkv`, `ipcq_allreduce`, `matmul_composite`,
`qkv_gemm`, `qkv_gemm_multi_pe`, `va_offset_verify`). Every bench follows
the same unified flow:

```
kernbench run --topology <T> --bench <N>
   ↓
cli/main.py::cmd_run
   ↓  resolve_topology(T)  + resolve(N)  + resolve_device(device_arg)
   ↓
runtime_api/bench_runner.py::run_bench(topology, bench_fn, device, engine_factory)
   ↓  engine_factory(topology, device) → GraphEngine
   ↓  RuntimeContext(engine, target_device, correlation_id, spec)
   ↓
bench_fn(ctx)        ← invokes the bench's run(torch)
   ↓  ctx.empty/zeros/from_numpy/launch/distributed.* etc. submit work
   ↓
ctx.wait_all()       ← drains any outstanding handles
   ↓
BenchResult(completion, correlation_id, trace, traces, engine)
```

ADR-0010 covers only the CLI surface (`run/list/probe/web`); ADR-0007
covers only the runtime API ↔ sim_engine boundary. The question "what
shape must a new bench file take?" had to be answered by grepping the
codebase. As a result:

- The @bench decorator contract (kebab-case name, non-empty description)
  lived only in the source.
- The bench function signature (`def run(torch)`) was a de-facto
  convention enforced by the CLI dispatcher calling `spec.run`.
- New bench authors learned the "helpers must use `_` prefix" rule only
  after seeing the audit's RuntimeError.
- The single-device convention (CLAUDE.md Part 2 CLI Semantics) and its
  interaction with multi-SIP CCL benches was ambiguous for bench
  authors.

This ADR consolidates all of it in one place.

## Decision

### D1. @bench decorator contract

```python
from kernbench.benches.registry import bench

@bench(name="my-bench", description="Short, complete-sentence description.")
def run(torch):
    ...
```

- `name`: kebab-case string matching `^[a-z][a-z0-9]*(-[a-z0-9]+)*$`.
  Lowercase letters, digits, and dashes only; underscores forbidden;
  must start with a letter.
- `description`: non-empty string (stripped length > 0). Displayed
  verbatim by `kernbench list`.
- The decorator **returns the function unchanged** — direct invocation
  is fine. Its only side effect is appending to `_PENDING`.

Violations of the first two rules raise `ValueError` at decoration time.
Duplicate names are caught at `_finalize()` with
`RuntimeError("duplicate bench name: ...")`.

### D2. Module-file convention

Every `src/kernbench/benches/<slug>.py` must be one of:

- **A bench module**: at top-level import, `@bench(...)` runs at least
  once to register at least one bench.
- **A helper module**: the filename starts with `_` (e.g.,
  `_shared_helpers.py`). `iter_modules` skips it.

The audit (`_audit_modules`) rejects any non-helper that fails to call
`@bench`. Intended consequence: dropping a new file into `benches/`
automatically registers its benches, and helper modules are clearly
flagged by their filename prefix alone.

### D3. The bench function signature is `def run(torch)`

The decorator does not enforce a function name, but **CLI dispatch calls
`spec_entry.run`** (the decorated callable). The convention is therefore:

- Function name: `run`. Other names work, but always use `run` for
  readability and grep-ability.
- Argument: a single positional `torch`. In practice this is a
  `RuntimeContext` instance exposing PyTorch-style namespaces
  (zeros/empty/launch/distributed/...) — see ADR-0024 D3.
- Return value: any (`Any`). `run_bench` ignores it and tracks
  completion via `ctx.handles()` / `engine.get_completion()`.

The `torch` name imitates a PyTorch-compatible idiom; the actual PyTorch
module is not passed in (aligned with ADR-0024's "rank = SIP" launcher
convention).

### D4. A bench must submit at least once

If `ctx.handles()` is empty after the bench returns, `run_bench` reports
`BenchResult.completion = ok=False, error_code="NO_REQUESTS"`. So a
meaningful bench must invoke at least one of:

- Tensor-creation APIs: `torch.zeros(...)`, `torch.empty(...)` — these
  internally submit `MmuMapMsg` and (for zeros) `MemoryWriteMsg`.
- Kernel-launch API: `torch.launch(name, fn, *args)` — submits per-SIP
  `KernelLaunchMsg`.
- (Exception) Empty placeholder benches: e.g.,
  `ipcq_allreduce.py`'s `print(...)`-only stub will receive a
  NO_REQUESTS result. CI is expected to recognize and handle placeholder
  benches specially.

### D5. Single-device convention + multi-SIP exception (ADR-0024/0027)

CLAUDE.md Part 2 CLI Semantics' **"benchmarks MUST remain
single-device"** rule is interpreted as follows:

- **Standard bench (single-SIP use)**: define tensor placement with
  `dp = DPPolicy(...)` and launch with `torch.launch(...)`. The SIP
  index is chosen by `--device` (CLI's responsibility).
- **CCL bench (multi-SIP use)**: as an exception, use
  `torch.distributed.init_process_group(backend="ahbm")` plus
  `torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` for the
  rank = SIP pattern (ADR-0024 D3). `--device` is ignored (or treated
  as `all`); each spawned worker calls `torch.ahbm.set_device(rank)` to
  bind to its SIP.

Multi-device patterns outside these two (e.g., one bench function
launching across multiple SIPs in the same process) are forbidden by
this ADR. Even with `--device all`, the bench runs once; multi-SIP use
inside that single run must follow D5's second pattern.

### D6. Name/index resolution (`resolve`)

`resolve(identifier: str)` returns a BenchSpec via:

1. If `identifier.isdigit()`: convert to int and find the spec where
   `index ==` that value. If none, `ValueError("No bench with index
   ...")`.
2. If `identifier in _REGISTRY`: direct lookup.
3. Otherwise: `ValueError("Unknown bench ...")`.

Empty or whitespace-only identifiers raise `ValueError("bench
identifier must be a non-empty string.")`.

The CLI passes `--bench` directly to `resolve`, so users can use either
`kernbench run --bench gemm-single-pe` or `kernbench run --bench 2`.

### D7. Indices are not a stable API

`_finalize()` sorts `_PENDING` alphabetically by name and assigns
1-based indices. Adding a new bench can shift existing benches'
indices. Therefore:

- Human-interactive use: indices are fine.
- Scripts / CI automation: always use the name.

This caveat is documented in `registry.py`'s module docstring.

### D8. Surface RuntimeContext exposes to benches

A bench's `torch` parameter may legitimately use:

- **Tensor creation**: `torch.empty(shape, dtype=..., dp=DPPolicy(...),
  name=...)`, `torch.zeros(...)`, `torch.from_numpy(arr)`. All submit
  host-side metadata plus device deployment (`MmuMapMsg` +
  `MemoryWriteMsg`).
- **Kernel launch**: `torch.launch(kernel_name, kernel_fn, *args)` —
  converts `(Tensor, int, float)` positional args to `TensorArg` /
  `ScalarArg`, submits per-SIP `KernelLaunchMsg`, and drains.
- **Synchronization**: `torch.wait(handle)`, `torch.wait_all()`
  (`run_bench` calls the latter automatically).
- **Distributed**: `torch.distributed.init_process_group(backend="ahbm")`,
  `torch.distributed.get_world_size()`,
  `torch.distributed.all_reduce(t, op=...)` (ADR-0024/0027).
- **Multi-process (rank = SIP)**:
  `torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` (ADR-0024 D3 /
  ADR-0027).
- **Device binding**: `torch.ahbm.set_device(rank)` or
  `torch.accelerator.set_device_index(rank)` (both point to the same
  namespace).
- **IPCQ install**: `torch.install_ipcq(algorithm=..., ccl_yaml=...)`
  (ADR-0023 D10).
- **Spec lookup**: `torch.spec` — the dict produced by the topology
  builder (system / cube_mesh / HBM parameters etc.). Use it so the
  bench does not hardcode topology.yaml values.

Benches must not access RuntimeContext private members (`_handles`,
`_traces`, `_allocators`, etc.) directly. This aligns with ADR-0007's
layer-boundary spirit: bench → runtime API → sim_engine flows in one
direction.

### D9. Environment-variable parameterization is allowed

Benches may parameterize themselves via `os.environ.get(...)`, as
`matmul_composite.py` does for `MATMUL_M`, `MATMUL_K`, `MATMUL_N`,
`MATMUL_DTYPE`, `MATMUL_VARIANT`. Rationale:

- The bench function signature is fixed by D3 to `def run(torch)`, so
  positional/keyword arguments cannot carry parameters.
- The env-var pattern is a natural hook for operational sweeps (e.g.,
  `MATMUL_VARIANT`).
- External drivers such as `scripts/gemm_sweep.py` (ADR-0044) consume
  this hook (it sets `MATMUL_M/K/N/VARIANT` at
  `scripts/gemm_sweep.py:115-118`).

When environment variables alter bench behavior, the module docstring
must list every variable used (`matmul_composite.py` is the canonical
example).

## Alternatives Considered

### A1. An explicit manifest file (YAML) listing benches

Rejected. The `@bench` + audit pattern guarantees "drop in file → auto-
register", concentrating cognitive cost in one place (the file itself).
A separate manifest is prone to drift, and helper separation is already
clear via the `_` prefix.

### A2. Allowing the bench's entry-point name in the decorator
(`@bench(name=..., entry="run_xxx")`)

Rejected. Breaks the simplicity of dispatch (`spec.run` is a single
callable). The `run` convention is sufficient; variants can register
multiple `@bench`-decorated functions in the same module.

### A3. A separate `@multi_device_bench` decorator for CCL

Rejected. The two patterns named in D5 (single + ADR-0024 multi-SIP)
cover all 8 current benches. A separate decorator would force dispatch
to branch and add complexity; the multi-SIP intent is already obvious
from the bench's `init_process_group(...)` call.

### A4. Make indices a stable API (registration order or explicit
`index=` argument)

Rejected. D7's trade-off favors user-friendliness — alphabetically
sorted 1-based indices read naturally in the `list` output. Scripts can
use names.

## Consequences

- "How to add a bench" is consolidated in one ADR — new authors only
  need to read D1-D3 and D8 without grepping source.
- The `_`-prefixed helper-module pattern is legitimized at ADR level,
  so future `benches/_*.py` shared helpers are free to be added.
- The CLI's single-device convention and CCL's multi-SIP exception are
  shown to be consistent (D5) — they are orthogonal.
- The rationale for ADR-0044's GEMM eval harness using env-var hooks
  (D9) is now ADR-pinned.
- Indices are explicitly unstable (D7), so any CI code calling
  `kernbench run --bench 3` is flagged for review after this ADR is
  accepted.