adr: add ADR-0045 (bench module contract — registration, dispatch, authoring)

Documents src/kernbench/benches/: how @bench registration + audit work,
how the CLI dispatches via run_bench/RuntimeContext, and the contract a
new bench module must satisfy.

Nine decisions (D1-D9) cover:
- @bench name/description rules and duplicate detection
- Module-file convention (_-prefixed helpers vs bench modules)
- def run(torch) signature; torch = RuntimeContext
- Minimum-one-submit rule (else NO_REQUESTS)
- Single-device convention + multi-SIP CCL exception (ADR-0024/0027)
- resolve() name/index decision tree; indices are not a stable API
- Exact RuntimeContext surface exposed to benches
- Env-var parameterization (matmul_composite / gemm_sweep.py pattern)

Four alternatives rejected with documented reasons (manifest YAML,
decorator entry= arg, @multi_device_bench split, stable indices).

Verifier (tools/verify_adr_lang_pairs.py) passes for EN/KO pair.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-21 16:29:45 -07:00
parent fd56b6cacd
commit 5f8dd688f5
2 changed files with 552 additions and 0 deletions
@@ -0,0 +1,291 @@
# ADR-0045: Bench Module Contract — registration, dispatch, and authoring
## Status
Accepted (2026-05-21).
Unifies the `src/kernbench/benches/` registration mechanism (@bench), the
CLI dispatch path (`kernbench run/list`), and the contract a new bench
module must follow. ADR-0010 (CLI surface) specifies the `kernbench
list/run` interface, but **how benches are registered and what signature
they must follow** had no ADR-level coverage.
## First action
When `kernbench.benches` is imported, `__init__.py` immediately calls
`_eager_import_and_audit(__path__, __name__)`. Its first action is to
enumerate every sibling module in the package directory via
`pkgutil.iter_modules(__path__)` and **eagerly import** each one via
`importlib.import_module(...)` — except modules matching either:
- name `registry` (the infrastructure module itself), or
- name starting with `_` (helper modules).
At import time, each `@bench(name=..., description=...)` decorator inside
the imported module runs, appending `(name, description, fn)` to
`_PENDING` and adding `fn.__module__` to `_REGISTERED_MODULES`.
Once imports finish, `_audit_modules(imported, _REGISTERED_MODULES)`
runs; if any imported module did not invoke `@bench` at least once, it
raises `RuntimeError("Bench module(s) missing @bench decorator: ...")`
immediately. At this point indices are still unassigned — the first call
to `list_all()` / `resolve(...)` triggers `_finalize()`, which sorts
`_PENDING` alphabetically by name and assigns 1-based indices.
In short, **the bench infrastructure's first act is "eagerly import
every non-helper module in the package and audit that each one
registered at least one bench"**.
## Context
`src/kernbench/benches/` currently holds 8 bench modules (`ccl_allreduce`,
`gemm_single_pe`, `gpt3_qkv`, `ipcq_allreduce`, `matmul_composite`,
`qkv_gemm`, `qkv_gemm_multi_pe`, `va_offset_verify`). Every bench follows
the same unified flow:
```
kernbench run --topology <T> --bench <N>
cli/main.py::cmd_run
↓ resolve_topology(T) + resolve(N) + resolve_device(device_arg)
runtime_api/bench_runner.py::run_bench(topology, bench_fn, device, engine_factory)
↓ engine_factory(topology, device) → GraphEngine
↓ RuntimeContext(engine, target_device, correlation_id, spec)
bench_fn(ctx) ← invokes the bench's run(torch)
↓ ctx.empty/zeros/from_numpy/launch/distributed.* etc. submit work
ctx.wait_all() ← drains any outstanding handles
BenchResult(completion, correlation_id, trace, traces, engine)
```
ADR-0010 covers only the CLI surface (`run/list/probe/web`); ADR-0007
covers only the runtime API ↔ sim_engine boundary. The question "what
shape must a new bench file take?" had to be answered by grepping the
codebase. As a result:
- The @bench decorator contract (kebab-case name, non-empty description)
lived only in the source.
- The bench function signature (`def run(torch)`) was a de-facto
convention enforced by the CLI dispatcher calling `spec.run`.
- New bench authors learned the "helpers must use `_` prefix" rule only
after seeing the audit's RuntimeError.
- The single-device convention (CLAUDE.md Part 2 CLI Semantics) and its
interaction with multi-SIP CCL benches was ambiguous for bench
authors.
This ADR consolidates all of it in one place.
## Decision
### D1. @bench decorator contract
```python
from kernbench.benches.registry import bench
@bench(name="my-bench", description="Short, complete-sentence description.")
def run(torch):
...
```
- `name`: kebab-case string matching `^[a-z][a-z0-9]*(-[a-z0-9]+)*$`.
Lowercase letters, digits, and dashes only; underscores forbidden;
must start with a letter.
- `description`: non-empty string (stripped length > 0). Displayed
verbatim by `kernbench list`.
- The decorator **returns the function unchanged** — direct invocation
is fine. Its only side effect is appending to `_PENDING`.
Violations of the first two rules raise `ValueError` at decoration time.
Duplicate names are caught at `_finalize()` with
`RuntimeError("duplicate bench name: ...")`.
### D2. Module-file convention
Every `src/kernbench/benches/<slug>.py` must be one of:
- **A bench module**: at top-level import, `@bench(...)` runs at least
once to register at least one bench.
- **A helper module**: the filename starts with `_` (e.g.,
`_shared_helpers.py`). `iter_modules` skips it.
The audit (`_audit_modules`) rejects any non-helper that fails to call
`@bench`. Intended consequence: dropping a new file into `benches/`
automatically registers its benches, and helper modules are clearly
flagged by their filename prefix alone.
### D3. The bench function signature is `def run(torch)`
The decorator does not enforce a function name, but **CLI dispatch calls
`spec_entry.run`** (the decorated callable). The convention is therefore:
- Function name: `run`. Other names work, but always use `run` for
readability and grep-ability.
- Argument: a single positional `torch`. In practice this is a
`RuntimeContext` instance exposing PyTorch-style namespaces
(zeros/empty/launch/distributed/...) — see ADR-0024 D3.
- Return value: any (`Any`). `run_bench` ignores it and tracks
completion via `ctx.handles()` / `engine.get_completion()`.
The `torch` name imitates a PyTorch-compatible idiom; the actual PyTorch
module is not passed in (aligned with ADR-0024's "rank = SIP" launcher
convention).
### D4. A bench must submit at least once
If `ctx.handles()` is empty after the bench returns, `run_bench` reports
`BenchResult.completion = ok=False, error_code="NO_REQUESTS"`. So a
meaningful bench must invoke at least one of:
- Tensor-creation APIs: `torch.zeros(...)`, `torch.empty(...)` — these
internally submit `MmuMapMsg` and (for zeros) `MemoryWriteMsg`.
- Kernel-launch API: `torch.launch(name, fn, *args)` — submits per-SIP
`KernelLaunchMsg`.
- (Exception) Empty placeholder benches: e.g.,
`ipcq_allreduce.py`'s `print(...)`-only stub will receive a
NO_REQUESTS result. CI is expected to recognize and handle placeholder
benches specially.
### D5. Single-device convention + multi-SIP exception (ADR-0024/0027)
CLAUDE.md Part 2 CLI Semantics' **"benchmarks MUST remain
single-device"** rule is interpreted as follows:
- **Standard bench (single-SIP use)**: define tensor placement with
`dp = DPPolicy(...)` and launch with `torch.launch(...)`. The SIP
index is chosen by `--device` (CLI's responsibility).
- **CCL bench (multi-SIP use)**: as an exception, use
`torch.distributed.init_process_group(backend="ahbm")` plus
`torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` for the
rank = SIP pattern (ADR-0024 D3). `--device` is ignored (or treated
as `all`); each spawned worker calls `torch.ahbm.set_device(rank)` to
bind to its SIP.
Multi-device patterns outside these two (e.g., one bench function
launching across multiple SIPs in the same process) are forbidden by
this ADR. Even with `--device all`, the bench runs once; multi-SIP use
inside that single run must follow D5's second pattern.
### D6. Name/index resolution (`resolve`)
`resolve(identifier: str)` returns a BenchSpec via:
1. If `identifier.isdigit()`: convert to int and find the spec where
`index ==` that value. If none, `ValueError("No bench with index
...")`.
2. If `identifier in _REGISTRY`: direct lookup.
3. Otherwise: `ValueError("Unknown bench ...")`.
Empty or whitespace-only identifiers raise `ValueError("bench
identifier must be a non-empty string.")`.
The CLI passes `--bench` directly to `resolve`, so users can use either
`kernbench run --bench gemm-single-pe` or `kernbench run --bench 2`.
### D7. Indices are not a stable API
`_finalize()` sorts `_PENDING` alphabetically by name and assigns
1-based indices. Adding a new bench can shift existing benches'
indices. Therefore:
- Human-interactive use: indices are fine.
- Scripts / CI automation: always use the name.
This caveat is documented in `registry.py`'s module docstring.
### D8. Surface RuntimeContext exposes to benches
A bench's `torch` parameter may legitimately use:
- **Tensor creation**: `torch.empty(shape, dtype=..., dp=DPPolicy(...),
name=...)`, `torch.zeros(...)`, `torch.from_numpy(arr)`. All submit
host-side metadata plus device deployment (`MmuMapMsg` +
`MemoryWriteMsg`).
- **Kernel launch**: `torch.launch(kernel_name, kernel_fn, *args)` —
converts `(Tensor, int, float)` positional args to `TensorArg` /
`ScalarArg`, submits per-SIP `KernelLaunchMsg`, and drains.
- **Synchronization**: `torch.wait(handle)`, `torch.wait_all()`
(`run_bench` calls the latter automatically).
- **Distributed**: `torch.distributed.init_process_group(backend="ahbm")`,
`torch.distributed.get_world_size()`,
`torch.distributed.all_reduce(t, op=...)` (ADR-0024/0027).
- **Multi-process (rank = SIP)**:
`torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` (ADR-0024 D3 /
ADR-0027).
- **Device binding**: `torch.ahbm.set_device(rank)` or
`torch.accelerator.set_device_index(rank)` (both point to the same
namespace).
- **IPCQ install**: `torch.install_ipcq(algorithm=..., ccl_yaml=...)`
(ADR-0023 D10).
- **Spec lookup**: `torch.spec` — the dict produced by the topology
builder (system / cube_mesh / HBM parameters etc.). Use it so the
bench does not hardcode topology.yaml values.
Benches must not access RuntimeContext private members (`_handles`,
`_traces`, `_allocators`, etc.) directly. This aligns with ADR-0007's
layer-boundary spirit: bench → runtime API → sim_engine flows in one
direction.
### D9. Environment-variable parameterization is allowed
Benches may parameterize themselves via `os.environ.get(...)`, as
`matmul_composite.py` does for `MATMUL_M`, `MATMUL_K`, `MATMUL_N`,
`MATMUL_DTYPE`, `MATMUL_VARIANT`. Rationale:
- The bench function signature is fixed by D3 to `def run(torch)`, so
positional/keyword arguments cannot carry parameters.
- The env-var pattern is a natural hook for operational sweeps (e.g.,
`MATMUL_VARIANT`).
- External drivers such as `scripts/gemm_sweep.py` (ADR-0044) consume
this hook (it sets `MATMUL_M/K/N/VARIANT` at
`scripts/gemm_sweep.py:115-118`).
When environment variables alter bench behavior, the module docstring
must list every variable used (`matmul_composite.py` is the canonical
example).
## Alternatives Considered
### A1. An explicit manifest file (YAML) listing benches
Rejected. The `@bench` + audit pattern guarantees "drop in file → auto-
register", concentrating cognitive cost in one place (the file itself).
A separate manifest is prone to drift, and helper separation is already
clear via the `_` prefix.
### A2. Allowing the bench's entry-point name in the decorator
(`@bench(name=..., entry="run_xxx")`)
Rejected. Breaks the simplicity of dispatch (`spec.run` is a single
callable). The `run` convention is sufficient; variants can register
multiple `@bench`-decorated functions in the same module.
### A3. A separate `@multi_device_bench` decorator for CCL
Rejected. The two patterns named in D5 (single + ADR-0024 multi-SIP)
cover all 8 current benches. A separate decorator would force dispatch
to branch and add complexity; the multi-SIP intent is already obvious
from the bench's `init_process_group(...)` call.
### A4. Make indices a stable API (registration order or explicit
`index=` argument)
Rejected. D7's trade-off favors user-friendliness — alphabetically
sorted 1-based indices read naturally in the `list` output. Scripts can
use names.
## Consequences
- "How to add a bench" is consolidated in one ADR — new authors only
need to read D1-D3 and D8 without grepping source.
- The `_`-prefixed helper-module pattern is legitimized at ADR level,
so future `benches/_*.py` shared helpers are free to be added.
- The CLI's single-device convention and CCL's multi-SIP exception are
shown to be consistent (D5) — they are orthogonal.
- The rationale for ADR-0044's GEMM eval harness using env-var hooks
(D9) is now ADR-pinned.
- Indices are explicitly unstable (D7), so any CI code calling
`kernbench run --bench 3` is flagged for review after this ADR is
accepted.