adr: add ADR-0045 (bench module contract — registration, dispatch, authoring)

Documents src/kernbench/benches/: how @bench registration + audit work, how the CLI dispatches via run_bench/RuntimeContext, and the contract a new bench module must satisfy. Nine decisions (D1-D9) cover: - @bench name/description rules and duplicate detection - Module-file convention (_-prefixed helpers vs bench modules) - def run(torch) signature; torch = RuntimeContext - Minimum-one-submit rule (else NO_REQUESTS) - Single-device convention + multi-SIP CCL exception (ADR-0024/0027) - resolve() name/index decision tree; indices are not a stable API - Exact RuntimeContext surface exposed to benches - Env-var parameterization (matmul_composite / gemm_sweep.py pattern) Four alternatives rejected with documented reasons (manifest YAML, decorator entry= arg, @multi_device_bench split, stable indices). Verifier (tools/verify_adr_lang_pairs.py) passes for EN/KO pair. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 16:29:45 -07:00
parent fd56b6cacd
commit 5f8dd688f5
2 changed files with 552 additions and 0 deletions
@@ -0,0 +1,291 @@
+# ADR-0045: Bench Module Contract — registration, dispatch, and authoring
+
+## Status
+
+Accepted (2026-05-21).
+
+Unifies the `src/kernbench/benches/` registration mechanism (@bench), the
+CLI dispatch path (`kernbench run/list`), and the contract a new bench
+module must follow. ADR-0010 (CLI surface) specifies the `kernbench
+list/run` interface, but **how benches are registered and what signature
+they must follow** had no ADR-level coverage.
+
+## First action
+
+When `kernbench.benches` is imported, `__init__.py` immediately calls
+`_eager_import_and_audit(__path__, __name__)`. Its first action is to
+enumerate every sibling module in the package directory via
+`pkgutil.iter_modules(__path__)` and **eagerly import** each one via
+`importlib.import_module(...)` — except modules matching either:
+
+- name `registry` (the infrastructure module itself), or
+- name starting with `_` (helper modules).
+
+At import time, each `@bench(name=..., description=...)` decorator inside
+the imported module runs, appending `(name, description, fn)` to
+`_PENDING` and adding `fn.__module__` to `_REGISTERED_MODULES`.
+
+Once imports finish, `_audit_modules(imported, _REGISTERED_MODULES)`
+runs; if any imported module did not invoke `@bench` at least once, it
+raises `RuntimeError("Bench module(s) missing @bench decorator: ...")`
+immediately. At this point indices are still unassigned — the first call
+to `list_all()` / `resolve(...)` triggers `_finalize()`, which sorts
+`_PENDING` alphabetically by name and assigns 1-based indices.
+
+In short, **the bench infrastructure's first act is "eagerly import
+every non-helper module in the package and audit that each one
+registered at least one bench"**.
+
+## Context
+
+`src/kernbench/benches/` currently holds 8 bench modules (`ccl_allreduce`,
+`gemm_single_pe`, `gpt3_qkv`, `ipcq_allreduce`, `matmul_composite`,
+`qkv_gemm`, `qkv_gemm_multi_pe`, `va_offset_verify`). Every bench follows
+the same unified flow:
+
+```
+kernbench run --topology <T> --bench <N>
+   ↓
+cli/main.py::cmd_run
+   ↓  resolve_topology(T)  + resolve(N)  + resolve_device(device_arg)
+   ↓
+runtime_api/bench_runner.py::run_bench(topology, bench_fn, device, engine_factory)
+   ↓  engine_factory(topology, device) → GraphEngine
+   ↓  RuntimeContext(engine, target_device, correlation_id, spec)
+   ↓
+bench_fn(ctx)        ← invokes the bench's run(torch)
+   ↓  ctx.empty/zeros/from_numpy/launch/distributed.* etc. submit work
+   ↓
+ctx.wait_all()       ← drains any outstanding handles
+   ↓
+BenchResult(completion, correlation_id, trace, traces, engine)
+```
+
+ADR-0010 covers only the CLI surface (`run/list/probe/web`); ADR-0007
+covers only the runtime API ↔ sim_engine boundary. The question "what
+shape must a new bench file take?" had to be answered by grepping the
+codebase. As a result:
+
+- The @bench decorator contract (kebab-case name, non-empty description)
+  lived only in the source.
+- The bench function signature (`def run(torch)`) was a de-facto
+  convention enforced by the CLI dispatcher calling `spec.run`.
+- New bench authors learned the "helpers must use `_` prefix" rule only
+  after seeing the audit's RuntimeError.
+- The single-device convention (CLAUDE.md Part 2 CLI Semantics) and its
+  interaction with multi-SIP CCL benches was ambiguous for bench
+  authors.
+
+This ADR consolidates all of it in one place.
+
+## Decision
+
+### D1. @bench decorator contract
+
+```python
+from kernbench.benches.registry import bench
+
+@bench(name="my-bench", description="Short, complete-sentence description.")
+def run(torch):
+    ...
+```
+
+- `name`: kebab-case string matching `^[a-z][a-z0-9]*(-[a-z0-9]+)*$`.
+  Lowercase letters, digits, and dashes only; underscores forbidden;
+  must start with a letter.
+- `description`: non-empty string (stripped length > 0). Displayed
+  verbatim by `kernbench list`.
+- The decorator **returns the function unchanged** — direct invocation
+  is fine. Its only side effect is appending to `_PENDING`.
+
+Violations of the first two rules raise `ValueError` at decoration time.
+Duplicate names are caught at `_finalize()` with
+`RuntimeError("duplicate bench name: ...")`.
+
+### D2. Module-file convention
+
+Every `src/kernbench/benches/<slug>.py` must be one of:
+
+- **A bench module**: at top-level import, `@bench(...)` runs at least
+  once to register at least one bench.
+- **A helper module**: the filename starts with `_` (e.g.,
+  `_shared_helpers.py`). `iter_modules` skips it.
+
+The audit (`_audit_modules`) rejects any non-helper that fails to call
+`@bench`. Intended consequence: dropping a new file into `benches/`
+automatically registers its benches, and helper modules are clearly
+flagged by their filename prefix alone.
+
+### D3. The bench function signature is `def run(torch)`
+
+The decorator does not enforce a function name, but **CLI dispatch calls
+`spec_entry.run`** (the decorated callable). The convention is therefore:
+
+- Function name: `run`. Other names work, but always use `run` for
+  readability and grep-ability.
+- Argument: a single positional `torch`. In practice this is a
+  `RuntimeContext` instance exposing PyTorch-style namespaces
+  (zeros/empty/launch/distributed/...) — see ADR-0024 D3.
+- Return value: any (`Any`). `run_bench` ignores it and tracks
+  completion via `ctx.handles()` / `engine.get_completion()`.
+
+The `torch` name imitates a PyTorch-compatible idiom; the actual PyTorch
+module is not passed in (aligned with ADR-0024's "rank = SIP" launcher
+convention).
+
+### D4. A bench must submit at least once
+
+If `ctx.handles()` is empty after the bench returns, `run_bench` reports
+`BenchResult.completion = ok=False, error_code="NO_REQUESTS"`. So a
+meaningful bench must invoke at least one of:
+
+- Tensor-creation APIs: `torch.zeros(...)`, `torch.empty(...)` — these
+  internally submit `MmuMapMsg` and (for zeros) `MemoryWriteMsg`.
+- Kernel-launch API: `torch.launch(name, fn, *args)` — submits per-SIP
+  `KernelLaunchMsg`.
+- (Exception) Empty placeholder benches: e.g.,
+  `ipcq_allreduce.py`'s `print(...)`-only stub will receive a
+  NO_REQUESTS result. CI is expected to recognize and handle placeholder
+  benches specially.
+
+### D5. Single-device convention + multi-SIP exception (ADR-0024/0027)
+
+CLAUDE.md Part 2 CLI Semantics' **"benchmarks MUST remain
+single-device"** rule is interpreted as follows:
+
+- **Standard bench (single-SIP use)**: define tensor placement with
+  `dp = DPPolicy(...)` and launch with `torch.launch(...)`. The SIP
+  index is chosen by `--device` (CLI's responsibility).
+- **CCL bench (multi-SIP use)**: as an exception, use
+  `torch.distributed.init_process_group(backend="ahbm")` plus
+  `torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` for the
+  rank = SIP pattern (ADR-0024 D3). `--device` is ignored (or treated
+  as `all`); each spawned worker calls `torch.ahbm.set_device(rank)` to
+  bind to its SIP.
+
+Multi-device patterns outside these two (e.g., one bench function
+launching across multiple SIPs in the same process) are forbidden by
+this ADR. Even with `--device all`, the bench runs once; multi-SIP use
+inside that single run must follow D5's second pattern.
+
+### D6. Name/index resolution (`resolve`)
+
+`resolve(identifier: str)` returns a BenchSpec via:
+
+1. If `identifier.isdigit()`: convert to int and find the spec where
+   `index ==` that value. If none, `ValueError("No bench with index
+   ...")`.
+2. If `identifier in _REGISTRY`: direct lookup.
+3. Otherwise: `ValueError("Unknown bench ...")`.
+
+Empty or whitespace-only identifiers raise `ValueError("bench
+identifier must be a non-empty string.")`.
+
+The CLI passes `--bench` directly to `resolve`, so users can use either
+`kernbench run --bench gemm-single-pe` or `kernbench run --bench 2`.
+
+### D7. Indices are not a stable API
+
+`_finalize()` sorts `_PENDING` alphabetically by name and assigns
+1-based indices. Adding a new bench can shift existing benches'
+indices. Therefore:
+
+- Human-interactive use: indices are fine.
+- Scripts / CI automation: always use the name.
+
+This caveat is documented in `registry.py`'s module docstring.
+
+### D8. Surface RuntimeContext exposes to benches
+
+A bench's `torch` parameter may legitimately use:
+
+- **Tensor creation**: `torch.empty(shape, dtype=..., dp=DPPolicy(...),
+  name=...)`, `torch.zeros(...)`, `torch.from_numpy(arr)`. All submit
+  host-side metadata plus device deployment (`MmuMapMsg` +
+  `MemoryWriteMsg`).
+- **Kernel launch**: `torch.launch(kernel_name, kernel_fn, *args)` —
+  converts `(Tensor, int, float)` positional args to `TensorArg` /
+  `ScalarArg`, submits per-SIP `KernelLaunchMsg`, and drains.
+- **Synchronization**: `torch.wait(handle)`, `torch.wait_all()`
+  (`run_bench` calls the latter automatically).
+- **Distributed**: `torch.distributed.init_process_group(backend="ahbm")`,
+  `torch.distributed.get_world_size()`,
+  `torch.distributed.all_reduce(t, op=...)` (ADR-0024/0027).
+- **Multi-process (rank = SIP)**:
+  `torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` (ADR-0024 D3 /
+  ADR-0027).
+- **Device binding**: `torch.ahbm.set_device(rank)` or
+  `torch.accelerator.set_device_index(rank)` (both point to the same
+  namespace).
+- **IPCQ install**: `torch.install_ipcq(algorithm=..., ccl_yaml=...)`
+  (ADR-0023 D10).
+- **Spec lookup**: `torch.spec` — the dict produced by the topology
+  builder (system / cube_mesh / HBM parameters etc.). Use it so the
+  bench does not hardcode topology.yaml values.
+
+Benches must not access RuntimeContext private members (`_handles`,
+`_traces`, `_allocators`, etc.) directly. This aligns with ADR-0007's
+layer-boundary spirit: bench → runtime API → sim_engine flows in one
+direction.
+
+### D9. Environment-variable parameterization is allowed
+
+Benches may parameterize themselves via `os.environ.get(...)`, as
+`matmul_composite.py` does for `MATMUL_M`, `MATMUL_K`, `MATMUL_N`,
+`MATMUL_DTYPE`, `MATMUL_VARIANT`. Rationale:
+
+- The bench function signature is fixed by D3 to `def run(torch)`, so
+  positional/keyword arguments cannot carry parameters.
+- The env-var pattern is a natural hook for operational sweeps (e.g.,
+  `MATMUL_VARIANT`).
+- External drivers such as `scripts/gemm_sweep.py` (ADR-0044) consume
+  this hook (it sets `MATMUL_M/K/N/VARIANT` at
+  `scripts/gemm_sweep.py:115-118`).
+
+When environment variables alter bench behavior, the module docstring
+must list every variable used (`matmul_composite.py` is the canonical
+example).
+
+## Alternatives Considered
+
+### A1. An explicit manifest file (YAML) listing benches
+
+Rejected. The `@bench` + audit pattern guarantees "drop in file → auto-
+register", concentrating cognitive cost in one place (the file itself).
+A separate manifest is prone to drift, and helper separation is already
+clear via the `_` prefix.
+
+### A2. Allowing the bench's entry-point name in the decorator
+(`@bench(name=..., entry="run_xxx")`)
+
+Rejected. Breaks the simplicity of dispatch (`spec.run` is a single
+callable). The `run` convention is sufficient; variants can register
+multiple `@bench`-decorated functions in the same module.
+
+### A3. A separate `@multi_device_bench` decorator for CCL
+
+Rejected. The two patterns named in D5 (single + ADR-0024 multi-SIP)
+cover all 8 current benches. A separate decorator would force dispatch
+to branch and add complexity; the multi-SIP intent is already obvious
+from the bench's `init_process_group(...)` call.
+
+### A4. Make indices a stable API (registration order or explicit
+`index=` argument)
+
+Rejected. D7's trade-off favors user-friendliness — alphabetically
+sorted 1-based indices read naturally in the `list` output. Scripts can
+use names.
+
+## Consequences
+
+- "How to add a bench" is consolidated in one ADR — new authors only
+  need to read D1-D3 and D8 without grepping source.
+- The `_`-prefixed helper-module pattern is legitimized at ADR level,
+  so future `benches/_*.py` shared helpers are free to be added.
+- The CLI's single-device convention and CCL's multi-SIP exception are
+  shown to be consistent (D5) — they are orthogonal.
+- The rationale for ADR-0044's GEMM eval harness using env-var hooks
+  (D9) is now ADR-pinned.
+- Indices are explicitly unstable (D7), so any CI code calling
+  `kernbench run --bench 3` is flagged for review after this ADR is
+  accepted.