Move the GEMM + allreduce sweep/render logic out of scripts/ and tests/
into two self-contained eval benches so a user can regenerate every
result + figure with one command:
kernbench run --bench milestone-1h-gemm (MILESTONE_FAST=1 reuses JSON)
kernbench run --bench milestone-1h-ccl
- benches/milestone_1h_{gemm,ccl}.py: single home for each domain; the
run(torch) entry drives the sweeps and writes figures into
benches/1H_milestone_output/{gemm,ccl}/ (gitignored), then submits a
sentinel tensor to satisfy the run_bench contract.
- tests/gemm + tests/sccl helpers and scripts/gemm_sweep.py become thin
re-export/wrapper shims over the benches (single source preserved); the
pytest-only param builders + _run_distributed wrapper stay in the shim.
- eval-bench pattern: a bench may drive many configs + build its own
per-config engines (extends ADR-0045 D5; reverses ADR-0044 D1/D2).
ADR-0054 (EN+KO) records the design; ADR-0043/0044/0045 + CLAUDE.md CLI
Semantics amended; ADR INDEX regenerated. Verified: milestone benches run
clean (ok=True, all artifacts), full suite 67 passed, lang-pairs OK.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 KiB
ADR-0045: Bench Module Contract — registration, dispatch, and authoring
Status
Accepted (2026-05-21).
Unifies the src/kernbench/benches/ registration mechanism (@bench), the
CLI dispatch path (kernbench run/list), and the contract a new bench
module must follow. ADR-0010 (CLI surface) specifies the kernbench list/run interface, but how benches are registered and what signature
they must follow had no ADR-level coverage.
Extended by ADR-0054: D5's single-config rule gains a third pattern —
the eval bench (e.g. milestone-1h-*) drives many configs, builds its
own per-config engines, and submits a sentinel tensor to satisfy D4.
First action
When kernbench.benches is imported, __init__.py immediately calls
_eager_import_and_audit(__path__, __name__). Its first action is to
enumerate every sibling module in the package directory via
pkgutil.iter_modules(__path__) and eagerly import each one via
importlib.import_module(...) — except modules matching either:
- name
registry(the infrastructure module itself), or - name starting with
_(helper modules).
At import time, each @bench(name=..., description=...) decorator inside
the imported module runs, appending (name, description, fn) to
_PENDING and adding fn.__module__ to _REGISTERED_MODULES.
Once imports finish, _audit_modules(imported, _REGISTERED_MODULES)
runs; if any imported module did not invoke @bench at least once, it
raises RuntimeError("Bench module(s) missing @bench decorator: ...")
immediately. At this point indices are still unassigned — the first call
to list_all() / resolve(...) triggers _finalize(), which sorts
_PENDING alphabetically by name and assigns 1-based indices.
In short, the bench infrastructure's first act is "eagerly import every non-helper module in the package and audit that each one registered at least one bench".
Context
src/kernbench/benches/ currently holds 8 bench modules (ccl_allreduce,
gemm_single_pe, gpt3_qkv, ipcq_allreduce, matmul_composite,
qkv_gemm, qkv_gemm_multi_pe, va_offset_verify). Every bench follows
the same unified flow:
kernbench run --topology <T> --bench <N>
↓
cli/main.py::cmd_run
↓ resolve_topology(T) + resolve(N) + resolve_device(device_arg)
↓
runtime_api/bench_runner.py::run_bench(topology, bench_fn, device, engine_factory)
↓ engine_factory(topology, device) → GraphEngine
↓ RuntimeContext(engine, target_device, correlation_id, spec)
↓
bench_fn(ctx) ← invokes the bench's run(torch)
↓ ctx.empty/zeros/from_numpy/launch/distributed.* etc. submit work
↓
ctx.wait_all() ← drains any outstanding handles
↓
BenchResult(completion, correlation_id, trace, traces, engine)
ADR-0010 covers only the CLI surface (run/list/probe/web); ADR-0007
covers only the runtime API ↔ sim_engine boundary. The question "what
shape must a new bench file take?" had to be answered by grepping the
codebase. As a result:
- The @bench decorator contract (kebab-case name, non-empty description) lived only in the source.
- The bench function signature (
def run(torch)) was a de-facto convention enforced by the CLI dispatcher callingspec.run. - New bench authors learned the "helpers must use
_prefix" rule only after seeing the audit's RuntimeError. - The single-device convention (CLAUDE.md Part 2 CLI Semantics) and its interaction with multi-SIP CCL benches was ambiguous for bench authors.
This ADR consolidates all of it in one place.
Decision
D1. @bench decorator contract
from kernbench.benches.registry import bench
@bench(name="my-bench", description="Short, complete-sentence description.")
def run(torch):
...
name: kebab-case string matching^[a-z][a-z0-9]*(-[a-z0-9]+)*$. Lowercase letters, digits, and dashes only; underscores forbidden; must start with a letter.description: non-empty string (stripped length > 0). Displayed verbatim bykernbench list.- The decorator returns the function unchanged — direct invocation
is fine. Its only side effect is appending to
_PENDING.
Violations of the first two rules raise ValueError at decoration time.
Duplicate names are caught at _finalize() with
RuntimeError("duplicate bench name: ...").
D2. Module-file convention
Every src/kernbench/benches/<slug>.py must be one of:
- A bench module: at top-level import,
@bench(...)runs at least once to register at least one bench. - A helper module: the filename starts with
_(e.g.,_shared_helpers.py).iter_modulesskips it.
The audit (_audit_modules) rejects any non-helper that fails to call
@bench. Intended consequence: dropping a new file into benches/
automatically registers its benches, and helper modules are clearly
flagged by their filename prefix alone.
D3. The bench function signature is def run(torch)
The decorator does not enforce a function name, but CLI dispatch calls
spec_entry.run (the decorated callable). The convention is therefore:
- Function name:
run. Other names work, but always userunfor readability and grep-ability. - Argument: a single positional
torch. In practice this is aRuntimeContextinstance exposing PyTorch-style namespaces (zeros/empty/launch/distributed/...) — see ADR-0024 D3. - Return value: any (
Any).run_benchignores it and tracks completion viactx.handles()/engine.get_completion().
The torch name imitates a PyTorch-compatible idiom; the actual PyTorch
module is not passed in (aligned with ADR-0024's "rank = SIP" launcher
convention).
D4. A bench must submit at least once
If ctx.handles() is empty after the bench returns, run_bench reports
BenchResult.completion = ok=False, error_code="NO_REQUESTS". So a
meaningful bench must invoke at least one of:
- Tensor-creation APIs:
torch.zeros(...),torch.empty(...)— these internally submitMmuMapMsgand (for zeros)MemoryWriteMsg. - Kernel-launch API:
torch.launch(name, fn, *args)— submits per-SIPKernelLaunchMsg. - (Exception) Empty placeholder benches: e.g.,
ipcq_allreduce.py'sprint(...)-only stub will receive a NO_REQUESTS result. CI is expected to recognize and handle placeholder benches specially.
D5. Single-device convention + multi-SIP exception (ADR-0024/0027)
CLAUDE.md Part 2 CLI Semantics' "benchmarks MUST remain single-device" rule is interpreted as follows:
- Standard bench (single-SIP use): define tensor placement with
dp = DPPolicy(...)and launch withtorch.launch(...). The SIP index is chosen by--device(CLI's responsibility). - CCL bench (multi-SIP use): as an exception, use
torch.distributed.init_process_group(backend="ahbm")plustorch.multiprocessing.spawn(_worker, ..., nprocs=ws)for the rank = SIP pattern (ADR-0024 D3).--deviceis ignored (or treated asall); each spawned worker callstorch.ahbm.set_device(rank)to bind to its SIP.
Multi-device patterns outside these two (e.g., one bench function
launching across multiple SIPs in the same process) are forbidden by
this ADR. Even with --device all, the bench runs once; multi-SIP use
inside that single run must follow D5's second pattern.
D6. Name/index resolution (resolve)
resolve(identifier: str) returns a BenchSpec via:
- If
identifier.isdigit(): convert to int and find the spec whereindex ==that value. If none,ValueError("No bench with index ..."). - If
identifier in _REGISTRY: direct lookup. - Otherwise:
ValueError("Unknown bench ...").
Empty or whitespace-only identifiers raise ValueError("bench identifier must be a non-empty string.").
The CLI passes --bench directly to resolve, so users can use either
kernbench run --bench gemm-single-pe or kernbench run --bench 2.
D7. Indices are not a stable API
_finalize() sorts _PENDING alphabetically by name and assigns
1-based indices. Adding a new bench can shift existing benches'
indices. Therefore:
- Human-interactive use: indices are fine.
- Scripts / CI automation: always use the name.
This caveat is documented in registry.py's module docstring.
D8. Surface RuntimeContext exposes to benches
A bench's torch parameter may legitimately use:
- Tensor creation:
torch.empty(shape, dtype=..., dp=DPPolicy(...), name=...),torch.zeros(...),torch.from_numpy(arr). All submit host-side metadata plus device deployment (MmuMapMsg+MemoryWriteMsg). - Kernel launch:
torch.launch(kernel_name, kernel_fn, *args)— converts(Tensor, int, float)positional args toTensorArg/ScalarArg, submits per-SIPKernelLaunchMsg, and drains. - Synchronization:
torch.wait(handle),torch.wait_all()(run_benchcalls the latter automatically). - Distributed:
torch.distributed.init_process_group(backend="ahbm"),torch.distributed.get_world_size(),torch.distributed.all_reduce(t, op=...)(ADR-0024/0027). - Multi-process (rank = SIP):
torch.multiprocessing.spawn(_worker, ..., nprocs=ws)(ADR-0024 D3 / ADR-0027). - Device binding:
torch.ahbm.set_device(rank)ortorch.accelerator.set_device_index(rank)(both point to the same namespace). - IPCQ install:
torch.install_ipcq(algorithm=..., ccl_yaml=...)(ADR-0023 D10). - Spec lookup:
torch.spec— the dict produced by the topology builder (system / cube_mesh / HBM parameters etc.). Use it so the bench does not hardcode topology.yaml values.
Benches must not access RuntimeContext private members (_handles,
_traces, _allocators, etc.) directly. This aligns with ADR-0007's
layer-boundary spirit: bench → runtime API → sim_engine flows in one
direction.
D9. Environment-variable parameterization is allowed
Benches may parameterize themselves via os.environ.get(...), as
matmul_composite.py does for MATMUL_M, MATMUL_K, MATMUL_N,
MATMUL_DTYPE, MATMUL_VARIANT. Rationale:
- The bench function signature is fixed by D3 to
def run(torch), so positional/keyword arguments cannot carry parameters. - The env-var pattern is a natural hook for operational sweeps (e.g.,
MATMUL_VARIANT). - External drivers such as
scripts/gemm_sweep.py(ADR-0044) consume this hook (it setsMATMUL_M/K/N/VARIANTatscripts/gemm_sweep.py:115-118).
When environment variables alter bench behavior, the module docstring
must list every variable used (matmul_composite.py is the canonical
example).
Alternatives Considered
A1. An explicit manifest file (YAML) listing benches
Rejected. The @bench + audit pattern guarantees "drop in file → auto-
register", concentrating cognitive cost in one place (the file itself).
A separate manifest is prone to drift, and helper separation is already
clear via the _ prefix.
A2. Allowing the bench's entry-point name in the decorator
(@bench(name=..., entry="run_xxx"))
Rejected. Breaks the simplicity of dispatch (spec.run is a single
callable). The run convention is sufficient; variants can register
multiple @bench-decorated functions in the same module.
A3. A separate @multi_device_bench decorator for CCL
Rejected. The two patterns named in D5 (single + ADR-0024 multi-SIP)
cover all 8 current benches. A separate decorator would force dispatch
to branch and add complexity; the multi-SIP intent is already obvious
from the bench's init_process_group(...) call.
A4. Make indices a stable API (registration order or explicit
index= argument)
Rejected. D7's trade-off favors user-friendliness — alphabetically
sorted 1-based indices read naturally in the list output. Scripts can
use names.
Consequences
- "How to add a bench" is consolidated in one ADR — new authors only need to read D1-D3 and D8 without grepping source.
- The
_-prefixed helper-module pattern is legitimized at ADR level, so futurebenches/_*.pyshared helpers are free to be added. - The CLI's single-device convention and CCL's multi-SIP exception are shown to be consistent (D5) — they are orthogonal.
- The rationale for ADR-0044's GEMM eval harness using env-var hooks (D9) is now ADR-pinned.
- Indices are explicitly unstable (D7), so any CI code calling
kernbench run --bench 3is flagged for review after this ADR is accepted.