cc1bbd0ab7
Move the GEMM + allreduce sweep/render logic out of scripts/ and tests/
into two self-contained eval benches so a user can regenerate every
result + figure with one command:
kernbench run --bench milestone-1h-gemm (MILESTONE_FAST=1 reuses JSON)
kernbench run --bench milestone-1h-ccl
- benches/milestone_1h_{gemm,ccl}.py: single home for each domain; the
run(torch) entry drives the sweeps and writes figures into
benches/1H_milestone_output/{gemm,ccl}/ (gitignored), then submits a
sentinel tensor to satisfy the run_bench contract.
- tests/gemm + tests/sccl helpers and scripts/gemm_sweep.py become thin
re-export/wrapper shims over the benches (single source preserved); the
pytest-only param builders + _run_distributed wrapper stay in the shim.
- eval-bench pattern: a bench may drive many configs + build its own
per-config engines (extends ADR-0045 D5; reverses ADR-0044 D1/D2).
ADR-0054 (EN+KO) records the design; ADR-0043/0044/0045 + CLAUDE.md CLI
Semantics amended; ADR INDEX regenerated. Verified: milestone benches run
clean (ok=True, all artifacts), full suite 67 passed, lang-pairs OK.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
296 lines
12 KiB
Markdown
296 lines
12 KiB
Markdown
# ADR-0045: Bench Module Contract — registration, dispatch, and authoring
|
|
|
|
## Status
|
|
|
|
Accepted (2026-05-21).
|
|
|
|
Unifies the `src/kernbench/benches/` registration mechanism (@bench), the
|
|
CLI dispatch path (`kernbench run/list`), and the contract a new bench
|
|
module must follow. ADR-0010 (CLI surface) specifies the `kernbench
|
|
list/run` interface, but **how benches are registered and what signature
|
|
they must follow** had no ADR-level coverage.
|
|
|
|
**Extended by ADR-0054**: D5's single-config rule gains a third pattern —
|
|
the *eval bench* (e.g. `milestone-1h-*`) drives many configs, builds its
|
|
own per-config engines, and submits a sentinel tensor to satisfy D4.
|
|
|
|
## First action
|
|
|
|
When `kernbench.benches` is imported, `__init__.py` immediately calls
|
|
`_eager_import_and_audit(__path__, __name__)`. Its first action is to
|
|
enumerate every sibling module in the package directory via
|
|
`pkgutil.iter_modules(__path__)` and **eagerly import** each one via
|
|
`importlib.import_module(...)` — except modules matching either:
|
|
|
|
- name `registry` (the infrastructure module itself), or
|
|
- name starting with `_` (helper modules).
|
|
|
|
At import time, each `@bench(name=..., description=...)` decorator inside
|
|
the imported module runs, appending `(name, description, fn)` to
|
|
`_PENDING` and adding `fn.__module__` to `_REGISTERED_MODULES`.
|
|
|
|
Once imports finish, `_audit_modules(imported, _REGISTERED_MODULES)`
|
|
runs; if any imported module did not invoke `@bench` at least once, it
|
|
raises `RuntimeError("Bench module(s) missing @bench decorator: ...")`
|
|
immediately. At this point indices are still unassigned — the first call
|
|
to `list_all()` / `resolve(...)` triggers `_finalize()`, which sorts
|
|
`_PENDING` alphabetically by name and assigns 1-based indices.
|
|
|
|
In short, **the bench infrastructure's first act is "eagerly import
|
|
every non-helper module in the package and audit that each one
|
|
registered at least one bench"**.
|
|
|
|
## Context
|
|
|
|
`src/kernbench/benches/` currently holds 8 bench modules (`ccl_allreduce`,
|
|
`gemm_single_pe`, `gpt3_qkv`, `ipcq_allreduce`, `matmul_composite`,
|
|
`qkv_gemm`, `qkv_gemm_multi_pe`, `va_offset_verify`). Every bench follows
|
|
the same unified flow:
|
|
|
|
```
|
|
kernbench run --topology <T> --bench <N>
|
|
↓
|
|
cli/main.py::cmd_run
|
|
↓ resolve_topology(T) + resolve(N) + resolve_device(device_arg)
|
|
↓
|
|
runtime_api/bench_runner.py::run_bench(topology, bench_fn, device, engine_factory)
|
|
↓ engine_factory(topology, device) → GraphEngine
|
|
↓ RuntimeContext(engine, target_device, correlation_id, spec)
|
|
↓
|
|
bench_fn(ctx) ← invokes the bench's run(torch)
|
|
↓ ctx.empty/zeros/from_numpy/launch/distributed.* etc. submit work
|
|
↓
|
|
ctx.wait_all() ← drains any outstanding handles
|
|
↓
|
|
BenchResult(completion, correlation_id, trace, traces, engine)
|
|
```
|
|
|
|
ADR-0010 covers only the CLI surface (`run/list/probe/web`); ADR-0007
|
|
covers only the runtime API ↔ sim_engine boundary. The question "what
|
|
shape must a new bench file take?" had to be answered by grepping the
|
|
codebase. As a result:
|
|
|
|
- The @bench decorator contract (kebab-case name, non-empty description)
|
|
lived only in the source.
|
|
- The bench function signature (`def run(torch)`) was a de-facto
|
|
convention enforced by the CLI dispatcher calling `spec.run`.
|
|
- New bench authors learned the "helpers must use `_` prefix" rule only
|
|
after seeing the audit's RuntimeError.
|
|
- The single-device convention (CLAUDE.md Part 2 CLI Semantics) and its
|
|
interaction with multi-SIP CCL benches was ambiguous for bench
|
|
authors.
|
|
|
|
This ADR consolidates all of it in one place.
|
|
|
|
## Decision
|
|
|
|
### D1. @bench decorator contract
|
|
|
|
```python
|
|
from kernbench.benches.registry import bench
|
|
|
|
@bench(name="my-bench", description="Short, complete-sentence description.")
|
|
def run(torch):
|
|
...
|
|
```
|
|
|
|
- `name`: kebab-case string matching `^[a-z][a-z0-9]*(-[a-z0-9]+)*$`.
|
|
Lowercase letters, digits, and dashes only; underscores forbidden;
|
|
must start with a letter.
|
|
- `description`: non-empty string (stripped length > 0). Displayed
|
|
verbatim by `kernbench list`.
|
|
- The decorator **returns the function unchanged** — direct invocation
|
|
is fine. Its only side effect is appending to `_PENDING`.
|
|
|
|
Violations of the first two rules raise `ValueError` at decoration time.
|
|
Duplicate names are caught at `_finalize()` with
|
|
`RuntimeError("duplicate bench name: ...")`.
|
|
|
|
### D2. Module-file convention
|
|
|
|
Every `src/kernbench/benches/<slug>.py` must be one of:
|
|
|
|
- **A bench module**: at top-level import, `@bench(...)` runs at least
|
|
once to register at least one bench.
|
|
- **A helper module**: the filename starts with `_` (e.g.,
|
|
`_shared_helpers.py`). `iter_modules` skips it.
|
|
|
|
The audit (`_audit_modules`) rejects any non-helper that fails to call
|
|
`@bench`. Intended consequence: dropping a new file into `benches/`
|
|
automatically registers its benches, and helper modules are clearly
|
|
flagged by their filename prefix alone.
|
|
|
|
### D3. The bench function signature is `def run(torch)`
|
|
|
|
The decorator does not enforce a function name, but **CLI dispatch calls
|
|
`spec_entry.run`** (the decorated callable). The convention is therefore:
|
|
|
|
- Function name: `run`. Other names work, but always use `run` for
|
|
readability and grep-ability.
|
|
- Argument: a single positional `torch`. In practice this is a
|
|
`RuntimeContext` instance exposing PyTorch-style namespaces
|
|
(zeros/empty/launch/distributed/...) — see ADR-0024 D3.
|
|
- Return value: any (`Any`). `run_bench` ignores it and tracks
|
|
completion via `ctx.handles()` / `engine.get_completion()`.
|
|
|
|
The `torch` name imitates a PyTorch-compatible idiom; the actual PyTorch
|
|
module is not passed in (aligned with ADR-0024's "rank = SIP" launcher
|
|
convention).
|
|
|
|
### D4. A bench must submit at least once
|
|
|
|
If `ctx.handles()` is empty after the bench returns, `run_bench` reports
|
|
`BenchResult.completion = ok=False, error_code="NO_REQUESTS"`. So a
|
|
meaningful bench must invoke at least one of:
|
|
|
|
- Tensor-creation APIs: `torch.zeros(...)`, `torch.empty(...)` — these
|
|
internally submit `MmuMapMsg` and (for zeros) `MemoryWriteMsg`.
|
|
- Kernel-launch API: `torch.launch(name, fn, *args)` — submits per-SIP
|
|
`KernelLaunchMsg`.
|
|
- (Exception) Empty placeholder benches: e.g.,
|
|
`ipcq_allreduce.py`'s `print(...)`-only stub will receive a
|
|
NO_REQUESTS result. CI is expected to recognize and handle placeholder
|
|
benches specially.
|
|
|
|
### D5. Single-device convention + multi-SIP exception (ADR-0024/0027)
|
|
|
|
CLAUDE.md Part 2 CLI Semantics' **"benchmarks MUST remain
|
|
single-device"** rule is interpreted as follows:
|
|
|
|
- **Standard bench (single-SIP use)**: define tensor placement with
|
|
`dp = DPPolicy(...)` and launch with `torch.launch(...)`. The SIP
|
|
index is chosen by `--device` (CLI's responsibility).
|
|
- **CCL bench (multi-SIP use)**: as an exception, use
|
|
`torch.distributed.init_process_group(backend="ahbm")` plus
|
|
`torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` for the
|
|
rank = SIP pattern (ADR-0024 D3). `--device` is ignored (or treated
|
|
as `all`); each spawned worker calls `torch.ahbm.set_device(rank)` to
|
|
bind to its SIP.
|
|
|
|
Multi-device patterns outside these two (e.g., one bench function
|
|
launching across multiple SIPs in the same process) are forbidden by
|
|
this ADR. Even with `--device all`, the bench runs once; multi-SIP use
|
|
inside that single run must follow D5's second pattern.
|
|
|
|
### D6. Name/index resolution (`resolve`)
|
|
|
|
`resolve(identifier: str)` returns a BenchSpec via:
|
|
|
|
1. If `identifier.isdigit()`: convert to int and find the spec where
|
|
`index ==` that value. If none, `ValueError("No bench with index
|
|
...")`.
|
|
2. If `identifier in _REGISTRY`: direct lookup.
|
|
3. Otherwise: `ValueError("Unknown bench ...")`.
|
|
|
|
Empty or whitespace-only identifiers raise `ValueError("bench
|
|
identifier must be a non-empty string.")`.
|
|
|
|
The CLI passes `--bench` directly to `resolve`, so users can use either
|
|
`kernbench run --bench gemm-single-pe` or `kernbench run --bench 2`.
|
|
|
|
### D7. Indices are not a stable API
|
|
|
|
`_finalize()` sorts `_PENDING` alphabetically by name and assigns
|
|
1-based indices. Adding a new bench can shift existing benches'
|
|
indices. Therefore:
|
|
|
|
- Human-interactive use: indices are fine.
|
|
- Scripts / CI automation: always use the name.
|
|
|
|
This caveat is documented in `registry.py`'s module docstring.
|
|
|
|
### D8. Surface RuntimeContext exposes to benches
|
|
|
|
A bench's `torch` parameter may legitimately use:
|
|
|
|
- **Tensor creation**: `torch.empty(shape, dtype=..., dp=DPPolicy(...),
|
|
name=...)`, `torch.zeros(...)`, `torch.from_numpy(arr)`. All submit
|
|
host-side metadata plus device deployment (`MmuMapMsg` +
|
|
`MemoryWriteMsg`).
|
|
- **Kernel launch**: `torch.launch(kernel_name, kernel_fn, *args)` —
|
|
converts `(Tensor, int, float)` positional args to `TensorArg` /
|
|
`ScalarArg`, submits per-SIP `KernelLaunchMsg`, and drains.
|
|
- **Synchronization**: `torch.wait(handle)`, `torch.wait_all()`
|
|
(`run_bench` calls the latter automatically).
|
|
- **Distributed**: `torch.distributed.init_process_group(backend="ahbm")`,
|
|
`torch.distributed.get_world_size()`,
|
|
`torch.distributed.all_reduce(t, op=...)` (ADR-0024/0027).
|
|
- **Multi-process (rank = SIP)**:
|
|
`torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` (ADR-0024 D3 /
|
|
ADR-0027).
|
|
- **Device binding**: `torch.ahbm.set_device(rank)` or
|
|
`torch.accelerator.set_device_index(rank)` (both point to the same
|
|
namespace).
|
|
- **IPCQ install**: `torch.install_ipcq(algorithm=..., ccl_yaml=...)`
|
|
(ADR-0023 D10).
|
|
- **Spec lookup**: `torch.spec` — the dict produced by the topology
|
|
builder (system / cube_mesh / HBM parameters etc.). Use it so the
|
|
bench does not hardcode topology.yaml values.
|
|
|
|
Benches must not access RuntimeContext private members (`_handles`,
|
|
`_traces`, `_allocators`, etc.) directly. This aligns with ADR-0007's
|
|
layer-boundary spirit: bench → runtime API → sim_engine flows in one
|
|
direction.
|
|
|
|
### D9. Environment-variable parameterization is allowed
|
|
|
|
Benches may parameterize themselves via `os.environ.get(...)`, as
|
|
`matmul_composite.py` does for `MATMUL_M`, `MATMUL_K`, `MATMUL_N`,
|
|
`MATMUL_DTYPE`, `MATMUL_VARIANT`. Rationale:
|
|
|
|
- The bench function signature is fixed by D3 to `def run(torch)`, so
|
|
positional/keyword arguments cannot carry parameters.
|
|
- The env-var pattern is a natural hook for operational sweeps (e.g.,
|
|
`MATMUL_VARIANT`).
|
|
- External drivers such as `scripts/gemm_sweep.py` (ADR-0044) consume
|
|
this hook (it sets `MATMUL_M/K/N/VARIANT` at
|
|
`scripts/gemm_sweep.py:115-118`).
|
|
|
|
When environment variables alter bench behavior, the module docstring
|
|
must list every variable used (`matmul_composite.py` is the canonical
|
|
example).
|
|
|
|
## Alternatives Considered
|
|
|
|
### A1. An explicit manifest file (YAML) listing benches
|
|
|
|
Rejected. The `@bench` + audit pattern guarantees "drop in file → auto-
|
|
register", concentrating cognitive cost in one place (the file itself).
|
|
A separate manifest is prone to drift, and helper separation is already
|
|
clear via the `_` prefix.
|
|
|
|
### A2. Allowing the bench's entry-point name in the decorator
|
|
(`@bench(name=..., entry="run_xxx")`)
|
|
|
|
Rejected. Breaks the simplicity of dispatch (`spec.run` is a single
|
|
callable). The `run` convention is sufficient; variants can register
|
|
multiple `@bench`-decorated functions in the same module.
|
|
|
|
### A3. A separate `@multi_device_bench` decorator for CCL
|
|
|
|
Rejected. The two patterns named in D5 (single + ADR-0024 multi-SIP)
|
|
cover all 8 current benches. A separate decorator would force dispatch
|
|
to branch and add complexity; the multi-SIP intent is already obvious
|
|
from the bench's `init_process_group(...)` call.
|
|
|
|
### A4. Make indices a stable API (registration order or explicit
|
|
`index=` argument)
|
|
|
|
Rejected. D7's trade-off favors user-friendliness — alphabetically
|
|
sorted 1-based indices read naturally in the `list` output. Scripts can
|
|
use names.
|
|
|
|
## Consequences
|
|
|
|
- "How to add a bench" is consolidated in one ADR — new authors only
|
|
need to read D1-D3 and D8 without grepping source.
|
|
- The `_`-prefixed helper-module pattern is legitimized at ADR level,
|
|
so future `benches/_*.py` shared helpers are free to be added.
|
|
- The CLI's single-device convention and CCL's multi-SIP exception are
|
|
shown to be consistent (D5) — they are orthogonal.
|
|
- The rationale for ADR-0044's GEMM eval harness using env-var hooks
|
|
(D9) is now ADR-pinned.
|
|
- Indices are explicitly unstable (D7), so any CI code calling
|
|
`kernbench run --bench 3` is flagged for review after this ADR is
|
|
accepted.
|