Files

T

mukesh cc1bbd0ab7 eval: fold GEMM/allreduce harnesses into self-contained milestone benches

Move the GEMM + allreduce sweep/render logic out of scripts/ and tests/
into two self-contained eval benches so a user can regenerate every
result + figure with one command:

  kernbench run --bench milestone-1h-gemm   (MILESTONE_FAST=1 reuses JSON)
  kernbench run --bench milestone-1h-ccl

- benches/milestone_1h_{gemm,ccl}.py: single home for each domain; the
  run(torch) entry drives the sweeps and writes figures into
  benches/1H_milestone_output/{gemm,ccl}/ (gitignored), then submits a
  sentinel tensor to satisfy the run_bench contract.
- tests/gemm + tests/sccl helpers and scripts/gemm_sweep.py become thin
  re-export/wrapper shims over the benches (single source preserved); the
  pytest-only param builders + _run_distributed wrapper stay in the shim.
- eval-bench pattern: a bench may drive many configs + build its own
  per-config engines (extends ADR-0045 D5; reverses ADR-0044 D1/D2).

ADR-0054 (EN+KO) records the design; ADR-0043/0044/0045 + CLAUDE.md CLI
Semantics amended; ADR INDEX regenerated. Verified: milestone benches run
clean (ok=True, all artifacts), full suite 67 passed, lang-pairs OK.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-22 15:19:52 -07:00

12 KiB

Raw Blame History

ADR-0045: Bench Module Contract — registration, dispatch, and authoring

Status

Accepted (2026-05-21).

Unifies the src/kernbench/benches/ registration mechanism (@bench), the CLI dispatch path (kernbench run/list), and the contract a new bench module must follow. ADR-0010 (CLI surface) specifies the kernbench list/run interface, but how benches are registered and what signature they must follow had no ADR-level coverage.

Extended by ADR-0054: D5's single-config rule gains a third pattern — the eval bench (e.g. milestone-1h-*) drives many configs, builds its own per-config engines, and submits a sentinel tensor to satisfy D4.

First action

When kernbench.benches is imported, __init__.py immediately calls _eager_import_and_audit(__path__, __name__). Its first action is to enumerate every sibling module in the package directory via pkgutil.iter_modules(__path__) and eagerly import each one via importlib.import_module(...) — except modules matching either:

name registry (the infrastructure module itself), or
name starting with _ (helper modules).

At import time, each @bench(name=..., description=...) decorator inside the imported module runs, appending (name, description, fn) to _PENDING and adding fn.__module__ to _REGISTERED_MODULES.

Once imports finish, _audit_modules(imported, _REGISTERED_MODULES) runs; if any imported module did not invoke @bench at least once, it raises RuntimeError("Bench module(s) missing @bench decorator: ...") immediately. At this point indices are still unassigned — the first call to list_all() / resolve(...) triggers _finalize(), which sorts _PENDING alphabetically by name and assigns 1-based indices.

In short, the bench infrastructure's first act is "eagerly import every non-helper module in the package and audit that each one registered at least one bench".

Context

src/kernbench/benches/ currently holds 8 bench modules (ccl_allreduce, gemm_single_pe, gpt3_qkv, ipcq_allreduce, matmul_composite, qkv_gemm, qkv_gemm_multi_pe, va_offset_verify). Every bench follows the same unified flow:

kernbench run --topology <T> --bench <N>
   ↓
cli/main.py::cmd_run
   ↓  resolve_topology(T)  + resolve(N)  + resolve_device(device_arg)
   ↓
runtime_api/bench_runner.py::run_bench(topology, bench_fn, device, engine_factory)
   ↓  engine_factory(topology, device) → GraphEngine
   ↓  RuntimeContext(engine, target_device, correlation_id, spec)
   ↓
bench_fn(ctx)        ← invokes the bench's run(torch)
   ↓  ctx.empty/zeros/from_numpy/launch/distributed.* etc. submit work
   ↓
ctx.wait_all()       ← drains any outstanding handles
   ↓
BenchResult(completion, correlation_id, trace, traces, engine)

ADR-0010 covers only the CLI surface (run/list/probe/web); ADR-0007 covers only the runtime API ↔ sim_engine boundary. The question "what shape must a new bench file take?" had to be answered by grepping the codebase. As a result:

The @bench decorator contract (kebab-case name, non-empty description) lived only in the source.
The bench function signature (def run(torch)) was a de-facto convention enforced by the CLI dispatcher calling spec.run.
New bench authors learned the "helpers must use _ prefix" rule only after seeing the audit's RuntimeError.
The single-device convention (CLAUDE.md Part 2 CLI Semantics) and its interaction with multi-SIP CCL benches was ambiguous for bench authors.

This ADR consolidates all of it in one place.

Decision

D1. @bench decorator contract

from kernbench.benches.registry import bench

@bench(name="my-bench", description="Short, complete-sentence description.")
def run(torch):
    ...

name: kebab-case string matching ^[a-z][a-z0-9]*(-[a-z0-9]+)*$. Lowercase letters, digits, and dashes only; underscores forbidden; must start with a letter.
description: non-empty string (stripped length > 0). Displayed verbatim by kernbench list.
The decorator returns the function unchanged — direct invocation is fine. Its only side effect is appending to _PENDING.

Violations of the first two rules raise ValueError at decoration time. Duplicate names are caught at _finalize() with RuntimeError("duplicate bench name: ...").

D2. Module-file convention

Every src/kernbench/benches/<slug>.py must be one of:

A bench module: at top-level import, @bench(...) runs at least once to register at least one bench.
A helper module: the filename starts with _ (e.g., _shared_helpers.py). iter_modules skips it.

The audit (_audit_modules) rejects any non-helper that fails to call @bench. Intended consequence: dropping a new file into benches/ automatically registers its benches, and helper modules are clearly flagged by their filename prefix alone.

D3. The bench function signature is `def run(torch)`

The decorator does not enforce a function name, but CLI dispatch calls spec_entry.run (the decorated callable). The convention is therefore:

Function name: run. Other names work, but always use run for readability and grep-ability.
Argument: a single positional torch. In practice this is a RuntimeContext instance exposing PyTorch-style namespaces (zeros/empty/launch/distributed/...) — see ADR-0024 D3.
Return value: any (Any). run_bench ignores it and tracks completion via ctx.handles() / engine.get_completion().

The torch name imitates a PyTorch-compatible idiom; the actual PyTorch module is not passed in (aligned with ADR-0024's "rank = SIP" launcher convention).

D4. A bench must submit at least once

If ctx.handles() is empty after the bench returns, run_bench reports BenchResult.completion = ok=False, error_code="NO_REQUESTS". So a meaningful bench must invoke at least one of:

Tensor-creation APIs: torch.zeros(...), torch.empty(...) — these internally submit MmuMapMsg and (for zeros) MemoryWriteMsg.
Kernel-launch API: torch.launch(name, fn, *args) — submits per-SIP KernelLaunchMsg.
(Exception) Empty placeholder benches: e.g., ipcq_allreduce.py's print(...)-only stub will receive a NO_REQUESTS result. CI is expected to recognize and handle placeholder benches specially.

D5. Single-device convention + multi-SIP exception (ADR-0024/0027)

CLAUDE.md Part 2 CLI Semantics' "benchmarks MUST remain single-device" rule is interpreted as follows:

Standard bench (single-SIP use): define tensor placement with dp = DPPolicy(...) and launch with torch.launch(...). The SIP index is chosen by --device (CLI's responsibility).
CCL bench (multi-SIP use): as an exception, use torch.distributed.init_process_group(backend="ahbm") plus torch.multiprocessing.spawn(_worker, ..., nprocs=ws) for the rank = SIP pattern (ADR-0024 D3). --device is ignored (or treated as all); each spawned worker calls torch.ahbm.set_device(rank) to bind to its SIP.

Multi-device patterns outside these two (e.g., one bench function launching across multiple SIPs in the same process) are forbidden by this ADR. Even with --device all, the bench runs once; multi-SIP use inside that single run must follow D5's second pattern.

D6. Name/index resolution (`resolve`)

resolve(identifier: str) returns a BenchSpec via:

If identifier.isdigit(): convert to int and find the spec where index == that value. If none, ValueError("No bench with index ...").
If identifier in _REGISTRY: direct lookup.
Otherwise: ValueError("Unknown bench ...").

Empty or whitespace-only identifiers raise ValueError("bench identifier must be a non-empty string.").

The CLI passes --bench directly to resolve, so users can use either kernbench run --bench gemm-single-pe or kernbench run --bench 2.

D7. Indices are not a stable API

_finalize() sorts _PENDING alphabetically by name and assigns 1-based indices. Adding a new bench can shift existing benches' indices. Therefore:

Human-interactive use: indices are fine.
Scripts / CI automation: always use the name.

This caveat is documented in registry.py's module docstring.

D8. Surface RuntimeContext exposes to benches

A bench's torch parameter may legitimately use:

Tensor creation: torch.empty(shape, dtype=..., dp=DPPolicy(...), name=...), torch.zeros(...), torch.from_numpy(arr). All submit host-side metadata plus device deployment (MmuMapMsg + MemoryWriteMsg).
Kernel launch: torch.launch(kernel_name, kernel_fn, *args) — converts (Tensor, int, float) positional args to TensorArg / ScalarArg, submits per-SIP KernelLaunchMsg, and drains.
Synchronization: torch.wait(handle), torch.wait_all() (run_bench calls the latter automatically).
Distributed: torch.distributed.init_process_group(backend="ahbm"), torch.distributed.get_world_size(), torch.distributed.all_reduce(t, op=...) (ADR-0024/0027).
Multi-process (rank = SIP): torch.multiprocessing.spawn(_worker, ..., nprocs=ws) (ADR-0024 D3 / ADR-0027).
Device binding: torch.ahbm.set_device(rank) or torch.accelerator.set_device_index(rank) (both point to the same namespace).
IPCQ install: torch.install_ipcq(algorithm=..., ccl_yaml=...) (ADR-0023 D10).
Spec lookup: torch.spec — the dict produced by the topology builder (system / cube_mesh / HBM parameters etc.). Use it so the bench does not hardcode topology.yaml values.

Benches must not access RuntimeContext private members (_handles, _traces, _allocators, etc.) directly. This aligns with ADR-0007's layer-boundary spirit: bench → runtime API → sim_engine flows in one direction.

D9. Environment-variable parameterization is allowed

Benches may parameterize themselves via os.environ.get(...), as matmul_composite.py does for MATMUL_M, MATMUL_K, MATMUL_N, MATMUL_DTYPE, MATMUL_VARIANT. Rationale:

The bench function signature is fixed by D3 to def run(torch), so positional/keyword arguments cannot carry parameters.
The env-var pattern is a natural hook for operational sweeps (e.g., MATMUL_VARIANT).
External drivers such as scripts/gemm_sweep.py (ADR-0044) consume this hook (it sets MATMUL_M/K/N/VARIANT at scripts/gemm_sweep.py:115-118).

When environment variables alter bench behavior, the module docstring must list every variable used (matmul_composite.py is the canonical example).

Alternatives Considered

A1. An explicit manifest file (YAML) listing benches

Rejected. The @bench + audit pattern guarantees "drop in file → auto- register", concentrating cognitive cost in one place (the file itself). A separate manifest is prone to drift, and helper separation is already clear via the _ prefix.

A2. Allowing the bench's entry-point name in the decorator

(@bench(name=..., entry="run_xxx"))

Rejected. Breaks the simplicity of dispatch (spec.run is a single callable). The run convention is sufficient; variants can register multiple @bench-decorated functions in the same module.

A3. A separate `@multi_device_bench` decorator for CCL

Rejected. The two patterns named in D5 (single + ADR-0024 multi-SIP) cover all 8 current benches. A separate decorator would force dispatch to branch and add complexity; the multi-SIP intent is already obvious from the bench's init_process_group(...) call.

A4. Make indices a stable API (registration order or explicit

index= argument)

Rejected. D7's trade-off favors user-friendliness — alphabetically sorted 1-based indices read naturally in the list output. Scripts can use names.

Consequences

"How to add a bench" is consolidated in one ADR — new authors only need to read D1-D3 and D8 without grepping source.
The _-prefixed helper-module pattern is legitimized at ADR level, so future benches/_*.py shared helpers are free to be added.
The CLI's single-device convention and CCL's multi-SIP exception are shown to be consistent (D5) — they are orthogonal.
The rationale for ADR-0044's GEMM eval harness using env-var hooks (D9) is now ADR-pinned.
Indices are explicitly unstable (D7), so any CI code calling kernbench run --bench 3 is flagged for review after this ADR is accepted.

12 KiB Raw Blame History