Files

T

ywkang 049e3d8bb3 benches: package as kernbench.benches, add @bench registry + list subcommand

Move benches/ -> src/kernbench/benches/ and src/kernbench/cli/probe.py ->
src/kernbench/probes/probe.py. Each bench self-registers via
@bench(name=..., description=...); kernbench list enumerates benches
with auto-assigned indices, --bench accepts kebab-case name or numeric
index. Audit at package-import time fails if any non-underscore module
forgets the decorator. ADR-0010 (EN + KO) updated to reflect the new
resolver path, list subcommand, and probes package separation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 14:42:10 -07:00

5.2 KiB

Raw Blame History

ADR-0010: Command Line Interface and Execution Semantics

Status

Accepted

Context

The kernbench CLI is the user-facing entry point of the simulator. It exposes four subcommands:

run — execute a benchmark against a topology.
list — enumerate registered benches.
probe — diagnostic utility for latency / BW measurement.
web — interactive topology viewer.

Device enumeration is centralized in the CLI; neither the runtime API nor the simulation engine enumerates devices. Benchmarks remain single-device by design and accept a device identifier as input.

Decision

D1. Benchmark contract — single-device by design

A benchmark MUST define behavior for a single device only.
A benchmark MUST accept a device identifier as input.
Benchmarks MUST NOT enumerate or loop over multiple devices.

Multi-device execution is the CLI's concern (D3), not the benchmark's.

D2. `kernbench run` — benchmark execution

Required arguments:

--topology <path>: topology YAML file path. Loaded via resolve_topology().
--bench <identifier>: benchmark identifier. Resolved via kernbench.benches.registry.resolve(), which accepts either the registered kebab-case name (e.g., gemm-single-pe) or a numeric index from kernbench list.

Optional arguments:

--device <selector> (default: all):
- all — run once per discovered SIP (see D3).
- sip:<N> — run only on SIP N.
- Parsed via resolve_device().
--verify-data (default: off) — enable Phase 2 data verification (see ADR-0020). When set, engine_factory constructs the engine with enable_data=True. After the benchmark runs, a diagnostic summary of recorded ops is printed.

Each invocation runs the benchmark once within a single simulation instance.

D3. Multi-device execution is logically parallel

When --device all (or omitted) and the topology has multiple SIPs:

Benchmark executions are submitted to a single simulation engine instance.
Executions are logically parallel in simulation time.
Inter-device contention is naturally modeled (shared fabric bandwidth, cross-SIP traffic, etc.).

The CLI does NOT spawn multiple OS processes or independent simulation runs — parallelism is internal to one simulation instance.

D4. `kernbench list` — enumerate registered benches

No arguments. Prints each registered bench's auto-assigned index, registered name, and one-line description.

Benches register themselves via the @bench(name=..., description=...) decorator (kernbench.benches.registry). Every non-underscore module under kernbench.benches/ MUST register at least one bench; a missing decorator raises RuntimeError at package import time.

Indices are assigned alphabetically by name at import time. They are a CLI convenience (shorthand for --bench), not a stable API — a new bench inserted alphabetically will shift later indices.

D5. `kernbench probe` — latency / BW diagnostic utility

Required argument:

--topology <path>: topology YAML file path.

Optional argument:

--case <name> (default: all) — run a predefined traffic pattern, or all to run every defined case.

Probe runs each pattern through the simulation engine and reports per case:

End-to-end latency (ns).
Effective bandwidth (nbytes / total_ns).
Bottleneck bandwidth (min edge BW along the chosen path).
Utilization (effective / bottleneck).

Probe additionally validates monotonicity invariants — for example that local-HBM access ≤ cross-PE-within-cube ≤ cross-cube ≤ cross-SIP — and reports violations. Probe is a developer tool for verifying the latency / BW model; it is not a benchmark.

D6. `kernbench web` — topology viewer

Optional arguments:

--port <N> (default: 8765) — HTTP port.
--no-open — do not auto-open the browser.

Launches a local HTTP server that renders the compiled topology in the browser. Distinct from the static docs/diagrams/ artifacts:

docs/diagrams/ files are derived at topology-compile time (ADR-0006).
kernbench web is interactive — pan/zoom, hover for component attributes, switch between SIP / CUBE / PE views.

D7. Runtime API and simulation engine remain device-scoped

Runtime API calls operate on one device per invocation.
The simulation engine schedules all requests deterministically.
Neither layer enumerates devices.

This invariant keeps each layer testable in isolation; device enumeration and multi-device fan-out live only in the CLI's run command (D3).

The probe implementation lives under kernbench.probes (separate from kernbench.benches), reflecting that probes are diagnostic utilities, not registered benches.

Consequences

Benchmark authors write single-device logic; multi-device behavior emerges from the CLI dispatching across SIPs.
Adding a new subcommand (e.g., trace export, replay) does not require benchmark or runtime-API changes — the CLI is the extension point.
probe and web are diagnostic / visualization tools, not benchmarks; they bypass the benchmark loader path.

5.2 KiB Raw Blame History