Files
kernbench2/docs/adr/ADR-0013-verification_strategy.md
T
2026-03-18 11:47:48 -07:00

3.6 KiB

ADR-0013: Verification Strategy and Phase 1 Test Plan

Status

Accepted

Context

KernBench is a system-level simulator whose correctness is defined by:

  • adherence to SPEC-defined invariants,
  • determinism and debuggability,
  • explicit modeling of routing and latency.

Given the evolving implementation, we need a stable verification strategy that prevents architectural drift while allowing incremental development.

This ADR defines the Phase 1 verification plan and what constitutes "correct behavior" for early implementations.


Decision

D1. Verification is contract-based

Verification MUST be derived from:

  • SPEC requirements,
  • accepted ADRs.

Tests MUST validate architectural contracts, not incidental implementation details.


D2. Phase 1 verification scope

Phase 1 verification focuses on:

  • message contract validity (ADR-0012),
  • routing and fan-out semantics at the IO_CPU boundary (ADR-0009),
  • PA-first memory addressing and shard tagging (ADR-0011),
  • core latency and trace invariants (SPEC 0.1, R2).

Microarchitectural accuracy, bandwidth contention, and cycle-level behavior are explicitly out of scope in Phase 1.


D3. Required Phase 1 verification cases

The following verification cases MUST be supported by the implementation:

V1. Message schema validation

  • KernelLaunch requests missing (sip, cube, pe) in any tensor shard MUST be rejected.
  • MemoryWrite/MemoryRead requests missing destination/source placement tags MUST be rejected.
  • Completion results MUST follow the ok / error_code / error_message contract.

V2. IO_CPU fan-out and aggregation

Given:

  • a topology with one SIP, one CUBE, and two PEs,
  • a KernelLaunch request containing two tensor shards targeting different PEs,

The system MUST:

  • submit a single KernelLaunch to IO_CPU,
  • fan-out work internally to both PEs,
  • aggregate completion and return a single deterministic completion to the host.

V3. Latency and trace invariants

For any valid request:

  • the hop-by-hop trace MUST be non-empty,
  • total latency MUST be greater than zero,
  • repeated runs with identical inputs MUST produce identical traces.

V4. Topology independence and cross-domain coverage

Verification cases MUST pass for multiple topology shapes, including:

  • minimal: (1 SIP, 1 CUBE, 1 PE)
  • multi-PE: (1 SIP, 1 CUBE, N PEs)
  • multi-CUBE within a SIP: (1 SIP, M CUBEs, ≥1 PE per CUBE)
  • multi-SIP tray: (K SIPs, ≥1 CUBE per SIP, ≥1 PE per CUBE)

For multi-CUBE and multi-SIP topologies, Phase 1 verification focuses on:

  • explicit connectivity (required links exist),
  • deterministic routing and control-path traversal,
  • non-empty traces and latency > 0 for representative cross-domain requests (inter-CUBE and inter-SIP paths).

D4. Phase 1 artifacts

Phase 1 MAY include:

  • verification-only test code,
  • topology fixtures,
  • trace inspection utilities.

Phase 1 MUST NOT require:

  • production code changes solely to satisfy tests,
  • weakening or removing tests to allow progress.

D5. Phase 2 enforcement

Phase 2 (Apply) MUST:

  • run the Phase 1 verification cases,
  • rollback all changes if any verification fails,
  • preserve tests as authoritative contracts.

Consequences

  • Architectural correctness is enforced early.
  • Tests serve as executable documentation of system behavior.
  • Implementation remains flexible without losing rigor.

  • SPEC 0.1, R2, R6
  • ADR-0011 (PA-first memory addressing)
  • ADR-0012 (Host ↔ IO_CPU message schema)
  • ADR-0009 (Kernel execution semantics)