Files
kernbench2/docs/adr/ADR-0013-ver-verification_strategy.md
T
ywkang 687c98086d ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
  (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
  docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
  docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
  retroactive docs pending verification.

Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
  TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
  Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
  deleted; ADR-0019/0021 moved to adr-history with one-line stub status

Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
  serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
  per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
  target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
  selection, flit-aware per-flit commit, async finalize, command-only
  fallback path)

Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
  "Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
  block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
  ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
  (now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)

Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
  ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py

Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.

Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
  (ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:15:55 -07:00

3.7 KiB

ADR-0013: Verification Strategy and Phase 1 Test Plan

Status

Accepted

Context

KernBench is a system-level simulator whose correctness is defined by:

  • adherence to SPEC-defined invariants,
  • determinism and debuggability,
  • explicit modeling of routing and latency.

Given the evolving implementation, we need a stable verification strategy that prevents architectural drift while allowing incremental development.

This ADR defines the Phase 1 verification plan and what constitutes "correct behavior" for early implementations.


Decision

D1. Verification is contract-based

Verification MUST be derived from:

  • SPEC requirements,
  • accepted ADRs.

Tests MUST validate architectural contracts, not incidental implementation details.


D2. Phase 1 verification scope

Phase 1 verification focuses on:

  • message contract validity (ADR-0012),
  • routing and fan-out semantics at the IO_CPU boundary (ADR-0009),
  • PA-first memory addressing and shard tagging (ADR-0011),
  • core latency and trace invariants (SPEC 0.1, R2).

Microarchitectural accuracy, bandwidth contention, and cycle-level behavior are explicitly out of scope in Phase 1.


D3. Required Phase 1 verification cases

The following verification cases MUST be supported by the implementation:

V1. Message schema validation

  • KernelLaunch requests missing (sip, cube, pe) in any tensor shard MUST be rejected.
  • MemoryWrite/MemoryRead requests missing destination/source placement tags MUST be rejected.
  • Completion results MUST follow the ok / error_code / error_message contract.

V2. IO_CPU fan-out and aggregation

Given:

  • a topology with one SIP, one CUBE, and two PEs,
  • a KernelLaunch request containing two tensor shards targeting different PEs,

The system MUST:

  • submit a single KernelLaunch to IO_CPU,
  • fan-out work internally to both PEs,
  • aggregate completion and return a single deterministic completion to the host.

V3. Latency and trace invariants

For any valid request:

  • the hop-by-hop trace MUST be non-empty,
  • total latency MUST be greater than zero,
  • repeated runs with identical inputs MUST produce identical traces.

V4. Topology independence and cross-domain coverage

Verification cases MUST pass for multiple topology shapes, including:

  • minimal: (1 SIP, 1 CUBE, 1 PE)
  • multi-PE: (1 SIP, 1 CUBE, N PEs)
  • multi-CUBE within a SIP: (1 SIP, M CUBEs, ≥1 PE per CUBE)
  • multi-SIP tray: (K SIPs, ≥1 CUBE per SIP, ≥1 PE per CUBE)

For multi-CUBE and multi-SIP topologies, Phase 1 verification focuses on:

  • explicit connectivity (required links exist),
  • deterministic routing and control-path traversal,
  • non-empty traces and latency > 0 for representative cross-domain requests (inter-CUBE and inter-SIP paths).

D4. Phase 1 artifacts

Phase 1 MAY include:

  • verification-only test code,
  • topology fixtures,
  • trace inspection utilities.

Phase 1 MUST NOT require:

  • production code changes solely to satisfy tests,
  • weakening or removing tests to allow progress.

D5. Phase 2 enforcement

Phase 2 (Apply) MUST:

  • run the Phase 1 verification cases,
  • rollback all changes if any verification fails,
  • preserve tests as authoritative contracts.

Consequences

  • Architectural correctness is enforced early.
  • Tests serve as executable documentation of system behavior.
  • Implementation remains flexible without losing rigor.

  • SPEC 0.1, R2, R6
  • ADR-0011 (Memory Addressing — PA / VA / LA)
  • ADR-0012 (Host ↔ IO_CPU message schema)
  • ADR-0009 (Kernel execution semantics)