kernbench2/docs/adr/ADR-0009-kernel-execution-messaging.md

# ADR-0009: Kernel Execution Messaging and Completion Semantics

## Status

Accepted

## Context

Kernel execution is initiated by the host and proceeds through
device control components:

Host → IO_CPU → M_CPU → PE_CPU → schedulers → engines

Completion propagates in reverse order.

To keep benchmarks simple and topology-agnostic,
kernel execution must be endpoint-driven with deterministic aggregation.

---

## Decision

### D1. Kernel launch is an endpoint request

A kernel launch is initiated by submitting a single KernelLaunch request
to the IO_CPU endpoint.

The runtime API MUST:

- construct the kernel launch request,
- submit it to IO_CPU,
- await a single completion result.

The runtime API MUST NOT orchestrate internal fan-out.

---

### D2. Tensor arguments are passed by metadata

KernelLaunch requests MUST reference tensor arguments via:

- host-owned tensor handles, or
- resolved device address maps derived from those handles.

Bulk tensor data MUST NOT be embedded in kernel launch messages.

---

### D3. Fan-out and aggregation are component responsibilities

- IO_CPU fans out work to M_CPUs.
- M_CPU fans out work to PE_CPUs.
- PE_CPU manages kernel execution and engine dispatch.

Completion semantics:

- M_CPU completes when all targeted PEs complete or a failure policy triggers.
- IO_CPU completes when all targeted CUBEs complete or a failure policy triggers.

---

### D4. Completion and failure propagation

- All messages MUST carry correlation identifiers.
- Completion and failure MUST propagate deterministically to the host.
- The simulation engine provides futures/handles to observe completion.

---

## Links

- SPEC R1, R2, R7, R8
- ADR-0007 (Runtime API boundaries)
- ADR-0008 (Tensor deployment)