kernbench2/docs/adr/ADR-0020-prog-data-execution-two-pass.en.md

# ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation)

## Status

Accepted

## Context

The current simulation models **timing only**.
`tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies,
but do not actually read tensor data or perform computations.

### Required Capabilities

1. Must be able to store and read actual data in HBM/TCM/SRAM
2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
3. Must minimize simulation performance degradation

### Constraints

- SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
- Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations
- Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused
- Kernel functions must remain plain Python functions (no generator/async transformation)

### Design Exploration Results

| Option | Approach | Verdict |
|--------|----------|---------|
| Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block |
| SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests |
| Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads |
| **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact |

---

## Decision

### D1. 2-Pass Execution Model — Phase 0 Elimination

The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**.

Before:
```
Phase 0: Kernel → PeCommand list (no data, no branching)
Phase 1: Replay PeCommand list via SimPy (timing only)
```

After:
```
Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based
  - Memory read/write: SimPy timing + MemoryStore actual data
  - Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2)
  - Dynamic control flow possible (tl.load returns actual data)

Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable
```

This ADR **extends Phase 1 to be data-aware for memory operations only**.
Phase 1 handles latency/BW bottleneck analysis + memory data tracking,
Phase 2 handles GEMM/Math computation correctness verification.
Phase 2 is optional — if only timing is needed, run Phase 1 alone.

### D2. Op Log Recording — ComponentBase Hook

Op log recording is performed as a **hook in the component base class**.
Individual component implementations are not modified.

```python
class ComponentBase:
    def _on_process_start(self, env, msg):
        if self._op_logger and getattr(msg, 'data_op', False):
            self._op_logger.record_start(env.now, self.node.id, msg)

    def _on_process_end(self, env, msg):
        if self._op_logger and getattr(msg, 'data_op', False):
            self._op_logger.record_end(env.now, self.node.id, msg)
```

Hooks are called before and after `run()` within `_forward_txn()`.
`_op_logger` is optional — zero overhead when absent.

**Hook timing definitions**:

| Timing | Meaning |
|--------|---------|
| `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) |
| `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) |

Link traversal latency is not included in t_start/t_end.
Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start.

### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination

The existing Phase 0 (kernel → PeCommand list) is eliminated,
and **greenlet** is used to cooperatively interleave kernel and SimPy execution.

#### Operating Principle

greenlet is a C extension that provides cooperative context switching.
When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet)
to perform timing simulation, and after completion, returns to the kernel with actual data.

```
SimPy loop (parent greenlet)           Kernel (child greenlet)
─────────────────────────              ──────────────────────
g.switch() ─────────────────────────→ Kernel starts
                                       a = tl.load(ptr, ...)
                                         internal: parent.switch(DmaReadCmd)
cmd = DmaReadCmd ←──────────────────  (kernel paused)
  yield DmaReadMsg(...)
  yield env.timeout(dma_latency)
  data = memory_store.read(...)
g.switch(data) ─────────────────────→ (kernel resumed)
                                       a = data  ← actual numpy array
                                       if a[0][0] > 0.5:  ← branching possible
                                         ...
```

The kernel is maintained as a **plain Python function**.
greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc.

#### KernelRunner — Framework Layer

The greenlet loop resides not in the PE_CPU component but in the framework layer,
**KernelRunner**.

```python
# KernelRunner (framework — greenlet ↔ SimPy bridge)
class KernelRunner:
    def run(self, env, kernel_fn, args, store):
        g = greenlet(self._run_kernel)
        cmd = g.switch(kernel_fn, args)

        while cmd is not None:
            if isinstance(cmd, DmaReadCmd):
                yield from self._dispatch_dma(env, cmd)
                data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
                cmd = g.switch(data)            # resume with actual data
            elif isinstance(cmd, GemmCmd):
                yield from self._dispatch_gemm(env, cmd)
                cmd = g.switch()                # resume (no data)
            elif isinstance(cmd, DmaWriteCmd):
                store.write(cmd.dst_addr, cmd.data)  # visibility = issue time
                yield from self._dispatch_dma(env, cmd)  # timing only
                cmd = g.switch()

# PE_CPU (component — kept simple, unaware of greenlet)
def _execute_kernel(self, env):
    runner = KernelRunner(self.ctx)
    yield from runner.run(env, kernel_fn, args, store)
```

**Op logging single source of truth**: KernelRunner does not record directly to op_log.
All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**.
When KernelRunner delivers messages to components via `_dispatch_gemm()` etc.,
the component base class hooks automatically record them.

**Layer separation**:
- **Kernel code**: plain function, unaware of greenlet
- **TLContext**: calls `parent.switch(cmd)` inside `tl.load()`
- **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**.
- **ComponentBase hook**: the sole path for op_log recording
- **PE_CPU**: only calls KernelRunner, replaceable as a component

#### Handling Differences Between Memory Read/Write and Compute

| Operation | In Phase 1 | In Phase 2 |
|-----------|-----------|-----------|
| `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — |
| `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — |
| `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation |
| `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation |

Memory read/write is processed immediately in Phase 1 (numpy slice, fast).
GEMM/Math operations are batch-executed in Phase 2 (performance separation).

#### Store Visibility Rule

`tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue).
SimPy DMA timing is simulated separately afterward.

This is an intentional separation of timing and visibility:
- **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called
- **timing**: the point at which DMA latency completes in SimPy

This separation allows a load immediately after a store to see the latest data in dynamic control flow.

#### Result Handle Semantics

`tl.composite()` (sync/async) returns a **handle** referencing the result tensor.

The key contract in Phase 1:

1. **All compute handles are always considered pending in Phase 1.**
2. `tl.wait(handle)` **expresses timing synchronization only**
   and does not make the handle ready.
3. Accessing the handle's actual result data (`handle.data`, element access,
   numpy conversion, etc.) is **only possible in Phase 2**.
4. Therefore, **compute-result-based control flow is not supported in Phase 1.**
5. In contrast, `tl.load()` returns actual data in Phase 1, so
   **memory-read-based control flow is supported**.

| Handle state | Phase | Allowed operations |
|------------|-------|----------|
| pending | Phase 1 | `tl.wait(handle)` — timing synchronization only |
| pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) |
| pending | Phase 1 | **Data access not allowed** — value-based branching not possible |
| ready | Phase 2 | Actual numpy data access, verification |

This restriction is intentional. If computations were executed in Phase 1,
the SimPy single-thread would block, defeating the purpose of 2-pass separation.

#### Phase 1 Materialization — Future Extension

If Phase 1 eager execution becomes necessary for small operations
(scalar, small reduction) in the future, selective materialization can be supported
by adding a `materialized_in_phase1: bool` flag to the op record.
This is not implemented in the current scope.

### D4. data_op Flag — Message Self-Declaration

The logging target is determined by the `data_op` attribute on the message instance,
not by message type. The framework does not hardcode message types.

```python
class MsgBase:
    data_op: bool = False       # default: no logging

class DmaReadCmd(MsgBase):
    data_op = True              # memory transfer → logging

class GemmCmd(MsgBase):
    data_op = True              # compute → logging

class MathCmd(MsgBase):
    data_op = True              # compute → logging
```

When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True`
enables automatic logging without modifying framework code.

### D5. Op Log Structure

#### Op Classification Scheme

A two-level classification is used:

| Level | Field | Role |
|-------|-------|------|
| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion |
| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification |

#### OpRecord Definition

```python
@dataclass
class OpRecord:
    t_start: float              # SimPy time (ns) — service start
    t_end: float                # SimPy time (ns) — service completion
    component_id: str           # e.g. "sip0.cube0.pe0.pe_gemm"
    op_kind: str                # "memory" | "gemm" | "math"
    op_name: str                # specific operation name
    params: dict                # per-operation parameters (see below)
    dependency_ids: list[int]   # currently based on in-memory record index, may be replaced with stable op_id in the future
```

#### dependency_ids Generation Rules

`dependency_ids` is **optional**, and by default the executor performs
address-based dependency inference (see D6).

Explicit setting is only needed when precise execution ordering is required:
- **Default (address-based inference)**: the executor analyzes read/write sets to
  automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases.
- **Explicit setting**: set when logical dependencies cannot be expressed via addresses
  at the TLContext or command generation stage.
  Example: completion handle-based synchronization — handle dependencies depend on
  logical completion order rather than memory addresses, so they cannot be captured
  by address inference.

#### op_log Ordering

The op_log maintains **stable ordering** based on `t_start`.
Records with the same `t_start` preserve insertion order.

#### params Details

**memory (dma_read / dma_write)**:
```python
{
    "src_addr": int,            # source address (byte)
    "dst_addr": int,            # destination address (byte)
    "nbytes": int,              # transfer size
    "src_space": str,           # "hbm" | "tcm" | "sram"
    "dst_space": str,           # "hbm" | "tcm" | "sram"
}
```

**gemm**:
```python
{
    "src_a_addr": int,          # operand A address
    "src_b_addr": int,          # operand B address
    "dst_addr": int,            # output address
    "shape_a": tuple,           # e.g. (128, 256)
    "shape_b": tuple,           # e.g. (256, 128)
    "shape_out": tuple,         # e.g. (128, 128)
    "dtype_in": str,            # e.g. "f16"
    "dtype_acc": str,           # accumulation dtype, e.g. "f32"
    "dtype_out": str,           # output dtype, e.g. "f16"
    "transpose_a": bool,
    "transpose_b": bool,
    "layout_a": str,            # "row_major" | "col_major"
    "layout_b": str,
    "layout_out": str,
    "addr_space": str,          # "tcm" (GEMM operands are always in TCM)
}
```

**math**:
```python
{
    "op": str,                  # "exp" | "add" | "sum" | "where" | ...
    "input_addrs": list[int],   # list of operand addresses
    "input_shapes": list[tuple],
    "dst_addr": int,
    "shape_out": tuple,
    "dtype": str,
    "axis": int | None,         # reduction axis
    "addr_space": str,          # "tcm"
}
```

### D6. Phase 2 Executor

Phase 2 executes the op_log outside of SimPy.

```python
class DataExecutor:
    def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
        self.store = initial_store  # Takes the Phase 1 MemoryStore snapshot as input

    def run(self):
        for t, ops in groupby(op_log, key=lambda o: o.t_start):
            batch = list(ops)
            independent, sequential = self._classify(batch)
            self._execute_parallel(independent)
            self._execute_sequential(sequential)
```

**Parallel execution determination**:

Ops with the same `t_start` are considered **parallel candidates**.
The executor determines actual parallel execution based on the following criteria:
- Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks)
- Whether predecessor ops specified in `dependency_ids` have completed

Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel.

**Batch optimization**: Only independent ops with the same op_name **and identical
shape, dtype, layout, and transpose flags** are eligible for batching.
Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call.
Improves BLAS efficiency on CPU, reduces launch overhead on GPU.

**Phase 2 execution order guarantee**:

Phase 2 does not consider data arrival timing,
and guarantees execution order solely through
dependencies (address-based inference + explicit dependency_ids).

### D7. Memory Store

`MemoryStore` logically follows byte-addressable semantics,
and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping).

```python
class MemoryStore:
    def write(self, space: str, addr: int, data: np.ndarray) -> None: ...
    def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
```

**Internal storage format: numpy ndarray**

MemoryStore stores tensors as **numpy ndarrays**.

| Candidate | store/load speed | Phase 2 compute | Verdict |
|-----------|-----------------|-----------------|---------|
| **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** |
| bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected |
| torch tensor | Immediate | torch operations available | Use only for GPU optimization |

- write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup
- read: **returns numpy array by reference** (no copy)
- Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported)
- dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.)
- For byte-level access, convert via `.view(np.uint8)`
- For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility

**read/write contract**:

- read/write operates on a **contiguous tensor** basis.
  If non-contiguous stride views are needed, express them as separate copy ops.
- In the normal benchmark path, producer/consumer dtype match is expected.
  Reinterpret cast is a permissive behavior for low-level memory validation
  or special test cases.
- addr is byte-aligned, with minimum alignment = dtype size.
- dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast.
  Shape mismatch is verified based on nbytes, and raises an error on mismatch.
- Correctness criteria follow address-range-based read/write semantics.
- A tensor object cache may be used as an implementation optimization,
  but the canonical state is byte-addressable storage.
- At deploy time, the host injects initial tensor data.

### D8. Benchmark Kernel Code

The benchmark's **user code API is not changed**.
The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained.

However, internal command/message schemas may be extended to include metadata
required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose).

### D9. No Component Changes

Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified.
Op log recording is the responsibility of the ComponentBase hook.
When custom components are replaced, only the timing model changes,
and Phase 2 data execution is unaffected.

### D10. Phase 2 is Optional

```python
engine = GraphEngine(graph)
engine.run(benchmark)                       # Phase 1: timing only
result = engine.get_timing_result()

if verify_data:
    executor = DataExecutor(engine.op_log)  # Phase 2: data
    executor.run()
    executor.verify(expected_output)
```

If only timing analysis is needed, Phase 2 is skipped.
If the op_logger is deactivated, Phase 1 performance is identical to the original.

### D11. Verification Contract

Basic verification **compares the final output tensor** against a reference backend (numpy).

Per-dtype tolerance policy:

| dtype | Comparison method | Tolerance |
|-------|----------|-----------|
| f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
| f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
| bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
| int types | `np.array_equal` | exact |

- Default mode: compare final output only (end-to-end correctness)
- Debug mode: can compare intermediate tensors on a per-op basis
  (MemoryStore snapshot at each op boundary)

---

## Non-goals

- **Compute-result-based control flow**: not supported.
  All compute handles are in pending state during Phase 1,
  `wait()` expresses timing synchronization only and does not imply data readiness.
  Accessing `handle.data`, element access, or truth-value evaluation in Phase 1
  is **treated as an error**.
  Memory-data-based branching (results of `tl.load()`) is supported via greenlet.
  Phase 1 materialization is a future extension (see D3).
- **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce
  the execution time overlap from Phase 1. Phase 2 only verifies data correctness.
- **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls
  and do not reproduce the actual hardware PE microarchitecture.

## Open Questions

- **Aliasing / slice view**: How to represent slice/views referencing the same
  backing storage in MemoryStore (stride-based view vs copy semantics)
- **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE
  communication as memory ops or introduce a separate op_kind
- **Op log streaming**: Managing op_log memory usage in large-scale simulations
  (in-memory list vs disk-backed streaming)
- **Fused operation**: Whether to record tl.composite's tiled pipeline
  (READ→COMPUTE→WRITE) as a single fused op record or separate individual ops
- **Math op schema generalization**: The current math params have a simple structure,
  but generalization may be needed for broadcasting rules, per-input dtype, keepdims,
  scalar/immediate operands, where/mask expressions, etc.
- **Op record identifier**: Currently dependency_ids are based on in-memory list indices;
  replacement with stable op_id is needed when introducing streaming/disk-backed mode
- **Phase 1 materialization policy**: See Future Extension in D3.
  If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops
  needs to be defined

---

## Consequences

### Positive

- Minimal impact on SimPy simulation performance (only op_log append added)
- Free to use multi-threading/GPU in Phase 2
- Component replaceability preserved (ADR-0015 design philosophy maintained)
- No changes needed to benchmark user code API
- When adding new message types, only set the data_op flag
- Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported
- `tl.load()` returns actual data, making kernel debugging easier

### Negative

- op_log memory usage (for large-scale simulations)
- Phase 2 execution time is proportional to tensor size (large GEMM)
- Dynamic branching based on pending handles (incomplete computations) not possible
  (computations execute in Phase 2, result values are undetermined in Phase 1).
  Memory-data-based branching is supported via greenlet.
- greenlet C extension dependency added (pip install greenlet)