# ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation) ## Status Accepted ## Context The current simulation models **timing only**. `tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies, but do not actually read tensor data or perform computations. ### Required Capabilities 1. Must be able to store and read actual data in HBM/TCM/SRAM 2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results 3. Must minimize simulation performance degradation ### Constraints - SimPy is a single-thread event loop — running numpy matmul inside it blocks everything - Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations - Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused - Kernel functions must remain plain Python functions (no generator/async transformation) ### Design Exploration Results | Option | Approach | Verdict | |--------|----------|---------| | Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block | | SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests | | Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads | | **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact | --- ## Decision ### D1. 2-Pass Execution Model — Phase 0 Elimination The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**. Before: ``` Phase 0: Kernel → PeCommand list (no data, no branching) Phase 1: Replay PeCommand list via SimPy (timing only) ``` After: ``` Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based - Memory read/write: SimPy timing + MemoryStore actual data - Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2) - Dynamic control flow possible (tl.load returns actual data) Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable ``` This ADR **extends Phase 1 to be data-aware for memory operations only**. Phase 1 handles latency/BW bottleneck analysis + memory data tracking, Phase 2 handles GEMM/Math computation correctness verification. Phase 2 is optional — if only timing is needed, run Phase 1 alone. ### D2. Op Log Recording — ComponentBase Hook Op log recording is performed as a **hook in the component base class**. Individual component implementations are not modified. ```python class ComponentBase: def _on_process_start(self, env, msg): if self._op_logger and getattr(msg, 'data_op', False): self._op_logger.record_start(env.now, self.node.id, msg) def _on_process_end(self, env, msg): if self._op_logger and getattr(msg, 'data_op', False): self._op_logger.record_end(env.now, self.node.id, msg) ``` Hooks are called before and after `run()` within `_forward_txn()`. `_op_logger` is optional — zero overhead when absent. **Hook timing definitions**: | Timing | Meaning | |--------|---------| | `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) | | `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) | Link traversal latency is not included in t_start/t_end. Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start. ### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination The existing Phase 0 (kernel → PeCommand list) is eliminated, and **greenlet** is used to cooperatively interleave kernel and SimPy execution. #### Operating Principle greenlet is a C extension that provides cooperative context switching. When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet) to perform timing simulation, and after completion, returns to the kernel with actual data. ``` SimPy loop (parent greenlet) Kernel (child greenlet) ───────────────────────── ────────────────────── g.switch() ─────────────────────────→ Kernel starts a = tl.load(ptr, ...) internal: parent.switch(DmaReadCmd) cmd = DmaReadCmd ←────────────────── (kernel paused) yield DmaReadMsg(...) yield env.timeout(dma_latency) data = memory_store.read(...) g.switch(data) ─────────────────────→ (kernel resumed) a = data ← actual numpy array if a[0][0] > 0.5: ← branching possible ... ``` The kernel is maintained as a **plain Python function**. greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc. #### KernelRunner — Framework Layer The greenlet loop resides not in the PE_CPU component but in the framework layer, **KernelRunner**. ```python # KernelRunner (framework — greenlet ↔ SimPy bridge) class KernelRunner: def run(self, env, kernel_fn, args, store): g = greenlet(self._run_kernel) cmd = g.switch(kernel_fn, args) while cmd is not None: if isinstance(cmd, DmaReadCmd): yield from self._dispatch_dma(env, cmd) data = store.read(cmd.src_addr, cmd.shape, cmd.dtype) cmd = g.switch(data) # resume with actual data elif isinstance(cmd, GemmCmd): yield from self._dispatch_gemm(env, cmd) cmd = g.switch() # resume (no data) elif isinstance(cmd, DmaWriteCmd): store.write(cmd.dst_addr, cmd.data) # visibility = issue time yield from self._dispatch_dma(env, cmd) # timing only cmd = g.switch() # PE_CPU (component — kept simple, unaware of greenlet) def _execute_kernel(self, env): runner = KernelRunner(self.ctx) yield from runner.run(env, kernel_fn, args, store) ``` **Op logging single source of truth**: KernelRunner does not record directly to op_log. All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**. When KernelRunner delivers messages to components via `_dispatch_gemm()` etc., the component base class hooks automatically record them. **Layer separation**: - **Kernel code**: plain function, unaware of greenlet - **TLContext**: calls `parent.switch(cmd)` inside `tl.load()` - **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**. - **ComponentBase hook**: the sole path for op_log recording - **PE_CPU**: only calls KernelRunner, replaceable as a component #### Handling Differences Between Memory Read/Write and Compute | Operation | In Phase 1 | In Phase 2 | |-----------|-----------|-----------| | `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — | | `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — | | `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation | | `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation | Memory read/write is processed immediately in Phase 1 (numpy slice, fast). GEMM/Math operations are batch-executed in Phase 2 (performance separation). #### Store Visibility Rule `tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue). SimPy DMA timing is simulated separately afterward. This is an intentional separation of timing and visibility: - **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called - **timing**: the point at which DMA latency completes in SimPy This separation allows a load immediately after a store to see the latest data in dynamic control flow. #### Result Handle Semantics `tl.composite()` (sync/async) returns a **handle** referencing the result tensor. The key contract in Phase 1: 1. **All compute handles are always considered pending in Phase 1.** 2. `tl.wait(handle)` **expresses timing synchronization only** and does not make the handle ready. 3. Accessing the handle's actual result data (`handle.data`, element access, numpy conversion, etc.) is **only possible in Phase 2**. 4. Therefore, **compute-result-based control flow is not supported in Phase 1.** 5. In contrast, `tl.load()` returns actual data in Phase 1, so **memory-read-based control flow is supported**. | Handle state | Phase | Allowed operations | |------------|-------|----------| | pending | Phase 1 | `tl.wait(handle)` — timing synchronization only | | pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) | | pending | Phase 1 | **Data access not allowed** — value-based branching not possible | | ready | Phase 2 | Actual numpy data access, verification | This restriction is intentional. If computations were executed in Phase 1, the SimPy single-thread would block, defeating the purpose of 2-pass separation. #### Phase 1 Materialization — Future Extension If Phase 1 eager execution becomes necessary for small operations (scalar, small reduction) in the future, selective materialization can be supported by adding a `materialized_in_phase1: bool` flag to the op record. This is not implemented in the current scope. ### D4. data_op Flag — Message Self-Declaration The logging target is determined by the `data_op` attribute on the message instance, not by message type. The framework does not hardcode message types. ```python class MsgBase: data_op: bool = False # default: no logging class DmaReadCmd(MsgBase): data_op = True # memory transfer → logging class GemmCmd(MsgBase): data_op = True # compute → logging class MathCmd(MsgBase): data_op = True # compute → logging ``` When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True` enables automatic logging without modifying framework code. ### D5. Op Log Structure #### Op Classification Scheme A two-level classification is used: | Level | Field | Role | |-------|-------|------| | `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion | | `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification | #### OpRecord Definition ```python @dataclass class OpRecord: t_start: float # SimPy time (ns) — service start t_end: float # SimPy time (ns) — service completion component_id: str # e.g. "sip0.cube0.pe0.pe_gemm" op_kind: str # "memory" | "gemm" | "math" op_name: str # specific operation name params: dict # per-operation parameters (see below) dependency_ids: list[int] # currently based on in-memory record index, may be replaced with stable op_id in the future ``` #### dependency_ids Generation Rules `dependency_ids` is **optional**, and by default the executor performs address-based dependency inference (see D6). Explicit setting is only needed when precise execution ordering is required: - **Default (address-based inference)**: the executor analyzes read/write sets to automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases. - **Explicit setting**: set when logical dependencies cannot be expressed via addresses at the TLContext or command generation stage. Example: completion handle-based synchronization — handle dependencies depend on logical completion order rather than memory addresses, so they cannot be captured by address inference. #### op_log Ordering The op_log maintains **stable ordering** based on `t_start`. Records with the same `t_start` preserve insertion order. #### params Details **memory (dma_read / dma_write)**: ```python { "src_addr": int, # source address (byte) "dst_addr": int, # destination address (byte) "nbytes": int, # transfer size "src_space": str, # "hbm" | "tcm" | "sram" "dst_space": str, # "hbm" | "tcm" | "sram" } ``` **gemm**: ```python { "src_a_addr": int, # operand A address "src_b_addr": int, # operand B address "dst_addr": int, # output address "shape_a": tuple, # e.g. (128, 256) "shape_b": tuple, # e.g. (256, 128) "shape_out": tuple, # e.g. (128, 128) "dtype_in": str, # e.g. "f16" "dtype_acc": str, # accumulation dtype, e.g. "f32" "dtype_out": str, # output dtype, e.g. "f16" "transpose_a": bool, "transpose_b": bool, "layout_a": str, # "row_major" | "col_major" "layout_b": str, "layout_out": str, "addr_space": str, # "tcm" (GEMM operands are always in TCM) } ``` **math**: ```python { "op": str, # "exp" | "add" | "sum" | "where" | ... "input_addrs": list[int], # list of operand addresses "input_shapes": list[tuple], "dst_addr": int, "shape_out": tuple, "dtype": str, "axis": int | None, # reduction axis "addr_space": str, # "tcm" } ``` ### D6. Phase 2 Executor Phase 2 executes the op_log outside of SimPy. ```python class DataExecutor: def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore): self.store = initial_store # Takes the Phase 1 MemoryStore snapshot as input def run(self): for t, ops in groupby(op_log, key=lambda o: o.t_start): batch = list(ops) independent, sequential = self._classify(batch) self._execute_parallel(independent) self._execute_sequential(sequential) ``` **Parallel execution determination**: Ops with the same `t_start` are considered **parallel candidates**. The executor determines actual parallel execution based on the following criteria: - Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks) - Whether predecessor ops specified in `dependency_ids` have completed Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel. **Batch optimization**: Only independent ops with the same op_name **and identical shape, dtype, layout, and transpose flags** are eligible for batching. Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call. Improves BLAS efficiency on CPU, reduces launch overhead on GPU. **Phase 2 execution order guarantee**: Phase 2 does not consider data arrival timing, and guarantees execution order solely through dependencies (address-based inference + explicit dependency_ids). ### D7. Memory Store `MemoryStore` logically follows byte-addressable semantics, and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping). ```python class MemoryStore: def write(self, space: str, addr: int, data: np.ndarray) -> None: ... def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ... ``` **Internal storage format: numpy ndarray** MemoryStore stores tensors as **numpy ndarrays**. | Candidate | store/load speed | Phase 2 compute | Verdict | |-----------|-----------------|-----------------|---------| | **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** | | bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected | | torch tensor | Immediate | torch operations available | Use only for GPU optimization | - write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup - read: **returns numpy array by reference** (no copy) - Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported) - dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.) - For byte-level access, convert via `.view(np.uint8)` - For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility **read/write contract**: - read/write operates on a **contiguous tensor** basis. If non-contiguous stride views are needed, express them as separate copy ops. - In the normal benchmark path, producer/consumer dtype match is expected. Reinterpret cast is a permissive behavior for low-level memory validation or special test cases. - addr is byte-aligned, with minimum alignment = dtype size. - dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast. Shape mismatch is verified based on nbytes, and raises an error on mismatch. - Correctness criteria follow address-range-based read/write semantics. - A tensor object cache may be used as an implementation optimization, but the canonical state is byte-addressable storage. - At deploy time, the host injects initial tensor data. ### D8. Benchmark Kernel Code The benchmark's **user code API is not changed**. The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained. However, internal command/message schemas may be extended to include metadata required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose). ### D9. No Component Changes Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified. Op log recording is the responsibility of the ComponentBase hook. When custom components are replaced, only the timing model changes, and Phase 2 data execution is unaffected. ### D10. Phase 2 is Optional ```python engine = GraphEngine(graph) engine.run(benchmark) # Phase 1: timing only result = engine.get_timing_result() if verify_data: executor = DataExecutor(engine.op_log) # Phase 2: data executor.run() executor.verify(expected_output) ``` If only timing analysis is needed, Phase 2 is skipped. If the op_logger is deactivated, Phase 1 performance is identical to the original. ### D11. Verification Contract Basic verification **compares the final output tensor** against a reference backend (numpy). Per-dtype tolerance policy: | dtype | Comparison method | Tolerance | |-------|----------|-----------| | f32 | `np.allclose` | rtol=1e-5, atol=1e-5 | | f16 | `np.allclose` | rtol=1e-3, atol=1e-3 | | bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 | | int types | `np.array_equal` | exact | - Default mode: compare final output only (end-to-end correctness) - Debug mode: can compare intermediate tensors on a per-op basis (MemoryStore snapshot at each op boundary) --- ## Non-goals - **Compute-result-based control flow**: not supported. All compute handles are in pending state during Phase 1, `wait()` expresses timing synchronization only and does not imply data readiness. Accessing `handle.data`, element access, or truth-value evaluation in Phase 1 is **treated as an error**. Memory-data-based branching (results of `tl.load()`) is supported via greenlet. Phase 1 materialization is a future extension (see D3). - **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce the execution time overlap from Phase 1. Phase 2 only verifies data correctness. - **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls and do not reproduce the actual hardware PE microarchitecture. ## Open Questions - **Aliasing / slice view**: How to represent slice/views referencing the same backing storage in MemoryStore (stride-based view vs copy semantics) - **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE communication as memory ops or introduce a separate op_kind - **Op log streaming**: Managing op_log memory usage in large-scale simulations (in-memory list vs disk-backed streaming) - **Fused operation**: Whether to record tl.composite's tiled pipeline (READ→COMPUTE→WRITE) as a single fused op record or separate individual ops - **Math op schema generalization**: The current math params have a simple structure, but generalization may be needed for broadcasting rules, per-input dtype, keepdims, scalar/immediate operands, where/mask expressions, etc. - **Op record identifier**: Currently dependency_ids are based on in-memory list indices; replacement with stable op_id is needed when introducing streaming/disk-backed mode - **Phase 1 materialization policy**: See Future Extension in D3. If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops needs to be defined --- ## Consequences ### Positive - Minimal impact on SimPy simulation performance (only op_log append added) - Free to use multi-threading/GPU in Phase 2 - Component replaceability preserved (ADR-0015 design philosophy maintained) - No changes needed to benchmark user code API - When adding new message types, only set the data_op flag - Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported - `tl.load()` returns actual data, making kernel debugging easier ### Negative - op_log memory usage (for large-scale simulations) - Phase 2 execution time is proportional to tensor size (large GEMM) - Dynamic branching based on pending handles (incomplete computations) not possible (computations execute in Phase 2, result values are undetermined in Phase 1). Memory-data-based branching is supported via greenlet. - greenlet C extension dependency added (pip install greenlet)