# ADR-0052: OpLog + MemoryStore Schemas — sim_engine internals ## Status Accepted (2026-05-22). Pins down the `OpRecord` schema and the `record_start` / `record_end` / `record_copy` behavior in `sim_engine/op_log.py`, plus the (space, addr) namespace and read/write semantics of `MemoryStore` in `sim_engine/memory_store.py`. ADR-0020 (2-pass data execution) declares that these two facilities exist, but **the precise record fields and semantics** had no ADR-level coverage, and several recent ADRs (ADR-0046 D3.2's `tl.store` visibility, ADR-0023 D9's IPCQ copy record) depend on these semantics. ## First action ### `OpLogger(memory_store=None)` On construction, initialize three fields: 1. `self._records: list[OpRecord] = []` — accumulated records. 2. `self._pending: dict[int, dict] = {}` — partial records keyed by `id(msg)` (created at `record_start`, completed at `record_end`). 3. `self._memory_store = memory_store` — optional MemoryStore reference. Used to capture math-op input snapshots and dma_write HBM-source snapshots. Records and pending are empty; the `record_*` calls accumulate data over time. ### `MemoryStore()` On construction, initialize a single field: `self._storage: dict[str, dict[int, np.ndarray]] = {}` — a two-level dict (`space → addr → ndarray`). Inner dicts are created lazily as new spaces appear. In short, **both facilities' first act is "set up an empty accumulator buffer plus a sparse, per-space dict"**. The first record / write fills the fields when it arrives. ## Context ADR-0020 D2/D5/D7 (2-pass data execution) declares: - During Phase 1 (timing), `ComponentBase._on_process_start/end` hooks call `OpLogger.record_start/end`, recording the time and metadata of every data op. - Phase 2 (data) replays the op log in `t_start` order to compute real data. - Data payloads live in `MemoryStore`, keyed by (space, addr). Subsequent ADRs (ADR-0023 D9's IPCQ atomic write, ADR-0027's Megatron TP scratch-overwrite avoidance, ADR-0046 D3.2's `tl.store` visibility) depend on op_log and MemoryStore behavior, but **the exact record fields / space names / snapshot timing** are only discoverable via source grep. This ADR codifies them. ## Decision ### D1. `OpRecord` schema — seven fields ```python @dataclass class OpRecord: t_start: float t_end: float component_id: str op_kind: str # "memory" | "gemm" | "math" | "unknown" op_name: str # e.g. "dma_read", "gemm_f16", "exp", # "TileToken/DMA_READ", "composite_gemm", # "ipcq_copy" params: dict[str, Any] dependency_ids: list[int] = field(default_factory=list) ``` - **`t_start` / `t_end`**: SimPy time (float ns). `t_start` is when the component begins the op; `t_end` is completion. Duration = `t_end - t_start`. - **`component_id`**: the node id where the op occurred (e.g., `"sip0.cube0.pe0.pe_dma"`). - **`op_kind`**: one of four. Phase 2 DataExecutor branches on this. - **`op_name`**: a debug/analysis-friendly name. For a TileToken, expands to `"TileToken/{stage_type}"` (e.g., `"TileToken/DMA_READ"`) to disambiguate stages. - **`params`**: op-specific metadata dict (see D3). - **`dependency_ids`**: currently unused (default `[]`). Reserved for future cross-op dependency tracking. ### D2. `OpLogger.records` — guaranteed `t_start` sort ```python @property def records(self) -> list[OpRecord]: self._records.sort(key=lambda r: r.t_start) return self._records ``` A stable sort by `t_start` runs on each access. Records with the same `t_start` preserve insertion order. Aligns with ADR-0020 D5's "t_start stable ordering" requirement. Phase 2 DataExecutor always accesses via the `records` property, so even when `record_end` calls arrive out of `t_start` order (e.g., a short op started later but finished earlier), the sequence handed to Phase 2 is consistent. ### D3. `params` schema per `op_name` (matrix from `_extract_op_info`) #### D3.1. `op_kind="memory", op_name="dma_read"` (DmaReadCmd) ```python {"src_addr": int, "nbytes": int, "handle_id": str} ``` #### D3.2. `op_kind="memory", op_name="dma_write"` (DmaWriteCmd) ```python { "src_space": str, # handle.space ("tcm"|"hbm"|"sram"), default "tcm" "src_addr": int, # handle.addr "shape": tuple, "dtype": str, "dst_space": "hbm", # DmaWrite always targets HBM "dst_addr": int, "nbytes": int, "handle_id": str, # When src_space == "hbm" at record_end, a snapshot is added (D4) "snapshot": np.ndarray | None, } ``` #### D3.3. `op_kind="gemm", op_name=f"gemm_{dtype_a}"` (GemmCmd) ```python { "src_a_addr": int, "src_b_addr": int, "dst_addr": int, "shape_a": tuple, "shape_b": tuple, "shape_out": tuple, "dtype_in": str, "dtype_out": str, "m": int, "k": int, "n": int, # ADR-0027: per-operand + output spaces preserved "src_a_space": str, "src_b_space": str, "dst_space": str, } ``` #### D3.4. `op_kind="math", op_name=msg.op` (MathCmd; op = "exp", "sum", "add", "where", …) ```python { "input_addrs": list[int], # addrs of input handles "input_shapes": list[tuple], "input_spaces": list[str], "input_dtypes": list[str], "dst_addr": int, "dst_space": str, "shape_out": tuple, "dtype": str, "axis": int | None, # only meaningful for reductions # All inputs get snapshots at record_end (D4) "input_snapshots": list[np.ndarray | None], } ``` #### D3.5. `op_kind="gemm" or "math", op_name=f"composite_{op}"` (CompositeCmd) ```python { "op": str, # "gemm" | "math" "out_addr": int, "out_nbytes": int, # If op == "gemm", same fields as GemmCmd are added: "src_a_addr": int, "src_b_addr": int, "shape_a": tuple, "shape_b": tuple, "dtype_in": str, "dtype_out": str, "src_a_space": str, "src_b_space": str, "dst_space": "hbm", "dst_addr": int, # = out_addr } ``` If `op == "gemm"`, `op_kind = "gemm"`; otherwise `"math"`. An alias so Phase 2 replays composite-gemm on the same path as `GemmCmd`. #### D3.6. `op_kind="memory", op_name="ipcq_copy"` (record_copy path) ```python { "src_space": str, "src_addr": int, "dst_space": str, "dst_addr": int, "shape": tuple, "dtype": str, "nbytes": int, "snapshot": np.ndarray | None, # passed by caller; if None, record_copy reads fresh } ``` `PE_DMA._handle_ipcq_inbound` (ADR-0023 D9) emits this record so Phase 2 can replay the IPCQ slot's inbound copy. It bypasses `record_start` / `record_end` and pushes directly via `record_copy()`. #### D3.7. `op_kind="unknown", op_name=type(msg).__name__` Fallback for messages `_extract_op_info` doesn't recognize. `params = {}`. If DataExecutor encounters this kind, it skips — Phase 2 replay is unaffected. ### D4. Snapshot capture timing When `OpLogger._memory_store` is set, `record_end` performs: - **Math op**: read every input (addr/shape/space/dtype) from `self._memory_store.read(...)` and attach an ndarray copy to `params["input_snapshots"]`. Read failure → `None`. - **`dma_write` op**: snapshot the source **only if `src_space == "hbm"`** and attach to `params["snapshot"]`. TCM (PE scratch) sources are **deliberately skipped** — TCM is repopulated by Phase 2 math/gemm replay, and a Phase-1-time snapshot would capture a previous kernel's stale value (ADR-0027 postmortem: TP gemm → all_reduce race). - **`ipcq_copy`**: the caller passes the in-flight snapshot via `snapshot=token.data`. If absent, `record_copy` attempts a fresh read from MemoryStore. Snapshots are taken with `.copy()` (fresh allocation), making them safe against later storage mutation. This is the foundation of ADR-0027's "cross-PE Phase 2 ordering" race-avoidance. When `memory_store` is `None` (Phase 1 timing-only mode), all snapshot steps are skipped. Only the timing portion of the record is preserved; data replay is unavailable. ### D5. TileToken handling — `record_start` captures stage info ADR-0014 D6's self-routing tile token (pipeline mode) may have already advanced its `stage_idx` by the time `record_end` runs (the TileToken caches the next stage's params as it moves to the next component). Therefore: `record_start` pre-saves the following in `pending[id(msg)]["snap"]`: ```python snap["stage_type"] = stage.stage_type.name # "DMA_READ", "GEMM", ... snap["stage_params"] = dict(stage.params) # copy of params at start time ``` `record_end` retrieves this snap and merges into params: - Adds `params["stage_type"]` to final params. - Merges `stage_params` keys (keeps existing values if any). - If `op_name == "TileToken"`, rewrites it to `f"TileToken/{stage_type}"` (e.g., `"TileToken/DMA_READ"`), disambiguating different stages emitted by the same component. Thanks to this, DMA_READ vs DMA_WRITE, FETCH vs STORE coming from the same component (e.g., pe_dma) are distinguishable in reports. ### D6. `MemoryStore` — two-level (space, addr) dict ```python class MemoryStore: def __init__(self) -> None: self._storage: dict[str, dict[int, np.ndarray]] = {} def write(self, space, addr, data): self._storage[space][addr] = data def read(self, space, addr, shape=None, dtype=None) -> np.ndarray: ... def has(self, space, addr) -> bool: ... def snapshot(self) -> MemoryStore: ... ``` #### D6.1. Space namespace A string key. Standard values: - `"hbm"`: HBM data (deploy_tensor + Phase 2 dma_write results). - `"tcm"`: PE-local TCM (Phase 2 math/gemm output). - `"sram"`: cube-level SRAM (ADR-0023 D9.7's IPCQ slot tier). Other spaces (e.g., `"reg"`) are allowed — `_storage` is a lazy dict that creates a new space when `write` first touches it. #### D6.2. Address keying `addr` is an integer. It may be a **physical address (PA) or a virtual address (VA)** — `MemoryStore` itself doesn't know address-space semantics; it just uses them as keys. Phase 1's `MemoryWriteMsg` writes both PA and VA (`_create_tensor` zero-inits at PA and at the VA base too); Phase 2 reads/writes via the addresses captured by op_log. The caller decides `addr`'s meaning — `MemoryStore` provides only lookup. #### D6.3. read/write semantics — reference store (no copy) `write(space, addr, data)`: stores the ndarray reference. **No copy.** If the caller later mutates the same ndarray, the stored value changes. `read(space, addr, shape=None, dtype=None)`: returns the stored ndarray reference. If `shape`/`dtype` are provided: - `dtype != stored.dtype`: `arr.view(np_dtype)` reinterprets as a view (no copy). - `shape != stored.shape`: if `nbytes` matches, `arr.reshape(shape)` is a view. - `nbytes` mismatch → `ValueError`. To detach the data, the caller must call `arr.copy()`. ADR-0027's race-avoidance requires explicit `.copy()` in op_log snapshot steps for exactly this reason. #### D6.4. `has(space, addr) -> bool` Existence check; does not materialize data. #### D6.5. `snapshot() -> MemoryStore` Shallow copy. Creates a new instance of inner dicts but shares ndarray references. Used at Phase 2 init to fork from Phase 1's store, so Phase 2 mutations don't affect Phase 1's remaining consumers. ### D7. op_log assumes a single-threaded SimPy `OpLogger`'s `_records` and `_pending` are lock-free. SimPy is single-threaded, so nothing else can intrude between `record_start` and `record_end` for the same message. When multi-process kernbench (ADR-0047 D6) arrives, OpLogger must be split per process — one OpLogger instance cannot receive records from multiple processes. ## Alternatives Considered ### A1. Externalize op_log to SQLite / parquet Rejected (currently). The in-memory list minimizes Phase 1 → Phase 2 hand-off latency. Externalization makes sense for long-running batch runs but adds overhead for the current single-run workload. ### A2. Capture snapshots at `record_start` Rejected. At `record_start`, inputs are often not yet populated (e.g., a math op's input is the output of a just-issued previous op). `record_end` is the correct point. ### A3. Per-component MemoryStore Rejected. The (space, addr) key already disambiguates effectively, and splitting per component would complicate cross-PE IPCQ copy (ADR-0023 D9), which needs access to both source and destination stores. ### A4. Explicit dependency edges in op_log Partially adopted. The `dependency_ids` field exists on `OpRecord` but is currently unused (D1). Phase 2 DataExecutor orders via `t_start` + a secondary sort (memory ops before math at the same `t_start`). When an explicit dependency graph is required, this field is the home. Current ordering rules are sufficient, so it remains unused. ## Consequences - ADR-0020's op_log / MemoryStore declarations are expanded into the concrete D1–D6 schemas, so writing/modifying Phase 2 DataExecutor doesn't need source-grep to learn field semantics. - D3's per-`op_name` params matrix makes adding new ops (e.g., a new reduction type) a question of branching in `_extract_op_info`. - D4's per-op snapshot policy (math = input snapshot, dma_write = HBM-only snapshot) is ADR-locked, so ADR-0027's race-avoidance decision won't silently regress on future refactors. - D6.3's reference-store semantics are explicit, putting mutation safety on the caller. ADR-0027's explicit `.copy()` pattern is justified. - D7's single-thread assumption is recorded, so multi-process kernbench (ADR-0047 D6's supersession candidate) will need OpLogger separation when introduced.