Files
kernbench2/docs/adr/ADR-0052-dev-oplog-memory-store-schemas.md
ywkang bd49c93703 adr: add ADR-0050-0053 — close /report's second-pass G4 candidates
Documents four cross-cutting surfaces one layer deeper than the prior
G4 batch:

- 0050 par-ccl-algorithm-module-contract: how to author a new CCL
  algorithm in src/kernbench/ccl/algorithms/. Pairs with ADR-0045's
  bench-module contract. Pins the four required public symbols
  (kernel, kernel_args, TOPO_NAME_TO_KIND constants, kernel alias),
  the 9 + tl standardized kernel signature, the kernel_args tuple
  format, sip_topo_kind dispatch, and the ccl.yaml entry workflow.

- 0051 lat-routing-helper-api: every public method of AddressResolver
  (resolve, find_m_cpu, find_pcie_ep, find_io_cpu, find_all_pcie_eps)
  and PathRouter (find_path, find_path_with_distance,
  find_mcpu_dma_path, find_memory_path, find_node_path + 2 shims).
  Pins the four adjacency graphs (_adj_all / _adj / _adj_mcpu_dma /
  _adj_local) and the edge-kind exclusion sets they use, plus the
  single-owner naming convention.

- 0052 dev-oplog-memory-store-schemas: OpRecord's 7 fields, the
  per-op_name params matrix (dma_read, dma_write, gemm_*, math, math
  reduction, composite_gemm, ipcq_copy, unknown), snapshot timing
  rules (math = all inputs, dma_write = HBM-only — ADR-0027 race
  avoidance), TileToken stage_type capture, and MemoryStore's
  (space, addr) two-level dict with reference-store semantics.

- 0053 dev-topology-builder-algorithms: the 6-stage compile pipeline,
  cube_mesh.yaml's source_hash cache and its 5 input fields, the
  cube NoC auto-layout algorithm (row/col placement, HBM exclusion
  zone, PE/M_CPU/SRAM attachment via nearest-router, UCIe N/S/E/W
  distribution), the node naming convention (single-owner with
  router.py), the edge-kind catalog, the 4 view projections, and a
  table of spec-field changes vs mesh regeneration.

Bilingual pair verifier passes for all four EN/KO pairs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:52:42 -07:00

13 KiB
Raw Permalink Blame History

ADR-0052: OpLog + MemoryStore Schemas — sim_engine internals

Status

Accepted (2026-05-22).

Pins down the OpRecord schema and the record_start / record_end / record_copy behavior in sim_engine/op_log.py, plus the (space, addr) namespace and read/write semantics of MemoryStore in sim_engine/memory_store.py. ADR-0020 (2-pass data execution) declares that these two facilities exist, but the precise record fields and semantics had no ADR-level coverage, and several recent ADRs (ADR-0046 D3.2's tl.store visibility, ADR-0023 D9's IPCQ copy record) depend on these semantics.

First action

OpLogger(memory_store=None)

On construction, initialize three fields:

  1. self._records: list[OpRecord] = [] — accumulated records.
  2. self._pending: dict[int, dict] = {} — partial records keyed by id(msg) (created at record_start, completed at record_end).
  3. self._memory_store = memory_store — optional MemoryStore reference. Used to capture math-op input snapshots and dma_write HBM-source snapshots.

Records and pending are empty; the record_* calls accumulate data over time.

MemoryStore()

On construction, initialize a single field: self._storage: dict[str, dict[int, np.ndarray]] = {} — a two-level dict (space → addr → ndarray). Inner dicts are created lazily as new spaces appear.

In short, both facilities' first act is "set up an empty accumulator buffer plus a sparse, per-space dict". The first record / write fills the fields when it arrives.

Context

ADR-0020 D2/D5/D7 (2-pass data execution) declares:

  • During Phase 1 (timing), ComponentBase._on_process_start/end hooks call OpLogger.record_start/end, recording the time and metadata of every data op.
  • Phase 2 (data) replays the op log in t_start order to compute real data.
  • Data payloads live in MemoryStore, keyed by (space, addr).

Subsequent ADRs (ADR-0023 D9's IPCQ atomic write, ADR-0027's Megatron TP scratch-overwrite avoidance, ADR-0046 D3.2's tl.store visibility) depend on op_log and MemoryStore behavior, but the exact record fields / space names / snapshot timing are only discoverable via source grep. This ADR codifies them.

Decision

D1. OpRecord schema — seven fields

@dataclass
class OpRecord:
    t_start: float
    t_end: float
    component_id: str
    op_kind: str               # "memory" | "gemm" | "math" | "unknown"
    op_name: str               # e.g. "dma_read", "gemm_f16", "exp",
                               #     "TileToken/DMA_READ", "composite_gemm",
                               #     "ipcq_copy"
    params: dict[str, Any]
    dependency_ids: list[int] = field(default_factory=list)
  • t_start / t_end: SimPy time (float ns). t_start is when the component begins the op; t_end is completion. Duration = t_end - t_start.
  • component_id: the node id where the op occurred (e.g., "sip0.cube0.pe0.pe_dma").
  • op_kind: one of four. Phase 2 DataExecutor branches on this.
  • op_name: a debug/analysis-friendly name. For a TileToken, expands to "TileToken/{stage_type}" (e.g., "TileToken/DMA_READ") to disambiguate stages.
  • params: op-specific metadata dict (see D3).
  • dependency_ids: currently unused (default []). Reserved for future cross-op dependency tracking.

D2. OpLogger.records — guaranteed t_start sort

@property
def records(self) -> list[OpRecord]:
    self._records.sort(key=lambda r: r.t_start)
    return self._records

A stable sort by t_start runs on each access. Records with the same t_start preserve insertion order. Aligns with ADR-0020 D5's "t_start stable ordering" requirement.

Phase 2 DataExecutor always accesses via the records property, so even when record_end calls arrive out of t_start order (e.g., a short op started later but finished earlier), the sequence handed to Phase 2 is consistent.

D3. params schema per op_name (matrix from _extract_op_info)

D3.1. op_kind="memory", op_name="dma_read" (DmaReadCmd)

{"src_addr": int, "nbytes": int, "handle_id": str}

D3.2. op_kind="memory", op_name="dma_write" (DmaWriteCmd)

{
    "src_space": str,   # handle.space ("tcm"|"hbm"|"sram"), default "tcm"
    "src_addr": int,    # handle.addr
    "shape": tuple, "dtype": str,
    "dst_space": "hbm", # DmaWrite always targets HBM
    "dst_addr": int,
    "nbytes": int,
    "handle_id": str,
    # When src_space == "hbm" at record_end, a snapshot is added (D4)
    "snapshot": np.ndarray | None,
}

D3.3. op_kind="gemm", op_name=f"gemm_{dtype_a}" (GemmCmd)

{
    "src_a_addr": int, "src_b_addr": int, "dst_addr": int,
    "shape_a": tuple, "shape_b": tuple, "shape_out": tuple,
    "dtype_in": str, "dtype_out": str,
    "m": int, "k": int, "n": int,
    # ADR-0027: per-operand + output spaces preserved
    "src_a_space": str, "src_b_space": str, "dst_space": str,
}

D3.4. op_kind="math", op_name=msg.op (MathCmd; op = "exp", "sum", "add", "where", …)

{
    "input_addrs": list[int],   # addrs of input handles
    "input_shapes": list[tuple],
    "input_spaces": list[str],
    "input_dtypes": list[str],
    "dst_addr": int, "dst_space": str,
    "shape_out": tuple, "dtype": str,
    "axis": int | None,         # only meaningful for reductions
    # All inputs get snapshots at record_end (D4)
    "input_snapshots": list[np.ndarray | None],
}

D3.5. op_kind="gemm" or "math", op_name=f"composite_{op}" (CompositeCmd)

{
    "op": str,              # "gemm" | "math"
    "out_addr": int, "out_nbytes": int,
    # If op == "gemm", same fields as GemmCmd are added:
    "src_a_addr": int, "src_b_addr": int,
    "shape_a": tuple, "shape_b": tuple,
    "dtype_in": str, "dtype_out": str,
    "src_a_space": str, "src_b_space": str,
    "dst_space": "hbm", "dst_addr": int,  # = out_addr
}

If op == "gemm", op_kind = "gemm"; otherwise "math". An alias so Phase 2 replays composite-gemm on the same path as GemmCmd.

D3.6. op_kind="memory", op_name="ipcq_copy" (record_copy path)

{
    "src_space": str, "src_addr": int,
    "dst_space": str, "dst_addr": int,
    "shape": tuple, "dtype": str, "nbytes": int,
    "snapshot": np.ndarray | None,   # passed by caller; if None, record_copy reads fresh
}

PE_DMA._handle_ipcq_inbound (ADR-0023 D9) emits this record so Phase 2 can replay the IPCQ slot's inbound copy. It bypasses record_start / record_end and pushes directly via record_copy().

D3.7. op_kind="unknown", op_name=type(msg).__name__

Fallback for messages _extract_op_info doesn't recognize. params = {}. If DataExecutor encounters this kind, it skips — Phase 2 replay is unaffected.

D4. Snapshot capture timing

When OpLogger._memory_store is set, record_end performs:

  • Math op: read every input (addr/shape/space/dtype) from self._memory_store.read(...) and attach an ndarray copy to params["input_snapshots"]. Read failure → None.
  • dma_write op: snapshot the source only if src_space == "hbm" and attach to params["snapshot"]. TCM (PE scratch) sources are deliberately skipped — TCM is repopulated by Phase 2 math/gemm replay, and a Phase-1-time snapshot would capture a previous kernel's stale value (ADR-0027 postmortem: TP gemm → all_reduce race).
  • ipcq_copy: the caller passes the in-flight snapshot via snapshot=token.data. If absent, record_copy attempts a fresh read from MemoryStore.

Snapshots are taken with .copy() (fresh allocation), making them safe against later storage mutation. This is the foundation of ADR-0027's "cross-PE Phase 2 ordering" race-avoidance.

When memory_store is None (Phase 1 timing-only mode), all snapshot steps are skipped. Only the timing portion of the record is preserved; data replay is unavailable.

D5. TileToken handling — record_start captures stage info

ADR-0014 D6's self-routing tile token (pipeline mode) may have already advanced its stage_idx by the time record_end runs (the TileToken caches the next stage's params as it moves to the next component). Therefore:

record_start pre-saves the following in pending[id(msg)]["snap"]:

snap["stage_type"] = stage.stage_type.name        # "DMA_READ", "GEMM", ...
snap["stage_params"] = dict(stage.params)         # copy of params at start time

record_end retrieves this snap and merges into params:

  • Adds params["stage_type"] to final params.
  • Merges stage_params keys (keeps existing values if any).
  • If op_name == "TileToken", rewrites it to f"TileToken/{stage_type}" (e.g., "TileToken/DMA_READ"), disambiguating different stages emitted by the same component.

Thanks to this, DMA_READ vs DMA_WRITE, FETCH vs STORE coming from the same component (e.g., pe_dma) are distinguishable in reports.

D6. MemoryStore — two-level (space, addr) dict

class MemoryStore:
    def __init__(self) -> None:
        self._storage: dict[str, dict[int, np.ndarray]] = {}

    def write(self, space, addr, data): self._storage[space][addr] = data
    def read(self, space, addr, shape=None, dtype=None) -> np.ndarray: ...
    def has(self, space, addr) -> bool: ...
    def snapshot(self) -> MemoryStore: ...

D6.1. Space namespace

A string key. Standard values:

  • "hbm": HBM data (deploy_tensor + Phase 2 dma_write results).
  • "tcm": PE-local TCM (Phase 2 math/gemm output).
  • "sram": cube-level SRAM (ADR-0023 D9.7's IPCQ slot tier).

Other spaces (e.g., "reg") are allowed — _storage is a lazy dict that creates a new space when write first touches it.

D6.2. Address keying

addr is an integer. It may be a physical address (PA) or a virtual address (VA)MemoryStore itself doesn't know address-space semantics; it just uses them as keys. Phase 1's MemoryWriteMsg writes both PA and VA (_create_tensor zero-inits at PA and at the VA base too); Phase 2 reads/writes via the addresses captured by op_log.

The caller decides addr's meaning — MemoryStore provides only lookup.

D6.3. read/write semantics — reference store (no copy)

write(space, addr, data): stores the ndarray reference. No copy. If the caller later mutates the same ndarray, the stored value changes.

read(space, addr, shape=None, dtype=None): returns the stored ndarray reference. If shape/dtype are provided:

  • dtype != stored.dtype: arr.view(np_dtype) reinterprets as a view (no copy).
  • shape != stored.shape: if nbytes matches, arr.reshape(shape) is a view.
  • nbytes mismatch → ValueError.

To detach the data, the caller must call arr.copy(). ADR-0027's race-avoidance requires explicit .copy() in op_log snapshot steps for exactly this reason.

D6.4. has(space, addr) -> bool

Existence check; does not materialize data.

D6.5. snapshot() -> MemoryStore

Shallow copy. Creates a new instance of inner dicts but shares ndarray references. Used at Phase 2 init to fork from Phase 1's store, so Phase 2 mutations don't affect Phase 1's remaining consumers.

D7. op_log assumes a single-threaded SimPy

OpLogger's _records and _pending are lock-free. SimPy is single-threaded, so nothing else can intrude between record_start and record_end for the same message.

When multi-process kernbench (ADR-0047 D6) arrives, OpLogger must be split per process — one OpLogger instance cannot receive records from multiple processes.

Alternatives Considered

A1. Externalize op_log to SQLite / parquet

Rejected (currently). The in-memory list minimizes Phase 1 → Phase 2 hand-off latency. Externalization makes sense for long-running batch runs but adds overhead for the current single-run workload.

A2. Capture snapshots at record_start

Rejected. At record_start, inputs are often not yet populated (e.g., a math op's input is the output of a just-issued previous op). record_end is the correct point.

A3. Per-component MemoryStore

Rejected. The (space, addr) key already disambiguates effectively, and splitting per component would complicate cross-PE IPCQ copy (ADR-0023 D9), which needs access to both source and destination stores.

A4. Explicit dependency edges in op_log

Partially adopted. The dependency_ids field exists on OpRecord but is currently unused (D1). Phase 2 DataExecutor orders via t_start + a secondary sort (memory ops before math at the same t_start). When an explicit dependency graph is required, this field is the home. Current ordering rules are sufficient, so it remains unused.

Consequences

  • ADR-0020's op_log / MemoryStore declarations are expanded into the concrete D1D6 schemas, so writing/modifying Phase 2 DataExecutor doesn't need source-grep to learn field semantics.
  • D3's per-op_name params matrix makes adding new ops (e.g., a new reduction type) a question of branching in _extract_op_info.
  • D4's per-op snapshot policy (math = input snapshot, dma_write = HBM-only snapshot) is ADR-locked, so ADR-0027's race-avoidance decision won't silently regress on future refactors.
  • D6.3's reference-store semantics are explicit, putting mutation safety on the caller. ADR-0027's explicit .copy() pattern is justified.
  • D7's single-thread assumption is recorded, so multi-process kernbench (ADR-0047 D6's supersession candidate) will need OpLogger separation when introduced.