Documents four cross-cutting surfaces one layer deeper than the prior G4 batch: - 0050 par-ccl-algorithm-module-contract: how to author a new CCL algorithm in src/kernbench/ccl/algorithms/. Pairs with ADR-0045's bench-module contract. Pins the four required public symbols (kernel, kernel_args, TOPO_NAME_TO_KIND constants, kernel alias), the 9 + tl standardized kernel signature, the kernel_args tuple format, sip_topo_kind dispatch, and the ccl.yaml entry workflow. - 0051 lat-routing-helper-api: every public method of AddressResolver (resolve, find_m_cpu, find_pcie_ep, find_io_cpu, find_all_pcie_eps) and PathRouter (find_path, find_path_with_distance, find_mcpu_dma_path, find_memory_path, find_node_path + 2 shims). Pins the four adjacency graphs (_adj_all / _adj / _adj_mcpu_dma / _adj_local) and the edge-kind exclusion sets they use, plus the single-owner naming convention. - 0052 dev-oplog-memory-store-schemas: OpRecord's 7 fields, the per-op_name params matrix (dma_read, dma_write, gemm_*, math, math reduction, composite_gemm, ipcq_copy, unknown), snapshot timing rules (math = all inputs, dma_write = HBM-only — ADR-0027 race avoidance), TileToken stage_type capture, and MemoryStore's (space, addr) two-level dict with reference-store semantics. - 0053 dev-topology-builder-algorithms: the 6-stage compile pipeline, cube_mesh.yaml's source_hash cache and its 5 input fields, the cube NoC auto-layout algorithm (row/col placement, HBM exclusion zone, PE/M_CPU/SRAM attachment via nearest-router, UCIe N/S/E/W distribution), the node naming convention (single-owner with router.py), the edge-kind catalog, the 4 view projections, and a table of spec-field changes vs mesh regeneration. Bilingual pair verifier passes for all four EN/KO pairs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
13 KiB
ADR-0052: OpLog + MemoryStore Schemas — sim_engine internals
Status
Accepted (2026-05-22).
Pins down the OpRecord schema and the record_start / record_end /
record_copy behavior in sim_engine/op_log.py, plus the
(space, addr) namespace and read/write semantics of MemoryStore in
sim_engine/memory_store.py. ADR-0020 (2-pass data execution) declares
that these two facilities exist, but the precise record fields and
semantics had no ADR-level coverage, and several recent ADRs
(ADR-0046 D3.2's tl.store visibility, ADR-0023 D9's IPCQ copy
record) depend on these semantics.
First action
OpLogger(memory_store=None)
On construction, initialize three fields:
self._records: list[OpRecord] = []— accumulated records.self._pending: dict[int, dict] = {}— partial records keyed byid(msg)(created atrecord_start, completed atrecord_end).self._memory_store = memory_store— optional MemoryStore reference. Used to capture math-op input snapshots and dma_write HBM-source snapshots.
Records and pending are empty; the record_* calls accumulate data
over time.
MemoryStore()
On construction, initialize a single field:
self._storage: dict[str, dict[int, np.ndarray]] = {} — a two-level
dict (space → addr → ndarray). Inner dicts are created lazily as new
spaces appear.
In short, both facilities' first act is "set up an empty accumulator buffer plus a sparse, per-space dict". The first record / write fills the fields when it arrives.
Context
ADR-0020 D2/D5/D7 (2-pass data execution) declares:
- During Phase 1 (timing),
ComponentBase._on_process_start/endhooks callOpLogger.record_start/end, recording the time and metadata of every data op. - Phase 2 (data) replays the op log in
t_startorder to compute real data. - Data payloads live in
MemoryStore, keyed by (space, addr).
Subsequent ADRs (ADR-0023 D9's IPCQ atomic write, ADR-0027's Megatron
TP scratch-overwrite avoidance, ADR-0046 D3.2's tl.store visibility)
depend on op_log and MemoryStore behavior, but the exact record
fields / space names / snapshot timing are only discoverable via
source grep. This ADR codifies them.
Decision
D1. OpRecord schema — seven fields
@dataclass
class OpRecord:
t_start: float
t_end: float
component_id: str
op_kind: str # "memory" | "gemm" | "math" | "unknown"
op_name: str # e.g. "dma_read", "gemm_f16", "exp",
# "TileToken/DMA_READ", "composite_gemm",
# "ipcq_copy"
params: dict[str, Any]
dependency_ids: list[int] = field(default_factory=list)
t_start/t_end: SimPy time (float ns).t_startis when the component begins the op;t_endis completion. Duration =t_end - t_start.component_id: the node id where the op occurred (e.g.,"sip0.cube0.pe0.pe_dma").op_kind: one of four. Phase 2 DataExecutor branches on this.op_name: a debug/analysis-friendly name. For a TileToken, expands to"TileToken/{stage_type}"(e.g.,"TileToken/DMA_READ") to disambiguate stages.params: op-specific metadata dict (see D3).dependency_ids: currently unused (default[]). Reserved for future cross-op dependency tracking.
D2. OpLogger.records — guaranteed t_start sort
@property
def records(self) -> list[OpRecord]:
self._records.sort(key=lambda r: r.t_start)
return self._records
A stable sort by t_start runs on each access. Records with the same
t_start preserve insertion order. Aligns with ADR-0020 D5's
"t_start stable ordering" requirement.
Phase 2 DataExecutor always accesses via the records property, so
even when record_end calls arrive out of t_start order (e.g., a
short op started later but finished earlier), the sequence handed to
Phase 2 is consistent.
D3. params schema per op_name (matrix from _extract_op_info)
D3.1. op_kind="memory", op_name="dma_read" (DmaReadCmd)
{"src_addr": int, "nbytes": int, "handle_id": str}
D3.2. op_kind="memory", op_name="dma_write" (DmaWriteCmd)
{
"src_space": str, # handle.space ("tcm"|"hbm"|"sram"), default "tcm"
"src_addr": int, # handle.addr
"shape": tuple, "dtype": str,
"dst_space": "hbm", # DmaWrite always targets HBM
"dst_addr": int,
"nbytes": int,
"handle_id": str,
# When src_space == "hbm" at record_end, a snapshot is added (D4)
"snapshot": np.ndarray | None,
}
D3.3. op_kind="gemm", op_name=f"gemm_{dtype_a}" (GemmCmd)
{
"src_a_addr": int, "src_b_addr": int, "dst_addr": int,
"shape_a": tuple, "shape_b": tuple, "shape_out": tuple,
"dtype_in": str, "dtype_out": str,
"m": int, "k": int, "n": int,
# ADR-0027: per-operand + output spaces preserved
"src_a_space": str, "src_b_space": str, "dst_space": str,
}
D3.4. op_kind="math", op_name=msg.op (MathCmd; op = "exp", "sum", "add", "where", …)
{
"input_addrs": list[int], # addrs of input handles
"input_shapes": list[tuple],
"input_spaces": list[str],
"input_dtypes": list[str],
"dst_addr": int, "dst_space": str,
"shape_out": tuple, "dtype": str,
"axis": int | None, # only meaningful for reductions
# All inputs get snapshots at record_end (D4)
"input_snapshots": list[np.ndarray | None],
}
D3.5. op_kind="gemm" or "math", op_name=f"composite_{op}" (CompositeCmd)
{
"op": str, # "gemm" | "math"
"out_addr": int, "out_nbytes": int,
# If op == "gemm", same fields as GemmCmd are added:
"src_a_addr": int, "src_b_addr": int,
"shape_a": tuple, "shape_b": tuple,
"dtype_in": str, "dtype_out": str,
"src_a_space": str, "src_b_space": str,
"dst_space": "hbm", "dst_addr": int, # = out_addr
}
If op == "gemm", op_kind = "gemm"; otherwise "math". An alias so
Phase 2 replays composite-gemm on the same path as GemmCmd.
D3.6. op_kind="memory", op_name="ipcq_copy" (record_copy path)
{
"src_space": str, "src_addr": int,
"dst_space": str, "dst_addr": int,
"shape": tuple, "dtype": str, "nbytes": int,
"snapshot": np.ndarray | None, # passed by caller; if None, record_copy reads fresh
}
PE_DMA._handle_ipcq_inbound (ADR-0023 D9) emits this record so Phase
2 can replay the IPCQ slot's inbound copy. It bypasses
record_start / record_end and pushes directly via record_copy().
D3.7. op_kind="unknown", op_name=type(msg).__name__
Fallback for messages _extract_op_info doesn't recognize. params = {}. If DataExecutor encounters this kind, it skips — Phase 2 replay
is unaffected.
D4. Snapshot capture timing
When OpLogger._memory_store is set, record_end performs:
- Math op: read every input
(addr/shape/space/dtype) from
self._memory_store.read(...)and attach an ndarray copy toparams["input_snapshots"]. Read failure →None. dma_writeop: snapshot the source only ifsrc_space == "hbm"and attach toparams["snapshot"]. TCM (PE scratch) sources are deliberately skipped — TCM is repopulated by Phase 2 math/gemm replay, and a Phase-1-time snapshot would capture a previous kernel's stale value (ADR-0027 postmortem: TP gemm → all_reduce race).ipcq_copy: the caller passes the in-flight snapshot viasnapshot=token.data. If absent,record_copyattempts a fresh read from MemoryStore.
Snapshots are taken with .copy() (fresh allocation), making them
safe against later storage mutation. This is the foundation of
ADR-0027's "cross-PE Phase 2 ordering" race-avoidance.
When memory_store is None (Phase 1 timing-only mode), all
snapshot steps are skipped. Only the timing portion of the record is
preserved; data replay is unavailable.
D5. TileToken handling — record_start captures stage info
ADR-0014 D6's self-routing tile token (pipeline mode) may have already
advanced its stage_idx by the time record_end runs (the TileToken
caches the next stage's params as it moves to the next component).
Therefore:
record_start pre-saves the following in pending[id(msg)]["snap"]:
snap["stage_type"] = stage.stage_type.name # "DMA_READ", "GEMM", ...
snap["stage_params"] = dict(stage.params) # copy of params at start time
record_end retrieves this snap and merges into params:
- Adds
params["stage_type"]to final params. - Merges
stage_paramskeys (keeps existing values if any). - If
op_name == "TileToken", rewrites it tof"TileToken/{stage_type}"(e.g.,"TileToken/DMA_READ"), disambiguating different stages emitted by the same component.
Thanks to this, DMA_READ vs DMA_WRITE, FETCH vs STORE coming from the same component (e.g., pe_dma) are distinguishable in reports.
D6. MemoryStore — two-level (space, addr) dict
class MemoryStore:
def __init__(self) -> None:
self._storage: dict[str, dict[int, np.ndarray]] = {}
def write(self, space, addr, data): self._storage[space][addr] = data
def read(self, space, addr, shape=None, dtype=None) -> np.ndarray: ...
def has(self, space, addr) -> bool: ...
def snapshot(self) -> MemoryStore: ...
D6.1. Space namespace
A string key. Standard values:
"hbm": HBM data (deploy_tensor + Phase 2 dma_write results)."tcm": PE-local TCM (Phase 2 math/gemm output)."sram": cube-level SRAM (ADR-0023 D9.7's IPCQ slot tier).
Other spaces (e.g., "reg") are allowed — _storage is a lazy dict
that creates a new space when write first touches it.
D6.2. Address keying
addr is an integer. It may be a physical address (PA) or a virtual
address (VA) — MemoryStore itself doesn't know address-space
semantics; it just uses them as keys. Phase 1's MemoryWriteMsg
writes both PA and VA
(_create_tensor zero-inits at PA and at the VA base too); Phase 2
reads/writes via the addresses captured by op_log.
The caller decides addr's meaning — MemoryStore provides only
lookup.
D6.3. read/write semantics — reference store (no copy)
write(space, addr, data): stores the ndarray reference. No copy.
If the caller later mutates the same ndarray, the stored value
changes.
read(space, addr, shape=None, dtype=None): returns the stored
ndarray reference. If shape/dtype are provided:
dtype != stored.dtype:arr.view(np_dtype)reinterprets as a view (no copy).shape != stored.shape: ifnbytesmatches,arr.reshape(shape)is a view.nbytesmismatch →ValueError.
To detach the data, the caller must call arr.copy(). ADR-0027's
race-avoidance requires explicit .copy() in op_log snapshot steps
for exactly this reason.
D6.4. has(space, addr) -> bool
Existence check; does not materialize data.
D6.5. snapshot() -> MemoryStore
Shallow copy. Creates a new instance of inner dicts but shares ndarray references. Used at Phase 2 init to fork from Phase 1's store, so Phase 2 mutations don't affect Phase 1's remaining consumers.
D7. op_log assumes a single-threaded SimPy
OpLogger's _records and _pending are lock-free. SimPy is
single-threaded, so nothing else can intrude between record_start
and record_end for the same message.
When multi-process kernbench (ADR-0047 D6) arrives, OpLogger must be split per process — one OpLogger instance cannot receive records from multiple processes.
Alternatives Considered
A1. Externalize op_log to SQLite / parquet
Rejected (currently). The in-memory list minimizes Phase 1 → Phase 2 hand-off latency. Externalization makes sense for long-running batch runs but adds overhead for the current single-run workload.
A2. Capture snapshots at record_start
Rejected. At record_start, inputs are often not yet populated (e.g.,
a math op's input is the output of a just-issued previous op).
record_end is the correct point.
A3. Per-component MemoryStore
Rejected. The (space, addr) key already disambiguates effectively, and splitting per component would complicate cross-PE IPCQ copy (ADR-0023 D9), which needs access to both source and destination stores.
A4. Explicit dependency edges in op_log
Partially adopted. The dependency_ids field exists on OpRecord but
is currently unused (D1). Phase 2 DataExecutor orders via t_start +
a secondary sort (memory ops before math at the same t_start). When
an explicit dependency graph is required, this field is the home.
Current ordering rules are sufficient, so it remains unused.
Consequences
- ADR-0020's op_log / MemoryStore declarations are expanded into the concrete D1–D6 schemas, so writing/modifying Phase 2 DataExecutor doesn't need source-grep to learn field semantics.
- D3's per-
op_nameparams matrix makes adding new ops (e.g., a new reduction type) a question of branching in_extract_op_info. - D4's per-op snapshot policy (math = input snapshot, dma_write = HBM-only snapshot) is ADR-locked, so ADR-0027's race-avoidance decision won't silently regress on future refactors.
- D6.3's reference-store semantics are explicit, putting mutation
safety on the caller. ADR-0027's explicit
.copy()pattern is justified. - D7's single-thread assumption is recorded, so multi-process kernbench (ADR-0047 D6's supersession candidate) will need OpLogger separation when introduced.