Files
kernbench2/docs/adr/ADR-0052-dev-oplog-memory-store-schemas.md
T
ywkang bd49c93703 adr: add ADR-0050-0053 — close /report's second-pass G4 candidates
Documents four cross-cutting surfaces one layer deeper than the prior
G4 batch:

- 0050 par-ccl-algorithm-module-contract: how to author a new CCL
  algorithm in src/kernbench/ccl/algorithms/. Pairs with ADR-0045's
  bench-module contract. Pins the four required public symbols
  (kernel, kernel_args, TOPO_NAME_TO_KIND constants, kernel alias),
  the 9 + tl standardized kernel signature, the kernel_args tuple
  format, sip_topo_kind dispatch, and the ccl.yaml entry workflow.

- 0051 lat-routing-helper-api: every public method of AddressResolver
  (resolve, find_m_cpu, find_pcie_ep, find_io_cpu, find_all_pcie_eps)
  and PathRouter (find_path, find_path_with_distance,
  find_mcpu_dma_path, find_memory_path, find_node_path + 2 shims).
  Pins the four adjacency graphs (_adj_all / _adj / _adj_mcpu_dma /
  _adj_local) and the edge-kind exclusion sets they use, plus the
  single-owner naming convention.

- 0052 dev-oplog-memory-store-schemas: OpRecord's 7 fields, the
  per-op_name params matrix (dma_read, dma_write, gemm_*, math, math
  reduction, composite_gemm, ipcq_copy, unknown), snapshot timing
  rules (math = all inputs, dma_write = HBM-only — ADR-0027 race
  avoidance), TileToken stage_type capture, and MemoryStore's
  (space, addr) two-level dict with reference-store semantics.

- 0053 dev-topology-builder-algorithms: the 6-stage compile pipeline,
  cube_mesh.yaml's source_hash cache and its 5 input fields, the
  cube NoC auto-layout algorithm (row/col placement, HBM exclusion
  zone, PE/M_CPU/SRAM attachment via nearest-router, UCIe N/S/E/W
  distribution), the node naming convention (single-owner with
  router.py), the edge-kind catalog, the 4 view projections, and a
  table of spec-field changes vs mesh regeneration.

Bilingual pair verifier passes for all four EN/KO pairs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:52:42 -07:00

372 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0052: OpLog + MemoryStore Schemas — sim_engine internals
## Status
Accepted (2026-05-22).
Pins down the `OpRecord` schema and the `record_start` / `record_end` /
`record_copy` behavior in `sim_engine/op_log.py`, plus the
(space, addr) namespace and read/write semantics of `MemoryStore` in
`sim_engine/memory_store.py`. ADR-0020 (2-pass data execution) declares
that these two facilities exist, but **the precise record fields and
semantics** had no ADR-level coverage, and several recent ADRs
(ADR-0046 D3.2's `tl.store` visibility, ADR-0023 D9's IPCQ copy
record) depend on these semantics.
## First action
### `OpLogger(memory_store=None)`
On construction, initialize three fields:
1. `self._records: list[OpRecord] = []` — accumulated records.
2. `self._pending: dict[int, dict] = {}` — partial records keyed by
`id(msg)` (created at `record_start`, completed at `record_end`).
3. `self._memory_store = memory_store` — optional MemoryStore
reference. Used to capture math-op input snapshots and dma_write
HBM-source snapshots.
Records and pending are empty; the `record_*` calls accumulate data
over time.
### `MemoryStore()`
On construction, initialize a single field:
`self._storage: dict[str, dict[int, np.ndarray]] = {}` — a two-level
dict (`space → addr → ndarray`). Inner dicts are created lazily as new
spaces appear.
In short, **both facilities' first act is "set up an empty accumulator
buffer plus a sparse, per-space dict"**. The first record / write
fills the fields when it arrives.
## Context
ADR-0020 D2/D5/D7 (2-pass data execution) declares:
- During Phase 1 (timing), `ComponentBase._on_process_start/end` hooks
call `OpLogger.record_start/end`, recording the time and metadata of
every data op.
- Phase 2 (data) replays the op log in `t_start` order to compute real
data.
- Data payloads live in `MemoryStore`, keyed by (space, addr).
Subsequent ADRs (ADR-0023 D9's IPCQ atomic write, ADR-0027's Megatron
TP scratch-overwrite avoidance, ADR-0046 D3.2's `tl.store` visibility)
depend on op_log and MemoryStore behavior, but **the exact record
fields / space names / snapshot timing** are only discoverable via
source grep. This ADR codifies them.
## Decision
### D1. `OpRecord` schema — seven fields
```python
@dataclass
class OpRecord:
t_start: float
t_end: float
component_id: str
op_kind: str # "memory" | "gemm" | "math" | "unknown"
op_name: str # e.g. "dma_read", "gemm_f16", "exp",
# "TileToken/DMA_READ", "composite_gemm",
# "ipcq_copy"
params: dict[str, Any]
dependency_ids: list[int] = field(default_factory=list)
```
- **`t_start` / `t_end`**: SimPy time (float ns). `t_start` is when the
component begins the op; `t_end` is completion. Duration =
`t_end - t_start`.
- **`component_id`**: the node id where the op occurred (e.g.,
`"sip0.cube0.pe0.pe_dma"`).
- **`op_kind`**: one of four. Phase 2 DataExecutor branches on this.
- **`op_name`**: a debug/analysis-friendly name. For a TileToken,
expands to `"TileToken/{stage_type}"` (e.g.,
`"TileToken/DMA_READ"`) to disambiguate stages.
- **`params`**: op-specific metadata dict (see D3).
- **`dependency_ids`**: currently unused (default `[]`). Reserved for
future cross-op dependency tracking.
### D2. `OpLogger.records` — guaranteed `t_start` sort
```python
@property
def records(self) -> list[OpRecord]:
self._records.sort(key=lambda r: r.t_start)
return self._records
```
A stable sort by `t_start` runs on each access. Records with the same
`t_start` preserve insertion order. Aligns with ADR-0020 D5's
"t_start stable ordering" requirement.
Phase 2 DataExecutor always accesses via the `records` property, so
even when `record_end` calls arrive out of `t_start` order (e.g., a
short op started later but finished earlier), the sequence handed to
Phase 2 is consistent.
### D3. `params` schema per `op_name` (matrix from `_extract_op_info`)
#### D3.1. `op_kind="memory", op_name="dma_read"` (DmaReadCmd)
```python
{"src_addr": int, "nbytes": int, "handle_id": str}
```
#### D3.2. `op_kind="memory", op_name="dma_write"` (DmaWriteCmd)
```python
{
"src_space": str, # handle.space ("tcm"|"hbm"|"sram"), default "tcm"
"src_addr": int, # handle.addr
"shape": tuple, "dtype": str,
"dst_space": "hbm", # DmaWrite always targets HBM
"dst_addr": int,
"nbytes": int,
"handle_id": str,
# When src_space == "hbm" at record_end, a snapshot is added (D4)
"snapshot": np.ndarray | None,
}
```
#### D3.3. `op_kind="gemm", op_name=f"gemm_{dtype_a}"` (GemmCmd)
```python
{
"src_a_addr": int, "src_b_addr": int, "dst_addr": int,
"shape_a": tuple, "shape_b": tuple, "shape_out": tuple,
"dtype_in": str, "dtype_out": str,
"m": int, "k": int, "n": int,
# ADR-0027: per-operand + output spaces preserved
"src_a_space": str, "src_b_space": str, "dst_space": str,
}
```
#### D3.4. `op_kind="math", op_name=msg.op` (MathCmd; op = "exp", "sum", "add", "where", …)
```python
{
"input_addrs": list[int], # addrs of input handles
"input_shapes": list[tuple],
"input_spaces": list[str],
"input_dtypes": list[str],
"dst_addr": int, "dst_space": str,
"shape_out": tuple, "dtype": str,
"axis": int | None, # only meaningful for reductions
# All inputs get snapshots at record_end (D4)
"input_snapshots": list[np.ndarray | None],
}
```
#### D3.5. `op_kind="gemm" or "math", op_name=f"composite_{op}"` (CompositeCmd)
```python
{
"op": str, # "gemm" | "math"
"out_addr": int, "out_nbytes": int,
# If op == "gemm", same fields as GemmCmd are added:
"src_a_addr": int, "src_b_addr": int,
"shape_a": tuple, "shape_b": tuple,
"dtype_in": str, "dtype_out": str,
"src_a_space": str, "src_b_space": str,
"dst_space": "hbm", "dst_addr": int, # = out_addr
}
```
If `op == "gemm"`, `op_kind = "gemm"`; otherwise `"math"`. An alias so
Phase 2 replays composite-gemm on the same path as `GemmCmd`.
#### D3.6. `op_kind="memory", op_name="ipcq_copy"` (record_copy path)
```python
{
"src_space": str, "src_addr": int,
"dst_space": str, "dst_addr": int,
"shape": tuple, "dtype": str, "nbytes": int,
"snapshot": np.ndarray | None, # passed by caller; if None, record_copy reads fresh
}
```
`PE_DMA._handle_ipcq_inbound` (ADR-0023 D9) emits this record so Phase
2 can replay the IPCQ slot's inbound copy. It bypasses
`record_start` / `record_end` and pushes directly via `record_copy()`.
#### D3.7. `op_kind="unknown", op_name=type(msg).__name__`
Fallback for messages `_extract_op_info` doesn't recognize. `params =
{}`. If DataExecutor encounters this kind, it skips — Phase 2 replay
is unaffected.
### D4. Snapshot capture timing
When `OpLogger._memory_store` is set, `record_end` performs:
- **Math op**: read every input
(addr/shape/space/dtype) from `self._memory_store.read(...)` and
attach an ndarray copy to `params["input_snapshots"]`. Read failure
`None`.
- **`dma_write` op**: snapshot the source **only if `src_space ==
"hbm"`** and attach to `params["snapshot"]`. TCM (PE scratch)
sources are **deliberately skipped** — TCM is repopulated by Phase 2
math/gemm replay, and a Phase-1-time snapshot would capture a
previous kernel's stale value (ADR-0027 postmortem: TP gemm →
all_reduce race).
- **`ipcq_copy`**: the caller passes the in-flight snapshot via
`snapshot=token.data`. If absent, `record_copy` attempts a fresh
read from MemoryStore.
Snapshots are taken with `.copy()` (fresh allocation), making them
safe against later storage mutation. This is the foundation of
ADR-0027's "cross-PE Phase 2 ordering" race-avoidance.
When `memory_store` is `None` (Phase 1 timing-only mode), all
snapshot steps are skipped. Only the timing portion of the record is
preserved; data replay is unavailable.
### D5. TileToken handling — `record_start` captures stage info
ADR-0014 D6's self-routing tile token (pipeline mode) may have already
advanced its `stage_idx` by the time `record_end` runs (the TileToken
caches the next stage's params as it moves to the next component).
Therefore:
`record_start` pre-saves the following in `pending[id(msg)]["snap"]`:
```python
snap["stage_type"] = stage.stage_type.name # "DMA_READ", "GEMM", ...
snap["stage_params"] = dict(stage.params) # copy of params at start time
```
`record_end` retrieves this snap and merges into params:
- Adds `params["stage_type"]` to final params.
- Merges `stage_params` keys (keeps existing values if any).
- If `op_name == "TileToken"`, rewrites it to
`f"TileToken/{stage_type}"` (e.g., `"TileToken/DMA_READ"`),
disambiguating different stages emitted by the same component.
Thanks to this, DMA_READ vs DMA_WRITE, FETCH vs STORE coming from the
same component (e.g., pe_dma) are distinguishable in reports.
### D6. `MemoryStore` — two-level (space, addr) dict
```python
class MemoryStore:
def __init__(self) -> None:
self._storage: dict[str, dict[int, np.ndarray]] = {}
def write(self, space, addr, data): self._storage[space][addr] = data
def read(self, space, addr, shape=None, dtype=None) -> np.ndarray: ...
def has(self, space, addr) -> bool: ...
def snapshot(self) -> MemoryStore: ...
```
#### D6.1. Space namespace
A string key. Standard values:
- `"hbm"`: HBM data (deploy_tensor + Phase 2 dma_write results).
- `"tcm"`: PE-local TCM (Phase 2 math/gemm output).
- `"sram"`: cube-level SRAM (ADR-0023 D9.7's IPCQ slot tier).
Other spaces (e.g., `"reg"`) are allowed — `_storage` is a lazy dict
that creates a new space when `write` first touches it.
#### D6.2. Address keying
`addr` is an integer. It may be a **physical address (PA) or a virtual
address (VA)** — `MemoryStore` itself doesn't know address-space
semantics; it just uses them as keys. Phase 1's `MemoryWriteMsg`
writes both PA and VA
(`_create_tensor` zero-inits at PA and at the VA base too); Phase 2
reads/writes via the addresses captured by op_log.
The caller decides `addr`'s meaning — `MemoryStore` provides only
lookup.
#### D6.3. read/write semantics — reference store (no copy)
`write(space, addr, data)`: stores the ndarray reference. **No copy.**
If the caller later mutates the same ndarray, the stored value
changes.
`read(space, addr, shape=None, dtype=None)`: returns the stored
ndarray reference. If `shape`/`dtype` are provided:
- `dtype != stored.dtype`: `arr.view(np_dtype)` reinterprets as a
view (no copy).
- `shape != stored.shape`: if `nbytes` matches, `arr.reshape(shape)`
is a view.
- `nbytes` mismatch → `ValueError`.
To detach the data, the caller must call `arr.copy()`. ADR-0027's
race-avoidance requires explicit `.copy()` in op_log snapshot steps
for exactly this reason.
#### D6.4. `has(space, addr) -> bool`
Existence check; does not materialize data.
#### D6.5. `snapshot() -> MemoryStore`
Shallow copy. Creates a new instance of inner dicts but shares
ndarray references. Used at Phase 2 init to fork from Phase 1's
store, so Phase 2 mutations don't affect Phase 1's remaining
consumers.
### D7. op_log assumes a single-threaded SimPy
`OpLogger`'s `_records` and `_pending` are lock-free. SimPy is
single-threaded, so nothing else can intrude between `record_start`
and `record_end` for the same message.
When multi-process kernbench (ADR-0047 D6) arrives, OpLogger must be
split per process — one OpLogger instance cannot receive records from
multiple processes.
## Alternatives Considered
### A1. Externalize op_log to SQLite / parquet
Rejected (currently). The in-memory list minimizes Phase 1 → Phase 2
hand-off latency. Externalization makes sense for long-running batch
runs but adds overhead for the current single-run workload.
### A2. Capture snapshots at `record_start`
Rejected. At `record_start`, inputs are often not yet populated (e.g.,
a math op's input is the output of a just-issued previous op).
`record_end` is the correct point.
### A3. Per-component MemoryStore
Rejected. The (space, addr) key already disambiguates effectively, and
splitting per component would complicate cross-PE IPCQ copy (ADR-0023
D9), which needs access to both source and destination stores.
### A4. Explicit dependency edges in op_log
Partially adopted. The `dependency_ids` field exists on `OpRecord` but
is currently unused (D1). Phase 2 DataExecutor orders via `t_start` +
a secondary sort (memory ops before math at the same `t_start`). When
an explicit dependency graph is required, this field is the home.
Current ordering rules are sufficient, so it remains unused.
## Consequences
- ADR-0020's op_log / MemoryStore declarations are expanded into the
concrete D1D6 schemas, so writing/modifying Phase 2 DataExecutor
doesn't need source-grep to learn field semantics.
- D3's per-`op_name` params matrix makes adding new ops (e.g., a new
reduction type) a question of branching in `_extract_op_info`.
- D4's per-op snapshot policy (math = input snapshot, dma_write =
HBM-only snapshot) is ADR-locked, so ADR-0027's race-avoidance
decision won't silently regress on future refactors.
- D6.3's reference-store semantics are explicit, putting mutation
safety on the caller. ADR-0027's explicit `.copy()` pattern is
justified.
- D7's single-thread assumption is recorded, so multi-process
kernbench (ADR-0047 D6's supersession candidate) will need OpLogger
separation when introduced.