adr: add ADR-0050-0053 — close /report's second-pass G4 candidates
Documents four cross-cutting surfaces one layer deeper than the prior G4 batch: - 0050 par-ccl-algorithm-module-contract: how to author a new CCL algorithm in src/kernbench/ccl/algorithms/. Pairs with ADR-0045's bench-module contract. Pins the four required public symbols (kernel, kernel_args, TOPO_NAME_TO_KIND constants, kernel alias), the 9 + tl standardized kernel signature, the kernel_args tuple format, sip_topo_kind dispatch, and the ccl.yaml entry workflow. - 0051 lat-routing-helper-api: every public method of AddressResolver (resolve, find_m_cpu, find_pcie_ep, find_io_cpu, find_all_pcie_eps) and PathRouter (find_path, find_path_with_distance, find_mcpu_dma_path, find_memory_path, find_node_path + 2 shims). Pins the four adjacency graphs (_adj_all / _adj / _adj_mcpu_dma / _adj_local) and the edge-kind exclusion sets they use, plus the single-owner naming convention. - 0052 dev-oplog-memory-store-schemas: OpRecord's 7 fields, the per-op_name params matrix (dma_read, dma_write, gemm_*, math, math reduction, composite_gemm, ipcq_copy, unknown), snapshot timing rules (math = all inputs, dma_write = HBM-only — ADR-0027 race avoidance), TileToken stage_type capture, and MemoryStore's (space, addr) two-level dict with reference-store semantics. - 0053 dev-topology-builder-algorithms: the 6-stage compile pipeline, cube_mesh.yaml's source_hash cache and its 5 input fields, the cube NoC auto-layout algorithm (row/col placement, HBM exclusion zone, PE/M_CPU/SRAM attachment via nearest-router, UCIe N/S/E/W distribution), the node naming convention (single-owner with router.py), the edge-kind catalog, the 4 view projections, and a table of spec-field changes vs mesh regeneration. Bilingual pair verifier passes for all four EN/KO pairs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,371 @@
|
||||
# ADR-0052: OpLog + MemoryStore Schemas — sim_engine internals
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
Pins down the `OpRecord` schema and the `record_start` / `record_end` /
|
||||
`record_copy` behavior in `sim_engine/op_log.py`, plus the
|
||||
(space, addr) namespace and read/write semantics of `MemoryStore` in
|
||||
`sim_engine/memory_store.py`. ADR-0020 (2-pass data execution) declares
|
||||
that these two facilities exist, but **the precise record fields and
|
||||
semantics** had no ADR-level coverage, and several recent ADRs
|
||||
(ADR-0046 D3.2's `tl.store` visibility, ADR-0023 D9's IPCQ copy
|
||||
record) depend on these semantics.
|
||||
|
||||
## First action
|
||||
|
||||
### `OpLogger(memory_store=None)`
|
||||
|
||||
On construction, initialize three fields:
|
||||
|
||||
1. `self._records: list[OpRecord] = []` — accumulated records.
|
||||
2. `self._pending: dict[int, dict] = {}` — partial records keyed by
|
||||
`id(msg)` (created at `record_start`, completed at `record_end`).
|
||||
3. `self._memory_store = memory_store` — optional MemoryStore
|
||||
reference. Used to capture math-op input snapshots and dma_write
|
||||
HBM-source snapshots.
|
||||
|
||||
Records and pending are empty; the `record_*` calls accumulate data
|
||||
over time.
|
||||
|
||||
### `MemoryStore()`
|
||||
|
||||
On construction, initialize a single field:
|
||||
`self._storage: dict[str, dict[int, np.ndarray]] = {}` — a two-level
|
||||
dict (`space → addr → ndarray`). Inner dicts are created lazily as new
|
||||
spaces appear.
|
||||
|
||||
In short, **both facilities' first act is "set up an empty accumulator
|
||||
buffer plus a sparse, per-space dict"**. The first record / write
|
||||
fills the fields when it arrives.
|
||||
|
||||
## Context
|
||||
|
||||
ADR-0020 D2/D5/D7 (2-pass data execution) declares:
|
||||
|
||||
- During Phase 1 (timing), `ComponentBase._on_process_start/end` hooks
|
||||
call `OpLogger.record_start/end`, recording the time and metadata of
|
||||
every data op.
|
||||
- Phase 2 (data) replays the op log in `t_start` order to compute real
|
||||
data.
|
||||
- Data payloads live in `MemoryStore`, keyed by (space, addr).
|
||||
|
||||
Subsequent ADRs (ADR-0023 D9's IPCQ atomic write, ADR-0027's Megatron
|
||||
TP scratch-overwrite avoidance, ADR-0046 D3.2's `tl.store` visibility)
|
||||
depend on op_log and MemoryStore behavior, but **the exact record
|
||||
fields / space names / snapshot timing** are only discoverable via
|
||||
source grep. This ADR codifies them.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. `OpRecord` schema — seven fields
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class OpRecord:
|
||||
t_start: float
|
||||
t_end: float
|
||||
component_id: str
|
||||
op_kind: str # "memory" | "gemm" | "math" | "unknown"
|
||||
op_name: str # e.g. "dma_read", "gemm_f16", "exp",
|
||||
# "TileToken/DMA_READ", "composite_gemm",
|
||||
# "ipcq_copy"
|
||||
params: dict[str, Any]
|
||||
dependency_ids: list[int] = field(default_factory=list)
|
||||
```
|
||||
|
||||
- **`t_start` / `t_end`**: SimPy time (float ns). `t_start` is when the
|
||||
component begins the op; `t_end` is completion. Duration =
|
||||
`t_end - t_start`.
|
||||
- **`component_id`**: the node id where the op occurred (e.g.,
|
||||
`"sip0.cube0.pe0.pe_dma"`).
|
||||
- **`op_kind`**: one of four. Phase 2 DataExecutor branches on this.
|
||||
- **`op_name`**: a debug/analysis-friendly name. For a TileToken,
|
||||
expands to `"TileToken/{stage_type}"` (e.g.,
|
||||
`"TileToken/DMA_READ"`) to disambiguate stages.
|
||||
- **`params`**: op-specific metadata dict (see D3).
|
||||
- **`dependency_ids`**: currently unused (default `[]`). Reserved for
|
||||
future cross-op dependency tracking.
|
||||
|
||||
### D2. `OpLogger.records` — guaranteed `t_start` sort
|
||||
|
||||
```python
|
||||
@property
|
||||
def records(self) -> list[OpRecord]:
|
||||
self._records.sort(key=lambda r: r.t_start)
|
||||
return self._records
|
||||
```
|
||||
|
||||
A stable sort by `t_start` runs on each access. Records with the same
|
||||
`t_start` preserve insertion order. Aligns with ADR-0020 D5's
|
||||
"t_start stable ordering" requirement.
|
||||
|
||||
Phase 2 DataExecutor always accesses via the `records` property, so
|
||||
even when `record_end` calls arrive out of `t_start` order (e.g., a
|
||||
short op started later but finished earlier), the sequence handed to
|
||||
Phase 2 is consistent.
|
||||
|
||||
### D3. `params` schema per `op_name` (matrix from `_extract_op_info`)
|
||||
|
||||
#### D3.1. `op_kind="memory", op_name="dma_read"` (DmaReadCmd)
|
||||
|
||||
```python
|
||||
{"src_addr": int, "nbytes": int, "handle_id": str}
|
||||
```
|
||||
|
||||
#### D3.2. `op_kind="memory", op_name="dma_write"` (DmaWriteCmd)
|
||||
|
||||
```python
|
||||
{
|
||||
"src_space": str, # handle.space ("tcm"|"hbm"|"sram"), default "tcm"
|
||||
"src_addr": int, # handle.addr
|
||||
"shape": tuple, "dtype": str,
|
||||
"dst_space": "hbm", # DmaWrite always targets HBM
|
||||
"dst_addr": int,
|
||||
"nbytes": int,
|
||||
"handle_id": str,
|
||||
# When src_space == "hbm" at record_end, a snapshot is added (D4)
|
||||
"snapshot": np.ndarray | None,
|
||||
}
|
||||
```
|
||||
|
||||
#### D3.3. `op_kind="gemm", op_name=f"gemm_{dtype_a}"` (GemmCmd)
|
||||
|
||||
```python
|
||||
{
|
||||
"src_a_addr": int, "src_b_addr": int, "dst_addr": int,
|
||||
"shape_a": tuple, "shape_b": tuple, "shape_out": tuple,
|
||||
"dtype_in": str, "dtype_out": str,
|
||||
"m": int, "k": int, "n": int,
|
||||
# ADR-0027: per-operand + output spaces preserved
|
||||
"src_a_space": str, "src_b_space": str, "dst_space": str,
|
||||
}
|
||||
```
|
||||
|
||||
#### D3.4. `op_kind="math", op_name=msg.op` (MathCmd; op = "exp", "sum", "add", "where", …)
|
||||
|
||||
```python
|
||||
{
|
||||
"input_addrs": list[int], # addrs of input handles
|
||||
"input_shapes": list[tuple],
|
||||
"input_spaces": list[str],
|
||||
"input_dtypes": list[str],
|
||||
"dst_addr": int, "dst_space": str,
|
||||
"shape_out": tuple, "dtype": str,
|
||||
"axis": int | None, # only meaningful for reductions
|
||||
# All inputs get snapshots at record_end (D4)
|
||||
"input_snapshots": list[np.ndarray | None],
|
||||
}
|
||||
```
|
||||
|
||||
#### D3.5. `op_kind="gemm" or "math", op_name=f"composite_{op}"` (CompositeCmd)
|
||||
|
||||
```python
|
||||
{
|
||||
"op": str, # "gemm" | "math"
|
||||
"out_addr": int, "out_nbytes": int,
|
||||
# If op == "gemm", same fields as GemmCmd are added:
|
||||
"src_a_addr": int, "src_b_addr": int,
|
||||
"shape_a": tuple, "shape_b": tuple,
|
||||
"dtype_in": str, "dtype_out": str,
|
||||
"src_a_space": str, "src_b_space": str,
|
||||
"dst_space": "hbm", "dst_addr": int, # = out_addr
|
||||
}
|
||||
```
|
||||
|
||||
If `op == "gemm"`, `op_kind = "gemm"`; otherwise `"math"`. An alias so
|
||||
Phase 2 replays composite-gemm on the same path as `GemmCmd`.
|
||||
|
||||
#### D3.6. `op_kind="memory", op_name="ipcq_copy"` (record_copy path)
|
||||
|
||||
```python
|
||||
{
|
||||
"src_space": str, "src_addr": int,
|
||||
"dst_space": str, "dst_addr": int,
|
||||
"shape": tuple, "dtype": str, "nbytes": int,
|
||||
"snapshot": np.ndarray | None, # passed by caller; if None, record_copy reads fresh
|
||||
}
|
||||
```
|
||||
|
||||
`PE_DMA._handle_ipcq_inbound` (ADR-0023 D9) emits this record so Phase
|
||||
2 can replay the IPCQ slot's inbound copy. It bypasses
|
||||
`record_start` / `record_end` and pushes directly via `record_copy()`.
|
||||
|
||||
#### D3.7. `op_kind="unknown", op_name=type(msg).__name__`
|
||||
|
||||
Fallback for messages `_extract_op_info` doesn't recognize. `params =
|
||||
{}`. If DataExecutor encounters this kind, it skips — Phase 2 replay
|
||||
is unaffected.
|
||||
|
||||
### D4. Snapshot capture timing
|
||||
|
||||
When `OpLogger._memory_store` is set, `record_end` performs:
|
||||
|
||||
- **Math op**: read every input
|
||||
(addr/shape/space/dtype) from `self._memory_store.read(...)` and
|
||||
attach an ndarray copy to `params["input_snapshots"]`. Read failure
|
||||
→ `None`.
|
||||
- **`dma_write` op**: snapshot the source **only if `src_space ==
|
||||
"hbm"`** and attach to `params["snapshot"]`. TCM (PE scratch)
|
||||
sources are **deliberately skipped** — TCM is repopulated by Phase 2
|
||||
math/gemm replay, and a Phase-1-time snapshot would capture a
|
||||
previous kernel's stale value (ADR-0027 postmortem: TP gemm →
|
||||
all_reduce race).
|
||||
- **`ipcq_copy`**: the caller passes the in-flight snapshot via
|
||||
`snapshot=token.data`. If absent, `record_copy` attempts a fresh
|
||||
read from MemoryStore.
|
||||
|
||||
Snapshots are taken with `.copy()` (fresh allocation), making them
|
||||
safe against later storage mutation. This is the foundation of
|
||||
ADR-0027's "cross-PE Phase 2 ordering" race-avoidance.
|
||||
|
||||
When `memory_store` is `None` (Phase 1 timing-only mode), all
|
||||
snapshot steps are skipped. Only the timing portion of the record is
|
||||
preserved; data replay is unavailable.
|
||||
|
||||
### D5. TileToken handling — `record_start` captures stage info
|
||||
|
||||
ADR-0014 D6's self-routing tile token (pipeline mode) may have already
|
||||
advanced its `stage_idx` by the time `record_end` runs (the TileToken
|
||||
caches the next stage's params as it moves to the next component).
|
||||
Therefore:
|
||||
|
||||
`record_start` pre-saves the following in `pending[id(msg)]["snap"]`:
|
||||
|
||||
```python
|
||||
snap["stage_type"] = stage.stage_type.name # "DMA_READ", "GEMM", ...
|
||||
snap["stage_params"] = dict(stage.params) # copy of params at start time
|
||||
```
|
||||
|
||||
`record_end` retrieves this snap and merges into params:
|
||||
|
||||
- Adds `params["stage_type"]` to final params.
|
||||
- Merges `stage_params` keys (keeps existing values if any).
|
||||
- If `op_name == "TileToken"`, rewrites it to
|
||||
`f"TileToken/{stage_type}"` (e.g., `"TileToken/DMA_READ"`),
|
||||
disambiguating different stages emitted by the same component.
|
||||
|
||||
Thanks to this, DMA_READ vs DMA_WRITE, FETCH vs STORE coming from the
|
||||
same component (e.g., pe_dma) are distinguishable in reports.
|
||||
|
||||
### D6. `MemoryStore` — two-level (space, addr) dict
|
||||
|
||||
```python
|
||||
class MemoryStore:
|
||||
def __init__(self) -> None:
|
||||
self._storage: dict[str, dict[int, np.ndarray]] = {}
|
||||
|
||||
def write(self, space, addr, data): self._storage[space][addr] = data
|
||||
def read(self, space, addr, shape=None, dtype=None) -> np.ndarray: ...
|
||||
def has(self, space, addr) -> bool: ...
|
||||
def snapshot(self) -> MemoryStore: ...
|
||||
```
|
||||
|
||||
#### D6.1. Space namespace
|
||||
|
||||
A string key. Standard values:
|
||||
|
||||
- `"hbm"`: HBM data (deploy_tensor + Phase 2 dma_write results).
|
||||
- `"tcm"`: PE-local TCM (Phase 2 math/gemm output).
|
||||
- `"sram"`: cube-level SRAM (ADR-0023 D9.7's IPCQ slot tier).
|
||||
|
||||
Other spaces (e.g., `"reg"`) are allowed — `_storage` is a lazy dict
|
||||
that creates a new space when `write` first touches it.
|
||||
|
||||
#### D6.2. Address keying
|
||||
|
||||
`addr` is an integer. It may be a **physical address (PA) or a virtual
|
||||
address (VA)** — `MemoryStore` itself doesn't know address-space
|
||||
semantics; it just uses them as keys. Phase 1's `MemoryWriteMsg`
|
||||
writes both PA and VA
|
||||
(`_create_tensor` zero-inits at PA and at the VA base too); Phase 2
|
||||
reads/writes via the addresses captured by op_log.
|
||||
|
||||
The caller decides `addr`'s meaning — `MemoryStore` provides only
|
||||
lookup.
|
||||
|
||||
#### D6.3. read/write semantics — reference store (no copy)
|
||||
|
||||
`write(space, addr, data)`: stores the ndarray reference. **No copy.**
|
||||
If the caller later mutates the same ndarray, the stored value
|
||||
changes.
|
||||
|
||||
`read(space, addr, shape=None, dtype=None)`: returns the stored
|
||||
ndarray reference. If `shape`/`dtype` are provided:
|
||||
|
||||
- `dtype != stored.dtype`: `arr.view(np_dtype)` reinterprets as a
|
||||
view (no copy).
|
||||
- `shape != stored.shape`: if `nbytes` matches, `arr.reshape(shape)`
|
||||
is a view.
|
||||
- `nbytes` mismatch → `ValueError`.
|
||||
|
||||
To detach the data, the caller must call `arr.copy()`. ADR-0027's
|
||||
race-avoidance requires explicit `.copy()` in op_log snapshot steps
|
||||
for exactly this reason.
|
||||
|
||||
#### D6.4. `has(space, addr) -> bool`
|
||||
|
||||
Existence check; does not materialize data.
|
||||
|
||||
#### D6.5. `snapshot() -> MemoryStore`
|
||||
|
||||
Shallow copy. Creates a new instance of inner dicts but shares
|
||||
ndarray references. Used at Phase 2 init to fork from Phase 1's
|
||||
store, so Phase 2 mutations don't affect Phase 1's remaining
|
||||
consumers.
|
||||
|
||||
### D7. op_log assumes a single-threaded SimPy
|
||||
|
||||
`OpLogger`'s `_records` and `_pending` are lock-free. SimPy is
|
||||
single-threaded, so nothing else can intrude between `record_start`
|
||||
and `record_end` for the same message.
|
||||
|
||||
When multi-process kernbench (ADR-0047 D6) arrives, OpLogger must be
|
||||
split per process — one OpLogger instance cannot receive records from
|
||||
multiple processes.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. Externalize op_log to SQLite / parquet
|
||||
|
||||
Rejected (currently). The in-memory list minimizes Phase 1 → Phase 2
|
||||
hand-off latency. Externalization makes sense for long-running batch
|
||||
runs but adds overhead for the current single-run workload.
|
||||
|
||||
### A2. Capture snapshots at `record_start`
|
||||
|
||||
Rejected. At `record_start`, inputs are often not yet populated (e.g.,
|
||||
a math op's input is the output of a just-issued previous op).
|
||||
`record_end` is the correct point.
|
||||
|
||||
### A3. Per-component MemoryStore
|
||||
|
||||
Rejected. The (space, addr) key already disambiguates effectively, and
|
||||
splitting per component would complicate cross-PE IPCQ copy (ADR-0023
|
||||
D9), which needs access to both source and destination stores.
|
||||
|
||||
### A4. Explicit dependency edges in op_log
|
||||
|
||||
Partially adopted. The `dependency_ids` field exists on `OpRecord` but
|
||||
is currently unused (D1). Phase 2 DataExecutor orders via `t_start` +
|
||||
a secondary sort (memory ops before math at the same `t_start`). When
|
||||
an explicit dependency graph is required, this field is the home.
|
||||
Current ordering rules are sufficient, so it remains unused.
|
||||
|
||||
## Consequences
|
||||
|
||||
- ADR-0020's op_log / MemoryStore declarations are expanded into the
|
||||
concrete D1–D6 schemas, so writing/modifying Phase 2 DataExecutor
|
||||
doesn't need source-grep to learn field semantics.
|
||||
- D3's per-`op_name` params matrix makes adding new ops (e.g., a new
|
||||
reduction type) a question of branching in `_extract_op_info`.
|
||||
- D4's per-op snapshot policy (math = input snapshot, dma_write =
|
||||
HBM-only snapshot) is ADR-locked, so ADR-0027's race-avoidance
|
||||
decision won't silently regress on future refactors.
|
||||
- D6.3's reference-store semantics are explicit, putting mutation
|
||||
safety on the caller. ADR-0027's explicit `.copy()` pattern is
|
||||
justified.
|
||||
- D7's single-thread assumption is recorded, so multi-process
|
||||
kernbench (ADR-0047 D6's supersession candidate) will need OpLogger
|
||||
separation when introduced.
|
||||
Reference in New Issue
Block a user