Establish English as the canonical ADR language with Korean translations held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror). Promotion from adr-proposed/ to adr/ now writes English to adr/ and the Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md. - Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English, 2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix dropped). ADR-0023 EN regenerated against KO source which had newer HW Realization Notes (D16-D23) section. - docs/adr-history/ left frozen by design (transitional state). - CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline section covering bidirectional sync, conflict resolution (EN wins), and proposed-language freedom. - tools/verify_adr_lang_pairs.py: new verification tool checking pair completeness, filename mirroring, ADR-ID match, Status byte-equality. Pre-commit hook intentionally not added; run on demand or in CI. - tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF normalization, em-dash title separator, underscore-slug edge case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
21 KiB
ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation)
Status
Accepted
Context
The current simulation models timing only.
tl.load(), tl.composite(op="gemm"), etc. generate SimPy latencies,
but do not actually read tensor data or perform computations.
Required Capabilities
- Must be able to store and read actual data in HBM/TCM/SRAM
- PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
- Must minimize simulation performance degradation
Constraints
- SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
- Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations
- Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused
- Kernel functions must remain plain Python functions (no generator/async transformation)
Design Exploration Results
| Option | Approach | Verdict |
|---|---|---|
| Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block |
| SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests |
| Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads |
| 2-pass (adopted) | Phase 1: timing, Phase 2: data | Full separation, no performance impact |
Decision
D1. 2-Pass Execution Model — Phase 0 Elimination
The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are consolidated into 2 stages.
Before:
Phase 0: Kernel → PeCommand list (no data, no branching)
Phase 1: Replay PeCommand list via SimPy (timing only)
After:
Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based
- Memory read/write: SimPy timing + MemoryStore actual data
- Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2)
- Dynamic control flow possible (tl.load returns actual data)
Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable
This ADR extends Phase 1 to be data-aware for memory operations only. Phase 1 handles latency/BW bottleneck analysis + memory data tracking, Phase 2 handles GEMM/Math computation correctness verification. Phase 2 is optional — if only timing is needed, run Phase 1 alone.
D2. Op Log Recording — ComponentBase Hook
Op log recording is performed as a hook in the component base class. Individual component implementations are not modified.
class ComponentBase:
def _on_process_start(self, env, msg):
if self._op_logger and getattr(msg, 'data_op', False):
self._op_logger.record_start(env.now, self.node.id, msg)
def _on_process_end(self, env, msg):
if self._op_logger and getattr(msg, 'data_op', False):
self._op_logger.record_end(env.now, self.node.id, msg)
Hooks are called before and after run() within _forward_txn().
_op_logger is optional — zero overhead when absent.
Hook timing definitions:
| Timing | Meaning |
|---|---|
t_start |
The point at which the component begins servicing the msg (immediately before run() entry) |
t_end |
The point at which the component's internal service completes (immediately after run() returns) |
Link traversal latency is not included in t_start/t_end. Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start.
D3. Greenlet-Based Kernel Execution — Phase 0 Elimination
The existing Phase 0 (kernel → PeCommand list) is eliminated, and greenlet is used to cooperatively interleave kernel and SimPy execution.
Operating Principle
greenlet is a C extension that provides cooperative context switching.
When the kernel (child greenlet) calls tl.load() etc., it switches to the SimPy loop (parent greenlet)
to perform timing simulation, and after completion, returns to the kernel with actual data.
SimPy loop (parent greenlet) Kernel (child greenlet)
───────────────────────── ──────────────────────
g.switch() ─────────────────────────→ Kernel starts
a = tl.load(ptr, ...)
internal: parent.switch(DmaReadCmd)
cmd = DmaReadCmd ←────────────────── (kernel paused)
yield DmaReadMsg(...)
yield env.timeout(dma_latency)
data = memory_store.read(...)
g.switch(data) ─────────────────────→ (kernel resumed)
a = data ← actual numpy array
if a[0][0] > 0.5: ← branching possible
...
The kernel is maintained as a plain Python function.
greenlet switches exist only within the internal implementation of tl.load(), tl.store(), etc.
KernelRunner — Framework Layer
The greenlet loop resides not in the PE_CPU component but in the framework layer, KernelRunner.
# KernelRunner (framework — greenlet ↔ SimPy bridge)
class KernelRunner:
def run(self, env, kernel_fn, args, store):
g = greenlet(self._run_kernel)
cmd = g.switch(kernel_fn, args)
while cmd is not None:
if isinstance(cmd, DmaReadCmd):
yield from self._dispatch_dma(env, cmd)
data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
cmd = g.switch(data) # resume with actual data
elif isinstance(cmd, GemmCmd):
yield from self._dispatch_gemm(env, cmd)
cmd = g.switch() # resume (no data)
elif isinstance(cmd, DmaWriteCmd):
store.write(cmd.dst_addr, cmd.data) # visibility = issue time
yield from self._dispatch_dma(env, cmd) # timing only
cmd = g.switch()
# PE_CPU (component — kept simple, unaware of greenlet)
def _execute_kernel(self, env):
runner = KernelRunner(self.ctx)
yield from runner.run(env, kernel_fn, args, store)
Op logging single source of truth: KernelRunner does not record directly to op_log.
All op logging is handled solely by the ComponentBase hook (_on_process_start/end).
When KernelRunner delivers messages to components via _dispatch_gemm() etc.,
the component base class hooks automatically record them.
Layer separation:
- Kernel code: plain function, unaware of greenlet
- TLContext: calls
parent.switch(cmd)insidetl.load() - KernelRunner: greenlet ↔ SimPy bridge, handles MemoryStore read/write. Does not log.
- ComponentBase hook: the sole path for op_log recording
- PE_CPU: only calls KernelRunner, replaceable as a component
Handling Differences Between Memory Read/Write and Compute
| Operation | In Phase 1 | In Phase 2 |
|---|---|---|
tl.load() |
SimPy timing + MemoryStore read → actual data returned | — |
tl.store() |
SimPy timing + MemoryStore write → actual write | — |
tl.composite(gemm) |
SimPy timing + op_log recording only | numpy actual computation |
tl.dot() / math ops |
SimPy timing + op_log recording only | numpy actual computation |
Memory read/write is processed immediately in Phase 1 (numpy slice, fast). GEMM/Math operations are batch-executed in Phase 2 (performance separation).
Store Visibility Rule
tl.store() is immediately reflected in MemoryStore at issue time (visibility = issue).
SimPy DMA timing is simulated separately afterward.
This is an intentional separation of timing and visibility:
- visibility: the point at which it is reflected in MemoryStore = when
store.write()is called - timing: the point at which DMA latency completes in SimPy
This separation allows a load immediately after a store to see the latest data in dynamic control flow.
Result Handle Semantics
tl.composite() (sync/async) returns a handle referencing the result tensor.
The key contract in Phase 1:
- All compute handles are always considered pending in Phase 1.
tl.wait(handle)expresses timing synchronization only and does not make the handle ready.- Accessing the handle's actual result data (
handle.data, element access, numpy conversion, etc.) is only possible in Phase 2. - Therefore, compute-result-based control flow is not supported in Phase 1.
- In contrast,
tl.load()returns actual data in Phase 1, so memory-read-based control flow is supported.
| Handle state | Phase | Allowed operations |
|---|---|---|
| pending | Phase 1 | tl.wait(handle) — timing synchronization only |
| pending | Phase 1 | Pass handle as target of tl.store() (logical destination binding only, payload in Phase 2) |
| pending | Phase 1 | Data access not allowed — value-based branching not possible |
| ready | Phase 2 | Actual numpy data access, verification |
This restriction is intentional. If computations were executed in Phase 1, the SimPy single-thread would block, defeating the purpose of 2-pass separation.
Phase 1 Materialization — Future Extension
If Phase 1 eager execution becomes necessary for small operations
(scalar, small reduction) in the future, selective materialization can be supported
by adding a materialized_in_phase1: bool flag to the op record.
This is not implemented in the current scope.
D4. data_op Flag — Message Self-Declaration
The logging target is determined by the data_op attribute on the message instance,
not by message type. The framework does not hardcode message types.
class MsgBase:
data_op: bool = False # default: no logging
class DmaReadCmd(MsgBase):
data_op = True # memory transfer → logging
class GemmCmd(MsgBase):
data_op = True # compute → logging
class MathCmd(MsgBase):
data_op = True # compute → logging
When adding a new message type (e.g., IpcqMsg), simply setting data_op = True
enables automatic logging without modifying framework code.
D5. Op Log Structure
Op Classification Scheme
A two-level classification is used:
| Level | Field | Role |
|---|---|---|
op_kind |
memory | gemm | math |
executor dispatch criterion |
op_name |
dma_read | dma_write | gemm_f16 | exp | add | sum etc. |
specific operation identification |
OpRecord Definition
@dataclass
class OpRecord:
t_start: float # SimPy time (ns) — service start
t_end: float # SimPy time (ns) — service completion
component_id: str # e.g. "sip0.cube0.pe0.pe_gemm"
op_kind: str # "memory" | "gemm" | "math"
op_name: str # specific operation name
params: dict # per-operation parameters (see below)
dependency_ids: list[int] # currently based on in-memory record index, may be replaced with stable op_id in the future
dependency_ids Generation Rules
dependency_ids is optional, and by default the executor performs
address-based dependency inference (see D6).
Explicit setting is only needed when precise execution ordering is required:
- Default (address-based inference): the executor analyzes read/write sets to automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases.
- Explicit setting: set when logical dependencies cannot be expressed via addresses at the TLContext or command generation stage. Example: completion handle-based synchronization — handle dependencies depend on logical completion order rather than memory addresses, so they cannot be captured by address inference.
op_log Ordering
The op_log maintains stable ordering based on t_start.
Records with the same t_start preserve insertion order.
params Details
memory (dma_read / dma_write):
{
"src_addr": int, # source address (byte)
"dst_addr": int, # destination address (byte)
"nbytes": int, # transfer size
"src_space": str, # "hbm" | "tcm" | "sram"
"dst_space": str, # "hbm" | "tcm" | "sram"
}
gemm:
{
"src_a_addr": int, # operand A address
"src_b_addr": int, # operand B address
"dst_addr": int, # output address
"shape_a": tuple, # e.g. (128, 256)
"shape_b": tuple, # e.g. (256, 128)
"shape_out": tuple, # e.g. (128, 128)
"dtype_in": str, # e.g. "f16"
"dtype_acc": str, # accumulation dtype, e.g. "f32"
"dtype_out": str, # output dtype, e.g. "f16"
"transpose_a": bool,
"transpose_b": bool,
"layout_a": str, # "row_major" | "col_major"
"layout_b": str,
"layout_out": str,
"addr_space": str, # "tcm" (GEMM operands are always in TCM)
}
math:
{
"op": str, # "exp" | "add" | "sum" | "where" | ...
"input_addrs": list[int], # list of operand addresses
"input_shapes": list[tuple],
"dst_addr": int,
"shape_out": tuple,
"dtype": str,
"axis": int | None, # reduction axis
"addr_space": str, # "tcm"
}
D6. Phase 2 Executor
Phase 2 executes the op_log outside of SimPy.
class DataExecutor:
def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
self.store = initial_store # Takes the Phase 1 MemoryStore snapshot as input
def run(self):
for t, ops in groupby(op_log, key=lambda o: o.t_start):
batch = list(ops)
independent, sequential = self._classify(batch)
self._execute_parallel(independent)
self._execute_sequential(sequential)
Parallel execution determination:
Ops with the same t_start are considered parallel candidates.
The executor determines actual parallel execution based on the following criteria:
- Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks)
- Whether predecessor ops specified in
dependency_idshave completed
Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel.
Batch optimization: Only independent ops with the same op_name and identical
shape, dtype, layout, and transpose flags are eligible for batching.
Example: identical shape GEMMs from multiple PEs → bundled into a single np.matmul(a_batch, b_batch) call.
Improves BLAS efficiency on CPU, reduces launch overhead on GPU.
Phase 2 execution order guarantee:
Phase 2 does not consider data arrival timing, and guarantees execution order solely through dependencies (address-based inference + explicit dependency_ids).
D7. Memory Store
MemoryStore logically follows byte-addressable semantics,
and the current implementation uses tensor-granular storage (addr → numpy ndarray mapping).
class MemoryStore:
def write(self, space: str, addr: int, data: np.ndarray) -> None: ...
def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
Internal storage format: numpy ndarray
MemoryStore stores tensors as numpy ndarrays.
| Candidate | store/load speed | Phase 2 compute | Verdict |
|---|---|---|---|
| numpy ndarray | Immediate (reference passing, no copy) | np.matmul directly usable |
Adopted |
| bytearray | Requires memcpy | Requires np.frombuffer conversion |
Rejected |
| torch tensor | Immediate | torch operations available | Use only for GPU optimization |
- write: stores numpy array by reference (no copy) → Phase 1 overhead = 1 dict lookup
- read: returns numpy array by reference (no copy)
- Re-writing to the same addr overwrites at tensor granularity (partial overwrite not supported)
- dtype uses numpy native (
np.float16,np.float32,np.bfloat16, etc.) - For byte-level access, convert via
.view(np.uint8) - For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility
read/write contract:
- read/write operates on a contiguous tensor basis. If non-contiguous stride views are needed, express them as separate copy ops.
- In the normal benchmark path, producer/consumer dtype match is expected. Reinterpret cast is a permissive behavior for low-level memory validation or special test cases.
- addr is byte-aligned, with minimum alignment = dtype size.
- dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast. Shape mismatch is verified based on nbytes, and raises an error on mismatch.
- Correctness criteria follow address-range-based read/write semantics.
- A tensor object cache may be used as an implementation optimization, but the canonical state is byte-addressable storage.
- At deploy time, the host injects initial tensor data.
D8. Benchmark Kernel Code
The benchmark's user code API is not changed.
The call interfaces for tl.load(), tl.composite(), tl.store(), etc. are maintained.
However, internal command/message schemas may be extended to include metadata required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose).
D9. No Component Changes
Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified. Op log recording is the responsibility of the ComponentBase hook. When custom components are replaced, only the timing model changes, and Phase 2 data execution is unaffected.
D10. Phase 2 is Optional
engine = GraphEngine(graph)
engine.run(benchmark) # Phase 1: timing only
result = engine.get_timing_result()
if verify_data:
executor = DataExecutor(engine.op_log) # Phase 2: data
executor.run()
executor.verify(expected_output)
If only timing analysis is needed, Phase 2 is skipped. If the op_logger is deactivated, Phase 1 performance is identical to the original.
D11. Verification Contract
Basic verification compares the final output tensor against a reference backend (numpy).
Per-dtype tolerance policy:
| dtype | Comparison method | Tolerance |
|---|---|---|
| f32 | np.allclose |
rtol=1e-5, atol=1e-5 |
| f16 | np.allclose |
rtol=1e-3, atol=1e-3 |
| bf16 | np.allclose |
rtol=1e-2, atol=1e-2 |
| int types | np.array_equal |
exact |
- Default mode: compare final output only (end-to-end correctness)
- Debug mode: can compare intermediate tensors on a per-op basis (MemoryStore snapshot at each op boundary)
Non-goals
- Compute-result-based control flow: not supported.
All compute handles are in pending state during Phase 1,
wait()expresses timing synchronization only and does not imply data readiness. Accessinghandle.data, element access, or truth-value evaluation in Phase 1 is treated as an error. Memory-data-based branching (results oftl.load()) is supported via greenlet. Phase 1 materialization is a future extension (see D3). - Cycle-accurate overlap reconstruction: Phase 2 does not precisely reproduce the execution time overlap from Phase 1. Phase 2 only verifies data correctness.
- GPU kernel compilation: GEMM/Math in Phase 2 are numpy/torch calls and do not reproduce the actual hardware PE microarchitecture.
Open Questions
- Aliasing / slice view: How to represent slice/views referencing the same backing storage in MemoryStore (stride-based view vs copy semantics)
- IPCQ/descriptor read generalization: Whether to fully generalize PE-to-PE communication as memory ops or introduce a separate op_kind
- Op log streaming: Managing op_log memory usage in large-scale simulations (in-memory list vs disk-backed streaming)
- Fused operation: Whether to record tl.composite's tiled pipeline (READ→COMPUTE→WRITE) as a single fused op record or separate individual ops
- Math op schema generalization: The current math params have a simple structure, but generalization may be needed for broadcasting rules, per-input dtype, keepdims, scalar/immediate operands, where/mask expressions, etc.
- Op record identifier: Currently dependency_ids are based on in-memory list indices; replacement with stable op_id is needed when introducing streaming/disk-backed mode
- Phase 1 materialization policy: See Future Extension in D3. If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops needs to be defined
Consequences
Positive
- Minimal impact on SimPy simulation performance (only op_log append added)
- Free to use multi-threading/GPU in Phase 2
- Component replaceability preserved (ADR-0015 design philosophy maintained)
- No changes needed to benchmark user code API
- When adding new message types, only set the data_op flag
- Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported
tl.load()returns actual data, making kernel debugging easier
Negative
- op_log memory usage (for large-scale simulations)
- Phase 2 execution time is proportional to tensor size (large GEMM)
- Dynamic branching based on pending handles (incomplete computations) not possible (computations execute in Phase 2, result values are undetermined in Phase 1). Memory-data-based branching is supported via greenlet.
- greenlet C extension dependency added (pip install greenlet)