Files
kernbench2/docs/adr/ADR-0020-data-execution-two-pass.en.md
T
ywkang b2c52f0e34 Add English translations for ADR-0018, 0019, 0020, 0021
- ADR-0018: LA-based memory address abstraction + BAAW + HBM channel mapping
- ADR-0019: CUBE NOC per-channel and aggregated HBM connection model
- ADR-0020: 2-pass data execution model (timing/data separation, greenlet)
- ADR-0021: PE pipeline refactor (component separation + token self-routing)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 16:31:32 -07:00

23 KiB

ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation)

Status

Proposed

Context

The current simulation models timing only. tl.load(), tl.composite(op="gemm"), etc. generate SimPy latencies, but do not actually read tensor data or perform computations.

Required Capabilities

  1. Must be able to store and read actual data in HBM/TCM/SRAM
  2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
  3. Must minimize simulation performance degradation

Limitations of the Existing Kernel Execution Structure

The current kernel execution is separated into 3 stages:

Phase 0: Kernel function execution in TLContext → PeCommand list generation (outside SimPy, no data)
Phase 1: PE_CPU replays PeCommand list via SimPy (timing only)

Phase 0 requires the kernel to complete execution entirely before SimPy begins. tl.load() returns a TensorHandle (placeholder), so actual data cannot be accessed. Therefore, branching based on data values (dynamic control flow) is impossible.

This ADR resolves this limitation for memory operations only (see D1, D3).

Constraints

  • SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
  • Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations
  • Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused
  • Kernel functions must remain plain Python functions (no generator/async transformation)

Design Exploration Results

Option Approach Verdict
Direct execution in SimPy Call numpy GEMM inside SimPy Rejected: single-thread block
SimPy + ThreadPool future.submit → timeout → result() Rejected: blocks on result() for back-to-back requests
Symbolic + lazy Track metadata only, execute later Rejected: difficult to handle control-flow dependent reads
2-pass (adopted) Phase 1: timing, Phase 2: data Full separation, no performance impact

Decision

D1. 2-Pass Execution Model — Phase 0 Elimination

The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are consolidated into 2 stages.

Before:

Phase 0: Kernel → PeCommand list (no data, no branching)
Phase 1: Replay PeCommand list via SimPy (timing only)

After:

Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based
  - Memory read/write: SimPy timing + MemoryStore actual data
  - Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2)
  - Dynamic control flow possible (tl.load returns actual data)

Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable

This ADR extends Phase 1 to be data-aware for memory operations only. Phase 1 handles latency/BW bottleneck analysis + memory data tracking, Phase 2 handles GEMM/Math computation correctness verification. Phase 2 is optional — if only timing is needed, run Phase 1 alone.

D2. Op Log Recording — ComponentBase Hook

Op log recording is performed as a hook in the component base class. Individual component implementations are not modified.

class ComponentBase:
    def _on_process_start(self, env, msg):
        if self._op_logger and getattr(msg, 'data_op', False):
            self._op_logger.record_start(env.now, self.node.id, msg)

    def _on_process_end(self, env, msg):
        if self._op_logger and getattr(msg, 'data_op', False):
            self._op_logger.record_end(env.now, self.node.id, msg)

Hooks are called before and after run() within _forward_txn(). _op_logger is optional — zero overhead when absent.

Hook timing definitions:

Timing Meaning
t_start The point at which the component begins servicing the msg (immediately before run() entry)
t_end The point at which the component's internal service completes (immediately after run() returns)

Link traversal latency is not included in t_start/t_end. Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start.

D3. Greenlet-Based Kernel Execution — Phase 0 Elimination

The existing Phase 0 (kernel → PeCommand list) is eliminated, and greenlet is used to cooperatively interleave kernel and SimPy execution.

Operating Principle

greenlet is a C extension that provides cooperative context switching. When the kernel (child greenlet) calls tl.load() etc., it switches to the SimPy loop (parent greenlet) to perform timing simulation, and after completion, returns to the kernel with actual data.

SimPy loop (parent greenlet)           Kernel (child greenlet)
─────────────────────────              ──────────────────────
g.switch() ─────────────────────────→ Kernel starts
                                       a = tl.load(ptr, ...)
                                         internal: parent.switch(DmaReadCmd)
cmd = DmaReadCmd ←──────────────────  (kernel paused)
  yield DmaReadMsg(...)
  yield env.timeout(dma_latency)
  data = memory_store.read(...)
g.switch(data) ─────────────────────→ (kernel resumed)
                                       a = data  ← actual numpy array
                                       if a[0][0] > 0.5:  ← branching possible
                                         ...

The kernel is maintained as a plain Python function. greenlet switches exist only within the internal implementation of tl.load(), tl.store(), etc.

KernelRunner — Framework Layer

The greenlet loop resides not in the PE_CPU component but in the framework layer, KernelRunner.

# KernelRunner (framework — greenlet ↔ SimPy bridge)
class KernelRunner:
    def run(self, env, kernel_fn, args, store):
        g = greenlet(self._run_kernel)
        cmd = g.switch(kernel_fn, args)

        while cmd is not None:
            if isinstance(cmd, DmaReadCmd):
                yield from self._dispatch_dma(env, cmd)
                data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
                cmd = g.switch(data)            # resume with actual data
            elif isinstance(cmd, GemmCmd):
                yield from self._dispatch_gemm(env, cmd)
                cmd = g.switch()                # resume (no data)
            elif isinstance(cmd, DmaWriteCmd):
                store.write(cmd.dst_addr, cmd.data)  # visibility = issue time
                yield from self._dispatch_dma(env, cmd)  # timing only
                cmd = g.switch()

# PE_CPU (component — kept simple, unaware of greenlet)
def _execute_kernel(self, env):
    runner = KernelRunner(self.ctx)
    yield from runner.run(env, kernel_fn, args, store)

Op logging single source of truth: KernelRunner does not record directly to op_log. All op logging is handled solely by the ComponentBase hook (_on_process_start/end). When KernelRunner delivers messages to components via _dispatch_gemm() etc., the component base class hooks automatically record them.

Layer separation:

  • Kernel code: plain function, unaware of greenlet
  • TLContext: calls parent.switch(cmd) inside tl.load()
  • KernelRunner: greenlet ↔ SimPy bridge, handles MemoryStore read/write. Does not log.
  • ComponentBase hook: the sole path for op_log recording
  • PE_CPU: only calls KernelRunner, replaceable as a component

Handling Differences Between Memory Read/Write and Compute

Operation In Phase 1 In Phase 2
tl.load() SimPy timing + MemoryStore read → actual data returned
tl.store() SimPy timing + MemoryStore write → actual write
tl.composite(gemm) SimPy timing + op_log recording only numpy actual computation
tl.dot() / math ops SimPy timing + op_log recording only numpy actual computation

Memory read/write is processed immediately in Phase 1 (numpy slice, fast). GEMM/Math operations are batch-executed in Phase 2 (performance separation).

Store Visibility Rule

tl.store() is immediately reflected in MemoryStore at issue time (visibility = issue). SimPy DMA timing is simulated separately afterward.

This is an intentional separation of timing and visibility:

  • visibility: the point at which it is reflected in MemoryStore = when store.write() is called
  • timing: the point at which DMA latency completes in SimPy

This separation allows a load immediately after a store to see the latest data in dynamic control flow.

Result Handle Semantics

tl.composite() (sync/async) returns a handle referencing the result tensor.

The key contract in Phase 1:

  1. All compute handles are always considered pending in Phase 1.
  2. tl.wait(handle) expresses timing synchronization only and does not make the handle ready.
  3. Accessing the handle's actual result data (handle.data, element access, numpy conversion, etc.) is only possible in Phase 2.
  4. Therefore, compute-result-based control flow is not supported in Phase 1.
  5. In contrast, tl.load() returns actual data in Phase 1, so memory-read-based control flow is supported.
Handle state Phase Allowed operations
pending Phase 1 tl.wait(handle) — timing synchronization only
pending Phase 1 Pass handle as target of tl.store() (logical destination binding only, payload in Phase 2)
pending Phase 1 Data access not allowed — value-based branching not possible
ready Phase 2 Actual numpy data access, verification

This restriction is intentional. If computations were executed in Phase 1, the SimPy single-thread would block, defeating the purpose of 2-pass separation.

Phase 1 Materialization — Future Extension

If Phase 1 eager execution becomes necessary for small operations (scalar, small reduction) in the future, selective materialization can be supported by adding a materialized_in_phase1: bool flag to the op record. This is not implemented in the current scope.

D4. data_op Flag — Message Self-Declaration

The logging target is determined by the data_op attribute on the message instance, not by message type. The framework does not hardcode message types.

class MsgBase:
    data_op: bool = False       # default: no logging

class DmaReadCmd(MsgBase):
    data_op = True              # memory transfer → logging

class GemmCmd(MsgBase):
    data_op = True              # compute → logging

class MathCmd(MsgBase):
    data_op = True              # compute → logging

When adding a new message type (e.g., IpcqMsg), simply setting data_op = True enables automatic logging without modifying framework code.

D5. Op Log Structure

Op Classification Scheme

A two-level classification is used:

Level Field Role
op_kind memory | gemm | math executor dispatch criterion
op_name dma_read | dma_write | gemm_f16 | exp | add | sum etc. specific operation identification

OpRecord Definition

@dataclass
class OpRecord:
    t_start: float              # SimPy time (ns) — service start
    t_end: float                # SimPy time (ns) — service completion
    component_id: str           # e.g. "sip0.cube0.pe0.pe_gemm"
    op_kind: str                # "memory" | "gemm" | "math"
    op_name: str                # specific operation name
    params: dict                # per-operation parameters (see below)
    dependency_ids: list[int]   # currently based on in-memory record index, may be replaced with stable op_id in the future

dependency_ids Generation Rules

dependency_ids is optional, and by default the executor performs address-based dependency inference (see D6).

Explicit setting is only needed when precise execution ordering is required:

  • Default (address-based inference): the executor analyzes read/write sets to automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases.
  • Explicit setting: set when logical dependencies cannot be expressed via addresses at the TLContext or command generation stage. Example: completion handle-based synchronization — handle dependencies depend on logical completion order rather than memory addresses, so they cannot be captured by address inference.

op_log Ordering

The op_log maintains stable ordering based on t_start. Records with the same t_start preserve insertion order.

params Details

memory (dma_read / dma_write):

{
    "src_addr": int,            # source address (byte)
    "dst_addr": int,            # destination address (byte)
    "nbytes": int,              # transfer size
    "src_space": str,           # "hbm" | "tcm" | "sram"
    "dst_space": str,           # "hbm" | "tcm" | "sram"
}

gemm:

{
    "src_a_addr": int,          # operand A address
    "src_b_addr": int,          # operand B address
    "dst_addr": int,            # output address
    "shape_a": tuple,           # e.g. (128, 256)
    "shape_b": tuple,           # e.g. (256, 128)
    "shape_out": tuple,         # e.g. (128, 128)
    "dtype_in": str,            # e.g. "f16"
    "dtype_acc": str,           # accumulation dtype, e.g. "f32"
    "dtype_out": str,           # output dtype, e.g. "f16"
    "transpose_a": bool,
    "transpose_b": bool,
    "layout_a": str,            # "row_major" | "col_major"
    "layout_b": str,
    "layout_out": str,
    "addr_space": str,          # "tcm" (GEMM operands are always in TCM)
}

math:

{
    "op": str,                  # "exp" | "add" | "sum" | "where" | ...
    "input_addrs": list[int],   # list of operand addresses
    "input_shapes": list[tuple],
    "dst_addr": int,
    "shape_out": tuple,
    "dtype": str,
    "axis": int | None,         # reduction axis
    "addr_space": str,          # "tcm"
}

D6. Phase 2 Executor

Phase 2 executes the op_log outside of SimPy.

class DataExecutor:
    def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
        self.store = initial_store  # Takes the Phase 1 MemoryStore snapshot as input

    def run(self):
        for t, ops in groupby(op_log, key=lambda o: o.t_start):
            batch = list(ops)
            independent, sequential = self._classify(batch)
            self._execute_parallel(independent)
            self._execute_sequential(sequential)

Parallel execution determination:

Ops with the same t_start are considered parallel candidates. The executor determines actual parallel execution based on the following criteria:

  • Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks)
  • Whether predecessor ops specified in dependency_ids have completed

Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel.

Batch optimization: Only independent ops with the same op_name and identical shape, dtype, layout, and transpose flags are eligible for batching. Example: identical shape GEMMs from multiple PEs → bundled into a single np.matmul(a_batch, b_batch) call. Improves BLAS efficiency on CPU, reduces launch overhead on GPU.

Phase 2 execution order guarantee:

Phase 2 does not consider data arrival timing, and guarantees execution order solely through dependencies (address-based inference + explicit dependency_ids).

D7. Memory Store

MemoryStore logically follows byte-addressable semantics, and the current implementation uses tensor-granular storage (addr → numpy ndarray mapping).

class MemoryStore:
    def write(self, space: str, addr: int, data: np.ndarray) -> None: ...
    def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...

Internal storage format: numpy ndarray

MemoryStore stores tensors as numpy ndarrays.

Candidate store/load speed Phase 2 compute Verdict
numpy ndarray Immediate (reference passing, no copy) np.matmul directly usable Adopted
bytearray Requires memcpy Requires np.frombuffer conversion Rejected
torch tensor Immediate torch operations available Use only for GPU optimization
  • write: stores numpy array by reference (no copy) → Phase 1 overhead = 1 dict lookup
  • read: returns numpy array by reference (no copy)
  • Re-writing to the same addr overwrites at tensor granularity (partial overwrite not supported)
  • dtype uses numpy native (np.float16, np.float32, np.bfloat16, etc.)
  • For byte-level access, convert via .view(np.uint8)
  • For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility

read/write contract:

  • read/write operates on a contiguous tensor basis. If non-contiguous stride views are needed, express them as separate copy ops.
  • In the normal benchmark path, producer/consumer dtype match is expected. Reinterpret cast is a permissive behavior for low-level memory validation or special test cases.
  • addr is byte-aligned, with minimum alignment = dtype size.
  • dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast. Shape mismatch is verified based on nbytes, and raises an error on mismatch.
  • Correctness criteria follow address-range-based read/write semantics.
  • A tensor object cache may be used as an implementation optimization, but the canonical state is byte-addressable storage.
  • At deploy time, the host injects initial tensor data.

D8. Benchmark Kernel Code

The benchmark's user code API is not changed. The call interfaces for tl.load(), tl.composite(), tl.store(), etc. are maintained.

However, internal command/message schemas may be extended to include metadata required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose).

D9. No Component Changes

Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified. Op log recording is the responsibility of the ComponentBase hook. When custom components are replaced, only the timing model changes, and Phase 2 data execution is unaffected.

D10. Phase 2 is Optional

engine = GraphEngine(graph)
engine.run(benchmark)                       # Phase 1: timing only
result = engine.get_timing_result()

if verify_data:
    executor = DataExecutor(engine.op_log)  # Phase 2: data
    executor.run()
    executor.verify(expected_output)

If only timing analysis is needed, Phase 2 is skipped. If the op_logger is deactivated, Phase 1 performance is identical to the original.

D11. Verification Contract

Basic verification compares the final output tensor against a reference backend (numpy).

Per-dtype tolerance policy:

dtype Comparison method Tolerance
f32 np.allclose rtol=1e-5, atol=1e-5
f16 np.allclose rtol=1e-3, atol=1e-3
bf16 np.allclose rtol=1e-2, atol=1e-2
int types np.array_equal exact
  • Default mode: compare final output only (end-to-end correctness)
  • Debug mode: can compare intermediate tensors on a per-op basis (MemoryStore snapshot at each op boundary)

Non-goals

  • Compute-result-based control flow: not supported. All compute handles are in pending state during Phase 1, wait() expresses timing synchronization only and does not imply data readiness. Accessing handle.data, element access, or truth-value evaluation in Phase 1 is treated as an error. Memory-data-based branching (results of tl.load()) is supported via greenlet. Phase 1 materialization is a future extension (see D3).
  • Cycle-accurate overlap reconstruction: Phase 2 does not precisely reproduce the execution time overlap from Phase 1. Phase 2 only verifies data correctness.
  • GPU kernel compilation: GEMM/Math in Phase 2 are numpy/torch calls and do not reproduce the actual hardware PE microarchitecture.

Open Questions

  • Aliasing / slice view: How to represent slice/views referencing the same backing storage in MemoryStore (stride-based view vs copy semantics)
  • IPCQ/descriptor read generalization: Whether to fully generalize PE-to-PE communication as memory ops or introduce a separate op_kind
  • Op log streaming: Managing op_log memory usage in large-scale simulations (in-memory list vs disk-backed streaming)
  • Fused operation: Whether to record tl.composite's tiled pipeline (READ→COMPUTE→WRITE) as a single fused op record or separate individual ops
  • Math op schema generalization: The current math params have a simple structure, but generalization may be needed for broadcasting rules, per-input dtype, keepdims, scalar/immediate operands, where/mask expressions, etc.
  • Op record identifier: Currently dependency_ids are based on in-memory list indices; replacement with stable op_id is needed when introducing streaming/disk-backed mode
  • Phase 1 materialization policy: See Future Extension in D3. If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops needs to be defined

Consequences

Positive

  • Minimal impact on SimPy simulation performance (only op_log append added)
  • Free to use multi-threading/GPU in Phase 2
  • Component replaceability preserved (ADR-0015 design philosophy maintained)
  • No changes needed to benchmark user code API
  • When adding new message types, only set the data_op flag
  • Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported
  • tl.load() returns actual data, making kernel debugging easier

Negative

  • op_log memory usage (for large-scale simulations)
  • Phase 2 execution time is proportional to tensor size (large GEMM)
  • Dynamic branching based on pending handles (incomplete computations) not possible (computations execute in Phase 2, result values are undetermined in Phase 1). Memory-data-based branching is supported via greenlet.
  • greenlet C extension dependency added (pip install greenlet)

Affected Files

File Change
src/kernbench/components/base.py Add _on_process_start/end hooks
src/kernbench/common/pe_commands.py Add data_op = True, extend metadata fields
src/kernbench/sim_engine/op_log.py New: OpRecord, OpLogger
src/kernbench/sim_engine/data_executor.py New: DataExecutor, MemoryStore
src/kernbench/sim_engine/engine.py op_logger injection (optional)
src/kernbench/triton_emu/tl_context.py greenlet switch calls inside tl.load() etc.
src/kernbench/triton_emu/kernel_runner.py New: KernelRunner (greenlet ↔ SimPy bridge)
src/kernbench/components/builtin/pe_cpu.py Remove Phase 0, change to KernelRunner invocation
pyproject.toml Add greenlet dependency

Component implementation files (pe_gemm.py, pe_dma.py, hbm_ctrl.py, etc.): no changes Benchmark kernels (benches/*.py): no user API changes