Files

T

ywkang 140b85436a ADR-0020: 2-Pass data execution model with greenlet kernel runner

Design for actual data storage/computation in HBM/TCM/SRAM components:
- Phase 1: SimPy timing + MemoryStore (memory ops data-aware via greenlet)
- Phase 2: op_log-based numpy execution for GEMM/Math verification
- Greenlet-based KernelRunner replaces Phase 0 command list generation
- tl.load() returns real data in Phase 1, enabling memory-based control flow
- ComponentBase hook for op logging (single source of truth)
- MemoryStore: numpy ndarray tensor-granular storage with reference semantics

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-07 23:53:49 -07:00

23 KiB

Raw Blame History

ADR-0020: 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)

Status

Proposed

Context

현재 시뮬레이션은 타이밍만 모델링한다. tl.load(), tl.composite(op="gemm") 등은 SimPy latency를 생성하지만, 실제 텐서 데이터를 읽거나 연산하지 않는다.

필요한 기능

HBM/TCM/SRAM에 실제 데이터를 저장하고 읽을 수 있어야 한다
PE_GEMM, PE_MATH가 실제 행렬 연산을 수행하고 결과를 검증할 수 있어야 한다
시뮬레이션 성능 저하를 최소화해야 한다

기존 커널 실행 구조의 한계

현재 커널 실행은 3단계로 분리되어 있다:

Phase 0: TLContext에서 커널 함수 실행 → PeCommand 리스트 생성 (SimPy 밖, 데이터 없음)
Phase 1: PE_CPU가 PeCommand 리스트를 SimPy로 replay (타이밍만)

Phase 0에서 커널이 전부 실행 완료된 후에야 SimPy가 시작된다. tl.load()는 TensorHandle(placeholder)을 반환하므로 실제 데이터에 접근할 수 없다. 따라서 데이터 값에 따른 분기(dynamic control flow)가 불가능하다.

본 ADR은 이 한계를 메모리 연산에 한해 해소한다 (D1, D3 참조).

제약 조건

SimPy는 single-thread 이벤트 루프 — numpy matmul을 안에서 하면 전체가 block
컴포넌트는 교체 가능해야 한다 (ADR-0015) — 프레임워크 요구사항이 구현에 침투하면 안 됨
벤치마크 커널은 명령형 코드(tl.load → tl.composite → tl.wait) — 같은 코드를 재사용해야 함
커널 함수는 plain Python function으로 유지해야 한다 (generator/async 변환 불가)

설계 탐색 결과

Option	방식	판정
SimPy 내 직접 실행	GEMM을 SimPy 안에서 numpy 호출	탈락: single-thread block
SimPy + ThreadPool	future.submit → timeout → result()	탈락: back-to-back 요청 시 result()에서 block
Symbolic + lazy	메타데이터만 추적, 나중에 실행	탈락: control-flow dependent 읽기 처리 곤란
2-pass (채택)	Phase 1: 타이밍, Phase 2: 데이터	완전 분리, 성능 영향 없음

Decision

D1. 2-Pass 실행 모델 — Phase 0 제거

기존의 3단계(Phase 0 → Phase 1 → Phase 2)를 2단계로 통합한다.

기존:

Phase 0: 커널 → PeCommand 리스트 (데이터 없음, 분기 불가)
Phase 1: PeCommand 리스트를 SimPy replay (타이밍만)

변경:

Phase 1 (타이밍): 커널 + SimPy 통합 실행 — greenlet 기반
  - 메모리 읽기/쓰기: SimPy 타이밍 + MemoryStore 실제 데이터
  - 연산 (GEMM/Math): SimPy 타이밍 + op_log 기록 (실제 연산은 Phase 2)
  - dynamic control flow 가능 (tl.load가 실제 데이터 반환)

Phase 2 (데이터): op_log 기반 실제 연산 실행 — SimPy 외부, 병렬 가능

본 ADR은 메모리 연산에 한해 Phase 1을 data-aware로 확장한다. Phase 1은 latency/BW 병목 분석 + 메모리 데이터 추적, Phase 2는 GEMM/Math 연산 정합성 검증. Phase 2는 optional — 타이밍만 필요하면 Phase 1만 실행.

D2. Op Log 기록 — ComponentBase hook

op_log 기록은 컴포넌트 베이스 클래스의 hook으로 수행한다. 개별 컴포넌트 구현을 수정하지 않는다.

class ComponentBase:
    def _on_process_start(self, env, msg):
        if self._op_logger and getattr(msg, 'data_op', False):
            self._op_logger.record_start(env.now, self.node.id, msg)

    def _on_process_end(self, env, msg):
        if self._op_logger and getattr(msg, 'data_op', False):
            self._op_logger.record_end(env.now, self.node.id, msg)

_forward_txn() 에서 run() 전후로 hook을 호출한다. _op_logger는 optional — 없으면 오버헤드 제로.

hook 시점 정의:

시점	의미
`t_start`	컴포넌트가 해당 msg의 service를 시작한 시점 (`run()` 진입 직전)
`t_end`	컴포넌트의 내부 service가 완료된 시점 (`run()` 반환 직후)

link traversal latency는 t_start/t_end에 포함되지 않는다. link latency는 발신 컴포넌트의 t_end와 수신 컴포넌트의 t_start 차이로 관측된다.

D3. Greenlet 기반 커널 실행 — Phase 0 제거

기존 Phase 0 (커널 → PeCommand 리스트)를 제거하고, greenlet을 사용하여 커널과 SimPy를 협력적으로 interleave 실행한다.

동작 원리

greenlet은 협력적 context switch를 제공하는 C 확장이다. 커널(child greenlet)이 tl.load() 등을 호출하면 SimPy 루프(parent greenlet)로 switch하여 타이밍 시뮬레이션을 수행하고, 완료 후 실제 데이터와 함께 커널로 돌아온다.

SimPy 루프 (parent greenlet)          커널 (child greenlet)
─────────────────────────              ──────────────────────
g.switch() ─────────────────────────→ 커널 시작
                                       a = tl.load(ptr, ...)
                                         내부: parent.switch(DmaReadCmd)
cmd = DmaReadCmd ←──────────────────  (커널 일시정지)
  yield DmaReadMsg(...)
  yield env.timeout(dma_latency)
  data = memory_store.read(...)
g.switch(data) ─────────────────────→ (커널 재개)
                                       a = data  ← 실제 numpy array
                                       if a[0][0] > 0.5:  ← 분기 가능
                                         ...

커널은 plain Python function으로 유지된다. greenlet switch는 tl.load(), tl.store() 등의 내부 구현에만 존재한다.

KernelRunner — 프레임워크 레이어

greenlet 루프는 PE_CPU 컴포넌트가 아니라 프레임워크 레이어인 KernelRunner에 위치한다.

# KernelRunner (프레임워크 — greenlet ↔ SimPy 연결)
class KernelRunner:
    def run(self, env, kernel_fn, args, store):
        g = greenlet(self._run_kernel)
        cmd = g.switch(kernel_fn, args)

        while cmd is not None:
            if isinstance(cmd, DmaReadCmd):
                yield from self._dispatch_dma(env, cmd)
                data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
                cmd = g.switch(data)            # 실제 데이터와 함께 재개
            elif isinstance(cmd, GemmCmd):
                yield from self._dispatch_gemm(env, cmd)
                cmd = g.switch()                # 재개 (데이터 없음)
            elif isinstance(cmd, DmaWriteCmd):
                store.write(cmd.dst_addr, cmd.data)  # visibility = issue 시점
                yield from self._dispatch_dma(env, cmd)  # timing만 반영
                cmd = g.switch()

# PE_CPU (컴포넌트 — 간단하게 유지, greenlet을 모름)
def _execute_kernel(self, env):
    runner = KernelRunner(self.ctx)
    yield from runner.run(env, kernel_fn, args, store)

Op logging single source of truth: KernelRunner는 op_log에 직접 기록하지 않는다. 모든 op logging은 ComponentBase hook (_on_process_start/end)만 담당한다. KernelRunner가 _dispatch_gemm() 등으로 컴포넌트에 메시지를 전달하면, 컴포넌트 베이스 클래스의 hook이 자동으로 기록한다.

레이어 분리:

커널 코드: plain function, greenlet 존재를 모름
TLContext: tl.load() 내부에서 parent.switch(cmd) 호출
KernelRunner: greenlet ↔ SimPy 연결, MemoryStore 읽기/쓰기 처리. logging 안 함.
ComponentBase hook: op_log 기록의 유일한 경로
PE_CPU: KernelRunner를 호출만 함, 컴포넌트로서 교체 가능

메모리 읽기/쓰기 vs 연산의 처리 차이

연산	Phase 1에서	Phase 2에서
`tl.load()`	SimPy 타이밍 + MemoryStore read → 실제 데이터 반환	—
`tl.store()`	SimPy 타이밍 + MemoryStore write → 실제 기록	—
`tl.composite(gemm)`	SimPy 타이밍 + op_log 기록만	numpy 실제 연산
`tl.dot()` / math ops	SimPy 타이밍 + op_log 기록만	numpy 실제 연산

메모리 읽기/쓰기는 Phase 1에서 즉시 처리 (numpy slice, 빠름). GEMM/Math 연산은 Phase 2에서 batch 실행 (성능 분리).

Store Visibility Rule

tl.store()는 issue 시점에 MemoryStore에 즉시 반영된다 (visibility = issue). SimPy DMA 타이밍은 이후 별도로 시뮬레이션된다.

이는 timing과 visibility를 의도적으로 분리한 것이다:

visibility: MemoryStore에 반영되는 시점 = store.write() 호출 시
timing: SimPy에서 DMA latency가 완료되는 시점

이 분리로 dynamic control flow에서 store 직후 load가 최신 데이터를 볼 수 있다.

Result Handle Semantics

tl.composite()(sync/async)는 결과 tensor를 참조하는 handle을 반환한다.

Phase 1에서의 핵심 계약:

모든 compute handle은 Phase 1에서 항상 pending 상태로 간주한다.
tl.wait(handle)은 timing synchronization만 표현하며, handle을 ready로 만들지 않는다.
handle의 실제 결과 데이터 접근(handle.data, element access, numpy conversion 등)은 Phase 2에서만 가능하다.
따라서 Phase 1에서 compute-result 기반 control flow는 지원하지 않는다.
반면 tl.load()는 Phase 1에서 실제 데이터를 반환하므로, memory-read 기반 control flow는 지원 가능하다.

handle 상태	Phase	허용 동작
pending	Phase 1	`tl.wait(handle)` — timing 동기화만
pending	Phase 1	handle을 `tl.store()`의 대상으로 전달 (logical destination 연결만, payload는 Phase 2)
pending	Phase 1	데이터 접근 불가 — 값 기반 분기 불가
ready	Phase 2	실제 numpy 데이터 접근, 검증

이 제약은 의도적이다. Phase 1에서 연산을 실행하면 SimPy single-thread가 block되어 2-pass 분리의 존재 이유가 사라진다.

Phase 1 Materialization — Future Extension

향후 소형 연산(scalar, 작은 reduction)에 대해 Phase 1 eager execution이 필요한 경우, materialized_in_phase1: bool 플래그를 op record에 추가하여 선택적 materialization을 지원할 수 있다. 현재 범위에서는 구현하지 않는다.

D4. data_op 플래그 — 메시지 자기 선언

로깅 대상은 메시지 타입이 아니라 메시지 인스턴스의 data_op 속성으로 결정한다. 프레임워크가 메시지 타입을 하드코딩하지 않는다.

class MsgBase:
    data_op: bool = False       # 기본: 로깅 안 함

class DmaReadCmd(MsgBase):
    data_op = True              # 메모리 이동 → 로깅

class GemmCmd(MsgBase):
    data_op = True              # 연산 → 로깅

class MathCmd(MsgBase):
    data_op = True              # 연산 → 로깅

새 메시지 타입(예: IpcqMsg) 추가 시 data_op = True만 설정하면 프레임워크 코드 수정 없이 자동 로깅된다.

D5. Op Log 구조

op 분류 체계

2단계로 분류한다:

레벨	필드	역할
`op_kind`	`memory` \| `gemm` \| `math`	executor dispatch 기준
`op_name`	`dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` 등	구체 연산 식별

OpRecord 정의

@dataclass
class OpRecord:
    t_start: float              # SimPy 시각 (ns) — service 시작
    t_end: float                # SimPy 시각 (ns) — service 완료
    component_id: str           # e.g. "sip0.cube0.pe0.pe_gemm"
    op_kind: str                # "memory" | "gemm" | "math"
    op_name: str                # 구체 연산명
    params: dict                # 연산별 파라미터 (아래 참조)
    dependency_ids: list[int]   # 현재는 in-memory record index 기반, 향후 stable op_id로 대체 가능

dependency_ids 생성 규칙

dependency_ids는 optional이며, 기본적으로 executor는 주소 기반 dependency 추론을 수행한다 (D6 참조).

정확한 실행 순서가 필요한 경우에만 명시적으로 설정한다:

기본 (address-based inference): executor가 read/write set을 분석하여 RAW/WAW/WAR 의존성을 자동 추론. 대부분의 경우 이것으로 충분.
명시적 설정: TLContext 또는 command 생성 단계에서 logical dependency가 주소로 표현되지 않는 경우에 설정. 예: completion handle 기반 동기화 — handle dependency는 메모리 주소가 아니라 논리적 완료 순서에 의존하므로 address inference로 잡히지 않는다.

op_log ordering

op_log는 t_start 기준으로 stable ordering을 유지한다. 동일 t_start의 record들은 insertion order를 보존한다.

params 상세

memory (dma_read / dma_write):

{
    "src_addr": int,            # source 주소 (byte)
    "dst_addr": int,            # destination 주소 (byte)
    "nbytes": int,              # 전송 크기
    "src_space": str,           # "hbm" | "tcm" | "sram"
    "dst_space": str,           # "hbm" | "tcm" | "sram"
}

gemm:

{
    "src_a_addr": int,          # operand A 주소
    "src_b_addr": int,          # operand B 주소
    "dst_addr": int,            # output 주소
    "shape_a": tuple,           # e.g. (128, 256)
    "shape_b": tuple,           # e.g. (256, 128)
    "shape_out": tuple,         # e.g. (128, 128)
    "dtype_in": str,            # e.g. "f16"
    "dtype_acc": str,           # accumulation dtype, e.g. "f32"
    "dtype_out": str,           # output dtype, e.g. "f16"
    "transpose_a": bool,
    "transpose_b": bool,
    "layout_a": str,            # "row_major" | "col_major"
    "layout_b": str,
    "layout_out": str,
    "addr_space": str,          # "tcm" (GEMM operand는 항상 TCM)
}

math:

{
    "op": str,                  # "exp" | "add" | "sum" | "where" | ...
    "input_addrs": list[int],   # operand 주소 목록
    "input_shapes": list[tuple],
    "dst_addr": int,
    "shape_out": tuple,
    "dtype": str,
    "axis": int | None,         # reduction axis
    "addr_space": str,          # "tcm"
}

D6. Phase 2 Executor

Phase 2는 SimPy 밖에서 op_log를 실행한다.

class DataExecutor:
    def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
        self.store = initial_store  # Phase 1의 MemoryStore snapshot을 입력으로 받는다

    def run(self):
        for t, ops in groupby(op_log, key=lambda o: o.t_start):
            batch = list(ops)
            independent, sequential = self._classify(batch)
            self._execute_parallel(independent)
            self._execute_sequential(sequential)

병렬 실행 판정:

같은 t_start의 op들은 병렬 후보로 간주한다. 실제 병렬 실행 여부는 executor가 다음 기준으로 판정한다:

read/write 주소 범위 겹침 여부 (WAW, RAW, WAR 충돌 검사)
dependency_ids에 명시된 선행 op 완료 여부

주소 범위가 겹치지 않고 명시적 의존성이 없는 op들만 병렬 실행한다.

배치 최적화: 동일 op_name이며 shape, dtype, layout, transpose flag가 모두 동일한 독립 op들만 batching 대상이 된다. 예: 여러 PE의 동일 shape GEMM → np.matmul(a_batch, b_batch) 한 번으로 묶음. CPU에서도 BLAS 효율 향상, GPU에서는 launch overhead 절감.

Phase 2 실행 순서 보장:

Phase 2는 데이터 도착 시점을 고려하지 않으며, dependency (주소 기반 추론 + 명시적 dependency_ids)를 통해서만 실행 순서를 보장한다.

D7. Memory Store

MemoryStore는 논리적으로 byte-addressable semantics를 따르며, 현재 구현은 tensor-granular storage (addr → numpy ndarray 매핑)를 사용한다.

class MemoryStore:
    def write(self, space: str, addr: int, data: np.ndarray) -> None: ...
    def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...

내부 저장 포맷: numpy ndarray

MemoryStore는 텐서를 numpy ndarray로 저장한다.

후보	store/load 속도	Phase 2 연산	판정
numpy ndarray	즉시 (참조 전달, 복사 없음)	`np.matmul` 바로 사용	채택
bytearray	memcpy 필요	`np.frombuffer` 변환 필요	탈락
torch tensor	즉시	torch 연산 가능	GPU 최적화 시만 사용

write: numpy array를 참조 저장 (복사 없음) → Phase 1 오버헤드 = dict lookup 1회
read: numpy array를 참조 반환 (복사 없음)
동일 addr에 재 write 시 기존 array를 tensor 단위로 덮어쓴다 (partial overwrite 미지원)
dtype은 numpy native 사용 (np.float16, np.float32, np.bfloat16 등)
byte-level access가 필요한 경우 .view(np.uint8) 로 변환
Phase 2에서 GPU batch 최적화 시 numpy → torch tensor 변환은 executor가 담당

read/write contract:

read/write는 contiguous tensor 기준이다. non-contiguous stride view가 필요한 경우 별도 copy op으로 표현한다.
일반 benchmark path에서는 producer/consumer dtype 일치를 기대한다. reinterpret cast는 low-level memory validation 또는 특수 테스트 케이스를 위한 permissive behavior이다.
addr은 byte-aligned이며, 최소 alignment = dtype 크기.
dtype mismatch (write와 다른 dtype으로 read)는 reinterpret cast로 처리한다. shape 불일치 시 nbytes 기준으로 검증하고, 불일치하면 error.
정합성 기준은 주소 범위 기반 read/write semantics를 따른다.
구현 최적화로 tensor object cache를 둘 수 있지만, canonical state는 byte-addressable storage이다.
deploy 시점에 호스트가 초기 텐서 데이터를 주입한다.

D8. 벤치마크 커널 코드

벤치마크의 사용자 코드 API는 변경하지 않는다. tl.load(), tl.composite(), tl.store() 등의 호출 인터페이스는 유지.

단, 내부 command/message schema는 Phase 2 실행에 필요한 metadata를 포함하도록 확장될 수 있다 (예: dtype_acc, transpose 등 추가 필드).

D9. 컴포넌트 변경 없음

개별 컴포넌트 구현(PE_GEMM, PE_DMA, HBM_CTRL 등)은 수정하지 않는다. op_log 기록은 ComponentBase hook의 책임이다. 커스텀 컴포넌트 교체 시 타이밍 모델만 교체되며, Phase 2 데이터 실행은 영향받지 않는다.

D10. Phase 2는 Optional

engine = GraphEngine(graph)
engine.run(benchmark)                       # Phase 1: 타이밍만
result = engine.get_timing_result()

if verify_data:
    executor = DataExecutor(engine.op_log)  # Phase 2: 데이터
    executor.run()
    executor.verify(expected_output)

타이밍 분석만 필요하면 Phase 2를 건너뛴다. op_logger를 비활성화하면 Phase 1 성능도 기존과 동일.

D11. Verification Contract

기본 검증은 최종 output tensor를 reference backend(numpy)와 비교한다.

dtype별 tolerance 정책:

dtype	비교 방식	tolerance
f32	`np.allclose`	rtol=1e-5, atol=1e-5
f16	`np.allclose`	rtol=1e-3, atol=1e-3
bf16	`np.allclose`	rtol=1e-2, atol=1e-2
int 계열	`np.array_equal`	exact

기본 모드: 최종 output만 비교 (end-to-end correctness)
디버그 모드: intermediate tensor도 op 단위로 비교 가능 (MemoryStore snapshot at each op boundary)

Non-goals

Compute-result-based control flow: 지원하지 않는다. 모든 compute handle은 Phase 1에서 pending 상태이며, wait()는 timing synchronization만 표현하고 data readiness를 의미하지 않는다. Phase 1에서 handle.data 접근, element access, truth-value evaluation은 error로 처리한다. 메모리 데이터 기반 분기(tl.load() 결과)는 greenlet으로 지원된다. Phase 1 materialization은 future extension (D3 참조).
Cycle-accurate overlap reconstruction: Phase 2에서 Phase 1의 실행 시간 overlap을 정확히 재현하지 않는다. Phase 2는 데이터 정합성만 검증한다.
GPU kernel compilation: Phase 2의 GEMM/Math는 numpy/torch 호출이며, 실제 하드웨어 PE의 마이크로아키텍처를 재현하지 않는다.

Open Questions

Aliasing / slice view: 동일 backing storage를 참조하는 slice/view를 MemoryStore에서 어떻게 표현할지 (stride-based view vs copy semantics)
IPCQ/descriptor read 일반화: PE-to-PE 통신을 memory op으로 완전히 일반화할지, 별도 op_kind를 둘지
Op log streaming: 대규모 시뮬레이션에서 op_log 메모리 사용량 관리 (in-memory list vs disk-backed streaming)
Fused operation: tl.composite의 tiled pipeline (READ→COMPUTE→WRITE)을 하나의 fused op record로 기록할지, 개별 op으로 분리할지
Math op schema 일반화: 현재 math params는 단순 구조이나, broadcasting rule, input별 dtype, keepdims, scalar/immediate operand, where/mask 표현 등 일반화가 필요할 수 있음
Op record 식별자: 현재 dependency_ids는 in-memory list index 기반이며, streaming/disk-backed mode 도입 시 stable op_id로 대체 필요
Phase 1 materialization policy: D3의 Future Extension 참조. 허용 시 해당 op의 Phase 2 처리 방식 (skip / verify / recompute) 정의 필요

Consequences

긍정적

SimPy 시뮬레이션 성능 영향 최소 (op_log append만 추가)
Phase 2에서 멀티스레드/GPU 자유롭게 사용 가능
컴포넌트 교체 자유도 유지 (ADR-0015 설계 철학 보존)
벤치마크 사용자 코드 API 변경 불필요
새 메시지 타입 추가 시 data_op 플래그만 설정
greenlet으로 Phase 0 제거 — 메모리 데이터 기반 dynamic control flow 지원
tl.load()가 실제 데이터를 반환하므로 커널 디버깅 용이

부정적

op_log 메모리 사용량 (대규모 시뮬레이션 시)
Phase 2 실행 시간은 텐서 크기에 비례 (대형 GEMM)
pending handle (연산 미완료) 기반 동적 분기 불가 (연산은 Phase 2에서 실행, Phase 1에서 결과 값 미확정). 메모리 데이터 기반 분기는 greenlet으로 지원된다.
greenlet C 확장 의존성 추가 (pip install greenlet)

영향받는 파일

파일	변경
`src/kernbench/components/base.py`	`_on_process_start/end` hook 추가
`src/kernbench/common/pe_commands.py`	`data_op = True` 추가, metadata 필드 확장
`src/kernbench/sim_engine/op_log.py`	신규: OpRecord, OpLogger
`src/kernbench/sim_engine/data_executor.py`	신규: DataExecutor, MemoryStore
`src/kernbench/sim_engine/engine.py`	op_logger 주입 (optional)
`src/kernbench/triton_emu/tl_context.py`	`tl.load()` 등 내부에서 greenlet switch 호출
`src/kernbench/triton_emu/kernel_runner.py`	신규: KernelRunner (greenlet ↔ SimPy 연결)
`src/kernbench/components/builtin/pe_cpu.py`	Phase 0 제거, KernelRunner 호출로 변경
`pyproject.toml`	greenlet 의존성 추가

컴포넌트 구현 파일 (pe_gemm.py, pe_dma.py, hbm_ctrl.py 등): 변경 없음 벤치마크 커널 (benches/*.py): 사용자 API 변경 없음

23 KiB Raw Blame History