ADR: bilingual structure — EN canonical in adr/, KO mirror in adr-ko/

Establish English as the canonical ADR language with Korean translations held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror). Promotion from adr-proposed/ to adr/ now writes English to adr/ and the Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md. - Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English, 2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix dropped). ADR-0023 EN regenerated against KO source which had newer HW Realization Notes (D16-D23) section. - docs/adr-history/ left frozen by design (transitional state). - CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline section covering bidirectional sync, conflict resolution (EN wins), and proposed-language freedom. - tools/verify_adr_lang_pairs.py: new verification tool checking pair completeness, filename mirroring, ADR-ID match, Status byte-equality. Pre-commit hook intentionally not added; run on demand or in CI. - tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF normalization, em-dash title separator, underscore-slug edge case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:38:44 -07:00
parent 687c98086d
commit a796c1d2f7
42 changed files with 10515 additions and 3422 deletions
@@ -1,4 +1,4 @@
-# ADR-0020: 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)
+# ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation)

 ## Status

@@ -6,65 +6,65 @@ Accepted

 ## Context

-현재 시뮬레이션은 **타이밍만** 모델링한다.
-`tl.load()`, `tl.composite(op="gemm")` 등은 SimPy latency를 생성하지만,
-실제 텐서 데이터를 읽거나 연산하지 않는다.
+The current simulation models **timing only**.
+`tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies,
+but do not actually read tensor data or perform computations.

-### 필요한 기능
+### Required Capabilities

-1. HBM/TCM/SRAM에 실제 데이터를 저장하고 읽을 수 있어야 한다
-2. PE_GEMM, PE_MATH가 실제 행렬 연산을 수행하고 결과를 검증할 수 있어야 한다
-3. 시뮬레이션 성능 저하를 최소화해야 한다
+1. Must be able to store and read actual data in HBM/TCM/SRAM
+2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
+3. Must minimize simulation performance degradation

-### 제약 조건
+### Constraints

- SimPy는 single-thread 이벤트 루프 — numpy matmul을 안에서 하면 전체가 block
- 컴포넌트는 교체 가능해야 한다 (ADR-0015) — 프레임워크 요구사항이 구현에 침투하면 안 됨
- 벤치마크 커널은 명령형 코드(tl.load → tl.composite → tl.wait) — 같은 코드를 재사용해야 함
- 커널 함수는 plain Python function으로 유지해야 한다 (generator/async 변환 불가)
+- SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
+- Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations
+- Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused
+- Kernel functions must remain plain Python functions (no generator/async transformation)

-### 설계 탐색 결과
+### Design Exploration Results

-| Option | 방식 | 판정 |
-|--------|------|------|
-| SimPy 내 직접 실행 | GEMM을 SimPy 안에서 numpy 호출 | 탈락: single-thread block |
-| SimPy + ThreadPool | future.submit → timeout → result() | 탈락: back-to-back 요청 시 result()에서 block |
-| Symbolic + lazy | 메타데이터만 추적, 나중에 실행 | 탈락: control-flow dependent 읽기 처리 곤란 |
-| **2-pass (채택)** | Phase 1: 타이밍, Phase 2: 데이터 | 완전 분리, 성능 영향 없음 |
+| Option | Approach | Verdict |
+|--------|----------|---------|
+| Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block |
+| SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests |
+| Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads |
+| **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact |

 ---

 ## Decision

-### D1. 2-Pass 실행 모델 — Phase 0 제거
+### D1. 2-Pass Execution Model — Phase 0 Elimination

-기존의 3단계(Phase 0 → Phase 1 → Phase 2)를 **2단계로 통합**한다.
+The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**.

-기존:
+Before:
 ```
-Phase 0: 커널 → PeCommand 리스트 (데이터 없음, 분기 불가)
-Phase 1: PeCommand 리스트를 SimPy replay (타이밍만)
+Phase 0: Kernel → PeCommand list (no data, no branching)
+Phase 1: Replay PeCommand list via SimPy (timing only)
 ```

-변경:
+After:
 ```
-Phase 1 (타이밍): 커널 + SimPy 통합 실행 — greenlet 기반
-  - 메모리 읽기/쓰기: SimPy 타이밍 + MemoryStore 실제 데이터
-  - 연산 (GEMM/Math): SimPy 타이밍 + op_log 기록 (실제 연산은 Phase 2)
-  - dynamic control flow 가능 (tl.load가 실제 데이터 반환)
+Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based
+  - Memory read/write: SimPy timing + MemoryStore actual data
+  - Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2)
+  - Dynamic control flow possible (tl.load returns actual data)

-Phase 2 (데이터): op_log 기반 실제 연산 실행 — SimPy 외부, 병렬 가능
+Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable
 ```

-본 ADR은 **메모리 연산에 한해 Phase 1을 data-aware로 확장**한다.
-Phase 1은 latency/BW 병목 분석 + 메모리 데이터 추적,
-Phase 2는 GEMM/Math 연산 정합성 검증.
-Phase 2는 optional — 타이밍만 필요하면 Phase 1만 실행.
+This ADR **extends Phase 1 to be data-aware for memory operations only**.
+Phase 1 handles latency/BW bottleneck analysis + memory data tracking,
+Phase 2 handles GEMM/Math computation correctness verification.
+Phase 2 is optional — if only timing is needed, run Phase 1 alone.

-### D2. Op Log 기록 — ComponentBase hook
+### D2. Op Log Recording — ComponentBase Hook

-op_log 기록은 **컴포넌트 베이스 클래스의 hook**으로 수행한다.
-개별 컴포넌트 구현을 수정하지 않는다.
+Op log recording is performed as a **hook in the component base class**.
+Individual component implementations are not modified.

 ```python
 class ComponentBase:
@@ -77,56 +77,56 @@ class ComponentBase:
            self._op_logger.record_end(env.now, self.node.id, msg)
 ```

-`_forward_txn()` 에서 `run()` 전후로 hook을 호출한다.
-`_op_logger`는 optional — 없으면 오버헤드 제로.
+Hooks are called before and after `run()` within `_forward_txn()`.
+`_op_logger` is optional — zero overhead when absent.

-**hook 시점 정의**:
+**Hook timing definitions**:

-| 시점 | 의미 |
-|------|------|
-| `t_start` | 컴포넌트가 해당 msg의 **service를 시작**한 시점 (`run()` 진입 직전) |
-| `t_end` | 컴포넌트의 **내부 service가 완료**된 시점 (`run()` 반환 직후) |
+| Timing | Meaning |
+|--------|---------|
+| `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) |
+| `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) |

-link traversal latency는 t_start/t_end에 포함되지 않는다.
-link latency는 발신 컴포넌트의 t_end와 수신 컴포넌트의 t_start 차이로 관측된다.
+Link traversal latency is not included in t_start/t_end.
+Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start.

-### D3. Greenlet 기반 커널 실행 — Phase 0 제거
+### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination

-기존 Phase 0 (커널 → PeCommand 리스트)를 제거하고,
-**greenlet**을 사용하여 커널과 SimPy를 협력적으로 interleave 실행한다.
+The existing Phase 0 (kernel → PeCommand list) is eliminated,
+and **greenlet** is used to cooperatively interleave kernel and SimPy execution.

-#### 동작 원리
+#### Operating Principle

-greenlet은 협력적 context switch를 제공하는 C 확장이다.
-커널(child greenlet)이 `tl.load()` 등을 호출하면 SimPy 루프(parent greenlet)로
-switch하여 타이밍 시뮬레이션을 수행하고, 완료 후 실제 데이터와 함께 커널로 돌아온다.
+greenlet is a C extension that provides cooperative context switching.
+When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet)
+to perform timing simulation, and after completion, returns to the kernel with actual data.

 ```
-SimPy 루프 (parent greenlet)          커널 (child greenlet)
+SimPy loop (parent greenlet)           Kernel (child greenlet)
 ─────────────────────────              ──────────────────────
-g.switch() ─────────────────────────→ 커널 시작
+g.switch() ─────────────────────────→ Kernel starts
                                       a = tl.load(ptr, ...)
-                                         내부: parent.switch(DmaReadCmd)
-cmd = DmaReadCmd ←──────────────────  (커널 일시정지)
+                                         internal: parent.switch(DmaReadCmd)
+cmd = DmaReadCmd ←──────────────────  (kernel paused)
  yield DmaReadMsg(...)
  yield env.timeout(dma_latency)
  data = memory_store.read(...)
-g.switch(data) ─────────────────────→ (커널 재개)
-                                       a = data  ← 실제 numpy array
-                                       if a[0][0] > 0.5:  ← 분기 가능
+g.switch(data) ─────────────────────→ (kernel resumed)
+                                       a = data  ← actual numpy array
+                                       if a[0][0] > 0.5:  ← branching possible
                                         ...
 ```

-커널은 **plain Python function**으로 유지된다.
-greenlet switch는 `tl.load()`, `tl.store()` 등의 **내부 구현에만** 존재한다.
+The kernel is maintained as a **plain Python function**.
+greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc.

-#### KernelRunner — 프레임워크 레이어
+#### KernelRunner — Framework Layer

-greenlet 루프는 PE_CPU 컴포넌트가 아니라 프레임워크 레이어인
-**KernelRunner**에 위치한다.
+The greenlet loop resides not in the PE_CPU component but in the framework layer,
+**KernelRunner**.

 ```python
-# KernelRunner (프레임워크 — greenlet ↔ SimPy 연결)
+# KernelRunner (framework — greenlet ↔ SimPy bridge)
 class KernelRunner:
    def run(self, env, kernel_fn, args, store):
        g = greenlet(self._run_kernel)
@@ -136,160 +136,162 @@ class KernelRunner:
            if isinstance(cmd, DmaReadCmd):
                yield from self._dispatch_dma(env, cmd)
                data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
-                cmd = g.switch(data)            # 실제 데이터와 함께 재개
+                cmd = g.switch(data)            # resume with actual data
            elif isinstance(cmd, GemmCmd):
                yield from self._dispatch_gemm(env, cmd)
-                cmd = g.switch()                # 재개 (데이터 없음)
+                cmd = g.switch()                # resume (no data)
            elif isinstance(cmd, DmaWriteCmd):
-                store.write(cmd.dst_addr, cmd.data)  # visibility = issue 시점
-                yield from self._dispatch_dma(env, cmd)  # timing만 반영
+                store.write(cmd.dst_addr, cmd.data)  # visibility = issue time
+                yield from self._dispatch_dma(env, cmd)  # timing only
                cmd = g.switch()

-# PE_CPU (컴포넌트 — 간단하게 유지, greenlet을 모름)
+# PE_CPU (component — kept simple, unaware of greenlet)
 def _execute_kernel(self, env):
    runner = KernelRunner(self.ctx)
    yield from runner.run(env, kernel_fn, args, store)
 ```

-**Op logging single source of truth**: KernelRunner는 op_log에 직접 기록하지 않는다.
-모든 op logging은 **ComponentBase hook (_on_process_start/end)만** 담당한다.
-KernelRunner가 `_dispatch_gemm()` 등으로 컴포넌트에 메시지를 전달하면,
-컴포넌트 베이스 클래스의 hook이 자동으로 기록한다.
+**Op logging single source of truth**: KernelRunner does not record directly to op_log.
+All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**.
+When KernelRunner delivers messages to components via `_dispatch_gemm()` etc.,
+the component base class hooks automatically record them.

-**레이어 분리**:
- **커널 코드**: plain function, greenlet 존재를 모름
- **TLContext**: `tl.load()` 내부에서 `parent.switch(cmd)` 호출
- **KernelRunner**: greenlet ↔ SimPy 연결, MemoryStore 읽기/쓰기 처리. **logging 안 함**.
- **ComponentBase hook**: op_log 기록의 유일한 경로
- **PE_CPU**: KernelRunner를 호출만 함, 컴포넌트로서 교체 가능
+**Layer separation**:
+- **Kernel code**: plain function, unaware of greenlet
+- **TLContext**: calls `parent.switch(cmd)` inside `tl.load()`
+- **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**.
+- **ComponentBase hook**: the sole path for op_log recording
+- **PE_CPU**: only calls KernelRunner, replaceable as a component

-#### 메모리 읽기/쓰기 vs 연산의 처리 차이
+#### Handling Differences Between Memory Read/Write and Compute

-| 연산 | Phase 1에서 | Phase 2에서 |
-|------|------------|------------|
-| `tl.load()` | SimPy 타이밍 + MemoryStore read → **실제 데이터 반환** | — |
-| `tl.store()` | SimPy 타이밍 + MemoryStore write → **실제 기록** | — |
-| `tl.composite(gemm)` | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
-| `tl.dot()` / math ops | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
+| Operation | In Phase 1 | In Phase 2 |
+|-----------|-----------|-----------|
+| `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — |
+| `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — |
+| `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation |
+| `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation |

-메모리 읽기/쓰기는 Phase 1에서 즉시 처리 (numpy slice, 빠름).
-GEMM/Math 연산은 Phase 2에서 batch 실행 (성능 분리).
+Memory read/write is processed immediately in Phase 1 (numpy slice, fast).
+GEMM/Math operations are batch-executed in Phase 2 (performance separation).

 #### Store Visibility Rule

-`tl.store()`는 **issue 시점에 MemoryStore에 즉시 반영**된다 (visibility = issue).
-SimPy DMA 타이밍은 이후 별도로 시뮬레이션된다.
+`tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue).
+SimPy DMA timing is simulated separately afterward.

-이는 timing과 visibility를 의도적으로 분리한 것이다:
- **visibility**: MemoryStore에 반영되는 시점 = `store.write()` 호출 시
- **timing**: SimPy에서 DMA latency가 완료되는 시점
+This is an intentional separation of timing and visibility:
+- **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called
+- **timing**: the point at which DMA latency completes in SimPy

-이 분리로 dynamic control flow에서 store 직후 load가 최신 데이터를 볼 수 있다.
+This separation allows a load immediately after a store to see the latest data in dynamic control flow.

 #### Result Handle Semantics

-`tl.composite()`(sync/async)는 결과 tensor를 참조하는 **handle**을 반환한다.
+`tl.composite()` (sync/async) returns a **handle** referencing the result tensor.

-Phase 1에서의 핵심 계약:
+The key contract in Phase 1:

-1. **모든 compute handle은 Phase 1에서 항상 pending 상태로 간주한다.**
-2. `tl.wait(handle)`은 **timing synchronization만 표현**하며,
-   handle을 ready로 만들지 않는다.
-3. handle의 실제 결과 데이터 접근(`handle.data`, element access,
-   numpy conversion 등)은 **Phase 2에서만 가능**하다.
-4. 따라서 Phase 1에서 **compute-result 기반 control flow는 지원하지 않는다.**
-5. 반면 `tl.load()`는 Phase 1에서 실제 데이터를 반환하므로,
-   **memory-read 기반 control flow는 지원 가능**하다.
+1. **All compute handles are always considered pending in Phase 1.**
+2. `tl.wait(handle)` **expresses timing synchronization only**
+   and does not make the handle ready.
+3. Accessing the handle's actual result data (`handle.data`, element access,
+   numpy conversion, etc.) is **only possible in Phase 2**.
+4. Therefore, **compute-result-based control flow is not supported in Phase 1.**
+5. In contrast, `tl.load()` returns actual data in Phase 1, so
+   **memory-read-based control flow is supported**.

-| handle 상태 | Phase | 허용 동작 |
+| Handle state | Phase | Allowed operations |
 |------------|-------|----------|
-| pending | Phase 1 | `tl.wait(handle)` — timing 동기화만 |
-| pending | Phase 1 | handle을 `tl.store()`의 대상으로 전달 (logical destination 연결만, payload는 Phase 2) |
-| pending | Phase 1 | **데이터 접근 불가** — 값 기반 분기 불가 |
-| ready | Phase 2 | 실제 numpy 데이터 접근, 검증 |
+| pending | Phase 1 | `tl.wait(handle)` — timing synchronization only |
+| pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) |
+| pending | Phase 1 | **Data access not allowed** — value-based branching not possible |
+| ready | Phase 2 | Actual numpy data access, verification |

-이 제약은 의도적이다. Phase 1에서 연산을 실행하면 SimPy single-thread가
-block되어 2-pass 분리의 존재 이유가 사라진다.
+This restriction is intentional. If computations were executed in Phase 1,
+the SimPy single-thread would block, defeating the purpose of 2-pass separation.

 #### Phase 1 Materialization — Future Extension

-향후 소형 연산(scalar, 작은 reduction)에 대해 Phase 1 eager execution이
-필요한 경우, `materialized_in_phase1: bool` 플래그를 op record에 추가하여
-선택적 materialization을 지원할 수 있다. 현재 범위에서는 구현하지 않는다.
+If Phase 1 eager execution becomes necessary for small operations
+(scalar, small reduction) in the future, selective materialization can be supported
+by adding a `materialized_in_phase1: bool` flag to the op record.
+This is not implemented in the current scope.

-### D4. data_op 플래그 — 메시지 자기 선언
+### D4. data_op Flag — Message Self-Declaration

-로깅 대상은 메시지 타입이 아니라 메시지 인스턴스의 `data_op` 속성으로 결정한다.
-프레임워크가 메시지 타입을 하드코딩하지 않는다.
+The logging target is determined by the `data_op` attribute on the message instance,
+not by message type. The framework does not hardcode message types.

 ```python
 class MsgBase:
-    data_op: bool = False       # 기본: 로깅 안 함
+    data_op: bool = False       # default: no logging

 class DmaReadCmd(MsgBase):
-    data_op = True              # 메모리 이동 → 로깅
+    data_op = True              # memory transfer → logging

 class GemmCmd(MsgBase):
-    data_op = True              # 연산 → 로깅
+    data_op = True              # compute → logging

 class MathCmd(MsgBase):
-    data_op = True              # 연산 → 로깅
+    data_op = True              # compute → logging
 ```

-새 메시지 타입(예: IpcqMsg) 추가 시 `data_op = True`만 설정하면
-프레임워크 코드 수정 없이 자동 로깅된다.
+When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True`
+enables automatic logging without modifying framework code.

-### D5. Op Log 구조
+### D5. Op Log Structure

-#### op 분류 체계
+#### Op Classification Scheme

-2단계로 분류한다:
+A two-level classification is used:

-| 레벨 | 필드 | 역할 |
-|------|------|------|
-| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch 기준 |
-| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` 등 | 구체 연산 식별 |
+| Level | Field | Role |
+|-------|-------|------|
+| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion |
+| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification |

-#### OpRecord 정의
+#### OpRecord Definition

 ```python
@dataclass
 class OpRecord:
-    t_start: float              # SimPy 시각 (ns) — service 시작
-    t_end: float                # SimPy 시각 (ns) — service 완료
+    t_start: float              # SimPy time (ns) — service start
+    t_end: float                # SimPy time (ns) — service completion
    component_id: str           # e.g. "sip0.cube0.pe0.pe_gemm"
    op_kind: str                # "memory" | "gemm" | "math"
-    op_name: str                # 구체 연산명
-    params: dict                # 연산별 파라미터 (아래 참조)
-    dependency_ids: list[int]   # 현재는 in-memory record index 기반, 향후 stable op_id로 대체 가능
+    op_name: str                # specific operation name
+    params: dict                # per-operation parameters (see below)
+    dependency_ids: list[int]   # currently based on in-memory record index, may be replaced with stable op_id in the future
 ```

-#### dependency_ids 생성 규칙
+#### dependency_ids Generation Rules

-`dependency_ids`는 **optional**이며, 기본적으로 executor는
-주소 기반 dependency 추론을 수행한다 (D6 참조).
+`dependency_ids` is **optional**, and by default the executor performs
+address-based dependency inference (see D6).

-정확한 실행 순서가 필요한 경우에만 명시적으로 설정한다:
- **기본 (address-based inference)**: executor가 read/write set을 분석하여
-  RAW/WAW/WAR 의존성을 자동 추론. 대부분의 경우 이것으로 충분.
- **명시적 설정**: TLContext 또는 command 생성 단계에서 logical dependency가
-  주소로 표현되지 않는 경우에 설정.
-  예: completion handle 기반 동기화 — handle dependency는 메모리 주소가 아니라
-  논리적 완료 순서에 의존하므로 address inference로 잡히지 않는다.
+Explicit setting is only needed when precise execution ordering is required:
+- **Default (address-based inference)**: the executor analyzes read/write sets to
+  automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases.
+- **Explicit setting**: set when logical dependencies cannot be expressed via addresses
+  at the TLContext or command generation stage.
+  Example: completion handle-based synchronization — handle dependencies depend on
+  logical completion order rather than memory addresses, so they cannot be captured
+  by address inference.

-#### op_log ordering
+#### op_log Ordering

-op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
-동일 `t_start`의 record들은 insertion order를 보존한다.
+The op_log maintains **stable ordering** based on `t_start`.
+Records with the same `t_start` preserve insertion order.

-#### params 상세
+#### params Details

 **memory (dma_read / dma_write)**:
 ```python
 {
-    "src_addr": int,            # source 주소 (byte)
-    "dst_addr": int,            # destination 주소 (byte)
-    "nbytes": int,              # 전송 크기
+    "src_addr": int,            # source address (byte)
+    "dst_addr": int,            # destination address (byte)
+    "nbytes": int,              # transfer size
    "src_space": str,           # "hbm" | "tcm" | "sram"
    "dst_space": str,           # "hbm" | "tcm" | "sram"
 }
@@ -298,9 +300,9 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
 **gemm**:
 ```python
 {
-    "src_a_addr": int,          # operand A 주소
-    "src_b_addr": int,          # operand B 주소
-    "dst_addr": int,            # output 주소
+    "src_a_addr": int,          # operand A address
+    "src_b_addr": int,          # operand B address
+    "dst_addr": int,            # output address
    "shape_a": tuple,           # e.g. (128, 256)
    "shape_b": tuple,           # e.g. (256, 128)
    "shape_out": tuple,         # e.g. (128, 128)
@@ -312,7 +314,7 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
    "layout_a": str,            # "row_major" | "col_major"
    "layout_b": str,
    "layout_out": str,
-    "addr_space": str,          # "tcm" (GEMM operand는 항상 TCM)
+    "addr_space": str,          # "tcm" (GEMM operands are always in TCM)
 }
 ```

@@ -320,7 +322,7 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
 ```python
 {
    "op": str,                  # "exp" | "add" | "sum" | "where" | ...
-    "input_addrs": list[int],   # operand 주소 목록
+    "input_addrs": list[int],   # list of operand addresses
    "input_shapes": list[tuple],
    "dst_addr": int,
    "shape_out": tuple,
@@ -332,12 +334,12 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.

 ### D6. Phase 2 Executor

-Phase 2는 SimPy 밖에서 op_log를 실행한다.
+Phase 2 executes the op_log outside of SimPy.

 ```python
 class DataExecutor:
    def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
-        self.store = initial_store  # Phase 1의 MemoryStore snapshot을 입력으로 받는다
+        self.store = initial_store  # Takes the Phase 1 MemoryStore snapshot as input

    def run(self):
        for t, ops in groupby(op_log, key=lambda o: o.t_start):
@@ -347,30 +349,30 @@ class DataExecutor:
            self._execute_sequential(sequential)
 ```

-**병렬 실행 판정**:
+**Parallel execution determination**:

-같은 `t_start`의 op들은 **병렬 후보**로 간주한다.
-실제 병렬 실행 여부는 executor가 다음 기준으로 판정한다:
- read/write 주소 범위 겹침 여부 (WAW, RAW, WAR 충돌 검사)
- `dependency_ids`에 명시된 선행 op 완료 여부
+Ops with the same `t_start` are considered **parallel candidates**.
+The executor determines actual parallel execution based on the following criteria:
+- Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks)
+- Whether predecessor ops specified in `dependency_ids` have completed

-주소 범위가 겹치지 않고 명시적 의존성이 없는 op들만 병렬 실행한다.
+Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel.

-**배치 최적화**: 동일 op_name이며 **shape, dtype, layout, transpose flag가
-모두 동일한** 독립 op들만 batching 대상이 된다.
-예: 여러 PE의 동일 shape GEMM → `np.matmul(a_batch, b_batch)` 한 번으로 묶음.
-CPU에서도 BLAS 효율 향상, GPU에서는 launch overhead 절감.
+**Batch optimization**: Only independent ops with the same op_name **and identical
+shape, dtype, layout, and transpose flags** are eligible for batching.
+Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call.
+Improves BLAS efficiency on CPU, reduces launch overhead on GPU.

-**Phase 2 실행 순서 보장**:
+**Phase 2 execution order guarantee**:

-Phase 2는 데이터 도착 시점을 고려하지 않으며,
-dependency (주소 기반 추론 + 명시적 dependency_ids)를 통해서만
-실행 순서를 보장한다.
+Phase 2 does not consider data arrival timing,
+and guarantees execution order solely through
+dependencies (address-based inference + explicit dependency_ids).

 ### D7. Memory Store

-`MemoryStore`는 논리적으로 byte-addressable semantics를 따르며,
-현재 구현은 **tensor-granular storage** (addr → numpy ndarray 매핑)를 사용한다.
+`MemoryStore` logically follows byte-addressable semantics,
+and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping).

 ```python
 class MemoryStore:
@@ -378,139 +380,140 @@ class MemoryStore:
    def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
 ```

-**내부 저장 포맷: numpy ndarray**
+**Internal storage format: numpy ndarray**

-MemoryStore는 텐서를 **numpy ndarray**로 저장한다.
+MemoryStore stores tensors as **numpy ndarrays**.

-| 후보 | store/load 속도 | Phase 2 연산 | 판정 |
-|------|----------------|-------------|------|
-| **numpy ndarray** | 즉시 (참조 전달, 복사 없음) | `np.matmul` 바로 사용 | **채택** |
-| bytearray | memcpy 필요 | `np.frombuffer` 변환 필요 | 탈락 |
-| torch tensor | 즉시 | torch 연산 가능 | GPU 최적화 시만 사용 |
+| Candidate | store/load speed | Phase 2 compute | Verdict |
+|-----------|-----------------|-----------------|---------|
+| **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** |
+| bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected |
+| torch tensor | Immediate | torch operations available | Use only for GPU optimization |

- write: numpy array를 **참조 저장** (복사 없음) → Phase 1 오버헤드 = dict lookup 1회
- read: numpy array를 **참조 반환** (복사 없음)
- 동일 addr에 재 write 시 기존 array를 **tensor 단위로 덮어쓴다** (partial overwrite 미지원)
- dtype은 numpy native 사용 (`np.float16`, `np.float32`, `np.bfloat16` 등)
- byte-level access가 필요한 경우 `.view(np.uint8)` 로 변환
- Phase 2에서 GPU batch 최적화 시 numpy → torch tensor 변환은 executor가 담당
+- write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup
+- read: **returns numpy array by reference** (no copy)
+- Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported)
+- dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.)
+- For byte-level access, convert via `.view(np.uint8)`
+- For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility

 **read/write contract**:

- read/write는 **contiguous tensor** 기준이다.
-  non-contiguous stride view가 필요한 경우 별도 copy op으로 표현한다.
- 일반 benchmark path에서는 producer/consumer dtype 일치를 기대한다.
-  reinterpret cast는 low-level memory validation 또는 특수 테스트 케이스를 위한
-  permissive behavior이다.
- addr은 byte-aligned이며, 최소 alignment = dtype 크기.
- dtype mismatch (write와 다른 dtype으로 read)는 reinterpret cast로 처리한다.
-  shape 불일치 시 nbytes 기준으로 검증하고, 불일치하면 error.
- 정합성 기준은 주소 범위 기반 read/write semantics를 따른다.
- 구현 최적화로 tensor object cache를 둘 수 있지만,
-  canonical state는 byte-addressable storage이다.
- deploy 시점에 호스트가 초기 텐서 데이터를 주입한다.
+- read/write operates on a **contiguous tensor** basis.
+  If non-contiguous stride views are needed, express them as separate copy ops.
+- In the normal benchmark path, producer/consumer dtype match is expected.
+  Reinterpret cast is a permissive behavior for low-level memory validation
+  or special test cases.
+- addr is byte-aligned, with minimum alignment = dtype size.
+- dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast.
+  Shape mismatch is verified based on nbytes, and raises an error on mismatch.
+- Correctness criteria follow address-range-based read/write semantics.
+- A tensor object cache may be used as an implementation optimization,
+  but the canonical state is byte-addressable storage.
+- At deploy time, the host injects initial tensor data.

-### D8. 벤치마크 커널 코드
+### D8. Benchmark Kernel Code

-벤치마크의 **사용자 코드 API는 변경하지 않는다**.
-`tl.load()`, `tl.composite()`, `tl.store()` 등의 호출 인터페이스는 유지.
+The benchmark's **user code API is not changed**.
+The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained.

-단, 내부 command/message schema는 Phase 2 실행에 필요한 metadata를
-포함하도록 확장될 수 있다 (예: dtype_acc, transpose 등 추가 필드).
+However, internal command/message schemas may be extended to include metadata
+required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose).

-### D9. 컴포넌트 변경 없음
+### D9. No Component Changes

-개별 컴포넌트 구현(PE_GEMM, PE_DMA, HBM_CTRL 등)은 수정하지 않는다.
-op_log 기록은 ComponentBase hook의 책임이다.
-커스텀 컴포넌트 교체 시 타이밍 모델만 교체되며,
-Phase 2 데이터 실행은 영향받지 않는다.
+Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified.
+Op log recording is the responsibility of the ComponentBase hook.
+When custom components are replaced, only the timing model changes,
+and Phase 2 data execution is unaffected.

-### D10. Phase 2는 Optional
+### D10. Phase 2 is Optional

 ```python
 engine = GraphEngine(graph)
-engine.run(benchmark)                       # Phase 1: 타이밍만
+engine.run(benchmark)                       # Phase 1: timing only
 result = engine.get_timing_result()

 if verify_data:
-    executor = DataExecutor(engine.op_log)  # Phase 2: 데이터
+    executor = DataExecutor(engine.op_log)  # Phase 2: data
    executor.run()
    executor.verify(expected_output)
 ```

-타이밍 분석만 필요하면 Phase 2를 건너뛴다.
-op_logger를 비활성화하면 Phase 1 성능도 기존과 동일.
+If only timing analysis is needed, Phase 2 is skipped.
+If the op_logger is deactivated, Phase 1 performance is identical to the original.

 ### D11. Verification Contract

-기본 검증은 **최종 output tensor**를 reference backend(numpy)와 비교한다.
+Basic verification **compares the final output tensor** against a reference backend (numpy).

-dtype별 tolerance 정책:
+Per-dtype tolerance policy:

-| dtype | 비교 방식 | tolerance |
+| dtype | Comparison method | Tolerance |
 |-------|----------|-----------|
 | f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
 | f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
 | bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
-| int 계열 | `np.array_equal` | exact |
+| int types | `np.array_equal` | exact |

- 기본 모드: 최종 output만 비교 (end-to-end correctness)
- 디버그 모드: intermediate tensor도 op 단위로 비교 가능
+- Default mode: compare final output only (end-to-end correctness)
+- Debug mode: can compare intermediate tensors on a per-op basis
  (MemoryStore snapshot at each op boundary)

 ---

 ## Non-goals

- **Compute-result-based control flow**: 지원하지 않는다.
-  모든 compute handle은 Phase 1에서 pending 상태이며,
-  `wait()`는 timing synchronization만 표현하고 data readiness를 의미하지 않는다.
-  Phase 1에서 `handle.data` 접근, element access, truth-value evaluation은
-  **error로 처리**한다.
-  메모리 데이터 기반 분기(`tl.load()` 결과)는 greenlet으로 지원된다.
-  Phase 1 materialization은 future extension (D3 참조).
- **Cycle-accurate overlap reconstruction**: Phase 2에서 Phase 1의 실행 시간
-  overlap을 정확히 재현하지 않는다. Phase 2는 데이터 정합성만 검증한다.
- **GPU kernel compilation**: Phase 2의 GEMM/Math는 numpy/torch 호출이며,
-  실제 하드웨어 PE의 마이크로아키텍처를 재현하지 않는다.
+- **Compute-result-based control flow**: not supported.
+  All compute handles are in pending state during Phase 1,
+  `wait()` expresses timing synchronization only and does not imply data readiness.
+  Accessing `handle.data`, element access, or truth-value evaluation in Phase 1
+  is **treated as an error**.
+  Memory-data-based branching (results of `tl.load()`) is supported via greenlet.
+  Phase 1 materialization is a future extension (see D3).
+- **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce
+  the execution time overlap from Phase 1. Phase 2 only verifies data correctness.
+- **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls
+  and do not reproduce the actual hardware PE microarchitecture.

 ## Open Questions

- **Aliasing / slice view**: 동일 backing storage를 참조하는 slice/view를
-  MemoryStore에서 어떻게 표현할지 (stride-based view vs copy semantics)
- **IPCQ/descriptor read 일반화**: PE-to-PE 통신을 memory op으로 완전히
-  일반화할지, 별도 op_kind를 둘지
- **Op log streaming**: 대규모 시뮬레이션에서 op_log 메모리 사용량 관리
+- **Aliasing / slice view**: How to represent slice/views referencing the same
+  backing storage in MemoryStore (stride-based view vs copy semantics)
+- **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE
+  communication as memory ops or introduce a separate op_kind
+- **Op log streaming**: Managing op_log memory usage in large-scale simulations
  (in-memory list vs disk-backed streaming)
- **Fused operation**: tl.composite의 tiled pipeline (READ→COMPUTE→WRITE)을
-  하나의 fused op record로 기록할지, 개별 op으로 분리할지
- **Math op schema 일반화**: 현재 math params는 단순 구조이나,
-  broadcasting rule, input별 dtype, keepdims, scalar/immediate operand,
-  where/mask 표현 등 일반화가 필요할 수 있음
- **Op record 식별자**: 현재 dependency_ids는 in-memory list index 기반이며,
-  streaming/disk-backed mode 도입 시 stable op_id로 대체 필요
- **Phase 1 materialization policy**: D3의 Future Extension 참조.
-  허용 시 해당 op의 Phase 2 처리 방식 (skip / verify / recompute) 정의 필요
+- **Fused operation**: Whether to record tl.composite's tiled pipeline
+  (READ→COMPUTE→WRITE) as a single fused op record or separate individual ops
+- **Math op schema generalization**: The current math params have a simple structure,
+  but generalization may be needed for broadcasting rules, per-input dtype, keepdims,
+  scalar/immediate operands, where/mask expressions, etc.
+- **Op record identifier**: Currently dependency_ids are based on in-memory list indices;
+  replacement with stable op_id is needed when introducing streaming/disk-backed mode
+- **Phase 1 materialization policy**: See Future Extension in D3.
+  If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops
+  needs to be defined

 ---

 ## Consequences

-### 긍정적
+### Positive

- SimPy 시뮬레이션 성능 영향 최소 (op_log append만 추가)
- Phase 2에서 멀티스레드/GPU 자유롭게 사용 가능
- 컴포넌트 교체 자유도 유지 (ADR-0015 설계 철학 보존)
- 벤치마크 사용자 코드 API 변경 불필요
- 새 메시지 타입 추가 시 data_op 플래그만 설정
- greenlet으로 Phase 0 제거 — 메모리 데이터 기반 dynamic control flow 지원
- `tl.load()`가 실제 데이터를 반환하므로 커널 디버깅 용이
+- Minimal impact on SimPy simulation performance (only op_log append added)
+- Free to use multi-threading/GPU in Phase 2
+- Component replaceability preserved (ADR-0015 design philosophy maintained)
+- No changes needed to benchmark user code API
+- When adding new message types, only set the data_op flag
+- Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported
+- `tl.load()` returns actual data, making kernel debugging easier

-### 부정적
+### Negative

- op_log 메모리 사용량 (대규모 시뮬레이션 시)
- Phase 2 실행 시간은 텐서 크기에 비례 (대형 GEMM)
- pending handle (연산 미완료) 기반 동적 분기 불가
-  (연산은 Phase 2에서 실행, Phase 1에서 결과 값 미확정).
-  메모리 데이터 기반 분기는 greenlet으로 지원된다.
- greenlet C 확장 의존성 추가 (pip install greenlet)
+- op_log memory usage (for large-scale simulations)
+- Phase 2 execution time is proportional to tensor size (large GEMM)
+- Dynamic branching based on pending handles (incomplete computations) not possible
+  (computations execute in Phase 2, result values are undetermined in Phase 1).
+  Memory-data-based branching is supported via greenlet.
+- greenlet C extension dependency added (pip install greenlet)