ADR: bilingual structure — EN canonical in adr/, KO mirror in adr-ko/

Establish English as the canonical ADR language with Korean translations held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror). Promotion from adr-proposed/ to adr/ now writes English to adr/ and the Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md. - Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English, 2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix dropped). ADR-0023 EN regenerated against KO source which had newer HW Realization Notes (D16-D23) section. - docs/adr-history/ left frozen by design (transitional state). - CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline section covering bidirectional sync, conflict resolution (EN wins), and proposed-language freedom. - tools/verify_adr_lang_pairs.py: new verification tool checking pair completeness, filename mirroring, ADR-ID match, Status byte-equality. Pre-commit hook intentionally not added; run on demand or in CI. - tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF normalization, em-dash title separator, underscore-slug edge case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:38:44 -07:00
parent 687c98086d
commit a796c1d2f7
42 changed files with 10515 additions and 3422 deletions
@@ -18,7 +18,7 @@ We define stable, minimal message schemas for Host ↔ IO_CPU so that:
 - IO_CPU-internal fan-out/aggregation can evolve independently,
 - completion and failure propagation is deterministic.

-We also require PE-tagging (A 방식): each shard explicitly carries (sip,cube,pe)
+We also require PE-tagging (Scheme A): each shard explicitly carries (sip,cube,pe)
 so IO_CPU can deterministically route/fan-out without relying on PA decoding.

 ---
@@ -93,7 +93,7 @@ Rules:
 Mandatory fields:

 - common envelope fields (D3)
- destination placement tags (A 방식):
+- destination placement tags (Scheme A):
  - `dst_sip: int`
  - `dst_cube: int`
  - `dst_pe: int`
@@ -130,7 +130,7 @@ Notes:
 Mandatory fields:

 - common envelope fields (D3)
- source placement tags (A 방식):
+- source placement tags (Scheme A):
  - `src_sip: int`
  - `src_cube: int`
  - `src_pe: int`
@@ -183,7 +183,7 @@ Tensor arg (mandatory):

 - `shards: list[TensorShard]`

-`TensorShard` MUST have (A 방식 강제):
+`TensorShard` MUST have (Scheme A enforced):

 - `sip: int`
 - `cube: int`
@@ -1,519 +0,0 @@
-# ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation)
-
-## Status
-
-Accepted
-
-## Context
-
-The current simulation models **timing only**.
-`tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies,
-but do not actually read tensor data or perform computations.
-
-### Required Capabilities
-
-1. Must be able to store and read actual data in HBM/TCM/SRAM
-2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
-3. Must minimize simulation performance degradation
-
-### Constraints
-
- SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
- Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations
- Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused
- Kernel functions must remain plain Python functions (no generator/async transformation)
-
-### Design Exploration Results
-
-| Option | Approach | Verdict |
-|--------|----------|---------|
-| Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block |
-| SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests |
-| Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads |
-| **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact |
-
---
-
-## Decision
-
-### D1. 2-Pass Execution Model — Phase 0 Elimination
-
-The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**.
-
-Before:
-```
-Phase 0: Kernel → PeCommand list (no data, no branching)
-Phase 1: Replay PeCommand list via SimPy (timing only)
-```
-
-After:
-```
-Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based
-  - Memory read/write: SimPy timing + MemoryStore actual data
-  - Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2)
-  - Dynamic control flow possible (tl.load returns actual data)
-
-Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable
-```
-
-This ADR **extends Phase 1 to be data-aware for memory operations only**.
-Phase 1 handles latency/BW bottleneck analysis + memory data tracking,
-Phase 2 handles GEMM/Math computation correctness verification.
-Phase 2 is optional — if only timing is needed, run Phase 1 alone.
-
-### D2. Op Log Recording — ComponentBase Hook
-
-Op log recording is performed as a **hook in the component base class**.
-Individual component implementations are not modified.
-
-```python
-class ComponentBase:
-    def _on_process_start(self, env, msg):
-        if self._op_logger and getattr(msg, 'data_op', False):
-            self._op_logger.record_start(env.now, self.node.id, msg)
-
-    def _on_process_end(self, env, msg):
-        if self._op_logger and getattr(msg, 'data_op', False):
-            self._op_logger.record_end(env.now, self.node.id, msg)
-```
-
-Hooks are called before and after `run()` within `_forward_txn()`.
-`_op_logger` is optional — zero overhead when absent.
-
-**Hook timing definitions**:
-
-| Timing | Meaning |
-|--------|---------|
-| `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) |
-| `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) |
-
-Link traversal latency is not included in t_start/t_end.
-Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start.
-
-### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination
-
-The existing Phase 0 (kernel → PeCommand list) is eliminated,
-and **greenlet** is used to cooperatively interleave kernel and SimPy execution.
-
-#### Operating Principle
-
-greenlet is a C extension that provides cooperative context switching.
-When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet)
-to perform timing simulation, and after completion, returns to the kernel with actual data.
-
-```
-SimPy loop (parent greenlet)           Kernel (child greenlet)
-─────────────────────────              ──────────────────────
-g.switch() ─────────────────────────→ Kernel starts
-                                       a = tl.load(ptr, ...)
-                                         internal: parent.switch(DmaReadCmd)
-cmd = DmaReadCmd ←──────────────────  (kernel paused)
-  yield DmaReadMsg(...)
-  yield env.timeout(dma_latency)
-  data = memory_store.read(...)
-g.switch(data) ─────────────────────→ (kernel resumed)
-                                       a = data  ← actual numpy array
-                                       if a[0][0] > 0.5:  ← branching possible
-                                         ...
-```
-
-The kernel is maintained as a **plain Python function**.
-greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc.
-
-#### KernelRunner — Framework Layer
-
-The greenlet loop resides not in the PE_CPU component but in the framework layer,
-**KernelRunner**.
-
-```python
-# KernelRunner (framework — greenlet ↔ SimPy bridge)
-class KernelRunner:
-    def run(self, env, kernel_fn, args, store):
-        g = greenlet(self._run_kernel)
-        cmd = g.switch(kernel_fn, args)
-
-        while cmd is not None:
-            if isinstance(cmd, DmaReadCmd):
-                yield from self._dispatch_dma(env, cmd)
-                data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
-                cmd = g.switch(data)            # resume with actual data
-            elif isinstance(cmd, GemmCmd):
-                yield from self._dispatch_gemm(env, cmd)
-                cmd = g.switch()                # resume (no data)
-            elif isinstance(cmd, DmaWriteCmd):
-                store.write(cmd.dst_addr, cmd.data)  # visibility = issue time
-                yield from self._dispatch_dma(env, cmd)  # timing only
-                cmd = g.switch()
-
-# PE_CPU (component — kept simple, unaware of greenlet)
-def _execute_kernel(self, env):
-    runner = KernelRunner(self.ctx)
-    yield from runner.run(env, kernel_fn, args, store)
-```
-
-**Op logging single source of truth**: KernelRunner does not record directly to op_log.
-All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**.
-When KernelRunner delivers messages to components via `_dispatch_gemm()` etc.,
-the component base class hooks automatically record them.
-
-**Layer separation**:
- **Kernel code**: plain function, unaware of greenlet
- **TLContext**: calls `parent.switch(cmd)` inside `tl.load()`
- **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**.
- **ComponentBase hook**: the sole path for op_log recording
- **PE_CPU**: only calls KernelRunner, replaceable as a component
-
-#### Handling Differences Between Memory Read/Write and Compute
-
-| Operation | In Phase 1 | In Phase 2 |
-|-----------|-----------|-----------|
-| `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — |
-| `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — |
-| `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation |
-| `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation |
-
-Memory read/write is processed immediately in Phase 1 (numpy slice, fast).
-GEMM/Math operations are batch-executed in Phase 2 (performance separation).
-
-#### Store Visibility Rule
-
-`tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue).
-SimPy DMA timing is simulated separately afterward.
-
-This is an intentional separation of timing and visibility:
- **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called
- **timing**: the point at which DMA latency completes in SimPy
-
-This separation allows a load immediately after a store to see the latest data in dynamic control flow.
-
-#### Result Handle Semantics
-
-`tl.composite()` (sync/async) returns a **handle** referencing the result tensor.
-
-The key contract in Phase 1:
-
-1. **All compute handles are always considered pending in Phase 1.**
-2. `tl.wait(handle)` **expresses timing synchronization only**
-   and does not make the handle ready.
-3. Accessing the handle's actual result data (`handle.data`, element access,
-   numpy conversion, etc.) is **only possible in Phase 2**.
-4. Therefore, **compute-result-based control flow is not supported in Phase 1.**
-5. In contrast, `tl.load()` returns actual data in Phase 1, so
-   **memory-read-based control flow is supported**.
-
-| Handle state | Phase | Allowed operations |
-|------------|-------|----------|
-| pending | Phase 1 | `tl.wait(handle)` — timing synchronization only |
-| pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) |
-| pending | Phase 1 | **Data access not allowed** — value-based branching not possible |
-| ready | Phase 2 | Actual numpy data access, verification |
-
-This restriction is intentional. If computations were executed in Phase 1,
-the SimPy single-thread would block, defeating the purpose of 2-pass separation.
-
-#### Phase 1 Materialization — Future Extension
-
-If Phase 1 eager execution becomes necessary for small operations
-(scalar, small reduction) in the future, selective materialization can be supported
-by adding a `materialized_in_phase1: bool` flag to the op record.
-This is not implemented in the current scope.
-
-### D4. data_op Flag — Message Self-Declaration
-
-The logging target is determined by the `data_op` attribute on the message instance,
-not by message type. The framework does not hardcode message types.
-
-```python
-class MsgBase:
-    data_op: bool = False       # default: no logging
-
-class DmaReadCmd(MsgBase):
-    data_op = True              # memory transfer → logging
-
-class GemmCmd(MsgBase):
-    data_op = True              # compute → logging
-
-class MathCmd(MsgBase):
-    data_op = True              # compute → logging
-```
-
-When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True`
-enables automatic logging without modifying framework code.
-
-### D5. Op Log Structure
-
-#### Op Classification Scheme
-
-A two-level classification is used:
-
-| Level | Field | Role |
-|-------|-------|------|
-| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion |
-| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification |
-
-#### OpRecord Definition
-
-```python
-@dataclass
-class OpRecord:
-    t_start: float              # SimPy time (ns) — service start
-    t_end: float                # SimPy time (ns) — service completion
-    component_id: str           # e.g. "sip0.cube0.pe0.pe_gemm"
-    op_kind: str                # "memory" | "gemm" | "math"
-    op_name: str                # specific operation name
-    params: dict                # per-operation parameters (see below)
-    dependency_ids: list[int]   # currently based on in-memory record index, may be replaced with stable op_id in the future
-```
-
-#### dependency_ids Generation Rules
-
-`dependency_ids` is **optional**, and by default the executor performs
-address-based dependency inference (see D6).
-
-Explicit setting is only needed when precise execution ordering is required:
- **Default (address-based inference)**: the executor analyzes read/write sets to
-  automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases.
- **Explicit setting**: set when logical dependencies cannot be expressed via addresses
-  at the TLContext or command generation stage.
-  Example: completion handle-based synchronization — handle dependencies depend on
-  logical completion order rather than memory addresses, so they cannot be captured
-  by address inference.
-
-#### op_log Ordering
-
-The op_log maintains **stable ordering** based on `t_start`.
-Records with the same `t_start` preserve insertion order.
-
-#### params Details
-
-**memory (dma_read / dma_write)**:
-```python
-{
-    "src_addr": int,            # source address (byte)
-    "dst_addr": int,            # destination address (byte)
-    "nbytes": int,              # transfer size
-    "src_space": str,           # "hbm" | "tcm" | "sram"
-    "dst_space": str,           # "hbm" | "tcm" | "sram"
-}
-```
-
-**gemm**:
-```python
-{
-    "src_a_addr": int,          # operand A address
-    "src_b_addr": int,          # operand B address
-    "dst_addr": int,            # output address
-    "shape_a": tuple,           # e.g. (128, 256)
-    "shape_b": tuple,           # e.g. (256, 128)
-    "shape_out": tuple,         # e.g. (128, 128)
-    "dtype_in": str,            # e.g. "f16"
-    "dtype_acc": str,           # accumulation dtype, e.g. "f32"
-    "dtype_out": str,           # output dtype, e.g. "f16"
-    "transpose_a": bool,
-    "transpose_b": bool,
-    "layout_a": str,            # "row_major" | "col_major"
-    "layout_b": str,
-    "layout_out": str,
-    "addr_space": str,          # "tcm" (GEMM operands are always in TCM)
-}
-```
-
-**math**:
-```python
-{
-    "op": str,                  # "exp" | "add" | "sum" | "where" | ...
-    "input_addrs": list[int],   # list of operand addresses
-    "input_shapes": list[tuple],
-    "dst_addr": int,
-    "shape_out": tuple,
-    "dtype": str,
-    "axis": int | None,         # reduction axis
-    "addr_space": str,          # "tcm"
-}
-```
-
-### D6. Phase 2 Executor
-
-Phase 2 executes the op_log outside of SimPy.
-
-```python
-class DataExecutor:
-    def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
-        self.store = initial_store  # Takes the Phase 1 MemoryStore snapshot as input
-
-    def run(self):
-        for t, ops in groupby(op_log, key=lambda o: o.t_start):
-            batch = list(ops)
-            independent, sequential = self._classify(batch)
-            self._execute_parallel(independent)
-            self._execute_sequential(sequential)
-```
-
-**Parallel execution determination**:
-
-Ops with the same `t_start` are considered **parallel candidates**.
-The executor determines actual parallel execution based on the following criteria:
- Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks)
- Whether predecessor ops specified in `dependency_ids` have completed
-
-Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel.
-
-**Batch optimization**: Only independent ops with the same op_name **and identical
-shape, dtype, layout, and transpose flags** are eligible for batching.
-Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call.
-Improves BLAS efficiency on CPU, reduces launch overhead on GPU.
-
-**Phase 2 execution order guarantee**:
-
-Phase 2 does not consider data arrival timing,
-and guarantees execution order solely through
-dependencies (address-based inference + explicit dependency_ids).
-
-### D7. Memory Store
-
-`MemoryStore` logically follows byte-addressable semantics,
-and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping).
-
-```python
-class MemoryStore:
-    def write(self, space: str, addr: int, data: np.ndarray) -> None: ...
-    def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
-```
-
-**Internal storage format: numpy ndarray**
-
-MemoryStore stores tensors as **numpy ndarrays**.
-
-| Candidate | store/load speed | Phase 2 compute | Verdict |
-|-----------|-----------------|-----------------|---------|
-| **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** |
-| bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected |
-| torch tensor | Immediate | torch operations available | Use only for GPU optimization |
-
- write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup
- read: **returns numpy array by reference** (no copy)
- Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported)
- dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.)
- For byte-level access, convert via `.view(np.uint8)`
- For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility
-
-**read/write contract**:
-
- read/write operates on a **contiguous tensor** basis.
-  If non-contiguous stride views are needed, express them as separate copy ops.
- In the normal benchmark path, producer/consumer dtype match is expected.
-  Reinterpret cast is a permissive behavior for low-level memory validation
-  or special test cases.
- addr is byte-aligned, with minimum alignment = dtype size.
- dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast.
-  Shape mismatch is verified based on nbytes, and raises an error on mismatch.
- Correctness criteria follow address-range-based read/write semantics.
- A tensor object cache may be used as an implementation optimization,
-  but the canonical state is byte-addressable storage.
- At deploy time, the host injects initial tensor data.
-
-### D8. Benchmark Kernel Code
-
-The benchmark's **user code API is not changed**.
-The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained.
-
-However, internal command/message schemas may be extended to include metadata
-required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose).
-
-### D9. No Component Changes
-
-Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified.
-Op log recording is the responsibility of the ComponentBase hook.
-When custom components are replaced, only the timing model changes,
-and Phase 2 data execution is unaffected.
-
-### D10. Phase 2 is Optional
-
-```python
-engine = GraphEngine(graph)
-engine.run(benchmark)                       # Phase 1: timing only
-result = engine.get_timing_result()
-
-if verify_data:
-    executor = DataExecutor(engine.op_log)  # Phase 2: data
-    executor.run()
-    executor.verify(expected_output)
-```
-
-If only timing analysis is needed, Phase 2 is skipped.
-If the op_logger is deactivated, Phase 1 performance is identical to the original.
-
-### D11. Verification Contract
-
-Basic verification **compares the final output tensor** against a reference backend (numpy).
-
-Per-dtype tolerance policy:
-
-| dtype | Comparison method | Tolerance |
-|-------|----------|-----------|
-| f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
-| f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
-| bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
-| int types | `np.array_equal` | exact |
-
- Default mode: compare final output only (end-to-end correctness)
- Debug mode: can compare intermediate tensors on a per-op basis
-  (MemoryStore snapshot at each op boundary)
-
---
-
-## Non-goals
-
- **Compute-result-based control flow**: not supported.
-  All compute handles are in pending state during Phase 1,
-  `wait()` expresses timing synchronization only and does not imply data readiness.
-  Accessing `handle.data`, element access, or truth-value evaluation in Phase 1
-  is **treated as an error**.
-  Memory-data-based branching (results of `tl.load()`) is supported via greenlet.
-  Phase 1 materialization is a future extension (see D3).
- **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce
-  the execution time overlap from Phase 1. Phase 2 only verifies data correctness.
- **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls
-  and do not reproduce the actual hardware PE microarchitecture.
-
-## Open Questions
-
- **Aliasing / slice view**: How to represent slice/views referencing the same
-  backing storage in MemoryStore (stride-based view vs copy semantics)
- **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE
-  communication as memory ops or introduce a separate op_kind
- **Op log streaming**: Managing op_log memory usage in large-scale simulations
-  (in-memory list vs disk-backed streaming)
- **Fused operation**: Whether to record tl.composite's tiled pipeline
-  (READ→COMPUTE→WRITE) as a single fused op record or separate individual ops
- **Math op schema generalization**: The current math params have a simple structure,
-  but generalization may be needed for broadcasting rules, per-input dtype, keepdims,
-  scalar/immediate operands, where/mask expressions, etc.
- **Op record identifier**: Currently dependency_ids are based on in-memory list indices;
-  replacement with stable op_id is needed when introducing streaming/disk-backed mode
- **Phase 1 materialization policy**: See Future Extension in D3.
-  If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops
-  needs to be defined
-
---
-
-## Consequences
-
-### Positive
-
- Minimal impact on SimPy simulation performance (only op_log append added)
- Free to use multi-threading/GPU in Phase 2
- Component replaceability preserved (ADR-0015 design philosophy maintained)
- No changes needed to benchmark user code API
- When adding new message types, only set the data_op flag
- Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported
- `tl.load()` returns actual data, making kernel debugging easier
-
-### Negative
-
- op_log memory usage (for large-scale simulations)
- Phase 2 execution time is proportional to tensor size (large GEMM)
- Dynamic branching based on pending handles (incomplete computations) not possible
-  (computations execute in Phase 2, result values are undetermined in Phase 1).
-  Memory-data-based branching is supported via greenlet.
- greenlet C extension dependency added (pip install greenlet)
@@ -1,4 +1,4 @@
-# ADR-0020: 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)
+# ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation)

 ## Status

@@ -6,65 +6,65 @@ Accepted

 ## Context

-현재 시뮬레이션은 **타이밍만** 모델링한다.
-`tl.load()`, `tl.composite(op="gemm")` 등은 SimPy latency를 생성하지만,
-실제 텐서 데이터를 읽거나 연산하지 않는다.
+The current simulation models **timing only**.
+`tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies,
+but do not actually read tensor data or perform computations.

-### 필요한 기능
+### Required Capabilities

-1. HBM/TCM/SRAM에 실제 데이터를 저장하고 읽을 수 있어야 한다
-2. PE_GEMM, PE_MATH가 실제 행렬 연산을 수행하고 결과를 검증할 수 있어야 한다
-3. 시뮬레이션 성능 저하를 최소화해야 한다
+1. Must be able to store and read actual data in HBM/TCM/SRAM
+2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
+3. Must minimize simulation performance degradation

-### 제약 조건
+### Constraints

- SimPy는 single-thread 이벤트 루프 — numpy matmul을 안에서 하면 전체가 block
- 컴포넌트는 교체 가능해야 한다 (ADR-0015) — 프레임워크 요구사항이 구현에 침투하면 안 됨
- 벤치마크 커널은 명령형 코드(tl.load → tl.composite → tl.wait) — 같은 코드를 재사용해야 함
- 커널 함수는 plain Python function으로 유지해야 한다 (generator/async 변환 불가)
+- SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
+- Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations
+- Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused
+- Kernel functions must remain plain Python functions (no generator/async transformation)

-### 설계 탐색 결과
+### Design Exploration Results

-| Option | 방식 | 판정 |
-|--------|------|------|
-| SimPy 내 직접 실행 | GEMM을 SimPy 안에서 numpy 호출 | 탈락: single-thread block |
-| SimPy + ThreadPool | future.submit → timeout → result() | 탈락: back-to-back 요청 시 result()에서 block |
-| Symbolic + lazy | 메타데이터만 추적, 나중에 실행 | 탈락: control-flow dependent 읽기 처리 곤란 |
-| **2-pass (채택)** | Phase 1: 타이밍, Phase 2: 데이터 | 완전 분리, 성능 영향 없음 |
+| Option | Approach | Verdict |
+|--------|----------|---------|
+| Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block |
+| SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests |
+| Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads |
+| **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact |

 ---

 ## Decision

-### D1. 2-Pass 실행 모델 — Phase 0 제거
+### D1. 2-Pass Execution Model — Phase 0 Elimination

-기존의 3단계(Phase 0 → Phase 1 → Phase 2)를 **2단계로 통합**한다.
+The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**.

-기존:
+Before:
 ```
-Phase 0: 커널 → PeCommand 리스트 (데이터 없음, 분기 불가)
-Phase 1: PeCommand 리스트를 SimPy replay (타이밍만)
+Phase 0: Kernel → PeCommand list (no data, no branching)
+Phase 1: Replay PeCommand list via SimPy (timing only)
 ```

-변경:
+After:
 ```
-Phase 1 (타이밍): 커널 + SimPy 통합 실행 — greenlet 기반
-  - 메모리 읽기/쓰기: SimPy 타이밍 + MemoryStore 실제 데이터
-  - 연산 (GEMM/Math): SimPy 타이밍 + op_log 기록 (실제 연산은 Phase 2)
-  - dynamic control flow 가능 (tl.load가 실제 데이터 반환)
+Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based
+  - Memory read/write: SimPy timing + MemoryStore actual data
+  - Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2)
+  - Dynamic control flow possible (tl.load returns actual data)

-Phase 2 (데이터): op_log 기반 실제 연산 실행 — SimPy 외부, 병렬 가능
+Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable
 ```

-본 ADR은 **메모리 연산에 한해 Phase 1을 data-aware로 확장**한다.
-Phase 1은 latency/BW 병목 분석 + 메모리 데이터 추적,
-Phase 2는 GEMM/Math 연산 정합성 검증.
-Phase 2는 optional — 타이밍만 필요하면 Phase 1만 실행.
+This ADR **extends Phase 1 to be data-aware for memory operations only**.
+Phase 1 handles latency/BW bottleneck analysis + memory data tracking,
+Phase 2 handles GEMM/Math computation correctness verification.
+Phase 2 is optional — if only timing is needed, run Phase 1 alone.

-### D2. Op Log 기록 — ComponentBase hook
+### D2. Op Log Recording — ComponentBase Hook

-op_log 기록은 **컴포넌트 베이스 클래스의 hook**으로 수행한다.
-개별 컴포넌트 구현을 수정하지 않는다.
+Op log recording is performed as a **hook in the component base class**.
+Individual component implementations are not modified.

 ```python
 class ComponentBase:
@@ -77,56 +77,56 @@ class ComponentBase:
            self._op_logger.record_end(env.now, self.node.id, msg)
 ```

-`_forward_txn()` 에서 `run()` 전후로 hook을 호출한다.
-`_op_logger`는 optional — 없으면 오버헤드 제로.
+Hooks are called before and after `run()` within `_forward_txn()`.
+`_op_logger` is optional — zero overhead when absent.

-**hook 시점 정의**:
+**Hook timing definitions**:

-| 시점 | 의미 |
-|------|------|
-| `t_start` | 컴포넌트가 해당 msg의 **service를 시작**한 시점 (`run()` 진입 직전) |
-| `t_end` | 컴포넌트의 **내부 service가 완료**된 시점 (`run()` 반환 직후) |
+| Timing | Meaning |
+|--------|---------|
+| `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) |
+| `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) |

-link traversal latency는 t_start/t_end에 포함되지 않는다.
-link latency는 발신 컴포넌트의 t_end와 수신 컴포넌트의 t_start 차이로 관측된다.
+Link traversal latency is not included in t_start/t_end.
+Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start.

-### D3. Greenlet 기반 커널 실행 — Phase 0 제거
+### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination

-기존 Phase 0 (커널 → PeCommand 리스트)를 제거하고,
-**greenlet**을 사용하여 커널과 SimPy를 협력적으로 interleave 실행한다.
+The existing Phase 0 (kernel → PeCommand list) is eliminated,
+and **greenlet** is used to cooperatively interleave kernel and SimPy execution.

-#### 동작 원리
+#### Operating Principle

-greenlet은 협력적 context switch를 제공하는 C 확장이다.
-커널(child greenlet)이 `tl.load()` 등을 호출하면 SimPy 루프(parent greenlet)로
-switch하여 타이밍 시뮬레이션을 수행하고, 완료 후 실제 데이터와 함께 커널로 돌아온다.
+greenlet is a C extension that provides cooperative context switching.
+When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet)
+to perform timing simulation, and after completion, returns to the kernel with actual data.

 ```
-SimPy 루프 (parent greenlet)          커널 (child greenlet)
+SimPy loop (parent greenlet)           Kernel (child greenlet)
 ─────────────────────────              ──────────────────────
-g.switch() ─────────────────────────→ 커널 시작
+g.switch() ─────────────────────────→ Kernel starts
                                       a = tl.load(ptr, ...)
-                                         내부: parent.switch(DmaReadCmd)
-cmd = DmaReadCmd ←──────────────────  (커널 일시정지)
+                                         internal: parent.switch(DmaReadCmd)
+cmd = DmaReadCmd ←──────────────────  (kernel paused)
  yield DmaReadMsg(...)
  yield env.timeout(dma_latency)
  data = memory_store.read(...)
-g.switch(data) ─────────────────────→ (커널 재개)
-                                       a = data  ← 실제 numpy array
-                                       if a[0][0] > 0.5:  ← 분기 가능
+g.switch(data) ─────────────────────→ (kernel resumed)
+                                       a = data  ← actual numpy array
+                                       if a[0][0] > 0.5:  ← branching possible
                                         ...
 ```

-커널은 **plain Python function**으로 유지된다.
-greenlet switch는 `tl.load()`, `tl.store()` 등의 **내부 구현에만** 존재한다.
+The kernel is maintained as a **plain Python function**.
+greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc.

-#### KernelRunner — 프레임워크 레이어
+#### KernelRunner — Framework Layer

-greenlet 루프는 PE_CPU 컴포넌트가 아니라 프레임워크 레이어인
-**KernelRunner**에 위치한다.
+The greenlet loop resides not in the PE_CPU component but in the framework layer,
+**KernelRunner**.

 ```python
-# KernelRunner (프레임워크 — greenlet ↔ SimPy 연결)
+# KernelRunner (framework — greenlet ↔ SimPy bridge)
 class KernelRunner:
    def run(self, env, kernel_fn, args, store):
        g = greenlet(self._run_kernel)
@@ -136,160 +136,162 @@ class KernelRunner:
            if isinstance(cmd, DmaReadCmd):
                yield from self._dispatch_dma(env, cmd)
                data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
-                cmd = g.switch(data)            # 실제 데이터와 함께 재개
+                cmd = g.switch(data)            # resume with actual data
            elif isinstance(cmd, GemmCmd):
                yield from self._dispatch_gemm(env, cmd)
-                cmd = g.switch()                # 재개 (데이터 없음)
+                cmd = g.switch()                # resume (no data)
            elif isinstance(cmd, DmaWriteCmd):
-                store.write(cmd.dst_addr, cmd.data)  # visibility = issue 시점
-                yield from self._dispatch_dma(env, cmd)  # timing만 반영
+                store.write(cmd.dst_addr, cmd.data)  # visibility = issue time
+                yield from self._dispatch_dma(env, cmd)  # timing only
                cmd = g.switch()

-# PE_CPU (컴포넌트 — 간단하게 유지, greenlet을 모름)
+# PE_CPU (component — kept simple, unaware of greenlet)
 def _execute_kernel(self, env):
    runner = KernelRunner(self.ctx)
    yield from runner.run(env, kernel_fn, args, store)
 ```

-**Op logging single source of truth**: KernelRunner는 op_log에 직접 기록하지 않는다.
-모든 op logging은 **ComponentBase hook (_on_process_start/end)만** 담당한다.
-KernelRunner가 `_dispatch_gemm()` 등으로 컴포넌트에 메시지를 전달하면,
-컴포넌트 베이스 클래스의 hook이 자동으로 기록한다.
+**Op logging single source of truth**: KernelRunner does not record directly to op_log.
+All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**.
+When KernelRunner delivers messages to components via `_dispatch_gemm()` etc.,
+the component base class hooks automatically record them.

-**레이어 분리**:
- **커널 코드**: plain function, greenlet 존재를 모름
- **TLContext**: `tl.load()` 내부에서 `parent.switch(cmd)` 호출
- **KernelRunner**: greenlet ↔ SimPy 연결, MemoryStore 읽기/쓰기 처리. **logging 안 함**.
- **ComponentBase hook**: op_log 기록의 유일한 경로
- **PE_CPU**: KernelRunner를 호출만 함, 컴포넌트로서 교체 가능
+**Layer separation**:
+- **Kernel code**: plain function, unaware of greenlet
+- **TLContext**: calls `parent.switch(cmd)` inside `tl.load()`
+- **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**.
+- **ComponentBase hook**: the sole path for op_log recording
+- **PE_CPU**: only calls KernelRunner, replaceable as a component

-#### 메모리 읽기/쓰기 vs 연산의 처리 차이
+#### Handling Differences Between Memory Read/Write and Compute

-| 연산 | Phase 1에서 | Phase 2에서 |
-|------|------------|------------|
-| `tl.load()` | SimPy 타이밍 + MemoryStore read → **실제 데이터 반환** | — |
-| `tl.store()` | SimPy 타이밍 + MemoryStore write → **실제 기록** | — |
-| `tl.composite(gemm)` | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
-| `tl.dot()` / math ops | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
+| Operation | In Phase 1 | In Phase 2 |
+|-----------|-----------|-----------|
+| `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — |
+| `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — |
+| `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation |
+| `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation |

-메모리 읽기/쓰기는 Phase 1에서 즉시 처리 (numpy slice, 빠름).
-GEMM/Math 연산은 Phase 2에서 batch 실행 (성능 분리).
+Memory read/write is processed immediately in Phase 1 (numpy slice, fast).
+GEMM/Math operations are batch-executed in Phase 2 (performance separation).

 #### Store Visibility Rule

-`tl.store()`는 **issue 시점에 MemoryStore에 즉시 반영**된다 (visibility = issue).
-SimPy DMA 타이밍은 이후 별도로 시뮬레이션된다.
+`tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue).
+SimPy DMA timing is simulated separately afterward.

-이는 timing과 visibility를 의도적으로 분리한 것이다:
- **visibility**: MemoryStore에 반영되는 시점 = `store.write()` 호출 시
- **timing**: SimPy에서 DMA latency가 완료되는 시점
+This is an intentional separation of timing and visibility:
+- **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called
+- **timing**: the point at which DMA latency completes in SimPy

-이 분리로 dynamic control flow에서 store 직후 load가 최신 데이터를 볼 수 있다.
+This separation allows a load immediately after a store to see the latest data in dynamic control flow.

 #### Result Handle Semantics

-`tl.composite()`(sync/async)는 결과 tensor를 참조하는 **handle**을 반환한다.
+`tl.composite()` (sync/async) returns a **handle** referencing the result tensor.

-Phase 1에서의 핵심 계약:
+The key contract in Phase 1:

-1. **모든 compute handle은 Phase 1에서 항상 pending 상태로 간주한다.**
-2. `tl.wait(handle)`은 **timing synchronization만 표현**하며,
-   handle을 ready로 만들지 않는다.
-3. handle의 실제 결과 데이터 접근(`handle.data`, element access,
-   numpy conversion 등)은 **Phase 2에서만 가능**하다.
-4. 따라서 Phase 1에서 **compute-result 기반 control flow는 지원하지 않는다.**
-5. 반면 `tl.load()`는 Phase 1에서 실제 데이터를 반환하므로,
-   **memory-read 기반 control flow는 지원 가능**하다.
+1. **All compute handles are always considered pending in Phase 1.**
+2. `tl.wait(handle)` **expresses timing synchronization only**
+   and does not make the handle ready.
+3. Accessing the handle's actual result data (`handle.data`, element access,
+   numpy conversion, etc.) is **only possible in Phase 2**.
+4. Therefore, **compute-result-based control flow is not supported in Phase 1.**
+5. In contrast, `tl.load()` returns actual data in Phase 1, so
+   **memory-read-based control flow is supported**.

-| handle 상태 | Phase | 허용 동작 |
+| Handle state | Phase | Allowed operations |
 |------------|-------|----------|
-| pending | Phase 1 | `tl.wait(handle)` — timing 동기화만 |
-| pending | Phase 1 | handle을 `tl.store()`의 대상으로 전달 (logical destination 연결만, payload는 Phase 2) |
-| pending | Phase 1 | **데이터 접근 불가** — 값 기반 분기 불가 |
-| ready | Phase 2 | 실제 numpy 데이터 접근, 검증 |
+| pending | Phase 1 | `tl.wait(handle)` — timing synchronization only |
+| pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) |
+| pending | Phase 1 | **Data access not allowed** — value-based branching not possible |
+| ready | Phase 2 | Actual numpy data access, verification |

-이 제약은 의도적이다. Phase 1에서 연산을 실행하면 SimPy single-thread가
-block되어 2-pass 분리의 존재 이유가 사라진다.
+This restriction is intentional. If computations were executed in Phase 1,
+the SimPy single-thread would block, defeating the purpose of 2-pass separation.

 #### Phase 1 Materialization — Future Extension

-향후 소형 연산(scalar, 작은 reduction)에 대해 Phase 1 eager execution이
-필요한 경우, `materialized_in_phase1: bool` 플래그를 op record에 추가하여
-선택적 materialization을 지원할 수 있다. 현재 범위에서는 구현하지 않는다.
+If Phase 1 eager execution becomes necessary for small operations
+(scalar, small reduction) in the future, selective materialization can be supported
+by adding a `materialized_in_phase1: bool` flag to the op record.
+This is not implemented in the current scope.

-### D4. data_op 플래그 — 메시지 자기 선언
+### D4. data_op Flag — Message Self-Declaration

-로깅 대상은 메시지 타입이 아니라 메시지 인스턴스의 `data_op` 속성으로 결정한다.
-프레임워크가 메시지 타입을 하드코딩하지 않는다.
+The logging target is determined by the `data_op` attribute on the message instance,
+not by message type. The framework does not hardcode message types.

 ```python
 class MsgBase:
-    data_op: bool = False       # 기본: 로깅 안 함
+    data_op: bool = False       # default: no logging

 class DmaReadCmd(MsgBase):
-    data_op = True              # 메모리 이동 → 로깅
+    data_op = True              # memory transfer → logging

 class GemmCmd(MsgBase):
-    data_op = True              # 연산 → 로깅
+    data_op = True              # compute → logging

 class MathCmd(MsgBase):
-    data_op = True              # 연산 → 로깅
+    data_op = True              # compute → logging
 ```

-새 메시지 타입(예: IpcqMsg) 추가 시 `data_op = True`만 설정하면
-프레임워크 코드 수정 없이 자동 로깅된다.
+When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True`
+enables automatic logging without modifying framework code.

-### D5. Op Log 구조
+### D5. Op Log Structure

-#### op 분류 체계
+#### Op Classification Scheme

-2단계로 분류한다:
+A two-level classification is used:

-| 레벨 | 필드 | 역할 |
-|------|------|------|
-| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch 기준 |
-| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` 등 | 구체 연산 식별 |
+| Level | Field | Role |
+|-------|-------|------|
+| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion |
+| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification |

-#### OpRecord 정의
+#### OpRecord Definition

 ```python
@dataclass
 class OpRecord:
-    t_start: float              # SimPy 시각 (ns) — service 시작
-    t_end: float                # SimPy 시각 (ns) — service 완료
+    t_start: float              # SimPy time (ns) — service start
+    t_end: float                # SimPy time (ns) — service completion
    component_id: str           # e.g. "sip0.cube0.pe0.pe_gemm"
    op_kind: str                # "memory" | "gemm" | "math"
-    op_name: str                # 구체 연산명
-    params: dict                # 연산별 파라미터 (아래 참조)
-    dependency_ids: list[int]   # 현재는 in-memory record index 기반, 향후 stable op_id로 대체 가능
+    op_name: str                # specific operation name
+    params: dict                # per-operation parameters (see below)
+    dependency_ids: list[int]   # currently based on in-memory record index, may be replaced with stable op_id in the future
 ```

-#### dependency_ids 생성 규칙
+#### dependency_ids Generation Rules

-`dependency_ids`는 **optional**이며, 기본적으로 executor는
-주소 기반 dependency 추론을 수행한다 (D6 참조).
+`dependency_ids` is **optional**, and by default the executor performs
+address-based dependency inference (see D6).

-정확한 실행 순서가 필요한 경우에만 명시적으로 설정한다:
- **기본 (address-based inference)**: executor가 read/write set을 분석하여
-  RAW/WAW/WAR 의존성을 자동 추론. 대부분의 경우 이것으로 충분.
- **명시적 설정**: TLContext 또는 command 생성 단계에서 logical dependency가
-  주소로 표현되지 않는 경우에 설정.
-  예: completion handle 기반 동기화 — handle dependency는 메모리 주소가 아니라
-  논리적 완료 순서에 의존하므로 address inference로 잡히지 않는다.
+Explicit setting is only needed when precise execution ordering is required:
+- **Default (address-based inference)**: the executor analyzes read/write sets to
+  automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases.
+- **Explicit setting**: set when logical dependencies cannot be expressed via addresses
+  at the TLContext or command generation stage.
+  Example: completion handle-based synchronization — handle dependencies depend on
+  logical completion order rather than memory addresses, so they cannot be captured
+  by address inference.

-#### op_log ordering
+#### op_log Ordering

-op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
-동일 `t_start`의 record들은 insertion order를 보존한다.
+The op_log maintains **stable ordering** based on `t_start`.
+Records with the same `t_start` preserve insertion order.

-#### params 상세
+#### params Details

 **memory (dma_read / dma_write)**:
 ```python
 {
-    "src_addr": int,            # source 주소 (byte)
-    "dst_addr": int,            # destination 주소 (byte)
-    "nbytes": int,              # 전송 크기
+    "src_addr": int,            # source address (byte)
+    "dst_addr": int,            # destination address (byte)
+    "nbytes": int,              # transfer size
    "src_space": str,           # "hbm" | "tcm" | "sram"
    "dst_space": str,           # "hbm" | "tcm" | "sram"
 }
@@ -298,9 +300,9 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
 **gemm**:
 ```python
 {
-    "src_a_addr": int,          # operand A 주소
-    "src_b_addr": int,          # operand B 주소
-    "dst_addr": int,            # output 주소
+    "src_a_addr": int,          # operand A address
+    "src_b_addr": int,          # operand B address
+    "dst_addr": int,            # output address
    "shape_a": tuple,           # e.g. (128, 256)
    "shape_b": tuple,           # e.g. (256, 128)
    "shape_out": tuple,         # e.g. (128, 128)
@@ -312,7 +314,7 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
    "layout_a": str,            # "row_major" | "col_major"
    "layout_b": str,
    "layout_out": str,
-    "addr_space": str,          # "tcm" (GEMM operand는 항상 TCM)
+    "addr_space": str,          # "tcm" (GEMM operands are always in TCM)
 }
 ```

@@ -320,7 +322,7 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
 ```python
 {
    "op": str,                  # "exp" | "add" | "sum" | "where" | ...
-    "input_addrs": list[int],   # operand 주소 목록
+    "input_addrs": list[int],   # list of operand addresses
    "input_shapes": list[tuple],
    "dst_addr": int,
    "shape_out": tuple,
@@ -332,12 +334,12 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.

 ### D6. Phase 2 Executor

-Phase 2는 SimPy 밖에서 op_log를 실행한다.
+Phase 2 executes the op_log outside of SimPy.

 ```python
 class DataExecutor:
    def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
-        self.store = initial_store  # Phase 1의 MemoryStore snapshot을 입력으로 받는다
+        self.store = initial_store  # Takes the Phase 1 MemoryStore snapshot as input

    def run(self):
        for t, ops in groupby(op_log, key=lambda o: o.t_start):
@@ -347,30 +349,30 @@ class DataExecutor:
            self._execute_sequential(sequential)
 ```

-**병렬 실행 판정**:
+**Parallel execution determination**:

-같은 `t_start`의 op들은 **병렬 후보**로 간주한다.
-실제 병렬 실행 여부는 executor가 다음 기준으로 판정한다:
- read/write 주소 범위 겹침 여부 (WAW, RAW, WAR 충돌 검사)
- `dependency_ids`에 명시된 선행 op 완료 여부
+Ops with the same `t_start` are considered **parallel candidates**.
+The executor determines actual parallel execution based on the following criteria:
+- Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks)
+- Whether predecessor ops specified in `dependency_ids` have completed

-주소 범위가 겹치지 않고 명시적 의존성이 없는 op들만 병렬 실행한다.
+Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel.

-**배치 최적화**: 동일 op_name이며 **shape, dtype, layout, transpose flag가
-모두 동일한** 독립 op들만 batching 대상이 된다.
-예: 여러 PE의 동일 shape GEMM → `np.matmul(a_batch, b_batch)` 한 번으로 묶음.
-CPU에서도 BLAS 효율 향상, GPU에서는 launch overhead 절감.
+**Batch optimization**: Only independent ops with the same op_name **and identical
+shape, dtype, layout, and transpose flags** are eligible for batching.
+Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call.
+Improves BLAS efficiency on CPU, reduces launch overhead on GPU.

-**Phase 2 실행 순서 보장**:
+**Phase 2 execution order guarantee**:

-Phase 2는 데이터 도착 시점을 고려하지 않으며,
-dependency (주소 기반 추론 + 명시적 dependency_ids)를 통해서만
-실행 순서를 보장한다.
+Phase 2 does not consider data arrival timing,
+and guarantees execution order solely through
+dependencies (address-based inference + explicit dependency_ids).

 ### D7. Memory Store

-`MemoryStore`는 논리적으로 byte-addressable semantics를 따르며,
-현재 구현은 **tensor-granular storage** (addr → numpy ndarray 매핑)를 사용한다.
+`MemoryStore` logically follows byte-addressable semantics,
+and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping).

 ```python
 class MemoryStore:
@@ -378,139 +380,140 @@ class MemoryStore:
    def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
 ```

-**내부 저장 포맷: numpy ndarray**
+**Internal storage format: numpy ndarray**

-MemoryStore는 텐서를 **numpy ndarray**로 저장한다.
+MemoryStore stores tensors as **numpy ndarrays**.

-| 후보 | store/load 속도 | Phase 2 연산 | 판정 |
-|------|----------------|-------------|------|
-| **numpy ndarray** | 즉시 (참조 전달, 복사 없음) | `np.matmul` 바로 사용 | **채택** |
-| bytearray | memcpy 필요 | `np.frombuffer` 변환 필요 | 탈락 |
-| torch tensor | 즉시 | torch 연산 가능 | GPU 최적화 시만 사용 |
+| Candidate | store/load speed | Phase 2 compute | Verdict |
+|-----------|-----------------|-----------------|---------|
+| **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** |
+| bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected |
+| torch tensor | Immediate | torch operations available | Use only for GPU optimization |

- write: numpy array를 **참조 저장** (복사 없음) → Phase 1 오버헤드 = dict lookup 1회
- read: numpy array를 **참조 반환** (복사 없음)
- 동일 addr에 재 write 시 기존 array를 **tensor 단위로 덮어쓴다** (partial overwrite 미지원)
- dtype은 numpy native 사용 (`np.float16`, `np.float32`, `np.bfloat16` 등)
- byte-level access가 필요한 경우 `.view(np.uint8)` 로 변환
- Phase 2에서 GPU batch 최적화 시 numpy → torch tensor 변환은 executor가 담당
+- write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup
+- read: **returns numpy array by reference** (no copy)
+- Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported)
+- dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.)
+- For byte-level access, convert via `.view(np.uint8)`
+- For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility

 **read/write contract**:

- read/write는 **contiguous tensor** 기준이다.
-  non-contiguous stride view가 필요한 경우 별도 copy op으로 표현한다.
- 일반 benchmark path에서는 producer/consumer dtype 일치를 기대한다.
-  reinterpret cast는 low-level memory validation 또는 특수 테스트 케이스를 위한
-  permissive behavior이다.
- addr은 byte-aligned이며, 최소 alignment = dtype 크기.
- dtype mismatch (write와 다른 dtype으로 read)는 reinterpret cast로 처리한다.
-  shape 불일치 시 nbytes 기준으로 검증하고, 불일치하면 error.
- 정합성 기준은 주소 범위 기반 read/write semantics를 따른다.
- 구현 최적화로 tensor object cache를 둘 수 있지만,
-  canonical state는 byte-addressable storage이다.
- deploy 시점에 호스트가 초기 텐서 데이터를 주입한다.
+- read/write operates on a **contiguous tensor** basis.
+  If non-contiguous stride views are needed, express them as separate copy ops.
+- In the normal benchmark path, producer/consumer dtype match is expected.
+  Reinterpret cast is a permissive behavior for low-level memory validation
+  or special test cases.
+- addr is byte-aligned, with minimum alignment = dtype size.
+- dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast.
+  Shape mismatch is verified based on nbytes, and raises an error on mismatch.
+- Correctness criteria follow address-range-based read/write semantics.
+- A tensor object cache may be used as an implementation optimization,
+  but the canonical state is byte-addressable storage.
+- At deploy time, the host injects initial tensor data.

-### D8. 벤치마크 커널 코드
+### D8. Benchmark Kernel Code

-벤치마크의 **사용자 코드 API는 변경하지 않는다**.
-`tl.load()`, `tl.composite()`, `tl.store()` 등의 호출 인터페이스는 유지.
+The benchmark's **user code API is not changed**.
+The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained.

-단, 내부 command/message schema는 Phase 2 실행에 필요한 metadata를
-포함하도록 확장될 수 있다 (예: dtype_acc, transpose 등 추가 필드).
+However, internal command/message schemas may be extended to include metadata
+required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose).

-### D9. 컴포넌트 변경 없음
+### D9. No Component Changes

-개별 컴포넌트 구현(PE_GEMM, PE_DMA, HBM_CTRL 등)은 수정하지 않는다.
-op_log 기록은 ComponentBase hook의 책임이다.
-커스텀 컴포넌트 교체 시 타이밍 모델만 교체되며,
-Phase 2 데이터 실행은 영향받지 않는다.
+Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified.
+Op log recording is the responsibility of the ComponentBase hook.
+When custom components are replaced, only the timing model changes,
+and Phase 2 data execution is unaffected.

-### D10. Phase 2는 Optional
+### D10. Phase 2 is Optional

 ```python
 engine = GraphEngine(graph)
-engine.run(benchmark)                       # Phase 1: 타이밍만
+engine.run(benchmark)                       # Phase 1: timing only
 result = engine.get_timing_result()

 if verify_data:
-    executor = DataExecutor(engine.op_log)  # Phase 2: 데이터
+    executor = DataExecutor(engine.op_log)  # Phase 2: data
    executor.run()
    executor.verify(expected_output)
 ```

-타이밍 분석만 필요하면 Phase 2를 건너뛴다.
-op_logger를 비활성화하면 Phase 1 성능도 기존과 동일.
+If only timing analysis is needed, Phase 2 is skipped.
+If the op_logger is deactivated, Phase 1 performance is identical to the original.

 ### D11. Verification Contract

-기본 검증은 **최종 output tensor**를 reference backend(numpy)와 비교한다.
+Basic verification **compares the final output tensor** against a reference backend (numpy).

-dtype별 tolerance 정책:
+Per-dtype tolerance policy:

-| dtype | 비교 방식 | tolerance |
+| dtype | Comparison method | Tolerance |
 |-------|----------|-----------|
 | f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
 | f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
 | bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
-| int 계열 | `np.array_equal` | exact |
+| int types | `np.array_equal` | exact |

- 기본 모드: 최종 output만 비교 (end-to-end correctness)
- 디버그 모드: intermediate tensor도 op 단위로 비교 가능
+- Default mode: compare final output only (end-to-end correctness)
+- Debug mode: can compare intermediate tensors on a per-op basis
  (MemoryStore snapshot at each op boundary)

 ---

 ## Non-goals

- **Compute-result-based control flow**: 지원하지 않는다.
-  모든 compute handle은 Phase 1에서 pending 상태이며,
-  `wait()`는 timing synchronization만 표현하고 data readiness를 의미하지 않는다.
-  Phase 1에서 `handle.data` 접근, element access, truth-value evaluation은
-  **error로 처리**한다.
-  메모리 데이터 기반 분기(`tl.load()` 결과)는 greenlet으로 지원된다.
-  Phase 1 materialization은 future extension (D3 참조).
- **Cycle-accurate overlap reconstruction**: Phase 2에서 Phase 1의 실행 시간
-  overlap을 정확히 재현하지 않는다. Phase 2는 데이터 정합성만 검증한다.
- **GPU kernel compilation**: Phase 2의 GEMM/Math는 numpy/torch 호출이며,
-  실제 하드웨어 PE의 마이크로아키텍처를 재현하지 않는다.
+- **Compute-result-based control flow**: not supported.
+  All compute handles are in pending state during Phase 1,
+  `wait()` expresses timing synchronization only and does not imply data readiness.
+  Accessing `handle.data`, element access, or truth-value evaluation in Phase 1
+  is **treated as an error**.
+  Memory-data-based branching (results of `tl.load()`) is supported via greenlet.
+  Phase 1 materialization is a future extension (see D3).
+- **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce
+  the execution time overlap from Phase 1. Phase 2 only verifies data correctness.
+- **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls
+  and do not reproduce the actual hardware PE microarchitecture.

 ## Open Questions

- **Aliasing / slice view**: 동일 backing storage를 참조하는 slice/view를
-  MemoryStore에서 어떻게 표현할지 (stride-based view vs copy semantics)
- **IPCQ/descriptor read 일반화**: PE-to-PE 통신을 memory op으로 완전히
-  일반화할지, 별도 op_kind를 둘지
- **Op log streaming**: 대규모 시뮬레이션에서 op_log 메모리 사용량 관리
+- **Aliasing / slice view**: How to represent slice/views referencing the same
+  backing storage in MemoryStore (stride-based view vs copy semantics)
+- **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE
+  communication as memory ops or introduce a separate op_kind
+- **Op log streaming**: Managing op_log memory usage in large-scale simulations
  (in-memory list vs disk-backed streaming)
- **Fused operation**: tl.composite의 tiled pipeline (READ→COMPUTE→WRITE)을
-  하나의 fused op record로 기록할지, 개별 op으로 분리할지
- **Math op schema 일반화**: 현재 math params는 단순 구조이나,
-  broadcasting rule, input별 dtype, keepdims, scalar/immediate operand,
-  where/mask 표현 등 일반화가 필요할 수 있음
- **Op record 식별자**: 현재 dependency_ids는 in-memory list index 기반이며,
-  streaming/disk-backed mode 도입 시 stable op_id로 대체 필요
- **Phase 1 materialization policy**: D3의 Future Extension 참조.
-  허용 시 해당 op의 Phase 2 처리 방식 (skip / verify / recompute) 정의 필요
+- **Fused operation**: Whether to record tl.composite's tiled pipeline
+  (READ→COMPUTE→WRITE) as a single fused op record or separate individual ops
+- **Math op schema generalization**: The current math params have a simple structure,
+  but generalization may be needed for broadcasting rules, per-input dtype, keepdims,
+  scalar/immediate operands, where/mask expressions, etc.
+- **Op record identifier**: Currently dependency_ids are based on in-memory list indices;
+  replacement with stable op_id is needed when introducing streaming/disk-backed mode
+- **Phase 1 materialization policy**: See Future Extension in D3.
+  If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops
+  needs to be defined

 ---

 ## Consequences

-### 긍정적
+### Positive

- SimPy 시뮬레이션 성능 영향 최소 (op_log append만 추가)
- Phase 2에서 멀티스레드/GPU 자유롭게 사용 가능
- 컴포넌트 교체 자유도 유지 (ADR-0015 설계 철학 보존)
- 벤치마크 사용자 코드 API 변경 불필요
- 새 메시지 타입 추가 시 data_op 플래그만 설정
- greenlet으로 Phase 0 제거 — 메모리 데이터 기반 dynamic control flow 지원
- `tl.load()`가 실제 데이터를 반환하므로 커널 디버깅 용이
+- Minimal impact on SimPy simulation performance (only op_log append added)
+- Free to use multi-threading/GPU in Phase 2
+- Component replaceability preserved (ADR-0015 design philosophy maintained)
+- No changes needed to benchmark user code API
+- When adding new message types, only set the data_op flag
+- Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported
+- `tl.load()` returns actual data, making kernel debugging easier

-### 부정적
+### Negative

- op_log 메모리 사용량 (대규모 시뮬레이션 시)
- Phase 2 실행 시간은 텐서 크기에 비례 (대형 GEMM)
- pending handle (연산 미완료) 기반 동적 분기 불가
-  (연산은 Phase 2에서 실행, Phase 1에서 결과 값 미확정).
-  메모리 데이터 기반 분기는 greenlet으로 지원된다.
- greenlet C 확장 의존성 추가 (pip install greenlet)
+- op_log memory usage (for large-scale simulations)
+- Phase 2 execution time is proportional to tensor size (large GEMM)
+- Dynamic branching based on pending handles (incomplete computations) not possible
+  (computations execute in Phase 2, result values are undetermined in Phase 1).
+  Memory-data-based branching is supported via greenlet.
+- greenlet C extension dependency added (pip install greenlet)
@@ -1,882 +0,0 @@
-# ADR-0023: PE-level IPCQ — Inter-PE Collective Communication
-
-## Status
-
-Accepted
-
-## Context
-
-### Goal
-
-Add the infrastructure that lets CCL (Collective Communication Library)
-kernels run **inside** a PE. The host just launches a kernel on each
-SIP; the actual synchronization and data movement happen **inside the
-PE kernel via an IPCQ (Inter-Process Communication Queue)**.
-
-This mirrors how NCCL performs NVLink communication inside a GPU
-kernel, or how Cerebras / Tenstorrent expose core-local communication
-queues. Host-level collectives (`dist.all_reduce`) are deferred to
-**future work**; this ADR focuses solely on the kernel-side collective
-infrastructure.
-
-### Problems to solve
-
-1. PE-to-PE direct data movement (writing into a peer's memory).
-2. Synchronization — the sender must check that the receiver has space
-   in its buffer (backpressure).
-3. Resource contention between compute traffic and communication
-   traffic (Head-of-Line blocking).
-4. The host must be able to construct logical neighbor topologies
-   (ring / mesh / tree) per algorithm.
-
---
-
-## Decision
-
-### D1. Add a new `PE_IPCQ` component
-
-A new component `PE_IPCQ` is added inside each PE. It follows the same
-pattern as PE_GEMM / PE_MATH — modeling a sub-block of the PE as a
-distinct component.
-
-```
-PE
-├── PE_CPU
-├── PE_SCHEDULER
-├── PE_DMA
-├── PE_IPCQ          ← new
-├── PE_FETCH_STORE
-├── PE_GEMM
-├── PE_MATH
-├── PE_TCM
-├── PE_MMU
-```
-
-**Role separation** (control plane vs. data plane):
-
- **PE_IPCQ (control plane)**: ring-buffer address arithmetic, head /
-  tail pointer management, peer pointer caches, backpressure, 4-direction
-  neighbor mapping.
- **PE_DMA (data plane)**: actually moves data through cube_noc / UCIe
-  / PCIE into the peer's memory.
-
-PE_IPCQ does **not** move data itself — it delegates to PE_DMA.
-
-### D2. Ring buffer model
-
-Each PE owns 4 directions (N/S/E/W) × {tx, rx} = 8 ring buffers.
-
-```python
-@dataclass
-class IpcqQueuePair:
-    direction: Direction          # N/S/E/W
-    peer: IpcqEndpoint            # set by host at init time (D2.5)
-    tx_buffer_base: int           # outgoing data base addr (in our memory)
-    rx_buffer_base: int           # incoming data base addr (in our memory)
-    slot_size: int                # 1 tile per slot
-    n_slots: int                  # ring depth
-    my_head: int                  # next slot we will write/send into
-    my_tail: int                  # next slot we will read/recv from
-    peer_head_cache: int          # peer's last-seen head (updated via D9 piggyback)
-    peer_tail_cache: int          # peer's last-seen tail (updated via D9 fast-path credit)
-```
-
-**Canonical field names**: throughout this ADR the four names above
-(`my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`) are used
-consistently. Synonyms (`peer_head_local`, `peer_head`, `peer_tail`,
-etc.) are not used.
-
-| Field | Owner | Updated when |
-|-------|-------|--------------|
-| `my_head` | local PE_IPCQ | immediately after `tl.send` (send tracking) |
-| `my_tail` | local PE_IPCQ | immediately after `tl.recv` (recv tracking) |
-| `peer_head_cache` | local PE_IPCQ | on `IpcqMetaArrival` (D9 piggyback) |
-| `peer_tail_cache` | local PE_IPCQ | on `IpcqCreditMetadata` (D9 fast path) |
-
-**Slot unit**: fixed-size, one slot holds one full tile (no descriptor
-indirection). Full data embedded in the slot. See D5.
-
-### D2.5. `IpcqEndpoint` schema
-
-`IpcqQueuePair.peer` carries everything the sender needs to compute the
-peer's rx slot address:
-
-```python
-@dataclass(frozen=True)
-class IpcqEndpoint:
-    sip: int
-    cube: int
-    pe: int
-    buffer_kind: str             # "tcm" | "hbm" | "sram"
-    rx_base_pa: int              # peer rx_buffer base PA (PhysAddr.encode())
-    rx_base_va: int              # peer rx_buffer base VA (optional, MMU mode)
-    n_slots: int                 # peer ring depth (for wrap-around)
-    slot_size: int               # peer slot size (for offset)
-```
-
-Address computation:
-
-```python
-slot_idx = self.my_head % peer.n_slots
-dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size
-```
-
-PE_IPCQ passes `dst_pa` to PE_DMA inside an `IpcqDmaToken`. PE_DMA
-(vc_comm) routes the data to `dst_pa` through the fabric.
-
-**Endpoint construction order**: at backend init (D10), the IPCQ
-buffers for **every PE** are allocated first (so each rank knows the
-others' PA), then the per-rank neighbor tables are built and pushed to
-PE_IPCQ via `IpcqInitMsg`.
-
-### D3. Four-direction mapping ≡ logical ProcessGroup
-
-The PE views four directions (N/S/E/W) as logical ports. Real peer
-addresses are configured by the host CCL init, per the chosen
-algorithm. The PE kernel never knows the topology, only directions.
-
-```python
-# 1D ring
-for rank in range(world_size):
-    ipcq_set_neighbor(rank, "E", peer=ranks[(rank + 1) % world_size])
-    ipcq_set_neighbor(rank, "W", peer=ranks[(rank - 1) % world_size])
-
-# 2D mesh
-for r in range(R):
-    for c in range(C):
-        ipcq_set_neighbor((r, c), "N", peer=((r - 1) % R, c))
-        ipcq_set_neighbor((r, c), "S", peer=((r + 1) % R, c))
-        ipcq_set_neighbor((r, c), "E", peer=(r, (c + 1) % C))
-        ipcq_set_neighbor((r, c), "W", peer=(r, (c - 1) % C))
-```
-
-The PE code does not need to know where `tl.send(dir="E", ...)` actually
-ends up.
-
-### D4. PE kernel API
-
-```python
-# Send (blocking; may stall on backpressure)
-tl.send(dir: str, src=TensorHandle)
-tl.send(dir: str, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)
-
-# Recv (blocking)
-recv = tl.recv(dir: str, shape=..., dtype=...)
-recv = tl.recv(shape=..., dtype=...)        # round-robin across 4 directions
-
-# Recv (non-blocking)
-fut  = tl.recv_async(dir: str, shape=..., dtype=...)
-recv = tl.wait(fut)
-```
-
-`tl.recv()` (no direction) keeps a `last_polled_dir` cursor and on each
-call rotates through directions, returning the first available slot.
-Empty in all 4 directions → wait.
-
-**Fairness is weak**: the rotating start mitigates simple bias, but if
-one direction always wins the race the others can starve. Algorithms
-that need strict fairness must call `tl.recv(dir=...)` explicitly.
-
-### D5. Single-hop DMA write + full-data slot model
-
-Data moves from sender memory into the receiver's ring slot in **one
-DMA transfer**. Key properties:
-
- **Single-hop**: the sender already knows the peer rx slot address and
-  fires one fabric DMA into it.
- **No CPU memcpy**: the CPU never copies data.
- **No intermediate staging**: neither side keeps a separate staging
-  buffer (sender uses the source addr directly; receiver gets the data
-  in its ring slot directly).
-
-(Strictly speaking the fabric DMA write does happen, so this is not
-literally "no data movement" — it's the same property NCCL labels
-"zero-copy", meaning no CPU memcpy and no staging copy.)
-
-```
-PE A: tl.send(E, src_addr, nbytes)
-  1. IPCQ computes the peer rx slot address:
-       dst_addr = peer.rx_base_pa + (my_head % peer.n_slots) * peer.slot_size
-  2. Backpressure: my_head - peer_tail_cache < peer.n_slots ?
-                   (full → sleep / poll)
-  3. Submit DMA on PE_DMA(vc_comm): src_addr → peer dst_addr, nbytes
-  4. my_head += 1
-
-PE B: data = tl.recv(W)
-  1. Look at rx_buffer[my_tail % n_slots]
-  2. Wait for the data to arrive (D7 backpressure mode)
-  3. Return the slot address to the kernel (or fetch into register file)
-  4. my_tail += 1
-  5. Issue a credit-return fast path (D9): after the bottleneck-BW
-     latency the peer A's peer_tail_cache is updated.
-```
-
-The slot holds the full tile. The receiver only reads its own
-rx_buffer; it never reads back into A's memory. The sender knows the
-peer rx slot address and DMAs directly into it (single-hop).
-
-The PE's own PE_TCM read/write does not go through DMA (PE_TCM is local
-to the PE).
-
-### D6. Buffer placement — three-way benchmark
-
-The host CCL init picks the IPCQ ring-buffer location:
-
-```python
-ipcq_init(
-    backend="ahbm",
-    buffer_kind="tcm" | "hbm" | "sram",
-    n_slots=8,
-    slot_size=4096,
-)
-```
-
-| Location | Trait | Trade-off |
-|----------|-------|-----------|
-| **PE_TCM** | Attached to the PE; fast | Small; competes with PE-internal resources |
-| **PE-local HBM** | Large; via DMA | Higher latency |
-| **Cube SRAM** | Mid-size; cube-shared | Cube-internal contention |
-
-All three locations run the same kernel code; only the init differs.
-
-### D7. Backpressure — two-mode benchmark
-
-How the sender or receiver waits when peer slots are full / data not
-yet arrived:
-
-| Mode | Behavior | Model |
-|------|----------|-------|
-| **poll** | Periodically re-check the cached peer pointer | Spin loop |
-| **sleep** | Yield a SimPy event; wake on a peer-trigger | Interrupt-like |
-
-```python
-ipcq_init(backpressure="poll" | "sleep", ...)
-```
-
-Both modes are implemented so latency / throughput trade-offs can be
-benchmarked.
-
-### D8. PE_DMA virtual channels
-
-Extend PE_DMA from a single queue into a **two-channel virtual-channel**
-model.
-
-```
-PE_DMA
-├── vc_compute: tile load / store / writeback for GEMM and Math
-└── vc_comm:    IPCQ send data
-```
-
-Each VC has an independent state machine:
-
- One channel stalling does not block the other.
- The same physical link (cube_noc, UCIe, …) is shared, but link BW is
-  split between channels.
-
-**Chunk-level interleave**:
-
- Large GEMM tile DMAs do not lock the link end-to-end.
- Progress happens in chunks (e.g. 256 B); each chunk shares link BW
-  with the other VC's pending chunks.
- Chunk size is an init parameter (smaller = fairer, larger = more
-  efficient).
-
-Net effect:
-
- HoL blocking is eliminated (an IPCQ send can interleave with a long
-  compute DMA).
- Compute / comm overlap is natural (NVIDIA copy-engine + compute-SM
-  pattern).
- Matches the NoC-virtual-channel pattern used in real HW.
-
-**First-implementation accuracy limit (intentional)**: this ADR's
-first cut uses **deterministic chunk-level interleave + weighted
-round-robin arbitration** (default 50 / 50, exposed in `ccl.yaml`).
-This is a first-order approximation and is simpler than real HW
-dynamic-contention / credit-based arbiters. Functional correctness is
-unaffected, but heavy-contention scenarios may report slightly
-optimistic latency vs. real HW. A separate ADR can add a NoC arbiter
-component later if more precision is needed.
-
-#### Token routing
-
- Compute tokens (`TileToken`) — go through the existing
-  PE_FETCH_STORE → PE_DMA chain.
- Communication tokens (`IpcqDmaToken`, new) — PE_IPCQ → PE_DMA
-  self-routing.
- PE_DMA picks the channel by token type.
-
-```python
-class PeDmaComponent:
-    def _process(self, env, token):
-        if isinstance(token, IpcqDmaToken):
-            yield from self._vc_comm_process(env, token)
-        else:
-            yield from self._vc_compute_process(env, token)
-```
-
-### D9. Pointer synchronization — DMA payload piggyback
-
-Real HW (NVLink, UCIe, etc.) piggybacks metadata onto DMA payloads so
-pointers update along with the data. This simulation adopts the same
-model: **no separate control channel** — metadata travels with the
-data.
-
-The big benefits:
-
- **Automatic ordering**: data and metadata move on the same token, so
-  data is visible **before** the head_cache update. No race.
- **HW fidelity**: matches NVLink / UCIe piggybacked headers.
- **Component simplification**: no separate `IpcqPtrUpdate` event type.
-
-#### Send flow (head update via piggyback)
-
-```
-PE A: tl.send(E, src_addr, nbytes)
-  1. PE_IPCQ checks backpressure (using peer_tail_cache)
-  2. PE_IPCQ creates an IpcqDmaToken:
-       - data body (src_addr → peer dst_addr)
-       - piggyback metadata: (sender_seq, src_sip/cube/pe, src_direction)
-  3. Hand the token to PE_DMA(vc_comm)
-  4. PE A increments my_head (send tracking)
-
-[fabric DMA: latency elapses]
-
-PE B's PE_DMA receives the token
-  5. Writes data into dst_addr (B's rx slot) via MemoryStore.write
-  6. Forwards token metadata to PE B's PE_IPCQ (PE-internal wire, ~1 cycle)
-
-PE B's PE_IPCQ receives the metadata
-  7. Updates peer_head_cache (= A's head)
-  8. Wakes any pending recv on that direction
-```
-
-**Steps 5 and 6 must execute in the same SimPy step** — DMA completion
-makes data and metadata atomically visible.
-
-#### Recv flow (credit return — fast path with bottleneck-BW latency)
-
-When the receiver frees a slot, the sender must learn about it
-(backpressure release). Unlike data, the credit return does **not**
-travel through general vc_comm fabric — it uses a **separate fast
-path**, an abstraction of the NVLink / UCIe credit-return wire.
-
-**Latency** is computed from the **full path latency** (per-node
-overhead + edge propagation + drain), not a magic constant:
-
-```
-credit_size_bytes = 16  (ccl.yaml: ipcq_credit_size_bytes)
-path = router.find_path(self_pe, peer_pe.pe_dma)
-latency = compute_path_latency_ns(path, credit_size_bytes)
-        = sum(edge.distance_mm * ns_per_mm)
-        + sum(node_overhead_ns[n] for n in path)
-        + credit_size_bytes / bottleneck_bw_on_path
-```
-
-The router auto-appends `.pe_dma` to the source only, so the
-destination MUST be spelled with the explicit `.pe_dma` suffix or
-`find_path` raises and the credit silently teleports at zero cost
-(latent bug fixed alongside this update).
-
-`tl.recv` blocks on the credit-emit completion (recv yields-from
-`_delayed_credit_send` rather than spawning it as a fork). This puts
-the credit-return cost on the receiver's `pe_exec_ns`, modeling the
-IPCQ control-plane completing the consume-acknowledgement before
-recv returns to the kernel — the protocol equivalent of a non-posted
-`tl.store` waiting for an HBM ack on the raw DMA path.
-
-That gives us:
-
- **Topology-proportional approximation**: an in-cube credit return is
-  automatically faster than a cross-SIP credit return.
- **No magic constants**: every nanosecond comes from
-  `compute_path_latency_ns` on the same edge_map and `node_overhead_ns`
-  as data traffic.
- **No deadlock risk**: unlike piggyback, B can issue credit even when
-  it has no data to send back. `peer_credit_store.put` is unbounded.
- **`IPCQ ≥ raw DMA`** for matched physical moves — the credit-emit
-  cost on recv balances the HBM ack-trip cost RAW pays on the sender.
-
-#### Component coupling — SimPy Store channel
-
-PE B's PE_IPCQ does not call PE A's PE_IPCQ directly. Instead, at init
-time, **a SimPy Store is wired between the two** (a per-direction
-fast-path channel) and credit metadata is `put` into that store.
-
-```python
-class PeIpcqComponent:
-    def _delayed_credit_send(self, env, peer_credit_store, my_tail, latency_ns):
-        yield env.timeout(latency_ns)
-        yield peer_credit_store.put(IpcqCreditMetadata(seq=my_tail, ...))
-```
-
-Backend init wires both directions of the fast-path channel as part of
-fan-out (see `IpcqInitMsg` in D12).
-
-#### Credit-return fast path limitations
-
- `credit_size_bytes` is an estimate (typically 16–64 bytes).
- The fast path is **excluded from vc_comm BW contention** (separate
-  wire). Real HW credit-return wires are very lightweight, so this is a
-  reasonable first approximation.
- A follow-up ADR can: model the credit fast path as a separate link
-  (BW limit + contention), or switch to piggyback (`credit_return_mode:
-  piggyback`).
-
-#### PE_DMA's added responsibility
-
-When `vc_comm` receives a token, PE_DMA processes it as the following
-sequence: pay the Transaction's terminal BW drain, then atomically
-write data and forward metadata. **No SimPy yield is allowed between
-the data write and the metadata forward** (invariant I6). The drain
-yield must sit before the atomic block, not inside it:
-
-```python
-def _on_vc_comm_recv(self, env, txn):
-    # Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the
-    # sender PE_DMA). MUST happen before the atomic block so recv only
-    # wakes after the bytes have "landed".
-    drain = getattr(txn, "drain_ns", 0.0)
-    if drain > 0:
-        yield env.timeout(drain)
-
-    token = txn.request
-    # ── ATOMIC: no yield between these two operations ──
-    data = self._memory_store.read(token.src_space, token.src_addr,
-                                   shape=..., dtype=...)
-    self._memory_store.write(token.dst_endpoint.buffer_kind,
-                             token.dst_addr, data)
-    # 2. Forward metadata to the local PE_IPCQ
-    yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token))
-    # ───────────────────────────────────────────────────
-```
-
-The final `put` is yieldable but uses an unbounded internal store, so
-it completes in a single step. That `put` is the closing call of the
-atomic block; nothing may be inserted before it.
-
-#### Drain-at-inbound semantics (D9 timing model)
-
-The Transaction carries `drain_ns = nbytes / bottleneck_bw_on_path`
-stamped at send-side PE_DMA. In this simulator per-hop `overhead_ns`
-is paid at each forwarding component via `run()`, and the remaining
-BW drain is paid once at the Transaction's terminal. Every non-IPCQ
-Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via
-`ComponentBase._forward_txn` at the terminal node. For IPCQ the
-destination PE_DMA intercepts the Transaction with `_handle_ipcq_inbound`
-(so IPCQ-specific data write + metadata forward can happen), so **the
-drain MUST be paid explicitly at the top of that handler** to keep
-IPCQ's timing model on par with every other fabric Transaction.
-
-Side-effects of paying drain here:
-
- **SRC `tl.send`** is unchanged — fire-and-forget semantics are
-  preserved because the sender PE_DMA does not `yield sub_done`. The
-  `sub_done.succeed()` call (made after metadata forward below) is an
-  event with no listener on the sender side.
- **DST `tl.recv`** unblocks `drain_ns` later. Since recv wakes only
-  when `IpcqMetaArrival` reaches its local PE_IPCQ, and the metadata
-  forward now happens after the drain, recv observes the full fabric
-  transfer time including bandwidth cost.
-
-Matches the physical picture: send dispatches and leaves; recv waits
-until the bytes have actually been drained into its inbox.
-
-### D9.5. ADR-0020 (2-pass) integration
-
-`tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase
-1 simulates timing **and** moves data via MemoryStore; Phase 2 enables
-op-log-based correctness verification.
-
-#### Phase 1 (timing + data)
-
-D9 models head and tail updates with two different mechanisms:
-
- **Send-side (head update)** — DMA payload piggyback. Data write and
-  metadata forward happen in the same SimPy step → automatic atomic
-  visibility.
- **Recv-side (tail credit return)** — fast-path SimPy Store channel
-  with bottleneck-BW latency, then `peer_tail_cache` update.
-
-Together they preserve ring-buffer pointer consistency.
-
-The op-log records `op_kind="ipcq"` entries for sends (with
-`src/dst/space/addr/nbytes/dir/dtype/shape/sender_seq`) and recvs (with
-`recv_mode/src/dst/space/addr/nbytes/dir/dtype/shape/consumer_seq`).
-Two recv modes:
-
- **`return_slot`** (default): the slot address is returned to the
-  kernel. Zero-copy.
- **`copy_to_dst`**: when the kernel passes `dst_addr` + `dst_space`,
-  PE_IPCQ copies the slot data into the user dst.
-
-#### Phase 2 (op_log replay)
-
-When `DataExecutor` encounters an `op_kind="ipcq"` record:
-
- **send**: idempotent `src → dst` ndarray write.
- **recv (`return_slot`)**: no-op (the slot already holds the data).
- **recv (`copy_to_dst`)**: idempotent `slot → dst_addr` copy.
-
-IPCQ ops are pure data movement — Phase 2 has nothing extra to compute.
-The downstream GEMM / Math ops in `DataExecutor` will consume the data
-and naturally validate correctness.
-
-### D10. Host CCL init keeps the PyTorch shape
-
-The host code looks just like real PyTorch DDP. `init_process_group`
-creates the backend object; it does **not** receive IPCQ knobs
-(neighbor topology, buffer_kind, backpressure …).
-
-```python
-# benches/ccl_allreduce.py — same shape as real PyTorch
-def worker(rank, world_size, torch):
-    dist = torch.distributed
-    dist.init_process_group(backend="ahbm")  # reads ccl.yaml + topology
-    tensor = torch.zeros((1, world_size * N_ELEM), dtype="f16", dp=...)
-    tensor.copy_(torch.from_numpy(init))
-    dist.all_reduce(tensor, op="sum")
-```
-
-The IPCQ configuration is decided by the backend at
-`init_process_group` time: it loads `ccl.yaml`, picks the algorithm,
-and pushes IPCQ neighbor tables to every participating PE_IPCQ. The
-host code never has to know about IPCQ.
-
-A bench runs one algorithm, chosen via `ccl.yaml`'s `defaults.algorithm`.
-Switching algorithms is purely a `ccl.yaml` change — no host edits
-required.
-
-#### Init flow (eager)
-
-1. `init_process_group(backend="ahbm")` is called.
-2. Backend loads `ccl.yaml` → resolves `defaults.algorithm`.
-3. Pulls topology + buffer_kind + backpressure + slot config from
-   `algorithms[<algo>]`.
-4. **Immediately** installs neighbor tables on every PE_IPCQ
-   (sideband or fabric `IpcqInitMsg`).
-5. Subsequent `torch.launch(kernel_name, ...)` calls behave normally —
-   PE_IPCQ is already prepared whether the kernel is a CCL kernel or
-   not.
-
-### D11. CCL config file (`ccl.yaml`)
-
-IPCQ config and algorithm metadata live in a separate YAML file,
-following the same pattern as `components.yaml` and `topology.yaml`.
-
-A single benchmark execution runs one algorithm
-(`defaults.algorithm`). Switching algorithms means editing
-`defaults.algorithm` only.
-
-```yaml
-defaults:
-  algorithm: ring_allreduce_tcm
-  buffer_kind: tcm                # tcm | hbm | sram
-  backpressure: sleep             # poll | sleep
-  n_slots: 8
-  slot_size: 4096
-  vc_chunk_size: 256
-  ipcq_credit_size_bytes: 16
-
-algorithms:
-  ring_allreduce_tcm:
-    module: kernbench.ccl.algorithms.ring_allreduce
-    topology: ring_1d             # builtin name or "custom"
-    buffer_kind: tcm
-    n_elem: 8                     # optional, per-algorithm tile width
-
-  tree_allreduce_7:
-    module: kernbench.ccl.algorithms.tree_allreduce
-    topology: tree_binary
-    buffer_kind: tcm
-    world_size: 7                 # algorithm-level override
-    n_elem: 16
-
-  custom_mesh:
-    module: kernbench.ccl.algorithms.custom_mesh
-    topology: custom              # the module supplies its own neighbors()
-```
-
-`world_size` is **not set in `defaults`**. The backend resolves it via:
-`algorithm-level override > defaults override > topology spec`. The
-last fallback (`sips × cubes_per_sip × pes_per_cube`) mirrors real DDP
-where `WORLD_SIZE` comes from env vars rather than config files.
-
-#### Algorithm module structure
-
-Each algorithm module exports two hooks — `kernel` (required) and
-`neighbors` (optional) — plus a `kernel_args` helper that the
-backend uses to populate positional kernel arguments at `all_reduce`
-time:
-
-```python
-# src/kernbench/ccl/algorithms/ring_allreduce.py
-
-def kernel_args(world_size: int, n_elem: int) -> tuple:
-    return (n_elem, world_size)
-
-
-def kernel(t_ptr, n_elem, world_size, tl):
-    """Required — the PE kernel.
-
-    IPCQ is already installed by the backend before this is called.
-    The kernel only uses the four-direction send / recv API.
-    """
-    ...
-
-
-def neighbors(rank, world_size, neighbor_map):
-    """Optional — override the builtin topology's neighbor map.
-
-    Returns a new dict, the modified-in-place dict, or None to keep the
-    builtin map.
-    """
-    return None
-```
-
-#### `neighbors` override patterns
-
- **Pattern A — tweak a builtin**: drop a direction for some ranks, etc.
- **Pattern B — replace entirely**: ignore `neighbor_map` and return a
-  brand-new dict.
- **Pattern C — keep builtin**: omit `neighbors` or return None.
-
-#### Builtin topologies
-
-| topology | direction set |
-|----------|---------------|
-| `ring_1d` | E, W |
-| `ring_1d_unidir` | E only |
-| `mesh_2d` | N, S, E, W |
-| `tree_binary` | parent, child_left, child_right |
-| `none` | (empty) — algorithm must supply `neighbors()` |
-
-#### Adding a new algorithm
-
-1. Write `kernel` and `kernel_args` in
-   `src/kernbench/ccl/algorithms/<algo>.py`.
-2. Add an entry in `ccl.yaml`'s `algorithms` section.
-3. (Optional) provide `neighbors()` for custom topology.
-4. Set `defaults.algorithm` to the new algorithm.
-
-The host bench (`benches/ccl_allreduce.py`) does not change.
-
-### D12. Message / token schema
-
-The new message types added by this ADR. They live in
-`src/kernbench/common/pe_commands.py` and
-`src/kernbench/runtime_api/kernel.py`.
-
-#### `IpcqInitMsg` (sideband, fan-out at init)
-
-The backend pushes neighbor tables to every PE_IPCQ. Structure mirrors
-`MmuMapMsg` (`target_sips`, `target_cubes`, `target_pe`, `entries`).
-Each `IpcqInitEntry` has `direction`, `peer: IpcqEndpoint`,
-`my_rx_base_pa/va`, `n_slots`, `slot_size`, plus a `peer_credit_store`
-field — a `simpy.Store` instance pre-wired so the sender PE_IPCQ can
-push `IpcqCreditMetadata` directly into the receiver's input queue.
-
-#### `IpcqSendCmd` (PE_CPU → PE_IPCQ)
-
-Carries `direction`, source addr/space, nbytes, shape, dtype, and a
-handle id. `data_op=True` so it lands in the op_log.
-
-#### `IpcqRecvCmd` (PE_CPU → PE_IPCQ)
-
-Carries `direction` (or None for round-robin), `recv_mode`
-(`return_slot` / `copy_to_dst`), optional `dst_addr/dst_space`, shape,
-dtype, blocking flag.
-
-#### `IpcqDmaToken` (PE_IPCQ → PE_DMA, vc_comm channel)
-
-Per D9 piggyback: the token carries the data (`src/dst/space/nbytes`)
-plus the head metadata (`sender_seq`, `src_sip/cube/pe`,
-`src_direction`). PE_DMA picks the channel by token type
-(`IpcqDmaToken → vc_comm`, `TileToken → vc_compute`).
-
-The receiver's PE_DMA, on token arrival, performs the I6 atomic
-sequence: write data into MemoryStore, then forward `IpcqMetaArrival`
-to the local PE_IPCQ.
-
-#### `IpcqCreditMetadata` (PE_IPCQ → peer PE_IPCQ, fast path)
-
-Carries `consumer_seq` (= my_tail), source PE coords, and source
-direction. Travels through the dedicated SimPy Store channel rather
-than `vc_comm`. Latency = `credit_size_bytes / bottleneck_bw_on_path`.
-
-There is **no `IpcqPtrUpdate` event** — head updates flow via D9
-piggyback, tail updates via the D9 fast-path channel.
-
-### D13. Test strategy
-
-Test plan:
-
-#### T1. Unit tests (component-level)
-
- **PE_IPCQ** (`tests/test_pe_ipcq.py`): send without backpressure
-  immediately forwards a token; full peer slot triggers backpressure
-  (poll / sleep modes); recv waits, wakes on `IpcqMetaArrival`;
-  round-robin recv weak fairness; bad direction → `IpcqInvalidDirection`.
- **PE_DMA virtual channels** (`tests/test_pe_dma_vc.py`): `vc_compute`
-  / `vc_comm` independent progress, chunk interleave, BW split.
- **Builtin topology** (`tests/test_ccl_topologies.py`): ring_1d /
-  mesh_2d / tree_binary correctness, mesh_2d non-square →
-  `ValueError`, custom resolver returns the module's `neighbors`.
-
-#### T2. Integration tests (E2E send/recv)
-
- **`tests/test_ipcq_e2e.py`**: 2-rank ring, 4-rank ring (bidirectional
-  no-deadlock), 4×4 mesh.
- **CCL kernel + 2-pass** (`tests/test_ipcq_2pass.py`): greenlet mode
-  records `ipcq` ops in op_log; DataExecutor produces correct
-  `out.data`.
-
-#### T3. Backend init (`tests/test_ccl_backend_ipcq.py`)
-
-`ccl.yaml` load, builtin topology → `IpcqInitMsg` fan-out, endpoint PA
-consistency, per-`buffer_kind` allocation.
-
-#### T4. Regression
-
-All existing tests pass; ADR-0020 op_log / DataExecutor unaffected for
-non-CCL benches.
-
-#### T5. Performance / overhead
-
-Single send/recv pair latency = (DMA latency) + (IPCQ overhead).
-Should be close to a regular PE_DMA write of the same nbytes (IPCQ
-overhead < 100 ns).
-
-### D14. Invariants and failure modes
-
-#### Invariants
-
-I1. **Slot lifecycle exactly-once**: one send → exactly one recv.
-I2. **Pointer monotonicity**: `my_head` / `my_tail` strictly
-   non-decreasing; `sender_seq` strictly increasing.
-I3. **Endpoint consistency**: if rank A's `direction=E` peer is rank
-   B, then rank B's reverse-direction peer must be rank A. Verified at
-   init.
-I4. **`buffer_kind` consistency**: all PEs in a process group share
-   the same `buffer_kind` (no mixed mode in the first cut).
-I5. **op_log ordering**: send → DMA complete → recv possible. The
-   t_start order in op_log respects this causality.
-I6. **Atomic data + metadata visibility (MUST)**: at the receiver
-   side, data write (`MemoryStore.write`) and metadata forward
-   (`peer_head_cache` update) **must execute in the same SimPy step**.
-   No yield is allowed between the two operations in PE_DMA's vc_comm
-   handler. Code review must reject any inserted `yield` (or `yield
-   from`) — it would create a race where head_cache becomes visible
-   before or after the data.
-I7. **MemoryStore slot existence ↔ pointer**: as a consequence of I6,
-   the step in which `peer_head_cache > my_tail` becomes truthy is the
-   same step in which the slot data is observable.
-
-#### Failure modes (runtime errors)
-
-F1. **Bad direction**: `tl.send(dir="X")` for an uninstalled direction
-   → `IpcqInvalidDirection`, simulation aborts.
-F2. **Type mismatch**: dtype/shape/nbytes disagreement between matched
-   send and recv. Not validated by default; opt-in strict mode catches
-   it (`strict_validation: true` on a PE_IPCQ node attrs).
-F3. **Deadlock detection (timeout-based)**: the simulator empties its
-   schedule while a send/recv is still pending → engine raises
-   `IpcqDeadlock` and embeds a pointer dump.
-F4. **Backend init failure**: missing `defaults.algorithm`, missing
-   `algorithms[name]`, module import failure, topology validation
-   failure (I3, I4) — all raised at `init_process_group` time.
-F5. **Slot full + infinite backpressure**: the peer never recvs.
-   Surfaces as F3 timeout.
-
-#### Diagnostics
-
- **CCL trace**: `KERNBENCH_CCL_TRACE=1` logs each send/recv as
-  `(rank, t, dir, nbytes)`.
- **Pointer dump**: `kernbench.ccl.diagnostics.pointer_dump(engine)`
-  prints every PE_IPCQ ring buffer's `my_head`, `my_tail`,
-  `peer_head_cache`, `peer_tail_cache`.
- **Deadlock dump**: on hang the engine includes the pointer dump in
-  the `IpcqDeadlock` exception message.
-
-### D15. Algorithm-author cheat sheet
-
-Full step-by-step lives in
-[`docs/onboarding/ccl-author-guide.en.md`](../onboarding/ccl-author-guide.en.md). The
-shortest version:
-
-| Things you touch | Things you don't |
-|------------------|-------------------|
-| `src/kernbench/ccl/algorithms/<your_algo>.py` (`kernel`, `kernel_args`, optional `neighbors`) | `benches/ccl_allreduce.py` host code |
-| One entry in `ccl.yaml` + optionally `defaults.algorithm` | `src/kernbench/ccl/` framework |
-| (Optional) `tests/test_<your_algo>.py` mock test | PE_IPCQ component, AhbmCCLBackend |
-
-5-step flow: write the kernel → register in `ccl.yaml` → optional
-`neighbors` override → optional mock unit test → SimPy validation via
-`kernbench run --bench ccl_allreduce --verify-data`.
-
-Common mistakes: using a direction that wasn't installed, sends
-without matching recvs (deadlock), dtype/shape disagreement, assuming
-fairness from `tl.recv()` round-robin, confusing
-`tl.num_programs(axis)` with the CCL group size.
-
---
-
-## Non-goals
-
- **Host collective**: a model where `dist.all_reduce` itself moves
-  data on the host side is out of scope. This ADR only covers
-  communication that happens inside the PE kernel.
- **All-reduce algorithms**: ring / tree / etc. live in algorithm
-  modules and can be added without amending this ADR.
- **Reliability / error handling**: link faults, send/recv failure
-  recovery, etc. are out of scope.
- **NoC arbiter precision**: dynamic VC contention is left for a future
-  ADR (see D8).
-
---
-
-## Open questions
-
- **VC arbitration accuracy** — the first cut uses deterministic
-  chunk interleave + weighted round-robin; heavy contention may report
-  optimistic latency. A NoC arbiter component can be added later.
- **Credit return BW model** — the fast path is currently outside the
-  fabric BW contention model. Can be modeled as a separate link or
-  switched to piggyback (`credit_return_mode: piggyback`).
- **Ring buffer slot allocation metadata** — whether the host pushes
-  IPCQ buffer metadata via sideband or via a fabric message similar to
-  `MmuMapMsg` is open.
- **VC BW split default** — 50/50 vs. weighted (e.g. 80/20). Exposed in
-  `ccl.yaml`; default value TBD.
- **Direction count** — 4 (N/S/E/W) is fixed in the first cut; 6
-  (with Up/Down for 3D) or N (variable) is future work.
- **Multi-tile aggregation primitives** — whether
-  `tl.recv_all` or similar is needed for fan-in.
- **Round-robin recv fairness** — current weak fairness can starve;
-  strict fairness counter is future work.
- **Deadlock detection precision** — currently timeout-based; a
-  realtime wait-for graph would enable deterministic detection.
-
---
-
-## Consequences
-
-### Positive
-
- PE-to-PE direct communication enables CCL kernels to be written.
- Host stays minimal (just `launch`), synchronization happens inside
-  the PE → strong compute / comm overlap.
- VCs eliminate HoL blocking → collective latency is not blocked by
-  compute traffic.
- Buffer placement and backpressure mode are init-time parameters →
-  easy to benchmark.
- Four-direction logical neighbors → host is free to map
-  ring/mesh/tree algorithms.
-
-### Negative
-
- One new component (PE_IPCQ) and a redesigned PE_DMA (VCs).
- IPCQ memory cost = 8 rings × `slot_size` × `n_slots` per PE.
- VC arbitration is a first-order approximation; heavy contention
-  scenarios may report slightly optimistic latency vs real HW (D8).
- Chunk-level interleave makes PE_DMA implementation more complex.
@@ -6,43 +6,46 @@ Accepted

 ## Context

-### 목표
+### Goal

-`torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device)
-경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이**
-읽히는 bench 코드를 목표로 한다.
+Align the participation unit (rank) of `torch.distributed` collective calls
+to the **SIP** (device) boundary. The aim is bench code that, at the host
+level, reads **indistinguishably** from real PyTorch DDP/TP scripts.

-real PyTorch와 비교:
+Comparison with real PyTorch:

-| 차원 | real PyTorch | KernBench |
+| Dimension | real PyTorch | KernBench |
 | --- | --- | --- |
-| 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 각 1 SIP |
-| `get_rank()` | `RANK` env var | greenlet-local 레지스트리 |
-| `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 |
+| Process model | N processes, 1 GPU each | 1 process, N greenlets, 1 SIP each |
+| `get_rank()` | `RANK` env var | greenlet-local registry |
+| `get_world_size()` | `WORLD_SIZE` env var | SIP count from topology |
 | `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP |
-| `mp.spawn` | OS 프로세스 fork | greenlet fan-out |
+| `mp.spawn` | OS process fork | greenlet fan-out |

-### 풀어야 할 문제
+### Problems to solve

-1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록.
-2. **Greenlet-local rank/device tracking** — 1-프로세스 모델 안에서 각
-   worker greenlet이 자기 rank / 자기 SIP를 정확히 식별.
-3. **Tensor placement = structural (sip, cube, pe)** — rank가 SIP이면
-   기본 텐서 배치도 구조적 좌표로 표현되어야 함.
+1. **Public API where rank = SIP** — so bench workers do not have to know
+   about the PE concept.
+2. **Greenlet-local rank/device tracking** — within the 1-process model,
+   each worker greenlet must correctly identify its own rank / its own SIP.
+3. **Tensor placement = structural (sip, cube, pe)** — if rank is SIP,
+   the default tensor placement should also be expressed in structural
+   coordinates.

-### Non-problem (이 ADR 밖)
+### Non-problem (outside this ADR)

 - IPCQ direction addressing → ADR-0025
- `DPPolicy.sip`/`num_sips` 제거 → ADR-0026
+- Removing `DPPolicy.sip`/`num_sips` → ADR-0026
 - Megatron-style TP → ADR-0027
 - DTensor → ADR-0028 (future)
 - Worker scheduling / `mp.spawn` / collective drain / exception cleanup
  → ADR-0027 D0/D1
- Collective algorithm 구현 (intercube_allreduce, SFR config) → ADR-0032
+- Collective algorithm implementation (intercube_allreduce, SFR config)
+  → ADR-0032

 ## Decision

-### D1. rank = SIP (world_size 해석)
+### D1. rank = SIP (world_size resolution)

 ```python
 def _resolve_world_size(self) -> int:
@@ -55,8 +58,8 @@ def _resolve_world_size(self) -> int:
    return int(spec.get("system", {}).get("sips", {}).get("count", 1))
 ```

-우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml`
-override는 legacy "rank = PE" 테스트 경로로 유지.
+Priority order: algorithm override > defaults override > SIP count. The
+`ccl.yaml` override is retained as the legacy "rank = PE" test path.

 ### D2. Greenlet-local rank registry (+ debug warning)

@@ -83,11 +86,11 @@ class DistributedContext:
        return int(self._rank_by_greenlet[g])
 ```

-### D3. `torch.ahbm.set_device(rank)` — SIP 바인딩
+### D3. `torch.ahbm.set_device(rank)` — SIP binding

-KernBench 백엔드 이름은 `ahbm` (ADR-0023). Real PyTorch는
-`torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named
-namespace를 사용한다.
+The KernBench backend name is `ahbm` (ADR-0023). Real PyTorch uses
+`torch.cuda.set_device(r)`, but since we are not CUDA we use an
+honestly-named namespace.

 ```python
 class _AhbmNamespace:
@@ -113,10 +116,12 @@ class _AhbmNamespace:
 # Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
 ```

-**PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한
-`torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`,
-`torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는
-코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다.
+**PyTorch 2.x style parallel support**: Recent PyTorch is moving toward a
+device-agnostic `torch.accelerator` namespace
+(`torch.accelerator.set_device_index(r)`,
+`torch.accelerator.current_device_index()`). To support users who want to
+write code that is not tied to a specific device vendor, KernBench also
+exposes this surface in parallel.

 ```python
 class _AcceleratorNamespace:
@@ -141,23 +146,23 @@ self.ahbm = _AhbmNamespace()
 self.accelerator = _AcceleratorNamespace(self.ahbm)   # alias
 ```

-Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유:
+Bench authors may choose either — both share the same registry internally:

 ```python
 torch.ahbm.set_device(rank)                   # KernBench-native, explicit backend
 torch.accelerator.set_device_index(rank)      # PyTorch 2.x device-agnostic
 ```

-### D4. Tensor placement = structural (sip, cube, pe) 좌표
+### D4. Tensor placement = structural (sip, cube, pe) coordinates

-`resolve_dp_policy`가 `target_sip`을 직접 받아 구조적 좌표로 placement 생성.
-세부는 ADR-0026.
+`resolve_dp_policy` takes `target_sip` directly and produces placement in
+structural coordinates. Details in ADR-0026.

 ```python
 # RuntimeContext._create_tensor
 current_sip = self.ahbm.current_device()          # (D3 naming)
 if current_sip is None:
-    current_sip = 0  # single-driver fallback (D2와 일관)
+    current_sip = 0  # single-driver fallback (consistent with D2)
 placement = resolve_dp_policy(
    dp, shape=shape_2d, itemsize=itemsize,
    num_pe=eff_num_pe, num_cubes=eff_num_cubes,
@@ -165,29 +170,29 @@ placement = resolve_dp_policy(
 )
 ```

-Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적
-좌표를 직접 보유. ShardSpec 상세는 ADR-0026.
+No post-hoc `pe_index` shifting — ShardSpec carries the `(sip, cube, pe)`
+structural coordinates directly. ShardSpec details in ADR-0026.

 ---

 ## Dependencies

- **ADR-0023** (IPCQ): backend `ahbm` namespace의 기원.
- **ADR-0026** (DPPolicy intra-device): D4의 `resolve_dp_policy` 시그니처와
-  ShardSpec의 구조적 좌표 표현.
- **ADR-0027** (Megatron TP + scheduler): worker scheduling, `mp.spawn`,
-  collective drain, exception cleanup의 구현 기준.
+- **ADR-0023** (IPCQ): origin of the backend `ahbm` namespace.
+- **ADR-0026** (DPPolicy intra-device): the `resolve_dp_policy` signature
+  used by D4 and the structural-coordinate representation of ShardSpec.
+- **ADR-0027** (Megatron TP + scheduler): the implementation baseline for
+  worker scheduling, `mp.spawn`, collective drain, and exception cleanup.

 ---

 ## Non-goals

- **IPCQ protocol 수정**: ADR-0023 유지.
- **DPPolicy 필드 정리**: ADR-0026.
+- **Modifying the IPCQ protocol**: ADR-0023 remains as-is.
+- **Cleaning up DPPolicy fields**: ADR-0026.
 - **Megatron-style TP**: ADR-0027.
 - **Worker scheduling / spawn / drain / exception cleanup**: ADR-0027 D0/D1.
- **Collective algorithm 구현**: ADR-0032.
- **Multi-node (프로세스 간)**: 단일 프로세스.
+- **Collective algorithm implementation**: ADR-0032.
+- **Multi-node (cross-process)**: single process only.

 ---

@@ -195,12 +200,14 @@ Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적

 ### Positive

- **Bench = real PyTorch DDP** (공개 API 관점).
- **Greenlet-local rank**: 1-프로세스 모델에서 cross-rank correctness 가능.
- **Structural placement 좌표**: ADR-0026 / ADR-0027 / ADR-0032의 다른 ADR이
-  `(sip, cube, pe)` 3튜플 위에서 일관되게 동작.
+- **Bench = real PyTorch DDP** (from the public-API point of view).
+- **Greenlet-local rank**: enables cross-rank correctness within the
+  1-process model.
+- **Structural placement coordinates**: lets the other ADRs (ADR-0026 /
+  ADR-0027 / ADR-0032) operate consistently on top of the `(sip, cube, pe)`
+  3-tuple.

 ### Neutral

- IPCQ PE-level protocol (ADR-0023) 불변.
- IO_CPU 역할 불변 (기존 transit 그대로).
+- IPCQ PE-level protocol (ADR-0023) is unchanged.
+- IO_CPU role is unchanged (existing transit behavior preserved).
@@ -6,51 +6,58 @@ Accepted (Revision 2 — Address-based matching; peer_direction field dropped)

 ## Context

-### 목표
+### Goal

-ADR-0023의 IPCQ protocol에서 **"어느 direction pair를 통한 전송인가"의 식별**을
-topology / dict-order에 의존하지 않고 **주소 기반**으로 일관되게 한다.
-2-rank bidirectional ring (또는 여러 direction이 동일 peer를 가리키는
-topology 일반)에서 정확히 동작하도록 한다.
+In the IPCQ protocol of ADR-0023, make the **identification of "which
+direction pair this transfer belongs to"** consistent and **address-based**,
+without depending on topology / dict-order. It must work correctly in a
+2-rank bidirectional ring (and more generally in any topology where
+multiple directions point to the same peer).

-### 드러난 버그 — 2-rank bidirectional ring
+### The bug surfaced — 2-rank bidirectional ring

-`ring_1d(rank, world_size=2)` → `{"E": 1, "W": 1}` (rank 0). 양쪽 방향이 같은 peer.
+`ring_1d(rank, world_size=2)` → `{"E": 1, "W": 1}` (rank 0). Both directions
+point to the same peer.

-**버그 1 (install)**:
- `reverse_direction(0, 1)` → dict order로 "E" 반환 (틀림, "W"가 맞음 — opposite
-  direction convention)
- rank 0의 E entry가 `peer.rx_base_pa = rx_base(sip1, cube0, pe0, d="E")`로 설정
- tl.send(E) → data가 sip1의 E-rx buffer로 landing (should be W-rx)
+**Bug 1 (install)**:
+- `reverse_direction(0, 1)` → returns "E" by dict order (wrong; "W" is the
+  correct answer — opposite-direction convention)
+- rank 0's E entry is set with `peer.rx_base_pa = rx_base(sip1, cube0, pe0, d="E")`
+- tl.send(E) → data lands in sip1's E-rx buffer (should be W-rx)

-**버그 2 (runtime)**:
- 설령 install이 올바른 주소로 설정해도, receiver의 `_handle_meta_arrival`이
-  sender 좌표만으로 direction 매칭 → 첫 direction (E) 승
- peer_head_cache[E] 증가, peer_head_cache[W]는 불변
- Kernel의 tl.recv(W)는 peer_head_cache[W] 대기 → 영원히 블록 → IpcqDeadlock
+**Bug 2 (runtime)**:
+- Even if install set up the correct address, the receiver's
+  `_handle_meta_arrival` matches direction by sender coordinates only → the
+  first direction (E) wins
+- peer_head_cache[E] is incremented; peer_head_cache[W] is unchanged
+- The kernel's tl.recv(W) waits on peer_head_cache[W] → blocks forever →
+  IpcqDeadlock

-### 근본 원인
+### Root cause

-두 축에서 동일 문제:
-1. **Install-time pairing**: "내 direction과 peer의 어느 direction이 짝인가"
-   결정이 dict-iteration-order에 의존 → 여러 direction이 같은 peer를 가리킬 때
-   fragile
-2. **Runtime identification**: "어느 qp를 업데이트해야 하는가" 결정이 sender
-   좌표만으로 이루어짐 → direction 중복 시 ambiguous
+The same issue along two axes:
+1. **Install-time pairing**: deciding "which of my directions pairs with
+   which direction of the peer" depends on dict-iteration-order → fragile
+   when multiple directions point to the same peer
+2. **Runtime identification**: deciding "which qp should be updated" is
+   based on sender coordinates alone → ambiguous when directions are
+   duplicated

-### 해결 방향 — address-based matching
+### Solution direction — address-based matching

-각 PE의 rx buffer는 **direction별로 고유한 주소 range**에 위치 (rx_base_pa +
-direction_idx × bytes_per_direction). 따라서:
+Each PE's rx buffer sits at a **unique address range per direction**
+(rx_base_pa + direction_idx × bytes_per_direction). Therefore:

- **Runtime**: sender coord 대신 **dst_addr 범위**로 매칭 → unambiguous
- **Install**: opposite-direction 우선 선택 heuristic (ring / mesh의 자연스러운
-  대칭성)
- `peer_direction` 같은 이중 메타데이터 불필요 — **주소가 single source of
-  truth**
+- **Runtime**: match by **dst_addr range** instead of sender coord →
+  unambiguous
+- **Install**: prefer the opposite direction as a heuristic (the natural
+  symmetry of ring / mesh)
+- No need for redundant metadata like `peer_direction` — **address is the
+  single source of truth**

-이 설계는 **PhysAddr 전환 (ADR-0030)과 독립적**으로 작동. 현재 synthetic
-주소든 PhysAddr든 direction별 range 유일성만 지켜지면 동일하게 적용 가능.
+This design works **independently of the PhysAddr transition (ADR-0030)**.
+Whether the current addresses are synthetic or PhysAddr, the same approach
+applies as long as the per-direction range uniqueness is preserved.

 ---

@@ -91,17 +98,17 @@ def reverse_direction(my_rank: int, peer_rank: int, my_dir: str) -> str | None:
    return None
 ```

-호출부:
+Call site:

 ```python
 for d, peer_rank in nbrs.items():
-    peer_dir = reverse_direction(r, peer_rank, d)  # my_dir 전달
+    peer_dir = reverse_direction(r, peer_rank, d)  # pass my_dir
    if peer_dir is None:
        continue
    ...
 ```

-### D2. Runtime — `_handle_meta_arrival` dst_addr 매칭
+### D2. Runtime — `_handle_meta_arrival` dst_addr matching

 `src/kernbench/components/builtin/pe_ipcq.py`:

@@ -138,9 +145,10 @@ def _handle_meta_arrival(self, msg: IpcqMetaArrival) -> None:
    # Unknown dst_addr — diagnostic log (should not happen under correct install)
 ```

-Sender 좌표 검사는 **제거**. `dst_addr`가 이미 direction을 결정.
+The sender-coordinate check is **removed**. `dst_addr` already determines
+the direction.

-### D3. Credit — `dst_rx_base_pa` 필드 추가
+### D3. Credit — add `dst_rx_base_pa` field

 `src/kernbench/common/ipcq_types.py`:

@@ -148,25 +156,26 @@ Sender 좌표 검사는 **제거**. `dst_addr`가 이미 direction을 결정.
@dataclass(frozen=True)
 class IpcqCreditMetadata:
    consumer_seq: int
-    dst_rx_base_pa: int       # NEW: 원 sender의 peer.rx_base_pa와 매칭용
-    # 기존 필드 (diagnostic / log 용도로 유지)
+    dst_rx_base_pa: int       # NEW: matches the original sender's peer.rx_base_pa
+    # Existing fields (kept for diagnostic / logging purposes)
    src_sip: int
    src_cube: int
    src_pe: int
    src_direction: str
 ```

-Credit 생성 시 (`_delayed_credit_send`): 자기 direction의 `my_rx_base_pa`를
-`dst_rx_base_pa`로 실어 보냄 (이게 상대방이 sender 당시 썼던 `peer.rx_base_pa`).
+When the credit is generated (`_delayed_credit_send`): it carries this
+direction's `my_rx_base_pa` as `dst_rx_base_pa` (this is the
+`peer.rx_base_pa` the other side used when it was the sender).

-수신 측 (`_credit_worker`):
+Receiver side (`_credit_worker`):

 ```python
 def _credit_worker(self, env):
    while True:
        credit = yield self._credit_inbox.get()
        for d, qp in self._queue_pairs.items():
-            # peer의 rx_base_pa와 credit의 dst_rx_base_pa가 일치하는 qp 찾기
+            # Find the qp whose peer rx_base_pa matches the credit's dst_rx_base_pa
            if qp["peer"].rx_base_pa == credit.dst_rx_base_pa:
                qp["peer_tail_cache"] = max(qp["peer_tail_cache"],
                                              credit.consumer_seq)
@@ -178,41 +187,45 @@ def _credit_worker(self, env):
                break
 ```

-Sender 좌표 검사 제거. `dst_rx_base_pa` 매칭으로 unambiguous.
+Sender-coordinate check removed. Matching by `dst_rx_base_pa` is
+unambiguous.

-### D4. `IpcqInitEntry`에 `peer_direction` 필드를 **추가하지 않음**
+### D4. Do **not** add a `peer_direction` field to `IpcqInitEntry`

-ADR-0025 rev 1에서 제안했던 `IpcqInitEntry.peer_direction`은 **불필요**.
-이유:
- Meta arrival은 dst_addr로 매칭 (D2)
- Credit은 dst_rx_base_pa로 매칭 (D3)
- qp에 peer_direction 저장 필요 없음
- Install은 rx_base_pa 계산 시 내부적으로만 peer_dir 사용 (`reverse_direction`)
+The `IpcqInitEntry.peer_direction` proposed in ADR-0025 rev 1 is
+**unnecessary**. Reasons:
+- Meta arrivals are matched by dst_addr (D2)
+- Credits are matched by dst_rx_base_pa (D3)
+- No need to store peer_direction on qp
+- Install only uses peer_dir internally when computing rx_base_pa
+  (`reverse_direction`)

-IpcqInitEntry schema 변경 없음. Rev 1 대비 **단순화**.
+No change to the IpcqInitEntry schema. **Simpler** than rev 1.

-### D5. `IpcqDmaToken.src_direction` 유지 (diagnostic only)
+### D5. Keep `IpcqDmaToken.src_direction` (diagnostic only)

-기존 `src_direction` 필드는 제거하지 않는다. 다음 용도로 유지:
- Logging / trace: `KERNBENCH_CCL_TRACE=1` 출력의 `(rank, t, dir, nbytes)`
- Diagnostics: pointer_dump 등에서 direction 표시
- 미래 확장 여지
+The existing `src_direction` field is not removed. It is retained for:
+- Logging / trace: the `(rank, t, dir, nbytes)` output of
+  `KERNBENCH_CCL_TRACE=1`
+- Diagnostics: showing direction in pointer_dump, etc.
+- Room for future extension

-Runtime matching은 `dst_addr`만 사용.
+Runtime matching uses only `dst_addr`.

-### D6. Invariants (ADR-0023 I3 강화)
+### D6. Invariants (strengthens ADR-0023 I3)

-**I3 (엄격)**: 각 방향 pair `(my_direction, peer_direction)`에 대해 my
-rx_base와 peer rx_base는 **별개의 direction slot**을 가리켜야 함. Install은
-이를 보장해야 한다 (reverse_direction opposite-preference).
+**I3 (strict)**: For each direction pair `(my_direction, peer_direction)`,
+my rx_base and peer rx_base must point to **distinct direction slots**.
+Install must guarantee this (reverse_direction opposite-preference).

-**I3.1 (신규)**: 모든 qp에 대해 `qp["my_rx_base_pa"]`와 `qp["peer"].rx_base_pa`는
-서로 disjoint한 주소 range를 점유한다 (다른 direction의 buffer는 절대 겹치지
-않음). 이것이 D2/D3의 주소-기반 매칭의 전제.
+**I3.1 (new)**: For every qp, `qp["my_rx_base_pa"]` and
+`qp["peer"].rx_base_pa` occupy mutually disjoint address ranges (buffers
+of different directions never overlap). This is the prerequisite for the
+address-based matching of D2/D3.

-Install time에 검증 가능:
+Verifiable at install time:
 ```python
-# ccl/install_plan.py: build_install_plans 끝에 assertion
+# ccl/install_plan.py: assertion at the end of build_install_plans
 all_rx_ranges = set()
 for plan in plans:
    for pe_install in plan.pe_installs:
@@ -228,36 +241,42 @@ for plan in plans:

 ## Dependencies

- **ADR-0023** (IPCQ protocol): 본 ADR은 ADR-0023의 runtime 매칭 로직 수정
-  (D2, D3) + install heuristic 개선 (D1). IPCQ 프로토콜의 semantic layer
-  변경은 없음.
- **ADR-0024** (launcher): 2-rank bidirectional ring이 실제 쓰이는 경우가
-  ADR-0024의 ws=SIP_count 모델. 본 ADR이 그 케이스를 작동시킴.
- **ADR-0030** (PhysAddr transition, stub): **독립적** — ADR-0025의
-  주소-기반 매칭은 현재 synthetic 주소든 PhysAddr이든 동일하게 작동.
+- **ADR-0023** (IPCQ protocol): this ADR modifies ADR-0023's runtime
+  matching logic (D2, D3) and improves the install heuristic (D1). No
+  change to the IPCQ protocol's semantic layer.
+- **ADR-0024** (launcher): the case where a 2-rank bidirectional ring is
+  actually used is the ws=SIP_count model of ADR-0024. This ADR makes that
+  case work.
+- **ADR-0030** (PhysAddr transition, stub): **independent** — ADR-0025's
+  address-based matching works identically whether the current addresses
+  are synthetic or PhysAddr.

 ---

 ## Non-goals

- **IPCQ 주소 체계를 PhysAddr로 전환**: ADR-0030 scope. 본 ADR은 주소가 어떻게
-  인코딩되는가와 무관.
- **Multi-hop routing**: ADR-0023 D5의 single-hop DMA write 전제 유지.
- **Unidir ring 특수화**: `ring_1d_unidir`는 direction 하나만 있으므로 본 버그
-  무관.
+- **Migrating IPCQ addressing to PhysAddr**: ADR-0030 scope. This ADR is
+  agnostic to how addresses are encoded.
+- **Multi-hop routing**: the single-hop DMA write assumption of ADR-0023
+  D5 still holds.
+- **Unidir ring specialization**: `ring_1d_unidir` only has a single
+  direction, so the bug does not apply.

 ---

 ## Open questions

- **주소 매칭 성능**: `_handle_meta_arrival`과 `_credit_worker`가 qp를 선형
-  순회 (max 4 direction). 성능 영향 무시 가능 수준. 문제 시 dict lookup으로
-  전환 가능 (`_qp_by_rx_base`).
- **`IpcqDmaToken.src_direction` 필요성 재평가**: diagnostic 용도로만 남긴
-  필드를 계속 유지할지, 또는 logging 외부로 분리할지. 현재는 유지.
- **Install-time invariant 검증 cost**: D6의 I3.1 검증은 O(N_PE × N_direction)^2.
-  대형 topology에서 느려질 수 있음 → interval tree 등 자료구조로 개선 가능.
-  단순 구현 먼저.
+- **Address-matching performance**: `_handle_meta_arrival` and
+  `_credit_worker` iterate qp linearly (max 4 directions). The performance
+  impact is negligible. If it becomes an issue, this can be switched to a
+  dict lookup (`_qp_by_rx_base`).
+- **Re-evaluating the need for `IpcqDmaToken.src_direction`**: whether to
+  keep this field, which is only kept for diagnostics, or to split it out
+  of logging. Currently retained.
+- **Cost of install-time invariant verification**: the I3.1 verification
+  of D6 is O(N_PE × N_direction)^2. It could be slow on large topologies
+  → improvable via data structures such as interval trees. Simple
+  implementation first.

 ---

@@ -265,19 +284,26 @@ for plan in plans:

 ### Positive

- **단순함**: `peer_direction` 이중 메타데이터 제거. 주소가 single source of truth.
- **Unambiguous matching**: 모든 topology (direction 중복 포함)에서 동작.
- **Schema 변경 최소**: `IpcqInitEntry` 불변, `IpcqCreditMetadata`에 1 필드 추가.
- **PhysAddr 전환 (ADR-0030) 독립**: 주소-기반 매칭은 주소 인코딩 방식과 무관.
- **Diagnostic 유지**: `IpcqDmaToken.src_direction`은 로깅 용도로 존치.
+- **Simplicity**: redundant `peer_direction` metadata removed. Address is
+  the single source of truth.
+- **Unambiguous matching**: works on every topology (including duplicate
+  directions).
+- **Minimal schema changes**: `IpcqInitEntry` unchanged, one field added
+  to `IpcqCreditMetadata`.
+- **Independent of PhysAddr transition (ADR-0030)**: address-based matching
+  is agnostic to the address encoding.
+- **Diagnostics retained**: `IpcqDmaToken.src_direction` is kept for
+  logging.

 ### Negative

- Runtime 매칭이 주소 비교로 바뀌어서 디버깅 시 "왜 peer_head_cache[E]가 아닌
-  W가 업데이트됐나" 같은 질문에 address range를 추적해야 함 (기존엔 direction
-  이름으로 충분). 해결: pointer_dump에 "direction ↔ rx_base_pa" 매핑 포함.
+- Runtime matching is now by address comparison, so when debugging
+  questions like "why did peer_head_cache[W] update rather than [E]" one
+  has to follow the address range (previously the direction name was
+  enough). Mitigation: include a "direction ↔ rx_base_pa" mapping in
+  pointer_dump.

 ### Neutral

- IPCQ protocol의 semantic layer (sender가 dst_addr 계산, receiver가 수신)는
-  불변.
+- The semantic layer of the IPCQ protocol (sender computes dst_addr,
+  receiver receives) is unchanged.
@@ -1,4 +1,4 @@
-# ADR-0026: DPPolicy = Intra-Device Only — sip/num_sips 필드 제거
+# ADR-0026: DPPolicy = Intra-Device Only — remove sip/num_sips fields

 ## Status

@@ -6,16 +6,17 @@ Accepted (Revision 5 — Phase 2 landed 2026-04-14, 523 passed + 1 strict xfail)

 ## Context

-### 목표
+### Goal

-`DPPolicy`를 **한 device(SIP) 내부의 cube × PE 분산**만 표현하는 순수한
-intra-device 추상화로 명확화한다. SIP 간 분산(TP)은 별도 레이어로 분리
-(ADR-0024의 `torch.ahbm.set_device(rank)` 또는 ADR-0027의 Megatron parallel
-layers가 담당).
+Clarify `DPPolicy` as a pure intra-device abstraction that only expresses
+**cube × PE distribution within a single device (SIP)**. Inter-SIP
+distribution (TP) is split into a separate layer (handled by ADR-0024's
+`torch.ahbm.set_device(rank)` or by ADR-0027's Megatron-style parallel
+layers).

 ## Decision

-### D1. `DPPolicy`에서 `sip` + `num_sips` 필드 제거
+### D1. Remove `sip` + `num_sips` fields from `DPPolicy`

 ```python
@dataclass(frozen=True)
@@ -32,15 +33,16 @@ class DPPolicy:
    num_cubes: int | None = None
 ```

-제거되는 필드: `sip`, `num_sips`.
+Removed fields: `sip`, `num_sips`.

-### D2. `ShardSpec` — structural (sip, cube, pe) 좌표, `pe_index` 완전 제거
+### D2. `ShardSpec` — structural (sip, cube, pe) coordinates, `pe_index` fully removed

-현재 `ShardSpec.pe_index`는 **global flat index** (`sip × cubes × pes + cube ×
-pes + pe`). 이는 ADR-0024 D4이 "abstraction leakage"로 지적한 형태.
+The current `ShardSpec.pe_index` is a **global flat index**
+(`sip × cubes × pes + cube × pes + pe`). This is the form ADR-0024 D4
+flagged as "abstraction leakage".

-본 ADR에서 ShardSpec을 **structural 좌표로 재정의**하고, `pe_index`는
-property로도 **남기지 않는다**:
+This ADR **redefines ShardSpec in structural coordinates** and **does
+not even leave `pe_index` as a property**:

 ```python
 # src/kernbench/policy/placement/dp.py (after)
@@ -59,28 +61,32 @@ class ShardSpec:
    nbytes: int
 ```

-**핵심 원칙**:
- ShardSpec의 정체성은 `(sip, cube, pe)` 3튜플.
- **`pe_index` property도 없음** — silent semantics drift 차단.
- Global flat을 기대한 기존 호출자는 `.pe_index` 접근 시 **즉시
-  `AttributeError`** → 반드시 구조적 좌표로 migration.
- Flat integer key가 필요한 국소 문맥 (예: 내부 dict lookup)은 호출자가
-  명시적으로 `spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe`를 계산.
+**Core principle**:
+- The identity of ShardSpec is the `(sip, cube, pe)` 3-tuple.
+- **No `pe_index` property either** — blocks silent semantics drift.
+- Existing callers expecting global-flat get an **immediate
+  `AttributeError`** on `.pe_index` access → forced migration to
+  structural coordinates.
+- Local contexts that genuinely need a flat integer key (e.g. internal
+  dict lookup) explicitly compute
+  `spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe` at the call
+  site.

-**Property 제거 정당화**: KernBench는 사내 프로젝트로 call site가 한정되어
-있음. Silent drift 위험 (의미만 바뀌고 타입은 같은 int) 대비 explicit breakage
-(AttributeError)가 훨씬 안전.
+**Justification for removing the property**: KernBench is an internal
+project with a limited number of call sites. Explicit breakage
+(AttributeError) is much safer than the risk of silent drift (semantics
+change while the type stays int).

-### D3. `resolve_dp_policy`가 `target_sip`을 받아 structural 좌표 생성
+### D3. `resolve_dp_policy` takes `target_sip` and produces structural coordinates

-ADR-0024 D4의 계약 구현. Post-hoc shifting 없음.
+Implements the contract of ADR-0024 D4. No post-hoc shifting.

 ```python
 # src/kernbench/policy/placement/dp.py (after)

@dataclass(frozen=True)
 class _LocalPeShard:
-    """Internal — PE resolver의 반환. Cube 내 local PE 식별자 + payload."""
+    """Internal — return value of the PE resolver. Cube-local PE id + payload."""
    local_pe: int                  # cube-local PE index (0..num_pe-1)
    offset_bytes: int
    nbytes: int
@@ -93,7 +99,7 @@ def resolve_dp_policy(
    itemsize: int,
    num_pe: int,
    num_cubes: int = 1,
-    target_sip: int,       # NEW — 어느 SIP에 배치할지 명시
+    target_sip: int,       # NEW — explicitly state which SIP to place on
 ) -> list[ShardSpec]:
    """2-level resolution (cube × PE) on a specified SIP.

@@ -123,28 +129,30 @@ def resolve_dp_policy(
    return all_shards
 ```

-**내부 resolver** (`column_wise`, `row_wise`, `replicate`)는 `_LocalPeShard`
-리스트 반환 — `local_pe` 필드명으로 **"cube-local PE identifier"임이 명시적**.
-과거 `ShardSpec.pe_index`와 이름이 혼동되던 문제 해소.
+**Internal resolvers** (`column_wise`, `row_wise`, `replicate`) return a
+list of `_LocalPeShard` — the `local_pe` field name makes it **explicit
+that this is a "cube-local PE identifier"**. This resolves the previous
+confusion with the name `ShardSpec.pe_index`.

-**이름 규약 정리** (전체 ADR):
- `ShardSpec.pe`: 최종 외부 API — cube-local PE (structural coord)
- `_LocalPeShard.local_pe`: 내부 resolver 단계의 동일 의미
- `pe_index`: **제거**. 외부/내부 어디에도 남기지 않는다 (silent drift 차단의
-  부가 효과: 이름 재등장 없음).
+**Naming convention summary** (whole ADR):
+- `ShardSpec.pe`: the final external API — cube-local PE (structural coord)
+- `_LocalPeShard.local_pe`: the same meaning at the internal resolver stage
+- `pe_index`: **removed**. Not retained anywhere, internal or external
+  (additional benefit of preventing silent drift: the name does not
+  reappear).

-### D4. `_create_tensor` — 구조적 좌표로 직접 placement
+### D4. `_create_tensor` — placement directly in structural coordinates

-ADR-0024 D4 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy`
-호출 시점에 직접 지정.
+Continuation of ADR-0024 D4. Post-hoc shifting removed; structural
+coordinates are specified directly at the `resolve_dp_policy` call site.

 ```python
 # context.py _create_tensor (after)
 current_sip = self.ahbm.current_device()
 if current_sip is None:
-    # Single-driver fallback (ADR-0024 D2와 일관).
-    # Launcher 기반 코드가 set_device()를 빼먹으면 조용히 SIP 0에 박히는
-    # 문제가 있음 → debug mode에서 경고.
+    # Single-driver fallback (consistent with ADR-0024 D2).
+    # In launcher-based code, forgetting set_device() silently sticks the
+    # tensor on SIP 0 — emit a warning in debug mode.
    if os.environ.get("KERNBENCH_DEBUG"):
        import warnings
        warnings.warn(
@@ -161,38 +169,39 @@ placement = resolve_dp_policy(
    itemsize=itemsize,
    num_pe=eff_num_pe,
    num_cubes=eff_num_cubes,
-    target_sip=current_sip,          # ← 구조적 좌표 일차 지정
+    target_sip=current_sip,          # ← structural coord specified up front
 )

-# placement의 각 ShardSpec은 이미 (sip=current_sip, cube=local, pe=local) 포함.
-# 과거의 post-hoc shifting 블록은 완전히 제거.
+# Each ShardSpec in placement already carries (sip=current_sip, cube=local, pe=local).
+# The old post-hoc shifting block is removed entirely.
 ```

-**모든** 텐서가 current device SIP에 배치됨. Multi-SIP 텐서를 만들고 싶으면
-ADR-0027의 TP primitive 사용.
+**Every** tensor is placed on the current device's SIP. If you need a
+multi-SIP tensor, use the TP primitive of ADR-0027.

-**Single-driver fallback의 trade-off**: set_device 없는 호출에서 SIP 0으로
-default는 기존 single-driver 테스트 호환을 위해 유지. `KERNBENCH_DEBUG=1`
-환경에서는 launcher 컨텍스트의 실수로 set_device 누락 시 조용히 잘못된 SIP에
-배치되는 것을 감지할 수 있도록 warning.
+**Trade-off of the single-driver fallback**: When set_device is not
+called, defaulting to SIP 0 is kept for compatibility with existing
+single-driver tests. With `KERNBENCH_DEBUG=1`, a warning is emitted so
+that accidentally omitting set_device in a launcher context — which would
+silently place the tensor on the wrong SIP — can be detected.

-### D5. Downstream — allocator lookup은 구조적 tuple key로
+### D5. Downstream — allocator lookup by structural tuple key

-기존 `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`):
+Existing `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`):

 ```python
 for spec in placement:
-    alloc = allocators[spec.pe_index]       # ← AttributeError (property 제거됨)
+    alloc = allocators[spec.pe_index]       # ← AttributeError (property removed)
 ```

-`pe_index`가 없어졌으므로 구조적 좌표로 **강제** migration:
+With `pe_index` gone, migration to structural coordinates is **forced**:

 ```python
 for spec in placement:
    alloc = allocators[(spec.sip, spec.cube, spec.pe)]
 ```

-`_ensure_allocators`의 dict population도 tuple key로:
+The dict population in `_ensure_allocators` is also tuple-keyed:

 ```python
 # context.py _ensure_allocators (after)
@@ -204,59 +213,71 @@ for sip_id in sip_range:
            )
 ```

-`_free_tensor`도 동일: 기존 `flat_idx = sip * ... + cube * ... + pe` 계산
-블록 제거, `(shard.sip, shard.cube, shard.pe)` 직접 사용.
+`_free_tensor` is the same: the old
+`flat_idx = sip * ... + cube * ... + pe` computation block is removed,
+and `(shard.sip, shard.cube, shard.pe)` is used directly.

-**Tuple vs dataclass `PEIdentity`**: Tuple이 단순하고 hashable로 바로 써서
-권고. `PEIdentity` 값객체는 명시적 타입 장점은 있지만 boilerplate가 크고 현재
-allocator dict의 유일한 key라 오버엔지니어링. Tuple 유지.
+**Tuple vs dataclass `PEIdentity`**: Recommend the tuple — it is simple
+and hashable out of the box. A `PEIdentity` value object has the upside
+of an explicit type, but the boilerplate is large and it is currently
+the only key of the allocator dict, so it would be over-engineering.
+Keep the tuple.

-### D7. 하위 호환 — 불가 (cleanup ADR)
+### D7. Backward compatibility — none (cleanup ADR)

-이 ADR은 **breaking change**.
+This ADR is a **breaking change**.

-1. `DPPolicy(sip=...)` 또는 `DPPolicy(num_sips=...)` 호출 → `TypeError`
-2. `ShardSpec.pe_index` 접근 → `AttributeError`
+1. `DPPolicy(sip=...)` or `DPPolicy(num_sips=...)` → `TypeError`
+2. `ShardSpec.pe_index` access → `AttributeError`

-모두 **즉시 명시적 breakage**. Deprecation warning / fallback 경로 없음.
-KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에 migration.
+Both are **immediate, explicit breakage**. No deprecation warning /
+fallback path. KernBench is an internal project with a bounded set of
+call sites, so migration happens in one pass.

-**Silent drift 차단**이 property 완전 제거의 주된 이점: global flat을 기대한
-코드가 SIP-local 결과를 받아 조용히 잘못된 인덱싱을 할 가능성 제거.
+**Blocking silent drift** is the main upside of fully removing the
+property: code that expected a global flat could otherwise silently
+receive a SIP-local result and index incorrectly — that possibility is
+eliminated.

 ## Dependencies

- **ADR-0024** (launcher): `set_device(rank)` 및 current-device scoping이
-  SIP 배치 메커니즘 제공. 본 ADR은 그 위에 서서 DPPolicy를 순수 intra-device로
-  좁힘.
- **ADR-0027** (Megatron TP): 다중 SIP에 걸친 텐서가 필요한 경우의 대안 경로.
-  이 ADR 적용 후 multi-SIP use case는 ADR-0027로 이관.
+- **ADR-0024** (launcher): `set_device(rank)` and current-device scoping
+  provide the SIP placement mechanism. This ADR sits on top and narrows
+  DPPolicy to pure intra-device.
+- **ADR-0027** (Megatron TP): the alternative path when a tensor spans
+  multiple SIPs. After this ADR is applied, multi-SIP use cases move to
+  ADR-0027.

 ---

 ## Non-goals

- **`DPPolicy.cube` / `pe` 재설계**: 기존 replicate/column_wise/row_wise 의미
-  유지.
- **Tiling 정책 통합**: `tiled_column_major` / `tiled_row_major`는 그대로.
- **Multi-device 텐서 추상화 신규**: DTensor-like는 ADR-0028.
+- **Redesign of `DPPolicy.cube` / `pe`**: existing
+  replicate/column_wise/row_wise semantics are kept.
+- **Tiling policy consolidation**: `tiled_column_major` /
+  `tiled_row_major` stay as they are.
+- **New multi-device tensor abstraction**: a DTensor-like is ADR-0028.

 ---

 ## Open questions

- **`_create_tensor`의 current_sip 기본값**: set_device 없는 호출에서 rank=0
-  (SIP 0)로 fallback할지, 아니면 error 낼지. 권고는 fallback (기존 single-driver
-  테스트와의 호환).
- **`test_sip_parallel.py` 재작성 범위**: 기존 단위 테스트의 의도를 유지하며
-  launcher 기반으로 옮기려면 추가 fixture 필요. 별도 작업으로 scope.
- **`DPPolicy`의 `num_sips=None` 의미**: 필드가 없어지면 `num_sips` 개념 자체가
-  사라짐. Multi-SIP을 표현하고 싶으면 ADR-0027의 TP primitive를 쓰라는 것이
-  명시적 답.
+- **Default value of current_sip in `_create_tensor`**: for calls without
+  set_device, whether to fall back to rank=0 (SIP 0) or to raise an
+  error. The recommendation is fallback (compatibility with existing
+  single-driver tests).
+- **Scope of `test_sip_parallel.py` rewrite**: porting the existing unit
+  tests to the launcher base while preserving their intent requires
+  additional fixtures. Scoped as separate work.
+- **Meaning of `num_sips=None` on `DPPolicy`**: once the field is gone,
+  the concept of `num_sips` disappears entirely. The explicit answer for
+  expressing multi-SIP is to use the TP primitive of ADR-0027.

-**Resolved (이전 rev에서 open이었던 것들)**:
- ~~`ShardSpec.pe_index` property 존치 여부~~ → **완전 제거** (D2)
- ~~`_ensure_allocators` dict key 형식~~ → **tuple `(sip, cube, pe)`** (D5)
+**Resolved (items that were open in earlier revs)**:
+- ~~Whether to keep the `ShardSpec.pe_index` property~~ → **fully
+  removed** (D2)
+- ~~Form of `_ensure_allocators` dict key~~ → **tuple `(sip, cube, pe)`**
+  (D5)

 ---

@@ -264,25 +285,31 @@ KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에

 ### Positive

- **개념 분리 명확**: DPPolicy = intra-device, TP = inter-device.
- **API 단순화**: DPPolicy 생성자 필드 ~33% 축소.
- **Structural 좌표 일관성**: ShardSpec이 `(sip, cube, pe)` 튜플로 표현 →
-  abstraction leakage 해소 (ADR-0024 D4 계약 충족).
- **`pe_index` 의미 명확**: SIP-local이 단일 해석. Global flat이 필요하면 명시.
- **Launcher 모델 일관성**: ADR-0024의 "1 worker per SIP" 모델이 유일한 SIP
-  경계 제어 메커니즘.
+- **Clean conceptual separation**: DPPolicy = intra-device, TP =
+  inter-device.
+- **API simplification**: about a 33% reduction in DPPolicy constructor
+  fields.
+- **Structural-coordinate consistency**: ShardSpec is expressed as a
+  `(sip, cube, pe)` tuple → abstraction leakage resolved (the ADR-0024
+  D4 contract is satisfied).
+- **Clear meaning of `pe_index`**: the single interpretation is
+  SIP-local. If global-flat is needed, it must be made explicit.
+- **Launcher-model consistency**: ADR-0024's "1 worker per SIP" model is
+  the sole SIP-boundary control mechanism.

 ### Negative

 - **Breaking change (explicit)**: `DPPolicy(sip=...)` → `TypeError`,
-  `spec.pe_index` → `AttributeError`. 모든 호출자 한 번에 수정 필요.
- **ShardSpec schema 변경**: `pe_index` 단일 필드 → `sip`/`cube`/`pe` 세 필드.
-  Downstream (`deploy_tensor`, `_free_tensor`, `_ensure_allocators`,
-  `allocators` dict key 등) 연쇄 수정.
- **Silent drift 없음**: property 완전 제거로 runtime에서 즉시 실패 →
-  migration leakage 원천 차단. (Negative가 아니라 explicit tradeoff)
- `test_sip_parallel.py` 재작성 비용.
+  `spec.pe_index` → `AttributeError`. All callers need to be fixed at
+  once.
+- **ShardSpec schema change**: a single `pe_index` field becomes three
+  fields `sip`/`cube`/`pe`. Cascading edits downstream (`deploy_tensor`,
+  `_free_tensor`, `_ensure_allocators`, `allocators` dict key, etc.).
+- **No silent drift**: with the property fully removed, runtime failure
+  is immediate → migration leakage is blocked at the source. (Not a
+  negative but an explicit tradeoff.)
+- The cost of rewriting `test_sip_parallel.py`.

 ### Neutral

- 기존 `cube` / `pe` 필드 의미 불변.
+- The meaning of the existing `cube` / `pe` fields is unchanged.