ADR: bilingual structure — EN canonical in adr/, KO mirror in adr-ko/

Establish English as the canonical ADR language with Korean translations
held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror).
Promotion from adr-proposed/ to adr/ now writes English to adr/ and the
Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md.

- Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English,
  2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix
  dropped). ADR-0023 EN regenerated against KO source which had newer
  HW Realization Notes (D16-D23) section.
- docs/adr-history/ left frozen by design (transitional state).
- CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark
  docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline
  section covering bidirectional sync, conflict resolution (EN wins),
  and proposed-language freedom.
- tools/verify_adr_lang_pairs.py: new verification tool checking pair
  completeness, filename mirroring, ADR-ID match, Status byte-equality.
  Pre-commit hook intentionally not added; run on demand or in CI.
- tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF
  normalization, em-dash title separator, underscore-slug edge case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-20 01:38:44 -07:00
parent 687c98086d
commit a796c1d2f7
42 changed files with 10515 additions and 3422 deletions
@@ -18,7 +18,7 @@ We define stable, minimal message schemas for Host ↔ IO_CPU so that:
- IO_CPU-internal fan-out/aggregation can evolve independently,
- completion and failure propagation is deterministic.
We also require PE-tagging (A 방식): each shard explicitly carries (sip,cube,pe)
We also require PE-tagging (Scheme A): each shard explicitly carries (sip,cube,pe)
so IO_CPU can deterministically route/fan-out without relying on PA decoding.
---
@@ -93,7 +93,7 @@ Rules:
Mandatory fields:
- common envelope fields (D3)
- destination placement tags (A 방식):
- destination placement tags (Scheme A):
- `dst_sip: int`
- `dst_cube: int`
- `dst_pe: int`
@@ -130,7 +130,7 @@ Notes:
Mandatory fields:
- common envelope fields (D3)
- source placement tags (A 방식):
- source placement tags (Scheme A):
- `src_sip: int`
- `src_cube: int`
- `src_pe: int`
@@ -183,7 +183,7 @@ Tensor arg (mandatory):
- `shards: list[TensorShard]`
`TensorShard` MUST have (A 방식 강제):
`TensorShard` MUST have (Scheme A enforced):
- `sip: int`
- `cube: int`
@@ -1,519 +0,0 @@
# ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation)
## Status
Accepted
## Context
The current simulation models **timing only**.
`tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies,
but do not actually read tensor data or perform computations.
### Required Capabilities
1. Must be able to store and read actual data in HBM/TCM/SRAM
2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
3. Must minimize simulation performance degradation
### Constraints
- SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
- Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations
- Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused
- Kernel functions must remain plain Python functions (no generator/async transformation)
### Design Exploration Results
| Option | Approach | Verdict |
|--------|----------|---------|
| Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block |
| SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests |
| Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads |
| **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact |
---
## Decision
### D1. 2-Pass Execution Model — Phase 0 Elimination
The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**.
Before:
```
Phase 0: Kernel → PeCommand list (no data, no branching)
Phase 1: Replay PeCommand list via SimPy (timing only)
```
After:
```
Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based
- Memory read/write: SimPy timing + MemoryStore actual data
- Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2)
- Dynamic control flow possible (tl.load returns actual data)
Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable
```
This ADR **extends Phase 1 to be data-aware for memory operations only**.
Phase 1 handles latency/BW bottleneck analysis + memory data tracking,
Phase 2 handles GEMM/Math computation correctness verification.
Phase 2 is optional — if only timing is needed, run Phase 1 alone.
### D2. Op Log Recording — ComponentBase Hook
Op log recording is performed as a **hook in the component base class**.
Individual component implementations are not modified.
```python
class ComponentBase:
def _on_process_start(self, env, msg):
if self._op_logger and getattr(msg, 'data_op', False):
self._op_logger.record_start(env.now, self.node.id, msg)
def _on_process_end(self, env, msg):
if self._op_logger and getattr(msg, 'data_op', False):
self._op_logger.record_end(env.now, self.node.id, msg)
```
Hooks are called before and after `run()` within `_forward_txn()`.
`_op_logger` is optional — zero overhead when absent.
**Hook timing definitions**:
| Timing | Meaning |
|--------|---------|
| `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) |
| `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) |
Link traversal latency is not included in t_start/t_end.
Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start.
### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination
The existing Phase 0 (kernel → PeCommand list) is eliminated,
and **greenlet** is used to cooperatively interleave kernel and SimPy execution.
#### Operating Principle
greenlet is a C extension that provides cooperative context switching.
When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet)
to perform timing simulation, and after completion, returns to the kernel with actual data.
```
SimPy loop (parent greenlet) Kernel (child greenlet)
───────────────────────── ──────────────────────
g.switch() ─────────────────────────→ Kernel starts
a = tl.load(ptr, ...)
internal: parent.switch(DmaReadCmd)
cmd = DmaReadCmd ←────────────────── (kernel paused)
yield DmaReadMsg(...)
yield env.timeout(dma_latency)
data = memory_store.read(...)
g.switch(data) ─────────────────────→ (kernel resumed)
a = data ← actual numpy array
if a[0][0] > 0.5: ← branching possible
...
```
The kernel is maintained as a **plain Python function**.
greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc.
#### KernelRunner — Framework Layer
The greenlet loop resides not in the PE_CPU component but in the framework layer,
**KernelRunner**.
```python
# KernelRunner (framework — greenlet ↔ SimPy bridge)
class KernelRunner:
def run(self, env, kernel_fn, args, store):
g = greenlet(self._run_kernel)
cmd = g.switch(kernel_fn, args)
while cmd is not None:
if isinstance(cmd, DmaReadCmd):
yield from self._dispatch_dma(env, cmd)
data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
cmd = g.switch(data) # resume with actual data
elif isinstance(cmd, GemmCmd):
yield from self._dispatch_gemm(env, cmd)
cmd = g.switch() # resume (no data)
elif isinstance(cmd, DmaWriteCmd):
store.write(cmd.dst_addr, cmd.data) # visibility = issue time
yield from self._dispatch_dma(env, cmd) # timing only
cmd = g.switch()
# PE_CPU (component — kept simple, unaware of greenlet)
def _execute_kernel(self, env):
runner = KernelRunner(self.ctx)
yield from runner.run(env, kernel_fn, args, store)
```
**Op logging single source of truth**: KernelRunner does not record directly to op_log.
All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**.
When KernelRunner delivers messages to components via `_dispatch_gemm()` etc.,
the component base class hooks automatically record them.
**Layer separation**:
- **Kernel code**: plain function, unaware of greenlet
- **TLContext**: calls `parent.switch(cmd)` inside `tl.load()`
- **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**.
- **ComponentBase hook**: the sole path for op_log recording
- **PE_CPU**: only calls KernelRunner, replaceable as a component
#### Handling Differences Between Memory Read/Write and Compute
| Operation | In Phase 1 | In Phase 2 |
|-----------|-----------|-----------|
| `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — |
| `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — |
| `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation |
| `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation |
Memory read/write is processed immediately in Phase 1 (numpy slice, fast).
GEMM/Math operations are batch-executed in Phase 2 (performance separation).
#### Store Visibility Rule
`tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue).
SimPy DMA timing is simulated separately afterward.
This is an intentional separation of timing and visibility:
- **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called
- **timing**: the point at which DMA latency completes in SimPy
This separation allows a load immediately after a store to see the latest data in dynamic control flow.
#### Result Handle Semantics
`tl.composite()` (sync/async) returns a **handle** referencing the result tensor.
The key contract in Phase 1:
1. **All compute handles are always considered pending in Phase 1.**
2. `tl.wait(handle)` **expresses timing synchronization only**
and does not make the handle ready.
3. Accessing the handle's actual result data (`handle.data`, element access,
numpy conversion, etc.) is **only possible in Phase 2**.
4. Therefore, **compute-result-based control flow is not supported in Phase 1.**
5. In contrast, `tl.load()` returns actual data in Phase 1, so
**memory-read-based control flow is supported**.
| Handle state | Phase | Allowed operations |
|------------|-------|----------|
| pending | Phase 1 | `tl.wait(handle)` — timing synchronization only |
| pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) |
| pending | Phase 1 | **Data access not allowed** — value-based branching not possible |
| ready | Phase 2 | Actual numpy data access, verification |
This restriction is intentional. If computations were executed in Phase 1,
the SimPy single-thread would block, defeating the purpose of 2-pass separation.
#### Phase 1 Materialization — Future Extension
If Phase 1 eager execution becomes necessary for small operations
(scalar, small reduction) in the future, selective materialization can be supported
by adding a `materialized_in_phase1: bool` flag to the op record.
This is not implemented in the current scope.
### D4. data_op Flag — Message Self-Declaration
The logging target is determined by the `data_op` attribute on the message instance,
not by message type. The framework does not hardcode message types.
```python
class MsgBase:
data_op: bool = False # default: no logging
class DmaReadCmd(MsgBase):
data_op = True # memory transfer → logging
class GemmCmd(MsgBase):
data_op = True # compute → logging
class MathCmd(MsgBase):
data_op = True # compute → logging
```
When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True`
enables automatic logging without modifying framework code.
### D5. Op Log Structure
#### Op Classification Scheme
A two-level classification is used:
| Level | Field | Role |
|-------|-------|------|
| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion |
| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification |
#### OpRecord Definition
```python
@dataclass
class OpRecord:
t_start: float # SimPy time (ns) — service start
t_end: float # SimPy time (ns) — service completion
component_id: str # e.g. "sip0.cube0.pe0.pe_gemm"
op_kind: str # "memory" | "gemm" | "math"
op_name: str # specific operation name
params: dict # per-operation parameters (see below)
dependency_ids: list[int] # currently based on in-memory record index, may be replaced with stable op_id in the future
```
#### dependency_ids Generation Rules
`dependency_ids` is **optional**, and by default the executor performs
address-based dependency inference (see D6).
Explicit setting is only needed when precise execution ordering is required:
- **Default (address-based inference)**: the executor analyzes read/write sets to
automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases.
- **Explicit setting**: set when logical dependencies cannot be expressed via addresses
at the TLContext or command generation stage.
Example: completion handle-based synchronization — handle dependencies depend on
logical completion order rather than memory addresses, so they cannot be captured
by address inference.
#### op_log Ordering
The op_log maintains **stable ordering** based on `t_start`.
Records with the same `t_start` preserve insertion order.
#### params Details
**memory (dma_read / dma_write)**:
```python
{
"src_addr": int, # source address (byte)
"dst_addr": int, # destination address (byte)
"nbytes": int, # transfer size
"src_space": str, # "hbm" | "tcm" | "sram"
"dst_space": str, # "hbm" | "tcm" | "sram"
}
```
**gemm**:
```python
{
"src_a_addr": int, # operand A address
"src_b_addr": int, # operand B address
"dst_addr": int, # output address
"shape_a": tuple, # e.g. (128, 256)
"shape_b": tuple, # e.g. (256, 128)
"shape_out": tuple, # e.g. (128, 128)
"dtype_in": str, # e.g. "f16"
"dtype_acc": str, # accumulation dtype, e.g. "f32"
"dtype_out": str, # output dtype, e.g. "f16"
"transpose_a": bool,
"transpose_b": bool,
"layout_a": str, # "row_major" | "col_major"
"layout_b": str,
"layout_out": str,
"addr_space": str, # "tcm" (GEMM operands are always in TCM)
}
```
**math**:
```python
{
"op": str, # "exp" | "add" | "sum" | "where" | ...
"input_addrs": list[int], # list of operand addresses
"input_shapes": list[tuple],
"dst_addr": int,
"shape_out": tuple,
"dtype": str,
"axis": int | None, # reduction axis
"addr_space": str, # "tcm"
}
```
### D6. Phase 2 Executor
Phase 2 executes the op_log outside of SimPy.
```python
class DataExecutor:
def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
self.store = initial_store # Takes the Phase 1 MemoryStore snapshot as input
def run(self):
for t, ops in groupby(op_log, key=lambda o: o.t_start):
batch = list(ops)
independent, sequential = self._classify(batch)
self._execute_parallel(independent)
self._execute_sequential(sequential)
```
**Parallel execution determination**:
Ops with the same `t_start` are considered **parallel candidates**.
The executor determines actual parallel execution based on the following criteria:
- Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks)
- Whether predecessor ops specified in `dependency_ids` have completed
Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel.
**Batch optimization**: Only independent ops with the same op_name **and identical
shape, dtype, layout, and transpose flags** are eligible for batching.
Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call.
Improves BLAS efficiency on CPU, reduces launch overhead on GPU.
**Phase 2 execution order guarantee**:
Phase 2 does not consider data arrival timing,
and guarantees execution order solely through
dependencies (address-based inference + explicit dependency_ids).
### D7. Memory Store
`MemoryStore` logically follows byte-addressable semantics,
and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping).
```python
class MemoryStore:
def write(self, space: str, addr: int, data: np.ndarray) -> None: ...
def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
```
**Internal storage format: numpy ndarray**
MemoryStore stores tensors as **numpy ndarrays**.
| Candidate | store/load speed | Phase 2 compute | Verdict |
|-----------|-----------------|-----------------|---------|
| **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** |
| bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected |
| torch tensor | Immediate | torch operations available | Use only for GPU optimization |
- write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup
- read: **returns numpy array by reference** (no copy)
- Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported)
- dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.)
- For byte-level access, convert via `.view(np.uint8)`
- For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility
**read/write contract**:
- read/write operates on a **contiguous tensor** basis.
If non-contiguous stride views are needed, express them as separate copy ops.
- In the normal benchmark path, producer/consumer dtype match is expected.
Reinterpret cast is a permissive behavior for low-level memory validation
or special test cases.
- addr is byte-aligned, with minimum alignment = dtype size.
- dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast.
Shape mismatch is verified based on nbytes, and raises an error on mismatch.
- Correctness criteria follow address-range-based read/write semantics.
- A tensor object cache may be used as an implementation optimization,
but the canonical state is byte-addressable storage.
- At deploy time, the host injects initial tensor data.
### D8. Benchmark Kernel Code
The benchmark's **user code API is not changed**.
The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained.
However, internal command/message schemas may be extended to include metadata
required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose).
### D9. No Component Changes
Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified.
Op log recording is the responsibility of the ComponentBase hook.
When custom components are replaced, only the timing model changes,
and Phase 2 data execution is unaffected.
### D10. Phase 2 is Optional
```python
engine = GraphEngine(graph)
engine.run(benchmark) # Phase 1: timing only
result = engine.get_timing_result()
if verify_data:
executor = DataExecutor(engine.op_log) # Phase 2: data
executor.run()
executor.verify(expected_output)
```
If only timing analysis is needed, Phase 2 is skipped.
If the op_logger is deactivated, Phase 1 performance is identical to the original.
### D11. Verification Contract
Basic verification **compares the final output tensor** against a reference backend (numpy).
Per-dtype tolerance policy:
| dtype | Comparison method | Tolerance |
|-------|----------|-----------|
| f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
| f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
| bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
| int types | `np.array_equal` | exact |
- Default mode: compare final output only (end-to-end correctness)
- Debug mode: can compare intermediate tensors on a per-op basis
(MemoryStore snapshot at each op boundary)
---
## Non-goals
- **Compute-result-based control flow**: not supported.
All compute handles are in pending state during Phase 1,
`wait()` expresses timing synchronization only and does not imply data readiness.
Accessing `handle.data`, element access, or truth-value evaluation in Phase 1
is **treated as an error**.
Memory-data-based branching (results of `tl.load()`) is supported via greenlet.
Phase 1 materialization is a future extension (see D3).
- **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce
the execution time overlap from Phase 1. Phase 2 only verifies data correctness.
- **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls
and do not reproduce the actual hardware PE microarchitecture.
## Open Questions
- **Aliasing / slice view**: How to represent slice/views referencing the same
backing storage in MemoryStore (stride-based view vs copy semantics)
- **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE
communication as memory ops or introduce a separate op_kind
- **Op log streaming**: Managing op_log memory usage in large-scale simulations
(in-memory list vs disk-backed streaming)
- **Fused operation**: Whether to record tl.composite's tiled pipeline
(READ→COMPUTE→WRITE) as a single fused op record or separate individual ops
- **Math op schema generalization**: The current math params have a simple structure,
but generalization may be needed for broadcasting rules, per-input dtype, keepdims,
scalar/immediate operands, where/mask expressions, etc.
- **Op record identifier**: Currently dependency_ids are based on in-memory list indices;
replacement with stable op_id is needed when introducing streaming/disk-backed mode
- **Phase 1 materialization policy**: See Future Extension in D3.
If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops
needs to be defined
---
## Consequences
### Positive
- Minimal impact on SimPy simulation performance (only op_log append added)
- Free to use multi-threading/GPU in Phase 2
- Component replaceability preserved (ADR-0015 design philosophy maintained)
- No changes needed to benchmark user code API
- When adding new message types, only set the data_op flag
- Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported
- `tl.load()` returns actual data, making kernel debugging easier
### Negative
- op_log memory usage (for large-scale simulations)
- Phase 2 execution time is proportional to tensor size (large GEMM)
- Dynamic branching based on pending handles (incomplete computations) not possible
(computations execute in Phase 2, result values are undetermined in Phase 1).
Memory-data-based branching is supported via greenlet.
- greenlet C extension dependency added (pip install greenlet)
+268 -265
View File
@@ -1,4 +1,4 @@
# ADR-0020: 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)
# ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation)
## Status
@@ -6,65 +6,65 @@ Accepted
## Context
현재 시뮬레이션은 **타이밍만** 모델링한다.
`tl.load()`, `tl.composite(op="gemm")` 등은 SimPy latency를 생성하지만,
실제 텐서 데이터를 읽거나 연산하지 않는다.
The current simulation models **timing only**.
`tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies,
but do not actually read tensor data or perform computations.
### 필요한 기능
### Required Capabilities
1. HBM/TCM/SRAM에 실제 데이터를 저장하고 읽을 수 있어야 한다
2. PE_GEMM, PE_MATH가 실제 행렬 연산을 수행하고 결과를 검증할 수 있어야 한다
3. 시뮬레이션 성능 저하를 최소화해야 한다
1. Must be able to store and read actual data in HBM/TCM/SRAM
2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
3. Must minimize simulation performance degradation
### 제약 조건
### Constraints
- SimPy single-thread 이벤트 루프 — numpy matmul을 안에서 하면 전체가 block
- 컴포넌트는 교체 가능해야 한다 (ADR-0015) — 프레임워크 요구사항이 구현에 침투하면 안 됨
- 벤치마크 커널은 명령형 코드(tl.load → tl.composite → tl.wait) — 같은 코드를 재사용해야 함
- 커널 함수는 plain Python function으로 유지해야 한다 (generator/async 변환 불가)
- SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
- Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations
- Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused
- Kernel functions must remain plain Python functions (no generator/async transformation)
### 설계 탐색 결과
### Design Exploration Results
| Option | 방식 | 판정 |
|--------|------|------|
| SimPy 내 직접 실행 | GEMM을 SimPy 안에서 numpy 호출 | 탈락: single-thread block |
| SimPy + ThreadPool | future.submit → timeout → result() | 탈락: back-to-back 요청 시 result()에서 block |
| Symbolic + lazy | 메타데이터만 추적, 나중에 실행 | 탈락: control-flow dependent 읽기 처리 곤란 |
| **2-pass (채택)** | Phase 1: 타이밍, Phase 2: 데이터 | 완전 분리, 성능 영향 없음 |
| Option | Approach | Verdict |
|--------|----------|---------|
| Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block |
| SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests |
| Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads |
| **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact |
---
## Decision
### D1. 2-Pass 실행 모델 — Phase 0 제거
### D1. 2-Pass Execution Model — Phase 0 Elimination
기존의 3단계(Phase 0 → Phase 1 → Phase 2)를 **2단계로 통합**한다.
The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**.
기존:
Before:
```
Phase 0: 커널 → PeCommand 리스트 (데이터 없음, 분기 불가)
Phase 1: PeCommand 리스트를 SimPy replay (타이밍만)
Phase 0: Kernel → PeCommand list (no data, no branching)
Phase 1: Replay PeCommand list via SimPy (timing only)
```
변경:
After:
```
Phase 1 (타이밍): 커널 + SimPy 통합 실행 — greenlet 기반
- 메모리 읽기/쓰기: SimPy 타이밍 + MemoryStore 실제 데이터
- 연산 (GEMM/Math): SimPy 타이밍 + op_log 기록 (실제 연산은 Phase 2)
- dynamic control flow 가능 (tl.load가 실제 데이터 반환)
Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based
- Memory read/write: SimPy timing + MemoryStore actual data
- Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2)
- Dynamic control flow possible (tl.load returns actual data)
Phase 2 (데이터): op_log 기반 실제 연산 실행 — SimPy 외부, 병렬 가능
Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable
```
본 ADR은 **메모리 연산에 한해 Phase 1 data-aware로 확장**한다.
Phase 1 latency/BW 병목 분석 + 메모리 데이터 추적,
Phase 2 GEMM/Math 연산 정합성 검증.
Phase 2 optional — 타이밍만 필요하면 Phase 1만 실행.
This ADR **extends Phase 1 to be data-aware for memory operations only**.
Phase 1 handles latency/BW bottleneck analysis + memory data tracking,
Phase 2 handles GEMM/Math computation correctness verification.
Phase 2 is optional — if only timing is needed, run Phase 1 alone.
### D2. Op Log 기록 — ComponentBase hook
### D2. Op Log Recording — ComponentBase Hook
op_log 기록은 **컴포넌트 베이스 클래스의 hook**으로 수행한다.
개별 컴포넌트 구현을 수정하지 않는다.
Op log recording is performed as a **hook in the component base class**.
Individual component implementations are not modified.
```python
class ComponentBase:
@@ -77,56 +77,56 @@ class ComponentBase:
self._op_logger.record_end(env.now, self.node.id, msg)
```
`_forward_txn()` 에서 `run()` 전후로 hook을 호출한다.
`_op_logger` optional — 없으면 오버헤드 제로.
Hooks are called before and after `run()` within `_forward_txn()`.
`_op_logger` is optional — zero overhead when absent.
**hook 시점 정의**:
**Hook timing definitions**:
| 시점 | 의미 |
|------|------|
| `t_start` | 컴포넌트가 해당 msg의 **service를 시작**한 시점 (`run()` 진입 직전) |
| `t_end` | 컴포넌트의 **내부 service가 완료**된 시점 (`run()` 반환 직후) |
| Timing | Meaning |
|--------|---------|
| `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) |
| `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) |
link traversal latency t_start/t_end에 포함되지 않는다.
link latency는 발신 컴포넌트의 t_end와 수신 컴포넌트의 t_start 차이로 관측된다.
Link traversal latency is not included in t_start/t_end.
Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start.
### D3. Greenlet 기반 커널 실행 — Phase 0 제거
### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination
기존 Phase 0 (커널 → PeCommand 리스트)를 제거하고,
**greenlet**을 사용하여 커널과 SimPy를 협력적으로 interleave 실행한다.
The existing Phase 0 (kernel → PeCommand list) is eliminated,
and **greenlet** is used to cooperatively interleave kernel and SimPy execution.
#### 동작 원리
#### Operating Principle
greenlet은 협력적 context switch를 제공하는 C 확장이다.
커널(child greenlet) `tl.load()` 등을 호출하면 SimPy 루프(parent greenlet)
switch하여 타이밍 시뮬레이션을 수행하고, 완료 후 실제 데이터와 함께 커널로 돌아온다.
greenlet is a C extension that provides cooperative context switching.
When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet)
to perform timing simulation, and after completion, returns to the kernel with actual data.
```
SimPy 루프 (parent greenlet) 커널 (child greenlet)
SimPy loop (parent greenlet) Kernel (child greenlet)
───────────────────────── ──────────────────────
g.switch() ─────────────────────────→ 커널 시작
g.switch() ─────────────────────────→ Kernel starts
a = tl.load(ptr, ...)
내부: parent.switch(DmaReadCmd)
cmd = DmaReadCmd ←────────────────── (커널 일시정지)
internal: parent.switch(DmaReadCmd)
cmd = DmaReadCmd ←────────────────── (kernel paused)
yield DmaReadMsg(...)
yield env.timeout(dma_latency)
data = memory_store.read(...)
g.switch(data) ─────────────────────→ (커널 재개)
a = data ← 실제 numpy array
if a[0][0] > 0.5: ← 분기 가능
g.switch(data) ─────────────────────→ (kernel resumed)
a = data ← actual numpy array
if a[0][0] > 0.5: ← branching possible
...
```
커널은 **plain Python function**으로 유지된다.
greenlet switch `tl.load()`, `tl.store()` 등의 **내부 구현에만** 존재한다.
The kernel is maintained as a **plain Python function**.
greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc.
#### KernelRunner — 프레임워크 레이어
#### KernelRunner — Framework Layer
greenlet 루프는 PE_CPU 컴포넌트가 아니라 프레임워크 레이어인
**KernelRunner**에 위치한다.
The greenlet loop resides not in the PE_CPU component but in the framework layer,
**KernelRunner**.
```python
# KernelRunner (프레임워크 — greenlet ↔ SimPy 연결)
# KernelRunner (framework — greenlet ↔ SimPy bridge)
class KernelRunner:
def run(self, env, kernel_fn, args, store):
g = greenlet(self._run_kernel)
@@ -136,160 +136,162 @@ class KernelRunner:
if isinstance(cmd, DmaReadCmd):
yield from self._dispatch_dma(env, cmd)
data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
cmd = g.switch(data) # 실제 데이터와 함께 재개
cmd = g.switch(data) # resume with actual data
elif isinstance(cmd, GemmCmd):
yield from self._dispatch_gemm(env, cmd)
cmd = g.switch() # 재개 (데이터 없음)
cmd = g.switch() # resume (no data)
elif isinstance(cmd, DmaWriteCmd):
store.write(cmd.dst_addr, cmd.data) # visibility = issue 시점
yield from self._dispatch_dma(env, cmd) # timing만 반영
store.write(cmd.dst_addr, cmd.data) # visibility = issue time
yield from self._dispatch_dma(env, cmd) # timing only
cmd = g.switch()
# PE_CPU (컴포넌트 — 간단하게 유지, greenlet을 모름)
# PE_CPU (component — kept simple, unaware of greenlet)
def _execute_kernel(self, env):
runner = KernelRunner(self.ctx)
yield from runner.run(env, kernel_fn, args, store)
```
**Op logging single source of truth**: KernelRunner는 op_log에 직접 기록하지 않는다.
모든 op logging**ComponentBase hook (_on_process_start/end)** 담당한다.
KernelRunner `_dispatch_gemm()` 등으로 컴포넌트에 메시지를 전달하면,
컴포넌트 베이스 클래스의 hook이 자동으로 기록한다.
**Op logging single source of truth**: KernelRunner does not record directly to op_log.
All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**.
When KernelRunner delivers messages to components via `_dispatch_gemm()` etc.,
the component base class hooks automatically record them.
**레이어 분리**:
- **커널 코드**: plain function, greenlet 존재를 모름
- **TLContext**: `tl.load()` 내부에서 `parent.switch(cmd)` 호출
- **KernelRunner**: greenlet ↔ SimPy 연결, MemoryStore 읽기/쓰기 처리. **logging 안 함**.
- **ComponentBase hook**: op_log 기록의 유일한 경로
- **PE_CPU**: KernelRunner를 호출만 함, 컴포넌트로서 교체 가능
**Layer separation**:
- **Kernel code**: plain function, unaware of greenlet
- **TLContext**: calls `parent.switch(cmd)` inside `tl.load()`
- **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**.
- **ComponentBase hook**: the sole path for op_log recording
- **PE_CPU**: only calls KernelRunner, replaceable as a component
#### 메모리 읽기/쓰기 vs 연산의 처리 차이
#### Handling Differences Between Memory Read/Write and Compute
| 연산 | Phase 1에서 | Phase 2에서 |
|------|------------|------------|
| `tl.load()` | SimPy 타이밍 + MemoryStore read → **실제 데이터 반환** | — |
| `tl.store()` | SimPy 타이밍 + MemoryStore write → **실제 기록** | — |
| `tl.composite(gemm)` | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
| `tl.dot()` / math ops | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
| Operation | In Phase 1 | In Phase 2 |
|-----------|-----------|-----------|
| `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — |
| `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — |
| `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation |
| `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation |
메모리 읽기/쓰기는 Phase 1에서 즉시 처리 (numpy slice, 빠름).
GEMM/Math 연산은 Phase 2에서 batch 실행 (성능 분리).
Memory read/write is processed immediately in Phase 1 (numpy slice, fast).
GEMM/Math operations are batch-executed in Phase 2 (performance separation).
#### Store Visibility Rule
`tl.store()`는 **issue 시점에 MemoryStore에 즉시 반영**된다 (visibility = issue).
SimPy DMA 타이밍은 이후 별도로 시뮬레이션된다.
`tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue).
SimPy DMA timing is simulated separately afterward.
이는 timing visibility를 의도적으로 분리한 것이다:
- **visibility**: MemoryStore에 반영되는 시점 = `store.write()` 호출 시
- **timing**: SimPy에서 DMA latency가 완료되는 시점
This is an intentional separation of timing and visibility:
- **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called
- **timing**: the point at which DMA latency completes in SimPy
이 분리로 dynamic control flow에서 store 직후 load가 최신 데이터를 볼 수 있다.
This separation allows a load immediately after a store to see the latest data in dynamic control flow.
#### Result Handle Semantics
`tl.composite()`(sync/async)는 결과 tensor를 참조하는 **handle**을 반환한다.
`tl.composite()` (sync/async) returns a **handle** referencing the result tensor.
Phase 1에서의 핵심 계약:
The key contract in Phase 1:
1. **모든 compute handle은 Phase 1에서 항상 pending 상태로 간주한다.**
2. `tl.wait(handle)`은 **timing synchronization만 표현**하며,
handle ready로 만들지 않는다.
3. handle의 실제 결과 데이터 접근(`handle.data`, element access,
numpy conversion 등)은 **Phase 2에서만 가능**하다.
4. 따라서 Phase 1에서 **compute-result 기반 control flow는 지원하지 않는다.**
5. 반면 `tl.load()`는 Phase 1에서 실제 데이터를 반환하므로,
**memory-read 기반 control flow는 지원 가능**하다.
1. **All compute handles are always considered pending in Phase 1.**
2. `tl.wait(handle)` **expresses timing synchronization only**
and does not make the handle ready.
3. Accessing the handle's actual result data (`handle.data`, element access,
numpy conversion, etc.) is **only possible in Phase 2**.
4. Therefore, **compute-result-based control flow is not supported in Phase 1.**
5. In contrast, `tl.load()` returns actual data in Phase 1, so
**memory-read-based control flow is supported**.
| handle 상태 | Phase | 허용 동작 |
| Handle state | Phase | Allowed operations |
|------------|-------|----------|
| pending | Phase 1 | `tl.wait(handle)` — timing 동기화만 |
| pending | Phase 1 | handle `tl.store()`의 대상으로 전달 (logical destination 연결만, payload Phase 2) |
| pending | Phase 1 | **데이터 접근 불가** — 값 기반 분기 불가 |
| ready | Phase 2 | 실제 numpy 데이터 접근, 검증 |
| pending | Phase 1 | `tl.wait(handle)` — timing synchronization only |
| pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) |
| pending | Phase 1 | **Data access not allowed** — value-based branching not possible |
| ready | Phase 2 | Actual numpy data access, verification |
이 제약은 의도적이다. Phase 1에서 연산을 실행하면 SimPy single-thread가
block되어 2-pass 분리의 존재 이유가 사라진다.
This restriction is intentional. If computations were executed in Phase 1,
the SimPy single-thread would block, defeating the purpose of 2-pass separation.
#### Phase 1 Materialization — Future Extension
향후 소형 연산(scalar, 작은 reduction)에 대해 Phase 1 eager execution이
필요한 경우, `materialized_in_phase1: bool` 플래그를 op record에 추가하여
선택적 materialization을 지원할 수 있다. 현재 범위에서는 구현하지 않는다.
If Phase 1 eager execution becomes necessary for small operations
(scalar, small reduction) in the future, selective materialization can be supported
by adding a `materialized_in_phase1: bool` flag to the op record.
This is not implemented in the current scope.
### D4. data_op 플래그 — 메시지 자기 선언
### D4. data_op Flag — Message Self-Declaration
로깅 대상은 메시지 타입이 아니라 메시지 인스턴스의 `data_op` 속성으로 결정한다.
프레임워크가 메시지 타입을 하드코딩하지 않는다.
The logging target is determined by the `data_op` attribute on the message instance,
not by message type. The framework does not hardcode message types.
```python
class MsgBase:
data_op: bool = False # 기본: 로깅 안 함
data_op: bool = False # default: no logging
class DmaReadCmd(MsgBase):
data_op = True # 메모리 이동 → 로깅
data_op = True # memory transfer → logging
class GemmCmd(MsgBase):
data_op = True # 연산 → 로깅
data_op = True # compute → logging
class MathCmd(MsgBase):
data_op = True # 연산 → 로깅
data_op = True # compute → logging
```
새 메시지 타입(예: IpcqMsg) 추가 시 `data_op = True`만 설정하면
프레임워크 코드 수정 없이 자동 로깅된다.
When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True`
enables automatic logging without modifying framework code.
### D5. Op Log 구조
### D5. Op Log Structure
#### op 분류 체계
#### Op Classification Scheme
2단계로 분류한다:
A two-level classification is used:
| 레벨 | 필드 | 역할 |
|------|------|------|
| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch 기준 |
| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` 등 | 구체 연산 식별 |
| Level | Field | Role |
|-------|-------|------|
| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion |
| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification |
#### OpRecord 정의
#### OpRecord Definition
```python
@dataclass
class OpRecord:
t_start: float # SimPy 시각 (ns) — service 시작
t_end: float # SimPy 시각 (ns) — service 완료
t_start: float # SimPy time (ns) — service start
t_end: float # SimPy time (ns) — service completion
component_id: str # e.g. "sip0.cube0.pe0.pe_gemm"
op_kind: str # "memory" | "gemm" | "math"
op_name: str # 구체 연산명
params: dict # 연산별 파라미터 (아래 참조)
dependency_ids: list[int] # 현재는 in-memory record index 기반, 향후 stable op_id로 대체 가능
op_name: str # specific operation name
params: dict # per-operation parameters (see below)
dependency_ids: list[int] # currently based on in-memory record index, may be replaced with stable op_id in the future
```
#### dependency_ids 생성 규칙
#### dependency_ids Generation Rules
`dependency_ids` **optional**이며, 기본적으로 executor는
주소 기반 dependency 추론을 수행한다 (D6 참조).
`dependency_ids` is **optional**, and by default the executor performs
address-based dependency inference (see D6).
정확한 실행 순서가 필요한 경우에만 명시적으로 설정한다:
- **기본 (address-based inference)**: executor read/write set을 분석하여
RAW/WAW/WAR 의존성을 자동 추론. 대부분의 경우 이것으로 충분.
- **명시적 설정**: TLContext 또는 command 생성 단계에서 logical dependency가
주소로 표현되지 않는 경우에 설정.
: completion handle 기반 동기화 — handle dependency는 메모리 주소가 아니라
논리적 완료 순서에 의존하므로 address inference로 잡히지 않는다.
Explicit setting is only needed when precise execution ordering is required:
- **Default (address-based inference)**: the executor analyzes read/write sets to
automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases.
- **Explicit setting**: set when logical dependencies cannot be expressed via addresses
at the TLContext or command generation stage.
Example: completion handle-based synchronization — handle dependencies depend on
logical completion order rather than memory addresses, so they cannot be captured
by address inference.
#### op_log ordering
#### op_log Ordering
op_log`t_start` 기준으로 **stable ordering**을 유지한다.
동일 `t_start`의 record들은 insertion order를 보존한다.
The op_log maintains **stable ordering** based on `t_start`.
Records with the same `t_start` preserve insertion order.
#### params 상세
#### params Details
**memory (dma_read / dma_write)**:
```python
{
"src_addr": int, # source 주소 (byte)
"dst_addr": int, # destination 주소 (byte)
"nbytes": int, # 전송 크기
"src_addr": int, # source address (byte)
"dst_addr": int, # destination address (byte)
"nbytes": int, # transfer size
"src_space": str, # "hbm" | "tcm" | "sram"
"dst_space": str, # "hbm" | "tcm" | "sram"
}
@@ -298,9 +300,9 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
**gemm**:
```python
{
"src_a_addr": int, # operand A 주소
"src_b_addr": int, # operand B 주소
"dst_addr": int, # output 주소
"src_a_addr": int, # operand A address
"src_b_addr": int, # operand B address
"dst_addr": int, # output address
"shape_a": tuple, # e.g. (128, 256)
"shape_b": tuple, # e.g. (256, 128)
"shape_out": tuple, # e.g. (128, 128)
@@ -312,7 +314,7 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
"layout_a": str, # "row_major" | "col_major"
"layout_b": str,
"layout_out": str,
"addr_space": str, # "tcm" (GEMM operand는 항상 TCM)
"addr_space": str, # "tcm" (GEMM operands are always in TCM)
}
```
@@ -320,7 +322,7 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
```python
{
"op": str, # "exp" | "add" | "sum" | "where" | ...
"input_addrs": list[int], # operand 주소 목록
"input_addrs": list[int], # list of operand addresses
"input_shapes": list[tuple],
"dst_addr": int,
"shape_out": tuple,
@@ -332,12 +334,12 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
### D6. Phase 2 Executor
Phase 2는 SimPy 밖에서 op_log를 실행한다.
Phase 2 executes the op_log outside of SimPy.
```python
class DataExecutor:
def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
self.store = initial_store # Phase 1 MemoryStore snapshot을 입력으로 받는다
self.store = initial_store # Takes the Phase 1 MemoryStore snapshot as input
def run(self):
for t, ops in groupby(op_log, key=lambda o: o.t_start):
@@ -347,30 +349,30 @@ class DataExecutor:
self._execute_sequential(sequential)
```
**병렬 실행 판정**:
**Parallel execution determination**:
같은 `t_start`의 op들은 **병렬 후보**로 간주한다.
실제 병렬 실행 여부는 executor가 다음 기준으로 판정한다:
- read/write 주소 범위 겹침 여부 (WAW, RAW, WAR 충돌 검사)
- `dependency_ids`에 명시된 선행 op 완료 여부
Ops with the same `t_start` are considered **parallel candidates**.
The executor determines actual parallel execution based on the following criteria:
- Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks)
- Whether predecessor ops specified in `dependency_ids` have completed
주소 범위가 겹치지 않고 명시적 의존성이 없는 op들만 병렬 실행한다.
Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel.
**배치 최적화**: 동일 op_name이며 **shape, dtype, layout, transpose flag가
모두 동일한** 독립 op들만 batching 대상이 된다.
예: 여러 PE의 동일 shape GEMM → `np.matmul(a_batch, b_batch)` 한 번으로 묶음.
CPU에서도 BLAS 효율 향상, GPU에서는 launch overhead 절감.
**Batch optimization**: Only independent ops with the same op_name **and identical
shape, dtype, layout, and transpose flags** are eligible for batching.
Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call.
Improves BLAS efficiency on CPU, reduces launch overhead on GPU.
**Phase 2 실행 순서 보장**:
**Phase 2 execution order guarantee**:
Phase 2는 데이터 도착 시점을 고려하지 않으며,
dependency (주소 기반 추론 + 명시적 dependency_ids)를 통해서만
실행 순서를 보장한다.
Phase 2 does not consider data arrival timing,
and guarantees execution order solely through
dependencies (address-based inference + explicit dependency_ids).
### D7. Memory Store
`MemoryStore`는 논리적으로 byte-addressable semantics를 따르며,
현재 구현은 **tensor-granular storage** (addr → numpy ndarray 매핑)를 사용한다.
`MemoryStore` logically follows byte-addressable semantics,
and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping).
```python
class MemoryStore:
@@ -378,139 +380,140 @@ class MemoryStore:
def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
```
**내부 저장 포맷: numpy ndarray**
**Internal storage format: numpy ndarray**
MemoryStore는 텐서를 **numpy ndarray**로 저장한다.
MemoryStore stores tensors as **numpy ndarrays**.
| 후보 | store/load 속도 | Phase 2 연산 | 판정 |
|------|----------------|-------------|------|
| **numpy ndarray** | 즉시 (참조 전달, 복사 없음) | `np.matmul` 바로 사용 | **채택** |
| bytearray | memcpy 필요 | `np.frombuffer` 변환 필요 | 탈락 |
| torch tensor | 즉시 | torch 연산 가능 | GPU 최적화 시만 사용 |
| Candidate | store/load speed | Phase 2 compute | Verdict |
|-----------|-----------------|-----------------|---------|
| **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** |
| bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected |
| torch tensor | Immediate | torch operations available | Use only for GPU optimization |
- write: numpy array**참조 저장** (복사 없음) → Phase 1 오버헤드 = dict lookup 1회
- read: numpy array를 **참조 반환** (복사 없음)
- 동일 addr에 재 write 시 기존 array를 **tensor 단위로 덮어쓴다** (partial overwrite 미지원)
- dtype numpy native 사용 (`np.float16`, `np.float32`, `np.bfloat16`)
- byte-level access가 필요한 경우 `.view(np.uint8)` 로 변환
- Phase 2에서 GPU batch 최적화 시 numpy → torch tensor 변환은 executor가 담당
- write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup
- read: **returns numpy array by reference** (no copy)
- Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported)
- dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.)
- For byte-level access, convert via `.view(np.uint8)`
- For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility
**read/write contract**:
- read/write **contiguous tensor** 기준이다.
non-contiguous stride view가 필요한 경우 별도 copy op으로 표현한다.
- 일반 benchmark path에서는 producer/consumer dtype 일치를 기대한다.
reinterpret cast low-level memory validation 또는 특수 테스트 케이스를 위한
permissive behavior이다.
- addr byte-aligned이며, 최소 alignment = dtype 크기.
- dtype mismatch (write와 다른 dtype으로 read)는 reinterpret cast로 처리한다.
shape 불일치 시 nbytes 기준으로 검증하고, 불일치하면 error.
- 정합성 기준은 주소 범위 기반 read/write semantics를 따른다.
- 구현 최적화로 tensor object cache를 둘 수 있지만,
canonical state byte-addressable storage이다.
- deploy 시점에 호스트가 초기 텐서 데이터를 주입한다.
- read/write operates on a **contiguous tensor** basis.
If non-contiguous stride views are needed, express them as separate copy ops.
- In the normal benchmark path, producer/consumer dtype match is expected.
Reinterpret cast is a permissive behavior for low-level memory validation
or special test cases.
- addr is byte-aligned, with minimum alignment = dtype size.
- dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast.
Shape mismatch is verified based on nbytes, and raises an error on mismatch.
- Correctness criteria follow address-range-based read/write semantics.
- A tensor object cache may be used as an implementation optimization,
but the canonical state is byte-addressable storage.
- At deploy time, the host injects initial tensor data.
### D8. 벤치마크 커널 코드
### D8. Benchmark Kernel Code
벤치마크의 **사용자 코드 API는 변경하지 않는다**.
`tl.load()`, `tl.composite()`, `tl.store()` 등의 호출 인터페이스는 유지.
The benchmark's **user code API is not changed**.
The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained.
단, 내부 command/message schema는 Phase 2 실행에 필요한 metadata
포함하도록 확장될 수 있다 (예: dtype_acc, transpose 등 추가 필드).
However, internal command/message schemas may be extended to include metadata
required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose).
### D9. 컴포넌트 변경 없음
### D9. No Component Changes
개별 컴포넌트 구현(PE_GEMM, PE_DMA, HBM_CTRL 등)은 수정하지 않는다.
op_log 기록은 ComponentBase hook의 책임이다.
커스텀 컴포넌트 교체 시 타이밍 모델만 교체되며,
Phase 2 데이터 실행은 영향받지 않는다.
Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified.
Op log recording is the responsibility of the ComponentBase hook.
When custom components are replaced, only the timing model changes,
and Phase 2 data execution is unaffected.
### D10. Phase 2 Optional
### D10. Phase 2 is Optional
```python
engine = GraphEngine(graph)
engine.run(benchmark) # Phase 1: 타이밍만
engine.run(benchmark) # Phase 1: timing only
result = engine.get_timing_result()
if verify_data:
executor = DataExecutor(engine.op_log) # Phase 2: 데이터
executor = DataExecutor(engine.op_log) # Phase 2: data
executor.run()
executor.verify(expected_output)
```
타이밍 분석만 필요하면 Phase 2를 건너뛴다.
op_logger를 비활성화하면 Phase 1 성능도 기존과 동일.
If only timing analysis is needed, Phase 2 is skipped.
If the op_logger is deactivated, Phase 1 performance is identical to the original.
### D11. Verification Contract
기본 검증은 **최종 output tensor**를 reference backend(numpy)와 비교한다.
Basic verification **compares the final output tensor** against a reference backend (numpy).
dtype tolerance 정책:
Per-dtype tolerance policy:
| dtype | 비교 방식 | tolerance |
| dtype | Comparison method | Tolerance |
|-------|----------|-----------|
| f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
| f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
| bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
| int 계열 | `np.array_equal` | exact |
| int types | `np.array_equal` | exact |
- 기본 모드: 최종 output만 비교 (end-to-end correctness)
- 디버그 모드: intermediate tensor op 단위로 비교 가능
- Default mode: compare final output only (end-to-end correctness)
- Debug mode: can compare intermediate tensors on a per-op basis
(MemoryStore snapshot at each op boundary)
---
## Non-goals
- **Compute-result-based control flow**: 지원하지 않는다.
모든 compute handle은 Phase 1에서 pending 상태이며,
`wait()` timing synchronization만 표현하고 data readiness를 의미하지 않는다.
Phase 1에서 `handle.data` 접근, element access, truth-value evaluation
**error로 처리**한다.
메모리 데이터 기반 분기(`tl.load()` 결과)는 greenlet으로 지원된다.
Phase 1 materialization future extension (D3 참조).
- **Cycle-accurate overlap reconstruction**: Phase 2에서 Phase 1의 실행 시간
overlap을 정확히 재현하지 않는다. Phase 2는 데이터 정합성만 검증한다.
- **GPU kernel compilation**: Phase 2의 GEMM/Math는 numpy/torch 호출이며,
실제 하드웨어 PE의 마이크로아키텍처를 재현하지 않는다.
- **Compute-result-based control flow**: not supported.
All compute handles are in pending state during Phase 1,
`wait()` expresses timing synchronization only and does not imply data readiness.
Accessing `handle.data`, element access, or truth-value evaluation in Phase 1
is **treated as an error**.
Memory-data-based branching (results of `tl.load()`) is supported via greenlet.
Phase 1 materialization is a future extension (see D3).
- **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce
the execution time overlap from Phase 1. Phase 2 only verifies data correctness.
- **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls
and do not reproduce the actual hardware PE microarchitecture.
## Open Questions
- **Aliasing / slice view**: 동일 backing storage를 참조하는 slice/view를
MemoryStore에서 어떻게 표현할지 (stride-based view vs copy semantics)
- **IPCQ/descriptor read 일반화**: PE-to-PE 통신을 memory op으로 완전히
일반화할지, 별도 op_kind를 둘지
- **Op log streaming**: 대규모 시뮬레이션에서 op_log 메모리 사용량 관리
- **Aliasing / slice view**: How to represent slice/views referencing the same
backing storage in MemoryStore (stride-based view vs copy semantics)
- **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE
communication as memory ops or introduce a separate op_kind
- **Op log streaming**: Managing op_log memory usage in large-scale simulations
(in-memory list vs disk-backed streaming)
- **Fused operation**: tl.composite tiled pipeline (READ→COMPUTE→WRITE)을
하나의 fused op record로 기록할지, 개별 op으로 분리할지
- **Math op schema 일반화**: 현재 math params는 단순 구조이나,
broadcasting rule, input dtype, keepdims, scalar/immediate operand,
where/mask 표현 등 일반화가 필요할 수 있음
- **Op record 식별자**: 현재 dependency_ids in-memory list index 기반이며,
streaming/disk-backed mode 도입 시 stable op_id로 대체 필요
- **Phase 1 materialization policy**: D3의 Future Extension 참조.
허용 시 해당 op의 Phase 2 처리 방식 (skip / verify / recompute) 정의 필요
- **Fused operation**: Whether to record tl.composite's tiled pipeline
(READ→COMPUTE→WRITE) as a single fused op record or separate individual ops
- **Math op schema generalization**: The current math params have a simple structure,
but generalization may be needed for broadcasting rules, per-input dtype, keepdims,
scalar/immediate operands, where/mask expressions, etc.
- **Op record identifier**: Currently dependency_ids are based on in-memory list indices;
replacement with stable op_id is needed when introducing streaming/disk-backed mode
- **Phase 1 materialization policy**: See Future Extension in D3.
If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops
needs to be defined
---
## Consequences
### 긍정적
### Positive
- SimPy 시뮬레이션 성능 영향 최소 (op_log append만 추가)
- Phase 2에서 멀티스레드/GPU 자유롭게 사용 가능
- 컴포넌트 교체 자유도 유지 (ADR-0015 설계 철학 보존)
- 벤치마크 사용자 코드 API 변경 불필요
- 새 메시지 타입 추가 시 data_op 플래그만 설정
- greenlet으로 Phase 0 제거 — 메모리 데이터 기반 dynamic control flow 지원
- `tl.load()`가 실제 데이터를 반환하므로 커널 디버깅 용이
- Minimal impact on SimPy simulation performance (only op_log append added)
- Free to use multi-threading/GPU in Phase 2
- Component replaceability preserved (ADR-0015 design philosophy maintained)
- No changes needed to benchmark user code API
- When adding new message types, only set the data_op flag
- Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported
- `tl.load()` returns actual data, making kernel debugging easier
### 부정적
### Negative
- op_log 메모리 사용량 (대규모 시뮬레이션 시)
- Phase 2 실행 시간은 텐서 크기에 비례 (대형 GEMM)
- pending handle (연산 미완료) 기반 동적 분기 불가
(연산은 Phase 2에서 실행, Phase 1에서 결과 값 미확정).
메모리 데이터 기반 분기는 greenlet으로 지원된다.
- greenlet C 확장 의존성 추가 (pip install greenlet)
- op_log memory usage (for large-scale simulations)
- Phase 2 execution time is proportional to tensor size (large GEMM)
- Dynamic branching based on pending handles (incomplete computations) not possible
(computations execute in Phase 2, result values are undetermined in Phase 1).
Memory-data-based branching is supported via greenlet.
- greenlet C extension dependency added (pip install greenlet)
@@ -1,882 +0,0 @@
# ADR-0023: PE-level IPCQ — Inter-PE Collective Communication
## Status
Accepted
## Context
### Goal
Add the infrastructure that lets CCL (Collective Communication Library)
kernels run **inside** a PE. The host just launches a kernel on each
SIP; the actual synchronization and data movement happen **inside the
PE kernel via an IPCQ (Inter-Process Communication Queue)**.
This mirrors how NCCL performs NVLink communication inside a GPU
kernel, or how Cerebras / Tenstorrent expose core-local communication
queues. Host-level collectives (`dist.all_reduce`) are deferred to
**future work**; this ADR focuses solely on the kernel-side collective
infrastructure.
### Problems to solve
1. PE-to-PE direct data movement (writing into a peer's memory).
2. Synchronization — the sender must check that the receiver has space
in its buffer (backpressure).
3. Resource contention between compute traffic and communication
traffic (Head-of-Line blocking).
4. The host must be able to construct logical neighbor topologies
(ring / mesh / tree) per algorithm.
---
## Decision
### D1. Add a new `PE_IPCQ` component
A new component `PE_IPCQ` is added inside each PE. It follows the same
pattern as PE_GEMM / PE_MATH — modeling a sub-block of the PE as a
distinct component.
```
PE
├── PE_CPU
├── PE_SCHEDULER
├── PE_DMA
├── PE_IPCQ ← new
├── PE_FETCH_STORE
├── PE_GEMM
├── PE_MATH
├── PE_TCM
├── PE_MMU
```
**Role separation** (control plane vs. data plane):
- **PE_IPCQ (control plane)**: ring-buffer address arithmetic, head /
tail pointer management, peer pointer caches, backpressure, 4-direction
neighbor mapping.
- **PE_DMA (data plane)**: actually moves data through cube_noc / UCIe
/ PCIE into the peer's memory.
PE_IPCQ does **not** move data itself — it delegates to PE_DMA.
### D2. Ring buffer model
Each PE owns 4 directions (N/S/E/W) × {tx, rx} = 8 ring buffers.
```python
@dataclass
class IpcqQueuePair:
direction: Direction # N/S/E/W
peer: IpcqEndpoint # set by host at init time (D2.5)
tx_buffer_base: int # outgoing data base addr (in our memory)
rx_buffer_base: int # incoming data base addr (in our memory)
slot_size: int # 1 tile per slot
n_slots: int # ring depth
my_head: int # next slot we will write/send into
my_tail: int # next slot we will read/recv from
peer_head_cache: int # peer's last-seen head (updated via D9 piggyback)
peer_tail_cache: int # peer's last-seen tail (updated via D9 fast-path credit)
```
**Canonical field names**: throughout this ADR the four names above
(`my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`) are used
consistently. Synonyms (`peer_head_local`, `peer_head`, `peer_tail`,
etc.) are not used.
| Field | Owner | Updated when |
|-------|-------|--------------|
| `my_head` | local PE_IPCQ | immediately after `tl.send` (send tracking) |
| `my_tail` | local PE_IPCQ | immediately after `tl.recv` (recv tracking) |
| `peer_head_cache` | local PE_IPCQ | on `IpcqMetaArrival` (D9 piggyback) |
| `peer_tail_cache` | local PE_IPCQ | on `IpcqCreditMetadata` (D9 fast path) |
**Slot unit**: fixed-size, one slot holds one full tile (no descriptor
indirection). Full data embedded in the slot. See D5.
### D2.5. `IpcqEndpoint` schema
`IpcqQueuePair.peer` carries everything the sender needs to compute the
peer's rx slot address:
```python
@dataclass(frozen=True)
class IpcqEndpoint:
sip: int
cube: int
pe: int
buffer_kind: str # "tcm" | "hbm" | "sram"
rx_base_pa: int # peer rx_buffer base PA (PhysAddr.encode())
rx_base_va: int # peer rx_buffer base VA (optional, MMU mode)
n_slots: int # peer ring depth (for wrap-around)
slot_size: int # peer slot size (for offset)
```
Address computation:
```python
slot_idx = self.my_head % peer.n_slots
dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size
```
PE_IPCQ passes `dst_pa` to PE_DMA inside an `IpcqDmaToken`. PE_DMA
(vc_comm) routes the data to `dst_pa` through the fabric.
**Endpoint construction order**: at backend init (D10), the IPCQ
buffers for **every PE** are allocated first (so each rank knows the
others' PA), then the per-rank neighbor tables are built and pushed to
PE_IPCQ via `IpcqInitMsg`.
### D3. Four-direction mapping ≡ logical ProcessGroup
The PE views four directions (N/S/E/W) as logical ports. Real peer
addresses are configured by the host CCL init, per the chosen
algorithm. The PE kernel never knows the topology, only directions.
```python
# 1D ring
for rank in range(world_size):
ipcq_set_neighbor(rank, "E", peer=ranks[(rank + 1) % world_size])
ipcq_set_neighbor(rank, "W", peer=ranks[(rank - 1) % world_size])
# 2D mesh
for r in range(R):
for c in range(C):
ipcq_set_neighbor((r, c), "N", peer=((r - 1) % R, c))
ipcq_set_neighbor((r, c), "S", peer=((r + 1) % R, c))
ipcq_set_neighbor((r, c), "E", peer=(r, (c + 1) % C))
ipcq_set_neighbor((r, c), "W", peer=(r, (c - 1) % C))
```
The PE code does not need to know where `tl.send(dir="E", ...)` actually
ends up.
### D4. PE kernel API
```python
# Send (blocking; may stall on backpressure)
tl.send(dir: str, src=TensorHandle)
tl.send(dir: str, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)
# Recv (blocking)
recv = tl.recv(dir: str, shape=..., dtype=...)
recv = tl.recv(shape=..., dtype=...) # round-robin across 4 directions
# Recv (non-blocking)
fut = tl.recv_async(dir: str, shape=..., dtype=...)
recv = tl.wait(fut)
```
`tl.recv()` (no direction) keeps a `last_polled_dir` cursor and on each
call rotates through directions, returning the first available slot.
Empty in all 4 directions → wait.
**Fairness is weak**: the rotating start mitigates simple bias, but if
one direction always wins the race the others can starve. Algorithms
that need strict fairness must call `tl.recv(dir=...)` explicitly.
### D5. Single-hop DMA write + full-data slot model
Data moves from sender memory into the receiver's ring slot in **one
DMA transfer**. Key properties:
- **Single-hop**: the sender already knows the peer rx slot address and
fires one fabric DMA into it.
- **No CPU memcpy**: the CPU never copies data.
- **No intermediate staging**: neither side keeps a separate staging
buffer (sender uses the source addr directly; receiver gets the data
in its ring slot directly).
(Strictly speaking the fabric DMA write does happen, so this is not
literally "no data movement" — it's the same property NCCL labels
"zero-copy", meaning no CPU memcpy and no staging copy.)
```
PE A: tl.send(E, src_addr, nbytes)
1. IPCQ computes the peer rx slot address:
dst_addr = peer.rx_base_pa + (my_head % peer.n_slots) * peer.slot_size
2. Backpressure: my_head - peer_tail_cache < peer.n_slots ?
(full → sleep / poll)
3. Submit DMA on PE_DMA(vc_comm): src_addr → peer dst_addr, nbytes
4. my_head += 1
PE B: data = tl.recv(W)
1. Look at rx_buffer[my_tail % n_slots]
2. Wait for the data to arrive (D7 backpressure mode)
3. Return the slot address to the kernel (or fetch into register file)
4. my_tail += 1
5. Issue a credit-return fast path (D9): after the bottleneck-BW
latency the peer A's peer_tail_cache is updated.
```
The slot holds the full tile. The receiver only reads its own
rx_buffer; it never reads back into A's memory. The sender knows the
peer rx slot address and DMAs directly into it (single-hop).
The PE's own PE_TCM read/write does not go through DMA (PE_TCM is local
to the PE).
### D6. Buffer placement — three-way benchmark
The host CCL init picks the IPCQ ring-buffer location:
```python
ipcq_init(
backend="ahbm",
buffer_kind="tcm" | "hbm" | "sram",
n_slots=8,
slot_size=4096,
)
```
| Location | Trait | Trade-off |
|----------|-------|-----------|
| **PE_TCM** | Attached to the PE; fast | Small; competes with PE-internal resources |
| **PE-local HBM** | Large; via DMA | Higher latency |
| **Cube SRAM** | Mid-size; cube-shared | Cube-internal contention |
All three locations run the same kernel code; only the init differs.
### D7. Backpressure — two-mode benchmark
How the sender or receiver waits when peer slots are full / data not
yet arrived:
| Mode | Behavior | Model |
|------|----------|-------|
| **poll** | Periodically re-check the cached peer pointer | Spin loop |
| **sleep** | Yield a SimPy event; wake on a peer-trigger | Interrupt-like |
```python
ipcq_init(backpressure="poll" | "sleep", ...)
```
Both modes are implemented so latency / throughput trade-offs can be
benchmarked.
### D8. PE_DMA virtual channels
Extend PE_DMA from a single queue into a **two-channel virtual-channel**
model.
```
PE_DMA
├── vc_compute: tile load / store / writeback for GEMM and Math
└── vc_comm: IPCQ send data
```
Each VC has an independent state machine:
- One channel stalling does not block the other.
- The same physical link (cube_noc, UCIe, …) is shared, but link BW is
split between channels.
**Chunk-level interleave**:
- Large GEMM tile DMAs do not lock the link end-to-end.
- Progress happens in chunks (e.g. 256 B); each chunk shares link BW
with the other VC's pending chunks.
- Chunk size is an init parameter (smaller = fairer, larger = more
efficient).
Net effect:
- HoL blocking is eliminated (an IPCQ send can interleave with a long
compute DMA).
- Compute / comm overlap is natural (NVIDIA copy-engine + compute-SM
pattern).
- Matches the NoC-virtual-channel pattern used in real HW.
**First-implementation accuracy limit (intentional)**: this ADR's
first cut uses **deterministic chunk-level interleave + weighted
round-robin arbitration** (default 50 / 50, exposed in `ccl.yaml`).
This is a first-order approximation and is simpler than real HW
dynamic-contention / credit-based arbiters. Functional correctness is
unaffected, but heavy-contention scenarios may report slightly
optimistic latency vs. real HW. A separate ADR can add a NoC arbiter
component later if more precision is needed.
#### Token routing
- Compute tokens (`TileToken`) — go through the existing
PE_FETCH_STORE → PE_DMA chain.
- Communication tokens (`IpcqDmaToken`, new) — PE_IPCQ → PE_DMA
self-routing.
- PE_DMA picks the channel by token type.
```python
class PeDmaComponent:
def _process(self, env, token):
if isinstance(token, IpcqDmaToken):
yield from self._vc_comm_process(env, token)
else:
yield from self._vc_compute_process(env, token)
```
### D9. Pointer synchronization — DMA payload piggyback
Real HW (NVLink, UCIe, etc.) piggybacks metadata onto DMA payloads so
pointers update along with the data. This simulation adopts the same
model: **no separate control channel** — metadata travels with the
data.
The big benefits:
- **Automatic ordering**: data and metadata move on the same token, so
data is visible **before** the head_cache update. No race.
- **HW fidelity**: matches NVLink / UCIe piggybacked headers.
- **Component simplification**: no separate `IpcqPtrUpdate` event type.
#### Send flow (head update via piggyback)
```
PE A: tl.send(E, src_addr, nbytes)
1. PE_IPCQ checks backpressure (using peer_tail_cache)
2. PE_IPCQ creates an IpcqDmaToken:
- data body (src_addr → peer dst_addr)
- piggyback metadata: (sender_seq, src_sip/cube/pe, src_direction)
3. Hand the token to PE_DMA(vc_comm)
4. PE A increments my_head (send tracking)
[fabric DMA: latency elapses]
PE B's PE_DMA receives the token
5. Writes data into dst_addr (B's rx slot) via MemoryStore.write
6. Forwards token metadata to PE B's PE_IPCQ (PE-internal wire, ~1 cycle)
PE B's PE_IPCQ receives the metadata
7. Updates peer_head_cache (= A's head)
8. Wakes any pending recv on that direction
```
**Steps 5 and 6 must execute in the same SimPy step** — DMA completion
makes data and metadata atomically visible.
#### Recv flow (credit return — fast path with bottleneck-BW latency)
When the receiver frees a slot, the sender must learn about it
(backpressure release). Unlike data, the credit return does **not**
travel through general vc_comm fabric — it uses a **separate fast
path**, an abstraction of the NVLink / UCIe credit-return wire.
**Latency** is computed from the **full path latency** (per-node
overhead + edge propagation + drain), not a magic constant:
```
credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes)
path = router.find_path(self_pe, peer_pe.pe_dma)
latency = compute_path_latency_ns(path, credit_size_bytes)
= sum(edge.distance_mm * ns_per_mm)
+ sum(node_overhead_ns[n] for n in path)
+ credit_size_bytes / bottleneck_bw_on_path
```
The router auto-appends `.pe_dma` to the source only, so the
destination MUST be spelled with the explicit `.pe_dma` suffix or
`find_path` raises and the credit silently teleports at zero cost
(latent bug fixed alongside this update).
`tl.recv` blocks on the credit-emit completion (recv yields-from
`_delayed_credit_send` rather than spawning it as a fork). This puts
the credit-return cost on the receiver's `pe_exec_ns`, modeling the
IPCQ control-plane completing the consume-acknowledgement before
recv returns to the kernel — the protocol equivalent of a non-posted
`tl.store` waiting for an HBM ack on the raw DMA path.
That gives us:
- **Topology-proportional approximation**: an in-cube credit return is
automatically faster than a cross-SIP credit return.
- **No magic constants**: every nanosecond comes from
`compute_path_latency_ns` on the same edge_map and `node_overhead_ns`
as data traffic.
- **No deadlock risk**: unlike piggyback, B can issue credit even when
it has no data to send back. `peer_credit_store.put` is unbounded.
- **`IPCQ ≥ raw DMA`** for matched physical moves — the credit-emit
cost on recv balances the HBM ack-trip cost RAW pays on the sender.
#### Component coupling — SimPy Store channel
PE B's PE_IPCQ does not call PE A's PE_IPCQ directly. Instead, at init
time, **a SimPy Store is wired between the two** (a per-direction
fast-path channel) and credit metadata is `put` into that store.
```python
class PeIpcqComponent:
def _delayed_credit_send(self, env, peer_credit_store, my_tail, latency_ns):
yield env.timeout(latency_ns)
yield peer_credit_store.put(IpcqCreditMetadata(seq=my_tail, ...))
```
Backend init wires both directions of the fast-path channel as part of
fan-out (see `IpcqInitMsg` in D12).
#### Credit-return fast path limitations
- `credit_size_bytes` is an estimate (typically 1664 bytes).
- The fast path is **excluded from vc_comm BW contention** (separate
wire). Real HW credit-return wires are very lightweight, so this is a
reasonable first approximation.
- A follow-up ADR can: model the credit fast path as a separate link
(BW limit + contention), or switch to piggyback (`credit_return_mode:
piggyback`).
#### PE_DMA's added responsibility
When `vc_comm` receives a token, PE_DMA processes it as the following
sequence: pay the Transaction's terminal BW drain, then atomically
write data and forward metadata. **No SimPy yield is allowed between
the data write and the metadata forward** (invariant I6). The drain
yield must sit before the atomic block, not inside it:
```python
def _on_vc_comm_recv(self, env, txn):
# Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the
# sender PE_DMA). MUST happen before the atomic block so recv only
# wakes after the bytes have "landed".
drain = getattr(txn, "drain_ns", 0.0)
if drain > 0:
yield env.timeout(drain)
token = txn.request
# ── ATOMIC: no yield between these two operations ──
data = self._memory_store.read(token.src_space, token.src_addr,
shape=..., dtype=...)
self._memory_store.write(token.dst_endpoint.buffer_kind,
token.dst_addr, data)
# 2. Forward metadata to the local PE_IPCQ
yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token))
# ───────────────────────────────────────────────────
```
The final `put` is yieldable but uses an unbounded internal store, so
it completes in a single step. That `put` is the closing call of the
atomic block; nothing may be inserted before it.
#### Drain-at-inbound semantics (D9 timing model)
The Transaction carries `drain_ns = nbytes / bottleneck_bw_on_path`
stamped at send-side PE_DMA. In this simulator per-hop `overhead_ns`
is paid at each forwarding component via `run()`, and the remaining
BW drain is paid once at the Transaction's terminal. Every non-IPCQ
Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via
`ComponentBase._forward_txn` at the terminal node. For IPCQ the
destination PE_DMA intercepts the Transaction with `_handle_ipcq_inbound`
(so IPCQ-specific data write + metadata forward can happen), so **the
drain MUST be paid explicitly at the top of that handler** to keep
IPCQ's timing model on par with every other fabric Transaction.
Side-effects of paying drain here:
- **SRC `tl.send`** is unchanged — fire-and-forget semantics are
preserved because the sender PE_DMA does not `yield sub_done`. The
`sub_done.succeed()` call (made after metadata forward below) is an
event with no listener on the sender side.
- **DST `tl.recv`** unblocks `drain_ns` later. Since recv wakes only
when `IpcqMetaArrival` reaches its local PE_IPCQ, and the metadata
forward now happens after the drain, recv observes the full fabric
transfer time including bandwidth cost.
Matches the physical picture: send dispatches and leaves; recv waits
until the bytes have actually been drained into its inbox.
### D9.5. ADR-0020 (2-pass) integration
`tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase
1 simulates timing **and** moves data via MemoryStore; Phase 2 enables
op-log-based correctness verification.
#### Phase 1 (timing + data)
D9 models head and tail updates with two different mechanisms:
- **Send-side (head update)** — DMA payload piggyback. Data write and
metadata forward happen in the same SimPy step → automatic atomic
visibility.
- **Recv-side (tail credit return)** — fast-path SimPy Store channel
with bottleneck-BW latency, then `peer_tail_cache` update.
Together they preserve ring-buffer pointer consistency.
The op-log records `op_kind="ipcq"` entries for sends (with
`src/dst/space/addr/nbytes/dir/dtype/shape/sender_seq`) and recvs (with
`recv_mode/src/dst/space/addr/nbytes/dir/dtype/shape/consumer_seq`).
Two recv modes:
- **`return_slot`** (default): the slot address is returned to the
kernel. Zero-copy.
- **`copy_to_dst`**: when the kernel passes `dst_addr` + `dst_space`,
PE_IPCQ copies the slot data into the user dst.
#### Phase 2 (op_log replay)
When `DataExecutor` encounters an `op_kind="ipcq"` record:
- **send**: idempotent `src → dst` ndarray write.
- **recv (`return_slot`)**: no-op (the slot already holds the data).
- **recv (`copy_to_dst`)**: idempotent `slot → dst_addr` copy.
IPCQ ops are pure data movement — Phase 2 has nothing extra to compute.
The downstream GEMM / Math ops in `DataExecutor` will consume the data
and naturally validate correctness.
### D10. Host CCL init keeps the PyTorch shape
The host code looks just like real PyTorch DDP. `init_process_group`
creates the backend object; it does **not** receive IPCQ knobs
(neighbor topology, buffer_kind, backpressure …).
```python
# benches/ccl_allreduce.py — same shape as real PyTorch
def worker(rank, world_size, torch):
dist = torch.distributed
dist.init_process_group(backend="ahbm") # reads ccl.yaml + topology
tensor = torch.zeros((1, world_size * N_ELEM), dtype="f16", dp=...)
tensor.copy_(torch.from_numpy(init))
dist.all_reduce(tensor, op="sum")
```
The IPCQ configuration is decided by the backend at
`init_process_group` time: it loads `ccl.yaml`, picks the algorithm,
and pushes IPCQ neighbor tables to every participating PE_IPCQ. The
host code never has to know about IPCQ.
A bench runs one algorithm, chosen via `ccl.yaml`'s `defaults.algorithm`.
Switching algorithms is purely a `ccl.yaml` change — no host edits
required.
#### Init flow (eager)
1. `init_process_group(backend="ahbm")` is called.
2. Backend loads `ccl.yaml` → resolves `defaults.algorithm`.
3. Pulls topology + buffer_kind + backpressure + slot config from
`algorithms[<algo>]`.
4. **Immediately** installs neighbor tables on every PE_IPCQ
(sideband or fabric `IpcqInitMsg`).
5. Subsequent `torch.launch(kernel_name, ...)` calls behave normally —
PE_IPCQ is already prepared whether the kernel is a CCL kernel or
not.
### D11. CCL config file (`ccl.yaml`)
IPCQ config and algorithm metadata live in a separate YAML file,
following the same pattern as `components.yaml` and `topology.yaml`.
A single benchmark execution runs one algorithm
(`defaults.algorithm`). Switching algorithms means editing
`defaults.algorithm` only.
```yaml
defaults:
algorithm: ring_allreduce_tcm
buffer_kind: tcm # tcm | hbm | sram
backpressure: sleep # poll | sleep
n_slots: 8
slot_size: 4096
vc_chunk_size: 256
ipcq_credit_size_bytes: 16
algorithms:
ring_allreduce_tcm:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d # builtin name or "custom"
buffer_kind: tcm
n_elem: 8 # optional, per-algorithm tile width
tree_allreduce_7:
module: kernbench.ccl.algorithms.tree_allreduce
topology: tree_binary
buffer_kind: tcm
world_size: 7 # algorithm-level override
n_elem: 16
custom_mesh:
module: kernbench.ccl.algorithms.custom_mesh
topology: custom # the module supplies its own neighbors()
```
`world_size` is **not set in `defaults`**. The backend resolves it via:
`algorithm-level override > defaults override > topology spec`. The
last fallback (`sips × cubes_per_sip × pes_per_cube`) mirrors real DDP
where `WORLD_SIZE` comes from env vars rather than config files.
#### Algorithm module structure
Each algorithm module exports two hooks — `kernel` (required) and
`neighbors` (optional) — plus a `kernel_args` helper that the
backend uses to populate positional kernel arguments at `all_reduce`
time:
```python
# src/kernbench/ccl/algorithms/ring_allreduce.py
def kernel_args(world_size: int, n_elem: int) -> tuple:
return (n_elem, world_size)
def kernel(t_ptr, n_elem, world_size, tl):
"""Required — the PE kernel.
IPCQ is already installed by the backend before this is called.
The kernel only uses the four-direction send / recv API.
"""
...
def neighbors(rank, world_size, neighbor_map):
"""Optional — override the builtin topology's neighbor map.
Returns a new dict, the modified-in-place dict, or None to keep the
builtin map.
"""
return None
```
#### `neighbors` override patterns
- **Pattern A — tweak a builtin**: drop a direction for some ranks, etc.
- **Pattern B — replace entirely**: ignore `neighbor_map` and return a
brand-new dict.
- **Pattern C — keep builtin**: omit `neighbors` or return None.
#### Builtin topologies
| topology | direction set |
|----------|---------------|
| `ring_1d` | E, W |
| `ring_1d_unidir` | E only |
| `mesh_2d` | N, S, E, W |
| `tree_binary` | parent, child_left, child_right |
| `none` | (empty) — algorithm must supply `neighbors()` |
#### Adding a new algorithm
1. Write `kernel` and `kernel_args` in
`src/kernbench/ccl/algorithms/<algo>.py`.
2. Add an entry in `ccl.yaml`'s `algorithms` section.
3. (Optional) provide `neighbors()` for custom topology.
4. Set `defaults.algorithm` to the new algorithm.
The host bench (`benches/ccl_allreduce.py`) does not change.
### D12. Message / token schema
The new message types added by this ADR. They live in
`src/kernbench/common/pe_commands.py` and
`src/kernbench/runtime_api/kernel.py`.
#### `IpcqInitMsg` (sideband, fan-out at init)
The backend pushes neighbor tables to every PE_IPCQ. Structure mirrors
`MmuMapMsg` (`target_sips`, `target_cubes`, `target_pe`, `entries`).
Each `IpcqInitEntry` has `direction`, `peer: IpcqEndpoint`,
`my_rx_base_pa/va`, `n_slots`, `slot_size`, plus a `peer_credit_store`
field — a `simpy.Store` instance pre-wired so the sender PE_IPCQ can
push `IpcqCreditMetadata` directly into the receiver's input queue.
#### `IpcqSendCmd` (PE_CPU → PE_IPCQ)
Carries `direction`, source addr/space, nbytes, shape, dtype, and a
handle id. `data_op=True` so it lands in the op_log.
#### `IpcqRecvCmd` (PE_CPU → PE_IPCQ)
Carries `direction` (or None for round-robin), `recv_mode`
(`return_slot` / `copy_to_dst`), optional `dst_addr/dst_space`, shape,
dtype, blocking flag.
#### `IpcqDmaToken` (PE_IPCQ → PE_DMA, vc_comm channel)
Per D9 piggyback: the token carries the data (`src/dst/space/nbytes`)
plus the head metadata (`sender_seq`, `src_sip/cube/pe`,
`src_direction`). PE_DMA picks the channel by token type
(`IpcqDmaToken → vc_comm`, `TileToken → vc_compute`).
The receiver's PE_DMA, on token arrival, performs the I6 atomic
sequence: write data into MemoryStore, then forward `IpcqMetaArrival`
to the local PE_IPCQ.
#### `IpcqCreditMetadata` (PE_IPCQ → peer PE_IPCQ, fast path)
Carries `consumer_seq` (= my_tail), source PE coords, and source
direction. Travels through the dedicated SimPy Store channel rather
than `vc_comm`. Latency = `credit_size_bytes / bottleneck_bw_on_path`.
There is **no `IpcqPtrUpdate` event** — head updates flow via D9
piggyback, tail updates via the D9 fast-path channel.
### D13. Test strategy
Test plan:
#### T1. Unit tests (component-level)
- **PE_IPCQ** (`tests/test_pe_ipcq.py`): send without backpressure
immediately forwards a token; full peer slot triggers backpressure
(poll / sleep modes); recv waits, wakes on `IpcqMetaArrival`;
round-robin recv weak fairness; bad direction → `IpcqInvalidDirection`.
- **PE_DMA virtual channels** (`tests/test_pe_dma_vc.py`): `vc_compute`
/ `vc_comm` independent progress, chunk interleave, BW split.
- **Builtin topology** (`tests/test_ccl_topologies.py`): ring_1d /
mesh_2d / tree_binary correctness, mesh_2d non-square →
`ValueError`, custom resolver returns the module's `neighbors`.
#### T2. Integration tests (E2E send/recv)
- **`tests/test_ipcq_e2e.py`**: 2-rank ring, 4-rank ring (bidirectional
no-deadlock), 4×4 mesh.
- **CCL kernel + 2-pass** (`tests/test_ipcq_2pass.py`): greenlet mode
records `ipcq` ops in op_log; DataExecutor produces correct
`out.data`.
#### T3. Backend init (`tests/test_ccl_backend_ipcq.py`)
`ccl.yaml` load, builtin topology → `IpcqInitMsg` fan-out, endpoint PA
consistency, per-`buffer_kind` allocation.
#### T4. Regression
All existing tests pass; ADR-0020 op_log / DataExecutor unaffected for
non-CCL benches.
#### T5. Performance / overhead
Single send/recv pair latency = (DMA latency) + (IPCQ overhead).
Should be close to a regular PE_DMA write of the same nbytes (IPCQ
overhead < 100 ns).
### D14. Invariants and failure modes
#### Invariants
I1. **Slot lifecycle exactly-once**: one send → exactly one recv.
I2. **Pointer monotonicity**: `my_head` / `my_tail` strictly
non-decreasing; `sender_seq` strictly increasing.
I3. **Endpoint consistency**: if rank A's `direction=E` peer is rank
B, then rank B's reverse-direction peer must be rank A. Verified at
init.
I4. **`buffer_kind` consistency**: all PEs in a process group share
the same `buffer_kind` (no mixed mode in the first cut).
I5. **op_log ordering**: send → DMA complete → recv possible. The
t_start order in op_log respects this causality.
I6. **Atomic data + metadata visibility (MUST)**: at the receiver
side, data write (`MemoryStore.write`) and metadata forward
(`peer_head_cache` update) **must execute in the same SimPy step**.
No yield is allowed between the two operations in PE_DMA's vc_comm
handler. Code review must reject any inserted `yield` (or `yield
from`) — it would create a race where head_cache becomes visible
before or after the data.
I7. **MemoryStore slot existence ↔ pointer**: as a consequence of I6,
the step in which `peer_head_cache > my_tail` becomes truthy is the
same step in which the slot data is observable.
#### Failure modes (runtime errors)
F1. **Bad direction**: `tl.send(dir="X")` for an uninstalled direction
`IpcqInvalidDirection`, simulation aborts.
F2. **Type mismatch**: dtype/shape/nbytes disagreement between matched
send and recv. Not validated by default; opt-in strict mode catches
it (`strict_validation: true` on a PE_IPCQ node attrs).
F3. **Deadlock detection (timeout-based)**: the simulator empties its
schedule while a send/recv is still pending → engine raises
`IpcqDeadlock` and embeds a pointer dump.
F4. **Backend init failure**: missing `defaults.algorithm`, missing
`algorithms[name]`, module import failure, topology validation
failure (I3, I4) — all raised at `init_process_group` time.
F5. **Slot full + infinite backpressure**: the peer never recvs.
Surfaces as F3 timeout.
#### Diagnostics
- **CCL trace**: `KERNBENCH_CCL_TRACE=1` logs each send/recv as
`(rank, t, dir, nbytes)`.
- **Pointer dump**: `kernbench.ccl.diagnostics.pointer_dump(engine)`
prints every PE_IPCQ ring buffer's `my_head`, `my_tail`,
`peer_head_cache`, `peer_tail_cache`.
- **Deadlock dump**: on hang the engine includes the pointer dump in
the `IpcqDeadlock` exception message.
### D15. Algorithm-author cheat sheet
Full step-by-step lives in
[`docs/onboarding/ccl-author-guide.en.md`](../onboarding/ccl-author-guide.en.md). The
shortest version:
| Things you touch | Things you don't |
|------------------|-------------------|
| `src/kernbench/ccl/algorithms/<your_algo>.py` (`kernel`, `kernel_args`, optional `neighbors`) | `benches/ccl_allreduce.py` host code |
| One entry in `ccl.yaml` + optionally `defaults.algorithm` | `src/kernbench/ccl/` framework |
| (Optional) `tests/test_<your_algo>.py` mock test | PE_IPCQ component, AhbmCCLBackend |
5-step flow: write the kernel → register in `ccl.yaml` → optional
`neighbors` override → optional mock unit test → SimPy validation via
`kernbench run --bench ccl_allreduce --verify-data`.
Common mistakes: using a direction that wasn't installed, sends
without matching recvs (deadlock), dtype/shape disagreement, assuming
fairness from `tl.recv()` round-robin, confusing
`tl.num_programs(axis)` with the CCL group size.
---
## Non-goals
- **Host collective**: a model where `dist.all_reduce` itself moves
data on the host side is out of scope. This ADR only covers
communication that happens inside the PE kernel.
- **All-reduce algorithms**: ring / tree / etc. live in algorithm
modules and can be added without amending this ADR.
- **Reliability / error handling**: link faults, send/recv failure
recovery, etc. are out of scope.
- **NoC arbiter precision**: dynamic VC contention is left for a future
ADR (see D8).
---
## Open questions
- **VC arbitration accuracy** — the first cut uses deterministic
chunk interleave + weighted round-robin; heavy contention may report
optimistic latency. A NoC arbiter component can be added later.
- **Credit return BW model** — the fast path is currently outside the
fabric BW contention model. Can be modeled as a separate link or
switched to piggyback (`credit_return_mode: piggyback`).
- **Ring buffer slot allocation metadata** — whether the host pushes
IPCQ buffer metadata via sideband or via a fabric message similar to
`MmuMapMsg` is open.
- **VC BW split default** — 50/50 vs. weighted (e.g. 80/20). Exposed in
`ccl.yaml`; default value TBD.
- **Direction count** — 4 (N/S/E/W) is fixed in the first cut; 6
(with Up/Down for 3D) or N (variable) is future work.
- **Multi-tile aggregation primitives** — whether
`tl.recv_all` or similar is needed for fan-in.
- **Round-robin recv fairness** — current weak fairness can starve;
strict fairness counter is future work.
- **Deadlock detection precision** — currently timeout-based; a
realtime wait-for graph would enable deterministic detection.
---
## Consequences
### Positive
- PE-to-PE direct communication enables CCL kernels to be written.
- Host stays minimal (just `launch`), synchronization happens inside
the PE → strong compute / comm overlap.
- VCs eliminate HoL blocking → collective latency is not blocked by
compute traffic.
- Buffer placement and backpressure mode are init-time parameters →
easy to benchmark.
- Four-direction logical neighbors → host is free to map
ring/mesh/tree algorithms.
### Negative
- One new component (PE_IPCQ) and a redesigned PE_DMA (VCs).
- IPCQ memory cost = 8 rings × `slot_size` × `n_slots` per PE.
- VC arbitration is a first-order approximation; heavy contention
scenarios may report slightly optimistic latency vs real HW (D8).
- Chunk-level interleave makes PE_DMA implementation more complex.
File diff suppressed because it is too large Load Diff
+59 -52
View File
@@ -6,43 +6,46 @@ Accepted
## Context
### 목표
### Goal
`torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device)
경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이**
읽히는 bench 코드를 목표로 한다.
Align the participation unit (rank) of `torch.distributed` collective calls
to the **SIP** (device) boundary. The aim is bench code that, at the host
level, reads **indistinguishably** from real PyTorch DDP/TP scripts.
real PyTorch와 비교:
Comparison with real PyTorch:
| 차원 | real PyTorch | KernBench |
| Dimension | real PyTorch | KernBench |
| --- | --- | --- |
| 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 1 SIP |
| `get_rank()` | `RANK` env var | greenlet-local 레지스트리 |
| `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 |
| Process model | N processes, 1 GPU each | 1 process, N greenlets, 1 SIP each |
| `get_rank()` | `RANK` env var | greenlet-local registry |
| `get_world_size()` | `WORLD_SIZE` env var | SIP count from topology |
| `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP |
| `mp.spawn` | OS 프로세스 fork | greenlet fan-out |
| `mp.spawn` | OS process fork | greenlet fan-out |
### 풀어야 할 문제
### Problems to solve
1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록.
2. **Greenlet-local rank/device tracking** — 1-프로세스 모델 안에서 각
worker greenlet이 자기 rank / 자기 SIP를 정확히 식별.
3. **Tensor placement = structural (sip, cube, pe)** — rank가 SIP이면
기본 텐서 배치도 구조적 좌표로 표현되어야 함.
1. **Public API where rank = SIP** so bench workers do not have to know
about the PE concept.
2. **Greenlet-local rank/device tracking** — within the 1-process model,
each worker greenlet must correctly identify its own rank / its own SIP.
3. **Tensor placement = structural (sip, cube, pe)** — if rank is SIP,
the default tensor placement should also be expressed in structural
coordinates.
### Non-problem ( ADR)
### Non-problem (outside this ADR)
- IPCQ direction addressing → ADR-0025
- `DPPolicy.sip`/`num_sips` 제거 → ADR-0026
- Removing `DPPolicy.sip`/`num_sips` → ADR-0026
- Megatron-style TP → ADR-0027
- DTensor → ADR-0028 (future)
- Worker scheduling / `mp.spawn` / collective drain / exception cleanup
→ ADR-0027 D0/D1
- Collective algorithm 구현 (intercube_allreduce, SFR config) → ADR-0032
- Collective algorithm implementation (intercube_allreduce, SFR config)
→ ADR-0032
## Decision
### D1. rank = SIP (world_size 해석)
### D1. rank = SIP (world_size resolution)
```python
def _resolve_world_size(self) -> int:
@@ -55,8 +58,8 @@ def _resolve_world_size(self) -> int:
return int(spec.get("system", {}).get("sips", {}).get("count", 1))
```
우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml`
override는 legacy "rank = PE" 테스트 경로로 유지.
Priority order: algorithm override > defaults override > SIP count. The
`ccl.yaml` override is retained as the legacy "rank = PE" test path.
### D2. Greenlet-local rank registry (+ debug warning)
@@ -83,11 +86,11 @@ class DistributedContext:
return int(self._rank_by_greenlet[g])
```
### D3. `torch.ahbm.set_device(rank)` — SIP 바인딩
### D3. `torch.ahbm.set_device(rank)` — SIP binding
KernBench 백엔드 이름은 `ahbm` (ADR-0023). Real PyTorch
`torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named
namespace를 사용한다.
The KernBench backend name is `ahbm` (ADR-0023). Real PyTorch uses
`torch.cuda.set_device(r)`, but since we are not CUDA we use an
honestly-named namespace.
```python
class _AhbmNamespace:
@@ -113,10 +116,12 @@ class _AhbmNamespace:
# Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
```
**PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한
`torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`,
`torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는
코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다.
**PyTorch 2.x style parallel support**: Recent PyTorch is moving toward a
device-agnostic `torch.accelerator` namespace
(`torch.accelerator.set_device_index(r)`,
`torch.accelerator.current_device_index()`). To support users who want to
write code that is not tied to a specific device vendor, KernBench also
exposes this surface in parallel.
```python
class _AcceleratorNamespace:
@@ -141,23 +146,23 @@ self.ahbm = _AhbmNamespace()
self.accelerator = _AcceleratorNamespace(self.ahbm) # alias
```
Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유:
Bench authors may choose either — both share the same registry internally:
```python
torch.ahbm.set_device(rank) # KernBench-native, explicit backend
torch.accelerator.set_device_index(rank) # PyTorch 2.x device-agnostic
```
### D4. Tensor placement = structural (sip, cube, pe) 좌표
### D4. Tensor placement = structural (sip, cube, pe) coordinates
`resolve_dp_policy` `target_sip`을 직접 받아 구조적 좌표로 placement 생성.
세부는 ADR-0026.
`resolve_dp_policy` takes `target_sip` directly and produces placement in
structural coordinates. Details in ADR-0026.
```python
# RuntimeContext._create_tensor
current_sip = self.ahbm.current_device() # (D3 naming)
if current_sip is None:
current_sip = 0 # single-driver fallback (D2와 일관)
current_sip = 0 # single-driver fallback (consistent with D2)
placement = resolve_dp_policy(
dp, shape=shape_2d, itemsize=itemsize,
num_pe=eff_num_pe, num_cubes=eff_num_cubes,
@@ -165,29 +170,29 @@ placement = resolve_dp_policy(
)
```
Post-hoc `pe_index` shifting 없음 — ShardSpec `(sip, cube, pe)` 구조적
좌표를 직접 보유. ShardSpec 상세는 ADR-0026.
No post-hoc `pe_index` shifting — ShardSpec carries the `(sip, cube, pe)`
structural coordinates directly. ShardSpec details in ADR-0026.
---
## Dependencies
- **ADR-0023** (IPCQ): backend `ahbm` namespace의 기원.
- **ADR-0026** (DPPolicy intra-device): D4의 `resolve_dp_policy` 시그니처와
ShardSpec의 구조적 좌표 표현.
- **ADR-0027** (Megatron TP + scheduler): worker scheduling, `mp.spawn`,
collective drain, exception cleanup의 구현 기준.
- **ADR-0023** (IPCQ): origin of the backend `ahbm` namespace.
- **ADR-0026** (DPPolicy intra-device): the `resolve_dp_policy` signature
used by D4 and the structural-coordinate representation of ShardSpec.
- **ADR-0027** (Megatron TP + scheduler): the implementation baseline for
worker scheduling, `mp.spawn`, collective drain, and exception cleanup.
---
## Non-goals
- **IPCQ protocol 수정**: ADR-0023 유지.
- **DPPolicy 필드 정리**: ADR-0026.
- **Modifying the IPCQ protocol**: ADR-0023 remains as-is.
- **Cleaning up DPPolicy fields**: ADR-0026.
- **Megatron-style TP**: ADR-0027.
- **Worker scheduling / spawn / drain / exception cleanup**: ADR-0027 D0/D1.
- **Collective algorithm 구현**: ADR-0032.
- **Multi-node (프로세스 간)**: 단일 프로세스.
- **Collective algorithm implementation**: ADR-0032.
- **Multi-node (cross-process)**: single process only.
---
@@ -195,12 +200,14 @@ Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적
### Positive
- **Bench = real PyTorch DDP** (공개 API 관점).
- **Greenlet-local rank**: 1-프로세스 모델에서 cross-rank correctness 가능.
- **Structural placement 좌표**: ADR-0026 / ADR-0027 / ADR-0032의 다른 ADR이
`(sip, cube, pe)` 3튜플 위에서 일관되게 동작.
- **Bench = real PyTorch DDP** (from the public-API point of view).
- **Greenlet-local rank**: enables cross-rank correctness within the
1-process model.
- **Structural placement coordinates**: lets the other ADRs (ADR-0026 /
ADR-0027 / ADR-0032) operate consistently on top of the `(sip, cube, pe)`
3-tuple.
### Neutral
- IPCQ PE-level protocol (ADR-0023) 불변.
- IO_CPU 역할 불변 (기존 transit 그대로).
- IPCQ PE-level protocol (ADR-0023) is unchanged.
- IO_CPU role is unchanged (existing transit behavior preserved).
@@ -6,51 +6,58 @@ Accepted (Revision 2 — Address-based matching; peer_direction field dropped)
## Context
### 목표
### Goal
ADR-0023의 IPCQ protocol에서 **"어느 direction pair를 통한 전송인가"의 식별**을
topology / dict-order에 의존하지 않고 **주소 기반**으로 일관되게 한다.
2-rank bidirectional ring (또는 여러 direction이 동일 peer를 가리키는
topology 일반)에서 정확히 동작하도록 한다.
In the IPCQ protocol of ADR-0023, make the **identification of "which
direction pair this transfer belongs to"** consistent and **address-based**,
without depending on topology / dict-order. It must work correctly in a
2-rank bidirectional ring (and more generally in any topology where
multiple directions point to the same peer).
### 드러난 버그 — 2-rank bidirectional ring
### The bug surfaced — 2-rank bidirectional ring
`ring_1d(rank, world_size=2)``{"E": 1, "W": 1}` (rank 0). 양쪽 방향이 같은 peer.
`ring_1d(rank, world_size=2)``{"E": 1, "W": 1}` (rank 0). Both directions
point to the same peer.
**버그 1 (install)**:
- `reverse_direction(0, 1)`dict order로 "E" 반환 (틀림, "W"가 맞음 — opposite
direction convention)
- rank 0 E entry `peer.rx_base_pa = rx_base(sip1, cube0, pe0, d="E")`로 설정
- tl.send(E) → data sip1 E-rx buffer로 landing (should be W-rx)
**Bug 1 (install)**:
- `reverse_direction(0, 1)`returns "E" by dict order (wrong; "W" is the
correct answer — opposite-direction convention)
- rank 0's E entry is set with `peer.rx_base_pa = rx_base(sip1, cube0, pe0, d="E")`
- tl.send(E) → data lands in sip1's E-rx buffer (should be W-rx)
**버그 2 (runtime)**:
- 설령 install이 올바른 주소로 설정해도, receiver의 `_handle_meta_arrival`
sender 좌표만으로 direction 매칭 → 첫 direction (E) 승
- peer_head_cache[E] 증가, peer_head_cache[W]는 불변
- Kernel의 tl.recv(W)는 peer_head_cache[W] 대기 → 영원히 블록 → IpcqDeadlock
**Bug 2 (runtime)**:
- Even if install set up the correct address, the receiver's
`_handle_meta_arrival` matches direction by sender coordinates only → the
first direction (E) wins
- peer_head_cache[E] is incremented; peer_head_cache[W] is unchanged
- The kernel's tl.recv(W) waits on peer_head_cache[W] → blocks forever →
IpcqDeadlock
### 근본 원인
### Root cause
두 축에서 동일 문제:
1. **Install-time pairing**: "내 direction과 peer의 어느 direction이 짝인가"
결정이 dict-iteration-order에 의존 → 여러 direction이 같은 peer를 가리킬 때
fragile
2. **Runtime identification**: "어느 qp를 업데이트해야 하는가" 결정이 sender
좌표만으로 이루어짐 → direction 중복 시 ambiguous
The same issue along two axes:
1. **Install-time pairing**: deciding "which of my directions pairs with
which direction of the peer" depends on dict-iteration-order → fragile
when multiple directions point to the same peer
2. **Runtime identification**: deciding "which qp should be updated" is
based on sender coordinates alone → ambiguous when directions are
duplicated
### 해결 방향 — address-based matching
### Solution direction — address-based matching
PE rx buffer는 **direction별로 고유한 주소 range**에 위치 (rx_base_pa +
direction_idx × bytes_per_direction). 따라서:
Each PE's rx buffer sits at a **unique address range per direction**
(rx_base_pa + direction_idx × bytes_per_direction). Therefore:
- **Runtime**: sender coord 대신 **dst_addr 범위**로 매칭 → unambiguous
- **Install**: opposite-direction 우선 선택 heuristic (ring / mesh의 자연스러운
대칭성)
- `peer_direction` 같은 이중 메타데이터 불필요 — **주소가 single source of
truth**
- **Runtime**: match by **dst_addr range** instead of sender coord →
unambiguous
- **Install**: prefer the opposite direction as a heuristic (the natural
symmetry of ring / mesh)
- No need for redundant metadata like `peer_direction` — **address is the
single source of truth**
이 설계는 **PhysAddr 전환 (ADR-0030)과 독립적**으로 작동. 현재 synthetic
주소든 PhysAddr든 direction별 range 유일성만 지켜지면 동일하게 적용 가능.
This design works **independently of the PhysAddr transition (ADR-0030)**.
Whether the current addresses are synthetic or PhysAddr, the same approach
applies as long as the per-direction range uniqueness is preserved.
---
@@ -91,17 +98,17 @@ def reverse_direction(my_rank: int, peer_rank: int, my_dir: str) -> str | None:
return None
```
호출부:
Call site:
```python
for d, peer_rank in nbrs.items():
peer_dir = reverse_direction(r, peer_rank, d) # my_dir 전달
peer_dir = reverse_direction(r, peer_rank, d) # pass my_dir
if peer_dir is None:
continue
...
```
### D2. Runtime — `_handle_meta_arrival` dst_addr 매칭
### D2. Runtime — `_handle_meta_arrival` dst_addr matching
`src/kernbench/components/builtin/pe_ipcq.py`:
@@ -138,9 +145,10 @@ def _handle_meta_arrival(self, msg: IpcqMetaArrival) -> None:
# Unknown dst_addr — diagnostic log (should not happen under correct install)
```
Sender 좌표 검사는 **제거**. `dst_addr`가 이미 direction을 결정.
The sender-coordinate check is **removed**. `dst_addr` already determines
the direction.
### D3. Credit — `dst_rx_base_pa` 필드 추가
### D3. Credit — add `dst_rx_base_pa` field
`src/kernbench/common/ipcq_types.py`:
@@ -148,25 +156,26 @@ Sender 좌표 검사는 **제거**. `dst_addr`가 이미 direction을 결정.
@dataclass(frozen=True)
class IpcqCreditMetadata:
consumer_seq: int
dst_rx_base_pa: int # NEW: sender peer.rx_base_pa와 매칭용
# 기존 필드 (diagnostic / log 용도로 유지)
dst_rx_base_pa: int # NEW: matches the original sender's peer.rx_base_pa
# Existing fields (kept for diagnostic / logging purposes)
src_sip: int
src_cube: int
src_pe: int
src_direction: str
```
Credit 생성 시 (`_delayed_credit_send`): 자기 direction의 `my_rx_base_pa`
`dst_rx_base_pa`로 실어 보냄 (이게 상대방이 sender 당시 썼던 `peer.rx_base_pa`).
When the credit is generated (`_delayed_credit_send`): it carries this
direction's `my_rx_base_pa` as `dst_rx_base_pa` (this is the
`peer.rx_base_pa` the other side used when it was the sender).
수신 측 (`_credit_worker`):
Receiver side (`_credit_worker`):
```python
def _credit_worker(self, env):
while True:
credit = yield self._credit_inbox.get()
for d, qp in self._queue_pairs.items():
# peer rx_base_pa credit dst_rx_base_pa가 일치하는 qp 찾기
# Find the qp whose peer rx_base_pa matches the credit's dst_rx_base_pa
if qp["peer"].rx_base_pa == credit.dst_rx_base_pa:
qp["peer_tail_cache"] = max(qp["peer_tail_cache"],
credit.consumer_seq)
@@ -178,41 +187,45 @@ def _credit_worker(self, env):
break
```
Sender 좌표 검사 제거. `dst_rx_base_pa` 매칭으로 unambiguous.
Sender-coordinate check removed. Matching by `dst_rx_base_pa` is
unambiguous.
### D4. `IpcqInitEntry`에 `peer_direction` 필드를 **추가하지 않음**
### D4. Do **not** add a `peer_direction` field to `IpcqInitEntry`
ADR-0025 rev 1에서 제안했던 `IpcqInitEntry.peer_direction`**불필요**.
이유:
- Meta arrival dst_addr로 매칭 (D2)
- Credit dst_rx_base_pa로 매칭 (D3)
- qp에 peer_direction 저장 필요 없음
- Install은 rx_base_pa 계산 시 내부적으로만 peer_dir 사용 (`reverse_direction`)
The `IpcqInitEntry.peer_direction` proposed in ADR-0025 rev 1 is
**unnecessary**. Reasons:
- Meta arrivals are matched by dst_addr (D2)
- Credits are matched by dst_rx_base_pa (D3)
- No need to store peer_direction on qp
- Install only uses peer_dir internally when computing rx_base_pa
(`reverse_direction`)
IpcqInitEntry schema 변경 없음. Rev 1 대비 **단순화**.
No change to the IpcqInitEntry schema. **Simpler** than rev 1.
### D5. `IpcqDmaToken.src_direction` 유지 (diagnostic only)
### D5. Keep `IpcqDmaToken.src_direction` (diagnostic only)
기존 `src_direction` 필드는 제거하지 않는다. 다음 용도로 유지:
- Logging / trace: `KERNBENCH_CCL_TRACE=1` 출력의 `(rank, t, dir, nbytes)`
- Diagnostics: pointer_dump 등에서 direction 표시
- 미래 확장 여지
The existing `src_direction` field is not removed. It is retained for:
- Logging / trace: the `(rank, t, dir, nbytes)` output of
`KERNBENCH_CCL_TRACE=1`
- Diagnostics: showing direction in pointer_dump, etc.
- Room for future extension
Runtime matching `dst_addr`만 사용.
Runtime matching uses only `dst_addr`.
### D6. Invariants (ADR-0023 I3 강화)
### D6. Invariants (strengthens ADR-0023 I3)
**I3 (엄격)**: 각 방향 pair `(my_direction, peer_direction)`에 대해 my
rx_base peer rx_base는 **별개의 direction slot**을 가리켜야 함. Install은
이를 보장해야 한다 (reverse_direction opposite-preference).
**I3 (strict)**: For each direction pair `(my_direction, peer_direction)`,
my rx_base and peer rx_base must point to **distinct direction slots**.
Install must guarantee this (reverse_direction opposite-preference).
**I3.1 (신규)**: 모든 qp에 대해 `qp["my_rx_base_pa"]``qp["peer"].rx_base_pa`
서로 disjoint한 주소 range를 점유한다 (다른 direction의 buffer는 절대 겹치지
않음). 이것이 D2/D3의 주소-기반 매칭의 전제.
**I3.1 (new)**: For every qp, `qp["my_rx_base_pa"]` and
`qp["peer"].rx_base_pa` occupy mutually disjoint address ranges (buffers
of different directions never overlap). This is the prerequisite for the
address-based matching of D2/D3.
Install time에 검증 가능:
Verifiable at install time:
```python
# ccl/install_plan.py: build_install_plans 끝에 assertion
# ccl/install_plan.py: assertion at the end of build_install_plans
all_rx_ranges = set()
for plan in plans:
for pe_install in plan.pe_installs:
@@ -228,36 +241,42 @@ for plan in plans:
## Dependencies
- **ADR-0023** (IPCQ protocol): 본 ADR은 ADR-0023 runtime 매칭 로직 수정
(D2, D3) + install heuristic 개선 (D1). IPCQ 프로토콜의 semantic layer
변경은 없음.
- **ADR-0024** (launcher): 2-rank bidirectional ring이 실제 쓰이는 경우가
ADR-0024의 ws=SIP_count 모델. 본 ADR이 그 케이스를 작동시킴.
- **ADR-0030** (PhysAddr transition, stub): **독립적** — ADR-0025의
주소-기반 매칭은 현재 synthetic 주소든 PhysAddr이든 동일하게 작동.
- **ADR-0023** (IPCQ protocol): this ADR modifies ADR-0023's runtime
matching logic (D2, D3) and improves the install heuristic (D1). No
change to the IPCQ protocol's semantic layer.
- **ADR-0024** (launcher): the case where a 2-rank bidirectional ring is
actually used is the ws=SIP_count model of ADR-0024. This ADR makes that
case work.
- **ADR-0030** (PhysAddr transition, stub): **independent** — ADR-0025's
address-based matching works identically whether the current addresses
are synthetic or PhysAddr.
---
## Non-goals
- **IPCQ 주소 체계를 PhysAddr로 전환**: ADR-0030 scope. 본 ADR은 주소가 어떻게
인코딩되는가와 무관.
- **Multi-hop routing**: ADR-0023 D5의 single-hop DMA write 전제 유지.
- **Unidir ring 특수화**: `ring_1d_unidir`는 direction 하나만 있으므로 본 버그
무관.
- **Migrating IPCQ addressing to PhysAddr**: ADR-0030 scope. This ADR is
agnostic to how addresses are encoded.
- **Multi-hop routing**: the single-hop DMA write assumption of ADR-0023
D5 still holds.
- **Unidir ring specialization**: `ring_1d_unidir` only has a single
direction, so the bug does not apply.
---
## Open questions
- **주소 매칭 성능**: `_handle_meta_arrival``_credit_worker`가 qp를 선형
순회 (max 4 direction). 성능 영향 무시 가능 수준. 문제 시 dict lookup으로
전환 가능 (`_qp_by_rx_base`).
- **`IpcqDmaToken.src_direction` 필요성 재평가**: diagnostic 용도로만 남긴
필드를 계속 유지할지, 또는 logging 외부로 분리할지. 현재는 유지.
- **Install-time invariant 검증 cost**: D6의 I3.1 검증은 O(N_PE × N_direction)^2.
대형 topology에서 느려질 수 있음 → interval tree 등 자료구조로 개선 가능.
단순 구현 먼저.
- **Address-matching performance**: `_handle_meta_arrival` and
`_credit_worker` iterate qp linearly (max 4 directions). The performance
impact is negligible. If it becomes an issue, this can be switched to a
dict lookup (`_qp_by_rx_base`).
- **Re-evaluating the need for `IpcqDmaToken.src_direction`**: whether to
keep this field, which is only kept for diagnostics, or to split it out
of logging. Currently retained.
- **Cost of install-time invariant verification**: the I3.1 verification
of D6 is O(N_PE × N_direction)^2. It could be slow on large topologies
→ improvable via data structures such as interval trees. Simple
implementation first.
---
@@ -265,19 +284,26 @@ for plan in plans:
### Positive
- **단순함**: `peer_direction` 이중 메타데이터 제거. 주소가 single source of truth.
- **Unambiguous matching**: 모든 topology (direction 중복 포함)에서 동작.
- **Schema 변경 최소**: `IpcqInitEntry` 불변, `IpcqCreditMetadata`에 1 필드 추가.
- **PhysAddr 전환 (ADR-0030) 독립**: 주소-기반 매칭은 주소 인코딩 방식과 무관.
- **Diagnostic 유지**: `IpcqDmaToken.src_direction`은 로깅 용도로 존치.
- **Simplicity**: redundant `peer_direction` metadata removed. Address is
the single source of truth.
- **Unambiguous matching**: works on every topology (including duplicate
directions).
- **Minimal schema changes**: `IpcqInitEntry` unchanged, one field added
to `IpcqCreditMetadata`.
- **Independent of PhysAddr transition (ADR-0030)**: address-based matching
is agnostic to the address encoding.
- **Diagnostics retained**: `IpcqDmaToken.src_direction` is kept for
logging.
### Negative
- Runtime 매칭이 주소 비교로 바뀌어서 디버깅 시 "왜 peer_head_cache[E]가 아닌
W가 업데이트됐나" 같은 질문에 address range를 추적해야 함 (기존엔 direction
이름으로 충분). 해결: pointer_dump에 "direction ↔ rx_base_pa" 매핑 포함.
- Runtime matching is now by address comparison, so when debugging
questions like "why did peer_head_cache[W] update rather than [E]" one
has to follow the address range (previously the direction name was
enough). Mitigation: include a "direction ↔ rx_base_pa" mapping in
pointer_dump.
### Neutral
- IPCQ protocol의 semantic layer (sender가 dst_addr 계산, receiver가 수신)는
불변.
- The semantic layer of the IPCQ protocol (sender computes dst_addr,
receiver receives) is unchanged.
+130 -103
View File
@@ -1,4 +1,4 @@
# ADR-0026: DPPolicy = Intra-Device Only — sip/num_sips 필드 제거
# ADR-0026: DPPolicy = Intra-Device Only — remove sip/num_sips fields
## Status
@@ -6,16 +6,17 @@ Accepted (Revision 5 — Phase 2 landed 2026-04-14, 523 passed + 1 strict xfail)
## Context
### 목표
### Goal
`DPPolicy`를 **한 device(SIP) 내부의 cube × PE 분산**만 표현하는 순수한
intra-device 추상화로 명확화한다. SIP 간 분산(TP)은 별도 레이어로 분리
(ADR-0024의 `torch.ahbm.set_device(rank)` 또는 ADR-0027의 Megatron parallel
layers가 담당).
Clarify `DPPolicy` as a pure intra-device abstraction that only expresses
**cube × PE distribution within a single device (SIP)**. Inter-SIP
distribution (TP) is split into a separate layer (handled by ADR-0024's
`torch.ahbm.set_device(rank)` or by ADR-0027's Megatron-style parallel
layers).
## Decision
### D1. `DPPolicy`에서 `sip` + `num_sips` 필드 제거
### D1. Remove `sip` + `num_sips` fields from `DPPolicy`
```python
@dataclass(frozen=True)
@@ -32,15 +33,16 @@ class DPPolicy:
num_cubes: int | None = None
```
제거되는 필드: `sip`, `num_sips`.
Removed fields: `sip`, `num_sips`.
### D2. `ShardSpec` — structural (sip, cube, pe) 좌표, `pe_index` 완전 제거
### D2. `ShardSpec` — structural (sip, cube, pe) coordinates, `pe_index` fully removed
현재 `ShardSpec.pe_index` **global flat index** (`sip × cubes × pes + cube ×
pes + pe`). 이는 ADR-0024 D4이 "abstraction leakage"로 지적한 형태.
The current `ShardSpec.pe_index` is a **global flat index**
(`sip × cubes × pes + cube × pes + pe`). This is the form ADR-0024 D4
flagged as "abstraction leakage".
본 ADR에서 ShardSpec을 **structural 좌표로 재정의**하고, `pe_index`
property로도 **남기지 않는다**:
This ADR **redefines ShardSpec in structural coordinates** and **does
not even leave `pe_index` as a property**:
```python
# src/kernbench/policy/placement/dp.py (after)
@@ -59,28 +61,32 @@ class ShardSpec:
nbytes: int
```
**핵심 원칙**:
- ShardSpec의 정체성은 `(sip, cube, pe)` 3튜플.
- **`pe_index` property도 없음** — silent semantics drift 차단.
- Global flat을 기대한 기존 호출자는 `.pe_index` 접근 시 **즉시
`AttributeError`** → 반드시 구조적 좌표로 migration.
- Flat integer key가 필요한 국소 문맥 (예: 내부 dict lookup)은 호출자가
명시적으로 `spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe`를 계산.
**Core principle**:
- The identity of ShardSpec is the `(sip, cube, pe)` 3-tuple.
- **No `pe_index` property either** — blocks silent semantics drift.
- Existing callers expecting global-flat get an **immediate
`AttributeError`** on `.pe_index` access → forced migration to
structural coordinates.
- Local contexts that genuinely need a flat integer key (e.g. internal
dict lookup) explicitly compute
`spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe` at the call
site.
**Property 제거 정당화**: KernBench는 사내 프로젝트로 call site가 한정되어
있음. Silent drift 위험 (의미만 바뀌고 타입은 같은 int) 대비 explicit breakage
(AttributeError)가 훨씬 안전.
**Justification for removing the property**: KernBench is an internal
project with a limited number of call sites. Explicit breakage
(AttributeError) is much safer than the risk of silent drift (semantics
change while the type stays int).
### D3. `resolve_dp_policy` `target_sip`을 받아 structural 좌표 생성
### D3. `resolve_dp_policy` takes `target_sip` and produces structural coordinates
ADR-0024 D4의 계약 구현. Post-hoc shifting 없음.
Implements the contract of ADR-0024 D4. No post-hoc shifting.
```python
# src/kernbench/policy/placement/dp.py (after)
@dataclass(frozen=True)
class _LocalPeShard:
"""Internal — PE resolver의 반환. Cubelocal PE 식별자 + payload."""
"""Internal — return value of the PE resolver. Cube-local PE id + payload."""
local_pe: int # cube-local PE index (0..num_pe-1)
offset_bytes: int
nbytes: int
@@ -93,7 +99,7 @@ def resolve_dp_policy(
itemsize: int,
num_pe: int,
num_cubes: int = 1,
target_sip: int, # NEW — 어느 SIP에 배치할지 명시
target_sip: int, # NEW — explicitly state which SIP to place on
) -> list[ShardSpec]:
"""2-level resolution (cube × PE) on a specified SIP.
@@ -123,28 +129,30 @@ def resolve_dp_policy(
return all_shards
```
**내부 resolver** (`column_wise`, `row_wise`, `replicate`)`_LocalPeShard`
리스트 반환 — `local_pe` 필드명으로 **"cube-local PE identifier"임이 명시적**.
과거 `ShardSpec.pe_index`와 이름이 혼동되던 문제 해소.
**Internal resolvers** (`column_wise`, `row_wise`, `replicate`) return a
list of `_LocalPeShard` — the `local_pe` field name makes it **explicit
that this is a "cube-local PE identifier"**. This resolves the previous
confusion with the name `ShardSpec.pe_index`.
**이름 규약 정리** (전체 ADR):
- `ShardSpec.pe`: 최종 외부 API — cube-local PE (structural coord)
- `_LocalPeShard.local_pe`: 내부 resolver 단계의 동일 의미
- `pe_index`: **제거**. 외부/내부 어디에도 남기지 않는다 (silent drift 차단의
부가 효과: 이름 재등장 없음).
**Naming convention summary** (whole ADR):
- `ShardSpec.pe`: the final external API — cube-local PE (structural coord)
- `_LocalPeShard.local_pe`: the same meaning at the internal resolver stage
- `pe_index`: **removed**. Not retained anywhere, internal or external
(additional benefit of preventing silent drift: the name does not
reappear).
### D4. `_create_tensor` — 구조적 좌표로 직접 placement
### D4. `_create_tensor` — placement directly in structural coordinates
ADR-0024 D4 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy`
호출 시점에 직접 지정.
Continuation of ADR-0024 D4. Post-hoc shifting removed; structural
coordinates are specified directly at the `resolve_dp_policy` call site.
```python
# context.py _create_tensor (after)
current_sip = self.ahbm.current_device()
if current_sip is None:
# Single-driver fallback (ADR-0024 D2와 일관).
# Launcher 기반 코드가 set_device()를 빼먹으면 조용히 SIP 0에 박히는
# 문제가 있음 → debug mode에서 경고.
# Single-driver fallback (consistent with ADR-0024 D2).
# In launcher-based code, forgetting set_device() silently sticks the
# tensor on SIP 0 — emit a warning in debug mode.
if os.environ.get("KERNBENCH_DEBUG"):
import warnings
warnings.warn(
@@ -161,38 +169,39 @@ placement = resolve_dp_policy(
itemsize=itemsize,
num_pe=eff_num_pe,
num_cubes=eff_num_cubes,
target_sip=current_sip, # ← 구조적 좌표 일차 지정
target_sip=current_sip, # ← structural coord specified up front
)
# placement의 각 ShardSpec은 이미 (sip=current_sip, cube=local, pe=local) 포함.
# 과거의 post-hoc shifting 블록은 완전히 제거.
# Each ShardSpec in placement already carries (sip=current_sip, cube=local, pe=local).
# The old post-hoc shifting block is removed entirely.
```
**모든** 텐서가 current device SIP에 배치됨. Multi-SIP 텐서를 만들고 싶으면
ADR-0027의 TP primitive 사용.
**Every** tensor is placed on the current device's SIP. If you need a
multi-SIP tensor, use the TP primitive of ADR-0027.
**Single-driver fallback의 trade-off**: set_device 없는 호출에서 SIP 0으로
default는 기존 single-driver 테스트 호환을 위해 유지. `KERNBENCH_DEBUG=1`
환경에서는 launcher 컨텍스트의 실수로 set_device 누락 시 조용히 잘못된 SIP에
배치되는 것을 감지할 수 있도록 warning.
**Trade-off of the single-driver fallback**: When set_device is not
called, defaulting to SIP 0 is kept for compatibility with existing
single-driver tests. With `KERNBENCH_DEBUG=1`, a warning is emitted so
that accidentally omitting set_device in a launcher context — which would
silently place the tensor on the wrong SIP — can be detected.
### D5. Downstream — allocator lookup은 구조적 tuple key
### D5. Downstream — allocator lookup by structural tuple key
기존 `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`):
Existing `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`):
```python
for spec in placement:
alloc = allocators[spec.pe_index] # ← AttributeError (property 제거됨)
alloc = allocators[spec.pe_index] # ← AttributeError (property removed)
```
`pe_index`가 없어졌으므로 구조적 좌표로 **강제** migration:
With `pe_index` gone, migration to structural coordinates is **forced**:
```python
for spec in placement:
alloc = allocators[(spec.sip, spec.cube, spec.pe)]
```
`_ensure_allocators`의 dict population도 tuple key:
The dict population in `_ensure_allocators` is also tuple-keyed:
```python
# context.py _ensure_allocators (after)
@@ -204,59 +213,71 @@ for sip_id in sip_range:
)
```
`_free_tensor`도 동일: 기존 `flat_idx = sip * ... + cube * ... + pe` 계산
블록 제거, `(shard.sip, shard.cube, shard.pe)` 직접 사용.
`_free_tensor` is the same: the old
`flat_idx = sip * ... + cube * ... + pe` computation block is removed,
and `(shard.sip, shard.cube, shard.pe)` is used directly.
**Tuple vs dataclass `PEIdentity`**: Tuple이 단순하고 hashable로 바로 써서
권고. `PEIdentity` 값객체는 명시적 타입 장점은 있지만 boilerplate가 크고 현재
allocator dict의 유일한 key라 오버엔지니어링. Tuple 유지.
**Tuple vs dataclass `PEIdentity`**: Recommend the tuple — it is simple
and hashable out of the box. A `PEIdentity` value object has the upside
of an explicit type, but the boilerplate is large and it is currently
the only key of the allocator dict, so it would be over-engineering.
Keep the tuple.
### D7. 하위 호환 — 불가 (cleanup ADR)
### D7. Backward compatibility — none (cleanup ADR)
이 ADR은 **breaking change**.
This ADR is a **breaking change**.
1. `DPPolicy(sip=...)` 또는 `DPPolicy(num_sips=...)` 호출`TypeError`
2. `ShardSpec.pe_index` 접근`AttributeError`
1. `DPPolicy(sip=...)` or `DPPolicy(num_sips=...)``TypeError`
2. `ShardSpec.pe_index` access`AttributeError`
모두 **즉시 명시적 breakage**. Deprecation warning / fallback 경로 없음.
KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에 migration.
Both are **immediate, explicit breakage**. No deprecation warning /
fallback path. KernBench is an internal project with a bounded set of
call sites, so migration happens in one pass.
**Silent drift 차단**이 property 완전 제거의 주된 이점: global flat을 기대한
코드가 SIP-local 결과를 받아 조용히 잘못된 인덱싱을 할 가능성 제거.
**Blocking silent drift** is the main upside of fully removing the
property: code that expected a global flat could otherwise silently
receive a SIP-local result and index incorrectly — that possibility is
eliminated.
## Dependencies
- **ADR-0024** (launcher): `set_device(rank)` current-device scoping
SIP 배치 메커니즘 제공. 본 ADR은 그 위에 서서 DPPolicy를 순수 intra-device로
좁힘.
- **ADR-0027** (Megatron TP): 다중 SIP에 걸친 텐서가 필요한 경우의 대안 경로.
이 ADR 적용 후 multi-SIP use case는 ADR-0027로 이관.
- **ADR-0024** (launcher): `set_device(rank)` and current-device scoping
provide the SIP placement mechanism. This ADR sits on top and narrows
DPPolicy to pure intra-device.
- **ADR-0027** (Megatron TP): the alternative path when a tensor spans
multiple SIPs. After this ADR is applied, multi-SIP use cases move to
ADR-0027.
---
## Non-goals
- **`DPPolicy.cube` / `pe` 재설계**: 기존 replicate/column_wise/row_wise 의미
유지.
- **Tiling 정책 통합**: `tiled_column_major` / `tiled_row_major`는 그대로.
- **Multi-device 텐서 추상화 신규**: DTensor-like는 ADR-0028.
- **Redesign of `DPPolicy.cube` / `pe`**: existing
replicate/column_wise/row_wise semantics are kept.
- **Tiling policy consolidation**: `tiled_column_major` /
`tiled_row_major` stay as they are.
- **New multi-device tensor abstraction**: a DTensor-like is ADR-0028.
---
## Open questions
- **`_create_tensor`의 current_sip 기본값**: set_device 없는 호출에서 rank=0
(SIP 0)로 fallback할지, 아니면 error 낼지. 권고는 fallback (기존 single-driver
테스트와의 호환).
- **`test_sip_parallel.py` 재작성 범위**: 기존 단위 테스트의 의도를 유지하며
launcher 기반으로 옮기려면 추가 fixture 필요. 별도 작업으로 scope.
- **`DPPolicy``num_sips=None` 의미**: 필드가 없어지면 `num_sips` 개념 자체가
사라짐. Multi-SIP을 표현하고 싶으면 ADR-0027의 TP primitive를 쓰라는 것이
명시적 답.
- **Default value of current_sip in `_create_tensor`**: for calls without
set_device, whether to fall back to rank=0 (SIP 0) or to raise an
error. The recommendation is fallback (compatibility with existing
single-driver tests).
- **Scope of `test_sip_parallel.py` rewrite**: porting the existing unit
tests to the launcher base while preserving their intent requires
additional fixtures. Scoped as separate work.
- **Meaning of `num_sips=None` on `DPPolicy`**: once the field is gone,
the concept of `num_sips` disappears entirely. The explicit answer for
expressing multi-SIP is to use the TP primitive of ADR-0027.
**Resolved (이전 rev에서 open이었던 것들)**:
- ~~`ShardSpec.pe_index` property 존치 여부~~ → **완전 제거** (D2)
- ~~`_ensure_allocators` dict key 형식~~ → **tuple `(sip, cube, pe)`** (D5)
**Resolved (items that were open in earlier revs)**:
- ~~Whether to keep the `ShardSpec.pe_index` property~~ → **fully
removed** (D2)
- ~~Form of `_ensure_allocators` dict key~~ → **tuple `(sip, cube, pe)`**
(D5)
---
@@ -264,25 +285,31 @@ KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에
### Positive
- **개념 분리 명확**: DPPolicy = intra-device, TP = inter-device.
- **API 단순화**: DPPolicy 생성자 필드 ~33% 축소.
- **Structural 좌표 일관성**: ShardSpec이 `(sip, cube, pe)` 튜플로 표현 →
abstraction leakage 해소 (ADR-0024 D4 계약 충족).
- **`pe_index` 의미 명확**: SIP-local이 단일 해석. Global flat이 필요하면 명시.
- **Launcher 모델 일관성**: ADR-0024의 "1 worker per SIP" 모델이 유일한 SIP
경계 제어 메커니즘.
- **Clean conceptual separation**: DPPolicy = intra-device, TP =
inter-device.
- **API simplification**: about a 33% reduction in DPPolicy constructor
fields.
- **Structural-coordinate consistency**: ShardSpec is expressed as a
`(sip, cube, pe)` tuple → abstraction leakage resolved (the ADR-0024
D4 contract is satisfied).
- **Clear meaning of `pe_index`**: the single interpretation is
SIP-local. If global-flat is needed, it must be made explicit.
- **Launcher-model consistency**: ADR-0024's "1 worker per SIP" model is
the sole SIP-boundary control mechanism.
### Negative
- **Breaking change (explicit)**: `DPPolicy(sip=...)``TypeError`,
`spec.pe_index``AttributeError`. 모든 호출자 한 번에 수정 필요.
- **ShardSpec schema 변경**: `pe_index` 단일 필드 → `sip`/`cube`/`pe` 세 필드.
Downstream (`deploy_tensor`, `_free_tensor`, `_ensure_allocators`,
`allocators` dict key 등) 연쇄 수정.
- **Silent drift 없음**: property 완전 제거로 runtime에서 즉시 실패 →
migration leakage 원천 차단. (Negative가 아니라 explicit tradeoff)
- `test_sip_parallel.py` 재작성 비용.
`spec.pe_index``AttributeError`. All callers need to be fixed at
once.
- **ShardSpec schema change**: a single `pe_index` field becomes three
fields `sip`/`cube`/`pe`. Cascading edits downstream (`deploy_tensor`,
`_free_tensor`, `_ensure_allocators`, `allocators` dict key, etc.).
- **No silent drift**: with the property fully removed, runtime failure
is immediate → migration leakage is blocked at the source. (Not a
negative but an explicit tradeoff.)
- The cost of rewriting `test_sip_parallel.py`.
### Neutral
- 기존 `cube` / `pe` 필드 의미 불변.
- The meaning of the existing `cube` / `pe` fields is unchanged.
File diff suppressed because it is too large Load Diff