ADR: bilingual structure — EN canonical in adr/, KO mirror in adr-ko/
Establish English as the canonical ADR language with Korean translations held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror). Promotion from adr-proposed/ to adr/ now writes English to adr/ and the Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md. - Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English, 2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix dropped). ADR-0023 EN regenerated against KO source which had newer HW Realization Notes (D16-D23) section. - docs/adr-history/ left frozen by design (transitional state). - CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline section covering bidirectional sync, conflict resolution (EN wins), and proposed-language freedom. - tools/verify_adr_lang_pairs.py: new verification tool checking pair completeness, filename mirroring, ADR-ID match, Status byte-equality. Pre-commit hook intentionally not added; run on demand or in CI. - tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF normalization, em-dash title separator, underscore-slug edge case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -18,7 +18,7 @@ We define stable, minimal message schemas for Host ↔ IO_CPU so that:
|
||||
- IO_CPU-internal fan-out/aggregation can evolve independently,
|
||||
- completion and failure propagation is deterministic.
|
||||
|
||||
We also require PE-tagging (A 방식): each shard explicitly carries (sip,cube,pe)
|
||||
We also require PE-tagging (Scheme A): each shard explicitly carries (sip,cube,pe)
|
||||
so IO_CPU can deterministically route/fan-out without relying on PA decoding.
|
||||
|
||||
---
|
||||
@@ -93,7 +93,7 @@ Rules:
|
||||
Mandatory fields:
|
||||
|
||||
- common envelope fields (D3)
|
||||
- destination placement tags (A 방식):
|
||||
- destination placement tags (Scheme A):
|
||||
- `dst_sip: int`
|
||||
- `dst_cube: int`
|
||||
- `dst_pe: int`
|
||||
@@ -130,7 +130,7 @@ Notes:
|
||||
Mandatory fields:
|
||||
|
||||
- common envelope fields (D3)
|
||||
- source placement tags (A 방식):
|
||||
- source placement tags (Scheme A):
|
||||
- `src_sip: int`
|
||||
- `src_cube: int`
|
||||
- `src_pe: int`
|
||||
@@ -183,7 +183,7 @@ Tensor arg (mandatory):
|
||||
|
||||
- `shards: list[TensorShard]`
|
||||
|
||||
`TensorShard` MUST have (A 방식 강제):
|
||||
`TensorShard` MUST have (Scheme A enforced):
|
||||
|
||||
- `sip: int`
|
||||
- `cube: int`
|
||||
|
||||
@@ -1,519 +0,0 @@
|
||||
# ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation)
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The current simulation models **timing only**.
|
||||
`tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies,
|
||||
but do not actually read tensor data or perform computations.
|
||||
|
||||
### Required Capabilities
|
||||
|
||||
1. Must be able to store and read actual data in HBM/TCM/SRAM
|
||||
2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
|
||||
3. Must minimize simulation performance degradation
|
||||
|
||||
### Constraints
|
||||
|
||||
- SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
|
||||
- Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations
|
||||
- Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused
|
||||
- Kernel functions must remain plain Python functions (no generator/async transformation)
|
||||
|
||||
### Design Exploration Results
|
||||
|
||||
| Option | Approach | Verdict |
|
||||
|--------|----------|---------|
|
||||
| Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block |
|
||||
| SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests |
|
||||
| Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads |
|
||||
| **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact |
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. 2-Pass Execution Model — Phase 0 Elimination
|
||||
|
||||
The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**.
|
||||
|
||||
Before:
|
||||
```
|
||||
Phase 0: Kernel → PeCommand list (no data, no branching)
|
||||
Phase 1: Replay PeCommand list via SimPy (timing only)
|
||||
```
|
||||
|
||||
After:
|
||||
```
|
||||
Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based
|
||||
- Memory read/write: SimPy timing + MemoryStore actual data
|
||||
- Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2)
|
||||
- Dynamic control flow possible (tl.load returns actual data)
|
||||
|
||||
Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable
|
||||
```
|
||||
|
||||
This ADR **extends Phase 1 to be data-aware for memory operations only**.
|
||||
Phase 1 handles latency/BW bottleneck analysis + memory data tracking,
|
||||
Phase 2 handles GEMM/Math computation correctness verification.
|
||||
Phase 2 is optional — if only timing is needed, run Phase 1 alone.
|
||||
|
||||
### D2. Op Log Recording — ComponentBase Hook
|
||||
|
||||
Op log recording is performed as a **hook in the component base class**.
|
||||
Individual component implementations are not modified.
|
||||
|
||||
```python
|
||||
class ComponentBase:
|
||||
def _on_process_start(self, env, msg):
|
||||
if self._op_logger and getattr(msg, 'data_op', False):
|
||||
self._op_logger.record_start(env.now, self.node.id, msg)
|
||||
|
||||
def _on_process_end(self, env, msg):
|
||||
if self._op_logger and getattr(msg, 'data_op', False):
|
||||
self._op_logger.record_end(env.now, self.node.id, msg)
|
||||
```
|
||||
|
||||
Hooks are called before and after `run()` within `_forward_txn()`.
|
||||
`_op_logger` is optional — zero overhead when absent.
|
||||
|
||||
**Hook timing definitions**:
|
||||
|
||||
| Timing | Meaning |
|
||||
|--------|---------|
|
||||
| `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) |
|
||||
| `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) |
|
||||
|
||||
Link traversal latency is not included in t_start/t_end.
|
||||
Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start.
|
||||
|
||||
### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination
|
||||
|
||||
The existing Phase 0 (kernel → PeCommand list) is eliminated,
|
||||
and **greenlet** is used to cooperatively interleave kernel and SimPy execution.
|
||||
|
||||
#### Operating Principle
|
||||
|
||||
greenlet is a C extension that provides cooperative context switching.
|
||||
When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet)
|
||||
to perform timing simulation, and after completion, returns to the kernel with actual data.
|
||||
|
||||
```
|
||||
SimPy loop (parent greenlet) Kernel (child greenlet)
|
||||
───────────────────────── ──────────────────────
|
||||
g.switch() ─────────────────────────→ Kernel starts
|
||||
a = tl.load(ptr, ...)
|
||||
internal: parent.switch(DmaReadCmd)
|
||||
cmd = DmaReadCmd ←────────────────── (kernel paused)
|
||||
yield DmaReadMsg(...)
|
||||
yield env.timeout(dma_latency)
|
||||
data = memory_store.read(...)
|
||||
g.switch(data) ─────────────────────→ (kernel resumed)
|
||||
a = data ← actual numpy array
|
||||
if a[0][0] > 0.5: ← branching possible
|
||||
...
|
||||
```
|
||||
|
||||
The kernel is maintained as a **plain Python function**.
|
||||
greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc.
|
||||
|
||||
#### KernelRunner — Framework Layer
|
||||
|
||||
The greenlet loop resides not in the PE_CPU component but in the framework layer,
|
||||
**KernelRunner**.
|
||||
|
||||
```python
|
||||
# KernelRunner (framework — greenlet ↔ SimPy bridge)
|
||||
class KernelRunner:
|
||||
def run(self, env, kernel_fn, args, store):
|
||||
g = greenlet(self._run_kernel)
|
||||
cmd = g.switch(kernel_fn, args)
|
||||
|
||||
while cmd is not None:
|
||||
if isinstance(cmd, DmaReadCmd):
|
||||
yield from self._dispatch_dma(env, cmd)
|
||||
data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
|
||||
cmd = g.switch(data) # resume with actual data
|
||||
elif isinstance(cmd, GemmCmd):
|
||||
yield from self._dispatch_gemm(env, cmd)
|
||||
cmd = g.switch() # resume (no data)
|
||||
elif isinstance(cmd, DmaWriteCmd):
|
||||
store.write(cmd.dst_addr, cmd.data) # visibility = issue time
|
||||
yield from self._dispatch_dma(env, cmd) # timing only
|
||||
cmd = g.switch()
|
||||
|
||||
# PE_CPU (component — kept simple, unaware of greenlet)
|
||||
def _execute_kernel(self, env):
|
||||
runner = KernelRunner(self.ctx)
|
||||
yield from runner.run(env, kernel_fn, args, store)
|
||||
```
|
||||
|
||||
**Op logging single source of truth**: KernelRunner does not record directly to op_log.
|
||||
All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**.
|
||||
When KernelRunner delivers messages to components via `_dispatch_gemm()` etc.,
|
||||
the component base class hooks automatically record them.
|
||||
|
||||
**Layer separation**:
|
||||
- **Kernel code**: plain function, unaware of greenlet
|
||||
- **TLContext**: calls `parent.switch(cmd)` inside `tl.load()`
|
||||
- **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**.
|
||||
- **ComponentBase hook**: the sole path for op_log recording
|
||||
- **PE_CPU**: only calls KernelRunner, replaceable as a component
|
||||
|
||||
#### Handling Differences Between Memory Read/Write and Compute
|
||||
|
||||
| Operation | In Phase 1 | In Phase 2 |
|
||||
|-----------|-----------|-----------|
|
||||
| `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — |
|
||||
| `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — |
|
||||
| `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation |
|
||||
| `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation |
|
||||
|
||||
Memory read/write is processed immediately in Phase 1 (numpy slice, fast).
|
||||
GEMM/Math operations are batch-executed in Phase 2 (performance separation).
|
||||
|
||||
#### Store Visibility Rule
|
||||
|
||||
`tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue).
|
||||
SimPy DMA timing is simulated separately afterward.
|
||||
|
||||
This is an intentional separation of timing and visibility:
|
||||
- **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called
|
||||
- **timing**: the point at which DMA latency completes in SimPy
|
||||
|
||||
This separation allows a load immediately after a store to see the latest data in dynamic control flow.
|
||||
|
||||
#### Result Handle Semantics
|
||||
|
||||
`tl.composite()` (sync/async) returns a **handle** referencing the result tensor.
|
||||
|
||||
The key contract in Phase 1:
|
||||
|
||||
1. **All compute handles are always considered pending in Phase 1.**
|
||||
2. `tl.wait(handle)` **expresses timing synchronization only**
|
||||
and does not make the handle ready.
|
||||
3. Accessing the handle's actual result data (`handle.data`, element access,
|
||||
numpy conversion, etc.) is **only possible in Phase 2**.
|
||||
4. Therefore, **compute-result-based control flow is not supported in Phase 1.**
|
||||
5. In contrast, `tl.load()` returns actual data in Phase 1, so
|
||||
**memory-read-based control flow is supported**.
|
||||
|
||||
| Handle state | Phase | Allowed operations |
|
||||
|------------|-------|----------|
|
||||
| pending | Phase 1 | `tl.wait(handle)` — timing synchronization only |
|
||||
| pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) |
|
||||
| pending | Phase 1 | **Data access not allowed** — value-based branching not possible |
|
||||
| ready | Phase 2 | Actual numpy data access, verification |
|
||||
|
||||
This restriction is intentional. If computations were executed in Phase 1,
|
||||
the SimPy single-thread would block, defeating the purpose of 2-pass separation.
|
||||
|
||||
#### Phase 1 Materialization — Future Extension
|
||||
|
||||
If Phase 1 eager execution becomes necessary for small operations
|
||||
(scalar, small reduction) in the future, selective materialization can be supported
|
||||
by adding a `materialized_in_phase1: bool` flag to the op record.
|
||||
This is not implemented in the current scope.
|
||||
|
||||
### D4. data_op Flag — Message Self-Declaration
|
||||
|
||||
The logging target is determined by the `data_op` attribute on the message instance,
|
||||
not by message type. The framework does not hardcode message types.
|
||||
|
||||
```python
|
||||
class MsgBase:
|
||||
data_op: bool = False # default: no logging
|
||||
|
||||
class DmaReadCmd(MsgBase):
|
||||
data_op = True # memory transfer → logging
|
||||
|
||||
class GemmCmd(MsgBase):
|
||||
data_op = True # compute → logging
|
||||
|
||||
class MathCmd(MsgBase):
|
||||
data_op = True # compute → logging
|
||||
```
|
||||
|
||||
When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True`
|
||||
enables automatic logging without modifying framework code.
|
||||
|
||||
### D5. Op Log Structure
|
||||
|
||||
#### Op Classification Scheme
|
||||
|
||||
A two-level classification is used:
|
||||
|
||||
| Level | Field | Role |
|
||||
|-------|-------|------|
|
||||
| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion |
|
||||
| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification |
|
||||
|
||||
#### OpRecord Definition
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class OpRecord:
|
||||
t_start: float # SimPy time (ns) — service start
|
||||
t_end: float # SimPy time (ns) — service completion
|
||||
component_id: str # e.g. "sip0.cube0.pe0.pe_gemm"
|
||||
op_kind: str # "memory" | "gemm" | "math"
|
||||
op_name: str # specific operation name
|
||||
params: dict # per-operation parameters (see below)
|
||||
dependency_ids: list[int] # currently based on in-memory record index, may be replaced with stable op_id in the future
|
||||
```
|
||||
|
||||
#### dependency_ids Generation Rules
|
||||
|
||||
`dependency_ids` is **optional**, and by default the executor performs
|
||||
address-based dependency inference (see D6).
|
||||
|
||||
Explicit setting is only needed when precise execution ordering is required:
|
||||
- **Default (address-based inference)**: the executor analyzes read/write sets to
|
||||
automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases.
|
||||
- **Explicit setting**: set when logical dependencies cannot be expressed via addresses
|
||||
at the TLContext or command generation stage.
|
||||
Example: completion handle-based synchronization — handle dependencies depend on
|
||||
logical completion order rather than memory addresses, so they cannot be captured
|
||||
by address inference.
|
||||
|
||||
#### op_log Ordering
|
||||
|
||||
The op_log maintains **stable ordering** based on `t_start`.
|
||||
Records with the same `t_start` preserve insertion order.
|
||||
|
||||
#### params Details
|
||||
|
||||
**memory (dma_read / dma_write)**:
|
||||
```python
|
||||
{
|
||||
"src_addr": int, # source address (byte)
|
||||
"dst_addr": int, # destination address (byte)
|
||||
"nbytes": int, # transfer size
|
||||
"src_space": str, # "hbm" | "tcm" | "sram"
|
||||
"dst_space": str, # "hbm" | "tcm" | "sram"
|
||||
}
|
||||
```
|
||||
|
||||
**gemm**:
|
||||
```python
|
||||
{
|
||||
"src_a_addr": int, # operand A address
|
||||
"src_b_addr": int, # operand B address
|
||||
"dst_addr": int, # output address
|
||||
"shape_a": tuple, # e.g. (128, 256)
|
||||
"shape_b": tuple, # e.g. (256, 128)
|
||||
"shape_out": tuple, # e.g. (128, 128)
|
||||
"dtype_in": str, # e.g. "f16"
|
||||
"dtype_acc": str, # accumulation dtype, e.g. "f32"
|
||||
"dtype_out": str, # output dtype, e.g. "f16"
|
||||
"transpose_a": bool,
|
||||
"transpose_b": bool,
|
||||
"layout_a": str, # "row_major" | "col_major"
|
||||
"layout_b": str,
|
||||
"layout_out": str,
|
||||
"addr_space": str, # "tcm" (GEMM operands are always in TCM)
|
||||
}
|
||||
```
|
||||
|
||||
**math**:
|
||||
```python
|
||||
{
|
||||
"op": str, # "exp" | "add" | "sum" | "where" | ...
|
||||
"input_addrs": list[int], # list of operand addresses
|
||||
"input_shapes": list[tuple],
|
||||
"dst_addr": int,
|
||||
"shape_out": tuple,
|
||||
"dtype": str,
|
||||
"axis": int | None, # reduction axis
|
||||
"addr_space": str, # "tcm"
|
||||
}
|
||||
```
|
||||
|
||||
### D6. Phase 2 Executor
|
||||
|
||||
Phase 2 executes the op_log outside of SimPy.
|
||||
|
||||
```python
|
||||
class DataExecutor:
|
||||
def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
|
||||
self.store = initial_store # Takes the Phase 1 MemoryStore snapshot as input
|
||||
|
||||
def run(self):
|
||||
for t, ops in groupby(op_log, key=lambda o: o.t_start):
|
||||
batch = list(ops)
|
||||
independent, sequential = self._classify(batch)
|
||||
self._execute_parallel(independent)
|
||||
self._execute_sequential(sequential)
|
||||
```
|
||||
|
||||
**Parallel execution determination**:
|
||||
|
||||
Ops with the same `t_start` are considered **parallel candidates**.
|
||||
The executor determines actual parallel execution based on the following criteria:
|
||||
- Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks)
|
||||
- Whether predecessor ops specified in `dependency_ids` have completed
|
||||
|
||||
Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel.
|
||||
|
||||
**Batch optimization**: Only independent ops with the same op_name **and identical
|
||||
shape, dtype, layout, and transpose flags** are eligible for batching.
|
||||
Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call.
|
||||
Improves BLAS efficiency on CPU, reduces launch overhead on GPU.
|
||||
|
||||
**Phase 2 execution order guarantee**:
|
||||
|
||||
Phase 2 does not consider data arrival timing,
|
||||
and guarantees execution order solely through
|
||||
dependencies (address-based inference + explicit dependency_ids).
|
||||
|
||||
### D7. Memory Store
|
||||
|
||||
`MemoryStore` logically follows byte-addressable semantics,
|
||||
and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping).
|
||||
|
||||
```python
|
||||
class MemoryStore:
|
||||
def write(self, space: str, addr: int, data: np.ndarray) -> None: ...
|
||||
def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
|
||||
```
|
||||
|
||||
**Internal storage format: numpy ndarray**
|
||||
|
||||
MemoryStore stores tensors as **numpy ndarrays**.
|
||||
|
||||
| Candidate | store/load speed | Phase 2 compute | Verdict |
|
||||
|-----------|-----------------|-----------------|---------|
|
||||
| **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** |
|
||||
| bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected |
|
||||
| torch tensor | Immediate | torch operations available | Use only for GPU optimization |
|
||||
|
||||
- write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup
|
||||
- read: **returns numpy array by reference** (no copy)
|
||||
- Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported)
|
||||
- dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.)
|
||||
- For byte-level access, convert via `.view(np.uint8)`
|
||||
- For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility
|
||||
|
||||
**read/write contract**:
|
||||
|
||||
- read/write operates on a **contiguous tensor** basis.
|
||||
If non-contiguous stride views are needed, express them as separate copy ops.
|
||||
- In the normal benchmark path, producer/consumer dtype match is expected.
|
||||
Reinterpret cast is a permissive behavior for low-level memory validation
|
||||
or special test cases.
|
||||
- addr is byte-aligned, with minimum alignment = dtype size.
|
||||
- dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast.
|
||||
Shape mismatch is verified based on nbytes, and raises an error on mismatch.
|
||||
- Correctness criteria follow address-range-based read/write semantics.
|
||||
- A tensor object cache may be used as an implementation optimization,
|
||||
but the canonical state is byte-addressable storage.
|
||||
- At deploy time, the host injects initial tensor data.
|
||||
|
||||
### D8. Benchmark Kernel Code
|
||||
|
||||
The benchmark's **user code API is not changed**.
|
||||
The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained.
|
||||
|
||||
However, internal command/message schemas may be extended to include metadata
|
||||
required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose).
|
||||
|
||||
### D9. No Component Changes
|
||||
|
||||
Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified.
|
||||
Op log recording is the responsibility of the ComponentBase hook.
|
||||
When custom components are replaced, only the timing model changes,
|
||||
and Phase 2 data execution is unaffected.
|
||||
|
||||
### D10. Phase 2 is Optional
|
||||
|
||||
```python
|
||||
engine = GraphEngine(graph)
|
||||
engine.run(benchmark) # Phase 1: timing only
|
||||
result = engine.get_timing_result()
|
||||
|
||||
if verify_data:
|
||||
executor = DataExecutor(engine.op_log) # Phase 2: data
|
||||
executor.run()
|
||||
executor.verify(expected_output)
|
||||
```
|
||||
|
||||
If only timing analysis is needed, Phase 2 is skipped.
|
||||
If the op_logger is deactivated, Phase 1 performance is identical to the original.
|
||||
|
||||
### D11. Verification Contract
|
||||
|
||||
Basic verification **compares the final output tensor** against a reference backend (numpy).
|
||||
|
||||
Per-dtype tolerance policy:
|
||||
|
||||
| dtype | Comparison method | Tolerance |
|
||||
|-------|----------|-----------|
|
||||
| f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
|
||||
| f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
|
||||
| bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
|
||||
| int types | `np.array_equal` | exact |
|
||||
|
||||
- Default mode: compare final output only (end-to-end correctness)
|
||||
- Debug mode: can compare intermediate tensors on a per-op basis
|
||||
(MemoryStore snapshot at each op boundary)
|
||||
|
||||
---
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **Compute-result-based control flow**: not supported.
|
||||
All compute handles are in pending state during Phase 1,
|
||||
`wait()` expresses timing synchronization only and does not imply data readiness.
|
||||
Accessing `handle.data`, element access, or truth-value evaluation in Phase 1
|
||||
is **treated as an error**.
|
||||
Memory-data-based branching (results of `tl.load()`) is supported via greenlet.
|
||||
Phase 1 materialization is a future extension (see D3).
|
||||
- **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce
|
||||
the execution time overlap from Phase 1. Phase 2 only verifies data correctness.
|
||||
- **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls
|
||||
and do not reproduce the actual hardware PE microarchitecture.
|
||||
|
||||
## Open Questions
|
||||
|
||||
- **Aliasing / slice view**: How to represent slice/views referencing the same
|
||||
backing storage in MemoryStore (stride-based view vs copy semantics)
|
||||
- **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE
|
||||
communication as memory ops or introduce a separate op_kind
|
||||
- **Op log streaming**: Managing op_log memory usage in large-scale simulations
|
||||
(in-memory list vs disk-backed streaming)
|
||||
- **Fused operation**: Whether to record tl.composite's tiled pipeline
|
||||
(READ→COMPUTE→WRITE) as a single fused op record or separate individual ops
|
||||
- **Math op schema generalization**: The current math params have a simple structure,
|
||||
but generalization may be needed for broadcasting rules, per-input dtype, keepdims,
|
||||
scalar/immediate operands, where/mask expressions, etc.
|
||||
- **Op record identifier**: Currently dependency_ids are based on in-memory list indices;
|
||||
replacement with stable op_id is needed when introducing streaming/disk-backed mode
|
||||
- **Phase 1 materialization policy**: See Future Extension in D3.
|
||||
If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops
|
||||
needs to be defined
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- Minimal impact on SimPy simulation performance (only op_log append added)
|
||||
- Free to use multi-threading/GPU in Phase 2
|
||||
- Component replaceability preserved (ADR-0015 design philosophy maintained)
|
||||
- No changes needed to benchmark user code API
|
||||
- When adding new message types, only set the data_op flag
|
||||
- Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported
|
||||
- `tl.load()` returns actual data, making kernel debugging easier
|
||||
|
||||
### Negative
|
||||
|
||||
- op_log memory usage (for large-scale simulations)
|
||||
- Phase 2 execution time is proportional to tensor size (large GEMM)
|
||||
- Dynamic branching based on pending handles (incomplete computations) not possible
|
||||
(computations execute in Phase 2, result values are undetermined in Phase 1).
|
||||
Memory-data-based branching is supported via greenlet.
|
||||
- greenlet C extension dependency added (pip install greenlet)
|
||||
@@ -1,4 +1,4 @@
|
||||
# ADR-0020: 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)
|
||||
# ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation)
|
||||
|
||||
## Status
|
||||
|
||||
@@ -6,65 +6,65 @@ Accepted
|
||||
|
||||
## Context
|
||||
|
||||
현재 시뮬레이션은 **타이밍만** 모델링한다.
|
||||
`tl.load()`, `tl.composite(op="gemm")` 등은 SimPy latency를 생성하지만,
|
||||
실제 텐서 데이터를 읽거나 연산하지 않는다.
|
||||
The current simulation models **timing only**.
|
||||
`tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies,
|
||||
but do not actually read tensor data or perform computations.
|
||||
|
||||
### 필요한 기능
|
||||
### Required Capabilities
|
||||
|
||||
1. HBM/TCM/SRAM에 실제 데이터를 저장하고 읽을 수 있어야 한다
|
||||
2. PE_GEMM, PE_MATH가 실제 행렬 연산을 수행하고 결과를 검증할 수 있어야 한다
|
||||
3. 시뮬레이션 성능 저하를 최소화해야 한다
|
||||
1. Must be able to store and read actual data in HBM/TCM/SRAM
|
||||
2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
|
||||
3. Must minimize simulation performance degradation
|
||||
|
||||
### 제약 조건
|
||||
### Constraints
|
||||
|
||||
- SimPy는 single-thread 이벤트 루프 — numpy matmul을 안에서 하면 전체가 block
|
||||
- 컴포넌트는 교체 가능해야 한다 (ADR-0015) — 프레임워크 요구사항이 구현에 침투하면 안 됨
|
||||
- 벤치마크 커널은 명령형 코드(tl.load → tl.composite → tl.wait) — 같은 코드를 재사용해야 함
|
||||
- 커널 함수는 plain Python function으로 유지해야 한다 (generator/async 변환 불가)
|
||||
- SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
|
||||
- Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations
|
||||
- Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused
|
||||
- Kernel functions must remain plain Python functions (no generator/async transformation)
|
||||
|
||||
### 설계 탐색 결과
|
||||
### Design Exploration Results
|
||||
|
||||
| Option | 방식 | 판정 |
|
||||
|--------|------|------|
|
||||
| SimPy 내 직접 실행 | GEMM을 SimPy 안에서 numpy 호출 | 탈락: single-thread block |
|
||||
| SimPy + ThreadPool | future.submit → timeout → result() | 탈락: back-to-back 요청 시 result()에서 block |
|
||||
| Symbolic + lazy | 메타데이터만 추적, 나중에 실행 | 탈락: control-flow dependent 읽기 처리 곤란 |
|
||||
| **2-pass (채택)** | Phase 1: 타이밍, Phase 2: 데이터 | 완전 분리, 성능 영향 없음 |
|
||||
| Option | Approach | Verdict |
|
||||
|--------|----------|---------|
|
||||
| Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block |
|
||||
| SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests |
|
||||
| Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads |
|
||||
| **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact |
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. 2-Pass 실행 모델 — Phase 0 제거
|
||||
### D1. 2-Pass Execution Model — Phase 0 Elimination
|
||||
|
||||
기존의 3단계(Phase 0 → Phase 1 → Phase 2)를 **2단계로 통합**한다.
|
||||
The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**.
|
||||
|
||||
기존:
|
||||
Before:
|
||||
```
|
||||
Phase 0: 커널 → PeCommand 리스트 (데이터 없음, 분기 불가)
|
||||
Phase 1: PeCommand 리스트를 SimPy replay (타이밍만)
|
||||
Phase 0: Kernel → PeCommand list (no data, no branching)
|
||||
Phase 1: Replay PeCommand list via SimPy (timing only)
|
||||
```
|
||||
|
||||
변경:
|
||||
After:
|
||||
```
|
||||
Phase 1 (타이밍): 커널 + SimPy 통합 실행 — greenlet 기반
|
||||
- 메모리 읽기/쓰기: SimPy 타이밍 + MemoryStore 실제 데이터
|
||||
- 연산 (GEMM/Math): SimPy 타이밍 + op_log 기록 (실제 연산은 Phase 2)
|
||||
- dynamic control flow 가능 (tl.load가 실제 데이터 반환)
|
||||
Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based
|
||||
- Memory read/write: SimPy timing + MemoryStore actual data
|
||||
- Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2)
|
||||
- Dynamic control flow possible (tl.load returns actual data)
|
||||
|
||||
Phase 2 (데이터): op_log 기반 실제 연산 실행 — SimPy 외부, 병렬 가능
|
||||
Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable
|
||||
```
|
||||
|
||||
본 ADR은 **메모리 연산에 한해 Phase 1을 data-aware로 확장**한다.
|
||||
Phase 1은 latency/BW 병목 분석 + 메모리 데이터 추적,
|
||||
Phase 2는 GEMM/Math 연산 정합성 검증.
|
||||
Phase 2는 optional — 타이밍만 필요하면 Phase 1만 실행.
|
||||
This ADR **extends Phase 1 to be data-aware for memory operations only**.
|
||||
Phase 1 handles latency/BW bottleneck analysis + memory data tracking,
|
||||
Phase 2 handles GEMM/Math computation correctness verification.
|
||||
Phase 2 is optional — if only timing is needed, run Phase 1 alone.
|
||||
|
||||
### D2. Op Log 기록 — ComponentBase hook
|
||||
### D2. Op Log Recording — ComponentBase Hook
|
||||
|
||||
op_log 기록은 **컴포넌트 베이스 클래스의 hook**으로 수행한다.
|
||||
개별 컴포넌트 구현을 수정하지 않는다.
|
||||
Op log recording is performed as a **hook in the component base class**.
|
||||
Individual component implementations are not modified.
|
||||
|
||||
```python
|
||||
class ComponentBase:
|
||||
@@ -77,56 +77,56 @@ class ComponentBase:
|
||||
self._op_logger.record_end(env.now, self.node.id, msg)
|
||||
```
|
||||
|
||||
`_forward_txn()` 에서 `run()` 전후로 hook을 호출한다.
|
||||
`_op_logger`는 optional — 없으면 오버헤드 제로.
|
||||
Hooks are called before and after `run()` within `_forward_txn()`.
|
||||
`_op_logger` is optional — zero overhead when absent.
|
||||
|
||||
**hook 시점 정의**:
|
||||
**Hook timing definitions**:
|
||||
|
||||
| 시점 | 의미 |
|
||||
|------|------|
|
||||
| `t_start` | 컴포넌트가 해당 msg의 **service를 시작**한 시점 (`run()` 진입 직전) |
|
||||
| `t_end` | 컴포넌트의 **내부 service가 완료**된 시점 (`run()` 반환 직후) |
|
||||
| Timing | Meaning |
|
||||
|--------|---------|
|
||||
| `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) |
|
||||
| `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) |
|
||||
|
||||
link traversal latency는 t_start/t_end에 포함되지 않는다.
|
||||
link latency는 발신 컴포넌트의 t_end와 수신 컴포넌트의 t_start 차이로 관측된다.
|
||||
Link traversal latency is not included in t_start/t_end.
|
||||
Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start.
|
||||
|
||||
### D3. Greenlet 기반 커널 실행 — Phase 0 제거
|
||||
### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination
|
||||
|
||||
기존 Phase 0 (커널 → PeCommand 리스트)를 제거하고,
|
||||
**greenlet**을 사용하여 커널과 SimPy를 협력적으로 interleave 실행한다.
|
||||
The existing Phase 0 (kernel → PeCommand list) is eliminated,
|
||||
and **greenlet** is used to cooperatively interleave kernel and SimPy execution.
|
||||
|
||||
#### 동작 원리
|
||||
#### Operating Principle
|
||||
|
||||
greenlet은 협력적 context switch를 제공하는 C 확장이다.
|
||||
커널(child greenlet)이 `tl.load()` 등을 호출하면 SimPy 루프(parent greenlet)로
|
||||
switch하여 타이밍 시뮬레이션을 수행하고, 완료 후 실제 데이터와 함께 커널로 돌아온다.
|
||||
greenlet is a C extension that provides cooperative context switching.
|
||||
When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet)
|
||||
to perform timing simulation, and after completion, returns to the kernel with actual data.
|
||||
|
||||
```
|
||||
SimPy 루프 (parent greenlet) 커널 (child greenlet)
|
||||
SimPy loop (parent greenlet) Kernel (child greenlet)
|
||||
───────────────────────── ──────────────────────
|
||||
g.switch() ─────────────────────────→ 커널 시작
|
||||
g.switch() ─────────────────────────→ Kernel starts
|
||||
a = tl.load(ptr, ...)
|
||||
내부: parent.switch(DmaReadCmd)
|
||||
cmd = DmaReadCmd ←────────────────── (커널 일시정지)
|
||||
internal: parent.switch(DmaReadCmd)
|
||||
cmd = DmaReadCmd ←────────────────── (kernel paused)
|
||||
yield DmaReadMsg(...)
|
||||
yield env.timeout(dma_latency)
|
||||
data = memory_store.read(...)
|
||||
g.switch(data) ─────────────────────→ (커널 재개)
|
||||
a = data ← 실제 numpy array
|
||||
if a[0][0] > 0.5: ← 분기 가능
|
||||
g.switch(data) ─────────────────────→ (kernel resumed)
|
||||
a = data ← actual numpy array
|
||||
if a[0][0] > 0.5: ← branching possible
|
||||
...
|
||||
```
|
||||
|
||||
커널은 **plain Python function**으로 유지된다.
|
||||
greenlet switch는 `tl.load()`, `tl.store()` 등의 **내부 구현에만** 존재한다.
|
||||
The kernel is maintained as a **plain Python function**.
|
||||
greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc.
|
||||
|
||||
#### KernelRunner — 프레임워크 레이어
|
||||
#### KernelRunner — Framework Layer
|
||||
|
||||
greenlet 루프는 PE_CPU 컴포넌트가 아니라 프레임워크 레이어인
|
||||
**KernelRunner**에 위치한다.
|
||||
The greenlet loop resides not in the PE_CPU component but in the framework layer,
|
||||
**KernelRunner**.
|
||||
|
||||
```python
|
||||
# KernelRunner (프레임워크 — greenlet ↔ SimPy 연결)
|
||||
# KernelRunner (framework — greenlet ↔ SimPy bridge)
|
||||
class KernelRunner:
|
||||
def run(self, env, kernel_fn, args, store):
|
||||
g = greenlet(self._run_kernel)
|
||||
@@ -136,160 +136,162 @@ class KernelRunner:
|
||||
if isinstance(cmd, DmaReadCmd):
|
||||
yield from self._dispatch_dma(env, cmd)
|
||||
data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
|
||||
cmd = g.switch(data) # 실제 데이터와 함께 재개
|
||||
cmd = g.switch(data) # resume with actual data
|
||||
elif isinstance(cmd, GemmCmd):
|
||||
yield from self._dispatch_gemm(env, cmd)
|
||||
cmd = g.switch() # 재개 (데이터 없음)
|
||||
cmd = g.switch() # resume (no data)
|
||||
elif isinstance(cmd, DmaWriteCmd):
|
||||
store.write(cmd.dst_addr, cmd.data) # visibility = issue 시점
|
||||
yield from self._dispatch_dma(env, cmd) # timing만 반영
|
||||
store.write(cmd.dst_addr, cmd.data) # visibility = issue time
|
||||
yield from self._dispatch_dma(env, cmd) # timing only
|
||||
cmd = g.switch()
|
||||
|
||||
# PE_CPU (컴포넌트 — 간단하게 유지, greenlet을 모름)
|
||||
# PE_CPU (component — kept simple, unaware of greenlet)
|
||||
def _execute_kernel(self, env):
|
||||
runner = KernelRunner(self.ctx)
|
||||
yield from runner.run(env, kernel_fn, args, store)
|
||||
```
|
||||
|
||||
**Op logging single source of truth**: KernelRunner는 op_log에 직접 기록하지 않는다.
|
||||
모든 op logging은 **ComponentBase hook (_on_process_start/end)만** 담당한다.
|
||||
KernelRunner가 `_dispatch_gemm()` 등으로 컴포넌트에 메시지를 전달하면,
|
||||
컴포넌트 베이스 클래스의 hook이 자동으로 기록한다.
|
||||
**Op logging single source of truth**: KernelRunner does not record directly to op_log.
|
||||
All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**.
|
||||
When KernelRunner delivers messages to components via `_dispatch_gemm()` etc.,
|
||||
the component base class hooks automatically record them.
|
||||
|
||||
**레이어 분리**:
|
||||
- **커널 코드**: plain function, greenlet 존재를 모름
|
||||
- **TLContext**: `tl.load()` 내부에서 `parent.switch(cmd)` 호출
|
||||
- **KernelRunner**: greenlet ↔ SimPy 연결, MemoryStore 읽기/쓰기 처리. **logging 안 함**.
|
||||
- **ComponentBase hook**: op_log 기록의 유일한 경로
|
||||
- **PE_CPU**: KernelRunner를 호출만 함, 컴포넌트로서 교체 가능
|
||||
**Layer separation**:
|
||||
- **Kernel code**: plain function, unaware of greenlet
|
||||
- **TLContext**: calls `parent.switch(cmd)` inside `tl.load()`
|
||||
- **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**.
|
||||
- **ComponentBase hook**: the sole path for op_log recording
|
||||
- **PE_CPU**: only calls KernelRunner, replaceable as a component
|
||||
|
||||
#### 메모리 읽기/쓰기 vs 연산의 처리 차이
|
||||
#### Handling Differences Between Memory Read/Write and Compute
|
||||
|
||||
| 연산 | Phase 1에서 | Phase 2에서 |
|
||||
|------|------------|------------|
|
||||
| `tl.load()` | SimPy 타이밍 + MemoryStore read → **실제 데이터 반환** | — |
|
||||
| `tl.store()` | SimPy 타이밍 + MemoryStore write → **실제 기록** | — |
|
||||
| `tl.composite(gemm)` | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
|
||||
| `tl.dot()` / math ops | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
|
||||
| Operation | In Phase 1 | In Phase 2 |
|
||||
|-----------|-----------|-----------|
|
||||
| `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — |
|
||||
| `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — |
|
||||
| `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation |
|
||||
| `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation |
|
||||
|
||||
메모리 읽기/쓰기는 Phase 1에서 즉시 처리 (numpy slice, 빠름).
|
||||
GEMM/Math 연산은 Phase 2에서 batch 실행 (성능 분리).
|
||||
Memory read/write is processed immediately in Phase 1 (numpy slice, fast).
|
||||
GEMM/Math operations are batch-executed in Phase 2 (performance separation).
|
||||
|
||||
#### Store Visibility Rule
|
||||
|
||||
`tl.store()`는 **issue 시점에 MemoryStore에 즉시 반영**된다 (visibility = issue).
|
||||
SimPy DMA 타이밍은 이후 별도로 시뮬레이션된다.
|
||||
`tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue).
|
||||
SimPy DMA timing is simulated separately afterward.
|
||||
|
||||
이는 timing과 visibility를 의도적으로 분리한 것이다:
|
||||
- **visibility**: MemoryStore에 반영되는 시점 = `store.write()` 호출 시
|
||||
- **timing**: SimPy에서 DMA latency가 완료되는 시점
|
||||
This is an intentional separation of timing and visibility:
|
||||
- **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called
|
||||
- **timing**: the point at which DMA latency completes in SimPy
|
||||
|
||||
이 분리로 dynamic control flow에서 store 직후 load가 최신 데이터를 볼 수 있다.
|
||||
This separation allows a load immediately after a store to see the latest data in dynamic control flow.
|
||||
|
||||
#### Result Handle Semantics
|
||||
|
||||
`tl.composite()`(sync/async)는 결과 tensor를 참조하는 **handle**을 반환한다.
|
||||
`tl.composite()` (sync/async) returns a **handle** referencing the result tensor.
|
||||
|
||||
Phase 1에서의 핵심 계약:
|
||||
The key contract in Phase 1:
|
||||
|
||||
1. **모든 compute handle은 Phase 1에서 항상 pending 상태로 간주한다.**
|
||||
2. `tl.wait(handle)`은 **timing synchronization만 표현**하며,
|
||||
handle을 ready로 만들지 않는다.
|
||||
3. handle의 실제 결과 데이터 접근(`handle.data`, element access,
|
||||
numpy conversion 등)은 **Phase 2에서만 가능**하다.
|
||||
4. 따라서 Phase 1에서 **compute-result 기반 control flow는 지원하지 않는다.**
|
||||
5. 반면 `tl.load()`는 Phase 1에서 실제 데이터를 반환하므로,
|
||||
**memory-read 기반 control flow는 지원 가능**하다.
|
||||
1. **All compute handles are always considered pending in Phase 1.**
|
||||
2. `tl.wait(handle)` **expresses timing synchronization only**
|
||||
and does not make the handle ready.
|
||||
3. Accessing the handle's actual result data (`handle.data`, element access,
|
||||
numpy conversion, etc.) is **only possible in Phase 2**.
|
||||
4. Therefore, **compute-result-based control flow is not supported in Phase 1.**
|
||||
5. In contrast, `tl.load()` returns actual data in Phase 1, so
|
||||
**memory-read-based control flow is supported**.
|
||||
|
||||
| handle 상태 | Phase | 허용 동작 |
|
||||
| Handle state | Phase | Allowed operations |
|
||||
|------------|-------|----------|
|
||||
| pending | Phase 1 | `tl.wait(handle)` — timing 동기화만 |
|
||||
| pending | Phase 1 | handle을 `tl.store()`의 대상으로 전달 (logical destination 연결만, payload는 Phase 2) |
|
||||
| pending | Phase 1 | **데이터 접근 불가** — 값 기반 분기 불가 |
|
||||
| ready | Phase 2 | 실제 numpy 데이터 접근, 검증 |
|
||||
| pending | Phase 1 | `tl.wait(handle)` — timing synchronization only |
|
||||
| pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) |
|
||||
| pending | Phase 1 | **Data access not allowed** — value-based branching not possible |
|
||||
| ready | Phase 2 | Actual numpy data access, verification |
|
||||
|
||||
이 제약은 의도적이다. Phase 1에서 연산을 실행하면 SimPy single-thread가
|
||||
block되어 2-pass 분리의 존재 이유가 사라진다.
|
||||
This restriction is intentional. If computations were executed in Phase 1,
|
||||
the SimPy single-thread would block, defeating the purpose of 2-pass separation.
|
||||
|
||||
#### Phase 1 Materialization — Future Extension
|
||||
|
||||
향후 소형 연산(scalar, 작은 reduction)에 대해 Phase 1 eager execution이
|
||||
필요한 경우, `materialized_in_phase1: bool` 플래그를 op record에 추가하여
|
||||
선택적 materialization을 지원할 수 있다. 현재 범위에서는 구현하지 않는다.
|
||||
If Phase 1 eager execution becomes necessary for small operations
|
||||
(scalar, small reduction) in the future, selective materialization can be supported
|
||||
by adding a `materialized_in_phase1: bool` flag to the op record.
|
||||
This is not implemented in the current scope.
|
||||
|
||||
### D4. data_op 플래그 — 메시지 자기 선언
|
||||
### D4. data_op Flag — Message Self-Declaration
|
||||
|
||||
로깅 대상은 메시지 타입이 아니라 메시지 인스턴스의 `data_op` 속성으로 결정한다.
|
||||
프레임워크가 메시지 타입을 하드코딩하지 않는다.
|
||||
The logging target is determined by the `data_op` attribute on the message instance,
|
||||
not by message type. The framework does not hardcode message types.
|
||||
|
||||
```python
|
||||
class MsgBase:
|
||||
data_op: bool = False # 기본: 로깅 안 함
|
||||
data_op: bool = False # default: no logging
|
||||
|
||||
class DmaReadCmd(MsgBase):
|
||||
data_op = True # 메모리 이동 → 로깅
|
||||
data_op = True # memory transfer → logging
|
||||
|
||||
class GemmCmd(MsgBase):
|
||||
data_op = True # 연산 → 로깅
|
||||
data_op = True # compute → logging
|
||||
|
||||
class MathCmd(MsgBase):
|
||||
data_op = True # 연산 → 로깅
|
||||
data_op = True # compute → logging
|
||||
```
|
||||
|
||||
새 메시지 타입(예: IpcqMsg) 추가 시 `data_op = True`만 설정하면
|
||||
프레임워크 코드 수정 없이 자동 로깅된다.
|
||||
When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True`
|
||||
enables automatic logging without modifying framework code.
|
||||
|
||||
### D5. Op Log 구조
|
||||
### D5. Op Log Structure
|
||||
|
||||
#### op 분류 체계
|
||||
#### Op Classification Scheme
|
||||
|
||||
2단계로 분류한다:
|
||||
A two-level classification is used:
|
||||
|
||||
| 레벨 | 필드 | 역할 |
|
||||
|------|------|------|
|
||||
| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch 기준 |
|
||||
| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` 등 | 구체 연산 식별 |
|
||||
| Level | Field | Role |
|
||||
|-------|-------|------|
|
||||
| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion |
|
||||
| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification |
|
||||
|
||||
#### OpRecord 정의
|
||||
#### OpRecord Definition
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class OpRecord:
|
||||
t_start: float # SimPy 시각 (ns) — service 시작
|
||||
t_end: float # SimPy 시각 (ns) — service 완료
|
||||
t_start: float # SimPy time (ns) — service start
|
||||
t_end: float # SimPy time (ns) — service completion
|
||||
component_id: str # e.g. "sip0.cube0.pe0.pe_gemm"
|
||||
op_kind: str # "memory" | "gemm" | "math"
|
||||
op_name: str # 구체 연산명
|
||||
params: dict # 연산별 파라미터 (아래 참조)
|
||||
dependency_ids: list[int] # 현재는 in-memory record index 기반, 향후 stable op_id로 대체 가능
|
||||
op_name: str # specific operation name
|
||||
params: dict # per-operation parameters (see below)
|
||||
dependency_ids: list[int] # currently based on in-memory record index, may be replaced with stable op_id in the future
|
||||
```
|
||||
|
||||
#### dependency_ids 생성 규칙
|
||||
#### dependency_ids Generation Rules
|
||||
|
||||
`dependency_ids`는 **optional**이며, 기본적으로 executor는
|
||||
주소 기반 dependency 추론을 수행한다 (D6 참조).
|
||||
`dependency_ids` is **optional**, and by default the executor performs
|
||||
address-based dependency inference (see D6).
|
||||
|
||||
정확한 실행 순서가 필요한 경우에만 명시적으로 설정한다:
|
||||
- **기본 (address-based inference)**: executor가 read/write set을 분석하여
|
||||
RAW/WAW/WAR 의존성을 자동 추론. 대부분의 경우 이것으로 충분.
|
||||
- **명시적 설정**: TLContext 또는 command 생성 단계에서 logical dependency가
|
||||
주소로 표현되지 않는 경우에 설정.
|
||||
예: completion handle 기반 동기화 — handle dependency는 메모리 주소가 아니라
|
||||
논리적 완료 순서에 의존하므로 address inference로 잡히지 않는다.
|
||||
Explicit setting is only needed when precise execution ordering is required:
|
||||
- **Default (address-based inference)**: the executor analyzes read/write sets to
|
||||
automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases.
|
||||
- **Explicit setting**: set when logical dependencies cannot be expressed via addresses
|
||||
at the TLContext or command generation stage.
|
||||
Example: completion handle-based synchronization — handle dependencies depend on
|
||||
logical completion order rather than memory addresses, so they cannot be captured
|
||||
by address inference.
|
||||
|
||||
#### op_log ordering
|
||||
#### op_log Ordering
|
||||
|
||||
op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
|
||||
동일 `t_start`의 record들은 insertion order를 보존한다.
|
||||
The op_log maintains **stable ordering** based on `t_start`.
|
||||
Records with the same `t_start` preserve insertion order.
|
||||
|
||||
#### params 상세
|
||||
#### params Details
|
||||
|
||||
**memory (dma_read / dma_write)**:
|
||||
```python
|
||||
{
|
||||
"src_addr": int, # source 주소 (byte)
|
||||
"dst_addr": int, # destination 주소 (byte)
|
||||
"nbytes": int, # 전송 크기
|
||||
"src_addr": int, # source address (byte)
|
||||
"dst_addr": int, # destination address (byte)
|
||||
"nbytes": int, # transfer size
|
||||
"src_space": str, # "hbm" | "tcm" | "sram"
|
||||
"dst_space": str, # "hbm" | "tcm" | "sram"
|
||||
}
|
||||
@@ -298,9 +300,9 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
|
||||
**gemm**:
|
||||
```python
|
||||
{
|
||||
"src_a_addr": int, # operand A 주소
|
||||
"src_b_addr": int, # operand B 주소
|
||||
"dst_addr": int, # output 주소
|
||||
"src_a_addr": int, # operand A address
|
||||
"src_b_addr": int, # operand B address
|
||||
"dst_addr": int, # output address
|
||||
"shape_a": tuple, # e.g. (128, 256)
|
||||
"shape_b": tuple, # e.g. (256, 128)
|
||||
"shape_out": tuple, # e.g. (128, 128)
|
||||
@@ -312,7 +314,7 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
|
||||
"layout_a": str, # "row_major" | "col_major"
|
||||
"layout_b": str,
|
||||
"layout_out": str,
|
||||
"addr_space": str, # "tcm" (GEMM operand는 항상 TCM)
|
||||
"addr_space": str, # "tcm" (GEMM operands are always in TCM)
|
||||
}
|
||||
```
|
||||
|
||||
@@ -320,7 +322,7 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
|
||||
```python
|
||||
{
|
||||
"op": str, # "exp" | "add" | "sum" | "where" | ...
|
||||
"input_addrs": list[int], # operand 주소 목록
|
||||
"input_addrs": list[int], # list of operand addresses
|
||||
"input_shapes": list[tuple],
|
||||
"dst_addr": int,
|
||||
"shape_out": tuple,
|
||||
@@ -332,12 +334,12 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
|
||||
|
||||
### D6. Phase 2 Executor
|
||||
|
||||
Phase 2는 SimPy 밖에서 op_log를 실행한다.
|
||||
Phase 2 executes the op_log outside of SimPy.
|
||||
|
||||
```python
|
||||
class DataExecutor:
|
||||
def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
|
||||
self.store = initial_store # Phase 1의 MemoryStore snapshot을 입력으로 받는다
|
||||
self.store = initial_store # Takes the Phase 1 MemoryStore snapshot as input
|
||||
|
||||
def run(self):
|
||||
for t, ops in groupby(op_log, key=lambda o: o.t_start):
|
||||
@@ -347,30 +349,30 @@ class DataExecutor:
|
||||
self._execute_sequential(sequential)
|
||||
```
|
||||
|
||||
**병렬 실행 판정**:
|
||||
**Parallel execution determination**:
|
||||
|
||||
같은 `t_start`의 op들은 **병렬 후보**로 간주한다.
|
||||
실제 병렬 실행 여부는 executor가 다음 기준으로 판정한다:
|
||||
- read/write 주소 범위 겹침 여부 (WAW, RAW, WAR 충돌 검사)
|
||||
- `dependency_ids`에 명시된 선행 op 완료 여부
|
||||
Ops with the same `t_start` are considered **parallel candidates**.
|
||||
The executor determines actual parallel execution based on the following criteria:
|
||||
- Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks)
|
||||
- Whether predecessor ops specified in `dependency_ids` have completed
|
||||
|
||||
주소 범위가 겹치지 않고 명시적 의존성이 없는 op들만 병렬 실행한다.
|
||||
Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel.
|
||||
|
||||
**배치 최적화**: 동일 op_name이며 **shape, dtype, layout, transpose flag가
|
||||
모두 동일한** 독립 op들만 batching 대상이 된다.
|
||||
예: 여러 PE의 동일 shape GEMM → `np.matmul(a_batch, b_batch)` 한 번으로 묶음.
|
||||
CPU에서도 BLAS 효율 향상, GPU에서는 launch overhead 절감.
|
||||
**Batch optimization**: Only independent ops with the same op_name **and identical
|
||||
shape, dtype, layout, and transpose flags** are eligible for batching.
|
||||
Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call.
|
||||
Improves BLAS efficiency on CPU, reduces launch overhead on GPU.
|
||||
|
||||
**Phase 2 실행 순서 보장**:
|
||||
**Phase 2 execution order guarantee**:
|
||||
|
||||
Phase 2는 데이터 도착 시점을 고려하지 않으며,
|
||||
dependency (주소 기반 추론 + 명시적 dependency_ids)를 통해서만
|
||||
실행 순서를 보장한다.
|
||||
Phase 2 does not consider data arrival timing,
|
||||
and guarantees execution order solely through
|
||||
dependencies (address-based inference + explicit dependency_ids).
|
||||
|
||||
### D7. Memory Store
|
||||
|
||||
`MemoryStore`는 논리적으로 byte-addressable semantics를 따르며,
|
||||
현재 구현은 **tensor-granular storage** (addr → numpy ndarray 매핑)를 사용한다.
|
||||
`MemoryStore` logically follows byte-addressable semantics,
|
||||
and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping).
|
||||
|
||||
```python
|
||||
class MemoryStore:
|
||||
@@ -378,139 +380,140 @@ class MemoryStore:
|
||||
def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
|
||||
```
|
||||
|
||||
**내부 저장 포맷: numpy ndarray**
|
||||
**Internal storage format: numpy ndarray**
|
||||
|
||||
MemoryStore는 텐서를 **numpy ndarray**로 저장한다.
|
||||
MemoryStore stores tensors as **numpy ndarrays**.
|
||||
|
||||
| 후보 | store/load 속도 | Phase 2 연산 | 판정 |
|
||||
|------|----------------|-------------|------|
|
||||
| **numpy ndarray** | 즉시 (참조 전달, 복사 없음) | `np.matmul` 바로 사용 | **채택** |
|
||||
| bytearray | memcpy 필요 | `np.frombuffer` 변환 필요 | 탈락 |
|
||||
| torch tensor | 즉시 | torch 연산 가능 | GPU 최적화 시만 사용 |
|
||||
| Candidate | store/load speed | Phase 2 compute | Verdict |
|
||||
|-----------|-----------------|-----------------|---------|
|
||||
| **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** |
|
||||
| bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected |
|
||||
| torch tensor | Immediate | torch operations available | Use only for GPU optimization |
|
||||
|
||||
- write: numpy array를 **참조 저장** (복사 없음) → Phase 1 오버헤드 = dict lookup 1회
|
||||
- read: numpy array를 **참조 반환** (복사 없음)
|
||||
- 동일 addr에 재 write 시 기존 array를 **tensor 단위로 덮어쓴다** (partial overwrite 미지원)
|
||||
- dtype은 numpy native 사용 (`np.float16`, `np.float32`, `np.bfloat16` 등)
|
||||
- byte-level access가 필요한 경우 `.view(np.uint8)` 로 변환
|
||||
- Phase 2에서 GPU batch 최적화 시 numpy → torch tensor 변환은 executor가 담당
|
||||
- write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup
|
||||
- read: **returns numpy array by reference** (no copy)
|
||||
- Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported)
|
||||
- dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.)
|
||||
- For byte-level access, convert via `.view(np.uint8)`
|
||||
- For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility
|
||||
|
||||
**read/write contract**:
|
||||
|
||||
- read/write는 **contiguous tensor** 기준이다.
|
||||
non-contiguous stride view가 필요한 경우 별도 copy op으로 표현한다.
|
||||
- 일반 benchmark path에서는 producer/consumer dtype 일치를 기대한다.
|
||||
reinterpret cast는 low-level memory validation 또는 특수 테스트 케이스를 위한
|
||||
permissive behavior이다.
|
||||
- addr은 byte-aligned이며, 최소 alignment = dtype 크기.
|
||||
- dtype mismatch (write와 다른 dtype으로 read)는 reinterpret cast로 처리한다.
|
||||
shape 불일치 시 nbytes 기준으로 검증하고, 불일치하면 error.
|
||||
- 정합성 기준은 주소 범위 기반 read/write semantics를 따른다.
|
||||
- 구현 최적화로 tensor object cache를 둘 수 있지만,
|
||||
canonical state는 byte-addressable storage이다.
|
||||
- deploy 시점에 호스트가 초기 텐서 데이터를 주입한다.
|
||||
- read/write operates on a **contiguous tensor** basis.
|
||||
If non-contiguous stride views are needed, express them as separate copy ops.
|
||||
- In the normal benchmark path, producer/consumer dtype match is expected.
|
||||
Reinterpret cast is a permissive behavior for low-level memory validation
|
||||
or special test cases.
|
||||
- addr is byte-aligned, with minimum alignment = dtype size.
|
||||
- dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast.
|
||||
Shape mismatch is verified based on nbytes, and raises an error on mismatch.
|
||||
- Correctness criteria follow address-range-based read/write semantics.
|
||||
- A tensor object cache may be used as an implementation optimization,
|
||||
but the canonical state is byte-addressable storage.
|
||||
- At deploy time, the host injects initial tensor data.
|
||||
|
||||
### D8. 벤치마크 커널 코드
|
||||
### D8. Benchmark Kernel Code
|
||||
|
||||
벤치마크의 **사용자 코드 API는 변경하지 않는다**.
|
||||
`tl.load()`, `tl.composite()`, `tl.store()` 등의 호출 인터페이스는 유지.
|
||||
The benchmark's **user code API is not changed**.
|
||||
The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained.
|
||||
|
||||
단, 내부 command/message schema는 Phase 2 실행에 필요한 metadata를
|
||||
포함하도록 확장될 수 있다 (예: dtype_acc, transpose 등 추가 필드).
|
||||
However, internal command/message schemas may be extended to include metadata
|
||||
required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose).
|
||||
|
||||
### D9. 컴포넌트 변경 없음
|
||||
### D9. No Component Changes
|
||||
|
||||
개별 컴포넌트 구현(PE_GEMM, PE_DMA, HBM_CTRL 등)은 수정하지 않는다.
|
||||
op_log 기록은 ComponentBase hook의 책임이다.
|
||||
커스텀 컴포넌트 교체 시 타이밍 모델만 교체되며,
|
||||
Phase 2 데이터 실행은 영향받지 않는다.
|
||||
Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified.
|
||||
Op log recording is the responsibility of the ComponentBase hook.
|
||||
When custom components are replaced, only the timing model changes,
|
||||
and Phase 2 data execution is unaffected.
|
||||
|
||||
### D10. Phase 2는 Optional
|
||||
### D10. Phase 2 is Optional
|
||||
|
||||
```python
|
||||
engine = GraphEngine(graph)
|
||||
engine.run(benchmark) # Phase 1: 타이밍만
|
||||
engine.run(benchmark) # Phase 1: timing only
|
||||
result = engine.get_timing_result()
|
||||
|
||||
if verify_data:
|
||||
executor = DataExecutor(engine.op_log) # Phase 2: 데이터
|
||||
executor = DataExecutor(engine.op_log) # Phase 2: data
|
||||
executor.run()
|
||||
executor.verify(expected_output)
|
||||
```
|
||||
|
||||
타이밍 분석만 필요하면 Phase 2를 건너뛴다.
|
||||
op_logger를 비활성화하면 Phase 1 성능도 기존과 동일.
|
||||
If only timing analysis is needed, Phase 2 is skipped.
|
||||
If the op_logger is deactivated, Phase 1 performance is identical to the original.
|
||||
|
||||
### D11. Verification Contract
|
||||
|
||||
기본 검증은 **최종 output tensor**를 reference backend(numpy)와 비교한다.
|
||||
Basic verification **compares the final output tensor** against a reference backend (numpy).
|
||||
|
||||
dtype별 tolerance 정책:
|
||||
Per-dtype tolerance policy:
|
||||
|
||||
| dtype | 비교 방식 | tolerance |
|
||||
| dtype | Comparison method | Tolerance |
|
||||
|-------|----------|-----------|
|
||||
| f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
|
||||
| f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
|
||||
| bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
|
||||
| int 계열 | `np.array_equal` | exact |
|
||||
| int types | `np.array_equal` | exact |
|
||||
|
||||
- 기본 모드: 최종 output만 비교 (end-to-end correctness)
|
||||
- 디버그 모드: intermediate tensor도 op 단위로 비교 가능
|
||||
- Default mode: compare final output only (end-to-end correctness)
|
||||
- Debug mode: can compare intermediate tensors on a per-op basis
|
||||
(MemoryStore snapshot at each op boundary)
|
||||
|
||||
---
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **Compute-result-based control flow**: 지원하지 않는다.
|
||||
모든 compute handle은 Phase 1에서 pending 상태이며,
|
||||
`wait()`는 timing synchronization만 표현하고 data readiness를 의미하지 않는다.
|
||||
Phase 1에서 `handle.data` 접근, element access, truth-value evaluation은
|
||||
**error로 처리**한다.
|
||||
메모리 데이터 기반 분기(`tl.load()` 결과)는 greenlet으로 지원된다.
|
||||
Phase 1 materialization은 future extension (D3 참조).
|
||||
- **Cycle-accurate overlap reconstruction**: Phase 2에서 Phase 1의 실행 시간
|
||||
overlap을 정확히 재현하지 않는다. Phase 2는 데이터 정합성만 검증한다.
|
||||
- **GPU kernel compilation**: Phase 2의 GEMM/Math는 numpy/torch 호출이며,
|
||||
실제 하드웨어 PE의 마이크로아키텍처를 재현하지 않는다.
|
||||
- **Compute-result-based control flow**: not supported.
|
||||
All compute handles are in pending state during Phase 1,
|
||||
`wait()` expresses timing synchronization only and does not imply data readiness.
|
||||
Accessing `handle.data`, element access, or truth-value evaluation in Phase 1
|
||||
is **treated as an error**.
|
||||
Memory-data-based branching (results of `tl.load()`) is supported via greenlet.
|
||||
Phase 1 materialization is a future extension (see D3).
|
||||
- **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce
|
||||
the execution time overlap from Phase 1. Phase 2 only verifies data correctness.
|
||||
- **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls
|
||||
and do not reproduce the actual hardware PE microarchitecture.
|
||||
|
||||
## Open Questions
|
||||
|
||||
- **Aliasing / slice view**: 동일 backing storage를 참조하는 slice/view를
|
||||
MemoryStore에서 어떻게 표현할지 (stride-based view vs copy semantics)
|
||||
- **IPCQ/descriptor read 일반화**: PE-to-PE 통신을 memory op으로 완전히
|
||||
일반화할지, 별도 op_kind를 둘지
|
||||
- **Op log streaming**: 대규모 시뮬레이션에서 op_log 메모리 사용량 관리
|
||||
- **Aliasing / slice view**: How to represent slice/views referencing the same
|
||||
backing storage in MemoryStore (stride-based view vs copy semantics)
|
||||
- **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE
|
||||
communication as memory ops or introduce a separate op_kind
|
||||
- **Op log streaming**: Managing op_log memory usage in large-scale simulations
|
||||
(in-memory list vs disk-backed streaming)
|
||||
- **Fused operation**: tl.composite의 tiled pipeline (READ→COMPUTE→WRITE)을
|
||||
하나의 fused op record로 기록할지, 개별 op으로 분리할지
|
||||
- **Math op schema 일반화**: 현재 math params는 단순 구조이나,
|
||||
broadcasting rule, input별 dtype, keepdims, scalar/immediate operand,
|
||||
where/mask 표현 등 일반화가 필요할 수 있음
|
||||
- **Op record 식별자**: 현재 dependency_ids는 in-memory list index 기반이며,
|
||||
streaming/disk-backed mode 도입 시 stable op_id로 대체 필요
|
||||
- **Phase 1 materialization policy**: D3의 Future Extension 참조.
|
||||
허용 시 해당 op의 Phase 2 처리 방식 (skip / verify / recompute) 정의 필요
|
||||
- **Fused operation**: Whether to record tl.composite's tiled pipeline
|
||||
(READ→COMPUTE→WRITE) as a single fused op record or separate individual ops
|
||||
- **Math op schema generalization**: The current math params have a simple structure,
|
||||
but generalization may be needed for broadcasting rules, per-input dtype, keepdims,
|
||||
scalar/immediate operands, where/mask expressions, etc.
|
||||
- **Op record identifier**: Currently dependency_ids are based on in-memory list indices;
|
||||
replacement with stable op_id is needed when introducing streaming/disk-backed mode
|
||||
- **Phase 1 materialization policy**: See Future Extension in D3.
|
||||
If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops
|
||||
needs to be defined
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
### 긍정적
|
||||
### Positive
|
||||
|
||||
- SimPy 시뮬레이션 성능 영향 최소 (op_log append만 추가)
|
||||
- Phase 2에서 멀티스레드/GPU 자유롭게 사용 가능
|
||||
- 컴포넌트 교체 자유도 유지 (ADR-0015 설계 철학 보존)
|
||||
- 벤치마크 사용자 코드 API 변경 불필요
|
||||
- 새 메시지 타입 추가 시 data_op 플래그만 설정
|
||||
- greenlet으로 Phase 0 제거 — 메모리 데이터 기반 dynamic control flow 지원
|
||||
- `tl.load()`가 실제 데이터를 반환하므로 커널 디버깅 용이
|
||||
- Minimal impact on SimPy simulation performance (only op_log append added)
|
||||
- Free to use multi-threading/GPU in Phase 2
|
||||
- Component replaceability preserved (ADR-0015 design philosophy maintained)
|
||||
- No changes needed to benchmark user code API
|
||||
- When adding new message types, only set the data_op flag
|
||||
- Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported
|
||||
- `tl.load()` returns actual data, making kernel debugging easier
|
||||
|
||||
### 부정적
|
||||
### Negative
|
||||
|
||||
- op_log 메모리 사용량 (대규모 시뮬레이션 시)
|
||||
- Phase 2 실행 시간은 텐서 크기에 비례 (대형 GEMM)
|
||||
- pending handle (연산 미완료) 기반 동적 분기 불가
|
||||
(연산은 Phase 2에서 실행, Phase 1에서 결과 값 미확정).
|
||||
메모리 데이터 기반 분기는 greenlet으로 지원된다.
|
||||
- greenlet C 확장 의존성 추가 (pip install greenlet)
|
||||
- op_log memory usage (for large-scale simulations)
|
||||
- Phase 2 execution time is proportional to tensor size (large GEMM)
|
||||
- Dynamic branching based on pending handles (incomplete computations) not possible
|
||||
(computations execute in Phase 2, result values are undetermined in Phase 1).
|
||||
Memory-data-based branching is supported via greenlet.
|
||||
- greenlet C extension dependency added (pip install greenlet)
|
||||
|
||||
@@ -1,882 +0,0 @@
|
||||
# ADR-0023: PE-level IPCQ — Inter-PE Collective Communication
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
### Goal
|
||||
|
||||
Add the infrastructure that lets CCL (Collective Communication Library)
|
||||
kernels run **inside** a PE. The host just launches a kernel on each
|
||||
SIP; the actual synchronization and data movement happen **inside the
|
||||
PE kernel via an IPCQ (Inter-Process Communication Queue)**.
|
||||
|
||||
This mirrors how NCCL performs NVLink communication inside a GPU
|
||||
kernel, or how Cerebras / Tenstorrent expose core-local communication
|
||||
queues. Host-level collectives (`dist.all_reduce`) are deferred to
|
||||
**future work**; this ADR focuses solely on the kernel-side collective
|
||||
infrastructure.
|
||||
|
||||
### Problems to solve
|
||||
|
||||
1. PE-to-PE direct data movement (writing into a peer's memory).
|
||||
2. Synchronization — the sender must check that the receiver has space
|
||||
in its buffer (backpressure).
|
||||
3. Resource contention between compute traffic and communication
|
||||
traffic (Head-of-Line blocking).
|
||||
4. The host must be able to construct logical neighbor topologies
|
||||
(ring / mesh / tree) per algorithm.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Add a new `PE_IPCQ` component
|
||||
|
||||
A new component `PE_IPCQ` is added inside each PE. It follows the same
|
||||
pattern as PE_GEMM / PE_MATH — modeling a sub-block of the PE as a
|
||||
distinct component.
|
||||
|
||||
```
|
||||
PE
|
||||
├── PE_CPU
|
||||
├── PE_SCHEDULER
|
||||
├── PE_DMA
|
||||
├── PE_IPCQ ← new
|
||||
├── PE_FETCH_STORE
|
||||
├── PE_GEMM
|
||||
├── PE_MATH
|
||||
├── PE_TCM
|
||||
├── PE_MMU
|
||||
```
|
||||
|
||||
**Role separation** (control plane vs. data plane):
|
||||
|
||||
- **PE_IPCQ (control plane)**: ring-buffer address arithmetic, head /
|
||||
tail pointer management, peer pointer caches, backpressure, 4-direction
|
||||
neighbor mapping.
|
||||
- **PE_DMA (data plane)**: actually moves data through cube_noc / UCIe
|
||||
/ PCIE into the peer's memory.
|
||||
|
||||
PE_IPCQ does **not** move data itself — it delegates to PE_DMA.
|
||||
|
||||
### D2. Ring buffer model
|
||||
|
||||
Each PE owns 4 directions (N/S/E/W) × {tx, rx} = 8 ring buffers.
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class IpcqQueuePair:
|
||||
direction: Direction # N/S/E/W
|
||||
peer: IpcqEndpoint # set by host at init time (D2.5)
|
||||
tx_buffer_base: int # outgoing data base addr (in our memory)
|
||||
rx_buffer_base: int # incoming data base addr (in our memory)
|
||||
slot_size: int # 1 tile per slot
|
||||
n_slots: int # ring depth
|
||||
my_head: int # next slot we will write/send into
|
||||
my_tail: int # next slot we will read/recv from
|
||||
peer_head_cache: int # peer's last-seen head (updated via D9 piggyback)
|
||||
peer_tail_cache: int # peer's last-seen tail (updated via D9 fast-path credit)
|
||||
```
|
||||
|
||||
**Canonical field names**: throughout this ADR the four names above
|
||||
(`my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`) are used
|
||||
consistently. Synonyms (`peer_head_local`, `peer_head`, `peer_tail`,
|
||||
etc.) are not used.
|
||||
|
||||
| Field | Owner | Updated when |
|
||||
|-------|-------|--------------|
|
||||
| `my_head` | local PE_IPCQ | immediately after `tl.send` (send tracking) |
|
||||
| `my_tail` | local PE_IPCQ | immediately after `tl.recv` (recv tracking) |
|
||||
| `peer_head_cache` | local PE_IPCQ | on `IpcqMetaArrival` (D9 piggyback) |
|
||||
| `peer_tail_cache` | local PE_IPCQ | on `IpcqCreditMetadata` (D9 fast path) |
|
||||
|
||||
**Slot unit**: fixed-size, one slot holds one full tile (no descriptor
|
||||
indirection). Full data embedded in the slot. See D5.
|
||||
|
||||
### D2.5. `IpcqEndpoint` schema
|
||||
|
||||
`IpcqQueuePair.peer` carries everything the sender needs to compute the
|
||||
peer's rx slot address:
|
||||
|
||||
```python
|
||||
@dataclass(frozen=True)
|
||||
class IpcqEndpoint:
|
||||
sip: int
|
||||
cube: int
|
||||
pe: int
|
||||
buffer_kind: str # "tcm" | "hbm" | "sram"
|
||||
rx_base_pa: int # peer rx_buffer base PA (PhysAddr.encode())
|
||||
rx_base_va: int # peer rx_buffer base VA (optional, MMU mode)
|
||||
n_slots: int # peer ring depth (for wrap-around)
|
||||
slot_size: int # peer slot size (for offset)
|
||||
```
|
||||
|
||||
Address computation:
|
||||
|
||||
```python
|
||||
slot_idx = self.my_head % peer.n_slots
|
||||
dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size
|
||||
```
|
||||
|
||||
PE_IPCQ passes `dst_pa` to PE_DMA inside an `IpcqDmaToken`. PE_DMA
|
||||
(vc_comm) routes the data to `dst_pa` through the fabric.
|
||||
|
||||
**Endpoint construction order**: at backend init (D10), the IPCQ
|
||||
buffers for **every PE** are allocated first (so each rank knows the
|
||||
others' PA), then the per-rank neighbor tables are built and pushed to
|
||||
PE_IPCQ via `IpcqInitMsg`.
|
||||
|
||||
### D3. Four-direction mapping ≡ logical ProcessGroup
|
||||
|
||||
The PE views four directions (N/S/E/W) as logical ports. Real peer
|
||||
addresses are configured by the host CCL init, per the chosen
|
||||
algorithm. The PE kernel never knows the topology, only directions.
|
||||
|
||||
```python
|
||||
# 1D ring
|
||||
for rank in range(world_size):
|
||||
ipcq_set_neighbor(rank, "E", peer=ranks[(rank + 1) % world_size])
|
||||
ipcq_set_neighbor(rank, "W", peer=ranks[(rank - 1) % world_size])
|
||||
|
||||
# 2D mesh
|
||||
for r in range(R):
|
||||
for c in range(C):
|
||||
ipcq_set_neighbor((r, c), "N", peer=((r - 1) % R, c))
|
||||
ipcq_set_neighbor((r, c), "S", peer=((r + 1) % R, c))
|
||||
ipcq_set_neighbor((r, c), "E", peer=(r, (c + 1) % C))
|
||||
ipcq_set_neighbor((r, c), "W", peer=(r, (c - 1) % C))
|
||||
```
|
||||
|
||||
The PE code does not need to know where `tl.send(dir="E", ...)` actually
|
||||
ends up.
|
||||
|
||||
### D4. PE kernel API
|
||||
|
||||
```python
|
||||
# Send (blocking; may stall on backpressure)
|
||||
tl.send(dir: str, src=TensorHandle)
|
||||
tl.send(dir: str, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)
|
||||
|
||||
# Recv (blocking)
|
||||
recv = tl.recv(dir: str, shape=..., dtype=...)
|
||||
recv = tl.recv(shape=..., dtype=...) # round-robin across 4 directions
|
||||
|
||||
# Recv (non-blocking)
|
||||
fut = tl.recv_async(dir: str, shape=..., dtype=...)
|
||||
recv = tl.wait(fut)
|
||||
```
|
||||
|
||||
`tl.recv()` (no direction) keeps a `last_polled_dir` cursor and on each
|
||||
call rotates through directions, returning the first available slot.
|
||||
Empty in all 4 directions → wait.
|
||||
|
||||
**Fairness is weak**: the rotating start mitigates simple bias, but if
|
||||
one direction always wins the race the others can starve. Algorithms
|
||||
that need strict fairness must call `tl.recv(dir=...)` explicitly.
|
||||
|
||||
### D5. Single-hop DMA write + full-data slot model
|
||||
|
||||
Data moves from sender memory into the receiver's ring slot in **one
|
||||
DMA transfer**. Key properties:
|
||||
|
||||
- **Single-hop**: the sender already knows the peer rx slot address and
|
||||
fires one fabric DMA into it.
|
||||
- **No CPU memcpy**: the CPU never copies data.
|
||||
- **No intermediate staging**: neither side keeps a separate staging
|
||||
buffer (sender uses the source addr directly; receiver gets the data
|
||||
in its ring slot directly).
|
||||
|
||||
(Strictly speaking the fabric DMA write does happen, so this is not
|
||||
literally "no data movement" — it's the same property NCCL labels
|
||||
"zero-copy", meaning no CPU memcpy and no staging copy.)
|
||||
|
||||
```
|
||||
PE A: tl.send(E, src_addr, nbytes)
|
||||
1. IPCQ computes the peer rx slot address:
|
||||
dst_addr = peer.rx_base_pa + (my_head % peer.n_slots) * peer.slot_size
|
||||
2. Backpressure: my_head - peer_tail_cache < peer.n_slots ?
|
||||
(full → sleep / poll)
|
||||
3. Submit DMA on PE_DMA(vc_comm): src_addr → peer dst_addr, nbytes
|
||||
4. my_head += 1
|
||||
|
||||
PE B: data = tl.recv(W)
|
||||
1. Look at rx_buffer[my_tail % n_slots]
|
||||
2. Wait for the data to arrive (D7 backpressure mode)
|
||||
3. Return the slot address to the kernel (or fetch into register file)
|
||||
4. my_tail += 1
|
||||
5. Issue a credit-return fast path (D9): after the bottleneck-BW
|
||||
latency the peer A's peer_tail_cache is updated.
|
||||
```
|
||||
|
||||
The slot holds the full tile. The receiver only reads its own
|
||||
rx_buffer; it never reads back into A's memory. The sender knows the
|
||||
peer rx slot address and DMAs directly into it (single-hop).
|
||||
|
||||
The PE's own PE_TCM read/write does not go through DMA (PE_TCM is local
|
||||
to the PE).
|
||||
|
||||
### D6. Buffer placement — three-way benchmark
|
||||
|
||||
The host CCL init picks the IPCQ ring-buffer location:
|
||||
|
||||
```python
|
||||
ipcq_init(
|
||||
backend="ahbm",
|
||||
buffer_kind="tcm" | "hbm" | "sram",
|
||||
n_slots=8,
|
||||
slot_size=4096,
|
||||
)
|
||||
```
|
||||
|
||||
| Location | Trait | Trade-off |
|
||||
|----------|-------|-----------|
|
||||
| **PE_TCM** | Attached to the PE; fast | Small; competes with PE-internal resources |
|
||||
| **PE-local HBM** | Large; via DMA | Higher latency |
|
||||
| **Cube SRAM** | Mid-size; cube-shared | Cube-internal contention |
|
||||
|
||||
All three locations run the same kernel code; only the init differs.
|
||||
|
||||
### D7. Backpressure — two-mode benchmark
|
||||
|
||||
How the sender or receiver waits when peer slots are full / data not
|
||||
yet arrived:
|
||||
|
||||
| Mode | Behavior | Model |
|
||||
|------|----------|-------|
|
||||
| **poll** | Periodically re-check the cached peer pointer | Spin loop |
|
||||
| **sleep** | Yield a SimPy event; wake on a peer-trigger | Interrupt-like |
|
||||
|
||||
```python
|
||||
ipcq_init(backpressure="poll" | "sleep", ...)
|
||||
```
|
||||
|
||||
Both modes are implemented so latency / throughput trade-offs can be
|
||||
benchmarked.
|
||||
|
||||
### D8. PE_DMA virtual channels
|
||||
|
||||
Extend PE_DMA from a single queue into a **two-channel virtual-channel**
|
||||
model.
|
||||
|
||||
```
|
||||
PE_DMA
|
||||
├── vc_compute: tile load / store / writeback for GEMM and Math
|
||||
└── vc_comm: IPCQ send data
|
||||
```
|
||||
|
||||
Each VC has an independent state machine:
|
||||
|
||||
- One channel stalling does not block the other.
|
||||
- The same physical link (cube_noc, UCIe, …) is shared, but link BW is
|
||||
split between channels.
|
||||
|
||||
**Chunk-level interleave**:
|
||||
|
||||
- Large GEMM tile DMAs do not lock the link end-to-end.
|
||||
- Progress happens in chunks (e.g. 256 B); each chunk shares link BW
|
||||
with the other VC's pending chunks.
|
||||
- Chunk size is an init parameter (smaller = fairer, larger = more
|
||||
efficient).
|
||||
|
||||
Net effect:
|
||||
|
||||
- HoL blocking is eliminated (an IPCQ send can interleave with a long
|
||||
compute DMA).
|
||||
- Compute / comm overlap is natural (NVIDIA copy-engine + compute-SM
|
||||
pattern).
|
||||
- Matches the NoC-virtual-channel pattern used in real HW.
|
||||
|
||||
**First-implementation accuracy limit (intentional)**: this ADR's
|
||||
first cut uses **deterministic chunk-level interleave + weighted
|
||||
round-robin arbitration** (default 50 / 50, exposed in `ccl.yaml`).
|
||||
This is a first-order approximation and is simpler than real HW
|
||||
dynamic-contention / credit-based arbiters. Functional correctness is
|
||||
unaffected, but heavy-contention scenarios may report slightly
|
||||
optimistic latency vs. real HW. A separate ADR can add a NoC arbiter
|
||||
component later if more precision is needed.
|
||||
|
||||
#### Token routing
|
||||
|
||||
- Compute tokens (`TileToken`) — go through the existing
|
||||
PE_FETCH_STORE → PE_DMA chain.
|
||||
- Communication tokens (`IpcqDmaToken`, new) — PE_IPCQ → PE_DMA
|
||||
self-routing.
|
||||
- PE_DMA picks the channel by token type.
|
||||
|
||||
```python
|
||||
class PeDmaComponent:
|
||||
def _process(self, env, token):
|
||||
if isinstance(token, IpcqDmaToken):
|
||||
yield from self._vc_comm_process(env, token)
|
||||
else:
|
||||
yield from self._vc_compute_process(env, token)
|
||||
```
|
||||
|
||||
### D9. Pointer synchronization — DMA payload piggyback
|
||||
|
||||
Real HW (NVLink, UCIe, etc.) piggybacks metadata onto DMA payloads so
|
||||
pointers update along with the data. This simulation adopts the same
|
||||
model: **no separate control channel** — metadata travels with the
|
||||
data.
|
||||
|
||||
The big benefits:
|
||||
|
||||
- **Automatic ordering**: data and metadata move on the same token, so
|
||||
data is visible **before** the head_cache update. No race.
|
||||
- **HW fidelity**: matches NVLink / UCIe piggybacked headers.
|
||||
- **Component simplification**: no separate `IpcqPtrUpdate` event type.
|
||||
|
||||
#### Send flow (head update via piggyback)
|
||||
|
||||
```
|
||||
PE A: tl.send(E, src_addr, nbytes)
|
||||
1. PE_IPCQ checks backpressure (using peer_tail_cache)
|
||||
2. PE_IPCQ creates an IpcqDmaToken:
|
||||
- data body (src_addr → peer dst_addr)
|
||||
- piggyback metadata: (sender_seq, src_sip/cube/pe, src_direction)
|
||||
3. Hand the token to PE_DMA(vc_comm)
|
||||
4. PE A increments my_head (send tracking)
|
||||
|
||||
[fabric DMA: latency elapses]
|
||||
|
||||
PE B's PE_DMA receives the token
|
||||
5. Writes data into dst_addr (B's rx slot) via MemoryStore.write
|
||||
6. Forwards token metadata to PE B's PE_IPCQ (PE-internal wire, ~1 cycle)
|
||||
|
||||
PE B's PE_IPCQ receives the metadata
|
||||
7. Updates peer_head_cache (= A's head)
|
||||
8. Wakes any pending recv on that direction
|
||||
```
|
||||
|
||||
**Steps 5 and 6 must execute in the same SimPy step** — DMA completion
|
||||
makes data and metadata atomically visible.
|
||||
|
||||
#### Recv flow (credit return — fast path with bottleneck-BW latency)
|
||||
|
||||
When the receiver frees a slot, the sender must learn about it
|
||||
(backpressure release). Unlike data, the credit return does **not**
|
||||
travel through general vc_comm fabric — it uses a **separate fast
|
||||
path**, an abstraction of the NVLink / UCIe credit-return wire.
|
||||
|
||||
**Latency** is computed from the **full path latency** (per-node
|
||||
overhead + edge propagation + drain), not a magic constant:
|
||||
|
||||
```
|
||||
credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes)
|
||||
path = router.find_path(self_pe, peer_pe.pe_dma)
|
||||
latency = compute_path_latency_ns(path, credit_size_bytes)
|
||||
= sum(edge.distance_mm * ns_per_mm)
|
||||
+ sum(node_overhead_ns[n] for n in path)
|
||||
+ credit_size_bytes / bottleneck_bw_on_path
|
||||
```
|
||||
|
||||
The router auto-appends `.pe_dma` to the source only, so the
|
||||
destination MUST be spelled with the explicit `.pe_dma` suffix or
|
||||
`find_path` raises and the credit silently teleports at zero cost
|
||||
(latent bug fixed alongside this update).
|
||||
|
||||
`tl.recv` blocks on the credit-emit completion (recv yields-from
|
||||
`_delayed_credit_send` rather than spawning it as a fork). This puts
|
||||
the credit-return cost on the receiver's `pe_exec_ns`, modeling the
|
||||
IPCQ control-plane completing the consume-acknowledgement before
|
||||
recv returns to the kernel — the protocol equivalent of a non-posted
|
||||
`tl.store` waiting for an HBM ack on the raw DMA path.
|
||||
|
||||
That gives us:
|
||||
|
||||
- **Topology-proportional approximation**: an in-cube credit return is
|
||||
automatically faster than a cross-SIP credit return.
|
||||
- **No magic constants**: every nanosecond comes from
|
||||
`compute_path_latency_ns` on the same edge_map and `node_overhead_ns`
|
||||
as data traffic.
|
||||
- **No deadlock risk**: unlike piggyback, B can issue credit even when
|
||||
it has no data to send back. `peer_credit_store.put` is unbounded.
|
||||
- **`IPCQ ≥ raw DMA`** for matched physical moves — the credit-emit
|
||||
cost on recv balances the HBM ack-trip cost RAW pays on the sender.
|
||||
|
||||
#### Component coupling — SimPy Store channel
|
||||
|
||||
PE B's PE_IPCQ does not call PE A's PE_IPCQ directly. Instead, at init
|
||||
time, **a SimPy Store is wired between the two** (a per-direction
|
||||
fast-path channel) and credit metadata is `put` into that store.
|
||||
|
||||
```python
|
||||
class PeIpcqComponent:
|
||||
def _delayed_credit_send(self, env, peer_credit_store, my_tail, latency_ns):
|
||||
yield env.timeout(latency_ns)
|
||||
yield peer_credit_store.put(IpcqCreditMetadata(seq=my_tail, ...))
|
||||
```
|
||||
|
||||
Backend init wires both directions of the fast-path channel as part of
|
||||
fan-out (see `IpcqInitMsg` in D12).
|
||||
|
||||
#### Credit-return fast path limitations
|
||||
|
||||
- `credit_size_bytes` is an estimate (typically 16–64 bytes).
|
||||
- The fast path is **excluded from vc_comm BW contention** (separate
|
||||
wire). Real HW credit-return wires are very lightweight, so this is a
|
||||
reasonable first approximation.
|
||||
- A follow-up ADR can: model the credit fast path as a separate link
|
||||
(BW limit + contention), or switch to piggyback (`credit_return_mode:
|
||||
piggyback`).
|
||||
|
||||
#### PE_DMA's added responsibility
|
||||
|
||||
When `vc_comm` receives a token, PE_DMA processes it as the following
|
||||
sequence: pay the Transaction's terminal BW drain, then atomically
|
||||
write data and forward metadata. **No SimPy yield is allowed between
|
||||
the data write and the metadata forward** (invariant I6). The drain
|
||||
yield must sit before the atomic block, not inside it:
|
||||
|
||||
```python
|
||||
def _on_vc_comm_recv(self, env, txn):
|
||||
# Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the
|
||||
# sender PE_DMA). MUST happen before the atomic block so recv only
|
||||
# wakes after the bytes have "landed".
|
||||
drain = getattr(txn, "drain_ns", 0.0)
|
||||
if drain > 0:
|
||||
yield env.timeout(drain)
|
||||
|
||||
token = txn.request
|
||||
# ── ATOMIC: no yield between these two operations ──
|
||||
data = self._memory_store.read(token.src_space, token.src_addr,
|
||||
shape=..., dtype=...)
|
||||
self._memory_store.write(token.dst_endpoint.buffer_kind,
|
||||
token.dst_addr, data)
|
||||
# 2. Forward metadata to the local PE_IPCQ
|
||||
yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token))
|
||||
# ───────────────────────────────────────────────────
|
||||
```
|
||||
|
||||
The final `put` is yieldable but uses an unbounded internal store, so
|
||||
it completes in a single step. That `put` is the closing call of the
|
||||
atomic block; nothing may be inserted before it.
|
||||
|
||||
#### Drain-at-inbound semantics (D9 timing model)
|
||||
|
||||
The Transaction carries `drain_ns = nbytes / bottleneck_bw_on_path`
|
||||
stamped at send-side PE_DMA. In this simulator per-hop `overhead_ns`
|
||||
is paid at each forwarding component via `run()`, and the remaining
|
||||
BW drain is paid once at the Transaction's terminal. Every non-IPCQ
|
||||
Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via
|
||||
`ComponentBase._forward_txn` at the terminal node. For IPCQ the
|
||||
destination PE_DMA intercepts the Transaction with `_handle_ipcq_inbound`
|
||||
(so IPCQ-specific data write + metadata forward can happen), so **the
|
||||
drain MUST be paid explicitly at the top of that handler** to keep
|
||||
IPCQ's timing model on par with every other fabric Transaction.
|
||||
|
||||
Side-effects of paying drain here:
|
||||
|
||||
- **SRC `tl.send`** is unchanged — fire-and-forget semantics are
|
||||
preserved because the sender PE_DMA does not `yield sub_done`. The
|
||||
`sub_done.succeed()` call (made after metadata forward below) is an
|
||||
event with no listener on the sender side.
|
||||
- **DST `tl.recv`** unblocks `drain_ns` later. Since recv wakes only
|
||||
when `IpcqMetaArrival` reaches its local PE_IPCQ, and the metadata
|
||||
forward now happens after the drain, recv observes the full fabric
|
||||
transfer time including bandwidth cost.
|
||||
|
||||
Matches the physical picture: send dispatches and leaves; recv waits
|
||||
until the bytes have actually been drained into its inbox.
|
||||
|
||||
### D9.5. ADR-0020 (2-pass) integration
|
||||
|
||||
`tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase
|
||||
1 simulates timing **and** moves data via MemoryStore; Phase 2 enables
|
||||
op-log-based correctness verification.
|
||||
|
||||
#### Phase 1 (timing + data)
|
||||
|
||||
D9 models head and tail updates with two different mechanisms:
|
||||
|
||||
- **Send-side (head update)** — DMA payload piggyback. Data write and
|
||||
metadata forward happen in the same SimPy step → automatic atomic
|
||||
visibility.
|
||||
- **Recv-side (tail credit return)** — fast-path SimPy Store channel
|
||||
with bottleneck-BW latency, then `peer_tail_cache` update.
|
||||
|
||||
Together they preserve ring-buffer pointer consistency.
|
||||
|
||||
The op-log records `op_kind="ipcq"` entries for sends (with
|
||||
`src/dst/space/addr/nbytes/dir/dtype/shape/sender_seq`) and recvs (with
|
||||
`recv_mode/src/dst/space/addr/nbytes/dir/dtype/shape/consumer_seq`).
|
||||
Two recv modes:
|
||||
|
||||
- **`return_slot`** (default): the slot address is returned to the
|
||||
kernel. Zero-copy.
|
||||
- **`copy_to_dst`**: when the kernel passes `dst_addr` + `dst_space`,
|
||||
PE_IPCQ copies the slot data into the user dst.
|
||||
|
||||
#### Phase 2 (op_log replay)
|
||||
|
||||
When `DataExecutor` encounters an `op_kind="ipcq"` record:
|
||||
|
||||
- **send**: idempotent `src → dst` ndarray write.
|
||||
- **recv (`return_slot`)**: no-op (the slot already holds the data).
|
||||
- **recv (`copy_to_dst`)**: idempotent `slot → dst_addr` copy.
|
||||
|
||||
IPCQ ops are pure data movement — Phase 2 has nothing extra to compute.
|
||||
The downstream GEMM / Math ops in `DataExecutor` will consume the data
|
||||
and naturally validate correctness.
|
||||
|
||||
### D10. Host CCL init keeps the PyTorch shape
|
||||
|
||||
The host code looks just like real PyTorch DDP. `init_process_group`
|
||||
creates the backend object; it does **not** receive IPCQ knobs
|
||||
(neighbor topology, buffer_kind, backpressure …).
|
||||
|
||||
```python
|
||||
# benches/ccl_allreduce.py — same shape as real PyTorch
|
||||
def worker(rank, world_size, torch):
|
||||
dist = torch.distributed
|
||||
dist.init_process_group(backend="ahbm") # reads ccl.yaml + topology
|
||||
tensor = torch.zeros((1, world_size * N_ELEM), dtype="f16", dp=...)
|
||||
tensor.copy_(torch.from_numpy(init))
|
||||
dist.all_reduce(tensor, op="sum")
|
||||
```
|
||||
|
||||
The IPCQ configuration is decided by the backend at
|
||||
`init_process_group` time: it loads `ccl.yaml`, picks the algorithm,
|
||||
and pushes IPCQ neighbor tables to every participating PE_IPCQ. The
|
||||
host code never has to know about IPCQ.
|
||||
|
||||
A bench runs one algorithm, chosen via `ccl.yaml`'s `defaults.algorithm`.
|
||||
Switching algorithms is purely a `ccl.yaml` change — no host edits
|
||||
required.
|
||||
|
||||
#### Init flow (eager)
|
||||
|
||||
1. `init_process_group(backend="ahbm")` is called.
|
||||
2. Backend loads `ccl.yaml` → resolves `defaults.algorithm`.
|
||||
3. Pulls topology + buffer_kind + backpressure + slot config from
|
||||
`algorithms[<algo>]`.
|
||||
4. **Immediately** installs neighbor tables on every PE_IPCQ
|
||||
(sideband or fabric `IpcqInitMsg`).
|
||||
5. Subsequent `torch.launch(kernel_name, ...)` calls behave normally —
|
||||
PE_IPCQ is already prepared whether the kernel is a CCL kernel or
|
||||
not.
|
||||
|
||||
### D11. CCL config file (`ccl.yaml`)
|
||||
|
||||
IPCQ config and algorithm metadata live in a separate YAML file,
|
||||
following the same pattern as `components.yaml` and `topology.yaml`.
|
||||
|
||||
A single benchmark execution runs one algorithm
|
||||
(`defaults.algorithm`). Switching algorithms means editing
|
||||
`defaults.algorithm` only.
|
||||
|
||||
```yaml
|
||||
defaults:
|
||||
algorithm: ring_allreduce_tcm
|
||||
buffer_kind: tcm # tcm | hbm | sram
|
||||
backpressure: sleep # poll | sleep
|
||||
n_slots: 8
|
||||
slot_size: 4096
|
||||
vc_chunk_size: 256
|
||||
ipcq_credit_size_bytes: 16
|
||||
|
||||
algorithms:
|
||||
ring_allreduce_tcm:
|
||||
module: kernbench.ccl.algorithms.ring_allreduce
|
||||
topology: ring_1d # builtin name or "custom"
|
||||
buffer_kind: tcm
|
||||
n_elem: 8 # optional, per-algorithm tile width
|
||||
|
||||
tree_allreduce_7:
|
||||
module: kernbench.ccl.algorithms.tree_allreduce
|
||||
topology: tree_binary
|
||||
buffer_kind: tcm
|
||||
world_size: 7 # algorithm-level override
|
||||
n_elem: 16
|
||||
|
||||
custom_mesh:
|
||||
module: kernbench.ccl.algorithms.custom_mesh
|
||||
topology: custom # the module supplies its own neighbors()
|
||||
```
|
||||
|
||||
`world_size` is **not set in `defaults`**. The backend resolves it via:
|
||||
`algorithm-level override > defaults override > topology spec`. The
|
||||
last fallback (`sips × cubes_per_sip × pes_per_cube`) mirrors real DDP
|
||||
where `WORLD_SIZE` comes from env vars rather than config files.
|
||||
|
||||
#### Algorithm module structure
|
||||
|
||||
Each algorithm module exports two hooks — `kernel` (required) and
|
||||
`neighbors` (optional) — plus a `kernel_args` helper that the
|
||||
backend uses to populate positional kernel arguments at `all_reduce`
|
||||
time:
|
||||
|
||||
```python
|
||||
# src/kernbench/ccl/algorithms/ring_allreduce.py
|
||||
|
||||
def kernel_args(world_size: int, n_elem: int) -> tuple:
|
||||
return (n_elem, world_size)
|
||||
|
||||
|
||||
def kernel(t_ptr, n_elem, world_size, tl):
|
||||
"""Required — the PE kernel.
|
||||
|
||||
IPCQ is already installed by the backend before this is called.
|
||||
The kernel only uses the four-direction send / recv API.
|
||||
"""
|
||||
...
|
||||
|
||||
|
||||
def neighbors(rank, world_size, neighbor_map):
|
||||
"""Optional — override the builtin topology's neighbor map.
|
||||
|
||||
Returns a new dict, the modified-in-place dict, or None to keep the
|
||||
builtin map.
|
||||
"""
|
||||
return None
|
||||
```
|
||||
|
||||
#### `neighbors` override patterns
|
||||
|
||||
- **Pattern A — tweak a builtin**: drop a direction for some ranks, etc.
|
||||
- **Pattern B — replace entirely**: ignore `neighbor_map` and return a
|
||||
brand-new dict.
|
||||
- **Pattern C — keep builtin**: omit `neighbors` or return None.
|
||||
|
||||
#### Builtin topologies
|
||||
|
||||
| topology | direction set |
|
||||
|----------|---------------|
|
||||
| `ring_1d` | E, W |
|
||||
| `ring_1d_unidir` | E only |
|
||||
| `mesh_2d` | N, S, E, W |
|
||||
| `tree_binary` | parent, child_left, child_right |
|
||||
| `none` | (empty) — algorithm must supply `neighbors()` |
|
||||
|
||||
#### Adding a new algorithm
|
||||
|
||||
1. Write `kernel` and `kernel_args` in
|
||||
`src/kernbench/ccl/algorithms/<algo>.py`.
|
||||
2. Add an entry in `ccl.yaml`'s `algorithms` section.
|
||||
3. (Optional) provide `neighbors()` for custom topology.
|
||||
4. Set `defaults.algorithm` to the new algorithm.
|
||||
|
||||
The host bench (`benches/ccl_allreduce.py`) does not change.
|
||||
|
||||
### D12. Message / token schema
|
||||
|
||||
The new message types added by this ADR. They live in
|
||||
`src/kernbench/common/pe_commands.py` and
|
||||
`src/kernbench/runtime_api/kernel.py`.
|
||||
|
||||
#### `IpcqInitMsg` (sideband, fan-out at init)
|
||||
|
||||
The backend pushes neighbor tables to every PE_IPCQ. Structure mirrors
|
||||
`MmuMapMsg` (`target_sips`, `target_cubes`, `target_pe`, `entries`).
|
||||
Each `IpcqInitEntry` has `direction`, `peer: IpcqEndpoint`,
|
||||
`my_rx_base_pa/va`, `n_slots`, `slot_size`, plus a `peer_credit_store`
|
||||
field — a `simpy.Store` instance pre-wired so the sender PE_IPCQ can
|
||||
push `IpcqCreditMetadata` directly into the receiver's input queue.
|
||||
|
||||
#### `IpcqSendCmd` (PE_CPU → PE_IPCQ)
|
||||
|
||||
Carries `direction`, source addr/space, nbytes, shape, dtype, and a
|
||||
handle id. `data_op=True` so it lands in the op_log.
|
||||
|
||||
#### `IpcqRecvCmd` (PE_CPU → PE_IPCQ)
|
||||
|
||||
Carries `direction` (or None for round-robin), `recv_mode`
|
||||
(`return_slot` / `copy_to_dst`), optional `dst_addr/dst_space`, shape,
|
||||
dtype, blocking flag.
|
||||
|
||||
#### `IpcqDmaToken` (PE_IPCQ → PE_DMA, vc_comm channel)
|
||||
|
||||
Per D9 piggyback: the token carries the data (`src/dst/space/nbytes`)
|
||||
plus the head metadata (`sender_seq`, `src_sip/cube/pe`,
|
||||
`src_direction`). PE_DMA picks the channel by token type
|
||||
(`IpcqDmaToken → vc_comm`, `TileToken → vc_compute`).
|
||||
|
||||
The receiver's PE_DMA, on token arrival, performs the I6 atomic
|
||||
sequence: write data into MemoryStore, then forward `IpcqMetaArrival`
|
||||
to the local PE_IPCQ.
|
||||
|
||||
#### `IpcqCreditMetadata` (PE_IPCQ → peer PE_IPCQ, fast path)
|
||||
|
||||
Carries `consumer_seq` (= my_tail), source PE coords, and source
|
||||
direction. Travels through the dedicated SimPy Store channel rather
|
||||
than `vc_comm`. Latency = `credit_size_bytes / bottleneck_bw_on_path`.
|
||||
|
||||
There is **no `IpcqPtrUpdate` event** — head updates flow via D9
|
||||
piggyback, tail updates via the D9 fast-path channel.
|
||||
|
||||
### D13. Test strategy
|
||||
|
||||
Test plan:
|
||||
|
||||
#### T1. Unit tests (component-level)
|
||||
|
||||
- **PE_IPCQ** (`tests/test_pe_ipcq.py`): send without backpressure
|
||||
immediately forwards a token; full peer slot triggers backpressure
|
||||
(poll / sleep modes); recv waits, wakes on `IpcqMetaArrival`;
|
||||
round-robin recv weak fairness; bad direction → `IpcqInvalidDirection`.
|
||||
- **PE_DMA virtual channels** (`tests/test_pe_dma_vc.py`): `vc_compute`
|
||||
/ `vc_comm` independent progress, chunk interleave, BW split.
|
||||
- **Builtin topology** (`tests/test_ccl_topologies.py`): ring_1d /
|
||||
mesh_2d / tree_binary correctness, mesh_2d non-square →
|
||||
`ValueError`, custom resolver returns the module's `neighbors`.
|
||||
|
||||
#### T2. Integration tests (E2E send/recv)
|
||||
|
||||
- **`tests/test_ipcq_e2e.py`**: 2-rank ring, 4-rank ring (bidirectional
|
||||
no-deadlock), 4×4 mesh.
|
||||
- **CCL kernel + 2-pass** (`tests/test_ipcq_2pass.py`): greenlet mode
|
||||
records `ipcq` ops in op_log; DataExecutor produces correct
|
||||
`out.data`.
|
||||
|
||||
#### T3. Backend init (`tests/test_ccl_backend_ipcq.py`)
|
||||
|
||||
`ccl.yaml` load, builtin topology → `IpcqInitMsg` fan-out, endpoint PA
|
||||
consistency, per-`buffer_kind` allocation.
|
||||
|
||||
#### T4. Regression
|
||||
|
||||
All existing tests pass; ADR-0020 op_log / DataExecutor unaffected for
|
||||
non-CCL benches.
|
||||
|
||||
#### T5. Performance / overhead
|
||||
|
||||
Single send/recv pair latency = (DMA latency) + (IPCQ overhead).
|
||||
Should be close to a regular PE_DMA write of the same nbytes (IPCQ
|
||||
overhead < 100 ns).
|
||||
|
||||
### D14. Invariants and failure modes
|
||||
|
||||
#### Invariants
|
||||
|
||||
I1. **Slot lifecycle exactly-once**: one send → exactly one recv.
|
||||
I2. **Pointer monotonicity**: `my_head` / `my_tail` strictly
|
||||
non-decreasing; `sender_seq` strictly increasing.
|
||||
I3. **Endpoint consistency**: if rank A's `direction=E` peer is rank
|
||||
B, then rank B's reverse-direction peer must be rank A. Verified at
|
||||
init.
|
||||
I4. **`buffer_kind` consistency**: all PEs in a process group share
|
||||
the same `buffer_kind` (no mixed mode in the first cut).
|
||||
I5. **op_log ordering**: send → DMA complete → recv possible. The
|
||||
t_start order in op_log respects this causality.
|
||||
I6. **Atomic data + metadata visibility (MUST)**: at the receiver
|
||||
side, data write (`MemoryStore.write`) and metadata forward
|
||||
(`peer_head_cache` update) **must execute in the same SimPy step**.
|
||||
No yield is allowed between the two operations in PE_DMA's vc_comm
|
||||
handler. Code review must reject any inserted `yield` (or `yield
|
||||
from`) — it would create a race where head_cache becomes visible
|
||||
before or after the data.
|
||||
I7. **MemoryStore slot existence ↔ pointer**: as a consequence of I6,
|
||||
the step in which `peer_head_cache > my_tail` becomes truthy is the
|
||||
same step in which the slot data is observable.
|
||||
|
||||
#### Failure modes (runtime errors)
|
||||
|
||||
F1. **Bad direction**: `tl.send(dir="X")` for an uninstalled direction
|
||||
→ `IpcqInvalidDirection`, simulation aborts.
|
||||
F2. **Type mismatch**: dtype/shape/nbytes disagreement between matched
|
||||
send and recv. Not validated by default; opt-in strict mode catches
|
||||
it (`strict_validation: true` on a PE_IPCQ node attrs).
|
||||
F3. **Deadlock detection (timeout-based)**: the simulator empties its
|
||||
schedule while a send/recv is still pending → engine raises
|
||||
`IpcqDeadlock` and embeds a pointer dump.
|
||||
F4. **Backend init failure**: missing `defaults.algorithm`, missing
|
||||
`algorithms[name]`, module import failure, topology validation
|
||||
failure (I3, I4) — all raised at `init_process_group` time.
|
||||
F5. **Slot full + infinite backpressure**: the peer never recvs.
|
||||
Surfaces as F3 timeout.
|
||||
|
||||
#### Diagnostics
|
||||
|
||||
- **CCL trace**: `KERNBENCH_CCL_TRACE=1` logs each send/recv as
|
||||
`(rank, t, dir, nbytes)`.
|
||||
- **Pointer dump**: `kernbench.ccl.diagnostics.pointer_dump(engine)`
|
||||
prints every PE_IPCQ ring buffer's `my_head`, `my_tail`,
|
||||
`peer_head_cache`, `peer_tail_cache`.
|
||||
- **Deadlock dump**: on hang the engine includes the pointer dump in
|
||||
the `IpcqDeadlock` exception message.
|
||||
|
||||
### D15. Algorithm-author cheat sheet
|
||||
|
||||
Full step-by-step lives in
|
||||
[`docs/onboarding/ccl-author-guide.en.md`](../onboarding/ccl-author-guide.en.md). The
|
||||
shortest version:
|
||||
|
||||
| Things you touch | Things you don't |
|
||||
|------------------|-------------------|
|
||||
| `src/kernbench/ccl/algorithms/<your_algo>.py` (`kernel`, `kernel_args`, optional `neighbors`) | `benches/ccl_allreduce.py` host code |
|
||||
| One entry in `ccl.yaml` + optionally `defaults.algorithm` | `src/kernbench/ccl/` framework |
|
||||
| (Optional) `tests/test_<your_algo>.py` mock test | PE_IPCQ component, AhbmCCLBackend |
|
||||
|
||||
5-step flow: write the kernel → register in `ccl.yaml` → optional
|
||||
`neighbors` override → optional mock unit test → SimPy validation via
|
||||
`kernbench run --bench ccl_allreduce --verify-data`.
|
||||
|
||||
Common mistakes: using a direction that wasn't installed, sends
|
||||
without matching recvs (deadlock), dtype/shape disagreement, assuming
|
||||
fairness from `tl.recv()` round-robin, confusing
|
||||
`tl.num_programs(axis)` with the CCL group size.
|
||||
|
||||
---
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **Host collective**: a model where `dist.all_reduce` itself moves
|
||||
data on the host side is out of scope. This ADR only covers
|
||||
communication that happens inside the PE kernel.
|
||||
- **All-reduce algorithms**: ring / tree / etc. live in algorithm
|
||||
modules and can be added without amending this ADR.
|
||||
- **Reliability / error handling**: link faults, send/recv failure
|
||||
recovery, etc. are out of scope.
|
||||
- **NoC arbiter precision**: dynamic VC contention is left for a future
|
||||
ADR (see D8).
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
- **VC arbitration accuracy** — the first cut uses deterministic
|
||||
chunk interleave + weighted round-robin; heavy contention may report
|
||||
optimistic latency. A NoC arbiter component can be added later.
|
||||
- **Credit return BW model** — the fast path is currently outside the
|
||||
fabric BW contention model. Can be modeled as a separate link or
|
||||
switched to piggyback (`credit_return_mode: piggyback`).
|
||||
- **Ring buffer slot allocation metadata** — whether the host pushes
|
||||
IPCQ buffer metadata via sideband or via a fabric message similar to
|
||||
`MmuMapMsg` is open.
|
||||
- **VC BW split default** — 50/50 vs. weighted (e.g. 80/20). Exposed in
|
||||
`ccl.yaml`; default value TBD.
|
||||
- **Direction count** — 4 (N/S/E/W) is fixed in the first cut; 6
|
||||
(with Up/Down for 3D) or N (variable) is future work.
|
||||
- **Multi-tile aggregation primitives** — whether
|
||||
`tl.recv_all` or similar is needed for fan-in.
|
||||
- **Round-robin recv fairness** — current weak fairness can starve;
|
||||
strict fairness counter is future work.
|
||||
- **Deadlock detection precision** — currently timeout-based; a
|
||||
realtime wait-for graph would enable deterministic detection.
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- PE-to-PE direct communication enables CCL kernels to be written.
|
||||
- Host stays minimal (just `launch`), synchronization happens inside
|
||||
the PE → strong compute / comm overlap.
|
||||
- VCs eliminate HoL blocking → collective latency is not blocked by
|
||||
compute traffic.
|
||||
- Buffer placement and backpressure mode are init-time parameters →
|
||||
easy to benchmark.
|
||||
- Four-direction logical neighbors → host is free to map
|
||||
ring/mesh/tree algorithms.
|
||||
|
||||
### Negative
|
||||
|
||||
- One new component (PE_IPCQ) and a redesigned PE_DMA (VCs).
|
||||
- IPCQ memory cost = 8 rings × `slot_size` × `n_slots` per PE.
|
||||
- VC arbitration is a first-order approximation; heavy contention
|
||||
scenarios may report slightly optimistic latency vs real HW (D8).
|
||||
- Chunk-level interleave makes PE_DMA implementation more complex.
|
||||
File diff suppressed because it is too large
Load Diff
@@ -6,43 +6,46 @@ Accepted
|
||||
|
||||
## Context
|
||||
|
||||
### 목표
|
||||
### Goal
|
||||
|
||||
`torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device)
|
||||
경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이**
|
||||
읽히는 bench 코드를 목표로 한다.
|
||||
Align the participation unit (rank) of `torch.distributed` collective calls
|
||||
to the **SIP** (device) boundary. The aim is bench code that, at the host
|
||||
level, reads **indistinguishably** from real PyTorch DDP/TP scripts.
|
||||
|
||||
real PyTorch와 비교:
|
||||
Comparison with real PyTorch:
|
||||
|
||||
| 차원 | real PyTorch | KernBench |
|
||||
| Dimension | real PyTorch | KernBench |
|
||||
| --- | --- | --- |
|
||||
| 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 각 1 SIP |
|
||||
| `get_rank()` | `RANK` env var | greenlet-local 레지스트리 |
|
||||
| `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 |
|
||||
| Process model | N processes, 1 GPU each | 1 process, N greenlets, 1 SIP each |
|
||||
| `get_rank()` | `RANK` env var | greenlet-local registry |
|
||||
| `get_world_size()` | `WORLD_SIZE` env var | SIP count from topology |
|
||||
| `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP |
|
||||
| `mp.spawn` | OS 프로세스 fork | greenlet fan-out |
|
||||
| `mp.spawn` | OS process fork | greenlet fan-out |
|
||||
|
||||
### 풀어야 할 문제
|
||||
### Problems to solve
|
||||
|
||||
1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록.
|
||||
2. **Greenlet-local rank/device tracking** — 1-프로세스 모델 안에서 각
|
||||
worker greenlet이 자기 rank / 자기 SIP를 정확히 식별.
|
||||
3. **Tensor placement = structural (sip, cube, pe)** — rank가 SIP이면
|
||||
기본 텐서 배치도 구조적 좌표로 표현되어야 함.
|
||||
1. **Public API where rank = SIP** — so bench workers do not have to know
|
||||
about the PE concept.
|
||||
2. **Greenlet-local rank/device tracking** — within the 1-process model,
|
||||
each worker greenlet must correctly identify its own rank / its own SIP.
|
||||
3. **Tensor placement = structural (sip, cube, pe)** — if rank is SIP,
|
||||
the default tensor placement should also be expressed in structural
|
||||
coordinates.
|
||||
|
||||
### Non-problem (이 ADR 밖)
|
||||
### Non-problem (outside this ADR)
|
||||
|
||||
- IPCQ direction addressing → ADR-0025
|
||||
- `DPPolicy.sip`/`num_sips` 제거 → ADR-0026
|
||||
- Removing `DPPolicy.sip`/`num_sips` → ADR-0026
|
||||
- Megatron-style TP → ADR-0027
|
||||
- DTensor → ADR-0028 (future)
|
||||
- Worker scheduling / `mp.spawn` / collective drain / exception cleanup
|
||||
→ ADR-0027 D0/D1
|
||||
- Collective algorithm 구현 (intercube_allreduce, SFR config) → ADR-0032
|
||||
- Collective algorithm implementation (intercube_allreduce, SFR config)
|
||||
→ ADR-0032
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. rank = SIP (world_size 해석)
|
||||
### D1. rank = SIP (world_size resolution)
|
||||
|
||||
```python
|
||||
def _resolve_world_size(self) -> int:
|
||||
@@ -55,8 +58,8 @@ def _resolve_world_size(self) -> int:
|
||||
return int(spec.get("system", {}).get("sips", {}).get("count", 1))
|
||||
```
|
||||
|
||||
우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml`
|
||||
override는 legacy "rank = PE" 테스트 경로로 유지.
|
||||
Priority order: algorithm override > defaults override > SIP count. The
|
||||
`ccl.yaml` override is retained as the legacy "rank = PE" test path.
|
||||
|
||||
### D2. Greenlet-local rank registry (+ debug warning)
|
||||
|
||||
@@ -83,11 +86,11 @@ class DistributedContext:
|
||||
return int(self._rank_by_greenlet[g])
|
||||
```
|
||||
|
||||
### D3. `torch.ahbm.set_device(rank)` — SIP 바인딩
|
||||
### D3. `torch.ahbm.set_device(rank)` — SIP binding
|
||||
|
||||
KernBench 백엔드 이름은 `ahbm` (ADR-0023). Real PyTorch는
|
||||
`torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named
|
||||
namespace를 사용한다.
|
||||
The KernBench backend name is `ahbm` (ADR-0023). Real PyTorch uses
|
||||
`torch.cuda.set_device(r)`, but since we are not CUDA we use an
|
||||
honestly-named namespace.
|
||||
|
||||
```python
|
||||
class _AhbmNamespace:
|
||||
@@ -113,10 +116,12 @@ class _AhbmNamespace:
|
||||
# Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
|
||||
```
|
||||
|
||||
**PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한
|
||||
`torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`,
|
||||
`torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는
|
||||
코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다.
|
||||
**PyTorch 2.x style parallel support**: Recent PyTorch is moving toward a
|
||||
device-agnostic `torch.accelerator` namespace
|
||||
(`torch.accelerator.set_device_index(r)`,
|
||||
`torch.accelerator.current_device_index()`). To support users who want to
|
||||
write code that is not tied to a specific device vendor, KernBench also
|
||||
exposes this surface in parallel.
|
||||
|
||||
```python
|
||||
class _AcceleratorNamespace:
|
||||
@@ -141,23 +146,23 @@ self.ahbm = _AhbmNamespace()
|
||||
self.accelerator = _AcceleratorNamespace(self.ahbm) # alias
|
||||
```
|
||||
|
||||
Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유:
|
||||
Bench authors may choose either — both share the same registry internally:
|
||||
|
||||
```python
|
||||
torch.ahbm.set_device(rank) # KernBench-native, explicit backend
|
||||
torch.accelerator.set_device_index(rank) # PyTorch 2.x device-agnostic
|
||||
```
|
||||
|
||||
### D4. Tensor placement = structural (sip, cube, pe) 좌표
|
||||
### D4. Tensor placement = structural (sip, cube, pe) coordinates
|
||||
|
||||
`resolve_dp_policy`가 `target_sip`을 직접 받아 구조적 좌표로 placement 생성.
|
||||
세부는 ADR-0026.
|
||||
`resolve_dp_policy` takes `target_sip` directly and produces placement in
|
||||
structural coordinates. Details in ADR-0026.
|
||||
|
||||
```python
|
||||
# RuntimeContext._create_tensor
|
||||
current_sip = self.ahbm.current_device() # (D3 naming)
|
||||
if current_sip is None:
|
||||
current_sip = 0 # single-driver fallback (D2와 일관)
|
||||
current_sip = 0 # single-driver fallback (consistent with D2)
|
||||
placement = resolve_dp_policy(
|
||||
dp, shape=shape_2d, itemsize=itemsize,
|
||||
num_pe=eff_num_pe, num_cubes=eff_num_cubes,
|
||||
@@ -165,29 +170,29 @@ placement = resolve_dp_policy(
|
||||
)
|
||||
```
|
||||
|
||||
Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적
|
||||
좌표를 직접 보유. ShardSpec 상세는 ADR-0026.
|
||||
No post-hoc `pe_index` shifting — ShardSpec carries the `(sip, cube, pe)`
|
||||
structural coordinates directly. ShardSpec details in ADR-0026.
|
||||
|
||||
---
|
||||
|
||||
## Dependencies
|
||||
|
||||
- **ADR-0023** (IPCQ): backend `ahbm` namespace의 기원.
|
||||
- **ADR-0026** (DPPolicy intra-device): D4의 `resolve_dp_policy` 시그니처와
|
||||
ShardSpec의 구조적 좌표 표현.
|
||||
- **ADR-0027** (Megatron TP + scheduler): worker scheduling, `mp.spawn`,
|
||||
collective drain, exception cleanup의 구현 기준.
|
||||
- **ADR-0023** (IPCQ): origin of the backend `ahbm` namespace.
|
||||
- **ADR-0026** (DPPolicy intra-device): the `resolve_dp_policy` signature
|
||||
used by D4 and the structural-coordinate representation of ShardSpec.
|
||||
- **ADR-0027** (Megatron TP + scheduler): the implementation baseline for
|
||||
worker scheduling, `mp.spawn`, collective drain, and exception cleanup.
|
||||
|
||||
---
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **IPCQ protocol 수정**: ADR-0023 유지.
|
||||
- **DPPolicy 필드 정리**: ADR-0026.
|
||||
- **Modifying the IPCQ protocol**: ADR-0023 remains as-is.
|
||||
- **Cleaning up DPPolicy fields**: ADR-0026.
|
||||
- **Megatron-style TP**: ADR-0027.
|
||||
- **Worker scheduling / spawn / drain / exception cleanup**: ADR-0027 D0/D1.
|
||||
- **Collective algorithm 구현**: ADR-0032.
|
||||
- **Multi-node (프로세스 간)**: 단일 프로세스.
|
||||
- **Collective algorithm implementation**: ADR-0032.
|
||||
- **Multi-node (cross-process)**: single process only.
|
||||
|
||||
---
|
||||
|
||||
@@ -195,12 +200,14 @@ Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적
|
||||
|
||||
### Positive
|
||||
|
||||
- **Bench = real PyTorch DDP** (공개 API 관점).
|
||||
- **Greenlet-local rank**: 1-프로세스 모델에서 cross-rank correctness 가능.
|
||||
- **Structural placement 좌표**: ADR-0026 / ADR-0027 / ADR-0032의 다른 ADR이
|
||||
`(sip, cube, pe)` 3튜플 위에서 일관되게 동작.
|
||||
- **Bench = real PyTorch DDP** (from the public-API point of view).
|
||||
- **Greenlet-local rank**: enables cross-rank correctness within the
|
||||
1-process model.
|
||||
- **Structural placement coordinates**: lets the other ADRs (ADR-0026 /
|
||||
ADR-0027 / ADR-0032) operate consistently on top of the `(sip, cube, pe)`
|
||||
3-tuple.
|
||||
|
||||
### Neutral
|
||||
|
||||
- IPCQ PE-level protocol (ADR-0023) 불변.
|
||||
- IO_CPU 역할 불변 (기존 transit 그대로).
|
||||
- IPCQ PE-level protocol (ADR-0023) is unchanged.
|
||||
- IO_CPU role is unchanged (existing transit behavior preserved).
|
||||
|
||||
@@ -6,51 +6,58 @@ Accepted (Revision 2 — Address-based matching; peer_direction field dropped)
|
||||
|
||||
## Context
|
||||
|
||||
### 목표
|
||||
### Goal
|
||||
|
||||
ADR-0023의 IPCQ protocol에서 **"어느 direction pair를 통한 전송인가"의 식별**을
|
||||
topology / dict-order에 의존하지 않고 **주소 기반**으로 일관되게 한다.
|
||||
2-rank bidirectional ring (또는 여러 direction이 동일 peer를 가리키는
|
||||
topology 일반)에서 정확히 동작하도록 한다.
|
||||
In the IPCQ protocol of ADR-0023, make the **identification of "which
|
||||
direction pair this transfer belongs to"** consistent and **address-based**,
|
||||
without depending on topology / dict-order. It must work correctly in a
|
||||
2-rank bidirectional ring (and more generally in any topology where
|
||||
multiple directions point to the same peer).
|
||||
|
||||
### 드러난 버그 — 2-rank bidirectional ring
|
||||
### The bug surfaced — 2-rank bidirectional ring
|
||||
|
||||
`ring_1d(rank, world_size=2)` → `{"E": 1, "W": 1}` (rank 0). 양쪽 방향이 같은 peer.
|
||||
`ring_1d(rank, world_size=2)` → `{"E": 1, "W": 1}` (rank 0). Both directions
|
||||
point to the same peer.
|
||||
|
||||
**버그 1 (install)**:
|
||||
- `reverse_direction(0, 1)` → dict order로 "E" 반환 (틀림, "W"가 맞음 — opposite
|
||||
direction convention)
|
||||
- rank 0의 E entry가 `peer.rx_base_pa = rx_base(sip1, cube0, pe0, d="E")`로 설정
|
||||
- tl.send(E) → data가 sip1의 E-rx buffer로 landing (should be W-rx)
|
||||
**Bug 1 (install)**:
|
||||
- `reverse_direction(0, 1)` → returns "E" by dict order (wrong; "W" is the
|
||||
correct answer — opposite-direction convention)
|
||||
- rank 0's E entry is set with `peer.rx_base_pa = rx_base(sip1, cube0, pe0, d="E")`
|
||||
- tl.send(E) → data lands in sip1's E-rx buffer (should be W-rx)
|
||||
|
||||
**버그 2 (runtime)**:
|
||||
- 설령 install이 올바른 주소로 설정해도, receiver의 `_handle_meta_arrival`이
|
||||
sender 좌표만으로 direction 매칭 → 첫 direction (E) 승
|
||||
- peer_head_cache[E] 증가, peer_head_cache[W]는 불변
|
||||
- Kernel의 tl.recv(W)는 peer_head_cache[W] 대기 → 영원히 블록 → IpcqDeadlock
|
||||
**Bug 2 (runtime)**:
|
||||
- Even if install set up the correct address, the receiver's
|
||||
`_handle_meta_arrival` matches direction by sender coordinates only → the
|
||||
first direction (E) wins
|
||||
- peer_head_cache[E] is incremented; peer_head_cache[W] is unchanged
|
||||
- The kernel's tl.recv(W) waits on peer_head_cache[W] → blocks forever →
|
||||
IpcqDeadlock
|
||||
|
||||
### 근본 원인
|
||||
### Root cause
|
||||
|
||||
두 축에서 동일 문제:
|
||||
1. **Install-time pairing**: "내 direction과 peer의 어느 direction이 짝인가"
|
||||
결정이 dict-iteration-order에 의존 → 여러 direction이 같은 peer를 가리킬 때
|
||||
fragile
|
||||
2. **Runtime identification**: "어느 qp를 업데이트해야 하는가" 결정이 sender
|
||||
좌표만으로 이루어짐 → direction 중복 시 ambiguous
|
||||
The same issue along two axes:
|
||||
1. **Install-time pairing**: deciding "which of my directions pairs with
|
||||
which direction of the peer" depends on dict-iteration-order → fragile
|
||||
when multiple directions point to the same peer
|
||||
2. **Runtime identification**: deciding "which qp should be updated" is
|
||||
based on sender coordinates alone → ambiguous when directions are
|
||||
duplicated
|
||||
|
||||
### 해결 방향 — address-based matching
|
||||
### Solution direction — address-based matching
|
||||
|
||||
각 PE의 rx buffer는 **direction별로 고유한 주소 range**에 위치 (rx_base_pa +
|
||||
direction_idx × bytes_per_direction). 따라서:
|
||||
Each PE's rx buffer sits at a **unique address range per direction**
|
||||
(rx_base_pa + direction_idx × bytes_per_direction). Therefore:
|
||||
|
||||
- **Runtime**: sender coord 대신 **dst_addr 범위**로 매칭 → unambiguous
|
||||
- **Install**: opposite-direction 우선 선택 heuristic (ring / mesh의 자연스러운
|
||||
대칭성)
|
||||
- `peer_direction` 같은 이중 메타데이터 불필요 — **주소가 single source of
|
||||
truth**
|
||||
- **Runtime**: match by **dst_addr range** instead of sender coord →
|
||||
unambiguous
|
||||
- **Install**: prefer the opposite direction as a heuristic (the natural
|
||||
symmetry of ring / mesh)
|
||||
- No need for redundant metadata like `peer_direction` — **address is the
|
||||
single source of truth**
|
||||
|
||||
이 설계는 **PhysAddr 전환 (ADR-0030)과 독립적**으로 작동. 현재 synthetic
|
||||
주소든 PhysAddr든 direction별 range 유일성만 지켜지면 동일하게 적용 가능.
|
||||
This design works **independently of the PhysAddr transition (ADR-0030)**.
|
||||
Whether the current addresses are synthetic or PhysAddr, the same approach
|
||||
applies as long as the per-direction range uniqueness is preserved.
|
||||
|
||||
---
|
||||
|
||||
@@ -91,17 +98,17 @@ def reverse_direction(my_rank: int, peer_rank: int, my_dir: str) -> str | None:
|
||||
return None
|
||||
```
|
||||
|
||||
호출부:
|
||||
Call site:
|
||||
|
||||
```python
|
||||
for d, peer_rank in nbrs.items():
|
||||
peer_dir = reverse_direction(r, peer_rank, d) # my_dir 전달
|
||||
peer_dir = reverse_direction(r, peer_rank, d) # pass my_dir
|
||||
if peer_dir is None:
|
||||
continue
|
||||
...
|
||||
```
|
||||
|
||||
### D2. Runtime — `_handle_meta_arrival` dst_addr 매칭
|
||||
### D2. Runtime — `_handle_meta_arrival` dst_addr matching
|
||||
|
||||
`src/kernbench/components/builtin/pe_ipcq.py`:
|
||||
|
||||
@@ -138,9 +145,10 @@ def _handle_meta_arrival(self, msg: IpcqMetaArrival) -> None:
|
||||
# Unknown dst_addr — diagnostic log (should not happen under correct install)
|
||||
```
|
||||
|
||||
Sender 좌표 검사는 **제거**. `dst_addr`가 이미 direction을 결정.
|
||||
The sender-coordinate check is **removed**. `dst_addr` already determines
|
||||
the direction.
|
||||
|
||||
### D3. Credit — `dst_rx_base_pa` 필드 추가
|
||||
### D3. Credit — add `dst_rx_base_pa` field
|
||||
|
||||
`src/kernbench/common/ipcq_types.py`:
|
||||
|
||||
@@ -148,25 +156,26 @@ Sender 좌표 검사는 **제거**. `dst_addr`가 이미 direction을 결정.
|
||||
@dataclass(frozen=True)
|
||||
class IpcqCreditMetadata:
|
||||
consumer_seq: int
|
||||
dst_rx_base_pa: int # NEW: 원 sender의 peer.rx_base_pa와 매칭용
|
||||
# 기존 필드 (diagnostic / log 용도로 유지)
|
||||
dst_rx_base_pa: int # NEW: matches the original sender's peer.rx_base_pa
|
||||
# Existing fields (kept for diagnostic / logging purposes)
|
||||
src_sip: int
|
||||
src_cube: int
|
||||
src_pe: int
|
||||
src_direction: str
|
||||
```
|
||||
|
||||
Credit 생성 시 (`_delayed_credit_send`): 자기 direction의 `my_rx_base_pa`를
|
||||
`dst_rx_base_pa`로 실어 보냄 (이게 상대방이 sender 당시 썼던 `peer.rx_base_pa`).
|
||||
When the credit is generated (`_delayed_credit_send`): it carries this
|
||||
direction's `my_rx_base_pa` as `dst_rx_base_pa` (this is the
|
||||
`peer.rx_base_pa` the other side used when it was the sender).
|
||||
|
||||
수신 측 (`_credit_worker`):
|
||||
Receiver side (`_credit_worker`):
|
||||
|
||||
```python
|
||||
def _credit_worker(self, env):
|
||||
while True:
|
||||
credit = yield self._credit_inbox.get()
|
||||
for d, qp in self._queue_pairs.items():
|
||||
# peer의 rx_base_pa와 credit의 dst_rx_base_pa가 일치하는 qp 찾기
|
||||
# Find the qp whose peer rx_base_pa matches the credit's dst_rx_base_pa
|
||||
if qp["peer"].rx_base_pa == credit.dst_rx_base_pa:
|
||||
qp["peer_tail_cache"] = max(qp["peer_tail_cache"],
|
||||
credit.consumer_seq)
|
||||
@@ -178,41 +187,45 @@ def _credit_worker(self, env):
|
||||
break
|
||||
```
|
||||
|
||||
Sender 좌표 검사 제거. `dst_rx_base_pa` 매칭으로 unambiguous.
|
||||
Sender-coordinate check removed. Matching by `dst_rx_base_pa` is
|
||||
unambiguous.
|
||||
|
||||
### D4. `IpcqInitEntry`에 `peer_direction` 필드를 **추가하지 않음**
|
||||
### D4. Do **not** add a `peer_direction` field to `IpcqInitEntry`
|
||||
|
||||
ADR-0025 rev 1에서 제안했던 `IpcqInitEntry.peer_direction`은 **불필요**.
|
||||
이유:
|
||||
- Meta arrival은 dst_addr로 매칭 (D2)
|
||||
- Credit은 dst_rx_base_pa로 매칭 (D3)
|
||||
- qp에 peer_direction 저장 필요 없음
|
||||
- Install은 rx_base_pa 계산 시 내부적으로만 peer_dir 사용 (`reverse_direction`)
|
||||
The `IpcqInitEntry.peer_direction` proposed in ADR-0025 rev 1 is
|
||||
**unnecessary**. Reasons:
|
||||
- Meta arrivals are matched by dst_addr (D2)
|
||||
- Credits are matched by dst_rx_base_pa (D3)
|
||||
- No need to store peer_direction on qp
|
||||
- Install only uses peer_dir internally when computing rx_base_pa
|
||||
(`reverse_direction`)
|
||||
|
||||
IpcqInitEntry schema 변경 없음. Rev 1 대비 **단순화**.
|
||||
No change to the IpcqInitEntry schema. **Simpler** than rev 1.
|
||||
|
||||
### D5. `IpcqDmaToken.src_direction` 유지 (diagnostic only)
|
||||
### D5. Keep `IpcqDmaToken.src_direction` (diagnostic only)
|
||||
|
||||
기존 `src_direction` 필드는 제거하지 않는다. 다음 용도로 유지:
|
||||
- Logging / trace: `KERNBENCH_CCL_TRACE=1` 출력의 `(rank, t, dir, nbytes)`
|
||||
- Diagnostics: pointer_dump 등에서 direction 표시
|
||||
- 미래 확장 여지
|
||||
The existing `src_direction` field is not removed. It is retained for:
|
||||
- Logging / trace: the `(rank, t, dir, nbytes)` output of
|
||||
`KERNBENCH_CCL_TRACE=1`
|
||||
- Diagnostics: showing direction in pointer_dump, etc.
|
||||
- Room for future extension
|
||||
|
||||
Runtime matching은 `dst_addr`만 사용.
|
||||
Runtime matching uses only `dst_addr`.
|
||||
|
||||
### D6. Invariants (ADR-0023 I3 강화)
|
||||
### D6. Invariants (strengthens ADR-0023 I3)
|
||||
|
||||
**I3 (엄격)**: 각 방향 pair `(my_direction, peer_direction)`에 대해 my
|
||||
rx_base와 peer rx_base는 **별개의 direction slot**을 가리켜야 함. Install은
|
||||
이를 보장해야 한다 (reverse_direction opposite-preference).
|
||||
**I3 (strict)**: For each direction pair `(my_direction, peer_direction)`,
|
||||
my rx_base and peer rx_base must point to **distinct direction slots**.
|
||||
Install must guarantee this (reverse_direction opposite-preference).
|
||||
|
||||
**I3.1 (신규)**: 모든 qp에 대해 `qp["my_rx_base_pa"]`와 `qp["peer"].rx_base_pa`는
|
||||
서로 disjoint한 주소 range를 점유한다 (다른 direction의 buffer는 절대 겹치지
|
||||
않음). 이것이 D2/D3의 주소-기반 매칭의 전제.
|
||||
**I3.1 (new)**: For every qp, `qp["my_rx_base_pa"]` and
|
||||
`qp["peer"].rx_base_pa` occupy mutually disjoint address ranges (buffers
|
||||
of different directions never overlap). This is the prerequisite for the
|
||||
address-based matching of D2/D3.
|
||||
|
||||
Install time에 검증 가능:
|
||||
Verifiable at install time:
|
||||
```python
|
||||
# ccl/install_plan.py: build_install_plans 끝에 assertion
|
||||
# ccl/install_plan.py: assertion at the end of build_install_plans
|
||||
all_rx_ranges = set()
|
||||
for plan in plans:
|
||||
for pe_install in plan.pe_installs:
|
||||
@@ -228,36 +241,42 @@ for plan in plans:
|
||||
|
||||
## Dependencies
|
||||
|
||||
- **ADR-0023** (IPCQ protocol): 본 ADR은 ADR-0023의 runtime 매칭 로직 수정
|
||||
(D2, D3) + install heuristic 개선 (D1). IPCQ 프로토콜의 semantic layer
|
||||
변경은 없음.
|
||||
- **ADR-0024** (launcher): 2-rank bidirectional ring이 실제 쓰이는 경우가
|
||||
ADR-0024의 ws=SIP_count 모델. 본 ADR이 그 케이스를 작동시킴.
|
||||
- **ADR-0030** (PhysAddr transition, stub): **독립적** — ADR-0025의
|
||||
주소-기반 매칭은 현재 synthetic 주소든 PhysAddr이든 동일하게 작동.
|
||||
- **ADR-0023** (IPCQ protocol): this ADR modifies ADR-0023's runtime
|
||||
matching logic (D2, D3) and improves the install heuristic (D1). No
|
||||
change to the IPCQ protocol's semantic layer.
|
||||
- **ADR-0024** (launcher): the case where a 2-rank bidirectional ring is
|
||||
actually used is the ws=SIP_count model of ADR-0024. This ADR makes that
|
||||
case work.
|
||||
- **ADR-0030** (PhysAddr transition, stub): **independent** — ADR-0025's
|
||||
address-based matching works identically whether the current addresses
|
||||
are synthetic or PhysAddr.
|
||||
|
||||
---
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **IPCQ 주소 체계를 PhysAddr로 전환**: ADR-0030 scope. 본 ADR은 주소가 어떻게
|
||||
인코딩되는가와 무관.
|
||||
- **Multi-hop routing**: ADR-0023 D5의 single-hop DMA write 전제 유지.
|
||||
- **Unidir ring 특수화**: `ring_1d_unidir`는 direction 하나만 있으므로 본 버그
|
||||
무관.
|
||||
- **Migrating IPCQ addressing to PhysAddr**: ADR-0030 scope. This ADR is
|
||||
agnostic to how addresses are encoded.
|
||||
- **Multi-hop routing**: the single-hop DMA write assumption of ADR-0023
|
||||
D5 still holds.
|
||||
- **Unidir ring specialization**: `ring_1d_unidir` only has a single
|
||||
direction, so the bug does not apply.
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
- **주소 매칭 성능**: `_handle_meta_arrival`과 `_credit_worker`가 qp를 선형
|
||||
순회 (max 4 direction). 성능 영향 무시 가능 수준. 문제 시 dict lookup으로
|
||||
전환 가능 (`_qp_by_rx_base`).
|
||||
- **`IpcqDmaToken.src_direction` 필요성 재평가**: diagnostic 용도로만 남긴
|
||||
필드를 계속 유지할지, 또는 logging 외부로 분리할지. 현재는 유지.
|
||||
- **Install-time invariant 검증 cost**: D6의 I3.1 검증은 O(N_PE × N_direction)^2.
|
||||
대형 topology에서 느려질 수 있음 → interval tree 등 자료구조로 개선 가능.
|
||||
단순 구현 먼저.
|
||||
- **Address-matching performance**: `_handle_meta_arrival` and
|
||||
`_credit_worker` iterate qp linearly (max 4 directions). The performance
|
||||
impact is negligible. If it becomes an issue, this can be switched to a
|
||||
dict lookup (`_qp_by_rx_base`).
|
||||
- **Re-evaluating the need for `IpcqDmaToken.src_direction`**: whether to
|
||||
keep this field, which is only kept for diagnostics, or to split it out
|
||||
of logging. Currently retained.
|
||||
- **Cost of install-time invariant verification**: the I3.1 verification
|
||||
of D6 is O(N_PE × N_direction)^2. It could be slow on large topologies
|
||||
→ improvable via data structures such as interval trees. Simple
|
||||
implementation first.
|
||||
|
||||
---
|
||||
|
||||
@@ -265,19 +284,26 @@ for plan in plans:
|
||||
|
||||
### Positive
|
||||
|
||||
- **단순함**: `peer_direction` 이중 메타데이터 제거. 주소가 single source of truth.
|
||||
- **Unambiguous matching**: 모든 topology (direction 중복 포함)에서 동작.
|
||||
- **Schema 변경 최소**: `IpcqInitEntry` 불변, `IpcqCreditMetadata`에 1 필드 추가.
|
||||
- **PhysAddr 전환 (ADR-0030) 독립**: 주소-기반 매칭은 주소 인코딩 방식과 무관.
|
||||
- **Diagnostic 유지**: `IpcqDmaToken.src_direction`은 로깅 용도로 존치.
|
||||
- **Simplicity**: redundant `peer_direction` metadata removed. Address is
|
||||
the single source of truth.
|
||||
- **Unambiguous matching**: works on every topology (including duplicate
|
||||
directions).
|
||||
- **Minimal schema changes**: `IpcqInitEntry` unchanged, one field added
|
||||
to `IpcqCreditMetadata`.
|
||||
- **Independent of PhysAddr transition (ADR-0030)**: address-based matching
|
||||
is agnostic to the address encoding.
|
||||
- **Diagnostics retained**: `IpcqDmaToken.src_direction` is kept for
|
||||
logging.
|
||||
|
||||
### Negative
|
||||
|
||||
- Runtime 매칭이 주소 비교로 바뀌어서 디버깅 시 "왜 peer_head_cache[E]가 아닌
|
||||
W가 업데이트됐나" 같은 질문에 address range를 추적해야 함 (기존엔 direction
|
||||
이름으로 충분). 해결: pointer_dump에 "direction ↔ rx_base_pa" 매핑 포함.
|
||||
- Runtime matching is now by address comparison, so when debugging
|
||||
questions like "why did peer_head_cache[W] update rather than [E]" one
|
||||
has to follow the address range (previously the direction name was
|
||||
enough). Mitigation: include a "direction ↔ rx_base_pa" mapping in
|
||||
pointer_dump.
|
||||
|
||||
### Neutral
|
||||
|
||||
- IPCQ protocol의 semantic layer (sender가 dst_addr 계산, receiver가 수신)는
|
||||
불변.
|
||||
- The semantic layer of the IPCQ protocol (sender computes dst_addr,
|
||||
receiver receives) is unchanged.
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# ADR-0026: DPPolicy = Intra-Device Only — sip/num_sips 필드 제거
|
||||
# ADR-0026: DPPolicy = Intra-Device Only — remove sip/num_sips fields
|
||||
|
||||
## Status
|
||||
|
||||
@@ -6,16 +6,17 @@ Accepted (Revision 5 — Phase 2 landed 2026-04-14, 523 passed + 1 strict xfail)
|
||||
|
||||
## Context
|
||||
|
||||
### 목표
|
||||
### Goal
|
||||
|
||||
`DPPolicy`를 **한 device(SIP) 내부의 cube × PE 분산**만 표현하는 순수한
|
||||
intra-device 추상화로 명확화한다. SIP 간 분산(TP)은 별도 레이어로 분리
|
||||
(ADR-0024의 `torch.ahbm.set_device(rank)` 또는 ADR-0027의 Megatron parallel
|
||||
layers가 담당).
|
||||
Clarify `DPPolicy` as a pure intra-device abstraction that only expresses
|
||||
**cube × PE distribution within a single device (SIP)**. Inter-SIP
|
||||
distribution (TP) is split into a separate layer (handled by ADR-0024's
|
||||
`torch.ahbm.set_device(rank)` or by ADR-0027's Megatron-style parallel
|
||||
layers).
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. `DPPolicy`에서 `sip` + `num_sips` 필드 제거
|
||||
### D1. Remove `sip` + `num_sips` fields from `DPPolicy`
|
||||
|
||||
```python
|
||||
@dataclass(frozen=True)
|
||||
@@ -32,15 +33,16 @@ class DPPolicy:
|
||||
num_cubes: int | None = None
|
||||
```
|
||||
|
||||
제거되는 필드: `sip`, `num_sips`.
|
||||
Removed fields: `sip`, `num_sips`.
|
||||
|
||||
### D2. `ShardSpec` — structural (sip, cube, pe) 좌표, `pe_index` 완전 제거
|
||||
### D2. `ShardSpec` — structural (sip, cube, pe) coordinates, `pe_index` fully removed
|
||||
|
||||
현재 `ShardSpec.pe_index`는 **global flat index** (`sip × cubes × pes + cube ×
|
||||
pes + pe`). 이는 ADR-0024 D4이 "abstraction leakage"로 지적한 형태.
|
||||
The current `ShardSpec.pe_index` is a **global flat index**
|
||||
(`sip × cubes × pes + cube × pes + pe`). This is the form ADR-0024 D4
|
||||
flagged as "abstraction leakage".
|
||||
|
||||
본 ADR에서 ShardSpec을 **structural 좌표로 재정의**하고, `pe_index`는
|
||||
property로도 **남기지 않는다**:
|
||||
This ADR **redefines ShardSpec in structural coordinates** and **does
|
||||
not even leave `pe_index` as a property**:
|
||||
|
||||
```python
|
||||
# src/kernbench/policy/placement/dp.py (after)
|
||||
@@ -59,28 +61,32 @@ class ShardSpec:
|
||||
nbytes: int
|
||||
```
|
||||
|
||||
**핵심 원칙**:
|
||||
- ShardSpec의 정체성은 `(sip, cube, pe)` 3튜플.
|
||||
- **`pe_index` property도 없음** — silent semantics drift 차단.
|
||||
- Global flat을 기대한 기존 호출자는 `.pe_index` 접근 시 **즉시
|
||||
`AttributeError`** → 반드시 구조적 좌표로 migration.
|
||||
- Flat integer key가 필요한 국소 문맥 (예: 내부 dict lookup)은 호출자가
|
||||
명시적으로 `spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe`를 계산.
|
||||
**Core principle**:
|
||||
- The identity of ShardSpec is the `(sip, cube, pe)` 3-tuple.
|
||||
- **No `pe_index` property either** — blocks silent semantics drift.
|
||||
- Existing callers expecting global-flat get an **immediate
|
||||
`AttributeError`** on `.pe_index` access → forced migration to
|
||||
structural coordinates.
|
||||
- Local contexts that genuinely need a flat integer key (e.g. internal
|
||||
dict lookup) explicitly compute
|
||||
`spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe` at the call
|
||||
site.
|
||||
|
||||
**Property 제거 정당화**: KernBench는 사내 프로젝트로 call site가 한정되어
|
||||
있음. Silent drift 위험 (의미만 바뀌고 타입은 같은 int) 대비 explicit breakage
|
||||
(AttributeError)가 훨씬 안전.
|
||||
**Justification for removing the property**: KernBench is an internal
|
||||
project with a limited number of call sites. Explicit breakage
|
||||
(AttributeError) is much safer than the risk of silent drift (semantics
|
||||
change while the type stays int).
|
||||
|
||||
### D3. `resolve_dp_policy`가 `target_sip`을 받아 structural 좌표 생성
|
||||
### D3. `resolve_dp_policy` takes `target_sip` and produces structural coordinates
|
||||
|
||||
ADR-0024 D4의 계약 구현. Post-hoc shifting 없음.
|
||||
Implements the contract of ADR-0024 D4. No post-hoc shifting.
|
||||
|
||||
```python
|
||||
# src/kernbench/policy/placement/dp.py (after)
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class _LocalPeShard:
|
||||
"""Internal — PE resolver의 반환. Cube 내 local PE 식별자 + payload."""
|
||||
"""Internal — return value of the PE resolver. Cube-local PE id + payload."""
|
||||
local_pe: int # cube-local PE index (0..num_pe-1)
|
||||
offset_bytes: int
|
||||
nbytes: int
|
||||
@@ -93,7 +99,7 @@ def resolve_dp_policy(
|
||||
itemsize: int,
|
||||
num_pe: int,
|
||||
num_cubes: int = 1,
|
||||
target_sip: int, # NEW — 어느 SIP에 배치할지 명시
|
||||
target_sip: int, # NEW — explicitly state which SIP to place on
|
||||
) -> list[ShardSpec]:
|
||||
"""2-level resolution (cube × PE) on a specified SIP.
|
||||
|
||||
@@ -123,28 +129,30 @@ def resolve_dp_policy(
|
||||
return all_shards
|
||||
```
|
||||
|
||||
**내부 resolver** (`column_wise`, `row_wise`, `replicate`)는 `_LocalPeShard`
|
||||
리스트 반환 — `local_pe` 필드명으로 **"cube-local PE identifier"임이 명시적**.
|
||||
과거 `ShardSpec.pe_index`와 이름이 혼동되던 문제 해소.
|
||||
**Internal resolvers** (`column_wise`, `row_wise`, `replicate`) return a
|
||||
list of `_LocalPeShard` — the `local_pe` field name makes it **explicit
|
||||
that this is a "cube-local PE identifier"**. This resolves the previous
|
||||
confusion with the name `ShardSpec.pe_index`.
|
||||
|
||||
**이름 규약 정리** (전체 ADR):
|
||||
- `ShardSpec.pe`: 최종 외부 API — cube-local PE (structural coord)
|
||||
- `_LocalPeShard.local_pe`: 내부 resolver 단계의 동일 의미
|
||||
- `pe_index`: **제거**. 외부/내부 어디에도 남기지 않는다 (silent drift 차단의
|
||||
부가 효과: 이름 재등장 없음).
|
||||
**Naming convention summary** (whole ADR):
|
||||
- `ShardSpec.pe`: the final external API — cube-local PE (structural coord)
|
||||
- `_LocalPeShard.local_pe`: the same meaning at the internal resolver stage
|
||||
- `pe_index`: **removed**. Not retained anywhere, internal or external
|
||||
(additional benefit of preventing silent drift: the name does not
|
||||
reappear).
|
||||
|
||||
### D4. `_create_tensor` — 구조적 좌표로 직접 placement
|
||||
### D4. `_create_tensor` — placement directly in structural coordinates
|
||||
|
||||
ADR-0024 D4 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy`
|
||||
호출 시점에 직접 지정.
|
||||
Continuation of ADR-0024 D4. Post-hoc shifting removed; structural
|
||||
coordinates are specified directly at the `resolve_dp_policy` call site.
|
||||
|
||||
```python
|
||||
# context.py _create_tensor (after)
|
||||
current_sip = self.ahbm.current_device()
|
||||
if current_sip is None:
|
||||
# Single-driver fallback (ADR-0024 D2와 일관).
|
||||
# Launcher 기반 코드가 set_device()를 빼먹으면 조용히 SIP 0에 박히는
|
||||
# 문제가 있음 → debug mode에서 경고.
|
||||
# Single-driver fallback (consistent with ADR-0024 D2).
|
||||
# In launcher-based code, forgetting set_device() silently sticks the
|
||||
# tensor on SIP 0 — emit a warning in debug mode.
|
||||
if os.environ.get("KERNBENCH_DEBUG"):
|
||||
import warnings
|
||||
warnings.warn(
|
||||
@@ -161,38 +169,39 @@ placement = resolve_dp_policy(
|
||||
itemsize=itemsize,
|
||||
num_pe=eff_num_pe,
|
||||
num_cubes=eff_num_cubes,
|
||||
target_sip=current_sip, # ← 구조적 좌표 일차 지정
|
||||
target_sip=current_sip, # ← structural coord specified up front
|
||||
)
|
||||
|
||||
# placement의 각 ShardSpec은 이미 (sip=current_sip, cube=local, pe=local) 포함.
|
||||
# 과거의 post-hoc shifting 블록은 완전히 제거.
|
||||
# Each ShardSpec in placement already carries (sip=current_sip, cube=local, pe=local).
|
||||
# The old post-hoc shifting block is removed entirely.
|
||||
```
|
||||
|
||||
**모든** 텐서가 current device SIP에 배치됨. Multi-SIP 텐서를 만들고 싶으면
|
||||
ADR-0027의 TP primitive 사용.
|
||||
**Every** tensor is placed on the current device's SIP. If you need a
|
||||
multi-SIP tensor, use the TP primitive of ADR-0027.
|
||||
|
||||
**Single-driver fallback의 trade-off**: set_device 없는 호출에서 SIP 0으로
|
||||
default는 기존 single-driver 테스트 호환을 위해 유지. `KERNBENCH_DEBUG=1`
|
||||
환경에서는 launcher 컨텍스트의 실수로 set_device 누락 시 조용히 잘못된 SIP에
|
||||
배치되는 것을 감지할 수 있도록 warning.
|
||||
**Trade-off of the single-driver fallback**: When set_device is not
|
||||
called, defaulting to SIP 0 is kept for compatibility with existing
|
||||
single-driver tests. With `KERNBENCH_DEBUG=1`, a warning is emitted so
|
||||
that accidentally omitting set_device in a launcher context — which would
|
||||
silently place the tensor on the wrong SIP — can be detected.
|
||||
|
||||
### D5. Downstream — allocator lookup은 구조적 tuple key로
|
||||
### D5. Downstream — allocator lookup by structural tuple key
|
||||
|
||||
기존 `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`):
|
||||
Existing `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`):
|
||||
|
||||
```python
|
||||
for spec in placement:
|
||||
alloc = allocators[spec.pe_index] # ← AttributeError (property 제거됨)
|
||||
alloc = allocators[spec.pe_index] # ← AttributeError (property removed)
|
||||
```
|
||||
|
||||
`pe_index`가 없어졌으므로 구조적 좌표로 **강제** migration:
|
||||
With `pe_index` gone, migration to structural coordinates is **forced**:
|
||||
|
||||
```python
|
||||
for spec in placement:
|
||||
alloc = allocators[(spec.sip, spec.cube, spec.pe)]
|
||||
```
|
||||
|
||||
`_ensure_allocators`의 dict population도 tuple key로:
|
||||
The dict population in `_ensure_allocators` is also tuple-keyed:
|
||||
|
||||
```python
|
||||
# context.py _ensure_allocators (after)
|
||||
@@ -204,59 +213,71 @@ for sip_id in sip_range:
|
||||
)
|
||||
```
|
||||
|
||||
`_free_tensor`도 동일: 기존 `flat_idx = sip * ... + cube * ... + pe` 계산
|
||||
블록 제거, `(shard.sip, shard.cube, shard.pe)` 직접 사용.
|
||||
`_free_tensor` is the same: the old
|
||||
`flat_idx = sip * ... + cube * ... + pe` computation block is removed,
|
||||
and `(shard.sip, shard.cube, shard.pe)` is used directly.
|
||||
|
||||
**Tuple vs dataclass `PEIdentity`**: Tuple이 단순하고 hashable로 바로 써서
|
||||
권고. `PEIdentity` 값객체는 명시적 타입 장점은 있지만 boilerplate가 크고 현재
|
||||
allocator dict의 유일한 key라 오버엔지니어링. Tuple 유지.
|
||||
**Tuple vs dataclass `PEIdentity`**: Recommend the tuple — it is simple
|
||||
and hashable out of the box. A `PEIdentity` value object has the upside
|
||||
of an explicit type, but the boilerplate is large and it is currently
|
||||
the only key of the allocator dict, so it would be over-engineering.
|
||||
Keep the tuple.
|
||||
|
||||
### D7. 하위 호환 — 불가 (cleanup ADR)
|
||||
### D7. Backward compatibility — none (cleanup ADR)
|
||||
|
||||
이 ADR은 **breaking change**.
|
||||
This ADR is a **breaking change**.
|
||||
|
||||
1. `DPPolicy(sip=...)` 또는 `DPPolicy(num_sips=...)` 호출 → `TypeError`
|
||||
2. `ShardSpec.pe_index` 접근 → `AttributeError`
|
||||
1. `DPPolicy(sip=...)` or `DPPolicy(num_sips=...)` → `TypeError`
|
||||
2. `ShardSpec.pe_index` access → `AttributeError`
|
||||
|
||||
모두 **즉시 명시적 breakage**. Deprecation warning / fallback 경로 없음.
|
||||
KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에 migration.
|
||||
Both are **immediate, explicit breakage**. No deprecation warning /
|
||||
fallback path. KernBench is an internal project with a bounded set of
|
||||
call sites, so migration happens in one pass.
|
||||
|
||||
**Silent drift 차단**이 property 완전 제거의 주된 이점: global flat을 기대한
|
||||
코드가 SIP-local 결과를 받아 조용히 잘못된 인덱싱을 할 가능성 제거.
|
||||
**Blocking silent drift** is the main upside of fully removing the
|
||||
property: code that expected a global flat could otherwise silently
|
||||
receive a SIP-local result and index incorrectly — that possibility is
|
||||
eliminated.
|
||||
|
||||
## Dependencies
|
||||
|
||||
- **ADR-0024** (launcher): `set_device(rank)` 및 current-device scoping이
|
||||
SIP 배치 메커니즘 제공. 본 ADR은 그 위에 서서 DPPolicy를 순수 intra-device로
|
||||
좁힘.
|
||||
- **ADR-0027** (Megatron TP): 다중 SIP에 걸친 텐서가 필요한 경우의 대안 경로.
|
||||
이 ADR 적용 후 multi-SIP use case는 ADR-0027로 이관.
|
||||
- **ADR-0024** (launcher): `set_device(rank)` and current-device scoping
|
||||
provide the SIP placement mechanism. This ADR sits on top and narrows
|
||||
DPPolicy to pure intra-device.
|
||||
- **ADR-0027** (Megatron TP): the alternative path when a tensor spans
|
||||
multiple SIPs. After this ADR is applied, multi-SIP use cases move to
|
||||
ADR-0027.
|
||||
|
||||
---
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **`DPPolicy.cube` / `pe` 재설계**: 기존 replicate/column_wise/row_wise 의미
|
||||
유지.
|
||||
- **Tiling 정책 통합**: `tiled_column_major` / `tiled_row_major`는 그대로.
|
||||
- **Multi-device 텐서 추상화 신규**: DTensor-like는 ADR-0028.
|
||||
- **Redesign of `DPPolicy.cube` / `pe`**: existing
|
||||
replicate/column_wise/row_wise semantics are kept.
|
||||
- **Tiling policy consolidation**: `tiled_column_major` /
|
||||
`tiled_row_major` stay as they are.
|
||||
- **New multi-device tensor abstraction**: a DTensor-like is ADR-0028.
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
- **`_create_tensor`의 current_sip 기본값**: set_device 없는 호출에서 rank=0
|
||||
(SIP 0)로 fallback할지, 아니면 error 낼지. 권고는 fallback (기존 single-driver
|
||||
테스트와의 호환).
|
||||
- **`test_sip_parallel.py` 재작성 범위**: 기존 단위 테스트의 의도를 유지하며
|
||||
launcher 기반으로 옮기려면 추가 fixture 필요. 별도 작업으로 scope.
|
||||
- **`DPPolicy`의 `num_sips=None` 의미**: 필드가 없어지면 `num_sips` 개념 자체가
|
||||
사라짐. Multi-SIP을 표현하고 싶으면 ADR-0027의 TP primitive를 쓰라는 것이
|
||||
명시적 답.
|
||||
- **Default value of current_sip in `_create_tensor`**: for calls without
|
||||
set_device, whether to fall back to rank=0 (SIP 0) or to raise an
|
||||
error. The recommendation is fallback (compatibility with existing
|
||||
single-driver tests).
|
||||
- **Scope of `test_sip_parallel.py` rewrite**: porting the existing unit
|
||||
tests to the launcher base while preserving their intent requires
|
||||
additional fixtures. Scoped as separate work.
|
||||
- **Meaning of `num_sips=None` on `DPPolicy`**: once the field is gone,
|
||||
the concept of `num_sips` disappears entirely. The explicit answer for
|
||||
expressing multi-SIP is to use the TP primitive of ADR-0027.
|
||||
|
||||
**Resolved (이전 rev에서 open이었던 것들)**:
|
||||
- ~~`ShardSpec.pe_index` property 존치 여부~~ → **완전 제거** (D2)
|
||||
- ~~`_ensure_allocators` dict key 형식~~ → **tuple `(sip, cube, pe)`** (D5)
|
||||
**Resolved (items that were open in earlier revs)**:
|
||||
- ~~Whether to keep the `ShardSpec.pe_index` property~~ → **fully
|
||||
removed** (D2)
|
||||
- ~~Form of `_ensure_allocators` dict key~~ → **tuple `(sip, cube, pe)`**
|
||||
(D5)
|
||||
|
||||
---
|
||||
|
||||
@@ -264,25 +285,31 @@ KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에
|
||||
|
||||
### Positive
|
||||
|
||||
- **개념 분리 명확**: DPPolicy = intra-device, TP = inter-device.
|
||||
- **API 단순화**: DPPolicy 생성자 필드 ~33% 축소.
|
||||
- **Structural 좌표 일관성**: ShardSpec이 `(sip, cube, pe)` 튜플로 표현 →
|
||||
abstraction leakage 해소 (ADR-0024 D4 계약 충족).
|
||||
- **`pe_index` 의미 명확**: SIP-local이 단일 해석. Global flat이 필요하면 명시.
|
||||
- **Launcher 모델 일관성**: ADR-0024의 "1 worker per SIP" 모델이 유일한 SIP
|
||||
경계 제어 메커니즘.
|
||||
- **Clean conceptual separation**: DPPolicy = intra-device, TP =
|
||||
inter-device.
|
||||
- **API simplification**: about a 33% reduction in DPPolicy constructor
|
||||
fields.
|
||||
- **Structural-coordinate consistency**: ShardSpec is expressed as a
|
||||
`(sip, cube, pe)` tuple → abstraction leakage resolved (the ADR-0024
|
||||
D4 contract is satisfied).
|
||||
- **Clear meaning of `pe_index`**: the single interpretation is
|
||||
SIP-local. If global-flat is needed, it must be made explicit.
|
||||
- **Launcher-model consistency**: ADR-0024's "1 worker per SIP" model is
|
||||
the sole SIP-boundary control mechanism.
|
||||
|
||||
### Negative
|
||||
|
||||
- **Breaking change (explicit)**: `DPPolicy(sip=...)` → `TypeError`,
|
||||
`spec.pe_index` → `AttributeError`. 모든 호출자 한 번에 수정 필요.
|
||||
- **ShardSpec schema 변경**: `pe_index` 단일 필드 → `sip`/`cube`/`pe` 세 필드.
|
||||
Downstream (`deploy_tensor`, `_free_tensor`, `_ensure_allocators`,
|
||||
`allocators` dict key 등) 연쇄 수정.
|
||||
- **Silent drift 없음**: property 완전 제거로 runtime에서 즉시 실패 →
|
||||
migration leakage 원천 차단. (Negative가 아니라 explicit tradeoff)
|
||||
- `test_sip_parallel.py` 재작성 비용.
|
||||
`spec.pe_index` → `AttributeError`. All callers need to be fixed at
|
||||
once.
|
||||
- **ShardSpec schema change**: a single `pe_index` field becomes three
|
||||
fields `sip`/`cube`/`pe`. Cascading edits downstream (`deploy_tensor`,
|
||||
`_free_tensor`, `_ensure_allocators`, `allocators` dict key, etc.).
|
||||
- **No silent drift**: with the property fully removed, runtime failure
|
||||
is immediate → migration leakage is blocked at the source. (Not a
|
||||
negative but an explicit tradeoff.)
|
||||
- The cost of rewriting `test_sip_parallel.py`.
|
||||
|
||||
### Neutral
|
||||
|
||||
- 기존 `cube` / `pe` 필드 의미 불변.
|
||||
- The meaning of the existing `cube` / `pe` fields is unchanged.
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user