687c98086d
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
(dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
retroactive docs pending verification.
Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
deleted; ADR-0019/0021 moved to adr-history with one-line stub status
Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
selection, flit-aware per-flit commit, async finalize, command-only
fallback path)
Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
"Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
(now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)
Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py
Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.
Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
(ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
520 lines
21 KiB
Markdown
520 lines
21 KiB
Markdown
# ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation)
|
|
|
|
## Status
|
|
|
|
Accepted
|
|
|
|
## Context
|
|
|
|
The current simulation models **timing only**.
|
|
`tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies,
|
|
but do not actually read tensor data or perform computations.
|
|
|
|
### Required Capabilities
|
|
|
|
1. Must be able to store and read actual data in HBM/TCM/SRAM
|
|
2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
|
|
3. Must minimize simulation performance degradation
|
|
|
|
### Constraints
|
|
|
|
- SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
|
|
- Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations
|
|
- Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused
|
|
- Kernel functions must remain plain Python functions (no generator/async transformation)
|
|
|
|
### Design Exploration Results
|
|
|
|
| Option | Approach | Verdict |
|
|
|--------|----------|---------|
|
|
| Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block |
|
|
| SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests |
|
|
| Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads |
|
|
| **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact |
|
|
|
|
---
|
|
|
|
## Decision
|
|
|
|
### D1. 2-Pass Execution Model — Phase 0 Elimination
|
|
|
|
The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**.
|
|
|
|
Before:
|
|
```
|
|
Phase 0: Kernel → PeCommand list (no data, no branching)
|
|
Phase 1: Replay PeCommand list via SimPy (timing only)
|
|
```
|
|
|
|
After:
|
|
```
|
|
Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based
|
|
- Memory read/write: SimPy timing + MemoryStore actual data
|
|
- Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2)
|
|
- Dynamic control flow possible (tl.load returns actual data)
|
|
|
|
Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable
|
|
```
|
|
|
|
This ADR **extends Phase 1 to be data-aware for memory operations only**.
|
|
Phase 1 handles latency/BW bottleneck analysis + memory data tracking,
|
|
Phase 2 handles GEMM/Math computation correctness verification.
|
|
Phase 2 is optional — if only timing is needed, run Phase 1 alone.
|
|
|
|
### D2. Op Log Recording — ComponentBase Hook
|
|
|
|
Op log recording is performed as a **hook in the component base class**.
|
|
Individual component implementations are not modified.
|
|
|
|
```python
|
|
class ComponentBase:
|
|
def _on_process_start(self, env, msg):
|
|
if self._op_logger and getattr(msg, 'data_op', False):
|
|
self._op_logger.record_start(env.now, self.node.id, msg)
|
|
|
|
def _on_process_end(self, env, msg):
|
|
if self._op_logger and getattr(msg, 'data_op', False):
|
|
self._op_logger.record_end(env.now, self.node.id, msg)
|
|
```
|
|
|
|
Hooks are called before and after `run()` within `_forward_txn()`.
|
|
`_op_logger` is optional — zero overhead when absent.
|
|
|
|
**Hook timing definitions**:
|
|
|
|
| Timing | Meaning |
|
|
|--------|---------|
|
|
| `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) |
|
|
| `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) |
|
|
|
|
Link traversal latency is not included in t_start/t_end.
|
|
Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start.
|
|
|
|
### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination
|
|
|
|
The existing Phase 0 (kernel → PeCommand list) is eliminated,
|
|
and **greenlet** is used to cooperatively interleave kernel and SimPy execution.
|
|
|
|
#### Operating Principle
|
|
|
|
greenlet is a C extension that provides cooperative context switching.
|
|
When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet)
|
|
to perform timing simulation, and after completion, returns to the kernel with actual data.
|
|
|
|
```
|
|
SimPy loop (parent greenlet) Kernel (child greenlet)
|
|
───────────────────────── ──────────────────────
|
|
g.switch() ─────────────────────────→ Kernel starts
|
|
a = tl.load(ptr, ...)
|
|
internal: parent.switch(DmaReadCmd)
|
|
cmd = DmaReadCmd ←────────────────── (kernel paused)
|
|
yield DmaReadMsg(...)
|
|
yield env.timeout(dma_latency)
|
|
data = memory_store.read(...)
|
|
g.switch(data) ─────────────────────→ (kernel resumed)
|
|
a = data ← actual numpy array
|
|
if a[0][0] > 0.5: ← branching possible
|
|
...
|
|
```
|
|
|
|
The kernel is maintained as a **plain Python function**.
|
|
greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc.
|
|
|
|
#### KernelRunner — Framework Layer
|
|
|
|
The greenlet loop resides not in the PE_CPU component but in the framework layer,
|
|
**KernelRunner**.
|
|
|
|
```python
|
|
# KernelRunner (framework — greenlet ↔ SimPy bridge)
|
|
class KernelRunner:
|
|
def run(self, env, kernel_fn, args, store):
|
|
g = greenlet(self._run_kernel)
|
|
cmd = g.switch(kernel_fn, args)
|
|
|
|
while cmd is not None:
|
|
if isinstance(cmd, DmaReadCmd):
|
|
yield from self._dispatch_dma(env, cmd)
|
|
data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
|
|
cmd = g.switch(data) # resume with actual data
|
|
elif isinstance(cmd, GemmCmd):
|
|
yield from self._dispatch_gemm(env, cmd)
|
|
cmd = g.switch() # resume (no data)
|
|
elif isinstance(cmd, DmaWriteCmd):
|
|
store.write(cmd.dst_addr, cmd.data) # visibility = issue time
|
|
yield from self._dispatch_dma(env, cmd) # timing only
|
|
cmd = g.switch()
|
|
|
|
# PE_CPU (component — kept simple, unaware of greenlet)
|
|
def _execute_kernel(self, env):
|
|
runner = KernelRunner(self.ctx)
|
|
yield from runner.run(env, kernel_fn, args, store)
|
|
```
|
|
|
|
**Op logging single source of truth**: KernelRunner does not record directly to op_log.
|
|
All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**.
|
|
When KernelRunner delivers messages to components via `_dispatch_gemm()` etc.,
|
|
the component base class hooks automatically record them.
|
|
|
|
**Layer separation**:
|
|
- **Kernel code**: plain function, unaware of greenlet
|
|
- **TLContext**: calls `parent.switch(cmd)` inside `tl.load()`
|
|
- **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**.
|
|
- **ComponentBase hook**: the sole path for op_log recording
|
|
- **PE_CPU**: only calls KernelRunner, replaceable as a component
|
|
|
|
#### Handling Differences Between Memory Read/Write and Compute
|
|
|
|
| Operation | In Phase 1 | In Phase 2 |
|
|
|-----------|-----------|-----------|
|
|
| `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — |
|
|
| `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — |
|
|
| `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation |
|
|
| `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation |
|
|
|
|
Memory read/write is processed immediately in Phase 1 (numpy slice, fast).
|
|
GEMM/Math operations are batch-executed in Phase 2 (performance separation).
|
|
|
|
#### Store Visibility Rule
|
|
|
|
`tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue).
|
|
SimPy DMA timing is simulated separately afterward.
|
|
|
|
This is an intentional separation of timing and visibility:
|
|
- **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called
|
|
- **timing**: the point at which DMA latency completes in SimPy
|
|
|
|
This separation allows a load immediately after a store to see the latest data in dynamic control flow.
|
|
|
|
#### Result Handle Semantics
|
|
|
|
`tl.composite()` (sync/async) returns a **handle** referencing the result tensor.
|
|
|
|
The key contract in Phase 1:
|
|
|
|
1. **All compute handles are always considered pending in Phase 1.**
|
|
2. `tl.wait(handle)` **expresses timing synchronization only**
|
|
and does not make the handle ready.
|
|
3. Accessing the handle's actual result data (`handle.data`, element access,
|
|
numpy conversion, etc.) is **only possible in Phase 2**.
|
|
4. Therefore, **compute-result-based control flow is not supported in Phase 1.**
|
|
5. In contrast, `tl.load()` returns actual data in Phase 1, so
|
|
**memory-read-based control flow is supported**.
|
|
|
|
| Handle state | Phase | Allowed operations |
|
|
|------------|-------|----------|
|
|
| pending | Phase 1 | `tl.wait(handle)` — timing synchronization only |
|
|
| pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) |
|
|
| pending | Phase 1 | **Data access not allowed** — value-based branching not possible |
|
|
| ready | Phase 2 | Actual numpy data access, verification |
|
|
|
|
This restriction is intentional. If computations were executed in Phase 1,
|
|
the SimPy single-thread would block, defeating the purpose of 2-pass separation.
|
|
|
|
#### Phase 1 Materialization — Future Extension
|
|
|
|
If Phase 1 eager execution becomes necessary for small operations
|
|
(scalar, small reduction) in the future, selective materialization can be supported
|
|
by adding a `materialized_in_phase1: bool` flag to the op record.
|
|
This is not implemented in the current scope.
|
|
|
|
### D4. data_op Flag — Message Self-Declaration
|
|
|
|
The logging target is determined by the `data_op` attribute on the message instance,
|
|
not by message type. The framework does not hardcode message types.
|
|
|
|
```python
|
|
class MsgBase:
|
|
data_op: bool = False # default: no logging
|
|
|
|
class DmaReadCmd(MsgBase):
|
|
data_op = True # memory transfer → logging
|
|
|
|
class GemmCmd(MsgBase):
|
|
data_op = True # compute → logging
|
|
|
|
class MathCmd(MsgBase):
|
|
data_op = True # compute → logging
|
|
```
|
|
|
|
When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True`
|
|
enables automatic logging without modifying framework code.
|
|
|
|
### D5. Op Log Structure
|
|
|
|
#### Op Classification Scheme
|
|
|
|
A two-level classification is used:
|
|
|
|
| Level | Field | Role |
|
|
|-------|-------|------|
|
|
| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion |
|
|
| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification |
|
|
|
|
#### OpRecord Definition
|
|
|
|
```python
|
|
@dataclass
|
|
class OpRecord:
|
|
t_start: float # SimPy time (ns) — service start
|
|
t_end: float # SimPy time (ns) — service completion
|
|
component_id: str # e.g. "sip0.cube0.pe0.pe_gemm"
|
|
op_kind: str # "memory" | "gemm" | "math"
|
|
op_name: str # specific operation name
|
|
params: dict # per-operation parameters (see below)
|
|
dependency_ids: list[int] # currently based on in-memory record index, may be replaced with stable op_id in the future
|
|
```
|
|
|
|
#### dependency_ids Generation Rules
|
|
|
|
`dependency_ids` is **optional**, and by default the executor performs
|
|
address-based dependency inference (see D6).
|
|
|
|
Explicit setting is only needed when precise execution ordering is required:
|
|
- **Default (address-based inference)**: the executor analyzes read/write sets to
|
|
automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases.
|
|
- **Explicit setting**: set when logical dependencies cannot be expressed via addresses
|
|
at the TLContext or command generation stage.
|
|
Example: completion handle-based synchronization — handle dependencies depend on
|
|
logical completion order rather than memory addresses, so they cannot be captured
|
|
by address inference.
|
|
|
|
#### op_log Ordering
|
|
|
|
The op_log maintains **stable ordering** based on `t_start`.
|
|
Records with the same `t_start` preserve insertion order.
|
|
|
|
#### params Details
|
|
|
|
**memory (dma_read / dma_write)**:
|
|
```python
|
|
{
|
|
"src_addr": int, # source address (byte)
|
|
"dst_addr": int, # destination address (byte)
|
|
"nbytes": int, # transfer size
|
|
"src_space": str, # "hbm" | "tcm" | "sram"
|
|
"dst_space": str, # "hbm" | "tcm" | "sram"
|
|
}
|
|
```
|
|
|
|
**gemm**:
|
|
```python
|
|
{
|
|
"src_a_addr": int, # operand A address
|
|
"src_b_addr": int, # operand B address
|
|
"dst_addr": int, # output address
|
|
"shape_a": tuple, # e.g. (128, 256)
|
|
"shape_b": tuple, # e.g. (256, 128)
|
|
"shape_out": tuple, # e.g. (128, 128)
|
|
"dtype_in": str, # e.g. "f16"
|
|
"dtype_acc": str, # accumulation dtype, e.g. "f32"
|
|
"dtype_out": str, # output dtype, e.g. "f16"
|
|
"transpose_a": bool,
|
|
"transpose_b": bool,
|
|
"layout_a": str, # "row_major" | "col_major"
|
|
"layout_b": str,
|
|
"layout_out": str,
|
|
"addr_space": str, # "tcm" (GEMM operands are always in TCM)
|
|
}
|
|
```
|
|
|
|
**math**:
|
|
```python
|
|
{
|
|
"op": str, # "exp" | "add" | "sum" | "where" | ...
|
|
"input_addrs": list[int], # list of operand addresses
|
|
"input_shapes": list[tuple],
|
|
"dst_addr": int,
|
|
"shape_out": tuple,
|
|
"dtype": str,
|
|
"axis": int | None, # reduction axis
|
|
"addr_space": str, # "tcm"
|
|
}
|
|
```
|
|
|
|
### D6. Phase 2 Executor
|
|
|
|
Phase 2 executes the op_log outside of SimPy.
|
|
|
|
```python
|
|
class DataExecutor:
|
|
def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
|
|
self.store = initial_store # Takes the Phase 1 MemoryStore snapshot as input
|
|
|
|
def run(self):
|
|
for t, ops in groupby(op_log, key=lambda o: o.t_start):
|
|
batch = list(ops)
|
|
independent, sequential = self._classify(batch)
|
|
self._execute_parallel(independent)
|
|
self._execute_sequential(sequential)
|
|
```
|
|
|
|
**Parallel execution determination**:
|
|
|
|
Ops with the same `t_start` are considered **parallel candidates**.
|
|
The executor determines actual parallel execution based on the following criteria:
|
|
- Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks)
|
|
- Whether predecessor ops specified in `dependency_ids` have completed
|
|
|
|
Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel.
|
|
|
|
**Batch optimization**: Only independent ops with the same op_name **and identical
|
|
shape, dtype, layout, and transpose flags** are eligible for batching.
|
|
Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call.
|
|
Improves BLAS efficiency on CPU, reduces launch overhead on GPU.
|
|
|
|
**Phase 2 execution order guarantee**:
|
|
|
|
Phase 2 does not consider data arrival timing,
|
|
and guarantees execution order solely through
|
|
dependencies (address-based inference + explicit dependency_ids).
|
|
|
|
### D7. Memory Store
|
|
|
|
`MemoryStore` logically follows byte-addressable semantics,
|
|
and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping).
|
|
|
|
```python
|
|
class MemoryStore:
|
|
def write(self, space: str, addr: int, data: np.ndarray) -> None: ...
|
|
def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
|
|
```
|
|
|
|
**Internal storage format: numpy ndarray**
|
|
|
|
MemoryStore stores tensors as **numpy ndarrays**.
|
|
|
|
| Candidate | store/load speed | Phase 2 compute | Verdict |
|
|
|-----------|-----------------|-----------------|---------|
|
|
| **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** |
|
|
| bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected |
|
|
| torch tensor | Immediate | torch operations available | Use only for GPU optimization |
|
|
|
|
- write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup
|
|
- read: **returns numpy array by reference** (no copy)
|
|
- Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported)
|
|
- dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.)
|
|
- For byte-level access, convert via `.view(np.uint8)`
|
|
- For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility
|
|
|
|
**read/write contract**:
|
|
|
|
- read/write operates on a **contiguous tensor** basis.
|
|
If non-contiguous stride views are needed, express them as separate copy ops.
|
|
- In the normal benchmark path, producer/consumer dtype match is expected.
|
|
Reinterpret cast is a permissive behavior for low-level memory validation
|
|
or special test cases.
|
|
- addr is byte-aligned, with minimum alignment = dtype size.
|
|
- dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast.
|
|
Shape mismatch is verified based on nbytes, and raises an error on mismatch.
|
|
- Correctness criteria follow address-range-based read/write semantics.
|
|
- A tensor object cache may be used as an implementation optimization,
|
|
but the canonical state is byte-addressable storage.
|
|
- At deploy time, the host injects initial tensor data.
|
|
|
|
### D8. Benchmark Kernel Code
|
|
|
|
The benchmark's **user code API is not changed**.
|
|
The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained.
|
|
|
|
However, internal command/message schemas may be extended to include metadata
|
|
required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose).
|
|
|
|
### D9. No Component Changes
|
|
|
|
Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified.
|
|
Op log recording is the responsibility of the ComponentBase hook.
|
|
When custom components are replaced, only the timing model changes,
|
|
and Phase 2 data execution is unaffected.
|
|
|
|
### D10. Phase 2 is Optional
|
|
|
|
```python
|
|
engine = GraphEngine(graph)
|
|
engine.run(benchmark) # Phase 1: timing only
|
|
result = engine.get_timing_result()
|
|
|
|
if verify_data:
|
|
executor = DataExecutor(engine.op_log) # Phase 2: data
|
|
executor.run()
|
|
executor.verify(expected_output)
|
|
```
|
|
|
|
If only timing analysis is needed, Phase 2 is skipped.
|
|
If the op_logger is deactivated, Phase 1 performance is identical to the original.
|
|
|
|
### D11. Verification Contract
|
|
|
|
Basic verification **compares the final output tensor** against a reference backend (numpy).
|
|
|
|
Per-dtype tolerance policy:
|
|
|
|
| dtype | Comparison method | Tolerance |
|
|
|-------|----------|-----------|
|
|
| f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
|
|
| f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
|
|
| bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
|
|
| int types | `np.array_equal` | exact |
|
|
|
|
- Default mode: compare final output only (end-to-end correctness)
|
|
- Debug mode: can compare intermediate tensors on a per-op basis
|
|
(MemoryStore snapshot at each op boundary)
|
|
|
|
---
|
|
|
|
## Non-goals
|
|
|
|
- **Compute-result-based control flow**: not supported.
|
|
All compute handles are in pending state during Phase 1,
|
|
`wait()` expresses timing synchronization only and does not imply data readiness.
|
|
Accessing `handle.data`, element access, or truth-value evaluation in Phase 1
|
|
is **treated as an error**.
|
|
Memory-data-based branching (results of `tl.load()`) is supported via greenlet.
|
|
Phase 1 materialization is a future extension (see D3).
|
|
- **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce
|
|
the execution time overlap from Phase 1. Phase 2 only verifies data correctness.
|
|
- **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls
|
|
and do not reproduce the actual hardware PE microarchitecture.
|
|
|
|
## Open Questions
|
|
|
|
- **Aliasing / slice view**: How to represent slice/views referencing the same
|
|
backing storage in MemoryStore (stride-based view vs copy semantics)
|
|
- **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE
|
|
communication as memory ops or introduce a separate op_kind
|
|
- **Op log streaming**: Managing op_log memory usage in large-scale simulations
|
|
(in-memory list vs disk-backed streaming)
|
|
- **Fused operation**: Whether to record tl.composite's tiled pipeline
|
|
(READ→COMPUTE→WRITE) as a single fused op record or separate individual ops
|
|
- **Math op schema generalization**: The current math params have a simple structure,
|
|
but generalization may be needed for broadcasting rules, per-input dtype, keepdims,
|
|
scalar/immediate operands, where/mask expressions, etc.
|
|
- **Op record identifier**: Currently dependency_ids are based on in-memory list indices;
|
|
replacement with stable op_id is needed when introducing streaming/disk-backed mode
|
|
- **Phase 1 materialization policy**: See Future Extension in D3.
|
|
If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops
|
|
needs to be defined
|
|
|
|
---
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
|
|
- Minimal impact on SimPy simulation performance (only op_log append added)
|
|
- Free to use multi-threading/GPU in Phase 2
|
|
- Component replaceability preserved (ADR-0015 design philosophy maintained)
|
|
- No changes needed to benchmark user code API
|
|
- When adding new message types, only set the data_op flag
|
|
- Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported
|
|
- `tl.load()` returns actual data, making kernel debugging easier
|
|
|
|
### Negative
|
|
|
|
- op_log memory usage (for large-scale simulations)
|
|
- Phase 2 execution time is proportional to tensor size (large GEMM)
|
|
- Dynamic branching based on pending handles (incomplete computations) not possible
|
|
(computations execute in Phase 2, result values are undetermined in Phase 1).
|
|
Memory-data-based branching is supported via greenlet.
|
|
- greenlet C extension dependency added (pip install greenlet)
|