Kernel-launch sync (ADR-0009 D5) and IPCQ drain at inbound (ADR-0023)
- KernelLaunchMsg gains target_start_ns: IO_CPU stamps a global barrier (max path latency across every target PE), M_CPU passes it through, PE_CPU yields until it before recording pe_exec_start. Every PE in a launch begins kernel execution at the same env.now regardless of its dispatch path length — eliminates per-PE dispatch-offset artifact in cross-PE and cross-cube latency measurements. - PE_DMA._handle_ipcq_inbound now pays Transaction.drain_ns at the top, matching the terminal-drain behavior of ComponentBase._forward_txn for every non-IPCQ Transaction. SRC-side tl.send stays fire-and-forget (sender doesn't yield on sub_done); tl.recv now blocks until bytes have actually drained into its inbox. - ComponentContext: new compute_path_latency_ns helper + node_overhead_ns field populated by GraphEngine. - tests/test_kernel_launch_sync.py: asserts all PEs in one launch produce identical pe_exec_ns for a no-op kernel (zero spread). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -67,6 +67,51 @@ Completion semantics:
|
||||
|
||||
---
|
||||
|
||||
### D5. Launch timing is endpoint-synchronized
|
||||
|
||||
All PEs targeted by a single kernel launch MUST begin executing the kernel
|
||||
body at the same simulated time, regardless of their dispatch path length
|
||||
from the launch entry point.
|
||||
|
||||
Rationale. The dispatch tree Host → IO_CPU → M_CPU → PE_CPU has variable
|
||||
latency at every level. PEs near their M_CPU receive the launch earlier
|
||||
than PEs farther away; cubes near an IO_CPU receive it earlier than cubes
|
||||
farther away. Without synchronization, each PE's kernel begins at a
|
||||
different `env.now`, making per-PE metrics such as `pe_exec_ns` a function
|
||||
of dispatch-path geometry rather than of the kernel's behavior —
|
||||
producing measurement artifacts in benchmarks that time kernel-internal
|
||||
waits (for example `tl.recv` on cross-cube or cross-SIP hops).
|
||||
|
||||
Mechanism.
|
||||
|
||||
- `KernelLaunchMsg` carries an optional `target_start_ns: float | None`.
|
||||
- **IO_CPU** is the canonical stamper. On fan-out to M_CPUs, it
|
||||
computes `target_start_ns = env.now + max_latency` where `max_latency`
|
||||
is the maximum `ComponentContext.compute_path_latency_ns(path)` across
|
||||
every target (sip, cube, pe) tuple — `path = find_node_path(io_cpu,
|
||||
pe_cpu_id)`. The stamped value is placed on the request carried by
|
||||
every fanned-out sub-Transaction.
|
||||
- **M_CPU** passes an already-stamped `target_start_ns` through
|
||||
unchanged. Only when the value is absent (e.g. a direct
|
||||
launch-to-M_CPU unit test) does M_CPU compute a per-cube barrier
|
||||
`env.now + max(local command-path latency)`.
|
||||
- **PE_CPU** yields `env.timeout(target_start_ns - env.now)` at the top
|
||||
of `_execute_kernel`, before recording `pe_exec_start` and invoking
|
||||
the kernel body.
|
||||
- When `target_start_ns is None`, PE_CPU falls through to the legacy
|
||||
unsynchronized behavior — preserving backward compatibility.
|
||||
|
||||
IO_CPU-level stamping guarantees every PE across every targeted cube
|
||||
uses the same barrier sim-time, eliminating both the within-cube
|
||||
dispatch-offset artifact *and* the cross-cube offset artifact in
|
||||
multi-cube launches. Models a real-hardware timed-broadcast launch
|
||||
(latency-equalized dispatch tree).
|
||||
|
||||
The synchronization is internal to the engine / IO_CPU / M_CPU / PE_CPU
|
||||
control plane — runtime API and application kernels are unchanged.
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- SPEC R1, R2, R7, R8
|
||||
|
||||
@@ -420,11 +420,21 @@ fan-out (see `IpcqInitMsg` in D12).
|
||||
#### PE_DMA's added responsibility
|
||||
|
||||
When `vc_comm` receives a token, PE_DMA processes it as the following
|
||||
**atomic** sequence. **No SimPy yield is allowed between the two steps**
|
||||
(invariant I6):
|
||||
sequence: pay the Transaction's terminal BW drain, then atomically
|
||||
write data and forward metadata. **No SimPy yield is allowed between
|
||||
the data write and the metadata forward** (invariant I6). The drain
|
||||
yield must sit before the atomic block, not inside it:
|
||||
|
||||
```python
|
||||
def _on_vc_comm_recv(self, env, token):
|
||||
def _on_vc_comm_recv(self, env, txn):
|
||||
# Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the
|
||||
# sender PE_DMA). MUST happen before the atomic block so recv only
|
||||
# wakes after the bytes have "landed".
|
||||
drain = getattr(txn, "drain_ns", 0.0)
|
||||
if drain > 0:
|
||||
yield env.timeout(drain)
|
||||
|
||||
token = txn.request
|
||||
# ── ATOMIC: no yield between these two operations ──
|
||||
data = self._memory_store.read(token.src_space, token.src_addr,
|
||||
shape=..., dtype=...)
|
||||
@@ -439,6 +449,33 @@ The final `put` is yieldable but uses an unbounded internal store, so
|
||||
it completes in a single step. That `put` is the closing call of the
|
||||
atomic block; nothing may be inserted before it.
|
||||
|
||||
#### Drain-at-inbound semantics (D9 timing model)
|
||||
|
||||
The Transaction carries `drain_ns = nbytes / bottleneck_bw_on_path`
|
||||
stamped at send-side PE_DMA. In this simulator per-hop `overhead_ns`
|
||||
is paid at each forwarding component via `run()`, and the remaining
|
||||
BW drain is paid once at the Transaction's terminal. Every non-IPCQ
|
||||
Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via
|
||||
`ComponentBase._forward_txn` at the terminal node. For IPCQ the
|
||||
destination PE_DMA intercepts the Transaction with `_handle_ipcq_inbound`
|
||||
(so IPCQ-specific data write + metadata forward can happen), so **the
|
||||
drain MUST be paid explicitly at the top of that handler** to keep
|
||||
IPCQ's timing model on par with every other fabric Transaction.
|
||||
|
||||
Side-effects of paying drain here:
|
||||
|
||||
- **SRC `tl.send`** is unchanged — fire-and-forget semantics are
|
||||
preserved because the sender PE_DMA does not `yield sub_done`. The
|
||||
`sub_done.succeed()` call (made after metadata forward below) is an
|
||||
event with no listener on the sender side.
|
||||
- **DST `tl.recv`** unblocks `drain_ns` later. Since recv wakes only
|
||||
when `IpcqMetaArrival` reaches its local PE_IPCQ, and the metadata
|
||||
forward now happens after the drain, recv observes the full fabric
|
||||
transfer time including bandwidth cost.
|
||||
|
||||
Matches the physical picture: send dispatches and leaves; recv waits
|
||||
until the bytes have actually been drained into its inbox.
|
||||
|
||||
### D9.5. ADR-0020 (2-pass) integration
|
||||
|
||||
`tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase
|
||||
|
||||
@@ -426,11 +426,22 @@ backend init에서 IpcqInitMsg fan-out 시 양방향 fast path channel을 함께
|
||||
|
||||
#### PE_DMA의 책임 추가
|
||||
|
||||
PE_DMA(vc_comm)는 token 수신 시 다음 atomic 시퀀스로 처리한다.
|
||||
**두 동작 사이에 SimPy yield를 두어서는 안 된다** (I6 MUST 규칙 참조):
|
||||
PE_DMA(vc_comm)는 token 수신 시 다음 시퀀스로 처리한다: Transaction
|
||||
terminal의 BW drain을 먼저 지불하고, 이어서 atomic하게 data write +
|
||||
metadata forward 수행. **data write와 metadata forward 사이에는 SimPy
|
||||
yield를 두어서는 안 된다** (I6 MUST 규칙 참조). drain yield는 atomic
|
||||
구간 안이 아니라 그 앞에 위치해야 한다:
|
||||
|
||||
```python
|
||||
def _on_vc_comm_recv(self, env, token):
|
||||
def _on_vc_comm_recv(self, env, txn):
|
||||
# Sender PE_DMA가 찍어 둔 drain_ns (= nbytes / bottleneck_bw) 를
|
||||
# 여기서 지불. atomic 구간보다 앞이어야 한다 — recv는 bytes가
|
||||
# "도착"한 이후에만 깨어나야 하므로.
|
||||
drain = getattr(txn, "drain_ns", 0.0)
|
||||
if drain > 0:
|
||||
yield env.timeout(drain)
|
||||
|
||||
token = txn.request
|
||||
# ── ATOMIC: 두 동작 사이에 yield 금지 ──
|
||||
# 1. data를 dst_addr에 write (dst의 메모리 공간은 token.dst_endpoint.buffer_kind)
|
||||
data = self._memory_store.read(token.src_space, token.src_addr,
|
||||
@@ -446,6 +457,32 @@ wire로 capacity가 unbounded인 store를 사용하므로 즉시 완료된다 (
|
||||
single-step). 이 최종 put이 atomic 구간의 끝이며, 그 이전에 다른 yield가
|
||||
삽입되면 안 된다.
|
||||
|
||||
#### Drain-at-inbound semantics (D9 timing model)
|
||||
|
||||
Transaction은 sender PE_DMA가 `drain_ns = nbytes / bottleneck_bw_on_path`
|
||||
를 찍어 둔 상태로 fabric에 들어간다. 이 simulator에서 per-hop `overhead_ns`
|
||||
는 각 forwarding component의 `run()` 에서 지불되고, 남은 BW drain은
|
||||
Transaction의 terminal node에서 한 번 지불된다. IPCQ가 아닌 모든
|
||||
Transaction (raw DMA, kernel-launch fanout 등) 은
|
||||
`ComponentBase._forward_txn` 이 terminal에서 이 drain을 지불한다. IPCQ의
|
||||
경우 목적지 PE_DMA가 `_handle_ipcq_inbound` 핸들러로 Transaction을
|
||||
가로채서 (IPCQ 전용 data write + metadata forward를 해야 하므로)
|
||||
**이 핸들러 최상단에서 drain을 명시적으로 지불해야 한다** — 그래야 IPCQ의
|
||||
timing model이 다른 모든 fabric Transaction과 동일선상에 놓인다.
|
||||
|
||||
여기서 drain을 지불할 때의 side-effect:
|
||||
|
||||
- **SRC `tl.send`**: 동작 불변. sender PE_DMA가 `sub_done` 을 `yield`
|
||||
하지 않으므로 fire-and-forget 의미가 보존된다. metadata forward 이후
|
||||
호출되는 `sub_done.succeed()` 는 sender 입장에서 listener가 없는 이벤트.
|
||||
- **DST `tl.recv`**: `drain_ns` 만큼 늦게 깨어난다. recv는 local PE_IPCQ
|
||||
의 `IpcqMetaArrival` 수신 시에만 wake되며, metadata forward가 drain
|
||||
이후로 이동했으므로 recv는 bandwidth까지 포함한 전체 fabric transfer
|
||||
시간을 관측하게 된다.
|
||||
|
||||
물리적 그림과 일치: send는 dispatch하고 바로 반환; recv는 bytes가 실제로
|
||||
자신의 inbox로 drain될 때까지 대기.
|
||||
|
||||
#### Backpressure latency 정확도
|
||||
|
||||
backpressure 해제까지 걸리는 시간:
|
||||
|
||||
Reference in New Issue
Block a user