Kernel-launch sync (ADR-0009 D5) and IPCQ drain at inbound (ADR-0023)

- KernelLaunchMsg gains target_start_ns: IO_CPU stamps a global barrier (max path latency across every target PE), M_CPU passes it through, PE_CPU yields until it before recording pe_exec_start. Every PE in a launch begins kernel execution at the same env.now regardless of its dispatch path length — eliminates per-PE dispatch-offset artifact in cross-PE and cross-cube latency measurements. - PE_DMA._handle_ipcq_inbound now pays Transaction.drain_ns at the top, matching the terminal-drain behavior of ComponentBase._forward_txn for every non-IPCQ Transaction. SRC-side tl.send stays fire-and-forget (sender doesn't yield on sub_done); tl.recv now blocks until bytes have actually drained into its inbox. - ComponentContext: new compute_path_latency_ns helper + node_overhead_ns field populated by GraphEngine. - tests/test_kernel_launch_sync.py: asserts all PEs in one launch produce identical pe_exec_ns for a no-op kernel (zero spread). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 15:30:29 -07:00
parent 6918e6e906
commit 14d800b0ae
14 changed files with 409 additions and 17 deletions
@@ -420,11 +420,21 @@ fan-out (see `IpcqInitMsg` in D12).
 #### PE_DMA's added responsibility

 When `vc_comm` receives a token, PE_DMA processes it as the following
-**atomic** sequence. **No SimPy yield is allowed between the two steps**
-(invariant I6):
+sequence: pay the Transaction's terminal BW drain, then atomically
+write data and forward metadata. **No SimPy yield is allowed between
+the data write and the metadata forward** (invariant I6). The drain
+yield must sit before the atomic block, not inside it:

 ```python
-def _on_vc_comm_recv(self, env, token):
+def _on_vc_comm_recv(self, env, txn):
+    # Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the
+    # sender PE_DMA). MUST happen before the atomic block so recv only
+    # wakes after the bytes have "landed".
+    drain = getattr(txn, "drain_ns", 0.0)
+    if drain > 0:
+        yield env.timeout(drain)
+
+    token = txn.request
    # ── ATOMIC: no yield between these two operations ──
    data = self._memory_store.read(token.src_space, token.src_addr,
                                   shape=..., dtype=...)
@@ -439,6 +449,33 @@ The final `put` is yieldable but uses an unbounded internal store, so
 it completes in a single step. That `put` is the closing call of the
 atomic block; nothing may be inserted before it.

+#### Drain-at-inbound semantics (D9 timing model)
+
+The Transaction carries `drain_ns = nbytes / bottleneck_bw_on_path`
+stamped at send-side PE_DMA. In this simulator per-hop `overhead_ns`
+is paid at each forwarding component via `run()`, and the remaining
+BW drain is paid once at the Transaction's terminal. Every non-IPCQ
+Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via
+`ComponentBase._forward_txn` at the terminal node. For IPCQ the
+destination PE_DMA intercepts the Transaction with `_handle_ipcq_inbound`
+(so IPCQ-specific data write + metadata forward can happen), so **the
+drain MUST be paid explicitly at the top of that handler** to keep
+IPCQ's timing model on par with every other fabric Transaction.
+
+Side-effects of paying drain here:
+
+- **SRC `tl.send`** is unchanged — fire-and-forget semantics are
+  preserved because the sender PE_DMA does not `yield sub_done`. The
+  `sub_done.succeed()` call (made after metadata forward below) is an
+  event with no listener on the sender side.
+- **DST `tl.recv`** unblocks `drain_ns` later. Since recv wakes only
+  when `IpcqMetaArrival` reaches its local PE_IPCQ, and the metadata
+  forward now happens after the drain, recv observes the full fabric
+  transfer time including bandwidth cost.
+
+Matches the physical picture: send dispatches and leaves; recv waits
+until the bytes have actually been drained into its inbox.
+
 ### D9.5. ADR-0020 (2-pass) integration

 `tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase