ADR-0009 D5: chain-aware target_start_ns + zero-byte launch fanout

The single-walk predictor (find_node_path(io_cpu, pe_cpu) + compute_path_latency_ns) under-shot actual dispatch latency for far cubes -- the routing graph could pick a path bypassing M_CPU, and non-zero-nbytes launch sub-txns serialized on shared first hops. Far PEs arrived at _execute_kernel after target_start_ns, silently skipped the barrier yield, and started pe_exec_start late. Their reported pe_exec_ns under-counted by exactly the late_ns amount (63 ns observed at h4 cube4.pe0 in the IPCQ test, up to 113 ns worst case for cubes 9-11), producing the suspicious flat region in the h4 IPCQ curve at 8192/10240 bytes. Fix: - IO_CPU predictor uses the explicit two-leg chain (IO_CPU->M_CPU + M_CPU->PE_CPU - io.overhead - m.overhead), so every PE on every targeted cube has a barrier >= its real dispatch arrival. - Kernel-launch fanout sub-txns carry nbytes=0 (control-plane, not data-plane), removing the per-cube fanout serialization that pushed far M_CPUs past the predictor. - Legacy io_cpu mirror updated. ADR-0009 D5 mechanism updated to specify the two-leg formula and the nbytes=0 requirement. New tests/test_d5_barrier_invariant.py asserts (a) no PE enters _execute_kernel after target_start_ns and (b) every PE in a multi-cube launch has identical pe_exec_start -- both regressions silently pass on the existing tests/test_kernel_launch_sync.py because that test only inspects post-aggregation max(pe_exec_ns). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 15:12:58 -07:00
parent 90874abbfe
commit c1a5cf3a2a
4 changed files with 275 additions and 19 deletions
@@ -86,11 +86,36 @@ Mechanism.

 - `KernelLaunchMsg` carries an optional `target_start_ns: float | None`.
 - **IO_CPU** is the canonical stamper. On fan-out to M_CPUs, it
-  computes `target_start_ns = env.now + max_latency` where `max_latency`
-  is the maximum `ComponentContext.compute_path_latency_ns(path)` across
-  every target (sip, cube, pe) tuple — `path = find_node_path(io_cpu,
-  pe_cpu_id)`. The stamped value is placed on the request carried by
-  every fanned-out sub-Transaction.
+  computes `target_start_ns = env.now + max_latency` where
+  `max_latency` is the maximum, over every target (sip, cube, pe)
+  tuple, of the **two-leg dispatch chain**:
+
+  ```
+  max_latency(sip, cube, pe) =
+      compute_path_latency_ns(find_node_path(io_cpu, m_cpu(sip, cube)))
+    + compute_path_latency_ns(find_node_path(m_cpu(sip, cube), pe_cpu))
+    - io_cpu.overhead_ns
+    - m_cpu.overhead_ns
+  ```
+
+  This models the actual dispatch as **two sequential Transactions**
+  (IO_CPU → M_CPU, then M_CPU → PE_CPU). Each leg's
+  `compute_path_latency_ns` adds its endpoints' `overhead_ns`;
+  `io_cpu.overhead_ns` is subtracted because IO_CPU has already
+  paid it before this method runs, and `m_cpu.overhead_ns` is
+  subtracted once because it appears as endpoint of leg1 *and*
+  start of leg2 but is paid only once at run time. A single
+  `find_node_path(io_cpu, pe_cpu)` walk is **not** equivalent —
+  it can pick a graph path that bypasses M_CPU and silently
+  under-shoots the prediction for far cubes, breaking the D5
+  invariant.
+
+  The fanned-out sub-Transactions carry **`nbytes = 0`** for
+  `KernelLaunchMsg` (control message only). Without this,
+  large kernel-launch payloads would occupy fabric BW on the
+  shared first hop and serialize the per-cube dispatch, pushing
+  far M_CPUs past `target_start_ns` and re-introducing the
+  late-arrival violation.
 - **M_CPU** passes an already-stamped `target_start_ns` through
  unchanged. Only when the value is absent (e.g. a direct
  launch-to-M_CPU unit test) does M_CPU compute a per-cube barrier