The single-walk predictor (find_node_path(io_cpu, pe_cpu) +
compute_path_latency_ns) under-shot actual dispatch latency for far
cubes -- the routing graph could pick a path bypassing M_CPU, and
non-zero-nbytes launch sub-txns serialized on shared first hops.
Far PEs arrived at _execute_kernel after target_start_ns, silently
skipped the barrier yield, and started pe_exec_start late. Their
reported pe_exec_ns under-counted by exactly the late_ns amount
(63 ns observed at h4 cube4.pe0 in the IPCQ test, up to 113 ns
worst case for cubes 9-11), producing the suspicious flat region
in the h4 IPCQ curve at 8192/10240 bytes.
Fix:
- IO_CPU predictor uses the explicit two-leg chain
(IO_CPU->M_CPU + M_CPU->PE_CPU - io.overhead - m.overhead), so
every PE on every targeted cube has a barrier >= its real
dispatch arrival.
- Kernel-launch fanout sub-txns carry nbytes=0 (control-plane,
not data-plane), removing the per-cube fanout serialization
that pushed far M_CPUs past the predictor.
- Legacy io_cpu mirror updated.
ADR-0009 D5 mechanism updated to specify the two-leg formula and
the nbytes=0 requirement. New tests/test_d5_barrier_invariant.py
asserts (a) no PE enters _execute_kernel after target_start_ns and
(b) every PE in a multi-cube launch has identical pe_exec_start --
both regressions silently pass on the existing
tests/test_kernel_launch_sync.py because that test only inspects
post-aggregation max(pe_exec_ns).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- KernelLaunchMsg gains target_start_ns: IO_CPU stamps a global barrier
(max path latency across every target PE), M_CPU passes it through,
PE_CPU yields until it before recording pe_exec_start. Every PE in a
launch begins kernel execution at the same env.now regardless of its
dispatch path length — eliminates per-PE dispatch-offset artifact in
cross-PE and cross-cube latency measurements.
- PE_DMA._handle_ipcq_inbound now pays Transaction.drain_ns at the top,
matching the terminal-drain behavior of ComponentBase._forward_txn for
every non-IPCQ Transaction. SRC-side tl.send stays fire-and-forget
(sender doesn't yield on sub_done); tl.recv now blocks until bytes
have actually drained into its inbox.
- ComponentContext: new compute_path_latency_ns helper + node_overhead_ns
field populated by GraphEngine.
- tests/test_kernel_launch_sync.py: asserts all PEs in one launch
produce identical pe_exec_ns for a no-op kernel (zero spread).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>