c1a5cf3a2a
The single-walk predictor (find_node_path(io_cpu, pe_cpu) +
compute_path_latency_ns) under-shot actual dispatch latency for far
cubes -- the routing graph could pick a path bypassing M_CPU, and
non-zero-nbytes launch sub-txns serialized on shared first hops.
Far PEs arrived at _execute_kernel after target_start_ns, silently
skipped the barrier yield, and started pe_exec_start late. Their
reported pe_exec_ns under-counted by exactly the late_ns amount
(63 ns observed at h4 cube4.pe0 in the IPCQ test, up to 113 ns
worst case for cubes 9-11), producing the suspicious flat region
in the h4 IPCQ curve at 8192/10240 bytes.
Fix:
- IO_CPU predictor uses the explicit two-leg chain
(IO_CPU->M_CPU + M_CPU->PE_CPU - io.overhead - m.overhead), so
every PE on every targeted cube has a barrier >= its real
dispatch arrival.
- Kernel-launch fanout sub-txns carry nbytes=0 (control-plane,
not data-plane), removing the per-cube fanout serialization
that pushed far M_CPUs past the predictor.
- Legacy io_cpu mirror updated.
ADR-0009 D5 mechanism updated to specify the two-leg formula and
the nbytes=0 requirement. New tests/test_d5_barrier_invariant.py
asserts (a) no PE enters _execute_kernel after target_start_ns and
(b) every PE in a multi-cube launch has identical pe_exec_start --
both regressions silently pass on the existing
tests/test_kernel_launch_sync.py because that test only inspects
post-aggregation max(pe_exec_ns).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>