Kernel-launch sync (ADR-0009 D5) and IPCQ drain at inbound (ADR-0023)

- KernelLaunchMsg gains target_start_ns: IO_CPU stamps a global barrier (max path latency across every target PE), M_CPU passes it through, PE_CPU yields until it before recording pe_exec_start. Every PE in a launch begins kernel execution at the same env.now regardless of its dispatch path length — eliminates per-PE dispatch-offset artifact in cross-PE and cross-cube latency measurements. - PE_DMA._handle_ipcq_inbound now pays Transaction.drain_ns at the top, matching the terminal-drain behavior of ComponentBase._forward_txn for every non-IPCQ Transaction. SRC-side tl.send stays fire-and-forget (sender doesn't yield on sub_done); tl.recv now blocks until bytes have actually drained into its inbox. - ComponentContext: new compute_path_latency_ns helper + node_overhead_ns field populated by GraphEngine. - tests/test_kernel_launch_sync.py: asserts all PEs in one launch produce identical pe_exec_ns for a no-op kernel (zero spread). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 15:30:29 -07:00
parent 6918e6e906
commit 14d800b0ae
14 changed files with 409 additions and 17 deletions
@@ -58,7 +58,13 @@ class IoCpuComponent(ComponentBase):
            self._pending[key] = (expected, received, parent_done)

    def _dispatch_to_m_cpus(self, env: simpy.Environment, txn: Any) -> Generator:
-        """Fan out sub-Transactions to target cube M_CPUs, wait for responses."""
+        """Fan out sub-Transactions to target cube M_CPUs, wait for responses.
+
+        ADR-0009 D5 (extended): stamp a global target_start_ns on
+        KernelLaunchMsg so every PE across every target cube starts at
+        the same env.now. See the non-legacy builtin for full rationale.
+        """
+        import dataclasses
        from kernbench.runtime_api.kernel import KernelLaunchMsg, MemoryReadMsg, MemoryWriteMsg

        request = txn.request
@@ -72,6 +78,34 @@ class IoCpuComponent(ComponentBase):
            txn.done.succeed()
            return

+        if isinstance(request, KernelLaunchMsg):
+            global_max_latency = 0.0
+            pe_ids = self._resolve_pe_ids(
+                getattr(request, "target_pe", "all")
+            )
+            for sip, cube in cube_targets:
+                for pe_id in pe_ids:
+                    pe_cpu_id = (
+                        f"sip{sip}.cube{cube}.pe{pe_id}.pe_cpu"
+                    )
+                    try:
+                        path = self.ctx.router.find_node_path(
+                            self.node.id, pe_cpu_id,
+                        )
+                    except Exception:
+                        continue
+                    if len(path) < 2:
+                        continue
+                    latency = self.ctx.compute_path_latency_ns(
+                        path, nbytes=0,
+                    )
+                    if latency > global_max_latency:
+                        global_max_latency = latency
+            request = dataclasses.replace(
+                request,
+                target_start_ns=float(env.now) + global_max_latency,
+            )
+
        # Setup aggregation
        self._pending[request.request_id] = (len(cube_targets), 0, txn.done)

@@ -91,6 +125,18 @@ class IoCpuComponent(ComponentBase):
            )
            yield self.out_ports[path[1]].put(sub_txn.advance())

+    def _resolve_pe_ids(self, target_pe: Any) -> list[int]:
+        """Resolve target_pe → list of PE indices (mirrors M_CPU logic)."""
+        if isinstance(target_pe, int):
+            return [target_pe]
+        if isinstance(target_pe, tuple):
+            return list(target_pe)
+        n_slices = 8
+        if self.ctx and self.ctx.spec:
+            mm = self.ctx.spec.get("cube", {}).get("memory_map", {})
+            n_slices = mm.get("hbm_slices_per_cube", 8)
+        return list(range(n_slices))
+
    def _resolve_cube_targets(self, request: Any) -> list[tuple[int, int]]:
        """Return list of (sip, cube) pairs to fan out to."""
        from kernbench.runtime_api.kernel import (