Kernel-launch sync (ADR-0009 D5) and IPCQ drain at inbound (ADR-0023)

- KernelLaunchMsg gains target_start_ns: IO_CPU stamps a global barrier (max path latency across every target PE), M_CPU passes it through, PE_CPU yields until it before recording pe_exec_start. Every PE in a launch begins kernel execution at the same env.now regardless of its dispatch path length — eliminates per-PE dispatch-offset artifact in cross-PE and cross-cube latency measurements. - PE_DMA._handle_ipcq_inbound now pays Transaction.drain_ns at the top, matching the terminal-drain behavior of ComponentBase._forward_txn for every non-IPCQ Transaction. SRC-side tl.send stays fire-and-forget (sender doesn't yield on sub_done); tl.recv now blocks until bytes have actually drained into its inbox. - ComponentContext: new compute_path_latency_ns helper + node_overhead_ns field populated by GraphEngine. - tests/test_kernel_launch_sync.py: asserts all PEs in one launch produce identical pe_exec_ns for a no-op kernel (zero spread). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 15:30:29 -07:00
parent 6918e6e906
commit 14d800b0ae
14 changed files with 409 additions and 17 deletions
@@ -162,7 +162,11 @@ class MCpuComponent(ComponentBase):
        Routes through find_node_path (M_CPU → NOC → PE_CPU command edges).
        PE_CPU sends ResponseMsg back via NOC → M_CPU on completion.
        Then sends aggregate ResponseMsg back to IO_CPU on the reverse path.
+
+        ADR-0009 D5: stamps target_start_ns so every PE in this fanout
+        starts executing at the same env.now regardless of dispatch path.
        """
+        import dataclasses
        request = txn.request
        target_pe = getattr(request, "target_pe", "all")
        cube_prefix = self.node.id.rsplit(".", 1)[0]  # e.g. "sip0.cube0"
@@ -172,9 +176,10 @@ class MCpuComponent(ComponentBase):
            txn.done.succeed()
            return

-        # Fan out to each PE_CPU, using response-based aggregation
-        sub_txns: list[Transaction] = []
-        n_dispatched = 0
+        # Resolve per-PE paths. If IO_CPU already stamped a global
+        # target_start_ns (ADR-0009 D5 extended), pass it through.
+        per_pe: list[tuple[int, list[str], float]] = []
+        max_latency = 0.0
        for pe_id in pe_ids:
            pe_cpu_id = f"{cube_prefix}.pe{pe_id}.pe_cpu"
            try:
@@ -183,8 +188,24 @@ class MCpuComponent(ComponentBase):
                continue
            if len(path) < 2:
                continue
+            latency = self.ctx.compute_path_latency_ns(path, nbytes=0)
+            per_pe.append((pe_id, path, latency))
+            if latency > max_latency:
+                max_latency = latency
+
+        if getattr(request, "target_start_ns", None) is not None:
+            stamped_request = request
+        else:
+            stamped_request = dataclasses.replace(
+                request, target_start_ns=float(env.now) + max_latency,
+            )
+
+        # Fan out to each PE_CPU, using response-based aggregation
+        sub_txns: list[Transaction] = []
+        n_dispatched = 0
+        for pe_id, path, _lat in per_pe:
            sub_txn = Transaction(
-                request=request, path=path, step=0,
+                request=stamped_request, path=path, step=0,
                nbytes=0, done=env.event(),
            )
            yield self.out_ports[path[1]].put(sub_txn.advance())