Fix Phase 1 slot-overwrite race + PE_MATH latency model (n_slots=4 safe)

Root cause: In ring all-reduce, PE_IPCQ's recv handler advances my_tail and issues a credit return immediately. With tight credit latency (0.12ns intra-cube), the sender can refill the slot BEFORE the receiver's outbound PE_DMA reads from it for the next send. The outbound snapshot then captures stale data from a later round. Fix: Propagate TensorHandle.data (captured at recv-time, before credit return) through the entire send chain: tl.send(src=handle) → IpcqSendCmd.data → IpcqDmaToken.data PE_DMA outbound already prefers token.data over MemoryStore read, so the recv-time snapshot is used for the in-flight data. This eliminates the race: the snapshot is captured before the slot can be overwritten. Additional fixes: - PE_MATH handle_command: compute SIMD latency from output tensor element count via _compute_ns(), using max(overhead_ns, compute_ns). Previously used overhead_ns=0.0 for all standalone MathCmd, making math ops take 0ns in SimPy. - DataExecutor secondary sort: same-t_start ops sorted by op_kind (memory < gemm < math) so IPCQ slot writes execute before math reads. - ipcq_copy recorded at INBOUND time (receiver PE_DMA arrival) instead of outbound. Inbound time is after fabric propagation, so it sorts correctly relative to the receiver's math. - record_copy accepts explicit snapshot parameter (from token.data). Result: N_ELEM=32 + 256-rank + n_slots=4 + cross-SIP now passes. n_slots reverted to 4 (the deeper buffer was a workaround, not needed). 502 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 23:02:19 -07:00
parent 74f5f5cf08
commit 1c8ddc2d03
7 changed files with 93 additions and 36 deletions
@@ -26,15 +26,27 @@ class DataExecutor:
        self._op_log = op_log
        self.store = store

+    # Ordering priority within the same t_start: memory copies must run
+    # before math/gemm so that slot data is populated before a consumer
+    # PE's math op reads it. With 0-ns PE_MATH overhead and tight SimPy
+    # scheduling, ipcq_copy and math ops from different PEs can collide
+    # at the exact same t_start.
+    _KIND_ORDER = {"memory": 0, "gemm": 1, "math": 2, "unknown": 3}
+
    def run(self) -> None:
        """Execute all ops in op_log order.

-        Ops are processed sequentially in t_start order. The previous
-        ThreadPoolExecutor-based parallel execution was removed because
-        same-t_start groups are almost always size 1 (each PE processes
-        one command at a time), so the thread-pool overhead dominated.
+        Primary sort: t_start (ascending).
+        Secondary sort: op_kind priority — memory (ipcq_copy/dma_write)
+        before gemm before math. This ensures IPCQ slot data arrives
+        before a consumer PE's math op tries to read it, even when both
+        share the same SimPy timestamp.
        """
-        for op in self._op_log:
+        ops = sorted(
+            self._op_log,
+            key=lambda r: (r.t_start, self._KIND_ORDER.get(r.op_kind, 3)),
+        )
+        for op in ops:
            if op.op_kind != "memory" or op.op_name != "dma_read":
                self._execute_op(op)

@@ -103,21 +103,17 @@ class OpLogger:
        src_space: str, src_addr: int,
        dst_space: str, dst_addr: int,
        shape: tuple[int, ...], dtype: str, nbytes: int,
+        snapshot: Any = None,
    ) -> None:
        """Record a memory copy op for Phase 2 replay (ADR-0023 + ADR-0020).

-        Used by PE_DMA at outbound (sender) time: the snapshot captures
-        the source data at the moment the send was issued, so Phase 2
-        replay does not see later mutations of the source addr (e.g. a
-        tl.store that runs after the recv at the sender).
-
-        For sources whose data is not yet materialized in Phase 1 (math
-        scratch outputs), the snapshot is None and Phase 2 falls back to
-        reading from MemoryStore — by which point the corresponding math
-        op has been replayed and the scratch addr is populated.
+        ``snapshot``: if provided (e.g. token.data from in-flight DMA),
+        used directly. Otherwise falls back to a fresh read from
+        MemoryStore[src_addr]. The snapshot is what Phase 2 writes into
+        dst_addr, avoiding stale-source races from cross-PE mutations.
        """
-        snap = None
-        if self._memory_store is not None:
+        snap = snapshot
+        if snap is None and self._memory_store is not None:
            try:
                arr = self._memory_store.read(
                    src_space, src_addr, shape=shape, dtype=dtype,