Fix Phase 1 slot-overwrite race + PE_MATH latency model (n_slots=4 safe)

Root cause: In ring all-reduce, PE_IPCQ's recv handler advances my_tail
and issues a credit return immediately. With tight credit latency
(0.12ns intra-cube), the sender can refill the slot BEFORE the
receiver's outbound PE_DMA reads from it for the next send. The
outbound snapshot then captures stale data from a later round.

Fix: Propagate TensorHandle.data (captured at recv-time, before credit
return) through the entire send chain:
  tl.send(src=handle) → IpcqSendCmd.data → IpcqDmaToken.data
PE_DMA outbound already prefers token.data over MemoryStore read, so
the recv-time snapshot is used for the in-flight data. This eliminates
the race: the snapshot is captured before the slot can be overwritten.

Additional fixes:
- PE_MATH handle_command: compute SIMD latency from output tensor
  element count via _compute_ns(), using max(overhead_ns, compute_ns).
  Previously used overhead_ns=0.0 for all standalone MathCmd, making
  math ops take 0ns in SimPy.
- DataExecutor secondary sort: same-t_start ops sorted by op_kind
  (memory < gemm < math) so IPCQ slot writes execute before math reads.
- ipcq_copy recorded at INBOUND time (receiver PE_DMA arrival) instead
  of outbound. Inbound time is after fabric propagation, so it sorts
  correctly relative to the receiver's math.
- record_copy accepts explicit snapshot parameter (from token.data).

Result: N_ELEM=32 + 256-rank + n_slots=4 + cross-SIP now passes.
n_slots reverted to 4 (the deeper buffer was a workaround, not needed).
502 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-12 23:02:19 -07:00
parent 74f5f5cf08
commit 1c8ddc2d03
7 changed files with 93 additions and 36 deletions
+17 -5
View File
@@ -26,15 +26,27 @@ class DataExecutor:
self._op_log = op_log
self.store = store
# Ordering priority within the same t_start: memory copies must run
# before math/gemm so that slot data is populated before a consumer
# PE's math op reads it. With 0-ns PE_MATH overhead and tight SimPy
# scheduling, ipcq_copy and math ops from different PEs can collide
# at the exact same t_start.
_KIND_ORDER = {"memory": 0, "gemm": 1, "math": 2, "unknown": 3}
def run(self) -> None:
"""Execute all ops in op_log order.
Ops are processed sequentially in t_start order. The previous
ThreadPoolExecutor-based parallel execution was removed because
same-t_start groups are almost always size 1 (each PE processes
one command at a time), so the thread-pool overhead dominated.
Primary sort: t_start (ascending).
Secondary sort: op_kind priority — memory (ipcq_copy/dma_write)
before gemm before math. This ensures IPCQ slot data arrives
before a consumer PE's math op tries to read it, even when both
share the same SimPy timestamp.
"""
for op in self._op_log:
ops = sorted(
self._op_log,
key=lambda r: (r.t_start, self._KIND_ORDER.get(r.op_kind, 3)),
)
for op in ops:
if op.op_kind != "memory" or op.op_name != "dma_read":
self._execute_op(op)
+7 -11
View File
@@ -103,21 +103,17 @@ class OpLogger:
src_space: str, src_addr: int,
dst_space: str, dst_addr: int,
shape: tuple[int, ...], dtype: str, nbytes: int,
snapshot: Any = None,
) -> None:
"""Record a memory copy op for Phase 2 replay (ADR-0023 + ADR-0020).
Used by PE_DMA at outbound (sender) time: the snapshot captures
the source data at the moment the send was issued, so Phase 2
replay does not see later mutations of the source addr (e.g. a
tl.store that runs after the recv at the sender).
For sources whose data is not yet materialized in Phase 1 (math
scratch outputs), the snapshot is None and Phase 2 falls back to
reading from MemoryStore — by which point the corresponding math
op has been replayed and the scratch addr is populated.
``snapshot``: if provided (e.g. token.data from in-flight DMA),
used directly. Otherwise falls back to a fresh read from
MemoryStore[src_addr]. The snapshot is what Phase 2 writes into
dst_addr, avoiding stale-source races from cross-PE mutations.
"""
snap = None
if self._memory_store is not None:
snap = snapshot
if snap is None and self._memory_store is not None:
try:
arr = self._memory_store.read(
src_space, src_addr, shape=shape, dtype=dtype,