Fix Phase 1 slot-overwrite race + PE_MATH latency model (n_slots=4 safe)
Root cause: In ring all-reduce, PE_IPCQ's recv handler advances my_tail and issues a credit return immediately. With tight credit latency (0.12ns intra-cube), the sender can refill the slot BEFORE the receiver's outbound PE_DMA reads from it for the next send. The outbound snapshot then captures stale data from a later round. Fix: Propagate TensorHandle.data (captured at recv-time, before credit return) through the entire send chain: tl.send(src=handle) → IpcqSendCmd.data → IpcqDmaToken.data PE_DMA outbound already prefers token.data over MemoryStore read, so the recv-time snapshot is used for the in-flight data. This eliminates the race: the snapshot is captured before the slot can be overwritten. Additional fixes: - PE_MATH handle_command: compute SIMD latency from output tensor element count via _compute_ns(), using max(overhead_ns, compute_ns). Previously used overhead_ns=0.0 for all standalone MathCmd, making math ops take 0ns in SimPy. - DataExecutor secondary sort: same-t_start ops sorted by op_kind (memory < gemm < math) so IPCQ slot writes execute before math reads. - ipcq_copy recorded at INBOUND time (receiver PE_DMA arrival) instead of outbound. Inbound time is after fabric propagation, so it sorts correctly relative to the receiver's math. - record_copy accepts explicit snapshot parameter (from token.data). Result: N_ELEM=32 + 256-rank + n_slots=4 + cross-SIP now passes. n_slots reverted to 4 (the deeper buffer was a workaround, not needed). 502 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -26,15 +26,27 @@ class DataExecutor:
|
||||
self._op_log = op_log
|
||||
self.store = store
|
||||
|
||||
# Ordering priority within the same t_start: memory copies must run
|
||||
# before math/gemm so that slot data is populated before a consumer
|
||||
# PE's math op reads it. With 0-ns PE_MATH overhead and tight SimPy
|
||||
# scheduling, ipcq_copy and math ops from different PEs can collide
|
||||
# at the exact same t_start.
|
||||
_KIND_ORDER = {"memory": 0, "gemm": 1, "math": 2, "unknown": 3}
|
||||
|
||||
def run(self) -> None:
|
||||
"""Execute all ops in op_log order.
|
||||
|
||||
Ops are processed sequentially in t_start order. The previous
|
||||
ThreadPoolExecutor-based parallel execution was removed because
|
||||
same-t_start groups are almost always size 1 (each PE processes
|
||||
one command at a time), so the thread-pool overhead dominated.
|
||||
Primary sort: t_start (ascending).
|
||||
Secondary sort: op_kind priority — memory (ipcq_copy/dma_write)
|
||||
before gemm before math. This ensures IPCQ slot data arrives
|
||||
before a consumer PE's math op tries to read it, even when both
|
||||
share the same SimPy timestamp.
|
||||
"""
|
||||
for op in self._op_log:
|
||||
ops = sorted(
|
||||
self._op_log,
|
||||
key=lambda r: (r.t_start, self._KIND_ORDER.get(r.op_kind, 3)),
|
||||
)
|
||||
for op in ops:
|
||||
if op.op_kind != "memory" or op.op_name != "dma_read":
|
||||
self._execute_op(op)
|
||||
|
||||
|
||||
@@ -103,21 +103,17 @@ class OpLogger:
|
||||
src_space: str, src_addr: int,
|
||||
dst_space: str, dst_addr: int,
|
||||
shape: tuple[int, ...], dtype: str, nbytes: int,
|
||||
snapshot: Any = None,
|
||||
) -> None:
|
||||
"""Record a memory copy op for Phase 2 replay (ADR-0023 + ADR-0020).
|
||||
|
||||
Used by PE_DMA at outbound (sender) time: the snapshot captures
|
||||
the source data at the moment the send was issued, so Phase 2
|
||||
replay does not see later mutations of the source addr (e.g. a
|
||||
tl.store that runs after the recv at the sender).
|
||||
|
||||
For sources whose data is not yet materialized in Phase 1 (math
|
||||
scratch outputs), the snapshot is None and Phase 2 falls back to
|
||||
reading from MemoryStore — by which point the corresponding math
|
||||
op has been replayed and the scratch addr is populated.
|
||||
``snapshot``: if provided (e.g. token.data from in-flight DMA),
|
||||
used directly. Otherwise falls back to a fresh read from
|
||||
MemoryStore[src_addr]. The snapshot is what Phase 2 writes into
|
||||
dst_addr, avoiding stale-source races from cross-PE mutations.
|
||||
"""
|
||||
snap = None
|
||||
if self._memory_store is not None:
|
||||
snap = snapshot
|
||||
if snap is None and self._memory_store is not None:
|
||||
try:
|
||||
arr = self._memory_store.read(
|
||||
src_space, src_addr, shape=shape, dtype=dtype,
|
||||
|
||||
Reference in New Issue
Block a user