Fix Phase 1 slot-overwrite race + PE_MATH latency model (n_slots=4 safe)
Root cause: In ring all-reduce, PE_IPCQ's recv handler advances my_tail and issues a credit return immediately. With tight credit latency (0.12ns intra-cube), the sender can refill the slot BEFORE the receiver's outbound PE_DMA reads from it for the next send. The outbound snapshot then captures stale data from a later round. Fix: Propagate TensorHandle.data (captured at recv-time, before credit return) through the entire send chain: tl.send(src=handle) → IpcqSendCmd.data → IpcqDmaToken.data PE_DMA outbound already prefers token.data over MemoryStore read, so the recv-time snapshot is used for the in-flight data. This eliminates the race: the snapshot is captured before the slot can be overwritten. Additional fixes: - PE_MATH handle_command: compute SIMD latency from output tensor element count via _compute_ns(), using max(overhead_ns, compute_ns). Previously used overhead_ns=0.0 for all standalone MathCmd, making math ops take 0ns in SimPy. - DataExecutor secondary sort: same-t_start ops sorted by op_kind (memory < gemm < math) so IPCQ slot writes execute before math reads. - ipcq_copy recorded at INBOUND time (receiver PE_DMA arrival) instead of outbound. Inbound time is after fabric propagation, so it sorts correctly relative to the receiver's math. - record_copy accepts explicit snapshot parameter (from token.data). Result: N_ELEM=32 + 256-rank + n_slots=4 + cross-SIP now passes. n_slots reverted to 4 (the deeper buffer was a workaround, not needed). 502 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -93,6 +93,14 @@ class IpcqSendCmd:
|
||||
shape: tuple[int, ...] # data shape (op_log + MemoryStore use)
|
||||
dtype: str
|
||||
handle_id: str # completion tracking
|
||||
# In-flight data snapshot captured at tl.send() time from the
|
||||
# TensorHandle.data field. Carries the actual numpy array that was
|
||||
# visible at recv-time (when handle.data was populated), avoiding a
|
||||
# Phase 1 race where a later IPCQ inbound overwrites the sender's
|
||||
# slot between recv and send. If None, PE_DMA outbound falls back to
|
||||
# reading MemoryStore[src_addr] (correct for sources that are never
|
||||
# overwritten, such as HBM tiles).
|
||||
data: Any = None
|
||||
data_op: bool = True # ADR-0020 op_log recording flag
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user