Fix ADR-0025: IPCQ direction addressing via address-based matching
2-rank bidirectional ring deadlock: when E and W neighbors point to the same peer, sender-coord matching in _handle_meta_arrival / _credit_worker picked the first direction in dict order, landing data in the wrong rx slot relative to what the kernel recv(W) was waiting on. Fix (ADR-0025 D1/D2/D3): - install.reverse_direction: prefer OPPOSITE direction (E↔W, N↔S) when peer has it pointing back to us; fallback to any matching for topologies without opposite convention (tree_binary parent/child). - _handle_meta_arrival: match by token.dst_addr range against each qp's my_rx_base_pa + n_slots × slot_size window (unambiguous). - _credit_worker: match by credit.dst_rx_base_pa == qp.peer.rx_base_pa. - IpcqCreditMetadata: new dst_rx_base_pa field carrying receiver-side rx base; _delayed_credit_send fills it from the consuming qp. Tests (Phase 1 → Phase 2): - test_reverse_direction_opposite_preference_2rank_ring - test_reverse_direction_opposite_preference_4rank_ring_sanity - test_meta_arrival_matches_by_dst_addr_same_peer - test_credit_matches_by_dst_rx_base_pa_same_peer - Existing credit-return test updated with dst_rx_base_pa. 508 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -219,9 +219,24 @@ def install_ipcq(
|
||||
"neighbor_table": neighbor_table,
|
||||
}
|
||||
|
||||
def reverse_direction(my_rank: int, peer_rank: int) -> str | None:
|
||||
"""Find which direction in peer's neighbor table points back to my_rank."""
|
||||
for d, target in neighbor_table[peer_rank].items():
|
||||
_OPPOSITE_DIR = {"E": "W", "W": "E", "N": "S", "S": "N"}
|
||||
|
||||
def reverse_direction(my_rank: int, peer_rank: int, my_dir: str) -> str | None:
|
||||
"""Find peer's direction that reciprocates my_dir→peer_rank.
|
||||
|
||||
Prefer the OPPOSITE direction (E↔W, N↔S) when the peer has it
|
||||
pointing back to us (ADR-0025 D1). This matters in 2-rank
|
||||
bidirectional rings where both E and W on one side point to the
|
||||
same peer — without the preference, dict-order first-match would
|
||||
route data into the wrong rx slot. Falls back to any direction
|
||||
pointing back for topologies without an opposite convention
|
||||
(e.g. tree_binary's parent/child).
|
||||
"""
|
||||
nt = neighbor_table[peer_rank]
|
||||
opp = _OPPOSITE_DIR.get(my_dir)
|
||||
if opp is not None and nt.get(opp) == my_rank:
|
||||
return opp
|
||||
for d, target in nt.items():
|
||||
if target == my_rank:
|
||||
return d
|
||||
return None
|
||||
@@ -234,7 +249,7 @@ def install_ipcq(
|
||||
if peer_rank is None:
|
||||
continue
|
||||
peer_s, peer_c, peer_p = rank_pe[peer_rank]
|
||||
peer_dir = reverse_direction(r, peer_rank)
|
||||
peer_dir = reverse_direction(r, peer_rank, d)
|
||||
if peer_dir is None:
|
||||
# Peer doesn't have a reverse entry — skip (asymmetric topology)
|
||||
continue
|
||||
|
||||
Reference in New Issue
Block a user