Fix ADR-0025: IPCQ direction addressing via address-based matching

2-rank bidirectional ring deadlock: when E and W neighbors point to the
same peer, sender-coord matching in _handle_meta_arrival / _credit_worker
picked the first direction in dict order, landing data in the wrong rx
slot relative to what the kernel recv(W) was waiting on.

Fix (ADR-0025 D1/D2/D3):
- install.reverse_direction: prefer OPPOSITE direction (E↔W, N↔S) when
  peer has it pointing back to us; fallback to any matching for
  topologies without opposite convention (tree_binary parent/child).
- _handle_meta_arrival: match by token.dst_addr range against each qp's
  my_rx_base_pa + n_slots × slot_size window (unambiguous).
- _credit_worker: match by credit.dst_rx_base_pa == qp.peer.rx_base_pa.
- IpcqCreditMetadata: new dst_rx_base_pa field carrying receiver-side
  rx base; _delayed_credit_send fills it from the consuming qp.

Tests (Phase 1 → Phase 2):
- test_reverse_direction_opposite_preference_2rank_ring
- test_reverse_direction_opposite_preference_4rank_ring_sanity
- test_meta_arrival_matches_by_dst_addr_same_peer
- test_credit_matches_by_dst_rx_base_pa_same_peer
- Existing credit-return test updated with dst_rx_base_pa.

508 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-14 00:38:41 -07:00
parent e1084800ab
commit 32536daf2e
6 changed files with 277 additions and 17 deletions
+8 -1
View File
@@ -196,10 +196,17 @@ class IpcqCreditMetadata:
Sent by ``PeIpcqComponent._delayed_credit_send`` after a
bottleneck-BW based latency, putting the metadata directly into
the peer's pre-wired credit store (no fabric routing).
``dst_rx_base_pa`` is the receiver's ``my_rx_base_pa`` for the direction
whose slot was consumed. The original sender matches this against
``qp.peer.rx_base_pa`` to find the correct direction (ADR-0025 D3) —
unambiguous even when multiple directions share the same peer (e.g.
2-rank bidirectional ring).
"""
consumer_seq: int # my_tail at recv side (new tail value)
src_sip: int # which peer is sending the credit
dst_rx_base_pa: int # receiver-side my_rx_base_pa (ADR-0025 D3)
src_sip: int # which peer is sending the credit (diag)
src_cube: int
src_pe: int
src_direction: str # sender-side direction (peer maps to its own)