# ADR-0025: IPCQ Direction Addressing — address-based matching ## Status Accepted (Revision 2 — Address-based matching; peer_direction field dropped) ## Context ### Goal In the IPCQ protocol of ADR-0023, make the **identification of "which direction pair this transfer belongs to"** consistent and **address-based**, without depending on topology / dict-order. It must work correctly in a 2-rank bidirectional ring (and more generally in any topology where multiple directions point to the same peer). ### The bug surfaced — 2-rank bidirectional ring `ring_1d(rank, world_size=2)` → `{"E": 1, "W": 1}` (rank 0). Both directions point to the same peer. **Bug 1 (install)**: - `reverse_direction(0, 1)` → returns "E" by dict order (wrong; "W" is the correct answer — opposite-direction convention) - rank 0's E entry is set with `peer.rx_base_pa = rx_base(sip1, cube0, pe0, d="E")` - tl.send(E) → data lands in sip1's E-rx buffer (should be W-rx) **Bug 2 (runtime)**: - Even if install set up the correct address, the receiver's `_handle_meta_arrival` matches direction by sender coordinates only → the first direction (E) wins - peer_head_cache[E] is incremented; peer_head_cache[W] is unchanged - The kernel's tl.recv(W) waits on peer_head_cache[W] → blocks forever → IpcqDeadlock ### Root cause The same issue along two axes: 1. **Install-time pairing**: deciding "which of my directions pairs with which direction of the peer" depends on dict-iteration-order → fragile when multiple directions point to the same peer 2. **Runtime identification**: deciding "which qp should be updated" is based on sender coordinates alone → ambiguous when directions are duplicated ### Solution direction — address-based matching Each PE's rx buffer sits at a **unique address range per direction** (rx_base_pa + direction_idx × bytes_per_direction). Therefore: - **Runtime**: match by **dst_addr range** instead of sender coord → unambiguous - **Install**: prefer the opposite direction as a heuristic (the natural symmetry of ring / mesh) - No need for redundant metadata like `peer_direction` — **address is the single source of truth** This design works **independently of the PhysAddr transition (ADR-0030)**. Whether the current addresses are synthetic or PhysAddr, the same approach applies as long as the per-direction range uniqueness is preserved. --- ## Decision ### D1. Install — `reverse_direction` opposite-preference `src/kernbench/ccl/install.py`: ```python # Extended in ADR-0032 with global_* pairs for inter-SIP directions, # which were introduced by configure_sfr_intercube_multisip to keep # intercube (N/S/E/W) and inter-SIP (global_N/S/E/W) namespaces disjoint. _OPPOSITE_DIR = { "E": "W", "W": "E", "N": "S", "S": "N", "global_E": "global_W", "global_W": "global_E", "global_N": "global_S", "global_S": "global_N", } def reverse_direction(my_rank: int, peer_rank: int, my_dir: str) -> str | None: """Find peer's direction that reciprocates my_dir→peer_rank. Prefer the OPPOSITE direction (E↔W, N↔S) when the peer has it pointing back to us. This matters in 2-rank bidirectional rings where both E and W on one side point to the same peer — without the preference, the first-match-wins iteration would route data into the wrong rx slot. Falls back to any direction pointing back for topologies without an opposite convention (tree_binary's parent/child). """ nt = neighbor_table[peer_rank] opp = _OPPOSITE_DIR.get(my_dir) if opp is not None and nt.get(opp) == my_rank: return opp for d, target in nt.items(): if target == my_rank: return d return None ``` Call site: ```python for d, peer_rank in nbrs.items(): peer_dir = reverse_direction(r, peer_rank, d) # pass my_dir if peer_dir is None: continue ... ``` ### D2. Runtime — `_handle_meta_arrival` dst_addr matching `src/kernbench/components/builtin/pe_ipcq.py`: ```python def _handle_meta_arrival(self, msg: IpcqMetaArrival) -> None: """Match incoming token to the receiver-side direction by dst_addr range. Each direction has a unique rx buffer address range (my_rx_base_pa + n_slots * slot_size). The token's dst_addr (set by the sender's IPCQ when computing peer's slot address) falls within exactly one such range. This address-based matching is unambiguous even when multiple directions have the same peer (2-rank ring). """ token = msg.token dst_addr = token.dst_addr for d, qp in self._queue_pairs.items(): base = qp["my_rx_base_pa"] size = qp["n_slots"] * qp["slot_size"] if base <= dst_addr < base + size: qp["peer_head_cache"] = max(qp["peer_head_cache"], token.sender_seq + 1) self._arrived_tokens.setdefault(d, []).append(token) waiters = self._recv_waiters.get(d, []) self._recv_waiters[d] = [] for ev in waiters: if not ev.triggered: ev.succeed() any_waiters = self._any_recv_waiters self._any_recv_waiters = [] for ev in any_waiters: if not ev.triggered: ev.succeed() return # Unknown dst_addr — diagnostic log (should not happen under correct install) ``` The sender-coordinate check is **removed**. `dst_addr` already determines the direction. ### D3. Credit — add `dst_rx_base_pa` field `src/kernbench/common/ipcq_types.py`: ```python @dataclass(frozen=True) class IpcqCreditMetadata: consumer_seq: int dst_rx_base_pa: int # NEW: matches the original sender's peer.rx_base_pa # Existing fields (kept for diagnostic / logging purposes) src_sip: int src_cube: int src_pe: int src_direction: str ``` When the credit is generated (`_delayed_credit_send`): it carries this direction's `my_rx_base_pa` as `dst_rx_base_pa` (this is the `peer.rx_base_pa` the other side used when it was the sender). Receiver side (`_credit_worker`): ```python def _credit_worker(self, env): while True: credit = yield self._credit_inbox.get() for d, qp in self._queue_pairs.items(): # Find the qp whose peer rx_base_pa matches the credit's dst_rx_base_pa if qp["peer"].rx_base_pa == credit.dst_rx_base_pa: qp["peer_tail_cache"] = max(qp["peer_tail_cache"], credit.consumer_seq) waiters = self._send_waiters.get(d, []) self._send_waiters[d] = [] for ev in waiters: if not ev.triggered: ev.succeed() break ``` Sender-coordinate check removed. Matching by `dst_rx_base_pa` is unambiguous. ### D4. Do **not** add a `peer_direction` field to `IpcqInitEntry` The `IpcqInitEntry.peer_direction` proposed in ADR-0025 rev 1 is **unnecessary**. Reasons: - Meta arrivals are matched by dst_addr (D2) - Credits are matched by dst_rx_base_pa (D3) - No need to store peer_direction on qp - Install only uses peer_dir internally when computing rx_base_pa (`reverse_direction`) No change to the IpcqInitEntry schema. **Simpler** than rev 1. ### D5. Keep `IpcqDmaToken.src_direction` (diagnostic only) The existing `src_direction` field is not removed. It is retained for: - Logging / trace: the `(rank, t, dir, nbytes)` output of `KERNBENCH_CCL_TRACE=1` - Diagnostics: showing direction in pointer_dump, etc. - Room for future extension Runtime matching uses only `dst_addr`. ### D6. Invariants (strengthens ADR-0023 I3) **I3 (strict)**: For each direction pair `(my_direction, peer_direction)`, my rx_base and peer rx_base must point to **distinct direction slots**. Install must guarantee this (reverse_direction opposite-preference). **I3.1 (new)**: For every qp, `qp["my_rx_base_pa"]` and `qp["peer"].rx_base_pa` occupy mutually disjoint address ranges (buffers of different directions never overlap). This is the prerequisite for the address-based matching of D2/D3. Verifiable at install time: ```python # ccl/install_plan.py: assertion at the end of build_install_plans all_rx_ranges = set() for plan in plans: for pe_install in plan.pe_installs: for entry in pe_install.neighbors: r = (entry.my_rx_base_pa, entry.my_rx_base_pa + plan.n_slots * plan.slot_size) overlap = any(_ranges_overlap(r, e) for e in all_rx_ranges) assert not overlap all_rx_ranges.add(r) ``` --- ## Dependencies - **ADR-0023** (IPCQ protocol): this ADR modifies ADR-0023's runtime matching logic (D2, D3) and improves the install heuristic (D1). No change to the IPCQ protocol's semantic layer. - **ADR-0024** (launcher): the case where a 2-rank bidirectional ring is actually used is the ws=SIP_count model of ADR-0024. This ADR makes that case work. - **ADR-0030** (PhysAddr transition, stub): **independent** — ADR-0025's address-based matching works identically whether the current addresses are synthetic or PhysAddr. --- ## Non-goals - **Migrating IPCQ addressing to PhysAddr**: ADR-0030 scope. This ADR is agnostic to how addresses are encoded. - **Multi-hop routing**: the single-hop DMA write assumption of ADR-0023 D5 still holds. - **Unidir ring specialization**: `ring_1d_unidir` only has a single direction, so the bug does not apply. --- ## Open questions - **Address-matching performance**: `_handle_meta_arrival` and `_credit_worker` iterate qp linearly (max 4 directions). The performance impact is negligible. If it becomes an issue, this can be switched to a dict lookup (`_qp_by_rx_base`). - **Re-evaluating the need for `IpcqDmaToken.src_direction`**: whether to keep this field, which is only kept for diagnostics, or to split it out of logging. Currently retained. - **Cost of install-time invariant verification**: the I3.1 verification of D6 is O(N_PE × N_direction)^2. It could be slow on large topologies → improvable via data structures such as interval trees. Simple implementation first. --- ## Consequences ### Positive - **Simplicity**: redundant `peer_direction` metadata removed. Address is the single source of truth. - **Unambiguous matching**: works on every topology (including duplicate directions). - **Minimal schema changes**: `IpcqInitEntry` unchanged, one field added to `IpcqCreditMetadata`. - **Independent of PhysAddr transition (ADR-0030)**: address-based matching is agnostic to the address encoding. - **Diagnostics retained**: `IpcqDmaToken.src_direction` is kept for logging. ### Negative - Runtime matching is now by address comparison, so when debugging questions like "why did peer_head_cache[W] update rather than [E]" one has to follow the address range (previously the direction name was enough). Mitigation: include a "direction ↔ rx_base_pa" mapping in pointer_dump. ### Neutral - The semantic layer of the IPCQ protocol (sender computes dst_addr, receiver receives) is unchanged.