ADR-0023 D9: blocking credit-emit with full-path latency

PE_IPCQ._handle_recv now yields-from _delayed_credit_send instead of
spawning it as a fork, so the receiver's pe_exec_ns includes the
credit-return cost. _credit_latency_ns switches from
compute_drain_ns(path, 16) to compute_path_latency_ns(path, 16) and
fixes a latent find_path bug where the destination lacked the
".pe_dma" suffix (silently returned 0 ns under the bare except).

Net effect on h3/h4 inter-cube pe-to-pe latency: IPCQ >= raw DMA at
every size, matching real-HW posted-write semantics. tl.send remains
fire-and-forget. ADR-0023 D9 amended; new diagnostic test
tests/test_pe_to_pe_diagnostic.py captures per-PE pe_exec_ns, paths,
drain, and meta-arrival timing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-27 15:12:38 -07:00
parent 19dfc86dc3
commit 90874abbfe
11 changed files with 901 additions and 25 deletions
+25 -8
View File
@@ -372,24 +372,41 @@ When the receiver frees a slot, the sender must learn about it
travel through general vc_comm fabric — it uses a **separate fast
path**, an abstraction of the NVLink / UCIe credit-return wire.
**Latency** is computed from the **bottleneck BW on the path**, not a
magic constant:
**Latency** is computed from the **full path latency** (per-node
overhead + edge propagation + drain), not a magic constant:
```
credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes)
path = router.find_path(self_pe, peer_pe)
latency = compute_drain_ns(path, credit_size_bytes)
= credit_size_bytes / bottleneck_bw_on_path
path = router.find_path(self_pe, peer_pe.pe_dma)
latency = compute_path_latency_ns(path, credit_size_bytes)
= sum(edge.distance_mm * ns_per_mm)
+ sum(node_overhead_ns[n] for n in path)
+ credit_size_bytes / bottleneck_bw_on_path
```
The router auto-appends `.pe_dma` to the source only, so the
destination MUST be spelled with the explicit `.pe_dma` suffix or
`find_path` raises and the credit silently teleports at zero cost
(latent bug fixed alongside this update).
`tl.recv` blocks on the credit-emit completion (recv yields-from
`_delayed_credit_send` rather than spawning it as a fork). This puts
the credit-return cost on the receiver's `pe_exec_ns`, modeling the
IPCQ control-plane completing the consume-acknowledgement before
recv returns to the kernel — the protocol equivalent of a non-posted
`tl.store` waiting for an HBM ack on the raw DMA path.
That gives us:
- **Topology-proportional approximation**: an in-cube credit return is
automatically faster than a cross-SIP credit return.
- **No magic constants**: no arbitrary `ipcq_ctrl_latency_ns`.
- **No magic constants**: every nanosecond comes from
`compute_path_latency_ns` on the same edge_map and `node_overhead_ns`
as data traffic.
- **No deadlock risk**: unlike piggyback, B can issue credit even when
it has no data to send back.
- **Reuses existing utility**: `ComponentContext.compute_drain_ns`.
it has no data to send back. `peer_credit_store.put` is unbounded.
- **`IPCQ ≥ raw DMA`** for matched physical moves — the credit-emit
cost on recv balances the HBM ack-trip cost RAW pays on the sender.
#### Component coupling — SimPy Store channel