ADR-0023 D9: blocking credit-emit with full-path latency
PE_IPCQ._handle_recv now yields-from _delayed_credit_send instead of spawning it as a fork, so the receiver's pe_exec_ns includes the credit-return cost. _credit_latency_ns switches from compute_drain_ns(path, 16) to compute_path_latency_ns(path, 16) and fixes a latent find_path bug where the destination lacked the ".pe_dma" suffix (silently returned 0 ns under the bare except). Net effect on h3/h4 inter-cube pe-to-pe latency: IPCQ >= raw DMA at every size, matching real-HW posted-write semantics. tl.send remains fire-and-forget. ADR-0023 D9 amended; new diagnostic test tests/test_pe_to_pe_diagnostic.py captures per-PE pe_exec_ns, paths, drain, and meta-arrival timing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -372,24 +372,41 @@ When the receiver frees a slot, the sender must learn about it
|
||||
travel through general vc_comm fabric — it uses a **separate fast
|
||||
path**, an abstraction of the NVLink / UCIe credit-return wire.
|
||||
|
||||
**Latency** is computed from the **bottleneck BW on the path**, not a
|
||||
magic constant:
|
||||
**Latency** is computed from the **full path latency** (per-node
|
||||
overhead + edge propagation + drain), not a magic constant:
|
||||
|
||||
```
|
||||
credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes)
|
||||
path = router.find_path(self_pe, peer_pe)
|
||||
latency = compute_drain_ns(path, credit_size_bytes)
|
||||
= credit_size_bytes / bottleneck_bw_on_path
|
||||
path = router.find_path(self_pe, peer_pe.pe_dma)
|
||||
latency = compute_path_latency_ns(path, credit_size_bytes)
|
||||
= sum(edge.distance_mm * ns_per_mm)
|
||||
+ sum(node_overhead_ns[n] for n in path)
|
||||
+ credit_size_bytes / bottleneck_bw_on_path
|
||||
```
|
||||
|
||||
The router auto-appends `.pe_dma` to the source only, so the
|
||||
destination MUST be spelled with the explicit `.pe_dma` suffix or
|
||||
`find_path` raises and the credit silently teleports at zero cost
|
||||
(latent bug fixed alongside this update).
|
||||
|
||||
`tl.recv` blocks on the credit-emit completion (recv yields-from
|
||||
`_delayed_credit_send` rather than spawning it as a fork). This puts
|
||||
the credit-return cost on the receiver's `pe_exec_ns`, modeling the
|
||||
IPCQ control-plane completing the consume-acknowledgement before
|
||||
recv returns to the kernel — the protocol equivalent of a non-posted
|
||||
`tl.store` waiting for an HBM ack on the raw DMA path.
|
||||
|
||||
That gives us:
|
||||
|
||||
- **Topology-proportional approximation**: an in-cube credit return is
|
||||
automatically faster than a cross-SIP credit return.
|
||||
- **No magic constants**: no arbitrary `ipcq_ctrl_latency_ns`.
|
||||
- **No magic constants**: every nanosecond comes from
|
||||
`compute_path_latency_ns` on the same edge_map and `node_overhead_ns`
|
||||
as data traffic.
|
||||
- **No deadlock risk**: unlike piggyback, B can issue credit even when
|
||||
it has no data to send back.
|
||||
- **Reuses existing utility**: `ComponentContext.compute_drain_ns`.
|
||||
it has no data to send back. `peer_credit_store.put` is unbounded.
|
||||
- **`IPCQ ≥ raw DMA`** for matched physical moves — the credit-emit
|
||||
cost on recv balances the HBM ack-trip cost RAW pays on the sender.
|
||||
|
||||
#### Component coupling — SimPy Store channel
|
||||
|
||||
|
||||
Reference in New Issue
Block a user