# ADR-0023: PE-level IPCQ — Inter-PE Collective Communication ## Status Accepted ## Context ### Goal Add the infrastructure that lets CCL (Collective Communication Library) kernels run **inside** a PE. The host just launches a kernel on each SIP; the actual synchronization and data movement happen **inside the PE kernel via an IPCQ (Inter-Process Communication Queue)**. This mirrors how NCCL performs NVLink communication inside a GPU kernel, or how Cerebras / Tenstorrent expose core-local communication queues. Host-level collectives (`dist.all_reduce`) are deferred to **future work**; this ADR focuses solely on the kernel-side collective infrastructure. ### Problems to solve 1. PE-to-PE direct data movement (writing into a peer's memory). 2. Synchronization — the sender must check that the receiver has space in its buffer (backpressure). 3. Resource contention between compute traffic and communication traffic (Head-of-Line blocking). 4. The host must be able to construct logical neighbor topologies (ring / mesh / tree) per algorithm. --- ## Decision ### D1. Add a new `PE_IPCQ` component A new component `PE_IPCQ` is added inside each PE. It follows the same pattern as PE_GEMM / PE_MATH — modeling a sub-block of the PE as a distinct component. ``` PE ├── PE_CPU ├── PE_SCHEDULER ├── PE_DMA ├── PE_IPCQ ← new ├── PE_FETCH_STORE ├── PE_GEMM ├── PE_MATH ├── PE_TCM ├── PE_MMU ``` **Role separation** (control plane vs. data plane): - **PE_IPCQ (control plane)**: ring-buffer address arithmetic, head / tail pointer management, peer pointer caches, backpressure, 4-direction neighbor mapping. - **PE_DMA (data plane)**: actually moves data through cube_noc / UCIe / PCIE into the peer's memory. PE_IPCQ does **not** move data itself — it delegates to PE_DMA. ### D2. Ring buffer model Each PE owns 4 directions (N/S/E/W) × {tx, rx} = 8 ring buffers. ```python @dataclass class IpcqQueuePair: direction: Direction # N/S/E/W peer: IpcqEndpoint # set by host at init time (D2.5) tx_buffer_base: int # outgoing data base addr (in our memory) rx_buffer_base: int # incoming data base addr (in our memory) slot_size: int # 1 tile per slot n_slots: int # ring depth my_head: int # next slot we will write/send into my_tail: int # next slot we will read/recv from peer_head_cache: int # peer's last-seen head (updated via D9 piggyback) peer_tail_cache: int # peer's last-seen tail (updated via D9 fast-path credit) ``` **Canonical field names**: throughout this ADR the four names above (`my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`) are used consistently. Synonyms (`peer_head_local`, `peer_head`, `peer_tail`, etc.) are not used. | Field | Owner | Updated when | |-------|-------|--------------| | `my_head` | local PE_IPCQ | immediately after `tl.send` (send tracking) | | `my_tail` | local PE_IPCQ | immediately after `tl.recv` (recv tracking) | | `peer_head_cache` | local PE_IPCQ | on `IpcqMetaArrival` (D9 piggyback) | | `peer_tail_cache` | local PE_IPCQ | on `IpcqCreditMetadata` (D9 fast path) | **Slot unit**: fixed-size, one slot holds one full tile (no descriptor indirection). Full data embedded in the slot. See D5. ### D2.5. `IpcqEndpoint` schema `IpcqQueuePair.peer` carries everything the sender needs to compute the peer's rx slot address: ```python @dataclass(frozen=True) class IpcqEndpoint: sip: int cube: int pe: int buffer_kind: str # "tcm" | "hbm" | "sram" rx_base_pa: int # peer rx_buffer base PA (PhysAddr.encode()) rx_base_va: int # peer rx_buffer base VA (optional, MMU mode) n_slots: int # peer ring depth (for wrap-around) slot_size: int # peer slot size (for offset) ``` Address computation: ```python slot_idx = self.my_head % peer.n_slots dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size ``` PE_IPCQ passes `dst_pa` to PE_DMA inside an `IpcqDmaToken`. PE_DMA (vc_comm) routes the data to `dst_pa` through the fabric. **Endpoint construction order**: at backend init (D10), the IPCQ buffers for **every PE** are allocated first (so each rank knows the others' PA), then the per-rank neighbor tables are built and pushed to PE_IPCQ via `IpcqInitMsg`. ### D3. Four-direction mapping ≡ logical ProcessGroup The PE views four directions (N/S/E/W) as logical ports. Real peer addresses are configured by the host CCL init, per the chosen algorithm. The PE kernel never knows the topology, only directions. ```python # 1D ring for rank in range(world_size): ipcq_set_neighbor(rank, "E", peer=ranks[(rank + 1) % world_size]) ipcq_set_neighbor(rank, "W", peer=ranks[(rank - 1) % world_size]) # 2D mesh for r in range(R): for c in range(C): ipcq_set_neighbor((r, c), "N", peer=((r - 1) % R, c)) ipcq_set_neighbor((r, c), "S", peer=((r + 1) % R, c)) ipcq_set_neighbor((r, c), "E", peer=(r, (c + 1) % C)) ipcq_set_neighbor((r, c), "W", peer=(r, (c - 1) % C)) ``` The PE code does not need to know where `tl.send(dir="E", ...)` actually ends up. ### D4. PE kernel API ```python # Send (blocking; may stall on backpressure) tl.send(dir: str, src=TensorHandle) tl.send(dir: str, src_addr=..., nbytes=..., shape=..., dtype=..., space=...) # Recv (blocking) recv = tl.recv(dir: str, shape=..., dtype=...) recv = tl.recv(shape=..., dtype=...) # round-robin across 4 directions # Recv (non-blocking) fut = tl.recv_async(dir: str, shape=..., dtype=...) recv = tl.wait(fut) ``` `tl.recv()` (no direction) keeps a `last_polled_dir` cursor and on each call rotates through directions, returning the first available slot. Empty in all 4 directions → wait. **Fairness is weak**: the rotating start mitigates simple bias, but if one direction always wins the race the others can starve. Algorithms that need strict fairness must call `tl.recv(dir=...)` explicitly. ### D5. Single-hop DMA write + full-data slot model Data moves from sender memory into the receiver's ring slot in **one DMA transfer**. Key properties: - **Single-hop**: the sender already knows the peer rx slot address and fires one fabric DMA into it. - **No CPU memcpy**: the CPU never copies data. - **No intermediate staging**: neither side keeps a separate staging buffer (sender uses the source addr directly; receiver gets the data in its ring slot directly). (Strictly speaking the fabric DMA write does happen, so this is not literally "no data movement" — it's the same property NCCL labels "zero-copy", meaning no CPU memcpy and no staging copy.) ``` PE A: tl.send(E, src_addr, nbytes) 1. IPCQ computes the peer rx slot address: dst_addr = peer.rx_base_pa + (my_head % peer.n_slots) * peer.slot_size 2. Backpressure: my_head - peer_tail_cache < peer.n_slots ? (full → sleep / poll) 3. Submit DMA on PE_DMA(vc_comm): src_addr → peer dst_addr, nbytes 4. my_head += 1 PE B: data = tl.recv(W) 1. Look at rx_buffer[my_tail % n_slots] 2. Wait for the data to arrive (D7 backpressure mode) 3. Return the slot address to the kernel (or fetch into register file) 4. my_tail += 1 5. Issue a credit-return fast path (D9): after the bottleneck-BW latency the peer A's peer_tail_cache is updated. ``` The slot holds the full tile. The receiver only reads its own rx_buffer; it never reads back into A's memory. The sender knows the peer rx slot address and DMAs directly into it (single-hop). The PE's own PE_TCM read/write does not go through DMA (PE_TCM is local to the PE). ### D6. Buffer placement — three-way benchmark The host CCL init picks the IPCQ ring-buffer location: ```python ipcq_init( backend="ahbm", buffer_kind="tcm" | "hbm" | "sram", n_slots=8, slot_size=4096, ) ``` | Location | Trait | Trade-off | |----------|-------|-----------| | **PE_TCM** | Attached to the PE; fast | Small; competes with PE-internal resources | | **PE-local HBM** | Large; via DMA | Higher latency | | **Cube SRAM** | Mid-size; cube-shared | Cube-internal contention | All three locations run the same kernel code; only the init differs. ### D7. Backpressure — two-mode benchmark How the sender or receiver waits when peer slots are full / data not yet arrived: | Mode | Behavior | Model | |------|----------|-------| | **poll** | Periodically re-check the cached peer pointer | Spin loop | | **sleep** | Yield a SimPy event; wake on a peer-trigger | Interrupt-like | ```python ipcq_init(backpressure="poll" | "sleep", ...) ``` Both modes are implemented so latency / throughput trade-offs can be benchmarked. ### D8. PE_DMA virtual channels Extend PE_DMA from a single queue into a **two-channel virtual-channel** model. ``` PE_DMA ├── vc_compute: tile load / store / writeback for GEMM and Math └── vc_comm: IPCQ send data ``` Each VC has an independent state machine: - One channel stalling does not block the other. - The same physical link (cube_noc, UCIe, …) is shared, but link BW is split between channels. **Chunk-level interleave**: - Large GEMM tile DMAs do not lock the link end-to-end. - Progress happens in chunks (e.g. 256 B); each chunk shares link BW with the other VC's pending chunks. - Chunk size is an init parameter (smaller = fairer, larger = more efficient). Net effect: - HoL blocking is eliminated (an IPCQ send can interleave with a long compute DMA). - Compute / comm overlap is natural (NVIDIA copy-engine + compute-SM pattern). - Matches the NoC-virtual-channel pattern used in real HW. **First-implementation accuracy limit (intentional)**: this ADR's first cut uses **deterministic chunk-level interleave + weighted round-robin arbitration** (default 50 / 50, exposed in `ccl.yaml`). This is a first-order approximation and is simpler than real HW dynamic-contention / credit-based arbiters. Functional correctness is unaffected, but heavy-contention scenarios may report slightly optimistic latency vs. real HW. A separate ADR can add a NoC arbiter component later if more precision is needed. #### Token routing - Compute tokens (`TileToken`) — go through the existing PE_FETCH_STORE → PE_DMA chain. - Communication tokens (`IpcqDmaToken`, new) — PE_IPCQ → PE_DMA self-routing. - PE_DMA picks the channel by token type. ```python class PeDmaComponent: def _process(self, env, token): if isinstance(token, IpcqDmaToken): yield from self._vc_comm_process(env, token) else: yield from self._vc_compute_process(env, token) ``` ### D9. Pointer synchronization — DMA payload piggyback Real HW (NVLink, UCIe, etc.) piggybacks metadata onto DMA payloads so pointers update along with the data. This simulation adopts the same model: **no separate control channel** — metadata travels with the data. The big benefits: - **Automatic ordering**: data and metadata move on the same token, so data is visible **before** the head_cache update. No race. - **HW fidelity**: matches NVLink / UCIe piggybacked headers. - **Component simplification**: no separate `IpcqPtrUpdate` event type. #### Send flow (head update via piggyback) ``` PE A: tl.send(E, src_addr, nbytes) 1. PE_IPCQ checks backpressure (using peer_tail_cache) 2. PE_IPCQ creates an IpcqDmaToken: - data body (src_addr → peer dst_addr) - piggyback metadata: (sender_seq, src_sip/cube/pe, src_direction) 3. Hand the token to PE_DMA(vc_comm) 4. PE A increments my_head (send tracking) [fabric DMA: latency elapses] PE B's PE_DMA receives the token 5. Writes data into dst_addr (B's rx slot) via MemoryStore.write 6. Forwards token metadata to PE B's PE_IPCQ (PE-internal wire, ~1 cycle) PE B's PE_IPCQ receives the metadata 7. Updates peer_head_cache (= A's head) 8. Wakes any pending recv on that direction ``` **Steps 5 and 6 must execute in the same SimPy step** — DMA completion makes data and metadata atomically visible. #### Recv flow (credit return — fast path with bottleneck-BW latency) When the receiver frees a slot, the sender must learn about it (backpressure release). Unlike data, the credit return does **not** travel through general vc_comm fabric — it uses a **separate fast path**, an abstraction of the NVLink / UCIe credit-return wire. **Latency** is computed from the **full path latency** (per-node overhead + edge propagation + drain), not a magic constant: ``` credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes) path = router.find_path(self_pe, peer_pe.pe_dma) latency = compute_path_latency_ns(path, credit_size_bytes) = sum(edge.distance_mm * ns_per_mm) + sum(node_overhead_ns[n] for n in path) + credit_size_bytes / bottleneck_bw_on_path ``` The router auto-appends `.pe_dma` to the source only, so the destination MUST be spelled with the explicit `.pe_dma` suffix or `find_path` raises and the credit silently teleports at zero cost (latent bug fixed alongside this update). `tl.recv` blocks on the credit-emit completion (recv yields-from `_delayed_credit_send` rather than spawning it as a fork). This puts the credit-return cost on the receiver's `pe_exec_ns`, modeling the IPCQ control-plane completing the consume-acknowledgement before recv returns to the kernel — the protocol equivalent of a non-posted `tl.store` waiting for an HBM ack on the raw DMA path. That gives us: - **Topology-proportional approximation**: an in-cube credit return is automatically faster than a cross-SIP credit return. - **No magic constants**: every nanosecond comes from `compute_path_latency_ns` on the same edge_map and `node_overhead_ns` as data traffic. - **No deadlock risk**: unlike piggyback, B can issue credit even when it has no data to send back. `peer_credit_store.put` is unbounded. - **`IPCQ ≥ raw DMA`** for matched physical moves — the credit-emit cost on recv balances the HBM ack-trip cost RAW pays on the sender. #### Component coupling — SimPy Store channel PE B's PE_IPCQ does not call PE A's PE_IPCQ directly. Instead, at init time, **a SimPy Store is wired between the two** (a per-direction fast-path channel) and credit metadata is `put` into that store. ```python class PeIpcqComponent: def _delayed_credit_send(self, env, peer_credit_store, my_tail, latency_ns): yield env.timeout(latency_ns) yield peer_credit_store.put(IpcqCreditMetadata(seq=my_tail, ...)) ``` Backend init wires both directions of the fast-path channel as part of fan-out (see `IpcqInitMsg` in D12). #### Credit-return fast path limitations - `credit_size_bytes` is an estimate (typically 16–64 bytes). - The fast path is **excluded from vc_comm BW contention** (separate wire). Real HW credit-return wires are very lightweight, so this is a reasonable first approximation. - A follow-up ADR can: model the credit fast path as a separate link (BW limit + contention), or switch to piggyback (`credit_return_mode: piggyback`). #### PE_DMA's added responsibility When `vc_comm` receives a token, PE_DMA processes it as the following sequence: pay the Transaction's terminal BW drain, then atomically write data and forward metadata. **No SimPy yield is allowed between the data write and the metadata forward** (invariant I6). The drain yield must sit before the atomic block, not inside it: ```python def _on_vc_comm_recv(self, env, txn): # Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the # sender PE_DMA). MUST happen before the atomic block so recv only # wakes after the bytes have "landed". drain = getattr(txn, "drain_ns", 0.0) if drain > 0: yield env.timeout(drain) token = txn.request # ── ATOMIC: no yield between these two operations ── data = self._memory_store.read(token.src_space, token.src_addr, shape=..., dtype=...) self._memory_store.write(token.dst_endpoint.buffer_kind, token.dst_addr, data) # 2. Forward metadata to the local PE_IPCQ yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token)) # ─────────────────────────────────────────────────── ``` The final `put` is yieldable but uses an unbounded internal store, so it completes in a single step. That `put` is the closing call of the atomic block; nothing may be inserted before it. #### Drain-at-inbound semantics (D9 timing model) The Transaction carries `drain_ns = nbytes / bottleneck_bw_on_path` stamped at send-side PE_DMA. In this simulator per-hop `overhead_ns` is paid at each forwarding component via `run()`, and the remaining BW drain is paid once at the Transaction's terminal. Every non-IPCQ Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via `ComponentBase._forward_txn` at the terminal node. For IPCQ the destination PE_DMA intercepts the Transaction with `_handle_ipcq_inbound` (so IPCQ-specific data write + metadata forward can happen), so **the drain MUST be paid explicitly at the top of that handler** to keep IPCQ's timing model on par with every other fabric Transaction. Side-effects of paying drain here: - **SRC `tl.send`** is unchanged — fire-and-forget semantics are preserved because the sender PE_DMA does not `yield sub_done`. The `sub_done.succeed()` call (made after metadata forward below) is an event with no listener on the sender side. - **DST `tl.recv`** unblocks `drain_ns` later. Since recv wakes only when `IpcqMetaArrival` reaches its local PE_IPCQ, and the metadata forward now happens after the drain, recv observes the full fabric transfer time including bandwidth cost. Matches the physical picture: send dispatches and leaves; recv waits until the bytes have actually been drained into its inbox. ### D9.5. ADR-0020 (2-pass) integration `tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase 1 simulates timing **and** moves data via MemoryStore; Phase 2 enables op-log-based correctness verification. #### Phase 1 (timing + data) D9 models head and tail updates with two different mechanisms: - **Send-side (head update)** — DMA payload piggyback. Data write and metadata forward happen in the same SimPy step → automatic atomic visibility. - **Recv-side (tail credit return)** — fast-path SimPy Store channel with bottleneck-BW latency, then `peer_tail_cache` update. Together they preserve ring-buffer pointer consistency. The op-log records `op_kind="ipcq"` entries for sends (with `src/dst/space/addr/nbytes/dir/dtype/shape/sender_seq`) and recvs (with `recv_mode/src/dst/space/addr/nbytes/dir/dtype/shape/consumer_seq`). Two recv modes: - **`return_slot`** (default): the slot address is returned to the kernel. Zero-copy. - **`copy_to_dst`**: when the kernel passes `dst_addr` + `dst_space`, PE_IPCQ copies the slot data into the user dst. #### Phase 2 (op_log replay) When `DataExecutor` encounters an `op_kind="ipcq"` record: - **send**: idempotent `src → dst` ndarray write. - **recv (`return_slot`)**: no-op (the slot already holds the data). - **recv (`copy_to_dst`)**: idempotent `slot → dst_addr` copy. IPCQ ops are pure data movement — Phase 2 has nothing extra to compute. The downstream GEMM / Math ops in `DataExecutor` will consume the data and naturally validate correctness. ### D10. Host CCL init keeps the PyTorch shape The host code looks just like real PyTorch DDP. `init_process_group` creates the backend object; it does **not** receive IPCQ knobs (neighbor topology, buffer_kind, backpressure …). ```python # benches/ccl_allreduce.py — same shape as real PyTorch def worker(rank, world_size, torch): dist = torch.distributed dist.init_process_group(backend="ahbm") # reads ccl.yaml + topology tensor = torch.zeros((1, world_size * N_ELEM), dtype="f16", dp=...) tensor.copy_(torch.from_numpy(init)) dist.all_reduce(tensor, op="sum") ``` The IPCQ configuration is decided by the backend at `init_process_group` time: it loads `ccl.yaml`, picks the algorithm, and pushes IPCQ neighbor tables to every participating PE_IPCQ. The host code never has to know about IPCQ. A bench runs one algorithm, chosen via `ccl.yaml`'s `defaults.algorithm`. Switching algorithms is purely a `ccl.yaml` change — no host edits required. #### Init flow (eager) 1. `init_process_group(backend="ahbm")` is called. 2. Backend loads `ccl.yaml` → resolves `defaults.algorithm`. 3. Pulls topology + buffer_kind + backpressure + slot config from `algorithms[]`. 4. **Immediately** installs neighbor tables on every PE_IPCQ (sideband or fabric `IpcqInitMsg`). 5. Subsequent `torch.launch(kernel_name, ...)` calls behave normally — PE_IPCQ is already prepared whether the kernel is a CCL kernel or not. ### D11. CCL config file (`ccl.yaml`) IPCQ config and algorithm metadata live in a separate YAML file, following the same pattern as `components.yaml` and `topology.yaml`. A single benchmark execution runs one algorithm (`defaults.algorithm`). Switching algorithms means editing `defaults.algorithm` only. ```yaml defaults: algorithm: ring_allreduce_tcm buffer_kind: tcm # tcm | hbm | sram backpressure: sleep # poll | sleep n_slots: 8 slot_size: 4096 vc_chunk_size: 256 ipcq_credit_size_bytes: 16 algorithms: ring_allreduce_tcm: module: kernbench.ccl.algorithms.ring_allreduce topology: ring_1d # builtin name or "custom" buffer_kind: tcm n_elem: 8 # optional, per-algorithm tile width tree_allreduce_7: module: kernbench.ccl.algorithms.tree_allreduce topology: tree_binary buffer_kind: tcm world_size: 7 # algorithm-level override n_elem: 16 custom_mesh: module: kernbench.ccl.algorithms.custom_mesh topology: custom # the module supplies its own neighbors() ``` `world_size` is **not set in `defaults`**. The backend resolves it via: `algorithm-level override > defaults override > topology spec`. The last fallback (`sips × cubes_per_sip × pes_per_cube`) mirrors real DDP where `WORLD_SIZE` comes from env vars rather than config files. #### Algorithm module structure Each algorithm module exports two hooks — `kernel` (required) and `neighbors` (optional) — plus a `kernel_args` helper that the backend uses to populate positional kernel arguments at `all_reduce` time: ```python # src/kernbench/ccl/algorithms/ring_allreduce.py def kernel_args(world_size: int, n_elem: int) -> tuple: return (n_elem, world_size) def kernel(t_ptr, n_elem, world_size, tl): """Required — the PE kernel. IPCQ is already installed by the backend before this is called. The kernel only uses the four-direction send / recv API. """ ... def neighbors(rank, world_size, neighbor_map): """Optional — override the builtin topology's neighbor map. Returns a new dict, the modified-in-place dict, or None to keep the builtin map. """ return None ``` #### `neighbors` override patterns - **Pattern A — tweak a builtin**: drop a direction for some ranks, etc. - **Pattern B — replace entirely**: ignore `neighbor_map` and return a brand-new dict. - **Pattern C — keep builtin**: omit `neighbors` or return None. #### Builtin topologies | topology | direction set | |----------|---------------| | `ring_1d` | E, W | | `ring_1d_unidir` | E only | | `mesh_2d` | N, S, E, W | | `tree_binary` | parent, child_left, child_right | | `none` | (empty) — algorithm must supply `neighbors()` | #### Adding a new algorithm 1. Write `kernel` and `kernel_args` in `src/kernbench/ccl/algorithms/.py`. 2. Add an entry in `ccl.yaml`'s `algorithms` section. 3. (Optional) provide `neighbors()` for custom topology. 4. Set `defaults.algorithm` to the new algorithm. The host bench (`benches/ccl_allreduce.py`) does not change. ### D12. Message / token schema The new message types added by this ADR. They live in `src/kernbench/common/pe_commands.py` and `src/kernbench/runtime_api/kernel.py`. #### `IpcqInitMsg` (sideband, fan-out at init) The backend pushes neighbor tables to every PE_IPCQ. Structure mirrors `MmuMapMsg` (`target_sips`, `target_cubes`, `target_pe`, `entries`). Each `IpcqInitEntry` has `direction`, `peer: IpcqEndpoint`, `my_rx_base_pa/va`, `n_slots`, `slot_size`, plus a `peer_credit_store` field — a `simpy.Store` instance pre-wired so the sender PE_IPCQ can push `IpcqCreditMetadata` directly into the receiver's input queue. #### `IpcqSendCmd` (PE_CPU → PE_IPCQ) Carries `direction`, source addr/space, nbytes, shape, dtype, and a handle id. `data_op=True` so it lands in the op_log. #### `IpcqRecvCmd` (PE_CPU → PE_IPCQ) Carries `direction` (or None for round-robin), `recv_mode` (`return_slot` / `copy_to_dst`), optional `dst_addr/dst_space`, shape, dtype, blocking flag. #### `IpcqDmaToken` (PE_IPCQ → PE_DMA, vc_comm channel) Per D9 piggyback: the token carries the data (`src/dst/space/nbytes`) plus the head metadata (`sender_seq`, `src_sip/cube/pe`, `src_direction`). PE_DMA picks the channel by token type (`IpcqDmaToken → vc_comm`, `TileToken → vc_compute`). The receiver's PE_DMA, on token arrival, performs the I6 atomic sequence: write data into MemoryStore, then forward `IpcqMetaArrival` to the local PE_IPCQ. #### `IpcqCreditMetadata` (PE_IPCQ → peer PE_IPCQ, fast path) Carries `consumer_seq` (= my_tail), source PE coords, and source direction. Travels through the dedicated SimPy Store channel rather than `vc_comm`. Latency = `credit_size_bytes / bottleneck_bw_on_path`. There is **no `IpcqPtrUpdate` event** — head updates flow via D9 piggyback, tail updates via the D9 fast-path channel. ### D13. Test strategy Test plan: #### T1. Unit tests (component-level) - **PE_IPCQ** (`tests/test_pe_ipcq.py`): send without backpressure immediately forwards a token; full peer slot triggers backpressure (poll / sleep modes); recv waits, wakes on `IpcqMetaArrival`; round-robin recv weak fairness; bad direction → `IpcqInvalidDirection`. - **PE_DMA virtual channels** (`tests/test_pe_dma_vc.py`): `vc_compute` / `vc_comm` independent progress, chunk interleave, BW split. - **Builtin topology** (`tests/test_ccl_topologies.py`): ring_1d / mesh_2d / tree_binary correctness, mesh_2d non-square → `ValueError`, custom resolver returns the module's `neighbors`. #### T2. Integration tests (E2E send/recv) - **`tests/test_ipcq_e2e.py`**: 2-rank ring, 4-rank ring (bidirectional no-deadlock), 4×4 mesh. - **CCL kernel + 2-pass** (`tests/test_ipcq_2pass.py`): greenlet mode records `ipcq` ops in op_log; DataExecutor produces correct `out.data`. #### T3. Backend init (`tests/test_ccl_backend_ipcq.py`) `ccl.yaml` load, builtin topology → `IpcqInitMsg` fan-out, endpoint PA consistency, per-`buffer_kind` allocation. #### T4. Regression All existing tests pass; ADR-0020 op_log / DataExecutor unaffected for non-CCL benches. #### T5. Performance / overhead Single send/recv pair latency = (DMA latency) + (IPCQ overhead). Should be close to a regular PE_DMA write of the same nbytes (IPCQ overhead < 100 ns). ### D14. Invariants and failure modes #### Invariants I1. **Slot lifecycle exactly-once**: one send → exactly one recv. I2. **Pointer monotonicity**: `my_head` / `my_tail` strictly non-decreasing; `sender_seq` strictly increasing. I3. **Endpoint consistency**: if rank A's `direction=E` peer is rank B, then rank B's reverse-direction peer must be rank A. Verified at init. I4. **`buffer_kind` consistency**: all PEs in a process group share the same `buffer_kind` (no mixed mode in the first cut). I5. **op_log ordering**: send → DMA complete → recv possible. The t_start order in op_log respects this causality. I6. **Atomic data + metadata visibility (MUST)**: at the receiver side, data write (`MemoryStore.write`) and metadata forward (`peer_head_cache` update) **must execute in the same SimPy step**. No yield is allowed between the two operations in PE_DMA's vc_comm handler. Code review must reject any inserted `yield` (or `yield from`) — it would create a race where head_cache becomes visible before or after the data. I7. **MemoryStore slot existence ↔ pointer**: as a consequence of I6, the step in which `peer_head_cache > my_tail` becomes truthy is the same step in which the slot data is observable. #### Failure modes (runtime errors) F1. **Bad direction**: `tl.send(dir="X")` for an uninstalled direction → `IpcqInvalidDirection`, simulation aborts. F2. **Type mismatch**: dtype/shape/nbytes disagreement between matched send and recv. Not validated by default; opt-in strict mode catches it (`strict_validation: true` on a PE_IPCQ node attrs). F3. **Deadlock detection (timeout-based)**: the simulator empties its schedule while a send/recv is still pending → engine raises `IpcqDeadlock` and embeds a pointer dump. F4. **Backend init failure**: missing `defaults.algorithm`, missing `algorithms[name]`, module import failure, topology validation failure (I3, I4) — all raised at `init_process_group` time. F5. **Slot full + infinite backpressure**: the peer never recvs. Surfaces as F3 timeout. #### Diagnostics - **CCL trace**: `KERNBENCH_CCL_TRACE=1` logs each send/recv as `(rank, t, dir, nbytes)`. - **Pointer dump**: `kernbench.ccl.diagnostics.pointer_dump(engine)` prints every PE_IPCQ ring buffer's `my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`. - **Deadlock dump**: on hang the engine includes the pointer dump in the `IpcqDeadlock` exception message. ### D15. Algorithm-author cheat sheet Full step-by-step lives in [`docs/onboarding/ccl-author-guide.en.md`](../onboarding/ccl-author-guide.en.md). The shortest version: | Things you touch | Things you don't | |------------------|-------------------| | `src/kernbench/ccl/algorithms/.py` (`kernel`, `kernel_args`, optional `neighbors`) | `benches/ccl_allreduce.py` host code | | One entry in `ccl.yaml` + optionally `defaults.algorithm` | `src/kernbench/ccl/` framework | | (Optional) `tests/test_.py` mock test | PE_IPCQ component, AhbmCCLBackend | 5-step flow: write the kernel → register in `ccl.yaml` → optional `neighbors` override → optional mock unit test → SimPy validation via `kernbench run --bench ccl_allreduce --verify-data`. Common mistakes: using a direction that wasn't installed, sends without matching recvs (deadlock), dtype/shape disagreement, assuming fairness from `tl.recv()` round-robin, confusing `tl.num_programs(axis)` with the CCL group size. --- ## HW Realization Notes (Informative) **Status of this section**: Forward-looking. Describes how the simulator contract (D1–D15) would map to silicon. Not currently implemented; subject to revision before tapeout. The simulator implements the contract via Python/SimPy equivalents in [pe_ipcq.py](../../src/kernbench/components/builtin/pe_ipcq.py) and [pe_dma.py](../../src/kernbench/components/builtin/pe_dma.py). ### D16. Proposed HW block diagram and end-to-end dataflow ![PE Baseline Architecture](../diagrams/pe_baseline.png) > Source: [`../diagrams/pe_baseline.d2`](../diagrams/pe_baseline.d2) — `d2 --layout=elk --scale 1.5`. ![PE Proposed Architecture](../diagrams/pe_proposed.png) > Source: [`../diagrams/pe_proposed.d2`](../diagrams/pe_proposed.d2) — `d2 --layout=elk`. **Baseline → Proposed key changes**: - Single FIFO inbox → **separate compute port / IPCQ port + WRR Arbiter** (NEW) - PE_IPCQ (SimPy component) → **IPCQ Controller** (HW register + combinational logic) - **IPCQ Slot Region reserved area** within TCM - Credit Injector / Receiver connect directly to the NoC via the Fabric Port #### End-to-end sequence (HW view) ```mermaid sequenceDiagram participant CPU_A as PE_A: PE_CPU participant IPCQ_A as PE_A: IPCQ Ctrl participant DMA_A as PE_A: DMA participant NOC as NoC Fabric participant DMA_B as PE_B: DMA participant IPCQ_B as PE_B: IPCQ Ctrl participant TCM_B as PE_B: TCM participant CPU_B as PE_B: PE_CPU Note over CPU_A: tl.send(dir="E", src=0x1000) CPU_A->>IPCQ_A: MMIO: send request Note over IPCQ_A: Backpressure check:
(head - peer_tail_cache) < n_slots → PASS
Slot addr gen:
dst = peer_rx_base + (head%n) × slot_size IPCQ_A->>DMA_A: IpcqDmaToken {src, dst, sender_seq=head} Note over IPCQ_A: my_head++ IPCQ_A-->>CPU_A: send returns (fire-and-forget) Note over DMA_A: TCM read → snapshot in read buffer
Flit pack: data + {sender_seq, dst_addr} DMA_A->>NOC: IPCQ data flit(s) Note over NOC: hop latency + BW drain NOC->>DMA_B: IPCQ data flit(s) Note over DMA_B: Terminal BW drain
Slot write latency rect rgb(255, 240, 220) Note over DMA_B,IPCQ_B: ATOMIC (I6): same cycle, no stall DMA_B->>TCM_B: write data → slot address DMA_B->>IPCQ_B: Meta Extractor: {sender_seq, dst_addr} end Note over IPCQ_B: Range match dst_addr → direction "W"
peer_head_cache["W"] = sender_seq + 1 IPCQ_B-->>CPU_B: recv_wake signal Note over CPU_B: tl.recv(dir="W") wakes up CPU_B->>IPCQ_B: recv request Note over IPCQ_B: peer_head_cache > my_tail → YES
slot_addr = rx_base + (tail%n) × slot_size IPCQ_B-->>CPU_B: return slot_addr CPU_B->>TCM_B: read data from slot Note over IPCQ_B: my_tail++ IPCQ_B->>NOC: Credit (16B): {consumer_seq, dst_rx_base_pa} Note over NOC: credit traversal (NoC latency) NOC->>IPCQ_A: Credit arrival Note over IPCQ_A: Match dst_rx_base_pa → direction "E"
peer_tail_cache["E"] = consumer_seq
Backpressure deassert (if stalled) ``` ### D17. IPCQ Controller HW Module (NEW) The hardware control block sitting between PE_CPU and the DMA Engine. Corresponds to the simulator's `PeIpcqComponent`. #### QPair Register File Per-direction queue-pair state held in flip-flops. The PE_CPU reads / writes them via MMIO (CSRs); software populates them at init time. ``` Per-direction registers (each 64-bit): my_head — sender write position (monotonic) my_tail — receiver read position (monotonic) peer_head_cache — last known peer head (updated by Meta Extractor) peer_tail_cache — last known peer tail (updated by Credit Receiver) rx_base_pa — this PE's rx buffer base physical address peer_rx_base_pa — peer's rx buffer base physical address n_slots — ring depth (power-of-2 constraint, see D21) slot_size — bytes per slot peer_credit_tgt — peer PE's credit-receive address Directions: up to 8 (N/S/E/W/parent/child_left/child_right + spare) Total: 8 dirs × 9 regs × 8 B = 576 B of flip-flops ``` #### Slot Address Generator (combinational) ``` Input: pointer (my_head or my_tail), n_slots, slot_size, base_pa Output: slot_addr = base_pa + (pointer % n_slots) * slot_size Implementation: n_slots power-of-2 → pointer & (n_slots - 1) (AND mask, 1 gate) slot_size power-of-2 → barrel shift (1 cycle) 64-bit add → ripple / Kogge-Stone adder (1 cycle) Latency: 1–2 combinational cycles ``` #### Backpressure Comparator (combinational) ``` full = (my_head - peer_tail_cache) >= n_slots Implementation: 64-bit subtract + unsigned compare Output: stall signal → PE_CPU (IPCQ send blocked) or DMA issue hold Latency: 1 cycle ``` #### Meta Extractor (inbound datapath sideband) Wired into the DMA Engine's inbound vc_comm path. Extracts metadata from arriving IPCQ flit headers and updates queue-pair state. ``` Trigger: DMA inbound write completion (same cycle) Extract: {sender_seq, dst_addr} from flit header Direction matching (ADR-0025 D2): for each dir: match = (base_pa[dir] <= dst_addr) && (dst_addr < base_pa[dir] + n_slots[dir] * slot_size[dir]) 8× parallel range comparators + priority encoder Update: peer_head_cache[matched_dir] = max(peer_head_cache, sender_seq + 1) Output: recv_wake signal → PE_CPU interrupt / flag Latency: 1 cycle (pipelined with the DMA write — I6 atomicity is intrinsic) ``` #### Credit Injector (outbound) ``` Trigger: recv completion (after my_tail increments) Action: pack a 16 B credit packet → DMA vc_comm (or a dedicated credit VC) Packet: {consumer_seq = my_tail, dst_rx_base_pa = my_rx_base_pa} Latency: 1 cycle to generate; then NoC traversal ``` #### Credit Receiver (inbound sideband) ``` Trigger: 16 B credit packet arrival (from NoC) Extract: {consumer_seq, dst_rx_base_pa} Direction matching (ADR-0025 D3): for each dir: match = (peer_rx_base_pa[dir] == credit.dst_rx_base_pa) Update: peer_tail_cache[matched_dir] = max(peer_tail_cache, consumer_seq) Output: send_wake signal → deassert backpressure stall Latency: 1 cycle ``` ### D18. DMA Engine vc_comm IPCQ-aware mode Add IPCQ-flit handling to the existing vc_comm channel (D8). **Outbound**: 1. Receive a command from the IPCQ Controller: `{src_addr, dst_addr, nbytes, sender_seq}`. 2. Read `src_addr` from TCM → snapshot into the DMA read buffer (standard DMA behavior). 3. Pack flit: data + piggyback metadata (`sender_seq`, `dst_addr`). 4. Inject into the NoC fabric port. 5. Fire-and-forget (no completion wait). **Inbound**: 1. Receive an IPCQ flit from the NoC. 2. Charge terminal BW drain (`drain_ns = nbytes / bottleneck_bw`). 3. Charge slot write latency (per backing memory tier). 4. **ATOMIC** (same pipeline stage, no stall insertion): - TCM write: data → slot address. - Meta Extractor trigger: `sender_seq` + `dst_addr` → IPCQ Controller. 5. Done. **I6 atomicity guaranteed in hardware**: TCM write completion and Meta Extractor trigger occur in the same pipeline stage, so no separate synchronization is needed. The simulator's "no SimPy yield between `MemoryStore.write` and `IpcqMetaArrival` put" (D9, I6) is preserved naturally. #### Data snapshot semantics Data latched into the DMA read buffer is unaffected by subsequent writes to `src` memory. This is standard DMA read-then-write behavior; no extra HW is required. #### Credit virtual channel (optional) - **Option A**: multiplex credits onto vc_comm (distinguish via 16 B header-only flits). - **Option B**: add a third dedicated credit VC (strict priority > data). Option B is friendlier to deadlock prevention, but a 16 B credit's BW impact is negligible, so Option A suffices. ### D19. Fabric flit format extension ``` Generic data flit (e.g. 512-bit): ┌──────────────────────────────────────────┐ │ [511:480] routing header (32b) │ │ [479:0] payload (480b = 60 B) │ └──────────────────────────────────────────┘ IPCQ data flit (only the first flit carries metadata): ┌──────────────────────────────────────────┐ │ [511:480] routing header (32b) │ │ [511] ipcq_flag (1b) │ ← IPCQ vs. normal DMA │ [510:509] vc_id (2b) │ │ [508:480] route + hop count │ │ [479:416] ipcq_metadata (64b) │ ← piggyback │ [479:448] sender_seq (32b) │ │ [447:416] dst_addr[31:0] (32b) │ ← used for direction match │ [415:0] payload (416b = 52 B) │ └──────────────────────────────────────────┘ Subsequent flits: full 60 B payload (no metadata). Credit-only flit (128-bit, header-only): ┌──────────────────────────────────────────┐ │ [127:96] routing header (32b) │ │ [127] credit_flag (1b) │ │ [95:64] consumer_seq (32b) │ │ [63:0] dst_rx_base_pa (64b) │ └──────────────────────────────────────────┘ ``` First-flit payload shrinks from 60 B to 52 B (13 % overhead). For multi-flit transfers the subsequent flits carry full payloads, so overhead < 1 % on large transfers. ### D20. TCM IPCQ slot region layout ``` TCM Memory Map (16 MB): ┌─────────────────────────────┐ 0x000000 │ Kernel Working Memory │ │ (compute tensors) │ │ ~14 MB │ ├─────────────────────────────┤ 0xE00000 │ IPCQ RX Buffers │ │ Dir N: slots × slot_size │ │ Dir S: slots × slot_size │ │ Dir E: slots × slot_size │ │ Dir W: slots × slot_size │ │ ~1 MB │ ├─────────────────────────────┤ 0xF00000 │ IPCQ Metadata / Scratch │ │ ~1 MB │ └─────────────────────────────┘ 0xFFFFFF ``` Place the IPCQ region in the upper TCM bank to minimize bank conflict with compute accesses (see Risk D22). ### D21. 2 nm implementation analysis #### Area estimate | Module | Gate count | Area (2 nm est.) | Notes | |---|---|---|---| | QPair Register File | ~4.6 K FF | 0.002 mm² | 576 B of flip-flops | | Slot Addr Gen + Backpressure | ~5 K gates | 0.001 mm² | Combinational | | Meta Extractor + Credit Logic | ~3 K gates | 0.001 mm² | 8× parallel comparators | | **IPCQ Controller subtotal** | **~12.6 K** | **~0.004 mm²** | **< 0.1 % of the PE area** | | DMA vc_comm extension | ~2 K gates | 0.002 mm² | Flit pack / unpack | | **Total delta** | **~14.6 K** | **~0.006 mm²** | | #### Timing | Path | Delay (2 nm est.) | Target clock | Margin | |---|---|---|---| | Backpressure (sub + cmp) | ~0.3 ns | 1 GHz (1 ns) | 3× | | Slot Addr Gen (mask + shift + add) | ~0.5 ns | 1 GHz | 2× | | Meta Extractor (8× range match) | ~0.4 ns | 1 GHz | 2.5× | | Credit Receiver (8× equality) | ~0.3 ns | 1 GHz | 3× | All critical paths fit within one cycle. Timing closure is not a concern. #### Power - Active: ~1 mW (register R/W + comparators while sending / receiving). - Idle: leakage only. - Negligible vs. total PE power. #### Constraints | Item | Constraint | Rationale | |---|---|---| | `n_slots` | **must be power-of-2** | mod → AND mask (1 gate). Arbitrary values need a divider (~10 cycles). | | `slot_size` | **power-of-2 recommended** | mul → barrel shift. Arbitrary values need a multiplier. | | TCM IPCQ region | **dedicated bank** | Prevents bank conflict with compute accesses. | ### D22. Risk assessment #### TCM bank conflict - **Risk**: IPCQ slot write and compute read both target the same TCM bank → stall. - **Mitigation**: place the IPCQ region in a dedicated upper-address bank (D20). - **Cost**: a small loss of TCM banking flexibility. - **Severity**: Medium (performance), Low (no correctness issue). #### Credit return latency under congestion - **Risk**: NoC congestion → credit-return delay → sender backpressure stall. - **Mitigation**: - Put credits on a separate VC with strict priority (16 B → negligible BW impact). - Or pick `n_slots` generously (8+) so credit delay is absorbed by buffer depth. - **Severity**: Low (16 B credits contribute almost nothing to congestion). #### Inter-direction ordering - **Risk**: simultaneous sends from one PE on multiple directions. - **Mitigation**: per-direction monotonic `sender_seq` suffices. Inter-direction ordering is the kernel's (software's) responsibility — same as the simulator model (D2 + D4). - **Severity**: Low (resolved by design). ### D23. HW alternatives considered #### Doorbell + polling (traditional) ``` Send: DMA write data → DMA write a doorbell register at the peer → peer polls doorbell Recv: polling loop on the doorbell, or interrupt-driven ``` | Pros | Cons | |---|---| | Simple HW (no IPCQ controller) | Two DMA transactions (data + doorbell) | | Reuses existing DMA | Needs explicit fence between data and doorbell | | | Polling burns power; interrupt adds latency | **Verdict**: 2–3× latency vs. piggyback. **Rejected.** #### Hardware message queue (NVIDIA NVLink style) ``` Send: CPU → push a descriptor onto HMQ → HW relays it to the peer HMQ Recv: pop a descriptor from HMQ → use the data pointer ``` | Pros | Cons | |---|---| | CPU only writes descriptors | Needs a separate HMQ engine (~0.05 mm²) | | Descriptor / data separation is flexible | Separate datapath from DMA → area / power overlap | | | Large tensors still need DMA | **Verdict**: With CCL's large-tensor pattern, DMA is still required, so HMQ + DMA is a duplicated datapath. **Rejected.** #### RDMA-style completion queue (CQ) ``` Send: DMA write → CQE auto-posted at the peer Recv: CQ poll / interrupt → read data location ``` | Pros | Cons | |---|---| | Mature InfiniBand / RoCE model | CQ management logic + CQE memory overhead | | Good multi-tenant isolation | CQE / data ordering needs extra plumbing | | | Over-engineered for PE-to-PE CCL | **Verdict**: RDMA CQ is suited to host-facing NICs with multi-tenant isolation. For single-owner PE-to-PE this is needless complexity. **Rejected.** #### Credit-in-data piggyback (v2 optimization candidate) In the current design the credit return is a separate 16 B packet. For bidirectional traffic patterns, **the credit can be folded into a reverse-direction data flit**. ``` PE_A →E→ PE_B: data + sender_seq=3 PE_B →W→ PE_A: data + sender_seq=5 + credit_ack=4 ← credit folded into data ``` | Pros | Cons | |---|---| | Removes the dedicated credit packet → NoC BW savings | Needs fallback for unidirectional patterns | | Bidirectional allreduce: credit latency → 0 | +8 B in the flit header (negligible) | | | Slightly more logic complexity | **Verdict**: A strong optimization. Eliminates the credit packet for bidirectional allreduce; the standalone credit fallback is retained. **Recommended for v2.** ### Open HW questions - What fraction of TCM may the IPCQ slot region occupy? (Current assumption: ~1 MB / 16 MB = 6.25 %.) - Dedicated credit VC vs. vc_comm multiplexing? (See D18.) - Inter-SIP link flit-format compatibility verification. - Maximum `n_slots`? (8 directions × 8 slots × 64 KB = 4 MB → 25 % of TCM.) --- ## Non-goals - **Host collective**: a model where `dist.all_reduce` itself moves data on the host side is out of scope. This ADR only covers communication that happens inside the PE kernel. - **All-reduce algorithms**: ring / tree / etc. live in algorithm modules and can be added without amending this ADR. - **Reliability / error handling**: link faults, send/recv failure recovery, etc. are out of scope. - **NoC arbiter precision**: dynamic VC contention is left for a future ADR (see D8). --- ## Open questions - **VC arbitration accuracy** — the first cut uses deterministic chunk interleave + weighted round-robin; heavy contention may report optimistic latency. A NoC arbiter component can be added later. - **Credit return BW model** — the fast path is currently outside the fabric BW contention model. Can be modeled as a separate link or switched to piggyback (`credit_return_mode: piggyback`). - **Ring buffer slot allocation metadata** — whether the host pushes IPCQ buffer metadata via sideband or via a fabric message similar to `MmuMapMsg` is open. - **VC BW split default** — 50/50 vs. weighted (e.g. 80/20). Exposed in `ccl.yaml`; default value TBD. - **Direction count** — 4 (N/S/E/W) is fixed in the first cut; 6 (with Up/Down for 3D) or N (variable) is future work. - **Multi-tile aggregation primitives** — whether `tl.recv_all` or similar is needed for fan-in. - **Round-robin recv fairness** — current weak fairness can starve; strict fairness counter is future work. - **Deadlock detection precision** — currently timeout-based; a realtime wait-for graph would enable deterministic detection. --- ## Consequences ### Positive - PE-to-PE direct communication enables CCL kernels to be written. - Host stays minimal (just `launch`), synchronization happens inside the PE → strong compute / comm overlap. - VCs eliminate HoL blocking → collective latency is not blocked by compute traffic. - Buffer placement and backpressure mode are init-time parameters → easy to benchmark. - Four-direction logical neighbors → host is free to map ring/mesh/tree algorithms. ### Negative - One new component (PE_IPCQ) and a redesigned PE_DMA (VCs). - IPCQ memory cost = 8 rings × `slot_size` × `n_slots` per PE. - VC arbitration is a first-order approximation; heavy contention scenarios may report slightly optimistic latency vs real HW (D8). - Chunk-level interleave makes PE_DMA implementation more complex.