# ADR-0023: PE-level IPCQ — Inter-PE Collective Communication ## Status Proposed ## Context ### Goal Add the infrastructure that lets CCL (Collective Communication Library) kernels run **inside** a PE. The host just launches a kernel on each SIP; the actual synchronization and data movement happen **inside the PE kernel via an IPCQ (Inter-Process Communication Queue)**. This mirrors how NCCL performs NVLink communication inside a GPU kernel, or how Cerebras / Tenstorrent expose core-local communication queues. Host-level collectives (`dist.all_reduce`) are deferred to **future work**; this ADR focuses solely on the kernel-side collective infrastructure. ### Current state - ADR-0021 PE pipeline refactor: each PE is decomposed into components (PE_CPU, PE_SCHEDULER, PE_DMA, PE_FETCH_STORE, PE_GEMM, PE_MATH, PE_TCM, PE_MMU). - No direct PE-to-PE channel exists today. All data movement goes through PE_DMA → cube_noc / UCIe / PCIE → HBM. - A pre-ADR host CCL skeleton exists (`dist.init_process_group(backend="ahbm")`, `_run_ccl_bench` running per-rank greenlets concurrently). The collective itself is a stub. ### Problems to solve 1. PE-to-PE direct data movement (writing into a peer's memory). 2. Synchronization — the sender must check that the receiver has space in its buffer (backpressure). 3. Resource contention between compute traffic and communication traffic (Head-of-Line blocking). 4. The host must be able to construct logical neighbor topologies (ring / mesh / tree) per algorithm. --- ## Decision ### D1. Add a new `PE_IPCQ` component A new component `PE_IPCQ` is added inside each PE. It follows the same pattern as PE_GEMM / PE_MATH — modeling a sub-block of the PE as a distinct component. ``` PE ├── PE_CPU ├── PE_SCHEDULER ├── PE_DMA ├── PE_IPCQ ← new ├── PE_FETCH_STORE ├── PE_GEMM ├── PE_MATH ├── PE_TCM ├── PE_MMU ``` **Role separation** (control plane vs. data plane): - **PE_IPCQ (control plane)**: ring-buffer address arithmetic, head / tail pointer management, peer pointer caches, backpressure, 4-direction neighbor mapping. - **PE_DMA (data plane)**: actually moves data through cube_noc / UCIe / PCIE into the peer's memory. PE_IPCQ does **not** move data itself — it delegates to PE_DMA. ### D2. Ring buffer model Each PE owns 4 directions (N/S/E/W) × {tx, rx} = 8 ring buffers. ```python @dataclass class IpcqQueuePair: direction: Direction # N/S/E/W peer: IpcqEndpoint # set by host at init time (D2.5) tx_buffer_base: int # outgoing data base addr (in our memory) rx_buffer_base: int # incoming data base addr (in our memory) slot_size: int # 1 tile per slot n_slots: int # ring depth my_head: int # next slot we will write/send into my_tail: int # next slot we will read/recv from peer_head_cache: int # peer's last-seen head (updated via D9 piggyback) peer_tail_cache: int # peer's last-seen tail (updated via D9 fast-path credit) ``` **Canonical field names**: throughout this ADR the four names above (`my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`) are used consistently. Synonyms (`peer_head_local`, `peer_head`, `peer_tail`, etc.) are not used. | Field | Owner | Updated when | |-------|-------|--------------| | `my_head` | local PE_IPCQ | immediately after `tl.send` (send tracking) | | `my_tail` | local PE_IPCQ | immediately after `tl.recv` (recv tracking) | | `peer_head_cache` | local PE_IPCQ | on `IpcqMetaArrival` (D9 piggyback) | | `peer_tail_cache` | local PE_IPCQ | on `IpcqCreditMetadata` (D9 fast path) | **Slot unit**: fixed-size, one slot holds one full tile (no descriptor indirection). Full data embedded in the slot. See D5. ### D2.5. `IpcqEndpoint` schema `IpcqQueuePair.peer` carries everything the sender needs to compute the peer's rx slot address: ```python @dataclass(frozen=True) class IpcqEndpoint: sip: int cube: int pe: int buffer_kind: str # "tcm" | "hbm" | "sram" rx_base_pa: int # peer rx_buffer base PA (PhysAddr.encode()) rx_base_va: int # peer rx_buffer base VA (optional, MMU mode) n_slots: int # peer ring depth (for wrap-around) slot_size: int # peer slot size (for offset) ``` Address computation: ```python slot_idx = self.my_head % peer.n_slots dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size ``` PE_IPCQ passes `dst_pa` to PE_DMA inside an `IpcqDmaToken`. PE_DMA (vc_comm) routes the data to `dst_pa` through the fabric. **Endpoint construction order**: at backend init (D10), the IPCQ buffers for **every PE** are allocated first (so each rank knows the others' PA), then the per-rank neighbor tables are built and pushed to PE_IPCQ via `IpcqInitMsg`. ### D3. Four-direction mapping ≡ logical ProcessGroup The PE views four directions (N/S/E/W) as logical ports. Real peer addresses are configured by the host CCL init, per the chosen algorithm. The PE kernel never knows the topology, only directions. ```python # 1D ring for rank in range(world_size): ipcq_set_neighbor(rank, "E", peer=ranks[(rank + 1) % world_size]) ipcq_set_neighbor(rank, "W", peer=ranks[(rank - 1) % world_size]) # 2D mesh for r in range(R): for c in range(C): ipcq_set_neighbor((r, c), "N", peer=((r - 1) % R, c)) ipcq_set_neighbor((r, c), "S", peer=((r + 1) % R, c)) ipcq_set_neighbor((r, c), "E", peer=(r, (c + 1) % C)) ipcq_set_neighbor((r, c), "W", peer=(r, (c - 1) % C)) ``` The PE code does not need to know where `tl.send(dir="E", ...)` actually ends up. ### D4. PE kernel API ```python # Send (blocking; may stall on backpressure) tl.send(dir: str, src=TensorHandle) tl.send(dir: str, src_addr=..., nbytes=..., shape=..., dtype=..., space=...) # Recv (blocking) recv = tl.recv(dir: str, shape=..., dtype=...) recv = tl.recv(shape=..., dtype=...) # round-robin across 4 directions # Recv (non-blocking) fut = tl.recv_async(dir: str, shape=..., dtype=...) recv = tl.wait(fut) ``` `tl.recv()` (no direction) keeps a `last_polled_dir` cursor and on each call rotates through directions, returning the first available slot. Empty in all 4 directions → wait. **Fairness is weak**: the rotating start mitigates simple bias, but if one direction always wins the race the others can starve. Algorithms that need strict fairness must call `tl.recv(dir=...)` explicitly. ### D5. Single-hop DMA write + full-data slot model Data moves from sender memory into the receiver's ring slot in **one DMA transfer**. Key properties: - **Single-hop**: the sender already knows the peer rx slot address and fires one fabric DMA into it. - **No CPU memcpy**: the CPU never copies data. - **No intermediate staging**: neither side keeps a separate staging buffer (sender uses the source addr directly; receiver gets the data in its ring slot directly). (Strictly speaking the fabric DMA write does happen, so this is not literally "no data movement" — it's the same property NCCL labels "zero-copy", meaning no CPU memcpy and no staging copy.) ``` PE A: tl.send(E, src_addr, nbytes) 1. IPCQ computes the peer rx slot address: dst_addr = peer.rx_base_pa + (my_head % peer.n_slots) * peer.slot_size 2. Backpressure: my_head - peer_tail_cache < peer.n_slots ? (full → sleep / poll) 3. Submit DMA on PE_DMA(vc_comm): src_addr → peer dst_addr, nbytes 4. my_head += 1 PE B: data = tl.recv(W) 1. Look at rx_buffer[my_tail % n_slots] 2. Wait for the data to arrive (D7 backpressure mode) 3. Return the slot address to the kernel (or fetch into register file) 4. my_tail += 1 5. Issue a credit-return fast path (D9): after the bottleneck-BW latency the peer A's peer_tail_cache is updated. ``` The slot holds the full tile. The receiver only reads its own rx_buffer; it never reads back into A's memory. The sender knows the peer rx slot address and DMAs directly into it (single-hop). The PE's own PE_TCM read/write does not go through DMA (PE_TCM is local to the PE). ### D6. Buffer placement — three-way benchmark The host CCL init picks the IPCQ ring-buffer location: ```python ipcq_init( backend="ahbm", buffer_kind="tcm" | "hbm" | "sram", n_slots=8, slot_size=4096, ) ``` | Location | Trait | Trade-off | |----------|-------|-----------| | **PE_TCM** | Attached to the PE; fast | Small; competes with PE-internal resources | | **PE-local HBM** | Large; via DMA | Higher latency | | **Cube SRAM** | Mid-size; cube-shared | Cube-internal contention | All three locations run the same kernel code; only the init differs. ### D7. Backpressure — two-mode benchmark How the sender or receiver waits when peer slots are full / data not yet arrived: | Mode | Behavior | Model | |------|----------|-------| | **poll** | Periodically re-check the cached peer pointer | Spin loop | | **sleep** | Yield a SimPy event; wake on a peer-trigger | Interrupt-like | ```python ipcq_init(backpressure="poll" | "sleep", ...) ``` Both modes are implemented so latency / throughput trade-offs can be benchmarked. ### D8. PE_DMA virtual channels Extend PE_DMA from a single queue into a **two-channel virtual-channel** model. ``` PE_DMA ├── vc_compute: tile load / store / writeback for GEMM and Math └── vc_comm: IPCQ send data ``` Each VC has an independent state machine: - One channel stalling does not block the other. - The same physical link (cube_noc, UCIe, …) is shared, but link BW is split between channels. **Chunk-level interleave**: - Large GEMM tile DMAs do not lock the link end-to-end. - Progress happens in chunks (e.g. 256 B); each chunk shares link BW with the other VC's pending chunks. - Chunk size is an init parameter (smaller = fairer, larger = more efficient). Net effect: - HoL blocking is eliminated (an IPCQ send can interleave with a long compute DMA). - Compute / comm overlap is natural (NVIDIA copy-engine + compute-SM pattern). - Matches the NoC-virtual-channel pattern used in real HW. **First-implementation accuracy limit (intentional)**: this ADR's first cut uses **deterministic chunk-level interleave + weighted round-robin arbitration** (default 50 / 50, exposed in `ccl.yaml`). This is a first-order approximation and is simpler than real HW dynamic-contention / credit-based arbiters. Functional correctness is unaffected, but heavy-contention scenarios may report slightly optimistic latency vs. real HW. A separate ADR can add a NoC arbiter component later if more precision is needed. #### Token routing - Compute tokens (`TileToken`) — go through the existing PE_FETCH_STORE → PE_DMA chain. - Communication tokens (`IpcqDmaToken`, new) — PE_IPCQ → PE_DMA self-routing. - PE_DMA picks the channel by token type. ```python class PeDmaComponent: def _process(self, env, token): if isinstance(token, IpcqDmaToken): yield from self._vc_comm_process(env, token) else: yield from self._vc_compute_process(env, token) ``` ### D9. Pointer synchronization — DMA payload piggyback Real HW (NVLink, UCIe, etc.) piggybacks metadata onto DMA payloads so pointers update along with the data. This simulation adopts the same model: **no separate control channel** — metadata travels with the data. The big benefits: - **Automatic ordering**: data and metadata move on the same token, so data is visible **before** the head_cache update. No race. - **HW fidelity**: matches NVLink / UCIe piggybacked headers. - **Component simplification**: no separate `IpcqPtrUpdate` event type. #### Send flow (head update via piggyback) ``` PE A: tl.send(E, src_addr, nbytes) 1. PE_IPCQ checks backpressure (using peer_tail_cache) 2. PE_IPCQ creates an IpcqDmaToken: - data body (src_addr → peer dst_addr) - piggyback metadata: (sender_seq, src_sip/cube/pe, src_direction) 3. Hand the token to PE_DMA(vc_comm) 4. PE A increments my_head (send tracking) [fabric DMA: latency elapses] PE B's PE_DMA receives the token 5. Writes data into dst_addr (B's rx slot) via MemoryStore.write 6. Forwards token metadata to PE B's PE_IPCQ (PE-internal wire, ~1 cycle) PE B's PE_IPCQ receives the metadata 7. Updates peer_head_cache (= A's head) 8. Wakes any pending recv on that direction ``` **Steps 5 and 6 must execute in the same SimPy step** — DMA completion makes data and metadata atomically visible. #### Recv flow (credit return — fast path with bottleneck-BW latency) When the receiver frees a slot, the sender must learn about it (backpressure release). Unlike data, the credit return does **not** travel through general vc_comm fabric — it uses a **separate fast path**, an abstraction of the NVLink / UCIe credit-return wire. **Latency** is computed from the **bottleneck BW on the path**, not a magic constant: ``` credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes) path = router.find_path(self_pe, peer_pe) latency = compute_drain_ns(path, credit_size_bytes) = credit_size_bytes / bottleneck_bw_on_path ``` That gives us: - **Topology-proportional approximation**: an in-cube credit return is automatically faster than a cross-SIP credit return. - **No magic constants**: no arbitrary `ipcq_ctrl_latency_ns`. - **No deadlock risk**: unlike piggyback, B can issue credit even when it has no data to send back. - **Reuses existing utility**: `ComponentContext.compute_drain_ns`. #### Component coupling — SimPy Store channel PE B's PE_IPCQ does not call PE A's PE_IPCQ directly. Instead, at init time, **a SimPy Store is wired between the two** (a per-direction fast-path channel) and credit metadata is `put` into that store. ```python class PeIpcqComponent: def _delayed_credit_send(self, env, peer_credit_store, my_tail, latency_ns): yield env.timeout(latency_ns) yield peer_credit_store.put(IpcqCreditMetadata(seq=my_tail, ...)) ``` Backend init wires both directions of the fast-path channel as part of fan-out (see `IpcqInitMsg` in D12). #### Credit-return fast path limitations - `credit_size_bytes` is an estimate (typically 16–64 bytes). - The fast path is **excluded from vc_comm BW contention** (separate wire). Real HW credit-return wires are very lightweight, so this is a reasonable first approximation. - A follow-up ADR can: model the credit fast path as a separate link (BW limit + contention), or switch to piggyback (`credit_return_mode: piggyback`). #### PE_DMA's added responsibility When `vc_comm` receives a token, PE_DMA processes it as the following **atomic** sequence. **No SimPy yield is allowed between the two steps** (invariant I6): ```python def _on_vc_comm_recv(self, env, token): # ── ATOMIC: no yield between these two operations ── data = self._memory_store.read(token.src_space, token.src_addr, shape=..., dtype=...) self._memory_store.write(token.dst_endpoint.buffer_kind, token.dst_addr, data) # 2. Forward metadata to the local PE_IPCQ yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token)) # ─────────────────────────────────────────────────── ``` The final `put` is yieldable but uses an unbounded internal store, so it completes in a single step. That `put` is the closing call of the atomic block; nothing may be inserted before it. ### D9.5. ADR-0020 (2-pass) integration `tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase 1 simulates timing **and** moves data via MemoryStore; Phase 2 enables op-log-based correctness verification. #### Phase 1 (timing + data) D9 models head and tail updates with two different mechanisms: - **Send-side (head update)** — DMA payload piggyback. Data write and metadata forward happen in the same SimPy step → automatic atomic visibility. - **Recv-side (tail credit return)** — fast-path SimPy Store channel with bottleneck-BW latency, then `peer_tail_cache` update. Together they preserve ring-buffer pointer consistency. The op-log records `op_kind="ipcq"` entries for sends (with `src/dst/space/addr/nbytes/dir/dtype/shape/sender_seq`) and recvs (with `recv_mode/src/dst/space/addr/nbytes/dir/dtype/shape/consumer_seq`). Two recv modes: - **`return_slot`** (default): the slot address is returned to the kernel. Zero-copy. - **`copy_to_dst`**: when the kernel passes `dst_addr` + `dst_space`, PE_IPCQ copies the slot data into the user dst. #### Phase 2 (op_log replay) When `DataExecutor` encounters an `op_kind="ipcq"` record: - **send**: idempotent `src → dst` ndarray write. - **recv (`return_slot`)**: no-op (the slot already holds the data). - **recv (`copy_to_dst`)**: idempotent `slot → dst_addr` copy. IPCQ ops are pure data movement — Phase 2 has nothing extra to compute. The downstream GEMM / Math ops in `DataExecutor` will consume the data and naturally validate correctness. ### D10. Host CCL init keeps the PyTorch shape The host code looks just like real PyTorch DDP. `init_process_group` creates the backend object; it does **not** receive IPCQ knobs (neighbor topology, buffer_kind, backpressure …). ```python # benches/ccl_allreduce.py — same shape as real PyTorch def worker(rank, world_size, torch): dist = torch.distributed dist.init_process_group(backend="ahbm") # reads ccl.yaml + topology tensor = torch.zeros((1, world_size * N_ELEM), dtype="f16", dp=...) tensor.copy_(torch.from_numpy(init)) dist.all_reduce(tensor, op="sum") ``` The IPCQ configuration is decided by the backend at `init_process_group` time: it loads `ccl.yaml`, picks the algorithm, and pushes IPCQ neighbor tables to every participating PE_IPCQ. The host code never has to know about IPCQ. A bench runs one algorithm, chosen via `ccl.yaml`'s `defaults.algorithm`. Switching algorithms is purely a `ccl.yaml` change — no host edits required. #### Init flow (eager) 1. `init_process_group(backend="ahbm")` is called. 2. Backend loads `ccl.yaml` → resolves `defaults.algorithm`. 3. Pulls topology + buffer_kind + backpressure + slot config from `algorithms[]`. 4. **Immediately** installs neighbor tables on every PE_IPCQ (sideband or fabric `IpcqInitMsg`). 5. Subsequent `torch.launch(kernel_name, ...)` calls behave normally — PE_IPCQ is already prepared whether the kernel is a CCL kernel or not. ### D11. CCL config file (`ccl.yaml`) IPCQ config and algorithm metadata live in a separate YAML file, following the same pattern as `components.yaml` and `topology.yaml`. A single benchmark execution runs one algorithm (`defaults.algorithm`). Switching algorithms means editing `defaults.algorithm` only. ```yaml defaults: algorithm: ring_allreduce_tcm buffer_kind: tcm # tcm | hbm | sram backpressure: sleep # poll | sleep n_slots: 8 slot_size: 4096 vc_chunk_size: 256 ipcq_credit_size_bytes: 16 algorithms: ring_allreduce_tcm: module: kernbench.ccl.algorithms.ring_allreduce topology: ring_1d # builtin name or "custom" buffer_kind: tcm n_elem: 8 # optional, per-algorithm tile width tree_allreduce_7: module: kernbench.ccl.algorithms.tree_allreduce topology: tree_binary buffer_kind: tcm world_size: 7 # algorithm-level override n_elem: 16 custom_mesh: module: kernbench.ccl.algorithms.custom_mesh topology: custom # the module supplies its own neighbors() ``` `world_size` is **not set in `defaults`**. The backend resolves it via: `algorithm-level override > defaults override > topology spec`. The last fallback (`sips × cubes_per_sip × pes_per_cube`) mirrors real DDP where `WORLD_SIZE` comes from env vars rather than config files. #### Algorithm module structure Each algorithm module exports two hooks — `kernel` (required) and `neighbors` (optional) — plus a `kernel_args` helper that the backend uses to populate positional kernel arguments at `all_reduce` time: ```python # src/kernbench/ccl/algorithms/ring_allreduce.py def kernel_args(world_size: int, n_elem: int) -> tuple: return (n_elem, world_size) def kernel(t_ptr, n_elem, world_size, tl): """Required — the PE kernel. IPCQ is already installed by the backend before this is called. The kernel only uses the four-direction send / recv API. """ ... def neighbors(rank, world_size, neighbor_map): """Optional — override the builtin topology's neighbor map. Returns a new dict, the modified-in-place dict, or None to keep the builtin map. """ return None ``` #### `neighbors` override patterns - **Pattern A — tweak a builtin**: drop a direction for some ranks, etc. - **Pattern B — replace entirely**: ignore `neighbor_map` and return a brand-new dict. - **Pattern C — keep builtin**: omit `neighbors` or return None. #### Builtin topologies | topology | direction set | |----------|---------------| | `ring_1d` | E, W | | `ring_1d_unidir` | E only | | `mesh_2d` | N, S, E, W | | `tree_binary` | parent, child_left, child_right | | `none` | (empty) — algorithm must supply `neighbors()` | #### Adding a new algorithm 1. Write `kernel` and `kernel_args` in `src/kernbench/ccl/algorithms/.py`. 2. Add an entry in `ccl.yaml`'s `algorithms` section. 3. (Optional) provide `neighbors()` for custom topology. 4. Set `defaults.algorithm` to the new algorithm. The host bench (`benches/ccl_allreduce.py`) does not change. ### D12. Message / token schema The new message types added by this ADR. They live in `src/kernbench/common/pe_commands.py` and `src/kernbench/runtime_api/kernel.py`. #### `IpcqInitMsg` (sideband, fan-out at init) The backend pushes neighbor tables to every PE_IPCQ. Structure mirrors `MmuMapMsg` (`target_sips`, `target_cubes`, `target_pe`, `entries`). Each `IpcqInitEntry` has `direction`, `peer: IpcqEndpoint`, `my_rx_base_pa/va`, `n_slots`, `slot_size`, plus a `peer_credit_store` field — a `simpy.Store` instance pre-wired so the sender PE_IPCQ can push `IpcqCreditMetadata` directly into the receiver's input queue. #### `IpcqSendCmd` (PE_CPU → PE_IPCQ) Carries `direction`, source addr/space, nbytes, shape, dtype, and a handle id. `data_op=True` so it lands in the op_log. #### `IpcqRecvCmd` (PE_CPU → PE_IPCQ) Carries `direction` (or None for round-robin), `recv_mode` (`return_slot` / `copy_to_dst`), optional `dst_addr/dst_space`, shape, dtype, blocking flag. #### `IpcqDmaToken` (PE_IPCQ → PE_DMA, vc_comm channel) Per D9 piggyback: the token carries the data (`src/dst/space/nbytes`) plus the head metadata (`sender_seq`, `src_sip/cube/pe`, `src_direction`). PE_DMA picks the channel by token type (`IpcqDmaToken → vc_comm`, `TileToken → vc_compute`). The receiver's PE_DMA, on token arrival, performs the I6 atomic sequence: write data into MemoryStore, then forward `IpcqMetaArrival` to the local PE_IPCQ. #### `IpcqCreditMetadata` (PE_IPCQ → peer PE_IPCQ, fast path) Carries `consumer_seq` (= my_tail), source PE coords, and source direction. Travels through the dedicated SimPy Store channel rather than `vc_comm`. Latency = `credit_size_bytes / bottleneck_bw_on_path`. There is **no `IpcqPtrUpdate` event** — head updates flow via D9 piggyback, tail updates via the D9 fast-path channel. ### D13. Test strategy Following the ADR-0021 D8 pattern. #### T1. Unit tests (component-level) - **PE_IPCQ** (`tests/test_pe_ipcq.py`): send without backpressure immediately forwards a token; full peer slot triggers backpressure (poll / sleep modes); recv waits, wakes on `IpcqMetaArrival`; round-robin recv weak fairness; bad direction → `IpcqInvalidDirection`. - **PE_DMA virtual channels** (`tests/test_pe_dma_vc.py`): `vc_compute` / `vc_comm` independent progress, chunk interleave, BW split. - **Builtin topology** (`tests/test_ccl_topologies.py`): ring_1d / mesh_2d / tree_binary correctness, mesh_2d non-square → `ValueError`, custom resolver returns the module's `neighbors`. #### T2. Integration tests (E2E send/recv) - **`tests/test_ipcq_e2e.py`**: 2-rank ring, 4-rank ring (bidirectional no-deadlock), 4×4 mesh. - **CCL kernel + 2-pass** (`tests/test_ipcq_2pass.py`): greenlet mode records `ipcq` ops in op_log; DataExecutor produces correct `out.data`. #### T3. Backend init (`tests/test_ccl_backend_ipcq.py`) `ccl.yaml` load, builtin topology → `IpcqInitMsg` fan-out, endpoint PA consistency, per-`buffer_kind` allocation. #### T4. Regression All existing tests pass; ADR-0020 op_log / DataExecutor unaffected for non-CCL benches. #### T5. Performance / overhead Single send/recv pair latency = (DMA latency) + (IPCQ overhead). Should be close to a regular PE_DMA write of the same nbytes (IPCQ overhead < 100 ns). ### D14. Invariants and failure modes #### Invariants I1. **Slot lifecycle exactly-once**: one send → exactly one recv. I2. **Pointer monotonicity**: `my_head` / `my_tail` strictly non-decreasing; `sender_seq` strictly increasing. I3. **Endpoint consistency**: if rank A's `direction=E` peer is rank B, then rank B's reverse-direction peer must be rank A. Verified at init. I4. **`buffer_kind` consistency**: all PEs in a process group share the same `buffer_kind` (no mixed mode in the first cut). I5. **op_log ordering**: send → DMA complete → recv possible. The t_start order in op_log respects this causality. I6. **Atomic data + metadata visibility (MUST)**: at the receiver side, data write (`MemoryStore.write`) and metadata forward (`peer_head_cache` update) **must execute in the same SimPy step**. No yield is allowed between the two operations in PE_DMA's vc_comm handler. Code review must reject any inserted `yield` (or `yield from`) — it would create a race where head_cache becomes visible before or after the data. I7. **MemoryStore slot existence ↔ pointer**: as a consequence of I6, the step in which `peer_head_cache > my_tail` becomes truthy is the same step in which the slot data is observable. #### Failure modes (runtime errors) F1. **Bad direction**: `tl.send(dir="X")` for an uninstalled direction → `IpcqInvalidDirection`, simulation aborts. F2. **Type mismatch**: dtype/shape/nbytes disagreement between matched send and recv. Not validated by default; opt-in strict mode catches it (`strict_validation: true` on a PE_IPCQ node attrs). F3. **Deadlock detection (timeout-based)**: the simulator empties its schedule while a send/recv is still pending → engine raises `IpcqDeadlock` and embeds a pointer dump. F4. **Backend init failure**: missing `defaults.algorithm`, missing `algorithms[name]`, module import failure, topology validation failure (I3, I4) — all raised at `init_process_group` time. F5. **Slot full + infinite backpressure**: the peer never recvs. Surfaces as F3 timeout. #### Diagnostics - **CCL trace**: `KERNBENCH_CCL_TRACE=1` logs each send/recv as `(rank, t, dir, nbytes)`. - **Pointer dump**: `kernbench.ccl.diagnostics.pointer_dump(engine)` prints every PE_IPCQ ring buffer's `my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`. - **Deadlock dump**: on hang the engine includes the pointer dump in the `IpcqDeadlock` exception message. ### D15. Algorithm-author cheat sheet Full step-by-step lives in [`docs/ccl-author-guide.en.md`](../ccl-author-guide.en.md). The shortest version: | Things you touch | Things you don't | |------------------|-------------------| | `src/kernbench/ccl/algorithms/.py` (`kernel`, `kernel_args`, optional `neighbors`) | `benches/ccl_allreduce.py` host code | | One entry in `ccl.yaml` + optionally `defaults.algorithm` | `src/kernbench/ccl/` framework | | (Optional) `tests/test_.py` mock test | PE_IPCQ component, AhbmCCLBackend | 5-step flow: write the kernel → register in `ccl.yaml` → optional `neighbors` override → optional mock unit test → SimPy validation via `kernbench run --bench ccl_allreduce --verify-data`. Common mistakes: using a direction that wasn't installed, sends without matching recvs (deadlock), dtype/shape disagreement, assuming fairness from `tl.recv()` round-robin, confusing `tl.num_programs(axis)` with the CCL group size. --- ## Non-goals - **Host collective**: a model where `dist.all_reduce` itself moves data on the host side is out of scope. This ADR only covers communication that happens inside the PE kernel. - **All-reduce algorithms**: ring / tree / etc. live in algorithm modules and can be added without amending this ADR. - **Reliability / error handling**: link faults, send/recv failure recovery, etc. are out of scope. - **NoC arbiter precision**: dynamic VC contention is left for a future ADR (see D8). --- ## Open questions - **VC arbitration accuracy** — the first cut uses deterministic chunk interleave + weighted round-robin; heavy contention may report optimistic latency. A NoC arbiter component can be added later. - **Credit return BW model** — the fast path is currently outside the fabric BW contention model. Can be modeled as a separate link or switched to piggyback (`credit_return_mode: piggyback`). - **Ring buffer slot allocation metadata** — whether the host pushes IPCQ buffer metadata via sideband or via a fabric message similar to `MmuMapMsg` is open. - **VC BW split default** — 50/50 vs. weighted (e.g. 80/20). Exposed in `ccl.yaml`; default value TBD. - **Direction count** — 4 (N/S/E/W) is fixed in the first cut; 6 (with Up/Down for 3D) or N (variable) is future work. - **Multi-tile aggregation primitives** — whether `tl.recv_all` or similar is needed for fan-in. - **Round-robin recv fairness** — current weak fairness can starve; strict fairness counter is future work. - **Deadlock detection precision** — currently timeout-based; a realtime wait-for graph would enable deterministic detection. --- ## Consequences ### Positive - PE-to-PE direct communication enables CCL kernels to be written. - Host stays minimal (just `launch`), synchronization happens inside the PE → strong compute / comm overlap. - VCs eliminate HoL blocking → collective latency is not blocked by compute traffic. - Buffer placement and backpressure mode are init-time parameters → easy to benchmark. - Four-direction logical neighbors → host is free to map ring/mesh/tree algorithms. ### Negative - One new component (PE_IPCQ) and a redesigned PE_DMA (VCs). - IPCQ memory cost = 8 rings × `slot_size` × `n_slots` per PE. - VC arbitration is a first-order approximation; heavy contention scenarios may report slightly optimistic latency vs real HW (D8). - Chunk-level interleave makes PE_DMA implementation more complex. --- ## Affected files | File | Change | |------|--------| | `topology.yaml` | Add `pe_ipcq` to `pe_template`, plus the IPCQ ↔ DMA / CPU / TCM edges. | | `components.yaml` | Register `pe_ipcq_v1`. | | `src/kernbench/topology/builder.py` | Wire the IPCQ chain into PE-internal edges. | | `src/kernbench/components/builtin/pe_ipcq.py` | New. | | `src/kernbench/components/builtin/pe_dma.py` | Add VCs, handle `IpcqDmaToken`. | | `src/kernbench/common/pe_commands.py` | `IpcqSendCmd`, `IpcqRecvCmd`, `IpcqDmaToken`. | | `src/kernbench/triton_emu/tl_context.py` | `tl.send` / `tl.recv` API. | | `src/kernbench/runtime_api/distributed.py` | Eager IPCQ install in `AhbmCCLBackend.__init__`. | | `src/kernbench/runtime_api/kernel.py` | `IpcqInitMsg` definition. | | `src/kernbench/ccl/__init__.py` | New CCL package. | | `src/kernbench/ccl/topologies.py` | Builtin topology generators + `resolve_topology()`. | | `src/kernbench/ccl/helpers.py` | Algorithm-author helpers (`chunked`, `ring_step`, `tree_step`). | | `src/kernbench/ccl/testing.py` | Mock CCL runtime (`run_kernel_in_mock`). | | `src/kernbench/ccl/algorithms/*.py` | Algorithm modules (kernel + `kernel_args` + optional `neighbors`). | | `ccl.yaml` | Algorithm metadata + IPCQ defaults. | | `tests/test_pe_ipcq.py` | PE_IPCQ unit tests. | | `tests/test_pe_dma_vc.py` | PE_DMA VC tests. | | `tests/test_ipcq_e2e.py` | end-to-end send/recv tests. | | `tests/test_ccl_topologies.py` | Builtin topology generator tests. | | `tests/test_ccl_allreduce_matrix.py` | Unified bench × algorithm matrix. |