Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

Major changes: PE-level IPCQ infrastructure: - New PE_IPCQ component: ring-buffer control plane with 4-direction neighbor mapping, head/tail pointers, backpressure (poll/sleep). - PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA, including in-flight data snapshot (D9) and op_log recording at outbound time for Phase 2 replay correctness. - IpcqDmaToken piggyback model: data + metadata travel together, atomic visibility at receiver (invariant I6). - Credit return fast path: bottleneck-BW latency, no fabric vc_comm. Phase 2 data execution (ADR-0020 integration): - op_log extended: DmaWriteCmd now captures src_space/src_addr for Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time. - DataExecutor replays dma_write + ipcq_copy in t_start order. - Engine._flush_data_phase: incremental cursor-based replay after each engine.wait() so host reads see post-Phase-2 data. - KernelRunner Phase 1 writes disabled when op_log is active to prevent stale data from corrupting the MemoryStore snapshot. TLContext / kernel API: - tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype), tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode. - TensorHandle operator overloading (add/sub/mul/div) via thread-local active TLContext → MathCmd dispatch through PE_MATH. - PE-local scratch allocator for math output handles. - tl.load returns space="hbm" handles for correct Phase 2 addressing. - Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv. Unified ccl_allreduce bench (PyTorch-compat host code): - Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch) split matching real PyTorch DDP worker pattern. - torch.distributed facade: init_process_group, get_world_size, get_rank, get_backend, all_reduce, barrier — only real PyTorch names. - AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches kernel via tensor shard metadata (n_elem from shards[0].nbytes). - world_size derived from topology spec (sips × cubes × pes_per_cube) with optional algorithm-level override in ccl.yaml. Tensor API (PyTorch-compat surface): - Tensor.numpy(): gather-aware (all shards via VA-based addressing). - Tensor.copy_(source): scatter from host tensor into sharded target. - RuntimeContext.from_numpy(arr): host-side staging tensor. - Tensor.data property fixed to use numpy() (was shards[0]-only). Algorithm modules moved to src/kernbench/ccl/algorithms/: - ring_allreduce, mesh_allreduce, tree_allreduce, hello_send. - Each module exports kernel_args(world_size, n_elem) helper. - ccl.yaml module paths updated to kernbench.ccl.algorithms.*. Dead code removed: - 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.). - _run_ccl_bench greenlet-per-SIP scheduler. - benches.loader.is_ccl_bench + run_rank detection. - benches/ccl/ directory. Tests: - New test_ccl_allreduce_matrix.py: 7 parametrized cases (ring×3 buffers, ring 8/16, mesh 4, tree 7). - New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests. - Existing tests updated for new import paths + world_size_override. Docs: - Korean ccl-author-guide.md and ADR-0023 paths updated. - New English versions: ccl-author-guide.en.md, ADR-0023.en.md. 502 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 19:36:59 -07:00
parent ff2c677a9c
commit 998cc85762
60 changed files with 9196 additions and 80 deletions
@@ -0,0 +1,866 @@
+# ADR-0023: PE-level IPCQ — Inter-PE Collective Communication
+
+## Status
+
+Proposed
+
+## Context
+
+### Goal
+
+Add the infrastructure that lets CCL (Collective Communication Library)
+kernels run **inside** a PE. The host just launches a kernel on each
+SIP; the actual synchronization and data movement happen **inside the
+PE kernel via an IPCQ (Inter-Process Communication Queue)**.
+
+This mirrors how NCCL performs NVLink communication inside a GPU
+kernel, or how Cerebras / Tenstorrent expose core-local communication
+queues. Host-level collectives (`dist.all_reduce`) are deferred to
+**future work**; this ADR focuses solely on the kernel-side collective
+infrastructure.
+
+### Current state
+
+- ADR-0021 PE pipeline refactor: each PE is decomposed into components
+  (PE_CPU, PE_SCHEDULER, PE_DMA, PE_FETCH_STORE, PE_GEMM, PE_MATH,
+  PE_TCM, PE_MMU).
+- No direct PE-to-PE channel exists today. All data movement goes
+  through PE_DMA → cube_noc / UCIe / PCIE → HBM.
+- A pre-ADR host CCL skeleton exists (`dist.init_process_group(backend="ahbm")`,
+  `_run_ccl_bench` running per-rank greenlets concurrently). The
+  collective itself is a stub.
+
+### Problems to solve
+
+1. PE-to-PE direct data movement (writing into a peer's memory).
+2. Synchronization — the sender must check that the receiver has space
+   in its buffer (backpressure).
+3. Resource contention between compute traffic and communication
+   traffic (Head-of-Line blocking).
+4. The host must be able to construct logical neighbor topologies
+   (ring / mesh / tree) per algorithm.
+
+---
+
+## Decision
+
+### D1. Add a new `PE_IPCQ` component
+
+A new component `PE_IPCQ` is added inside each PE. It follows the same
+pattern as PE_GEMM / PE_MATH — modeling a sub-block of the PE as a
+distinct component.
+
+```
+PE
+├── PE_CPU
+├── PE_SCHEDULER
+├── PE_DMA
+├── PE_IPCQ          ← new
+├── PE_FETCH_STORE
+├── PE_GEMM
+├── PE_MATH
+├── PE_TCM
+├── PE_MMU
+```
+
+**Role separation** (control plane vs. data plane):
+
+- **PE_IPCQ (control plane)**: ring-buffer address arithmetic, head /
+  tail pointer management, peer pointer caches, backpressure, 4-direction
+  neighbor mapping.
+- **PE_DMA (data plane)**: actually moves data through cube_noc / UCIe
+  / PCIE into the peer's memory.
+
+PE_IPCQ does **not** move data itself — it delegates to PE_DMA.
+
+### D2. Ring buffer model
+
+Each PE owns 4 directions (N/S/E/W) × {tx, rx} = 8 ring buffers.
+
+```python
+@dataclass
+class IpcqQueuePair:
+    direction: Direction          # N/S/E/W
+    peer: IpcqEndpoint            # set by host at init time (D2.5)
+    tx_buffer_base: int           # outgoing data base addr (in our memory)
+    rx_buffer_base: int           # incoming data base addr (in our memory)
+    slot_size: int                # 1 tile per slot
+    n_slots: int                  # ring depth
+    my_head: int                  # next slot we will write/send into
+    my_tail: int                  # next slot we will read/recv from
+    peer_head_cache: int          # peer's last-seen head (updated via D9 piggyback)
+    peer_tail_cache: int          # peer's last-seen tail (updated via D9 fast-path credit)
+```
+
+**Canonical field names**: throughout this ADR the four names above
+(`my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`) are used
+consistently. Synonyms (`peer_head_local`, `peer_head`, `peer_tail`,
+etc.) are not used.
+
+| Field | Owner | Updated when |
+|-------|-------|--------------|
+| `my_head` | local PE_IPCQ | immediately after `tl.send` (send tracking) |
+| `my_tail` | local PE_IPCQ | immediately after `tl.recv` (recv tracking) |
+| `peer_head_cache` | local PE_IPCQ | on `IpcqMetaArrival` (D9 piggyback) |
+| `peer_tail_cache` | local PE_IPCQ | on `IpcqCreditMetadata` (D9 fast path) |
+
+**Slot unit**: fixed-size, one slot holds one full tile (no descriptor
+indirection). Full data embedded in the slot. See D5.
+
+### D2.5. `IpcqEndpoint` schema
+
+`IpcqQueuePair.peer` carries everything the sender needs to compute the
+peer's rx slot address:
+
+```python
+@dataclass(frozen=True)
+class IpcqEndpoint:
+    sip: int
+    cube: int
+    pe: int
+    buffer_kind: str             # "tcm" | "hbm" | "sram"
+    rx_base_pa: int              # peer rx_buffer base PA (PhysAddr.encode())
+    rx_base_va: int              # peer rx_buffer base VA (optional, MMU mode)
+    n_slots: int                 # peer ring depth (for wrap-around)
+    slot_size: int               # peer slot size (for offset)
+```
+
+Address computation:
+
+```python
+slot_idx = self.my_head % peer.n_slots
+dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size
+```
+
+PE_IPCQ passes `dst_pa` to PE_DMA inside an `IpcqDmaToken`. PE_DMA
+(vc_comm) routes the data to `dst_pa` through the fabric.
+
+**Endpoint construction order**: at backend init (D10), the IPCQ
+buffers for **every PE** are allocated first (so each rank knows the
+others' PA), then the per-rank neighbor tables are built and pushed to
+PE_IPCQ via `IpcqInitMsg`.
+
+### D3. Four-direction mapping ≡ logical ProcessGroup
+
+The PE views four directions (N/S/E/W) as logical ports. Real peer
+addresses are configured by the host CCL init, per the chosen
+algorithm. The PE kernel never knows the topology, only directions.
+
+```python
+# 1D ring
+for rank in range(world_size):
+    ipcq_set_neighbor(rank, "E", peer=ranks[(rank + 1) % world_size])
+    ipcq_set_neighbor(rank, "W", peer=ranks[(rank - 1) % world_size])
+
+# 2D mesh
+for r in range(R):
+    for c in range(C):
+        ipcq_set_neighbor((r, c), "N", peer=((r - 1) % R, c))
+        ipcq_set_neighbor((r, c), "S", peer=((r + 1) % R, c))
+        ipcq_set_neighbor((r, c), "E", peer=(r, (c + 1) % C))
+        ipcq_set_neighbor((r, c), "W", peer=(r, (c - 1) % C))
+```
+
+The PE code does not need to know where `tl.send(dir="E", ...)` actually
+ends up.
+
+### D4. PE kernel API
+
+```python
+# Send (blocking; may stall on backpressure)
+tl.send(dir: str, src=TensorHandle)
+tl.send(dir: str, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)
+
+# Recv (blocking)
+recv = tl.recv(dir: str, shape=..., dtype=...)
+recv = tl.recv(shape=..., dtype=...)        # round-robin across 4 directions
+
+# Recv (non-blocking)
+fut  = tl.recv_async(dir: str, shape=..., dtype=...)
+recv = tl.wait(fut)
+```
+
+`tl.recv()` (no direction) keeps a `last_polled_dir` cursor and on each
+call rotates through directions, returning the first available slot.
+Empty in all 4 directions → wait.
+
+**Fairness is weak**: the rotating start mitigates simple bias, but if
+one direction always wins the race the others can starve. Algorithms
+that need strict fairness must call `tl.recv(dir=...)` explicitly.
+
+### D5. Single-hop DMA write + full-data slot model
+
+Data moves from sender memory into the receiver's ring slot in **one
+DMA transfer**. Key properties:
+
+- **Single-hop**: the sender already knows the peer rx slot address and
+  fires one fabric DMA into it.
+- **No CPU memcpy**: the CPU never copies data.
+- **No intermediate staging**: neither side keeps a separate staging
+  buffer (sender uses the source addr directly; receiver gets the data
+  in its ring slot directly).
+
+(Strictly speaking the fabric DMA write does happen, so this is not
+literally "no data movement" — it's the same property NCCL labels
+"zero-copy", meaning no CPU memcpy and no staging copy.)
+
+```
+PE A: tl.send(E, src_addr, nbytes)
+  1. IPCQ computes the peer rx slot address:
+       dst_addr = peer.rx_base_pa + (my_head % peer.n_slots) * peer.slot_size
+  2. Backpressure: my_head - peer_tail_cache < peer.n_slots ?
+                   (full → sleep / poll)
+  3. Submit DMA on PE_DMA(vc_comm): src_addr → peer dst_addr, nbytes
+  4. my_head += 1
+
+PE B: data = tl.recv(W)
+  1. Look at rx_buffer[my_tail % n_slots]
+  2. Wait for the data to arrive (D7 backpressure mode)
+  3. Return the slot address to the kernel (or fetch into register file)
+  4. my_tail += 1
+  5. Issue a credit-return fast path (D9): after the bottleneck-BW
+     latency the peer A's peer_tail_cache is updated.
+```
+
+The slot holds the full tile. The receiver only reads its own
+rx_buffer; it never reads back into A's memory. The sender knows the
+peer rx slot address and DMAs directly into it (single-hop).
+
+The PE's own PE_TCM read/write does not go through DMA (PE_TCM is local
+to the PE).
+
+### D6. Buffer placement — three-way benchmark
+
+The host CCL init picks the IPCQ ring-buffer location:
+
+```python
+ipcq_init(
+    backend="ahbm",
+    buffer_kind="tcm" | "hbm" | "sram",
+    n_slots=8,
+    slot_size=4096,
+)
+```
+
+| Location | Trait | Trade-off |
+|----------|-------|-----------|
+| **PE_TCM** | Attached to the PE; fast | Small; competes with PE-internal resources |
+| **PE-local HBM** | Large; via DMA | Higher latency |
+| **Cube SRAM** | Mid-size; cube-shared | Cube-internal contention |
+
+All three locations run the same kernel code; only the init differs.
+
+### D7. Backpressure — two-mode benchmark
+
+How the sender or receiver waits when peer slots are full / data not
+yet arrived:
+
+| Mode | Behavior | Model |
+|------|----------|-------|
+| **poll** | Periodically re-check the cached peer pointer | Spin loop |
+| **sleep** | Yield a SimPy event; wake on a peer-trigger | Interrupt-like |
+
+```python
+ipcq_init(backpressure="poll" | "sleep", ...)
+```
+
+Both modes are implemented so latency / throughput trade-offs can be
+benchmarked.
+
+### D8. PE_DMA virtual channels
+
+Extend PE_DMA from a single queue into a **two-channel virtual-channel**
+model.
+
+```
+PE_DMA
+├── vc_compute: tile load / store / writeback for GEMM and Math
+└── vc_comm:    IPCQ send data
+```
+
+Each VC has an independent state machine:
+
+- One channel stalling does not block the other.
+- The same physical link (cube_noc, UCIe, …) is shared, but link BW is
+  split between channels.
+
+**Chunk-level interleave**:
+
+- Large GEMM tile DMAs do not lock the link end-to-end.
+- Progress happens in chunks (e.g. 256 B); each chunk shares link BW
+  with the other VC's pending chunks.
+- Chunk size is an init parameter (smaller = fairer, larger = more
+  efficient).
+
+Net effect:
+
+- HoL blocking is eliminated (an IPCQ send can interleave with a long
+  compute DMA).
+- Compute / comm overlap is natural (NVIDIA copy-engine + compute-SM
+  pattern).
+- Matches the NoC-virtual-channel pattern used in real HW.
+
+**First-implementation accuracy limit (intentional)**: this ADR's
+first cut uses **deterministic chunk-level interleave + weighted
+round-robin arbitration** (default 50 / 50, exposed in `ccl.yaml`).
+This is a first-order approximation and is simpler than real HW
+dynamic-contention / credit-based arbiters. Functional correctness is
+unaffected, but heavy-contention scenarios may report slightly
+optimistic latency vs. real HW. A separate ADR can add a NoC arbiter
+component later if more precision is needed.
+
+#### Token routing
+
+- Compute tokens (`TileToken`) — go through the existing
+  PE_FETCH_STORE → PE_DMA chain.
+- Communication tokens (`IpcqDmaToken`, new) — PE_IPCQ → PE_DMA
+  self-routing.
+- PE_DMA picks the channel by token type.
+
+```python
+class PeDmaComponent:
+    def _process(self, env, token):
+        if isinstance(token, IpcqDmaToken):
+            yield from self._vc_comm_process(env, token)
+        else:
+            yield from self._vc_compute_process(env, token)
+```
+
+### D9. Pointer synchronization — DMA payload piggyback
+
+Real HW (NVLink, UCIe, etc.) piggybacks metadata onto DMA payloads so
+pointers update along with the data. This simulation adopts the same
+model: **no separate control channel** — metadata travels with the
+data.
+
+The big benefits:
+
+- **Automatic ordering**: data and metadata move on the same token, so
+  data is visible **before** the head_cache update. No race.
+- **HW fidelity**: matches NVLink / UCIe piggybacked headers.
+- **Component simplification**: no separate `IpcqPtrUpdate` event type.
+
+#### Send flow (head update via piggyback)
+
+```
+PE A: tl.send(E, src_addr, nbytes)
+  1. PE_IPCQ checks backpressure (using peer_tail_cache)
+  2. PE_IPCQ creates an IpcqDmaToken:
+       - data body (src_addr → peer dst_addr)
+       - piggyback metadata: (sender_seq, src_sip/cube/pe, src_direction)
+  3. Hand the token to PE_DMA(vc_comm)
+  4. PE A increments my_head (send tracking)
+
+[fabric DMA: latency elapses]
+
+PE B's PE_DMA receives the token
+  5. Writes data into dst_addr (B's rx slot) via MemoryStore.write
+  6. Forwards token metadata to PE B's PE_IPCQ (PE-internal wire, ~1 cycle)
+
+PE B's PE_IPCQ receives the metadata
+  7. Updates peer_head_cache (= A's head)
+  8. Wakes any pending recv on that direction
+```
+
+**Steps 5 and 6 must execute in the same SimPy step** — DMA completion
+makes data and metadata atomically visible.
+
+#### Recv flow (credit return — fast path with bottleneck-BW latency)
+
+When the receiver frees a slot, the sender must learn about it
+(backpressure release). Unlike data, the credit return does **not**
+travel through general vc_comm fabric — it uses a **separate fast
+path**, an abstraction of the NVLink / UCIe credit-return wire.
+
+**Latency** is computed from the **bottleneck BW on the path**, not a
+magic constant:
+
+```
+credit_size_bytes = 16  (ccl.yaml: ipcq_credit_size_bytes)
+path = router.find_path(self_pe, peer_pe)
+latency = compute_drain_ns(path, credit_size_bytes)
+        = credit_size_bytes / bottleneck_bw_on_path
+```
+
+That gives us:
+
+- **Topology-proportional approximation**: an in-cube credit return is
+  automatically faster than a cross-SIP credit return.
+- **No magic constants**: no arbitrary `ipcq_ctrl_latency_ns`.
+- **No deadlock risk**: unlike piggyback, B can issue credit even when
+  it has no data to send back.
+- **Reuses existing utility**: `ComponentContext.compute_drain_ns`.
+
+#### Component coupling — SimPy Store channel
+
+PE B's PE_IPCQ does not call PE A's PE_IPCQ directly. Instead, at init
+time, **a SimPy Store is wired between the two** (a per-direction
+fast-path channel) and credit metadata is `put` into that store.
+
+```python
+class PeIpcqComponent:
+    def _delayed_credit_send(self, env, peer_credit_store, my_tail, latency_ns):
+        yield env.timeout(latency_ns)
+        yield peer_credit_store.put(IpcqCreditMetadata(seq=my_tail, ...))
+```
+
+Backend init wires both directions of the fast-path channel as part of
+fan-out (see `IpcqInitMsg` in D12).
+
+#### Credit-return fast path limitations
+
+- `credit_size_bytes` is an estimate (typically 16–64 bytes).
+- The fast path is **excluded from vc_comm BW contention** (separate
+  wire). Real HW credit-return wires are very lightweight, so this is a
+  reasonable first approximation.
+- A follow-up ADR can: model the credit fast path as a separate link
+  (BW limit + contention), or switch to piggyback (`credit_return_mode:
+  piggyback`).
+
+#### PE_DMA's added responsibility
+
+When `vc_comm` receives a token, PE_DMA processes it as the following
+**atomic** sequence. **No SimPy yield is allowed between the two steps**
+(invariant I6):
+
+```python
+def _on_vc_comm_recv(self, env, token):
+    # ── ATOMIC: no yield between these two operations ──
+    data = self._memory_store.read(token.src_space, token.src_addr,
+                                   shape=..., dtype=...)
+    self._memory_store.write(token.dst_endpoint.buffer_kind,
+                             token.dst_addr, data)
+    # 2. Forward metadata to the local PE_IPCQ
+    yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token))
+    # ───────────────────────────────────────────────────
+```
+
+The final `put` is yieldable but uses an unbounded internal store, so
+it completes in a single step. That `put` is the closing call of the
+atomic block; nothing may be inserted before it.
+
+### D9.5. ADR-0020 (2-pass) integration
+
+`tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase
+1 simulates timing **and** moves data via MemoryStore; Phase 2 enables
+op-log-based correctness verification.
+
+#### Phase 1 (timing + data)
+
+D9 models head and tail updates with two different mechanisms:
+
+- **Send-side (head update)** — DMA payload piggyback. Data write and
+  metadata forward happen in the same SimPy step → automatic atomic
+  visibility.
+- **Recv-side (tail credit return)** — fast-path SimPy Store channel
+  with bottleneck-BW latency, then `peer_tail_cache` update.
+
+Together they preserve ring-buffer pointer consistency.
+
+The op-log records `op_kind="ipcq"` entries for sends (with
+`src/dst/space/addr/nbytes/dir/dtype/shape/sender_seq`) and recvs (with
+`recv_mode/src/dst/space/addr/nbytes/dir/dtype/shape/consumer_seq`).
+Two recv modes:
+
+- **`return_slot`** (default): the slot address is returned to the
+  kernel. Zero-copy.
+- **`copy_to_dst`**: when the kernel passes `dst_addr` + `dst_space`,
+  PE_IPCQ copies the slot data into the user dst.
+
+#### Phase 2 (op_log replay)
+
+When `DataExecutor` encounters an `op_kind="ipcq"` record:
+
+- **send**: idempotent `src → dst` ndarray write.
+- **recv (`return_slot`)**: no-op (the slot already holds the data).
+- **recv (`copy_to_dst`)**: idempotent `slot → dst_addr` copy.
+
+IPCQ ops are pure data movement — Phase 2 has nothing extra to compute.
+The downstream GEMM / Math ops in `DataExecutor` will consume the data
+and naturally validate correctness.
+
+### D10. Host CCL init keeps the PyTorch shape
+
+The host code looks just like real PyTorch DDP. `init_process_group`
+creates the backend object; it does **not** receive IPCQ knobs
+(neighbor topology, buffer_kind, backpressure …).
+
+```python
+# benches/ccl_allreduce.py — same shape as real PyTorch
+def worker(rank, world_size, torch):
+    dist = torch.distributed
+    dist.init_process_group(backend="ahbm")  # reads ccl.yaml + topology
+    tensor = torch.zeros((1, world_size * N_ELEM), dtype="f16", dp=...)
+    tensor.copy_(torch.from_numpy(init))
+    dist.all_reduce(tensor, op="sum")
+```
+
+The IPCQ configuration is decided by the backend at
+`init_process_group` time: it loads `ccl.yaml`, picks the algorithm,
+and pushes IPCQ neighbor tables to every participating PE_IPCQ. The
+host code never has to know about IPCQ.
+
+A bench runs one algorithm, chosen via `ccl.yaml`'s `defaults.algorithm`.
+Switching algorithms is purely a `ccl.yaml` change — no host edits
+required.
+
+#### Init flow (eager)
+
+1. `init_process_group(backend="ahbm")` is called.
+2. Backend loads `ccl.yaml` → resolves `defaults.algorithm`.
+3. Pulls topology + buffer_kind + backpressure + slot config from
+   `algorithms[<algo>]`.
+4. **Immediately** installs neighbor tables on every PE_IPCQ
+   (sideband or fabric `IpcqInitMsg`).
+5. Subsequent `torch.launch(kernel_name, ...)` calls behave normally —
+   PE_IPCQ is already prepared whether the kernel is a CCL kernel or
+   not.
+
+### D11. CCL config file (`ccl.yaml`)
+
+IPCQ config and algorithm metadata live in a separate YAML file,
+following the same pattern as `components.yaml` and `topology.yaml`.
+
+A single benchmark execution runs one algorithm
+(`defaults.algorithm`). Switching algorithms means editing
+`defaults.algorithm` only.
+
+```yaml
+defaults:
+  algorithm: ring_allreduce_tcm
+  buffer_kind: tcm                # tcm | hbm | sram
+  backpressure: sleep             # poll | sleep
+  n_slots: 8
+  slot_size: 4096
+  vc_chunk_size: 256
+  ipcq_credit_size_bytes: 16
+
+algorithms:
+  ring_allreduce_tcm:
+    module: kernbench.ccl.algorithms.ring_allreduce
+    topology: ring_1d             # builtin name or "custom"
+    buffer_kind: tcm
+    n_elem: 8                     # optional, per-algorithm tile width
+
+  tree_allreduce_7:
+    module: kernbench.ccl.algorithms.tree_allreduce
+    topology: tree_binary
+    buffer_kind: tcm
+    world_size: 7                 # algorithm-level override
+    n_elem: 16
+
+  custom_mesh:
+    module: kernbench.ccl.algorithms.custom_mesh
+    topology: custom              # the module supplies its own neighbors()
+```
+
+`world_size` is **not set in `defaults`**. The backend resolves it via:
+`algorithm-level override > defaults override > topology spec`. The
+last fallback (`sips × cubes_per_sip × pes_per_cube`) mirrors real DDP
+where `WORLD_SIZE` comes from env vars rather than config files.
+
+#### Algorithm module structure
+
+Each algorithm module exports two hooks — `kernel` (required) and
+`neighbors` (optional) — plus a `kernel_args` helper that the
+backend uses to populate positional kernel arguments at `all_reduce`
+time:
+
+```python
+# src/kernbench/ccl/algorithms/ring_allreduce.py
+
+def kernel_args(world_size: int, n_elem: int) -> tuple:
+    return (n_elem, world_size)
+
+
+def kernel(t_ptr, n_elem, world_size, tl):
+    """Required — the PE kernel.
+
+    IPCQ is already installed by the backend before this is called.
+    The kernel only uses the four-direction send / recv API.
+    """
+    ...
+
+
+def neighbors(rank, world_size, neighbor_map):
+    """Optional — override the builtin topology's neighbor map.
+
+    Returns a new dict, the modified-in-place dict, or None to keep the
+    builtin map.
+    """
+    return None
+```
+
+#### `neighbors` override patterns
+
+- **Pattern A — tweak a builtin**: drop a direction for some ranks, etc.
+- **Pattern B — replace entirely**: ignore `neighbor_map` and return a
+  brand-new dict.
+- **Pattern C — keep builtin**: omit `neighbors` or return None.
+
+#### Builtin topologies
+
+| topology | direction set |
+|----------|---------------|
+| `ring_1d` | E, W |
+| `ring_1d_unidir` | E only |
+| `mesh_2d` | N, S, E, W |
+| `tree_binary` | parent, child_left, child_right |
+| `none` | (empty) — algorithm must supply `neighbors()` |
+
+#### Adding a new algorithm
+
+1. Write `kernel` and `kernel_args` in
+   `src/kernbench/ccl/algorithms/<algo>.py`.
+2. Add an entry in `ccl.yaml`'s `algorithms` section.
+3. (Optional) provide `neighbors()` for custom topology.
+4. Set `defaults.algorithm` to the new algorithm.
+
+The host bench (`benches/ccl_allreduce.py`) does not change.
+
+### D12. Message / token schema
+
+The new message types added by this ADR. They live in
+`src/kernbench/common/pe_commands.py` and
+`src/kernbench/runtime_api/kernel.py`.
+
+#### `IpcqInitMsg` (sideband, fan-out at init)
+
+The backend pushes neighbor tables to every PE_IPCQ. Structure mirrors
+`MmuMapMsg` (`target_sips`, `target_cubes`, `target_pe`, `entries`).
+Each `IpcqInitEntry` has `direction`, `peer: IpcqEndpoint`,
+`my_rx_base_pa/va`, `n_slots`, `slot_size`, plus a `peer_credit_store`
+field — a `simpy.Store` instance pre-wired so the sender PE_IPCQ can
+push `IpcqCreditMetadata` directly into the receiver's input queue.
+
+#### `IpcqSendCmd` (PE_CPU → PE_IPCQ)
+
+Carries `direction`, source addr/space, nbytes, shape, dtype, and a
+handle id. `data_op=True` so it lands in the op_log.
+
+#### `IpcqRecvCmd` (PE_CPU → PE_IPCQ)
+
+Carries `direction` (or None for round-robin), `recv_mode`
+(`return_slot` / `copy_to_dst`), optional `dst_addr/dst_space`, shape,
+dtype, blocking flag.
+
+#### `IpcqDmaToken` (PE_IPCQ → PE_DMA, vc_comm channel)
+
+Per D9 piggyback: the token carries the data (`src/dst/space/nbytes`)
+plus the head metadata (`sender_seq`, `src_sip/cube/pe`,
+`src_direction`). PE_DMA picks the channel by token type
+(`IpcqDmaToken → vc_comm`, `TileToken → vc_compute`).
+
+The receiver's PE_DMA, on token arrival, performs the I6 atomic
+sequence: write data into MemoryStore, then forward `IpcqMetaArrival`
+to the local PE_IPCQ.
+
+#### `IpcqCreditMetadata` (PE_IPCQ → peer PE_IPCQ, fast path)
+
+Carries `consumer_seq` (= my_tail), source PE coords, and source
+direction. Travels through the dedicated SimPy Store channel rather
+than `vc_comm`. Latency = `credit_size_bytes / bottleneck_bw_on_path`.
+
+There is **no `IpcqPtrUpdate` event** — head updates flow via D9
+piggyback, tail updates via the D9 fast-path channel.
+
+### D13. Test strategy
+
+Following the ADR-0021 D8 pattern.
+
+#### T1. Unit tests (component-level)
+
+- **PE_IPCQ** (`tests/test_pe_ipcq.py`): send without backpressure
+  immediately forwards a token; full peer slot triggers backpressure
+  (poll / sleep modes); recv waits, wakes on `IpcqMetaArrival`;
+  round-robin recv weak fairness; bad direction → `IpcqInvalidDirection`.
+- **PE_DMA virtual channels** (`tests/test_pe_dma_vc.py`): `vc_compute`
+  / `vc_comm` independent progress, chunk interleave, BW split.
+- **Builtin topology** (`tests/test_ccl_topologies.py`): ring_1d /
+  mesh_2d / tree_binary correctness, mesh_2d non-square →
+  `ValueError`, custom resolver returns the module's `neighbors`.
+
+#### T2. Integration tests (E2E send/recv)
+
+- **`tests/test_ipcq_e2e.py`**: 2-rank ring, 4-rank ring (bidirectional
+  no-deadlock), 4×4 mesh.
+- **CCL kernel + 2-pass** (`tests/test_ipcq_2pass.py`): greenlet mode
+  records `ipcq` ops in op_log; DataExecutor produces correct
+  `out.data`.
+
+#### T3. Backend init (`tests/test_ccl_backend_ipcq.py`)
+
+`ccl.yaml` load, builtin topology → `IpcqInitMsg` fan-out, endpoint PA
+consistency, per-`buffer_kind` allocation.
+
+#### T4. Regression
+
+All existing tests pass; ADR-0020 op_log / DataExecutor unaffected for
+non-CCL benches.
+
+#### T5. Performance / overhead
+
+Single send/recv pair latency = (DMA latency) + (IPCQ overhead).
+Should be close to a regular PE_DMA write of the same nbytes (IPCQ
+overhead < 100 ns).
+
+### D14. Invariants and failure modes
+
+#### Invariants
+
+I1. **Slot lifecycle exactly-once**: one send → exactly one recv.
+I2. **Pointer monotonicity**: `my_head` / `my_tail` strictly
+   non-decreasing; `sender_seq` strictly increasing.
+I3. **Endpoint consistency**: if rank A's `direction=E` peer is rank
+   B, then rank B's reverse-direction peer must be rank A. Verified at
+   init.
+I4. **`buffer_kind` consistency**: all PEs in a process group share
+   the same `buffer_kind` (no mixed mode in the first cut).
+I5. **op_log ordering**: send → DMA complete → recv possible. The
+   t_start order in op_log respects this causality.
+I6. **Atomic data + metadata visibility (MUST)**: at the receiver
+   side, data write (`MemoryStore.write`) and metadata forward
+   (`peer_head_cache` update) **must execute in the same SimPy step**.
+   No yield is allowed between the two operations in PE_DMA's vc_comm
+   handler. Code review must reject any inserted `yield` (or `yield
+   from`) — it would create a race where head_cache becomes visible
+   before or after the data.
+I7. **MemoryStore slot existence ↔ pointer**: as a consequence of I6,
+   the step in which `peer_head_cache > my_tail` becomes truthy is the
+   same step in which the slot data is observable.
+
+#### Failure modes (runtime errors)
+
+F1. **Bad direction**: `tl.send(dir="X")` for an uninstalled direction
+   → `IpcqInvalidDirection`, simulation aborts.
+F2. **Type mismatch**: dtype/shape/nbytes disagreement between matched
+   send and recv. Not validated by default; opt-in strict mode catches
+   it (`strict_validation: true` on a PE_IPCQ node attrs).
+F3. **Deadlock detection (timeout-based)**: the simulator empties its
+   schedule while a send/recv is still pending → engine raises
+   `IpcqDeadlock` and embeds a pointer dump.
+F4. **Backend init failure**: missing `defaults.algorithm`, missing
+   `algorithms[name]`, module import failure, topology validation
+   failure (I3, I4) — all raised at `init_process_group` time.
+F5. **Slot full + infinite backpressure**: the peer never recvs.
+   Surfaces as F3 timeout.
+
+#### Diagnostics
+
+- **CCL trace**: `KERNBENCH_CCL_TRACE=1` logs each send/recv as
+  `(rank, t, dir, nbytes)`.
+- **Pointer dump**: `kernbench.ccl.diagnostics.pointer_dump(engine)`
+  prints every PE_IPCQ ring buffer's `my_head`, `my_tail`,
+  `peer_head_cache`, `peer_tail_cache`.
+- **Deadlock dump**: on hang the engine includes the pointer dump in
+  the `IpcqDeadlock` exception message.
+
+### D15. Algorithm-author cheat sheet
+
+Full step-by-step lives in
+[`docs/ccl-author-guide.en.md`](../ccl-author-guide.en.md). The
+shortest version:
+
+| Things you touch | Things you don't |
+|------------------|-------------------|
+| `src/kernbench/ccl/algorithms/<your_algo>.py` (`kernel`, `kernel_args`, optional `neighbors`) | `benches/ccl_allreduce.py` host code |
+| One entry in `ccl.yaml` + optionally `defaults.algorithm` | `src/kernbench/ccl/` framework |
+| (Optional) `tests/test_<your_algo>.py` mock test | PE_IPCQ component, AhbmCCLBackend |
+
+5-step flow: write the kernel → register in `ccl.yaml` → optional
+`neighbors` override → optional mock unit test → SimPy validation via
+`kernbench run --bench ccl_allreduce --verify-data`.
+
+Common mistakes: using a direction that wasn't installed, sends
+without matching recvs (deadlock), dtype/shape disagreement, assuming
+fairness from `tl.recv()` round-robin, confusing
+`tl.num_programs(axis)` with the CCL group size.
+
+---
+
+## Non-goals
+
+- **Host collective**: a model where `dist.all_reduce` itself moves
+  data on the host side is out of scope. This ADR only covers
+  communication that happens inside the PE kernel.
+- **All-reduce algorithms**: ring / tree / etc. live in algorithm
+  modules and can be added without amending this ADR.
+- **Reliability / error handling**: link faults, send/recv failure
+  recovery, etc. are out of scope.
+- **NoC arbiter precision**: dynamic VC contention is left for a future
+  ADR (see D8).
+
+---
+
+## Open questions
+
+- **VC arbitration accuracy** — the first cut uses deterministic
+  chunk interleave + weighted round-robin; heavy contention may report
+  optimistic latency. A NoC arbiter component can be added later.
+- **Credit return BW model** — the fast path is currently outside the
+  fabric BW contention model. Can be modeled as a separate link or
+  switched to piggyback (`credit_return_mode: piggyback`).
+- **Ring buffer slot allocation metadata** — whether the host pushes
+  IPCQ buffer metadata via sideband or via a fabric message similar to
+  `MmuMapMsg` is open.
+- **VC BW split default** — 50/50 vs. weighted (e.g. 80/20). Exposed in
+  `ccl.yaml`; default value TBD.
+- **Direction count** — 4 (N/S/E/W) is fixed in the first cut; 6
+  (with Up/Down for 3D) or N (variable) is future work.
+- **Multi-tile aggregation primitives** — whether
+  `tl.recv_all` or similar is needed for fan-in.
+- **Round-robin recv fairness** — current weak fairness can starve;
+  strict fairness counter is future work.
+- **Deadlock detection precision** — currently timeout-based; a
+  realtime wait-for graph would enable deterministic detection.
+
+---
+
+## Consequences
+
+### Positive
+
+- PE-to-PE direct communication enables CCL kernels to be written.
+- Host stays minimal (just `launch`), synchronization happens inside
+  the PE → strong compute / comm overlap.
+- VCs eliminate HoL blocking → collective latency is not blocked by
+  compute traffic.
+- Buffer placement and backpressure mode are init-time parameters →
+  easy to benchmark.
+- Four-direction logical neighbors → host is free to map
+  ring/mesh/tree algorithms.
+
+### Negative
+
+- One new component (PE_IPCQ) and a redesigned PE_DMA (VCs).
+- IPCQ memory cost = 8 rings × `slot_size` × `n_slots` per PE.
+- VC arbitration is a first-order approximation; heavy contention
+  scenarios may report slightly optimistic latency vs real HW (D8).
+- Chunk-level interleave makes PE_DMA implementation more complex.
+
+---
+
+## Affected files
+
+| File | Change |
+|------|--------|
+| `topology.yaml` | Add `pe_ipcq` to `pe_template`, plus the IPCQ ↔ DMA / CPU / TCM edges. |
+| `components.yaml` | Register `pe_ipcq_v1`. |
+| `src/kernbench/topology/builder.py` | Wire the IPCQ chain into PE-internal edges. |
+| `src/kernbench/components/builtin/pe_ipcq.py` | New. |
+| `src/kernbench/components/builtin/pe_dma.py` | Add VCs, handle `IpcqDmaToken`. |
+| `src/kernbench/common/pe_commands.py` | `IpcqSendCmd`, `IpcqRecvCmd`, `IpcqDmaToken`. |
+| `src/kernbench/triton_emu/tl_context.py` | `tl.send` / `tl.recv` API. |
+| `src/kernbench/runtime_api/distributed.py` | Eager IPCQ install in `AhbmCCLBackend.__init__`. |
+| `src/kernbench/runtime_api/kernel.py` | `IpcqInitMsg` definition. |
+| `src/kernbench/ccl/__init__.py` | New CCL package. |
+| `src/kernbench/ccl/topologies.py` | Builtin topology generators + `resolve_topology()`. |
+| `src/kernbench/ccl/helpers.py` | Algorithm-author helpers (`chunked`, `ring_step`, `tree_step`). |
+| `src/kernbench/ccl/testing.py` | Mock CCL runtime (`run_kernel_in_mock`). |
+| `src/kernbench/ccl/algorithms/*.py` | Algorithm modules (kernel + `kernel_args` + optional `neighbors`). |
+| `ccl.yaml` | Algorithm metadata + IPCQ defaults. |
+| `tests/test_pe_ipcq.py` | PE_IPCQ unit tests. |
+| `tests/test_pe_dma_vc.py` | PE_DMA VC tests. |
+| `tests/test_ipcq_e2e.py` | end-to-end send/recv tests. |
+| `tests/test_ccl_topologies.py` | Builtin topology generator tests. |
+| `tests/test_ccl_allreduce_matrix.py` | Unified bench × algorithm matrix. |