kernbench2/docs/adr/ADR-0023-dev-ipcq-pe-collective.md

# ADR-0023: PE-level IPCQ — Inter-PE Collective Communication

## Status

Accepted

## Context

### Goal

Add the infrastructure that lets CCL (Collective Communication Library)
kernels run **inside** a PE. The host just launches a kernel on each
SIP; the actual synchronization and data movement happen **inside the
PE kernel via an IPCQ (Inter-Process Communication Queue)**.

This mirrors how NCCL performs NVLink communication inside a GPU
kernel, or how Cerebras / Tenstorrent expose core-local communication
queues. Host-level collectives (`dist.all_reduce`) are deferred to
**future work**; this ADR focuses solely on the kernel-side collective
infrastructure.

### Problems to solve

1. PE-to-PE direct data movement (writing into a peer's memory).
2. Synchronization — the sender must check that the receiver has space
   in its buffer (backpressure).
3. Resource contention between compute traffic and communication
   traffic (Head-of-Line blocking).
4. The host must be able to construct logical neighbor topologies
   (ring / mesh / tree) per algorithm.

---

## Decision

### D1. Add a new `PE_IPCQ` component

A new component `PE_IPCQ` is added inside each PE. It follows the same
pattern as PE_GEMM / PE_MATH — modeling a sub-block of the PE as a
distinct component.

```
PE
├── PE_CPU
├── PE_SCHEDULER
├── PE_DMA
├── PE_IPCQ          ← new
├── PE_FETCH_STORE
├── PE_GEMM
├── PE_MATH
├── PE_TCM
├── PE_MMU
```

**Role separation** (control plane vs. data plane):

- **PE_IPCQ (control plane)**: ring-buffer address arithmetic, head /
  tail pointer management, peer pointer caches, backpressure, 4-direction
  neighbor mapping.
- **PE_DMA (data plane)**: actually moves data through cube_noc / UCIe
  / PCIE into the peer's memory.

PE_IPCQ does **not** move data itself — it delegates to PE_DMA.

### D2. Ring buffer model

Each PE owns 4 directions (N/S/E/W) × {tx, rx} = 8 ring buffers.

```python
@dataclass
class IpcqQueuePair:
    direction: Direction          # N/S/E/W
    peer: IpcqEndpoint            # set by host at init time (D2.5)
    tx_buffer_base: int           # outgoing data base addr (in our memory)
    rx_buffer_base: int           # incoming data base addr (in our memory)
    slot_size: int                # 1 tile per slot
    n_slots: int                  # ring depth
    my_head: int                  # next slot we will write/send into
    my_tail: int                  # next slot we will read/recv from
    peer_head_cache: int          # peer's last-seen head (updated via D9 piggyback)
    peer_tail_cache: int          # peer's last-seen tail (updated via D9 fast-path credit)
```

**Canonical field names**: throughout this ADR the four names above
(`my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`) are used
consistently. Synonyms (`peer_head_local`, `peer_head`, `peer_tail`,
etc.) are not used.

| Field | Owner | Updated when |
|-------|-------|--------------|
| `my_head` | local PE_IPCQ | immediately after `tl.send` (send tracking) |
| `my_tail` | local PE_IPCQ | immediately after `tl.recv` (recv tracking) |
| `peer_head_cache` | local PE_IPCQ | on `IpcqMetaArrival` (D9 piggyback) |
| `peer_tail_cache` | local PE_IPCQ | on `IpcqCreditMetadata` (D9 fast path) |

**Slot unit**: fixed-size, one slot holds one full tile (no descriptor
indirection). Full data embedded in the slot. See D5.

### D2.5. `IpcqEndpoint` schema

`IpcqQueuePair.peer` carries everything the sender needs to compute the
peer's rx slot address:

```python
@dataclass(frozen=True)
class IpcqEndpoint:
    sip: int
    cube: int
    pe: int
    buffer_kind: str             # "tcm" | "hbm" | "sram"
    rx_base_pa: int              # peer rx_buffer base PA (PhysAddr.encode())
    rx_base_va: int              # peer rx_buffer base VA (optional, MMU mode)
    n_slots: int                 # peer ring depth (for wrap-around)
    slot_size: int               # peer slot size (for offset)
```

Address computation:

```python
slot_idx = self.my_head % peer.n_slots
dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size
```

PE_IPCQ passes `dst_pa` to PE_DMA inside an `IpcqDmaToken`. PE_DMA
(vc_comm) routes the data to `dst_pa` through the fabric.

**Endpoint construction order**: at backend init (D10), the IPCQ
buffers for **every PE** are allocated first (so each rank knows the
others' PA), then the per-rank neighbor tables are built and pushed to
PE_IPCQ via `IpcqInitMsg`.

### D3. Four-direction mapping ≡ logical ProcessGroup

The PE views four directions (N/S/E/W) as logical ports. Real peer
addresses are configured by the host CCL init, per the chosen
algorithm. The PE kernel never knows the topology, only directions.

```python
# 1D ring
for rank in range(world_size):
    ipcq_set_neighbor(rank, "E", peer=ranks[(rank + 1) % world_size])
    ipcq_set_neighbor(rank, "W", peer=ranks[(rank - 1) % world_size])

# 2D mesh
for r in range(R):
    for c in range(C):
        ipcq_set_neighbor((r, c), "N", peer=((r - 1) % R, c))
        ipcq_set_neighbor((r, c), "S", peer=((r + 1) % R, c))
        ipcq_set_neighbor((r, c), "E", peer=(r, (c + 1) % C))
        ipcq_set_neighbor((r, c), "W", peer=(r, (c - 1) % C))
```

The PE code does not need to know where `tl.send(dir="E", ...)` actually
ends up.

### D4. PE kernel API

```python
# Send (blocking; may stall on backpressure)
tl.send(dir: str, src=TensorHandle)
tl.send(dir: str, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)

# Recv (blocking)
recv = tl.recv(dir: str, shape=..., dtype=...)
recv = tl.recv(shape=..., dtype=...)        # round-robin across 4 directions

# Recv (non-blocking)
fut  = tl.recv_async(dir: str, shape=..., dtype=...)
recv = tl.wait(fut)
```

`tl.recv()` (no direction) keeps a `last_polled_dir` cursor and on each
call rotates through directions, returning the first available slot.
Empty in all 4 directions → wait.

**Fairness is weak**: the rotating start mitigates simple bias, but if
one direction always wins the race the others can starve. Algorithms
that need strict fairness must call `tl.recv(dir=...)` explicitly.

### D5. Single-hop DMA write + full-data slot model

Data moves from sender memory into the receiver's ring slot in **one
DMA transfer**. Key properties:

- **Single-hop**: the sender already knows the peer rx slot address and
  fires one fabric DMA into it.
- **No CPU memcpy**: the CPU never copies data.
- **No intermediate staging**: neither side keeps a separate staging
  buffer (sender uses the source addr directly; receiver gets the data
  in its ring slot directly).

(Strictly speaking the fabric DMA write does happen, so this is not
literally "no data movement" — it's the same property NCCL labels
"zero-copy", meaning no CPU memcpy and no staging copy.)

```
PE A: tl.send(E, src_addr, nbytes)
  1. IPCQ computes the peer rx slot address:
       dst_addr = peer.rx_base_pa + (my_head % peer.n_slots) * peer.slot_size
  2. Backpressure: my_head - peer_tail_cache < peer.n_slots ?
                   (full → sleep / poll)
  3. Submit DMA on PE_DMA(vc_comm): src_addr → peer dst_addr, nbytes
  4. my_head += 1

PE B: data = tl.recv(W)
  1. Look at rx_buffer[my_tail % n_slots]
  2. Wait for the data to arrive (D7 backpressure mode)
  3. Return the slot address to the kernel (or fetch into register file)
  4. my_tail += 1
  5. Issue a credit-return fast path (D9): after the bottleneck-BW
     latency the peer A's peer_tail_cache is updated.
```

The slot holds the full tile. The receiver only reads its own
rx_buffer; it never reads back into A's memory. The sender knows the
peer rx slot address and DMAs directly into it (single-hop).

The PE's own PE_TCM read/write does not go through DMA (PE_TCM is local
to the PE).

### D6. Buffer placement — three-way benchmark

The host CCL init picks the IPCQ ring-buffer location:

```python
ipcq_init(
    backend="ahbm",
    buffer_kind="tcm" | "hbm" | "sram",
    n_slots=8,
    slot_size=4096,
)
```

| Location | Trait | Trade-off |
|----------|-------|-----------|
| **PE_TCM** | Attached to the PE; fast | Small; competes with PE-internal resources |
| **PE-local HBM** | Large; via DMA | Higher latency |
| **Cube SRAM** | Mid-size; cube-shared | Cube-internal contention |

All three locations run the same kernel code; only the init differs.

### D7. Backpressure — two-mode benchmark

How the sender or receiver waits when peer slots are full / data not
yet arrived:

| Mode | Behavior | Model |
|------|----------|-------|
| **poll** | Periodically re-check the cached peer pointer | Spin loop |
| **sleep** | Yield a SimPy event; wake on a peer-trigger | Interrupt-like |

```python
ipcq_init(backpressure="poll" | "sleep", ...)
```

Both modes are implemented so latency / throughput trade-offs can be
benchmarked.

### D8. PE_DMA virtual channels

Extend PE_DMA from a single queue into a **two-channel virtual-channel**
model.

```
PE_DMA
├── vc_compute: tile load / store / writeback for GEMM and Math
└── vc_comm:    IPCQ send data
```

Each VC has an independent state machine:

- One channel stalling does not block the other.
- The same physical link (cube_noc, UCIe, …) is shared, but link BW is
  split between channels.

**Chunk-level interleave**:

- Large GEMM tile DMAs do not lock the link end-to-end.
- Progress happens in chunks (e.g. 256 B); each chunk shares link BW
  with the other VC's pending chunks.
- Chunk size is an init parameter (smaller = fairer, larger = more
  efficient).

Net effect:

- HoL blocking is eliminated (an IPCQ send can interleave with a long
  compute DMA).
- Compute / comm overlap is natural (NVIDIA copy-engine + compute-SM
  pattern).
- Matches the NoC-virtual-channel pattern used in real HW.

**First-implementation accuracy limit (intentional)**: this ADR's
first cut uses **deterministic chunk-level interleave + weighted
round-robin arbitration** (default 50 / 50, exposed in `ccl.yaml`).
This is a first-order approximation and is simpler than real HW
dynamic-contention / credit-based arbiters. Functional correctness is
unaffected, but heavy-contention scenarios may report slightly
optimistic latency vs. real HW. A separate ADR can add a NoC arbiter
component later if more precision is needed.

#### Token routing

- Compute tokens (`TileToken`) — go through the existing
  PE_FETCH_STORE → PE_DMA chain.
- Communication tokens (`IpcqDmaToken`, new) — PE_IPCQ → PE_DMA
  self-routing.
- PE_DMA picks the channel by token type.

```python
class PeDmaComponent:
    def _process(self, env, token):
        if isinstance(token, IpcqDmaToken):
            yield from self._vc_comm_process(env, token)
        else:
            yield from self._vc_compute_process(env, token)
```

### D9. Pointer synchronization — DMA payload piggyback

Real HW (NVLink, UCIe, etc.) piggybacks metadata onto DMA payloads so
pointers update along with the data. This simulation adopts the same
model: **no separate control channel** — metadata travels with the
data.

The big benefits:

- **Automatic ordering**: data and metadata move on the same token, so
  data is visible **before** the head_cache update. No race.
- **HW fidelity**: matches NVLink / UCIe piggybacked headers.
- **Component simplification**: no separate `IpcqPtrUpdate` event type.

#### Send flow (head update via piggyback)

```
PE A: tl.send(E, src_addr, nbytes)
  1. PE_IPCQ checks backpressure (using peer_tail_cache)
  2. PE_IPCQ creates an IpcqDmaToken:
       - data body (src_addr → peer dst_addr)
       - piggyback metadata: (sender_seq, src_sip/cube/pe, src_direction)
  3. Hand the token to PE_DMA(vc_comm)
  4. PE A increments my_head (send tracking)

[fabric DMA: latency elapses]

PE B's PE_DMA receives the token
  5. Writes data into dst_addr (B's rx slot) via MemoryStore.write
  6. Forwards token metadata to PE B's PE_IPCQ (PE-internal wire, ~1 cycle)

PE B's PE_IPCQ receives the metadata
  7. Updates peer_head_cache (= A's head)
  8. Wakes any pending recv on that direction
```

**Steps 5 and 6 must execute in the same SimPy step** — DMA completion
makes data and metadata atomically visible.

#### Recv flow (credit return — fast path with bottleneck-BW latency)

When the receiver frees a slot, the sender must learn about it
(backpressure release). Unlike data, the credit return does **not**
travel through general vc_comm fabric — it uses a **separate fast
path**, an abstraction of the NVLink / UCIe credit-return wire.

**Latency** is computed from the **full path latency** (per-node
overhead + edge propagation + drain), not a magic constant:

```
credit_size_bytes = 16  (ccl.yaml: ipcq_credit_size_bytes)
path = router.find_path(self_pe, peer_pe.pe_dma)
latency = compute_path_latency_ns(path, credit_size_bytes)
        = sum(edge.distance_mm * ns_per_mm)
        + sum(node_overhead_ns[n] for n in path)
        + credit_size_bytes / bottleneck_bw_on_path
```

The router auto-appends `.pe_dma` to the source only, so the
destination MUST be spelled with the explicit `.pe_dma` suffix or
`find_path` raises and the credit silently teleports at zero cost
(latent bug fixed alongside this update).

`tl.recv` blocks on the credit-emit completion (recv yields-from
`_delayed_credit_send` rather than spawning it as a fork). This puts
the credit-return cost on the receiver's `pe_exec_ns`, modeling the
IPCQ control-plane completing the consume-acknowledgement before
recv returns to the kernel — the protocol equivalent of a non-posted
`tl.store` waiting for an HBM ack on the raw DMA path.

That gives us:

- **Topology-proportional approximation**: an in-cube credit return is
  automatically faster than a cross-SIP credit return.
- **No magic constants**: every nanosecond comes from
  `compute_path_latency_ns` on the same edge_map and `node_overhead_ns`
  as data traffic.
- **No deadlock risk**: unlike piggyback, B can issue credit even when
  it has no data to send back. `peer_credit_store.put` is unbounded.
- **`IPCQ ≥ raw DMA`** for matched physical moves — the credit-emit
  cost on recv balances the HBM ack-trip cost RAW pays on the sender.

#### Component coupling — SimPy Store channel

PE B's PE_IPCQ does not call PE A's PE_IPCQ directly. Instead, at init
time, **a SimPy Store is wired between the two** (a per-direction
fast-path channel) and credit metadata is `put` into that store.

```python
class PeIpcqComponent:
    def _delayed_credit_send(self, env, peer_credit_store, my_tail, latency_ns):
        yield env.timeout(latency_ns)
        yield peer_credit_store.put(IpcqCreditMetadata(seq=my_tail, ...))
```

Backend init wires both directions of the fast-path channel as part of
fan-out (see `IpcqInitMsg` in D12).

#### Credit-return fast path limitations

- `credit_size_bytes` is an estimate (typically 16–64 bytes).
- The fast path is **excluded from vc_comm BW contention** (separate
  wire). Real HW credit-return wires are very lightweight, so this is a
  reasonable first approximation.
- A follow-up ADR can: model the credit fast path as a separate link
  (BW limit + contention), or switch to piggyback (`credit_return_mode:
  piggyback`).

#### PE_DMA's added responsibility

When `vc_comm` receives a token, PE_DMA processes it as the following
sequence: pay the Transaction's terminal BW drain, then atomically
write data and forward metadata. **No SimPy yield is allowed between
the data write and the metadata forward** (invariant I6). The drain
yield must sit before the atomic block, not inside it:

```python
def _on_vc_comm_recv(self, env, txn):
    # Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the
    # sender PE_DMA). MUST happen before the atomic block so recv only
    # wakes after the bytes have "landed".
    drain = getattr(txn, "drain_ns", 0.0)
    if drain > 0:
        yield env.timeout(drain)

    token = txn.request
    # ── ATOMIC: no yield between these two operations ──
    data = self._memory_store.read(token.src_space, token.src_addr,
                                   shape=..., dtype=...)
    self._memory_store.write(token.dst_endpoint.buffer_kind,
                             token.dst_addr, data)
    # 2. Forward metadata to the local PE_IPCQ
    yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token))
    # ───────────────────────────────────────────────────
```

The final `put` is yieldable but uses an unbounded internal store, so
it completes in a single step. That `put` is the closing call of the
atomic block; nothing may be inserted before it.

#### Drain-at-inbound semantics (D9 timing model)

The Transaction carries `drain_ns = nbytes / bottleneck_bw_on_path`
stamped at send-side PE_DMA. In this simulator per-hop `overhead_ns`
is paid at each forwarding component via `run()`, and the remaining
BW drain is paid once at the Transaction's terminal. Every non-IPCQ
Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via
`ComponentBase._forward_txn` at the terminal node. For IPCQ the
destination PE_DMA intercepts the Transaction with `_handle_ipcq_inbound`
(so IPCQ-specific data write + metadata forward can happen), so **the
drain MUST be paid explicitly at the top of that handler** to keep
IPCQ's timing model on par with every other fabric Transaction.

Side-effects of paying drain here:

- **SRC `tl.send`** is unchanged — fire-and-forget semantics are
  preserved because the sender PE_DMA does not `yield sub_done`. The
  `sub_done.succeed()` call (made after metadata forward below) is an
  event with no listener on the sender side.
- **DST `tl.recv`** unblocks `drain_ns` later. Since recv wakes only
  when `IpcqMetaArrival` reaches its local PE_IPCQ, and the metadata
  forward now happens after the drain, recv observes the full fabric
  transfer time including bandwidth cost.

Matches the physical picture: send dispatches and leaves; recv waits
until the bytes have actually been drained into its inbox.

### D9.5. ADR-0020 (2-pass) integration

`tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase
1 simulates timing **and** moves data via MemoryStore; Phase 2 enables
op-log-based correctness verification.

#### Phase 1 (timing + data)

D9 models head and tail updates with two different mechanisms:

- **Send-side (head update)** — DMA payload piggyback. Data write and
  metadata forward happen in the same SimPy step → automatic atomic
  visibility.
- **Recv-side (tail credit return)** — fast-path SimPy Store channel
  with bottleneck-BW latency, then `peer_tail_cache` update.

Together they preserve ring-buffer pointer consistency.

The op-log records `op_kind="ipcq"` entries for sends (with
`src/dst/space/addr/nbytes/dir/dtype/shape/sender_seq`) and recvs (with
`recv_mode/src/dst/space/addr/nbytes/dir/dtype/shape/consumer_seq`).
Two recv modes:

- **`return_slot`** (default): the slot address is returned to the
  kernel. Zero-copy.
- **`copy_to_dst`**: when the kernel passes `dst_addr` + `dst_space`,
  PE_IPCQ copies the slot data into the user dst.

#### Phase 2 (op_log replay)

When `DataExecutor` encounters an `op_kind="ipcq"` record:

- **send**: idempotent `src → dst` ndarray write.
- **recv (`return_slot`)**: no-op (the slot already holds the data).
- **recv (`copy_to_dst`)**: idempotent `slot → dst_addr` copy.

IPCQ ops are pure data movement — Phase 2 has nothing extra to compute.
The downstream GEMM / Math ops in `DataExecutor` will consume the data
and naturally validate correctness.

### D10. Host CCL init keeps the PyTorch shape

The host code looks just like real PyTorch DDP. `init_process_group`
creates the backend object; it does **not** receive IPCQ knobs
(neighbor topology, buffer_kind, backpressure …).

```python
# benches/ccl_allreduce.py — same shape as real PyTorch
def worker(rank, world_size, torch):
    dist = torch.distributed
    dist.init_process_group(backend="ahbm")  # reads ccl.yaml + topology
    tensor = torch.zeros((1, world_size * N_ELEM), dtype="f16", dp=...)
    tensor.copy_(torch.from_numpy(init))
    dist.all_reduce(tensor, op="sum")
```

The IPCQ configuration is decided by the backend at
`init_process_group` time: it loads `ccl.yaml`, picks the algorithm,
and pushes IPCQ neighbor tables to every participating PE_IPCQ. The
host code never has to know about IPCQ.

A bench runs one algorithm, chosen via `ccl.yaml`'s `defaults.algorithm`.
Switching algorithms is purely a `ccl.yaml` change — no host edits
required.

#### Init flow (eager)

1. `init_process_group(backend="ahbm")` is called.
2. Backend loads `ccl.yaml` → resolves `defaults.algorithm`.
3. Pulls topology + buffer_kind + backpressure + slot config from
   `algorithms[<algo>]`.
4. **Immediately** installs neighbor tables on every PE_IPCQ
   (sideband or fabric `IpcqInitMsg`).
5. Subsequent `torch.launch(kernel_name, ...)` calls behave normally —
   PE_IPCQ is already prepared whether the kernel is a CCL kernel or
   not.

### D11. CCL config file (`ccl.yaml`)

IPCQ config and algorithm metadata live in a separate YAML file,
following the same pattern as `components.yaml` and `topology.yaml`.

A single benchmark execution runs one algorithm
(`defaults.algorithm`). Switching algorithms means editing
`defaults.algorithm` only.

```yaml
defaults:
  algorithm: ring_allreduce_tcm
  buffer_kind: tcm                # tcm | hbm | sram
  backpressure: sleep             # poll | sleep
  n_slots: 8
  slot_size: 4096
  vc_chunk_size: 256
  ipcq_credit_size_bytes: 16

algorithms:
  ring_allreduce_tcm:
    module: kernbench.ccl.algorithms.ring_allreduce
    topology: ring_1d             # builtin name or "custom"
    buffer_kind: tcm
    n_elem: 8                     # optional, per-algorithm tile width

  tree_allreduce_7:
    module: kernbench.ccl.algorithms.tree_allreduce
    topology: tree_binary
    buffer_kind: tcm
    world_size: 7                 # algorithm-level override
    n_elem: 16

  custom_mesh:
    module: kernbench.ccl.algorithms.custom_mesh
    topology: custom              # the module supplies its own neighbors()
```

`world_size` is **not set in `defaults`**. The backend resolves it via:
`algorithm-level override > defaults override > topology spec`. The
last fallback (`sips × cubes_per_sip × pes_per_cube`) mirrors real DDP
where `WORLD_SIZE` comes from env vars rather than config files.

#### Algorithm module structure

Each algorithm module exports two hooks — `kernel` (required) and
`neighbors` (optional) — plus a `kernel_args` helper that the
backend uses to populate positional kernel arguments at `all_reduce`
time:

```python
# src/kernbench/ccl/algorithms/ring_allreduce.py

def kernel_args(world_size: int, n_elem: int) -> tuple:
    return (n_elem, world_size)


def kernel(t_ptr, n_elem, world_size, tl):
    """Required — the PE kernel.

    IPCQ is already installed by the backend before this is called.
    The kernel only uses the four-direction send / recv API.
    """
    ...


def neighbors(rank, world_size, neighbor_map):
    """Optional — override the builtin topology's neighbor map.

    Returns a new dict, the modified-in-place dict, or None to keep the
    builtin map.
    """
    return None
```

#### `neighbors` override patterns

- **Pattern A — tweak a builtin**: drop a direction for some ranks, etc.
- **Pattern B — replace entirely**: ignore `neighbor_map` and return a
  brand-new dict.
- **Pattern C — keep builtin**: omit `neighbors` or return None.

#### Builtin topologies

| topology | direction set |
|----------|---------------|
| `ring_1d` | E, W |
| `ring_1d_unidir` | E only |
| `mesh_2d` | N, S, E, W |
| `tree_binary` | parent, child_left, child_right |
| `none` | (empty) — algorithm must supply `neighbors()` |

#### Adding a new algorithm

1. Write `kernel` and `kernel_args` in
   `src/kernbench/ccl/algorithms/<algo>.py`.
2. Add an entry in `ccl.yaml`'s `algorithms` section.
3. (Optional) provide `neighbors()` for custom topology.
4. Set `defaults.algorithm` to the new algorithm.

The host bench (`benches/ccl_allreduce.py`) does not change.

### D12. Message / token schema

The new message types added by this ADR. They live in
`src/kernbench/common/pe_commands.py` and
`src/kernbench/runtime_api/kernel.py`.

#### `IpcqInitMsg` (sideband, fan-out at init)

The backend pushes neighbor tables to every PE_IPCQ. Structure mirrors
`MmuMapMsg` (`target_sips`, `target_cubes`, `target_pe`, `entries`).
Each `IpcqInitEntry` has `direction`, `peer: IpcqEndpoint`,
`my_rx_base_pa/va`, `n_slots`, `slot_size`, plus a `peer_credit_store`
field — a `simpy.Store` instance pre-wired so the sender PE_IPCQ can
push `IpcqCreditMetadata` directly into the receiver's input queue.

#### `IpcqSendCmd` (PE_CPU → PE_IPCQ)

Carries `direction`, source addr/space, nbytes, shape, dtype, and a
handle id. `data_op=True` so it lands in the op_log.

#### `IpcqRecvCmd` (PE_CPU → PE_IPCQ)

Carries `direction` (or None for round-robin), `recv_mode`
(`return_slot` / `copy_to_dst`), optional `dst_addr/dst_space`, shape,
dtype, blocking flag.

#### `IpcqDmaToken` (PE_IPCQ → PE_DMA, vc_comm channel)

Per D9 piggyback: the token carries the data (`src/dst/space/nbytes`)
plus the head metadata (`sender_seq`, `src_sip/cube/pe`,
`src_direction`). PE_DMA picks the channel by token type
(`IpcqDmaToken → vc_comm`, `TileToken → vc_compute`).

The receiver's PE_DMA, on token arrival, performs the I6 atomic
sequence: write data into MemoryStore, then forward `IpcqMetaArrival`
to the local PE_IPCQ.

#### `IpcqCreditMetadata` (PE_IPCQ → peer PE_IPCQ, fast path)

Carries `consumer_seq` (= my_tail), source PE coords, and source
direction. Travels through the dedicated SimPy Store channel rather
than `vc_comm`. Latency = `credit_size_bytes / bottleneck_bw_on_path`.

There is **no `IpcqPtrUpdate` event** — head updates flow via D9
piggyback, tail updates via the D9 fast-path channel.

### D13. Test strategy

Test plan:

#### T1. Unit tests (component-level)

- **PE_IPCQ** (`tests/test_pe_ipcq.py`): send without backpressure
  immediately forwards a token; full peer slot triggers backpressure
  (poll / sleep modes); recv waits, wakes on `IpcqMetaArrival`;
  round-robin recv weak fairness; bad direction → `IpcqInvalidDirection`.
- **PE_DMA virtual channels** (`tests/test_pe_dma_vc.py`): `vc_compute`
  / `vc_comm` independent progress, chunk interleave, BW split.
- **Builtin topology** (`tests/test_ccl_topologies.py`): ring_1d /
  mesh_2d / tree_binary correctness, mesh_2d non-square →
  `ValueError`, custom resolver returns the module's `neighbors`.

#### T2. Integration tests (E2E send/recv)

- **`tests/test_ipcq_e2e.py`**: 2-rank ring, 4-rank ring (bidirectional
  no-deadlock), 4×4 mesh.
- **CCL kernel + 2-pass** (`tests/test_ipcq_2pass.py`): greenlet mode
  records `ipcq` ops in op_log; DataExecutor produces correct
  `out.data`.

#### T3. Backend init (`tests/test_ccl_backend_ipcq.py`)

`ccl.yaml` load, builtin topology → `IpcqInitMsg` fan-out, endpoint PA
consistency, per-`buffer_kind` allocation.

#### T4. Regression

All existing tests pass; ADR-0020 op_log / DataExecutor unaffected for
non-CCL benches.

#### T5. Performance / overhead

Single send/recv pair latency = (DMA latency) + (IPCQ overhead).
Should be close to a regular PE_DMA write of the same nbytes (IPCQ
overhead < 100 ns).

### D14. Invariants and failure modes

#### Invariants

I1. **Slot lifecycle exactly-once**: one send → exactly one recv.
I2. **Pointer monotonicity**: `my_head` / `my_tail` strictly
   non-decreasing; `sender_seq` strictly increasing.
I3. **Endpoint consistency**: if rank A's `direction=E` peer is rank
   B, then rank B's reverse-direction peer must be rank A. Verified at
   init.
I4. **`buffer_kind` consistency**: all PEs in a process group share
   the same `buffer_kind` (no mixed mode in the first cut).
I5. **op_log ordering**: send → DMA complete → recv possible. The
   t_start order in op_log respects this causality.
I6. **Atomic data + metadata visibility (MUST)**: at the receiver
   side, data write (`MemoryStore.write`) and metadata forward
   (`peer_head_cache` update) **must execute in the same SimPy step**.
   No yield is allowed between the two operations in PE_DMA's vc_comm
   handler. Code review must reject any inserted `yield` (or `yield
   from`) — it would create a race where head_cache becomes visible
   before or after the data.
I7. **MemoryStore slot existence ↔ pointer**: as a consequence of I6,
   the step in which `peer_head_cache > my_tail` becomes truthy is the
   same step in which the slot data is observable.

#### Failure modes (runtime errors)

F1. **Bad direction**: `tl.send(dir="X")` for an uninstalled direction
   → `IpcqInvalidDirection`, simulation aborts.
F2. **Type mismatch**: dtype/shape/nbytes disagreement between matched
   send and recv. Not validated by default; opt-in strict mode catches
   it (`strict_validation: true` on a PE_IPCQ node attrs).
F3. **Deadlock detection (timeout-based)**: the simulator empties its
   schedule while a send/recv is still pending → engine raises
   `IpcqDeadlock` and embeds a pointer dump.
F4. **Backend init failure**: missing `defaults.algorithm`, missing
   `algorithms[name]`, module import failure, topology validation
   failure (I3, I4) — all raised at `init_process_group` time.
F5. **Slot full + infinite backpressure**: the peer never recvs.
   Surfaces as F3 timeout.

#### Diagnostics

- **CCL trace**: `KERNBENCH_CCL_TRACE=1` logs each send/recv as
  `(rank, t, dir, nbytes)`.
- **Pointer dump**: `kernbench.ccl.diagnostics.pointer_dump(engine)`
  prints every PE_IPCQ ring buffer's `my_head`, `my_tail`,
  `peer_head_cache`, `peer_tail_cache`.
- **Deadlock dump**: on hang the engine includes the pointer dump in
  the `IpcqDeadlock` exception message.

### D15. Algorithm-author cheat sheet

Full step-by-step lives in
[`docs/onboarding/ccl-author-guide.en.md`](../onboarding/ccl-author-guide.en.md). The
shortest version:

| Things you touch | Things you don't |
|------------------|-------------------|
| `src/kernbench/ccl/algorithms/<your_algo>.py` (`kernel`, `kernel_args`, optional `neighbors`) | `benches/ccl_allreduce.py` host code |
| One entry in `ccl.yaml` + optionally `defaults.algorithm` | `src/kernbench/ccl/` framework |
| (Optional) `tests/test_<your_algo>.py` mock test | PE_IPCQ component, AhbmCCLBackend |

5-step flow: write the kernel → register in `ccl.yaml` → optional
`neighbors` override → optional mock unit test → SimPy validation via
`kernbench run --bench ccl_allreduce --verify-data`.

Common mistakes: using a direction that wasn't installed, sends
without matching recvs (deadlock), dtype/shape disagreement, assuming
fairness from `tl.recv()` round-robin, confusing
`tl.num_programs(axis)` with the CCL group size.

---

## HW Realization Notes (Informative)

**Status of this section**: Forward-looking. Describes how the simulator
contract (D1–D15) would map to silicon. Not currently implemented;
subject to revision before tapeout. The simulator implements the
contract via Python/SimPy equivalents in
[pe_ipcq.py](../../src/kernbench/components/builtin/pe_ipcq.py) and
[pe_dma.py](../../src/kernbench/components/builtin/pe_dma.py).

### D16. Proposed HW block diagram and end-to-end dataflow

![PE Baseline Architecture](../diagrams/pe_baseline.png)

> Source: [`../diagrams/pe_baseline.d2`](../diagrams/pe_baseline.d2) — `d2 --layout=elk --scale 1.5`.

![PE Proposed Architecture](../diagrams/pe_proposed.png)

> Source: [`../diagrams/pe_proposed.d2`](../diagrams/pe_proposed.d2) — `d2 --layout=elk`.

**Baseline → Proposed key changes**:

- Single FIFO inbox → **separate compute port / IPCQ port + WRR Arbiter** (NEW)
- PE_IPCQ (SimPy component) → **IPCQ Controller** (HW register + combinational logic)
- **IPCQ Slot Region reserved area** within TCM
- Credit Injector / Receiver connect directly to the NoC via the Fabric Port

#### End-to-end sequence (HW view)

```mermaid
sequenceDiagram
    participant CPU_A as PE_A: PE_CPU
    participant IPCQ_A as PE_A: IPCQ Ctrl
    participant DMA_A as PE_A: DMA
    participant NOC as NoC Fabric
    participant DMA_B as PE_B: DMA
    participant IPCQ_B as PE_B: IPCQ Ctrl
    participant TCM_B as PE_B: TCM
    participant CPU_B as PE_B: PE_CPU

    Note over CPU_A: tl.send(dir="E", src=0x1000)

    CPU_A->>IPCQ_A: MMIO: send request
    Note over IPCQ_A: Backpressure check:<br/>(head - peer_tail_cache) < n_slots → PASS<br/>Slot addr gen:<br/>dst = peer_rx_base + (head%n) × slot_size
    IPCQ_A->>DMA_A: IpcqDmaToken {src, dst, sender_seq=head}
    Note over IPCQ_A: my_head++
    IPCQ_A-->>CPU_A: send returns (fire-and-forget)

    Note over DMA_A: TCM read → snapshot in read buffer<br/>Flit pack: data + {sender_seq, dst_addr}
    DMA_A->>NOC: IPCQ data flit(s)

    Note over NOC: hop latency + BW drain

    NOC->>DMA_B: IPCQ data flit(s)
    Note over DMA_B: Terminal BW drain<br/>Slot write latency

    rect rgb(255, 240, 220)
        Note over DMA_B,IPCQ_B: ATOMIC (I6): same cycle, no stall
        DMA_B->>TCM_B: write data → slot address
        DMA_B->>IPCQ_B: Meta Extractor: {sender_seq, dst_addr}
    end

    Note over IPCQ_B: Range match dst_addr → direction "W"<br/>peer_head_cache["W"] = sender_seq + 1
    IPCQ_B-->>CPU_B: recv_wake signal

    Note over CPU_B: tl.recv(dir="W") wakes up
    CPU_B->>IPCQ_B: recv request
    Note over IPCQ_B: peer_head_cache > my_tail → YES<br/>slot_addr = rx_base + (tail%n) × slot_size
    IPCQ_B-->>CPU_B: return slot_addr
    CPU_B->>TCM_B: read data from slot
    Note over IPCQ_B: my_tail++

    IPCQ_B->>NOC: Credit (16B): {consumer_seq, dst_rx_base_pa}
    Note over NOC: credit traversal (NoC latency)
    NOC->>IPCQ_A: Credit arrival

    Note over IPCQ_A: Match dst_rx_base_pa → direction "E"<br/>peer_tail_cache["E"] = consumer_seq<br/>Backpressure deassert (if stalled)
```

### D17. IPCQ Controller HW Module (NEW)

The hardware control block sitting between PE_CPU and the DMA Engine.
Corresponds to the simulator's `PeIpcqComponent`.

#### QPair Register File

Per-direction queue-pair state held in flip-flops. The PE_CPU reads /
writes them via MMIO (CSRs); software populates them at init time.

```
Per-direction registers (each 64-bit):
  my_head          — sender write position (monotonic)
  my_tail          — receiver read position (monotonic)
  peer_head_cache  — last known peer head (updated by Meta Extractor)
  peer_tail_cache  — last known peer tail (updated by Credit Receiver)
  rx_base_pa       — this PE's rx buffer base physical address
  peer_rx_base_pa  — peer's rx buffer base physical address
  n_slots          — ring depth (power-of-2 constraint, see D21)
  slot_size        — bytes per slot
  peer_credit_tgt  — peer PE's credit-receive address

Directions: up to 8 (N/S/E/W/parent/child_left/child_right + spare)
Total: 8 dirs × 9 regs × 8 B = 576 B of flip-flops
```

#### Slot Address Generator (combinational)

```
Input:  pointer (my_head or my_tail), n_slots, slot_size, base_pa
Output: slot_addr = base_pa + (pointer % n_slots) * slot_size

Implementation:
  n_slots power-of-2 → pointer & (n_slots - 1)   (AND mask, 1 gate)
  slot_size power-of-2 → barrel shift             (1 cycle)
  64-bit add → ripple / Kogge-Stone adder         (1 cycle)

Latency: 1–2 combinational cycles
```

#### Backpressure Comparator (combinational)

```
full = (my_head - peer_tail_cache) >= n_slots

Implementation: 64-bit subtract + unsigned compare
Output: stall signal → PE_CPU (IPCQ send blocked) or DMA issue hold
Latency: 1 cycle
```

#### Meta Extractor (inbound datapath sideband)

Wired into the DMA Engine's inbound vc_comm path. Extracts metadata
from arriving IPCQ flit headers and updates queue-pair state.

```
Trigger: DMA inbound write completion (same cycle)
Extract: {sender_seq, dst_addr} from flit header

Direction matching (ADR-0025 D2):
  for each dir:
    match = (base_pa[dir] <= dst_addr) && (dst_addr < base_pa[dir] + n_slots[dir] * slot_size[dir])
  8× parallel range comparators + priority encoder

Update: peer_head_cache[matched_dir] = max(peer_head_cache, sender_seq + 1)
Output: recv_wake signal → PE_CPU interrupt / flag
Latency: 1 cycle (pipelined with the DMA write — I6 atomicity is intrinsic)
```

#### Credit Injector (outbound)

```
Trigger: recv completion (after my_tail increments)
Action:  pack a 16 B credit packet → DMA vc_comm (or a dedicated credit VC)

Packet: {consumer_seq = my_tail, dst_rx_base_pa = my_rx_base_pa}
Latency: 1 cycle to generate; then NoC traversal
```

#### Credit Receiver (inbound sideband)

```
Trigger: 16 B credit packet arrival (from NoC)
Extract: {consumer_seq, dst_rx_base_pa}

Direction matching (ADR-0025 D3):
  for each dir:
    match = (peer_rx_base_pa[dir] == credit.dst_rx_base_pa)

Update: peer_tail_cache[matched_dir] = max(peer_tail_cache, consumer_seq)
Output: send_wake signal → deassert backpressure stall
Latency: 1 cycle
```

### D18. DMA Engine vc_comm IPCQ-aware mode

Add IPCQ-flit handling to the existing vc_comm channel (D8).

**Outbound**:

1. Receive a command from the IPCQ Controller: `{src_addr, dst_addr, nbytes, sender_seq}`.
2. Read `src_addr` from TCM → snapshot into the DMA read buffer (standard DMA behavior).
3. Pack flit: data + piggyback metadata (`sender_seq`, `dst_addr`).
4. Inject into the NoC fabric port.
5. Fire-and-forget (no completion wait).

**Inbound**:

1. Receive an IPCQ flit from the NoC.
2. Charge terminal BW drain (`drain_ns = nbytes / bottleneck_bw`).
3. Charge slot write latency (per backing memory tier).
4. **ATOMIC** (same pipeline stage, no stall insertion):
   - TCM write: data → slot address.
   - Meta Extractor trigger: `sender_seq` + `dst_addr` → IPCQ Controller.
5. Done.

**I6 atomicity guaranteed in hardware**: TCM write completion and Meta
Extractor trigger occur in the same pipeline stage, so no separate
synchronization is needed. The simulator's "no SimPy yield between
`MemoryStore.write` and `IpcqMetaArrival` put" (D9, I6) is preserved
naturally.

#### Data snapshot semantics

Data latched into the DMA read buffer is unaffected by subsequent
writes to `src` memory. This is standard DMA read-then-write
behavior; no extra HW is required.

#### Credit virtual channel (optional)

- **Option A**: multiplex credits onto vc_comm (distinguish via 16 B
  header-only flits).
- **Option B**: add a third dedicated credit VC (strict priority > data).

Option B is friendlier to deadlock prevention, but a 16 B credit's BW
impact is negligible, so Option A suffices.

### D19. Fabric flit format extension

```
Generic data flit (e.g. 512-bit):
┌──────────────────────────────────────────┐
│ [511:480] routing header (32b)           │
│ [479:0]   payload (480b = 60 B)          │
└──────────────────────────────────────────┘

IPCQ data flit (only the first flit carries metadata):
┌──────────────────────────────────────────┐
│ [511:480] routing header (32b)           │
│   [511]    ipcq_flag (1b)                │  ← IPCQ vs. normal DMA
│   [510:509] vc_id (2b)                   │
│   [508:480] route + hop count            │
│ [479:416] ipcq_metadata (64b)            │  ← piggyback
│   [479:448] sender_seq (32b)             │
│   [447:416] dst_addr[31:0] (32b)         │  ← used for direction match
│ [415:0]   payload (416b = 52 B)          │
└──────────────────────────────────────────┘
Subsequent flits: full 60 B payload (no metadata).

Credit-only flit (128-bit, header-only):
┌──────────────────────────────────────────┐
│ [127:96]  routing header (32b)           │
│   [127]   credit_flag (1b)               │
│ [95:64]   consumer_seq (32b)             │
│ [63:0]    dst_rx_base_pa (64b)           │
└──────────────────────────────────────────┘
```

First-flit payload shrinks from 60 B to 52 B (13 % overhead). For
multi-flit transfers the subsequent flits carry full payloads, so
overhead < 1 % on large transfers.

### D20. TCM IPCQ slot region layout

```
TCM Memory Map (16 MB):
┌─────────────────────────────┐ 0x000000
│  Kernel Working Memory      │
│  (compute tensors)          │
│  ~14 MB                     │
├─────────────────────────────┤ 0xE00000
│  IPCQ RX Buffers            │
│  Dir N: slots × slot_size   │
│  Dir S: slots × slot_size   │
│  Dir E: slots × slot_size   │
│  Dir W: slots × slot_size   │
│  ~1 MB                      │
├─────────────────────────────┤ 0xF00000
│  IPCQ Metadata / Scratch    │
│  ~1 MB                      │
└─────────────────────────────┘ 0xFFFFFF
```

Place the IPCQ region in the upper TCM bank to minimize bank conflict
with compute accesses (see Risk D22).

### D21. 2 nm implementation analysis

#### Area estimate

| Module | Gate count | Area (2 nm est.) | Notes |
|---|---|---|---|
| QPair Register File | ~4.6 K FF | 0.002 mm² | 576 B of flip-flops |
| Slot Addr Gen + Backpressure | ~5 K gates | 0.001 mm² | Combinational |
| Meta Extractor + Credit Logic | ~3 K gates | 0.001 mm² | 8× parallel comparators |
| **IPCQ Controller subtotal** | **~12.6 K** | **~0.004 mm²** | **< 0.1 % of the PE area** |
| DMA vc_comm extension | ~2 K gates | 0.002 mm² | Flit pack / unpack |
| **Total delta** | **~14.6 K** | **~0.006 mm²** | |

#### Timing

| Path | Delay (2 nm est.) | Target clock | Margin |
|---|---|---|---|
| Backpressure (sub + cmp) | ~0.3 ns | 1 GHz (1 ns) | 3× |
| Slot Addr Gen (mask + shift + add) | ~0.5 ns | 1 GHz | 2× |
| Meta Extractor (8× range match) | ~0.4 ns | 1 GHz | 2.5× |
| Credit Receiver (8× equality) | ~0.3 ns | 1 GHz | 3× |

All critical paths fit within one cycle. Timing closure is not a
concern.

#### Power

- Active: ~1 mW (register R/W + comparators while sending / receiving).
- Idle: leakage only.
- Negligible vs. total PE power.

#### Constraints

| Item | Constraint | Rationale |
|---|---|---|
| `n_slots` | **must be power-of-2** | mod → AND mask (1 gate). Arbitrary values need a divider (~10 cycles). |
| `slot_size` | **power-of-2 recommended** | mul → barrel shift. Arbitrary values need a multiplier. |
| TCM IPCQ region | **dedicated bank** | Prevents bank conflict with compute accesses. |

### D22. Risk assessment

#### TCM bank conflict

- **Risk**: IPCQ slot write and compute read both target the same TCM
  bank → stall.
- **Mitigation**: place the IPCQ region in a dedicated upper-address
  bank (D20).
- **Cost**: a small loss of TCM banking flexibility.
- **Severity**: Medium (performance), Low (no correctness issue).

#### Credit return latency under congestion

- **Risk**: NoC congestion → credit-return delay → sender backpressure
  stall.
- **Mitigation**:
  - Put credits on a separate VC with strict priority (16 B →
    negligible BW impact).
  - Or pick `n_slots` generously (8+) so credit delay is absorbed by
    buffer depth.
- **Severity**: Low (16 B credits contribute almost nothing to
  congestion).

#### Inter-direction ordering

- **Risk**: simultaneous sends from one PE on multiple directions.
- **Mitigation**: per-direction monotonic `sender_seq` suffices.
  Inter-direction ordering is the kernel's (software's)
  responsibility — same as the simulator model (D2 + D4).
- **Severity**: Low (resolved by design).

### D23. HW alternatives considered

#### Doorbell + polling (traditional)

```
Send: DMA write data → DMA write a doorbell register at the peer → peer polls doorbell
Recv: polling loop on the doorbell, or interrupt-driven
```

| Pros | Cons |
|---|---|
| Simple HW (no IPCQ controller) | Two DMA transactions (data + doorbell) |
| Reuses existing DMA | Needs explicit fence between data and doorbell |
| | Polling burns power; interrupt adds latency |

**Verdict**: 2–3× latency vs. piggyback. **Rejected.**

#### Hardware message queue (NVIDIA NVLink style)

```
Send: CPU → push a descriptor onto HMQ → HW relays it to the peer HMQ
Recv: pop a descriptor from HMQ → use the data pointer
```

| Pros | Cons |
|---|---|
| CPU only writes descriptors | Needs a separate HMQ engine (~0.05 mm²) |
| Descriptor / data separation is flexible | Separate datapath from DMA → area / power overlap |
| | Large tensors still need DMA |

**Verdict**: With CCL's large-tensor pattern, DMA is still required,
so HMQ + DMA is a duplicated datapath. **Rejected.**

#### RDMA-style completion queue (CQ)

```
Send: DMA write → CQE auto-posted at the peer
Recv: CQ poll / interrupt → read data location
```

| Pros | Cons |
|---|---|
| Mature InfiniBand / RoCE model | CQ management logic + CQE memory overhead |
| Good multi-tenant isolation | CQE / data ordering needs extra plumbing |
| | Over-engineered for PE-to-PE CCL |

**Verdict**: RDMA CQ is suited to host-facing NICs with multi-tenant
isolation. For single-owner PE-to-PE this is needless complexity.
**Rejected.**

#### Credit-in-data piggyback (v2 optimization candidate)

In the current design the credit return is a separate 16 B packet.
For bidirectional traffic patterns, **the credit can be folded into a
reverse-direction data flit**.

```
PE_A →E→ PE_B: data + sender_seq=3
PE_B →W→ PE_A: data + sender_seq=5 + credit_ack=4  ← credit folded into data
```

| Pros | Cons |
|---|---|
| Removes the dedicated credit packet → NoC BW savings | Needs fallback for unidirectional patterns |
| Bidirectional allreduce: credit latency → 0 | +8 B in the flit header (negligible) |
| | Slightly more logic complexity |

**Verdict**: A strong optimization. Eliminates the credit packet for
bidirectional allreduce; the standalone credit fallback is retained.
**Recommended for v2.**

### Open HW questions

- What fraction of TCM may the IPCQ slot region occupy? (Current
  assumption: ~1 MB / 16 MB = 6.25 %.)
- Dedicated credit VC vs. vc_comm multiplexing? (See D18.)
- Inter-SIP link flit-format compatibility verification.
- Maximum `n_slots`? (8 directions × 8 slots × 64 KB = 4 MB → 25 % of
  TCM.)

---

## Non-goals

- **Host collective**: a model where `dist.all_reduce` itself moves
  data on the host side is out of scope. This ADR only covers
  communication that happens inside the PE kernel.
- **All-reduce algorithms**: ring / tree / etc. live in algorithm
  modules and can be added without amending this ADR.
- **Reliability / error handling**: link faults, send/recv failure
  recovery, etc. are out of scope.
- **NoC arbiter precision**: dynamic VC contention is left for a future
  ADR (see D8).

---

## Open questions

- **VC arbitration accuracy** — the first cut uses deterministic
  chunk interleave + weighted round-robin; heavy contention may report
  optimistic latency. A NoC arbiter component can be added later.
- **Credit return BW model** — the fast path is currently outside the
  fabric BW contention model. Can be modeled as a separate link or
  switched to piggyback (`credit_return_mode: piggyback`).
- **Ring buffer slot allocation metadata** — whether the host pushes
  IPCQ buffer metadata via sideband or via a fabric message similar to
  `MmuMapMsg` is open.
- **VC BW split default** — 50/50 vs. weighted (e.g. 80/20). Exposed in
  `ccl.yaml`; default value TBD.
- **Direction count** — 4 (N/S/E/W) is fixed in the first cut; 6
  (with Up/Down for 3D) or N (variable) is future work.
- **Multi-tile aggregation primitives** — whether
  `tl.recv_all` or similar is needed for fan-in.
- **Round-robin recv fairness** — current weak fairness can starve;
  strict fairness counter is future work.
- **Deadlock detection precision** — currently timeout-based; a
  realtime wait-for graph would enable deterministic detection.

---

## Consequences

### Positive

- PE-to-PE direct communication enables CCL kernels to be written.
- Host stays minimal (just `launch`), synchronization happens inside
  the PE → strong compute / comm overlap.
- VCs eliminate HoL blocking → collective latency is not blocked by
  compute traffic.
- Buffer placement and backpressure mode are init-time parameters →
  easy to benchmark.
- Four-direction logical neighbors → host is free to map
  ring/mesh/tree algorithms.

### Negative

- One new component (PE_IPCQ) and a redesigned PE_DMA (VCs).
- IPCQ memory cost = 8 rings × `slot_size` × `n_slots` per PE.
- VC arbitration is a first-order approximation; heavy contention
  scenarios may report slightly optimistic latency vs real HW (D8).
- Chunk-level interleave makes PE_DMA implementation more complex.