Files
kernbench2/docs/adr/ADR-0023-dev-ipcq-pe-collective.md
ywkang a796c1d2f7 ADR: bilingual structure — EN canonical in adr/, KO mirror in adr-ko/
Establish English as the canonical ADR language with Korean translations
held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror).
Promotion from adr-proposed/ to adr/ now writes English to adr/ and the
Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md.

- Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English,
  2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix
  dropped). ADR-0023 EN regenerated against KO source which had newer
  HW Realization Notes (D16-D23) section.
- docs/adr-history/ left frozen by design (transitional state).
- CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark
  docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline
  section covering bidirectional sync, conflict resolution (EN wins),
  and proposed-language freedom.
- tools/verify_adr_lang_pairs.py: new verification tool checking pair
  completeness, filename mirroring, ADR-ID match, Status byte-equality.
  Pre-commit hook intentionally not added; run on demand or in CI.
- tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF
  normalization, em-dash title separator, underscore-slug edge case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:38:44 -07:00

1309 lines
48 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0023: PE-level IPCQ — Inter-PE Collective Communication
## Status
Accepted
## Context
### Goal
Add the infrastructure that lets CCL (Collective Communication Library)
kernels run **inside** a PE. The host just launches a kernel on each
SIP; the actual synchronization and data movement happen **inside the
PE kernel via an IPCQ (Inter-Process Communication Queue)**.
This mirrors how NCCL performs NVLink communication inside a GPU
kernel, or how Cerebras / Tenstorrent expose core-local communication
queues. Host-level collectives (`dist.all_reduce`) are deferred to
**future work**; this ADR focuses solely on the kernel-side collective
infrastructure.
### Problems to solve
1. PE-to-PE direct data movement (writing into a peer's memory).
2. Synchronization — the sender must check that the receiver has space
in its buffer (backpressure).
3. Resource contention between compute traffic and communication
traffic (Head-of-Line blocking).
4. The host must be able to construct logical neighbor topologies
(ring / mesh / tree) per algorithm.
---
## Decision
### D1. Add a new `PE_IPCQ` component
A new component `PE_IPCQ` is added inside each PE. It follows the same
pattern as PE_GEMM / PE_MATH — modeling a sub-block of the PE as a
distinct component.
```
PE
├── PE_CPU
├── PE_SCHEDULER
├── PE_DMA
├── PE_IPCQ ← new
├── PE_FETCH_STORE
├── PE_GEMM
├── PE_MATH
├── PE_TCM
├── PE_MMU
```
**Role separation** (control plane vs. data plane):
- **PE_IPCQ (control plane)**: ring-buffer address arithmetic, head /
tail pointer management, peer pointer caches, backpressure, 4-direction
neighbor mapping.
- **PE_DMA (data plane)**: actually moves data through cube_noc / UCIe
/ PCIE into the peer's memory.
PE_IPCQ does **not** move data itself — it delegates to PE_DMA.
### D2. Ring buffer model
Each PE owns 4 directions (N/S/E/W) × {tx, rx} = 8 ring buffers.
```python
@dataclass
class IpcqQueuePair:
direction: Direction # N/S/E/W
peer: IpcqEndpoint # set by host at init time (D2.5)
tx_buffer_base: int # outgoing data base addr (in our memory)
rx_buffer_base: int # incoming data base addr (in our memory)
slot_size: int # 1 tile per slot
n_slots: int # ring depth
my_head: int # next slot we will write/send into
my_tail: int # next slot we will read/recv from
peer_head_cache: int # peer's last-seen head (updated via D9 piggyback)
peer_tail_cache: int # peer's last-seen tail (updated via D9 fast-path credit)
```
**Canonical field names**: throughout this ADR the four names above
(`my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`) are used
consistently. Synonyms (`peer_head_local`, `peer_head`, `peer_tail`,
etc.) are not used.
| Field | Owner | Updated when |
|-------|-------|--------------|
| `my_head` | local PE_IPCQ | immediately after `tl.send` (send tracking) |
| `my_tail` | local PE_IPCQ | immediately after `tl.recv` (recv tracking) |
| `peer_head_cache` | local PE_IPCQ | on `IpcqMetaArrival` (D9 piggyback) |
| `peer_tail_cache` | local PE_IPCQ | on `IpcqCreditMetadata` (D9 fast path) |
**Slot unit**: fixed-size, one slot holds one full tile (no descriptor
indirection). Full data embedded in the slot. See D5.
### D2.5. `IpcqEndpoint` schema
`IpcqQueuePair.peer` carries everything the sender needs to compute the
peer's rx slot address:
```python
@dataclass(frozen=True)
class IpcqEndpoint:
sip: int
cube: int
pe: int
buffer_kind: str # "tcm" | "hbm" | "sram"
rx_base_pa: int # peer rx_buffer base PA (PhysAddr.encode())
rx_base_va: int # peer rx_buffer base VA (optional, MMU mode)
n_slots: int # peer ring depth (for wrap-around)
slot_size: int # peer slot size (for offset)
```
Address computation:
```python
slot_idx = self.my_head % peer.n_slots
dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size
```
PE_IPCQ passes `dst_pa` to PE_DMA inside an `IpcqDmaToken`. PE_DMA
(vc_comm) routes the data to `dst_pa` through the fabric.
**Endpoint construction order**: at backend init (D10), the IPCQ
buffers for **every PE** are allocated first (so each rank knows the
others' PA), then the per-rank neighbor tables are built and pushed to
PE_IPCQ via `IpcqInitMsg`.
### D3. Four-direction mapping ≡ logical ProcessGroup
The PE views four directions (N/S/E/W) as logical ports. Real peer
addresses are configured by the host CCL init, per the chosen
algorithm. The PE kernel never knows the topology, only directions.
```python
# 1D ring
for rank in range(world_size):
ipcq_set_neighbor(rank, "E", peer=ranks[(rank + 1) % world_size])
ipcq_set_neighbor(rank, "W", peer=ranks[(rank - 1) % world_size])
# 2D mesh
for r in range(R):
for c in range(C):
ipcq_set_neighbor((r, c), "N", peer=((r - 1) % R, c))
ipcq_set_neighbor((r, c), "S", peer=((r + 1) % R, c))
ipcq_set_neighbor((r, c), "E", peer=(r, (c + 1) % C))
ipcq_set_neighbor((r, c), "W", peer=(r, (c - 1) % C))
```
The PE code does not need to know where `tl.send(dir="E", ...)` actually
ends up.
### D4. PE kernel API
```python
# Send (blocking; may stall on backpressure)
tl.send(dir: str, src=TensorHandle)
tl.send(dir: str, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)
# Recv (blocking)
recv = tl.recv(dir: str, shape=..., dtype=...)
recv = tl.recv(shape=..., dtype=...) # round-robin across 4 directions
# Recv (non-blocking)
fut = tl.recv_async(dir: str, shape=..., dtype=...)
recv = tl.wait(fut)
```
`tl.recv()` (no direction) keeps a `last_polled_dir` cursor and on each
call rotates through directions, returning the first available slot.
Empty in all 4 directions → wait.
**Fairness is weak**: the rotating start mitigates simple bias, but if
one direction always wins the race the others can starve. Algorithms
that need strict fairness must call `tl.recv(dir=...)` explicitly.
### D5. Single-hop DMA write + full-data slot model
Data moves from sender memory into the receiver's ring slot in **one
DMA transfer**. Key properties:
- **Single-hop**: the sender already knows the peer rx slot address and
fires one fabric DMA into it.
- **No CPU memcpy**: the CPU never copies data.
- **No intermediate staging**: neither side keeps a separate staging
buffer (sender uses the source addr directly; receiver gets the data
in its ring slot directly).
(Strictly speaking the fabric DMA write does happen, so this is not
literally "no data movement" — it's the same property NCCL labels
"zero-copy", meaning no CPU memcpy and no staging copy.)
```
PE A: tl.send(E, src_addr, nbytes)
1. IPCQ computes the peer rx slot address:
dst_addr = peer.rx_base_pa + (my_head % peer.n_slots) * peer.slot_size
2. Backpressure: my_head - peer_tail_cache < peer.n_slots ?
(full → sleep / poll)
3. Submit DMA on PE_DMA(vc_comm): src_addr → peer dst_addr, nbytes
4. my_head += 1
PE B: data = tl.recv(W)
1. Look at rx_buffer[my_tail % n_slots]
2. Wait for the data to arrive (D7 backpressure mode)
3. Return the slot address to the kernel (or fetch into register file)
4. my_tail += 1
5. Issue a credit-return fast path (D9): after the bottleneck-BW
latency the peer A's peer_tail_cache is updated.
```
The slot holds the full tile. The receiver only reads its own
rx_buffer; it never reads back into A's memory. The sender knows the
peer rx slot address and DMAs directly into it (single-hop).
The PE's own PE_TCM read/write does not go through DMA (PE_TCM is local
to the PE).
### D6. Buffer placement — three-way benchmark
The host CCL init picks the IPCQ ring-buffer location:
```python
ipcq_init(
backend="ahbm",
buffer_kind="tcm" | "hbm" | "sram",
n_slots=8,
slot_size=4096,
)
```
| Location | Trait | Trade-off |
|----------|-------|-----------|
| **PE_TCM** | Attached to the PE; fast | Small; competes with PE-internal resources |
| **PE-local HBM** | Large; via DMA | Higher latency |
| **Cube SRAM** | Mid-size; cube-shared | Cube-internal contention |
All three locations run the same kernel code; only the init differs.
### D7. Backpressure — two-mode benchmark
How the sender or receiver waits when peer slots are full / data not
yet arrived:
| Mode | Behavior | Model |
|------|----------|-------|
| **poll** | Periodically re-check the cached peer pointer | Spin loop |
| **sleep** | Yield a SimPy event; wake on a peer-trigger | Interrupt-like |
```python
ipcq_init(backpressure="poll" | "sleep", ...)
```
Both modes are implemented so latency / throughput trade-offs can be
benchmarked.
### D8. PE_DMA virtual channels
Extend PE_DMA from a single queue into a **two-channel virtual-channel**
model.
```
PE_DMA
├── vc_compute: tile load / store / writeback for GEMM and Math
└── vc_comm: IPCQ send data
```
Each VC has an independent state machine:
- One channel stalling does not block the other.
- The same physical link (cube_noc, UCIe, …) is shared, but link BW is
split between channels.
**Chunk-level interleave**:
- Large GEMM tile DMAs do not lock the link end-to-end.
- Progress happens in chunks (e.g. 256 B); each chunk shares link BW
with the other VC's pending chunks.
- Chunk size is an init parameter (smaller = fairer, larger = more
efficient).
Net effect:
- HoL blocking is eliminated (an IPCQ send can interleave with a long
compute DMA).
- Compute / comm overlap is natural (NVIDIA copy-engine + compute-SM
pattern).
- Matches the NoC-virtual-channel pattern used in real HW.
**First-implementation accuracy limit (intentional)**: this ADR's
first cut uses **deterministic chunk-level interleave + weighted
round-robin arbitration** (default 50 / 50, exposed in `ccl.yaml`).
This is a first-order approximation and is simpler than real HW
dynamic-contention / credit-based arbiters. Functional correctness is
unaffected, but heavy-contention scenarios may report slightly
optimistic latency vs. real HW. A separate ADR can add a NoC arbiter
component later if more precision is needed.
#### Token routing
- Compute tokens (`TileToken`) — go through the existing
PE_FETCH_STORE → PE_DMA chain.
- Communication tokens (`IpcqDmaToken`, new) — PE_IPCQ → PE_DMA
self-routing.
- PE_DMA picks the channel by token type.
```python
class PeDmaComponent:
def _process(self, env, token):
if isinstance(token, IpcqDmaToken):
yield from self._vc_comm_process(env, token)
else:
yield from self._vc_compute_process(env, token)
```
### D9. Pointer synchronization — DMA payload piggyback
Real HW (NVLink, UCIe, etc.) piggybacks metadata onto DMA payloads so
pointers update along with the data. This simulation adopts the same
model: **no separate control channel** — metadata travels with the
data.
The big benefits:
- **Automatic ordering**: data and metadata move on the same token, so
data is visible **before** the head_cache update. No race.
- **HW fidelity**: matches NVLink / UCIe piggybacked headers.
- **Component simplification**: no separate `IpcqPtrUpdate` event type.
#### Send flow (head update via piggyback)
```
PE A: tl.send(E, src_addr, nbytes)
1. PE_IPCQ checks backpressure (using peer_tail_cache)
2. PE_IPCQ creates an IpcqDmaToken:
- data body (src_addr → peer dst_addr)
- piggyback metadata: (sender_seq, src_sip/cube/pe, src_direction)
3. Hand the token to PE_DMA(vc_comm)
4. PE A increments my_head (send tracking)
[fabric DMA: latency elapses]
PE B's PE_DMA receives the token
5. Writes data into dst_addr (B's rx slot) via MemoryStore.write
6. Forwards token metadata to PE B's PE_IPCQ (PE-internal wire, ~1 cycle)
PE B's PE_IPCQ receives the metadata
7. Updates peer_head_cache (= A's head)
8. Wakes any pending recv on that direction
```
**Steps 5 and 6 must execute in the same SimPy step** — DMA completion
makes data and metadata atomically visible.
#### Recv flow (credit return — fast path with bottleneck-BW latency)
When the receiver frees a slot, the sender must learn about it
(backpressure release). Unlike data, the credit return does **not**
travel through general vc_comm fabric — it uses a **separate fast
path**, an abstraction of the NVLink / UCIe credit-return wire.
**Latency** is computed from the **full path latency** (per-node
overhead + edge propagation + drain), not a magic constant:
```
credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes)
path = router.find_path(self_pe, peer_pe.pe_dma)
latency = compute_path_latency_ns(path, credit_size_bytes)
= sum(edge.distance_mm * ns_per_mm)
+ sum(node_overhead_ns[n] for n in path)
+ credit_size_bytes / bottleneck_bw_on_path
```
The router auto-appends `.pe_dma` to the source only, so the
destination MUST be spelled with the explicit `.pe_dma` suffix or
`find_path` raises and the credit silently teleports at zero cost
(latent bug fixed alongside this update).
`tl.recv` blocks on the credit-emit completion (recv yields-from
`_delayed_credit_send` rather than spawning it as a fork). This puts
the credit-return cost on the receiver's `pe_exec_ns`, modeling the
IPCQ control-plane completing the consume-acknowledgement before
recv returns to the kernel — the protocol equivalent of a non-posted
`tl.store` waiting for an HBM ack on the raw DMA path.
That gives us:
- **Topology-proportional approximation**: an in-cube credit return is
automatically faster than a cross-SIP credit return.
- **No magic constants**: every nanosecond comes from
`compute_path_latency_ns` on the same edge_map and `node_overhead_ns`
as data traffic.
- **No deadlock risk**: unlike piggyback, B can issue credit even when
it has no data to send back. `peer_credit_store.put` is unbounded.
- **`IPCQ ≥ raw DMA`** for matched physical moves — the credit-emit
cost on recv balances the HBM ack-trip cost RAW pays on the sender.
#### Component coupling — SimPy Store channel
PE B's PE_IPCQ does not call PE A's PE_IPCQ directly. Instead, at init
time, **a SimPy Store is wired between the two** (a per-direction
fast-path channel) and credit metadata is `put` into that store.
```python
class PeIpcqComponent:
def _delayed_credit_send(self, env, peer_credit_store, my_tail, latency_ns):
yield env.timeout(latency_ns)
yield peer_credit_store.put(IpcqCreditMetadata(seq=my_tail, ...))
```
Backend init wires both directions of the fast-path channel as part of
fan-out (see `IpcqInitMsg` in D12).
#### Credit-return fast path limitations
- `credit_size_bytes` is an estimate (typically 1664 bytes).
- The fast path is **excluded from vc_comm BW contention** (separate
wire). Real HW credit-return wires are very lightweight, so this is a
reasonable first approximation.
- A follow-up ADR can: model the credit fast path as a separate link
(BW limit + contention), or switch to piggyback (`credit_return_mode:
piggyback`).
#### PE_DMA's added responsibility
When `vc_comm` receives a token, PE_DMA processes it as the following
sequence: pay the Transaction's terminal BW drain, then atomically
write data and forward metadata. **No SimPy yield is allowed between
the data write and the metadata forward** (invariant I6). The drain
yield must sit before the atomic block, not inside it:
```python
def _on_vc_comm_recv(self, env, txn):
# Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the
# sender PE_DMA). MUST happen before the atomic block so recv only
# wakes after the bytes have "landed".
drain = getattr(txn, "drain_ns", 0.0)
if drain > 0:
yield env.timeout(drain)
token = txn.request
# ── ATOMIC: no yield between these two operations ──
data = self._memory_store.read(token.src_space, token.src_addr,
shape=..., dtype=...)
self._memory_store.write(token.dst_endpoint.buffer_kind,
token.dst_addr, data)
# 2. Forward metadata to the local PE_IPCQ
yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token))
# ───────────────────────────────────────────────────
```
The final `put` is yieldable but uses an unbounded internal store, so
it completes in a single step. That `put` is the closing call of the
atomic block; nothing may be inserted before it.
#### Drain-at-inbound semantics (D9 timing model)
The Transaction carries `drain_ns = nbytes / bottleneck_bw_on_path`
stamped at send-side PE_DMA. In this simulator per-hop `overhead_ns`
is paid at each forwarding component via `run()`, and the remaining
BW drain is paid once at the Transaction's terminal. Every non-IPCQ
Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via
`ComponentBase._forward_txn` at the terminal node. For IPCQ the
destination PE_DMA intercepts the Transaction with `_handle_ipcq_inbound`
(so IPCQ-specific data write + metadata forward can happen), so **the
drain MUST be paid explicitly at the top of that handler** to keep
IPCQ's timing model on par with every other fabric Transaction.
Side-effects of paying drain here:
- **SRC `tl.send`** is unchanged — fire-and-forget semantics are
preserved because the sender PE_DMA does not `yield sub_done`. The
`sub_done.succeed()` call (made after metadata forward below) is an
event with no listener on the sender side.
- **DST `tl.recv`** unblocks `drain_ns` later. Since recv wakes only
when `IpcqMetaArrival` reaches its local PE_IPCQ, and the metadata
forward now happens after the drain, recv observes the full fabric
transfer time including bandwidth cost.
Matches the physical picture: send dispatches and leaves; recv waits
until the bytes have actually been drained into its inbox.
### D9.5. ADR-0020 (2-pass) integration
`tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase
1 simulates timing **and** moves data via MemoryStore; Phase 2 enables
op-log-based correctness verification.
#### Phase 1 (timing + data)
D9 models head and tail updates with two different mechanisms:
- **Send-side (head update)** — DMA payload piggyback. Data write and
metadata forward happen in the same SimPy step → automatic atomic
visibility.
- **Recv-side (tail credit return)** — fast-path SimPy Store channel
with bottleneck-BW latency, then `peer_tail_cache` update.
Together they preserve ring-buffer pointer consistency.
The op-log records `op_kind="ipcq"` entries for sends (with
`src/dst/space/addr/nbytes/dir/dtype/shape/sender_seq`) and recvs (with
`recv_mode/src/dst/space/addr/nbytes/dir/dtype/shape/consumer_seq`).
Two recv modes:
- **`return_slot`** (default): the slot address is returned to the
kernel. Zero-copy.
- **`copy_to_dst`**: when the kernel passes `dst_addr` + `dst_space`,
PE_IPCQ copies the slot data into the user dst.
#### Phase 2 (op_log replay)
When `DataExecutor` encounters an `op_kind="ipcq"` record:
- **send**: idempotent `src → dst` ndarray write.
- **recv (`return_slot`)**: no-op (the slot already holds the data).
- **recv (`copy_to_dst`)**: idempotent `slot → dst_addr` copy.
IPCQ ops are pure data movement — Phase 2 has nothing extra to compute.
The downstream GEMM / Math ops in `DataExecutor` will consume the data
and naturally validate correctness.
### D10. Host CCL init keeps the PyTorch shape
The host code looks just like real PyTorch DDP. `init_process_group`
creates the backend object; it does **not** receive IPCQ knobs
(neighbor topology, buffer_kind, backpressure …).
```python
# benches/ccl_allreduce.py — same shape as real PyTorch
def worker(rank, world_size, torch):
dist = torch.distributed
dist.init_process_group(backend="ahbm") # reads ccl.yaml + topology
tensor = torch.zeros((1, world_size * N_ELEM), dtype="f16", dp=...)
tensor.copy_(torch.from_numpy(init))
dist.all_reduce(tensor, op="sum")
```
The IPCQ configuration is decided by the backend at
`init_process_group` time: it loads `ccl.yaml`, picks the algorithm,
and pushes IPCQ neighbor tables to every participating PE_IPCQ. The
host code never has to know about IPCQ.
A bench runs one algorithm, chosen via `ccl.yaml`'s `defaults.algorithm`.
Switching algorithms is purely a `ccl.yaml` change — no host edits
required.
#### Init flow (eager)
1. `init_process_group(backend="ahbm")` is called.
2. Backend loads `ccl.yaml` → resolves `defaults.algorithm`.
3. Pulls topology + buffer_kind + backpressure + slot config from
`algorithms[<algo>]`.
4. **Immediately** installs neighbor tables on every PE_IPCQ
(sideband or fabric `IpcqInitMsg`).
5. Subsequent `torch.launch(kernel_name, ...)` calls behave normally —
PE_IPCQ is already prepared whether the kernel is a CCL kernel or
not.
### D11. CCL config file (`ccl.yaml`)
IPCQ config and algorithm metadata live in a separate YAML file,
following the same pattern as `components.yaml` and `topology.yaml`.
A single benchmark execution runs one algorithm
(`defaults.algorithm`). Switching algorithms means editing
`defaults.algorithm` only.
```yaml
defaults:
algorithm: ring_allreduce_tcm
buffer_kind: tcm # tcm | hbm | sram
backpressure: sleep # poll | sleep
n_slots: 8
slot_size: 4096
vc_chunk_size: 256
ipcq_credit_size_bytes: 16
algorithms:
ring_allreduce_tcm:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d # builtin name or "custom"
buffer_kind: tcm
n_elem: 8 # optional, per-algorithm tile width
tree_allreduce_7:
module: kernbench.ccl.algorithms.tree_allreduce
topology: tree_binary
buffer_kind: tcm
world_size: 7 # algorithm-level override
n_elem: 16
custom_mesh:
module: kernbench.ccl.algorithms.custom_mesh
topology: custom # the module supplies its own neighbors()
```
`world_size` is **not set in `defaults`**. The backend resolves it via:
`algorithm-level override > defaults override > topology spec`. The
last fallback (`sips × cubes_per_sip × pes_per_cube`) mirrors real DDP
where `WORLD_SIZE` comes from env vars rather than config files.
#### Algorithm module structure
Each algorithm module exports two hooks — `kernel` (required) and
`neighbors` (optional) — plus a `kernel_args` helper that the
backend uses to populate positional kernel arguments at `all_reduce`
time:
```python
# src/kernbench/ccl/algorithms/ring_allreduce.py
def kernel_args(world_size: int, n_elem: int) -> tuple:
return (n_elem, world_size)
def kernel(t_ptr, n_elem, world_size, tl):
"""Required — the PE kernel.
IPCQ is already installed by the backend before this is called.
The kernel only uses the four-direction send / recv API.
"""
...
def neighbors(rank, world_size, neighbor_map):
"""Optional — override the builtin topology's neighbor map.
Returns a new dict, the modified-in-place dict, or None to keep the
builtin map.
"""
return None
```
#### `neighbors` override patterns
- **Pattern A — tweak a builtin**: drop a direction for some ranks, etc.
- **Pattern B — replace entirely**: ignore `neighbor_map` and return a
brand-new dict.
- **Pattern C — keep builtin**: omit `neighbors` or return None.
#### Builtin topologies
| topology | direction set |
|----------|---------------|
| `ring_1d` | E, W |
| `ring_1d_unidir` | E only |
| `mesh_2d` | N, S, E, W |
| `tree_binary` | parent, child_left, child_right |
| `none` | (empty) — algorithm must supply `neighbors()` |
#### Adding a new algorithm
1. Write `kernel` and `kernel_args` in
`src/kernbench/ccl/algorithms/<algo>.py`.
2. Add an entry in `ccl.yaml`'s `algorithms` section.
3. (Optional) provide `neighbors()` for custom topology.
4. Set `defaults.algorithm` to the new algorithm.
The host bench (`benches/ccl_allreduce.py`) does not change.
### D12. Message / token schema
The new message types added by this ADR. They live in
`src/kernbench/common/pe_commands.py` and
`src/kernbench/runtime_api/kernel.py`.
#### `IpcqInitMsg` (sideband, fan-out at init)
The backend pushes neighbor tables to every PE_IPCQ. Structure mirrors
`MmuMapMsg` (`target_sips`, `target_cubes`, `target_pe`, `entries`).
Each `IpcqInitEntry` has `direction`, `peer: IpcqEndpoint`,
`my_rx_base_pa/va`, `n_slots`, `slot_size`, plus a `peer_credit_store`
field — a `simpy.Store` instance pre-wired so the sender PE_IPCQ can
push `IpcqCreditMetadata` directly into the receiver's input queue.
#### `IpcqSendCmd` (PE_CPU → PE_IPCQ)
Carries `direction`, source addr/space, nbytes, shape, dtype, and a
handle id. `data_op=True` so it lands in the op_log.
#### `IpcqRecvCmd` (PE_CPU → PE_IPCQ)
Carries `direction` (or None for round-robin), `recv_mode`
(`return_slot` / `copy_to_dst`), optional `dst_addr/dst_space`, shape,
dtype, blocking flag.
#### `IpcqDmaToken` (PE_IPCQ → PE_DMA, vc_comm channel)
Per D9 piggyback: the token carries the data (`src/dst/space/nbytes`)
plus the head metadata (`sender_seq`, `src_sip/cube/pe`,
`src_direction`). PE_DMA picks the channel by token type
(`IpcqDmaToken → vc_comm`, `TileToken → vc_compute`).
The receiver's PE_DMA, on token arrival, performs the I6 atomic
sequence: write data into MemoryStore, then forward `IpcqMetaArrival`
to the local PE_IPCQ.
#### `IpcqCreditMetadata` (PE_IPCQ → peer PE_IPCQ, fast path)
Carries `consumer_seq` (= my_tail), source PE coords, and source
direction. Travels through the dedicated SimPy Store channel rather
than `vc_comm`. Latency = `credit_size_bytes / bottleneck_bw_on_path`.
There is **no `IpcqPtrUpdate` event** — head updates flow via D9
piggyback, tail updates via the D9 fast-path channel.
### D13. Test strategy
Test plan:
#### T1. Unit tests (component-level)
- **PE_IPCQ** (`tests/test_pe_ipcq.py`): send without backpressure
immediately forwards a token; full peer slot triggers backpressure
(poll / sleep modes); recv waits, wakes on `IpcqMetaArrival`;
round-robin recv weak fairness; bad direction → `IpcqInvalidDirection`.
- **PE_DMA virtual channels** (`tests/test_pe_dma_vc.py`): `vc_compute`
/ `vc_comm` independent progress, chunk interleave, BW split.
- **Builtin topology** (`tests/test_ccl_topologies.py`): ring_1d /
mesh_2d / tree_binary correctness, mesh_2d non-square →
`ValueError`, custom resolver returns the module's `neighbors`.
#### T2. Integration tests (E2E send/recv)
- **`tests/test_ipcq_e2e.py`**: 2-rank ring, 4-rank ring (bidirectional
no-deadlock), 4×4 mesh.
- **CCL kernel + 2-pass** (`tests/test_ipcq_2pass.py`): greenlet mode
records `ipcq` ops in op_log; DataExecutor produces correct
`out.data`.
#### T3. Backend init (`tests/test_ccl_backend_ipcq.py`)
`ccl.yaml` load, builtin topology → `IpcqInitMsg` fan-out, endpoint PA
consistency, per-`buffer_kind` allocation.
#### T4. Regression
All existing tests pass; ADR-0020 op_log / DataExecutor unaffected for
non-CCL benches.
#### T5. Performance / overhead
Single send/recv pair latency = (DMA latency) + (IPCQ overhead).
Should be close to a regular PE_DMA write of the same nbytes (IPCQ
overhead < 100 ns).
### D14. Invariants and failure modes
#### Invariants
I1. **Slot lifecycle exactly-once**: one send → exactly one recv.
I2. **Pointer monotonicity**: `my_head` / `my_tail` strictly
non-decreasing; `sender_seq` strictly increasing.
I3. **Endpoint consistency**: if rank A's `direction=E` peer is rank
B, then rank B's reverse-direction peer must be rank A. Verified at
init.
I4. **`buffer_kind` consistency**: all PEs in a process group share
the same `buffer_kind` (no mixed mode in the first cut).
I5. **op_log ordering**: send → DMA complete → recv possible. The
t_start order in op_log respects this causality.
I6. **Atomic data + metadata visibility (MUST)**: at the receiver
side, data write (`MemoryStore.write`) and metadata forward
(`peer_head_cache` update) **must execute in the same SimPy step**.
No yield is allowed between the two operations in PE_DMA's vc_comm
handler. Code review must reject any inserted `yield` (or `yield
from`) — it would create a race where head_cache becomes visible
before or after the data.
I7. **MemoryStore slot existence ↔ pointer**: as a consequence of I6,
the step in which `peer_head_cache > my_tail` becomes truthy is the
same step in which the slot data is observable.
#### Failure modes (runtime errors)
F1. **Bad direction**: `tl.send(dir="X")` for an uninstalled direction
`IpcqInvalidDirection`, simulation aborts.
F2. **Type mismatch**: dtype/shape/nbytes disagreement between matched
send and recv. Not validated by default; opt-in strict mode catches
it (`strict_validation: true` on a PE_IPCQ node attrs).
F3. **Deadlock detection (timeout-based)**: the simulator empties its
schedule while a send/recv is still pending → engine raises
`IpcqDeadlock` and embeds a pointer dump.
F4. **Backend init failure**: missing `defaults.algorithm`, missing
`algorithms[name]`, module import failure, topology validation
failure (I3, I4) — all raised at `init_process_group` time.
F5. **Slot full + infinite backpressure**: the peer never recvs.
Surfaces as F3 timeout.
#### Diagnostics
- **CCL trace**: `KERNBENCH_CCL_TRACE=1` logs each send/recv as
`(rank, t, dir, nbytes)`.
- **Pointer dump**: `kernbench.ccl.diagnostics.pointer_dump(engine)`
prints every PE_IPCQ ring buffer's `my_head`, `my_tail`,
`peer_head_cache`, `peer_tail_cache`.
- **Deadlock dump**: on hang the engine includes the pointer dump in
the `IpcqDeadlock` exception message.
### D15. Algorithm-author cheat sheet
Full step-by-step lives in
[`docs/onboarding/ccl-author-guide.en.md`](../onboarding/ccl-author-guide.en.md). The
shortest version:
| Things you touch | Things you don't |
|------------------|-------------------|
| `src/kernbench/ccl/algorithms/<your_algo>.py` (`kernel`, `kernel_args`, optional `neighbors`) | `benches/ccl_allreduce.py` host code |
| One entry in `ccl.yaml` + optionally `defaults.algorithm` | `src/kernbench/ccl/` framework |
| (Optional) `tests/test_<your_algo>.py` mock test | PE_IPCQ component, AhbmCCLBackend |
5-step flow: write the kernel → register in `ccl.yaml` → optional
`neighbors` override → optional mock unit test → SimPy validation via
`kernbench run --bench ccl_allreduce --verify-data`.
Common mistakes: using a direction that wasn't installed, sends
without matching recvs (deadlock), dtype/shape disagreement, assuming
fairness from `tl.recv()` round-robin, confusing
`tl.num_programs(axis)` with the CCL group size.
---
## HW Realization Notes (Informative)
**Status of this section**: Forward-looking. Describes how the simulator
contract (D1D15) would map to silicon. Not currently implemented;
subject to revision before tapeout. The simulator implements the
contract via Python/SimPy equivalents in
[pe_ipcq.py](../../src/kernbench/components/builtin/pe_ipcq.py) and
[pe_dma.py](../../src/kernbench/components/builtin/pe_dma.py).
### D16. Proposed HW block diagram and end-to-end dataflow
![PE Baseline Architecture](../diagrams/pe_baseline.png)
> Source: [`../diagrams/pe_baseline.d2`](../diagrams/pe_baseline.d2) — `d2 --layout=elk --scale 1.5`.
![PE Proposed Architecture](../diagrams/pe_proposed.png)
> Source: [`../diagrams/pe_proposed.d2`](../diagrams/pe_proposed.d2) — `d2 --layout=elk`.
**Baseline → Proposed key changes**:
- Single FIFO inbox → **separate compute port / IPCQ port + WRR Arbiter** (NEW)
- PE_IPCQ (SimPy component) → **IPCQ Controller** (HW register + combinational logic)
- **IPCQ Slot Region reserved area** within TCM
- Credit Injector / Receiver connect directly to the NoC via the Fabric Port
#### End-to-end sequence (HW view)
```mermaid
sequenceDiagram
participant CPU_A as PE_A: PE_CPU
participant IPCQ_A as PE_A: IPCQ Ctrl
participant DMA_A as PE_A: DMA
participant NOC as NoC Fabric
participant DMA_B as PE_B: DMA
participant IPCQ_B as PE_B: IPCQ Ctrl
participant TCM_B as PE_B: TCM
participant CPU_B as PE_B: PE_CPU
Note over CPU_A: tl.send(dir="E", src=0x1000)
CPU_A->>IPCQ_A: MMIO: send request
Note over IPCQ_A: Backpressure check:<br/>(head - peer_tail_cache) < n_slots → PASS<br/>Slot addr gen:<br/>dst = peer_rx_base + (head%n) × slot_size
IPCQ_A->>DMA_A: IpcqDmaToken {src, dst, sender_seq=head}
Note over IPCQ_A: my_head++
IPCQ_A-->>CPU_A: send returns (fire-and-forget)
Note over DMA_A: TCM read → snapshot in read buffer<br/>Flit pack: data + {sender_seq, dst_addr}
DMA_A->>NOC: IPCQ data flit(s)
Note over NOC: hop latency + BW drain
NOC->>DMA_B: IPCQ data flit(s)
Note over DMA_B: Terminal BW drain<br/>Slot write latency
rect rgb(255, 240, 220)
Note over DMA_B,IPCQ_B: ATOMIC (I6): same cycle, no stall
DMA_B->>TCM_B: write data → slot address
DMA_B->>IPCQ_B: Meta Extractor: {sender_seq, dst_addr}
end
Note over IPCQ_B: Range match dst_addr → direction "W"<br/>peer_head_cache["W"] = sender_seq + 1
IPCQ_B-->>CPU_B: recv_wake signal
Note over CPU_B: tl.recv(dir="W") wakes up
CPU_B->>IPCQ_B: recv request
Note over IPCQ_B: peer_head_cache > my_tail → YES<br/>slot_addr = rx_base + (tail%n) × slot_size
IPCQ_B-->>CPU_B: return slot_addr
CPU_B->>TCM_B: read data from slot
Note over IPCQ_B: my_tail++
IPCQ_B->>NOC: Credit (16B): {consumer_seq, dst_rx_base_pa}
Note over NOC: credit traversal (NoC latency)
NOC->>IPCQ_A: Credit arrival
Note over IPCQ_A: Match dst_rx_base_pa → direction "E"<br/>peer_tail_cache["E"] = consumer_seq<br/>Backpressure deassert (if stalled)
```
### D17. IPCQ Controller HW Module (NEW)
The hardware control block sitting between PE_CPU and the DMA Engine.
Corresponds to the simulator's `PeIpcqComponent`.
#### QPair Register File
Per-direction queue-pair state held in flip-flops. The PE_CPU reads /
writes them via MMIO (CSRs); software populates them at init time.
```
Per-direction registers (each 64-bit):
my_head — sender write position (monotonic)
my_tail — receiver read position (monotonic)
peer_head_cache — last known peer head (updated by Meta Extractor)
peer_tail_cache — last known peer tail (updated by Credit Receiver)
rx_base_pa — this PE's rx buffer base physical address
peer_rx_base_pa — peer's rx buffer base physical address
n_slots — ring depth (power-of-2 constraint, see D21)
slot_size — bytes per slot
peer_credit_tgt — peer PE's credit-receive address
Directions: up to 8 (N/S/E/W/parent/child_left/child_right + spare)
Total: 8 dirs × 9 regs × 8 B = 576 B of flip-flops
```
#### Slot Address Generator (combinational)
```
Input: pointer (my_head or my_tail), n_slots, slot_size, base_pa
Output: slot_addr = base_pa + (pointer % n_slots) * slot_size
Implementation:
n_slots power-of-2 → pointer & (n_slots - 1) (AND mask, 1 gate)
slot_size power-of-2 → barrel shift (1 cycle)
64-bit add → ripple / Kogge-Stone adder (1 cycle)
Latency: 12 combinational cycles
```
#### Backpressure Comparator (combinational)
```
full = (my_head - peer_tail_cache) >= n_slots
Implementation: 64-bit subtract + unsigned compare
Output: stall signal → PE_CPU (IPCQ send blocked) or DMA issue hold
Latency: 1 cycle
```
#### Meta Extractor (inbound datapath sideband)
Wired into the DMA Engine's inbound vc_comm path. Extracts metadata
from arriving IPCQ flit headers and updates queue-pair state.
```
Trigger: DMA inbound write completion (same cycle)
Extract: {sender_seq, dst_addr} from flit header
Direction matching (ADR-0025 D2):
for each dir:
match = (base_pa[dir] <= dst_addr) && (dst_addr < base_pa[dir] + n_slots[dir] * slot_size[dir])
8× parallel range comparators + priority encoder
Update: peer_head_cache[matched_dir] = max(peer_head_cache, sender_seq + 1)
Output: recv_wake signal → PE_CPU interrupt / flag
Latency: 1 cycle (pipelined with the DMA write — I6 atomicity is intrinsic)
```
#### Credit Injector (outbound)
```
Trigger: recv completion (after my_tail increments)
Action: pack a 16 B credit packet → DMA vc_comm (or a dedicated credit VC)
Packet: {consumer_seq = my_tail, dst_rx_base_pa = my_rx_base_pa}
Latency: 1 cycle to generate; then NoC traversal
```
#### Credit Receiver (inbound sideband)
```
Trigger: 16 B credit packet arrival (from NoC)
Extract: {consumer_seq, dst_rx_base_pa}
Direction matching (ADR-0025 D3):
for each dir:
match = (peer_rx_base_pa[dir] == credit.dst_rx_base_pa)
Update: peer_tail_cache[matched_dir] = max(peer_tail_cache, consumer_seq)
Output: send_wake signal → deassert backpressure stall
Latency: 1 cycle
```
### D18. DMA Engine vc_comm IPCQ-aware mode
Add IPCQ-flit handling to the existing vc_comm channel (D8).
**Outbound**:
1. Receive a command from the IPCQ Controller: `{src_addr, dst_addr, nbytes, sender_seq}`.
2. Read `src_addr` from TCM → snapshot into the DMA read buffer (standard DMA behavior).
3. Pack flit: data + piggyback metadata (`sender_seq`, `dst_addr`).
4. Inject into the NoC fabric port.
5. Fire-and-forget (no completion wait).
**Inbound**:
1. Receive an IPCQ flit from the NoC.
2. Charge terminal BW drain (`drain_ns = nbytes / bottleneck_bw`).
3. Charge slot write latency (per backing memory tier).
4. **ATOMIC** (same pipeline stage, no stall insertion):
- TCM write: data → slot address.
- Meta Extractor trigger: `sender_seq` + `dst_addr` → IPCQ Controller.
5. Done.
**I6 atomicity guaranteed in hardware**: TCM write completion and Meta
Extractor trigger occur in the same pipeline stage, so no separate
synchronization is needed. The simulator's "no SimPy yield between
`MemoryStore.write` and `IpcqMetaArrival` put" (D9, I6) is preserved
naturally.
#### Data snapshot semantics
Data latched into the DMA read buffer is unaffected by subsequent
writes to `src` memory. This is standard DMA read-then-write
behavior; no extra HW is required.
#### Credit virtual channel (optional)
- **Option A**: multiplex credits onto vc_comm (distinguish via 16 B
header-only flits).
- **Option B**: add a third dedicated credit VC (strict priority > data).
Option B is friendlier to deadlock prevention, but a 16 B credit's BW
impact is negligible, so Option A suffices.
### D19. Fabric flit format extension
```
Generic data flit (e.g. 512-bit):
┌──────────────────────────────────────────┐
│ [511:480] routing header (32b) │
│ [479:0] payload (480b = 60 B) │
└──────────────────────────────────────────┘
IPCQ data flit (only the first flit carries metadata):
┌──────────────────────────────────────────┐
│ [511:480] routing header (32b) │
│ [511] ipcq_flag (1b) │ ← IPCQ vs. normal DMA
│ [510:509] vc_id (2b) │
│ [508:480] route + hop count │
│ [479:416] ipcq_metadata (64b) │ ← piggyback
│ [479:448] sender_seq (32b) │
│ [447:416] dst_addr[31:0] (32b) │ ← used for direction match
│ [415:0] payload (416b = 52 B) │
└──────────────────────────────────────────┘
Subsequent flits: full 60 B payload (no metadata).
Credit-only flit (128-bit, header-only):
┌──────────────────────────────────────────┐
│ [127:96] routing header (32b) │
│ [127] credit_flag (1b) │
│ [95:64] consumer_seq (32b) │
│ [63:0] dst_rx_base_pa (64b) │
└──────────────────────────────────────────┘
```
First-flit payload shrinks from 60 B to 52 B (13 % overhead). For
multi-flit transfers the subsequent flits carry full payloads, so
overhead < 1 % on large transfers.
### D20. TCM IPCQ slot region layout
```
TCM Memory Map (16 MB):
┌─────────────────────────────┐ 0x000000
│ Kernel Working Memory │
│ (compute tensors) │
│ ~14 MB │
├─────────────────────────────┤ 0xE00000
│ IPCQ RX Buffers │
│ Dir N: slots × slot_size │
│ Dir S: slots × slot_size │
│ Dir E: slots × slot_size │
│ Dir W: slots × slot_size │
│ ~1 MB │
├─────────────────────────────┤ 0xF00000
│ IPCQ Metadata / Scratch │
│ ~1 MB │
└─────────────────────────────┘ 0xFFFFFF
```
Place the IPCQ region in the upper TCM bank to minimize bank conflict
with compute accesses (see Risk D22).
### D21. 2 nm implementation analysis
#### Area estimate
| Module | Gate count | Area (2 nm est.) | Notes |
|---|---|---|---|
| QPair Register File | ~4.6 K FF | 0.002 mm² | 576 B of flip-flops |
| Slot Addr Gen + Backpressure | ~5 K gates | 0.001 mm² | Combinational |
| Meta Extractor + Credit Logic | ~3 K gates | 0.001 mm² | 8× parallel comparators |
| **IPCQ Controller subtotal** | **~12.6 K** | **~0.004 mm²** | **< 0.1 % of the PE area** |
| DMA vc_comm extension | ~2 K gates | 0.002 mm² | Flit pack / unpack |
| **Total delta** | **~14.6 K** | **~0.006 mm²** | |
#### Timing
| Path | Delay (2 nm est.) | Target clock | Margin |
|---|---|---|---|
| Backpressure (sub + cmp) | ~0.3 ns | 1 GHz (1 ns) | 3× |
| Slot Addr Gen (mask + shift + add) | ~0.5 ns | 1 GHz | 2× |
| Meta Extractor (8× range match) | ~0.4 ns | 1 GHz | 2.5× |
| Credit Receiver (8× equality) | ~0.3 ns | 1 GHz | 3× |
All critical paths fit within one cycle. Timing closure is not a
concern.
#### Power
- Active: ~1 mW (register R/W + comparators while sending / receiving).
- Idle: leakage only.
- Negligible vs. total PE power.
#### Constraints
| Item | Constraint | Rationale |
|---|---|---|
| `n_slots` | **must be power-of-2** | mod → AND mask (1 gate). Arbitrary values need a divider (~10 cycles). |
| `slot_size` | **power-of-2 recommended** | mul → barrel shift. Arbitrary values need a multiplier. |
| TCM IPCQ region | **dedicated bank** | Prevents bank conflict with compute accesses. |
### D22. Risk assessment
#### TCM bank conflict
- **Risk**: IPCQ slot write and compute read both target the same TCM
bank → stall.
- **Mitigation**: place the IPCQ region in a dedicated upper-address
bank (D20).
- **Cost**: a small loss of TCM banking flexibility.
- **Severity**: Medium (performance), Low (no correctness issue).
#### Credit return latency under congestion
- **Risk**: NoC congestion → credit-return delay → sender backpressure
stall.
- **Mitigation**:
- Put credits on a separate VC with strict priority (16 B →
negligible BW impact).
- Or pick `n_slots` generously (8+) so credit delay is absorbed by
buffer depth.
- **Severity**: Low (16 B credits contribute almost nothing to
congestion).
#### Inter-direction ordering
- **Risk**: simultaneous sends from one PE on multiple directions.
- **Mitigation**: per-direction monotonic `sender_seq` suffices.
Inter-direction ordering is the kernel's (software's)
responsibility — same as the simulator model (D2 + D4).
- **Severity**: Low (resolved by design).
### D23. HW alternatives considered
#### Doorbell + polling (traditional)
```
Send: DMA write data → DMA write a doorbell register at the peer → peer polls doorbell
Recv: polling loop on the doorbell, or interrupt-driven
```
| Pros | Cons |
|---|---|
| Simple HW (no IPCQ controller) | Two DMA transactions (data + doorbell) |
| Reuses existing DMA | Needs explicit fence between data and doorbell |
| | Polling burns power; interrupt adds latency |
**Verdict**: 23× latency vs. piggyback. **Rejected.**
#### Hardware message queue (NVIDIA NVLink style)
```
Send: CPU → push a descriptor onto HMQ → HW relays it to the peer HMQ
Recv: pop a descriptor from HMQ → use the data pointer
```
| Pros | Cons |
|---|---|
| CPU only writes descriptors | Needs a separate HMQ engine (~0.05 mm²) |
| Descriptor / data separation is flexible | Separate datapath from DMA → area / power overlap |
| | Large tensors still need DMA |
**Verdict**: With CCL's large-tensor pattern, DMA is still required,
so HMQ + DMA is a duplicated datapath. **Rejected.**
#### RDMA-style completion queue (CQ)
```
Send: DMA write → CQE auto-posted at the peer
Recv: CQ poll / interrupt → read data location
```
| Pros | Cons |
|---|---|
| Mature InfiniBand / RoCE model | CQ management logic + CQE memory overhead |
| Good multi-tenant isolation | CQE / data ordering needs extra plumbing |
| | Over-engineered for PE-to-PE CCL |
**Verdict**: RDMA CQ is suited to host-facing NICs with multi-tenant
isolation. For single-owner PE-to-PE this is needless complexity.
**Rejected.**
#### Credit-in-data piggyback (v2 optimization candidate)
In the current design the credit return is a separate 16 B packet.
For bidirectional traffic patterns, **the credit can be folded into a
reverse-direction data flit**.
```
PE_A →E→ PE_B: data + sender_seq=3
PE_B →W→ PE_A: data + sender_seq=5 + credit_ack=4 ← credit folded into data
```
| Pros | Cons |
|---|---|
| Removes the dedicated credit packet → NoC BW savings | Needs fallback for unidirectional patterns |
| Bidirectional allreduce: credit latency → 0 | +8 B in the flit header (negligible) |
| | Slightly more logic complexity |
**Verdict**: A strong optimization. Eliminates the credit packet for
bidirectional allreduce; the standalone credit fallback is retained.
**Recommended for v2.**
### Open HW questions
- What fraction of TCM may the IPCQ slot region occupy? (Current
assumption: ~1 MB / 16 MB = 6.25 %.)
- Dedicated credit VC vs. vc_comm multiplexing? (See D18.)
- Inter-SIP link flit-format compatibility verification.
- Maximum `n_slots`? (8 directions × 8 slots × 64 KB = 4 MB → 25 % of
TCM.)
---
## Non-goals
- **Host collective**: a model where `dist.all_reduce` itself moves
data on the host side is out of scope. This ADR only covers
communication that happens inside the PE kernel.
- **All-reduce algorithms**: ring / tree / etc. live in algorithm
modules and can be added without amending this ADR.
- **Reliability / error handling**: link faults, send/recv failure
recovery, etc. are out of scope.
- **NoC arbiter precision**: dynamic VC contention is left for a future
ADR (see D8).
---
## Open questions
- **VC arbitration accuracy** — the first cut uses deterministic
chunk interleave + weighted round-robin; heavy contention may report
optimistic latency. A NoC arbiter component can be added later.
- **Credit return BW model** — the fast path is currently outside the
fabric BW contention model. Can be modeled as a separate link or
switched to piggyback (`credit_return_mode: piggyback`).
- **Ring buffer slot allocation metadata** — whether the host pushes
IPCQ buffer metadata via sideband or via a fabric message similar to
`MmuMapMsg` is open.
- **VC BW split default** — 50/50 vs. weighted (e.g. 80/20). Exposed in
`ccl.yaml`; default value TBD.
- **Direction count** — 4 (N/S/E/W) is fixed in the first cut; 6
(with Up/Down for 3D) or N (variable) is future work.
- **Multi-tile aggregation primitives** — whether
`tl.recv_all` or similar is needed for fan-in.
- **Round-robin recv fairness** — current weak fairness can starve;
strict fairness counter is future work.
- **Deadlock detection precision** — currently timeout-based; a
realtime wait-for graph would enable deterministic detection.
---
## Consequences
### Positive
- PE-to-PE direct communication enables CCL kernels to be written.
- Host stays minimal (just `launch`), synchronization happens inside
the PE → strong compute / comm overlap.
- VCs eliminate HoL blocking → collective latency is not blocked by
compute traffic.
- Buffer placement and backpressure mode are init-time parameters →
easy to benchmark.
- Four-direction logical neighbors → host is free to map
ring/mesh/tree algorithms.
### Negative
- One new component (PE_IPCQ) and a redesigned PE_DMA (VCs).
- IPCQ memory cost = 8 rings × `slot_size` × `n_slots` per PE.
- VC arbitration is a first-order approximation; heavy contention
scenarios may report slightly optimistic latency vs real HW (D8).
- Chunk-level interleave makes PE_DMA implementation more complex.