Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
(dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
retroactive docs pending verification.
Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
deleted; ADR-0019/0021 moved to adr-history with one-line stub status
Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
selection, flit-aware per-flit commit, async finalize, command-only
fallback path)
Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
"Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
(now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)
Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py
Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.
Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
(ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
33 KiB
ADR-0023: PE-level IPCQ — Inter-PE Collective Communication
Status
Accepted
Context
Goal
Add the infrastructure that lets CCL (Collective Communication Library) kernels run inside a PE. The host just launches a kernel on each SIP; the actual synchronization and data movement happen inside the PE kernel via an IPCQ (Inter-Process Communication Queue).
This mirrors how NCCL performs NVLink communication inside a GPU
kernel, or how Cerebras / Tenstorrent expose core-local communication
queues. Host-level collectives (dist.all_reduce) are deferred to
future work; this ADR focuses solely on the kernel-side collective
infrastructure.
Problems to solve
- PE-to-PE direct data movement (writing into a peer's memory).
- Synchronization — the sender must check that the receiver has space in its buffer (backpressure).
- Resource contention between compute traffic and communication traffic (Head-of-Line blocking).
- The host must be able to construct logical neighbor topologies (ring / mesh / tree) per algorithm.
Decision
D1. Add a new PE_IPCQ component
A new component PE_IPCQ is added inside each PE. It follows the same
pattern as PE_GEMM / PE_MATH — modeling a sub-block of the PE as a
distinct component.
PE
├── PE_CPU
├── PE_SCHEDULER
├── PE_DMA
├── PE_IPCQ ← new
├── PE_FETCH_STORE
├── PE_GEMM
├── PE_MATH
├── PE_TCM
├── PE_MMU
Role separation (control plane vs. data plane):
- PE_IPCQ (control plane): ring-buffer address arithmetic, head / tail pointer management, peer pointer caches, backpressure, 4-direction neighbor mapping.
- PE_DMA (data plane): actually moves data through cube_noc / UCIe / PCIE into the peer's memory.
PE_IPCQ does not move data itself — it delegates to PE_DMA.
D2. Ring buffer model
Each PE owns 4 directions (N/S/E/W) × {tx, rx} = 8 ring buffers.
@dataclass
class IpcqQueuePair:
direction: Direction # N/S/E/W
peer: IpcqEndpoint # set by host at init time (D2.5)
tx_buffer_base: int # outgoing data base addr (in our memory)
rx_buffer_base: int # incoming data base addr (in our memory)
slot_size: int # 1 tile per slot
n_slots: int # ring depth
my_head: int # next slot we will write/send into
my_tail: int # next slot we will read/recv from
peer_head_cache: int # peer's last-seen head (updated via D9 piggyback)
peer_tail_cache: int # peer's last-seen tail (updated via D9 fast-path credit)
Canonical field names: throughout this ADR the four names above
(my_head, my_tail, peer_head_cache, peer_tail_cache) are used
consistently. Synonyms (peer_head_local, peer_head, peer_tail,
etc.) are not used.
| Field | Owner | Updated when |
|---|---|---|
my_head |
local PE_IPCQ | immediately after tl.send (send tracking) |
my_tail |
local PE_IPCQ | immediately after tl.recv (recv tracking) |
peer_head_cache |
local PE_IPCQ | on IpcqMetaArrival (D9 piggyback) |
peer_tail_cache |
local PE_IPCQ | on IpcqCreditMetadata (D9 fast path) |
Slot unit: fixed-size, one slot holds one full tile (no descriptor indirection). Full data embedded in the slot. See D5.
D2.5. IpcqEndpoint schema
IpcqQueuePair.peer carries everything the sender needs to compute the
peer's rx slot address:
@dataclass(frozen=True)
class IpcqEndpoint:
sip: int
cube: int
pe: int
buffer_kind: str # "tcm" | "hbm" | "sram"
rx_base_pa: int # peer rx_buffer base PA (PhysAddr.encode())
rx_base_va: int # peer rx_buffer base VA (optional, MMU mode)
n_slots: int # peer ring depth (for wrap-around)
slot_size: int # peer slot size (for offset)
Address computation:
slot_idx = self.my_head % peer.n_slots
dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size
PE_IPCQ passes dst_pa to PE_DMA inside an IpcqDmaToken. PE_DMA
(vc_comm) routes the data to dst_pa through the fabric.
Endpoint construction order: at backend init (D10), the IPCQ
buffers for every PE are allocated first (so each rank knows the
others' PA), then the per-rank neighbor tables are built and pushed to
PE_IPCQ via IpcqInitMsg.
D3. Four-direction mapping ≡ logical ProcessGroup
The PE views four directions (N/S/E/W) as logical ports. Real peer addresses are configured by the host CCL init, per the chosen algorithm. The PE kernel never knows the topology, only directions.
# 1D ring
for rank in range(world_size):
ipcq_set_neighbor(rank, "E", peer=ranks[(rank + 1) % world_size])
ipcq_set_neighbor(rank, "W", peer=ranks[(rank - 1) % world_size])
# 2D mesh
for r in range(R):
for c in range(C):
ipcq_set_neighbor((r, c), "N", peer=((r - 1) % R, c))
ipcq_set_neighbor((r, c), "S", peer=((r + 1) % R, c))
ipcq_set_neighbor((r, c), "E", peer=(r, (c + 1) % C))
ipcq_set_neighbor((r, c), "W", peer=(r, (c - 1) % C))
The PE code does not need to know where tl.send(dir="E", ...) actually
ends up.
D4. PE kernel API
# Send (blocking; may stall on backpressure)
tl.send(dir: str, src=TensorHandle)
tl.send(dir: str, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)
# Recv (blocking)
recv = tl.recv(dir: str, shape=..., dtype=...)
recv = tl.recv(shape=..., dtype=...) # round-robin across 4 directions
# Recv (non-blocking)
fut = tl.recv_async(dir: str, shape=..., dtype=...)
recv = tl.wait(fut)
tl.recv() (no direction) keeps a last_polled_dir cursor and on each
call rotates through directions, returning the first available slot.
Empty in all 4 directions → wait.
Fairness is weak: the rotating start mitigates simple bias, but if
one direction always wins the race the others can starve. Algorithms
that need strict fairness must call tl.recv(dir=...) explicitly.
D5. Single-hop DMA write + full-data slot model
Data moves from sender memory into the receiver's ring slot in one DMA transfer. Key properties:
- Single-hop: the sender already knows the peer rx slot address and fires one fabric DMA into it.
- No CPU memcpy: the CPU never copies data.
- No intermediate staging: neither side keeps a separate staging buffer (sender uses the source addr directly; receiver gets the data in its ring slot directly).
(Strictly speaking the fabric DMA write does happen, so this is not literally "no data movement" — it's the same property NCCL labels "zero-copy", meaning no CPU memcpy and no staging copy.)
PE A: tl.send(E, src_addr, nbytes)
1. IPCQ computes the peer rx slot address:
dst_addr = peer.rx_base_pa + (my_head % peer.n_slots) * peer.slot_size
2. Backpressure: my_head - peer_tail_cache < peer.n_slots ?
(full → sleep / poll)
3. Submit DMA on PE_DMA(vc_comm): src_addr → peer dst_addr, nbytes
4. my_head += 1
PE B: data = tl.recv(W)
1. Look at rx_buffer[my_tail % n_slots]
2. Wait for the data to arrive (D7 backpressure mode)
3. Return the slot address to the kernel (or fetch into register file)
4. my_tail += 1
5. Issue a credit-return fast path (D9): after the bottleneck-BW
latency the peer A's peer_tail_cache is updated.
The slot holds the full tile. The receiver only reads its own rx_buffer; it never reads back into A's memory. The sender knows the peer rx slot address and DMAs directly into it (single-hop).
The PE's own PE_TCM read/write does not go through DMA (PE_TCM is local to the PE).
D6. Buffer placement — three-way benchmark
The host CCL init picks the IPCQ ring-buffer location:
ipcq_init(
backend="ahbm",
buffer_kind="tcm" | "hbm" | "sram",
n_slots=8,
slot_size=4096,
)
| Location | Trait | Trade-off |
|---|---|---|
| PE_TCM | Attached to the PE; fast | Small; competes with PE-internal resources |
| PE-local HBM | Large; via DMA | Higher latency |
| Cube SRAM | Mid-size; cube-shared | Cube-internal contention |
All three locations run the same kernel code; only the init differs.
D7. Backpressure — two-mode benchmark
How the sender or receiver waits when peer slots are full / data not yet arrived:
| Mode | Behavior | Model |
|---|---|---|
| poll | Periodically re-check the cached peer pointer | Spin loop |
| sleep | Yield a SimPy event; wake on a peer-trigger | Interrupt-like |
ipcq_init(backpressure="poll" | "sleep", ...)
Both modes are implemented so latency / throughput trade-offs can be benchmarked.
D8. PE_DMA virtual channels
Extend PE_DMA from a single queue into a two-channel virtual-channel model.
PE_DMA
├── vc_compute: tile load / store / writeback for GEMM and Math
└── vc_comm: IPCQ send data
Each VC has an independent state machine:
- One channel stalling does not block the other.
- The same physical link (cube_noc, UCIe, …) is shared, but link BW is split between channels.
Chunk-level interleave:
- Large GEMM tile DMAs do not lock the link end-to-end.
- Progress happens in chunks (e.g. 256 B); each chunk shares link BW with the other VC's pending chunks.
- Chunk size is an init parameter (smaller = fairer, larger = more efficient).
Net effect:
- HoL blocking is eliminated (an IPCQ send can interleave with a long compute DMA).
- Compute / comm overlap is natural (NVIDIA copy-engine + compute-SM pattern).
- Matches the NoC-virtual-channel pattern used in real HW.
First-implementation accuracy limit (intentional): this ADR's
first cut uses deterministic chunk-level interleave + weighted
round-robin arbitration (default 50 / 50, exposed in ccl.yaml).
This is a first-order approximation and is simpler than real HW
dynamic-contention / credit-based arbiters. Functional correctness is
unaffected, but heavy-contention scenarios may report slightly
optimistic latency vs. real HW. A separate ADR can add a NoC arbiter
component later if more precision is needed.
Token routing
- Compute tokens (
TileToken) — go through the existing PE_FETCH_STORE → PE_DMA chain. - Communication tokens (
IpcqDmaToken, new) — PE_IPCQ → PE_DMA self-routing. - PE_DMA picks the channel by token type.
class PeDmaComponent:
def _process(self, env, token):
if isinstance(token, IpcqDmaToken):
yield from self._vc_comm_process(env, token)
else:
yield from self._vc_compute_process(env, token)
D9. Pointer synchronization — DMA payload piggyback
Real HW (NVLink, UCIe, etc.) piggybacks metadata onto DMA payloads so pointers update along with the data. This simulation adopts the same model: no separate control channel — metadata travels with the data.
The big benefits:
- Automatic ordering: data and metadata move on the same token, so data is visible before the head_cache update. No race.
- HW fidelity: matches NVLink / UCIe piggybacked headers.
- Component simplification: no separate
IpcqPtrUpdateevent type.
Send flow (head update via piggyback)
PE A: tl.send(E, src_addr, nbytes)
1. PE_IPCQ checks backpressure (using peer_tail_cache)
2. PE_IPCQ creates an IpcqDmaToken:
- data body (src_addr → peer dst_addr)
- piggyback metadata: (sender_seq, src_sip/cube/pe, src_direction)
3. Hand the token to PE_DMA(vc_comm)
4. PE A increments my_head (send tracking)
[fabric DMA: latency elapses]
PE B's PE_DMA receives the token
5. Writes data into dst_addr (B's rx slot) via MemoryStore.write
6. Forwards token metadata to PE B's PE_IPCQ (PE-internal wire, ~1 cycle)
PE B's PE_IPCQ receives the metadata
7. Updates peer_head_cache (= A's head)
8. Wakes any pending recv on that direction
Steps 5 and 6 must execute in the same SimPy step — DMA completion makes data and metadata atomically visible.
Recv flow (credit return — fast path with bottleneck-BW latency)
When the receiver frees a slot, the sender must learn about it (backpressure release). Unlike data, the credit return does not travel through general vc_comm fabric — it uses a separate fast path, an abstraction of the NVLink / UCIe credit-return wire.
Latency is computed from the full path latency (per-node overhead + edge propagation + drain), not a magic constant:
credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes)
path = router.find_path(self_pe, peer_pe.pe_dma)
latency = compute_path_latency_ns(path, credit_size_bytes)
= sum(edge.distance_mm * ns_per_mm)
+ sum(node_overhead_ns[n] for n in path)
+ credit_size_bytes / bottleneck_bw_on_path
The router auto-appends .pe_dma to the source only, so the
destination MUST be spelled with the explicit .pe_dma suffix or
find_path raises and the credit silently teleports at zero cost
(latent bug fixed alongside this update).
tl.recv blocks on the credit-emit completion (recv yields-from
_delayed_credit_send rather than spawning it as a fork). This puts
the credit-return cost on the receiver's pe_exec_ns, modeling the
IPCQ control-plane completing the consume-acknowledgement before
recv returns to the kernel — the protocol equivalent of a non-posted
tl.store waiting for an HBM ack on the raw DMA path.
That gives us:
- Topology-proportional approximation: an in-cube credit return is automatically faster than a cross-SIP credit return.
- No magic constants: every nanosecond comes from
compute_path_latency_nson the same edge_map andnode_overhead_nsas data traffic. - No deadlock risk: unlike piggyback, B can issue credit even when
it has no data to send back.
peer_credit_store.putis unbounded. IPCQ ≥ raw DMAfor matched physical moves — the credit-emit cost on recv balances the HBM ack-trip cost RAW pays on the sender.
Component coupling — SimPy Store channel
PE B's PE_IPCQ does not call PE A's PE_IPCQ directly. Instead, at init
time, a SimPy Store is wired between the two (a per-direction
fast-path channel) and credit metadata is put into that store.
class PeIpcqComponent:
def _delayed_credit_send(self, env, peer_credit_store, my_tail, latency_ns):
yield env.timeout(latency_ns)
yield peer_credit_store.put(IpcqCreditMetadata(seq=my_tail, ...))
Backend init wires both directions of the fast-path channel as part of
fan-out (see IpcqInitMsg in D12).
Credit-return fast path limitations
credit_size_bytesis an estimate (typically 16–64 bytes).- The fast path is excluded from vc_comm BW contention (separate wire). Real HW credit-return wires are very lightweight, so this is a reasonable first approximation.
- A follow-up ADR can: model the credit fast path as a separate link
(BW limit + contention), or switch to piggyback (
credit_return_mode: piggyback).
PE_DMA's added responsibility
When vc_comm receives a token, PE_DMA processes it as the following
sequence: pay the Transaction's terminal BW drain, then atomically
write data and forward metadata. No SimPy yield is allowed between
the data write and the metadata forward (invariant I6). The drain
yield must sit before the atomic block, not inside it:
def _on_vc_comm_recv(self, env, txn):
# Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the
# sender PE_DMA). MUST happen before the atomic block so recv only
# wakes after the bytes have "landed".
drain = getattr(txn, "drain_ns", 0.0)
if drain > 0:
yield env.timeout(drain)
token = txn.request
# ── ATOMIC: no yield between these two operations ──
data = self._memory_store.read(token.src_space, token.src_addr,
shape=..., dtype=...)
self._memory_store.write(token.dst_endpoint.buffer_kind,
token.dst_addr, data)
# 2. Forward metadata to the local PE_IPCQ
yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token))
# ───────────────────────────────────────────────────
The final put is yieldable but uses an unbounded internal store, so
it completes in a single step. That put is the closing call of the
atomic block; nothing may be inserted before it.
Drain-at-inbound semantics (D9 timing model)
The Transaction carries drain_ns = nbytes / bottleneck_bw_on_path
stamped at send-side PE_DMA. In this simulator per-hop overhead_ns
is paid at each forwarding component via run(), and the remaining
BW drain is paid once at the Transaction's terminal. Every non-IPCQ
Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via
ComponentBase._forward_txn at the terminal node. For IPCQ the
destination PE_DMA intercepts the Transaction with _handle_ipcq_inbound
(so IPCQ-specific data write + metadata forward can happen), so the
drain MUST be paid explicitly at the top of that handler to keep
IPCQ's timing model on par with every other fabric Transaction.
Side-effects of paying drain here:
- SRC
tl.sendis unchanged — fire-and-forget semantics are preserved because the sender PE_DMA does notyield sub_done. Thesub_done.succeed()call (made after metadata forward below) is an event with no listener on the sender side. - DST
tl.recvunblocksdrain_nslater. Since recv wakes only whenIpcqMetaArrivalreaches its local PE_IPCQ, and the metadata forward now happens after the drain, recv observes the full fabric transfer time including bandwidth cost.
Matches the physical picture: send dispatches and leaves; recv waits until the bytes have actually been drained into its inbox.
D9.5. ADR-0020 (2-pass) integration
tl.send / tl.recv integrates with ADR-0020's two-pass model. Phase
1 simulates timing and moves data via MemoryStore; Phase 2 enables
op-log-based correctness verification.
Phase 1 (timing + data)
D9 models head and tail updates with two different mechanisms:
- Send-side (head update) — DMA payload piggyback. Data write and metadata forward happen in the same SimPy step → automatic atomic visibility.
- Recv-side (tail credit return) — fast-path SimPy Store channel
with bottleneck-BW latency, then
peer_tail_cacheupdate.
Together they preserve ring-buffer pointer consistency.
The op-log records op_kind="ipcq" entries for sends (with
src/dst/space/addr/nbytes/dir/dtype/shape/sender_seq) and recvs (with
recv_mode/src/dst/space/addr/nbytes/dir/dtype/shape/consumer_seq).
Two recv modes:
return_slot(default): the slot address is returned to the kernel. Zero-copy.copy_to_dst: when the kernel passesdst_addr+dst_space, PE_IPCQ copies the slot data into the user dst.
Phase 2 (op_log replay)
When DataExecutor encounters an op_kind="ipcq" record:
- send: idempotent
src → dstndarray write. - recv (
return_slot): no-op (the slot already holds the data). - recv (
copy_to_dst): idempotentslot → dst_addrcopy.
IPCQ ops are pure data movement — Phase 2 has nothing extra to compute.
The downstream GEMM / Math ops in DataExecutor will consume the data
and naturally validate correctness.
D10. Host CCL init keeps the PyTorch shape
The host code looks just like real PyTorch DDP. init_process_group
creates the backend object; it does not receive IPCQ knobs
(neighbor topology, buffer_kind, backpressure …).
# benches/ccl_allreduce.py — same shape as real PyTorch
def worker(rank, world_size, torch):
dist = torch.distributed
dist.init_process_group(backend="ahbm") # reads ccl.yaml + topology
tensor = torch.zeros((1, world_size * N_ELEM), dtype="f16", dp=...)
tensor.copy_(torch.from_numpy(init))
dist.all_reduce(tensor, op="sum")
The IPCQ configuration is decided by the backend at
init_process_group time: it loads ccl.yaml, picks the algorithm,
and pushes IPCQ neighbor tables to every participating PE_IPCQ. The
host code never has to know about IPCQ.
A bench runs one algorithm, chosen via ccl.yaml's defaults.algorithm.
Switching algorithms is purely a ccl.yaml change — no host edits
required.
Init flow (eager)
init_process_group(backend="ahbm")is called.- Backend loads
ccl.yaml→ resolvesdefaults.algorithm. - Pulls topology + buffer_kind + backpressure + slot config from
algorithms[<algo>]. - Immediately installs neighbor tables on every PE_IPCQ
(sideband or fabric
IpcqInitMsg). - Subsequent
torch.launch(kernel_name, ...)calls behave normally — PE_IPCQ is already prepared whether the kernel is a CCL kernel or not.
D11. CCL config file (ccl.yaml)
IPCQ config and algorithm metadata live in a separate YAML file,
following the same pattern as components.yaml and topology.yaml.
A single benchmark execution runs one algorithm
(defaults.algorithm). Switching algorithms means editing
defaults.algorithm only.
defaults:
algorithm: ring_allreduce_tcm
buffer_kind: tcm # tcm | hbm | sram
backpressure: sleep # poll | sleep
n_slots: 8
slot_size: 4096
vc_chunk_size: 256
ipcq_credit_size_bytes: 16
algorithms:
ring_allreduce_tcm:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d # builtin name or "custom"
buffer_kind: tcm
n_elem: 8 # optional, per-algorithm tile width
tree_allreduce_7:
module: kernbench.ccl.algorithms.tree_allreduce
topology: tree_binary
buffer_kind: tcm
world_size: 7 # algorithm-level override
n_elem: 16
custom_mesh:
module: kernbench.ccl.algorithms.custom_mesh
topology: custom # the module supplies its own neighbors()
world_size is not set in defaults. The backend resolves it via:
algorithm-level override > defaults override > topology spec. The
last fallback (sips × cubes_per_sip × pes_per_cube) mirrors real DDP
where WORLD_SIZE comes from env vars rather than config files.
Algorithm module structure
Each algorithm module exports two hooks — kernel (required) and
neighbors (optional) — plus a kernel_args helper that the
backend uses to populate positional kernel arguments at all_reduce
time:
# src/kernbench/ccl/algorithms/ring_allreduce.py
def kernel_args(world_size: int, n_elem: int) -> tuple:
return (n_elem, world_size)
def kernel(t_ptr, n_elem, world_size, tl):
"""Required — the PE kernel.
IPCQ is already installed by the backend before this is called.
The kernel only uses the four-direction send / recv API.
"""
...
def neighbors(rank, world_size, neighbor_map):
"""Optional — override the builtin topology's neighbor map.
Returns a new dict, the modified-in-place dict, or None to keep the
builtin map.
"""
return None
neighbors override patterns
- Pattern A — tweak a builtin: drop a direction for some ranks, etc.
- Pattern B — replace entirely: ignore
neighbor_mapand return a brand-new dict. - Pattern C — keep builtin: omit
neighborsor return None.
Builtin topologies
| topology | direction set |
|---|---|
ring_1d |
E, W |
ring_1d_unidir |
E only |
mesh_2d |
N, S, E, W |
tree_binary |
parent, child_left, child_right |
none |
(empty) — algorithm must supply neighbors() |
Adding a new algorithm
- Write
kernelandkernel_argsinsrc/kernbench/ccl/algorithms/<algo>.py. - Add an entry in
ccl.yaml'salgorithmssection. - (Optional) provide
neighbors()for custom topology. - Set
defaults.algorithmto the new algorithm.
The host bench (benches/ccl_allreduce.py) does not change.
D12. Message / token schema
The new message types added by this ADR. They live in
src/kernbench/common/pe_commands.py and
src/kernbench/runtime_api/kernel.py.
IpcqInitMsg (sideband, fan-out at init)
The backend pushes neighbor tables to every PE_IPCQ. Structure mirrors
MmuMapMsg (target_sips, target_cubes, target_pe, entries).
Each IpcqInitEntry has direction, peer: IpcqEndpoint,
my_rx_base_pa/va, n_slots, slot_size, plus a peer_credit_store
field — a simpy.Store instance pre-wired so the sender PE_IPCQ can
push IpcqCreditMetadata directly into the receiver's input queue.
IpcqSendCmd (PE_CPU → PE_IPCQ)
Carries direction, source addr/space, nbytes, shape, dtype, and a
handle id. data_op=True so it lands in the op_log.
IpcqRecvCmd (PE_CPU → PE_IPCQ)
Carries direction (or None for round-robin), recv_mode
(return_slot / copy_to_dst), optional dst_addr/dst_space, shape,
dtype, blocking flag.
IpcqDmaToken (PE_IPCQ → PE_DMA, vc_comm channel)
Per D9 piggyback: the token carries the data (src/dst/space/nbytes)
plus the head metadata (sender_seq, src_sip/cube/pe,
src_direction). PE_DMA picks the channel by token type
(IpcqDmaToken → vc_comm, TileToken → vc_compute).
The receiver's PE_DMA, on token arrival, performs the I6 atomic
sequence: write data into MemoryStore, then forward IpcqMetaArrival
to the local PE_IPCQ.
IpcqCreditMetadata (PE_IPCQ → peer PE_IPCQ, fast path)
Carries consumer_seq (= my_tail), source PE coords, and source
direction. Travels through the dedicated SimPy Store channel rather
than vc_comm. Latency = credit_size_bytes / bottleneck_bw_on_path.
There is no IpcqPtrUpdate event — head updates flow via D9
piggyback, tail updates via the D9 fast-path channel.
D13. Test strategy
Test plan:
T1. Unit tests (component-level)
- PE_IPCQ (
tests/test_pe_ipcq.py): send without backpressure immediately forwards a token; full peer slot triggers backpressure (poll / sleep modes); recv waits, wakes onIpcqMetaArrival; round-robin recv weak fairness; bad direction →IpcqInvalidDirection. - PE_DMA virtual channels (
tests/test_pe_dma_vc.py):vc_compute/vc_commindependent progress, chunk interleave, BW split. - Builtin topology (
tests/test_ccl_topologies.py): ring_1d / mesh_2d / tree_binary correctness, mesh_2d non-square →ValueError, custom resolver returns the module'sneighbors.
T2. Integration tests (E2E send/recv)
tests/test_ipcq_e2e.py: 2-rank ring, 4-rank ring (bidirectional no-deadlock), 4×4 mesh.- CCL kernel + 2-pass (
tests/test_ipcq_2pass.py): greenlet mode recordsipcqops in op_log; DataExecutor produces correctout.data.
T3. Backend init (tests/test_ccl_backend_ipcq.py)
ccl.yaml load, builtin topology → IpcqInitMsg fan-out, endpoint PA
consistency, per-buffer_kind allocation.
T4. Regression
All existing tests pass; ADR-0020 op_log / DataExecutor unaffected for non-CCL benches.
T5. Performance / overhead
Single send/recv pair latency = (DMA latency) + (IPCQ overhead). Should be close to a regular PE_DMA write of the same nbytes (IPCQ overhead < 100 ns).
D14. Invariants and failure modes
Invariants
I1. Slot lifecycle exactly-once: one send → exactly one recv.
I2. Pointer monotonicity: my_head / my_tail strictly
non-decreasing; sender_seq strictly increasing.
I3. Endpoint consistency: if rank A's direction=E peer is rank
B, then rank B's reverse-direction peer must be rank A. Verified at
init.
I4. buffer_kind consistency: all PEs in a process group share
the same buffer_kind (no mixed mode in the first cut).
I5. op_log ordering: send → DMA complete → recv possible. The
t_start order in op_log respects this causality.
I6. Atomic data + metadata visibility (MUST): at the receiver
side, data write (MemoryStore.write) and metadata forward
(peer_head_cache update) must execute in the same SimPy step.
No yield is allowed between the two operations in PE_DMA's vc_comm
handler. Code review must reject any inserted yield (or yield from) — it would create a race where head_cache becomes visible
before or after the data.
I7. MemoryStore slot existence ↔ pointer: as a consequence of I6,
the step in which peer_head_cache > my_tail becomes truthy is the
same step in which the slot data is observable.
Failure modes (runtime errors)
F1. Bad direction: tl.send(dir="X") for an uninstalled direction
→ IpcqInvalidDirection, simulation aborts.
F2. Type mismatch: dtype/shape/nbytes disagreement between matched
send and recv. Not validated by default; opt-in strict mode catches
it (strict_validation: true on a PE_IPCQ node attrs).
F3. Deadlock detection (timeout-based): the simulator empties its
schedule while a send/recv is still pending → engine raises
IpcqDeadlock and embeds a pointer dump.
F4. Backend init failure: missing defaults.algorithm, missing
algorithms[name], module import failure, topology validation
failure (I3, I4) — all raised at init_process_group time.
F5. Slot full + infinite backpressure: the peer never recvs.
Surfaces as F3 timeout.
Diagnostics
- CCL trace:
KERNBENCH_CCL_TRACE=1logs each send/recv as(rank, t, dir, nbytes). - Pointer dump:
kernbench.ccl.diagnostics.pointer_dump(engine)prints every PE_IPCQ ring buffer'smy_head,my_tail,peer_head_cache,peer_tail_cache. - Deadlock dump: on hang the engine includes the pointer dump in
the
IpcqDeadlockexception message.
D15. Algorithm-author cheat sheet
Full step-by-step lives in
docs/onboarding/ccl-author-guide.en.md. The
shortest version:
| Things you touch | Things you don't |
|---|---|
src/kernbench/ccl/algorithms/<your_algo>.py (kernel, kernel_args, optional neighbors) |
benches/ccl_allreduce.py host code |
One entry in ccl.yaml + optionally defaults.algorithm |
src/kernbench/ccl/ framework |
(Optional) tests/test_<your_algo>.py mock test |
PE_IPCQ component, AhbmCCLBackend |
5-step flow: write the kernel → register in ccl.yaml → optional
neighbors override → optional mock unit test → SimPy validation via
kernbench run --bench ccl_allreduce --verify-data.
Common mistakes: using a direction that wasn't installed, sends
without matching recvs (deadlock), dtype/shape disagreement, assuming
fairness from tl.recv() round-robin, confusing
tl.num_programs(axis) with the CCL group size.
Non-goals
- Host collective: a model where
dist.all_reduceitself moves data on the host side is out of scope. This ADR only covers communication that happens inside the PE kernel. - All-reduce algorithms: ring / tree / etc. live in algorithm modules and can be added without amending this ADR.
- Reliability / error handling: link faults, send/recv failure recovery, etc. are out of scope.
- NoC arbiter precision: dynamic VC contention is left for a future ADR (see D8).
Open questions
- VC arbitration accuracy — the first cut uses deterministic chunk interleave + weighted round-robin; heavy contention may report optimistic latency. A NoC arbiter component can be added later.
- Credit return BW model — the fast path is currently outside the
fabric BW contention model. Can be modeled as a separate link or
switched to piggyback (
credit_return_mode: piggyback). - Ring buffer slot allocation metadata — whether the host pushes
IPCQ buffer metadata via sideband or via a fabric message similar to
MmuMapMsgis open. - VC BW split default — 50/50 vs. weighted (e.g. 80/20). Exposed in
ccl.yaml; default value TBD. - Direction count — 4 (N/S/E/W) is fixed in the first cut; 6 (with Up/Down for 3D) or N (variable) is future work.
- Multi-tile aggregation primitives — whether
tl.recv_allor similar is needed for fan-in. - Round-robin recv fairness — current weak fairness can starve; strict fairness counter is future work.
- Deadlock detection precision — currently timeout-based; a realtime wait-for graph would enable deterministic detection.
Consequences
Positive
- PE-to-PE direct communication enables CCL kernels to be written.
- Host stays minimal (just
launch), synchronization happens inside the PE → strong compute / comm overlap. - VCs eliminate HoL blocking → collective latency is not blocked by compute traffic.
- Buffer placement and backpressure mode are init-time parameters → easy to benchmark.
- Four-direction logical neighbors → host is free to map ring/mesh/tree algorithms.
Negative
- One new component (PE_IPCQ) and a redesigned PE_DMA (VCs).
- IPCQ memory cost = 8 rings ×
slot_size×n_slotsper PE. - VC arbitration is a first-order approximation; heavy contention scenarios may report slightly optimistic latency vs real HW (D8).
- Chunk-level interleave makes PE_DMA implementation more complex.