687c98086d
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
(dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
retroactive docs pending verification.
Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
deleted; ADR-0019/0021 moved to adr-history with one-line stub status
Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
selection, flit-aware per-flit commit, async finalize, command-only
fallback path)
Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
"Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
(now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)
Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py
Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.
Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
(ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
883 lines
33 KiB
Markdown
883 lines
33 KiB
Markdown
# ADR-0023: PE-level IPCQ — Inter-PE Collective Communication
|
||
|
||
## Status
|
||
|
||
Accepted
|
||
|
||
## Context
|
||
|
||
### Goal
|
||
|
||
Add the infrastructure that lets CCL (Collective Communication Library)
|
||
kernels run **inside** a PE. The host just launches a kernel on each
|
||
SIP; the actual synchronization and data movement happen **inside the
|
||
PE kernel via an IPCQ (Inter-Process Communication Queue)**.
|
||
|
||
This mirrors how NCCL performs NVLink communication inside a GPU
|
||
kernel, or how Cerebras / Tenstorrent expose core-local communication
|
||
queues. Host-level collectives (`dist.all_reduce`) are deferred to
|
||
**future work**; this ADR focuses solely on the kernel-side collective
|
||
infrastructure.
|
||
|
||
### Problems to solve
|
||
|
||
1. PE-to-PE direct data movement (writing into a peer's memory).
|
||
2. Synchronization — the sender must check that the receiver has space
|
||
in its buffer (backpressure).
|
||
3. Resource contention between compute traffic and communication
|
||
traffic (Head-of-Line blocking).
|
||
4. The host must be able to construct logical neighbor topologies
|
||
(ring / mesh / tree) per algorithm.
|
||
|
||
---
|
||
|
||
## Decision
|
||
|
||
### D1. Add a new `PE_IPCQ` component
|
||
|
||
A new component `PE_IPCQ` is added inside each PE. It follows the same
|
||
pattern as PE_GEMM / PE_MATH — modeling a sub-block of the PE as a
|
||
distinct component.
|
||
|
||
```
|
||
PE
|
||
├── PE_CPU
|
||
├── PE_SCHEDULER
|
||
├── PE_DMA
|
||
├── PE_IPCQ ← new
|
||
├── PE_FETCH_STORE
|
||
├── PE_GEMM
|
||
├── PE_MATH
|
||
├── PE_TCM
|
||
├── PE_MMU
|
||
```
|
||
|
||
**Role separation** (control plane vs. data plane):
|
||
|
||
- **PE_IPCQ (control plane)**: ring-buffer address arithmetic, head /
|
||
tail pointer management, peer pointer caches, backpressure, 4-direction
|
||
neighbor mapping.
|
||
- **PE_DMA (data plane)**: actually moves data through cube_noc / UCIe
|
||
/ PCIE into the peer's memory.
|
||
|
||
PE_IPCQ does **not** move data itself — it delegates to PE_DMA.
|
||
|
||
### D2. Ring buffer model
|
||
|
||
Each PE owns 4 directions (N/S/E/W) × {tx, rx} = 8 ring buffers.
|
||
|
||
```python
|
||
@dataclass
|
||
class IpcqQueuePair:
|
||
direction: Direction # N/S/E/W
|
||
peer: IpcqEndpoint # set by host at init time (D2.5)
|
||
tx_buffer_base: int # outgoing data base addr (in our memory)
|
||
rx_buffer_base: int # incoming data base addr (in our memory)
|
||
slot_size: int # 1 tile per slot
|
||
n_slots: int # ring depth
|
||
my_head: int # next slot we will write/send into
|
||
my_tail: int # next slot we will read/recv from
|
||
peer_head_cache: int # peer's last-seen head (updated via D9 piggyback)
|
||
peer_tail_cache: int # peer's last-seen tail (updated via D9 fast-path credit)
|
||
```
|
||
|
||
**Canonical field names**: throughout this ADR the four names above
|
||
(`my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`) are used
|
||
consistently. Synonyms (`peer_head_local`, `peer_head`, `peer_tail`,
|
||
etc.) are not used.
|
||
|
||
| Field | Owner | Updated when |
|
||
|-------|-------|--------------|
|
||
| `my_head` | local PE_IPCQ | immediately after `tl.send` (send tracking) |
|
||
| `my_tail` | local PE_IPCQ | immediately after `tl.recv` (recv tracking) |
|
||
| `peer_head_cache` | local PE_IPCQ | on `IpcqMetaArrival` (D9 piggyback) |
|
||
| `peer_tail_cache` | local PE_IPCQ | on `IpcqCreditMetadata` (D9 fast path) |
|
||
|
||
**Slot unit**: fixed-size, one slot holds one full tile (no descriptor
|
||
indirection). Full data embedded in the slot. See D5.
|
||
|
||
### D2.5. `IpcqEndpoint` schema
|
||
|
||
`IpcqQueuePair.peer` carries everything the sender needs to compute the
|
||
peer's rx slot address:
|
||
|
||
```python
|
||
@dataclass(frozen=True)
|
||
class IpcqEndpoint:
|
||
sip: int
|
||
cube: int
|
||
pe: int
|
||
buffer_kind: str # "tcm" | "hbm" | "sram"
|
||
rx_base_pa: int # peer rx_buffer base PA (PhysAddr.encode())
|
||
rx_base_va: int # peer rx_buffer base VA (optional, MMU mode)
|
||
n_slots: int # peer ring depth (for wrap-around)
|
||
slot_size: int # peer slot size (for offset)
|
||
```
|
||
|
||
Address computation:
|
||
|
||
```python
|
||
slot_idx = self.my_head % peer.n_slots
|
||
dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size
|
||
```
|
||
|
||
PE_IPCQ passes `dst_pa` to PE_DMA inside an `IpcqDmaToken`. PE_DMA
|
||
(vc_comm) routes the data to `dst_pa` through the fabric.
|
||
|
||
**Endpoint construction order**: at backend init (D10), the IPCQ
|
||
buffers for **every PE** are allocated first (so each rank knows the
|
||
others' PA), then the per-rank neighbor tables are built and pushed to
|
||
PE_IPCQ via `IpcqInitMsg`.
|
||
|
||
### D3. Four-direction mapping ≡ logical ProcessGroup
|
||
|
||
The PE views four directions (N/S/E/W) as logical ports. Real peer
|
||
addresses are configured by the host CCL init, per the chosen
|
||
algorithm. The PE kernel never knows the topology, only directions.
|
||
|
||
```python
|
||
# 1D ring
|
||
for rank in range(world_size):
|
||
ipcq_set_neighbor(rank, "E", peer=ranks[(rank + 1) % world_size])
|
||
ipcq_set_neighbor(rank, "W", peer=ranks[(rank - 1) % world_size])
|
||
|
||
# 2D mesh
|
||
for r in range(R):
|
||
for c in range(C):
|
||
ipcq_set_neighbor((r, c), "N", peer=((r - 1) % R, c))
|
||
ipcq_set_neighbor((r, c), "S", peer=((r + 1) % R, c))
|
||
ipcq_set_neighbor((r, c), "E", peer=(r, (c + 1) % C))
|
||
ipcq_set_neighbor((r, c), "W", peer=(r, (c - 1) % C))
|
||
```
|
||
|
||
The PE code does not need to know where `tl.send(dir="E", ...)` actually
|
||
ends up.
|
||
|
||
### D4. PE kernel API
|
||
|
||
```python
|
||
# Send (blocking; may stall on backpressure)
|
||
tl.send(dir: str, src=TensorHandle)
|
||
tl.send(dir: str, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)
|
||
|
||
# Recv (blocking)
|
||
recv = tl.recv(dir: str, shape=..., dtype=...)
|
||
recv = tl.recv(shape=..., dtype=...) # round-robin across 4 directions
|
||
|
||
# Recv (non-blocking)
|
||
fut = tl.recv_async(dir: str, shape=..., dtype=...)
|
||
recv = tl.wait(fut)
|
||
```
|
||
|
||
`tl.recv()` (no direction) keeps a `last_polled_dir` cursor and on each
|
||
call rotates through directions, returning the first available slot.
|
||
Empty in all 4 directions → wait.
|
||
|
||
**Fairness is weak**: the rotating start mitigates simple bias, but if
|
||
one direction always wins the race the others can starve. Algorithms
|
||
that need strict fairness must call `tl.recv(dir=...)` explicitly.
|
||
|
||
### D5. Single-hop DMA write + full-data slot model
|
||
|
||
Data moves from sender memory into the receiver's ring slot in **one
|
||
DMA transfer**. Key properties:
|
||
|
||
- **Single-hop**: the sender already knows the peer rx slot address and
|
||
fires one fabric DMA into it.
|
||
- **No CPU memcpy**: the CPU never copies data.
|
||
- **No intermediate staging**: neither side keeps a separate staging
|
||
buffer (sender uses the source addr directly; receiver gets the data
|
||
in its ring slot directly).
|
||
|
||
(Strictly speaking the fabric DMA write does happen, so this is not
|
||
literally "no data movement" — it's the same property NCCL labels
|
||
"zero-copy", meaning no CPU memcpy and no staging copy.)
|
||
|
||
```
|
||
PE A: tl.send(E, src_addr, nbytes)
|
||
1. IPCQ computes the peer rx slot address:
|
||
dst_addr = peer.rx_base_pa + (my_head % peer.n_slots) * peer.slot_size
|
||
2. Backpressure: my_head - peer_tail_cache < peer.n_slots ?
|
||
(full → sleep / poll)
|
||
3. Submit DMA on PE_DMA(vc_comm): src_addr → peer dst_addr, nbytes
|
||
4. my_head += 1
|
||
|
||
PE B: data = tl.recv(W)
|
||
1. Look at rx_buffer[my_tail % n_slots]
|
||
2. Wait for the data to arrive (D7 backpressure mode)
|
||
3. Return the slot address to the kernel (or fetch into register file)
|
||
4. my_tail += 1
|
||
5. Issue a credit-return fast path (D9): after the bottleneck-BW
|
||
latency the peer A's peer_tail_cache is updated.
|
||
```
|
||
|
||
The slot holds the full tile. The receiver only reads its own
|
||
rx_buffer; it never reads back into A's memory. The sender knows the
|
||
peer rx slot address and DMAs directly into it (single-hop).
|
||
|
||
The PE's own PE_TCM read/write does not go through DMA (PE_TCM is local
|
||
to the PE).
|
||
|
||
### D6. Buffer placement — three-way benchmark
|
||
|
||
The host CCL init picks the IPCQ ring-buffer location:
|
||
|
||
```python
|
||
ipcq_init(
|
||
backend="ahbm",
|
||
buffer_kind="tcm" | "hbm" | "sram",
|
||
n_slots=8,
|
||
slot_size=4096,
|
||
)
|
||
```
|
||
|
||
| Location | Trait | Trade-off |
|
||
|----------|-------|-----------|
|
||
| **PE_TCM** | Attached to the PE; fast | Small; competes with PE-internal resources |
|
||
| **PE-local HBM** | Large; via DMA | Higher latency |
|
||
| **Cube SRAM** | Mid-size; cube-shared | Cube-internal contention |
|
||
|
||
All three locations run the same kernel code; only the init differs.
|
||
|
||
### D7. Backpressure — two-mode benchmark
|
||
|
||
How the sender or receiver waits when peer slots are full / data not
|
||
yet arrived:
|
||
|
||
| Mode | Behavior | Model |
|
||
|------|----------|-------|
|
||
| **poll** | Periodically re-check the cached peer pointer | Spin loop |
|
||
| **sleep** | Yield a SimPy event; wake on a peer-trigger | Interrupt-like |
|
||
|
||
```python
|
||
ipcq_init(backpressure="poll" | "sleep", ...)
|
||
```
|
||
|
||
Both modes are implemented so latency / throughput trade-offs can be
|
||
benchmarked.
|
||
|
||
### D8. PE_DMA virtual channels
|
||
|
||
Extend PE_DMA from a single queue into a **two-channel virtual-channel**
|
||
model.
|
||
|
||
```
|
||
PE_DMA
|
||
├── vc_compute: tile load / store / writeback for GEMM and Math
|
||
└── vc_comm: IPCQ send data
|
||
```
|
||
|
||
Each VC has an independent state machine:
|
||
|
||
- One channel stalling does not block the other.
|
||
- The same physical link (cube_noc, UCIe, …) is shared, but link BW is
|
||
split between channels.
|
||
|
||
**Chunk-level interleave**:
|
||
|
||
- Large GEMM tile DMAs do not lock the link end-to-end.
|
||
- Progress happens in chunks (e.g. 256 B); each chunk shares link BW
|
||
with the other VC's pending chunks.
|
||
- Chunk size is an init parameter (smaller = fairer, larger = more
|
||
efficient).
|
||
|
||
Net effect:
|
||
|
||
- HoL blocking is eliminated (an IPCQ send can interleave with a long
|
||
compute DMA).
|
||
- Compute / comm overlap is natural (NVIDIA copy-engine + compute-SM
|
||
pattern).
|
||
- Matches the NoC-virtual-channel pattern used in real HW.
|
||
|
||
**First-implementation accuracy limit (intentional)**: this ADR's
|
||
first cut uses **deterministic chunk-level interleave + weighted
|
||
round-robin arbitration** (default 50 / 50, exposed in `ccl.yaml`).
|
||
This is a first-order approximation and is simpler than real HW
|
||
dynamic-contention / credit-based arbiters. Functional correctness is
|
||
unaffected, but heavy-contention scenarios may report slightly
|
||
optimistic latency vs. real HW. A separate ADR can add a NoC arbiter
|
||
component later if more precision is needed.
|
||
|
||
#### Token routing
|
||
|
||
- Compute tokens (`TileToken`) — go through the existing
|
||
PE_FETCH_STORE → PE_DMA chain.
|
||
- Communication tokens (`IpcqDmaToken`, new) — PE_IPCQ → PE_DMA
|
||
self-routing.
|
||
- PE_DMA picks the channel by token type.
|
||
|
||
```python
|
||
class PeDmaComponent:
|
||
def _process(self, env, token):
|
||
if isinstance(token, IpcqDmaToken):
|
||
yield from self._vc_comm_process(env, token)
|
||
else:
|
||
yield from self._vc_compute_process(env, token)
|
||
```
|
||
|
||
### D9. Pointer synchronization — DMA payload piggyback
|
||
|
||
Real HW (NVLink, UCIe, etc.) piggybacks metadata onto DMA payloads so
|
||
pointers update along with the data. This simulation adopts the same
|
||
model: **no separate control channel** — metadata travels with the
|
||
data.
|
||
|
||
The big benefits:
|
||
|
||
- **Automatic ordering**: data and metadata move on the same token, so
|
||
data is visible **before** the head_cache update. No race.
|
||
- **HW fidelity**: matches NVLink / UCIe piggybacked headers.
|
||
- **Component simplification**: no separate `IpcqPtrUpdate` event type.
|
||
|
||
#### Send flow (head update via piggyback)
|
||
|
||
```
|
||
PE A: tl.send(E, src_addr, nbytes)
|
||
1. PE_IPCQ checks backpressure (using peer_tail_cache)
|
||
2. PE_IPCQ creates an IpcqDmaToken:
|
||
- data body (src_addr → peer dst_addr)
|
||
- piggyback metadata: (sender_seq, src_sip/cube/pe, src_direction)
|
||
3. Hand the token to PE_DMA(vc_comm)
|
||
4. PE A increments my_head (send tracking)
|
||
|
||
[fabric DMA: latency elapses]
|
||
|
||
PE B's PE_DMA receives the token
|
||
5. Writes data into dst_addr (B's rx slot) via MemoryStore.write
|
||
6. Forwards token metadata to PE B's PE_IPCQ (PE-internal wire, ~1 cycle)
|
||
|
||
PE B's PE_IPCQ receives the metadata
|
||
7. Updates peer_head_cache (= A's head)
|
||
8. Wakes any pending recv on that direction
|
||
```
|
||
|
||
**Steps 5 and 6 must execute in the same SimPy step** — DMA completion
|
||
makes data and metadata atomically visible.
|
||
|
||
#### Recv flow (credit return — fast path with bottleneck-BW latency)
|
||
|
||
When the receiver frees a slot, the sender must learn about it
|
||
(backpressure release). Unlike data, the credit return does **not**
|
||
travel through general vc_comm fabric — it uses a **separate fast
|
||
path**, an abstraction of the NVLink / UCIe credit-return wire.
|
||
|
||
**Latency** is computed from the **full path latency** (per-node
|
||
overhead + edge propagation + drain), not a magic constant:
|
||
|
||
```
|
||
credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes)
|
||
path = router.find_path(self_pe, peer_pe.pe_dma)
|
||
latency = compute_path_latency_ns(path, credit_size_bytes)
|
||
= sum(edge.distance_mm * ns_per_mm)
|
||
+ sum(node_overhead_ns[n] for n in path)
|
||
+ credit_size_bytes / bottleneck_bw_on_path
|
||
```
|
||
|
||
The router auto-appends `.pe_dma` to the source only, so the
|
||
destination MUST be spelled with the explicit `.pe_dma` suffix or
|
||
`find_path` raises and the credit silently teleports at zero cost
|
||
(latent bug fixed alongside this update).
|
||
|
||
`tl.recv` blocks on the credit-emit completion (recv yields-from
|
||
`_delayed_credit_send` rather than spawning it as a fork). This puts
|
||
the credit-return cost on the receiver's `pe_exec_ns`, modeling the
|
||
IPCQ control-plane completing the consume-acknowledgement before
|
||
recv returns to the kernel — the protocol equivalent of a non-posted
|
||
`tl.store` waiting for an HBM ack on the raw DMA path.
|
||
|
||
That gives us:
|
||
|
||
- **Topology-proportional approximation**: an in-cube credit return is
|
||
automatically faster than a cross-SIP credit return.
|
||
- **No magic constants**: every nanosecond comes from
|
||
`compute_path_latency_ns` on the same edge_map and `node_overhead_ns`
|
||
as data traffic.
|
||
- **No deadlock risk**: unlike piggyback, B can issue credit even when
|
||
it has no data to send back. `peer_credit_store.put` is unbounded.
|
||
- **`IPCQ ≥ raw DMA`** for matched physical moves — the credit-emit
|
||
cost on recv balances the HBM ack-trip cost RAW pays on the sender.
|
||
|
||
#### Component coupling — SimPy Store channel
|
||
|
||
PE B's PE_IPCQ does not call PE A's PE_IPCQ directly. Instead, at init
|
||
time, **a SimPy Store is wired between the two** (a per-direction
|
||
fast-path channel) and credit metadata is `put` into that store.
|
||
|
||
```python
|
||
class PeIpcqComponent:
|
||
def _delayed_credit_send(self, env, peer_credit_store, my_tail, latency_ns):
|
||
yield env.timeout(latency_ns)
|
||
yield peer_credit_store.put(IpcqCreditMetadata(seq=my_tail, ...))
|
||
```
|
||
|
||
Backend init wires both directions of the fast-path channel as part of
|
||
fan-out (see `IpcqInitMsg` in D12).
|
||
|
||
#### Credit-return fast path limitations
|
||
|
||
- `credit_size_bytes` is an estimate (typically 16–64 bytes).
|
||
- The fast path is **excluded from vc_comm BW contention** (separate
|
||
wire). Real HW credit-return wires are very lightweight, so this is a
|
||
reasonable first approximation.
|
||
- A follow-up ADR can: model the credit fast path as a separate link
|
||
(BW limit + contention), or switch to piggyback (`credit_return_mode:
|
||
piggyback`).
|
||
|
||
#### PE_DMA's added responsibility
|
||
|
||
When `vc_comm` receives a token, PE_DMA processes it as the following
|
||
sequence: pay the Transaction's terminal BW drain, then atomically
|
||
write data and forward metadata. **No SimPy yield is allowed between
|
||
the data write and the metadata forward** (invariant I6). The drain
|
||
yield must sit before the atomic block, not inside it:
|
||
|
||
```python
|
||
def _on_vc_comm_recv(self, env, txn):
|
||
# Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the
|
||
# sender PE_DMA). MUST happen before the atomic block so recv only
|
||
# wakes after the bytes have "landed".
|
||
drain = getattr(txn, "drain_ns", 0.0)
|
||
if drain > 0:
|
||
yield env.timeout(drain)
|
||
|
||
token = txn.request
|
||
# ── ATOMIC: no yield between these two operations ──
|
||
data = self._memory_store.read(token.src_space, token.src_addr,
|
||
shape=..., dtype=...)
|
||
self._memory_store.write(token.dst_endpoint.buffer_kind,
|
||
token.dst_addr, data)
|
||
# 2. Forward metadata to the local PE_IPCQ
|
||
yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token))
|
||
# ───────────────────────────────────────────────────
|
||
```
|
||
|
||
The final `put` is yieldable but uses an unbounded internal store, so
|
||
it completes in a single step. That `put` is the closing call of the
|
||
atomic block; nothing may be inserted before it.
|
||
|
||
#### Drain-at-inbound semantics (D9 timing model)
|
||
|
||
The Transaction carries `drain_ns = nbytes / bottleneck_bw_on_path`
|
||
stamped at send-side PE_DMA. In this simulator per-hop `overhead_ns`
|
||
is paid at each forwarding component via `run()`, and the remaining
|
||
BW drain is paid once at the Transaction's terminal. Every non-IPCQ
|
||
Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via
|
||
`ComponentBase._forward_txn` at the terminal node. For IPCQ the
|
||
destination PE_DMA intercepts the Transaction with `_handle_ipcq_inbound`
|
||
(so IPCQ-specific data write + metadata forward can happen), so **the
|
||
drain MUST be paid explicitly at the top of that handler** to keep
|
||
IPCQ's timing model on par with every other fabric Transaction.
|
||
|
||
Side-effects of paying drain here:
|
||
|
||
- **SRC `tl.send`** is unchanged — fire-and-forget semantics are
|
||
preserved because the sender PE_DMA does not `yield sub_done`. The
|
||
`sub_done.succeed()` call (made after metadata forward below) is an
|
||
event with no listener on the sender side.
|
||
- **DST `tl.recv`** unblocks `drain_ns` later. Since recv wakes only
|
||
when `IpcqMetaArrival` reaches its local PE_IPCQ, and the metadata
|
||
forward now happens after the drain, recv observes the full fabric
|
||
transfer time including bandwidth cost.
|
||
|
||
Matches the physical picture: send dispatches and leaves; recv waits
|
||
until the bytes have actually been drained into its inbox.
|
||
|
||
### D9.5. ADR-0020 (2-pass) integration
|
||
|
||
`tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase
|
||
1 simulates timing **and** moves data via MemoryStore; Phase 2 enables
|
||
op-log-based correctness verification.
|
||
|
||
#### Phase 1 (timing + data)
|
||
|
||
D9 models head and tail updates with two different mechanisms:
|
||
|
||
- **Send-side (head update)** — DMA payload piggyback. Data write and
|
||
metadata forward happen in the same SimPy step → automatic atomic
|
||
visibility.
|
||
- **Recv-side (tail credit return)** — fast-path SimPy Store channel
|
||
with bottleneck-BW latency, then `peer_tail_cache` update.
|
||
|
||
Together they preserve ring-buffer pointer consistency.
|
||
|
||
The op-log records `op_kind="ipcq"` entries for sends (with
|
||
`src/dst/space/addr/nbytes/dir/dtype/shape/sender_seq`) and recvs (with
|
||
`recv_mode/src/dst/space/addr/nbytes/dir/dtype/shape/consumer_seq`).
|
||
Two recv modes:
|
||
|
||
- **`return_slot`** (default): the slot address is returned to the
|
||
kernel. Zero-copy.
|
||
- **`copy_to_dst`**: when the kernel passes `dst_addr` + `dst_space`,
|
||
PE_IPCQ copies the slot data into the user dst.
|
||
|
||
#### Phase 2 (op_log replay)
|
||
|
||
When `DataExecutor` encounters an `op_kind="ipcq"` record:
|
||
|
||
- **send**: idempotent `src → dst` ndarray write.
|
||
- **recv (`return_slot`)**: no-op (the slot already holds the data).
|
||
- **recv (`copy_to_dst`)**: idempotent `slot → dst_addr` copy.
|
||
|
||
IPCQ ops are pure data movement — Phase 2 has nothing extra to compute.
|
||
The downstream GEMM / Math ops in `DataExecutor` will consume the data
|
||
and naturally validate correctness.
|
||
|
||
### D10. Host CCL init keeps the PyTorch shape
|
||
|
||
The host code looks just like real PyTorch DDP. `init_process_group`
|
||
creates the backend object; it does **not** receive IPCQ knobs
|
||
(neighbor topology, buffer_kind, backpressure …).
|
||
|
||
```python
|
||
# benches/ccl_allreduce.py — same shape as real PyTorch
|
||
def worker(rank, world_size, torch):
|
||
dist = torch.distributed
|
||
dist.init_process_group(backend="ahbm") # reads ccl.yaml + topology
|
||
tensor = torch.zeros((1, world_size * N_ELEM), dtype="f16", dp=...)
|
||
tensor.copy_(torch.from_numpy(init))
|
||
dist.all_reduce(tensor, op="sum")
|
||
```
|
||
|
||
The IPCQ configuration is decided by the backend at
|
||
`init_process_group` time: it loads `ccl.yaml`, picks the algorithm,
|
||
and pushes IPCQ neighbor tables to every participating PE_IPCQ. The
|
||
host code never has to know about IPCQ.
|
||
|
||
A bench runs one algorithm, chosen via `ccl.yaml`'s `defaults.algorithm`.
|
||
Switching algorithms is purely a `ccl.yaml` change — no host edits
|
||
required.
|
||
|
||
#### Init flow (eager)
|
||
|
||
1. `init_process_group(backend="ahbm")` is called.
|
||
2. Backend loads `ccl.yaml` → resolves `defaults.algorithm`.
|
||
3. Pulls topology + buffer_kind + backpressure + slot config from
|
||
`algorithms[<algo>]`.
|
||
4. **Immediately** installs neighbor tables on every PE_IPCQ
|
||
(sideband or fabric `IpcqInitMsg`).
|
||
5. Subsequent `torch.launch(kernel_name, ...)` calls behave normally —
|
||
PE_IPCQ is already prepared whether the kernel is a CCL kernel or
|
||
not.
|
||
|
||
### D11. CCL config file (`ccl.yaml`)
|
||
|
||
IPCQ config and algorithm metadata live in a separate YAML file,
|
||
following the same pattern as `components.yaml` and `topology.yaml`.
|
||
|
||
A single benchmark execution runs one algorithm
|
||
(`defaults.algorithm`). Switching algorithms means editing
|
||
`defaults.algorithm` only.
|
||
|
||
```yaml
|
||
defaults:
|
||
algorithm: ring_allreduce_tcm
|
||
buffer_kind: tcm # tcm | hbm | sram
|
||
backpressure: sleep # poll | sleep
|
||
n_slots: 8
|
||
slot_size: 4096
|
||
vc_chunk_size: 256
|
||
ipcq_credit_size_bytes: 16
|
||
|
||
algorithms:
|
||
ring_allreduce_tcm:
|
||
module: kernbench.ccl.algorithms.ring_allreduce
|
||
topology: ring_1d # builtin name or "custom"
|
||
buffer_kind: tcm
|
||
n_elem: 8 # optional, per-algorithm tile width
|
||
|
||
tree_allreduce_7:
|
||
module: kernbench.ccl.algorithms.tree_allreduce
|
||
topology: tree_binary
|
||
buffer_kind: tcm
|
||
world_size: 7 # algorithm-level override
|
||
n_elem: 16
|
||
|
||
custom_mesh:
|
||
module: kernbench.ccl.algorithms.custom_mesh
|
||
topology: custom # the module supplies its own neighbors()
|
||
```
|
||
|
||
`world_size` is **not set in `defaults`**. The backend resolves it via:
|
||
`algorithm-level override > defaults override > topology spec`. The
|
||
last fallback (`sips × cubes_per_sip × pes_per_cube`) mirrors real DDP
|
||
where `WORLD_SIZE` comes from env vars rather than config files.
|
||
|
||
#### Algorithm module structure
|
||
|
||
Each algorithm module exports two hooks — `kernel` (required) and
|
||
`neighbors` (optional) — plus a `kernel_args` helper that the
|
||
backend uses to populate positional kernel arguments at `all_reduce`
|
||
time:
|
||
|
||
```python
|
||
# src/kernbench/ccl/algorithms/ring_allreduce.py
|
||
|
||
def kernel_args(world_size: int, n_elem: int) -> tuple:
|
||
return (n_elem, world_size)
|
||
|
||
|
||
def kernel(t_ptr, n_elem, world_size, tl):
|
||
"""Required — the PE kernel.
|
||
|
||
IPCQ is already installed by the backend before this is called.
|
||
The kernel only uses the four-direction send / recv API.
|
||
"""
|
||
...
|
||
|
||
|
||
def neighbors(rank, world_size, neighbor_map):
|
||
"""Optional — override the builtin topology's neighbor map.
|
||
|
||
Returns a new dict, the modified-in-place dict, or None to keep the
|
||
builtin map.
|
||
"""
|
||
return None
|
||
```
|
||
|
||
#### `neighbors` override patterns
|
||
|
||
- **Pattern A — tweak a builtin**: drop a direction for some ranks, etc.
|
||
- **Pattern B — replace entirely**: ignore `neighbor_map` and return a
|
||
brand-new dict.
|
||
- **Pattern C — keep builtin**: omit `neighbors` or return None.
|
||
|
||
#### Builtin topologies
|
||
|
||
| topology | direction set |
|
||
|----------|---------------|
|
||
| `ring_1d` | E, W |
|
||
| `ring_1d_unidir` | E only |
|
||
| `mesh_2d` | N, S, E, W |
|
||
| `tree_binary` | parent, child_left, child_right |
|
||
| `none` | (empty) — algorithm must supply `neighbors()` |
|
||
|
||
#### Adding a new algorithm
|
||
|
||
1. Write `kernel` and `kernel_args` in
|
||
`src/kernbench/ccl/algorithms/<algo>.py`.
|
||
2. Add an entry in `ccl.yaml`'s `algorithms` section.
|
||
3. (Optional) provide `neighbors()` for custom topology.
|
||
4. Set `defaults.algorithm` to the new algorithm.
|
||
|
||
The host bench (`benches/ccl_allreduce.py`) does not change.
|
||
|
||
### D12. Message / token schema
|
||
|
||
The new message types added by this ADR. They live in
|
||
`src/kernbench/common/pe_commands.py` and
|
||
`src/kernbench/runtime_api/kernel.py`.
|
||
|
||
#### `IpcqInitMsg` (sideband, fan-out at init)
|
||
|
||
The backend pushes neighbor tables to every PE_IPCQ. Structure mirrors
|
||
`MmuMapMsg` (`target_sips`, `target_cubes`, `target_pe`, `entries`).
|
||
Each `IpcqInitEntry` has `direction`, `peer: IpcqEndpoint`,
|
||
`my_rx_base_pa/va`, `n_slots`, `slot_size`, plus a `peer_credit_store`
|
||
field — a `simpy.Store` instance pre-wired so the sender PE_IPCQ can
|
||
push `IpcqCreditMetadata` directly into the receiver's input queue.
|
||
|
||
#### `IpcqSendCmd` (PE_CPU → PE_IPCQ)
|
||
|
||
Carries `direction`, source addr/space, nbytes, shape, dtype, and a
|
||
handle id. `data_op=True` so it lands in the op_log.
|
||
|
||
#### `IpcqRecvCmd` (PE_CPU → PE_IPCQ)
|
||
|
||
Carries `direction` (or None for round-robin), `recv_mode`
|
||
(`return_slot` / `copy_to_dst`), optional `dst_addr/dst_space`, shape,
|
||
dtype, blocking flag.
|
||
|
||
#### `IpcqDmaToken` (PE_IPCQ → PE_DMA, vc_comm channel)
|
||
|
||
Per D9 piggyback: the token carries the data (`src/dst/space/nbytes`)
|
||
plus the head metadata (`sender_seq`, `src_sip/cube/pe`,
|
||
`src_direction`). PE_DMA picks the channel by token type
|
||
(`IpcqDmaToken → vc_comm`, `TileToken → vc_compute`).
|
||
|
||
The receiver's PE_DMA, on token arrival, performs the I6 atomic
|
||
sequence: write data into MemoryStore, then forward `IpcqMetaArrival`
|
||
to the local PE_IPCQ.
|
||
|
||
#### `IpcqCreditMetadata` (PE_IPCQ → peer PE_IPCQ, fast path)
|
||
|
||
Carries `consumer_seq` (= my_tail), source PE coords, and source
|
||
direction. Travels through the dedicated SimPy Store channel rather
|
||
than `vc_comm`. Latency = `credit_size_bytes / bottleneck_bw_on_path`.
|
||
|
||
There is **no `IpcqPtrUpdate` event** — head updates flow via D9
|
||
piggyback, tail updates via the D9 fast-path channel.
|
||
|
||
### D13. Test strategy
|
||
|
||
Test plan:
|
||
|
||
#### T1. Unit tests (component-level)
|
||
|
||
- **PE_IPCQ** (`tests/test_pe_ipcq.py`): send without backpressure
|
||
immediately forwards a token; full peer slot triggers backpressure
|
||
(poll / sleep modes); recv waits, wakes on `IpcqMetaArrival`;
|
||
round-robin recv weak fairness; bad direction → `IpcqInvalidDirection`.
|
||
- **PE_DMA virtual channels** (`tests/test_pe_dma_vc.py`): `vc_compute`
|
||
/ `vc_comm` independent progress, chunk interleave, BW split.
|
||
- **Builtin topology** (`tests/test_ccl_topologies.py`): ring_1d /
|
||
mesh_2d / tree_binary correctness, mesh_2d non-square →
|
||
`ValueError`, custom resolver returns the module's `neighbors`.
|
||
|
||
#### T2. Integration tests (E2E send/recv)
|
||
|
||
- **`tests/test_ipcq_e2e.py`**: 2-rank ring, 4-rank ring (bidirectional
|
||
no-deadlock), 4×4 mesh.
|
||
- **CCL kernel + 2-pass** (`tests/test_ipcq_2pass.py`): greenlet mode
|
||
records `ipcq` ops in op_log; DataExecutor produces correct
|
||
`out.data`.
|
||
|
||
#### T3. Backend init (`tests/test_ccl_backend_ipcq.py`)
|
||
|
||
`ccl.yaml` load, builtin topology → `IpcqInitMsg` fan-out, endpoint PA
|
||
consistency, per-`buffer_kind` allocation.
|
||
|
||
#### T4. Regression
|
||
|
||
All existing tests pass; ADR-0020 op_log / DataExecutor unaffected for
|
||
non-CCL benches.
|
||
|
||
#### T5. Performance / overhead
|
||
|
||
Single send/recv pair latency = (DMA latency) + (IPCQ overhead).
|
||
Should be close to a regular PE_DMA write of the same nbytes (IPCQ
|
||
overhead < 100 ns).
|
||
|
||
### D14. Invariants and failure modes
|
||
|
||
#### Invariants
|
||
|
||
I1. **Slot lifecycle exactly-once**: one send → exactly one recv.
|
||
I2. **Pointer monotonicity**: `my_head` / `my_tail` strictly
|
||
non-decreasing; `sender_seq` strictly increasing.
|
||
I3. **Endpoint consistency**: if rank A's `direction=E` peer is rank
|
||
B, then rank B's reverse-direction peer must be rank A. Verified at
|
||
init.
|
||
I4. **`buffer_kind` consistency**: all PEs in a process group share
|
||
the same `buffer_kind` (no mixed mode in the first cut).
|
||
I5. **op_log ordering**: send → DMA complete → recv possible. The
|
||
t_start order in op_log respects this causality.
|
||
I6. **Atomic data + metadata visibility (MUST)**: at the receiver
|
||
side, data write (`MemoryStore.write`) and metadata forward
|
||
(`peer_head_cache` update) **must execute in the same SimPy step**.
|
||
No yield is allowed between the two operations in PE_DMA's vc_comm
|
||
handler. Code review must reject any inserted `yield` (or `yield
|
||
from`) — it would create a race where head_cache becomes visible
|
||
before or after the data.
|
||
I7. **MemoryStore slot existence ↔ pointer**: as a consequence of I6,
|
||
the step in which `peer_head_cache > my_tail` becomes truthy is the
|
||
same step in which the slot data is observable.
|
||
|
||
#### Failure modes (runtime errors)
|
||
|
||
F1. **Bad direction**: `tl.send(dir="X")` for an uninstalled direction
|
||
→ `IpcqInvalidDirection`, simulation aborts.
|
||
F2. **Type mismatch**: dtype/shape/nbytes disagreement between matched
|
||
send and recv. Not validated by default; opt-in strict mode catches
|
||
it (`strict_validation: true` on a PE_IPCQ node attrs).
|
||
F3. **Deadlock detection (timeout-based)**: the simulator empties its
|
||
schedule while a send/recv is still pending → engine raises
|
||
`IpcqDeadlock` and embeds a pointer dump.
|
||
F4. **Backend init failure**: missing `defaults.algorithm`, missing
|
||
`algorithms[name]`, module import failure, topology validation
|
||
failure (I3, I4) — all raised at `init_process_group` time.
|
||
F5. **Slot full + infinite backpressure**: the peer never recvs.
|
||
Surfaces as F3 timeout.
|
||
|
||
#### Diagnostics
|
||
|
||
- **CCL trace**: `KERNBENCH_CCL_TRACE=1` logs each send/recv as
|
||
`(rank, t, dir, nbytes)`.
|
||
- **Pointer dump**: `kernbench.ccl.diagnostics.pointer_dump(engine)`
|
||
prints every PE_IPCQ ring buffer's `my_head`, `my_tail`,
|
||
`peer_head_cache`, `peer_tail_cache`.
|
||
- **Deadlock dump**: on hang the engine includes the pointer dump in
|
||
the `IpcqDeadlock` exception message.
|
||
|
||
### D15. Algorithm-author cheat sheet
|
||
|
||
Full step-by-step lives in
|
||
[`docs/onboarding/ccl-author-guide.en.md`](../onboarding/ccl-author-guide.en.md). The
|
||
shortest version:
|
||
|
||
| Things you touch | Things you don't |
|
||
|------------------|-------------------|
|
||
| `src/kernbench/ccl/algorithms/<your_algo>.py` (`kernel`, `kernel_args`, optional `neighbors`) | `benches/ccl_allreduce.py` host code |
|
||
| One entry in `ccl.yaml` + optionally `defaults.algorithm` | `src/kernbench/ccl/` framework |
|
||
| (Optional) `tests/test_<your_algo>.py` mock test | PE_IPCQ component, AhbmCCLBackend |
|
||
|
||
5-step flow: write the kernel → register in `ccl.yaml` → optional
|
||
`neighbors` override → optional mock unit test → SimPy validation via
|
||
`kernbench run --bench ccl_allreduce --verify-data`.
|
||
|
||
Common mistakes: using a direction that wasn't installed, sends
|
||
without matching recvs (deadlock), dtype/shape disagreement, assuming
|
||
fairness from `tl.recv()` round-robin, confusing
|
||
`tl.num_programs(axis)` with the CCL group size.
|
||
|
||
---
|
||
|
||
## Non-goals
|
||
|
||
- **Host collective**: a model where `dist.all_reduce` itself moves
|
||
data on the host side is out of scope. This ADR only covers
|
||
communication that happens inside the PE kernel.
|
||
- **All-reduce algorithms**: ring / tree / etc. live in algorithm
|
||
modules and can be added without amending this ADR.
|
||
- **Reliability / error handling**: link faults, send/recv failure
|
||
recovery, etc. are out of scope.
|
||
- **NoC arbiter precision**: dynamic VC contention is left for a future
|
||
ADR (see D8).
|
||
|
||
---
|
||
|
||
## Open questions
|
||
|
||
- **VC arbitration accuracy** — the first cut uses deterministic
|
||
chunk interleave + weighted round-robin; heavy contention may report
|
||
optimistic latency. A NoC arbiter component can be added later.
|
||
- **Credit return BW model** — the fast path is currently outside the
|
||
fabric BW contention model. Can be modeled as a separate link or
|
||
switched to piggyback (`credit_return_mode: piggyback`).
|
||
- **Ring buffer slot allocation metadata** — whether the host pushes
|
||
IPCQ buffer metadata via sideband or via a fabric message similar to
|
||
`MmuMapMsg` is open.
|
||
- **VC BW split default** — 50/50 vs. weighted (e.g. 80/20). Exposed in
|
||
`ccl.yaml`; default value TBD.
|
||
- **Direction count** — 4 (N/S/E/W) is fixed in the first cut; 6
|
||
(with Up/Down for 3D) or N (variable) is future work.
|
||
- **Multi-tile aggregation primitives** — whether
|
||
`tl.recv_all` or similar is needed for fan-in.
|
||
- **Round-robin recv fairness** — current weak fairness can starve;
|
||
strict fairness counter is future work.
|
||
- **Deadlock detection precision** — currently timeout-based; a
|
||
realtime wait-for graph would enable deterministic detection.
|
||
|
||
---
|
||
|
||
## Consequences
|
||
|
||
### Positive
|
||
|
||
- PE-to-PE direct communication enables CCL kernels to be written.
|
||
- Host stays minimal (just `launch`), synchronization happens inside
|
||
the PE → strong compute / comm overlap.
|
||
- VCs eliminate HoL blocking → collective latency is not blocked by
|
||
compute traffic.
|
||
- Buffer placement and backpressure mode are init-time parameters →
|
||
easy to benchmark.
|
||
- Four-direction logical neighbors → host is free to map
|
||
ring/mesh/tree algorithms.
|
||
|
||
### Negative
|
||
|
||
- One new component (PE_IPCQ) and a redesigned PE_DMA (VCs).
|
||
- IPCQ memory cost = 8 rings × `slot_size` × `n_slots` per PE.
|
||
- VC arbitration is a first-order approximation; heavy contention
|
||
scenarios may report slightly optimistic latency vs real HW (D8).
|
||
- Chunk-level interleave makes PE_DMA implementation more complex.
|