Files

T

ywkang 687c98086d ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037

Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
  (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
  docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
  docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
  retroactive docs pending verification.

Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
  TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
  Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
  deleted; ADR-0019/0021 moved to adr-history with one-line stub status

Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
  serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
  per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
  target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
  selection, flit-aware per-flit commit, async finalize, command-only
  fallback path)

Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
  "Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
  block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
  ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
  (now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)

Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
  ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py

Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.

Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
  (ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 01:15:55 -07:00

33 KiB

Raw Blame History

ADR-0023: PE-level IPCQ — Inter-PE Collective Communication

Status

Accepted

Context

Goal

Add the infrastructure that lets CCL (Collective Communication Library) kernels run inside a PE. The host just launches a kernel on each SIP; the actual synchronization and data movement happen inside the PE kernel via an IPCQ (Inter-Process Communication Queue).

This mirrors how NCCL performs NVLink communication inside a GPU kernel, or how Cerebras / Tenstorrent expose core-local communication queues. Host-level collectives (dist.all_reduce) are deferred to future work; this ADR focuses solely on the kernel-side collective infrastructure.

Problems to solve

PE-to-PE direct data movement (writing into a peer's memory).
Synchronization — the sender must check that the receiver has space in its buffer (backpressure).
Resource contention between compute traffic and communication traffic (Head-of-Line blocking).
The host must be able to construct logical neighbor topologies (ring / mesh / tree) per algorithm.

Decision

D1. Add a new `PE_IPCQ` component

A new component PE_IPCQ is added inside each PE. It follows the same pattern as PE_GEMM / PE_MATH — modeling a sub-block of the PE as a distinct component.

PE
├── PE_CPU
├── PE_SCHEDULER
├── PE_DMA
├── PE_IPCQ          ← new
├── PE_FETCH_STORE
├── PE_GEMM
├── PE_MATH
├── PE_TCM
├── PE_MMU

Role separation (control plane vs. data plane):

PE_IPCQ (control plane): ring-buffer address arithmetic, head / tail pointer management, peer pointer caches, backpressure, 4-direction neighbor mapping.
PE_DMA (data plane): actually moves data through cube_noc / UCIe / PCIE into the peer's memory.

PE_IPCQ does not move data itself — it delegates to PE_DMA.

D2. Ring buffer model

Each PE owns 4 directions (N/S/E/W) × {tx, rx} = 8 ring buffers.

@dataclass
class IpcqQueuePair:
    direction: Direction          # N/S/E/W
    peer: IpcqEndpoint            # set by host at init time (D2.5)
    tx_buffer_base: int           # outgoing data base addr (in our memory)
    rx_buffer_base: int           # incoming data base addr (in our memory)
    slot_size: int                # 1 tile per slot
    n_slots: int                  # ring depth
    my_head: int                  # next slot we will write/send into
    my_tail: int                  # next slot we will read/recv from
    peer_head_cache: int          # peer's last-seen head (updated via D9 piggyback)
    peer_tail_cache: int          # peer's last-seen tail (updated via D9 fast-path credit)

Canonical field names: throughout this ADR the four names above (my_head, my_tail, peer_head_cache, peer_tail_cache) are used consistently. Synonyms (peer_head_local, peer_head, peer_tail, etc.) are not used.

Field	Owner	Updated when
`my_head`	local PE_IPCQ	immediately after `tl.send` (send tracking)
`my_tail`	local PE_IPCQ	immediately after `tl.recv` (recv tracking)
`peer_head_cache`	local PE_IPCQ	on `IpcqMetaArrival` (D9 piggyback)
`peer_tail_cache`	local PE_IPCQ	on `IpcqCreditMetadata` (D9 fast path)

Slot unit: fixed-size, one slot holds one full tile (no descriptor indirection). Full data embedded in the slot. See D5.

D2.5. `IpcqEndpoint` schema

IpcqQueuePair.peer carries everything the sender needs to compute the peer's rx slot address:

@dataclass(frozen=True)
class IpcqEndpoint:
    sip: int
    cube: int
    pe: int
    buffer_kind: str             # "tcm" | "hbm" | "sram"
    rx_base_pa: int              # peer rx_buffer base PA (PhysAddr.encode())
    rx_base_va: int              # peer rx_buffer base VA (optional, MMU mode)
    n_slots: int                 # peer ring depth (for wrap-around)
    slot_size: int               # peer slot size (for offset)

Address computation:

slot_idx = self.my_head % peer.n_slots
dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size

PE_IPCQ passes dst_pa to PE_DMA inside an IpcqDmaToken. PE_DMA (vc_comm) routes the data to dst_pa through the fabric.

Endpoint construction order: at backend init (D10), the IPCQ buffers for every PE are allocated first (so each rank knows the others' PA), then the per-rank neighbor tables are built and pushed to PE_IPCQ via IpcqInitMsg.

D3. Four-direction mapping ≡ logical ProcessGroup

The PE views four directions (N/S/E/W) as logical ports. Real peer addresses are configured by the host CCL init, per the chosen algorithm. The PE kernel never knows the topology, only directions.

# 1D ring
for rank in range(world_size):
    ipcq_set_neighbor(rank, "E", peer=ranks[(rank + 1) % world_size])
    ipcq_set_neighbor(rank, "W", peer=ranks[(rank - 1) % world_size])

# 2D mesh
for r in range(R):
    for c in range(C):
        ipcq_set_neighbor((r, c), "N", peer=((r - 1) % R, c))
        ipcq_set_neighbor((r, c), "S", peer=((r + 1) % R, c))
        ipcq_set_neighbor((r, c), "E", peer=(r, (c + 1) % C))
        ipcq_set_neighbor((r, c), "W", peer=(r, (c - 1) % C))

The PE code does not need to know where tl.send(dir="E", ...) actually ends up.

D4. PE kernel API

# Send (blocking; may stall on backpressure)
tl.send(dir: str, src=TensorHandle)
tl.send(dir: str, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)

# Recv (blocking)
recv = tl.recv(dir: str, shape=..., dtype=...)
recv = tl.recv(shape=..., dtype=...)        # round-robin across 4 directions

# Recv (non-blocking)
fut  = tl.recv_async(dir: str, shape=..., dtype=...)
recv = tl.wait(fut)

tl.recv() (no direction) keeps a last_polled_dir cursor and on each call rotates through directions, returning the first available slot. Empty in all 4 directions → wait.

Fairness is weak: the rotating start mitigates simple bias, but if one direction always wins the race the others can starve. Algorithms that need strict fairness must call tl.recv(dir=...) explicitly.

D5. Single-hop DMA write + full-data slot model

Data moves from sender memory into the receiver's ring slot in one DMA transfer. Key properties:

Single-hop: the sender already knows the peer rx slot address and fires one fabric DMA into it.
No CPU memcpy: the CPU never copies data.
No intermediate staging: neither side keeps a separate staging buffer (sender uses the source addr directly; receiver gets the data in its ring slot directly).

(Strictly speaking the fabric DMA write does happen, so this is not literally "no data movement" — it's the same property NCCL labels "zero-copy", meaning no CPU memcpy and no staging copy.)

PE A: tl.send(E, src_addr, nbytes)
  1. IPCQ computes the peer rx slot address:
       dst_addr = peer.rx_base_pa + (my_head % peer.n_slots) * peer.slot_size
  2. Backpressure: my_head - peer_tail_cache < peer.n_slots ?
                   (full → sleep / poll)
  3. Submit DMA on PE_DMA(vc_comm): src_addr → peer dst_addr, nbytes
  4. my_head += 1

PE B: data = tl.recv(W)
  1. Look at rx_buffer[my_tail % n_slots]
  2. Wait for the data to arrive (D7 backpressure mode)
  3. Return the slot address to the kernel (or fetch into register file)
  4. my_tail += 1
  5. Issue a credit-return fast path (D9): after the bottleneck-BW
     latency the peer A's peer_tail_cache is updated.

The slot holds the full tile. The receiver only reads its own rx_buffer; it never reads back into A's memory. The sender knows the peer rx slot address and DMAs directly into it (single-hop).

The PE's own PE_TCM read/write does not go through DMA (PE_TCM is local to the PE).

D6. Buffer placement — three-way benchmark

The host CCL init picks the IPCQ ring-buffer location:

ipcq_init(
    backend="ahbm",
    buffer_kind="tcm" | "hbm" | "sram",
    n_slots=8,
    slot_size=4096,
)

Location	Trait	Trade-off
PE_TCM	Attached to the PE; fast	Small; competes with PE-internal resources
PE-local HBM	Large; via DMA	Higher latency
Cube SRAM	Mid-size; cube-shared	Cube-internal contention

All three locations run the same kernel code; only the init differs.

D7. Backpressure — two-mode benchmark

How the sender or receiver waits when peer slots are full / data not yet arrived:

Mode	Behavior	Model
poll	Periodically re-check the cached peer pointer	Spin loop
sleep	Yield a SimPy event; wake on a peer-trigger	Interrupt-like

ipcq_init(backpressure="poll" | "sleep", ...)

Both modes are implemented so latency / throughput trade-offs can be benchmarked.

D8. PE_DMA virtual channels

Extend PE_DMA from a single queue into a two-channel virtual-channel model.

PE_DMA
├── vc_compute: tile load / store / writeback for GEMM and Math
└── vc_comm:    IPCQ send data

Each VC has an independent state machine:

One channel stalling does not block the other.
The same physical link (cube_noc, UCIe, …) is shared, but link BW is split between channels.

Chunk-level interleave:

Large GEMM tile DMAs do not lock the link end-to-end.
Progress happens in chunks (e.g. 256 B); each chunk shares link BW with the other VC's pending chunks.
Chunk size is an init parameter (smaller = fairer, larger = more efficient).

Net effect:

HoL blocking is eliminated (an IPCQ send can interleave with a long compute DMA).
Compute / comm overlap is natural (NVIDIA copy-engine + compute-SM pattern).
Matches the NoC-virtual-channel pattern used in real HW.

First-implementation accuracy limit (intentional): this ADR's first cut uses deterministic chunk-level interleave + weighted round-robin arbitration (default 50 / 50, exposed in ccl.yaml). This is a first-order approximation and is simpler than real HW dynamic-contention / credit-based arbiters. Functional correctness is unaffected, but heavy-contention scenarios may report slightly optimistic latency vs. real HW. A separate ADR can add a NoC arbiter component later if more precision is needed.

Token routing

Compute tokens (TileToken) — go through the existing PE_FETCH_STORE → PE_DMA chain.
Communication tokens (IpcqDmaToken, new) — PE_IPCQ → PE_DMA self-routing.
PE_DMA picks the channel by token type.

class PeDmaComponent:
    def _process(self, env, token):
        if isinstance(token, IpcqDmaToken):
            yield from self._vc_comm_process(env, token)
        else:
            yield from self._vc_compute_process(env, token)

D9. Pointer synchronization — DMA payload piggyback

Real HW (NVLink, UCIe, etc.) piggybacks metadata onto DMA payloads so pointers update along with the data. This simulation adopts the same model: no separate control channel — metadata travels with the data.

The big benefits:

Automatic ordering: data and metadata move on the same token, so data is visible before the head_cache update. No race.
HW fidelity: matches NVLink / UCIe piggybacked headers.
Component simplification: no separate IpcqPtrUpdate event type.

Send flow (head update via piggyback)

PE A: tl.send(E, src_addr, nbytes)
  1. PE_IPCQ checks backpressure (using peer_tail_cache)
  2. PE_IPCQ creates an IpcqDmaToken:
       - data body (src_addr → peer dst_addr)
       - piggyback metadata: (sender_seq, src_sip/cube/pe, src_direction)
  3. Hand the token to PE_DMA(vc_comm)
  4. PE A increments my_head (send tracking)

[fabric DMA: latency elapses]

PE B's PE_DMA receives the token
  5. Writes data into dst_addr (B's rx slot) via MemoryStore.write
  6. Forwards token metadata to PE B's PE_IPCQ (PE-internal wire, ~1 cycle)

PE B's PE_IPCQ receives the metadata
  7. Updates peer_head_cache (= A's head)
  8. Wakes any pending recv on that direction

Steps 5 and 6 must execute in the same SimPy step — DMA completion makes data and metadata atomically visible.

Recv flow (credit return — fast path with bottleneck-BW latency)

When the receiver frees a slot, the sender must learn about it (backpressure release). Unlike data, the credit return does not travel through general vc_comm fabric — it uses a separate fast path, an abstraction of the NVLink / UCIe credit-return wire.

Latency is computed from the full path latency (per-node overhead + edge propagation + drain), not a magic constant:

credit_size_bytes = 16  (ccl.yaml: ipcq_credit_size_bytes)
path = router.find_path(self_pe, peer_pe.pe_dma)
latency = compute_path_latency_ns(path, credit_size_bytes)
        = sum(edge.distance_mm * ns_per_mm)
        + sum(node_overhead_ns[n] for n in path)
        + credit_size_bytes / bottleneck_bw_on_path

The router auto-appends .pe_dma to the source only, so the destination MUST be spelled with the explicit .pe_dma suffix or find_path raises and the credit silently teleports at zero cost (latent bug fixed alongside this update).

tl.recv blocks on the credit-emit completion (recv yields-from _delayed_credit_send rather than spawning it as a fork). This puts the credit-return cost on the receiver's pe_exec_ns, modeling the IPCQ control-plane completing the consume-acknowledgement before recv returns to the kernel — the protocol equivalent of a non-posted tl.store waiting for an HBM ack on the raw DMA path.

That gives us:

Topology-proportional approximation: an in-cube credit return is automatically faster than a cross-SIP credit return.
No magic constants: every nanosecond comes from compute_path_latency_ns on the same edge_map and node_overhead_ns as data traffic.
No deadlock risk: unlike piggyback, B can issue credit even when it has no data to send back. peer_credit_store.put is unbounded.
IPCQ ≥ raw DMA for matched physical moves — the credit-emit cost on recv balances the HBM ack-trip cost RAW pays on the sender.

Component coupling — SimPy Store channel

PE B's PE_IPCQ does not call PE A's PE_IPCQ directly. Instead, at init time, a SimPy Store is wired between the two (a per-direction fast-path channel) and credit metadata is put into that store.

class PeIpcqComponent:
    def _delayed_credit_send(self, env, peer_credit_store, my_tail, latency_ns):
        yield env.timeout(latency_ns)
        yield peer_credit_store.put(IpcqCreditMetadata(seq=my_tail, ...))

Backend init wires both directions of the fast-path channel as part of fan-out (see IpcqInitMsg in D12).

Credit-return fast path limitations

credit_size_bytes is an estimate (typically 16–64 bytes).
The fast path is excluded from vc_comm BW contention (separate wire). Real HW credit-return wires are very lightweight, so this is a reasonable first approximation.
A follow-up ADR can: model the credit fast path as a separate link (BW limit + contention), or switch to piggyback (credit_return_mode: piggyback).

PE_DMA's added responsibility

When vc_comm receives a token, PE_DMA processes it as the following sequence: pay the Transaction's terminal BW drain, then atomically write data and forward metadata. No SimPy yield is allowed between the data write and the metadata forward (invariant I6). The drain yield must sit before the atomic block, not inside it:

def _on_vc_comm_recv(self, env, txn):
    # Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the
    # sender PE_DMA). MUST happen before the atomic block so recv only
    # wakes after the bytes have "landed".
    drain = getattr(txn, "drain_ns", 0.0)
    if drain > 0:
        yield env.timeout(drain)

    token = txn.request
    # ── ATOMIC: no yield between these two operations ──
    data = self._memory_store.read(token.src_space, token.src_addr,
                                   shape=..., dtype=...)
    self._memory_store.write(token.dst_endpoint.buffer_kind,
                             token.dst_addr, data)
    # 2. Forward metadata to the local PE_IPCQ
    yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token))
    # ───────────────────────────────────────────────────

The final put is yieldable but uses an unbounded internal store, so it completes in a single step. That put is the closing call of the atomic block; nothing may be inserted before it.

Drain-at-inbound semantics (D9 timing model)

The Transaction carries drain_ns = nbytes / bottleneck_bw_on_path stamped at send-side PE_DMA. In this simulator per-hop overhead_ns is paid at each forwarding component via run(), and the remaining BW drain is paid once at the Transaction's terminal. Every non-IPCQ Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via ComponentBase._forward_txn at the terminal node. For IPCQ the destination PE_DMA intercepts the Transaction with _handle_ipcq_inbound (so IPCQ-specific data write + metadata forward can happen), so the drain MUST be paid explicitly at the top of that handler to keep IPCQ's timing model on par with every other fabric Transaction.

Side-effects of paying drain here:

SRC tl.send is unchanged — fire-and-forget semantics are preserved because the sender PE_DMA does not yield sub_done. The sub_done.succeed() call (made after metadata forward below) is an event with no listener on the sender side.
DST tl.recv unblocks drain_ns later. Since recv wakes only when IpcqMetaArrival reaches its local PE_IPCQ, and the metadata forward now happens after the drain, recv observes the full fabric transfer time including bandwidth cost.

Matches the physical picture: send dispatches and leaves; recv waits until the bytes have actually been drained into its inbox.

D9.5. ADR-0020 (2-pass) integration

tl.send / tl.recv integrates with ADR-0020's two-pass model. Phase 1 simulates timing and moves data via MemoryStore; Phase 2 enables op-log-based correctness verification.

Phase 1 (timing + data)

D9 models head and tail updates with two different mechanisms:

Send-side (head update) — DMA payload piggyback. Data write and metadata forward happen in the same SimPy step → automatic atomic visibility.
Recv-side (tail credit return) — fast-path SimPy Store channel with bottleneck-BW latency, then peer_tail_cache update.

Together they preserve ring-buffer pointer consistency.

The op-log records op_kind="ipcq" entries for sends (with src/dst/space/addr/nbytes/dir/dtype/shape/sender_seq) and recvs (with recv_mode/src/dst/space/addr/nbytes/dir/dtype/shape/consumer_seq). Two recv modes:

return_slot (default): the slot address is returned to the kernel. Zero-copy.
copy_to_dst: when the kernel passes dst_addr + dst_space, PE_IPCQ copies the slot data into the user dst.

Phase 2 (op_log replay)

When DataExecutor encounters an op_kind="ipcq" record:

send: idempotent src → dst ndarray write.
recv (return_slot): no-op (the slot already holds the data).
recv (copy_to_dst): idempotent slot → dst_addr copy.

IPCQ ops are pure data movement — Phase 2 has nothing extra to compute. The downstream GEMM / Math ops in DataExecutor will consume the data and naturally validate correctness.

D10. Host CCL init keeps the PyTorch shape

The host code looks just like real PyTorch DDP. init_process_group creates the backend object; it does not receive IPCQ knobs (neighbor topology, buffer_kind, backpressure …).

# benches/ccl_allreduce.py — same shape as real PyTorch
def worker(rank, world_size, torch):
    dist = torch.distributed
    dist.init_process_group(backend="ahbm")  # reads ccl.yaml + topology
    tensor = torch.zeros((1, world_size * N_ELEM), dtype="f16", dp=...)
    tensor.copy_(torch.from_numpy(init))
    dist.all_reduce(tensor, op="sum")

The IPCQ configuration is decided by the backend at init_process_group time: it loads ccl.yaml, picks the algorithm, and pushes IPCQ neighbor tables to every participating PE_IPCQ. The host code never has to know about IPCQ.

A bench runs one algorithm, chosen via ccl.yaml's defaults.algorithm. Switching algorithms is purely a ccl.yaml change — no host edits required.

Init flow (eager)

init_process_group(backend="ahbm") is called.
Backend loads ccl.yaml → resolves defaults.algorithm.
Pulls topology + buffer_kind + backpressure + slot config from algorithms[<algo>].
Immediately installs neighbor tables on every PE_IPCQ (sideband or fabric IpcqInitMsg).
Subsequent torch.launch(kernel_name, ...) calls behave normally — PE_IPCQ is already prepared whether the kernel is a CCL kernel or not.

D11. CCL config file (`ccl.yaml`)

IPCQ config and algorithm metadata live in a separate YAML file, following the same pattern as components.yaml and topology.yaml.

A single benchmark execution runs one algorithm (defaults.algorithm). Switching algorithms means editing defaults.algorithm only.

defaults:
  algorithm: ring_allreduce_tcm
  buffer_kind: tcm                # tcm | hbm | sram
  backpressure: sleep             # poll | sleep
  n_slots: 8
  slot_size: 4096
  vc_chunk_size: 256
  ipcq_credit_size_bytes: 16

algorithms:
  ring_allreduce_tcm:
    module: kernbench.ccl.algorithms.ring_allreduce
    topology: ring_1d             # builtin name or "custom"
    buffer_kind: tcm
    n_elem: 8                     # optional, per-algorithm tile width

  tree_allreduce_7:
    module: kernbench.ccl.algorithms.tree_allreduce
    topology: tree_binary
    buffer_kind: tcm
    world_size: 7                 # algorithm-level override
    n_elem: 16

  custom_mesh:
    module: kernbench.ccl.algorithms.custom_mesh
    topology: custom              # the module supplies its own neighbors()

world_size is not set in defaults. The backend resolves it via: algorithm-level override > defaults override > topology spec. The last fallback (sips × cubes_per_sip × pes_per_cube) mirrors real DDP where WORLD_SIZE comes from env vars rather than config files.

Algorithm module structure

Each algorithm module exports two hooks — kernel (required) and neighbors (optional) — plus a kernel_args helper that the backend uses to populate positional kernel arguments at all_reduce time:

# src/kernbench/ccl/algorithms/ring_allreduce.py

def kernel_args(world_size: int, n_elem: int) -> tuple:
    return (n_elem, world_size)


def kernel(t_ptr, n_elem, world_size, tl):
    """Required — the PE kernel.

    IPCQ is already installed by the backend before this is called.
    The kernel only uses the four-direction send / recv API.
    """
    ...


def neighbors(rank, world_size, neighbor_map):
    """Optional — override the builtin topology's neighbor map.

    Returns a new dict, the modified-in-place dict, or None to keep the
    builtin map.
    """
    return None

`neighbors` override patterns

Pattern A — tweak a builtin: drop a direction for some ranks, etc.
Pattern B — replace entirely: ignore neighbor_map and return a brand-new dict.
Pattern C — keep builtin: omit neighbors or return None.

Builtin topologies

topology	direction set
`ring_1d`	E, W
`ring_1d_unidir`	E only
`mesh_2d`	N, S, E, W
`tree_binary`	parent, child_left, child_right
`none`	(empty) — algorithm must supply `neighbors()`

Adding a new algorithm

Write kernel and kernel_args in src/kernbench/ccl/algorithms/<algo>.py.
Add an entry in ccl.yaml's algorithms section.
(Optional) provide neighbors() for custom topology.
Set defaults.algorithm to the new algorithm.

The host bench (benches/ccl_allreduce.py) does not change.

D12. Message / token schema

The new message types added by this ADR. They live in src/kernbench/common/pe_commands.py and src/kernbench/runtime_api/kernel.py.

`IpcqInitMsg` (sideband, fan-out at init)

The backend pushes neighbor tables to every PE_IPCQ. Structure mirrors MmuMapMsg (target_sips, target_cubes, target_pe, entries). Each IpcqInitEntry has direction, peer: IpcqEndpoint, my_rx_base_pa/va, n_slots, slot_size, plus a peer_credit_store field — a simpy.Store instance pre-wired so the sender PE_IPCQ can push IpcqCreditMetadata directly into the receiver's input queue.

`IpcqSendCmd` (PE_CPU → PE_IPCQ)

Carries direction, source addr/space, nbytes, shape, dtype, and a handle id. data_op=True so it lands in the op_log.

`IpcqRecvCmd` (PE_CPU → PE_IPCQ)

Carries direction (or None for round-robin), recv_mode (return_slot / copy_to_dst), optional dst_addr/dst_space, shape, dtype, blocking flag.

`IpcqDmaToken` (PE_IPCQ → PE_DMA, vc_comm channel)

Per D9 piggyback: the token carries the data (src/dst/space/nbytes) plus the head metadata (sender_seq, src_sip/cube/pe, src_direction). PE_DMA picks the channel by token type (IpcqDmaToken → vc_comm, TileToken → vc_compute).

The receiver's PE_DMA, on token arrival, performs the I6 atomic sequence: write data into MemoryStore, then forward IpcqMetaArrival to the local PE_IPCQ.

`IpcqCreditMetadata` (PE_IPCQ → peer PE_IPCQ, fast path)

Carries consumer_seq (= my_tail), source PE coords, and source direction. Travels through the dedicated SimPy Store channel rather than vc_comm. Latency = credit_size_bytes / bottleneck_bw_on_path.

There is no IpcqPtrUpdate event — head updates flow via D9 piggyback, tail updates via the D9 fast-path channel.

D13. Test strategy

Test plan:

T1. Unit tests (component-level)

PE_IPCQ (tests/test_pe_ipcq.py): send without backpressure immediately forwards a token; full peer slot triggers backpressure (poll / sleep modes); recv waits, wakes on IpcqMetaArrival; round-robin recv weak fairness; bad direction → IpcqInvalidDirection.
PE_DMA virtual channels (tests/test_pe_dma_vc.py): vc_compute / vc_comm independent progress, chunk interleave, BW split.
Builtin topology (tests/test_ccl_topologies.py): ring_1d / mesh_2d / tree_binary correctness, mesh_2d non-square → ValueError, custom resolver returns the module's neighbors.

T2. Integration tests (E2E send/recv)

tests/test_ipcq_e2e.py: 2-rank ring, 4-rank ring (bidirectional no-deadlock), 4×4 mesh.
CCL kernel + 2-pass (tests/test_ipcq_2pass.py): greenlet mode records ipcq ops in op_log; DataExecutor produces correct out.data.

T3. Backend init (`tests/test_ccl_backend_ipcq.py`)

ccl.yaml load, builtin topology → IpcqInitMsg fan-out, endpoint PA consistency, per-buffer_kind allocation.

T4. Regression

All existing tests pass; ADR-0020 op_log / DataExecutor unaffected for non-CCL benches.

T5. Performance / overhead

Single send/recv pair latency = (DMA latency) + (IPCQ overhead). Should be close to a regular PE_DMA write of the same nbytes (IPCQ overhead < 100 ns).

D14. Invariants and failure modes

Invariants

I1. Slot lifecycle exactly-once: one send → exactly one recv. I2. Pointer monotonicity: my_head / my_tail strictly non-decreasing; sender_seq strictly increasing. I3. Endpoint consistency: if rank A's direction=E peer is rank B, then rank B's reverse-direction peer must be rank A. Verified at init. I4. buffer_kind consistency: all PEs in a process group share the same buffer_kind (no mixed mode in the first cut). I5. op_log ordering: send → DMA complete → recv possible. The t_start order in op_log respects this causality. I6. Atomic data + metadata visibility (MUST): at the receiver side, data write (MemoryStore.write) and metadata forward (peer_head_cache update) must execute in the same SimPy step. No yield is allowed between the two operations in PE_DMA's vc_comm handler. Code review must reject any inserted yield (or yield from) — it would create a race where head_cache becomes visible before or after the data. I7. MemoryStore slot existence ↔ pointer: as a consequence of I6, the step in which peer_head_cache > my_tail becomes truthy is the same step in which the slot data is observable.

Failure modes (runtime errors)

F1. Bad direction: tl.send(dir="X") for an uninstalled direction → IpcqInvalidDirection, simulation aborts. F2. Type mismatch: dtype/shape/nbytes disagreement between matched send and recv. Not validated by default; opt-in strict mode catches it (strict_validation: true on a PE_IPCQ node attrs). F3. Deadlock detection (timeout-based): the simulator empties its schedule while a send/recv is still pending → engine raises IpcqDeadlock and embeds a pointer dump. F4. Backend init failure: missing defaults.algorithm, missing algorithms[name], module import failure, topology validation failure (I3, I4) — all raised at init_process_group time. F5. Slot full + infinite backpressure: the peer never recvs. Surfaces as F3 timeout.

Diagnostics

CCL trace: KERNBENCH_CCL_TRACE=1 logs each send/recv as (rank, t, dir, nbytes).
Pointer dump: kernbench.ccl.diagnostics.pointer_dump(engine) prints every PE_IPCQ ring buffer's my_head, my_tail, peer_head_cache, peer_tail_cache.
Deadlock dump: on hang the engine includes the pointer dump in the IpcqDeadlock exception message.

D15. Algorithm-author cheat sheet

Full step-by-step lives in docs/onboarding/ccl-author-guide.en.md. The shortest version:

Things you touch	Things you don't
`src/kernbench/ccl/algorithms/<your_algo>.py` (`kernel`, `kernel_args`, optional `neighbors`)	`benches/ccl_allreduce.py` host code
One entry in `ccl.yaml` + optionally `defaults.algorithm`	`src/kernbench/ccl/` framework
(Optional) `tests/test_<your_algo>.py` mock test	PE_IPCQ component, AhbmCCLBackend

5-step flow: write the kernel → register in ccl.yaml → optional neighbors override → optional mock unit test → SimPy validation via kernbench run --bench ccl_allreduce --verify-data.

Common mistakes: using a direction that wasn't installed, sends without matching recvs (deadlock), dtype/shape disagreement, assuming fairness from tl.recv() round-robin, confusing tl.num_programs(axis) with the CCL group size.

Non-goals

Host collective: a model where dist.all_reduce itself moves data on the host side is out of scope. This ADR only covers communication that happens inside the PE kernel.
All-reduce algorithms: ring / tree / etc. live in algorithm modules and can be added without amending this ADR.
Reliability / error handling: link faults, send/recv failure recovery, etc. are out of scope.
NoC arbiter precision: dynamic VC contention is left for a future ADR (see D8).

Open questions

VC arbitration accuracy — the first cut uses deterministic chunk interleave + weighted round-robin; heavy contention may report optimistic latency. A NoC arbiter component can be added later.
Credit return BW model — the fast path is currently outside the fabric BW contention model. Can be modeled as a separate link or switched to piggyback (credit_return_mode: piggyback).
Ring buffer slot allocation metadata — whether the host pushes IPCQ buffer metadata via sideband or via a fabric message similar to MmuMapMsg is open.
VC BW split default — 50/50 vs. weighted (e.g. 80/20). Exposed in ccl.yaml; default value TBD.
Direction count — 4 (N/S/E/W) is fixed in the first cut; 6 (with Up/Down for 3D) or N (variable) is future work.
Multi-tile aggregation primitives — whether tl.recv_all or similar is needed for fan-in.
Round-robin recv fairness — current weak fairness can starve; strict fairness counter is future work.
Deadlock detection precision — currently timeout-based; a realtime wait-for graph would enable deterministic detection.

Consequences

Positive

PE-to-PE direct communication enables CCL kernels to be written.
Host stays minimal (just launch), synchronization happens inside the PE → strong compute / comm overlap.
VCs eliminate HoL blocking → collective latency is not blocked by compute traffic.
Buffer placement and backpressure mode are init-time parameters → easy to benchmark.
Four-direction logical neighbors → host is free to map ring/mesh/tree algorithms.

Negative

One new component (PE_IPCQ) and a redesigned PE_DMA (VCs).
IPCQ memory cost = 8 rings × slot_size × n_slots per PE.
VC arbitration is a first-order approximation; heavy contention scenarios may report slightly optimistic latency vs real HW (D8).
Chunk-level interleave makes PE_DMA implementation more complex.

33 KiB Raw Blame History Unescape Escape