Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

Major changes: PE-level IPCQ infrastructure: - New PE_IPCQ component: ring-buffer control plane with 4-direction neighbor mapping, head/tail pointers, backpressure (poll/sleep). - PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA, including in-flight data snapshot (D9) and op_log recording at outbound time for Phase 2 replay correctness. - IpcqDmaToken piggyback model: data + metadata travel together, atomic visibility at receiver (invariant I6). - Credit return fast path: bottleneck-BW latency, no fabric vc_comm. Phase 2 data execution (ADR-0020 integration): - op_log extended: DmaWriteCmd now captures src_space/src_addr for Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time. - DataExecutor replays dma_write + ipcq_copy in t_start order. - Engine._flush_data_phase: incremental cursor-based replay after each engine.wait() so host reads see post-Phase-2 data. - KernelRunner Phase 1 writes disabled when op_log is active to prevent stale data from corrupting the MemoryStore snapshot. TLContext / kernel API: - tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype), tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode. - TensorHandle operator overloading (add/sub/mul/div) via thread-local active TLContext → MathCmd dispatch through PE_MATH. - PE-local scratch allocator for math output handles. - tl.load returns space="hbm" handles for correct Phase 2 addressing. - Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv. Unified ccl_allreduce bench (PyTorch-compat host code): - Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch) split matching real PyTorch DDP worker pattern. - torch.distributed facade: init_process_group, get_world_size, get_rank, get_backend, all_reduce, barrier — only real PyTorch names. - AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches kernel via tensor shard metadata (n_elem from shards[0].nbytes). - world_size derived from topology spec (sips × cubes × pes_per_cube) with optional algorithm-level override in ccl.yaml. Tensor API (PyTorch-compat surface): - Tensor.numpy(): gather-aware (all shards via VA-based addressing). - Tensor.copy_(source): scatter from host tensor into sharded target. - RuntimeContext.from_numpy(arr): host-side staging tensor. - Tensor.data property fixed to use numpy() (was shards[0]-only). Algorithm modules moved to src/kernbench/ccl/algorithms/: - ring_allreduce, mesh_allreduce, tree_allreduce, hello_send. - Each module exports kernel_args(world_size, n_elem) helper. - ccl.yaml module paths updated to kernbench.ccl.algorithms.*. Dead code removed: - 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.). - _run_ccl_bench greenlet-per-SIP scheduler. - benches.loader.is_ccl_bench + run_rank detection. - benches/ccl/ directory. Tests: - New test_ccl_allreduce_matrix.py: 7 parametrized cases (ring×3 buffers, ring 8/16, mesh 4, tree 7). - New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests. - Existing tests updated for new import paths + world_size_override. Docs: - Korean ccl-author-guide.md and ADR-0023 paths updated. - New English versions: ccl-author-guide.en.md, ADR-0023.en.md. 502 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 19:36:59 -07:00
parent ff2c677a9c
commit 998cc85762
60 changed files with 9196 additions and 80 deletions
@@ -0,0 +1,9 @@
+"""CCL (Collective Communication Library) framework for kernbench (ADR-0023).
+
+This package provides:
+    - topologies: builtin neighbor topology generators (ring/mesh/tree)
+    - helpers:    utilities for algorithm authors (chunked, ring_step, ...)
+    - testing:    mock CCL runtime for fast unit tests of algorithm kernels
+
+See docs/adr/ADR-0023-ipcq-pe-collective.md and docs/ccl-author-guide.md.
+"""
@@ -0,0 +1,29 @@
+"""Hello-world CCL kernel for the docs/ccl-author-guide.md walkthrough.
+
+Each PE sends its tile to the E neighbor and receives one tile from W,
+then stores the received tile back into its own HBM slice. The simplest
+possible demonstration of ``tl.send`` / ``tl.recv``.
+"""
+from __future__ import annotations
+
+
+def kernel_args(world_size: int, n_elem: int) -> tuple:
+    """Return the positional kernel arguments for the ahbm backend."""
+    return (n_elem,)
+
+
+def kernel(t_ptr, n_elem, tl):
+    local_pe = tl.program_id(axis=0)
+    cube_id = tl.program_id(axis=1)
+    pes_per_cube = tl.num_programs(axis=0)
+    rank = cube_id * pes_per_cube + local_pe
+    nbytes = n_elem * 2
+    pe_addr = t_ptr + rank * nbytes
+
+    # Send our local HBM tile to the E neighbor.
+    src = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
+    tl.send(dir="E", src=src)
+
+    # Receive a tile from W and store it into our slice (overwrite).
+    recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
+    tl.store(pe_addr, recv)
@@ -0,0 +1,73 @@
+"""2D-mesh all-reduce kernel (ADR-0023).
+
+Two-phase reduce on a square mesh of side ``S`` (world_size = S*S):
+  1. Row reduce: ring all-reduce along E/W within each row.
+  2. Column reduce: ring all-reduce along N/S within each column.
+
+After both phases, every rank holds the global sum.
+
+Uses TensorHandle math (PE_MATH) for accumulation. Op_log captures the
+data flow so Phase 2 produces correct final HBM contents. Math/recv
+handles are passed directly to the next send, avoiding store→reload
+which doesn't propagate correctly with timing-only Phase 1 math.
+"""
+from __future__ import annotations
+
+import math
+
+
+def kernel_args(world_size: int, n_elem: int) -> tuple:
+    """Return the positional kernel arguments for the ahbm backend.
+
+    Mesh all-reduce requires ``world_size`` to be a perfect square —
+    the mesh side length is ``sqrt(world_size)``.
+    """
+    side = int(round(math.sqrt(world_size)))
+    if side * side != world_size:
+        raise ValueError(
+            f"mesh_allreduce requires a square world_size; got {world_size}"
+        )
+    return (n_elem, side)
+
+
+def kernel(t_ptr, n_elem, side, tl):
+    """All-reduce on a square mesh.
+
+    Args:
+        t_ptr: HBM base address (column-sharded VA shared across ranks)
+        n_elem: number of f16 elements per tile
+        side: mesh side length (sqrt(world_size))
+        tl: TLContext (ADR-0022).
+    """
+    local_pe = tl.program_id(axis=0)
+    cube_id = tl.program_id(axis=1)
+    pes_per_cube = tl.num_programs(axis=0)
+    rank = cube_id * pes_per_cube + local_pe
+    nbytes = n_elem * 2
+
+    pe_addr = t_ptr + rank * nbytes
+    acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
+    current = acc
+
+    # ── Phase 1: row ring (E direction) ──
+    # Ring forwards each received tile (not the cumulative acc) so every
+    # tile passes through every rank exactly once.
+    for _ in range(side - 1):
+        tl.send(dir="E", src=current)
+        recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
+        acc = acc + recv
+        current = recv
+
+    # Phase 2 column ring starts from the row-phase accumulator. We do NOT
+    # store/reload here — the math handle's scratch addr is the source for
+    # the first column send and Phase 2 ipcq_copy replays from there.
+    current = acc
+
+    # ── Phase 2: column ring (S direction) ──
+    for _ in range(side - 1):
+        tl.send(dir="S", src=current)
+        recv = tl.recv(dir="N", shape=(n_elem,), dtype="f16")
+        acc = acc + recv
+        current = recv
+
+    tl.store(pe_addr, acc)
@@ -0,0 +1,80 @@
+"""Ring all-reduce kernel for IPCQ-based PE collective (ADR-0023).
+
+Algorithm: 1D ring of N PEs, each PE starts with one tile of data.
+After ``world_size - 1`` rounds, every PE's accumulator holds the sum
+of all PE tiles.
+
+Strategy
+--------
+Each PE starts with its own tile in HBM. The kernel:
+1. Loads the local tile into a TensorHandle (the accumulator).
+2. In each of ``world_size - 1`` rounds:
+   - Sends the current accumulator/recv slot to the E neighbor.
+   - Receives a tile from the W neighbor — the recv handle points
+     into the per-direction TCM slot.
+   - Adds the received tile to the accumulator using the TensorHandle
+     operator overload, which dispatches to ``MathCmd`` (PE_MATH).
+3. Stores the final accumulator back to HBM via tl.store. The store is
+   recorded in op_log with both src and dst, so Phase 2 will copy the
+   replayed math result from PE-local scratch into HBM.
+
+ADR-0020 D3 split: Phase 1 simulates timing only — math results are
+not yet computed, so the accumulator data flowing through Phase 1 may
+be stale. Phase 2's DataExecutor replays math + IPCQ copies + dma_write
+in stable t_start order, producing correct final HBM contents.
+"""
+from __future__ import annotations
+
+
+def kernel_args(world_size: int, n_elem: int) -> tuple:
+    """Return the positional kernel arguments for the ahbm backend.
+
+    Ring all-reduce takes (n_elem, world_size) after the tensor pointer.
+    """
+    return (n_elem, world_size)
+
+
+def kernel(t_ptr, n_elem, world_size, tl):
+    """Ring all-reduce.
+
+    Args:
+        t_ptr: HBM base address of the column-sharded tensor — all PEs
+               share this base. The per-PE slice lives at
+               ``t_ptr + global_rank * n_elem * 2``.
+        n_elem: number of f16 elements per tile.
+        world_size: total number of participating ranks (passed by host).
+        tl: TLContext (auto-injected, ADR-0022). The kernel derives the
+            global rank from ``program_id(axis=0)`` (local PE) and
+            ``program_id(axis=1)`` (cube id):
+
+                rank = cube_id * pes_per_cube + local_pe
+    """
+    local_pe = tl.program_id(axis=0)
+    cube_id = tl.program_id(axis=1)
+    pes_per_cube = tl.num_programs(axis=0)
+    rank = cube_id * pes_per_cube + local_pe
+    nbytes = n_elem * 2  # f16
+
+    # Each PE reads from its own slice of the shared base address
+    pe_addr = t_ptr + rank * nbytes
+
+    # Load the local tile — handle points at HBM[pe_addr].
+    acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
+    # The ring forwards each received tile to the next neighbor (NOT the
+    # cumulative accumulator), so every rank's tile passes through every
+    # rank exactly once. The accumulator sums the new arrival each round.
+    current = acc
+
+    for _step in range(world_size - 1):
+        tl.send(dir="E", src=current)
+        recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
+        # TensorHandle add → MathCmd → PE_MATH (timing in Phase 1, real
+        # numpy in Phase 2 via DataExecutor). The result handle lives at
+        # an auto-allocated PE-local scratch addr.
+        acc = acc + recv
+        current = recv  # forward W's tile to E next round
+
+    # Final result back to this PE's HBM slice. Op_log captures the
+    # source (scratch addr) and dst (HBM slice) so Phase 2 copies the
+    # accumulated value into HBM for verification.
+    tl.store(pe_addr, acc)
@@ -0,0 +1,80 @@
+"""Tree all-reduce kernel for IPCQ-based PE collective (ADR-0023).
+
+Two-phase binary tree all-reduce:
+
+  Phase 1 (reduce up):
+    - leaf nodes send their value to ``parent``
+    - internal nodes recv from each child, sum, then send to ``parent``
+    - root accumulates child contributions; final acc holds global sum
+
+  Phase 2 (broadcast down):
+    - root sends acc to ``child_left`` and ``child_right`` (if present)
+    - internal nodes recv from ``parent``, then forward to children
+    - all ranks store the final acc to HBM
+
+Uses TensorHandle math (PE_MATH) for accumulation. Op_log captures the
+data flow so Phase 2 produces correct final HBM contents. The kernel
+deliberately avoids the store→reload→send pattern: math/recv handles
+are passed directly to the next send so PE_DMA snapshots a deterministic
+source addr that Phase 2 can replay.
+"""
+from __future__ import annotations
+
+
+def kernel_args(world_size: int, n_elem: int) -> tuple:
+    """Return the positional kernel arguments for the ahbm backend."""
+    return (n_elem, world_size)
+
+
+def kernel(t_ptr, n_elem, world_size, tl):
+    """Tree all-reduce.
+
+    Args:
+        t_ptr: HBM base address.
+        n_elem: number of f16 elements per tile.
+        world_size: total number of participating ranks (passed by host).
+        tl: TLContext (ADR-0022). Global rank from program_id(0/1).
+    """
+    local_pe = tl.program_id(axis=0)
+    cube_id = tl.program_id(axis=1)
+    pes_per_cube = tl.num_programs(axis=0)
+    rank = cube_id * pes_per_cube + local_pe
+    nbytes = n_elem * 2
+
+    pe_addr = t_ptr + rank * nbytes
+    acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
+
+    # Compute children/parent existence (matches tree_binary topology generator)
+    has_parent = rank > 0
+    left = 2 * rank + 1
+    right = 2 * rank + 2
+    has_left = left < world_size
+    has_right = right < world_size
+
+    # ── Phase 1: reduce up ──
+    if has_left:
+        recv = tl.recv(dir="child_left", shape=(n_elem,), dtype="f16")
+        acc = acc + recv
+    if has_right:
+        recv = tl.recv(dir="child_right", shape=(n_elem,), dtype="f16")
+        acc = acc + recv
+
+    if has_parent:
+        # Send the math/load handle directly — its addr is either the
+        # original HBM tile (leaf) or the PE-local scratch where the
+        # accumulator lives. Phase 2 ipcq_copy replays from the same addr.
+        tl.send(dir="parent", src=acc)
+
+    # ── Phase 2: broadcast down ──
+    if has_parent:
+        # Replace acc with the value broadcast from the parent (the global
+        # sum). The recv handle points at the parent-direction TCM slot.
+        acc = tl.recv(dir="parent", shape=(n_elem,), dtype="f16")
+
+    if has_left:
+        tl.send(dir="child_left", src=acc)
+    if has_right:
+        tl.send(dir="child_right", src=acc)
+
+    # Final store to HBM for the bench's verification path.
+    tl.store(pe_addr, acc)
@@ -0,0 +1,127 @@
+"""CCL diagnostics: trace + pointer dump + deadlock (ADR-0023 D14).
+
+Trace
+-----
+Set ``KERNBENCH_CCL_TRACE=1`` (or any truthy value) to enable per-event
+logging of CCL send/recv to stdout. Off by default.
+
+Pointer dump
+------------
+``pointer_dump(engine)`` returns a multi-line string showing every PE_IPCQ's
+ring buffer state (my_head, my_tail, peer_head_cache, peer_tail_cache).
+Useful for diagnosing hangs.
+
+Deadlock
+--------
+``IpcqDeadlock`` is raised by the engine when SimPy's schedule empties
+while a request is still pending — typical of unmatched send/recv pairs.
+The exception message includes the pointer dump.
+"""
+from __future__ import annotations
+
+import os
+from typing import Any
+
+
+class IpcqDeadlock(RuntimeError):
+    """Raised when the simulation cannot make further progress while a
+    CCL request is still pending (D14 F3)."""
+
+
+# ── Trace toggle ─────────────────────────────────────────────────────
+
+
+_TRACE_ENABLED: bool = False
+
+
+def reload_trace_setting() -> None:
+    """Re-read the ``KERNBENCH_CCL_TRACE`` env var."""
+    global _TRACE_ENABLED
+    val = os.environ.get("KERNBENCH_CCL_TRACE", "")
+    _TRACE_ENABLED = val.strip().lower() in {"1", "true", "yes", "on"}
+
+
+def trace_enabled() -> bool:
+    return _TRACE_ENABLED
+
+
+# Initialise once at import time
+reload_trace_setting()
+
+
+# ── Trace event functions ────────────────────────────────────────────
+
+
+def log_send(
+    t_ns: float,
+    sender: str,
+    direction: str,
+    nbytes: int,
+    sender_seq: int,
+) -> None:
+    if not _TRACE_ENABLED:
+        return
+    print(
+        f"[ccl t={t_ns:.1f} send] {sender} dir={direction} nbytes={nbytes} seq={sender_seq}",
+        flush=True,
+    )
+
+
+def log_recv(
+    t_ns: float,
+    receiver: str,
+    direction: str,
+    nbytes: int,
+) -> None:
+    if not _TRACE_ENABLED:
+        return
+    print(
+        f"[ccl t={t_ns:.1f} recv] {receiver} dir={direction} nbytes={nbytes}",
+        flush=True,
+    )
+
+
+def log_credit_return(
+    t_ns: float,
+    sender: str,
+    direction: str,
+    consumer_seq: int,
+) -> None:
+    if not _TRACE_ENABLED:
+        return
+    print(
+        f"[ccl t={t_ns:.1f} credit] {sender} dir={direction} seq={consumer_seq}",
+        flush=True,
+    )
+
+
+# ── Pointer dump ─────────────────────────────────────────────────────
+
+
+def pointer_dump(engine: Any) -> str:
+    """Return a multi-line string of every PE_IPCQ's pointer state."""
+    lines: list[str] = []
+    components = getattr(engine, "_components", {})
+    for node_id in sorted(components):
+        if not node_id.endswith(".pe_ipcq"):
+            continue
+        comp = components[node_id]
+        qps = getattr(comp, "queue_pairs", {})
+        if not qps:
+            continue
+        lines.append(node_id)
+        for d in sorted(qps):
+            qp = qps[d]
+            peer = qp["peer"]
+            lines.append(
+                f"  {d}: peer=sip{peer.sip}.cube{peer.cube}.pe{peer.pe}  "
+                f"my_head={qp['my_head']} my_tail={qp['my_tail']}  "
+                f"peer_head_cache={qp['peer_head_cache']} "
+                f"peer_tail_cache={qp['peer_tail_cache']}"
+            )
+    return "\n".join(lines)
+
+
+def print_pointer_dump(engine: Any) -> None:
+    """Convenience: print pointer_dump(engine) to stdout."""
+    print(pointer_dump(engine), flush=True)
@@ -0,0 +1,118 @@
+"""Helpers for CCL algorithm authors (ADR-0023 D15).
+
+These are pure utility functions usable from any kernel module:
+
+    from kernbench.ccl.helpers import chunked, ring_step, tree_step
+
+They keep algorithm code short and free of off-by-one bugs.
+"""
+from __future__ import annotations
+
+from dataclasses import dataclass
+from typing import Any
+
+
+_DTYPE_BYTES = {
+    "f16": 2, "fp16": 2, "float16": 2, "bf16": 2,
+    "f32": 4, "fp32": 4, "float32": 4,
+    "i8": 1, "int8": 1,
+    "i16": 2, "int16": 2,
+    "i32": 4, "int32": 4,
+}
+
+
+def _itemsize(dtype: str) -> int:
+    if dtype not in _DTYPE_BYTES:
+        raise ValueError(f"Unsupported dtype: {dtype}")
+    return _DTYPE_BYTES[dtype]
+
+
+# ── chunked ──────────────────────────────────────────────────────────
+
+
+@dataclass(frozen=True)
+class Chunk:
+    """One chunk of a tensor used by collective algorithms."""
+
+    addr: int
+    n_elem: int
+    nbytes: int
+
+
+def chunked(
+    base_addr: int,
+    n_chunks: int,
+    n_elem: int,
+    dtype: str = "f16",
+) -> list[Chunk]:
+    """Slice a 1D buffer into ``n_chunks`` equal Chunks.
+
+    Args:
+        base_addr: starting address of the buffer.
+        n_chunks: number of equal chunks to produce.
+        n_elem: total number of elements (must be divisible by n_chunks).
+        dtype: element type for byte-size calculation.
+
+    Returns:
+        List of ``Chunk`` objects whose addresses are consecutive.
+
+    Raises:
+        ValueError: if n_elem is not divisible by n_chunks.
+    """
+    if n_elem % n_chunks != 0:
+        raise ValueError(
+            f"chunked: n_elem ({n_elem}) not divisible by n_chunks ({n_chunks})"
+        )
+    per_chunk_elem = n_elem // n_chunks
+    isize = _itemsize(dtype)
+    per_chunk_bytes = per_chunk_elem * isize
+    return [
+        Chunk(
+            addr=base_addr + i * per_chunk_bytes,
+            n_elem=per_chunk_elem,
+            nbytes=per_chunk_bytes,
+        )
+        for i in range(n_chunks)
+    ]
+
+
+# ── ring_step ────────────────────────────────────────────────────────
+
+
+def ring_step(rank: int, step: int, world_size: int) -> tuple[int, int]:
+    """Return ``(send_chunk_idx, recv_chunk_idx)`` for a ring algorithm step.
+
+    Standard reduce-scatter / all-gather ring schedule:
+        at step s, rank r sends chunk (r - s) and receives chunk (r - s - 1)
+        modulo world_size.
+
+    Used by ring all-reduce kernels:
+
+        for step in range(world_size - 1):
+            send_idx, recv_idx = ring_step(rank, step, world_size)
+            tl.send(dir="E", src=chunks[send_idx])
+            chunks[recv_idx] += tl.recv(dir="W").data
+    """
+    send_idx = (rank - step) % world_size
+    recv_idx = (rank - step - 1) % world_size
+    return send_idx, recv_idx
+
+
+# ── tree_step ────────────────────────────────────────────────────────
+
+
+def tree_step(rank: int, world_size: int) -> dict[str, Any]:
+    """Return parent/children for binary tree rooted at rank 0.
+
+    Returns:
+        ``{"parent": int|None, "children": list[int]}``
+    """
+    parent = (rank - 1) // 2 if rank > 0 else None
+    children: list[int] = []
+    left = 2 * rank + 1
+    right = 2 * rank + 2
+    if left < world_size:
+        children.append(left)
+    if right < world_size:
+        children.append(right)
+    return {"parent": parent, "children": children}
@@ -0,0 +1,266 @@
+"""IPCQ install plan for AhbmCCLBackend (ADR-0023 D10/D11/D12).
+
+Given a ccl.yaml config, the topology, and the engine, this module:
+
+1. Loads ccl.yaml and resolves the chosen algorithm.
+2. Maps each rank to a (sip, cube, pe) PE address using a linear scheme.
+3. Allocates per-rank IPCQ ring buffer base addresses (synthetic but
+   unique-per-PE; see notes below).
+4. Builds neighbor tables via the algorithm's ``topology`` field plus the
+   optional ``neighbors()`` override hook from the algorithm module.
+5. Wires bidirectional credit-return SimPy Stores between every (PE, peer)
+   pair.
+6. Installs each PE_IPCQ component's neighbor table directly via its
+   ``_install_neighbors`` sideband call (equivalent to fan-out IpcqInitMsg
+   without going through fabric).
+
+Address scheme
+--------------
+For the first implementation we use a synthetic address scheme that
+guarantees uniqueness per (sip, cube, pe, direction) without going
+through ``PEMemAllocator``. The address is encoded as:
+
+    base = IPCQ_BASE | (sip << 40) | (cube << 32) | (pe << 24)
+    rx_base[direction_idx] = base + direction_idx * (n_slots * slot_size)
+
+The ``buffer_kind`` (tcm/hbm/sram) selects the *MemoryStore space* into
+which data is written. Within a space, addresses are unique per PE so
+the existing MemoryStore (``{space: {addr: ndarray}}``) handles them
+naturally.
+
+This bypasses the topology's address resolver / PhysAddr encoding and
+treats IPCQ buffers as a separate, parallel address namespace. Real PA
+encoding can be plugged in later without changing the rest of the design.
+"""
+from __future__ import annotations
+
+from pathlib import Path
+from typing import Any
+
+import simpy
+import yaml
+
+from kernbench.ccl.topologies import resolve_topology
+from kernbench.common.ipcq_types import (
+    IpcqEndpoint,
+    IpcqInitEntry,
+)
+from kernbench.runtime_api.kernel import IpcqInitMsg
+
+
+# IPCQ synthetic address space top bit
+_IPCQ_BASE = 1 << 60
+
+
+def _ipcq_base_for_pe(sip: int, cube: int, pe: int) -> int:
+    return _IPCQ_BASE | (sip << 40) | (cube << 32) | (pe << 24)
+
+
+# ── ccl.yaml loading ─────────────────────────────────────────────────
+
+
+def load_ccl_config(path: str | Path | None = None) -> dict:
+    """Load and validate ccl.yaml. Searches cwd and project root."""
+    if path is None:
+        candidates = [
+            Path.cwd() / "ccl.yaml",
+            Path(__file__).resolve().parents[3] / "ccl.yaml",
+        ]
+        for p in candidates:
+            if p.exists():
+                path = p
+                break
+    if path is None:
+        raise FileNotFoundError(
+            "ccl.yaml not found. Place it at project root or cwd."
+        )
+    with open(path) as f:
+        cfg = yaml.safe_load(f)
+    if "defaults" not in cfg:
+        raise ValueError("ccl.yaml missing 'defaults' section")
+    if "algorithms" not in cfg:
+        raise ValueError("ccl.yaml missing 'algorithms' section")
+    return cfg
+
+
+def resolve_algorithm_config(cfg: dict, name: str | None = None) -> dict:
+    """Merge defaults with the chosen algorithm's overrides.
+
+    Returns a flat dict with at minimum: module, topology, buffer_kind,
+    backpressure, n_slots, slot_size, ipcq_credit_size_bytes, world_size.
+    """
+    defaults = dict(cfg.get("defaults", {}))
+    algo_name = name or defaults.get("algorithm")
+    if algo_name is None:
+        raise ValueError("ccl.yaml: defaults.algorithm not set")
+    algos = cfg.get("algorithms", {})
+    if algo_name not in algos:
+        raise ValueError(
+            f"ccl.yaml: algorithm '{algo_name}' not in algorithms section"
+        )
+    merged = defaults.copy()
+    merged.update(algos[algo_name])
+    merged["algorithm"] = algo_name
+    return merged
+
+
+# ── rank → PE mapping ────────────────────────────────────────────────
+
+
+def linear_rank_to_pe(rank: int, spec: dict) -> tuple[int, int, int]:
+    """Map a rank to (sip, cube, pe) using linear topology order."""
+    sips = spec["system"]["sips"]["count"]
+    cubes_per_sip = spec["sip"]["cube_mesh"]["w"] * spec["sip"]["cube_mesh"]["h"]
+    pe_layout = spec["cube"]["pe_layout"]
+    pes_per_cube = pe_layout["pe_per_corner"] * len(pe_layout["corners"])
+
+    pes_per_sip = cubes_per_sip * pes_per_cube
+    if rank >= sips * pes_per_sip:
+        raise ValueError(
+            f"rank {rank} exceeds total PE count {sips * pes_per_sip}"
+        )
+    sip = rank // pes_per_sip
+    rem = rank % pes_per_sip
+    cube = rem // pes_per_cube
+    pe = rem % pes_per_cube
+    return sip, cube, pe
+
+
+# ── Install plan ─────────────────────────────────────────────────────
+
+
+def install_ipcq(
+    engine: Any,
+    spec: dict,
+    cfg: dict,
+    algo_module: Any | None = None,
+    rank_to_pe: list[tuple[int, int, int]] | None = None,
+) -> dict[str, Any]:
+    """Build neighbor tables and install them in every participating PE_IPCQ.
+
+    Args:
+        engine: GraphEngine with ``_components`` dict
+        spec: topology spec dict
+        cfg: merged algorithm config (from ``resolve_algorithm_config``)
+        algo_module: optional algorithm Python module (for neighbors override)
+        rank_to_pe: optional explicit rank → (sip, cube, pe) mapping. If
+                    None, the default linear mapping is used.
+
+    Returns:
+        A diagnostics dict with the install plan (rank → PE map, neighbor table).
+    """
+    if "world_size" in cfg:
+        world_size = int(cfg["world_size"])
+    else:
+        # Topology-derived fallback (mirrors AhbmCCLBackend / RuntimeContext).
+        sips = int(spec.get("system", {}).get("sips", {}).get("count", 1))
+        cm = spec.get("sip", {}).get("cube_mesh", {})
+        cubes_per_sip = int(cm.get("w", 1)) * int(cm.get("h", 1))
+        pl = spec.get("cube", {}).get("pe_layout", {})
+        corners = pl.get("corners", [])
+        pe_per_corner = int(pl.get("pe_per_corner", 1))
+        pes_per_cube = pe_per_corner * max(len(corners), 1)
+        world_size = sips * cubes_per_sip * pes_per_cube
+    buffer_kind = cfg["buffer_kind"]
+    n_slots = int(cfg["n_slots"])
+    slot_size = int(cfg["slot_size"])
+    backpressure = cfg["backpressure"]
+    credit_size_bytes = int(cfg.get("ipcq_credit_size_bytes", 16))
+
+    # Step 1: rank → (sip, cube, pe)
+    if rank_to_pe is not None:
+        if len(rank_to_pe) != world_size:
+            raise ValueError(
+                f"rank_to_pe has {len(rank_to_pe)} entries but world_size={world_size}"
+            )
+        rank_pe = list(rank_to_pe)
+    else:
+        rank_pe: list[tuple[int, int, int]] = [
+            linear_rank_to_pe(r, spec) for r in range(world_size)
+        ]
+    pe_to_rank = {(s, c, p): r for r, (s, c, p) in enumerate(rank_pe)}
+
+    # Step 2: resolve topology fn (with optional override)
+    topo_fn = resolve_topology(cfg["topology"], algo_module=algo_module)
+
+    # Build per-rank neighbor map
+    neighbor_table: dict[int, dict[str, int]] = {}
+    for r in range(world_size):
+        neighbor_table[r] = topo_fn(r, world_size)
+
+    # Step 3: pull the live engine reference for each PE_IPCQ
+    components = engine._components
+    pe_ipcq_id = lambda s, c, p: f"sip{s}.cube{c}.pe{p}.pe_ipcq"
+
+    # Step 4: per-PE rx_base address and per-PE credit_inbox
+    direction_keys = sorted({d for nt in neighbor_table.values() for d in nt})
+    direction_idx = {d: i for i, d in enumerate(direction_keys)}
+    bytes_per_direction = n_slots * slot_size
+
+    def rx_base(s: int, c: int, p: int, d: str) -> int:
+        return _ipcq_base_for_pe(s, c, p) + direction_idx[d] * bytes_per_direction
+
+    # Wire bidirectional credit stores: backend creates the SimPy Stores
+    # by reading each rank's PE_IPCQ.credit_inbox property.
+    rank_to_credit_inbox: dict[int, simpy.Store] = {}
+    for r, (s, c, p) in enumerate(rank_pe):
+        comp = components[pe_ipcq_id(s, c, p)]
+        # Trigger lazy creation of credit_inbox if not yet started.
+        # PE_IPCQ.start() creates it; we ensure it exists.
+        if comp._credit_inbox is None:
+            comp._credit_inbox = simpy.Store(engine._env)
+        rank_to_credit_inbox[r] = comp.credit_inbox
+
+    # Step 5: build IpcqInitMsg per rank and call _install_neighbors directly
+    plan: dict[str, Any] = {
+        "world_size": world_size,
+        "rank_to_pe": rank_pe,
+        "buffer_kind": buffer_kind,
+        "neighbor_table": neighbor_table,
+    }
+
+    def reverse_direction(my_rank: int, peer_rank: int) -> str | None:
+        """Find which direction in peer's neighbor table points back to my_rank."""
+        for d, target in neighbor_table[peer_rank].items():
+            if target == my_rank:
+                return d
+        return None
+
+    for r, (s, c, p) in enumerate(rank_pe):
+        my_pe_ipcq = components[pe_ipcq_id(s, c, p)]
+        nbrs = neighbor_table[r]
+        entries: list[IpcqInitEntry] = []
+        for d, peer_rank in nbrs.items():
+            if peer_rank is None:
+                continue
+            peer_s, peer_c, peer_p = rank_pe[peer_rank]
+            peer_dir = reverse_direction(r, peer_rank)
+            if peer_dir is None:
+                # Peer doesn't have a reverse entry — skip (asymmetric topology)
+                continue
+            peer_endpoint = IpcqEndpoint(
+                sip=peer_s, cube=peer_c, pe=peer_p,
+                buffer_kind=buffer_kind,
+                rx_base_pa=rx_base(peer_s, peer_c, peer_p, peer_dir),
+                rx_base_va=0,
+                n_slots=n_slots, slot_size=slot_size,
+            )
+            entries.append(IpcqInitEntry(
+                direction=d,
+                peer=peer_endpoint,
+                my_rx_base_pa=rx_base(s, c, p, d),
+                my_rx_base_va=0,
+                n_slots=n_slots, slot_size=slot_size,
+                peer_credit_store=rank_to_credit_inbox[peer_rank],
+            ))
+        msg = IpcqInitMsg(
+            correlation_id="ccl_init", request_id=f"init_r{r}",
+            target_sips=(s,), target_cubes=(c,), target_pe=p,
+            entries=tuple(entries),
+            backpressure_mode=backpressure,
+            buffer_kind=buffer_kind,
+            credit_size_bytes=credit_size_bytes,
+        )
+        my_pe_ipcq._install_neighbors(msg)
+
+    return plan
@@ -0,0 +1,465 @@
+"""Mock CCL runtime for fast unit tests of algorithm kernels (ADR-0023 D15).
+
+Runs a kernel function once per rank with a minimal ``tl`` shim — no SimPy,
+no PE_DMA, no fabric simulation. Just enough to verify *functional*
+correctness of an IPCQ-based collective algorithm.
+
+Cross-rank send/recv is implemented with greenlet cooperative scheduling
+plus per-(rank, direction) FIFO queues. Backpressure is not modeled —
+queues are unbounded.
+
+Typical usage in a test::
+
+    from kernbench.ccl.testing import run_kernel_in_mock
+    from kernbench.ccl.algorithms.ring_allreduce import kernel
+
+    inputs = [np.full(16, r + 1, dtype="f16") for r in range(4)]
+    outputs = run_kernel_in_mock(
+        kernel_fn=kernel, world_size=4, topology="ring_1d",
+        inputs=inputs, kernel_args=(16,),
+    )
+    for r in range(4):
+        assert np.allclose(outputs[r], sum(inputs))
+"""
+from __future__ import annotations
+
+from collections import deque
+from typing import Any, Callable
+
+import numpy as np
+from greenlet import greenlet
+
+from kernbench.ccl.topologies import resolve_topology
+from kernbench.common.ipcq_types import IpcqInvalidDirection
+from kernbench.common.pe_commands import TensorHandle
+
+
+# ── Per-rank fake state ──────────────────────────────────────────────
+
+
+class _MockRankState:
+    """Per-rank scratch holding HBM/recv slots and tl shim hooks."""
+
+    def __init__(
+        self,
+        rank: int,
+        world_size: int,
+        neighbors: dict[str, int],
+        input_arr: np.ndarray,
+    ) -> None:
+        self.rank = rank
+        self.world_size = world_size
+        self.neighbors = neighbors  # direction → peer rank
+        # HBM "memory": addr → ndarray. Per-rank, no cross-rank sharing.
+        self._hbm: dict[int, np.ndarray] = {}
+        self._tcm: dict[int, np.ndarray] = {}
+        # ``t_ptr`` is the address the kernel sees. Real benches use a
+        # column-sharded VA so each rank reads from ``t_ptr + rank*nbytes``.
+        # Mirror that here: each rank's slice lives at the rank-specific addr.
+        nbytes = int(input_arr.nbytes)
+        self.t_ptr = 0  # base; per-rank offset is rank * nbytes
+        self._slice_addr = rank * nbytes
+        self._hbm[self._slice_addr] = input_arr.copy()
+        # Inbound recv FIFOs: direction → deque[ndarray]
+        self.recv_q: dict[str, deque[np.ndarray]] = {d: deque() for d in neighbors}
+        # Output (set when kernel calls tl.store at slice address)
+        self.output: np.ndarray | None = None
+        # Greenlet for this rank — set later
+        self.g: greenlet | None = None
+
+
+# ── Mock TLContext ───────────────────────────────────────────────────
+
+
+class _MockTL:
+    """Drop-in tl shim for mock runtime.
+
+    Supports the subset of TLContext API that algorithm authors use:
+    program_id, num_programs, load, store, send, recv, recv_async, wait,
+    plus arithmetic operations on TensorHandle (eager numpy execution,
+    no SimPy involved).
+    """
+
+    def __init__(self, state: _MockRankState, scheduler: "_MockScheduler") -> None:
+        self._state = state
+        self._scheduler = scheduler
+        self._handle_counter = 0
+
+    def _next_id(self) -> str:
+        self._handle_counter += 1
+        return f"mt{self._handle_counter}"
+
+    @property
+    def rank(self) -> int:
+        return self._state.rank
+
+    @property
+    def world_size(self) -> int:
+        return self._state.world_size
+
+    # axis-aware
+    def program_id(self, axis: int = 0) -> int:
+        return self._state.rank if axis == 0 else 0
+
+    def num_programs(self, axis: int = 0) -> int:
+        return self._state.world_size if axis == 0 else 1
+
+    # ── arithmetic ops (called by TensorHandle.__add__ etc.) ──
+
+    def _binary_math(self, op: str, a: TensorHandle, b: TensorHandle) -> TensorHandle:
+        a_data = np.asarray(a.data) if a.data is not None else None
+        b_data = np.asarray(b.data) if b.data is not None else None
+        if a_data is None or b_data is None:
+            result = None
+        elif op == "add":
+            result = a_data + b_data
+        elif op == "sub":
+            result = a_data - b_data
+        elif op == "mul":
+            result = a_data * b_data
+        elif op == "div":
+            result = a_data / b_data
+        elif op == "maximum":
+            result = np.maximum(a_data, b_data)
+        elif op == "minimum":
+            result = np.minimum(a_data, b_data)
+        else:
+            raise NotImplementedError(f"mock _binary_math: op {op!r} not implemented")
+        return TensorHandle(
+            id=self._next_id(),
+            addr=0, shape=a.shape, dtype=a.dtype,
+            nbytes=int(np.prod(a.shape)) * 2 if a.shape else 0,
+            data=result, space="tcm",
+        )
+
+    def maximum(self, a: TensorHandle, b: TensorHandle) -> TensorHandle:
+        return self._binary_math("maximum", a, b)
+
+    def minimum(self, a: TensorHandle, b: TensorHandle) -> TensorHandle:
+        return self._binary_math("minimum", a, b)
+
+    def fma(
+        self, a: TensorHandle, b: TensorHandle, c: TensorHandle,
+    ) -> TensorHandle:
+        a_data = np.asarray(a.data) if a.data is not None else None
+        b_data = np.asarray(b.data) if b.data is not None else None
+        c_data = np.asarray(c.data) if c.data is not None else None
+        result = (
+            a_data * b_data + c_data
+            if (a_data is not None and b_data is not None and c_data is not None)
+            else None
+        )
+        return TensorHandle(
+            id=self._next_id(),
+            addr=0, shape=a.shape, dtype=a.dtype,
+            nbytes=int(np.prod(a.shape)) * 2 if a.shape else 0,
+            data=result, space="tcm",
+        )
+
+    def clamp(
+        self,
+        x: TensorHandle,
+        min: TensorHandle,
+        max: TensorHandle,
+    ) -> TensorHandle:
+        x_data = np.asarray(x.data) if x.data is not None else None
+        lo = np.asarray(min.data) if min.data is not None else None
+        hi = np.asarray(max.data) if max.data is not None else None
+        result = (
+            np.minimum(np.maximum(x_data, lo), hi)
+            if (x_data is not None and lo is not None and hi is not None)
+            else None
+        )
+        return TensorHandle(
+            id=self._next_id(),
+            addr=0, shape=x.shape, dtype=x.dtype,
+            nbytes=int(np.prod(x.shape)) * 2 if x.shape else 0,
+            data=result, space="tcm",
+        )
+
+    def softmax(self, x: TensorHandle, axis: int = -1) -> TensorHandle:
+        x_data = np.asarray(x.data) if x.data is not None else None
+        if x_data is None:
+            result = None
+        else:
+            x_max = np.max(x_data, axis=axis, keepdims=True)
+            e = np.exp(x_data - x_max)
+            s = np.sum(e, axis=axis, keepdims=True)
+            result = e / s
+        return TensorHandle(
+            id=self._next_id(),
+            addr=0, shape=x.shape, dtype=x.dtype,
+            nbytes=int(np.prod(x.shape)) * 2 if x.shape else 0,
+            data=result, space="tcm",
+        )
+
+    @staticmethod
+    def cdiv(a: int, b: int) -> int:
+        return -(-int(a) // int(b))
+
+    def _unary_math(self, op: str, x: TensorHandle) -> TensorHandle:
+        x_data = np.asarray(x.data) if x.data is not None else None
+        if x_data is None:
+            result = None
+        elif op == "exp":
+            result = np.exp(x_data)
+        elif op == "log":
+            result = np.log(x_data)
+        elif op == "sqrt":
+            result = np.sqrt(x_data)
+        elif op == "abs":
+            result = np.abs(x_data)
+        elif op == "sigmoid":
+            result = 1.0 / (1.0 + np.exp(-x_data))
+        elif op == "cos":
+            result = np.cos(x_data)
+        elif op == "sin":
+            result = np.sin(x_data)
+        else:
+            raise NotImplementedError(f"mock _unary_math: op {op!r} not implemented")
+        return TensorHandle(
+            id=self._next_id(),
+            addr=0, shape=x.shape, dtype=x.dtype,
+            nbytes=int(np.prod(x.shape)) * 2 if x.shape else 0,
+            data=result, space="tcm",
+        )
+
+    def load(self, ptr: int, shape: tuple[int, ...], dtype: str = "f16") -> TensorHandle:
+        data = self._state._hbm.get(ptr)
+        if data is None:
+            data = np.zeros(shape, dtype=np.float16)
+        return TensorHandle(
+            id=f"load_{ptr}", addr=ptr, shape=shape, dtype=dtype,
+            nbytes=int(np.prod(shape)) * 2, data=data, space="hbm",
+        )
+
+    def store(self, ptr: int, handle: TensorHandle) -> None:
+        if handle.data is not None:
+            self._state._hbm[ptr] = np.asarray(handle.data)
+            if ptr == self._state._slice_addr:
+                self._state.output = self._state._hbm[ptr]
+
+    # IPCQ
+    def send(
+        self,
+        dir: str,
+        src: TensorHandle | None = None,
+        *,
+        src_addr: int | None = None,
+        nbytes: int | None = None,
+        shape: tuple[int, ...] | None = None,
+        dtype: str = "f16",
+        space: str = "tcm",
+    ) -> None:
+        if dir not in self._state.neighbors:
+            raise IpcqInvalidDirection(
+                f"mock tl.send: direction {dir!r} not in neighbors {list(self._state.neighbors)}"
+            )
+        if src is not None:
+            if src.data is not None:
+                data = np.asarray(src.data)
+            else:
+                # Resolve from this rank's local memory at src.addr
+                space_dict = self._state._hbm if src.space == "hbm" else self._state._tcm
+                stored = space_dict.get(src.addr)
+                if stored is None:
+                    raise RuntimeError(
+                        f"mock tl.send: no data at {src.space}:0x{src.addr:x}"
+                    )
+                data = np.asarray(stored)
+        else:
+            data = None
+        if data is None:
+            raise RuntimeError("mock tl.send: src is None")
+        peer_rank = self._state.neighbors[dir]
+        # Find the reverse direction in peer's neighbors that points back to me
+        peer_state = self._scheduler.states[peer_rank]
+        reverse_dir = None
+        for d, target in peer_state.neighbors.items():
+            if target == self._state.rank:
+                reverse_dir = d
+                break
+        if reverse_dir is None:
+            raise RuntimeError(
+                f"mock tl.send: peer rank {peer_rank} has no reverse direction"
+            )
+        peer_state.recv_q[reverse_dir].append(data.copy())
+        # After delivering, hand control back to scheduler so the receiver
+        # can wake up.
+        self._scheduler.yield_()
+
+    def recv_async(
+        self,
+        dir: str,
+        shape: tuple[int, ...] = (),
+        dtype: str = "f16",
+    ) -> dict:
+        """Non-blocking recv. Returns a future dict to pass to tl.wait."""
+        if dir not in self._state.neighbors:
+            raise IpcqInvalidDirection(
+                f"mock tl.recv_async: direction {dir!r} not in neighbors"
+            )
+        return {"_kind": "recv_future", "dir": dir, "shape": shape, "dtype": dtype}
+
+    def wait(self, future: Any) -> TensorHandle:
+        """Block until the recv future has data."""
+        if not isinstance(future, dict) or future.get("_kind") != "recv_future":
+            raise TypeError("tl.wait: expected recv future from tl.recv_async")
+        d = future["dir"]
+        while not self._state.recv_q[d]:
+            self._scheduler.yield_()
+        data = self._state.recv_q[d].popleft()
+        return self._make_handle(data, d, future["dtype"])
+
+    def recv(
+        self,
+        dir: str | None = None,
+        shape: tuple[int, ...] = (),
+        dtype: str = "f16",
+    ) -> TensorHandle:
+        if dir is not None and dir not in self._state.neighbors:
+            raise IpcqInvalidDirection(
+                f"mock tl.recv: direction {dir!r} not in neighbors {list(self._state.neighbors)}"
+            )
+        # Wait for data
+        while True:
+            if dir is None:
+                # round-robin over directions
+                for d in self._state.neighbors:
+                    if self._state.recv_q[d]:
+                        data = self._state.recv_q[d].popleft()
+                        return self._make_handle(data, d, dtype)
+            else:
+                if self._state.recv_q[dir]:
+                    data = self._state.recv_q[dir].popleft()
+                    return self._make_handle(data, dir, dtype)
+            # Yield to other ranks
+            self._scheduler.yield_()
+
+    def _make_handle(self, data: np.ndarray, direction: str, dtype: str) -> TensorHandle:
+        return TensorHandle(
+            id=f"recv_{direction}",
+            addr=0, shape=data.shape, dtype=dtype,
+            nbytes=int(data.nbytes), data=data, space="tcm",
+        )
+
+
+# ── Cooperative scheduler ────────────────────────────────────────────
+
+
+class _MockScheduler:
+    """Round-robin cooperative scheduler over rank greenlets."""
+
+    def __init__(self, states: list[_MockRankState]) -> None:
+        self.states = states
+        self._parent: greenlet | None = None
+        self._cur_idx = 0
+
+    def yield_(self) -> None:
+        """Called from inside a rank greenlet to give other ranks a turn."""
+        assert self._parent is not None
+        self._parent.switch()
+
+    def run(self, kernel_fn: Callable, kernel_args: tuple) -> list[np.ndarray]:
+        from kernbench.triton_emu.tl_context import TLContext
+
+        self._parent = greenlet.getcurrent()
+        n = len(self.states)
+
+        # Per-rank tl shim
+        tls: dict[int, _MockTL] = {}
+
+        def _spawn(rank_idx: int) -> greenlet:
+            state = self.states[rank_idx]
+            tl = _MockTL(state, self)
+            tls[rank_idx] = tl
+
+            def _entry():
+                # Activate this rank's tl for TensorHandle operator overloads
+                TLContext._set_active(tl)  # type: ignore[attr-defined]
+                try:
+                    kernel_fn(state.t_ptr, *kernel_args, tl=tl)
+                finally:
+                    TLContext._set_active(None)  # type: ignore[attr-defined]
+
+            return greenlet(_entry)
+
+        for state in self.states:
+            state.g = _spawn(state.rank)
+
+        # Drive each rank round-robin until all dead. Detect global deadlock.
+        max_rounds = 10_000
+        round_no = 0
+        while True:
+            alive = [s for s in self.states if s.g is not None and not s.g.dead]
+            if not alive:
+                break
+            progressed = False
+            for s in self.states:
+                if s.g is None or s.g.dead:
+                    continue
+                # Multi-rank greenlets share TLContext active state via the
+                # module-level thread-local; restore this rank's tl before
+                # resuming so TensorHandle operator overloads dispatch to
+                # the right _MockTL.
+                TLContext._set_active(tls[s.rank])  # type: ignore[attr-defined]
+                s.g.switch()
+                if s.g.dead:
+                    progressed = True
+            TLContext._set_active(None)  # type: ignore[attr-defined]
+            # Loose progress check: if no greenlet died and queues didn't grow,
+            # advance round counter; abort after too many idle rounds.
+            round_no += 1
+            if round_no > max_rounds and not progressed:
+                raise RuntimeError(
+                    "mock CCL runtime: deadlock detected (no progress for "
+                    f"{max_rounds} rounds)"
+                )
+
+        return [
+            s.output if s.output is not None else s._hbm.get(s._slice_addr)
+            for s in self.states
+        ]
+
+
+# ── Public entry ────────────────────────────────────────────────────
+
+
+def run_kernel_in_mock(
+    kernel_fn: Callable,
+    world_size: int,
+    topology: str,
+    inputs: list[np.ndarray],
+    kernel_args: tuple = (),
+    algo_module: Any | None = None,
+) -> list[np.ndarray]:
+    """Run a CCL kernel under the mock runtime with no SimPy/fabric.
+
+    Args:
+        kernel_fn: ``kernel(t_ptr, *kernel_args, tl=...)``
+        world_size: number of ranks
+        topology: builtin topology name (e.g. "ring_1d")
+        inputs: per-rank input ndarrays. ``inputs[r]`` becomes rank r's
+                local tile at HBM address 0.
+        kernel_args: extra positional args after t_ptr
+        algo_module: optional module providing ``neighbors()`` override
+
+    Returns:
+        Per-rank output ndarrays — whatever the kernel wrote via tl.store
+        (or the original input if the kernel didn't store).
+    """
+    if len(inputs) != world_size:
+        raise ValueError(f"len(inputs)={len(inputs)} != world_size={world_size}")
+
+    topo_fn = resolve_topology(topology, algo_module=algo_module)
+    states = [
+        _MockRankState(
+            rank=r, world_size=world_size,
+            neighbors=topo_fn(r, world_size),
+            input_arr=inputs[r],
+        )
+        for r in range(world_size)
+    ]
+
+    sched = _MockScheduler(states)
+    return sched.run(kernel_fn, kernel_args)
@@ -0,0 +1,128 @@
+"""Builtin neighbor topology generators for CCL backend (ADR-0023 D11).
+
+Each generator takes ``(rank, world_size)`` and returns a
+``dict[direction, peer_rank]`` for that rank. ``direction`` is one of
+``"N" | "S" | "E" | "W"`` for ring/mesh, or
+``"parent" | "child_left" | "child_right"`` for tree topologies.
+
+Algorithm modules may override the generated map by defining a
+``neighbors(rank, world_size, neighbor_map) -> dict | None`` function in
+the same module (see D11 / D15). ``resolve_topology`` wires these together.
+"""
+from __future__ import annotations
+
+from typing import Any, Callable
+
+NeighborMap = dict[str, int]
+TopologyFn = Callable[[int, int], NeighborMap]
+
+
+# ── Builtin generators ───────────────────────────────────────────────
+
+
+def ring_1d(rank: int, world_size: int) -> NeighborMap:
+    """1D bidirectional ring (E/W)."""
+    return {
+        "E": (rank + 1) % world_size,
+        "W": (rank - 1) % world_size,
+    }
+
+
+def ring_1d_unidir(rank: int, world_size: int) -> NeighborMap:
+    """1D unidirectional ring (E only)."""
+    return {"E": (rank + 1) % world_size}
+
+
+def mesh_2d(rank: int, world_size: int) -> NeighborMap:
+    """Square 2D mesh (N/S/E/W).
+
+    Layout: rank = row * side + col, with side = sqrt(world_size).
+    Wrap-around (torus) on all four edges.
+    """
+    side = int(round(world_size ** 0.5))
+    if side * side != world_size:
+        raise ValueError(
+            f"mesh_2d requires square world_size, got {world_size}"
+        )
+    r, c = divmod(rank, side)
+    return {
+        "N": ((r - 1) % side) * side + c,
+        "S": ((r + 1) % side) * side + c,
+        "W": r * side + (c - 1) % side,
+        "E": r * side + (c + 1) % side,
+    }
+
+
+def tree_binary(rank: int, world_size: int) -> NeighborMap:
+    """Binary tree rooted at rank 0.
+
+    Children of rank r are 2r+1 and 2r+2 (if within world_size).
+    Parent of rank r > 0 is (r-1)//2.
+    Returned keys (only those that exist):
+        "parent", "child_left", "child_right"
+    """
+    n: NeighborMap = {}
+    if rank > 0:
+        n["parent"] = (rank - 1) // 2
+    left = 2 * rank + 1
+    right = 2 * rank + 2
+    if left < world_size:
+        n["child_left"] = left
+    if right < world_size:
+        n["child_right"] = right
+    return n
+
+
+def none(rank: int, world_size: int) -> NeighborMap:
+    """Empty map — algorithm's neighbors() must build from scratch."""
+    return {}
+
+
+_BUILTIN: dict[str, TopologyFn] = {
+    "ring_1d": ring_1d,
+    "ring_1d_unidir": ring_1d_unidir,
+    "mesh_2d": mesh_2d,
+    "tree_binary": tree_binary,
+    "none": none,
+}
+
+
+# ── Resolution ───────────────────────────────────────────────────────
+
+
+def resolve_topology(
+    name: str, algo_module: Any | None = None,
+) -> TopologyFn:
+    """Return a callable ``(rank, world_size) -> NeighborMap``.
+
+    Args:
+        name: builtin topology name from ccl.yaml. Must be one of
+              ``ring_1d``, ``ring_1d_unidir``, ``mesh_2d``, ``tree_binary``,
+              or ``none``.
+        algo_module: optional algorithm module. If it defines
+              ``neighbors(rank, world_size, neighbor_map)``, that hook is
+              invoked after the builtin to override the result.
+              Returning None from neighbors() leaves the builtin map
+              unchanged; returning a dict replaces it.
+
+    Raises:
+        ValueError: if ``name`` is not a known builtin.
+    """
+    if name not in _BUILTIN:
+        raise ValueError(
+            f"Unknown topology '{name}'. "
+            f"Available builtins: {list(_BUILTIN)}"
+        )
+    builtin_fn = _BUILTIN[name]
+    override_fn = getattr(algo_module, "neighbors", None) if algo_module else None
+    if override_fn is None or not callable(override_fn):
+        return builtin_fn
+
+    def _wrapped(rank: int, world_size: int) -> NeighborMap:
+        base = builtin_fn(rank, world_size)
+        result = override_fn(rank, world_size, base)
+        if result is None:
+            return base
+        return result
+
+    return _wrapped