Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

Major changes: PE-level IPCQ infrastructure: - New PE_IPCQ component: ring-buffer control plane with 4-direction neighbor mapping, head/tail pointers, backpressure (poll/sleep). - PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA, including in-flight data snapshot (D9) and op_log recording at outbound time for Phase 2 replay correctness. - IpcqDmaToken piggyback model: data + metadata travel together, atomic visibility at receiver (invariant I6). - Credit return fast path: bottleneck-BW latency, no fabric vc_comm. Phase 2 data execution (ADR-0020 integration): - op_log extended: DmaWriteCmd now captures src_space/src_addr for Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time. - DataExecutor replays dma_write + ipcq_copy in t_start order. - Engine._flush_data_phase: incremental cursor-based replay after each engine.wait() so host reads see post-Phase-2 data. - KernelRunner Phase 1 writes disabled when op_log is active to prevent stale data from corrupting the MemoryStore snapshot. TLContext / kernel API: - tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype), tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode. - TensorHandle operator overloading (add/sub/mul/div) via thread-local active TLContext → MathCmd dispatch through PE_MATH. - PE-local scratch allocator for math output handles. - tl.load returns space="hbm" handles for correct Phase 2 addressing. - Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv. Unified ccl_allreduce bench (PyTorch-compat host code): - Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch) split matching real PyTorch DDP worker pattern. - torch.distributed facade: init_process_group, get_world_size, get_rank, get_backend, all_reduce, barrier — only real PyTorch names. - AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches kernel via tensor shard metadata (n_elem from shards[0].nbytes). - world_size derived from topology spec (sips × cubes × pes_per_cube) with optional algorithm-level override in ccl.yaml. Tensor API (PyTorch-compat surface): - Tensor.numpy(): gather-aware (all shards via VA-based addressing). - Tensor.copy_(source): scatter from host tensor into sharded target. - RuntimeContext.from_numpy(arr): host-side staging tensor. - Tensor.data property fixed to use numpy() (was shards[0]-only). Algorithm modules moved to src/kernbench/ccl/algorithms/: - ring_allreduce, mesh_allreduce, tree_allreduce, hello_send. - Each module exports kernel_args(world_size, n_elem) helper. - ccl.yaml module paths updated to kernbench.ccl.algorithms.*. Dead code removed: - 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.). - _run_ccl_bench greenlet-per-SIP scheduler. - benches.loader.is_ccl_bench + run_rank detection. - benches/ccl/ directory. Tests: - New test_ccl_allreduce_matrix.py: 7 parametrized cases (ring×3 buffers, ring 8/16, mesh 4, tree 7). - New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests. - Existing tests updated for new import paths + world_size_override. Docs: - Korean ccl-author-guide.md and ADR-0023 paths updated. - New English versions: ccl-author-guide.en.md, ADR-0023.en.md. 502 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 19:36:59 -07:00
parent ff2c677a9c
commit 998cc85762
60 changed files with 9196 additions and 80 deletions
@@ -0,0 +1,465 @@
+"""Mock CCL runtime for fast unit tests of algorithm kernels (ADR-0023 D15).
+
+Runs a kernel function once per rank with a minimal ``tl`` shim — no SimPy,
+no PE_DMA, no fabric simulation. Just enough to verify *functional*
+correctness of an IPCQ-based collective algorithm.
+
+Cross-rank send/recv is implemented with greenlet cooperative scheduling
+plus per-(rank, direction) FIFO queues. Backpressure is not modeled —
+queues are unbounded.
+
+Typical usage in a test::
+
+    from kernbench.ccl.testing import run_kernel_in_mock
+    from kernbench.ccl.algorithms.ring_allreduce import kernel
+
+    inputs = [np.full(16, r + 1, dtype="f16") for r in range(4)]
+    outputs = run_kernel_in_mock(
+        kernel_fn=kernel, world_size=4, topology="ring_1d",
+        inputs=inputs, kernel_args=(16,),
+    )
+    for r in range(4):
+        assert np.allclose(outputs[r], sum(inputs))
+"""
+from __future__ import annotations
+
+from collections import deque
+from typing import Any, Callable
+
+import numpy as np
+from greenlet import greenlet
+
+from kernbench.ccl.topologies import resolve_topology
+from kernbench.common.ipcq_types import IpcqInvalidDirection
+from kernbench.common.pe_commands import TensorHandle
+
+
+# ── Per-rank fake state ──────────────────────────────────────────────
+
+
+class _MockRankState:
+    """Per-rank scratch holding HBM/recv slots and tl shim hooks."""
+
+    def __init__(
+        self,
+        rank: int,
+        world_size: int,
+        neighbors: dict[str, int],
+        input_arr: np.ndarray,
+    ) -> None:
+        self.rank = rank
+        self.world_size = world_size
+        self.neighbors = neighbors  # direction → peer rank
+        # HBM "memory": addr → ndarray. Per-rank, no cross-rank sharing.
+        self._hbm: dict[int, np.ndarray] = {}
+        self._tcm: dict[int, np.ndarray] = {}
+        # ``t_ptr`` is the address the kernel sees. Real benches use a
+        # column-sharded VA so each rank reads from ``t_ptr + rank*nbytes``.
+        # Mirror that here: each rank's slice lives at the rank-specific addr.
+        nbytes = int(input_arr.nbytes)
+        self.t_ptr = 0  # base; per-rank offset is rank * nbytes
+        self._slice_addr = rank * nbytes
+        self._hbm[self._slice_addr] = input_arr.copy()
+        # Inbound recv FIFOs: direction → deque[ndarray]
+        self.recv_q: dict[str, deque[np.ndarray]] = {d: deque() for d in neighbors}
+        # Output (set when kernel calls tl.store at slice address)
+        self.output: np.ndarray | None = None
+        # Greenlet for this rank — set later
+        self.g: greenlet | None = None
+
+
+# ── Mock TLContext ───────────────────────────────────────────────────
+
+
+class _MockTL:
+    """Drop-in tl shim for mock runtime.
+
+    Supports the subset of TLContext API that algorithm authors use:
+    program_id, num_programs, load, store, send, recv, recv_async, wait,
+    plus arithmetic operations on TensorHandle (eager numpy execution,
+    no SimPy involved).
+    """
+
+    def __init__(self, state: _MockRankState, scheduler: "_MockScheduler") -> None:
+        self._state = state
+        self._scheduler = scheduler
+        self._handle_counter = 0
+
+    def _next_id(self) -> str:
+        self._handle_counter += 1
+        return f"mt{self._handle_counter}"
+
+    @property
+    def rank(self) -> int:
+        return self._state.rank
+
+    @property
+    def world_size(self) -> int:
+        return self._state.world_size
+
+    # axis-aware
+    def program_id(self, axis: int = 0) -> int:
+        return self._state.rank if axis == 0 else 0
+
+    def num_programs(self, axis: int = 0) -> int:
+        return self._state.world_size if axis == 0 else 1
+
+    # ── arithmetic ops (called by TensorHandle.__add__ etc.) ──
+
+    def _binary_math(self, op: str, a: TensorHandle, b: TensorHandle) -> TensorHandle:
+        a_data = np.asarray(a.data) if a.data is not None else None
+        b_data = np.asarray(b.data) if b.data is not None else None
+        if a_data is None or b_data is None:
+            result = None
+        elif op == "add":
+            result = a_data + b_data
+        elif op == "sub":
+            result = a_data - b_data
+        elif op == "mul":
+            result = a_data * b_data
+        elif op == "div":
+            result = a_data / b_data
+        elif op == "maximum":
+            result = np.maximum(a_data, b_data)
+        elif op == "minimum":
+            result = np.minimum(a_data, b_data)
+        else:
+            raise NotImplementedError(f"mock _binary_math: op {op!r} not implemented")
+        return TensorHandle(
+            id=self._next_id(),
+            addr=0, shape=a.shape, dtype=a.dtype,
+            nbytes=int(np.prod(a.shape)) * 2 if a.shape else 0,
+            data=result, space="tcm",
+        )
+
+    def maximum(self, a: TensorHandle, b: TensorHandle) -> TensorHandle:
+        return self._binary_math("maximum", a, b)
+
+    def minimum(self, a: TensorHandle, b: TensorHandle) -> TensorHandle:
+        return self._binary_math("minimum", a, b)
+
+    def fma(
+        self, a: TensorHandle, b: TensorHandle, c: TensorHandle,
+    ) -> TensorHandle:
+        a_data = np.asarray(a.data) if a.data is not None else None
+        b_data = np.asarray(b.data) if b.data is not None else None
+        c_data = np.asarray(c.data) if c.data is not None else None
+        result = (
+            a_data * b_data + c_data
+            if (a_data is not None and b_data is not None and c_data is not None)
+            else None
+        )
+        return TensorHandle(
+            id=self._next_id(),
+            addr=0, shape=a.shape, dtype=a.dtype,
+            nbytes=int(np.prod(a.shape)) * 2 if a.shape else 0,
+            data=result, space="tcm",
+        )
+
+    def clamp(
+        self,
+        x: TensorHandle,
+        min: TensorHandle,
+        max: TensorHandle,
+    ) -> TensorHandle:
+        x_data = np.asarray(x.data) if x.data is not None else None
+        lo = np.asarray(min.data) if min.data is not None else None
+        hi = np.asarray(max.data) if max.data is not None else None
+        result = (
+            np.minimum(np.maximum(x_data, lo), hi)
+            if (x_data is not None and lo is not None and hi is not None)
+            else None
+        )
+        return TensorHandle(
+            id=self._next_id(),
+            addr=0, shape=x.shape, dtype=x.dtype,
+            nbytes=int(np.prod(x.shape)) * 2 if x.shape else 0,
+            data=result, space="tcm",
+        )
+
+    def softmax(self, x: TensorHandle, axis: int = -1) -> TensorHandle:
+        x_data = np.asarray(x.data) if x.data is not None else None
+        if x_data is None:
+            result = None
+        else:
+            x_max = np.max(x_data, axis=axis, keepdims=True)
+            e = np.exp(x_data - x_max)
+            s = np.sum(e, axis=axis, keepdims=True)
+            result = e / s
+        return TensorHandle(
+            id=self._next_id(),
+            addr=0, shape=x.shape, dtype=x.dtype,
+            nbytes=int(np.prod(x.shape)) * 2 if x.shape else 0,
+            data=result, space="tcm",
+        )
+
+    @staticmethod
+    def cdiv(a: int, b: int) -> int:
+        return -(-int(a) // int(b))
+
+    def _unary_math(self, op: str, x: TensorHandle) -> TensorHandle:
+        x_data = np.asarray(x.data) if x.data is not None else None
+        if x_data is None:
+            result = None
+        elif op == "exp":
+            result = np.exp(x_data)
+        elif op == "log":
+            result = np.log(x_data)
+        elif op == "sqrt":
+            result = np.sqrt(x_data)
+        elif op == "abs":
+            result = np.abs(x_data)
+        elif op == "sigmoid":
+            result = 1.0 / (1.0 + np.exp(-x_data))
+        elif op == "cos":
+            result = np.cos(x_data)
+        elif op == "sin":
+            result = np.sin(x_data)
+        else:
+            raise NotImplementedError(f"mock _unary_math: op {op!r} not implemented")
+        return TensorHandle(
+            id=self._next_id(),
+            addr=0, shape=x.shape, dtype=x.dtype,
+            nbytes=int(np.prod(x.shape)) * 2 if x.shape else 0,
+            data=result, space="tcm",
+        )
+
+    def load(self, ptr: int, shape: tuple[int, ...], dtype: str = "f16") -> TensorHandle:
+        data = self._state._hbm.get(ptr)
+        if data is None:
+            data = np.zeros(shape, dtype=np.float16)
+        return TensorHandle(
+            id=f"load_{ptr}", addr=ptr, shape=shape, dtype=dtype,
+            nbytes=int(np.prod(shape)) * 2, data=data, space="hbm",
+        )
+
+    def store(self, ptr: int, handle: TensorHandle) -> None:
+        if handle.data is not None:
+            self._state._hbm[ptr] = np.asarray(handle.data)
+            if ptr == self._state._slice_addr:
+                self._state.output = self._state._hbm[ptr]
+
+    # IPCQ
+    def send(
+        self,
+        dir: str,
+        src: TensorHandle | None = None,
+        *,
+        src_addr: int | None = None,
+        nbytes: int | None = None,
+        shape: tuple[int, ...] | None = None,
+        dtype: str = "f16",
+        space: str = "tcm",
+    ) -> None:
+        if dir not in self._state.neighbors:
+            raise IpcqInvalidDirection(
+                f"mock tl.send: direction {dir!r} not in neighbors {list(self._state.neighbors)}"
+            )
+        if src is not None:
+            if src.data is not None:
+                data = np.asarray(src.data)
+            else:
+                # Resolve from this rank's local memory at src.addr
+                space_dict = self._state._hbm if src.space == "hbm" else self._state._tcm
+                stored = space_dict.get(src.addr)
+                if stored is None:
+                    raise RuntimeError(
+                        f"mock tl.send: no data at {src.space}:0x{src.addr:x}"
+                    )
+                data = np.asarray(stored)
+        else:
+            data = None
+        if data is None:
+            raise RuntimeError("mock tl.send: src is None")
+        peer_rank = self._state.neighbors[dir]
+        # Find the reverse direction in peer's neighbors that points back to me
+        peer_state = self._scheduler.states[peer_rank]
+        reverse_dir = None
+        for d, target in peer_state.neighbors.items():
+            if target == self._state.rank:
+                reverse_dir = d
+                break
+        if reverse_dir is None:
+            raise RuntimeError(
+                f"mock tl.send: peer rank {peer_rank} has no reverse direction"
+            )
+        peer_state.recv_q[reverse_dir].append(data.copy())
+        # After delivering, hand control back to scheduler so the receiver
+        # can wake up.
+        self._scheduler.yield_()
+
+    def recv_async(
+        self,
+        dir: str,
+        shape: tuple[int, ...] = (),
+        dtype: str = "f16",
+    ) -> dict:
+        """Non-blocking recv. Returns a future dict to pass to tl.wait."""
+        if dir not in self._state.neighbors:
+            raise IpcqInvalidDirection(
+                f"mock tl.recv_async: direction {dir!r} not in neighbors"
+            )
+        return {"_kind": "recv_future", "dir": dir, "shape": shape, "dtype": dtype}
+
+    def wait(self, future: Any) -> TensorHandle:
+        """Block until the recv future has data."""
+        if not isinstance(future, dict) or future.get("_kind") != "recv_future":
+            raise TypeError("tl.wait: expected recv future from tl.recv_async")
+        d = future["dir"]
+        while not self._state.recv_q[d]:
+            self._scheduler.yield_()
+        data = self._state.recv_q[d].popleft()
+        return self._make_handle(data, d, future["dtype"])
+
+    def recv(
+        self,
+        dir: str | None = None,
+        shape: tuple[int, ...] = (),
+        dtype: str = "f16",
+    ) -> TensorHandle:
+        if dir is not None and dir not in self._state.neighbors:
+            raise IpcqInvalidDirection(
+                f"mock tl.recv: direction {dir!r} not in neighbors {list(self._state.neighbors)}"
+            )
+        # Wait for data
+        while True:
+            if dir is None:
+                # round-robin over directions
+                for d in self._state.neighbors:
+                    if self._state.recv_q[d]:
+                        data = self._state.recv_q[d].popleft()
+                        return self._make_handle(data, d, dtype)
+            else:
+                if self._state.recv_q[dir]:
+                    data = self._state.recv_q[dir].popleft()
+                    return self._make_handle(data, dir, dtype)
+            # Yield to other ranks
+            self._scheduler.yield_()
+
+    def _make_handle(self, data: np.ndarray, direction: str, dtype: str) -> TensorHandle:
+        return TensorHandle(
+            id=f"recv_{direction}",
+            addr=0, shape=data.shape, dtype=dtype,
+            nbytes=int(data.nbytes), data=data, space="tcm",
+        )
+
+
+# ── Cooperative scheduler ────────────────────────────────────────────
+
+
+class _MockScheduler:
+    """Round-robin cooperative scheduler over rank greenlets."""
+
+    def __init__(self, states: list[_MockRankState]) -> None:
+        self.states = states
+        self._parent: greenlet | None = None
+        self._cur_idx = 0
+
+    def yield_(self) -> None:
+        """Called from inside a rank greenlet to give other ranks a turn."""
+        assert self._parent is not None
+        self._parent.switch()
+
+    def run(self, kernel_fn: Callable, kernel_args: tuple) -> list[np.ndarray]:
+        from kernbench.triton_emu.tl_context import TLContext
+
+        self._parent = greenlet.getcurrent()
+        n = len(self.states)
+
+        # Per-rank tl shim
+        tls: dict[int, _MockTL] = {}
+
+        def _spawn(rank_idx: int) -> greenlet:
+            state = self.states[rank_idx]
+            tl = _MockTL(state, self)
+            tls[rank_idx] = tl
+
+            def _entry():
+                # Activate this rank's tl for TensorHandle operator overloads
+                TLContext._set_active(tl)  # type: ignore[attr-defined]
+                try:
+                    kernel_fn(state.t_ptr, *kernel_args, tl=tl)
+                finally:
+                    TLContext._set_active(None)  # type: ignore[attr-defined]
+
+            return greenlet(_entry)
+
+        for state in self.states:
+            state.g = _spawn(state.rank)
+
+        # Drive each rank round-robin until all dead. Detect global deadlock.
+        max_rounds = 10_000
+        round_no = 0
+        while True:
+            alive = [s for s in self.states if s.g is not None and not s.g.dead]
+            if not alive:
+                break
+            progressed = False
+            for s in self.states:
+                if s.g is None or s.g.dead:
+                    continue
+                # Multi-rank greenlets share TLContext active state via the
+                # module-level thread-local; restore this rank's tl before
+                # resuming so TensorHandle operator overloads dispatch to
+                # the right _MockTL.
+                TLContext._set_active(tls[s.rank])  # type: ignore[attr-defined]
+                s.g.switch()
+                if s.g.dead:
+                    progressed = True
+            TLContext._set_active(None)  # type: ignore[attr-defined]
+            # Loose progress check: if no greenlet died and queues didn't grow,
+            # advance round counter; abort after too many idle rounds.
+            round_no += 1
+            if round_no > max_rounds and not progressed:
+                raise RuntimeError(
+                    "mock CCL runtime: deadlock detected (no progress for "
+                    f"{max_rounds} rounds)"
+                )
+
+        return [
+            s.output if s.output is not None else s._hbm.get(s._slice_addr)
+            for s in self.states
+        ]
+
+
+# ── Public entry ────────────────────────────────────────────────────
+
+
+def run_kernel_in_mock(
+    kernel_fn: Callable,
+    world_size: int,
+    topology: str,
+    inputs: list[np.ndarray],
+    kernel_args: tuple = (),
+    algo_module: Any | None = None,
+) -> list[np.ndarray]:
+    """Run a CCL kernel under the mock runtime with no SimPy/fabric.
+
+    Args:
+        kernel_fn: ``kernel(t_ptr, *kernel_args, tl=...)``
+        world_size: number of ranks
+        topology: builtin topology name (e.g. "ring_1d")
+        inputs: per-rank input ndarrays. ``inputs[r]`` becomes rank r's
+                local tile at HBM address 0.
+        kernel_args: extra positional args after t_ptr
+        algo_module: optional module providing ``neighbors()`` override
+
+    Returns:
+        Per-rank output ndarrays — whatever the kernel wrote via tl.store
+        (or the original input if the kernel didn't store).
+    """
+    if len(inputs) != world_size:
+        raise ValueError(f"len(inputs)={len(inputs)} != world_size={world_size}")
+
+    topo_fn = resolve_topology(topology, algo_module=algo_module)
+    states = [
+        _MockRankState(
+            rank=r, world_size=world_size,
+            neighbors=topo_fn(r, world_size),
+            input_arr=inputs[r],
+        )
+        for r in range(world_size)
+    ]
+
+    sched = _MockScheduler(states)
+    return sched.run(kernel_fn, kernel_args)