Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

Major changes: PE-level IPCQ infrastructure: - New PE_IPCQ component: ring-buffer control plane with 4-direction neighbor mapping, head/tail pointers, backpressure (poll/sleep). - PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA, including in-flight data snapshot (D9) and op_log recording at outbound time for Phase 2 replay correctness. - IpcqDmaToken piggyback model: data + metadata travel together, atomic visibility at receiver (invariant I6). - Credit return fast path: bottleneck-BW latency, no fabric vc_comm. Phase 2 data execution (ADR-0020 integration): - op_log extended: DmaWriteCmd now captures src_space/src_addr for Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time. - DataExecutor replays dma_write + ipcq_copy in t_start order. - Engine._flush_data_phase: incremental cursor-based replay after each engine.wait() so host reads see post-Phase-2 data. - KernelRunner Phase 1 writes disabled when op_log is active to prevent stale data from corrupting the MemoryStore snapshot. TLContext / kernel API: - tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype), tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode. - TensorHandle operator overloading (add/sub/mul/div) via thread-local active TLContext → MathCmd dispatch through PE_MATH. - PE-local scratch allocator for math output handles. - tl.load returns space="hbm" handles for correct Phase 2 addressing. - Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv. Unified ccl_allreduce bench (PyTorch-compat host code): - Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch) split matching real PyTorch DDP worker pattern. - torch.distributed facade: init_process_group, get_world_size, get_rank, get_backend, all_reduce, barrier — only real PyTorch names. - AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches kernel via tensor shard metadata (n_elem from shards[0].nbytes). - world_size derived from topology spec (sips × cubes × pes_per_cube) with optional algorithm-level override in ccl.yaml. Tensor API (PyTorch-compat surface): - Tensor.numpy(): gather-aware (all shards via VA-based addressing). - Tensor.copy_(source): scatter from host tensor into sharded target. - RuntimeContext.from_numpy(arr): host-side staging tensor. - Tensor.data property fixed to use numpy() (was shards[0]-only). Algorithm modules moved to src/kernbench/ccl/algorithms/: - ring_allreduce, mesh_allreduce, tree_allreduce, hello_send. - Each module exports kernel_args(world_size, n_elem) helper. - ccl.yaml module paths updated to kernbench.ccl.algorithms.*. Dead code removed: - 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.). - _run_ccl_bench greenlet-per-SIP scheduler. - benches.loader.is_ccl_bench + run_rank detection. - benches/ccl/ directory. Tests: - New test_ccl_allreduce_matrix.py: 7 parametrized cases (ring×3 buffers, ring 8/16, mesh 4, tree 7). - New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests. - Existing tests updated for new import paths + world_size_override. Docs: - Korean ccl-author-guide.md and ADR-0023 paths updated. - New English versions: ccl-author-guide.en.md, ADR-0023.en.md. 502 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 19:36:59 -07:00
parent ff2c677a9c
commit 998cc85762
60 changed files with 9196 additions and 80 deletions
@@ -15,6 +15,7 @@ from typing import TYPE_CHECKING, Any
 import simpy
 from greenlet import greenlet

+from kernbench.common.ipcq_types import IpcqRecvCmd, IpcqRequest, IpcqSendCmd, RecvFuture
 from kernbench.common.pe_commands import (
    CompletionHandle,
    CompositeCmd,
@@ -51,6 +52,9 @@ class KernelRunner:
        out_ports: dict[str, simpy.Store],
        store: MemoryStore | None = None,
        num_cubes: int = 1,
+        ipcq_id: str | None = None,
+        scratch_base: int = 0,
+        scratch_size: int = 1 << 20,
    ) -> None:
        self._pe_prefix = pe_prefix
        self._pe_idx = pe_idx
@@ -61,6 +65,13 @@ class KernelRunner:
        self._out_ports = out_ports
        self._store = store
        self._parent: greenlet | None = None
+        # Optional IPCQ port (ADR-0023). If None, IPCQ commands raise.
+        self._ipcq_id = ipcq_id or f"{pe_prefix}.pe_ipcq"
+        # PE-local scratch for compute output TensorHandles (ADR-0020 D3
+        # extension). The TLContext allocates from this pool when math/dot
+        # ops produce a result that may later be used as a send/store source.
+        self._scratch_base = scratch_base
+        self._scratch_size = scratch_size

    def run(
        self,
@@ -89,7 +100,10 @@ class KernelRunner:
            num_cubes=self._num_cubes,
            dispatch_cycles=0,
            runner=self,
+            scratch_base=self._scratch_base,
+            scratch_size=self._scratch_size,
        )
+        self._tl = tl  # exposed so switch_to_simpy can re-set on restore

        def _kernel_entry():
            TLContext._set_active(tl)  # type: ignore[attr-defined]
@@ -103,13 +117,20 @@ class KernelRunner:
        pending: dict[str, simpy.Event] = {}
        composite_results: list[dict] = []

+        # Helper: set our tl as active just before resuming the kernel.
+        # Multiple PE kernel runners share the same thread-local; without
+        # this, another runner's kernel may have left a different context.
+        def _switch_kernel(*args):
+            TLContext._set_active(tl)  # type: ignore[attr-defined]
+            return g.switch(*args)
+
        # Start kernel — first switch returns first command (or None if kernel is done)
-        cmd = g.switch()
+        cmd = _switch_kernel()

        while cmd is not None:
            if isinstance(cmd, PeCpuOverheadCmd):
                yield env.timeout(cmd.cycles)
-                cmd = g.switch()
+                cmd = _switch_kernel()

            elif isinstance(cmd, WaitCmd):
                if cmd.handle is not None:
@@ -120,7 +141,7 @@ class KernelRunner:
                    for evt in pending.values():
                        yield evt
                    pending.clear()
-                cmd = g.switch()
+                cmd = _switch_kernel()

            elif isinstance(cmd, DmaReadCmd):
                # Dispatch DMA through SimPy components
@@ -141,10 +162,12 @@ class KernelRunner:
                        )
                    except KeyError:
                        pass
-                cmd = g.switch(data)
+                cmd = _switch_kernel(data)

            elif isinstance(cmd, DmaWriteCmd):
-                # Write to MemoryStore first (visibility = issue, ADR-0020 D3)
+                # Write to MemoryStore first (visibility = issue, ADR-0020 D3).
+                # When data is None (e.g. timing-only TensorHandle math result),
+                # this is a no-op; Phase 2 dma_write replay handles those.
                if self._store is not None and cmd.handle.data is not None:
                    self._store.write("hbm", cmd.dst_addr, cmd.handle.data)

@@ -154,7 +177,7 @@ class KernelRunner:
                )
                yield self._out_ports[self._scheduler_id].put(pe_txn)
                yield done_evt
-                cmd = g.switch()
+                cmd = _switch_kernel()

            elif isinstance(cmd, CompositeCmd):
                # Non-blocking composite
@@ -165,7 +188,7 @@ class KernelRunner:
                composite_results.append(pe_txn.result_data)
                yield self._out_ports[self._scheduler_id].put(pe_txn)
                pending[cmd.completion.id] = done_evt
-                cmd = g.switch()
+                cmd = _switch_kernel()

            elif isinstance(cmd, (GemmCmd, MathCmd)):
                # Blocking compute command
@@ -175,7 +198,90 @@ class KernelRunner:
                )
                yield self._out_ports[self._scheduler_id].put(pe_txn)
                yield done_evt
-                cmd = g.switch()
+                cmd = _switch_kernel()
+
+            elif isinstance(cmd, IpcqSendCmd):
+                # Forward IpcqRequest to PE_IPCQ, wait for done
+                if self._ipcq_id not in self._out_ports:
+                    raise RuntimeError(
+                        f"PE_IPCQ port {self._ipcq_id!r} not wired to runner"
+                    )
+                done_evt = env.event()
+                req = IpcqRequest(command=cmd, done=done_evt)
+                yield self._out_ports[self._ipcq_id].put(req)
+                yield done_evt
+                cmd = _switch_kernel()
+
+            elif isinstance(cmd, IpcqRecvCmd):
+                if self._ipcq_id not in self._out_ports:
+                    raise RuntimeError(
+                        f"PE_IPCQ port {self._ipcq_id!r} not wired to runner"
+                    )
+                done_evt = env.event()
+                req = IpcqRequest(command=cmd, done=done_evt)
+                yield self._out_ports[self._ipcq_id].put(req)
+                yield done_evt
+                # Read actual data from MemoryStore at the slot address
+                data = None
+                src_space = req.result_data.get("src_space", "tcm")
+                src_addr = req.result_data.get("src_addr", 0)
+                if self._store is not None:
+                    try:
+                        data = self._store.read(
+                            src_space, src_addr,
+                            shape=cmd.shape, dtype=cmd.dtype,
+                        )
+                    except KeyError:
+                        pass
+                # Build result dict for tl.recv to wrap in TensorHandle
+                result = {
+                    "data": data,
+                    "src_space": src_space,
+                    "src_addr": src_addr,
+                    "direction": req.result_data.get("direction", cmd.direction),
+                    "dtype": cmd.dtype,
+                    "shape": cmd.shape,
+                    "nbytes": req.result_data.get("nbytes", 0),
+                }
+                cmd = _switch_kernel(result)
+
+            elif isinstance(cmd, tuple) and len(cmd) == 2 and cmd[0] == "recv_async":
+                # Non-blocking recv: post the IpcqRequest now, store the
+                # event in the future, return None to kernel.
+                future: RecvFuture = cmd[1]
+                done_evt = env.event()
+                req = IpcqRequest(command=future.cmd, done=done_evt)
+                future.request = req
+                future.event = done_evt
+                yield self._out_ports[self._ipcq_id].put(req)
+                cmd = _switch_kernel(None)
+
+            elif isinstance(cmd, tuple) and len(cmd) == 2 and cmd[0] == "recv_wait":
+                future = cmd[1]
+                if not future.event.triggered:
+                    yield future.event
+                req = future.request
+                src_space = req.result_data.get("src_space", "tcm")
+                src_addr = req.result_data.get("src_addr", 0)
+                data = None
+                if self._store is not None:
+                    try:
+                        data = self._store.read(
+                            src_space, src_addr,
+                            shape=future.cmd.shape, dtype=future.cmd.dtype,
+                        )
+                    except KeyError:
+                        pass
+                result = {
+                    "data": data,
+                    "src_space": src_space,
+                    "src_addr": src_addr,
+                    "direction": req.result_data.get("direction", future.cmd.direction),
+                    "dtype": future.cmd.dtype,
+                    "shape": future.cmd.shape,
+                    "nbytes": req.result_data.get("nbytes", 0),
+                }
+                cmd = _switch_kernel(result)

            else:
                # Unknown command — pass through as blocking
@@ -185,7 +291,7 @@ class KernelRunner:
                )
                yield self._out_ports[self._scheduler_id].put(pe_txn)
                yield done_evt
-                cmd = g.switch()
+                cmd = _switch_kernel()

        # Wait remaining pending composites
        for evt in pending.values():
@@ -17,6 +17,7 @@ from __future__ import annotations
 import math
 from typing import Literal

+from kernbench.common.ipcq_types import IpcqRecvCmd, IpcqSendCmd, RecvFuture
 from kernbench.common.pe_commands import (
    CompletionHandle,
    CompositeCmd,
@@ -55,6 +56,8 @@ class TLContext:
        runner: Any = None,
        cube_id: int = 0,
        num_cubes: int = 1,
+        scratch_base: int = 0,
+        scratch_size: int = 1 << 20,  # 1 MiB per kernel invocation
    ) -> None:
        self._pe_id = pe_id
        self._num_programs = num_programs
@@ -65,6 +68,33 @@ class TLContext:
        self._handle_counter = 0
        self._completion_counter = 0
        self._runner = runner  # KernelRunner for greenlet mode (ADR-0020 D3)
+        # PE-local scratch allocator for math/compute output handles.
+        # Each binary/unary/reduction op auto-allocates a unique addr from
+        # this pool so the resulting TensorHandle can be the source of a
+        # later tl.send / tl.store. Cursor resets on every kernel invocation.
+        self._scratch_base = scratch_base
+        self._scratch_size = scratch_size
+        self._scratch_cursor = 0
+
+    def _scratch_alloc(self, nbytes: int) -> int:
+        """Allocate a unique scratch address for an output TensorHandle.
+
+        Returns 0 if no scratch base was configured (e.g. command-list mode);
+        in that case the resulting handle has addr=0 and cannot be used as a
+        send/store source. Greenlet/runner mode always supplies a base.
+        """
+        if self._scratch_base == 0:
+            return 0
+        # 16-byte alignment
+        aligned = (nbytes + 15) & ~15
+        addr = self._scratch_base + self._scratch_cursor
+        self._scratch_cursor += aligned
+        if self._scratch_cursor > self._scratch_size:
+            raise RuntimeError(
+                f"TLContext scratch overflow: requested {nbytes}B, "
+                f"used {self._scratch_cursor}/{self._scratch_size}B"
+            )
+        return addr

    @property
    def commands(self) -> list[PeCommand]:
@@ -93,11 +123,30 @@ class TLContext:

    def _make_handle(
        self, addr: int, shape: tuple[int, ...], dtype: str,
+        space: str = "tcm",
    ) -> TensorHandle:
        return TensorHandle(
            id=self._next_handle_id(),
            addr=addr, shape=shape, dtype=dtype,
            nbytes=self._nbytes(shape, dtype),
+            space=space,
+        )
+
+    def _make_compute_out(
+        self, shape: tuple[int, ...], dtype: str,
+    ) -> TensorHandle:
+        """Allocate an output TensorHandle in PE-local scratch (TCM space).
+
+        Used by math/compute ops so the result has a real address that can
+        be the source of a later send/store. The data field stays None in
+        Phase 1 — Phase 2 DataExecutor fills the actual ndarray.
+        """
+        nbytes = self._nbytes(shape, dtype)
+        addr = self._scratch_alloc(nbytes)
+        return TensorHandle(
+            id=self._next_handle_id(),
+            addr=addr, shape=shape, dtype=dtype,
+            nbytes=nbytes, space="tcm",
        )

    # ── Reference (no DMA, metadata only) ────────────────────────
@@ -124,20 +173,26 @@ class TLContext:
    def load(
        self, ptr: int, shape: tuple[int, ...], dtype: str = "f16",
    ) -> TensorHandle:
-        """Load tensor from HBM to TCM. Returns TensorHandle.
+        """Load tensor from HBM. Returns TensorHandle pointing at HBM[ptr].

        In greenlet mode: returns TensorHandle with actual numpy data.
        In command-list mode: returns TensorHandle with data=None.
+
+        The returned handle's ``space`` is "hbm" so subsequent ops (math,
+        send, store) using this handle as a source resolve via MemoryStore
+        at ``(hbm, ptr)`` — which is where the load's underlying data
+        actually lives in Phase 2 storage.
        """
        self._emit_dispatch_overhead()
-        handle = self._make_handle(addr=ptr, shape=shape, dtype=dtype)
+        handle = self._make_handle(addr=ptr, shape=shape, dtype=dtype, space="hbm")
        cmd = DmaReadCmd(handle=handle, src_addr=ptr, nbytes=handle.nbytes)
        data = self._emit(cmd)
        if data is not None:
-            # Greenlet mode: attach real data to handle
+            # Greenlet mode: attach real data to handle (preserve space)
            return TensorHandle(
                id=handle.id, addr=handle.addr, shape=handle.shape,
                dtype=handle.dtype, nbytes=handle.nbytes, data=data,
+                space=handle.space,
            )
        return handle

@@ -162,7 +217,7 @@ class TLContext:
            raise ValueError(f"dot shape mismatch: a.K={k} != b.K={k2}")
        out_shape = (*a.shape[:-2], m, n)
        out_dtype = a.dtype
-        out = self._make_handle(addr=0, shape=out_shape, dtype=out_dtype)
+        out = self._make_compute_out(shape=out_shape, dtype=out_dtype)
        self._emit_dispatch_overhead()
        self._emit(GemmCmd(a=a, b=b, out=out, m=m, k=k, n=n))
        return out
@@ -170,7 +225,7 @@ class TLContext:
    # ── MATH Engine: unary (blocking) ─────────────────────────────

    def _unary_math(self, op: str, x: TensorHandle) -> TensorHandle:
-        out = self._make_handle(addr=0, shape=x.shape, dtype=x.dtype)
+        out = self._make_compute_out(shape=x.shape, dtype=x.dtype)
        self._emit_dispatch_overhead()
        self._emit(MathCmd(op=op, inputs=(x,), out=out))
        return out
@@ -203,7 +258,7 @@ class TLContext:
    ) -> TensorHandle:
        out_shape = list(x.shape)
        out_shape[axis] = 1
-        out = self._make_handle(addr=0, shape=tuple(out_shape), dtype=x.dtype)
+        out = self._make_compute_out(shape=tuple(out_shape), dtype=x.dtype)
        self._emit_dispatch_overhead()
        self._emit(MathCmd(op=op, inputs=(x,), out=out, axis=axis))
        return out
@@ -222,7 +277,7 @@ class TLContext:
    def _binary_math(
        self, op: str, a: TensorHandle, b: TensorHandle,
    ) -> TensorHandle:
-        out = self._make_handle(addr=0, shape=a.shape, dtype=a.dtype)
+        out = self._make_compute_out(shape=a.shape, dtype=a.dtype)
        self._emit_dispatch_overhead()
        self._emit(MathCmd(op=op, inputs=(a, b), out=out))
        return out
@@ -230,15 +285,67 @@ class TLContext:
    def where(
        self, cond: TensorHandle, a: TensorHandle, b: TensorHandle,
    ) -> TensorHandle:
-        out = self._make_handle(addr=0, shape=a.shape, dtype=a.dtype)
+        out = self._make_compute_out(shape=a.shape, dtype=a.dtype)
        self._emit_dispatch_overhead()
        self._emit(MathCmd(op="where", inputs=(cond, a, b), out=out))
        return out

+    def maximum(self, a: TensorHandle, b: TensorHandle) -> TensorHandle:
+        """Element-wise max of two tensors (real Triton: tl.maximum)."""
+        return self._binary_math("maximum", a, b)
+
+    def minimum(self, a: TensorHandle, b: TensorHandle) -> TensorHandle:
+        """Element-wise min of two tensors (real Triton: tl.minimum)."""
+        return self._binary_math("minimum", a, b)
+
+    def fma(
+        self, a: TensorHandle, b: TensorHandle, c: TensorHandle,
+    ) -> TensorHandle:
+        """Fused multiply-add: a * b + c (real Triton: tl.fma)."""
+        out = self._make_compute_out(shape=a.shape, dtype=a.dtype)
+        self._emit_dispatch_overhead()
+        self._emit(MathCmd(op="fma", inputs=(a, b, c), out=out))
+        return out
+
+    def clamp(
+        self,
+        x: TensorHandle,
+        min: TensorHandle,
+        max: TensorHandle,
+    ) -> TensorHandle:
+        """Clamp x to [min, max] (real Triton: tl.clamp)."""
+        out = self._make_compute_out(shape=x.shape, dtype=x.dtype)
+        self._emit_dispatch_overhead()
+        self._emit(MathCmd(op="clamp", inputs=(x, min, max), out=out))
+        return out
+
+    def softmax(self, x: TensorHandle, axis: int = -1) -> TensorHandle:
+        """Numerically-stable softmax along ``axis`` (real Triton: tl.softmax).
+
+        Implemented as a single MathCmd (op="softmax") so timing accounts
+        for one MATH dispatch; Phase 2 DataExecutor expands it to the
+        canonical (x - max) → exp → sum → div sequence.
+        """
+        out = self._make_compute_out(shape=x.shape, dtype=x.dtype)
+        self._emit_dispatch_overhead()
+        self._emit(MathCmd(op="softmax", inputs=(x,), out=out, axis=axis))
+        return out
+
+    # ── Scalar helpers (real Triton: tl.cdiv etc.) ────────────────
+
+    @staticmethod
+    def cdiv(a: int, b: int) -> int:
+        """Ceiling division: (a + b - 1) // b (real Triton: tl.cdiv).
+
+        Used by host/kernel grid math; not a tensor op, so no MathCmd
+        is emitted. Mirrors triton.cdiv.
+        """
+        return -(-int(a) // int(b))
+
    # ── Index / Scalar (PE_CPU, no engine) ────────────────────────

    def program_id(self, axis: int = 0) -> int:
-        """Return program instance index.
+        """Return program instance index (ADR-0022).

        axis=0: local PE id within cube.
        axis=1: cube id.
@@ -248,7 +355,7 @@ class TLContext:
        return self._pe_id

    def num_programs(self, axis: int = 0) -> int:
-        """Return total number of program instances.
+        """Return total number of program instances (ADR-0022).

        axis=0: num PEs per cube.
        axis=1: num cubes.
@@ -284,6 +391,119 @@ class TLContext:
            dtype=x.dtype, nbytes=x.nbytes, data=x.data,
        )

+    # ── IPCQ (CCL) collective primitives (ADR-0023 D4) ────────────
+
+    def send(
+        self,
+        dir: str,
+        src: TensorHandle | None = None,
+        *,
+        src_addr: int | None = None,
+        nbytes: int | None = None,
+        shape: tuple[int, ...] | None = None,
+        dtype: str = "f16",
+        space: str = "tcm",
+    ) -> None:
+        """Send tensor data to the peer in the given direction.
+
+        Two calling forms:
+            tl.send(dir, handle)                       # use handle's metadata
+            tl.send(dir, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)
+
+        Blocking: returns when PE_IPCQ has accepted the request and
+        forwarded the IpcqDmaToken to PE_DMA. Backpressure may apply.
+        """
+        if src is not None:
+            src_addr = src.addr
+            nbytes = src.nbytes
+            shape = src.shape
+            dtype = src.dtype
+            space = getattr(src, "space", space)
+        if src_addr is None or nbytes is None or shape is None:
+            raise ValueError("tl.send: provide either a TensorHandle or src_addr/nbytes/shape")
+        self._emit_dispatch_overhead()
+        cmd = IpcqSendCmd(
+            direction=dir,
+            src_addr=src_addr, src_space=space,
+            nbytes=nbytes, shape=shape, dtype=dtype,
+            handle_id=self._next_handle_id(),
+        )
+        self._emit(cmd)
+
+    def recv(
+        self,
+        dir: str | None = None,
+        shape: tuple[int, ...] = (),
+        dtype: str = "f16",
+        space: str = "tcm",
+        dst_addr: int | None = None,
+        dst_space: str | None = None,
+    ) -> TensorHandle:
+        """Receive tensor data from a peer.
+
+        Args:
+            dir: specific direction (e.g. "W"), or None for round-robin.
+            shape, dtype: expected tensor metadata.
+            dst_addr / dst_space: if both are provided, the slot data is
+                copied to (dst_space, dst_addr) before the handle is
+                returned ("copy_to_dst" mode). Otherwise the slot address
+                is returned directly ("return_slot" mode).
+
+        Returns:
+            TensorHandle pointing to the slot (or dst) where the data has
+            arrived. In greenlet/runner mode, ``handle.data`` carries the
+            actual ndarray; in command-list mode the handle is a placeholder.
+        """
+        self._emit_dispatch_overhead()
+        if dst_addr is not None and dst_space is not None:
+            cmd = IpcqRecvCmd(
+                direction=dir,
+                shape=shape, dtype=dtype,
+                handle_id=self._next_handle_id(),
+                recv_mode="copy_to_dst",
+                dst_addr=dst_addr, dst_space=dst_space,
+            )
+        else:
+            cmd = IpcqRecvCmd(
+                direction=dir,
+                shape=shape, dtype=dtype,
+                handle_id=self._next_handle_id(),
+            )
+        result = self._emit(cmd)
+        if isinstance(result, dict):
+            slot_addr = int(result.get("src_addr", 0))
+            slot_space = str(result.get("src_space", "tcm"))
+            data = result.get("data")
+            return TensorHandle(
+                id=self._next_handle_id(),
+                addr=slot_addr,
+                shape=shape,
+                dtype=dtype,
+                nbytes=self._nbytes(shape, dtype),
+                data=data,
+                space=slot_space,
+            )
+        return self._make_handle(addr=0, shape=shape, dtype=dtype)
+
+    def recv_async(
+        self,
+        dir: str,
+        shape: tuple[int, ...] = (),
+        dtype: str = "f16",
+    ) -> "RecvFuture":
+        """Non-blocking recv. Returns a future to pass into ``tl.wait``."""
+        self._emit_dispatch_overhead()
+        cmd = IpcqRecvCmd(
+            direction=dir,
+            shape=shape, dtype=dtype,
+            handle_id=self._next_handle_id(),
+            blocking=False,
+        )
+        future = RecvFuture(cmd=cmd)
+        if self._runner is not None:
+            self._runner.switch_to_simpy(("recv_async", future))
+        return future
+
    # ── Composite + Control ───────────────────────────────────────

    def composite(
@@ -316,9 +536,40 @@ class TLContext:
        ))
        return completion

-    def wait(self, handle: CompletionHandle | None = None) -> None:
-        """Wait for a specific composite or all pending composites."""
+    def wait(self, handle: "CompletionHandle | RecvFuture | None" = None) -> Any:
+        """Wait for a composite, a recv future, or all pending composites.
+
+        - ``CompletionHandle`` (or None): wait for composite completion.
+        - ``RecvFuture``: wait for a non-blocking ``recv_async`` to finish.
+          Returns the resolved ``TensorHandle``.
+        """
+        if isinstance(handle, RecvFuture):
+            if handle.resolved:
+                return handle.result
+            if self._runner is None:
+                raise RuntimeError(
+                    "tl.wait(RecvFuture) requires runner mode (greenlet)"
+                )
+            result_dict = self._runner.switch_to_simpy(("recv_wait", handle))
+            slot_addr = int(result_dict.get("src_addr", 0))
+            slot_space = str(result_dict.get("src_space", "tcm"))
+            data = result_dict.get("data")
+            th = TensorHandle(
+                id=self._next_handle_id(),
+                addr=slot_addr,
+                shape=handle.cmd.shape,
+                dtype=handle.cmd.dtype,
+                nbytes=self._nbytes(handle.cmd.shape, handle.cmd.dtype),
+                data=data,
+                space=slot_space,
+            )
+            handle.resolved = True
+            handle.result = th
+            return th
+
+        # Composite path (existing behaviour)
        self._emit(WaitCmd(handle=handle))
+        return None

    def cycles(self, n: int) -> None:
        """Declare PE_CPU scalar execution overhead (cycles)."""