Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

Major changes:

PE-level IPCQ infrastructure:
- New PE_IPCQ component: ring-buffer control plane with 4-direction
  neighbor mapping, head/tail pointers, backpressure (poll/sleep).
- PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA,
  including in-flight data snapshot (D9) and op_log recording at
  outbound time for Phase 2 replay correctness.
- IpcqDmaToken piggyback model: data + metadata travel together,
  atomic visibility at receiver (invariant I6).
- Credit return fast path: bottleneck-BW latency, no fabric vc_comm.

Phase 2 data execution (ADR-0020 integration):
- op_log extended: DmaWriteCmd now captures src_space/src_addr for
  Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time.
- DataExecutor replays dma_write + ipcq_copy in t_start order.
- Engine._flush_data_phase: incremental cursor-based replay after
  each engine.wait() so host reads see post-Phase-2 data.
- KernelRunner Phase 1 writes disabled when op_log is active to
  prevent stale data from corrupting the MemoryStore snapshot.

TLContext / kernel API:
- tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype),
  tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode.
- TensorHandle operator overloading (add/sub/mul/div) via thread-local
  active TLContext → MathCmd dispatch through PE_MATH.
- PE-local scratch allocator for math output handles.
- tl.load returns space="hbm" handles for correct Phase 2 addressing.
- Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv.

Unified ccl_allreduce bench (PyTorch-compat host code):
- Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch)
  split matching real PyTorch DDP worker pattern.
- torch.distributed facade: init_process_group, get_world_size,
  get_rank, get_backend, all_reduce, barrier — only real PyTorch names.
- AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches
  kernel via tensor shard metadata (n_elem from shards[0].nbytes).
- world_size derived from topology spec (sips × cubes × pes_per_cube)
  with optional algorithm-level override in ccl.yaml.

Tensor API (PyTorch-compat surface):
- Tensor.numpy(): gather-aware (all shards via VA-based addressing).
- Tensor.copy_(source): scatter from host tensor into sharded target.
- RuntimeContext.from_numpy(arr): host-side staging tensor.
- Tensor.data property fixed to use numpy() (was shards[0]-only).

Algorithm modules moved to src/kernbench/ccl/algorithms/:
- ring_allreduce, mesh_allreduce, tree_allreduce, hello_send.
- Each module exports kernel_args(world_size, n_elem) helper.
- ccl.yaml module paths updated to kernbench.ccl.algorithms.*.

Dead code removed:
- 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.).
- _run_ccl_bench greenlet-per-SIP scheduler.
- benches.loader.is_ccl_bench + run_rank detection.
- benches/ccl/ directory.

Tests:
- New test_ccl_allreduce_matrix.py: 7 parametrized cases
  (ring×3 buffers, ring 8/16, mesh 4, tree 7).
- New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests.
- Existing tests updated for new import paths + world_size_override.

Docs:
- Korean ccl-author-guide.md and ADR-0023 paths updated.
- New English versions: ccl-author-guide.en.md, ADR-0023.en.md.

502 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-12 19:36:59 -07:00
parent ff2c677a9c
commit 998cc85762
60 changed files with 9196 additions and 80 deletions
+115 -9
View File
@@ -15,6 +15,7 @@ from typing import TYPE_CHECKING, Any
import simpy
from greenlet import greenlet
from kernbench.common.ipcq_types import IpcqRecvCmd, IpcqRequest, IpcqSendCmd, RecvFuture
from kernbench.common.pe_commands import (
CompletionHandle,
CompositeCmd,
@@ -51,6 +52,9 @@ class KernelRunner:
out_ports: dict[str, simpy.Store],
store: MemoryStore | None = None,
num_cubes: int = 1,
ipcq_id: str | None = None,
scratch_base: int = 0,
scratch_size: int = 1 << 20,
) -> None:
self._pe_prefix = pe_prefix
self._pe_idx = pe_idx
@@ -61,6 +65,13 @@ class KernelRunner:
self._out_ports = out_ports
self._store = store
self._parent: greenlet | None = None
# Optional IPCQ port (ADR-0023). If None, IPCQ commands raise.
self._ipcq_id = ipcq_id or f"{pe_prefix}.pe_ipcq"
# PE-local scratch for compute output TensorHandles (ADR-0020 D3
# extension). The TLContext allocates from this pool when math/dot
# ops produce a result that may later be used as a send/store source.
self._scratch_base = scratch_base
self._scratch_size = scratch_size
def run(
self,
@@ -89,7 +100,10 @@ class KernelRunner:
num_cubes=self._num_cubes,
dispatch_cycles=0,
runner=self,
scratch_base=self._scratch_base,
scratch_size=self._scratch_size,
)
self._tl = tl # exposed so switch_to_simpy can re-set on restore
def _kernel_entry():
TLContext._set_active(tl) # type: ignore[attr-defined]
@@ -103,13 +117,20 @@ class KernelRunner:
pending: dict[str, simpy.Event] = {}
composite_results: list[dict] = []
# Helper: set our tl as active just before resuming the kernel.
# Multiple PE kernel runners share the same thread-local; without
# this, another runner's kernel may have left a different context.
def _switch_kernel(*args):
TLContext._set_active(tl) # type: ignore[attr-defined]
return g.switch(*args)
# Start kernel — first switch returns first command (or None if kernel is done)
cmd = g.switch()
cmd = _switch_kernel()
while cmd is not None:
if isinstance(cmd, PeCpuOverheadCmd):
yield env.timeout(cmd.cycles)
cmd = g.switch()
cmd = _switch_kernel()
elif isinstance(cmd, WaitCmd):
if cmd.handle is not None:
@@ -120,7 +141,7 @@ class KernelRunner:
for evt in pending.values():
yield evt
pending.clear()
cmd = g.switch()
cmd = _switch_kernel()
elif isinstance(cmd, DmaReadCmd):
# Dispatch DMA through SimPy components
@@ -141,10 +162,12 @@ class KernelRunner:
)
except KeyError:
pass
cmd = g.switch(data)
cmd = _switch_kernel(data)
elif isinstance(cmd, DmaWriteCmd):
# Write to MemoryStore first (visibility = issue, ADR-0020 D3)
# Write to MemoryStore first (visibility = issue, ADR-0020 D3).
# When data is None (e.g. timing-only TensorHandle math result),
# this is a no-op; Phase 2 dma_write replay handles those.
if self._store is not None and cmd.handle.data is not None:
self._store.write("hbm", cmd.dst_addr, cmd.handle.data)
@@ -154,7 +177,7 @@ class KernelRunner:
)
yield self._out_ports[self._scheduler_id].put(pe_txn)
yield done_evt
cmd = g.switch()
cmd = _switch_kernel()
elif isinstance(cmd, CompositeCmd):
# Non-blocking composite
@@ -165,7 +188,7 @@ class KernelRunner:
composite_results.append(pe_txn.result_data)
yield self._out_ports[self._scheduler_id].put(pe_txn)
pending[cmd.completion.id] = done_evt
cmd = g.switch()
cmd = _switch_kernel()
elif isinstance(cmd, (GemmCmd, MathCmd)):
# Blocking compute command
@@ -175,7 +198,90 @@ class KernelRunner:
)
yield self._out_ports[self._scheduler_id].put(pe_txn)
yield done_evt
cmd = g.switch()
cmd = _switch_kernel()
elif isinstance(cmd, IpcqSendCmd):
# Forward IpcqRequest to PE_IPCQ, wait for done
if self._ipcq_id not in self._out_ports:
raise RuntimeError(
f"PE_IPCQ port {self._ipcq_id!r} not wired to runner"
)
done_evt = env.event()
req = IpcqRequest(command=cmd, done=done_evt)
yield self._out_ports[self._ipcq_id].put(req)
yield done_evt
cmd = _switch_kernel()
elif isinstance(cmd, IpcqRecvCmd):
if self._ipcq_id not in self._out_ports:
raise RuntimeError(
f"PE_IPCQ port {self._ipcq_id!r} not wired to runner"
)
done_evt = env.event()
req = IpcqRequest(command=cmd, done=done_evt)
yield self._out_ports[self._ipcq_id].put(req)
yield done_evt
# Read actual data from MemoryStore at the slot address
data = None
src_space = req.result_data.get("src_space", "tcm")
src_addr = req.result_data.get("src_addr", 0)
if self._store is not None:
try:
data = self._store.read(
src_space, src_addr,
shape=cmd.shape, dtype=cmd.dtype,
)
except KeyError:
pass
# Build result dict for tl.recv to wrap in TensorHandle
result = {
"data": data,
"src_space": src_space,
"src_addr": src_addr,
"direction": req.result_data.get("direction", cmd.direction),
"dtype": cmd.dtype,
"shape": cmd.shape,
"nbytes": req.result_data.get("nbytes", 0),
}
cmd = _switch_kernel(result)
elif isinstance(cmd, tuple) and len(cmd) == 2 and cmd[0] == "recv_async":
# Non-blocking recv: post the IpcqRequest now, store the
# event in the future, return None to kernel.
future: RecvFuture = cmd[1]
done_evt = env.event()
req = IpcqRequest(command=future.cmd, done=done_evt)
future.request = req
future.event = done_evt
yield self._out_ports[self._ipcq_id].put(req)
cmd = _switch_kernel(None)
elif isinstance(cmd, tuple) and len(cmd) == 2 and cmd[0] == "recv_wait":
future = cmd[1]
if not future.event.triggered:
yield future.event
req = future.request
src_space = req.result_data.get("src_space", "tcm")
src_addr = req.result_data.get("src_addr", 0)
data = None
if self._store is not None:
try:
data = self._store.read(
src_space, src_addr,
shape=future.cmd.shape, dtype=future.cmd.dtype,
)
except KeyError:
pass
result = {
"data": data,
"src_space": src_space,
"src_addr": src_addr,
"direction": req.result_data.get("direction", future.cmd.direction),
"dtype": future.cmd.dtype,
"shape": future.cmd.shape,
"nbytes": req.result_data.get("nbytes", 0),
}
cmd = _switch_kernel(result)
else:
# Unknown command — pass through as blocking
@@ -185,7 +291,7 @@ class KernelRunner:
)
yield self._out_ports[self._scheduler_id].put(pe_txn)
yield done_evt
cmd = g.switch()
cmd = _switch_kernel()
# Wait remaining pending composites
for evt in pending.values():
+263 -12
View File
@@ -17,6 +17,7 @@ from __future__ import annotations
import math
from typing import Literal
from kernbench.common.ipcq_types import IpcqRecvCmd, IpcqSendCmd, RecvFuture
from kernbench.common.pe_commands import (
CompletionHandle,
CompositeCmd,
@@ -55,6 +56,8 @@ class TLContext:
runner: Any = None,
cube_id: int = 0,
num_cubes: int = 1,
scratch_base: int = 0,
scratch_size: int = 1 << 20, # 1 MiB per kernel invocation
) -> None:
self._pe_id = pe_id
self._num_programs = num_programs
@@ -65,6 +68,33 @@ class TLContext:
self._handle_counter = 0
self._completion_counter = 0
self._runner = runner # KernelRunner for greenlet mode (ADR-0020 D3)
# PE-local scratch allocator for math/compute output handles.
# Each binary/unary/reduction op auto-allocates a unique addr from
# this pool so the resulting TensorHandle can be the source of a
# later tl.send / tl.store. Cursor resets on every kernel invocation.
self._scratch_base = scratch_base
self._scratch_size = scratch_size
self._scratch_cursor = 0
def _scratch_alloc(self, nbytes: int) -> int:
"""Allocate a unique scratch address for an output TensorHandle.
Returns 0 if no scratch base was configured (e.g. command-list mode);
in that case the resulting handle has addr=0 and cannot be used as a
send/store source. Greenlet/runner mode always supplies a base.
"""
if self._scratch_base == 0:
return 0
# 16-byte alignment
aligned = (nbytes + 15) & ~15
addr = self._scratch_base + self._scratch_cursor
self._scratch_cursor += aligned
if self._scratch_cursor > self._scratch_size:
raise RuntimeError(
f"TLContext scratch overflow: requested {nbytes}B, "
f"used {self._scratch_cursor}/{self._scratch_size}B"
)
return addr
@property
def commands(self) -> list[PeCommand]:
@@ -93,11 +123,30 @@ class TLContext:
def _make_handle(
self, addr: int, shape: tuple[int, ...], dtype: str,
space: str = "tcm",
) -> TensorHandle:
return TensorHandle(
id=self._next_handle_id(),
addr=addr, shape=shape, dtype=dtype,
nbytes=self._nbytes(shape, dtype),
space=space,
)
def _make_compute_out(
self, shape: tuple[int, ...], dtype: str,
) -> TensorHandle:
"""Allocate an output TensorHandle in PE-local scratch (TCM space).
Used by math/compute ops so the result has a real address that can
be the source of a later send/store. The data field stays None in
Phase 1 — Phase 2 DataExecutor fills the actual ndarray.
"""
nbytes = self._nbytes(shape, dtype)
addr = self._scratch_alloc(nbytes)
return TensorHandle(
id=self._next_handle_id(),
addr=addr, shape=shape, dtype=dtype,
nbytes=nbytes, space="tcm",
)
# ── Reference (no DMA, metadata only) ────────────────────────
@@ -124,20 +173,26 @@ class TLContext:
def load(
self, ptr: int, shape: tuple[int, ...], dtype: str = "f16",
) -> TensorHandle:
"""Load tensor from HBM to TCM. Returns TensorHandle.
"""Load tensor from HBM. Returns TensorHandle pointing at HBM[ptr].
In greenlet mode: returns TensorHandle with actual numpy data.
In command-list mode: returns TensorHandle with data=None.
The returned handle's ``space`` is "hbm" so subsequent ops (math,
send, store) using this handle as a source resolve via MemoryStore
at ``(hbm, ptr)`` — which is where the load's underlying data
actually lives in Phase 2 storage.
"""
self._emit_dispatch_overhead()
handle = self._make_handle(addr=ptr, shape=shape, dtype=dtype)
handle = self._make_handle(addr=ptr, shape=shape, dtype=dtype, space="hbm")
cmd = DmaReadCmd(handle=handle, src_addr=ptr, nbytes=handle.nbytes)
data = self._emit(cmd)
if data is not None:
# Greenlet mode: attach real data to handle
# Greenlet mode: attach real data to handle (preserve space)
return TensorHandle(
id=handle.id, addr=handle.addr, shape=handle.shape,
dtype=handle.dtype, nbytes=handle.nbytes, data=data,
space=handle.space,
)
return handle
@@ -162,7 +217,7 @@ class TLContext:
raise ValueError(f"dot shape mismatch: a.K={k} != b.K={k2}")
out_shape = (*a.shape[:-2], m, n)
out_dtype = a.dtype
out = self._make_handle(addr=0, shape=out_shape, dtype=out_dtype)
out = self._make_compute_out(shape=out_shape, dtype=out_dtype)
self._emit_dispatch_overhead()
self._emit(GemmCmd(a=a, b=b, out=out, m=m, k=k, n=n))
return out
@@ -170,7 +225,7 @@ class TLContext:
# ── MATH Engine: unary (blocking) ─────────────────────────────
def _unary_math(self, op: str, x: TensorHandle) -> TensorHandle:
out = self._make_handle(addr=0, shape=x.shape, dtype=x.dtype)
out = self._make_compute_out(shape=x.shape, dtype=x.dtype)
self._emit_dispatch_overhead()
self._emit(MathCmd(op=op, inputs=(x,), out=out))
return out
@@ -203,7 +258,7 @@ class TLContext:
) -> TensorHandle:
out_shape = list(x.shape)
out_shape[axis] = 1
out = self._make_handle(addr=0, shape=tuple(out_shape), dtype=x.dtype)
out = self._make_compute_out(shape=tuple(out_shape), dtype=x.dtype)
self._emit_dispatch_overhead()
self._emit(MathCmd(op=op, inputs=(x,), out=out, axis=axis))
return out
@@ -222,7 +277,7 @@ class TLContext:
def _binary_math(
self, op: str, a: TensorHandle, b: TensorHandle,
) -> TensorHandle:
out = self._make_handle(addr=0, shape=a.shape, dtype=a.dtype)
out = self._make_compute_out(shape=a.shape, dtype=a.dtype)
self._emit_dispatch_overhead()
self._emit(MathCmd(op=op, inputs=(a, b), out=out))
return out
@@ -230,15 +285,67 @@ class TLContext:
def where(
self, cond: TensorHandle, a: TensorHandle, b: TensorHandle,
) -> TensorHandle:
out = self._make_handle(addr=0, shape=a.shape, dtype=a.dtype)
out = self._make_compute_out(shape=a.shape, dtype=a.dtype)
self._emit_dispatch_overhead()
self._emit(MathCmd(op="where", inputs=(cond, a, b), out=out))
return out
def maximum(self, a: TensorHandle, b: TensorHandle) -> TensorHandle:
"""Element-wise max of two tensors (real Triton: tl.maximum)."""
return self._binary_math("maximum", a, b)
def minimum(self, a: TensorHandle, b: TensorHandle) -> TensorHandle:
"""Element-wise min of two tensors (real Triton: tl.minimum)."""
return self._binary_math("minimum", a, b)
def fma(
self, a: TensorHandle, b: TensorHandle, c: TensorHandle,
) -> TensorHandle:
"""Fused multiply-add: a * b + c (real Triton: tl.fma)."""
out = self._make_compute_out(shape=a.shape, dtype=a.dtype)
self._emit_dispatch_overhead()
self._emit(MathCmd(op="fma", inputs=(a, b, c), out=out))
return out
def clamp(
self,
x: TensorHandle,
min: TensorHandle,
max: TensorHandle,
) -> TensorHandle:
"""Clamp x to [min, max] (real Triton: tl.clamp)."""
out = self._make_compute_out(shape=x.shape, dtype=x.dtype)
self._emit_dispatch_overhead()
self._emit(MathCmd(op="clamp", inputs=(x, min, max), out=out))
return out
def softmax(self, x: TensorHandle, axis: int = -1) -> TensorHandle:
"""Numerically-stable softmax along ``axis`` (real Triton: tl.softmax).
Implemented as a single MathCmd (op="softmax") so timing accounts
for one MATH dispatch; Phase 2 DataExecutor expands it to the
canonical (x - max) → exp → sum → div sequence.
"""
out = self._make_compute_out(shape=x.shape, dtype=x.dtype)
self._emit_dispatch_overhead()
self._emit(MathCmd(op="softmax", inputs=(x,), out=out, axis=axis))
return out
# ── Scalar helpers (real Triton: tl.cdiv etc.) ────────────────
@staticmethod
def cdiv(a: int, b: int) -> int:
"""Ceiling division: (a + b - 1) // b (real Triton: tl.cdiv).
Used by host/kernel grid math; not a tensor op, so no MathCmd
is emitted. Mirrors triton.cdiv.
"""
return -(-int(a) // int(b))
# ── Index / Scalar (PE_CPU, no engine) ────────────────────────
def program_id(self, axis: int = 0) -> int:
"""Return program instance index.
"""Return program instance index (ADR-0022).
axis=0: local PE id within cube.
axis=1: cube id.
@@ -248,7 +355,7 @@ class TLContext:
return self._pe_id
def num_programs(self, axis: int = 0) -> int:
"""Return total number of program instances.
"""Return total number of program instances (ADR-0022).
axis=0: num PEs per cube.
axis=1: num cubes.
@@ -284,6 +391,119 @@ class TLContext:
dtype=x.dtype, nbytes=x.nbytes, data=x.data,
)
# ── IPCQ (CCL) collective primitives (ADR-0023 D4) ────────────
def send(
self,
dir: str,
src: TensorHandle | None = None,
*,
src_addr: int | None = None,
nbytes: int | None = None,
shape: tuple[int, ...] | None = None,
dtype: str = "f16",
space: str = "tcm",
) -> None:
"""Send tensor data to the peer in the given direction.
Two calling forms:
tl.send(dir, handle) # use handle's metadata
tl.send(dir, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)
Blocking: returns when PE_IPCQ has accepted the request and
forwarded the IpcqDmaToken to PE_DMA. Backpressure may apply.
"""
if src is not None:
src_addr = src.addr
nbytes = src.nbytes
shape = src.shape
dtype = src.dtype
space = getattr(src, "space", space)
if src_addr is None or nbytes is None or shape is None:
raise ValueError("tl.send: provide either a TensorHandle or src_addr/nbytes/shape")
self._emit_dispatch_overhead()
cmd = IpcqSendCmd(
direction=dir,
src_addr=src_addr, src_space=space,
nbytes=nbytes, shape=shape, dtype=dtype,
handle_id=self._next_handle_id(),
)
self._emit(cmd)
def recv(
self,
dir: str | None = None,
shape: tuple[int, ...] = (),
dtype: str = "f16",
space: str = "tcm",
dst_addr: int | None = None,
dst_space: str | None = None,
) -> TensorHandle:
"""Receive tensor data from a peer.
Args:
dir: specific direction (e.g. "W"), or None for round-robin.
shape, dtype: expected tensor metadata.
dst_addr / dst_space: if both are provided, the slot data is
copied to (dst_space, dst_addr) before the handle is
returned ("copy_to_dst" mode). Otherwise the slot address
is returned directly ("return_slot" mode).
Returns:
TensorHandle pointing to the slot (or dst) where the data has
arrived. In greenlet/runner mode, ``handle.data`` carries the
actual ndarray; in command-list mode the handle is a placeholder.
"""
self._emit_dispatch_overhead()
if dst_addr is not None and dst_space is not None:
cmd = IpcqRecvCmd(
direction=dir,
shape=shape, dtype=dtype,
handle_id=self._next_handle_id(),
recv_mode="copy_to_dst",
dst_addr=dst_addr, dst_space=dst_space,
)
else:
cmd = IpcqRecvCmd(
direction=dir,
shape=shape, dtype=dtype,
handle_id=self._next_handle_id(),
)
result = self._emit(cmd)
if isinstance(result, dict):
slot_addr = int(result.get("src_addr", 0))
slot_space = str(result.get("src_space", "tcm"))
data = result.get("data")
return TensorHandle(
id=self._next_handle_id(),
addr=slot_addr,
shape=shape,
dtype=dtype,
nbytes=self._nbytes(shape, dtype),
data=data,
space=slot_space,
)
return self._make_handle(addr=0, shape=shape, dtype=dtype)
def recv_async(
self,
dir: str,
shape: tuple[int, ...] = (),
dtype: str = "f16",
) -> "RecvFuture":
"""Non-blocking recv. Returns a future to pass into ``tl.wait``."""
self._emit_dispatch_overhead()
cmd = IpcqRecvCmd(
direction=dir,
shape=shape, dtype=dtype,
handle_id=self._next_handle_id(),
blocking=False,
)
future = RecvFuture(cmd=cmd)
if self._runner is not None:
self._runner.switch_to_simpy(("recv_async", future))
return future
# ── Composite + Control ───────────────────────────────────────
def composite(
@@ -316,9 +536,40 @@ class TLContext:
))
return completion
def wait(self, handle: CompletionHandle | None = None) -> None:
"""Wait for a specific composite or all pending composites."""
def wait(self, handle: "CompletionHandle | RecvFuture | None" = None) -> Any:
"""Wait for a composite, a recv future, or all pending composites.
- ``CompletionHandle`` (or None): wait for composite completion.
- ``RecvFuture``: wait for a non-blocking ``recv_async`` to finish.
Returns the resolved ``TensorHandle``.
"""
if isinstance(handle, RecvFuture):
if handle.resolved:
return handle.result
if self._runner is None:
raise RuntimeError(
"tl.wait(RecvFuture) requires runner mode (greenlet)"
)
result_dict = self._runner.switch_to_simpy(("recv_wait", handle))
slot_addr = int(result_dict.get("src_addr", 0))
slot_space = str(result_dict.get("src_space", "tcm"))
data = result_dict.get("data")
th = TensorHandle(
id=self._next_handle_id(),
addr=slot_addr,
shape=handle.cmd.shape,
dtype=handle.cmd.dtype,
nbytes=self._nbytes(handle.cmd.shape, handle.cmd.dtype),
data=data,
space=slot_space,
)
handle.resolved = True
handle.result = th
return th
# Composite path (existing behaviour)
self._emit(WaitCmd(handle=handle))
return None
def cycles(self, n: int) -> None:
"""Declare PE_CPU scalar execution overhead (cycles)."""