Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)
Major changes:
PE-level IPCQ infrastructure:
- New PE_IPCQ component: ring-buffer control plane with 4-direction
neighbor mapping, head/tail pointers, backpressure (poll/sleep).
- PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA,
including in-flight data snapshot (D9) and op_log recording at
outbound time for Phase 2 replay correctness.
- IpcqDmaToken piggyback model: data + metadata travel together,
atomic visibility at receiver (invariant I6).
- Credit return fast path: bottleneck-BW latency, no fabric vc_comm.
Phase 2 data execution (ADR-0020 integration):
- op_log extended: DmaWriteCmd now captures src_space/src_addr for
Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time.
- DataExecutor replays dma_write + ipcq_copy in t_start order.
- Engine._flush_data_phase: incremental cursor-based replay after
each engine.wait() so host reads see post-Phase-2 data.
- KernelRunner Phase 1 writes disabled when op_log is active to
prevent stale data from corrupting the MemoryStore snapshot.
TLContext / kernel API:
- tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype),
tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode.
- TensorHandle operator overloading (add/sub/mul/div) via thread-local
active TLContext → MathCmd dispatch through PE_MATH.
- PE-local scratch allocator for math output handles.
- tl.load returns space="hbm" handles for correct Phase 2 addressing.
- Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv.
Unified ccl_allreduce bench (PyTorch-compat host code):
- Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch)
split matching real PyTorch DDP worker pattern.
- torch.distributed facade: init_process_group, get_world_size,
get_rank, get_backend, all_reduce, barrier — only real PyTorch names.
- AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches
kernel via tensor shard metadata (n_elem from shards[0].nbytes).
- world_size derived from topology spec (sips × cubes × pes_per_cube)
with optional algorithm-level override in ccl.yaml.
Tensor API (PyTorch-compat surface):
- Tensor.numpy(): gather-aware (all shards via VA-based addressing).
- Tensor.copy_(source): scatter from host tensor into sharded target.
- RuntimeContext.from_numpy(arr): host-side staging tensor.
- Tensor.data property fixed to use numpy() (was shards[0]-only).
Algorithm modules moved to src/kernbench/ccl/algorithms/:
- ring_allreduce, mesh_allreduce, tree_allreduce, hello_send.
- Each module exports kernel_args(world_size, n_elem) helper.
- ccl.yaml module paths updated to kernbench.ccl.algorithms.*.
Dead code removed:
- 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.).
- _run_ccl_bench greenlet-per-SIP scheduler.
- benches.loader.is_ccl_bench + run_rank detection.
- benches/ccl/ directory.
Tests:
- New test_ccl_allreduce_matrix.py: 7 parametrized cases
(ring×3 buffers, ring 8/16, mesh 4, tree 7).
- New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests.
- Existing tests updated for new import paths + world_size_override.
Docs:
- Korean ccl-author-guide.md and ADR-0023 paths updated.
- New English versions: ccl-author-guide.en.md, ADR-0023.en.md.
502 tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -15,6 +15,7 @@ from typing import TYPE_CHECKING, Any
|
||||
import simpy
|
||||
from greenlet import greenlet
|
||||
|
||||
from kernbench.common.ipcq_types import IpcqRecvCmd, IpcqRequest, IpcqSendCmd, RecvFuture
|
||||
from kernbench.common.pe_commands import (
|
||||
CompletionHandle,
|
||||
CompositeCmd,
|
||||
@@ -51,6 +52,9 @@ class KernelRunner:
|
||||
out_ports: dict[str, simpy.Store],
|
||||
store: MemoryStore | None = None,
|
||||
num_cubes: int = 1,
|
||||
ipcq_id: str | None = None,
|
||||
scratch_base: int = 0,
|
||||
scratch_size: int = 1 << 20,
|
||||
) -> None:
|
||||
self._pe_prefix = pe_prefix
|
||||
self._pe_idx = pe_idx
|
||||
@@ -61,6 +65,13 @@ class KernelRunner:
|
||||
self._out_ports = out_ports
|
||||
self._store = store
|
||||
self._parent: greenlet | None = None
|
||||
# Optional IPCQ port (ADR-0023). If None, IPCQ commands raise.
|
||||
self._ipcq_id = ipcq_id or f"{pe_prefix}.pe_ipcq"
|
||||
# PE-local scratch for compute output TensorHandles (ADR-0020 D3
|
||||
# extension). The TLContext allocates from this pool when math/dot
|
||||
# ops produce a result that may later be used as a send/store source.
|
||||
self._scratch_base = scratch_base
|
||||
self._scratch_size = scratch_size
|
||||
|
||||
def run(
|
||||
self,
|
||||
@@ -89,7 +100,10 @@ class KernelRunner:
|
||||
num_cubes=self._num_cubes,
|
||||
dispatch_cycles=0,
|
||||
runner=self,
|
||||
scratch_base=self._scratch_base,
|
||||
scratch_size=self._scratch_size,
|
||||
)
|
||||
self._tl = tl # exposed so switch_to_simpy can re-set on restore
|
||||
|
||||
def _kernel_entry():
|
||||
TLContext._set_active(tl) # type: ignore[attr-defined]
|
||||
@@ -103,13 +117,20 @@ class KernelRunner:
|
||||
pending: dict[str, simpy.Event] = {}
|
||||
composite_results: list[dict] = []
|
||||
|
||||
# Helper: set our tl as active just before resuming the kernel.
|
||||
# Multiple PE kernel runners share the same thread-local; without
|
||||
# this, another runner's kernel may have left a different context.
|
||||
def _switch_kernel(*args):
|
||||
TLContext._set_active(tl) # type: ignore[attr-defined]
|
||||
return g.switch(*args)
|
||||
|
||||
# Start kernel — first switch returns first command (or None if kernel is done)
|
||||
cmd = g.switch()
|
||||
cmd = _switch_kernel()
|
||||
|
||||
while cmd is not None:
|
||||
if isinstance(cmd, PeCpuOverheadCmd):
|
||||
yield env.timeout(cmd.cycles)
|
||||
cmd = g.switch()
|
||||
cmd = _switch_kernel()
|
||||
|
||||
elif isinstance(cmd, WaitCmd):
|
||||
if cmd.handle is not None:
|
||||
@@ -120,7 +141,7 @@ class KernelRunner:
|
||||
for evt in pending.values():
|
||||
yield evt
|
||||
pending.clear()
|
||||
cmd = g.switch()
|
||||
cmd = _switch_kernel()
|
||||
|
||||
elif isinstance(cmd, DmaReadCmd):
|
||||
# Dispatch DMA through SimPy components
|
||||
@@ -141,10 +162,12 @@ class KernelRunner:
|
||||
)
|
||||
except KeyError:
|
||||
pass
|
||||
cmd = g.switch(data)
|
||||
cmd = _switch_kernel(data)
|
||||
|
||||
elif isinstance(cmd, DmaWriteCmd):
|
||||
# Write to MemoryStore first (visibility = issue, ADR-0020 D3)
|
||||
# Write to MemoryStore first (visibility = issue, ADR-0020 D3).
|
||||
# When data is None (e.g. timing-only TensorHandle math result),
|
||||
# this is a no-op; Phase 2 dma_write replay handles those.
|
||||
if self._store is not None and cmd.handle.data is not None:
|
||||
self._store.write("hbm", cmd.dst_addr, cmd.handle.data)
|
||||
|
||||
@@ -154,7 +177,7 @@ class KernelRunner:
|
||||
)
|
||||
yield self._out_ports[self._scheduler_id].put(pe_txn)
|
||||
yield done_evt
|
||||
cmd = g.switch()
|
||||
cmd = _switch_kernel()
|
||||
|
||||
elif isinstance(cmd, CompositeCmd):
|
||||
# Non-blocking composite
|
||||
@@ -165,7 +188,7 @@ class KernelRunner:
|
||||
composite_results.append(pe_txn.result_data)
|
||||
yield self._out_ports[self._scheduler_id].put(pe_txn)
|
||||
pending[cmd.completion.id] = done_evt
|
||||
cmd = g.switch()
|
||||
cmd = _switch_kernel()
|
||||
|
||||
elif isinstance(cmd, (GemmCmd, MathCmd)):
|
||||
# Blocking compute command
|
||||
@@ -175,7 +198,90 @@ class KernelRunner:
|
||||
)
|
||||
yield self._out_ports[self._scheduler_id].put(pe_txn)
|
||||
yield done_evt
|
||||
cmd = g.switch()
|
||||
cmd = _switch_kernel()
|
||||
|
||||
elif isinstance(cmd, IpcqSendCmd):
|
||||
# Forward IpcqRequest to PE_IPCQ, wait for done
|
||||
if self._ipcq_id not in self._out_ports:
|
||||
raise RuntimeError(
|
||||
f"PE_IPCQ port {self._ipcq_id!r} not wired to runner"
|
||||
)
|
||||
done_evt = env.event()
|
||||
req = IpcqRequest(command=cmd, done=done_evt)
|
||||
yield self._out_ports[self._ipcq_id].put(req)
|
||||
yield done_evt
|
||||
cmd = _switch_kernel()
|
||||
|
||||
elif isinstance(cmd, IpcqRecvCmd):
|
||||
if self._ipcq_id not in self._out_ports:
|
||||
raise RuntimeError(
|
||||
f"PE_IPCQ port {self._ipcq_id!r} not wired to runner"
|
||||
)
|
||||
done_evt = env.event()
|
||||
req = IpcqRequest(command=cmd, done=done_evt)
|
||||
yield self._out_ports[self._ipcq_id].put(req)
|
||||
yield done_evt
|
||||
# Read actual data from MemoryStore at the slot address
|
||||
data = None
|
||||
src_space = req.result_data.get("src_space", "tcm")
|
||||
src_addr = req.result_data.get("src_addr", 0)
|
||||
if self._store is not None:
|
||||
try:
|
||||
data = self._store.read(
|
||||
src_space, src_addr,
|
||||
shape=cmd.shape, dtype=cmd.dtype,
|
||||
)
|
||||
except KeyError:
|
||||
pass
|
||||
# Build result dict for tl.recv to wrap in TensorHandle
|
||||
result = {
|
||||
"data": data,
|
||||
"src_space": src_space,
|
||||
"src_addr": src_addr,
|
||||
"direction": req.result_data.get("direction", cmd.direction),
|
||||
"dtype": cmd.dtype,
|
||||
"shape": cmd.shape,
|
||||
"nbytes": req.result_data.get("nbytes", 0),
|
||||
}
|
||||
cmd = _switch_kernel(result)
|
||||
|
||||
elif isinstance(cmd, tuple) and len(cmd) == 2 and cmd[0] == "recv_async":
|
||||
# Non-blocking recv: post the IpcqRequest now, store the
|
||||
# event in the future, return None to kernel.
|
||||
future: RecvFuture = cmd[1]
|
||||
done_evt = env.event()
|
||||
req = IpcqRequest(command=future.cmd, done=done_evt)
|
||||
future.request = req
|
||||
future.event = done_evt
|
||||
yield self._out_ports[self._ipcq_id].put(req)
|
||||
cmd = _switch_kernel(None)
|
||||
|
||||
elif isinstance(cmd, tuple) and len(cmd) == 2 and cmd[0] == "recv_wait":
|
||||
future = cmd[1]
|
||||
if not future.event.triggered:
|
||||
yield future.event
|
||||
req = future.request
|
||||
src_space = req.result_data.get("src_space", "tcm")
|
||||
src_addr = req.result_data.get("src_addr", 0)
|
||||
data = None
|
||||
if self._store is not None:
|
||||
try:
|
||||
data = self._store.read(
|
||||
src_space, src_addr,
|
||||
shape=future.cmd.shape, dtype=future.cmd.dtype,
|
||||
)
|
||||
except KeyError:
|
||||
pass
|
||||
result = {
|
||||
"data": data,
|
||||
"src_space": src_space,
|
||||
"src_addr": src_addr,
|
||||
"direction": req.result_data.get("direction", future.cmd.direction),
|
||||
"dtype": future.cmd.dtype,
|
||||
"shape": future.cmd.shape,
|
||||
"nbytes": req.result_data.get("nbytes", 0),
|
||||
}
|
||||
cmd = _switch_kernel(result)
|
||||
|
||||
else:
|
||||
# Unknown command — pass through as blocking
|
||||
@@ -185,7 +291,7 @@ class KernelRunner:
|
||||
)
|
||||
yield self._out_ports[self._scheduler_id].put(pe_txn)
|
||||
yield done_evt
|
||||
cmd = g.switch()
|
||||
cmd = _switch_kernel()
|
||||
|
||||
# Wait remaining pending composites
|
||||
for evt in pending.values():
|
||||
|
||||
@@ -17,6 +17,7 @@ from __future__ import annotations
|
||||
import math
|
||||
from typing import Literal
|
||||
|
||||
from kernbench.common.ipcq_types import IpcqRecvCmd, IpcqSendCmd, RecvFuture
|
||||
from kernbench.common.pe_commands import (
|
||||
CompletionHandle,
|
||||
CompositeCmd,
|
||||
@@ -55,6 +56,8 @@ class TLContext:
|
||||
runner: Any = None,
|
||||
cube_id: int = 0,
|
||||
num_cubes: int = 1,
|
||||
scratch_base: int = 0,
|
||||
scratch_size: int = 1 << 20, # 1 MiB per kernel invocation
|
||||
) -> None:
|
||||
self._pe_id = pe_id
|
||||
self._num_programs = num_programs
|
||||
@@ -65,6 +68,33 @@ class TLContext:
|
||||
self._handle_counter = 0
|
||||
self._completion_counter = 0
|
||||
self._runner = runner # KernelRunner for greenlet mode (ADR-0020 D3)
|
||||
# PE-local scratch allocator for math/compute output handles.
|
||||
# Each binary/unary/reduction op auto-allocates a unique addr from
|
||||
# this pool so the resulting TensorHandle can be the source of a
|
||||
# later tl.send / tl.store. Cursor resets on every kernel invocation.
|
||||
self._scratch_base = scratch_base
|
||||
self._scratch_size = scratch_size
|
||||
self._scratch_cursor = 0
|
||||
|
||||
def _scratch_alloc(self, nbytes: int) -> int:
|
||||
"""Allocate a unique scratch address for an output TensorHandle.
|
||||
|
||||
Returns 0 if no scratch base was configured (e.g. command-list mode);
|
||||
in that case the resulting handle has addr=0 and cannot be used as a
|
||||
send/store source. Greenlet/runner mode always supplies a base.
|
||||
"""
|
||||
if self._scratch_base == 0:
|
||||
return 0
|
||||
# 16-byte alignment
|
||||
aligned = (nbytes + 15) & ~15
|
||||
addr = self._scratch_base + self._scratch_cursor
|
||||
self._scratch_cursor += aligned
|
||||
if self._scratch_cursor > self._scratch_size:
|
||||
raise RuntimeError(
|
||||
f"TLContext scratch overflow: requested {nbytes}B, "
|
||||
f"used {self._scratch_cursor}/{self._scratch_size}B"
|
||||
)
|
||||
return addr
|
||||
|
||||
@property
|
||||
def commands(self) -> list[PeCommand]:
|
||||
@@ -93,11 +123,30 @@ class TLContext:
|
||||
|
||||
def _make_handle(
|
||||
self, addr: int, shape: tuple[int, ...], dtype: str,
|
||||
space: str = "tcm",
|
||||
) -> TensorHandle:
|
||||
return TensorHandle(
|
||||
id=self._next_handle_id(),
|
||||
addr=addr, shape=shape, dtype=dtype,
|
||||
nbytes=self._nbytes(shape, dtype),
|
||||
space=space,
|
||||
)
|
||||
|
||||
def _make_compute_out(
|
||||
self, shape: tuple[int, ...], dtype: str,
|
||||
) -> TensorHandle:
|
||||
"""Allocate an output TensorHandle in PE-local scratch (TCM space).
|
||||
|
||||
Used by math/compute ops so the result has a real address that can
|
||||
be the source of a later send/store. The data field stays None in
|
||||
Phase 1 — Phase 2 DataExecutor fills the actual ndarray.
|
||||
"""
|
||||
nbytes = self._nbytes(shape, dtype)
|
||||
addr = self._scratch_alloc(nbytes)
|
||||
return TensorHandle(
|
||||
id=self._next_handle_id(),
|
||||
addr=addr, shape=shape, dtype=dtype,
|
||||
nbytes=nbytes, space="tcm",
|
||||
)
|
||||
|
||||
# ── Reference (no DMA, metadata only) ────────────────────────
|
||||
@@ -124,20 +173,26 @@ class TLContext:
|
||||
def load(
|
||||
self, ptr: int, shape: tuple[int, ...], dtype: str = "f16",
|
||||
) -> TensorHandle:
|
||||
"""Load tensor from HBM to TCM. Returns TensorHandle.
|
||||
"""Load tensor from HBM. Returns TensorHandle pointing at HBM[ptr].
|
||||
|
||||
In greenlet mode: returns TensorHandle with actual numpy data.
|
||||
In command-list mode: returns TensorHandle with data=None.
|
||||
|
||||
The returned handle's ``space`` is "hbm" so subsequent ops (math,
|
||||
send, store) using this handle as a source resolve via MemoryStore
|
||||
at ``(hbm, ptr)`` — which is where the load's underlying data
|
||||
actually lives in Phase 2 storage.
|
||||
"""
|
||||
self._emit_dispatch_overhead()
|
||||
handle = self._make_handle(addr=ptr, shape=shape, dtype=dtype)
|
||||
handle = self._make_handle(addr=ptr, shape=shape, dtype=dtype, space="hbm")
|
||||
cmd = DmaReadCmd(handle=handle, src_addr=ptr, nbytes=handle.nbytes)
|
||||
data = self._emit(cmd)
|
||||
if data is not None:
|
||||
# Greenlet mode: attach real data to handle
|
||||
# Greenlet mode: attach real data to handle (preserve space)
|
||||
return TensorHandle(
|
||||
id=handle.id, addr=handle.addr, shape=handle.shape,
|
||||
dtype=handle.dtype, nbytes=handle.nbytes, data=data,
|
||||
space=handle.space,
|
||||
)
|
||||
return handle
|
||||
|
||||
@@ -162,7 +217,7 @@ class TLContext:
|
||||
raise ValueError(f"dot shape mismatch: a.K={k} != b.K={k2}")
|
||||
out_shape = (*a.shape[:-2], m, n)
|
||||
out_dtype = a.dtype
|
||||
out = self._make_handle(addr=0, shape=out_shape, dtype=out_dtype)
|
||||
out = self._make_compute_out(shape=out_shape, dtype=out_dtype)
|
||||
self._emit_dispatch_overhead()
|
||||
self._emit(GemmCmd(a=a, b=b, out=out, m=m, k=k, n=n))
|
||||
return out
|
||||
@@ -170,7 +225,7 @@ class TLContext:
|
||||
# ── MATH Engine: unary (blocking) ─────────────────────────────
|
||||
|
||||
def _unary_math(self, op: str, x: TensorHandle) -> TensorHandle:
|
||||
out = self._make_handle(addr=0, shape=x.shape, dtype=x.dtype)
|
||||
out = self._make_compute_out(shape=x.shape, dtype=x.dtype)
|
||||
self._emit_dispatch_overhead()
|
||||
self._emit(MathCmd(op=op, inputs=(x,), out=out))
|
||||
return out
|
||||
@@ -203,7 +258,7 @@ class TLContext:
|
||||
) -> TensorHandle:
|
||||
out_shape = list(x.shape)
|
||||
out_shape[axis] = 1
|
||||
out = self._make_handle(addr=0, shape=tuple(out_shape), dtype=x.dtype)
|
||||
out = self._make_compute_out(shape=tuple(out_shape), dtype=x.dtype)
|
||||
self._emit_dispatch_overhead()
|
||||
self._emit(MathCmd(op=op, inputs=(x,), out=out, axis=axis))
|
||||
return out
|
||||
@@ -222,7 +277,7 @@ class TLContext:
|
||||
def _binary_math(
|
||||
self, op: str, a: TensorHandle, b: TensorHandle,
|
||||
) -> TensorHandle:
|
||||
out = self._make_handle(addr=0, shape=a.shape, dtype=a.dtype)
|
||||
out = self._make_compute_out(shape=a.shape, dtype=a.dtype)
|
||||
self._emit_dispatch_overhead()
|
||||
self._emit(MathCmd(op=op, inputs=(a, b), out=out))
|
||||
return out
|
||||
@@ -230,15 +285,67 @@ class TLContext:
|
||||
def where(
|
||||
self, cond: TensorHandle, a: TensorHandle, b: TensorHandle,
|
||||
) -> TensorHandle:
|
||||
out = self._make_handle(addr=0, shape=a.shape, dtype=a.dtype)
|
||||
out = self._make_compute_out(shape=a.shape, dtype=a.dtype)
|
||||
self._emit_dispatch_overhead()
|
||||
self._emit(MathCmd(op="where", inputs=(cond, a, b), out=out))
|
||||
return out
|
||||
|
||||
def maximum(self, a: TensorHandle, b: TensorHandle) -> TensorHandle:
|
||||
"""Element-wise max of two tensors (real Triton: tl.maximum)."""
|
||||
return self._binary_math("maximum", a, b)
|
||||
|
||||
def minimum(self, a: TensorHandle, b: TensorHandle) -> TensorHandle:
|
||||
"""Element-wise min of two tensors (real Triton: tl.minimum)."""
|
||||
return self._binary_math("minimum", a, b)
|
||||
|
||||
def fma(
|
||||
self, a: TensorHandle, b: TensorHandle, c: TensorHandle,
|
||||
) -> TensorHandle:
|
||||
"""Fused multiply-add: a * b + c (real Triton: tl.fma)."""
|
||||
out = self._make_compute_out(shape=a.shape, dtype=a.dtype)
|
||||
self._emit_dispatch_overhead()
|
||||
self._emit(MathCmd(op="fma", inputs=(a, b, c), out=out))
|
||||
return out
|
||||
|
||||
def clamp(
|
||||
self,
|
||||
x: TensorHandle,
|
||||
min: TensorHandle,
|
||||
max: TensorHandle,
|
||||
) -> TensorHandle:
|
||||
"""Clamp x to [min, max] (real Triton: tl.clamp)."""
|
||||
out = self._make_compute_out(shape=x.shape, dtype=x.dtype)
|
||||
self._emit_dispatch_overhead()
|
||||
self._emit(MathCmd(op="clamp", inputs=(x, min, max), out=out))
|
||||
return out
|
||||
|
||||
def softmax(self, x: TensorHandle, axis: int = -1) -> TensorHandle:
|
||||
"""Numerically-stable softmax along ``axis`` (real Triton: tl.softmax).
|
||||
|
||||
Implemented as a single MathCmd (op="softmax") so timing accounts
|
||||
for one MATH dispatch; Phase 2 DataExecutor expands it to the
|
||||
canonical (x - max) → exp → sum → div sequence.
|
||||
"""
|
||||
out = self._make_compute_out(shape=x.shape, dtype=x.dtype)
|
||||
self._emit_dispatch_overhead()
|
||||
self._emit(MathCmd(op="softmax", inputs=(x,), out=out, axis=axis))
|
||||
return out
|
||||
|
||||
# ── Scalar helpers (real Triton: tl.cdiv etc.) ────────────────
|
||||
|
||||
@staticmethod
|
||||
def cdiv(a: int, b: int) -> int:
|
||||
"""Ceiling division: (a + b - 1) // b (real Triton: tl.cdiv).
|
||||
|
||||
Used by host/kernel grid math; not a tensor op, so no MathCmd
|
||||
is emitted. Mirrors triton.cdiv.
|
||||
"""
|
||||
return -(-int(a) // int(b))
|
||||
|
||||
# ── Index / Scalar (PE_CPU, no engine) ────────────────────────
|
||||
|
||||
def program_id(self, axis: int = 0) -> int:
|
||||
"""Return program instance index.
|
||||
"""Return program instance index (ADR-0022).
|
||||
|
||||
axis=0: local PE id within cube.
|
||||
axis=1: cube id.
|
||||
@@ -248,7 +355,7 @@ class TLContext:
|
||||
return self._pe_id
|
||||
|
||||
def num_programs(self, axis: int = 0) -> int:
|
||||
"""Return total number of program instances.
|
||||
"""Return total number of program instances (ADR-0022).
|
||||
|
||||
axis=0: num PEs per cube.
|
||||
axis=1: num cubes.
|
||||
@@ -284,6 +391,119 @@ class TLContext:
|
||||
dtype=x.dtype, nbytes=x.nbytes, data=x.data,
|
||||
)
|
||||
|
||||
# ── IPCQ (CCL) collective primitives (ADR-0023 D4) ────────────
|
||||
|
||||
def send(
|
||||
self,
|
||||
dir: str,
|
||||
src: TensorHandle | None = None,
|
||||
*,
|
||||
src_addr: int | None = None,
|
||||
nbytes: int | None = None,
|
||||
shape: tuple[int, ...] | None = None,
|
||||
dtype: str = "f16",
|
||||
space: str = "tcm",
|
||||
) -> None:
|
||||
"""Send tensor data to the peer in the given direction.
|
||||
|
||||
Two calling forms:
|
||||
tl.send(dir, handle) # use handle's metadata
|
||||
tl.send(dir, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)
|
||||
|
||||
Blocking: returns when PE_IPCQ has accepted the request and
|
||||
forwarded the IpcqDmaToken to PE_DMA. Backpressure may apply.
|
||||
"""
|
||||
if src is not None:
|
||||
src_addr = src.addr
|
||||
nbytes = src.nbytes
|
||||
shape = src.shape
|
||||
dtype = src.dtype
|
||||
space = getattr(src, "space", space)
|
||||
if src_addr is None or nbytes is None or shape is None:
|
||||
raise ValueError("tl.send: provide either a TensorHandle or src_addr/nbytes/shape")
|
||||
self._emit_dispatch_overhead()
|
||||
cmd = IpcqSendCmd(
|
||||
direction=dir,
|
||||
src_addr=src_addr, src_space=space,
|
||||
nbytes=nbytes, shape=shape, dtype=dtype,
|
||||
handle_id=self._next_handle_id(),
|
||||
)
|
||||
self._emit(cmd)
|
||||
|
||||
def recv(
|
||||
self,
|
||||
dir: str | None = None,
|
||||
shape: tuple[int, ...] = (),
|
||||
dtype: str = "f16",
|
||||
space: str = "tcm",
|
||||
dst_addr: int | None = None,
|
||||
dst_space: str | None = None,
|
||||
) -> TensorHandle:
|
||||
"""Receive tensor data from a peer.
|
||||
|
||||
Args:
|
||||
dir: specific direction (e.g. "W"), or None for round-robin.
|
||||
shape, dtype: expected tensor metadata.
|
||||
dst_addr / dst_space: if both are provided, the slot data is
|
||||
copied to (dst_space, dst_addr) before the handle is
|
||||
returned ("copy_to_dst" mode). Otherwise the slot address
|
||||
is returned directly ("return_slot" mode).
|
||||
|
||||
Returns:
|
||||
TensorHandle pointing to the slot (or dst) where the data has
|
||||
arrived. In greenlet/runner mode, ``handle.data`` carries the
|
||||
actual ndarray; in command-list mode the handle is a placeholder.
|
||||
"""
|
||||
self._emit_dispatch_overhead()
|
||||
if dst_addr is not None and dst_space is not None:
|
||||
cmd = IpcqRecvCmd(
|
||||
direction=dir,
|
||||
shape=shape, dtype=dtype,
|
||||
handle_id=self._next_handle_id(),
|
||||
recv_mode="copy_to_dst",
|
||||
dst_addr=dst_addr, dst_space=dst_space,
|
||||
)
|
||||
else:
|
||||
cmd = IpcqRecvCmd(
|
||||
direction=dir,
|
||||
shape=shape, dtype=dtype,
|
||||
handle_id=self._next_handle_id(),
|
||||
)
|
||||
result = self._emit(cmd)
|
||||
if isinstance(result, dict):
|
||||
slot_addr = int(result.get("src_addr", 0))
|
||||
slot_space = str(result.get("src_space", "tcm"))
|
||||
data = result.get("data")
|
||||
return TensorHandle(
|
||||
id=self._next_handle_id(),
|
||||
addr=slot_addr,
|
||||
shape=shape,
|
||||
dtype=dtype,
|
||||
nbytes=self._nbytes(shape, dtype),
|
||||
data=data,
|
||||
space=slot_space,
|
||||
)
|
||||
return self._make_handle(addr=0, shape=shape, dtype=dtype)
|
||||
|
||||
def recv_async(
|
||||
self,
|
||||
dir: str,
|
||||
shape: tuple[int, ...] = (),
|
||||
dtype: str = "f16",
|
||||
) -> "RecvFuture":
|
||||
"""Non-blocking recv. Returns a future to pass into ``tl.wait``."""
|
||||
self._emit_dispatch_overhead()
|
||||
cmd = IpcqRecvCmd(
|
||||
direction=dir,
|
||||
shape=shape, dtype=dtype,
|
||||
handle_id=self._next_handle_id(),
|
||||
blocking=False,
|
||||
)
|
||||
future = RecvFuture(cmd=cmd)
|
||||
if self._runner is not None:
|
||||
self._runner.switch_to_simpy(("recv_async", future))
|
||||
return future
|
||||
|
||||
# ── Composite + Control ───────────────────────────────────────
|
||||
|
||||
def composite(
|
||||
@@ -316,9 +536,40 @@ class TLContext:
|
||||
))
|
||||
return completion
|
||||
|
||||
def wait(self, handle: CompletionHandle | None = None) -> None:
|
||||
"""Wait for a specific composite or all pending composites."""
|
||||
def wait(self, handle: "CompletionHandle | RecvFuture | None" = None) -> Any:
|
||||
"""Wait for a composite, a recv future, or all pending composites.
|
||||
|
||||
- ``CompletionHandle`` (or None): wait for composite completion.
|
||||
- ``RecvFuture``: wait for a non-blocking ``recv_async`` to finish.
|
||||
Returns the resolved ``TensorHandle``.
|
||||
"""
|
||||
if isinstance(handle, RecvFuture):
|
||||
if handle.resolved:
|
||||
return handle.result
|
||||
if self._runner is None:
|
||||
raise RuntimeError(
|
||||
"tl.wait(RecvFuture) requires runner mode (greenlet)"
|
||||
)
|
||||
result_dict = self._runner.switch_to_simpy(("recv_wait", handle))
|
||||
slot_addr = int(result_dict.get("src_addr", 0))
|
||||
slot_space = str(result_dict.get("src_space", "tcm"))
|
||||
data = result_dict.get("data")
|
||||
th = TensorHandle(
|
||||
id=self._next_handle_id(),
|
||||
addr=slot_addr,
|
||||
shape=handle.cmd.shape,
|
||||
dtype=handle.cmd.dtype,
|
||||
nbytes=self._nbytes(handle.cmd.shape, handle.cmd.dtype),
|
||||
data=data,
|
||||
space=slot_space,
|
||||
)
|
||||
handle.resolved = True
|
||||
handle.result = th
|
||||
return th
|
||||
|
||||
# Composite path (existing behaviour)
|
||||
self._emit(WaitCmd(handle=handle))
|
||||
return None
|
||||
|
||||
def cycles(self, n: int) -> None:
|
||||
"""Declare PE_CPU scalar execution overhead (cycles)."""
|
||||
|
||||
Reference in New Issue
Block a user