Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)
Major changes:
PE-level IPCQ infrastructure:
- New PE_IPCQ component: ring-buffer control plane with 4-direction
neighbor mapping, head/tail pointers, backpressure (poll/sleep).
- PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA,
including in-flight data snapshot (D9) and op_log recording at
outbound time for Phase 2 replay correctness.
- IpcqDmaToken piggyback model: data + metadata travel together,
atomic visibility at receiver (invariant I6).
- Credit return fast path: bottleneck-BW latency, no fabric vc_comm.
Phase 2 data execution (ADR-0020 integration):
- op_log extended: DmaWriteCmd now captures src_space/src_addr for
Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time.
- DataExecutor replays dma_write + ipcq_copy in t_start order.
- Engine._flush_data_phase: incremental cursor-based replay after
each engine.wait() so host reads see post-Phase-2 data.
- KernelRunner Phase 1 writes disabled when op_log is active to
prevent stale data from corrupting the MemoryStore snapshot.
TLContext / kernel API:
- tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype),
tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode.
- TensorHandle operator overloading (add/sub/mul/div) via thread-local
active TLContext → MathCmd dispatch through PE_MATH.
- PE-local scratch allocator for math output handles.
- tl.load returns space="hbm" handles for correct Phase 2 addressing.
- Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv.
Unified ccl_allreduce bench (PyTorch-compat host code):
- Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch)
split matching real PyTorch DDP worker pattern.
- torch.distributed facade: init_process_group, get_world_size,
get_rank, get_backend, all_reduce, barrier — only real PyTorch names.
- AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches
kernel via tensor shard metadata (n_elem from shards[0].nbytes).
- world_size derived from topology spec (sips × cubes × pes_per_cube)
with optional algorithm-level override in ccl.yaml.
Tensor API (PyTorch-compat surface):
- Tensor.numpy(): gather-aware (all shards via VA-based addressing).
- Tensor.copy_(source): scatter from host tensor into sharded target.
- RuntimeContext.from_numpy(arr): host-side staging tensor.
- Tensor.data property fixed to use numpy() (was shards[0]-only).
Algorithm modules moved to src/kernbench/ccl/algorithms/:
- ring_allreduce, mesh_allreduce, tree_allreduce, hello_send.
- Each module exports kernel_args(world_size, n_elem) helper.
- ccl.yaml module paths updated to kernbench.ccl.algorithms.*.
Dead code removed:
- 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.).
- _run_ccl_bench greenlet-per-SIP scheduler.
- benches.loader.is_ccl_bench + run_rank detection.
- benches/ccl/ directory.
Tests:
- New test_ccl_allreduce_matrix.py: 7 parametrized cases
(ring×3 buffers, ring 8/16, mesh 4, tree 7).
- New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests.
- Existing tests updated for new import paths + world_size_override.
Docs:
- Korean ccl-author-guide.md and ADR-0023 paths updated.
- New English versions: ccl-author-guide.en.md, ADR-0023.en.md.
502 tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,9 @@
|
||||
"""CCL (Collective Communication Library) framework for kernbench (ADR-0023).
|
||||
|
||||
This package provides:
|
||||
- topologies: builtin neighbor topology generators (ring/mesh/tree)
|
||||
- helpers: utilities for algorithm authors (chunked, ring_step, ...)
|
||||
- testing: mock CCL runtime for fast unit tests of algorithm kernels
|
||||
|
||||
See docs/adr/ADR-0023-ipcq-pe-collective.md and docs/ccl-author-guide.md.
|
||||
"""
|
||||
@@ -0,0 +1,29 @@
|
||||
"""Hello-world CCL kernel for the docs/ccl-author-guide.md walkthrough.
|
||||
|
||||
Each PE sends its tile to the E neighbor and receives one tile from W,
|
||||
then stores the received tile back into its own HBM slice. The simplest
|
||||
possible demonstration of ``tl.send`` / ``tl.recv``.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
|
||||
def kernel_args(world_size: int, n_elem: int) -> tuple:
|
||||
"""Return the positional kernel arguments for the ahbm backend."""
|
||||
return (n_elem,)
|
||||
|
||||
|
||||
def kernel(t_ptr, n_elem, tl):
|
||||
local_pe = tl.program_id(axis=0)
|
||||
cube_id = tl.program_id(axis=1)
|
||||
pes_per_cube = tl.num_programs(axis=0)
|
||||
rank = cube_id * pes_per_cube + local_pe
|
||||
nbytes = n_elem * 2
|
||||
pe_addr = t_ptr + rank * nbytes
|
||||
|
||||
# Send our local HBM tile to the E neighbor.
|
||||
src = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
|
||||
tl.send(dir="E", src=src)
|
||||
|
||||
# Receive a tile from W and store it into our slice (overwrite).
|
||||
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
|
||||
tl.store(pe_addr, recv)
|
||||
@@ -0,0 +1,73 @@
|
||||
"""2D-mesh all-reduce kernel (ADR-0023).
|
||||
|
||||
Two-phase reduce on a square mesh of side ``S`` (world_size = S*S):
|
||||
1. Row reduce: ring all-reduce along E/W within each row.
|
||||
2. Column reduce: ring all-reduce along N/S within each column.
|
||||
|
||||
After both phases, every rank holds the global sum.
|
||||
|
||||
Uses TensorHandle math (PE_MATH) for accumulation. Op_log captures the
|
||||
data flow so Phase 2 produces correct final HBM contents. Math/recv
|
||||
handles are passed directly to the next send, avoiding store→reload
|
||||
which doesn't propagate correctly with timing-only Phase 1 math.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import math
|
||||
|
||||
|
||||
def kernel_args(world_size: int, n_elem: int) -> tuple:
|
||||
"""Return the positional kernel arguments for the ahbm backend.
|
||||
|
||||
Mesh all-reduce requires ``world_size`` to be a perfect square —
|
||||
the mesh side length is ``sqrt(world_size)``.
|
||||
"""
|
||||
side = int(round(math.sqrt(world_size)))
|
||||
if side * side != world_size:
|
||||
raise ValueError(
|
||||
f"mesh_allreduce requires a square world_size; got {world_size}"
|
||||
)
|
||||
return (n_elem, side)
|
||||
|
||||
|
||||
def kernel(t_ptr, n_elem, side, tl):
|
||||
"""All-reduce on a square mesh.
|
||||
|
||||
Args:
|
||||
t_ptr: HBM base address (column-sharded VA shared across ranks)
|
||||
n_elem: number of f16 elements per tile
|
||||
side: mesh side length (sqrt(world_size))
|
||||
tl: TLContext (ADR-0022).
|
||||
"""
|
||||
local_pe = tl.program_id(axis=0)
|
||||
cube_id = tl.program_id(axis=1)
|
||||
pes_per_cube = tl.num_programs(axis=0)
|
||||
rank = cube_id * pes_per_cube + local_pe
|
||||
nbytes = n_elem * 2
|
||||
|
||||
pe_addr = t_ptr + rank * nbytes
|
||||
acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
|
||||
current = acc
|
||||
|
||||
# ── Phase 1: row ring (E direction) ──
|
||||
# Ring forwards each received tile (not the cumulative acc) so every
|
||||
# tile passes through every rank exactly once.
|
||||
for _ in range(side - 1):
|
||||
tl.send(dir="E", src=current)
|
||||
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
|
||||
acc = acc + recv
|
||||
current = recv
|
||||
|
||||
# Phase 2 column ring starts from the row-phase accumulator. We do NOT
|
||||
# store/reload here — the math handle's scratch addr is the source for
|
||||
# the first column send and Phase 2 ipcq_copy replays from there.
|
||||
current = acc
|
||||
|
||||
# ── Phase 2: column ring (S direction) ──
|
||||
for _ in range(side - 1):
|
||||
tl.send(dir="S", src=current)
|
||||
recv = tl.recv(dir="N", shape=(n_elem,), dtype="f16")
|
||||
acc = acc + recv
|
||||
current = recv
|
||||
|
||||
tl.store(pe_addr, acc)
|
||||
@@ -0,0 +1,80 @@
|
||||
"""Ring all-reduce kernel for IPCQ-based PE collective (ADR-0023).
|
||||
|
||||
Algorithm: 1D ring of N PEs, each PE starts with one tile of data.
|
||||
After ``world_size - 1`` rounds, every PE's accumulator holds the sum
|
||||
of all PE tiles.
|
||||
|
||||
Strategy
|
||||
--------
|
||||
Each PE starts with its own tile in HBM. The kernel:
|
||||
1. Loads the local tile into a TensorHandle (the accumulator).
|
||||
2. In each of ``world_size - 1`` rounds:
|
||||
- Sends the current accumulator/recv slot to the E neighbor.
|
||||
- Receives a tile from the W neighbor — the recv handle points
|
||||
into the per-direction TCM slot.
|
||||
- Adds the received tile to the accumulator using the TensorHandle
|
||||
operator overload, which dispatches to ``MathCmd`` (PE_MATH).
|
||||
3. Stores the final accumulator back to HBM via tl.store. The store is
|
||||
recorded in op_log with both src and dst, so Phase 2 will copy the
|
||||
replayed math result from PE-local scratch into HBM.
|
||||
|
||||
ADR-0020 D3 split: Phase 1 simulates timing only — math results are
|
||||
not yet computed, so the accumulator data flowing through Phase 1 may
|
||||
be stale. Phase 2's DataExecutor replays math + IPCQ copies + dma_write
|
||||
in stable t_start order, producing correct final HBM contents.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
|
||||
def kernel_args(world_size: int, n_elem: int) -> tuple:
|
||||
"""Return the positional kernel arguments for the ahbm backend.
|
||||
|
||||
Ring all-reduce takes (n_elem, world_size) after the tensor pointer.
|
||||
"""
|
||||
return (n_elem, world_size)
|
||||
|
||||
|
||||
def kernel(t_ptr, n_elem, world_size, tl):
|
||||
"""Ring all-reduce.
|
||||
|
||||
Args:
|
||||
t_ptr: HBM base address of the column-sharded tensor — all PEs
|
||||
share this base. The per-PE slice lives at
|
||||
``t_ptr + global_rank * n_elem * 2``.
|
||||
n_elem: number of f16 elements per tile.
|
||||
world_size: total number of participating ranks (passed by host).
|
||||
tl: TLContext (auto-injected, ADR-0022). The kernel derives the
|
||||
global rank from ``program_id(axis=0)`` (local PE) and
|
||||
``program_id(axis=1)`` (cube id):
|
||||
|
||||
rank = cube_id * pes_per_cube + local_pe
|
||||
"""
|
||||
local_pe = tl.program_id(axis=0)
|
||||
cube_id = tl.program_id(axis=1)
|
||||
pes_per_cube = tl.num_programs(axis=0)
|
||||
rank = cube_id * pes_per_cube + local_pe
|
||||
nbytes = n_elem * 2 # f16
|
||||
|
||||
# Each PE reads from its own slice of the shared base address
|
||||
pe_addr = t_ptr + rank * nbytes
|
||||
|
||||
# Load the local tile — handle points at HBM[pe_addr].
|
||||
acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
|
||||
# The ring forwards each received tile to the next neighbor (NOT the
|
||||
# cumulative accumulator), so every rank's tile passes through every
|
||||
# rank exactly once. The accumulator sums the new arrival each round.
|
||||
current = acc
|
||||
|
||||
for _step in range(world_size - 1):
|
||||
tl.send(dir="E", src=current)
|
||||
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
|
||||
# TensorHandle add → MathCmd → PE_MATH (timing in Phase 1, real
|
||||
# numpy in Phase 2 via DataExecutor). The result handle lives at
|
||||
# an auto-allocated PE-local scratch addr.
|
||||
acc = acc + recv
|
||||
current = recv # forward W's tile to E next round
|
||||
|
||||
# Final result back to this PE's HBM slice. Op_log captures the
|
||||
# source (scratch addr) and dst (HBM slice) so Phase 2 copies the
|
||||
# accumulated value into HBM for verification.
|
||||
tl.store(pe_addr, acc)
|
||||
@@ -0,0 +1,80 @@
|
||||
"""Tree all-reduce kernel for IPCQ-based PE collective (ADR-0023).
|
||||
|
||||
Two-phase binary tree all-reduce:
|
||||
|
||||
Phase 1 (reduce up):
|
||||
- leaf nodes send their value to ``parent``
|
||||
- internal nodes recv from each child, sum, then send to ``parent``
|
||||
- root accumulates child contributions; final acc holds global sum
|
||||
|
||||
Phase 2 (broadcast down):
|
||||
- root sends acc to ``child_left`` and ``child_right`` (if present)
|
||||
- internal nodes recv from ``parent``, then forward to children
|
||||
- all ranks store the final acc to HBM
|
||||
|
||||
Uses TensorHandle math (PE_MATH) for accumulation. Op_log captures the
|
||||
data flow so Phase 2 produces correct final HBM contents. The kernel
|
||||
deliberately avoids the store→reload→send pattern: math/recv handles
|
||||
are passed directly to the next send so PE_DMA snapshots a deterministic
|
||||
source addr that Phase 2 can replay.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
|
||||
def kernel_args(world_size: int, n_elem: int) -> tuple:
|
||||
"""Return the positional kernel arguments for the ahbm backend."""
|
||||
return (n_elem, world_size)
|
||||
|
||||
|
||||
def kernel(t_ptr, n_elem, world_size, tl):
|
||||
"""Tree all-reduce.
|
||||
|
||||
Args:
|
||||
t_ptr: HBM base address.
|
||||
n_elem: number of f16 elements per tile.
|
||||
world_size: total number of participating ranks (passed by host).
|
||||
tl: TLContext (ADR-0022). Global rank from program_id(0/1).
|
||||
"""
|
||||
local_pe = tl.program_id(axis=0)
|
||||
cube_id = tl.program_id(axis=1)
|
||||
pes_per_cube = tl.num_programs(axis=0)
|
||||
rank = cube_id * pes_per_cube + local_pe
|
||||
nbytes = n_elem * 2
|
||||
|
||||
pe_addr = t_ptr + rank * nbytes
|
||||
acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
|
||||
|
||||
# Compute children/parent existence (matches tree_binary topology generator)
|
||||
has_parent = rank > 0
|
||||
left = 2 * rank + 1
|
||||
right = 2 * rank + 2
|
||||
has_left = left < world_size
|
||||
has_right = right < world_size
|
||||
|
||||
# ── Phase 1: reduce up ──
|
||||
if has_left:
|
||||
recv = tl.recv(dir="child_left", shape=(n_elem,), dtype="f16")
|
||||
acc = acc + recv
|
||||
if has_right:
|
||||
recv = tl.recv(dir="child_right", shape=(n_elem,), dtype="f16")
|
||||
acc = acc + recv
|
||||
|
||||
if has_parent:
|
||||
# Send the math/load handle directly — its addr is either the
|
||||
# original HBM tile (leaf) or the PE-local scratch where the
|
||||
# accumulator lives. Phase 2 ipcq_copy replays from the same addr.
|
||||
tl.send(dir="parent", src=acc)
|
||||
|
||||
# ── Phase 2: broadcast down ──
|
||||
if has_parent:
|
||||
# Replace acc with the value broadcast from the parent (the global
|
||||
# sum). The recv handle points at the parent-direction TCM slot.
|
||||
acc = tl.recv(dir="parent", shape=(n_elem,), dtype="f16")
|
||||
|
||||
if has_left:
|
||||
tl.send(dir="child_left", src=acc)
|
||||
if has_right:
|
||||
tl.send(dir="child_right", src=acc)
|
||||
|
||||
# Final store to HBM for the bench's verification path.
|
||||
tl.store(pe_addr, acc)
|
||||
@@ -0,0 +1,127 @@
|
||||
"""CCL diagnostics: trace + pointer dump + deadlock (ADR-0023 D14).
|
||||
|
||||
Trace
|
||||
-----
|
||||
Set ``KERNBENCH_CCL_TRACE=1`` (or any truthy value) to enable per-event
|
||||
logging of CCL send/recv to stdout. Off by default.
|
||||
|
||||
Pointer dump
|
||||
------------
|
||||
``pointer_dump(engine)`` returns a multi-line string showing every PE_IPCQ's
|
||||
ring buffer state (my_head, my_tail, peer_head_cache, peer_tail_cache).
|
||||
Useful for diagnosing hangs.
|
||||
|
||||
Deadlock
|
||||
--------
|
||||
``IpcqDeadlock`` is raised by the engine when SimPy's schedule empties
|
||||
while a request is still pending — typical of unmatched send/recv pairs.
|
||||
The exception message includes the pointer dump.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
from typing import Any
|
||||
|
||||
|
||||
class IpcqDeadlock(RuntimeError):
|
||||
"""Raised when the simulation cannot make further progress while a
|
||||
CCL request is still pending (D14 F3)."""
|
||||
|
||||
|
||||
# ── Trace toggle ─────────────────────────────────────────────────────
|
||||
|
||||
|
||||
_TRACE_ENABLED: bool = False
|
||||
|
||||
|
||||
def reload_trace_setting() -> None:
|
||||
"""Re-read the ``KERNBENCH_CCL_TRACE`` env var."""
|
||||
global _TRACE_ENABLED
|
||||
val = os.environ.get("KERNBENCH_CCL_TRACE", "")
|
||||
_TRACE_ENABLED = val.strip().lower() in {"1", "true", "yes", "on"}
|
||||
|
||||
|
||||
def trace_enabled() -> bool:
|
||||
return _TRACE_ENABLED
|
||||
|
||||
|
||||
# Initialise once at import time
|
||||
reload_trace_setting()
|
||||
|
||||
|
||||
# ── Trace event functions ────────────────────────────────────────────
|
||||
|
||||
|
||||
def log_send(
|
||||
t_ns: float,
|
||||
sender: str,
|
||||
direction: str,
|
||||
nbytes: int,
|
||||
sender_seq: int,
|
||||
) -> None:
|
||||
if not _TRACE_ENABLED:
|
||||
return
|
||||
print(
|
||||
f"[ccl t={t_ns:.1f} send] {sender} dir={direction} nbytes={nbytes} seq={sender_seq}",
|
||||
flush=True,
|
||||
)
|
||||
|
||||
|
||||
def log_recv(
|
||||
t_ns: float,
|
||||
receiver: str,
|
||||
direction: str,
|
||||
nbytes: int,
|
||||
) -> None:
|
||||
if not _TRACE_ENABLED:
|
||||
return
|
||||
print(
|
||||
f"[ccl t={t_ns:.1f} recv] {receiver} dir={direction} nbytes={nbytes}",
|
||||
flush=True,
|
||||
)
|
||||
|
||||
|
||||
def log_credit_return(
|
||||
t_ns: float,
|
||||
sender: str,
|
||||
direction: str,
|
||||
consumer_seq: int,
|
||||
) -> None:
|
||||
if not _TRACE_ENABLED:
|
||||
return
|
||||
print(
|
||||
f"[ccl t={t_ns:.1f} credit] {sender} dir={direction} seq={consumer_seq}",
|
||||
flush=True,
|
||||
)
|
||||
|
||||
|
||||
# ── Pointer dump ─────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def pointer_dump(engine: Any) -> str:
|
||||
"""Return a multi-line string of every PE_IPCQ's pointer state."""
|
||||
lines: list[str] = []
|
||||
components = getattr(engine, "_components", {})
|
||||
for node_id in sorted(components):
|
||||
if not node_id.endswith(".pe_ipcq"):
|
||||
continue
|
||||
comp = components[node_id]
|
||||
qps = getattr(comp, "queue_pairs", {})
|
||||
if not qps:
|
||||
continue
|
||||
lines.append(node_id)
|
||||
for d in sorted(qps):
|
||||
qp = qps[d]
|
||||
peer = qp["peer"]
|
||||
lines.append(
|
||||
f" {d}: peer=sip{peer.sip}.cube{peer.cube}.pe{peer.pe} "
|
||||
f"my_head={qp['my_head']} my_tail={qp['my_tail']} "
|
||||
f"peer_head_cache={qp['peer_head_cache']} "
|
||||
f"peer_tail_cache={qp['peer_tail_cache']}"
|
||||
)
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def print_pointer_dump(engine: Any) -> None:
|
||||
"""Convenience: print pointer_dump(engine) to stdout."""
|
||||
print(pointer_dump(engine), flush=True)
|
||||
@@ -0,0 +1,118 @@
|
||||
"""Helpers for CCL algorithm authors (ADR-0023 D15).
|
||||
|
||||
These are pure utility functions usable from any kernel module:
|
||||
|
||||
from kernbench.ccl.helpers import chunked, ring_step, tree_step
|
||||
|
||||
They keep algorithm code short and free of off-by-one bugs.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
from typing import Any
|
||||
|
||||
|
||||
_DTYPE_BYTES = {
|
||||
"f16": 2, "fp16": 2, "float16": 2, "bf16": 2,
|
||||
"f32": 4, "fp32": 4, "float32": 4,
|
||||
"i8": 1, "int8": 1,
|
||||
"i16": 2, "int16": 2,
|
||||
"i32": 4, "int32": 4,
|
||||
}
|
||||
|
||||
|
||||
def _itemsize(dtype: str) -> int:
|
||||
if dtype not in _DTYPE_BYTES:
|
||||
raise ValueError(f"Unsupported dtype: {dtype}")
|
||||
return _DTYPE_BYTES[dtype]
|
||||
|
||||
|
||||
# ── chunked ──────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class Chunk:
|
||||
"""One chunk of a tensor used by collective algorithms."""
|
||||
|
||||
addr: int
|
||||
n_elem: int
|
||||
nbytes: int
|
||||
|
||||
|
||||
def chunked(
|
||||
base_addr: int,
|
||||
n_chunks: int,
|
||||
n_elem: int,
|
||||
dtype: str = "f16",
|
||||
) -> list[Chunk]:
|
||||
"""Slice a 1D buffer into ``n_chunks`` equal Chunks.
|
||||
|
||||
Args:
|
||||
base_addr: starting address of the buffer.
|
||||
n_chunks: number of equal chunks to produce.
|
||||
n_elem: total number of elements (must be divisible by n_chunks).
|
||||
dtype: element type for byte-size calculation.
|
||||
|
||||
Returns:
|
||||
List of ``Chunk`` objects whose addresses are consecutive.
|
||||
|
||||
Raises:
|
||||
ValueError: if n_elem is not divisible by n_chunks.
|
||||
"""
|
||||
if n_elem % n_chunks != 0:
|
||||
raise ValueError(
|
||||
f"chunked: n_elem ({n_elem}) not divisible by n_chunks ({n_chunks})"
|
||||
)
|
||||
per_chunk_elem = n_elem // n_chunks
|
||||
isize = _itemsize(dtype)
|
||||
per_chunk_bytes = per_chunk_elem * isize
|
||||
return [
|
||||
Chunk(
|
||||
addr=base_addr + i * per_chunk_bytes,
|
||||
n_elem=per_chunk_elem,
|
||||
nbytes=per_chunk_bytes,
|
||||
)
|
||||
for i in range(n_chunks)
|
||||
]
|
||||
|
||||
|
||||
# ── ring_step ────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def ring_step(rank: int, step: int, world_size: int) -> tuple[int, int]:
|
||||
"""Return ``(send_chunk_idx, recv_chunk_idx)`` for a ring algorithm step.
|
||||
|
||||
Standard reduce-scatter / all-gather ring schedule:
|
||||
at step s, rank r sends chunk (r - s) and receives chunk (r - s - 1)
|
||||
modulo world_size.
|
||||
|
||||
Used by ring all-reduce kernels:
|
||||
|
||||
for step in range(world_size - 1):
|
||||
send_idx, recv_idx = ring_step(rank, step, world_size)
|
||||
tl.send(dir="E", src=chunks[send_idx])
|
||||
chunks[recv_idx] += tl.recv(dir="W").data
|
||||
"""
|
||||
send_idx = (rank - step) % world_size
|
||||
recv_idx = (rank - step - 1) % world_size
|
||||
return send_idx, recv_idx
|
||||
|
||||
|
||||
# ── tree_step ────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def tree_step(rank: int, world_size: int) -> dict[str, Any]:
|
||||
"""Return parent/children for binary tree rooted at rank 0.
|
||||
|
||||
Returns:
|
||||
``{"parent": int|None, "children": list[int]}``
|
||||
"""
|
||||
parent = (rank - 1) // 2 if rank > 0 else None
|
||||
children: list[int] = []
|
||||
left = 2 * rank + 1
|
||||
right = 2 * rank + 2
|
||||
if left < world_size:
|
||||
children.append(left)
|
||||
if right < world_size:
|
||||
children.append(right)
|
||||
return {"parent": parent, "children": children}
|
||||
@@ -0,0 +1,266 @@
|
||||
"""IPCQ install plan for AhbmCCLBackend (ADR-0023 D10/D11/D12).
|
||||
|
||||
Given a ccl.yaml config, the topology, and the engine, this module:
|
||||
|
||||
1. Loads ccl.yaml and resolves the chosen algorithm.
|
||||
2. Maps each rank to a (sip, cube, pe) PE address using a linear scheme.
|
||||
3. Allocates per-rank IPCQ ring buffer base addresses (synthetic but
|
||||
unique-per-PE; see notes below).
|
||||
4. Builds neighbor tables via the algorithm's ``topology`` field plus the
|
||||
optional ``neighbors()`` override hook from the algorithm module.
|
||||
5. Wires bidirectional credit-return SimPy Stores between every (PE, peer)
|
||||
pair.
|
||||
6. Installs each PE_IPCQ component's neighbor table directly via its
|
||||
``_install_neighbors`` sideband call (equivalent to fan-out IpcqInitMsg
|
||||
without going through fabric).
|
||||
|
||||
Address scheme
|
||||
--------------
|
||||
For the first implementation we use a synthetic address scheme that
|
||||
guarantees uniqueness per (sip, cube, pe, direction) without going
|
||||
through ``PEMemAllocator``. The address is encoded as:
|
||||
|
||||
base = IPCQ_BASE | (sip << 40) | (cube << 32) | (pe << 24)
|
||||
rx_base[direction_idx] = base + direction_idx * (n_slots * slot_size)
|
||||
|
||||
The ``buffer_kind`` (tcm/hbm/sram) selects the *MemoryStore space* into
|
||||
which data is written. Within a space, addresses are unique per PE so
|
||||
the existing MemoryStore (``{space: {addr: ndarray}}``) handles them
|
||||
naturally.
|
||||
|
||||
This bypasses the topology's address resolver / PhysAddr encoding and
|
||||
treats IPCQ buffers as a separate, parallel address namespace. Real PA
|
||||
encoding can be plugged in later without changing the rest of the design.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import simpy
|
||||
import yaml
|
||||
|
||||
from kernbench.ccl.topologies import resolve_topology
|
||||
from kernbench.common.ipcq_types import (
|
||||
IpcqEndpoint,
|
||||
IpcqInitEntry,
|
||||
)
|
||||
from kernbench.runtime_api.kernel import IpcqInitMsg
|
||||
|
||||
|
||||
# IPCQ synthetic address space top bit
|
||||
_IPCQ_BASE = 1 << 60
|
||||
|
||||
|
||||
def _ipcq_base_for_pe(sip: int, cube: int, pe: int) -> int:
|
||||
return _IPCQ_BASE | (sip << 40) | (cube << 32) | (pe << 24)
|
||||
|
||||
|
||||
# ── ccl.yaml loading ─────────────────────────────────────────────────
|
||||
|
||||
|
||||
def load_ccl_config(path: str | Path | None = None) -> dict:
|
||||
"""Load and validate ccl.yaml. Searches cwd and project root."""
|
||||
if path is None:
|
||||
candidates = [
|
||||
Path.cwd() / "ccl.yaml",
|
||||
Path(__file__).resolve().parents[3] / "ccl.yaml",
|
||||
]
|
||||
for p in candidates:
|
||||
if p.exists():
|
||||
path = p
|
||||
break
|
||||
if path is None:
|
||||
raise FileNotFoundError(
|
||||
"ccl.yaml not found. Place it at project root or cwd."
|
||||
)
|
||||
with open(path) as f:
|
||||
cfg = yaml.safe_load(f)
|
||||
if "defaults" not in cfg:
|
||||
raise ValueError("ccl.yaml missing 'defaults' section")
|
||||
if "algorithms" not in cfg:
|
||||
raise ValueError("ccl.yaml missing 'algorithms' section")
|
||||
return cfg
|
||||
|
||||
|
||||
def resolve_algorithm_config(cfg: dict, name: str | None = None) -> dict:
|
||||
"""Merge defaults with the chosen algorithm's overrides.
|
||||
|
||||
Returns a flat dict with at minimum: module, topology, buffer_kind,
|
||||
backpressure, n_slots, slot_size, ipcq_credit_size_bytes, world_size.
|
||||
"""
|
||||
defaults = dict(cfg.get("defaults", {}))
|
||||
algo_name = name or defaults.get("algorithm")
|
||||
if algo_name is None:
|
||||
raise ValueError("ccl.yaml: defaults.algorithm not set")
|
||||
algos = cfg.get("algorithms", {})
|
||||
if algo_name not in algos:
|
||||
raise ValueError(
|
||||
f"ccl.yaml: algorithm '{algo_name}' not in algorithms section"
|
||||
)
|
||||
merged = defaults.copy()
|
||||
merged.update(algos[algo_name])
|
||||
merged["algorithm"] = algo_name
|
||||
return merged
|
||||
|
||||
|
||||
# ── rank → PE mapping ────────────────────────────────────────────────
|
||||
|
||||
|
||||
def linear_rank_to_pe(rank: int, spec: dict) -> tuple[int, int, int]:
|
||||
"""Map a rank to (sip, cube, pe) using linear topology order."""
|
||||
sips = spec["system"]["sips"]["count"]
|
||||
cubes_per_sip = spec["sip"]["cube_mesh"]["w"] * spec["sip"]["cube_mesh"]["h"]
|
||||
pe_layout = spec["cube"]["pe_layout"]
|
||||
pes_per_cube = pe_layout["pe_per_corner"] * len(pe_layout["corners"])
|
||||
|
||||
pes_per_sip = cubes_per_sip * pes_per_cube
|
||||
if rank >= sips * pes_per_sip:
|
||||
raise ValueError(
|
||||
f"rank {rank} exceeds total PE count {sips * pes_per_sip}"
|
||||
)
|
||||
sip = rank // pes_per_sip
|
||||
rem = rank % pes_per_sip
|
||||
cube = rem // pes_per_cube
|
||||
pe = rem % pes_per_cube
|
||||
return sip, cube, pe
|
||||
|
||||
|
||||
# ── Install plan ─────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def install_ipcq(
|
||||
engine: Any,
|
||||
spec: dict,
|
||||
cfg: dict,
|
||||
algo_module: Any | None = None,
|
||||
rank_to_pe: list[tuple[int, int, int]] | None = None,
|
||||
) -> dict[str, Any]:
|
||||
"""Build neighbor tables and install them in every participating PE_IPCQ.
|
||||
|
||||
Args:
|
||||
engine: GraphEngine with ``_components`` dict
|
||||
spec: topology spec dict
|
||||
cfg: merged algorithm config (from ``resolve_algorithm_config``)
|
||||
algo_module: optional algorithm Python module (for neighbors override)
|
||||
rank_to_pe: optional explicit rank → (sip, cube, pe) mapping. If
|
||||
None, the default linear mapping is used.
|
||||
|
||||
Returns:
|
||||
A diagnostics dict with the install plan (rank → PE map, neighbor table).
|
||||
"""
|
||||
if "world_size" in cfg:
|
||||
world_size = int(cfg["world_size"])
|
||||
else:
|
||||
# Topology-derived fallback (mirrors AhbmCCLBackend / RuntimeContext).
|
||||
sips = int(spec.get("system", {}).get("sips", {}).get("count", 1))
|
||||
cm = spec.get("sip", {}).get("cube_mesh", {})
|
||||
cubes_per_sip = int(cm.get("w", 1)) * int(cm.get("h", 1))
|
||||
pl = spec.get("cube", {}).get("pe_layout", {})
|
||||
corners = pl.get("corners", [])
|
||||
pe_per_corner = int(pl.get("pe_per_corner", 1))
|
||||
pes_per_cube = pe_per_corner * max(len(corners), 1)
|
||||
world_size = sips * cubes_per_sip * pes_per_cube
|
||||
buffer_kind = cfg["buffer_kind"]
|
||||
n_slots = int(cfg["n_slots"])
|
||||
slot_size = int(cfg["slot_size"])
|
||||
backpressure = cfg["backpressure"]
|
||||
credit_size_bytes = int(cfg.get("ipcq_credit_size_bytes", 16))
|
||||
|
||||
# Step 1: rank → (sip, cube, pe)
|
||||
if rank_to_pe is not None:
|
||||
if len(rank_to_pe) != world_size:
|
||||
raise ValueError(
|
||||
f"rank_to_pe has {len(rank_to_pe)} entries but world_size={world_size}"
|
||||
)
|
||||
rank_pe = list(rank_to_pe)
|
||||
else:
|
||||
rank_pe: list[tuple[int, int, int]] = [
|
||||
linear_rank_to_pe(r, spec) for r in range(world_size)
|
||||
]
|
||||
pe_to_rank = {(s, c, p): r for r, (s, c, p) in enumerate(rank_pe)}
|
||||
|
||||
# Step 2: resolve topology fn (with optional override)
|
||||
topo_fn = resolve_topology(cfg["topology"], algo_module=algo_module)
|
||||
|
||||
# Build per-rank neighbor map
|
||||
neighbor_table: dict[int, dict[str, int]] = {}
|
||||
for r in range(world_size):
|
||||
neighbor_table[r] = topo_fn(r, world_size)
|
||||
|
||||
# Step 3: pull the live engine reference for each PE_IPCQ
|
||||
components = engine._components
|
||||
pe_ipcq_id = lambda s, c, p: f"sip{s}.cube{c}.pe{p}.pe_ipcq"
|
||||
|
||||
# Step 4: per-PE rx_base address and per-PE credit_inbox
|
||||
direction_keys = sorted({d for nt in neighbor_table.values() for d in nt})
|
||||
direction_idx = {d: i for i, d in enumerate(direction_keys)}
|
||||
bytes_per_direction = n_slots * slot_size
|
||||
|
||||
def rx_base(s: int, c: int, p: int, d: str) -> int:
|
||||
return _ipcq_base_for_pe(s, c, p) + direction_idx[d] * bytes_per_direction
|
||||
|
||||
# Wire bidirectional credit stores: backend creates the SimPy Stores
|
||||
# by reading each rank's PE_IPCQ.credit_inbox property.
|
||||
rank_to_credit_inbox: dict[int, simpy.Store] = {}
|
||||
for r, (s, c, p) in enumerate(rank_pe):
|
||||
comp = components[pe_ipcq_id(s, c, p)]
|
||||
# Trigger lazy creation of credit_inbox if not yet started.
|
||||
# PE_IPCQ.start() creates it; we ensure it exists.
|
||||
if comp._credit_inbox is None:
|
||||
comp._credit_inbox = simpy.Store(engine._env)
|
||||
rank_to_credit_inbox[r] = comp.credit_inbox
|
||||
|
||||
# Step 5: build IpcqInitMsg per rank and call _install_neighbors directly
|
||||
plan: dict[str, Any] = {
|
||||
"world_size": world_size,
|
||||
"rank_to_pe": rank_pe,
|
||||
"buffer_kind": buffer_kind,
|
||||
"neighbor_table": neighbor_table,
|
||||
}
|
||||
|
||||
def reverse_direction(my_rank: int, peer_rank: int) -> str | None:
|
||||
"""Find which direction in peer's neighbor table points back to my_rank."""
|
||||
for d, target in neighbor_table[peer_rank].items():
|
||||
if target == my_rank:
|
||||
return d
|
||||
return None
|
||||
|
||||
for r, (s, c, p) in enumerate(rank_pe):
|
||||
my_pe_ipcq = components[pe_ipcq_id(s, c, p)]
|
||||
nbrs = neighbor_table[r]
|
||||
entries: list[IpcqInitEntry] = []
|
||||
for d, peer_rank in nbrs.items():
|
||||
if peer_rank is None:
|
||||
continue
|
||||
peer_s, peer_c, peer_p = rank_pe[peer_rank]
|
||||
peer_dir = reverse_direction(r, peer_rank)
|
||||
if peer_dir is None:
|
||||
# Peer doesn't have a reverse entry — skip (asymmetric topology)
|
||||
continue
|
||||
peer_endpoint = IpcqEndpoint(
|
||||
sip=peer_s, cube=peer_c, pe=peer_p,
|
||||
buffer_kind=buffer_kind,
|
||||
rx_base_pa=rx_base(peer_s, peer_c, peer_p, peer_dir),
|
||||
rx_base_va=0,
|
||||
n_slots=n_slots, slot_size=slot_size,
|
||||
)
|
||||
entries.append(IpcqInitEntry(
|
||||
direction=d,
|
||||
peer=peer_endpoint,
|
||||
my_rx_base_pa=rx_base(s, c, p, d),
|
||||
my_rx_base_va=0,
|
||||
n_slots=n_slots, slot_size=slot_size,
|
||||
peer_credit_store=rank_to_credit_inbox[peer_rank],
|
||||
))
|
||||
msg = IpcqInitMsg(
|
||||
correlation_id="ccl_init", request_id=f"init_r{r}",
|
||||
target_sips=(s,), target_cubes=(c,), target_pe=p,
|
||||
entries=tuple(entries),
|
||||
backpressure_mode=backpressure,
|
||||
buffer_kind=buffer_kind,
|
||||
credit_size_bytes=credit_size_bytes,
|
||||
)
|
||||
my_pe_ipcq._install_neighbors(msg)
|
||||
|
||||
return plan
|
||||
@@ -0,0 +1,465 @@
|
||||
"""Mock CCL runtime for fast unit tests of algorithm kernels (ADR-0023 D15).
|
||||
|
||||
Runs a kernel function once per rank with a minimal ``tl`` shim — no SimPy,
|
||||
no PE_DMA, no fabric simulation. Just enough to verify *functional*
|
||||
correctness of an IPCQ-based collective algorithm.
|
||||
|
||||
Cross-rank send/recv is implemented with greenlet cooperative scheduling
|
||||
plus per-(rank, direction) FIFO queues. Backpressure is not modeled —
|
||||
queues are unbounded.
|
||||
|
||||
Typical usage in a test::
|
||||
|
||||
from kernbench.ccl.testing import run_kernel_in_mock
|
||||
from kernbench.ccl.algorithms.ring_allreduce import kernel
|
||||
|
||||
inputs = [np.full(16, r + 1, dtype="f16") for r in range(4)]
|
||||
outputs = run_kernel_in_mock(
|
||||
kernel_fn=kernel, world_size=4, topology="ring_1d",
|
||||
inputs=inputs, kernel_args=(16,),
|
||||
)
|
||||
for r in range(4):
|
||||
assert np.allclose(outputs[r], sum(inputs))
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
from collections import deque
|
||||
from typing import Any, Callable
|
||||
|
||||
import numpy as np
|
||||
from greenlet import greenlet
|
||||
|
||||
from kernbench.ccl.topologies import resolve_topology
|
||||
from kernbench.common.ipcq_types import IpcqInvalidDirection
|
||||
from kernbench.common.pe_commands import TensorHandle
|
||||
|
||||
|
||||
# ── Per-rank fake state ──────────────────────────────────────────────
|
||||
|
||||
|
||||
class _MockRankState:
|
||||
"""Per-rank scratch holding HBM/recv slots and tl shim hooks."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
rank: int,
|
||||
world_size: int,
|
||||
neighbors: dict[str, int],
|
||||
input_arr: np.ndarray,
|
||||
) -> None:
|
||||
self.rank = rank
|
||||
self.world_size = world_size
|
||||
self.neighbors = neighbors # direction → peer rank
|
||||
# HBM "memory": addr → ndarray. Per-rank, no cross-rank sharing.
|
||||
self._hbm: dict[int, np.ndarray] = {}
|
||||
self._tcm: dict[int, np.ndarray] = {}
|
||||
# ``t_ptr`` is the address the kernel sees. Real benches use a
|
||||
# column-sharded VA so each rank reads from ``t_ptr + rank*nbytes``.
|
||||
# Mirror that here: each rank's slice lives at the rank-specific addr.
|
||||
nbytes = int(input_arr.nbytes)
|
||||
self.t_ptr = 0 # base; per-rank offset is rank * nbytes
|
||||
self._slice_addr = rank * nbytes
|
||||
self._hbm[self._slice_addr] = input_arr.copy()
|
||||
# Inbound recv FIFOs: direction → deque[ndarray]
|
||||
self.recv_q: dict[str, deque[np.ndarray]] = {d: deque() for d in neighbors}
|
||||
# Output (set when kernel calls tl.store at slice address)
|
||||
self.output: np.ndarray | None = None
|
||||
# Greenlet for this rank — set later
|
||||
self.g: greenlet | None = None
|
||||
|
||||
|
||||
# ── Mock TLContext ───────────────────────────────────────────────────
|
||||
|
||||
|
||||
class _MockTL:
|
||||
"""Drop-in tl shim for mock runtime.
|
||||
|
||||
Supports the subset of TLContext API that algorithm authors use:
|
||||
program_id, num_programs, load, store, send, recv, recv_async, wait,
|
||||
plus arithmetic operations on TensorHandle (eager numpy execution,
|
||||
no SimPy involved).
|
||||
"""
|
||||
|
||||
def __init__(self, state: _MockRankState, scheduler: "_MockScheduler") -> None:
|
||||
self._state = state
|
||||
self._scheduler = scheduler
|
||||
self._handle_counter = 0
|
||||
|
||||
def _next_id(self) -> str:
|
||||
self._handle_counter += 1
|
||||
return f"mt{self._handle_counter}"
|
||||
|
||||
@property
|
||||
def rank(self) -> int:
|
||||
return self._state.rank
|
||||
|
||||
@property
|
||||
def world_size(self) -> int:
|
||||
return self._state.world_size
|
||||
|
||||
# axis-aware
|
||||
def program_id(self, axis: int = 0) -> int:
|
||||
return self._state.rank if axis == 0 else 0
|
||||
|
||||
def num_programs(self, axis: int = 0) -> int:
|
||||
return self._state.world_size if axis == 0 else 1
|
||||
|
||||
# ── arithmetic ops (called by TensorHandle.__add__ etc.) ──
|
||||
|
||||
def _binary_math(self, op: str, a: TensorHandle, b: TensorHandle) -> TensorHandle:
|
||||
a_data = np.asarray(a.data) if a.data is not None else None
|
||||
b_data = np.asarray(b.data) if b.data is not None else None
|
||||
if a_data is None or b_data is None:
|
||||
result = None
|
||||
elif op == "add":
|
||||
result = a_data + b_data
|
||||
elif op == "sub":
|
||||
result = a_data - b_data
|
||||
elif op == "mul":
|
||||
result = a_data * b_data
|
||||
elif op == "div":
|
||||
result = a_data / b_data
|
||||
elif op == "maximum":
|
||||
result = np.maximum(a_data, b_data)
|
||||
elif op == "minimum":
|
||||
result = np.minimum(a_data, b_data)
|
||||
else:
|
||||
raise NotImplementedError(f"mock _binary_math: op {op!r} not implemented")
|
||||
return TensorHandle(
|
||||
id=self._next_id(),
|
||||
addr=0, shape=a.shape, dtype=a.dtype,
|
||||
nbytes=int(np.prod(a.shape)) * 2 if a.shape else 0,
|
||||
data=result, space="tcm",
|
||||
)
|
||||
|
||||
def maximum(self, a: TensorHandle, b: TensorHandle) -> TensorHandle:
|
||||
return self._binary_math("maximum", a, b)
|
||||
|
||||
def minimum(self, a: TensorHandle, b: TensorHandle) -> TensorHandle:
|
||||
return self._binary_math("minimum", a, b)
|
||||
|
||||
def fma(
|
||||
self, a: TensorHandle, b: TensorHandle, c: TensorHandle,
|
||||
) -> TensorHandle:
|
||||
a_data = np.asarray(a.data) if a.data is not None else None
|
||||
b_data = np.asarray(b.data) if b.data is not None else None
|
||||
c_data = np.asarray(c.data) if c.data is not None else None
|
||||
result = (
|
||||
a_data * b_data + c_data
|
||||
if (a_data is not None and b_data is not None and c_data is not None)
|
||||
else None
|
||||
)
|
||||
return TensorHandle(
|
||||
id=self._next_id(),
|
||||
addr=0, shape=a.shape, dtype=a.dtype,
|
||||
nbytes=int(np.prod(a.shape)) * 2 if a.shape else 0,
|
||||
data=result, space="tcm",
|
||||
)
|
||||
|
||||
def clamp(
|
||||
self,
|
||||
x: TensorHandle,
|
||||
min: TensorHandle,
|
||||
max: TensorHandle,
|
||||
) -> TensorHandle:
|
||||
x_data = np.asarray(x.data) if x.data is not None else None
|
||||
lo = np.asarray(min.data) if min.data is not None else None
|
||||
hi = np.asarray(max.data) if max.data is not None else None
|
||||
result = (
|
||||
np.minimum(np.maximum(x_data, lo), hi)
|
||||
if (x_data is not None and lo is not None and hi is not None)
|
||||
else None
|
||||
)
|
||||
return TensorHandle(
|
||||
id=self._next_id(),
|
||||
addr=0, shape=x.shape, dtype=x.dtype,
|
||||
nbytes=int(np.prod(x.shape)) * 2 if x.shape else 0,
|
||||
data=result, space="tcm",
|
||||
)
|
||||
|
||||
def softmax(self, x: TensorHandle, axis: int = -1) -> TensorHandle:
|
||||
x_data = np.asarray(x.data) if x.data is not None else None
|
||||
if x_data is None:
|
||||
result = None
|
||||
else:
|
||||
x_max = np.max(x_data, axis=axis, keepdims=True)
|
||||
e = np.exp(x_data - x_max)
|
||||
s = np.sum(e, axis=axis, keepdims=True)
|
||||
result = e / s
|
||||
return TensorHandle(
|
||||
id=self._next_id(),
|
||||
addr=0, shape=x.shape, dtype=x.dtype,
|
||||
nbytes=int(np.prod(x.shape)) * 2 if x.shape else 0,
|
||||
data=result, space="tcm",
|
||||
)
|
||||
|
||||
@staticmethod
|
||||
def cdiv(a: int, b: int) -> int:
|
||||
return -(-int(a) // int(b))
|
||||
|
||||
def _unary_math(self, op: str, x: TensorHandle) -> TensorHandle:
|
||||
x_data = np.asarray(x.data) if x.data is not None else None
|
||||
if x_data is None:
|
||||
result = None
|
||||
elif op == "exp":
|
||||
result = np.exp(x_data)
|
||||
elif op == "log":
|
||||
result = np.log(x_data)
|
||||
elif op == "sqrt":
|
||||
result = np.sqrt(x_data)
|
||||
elif op == "abs":
|
||||
result = np.abs(x_data)
|
||||
elif op == "sigmoid":
|
||||
result = 1.0 / (1.0 + np.exp(-x_data))
|
||||
elif op == "cos":
|
||||
result = np.cos(x_data)
|
||||
elif op == "sin":
|
||||
result = np.sin(x_data)
|
||||
else:
|
||||
raise NotImplementedError(f"mock _unary_math: op {op!r} not implemented")
|
||||
return TensorHandle(
|
||||
id=self._next_id(),
|
||||
addr=0, shape=x.shape, dtype=x.dtype,
|
||||
nbytes=int(np.prod(x.shape)) * 2 if x.shape else 0,
|
||||
data=result, space="tcm",
|
||||
)
|
||||
|
||||
def load(self, ptr: int, shape: tuple[int, ...], dtype: str = "f16") -> TensorHandle:
|
||||
data = self._state._hbm.get(ptr)
|
||||
if data is None:
|
||||
data = np.zeros(shape, dtype=np.float16)
|
||||
return TensorHandle(
|
||||
id=f"load_{ptr}", addr=ptr, shape=shape, dtype=dtype,
|
||||
nbytes=int(np.prod(shape)) * 2, data=data, space="hbm",
|
||||
)
|
||||
|
||||
def store(self, ptr: int, handle: TensorHandle) -> None:
|
||||
if handle.data is not None:
|
||||
self._state._hbm[ptr] = np.asarray(handle.data)
|
||||
if ptr == self._state._slice_addr:
|
||||
self._state.output = self._state._hbm[ptr]
|
||||
|
||||
# IPCQ
|
||||
def send(
|
||||
self,
|
||||
dir: str,
|
||||
src: TensorHandle | None = None,
|
||||
*,
|
||||
src_addr: int | None = None,
|
||||
nbytes: int | None = None,
|
||||
shape: tuple[int, ...] | None = None,
|
||||
dtype: str = "f16",
|
||||
space: str = "tcm",
|
||||
) -> None:
|
||||
if dir not in self._state.neighbors:
|
||||
raise IpcqInvalidDirection(
|
||||
f"mock tl.send: direction {dir!r} not in neighbors {list(self._state.neighbors)}"
|
||||
)
|
||||
if src is not None:
|
||||
if src.data is not None:
|
||||
data = np.asarray(src.data)
|
||||
else:
|
||||
# Resolve from this rank's local memory at src.addr
|
||||
space_dict = self._state._hbm if src.space == "hbm" else self._state._tcm
|
||||
stored = space_dict.get(src.addr)
|
||||
if stored is None:
|
||||
raise RuntimeError(
|
||||
f"mock tl.send: no data at {src.space}:0x{src.addr:x}"
|
||||
)
|
||||
data = np.asarray(stored)
|
||||
else:
|
||||
data = None
|
||||
if data is None:
|
||||
raise RuntimeError("mock tl.send: src is None")
|
||||
peer_rank = self._state.neighbors[dir]
|
||||
# Find the reverse direction in peer's neighbors that points back to me
|
||||
peer_state = self._scheduler.states[peer_rank]
|
||||
reverse_dir = None
|
||||
for d, target in peer_state.neighbors.items():
|
||||
if target == self._state.rank:
|
||||
reverse_dir = d
|
||||
break
|
||||
if reverse_dir is None:
|
||||
raise RuntimeError(
|
||||
f"mock tl.send: peer rank {peer_rank} has no reverse direction"
|
||||
)
|
||||
peer_state.recv_q[reverse_dir].append(data.copy())
|
||||
# After delivering, hand control back to scheduler so the receiver
|
||||
# can wake up.
|
||||
self._scheduler.yield_()
|
||||
|
||||
def recv_async(
|
||||
self,
|
||||
dir: str,
|
||||
shape: tuple[int, ...] = (),
|
||||
dtype: str = "f16",
|
||||
) -> dict:
|
||||
"""Non-blocking recv. Returns a future dict to pass to tl.wait."""
|
||||
if dir not in self._state.neighbors:
|
||||
raise IpcqInvalidDirection(
|
||||
f"mock tl.recv_async: direction {dir!r} not in neighbors"
|
||||
)
|
||||
return {"_kind": "recv_future", "dir": dir, "shape": shape, "dtype": dtype}
|
||||
|
||||
def wait(self, future: Any) -> TensorHandle:
|
||||
"""Block until the recv future has data."""
|
||||
if not isinstance(future, dict) or future.get("_kind") != "recv_future":
|
||||
raise TypeError("tl.wait: expected recv future from tl.recv_async")
|
||||
d = future["dir"]
|
||||
while not self._state.recv_q[d]:
|
||||
self._scheduler.yield_()
|
||||
data = self._state.recv_q[d].popleft()
|
||||
return self._make_handle(data, d, future["dtype"])
|
||||
|
||||
def recv(
|
||||
self,
|
||||
dir: str | None = None,
|
||||
shape: tuple[int, ...] = (),
|
||||
dtype: str = "f16",
|
||||
) -> TensorHandle:
|
||||
if dir is not None and dir not in self._state.neighbors:
|
||||
raise IpcqInvalidDirection(
|
||||
f"mock tl.recv: direction {dir!r} not in neighbors {list(self._state.neighbors)}"
|
||||
)
|
||||
# Wait for data
|
||||
while True:
|
||||
if dir is None:
|
||||
# round-robin over directions
|
||||
for d in self._state.neighbors:
|
||||
if self._state.recv_q[d]:
|
||||
data = self._state.recv_q[d].popleft()
|
||||
return self._make_handle(data, d, dtype)
|
||||
else:
|
||||
if self._state.recv_q[dir]:
|
||||
data = self._state.recv_q[dir].popleft()
|
||||
return self._make_handle(data, dir, dtype)
|
||||
# Yield to other ranks
|
||||
self._scheduler.yield_()
|
||||
|
||||
def _make_handle(self, data: np.ndarray, direction: str, dtype: str) -> TensorHandle:
|
||||
return TensorHandle(
|
||||
id=f"recv_{direction}",
|
||||
addr=0, shape=data.shape, dtype=dtype,
|
||||
nbytes=int(data.nbytes), data=data, space="tcm",
|
||||
)
|
||||
|
||||
|
||||
# ── Cooperative scheduler ────────────────────────────────────────────
|
||||
|
||||
|
||||
class _MockScheduler:
|
||||
"""Round-robin cooperative scheduler over rank greenlets."""
|
||||
|
||||
def __init__(self, states: list[_MockRankState]) -> None:
|
||||
self.states = states
|
||||
self._parent: greenlet | None = None
|
||||
self._cur_idx = 0
|
||||
|
||||
def yield_(self) -> None:
|
||||
"""Called from inside a rank greenlet to give other ranks a turn."""
|
||||
assert self._parent is not None
|
||||
self._parent.switch()
|
||||
|
||||
def run(self, kernel_fn: Callable, kernel_args: tuple) -> list[np.ndarray]:
|
||||
from kernbench.triton_emu.tl_context import TLContext
|
||||
|
||||
self._parent = greenlet.getcurrent()
|
||||
n = len(self.states)
|
||||
|
||||
# Per-rank tl shim
|
||||
tls: dict[int, _MockTL] = {}
|
||||
|
||||
def _spawn(rank_idx: int) -> greenlet:
|
||||
state = self.states[rank_idx]
|
||||
tl = _MockTL(state, self)
|
||||
tls[rank_idx] = tl
|
||||
|
||||
def _entry():
|
||||
# Activate this rank's tl for TensorHandle operator overloads
|
||||
TLContext._set_active(tl) # type: ignore[attr-defined]
|
||||
try:
|
||||
kernel_fn(state.t_ptr, *kernel_args, tl=tl)
|
||||
finally:
|
||||
TLContext._set_active(None) # type: ignore[attr-defined]
|
||||
|
||||
return greenlet(_entry)
|
||||
|
||||
for state in self.states:
|
||||
state.g = _spawn(state.rank)
|
||||
|
||||
# Drive each rank round-robin until all dead. Detect global deadlock.
|
||||
max_rounds = 10_000
|
||||
round_no = 0
|
||||
while True:
|
||||
alive = [s for s in self.states if s.g is not None and not s.g.dead]
|
||||
if not alive:
|
||||
break
|
||||
progressed = False
|
||||
for s in self.states:
|
||||
if s.g is None or s.g.dead:
|
||||
continue
|
||||
# Multi-rank greenlets share TLContext active state via the
|
||||
# module-level thread-local; restore this rank's tl before
|
||||
# resuming so TensorHandle operator overloads dispatch to
|
||||
# the right _MockTL.
|
||||
TLContext._set_active(tls[s.rank]) # type: ignore[attr-defined]
|
||||
s.g.switch()
|
||||
if s.g.dead:
|
||||
progressed = True
|
||||
TLContext._set_active(None) # type: ignore[attr-defined]
|
||||
# Loose progress check: if no greenlet died and queues didn't grow,
|
||||
# advance round counter; abort after too many idle rounds.
|
||||
round_no += 1
|
||||
if round_no > max_rounds and not progressed:
|
||||
raise RuntimeError(
|
||||
"mock CCL runtime: deadlock detected (no progress for "
|
||||
f"{max_rounds} rounds)"
|
||||
)
|
||||
|
||||
return [
|
||||
s.output if s.output is not None else s._hbm.get(s._slice_addr)
|
||||
for s in self.states
|
||||
]
|
||||
|
||||
|
||||
# ── Public entry ────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def run_kernel_in_mock(
|
||||
kernel_fn: Callable,
|
||||
world_size: int,
|
||||
topology: str,
|
||||
inputs: list[np.ndarray],
|
||||
kernel_args: tuple = (),
|
||||
algo_module: Any | None = None,
|
||||
) -> list[np.ndarray]:
|
||||
"""Run a CCL kernel under the mock runtime with no SimPy/fabric.
|
||||
|
||||
Args:
|
||||
kernel_fn: ``kernel(t_ptr, *kernel_args, tl=...)``
|
||||
world_size: number of ranks
|
||||
topology: builtin topology name (e.g. "ring_1d")
|
||||
inputs: per-rank input ndarrays. ``inputs[r]`` becomes rank r's
|
||||
local tile at HBM address 0.
|
||||
kernel_args: extra positional args after t_ptr
|
||||
algo_module: optional module providing ``neighbors()`` override
|
||||
|
||||
Returns:
|
||||
Per-rank output ndarrays — whatever the kernel wrote via tl.store
|
||||
(or the original input if the kernel didn't store).
|
||||
"""
|
||||
if len(inputs) != world_size:
|
||||
raise ValueError(f"len(inputs)={len(inputs)} != world_size={world_size}")
|
||||
|
||||
topo_fn = resolve_topology(topology, algo_module=algo_module)
|
||||
states = [
|
||||
_MockRankState(
|
||||
rank=r, world_size=world_size,
|
||||
neighbors=topo_fn(r, world_size),
|
||||
input_arr=inputs[r],
|
||||
)
|
||||
for r in range(world_size)
|
||||
]
|
||||
|
||||
sched = _MockScheduler(states)
|
||||
return sched.run(kernel_fn, kernel_args)
|
||||
@@ -0,0 +1,128 @@
|
||||
"""Builtin neighbor topology generators for CCL backend (ADR-0023 D11).
|
||||
|
||||
Each generator takes ``(rank, world_size)`` and returns a
|
||||
``dict[direction, peer_rank]`` for that rank. ``direction`` is one of
|
||||
``"N" | "S" | "E" | "W"`` for ring/mesh, or
|
||||
``"parent" | "child_left" | "child_right"`` for tree topologies.
|
||||
|
||||
Algorithm modules may override the generated map by defining a
|
||||
``neighbors(rank, world_size, neighbor_map) -> dict | None`` function in
|
||||
the same module (see D11 / D15). ``resolve_topology`` wires these together.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any, Callable
|
||||
|
||||
NeighborMap = dict[str, int]
|
||||
TopologyFn = Callable[[int, int], NeighborMap]
|
||||
|
||||
|
||||
# ── Builtin generators ───────────────────────────────────────────────
|
||||
|
||||
|
||||
def ring_1d(rank: int, world_size: int) -> NeighborMap:
|
||||
"""1D bidirectional ring (E/W)."""
|
||||
return {
|
||||
"E": (rank + 1) % world_size,
|
||||
"W": (rank - 1) % world_size,
|
||||
}
|
||||
|
||||
|
||||
def ring_1d_unidir(rank: int, world_size: int) -> NeighborMap:
|
||||
"""1D unidirectional ring (E only)."""
|
||||
return {"E": (rank + 1) % world_size}
|
||||
|
||||
|
||||
def mesh_2d(rank: int, world_size: int) -> NeighborMap:
|
||||
"""Square 2D mesh (N/S/E/W).
|
||||
|
||||
Layout: rank = row * side + col, with side = sqrt(world_size).
|
||||
Wrap-around (torus) on all four edges.
|
||||
"""
|
||||
side = int(round(world_size ** 0.5))
|
||||
if side * side != world_size:
|
||||
raise ValueError(
|
||||
f"mesh_2d requires square world_size, got {world_size}"
|
||||
)
|
||||
r, c = divmod(rank, side)
|
||||
return {
|
||||
"N": ((r - 1) % side) * side + c,
|
||||
"S": ((r + 1) % side) * side + c,
|
||||
"W": r * side + (c - 1) % side,
|
||||
"E": r * side + (c + 1) % side,
|
||||
}
|
||||
|
||||
|
||||
def tree_binary(rank: int, world_size: int) -> NeighborMap:
|
||||
"""Binary tree rooted at rank 0.
|
||||
|
||||
Children of rank r are 2r+1 and 2r+2 (if within world_size).
|
||||
Parent of rank r > 0 is (r-1)//2.
|
||||
Returned keys (only those that exist):
|
||||
"parent", "child_left", "child_right"
|
||||
"""
|
||||
n: NeighborMap = {}
|
||||
if rank > 0:
|
||||
n["parent"] = (rank - 1) // 2
|
||||
left = 2 * rank + 1
|
||||
right = 2 * rank + 2
|
||||
if left < world_size:
|
||||
n["child_left"] = left
|
||||
if right < world_size:
|
||||
n["child_right"] = right
|
||||
return n
|
||||
|
||||
|
||||
def none(rank: int, world_size: int) -> NeighborMap:
|
||||
"""Empty map — algorithm's neighbors() must build from scratch."""
|
||||
return {}
|
||||
|
||||
|
||||
_BUILTIN: dict[str, TopologyFn] = {
|
||||
"ring_1d": ring_1d,
|
||||
"ring_1d_unidir": ring_1d_unidir,
|
||||
"mesh_2d": mesh_2d,
|
||||
"tree_binary": tree_binary,
|
||||
"none": none,
|
||||
}
|
||||
|
||||
|
||||
# ── Resolution ───────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def resolve_topology(
|
||||
name: str, algo_module: Any | None = None,
|
||||
) -> TopologyFn:
|
||||
"""Return a callable ``(rank, world_size) -> NeighborMap``.
|
||||
|
||||
Args:
|
||||
name: builtin topology name from ccl.yaml. Must be one of
|
||||
``ring_1d``, ``ring_1d_unidir``, ``mesh_2d``, ``tree_binary``,
|
||||
or ``none``.
|
||||
algo_module: optional algorithm module. If it defines
|
||||
``neighbors(rank, world_size, neighbor_map)``, that hook is
|
||||
invoked after the builtin to override the result.
|
||||
Returning None from neighbors() leaves the builtin map
|
||||
unchanged; returning a dict replaces it.
|
||||
|
||||
Raises:
|
||||
ValueError: if ``name`` is not a known builtin.
|
||||
"""
|
||||
if name not in _BUILTIN:
|
||||
raise ValueError(
|
||||
f"Unknown topology '{name}'. "
|
||||
f"Available builtins: {list(_BUILTIN)}"
|
||||
)
|
||||
builtin_fn = _BUILTIN[name]
|
||||
override_fn = getattr(algo_module, "neighbors", None) if algo_module else None
|
||||
if override_fn is None or not callable(override_fn):
|
||||
return builtin_fn
|
||||
|
||||
def _wrapped(rank: int, world_size: int) -> NeighborMap:
|
||||
base = builtin_fn(rank, world_size)
|
||||
result = override_fn(rank, world_size, base)
|
||||
if result is None:
|
||||
return base
|
||||
return result
|
||||
|
||||
return _wrapped
|
||||
Reference in New Issue
Block a user