Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

Major changes:

PE-level IPCQ infrastructure:
- New PE_IPCQ component: ring-buffer control plane with 4-direction
  neighbor mapping, head/tail pointers, backpressure (poll/sleep).
- PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA,
  including in-flight data snapshot (D9) and op_log recording at
  outbound time for Phase 2 replay correctness.
- IpcqDmaToken piggyback model: data + metadata travel together,
  atomic visibility at receiver (invariant I6).
- Credit return fast path: bottleneck-BW latency, no fabric vc_comm.

Phase 2 data execution (ADR-0020 integration):
- op_log extended: DmaWriteCmd now captures src_space/src_addr for
  Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time.
- DataExecutor replays dma_write + ipcq_copy in t_start order.
- Engine._flush_data_phase: incremental cursor-based replay after
  each engine.wait() so host reads see post-Phase-2 data.
- KernelRunner Phase 1 writes disabled when op_log is active to
  prevent stale data from corrupting the MemoryStore snapshot.

TLContext / kernel API:
- tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype),
  tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode.
- TensorHandle operator overloading (add/sub/mul/div) via thread-local
  active TLContext → MathCmd dispatch through PE_MATH.
- PE-local scratch allocator for math output handles.
- tl.load returns space="hbm" handles for correct Phase 2 addressing.
- Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv.

Unified ccl_allreduce bench (PyTorch-compat host code):
- Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch)
  split matching real PyTorch DDP worker pattern.
- torch.distributed facade: init_process_group, get_world_size,
  get_rank, get_backend, all_reduce, barrier — only real PyTorch names.
- AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches
  kernel via tensor shard metadata (n_elem from shards[0].nbytes).
- world_size derived from topology spec (sips × cubes × pes_per_cube)
  with optional algorithm-level override in ccl.yaml.

Tensor API (PyTorch-compat surface):
- Tensor.numpy(): gather-aware (all shards via VA-based addressing).
- Tensor.copy_(source): scatter from host tensor into sharded target.
- RuntimeContext.from_numpy(arr): host-side staging tensor.
- Tensor.data property fixed to use numpy() (was shards[0]-only).

Algorithm modules moved to src/kernbench/ccl/algorithms/:
- ring_allreduce, mesh_allreduce, tree_allreduce, hello_send.
- Each module exports kernel_args(world_size, n_elem) helper.
- ccl.yaml module paths updated to kernbench.ccl.algorithms.*.

Dead code removed:
- 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.).
- _run_ccl_bench greenlet-per-SIP scheduler.
- benches.loader.is_ccl_bench + run_rank detection.
- benches/ccl/ directory.

Tests:
- New test_ccl_allreduce_matrix.py: 7 parametrized cases
  (ring×3 buffers, ring 8/16, mesh 4, tree 7).
- New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests.
- Existing tests updated for new import paths + world_size_override.

Docs:
- Korean ccl-author-guide.md and ADR-0023 paths updated.
- New English versions: ccl-author-guide.en.md, ADR-0023.en.md.

502 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-12 19:36:59 -07:00
parent ff2c677a9c
commit 998cc85762
60 changed files with 9196 additions and 80 deletions
+129
View File
@@ -0,0 +1,129 @@
"""CCL all-reduce bench — single unified entry point.
Driven entirely by ``ccl.yaml`` + ``topology.yaml``:
- ``defaults.algorithm`` in ``ccl.yaml`` picks which kernel to run
(``ring_allreduce_{tcm,hbm,sram}`` / ``mesh_allreduce_4`` /
``tree_allreduce_7``).
- ``world_size`` is derived from the algorithm entry's override or from
the topology spec (``sips × cubes_per_sip × pes_per_cube``).
- The host code uses only real PyTorch ``torch.distributed`` names:
``init_process_group``, ``get_world_size``, ``get_rank``, ``all_reduce``.
The bench is split into ``worker(rank, world_size, torch)`` — the
per-rank business logic, designed to look like a real PyTorch DDP
training worker so future model benches can reuse the same skeleton —
and ``run(torch)`` — the kernbench-specific launcher that initializes
the process group and invokes the worker.
"""
from __future__ import annotations
import numpy as np
from kernbench.ccl.install import load_ccl_config, resolve_algorithm_config
from kernbench.policy.placement.dp import DPPolicy
# Default per-rank tile size if ccl.yaml doesn't override it. Real
# pytorch benches hardcode batch/feature dims similarly.
DEFAULT_N_ELEM = 32
def _derive_dp(spec: dict, world_size: int) -> DPPolicy:
"""Pick a DPPolicy that fans the tensor across exactly ``world_size`` PEs.
Mirrors what a real PyTorch DDP user does manually with
``tensor.to(f"cuda:{rank}")``: the host code chooses the placement so
that the collective sees the right number of participating ranks.
"""
sips = int(spec["system"]["sips"]["count"])
cm = spec["sip"]["cube_mesh"]
pl = spec["cube"]["pe_layout"]
pes_per_cube = int(pl["pe_per_corner"]) * len(pl["corners"])
cubes_per_sip = int(cm["w"]) * int(cm["h"])
total = sips * cubes_per_sip * pes_per_cube
if world_size == total:
return DPPolicy(sip="column_wise", cube="column_wise", pe="column_wise")
if world_size <= pes_per_cube:
return DPPolicy(
sip="replicate", cube="replicate", pe="column_wise",
num_sips=1, num_cubes=1, num_pes=world_size,
)
if world_size <= cubes_per_sip * pes_per_cube:
return DPPolicy(
sip="replicate", cube="column_wise", pe="column_wise",
num_sips=1, num_cubes=world_size // pes_per_cube,
)
return DPPolicy(sip="column_wise", cube="column_wise", pe="column_wise")
def worker(rank: int, world_size: int, torch) -> None:
"""Per-rank business logic. Mirrors a real PyTorch DDP worker.
In real PyTorch DDP, this function runs in N separate processes,
each with its own ``rank``. In kernbench (single-process multi-device)
it is invoked once with ``rank=0`` on the single host driver; the
actual per-PE parallelism is handled by ``torch.launch`` fanning out
the kernel across all participating PEs via the tensor's DPPolicy.
The ``rank`` parameter is therefore always 0 today, and is kept as
an explicit argument for parity with real DDP workers (``if rank ==
0`` logging guards, future multi-host extensions).
"""
cfg = resolve_algorithm_config(load_ccl_config())
algo_name = cfg["algorithm"]
n_elem = int(cfg.get("n_elem", DEFAULT_N_ELEM))
# Pick a DP that produces exactly ``world_size`` shards on this topology.
dp = _derive_dp(torch.spec, world_size)
tensor = torch.zeros(
(1, world_size * n_elem), dtype="f16", dp=dp, name="ccl_in",
)
# Initialize: CCL rank r's slice gets value (r + 1). Real PyTorch idiom:
# target.copy_(torch.from_numpy(source))
init = np.zeros((1, world_size * n_elem), dtype=np.float16)
for r in range(world_size):
init[0, r * n_elem : (r + 1) * n_elem] = float(r + 1)
tensor.copy_(torch.from_numpy(init))
# The main act: one all_reduce call — the backend installs IPCQ at
# init_process_group time and here only dispatches the kernel.
torch.distributed.all_reduce(tensor, op="sum")
# Verify: each shard should hold sum(1..world_size) after all-reduce.
result = tensor.numpy()
expected = float(sum(range(1, world_size + 1)))
all_ok = bool(np.allclose(result, expected, rtol=1e-1, atol=1e-1))
# Print only on rank 0 — real PyTorch DDP idiom for single-source logs.
if rank == 0:
if all_ok:
print(f" {algo_name} (ws={world_size}): {world_size} OK")
else:
flat = result.reshape(-1)
n_fail = 0
for r in range(world_size):
slice_r = flat[r * n_elem : (r + 1) * n_elem]
if not np.allclose(slice_r, expected, rtol=1e-1, atol=1e-1):
n_fail += 1
if n_fail <= 5:
print(
f" [FAIL] rank {r} "
f"(ws={world_size}, algo={algo_name}): "
f"got mean={float(slice_r.mean()):.3f}, "
f"expected={expected:.3f}"
)
print(
f" {algo_name} (ws={world_size}): "
f"{world_size - n_fail} OK / {n_fail} FAIL"
)
def run(torch) -> None:
"""CLI entry point: initialize the process group, invoke worker."""
dist = torch.distributed
dist.init_process_group(backend="ahbm")
worker(
rank=dist.get_rank(),
world_size=dist.get_world_size(),
torch=torch,
)
+19 -16
View File
@@ -9,29 +9,32 @@ from kernbench.runtime_api.context import RuntimeContext
BenchFn = Callable[[RuntimeContext], Any] BenchFn = Callable[[RuntimeContext], Any]
def _load_module(bench_id: str):
bench_id = bench_id.strip()
if not bench_id:
raise ValueError("Bench id is empty.")
module_path = f"benches.{bench_id}"
try:
return importlib.import_module(module_path)
except ModuleNotFoundError as e:
raise ValueError(
f"Unknown bench '{bench_id}'. Expected module {module_path}.py"
) from e
def resolve_bench(bench_id: str) -> BenchFn: def resolve_bench(bench_id: str) -> BenchFn:
""" """Resolve a bench id into its ``run(torch)`` callable.
Resolve a bench id into a callable bench function.
Expected layout (repo root): Expected layout (repo root):
benches/<bench_id>.py benches/<bench_id>.py
def run(torch: RuntimeContext) -> Any def run(torch: RuntimeContext) -> Any
""" """
bench_id = bench_id.strip() mod = _load_module(bench_id)
if not bench_id:
raise ValueError("Bench id is empty.")
module_path = f"benches.{bench_id}"
try:
mod = importlib.import_module(module_path)
except ModuleNotFoundError as e:
raise ValueError(f"Unknown bench '{bench_id}'. Expected module {module_path}.py") from e
run_fn = getattr(mod, "run", None) run_fn = getattr(mod, "run", None)
if run_fn is None: if run_fn is None:
raise ValueError(f"Bench module {module_path} must define a 'run(torch)' function.") raise ValueError(
f"Bench module benches.{bench_id} must define 'run(torch)'."
)
if not callable(run_fn): if not callable(run_fn):
raise ValueError(f"'run' in {module_path} is not callable.") raise ValueError(f"'run' in benches.{bench_id} is not callable.")
return run_fn return run_fn
+80
View File
@@ -0,0 +1,80 @@
# ccl.yaml — CCL backend (ahbm) configuration (ADR-0023 D11)
#
# Loaded by AhbmCCLBackend at init_process_group time.
# defaults.algorithm chooses which kernel + topology is installed
# into PE_IPCQ neighbor tables. Host code is unaware of these settings.
defaults:
# Algorithm to run for this benchmark execution.
algorithm: ring_allreduce_tcm
# NOTE: world_size is not set here by default. AhbmCCLBackend derives it
# from the chosen algorithm's entry (if it sets ``world_size``) or from
# topology.yaml (``sips × cubes_per_sip × pes_per_cube``). This mirrors
# real PyTorch DDP where ranks/world_size come from env vars, not code.
# IPCQ ring buffer location.
# tcm — PE-local TCM (fast, small, conflicts with compute TCM access)
# hbm — PE-local HBM (large, slower DMA latency)
# sram — Cube-shared SRAM (medium, cube-internal contention)
buffer_kind: tcm
# Backpressure mode.
# poll — spin-loop polling of cached peer pointers
# sleep — yield SimPy event, wake on credit return
backpressure: sleep
# Ring depth: number of slots per (direction, tx|rx) buffer.
n_slots: 4
# Slot size in bytes (must hold one tile worth of data).
slot_size: 4096
# PE_DMA virtual channel chunk size (D8). First implementation does not
# use chunk-level interleave; this is reserved for future precision.
vc_chunk_size: 256
# Credit return fast path message size (D9). Used by bottleneck-BW
# latency calculation. 16-64 bytes typical.
ipcq_credit_size_bytes: 16
algorithms:
# ── ring all-reduce, buffer in PE_TCM ──
# Defaults to topology-derived world_size (full system, 256 ranks).
# Use a smaller tile size at high rank counts so f16 sums stay within
# the verification tolerance and op_log replay scales.
ring_allreduce_tcm:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d
buffer_kind: tcm
n_elem: 8
# ── ring all-reduce, buffer in PE-local HBM ──
ring_allreduce_hbm:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d
buffer_kind: hbm
n_elem: 8
# ── ring all-reduce, buffer in cube SRAM ──
ring_allreduce_sram:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d
buffer_kind: sram
n_elem: 8
# ── 2D mesh all-reduce: perfect square only (2×2 = 4 PEs) ──
mesh_allreduce_4:
module: kernbench.ccl.algorithms.mesh_allreduce
topology: mesh_2d
buffer_kind: tcm
world_size: 4
n_elem: 16
# ── tree all-reduce (binary, 7 PEs) ──
tree_allreduce_7:
module: kernbench.ccl.algorithms.tree_allreduce
topology: tree_binary
buffer_kind: tcm
world_size: 7
n_elem: 16
+1
View File
@@ -51,5 +51,6 @@ components:
builtin.pe_fetch_store: kernbench.components.builtin.pe_fetch_store:PeFetchStoreComponent builtin.pe_fetch_store: kernbench.components.builtin.pe_fetch_store:PeFetchStoreComponent
builtin.pe_mmu: kernbench.components.builtin.pe_mmu:PeMmuComponent builtin.pe_mmu: kernbench.components.builtin.pe_mmu:PeMmuComponent
builtin.pe_tcm: kernbench.components.builtin.pe_tcm:PeTcmComponent builtin.pe_tcm: kernbench.components.builtin.pe_tcm:PeTcmComponent
builtin.pe_ipcq: kernbench.components.builtin.pe_ipcq:PeIpcqComponent
# Custom — add your implementations here # Custom — add your implementations here
+866
View File
@@ -0,0 +1,866 @@
# ADR-0023: PE-level IPCQ — Inter-PE Collective Communication
## Status
Proposed
## Context
### Goal
Add the infrastructure that lets CCL (Collective Communication Library)
kernels run **inside** a PE. The host just launches a kernel on each
SIP; the actual synchronization and data movement happen **inside the
PE kernel via an IPCQ (Inter-Process Communication Queue)**.
This mirrors how NCCL performs NVLink communication inside a GPU
kernel, or how Cerebras / Tenstorrent expose core-local communication
queues. Host-level collectives (`dist.all_reduce`) are deferred to
**future work**; this ADR focuses solely on the kernel-side collective
infrastructure.
### Current state
- ADR-0021 PE pipeline refactor: each PE is decomposed into components
(PE_CPU, PE_SCHEDULER, PE_DMA, PE_FETCH_STORE, PE_GEMM, PE_MATH,
PE_TCM, PE_MMU).
- No direct PE-to-PE channel exists today. All data movement goes
through PE_DMA → cube_noc / UCIe / PCIE → HBM.
- A pre-ADR host CCL skeleton exists (`dist.init_process_group(backend="ahbm")`,
`_run_ccl_bench` running per-rank greenlets concurrently). The
collective itself is a stub.
### Problems to solve
1. PE-to-PE direct data movement (writing into a peer's memory).
2. Synchronization — the sender must check that the receiver has space
in its buffer (backpressure).
3. Resource contention between compute traffic and communication
traffic (Head-of-Line blocking).
4. The host must be able to construct logical neighbor topologies
(ring / mesh / tree) per algorithm.
---
## Decision
### D1. Add a new `PE_IPCQ` component
A new component `PE_IPCQ` is added inside each PE. It follows the same
pattern as PE_GEMM / PE_MATH — modeling a sub-block of the PE as a
distinct component.
```
PE
├── PE_CPU
├── PE_SCHEDULER
├── PE_DMA
├── PE_IPCQ ← new
├── PE_FETCH_STORE
├── PE_GEMM
├── PE_MATH
├── PE_TCM
├── PE_MMU
```
**Role separation** (control plane vs. data plane):
- **PE_IPCQ (control plane)**: ring-buffer address arithmetic, head /
tail pointer management, peer pointer caches, backpressure, 4-direction
neighbor mapping.
- **PE_DMA (data plane)**: actually moves data through cube_noc / UCIe
/ PCIE into the peer's memory.
PE_IPCQ does **not** move data itself — it delegates to PE_DMA.
### D2. Ring buffer model
Each PE owns 4 directions (N/S/E/W) × {tx, rx} = 8 ring buffers.
```python
@dataclass
class IpcqQueuePair:
direction: Direction # N/S/E/W
peer: IpcqEndpoint # set by host at init time (D2.5)
tx_buffer_base: int # outgoing data base addr (in our memory)
rx_buffer_base: int # incoming data base addr (in our memory)
slot_size: int # 1 tile per slot
n_slots: int # ring depth
my_head: int # next slot we will write/send into
my_tail: int # next slot we will read/recv from
peer_head_cache: int # peer's last-seen head (updated via D9 piggyback)
peer_tail_cache: int # peer's last-seen tail (updated via D9 fast-path credit)
```
**Canonical field names**: throughout this ADR the four names above
(`my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`) are used
consistently. Synonyms (`peer_head_local`, `peer_head`, `peer_tail`,
etc.) are not used.
| Field | Owner | Updated when |
|-------|-------|--------------|
| `my_head` | local PE_IPCQ | immediately after `tl.send` (send tracking) |
| `my_tail` | local PE_IPCQ | immediately after `tl.recv` (recv tracking) |
| `peer_head_cache` | local PE_IPCQ | on `IpcqMetaArrival` (D9 piggyback) |
| `peer_tail_cache` | local PE_IPCQ | on `IpcqCreditMetadata` (D9 fast path) |
**Slot unit**: fixed-size, one slot holds one full tile (no descriptor
indirection). Full data embedded in the slot. See D5.
### D2.5. `IpcqEndpoint` schema
`IpcqQueuePair.peer` carries everything the sender needs to compute the
peer's rx slot address:
```python
@dataclass(frozen=True)
class IpcqEndpoint:
sip: int
cube: int
pe: int
buffer_kind: str # "tcm" | "hbm" | "sram"
rx_base_pa: int # peer rx_buffer base PA (PhysAddr.encode())
rx_base_va: int # peer rx_buffer base VA (optional, MMU mode)
n_slots: int # peer ring depth (for wrap-around)
slot_size: int # peer slot size (for offset)
```
Address computation:
```python
slot_idx = self.my_head % peer.n_slots
dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size
```
PE_IPCQ passes `dst_pa` to PE_DMA inside an `IpcqDmaToken`. PE_DMA
(vc_comm) routes the data to `dst_pa` through the fabric.
**Endpoint construction order**: at backend init (D10), the IPCQ
buffers for **every PE** are allocated first (so each rank knows the
others' PA), then the per-rank neighbor tables are built and pushed to
PE_IPCQ via `IpcqInitMsg`.
### D3. Four-direction mapping ≡ logical ProcessGroup
The PE views four directions (N/S/E/W) as logical ports. Real peer
addresses are configured by the host CCL init, per the chosen
algorithm. The PE kernel never knows the topology, only directions.
```python
# 1D ring
for rank in range(world_size):
ipcq_set_neighbor(rank, "E", peer=ranks[(rank + 1) % world_size])
ipcq_set_neighbor(rank, "W", peer=ranks[(rank - 1) % world_size])
# 2D mesh
for r in range(R):
for c in range(C):
ipcq_set_neighbor((r, c), "N", peer=((r - 1) % R, c))
ipcq_set_neighbor((r, c), "S", peer=((r + 1) % R, c))
ipcq_set_neighbor((r, c), "E", peer=(r, (c + 1) % C))
ipcq_set_neighbor((r, c), "W", peer=(r, (c - 1) % C))
```
The PE code does not need to know where `tl.send(dir="E", ...)` actually
ends up.
### D4. PE kernel API
```python
# Send (blocking; may stall on backpressure)
tl.send(dir: str, src=TensorHandle)
tl.send(dir: str, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)
# Recv (blocking)
recv = tl.recv(dir: str, shape=..., dtype=...)
recv = tl.recv(shape=..., dtype=...) # round-robin across 4 directions
# Recv (non-blocking)
fut = tl.recv_async(dir: str, shape=..., dtype=...)
recv = tl.wait(fut)
```
`tl.recv()` (no direction) keeps a `last_polled_dir` cursor and on each
call rotates through directions, returning the first available slot.
Empty in all 4 directions → wait.
**Fairness is weak**: the rotating start mitigates simple bias, but if
one direction always wins the race the others can starve. Algorithms
that need strict fairness must call `tl.recv(dir=...)` explicitly.
### D5. Single-hop DMA write + full-data slot model
Data moves from sender memory into the receiver's ring slot in **one
DMA transfer**. Key properties:
- **Single-hop**: the sender already knows the peer rx slot address and
fires one fabric DMA into it.
- **No CPU memcpy**: the CPU never copies data.
- **No intermediate staging**: neither side keeps a separate staging
buffer (sender uses the source addr directly; receiver gets the data
in its ring slot directly).
(Strictly speaking the fabric DMA write does happen, so this is not
literally "no data movement" — it's the same property NCCL labels
"zero-copy", meaning no CPU memcpy and no staging copy.)
```
PE A: tl.send(E, src_addr, nbytes)
1. IPCQ computes the peer rx slot address:
dst_addr = peer.rx_base_pa + (my_head % peer.n_slots) * peer.slot_size
2. Backpressure: my_head - peer_tail_cache < peer.n_slots ?
(full → sleep / poll)
3. Submit DMA on PE_DMA(vc_comm): src_addr → peer dst_addr, nbytes
4. my_head += 1
PE B: data = tl.recv(W)
1. Look at rx_buffer[my_tail % n_slots]
2. Wait for the data to arrive (D7 backpressure mode)
3. Return the slot address to the kernel (or fetch into register file)
4. my_tail += 1
5. Issue a credit-return fast path (D9): after the bottleneck-BW
latency the peer A's peer_tail_cache is updated.
```
The slot holds the full tile. The receiver only reads its own
rx_buffer; it never reads back into A's memory. The sender knows the
peer rx slot address and DMAs directly into it (single-hop).
The PE's own PE_TCM read/write does not go through DMA (PE_TCM is local
to the PE).
### D6. Buffer placement — three-way benchmark
The host CCL init picks the IPCQ ring-buffer location:
```python
ipcq_init(
backend="ahbm",
buffer_kind="tcm" | "hbm" | "sram",
n_slots=8,
slot_size=4096,
)
```
| Location | Trait | Trade-off |
|----------|-------|-----------|
| **PE_TCM** | Attached to the PE; fast | Small; competes with PE-internal resources |
| **PE-local HBM** | Large; via DMA | Higher latency |
| **Cube SRAM** | Mid-size; cube-shared | Cube-internal contention |
All three locations run the same kernel code; only the init differs.
### D7. Backpressure — two-mode benchmark
How the sender or receiver waits when peer slots are full / data not
yet arrived:
| Mode | Behavior | Model |
|------|----------|-------|
| **poll** | Periodically re-check the cached peer pointer | Spin loop |
| **sleep** | Yield a SimPy event; wake on a peer-trigger | Interrupt-like |
```python
ipcq_init(backpressure="poll" | "sleep", ...)
```
Both modes are implemented so latency / throughput trade-offs can be
benchmarked.
### D8. PE_DMA virtual channels
Extend PE_DMA from a single queue into a **two-channel virtual-channel**
model.
```
PE_DMA
├── vc_compute: tile load / store / writeback for GEMM and Math
└── vc_comm: IPCQ send data
```
Each VC has an independent state machine:
- One channel stalling does not block the other.
- The same physical link (cube_noc, UCIe, …) is shared, but link BW is
split between channels.
**Chunk-level interleave**:
- Large GEMM tile DMAs do not lock the link end-to-end.
- Progress happens in chunks (e.g. 256 B); each chunk shares link BW
with the other VC's pending chunks.
- Chunk size is an init parameter (smaller = fairer, larger = more
efficient).
Net effect:
- HoL blocking is eliminated (an IPCQ send can interleave with a long
compute DMA).
- Compute / comm overlap is natural (NVIDIA copy-engine + compute-SM
pattern).
- Matches the NoC-virtual-channel pattern used in real HW.
**First-implementation accuracy limit (intentional)**: this ADR's
first cut uses **deterministic chunk-level interleave + weighted
round-robin arbitration** (default 50 / 50, exposed in `ccl.yaml`).
This is a first-order approximation and is simpler than real HW
dynamic-contention / credit-based arbiters. Functional correctness is
unaffected, but heavy-contention scenarios may report slightly
optimistic latency vs. real HW. A separate ADR can add a NoC arbiter
component later if more precision is needed.
#### Token routing
- Compute tokens (`TileToken`) — go through the existing
PE_FETCH_STORE → PE_DMA chain.
- Communication tokens (`IpcqDmaToken`, new) — PE_IPCQ → PE_DMA
self-routing.
- PE_DMA picks the channel by token type.
```python
class PeDmaComponent:
def _process(self, env, token):
if isinstance(token, IpcqDmaToken):
yield from self._vc_comm_process(env, token)
else:
yield from self._vc_compute_process(env, token)
```
### D9. Pointer synchronization — DMA payload piggyback
Real HW (NVLink, UCIe, etc.) piggybacks metadata onto DMA payloads so
pointers update along with the data. This simulation adopts the same
model: **no separate control channel** — metadata travels with the
data.
The big benefits:
- **Automatic ordering**: data and metadata move on the same token, so
data is visible **before** the head_cache update. No race.
- **HW fidelity**: matches NVLink / UCIe piggybacked headers.
- **Component simplification**: no separate `IpcqPtrUpdate` event type.
#### Send flow (head update via piggyback)
```
PE A: tl.send(E, src_addr, nbytes)
1. PE_IPCQ checks backpressure (using peer_tail_cache)
2. PE_IPCQ creates an IpcqDmaToken:
- data body (src_addr → peer dst_addr)
- piggyback metadata: (sender_seq, src_sip/cube/pe, src_direction)
3. Hand the token to PE_DMA(vc_comm)
4. PE A increments my_head (send tracking)
[fabric DMA: latency elapses]
PE B's PE_DMA receives the token
5. Writes data into dst_addr (B's rx slot) via MemoryStore.write
6. Forwards token metadata to PE B's PE_IPCQ (PE-internal wire, ~1 cycle)
PE B's PE_IPCQ receives the metadata
7. Updates peer_head_cache (= A's head)
8. Wakes any pending recv on that direction
```
**Steps 5 and 6 must execute in the same SimPy step** — DMA completion
makes data and metadata atomically visible.
#### Recv flow (credit return — fast path with bottleneck-BW latency)
When the receiver frees a slot, the sender must learn about it
(backpressure release). Unlike data, the credit return does **not**
travel through general vc_comm fabric — it uses a **separate fast
path**, an abstraction of the NVLink / UCIe credit-return wire.
**Latency** is computed from the **bottleneck BW on the path**, not a
magic constant:
```
credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes)
path = router.find_path(self_pe, peer_pe)
latency = compute_drain_ns(path, credit_size_bytes)
= credit_size_bytes / bottleneck_bw_on_path
```
That gives us:
- **Topology-proportional approximation**: an in-cube credit return is
automatically faster than a cross-SIP credit return.
- **No magic constants**: no arbitrary `ipcq_ctrl_latency_ns`.
- **No deadlock risk**: unlike piggyback, B can issue credit even when
it has no data to send back.
- **Reuses existing utility**: `ComponentContext.compute_drain_ns`.
#### Component coupling — SimPy Store channel
PE B's PE_IPCQ does not call PE A's PE_IPCQ directly. Instead, at init
time, **a SimPy Store is wired between the two** (a per-direction
fast-path channel) and credit metadata is `put` into that store.
```python
class PeIpcqComponent:
def _delayed_credit_send(self, env, peer_credit_store, my_tail, latency_ns):
yield env.timeout(latency_ns)
yield peer_credit_store.put(IpcqCreditMetadata(seq=my_tail, ...))
```
Backend init wires both directions of the fast-path channel as part of
fan-out (see `IpcqInitMsg` in D12).
#### Credit-return fast path limitations
- `credit_size_bytes` is an estimate (typically 1664 bytes).
- The fast path is **excluded from vc_comm BW contention** (separate
wire). Real HW credit-return wires are very lightweight, so this is a
reasonable first approximation.
- A follow-up ADR can: model the credit fast path as a separate link
(BW limit + contention), or switch to piggyback (`credit_return_mode:
piggyback`).
#### PE_DMA's added responsibility
When `vc_comm` receives a token, PE_DMA processes it as the following
**atomic** sequence. **No SimPy yield is allowed between the two steps**
(invariant I6):
```python
def _on_vc_comm_recv(self, env, token):
# ── ATOMIC: no yield between these two operations ──
data = self._memory_store.read(token.src_space, token.src_addr,
shape=..., dtype=...)
self._memory_store.write(token.dst_endpoint.buffer_kind,
token.dst_addr, data)
# 2. Forward metadata to the local PE_IPCQ
yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token))
# ───────────────────────────────────────────────────
```
The final `put` is yieldable but uses an unbounded internal store, so
it completes in a single step. That `put` is the closing call of the
atomic block; nothing may be inserted before it.
### D9.5. ADR-0020 (2-pass) integration
`tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase
1 simulates timing **and** moves data via MemoryStore; Phase 2 enables
op-log-based correctness verification.
#### Phase 1 (timing + data)
D9 models head and tail updates with two different mechanisms:
- **Send-side (head update)** — DMA payload piggyback. Data write and
metadata forward happen in the same SimPy step → automatic atomic
visibility.
- **Recv-side (tail credit return)** — fast-path SimPy Store channel
with bottleneck-BW latency, then `peer_tail_cache` update.
Together they preserve ring-buffer pointer consistency.
The op-log records `op_kind="ipcq"` entries for sends (with
`src/dst/space/addr/nbytes/dir/dtype/shape/sender_seq`) and recvs (with
`recv_mode/src/dst/space/addr/nbytes/dir/dtype/shape/consumer_seq`).
Two recv modes:
- **`return_slot`** (default): the slot address is returned to the
kernel. Zero-copy.
- **`copy_to_dst`**: when the kernel passes `dst_addr` + `dst_space`,
PE_IPCQ copies the slot data into the user dst.
#### Phase 2 (op_log replay)
When `DataExecutor` encounters an `op_kind="ipcq"` record:
- **send**: idempotent `src → dst` ndarray write.
- **recv (`return_slot`)**: no-op (the slot already holds the data).
- **recv (`copy_to_dst`)**: idempotent `slot → dst_addr` copy.
IPCQ ops are pure data movement — Phase 2 has nothing extra to compute.
The downstream GEMM / Math ops in `DataExecutor` will consume the data
and naturally validate correctness.
### D10. Host CCL init keeps the PyTorch shape
The host code looks just like real PyTorch DDP. `init_process_group`
creates the backend object; it does **not** receive IPCQ knobs
(neighbor topology, buffer_kind, backpressure …).
```python
# benches/ccl_allreduce.py — same shape as real PyTorch
def worker(rank, world_size, torch):
dist = torch.distributed
dist.init_process_group(backend="ahbm") # reads ccl.yaml + topology
tensor = torch.zeros((1, world_size * N_ELEM), dtype="f16", dp=...)
tensor.copy_(torch.from_numpy(init))
dist.all_reduce(tensor, op="sum")
```
The IPCQ configuration is decided by the backend at
`init_process_group` time: it loads `ccl.yaml`, picks the algorithm,
and pushes IPCQ neighbor tables to every participating PE_IPCQ. The
host code never has to know about IPCQ.
A bench runs one algorithm, chosen via `ccl.yaml`'s `defaults.algorithm`.
Switching algorithms is purely a `ccl.yaml` change — no host edits
required.
#### Init flow (eager)
1. `init_process_group(backend="ahbm")` is called.
2. Backend loads `ccl.yaml` → resolves `defaults.algorithm`.
3. Pulls topology + buffer_kind + backpressure + slot config from
`algorithms[<algo>]`.
4. **Immediately** installs neighbor tables on every PE_IPCQ
(sideband or fabric `IpcqInitMsg`).
5. Subsequent `torch.launch(kernel_name, ...)` calls behave normally —
PE_IPCQ is already prepared whether the kernel is a CCL kernel or
not.
### D11. CCL config file (`ccl.yaml`)
IPCQ config and algorithm metadata live in a separate YAML file,
following the same pattern as `components.yaml` and `topology.yaml`.
A single benchmark execution runs one algorithm
(`defaults.algorithm`). Switching algorithms means editing
`defaults.algorithm` only.
```yaml
defaults:
algorithm: ring_allreduce_tcm
buffer_kind: tcm # tcm | hbm | sram
backpressure: sleep # poll | sleep
n_slots: 8
slot_size: 4096
vc_chunk_size: 256
ipcq_credit_size_bytes: 16
algorithms:
ring_allreduce_tcm:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d # builtin name or "custom"
buffer_kind: tcm
n_elem: 8 # optional, per-algorithm tile width
tree_allreduce_7:
module: kernbench.ccl.algorithms.tree_allreduce
topology: tree_binary
buffer_kind: tcm
world_size: 7 # algorithm-level override
n_elem: 16
custom_mesh:
module: kernbench.ccl.algorithms.custom_mesh
topology: custom # the module supplies its own neighbors()
```
`world_size` is **not set in `defaults`**. The backend resolves it via:
`algorithm-level override > defaults override > topology spec`. The
last fallback (`sips × cubes_per_sip × pes_per_cube`) mirrors real DDP
where `WORLD_SIZE` comes from env vars rather than config files.
#### Algorithm module structure
Each algorithm module exports two hooks — `kernel` (required) and
`neighbors` (optional) — plus a `kernel_args` helper that the
backend uses to populate positional kernel arguments at `all_reduce`
time:
```python
# src/kernbench/ccl/algorithms/ring_allreduce.py
def kernel_args(world_size: int, n_elem: int) -> tuple:
return (n_elem, world_size)
def kernel(t_ptr, n_elem, world_size, tl):
"""Required — the PE kernel.
IPCQ is already installed by the backend before this is called.
The kernel only uses the four-direction send / recv API.
"""
...
def neighbors(rank, world_size, neighbor_map):
"""Optional — override the builtin topology's neighbor map.
Returns a new dict, the modified-in-place dict, or None to keep the
builtin map.
"""
return None
```
#### `neighbors` override patterns
- **Pattern A — tweak a builtin**: drop a direction for some ranks, etc.
- **Pattern B — replace entirely**: ignore `neighbor_map` and return a
brand-new dict.
- **Pattern C — keep builtin**: omit `neighbors` or return None.
#### Builtin topologies
| topology | direction set |
|----------|---------------|
| `ring_1d` | E, W |
| `ring_1d_unidir` | E only |
| `mesh_2d` | N, S, E, W |
| `tree_binary` | parent, child_left, child_right |
| `none` | (empty) — algorithm must supply `neighbors()` |
#### Adding a new algorithm
1. Write `kernel` and `kernel_args` in
`src/kernbench/ccl/algorithms/<algo>.py`.
2. Add an entry in `ccl.yaml`'s `algorithms` section.
3. (Optional) provide `neighbors()` for custom topology.
4. Set `defaults.algorithm` to the new algorithm.
The host bench (`benches/ccl_allreduce.py`) does not change.
### D12. Message / token schema
The new message types added by this ADR. They live in
`src/kernbench/common/pe_commands.py` and
`src/kernbench/runtime_api/kernel.py`.
#### `IpcqInitMsg` (sideband, fan-out at init)
The backend pushes neighbor tables to every PE_IPCQ. Structure mirrors
`MmuMapMsg` (`target_sips`, `target_cubes`, `target_pe`, `entries`).
Each `IpcqInitEntry` has `direction`, `peer: IpcqEndpoint`,
`my_rx_base_pa/va`, `n_slots`, `slot_size`, plus a `peer_credit_store`
field — a `simpy.Store` instance pre-wired so the sender PE_IPCQ can
push `IpcqCreditMetadata` directly into the receiver's input queue.
#### `IpcqSendCmd` (PE_CPU → PE_IPCQ)
Carries `direction`, source addr/space, nbytes, shape, dtype, and a
handle id. `data_op=True` so it lands in the op_log.
#### `IpcqRecvCmd` (PE_CPU → PE_IPCQ)
Carries `direction` (or None for round-robin), `recv_mode`
(`return_slot` / `copy_to_dst`), optional `dst_addr/dst_space`, shape,
dtype, blocking flag.
#### `IpcqDmaToken` (PE_IPCQ → PE_DMA, vc_comm channel)
Per D9 piggyback: the token carries the data (`src/dst/space/nbytes`)
plus the head metadata (`sender_seq`, `src_sip/cube/pe`,
`src_direction`). PE_DMA picks the channel by token type
(`IpcqDmaToken → vc_comm`, `TileToken → vc_compute`).
The receiver's PE_DMA, on token arrival, performs the I6 atomic
sequence: write data into MemoryStore, then forward `IpcqMetaArrival`
to the local PE_IPCQ.
#### `IpcqCreditMetadata` (PE_IPCQ → peer PE_IPCQ, fast path)
Carries `consumer_seq` (= my_tail), source PE coords, and source
direction. Travels through the dedicated SimPy Store channel rather
than `vc_comm`. Latency = `credit_size_bytes / bottleneck_bw_on_path`.
There is **no `IpcqPtrUpdate` event** — head updates flow via D9
piggyback, tail updates via the D9 fast-path channel.
### D13. Test strategy
Following the ADR-0021 D8 pattern.
#### T1. Unit tests (component-level)
- **PE_IPCQ** (`tests/test_pe_ipcq.py`): send without backpressure
immediately forwards a token; full peer slot triggers backpressure
(poll / sleep modes); recv waits, wakes on `IpcqMetaArrival`;
round-robin recv weak fairness; bad direction → `IpcqInvalidDirection`.
- **PE_DMA virtual channels** (`tests/test_pe_dma_vc.py`): `vc_compute`
/ `vc_comm` independent progress, chunk interleave, BW split.
- **Builtin topology** (`tests/test_ccl_topologies.py`): ring_1d /
mesh_2d / tree_binary correctness, mesh_2d non-square →
`ValueError`, custom resolver returns the module's `neighbors`.
#### T2. Integration tests (E2E send/recv)
- **`tests/test_ipcq_e2e.py`**: 2-rank ring, 4-rank ring (bidirectional
no-deadlock), 4×4 mesh.
- **CCL kernel + 2-pass** (`tests/test_ipcq_2pass.py`): greenlet mode
records `ipcq` ops in op_log; DataExecutor produces correct
`out.data`.
#### T3. Backend init (`tests/test_ccl_backend_ipcq.py`)
`ccl.yaml` load, builtin topology → `IpcqInitMsg` fan-out, endpoint PA
consistency, per-`buffer_kind` allocation.
#### T4. Regression
All existing tests pass; ADR-0020 op_log / DataExecutor unaffected for
non-CCL benches.
#### T5. Performance / overhead
Single send/recv pair latency = (DMA latency) + (IPCQ overhead).
Should be close to a regular PE_DMA write of the same nbytes (IPCQ
overhead < 100 ns).
### D14. Invariants and failure modes
#### Invariants
I1. **Slot lifecycle exactly-once**: one send → exactly one recv.
I2. **Pointer monotonicity**: `my_head` / `my_tail` strictly
non-decreasing; `sender_seq` strictly increasing.
I3. **Endpoint consistency**: if rank A's `direction=E` peer is rank
B, then rank B's reverse-direction peer must be rank A. Verified at
init.
I4. **`buffer_kind` consistency**: all PEs in a process group share
the same `buffer_kind` (no mixed mode in the first cut).
I5. **op_log ordering**: send → DMA complete → recv possible. The
t_start order in op_log respects this causality.
I6. **Atomic data + metadata visibility (MUST)**: at the receiver
side, data write (`MemoryStore.write`) and metadata forward
(`peer_head_cache` update) **must execute in the same SimPy step**.
No yield is allowed between the two operations in PE_DMA's vc_comm
handler. Code review must reject any inserted `yield` (or `yield
from`) — it would create a race where head_cache becomes visible
before or after the data.
I7. **MemoryStore slot existence ↔ pointer**: as a consequence of I6,
the step in which `peer_head_cache > my_tail` becomes truthy is the
same step in which the slot data is observable.
#### Failure modes (runtime errors)
F1. **Bad direction**: `tl.send(dir="X")` for an uninstalled direction
`IpcqInvalidDirection`, simulation aborts.
F2. **Type mismatch**: dtype/shape/nbytes disagreement between matched
send and recv. Not validated by default; opt-in strict mode catches
it (`strict_validation: true` on a PE_IPCQ node attrs).
F3. **Deadlock detection (timeout-based)**: the simulator empties its
schedule while a send/recv is still pending → engine raises
`IpcqDeadlock` and embeds a pointer dump.
F4. **Backend init failure**: missing `defaults.algorithm`, missing
`algorithms[name]`, module import failure, topology validation
failure (I3, I4) — all raised at `init_process_group` time.
F5. **Slot full + infinite backpressure**: the peer never recvs.
Surfaces as F3 timeout.
#### Diagnostics
- **CCL trace**: `KERNBENCH_CCL_TRACE=1` logs each send/recv as
`(rank, t, dir, nbytes)`.
- **Pointer dump**: `kernbench.ccl.diagnostics.pointer_dump(engine)`
prints every PE_IPCQ ring buffer's `my_head`, `my_tail`,
`peer_head_cache`, `peer_tail_cache`.
- **Deadlock dump**: on hang the engine includes the pointer dump in
the `IpcqDeadlock` exception message.
### D15. Algorithm-author cheat sheet
Full step-by-step lives in
[`docs/ccl-author-guide.en.md`](../ccl-author-guide.en.md). The
shortest version:
| Things you touch | Things you don't |
|------------------|-------------------|
| `src/kernbench/ccl/algorithms/<your_algo>.py` (`kernel`, `kernel_args`, optional `neighbors`) | `benches/ccl_allreduce.py` host code |
| One entry in `ccl.yaml` + optionally `defaults.algorithm` | `src/kernbench/ccl/` framework |
| (Optional) `tests/test_<your_algo>.py` mock test | PE_IPCQ component, AhbmCCLBackend |
5-step flow: write the kernel → register in `ccl.yaml` → optional
`neighbors` override → optional mock unit test → SimPy validation via
`kernbench run --bench ccl_allreduce --verify-data`.
Common mistakes: using a direction that wasn't installed, sends
without matching recvs (deadlock), dtype/shape disagreement, assuming
fairness from `tl.recv()` round-robin, confusing
`tl.num_programs(axis)` with the CCL group size.
---
## Non-goals
- **Host collective**: a model where `dist.all_reduce` itself moves
data on the host side is out of scope. This ADR only covers
communication that happens inside the PE kernel.
- **All-reduce algorithms**: ring / tree / etc. live in algorithm
modules and can be added without amending this ADR.
- **Reliability / error handling**: link faults, send/recv failure
recovery, etc. are out of scope.
- **NoC arbiter precision**: dynamic VC contention is left for a future
ADR (see D8).
---
## Open questions
- **VC arbitration accuracy** — the first cut uses deterministic
chunk interleave + weighted round-robin; heavy contention may report
optimistic latency. A NoC arbiter component can be added later.
- **Credit return BW model** — the fast path is currently outside the
fabric BW contention model. Can be modeled as a separate link or
switched to piggyback (`credit_return_mode: piggyback`).
- **Ring buffer slot allocation metadata** — whether the host pushes
IPCQ buffer metadata via sideband or via a fabric message similar to
`MmuMapMsg` is open.
- **VC BW split default** — 50/50 vs. weighted (e.g. 80/20). Exposed in
`ccl.yaml`; default value TBD.
- **Direction count** — 4 (N/S/E/W) is fixed in the first cut; 6
(with Up/Down for 3D) or N (variable) is future work.
- **Multi-tile aggregation primitives** — whether
`tl.recv_all` or similar is needed for fan-in.
- **Round-robin recv fairness** — current weak fairness can starve;
strict fairness counter is future work.
- **Deadlock detection precision** — currently timeout-based; a
realtime wait-for graph would enable deterministic detection.
---
## Consequences
### Positive
- PE-to-PE direct communication enables CCL kernels to be written.
- Host stays minimal (just `launch`), synchronization happens inside
the PE → strong compute / comm overlap.
- VCs eliminate HoL blocking → collective latency is not blocked by
compute traffic.
- Buffer placement and backpressure mode are init-time parameters →
easy to benchmark.
- Four-direction logical neighbors → host is free to map
ring/mesh/tree algorithms.
### Negative
- One new component (PE_IPCQ) and a redesigned PE_DMA (VCs).
- IPCQ memory cost = 8 rings × `slot_size` × `n_slots` per PE.
- VC arbitration is a first-order approximation; heavy contention
scenarios may report slightly optimistic latency vs real HW (D8).
- Chunk-level interleave makes PE_DMA implementation more complex.
---
## Affected files
| File | Change |
|------|--------|
| `topology.yaml` | Add `pe_ipcq` to `pe_template`, plus the IPCQ ↔ DMA / CPU / TCM edges. |
| `components.yaml` | Register `pe_ipcq_v1`. |
| `src/kernbench/topology/builder.py` | Wire the IPCQ chain into PE-internal edges. |
| `src/kernbench/components/builtin/pe_ipcq.py` | New. |
| `src/kernbench/components/builtin/pe_dma.py` | Add VCs, handle `IpcqDmaToken`. |
| `src/kernbench/common/pe_commands.py` | `IpcqSendCmd`, `IpcqRecvCmd`, `IpcqDmaToken`. |
| `src/kernbench/triton_emu/tl_context.py` | `tl.send` / `tl.recv` API. |
| `src/kernbench/runtime_api/distributed.py` | Eager IPCQ install in `AhbmCCLBackend.__init__`. |
| `src/kernbench/runtime_api/kernel.py` | `IpcqInitMsg` definition. |
| `src/kernbench/ccl/__init__.py` | New CCL package. |
| `src/kernbench/ccl/topologies.py` | Builtin topology generators + `resolve_topology()`. |
| `src/kernbench/ccl/helpers.py` | Algorithm-author helpers (`chunked`, `ring_step`, `tree_step`). |
| `src/kernbench/ccl/testing.py` | Mock CCL runtime (`run_kernel_in_mock`). |
| `src/kernbench/ccl/algorithms/*.py` | Algorithm modules (kernel + `kernel_args` + optional `neighbors`). |
| `ccl.yaml` | Algorithm metadata + IPCQ defaults. |
| `tests/test_pe_ipcq.py` | PE_IPCQ unit tests. |
| `tests/test_pe_dma_vc.py` | PE_DMA VC tests. |
| `tests/test_ipcq_e2e.py` | end-to-end send/recv tests. |
| `tests/test_ccl_topologies.py` | Builtin topology generator tests. |
| `tests/test_ccl_allreduce_matrix.py` | Unified bench × algorithm matrix. |
File diff suppressed because it is too large Load Diff
+592
View File
@@ -0,0 +1,592 @@
# CCL Algorithm Author Guide (English)
This document is a step-by-step guide for engineers writing CCL
(Collective Communication Library) algorithms in kernbench. The
internal system design and component structure live in
[ADR-0023](adr/ADR-0023-ipcq-pe-collective.md).
The goal here is to clearly separate **what an algorithm author has to
touch** from **what they can leave alone**, and to get a first
algorithm running through the shortest possible path.
---
## 0. Five-minute tour
| Things you touch | Location |
|------------------|----------|
| Algorithm module (kernel + optional `neighbors()`) | `src/kernbench/ccl/algorithms/<algo>.py` |
| Algorithm registration | `ccl.yaml` |
| Host bench (rank count, init, launch, verify) | `benches/<your_bench>.py` |
| (Optional) unit test | `tests/test_<algo>.py` |
| Things you do NOT touch | Location |
|--------------------------|----------|
| TLContext API | `src/kernbench/triton_emu/tl_context.py` (ADR-0022 spec) |
| Framework (topology generators, helpers, mock testing) | `src/kernbench/ccl/` |
| PE_IPCQ / PE_DMA components | `src/kernbench/components/builtin/` |
| Backend implementation (`install_ipcq`) | `src/kernbench/runtime_api/distributed.py` and `kernbench/ccl/install.py` |
Workflow:
1. Write a `kernel` function in the algorithm module.
2. Register an entry in `ccl.yaml`.
3. Write a host bench using `torch.distributed.init_process_group` /
`torch.distributed.all_reduce` (the unified `benches/ccl_allreduce.py`
handles the common case).
4. (Optional) Run the mock runtime for fast unit tests (a few ms).
5. `kernbench run --bench <name> --verify-data` for full SimPy verification.
---
## 1. Hello World — the simplest send/recv
Each PE sends its tile to its E neighbor once and receives a tile from
its W neighbor once. The reference code lives in
[`src/kernbench/ccl/algorithms/hello_send.py`](../src/kernbench/ccl/algorithms/hello_send.py).
### Step 1: write the kernel
New file `src/kernbench/ccl/algorithms/hello_send.py`:
```python
"""Hello world: send your tile to the next rank, receive from the previous one."""
def kernel(t_ptr, n_elem, tl):
# Global rank is computed from program_id(0/1) (ADR-0022).
local_pe = tl.program_id(axis=0)
cube_id = tl.program_id(axis=1)
pes_per_cube = tl.num_programs(axis=0)
rank = cube_id * pes_per_cube + local_pe
nbytes = n_elem * 2 # f16
pe_addr = t_ptr + rank * nbytes
# Load our slice and send it east.
src = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
tl.send(dir="E", src=src)
# Receive from west and store directly back into our slice.
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
tl.store(pe_addr, recv)
def kernel_args(world_size: int, n_elem: int) -> tuple:
"""Positional kernel args used by the ahbm backend (after t_ptr)."""
return (n_elem,)
```
Key points:
- **Global rank is computed from `program_id(axis=0)` + `program_id(axis=1)`.**
TL has no contractually-supported `tl.rank` / `tl.world_size`. If the
host needs to pass `world_size` or anything else as an algorithm
parameter, it goes through ordinary `torch.launch` arguments.
- **`tl.send` takes a `TensorHandle`.** PE_IPCQ reads
`addr`/`space`/`shape`/`dtype`/`nbytes` from the handle to issue an
`IpcqDmaToken` to PE_DMA.
- **`tl.recv` requires `shape` and `dtype`.** The returned TensorHandle
points at the IPCQ ring slot and can be used directly as a `dst`
handle (e.g. `tl.store(pe_addr, recv)`). Phase 2's `dma_write` replay
handles the (slot → hbm) copy, so user code never has to touch
`recv.data`.
### Step 2: register in `ccl.yaml`
```yaml
algorithms:
hello_send:
module: kernbench.ccl.algorithms.hello_send
topology: ring_1d
buffer_kind: tcm
world_size: 8
```
`world_size` here is optional. If absent, `AhbmCCLBackend` derives it
from the topology spec (`sips × cubes_per_sip × pes_per_cube`).
### Step 3: write a host bench (optional — the unified bench may suffice)
For most CCL benchmarks the existing `benches/ccl_allreduce.py` is
sufficient: it reads `ccl.yaml`, picks the algorithm, sets up the
process group, and runs the collective. If your algorithm needs custom
host logic, write a new bench file along the same lines.
The host code looks like a real PyTorch DDP worker:
```python
"""benches/ccl_hello.py"""
from __future__ import annotations
import numpy as np
from kernbench.policy.placement.dp import DPPolicy
N_ELEM = 8
def worker(rank: int, world_size: int, torch) -> None:
"""Per-rank business logic — mirrors a real PyTorch DDP worker."""
dp = DPPolicy(
sip="replicate", cube="replicate", pe="column_wise",
num_sips=1, num_cubes=1, num_pes=world_size,
)
tensor = torch.zeros(
(1, world_size * N_ELEM), dtype="f16", dp=dp, name="hello_in",
)
# Per-rank initialization via the real PyTorch idiom.
init = np.zeros((1, world_size * N_ELEM), dtype=np.float16)
for r in range(world_size):
init[0, r * N_ELEM : (r + 1) * N_ELEM] = float(r + 1)
tensor.copy_(torch.from_numpy(init))
# The collective itself.
torch.distributed.all_reduce(tensor, op="sum")
# Verify on rank 0 (real PyTorch DDP idiom).
if rank == 0:
result = tensor.numpy()
for r in range(world_size):
expected = float(((r - 1) % world_size) + 1)
slice_r = result[0, r * N_ELEM : (r + 1) * N_ELEM]
print(
f" rank {r}: got {float(slice_r.mean()):.1f}, "
f"expected {expected:.1f}"
)
def run(torch) -> None:
"""CLI entry point. Initializes dist, dispatches to worker."""
dist = torch.distributed
dist.init_process_group(backend="ahbm")
worker(
rank=dist.get_rank(),
world_size=dist.get_world_size(),
torch=torch,
)
```
### Step 4: unit test (optional but strongly recommended)
`tests/test_hello_send.py`:
```python
import numpy as np
from kernbench.ccl.algorithms.hello_send import kernel
from kernbench.ccl.testing import run_kernel_in_mock
def test_hello_send_4_ranks():
n_elem = 8
inputs = [
np.full((n_elem,), float(r + 1), dtype=np.float16)
for r in range(4)
]
outputs = run_kernel_in_mock(
kernel_fn=kernel,
world_size=4,
topology="ring_1d",
inputs=inputs,
kernel_args=(n_elem,),
)
# rank r should now hold rank (r-1) % 4's data.
for r in range(4):
assert np.array_equal(outputs[r], inputs[(r - 1) % 4])
```
`run_kernel_in_mock` runs every rank concurrently in pure Python (no
SimPy), so a unit test like this finishes in **milliseconds**. It only
verifies algorithmic correctness — no latency, no DMA, no fabric.
### Step 5: SimPy validation
```bash
kernbench run --topology topology.yaml --bench ccl_hello --verify-data
```
Phase 1 runs the SimPy simulation + MemoryStore data movement, Phase 2
replays the op_log for correctness. The bench's `print` lines should
show OK for every rank.
---
## 2. Ring all-reduce — the second algorithm
Slightly more complex. Each PE runs `world_size - 1` rounds, sending
its current tile east and accumulating the tile received from the west.
After all rounds, every PE holds the global sum.
The reference implementation lives in
[`src/kernbench/ccl/algorithms/ring_allreduce.py`](../src/kernbench/ccl/algorithms/ring_allreduce.py).
The core flow:
```python
"""Ring all-reduce."""
def kernel(t_ptr, n_elem, world_size, tl):
local_pe = tl.program_id(axis=0)
cube_id = tl.program_id(axis=1)
pes_per_cube = tl.num_programs(axis=0)
rank = cube_id * pes_per_cube + local_pe
nbytes = n_elem * 2
pe_addr = t_ptr + rank * nbytes
# The handle points at HBM[pe_addr]. In greenlet mode .data is
# populated, but the kernel never has to touch .data directly.
acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
current = acc # source for the first send
for _step in range(world_size - 1):
tl.send(dir="E", src=current)
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
# TensorHandle operator overload → MathCmd → PE_MATH dispatch.
# Phase 1 only models timing; Phase 2 DataExecutor replays the
# actual numpy accumulation.
acc = acc + recv
current = recv # forward the received slot to the next round
# Store the final accumulator back to HBM. Source is acc (a PE-local
# scratch addr); dst is HBM. The op_log dma_write entry records both
# ends so Phase 2 copies the math result into HBM at verify time.
tl.store(pe_addr, acc)
def kernel_args(world_size: int, n_elem: int) -> tuple:
return (n_elem, world_size)
```
Four key points:
1. **Accumulation goes through TensorHandle operators.** `acc + recv`
emits a `MathCmd` and dispatches it through PE_MATH — i.e. the
real hardware path, so the latency model stays accurate. Per
ADR-0020 D3, Phase 1 only simulates timing; Phase 2's `DataExecutor`
replays the op_log and runs the actual numpy accumulation.
2. **Use `current = recv` to forward.** Each round must update the send
source to the just-received slot handle so the same data circulates
exactly once around the ring. Setting `current = acc` would resend
the cumulative sum, inflating the result.
3. **`tl.store(pe_addr, acc)` exactly once at the end.** Do not use a
store→reload pattern in the middle. `acc` lives in PE-local scratch;
the op_log records `(src=scratch, dst=hbm)` and Phase 2 first runs
math (filling scratch) then copies via the dma_write snapshot.
4. **`world_size` is passed by the host explicitly.** TL only knows the
topology slot count (e.g. `num_programs(axis=0)` is "PEs per cube"),
not the participating CCL group size. The host bench knows
`world_size` and forwards it as an explicit kernel argument.
For registration in `ccl.yaml` and wiring through the unified bench,
look at the existing `ring_allreduce_tcm/_hbm/_sram` entries plus
[`benches/ccl_allreduce.py`](../benches/ccl_allreduce.py). Mock unit
tests live in
[`tests/test_ccl_mock_runtime.py`](../tests/test_ccl_mock_runtime.py)
and follow the `kernel_args=(n_elem, world_size)` convention.
---
## 3. `neighbors()` override — custom topology
Most algorithms are happy with the builtin topologies (`ring_1d`,
`mesh_2d`, `tree_binary`, `ring_1d_unidir`, `none`). If you want to
modify a builtin or define a brand-new connectivity pattern, define a
`neighbors()` function in your algorithm module.
### Signature
```python
def neighbors(
rank: int, world_size: int, neighbor_map: dict[str, int],
) -> dict[str, int] | None:
"""Override the neighbor map produced by the builtin topology.
Args:
neighbor_map: the mapping the ccl.yaml ``topology`` field built.
For ring_1d this is {"E": (rank+1)%ws, "W": (rank-1)%ws}.
The dict is mutable — modify in place if you want.
Returns:
dict: the new neighbor map (or the modified-in-place dict).
None: do not override; use neighbor_map as-is.
"""
return None
```
### Pattern A: tweak a builtin
```python
def neighbors(rank, world_size, neighbor_map):
# Only even ranks use W; remove W from odd ranks.
if rank % 2 == 1:
neighbor_map.pop("W", None)
return neighbor_map
```
### Pattern B: replace entirely (skip-connection ring)
```python
def neighbors(rank, world_size, neighbor_map):
return {"E": (rank + 2) % world_size}
```
### Pattern C: keep builtin
Either omit `neighbors` entirely or return None:
```python
def neighbors(rank, world_size, neighbor_map):
return None # explicit "use the builtin"
```
---
## 4. PE kernel API reference (ADR-0023 D4)
### IPCQ API
| API | Description | Blocking? |
|-----|-------------|-----------|
| `tl.send(dir, src=TensorHandle)` | Send to a peer in the given direction. | Yes (waits if peer slots are full) |
| `tl.send(dir, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)` | Same, keyword form. | Yes |
| `tl.recv(dir, shape=..., dtype=...)` | Blocking recv from one direction. | Yes |
| `tl.recv(shape=..., dtype=...)` | Round-robin recv across all four directions. | Yes |
| `tl.recv_async(dir, shape=..., dtype=...) → RecvFuture` | Non-blocking recv. | No |
| `tl.wait(future)` | Wait for a non-blocking recv future → returns the resolved TensorHandle. | Yes |
### Existing TL API (ADR-0020/0022, unchanged)
| API | Description |
|-----|-------------|
| `tl.load(addr, shape, dtype) → TensorHandle` | DMA read; in greenlet mode `.data` carries the ndarray. |
| `tl.store(addr, handle)` | DMA write — when `handle.data` is set the runner propagates it to MemoryStore. |
| `tl.composite(op, ...)` | Submit a GEMM/Math composite (non-blocking). |
| `tl.program_id(axis=0)` | Local PE id within the cube. |
| `tl.program_id(axis=1)` | Cube id (ADR-0022). |
| `tl.num_programs(axis=0/1)` | Topology slot counts (NOT the participating-rank count). |
### Two recv modes
The default is `return_slot` (zero-copy): the IPCQ slot address is
returned in `handle.addr`. To force a copy into a custom destination,
pass `dst_addr` + `dst_space`:
```python
recv = tl.recv(
dir="W", shape=(8,), dtype="f16",
dst_addr=my_scratch_addr,
dst_space="hbm",
)
# After this call recv.addr == my_scratch_addr (copy_to_dst mode).
```
---
## 5. Helpers (`kernbench.ccl.helpers`)
Convenience helpers to keep algorithm code short:
```python
from kernbench.ccl.helpers import chunked, ring_step, tree_step
```
### `chunked(base_addr, n_chunks, n_elem, dtype="f16") → list[Chunk]`
Split a tile of `n_elem` elements into `n_chunks` equal-size views.
Each `Chunk` has `addr`, `n_elem`, `nbytes` fields.
```python
chunks = chunked(t_ptr, n_chunks=4, n_elem=64, dtype="f16")
# chunks[0..3] are 16-element views with consecutive addresses.
```
### `ring_step(rank, step, world_size) → (send_idx, recv_idx)`
Per-step chunk indices for a ring algorithm (reduce-scatter / all-gather):
```python
for step in range(world_size - 1):
send_idx, recv_idx = ring_step(rank, step, world_size)
tl.send(
dir="E", src_addr=chunks[send_idx].addr,
nbytes=chunks[send_idx].nbytes,
shape=(chunks[send_idx].n_elem,), dtype="f16",
)
recv = tl.recv(
dir="W", shape=(chunks[recv_idx].n_elem,), dtype="f16",
)
# accumulate ...
```
### `tree_step(rank, world_size) → {"parent": int|None, "children": list[int]}`
Parent / children rank ids for a binary tree:
```python
info = tree_step(rank, world_size)
if info["parent"] is None:
print(f"rank {rank} is the root")
for child in info["children"]:
...
```
---
## 6. Unit testing — Mock runtime
`kernbench.ccl.testing.run_kernel_in_mock` runs an algorithm without
SimPy for fast feedback.
### Basic usage
```python
import numpy as np
from kernbench.ccl.testing import run_kernel_in_mock
from kernbench.ccl.algorithms.my_algo import kernel
def test_my_algo():
n_elem = 16
inputs = [np.arange(n_elem, dtype="f16") + r for r in range(4)]
expected = sum(inputs)
outputs = run_kernel_in_mock(
kernel_fn=kernel,
world_size=4,
topology="ring_1d",
inputs=inputs,
kernel_args=(n_elem, 4), # positional args after t_ptr
)
for r in range(4):
assert np.allclose(outputs[r], expected, rtol=1e-3)
```
### Behavior
- All ranks run their kernels concurrently as cooperative greenlets.
- `tl.send` / `tl.recv` are serviced by in-memory FIFOs (no DMA, no
latency).
- Each rank's last `store` is what the helper returns as a numpy array.
### Limitations
- No latency or performance numbers (it is not a simulation).
- No PE_DMA, fabric, or BW model.
- Correctness only.
- One cube assumed: `program_id(axis=1)` is always 0.
---
## 7. Debugging
### CCL trace
```bash
KERNBENCH_CCL_TRACE=1 kernbench run --topology topology.yaml \
--bench ccl_allreduce --verify-data
```
Per-rank send/recv events appear on stdout:
```
[ccl t=346.4 send] sip0.cube0.pe1 dir=E nbytes=64 seq=0
[ccl t=360.4 recv] sip0.cube0.pe2 dir=W nbytes=64
```
### Pointer dump
`kernbench.ccl.diagnostics.pointer_dump(engine)` returns a multi-line
dump of every PE_IPCQ ring buffer's `my_head`, `my_tail`,
`peer_head_cache`, `peer_tail_cache`. When something hangs, this shows
which rank is stuck and on what.
### Deadlock detection
When the SimPy schedule empties because of unmatched send/recv pairs,
the engine raises `IpcqDeadlock` and embeds the pointer dump in the
message (ADR-0023 D14 F3). Wait-for-graph visualization is future
work.
---
## 8. Common mistakes
### 1. Using a direction that wasn't installed
`topology: ring_1d` only installs E and W. Trying:
```python
tl.send(dir="N", ...) # → IpcqInvalidDirection
```
Fix: switch to `topology: mesh_2d`, or add N/S in a `neighbors()` override.
### 2. `send` without a matching `recv`
```python
def kernel(..., tl):
for _ in range(100):
tl.send(dir="E", ...)
# The peer never recvs → ring buffer fills → backpressure → deadlock.
```
Fix: every `send` needs a matching `recv` on the receiver side.
Otherwise `IpcqDeadlock` is raised.
### 3. dtype/shape mismatch
By default mismatches are not validated. The author is responsible for
consistency. Set `strict_validation: true` on a PE_IPCQ node's attrs to
enable D14 F2 strict mode and catch them immediately.
### 4. Assuming round-robin recv fairness
`tl.recv()` (no direction) returns the first slot to arrive in
round-robin order, but **arrival order is not predictable**. If your
algorithm depends on a particular direction, name it explicitly:
`tl.recv(dir="N", ...)`.
### 5. Confusing `num_programs` with the CCL group size
`tl.num_programs(axis=0/1)` reports topology slot counts, not the
number of ranks participating in the collective. The host bench knows
`world_size` and must pass it through as a kernel argument.
### 6. Overwriting the send source before it's actually sent
PE_DMA snapshots the source data into the IpcqDmaToken at send time,
preserving in-flight semantics. Even so, the safest pattern is to call
`tl.send` first and only mutate the source addr afterwards. If you
mutate the addr before `tl.send` makes it into the PE_DMA queue, the
snapshot will pick up the wrong data.
---
## 9. Next steps
- Try other topologies (`mesh_2d`, `tree_binary`).
- Faster algorithms (recursive halving / doubling).
- Compare `buffer_kind` (tcm/hbm/sram) and `backpressure` (poll/sleep)
modes for latency.
- Larger-scale validation through the unified `ccl_allreduce` bench
with different `ccl.yaml` overlays.
If you add a new algorithm or pattern, please send a PR.
---
## References
- [ADR-0023](adr/ADR-0023-ipcq-pe-collective.md): IPCQ + PE-level collective design.
- [ADR-0022](adr/ADR-0022-program-id-2d-grid.md): 2D grid program_id (axis=0/1).
- [ADR-0020](adr/ADR-0020-data-execution-two-pass.md): 2-pass data execution.
- [ADR-0021](adr/ADR-0021-pe-pipeline-refactor.md): PE pipeline refactor.
Existing algorithm examples:
- [`src/kernbench/ccl/algorithms/hello_send.py`](../src/kernbench/ccl/algorithms/hello_send.py) — simplest send/recv
- [`src/kernbench/ccl/algorithms/ring_allreduce.py`](../src/kernbench/ccl/algorithms/ring_allreduce.py) — ring all-reduce
- [`src/kernbench/ccl/algorithms/mesh_allreduce.py`](../src/kernbench/ccl/algorithms/mesh_allreduce.py) — 2D mesh all-reduce
- [`src/kernbench/ccl/algorithms/tree_allreduce.py`](../src/kernbench/ccl/algorithms/tree_allreduce.py) — binary tree all-reduce
+537
View File
@@ -0,0 +1,537 @@
# CCL Algorithm Author Guide
이 문서는 kernbench에서 CCL (Collective Communication Library) 알고리즘을
직접 작성하는 사람을 위한 step-by-step 가이드이다. 시스템 내부 설계와
컴포넌트 구조는 [ADR-0023](adr/ADR-0023-ipcq-pe-collective.md)에 있다.
본 가이드는 알고리즘 작성자가 **자신이 만져야 할 곳**과 **만지지 않아도 될 곳**을
명확히 분리하고, 가장 짧은 경로로 첫 알고리즘을 동작시키는 것을 목표로 한다.
---
## 0. 5분 요약
| 만지는 것 | 위치 |
|----------|------|
| 알고리즘 모듈 (kernel + 선택적 neighbors) | `src/kernbench/ccl/algorithms/<algo>.py` |
| 알고리즘 등록 | `ccl.yaml` |
| 호스트 bench (PE 수, 메모리 init, launch, 검증) | `benches/<your_bench>.py` |
| (선택) 단위 테스트 | `tests/test_<algo>.py` |
| 만지지 않는 것 | 위치 |
|---------------|------|
| TLContext API | `src/kernbench/triton_emu/tl_context.py` (ADR-0022 spec) |
| 프레임워크 (topology generators, helpers, mock testing) | `src/kernbench/ccl/` |
| PE_IPCQ / PE_DMA 컴포넌트 | `src/kernbench/components/builtin/` |
| backend 구현 (install_ipcq) | `src/kernbench/runtime_api/distributed.py``kernbench/ccl/install.py` |
흐름:
1. 알고리즘 모듈에 `kernel` 작성
2. `ccl.yaml`에 entry 등록
3. 호스트 bench에서 `install_ipcq` + `launch`
4. (선택) mock runtime으로 단위 테스트 (수 ms)
5. `kernbench run --bench <name> --verify-data`로 SimPy 검증
---
## 1. Hello World — 가장 단순한 send/recv
각 PE가 자기 데이터를 E 방향 이웃에 한 번 보내고, W 방향에서 한 번 받는
가장 단순한 알고리즘이다. 실제 동작 코드는
[`src/kernbench/ccl/algorithms/hello_send.py`](../src/kernbench/ccl/algorithms/hello_send.py)
에 있다.
### Step 1: kernel 작성
새 파일 `src/kernbench/ccl/algorithms/hello_send.py`:
```python
"""Hello world: 자기 데이터를 다음 rank에 보내고 이전 rank에서 받기."""
def kernel(t_ptr, n_elem, tl):
# 글로벌 rank는 program_id(0/1)에서 계산 (ADR-0022)
local_pe = tl.program_id(axis=0)
cube_id = tl.program_id(axis=1)
pes_per_cube = tl.num_programs(axis=0)
rank = cube_id * pes_per_cube + local_pe
nbytes = n_elem * 2 # f16
pe_addr = t_ptr + rank * nbytes
# 자기 슬라이스를 로드해서 E로 보낸다.
src = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
tl.send(dir="E", src=src)
# W 방향에서 받아서 그대로 자기 슬라이스에 store한다.
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
tl.store(pe_addr, recv)
```
핵심 포인트:
- **글로벌 rank는 `program_id(axis=0)` + `program_id(axis=1)`에서 계산.** TL에는
`tl.rank` / `tl.world_size` 같은 약속되지 않은 확장이 없다. 호스트가
`world_size` 같은 알고리즘 파라미터가 필요하면 `torch.launch`의 일반 인자로
전달한다.
- **`tl.send``TensorHandle`을 받는다.** 핸들의 `addr`/`space`/`shape`/`dtype`/`nbytes`
PE_IPCQ가 읽어 PE_DMA에 IpcqDmaToken을 발행한다.
- **`tl.recv``shape``dtype`이 필수.** 반환된 TensorHandle은 IPCQ ring slot을
가리키며, `tl.store(pe_addr, recv)`처럼 dst 핸들로 그대로 사용할 수 있다.
Phase 2 dma_write replay가 (slot, hbm) 복사를 수행하므로 numpy `.data`
직접 만질 필요가 없다.
### Step 2: ccl.yaml 등록
`ccl.yaml``algorithms` 섹션에 entry를 추가한다. (defaults.algorithm은 호스트
bench가 `install_ipcq(algorithm=...)`로 명시 전달해도 되므로 꼭 바꿀 필요는 없다.)
```yaml
algorithms:
hello_send:
module: kernbench.ccl.algorithms.hello_send
topology: ring_1d
buffer_kind: tcm
```
### Step 3: 호스트 bench 작성
새 파일 `benches/ccl_hello.py`:
```python
"""Hello-world ring rotation bench (각 PE가 W 이웃의 데이터를 1번 받음)."""
import numpy as np
from kernbench.ccl.algorithms import hello_send
from kernbench.policy.placement.dp import DPPolicy
ALGORITHM = "hello_send"
N_ELEM = 8
WORLD_SIZE = 8
def run(torch):
plan = torch.install_ipcq(algorithm=ALGORITHM)
a = torch.zeros(
(1, WORLD_SIZE * N_ELEM), dtype="f16",
dp=DPPolicy(
sip="replicate", cube="replicate", pe="column_wise",
num_sips=1, num_cubes=1,
),
name="hello_in",
)
store = torch.engine.memory_store
base = a._handle.va_base or a._handle.shards[0].pa
nbytes = N_ELEM * 2
for r in range(WORLD_SIZE):
store.write("hbm", base + r * nbytes,
np.full((N_ELEM,), float(r + 1), dtype=np.float16))
torch.launch(ALGORITHM, hello_send.kernel, a, N_ELEM)
# rank r은 rank (r-1)%ws의 데이터를 가져야 한다.
for r, (sip, cube, pe) in enumerate(plan["rank_to_pe"]):
result = store.read("hbm", base + r * nbytes, shape=(N_ELEM,), dtype="f16")
prev = float(((r - 1) % WORLD_SIZE) + 1)
ok = np.allclose(result, prev)
print(f" [{'OK ' if ok else 'FAIL'}] rank {r} got {float(result.mean()):.1f}, "
f"expected {prev:.1f}")
```
### Step 4: 단위 테스트 (선택, 강력 추천)
`tests/test_hello_send.py`:
```python
import numpy as np
from kernbench.ccl.algorithms.hello_send import kernel
from kernbench.ccl.testing import run_kernel_in_mock
def test_hello_send_4_ranks():
n_elem = 8
inputs = [np.full((n_elem,), float(r + 1), dtype=np.float16) for r in range(4)]
outputs = run_kernel_in_mock(
kernel_fn=kernel,
world_size=4,
topology="ring_1d",
inputs=inputs,
kernel_args=(n_elem,),
)
# rank r은 rank (r-1) % 4의 데이터를 받아야 함
for r in range(4):
assert np.array_equal(outputs[r], inputs[(r - 1) % 4])
```
`run_kernel_in_mock`는 SimPy 없이 순수 Python으로 모든 rank를 동시 실행하므로
**ms 단위로 끝난다**. 알고리즘 logic 정합성만 검증.
### Step 5: 시뮬 검증
```bash
kernbench run --topology topology.yaml --bench ccl_hello --verify-data
```
Phase 1에서 SimPy 시뮬레이션 + MemoryStore 데이터 이동, Phase 2에서 op_log
정합성 replay. 호스트 bench의 `print` 검증이 모든 rank에 대해 OK여야 한다.
---
## 2. Ring All-Reduce — 두 번째 알고리즘
조금 더 복잡한 예제. Ring all-reduce는 N-1 라운드 동안 각 PE가 자기 데이터를
E로 보내고 W에서 받아 누적한다. 최종적으로 모든 PE가 글로벌 sum을 갖는다.
실제 동작 코드는 [`src/kernbench/ccl/algorithms/ring_allreduce.py`](../src/kernbench/ccl/algorithms/ring_allreduce.py)
참조. 핵심 흐름:
```python
"""Ring all-reduce."""
def kernel(t_ptr, n_elem, world_size, tl):
# rank
local_pe = tl.program_id(axis=0)
cube_id = tl.program_id(axis=1)
pes_per_cube = tl.num_programs(axis=0)
rank = cube_id * pes_per_cube + local_pe
nbytes = n_elem * 2
pe_addr = t_ptr + rank * nbytes
# HBM의 자기 슬라이스를 가리키는 TensorHandle. greenlet 모드에선 .data가
# 채워지지만 커널은 .data를 직접 만질 필요가 없다.
acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
current = acc # 첫 라운드 send 출처
for _step in range(world_size - 1):
tl.send(dir="E", src=current)
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
# TensorHandle 연산자 오버로드 → MathCmd → PE_MATH 디스패치.
# Phase 1은 타이밍만, Phase 2 DataExecutor가 실제 numpy 누적을 수행한다.
acc = acc + recv
current = recv # 다음 라운드는 직전에 받은 슬롯을 다시 forward
# 최종 누적값을 자기 슬라이스에 store. 출처는 acc(=PE-local scratch addr)
# 이고 dst는 HBM. op_log dma_write가 (scratch, hbm) 복사 정보를 기록하므로
# Phase 2가 검증 시점에 HBM[pe_addr]에 정답을 채워준다.
tl.store(pe_addr, acc)
```
네 가지 포인트:
1. **누적은 TensorHandle 연산자**: `acc + recv``MathCmd`를 emit하고
PE_MATH로 디스패치된다 — 실제 하드웨어 경로를 거치므로 latency 모델이
정확하다. ADR-0020 D3대로 Phase 1은 타이밍만 시뮬레이션하고, Phase 2
`DataExecutor`가 op_log를 재실행하면서 numpy 누적을 수행한다.
2. **`current = recv`로 forward**: 매 라운드의 send 출처를 직전에 받은 슬롯
핸들로 갱신해야 같은 데이터가 ring을 순회하면서 누적이 한 번씩 일어난다.
`current = acc`로 두면 누적값이 다시 송출되어 결과가 부풀려진다.
3. **`tl.store(pe_addr, acc)` 한 번이면 끝**: 중간에 store→reload 패턴은
금지다. acc는 PE-local scratch에 살고, op_log가 (src=scratch, dst=hbm)
메타데이터를 기록한다. Phase 2가 math를 먼저 실행해 scratch를 채운 뒤
dma_write 스냅샷으로 HBM에 복사한다.
4. **`world_size`는 호스트가 명시 전달**: TL은 topology slot 수만 안다 (예:
`num_programs(axis=0)`은 cube당 PE 수). 실제 참여하는 CCL group 크기는 bench가
알고 호스트→kernel 인자로 넘긴다.
`ccl.yaml` 등록 + 호스트 bench는 [`benches/ccl_allreduce_tcm.py`](../benches/ccl_allreduce_tcm.py)
참조. mock 단위 테스트는 [`tests/test_ccl_mock_runtime.py`](../tests/test_ccl_mock_runtime.py)
를 그대로 따라하면 된다 (`kernel_args=(n_elem, world_size)` 인자 형태).
---
## 3. neighbors() override — Custom topology
대부분의 알고리즘은 builtin topology(`ring_1d`, `mesh_2d`, `tree_binary`,
`ring_1d_unidir`, `none`)로 충분하다. builtin을 변형하거나 새로 만들고 싶으면
알고리즘 모듈에 `neighbors()`를 정의한다.
### 시그니처
```python
def neighbors(rank: int, world_size: int, neighbor_map: dict[str, int]) -> dict[str, int] | None:
"""builtin topology가 만든 neighbor_map을 override.
Args:
neighbor_map: ccl.yaml의 topology 필드가 만든 builtin 매핑.
예: ring_1d → {"E": (rank+1)%ws, "W": (rank-1)%ws}
mutable dict — 직접 수정 가능.
Returns:
dict: neighbor_map을 override한 결과 (또는 수정한 그 dict)
None: override 안 함, neighbor_map 그대로 사용
"""
return None
```
### Pattern A: builtin을 base로 일부만 수정
```python
def neighbors(rank, world_size, neighbor_map):
# 짝수 rank만 W 방향 사용 (홀수 rank는 W 제거)
if rank % 2 == 1:
neighbor_map.pop("W", None)
return neighbor_map
```
### Pattern B: 완전히 새로 작성 (skip-connection ring)
```python
def neighbors(rank, world_size, neighbor_map):
# neighbor_map은 무시하고 새로 작성
return {"E": (rank + 2) % world_size}
```
### Pattern C: builtin 사용, override 없음
`neighbors()` 함수를 정의하지 않거나 None을 반환:
```python
def neighbors(rank, world_size, neighbor_map):
return None # 명시적으로 builtin 사용
```
---
## 4. PE 커널 API 레퍼런스 (ADR-0023 D4)
### IPCQ API
| API | 설명 | Blocking? |
|-----|------|-----------|
| `tl.send(dir, src=TensorHandle)` | direction으로 데이터 send | Yes (peer slot full 시 wait) |
| `tl.send(dir, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)` | 동일, keyword 형태 | Yes |
| `tl.recv(dir, shape=..., dtype=...)` | 특정 방향에서 blocking recv | Yes |
| `tl.recv(shape=..., dtype=...)` | 4방향 round-robin recv (방향 미지정) | Yes |
| `tl.recv_async(dir, shape=..., dtype=...) → RecvFuture` | non-blocking recv | No |
| `tl.wait(future)` | non-blocking future 완료 대기 → TensorHandle | Yes |
### 기존 TL API (ADR-0020/0022, 그대로 사용 가능)
| API | 설명 |
|-----|------|
| `tl.load(addr, shape, dtype) → TensorHandle` | DMA read; greenlet 모드에서 `.data`에 ndarray |
| `tl.store(addr, handle)` | DMA write — handle.data가 있으면 MemoryStore에 propagate |
| `tl.composite(op, ...)` | GEMM/Math compute 비동기 submit |
| `tl.program_id(axis=0)` | cube 내 local PE id |
| `tl.program_id(axis=1)` | cube id (ADR-0022) |
| `tl.num_programs(axis=0/1)` | topology 슬롯 수 (참여 ranks 수가 아님) |
### `recv` 두 가지 모드
기본은 `return_slot` (zero-copy): IPCQ slot 주소가 그대로 handle.addr에 들어온다.
slot 데이터를 별도 위치로 복사하고 싶으면 `dst_addr` + `dst_space`를 명시:
```python
recv = tl.recv(
dir="W", shape=(8,), dtype="f16",
dst_addr=my_scratch_addr,
dst_space="hbm",
)
# 이제 recv.addr == my_scratch_addr (copy_to_dst 모드)
```
---
## 5. Helpers (`kernbench.ccl.helpers`)
알고리즘 코드를 짧게 유지하기 위한 헬퍼들:
```python
from kernbench.ccl.helpers import chunked, ring_step, tree_step
```
### `chunked(base_addr, n_chunks, n_elem, dtype="f16") → list[Chunk]`
`n_elem` 개의 element를 `n_chunks` 등분한 view 리스트를 반환. 각 `Chunk`
`addr`, `n_elem`, `nbytes` 필드를 가진다.
```python
chunks = chunked(t_ptr, n_chunks=4, n_elem=64, dtype="f16")
# chunks[0..3] 각각 16 element view, addr이 연속
```
### `ring_step(rank, step, world_size) → (send_idx, recv_idx)`
Ring algorithm의 step별 chunk 인덱스 (reduce-scatter / all-gather):
```python
for step in range(world_size - 1):
send_idx, recv_idx = ring_step(rank, step, world_size)
tl.send(dir="E", src_addr=chunks[send_idx].addr,
nbytes=chunks[send_idx].nbytes,
shape=(chunks[send_idx].n_elem,), dtype="f16")
recv = tl.recv(dir="W", shape=(chunks[recv_idx].n_elem,), dtype="f16")
# accumulate ...
```
### `tree_step(rank, world_size) → {"parent": int|None, "children": list[int]}`
Binary tree의 parent/children rank:
```python
info = tree_step(rank, world_size)
if info["parent"] is None:
print(f"rank {rank} is the root")
for child in info["children"]:
...
```
---
## 6. 단위 테스트 — Mock Runtime
`kernbench.ccl.testing.run_kernel_in_mock`은 SimPy를 거치지 않고 알고리즘을
빠르게 검증할 수 있다.
### 기본 사용법
```python
from kernbench.ccl.testing import run_kernel_in_mock
from kernbench.ccl.algorithms.my_algo import kernel
import numpy as np
def test_my_algo():
n_elem = 16
inputs = [np.arange(n_elem, dtype="f16") + r for r in range(4)]
expected = sum(inputs)
outputs = run_kernel_in_mock(
kernel_fn=kernel,
world_size=4,
topology="ring_1d",
inputs=inputs,
kernel_args=(n_elem, 4), # kernel의 (t_ptr 이후) 추가 positional 인자
)
for r in range(4):
assert np.allclose(outputs[r], expected, rtol=1e-3)
```
### 동작
- 4개 rank의 kernel을 greenlet으로 동시 실행
- `tl.send/recv`를 in-memory FIFO로 즉시 처리 (DMA, latency 무시)
- 각 rank가 마지막에 store한 데이터를 ndarray로 반환
### 한계
- latency / 성능 측정 불가 (시뮬레이션이 아님)
- PE_DMA, fabric, BW 모델 안 함
- 정합성 검증만 가능
- 한 cube 안에서 동작하는 가정 — `program_id(axis=1)`은 항상 0
---
## 7. 디버깅
### CCL trace
```bash
KERNBENCH_CCL_TRACE=1 kernbench run --topology topology.yaml \
--bench ccl_allreduce_tcm --verify-data
```
각 rank의 send/recv 시점이 stdout에 출력된다:
```
[ccl t=346.4 send] sip0.cube0.pe1 dir=E nbytes=64 seq=0
[ccl t=360.4 recv] sip0.cube0.pe2 dir=W nbytes=64
...
```
### Pointer dump
`kernbench.ccl.diagnostics.pointer_dump(engine)`는 모든 PE_IPCQ의 ring buffer
상태(`my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`)를 multi-line
문자열로 반환한다. hang이 발생하면 어느 rank가 어떤 상태에서 막혔는지 한눈에
보인다.
### Deadlock detection
매칭되지 않는 send/recv 등으로 SimPy 스케줄이 비면 engine이 `IpcqDeadlock`
던지며 pointer dump를 메시지에 포함시킨다 (ADR-0023 D14 F3). 별도 wait-for graph
시각화는 미래 작업.
---
## 8. 흔한 실수
### 1. install 안 된 direction 사용
ccl.yaml의 `topology: ring_1d`는 E/W만 install한다. N/S 사용 시:
```python
tl.send(dir="N", ...) # → IpcqInvalidDirection 예외
```
해결: `topology: mesh_2d`로 바꾸거나, `neighbors()` override로 N/S 추가.
### 2. send만 호출하고 recv 없음
```python
def kernel(..., tl):
for _ in range(100):
tl.send(dir="E", ...)
# peer 측 recv 없음 → ring buffer 가득 차면 backpressure → deadlock
```
해결: 모든 send에 짝이 되는 recv가 있어야 한다. 안 그러면 `IpcqDeadlock`
발생한다.
### 3. dtype/shape 불일치
기본 모드에서는 dtype/shape mismatch를 검증하지 않는다. 작성자가 직접 보장하거나,
PE_IPCQ 노드 attrs에 `strict_validation: true`를 설정해 D14 F2 strict 모드로
mismatch를 즉시 잡을 수 있다.
### 4. round-robin recv의 fairness 가정
`tl.recv()` (방향 미지정)는 round-robin으로 가져오지만, 도착한 첫 슬롯을 반환한다.
**도착 순서를 알 수 없으므로** 알고리즘이 도착 방향에 의존하면 안 된다.
필요하면 `tl.recv(dir="N", ...)`처럼 명시.
### 5. CCL 그룹 크기 가정
`tl.num_programs(axis=0/1)`은 토폴로지 슬롯 개수이지 CCL group 크기가 아니다.
참여하는 rank 수(`world_size`)는 호스트 bench가 알고 있고, kernel 인자로 명시
전달해야 한다.
### 6. 호스트가 send-source 메모리를 도착 전에 덮어씀
PE_DMA가 송신 시점에 src 데이터를 토큰에 스냅샷해서 in-flight 데이터의 의미가
보존된다. 그래도 하나의 PE 안에서 같은 주소를 여러 step에 걸쳐 갱신할 때는
direct send 후 다른 step에서 같은 주소를 store해도 안전하다 (token snapshot 덕분).
하지만 `tl.send`가 PE_DMA 큐에 enqueue되기 전에 주소를 덮어쓰면 잘못된 데이터가
스냅샷된다 — `tl.send`를 먼저, 메모리 변경을 나중에 하는 게 권장.
---
## 9. 다음 단계
- `mesh_2d` / `tree_binary` 같은 다른 topology 활용
- recursive halving/doubling 등 더 빠른 알고리즘
- `buffer_kind` (tcm/hbm/sram) / `backpressure` (poll/sleep) 모드별 latency 비교
- `ccl_ring_allreduce_multicube.py`, `ccl_ring_allreduce_multisip.py`처럼 큰
scale의 ring 검증
새 알고리즘이나 패턴을 추가했다면 PR로 기여해주세요.
---
## 참고
- [ADR-0023](adr/ADR-0023-ipcq-pe-collective.md): IPCQ + PE-level collective 설계
- [ADR-0022](adr/ADR-0022-program-id-2d-grid.md): 2D grid program_id (axis=0/1)
- [ADR-0020](adr/ADR-0020-data-execution-two-pass.md): 2-pass data execution
- [ADR-0021](adr/ADR-0021-pe-pipeline-refactor.md): PE pipeline refactor
기존 알고리즘 예제:
- [`src/kernbench/ccl/algorithms/hello_send.py`](../src/kernbench/ccl/algorithms/hello_send.py) — 가장 단순한 send/recv
- [`src/kernbench/ccl/algorithms/ring_allreduce.py`](../src/kernbench/ccl/algorithms/ring_allreduce.py) — ring all-reduce
- [`src/kernbench/ccl/algorithms/mesh_allreduce.py`](../src/kernbench/ccl/algorithms/mesh_allreduce.py) — 2D mesh all-reduce
- [`src/kernbench/ccl/algorithms/tree_allreduce.py`](../src/kernbench/ccl/algorithms/tree_allreduce.py) — binary tree all-reduce
+9
View File
@@ -0,0 +1,9 @@
"""CCL (Collective Communication Library) framework for kernbench (ADR-0023).
This package provides:
- topologies: builtin neighbor topology generators (ring/mesh/tree)
- helpers: utilities for algorithm authors (chunked, ring_step, ...)
- testing: mock CCL runtime for fast unit tests of algorithm kernels
See docs/adr/ADR-0023-ipcq-pe-collective.md and docs/ccl-author-guide.md.
"""
@@ -0,0 +1,29 @@
"""Hello-world CCL kernel for the docs/ccl-author-guide.md walkthrough.
Each PE sends its tile to the E neighbor and receives one tile from W,
then stores the received tile back into its own HBM slice. The simplest
possible demonstration of ``tl.send`` / ``tl.recv``.
"""
from __future__ import annotations
def kernel_args(world_size: int, n_elem: int) -> tuple:
"""Return the positional kernel arguments for the ahbm backend."""
return (n_elem,)
def kernel(t_ptr, n_elem, tl):
local_pe = tl.program_id(axis=0)
cube_id = tl.program_id(axis=1)
pes_per_cube = tl.num_programs(axis=0)
rank = cube_id * pes_per_cube + local_pe
nbytes = n_elem * 2
pe_addr = t_ptr + rank * nbytes
# Send our local HBM tile to the E neighbor.
src = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
tl.send(dir="E", src=src)
# Receive a tile from W and store it into our slice (overwrite).
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
tl.store(pe_addr, recv)
@@ -0,0 +1,73 @@
"""2D-mesh all-reduce kernel (ADR-0023).
Two-phase reduce on a square mesh of side ``S`` (world_size = S*S):
1. Row reduce: ring all-reduce along E/W within each row.
2. Column reduce: ring all-reduce along N/S within each column.
After both phases, every rank holds the global sum.
Uses TensorHandle math (PE_MATH) for accumulation. Op_log captures the
data flow so Phase 2 produces correct final HBM contents. Math/recv
handles are passed directly to the next send, avoiding store→reload
which doesn't propagate correctly with timing-only Phase 1 math.
"""
from __future__ import annotations
import math
def kernel_args(world_size: int, n_elem: int) -> tuple:
"""Return the positional kernel arguments for the ahbm backend.
Mesh all-reduce requires ``world_size`` to be a perfect square —
the mesh side length is ``sqrt(world_size)``.
"""
side = int(round(math.sqrt(world_size)))
if side * side != world_size:
raise ValueError(
f"mesh_allreduce requires a square world_size; got {world_size}"
)
return (n_elem, side)
def kernel(t_ptr, n_elem, side, tl):
"""All-reduce on a square mesh.
Args:
t_ptr: HBM base address (column-sharded VA shared across ranks)
n_elem: number of f16 elements per tile
side: mesh side length (sqrt(world_size))
tl: TLContext (ADR-0022).
"""
local_pe = tl.program_id(axis=0)
cube_id = tl.program_id(axis=1)
pes_per_cube = tl.num_programs(axis=0)
rank = cube_id * pes_per_cube + local_pe
nbytes = n_elem * 2
pe_addr = t_ptr + rank * nbytes
acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
current = acc
# ── Phase 1: row ring (E direction) ──
# Ring forwards each received tile (not the cumulative acc) so every
# tile passes through every rank exactly once.
for _ in range(side - 1):
tl.send(dir="E", src=current)
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
acc = acc + recv
current = recv
# Phase 2 column ring starts from the row-phase accumulator. We do NOT
# store/reload here — the math handle's scratch addr is the source for
# the first column send and Phase 2 ipcq_copy replays from there.
current = acc
# ── Phase 2: column ring (S direction) ──
for _ in range(side - 1):
tl.send(dir="S", src=current)
recv = tl.recv(dir="N", shape=(n_elem,), dtype="f16")
acc = acc + recv
current = recv
tl.store(pe_addr, acc)
@@ -0,0 +1,80 @@
"""Ring all-reduce kernel for IPCQ-based PE collective (ADR-0023).
Algorithm: 1D ring of N PEs, each PE starts with one tile of data.
After ``world_size - 1`` rounds, every PE's accumulator holds the sum
of all PE tiles.
Strategy
--------
Each PE starts with its own tile in HBM. The kernel:
1. Loads the local tile into a TensorHandle (the accumulator).
2. In each of ``world_size - 1`` rounds:
- Sends the current accumulator/recv slot to the E neighbor.
- Receives a tile from the W neighbor — the recv handle points
into the per-direction TCM slot.
- Adds the received tile to the accumulator using the TensorHandle
operator overload, which dispatches to ``MathCmd`` (PE_MATH).
3. Stores the final accumulator back to HBM via tl.store. The store is
recorded in op_log with both src and dst, so Phase 2 will copy the
replayed math result from PE-local scratch into HBM.
ADR-0020 D3 split: Phase 1 simulates timing only — math results are
not yet computed, so the accumulator data flowing through Phase 1 may
be stale. Phase 2's DataExecutor replays math + IPCQ copies + dma_write
in stable t_start order, producing correct final HBM contents.
"""
from __future__ import annotations
def kernel_args(world_size: int, n_elem: int) -> tuple:
"""Return the positional kernel arguments for the ahbm backend.
Ring all-reduce takes (n_elem, world_size) after the tensor pointer.
"""
return (n_elem, world_size)
def kernel(t_ptr, n_elem, world_size, tl):
"""Ring all-reduce.
Args:
t_ptr: HBM base address of the column-sharded tensor — all PEs
share this base. The per-PE slice lives at
``t_ptr + global_rank * n_elem * 2``.
n_elem: number of f16 elements per tile.
world_size: total number of participating ranks (passed by host).
tl: TLContext (auto-injected, ADR-0022). The kernel derives the
global rank from ``program_id(axis=0)`` (local PE) and
``program_id(axis=1)`` (cube id):
rank = cube_id * pes_per_cube + local_pe
"""
local_pe = tl.program_id(axis=0)
cube_id = tl.program_id(axis=1)
pes_per_cube = tl.num_programs(axis=0)
rank = cube_id * pes_per_cube + local_pe
nbytes = n_elem * 2 # f16
# Each PE reads from its own slice of the shared base address
pe_addr = t_ptr + rank * nbytes
# Load the local tile — handle points at HBM[pe_addr].
acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
# The ring forwards each received tile to the next neighbor (NOT the
# cumulative accumulator), so every rank's tile passes through every
# rank exactly once. The accumulator sums the new arrival each round.
current = acc
for _step in range(world_size - 1):
tl.send(dir="E", src=current)
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
# TensorHandle add → MathCmd → PE_MATH (timing in Phase 1, real
# numpy in Phase 2 via DataExecutor). The result handle lives at
# an auto-allocated PE-local scratch addr.
acc = acc + recv
current = recv # forward W's tile to E next round
# Final result back to this PE's HBM slice. Op_log captures the
# source (scratch addr) and dst (HBM slice) so Phase 2 copies the
# accumulated value into HBM for verification.
tl.store(pe_addr, acc)
@@ -0,0 +1,80 @@
"""Tree all-reduce kernel for IPCQ-based PE collective (ADR-0023).
Two-phase binary tree all-reduce:
Phase 1 (reduce up):
- leaf nodes send their value to ``parent``
- internal nodes recv from each child, sum, then send to ``parent``
- root accumulates child contributions; final acc holds global sum
Phase 2 (broadcast down):
- root sends acc to ``child_left`` and ``child_right`` (if present)
- internal nodes recv from ``parent``, then forward to children
- all ranks store the final acc to HBM
Uses TensorHandle math (PE_MATH) for accumulation. Op_log captures the
data flow so Phase 2 produces correct final HBM contents. The kernel
deliberately avoids the store→reload→send pattern: math/recv handles
are passed directly to the next send so PE_DMA snapshots a deterministic
source addr that Phase 2 can replay.
"""
from __future__ import annotations
def kernel_args(world_size: int, n_elem: int) -> tuple:
"""Return the positional kernel arguments for the ahbm backend."""
return (n_elem, world_size)
def kernel(t_ptr, n_elem, world_size, tl):
"""Tree all-reduce.
Args:
t_ptr: HBM base address.
n_elem: number of f16 elements per tile.
world_size: total number of participating ranks (passed by host).
tl: TLContext (ADR-0022). Global rank from program_id(0/1).
"""
local_pe = tl.program_id(axis=0)
cube_id = tl.program_id(axis=1)
pes_per_cube = tl.num_programs(axis=0)
rank = cube_id * pes_per_cube + local_pe
nbytes = n_elem * 2
pe_addr = t_ptr + rank * nbytes
acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
# Compute children/parent existence (matches tree_binary topology generator)
has_parent = rank > 0
left = 2 * rank + 1
right = 2 * rank + 2
has_left = left < world_size
has_right = right < world_size
# ── Phase 1: reduce up ──
if has_left:
recv = tl.recv(dir="child_left", shape=(n_elem,), dtype="f16")
acc = acc + recv
if has_right:
recv = tl.recv(dir="child_right", shape=(n_elem,), dtype="f16")
acc = acc + recv
if has_parent:
# Send the math/load handle directly — its addr is either the
# original HBM tile (leaf) or the PE-local scratch where the
# accumulator lives. Phase 2 ipcq_copy replays from the same addr.
tl.send(dir="parent", src=acc)
# ── Phase 2: broadcast down ──
if has_parent:
# Replace acc with the value broadcast from the parent (the global
# sum). The recv handle points at the parent-direction TCM slot.
acc = tl.recv(dir="parent", shape=(n_elem,), dtype="f16")
if has_left:
tl.send(dir="child_left", src=acc)
if has_right:
tl.send(dir="child_right", src=acc)
# Final store to HBM for the bench's verification path.
tl.store(pe_addr, acc)
+127
View File
@@ -0,0 +1,127 @@
"""CCL diagnostics: trace + pointer dump + deadlock (ADR-0023 D14).
Trace
-----
Set ``KERNBENCH_CCL_TRACE=1`` (or any truthy value) to enable per-event
logging of CCL send/recv to stdout. Off by default.
Pointer dump
------------
``pointer_dump(engine)`` returns a multi-line string showing every PE_IPCQ's
ring buffer state (my_head, my_tail, peer_head_cache, peer_tail_cache).
Useful for diagnosing hangs.
Deadlock
--------
``IpcqDeadlock`` is raised by the engine when SimPy's schedule empties
while a request is still pending — typical of unmatched send/recv pairs.
The exception message includes the pointer dump.
"""
from __future__ import annotations
import os
from typing import Any
class IpcqDeadlock(RuntimeError):
"""Raised when the simulation cannot make further progress while a
CCL request is still pending (D14 F3)."""
# ── Trace toggle ─────────────────────────────────────────────────────
_TRACE_ENABLED: bool = False
def reload_trace_setting() -> None:
"""Re-read the ``KERNBENCH_CCL_TRACE`` env var."""
global _TRACE_ENABLED
val = os.environ.get("KERNBENCH_CCL_TRACE", "")
_TRACE_ENABLED = val.strip().lower() in {"1", "true", "yes", "on"}
def trace_enabled() -> bool:
return _TRACE_ENABLED
# Initialise once at import time
reload_trace_setting()
# ── Trace event functions ────────────────────────────────────────────
def log_send(
t_ns: float,
sender: str,
direction: str,
nbytes: int,
sender_seq: int,
) -> None:
if not _TRACE_ENABLED:
return
print(
f"[ccl t={t_ns:.1f} send] {sender} dir={direction} nbytes={nbytes} seq={sender_seq}",
flush=True,
)
def log_recv(
t_ns: float,
receiver: str,
direction: str,
nbytes: int,
) -> None:
if not _TRACE_ENABLED:
return
print(
f"[ccl t={t_ns:.1f} recv] {receiver} dir={direction} nbytes={nbytes}",
flush=True,
)
def log_credit_return(
t_ns: float,
sender: str,
direction: str,
consumer_seq: int,
) -> None:
if not _TRACE_ENABLED:
return
print(
f"[ccl t={t_ns:.1f} credit] {sender} dir={direction} seq={consumer_seq}",
flush=True,
)
# ── Pointer dump ─────────────────────────────────────────────────────
def pointer_dump(engine: Any) -> str:
"""Return a multi-line string of every PE_IPCQ's pointer state."""
lines: list[str] = []
components = getattr(engine, "_components", {})
for node_id in sorted(components):
if not node_id.endswith(".pe_ipcq"):
continue
comp = components[node_id]
qps = getattr(comp, "queue_pairs", {})
if not qps:
continue
lines.append(node_id)
for d in sorted(qps):
qp = qps[d]
peer = qp["peer"]
lines.append(
f" {d}: peer=sip{peer.sip}.cube{peer.cube}.pe{peer.pe} "
f"my_head={qp['my_head']} my_tail={qp['my_tail']} "
f"peer_head_cache={qp['peer_head_cache']} "
f"peer_tail_cache={qp['peer_tail_cache']}"
)
return "\n".join(lines)
def print_pointer_dump(engine: Any) -> None:
"""Convenience: print pointer_dump(engine) to stdout."""
print(pointer_dump(engine), flush=True)
+118
View File
@@ -0,0 +1,118 @@
"""Helpers for CCL algorithm authors (ADR-0023 D15).
These are pure utility functions usable from any kernel module:
from kernbench.ccl.helpers import chunked, ring_step, tree_step
They keep algorithm code short and free of off-by-one bugs.
"""
from __future__ import annotations
from dataclasses import dataclass
from typing import Any
_DTYPE_BYTES = {
"f16": 2, "fp16": 2, "float16": 2, "bf16": 2,
"f32": 4, "fp32": 4, "float32": 4,
"i8": 1, "int8": 1,
"i16": 2, "int16": 2,
"i32": 4, "int32": 4,
}
def _itemsize(dtype: str) -> int:
if dtype not in _DTYPE_BYTES:
raise ValueError(f"Unsupported dtype: {dtype}")
return _DTYPE_BYTES[dtype]
# ── chunked ──────────────────────────────────────────────────────────
@dataclass(frozen=True)
class Chunk:
"""One chunk of a tensor used by collective algorithms."""
addr: int
n_elem: int
nbytes: int
def chunked(
base_addr: int,
n_chunks: int,
n_elem: int,
dtype: str = "f16",
) -> list[Chunk]:
"""Slice a 1D buffer into ``n_chunks`` equal Chunks.
Args:
base_addr: starting address of the buffer.
n_chunks: number of equal chunks to produce.
n_elem: total number of elements (must be divisible by n_chunks).
dtype: element type for byte-size calculation.
Returns:
List of ``Chunk`` objects whose addresses are consecutive.
Raises:
ValueError: if n_elem is not divisible by n_chunks.
"""
if n_elem % n_chunks != 0:
raise ValueError(
f"chunked: n_elem ({n_elem}) not divisible by n_chunks ({n_chunks})"
)
per_chunk_elem = n_elem // n_chunks
isize = _itemsize(dtype)
per_chunk_bytes = per_chunk_elem * isize
return [
Chunk(
addr=base_addr + i * per_chunk_bytes,
n_elem=per_chunk_elem,
nbytes=per_chunk_bytes,
)
for i in range(n_chunks)
]
# ── ring_step ────────────────────────────────────────────────────────
def ring_step(rank: int, step: int, world_size: int) -> tuple[int, int]:
"""Return ``(send_chunk_idx, recv_chunk_idx)`` for a ring algorithm step.
Standard reduce-scatter / all-gather ring schedule:
at step s, rank r sends chunk (r - s) and receives chunk (r - s - 1)
modulo world_size.
Used by ring all-reduce kernels:
for step in range(world_size - 1):
send_idx, recv_idx = ring_step(rank, step, world_size)
tl.send(dir="E", src=chunks[send_idx])
chunks[recv_idx] += tl.recv(dir="W").data
"""
send_idx = (rank - step) % world_size
recv_idx = (rank - step - 1) % world_size
return send_idx, recv_idx
# ── tree_step ────────────────────────────────────────────────────────
def tree_step(rank: int, world_size: int) -> dict[str, Any]:
"""Return parent/children for binary tree rooted at rank 0.
Returns:
``{"parent": int|None, "children": list[int]}``
"""
parent = (rank - 1) // 2 if rank > 0 else None
children: list[int] = []
left = 2 * rank + 1
right = 2 * rank + 2
if left < world_size:
children.append(left)
if right < world_size:
children.append(right)
return {"parent": parent, "children": children}
+266
View File
@@ -0,0 +1,266 @@
"""IPCQ install plan for AhbmCCLBackend (ADR-0023 D10/D11/D12).
Given a ccl.yaml config, the topology, and the engine, this module:
1. Loads ccl.yaml and resolves the chosen algorithm.
2. Maps each rank to a (sip, cube, pe) PE address using a linear scheme.
3. Allocates per-rank IPCQ ring buffer base addresses (synthetic but
unique-per-PE; see notes below).
4. Builds neighbor tables via the algorithm's ``topology`` field plus the
optional ``neighbors()`` override hook from the algorithm module.
5. Wires bidirectional credit-return SimPy Stores between every (PE, peer)
pair.
6. Installs each PE_IPCQ component's neighbor table directly via its
``_install_neighbors`` sideband call (equivalent to fan-out IpcqInitMsg
without going through fabric).
Address scheme
--------------
For the first implementation we use a synthetic address scheme that
guarantees uniqueness per (sip, cube, pe, direction) without going
through ``PEMemAllocator``. The address is encoded as:
base = IPCQ_BASE | (sip << 40) | (cube << 32) | (pe << 24)
rx_base[direction_idx] = base + direction_idx * (n_slots * slot_size)
The ``buffer_kind`` (tcm/hbm/sram) selects the *MemoryStore space* into
which data is written. Within a space, addresses are unique per PE so
the existing MemoryStore (``{space: {addr: ndarray}}``) handles them
naturally.
This bypasses the topology's address resolver / PhysAddr encoding and
treats IPCQ buffers as a separate, parallel address namespace. Real PA
encoding can be plugged in later without changing the rest of the design.
"""
from __future__ import annotations
from pathlib import Path
from typing import Any
import simpy
import yaml
from kernbench.ccl.topologies import resolve_topology
from kernbench.common.ipcq_types import (
IpcqEndpoint,
IpcqInitEntry,
)
from kernbench.runtime_api.kernel import IpcqInitMsg
# IPCQ synthetic address space top bit
_IPCQ_BASE = 1 << 60
def _ipcq_base_for_pe(sip: int, cube: int, pe: int) -> int:
return _IPCQ_BASE | (sip << 40) | (cube << 32) | (pe << 24)
# ── ccl.yaml loading ─────────────────────────────────────────────────
def load_ccl_config(path: str | Path | None = None) -> dict:
"""Load and validate ccl.yaml. Searches cwd and project root."""
if path is None:
candidates = [
Path.cwd() / "ccl.yaml",
Path(__file__).resolve().parents[3] / "ccl.yaml",
]
for p in candidates:
if p.exists():
path = p
break
if path is None:
raise FileNotFoundError(
"ccl.yaml not found. Place it at project root or cwd."
)
with open(path) as f:
cfg = yaml.safe_load(f)
if "defaults" not in cfg:
raise ValueError("ccl.yaml missing 'defaults' section")
if "algorithms" not in cfg:
raise ValueError("ccl.yaml missing 'algorithms' section")
return cfg
def resolve_algorithm_config(cfg: dict, name: str | None = None) -> dict:
"""Merge defaults with the chosen algorithm's overrides.
Returns a flat dict with at minimum: module, topology, buffer_kind,
backpressure, n_slots, slot_size, ipcq_credit_size_bytes, world_size.
"""
defaults = dict(cfg.get("defaults", {}))
algo_name = name or defaults.get("algorithm")
if algo_name is None:
raise ValueError("ccl.yaml: defaults.algorithm not set")
algos = cfg.get("algorithms", {})
if algo_name not in algos:
raise ValueError(
f"ccl.yaml: algorithm '{algo_name}' not in algorithms section"
)
merged = defaults.copy()
merged.update(algos[algo_name])
merged["algorithm"] = algo_name
return merged
# ── rank → PE mapping ────────────────────────────────────────────────
def linear_rank_to_pe(rank: int, spec: dict) -> tuple[int, int, int]:
"""Map a rank to (sip, cube, pe) using linear topology order."""
sips = spec["system"]["sips"]["count"]
cubes_per_sip = spec["sip"]["cube_mesh"]["w"] * spec["sip"]["cube_mesh"]["h"]
pe_layout = spec["cube"]["pe_layout"]
pes_per_cube = pe_layout["pe_per_corner"] * len(pe_layout["corners"])
pes_per_sip = cubes_per_sip * pes_per_cube
if rank >= sips * pes_per_sip:
raise ValueError(
f"rank {rank} exceeds total PE count {sips * pes_per_sip}"
)
sip = rank // pes_per_sip
rem = rank % pes_per_sip
cube = rem // pes_per_cube
pe = rem % pes_per_cube
return sip, cube, pe
# ── Install plan ─────────────────────────────────────────────────────
def install_ipcq(
engine: Any,
spec: dict,
cfg: dict,
algo_module: Any | None = None,
rank_to_pe: list[tuple[int, int, int]] | None = None,
) -> dict[str, Any]:
"""Build neighbor tables and install them in every participating PE_IPCQ.
Args:
engine: GraphEngine with ``_components`` dict
spec: topology spec dict
cfg: merged algorithm config (from ``resolve_algorithm_config``)
algo_module: optional algorithm Python module (for neighbors override)
rank_to_pe: optional explicit rank → (sip, cube, pe) mapping. If
None, the default linear mapping is used.
Returns:
A diagnostics dict with the install plan (rank → PE map, neighbor table).
"""
if "world_size" in cfg:
world_size = int(cfg["world_size"])
else:
# Topology-derived fallback (mirrors AhbmCCLBackend / RuntimeContext).
sips = int(spec.get("system", {}).get("sips", {}).get("count", 1))
cm = spec.get("sip", {}).get("cube_mesh", {})
cubes_per_sip = int(cm.get("w", 1)) * int(cm.get("h", 1))
pl = spec.get("cube", {}).get("pe_layout", {})
corners = pl.get("corners", [])
pe_per_corner = int(pl.get("pe_per_corner", 1))
pes_per_cube = pe_per_corner * max(len(corners), 1)
world_size = sips * cubes_per_sip * pes_per_cube
buffer_kind = cfg["buffer_kind"]
n_slots = int(cfg["n_slots"])
slot_size = int(cfg["slot_size"])
backpressure = cfg["backpressure"]
credit_size_bytes = int(cfg.get("ipcq_credit_size_bytes", 16))
# Step 1: rank → (sip, cube, pe)
if rank_to_pe is not None:
if len(rank_to_pe) != world_size:
raise ValueError(
f"rank_to_pe has {len(rank_to_pe)} entries but world_size={world_size}"
)
rank_pe = list(rank_to_pe)
else:
rank_pe: list[tuple[int, int, int]] = [
linear_rank_to_pe(r, spec) for r in range(world_size)
]
pe_to_rank = {(s, c, p): r for r, (s, c, p) in enumerate(rank_pe)}
# Step 2: resolve topology fn (with optional override)
topo_fn = resolve_topology(cfg["topology"], algo_module=algo_module)
# Build per-rank neighbor map
neighbor_table: dict[int, dict[str, int]] = {}
for r in range(world_size):
neighbor_table[r] = topo_fn(r, world_size)
# Step 3: pull the live engine reference for each PE_IPCQ
components = engine._components
pe_ipcq_id = lambda s, c, p: f"sip{s}.cube{c}.pe{p}.pe_ipcq"
# Step 4: per-PE rx_base address and per-PE credit_inbox
direction_keys = sorted({d for nt in neighbor_table.values() for d in nt})
direction_idx = {d: i for i, d in enumerate(direction_keys)}
bytes_per_direction = n_slots * slot_size
def rx_base(s: int, c: int, p: int, d: str) -> int:
return _ipcq_base_for_pe(s, c, p) + direction_idx[d] * bytes_per_direction
# Wire bidirectional credit stores: backend creates the SimPy Stores
# by reading each rank's PE_IPCQ.credit_inbox property.
rank_to_credit_inbox: dict[int, simpy.Store] = {}
for r, (s, c, p) in enumerate(rank_pe):
comp = components[pe_ipcq_id(s, c, p)]
# Trigger lazy creation of credit_inbox if not yet started.
# PE_IPCQ.start() creates it; we ensure it exists.
if comp._credit_inbox is None:
comp._credit_inbox = simpy.Store(engine._env)
rank_to_credit_inbox[r] = comp.credit_inbox
# Step 5: build IpcqInitMsg per rank and call _install_neighbors directly
plan: dict[str, Any] = {
"world_size": world_size,
"rank_to_pe": rank_pe,
"buffer_kind": buffer_kind,
"neighbor_table": neighbor_table,
}
def reverse_direction(my_rank: int, peer_rank: int) -> str | None:
"""Find which direction in peer's neighbor table points back to my_rank."""
for d, target in neighbor_table[peer_rank].items():
if target == my_rank:
return d
return None
for r, (s, c, p) in enumerate(rank_pe):
my_pe_ipcq = components[pe_ipcq_id(s, c, p)]
nbrs = neighbor_table[r]
entries: list[IpcqInitEntry] = []
for d, peer_rank in nbrs.items():
if peer_rank is None:
continue
peer_s, peer_c, peer_p = rank_pe[peer_rank]
peer_dir = reverse_direction(r, peer_rank)
if peer_dir is None:
# Peer doesn't have a reverse entry — skip (asymmetric topology)
continue
peer_endpoint = IpcqEndpoint(
sip=peer_s, cube=peer_c, pe=peer_p,
buffer_kind=buffer_kind,
rx_base_pa=rx_base(peer_s, peer_c, peer_p, peer_dir),
rx_base_va=0,
n_slots=n_slots, slot_size=slot_size,
)
entries.append(IpcqInitEntry(
direction=d,
peer=peer_endpoint,
my_rx_base_pa=rx_base(s, c, p, d),
my_rx_base_va=0,
n_slots=n_slots, slot_size=slot_size,
peer_credit_store=rank_to_credit_inbox[peer_rank],
))
msg = IpcqInitMsg(
correlation_id="ccl_init", request_id=f"init_r{r}",
target_sips=(s,), target_cubes=(c,), target_pe=p,
entries=tuple(entries),
backpressure_mode=backpressure,
buffer_kind=buffer_kind,
credit_size_bytes=credit_size_bytes,
)
my_pe_ipcq._install_neighbors(msg)
return plan
+465
View File
@@ -0,0 +1,465 @@
"""Mock CCL runtime for fast unit tests of algorithm kernels (ADR-0023 D15).
Runs a kernel function once per rank with a minimal ``tl`` shim — no SimPy,
no PE_DMA, no fabric simulation. Just enough to verify *functional*
correctness of an IPCQ-based collective algorithm.
Cross-rank send/recv is implemented with greenlet cooperative scheduling
plus per-(rank, direction) FIFO queues. Backpressure is not modeled —
queues are unbounded.
Typical usage in a test::
from kernbench.ccl.testing import run_kernel_in_mock
from kernbench.ccl.algorithms.ring_allreduce import kernel
inputs = [np.full(16, r + 1, dtype="f16") for r in range(4)]
outputs = run_kernel_in_mock(
kernel_fn=kernel, world_size=4, topology="ring_1d",
inputs=inputs, kernel_args=(16,),
)
for r in range(4):
assert np.allclose(outputs[r], sum(inputs))
"""
from __future__ import annotations
from collections import deque
from typing import Any, Callable
import numpy as np
from greenlet import greenlet
from kernbench.ccl.topologies import resolve_topology
from kernbench.common.ipcq_types import IpcqInvalidDirection
from kernbench.common.pe_commands import TensorHandle
# ── Per-rank fake state ──────────────────────────────────────────────
class _MockRankState:
"""Per-rank scratch holding HBM/recv slots and tl shim hooks."""
def __init__(
self,
rank: int,
world_size: int,
neighbors: dict[str, int],
input_arr: np.ndarray,
) -> None:
self.rank = rank
self.world_size = world_size
self.neighbors = neighbors # direction → peer rank
# HBM "memory": addr → ndarray. Per-rank, no cross-rank sharing.
self._hbm: dict[int, np.ndarray] = {}
self._tcm: dict[int, np.ndarray] = {}
# ``t_ptr`` is the address the kernel sees. Real benches use a
# column-sharded VA so each rank reads from ``t_ptr + rank*nbytes``.
# Mirror that here: each rank's slice lives at the rank-specific addr.
nbytes = int(input_arr.nbytes)
self.t_ptr = 0 # base; per-rank offset is rank * nbytes
self._slice_addr = rank * nbytes
self._hbm[self._slice_addr] = input_arr.copy()
# Inbound recv FIFOs: direction → deque[ndarray]
self.recv_q: dict[str, deque[np.ndarray]] = {d: deque() for d in neighbors}
# Output (set when kernel calls tl.store at slice address)
self.output: np.ndarray | None = None
# Greenlet for this rank — set later
self.g: greenlet | None = None
# ── Mock TLContext ───────────────────────────────────────────────────
class _MockTL:
"""Drop-in tl shim for mock runtime.
Supports the subset of TLContext API that algorithm authors use:
program_id, num_programs, load, store, send, recv, recv_async, wait,
plus arithmetic operations on TensorHandle (eager numpy execution,
no SimPy involved).
"""
def __init__(self, state: _MockRankState, scheduler: "_MockScheduler") -> None:
self._state = state
self._scheduler = scheduler
self._handle_counter = 0
def _next_id(self) -> str:
self._handle_counter += 1
return f"mt{self._handle_counter}"
@property
def rank(self) -> int:
return self._state.rank
@property
def world_size(self) -> int:
return self._state.world_size
# axis-aware
def program_id(self, axis: int = 0) -> int:
return self._state.rank if axis == 0 else 0
def num_programs(self, axis: int = 0) -> int:
return self._state.world_size if axis == 0 else 1
# ── arithmetic ops (called by TensorHandle.__add__ etc.) ──
def _binary_math(self, op: str, a: TensorHandle, b: TensorHandle) -> TensorHandle:
a_data = np.asarray(a.data) if a.data is not None else None
b_data = np.asarray(b.data) if b.data is not None else None
if a_data is None or b_data is None:
result = None
elif op == "add":
result = a_data + b_data
elif op == "sub":
result = a_data - b_data
elif op == "mul":
result = a_data * b_data
elif op == "div":
result = a_data / b_data
elif op == "maximum":
result = np.maximum(a_data, b_data)
elif op == "minimum":
result = np.minimum(a_data, b_data)
else:
raise NotImplementedError(f"mock _binary_math: op {op!r} not implemented")
return TensorHandle(
id=self._next_id(),
addr=0, shape=a.shape, dtype=a.dtype,
nbytes=int(np.prod(a.shape)) * 2 if a.shape else 0,
data=result, space="tcm",
)
def maximum(self, a: TensorHandle, b: TensorHandle) -> TensorHandle:
return self._binary_math("maximum", a, b)
def minimum(self, a: TensorHandle, b: TensorHandle) -> TensorHandle:
return self._binary_math("minimum", a, b)
def fma(
self, a: TensorHandle, b: TensorHandle, c: TensorHandle,
) -> TensorHandle:
a_data = np.asarray(a.data) if a.data is not None else None
b_data = np.asarray(b.data) if b.data is not None else None
c_data = np.asarray(c.data) if c.data is not None else None
result = (
a_data * b_data + c_data
if (a_data is not None and b_data is not None and c_data is not None)
else None
)
return TensorHandle(
id=self._next_id(),
addr=0, shape=a.shape, dtype=a.dtype,
nbytes=int(np.prod(a.shape)) * 2 if a.shape else 0,
data=result, space="tcm",
)
def clamp(
self,
x: TensorHandle,
min: TensorHandle,
max: TensorHandle,
) -> TensorHandle:
x_data = np.asarray(x.data) if x.data is not None else None
lo = np.asarray(min.data) if min.data is not None else None
hi = np.asarray(max.data) if max.data is not None else None
result = (
np.minimum(np.maximum(x_data, lo), hi)
if (x_data is not None and lo is not None and hi is not None)
else None
)
return TensorHandle(
id=self._next_id(),
addr=0, shape=x.shape, dtype=x.dtype,
nbytes=int(np.prod(x.shape)) * 2 if x.shape else 0,
data=result, space="tcm",
)
def softmax(self, x: TensorHandle, axis: int = -1) -> TensorHandle:
x_data = np.asarray(x.data) if x.data is not None else None
if x_data is None:
result = None
else:
x_max = np.max(x_data, axis=axis, keepdims=True)
e = np.exp(x_data - x_max)
s = np.sum(e, axis=axis, keepdims=True)
result = e / s
return TensorHandle(
id=self._next_id(),
addr=0, shape=x.shape, dtype=x.dtype,
nbytes=int(np.prod(x.shape)) * 2 if x.shape else 0,
data=result, space="tcm",
)
@staticmethod
def cdiv(a: int, b: int) -> int:
return -(-int(a) // int(b))
def _unary_math(self, op: str, x: TensorHandle) -> TensorHandle:
x_data = np.asarray(x.data) if x.data is not None else None
if x_data is None:
result = None
elif op == "exp":
result = np.exp(x_data)
elif op == "log":
result = np.log(x_data)
elif op == "sqrt":
result = np.sqrt(x_data)
elif op == "abs":
result = np.abs(x_data)
elif op == "sigmoid":
result = 1.0 / (1.0 + np.exp(-x_data))
elif op == "cos":
result = np.cos(x_data)
elif op == "sin":
result = np.sin(x_data)
else:
raise NotImplementedError(f"mock _unary_math: op {op!r} not implemented")
return TensorHandle(
id=self._next_id(),
addr=0, shape=x.shape, dtype=x.dtype,
nbytes=int(np.prod(x.shape)) * 2 if x.shape else 0,
data=result, space="tcm",
)
def load(self, ptr: int, shape: tuple[int, ...], dtype: str = "f16") -> TensorHandle:
data = self._state._hbm.get(ptr)
if data is None:
data = np.zeros(shape, dtype=np.float16)
return TensorHandle(
id=f"load_{ptr}", addr=ptr, shape=shape, dtype=dtype,
nbytes=int(np.prod(shape)) * 2, data=data, space="hbm",
)
def store(self, ptr: int, handle: TensorHandle) -> None:
if handle.data is not None:
self._state._hbm[ptr] = np.asarray(handle.data)
if ptr == self._state._slice_addr:
self._state.output = self._state._hbm[ptr]
# IPCQ
def send(
self,
dir: str,
src: TensorHandle | None = None,
*,
src_addr: int | None = None,
nbytes: int | None = None,
shape: tuple[int, ...] | None = None,
dtype: str = "f16",
space: str = "tcm",
) -> None:
if dir not in self._state.neighbors:
raise IpcqInvalidDirection(
f"mock tl.send: direction {dir!r} not in neighbors {list(self._state.neighbors)}"
)
if src is not None:
if src.data is not None:
data = np.asarray(src.data)
else:
# Resolve from this rank's local memory at src.addr
space_dict = self._state._hbm if src.space == "hbm" else self._state._tcm
stored = space_dict.get(src.addr)
if stored is None:
raise RuntimeError(
f"mock tl.send: no data at {src.space}:0x{src.addr:x}"
)
data = np.asarray(stored)
else:
data = None
if data is None:
raise RuntimeError("mock tl.send: src is None")
peer_rank = self._state.neighbors[dir]
# Find the reverse direction in peer's neighbors that points back to me
peer_state = self._scheduler.states[peer_rank]
reverse_dir = None
for d, target in peer_state.neighbors.items():
if target == self._state.rank:
reverse_dir = d
break
if reverse_dir is None:
raise RuntimeError(
f"mock tl.send: peer rank {peer_rank} has no reverse direction"
)
peer_state.recv_q[reverse_dir].append(data.copy())
# After delivering, hand control back to scheduler so the receiver
# can wake up.
self._scheduler.yield_()
def recv_async(
self,
dir: str,
shape: tuple[int, ...] = (),
dtype: str = "f16",
) -> dict:
"""Non-blocking recv. Returns a future dict to pass to tl.wait."""
if dir not in self._state.neighbors:
raise IpcqInvalidDirection(
f"mock tl.recv_async: direction {dir!r} not in neighbors"
)
return {"_kind": "recv_future", "dir": dir, "shape": shape, "dtype": dtype}
def wait(self, future: Any) -> TensorHandle:
"""Block until the recv future has data."""
if not isinstance(future, dict) or future.get("_kind") != "recv_future":
raise TypeError("tl.wait: expected recv future from tl.recv_async")
d = future["dir"]
while not self._state.recv_q[d]:
self._scheduler.yield_()
data = self._state.recv_q[d].popleft()
return self._make_handle(data, d, future["dtype"])
def recv(
self,
dir: str | None = None,
shape: tuple[int, ...] = (),
dtype: str = "f16",
) -> TensorHandle:
if dir is not None and dir not in self._state.neighbors:
raise IpcqInvalidDirection(
f"mock tl.recv: direction {dir!r} not in neighbors {list(self._state.neighbors)}"
)
# Wait for data
while True:
if dir is None:
# round-robin over directions
for d in self._state.neighbors:
if self._state.recv_q[d]:
data = self._state.recv_q[d].popleft()
return self._make_handle(data, d, dtype)
else:
if self._state.recv_q[dir]:
data = self._state.recv_q[dir].popleft()
return self._make_handle(data, dir, dtype)
# Yield to other ranks
self._scheduler.yield_()
def _make_handle(self, data: np.ndarray, direction: str, dtype: str) -> TensorHandle:
return TensorHandle(
id=f"recv_{direction}",
addr=0, shape=data.shape, dtype=dtype,
nbytes=int(data.nbytes), data=data, space="tcm",
)
# ── Cooperative scheduler ────────────────────────────────────────────
class _MockScheduler:
"""Round-robin cooperative scheduler over rank greenlets."""
def __init__(self, states: list[_MockRankState]) -> None:
self.states = states
self._parent: greenlet | None = None
self._cur_idx = 0
def yield_(self) -> None:
"""Called from inside a rank greenlet to give other ranks a turn."""
assert self._parent is not None
self._parent.switch()
def run(self, kernel_fn: Callable, kernel_args: tuple) -> list[np.ndarray]:
from kernbench.triton_emu.tl_context import TLContext
self._parent = greenlet.getcurrent()
n = len(self.states)
# Per-rank tl shim
tls: dict[int, _MockTL] = {}
def _spawn(rank_idx: int) -> greenlet:
state = self.states[rank_idx]
tl = _MockTL(state, self)
tls[rank_idx] = tl
def _entry():
# Activate this rank's tl for TensorHandle operator overloads
TLContext._set_active(tl) # type: ignore[attr-defined]
try:
kernel_fn(state.t_ptr, *kernel_args, tl=tl)
finally:
TLContext._set_active(None) # type: ignore[attr-defined]
return greenlet(_entry)
for state in self.states:
state.g = _spawn(state.rank)
# Drive each rank round-robin until all dead. Detect global deadlock.
max_rounds = 10_000
round_no = 0
while True:
alive = [s for s in self.states if s.g is not None and not s.g.dead]
if not alive:
break
progressed = False
for s in self.states:
if s.g is None or s.g.dead:
continue
# Multi-rank greenlets share TLContext active state via the
# module-level thread-local; restore this rank's tl before
# resuming so TensorHandle operator overloads dispatch to
# the right _MockTL.
TLContext._set_active(tls[s.rank]) # type: ignore[attr-defined]
s.g.switch()
if s.g.dead:
progressed = True
TLContext._set_active(None) # type: ignore[attr-defined]
# Loose progress check: if no greenlet died and queues didn't grow,
# advance round counter; abort after too many idle rounds.
round_no += 1
if round_no > max_rounds and not progressed:
raise RuntimeError(
"mock CCL runtime: deadlock detected (no progress for "
f"{max_rounds} rounds)"
)
return [
s.output if s.output is not None else s._hbm.get(s._slice_addr)
for s in self.states
]
# ── Public entry ────────────────────────────────────────────────────
def run_kernel_in_mock(
kernel_fn: Callable,
world_size: int,
topology: str,
inputs: list[np.ndarray],
kernel_args: tuple = (),
algo_module: Any | None = None,
) -> list[np.ndarray]:
"""Run a CCL kernel under the mock runtime with no SimPy/fabric.
Args:
kernel_fn: ``kernel(t_ptr, *kernel_args, tl=...)``
world_size: number of ranks
topology: builtin topology name (e.g. "ring_1d")
inputs: per-rank input ndarrays. ``inputs[r]`` becomes rank r's
local tile at HBM address 0.
kernel_args: extra positional args after t_ptr
algo_module: optional module providing ``neighbors()`` override
Returns:
Per-rank output ndarrays — whatever the kernel wrote via tl.store
(or the original input if the kernel didn't store).
"""
if len(inputs) != world_size:
raise ValueError(f"len(inputs)={len(inputs)} != world_size={world_size}")
topo_fn = resolve_topology(topology, algo_module=algo_module)
states = [
_MockRankState(
rank=r, world_size=world_size,
neighbors=topo_fn(r, world_size),
input_arr=inputs[r],
)
for r in range(world_size)
]
sched = _MockScheduler(states)
return sched.run(kernel_fn, kernel_args)
+128
View File
@@ -0,0 +1,128 @@
"""Builtin neighbor topology generators for CCL backend (ADR-0023 D11).
Each generator takes ``(rank, world_size)`` and returns a
``dict[direction, peer_rank]`` for that rank. ``direction`` is one of
``"N" | "S" | "E" | "W"`` for ring/mesh, or
``"parent" | "child_left" | "child_right"`` for tree topologies.
Algorithm modules may override the generated map by defining a
``neighbors(rank, world_size, neighbor_map) -> dict | None`` function in
the same module (see D11 / D15). ``resolve_topology`` wires these together.
"""
from __future__ import annotations
from typing import Any, Callable
NeighborMap = dict[str, int]
TopologyFn = Callable[[int, int], NeighborMap]
# ── Builtin generators ───────────────────────────────────────────────
def ring_1d(rank: int, world_size: int) -> NeighborMap:
"""1D bidirectional ring (E/W)."""
return {
"E": (rank + 1) % world_size,
"W": (rank - 1) % world_size,
}
def ring_1d_unidir(rank: int, world_size: int) -> NeighborMap:
"""1D unidirectional ring (E only)."""
return {"E": (rank + 1) % world_size}
def mesh_2d(rank: int, world_size: int) -> NeighborMap:
"""Square 2D mesh (N/S/E/W).
Layout: rank = row * side + col, with side = sqrt(world_size).
Wrap-around (torus) on all four edges.
"""
side = int(round(world_size ** 0.5))
if side * side != world_size:
raise ValueError(
f"mesh_2d requires square world_size, got {world_size}"
)
r, c = divmod(rank, side)
return {
"N": ((r - 1) % side) * side + c,
"S": ((r + 1) % side) * side + c,
"W": r * side + (c - 1) % side,
"E": r * side + (c + 1) % side,
}
def tree_binary(rank: int, world_size: int) -> NeighborMap:
"""Binary tree rooted at rank 0.
Children of rank r are 2r+1 and 2r+2 (if within world_size).
Parent of rank r > 0 is (r-1)//2.
Returned keys (only those that exist):
"parent", "child_left", "child_right"
"""
n: NeighborMap = {}
if rank > 0:
n["parent"] = (rank - 1) // 2
left = 2 * rank + 1
right = 2 * rank + 2
if left < world_size:
n["child_left"] = left
if right < world_size:
n["child_right"] = right
return n
def none(rank: int, world_size: int) -> NeighborMap:
"""Empty map — algorithm's neighbors() must build from scratch."""
return {}
_BUILTIN: dict[str, TopologyFn] = {
"ring_1d": ring_1d,
"ring_1d_unidir": ring_1d_unidir,
"mesh_2d": mesh_2d,
"tree_binary": tree_binary,
"none": none,
}
# ── Resolution ───────────────────────────────────────────────────────
def resolve_topology(
name: str, algo_module: Any | None = None,
) -> TopologyFn:
"""Return a callable ``(rank, world_size) -> NeighborMap``.
Args:
name: builtin topology name from ccl.yaml. Must be one of
``ring_1d``, ``ring_1d_unidir``, ``mesh_2d``, ``tree_binary``,
or ``none``.
algo_module: optional algorithm module. If it defines
``neighbors(rank, world_size, neighbor_map)``, that hook is
invoked after the builtin to override the result.
Returning None from neighbors() leaves the builtin map
unchanged; returning a dict replaces it.
Raises:
ValueError: if ``name`` is not a known builtin.
"""
if name not in _BUILTIN:
raise ValueError(
f"Unknown topology '{name}'. "
f"Available builtins: {list(_BUILTIN)}"
)
builtin_fn = _BUILTIN[name]
override_fn = getattr(algo_module, "neighbors", None) if algo_module else None
if override_fn is None or not callable(override_fn):
return builtin_fn
def _wrapped(rank: int, world_size: int) -> NeighborMap:
base = builtin_fn(rank, world_size)
result = override_fn(rank, world_size, base)
if result is None:
return base
return result
return _wrapped
+234
View File
@@ -0,0 +1,234 @@
"""IPCQ schemas and exceptions (ADR-0023 D2.5, D12, D14 F1).
This module contains the data structures and exceptions used by the
PE-level IPCQ collective communication infrastructure. The host-facing
sideband fan-out message ``IpcqInitMsg`` lives in
``kernbench.runtime_api.kernel`` (alongside other fabric messages),
while all internal token / metadata / command schemas are kept here.
Layering:
PE_CPU --IpcqRequest(IpcqSendCmd|IpcqRecvCmd)--> PE_IPCQ
PE_IPCQ --IpcqDmaToken--> PE_DMA (vc_comm)
PE_DMA --IpcqMetaArrival--> PE_IPCQ (atomic, D9)
PE_IPCQ --IpcqCreditMetadata--> peer PE_IPCQ (fast path, D9)
See ADR-0023 for the full design.
"""
from __future__ import annotations
from dataclasses import dataclass, field
from typing import TYPE_CHECKING, Any, Union
if TYPE_CHECKING:
import simpy
# ── D14 F1: invalid direction exception ──────────────────────────────
class IpcqInvalidDirection(ValueError):
"""Raised when a kernel calls tl.send/recv with a direction that
has no neighbor installed for this PE."""
# ── D2.5: IpcqEndpoint ───────────────────────────────────────────────
@dataclass(frozen=True)
class IpcqEndpoint:
"""송신 측이 peer's rx_buffer 주소를 계산하기 위해 필요한 모든 정보 (D2.5).
Sender PE_IPCQ uses this to compute the destination PA for its DMA
write into the peer's rx ring buffer slot:
slot_idx = sender.my_head % peer.n_slots
dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size
"""
sip: int # destination SIP
cube: int # destination cube
pe: int # destination PE (cube-local index)
buffer_kind: str # "tcm" | "hbm" | "sram"
rx_base_pa: int # peer rx_buffer base PA (PhysAddr.encode())
rx_base_va: int # peer rx_buffer base VA (optional, MMU)
n_slots: int # peer ring depth (wrap-around modulo)
slot_size: int # peer slot size (offset multiplier)
# ── D12: IpcqInitEntry (used by IpcqInitMsg in kernel.py) ────────────
@dataclass(frozen=True)
class IpcqInitEntry:
"""One direction's neighbor entry that backend installs into a PE_IPCQ
via IpcqInitMsg (kernbench.runtime_api.kernel.IpcqInitMsg, D12).
"""
direction: str # "N" | "S" | "E" | "W"
peer: IpcqEndpoint # see D2.5
my_rx_base_pa: int # this PE's own rx_buffer base
my_rx_base_va: int # this PE's own rx_buffer base VA (optional)
n_slots: int # this PE's ring depth
slot_size: int # this PE's slot size
# Credit fast path channel (D9).
# Contract: must be a simpy.Store instance dedicated to receiving
# IpcqCreditMetadata objects only. Backend wires it once at init time
# and the receiving PE_IPCQ owns its consumer side; the sender (peer's
# PE_IPCQ) puts IpcqCreditMetadata directly into this store via
# _delayed_credit_send. Do not put any other object type.
peer_credit_store: "simpy.Store"
# ── D12: IpcqSendCmd (PE_CPU → PE_IPCQ) ──────────────────────────────
@dataclass(frozen=True)
class IpcqSendCmd:
"""tl.send command issued by the kernel to PE_IPCQ."""
direction: str # "N" | "S" | "E" | "W"
src_addr: int # source data address (TCM/HBM/SRAM)
src_space: str # "tcm" | "hbm" | "sram"
nbytes: int
shape: tuple[int, ...] # data shape (op_log + MemoryStore use)
dtype: str
handle_id: str # completion tracking
data_op: bool = True # ADR-0020 op_log recording flag
# ── D12: IpcqRecvCmd (PE_CPU → PE_IPCQ) ──────────────────────────────
@dataclass(frozen=True)
class IpcqRecvCmd:
"""tl.recv command issued by the kernel to PE_IPCQ.
Two modes (recv_mode):
"return_slot" — return slot address as-is (default, zero-copy).
Kernel uses the slot memory directly.
"copy_to_dst" — copy slot data to dst_addr, then return.
"""
direction: str | None # None → round-robin (weak fairness, D4)
shape: tuple[int, ...]
dtype: str
handle_id: str
recv_mode: str = "return_slot"
dst_addr: int = 0 # used only when recv_mode == "copy_to_dst"
dst_space: str = "" # used only when recv_mode == "copy_to_dst"
blocking: bool = True
data_op: bool = True
# ── D12: IpcqDmaToken (PE_IPCQ → PE_DMA, vc_comm) ───────────────────
@dataclass
class IpcqDmaToken:
"""Token sent from PE_IPCQ to PE_DMA (vc_comm channel) carrying both
the data move request and the piggyback metadata (ADR-0023 D9).
Receiving PE_DMA processes this atomically (I6 MUST):
1. MemoryStore.write(dst_endpoint.buffer_kind, dst_addr, data)
2. Forward IpcqMetaArrival(token=self) to peer PE_IPCQ
No yield is allowed between the two steps.
The ``data`` field is a snapshot taken by the sender's PE_DMA at the
moment the send is issued. This preserves "in-flight data" semantics:
if the sender mutates its source memory after issuing the send but
before arrival, the receiver still gets the snapshot. The snapshot is
None for control-only tokens (e.g. credit-only updates).
"""
# ── Data movement (single-hop DMA write) ──
src_addr: int
src_space: str
dst_addr: int # already-computed peer rx slot PA
dst_endpoint: IpcqEndpoint # routing target (sip/cube/pe) + buffer_kind
nbytes: int
handle_id: str # completion notify back to sender PE_IPCQ
# Optional shape/dtype carried for op_log + MemoryStore convenience.
shape: tuple[int, ...] = ()
dtype: str = "f16"
# In-flight data snapshot (sender PE_DMA captures this at send time).
data: Any = None
# ── Piggyback metadata (D9) ──
sender_seq: int = 0 # monotonic; receiver updates peer_head_cache
src_sip: int = 0
src_cube: int = 0
src_pe: int = 0
src_direction: str = "E" # sender-side direction; receiver maps to its own
data_op: bool = True
# ── D12: IpcqMetaArrival (PE_DMA → PE_IPCQ, intra-PE wire) ──────────
@dataclass
class IpcqMetaArrival:
"""Posted by receiving PE_DMA into the destination PE's PE_IPCQ inbox
in the same SimPy step as the MemoryStore.write (D9, I6 MUST).
The receiver PE_IPCQ uses ``token.sender_seq`` to update its
peer_head_cache for the corresponding direction.
"""
token: IpcqDmaToken
# ── D12: IpcqCreditMetadata (PE_IPCQ → peer PE_IPCQ, fast path) ─────
@dataclass(frozen=True)
class IpcqCreditMetadata:
"""Credit return — recv-side → send-side fast path (D9).
Sent by ``PeIpcqComponent._delayed_credit_send`` after a
bottleneck-BW based latency, putting the metadata directly into
the peer's pre-wired credit store (no fabric routing).
"""
consumer_seq: int # my_tail at recv side (new tail value)
src_sip: int # which peer is sending the credit
src_cube: int
src_pe: int
src_direction: str # sender-side direction (peer maps to its own)
# ── Request wrapper (PE_CPU → PE_IPCQ) ───────────────────────────────
@dataclass
class IpcqRequest:
"""Wrapper carrying an IpcqSendCmd or IpcqRecvCmd plus a SimPy completion
event. Posted by PE_CPU into PE_IPCQ's inbox; PE_IPCQ calls
``done.succeed()`` when the request is fully processed.
For recv requests, the result (slot address, direction, dtype, shape)
is written into ``result_data`` so the caller can read it after wait.
"""
command: "IpcqSendCmd | IpcqRecvCmd"
done: "simpy.Event"
result_data: dict[str, Any] = field(default_factory=dict)
# ── RecvFuture (kernel ↔ runner handshake for tl.recv_async / tl.wait) ─
@dataclass
class RecvFuture:
"""Opaque future returned by ``tl.recv_async``.
The KernelRunner attaches a SimPy event and the IpcqRequest in the
background; ``tl.wait(future)`` switches back to the runner which
yields on the event and resolves the result into a TensorHandle.
"""
cmd: "IpcqRecvCmd"
request: Any = None # IpcqRequest (set by runner)
event: Any = None # simpy.Event (set by runner)
resolved: bool = False
result: Any = None # cached TensorHandle after wait()
+1
View File
@@ -33,6 +33,7 @@ class TensorHandle:
dtype: str dtype: str
nbytes: int # total byte size nbytes: int # total byte size
data: object = None # reserved for validate mode data: object = None # reserved for validate mode
space: str = "tcm" # MemoryStore space ("tcm" | "hbm" | "sram")
@dataclass(frozen=True) @dataclass(frozen=True)
+25 -2
View File
@@ -42,9 +42,30 @@ class PeCpuComponent(ComponentBase):
self._cube_idx = int(parts[1].replace("cube", "")) self._cube_idx = int(parts[1].replace("cube", ""))
except (IndexError, ValueError): except (IndexError, ValueError):
self._cube_idx = 0 self._cube_idx = 0
# num_cubes from spec (for tl.program_id(axis=1)) # num_cubes from spec (for tl.program_id(axis=1) — ADR-0022)
spec = ctx.spec if ctx else {} spec = ctx.spec if ctx else {}
self._num_cubes = spec.get("system", {}).get("sips", {}).get("cubes_per_sip", 1) cube_mesh = spec.get("sip", {}).get("cube_mesh", {})
if cube_mesh:
self._num_cubes = int(cube_mesh.get("w", 1)) * int(cube_mesh.get("h", 1))
else:
self._num_cubes = (
spec.get("system", {}).get("sips", {}).get("cubes_per_sip", 1)
)
# PE-local scratch for kernel math output handles (ADR-0020 D3
# extension; reserved portion of TCM addressed via a synthetic
# MemoryStore key, not the real PA encoder).
pe_template = spec.get("cube", {}).get("pe_template", {})
tcm_attrs = pe_template.get("components", {}).get("pe_tcm", {}).get("attrs", {})
scratch_mb = float(tcm_attrs.get("kernel_scratch_mb", 1))
self._tl_scratch_size = int(scratch_mb * (1 << 20))
# PE-unique base address — high bit pattern to avoid collision with
# IPCQ ring buffers (which use bit 60).
self._tl_scratch_base = (
(1 << 61)
| (self._sip_idx << 40)
| (self._cube_idx << 32)
| (self._pe_idx << 24)
)
def _find_shard(self, shards: tuple) -> Any: def _find_shard(self, shards: tuple) -> Any:
"""Find shard matching this PE's (sip, cube, pe). Fallback to positional index.""" """Find shard matching this PE's (sip, cube, pe). Fallback to positional index."""
@@ -146,6 +167,8 @@ class PeCpuComponent(ComponentBase):
scheduler_id=scheduler_id, scheduler_id=scheduler_id,
out_ports=self.out_ports, out_ports=self.out_ports,
store=store, store=store,
scratch_base=self._tl_scratch_base,
scratch_size=self._tl_scratch_size,
) )
yield from runner.run(env, kernel_fn, kernel_args, num_programs) yield from runner.run(env, kernel_fn, kernel_args, num_programs)
return getattr(runner, "_composite_results", []) return getattr(runner, "_composite_results", [])
+115 -2
View File
@@ -106,19 +106,132 @@ class PeDmaComponent(PeEngineBase):
pe_txn.done.succeed() pe_txn.done.succeed()
def _worker(self, env: simpy.Environment) -> Generator: def _worker(self, env: simpy.Environment) -> Generator:
"""Handle TileToken (pipeline), PeInternalTxn (legacy), and Transaction (fabric).""" """Handle TileToken (pipeline), PeInternalTxn (legacy), IpcqDmaToken,
and Transaction (fabric)."""
from kernbench.common.ipcq_types import IpcqDmaToken
from kernbench.common.pe_commands import PeInternalTxn from kernbench.common.pe_commands import PeInternalTxn
from kernbench.components.builtin.pe_types import TileToken from kernbench.components.builtin.pe_types import TileToken
while True: while True:
msg: Any = yield self._inbox.get() msg: Any = yield self._inbox.get()
if isinstance(msg, TileToken): if isinstance(msg, IpcqDmaToken):
# Outbound: IPCQ token from local PE_IPCQ → forward via fabric
env.process(self._handle_ipcq_outbound(env, msg))
elif isinstance(msg, TileToken):
env.process(self._pipeline_process(env, msg)) env.process(self._pipeline_process(env, msg))
elif isinstance(msg, PeInternalTxn): elif isinstance(msg, PeInternalTxn):
env.process(self._handle_with_hooks(env, msg)) env.process(self._handle_with_hooks(env, msg))
else:
# Transaction (or unknown). May carry IpcqDmaToken inbound.
req = getattr(msg, "request", None)
if isinstance(req, IpcqDmaToken):
env.process(self._handle_ipcq_inbound(env, msg))
else: else:
env.process(self._forward_txn(env, msg)) env.process(self._forward_txn(env, msg))
# ── IPCQ outbound (PE_IPCQ → PE_DMA → fabric) ───────────────────
def _handle_ipcq_outbound(self, env: simpy.Environment, token: Any) -> Generator:
"""Forward IpcqDmaToken from local PE_IPCQ through the fabric to peer
PE_DMA. ADR-0023 D8 (vc_comm channel)."""
if self.ctx is None:
return # nothing to do
peer = token.dst_endpoint
peer_pe_dma = f"sip{peer.sip}.cube{peer.cube}.pe{peer.pe}.pe_dma"
# Snapshot the source data at send time (D9 in-flight semantics).
# Without this, the receiver could read stale or future data if the
# sender mutates src_addr between send issue and DMA arrival.
store = getattr(self.ctx, "memory_store", None)
if store is not None and token.data is None:
try:
snap = store.read(
token.src_space, token.src_addr,
shape=token.shape, dtype=token.dtype,
)
# Copy so later mutations to src_addr don't affect the snapshot.
token.data = snap.copy() if hasattr(snap, "copy") else snap
except Exception:
token.data = None
# Record the IPCQ copy in op_log at OUTBOUND time. ADR-0020 D6:
# Phase 2 replays the copy in t_start order; using outbound time
# (rather than inbound) ensures the copy executes before any later
# local op at the sender that might overwrite token.src_addr (e.g.
# a tl.store after a recv).
if self._op_logger is not None:
try:
self._op_logger.record_copy(
t_start=float(env.now), t_end=float(env.now),
component_id=self.node.id,
src_space=token.src_space, src_addr=token.src_addr,
dst_space=peer.buffer_kind,
dst_addr=token.dst_addr,
shape=token.shape, dtype=token.dtype, nbytes=token.nbytes,
)
except Exception:
pass
try:
path = self.ctx.router.find_path(self._pe_prefix, peer_pe_dma)
except Exception:
return
drain_ns = self.ctx.compute_drain_ns(path, token.nbytes)
sub_done = env.event()
sub_txn = Transaction(
request=token, path=path, step=0,
nbytes=token.nbytes, done=sub_done, drain_ns=drain_ns,
)
if len(path) > 1:
next_hop = path[1]
if next_hop in self.out_ports:
yield self.out_ports[next_hop].put(sub_txn.advance())
else:
return
# Note: don't wait on sub_done here — fire-and-forget for vc_comm.
# IPCQ slot bookkeeping (peer_head) was already updated by PE_IPCQ;
# backpressure is via credit return, not via this DMA's completion.
# ── IPCQ inbound (fabric → PE_DMA → MemoryStore + PE_IPCQ) ──────
def _handle_ipcq_inbound(self, env: simpy.Environment, txn: Any) -> Generator:
"""At destination PE_DMA: atomically write data and forward metadata.
I6 (MUST): no SimPy yield between MemoryStore.write and the
IpcqMetaArrival put into PE_IPCQ.
"""
from kernbench.common.ipcq_types import IpcqMetaArrival
token = txn.request
# ── ATOMIC: do not introduce yield between these two operations ──
# 1. Move data via MemoryStore (single-hop DMA write).
# Prefer the in-flight snapshot stashed by the sender PE_DMA;
# fall back to a fresh read of src_addr if no snapshot is present
# (e.g. control-only token).
store = getattr(self.ctx, "memory_store", None) if self.ctx else None
if store is not None:
try:
data = token.data
if data is None:
data = store.read(
token.src_space, token.src_addr,
shape=token.shape, dtype=token.dtype,
)
store.write(token.dst_endpoint.buffer_kind, token.dst_addr, data)
except Exception:
pass
# 2. Forward IpcqMetaArrival to local PE_IPCQ
ipcq_id = f"{self._pe_prefix}.pe_ipcq"
if ipcq_id in self.out_ports:
yield self.out_ports[ipcq_id].put(IpcqMetaArrival(token=token))
# ─────────────────────────────────────────────────────────────────
if not txn.done.triggered:
txn.done.succeed()
def _pipeline_process(self, env: simpy.Environment, token: Any) -> Generator: def _pipeline_process(self, env: simpy.Environment, token: Any) -> Generator:
"""Pipeline mode: DMA read/write via fabric, then self-route.""" """Pipeline mode: DMA read/write via fabric, then self-route."""
self._on_process_start(env, token) self._on_process_start(env, token)
+455
View File
@@ -0,0 +1,455 @@
"""PE_IPCQ component (ADR-0023): per-PE IPCQ control plane.
Responsibilities:
- Hold per-direction queue pair state (my_head, my_tail,
peer_head_cache, peer_tail_cache, ring buffer addresses)
- Process IpcqInitMsg from backend to install neighbor table
- Handle IpcqRequest(IpcqSendCmd) from PE_CPU:
compute peer slot address, check backpressure, forward
IpcqDmaToken to PE_DMA (vc_comm)
- Handle IpcqRequest(IpcqRecvCmd) from PE_CPU:
wait for data arrival, return slot address (or copy to dst),
send fast-path credit return
- Handle IpcqMetaArrival from PE_DMA: update peer_head_cache, wake recv
- Handle IpcqCreditMetadata via own credit_inbox: update peer_tail_cache,
wake send
PE_IPCQ does NOT move data — it forwards IpcqDmaToken to PE_DMA which
performs the actual fabric DMA.
Credit return uses a fast path: PE_IPCQ creates a SimPy process with a
bottleneck-BW based latency, then puts IpcqCreditMetadata directly into
the peer's pre-wired credit_store.
"""
from __future__ import annotations
from collections.abc import Generator
from typing import TYPE_CHECKING, Any
import simpy
from kernbench.common.ipcq_types import (
IpcqCreditMetadata,
IpcqDmaToken,
IpcqInvalidDirection,
IpcqMetaArrival,
IpcqRecvCmd,
IpcqRequest,
IpcqSendCmd,
)
from kernbench.components.base import ComponentBase
if TYPE_CHECKING:
from kernbench.components.context import ComponentContext
from kernbench.runtime_api.kernel import IpcqInitMsg
from kernbench.topology.types import Node
_DIR_ORDER: tuple[str, ...] = ("N", "S", "E", "W", "parent", "child_left", "child_right")
class PeIpcqComponent(ComponentBase):
"""PE_IPCQ: ring buffer pointer + neighbor management for CCL.
Owned by one PE; talks to PE_DMA via out_ports[<pe_dma_id>] and
receives credit return metadata via the public ``credit_inbox``
SimPy Store (wired by backend at IpcqInitMsg installation time).
"""
def __init__(self, node: Node, ctx: ComponentContext | None = None) -> None:
super().__init__(node, ctx)
# Strict shape/dtype validation (D14 F2). Off by default.
self._strict: bool = bool(node.attrs.get("strict_validation", False))
# direction → list of received tokens (for strict-mode peek of next slot)
self._arrived_tokens: dict[str, list] = {}
# Parse self (sip, cube, pe) from node id, e.g. "sip0.cube0.pe0.pe_ipcq"
self._pe_prefix: str = node.id.rsplit(".", 1)[0]
parts = self._pe_prefix.split(".")
try:
self._self_sip = int(parts[0].replace("sip", ""))
except (IndexError, ValueError):
self._self_sip = 0
try:
self._self_cube = int(parts[1].replace("cube", ""))
except (IndexError, ValueError):
self._self_cube = 0
try:
self._self_pe = int(parts[2].replace("pe", ""))
except (IndexError, ValueError):
self._self_pe = 0
self._dma_node_id = f"{self._pe_prefix}.pe_dma"
# direction → state dict (see _install_neighbors for shape)
self._queue_pairs: dict[str, dict[str, Any]] = {}
self._installed = False
self._buffer_kind: str = "tcm"
self._backpressure_mode: str = "sleep"
self._credit_size_bytes: int = 16
# waiters for recv (per direction) and any-direction (for round-robin)
self._recv_waiters: dict[str, list[simpy.Event]] = {}
self._any_recv_waiters: list[simpy.Event] = []
# waiters for send backpressure (per direction)
self._send_waiters: dict[str, list[simpy.Event]] = {}
# round-robin cursor over installed directions
self._rr_dirs: list[str] = []
self._rr_cursor: int = 0
# credit_inbox is created in start() once env is available
self._credit_inbox: simpy.Store | None = None
# ── Public ──
@property
def credit_inbox(self) -> simpy.Store:
"""SimPy Store that backend wires as ``peer_credit_store`` on
every remote sender targeting this PE. Used by D9 fast path."""
assert self._credit_inbox is not None, "PE_IPCQ not started yet"
return self._credit_inbox
@property
def queue_pairs(self) -> dict[str, dict[str, Any]]:
"""Test/debug accessor."""
return self._queue_pairs
# ── Lifecycle ──
def run(self, env: simpy.Environment, nbytes: int) -> Generator:
yield env.timeout(0)
def start(self, env: simpy.Environment) -> None:
# Create credit_inbox even if there are no in_ports yet
if self._credit_inbox is None:
self._credit_inbox = simpy.Store(env)
# If no in_ports were wired (e.g. unit test), still spin up workers
if not self.in_ports:
self._inbox = simpy.Store(env)
super().start(env)
env.process(self._credit_worker(env))
# ── Worker (override of ComponentBase._worker) ──
def _worker(self, env: simpy.Environment) -> Generator:
from kernbench.runtime_api.kernel import IpcqInitMsg
while True:
msg: Any = yield self._inbox.get()
# IpcqInitMsg may arrive wrapped in a transaction (with .request)
# or directly.
request_obj = getattr(msg, "request", None)
if isinstance(request_obj, IpcqInitMsg):
self._install_neighbors(request_obj)
done = getattr(msg, "done", None)
if done is not None and not done.triggered:
done.succeed()
continue
if isinstance(msg, IpcqInitMsg):
self._install_neighbors(msg)
continue
if isinstance(msg, IpcqMetaArrival):
self._handle_meta_arrival(msg)
continue
if isinstance(msg, IpcqRequest):
env.process(self._handle_request(env, msg))
continue
# Unknown message — drop or forward via base class fallback
env.process(self._forward_txn(env, msg))
# ── Init ──
def _install_neighbors(self, msg: IpcqInitMsg) -> None:
self._installed = True
self._buffer_kind = msg.buffer_kind
self._backpressure_mode = msg.backpressure_mode
self._credit_size_bytes = msg.credit_size_bytes
for entry in msg.entries:
self._queue_pairs[entry.direction] = {
"peer": entry.peer,
"my_rx_base_pa": entry.my_rx_base_pa,
"my_rx_base_va": entry.my_rx_base_va,
"n_slots": entry.n_slots,
"slot_size": entry.slot_size,
"peer_credit_store": entry.peer_credit_store,
"my_head": 0,
"my_tail": 0,
"peer_head_cache": 0,
"peer_tail_cache": 0,
}
self._recv_waiters.setdefault(entry.direction, [])
self._send_waiters.setdefault(entry.direction, [])
# Reset round-robin order to a stable canonical sequence
self._rr_dirs = [d for d in _DIR_ORDER if d in self._queue_pairs]
self._rr_cursor = 0
# ── Send ──
def _handle_request(self, env: simpy.Environment, req: IpcqRequest) -> Generator:
cmd = req.command
if isinstance(cmd, IpcqSendCmd):
yield from self._handle_send(env, req, cmd)
elif isinstance(cmd, IpcqRecvCmd):
yield from self._handle_recv(env, req, cmd)
def _handle_send(
self, env: simpy.Environment, req: IpcqRequest, cmd: IpcqSendCmd,
) -> Generator:
if cmd.direction not in self._queue_pairs:
raise IpcqInvalidDirection(
f"PE {self._pe_prefix}: direction {cmd.direction!r} not installed"
)
qp = self._queue_pairs[cmd.direction]
peer = qp["peer"]
# Backpressure: wait while ring full
while (qp["my_head"] - qp["peer_tail_cache"]) >= peer.n_slots:
wait_event = env.event()
self._send_waiters[cmd.direction].append(wait_event)
yield wait_event
# Compute peer slot address
slot_idx = qp["my_head"] % peer.n_slots
dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size
token = IpcqDmaToken(
src_addr=cmd.src_addr,
src_space=cmd.src_space,
dst_addr=dst_pa,
dst_endpoint=peer,
nbytes=cmd.nbytes,
handle_id=cmd.handle_id,
shape=cmd.shape,
dtype=cmd.dtype,
sender_seq=qp["my_head"],
src_sip=self._self_sip,
src_cube=self._self_cube,
src_pe=self._self_pe,
src_direction=cmd.direction,
)
# Forward to PE_DMA (vc_comm)
yield self.out_ports[self._dma_node_id].put(token)
qp["my_head"] += 1
# Diagnostics trace (D14)
from kernbench.ccl import diagnostics
if diagnostics.trace_enabled():
diagnostics.log_send(
t_ns=float(env.now), sender=self._pe_prefix,
direction=cmd.direction, nbytes=cmd.nbytes,
sender_seq=qp["my_head"] - 1,
)
if not req.done.triggered:
req.done.succeed()
# ── Recv ──
def _handle_recv(
self, env: simpy.Environment, req: IpcqRequest, cmd: IpcqRecvCmd,
) -> Generator:
if cmd.direction is None:
direction = yield from self._wait_any_direction(env)
else:
if cmd.direction not in self._queue_pairs:
raise IpcqInvalidDirection(
f"PE {self._pe_prefix}: direction {cmd.direction!r} not installed"
)
direction = cmd.direction
qp = self._queue_pairs[direction]
while qp["peer_head_cache"] <= qp["my_tail"]:
wait_event = env.event()
self._recv_waiters[direction].append(wait_event)
yield wait_event
qp = self._queue_pairs[direction]
slot_idx = qp["my_tail"] % qp["n_slots"]
slot_addr = qp["my_rx_base_pa"] + slot_idx * qp["slot_size"]
# Strict validation (D14 F2): peek the next-arrived token's metadata
# against the recv command's expected shape/dtype/nbytes.
arrived = self._arrived_tokens.get(direction, [])
if arrived:
front = arrived.pop(0)
if self._strict:
expected_nbytes = self._nbytes_for(cmd.shape, cmd.dtype)
if front.dtype != cmd.dtype:
raise ValueError(
f"PE_IPCQ {self._pe_prefix} recv strict: dtype mismatch — "
f"sender={front.dtype} recv={cmd.dtype}"
)
if front.shape != cmd.shape:
raise ValueError(
f"PE_IPCQ {self._pe_prefix} recv strict: shape mismatch — "
f"sender={front.shape} recv={cmd.shape}"
)
if front.nbytes != expected_nbytes:
raise ValueError(
f"PE_IPCQ {self._pe_prefix} recv strict: nbytes mismatch — "
f"sender={front.nbytes} recv={expected_nbytes}"
)
req.result_data["src_space"] = self._buffer_kind
req.result_data["src_addr"] = slot_addr
req.result_data["direction"] = direction
req.result_data["dtype"] = cmd.dtype
req.result_data["shape"] = cmd.shape
req.result_data["nbytes"] = self._nbytes_for(cmd.shape, cmd.dtype)
# copy_to_dst mode: rebind the result handle to (dst_space, dst_addr).
# When op_log is disabled, we also do the actual data move now;
# when op_log is enabled, Phase 2 replays the slot→dst copy from
# the op_log entry below so we don't pollute the slot in Phase 1.
if cmd.recv_mode == "copy_to_dst" and self.ctx is not None:
req.result_data["src_space"] = cmd.dst_space
req.result_data["src_addr"] = cmd.dst_addr
store = getattr(self.ctx, "memory_store", None)
if store is not None and self._op_logger is None:
try:
data = store.read(self._buffer_kind, slot_addr, shape=cmd.shape, dtype=cmd.dtype)
store.write(cmd.dst_space, cmd.dst_addr, data)
except Exception:
pass
if self._op_logger is not None:
# Record slot → dst copy for Phase 2 replay (ADR-0023 D9.5).
try:
self._op_logger.record_copy(
t_start=float(env.now), t_end=float(env.now),
component_id=self.node.id,
src_space=self._buffer_kind, src_addr=slot_addr,
dst_space=cmd.dst_space, dst_addr=cmd.dst_addr,
shape=cmd.shape, dtype=cmd.dtype,
nbytes=self._nbytes_for(cmd.shape, cmd.dtype),
)
except Exception:
pass
qp["my_tail"] += 1
# Diagnostics trace (D14)
from kernbench.ccl import diagnostics
if diagnostics.trace_enabled():
diagnostics.log_recv(
t_ns=float(env.now), receiver=self._pe_prefix,
direction=direction,
nbytes=req.result_data.get("nbytes", 0),
)
# Fast path credit return — bottleneck BW based latency
env.process(
self._delayed_credit_send(env, direction, qp["peer_credit_store"], qp["my_tail"])
)
if not req.done.triggered:
req.done.succeed()
def _wait_any_direction(self, env: simpy.Environment) -> Generator:
"""Round-robin scan over installed directions; wait until at least one
has data. Returns the chosen direction (str)."""
if not self._rr_dirs:
raise IpcqInvalidDirection(
f"PE {self._pe_prefix}: no neighbors installed"
)
while True:
n = len(self._rr_dirs)
for i in range(n):
idx = (self._rr_cursor + i) % n
d = self._rr_dirs[idx]
qp = self._queue_pairs[d]
if qp["peer_head_cache"] > qp["my_tail"]:
self._rr_cursor = (idx + 1) % n
return d
# Nothing available — wait until any arrival
wait_event = env.event()
self._any_recv_waiters.append(wait_event)
yield wait_event
# ── Metadata arrival from PE_DMA (D9) ──
def _handle_meta_arrival(self, msg: IpcqMetaArrival) -> None:
token = msg.token
sender_key = (token.src_sip, token.src_cube, token.src_pe)
for d, qp in self._queue_pairs.items():
p = qp["peer"]
if (p.sip, p.cube, p.pe) == sender_key:
qp["peer_head_cache"] = max(qp["peer_head_cache"], token.sender_seq + 1)
# Track arrived token for strict-mode peek
self._arrived_tokens.setdefault(d, []).append(token)
# Wake any blocked recv on this direction
waiters = self._recv_waiters.get(d, [])
self._recv_waiters[d] = []
for ev in waiters:
if not ev.triggered:
ev.succeed()
# Wake any-direction waiters
any_waiters = self._any_recv_waiters
self._any_recv_waiters = []
for ev in any_waiters:
if not ev.triggered:
ev.succeed()
return
# Unknown sender — silently drop (could log)
# ── Credit return (fast path) ──
def _credit_worker(self, env: simpy.Environment) -> Generator:
"""Process IpcqCreditMetadata from credit_inbox."""
assert self._credit_inbox is not None
while True:
credit: IpcqCreditMetadata = yield self._credit_inbox.get()
sender_key = (credit.src_sip, credit.src_cube, credit.src_pe)
for d, qp in self._queue_pairs.items():
p = qp["peer"]
if (p.sip, p.cube, p.pe) == sender_key:
qp["peer_tail_cache"] = max(qp["peer_tail_cache"], credit.consumer_seq)
# Wake any blocked send on this direction
waiters = self._send_waiters.get(d, [])
self._send_waiters[d] = []
for ev in waiters:
if not ev.triggered:
ev.succeed()
break
def _delayed_credit_send(
self,
env: simpy.Environment,
direction: str,
peer_credit_store: simpy.Store,
new_tail: int,
) -> Generator:
"""Wait bottleneck-BW latency, then put IpcqCreditMetadata into peer
credit store (D9 fast path)."""
latency_ns = self._credit_latency_ns(direction)
if latency_ns > 0:
yield env.timeout(latency_ns)
meta = IpcqCreditMetadata(
consumer_seq=new_tail,
src_sip=self._self_sip,
src_cube=self._self_cube,
src_pe=self._self_pe,
src_direction=direction,
)
yield peer_credit_store.put(meta)
def _credit_latency_ns(self, direction: str) -> float:
"""Compute credit fast path latency = credit_size / bottleneck_bw.
Falls back to 0 when ctx/router is unavailable (unit-test mode).
"""
if self.ctx is None:
return 0.0
qp = self._queue_pairs[direction]
peer = qp["peer"]
peer_pe_prefix = f"sip{peer.sip}.cube{peer.cube}.pe{peer.pe}"
try:
path = self.ctx.router.find_path(self._pe_prefix, peer_pe_prefix)
return self.ctx.compute_drain_ns(path, self._credit_size_bytes)
except Exception:
return 0.0
# ── Helpers ──
@staticmethod
def _nbytes_for(shape: tuple[int, ...], dtype: str) -> int:
from math import prod
bits = {"f16": 16, "bf16": 16, "f32": 32, "i8": 8, "i16": 16, "i32": 32}.get(dtype, 16)
return prod(shape) * (bits // 8) if shape else 0
+2 -4
View File
@@ -29,11 +29,10 @@ def run_bench(
correlation_id: str = "bench0", correlation_id: str = "bench0",
completion_policy: CompletionPolicy = CompletionPolicy.LAST_SUBMITTED, completion_policy: CompletionPolicy = CompletionPolicy.LAST_SUBMITTED,
) -> BenchResult: ) -> BenchResult:
""" """Minimal bench runner.
Minimal bench runner.
- topology: compiled topology object (opaque to runtime here) - topology: compiled topology object (opaque to runtime here)
- bench_fn: callable that receives RuntimeContext and submits requests - bench_fn: callable ``run(torch)`` receiving a RuntimeContext
- device: DeviceSelector ("all" or "sip:<N>") - device: DeviceSelector ("all" or "sip:<N>")
- engine_factory: builds sim_engine for given topology & device - engine_factory: builds sim_engine for given topology & device
- completion_policy: how to determine overall completion/result - completion_policy: how to determine overall completion/result
@@ -48,7 +47,6 @@ def run_bench(
) )
bench_fn(ctx) bench_fn(ctx)
ctx.wait_all() ctx.wait_all()
collected_traces = ctx._traces or None collected_traces = ctx._traces or None
+125 -7
View File
@@ -9,6 +9,39 @@ from kernbench.common.types import Completion, RequestHandle, SimEngine
from .types import DeviceSelector from .types import DeviceSelector
def _world_size_from_spec(spec: dict | None) -> int:
"""Derive world_size from topology spec: sips × cubes × pes_per_cube."""
spec = spec or {}
sips = int(spec.get("system", {}).get("sips", {}).get("count", 1))
cm = spec.get("sip", {}).get("cube_mesh", {})
cubes_per_sip = int(cm.get("w", 1)) * int(cm.get("h", 1))
pl = spec.get("cube", {}).get("pe_layout", {})
corners = pl.get("corners", [])
pe_per_corner = int(pl.get("pe_per_corner", 1))
pes_per_cube = pe_per_corner * max(len(corners), 1)
return sips * cubes_per_sip * pes_per_cube
def _numpy_to_dtype_str(np_dtype) -> str:
"""Map numpy dtype → kernbench dtype string used by Tensor."""
import numpy as np
kind_map = {
np.float16: "f16",
np.float32: "f32",
np.int8: "i8",
np.int16: "i16",
np.int32: "i32",
np.uint8: "u8",
np.uint16: "u16",
np.uint32: "u32",
}
for np_type, s in kind_map.items():
if np.dtype(np_dtype) == np.dtype(np_type):
return s
raise ValueError(f"unsupported numpy dtype: {np_dtype!r}")
@dataclass @dataclass
class RuntimeContext: class RuntimeContext:
engine: SimEngine engine: SimEngine
@@ -23,6 +56,66 @@ class RuntimeContext:
_tensor_counter: int = field(default=0, init=False) _tensor_counter: int = field(default=0, init=False)
_traces: list[dict] = field(default_factory=list, init=False) _traces: list[dict] = field(default_factory=list, init=False)
_tensors: list[Any] = field(default_factory=list, init=False) _tensors: list[Any] = field(default_factory=list, init=False)
distributed: Any = field(default=None, init=False) # DistributedContext for CCL benches
_ipcq_plan: dict = field(default_factory=dict, init=False) # ADR-0023 install plan
def __post_init__(self) -> None:
# Eagerly attach a DistributedContext so bench code can do
# ``dist = torch.distributed`` + ``dist.init_process_group(...)``
# without needing a separate launcher to install it.
from kernbench.runtime_api.distributed import DistributedContext
dc = DistributedContext()
dc._ctx_ref = self # back-reference for AhbmCCLBackend to reach ctx.launch etc.
self.distributed = dc
def install_ipcq(
self,
algorithm: str | None = None,
ccl_yaml: str | None = None,
world_size_override: int | None = None,
rank_to_pe: list[tuple[int, int, int]] | None = None,
) -> dict:
"""Install IPCQ neighbor tables on all participating PEs (ADR-0023 D10).
Loads ``ccl.yaml`` (or the path provided), resolves the chosen
algorithm (or ``defaults.algorithm`` if None), and pushes per-PE
IpcqInitMsg into every PE_IPCQ component via the engine.
Args:
algorithm: name of the algorithm in ccl.yaml (or use defaults).
ccl_yaml: optional path to ccl.yaml.
world_size_override: if set, replace the algorithm's world_size.
Returns the install plan dict (rank → (sip,cube,pe), neighbor table).
"""
import importlib
from kernbench.ccl.install import (
install_ipcq as _install,
load_ccl_config,
resolve_algorithm_config,
)
cfg = load_ccl_config(ccl_yaml)
merged = resolve_algorithm_config(cfg, algorithm)
if world_size_override is not None:
merged["world_size"] = world_size_override
elif "world_size" not in merged:
# Derive from topology.yaml when neither the algorithm entry
# nor ``defaults`` carries ``world_size`` (matches pytorch DDP
# where env vars determine ranks, not the ccl config file).
merged["world_size"] = _world_size_from_spec(self.spec)
algo_module = None
try:
algo_module = importlib.import_module(merged["module"])
except ModuleNotFoundError:
pass
plan = _install(
self.engine, self.spec, merged,
algo_module=algo_module, rank_to_pe=rank_to_pe,
)
self._ipcq_plan = plan
self._ipcq_config = merged
return plan
def __enter__(self): def __enter__(self):
return self return self
@@ -258,6 +351,24 @@ class RuntimeContext:
"""Allocate a tensor in HBM without initialization (like torch.empty).""" """Allocate a tensor in HBM without initialization (like torch.empty)."""
return self._create_tensor(shape, dtype, name, pattern=None, dp=dp) return self._create_tensor(shape, dtype, name, pattern=None, dp=dp)
def from_numpy(self, arr: Any):
"""Create a host-side tensor wrapping a numpy array.
Mirrors ``torch.from_numpy``. The returned tensor is NOT deployed
to any PE — it lives in an in-memory host staging buffer. Use
``target.copy_(host_tensor)`` to scatter its contents into a
sharded, deployed tensor.
"""
import numpy as np
from kernbench.runtime_api.tensor import Tensor
arr_c = np.ascontiguousarray(arr)
dtype_str = _numpy_to_dtype_str(arr_c.dtype)
t = Tensor(shape=tuple(arr_c.shape), dtype=dtype_str, name="host")
t._host_buffer = arr_c
t._memory_store = getattr(self.engine, "_memory_store", None)
return t
def _create_tensor( def _create_tensor(
self, self,
shape: tuple[int, ...], shape: tuple[int, ...],
@@ -418,13 +529,12 @@ class RuntimeContext:
TensorArgShard, TensorArgShard,
) )
from kernbench.runtime_api.tensor import Tensor from kernbench.runtime_api.tensor import Tensor
from kernbench.triton_emu.registry import register_kernel from kernbench.triton_emu.registry import _kernels, register_kernel
# Register kernel (idempotent) # Register kernel (idempotent overwrite — last call wins).
try: # Tests can re-register the same kernel_name with a different
register_kernel(kernel_name, kernel_fn) # function; the user's most recent launch must use the latest fn.
except ValueError: _kernels[kernel_name] = kernel_fn
pass
# Collect tensors and scalars # Collect tensors and scalars
tensor_args: list[Tensor] = [] tensor_args: list[Tensor] = []
@@ -506,6 +616,7 @@ class RuntimeContext:
# Per-SIP kernel launch: each SIP gets TensorArgs with local va_base # Per-SIP kernel launch: each SIP gets TensorArgs with local va_base
last_handle = None last_handle = None
_pending_handles: list[tuple[Any, int]] = []
for sip_id in sorted(sip_set): for sip_id in sorted(sip_set):
sip_kernel_args: list = [] sip_kernel_args: list = []
sip_cube_set: set[int] = set() sip_cube_set: set[int] = set()
@@ -566,10 +677,17 @@ class RuntimeContext:
target_cubes=target_cubes, target_cubes=target_cubes,
target_pe=target_pe, target_pe=target_pe,
)) ))
# Defer wait until all SIPs are submitted (multi-SIP CCL needs
# all participating PEs to be live concurrently — waiting
# per-SIP would deadlock when ranks span SIP boundaries).
_pending_handles.append((h, sip_id))
last_handle = h
# Drain pending handles now that every SIP has a launch posted.
for h, sip_id in _pending_handles:
self.wait(h, _meta={ self.wait(h, _meta={
"phase": "kernel", "name": kernel_name, "phase": "kernel", "name": kernel_name,
"sip": sip_id, "target_pe": target_pe, "sip": sip_id, "target_pe": target_pe,
}) })
last_handle = h
return last_handle return last_handle
+179
View File
@@ -0,0 +1,179 @@
"""PyTorch-compatible distributed communication shim (ADR-0023 D11).
Provides a ``torch.distributed``-like API whose public surface matches
real PyTorch so that bench code looks identical to a DDP training script.
Only the ``ahbm`` backend is implemented. It:
1. Reads ``ccl.yaml`` to decide which collective algorithm to run.
2. Derives world_size from the algorithm entry, the defaults section, or
from the topology spec (``system.sips.count × sip.cube_mesh × pe_layout``).
3. At ``init_process_group`` time, eagerly installs the IPCQ neighbor
table once (one-time comm setup — mirrors NCCL communicator creation).
4. On each ``all_reduce(tensor)`` call, reads per-shard metadata from the
tensor handle and dispatches ``torch.launch`` with the registered
kernel. The kernel performs intra-PE ring/tree/mesh CCL via IPCQ,
and Phase 2 DataExecutor replays math + copies from op_log so
MemoryStore is correct when ``all_reduce`` returns.
Host bench code uses only real-PyTorch names:
dist.init_process_group, dist.is_initialized, dist.get_world_size,
dist.get_rank, dist.get_backend, dist.all_reduce, dist.barrier
"""
from __future__ import annotations
import importlib
from typing import Any
class AhbmCCLBackend:
"""Ahbm CCL backend — drives kernel-level collectives via IPCQ."""
def __init__(self, torch_ctx: Any) -> None:
from kernbench.ccl.install import (
load_ccl_config,
resolve_algorithm_config,
)
self.ctx = torch_ctx
self._cfg_all = load_ccl_config()
self._merged = resolve_algorithm_config(self._cfg_all)
self._algo_module = importlib.import_module(self._merged["module"])
self._world_size = self._resolve_world_size()
# Eager IPCQ install — ``init_process_group`` time. Mirrors NCCL
# communicator creation: done once, reused across every subsequent
# collective call on the same process group.
self.ctx.install_ipcq(
algorithm=self._merged["algorithm"],
world_size_override=self._world_size,
)
def _resolve_world_size(self) -> int:
"""Derive world_size (priority: algorithm override > defaults > topology).
Topology derivation:
sips × cubes_per_sip × pes_per_cube
"""
if "world_size" in self._merged:
return int(self._merged["world_size"])
defaults = self._cfg_all.get("defaults", {})
if "world_size" in defaults:
return int(defaults["world_size"])
spec = self.ctx.spec or {}
sips = int(spec.get("system", {}).get("sips", {}).get("count", 1))
cm = spec.get("sip", {}).get("cube_mesh", {})
cubes_per_sip = int(cm.get("w", 1)) * int(cm.get("h", 1))
pl = spec.get("cube", {}).get("pe_layout", {})
corners = pl.get("corners", [])
pe_per_corner = int(pl.get("pe_per_corner", 1))
pes_per_cube = pe_per_corner * max(len(corners), 1)
return sips * cubes_per_sip * pes_per_cube
@property
def world_size(self) -> int:
return self._world_size
def all_reduce(self, tensor: Any, op: str = "sum") -> None:
"""Dispatch the configured CCL algorithm as a single kernel launch.
Raises if ``op != "sum"`` (current kernels only implement add
reduction) or if the tensor's shard count disagrees with the
world_size that was installed into PE_IPCQ.
"""
if op != "sum":
raise NotImplementedError(f"all_reduce op={op!r} not supported")
if tensor._handle is None:
raise RuntimeError(
f"Tensor '{tensor.name}' is not deployed (call torch.zeros "
"with a DPPolicy first)"
)
shards = tensor._handle.shards
if len(shards) != self._world_size:
raise RuntimeError(
f"all_reduce tensor has {len(shards)} shards but the "
f"ahbm backend was installed with world_size="
f"{self._world_size}; adjust the tensor's DPPolicy or "
"restart the process group"
)
n_elem = shards[0].nbytes // tensor.itemsize
kernel_fn = self._algo_module.kernel
kernel_args = self._algo_module.kernel_args(self._world_size, n_elem)
self.ctx.launch(
self._merged["algorithm"], kernel_fn, tensor, *kernel_args,
)
def barrier(self) -> None:
# Single-driver model → no cross-process sync needed. Keeping the
# method so ``dist.barrier()`` is callable (pytorch-compat surface).
return None
class DistributedContext:
"""torch.distributed-compat facade.
Public surface matches real PyTorch so bench code reads identically
to a DDP training script. Single-driver semantics: ``get_rank()``
always returns 0 because kernbench runs as one Python process;
``get_world_size()`` returns the CCL group size (number of PEs
participating in the collective).
"""
def __init__(self) -> None:
self._backend: AhbmCCLBackend | None = None
def init_process_group(
self,
backend: str = "ahbm",
world_size: int | None = None,
rank: int | None = None,
**kwargs: Any,
) -> None:
"""Create the default process group.
``world_size`` and ``rank`` are accepted for API parity with
``torch.distributed.init_process_group`` but ignored — the ahbm
backend derives both from ``ccl.yaml`` + topology automatically
(like reading ``RANK``/``WORLD_SIZE`` env vars in real DDP).
"""
if backend != "ahbm":
raise ValueError(
f"Unsupported backend '{backend}'. Only 'ahbm' is supported."
)
ctx = getattr(self, "_ctx_ref", None)
if ctx is None:
raise RuntimeError(
"DistributedContext not bound to a RuntimeContext"
)
self._backend = AhbmCCLBackend(torch_ctx=ctx)
def is_initialized(self) -> bool:
return self._backend is not None
def get_world_size(self) -> int:
self._ensure_initialized()
return self._backend.world_size
def get_rank(self) -> int:
# Single-driver kernbench: there is only one host rank.
self._ensure_initialized()
return 0
def get_backend(self) -> str:
self._ensure_initialized()
return "ahbm"
def all_reduce(self, tensor: Any, op: str = "sum") -> None:
self._ensure_initialized()
self._backend.all_reduce(tensor, op=op)
def barrier(self) -> None:
self._ensure_initialized()
self._backend.barrier()
def _ensure_initialized(self) -> None:
if self._backend is None:
raise RuntimeError(
"Default process group has not been initialized. "
"Call init_process_group(backend='ahbm') first."
)
+27
View File
@@ -152,3 +152,30 @@ class MmuUnmapMsg:
target_cubes: tuple[int, ...] | Literal["all"] = "all" target_cubes: tuple[int, ...] | Literal["all"] = "all"
target_pe: int | Literal["all"] = "all" target_pe: int | Literal["all"] = "all"
msg_type: Literal["mmu_unmap"] = "mmu_unmap" msg_type: Literal["mmu_unmap"] = "mmu_unmap"
@dataclass(frozen=True)
class IpcqInitMsg:
"""IPCQ neighbor table install (sideband fan-out, ADR-0023 D10/D12).
Backend issues this at ``init_process_group`` time to install per-PE
IPCQ neighbor tables. Each entry covers one direction (N/S/E/W) and
carries the peer's IpcqEndpoint plus this PE's own rx_buffer base
and a pre-wired SimPy Store for credit return fast path (D9).
Routing is similar to MmuMapMsg.
"""
correlation_id: str
request_id: str
target_sips: tuple[int, ...] | Literal["all"] = "all"
target_cubes: tuple[int, ...] | Literal["all"] = "all"
target_pe: int | tuple[int, ...] | Literal["all"] = "all"
# entries: tuple[IpcqInitEntry, ...] — kept as tuple of plain objects to
# avoid a runtime import cycle (IpcqInitEntry lives in
# kernbench.common.ipcq_types).
entries: tuple = ()
backpressure_mode: str = "sleep" # "poll" | "sleep"
buffer_kind: str = "tcm" # "tcm" | "hbm" | "sram"
credit_size_bytes: int = 16
msg_type: Literal["ipcq_init"] = "ipcq_init"
+82 -7
View File
@@ -146,6 +146,11 @@ class Tensor:
self._handle: TensorHandle | None = None self._handle: TensorHandle | None = None
self._ctx_ref: weakref.ref | None = None # set by RuntimeContext self._ctx_ref: weakref.ref | None = None # set by RuntimeContext
self._memory_store = None # set by RuntimeContext when enable_data=True self._memory_store = None # set by RuntimeContext when enable_data=True
# Host-side staging buffer for torch.from_numpy() results. A tensor
# with a non-None _host_buffer is NOT deployed to any PE — it lives
# only on the host. Use `target.copy_(host_tensor)` to scatter the
# data into a deployed, sharded target tensor.
self._host_buffer: np.ndarray | None = None
def __del__(self) -> None: def __del__(self) -> None:
if self._ctx_ref is None or self._handle is None: if self._ctx_ref is None or self._handle is None:
@@ -166,15 +171,85 @@ class Tensor:
@property @property
def data(self) -> np.ndarray: def data(self) -> np.ndarray:
"""Tensor data as numpy array. Returns actual values when enable_data=True, """Tensor data as numpy array.
zeros placeholder otherwise (like an uninitialized tensor)."""
if self._memory_store is not None and self._handle is not None: Gathers all shards into a single full-shape array. Returns actual
shard = self._handle.shards[0] values when enable_data=True, zeros placeholder otherwise (like an
uninitialized tensor). Alias of ``numpy()``.
"""
return self.numpy()
def _shard_store_addr(self, shard: TensorShard) -> int:
"""MemoryStore key for a shard.
Kernels read tensors via VA (translated to PA by PE_DMA's MMU when
a mapping exists, otherwise the addr is treated as a PA-equivalent
key). Tensor I/O therefore writes/reads at ``va_base + offset_bytes``
when ``va_base`` is set, falling back to ``shard.pa`` for the
VA-less mode used by some legacy paths.
"""
if self._handle and self._handle.va_base:
return self._handle.va_base + shard.offset_bytes
return shard.pa
def numpy(self) -> np.ndarray:
"""Return a single numpy array gathered from all shards.
Mirrors ``torch.Tensor.numpy()``. In kernbench, sharded tensors are
gathered into a single full-shape ndarray according to each shard's
``offset_bytes`` / ``nbytes`` range.
"""
np_dtype = _numpy_dtype(self.dtype)
# Host-side tensor (created via torch.from_numpy) has no shards.
if self._host_buffer is not None:
return self._host_buffer.copy()
if self._handle is None or self._memory_store is None:
return np.zeros(self.shape, dtype=np_dtype)
flat = np.zeros(math.prod(self.shape), dtype=np_dtype)
for shard in self._handle.shards:
start = shard.offset_bytes // self.itemsize
count = shard.nbytes // self.itemsize
try: try:
return self._memory_store.read("hbm", shard.pa, shape=self.shape, dtype=self.dtype) piece = self._memory_store.read(
"hbm", self._shard_store_addr(shard),
)
except KeyError: except KeyError:
pass continue
return np.zeros(self.shape, dtype=_numpy_dtype(self.dtype)) flat[start : start + count] = (
np.asarray(piece, dtype=np_dtype).reshape(-1)[:count]
)
return flat.reshape(self.shape)
def copy_(self, source: "Tensor") -> "Tensor":
"""In-place copy from another tensor into self.
Mirrors ``torch.Tensor.copy_()``. If ``source`` is a host tensor
(from ``torch.from_numpy``), its ndarray is split across self's
shards using each shard's byte range. If ``source`` is a deployed
(sharded) tensor, its contents are gathered first and then
re-scattered into self's shard layout.
Shapes must match. Returns self.
"""
if self._handle is None or self._memory_store is None:
raise RuntimeError(
f"Tensor '{self.name}' must be deployed before copy_()"
)
if source.shape != self.shape:
raise ValueError(
f"copy_ shape mismatch: self={self.shape} source={source.shape}"
)
np_dtype = _numpy_dtype(self.dtype)
arr = source.numpy().astype(np_dtype, copy=False)
flat = np.ascontiguousarray(arr).reshape(-1)
for shard in self._handle.shards:
start = shard.offset_bytes // self.itemsize
count = shard.nbytes // self.itemsize
piece = flat[start : start + count].copy()
self._memory_store.write(
"hbm", self._shard_store_addr(shard), piece,
)
return self
@property @property
def itemsize(self) -> int: def itemsize(self) -> int:
+75 -6
View File
@@ -51,7 +51,42 @@ class DataExecutor:
self._execute_math(op) self._execute_math(op)
def _execute_memory(self, op: OpRecord) -> None: def _execute_memory(self, op: OpRecord) -> None:
"""Memory ops are already handled by Phase 1 MemoryStore. Skip.""" """Replay memory copy ops in Phase 2 (ADR-0020 + ADR-0023).
- dma_read: no-op (handle already references HBM source).
- dma_write: copy (src_space, src_addr) → (dst_space, dst_addr).
Required because Phase 2 may have just produced new data at the
source addr (e.g. PE_MATH scratch output).
- ipcq_copy: copy across PEs — sender's source → receiver's slot.
Required because the source may be a Phase 2 math output, and
a downstream math op on the receiver reads from the slot.
Legacy entries without src/dst metadata are silently skipped.
"""
p = op.params
if op.op_name == "dma_write" or op.op_name == "ipcq_copy":
src_space = p.get("src_space")
src_addr = p.get("src_addr")
dst_space = p.get("dst_space")
dst_addr = p.get("dst_addr")
if (src_space is None or src_addr is None
or dst_space is None or dst_addr is None):
return
# Prefer the Phase-1-time snapshot (captured at record_end /
# outbound) so we don't read from a source that has since been
# mutated by another op. Fall back to MemoryStore for sources
# that had no Phase 1 data (e.g. math scratch outputs that
# only get populated by Phase 2's math replay).
data = p.get("snapshot")
if data is None:
try:
data = self.store.read(
src_space, src_addr,
shape=p.get("shape"), dtype=p.get("dtype"),
)
except KeyError:
return
self.store.write(dst_space, dst_addr, data)
def _execute_gemm(self, op: OpRecord) -> None: def _execute_gemm(self, op: OpRecord) -> None:
"""Execute GEMM: out = a @ b.""" """Execute GEMM: out = a @ b."""
@@ -77,18 +112,35 @@ class DataExecutor:
"""Execute math op: unary, binary, or reduction.""" """Execute math op: unary, binary, or reduction."""
p = op.params p = op.params
math_op = p.get("op", op.op_name) math_op = p.get("op", op.op_name)
space = p.get("addr_space", "tcm")
dtype = p.get("dtype", "f32") dtype = p.get("dtype", "f32")
input_addrs = p.get("input_addrs", []) input_addrs = p.get("input_addrs", [])
input_shapes = p.get("input_shapes", []) input_shapes = p.get("input_shapes", [])
# Per-input space/dtype (ADR-0023 CCL accumulation): math ops can
# mix inputs from different MemoryStore spaces (e.g. acc in "hbm",
# recv slot in "tcm"). Fall back to legacy single-space mode when
# the per-input lists are absent.
input_spaces = p.get("input_spaces") or [p.get("addr_space", "tcm")] * len(input_addrs)
input_dtypes = p.get("input_dtypes") or [dtype] * len(input_addrs)
# Per-input data snapshots (ADR-0020 D6): captured at op_log
# record time. Phase 1 has correct values for slot/HBM addrs at
# that moment, which lets Phase 2 sidestep the slot-wraparound
# races where a later round overwrites a slot before this op
# runs in t_start order.
snapshots = p.get("input_snapshots") or [None] * len(input_addrs)
dst_space = p.get("dst_space", p.get("addr_space", "tcm"))
inputs = [] inputs = []
for addr, shape in zip(input_addrs, input_shapes): for addr, shape, space, idtype, snap in zip(
inputs.append(self.store.read(space, addr, shape=shape, dtype=dtype)) input_addrs, input_shapes, input_spaces, input_dtypes, snapshots
):
if snap is not None:
inputs.append(snap)
else:
inputs.append(self.store.read(space, addr, shape=shape, dtype=idtype))
result = _compute_math(math_op, inputs, p.get("axis")) result = _compute_math(math_op, inputs, p.get("axis"))
if result is not None: if result is not None:
self.store.write(space, p["dst_addr"], result) self.store.write(dst_space, p["dst_addr"], result)
def verify(self, expected: dict[tuple[str, int], np.ndarray], def verify(self, expected: dict[tuple[str, int], np.ndarray],
rtol: float = 1e-3, atol: float = 1e-3) -> dict[str, bool]: rtol: float = 1e-3, atol: float = 1e-3) -> dict[str, bool]:
@@ -146,6 +198,14 @@ def _compute_math(op: str, inputs: list[np.ndarray], axis: int | None) -> np.nda
if op == "min": if op == "min":
return np.min(x, axis=axis, keepdims=True) return np.min(x, axis=axis, keepdims=True)
# Softmax (numerically stable)
if op == "softmax":
ax = axis if axis is not None else -1
x_max = np.max(x, axis=ax, keepdims=True)
e = np.exp(x - x_max)
s = np.sum(e, axis=ax, keepdims=True)
return e / s
# Binary # Binary
if len(inputs) >= 2: if len(inputs) >= 2:
y = inputs[1] y = inputs[1]
@@ -157,9 +217,18 @@ def _compute_math(op: str, inputs: list[np.ndarray], axis: int | None) -> np.nda
return x * y return x * y
if op == "div": if op == "div":
return x / y return x / y
if op == "maximum":
return np.maximum(x, y)
if op == "minimum":
return np.minimum(x, y)
# Ternary # Ternary
if op == "where" and len(inputs) >= 3: if len(inputs) >= 3:
if op == "where":
return np.where(inputs[0], inputs[1], inputs[2]) return np.where(inputs[0], inputs[1], inputs[2])
if op == "fma":
return inputs[0] * inputs[1] + inputs[2]
if op == "clamp":
return np.minimum(np.maximum(inputs[0], inputs[1]), inputs[2])
return None return None
+54 -1
View File
@@ -51,8 +51,12 @@ class GraphEngine:
if enable_data: if enable_data:
from kernbench.sim_engine.memory_store import MemoryStore from kernbench.sim_engine.memory_store import MemoryStore
from kernbench.sim_engine.op_log import OpLogger from kernbench.sim_engine.op_log import OpLogger
self._op_logger = OpLogger()
self._memory_store = MemoryStore() self._memory_store = MemoryStore()
self._op_logger = OpLogger(memory_store=self._memory_store)
# Cursor for incremental Phase 2 replay (ADR-0020 D6).
# SimPy env.now is monotonic so newly logged records always sort
# to the tail; the cursor remains valid across waits.
self._data_cursor = 0
ctx = ComponentContext( ctx = ComponentContext(
router=self._router, router=self._router,
@@ -147,11 +151,60 @@ class GraphEngine:
self._env.process(self._process(str(handle), request, event)) self._env.process(self._process(str(handle), request, event))
return handle return handle
def _flush_data_phase(self) -> None:
"""Replay newly recorded op_log entries through DataExecutor.
ADR-0020 D6 Phase 2: when data tracking is enabled, run DataExecutor
on records added since the last flush so that callers reading
MemoryStore between launches observe correct (compute-replayed)
tensor data.
Cursor-based incremental replay is necessary because Phase 2 is
NOT idempotent across full re-runs: a math op writes a TCM scratch
addr, a later dma_write copies that scratch into HBM[X], and an
even-later math op may then read HBM[X]. Re-running everything
from scratch would let the second pass's first math op read the
already-overwritten HBM[X] instead of the original input.
"""
if self._op_logger is None or self._memory_store is None:
return
records = self._op_logger.records # sorted by t_start (stable)
if self._data_cursor >= len(records):
return
new_records = records[self._data_cursor:]
from kernbench.sim_engine.data_executor import DataExecutor
DataExecutor(new_records, self._memory_store).run()
self._data_cursor = len(records)
def wait(self, handle: RequestHandle) -> None: def wait(self, handle: RequestHandle) -> None:
key = str(handle) key = str(handle)
event = self._events[key] event = self._events[key]
if not event.triggered: if not event.triggered:
try:
self._env.run(until=event) self._env.run(until=event)
except (simpy.core.EmptySchedule, RuntimeError) as exc:
# SimPy raises EmptySchedule directly OR (in newer simpy)
# wraps it as a RuntimeError("No scheduled events left ...").
# Either case while our event is still pending → IPCQ deadlock.
msg = str(exc)
is_deadlock = (
isinstance(exc, simpy.core.EmptySchedule)
or "No scheduled events left" in msg
)
if not is_deadlock:
raise
from kernbench.ccl.diagnostics import IpcqDeadlock, pointer_dump
dump = pointer_dump(self)
if dump.strip():
raise IpcqDeadlock(
"IPCQ deadlock: simulation schedule empty while "
f"request {handle!r} is still pending.\n"
f"Pointer state:\n{dump}"
) from None
raise
# ADR-0020: replay newly logged ops so the caller observes
# post-Phase-2 tensor state from MemoryStore.
self._flush_data_phase()
def get_completion(self, handle: RequestHandle) -> tuple[Completion, Trace | None]: def get_completion(self, handle: RequestHandle) -> tuple[Completion, Trace | None]:
return self._results[str(handle)] return self._results[str(handle)]
+84 -1
View File
@@ -29,9 +29,13 @@ class OpLogger:
Records are maintained in t_start stable ordering (insertion order). Records are maintained in t_start stable ordering (insertion order).
""" """
def __init__(self) -> None: def __init__(self, memory_store: Any | None = None) -> None:
self._records: list[OpRecord] = [] self._records: list[OpRecord] = []
self._pending: dict[int, dict[str, Any]] = {} # msg id → partial record self._pending: dict[int, dict[str, Any]] = {} # msg id → partial record
# Optional MemoryStore reference. When set, math op records capture
# input data snapshots at record_end time so Phase 2 replay does
# not depend on slot/scratch addrs surviving until math runs.
self._memory_store = memory_store
@property @property
def records(self) -> list[OpRecord]: def records(self) -> list[OpRecord]:
@@ -53,6 +57,38 @@ class OpLogger:
if pending is None: if pending is None:
return return
op_kind, op_name, params = _extract_op_info(msg) op_kind, op_name, params = _extract_op_info(msg)
# Snapshot data at record time so Phase 2 replay sidesteps
# downstream mutations of source addrs (e.g. a tl.store that
# overwrites HBM after a load handle was sent, or a slot that
# gets reused on the next ring round).
if self._memory_store is not None:
if op_kind == "math":
snaps: list[Any] = []
for addr, shape, space, idtype in zip(
params.get("input_addrs", []),
params.get("input_shapes", []),
params.get("input_spaces", []),
params.get("input_dtypes", []),
):
try:
arr = self._memory_store.read(
space, addr, shape=shape, dtype=idtype,
)
snaps.append(arr.copy() if hasattr(arr, "copy") else arr)
except Exception:
snaps.append(None)
params["input_snapshots"] = snaps
elif op_name == "dma_write":
try:
arr = self._memory_store.read(
params["src_space"], params["src_addr"],
shape=params.get("shape"), dtype=params.get("dtype"),
)
params["snapshot"] = (
arr.copy() if hasattr(arr, "copy") else arr
)
except Exception:
params["snapshot"] = None
self._records.append(OpRecord( self._records.append(OpRecord(
t_start=pending["t_start"], t_start=pending["t_start"],
t_end=t, t_end=t,
@@ -62,6 +98,45 @@ class OpLogger:
params=params, params=params,
)) ))
def record_copy(
self, t_start: float, t_end: float, component_id: str,
src_space: str, src_addr: int,
dst_space: str, dst_addr: int,
shape: tuple[int, ...], dtype: str, nbytes: int,
) -> None:
"""Record a memory copy op for Phase 2 replay (ADR-0023 + ADR-0020).
Used by PE_DMA at outbound (sender) time: the snapshot captures
the source data at the moment the send was issued, so Phase 2
replay does not see later mutations of the source addr (e.g. a
tl.store that runs after the recv at the sender).
For sources whose data is not yet materialized in Phase 1 (math
scratch outputs), the snapshot is None and Phase 2 falls back to
reading from MemoryStore — by which point the corresponding math
op has been replayed and the scratch addr is populated.
"""
snap = None
if self._memory_store is not None:
try:
arr = self._memory_store.read(
src_space, src_addr, shape=shape, dtype=dtype,
)
snap = arr.copy() if hasattr(arr, "copy") else arr
except Exception:
snap = None
self._records.append(OpRecord(
t_start=t_start, t_end=t_end,
component_id=component_id,
op_kind="memory", op_name="ipcq_copy",
params={
"src_space": src_space, "src_addr": src_addr,
"dst_space": dst_space, "dst_addr": dst_addr,
"shape": shape, "dtype": dtype, "nbytes": nbytes,
"snapshot": snap,
},
))
def _extract_op_info(msg: Any) -> tuple[str, str, dict[str, Any]]: def _extract_op_info(msg: Any) -> tuple[str, str, dict[str, Any]]:
"""Extract op_kind, op_name, params from a data_op message.""" """Extract op_kind, op_name, params from a data_op message."""
@@ -76,6 +151,11 @@ def _extract_op_info(msg: Any) -> tuple[str, str, dict[str, Any]]:
} }
if isinstance(msg, DmaWriteCmd): if isinstance(msg, DmaWriteCmd):
return "memory", "dma_write", { return "memory", "dma_write", {
"src_space": getattr(msg.handle, "space", "tcm"),
"src_addr": msg.handle.addr,
"shape": msg.handle.shape,
"dtype": msg.handle.dtype,
"dst_space": "hbm",
"dst_addr": msg.dst_addr, "dst_addr": msg.dst_addr,
"nbytes": msg.nbytes, "nbytes": msg.nbytes,
"handle_id": msg.handle.id, "handle_id": msg.handle.id,
@@ -96,7 +176,10 @@ def _extract_op_info(msg: Any) -> tuple[str, str, dict[str, Any]]:
return "math", msg.op, { return "math", msg.op, {
"input_addrs": [h.addr for h in msg.inputs], "input_addrs": [h.addr for h in msg.inputs],
"input_shapes": [h.shape for h in msg.inputs], "input_shapes": [h.shape for h in msg.inputs],
"input_spaces": [getattr(h, "space", "tcm") for h in msg.inputs],
"input_dtypes": [h.dtype for h in msg.inputs],
"dst_addr": msg.out.addr, "dst_addr": msg.out.addr,
"dst_space": getattr(msg.out, "space", "tcm"),
"shape_out": msg.out.shape, "shape_out": msg.out.shape,
"dtype": msg.out.dtype, "dtype": msg.out.dtype,
"axis": msg.axis, "axis": msg.axis,
+30 -2
View File
@@ -25,6 +25,7 @@ _PE_COMP_OFFSETS = {
"pe_math": (0.0, 0.15), "pe_math": (0.0, 0.15),
"pe_mmu": (0.15, -0.15), "pe_mmu": (0.15, -0.15),
"pe_tcm": (0.3, 0.0), "pe_tcm": (0.3, 0.0),
"pe_ipcq": (-0.15, 0.15),
} }
@@ -698,6 +699,20 @@ def _add_pe_internal_edges(edges: list[Edge], pp: str, pe_links: dict) -> None:
kind="pe_internal", kind="pe_internal",
)) ))
# PE_IPCQ edges (ADR-0023 D1, D9 D10)
ipcq_edges = [
("pe_cpu", "pe_ipcq", "cpu_to_ipcq_mm"), # IpcqRequest
("pe_ipcq", "pe_dma", "ipcq_to_dma_mm"), # IpcqDmaToken outbound
("pe_dma", "pe_ipcq", "dma_to_ipcq_mm"), # IpcqMetaArrival inbound
]
for src_c, dst_c, mm_key in ipcq_edges:
if mm_key in pe_links:
edges.append(Edge(
src=f"{pp}.{src_c}", dst=f"{pp}.{dst_c}",
distance_mm=pe_links[mm_key],
kind="pe_internal",
))
# ── Inter-cube / IO / system edges ────────────────────────────────── # ── Inter-cube / IO / system edges ──────────────────────────────────
@@ -765,7 +780,13 @@ def _add_io_to_cube_edges(
def _add_system_to_io_edges( def _add_system_to_io_edges(
edges: list[Edge], sp: str, sip_spec: dict, system: dict, edges: list[Edge], sp: str, sip_spec: dict, system: dict,
) -> None: ) -> None:
"""Add fabric switch IO chiplet PCIe edges.""" """Add bidirectional fabric switch IO chiplet PCIe edges.
Both directions are needed:
switch → pcie_ep for host→device traffic (memory writes, kernel launch)
pcie_ep → switch for device-side outbound traffic (cross-SIP IPCQ
send between PE_DMAs through the system switch).
"""
sw_id = "fabric.switch0" sw_id = "fabric.switch0"
sys_link = system["links"]["io_ep_to_switch"] sys_link = system["links"]["io_ep_to_switch"]
for inst in sip_spec["iochiplet"]["instances"]: for inst in sip_spec["iochiplet"]["instances"]:
@@ -776,6 +797,12 @@ def _add_system_to_io_edges(
bw_gbs=sys_link["bw_gbs_per_ep"], bw_gbs=sys_link["bw_gbs_per_ep"],
kind="pcie", kind="pcie",
)) ))
edges.append(Edge(
src=pcie_ep_id, dst=sw_id,
distance_mm=sys_link["distance_mm"],
bw_gbs=sys_link["bw_gbs_per_ep"],
kind="pcie",
))
# ── View builders ──────────────────────────────────────────────────── # ── View builders ────────────────────────────────────────────────────
@@ -1113,13 +1140,14 @@ def _build_pe_view(spec: dict) -> ViewGraph:
"pe_math": (7.0, 6.5), "pe_math": (7.0, 6.5),
"pe_mmu": (4.0, 1.5), "pe_mmu": (4.0, 1.5),
"pe_tcm": (10.0, 4.0), "pe_tcm": (10.0, 4.0),
"pe_ipcq": (4.0, 6.5),
} }
nodes: dict[str, Node] = {} nodes: dict[str, Node] = {}
view_edges: list[Edge] = [] view_edges: list[Edge] = []
for comp_name, comp_spec in pe_tmpl["components"].items(): for comp_name, comp_spec in pe_tmpl["components"].items():
px, py = positions[comp_name] px, py = positions.get(comp_name, (1.0, 1.0))
nodes[comp_name] = Node( nodes[comp_name] = Node(
id=comp_name, kind=comp_spec["kind"], impl=comp_spec["impl"], id=comp_name, kind=comp_spec["kind"], impl=comp_spec["impl"],
attrs=comp_spec["attrs"], pos_mm=(px, py), attrs=comp_spec["attrs"], pos_mm=(px, py),
+115 -9
View File
@@ -15,6 +15,7 @@ from typing import TYPE_CHECKING, Any
import simpy import simpy
from greenlet import greenlet from greenlet import greenlet
from kernbench.common.ipcq_types import IpcqRecvCmd, IpcqRequest, IpcqSendCmd, RecvFuture
from kernbench.common.pe_commands import ( from kernbench.common.pe_commands import (
CompletionHandle, CompletionHandle,
CompositeCmd, CompositeCmd,
@@ -51,6 +52,9 @@ class KernelRunner:
out_ports: dict[str, simpy.Store], out_ports: dict[str, simpy.Store],
store: MemoryStore | None = None, store: MemoryStore | None = None,
num_cubes: int = 1, num_cubes: int = 1,
ipcq_id: str | None = None,
scratch_base: int = 0,
scratch_size: int = 1 << 20,
) -> None: ) -> None:
self._pe_prefix = pe_prefix self._pe_prefix = pe_prefix
self._pe_idx = pe_idx self._pe_idx = pe_idx
@@ -61,6 +65,13 @@ class KernelRunner:
self._out_ports = out_ports self._out_ports = out_ports
self._store = store self._store = store
self._parent: greenlet | None = None self._parent: greenlet | None = None
# Optional IPCQ port (ADR-0023). If None, IPCQ commands raise.
self._ipcq_id = ipcq_id or f"{pe_prefix}.pe_ipcq"
# PE-local scratch for compute output TensorHandles (ADR-0020 D3
# extension). The TLContext allocates from this pool when math/dot
# ops produce a result that may later be used as a send/store source.
self._scratch_base = scratch_base
self._scratch_size = scratch_size
def run( def run(
self, self,
@@ -89,7 +100,10 @@ class KernelRunner:
num_cubes=self._num_cubes, num_cubes=self._num_cubes,
dispatch_cycles=0, dispatch_cycles=0,
runner=self, runner=self,
scratch_base=self._scratch_base,
scratch_size=self._scratch_size,
) )
self._tl = tl # exposed so switch_to_simpy can re-set on restore
def _kernel_entry(): def _kernel_entry():
TLContext._set_active(tl) # type: ignore[attr-defined] TLContext._set_active(tl) # type: ignore[attr-defined]
@@ -103,13 +117,20 @@ class KernelRunner:
pending: dict[str, simpy.Event] = {} pending: dict[str, simpy.Event] = {}
composite_results: list[dict] = [] composite_results: list[dict] = []
# Helper: set our tl as active just before resuming the kernel.
# Multiple PE kernel runners share the same thread-local; without
# this, another runner's kernel may have left a different context.
def _switch_kernel(*args):
TLContext._set_active(tl) # type: ignore[attr-defined]
return g.switch(*args)
# Start kernel — first switch returns first command (or None if kernel is done) # Start kernel — first switch returns first command (or None if kernel is done)
cmd = g.switch() cmd = _switch_kernel()
while cmd is not None: while cmd is not None:
if isinstance(cmd, PeCpuOverheadCmd): if isinstance(cmd, PeCpuOverheadCmd):
yield env.timeout(cmd.cycles) yield env.timeout(cmd.cycles)
cmd = g.switch() cmd = _switch_kernel()
elif isinstance(cmd, WaitCmd): elif isinstance(cmd, WaitCmd):
if cmd.handle is not None: if cmd.handle is not None:
@@ -120,7 +141,7 @@ class KernelRunner:
for evt in pending.values(): for evt in pending.values():
yield evt yield evt
pending.clear() pending.clear()
cmd = g.switch() cmd = _switch_kernel()
elif isinstance(cmd, DmaReadCmd): elif isinstance(cmd, DmaReadCmd):
# Dispatch DMA through SimPy components # Dispatch DMA through SimPy components
@@ -141,10 +162,12 @@ class KernelRunner:
) )
except KeyError: except KeyError:
pass pass
cmd = g.switch(data) cmd = _switch_kernel(data)
elif isinstance(cmd, DmaWriteCmd): elif isinstance(cmd, DmaWriteCmd):
# Write to MemoryStore first (visibility = issue, ADR-0020 D3) # Write to MemoryStore first (visibility = issue, ADR-0020 D3).
# When data is None (e.g. timing-only TensorHandle math result),
# this is a no-op; Phase 2 dma_write replay handles those.
if self._store is not None and cmd.handle.data is not None: if self._store is not None and cmd.handle.data is not None:
self._store.write("hbm", cmd.dst_addr, cmd.handle.data) self._store.write("hbm", cmd.dst_addr, cmd.handle.data)
@@ -154,7 +177,7 @@ class KernelRunner:
) )
yield self._out_ports[self._scheduler_id].put(pe_txn) yield self._out_ports[self._scheduler_id].put(pe_txn)
yield done_evt yield done_evt
cmd = g.switch() cmd = _switch_kernel()
elif isinstance(cmd, CompositeCmd): elif isinstance(cmd, CompositeCmd):
# Non-blocking composite # Non-blocking composite
@@ -165,7 +188,7 @@ class KernelRunner:
composite_results.append(pe_txn.result_data) composite_results.append(pe_txn.result_data)
yield self._out_ports[self._scheduler_id].put(pe_txn) yield self._out_ports[self._scheduler_id].put(pe_txn)
pending[cmd.completion.id] = done_evt pending[cmd.completion.id] = done_evt
cmd = g.switch() cmd = _switch_kernel()
elif isinstance(cmd, (GemmCmd, MathCmd)): elif isinstance(cmd, (GemmCmd, MathCmd)):
# Blocking compute command # Blocking compute command
@@ -175,7 +198,90 @@ class KernelRunner:
) )
yield self._out_ports[self._scheduler_id].put(pe_txn) yield self._out_ports[self._scheduler_id].put(pe_txn)
yield done_evt yield done_evt
cmd = g.switch() cmd = _switch_kernel()
elif isinstance(cmd, IpcqSendCmd):
# Forward IpcqRequest to PE_IPCQ, wait for done
if self._ipcq_id not in self._out_ports:
raise RuntimeError(
f"PE_IPCQ port {self._ipcq_id!r} not wired to runner"
)
done_evt = env.event()
req = IpcqRequest(command=cmd, done=done_evt)
yield self._out_ports[self._ipcq_id].put(req)
yield done_evt
cmd = _switch_kernel()
elif isinstance(cmd, IpcqRecvCmd):
if self._ipcq_id not in self._out_ports:
raise RuntimeError(
f"PE_IPCQ port {self._ipcq_id!r} not wired to runner"
)
done_evt = env.event()
req = IpcqRequest(command=cmd, done=done_evt)
yield self._out_ports[self._ipcq_id].put(req)
yield done_evt
# Read actual data from MemoryStore at the slot address
data = None
src_space = req.result_data.get("src_space", "tcm")
src_addr = req.result_data.get("src_addr", 0)
if self._store is not None:
try:
data = self._store.read(
src_space, src_addr,
shape=cmd.shape, dtype=cmd.dtype,
)
except KeyError:
pass
# Build result dict for tl.recv to wrap in TensorHandle
result = {
"data": data,
"src_space": src_space,
"src_addr": src_addr,
"direction": req.result_data.get("direction", cmd.direction),
"dtype": cmd.dtype,
"shape": cmd.shape,
"nbytes": req.result_data.get("nbytes", 0),
}
cmd = _switch_kernel(result)
elif isinstance(cmd, tuple) and len(cmd) == 2 and cmd[0] == "recv_async":
# Non-blocking recv: post the IpcqRequest now, store the
# event in the future, return None to kernel.
future: RecvFuture = cmd[1]
done_evt = env.event()
req = IpcqRequest(command=future.cmd, done=done_evt)
future.request = req
future.event = done_evt
yield self._out_ports[self._ipcq_id].put(req)
cmd = _switch_kernel(None)
elif isinstance(cmd, tuple) and len(cmd) == 2 and cmd[0] == "recv_wait":
future = cmd[1]
if not future.event.triggered:
yield future.event
req = future.request
src_space = req.result_data.get("src_space", "tcm")
src_addr = req.result_data.get("src_addr", 0)
data = None
if self._store is not None:
try:
data = self._store.read(
src_space, src_addr,
shape=future.cmd.shape, dtype=future.cmd.dtype,
)
except KeyError:
pass
result = {
"data": data,
"src_space": src_space,
"src_addr": src_addr,
"direction": req.result_data.get("direction", future.cmd.direction),
"dtype": future.cmd.dtype,
"shape": future.cmd.shape,
"nbytes": req.result_data.get("nbytes", 0),
}
cmd = _switch_kernel(result)
else: else:
# Unknown command — pass through as blocking # Unknown command — pass through as blocking
@@ -185,7 +291,7 @@ class KernelRunner:
) )
yield self._out_ports[self._scheduler_id].put(pe_txn) yield self._out_ports[self._scheduler_id].put(pe_txn)
yield done_evt yield done_evt
cmd = g.switch() cmd = _switch_kernel()
# Wait remaining pending composites # Wait remaining pending composites
for evt in pending.values(): for evt in pending.values():
+263 -12
View File
@@ -17,6 +17,7 @@ from __future__ import annotations
import math import math
from typing import Literal from typing import Literal
from kernbench.common.ipcq_types import IpcqRecvCmd, IpcqSendCmd, RecvFuture
from kernbench.common.pe_commands import ( from kernbench.common.pe_commands import (
CompletionHandle, CompletionHandle,
CompositeCmd, CompositeCmd,
@@ -55,6 +56,8 @@ class TLContext:
runner: Any = None, runner: Any = None,
cube_id: int = 0, cube_id: int = 0,
num_cubes: int = 1, num_cubes: int = 1,
scratch_base: int = 0,
scratch_size: int = 1 << 20, # 1 MiB per kernel invocation
) -> None: ) -> None:
self._pe_id = pe_id self._pe_id = pe_id
self._num_programs = num_programs self._num_programs = num_programs
@@ -65,6 +68,33 @@ class TLContext:
self._handle_counter = 0 self._handle_counter = 0
self._completion_counter = 0 self._completion_counter = 0
self._runner = runner # KernelRunner for greenlet mode (ADR-0020 D3) self._runner = runner # KernelRunner for greenlet mode (ADR-0020 D3)
# PE-local scratch allocator for math/compute output handles.
# Each binary/unary/reduction op auto-allocates a unique addr from
# this pool so the resulting TensorHandle can be the source of a
# later tl.send / tl.store. Cursor resets on every kernel invocation.
self._scratch_base = scratch_base
self._scratch_size = scratch_size
self._scratch_cursor = 0
def _scratch_alloc(self, nbytes: int) -> int:
"""Allocate a unique scratch address for an output TensorHandle.
Returns 0 if no scratch base was configured (e.g. command-list mode);
in that case the resulting handle has addr=0 and cannot be used as a
send/store source. Greenlet/runner mode always supplies a base.
"""
if self._scratch_base == 0:
return 0
# 16-byte alignment
aligned = (nbytes + 15) & ~15
addr = self._scratch_base + self._scratch_cursor
self._scratch_cursor += aligned
if self._scratch_cursor > self._scratch_size:
raise RuntimeError(
f"TLContext scratch overflow: requested {nbytes}B, "
f"used {self._scratch_cursor}/{self._scratch_size}B"
)
return addr
@property @property
def commands(self) -> list[PeCommand]: def commands(self) -> list[PeCommand]:
@@ -93,11 +123,30 @@ class TLContext:
def _make_handle( def _make_handle(
self, addr: int, shape: tuple[int, ...], dtype: str, self, addr: int, shape: tuple[int, ...], dtype: str,
space: str = "tcm",
) -> TensorHandle: ) -> TensorHandle:
return TensorHandle( return TensorHandle(
id=self._next_handle_id(), id=self._next_handle_id(),
addr=addr, shape=shape, dtype=dtype, addr=addr, shape=shape, dtype=dtype,
nbytes=self._nbytes(shape, dtype), nbytes=self._nbytes(shape, dtype),
space=space,
)
def _make_compute_out(
self, shape: tuple[int, ...], dtype: str,
) -> TensorHandle:
"""Allocate an output TensorHandle in PE-local scratch (TCM space).
Used by math/compute ops so the result has a real address that can
be the source of a later send/store. The data field stays None in
Phase 1 — Phase 2 DataExecutor fills the actual ndarray.
"""
nbytes = self._nbytes(shape, dtype)
addr = self._scratch_alloc(nbytes)
return TensorHandle(
id=self._next_handle_id(),
addr=addr, shape=shape, dtype=dtype,
nbytes=nbytes, space="tcm",
) )
# ── Reference (no DMA, metadata only) ──────────────────────── # ── Reference (no DMA, metadata only) ────────────────────────
@@ -124,20 +173,26 @@ class TLContext:
def load( def load(
self, ptr: int, shape: tuple[int, ...], dtype: str = "f16", self, ptr: int, shape: tuple[int, ...], dtype: str = "f16",
) -> TensorHandle: ) -> TensorHandle:
"""Load tensor from HBM to TCM. Returns TensorHandle. """Load tensor from HBM. Returns TensorHandle pointing at HBM[ptr].
In greenlet mode: returns TensorHandle with actual numpy data. In greenlet mode: returns TensorHandle with actual numpy data.
In command-list mode: returns TensorHandle with data=None. In command-list mode: returns TensorHandle with data=None.
The returned handle's ``space`` is "hbm" so subsequent ops (math,
send, store) using this handle as a source resolve via MemoryStore
at ``(hbm, ptr)`` — which is where the load's underlying data
actually lives in Phase 2 storage.
""" """
self._emit_dispatch_overhead() self._emit_dispatch_overhead()
handle = self._make_handle(addr=ptr, shape=shape, dtype=dtype) handle = self._make_handle(addr=ptr, shape=shape, dtype=dtype, space="hbm")
cmd = DmaReadCmd(handle=handle, src_addr=ptr, nbytes=handle.nbytes) cmd = DmaReadCmd(handle=handle, src_addr=ptr, nbytes=handle.nbytes)
data = self._emit(cmd) data = self._emit(cmd)
if data is not None: if data is not None:
# Greenlet mode: attach real data to handle # Greenlet mode: attach real data to handle (preserve space)
return TensorHandle( return TensorHandle(
id=handle.id, addr=handle.addr, shape=handle.shape, id=handle.id, addr=handle.addr, shape=handle.shape,
dtype=handle.dtype, nbytes=handle.nbytes, data=data, dtype=handle.dtype, nbytes=handle.nbytes, data=data,
space=handle.space,
) )
return handle return handle
@@ -162,7 +217,7 @@ class TLContext:
raise ValueError(f"dot shape mismatch: a.K={k} != b.K={k2}") raise ValueError(f"dot shape mismatch: a.K={k} != b.K={k2}")
out_shape = (*a.shape[:-2], m, n) out_shape = (*a.shape[:-2], m, n)
out_dtype = a.dtype out_dtype = a.dtype
out = self._make_handle(addr=0, shape=out_shape, dtype=out_dtype) out = self._make_compute_out(shape=out_shape, dtype=out_dtype)
self._emit_dispatch_overhead() self._emit_dispatch_overhead()
self._emit(GemmCmd(a=a, b=b, out=out, m=m, k=k, n=n)) self._emit(GemmCmd(a=a, b=b, out=out, m=m, k=k, n=n))
return out return out
@@ -170,7 +225,7 @@ class TLContext:
# ── MATH Engine: unary (blocking) ───────────────────────────── # ── MATH Engine: unary (blocking) ─────────────────────────────
def _unary_math(self, op: str, x: TensorHandle) -> TensorHandle: def _unary_math(self, op: str, x: TensorHandle) -> TensorHandle:
out = self._make_handle(addr=0, shape=x.shape, dtype=x.dtype) out = self._make_compute_out(shape=x.shape, dtype=x.dtype)
self._emit_dispatch_overhead() self._emit_dispatch_overhead()
self._emit(MathCmd(op=op, inputs=(x,), out=out)) self._emit(MathCmd(op=op, inputs=(x,), out=out))
return out return out
@@ -203,7 +258,7 @@ class TLContext:
) -> TensorHandle: ) -> TensorHandle:
out_shape = list(x.shape) out_shape = list(x.shape)
out_shape[axis] = 1 out_shape[axis] = 1
out = self._make_handle(addr=0, shape=tuple(out_shape), dtype=x.dtype) out = self._make_compute_out(shape=tuple(out_shape), dtype=x.dtype)
self._emit_dispatch_overhead() self._emit_dispatch_overhead()
self._emit(MathCmd(op=op, inputs=(x,), out=out, axis=axis)) self._emit(MathCmd(op=op, inputs=(x,), out=out, axis=axis))
return out return out
@@ -222,7 +277,7 @@ class TLContext:
def _binary_math( def _binary_math(
self, op: str, a: TensorHandle, b: TensorHandle, self, op: str, a: TensorHandle, b: TensorHandle,
) -> TensorHandle: ) -> TensorHandle:
out = self._make_handle(addr=0, shape=a.shape, dtype=a.dtype) out = self._make_compute_out(shape=a.shape, dtype=a.dtype)
self._emit_dispatch_overhead() self._emit_dispatch_overhead()
self._emit(MathCmd(op=op, inputs=(a, b), out=out)) self._emit(MathCmd(op=op, inputs=(a, b), out=out))
return out return out
@@ -230,15 +285,67 @@ class TLContext:
def where( def where(
self, cond: TensorHandle, a: TensorHandle, b: TensorHandle, self, cond: TensorHandle, a: TensorHandle, b: TensorHandle,
) -> TensorHandle: ) -> TensorHandle:
out = self._make_handle(addr=0, shape=a.shape, dtype=a.dtype) out = self._make_compute_out(shape=a.shape, dtype=a.dtype)
self._emit_dispatch_overhead() self._emit_dispatch_overhead()
self._emit(MathCmd(op="where", inputs=(cond, a, b), out=out)) self._emit(MathCmd(op="where", inputs=(cond, a, b), out=out))
return out return out
def maximum(self, a: TensorHandle, b: TensorHandle) -> TensorHandle:
"""Element-wise max of two tensors (real Triton: tl.maximum)."""
return self._binary_math("maximum", a, b)
def minimum(self, a: TensorHandle, b: TensorHandle) -> TensorHandle:
"""Element-wise min of two tensors (real Triton: tl.minimum)."""
return self._binary_math("minimum", a, b)
def fma(
self, a: TensorHandle, b: TensorHandle, c: TensorHandle,
) -> TensorHandle:
"""Fused multiply-add: a * b + c (real Triton: tl.fma)."""
out = self._make_compute_out(shape=a.shape, dtype=a.dtype)
self._emit_dispatch_overhead()
self._emit(MathCmd(op="fma", inputs=(a, b, c), out=out))
return out
def clamp(
self,
x: TensorHandle,
min: TensorHandle,
max: TensorHandle,
) -> TensorHandle:
"""Clamp x to [min, max] (real Triton: tl.clamp)."""
out = self._make_compute_out(shape=x.shape, dtype=x.dtype)
self._emit_dispatch_overhead()
self._emit(MathCmd(op="clamp", inputs=(x, min, max), out=out))
return out
def softmax(self, x: TensorHandle, axis: int = -1) -> TensorHandle:
"""Numerically-stable softmax along ``axis`` (real Triton: tl.softmax).
Implemented as a single MathCmd (op="softmax") so timing accounts
for one MATH dispatch; Phase 2 DataExecutor expands it to the
canonical (x - max) → exp → sum → div sequence.
"""
out = self._make_compute_out(shape=x.shape, dtype=x.dtype)
self._emit_dispatch_overhead()
self._emit(MathCmd(op="softmax", inputs=(x,), out=out, axis=axis))
return out
# ── Scalar helpers (real Triton: tl.cdiv etc.) ────────────────
@staticmethod
def cdiv(a: int, b: int) -> int:
"""Ceiling division: (a + b - 1) // b (real Triton: tl.cdiv).
Used by host/kernel grid math; not a tensor op, so no MathCmd
is emitted. Mirrors triton.cdiv.
"""
return -(-int(a) // int(b))
# ── Index / Scalar (PE_CPU, no engine) ──────────────────────── # ── Index / Scalar (PE_CPU, no engine) ────────────────────────
def program_id(self, axis: int = 0) -> int: def program_id(self, axis: int = 0) -> int:
"""Return program instance index. """Return program instance index (ADR-0022).
axis=0: local PE id within cube. axis=0: local PE id within cube.
axis=1: cube id. axis=1: cube id.
@@ -248,7 +355,7 @@ class TLContext:
return self._pe_id return self._pe_id
def num_programs(self, axis: int = 0) -> int: def num_programs(self, axis: int = 0) -> int:
"""Return total number of program instances. """Return total number of program instances (ADR-0022).
axis=0: num PEs per cube. axis=0: num PEs per cube.
axis=1: num cubes. axis=1: num cubes.
@@ -284,6 +391,119 @@ class TLContext:
dtype=x.dtype, nbytes=x.nbytes, data=x.data, dtype=x.dtype, nbytes=x.nbytes, data=x.data,
) )
# ── IPCQ (CCL) collective primitives (ADR-0023 D4) ────────────
def send(
self,
dir: str,
src: TensorHandle | None = None,
*,
src_addr: int | None = None,
nbytes: int | None = None,
shape: tuple[int, ...] | None = None,
dtype: str = "f16",
space: str = "tcm",
) -> None:
"""Send tensor data to the peer in the given direction.
Two calling forms:
tl.send(dir, handle) # use handle's metadata
tl.send(dir, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)
Blocking: returns when PE_IPCQ has accepted the request and
forwarded the IpcqDmaToken to PE_DMA. Backpressure may apply.
"""
if src is not None:
src_addr = src.addr
nbytes = src.nbytes
shape = src.shape
dtype = src.dtype
space = getattr(src, "space", space)
if src_addr is None or nbytes is None or shape is None:
raise ValueError("tl.send: provide either a TensorHandle or src_addr/nbytes/shape")
self._emit_dispatch_overhead()
cmd = IpcqSendCmd(
direction=dir,
src_addr=src_addr, src_space=space,
nbytes=nbytes, shape=shape, dtype=dtype,
handle_id=self._next_handle_id(),
)
self._emit(cmd)
def recv(
self,
dir: str | None = None,
shape: tuple[int, ...] = (),
dtype: str = "f16",
space: str = "tcm",
dst_addr: int | None = None,
dst_space: str | None = None,
) -> TensorHandle:
"""Receive tensor data from a peer.
Args:
dir: specific direction (e.g. "W"), or None for round-robin.
shape, dtype: expected tensor metadata.
dst_addr / dst_space: if both are provided, the slot data is
copied to (dst_space, dst_addr) before the handle is
returned ("copy_to_dst" mode). Otherwise the slot address
is returned directly ("return_slot" mode).
Returns:
TensorHandle pointing to the slot (or dst) where the data has
arrived. In greenlet/runner mode, ``handle.data`` carries the
actual ndarray; in command-list mode the handle is a placeholder.
"""
self._emit_dispatch_overhead()
if dst_addr is not None and dst_space is not None:
cmd = IpcqRecvCmd(
direction=dir,
shape=shape, dtype=dtype,
handle_id=self._next_handle_id(),
recv_mode="copy_to_dst",
dst_addr=dst_addr, dst_space=dst_space,
)
else:
cmd = IpcqRecvCmd(
direction=dir,
shape=shape, dtype=dtype,
handle_id=self._next_handle_id(),
)
result = self._emit(cmd)
if isinstance(result, dict):
slot_addr = int(result.get("src_addr", 0))
slot_space = str(result.get("src_space", "tcm"))
data = result.get("data")
return TensorHandle(
id=self._next_handle_id(),
addr=slot_addr,
shape=shape,
dtype=dtype,
nbytes=self._nbytes(shape, dtype),
data=data,
space=slot_space,
)
return self._make_handle(addr=0, shape=shape, dtype=dtype)
def recv_async(
self,
dir: str,
shape: tuple[int, ...] = (),
dtype: str = "f16",
) -> "RecvFuture":
"""Non-blocking recv. Returns a future to pass into ``tl.wait``."""
self._emit_dispatch_overhead()
cmd = IpcqRecvCmd(
direction=dir,
shape=shape, dtype=dtype,
handle_id=self._next_handle_id(),
blocking=False,
)
future = RecvFuture(cmd=cmd)
if self._runner is not None:
self._runner.switch_to_simpy(("recv_async", future))
return future
# ── Composite + Control ─────────────────────────────────────── # ── Composite + Control ───────────────────────────────────────
def composite( def composite(
@@ -316,9 +536,40 @@ class TLContext:
)) ))
return completion return completion
def wait(self, handle: CompletionHandle | None = None) -> None: def wait(self, handle: "CompletionHandle | RecvFuture | None" = None) -> Any:
"""Wait for a specific composite or all pending composites.""" """Wait for a composite, a recv future, or all pending composites.
- ``CompletionHandle`` (or None): wait for composite completion.
- ``RecvFuture``: wait for a non-blocking ``recv_async`` to finish.
Returns the resolved ``TensorHandle``.
"""
if isinstance(handle, RecvFuture):
if handle.resolved:
return handle.result
if self._runner is None:
raise RuntimeError(
"tl.wait(RecvFuture) requires runner mode (greenlet)"
)
result_dict = self._runner.switch_to_simpy(("recv_wait", handle))
slot_addr = int(result_dict.get("src_addr", 0))
slot_space = str(result_dict.get("src_space", "tcm"))
data = result_dict.get("data")
th = TensorHandle(
id=self._next_handle_id(),
addr=slot_addr,
shape=handle.cmd.shape,
dtype=handle.cmd.dtype,
nbytes=self._nbytes(handle.cmd.shape, handle.cmd.dtype),
data=data,
space=slot_space,
)
handle.resolved = True
handle.result = th
return th
# Composite path (existing behaviour)
self._emit(WaitCmd(handle=handle)) self._emit(WaitCmd(handle=handle))
return None
def cycles(self, n: int) -> None: def cycles(self, n: int) -> None:
"""Declare PE_CPU scalar execution overhead (cycles).""" """Declare PE_CPU scalar execution overhead (cycles)."""
+142
View File
@@ -0,0 +1,142 @@
"""End-to-end matrix tests for the unified ``ccl_allreduce`` bench.
Each parametrized case writes a tmp ``ccl.yaml`` overlay that selects a
specific (algorithm, world_size, buffer_kind, n_elem) combination, then
runs the bench via the CLI and asserts the printed line reports all
ranks OK.
This single test file replaces the per-variant bench tests
(test_ccl_allreduce_e2e, test_ccl_mesh_allreduce, test_ccl_tree_allreduce,
test_ccl_multicube, test_ccl_multisip).
"""
from __future__ import annotations
import os
import textwrap
import pytest
import kernbench.cli.main as cli_main
CCL_YAML_TEMPLATE = textwrap.dedent("""\
defaults:
algorithm: {algorithm}
buffer_kind: {buffer_kind}
backpressure: sleep
n_slots: 4
slot_size: 4096
vc_chunk_size: 256
ipcq_credit_size_bytes: 16
algorithms:
{algorithm}:
module: {module}
topology: {topology}
buffer_kind: {buffer_kind}
{world_size_line}{n_elem_line}
""")
def _write_ccl_yaml(
tmp_path,
*,
algorithm: str,
module: str,
topology: str,
buffer_kind: str = "tcm",
world_size: int | None = None,
n_elem: int | None = None,
) -> str:
"""Write a tmp ccl.yaml in tmp_path and return its directory."""
ws_line = f" world_size: {world_size}\n" if world_size is not None else ""
nel_line = f" n_elem: {n_elem}\n" if n_elem is not None else ""
body = CCL_YAML_TEMPLATE.format(
algorithm=algorithm,
module=module,
topology=topology,
buffer_kind=buffer_kind,
world_size_line=ws_line,
n_elem_line=nel_line,
)
yaml_path = tmp_path / "ccl.yaml"
yaml_path.write_text(body)
return str(tmp_path)
CASES = [
# algorithm, module, topology, buffer_kind, world_size, n_elem, expected_ws
pytest.param(
"ring_allreduce_tcm", "kernbench.ccl.algorithms.ring_allreduce",
"ring_1d", "tcm", None, 8, 256,
id="ring_full_system_tcm",
),
pytest.param(
"ring_allreduce_hbm", "kernbench.ccl.algorithms.ring_allreduce",
"ring_1d", "hbm", None, 8, 256,
id="ring_full_system_hbm",
),
pytest.param(
"ring_allreduce_sram", "kernbench.ccl.algorithms.ring_allreduce",
"ring_1d", "sram", None, 8, 256,
id="ring_full_system_sram",
),
pytest.param(
"ring_allreduce_8", "kernbench.ccl.algorithms.ring_allreduce",
"ring_1d", "tcm", 8, 32, 8,
id="ring_single_cube",
),
pytest.param(
"ring_allreduce_16", "kernbench.ccl.algorithms.ring_allreduce",
"ring_1d", "tcm", 16, 16, 16,
id="ring_multi_cube",
),
pytest.param(
"mesh_allreduce_4", "kernbench.ccl.algorithms.mesh_allreduce",
"mesh_2d", "tcm", 4, 16, 4,
id="mesh_2x2",
),
pytest.param(
"tree_allreduce_7", "kernbench.ccl.algorithms.tree_allreduce",
"tree_binary", "tcm", 7, 16, 7,
id="tree_binary_7",
),
]
@pytest.mark.parametrize(
"algorithm,module,topology,buffer_kind,world_size,n_elem,expected_ws",
CASES,
)
def test_ccl_allreduce_matrix(
tmp_path, capsys, monkeypatch,
algorithm, module, topology, buffer_kind, world_size, n_elem, expected_ws,
):
"""Each (algorithm × buffer × world_size) combo passes through the
unified bench and yields all ranks OK."""
project_root = os.path.abspath(
os.path.join(os.path.dirname(__file__), "..")
)
yaml_dir = _write_ccl_yaml(
tmp_path,
algorithm=algorithm,
module=module,
topology=topology,
buffer_kind=buffer_kind,
world_size=world_size,
n_elem=n_elem,
)
monkeypatch.chdir(yaml_dir)
rc = cli_main.main([
"run",
"--topology", os.path.join(project_root, "topology.yaml"),
"--bench", "ccl_allreduce",
"--verify-data",
])
assert rc == 0
out = capsys.readouterr().out
assert "FAIL" not in out, f"unexpected FAIL in output:\n{out}"
assert f"{algorithm} (ws={expected_ws}): {expected_ws} OK" in out, (
f"expected '{algorithm} (ws={expected_ws}): {expected_ws} OK' "
f"in output:\n{out}"
)
+125
View File
@@ -0,0 +1,125 @@
"""Tests for IPCQ deadlock detection (ADR-0023 D14 F3)."""
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Any
import pytest
import simpy
from kernbench.ccl import diagnostics
from kernbench.common.ipcq_types import (
IpcqEndpoint,
IpcqInitEntry,
IpcqRecvCmd,
IpcqRequest,
)
from kernbench.components.builtin.pe_ipcq import PeIpcqComponent
from kernbench.runtime_api.kernel import IpcqInitMsg
from kernbench.topology.types import Node
@dataclass
class _FakeTxn:
request: Any
done: simpy.Event
result_data: dict[str, Any] = field(default_factory=dict)
def _make_isolated_pe_ipcq(env):
node = Node(
id="sip0.cube0.pe0.pe_ipcq", kind="pe_ipcq",
impl="builtin.pe_ipcq", attrs={}, pos_mm=None,
)
comp = PeIpcqComponent(node, ctx=None)
comp.in_ports["host"] = simpy.Store(env)
comp.out_ports["sip0.cube0.pe0.pe_dma"] = simpy.Store(env)
comp.start(env)
peer_credit = simpy.Store(env)
ep = IpcqEndpoint(
sip=0, cube=0, pe=1, buffer_kind="tcm",
rx_base_pa=0x10_000, rx_base_va=0,
n_slots=4, slot_size=4096,
)
init_msg = IpcqInitMsg(
correlation_id="t", request_id="t",
target_sips=(0,), target_cubes=(0,), target_pe=0,
entries=(IpcqInitEntry(
direction="W", peer=ep,
my_rx_base_pa=0x40_000, my_rx_base_va=0,
n_slots=4, slot_size=4096,
peer_credit_store=peer_credit,
),),
backpressure_mode="sleep",
buffer_kind="tcm",
credit_size_bytes=16,
)
done = env.event()
comp.in_ports["host"].put(_FakeTxn(request=init_msg, done=done))
env.run(until=done)
return comp
def test_pointer_dump_includes_blocked_state():
"""A blocked recv should still be visible in the pointer dump."""
env = simpy.Environment()
comp = _make_isolated_pe_ipcq(env)
# Issue a recv that will block (no data has arrived)
recv_cmd = IpcqRecvCmd(direction="W", shape=(8,), dtype="f16", handle_id="r1")
req = IpcqRequest(command=recv_cmd, done=env.event())
comp.in_ports["host"].put(req)
env.run(until=10)
assert not req.done.triggered
# Pointer dump should show my_tail=0 and peer_head_cache=0
# We need to use the engine API but for an isolated component, just call directly
class FakeEngine:
_components = {"sip0.cube0.pe0.pe_ipcq": comp}
dump = diagnostics.pointer_dump(FakeEngine())
assert "my_tail=0" in dump
assert "peer_head_cache=0" in dump
def test_deadlock_detection_recv_without_send():
"""A recv with no matching sender → SimPy schedule empties → engine
raises ``IpcqDeadlock`` with a pointer dump.
"""
from kernbench.ccl.diagnostics import IpcqDeadlock
from kernbench.policy.placement.dp import DPPolicy
from kernbench.runtime_api.bench_runner import run_bench
from kernbench.runtime_api.types import resolve_device
from kernbench.sim_engine.engine import GraphEngine
from kernbench.topology.builder import resolve_topology
def deadlock_kernel(t_ptr, n_elem, tl):
# Every PE just receives, no sends → no one delivers → deadlock
tl.recv(dir="W", shape=(n_elem,), dtype="f16")
topo = resolve_topology("topology.yaml")
def run(torch):
torch.install_ipcq(
algorithm="ring_allreduce_tcm", world_size_override=8,
)
a = torch.zeros(
(1, 8 * 8),
dtype="f16",
dp=DPPolicy(
sip="replicate", cube="replicate", pe="column_wise",
num_sips=1, num_cubes=1,
),
name="dl_in",
)
torch.launch("dl", deadlock_kernel, a, 8)
with pytest.raises(IpcqDeadlock):
run_bench(
topology=topo, bench_fn=run,
device=resolve_device("all"),
engine_factory=lambda t, d: GraphEngine(
getattr(t, "topology_obj", t), enable_data=True
),
)
+70
View File
@@ -0,0 +1,70 @@
"""Tests for CCL diagnostics: trace + pointer dump (ADR-0023 D14)."""
from __future__ import annotations
import os
from kernbench.ccl import diagnostics
# ── trace toggle ─────────────────────────────────────────────────────
def test_trace_disabled_by_default(monkeypatch):
monkeypatch.delenv("KERNBENCH_CCL_TRACE", raising=False)
diagnostics.reload_trace_setting()
assert diagnostics.trace_enabled() is False
def test_trace_enabled_via_env(monkeypatch):
monkeypatch.setenv("KERNBENCH_CCL_TRACE", "1")
diagnostics.reload_trace_setting()
assert diagnostics.trace_enabled() is True
def test_trace_record_send(monkeypatch, capsys):
monkeypatch.setenv("KERNBENCH_CCL_TRACE", "1")
diagnostics.reload_trace_setting()
diagnostics.log_send(t_ns=100.0, sender="sip0.cube0.pe0",
direction="E", nbytes=64, sender_seq=0)
out = capsys.readouterr().out
assert "send" in out
assert "sip0.cube0.pe0" in out
assert "dir=E" in out
monkeypatch.delenv("KERNBENCH_CCL_TRACE")
diagnostics.reload_trace_setting()
def test_trace_record_recv(monkeypatch, capsys):
monkeypatch.setenv("KERNBENCH_CCL_TRACE", "1")
diagnostics.reload_trace_setting()
diagnostics.log_recv(t_ns=200.0, receiver="sip0.cube0.pe1",
direction="W", nbytes=64)
out = capsys.readouterr().out
assert "recv" in out
assert "sip0.cube0.pe1" in out
monkeypatch.delenv("KERNBENCH_CCL_TRACE")
diagnostics.reload_trace_setting()
# ── pointer dump ────────────────────────────────────────────────────
def test_pointer_dump_format():
from kernbench.sim_engine.engine import GraphEngine
from kernbench.topology.builder import resolve_topology
from kernbench.ccl.install import (
install_ipcq, load_ccl_config, resolve_algorithm_config,
)
topo = resolve_topology("topology.yaml").topology_obj
engine = GraphEngine(topo, enable_data=True)
cfg = resolve_algorithm_config(load_ccl_config(), name="ring_allreduce_tcm")
install_ipcq(engine, topo.spec, cfg)
dump = diagnostics.pointer_dump(engine)
# 8 ranks × 2 directions = 16 lines (plus 8 PE headers)
assert "sip0.cube0.pe0" in dump
assert "E:" in dump
assert "W:" in dump
assert "my_head=" in dump
assert "peer_tail_cache=" in dump
+62
View File
@@ -0,0 +1,62 @@
"""Tests for the torch.distributed-compat facade (ADR-0023 D11).
These tests verify the public API surface of ``DistributedContext`` +
``AhbmCCLBackend``. End-to-end correctness of the allreduce itself is
covered by tests/test_ccl_allreduce_matrix.py.
"""
from __future__ import annotations
from kernbench.runtime_api.distributed import AhbmCCLBackend, DistributedContext
def test_init_process_group_requires_ctx_ref():
"""Using DistributedContext without RuntimeContext binding should fail."""
dist = DistributedContext()
# Not bound to a RuntimeContext → init should raise.
try:
dist.init_process_group(backend="ahbm")
assert False, "expected RuntimeError"
except RuntimeError:
pass
def test_init_process_group_rejects_unknown_backend():
"""Unknown backend raises ValueError (matches pytorch behavior)."""
dist = DistributedContext()
dist._ctx_ref = object() # dummy; won't be reached before the check
try:
dist.init_process_group(backend="nccl")
assert False, "expected ValueError"
except ValueError:
pass
def test_distributed_pytorch_compat_surface():
"""DistributedContext only exposes real torch.distributed API names."""
# Every public attribute should either be a real pytorch name or private.
allowed = {
"init_process_group",
"is_initialized",
"get_world_size",
"get_rank",
"get_backend",
"all_reduce",
"barrier",
}
dc = DistributedContext()
for attr in dir(dc):
if attr.startswith("_"):
continue
assert attr in allowed, (
f"DistributedContext exposes non-pytorch API: {attr!r}"
)
def test_backend_class_surface():
"""AhbmCCLBackend exposes only all_reduce + barrier + world_size."""
# Ensure we don't accidentally leak internal method names.
public = {m for m in dir(AhbmCCLBackend) if not m.startswith("_")}
# Class must at minimum expose these.
assert "all_reduce" in public
assert "barrier" in public
assert "world_size" in public
+81
View File
@@ -0,0 +1,81 @@
"""Validate the hello-world example from docs/ccl-author-guide.md.
This is the simplest possible CCL kernel each PE sends its tile E
and receives a tile from W. After running, each rank's slice should
contain the data of the previous rank.
"""
from __future__ import annotations
import numpy as np
from kernbench.ccl.algorithms import hello_send
from kernbench.ccl.testing import run_kernel_in_mock
def test_hello_send_4_ranks_mock():
n_elem = 8
inputs = [np.full((n_elem,), float(r + 1), dtype=np.float16) for r in range(4)]
outputs = run_kernel_in_mock(
kernel_fn=hello_send.kernel,
world_size=4,
topology="ring_1d",
inputs=inputs,
kernel_args=(n_elem,),
)
# rank r should have rank (r-1) % 4's data
for r in range(4):
prev = inputs[(r - 1) % 4]
assert np.array_equal(outputs[r], prev), f"rank {r}: got {outputs[r]}"
def test_hello_send_via_simpy_runner():
"""Same but through real SimPy + IPCQ."""
from kernbench.policy.placement.dp import DPPolicy
from kernbench.runtime_api.bench_runner import run_bench
from kernbench.runtime_api.types import resolve_device
from kernbench.sim_engine.engine import GraphEngine
from kernbench.topology.builder import resolve_topology
topo = resolve_topology("topology.yaml")
n_elem = 8
world_size = 8
def run(torch):
# World size for this hello test is 8 (one cube). ccl.yaml no
# longer carries a default world_size — pass it explicitly.
plan = torch.install_ipcq(
algorithm="ring_allreduce_tcm", world_size_override=world_size,
)
a = torch.zeros(
(1, world_size * n_elem), dtype="f16",
dp=DPPolicy(
sip="replicate", cube="replicate", pe="column_wise",
num_sips=1, num_cubes=1,
),
name="hello_in",
)
store = torch.engine.memory_store
base = a._handle.va_base or a._handle.shards[0].pa
nbytes = n_elem * 2
for r in range(world_size):
store.write("hbm", base + r * nbytes,
np.full((n_elem,), float(r + 1), dtype=np.float16))
torch.launch("hello_send", hello_send.kernel, a, n_elem)
# Each rank should hold the previous rank's data after the round
for r in range(world_size):
arr = store.read("hbm", base + r * nbytes, shape=(n_elem,), dtype="f16")
prev_value = float(((r - 1) % world_size) + 1)
assert np.allclose(arr, prev_value), f"rank {r}: got {arr}, expected {prev_value}"
result = run_bench(
topology=topo, bench_fn=run,
device=resolve_device("all"),
engine_factory=lambda t, d: GraphEngine(
getattr(t, "topology_obj", t), enable_data=True
),
)
assert result.completion.ok
+68
View File
@@ -0,0 +1,68 @@
"""Tests for CCL algorithm-author helpers (ADR-0023 D15)."""
from __future__ import annotations
import pytest
from kernbench.ccl.helpers import (
Chunk,
chunked,
ring_step,
tree_step,
)
# ── chunked ──────────────────────────────────────────────────────────
def test_chunked_basic():
chunks = chunked(base_addr=0x1000, n_chunks=4, n_elem=64, dtype="f16")
assert len(chunks) == 4
# Each chunk has 16 elements (64 / 4)
assert chunks[0] == Chunk(addr=0x1000, n_elem=16, nbytes=32)
assert chunks[1] == Chunk(addr=0x1020, n_elem=16, nbytes=32)
assert chunks[2] == Chunk(addr=0x1040, n_elem=16, nbytes=32)
assert chunks[3] == Chunk(addr=0x1060, n_elem=16, nbytes=32)
def test_chunked_f32():
chunks = chunked(base_addr=0x100, n_chunks=2, n_elem=8, dtype="f32")
assert chunks[0].nbytes == 16 # 4 elem × 4 bytes
assert chunks[1].addr == 0x100 + 16
def test_chunked_uneven_raises():
with pytest.raises(ValueError):
chunked(base_addr=0x100, n_chunks=3, n_elem=10, dtype="f16")
# ── ring_step ────────────────────────────────────────────────────────
def test_ring_step_4_ranks():
# Standard reduce-scatter ring step:
# at step s, rank r sends chunk (r-s) and receives chunk (r-s-1) (mod ws)
assert ring_step(rank=0, step=0, world_size=4) == (0, 3)
assert ring_step(rank=0, step=1, world_size=4) == (3, 2)
assert ring_step(rank=1, step=0, world_size=4) == (1, 0)
assert ring_step(rank=2, step=0, world_size=4) == (2, 1)
# ── tree_step ────────────────────────────────────────────────────────
def test_tree_step_root():
info = tree_step(rank=0, world_size=7)
assert info["parent"] is None
assert info["children"] == [1, 2]
def test_tree_step_internal():
info = tree_step(rank=1, world_size=7)
assert info["parent"] == 0
assert info["children"] == [3, 4]
def test_tree_step_leaf():
info = tree_step(rank=4, world_size=7)
assert info["parent"] == 1
assert info["children"] == []
+100
View File
@@ -0,0 +1,100 @@
"""Tests for CCL backend install (ADR-0023 D10/D11)."""
from __future__ import annotations
from kernbench.ccl.install import (
install_ipcq,
linear_rank_to_pe,
load_ccl_config,
resolve_algorithm_config,
)
from kernbench.sim_engine.engine import GraphEngine
from kernbench.topology.builder import resolve_topology
def _engine():
topo = resolve_topology("topology.yaml").topology_obj
return GraphEngine(topo, enable_data=True), topo
def test_load_ccl_config():
cfg = load_ccl_config()
assert "defaults" in cfg
assert "algorithms" in cfg
def test_resolve_algorithm_config_default():
cfg = load_ccl_config()
merged = resolve_algorithm_config(cfg)
assert merged["algorithm"] == cfg["defaults"]["algorithm"]
# ccl.yaml no longer carries defaults.world_size — backend derives
# it from topology.yaml at install time. Just check the field is
# absent here (verified per-test where install_ipcq is called).
assert "world_size" not in merged or merged["world_size"] >= 1
def test_resolve_algorithm_config_override():
cfg = load_ccl_config()
merged = resolve_algorithm_config(cfg, name="ring_allreduce_hbm")
assert merged["algorithm"] == "ring_allreduce_hbm"
assert merged["buffer_kind"] == "hbm" # algo override
# defaults still apply
assert merged["n_slots"] == cfg["defaults"]["n_slots"]
def test_linear_rank_to_pe():
engine, topo = _engine()
spec = topo.spec
# Cube 0 of SIP 0
assert linear_rank_to_pe(0, spec) == (0, 0, 0)
assert linear_rank_to_pe(7, spec) == (0, 0, 7)
# Should not exceed total PE count
pes_per_sip = (
spec["sip"]["cube_mesh"]["w"] * spec["sip"]["cube_mesh"]["h"]
* spec["cube"]["pe_layout"]["pe_per_corner"]
* len(spec["cube"]["pe_layout"]["corners"])
)
sips = spec["system"]["sips"]["count"]
total = sips * pes_per_sip
assert total >= 8
def test_install_ipcq_neighbors_correct():
engine, topo = _engine()
cfg = load_ccl_config()
merged = resolve_algorithm_config(cfg, name="ring_allreduce_tcm")
# Force a single-cube 8-rank install for the assertions below.
merged["world_size"] = 8
plan = install_ipcq(engine, topo.spec, merged)
assert plan["world_size"] == 8
assert plan["buffer_kind"] == "tcm"
# Each rank should have E and W entries
for r, nbrs in plan["neighbor_table"].items():
assert "E" in nbrs
assert "W" in nbrs
# Inspect installed PE_IPCQ for rank 0
ipcq = engine._components["sip0.cube0.pe0.pe_ipcq"]
qp_e = ipcq.queue_pairs["E"]
qp_w = ipcq.queue_pairs["W"]
assert qp_e["peer"].pe == 1 # rank 0's E neighbor is rank 1
assert qp_w["peer"].pe == 7 # rank 0's W neighbor is rank 7
# rx_base addresses should be unique
assert qp_e["my_rx_base_pa"] != qp_w["my_rx_base_pa"]
def test_install_ipcq_credit_stores_wired():
engine, topo = _engine()
cfg = load_ccl_config()
merged = resolve_algorithm_config(cfg, name="ring_allreduce_tcm")
merged["world_size"] = 8
install_ipcq(engine, topo.spec, merged)
# rank 0 (pe0) sending E goes to rank 1 (pe1)
# rank 0's peer_credit_store on E direction should equal rank 1's credit_inbox
pe0 = engine._components["sip0.cube0.pe0.pe_ipcq"]
pe1 = engine._components["sip0.cube0.pe1.pe_ipcq"]
qp_e = pe0.queue_pairs["E"]
assert qp_e["peer_credit_store"] is pe1.credit_inbox
+83
View File
@@ -0,0 +1,83 @@
"""Tests for the mock CCL runtime (ADR-0023 D15)."""
from __future__ import annotations
import numpy as np
from kernbench.ccl.algorithms import ring_allreduce
from kernbench.ccl.testing import run_kernel_in_mock
def test_ring_allreduce_4_ranks():
"""Run the ring all-reduce kernel under the mock runtime, no SimPy."""
n_elem = 8
inputs = [
np.full((n_elem,), float(r + 1), dtype=np.float16)
for r in range(4)
]
expected = sum(inputs) # [10, 10, ..., 10]
outputs = run_kernel_in_mock(
kernel_fn=ring_allreduce.kernel,
world_size=4,
topology="ring_1d",
inputs=inputs,
kernel_args=(n_elem, 4),
)
assert len(outputs) == 4
for r in range(4):
assert np.allclose(outputs[r], expected)
def test_ring_allreduce_8_ranks():
n_elem = 16
inputs = [
np.full((n_elem,), float(r + 1), dtype=np.float16)
for r in range(8)
]
expected = sum(inputs) # [36, 36, ...]
outputs = run_kernel_in_mock(
kernel_fn=ring_allreduce.kernel,
world_size=8,
topology="ring_1d",
inputs=inputs,
kernel_args=(n_elem, 8),
)
for r in range(8):
assert np.allclose(outputs[r], expected)
def test_ring_allreduce_random_data():
n_elem = 32
rng = np.random.default_rng(42)
inputs = [rng.standard_normal(n_elem).astype(np.float16) for _ in range(4)]
expected = sum(inputs)
outputs = run_kernel_in_mock(
kernel_fn=ring_allreduce.kernel,
world_size=4,
topology="ring_1d",
inputs=inputs,
kernel_args=(n_elem, 4),
)
for r in range(4):
assert np.allclose(outputs[r], expected, rtol=1e-2, atol=1e-2)
def test_mock_runtime_invalid_direction_raises():
"""A kernel that uses an unsupported direction should raise."""
import pytest
def bad_kernel(t_ptr, n_elem, tl):
tl.send(dir="N", src_addr=0, nbytes=2, shape=(1,), dtype="f16", space="hbm")
inputs = [np.array([1.0], dtype=np.float16) for _ in range(2)]
with pytest.raises(Exception):
run_kernel_in_mock(
kernel_fn=bad_kernel,
world_size=2,
topology="ring_1d",
inputs=inputs,
kernel_args=(1,),
)
+134
View File
@@ -0,0 +1,134 @@
"""CCL performance validation tests (ADR-0023 D13 T5).
Sanity-checks the simulated latency of the unified ``ccl_allreduce`` bench
under different ``ccl.yaml`` algorithm choices:
- All buffer kinds finish in non-zero simulated time.
- Latency is bounded well under 1 ms for small tiles.
These are sanity checks on the model itself, not on absolute numbers.
"""
from __future__ import annotations
import importlib
import os
from contextlib import contextmanager
import pytest
from kernbench.runtime_api.bench_runner import run_bench
from kernbench.runtime_api.types import resolve_device
from kernbench.sim_engine.engine import GraphEngine
from kernbench.topology.builder import resolve_topology
def _engine_factory(topology, device):
return GraphEngine(getattr(topology, "topology_obj", topology), enable_data=True)
@contextmanager
def _ccl_yaml_override(algorithm: str, world_size: int | None = None):
"""Write a tmp ccl.yaml that forces a specific algorithm + world_size."""
import tempfile
entry_extra = f"\n world_size: {world_size}" if world_size is not None else ""
body = f"""
defaults:
algorithm: {algorithm}
buffer_kind: tcm
backpressure: sleep
n_slots: 4
slot_size: 4096
vc_chunk_size: 256
ipcq_credit_size_bytes: 16
algorithms:
ring_allreduce_tcm:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d
buffer_kind: tcm
ring_allreduce_hbm:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d
buffer_kind: hbm
ring_allreduce_sram:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d
buffer_kind: sram{entry_extra if algorithm.startswith("ring") else ""}
{algorithm}:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d
buffer_kind: tcm{entry_extra}
""" if world_size is not None else f"""
defaults:
algorithm: {algorithm}
buffer_kind: tcm
backpressure: sleep
n_slots: 4
slot_size: 4096
vc_chunk_size: 256
ipcq_credit_size_bytes: 16
algorithms:
ring_allreduce_tcm:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d
buffer_kind: tcm
ring_allreduce_hbm:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d
buffer_kind: hbm
ring_allreduce_sram:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d
buffer_kind: sram
"""
with tempfile.TemporaryDirectory() as tmp:
path = os.path.join(tmp, "ccl.yaml")
with open(path, "w") as f:
f.write(body)
old_cwd = os.getcwd()
os.chdir(tmp)
try:
yield path
finally:
os.chdir(old_cwd)
def _run_unified(algorithm: str, world_size: int | None = None) -> float:
"""Run the unified ccl_allreduce bench under a ccl.yaml override,
return simulated kernel total_ns."""
with _ccl_yaml_override(algorithm, world_size):
topo = resolve_topology(
os.path.join(os.path.dirname(__file__), "..", "topology.yaml")
)
bench_mod = importlib.import_module("benches.ccl_allreduce")
result = run_bench(
topology=topo, bench_fn=bench_mod.run,
device=resolve_device("all"),
engine_factory=_engine_factory,
)
assert result.completion.ok, f"{algorithm} did not complete"
last_kernel = None
for tr in (result.traces or []):
if tr.get("phase") == "kernel":
last_kernel = tr
assert last_kernel is not None, f"{algorithm} produced no kernel trace"
return float(last_kernel.get("total_ns", 0.0))
@pytest.mark.parametrize("algorithm", [
"ring_allreduce_tcm",
"ring_allreduce_hbm",
"ring_allreduce_sram",
])
def test_ccl_latency_positive(algorithm):
"""Every buffer kind must produce a positive simulated latency."""
ns = _run_unified(algorithm)
assert ns > 0
def test_ccl_latency_under_reasonable_bound():
"""Sanity bound: ring all-reduce (tile=32 f16) should finish in well
under 1 ms simulated. Way overhead-dominated for small tiles."""
ns = _run_unified("ring_allreduce_tcm")
assert ns < 100_000_000 # < 100 ms simulated — very loose bound
+48
View File
@@ -0,0 +1,48 @@
"""Test that tl.recv() (no direction) works under the mock runtime
and the SimPy PE_IPCQ component (ADR-0023 D4 weak fairness)."""
from __future__ import annotations
import numpy as np
from kernbench.ccl.testing import run_kernel_in_mock
def kernel_round_robin(t_ptr, n_elem, tl):
"""Each PE sends one tile E then receives N-1 tiles via round-robin.
Uses TensorHandle math (PE_MATH) so Phase 2 produces correct HBM
contents under SimPy + op_log replay."""
rank = tl.program_id(axis=0)
world_size = tl.num_programs(axis=0)
nbytes = n_elem * 2
pe_addr = t_ptr + rank * nbytes
acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
current = acc
for _step in range(world_size - 1):
tl.send(dir="E", src=current)
# No direction → round-robin
recv = tl.recv(shape=(n_elem,), dtype="f16")
acc = acc + recv
current = recv # forward W's tile to E next round
tl.store(pe_addr, acc)
def test_round_robin_recv_mock_runtime():
n_elem = 8
inputs = [
np.full((n_elem,), float(r + 1), dtype=np.float16)
for r in range(4)
]
expected = sum(inputs) # [10,...]
outputs = run_kernel_in_mock(
kernel_fn=kernel_round_robin,
world_size=4,
topology="ring_1d",
inputs=inputs,
kernel_args=(n_elem,),
)
for r in range(4):
assert np.allclose(outputs[r], expected)
+140
View File
@@ -0,0 +1,140 @@
"""Tests for IPCQ strict shape/dtype validation (ADR-0023 D14 F2)."""
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Any
import pytest
import simpy
from kernbench.common.ipcq_types import (
IpcqDmaToken,
IpcqEndpoint,
IpcqInitEntry,
IpcqInvalidDirection,
IpcqMetaArrival,
IpcqRecvCmd,
IpcqRequest,
IpcqSendCmd,
)
from kernbench.components.builtin.pe_ipcq import PeIpcqComponent
from kernbench.runtime_api.kernel import IpcqInitMsg
from kernbench.topology.types import Node
# ── helpers (smaller copy of test_pe_ipcq fixtures) ────────────────
@dataclass
class _FakeTxn:
request: Any
done: simpy.Event
result_data: dict[str, Any] = field(default_factory=dict)
def _make(env, strict: bool = True):
node = Node(
id="sip0.cube0.pe0.pe_ipcq", kind="pe_ipcq",
impl="builtin.pe_ipcq",
attrs={"strict_validation": strict},
pos_mm=None,
)
comp = PeIpcqComponent(node, ctx=None)
comp.in_ports["host"] = simpy.Store(env)
comp.out_ports["sip0.cube0.pe0.pe_dma"] = simpy.Store(env)
comp.start(env)
peer_credit = simpy.Store(env)
ep = IpcqEndpoint(
sip=0, cube=0, pe=1, buffer_kind="tcm",
rx_base_pa=0x10_000, rx_base_va=0,
n_slots=4, slot_size=4096,
)
init_msg = IpcqInitMsg(
correlation_id="t", request_id="t",
target_sips=(0,), target_cubes=(0,), target_pe=0,
entries=(IpcqInitEntry(
direction="W", peer=ep,
my_rx_base_pa=0x40_000, my_rx_base_va=0,
n_slots=4, slot_size=4096,
peer_credit_store=peer_credit,
),),
backpressure_mode="sleep",
buffer_kind="tcm",
credit_size_bytes=16,
)
done = env.event()
comp.in_ports["host"].put(_FakeTxn(request=init_msg, done=done))
env.run(until=done)
return comp
# ── F2 tests ─────────────────────────────────────────────────────────
def test_strict_mode_dtype_mismatch_raises():
env = simpy.Environment()
comp = _make(env, strict=True)
# Pre-arrive metadata with f32 dtype
fake_token = IpcqDmaToken(
src_addr=0, src_space="tcm",
dst_addr=0x40_000, dst_endpoint=comp._queue_pairs["W"]["peer"],
nbytes=64, handle_id="x",
shape=(8,), dtype="f32", # mismatched
sender_seq=0,
src_sip=0, src_cube=0, src_pe=1, src_direction="E",
)
comp.in_ports["host"].put(IpcqMetaArrival(token=fake_token))
env.run(until=5)
# recv expecting f16 → should raise on strict
recv_cmd = IpcqRecvCmd(direction="W", shape=(8,), dtype="f16", handle_id="r")
req = IpcqRequest(command=recv_cmd, done=env.event())
comp.in_ports["host"].put(req)
with pytest.raises(ValueError, match="dtype"):
env.run(until=req.done)
def test_strict_mode_shape_mismatch_raises():
env = simpy.Environment()
comp = _make(env, strict=True)
fake_token = IpcqDmaToken(
src_addr=0, src_space="tcm",
dst_addr=0x40_000, dst_endpoint=comp._queue_pairs["W"]["peer"],
nbytes=64, handle_id="x",
shape=(16,), dtype="f16", # wrong shape
sender_seq=0,
src_sip=0, src_cube=0, src_pe=1, src_direction="E",
)
comp.in_ports["host"].put(IpcqMetaArrival(token=fake_token))
env.run(until=5)
recv_cmd = IpcqRecvCmd(direction="W", shape=(8,), dtype="f16", handle_id="r")
req = IpcqRequest(command=recv_cmd, done=env.event())
comp.in_ports["host"].put(req)
with pytest.raises(ValueError, match="shape"):
env.run(until=req.done)
def test_non_strict_mode_silently_accepts():
env = simpy.Environment()
comp = _make(env, strict=False)
fake_token = IpcqDmaToken(
src_addr=0, src_space="tcm",
dst_addr=0x40_000, dst_endpoint=comp._queue_pairs["W"]["peer"],
nbytes=64, handle_id="x",
shape=(16,), dtype="f32", # both wrong
sender_seq=0,
src_sip=0, src_cube=0, src_pe=1, src_direction="E",
)
comp.in_ports["host"].put(IpcqMetaArrival(token=fake_token))
env.run(until=5)
recv_cmd = IpcqRecvCmd(direction="W", shape=(8,), dtype="f16", handle_id="r")
req = IpcqRequest(command=recv_cmd, done=env.event())
comp.in_ports["host"].put(req)
env.run(until=req.done)
assert req.done.triggered # no exception
+164
View File
@@ -0,0 +1,164 @@
"""Tests for CCL builtin topology generators (ADR-0023 D11)."""
import pytest
from kernbench.ccl.topologies import (
mesh_2d,
none,
resolve_topology,
ring_1d,
ring_1d_unidir,
tree_binary,
)
# ── ring_1d ──────────────────────────────────────────────────────────
def test_ring_1d_4_ranks():
assert ring_1d(0, 4) == {"E": 1, "W": 3}
assert ring_1d(1, 4) == {"E": 2, "W": 0}
assert ring_1d(2, 4) == {"E": 3, "W": 1}
assert ring_1d(3, 4) == {"E": 0, "W": 2}
def test_ring_1d_2_ranks():
assert ring_1d(0, 2) == {"E": 1, "W": 1}
assert ring_1d(1, 2) == {"E": 0, "W": 0}
# ── ring_1d_unidir ───────────────────────────────────────────────────
def test_ring_1d_unidir():
assert ring_1d_unidir(0, 4) == {"E": 1}
assert ring_1d_unidir(3, 4) == {"E": 0}
# ── mesh_2d ──────────────────────────────────────────────────────────
def test_mesh_2d_2x2():
# 2x2 mesh:
# 0 1
# 2 3
assert mesh_2d(0, 4) == {"N": 2, "S": 2, "E": 1, "W": 1}
assert mesh_2d(1, 4) == {"N": 3, "S": 3, "E": 0, "W": 0}
assert mesh_2d(2, 4) == {"N": 0, "S": 0, "E": 3, "W": 3}
assert mesh_2d(3, 4) == {"N": 1, "S": 1, "E": 2, "W": 2}
def test_mesh_2d_4x4():
# 4x4 mesh: rank = r*4 + c
n = mesh_2d(5, 16) # r=1, c=1
assert n["N"] == 1 # ((1-1)%4)*4 + 1
assert n["S"] == 9 # ((1+1)%4)*4 + 1
assert n["W"] == 4 # 1*4 + (1-1)%4
assert n["E"] == 6 # 1*4 + (1+1)%4
def test_mesh_2d_non_square_raises():
with pytest.raises(ValueError):
mesh_2d(0, 5)
# ── tree_binary ──────────────────────────────────────────────────────
def test_tree_binary_root():
n = tree_binary(0, 7)
assert "parent" not in n
assert n["child_left"] == 1
assert n["child_right"] == 2
def test_tree_binary_internal():
n = tree_binary(1, 7)
assert n["parent"] == 0
assert n["child_left"] == 3
assert n["child_right"] == 4
def test_tree_binary_leaf():
n = tree_binary(6, 7)
assert n["parent"] == 2
assert "child_left" not in n
assert "child_right" not in n
# ── none ─────────────────────────────────────────────────────────────
def test_none_returns_empty():
assert none(0, 4) == {}
assert none(3, 7) == {}
# ── resolve_topology ─────────────────────────────────────────────────
def test_resolve_topology_builtin():
fn = resolve_topology("ring_1d")
assert fn(0, 4) == {"E": 1, "W": 3}
def test_resolve_topology_unknown_raises():
with pytest.raises(ValueError):
resolve_topology("nonsense")
def test_resolve_topology_with_neighbors_override_pattern_a():
"""Algorithm module with neighbors() that mutates builtin map."""
class FakeModule:
@staticmethod
def neighbors(rank, world_size, neighbor_map):
if rank % 2 == 1:
neighbor_map.pop("W", None)
return neighbor_map
fn = resolve_topology("ring_1d", algo_module=FakeModule)
assert fn(0, 4) == {"E": 1, "W": 3}
assert fn(1, 4) == {"E": 2} # W removed
def test_resolve_topology_with_neighbors_override_pattern_b():
"""Algorithm module with neighbors() that returns brand-new dict."""
class FakeModule:
@staticmethod
def neighbors(rank, world_size, neighbor_map):
return {"E": (rank + 2) % world_size}
fn = resolve_topology("ring_1d", algo_module=FakeModule)
assert fn(0, 4) == {"E": 2}
assert fn(3, 4) == {"E": 1}
def test_resolve_topology_with_neighbors_override_pattern_c_none():
"""Algorithm module's neighbors() returns None → builtin used as-is."""
class FakeModule:
@staticmethod
def neighbors(rank, world_size, neighbor_map):
return None
fn = resolve_topology("ring_1d", algo_module=FakeModule)
assert fn(0, 4) == {"E": 1, "W": 3}
def test_resolve_topology_none_with_neighbors_override():
"""topology=none + custom neighbors() builds from scratch."""
class FakeModule:
@staticmethod
def neighbors(rank, world_size, neighbor_map):
assert neighbor_map == {} # builtin returned empty
return {"E": (rank + 1) % world_size}
fn = resolve_topology("none", algo_module=FakeModule)
assert fn(0, 4) == {"E": 1}
def test_resolve_topology_module_without_neighbors():
"""Algorithm module without neighbors() function works normally."""
class FakeModule:
pass # no neighbors attribute
fn = resolve_topology("ring_1d", algo_module=FakeModule)
assert fn(0, 4) == {"E": 1, "W": 3}
+73
View File
@@ -0,0 +1,73 @@
"""Cross-SIP PE_DMA routing tests (ADR-0023, topology v2).
Verifies that PE_DMA in one SIP can route to PE_DMA in another SIP via
the bidirectional pcie_ep fabric.switch0 path. Required for IPCQ
multi-SIP collectives.
"""
from __future__ import annotations
import pytest
from kernbench.policy.routing.router import PathRouter, RoutingError
from kernbench.topology.builder import resolve_topology
def _topo():
return resolve_topology("topology.yaml").topology_obj
# ── New edge ────────────────────────────────────────────────────────
def test_pcie_ep_to_switch_edge_exists():
"""The reverse pcie_ep → switch edge must exist for outbound traffic."""
topo = _topo()
pairs = {(e.src, e.dst) for e in topo.edges}
assert ("sip0.io0.pcie_ep", "fabric.switch0") in pairs
assert ("sip1.io0.pcie_ep", "fabric.switch0") in pairs
def test_existing_switch_to_pcie_ep_still_present():
"""Host→device path must remain intact (regression)."""
topo = _topo()
pairs = {(e.src, e.dst) for e in topo.edges}
assert ("fabric.switch0", "sip0.io0.pcie_ep") in pairs
assert ("fabric.switch0", "sip1.io0.pcie_ep") in pairs
# ── Cross-SIP path ──────────────────────────────────────────────────
def test_router_finds_cross_sip_pe_dma_path():
topo = _topo()
r = PathRouter(topo)
path = r.find_path("sip0.cube0.pe0", "sip1.cube0.pe0.pe_dma")
assert len(path) > 0
assert path[0] == "sip0.cube0.pe0.pe_dma"
assert path[-1] == "sip1.cube0.pe0.pe_dma"
assert "fabric.switch0" in path
def test_router_finds_cross_sip_far_pe_path():
"""Last cube of sip0 → first cube of sip1."""
topo = _topo()
r = PathRouter(topo)
path = r.find_path("sip0.cube15.pe7", "sip1.cube0.pe0.pe_dma")
assert "fabric.switch0" in path
# ── Regression: intra-SIP routing unchanged ─────────────────────────
def test_router_intra_sip_path_unchanged():
topo = _topo()
r = PathRouter(topo)
path = r.find_path("sip0.cube0.pe0", "sip0.cube0.pe1.pe_dma")
assert "fabric.switch0" not in path # should not detour through switch
def test_router_intra_cube_path_unchanged():
topo = _topo()
r = PathRouter(topo)
path = r.find_path("sip0.cube0.pe0", "sip0.cube0.hbm_ctrl")
assert "fabric.switch0" not in path
+63
View File
@@ -58,6 +58,69 @@ def test_math_exp():
assert np.allclose(result, np.exp(x)) assert np.allclose(result, np.exp(x))
def test_math_extra_ops():
"""Phase 2 replay of tl.maximum/minimum/fma/clamp/softmax."""
store = MemoryStore()
a = np.array([1.0, 5.0, 3.0], dtype=np.float32)
b = np.array([4.0, 2.0, 6.0], dtype=np.float32)
c = np.array([0.5, 0.5, 0.5], dtype=np.float32)
store.write("tcm", 0x0, a)
store.write("tcm", 0x100, b)
store.write("tcm", 0x200, c)
def _math(name, op, dst, inputs, axis=None):
return OpRecord(
t_start=float(dst), t_end=float(dst) + 1.0,
component_id="pe_math", op_kind="math", op_name=name,
params={
"op": op,
"input_addrs": [a for a, _ in inputs],
"input_shapes": [s for _, s in inputs],
"input_spaces": ["tcm"] * len(inputs),
"input_dtypes": ["f32"] * len(inputs),
"dst_addr": dst, "dst_space": "tcm",
"shape_out": (3,), "dtype": "f32", "axis": axis,
},
)
ops = [
_math("maximum", "maximum", 0x300, [(0x0, (3,)), (0x100, (3,))]),
_math("minimum", "minimum", 0x400, [(0x0, (3,)), (0x100, (3,))]),
_math("fma", "fma", 0x500, [(0x0, (3,)), (0x100, (3,)), (0x200, (3,))]),
_math("clamp", "clamp", 0x600, [(0x0, (3,)), (0x200, (3,)), (0x100, (3,))]),
]
DataExecutor(ops, store).run()
assert np.array_equal(store.read("tcm", 0x300), np.maximum(a, b))
assert np.array_equal(store.read("tcm", 0x400), np.minimum(a, b))
assert np.array_equal(store.read("tcm", 0x500), a * b + c)
assert np.array_equal(
store.read("tcm", 0x600), np.minimum(np.maximum(a, c), b)
)
def test_math_softmax():
store = MemoryStore()
x = np.array([[1.0, 2.0, 3.0], [10.0, 20.0, 30.0]], dtype=np.float32)
store.write("tcm", 0x0, x)
op = OpRecord(
t_start=0.0, t_end=1.0,
component_id="pe_math", op_kind="math", op_name="softmax",
params={
"op": "softmax",
"input_addrs": [0x0], "input_shapes": [(2, 3)],
"input_spaces": ["tcm"], "input_dtypes": ["f32"],
"dst_addr": 0x100, "dst_space": "tcm",
"shape_out": (2, 3), "dtype": "f32", "axis": -1,
},
)
DataExecutor([op], store).run()
expected = np.exp(x - x.max(axis=-1, keepdims=True))
expected /= expected.sum(axis=-1, keepdims=True)
assert np.allclose(store.read("tcm", 0x100), expected)
def test_math_add(): def test_math_add():
store = MemoryStore() store = MemoryStore()
a = np.array([1.0, 2.0], dtype=np.float32) a = np.array([1.0, 2.0], dtype=np.float32)
+169
View File
@@ -0,0 +1,169 @@
"""Tests for IPCQ type schemas (ADR-0023 D2.5, D12, D14 F1)."""
import pytest
from kernbench.common.ipcq_types import (
IpcqCreditMetadata,
IpcqDmaToken,
IpcqEndpoint,
IpcqInitEntry,
IpcqInvalidDirection,
IpcqMetaArrival,
IpcqRecvCmd,
IpcqSendCmd,
)
from kernbench.runtime_api.kernel import IpcqInitMsg
# ── IpcqEndpoint ─────────────────────────────────────────────────────
def test_ipcq_endpoint_basic():
ep = IpcqEndpoint(
sip=0, cube=0, pe=1,
buffer_kind="tcm",
rx_base_pa=0x1000, rx_base_va=0,
n_slots=8, slot_size=4096,
)
assert ep.sip == 0
assert ep.buffer_kind == "tcm"
assert ep.n_slots == 8
def test_ipcq_endpoint_frozen():
ep = IpcqEndpoint(
sip=0, cube=0, pe=1, buffer_kind="tcm",
rx_base_pa=0x1000, rx_base_va=0, n_slots=8, slot_size=4096,
)
with pytest.raises(Exception): # FrozenInstanceError
ep.sip = 99 # type: ignore
# ── IpcqDmaToken ─────────────────────────────────────────────────────
def test_ipcq_dma_token():
ep = IpcqEndpoint(
sip=0, cube=0, pe=1, buffer_kind="tcm",
rx_base_pa=0x1000, rx_base_va=0, n_slots=8, slot_size=4096,
)
tok = IpcqDmaToken(
src_addr=0x500, src_space="tcm",
dst_addr=0x1000, dst_endpoint=ep,
nbytes=128, handle_id="h1",
sender_seq=0,
src_sip=0, src_cube=0, src_pe=0, src_direction="E",
)
assert tok.nbytes == 128
assert tok.dst_endpoint.buffer_kind == "tcm"
assert tok.data_op is True
# ── IpcqCreditMetadata ───────────────────────────────────────────────
def test_ipcq_credit_metadata():
cm = IpcqCreditMetadata(
consumer_seq=3, src_sip=0, src_cube=0, src_pe=1, src_direction="W",
)
assert cm.consumer_seq == 3
assert cm.src_direction == "W"
def test_ipcq_credit_metadata_frozen():
cm = IpcqCreditMetadata(
consumer_seq=3, src_sip=0, src_cube=0, src_pe=1, src_direction="W",
)
with pytest.raises(Exception):
cm.consumer_seq = 99 # type: ignore
# ── IpcqMetaArrival ──────────────────────────────────────────────────
def test_ipcq_meta_arrival():
ep = IpcqEndpoint(
sip=0, cube=0, pe=1, buffer_kind="tcm",
rx_base_pa=0x1000, rx_base_va=0, n_slots=8, slot_size=4096,
)
tok = IpcqDmaToken(
src_addr=0x500, src_space="tcm",
dst_addr=0x1000, dst_endpoint=ep,
nbytes=128, handle_id="h1",
sender_seq=0,
src_sip=0, src_cube=0, src_pe=0, src_direction="E",
)
ma = IpcqMetaArrival(token=tok)
assert ma.token.sender_seq == 0
assert ma.token.src_direction == "E"
# ── IpcqSendCmd / IpcqRecvCmd ────────────────────────────────────────
def test_ipcq_send_cmd():
cmd = IpcqSendCmd(
direction="E", src_addr=0x100, src_space="tcm",
nbytes=64, shape=(8, 8), dtype="f16", handle_id="s1",
)
assert cmd.direction == "E"
assert cmd.data_op is True
def test_ipcq_recv_cmd_default_return_slot():
cmd = IpcqRecvCmd(direction="W", shape=(8, 8), dtype="f16", handle_id="r1")
assert cmd.recv_mode == "return_slot"
assert cmd.dst_addr == 0
def test_ipcq_recv_cmd_round_robin():
cmd = IpcqRecvCmd(direction=None, shape=(8, 8), dtype="f16", handle_id="r2")
assert cmd.direction is None
def test_ipcq_recv_cmd_copy_to_dst():
cmd = IpcqRecvCmd(
direction="W", recv_mode="copy_to_dst",
dst_addr=0x2000, dst_space="hbm",
shape=(8, 8), dtype="f16", handle_id="r3",
)
assert cmd.recv_mode == "copy_to_dst"
assert cmd.dst_addr == 0x2000
# ── IpcqInvalidDirection ─────────────────────────────────────────────
def test_ipcq_invalid_direction():
with pytest.raises(IpcqInvalidDirection):
raise IpcqInvalidDirection("direction 'X' not installed")
# ── IpcqInitEntry / IpcqInitMsg ──────────────────────────────────────
def test_ipcq_init_entry_and_msg():
import simpy
env = simpy.Environment()
credit_store = simpy.Store(env)
ep = IpcqEndpoint(
sip=0, cube=0, pe=1, buffer_kind="tcm",
rx_base_pa=0x1000, rx_base_va=0, n_slots=8, slot_size=4096,
)
entry = IpcqInitEntry(
direction="E", peer=ep,
my_rx_base_pa=0x2000, my_rx_base_va=0,
n_slots=8, slot_size=4096,
peer_credit_store=credit_store,
)
msg = IpcqInitMsg(
correlation_id="c1", request_id="r1",
target_sips=(0,), target_cubes=(0,), target_pe=0,
entries=(entry,),
backpressure_mode="sleep",
buffer_kind="tcm",
credit_size_bytes=16,
)
assert msg.entries[0].direction == "E"
assert msg.entries[0].peer.sip == 0
assert msg.credit_size_bytes == 16
+206
View File
@@ -0,0 +1,206 @@
"""Tests for PE_DMA IPCQ handling (ADR-0023 D8 + D9 atomic).
PE_DMA gains two new behaviors:
1. Outbound: when it receives an IpcqDmaToken from local PE_IPCQ, it
forwards it through the fabric (next-hop port) toward the peer
PE_DMA.
2. Inbound: when it receives a Transaction wrapping an IpcqDmaToken,
it performs MemoryStore.write at dst_endpoint.buffer_kind/dst_addr
and forwards IpcqMetaArrival(token) to local PE_IPCQ both in the
SAME SimPy step (I6 MUST).
"""
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Any
import numpy as np
import simpy
from kernbench.common.ipcq_types import (
IpcqDmaToken,
IpcqEndpoint,
IpcqMetaArrival,
)
from kernbench.components.builtin.pe_dma import PeDmaComponent
from kernbench.sim_engine.memory_store import MemoryStore
from kernbench.sim_engine.transaction import Transaction
from kernbench.topology.types import Node
# ── Mock context ─────────────────────────────────────────────────────
@dataclass
class _MockResolver:
pass
@dataclass
class _MockRouter:
"""Returns a fixed two-hop path for any (src, dst)."""
def find_path(self, src: str, dst: str) -> list[str]:
return [src, "fake_router", dst]
@dataclass
class _MockCtx:
router: Any = field(default_factory=_MockRouter)
resolver: Any = field(default_factory=_MockResolver)
memory_store: Any = None
edge_map: dict = field(default_factory=dict)
spec: dict = field(default_factory=dict)
op_logger: Any = None
def compute_drain_ns(self, path: list[str], nbytes: int) -> float:
return 0.0
def get_shared_resource(self, env, key, capacity=1):
return simpy.Resource(env, capacity=capacity)
def _make_pe_dma(
env: simpy.Environment, pe_prefix: str, store: MemoryStore | None = None,
) -> PeDmaComponent:
node = Node(
id=f"{pe_prefix}.pe_dma",
kind="pe_dma",
impl="builtin.pe_dma",
attrs={},
pos_mm=None,
)
ctx = _MockCtx(memory_store=store)
comp = PeDmaComponent(node, ctx=ctx)
comp.in_ports["host"] = simpy.Store(env)
comp.out_ports["fake_router"] = simpy.Store(env)
comp.out_ports[f"{pe_prefix}.pe_ipcq"] = simpy.Store(env)
comp.start(env)
return comp
def _make_endpoint(sip=0, cube=0, pe=1, buffer_kind="tcm") -> IpcqEndpoint:
return IpcqEndpoint(
sip=sip, cube=cube, pe=pe,
buffer_kind=buffer_kind,
rx_base_pa=0x10_000, rx_base_va=0,
n_slots=4, slot_size=4096,
)
# ── Outbound: PE_IPCQ → PE_DMA → fabric ──────────────────────────────
def test_outbound_forwards_token_through_fabric():
env = simpy.Environment()
store = MemoryStore()
src_arr = np.arange(16, dtype=np.float16)
store.write("tcm", 0x500, src_arr)
src = _make_pe_dma(env, "sip0.cube0.pe0", store=store)
peer = _make_endpoint(pe=1)
token = IpcqDmaToken(
src_addr=0x500, src_space="tcm",
dst_addr=0x10_000, dst_endpoint=peer,
nbytes=32, handle_id="t1",
shape=(16,), dtype="f16",
sender_seq=0,
src_sip=0, src_cube=0, src_pe=0, src_direction="E",
)
src.in_ports["host"].put(token)
env.run(until=10)
# The token should be wrapped in a Transaction and forwarded to "fake_router"
fab = src.out_ports["fake_router"]
assert len(fab.items) == 1
txn = fab.items[0]
assert isinstance(txn, Transaction)
assert isinstance(txn.request, IpcqDmaToken)
assert txn.request.dst_addr == 0x10_000
# ── Inbound: PE_DMA → MemoryStore.write + IpcqMetaArrival forward ───
def test_inbound_writes_memory_and_forwards_metadata_atomically():
env = simpy.Environment()
store = MemoryStore()
# Sender wrote source data to MemoryStore
src_arr = np.arange(16, dtype=np.float16) + 100
store.write("tcm", 0x500, src_arr)
dst = _make_pe_dma(env, "sip0.cube0.pe1", store=store)
peer = _make_endpoint(sip=0, cube=0, pe=1, buffer_kind="tcm")
token = IpcqDmaToken(
src_addr=0x500, src_space="tcm",
dst_addr=0x10_000, dst_endpoint=peer,
nbytes=32, handle_id="t1",
shape=(16,), dtype="f16",
sender_seq=0,
src_sip=0, src_cube=0, src_pe=0, src_direction="E",
)
# Wrap in a Transaction with this PE_DMA as the terminal
done = env.event()
txn = Transaction(
request=token, path=["fake_router", "sip0.cube0.pe1.pe_dma"],
step=1, nbytes=32, done=done,
)
dst.in_ports["host"].put(txn)
env.run(until=done)
# 1. MemoryStore should have the data at dst_addr
arrived = store.read("tcm", 0x10_000, shape=(16,), dtype="f16")
assert np.array_equal(arrived, src_arr)
# 2. IpcqMetaArrival should be in PE_IPCQ port
ipcq_port = dst.out_ports["sip0.cube0.pe1.pe_ipcq"]
assert len(ipcq_port.items) == 1
arrival = ipcq_port.items[0]
assert isinstance(arrival, IpcqMetaArrival)
assert arrival.token.sender_seq == 0
assert arrival.token.src_pe == 0
def test_inbound_no_yield_between_write_and_metadata_forward():
"""Soft check: when multiple inbound IPCQ tokens arrive, the order of
MemoryStore writes and IpcqMetaArrival forwards is preserved (no
interleaving from extraneous yields).
"""
env = simpy.Environment()
store = MemoryStore()
for i in range(3):
store.write("tcm", 0x500 + i * 0x100, np.arange(8, dtype=np.float16) + i * 10)
dst = _make_pe_dma(env, "sip0.cube0.pe1", store=store)
peer = _make_endpoint(sip=0, cube=0, pe=1)
for i in range(3):
token = IpcqDmaToken(
src_addr=0x500 + i * 0x100, src_space="tcm",
dst_addr=0x10_000 + i * 0x100, dst_endpoint=peer,
nbytes=16, handle_id=f"t{i}",
shape=(8,), dtype="f16",
sender_seq=i,
src_sip=0, src_cube=0, src_pe=0, src_direction="E",
)
done = env.event()
txn = Transaction(
request=token, path=["fake_router", "sip0.cube0.pe1.pe_dma"],
step=1, nbytes=16, done=done,
)
dst.in_ports["host"].put(txn)
env.run(until=done)
# Check ordering of arrivals
ipcq_port = dst.out_ports["sip0.cube0.pe1.pe_ipcq"]
arrivals = list(ipcq_port.items)
assert [a.token.sender_seq for a in arrivals] == [0, 1, 2]
# Memory must be in order
for i in range(3):
arr = store.read("tcm", 0x10_000 + i * 0x100, shape=(8,), dtype="f16")
assert arr[0] == i * 10
+317
View File
@@ -0,0 +1,317 @@
"""Tests for PE_IPCQ component (ADR-0023 D1, D2, D9, D14).
These tests use a mock setup: PeIpcqComponent is instantiated directly,
its in_ports/out_ports are wired to plain SimPy Stores, and IpcqInitMsg
is delivered via a simple dummy transaction wrapper. PE_DMA is mocked
as a Store that we drain manually.
"""
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Any
import pytest
import simpy
from kernbench.common.ipcq_types import (
IpcqCreditMetadata,
IpcqDmaToken,
IpcqEndpoint,
IpcqInitEntry,
IpcqInvalidDirection,
IpcqMetaArrival,
IpcqRecvCmd,
IpcqRequest,
IpcqSendCmd,
)
from kernbench.components.builtin.pe_ipcq import PeIpcqComponent
from kernbench.runtime_api.kernel import IpcqInitMsg
from kernbench.topology.types import Node
# ── Fakes / fixtures ─────────────────────────────────────────────────
@dataclass
class _FakeTxn:
request: Any
done: simpy.Event
result_data: dict[str, Any] = field(default_factory=dict)
def _make_pe_ipcq(env: simpy.Environment, pe_prefix: str = "sip0.cube0.pe0") -> PeIpcqComponent:
"""Create a PeIpcqComponent with mocked ports.
Returns the component with:
- in_ports["host"] for posting IpcqInitMsg / IpcqRequest
- out_ports["__pe_dma__"] for outgoing IpcqDmaToken (drain manually)
- The component is started.
"""
node = Node(
id=f"{pe_prefix}.pe_ipcq",
kind="pe_ipcq",
impl="builtin.pe_ipcq",
attrs={},
pos_mm=None,
)
comp = PeIpcqComponent(node, ctx=None)
comp.in_ports["host"] = simpy.Store(env)
comp.out_ports[f"{pe_prefix}.pe_dma"] = simpy.Store(env)
comp.start(env)
return comp
def _install_two_neighbors(env: simpy.Environment, comp: PeIpcqComponent) -> tuple[simpy.Store, simpy.Store]:
"""Install E and W neighbor entries with peer_credit_stores.
Returns (peer_e_credit_store, peer_w_credit_store) i.e. the stores
that the component will put credits into when it receives data.
"""
peer_e_credit = simpy.Store(env)
peer_w_credit = simpy.Store(env)
ep_e = IpcqEndpoint(
sip=0, cube=0, pe=1,
buffer_kind="tcm",
rx_base_pa=0x10_000, rx_base_va=0,
n_slots=4, slot_size=4096,
)
ep_w = IpcqEndpoint(
sip=0, cube=0, pe=2,
buffer_kind="tcm",
rx_base_pa=0x20_000, rx_base_va=0,
n_slots=4, slot_size=4096,
)
init_msg = IpcqInitMsg(
correlation_id="t", request_id="t",
target_sips=(0,), target_cubes=(0,), target_pe=0,
entries=(
IpcqInitEntry(
direction="E", peer=ep_e,
my_rx_base_pa=0x30_000, my_rx_base_va=0,
n_slots=4, slot_size=4096,
peer_credit_store=peer_e_credit,
),
IpcqInitEntry(
direction="W", peer=ep_w,
my_rx_base_pa=0x40_000, my_rx_base_va=0,
n_slots=4, slot_size=4096,
peer_credit_store=peer_w_credit,
),
),
backpressure_mode="sleep",
buffer_kind="tcm",
credit_size_bytes=16,
)
done = env.event()
comp.in_ports["host"].put(_FakeTxn(request=init_msg, done=done))
env.run(until=done)
return peer_e_credit, peer_w_credit
# ── send: forward token to PE_DMA ────────────────────────────────────
def test_send_forwards_token_to_pe_dma():
env = simpy.Environment()
comp = _make_pe_ipcq(env)
_install_two_neighbors(env, comp)
pe_dma = comp.out_ports["sip0.cube0.pe0.pe_dma"]
cmd = IpcqSendCmd(
direction="E", src_addr=0x500, src_space="tcm",
nbytes=128, shape=(8, 8), dtype="f16", handle_id="s1",
)
done = env.event()
comp.in_ports["host"].put(IpcqRequest(command=cmd, done=done))
env.run(until=done)
# Token should be in PE_DMA's mock store
assert len(pe_dma.items) == 1
token = pe_dma.items[0]
assert isinstance(token, IpcqDmaToken)
assert token.dst_addr == 0x10_000 # peer.rx_base_pa + 0
assert token.nbytes == 128
assert token.sender_seq == 0
assert token.src_direction == "E"
def test_send_advances_my_head_and_slot_addresses():
env = simpy.Environment()
comp = _make_pe_ipcq(env)
_install_two_neighbors(env, comp)
pe_dma = comp.out_ports["sip0.cube0.pe0.pe_dma"]
for i in range(3):
cmd = IpcqSendCmd(
direction="E", src_addr=0x500 + i,
src_space="tcm", nbytes=64,
shape=(8,), dtype="f16", handle_id=f"s{i}",
)
done = env.event()
comp.in_ports["host"].put(IpcqRequest(command=cmd, done=done))
env.run(until=done)
tokens = pe_dma.items
assert [t.sender_seq for t in tokens] == [0, 1, 2]
# slot addresses: peer.rx_base_pa (0x10_000) + i * slot_size (4096)
assert [t.dst_addr for t in tokens] == [0x10_000, 0x11_000, 0x12_000]
def test_send_invalid_direction_raises():
env = simpy.Environment()
comp = _make_pe_ipcq(env)
_install_two_neighbors(env, comp)
cmd = IpcqSendCmd(
direction="N", src_addr=0x100, src_space="tcm",
nbytes=64, shape=(8,), dtype="f16", handle_id="s_bad",
)
done = env.event()
comp.in_ports["host"].put(IpcqRequest(command=cmd, done=done))
with pytest.raises(IpcqInvalidDirection):
env.run(until=done)
# ── recv: wait for data and return slot address ─────────────────────
def test_recv_waits_until_metadata_arrives():
env = simpy.Environment()
comp = _make_pe_ipcq(env)
_install_two_neighbors(env, comp)
recv_cmd = IpcqRecvCmd(
direction="W", shape=(8,), dtype="f16", handle_id="r1",
)
recv_req = IpcqRequest(command=recv_cmd, done=env.event())
comp.in_ports["host"].put(recv_req)
# Run a bit — recv should not complete yet (no data)
env.run(until=10)
assert not recv_req.done.triggered
# Simulate metadata arrival from peer (W direction = sender pe=2)
fake_token = IpcqDmaToken(
src_addr=0, src_space="tcm",
dst_addr=0x40_000, dst_endpoint=comp._queue_pairs["W"]["peer"],
nbytes=64, handle_id="x",
shape=(8,), dtype="f16",
sender_seq=0,
src_sip=0, src_cube=0, src_pe=2, src_direction="E",
)
comp.in_ports["host"].put(IpcqMetaArrival(token=fake_token))
env.run(until=recv_req.done)
assert recv_req.result_data["src_addr"] == 0x40_000 # my_rx_base_pa for W
assert recv_req.result_data["direction"] == "W"
def test_recv_returns_immediately_if_data_already_present():
env = simpy.Environment()
comp = _make_pe_ipcq(env)
_install_two_neighbors(env, comp)
# Pre-arrive metadata
fake_token = IpcqDmaToken(
src_addr=0, src_space="tcm",
dst_addr=0x40_000, dst_endpoint=comp._queue_pairs["W"]["peer"],
nbytes=64, handle_id="x",
shape=(8,), dtype="f16",
sender_seq=0,
src_sip=0, src_cube=0, src_pe=2, src_direction="E",
)
comp.in_ports["host"].put(IpcqMetaArrival(token=fake_token))
env.run(until=5)
recv_cmd = IpcqRecvCmd(
direction="W", shape=(8,), dtype="f16", handle_id="r1",
)
recv_req = IpcqRequest(command=recv_cmd, done=env.event())
comp.in_ports["host"].put(recv_req)
env.run(until=recv_req.done)
assert recv_req.result_data["src_addr"] == 0x40_000
def test_recv_round_robin_picks_arrived_direction():
env = simpy.Environment()
comp = _make_pe_ipcq(env)
_install_two_neighbors(env, comp)
# Pre-arrive metadata only on W direction
fake_token = IpcqDmaToken(
src_addr=0, src_space="tcm",
dst_addr=0x40_000, dst_endpoint=comp._queue_pairs["W"]["peer"],
nbytes=64, handle_id="x",
shape=(8,), dtype="f16",
sender_seq=0,
src_sip=0, src_cube=0, src_pe=2, src_direction="E",
)
comp.in_ports["host"].put(IpcqMetaArrival(token=fake_token))
env.run(until=5)
# recv() with no direction → round-robin
recv_cmd = IpcqRecvCmd(
direction=None, shape=(8,), dtype="f16", handle_id="r_rr",
)
recv_req = IpcqRequest(command=recv_cmd, done=env.event())
comp.in_ports["host"].put(recv_req)
env.run(until=recv_req.done)
assert recv_req.result_data["direction"] == "W"
# ── backpressure: send blocks when full ──────────────────────────────
def test_send_blocks_when_peer_slot_full():
env = simpy.Environment()
comp = _make_pe_ipcq(env)
_install_two_neighbors(env, comp)
# n_slots = 4, so 4 sends should succeed; 5th blocks
for i in range(4):
cmd = IpcqSendCmd(
direction="E", src_addr=0x500, src_space="tcm",
nbytes=64, shape=(8,), dtype="f16", handle_id=f"s{i}",
)
done = env.event()
comp.in_ports["host"].put(IpcqRequest(command=cmd, done=done))
env.run(until=done)
# 5th send: should not complete
cmd5 = IpcqSendCmd(
direction="E", src_addr=0x500, src_space="tcm",
nbytes=64, shape=(8,), dtype="f16", handle_id="s5",
)
req5 = IpcqRequest(command=cmd5, done=env.event())
comp.in_ports["host"].put(req5)
env.run(until=20)
assert not req5.done.triggered
# Send a credit return: peer (E direction, pe=1) consumed slot 0
credit = IpcqCreditMetadata(
consumer_seq=1, # peer consumed up to my_tail=1
src_sip=0, src_cube=0, src_pe=1, src_direction="W", # peer's view
)
comp.credit_inbox.put(credit)
env.run(until=req5.done)
assert req5.done.triggered
# ── Init test ────────────────────────────────────────────────────────
def test_init_installs_neighbors():
env = simpy.Environment()
comp = _make_pe_ipcq(env)
_install_two_neighbors(env, comp)
assert "E" in comp._queue_pairs
assert "W" in comp._queue_pairs
assert comp._queue_pairs["E"]["peer"].pe == 1
assert comp._queue_pairs["W"]["peer"].pe == 2
assert comp._queue_pairs["E"]["my_head"] == 0
assert comp._queue_pairs["E"]["peer_tail_cache"] == 0
+80
View File
@@ -0,0 +1,80 @@
"""Tests for recv_mode='copy_to_dst' (ADR-0023 D9.5)."""
from __future__ import annotations
import numpy as np
def test_recv_copy_to_dst_via_simpy_runner():
"""Run a kernel that uses tl.recv(..., dst_addr=..., dst_space=...).
Verify the data is moved to the dst location after recv.
"""
import importlib
from kernbench.policy.placement.dp import DPPolicy
from kernbench.runtime_api.bench_runner import run_bench
from kernbench.runtime_api.types import resolve_device
from kernbench.sim_engine.engine import GraphEngine
from kernbench.topology.builder import resolve_topology
from kernbench.common.pe_commands import TensorHandle
def kernel(t_ptr, n_elem, dst_buf_addr, tl):
rank = tl.program_id(axis=0)
ws = tl.num_programs(axis=0)
nbytes = n_elem * 2
# Each PE sends own data, then recv into a custom dst slot
current = TensorHandle(
id="loc", addr=t_ptr + rank * nbytes,
shape=(n_elem,), dtype="f16",
nbytes=nbytes, data=None, space="hbm",
)
tl.send(dir="E", src=current)
# copy_to_dst: move into a per-rank scratch HBM addr
recv = tl.recv(
dir="W", shape=(n_elem,), dtype="f16",
dst_addr=dst_buf_addr + rank * nbytes,
dst_space="hbm",
)
# Sanity: recv handle should now point to our dst addr
assert recv.addr == dst_buf_addr + rank * nbytes
assert recv.space == "hbm"
topo = resolve_topology("topology.yaml")
def run(torch):
plan = torch.install_ipcq(
algorithm="ring_allreduce_tcm", world_size_override=8,
)
a = torch.zeros(
(1, 8 * 8),
dtype="f16",
dp=DPPolicy(
sip="replicate", cube="replicate", pe="column_wise",
num_sips=1, num_cubes=1,
),
name="copy_in",
)
store = torch.engine.memory_store
base = a._handle.va_base or a._handle.shards[0].pa
nbytes = 8 * 2
for r in range(8):
store.write("hbm", base + r * nbytes,
np.full((8,), float(r + 1), dtype=np.float16))
# Use a separate dst region (synthetic addresses)
dst_buf = 0xC0FFEE_0000
torch.launch("ring_allreduce_tcm", kernel, a, 8, dst_buf)
# After the kernel, dst_buf + r*16 should contain rank (r-1)%8's data
for r in range(8):
arr = store.read("hbm", dst_buf + r * nbytes, shape=(8,), dtype="f16")
expected = float(((r - 1) % 8) + 1)
assert np.allclose(arr, expected), f"rank {r}: got {arr}, expected {expected}"
result = run_bench(
topology=topo, bench_fn=run,
device=resolve_device("all"),
engine_factory=lambda t, d: GraphEngine(
getattr(t, "topology_obj", t), enable_data=True
),
)
assert result.completion.ok
+136
View File
@@ -0,0 +1,136 @@
"""Tests for the pytorch-compat Tensor API extensions.
Covers the new ``torch.from_numpy`` factory and ``Tensor.numpy``,
``Tensor.copy_`` methods used by the unified ``ccl_allreduce`` bench.
"""
from __future__ import annotations
import numpy as np
import pytest
from kernbench.policy.placement.dp import DPPolicy
from kernbench.runtime_api.bench_runner import run_bench
from kernbench.runtime_api.types import resolve_device
from kernbench.sim_engine.engine import GraphEngine
from kernbench.topology.builder import resolve_topology
def _engine_factory(topology, device):
return GraphEngine(getattr(topology, "topology_obj", topology), enable_data=True)
def _run_with(bench_body):
topo = resolve_topology("topology.yaml")
return run_bench(
topology=topo,
bench_fn=bench_body,
device=resolve_device("all"),
engine_factory=_engine_factory,
)
# ── from_numpy ──────────────────────────────────────────────────────
def test_from_numpy_creates_host_tensor():
"""torch.from_numpy returns a kernbench Tensor with the array stored
in its host buffer (not deployed to any PE)."""
def body(torch):
arr = np.arange(8, dtype=np.float16).reshape(1, 8)
h = torch.from_numpy(arr)
# Host tensor has shape/dtype matching the array.
assert h.shape == (1, 8)
assert h.dtype == "f16"
# numpy() round-trips the host buffer.
assert np.array_equal(h.numpy(), arr)
# No deploy → no real shards.
assert h._handle is None
# Submit a no-op so run_bench has at least one handle.
torch.zeros((1, 8), dtype="f16",
dp=DPPolicy(sip="replicate", cube="replicate", pe="replicate",
num_sips=1, num_cubes=1, num_pes=1),
name="dummy")
_run_with(body)
# ── single-PE replicated tensor ─────────────────────────────────────
def test_copy_and_numpy_single_pe():
"""copy_ from a numpy array, then numpy() round-trips correctly on
a single-PE (no real sharding) tensor."""
def body(torch):
dp = DPPolicy(sip="replicate", cube="replicate", pe="replicate",
num_sips=1, num_cubes=1, num_pes=1)
t = torch.zeros((1, 16), dtype="f16", dp=dp, name="t")
src = np.arange(16, dtype=np.float16).reshape(1, 16)
t.copy_(torch.from_numpy(src))
gathered = t.numpy()
assert gathered.shape == (1, 16)
assert np.array_equal(gathered, src)
_run_with(body)
# ── multi-PE column-wise sharding (1 cube) ──────────────────────────
def test_copy_and_numpy_multi_pe_column_wise():
"""copy_ splits across 8 PEs in one cube, numpy() reassembles."""
def body(torch):
n_pe = 8
dp = DPPolicy(sip="replicate", cube="replicate", pe="column_wise",
num_sips=1, num_cubes=1, num_pes=n_pe)
t = torch.zeros((1, n_pe * 4), dtype="f16", dp=dp, name="t")
src = np.arange(n_pe * 4, dtype=np.float16).reshape(1, n_pe * 4)
t.copy_(torch.from_numpy(src))
gathered = t.numpy()
assert gathered.shape == (1, n_pe * 4)
assert np.array_equal(gathered, src)
# Sanity: there really were 8 shards.
assert len(t._handle.shards) == n_pe
_run_with(body)
# ── multi-cube sharding ─────────────────────────────────────────────
def test_copy_and_numpy_multi_cube():
"""copy_ across 2 cubes (16 PEs total), numpy() reassembles."""
def body(torch):
n_pe_per_cube = 8
n_cubes = 2
total = n_cubes * n_pe_per_cube # 16
dp = DPPolicy(sip="replicate", cube="column_wise", pe="column_wise",
num_sips=1, num_cubes=n_cubes)
t = torch.zeros((1, total * 4), dtype="f16", dp=dp, name="t")
src = np.arange(total * 4, dtype=np.float16).reshape(1, total * 4)
t.copy_(torch.from_numpy(src))
gathered = t.numpy()
assert np.array_equal(gathered, src)
assert len(t._handle.shards) == total
_run_with(body)
# ── shape mismatch raises ───────────────────────────────────────────
def test_copy_shape_mismatch_raises():
"""copy_ with mismatched shapes raises ValueError."""
def body(torch):
dp = DPPolicy(sip="replicate", cube="replicate", pe="replicate",
num_sips=1, num_cubes=1, num_pes=1)
t = torch.zeros((1, 8), dtype="f16", dp=dp, name="t")
src = np.zeros((1, 16), dtype=np.float16)
with pytest.raises(ValueError, match="copy_ shape mismatch"):
t.copy_(torch.from_numpy(src))
_run_with(body)
+95
View File
@@ -0,0 +1,95 @@
"""Tests for tl.send / tl.recv API (ADR-0023 D4 + D9.5)."""
from __future__ import annotations
from typing import Any
import simpy
from greenlet import greenlet
from kernbench.common.ipcq_types import (
IpcqRecvCmd,
IpcqRequest,
IpcqSendCmd,
)
from kernbench.triton_emu.tl_context import TLContext
# ── Command-list mode (no runner) ────────────────────────────────────
def test_tl_send_command_list_mode():
tl = TLContext(pe_id=0, num_programs=4, dispatch_cycles=0)
tl.send(dir="E", src_addr=0x500, nbytes=64, shape=(8,), dtype="f16")
cmds = tl.commands
sends = [c for c in cmds if isinstance(c, IpcqSendCmd)]
assert len(sends) == 1
assert sends[0].direction == "E"
assert sends[0].src_addr == 0x500
assert sends[0].nbytes == 64
def test_tl_recv_command_list_mode():
tl = TLContext(pe_id=0, num_programs=4, dispatch_cycles=0)
handle = tl.recv(dir="W", shape=(8,), dtype="f16")
cmds = tl.commands
recvs = [c for c in cmds if isinstance(c, IpcqRecvCmd)]
assert len(recvs) == 1
assert recvs[0].direction == "W"
# In command-list mode (no runner), tl.recv returns a placeholder
# TensorHandle (no actual data movement happens until SimPy)
assert handle.shape == (8,)
assert handle.dtype == "f16"
def test_tl_recv_round_robin_no_dir():
tl = TLContext(pe_id=0, num_programs=4, dispatch_cycles=0)
tl.recv(shape=(8,), dtype="f16")
cmds = tl.commands
recvs = [c for c in cmds if isinstance(c, IpcqRecvCmd)]
assert recvs[0].direction is None
# ── Runner mode (greenlet) ──────────────────────────────────────────
class _StubRunner:
"""Minimal runner that auto-responds to IpcqSendCmd / IpcqRecvCmd."""
def __init__(self) -> None:
self.received: list[Any] = []
def switch_to_simpy(self, cmd: Any) -> Any:
self.received.append(cmd)
if isinstance(cmd, IpcqSendCmd):
return None
if isinstance(cmd, IpcqRecvCmd):
# Return a fake slot dict
return {
"data": None,
"src_space": "tcm",
"src_addr": 0xABCD,
"direction": cmd.direction or "E",
"dtype": cmd.dtype,
"shape": cmd.shape,
"nbytes": 16,
}
return None
def test_tl_send_runner_mode():
runner = _StubRunner()
tl = TLContext(pe_id=0, num_programs=4, dispatch_cycles=0, runner=runner)
tl.send(dir="E", src_addr=0x500, nbytes=64, shape=(8,), dtype="f16")
assert len(runner.received) == 1
assert isinstance(runner.received[0], IpcqSendCmd)
def test_tl_recv_runner_mode_returns_handle_with_slot_addr():
runner = _StubRunner()
tl = TLContext(pe_id=0, num_programs=4, dispatch_cycles=0, runner=runner)
h = tl.recv(dir="W", shape=(8,), dtype="f16")
assert isinstance(runner.received[0], IpcqRecvCmd)
# The returned TensorHandle's addr should reflect the slot
assert h.addr == 0xABCD
assert h.shape == (8,)
assert h.dtype == "f16"
+106
View File
@@ -0,0 +1,106 @@
"""Tests for tl.recv_async + tl.wait (ADR-0023 D4)."""
from __future__ import annotations
import numpy as np
from kernbench.ccl.testing import run_kernel_in_mock
def kernel_async_recv(t_ptr, n_elem, tl):
"""Each PE issues recv_async first, then send, then wait — this exercises
the non-blocking path. Uses TensorHandle math (PE_MATH) for accumulation
so Phase 2 produces correct final HBM contents."""
rank = tl.program_id(axis=0)
world_size = tl.num_programs(axis=0)
nbytes = n_elem * 2
pe_addr = t_ptr + rank * nbytes
acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
current = acc
for _step in range(world_size - 1):
future = tl.recv_async(dir="W", shape=(n_elem,), dtype="f16")
tl.send(dir="E", src=current)
recv = tl.wait(future)
acc = acc + recv
current = recv # forward W's tile to E next round
tl.store(pe_addr, acc)
def test_recv_async_mock_runtime():
n_elem = 8
inputs = [
np.full((n_elem,), float(r + 1), dtype=np.float16)
for r in range(4)
]
expected = sum(inputs)
outputs = run_kernel_in_mock(
kernel_fn=kernel_async_recv,
world_size=4,
topology="ring_1d",
inputs=inputs,
kernel_args=(n_elem,),
)
for r in range(4):
assert np.allclose(outputs[r], expected)
def test_recv_async_simpy_runner():
"""Run the async kernel through the real SimPy stack via the
install_ipcq + launch path.
"""
import importlib
from kernbench.runtime_api.bench_runner import run_bench
from kernbench.runtime_api.types import resolve_device
from kernbench.sim_engine.engine import GraphEngine
from kernbench.topology.builder import resolve_topology
# Re-use the standard 8-PE bench skeleton but swap in the async kernel.
topo = resolve_topology("topology.yaml")
# Build a tiny inline bench module
import types
mod = types.ModuleType("inline_bench_async")
from kernbench.policy.placement.dp import DPPolicy
def run(torch):
plan = torch.install_ipcq(
algorithm="ring_allreduce_tcm", world_size_override=8,
)
a = torch.zeros(
(1, 8 * 8),
dtype="f16",
dp=DPPolicy(
sip="replicate", cube="replicate", pe="column_wise",
num_sips=1, num_cubes=1,
),
name="async_in",
)
store = torch.engine.memory_store
base = a._handle.va_base or a._handle.shards[0].pa
nbytes = 8 * 2
for r in range(8):
store.write("hbm", base + r * nbytes,
np.full((8,), float(r + 1), dtype=np.float16))
torch.launch("ring_allreduce_tcm", kernel_async_recv, a, 8)
for r in range(8):
result = store.read("hbm", base + r * nbytes, shape=(8,), dtype="f16")
expected = float(sum(range(1, 9))) # 36
assert np.allclose(result, expected, rtol=1e-2, atol=1e-2), \
f"rank {r}: got {result}, expected {expected}"
mod.run = run
result = run_bench(
topology=topo, bench_fn=mod.run,
device=resolve_device("all"),
engine_factory=lambda t, d: GraphEngine(
getattr(t, "topology_obj", t), enable_data=True
),
)
assert result.completion.ok
+9 -6
View File
@@ -19,16 +19,19 @@ def test_full_graph_node_count():
# + 2 SIPs x (1 IO x 23 io_nodes # + 2 SIPs x (1 IO x 23 io_nodes
# + 16 cubes x (32 routers + 1 hbm_ctrl + 1 m_cpu + 1 sram # + 16 cubes x (32 routers + 1 hbm_ctrl + 1 m_cpu + 1 sram
# + 20 ucie (4 ports x (1 port + 4 conn)) # + 20 ucie (4 ports x (1 port + 4 conn))
# + 8 PEs x 8 pe_comps)) (ADR-0021: +pe_fetch_store) # + 8 PEs x 9 pe_comps)) (ADR-0023: +pe_ipcq)
# IO: pcie_ep + io_cpu + noc + 4 io_ucie_ports + 4*4 io_ucie_conn = 23 # IO: pcie_ep + io_cpu + noc + 4 io_ucie_ports + 4*4 io_ucie_conn = 23
# cube: 32 + 3 + 20 + 64 = 119 # cube: 32 + 3 + 20 + 72 = 127
# = 1 + 2*(23 + 16*119) = 1 + 2*(23+1904) = 1 + 3854 = 3855 # = 1 + 2*(23 + 16*127) = 1 + 2*(23+2032) = 1 + 4110 = 4111
assert len(g.nodes) == 3855 assert len(g.nodes) == 4111
def test_full_graph_edge_count(): def test_full_graph_edge_count():
g = _graph() g = _graph()
assert len(g.edges) == 12922 # ADR-0021: +pe_fetch_store + chaining edges # ADR-0023: +3 IPCQ edges per PE (cpu→ipcq, ipcq→dma, dma→ipcq)
# 2 SIPs × 16 cubes × 8 PEs × 3 = 768 new edges
# Cross-SIP routing: +1 reverse pcie_ep→switch edge per SIP = +2
assert len(g.edges) == 13692
# -- Full graph: specific nodes exist ----------------------------------------- # -- Full graph: specific nodes exist -----------------------------------------
@@ -287,7 +290,7 @@ def test_pe_view_has_all_components():
v = _graph().pe_view v = _graph().pe_view
assert set(v.nodes.keys()) == { assert set(v.nodes.keys()) == {
"pe_cpu", "pe_scheduler", "pe_dma", "pe_fetch_store", "pe_cpu", "pe_scheduler", "pe_dma", "pe_fetch_store",
"pe_gemm", "pe_math", "pe_mmu", "pe_tcm", "pe_gemm", "pe_math", "pe_mmu", "pe_tcm", "pe_ipcq",
} }
+1 -1
View File
@@ -24,7 +24,7 @@ def test_pe_template_components():
comps = spec["cube"]["pe_template"]["components"] comps = spec["cube"]["pe_template"]["components"]
assert set(comps.keys()) == { assert set(comps.keys()) == {
"pe_cpu", "pe_scheduler", "pe_dma", "pe_fetch_store", "pe_cpu", "pe_scheduler", "pe_dma", "pe_fetch_store",
"pe_gemm", "pe_math", "pe_mmu", "pe_tcm", "pe_gemm", "pe_math", "pe_mmu", "pe_tcm", "pe_ipcq",
} }
+31
View File
@@ -87,6 +87,37 @@ def test_tl_math_unary_ops():
assert ops == ["exp", "log", "sqrt", "abs", "sigmoid", "cos", "sin"] assert ops == ["exp", "log", "sqrt", "abs", "sigmoid", "cos", "sin"]
def test_tl_math_extra_ops():
"""tl.maximum/minimum/fma/clamp/softmax + tl.cdiv (real-Triton parity)."""
tl = _ctx()
a = tl.load(0x1000, shape=(8, 8), dtype="f16")
b = tl.load(0x2000, shape=(8, 8), dtype="f16")
c = tl.load(0x3000, shape=(8, 8), dtype="f16")
tl.maximum(a, b)
tl.minimum(a, b)
tl.fma(a, b, c)
tl.clamp(a, b, c)
tl.softmax(a, axis=1)
math_cmds = [cm for cm in tl.commands if isinstance(cm, MathCmd)]
ops = [cm.op for cm in math_cmds]
assert ops == ["maximum", "minimum", "fma", "clamp", "softmax"]
# ternary fma/clamp must record three inputs
fma_cmd = math_cmds[2]
assert len(fma_cmd.inputs) == 3
clamp_cmd = math_cmds[3]
assert len(clamp_cmd.inputs) == 3
# softmax records the axis
assert math_cmds[4].axis == 1
# cdiv is a scalar helper, not a tensor op
from kernbench.triton_emu.tl_context import TLContext
assert TLContext.cdiv(10, 3) == 4
assert TLContext.cdiv(9, 3) == 3
assert TLContext.cdiv(0, 4) == 0
# ── 5. a + b, a * b → MathCmd ──────────────────────────────────── # ── 5. a + b, a * b → MathCmd ────────────────────────────────────
+5 -1
View File
@@ -67,7 +67,8 @@ cube:
pe_math: { kind: pe_math, impl: builtin.pe_math, attrs: { overhead_ns: 0.0, shared_resource: accel_slot } } pe_math: { kind: pe_math, impl: builtin.pe_math, attrs: { overhead_ns: 0.0, shared_resource: accel_slot } }
pe_fetch_store: { kind: pe_fetch_store, impl: builtin.pe_fetch_store, attrs: { overhead_ns: 0.0 } } pe_fetch_store: { kind: pe_fetch_store, impl: builtin.pe_fetch_store, attrs: { overhead_ns: 0.0 } }
pe_mmu: { kind: pe_mmu, impl: builtin.pe_mmu, attrs: { tlb_overhead_ns: 0.5, page_size: 4096 } } pe_mmu: { kind: pe_mmu, impl: builtin.pe_mmu, attrs: { tlb_overhead_ns: 0.5, page_size: 4096 } }
pe_tcm: { kind: pe_tcm, impl: builtin.pe_tcm, attrs: { size_mb: 16, read_bw_gbs: 512.0, write_bw_gbs: 512.0 } } pe_tcm: { kind: pe_tcm, impl: builtin.pe_tcm, attrs: { size_mb: 16, read_bw_gbs: 512.0, write_bw_gbs: 512.0, kernel_scratch_mb: 1 } }
pe_ipcq: { kind: pe_ipcq, impl: builtin.pe_ipcq, attrs: { overhead_ns: 0.0 } }
links: links:
pe_cpu_to_scheduler_mm: 0.5 pe_cpu_to_scheduler_mm: 0.5
scheduler_to_dma_mm: 0.5 scheduler_to_dma_mm: 0.5
@@ -88,6 +89,9 @@ cube:
gemm_to_tcm_mm: 0.5 gemm_to_tcm_mm: 0.5
math_to_tcm_bw_gbs: 512.0 math_to_tcm_bw_gbs: 512.0
math_to_tcm_mm: 0.5 math_to_tcm_mm: 0.5
cpu_to_ipcq_mm: 0.5 # PE_CPU → PE_IPCQ (ADR-0023)
ipcq_to_dma_mm: 0.0 # PE_IPCQ → PE_DMA token forwarding (ADR-0023)
dma_to_ipcq_mm: 0.0 # PE_DMA → PE_IPCQ metadata arrival (ADR-0023)
memory_map: memory_map:
hbm_total_gb_per_cube: 48 hbm_total_gb_per_cube: 48