Files
kernbench2/docs/ccl-author-guide.en.md
T
ywkang 998cc85762 Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)
Major changes:

PE-level IPCQ infrastructure:
- New PE_IPCQ component: ring-buffer control plane with 4-direction
  neighbor mapping, head/tail pointers, backpressure (poll/sleep).
- PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA,
  including in-flight data snapshot (D9) and op_log recording at
  outbound time for Phase 2 replay correctness.
- IpcqDmaToken piggyback model: data + metadata travel together,
  atomic visibility at receiver (invariant I6).
- Credit return fast path: bottleneck-BW latency, no fabric vc_comm.

Phase 2 data execution (ADR-0020 integration):
- op_log extended: DmaWriteCmd now captures src_space/src_addr for
  Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time.
- DataExecutor replays dma_write + ipcq_copy in t_start order.
- Engine._flush_data_phase: incremental cursor-based replay after
  each engine.wait() so host reads see post-Phase-2 data.
- KernelRunner Phase 1 writes disabled when op_log is active to
  prevent stale data from corrupting the MemoryStore snapshot.

TLContext / kernel API:
- tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype),
  tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode.
- TensorHandle operator overloading (add/sub/mul/div) via thread-local
  active TLContext → MathCmd dispatch through PE_MATH.
- PE-local scratch allocator for math output handles.
- tl.load returns space="hbm" handles for correct Phase 2 addressing.
- Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv.

Unified ccl_allreduce bench (PyTorch-compat host code):
- Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch)
  split matching real PyTorch DDP worker pattern.
- torch.distributed facade: init_process_group, get_world_size,
  get_rank, get_backend, all_reduce, barrier — only real PyTorch names.
- AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches
  kernel via tensor shard metadata (n_elem from shards[0].nbytes).
- world_size derived from topology spec (sips × cubes × pes_per_cube)
  with optional algorithm-level override in ccl.yaml.

Tensor API (PyTorch-compat surface):
- Tensor.numpy(): gather-aware (all shards via VA-based addressing).
- Tensor.copy_(source): scatter from host tensor into sharded target.
- RuntimeContext.from_numpy(arr): host-side staging tensor.
- Tensor.data property fixed to use numpy() (was shards[0]-only).

Algorithm modules moved to src/kernbench/ccl/algorithms/:
- ring_allreduce, mesh_allreduce, tree_allreduce, hello_send.
- Each module exports kernel_args(world_size, n_elem) helper.
- ccl.yaml module paths updated to kernbench.ccl.algorithms.*.

Dead code removed:
- 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.).
- _run_ccl_bench greenlet-per-SIP scheduler.
- benches.loader.is_ccl_bench + run_rank detection.
- benches/ccl/ directory.

Tests:
- New test_ccl_allreduce_matrix.py: 7 parametrized cases
  (ring×3 buffers, ring 8/16, mesh 4, tree 7).
- New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests.
- Existing tests updated for new import paths + world_size_override.

Docs:
- Korean ccl-author-guide.md and ADR-0023 paths updated.
- New English versions: ccl-author-guide.en.md, ADR-0023.en.md.

502 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 19:36:59 -07:00

19 KiB
Raw Blame History

CCL Algorithm Author Guide (English)

This document is a step-by-step guide for engineers writing CCL (Collective Communication Library) algorithms in kernbench. The internal system design and component structure live in ADR-0023.

The goal here is to clearly separate what an algorithm author has to touch from what they can leave alone, and to get a first algorithm running through the shortest possible path.


0. Five-minute tour

Things you touch Location
Algorithm module (kernel + optional neighbors()) src/kernbench/ccl/algorithms/<algo>.py
Algorithm registration ccl.yaml
Host bench (rank count, init, launch, verify) benches/<your_bench>.py
(Optional) unit test tests/test_<algo>.py
Things you do NOT touch Location
TLContext API src/kernbench/triton_emu/tl_context.py (ADR-0022 spec)
Framework (topology generators, helpers, mock testing) src/kernbench/ccl/
PE_IPCQ / PE_DMA components src/kernbench/components/builtin/
Backend implementation (install_ipcq) src/kernbench/runtime_api/distributed.py and kernbench/ccl/install.py

Workflow:

  1. Write a kernel function in the algorithm module.
  2. Register an entry in ccl.yaml.
  3. Write a host bench using torch.distributed.init_process_group / torch.distributed.all_reduce (the unified benches/ccl_allreduce.py handles the common case).
  4. (Optional) Run the mock runtime for fast unit tests (a few ms).
  5. kernbench run --bench <name> --verify-data for full SimPy verification.

1. Hello World — the simplest send/recv

Each PE sends its tile to its E neighbor once and receives a tile from its W neighbor once. The reference code lives in src/kernbench/ccl/algorithms/hello_send.py.

Step 1: write the kernel

New file src/kernbench/ccl/algorithms/hello_send.py:

"""Hello world: send your tile to the next rank, receive from the previous one."""


def kernel(t_ptr, n_elem, tl):
    # Global rank is computed from program_id(0/1) (ADR-0022).
    local_pe = tl.program_id(axis=0)
    cube_id = tl.program_id(axis=1)
    pes_per_cube = tl.num_programs(axis=0)
    rank = cube_id * pes_per_cube + local_pe

    nbytes = n_elem * 2  # f16
    pe_addr = t_ptr + rank * nbytes

    # Load our slice and send it east.
    src = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
    tl.send(dir="E", src=src)

    # Receive from west and store directly back into our slice.
    recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
    tl.store(pe_addr, recv)


def kernel_args(world_size: int, n_elem: int) -> tuple:
    """Positional kernel args used by the ahbm backend (after t_ptr)."""
    return (n_elem,)

Key points:

  • Global rank is computed from program_id(axis=0) + program_id(axis=1). TL has no contractually-supported tl.rank / tl.world_size. If the host needs to pass world_size or anything else as an algorithm parameter, it goes through ordinary torch.launch arguments.
  • tl.send takes a TensorHandle. PE_IPCQ reads addr/space/shape/dtype/nbytes from the handle to issue an IpcqDmaToken to PE_DMA.
  • tl.recv requires shape and dtype. The returned TensorHandle points at the IPCQ ring slot and can be used directly as a dst handle (e.g. tl.store(pe_addr, recv)). Phase 2's dma_write replay handles the (slot → hbm) copy, so user code never has to touch recv.data.

Step 2: register in ccl.yaml

algorithms:
  hello_send:
    module: kernbench.ccl.algorithms.hello_send
    topology: ring_1d
    buffer_kind: tcm
    world_size: 8

world_size here is optional. If absent, AhbmCCLBackend derives it from the topology spec (sips × cubes_per_sip × pes_per_cube).

Step 3: write a host bench (optional — the unified bench may suffice)

For most CCL benchmarks the existing benches/ccl_allreduce.py is sufficient: it reads ccl.yaml, picks the algorithm, sets up the process group, and runs the collective. If your algorithm needs custom host logic, write a new bench file along the same lines.

The host code looks like a real PyTorch DDP worker:

"""benches/ccl_hello.py"""
from __future__ import annotations

import numpy as np

from kernbench.policy.placement.dp import DPPolicy


N_ELEM = 8


def worker(rank: int, world_size: int, torch) -> None:
    """Per-rank business logic — mirrors a real PyTorch DDP worker."""
    dp = DPPolicy(
        sip="replicate", cube="replicate", pe="column_wise",
        num_sips=1, num_cubes=1, num_pes=world_size,
    )
    tensor = torch.zeros(
        (1, world_size * N_ELEM), dtype="f16", dp=dp, name="hello_in",
    )

    # Per-rank initialization via the real PyTorch idiom.
    init = np.zeros((1, world_size * N_ELEM), dtype=np.float16)
    for r in range(world_size):
        init[0, r * N_ELEM : (r + 1) * N_ELEM] = float(r + 1)
    tensor.copy_(torch.from_numpy(init))

    # The collective itself.
    torch.distributed.all_reduce(tensor, op="sum")

    # Verify on rank 0 (real PyTorch DDP idiom).
    if rank == 0:
        result = tensor.numpy()
        for r in range(world_size):
            expected = float(((r - 1) % world_size) + 1)
            slice_r = result[0, r * N_ELEM : (r + 1) * N_ELEM]
            print(
                f"  rank {r}: got {float(slice_r.mean()):.1f}, "
                f"expected {expected:.1f}"
            )


def run(torch) -> None:
    """CLI entry point. Initializes dist, dispatches to worker."""
    dist = torch.distributed
    dist.init_process_group(backend="ahbm")
    worker(
        rank=dist.get_rank(),
        world_size=dist.get_world_size(),
        torch=torch,
    )

tests/test_hello_send.py:

import numpy as np

from kernbench.ccl.algorithms.hello_send import kernel
from kernbench.ccl.testing import run_kernel_in_mock


def test_hello_send_4_ranks():
    n_elem = 8
    inputs = [
        np.full((n_elem,), float(r + 1), dtype=np.float16)
        for r in range(4)
    ]
    outputs = run_kernel_in_mock(
        kernel_fn=kernel,
        world_size=4,
        topology="ring_1d",
        inputs=inputs,
        kernel_args=(n_elem,),
    )
    # rank r should now hold rank (r-1) % 4's data.
    for r in range(4):
        assert np.array_equal(outputs[r], inputs[(r - 1) % 4])

run_kernel_in_mock runs every rank concurrently in pure Python (no SimPy), so a unit test like this finishes in milliseconds. It only verifies algorithmic correctness — no latency, no DMA, no fabric.

Step 5: SimPy validation

kernbench run --topology topology.yaml --bench ccl_hello --verify-data

Phase 1 runs the SimPy simulation + MemoryStore data movement, Phase 2 replays the op_log for correctness. The bench's print lines should show OK for every rank.


2. Ring all-reduce — the second algorithm

Slightly more complex. Each PE runs world_size - 1 rounds, sending its current tile east and accumulating the tile received from the west. After all rounds, every PE holds the global sum.

The reference implementation lives in src/kernbench/ccl/algorithms/ring_allreduce.py. The core flow:

"""Ring all-reduce."""


def kernel(t_ptr, n_elem, world_size, tl):
    local_pe = tl.program_id(axis=0)
    cube_id = tl.program_id(axis=1)
    pes_per_cube = tl.num_programs(axis=0)
    rank = cube_id * pes_per_cube + local_pe
    nbytes = n_elem * 2
    pe_addr = t_ptr + rank * nbytes

    # The handle points at HBM[pe_addr]. In greenlet mode .data is
    # populated, but the kernel never has to touch .data directly.
    acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
    current = acc  # source for the first send

    for _step in range(world_size - 1):
        tl.send(dir="E", src=current)
        recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
        # TensorHandle operator overload → MathCmd → PE_MATH dispatch.
        # Phase 1 only models timing; Phase 2 DataExecutor replays the
        # actual numpy accumulation.
        acc = acc + recv
        current = recv  # forward the received slot to the next round

    # Store the final accumulator back to HBM. Source is acc (a PE-local
    # scratch addr); dst is HBM. The op_log dma_write entry records both
    # ends so Phase 2 copies the math result into HBM at verify time.
    tl.store(pe_addr, acc)


def kernel_args(world_size: int, n_elem: int) -> tuple:
    return (n_elem, world_size)

Four key points:

  1. Accumulation goes through TensorHandle operators. acc + recv emits a MathCmd and dispatches it through PE_MATH — i.e. the real hardware path, so the latency model stays accurate. Per ADR-0020 D3, Phase 1 only simulates timing; Phase 2's DataExecutor replays the op_log and runs the actual numpy accumulation.
  2. Use current = recv to forward. Each round must update the send source to the just-received slot handle so the same data circulates exactly once around the ring. Setting current = acc would resend the cumulative sum, inflating the result.
  3. tl.store(pe_addr, acc) exactly once at the end. Do not use a store→reload pattern in the middle. acc lives in PE-local scratch; the op_log records (src=scratch, dst=hbm) and Phase 2 first runs math (filling scratch) then copies via the dma_write snapshot.
  4. world_size is passed by the host explicitly. TL only knows the topology slot count (e.g. num_programs(axis=0) is "PEs per cube"), not the participating CCL group size. The host bench knows world_size and forwards it as an explicit kernel argument.

For registration in ccl.yaml and wiring through the unified bench, look at the existing ring_allreduce_tcm/_hbm/_sram entries plus benches/ccl_allreduce.py. Mock unit tests live in tests/test_ccl_mock_runtime.py and follow the kernel_args=(n_elem, world_size) convention.


3. neighbors() override — custom topology

Most algorithms are happy with the builtin topologies (ring_1d, mesh_2d, tree_binary, ring_1d_unidir, none). If you want to modify a builtin or define a brand-new connectivity pattern, define a neighbors() function in your algorithm module.

Signature

def neighbors(
    rank: int, world_size: int, neighbor_map: dict[str, int],
) -> dict[str, int] | None:
    """Override the neighbor map produced by the builtin topology.

    Args:
        neighbor_map: the mapping the ccl.yaml ``topology`` field built.
                      For ring_1d this is {"E": (rank+1)%ws, "W": (rank-1)%ws}.
                      The dict is mutable — modify in place if you want.

    Returns:
        dict: the new neighbor map (or the modified-in-place dict).
        None: do not override; use neighbor_map as-is.
    """
    return None

Pattern A: tweak a builtin

def neighbors(rank, world_size, neighbor_map):
    # Only even ranks use W; remove W from odd ranks.
    if rank % 2 == 1:
        neighbor_map.pop("W", None)
    return neighbor_map

Pattern B: replace entirely (skip-connection ring)

def neighbors(rank, world_size, neighbor_map):
    return {"E": (rank + 2) % world_size}

Pattern C: keep builtin

Either omit neighbors entirely or return None:

def neighbors(rank, world_size, neighbor_map):
    return None  # explicit "use the builtin"

4. PE kernel API reference (ADR-0023 D4)

IPCQ API

API Description Blocking?
tl.send(dir, src=TensorHandle) Send to a peer in the given direction. Yes (waits if peer slots are full)
tl.send(dir, src_addr=..., nbytes=..., shape=..., dtype=..., space=...) Same, keyword form. Yes
tl.recv(dir, shape=..., dtype=...) Blocking recv from one direction. Yes
tl.recv(shape=..., dtype=...) Round-robin recv across all four directions. Yes
tl.recv_async(dir, shape=..., dtype=...) → RecvFuture Non-blocking recv. No
tl.wait(future) Wait for a non-blocking recv future → returns the resolved TensorHandle. Yes

Existing TL API (ADR-0020/0022, unchanged)

API Description
tl.load(addr, shape, dtype) → TensorHandle DMA read; in greenlet mode .data carries the ndarray.
tl.store(addr, handle) DMA write — when handle.data is set the runner propagates it to MemoryStore.
tl.composite(op, ...) Submit a GEMM/Math composite (non-blocking).
tl.program_id(axis=0) Local PE id within the cube.
tl.program_id(axis=1) Cube id (ADR-0022).
tl.num_programs(axis=0/1) Topology slot counts (NOT the participating-rank count).

Two recv modes

The default is return_slot (zero-copy): the IPCQ slot address is returned in handle.addr. To force a copy into a custom destination, pass dst_addr + dst_space:

recv = tl.recv(
    dir="W", shape=(8,), dtype="f16",
    dst_addr=my_scratch_addr,
    dst_space="hbm",
)
# After this call recv.addr == my_scratch_addr (copy_to_dst mode).

5. Helpers (kernbench.ccl.helpers)

Convenience helpers to keep algorithm code short:

from kernbench.ccl.helpers import chunked, ring_step, tree_step

chunked(base_addr, n_chunks, n_elem, dtype="f16") → list[Chunk]

Split a tile of n_elem elements into n_chunks equal-size views. Each Chunk has addr, n_elem, nbytes fields.

chunks = chunked(t_ptr, n_chunks=4, n_elem=64, dtype="f16")
# chunks[0..3] are 16-element views with consecutive addresses.

ring_step(rank, step, world_size) → (send_idx, recv_idx)

Per-step chunk indices for a ring algorithm (reduce-scatter / all-gather):

for step in range(world_size - 1):
    send_idx, recv_idx = ring_step(rank, step, world_size)
    tl.send(
        dir="E", src_addr=chunks[send_idx].addr,
        nbytes=chunks[send_idx].nbytes,
        shape=(chunks[send_idx].n_elem,), dtype="f16",
    )
    recv = tl.recv(
        dir="W", shape=(chunks[recv_idx].n_elem,), dtype="f16",
    )
    # accumulate ...

tree_step(rank, world_size) → {"parent": int|None, "children": list[int]}

Parent / children rank ids for a binary tree:

info = tree_step(rank, world_size)
if info["parent"] is None:
    print(f"rank {rank} is the root")
for child in info["children"]:
    ...

6. Unit testing — Mock runtime

kernbench.ccl.testing.run_kernel_in_mock runs an algorithm without SimPy for fast feedback.

Basic usage

import numpy as np

from kernbench.ccl.testing import run_kernel_in_mock
from kernbench.ccl.algorithms.my_algo import kernel


def test_my_algo():
    n_elem = 16
    inputs = [np.arange(n_elem, dtype="f16") + r for r in range(4)]
    expected = sum(inputs)
    outputs = run_kernel_in_mock(
        kernel_fn=kernel,
        world_size=4,
        topology="ring_1d",
        inputs=inputs,
        kernel_args=(n_elem, 4),  # positional args after t_ptr
    )
    for r in range(4):
        assert np.allclose(outputs[r], expected, rtol=1e-3)

Behavior

  • All ranks run their kernels concurrently as cooperative greenlets.
  • tl.send / tl.recv are serviced by in-memory FIFOs (no DMA, no latency).
  • Each rank's last store is what the helper returns as a numpy array.

Limitations

  • No latency or performance numbers (it is not a simulation).
  • No PE_DMA, fabric, or BW model.
  • Correctness only.
  • One cube assumed: program_id(axis=1) is always 0.

7. Debugging

CCL trace

KERNBENCH_CCL_TRACE=1 kernbench run --topology topology.yaml \
    --bench ccl_allreduce --verify-data

Per-rank send/recv events appear on stdout:

[ccl t=346.4 send] sip0.cube0.pe1 dir=E nbytes=64 seq=0
[ccl t=360.4 recv] sip0.cube0.pe2 dir=W nbytes=64

Pointer dump

kernbench.ccl.diagnostics.pointer_dump(engine) returns a multi-line dump of every PE_IPCQ ring buffer's my_head, my_tail, peer_head_cache, peer_tail_cache. When something hangs, this shows which rank is stuck and on what.

Deadlock detection

When the SimPy schedule empties because of unmatched send/recv pairs, the engine raises IpcqDeadlock and embeds the pointer dump in the message (ADR-0023 D14 F3). Wait-for-graph visualization is future work.


8. Common mistakes

1. Using a direction that wasn't installed

topology: ring_1d only installs E and W. Trying:

tl.send(dir="N", ...)   # → IpcqInvalidDirection

Fix: switch to topology: mesh_2d, or add N/S in a neighbors() override.

2. send without a matching recv

def kernel(..., tl):
    for _ in range(100):
        tl.send(dir="E", ...)
    # The peer never recvs → ring buffer fills → backpressure → deadlock.

Fix: every send needs a matching recv on the receiver side. Otherwise IpcqDeadlock is raised.

3. dtype/shape mismatch

By default mismatches are not validated. The author is responsible for consistency. Set strict_validation: true on a PE_IPCQ node's attrs to enable D14 F2 strict mode and catch them immediately.

4. Assuming round-robin recv fairness

tl.recv() (no direction) returns the first slot to arrive in round-robin order, but arrival order is not predictable. If your algorithm depends on a particular direction, name it explicitly: tl.recv(dir="N", ...).

5. Confusing num_programs with the CCL group size

tl.num_programs(axis=0/1) reports topology slot counts, not the number of ranks participating in the collective. The host bench knows world_size and must pass it through as a kernel argument.

6. Overwriting the send source before it's actually sent

PE_DMA snapshots the source data into the IpcqDmaToken at send time, preserving in-flight semantics. Even so, the safest pattern is to call tl.send first and only mutate the source addr afterwards. If you mutate the addr before tl.send makes it into the PE_DMA queue, the snapshot will pick up the wrong data.


9. Next steps

  • Try other topologies (mesh_2d, tree_binary).
  • Faster algorithms (recursive halving / doubling).
  • Compare buffer_kind (tcm/hbm/sram) and backpressure (poll/sleep) modes for latency.
  • Larger-scale validation through the unified ccl_allreduce bench with different ccl.yaml overlays.

If you add a new algorithm or pattern, please send a PR.


References

Existing algorithm examples: