Files
kernbench2/docs/onboarding/ccl-author-guide.en.md
T
ywkang 687c98086d ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
  (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
  docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
  docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
  retroactive docs pending verification.

Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
  TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
  Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
  deleted; ADR-0019/0021 moved to adr-history with one-line stub status

Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
  serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
  per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
  target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
  selection, flit-aware per-flit commit, async finalize, command-only
  fallback path)

Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
  "Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
  block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
  ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
  (now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)

Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
  ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py

Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.

Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
  (ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:15:55 -07:00

19 KiB
Raw Blame History

CCL Algorithm Author Guide (English)

This document is a step-by-step guide for engineers writing CCL (Collective Communication Library) algorithms in kernbench. The internal system design and component structure live in ADR-0023.

The goal here is to clearly separate what an algorithm author has to touch from what they can leave alone, and to get a first algorithm running through the shortest possible path.


0. Five-minute tour

Things you touch Location
Algorithm module (kernel + optional neighbors()) src/kernbench/ccl/algorithms/<algo>.py
Algorithm registration ccl.yaml
Host bench (rank count, init, launch, verify) benches/<your_bench>.py
(Optional) unit test tests/test_<algo>.py
Things you do NOT touch Location
TLContext API src/kernbench/triton_emu/tl_context.py (ADR-0022 spec)
Framework (topology generators, helpers, mock testing) src/kernbench/ccl/
PE_IPCQ / PE_DMA components src/kernbench/components/builtin/
Backend implementation (install_ipcq) src/kernbench/runtime_api/distributed.py and kernbench/ccl/install.py

Workflow:

  1. Write a kernel function in the algorithm module.
  2. Register an entry in ccl.yaml.
  3. Write a host bench using torch.distributed.init_process_group / torch.distributed.all_reduce (the unified benches/ccl_allreduce.py handles the common case).
  4. (Optional) Run the mock runtime for fast unit tests (a few ms).
  5. kernbench run --bench <name> --verify-data for full SimPy verification.

1. Hello World — the simplest send/recv

Each PE sends its tile to its E neighbor once and receives a tile from its W neighbor once. The reference code lives in src/kernbench/ccl/algorithms/hello_send.py.

Step 1: write the kernel

New file src/kernbench/ccl/algorithms/hello_send.py:

"""Hello world: send your tile to the next rank, receive from the previous one."""


def kernel(t_ptr, n_elem, tl):
    # Global rank is computed from program_id(0/1) (ADR-0022).
    local_pe = tl.program_id(axis=0)
    cube_id = tl.program_id(axis=1)
    pes_per_cube = tl.num_programs(axis=0)
    rank = cube_id * pes_per_cube + local_pe

    nbytes = n_elem * 2  # f16
    pe_addr = t_ptr + rank * nbytes

    # Load our slice and send it east.
    src = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
    tl.send(dir="E", src=src)

    # Receive from west and store directly back into our slice.
    recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
    tl.store(pe_addr, recv)


def kernel_args(world_size: int, n_elem: int) -> tuple:
    """Positional kernel args used by the ahbm backend (after t_ptr)."""
    return (n_elem,)

Key points:

  • Global rank is computed from program_id(axis=0) + program_id(axis=1). TL has no contractually-supported tl.rank / tl.world_size. If the host needs to pass world_size or anything else as an algorithm parameter, it goes through ordinary torch.launch arguments.
  • tl.send takes a TensorHandle. PE_IPCQ reads addr/space/shape/dtype/nbytes from the handle to issue an IpcqDmaToken to PE_DMA.
  • tl.recv requires shape and dtype. The returned TensorHandle points at the IPCQ ring slot and can be used directly as a dst handle (e.g. tl.store(pe_addr, recv)). Phase 2's dma_write replay handles the (slot → hbm) copy, so user code never has to touch recv.data.

Step 2: register in ccl.yaml

algorithms:
  hello_send:
    module: kernbench.ccl.algorithms.hello_send
    topology: ring_1d
    buffer_kind: tcm
    world_size: 8

world_size here is optional. If absent, AhbmCCLBackend derives it from the topology spec (sips × cubes_per_sip × pes_per_cube).

Step 3: write a host bench (optional — the unified bench may suffice)

For most CCL benchmarks the existing benches/ccl_allreduce.py is sufficient: it reads ccl.yaml, picks the algorithm, sets up the process group, and runs the collective. If your algorithm needs custom host logic, write a new bench file along the same lines.

The host code looks like a real PyTorch DDP worker:

"""benches/ccl_hello.py"""
from __future__ import annotations

import numpy as np

from kernbench.policy.placement.dp import DPPolicy


N_ELEM = 8


def worker(rank: int, world_size: int, torch) -> None:
    """Per-rank business logic — mirrors a real PyTorch DDP worker."""
    dp = DPPolicy(
        cube="replicate", pe="column_wise",
        num_cubes=1, num_pes=world_size,
    )
    tensor = torch.zeros(
        (1, world_size * N_ELEM), dtype="f16", dp=dp, name="hello_in",
    )

    # Per-rank initialization via the real PyTorch idiom.
    init = np.zeros((1, world_size * N_ELEM), dtype=np.float16)
    for r in range(world_size):
        init[0, r * N_ELEM : (r + 1) * N_ELEM] = float(r + 1)
    tensor.copy_(torch.from_numpy(init))

    # The collective itself.
    torch.distributed.all_reduce(tensor, op="sum")

    # Verify on rank 0 (real PyTorch DDP idiom).
    if rank == 0:
        result = tensor.numpy()
        for r in range(world_size):
            expected = float(((r - 1) % world_size) + 1)
            slice_r = result[0, r * N_ELEM : (r + 1) * N_ELEM]
            print(
                f"  rank {r}: got {float(slice_r.mean()):.1f}, "
                f"expected {expected:.1f}"
            )


def run(torch) -> None:
    """CLI entry point. Initializes dist, dispatches to worker."""
    dist = torch.distributed
    dist.init_process_group(backend="ahbm")
    worker(
        rank=dist.get_rank(),
        world_size=dist.get_world_size(),
        torch=torch,
    )

tests/test_hello_send.py:

import numpy as np

from kernbench.ccl.algorithms.hello_send import kernel
from kernbench.ccl.testing import run_kernel_in_mock


def test_hello_send_4_ranks():
    n_elem = 8
    inputs = [
        np.full((n_elem,), float(r + 1), dtype=np.float16)
        for r in range(4)
    ]
    outputs = run_kernel_in_mock(
        kernel_fn=kernel,
        world_size=4,
        topology="ring_1d",
        inputs=inputs,
        kernel_args=(n_elem,),
    )
    # rank r should now hold rank (r-1) % 4's data.
    for r in range(4):
        assert np.array_equal(outputs[r], inputs[(r - 1) % 4])

run_kernel_in_mock runs every rank concurrently in pure Python (no SimPy), so a unit test like this finishes in milliseconds. It only verifies algorithmic correctness — no latency, no DMA, no fabric.

Step 5: SimPy validation

kernbench run --topology topology.yaml --bench ccl_hello --verify-data

Phase 1 runs the SimPy simulation + MemoryStore data movement, Phase 2 replays the op_log for correctness. The bench's print lines should show OK for every rank.


2. Ring all-reduce — the second algorithm

Slightly more complex. Each PE runs world_size - 1 rounds, sending its current tile east and accumulating the tile received from the west. After all rounds, every PE holds the global sum.

The reference implementation lives in src/kernbench/ccl/algorithms/ring_allreduce.py. The core flow:

"""Ring all-reduce."""


def kernel(t_ptr, n_elem, world_size, tl):
    local_pe = tl.program_id(axis=0)
    cube_id = tl.program_id(axis=1)
    pes_per_cube = tl.num_programs(axis=0)
    rank = cube_id * pes_per_cube + local_pe
    nbytes = n_elem * 2
    pe_addr = t_ptr + rank * nbytes

    # The handle points at HBM[pe_addr]. In greenlet mode .data is
    # populated, but the kernel never has to touch .data directly.
    acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
    current = acc  # source for the first send

    for _step in range(world_size - 1):
        tl.send(dir="E", src=current)
        recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
        # TensorHandle operator overload → MathCmd → PE_MATH dispatch.
        # Phase 1 only models timing; Phase 2 DataExecutor replays the
        # actual numpy accumulation.
        acc = acc + recv
        current = recv  # forward the received slot to the next round

    # Store the final accumulator back to HBM. Source is acc (a PE-local
    # scratch addr); dst is HBM. The op_log dma_write entry records both
    # ends so Phase 2 copies the math result into HBM at verify time.
    tl.store(pe_addr, acc)


def kernel_args(world_size: int, n_elem: int) -> tuple:
    return (n_elem, world_size)

Four key points:

  1. Accumulation goes through TensorHandle operators. acc + recv emits a MathCmd and dispatches it through PE_MATH — i.e. the real hardware path, so the latency model stays accurate. Per ADR-0020 D3, Phase 1 only simulates timing; Phase 2's DataExecutor replays the op_log and runs the actual numpy accumulation.
  2. Use current = recv to forward. Each round must update the send source to the just-received slot handle so the same data circulates exactly once around the ring. Setting current = acc would resend the cumulative sum, inflating the result.
  3. tl.store(pe_addr, acc) exactly once at the end. Do not use a store→reload pattern in the middle. acc lives in PE-local scratch; the op_log records (src=scratch, dst=hbm) and Phase 2 first runs math (filling scratch) then copies via the dma_write snapshot.
  4. world_size is passed by the host explicitly. TL only knows the topology slot count (e.g. num_programs(axis=0) is "PEs per cube"), not the participating CCL group size. The host bench knows world_size and forwards it as an explicit kernel argument.

For registration in ccl.yaml and wiring through the unified bench, look at the existing ring_allreduce_tcm/_hbm/_sram entries plus benches/ccl_allreduce.py. Mock unit tests live in tests/test_ccl_mock_runtime.py and follow the kernel_args=(n_elem, world_size) convention.


3. neighbors() override — custom topology

Most algorithms are happy with the builtin topologies (ring_1d, mesh_2d, tree_binary, ring_1d_unidir, none). If you want to modify a builtin or define a brand-new connectivity pattern, define a neighbors() function in your algorithm module.

Signature

def neighbors(
    rank: int, world_size: int, neighbor_map: dict[str, int],
) -> dict[str, int] | None:
    """Override the neighbor map produced by the builtin topology.

    Args:
        neighbor_map: the mapping the ccl.yaml ``topology`` field built.
                      For ring_1d this is {"E": (rank+1)%ws, "W": (rank-1)%ws}.
                      The dict is mutable — modify in place if you want.

    Returns:
        dict: the new neighbor map (or the modified-in-place dict).
        None: do not override; use neighbor_map as-is.
    """
    return None

Pattern A: tweak a builtin

def neighbors(rank, world_size, neighbor_map):
    # Only even ranks use W; remove W from odd ranks.
    if rank % 2 == 1:
        neighbor_map.pop("W", None)
    return neighbor_map

Pattern B: replace entirely (skip-connection ring)

def neighbors(rank, world_size, neighbor_map):
    return {"E": (rank + 2) % world_size}

Pattern C: keep builtin

Either omit neighbors entirely or return None:

def neighbors(rank, world_size, neighbor_map):
    return None  # explicit "use the builtin"

4. PE kernel API reference (ADR-0023 D4)

IPCQ API

API Description Blocking?
tl.send(dir, src=TensorHandle) Send to a peer in the given direction. Yes (waits if peer slots are full)
tl.send(dir, src_addr=..., nbytes=..., shape=..., dtype=..., space=...) Same, keyword form. Yes
tl.recv(dir, shape=..., dtype=...) Blocking recv from one direction. Yes
tl.recv(shape=..., dtype=...) Round-robin recv across all four directions. Yes
tl.recv_async(dir, shape=..., dtype=...) → RecvFuture Non-blocking recv. No
tl.wait(future) Wait for a non-blocking recv future → returns the resolved TensorHandle. Yes

Existing TL API (ADR-0020/0022, unchanged)

API Description
tl.load(addr, shape, dtype) → TensorHandle DMA read; in greenlet mode .data carries the ndarray.
tl.store(addr, handle) DMA write — when handle.data is set the runner propagates it to MemoryStore.
tl.composite(op, ...) Submit a GEMM/Math composite (non-blocking).
tl.program_id(axis=0) Local PE id within the cube.
tl.program_id(axis=1) Cube id (ADR-0022).
tl.num_programs(axis=0/1) Topology slot counts (NOT the participating-rank count).

Two recv modes

The default is return_slot (zero-copy): the IPCQ slot address is returned in handle.addr. To force a copy into a custom destination, pass dst_addr + dst_space:

recv = tl.recv(
    dir="W", shape=(8,), dtype="f16",
    dst_addr=my_scratch_addr,
    dst_space="hbm",
)
# After this call recv.addr == my_scratch_addr (copy_to_dst mode).

5. Helpers (kernbench.ccl.helpers)

Convenience helpers to keep algorithm code short:

from kernbench.ccl.helpers import chunked, ring_step, tree_step

chunked(base_addr, n_chunks, n_elem, dtype="f16") → list[Chunk]

Split a tile of n_elem elements into n_chunks equal-size views. Each Chunk has addr, n_elem, nbytes fields.

chunks = chunked(t_ptr, n_chunks=4, n_elem=64, dtype="f16")
# chunks[0..3] are 16-element views with consecutive addresses.

ring_step(rank, step, world_size) → (send_idx, recv_idx)

Per-step chunk indices for a ring algorithm (reduce-scatter / all-gather):

for step in range(world_size - 1):
    send_idx, recv_idx = ring_step(rank, step, world_size)
    tl.send(
        dir="E", src_addr=chunks[send_idx].addr,
        nbytes=chunks[send_idx].nbytes,
        shape=(chunks[send_idx].n_elem,), dtype="f16",
    )
    recv = tl.recv(
        dir="W", shape=(chunks[recv_idx].n_elem,), dtype="f16",
    )
    # accumulate ...

tree_step(rank, world_size) → {"parent": int|None, "children": list[int]}

Parent / children rank ids for a binary tree:

info = tree_step(rank, world_size)
if info["parent"] is None:
    print(f"rank {rank} is the root")
for child in info["children"]:
    ...

6. Unit testing — Mock runtime

kernbench.ccl.testing.run_kernel_in_mock runs an algorithm without SimPy for fast feedback.

Basic usage

import numpy as np

from kernbench.ccl.testing import run_kernel_in_mock
from kernbench.ccl.algorithms.my_algo import kernel


def test_my_algo():
    n_elem = 16
    inputs = [np.arange(n_elem, dtype="f16") + r for r in range(4)]
    expected = sum(inputs)
    outputs = run_kernel_in_mock(
        kernel_fn=kernel,
        world_size=4,
        topology="ring_1d",
        inputs=inputs,
        kernel_args=(n_elem, 4),  # positional args after t_ptr
    )
    for r in range(4):
        assert np.allclose(outputs[r], expected, rtol=1e-3)

Behavior

  • All ranks run their kernels concurrently as cooperative greenlets.
  • tl.send / tl.recv are serviced by in-memory FIFOs (no DMA, no latency).
  • Each rank's last store is what the helper returns as a numpy array.

Limitations

  • No latency or performance numbers (it is not a simulation).
  • No PE_DMA, fabric, or BW model.
  • Correctness only.
  • One cube assumed: program_id(axis=1) is always 0.

7. Debugging

CCL trace

KERNBENCH_CCL_TRACE=1 kernbench run --topology topology.yaml \
    --bench ccl_allreduce --verify-data

Per-rank send/recv events appear on stdout:

[ccl t=346.4 send] sip0.cube0.pe1 dir=E nbytes=64 seq=0
[ccl t=360.4 recv] sip0.cube0.pe2 dir=W nbytes=64

Pointer dump

kernbench.ccl.diagnostics.pointer_dump(engine) returns a multi-line dump of every PE_IPCQ ring buffer's my_head, my_tail, peer_head_cache, peer_tail_cache. When something hangs, this shows which rank is stuck and on what.

Deadlock detection

When the SimPy schedule empties because of unmatched send/recv pairs, the engine raises IpcqDeadlock and embeds the pointer dump in the message (ADR-0023 D14 F3). Wait-for-graph visualization is future work.


8. Common mistakes

1. Using a direction that wasn't installed

topology: ring_1d only installs E and W. Trying:

tl.send(dir="N", ...)   # → IpcqInvalidDirection

Fix: switch to topology: mesh_2d, or add N/S in a neighbors() override.

2. send without a matching recv

def kernel(..., tl):
    for _ in range(100):
        tl.send(dir="E", ...)
    # The peer never recvs → ring buffer fills → backpressure → deadlock.

Fix: every send needs a matching recv on the receiver side. Otherwise IpcqDeadlock is raised.

3. dtype/shape mismatch

By default mismatches are not validated. The author is responsible for consistency. Set strict_validation: true on a PE_IPCQ node's attrs to enable D14 F2 strict mode and catch them immediately.

4. Assuming round-robin recv fairness

tl.recv() (no direction) returns the first slot to arrive in round-robin order, but arrival order is not predictable. If your algorithm depends on a particular direction, name it explicitly: tl.recv(dir="N", ...).

5. Confusing num_programs with the CCL group size

tl.num_programs(axis=0/1) reports topology slot counts, not the number of ranks participating in the collective. The host bench knows world_size and must pass it through as a kernel argument.

6. Overwriting the send source before it's actually sent

PE_DMA snapshots the source data into the IpcqDmaToken at send time, preserving in-flight semantics. Even so, the safest pattern is to call tl.send first and only mutate the source addr afterwards. If you mutate the addr before tl.send makes it into the PE_DMA queue, the snapshot will pick up the wrong data.


9. Next steps

  • Try other topologies (mesh_2d, tree_binary).
  • Faster algorithms (recursive halving / doubling).
  • Compare buffer_kind (tcm/hbm/sram) and backpressure (poll/sleep) modes for latency.
  • Larger-scale validation through the unified ccl_allreduce bench with different ccl.yaml overlays.

If you add a new algorithm or pattern, please send a PR.


References

Existing algorithm examples: