Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
(dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
retroactive docs pending verification.
Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
deleted; ADR-0019/0021 moved to adr-history with one-line stub status
Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
selection, flit-aware per-flit commit, async finalize, command-only
fallback path)
Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
"Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
(now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)
Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py
Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.
Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
(ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
19 KiB
CCL Algorithm Author Guide (English)
This document is a step-by-step guide for engineers writing CCL (Collective Communication Library) algorithms in kernbench. The internal system design and component structure live in ADR-0023.
The goal here is to clearly separate what an algorithm author has to touch from what they can leave alone, and to get a first algorithm running through the shortest possible path.
0. Five-minute tour
| Things you touch | Location |
|---|---|
Algorithm module (kernel + optional neighbors()) |
src/kernbench/ccl/algorithms/<algo>.py |
| Algorithm registration | ccl.yaml |
| Host bench (rank count, init, launch, verify) | benches/<your_bench>.py |
| (Optional) unit test | tests/test_<algo>.py |
| Things you do NOT touch | Location |
|---|---|
| TLContext API | src/kernbench/triton_emu/tl_context.py (ADR-0022 spec) |
| Framework (topology generators, helpers, mock testing) | src/kernbench/ccl/ |
| PE_IPCQ / PE_DMA components | src/kernbench/components/builtin/ |
Backend implementation (install_ipcq) |
src/kernbench/runtime_api/distributed.py and kernbench/ccl/install.py |
Workflow:
- Write a
kernelfunction in the algorithm module. - Register an entry in
ccl.yaml. - Write a host bench using
torch.distributed.init_process_group/torch.distributed.all_reduce(the unifiedbenches/ccl_allreduce.pyhandles the common case). - (Optional) Run the mock runtime for fast unit tests (a few ms).
kernbench run --bench <name> --verify-datafor full SimPy verification.
1. Hello World — the simplest send/recv
Each PE sends its tile to its E neighbor once and receives a tile from
its W neighbor once. The reference code lives in
src/kernbench/ccl/algorithms/hello_send.py.
Step 1: write the kernel
New file src/kernbench/ccl/algorithms/hello_send.py:
"""Hello world: send your tile to the next rank, receive from the previous one."""
def kernel(t_ptr, n_elem, tl):
# Global rank is computed from program_id(0/1) (ADR-0022).
local_pe = tl.program_id(axis=0)
cube_id = tl.program_id(axis=1)
pes_per_cube = tl.num_programs(axis=0)
rank = cube_id * pes_per_cube + local_pe
nbytes = n_elem * 2 # f16
pe_addr = t_ptr + rank * nbytes
# Load our slice and send it east.
src = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
tl.send(dir="E", src=src)
# Receive from west and store directly back into our slice.
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
tl.store(pe_addr, recv)
def kernel_args(world_size: int, n_elem: int) -> tuple:
"""Positional kernel args used by the ahbm backend (after t_ptr)."""
return (n_elem,)
Key points:
- Global rank is computed from
program_id(axis=0)+program_id(axis=1). TL has no contractually-supportedtl.rank/tl.world_size. If the host needs to passworld_sizeor anything else as an algorithm parameter, it goes through ordinarytorch.launcharguments. tl.sendtakes aTensorHandle. PE_IPCQ readsaddr/space/shape/dtype/nbytesfrom the handle to issue anIpcqDmaTokento PE_DMA.tl.recvrequiresshapeanddtype. The returned TensorHandle points at the IPCQ ring slot and can be used directly as adsthandle (e.g.tl.store(pe_addr, recv)). Phase 2'sdma_writereplay handles the (slot → hbm) copy, so user code never has to touchrecv.data.
Step 2: register in ccl.yaml
algorithms:
hello_send:
module: kernbench.ccl.algorithms.hello_send
topology: ring_1d
buffer_kind: tcm
world_size: 8
world_size here is optional. If absent, AhbmCCLBackend derives it
from the topology spec (sips × cubes_per_sip × pes_per_cube).
Step 3: write a host bench (optional — the unified bench may suffice)
For most CCL benchmarks the existing benches/ccl_allreduce.py is
sufficient: it reads ccl.yaml, picks the algorithm, sets up the
process group, and runs the collective. If your algorithm needs custom
host logic, write a new bench file along the same lines.
The host code looks like a real PyTorch DDP worker:
"""benches/ccl_hello.py"""
from __future__ import annotations
import numpy as np
from kernbench.policy.placement.dp import DPPolicy
N_ELEM = 8
def worker(rank: int, world_size: int, torch) -> None:
"""Per-rank business logic — mirrors a real PyTorch DDP worker."""
dp = DPPolicy(
cube="replicate", pe="column_wise",
num_cubes=1, num_pes=world_size,
)
tensor = torch.zeros(
(1, world_size * N_ELEM), dtype="f16", dp=dp, name="hello_in",
)
# Per-rank initialization via the real PyTorch idiom.
init = np.zeros((1, world_size * N_ELEM), dtype=np.float16)
for r in range(world_size):
init[0, r * N_ELEM : (r + 1) * N_ELEM] = float(r + 1)
tensor.copy_(torch.from_numpy(init))
# The collective itself.
torch.distributed.all_reduce(tensor, op="sum")
# Verify on rank 0 (real PyTorch DDP idiom).
if rank == 0:
result = tensor.numpy()
for r in range(world_size):
expected = float(((r - 1) % world_size) + 1)
slice_r = result[0, r * N_ELEM : (r + 1) * N_ELEM]
print(
f" rank {r}: got {float(slice_r.mean()):.1f}, "
f"expected {expected:.1f}"
)
def run(torch) -> None:
"""CLI entry point. Initializes dist, dispatches to worker."""
dist = torch.distributed
dist.init_process_group(backend="ahbm")
worker(
rank=dist.get_rank(),
world_size=dist.get_world_size(),
torch=torch,
)
Step 4: unit test (optional but strongly recommended)
tests/test_hello_send.py:
import numpy as np
from kernbench.ccl.algorithms.hello_send import kernel
from kernbench.ccl.testing import run_kernel_in_mock
def test_hello_send_4_ranks():
n_elem = 8
inputs = [
np.full((n_elem,), float(r + 1), dtype=np.float16)
for r in range(4)
]
outputs = run_kernel_in_mock(
kernel_fn=kernel,
world_size=4,
topology="ring_1d",
inputs=inputs,
kernel_args=(n_elem,),
)
# rank r should now hold rank (r-1) % 4's data.
for r in range(4):
assert np.array_equal(outputs[r], inputs[(r - 1) % 4])
run_kernel_in_mock runs every rank concurrently in pure Python (no
SimPy), so a unit test like this finishes in milliseconds. It only
verifies algorithmic correctness — no latency, no DMA, no fabric.
Step 5: SimPy validation
kernbench run --topology topology.yaml --bench ccl_hello --verify-data
Phase 1 runs the SimPy simulation + MemoryStore data movement, Phase 2
replays the op_log for correctness. The bench's print lines should
show OK for every rank.
2. Ring all-reduce — the second algorithm
Slightly more complex. Each PE runs world_size - 1 rounds, sending
its current tile east and accumulating the tile received from the west.
After all rounds, every PE holds the global sum.
The reference implementation lives in
src/kernbench/ccl/algorithms/ring_allreduce.py.
The core flow:
"""Ring all-reduce."""
def kernel(t_ptr, n_elem, world_size, tl):
local_pe = tl.program_id(axis=0)
cube_id = tl.program_id(axis=1)
pes_per_cube = tl.num_programs(axis=0)
rank = cube_id * pes_per_cube + local_pe
nbytes = n_elem * 2
pe_addr = t_ptr + rank * nbytes
# The handle points at HBM[pe_addr]. In greenlet mode .data is
# populated, but the kernel never has to touch .data directly.
acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
current = acc # source for the first send
for _step in range(world_size - 1):
tl.send(dir="E", src=current)
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
# TensorHandle operator overload → MathCmd → PE_MATH dispatch.
# Phase 1 only models timing; Phase 2 DataExecutor replays the
# actual numpy accumulation.
acc = acc + recv
current = recv # forward the received slot to the next round
# Store the final accumulator back to HBM. Source is acc (a PE-local
# scratch addr); dst is HBM. The op_log dma_write entry records both
# ends so Phase 2 copies the math result into HBM at verify time.
tl.store(pe_addr, acc)
def kernel_args(world_size: int, n_elem: int) -> tuple:
return (n_elem, world_size)
Four key points:
- Accumulation goes through TensorHandle operators.
acc + recvemits aMathCmdand dispatches it through PE_MATH — i.e. the real hardware path, so the latency model stays accurate. Per ADR-0020 D3, Phase 1 only simulates timing; Phase 2'sDataExecutorreplays the op_log and runs the actual numpy accumulation. - Use
current = recvto forward. Each round must update the send source to the just-received slot handle so the same data circulates exactly once around the ring. Settingcurrent = accwould resend the cumulative sum, inflating the result. tl.store(pe_addr, acc)exactly once at the end. Do not use a store→reload pattern in the middle.acclives in PE-local scratch; the op_log records(src=scratch, dst=hbm)and Phase 2 first runs math (filling scratch) then copies via the dma_write snapshot.world_sizeis passed by the host explicitly. TL only knows the topology slot count (e.g.num_programs(axis=0)is "PEs per cube"), not the participating CCL group size. The host bench knowsworld_sizeand forwards it as an explicit kernel argument.
For registration in ccl.yaml and wiring through the unified bench,
look at the existing ring_allreduce_tcm/_hbm/_sram entries plus
benches/ccl_allreduce.py. Mock unit
tests live in
tests/test_ccl_mock_runtime.py
and follow the kernel_args=(n_elem, world_size) convention.
3. neighbors() override — custom topology
Most algorithms are happy with the builtin topologies (ring_1d,
mesh_2d, tree_binary, ring_1d_unidir, none). If you want to
modify a builtin or define a brand-new connectivity pattern, define a
neighbors() function in your algorithm module.
Signature
def neighbors(
rank: int, world_size: int, neighbor_map: dict[str, int],
) -> dict[str, int] | None:
"""Override the neighbor map produced by the builtin topology.
Args:
neighbor_map: the mapping the ccl.yaml ``topology`` field built.
For ring_1d this is {"E": (rank+1)%ws, "W": (rank-1)%ws}.
The dict is mutable — modify in place if you want.
Returns:
dict: the new neighbor map (or the modified-in-place dict).
None: do not override; use neighbor_map as-is.
"""
return None
Pattern A: tweak a builtin
def neighbors(rank, world_size, neighbor_map):
# Only even ranks use W; remove W from odd ranks.
if rank % 2 == 1:
neighbor_map.pop("W", None)
return neighbor_map
Pattern B: replace entirely (skip-connection ring)
def neighbors(rank, world_size, neighbor_map):
return {"E": (rank + 2) % world_size}
Pattern C: keep builtin
Either omit neighbors entirely or return None:
def neighbors(rank, world_size, neighbor_map):
return None # explicit "use the builtin"
4. PE kernel API reference (ADR-0023 D4)
IPCQ API
| API | Description | Blocking? |
|---|---|---|
tl.send(dir, src=TensorHandle) |
Send to a peer in the given direction. | Yes (waits if peer slots are full) |
tl.send(dir, src_addr=..., nbytes=..., shape=..., dtype=..., space=...) |
Same, keyword form. | Yes |
tl.recv(dir, shape=..., dtype=...) |
Blocking recv from one direction. | Yes |
tl.recv(shape=..., dtype=...) |
Round-robin recv across all four directions. | Yes |
tl.recv_async(dir, shape=..., dtype=...) → RecvFuture |
Non-blocking recv. | No |
tl.wait(future) |
Wait for a non-blocking recv future → returns the resolved TensorHandle. | Yes |
Existing TL API (ADR-0020/0022, unchanged)
| API | Description |
|---|---|
tl.load(addr, shape, dtype) → TensorHandle |
DMA read; in greenlet mode .data carries the ndarray. |
tl.store(addr, handle) |
DMA write — when handle.data is set the runner propagates it to MemoryStore. |
tl.composite(op, ...) |
Submit a GEMM/Math composite (non-blocking). |
tl.program_id(axis=0) |
Local PE id within the cube. |
tl.program_id(axis=1) |
Cube id (ADR-0022). |
tl.num_programs(axis=0/1) |
Topology slot counts (NOT the participating-rank count). |
Two recv modes
The default is return_slot (zero-copy): the IPCQ slot address is
returned in handle.addr. To force a copy into a custom destination,
pass dst_addr + dst_space:
recv = tl.recv(
dir="W", shape=(8,), dtype="f16",
dst_addr=my_scratch_addr,
dst_space="hbm",
)
# After this call recv.addr == my_scratch_addr (copy_to_dst mode).
5. Helpers (kernbench.ccl.helpers)
Convenience helpers to keep algorithm code short:
from kernbench.ccl.helpers import chunked, ring_step, tree_step
chunked(base_addr, n_chunks, n_elem, dtype="f16") → list[Chunk]
Split a tile of n_elem elements into n_chunks equal-size views.
Each Chunk has addr, n_elem, nbytes fields.
chunks = chunked(t_ptr, n_chunks=4, n_elem=64, dtype="f16")
# chunks[0..3] are 16-element views with consecutive addresses.
ring_step(rank, step, world_size) → (send_idx, recv_idx)
Per-step chunk indices for a ring algorithm (reduce-scatter / all-gather):
for step in range(world_size - 1):
send_idx, recv_idx = ring_step(rank, step, world_size)
tl.send(
dir="E", src_addr=chunks[send_idx].addr,
nbytes=chunks[send_idx].nbytes,
shape=(chunks[send_idx].n_elem,), dtype="f16",
)
recv = tl.recv(
dir="W", shape=(chunks[recv_idx].n_elem,), dtype="f16",
)
# accumulate ...
tree_step(rank, world_size) → {"parent": int|None, "children": list[int]}
Parent / children rank ids for a binary tree:
info = tree_step(rank, world_size)
if info["parent"] is None:
print(f"rank {rank} is the root")
for child in info["children"]:
...
6. Unit testing — Mock runtime
kernbench.ccl.testing.run_kernel_in_mock runs an algorithm without
SimPy for fast feedback.
Basic usage
import numpy as np
from kernbench.ccl.testing import run_kernel_in_mock
from kernbench.ccl.algorithms.my_algo import kernel
def test_my_algo():
n_elem = 16
inputs = [np.arange(n_elem, dtype="f16") + r for r in range(4)]
expected = sum(inputs)
outputs = run_kernel_in_mock(
kernel_fn=kernel,
world_size=4,
topology="ring_1d",
inputs=inputs,
kernel_args=(n_elem, 4), # positional args after t_ptr
)
for r in range(4):
assert np.allclose(outputs[r], expected, rtol=1e-3)
Behavior
- All ranks run their kernels concurrently as cooperative greenlets.
tl.send/tl.recvare serviced by in-memory FIFOs (no DMA, no latency).- Each rank's last
storeis what the helper returns as a numpy array.
Limitations
- No latency or performance numbers (it is not a simulation).
- No PE_DMA, fabric, or BW model.
- Correctness only.
- One cube assumed:
program_id(axis=1)is always 0.
7. Debugging
CCL trace
KERNBENCH_CCL_TRACE=1 kernbench run --topology topology.yaml \
--bench ccl_allreduce --verify-data
Per-rank send/recv events appear on stdout:
[ccl t=346.4 send] sip0.cube0.pe1 dir=E nbytes=64 seq=0
[ccl t=360.4 recv] sip0.cube0.pe2 dir=W nbytes=64
Pointer dump
kernbench.ccl.diagnostics.pointer_dump(engine) returns a multi-line
dump of every PE_IPCQ ring buffer's my_head, my_tail,
peer_head_cache, peer_tail_cache. When something hangs, this shows
which rank is stuck and on what.
Deadlock detection
When the SimPy schedule empties because of unmatched send/recv pairs,
the engine raises IpcqDeadlock and embeds the pointer dump in the
message (ADR-0023 D14 F3). Wait-for-graph visualization is future
work.
8. Common mistakes
1. Using a direction that wasn't installed
topology: ring_1d only installs E and W. Trying:
tl.send(dir="N", ...) # → IpcqInvalidDirection
Fix: switch to topology: mesh_2d, or add N/S in a neighbors() override.
2. send without a matching recv
def kernel(..., tl):
for _ in range(100):
tl.send(dir="E", ...)
# The peer never recvs → ring buffer fills → backpressure → deadlock.
Fix: every send needs a matching recv on the receiver side.
Otherwise IpcqDeadlock is raised.
3. dtype/shape mismatch
By default mismatches are not validated. The author is responsible for
consistency. Set strict_validation: true on a PE_IPCQ node's attrs to
enable D14 F2 strict mode and catch them immediately.
4. Assuming round-robin recv fairness
tl.recv() (no direction) returns the first slot to arrive in
round-robin order, but arrival order is not predictable. If your
algorithm depends on a particular direction, name it explicitly:
tl.recv(dir="N", ...).
5. Confusing num_programs with the CCL group size
tl.num_programs(axis=0/1) reports topology slot counts, not the
number of ranks participating in the collective. The host bench knows
world_size and must pass it through as a kernel argument.
6. Overwriting the send source before it's actually sent
PE_DMA snapshots the source data into the IpcqDmaToken at send time,
preserving in-flight semantics. Even so, the safest pattern is to call
tl.send first and only mutate the source addr afterwards. If you
mutate the addr before tl.send makes it into the PE_DMA queue, the
snapshot will pick up the wrong data.
9. Next steps
- Try other topologies (
mesh_2d,tree_binary). - Faster algorithms (recursive halving / doubling).
- Compare
buffer_kind(tcm/hbm/sram) andbackpressure(poll/sleep) modes for latency. - Larger-scale validation through the unified
ccl_allreducebench with differentccl.yamloverlays.
If you add a new algorithm or pattern, please send a PR.
References
- ADR-0023: IPCQ + PE-level collective design.
- ADR-0022: 2D grid program_id (axis=0/1).
- ADR-0020: 2-pass data execution.
- ADR-0014: PE pipeline execution model.
Existing algorithm examples:
src/kernbench/ccl/algorithms/hello_send.py— simplest send/recvsrc/kernbench/ccl/algorithms/ring_allreduce.py— ring all-reducesrc/kernbench/ccl/algorithms/mesh_allreduce.py— 2D mesh all-reducesrc/kernbench/ccl/algorithms/tree_allreduce.py— binary tree all-reduce