Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

Major changes: PE-level IPCQ infrastructure: - New PE_IPCQ component: ring-buffer control plane with 4-direction neighbor mapping, head/tail pointers, backpressure (poll/sleep). - PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA, including in-flight data snapshot (D9) and op_log recording at outbound time for Phase 2 replay correctness. - IpcqDmaToken piggyback model: data + metadata travel together, atomic visibility at receiver (invariant I6). - Credit return fast path: bottleneck-BW latency, no fabric vc_comm. Phase 2 data execution (ADR-0020 integration): - op_log extended: DmaWriteCmd now captures src_space/src_addr for Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time. - DataExecutor replays dma_write + ipcq_copy in t_start order. - Engine._flush_data_phase: incremental cursor-based replay after each engine.wait() so host reads see post-Phase-2 data. - KernelRunner Phase 1 writes disabled when op_log is active to prevent stale data from corrupting the MemoryStore snapshot. TLContext / kernel API: - tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype), tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode. - TensorHandle operator overloading (add/sub/mul/div) via thread-local active TLContext → MathCmd dispatch through PE_MATH. - PE-local scratch allocator for math output handles. - tl.load returns space="hbm" handles for correct Phase 2 addressing. - Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv. Unified ccl_allreduce bench (PyTorch-compat host code): - Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch) split matching real PyTorch DDP worker pattern. - torch.distributed facade: init_process_group, get_world_size, get_rank, get_backend, all_reduce, barrier — only real PyTorch names. - AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches kernel via tensor shard metadata (n_elem from shards[0].nbytes). - world_size derived from topology spec (sips × cubes × pes_per_cube) with optional algorithm-level override in ccl.yaml. Tensor API (PyTorch-compat surface): - Tensor.numpy(): gather-aware (all shards via VA-based addressing). - Tensor.copy_(source): scatter from host tensor into sharded target. - RuntimeContext.from_numpy(arr): host-side staging tensor. - Tensor.data property fixed to use numpy() (was shards[0]-only). Algorithm modules moved to src/kernbench/ccl/algorithms/: - ring_allreduce, mesh_allreduce, tree_allreduce, hello_send. - Each module exports kernel_args(world_size, n_elem) helper. - ccl.yaml module paths updated to kernbench.ccl.algorithms.*. Dead code removed: - 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.). - _run_ccl_bench greenlet-per-SIP scheduler. - benches.loader.is_ccl_bench + run_rank detection. - benches/ccl/ directory. Tests: - New test_ccl_allreduce_matrix.py: 7 parametrized cases (ring×3 buffers, ring 8/16, mesh 4, tree 7). - New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests. - Existing tests updated for new import paths + world_size_override. Docs: - Korean ccl-author-guide.md and ADR-0023 paths updated. - New English versions: ccl-author-guide.en.md, ADR-0023.en.md. 502 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 19:36:59 -07:00
parent ff2c677a9c
commit 998cc85762
60 changed files with 9196 additions and 80 deletions
@@ -0,0 +1,592 @@
+# CCL Algorithm Author Guide (English)
+
+This document is a step-by-step guide for engineers writing CCL
+(Collective Communication Library) algorithms in kernbench. The
+internal system design and component structure live in
+[ADR-0023](adr/ADR-0023-ipcq-pe-collective.md).
+
+The goal here is to clearly separate **what an algorithm author has to
+touch** from **what they can leave alone**, and to get a first
+algorithm running through the shortest possible path.
+
+---
+
+## 0. Five-minute tour
+
+| Things you touch | Location |
+|------------------|----------|
+| Algorithm module (kernel + optional `neighbors()`) | `src/kernbench/ccl/algorithms/<algo>.py` |
+| Algorithm registration | `ccl.yaml` |
+| Host bench (rank count, init, launch, verify) | `benches/<your_bench>.py` |
+| (Optional) unit test | `tests/test_<algo>.py` |
+
+| Things you do NOT touch | Location |
+|--------------------------|----------|
+| TLContext API | `src/kernbench/triton_emu/tl_context.py` (ADR-0022 spec) |
+| Framework (topology generators, helpers, mock testing) | `src/kernbench/ccl/` |
+| PE_IPCQ / PE_DMA components | `src/kernbench/components/builtin/` |
+| Backend implementation (`install_ipcq`) | `src/kernbench/runtime_api/distributed.py` and `kernbench/ccl/install.py` |
+
+Workflow:
+1. Write a `kernel` function in the algorithm module.
+2. Register an entry in `ccl.yaml`.
+3. Write a host bench using `torch.distributed.init_process_group` /
+   `torch.distributed.all_reduce` (the unified `benches/ccl_allreduce.py`
+   handles the common case).
+4. (Optional) Run the mock runtime for fast unit tests (a few ms).
+5. `kernbench run --bench <name> --verify-data` for full SimPy verification.
+
+---
+
+## 1. Hello World — the simplest send/recv
+
+Each PE sends its tile to its E neighbor once and receives a tile from
+its W neighbor once. The reference code lives in
+[`src/kernbench/ccl/algorithms/hello_send.py`](../src/kernbench/ccl/algorithms/hello_send.py).
+
+### Step 1: write the kernel
+
+New file `src/kernbench/ccl/algorithms/hello_send.py`:
+
+```python
+"""Hello world: send your tile to the next rank, receive from the previous one."""
+
+
+def kernel(t_ptr, n_elem, tl):
+    # Global rank is computed from program_id(0/1) (ADR-0022).
+    local_pe = tl.program_id(axis=0)
+    cube_id = tl.program_id(axis=1)
+    pes_per_cube = tl.num_programs(axis=0)
+    rank = cube_id * pes_per_cube + local_pe
+
+    nbytes = n_elem * 2  # f16
+    pe_addr = t_ptr + rank * nbytes
+
+    # Load our slice and send it east.
+    src = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
+    tl.send(dir="E", src=src)
+
+    # Receive from west and store directly back into our slice.
+    recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
+    tl.store(pe_addr, recv)
+
+
+def kernel_args(world_size: int, n_elem: int) -> tuple:
+    """Positional kernel args used by the ahbm backend (after t_ptr)."""
+    return (n_elem,)
+```
+
+Key points:
+
+- **Global rank is computed from `program_id(axis=0)` + `program_id(axis=1)`.**
+  TL has no contractually-supported `tl.rank` / `tl.world_size`. If the
+  host needs to pass `world_size` or anything else as an algorithm
+  parameter, it goes through ordinary `torch.launch` arguments.
+- **`tl.send` takes a `TensorHandle`.** PE_IPCQ reads
+  `addr`/`space`/`shape`/`dtype`/`nbytes` from the handle to issue an
+  `IpcqDmaToken` to PE_DMA.
+- **`tl.recv` requires `shape` and `dtype`.** The returned TensorHandle
+  points at the IPCQ ring slot and can be used directly as a `dst`
+  handle (e.g. `tl.store(pe_addr, recv)`). Phase 2's `dma_write` replay
+  handles the (slot → hbm) copy, so user code never has to touch
+  `recv.data`.
+
+### Step 2: register in `ccl.yaml`
+
+```yaml
+algorithms:
+  hello_send:
+    module: kernbench.ccl.algorithms.hello_send
+    topology: ring_1d
+    buffer_kind: tcm
+    world_size: 8
+```
+
+`world_size` here is optional. If absent, `AhbmCCLBackend` derives it
+from the topology spec (`sips × cubes_per_sip × pes_per_cube`).
+
+### Step 3: write a host bench (optional — the unified bench may suffice)
+
+For most CCL benchmarks the existing `benches/ccl_allreduce.py` is
+sufficient: it reads `ccl.yaml`, picks the algorithm, sets up the
+process group, and runs the collective. If your algorithm needs custom
+host logic, write a new bench file along the same lines.
+
+The host code looks like a real PyTorch DDP worker:
+
+```python
+"""benches/ccl_hello.py"""
+from __future__ import annotations
+
+import numpy as np
+
+from kernbench.policy.placement.dp import DPPolicy
+
+
+N_ELEM = 8
+
+
+def worker(rank: int, world_size: int, torch) -> None:
+    """Per-rank business logic — mirrors a real PyTorch DDP worker."""
+    dp = DPPolicy(
+        sip="replicate", cube="replicate", pe="column_wise",
+        num_sips=1, num_cubes=1, num_pes=world_size,
+    )
+    tensor = torch.zeros(
+        (1, world_size * N_ELEM), dtype="f16", dp=dp, name="hello_in",
+    )
+
+    # Per-rank initialization via the real PyTorch idiom.
+    init = np.zeros((1, world_size * N_ELEM), dtype=np.float16)
+    for r in range(world_size):
+        init[0, r * N_ELEM : (r + 1) * N_ELEM] = float(r + 1)
+    tensor.copy_(torch.from_numpy(init))
+
+    # The collective itself.
+    torch.distributed.all_reduce(tensor, op="sum")
+
+    # Verify on rank 0 (real PyTorch DDP idiom).
+    if rank == 0:
+        result = tensor.numpy()
+        for r in range(world_size):
+            expected = float(((r - 1) % world_size) + 1)
+            slice_r = result[0, r * N_ELEM : (r + 1) * N_ELEM]
+            print(
+                f"  rank {r}: got {float(slice_r.mean()):.1f}, "
+                f"expected {expected:.1f}"
+            )
+
+
+def run(torch) -> None:
+    """CLI entry point. Initializes dist, dispatches to worker."""
+    dist = torch.distributed
+    dist.init_process_group(backend="ahbm")
+    worker(
+        rank=dist.get_rank(),
+        world_size=dist.get_world_size(),
+        torch=torch,
+    )
+```
+
+### Step 4: unit test (optional but strongly recommended)
+
+`tests/test_hello_send.py`:
+
+```python
+import numpy as np
+
+from kernbench.ccl.algorithms.hello_send import kernel
+from kernbench.ccl.testing import run_kernel_in_mock
+
+
+def test_hello_send_4_ranks():
+    n_elem = 8
+    inputs = [
+        np.full((n_elem,), float(r + 1), dtype=np.float16)
+        for r in range(4)
+    ]
+    outputs = run_kernel_in_mock(
+        kernel_fn=kernel,
+        world_size=4,
+        topology="ring_1d",
+        inputs=inputs,
+        kernel_args=(n_elem,),
+    )
+    # rank r should now hold rank (r-1) % 4's data.
+    for r in range(4):
+        assert np.array_equal(outputs[r], inputs[(r - 1) % 4])
+```
+
+`run_kernel_in_mock` runs every rank concurrently in pure Python (no
+SimPy), so a unit test like this finishes in **milliseconds**. It only
+verifies algorithmic correctness — no latency, no DMA, no fabric.
+
+### Step 5: SimPy validation
+
+```bash
+kernbench run --topology topology.yaml --bench ccl_hello --verify-data
+```
+
+Phase 1 runs the SimPy simulation + MemoryStore data movement, Phase 2
+replays the op_log for correctness. The bench's `print` lines should
+show OK for every rank.
+
+---
+
+## 2. Ring all-reduce — the second algorithm
+
+Slightly more complex. Each PE runs `world_size - 1` rounds, sending
+its current tile east and accumulating the tile received from the west.
+After all rounds, every PE holds the global sum.
+
+The reference implementation lives in
+[`src/kernbench/ccl/algorithms/ring_allreduce.py`](../src/kernbench/ccl/algorithms/ring_allreduce.py).
+The core flow:
+
+```python
+"""Ring all-reduce."""
+
+
+def kernel(t_ptr, n_elem, world_size, tl):
+    local_pe = tl.program_id(axis=0)
+    cube_id = tl.program_id(axis=1)
+    pes_per_cube = tl.num_programs(axis=0)
+    rank = cube_id * pes_per_cube + local_pe
+    nbytes = n_elem * 2
+    pe_addr = t_ptr + rank * nbytes
+
+    # The handle points at HBM[pe_addr]. In greenlet mode .data is
+    # populated, but the kernel never has to touch .data directly.
+    acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
+    current = acc  # source for the first send
+
+    for _step in range(world_size - 1):
+        tl.send(dir="E", src=current)
+        recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
+        # TensorHandle operator overload → MathCmd → PE_MATH dispatch.
+        # Phase 1 only models timing; Phase 2 DataExecutor replays the
+        # actual numpy accumulation.
+        acc = acc + recv
+        current = recv  # forward the received slot to the next round
+
+    # Store the final accumulator back to HBM. Source is acc (a PE-local
+    # scratch addr); dst is HBM. The op_log dma_write entry records both
+    # ends so Phase 2 copies the math result into HBM at verify time.
+    tl.store(pe_addr, acc)
+
+
+def kernel_args(world_size: int, n_elem: int) -> tuple:
+    return (n_elem, world_size)
+```
+
+Four key points:
+
+1. **Accumulation goes through TensorHandle operators.** `acc + recv`
+   emits a `MathCmd` and dispatches it through PE_MATH — i.e. the
+   real hardware path, so the latency model stays accurate. Per
+   ADR-0020 D3, Phase 1 only simulates timing; Phase 2's `DataExecutor`
+   replays the op_log and runs the actual numpy accumulation.
+2. **Use `current = recv` to forward.** Each round must update the send
+   source to the just-received slot handle so the same data circulates
+   exactly once around the ring. Setting `current = acc` would resend
+   the cumulative sum, inflating the result.
+3. **`tl.store(pe_addr, acc)` exactly once at the end.** Do not use a
+   store→reload pattern in the middle. `acc` lives in PE-local scratch;
+   the op_log records `(src=scratch, dst=hbm)` and Phase 2 first runs
+   math (filling scratch) then copies via the dma_write snapshot.
+4. **`world_size` is passed by the host explicitly.** TL only knows the
+   topology slot count (e.g. `num_programs(axis=0)` is "PEs per cube"),
+   not the participating CCL group size. The host bench knows
+   `world_size` and forwards it as an explicit kernel argument.
+
+For registration in `ccl.yaml` and wiring through the unified bench,
+look at the existing `ring_allreduce_tcm/_hbm/_sram` entries plus
+[`benches/ccl_allreduce.py`](../benches/ccl_allreduce.py). Mock unit
+tests live in
+[`tests/test_ccl_mock_runtime.py`](../tests/test_ccl_mock_runtime.py)
+and follow the `kernel_args=(n_elem, world_size)` convention.
+
+---
+
+## 3. `neighbors()` override — custom topology
+
+Most algorithms are happy with the builtin topologies (`ring_1d`,
+`mesh_2d`, `tree_binary`, `ring_1d_unidir`, `none`). If you want to
+modify a builtin or define a brand-new connectivity pattern, define a
+`neighbors()` function in your algorithm module.
+
+### Signature
+
+```python
+def neighbors(
+    rank: int, world_size: int, neighbor_map: dict[str, int],
+) -> dict[str, int] | None:
+    """Override the neighbor map produced by the builtin topology.
+
+    Args:
+        neighbor_map: the mapping the ccl.yaml ``topology`` field built.
+                      For ring_1d this is {"E": (rank+1)%ws, "W": (rank-1)%ws}.
+                      The dict is mutable — modify in place if you want.
+
+    Returns:
+        dict: the new neighbor map (or the modified-in-place dict).
+        None: do not override; use neighbor_map as-is.
+    """
+    return None
+```
+
+### Pattern A: tweak a builtin
+
+```python
+def neighbors(rank, world_size, neighbor_map):
+    # Only even ranks use W; remove W from odd ranks.
+    if rank % 2 == 1:
+        neighbor_map.pop("W", None)
+    return neighbor_map
+```
+
+### Pattern B: replace entirely (skip-connection ring)
+
+```python
+def neighbors(rank, world_size, neighbor_map):
+    return {"E": (rank + 2) % world_size}
+```
+
+### Pattern C: keep builtin
+
+Either omit `neighbors` entirely or return None:
+
+```python
+def neighbors(rank, world_size, neighbor_map):
+    return None  # explicit "use the builtin"
+```
+
+---
+
+## 4. PE kernel API reference (ADR-0023 D4)
+
+### IPCQ API
+
+| API | Description | Blocking? |
+|-----|-------------|-----------|
+| `tl.send(dir, src=TensorHandle)` | Send to a peer in the given direction. | Yes (waits if peer slots are full) |
+| `tl.send(dir, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)` | Same, keyword form. | Yes |
+| `tl.recv(dir, shape=..., dtype=...)` | Blocking recv from one direction. | Yes |
+| `tl.recv(shape=..., dtype=...)` | Round-robin recv across all four directions. | Yes |
+| `tl.recv_async(dir, shape=..., dtype=...) → RecvFuture` | Non-blocking recv. | No |
+| `tl.wait(future)` | Wait for a non-blocking recv future → returns the resolved TensorHandle. | Yes |
+
+### Existing TL API (ADR-0020/0022, unchanged)
+
+| API | Description |
+|-----|-------------|
+| `tl.load(addr, shape, dtype) → TensorHandle` | DMA read; in greenlet mode `.data` carries the ndarray. |
+| `tl.store(addr, handle)` | DMA write — when `handle.data` is set the runner propagates it to MemoryStore. |
+| `tl.composite(op, ...)` | Submit a GEMM/Math composite (non-blocking). |
+| `tl.program_id(axis=0)` | Local PE id within the cube. |
+| `tl.program_id(axis=1)` | Cube id (ADR-0022). |
+| `tl.num_programs(axis=0/1)` | Topology slot counts (NOT the participating-rank count). |
+
+### Two recv modes
+
+The default is `return_slot` (zero-copy): the IPCQ slot address is
+returned in `handle.addr`. To force a copy into a custom destination,
+pass `dst_addr` + `dst_space`:
+
+```python
+recv = tl.recv(
+    dir="W", shape=(8,), dtype="f16",
+    dst_addr=my_scratch_addr,
+    dst_space="hbm",
+)
+# After this call recv.addr == my_scratch_addr (copy_to_dst mode).
+```
+
+---
+
+## 5. Helpers (`kernbench.ccl.helpers`)
+
+Convenience helpers to keep algorithm code short:
+
+```python
+from kernbench.ccl.helpers import chunked, ring_step, tree_step
+```
+
+### `chunked(base_addr, n_chunks, n_elem, dtype="f16") → list[Chunk]`
+
+Split a tile of `n_elem` elements into `n_chunks` equal-size views.
+Each `Chunk` has `addr`, `n_elem`, `nbytes` fields.
+
+```python
+chunks = chunked(t_ptr, n_chunks=4, n_elem=64, dtype="f16")
+# chunks[0..3] are 16-element views with consecutive addresses.
+```
+
+### `ring_step(rank, step, world_size) → (send_idx, recv_idx)`
+
+Per-step chunk indices for a ring algorithm (reduce-scatter / all-gather):
+
+```python
+for step in range(world_size - 1):
+    send_idx, recv_idx = ring_step(rank, step, world_size)
+    tl.send(
+        dir="E", src_addr=chunks[send_idx].addr,
+        nbytes=chunks[send_idx].nbytes,
+        shape=(chunks[send_idx].n_elem,), dtype="f16",
+    )
+    recv = tl.recv(
+        dir="W", shape=(chunks[recv_idx].n_elem,), dtype="f16",
+    )
+    # accumulate ...
+```
+
+### `tree_step(rank, world_size) → {"parent": int|None, "children": list[int]}`
+
+Parent / children rank ids for a binary tree:
+
+```python
+info = tree_step(rank, world_size)
+if info["parent"] is None:
+    print(f"rank {rank} is the root")
+for child in info["children"]:
+    ...
+```
+
+---
+
+## 6. Unit testing — Mock runtime
+
+`kernbench.ccl.testing.run_kernel_in_mock` runs an algorithm without
+SimPy for fast feedback.
+
+### Basic usage
+
+```python
+import numpy as np
+
+from kernbench.ccl.testing import run_kernel_in_mock
+from kernbench.ccl.algorithms.my_algo import kernel
+
+
+def test_my_algo():
+    n_elem = 16
+    inputs = [np.arange(n_elem, dtype="f16") + r for r in range(4)]
+    expected = sum(inputs)
+    outputs = run_kernel_in_mock(
+        kernel_fn=kernel,
+        world_size=4,
+        topology="ring_1d",
+        inputs=inputs,
+        kernel_args=(n_elem, 4),  # positional args after t_ptr
+    )
+    for r in range(4):
+        assert np.allclose(outputs[r], expected, rtol=1e-3)
+```
+
+### Behavior
+
+- All ranks run their kernels concurrently as cooperative greenlets.
+- `tl.send` / `tl.recv` are serviced by in-memory FIFOs (no DMA, no
+  latency).
+- Each rank's last `store` is what the helper returns as a numpy array.
+
+### Limitations
+
+- No latency or performance numbers (it is not a simulation).
+- No PE_DMA, fabric, or BW model.
+- Correctness only.
+- One cube assumed: `program_id(axis=1)` is always 0.
+
+---
+
+## 7. Debugging
+
+### CCL trace
+
+```bash
+KERNBENCH_CCL_TRACE=1 kernbench run --topology topology.yaml \
+    --bench ccl_allreduce --verify-data
+```
+
+Per-rank send/recv events appear on stdout:
+
+```
+[ccl t=346.4 send] sip0.cube0.pe1 dir=E nbytes=64 seq=0
+[ccl t=360.4 recv] sip0.cube0.pe2 dir=W nbytes=64
+```
+
+### Pointer dump
+
+`kernbench.ccl.diagnostics.pointer_dump(engine)` returns a multi-line
+dump of every PE_IPCQ ring buffer's `my_head`, `my_tail`,
+`peer_head_cache`, `peer_tail_cache`. When something hangs, this shows
+which rank is stuck and on what.
+
+### Deadlock detection
+
+When the SimPy schedule empties because of unmatched send/recv pairs,
+the engine raises `IpcqDeadlock` and embeds the pointer dump in the
+message (ADR-0023 D14 F3). Wait-for-graph visualization is future
+work.
+
+---
+
+## 8. Common mistakes
+
+### 1. Using a direction that wasn't installed
+
+`topology: ring_1d` only installs E and W. Trying:
+
+```python
+tl.send(dir="N", ...)   # → IpcqInvalidDirection
+```
+
+Fix: switch to `topology: mesh_2d`, or add N/S in a `neighbors()` override.
+
+### 2. `send` without a matching `recv`
+
+```python
+def kernel(..., tl):
+    for _ in range(100):
+        tl.send(dir="E", ...)
+    # The peer never recvs → ring buffer fills → backpressure → deadlock.
+```
+
+Fix: every `send` needs a matching `recv` on the receiver side.
+Otherwise `IpcqDeadlock` is raised.
+
+### 3. dtype/shape mismatch
+
+By default mismatches are not validated. The author is responsible for
+consistency. Set `strict_validation: true` on a PE_IPCQ node's attrs to
+enable D14 F2 strict mode and catch them immediately.
+
+### 4. Assuming round-robin recv fairness
+
+`tl.recv()` (no direction) returns the first slot to arrive in
+round-robin order, but **arrival order is not predictable**. If your
+algorithm depends on a particular direction, name it explicitly:
+`tl.recv(dir="N", ...)`.
+
+### 5. Confusing `num_programs` with the CCL group size
+
+`tl.num_programs(axis=0/1)` reports topology slot counts, not the
+number of ranks participating in the collective. The host bench knows
+`world_size` and must pass it through as a kernel argument.
+
+### 6. Overwriting the send source before it's actually sent
+
+PE_DMA snapshots the source data into the IpcqDmaToken at send time,
+preserving in-flight semantics. Even so, the safest pattern is to call
+`tl.send` first and only mutate the source addr afterwards. If you
+mutate the addr before `tl.send` makes it into the PE_DMA queue, the
+snapshot will pick up the wrong data.
+
+---
+
+## 9. Next steps
+
+- Try other topologies (`mesh_2d`, `tree_binary`).
+- Faster algorithms (recursive halving / doubling).
+- Compare `buffer_kind` (tcm/hbm/sram) and `backpressure` (poll/sleep)
+  modes for latency.
+- Larger-scale validation through the unified `ccl_allreduce` bench
+  with different `ccl.yaml` overlays.
+
+If you add a new algorithm or pattern, please send a PR.
+
+---
+
+## References
+
+- [ADR-0023](adr/ADR-0023-ipcq-pe-collective.md): IPCQ + PE-level collective design.
+- [ADR-0022](adr/ADR-0022-program-id-2d-grid.md): 2D grid program_id (axis=0/1).
+- [ADR-0020](adr/ADR-0020-data-execution-two-pass.md): 2-pass data execution.
+- [ADR-0021](adr/ADR-0021-pe-pipeline-refactor.md): PE pipeline refactor.
+
+Existing algorithm examples:
+
+- [`src/kernbench/ccl/algorithms/hello_send.py`](../src/kernbench/ccl/algorithms/hello_send.py) — simplest send/recv
+- [`src/kernbench/ccl/algorithms/ring_allreduce.py`](../src/kernbench/ccl/algorithms/ring_allreduce.py) — ring all-reduce
+- [`src/kernbench/ccl/algorithms/mesh_allreduce.py`](../src/kernbench/ccl/algorithms/mesh_allreduce.py) — 2D mesh all-reduce
+- [`src/kernbench/ccl/algorithms/tree_allreduce.py`](../src/kernbench/ccl/algorithms/tree_allreduce.py) — binary tree all-reduce