# ADR-0050: CCL Algorithm Module Contract — `ccl/algorithms/*.py` ## Status Accepted (2026-05-22). Pins down the interface, kernel signature, and addition workflow that a module under `src/kernbench/ccl/algorithms/` must satisfy in order to be used as a collective algorithm by the AHBM CCL backend (ADR-0047). ADR-0047 D3 states only that "the algorithm module must expose `kernel`, `kernel_args`, optionally `TOPO_NAME_TO_KIND`"; **the contract an algorithm-module author needs to follow** has had no ADR-level coverage. This ADR pairs with ADR-0045's bench-module contract. ## First action An algorithm module is imported at two moments: 1. **AHBM backend entry**: when user code calls `dist.init_process_group(backend="ahbm")`, `AhbmCCLBackend.__init__` runs `self._algo_module = importlib.import_module(self._merged["module"])`. At module level, the following occur first: - Topology-kind integer constants like `SIP_TOPO_RING/TORUS/MESH` are bound in the module namespace. - The `TOPO_NAME_TO_KIND` dict is bound; the backend reads it via `getattr(self._algo_module, "TOPO_NAME_TO_KIND", None)`. - `kernel_args` function is defined for the caller. - The actual algorithm function (e.g., `allreduce_intercube_multidevice`) is defined. - At the bottom of the module, `kernel = allreduce_intercube_multidevice` publishes the alias. 2. **ccl.yaml install stage**: `kernbench.ccl.install.install_ipcq` imports the same algorithm module while pushing the IPCQ neighbor table. In short, **the algorithm module's first act is "publish topology-kind constants, the `TOPO_NAME_TO_KIND` dict, the `kernel_args` function, and the `kernel` alias into the module namespace"** — all as import-time side effects, no separate initialization call. ## Context `AhbmCCLBackend` (ADR-0047), at process-group creation, dynamically imports a module path obtained from `ccl.yaml`'s `defaults.algorithm` (or a user-specified algorithm). The backend expects four things from the module: - `kernel`: the collective's entry function. - `kernel_args(world_size, n_elem, cube_w=, cube_h=) -> tuple`: a tuple packing the kernel's positional arguments. - `TOPO_NAME_TO_KIND` (optional): a dict mapping `topology.yaml`'s `sips.topology` string (e.g., `"ring_1d"`, `"torus_2d"`, `"mesh_2d_no_wrap"`) to the integer kind constants. - (Indirectly) IPCQ neighbor-table install: `configure_sfr_intercube_multisip` reads the module's `TOPO_NAME_TO_KIND` plus cube dimensions to decide the SFR. The current corpus has one algorithm module: `lrab_hierarchical_allreduce.py` (248 lines). The name expands to "**l**eft-**r**ight **a**lternating **b**roadcast hierarchical allreduce". When future modules like `ring_allreduce`, `tree_allreduce`, or `broadcast` are added, they must follow this contract for the backend's dispatch path to keep working. Without an ADR-level contract: - A new algorithm author has to infer the signature from ADR-0047 D3's one-liner. - The kernel-function argument order (especially `t_ptr, n_elem, cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl`) is unclear without grep. - It is conventional, but not documented, what `kernel_args` takes as inputs and what tuple it must return. ## Decision ### D1. The algorithm module exposes four public symbols ```python # src/kernbench/ccl/algorithms/.py from __future__ import annotations # (required) topology-kind constants — referenced internally SIP_TOPO_RING = 0 SIP_TOPO_TORUS = 1 SIP_TOPO_MESH = 2 # (optional) topology name → kind mapping. Used by the backend to # translate ccl.yaml/topology's string SIP topology into an integer. TOPO_NAME_TO_KIND = { "ring_1d": SIP_TOPO_RING, "torus_2d": SIP_TOPO_TORUS, "mesh_2d_no_wrap": SIP_TOPO_MESH, } # (required) kernel argument builder def kernel_args(world_size: int, n_elem: int, *, cube_w: int = 4, cube_h: int = 4) -> tuple: return (n_elem, cube_w, cube_h, world_size) # (required) kernel function (TLContext is injected via the `tl=...` # keyword argument). def my_allreduce_kernel(t_ptr, n_elem, cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, *, tl): ... # (required) kernel alias — the backend accesses `module.kernel` kernel = my_allreduce_kernel ``` - The `kernel` alias is the entry point the backend invokes. Whatever the function name is (e.g., `allreduce_intercube_multidevice`), it must be exposed via `module.kernel = fn`. - Without `kernel_args`, the backend has no way to build the algorithm's argument list. See D2 for the signature. - If `TOPO_NAME_TO_KIND` is absent, the backend falls back to `sip_topo_kind = 0`. An algorithm supporting only a single topology may omit it. ### D2. `kernel_args` signature — `(world_size, n_elem, *, cube_w, cube_h)` ```python def kernel_args(world_size: int, n_elem: int, *, cube_w: int = 4, cube_h: int = 4) -> tuple: return (n_elem, cube_w, cube_h, world_size) ``` - **Positional arguments**: `world_size` (= number of ranks), `n_elem` (= element count of a single shard, f16-based). - **Keyword arguments**: `cube_w`, `cube_h` (= cube-mesh dimensions). Default 4×4 — aligned with `topology.yaml`'s `sip.cube_mesh` default. - **Return**: a tuple in the order the kernel's positional arguments expect. When the backend calls `all_reduce`: ```python kernel_args_tuple = self._algo_module.kernel_args( self._world_size, n_elem, cube_w=eff_cube_w, cube_h=eff_cube_h, ) extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h) pending = self.ctx.launch( self._merged["algorithm"], kernel_fn, tensor, *kernel_args_tuple, *extra_args, _defer_wait=True, ) ``` So the kernel's full positional argument list becomes: `(tensor_ptr, *kernel_args_tuple, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)`, with `tl=...` injected as a keyword. The tuple length and order returned by `kernel_args` must **match the kernel signature 1:1**. ### D3. Kernel signature — standardized 9 + tl arguments Recommended signature: ```python def my_kernel( t_ptr: int, # VA base of the row-wise-sharded tensor on this SIP n_elem: int, # element count per cube tile (or per shard) cube_w: int, # cube mesh width (from kernel_args) cube_h: int, # cube mesh height (from kernel_args) n_sips: int, # equal to world_size (rank = SIP, ADR-0024) sip_rank: int, # this SIP's rank sip_topo_kind: int, # result of TOPO_NAME_TO_KIND lookup sip_topo_w: int, # SIP mesh width (0 for ring_1d) sip_topo_h: int, # SIP mesh height (0 for ring_1d) *, tl, # TLContext (auto-injected) ) -> None: ``` Even if `kernel_args` chose a different positional argument order, the kernel's **last four positional arguments are always `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)`** — the backend appends them as `extra_args` (ADR-0047 D5). A custom algorithm must accept these four, but a single-SIP algorithm may simply ignore them. `tl` is injected via keyword — `RuntimeContext.launch` adds `tl=tl_ctx` just before invoking the kernel. The signature therefore exposes `tl` as keyword-only (`*, tl`) or as the trailing keyword parameter. ### D4. Kernel body — freedom and constraints Available inside the kernel: every `tl.*` primitive from ADR-0046 D3. Common patterns: - `cube_id = tl.program_id(axis=1)` — this PE's cube index. - `pe_addr = t_ptr + cube_id * nbytes` — per-cube VA of the tile. - `acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")` — load local data. - `tl.send(dir=...)` / `tl.recv(dir=..., shape=, dtype=)` — IPCQ collective. - `acc = acc + recv` — TensorHandle arithmetic operators (ADR-0046 D4). - `tl.store(pe_addr, acc)` — store the result. The kernel body is plain Python — branching and loops are fine. But: - No SimPy `yield` or `async` (ADR-0046 D1). - No direct access to TensorHandle `.data` — the Phase 1 timing model doesn't see data dependencies (ADR-0020's 2-pass separation). - Kernel execution must be deterministic — the same input must produce the same op sequence. No random or external IO. ### D5. SIP topology semantics — meaning of `sip_topo_kind` The backend looks up `topology.yaml`'s `system.sips.topology` string in the algorithm module's `TOPO_NAME_TO_KIND` and passes the integer as `sip_topo_kind`. The algorithm then branches: ```python if sip_topo_kind == SIP_TOPO_RING: acc = _inter_sip_ring(...) elif sip_topo_kind == SIP_TOPO_TORUS: acc = _inter_sip_torus_2d(...) elif sip_topo_kind == SIP_TOPO_MESH: acc = _inter_sip_mesh_2d(...) ``` Each topology branch communicates with peers via IPCQ direction names (`"global_E"`, `"W"`, `"S"`, `"N"` …). Direction semantics are defined in ADR-0023/0025; `configure_sfr_intercube_multisip` installs the IPCQ neighbor table accordingly. If a topology kind not supported by the algorithm appears, prefer an explicit `raise ValueError(f"unsupported topology kind {sip_topo_kind}")` over a silent no-op — fail fast on misconfiguration. ### D6. The `ccl.yaml` algorithm entry The algorithm module is paired with a `ccl.yaml` entry (ADR-0023 D10 + ADR-0047 D3): ```yaml defaults: algorithm: lrab_hierarchical_allreduce n_elem: 8 algorithms: lrab_hierarchical_allreduce: module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce # optional: world_size override # optional: per-algorithm parameters consumed by configure_sfr_intercube_multisip ``` - `module`: the full Python module path; `importlib.import_module` consumes this string as-is. - `world_size` (optional): when set, overrides the topology fallback (ADR-0047 D2). - Algorithm-specific parameters are consumed by `configure_sfr_intercube_multisip`. Workflow to add a new algorithm: 1. Write `src/kernbench/ccl/algorithms/.py` following D1. 2. Add the entry under `algorithms` in `ccl.yaml`. 3. (If needed) extend `kernbench.ccl.sfr_config` with the SFR-install branch. 4. Add tests (e.g., `tests/sccl/test_.py`, extending the ADR-0043 eval harness). ### D7. Legacy "rank = flat PE index" mode The `world_size` override in `ccl.yaml`, surfaced by ADR-0047 D2, is used by legacy "rank = flat PE index" tests. The algorithm module can assume `n_sips=world_size` ranks even in this mode — the backend maintains the rank↔(SIP, cube, PE) mapping, so no modal branching is needed inside the algorithm body. In single-cube workloads (where `cube_w=cube_h=1`), the algorithm must skip mesh-based phases — see the `single_cube = (cube_w == 1 and cube_h == 1)` pattern in `lrab_hierarchical_allreduce.py`. ## Alternatives Considered ### A1. Organize the algorithm module as a class (`class Allreduce: kernel(...)`) Rejected. The Python module namespace already identifies an algorithm (see ADR-0047 D3's `importlib.import_module`). A class wrapper adds indirection without simplifying dispatch. Module-level free functions plus a `kernel` alias are clean and obvious. ### A2. Type `kernel_args` with an explicit dataclass Rejected (currently). Each algorithm normally has a different argument count; forcing one dataclass would hurt cross-algorithm interchange. The tuple return is simple and unpacks cleanly with the backend's `*kernel_args_tuple`. If an algorithm wants stronger internal typing, it may define its own NamedTuple. ### A3. Move SFR installation inside the algorithm module Rejected. SFR installation (`configure_sfr_intercube_multisip`) is a cross-module decision combining topology + algorithm; `kernbench.ccl.sfr_config` is a more natural home than the algorithm module itself. D6's "extend sfr_config if needed" workflow keeps responsibility boundaries clear. ### A4. Auto-register algorithm names via a decorator (analogous to ADR-0045's `@bench`) Rejected. Unlike benches, algorithms are already tied to `ccl.yaml` entries; an additional registry would be redundant. The string mapping in `module` is sufficient. ## Consequences - ADR-0047 D3's one-line contract expands to a D1–D7 author-facing guide; new algorithm signatures no longer need to be grep-derived. - D3's standardized 9 + tl signature couples naturally with the backend's `extra_args` append (ADR-0047 D5). It is explicit that even single-SIP-only algorithms must accept the four `sip_*` trailing arguments. - D5's fail-loud recommendation means a `ccl.yaml` topology that the algorithm doesn't support will surface as an explicit `ValueError` rather than a silent wrong result. - D6's step-by-step addition workflow makes clear how far a new algorithm has to reach into sfr_config / tests / ccl.yaml.