Files
kernbench2/docs/adr/ADR-0050-par-ccl-algorithm-module-contract.md
T
ywkang bd49c93703 adr: add ADR-0050-0053 — close /report's second-pass G4 candidates
Documents four cross-cutting surfaces one layer deeper than the prior
G4 batch:

- 0050 par-ccl-algorithm-module-contract: how to author a new CCL
  algorithm in src/kernbench/ccl/algorithms/. Pairs with ADR-0045's
  bench-module contract. Pins the four required public symbols
  (kernel, kernel_args, TOPO_NAME_TO_KIND constants, kernel alias),
  the 9 + tl standardized kernel signature, the kernel_args tuple
  format, sip_topo_kind dispatch, and the ccl.yaml entry workflow.

- 0051 lat-routing-helper-api: every public method of AddressResolver
  (resolve, find_m_cpu, find_pcie_ep, find_io_cpu, find_all_pcie_eps)
  and PathRouter (find_path, find_path_with_distance,
  find_mcpu_dma_path, find_memory_path, find_node_path + 2 shims).
  Pins the four adjacency graphs (_adj_all / _adj / _adj_mcpu_dma /
  _adj_local) and the edge-kind exclusion sets they use, plus the
  single-owner naming convention.

- 0052 dev-oplog-memory-store-schemas: OpRecord's 7 fields, the
  per-op_name params matrix (dma_read, dma_write, gemm_*, math, math
  reduction, composite_gemm, ipcq_copy, unknown), snapshot timing
  rules (math = all inputs, dma_write = HBM-only — ADR-0027 race
  avoidance), TileToken stage_type capture, and MemoryStore's
  (space, addr) two-level dict with reference-store semantics.

- 0053 dev-topology-builder-algorithms: the 6-stage compile pipeline,
  cube_mesh.yaml's source_hash cache and its 5 input fields, the
  cube NoC auto-layout algorithm (row/col placement, HBM exclusion
  zone, PE/M_CPU/SRAM attachment via nearest-router, UCIe N/S/E/W
  distribution), the node naming convention (single-owner with
  router.py), the edge-kind catalog, the 4 view projections, and a
  table of spec-field changes vs mesh regeneration.

Bilingual pair verifier passes for all four EN/KO pairs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:52:42 -07:00

12 KiB
Raw Blame History

ADR-0050: CCL Algorithm Module Contract — ccl/algorithms/*.py

Status

Accepted (2026-05-22).

Pins down the interface, kernel signature, and addition workflow that a module under src/kernbench/ccl/algorithms/ must satisfy in order to be used as a collective algorithm by the AHBM CCL backend (ADR-0047). ADR-0047 D3 states only that "the algorithm module must expose kernel, kernel_args, optionally TOPO_NAME_TO_KIND"; the contract an algorithm-module author needs to follow has had no ADR-level coverage. This ADR pairs with ADR-0045's bench-module contract.

First action

An algorithm module is imported at two moments:

  1. AHBM backend entry: when user code calls dist.init_process_group(backend="ahbm"), AhbmCCLBackend.__init__ runs self._algo_module = importlib.import_module(self._merged["module"]). At module level, the following occur first:

    • Topology-kind integer constants like SIP_TOPO_RING/TORUS/MESH are bound in the module namespace.
    • The TOPO_NAME_TO_KIND dict is bound; the backend reads it via getattr(self._algo_module, "TOPO_NAME_TO_KIND", None).
    • kernel_args function is defined for the caller.
    • The actual algorithm function (e.g., allreduce_intercube_multidevice) is defined.
    • At the bottom of the module, kernel = allreduce_intercube_multidevice publishes the alias.
  2. ccl.yaml install stage: kernbench.ccl.install.install_ipcq imports the same algorithm module while pushing the IPCQ neighbor table.

In short, the algorithm module's first act is "publish topology-kind constants, the TOPO_NAME_TO_KIND dict, the kernel_args function, and the kernel alias into the module namespace" — all as import-time side effects, no separate initialization call.

Context

AhbmCCLBackend (ADR-0047), at process-group creation, dynamically imports a module path obtained from ccl.yaml's defaults.algorithm (or a user-specified algorithm). The backend expects four things from the module:

  • kernel: the collective's entry function.
  • kernel_args(world_size, n_elem, cube_w=, cube_h=) -> tuple: a tuple packing the kernel's positional arguments.
  • TOPO_NAME_TO_KIND (optional): a dict mapping topology.yaml's sips.topology string (e.g., "ring_1d", "torus_2d", "mesh_2d_no_wrap") to the integer kind constants.
  • (Indirectly) IPCQ neighbor-table install: configure_sfr_intercube_multisip reads the module's TOPO_NAME_TO_KIND plus cube dimensions to decide the SFR.

The current corpus has one algorithm module: lrab_hierarchical_allreduce.py (248 lines). The name expands to "left-right alternating broadcast hierarchical allreduce". When future modules like ring_allreduce, tree_allreduce, or broadcast are added, they must follow this contract for the backend's dispatch path to keep working.

Without an ADR-level contract:

  • A new algorithm author has to infer the signature from ADR-0047 D3's one-liner.
  • The kernel-function argument order (especially t_ptr, n_elem, cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl) is unclear without grep.
  • It is conventional, but not documented, what kernel_args takes as inputs and what tuple it must return.

Decision

D1. The algorithm module exposes four public symbols

# src/kernbench/ccl/algorithms/<name>.py

from __future__ import annotations

# (required) topology-kind constants — referenced internally
SIP_TOPO_RING = 0
SIP_TOPO_TORUS = 1
SIP_TOPO_MESH = 2

# (optional) topology name → kind mapping. Used by the backend to
# translate ccl.yaml/topology's string SIP topology into an integer.
TOPO_NAME_TO_KIND = {
    "ring_1d": SIP_TOPO_RING,
    "torus_2d": SIP_TOPO_TORUS,
    "mesh_2d_no_wrap": SIP_TOPO_MESH,
}

# (required) kernel argument builder
def kernel_args(world_size: int, n_elem: int, *, cube_w: int = 4, cube_h: int = 4) -> tuple:
    return (n_elem, cube_w, cube_h, world_size)

# (required) kernel function (TLContext is injected via the `tl=...`
# keyword argument).
def my_allreduce_kernel(t_ptr, n_elem, cube_w, cube_h, n_sips,
                         sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, *, tl):
    ...

# (required) kernel alias — the backend accesses `module.kernel`
kernel = my_allreduce_kernel
  • The kernel alias is the entry point the backend invokes. Whatever the function name is (e.g., allreduce_intercube_multidevice), it must be exposed via module.kernel = fn.
  • Without kernel_args, the backend has no way to build the algorithm's argument list. See D2 for the signature.
  • If TOPO_NAME_TO_KIND is absent, the backend falls back to sip_topo_kind = 0. An algorithm supporting only a single topology may omit it.

D2. kernel_args signature — (world_size, n_elem, *, cube_w, cube_h)

def kernel_args(world_size: int, n_elem: int, *,
                cube_w: int = 4, cube_h: int = 4) -> tuple:
    return (n_elem, cube_w, cube_h, world_size)
  • Positional arguments: world_size (= number of ranks), n_elem (= element count of a single shard, f16-based).
  • Keyword arguments: cube_w, cube_h (= cube-mesh dimensions). Default 4×4 — aligned with topology.yaml's sip.cube_mesh default.
  • Return: a tuple in the order the kernel's positional arguments expect.

When the backend calls all_reduce:

kernel_args_tuple = self._algo_module.kernel_args(
    self._world_size, n_elem, cube_w=eff_cube_w, cube_h=eff_cube_h,
)
extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)
pending = self.ctx.launch(
    self._merged["algorithm"], kernel_fn, tensor,
    *kernel_args_tuple, *extra_args, _defer_wait=True,
)

So the kernel's full positional argument list becomes: (tensor_ptr, *kernel_args_tuple, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h), with tl=... injected as a keyword. The tuple length and order returned by kernel_args must match the kernel signature 1:1.

D3. Kernel signature — standardized 9 + tl arguments

Recommended signature:

def my_kernel(
    t_ptr: int,             # VA base of the row-wise-sharded tensor on this SIP
    n_elem: int,            # element count per cube tile (or per shard)
    cube_w: int,            # cube mesh width  (from kernel_args)
    cube_h: int,            # cube mesh height (from kernel_args)
    n_sips: int,            # equal to world_size (rank = SIP, ADR-0024)
    sip_rank: int,          # this SIP's rank
    sip_topo_kind: int,     # result of TOPO_NAME_TO_KIND lookup
    sip_topo_w: int,        # SIP mesh width (0 for ring_1d)
    sip_topo_h: int,        # SIP mesh height (0 for ring_1d)
    *, tl,                  # TLContext (auto-injected)
) -> None:

Even if kernel_args chose a different positional argument order, the kernel's last four positional arguments are always (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h) — the backend appends them as extra_args (ADR-0047 D5). A custom algorithm must accept these four, but a single-SIP algorithm may simply ignore them.

tl is injected via keyword — RuntimeContext.launch adds tl=tl_ctx just before invoking the kernel. The signature therefore exposes tl as keyword-only (*, tl) or as the trailing keyword parameter.

D4. Kernel body — freedom and constraints

Available inside the kernel: every tl.* primitive from ADR-0046 D3.

Common patterns:

  • cube_id = tl.program_id(axis=1) — this PE's cube index.
  • pe_addr = t_ptr + cube_id * nbytes — per-cube VA of the tile.
  • acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16") — load local data.
  • tl.send(dir=...) / tl.recv(dir=..., shape=, dtype=) — IPCQ collective.
  • acc = acc + recv — TensorHandle arithmetic operators (ADR-0046 D4).
  • tl.store(pe_addr, acc) — store the result.

The kernel body is plain Python — branching and loops are fine. But:

  • No SimPy yield or async (ADR-0046 D1).
  • No direct access to TensorHandle .data — the Phase 1 timing model doesn't see data dependencies (ADR-0020's 2-pass separation).
  • Kernel execution must be deterministic — the same input must produce the same op sequence. No random or external IO.

D5. SIP topology semantics — meaning of sip_topo_kind

The backend looks up topology.yaml's system.sips.topology string in the algorithm module's TOPO_NAME_TO_KIND and passes the integer as sip_topo_kind. The algorithm then branches:

if sip_topo_kind == SIP_TOPO_RING:
    acc = _inter_sip_ring(...)
elif sip_topo_kind == SIP_TOPO_TORUS:
    acc = _inter_sip_torus_2d(...)
elif sip_topo_kind == SIP_TOPO_MESH:
    acc = _inter_sip_mesh_2d(...)

Each topology branch communicates with peers via IPCQ direction names ("global_E", "W", "S", "N" …). Direction semantics are defined in ADR-0023/0025; configure_sfr_intercube_multisip installs the IPCQ neighbor table accordingly.

If a topology kind not supported by the algorithm appears, prefer an explicit raise ValueError(f"unsupported topology kind {sip_topo_kind}") over a silent no-op — fail fast on misconfiguration.

D6. The ccl.yaml algorithm entry

The algorithm module is paired with a ccl.yaml entry (ADR-0023 D10 + ADR-0047 D3):

defaults:
  algorithm: lrab_hierarchical_allreduce
  n_elem: 8

algorithms:
  lrab_hierarchical_allreduce:
    module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
    # optional: world_size override
    # optional: per-algorithm parameters consumed by configure_sfr_intercube_multisip
  • module: the full Python module path; importlib.import_module consumes this string as-is.
  • world_size (optional): when set, overrides the topology fallback (ADR-0047 D2).
  • Algorithm-specific parameters are consumed by configure_sfr_intercube_multisip.

Workflow to add a new algorithm:

  1. Write src/kernbench/ccl/algorithms/<name>.py following D1.
  2. Add the entry under algorithms in ccl.yaml.
  3. (If needed) extend kernbench.ccl.sfr_config with the SFR-install branch.
  4. Add tests (e.g., tests/sccl/test_<name>.py, extending the ADR-0043 eval harness).

D7. Legacy "rank = flat PE index" mode

The world_size override in ccl.yaml, surfaced by ADR-0047 D2, is used by legacy "rank = flat PE index" tests. The algorithm module can assume n_sips=world_size ranks even in this mode — the backend maintains the rank↔(SIP, cube, PE) mapping, so no modal branching is needed inside the algorithm body.

In single-cube workloads (where cube_w=cube_h=1), the algorithm must skip mesh-based phases — see the single_cube = (cube_w == 1 and cube_h == 1) pattern in lrab_hierarchical_allreduce.py.

Alternatives Considered

A1. Organize the algorithm module as a class (class Allreduce: kernel(...))

Rejected. The Python module namespace already identifies an algorithm (see ADR-0047 D3's importlib.import_module). A class wrapper adds indirection without simplifying dispatch. Module-level free functions plus a kernel alias are clean and obvious.

A2. Type kernel_args with an explicit dataclass

Rejected (currently). Each algorithm normally has a different argument count; forcing one dataclass would hurt cross-algorithm interchange. The tuple return is simple and unpacks cleanly with the backend's *kernel_args_tuple. If an algorithm wants stronger internal typing, it may define its own NamedTuple.

A3. Move SFR installation inside the algorithm module

Rejected. SFR installation (configure_sfr_intercube_multisip) is a cross-module decision combining topology + algorithm; kernbench.ccl.sfr_config is a more natural home than the algorithm module itself. D6's "extend sfr_config if needed" workflow keeps responsibility boundaries clear.

A4. Auto-register algorithm names via a decorator (analogous to ADR-0045's @bench)

Rejected. Unlike benches, algorithms are already tied to ccl.yaml entries; an additional registry would be redundant. The string mapping in module is sufficient.

Consequences

  • ADR-0047 D3's one-line contract expands to a D1D7 author-facing guide; new algorithm signatures no longer need to be grep-derived.
  • D3's standardized 9 + tl signature couples naturally with the backend's extra_args append (ADR-0047 D5). It is explicit that even single-SIP-only algorithms must accept the four sip_* trailing arguments.
  • D5's fail-loud recommendation means a ccl.yaml topology that the algorithm doesn't support will surface as an explicit ValueError rather than a silent wrong result.
  • D6's step-by-step addition workflow makes clear how far a new algorithm has to reach into sfr_config / tests / ccl.yaml.