Files
kernbench2/docs/adr/ADR-0050-par-ccl-algorithm-module-contract.md
ywkang bd49c93703 adr: add ADR-0050-0053 — close /report's second-pass G4 candidates
Documents four cross-cutting surfaces one layer deeper than the prior
G4 batch:

- 0050 par-ccl-algorithm-module-contract: how to author a new CCL
  algorithm in src/kernbench/ccl/algorithms/. Pairs with ADR-0045's
  bench-module contract. Pins the four required public symbols
  (kernel, kernel_args, TOPO_NAME_TO_KIND constants, kernel alias),
  the 9 + tl standardized kernel signature, the kernel_args tuple
  format, sip_topo_kind dispatch, and the ccl.yaml entry workflow.

- 0051 lat-routing-helper-api: every public method of AddressResolver
  (resolve, find_m_cpu, find_pcie_ep, find_io_cpu, find_all_pcie_eps)
  and PathRouter (find_path, find_path_with_distance,
  find_mcpu_dma_path, find_memory_path, find_node_path + 2 shims).
  Pins the four adjacency graphs (_adj_all / _adj / _adj_mcpu_dma /
  _adj_local) and the edge-kind exclusion sets they use, plus the
  single-owner naming convention.

- 0052 dev-oplog-memory-store-schemas: OpRecord's 7 fields, the
  per-op_name params matrix (dma_read, dma_write, gemm_*, math, math
  reduction, composite_gemm, ipcq_copy, unknown), snapshot timing
  rules (math = all inputs, dma_write = HBM-only — ADR-0027 race
  avoidance), TileToken stage_type capture, and MemoryStore's
  (space, addr) two-level dict with reference-store semantics.

- 0053 dev-topology-builder-algorithms: the 6-stage compile pipeline,
  cube_mesh.yaml's source_hash cache and its 5 input fields, the
  cube NoC auto-layout algorithm (row/col placement, HBM exclusion
  zone, PE/M_CPU/SRAM attachment via nearest-router, UCIe N/S/E/W
  distribution), the node naming convention (single-owner with
  router.py), the edge-kind catalog, the 4 view projections, and a
  table of spec-field changes vs mesh regeneration.

Bilingual pair verifier passes for all four EN/KO pairs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:52:42 -07:00

323 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0050: CCL Algorithm Module Contract — `ccl/algorithms/*.py`
## Status
Accepted (2026-05-22).
Pins down the interface, kernel signature, and addition workflow that a
module under `src/kernbench/ccl/algorithms/` must satisfy in order to be
used as a collective algorithm by the AHBM CCL backend (ADR-0047).
ADR-0047 D3 states only that "the algorithm module must expose `kernel`,
`kernel_args`, optionally `TOPO_NAME_TO_KIND`"; **the contract an
algorithm-module author needs to follow** has had no ADR-level coverage.
This ADR pairs with ADR-0045's bench-module contract.
## First action
An algorithm module is imported at two moments:
1. **AHBM backend entry**: when user code calls
`dist.init_process_group(backend="ahbm")`,
`AhbmCCLBackend.__init__` runs
`self._algo_module = importlib.import_module(self._merged["module"])`.
At module level, the following occur first:
- Topology-kind integer constants like `SIP_TOPO_RING/TORUS/MESH`
are bound in the module namespace.
- The `TOPO_NAME_TO_KIND` dict is bound; the backend reads it via
`getattr(self._algo_module, "TOPO_NAME_TO_KIND", None)`.
- `kernel_args` function is defined for the caller.
- The actual algorithm function (e.g.,
`allreduce_intercube_multidevice`) is defined.
- At the bottom of the module, `kernel = allreduce_intercube_multidevice`
publishes the alias.
2. **ccl.yaml install stage**:
`kernbench.ccl.install.install_ipcq` imports the same algorithm
module while pushing the IPCQ neighbor table.
In short, **the algorithm module's first act is "publish topology-kind
constants, the `TOPO_NAME_TO_KIND` dict, the `kernel_args` function, and
the `kernel` alias into the module namespace"** — all as import-time
side effects, no separate initialization call.
## Context
`AhbmCCLBackend` (ADR-0047), at process-group creation, dynamically
imports a module path obtained from `ccl.yaml`'s `defaults.algorithm` (or
a user-specified algorithm). The backend expects four things from the
module:
- `kernel`: the collective's entry function.
- `kernel_args(world_size, n_elem, cube_w=, cube_h=) -> tuple`: a tuple
packing the kernel's positional arguments.
- `TOPO_NAME_TO_KIND` (optional): a dict mapping `topology.yaml`'s
`sips.topology` string (e.g., `"ring_1d"`, `"torus_2d"`,
`"mesh_2d_no_wrap"`) to the integer kind constants.
- (Indirectly) IPCQ neighbor-table install:
`configure_sfr_intercube_multisip` reads
the module's `TOPO_NAME_TO_KIND` plus cube dimensions to decide the
SFR.
The current corpus has one algorithm module:
`lrab_hierarchical_allreduce.py` (248 lines). The name expands to
"**l**eft-**r**ight **a**lternating **b**roadcast hierarchical allreduce".
When future modules like `ring_allreduce`, `tree_allreduce`, or
`broadcast` are added, they must follow this contract for the backend's
dispatch path to keep working.
Without an ADR-level contract:
- A new algorithm author has to infer the signature from ADR-0047 D3's
one-liner.
- The kernel-function argument order (especially `t_ptr, n_elem,
cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w,
sip_topo_h, tl`) is unclear without grep.
- It is conventional, but not documented, what `kernel_args` takes as
inputs and what tuple it must return.
## Decision
### D1. The algorithm module exposes four public symbols
```python
# src/kernbench/ccl/algorithms/<name>.py
from __future__ import annotations
# (required) topology-kind constants — referenced internally
SIP_TOPO_RING = 0
SIP_TOPO_TORUS = 1
SIP_TOPO_MESH = 2
# (optional) topology name → kind mapping. Used by the backend to
# translate ccl.yaml/topology's string SIP topology into an integer.
TOPO_NAME_TO_KIND = {
"ring_1d": SIP_TOPO_RING,
"torus_2d": SIP_TOPO_TORUS,
"mesh_2d_no_wrap": SIP_TOPO_MESH,
}
# (required) kernel argument builder
def kernel_args(world_size: int, n_elem: int, *, cube_w: int = 4, cube_h: int = 4) -> tuple:
return (n_elem, cube_w, cube_h, world_size)
# (required) kernel function (TLContext is injected via the `tl=...`
# keyword argument).
def my_allreduce_kernel(t_ptr, n_elem, cube_w, cube_h, n_sips,
sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, *, tl):
...
# (required) kernel alias — the backend accesses `module.kernel`
kernel = my_allreduce_kernel
```
- The `kernel` alias is the entry point the backend invokes. Whatever
the function name is (e.g., `allreduce_intercube_multidevice`), it
must be exposed via `module.kernel = fn`.
- Without `kernel_args`, the backend has no way to build the
algorithm's argument list. See D2 for the signature.
- If `TOPO_NAME_TO_KIND` is absent, the backend falls back to
`sip_topo_kind = 0`. An algorithm supporting only a single topology
may omit it.
### D2. `kernel_args` signature — `(world_size, n_elem, *, cube_w, cube_h)`
```python
def kernel_args(world_size: int, n_elem: int, *,
cube_w: int = 4, cube_h: int = 4) -> tuple:
return (n_elem, cube_w, cube_h, world_size)
```
- **Positional arguments**: `world_size` (= number of ranks), `n_elem`
(= element count of a single shard, f16-based).
- **Keyword arguments**: `cube_w`, `cube_h` (= cube-mesh dimensions).
Default 4×4 — aligned with `topology.yaml`'s `sip.cube_mesh` default.
- **Return**: a tuple in the order the kernel's positional arguments
expect.
When the backend calls `all_reduce`:
```python
kernel_args_tuple = self._algo_module.kernel_args(
self._world_size, n_elem, cube_w=eff_cube_w, cube_h=eff_cube_h,
)
extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)
pending = self.ctx.launch(
self._merged["algorithm"], kernel_fn, tensor,
*kernel_args_tuple, *extra_args, _defer_wait=True,
)
```
So the kernel's full positional argument list becomes: `(tensor_ptr,
*kernel_args_tuple, sip_rank, sip_topo_kind, sip_topo_w,
sip_topo_h)`, with `tl=...` injected as a keyword. The tuple length
and order returned by `kernel_args` must **match the kernel signature
1:1**.
### D3. Kernel signature — standardized 9 + tl arguments
Recommended signature:
```python
def my_kernel(
t_ptr: int, # VA base of the row-wise-sharded tensor on this SIP
n_elem: int, # element count per cube tile (or per shard)
cube_w: int, # cube mesh width (from kernel_args)
cube_h: int, # cube mesh height (from kernel_args)
n_sips: int, # equal to world_size (rank = SIP, ADR-0024)
sip_rank: int, # this SIP's rank
sip_topo_kind: int, # result of TOPO_NAME_TO_KIND lookup
sip_topo_w: int, # SIP mesh width (0 for ring_1d)
sip_topo_h: int, # SIP mesh height (0 for ring_1d)
*, tl, # TLContext (auto-injected)
) -> None:
```
Even if `kernel_args` chose a different positional argument order, the
kernel's **last four positional arguments are always
`(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)`** — the backend
appends them as `extra_args` (ADR-0047 D5). A custom algorithm must
accept these four, but a single-SIP algorithm may simply ignore them.
`tl` is injected via keyword — `RuntimeContext.launch` adds `tl=tl_ctx`
just before invoking the kernel. The signature therefore exposes `tl`
as keyword-only (`*, tl`) or as the trailing keyword parameter.
### D4. Kernel body — freedom and constraints
Available inside the kernel: every `tl.*` primitive from ADR-0046 D3.
Common patterns:
- `cube_id = tl.program_id(axis=1)` — this PE's cube index.
- `pe_addr = t_ptr + cube_id * nbytes` — per-cube VA of the tile.
- `acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")` — load local
data.
- `tl.send(dir=...)` / `tl.recv(dir=..., shape=, dtype=)` — IPCQ
collective.
- `acc = acc + recv` — TensorHandle arithmetic operators (ADR-0046 D4).
- `tl.store(pe_addr, acc)` — store the result.
The kernel body is plain Python — branching and loops are fine. But:
- No SimPy `yield` or `async` (ADR-0046 D1).
- No direct access to TensorHandle `.data` — the Phase 1 timing model
doesn't see data dependencies (ADR-0020's 2-pass separation).
- Kernel execution must be deterministic — the same input must produce
the same op sequence. No random or external IO.
### D5. SIP topology semantics — meaning of `sip_topo_kind`
The backend looks up `topology.yaml`'s `system.sips.topology` string
in the algorithm module's `TOPO_NAME_TO_KIND` and passes the integer
as `sip_topo_kind`. The algorithm then branches:
```python
if sip_topo_kind == SIP_TOPO_RING:
acc = _inter_sip_ring(...)
elif sip_topo_kind == SIP_TOPO_TORUS:
acc = _inter_sip_torus_2d(...)
elif sip_topo_kind == SIP_TOPO_MESH:
acc = _inter_sip_mesh_2d(...)
```
Each topology branch communicates with peers via IPCQ direction names
(`"global_E"`, `"W"`, `"S"`, `"N"` …). Direction semantics are defined
in ADR-0023/0025; `configure_sfr_intercube_multisip` installs the IPCQ
neighbor table accordingly.
If a topology kind not supported by the algorithm appears, prefer an
explicit `raise ValueError(f"unsupported topology kind
{sip_topo_kind}")` over a silent no-op — fail fast on misconfiguration.
### D6. The `ccl.yaml` algorithm entry
The algorithm module is paired with a `ccl.yaml` entry (ADR-0023 D10 +
ADR-0047 D3):
```yaml
defaults:
algorithm: lrab_hierarchical_allreduce
n_elem: 8
algorithms:
lrab_hierarchical_allreduce:
module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
# optional: world_size override
# optional: per-algorithm parameters consumed by configure_sfr_intercube_multisip
```
- `module`: the full Python module path; `importlib.import_module`
consumes this string as-is.
- `world_size` (optional): when set, overrides the topology fallback
(ADR-0047 D2).
- Algorithm-specific parameters are consumed by
`configure_sfr_intercube_multisip`.
Workflow to add a new algorithm:
1. Write `src/kernbench/ccl/algorithms/<name>.py` following D1.
2. Add the entry under `algorithms` in `ccl.yaml`.
3. (If needed) extend `kernbench.ccl.sfr_config` with the SFR-install
branch.
4. Add tests (e.g., `tests/sccl/test_<name>.py`, extending the
ADR-0043 eval harness).
### D7. Legacy "rank = flat PE index" mode
The `world_size` override in `ccl.yaml`, surfaced by ADR-0047 D2, is
used by legacy "rank = flat PE index" tests. The algorithm module can
assume `n_sips=world_size` ranks even in this mode — the backend
maintains the rank↔(SIP, cube, PE) mapping, so no modal branching is
needed inside the algorithm body.
In single-cube workloads (where `cube_w=cube_h=1`), the algorithm must
skip mesh-based phases — see the
`single_cube = (cube_w == 1 and cube_h == 1)` pattern in
`lrab_hierarchical_allreduce.py`.
## Alternatives Considered
### A1. Organize the algorithm module as a class (`class Allreduce: kernel(...)`)
Rejected. The Python module namespace already identifies an algorithm
(see ADR-0047 D3's `importlib.import_module`). A class wrapper adds
indirection without simplifying dispatch. Module-level free functions
plus a `kernel` alias are clean and obvious.
### A2. Type `kernel_args` with an explicit dataclass
Rejected (currently). Each algorithm normally has a different argument
count; forcing one dataclass would hurt cross-algorithm interchange.
The tuple return is simple and unpacks cleanly with the backend's
`*kernel_args_tuple`. If an algorithm wants stronger internal typing,
it may define its own NamedTuple.
### A3. Move SFR installation inside the algorithm module
Rejected. SFR installation
(`configure_sfr_intercube_multisip`) is a cross-module decision
combining topology + algorithm; `kernbench.ccl.sfr_config` is a more
natural home than the algorithm module itself. D6's "extend
sfr_config if needed" workflow keeps responsibility boundaries clear.
### A4. Auto-register algorithm names via a decorator (analogous to ADR-0045's `@bench`)
Rejected. Unlike benches, algorithms are already tied to `ccl.yaml`
entries; an additional registry would be redundant. The string mapping
in `module` is sufficient.
## Consequences
- ADR-0047 D3's one-line contract expands to a D1D7 author-facing
guide; new algorithm signatures no longer need to be grep-derived.
- D3's standardized 9 + tl signature couples naturally with the
backend's `extra_args` append (ADR-0047 D5). It is explicit that
even single-SIP-only algorithms must accept the four `sip_*` trailing
arguments.
- D5's fail-loud recommendation means a `ccl.yaml` topology that the
algorithm doesn't support will surface as an explicit `ValueError`
rather than a silent wrong result.
- D6's step-by-step addition workflow makes clear how far a new
algorithm has to reach into sfr_config / tests / ccl.yaml.