adr: add ADR-0050-0053 — close /report's second-pass G4 candidates
Documents four cross-cutting surfaces one layer deeper than the prior G4 batch: - 0050 par-ccl-algorithm-module-contract: how to author a new CCL algorithm in src/kernbench/ccl/algorithms/. Pairs with ADR-0045's bench-module contract. Pins the four required public symbols (kernel, kernel_args, TOPO_NAME_TO_KIND constants, kernel alias), the 9 + tl standardized kernel signature, the kernel_args tuple format, sip_topo_kind dispatch, and the ccl.yaml entry workflow. - 0051 lat-routing-helper-api: every public method of AddressResolver (resolve, find_m_cpu, find_pcie_ep, find_io_cpu, find_all_pcie_eps) and PathRouter (find_path, find_path_with_distance, find_mcpu_dma_path, find_memory_path, find_node_path + 2 shims). Pins the four adjacency graphs (_adj_all / _adj / _adj_mcpu_dma / _adj_local) and the edge-kind exclusion sets they use, plus the single-owner naming convention. - 0052 dev-oplog-memory-store-schemas: OpRecord's 7 fields, the per-op_name params matrix (dma_read, dma_write, gemm_*, math, math reduction, composite_gemm, ipcq_copy, unknown), snapshot timing rules (math = all inputs, dma_write = HBM-only — ADR-0027 race avoidance), TileToken stage_type capture, and MemoryStore's (space, addr) two-level dict with reference-store semantics. - 0053 dev-topology-builder-algorithms: the 6-stage compile pipeline, cube_mesh.yaml's source_hash cache and its 5 input fields, the cube NoC auto-layout algorithm (row/col placement, HBM exclusion zone, PE/M_CPU/SRAM attachment via nearest-router, UCIe N/S/E/W distribution), the node naming convention (single-owner with router.py), the edge-kind catalog, the 4 view projections, and a table of spec-field changes vs mesh regeneration. Bilingual pair verifier passes for all four EN/KO pairs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,322 @@
|
||||
# ADR-0050: CCL Algorithm Module Contract — `ccl/algorithms/*.py`
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
Pins down the interface, kernel signature, and addition workflow that a
|
||||
module under `src/kernbench/ccl/algorithms/` must satisfy in order to be
|
||||
used as a collective algorithm by the AHBM CCL backend (ADR-0047).
|
||||
ADR-0047 D3 states only that "the algorithm module must expose `kernel`,
|
||||
`kernel_args`, optionally `TOPO_NAME_TO_KIND`"; **the contract an
|
||||
algorithm-module author needs to follow** has had no ADR-level coverage.
|
||||
This ADR pairs with ADR-0045's bench-module contract.
|
||||
|
||||
## First action
|
||||
|
||||
An algorithm module is imported at two moments:
|
||||
|
||||
1. **AHBM backend entry**: when user code calls
|
||||
`dist.init_process_group(backend="ahbm")`,
|
||||
`AhbmCCLBackend.__init__` runs
|
||||
`self._algo_module = importlib.import_module(self._merged["module"])`.
|
||||
At module level, the following occur first:
|
||||
- Topology-kind integer constants like `SIP_TOPO_RING/TORUS/MESH`
|
||||
are bound in the module namespace.
|
||||
- The `TOPO_NAME_TO_KIND` dict is bound; the backend reads it via
|
||||
`getattr(self._algo_module, "TOPO_NAME_TO_KIND", None)`.
|
||||
- `kernel_args` function is defined for the caller.
|
||||
- The actual algorithm function (e.g.,
|
||||
`allreduce_intercube_multidevice`) is defined.
|
||||
- At the bottom of the module, `kernel = allreduce_intercube_multidevice`
|
||||
publishes the alias.
|
||||
|
||||
2. **ccl.yaml install stage**:
|
||||
`kernbench.ccl.install.install_ipcq` imports the same algorithm
|
||||
module while pushing the IPCQ neighbor table.
|
||||
|
||||
In short, **the algorithm module's first act is "publish topology-kind
|
||||
constants, the `TOPO_NAME_TO_KIND` dict, the `kernel_args` function, and
|
||||
the `kernel` alias into the module namespace"** — all as import-time
|
||||
side effects, no separate initialization call.
|
||||
|
||||
## Context
|
||||
|
||||
`AhbmCCLBackend` (ADR-0047), at process-group creation, dynamically
|
||||
imports a module path obtained from `ccl.yaml`'s `defaults.algorithm` (or
|
||||
a user-specified algorithm). The backend expects four things from the
|
||||
module:
|
||||
|
||||
- `kernel`: the collective's entry function.
|
||||
- `kernel_args(world_size, n_elem, cube_w=, cube_h=) -> tuple`: a tuple
|
||||
packing the kernel's positional arguments.
|
||||
- `TOPO_NAME_TO_KIND` (optional): a dict mapping `topology.yaml`'s
|
||||
`sips.topology` string (e.g., `"ring_1d"`, `"torus_2d"`,
|
||||
`"mesh_2d_no_wrap"`) to the integer kind constants.
|
||||
- (Indirectly) IPCQ neighbor-table install:
|
||||
`configure_sfr_intercube_multisip` reads
|
||||
the module's `TOPO_NAME_TO_KIND` plus cube dimensions to decide the
|
||||
SFR.
|
||||
|
||||
The current corpus has one algorithm module:
|
||||
`lrab_hierarchical_allreduce.py` (248 lines). The name expands to
|
||||
"**l**eft-**r**ight **a**lternating **b**roadcast hierarchical allreduce".
|
||||
When future modules like `ring_allreduce`, `tree_allreduce`, or
|
||||
`broadcast` are added, they must follow this contract for the backend's
|
||||
dispatch path to keep working.
|
||||
|
||||
Without an ADR-level contract:
|
||||
|
||||
- A new algorithm author has to infer the signature from ADR-0047 D3's
|
||||
one-liner.
|
||||
- The kernel-function argument order (especially `t_ptr, n_elem,
|
||||
cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w,
|
||||
sip_topo_h, tl`) is unclear without grep.
|
||||
- It is conventional, but not documented, what `kernel_args` takes as
|
||||
inputs and what tuple it must return.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. The algorithm module exposes four public symbols
|
||||
|
||||
```python
|
||||
# src/kernbench/ccl/algorithms/<name>.py
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
# (required) topology-kind constants — referenced internally
|
||||
SIP_TOPO_RING = 0
|
||||
SIP_TOPO_TORUS = 1
|
||||
SIP_TOPO_MESH = 2
|
||||
|
||||
# (optional) topology name → kind mapping. Used by the backend to
|
||||
# translate ccl.yaml/topology's string SIP topology into an integer.
|
||||
TOPO_NAME_TO_KIND = {
|
||||
"ring_1d": SIP_TOPO_RING,
|
||||
"torus_2d": SIP_TOPO_TORUS,
|
||||
"mesh_2d_no_wrap": SIP_TOPO_MESH,
|
||||
}
|
||||
|
||||
# (required) kernel argument builder
|
||||
def kernel_args(world_size: int, n_elem: int, *, cube_w: int = 4, cube_h: int = 4) -> tuple:
|
||||
return (n_elem, cube_w, cube_h, world_size)
|
||||
|
||||
# (required) kernel function (TLContext is injected via the `tl=...`
|
||||
# keyword argument).
|
||||
def my_allreduce_kernel(t_ptr, n_elem, cube_w, cube_h, n_sips,
|
||||
sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, *, tl):
|
||||
...
|
||||
|
||||
# (required) kernel alias — the backend accesses `module.kernel`
|
||||
kernel = my_allreduce_kernel
|
||||
```
|
||||
|
||||
- The `kernel` alias is the entry point the backend invokes. Whatever
|
||||
the function name is (e.g., `allreduce_intercube_multidevice`), it
|
||||
must be exposed via `module.kernel = fn`.
|
||||
- Without `kernel_args`, the backend has no way to build the
|
||||
algorithm's argument list. See D2 for the signature.
|
||||
- If `TOPO_NAME_TO_KIND` is absent, the backend falls back to
|
||||
`sip_topo_kind = 0`. An algorithm supporting only a single topology
|
||||
may omit it.
|
||||
|
||||
### D2. `kernel_args` signature — `(world_size, n_elem, *, cube_w, cube_h)`
|
||||
|
||||
```python
|
||||
def kernel_args(world_size: int, n_elem: int, *,
|
||||
cube_w: int = 4, cube_h: int = 4) -> tuple:
|
||||
return (n_elem, cube_w, cube_h, world_size)
|
||||
```
|
||||
|
||||
- **Positional arguments**: `world_size` (= number of ranks), `n_elem`
|
||||
(= element count of a single shard, f16-based).
|
||||
- **Keyword arguments**: `cube_w`, `cube_h` (= cube-mesh dimensions).
|
||||
Default 4×4 — aligned with `topology.yaml`'s `sip.cube_mesh` default.
|
||||
- **Return**: a tuple in the order the kernel's positional arguments
|
||||
expect.
|
||||
|
||||
When the backend calls `all_reduce`:
|
||||
|
||||
```python
|
||||
kernel_args_tuple = self._algo_module.kernel_args(
|
||||
self._world_size, n_elem, cube_w=eff_cube_w, cube_h=eff_cube_h,
|
||||
)
|
||||
extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)
|
||||
pending = self.ctx.launch(
|
||||
self._merged["algorithm"], kernel_fn, tensor,
|
||||
*kernel_args_tuple, *extra_args, _defer_wait=True,
|
||||
)
|
||||
```
|
||||
|
||||
So the kernel's full positional argument list becomes: `(tensor_ptr,
|
||||
*kernel_args_tuple, sip_rank, sip_topo_kind, sip_topo_w,
|
||||
sip_topo_h)`, with `tl=...` injected as a keyword. The tuple length
|
||||
and order returned by `kernel_args` must **match the kernel signature
|
||||
1:1**.
|
||||
|
||||
### D3. Kernel signature — standardized 9 + tl arguments
|
||||
|
||||
Recommended signature:
|
||||
|
||||
```python
|
||||
def my_kernel(
|
||||
t_ptr: int, # VA base of the row-wise-sharded tensor on this SIP
|
||||
n_elem: int, # element count per cube tile (or per shard)
|
||||
cube_w: int, # cube mesh width (from kernel_args)
|
||||
cube_h: int, # cube mesh height (from kernel_args)
|
||||
n_sips: int, # equal to world_size (rank = SIP, ADR-0024)
|
||||
sip_rank: int, # this SIP's rank
|
||||
sip_topo_kind: int, # result of TOPO_NAME_TO_KIND lookup
|
||||
sip_topo_w: int, # SIP mesh width (0 for ring_1d)
|
||||
sip_topo_h: int, # SIP mesh height (0 for ring_1d)
|
||||
*, tl, # TLContext (auto-injected)
|
||||
) -> None:
|
||||
```
|
||||
|
||||
Even if `kernel_args` chose a different positional argument order, the
|
||||
kernel's **last four positional arguments are always
|
||||
`(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)`** — the backend
|
||||
appends them as `extra_args` (ADR-0047 D5). A custom algorithm must
|
||||
accept these four, but a single-SIP algorithm may simply ignore them.
|
||||
|
||||
`tl` is injected via keyword — `RuntimeContext.launch` adds `tl=tl_ctx`
|
||||
just before invoking the kernel. The signature therefore exposes `tl`
|
||||
as keyword-only (`*, tl`) or as the trailing keyword parameter.
|
||||
|
||||
### D4. Kernel body — freedom and constraints
|
||||
|
||||
Available inside the kernel: every `tl.*` primitive from ADR-0046 D3.
|
||||
|
||||
Common patterns:
|
||||
|
||||
- `cube_id = tl.program_id(axis=1)` — this PE's cube index.
|
||||
- `pe_addr = t_ptr + cube_id * nbytes` — per-cube VA of the tile.
|
||||
- `acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")` — load local
|
||||
data.
|
||||
- `tl.send(dir=...)` / `tl.recv(dir=..., shape=, dtype=)` — IPCQ
|
||||
collective.
|
||||
- `acc = acc + recv` — TensorHandle arithmetic operators (ADR-0046 D4).
|
||||
- `tl.store(pe_addr, acc)` — store the result.
|
||||
|
||||
The kernel body is plain Python — branching and loops are fine. But:
|
||||
|
||||
- No SimPy `yield` or `async` (ADR-0046 D1).
|
||||
- No direct access to TensorHandle `.data` — the Phase 1 timing model
|
||||
doesn't see data dependencies (ADR-0020's 2-pass separation).
|
||||
- Kernel execution must be deterministic — the same input must produce
|
||||
the same op sequence. No random or external IO.
|
||||
|
||||
### D5. SIP topology semantics — meaning of `sip_topo_kind`
|
||||
|
||||
The backend looks up `topology.yaml`'s `system.sips.topology` string
|
||||
in the algorithm module's `TOPO_NAME_TO_KIND` and passes the integer
|
||||
as `sip_topo_kind`. The algorithm then branches:
|
||||
|
||||
```python
|
||||
if sip_topo_kind == SIP_TOPO_RING:
|
||||
acc = _inter_sip_ring(...)
|
||||
elif sip_topo_kind == SIP_TOPO_TORUS:
|
||||
acc = _inter_sip_torus_2d(...)
|
||||
elif sip_topo_kind == SIP_TOPO_MESH:
|
||||
acc = _inter_sip_mesh_2d(...)
|
||||
```
|
||||
|
||||
Each topology branch communicates with peers via IPCQ direction names
|
||||
(`"global_E"`, `"W"`, `"S"`, `"N"` …). Direction semantics are defined
|
||||
in ADR-0023/0025; `configure_sfr_intercube_multisip` installs the IPCQ
|
||||
neighbor table accordingly.
|
||||
|
||||
If a topology kind not supported by the algorithm appears, prefer an
|
||||
explicit `raise ValueError(f"unsupported topology kind
|
||||
{sip_topo_kind}")` over a silent no-op — fail fast on misconfiguration.
|
||||
|
||||
### D6. The `ccl.yaml` algorithm entry
|
||||
|
||||
The algorithm module is paired with a `ccl.yaml` entry (ADR-0023 D10 +
|
||||
ADR-0047 D3):
|
||||
|
||||
```yaml
|
||||
defaults:
|
||||
algorithm: lrab_hierarchical_allreduce
|
||||
n_elem: 8
|
||||
|
||||
algorithms:
|
||||
lrab_hierarchical_allreduce:
|
||||
module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
|
||||
# optional: world_size override
|
||||
# optional: per-algorithm parameters consumed by configure_sfr_intercube_multisip
|
||||
```
|
||||
|
||||
- `module`: the full Python module path; `importlib.import_module`
|
||||
consumes this string as-is.
|
||||
- `world_size` (optional): when set, overrides the topology fallback
|
||||
(ADR-0047 D2).
|
||||
- Algorithm-specific parameters are consumed by
|
||||
`configure_sfr_intercube_multisip`.
|
||||
|
||||
Workflow to add a new algorithm:
|
||||
|
||||
1. Write `src/kernbench/ccl/algorithms/<name>.py` following D1.
|
||||
2. Add the entry under `algorithms` in `ccl.yaml`.
|
||||
3. (If needed) extend `kernbench.ccl.sfr_config` with the SFR-install
|
||||
branch.
|
||||
4. Add tests (e.g., `tests/sccl/test_<name>.py`, extending the
|
||||
ADR-0043 eval harness).
|
||||
|
||||
### D7. Legacy "rank = flat PE index" mode
|
||||
|
||||
The `world_size` override in `ccl.yaml`, surfaced by ADR-0047 D2, is
|
||||
used by legacy "rank = flat PE index" tests. The algorithm module can
|
||||
assume `n_sips=world_size` ranks even in this mode — the backend
|
||||
maintains the rank↔(SIP, cube, PE) mapping, so no modal branching is
|
||||
needed inside the algorithm body.
|
||||
|
||||
In single-cube workloads (where `cube_w=cube_h=1`), the algorithm must
|
||||
skip mesh-based phases — see the
|
||||
`single_cube = (cube_w == 1 and cube_h == 1)` pattern in
|
||||
`lrab_hierarchical_allreduce.py`.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### A1. Organize the algorithm module as a class (`class Allreduce: kernel(...)`)
|
||||
|
||||
Rejected. The Python module namespace already identifies an algorithm
|
||||
(see ADR-0047 D3's `importlib.import_module`). A class wrapper adds
|
||||
indirection without simplifying dispatch. Module-level free functions
|
||||
plus a `kernel` alias are clean and obvious.
|
||||
|
||||
### A2. Type `kernel_args` with an explicit dataclass
|
||||
|
||||
Rejected (currently). Each algorithm normally has a different argument
|
||||
count; forcing one dataclass would hurt cross-algorithm interchange.
|
||||
The tuple return is simple and unpacks cleanly with the backend's
|
||||
`*kernel_args_tuple`. If an algorithm wants stronger internal typing,
|
||||
it may define its own NamedTuple.
|
||||
|
||||
### A3. Move SFR installation inside the algorithm module
|
||||
|
||||
Rejected. SFR installation
|
||||
(`configure_sfr_intercube_multisip`) is a cross-module decision
|
||||
combining topology + algorithm; `kernbench.ccl.sfr_config` is a more
|
||||
natural home than the algorithm module itself. D6's "extend
|
||||
sfr_config if needed" workflow keeps responsibility boundaries clear.
|
||||
|
||||
### A4. Auto-register algorithm names via a decorator (analogous to ADR-0045's `@bench`)
|
||||
|
||||
Rejected. Unlike benches, algorithms are already tied to `ccl.yaml`
|
||||
entries; an additional registry would be redundant. The string mapping
|
||||
in `module` is sufficient.
|
||||
|
||||
## Consequences
|
||||
|
||||
- ADR-0047 D3's one-line contract expands to a D1–D7 author-facing
|
||||
guide; new algorithm signatures no longer need to be grep-derived.
|
||||
- D3's standardized 9 + tl signature couples naturally with the
|
||||
backend's `extra_args` append (ADR-0047 D5). It is explicit that
|
||||
even single-SIP-only algorithms must accept the four `sip_*` trailing
|
||||
arguments.
|
||||
- D5's fail-loud recommendation means a `ccl.yaml` topology that the
|
||||
algorithm doesn't support will surface as an explicit `ValueError`
|
||||
rather than a silent wrong result.
|
||||
- D6's step-by-step addition workflow makes clear how far a new
|
||||
algorithm has to reach into sfr_config / tests / ccl.yaml.
|
||||
Reference in New Issue
Block a user