Files
kernbench2/docs/adr/ADR-0032-algo-intercube-allreduce.md
mukesh fd56b6cacd adr: add ADR-0043/0044 (eval harnesses); reconcile ADR-0024/0032 for SIP w/h
Document the allreduce + GEMM evaluation harnesses and bring the affected
allreduce ADRs in line with the refactored code.

New (Accepted, EN + KO):
- ADR-0043 — allreduce evaluation harness (tests/sccl/): distributed-driven
  correctness, latency/buffer-kind sweeps, sessionfinish plot aggregators,
  topology + FSIM-comparison figures. Verified against the implementation.
- ADR-0044 — GEMM evaluation harness (scripts/gemm_sweep.py + tests/gemm/):
  heavy-script data gen vs. fast test-rendered figures, slow regenerator,
  the 3-figure set. Records two limitations as open questions: the
  theoretical-model constants are inherited (not yet traced to ADR-0033/
  0014), and the *_measured figure is a naming misnomer.

Updated (EN + KO):
- ADR-0024 — add D5: SIP grid w/h resolution (explicit sips.w/h, square
  fallback, fail-loud), documenting the AhbmCCLBackend fix.
- ADR-0032 — D4/D5/Non-goals reconciled: rectangular SIP grids (e.g. 6 SIPs
  as 3x2) are supported via explicit w/h; the square requirement now
  applies only to the fallback. Affected-files repointed to tests/sccl/.

Verification: ADR-0023 and ADR-0042 confirmed still matching the code (no
change). verify_adr_lang_pairs.py passes (EN/KO Status blocks byte-equal).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 10:26:25 -07:00

11 KiB
Raw Permalink Blame History

ADR-0032: Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange

Status

Accepted (supersedes ADR-0029).

Context

Goal

Define a single all-reduce algorithm that exploits the topology hierarchy: cube mesh within each SIP (intercube) + inter-SIP exchange. One kernel, one SFR configuration path, driven by topology.yaml and ccl.yaml.

Why replace ADR-0029 (hierarchical 3-level)

ADR-0029 proposed a 3-level (intra-cube → inter-cube → inter-SIP) algorithm where every PE in the system participates. In practice this adds the intra-cube PE-to-PE stage complexity (bidirectional reduce + chain broadcast) without matching the common workload pattern where the tensor is sharded per cube (not per PE within a cube).

Moreover, the hierarchical design required:

  • per-PE neighbor graph installation (_build_pe_installs multi-level)
  • multi-level topology schema (hierarchical_3level)
  • all_pes mapper + multi_pe_sip_local validator infrastructure

The intercube algorithm below removes all of that: pe0-only same-lane intercube reduce on the 4×4 cube mesh, then inter-SIP exchange on the root cube, then broadcast back. Simpler kernel, simpler wiring, same bandwidth characteristics for the common per-cube DP workload.

Current state

  • src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py — kernel
  • src/kernbench/ccl/sfr_config.pyconfigure_sfr_intercube_multisip
  • src/kernbench/runtime_api/distributed.pyAhbmCCLBackend wires this automatically at init_process_group time.
  • Old ring_allreduce, mesh_allreduce, tree_allreduce, hierarchical_allreduce modules and their tests are removed.

Decision

D1. Algorithm structure — 5 phases (center-root, bidirectional)

The root cube sits at the geometric center of the cube mesh:

root_col  = cube_w // 2
root_row  = cube_h // 2
root_cube = root_row * cube_w + root_col   # center; 10 on a 4×4 mesh

Each reduce/broadcast phase converges/diverges bidirectionally toward this center, halving the intra-SIP critical path versus a corner-root walk (4×4 mesh: 4 hops reduce + 4 hops broadcast vs 6+6 with an SE-corner root).

For each SIP (launched concurrently by mp.spawn):

Phase 1 — Row reduce converging at col == root_col (cube mesh, pe0 only):
    left half (col < root_col) walks W→E; right half (col > root_col)
    walks E→W; the root_col cube merges both sides → holds row sum.

Phase 2 — Col reduce on col == root_col converging at row == root_row:
    above (row < root_row) walks N→S; below (row > root_row) walks S→N;
    the root cube merges both → holds the full SIP sum.

Phase 3 — Inter-SIP exchange on cube_id == root_cube (pe0 only):
    Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
    selected by sip_topo_kind (from topology.yaml sips.topology).

Phase 4 — Col broadcast on col == root_col, outward from root_row.

Phase 5 — Row broadcast outward from root_col across the cube mesh.

After all phases every cube's pe0 holds the global sum.

Single-cube fast-path: when cube_w == cube_h == 1 (one cube per rank, the common TP case), the intra-SIP reduce/broadcast phases are skipped and the kernel goes straight to the Phase 3 inter-SIP exchange.

The kernel is a single function parameterised by sip_topo_kind ∈ {0, 1, 2} (ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical across topologies; only phase 3 branches. Helper functions _inter_sip_ring, _inter_sip_torus_2d, _inter_sip_mesh_2d encode the three exchange patterns.

D2. Tensor layout (rank = SIP, per-worker)

Per ADR-0024 rank = SIP at the process-group level. Each worker allocates its own cube-mesh-spanning tensor:

dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=16, num_pes=1)
tensor = torch.zeros((n_cubes, n_elem), dtype="f16", dp=dp)

Shard layout: 16 shards per SIP, one per cube on pe0. The kernel addresses each cube's shard as pe_addr = t_ptr + cube_id * n_elem * 2.

D3. SFR / IPCQ wiring — configure_sfr_intercube_multisip

Replaces the rank-to-2-PE install from ADR-0024. Wires PE_IPCQ neighbor tables for every cube's pe0 across every SIP — regardless of which cube is the root or which SIP topology is selected. This lets the kernel elect the root cube at runtime and supports topology switches without re-wiring.

Level Direction labels Scope
Intercube within SIP N / S / E / W pe0 of every cube → pe0 of mesh neighbors (no wrap)
Inter-SIP (all cubes) global_E / global_W / global_N / global_S pe0 of cube c on sip A → pe0 of cube c on peer SIP per sips.topology

Inter-SIP directions use the global_* prefix to keep the namespace disjoint from intercube directions. ADR-0025's _OPPOSITE_DIR is extended with global_E ↔ global_W and global_N ↔ global_S so the reverse- direction resolver handles 2-SIP bidirectional rings correctly.

Internally the function calls install_ipcq with:

  • world_size = n_sips × n_cubes
  • rank_to_pe = [(sip, cube, 0) for sip in range(n_sips) for cube in range(n_cubes)]
  • A closure-captured neighbors() function that builds the map above.

This world_size is internal to IPCQ wiring and does not leak to the process-group rank.

D4. SIP topology — from topology.yaml

system:
  sips:
    count: 2
    topology: ring_1d       # or torus_2d, mesh_2d_no_wrap
  • ring_1d: n_sips-1 rounds of send global_E / recv global_W.
  • torus_2d: w × h wrapping mesh. Row ring on global_E/W then col ring on global_S/N.
  • mesh_2d_no_wrap: w × h mesh without wrap-around. Chain reduce + broadcast per dimension.

2D grid dims (w, h) come from system.sips.w/h (ADR-0024 D5). A square fallback (round(sqrt(n_sips))²) applies only when w/h are omitted, so rectangular grids (e.g. 6 SIPs as 3×2) are supported by giving explicit w/h.

D5. Process-group integration — AhbmCCLBackend

At init_process_group time the backend:

  1. Loads ccl.yaml + topology.yaml.
  2. Derives sip_topo_kind from system.sips.topology via the algorithm module's TOPO_NAME_TO_KIND, and sip_topo_w, sip_topo_h from system.sips.w/h with a square-only fallback (ADR-0024 D5).
  3. Calls configure_sfr_intercube_multisip(engine, spec, cfg) — one-time SFR wiring, mirrors NCCL communicator creation.

At each dist.all_reduce(tensor) call:

  1. Resolves kernel_fn from cfg["module"].
  2. Builds args: (n_elem, cube_w, cube_h, n_sips) from kernel_args(world_size, n_elem).
  3. Appends (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h) where sip_rank is the current greenlet's bound rank.
  4. Launches with _defer_wait=True; the main scheduler drains pending handles after all workers submit (per ADR-0027 D0.4).

D6. Config schema

ccl.yaml:

defaults:
  algorithm: lrab_hierarchical_allreduce
  buffer_kind: tcm
  ...

algorithms:
  lrab_hierarchical_allreduce:
    module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
    topology: none
    buffer_kind: tcm
    n_elem: 8
    root_cube: 15   # NOT read today — the kernel elects the root dynamically
                    # as the geometric center (see D1). Kept as a placeholder
                    # for a future explicit-root override / runtime election.

topology.yaml:

system:
  sips:
    count: 2
    topology: ring_1d
sip:
  cube_mesh: { w: 4, h: 4 }

D7. Algorithm module contract

Modules loaded via cfg["module"] must export:

Name Purpose
kernel callable, signature (t_ptr, n_elem, cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl)
kernel_args(world_size, n_elem) -> tuple returns the first 4 scalar args (per-tensor)
TOPO_NAME_TO_KIND: dict[str, int] maps system.sips.topology name to kernel branch code
SIP_TOPO_RING, SIP_TOPO_TORUS, SIP_TOPO_MESH integer constants (0, 1, 2)

Dependencies

  • ADR-0023: IPCQ protocol (neighbor table, send/recv, credit return).
  • ADR-0024: rank = SIP launcher, mp.spawn, greenlet-local rank.
  • ADR-0025: Address-based IPCQ direction matching; extended _OPPOSITE_DIR with global_* pairs.
  • ADR-0027: Worker-wait / collective-pending drain in main scheduler.

Non-goals

  • Per-PE allreduce (intra-cube PE-to-PE reduce). Out of scope — the workload for this algorithm is per-cube DP.
  • Square-grid fallback requires n_sips = k²: rectangular SIP grids (non-square mesh/torus) are supported, but only when system.sips.w/h are given explicitly (ADR-0024 D5). With w/h omitted, 2D topologies fall back to a square grid and still require n_sips = k².
  • Pipelined chunks: single-tile per cube, no pipelining yet.
  • Root cube runtime election: the kernel currently uses root_cube = (mesh_h // 2) * mesh_w + (mesh_w // 2) — the geometric center, chosen to minimize the intra-SIP critical path. SFR wiring covers all cubes, so electing a different root at runtime is a pure kernel change when needed.

Consequences

Positive

  • Single kernel, single install path for all-reduce — replaces four removed modules (ring, mesh, tree, hierarchical).
  • Topology-agnostic kernel: ring / torus / mesh selected via one integer param, no kernel duplication.
  • Automatic via dist.all_reduce: no bench-level or user-level algorithm selection needed; config-driven end-to-end.
  • Full SFR wiring: every cube on every SIP has inter-SIP links available — supports future dynamic root-cube election.

Negative

  • Not suitable for per-PE sharded tensors: TP-layer-style tensors that shard within one cube across 8 PEs are not addressable by this kernel. Such workloads would need a separate intra-cube all-reduce path (not yet implemented).
  • configure_sfr_intercube_multisip always wires all pe0s: even if a given run only needs a subset (e.g. 1 SIP, ring only). Install cost is small but not zero.

Affected files

File Change
src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py (new) Kernel + _inter_sip_* helpers + TOPO_NAME_TO_KIND
src/kernbench/ccl/sfr_config.py (new) configure_sfr_intercube_multisip
src/kernbench/ccl/topologies.py Added torus_2d, mesh_2d_no_wrap
src/kernbench/ccl/install.py Extended _OPPOSITE_DIR with global_* pairs
src/kernbench/runtime_api/distributed.py AhbmCCLBackend uses configure_sfr_intercube_multisip + appends sip_rank/topo args
ccl.yaml Single lrab_hierarchical_allreduce entry
topology.yaml Added system.sips.topology
benches/ccl_allreduce.py Row-wise cube-mesh tensor layout
tests/sccl/ (test package) Config-driven ring/torus/mesh correctness + full dist.all_reduce path + latency/buffer-kind sweeps (evaluation harness — ADR-0043)
tests/test_intercube_sfr_config.py SFR wiring verification
Removed ring_allreduce.py, mesh_allreduce.py, tree_allreduce.py, hierarchical_allreduce.py, hello_send.py, testing.py and their tests