Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
(dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
retroactive docs pending verification.
Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
deleted; ADR-0019/0021 moved to adr-history with one-line stub status
Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
selection, flit-aware per-flit commit, async finalize, command-only
fallback path)
Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
"Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
(now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)
Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py
Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.
Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
(ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9.5 KiB
ADR-0032: Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange
Status
Accepted (supersedes ADR-0029).
Context
Goal
Define a single all-reduce algorithm that exploits the topology hierarchy:
cube mesh within each SIP (intercube) + inter-SIP exchange. One kernel,
one SFR configuration path, driven by topology.yaml and ccl.yaml.
Why replace ADR-0029 (hierarchical 3-level)
ADR-0029 proposed a 3-level (intra-cube → inter-cube → inter-SIP) algorithm where every PE in the system participates. In practice this adds the intra-cube PE-to-PE stage complexity (bidirectional reduce + chain broadcast) without matching the common workload pattern where the tensor is sharded per cube (not per PE within a cube).
Moreover, the hierarchical design required:
- per-PE neighbor graph installation (
_build_pe_installsmulti-level) - multi-level topology schema (
hierarchical_3level) all_pesmapper +multi_pe_sip_localvalidator infrastructure
The intercube algorithm below removes all of that: pe0-only same-lane intercube reduce on the 4×4 cube mesh, then inter-SIP exchange on the root cube, then broadcast back. Simpler kernel, simpler wiring, same bandwidth characteristics for the common per-cube DP workload.
Current state
src/kernbench/ccl/algorithms/intercube_allreduce.py— kernelsrc/kernbench/ccl/sfr_config.py—configure_sfr_intercube_multisipsrc/kernbench/runtime_api/distributed.py—AhbmCCLBackendwires this automatically atinit_process_grouptime.- Old
ring_allreduce,mesh_allreduce,tree_allreduce,hierarchical_allreducemodules and their tests are removed.
Decision
D1. Algorithm structure — 5 phases
For each SIP (launched concurrently by mp.spawn):
Phase 1 — Row reduce W → E (cube mesh, pe0 only):
col=0 sends E → col=1 accumulates, sends E → ... → col=3 holds row sum.
Phase 2 — Col reduce N → S on rightmost column (pe0, col = mesh_w-1):
row=0 sends S → row=1 accumulates, sends S → ... → root cube (15)
holds the full SIP sum.
Phase 3 — Inter-SIP exchange on root cube (pe0 of root cube only):
Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
selected by sip_topo_kind (from topology.yaml sips.topology).
Phase 4 — Col broadcast S → N on rightmost column.
Phase 5 — Row broadcast E → W across the cube mesh.
After all phases every cube's pe0 holds the global sum.
The kernel is a single function parameterised by sip_topo_kind ∈ {0, 1, 2}
(ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical
across topologies; only phase 3 branches. Helper functions
_inter_sip_ring, _inter_sip_torus_2d, _inter_sip_mesh_2d encode the
three exchange patterns.
D2. Tensor layout (rank = SIP, per-worker)
Per ADR-0024 rank = SIP at the process-group level. Each worker allocates its own cube-mesh-spanning tensor:
dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=16, num_pes=1)
tensor = torch.zeros((n_cubes, n_elem), dtype="f16", dp=dp)
Shard layout: 16 shards per SIP, one per cube on pe0. The kernel addresses
each cube's shard as pe_addr = t_ptr + cube_id * n_elem * 2.
D3. SFR / IPCQ wiring — configure_sfr_intercube_multisip
Replaces the rank-to-2-PE install from ADR-0024. Wires PE_IPCQ neighbor tables for every cube's pe0 across every SIP — regardless of which cube is the root or which SIP topology is selected. This lets the kernel elect the root cube at runtime and supports topology switches without re-wiring.
| Level | Direction labels | Scope |
|---|---|---|
| Intercube within SIP | N / S / E / W | pe0 of every cube → pe0 of mesh neighbors (no wrap) |
| Inter-SIP (all cubes) | global_E / global_W / global_N / global_S | pe0 of cube c on sip A → pe0 of cube c on peer SIP per sips.topology |
Inter-SIP directions use the global_* prefix to keep the namespace
disjoint from intercube directions. ADR-0025's _OPPOSITE_DIR is extended
with global_E ↔ global_W and global_N ↔ global_S so the reverse-
direction resolver handles 2-SIP bidirectional rings correctly.
Internally the function calls install_ipcq with:
world_size = n_sips × n_cubesrank_to_pe = [(sip, cube, 0) for sip in range(n_sips) for cube in range(n_cubes)]- A closure-captured
neighbors()function that builds the map above.
This world_size is internal to IPCQ wiring and does not leak to the
process-group rank.
D4. SIP topology — from topology.yaml
system:
sips:
count: 2
topology: ring_1d # or torus_2d, mesh_2d_no_wrap
ring_1d: n_sips-1 rounds ofsend global_E / recv global_W.torus_2d: sqrt(n_sips)×sqrt(n_sips) wrapping mesh. Row ring onglobal_E/Wthen col ring onglobal_S/N.mesh_2d_no_wrap: square mesh without wrap-around. Chain reduce + broadcast per dimension.
2D variants require n_sips to be a perfect square.
D5. Process-group integration — AhbmCCLBackend
At init_process_group time the backend:
- Loads
ccl.yaml+topology.yaml. - Derives
sip_topo_kind, sip_topo_w, sip_topo_hfromsystem.sips.topologyusing the algorithm module'sTOPO_NAME_TO_KIND. - Calls
configure_sfr_intercube_multisip(engine, spec, cfg)— one-time SFR wiring, mirrors NCCL communicator creation.
At each dist.all_reduce(tensor) call:
- Resolves
kernel_fnfromcfg["module"]. - Builds args:
(n_elem, cube_w, cube_h, n_sips)fromkernel_args(world_size, n_elem). - Appends
(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)wheresip_rankis the current greenlet's bound rank. - Launches with
_defer_wait=True; the main scheduler drains pending handles after all workers submit (per ADR-0027 D0.4).
D6. Config schema
ccl.yaml:
defaults:
algorithm: intercube_allreduce
buffer_kind: tcm
...
algorithms:
intercube_allreduce:
module: kernbench.ccl.algorithms.intercube_allreduce
topology: none
buffer_kind: tcm
n_elem: 8
root_cube: 15
topology.yaml:
system:
sips:
count: 2
topology: ring_1d
sip:
cube_mesh: { w: 4, h: 4 }
D7. Algorithm module contract
Modules loaded via cfg["module"] must export:
| Name | Purpose |
|---|---|
kernel |
callable, signature (t_ptr, n_elem, cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl) |
kernel_args(world_size, n_elem) -> tuple |
returns the first 4 scalar args (per-tensor) |
TOPO_NAME_TO_KIND: dict[str, int] |
maps system.sips.topology name to kernel branch code |
SIP_TOPO_RING, SIP_TOPO_TORUS, SIP_TOPO_MESH |
integer constants (0, 1, 2) |
Dependencies
- ADR-0023: IPCQ protocol (neighbor table, send/recv, credit return).
- ADR-0024: rank = SIP launcher,
mp.spawn, greenlet-local rank. - ADR-0025: Address-based IPCQ direction matching; extended
_OPPOSITE_DIRwithglobal_*pairs. - ADR-0027: Worker-wait / collective-pending drain in main scheduler.
Non-goals
- Per-PE allreduce (intra-cube PE-to-PE reduce). Out of scope — the workload for this algorithm is per-cube DP.
- Asymmetric SIP topologies (non-square mesh/torus).
torus_2dandmesh_2d_no_wraprequiren_sips = k². - Pipelined chunks: single-tile per cube, no pipelining yet.
- Root cube runtime election: the kernel currently uses
root_cube = (mesh_h - 1) * mesh_w + (mesh_w - 1)hardcoded to the SE corner. SFR wiring covers all cubes, so runtime election is a pure kernel change when needed.
Consequences
Positive
- Single kernel, single install path for all-reduce — replaces four
removed modules (
ring,mesh,tree,hierarchical). - Topology-agnostic kernel: ring / torus / mesh selected via one integer param, no kernel duplication.
- Automatic via
dist.all_reduce: no bench-level or user-level algorithm selection needed; config-driven end-to-end. - Full SFR wiring: every cube on every SIP has inter-SIP links available — supports future dynamic root-cube election.
Negative
- Not suitable for per-PE sharded tensors: TP-layer-style tensors that shard within one cube across 8 PEs are not addressable by this kernel. Such workloads would need a separate intra-cube all-reduce path (not yet implemented).
configure_sfr_intercube_multisipalways wires all pe0s: even if a given run only needs a subset (e.g. 1 SIP, ring only). Install cost is small but not zero.
Affected files
| File | Change |
|---|---|
src/kernbench/ccl/algorithms/intercube_allreduce.py (new) |
Kernel + _inter_sip_* helpers + TOPO_NAME_TO_KIND |
src/kernbench/ccl/sfr_config.py (new) |
configure_sfr_intercube_multisip |
src/kernbench/ccl/topologies.py |
Added torus_2d, mesh_2d_no_wrap |
src/kernbench/ccl/install.py |
Extended _OPPOSITE_DIR with global_* pairs |
src/kernbench/runtime_api/distributed.py |
AhbmCCLBackend uses configure_sfr_intercube_multisip + appends sip_rank/topo args |
ccl.yaml |
Single intercube_allreduce entry |
topology.yaml |
Added system.sips.topology |
benches/ccl_allreduce.py |
Row-wise cube-mesh tensor layout |
tests/test_allreduce_multidevice.py (new) |
Config-driven ring/torus/mesh |
tests/test_distributed_intercube_allreduce.py (new) |
Full dist.all_reduce path |
tests/test_intercube_sfr_config.py (new) |
SFR wiring verification |
| Removed | ring_allreduce.py, mesh_allreduce.py, tree_allreduce.py, hierarchical_allreduce.py, hello_send.py, testing.py and their tests |