ADR-0032 + intra_* opposite directions in IPCQ install
Add intra_N/S/E/W to install.py _OPPOSITE_DIR table so the intra-cube PE-to-PE namespace is symmetrical with intercube N/S/E/W. ADR-0032 documents the intercube allreduce algorithm (supersedes ADR-0029). Refresh ADR-0024/0025/0029 cross-refs and update test_intercube_sfr_config.py to cover the new intra_* mappings. Drop the obsolete test_ccl_round_robin_recv.py (replaced by intercube tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,256 @@
|
||||
# ADR-0032: Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (supersedes ADR-0029).
|
||||
|
||||
## Context
|
||||
|
||||
### Goal
|
||||
|
||||
Define a single all-reduce algorithm that exploits the topology hierarchy:
|
||||
cube mesh within each SIP (intercube) + inter-SIP exchange. One kernel,
|
||||
one SFR configuration path, driven by `topology.yaml` and `ccl.yaml`.
|
||||
|
||||
### Why replace ADR-0029 (hierarchical 3-level)
|
||||
|
||||
ADR-0029 proposed a 3-level (intra-cube → inter-cube → inter-SIP) algorithm
|
||||
where every PE in the system participates. In practice this adds the
|
||||
intra-cube PE-to-PE stage complexity (bidirectional reduce + chain broadcast)
|
||||
without matching the common workload pattern where the tensor is sharded
|
||||
**per cube** (not per PE within a cube).
|
||||
|
||||
Moreover, the hierarchical design required:
|
||||
- per-PE neighbor graph installation (`_build_pe_installs` multi-level)
|
||||
- multi-level topology schema (`hierarchical_3level`)
|
||||
- `all_pes` mapper + `multi_pe_sip_local` validator infrastructure
|
||||
|
||||
The intercube algorithm below removes all of that: **pe0-only same-lane
|
||||
intercube reduce on the 4×4 cube mesh**, then inter-SIP exchange on the
|
||||
root cube, then broadcast back. Simpler kernel, simpler wiring, same
|
||||
bandwidth characteristics for the common per-cube DP workload.
|
||||
|
||||
### Current state
|
||||
|
||||
- `src/kernbench/ccl/algorithms/intercube_allreduce.py` — kernel
|
||||
- `src/kernbench/ccl/sfr_config.py` — `configure_sfr_intercube_multisip`
|
||||
- `src/kernbench/runtime_api/distributed.py` — `AhbmCCLBackend` wires this
|
||||
automatically at `init_process_group` time.
|
||||
- Old `ring_allreduce`, `mesh_allreduce`, `tree_allreduce`,
|
||||
`hierarchical_allreduce` modules and their tests are **removed**.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Algorithm structure — 5 phases
|
||||
|
||||
For each SIP (launched concurrently by `mp.spawn`):
|
||||
|
||||
```
|
||||
Phase 1 — Row reduce W → E (cube mesh, pe0 only):
|
||||
col=0 sends E → col=1 accumulates, sends E → ... → col=3 holds row sum.
|
||||
|
||||
Phase 2 — Col reduce N → S on rightmost column (pe0, col = mesh_w-1):
|
||||
row=0 sends S → row=1 accumulates, sends S → ... → root cube (15)
|
||||
holds the full SIP sum.
|
||||
|
||||
Phase 3 — Inter-SIP exchange on root cube (pe0 of root cube only):
|
||||
Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
|
||||
selected by sip_topo_kind (from topology.yaml sips.topology).
|
||||
|
||||
Phase 4 — Col broadcast S → N on rightmost column.
|
||||
|
||||
Phase 5 — Row broadcast E → W across the cube mesh.
|
||||
```
|
||||
|
||||
After all phases every cube's pe0 holds the global sum.
|
||||
|
||||
The kernel is a single function parameterised by `sip_topo_kind ∈ {0, 1, 2}`
|
||||
(ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical
|
||||
across topologies; only phase 3 branches. Helper functions
|
||||
`_inter_sip_ring`, `_inter_sip_torus_2d`, `_inter_sip_mesh_2d` encode the
|
||||
three exchange patterns.
|
||||
|
||||
### D2. Tensor layout (rank = SIP, per-worker)
|
||||
|
||||
Per ADR-0024 rank = SIP at the process-group level. Each worker allocates
|
||||
its own cube-mesh-spanning tensor:
|
||||
|
||||
```python
|
||||
dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=16, num_pes=1)
|
||||
tensor = torch.zeros((n_cubes, n_elem), dtype="f16", dp=dp)
|
||||
```
|
||||
|
||||
Shard layout: 16 shards per SIP, one per cube on pe0. The kernel addresses
|
||||
each cube's shard as `pe_addr = t_ptr + cube_id * n_elem * 2`.
|
||||
|
||||
### D3. SFR / IPCQ wiring — `configure_sfr_intercube_multisip`
|
||||
|
||||
Replaces the rank-to-2-PE install from ADR-0024. Wires PE_IPCQ neighbor
|
||||
tables for **every cube's pe0 across every SIP** — regardless of which
|
||||
cube is the root or which SIP topology is selected. This lets the kernel
|
||||
elect the root cube at runtime and supports topology switches without
|
||||
re-wiring.
|
||||
|
||||
| Level | Direction labels | Scope |
|
||||
|---|---|---|
|
||||
| Intercube within SIP | N / S / E / W | pe0 of every cube → pe0 of mesh neighbors (no wrap) |
|
||||
| Inter-SIP (all cubes) | global_E / global_W / global_N / global_S | pe0 of cube c on sip A → pe0 of cube c on peer SIP per `sips.topology` |
|
||||
|
||||
Inter-SIP directions use the `global_*` prefix to keep the namespace
|
||||
disjoint from intercube directions. ADR-0025's `_OPPOSITE_DIR` is extended
|
||||
with `global_E ↔ global_W` and `global_N ↔ global_S` so the reverse-
|
||||
direction resolver handles 2-SIP bidirectional rings correctly.
|
||||
|
||||
Internally the function calls `install_ipcq` with:
|
||||
- `world_size = n_sips × n_cubes`
|
||||
- `rank_to_pe = [(sip, cube, 0) for sip in range(n_sips) for cube in range(n_cubes)]`
|
||||
- A closure-captured `neighbors()` function that builds the map above.
|
||||
|
||||
This `world_size` is internal to IPCQ wiring and does not leak to the
|
||||
process-group rank.
|
||||
|
||||
### D4. SIP topology — from `topology.yaml`
|
||||
|
||||
```yaml
|
||||
system:
|
||||
sips:
|
||||
count: 2
|
||||
topology: ring_1d # or torus_2d, mesh_2d_no_wrap
|
||||
```
|
||||
|
||||
- `ring_1d`: n_sips-1 rounds of `send global_E / recv global_W`.
|
||||
- `torus_2d`: sqrt(n_sips)×sqrt(n_sips) wrapping mesh. Row ring on
|
||||
`global_E/W` then col ring on `global_S/N`.
|
||||
- `mesh_2d_no_wrap`: square mesh without wrap-around. Chain reduce +
|
||||
broadcast per dimension.
|
||||
|
||||
2D variants require `n_sips` to be a perfect square.
|
||||
|
||||
### D5. Process-group integration — `AhbmCCLBackend`
|
||||
|
||||
At `init_process_group` time the backend:
|
||||
|
||||
1. Loads `ccl.yaml` + `topology.yaml`.
|
||||
2. Derives `sip_topo_kind, sip_topo_w, sip_topo_h` from
|
||||
`system.sips.topology` using the algorithm module's `TOPO_NAME_TO_KIND`.
|
||||
3. Calls `configure_sfr_intercube_multisip(engine, spec, cfg)` — one-time
|
||||
SFR wiring, mirrors NCCL communicator creation.
|
||||
|
||||
At each `dist.all_reduce(tensor)` call:
|
||||
|
||||
1. Resolves `kernel_fn` from `cfg["module"]`.
|
||||
2. Builds args: `(n_elem, cube_w, cube_h, n_sips)` from
|
||||
`kernel_args(world_size, n_elem)`.
|
||||
3. Appends `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` where
|
||||
`sip_rank` is the current greenlet's bound rank.
|
||||
4. Launches with `_defer_wait=True`; the main scheduler drains pending
|
||||
handles after all workers submit (per ADR-0024 D7 / ADR-0027 D0.4).
|
||||
|
||||
### D6. Config schema
|
||||
|
||||
`ccl.yaml`:
|
||||
|
||||
```yaml
|
||||
defaults:
|
||||
algorithm: intercube_allreduce
|
||||
buffer_kind: tcm
|
||||
...
|
||||
|
||||
algorithms:
|
||||
intercube_allreduce:
|
||||
module: kernbench.ccl.algorithms.intercube_allreduce
|
||||
topology: none
|
||||
buffer_kind: tcm
|
||||
n_elem: 8
|
||||
root_cube: 15
|
||||
```
|
||||
|
||||
`topology.yaml`:
|
||||
|
||||
```yaml
|
||||
system:
|
||||
sips:
|
||||
count: 2
|
||||
topology: ring_1d
|
||||
sip:
|
||||
cube_mesh: { w: 4, h: 4 }
|
||||
```
|
||||
|
||||
### D7. Algorithm module contract
|
||||
|
||||
Modules loaded via `cfg["module"]` must export:
|
||||
|
||||
| Name | Purpose |
|
||||
|---|---|
|
||||
| `kernel` | callable, signature `(t_ptr, n_elem, cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl)` |
|
||||
| `kernel_args(world_size, n_elem) -> tuple` | returns the first 4 scalar args (per-tensor) |
|
||||
| `TOPO_NAME_TO_KIND: dict[str, int]` | maps `system.sips.topology` name to kernel branch code |
|
||||
| `SIP_TOPO_RING`, `SIP_TOPO_TORUS`, `SIP_TOPO_MESH` | integer constants (0, 1, 2) |
|
||||
|
||||
---
|
||||
|
||||
## Dependencies
|
||||
|
||||
- **ADR-0023**: IPCQ protocol (neighbor table, send/recv, credit return).
|
||||
- **ADR-0024**: rank = SIP launcher, `mp.spawn`, greenlet-local rank.
|
||||
- **ADR-0025**: Address-based IPCQ direction matching; extended
|
||||
`_OPPOSITE_DIR` with `global_*` pairs.
|
||||
- **ADR-0027**: Worker-wait / collective-pending drain in main scheduler.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **Per-PE allreduce** (intra-cube PE-to-PE reduce). Out of scope — the
|
||||
workload for this algorithm is per-cube DP.
|
||||
- **Asymmetric SIP topologies** (non-square mesh/torus). `torus_2d` and
|
||||
`mesh_2d_no_wrap` require `n_sips = k²`.
|
||||
- **Pipelined chunks**: single-tile per cube, no pipelining yet.
|
||||
- **Root cube runtime election**: the kernel currently uses
|
||||
`root_cube = (mesh_h - 1) * mesh_w + (mesh_w - 1)` hardcoded to the SE
|
||||
corner. SFR wiring covers all cubes, so runtime election is a pure kernel
|
||||
change when needed.
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- **Single kernel, single install path** for all-reduce — replaces four
|
||||
removed modules (`ring`, `mesh`, `tree`, `hierarchical`).
|
||||
- **Topology-agnostic kernel**: ring / torus / mesh selected via one
|
||||
integer param, no kernel duplication.
|
||||
- **Automatic via `dist.all_reduce`**: no bench-level or user-level
|
||||
algorithm selection needed; config-driven end-to-end.
|
||||
- **Full SFR wiring**: every cube on every SIP has inter-SIP links
|
||||
available — supports future dynamic root-cube election.
|
||||
|
||||
### Negative
|
||||
|
||||
- **Not suitable for per-PE sharded tensors**: TP-layer-style tensors that
|
||||
shard within one cube across 8 PEs are not addressable by this kernel.
|
||||
Such workloads would need a separate intra-cube all-reduce path (not
|
||||
yet implemented).
|
||||
- **`configure_sfr_intercube_multisip` always wires all pe0s**: even if a
|
||||
given run only needs a subset (e.g. 1 SIP, ring only). Install cost is
|
||||
small but not zero.
|
||||
|
||||
---
|
||||
|
||||
## Affected files
|
||||
|
||||
| File | Change |
|
||||
|---|---|
|
||||
| `src/kernbench/ccl/algorithms/intercube_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` |
|
||||
| `src/kernbench/ccl/sfr_config.py` (new) | `configure_sfr_intercube_multisip` |
|
||||
| `src/kernbench/ccl/topologies.py` | Added `torus_2d`, `mesh_2d_no_wrap` |
|
||||
| `src/kernbench/ccl/install.py` | Extended `_OPPOSITE_DIR` with `global_*` pairs |
|
||||
| `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend` uses `configure_sfr_intercube_multisip` + appends sip_rank/topo args |
|
||||
| `ccl.yaml` | Single `intercube_allreduce` entry |
|
||||
| `topology.yaml` | Added `system.sips.topology` |
|
||||
| `benches/ccl_allreduce.py` | Row-wise cube-mesh tensor layout |
|
||||
| `tests/test_allreduce_multidevice.py` (new) | Config-driven ring/torus/mesh |
|
||||
| `tests/test_distributed_intercube_allreduce.py` (new) | Full `dist.all_reduce` path |
|
||||
| `tests/test_intercube_sfr_config.py` (new) | SFR wiring verification |
|
||||
| Removed | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` and their tests |
|
||||
Reference in New Issue
Block a user