a796c1d2f7
Establish English as the canonical ADR language with Korean translations held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror). Promotion from adr-proposed/ to adr/ now writes English to adr/ and the Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md. - Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English, 2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix dropped). ADR-0023 EN regenerated against KO source which had newer HW Realization Notes (D16-D23) section. - docs/adr-history/ left frozen by design (transitional state). - CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline section covering bidirectional sync, conflict resolution (EN wins), and proposed-language freedom. - tools/verify_adr_lang_pairs.py: new verification tool checking pair completeness, filename mirroring, ADR-ID match, Status byte-equality. Pre-commit hook intentionally not added; run on demand or in CI. - tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF normalization, em-dash title separator, underscore-slug edge case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
257 lines
9.5 KiB
Markdown
257 lines
9.5 KiB
Markdown
# ADR-0032: Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange
|
||
|
||
## Status
|
||
|
||
Accepted (supersedes ADR-0029).
|
||
|
||
## Context
|
||
|
||
### Goal
|
||
|
||
Define a single all-reduce algorithm that exploits the topology hierarchy:
|
||
cube mesh within each SIP (intercube) + inter-SIP exchange. One kernel,
|
||
one SFR configuration path, driven by `topology.yaml` and `ccl.yaml`.
|
||
|
||
### Why replace ADR-0029 (hierarchical 3-level)
|
||
|
||
ADR-0029 proposed a 3-level (intra-cube → inter-cube → inter-SIP) algorithm
|
||
where every PE in the system participates. In practice this adds the
|
||
intra-cube PE-to-PE stage complexity (bidirectional reduce + chain broadcast)
|
||
without matching the common workload pattern where the tensor is sharded
|
||
**per cube** (not per PE within a cube).
|
||
|
||
Moreover, the hierarchical design required:
|
||
- per-PE neighbor graph installation (`_build_pe_installs` multi-level)
|
||
- multi-level topology schema (`hierarchical_3level`)
|
||
- `all_pes` mapper + `multi_pe_sip_local` validator infrastructure
|
||
|
||
The intercube algorithm below removes all of that: **pe0-only same-lane
|
||
intercube reduce on the 4×4 cube mesh**, then inter-SIP exchange on the
|
||
root cube, then broadcast back. Simpler kernel, simpler wiring, same
|
||
bandwidth characteristics for the common per-cube DP workload.
|
||
|
||
### Current state
|
||
|
||
- `src/kernbench/ccl/algorithms/intercube_allreduce.py` — kernel
|
||
- `src/kernbench/ccl/sfr_config.py` — `configure_sfr_intercube_multisip`
|
||
- `src/kernbench/runtime_api/distributed.py` — `AhbmCCLBackend` wires this
|
||
automatically at `init_process_group` time.
|
||
- Old `ring_allreduce`, `mesh_allreduce`, `tree_allreduce`,
|
||
`hierarchical_allreduce` modules and their tests are **removed**.
|
||
|
||
---
|
||
|
||
## Decision
|
||
|
||
### D1. Algorithm structure — 5 phases
|
||
|
||
For each SIP (launched concurrently by `mp.spawn`):
|
||
|
||
```
|
||
Phase 1 — Row reduce W → E (cube mesh, pe0 only):
|
||
col=0 sends E → col=1 accumulates, sends E → ... → col=3 holds row sum.
|
||
|
||
Phase 2 — Col reduce N → S on rightmost column (pe0, col = mesh_w-1):
|
||
row=0 sends S → row=1 accumulates, sends S → ... → root cube (15)
|
||
holds the full SIP sum.
|
||
|
||
Phase 3 — Inter-SIP exchange on root cube (pe0 of root cube only):
|
||
Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
|
||
selected by sip_topo_kind (from topology.yaml sips.topology).
|
||
|
||
Phase 4 — Col broadcast S → N on rightmost column.
|
||
|
||
Phase 5 — Row broadcast E → W across the cube mesh.
|
||
```
|
||
|
||
After all phases every cube's pe0 holds the global sum.
|
||
|
||
The kernel is a single function parameterised by `sip_topo_kind ∈ {0, 1, 2}`
|
||
(ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical
|
||
across topologies; only phase 3 branches. Helper functions
|
||
`_inter_sip_ring`, `_inter_sip_torus_2d`, `_inter_sip_mesh_2d` encode the
|
||
three exchange patterns.
|
||
|
||
### D2. Tensor layout (rank = SIP, per-worker)
|
||
|
||
Per ADR-0024 rank = SIP at the process-group level. Each worker allocates
|
||
its own cube-mesh-spanning tensor:
|
||
|
||
```python
|
||
dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=16, num_pes=1)
|
||
tensor = torch.zeros((n_cubes, n_elem), dtype="f16", dp=dp)
|
||
```
|
||
|
||
Shard layout: 16 shards per SIP, one per cube on pe0. The kernel addresses
|
||
each cube's shard as `pe_addr = t_ptr + cube_id * n_elem * 2`.
|
||
|
||
### D3. SFR / IPCQ wiring — `configure_sfr_intercube_multisip`
|
||
|
||
Replaces the rank-to-2-PE install from ADR-0024. Wires PE_IPCQ neighbor
|
||
tables for **every cube's pe0 across every SIP** — regardless of which
|
||
cube is the root or which SIP topology is selected. This lets the kernel
|
||
elect the root cube at runtime and supports topology switches without
|
||
re-wiring.
|
||
|
||
| Level | Direction labels | Scope |
|
||
|---|---|---|
|
||
| Intercube within SIP | N / S / E / W | pe0 of every cube → pe0 of mesh neighbors (no wrap) |
|
||
| Inter-SIP (all cubes) | global_E / global_W / global_N / global_S | pe0 of cube c on sip A → pe0 of cube c on peer SIP per `sips.topology` |
|
||
|
||
Inter-SIP directions use the `global_*` prefix to keep the namespace
|
||
disjoint from intercube directions. ADR-0025's `_OPPOSITE_DIR` is extended
|
||
with `global_E ↔ global_W` and `global_N ↔ global_S` so the reverse-
|
||
direction resolver handles 2-SIP bidirectional rings correctly.
|
||
|
||
Internally the function calls `install_ipcq` with:
|
||
- `world_size = n_sips × n_cubes`
|
||
- `rank_to_pe = [(sip, cube, 0) for sip in range(n_sips) for cube in range(n_cubes)]`
|
||
- A closure-captured `neighbors()` function that builds the map above.
|
||
|
||
This `world_size` is internal to IPCQ wiring and does not leak to the
|
||
process-group rank.
|
||
|
||
### D4. SIP topology — from `topology.yaml`
|
||
|
||
```yaml
|
||
system:
|
||
sips:
|
||
count: 2
|
||
topology: ring_1d # or torus_2d, mesh_2d_no_wrap
|
||
```
|
||
|
||
- `ring_1d`: n_sips-1 rounds of `send global_E / recv global_W`.
|
||
- `torus_2d`: sqrt(n_sips)×sqrt(n_sips) wrapping mesh. Row ring on
|
||
`global_E/W` then col ring on `global_S/N`.
|
||
- `mesh_2d_no_wrap`: square mesh without wrap-around. Chain reduce +
|
||
broadcast per dimension.
|
||
|
||
2D variants require `n_sips` to be a perfect square.
|
||
|
||
### D5. Process-group integration — `AhbmCCLBackend`
|
||
|
||
At `init_process_group` time the backend:
|
||
|
||
1. Loads `ccl.yaml` + `topology.yaml`.
|
||
2. Derives `sip_topo_kind, sip_topo_w, sip_topo_h` from
|
||
`system.sips.topology` using the algorithm module's `TOPO_NAME_TO_KIND`.
|
||
3. Calls `configure_sfr_intercube_multisip(engine, spec, cfg)` — one-time
|
||
SFR wiring, mirrors NCCL communicator creation.
|
||
|
||
At each `dist.all_reduce(tensor)` call:
|
||
|
||
1. Resolves `kernel_fn` from `cfg["module"]`.
|
||
2. Builds args: `(n_elem, cube_w, cube_h, n_sips)` from
|
||
`kernel_args(world_size, n_elem)`.
|
||
3. Appends `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` where
|
||
`sip_rank` is the current greenlet's bound rank.
|
||
4. Launches with `_defer_wait=True`; the main scheduler drains pending
|
||
handles after all workers submit (per ADR-0027 D0.4).
|
||
|
||
### D6. Config schema
|
||
|
||
`ccl.yaml`:
|
||
|
||
```yaml
|
||
defaults:
|
||
algorithm: intercube_allreduce
|
||
buffer_kind: tcm
|
||
...
|
||
|
||
algorithms:
|
||
intercube_allreduce:
|
||
module: kernbench.ccl.algorithms.intercube_allreduce
|
||
topology: none
|
||
buffer_kind: tcm
|
||
n_elem: 8
|
||
root_cube: 15
|
||
```
|
||
|
||
`topology.yaml`:
|
||
|
||
```yaml
|
||
system:
|
||
sips:
|
||
count: 2
|
||
topology: ring_1d
|
||
sip:
|
||
cube_mesh: { w: 4, h: 4 }
|
||
```
|
||
|
||
### D7. Algorithm module contract
|
||
|
||
Modules loaded via `cfg["module"]` must export:
|
||
|
||
| Name | Purpose |
|
||
|---|---|
|
||
| `kernel` | callable, signature `(t_ptr, n_elem, cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl)` |
|
||
| `kernel_args(world_size, n_elem) -> tuple` | returns the first 4 scalar args (per-tensor) |
|
||
| `TOPO_NAME_TO_KIND: dict[str, int]` | maps `system.sips.topology` name to kernel branch code |
|
||
| `SIP_TOPO_RING`, `SIP_TOPO_TORUS`, `SIP_TOPO_MESH` | integer constants (0, 1, 2) |
|
||
|
||
---
|
||
|
||
## Dependencies
|
||
|
||
- **ADR-0023**: IPCQ protocol (neighbor table, send/recv, credit return).
|
||
- **ADR-0024**: rank = SIP launcher, `mp.spawn`, greenlet-local rank.
|
||
- **ADR-0025**: Address-based IPCQ direction matching; extended
|
||
`_OPPOSITE_DIR` with `global_*` pairs.
|
||
- **ADR-0027**: Worker-wait / collective-pending drain in main scheduler.
|
||
|
||
## Non-goals
|
||
|
||
- **Per-PE allreduce** (intra-cube PE-to-PE reduce). Out of scope — the
|
||
workload for this algorithm is per-cube DP.
|
||
- **Asymmetric SIP topologies** (non-square mesh/torus). `torus_2d` and
|
||
`mesh_2d_no_wrap` require `n_sips = k²`.
|
||
- **Pipelined chunks**: single-tile per cube, no pipelining yet.
|
||
- **Root cube runtime election**: the kernel currently uses
|
||
`root_cube = (mesh_h - 1) * mesh_w + (mesh_w - 1)` hardcoded to the SE
|
||
corner. SFR wiring covers all cubes, so runtime election is a pure kernel
|
||
change when needed.
|
||
|
||
---
|
||
|
||
## Consequences
|
||
|
||
### Positive
|
||
|
||
- **Single kernel, single install path** for all-reduce — replaces four
|
||
removed modules (`ring`, `mesh`, `tree`, `hierarchical`).
|
||
- **Topology-agnostic kernel**: ring / torus / mesh selected via one
|
||
integer param, no kernel duplication.
|
||
- **Automatic via `dist.all_reduce`**: no bench-level or user-level
|
||
algorithm selection needed; config-driven end-to-end.
|
||
- **Full SFR wiring**: every cube on every SIP has inter-SIP links
|
||
available — supports future dynamic root-cube election.
|
||
|
||
### Negative
|
||
|
||
- **Not suitable for per-PE sharded tensors**: TP-layer-style tensors that
|
||
shard within one cube across 8 PEs are not addressable by this kernel.
|
||
Such workloads would need a separate intra-cube all-reduce path (not
|
||
yet implemented).
|
||
- **`configure_sfr_intercube_multisip` always wires all pe0s**: even if a
|
||
given run only needs a subset (e.g. 1 SIP, ring only). Install cost is
|
||
small but not zero.
|
||
|
||
---
|
||
|
||
## Affected files
|
||
|
||
| File | Change |
|
||
|---|---|
|
||
| `src/kernbench/ccl/algorithms/intercube_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` |
|
||
| `src/kernbench/ccl/sfr_config.py` (new) | `configure_sfr_intercube_multisip` |
|
||
| `src/kernbench/ccl/topologies.py` | Added `torus_2d`, `mesh_2d_no_wrap` |
|
||
| `src/kernbench/ccl/install.py` | Extended `_OPPOSITE_DIR` with `global_*` pairs |
|
||
| `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend` uses `configure_sfr_intercube_multisip` + appends sip_rank/topo args |
|
||
| `ccl.yaml` | Single `intercube_allreduce` entry |
|
||
| `topology.yaml` | Added `system.sips.topology` |
|
||
| `benches/ccl_allreduce.py` | Row-wise cube-mesh tensor layout |
|
||
| `tests/test_allreduce_multidevice.py` (new) | Config-driven ring/torus/mesh |
|
||
| `tests/test_distributed_intercube_allreduce.py` (new) | Full `dist.all_reduce` path |
|
||
| `tests/test_intercube_sfr_config.py` (new) | SFR wiring verification |
|
||
| Removed | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` and their tests |
|