Files
kernbench2/docs/adr/ADR-0032-algo-intercube-allreduce.md
T
ywkang 687c98086d ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
  (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
  docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
  docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
  retroactive docs pending verification.

Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
  TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
  Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
  deleted; ADR-0019/0021 moved to adr-history with one-line stub status

Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
  serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
  per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
  target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
  selection, flit-aware per-flit commit, async finalize, command-only
  fallback path)

Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
  "Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
  block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
  ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
  (now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)

Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
  ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py

Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.

Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
  (ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:15:55 -07:00

257 lines
9.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0032: Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange
## Status
Accepted (supersedes ADR-0029).
## Context
### Goal
Define a single all-reduce algorithm that exploits the topology hierarchy:
cube mesh within each SIP (intercube) + inter-SIP exchange. One kernel,
one SFR configuration path, driven by `topology.yaml` and `ccl.yaml`.
### Why replace ADR-0029 (hierarchical 3-level)
ADR-0029 proposed a 3-level (intra-cube → inter-cube → inter-SIP) algorithm
where every PE in the system participates. In practice this adds the
intra-cube PE-to-PE stage complexity (bidirectional reduce + chain broadcast)
without matching the common workload pattern where the tensor is sharded
**per cube** (not per PE within a cube).
Moreover, the hierarchical design required:
- per-PE neighbor graph installation (`_build_pe_installs` multi-level)
- multi-level topology schema (`hierarchical_3level`)
- `all_pes` mapper + `multi_pe_sip_local` validator infrastructure
The intercube algorithm below removes all of that: **pe0-only same-lane
intercube reduce on the 4×4 cube mesh**, then inter-SIP exchange on the
root cube, then broadcast back. Simpler kernel, simpler wiring, same
bandwidth characteristics for the common per-cube DP workload.
### Current state
- `src/kernbench/ccl/algorithms/intercube_allreduce.py` — kernel
- `src/kernbench/ccl/sfr_config.py``configure_sfr_intercube_multisip`
- `src/kernbench/runtime_api/distributed.py``AhbmCCLBackend` wires this
automatically at `init_process_group` time.
- Old `ring_allreduce`, `mesh_allreduce`, `tree_allreduce`,
`hierarchical_allreduce` modules and their tests are **removed**.
---
## Decision
### D1. Algorithm structure — 5 phases
For each SIP (launched concurrently by `mp.spawn`):
```
Phase 1 — Row reduce W → E (cube mesh, pe0 only):
col=0 sends E → col=1 accumulates, sends E → ... → col=3 holds row sum.
Phase 2 — Col reduce N → S on rightmost column (pe0, col = mesh_w-1):
row=0 sends S → row=1 accumulates, sends S → ... → root cube (15)
holds the full SIP sum.
Phase 3 — Inter-SIP exchange on root cube (pe0 of root cube only):
Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
selected by sip_topo_kind (from topology.yaml sips.topology).
Phase 4 — Col broadcast S → N on rightmost column.
Phase 5 — Row broadcast E → W across the cube mesh.
```
After all phases every cube's pe0 holds the global sum.
The kernel is a single function parameterised by `sip_topo_kind ∈ {0, 1, 2}`
(ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical
across topologies; only phase 3 branches. Helper functions
`_inter_sip_ring`, `_inter_sip_torus_2d`, `_inter_sip_mesh_2d` encode the
three exchange patterns.
### D2. Tensor layout (rank = SIP, per-worker)
Per ADR-0024 rank = SIP at the process-group level. Each worker allocates
its own cube-mesh-spanning tensor:
```python
dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=16, num_pes=1)
tensor = torch.zeros((n_cubes, n_elem), dtype="f16", dp=dp)
```
Shard layout: 16 shards per SIP, one per cube on pe0. The kernel addresses
each cube's shard as `pe_addr = t_ptr + cube_id * n_elem * 2`.
### D3. SFR / IPCQ wiring — `configure_sfr_intercube_multisip`
Replaces the rank-to-2-PE install from ADR-0024. Wires PE_IPCQ neighbor
tables for **every cube's pe0 across every SIP** — regardless of which
cube is the root or which SIP topology is selected. This lets the kernel
elect the root cube at runtime and supports topology switches without
re-wiring.
| Level | Direction labels | Scope |
|---|---|---|
| Intercube within SIP | N / S / E / W | pe0 of every cube → pe0 of mesh neighbors (no wrap) |
| Inter-SIP (all cubes) | global_E / global_W / global_N / global_S | pe0 of cube c on sip A → pe0 of cube c on peer SIP per `sips.topology` |
Inter-SIP directions use the `global_*` prefix to keep the namespace
disjoint from intercube directions. ADR-0025's `_OPPOSITE_DIR` is extended
with `global_E ↔ global_W` and `global_N ↔ global_S` so the reverse-
direction resolver handles 2-SIP bidirectional rings correctly.
Internally the function calls `install_ipcq` with:
- `world_size = n_sips × n_cubes`
- `rank_to_pe = [(sip, cube, 0) for sip in range(n_sips) for cube in range(n_cubes)]`
- A closure-captured `neighbors()` function that builds the map above.
This `world_size` is internal to IPCQ wiring and does not leak to the
process-group rank.
### D4. SIP topology — from `topology.yaml`
```yaml
system:
sips:
count: 2
topology: ring_1d # or torus_2d, mesh_2d_no_wrap
```
- `ring_1d`: n_sips-1 rounds of `send global_E / recv global_W`.
- `torus_2d`: sqrt(n_sips)×sqrt(n_sips) wrapping mesh. Row ring on
`global_E/W` then col ring on `global_S/N`.
- `mesh_2d_no_wrap`: square mesh without wrap-around. Chain reduce +
broadcast per dimension.
2D variants require `n_sips` to be a perfect square.
### D5. Process-group integration — `AhbmCCLBackend`
At `init_process_group` time the backend:
1. Loads `ccl.yaml` + `topology.yaml`.
2. Derives `sip_topo_kind, sip_topo_w, sip_topo_h` from
`system.sips.topology` using the algorithm module's `TOPO_NAME_TO_KIND`.
3. Calls `configure_sfr_intercube_multisip(engine, spec, cfg)` — one-time
SFR wiring, mirrors NCCL communicator creation.
At each `dist.all_reduce(tensor)` call:
1. Resolves `kernel_fn` from `cfg["module"]`.
2. Builds args: `(n_elem, cube_w, cube_h, n_sips)` from
`kernel_args(world_size, n_elem)`.
3. Appends `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` where
`sip_rank` is the current greenlet's bound rank.
4. Launches with `_defer_wait=True`; the main scheduler drains pending
handles after all workers submit (per ADR-0027 D0.4).
### D6. Config schema
`ccl.yaml`:
```yaml
defaults:
algorithm: intercube_allreduce
buffer_kind: tcm
...
algorithms:
intercube_allreduce:
module: kernbench.ccl.algorithms.intercube_allreduce
topology: none
buffer_kind: tcm
n_elem: 8
root_cube: 15
```
`topology.yaml`:
```yaml
system:
sips:
count: 2
topology: ring_1d
sip:
cube_mesh: { w: 4, h: 4 }
```
### D7. Algorithm module contract
Modules loaded via `cfg["module"]` must export:
| Name | Purpose |
|---|---|
| `kernel` | callable, signature `(t_ptr, n_elem, cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl)` |
| `kernel_args(world_size, n_elem) -> tuple` | returns the first 4 scalar args (per-tensor) |
| `TOPO_NAME_TO_KIND: dict[str, int]` | maps `system.sips.topology` name to kernel branch code |
| `SIP_TOPO_RING`, `SIP_TOPO_TORUS`, `SIP_TOPO_MESH` | integer constants (0, 1, 2) |
---
## Dependencies
- **ADR-0023**: IPCQ protocol (neighbor table, send/recv, credit return).
- **ADR-0024**: rank = SIP launcher, `mp.spawn`, greenlet-local rank.
- **ADR-0025**: Address-based IPCQ direction matching; extended
`_OPPOSITE_DIR` with `global_*` pairs.
- **ADR-0027**: Worker-wait / collective-pending drain in main scheduler.
## Non-goals
- **Per-PE allreduce** (intra-cube PE-to-PE reduce). Out of scope — the
workload for this algorithm is per-cube DP.
- **Asymmetric SIP topologies** (non-square mesh/torus). `torus_2d` and
`mesh_2d_no_wrap` require `n_sips = k²`.
- **Pipelined chunks**: single-tile per cube, no pipelining yet.
- **Root cube runtime election**: the kernel currently uses
`root_cube = (mesh_h - 1) * mesh_w + (mesh_w - 1)` hardcoded to the SE
corner. SFR wiring covers all cubes, so runtime election is a pure kernel
change when needed.
---
## Consequences
### Positive
- **Single kernel, single install path** for all-reduce — replaces four
removed modules (`ring`, `mesh`, `tree`, `hierarchical`).
- **Topology-agnostic kernel**: ring / torus / mesh selected via one
integer param, no kernel duplication.
- **Automatic via `dist.all_reduce`**: no bench-level or user-level
algorithm selection needed; config-driven end-to-end.
- **Full SFR wiring**: every cube on every SIP has inter-SIP links
available — supports future dynamic root-cube election.
### Negative
- **Not suitable for per-PE sharded tensors**: TP-layer-style tensors that
shard within one cube across 8 PEs are not addressable by this kernel.
Such workloads would need a separate intra-cube all-reduce path (not
yet implemented).
- **`configure_sfr_intercube_multisip` always wires all pe0s**: even if a
given run only needs a subset (e.g. 1 SIP, ring only). Install cost is
small but not zero.
---
## Affected files
| File | Change |
|---|---|
| `src/kernbench/ccl/algorithms/intercube_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` |
| `src/kernbench/ccl/sfr_config.py` (new) | `configure_sfr_intercube_multisip` |
| `src/kernbench/ccl/topologies.py` | Added `torus_2d`, `mesh_2d_no_wrap` |
| `src/kernbench/ccl/install.py` | Extended `_OPPOSITE_DIR` with `global_*` pairs |
| `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend` uses `configure_sfr_intercube_multisip` + appends sip_rank/topo args |
| `ccl.yaml` | Single `intercube_allreduce` entry |
| `topology.yaml` | Added `system.sips.topology` |
| `benches/ccl_allreduce.py` | Row-wise cube-mesh tensor layout |
| `tests/test_allreduce_multidevice.py` (new) | Config-driven ring/torus/mesh |
| `tests/test_distributed_intercube_allreduce.py` (new) | Full `dist.all_reduce` path |
| `tests/test_intercube_sfr_config.py` (new) | SFR wiring verification |
| Removed | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` and their tests |