Files
kernbench2/docs/adr/ADR-0043-eval-allreduce-harness.md
mukesh fd56b6cacd adr: add ADR-0043/0044 (eval harnesses); reconcile ADR-0024/0032 for SIP w/h
Document the allreduce + GEMM evaluation harnesses and bring the affected
allreduce ADRs in line with the refactored code.

New (Accepted, EN + KO):
- ADR-0043 — allreduce evaluation harness (tests/sccl/): distributed-driven
  correctness, latency/buffer-kind sweeps, sessionfinish plot aggregators,
  topology + FSIM-comparison figures. Verified against the implementation.
- ADR-0044 — GEMM evaluation harness (scripts/gemm_sweep.py + tests/gemm/):
  heavy-script data gen vs. fast test-rendered figures, slow regenerator,
  the 3-figure set. Records two limitations as open questions: the
  theoretical-model constants are inherited (not yet traced to ADR-0033/
  0014), and the *_measured figure is a naming misnomer.

Updated (EN + KO):
- ADR-0024 — add D5: SIP grid w/h resolution (explicit sips.w/h, square
  fallback, fail-loud), documenting the AhbmCCLBackend fix.
- ADR-0032 — D4/D5/Non-goals reconciled: rectangular SIP grids (e.g. 6 SIPs
  as 3x2) are supported via explicit w/h; the square requirement now
  applies only to the fallback. Affected-files repointed to tests/sccl/.

Verification: ADR-0023 and ADR-0042 confirmed still matching the code (no
change). verify_adr_lang_pairs.py passes (EN/KO Status blocks byte-equal).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 10:26:25 -07:00

131 lines
5.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0043: Allreduce Evaluation Harness — `tests/sccl/`
## Status
Accepted
Documents the `tests/sccl/` evaluation harness; verified against the
implementation (constants, file set, and sweep dimensions cross-checked).
## Context
ADR-0032 defines the intercube all-reduce *algorithm*; ADR-0023/0024/0027
define the IPCQ backend, the rank=SIP launcher, and `mp.spawn`. None of
them describe **how the allreduce is exercised and characterized** — the
correctness tests, the latency/buffer-kind sweeps, and the derived plots.
ADR-0013 (verification strategy) is the general policy; this ADR pins the
concrete allreduce harness so the "evaluation" half of the work is
documented, not just the implementation.
The harness lives under `tests/sccl/` (the package created when the
allreduce tests were consolidated). It supersedes the earlier flat
`tests/test_allreduce_multidevice.py` + `tests/test_distributed_*` layout.
## Decision
### D1. Drive evaluation through the public `torch.distributed` path
Correctness and the sweeps run the collective through the real DDP-shaped
path — `init_process_group(backend="ahbm") → mp.spawn → dist.all_reduce`
(ADR-0024/0027) — not the lower-level `ctx.launch`. A shared helper
`_run_distributed(tmp_path, monkeypatch, topo_path, corr_id, n_elem)` in
`tests/sccl/_allreduce_helpers.py` builds the engine, runs the workers, and
returns `(engine, n_cubes)`. `monkeypatch.chdir` points the backend's
`load_ccl_config()` (cwd lookup) at a per-case temp `ccl.yaml`.
A direct-launch reference (`run_allreduce`) is retained in the same helper
module — not for the distributed tests, but because the IPCQ buffer-kind /
root-center micro-tests under `tests/` import it.
### D2. One file per evaluation concern
| File | Concern | `torch.distributed`? |
|---|---|---|
| `test_allreduce_ring_torus_mesh.py` | correctness across ring_1d / torus_2d (2×3) / mesh_2d_no_wrap (2×3) | yes |
| `test_distributed_default_topology.py` | full path on `topology.yaml` as-is | yes |
| `test_plot_latency_sweep.py` | latency sweep rows (n_elem × topology) | yes |
| `test_plot_buffer_kind_sweep.py` | TCM/SRAM/HBM sweep rows | yes |
| `test_plot_topology_diagram.py` | topology.png (pure matplotlib) | no |
| `test_plot_comparison_fsim.py` | broken-axis model-vs-FSIM comparison | no |
| `test_intercube_root_center.py` | ADR-0032 center-root latency guard (direct path) | no |
`_allreduce_helpers.py` holds the shared plumbing (driver, config writers,
sweep/buffer-kind constants, plot aggregators, topology-diagram + FSIM
comparison emitters). It is not collected (no `test_` prefix).
### D3. Latency metric — critical-path `pe_exec_ns`
The reported latency per config is `crit_ns = max(pe_exec_ns)` over
`engine._results` — the slowest rank's PE execution time. This is the
number plotted on every latency chart and recorded in `summary.csv`.
### D4. Sweep dimensions
- **Latency sweep**: `n_elem ∈ {8, 32, 64, 128, 512, 1024, 2048, 4096,
8192, 16384, 32768, 49152}` (16 excluded — collides with `n_cubes`) ×
topology ∈ {ring_1d (6), torus_2d 2×3 (6), mesh_2d_no_wrap 2×3 (6)}.
- **Buffer-kind sweep**: `buffer_kind ∈ {tcm, sram, hbm}` × a smaller
`n_elem` grid, on torus_2d 6-SIP (3×2). buffer_kind is set in the temp
`ccl.yaml` (read by the backend at `init_process_group`, ADR-0023 D6).
The 2×3 / 3×2 grids exercise the explicit-`w/h` SIP resolution
(ADR-0024 D5).
### D5. Derived plots via `pytest_sessionfinish` aggregators
Sweep tests are xdist-friendly: each parametrized case writes one JSON row
to a staging dir. The conftest `pytest_sessionfinish` hook (controller node
only) calls the aggregators in `_allreduce_helpers.py`:
- `_aggregate_sweep_plots()` → per-topology PNGs + `summary.csv`
- `aggregate_buffer_kind_plot()` → the TCM/SRAM/HBM comparison PNG + csv
The topology-diagram and FSIM-comparison figures are emitted directly by
their own `test_plot_*` tests (no row staging — they are pure functions of
`topology.yaml` and `summary.csv` respectively). All outputs land in
`docs/diagrams/allreduce_latency_plots/` and are **derived artifacts** per
CLAUDE.md (consistent-with-ADRs, no Phase-2 gate).
### D6. The FSIM comparison reference is a hardcoded constant
`emit_comparison_fsim_plot()` overlays the model curves against a single
external FSIM single-device reference (`366 µs`), held as a literal — there
is no external data file. The "measured" series comes from the simulator
(`op_log` GEMM count, `composite_window_ns`); the "theoretical" series is a
hand-derived analytical model (the same one ADR-0044 D5 flags as
ADR-unverified).
## Consequences
### Positive
- The allreduce is evaluated through the same API a real DDP script uses,
so the harness doubles as an integration test of ADR-0024/0027.
- Figures regenerate on every `pytest` run from committed data; no manual
plot step.
- Rectangular-grid sweeps gave the regression coverage that surfaced the
ADR-0024 D5 `w/h` fix.
### Negative / limitations
- The full latency sweep runs in the default `pytest` (~minutes); it is not
marked `slow`. (Contrast ADR-0044, where the GEMM sweep is `slow`.)
- `test_intercube_root_center.py` carries a latency *threshold* assertion
(ADR-0032 center-root guard) — the only absolute-latency assertion in the
suite; it is sensitive to latency-model changes (ADR-0033).
## Dependencies
- **ADR-0013**: verification strategy (general policy this specializes).
- **ADR-0023 / ADR-0024 / ADR-0027**: IPCQ backend, rank=SIP launcher,
`mp.spawn` — the path D1 drives.
- **ADR-0032**: the algorithm under evaluation; D4 grids exercise its
topology branches.
- **ADR-0044**: the sibling GEMM evaluation harness.
## Open questions
- Should the latency sweep be marked `slow` for parity with the GEMM sweep?
- Should the FSIM reference move from a hardcoded constant to a versioned
data file?