Document the allreduce + GEMM evaluation harnesses and bring the affected
allreduce ADRs in line with the refactored code.
New (Accepted, EN + KO):
- ADR-0043 — allreduce evaluation harness (tests/sccl/): distributed-driven
correctness, latency/buffer-kind sweeps, sessionfinish plot aggregators,
topology + FSIM-comparison figures. Verified against the implementation.
- ADR-0044 — GEMM evaluation harness (scripts/gemm_sweep.py + tests/gemm/):
heavy-script data gen vs. fast test-rendered figures, slow regenerator,
the 3-figure set. Records two limitations as open questions: the
theoretical-model constants are inherited (not yet traced to ADR-0033/
0014), and the *_measured figure is a naming misnomer.
Updated (EN + KO):
- ADR-0024 — add D5: SIP grid w/h resolution (explicit sips.w/h, square
fallback, fail-loud), documenting the AhbmCCLBackend fix.
- ADR-0032 — D4/D5/Non-goals reconciled: rectangular SIP grids (e.g. 6 SIPs
as 3x2) are supported via explicit w/h; the square requirement now
applies only to the fallback. Affected-files repointed to tests/sccl/.
Verification: ADR-0023 and ADR-0042 confirmed still matching the code (no
change). verify_adr_lang_pairs.py passes (EN/KO Status blocks byte-equal).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rename the intercube all-reduce identity to lrab_hierarchical_allreduce
(module, config key, distributed test) so the name reflects both levels
it implements: LRAB intra-SIP (local reduce to center root + broadcast)
and the hierarchical inter-SIP topology exchange (ring/torus/mesh).
ADR-0032 slug kept as the stable decision id; pure rename, no logic change.
Also in this batch:
- ADR-0032 (EN+KO): document the shipped center-root bidirectional reduce
(doc was stale corner-root); annotate ccl.yaml root_cube as a placeholder.
- Rename allreduce + pe2pe latency plots to descriptive, title-matching
filenames and retitle the in-plot headings; drop overview/overview_log.
- Point the PPTX image refs at the new plot names.
Doc + derived-artifact + rename only; no simulation behavior changed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>