adr: add ADR-0043/0044 (eval harnesses); reconcile ADR-0024/0032 for SIP w/h
Document the allreduce + GEMM evaluation harnesses and bring the affected allreduce ADRs in line with the refactored code. New (Accepted, EN + KO): - ADR-0043 — allreduce evaluation harness (tests/sccl/): distributed-driven correctness, latency/buffer-kind sweeps, sessionfinish plot aggregators, topology + FSIM-comparison figures. Verified against the implementation. - ADR-0044 — GEMM evaluation harness (scripts/gemm_sweep.py + tests/gemm/): heavy-script data gen vs. fast test-rendered figures, slow regenerator, the 3-figure set. Records two limitations as open questions: the theoretical-model constants are inherited (not yet traced to ADR-0033/ 0014), and the *_measured figure is a naming misnomer. Updated (EN + KO): - ADR-0024 — add D5: SIP grid w/h resolution (explicit sips.w/h, square fallback, fail-loud), documenting the AhbmCCLBackend fix. - ADR-0032 — D4/D5/Non-goals reconciled: rectangular SIP grids (e.g. 6 SIPs as 3x2) are supported via explicit w/h; the square requirement now applies only to the fallback. Affected-files repointed to tests/sccl/. Verification: ADR-0023 and ADR-0042 confirmed still matching the code (no change). verify_adr_lang_pairs.py passes (EN/KO Status blocks byte-equal). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -173,6 +173,37 @@ placement = resolve_dp_policy(
|
||||
No post-hoc `pe_index` shifting — ShardSpec carries the `(sip, cube, pe)`
|
||||
structural coordinates directly. ShardSpec details in ADR-0026.
|
||||
|
||||
### D5. SIP grid dimensions — explicit `sips.w/h` resolution
|
||||
|
||||
For 2D inter-SIP topologies (`torus_2d`, `mesh_2d_no_wrap`) the SIP grid
|
||||
shape (width × height) is resolved from `system.sips.w` / `system.sips.h`,
|
||||
mirroring how D1 resolves `world_size` from `sips.count`. Precedence:
|
||||
explicit `w/h` (validated `w*h == count`) > square fallback
|
||||
(`round(sqrt(count))²`, used only when no `w/h` is given) > error.
|
||||
|
||||
```python
|
||||
sips = spec.get("system", {}).get("sips", {})
|
||||
if sip_topo == "ring_1d":
|
||||
w, h = 0, 0 # 1D sentinel (no grid)
|
||||
elif sips.get("w") is not None and sips.get("h") is not None:
|
||||
w, h = int(sips["w"]), int(sips["h"])
|
||||
if w * h != n_sips:
|
||||
raise ValueError(f"sip layout {w}x{h} != sips.count ({n_sips})")
|
||||
else:
|
||||
side = int(round(math.sqrt(n_sips)))
|
||||
if side * side != n_sips:
|
||||
raise ValueError("non-square sips.count requires explicit sips.w/h")
|
||||
w, h = side, side
|
||||
```
|
||||
|
||||
This lifts the earlier assumption that 2D SIP grids must be perfect
|
||||
squares: a 6-SIP `torus_2d` / `mesh_2d_no_wrap` is now expressible as
|
||||
`w: 3, h: 2` (or `2x3`). The derived `(w, h)` feed the algorithm's
|
||||
inter-SIP exchange (consumed in ADR-0032 D5). The prior code path silently
|
||||
took `round(sqrt(count))²` for any non-ring topology, which produced a
|
||||
wrong grid (e.g. 2×2 for 6 SIPs); the explicit-`w/h` path with a
|
||||
fail-loud fallback replaces that.
|
||||
|
||||
---
|
||||
|
||||
## Dependencies
|
||||
|
||||
Reference in New Issue
Block a user