Files

T

mukesh cc1bbd0ab7 eval: fold GEMM/allreduce harnesses into self-contained milestone benches

Move the GEMM + allreduce sweep/render logic out of scripts/ and tests/
into two self-contained eval benches so a user can regenerate every
result + figure with one command:

  kernbench run --bench milestone-1h-gemm   (MILESTONE_FAST=1 reuses JSON)
  kernbench run --bench milestone-1h-ccl

- benches/milestone_1h_{gemm,ccl}.py: single home for each domain; the
  run(torch) entry drives the sweeps and writes figures into
  benches/1H_milestone_output/{gemm,ccl}/ (gitignored), then submits a
  sentinel tensor to satisfy the run_bench contract.
- tests/gemm + tests/sccl helpers and scripts/gemm_sweep.py become thin
  re-export/wrapper shims over the benches (single source preserved); the
  pytest-only param builders + _run_distributed wrapper stay in the shim.
- eval-bench pattern: a bench may drive many configs + build its own
  per-config engines (extends ADR-0045 D5; reverses ADR-0044 D1/D2).

ADR-0054 (EN+KO) records the design; ADR-0043/0044/0045 + CLAUDE.md CLI
Semantics amended; ADR INDEX regenerated. Verified: milestone benches run
clean (ok=True, all artifacts), full suite 67 passed, lang-pairs OK.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-22 15:19:52 -07:00

6.4 KiB

Raw Blame History

ADR-0043: Allreduce 평가 하니스 — `tests/sccl/`

Status

Accepted

tests/sccl/ 평가 하니스를 문서화한다; 구현과 대조 검증 완료 (상수, 파일 집합, 스윕 차원을 교차 확인).

ADR-0054로 개정됨: 드라이버 코어, sweep, renderer가 milestone-1h-ccl bench(단일 home)로 이동했다; tests/sccl/_allreduce_helpers.py는 이제 거기서 re-export한다(pytest 전용 param 빌더 + _run_distributed wrapper는 로컬 유지). figure 테스트는 변경 없음.

Context

ADR-0032는 intercube all-reduce 알고리즘을 정의하고, ADR-0023/0024/0027은 IPCQ 백엔드, rank=SIP launcher, mp.spawn을 정의한다. 그러나 어느 것도 allreduce를 어떻게 구동하고 특성화하는가 — 정확성 테스트, latency/ buffer-kind 스윕, 파생 플롯 — 는 기술하지 않는다. ADR-0013(verification strategy)이 일반 정책이라면, 본 ADR은 구체적 allreduce 하니스를 고정하여 작업의 "평가" 절반이 구현과 함께 문서화되도록 한다.

하니스는 tests/sccl/(allreduce 테스트 통합 시 생성된 패키지)에 위치한다. 이전의 평면적 tests/test_allreduce_multidevice.py + tests/test_distributed_* 레이아웃을 대체한다.

Decision

D1. 평가를 공개 `torch.distributed` 경로로 구동

정확성과 스윕은 collective를 실제 DDP 형태 경로 — init_process_group(backend="ahbm") → mp.spawn → dist.all_reduce (ADR-0024/0027) — 로 실행하며, 하위 레벨 ctx.launch를 쓰지 않는다. tests/sccl/_allreduce_helpers.py의 공유 헬퍼 _run_distributed(tmp_path, monkeypatch, topo_path, corr_id, n_elem)가 엔진을 빌드하고 워커를 실행하고 (engine, n_cubes)를 반환한다. monkeypatch.chdir이 백엔드의 load_ccl_config()(cwd 조회)를 케이스별 임시 ccl.yaml로 향하게 한다.

직접 launch 레퍼런스(run_allreduce)는 같은 헬퍼 모듈에 유지된다 — distributed 테스트용이 아니라, tests/의 IPCQ buffer-kind / root-center 마이크로 테스트가 import하기 때문이다.

D2. 평가 관심사별 파일 하나

파일	관심사	`torch.distributed`?
`test_allreduce_ring_torus_mesh.py`	ring_1d / torus_2d (2×3) / mesh_2d_no_wrap (2×3) 정확성	yes
`test_distributed_default_topology.py`	`topology.yaml` 그대로의 전체 경로	yes
`test_plot_latency_sweep.py`	latency 스윕 행 (n_elem × topology)	yes
`test_plot_buffer_kind_sweep.py`	TCM/SRAM/HBM 스윕 행	yes
`test_plot_topology_diagram.py`	topology.png (순수 matplotlib)	no
`test_plot_comparison_fsim.py`	broken-axis 모델 vs FSIM 비교	no
`test_intercube_root_center.py`	ADR-0032 center-root latency 가드 (직접 경로)	no

_allreduce_helpers.py는 공유 plumbing(드라이버, config writer, 스윕/ buffer-kind 상수, 플롯 aggregator, topology-diagram + FSIM 비교 emitter)을 보유한다. 수집되지 않는다(test_ 접두사 없음).

D3. Latency 메트릭 — critical-path `pe_exec_ns`

config별 보고 latency는 engine._results에 대한 crit_ns = max(pe_exec_ns) — 가장 느린 rank의 PE 실행 시간 — 이다. 모든 latency 차트에 그려지고 summary.csv에 기록되는 값이다.

D4. 스윕 차원

Latency 스윕: n_elem ∈ {8, 32, 64, 128, 512, 1024, 2048, 4096, 8192, 16384, 32768, 49152} (16 제외 — n_cubes와 충돌) × topology ∈ {ring_1d (6), torus_2d 2×3 (6), mesh_2d_no_wrap 2×3 (6)}.
Buffer-kind 스윕: buffer_kind ∈ {tcm, sram, hbm} × 더 작은 n_elem 그리드, torus_2d 6-SIP (3×2)에서. buffer_kind는 임시 ccl.yaml에 설정되며(백엔드가 init_process_group 시점에 읽음, ADR-0023 D6) 적용된다.

2×3 / 3×2 그리드는 명시적 w/h SIP 해석(ADR-0024 D5)을 행사한다.

D5. `pytest_sessionfinish` aggregator를 통한 파생 플롯

스윕 테스트는 xdist 친화적이다: 각 parametrized 케이스가 staging 디렉터리에 JSON 행 하나를 쓴다. conftest pytest_sessionfinish 훅(controller 노드 전용)이 _allreduce_helpers.py의 aggregator를 호출한다:

_aggregate_sweep_plots() → topology별 PNG + summary.csv
aggregate_buffer_kind_plot() → TCM/SRAM/HBM 비교 PNG + csv

topology-diagram 및 FSIM-비교 figure는 각자의 test_plot_* 테스트가 직접 emit한다(행 staging 없음 — 각각 topology.yaml과 summary.csv의 순수 함수). 모든 출력은 docs/diagrams/allreduce_latency_plots/에 떨어지며 CLAUDE.md에 따라 파생 아티팩트다(ADR과 일관, Phase-2 게이트 없음).

D6. FSIM 비교 레퍼런스는 하드코딩 상수

emit_comparison_fsim_plot()은 모델 곡선을 외부 FSIM single-device 레퍼런스(366 µs) 하나와 겹쳐 그리며, 이는 리터럴로 보유된다 — 외부 데이터 파일 없음. "measured" 시리즈는 시뮬레이터(op_log GEMM 카운트, composite_window_ns)에서, "theoretical" 시리즈는 손으로 도출한 해석적 모델(ADR-0044 D5가 ADR-미검증으로 표시한 동일 모델)에서 온다.

Consequences

Positive

allreduce가 실제 DDP 스크립트와 같은 API로 평가되므로, 하니스가 ADR-0024/0027의 통합 테스트 역할도 겸한다.
figure는 매 pytest 실행마다 committed 데이터로 재생성된다; 수동 플롯 단계 없음.
직사각형 그리드 스윕이 ADR-0024 D5 w/h 수정을 드러낸 회귀 커버리지를 제공했다.

Negative / limitations

전체 latency 스윕은 기본 pytest에서 실행된다(~분 단위); slow로 표시되지 않는다. (ADR-0044는 GEMM 스윕을 slow로 표시하는 것과 대조.)
test_intercube_root_center.py는 latency 임계값 assertion(ADR-0032 center-root 가드)을 보유한다 — 스위트에서 유일한 절대-latency assertion이며 latency 모델 변경(ADR-0033)에 민감하다.

Dependencies

ADR-0013: verification strategy (본 ADR이 특수화하는 일반 정책).
ADR-0023 / ADR-0024 / ADR-0027: IPCQ 백엔드, rank=SIP launcher, mp.spawn — D1이 구동하는 경로.
ADR-0032: 평가 대상 알고리즘; D4 그리드가 그 topology 분기를 행사.
ADR-0044: 형제 격인 GEMM 평가 하니스.

Open questions

GEMM 스윕과의 일관성을 위해 latency 스윕을 slow로 표시할 것인가?
FSIM 레퍼런스를 하드코딩 상수에서 버전 관리되는 데이터 파일로 옮길 것인가?

6.4 KiB Raw Blame History Unescape Escape

ADR-0043: Allreduce 평가 하니스 — tests/sccl/