CCL allreduce: rename to lrab_hierarchical_allreduce + descriptive plots

Rename the intercube all-reduce identity to lrab_hierarchical_allreduce (module, config key, distributed test) so the name reflects both levels it implements: LRAB intra-SIP (local reduce to center root + broadcast) and the hierarchical inter-SIP topology exchange (ring/torus/mesh). ADR-0032 slug kept as the stable decision id; pure rename, no logic change. Also in this batch: - ADR-0032 (EN+KO): document the shipped center-root bidirectional reduce (doc was stale corner-root); annotate ccl.yaml root_cube as a placeholder. - Rename allreduce + pe2pe latency plots to descriptive, title-matching filenames and retitle the in-plot headings; drop overview/overview_log. - Point the PPTX image refs at the new plot names. Doc + derived-artifact + rename only; no simulation behavior changed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 20:50:48 -07:00
parent e77e4a1703
commit ff7d727ddd
38 changed files with 259 additions and 272 deletions
@@ -31,7 +31,7 @@ pe0만의 same-lane 큐브 간 reduce**, 그 다음 루트 큐브에서 SIP 간

 ### 현재 상태

- `src/kernbench/ccl/algorithms/intercube_allreduce.py` — 커널
+- `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` — 커널
 - `src/kernbench/ccl/sfr_config.py` — `configure_sfr_intercube_multisip`
 - `src/kernbench/runtime_api/distributed.py` — `AhbmCCLBackend`가
  `init_process_group` 시점에 자동으로 와이어링한다.
@@ -42,29 +42,46 @@ pe0만의 same-lane 큐브 간 reduce**, 그 다음 루트 큐브에서 SIP 간

 ## Decision

-### D1. 알고리즘 구조 — 5단계
+### D1. 알고리즘 구조 — 5단계 (center-root, 양방향)
+
+루트 큐브는 큐브 메시의 기하학적 **중심**에 위치한다:
+
+```
+root_col  = cube_w // 2
+root_row  = cube_h // 2
+root_cube = root_row * cube_w + root_col   # 중심; 4×4 메시에서 10
+```
+
+각 reduce/broadcast 단계는 이 중심을 향해 **양방향으로** 수렴/발산하여,
+corner-root 워크 대비 SIP 내부 임계 경로를 절반으로 줄인다 (4×4 메시:
+reduce 4홉 + broadcast 4홉 vs SE-코너 루트의 6+6).

 각 SIP에 대해 (`mp.spawn`으로 동시에 launch):

 ```
-Phase 1 — Row reduce W → E (큐브 메시, pe0만):
-    col=0이 E로 송신 → col=1이 누적, E로 송신 → ... → col=3이 row sum 보유.
+Phase 1 — col == root_col에서 수렴하는 Row reduce (큐브 메시, pe0만):
+    좌측 절반(col < root_col)은 W→E로, 우측 절반(col > root_col)은
+    E→W로 진행; root_col 큐브가 양쪽을 병합 → row sum 보유.

-Phase 2 — 최우측 열에서 Col reduce N → S (pe0, col = mesh_w-1):
-    row=0이 S로 송신 → row=1이 누적, S로 송신 → ... → 루트 큐브 (15)가
-    전체 SIP sum 보유.
+Phase 2 — col == root_col에서 row == root_row로 수렴하는 Col reduce:
+    위쪽(row < root_row)은 N→S로, 아래쪽(row > root_row)은 S→N로 진행;
+    루트 큐브가 양쪽을 병합 → 전체 SIP sum 보유.

-Phase 3 — 루트 큐브에서 SIP 간 교환 (루트 큐브의 pe0만):
+Phase 3 — cube_id == root_cube에서 SIP 간 교환 (pe0만):
    Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
    sip_topo_kind(topology.yaml의 sips.topology)로 선택.

-Phase 4 — 최우측 열에서 Col 브로드캐스트 S → N.
+Phase 4 — col == root_col에서 root_row로부터 바깥쪽으로 Col 브로드캐스트.

-Phase 5 — 큐브 메시 전반에 걸친 Row 브로드캐스트 E → W.
+Phase 5 — root_col로부터 바깥쪽으로 큐브 메시 전반에 Row 브로드캐스트.
 ```

 모든 단계가 끝나면 모든 큐브의 pe0이 전역 sum을 보유한다.

+**단일 큐브 fast-path**: `cube_w == cube_h == 1`(rank당 큐브 하나, 일반적인
+TP 케이스)인 경우 SIP 내부 reduce/broadcast 단계를 건너뛰고 곧바로
+Phase 3 SIP 간 교환으로 진행한다.
+
 커널은 `sip_topo_kind ∈ {0, 1, 2}`(ring_1d, torus_2d, mesh_2d_no_wrap)로
 파라미터화된 단일 함수이다. Phase 1-2와 4-5는 토폴로지 전반에서 동일하며,
 phase 3만 분기한다. 헬퍼 함수 `_inter_sip_ring`, `_inter_sip_torus_2d`,
@@ -152,17 +169,19 @@ system:

 ```yaml
 defaults:
-  algorithm: intercube_allreduce
+  algorithm: lrab_hierarchical_allreduce
  buffer_kind: tcm
  ...

 algorithms:
-  intercube_allreduce:
-    module: kernbench.ccl.algorithms.intercube_allreduce
+  lrab_hierarchical_allreduce:
+    module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
    topology: none
    buffer_kind: tcm
    n_elem: 8
-    root_cube: 15
+    root_cube: 15   # 현재 사용되지 않음 — 커널이 루트를 기하학적 중심으로
+                    # 동적으로 선출한다 (D1 참조). 향후 명시적 루트 override /
+                    # 런타임 선출 훅을 위한 placeholder로 유지한다.
 ```

 `topology.yaml`:
@@ -205,10 +224,11 @@ sip:
 - **비대칭 SIP 토폴로지** (정사각형이 아닌 메시/토러스).
  `torus_2d`와 `mesh_2d_no_wrap`은 `n_sips = k²`를 요구한다.
 - **파이프라인 청크**: 큐브당 단일 타일, 아직 파이프라이닝 없음.
- **루트 큐브의 런타임 선출**: 커널은 현재 SE 코너로 하드코딩된
-  `root_cube = (mesh_h - 1) * mesh_w + (mesh_w - 1)`을 사용한다. SFR
-  와이어링이 모든 큐브를 커버하므로, 필요해질 때 런타임 선출은 순수
-  커널 변경이다.
+- **루트 큐브의 런타임 선출**: 커널은 현재 SIP 내부 임계 경로를
+  최소화하기 위해 기하학적 중심인
+  `root_cube = (mesh_h // 2) * mesh_w + (mesh_w // 2)`을 사용한다. SFR
+  와이어링이 모든 큐브를 커버하므로, 필요해질 때 다른 루트를 런타임에
+  선출하는 것은 순수 커널 변경이다.

 ---

@@ -241,15 +261,15 @@ sip:

 | File | Change |
 |---|---|
-| `src/kernbench/ccl/algorithms/intercube_allreduce.py` (신규) | 커널 + `_inter_sip_*` 헬퍼 + `TOPO_NAME_TO_KIND` |
+| `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` (신규) | 커널 + `_inter_sip_*` 헬퍼 + `TOPO_NAME_TO_KIND` |
 | `src/kernbench/ccl/sfr_config.py` (신규) | `configure_sfr_intercube_multisip` |
 | `src/kernbench/ccl/topologies.py` | `torus_2d`, `mesh_2d_no_wrap` 추가 |
 | `src/kernbench/ccl/install.py` | `_OPPOSITE_DIR`을 `global_*` 쌍으로 확장 |
 | `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend`가 `configure_sfr_intercube_multisip` 사용 + sip_rank/topo 인자 추가 |
-| `ccl.yaml` | 단일 `intercube_allreduce` 항목 |
+| `ccl.yaml` | 단일 `lrab_hierarchical_allreduce` 항목 |
 | `topology.yaml` | `system.sips.topology` 추가 |
 | `benches/ccl_allreduce.py` | Row-wise 큐브-메시 텐서 레이아웃 |
 | `tests/test_allreduce_multidevice.py` (신규) | 구성 기반 ring/torus/mesh |
-| `tests/test_distributed_intercube_allreduce.py` (신규) | 전체 `dist.all_reduce` 경로 |
+| `tests/test_distributed_lrab_hierarchical_allreduce.py` (신규) | 전체 `dist.all_reduce` 경로 |
 | `tests/test_intercube_sfr_config.py` (신규) | SFR 와이어링 검증 |
 | 제거 | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` 및 그 테스트 |