kernbench2/docs/adr-ko/ADR-0050-par-ccl-algorithm-module-contract.md

# ADR-0050: CCL Algorithm Module Contract — `ccl/algorithms/*.py`

## Status

Accepted (2026-05-22).

`src/kernbench/ccl/algorithms/` 디렉터리 안의 모듈이 AHBM CCL backend
(ADR-0047) 에서 collective algorithm 으로 사용되려면 갖춰야 할 인터페이스,
kernel 시그너처, 그리고 새 알고리즘 추가 절차를 명시한다. ADR-0047 D3 가
"algorithm 모듈은 `kernel`, `kernel_args`, optional `TOPO_NAME_TO_KIND` 를
expose 해야 한다" 라고만 한 줄로 언급하나, **algorithm 모듈 작성자가 따라야
할 contract** 는 ADR-level 에서 정리된 적이 없다. ADR-0045 가 bench 모듈
contract 를 다루는 것과 짝을 이룬다.

## First action (제일 처음에 하는 일)

알고리즘 모듈이 import 되는 시점은 두 가지다:

1. **AHBM backend 진입**: 사용자 코드가 `dist.init_process_group(backend="ahbm")`
   를 호출하면, `AhbmCCLBackend.__init__` 안에서 `self._algo_module =
   importlib.import_module(self._merged["module"])` 가 실행된다. 이때 모듈
   레벨에서 가장 먼저 일어나는 일:
   - `SIP_TOPO_RING/TORUS/MESH` 같은 정수 상수가 모듈 namespace 에 노출.
   - `TOPO_NAME_TO_KIND` 사전이 모듈 namespace 에 노출 — backend 가
     `topo_map = getattr(self._algo_module, "TOPO_NAME_TO_KIND", None)` 로
     조회.
   - `kernel_args` 함수 정의 — 호출 시 호출자가 사용.
   - `allreduce_intercube_multidevice` 같은 알고리즘 함수 정의.
   - 모듈 마지막 줄에서 `kernel = allreduce_intercube_multidevice` 로
     alias 가 노출.

2. **ccl.yaml install 단계**: `kernbench.ccl.install.install_ipcq` 가 호출
   되어 IPCQ neighbor table 을 푸시할 때 같은 알고리즘 모듈이 import 됨.

즉, **algorithm 모듈의 첫 일은 "topology-kind 상수, `TOPO_NAME_TO_KIND`
사전, `kernel_args` 함수, 그리고 `kernel` alias 를 모듈 namespace 에 노출
하는 것"** 이다. 모든 노출은 import-time 부수효과로 충분하며 별도 초기화
함수 호출이 필요하지 않다.

## Context

`AhbmCCLBackend` (ADR-0047) 는 process group 초기화 시점에 `ccl.yaml` 의
`defaults.algorithm` (또는 사용자가 지정한 알고리즘 이름) 으로부터 모듈
경로를 얻어 dynamic import 한다. backend 는 그 모듈로부터 다음 4 가지를
기대한다:

- `kernel`: collective 의 진입 함수.
- `kernel_args(world_size, n_elem, cube_w=, cube_h=) -> tuple`: kernel 에
  넘길 위치 인자 묶음.
- `TOPO_NAME_TO_KIND` (optional): `topology.yaml` 의 `sips.topology`
  문자열 (예: `"ring_1d"`, `"torus_2d"`, `"mesh_2d_no_wrap"`) 을 정수
  상수로 매핑하는 dict.
- (간접) IPCQ neighbor table 설치: `configure_sfr_intercube_multisip` 가
  알고리즘 모듈의 `TOPO_NAME_TO_KIND` 와 `cube_w/h` 를 보고 SFR 을 결정.

현재 코퍼스의 유일한 algorithm 모듈은 `lrab_hierarchical_allreduce.py`
(248 줄) 이다. 이름은 "**l**eft-**r**ight **a**lternating **b**roadcast
**hierarchical allreduce**". 향후 `ring_allreduce`, `tree_allreduce`,
`broadcast` 같은 모듈이 추가될 때마다 이 contract 를 따라야 일관된
디스패치가 가능하다.

이 contract 가 ADR-level 에 없으면:

- 새 algorithm 작성자가 ADR-0047 D3 의 한 줄 만으로 시그너처를 추론해야.
- kernel 함수 인자 순서 (특히 `t_ptr, n_elem, cube_w, cube_h, n_sips,
  sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl`) 의 의미가 코드
  grep 없이는 명확하지 않다.
- `kernel_args` 가 어떤 인자를 받고 어떤 tuple 을 돌려줘야 하는지 관례
  로만 굳어진다.

## Decision

### D1. algorithm 모듈은 4 가지 public symbol 을 노출한다

```python
# src/kernbench/ccl/algorithms/<name>.py

from __future__ import annotations

# (필수) topology-kind 상수 — 알고리즘 내부에서 사용
SIP_TOPO_RING = 0
SIP_TOPO_TORUS = 1
SIP_TOPO_MESH = 2

# (선택) topology 이름 → kind 매핑. backend 가 ccl.yaml/topology 의
# 문자열 SIP topology 를 정수로 변환하는 데 사용.
TOPO_NAME_TO_KIND = {
    "ring_1d": SIP_TOPO_RING,
    "torus_2d": SIP_TOPO_TORUS,
    "mesh_2d_no_wrap": SIP_TOPO_MESH,
}

# (필수) kernel 인자 빌더
def kernel_args(world_size: int, n_elem: int, *, cube_w: int = 4, cube_h: int = 4) -> tuple:
    return (n_elem, cube_w, cube_h, world_size)

# (필수) kernel 함수 (`tl=...` 키워드를 통해 TLContext 가 주입됨)
def my_allreduce_kernel(t_ptr, n_elem, cube_w, cube_h, n_sips,
                         sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, *, tl):
    ...

# (필수) kernel alias — backend 가 `module.kernel` 로 접근
kernel = my_allreduce_kernel
```

- `kernel` alias 는 backend 가 직접 호출하는 entry point 다. 함수 이름이
  무엇이든 (`allreduce_intercube_multidevice` 처럼) `module.kernel = fn`
  으로 노출해야 한다.
- `kernel_args` 가 없으면 backend 가 알고리즘 인자를 만들 방법이 없다.
  signature 는 D2 참고.
- `TOPO_NAME_TO_KIND` 가 없으면 backend 는 `sip_topo_kind = 0` 으로
  fallback 한다. 단일 topology 만 지원하는 알고리즘이라면 생략 가능.

### D2. `kernel_args` 시그너처 — `(world_size, n_elem, *, cube_w, cube_h)`

```python
def kernel_args(world_size: int, n_elem: int, *,
                cube_w: int = 4, cube_h: int = 4) -> tuple:
    return (n_elem, cube_w, cube_h, world_size)
```

- **위치 인자**: `world_size` (= rank 수), `n_elem` (= 단일 shard 의
  element 수, f16 기준).
- **키워드 인자**: `cube_w`, `cube_h` (= cube mesh 크기). default 는
  4×4 — `topology.yaml` 의 `sip.cube_mesh` 기본값과 정합.
- **반환**: kernel 의 위치 인자 순서대로 묶은 tuple.

backend 의 `all_reduce` 가 호출 시:

```python
kernel_args_tuple = self._algo_module.kernel_args(
    self._world_size, n_elem, cube_w=eff_cube_w, cube_h=eff_cube_h,
)
extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)
pending = self.ctx.launch(
    self._merged["algorithm"], kernel_fn, tensor,
    *kernel_args_tuple, *extra_args, _defer_wait=True,
)
```

즉 kernel 의 최종 위치 인자는: `(tensor_ptr, *kernel_args_tuple,
sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` 이며, 거기에 `tl=...` 가
키워드로 자동 주입된다. `kernel_args` 가 돌려주는 tuple 의 길이/순서는
**kernel signature 와 1:1 일치** 해야 한다.

### D3. `kernel` 함수 시그너처 — 정형화된 9 + tl 인자

권장 시그너처:

```python
def my_kernel(
    t_ptr: int,             # VA base of the row-wise-sharded tensor on this SIP
    n_elem: int,            # element count per cube tile (or per shard)
    cube_w: int,            # cube mesh width  (kernel_args 에서 옴)
    cube_h: int,            # cube mesh height (kernel_args 에서 옴)
    n_sips: int,            # world_size 와 동일 (rank = SIP, ADR-0024)
    sip_rank: int,          # 이 SIP 의 rank
    sip_topo_kind: int,     # TOPO_NAME_TO_KIND lookup 결과
    sip_topo_w: int,        # SIP mesh width (ring_1d 면 0)
    sip_topo_h: int,        # SIP mesh height (ring_1d 면 0)
    *, tl,                  # TLContext (auto-injected)
) -> None:
```

`kernel_args` 가 다른 위치 인자 순서를 채택하더라도, kernel 의 **마지막
4 개 위치 인자는 항상 `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)`**
이며 backend 가 `extra_args` 로 append 한다 (ADR-0047 D5). 이 4 개 인자는
사용자 정의 algorithm 도 받아야 하지만, 알고리즘이 single-SIP 이라면
그냥 무시하면 된다.

`tl` 은 위치 인자가 아닌 키워드로 주입된다 — `RuntimeContext.launch` 가
kernel 호출 직전에 `tl=tl_ctx` 를 추가한다. 따라서 kernel signature 의
`tl` 은 keyword-only (`*, tl`) 또는 마지막 키워드 매개변수 형태여야
한다.

### D4. kernel body 의 자유도와 제약

kernel body 안에서 사용 가능한 표면: ADR-0046 D3 의 모든 `tl.*` primitive.

특히 자주 쓰이는 패턴:

- `cube_id = tl.program_id(axis=1)` — 이 PE 가 속한 cube 인덱스.
- `pe_addr = t_ptr + cube_id * nbytes` — cube-별 tile 의 VA 계산.
- `acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")` — local 데이터
  로드.
- `tl.send(dir=...)` / `tl.recv(dir=..., shape=, dtype=)` — IPCQ
  collective.
- `acc = acc + recv` — TensorHandle 산술 연산자 (ADR-0046 D4).
- `tl.store(pe_addr, acc)` — 결과 저장.

kernel body 는 일반 Python 함수이며, branching/looping 자유. 단:

- SimPy `yield` 또는 `async` 금지 (ADR-0046 D1).
- TensorHandle 의 `.data` 직접 접근 금지 — phase 1 timing 모델은
  데이터 의존을 모른다 (ADR-0020 의 2-pass 분리).
- kernel 실행은 deterministic 해야 한다 — 같은 입력으로 두 번 실행하면
  같은 op 시퀀스 발사. random / external IO 금지.

### D5. SIP topology semantics — `sip_topo_kind` 의 의미

backend 가 `topology.yaml` 의 `system.sips.topology` 문자열을 algorithm
모듈의 `TOPO_NAME_TO_KIND` 로 lookup 하여 `sip_topo_kind` 정수로 변환.
algorithm 은 이 정수를 보고 분기:

```python
if sip_topo_kind == SIP_TOPO_RING:
    acc = _inter_sip_ring(...)
elif sip_topo_kind == SIP_TOPO_TORUS:
    acc = _inter_sip_torus_2d(...)
elif sip_topo_kind == SIP_TOPO_MESH:
    acc = _inter_sip_mesh_2d(...)
```

각 topology branch 는 IPCQ direction 이름 (예: `"global_E"`, `"W"`, `"S"`,
`"N"`) 을 통해 peer 와 통신. direction 의 의미는 ADR-0023/0025 가 정의
하며, `configure_sfr_intercube_multisip` 가 IPCQ neighbor table 을 그에
맞춰 설치한다.

algorithm 모듈은 자기가 지원하지 않는 topology kind 가 들어오면 silent
no-op 으로 두기보다 명시적으로 `raise ValueError(f"unsupported topology
kind {sip_topo_kind}")` 하는 것을 권장 — 실수로 backend 에 잘못 dispatch
된 경우 빠르게 fail.

### D6. ccl.yaml 의 algorithm entry 구조

algorithm 모듈은 `ccl.yaml` 의 entry 와 짝을 이룬다 (ADR-0023 D10 +
ADR-0047 D3):

```yaml
defaults:
  algorithm: lrab_hierarchical_allreduce
  n_elem: 8

algorithms:
  lrab_hierarchical_allreduce:
    module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
    # optional: world_size override
    # optional: per-algorithm parameters consumed by configure_sfr_intercube_multisip
```

- `module`: full Python module path. backend 의 `importlib.import_module`
  가 이 문자열을 그대로 사용.
- `world_size` (optional): 명시되면 topology fallback 을 override
  (ADR-0047 D2).
- algorithm-specific parameters 는 `configure_sfr_intercube_multisip` 가
  소비.

새 algorithm 추가 시:

1. `src/kernbench/ccl/algorithms/<name>.py` 작성 (D1 컨벤션).
2. `ccl.yaml` 의 `algorithms` 섹션에 entry 추가.
3. (필요 시) `kernbench.ccl.sfr_config` 에 SFR 설치 분기 추가.
4. test 추가 (예: `tests/sccl/test_<name>.py`, ADR-0043 의 eval harness
   확장).

### D7. legacy "rank = flat PE index" 모드

ADR-0047 D2 가 명시한 `ccl.yaml` 의 `world_size` override 경로는 legacy
"rank = flat PE index" 테스트가 사용한다. algorithm 모듈은 이 모드 에서도
`n_sips=world_size` 만큼의 rank 가 들어옴을 가정하면 된다 — backend 가
rank↔(SIP, cube, PE) 매핑을 사전에 분리해 두므로 algorithm 본체에서는
modal 분기가 필요 없다.

단, single-cube workload 에서는 `cube_w=cube_h=1` 이 들어와 mesh-기반
phase 들이 skip 되도록 작성해야 한다 (`lrab_hierarchical_allreduce.py`
의 `single_cube = (cube_w == 1 and cube_h == 1)` 패턴 참고).

## Alternatives Considered

### A1. algorithm 모듈을 class 로 구조화 (`class Allreduce: kernel(...)` 등)

기각. Python 모듈 namespace 자체가 algorithm 의 identity 로 사용 중이며
(ADR-0047 D3 의 `importlib.import_module`), class 한 겹은 추가 indirection
만 늘리고 dispatch 측 코드를 두텁게 만든다. 모듈-레벨 free function
+ `kernel` alias 패턴이 충분히 명확.

### A2. `kernel_args` 를 명시적 dataclass 로 typing

기각 (현재). algorithm 마다 인자 갯수가 다른 것이 정상이며, dataclass 한
종류를 강제하면 다양한 algorithm 간 호환이 어려워진다. tuple 반환은 simple
하고 backend 측 `*kernel_args_tuple` unpacking 과 깨끗이 맞물린다.
algorithm 별 자체 타입 강도가 필요해지면 그 algorithm 모듈 안에서 NamedTuple
사용은 자유.

### A3. SFR 설치를 algorithm 모듈 안으로

기각. SFR 설치 (`configure_sfr_intercube_multisip`) 는 topology + algorithm
모두를 보고 IPCQ neighbor table 을 설치하는 cross-module 결정이라, algorithm
모듈 내부보다 `kernbench.ccl.sfr_config` 같은 전용 위치가 자연스럽다. D6 의
"필요 시 sfr_config 분기 추가" 워크플로우가 책임 분리 측면에서 더 명확.

### A4. algorithm name 을 모듈 namespace 에 자동 등록 (decorator)

기각. ADR-0045 (bench) 와 달리 algorithm 은 ccl.yaml entry 와 직접 묶여
있어 추가 등록 레지스트리가 중복이다. `module` 문자열 매핑 하나면 충분.

## Consequences

- ADR-0047 D3 의 한 줄 contract 가 D1–D7 의 작성자-친화적 가이드로 확장
  되어, 새 algorithm 추가 시 시그너처를 grep 으로 추론할 필요 없음.
- D3 의 9 + tl 인자 시그너처가 표준화되어, backend 의 `extra_args` append
  (ADR-0047 D5) 와 자연스럽게 맞물림. 향후 single-SIP-only algorithm 도
  4 개의 sip_* 인자를 받아야 함이 명시.
- D5 의 fail-loud 권장으로, ccl.yaml 의 topology 가 algorithm 미지원
  topology 로 잘못 설정되면 backend 가 silent wrong-result 가 아닌
  ValueError 로 fail.
- D6 의 단계별 추가 절차가 명시되어, 새 algorithm 추가가 sfr_config /
  test / ccl.yaml 어디까지 손대야 하는지 분명.