adr: add ADR-0046-0049 — close G4 coverage gaps from /report

Documents four cross-cutting surfaces that previously had no ADR backing, each surfaced as a G4 candidate by /report: - 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...), the two execution modes (command-list vs greenlet runner), scratch allocator semantics, dispatch-overhead model, and the kernel registry. - 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group (backend="ahbm") install path. world_size priority (algorithm > defaults > topology), the 4-step init sequence (load ccl.yaml, import algorithm module, derive world_size, install SFR + IPCQ), greenlet- local rank registry, all_reduce dispatch via _defer_wait, barrier no-op rationale, and the explicit list of unsupported dist.* APIs. - 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator free-list semantics. Offset-keyed first-fit with coalescing, the no-validation trust model for free(), HBM/TCM channel separation, page-aligned VA allocation, the page_size dual-default (VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and one-allocator-per-sub-unit rule. - 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog. H2D / D2H / PE DMA categories with their exact cube-index choices, the 32 KiB reference size, the 5-point utilization sweep, the formula vs actual column meanings, automatic invariant checks (monotonicity, D2H >= H2D, best < worst), per-case GraphEngine isolation, and the human-readable (not machine-parsable) output contract. Bilingual pair verifier passes for all four EN/KO pairs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:25:04 -07:00
parent 5f8dd688f5
commit 9a02955770
8 changed files with 2154 additions and 0 deletions
@@ -0,0 +1,243 @@
+# ADR-0047: AHBM CCL Backend — `torch.distributed`-compat shim
+
+## Status
+
+Accepted (2026-05-22).
+
+`runtime_api/distributed.py` 의 `AhbmCCLBackend` + `DistributedContext` —
+즉 `torch.distributed.init_process_group(backend="ahbm")` 진입점이 실제로
+무엇을 설치하고 어떤 의미로 `all_reduce`/`barrier`/`get_rank` 등을
+구현하는지를 명시한다. ADR-0023 D11 이 "torch.distributed compatibility"
+의도를 언급하나, **backend 자체의 동작 모델**은 ADR-level 에 없었다.
+
+## First action (제일 처음에 하는 일)
+
+`RuntimeContext.__post_init__` 가 자동으로 `DistributedContext()` 인스턴스를
+만들어 `self.distributed` 에 attach 한다. 그 시점의 첫 일은:
+
+1. `self._backend: AhbmCCLBackend | None = None` 으로 초기화 (아직 init
+   되지 않은 상태).
+2. `self._rank_by_greenlet: dict = {}` 로 greenlet-local rank 레지스트리
+   초기화 (ADR-0024 D2).
+3. 호출자(RuntimeContext) 측에서 `dc._ctx_ref = self` 로 back-reference 를
+   심어, 이후 `init_process_group` 가 `ctx.engine` / `ctx.spec` / `ctx.launch`
+   에 도달할 수 있게 한다.
+
+즉, **DistributedContext 의 첫 일은 "RuntimeContext 에 자기 자신을
+back-reference 와 함께 부착하고 backend 슬롯을 비워두는 것"**. 실제 backend
+설치(IPCQ install, world_size 산출, 알고리즘 모듈 로드)는 사용자 코드의
+`torch.distributed.init_process_group(backend="ahbm")` 호출 시점에 비로소
+일어난다.
+
+해당 시점의 `init_process_group` 의 첫 일은:
+
+1. `backend != "ahbm"` 이면 즉시 `ValueError("Unsupported backend ...")`.
+2. `getattr(self, "_ctx_ref", None)` 가 None 이면
+   `RuntimeError("DistributedContext not bound to a RuntimeContext")`.
+3. `self._backend = AhbmCCLBackend(torch_ctx=ctx)` — 이 생성자 안에서
+   ccl.yaml load + 알고리즘 모듈 import + world_size 산출 + SFR 설정 +
+   IPCQ install 이 모두 일어난다.
+4. `self._backend._dist_ctx = self` — backend 가 거꾸로
+   `_rank_by_greenlet` 에 접근할 수 있게 함.
+
+## Context
+
+PyTorch DDP 의 collective 호출 (`init_process_group`, `all_reduce` 등) 을
+그대로 사용할 수 있게 만들어, bench 코드가 "진짜 DDP training script" 와
+동일한 모습이 되도록 하는 것이 `AhbmCCLBackend` 의 목적이다 (ADR-0024 +
+ADR-0027 의 launcher 모델과 정렬).
+
+이 backend 가 책임지는 것:
+
+- `init_process_group` 시점에 **IPCQ neighbor table 을 한 번 설치** (real
+  NCCL communicator creation 과 유사).
+- `all_reduce(tensor, op="sum")` 호출 시 **설정된 algorithm 의 kernel 함수
+  를 `ctx.launch(...)` 로 발사**.
+- `get_world_size` / `get_rank` 를 greenlet-local rank 레지스트리와
+  ccl.yaml/topology 로부터 일관되게 답함.
+
+ADR-0023 D10 (IPCQ install plan), ADR-0024 (SIP launcher) 가 부분적으로
+이를 다루나, **`AhbmCCLBackend` 자체의 책임 범위와 의사결정 순서**는
+어디에도 명시되어 있지 않다. 본 ADR 이 채운다.
+
+## Decision
+
+### D1. backend 는 `init_process_group(backend="ahbm")` 시점에만 생성된다
+
+`DistributedContext` 는 `__init__` 시점에 `_backend = None` 으로 시작한다.
+backend 객체는 사용자가 `dist.init_process_group(backend="ahbm")` 를
+호출하기 전까지 존재하지 않으며, 그 외 API (`is_initialized`,
+`get_world_size`, `all_reduce`, `barrier`) 가 backend 가 None 인 채로
+호출되면 `RuntimeError("Default process group has not been initialized...")`
+를 던진다 (`_ensure_initialized` 헬퍼).
+
+`backend != "ahbm"` 은 즉시 `ValueError`. 다른 backend 명 (nccl, gloo
+등) 은 인식하지 않는다.
+
+### D2. world_size 산출 우선순위 — algorithm > defaults > topology
+
+`AhbmCCLBackend._resolve_world_size` (ADR-0024 D1) 의 결정 순서:
+
+1. `ccl.yaml` 의 algorithm entry 에 `world_size` 가 있으면 그 값.
+2. `defaults.world_size` 가 있으면 그 값.
+3. 둘 다 없으면 `spec.system.sips.count` (=topology 의 SIP 개수).
+
+기본 의미는 **rank = SIP** (ADR-0024). cube/PE-level parallelism 은 각
+rank 안에서 DPPolicy 로 표현되며 world_size 에 영향을 주지 않는다. 명시적
+`ccl.yaml` 의 world_size override 가 있으면 legacy "rank = flat PE 인덱스"
+테스트 경로를 위해 그대로 존중된다.
+
+`init_process_group(world_size=..., rank=...)` 의 사용자 인자는 **수신하나
+무시**된다 (real PyTorch 의 `RANK` / `WORLD_SIZE` env var 와 같은 의미).
+
+### D3. `init_process_group` 가 즉시 하는 4가지 설치 작업
+
+`AhbmCCLBackend.__init__` 안에서 다음이 순차 실행된다:
+
+1. **ccl.yaml 로딩**: `kernbench.ccl.install.load_ccl_config()` →
+   `resolve_algorithm_config(_cfg_all)` 로 `defaults.algorithm` (또는
+   사용자가 지정한 알고리즘) 의 merged config 산출.
+2. **알고리즘 모듈 import**: `importlib.import_module(self._merged["module"])`.
+   이 모듈은 `kernel` 함수, `kernel_args(world_size, n_elem, cube_w, cube_h)`
+   helper, optional `TOPO_NAME_TO_KIND` 매핑을 노출해야 한다.
+3. **world_size 산출** (D2).
+4. **topology 메타 수집**: `spec` 으로부터 `n_sips`, `sip_topo` (`ring_1d`
+   기본), `cube_w`/`cube_h`, `sips.w`/`sips.h`. SIP topology 가 ring_1d 가
+   아니면 explicit `w`/`h` 또는 square root 로 (`w*h == n_sips` 보장)
+   `_sip_topo_w/h` 산출. 불일치 시 `ValueError`.
+5. **SFR + IPCQ 설치**: `kernbench.ccl.sfr_config.configure_sfr_intercube_multisip
+   (engine, spec, self._merged)` 를 호출. 이 함수가 모든 SIP/cube 의 pe0 에
+   IPCQ neighbor table 을 푸시 (real NCCL communicator 의 일회성 설정에
+   해당).
+
+이 순서가 변하면 (예: SFR 전에 algorithm 모듈 load 가 실패하면) 부분 초기화
+상태가 발생할 수 있다. 따라서 D3 는 atomic 한 4-단계로 본다 — 실패 시
+backend 는 미설치 상태로 남는다.
+
+### D4. greenlet-local rank 등록 (ADR-0024 D2)
+
+`DistributedContext._rank_by_greenlet: dict[greenlet, int]` 은 spawn 된
+worker greenlet 각각에 rank 를 매핑한다. bench launcher (예:
+`torch.multiprocessing.spawn`) 가 worker 를 띄울 때
+`dc._bind_rank(g, rank)` 를 호출하여 등록한다.
+
+`get_rank()` 는 `getcurrent()` 의 greenlet 을 lookup. 미등록 greenlet은
+fallback 으로 0 을 반환 — single-driver / 테스트 호환성 유지.
+
+backend 는 `_dist_ctx._rank_by_greenlet` 를 통해 `all_reduce` 시 현재
+greenlet 의 rank 를 가져온다 (D5).
+
+### D5. `all_reduce(tensor, op="sum")` 동작
+
+검증 단계:
+
+- `op != "sum"` → `NotImplementedError`. 현재 kernel 들은 add reduction만 구현.
+- `tensor._handle is None` → `RuntimeError("not deployed")`.
+- `tensor._handle.shards` 가 비면 `RuntimeError("no shards")`.
+
+준비 단계:
+
+- `n_elem = shards[0].nbytes // tensor.itemsize` — 단일 shard 의 element 수.
+- `kernel_fn = self._algo_module.kernel` — D3 에서 import 된 알고리즘 모듈의
+  진입 함수.
+- effective cube dims 결정: 첫 번째 SIP 의 cube 갯수가 1 이면 (1,1) 으로
+  scalar 처리, 아니면 토폴로지의 `cube_w`/`cube_h` 사용. TP 가 일부 cube
+  만 쓰는 경우를 자연스럽게 흡수.
+- `kernel_args = self._algo_module.kernel_args(world_size, n_elem, cube_w,
+  cube_h)` — 알고리즘이 자기 kernel 에 넘길 인자 셋을 결정.
+
+dispatch:
+
+- 현재 greenlet 의 rank 를 `_rank_by_greenlet.get(g, 0)` 로 lookup.
+- `extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` 를 append.
+- `pending = self.ctx.launch(algorithm_name, kernel_fn, tensor, *kernel_args,
+  *extra_args, _defer_wait=True)` — `_defer_wait=True` 로 collective drain
+  을 메인 scheduler 에 위임 (ADR-0027 D0.4).
+
+drain:
+
+- 부모 greenlet 이 살아있으면 (multi-greenlet 모드) `_pending_collective_handles`
+  에 enqueue 한 뒤 부모로 switch. 메인 scheduler 가 모든 rank 의 launch 후
+  일괄 drain.
+- 단일-driver 모드면 inline 으로 `for h, _sip_id, meta in pending:
+  self.ctx.wait(h, _meta=meta)` 즉시 drain.
+
+### D6. `barrier()` 는 no-op 이다 (single-driver 모델)
+
+kernbench 는 하나의 Python process 안에서 모든 rank 를 greenlet 으로 다룬다.
+process 간 동기화가 필요한 상황이 없으므로 `barrier()` 는 호출 가능하지만
+실제 어떤 동기화도 수행하지 않는다. real PyTorch DDP 와의 API 호환성을
+위해 유지 (호출자가 NotImplementedError 를 받지 않도록).
+
+장래에 multi-process kernbench (예: SimPy event loop 가 process 별로
+독립) 가 도입되면 D6 를 supersede 하는 새 ADR 이 필요.
+
+### D7. `get_rank` / `get_world_size` / `get_backend` 의 의미
+
+- `get_rank()` (D4): 현재 greenlet 의 bound rank. 미등록은 0.
+- `get_world_size()` (D2): backend 가 D3 에서 산출한 world_size.
+- `get_backend()`: 항상 `"ahbm"` 문자열. backend 객체가 존재하지 않으면
+  `_ensure_initialized` 에서 RuntimeError.
+
+real PyTorch 와의 차이:
+
+- real PyTorch `get_rank()` 는 process global 값이지만, kernbench 는
+  greenlet-local. spawn 된 worker 안에서 호출하면 rank, main thread 에서
+  호출하면 0. bench 작성자는 worker 함수 안에서만 의미 있는 rank 를 기대해야
+  한다.
+
+### D8. 지원하는 API 표면 (final)
+
+`DistributedContext` 가 노출하는 API:
+
+- `init_process_group(backend="ahbm", world_size=None, rank=None, **kwargs)`
+- `is_initialized() -> bool`
+- `get_world_size() -> int`
+- `get_rank() -> int`
+- `get_backend() -> str`
+- `all_reduce(tensor, op="sum") -> None`
+- `barrier() -> None`
+- (internal) `_bind_rank(g, rank)`
+
+이외의 PyTorch distributed API (broadcast, reduce, all_gather, gather,
+scatter, send/recv 등) 는 **아직 구현되어 있지 않다**. kernel 레벨에서는
+`tl.send`/`tl.recv` (ADR-0046 D3.10) 로 직접 표현 가능하나, dist.* surface
+로는 노출되지 않는다. 추가 collective 가 필요해질 시 별도 알고리즘 모듈
+ `DistributedContext` 메소드 한 쌍을 추가하여 D8 를 확장한다.
+
+## Alternatives Considered
+
+### A1. backend 를 `RuntimeContext.__init__` 에서 즉시 생성
+
+기각. ccl.yaml 이 없거나 알고리즘 모듈을 import 할 수 없는 경우, bench 가
+distributed 기능을 안 쓰는데도 RuntimeContext 생성 자체가 실패하게 된다.
+"호출 시점에 비로소 설치" (D1) 가 lazy 의미상 옳다.
+
+### A2. world_size 를 항상 topology 로부터 자동 산출 (override 금지)
+
+기각. ADR-0024 D1 의 "explicit override" 경로가 legacy 테스트에서 사용 중.
+한 SIP 안에서 PE-level rank 를 따로 정의해야 하는 진단 시나리오를 위해
+유지.
+
+### A3. `op != "sum"` 을 silent fallback 으로 처리
+
+기각. 사용자가 `op="prod"` / `"max"` / `"avg"` 를 의도했는데 silently sum
+이 실행되면 결과 검증이 매우 어렵다. 명시적 `NotImplementedError` 가 안전.
+
+### A4. `barrier` 를 SimPy event 로 구현
+
+기각 (현재). single-driver 모델에서 cross-process 동기화 의미가 없으므로
+no-op 가 의미적으로 정확. SimPy fake-barrier 는 의미 없이 코드 복잡도만
+높임. multi-process kernbench 도입 시 재평가.
+
+## Consequences
+
+- `torch.distributed.init_process_group(backend="ahbm")` 의 4-단계 설치
+  (D3) 가 ADR-level 에서 굳어져, 향후 새 collective 알고리즘이 어디에
+  훅을 걸어야 하는지 명확.
+- D2 의 우선순위 (algorithm > defaults > topology) 가 명시되어, ccl.yaml
+  변경 시 영향 범위를 빠르게 가늠 가능.
+- D6 의 barrier no-op 결정이 ADR-level 에 굳어져, multi-process kernbench
+  도입 시 별도 ADR 로 supersede 해야 함이 분명.
+- D8 의 미지원 API 목록이 명시되어, 사용자가 `dist.broadcast(...)` 를
+  호출하려 할 때의 명확한 거절 근거 제공.