adr: add ADR-0046-0049 — close G4 coverage gaps from /report

Documents four cross-cutting surfaces that previously had no ADR backing, each surfaced as a G4 candidate by /report: - 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...), the two execution modes (command-list vs greenlet runner), scratch allocator semantics, dispatch-overhead model, and the kernel registry. - 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group (backend="ahbm") install path. world_size priority (algorithm > defaults > topology), the 4-step init sequence (load ccl.yaml, import algorithm module, derive world_size, install SFR + IPCQ), greenlet- local rank registry, all_reduce dispatch via _defer_wait, barrier no-op rationale, and the explicit list of unsupported dist.* APIs. - 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator free-list semantics. Offset-keyed first-fit with coalescing, the no-validation trust model for free(), HBM/TCM channel separation, page-aligned VA allocation, the page_size dual-default (VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and one-allocator-per-sub-unit rule. - 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog. H2D / D2H / PE DMA categories with their exact cube-index choices, the 32 KiB reference size, the 5-point utilization sweep, the formula vs actual column meanings, automatic invariant checks (monotonicity, D2H >= H2D, best < worst), per-case GraphEngine isolation, and the human-readable (not machine-parsable) output contract. Bilingual pair verifier passes for all four EN/KO pairs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:25:04 -07:00
parent 5f8dd688f5
commit 9a02955770
8 changed files with 2154 additions and 0 deletions
@@ -0,0 +1,307 @@
+# ADR-0046: TLContext — Kernel-side `tl.*` API Contract
+
+## Status
+
+Accepted (2026-05-22).
+
+`src/kernbench/triton_emu/` 의 `TLContext` 가 노출하는 `tl.*` primitive
+집합과 그 의미, 그리고 두 실행 모드 (command-list / greenlet runner) 의
+계약을 명시한다. ADR-0014/0020 가 PE 파이프라인과 2-pass 실행 모델을
+정의하나, **bench 의 kernel 함수가 호출하는 `tl.*` 표면 자체**는 ADR-level
+에 정리되어 있지 않았다.
+
+## First action (제일 처음에 하는 일)
+
+`TLContext(pe_id, num_programs, dispatch_cycles, runner, cube_id, num_cubes,
+scratch_base, scratch_size)` 생성 시 가장 먼저 다음 6개 필드를 초기화한다:
+
+- `self._pe_id`, `self._num_programs`, `self._cube_id`, `self._num_cubes` —
+  `tl.program_id` / `tl.num_programs` 가 반환할 값.
+- `self._dispatch_cycles` — 모든 `tl.*` API 호출 시작에서 자동으로 발행될
+  `PeCpuOverheadCmd(cycles)` 의 cycle 수.
+- `self._runner` — `KernelRunner` 인스턴스 (있으면 greenlet 모드, 없으면
+  command-list 모드).
+- `self._commands: list[PeCommand] = []` — command-list 모드에서 누적할
+  command 시퀀스.
+- `self._handle_counter = 0`, `self._completion_counter = 0` — 새 TensorHandle /
+  CompletionHandle id 생성용.
+- `self._scratch_base`, `self._scratch_size`, `self._scratch_cursor = 0` —
+  PE-로컬 scratch 영역 (math/dot/composite 의 output handle 주소 할당용).
+
+즉, **TLContext 의 첫 일은 "이 kernel 인스턴스가 어디서 (sip/cube/pe) 어떤
+규모 (num_programs/num_cubes) 로 실행되며, 어느 모드 (runner 유무) 로
+명령을 발사할지 메타데이터를 채우는 것"** 이다. 이 시점에 SimPy event 는
+없으며 command 도 발사되지 않는다.
+
+런타임 첫 동작은 kernel 함수가 `tl.<api>()` 를 처음 호출할 때 발생한다.
+모든 `tl.*` API 의 표준 entry 동작은:
+
+1. `self._emit_dispatch_overhead()` 호출 — `dispatch_cycles > 0` 인 경우
+   `PeCpuOverheadCmd(dispatch_cycles)` 를 즉시 `_emit`.
+2. API 별 처리 (TensorHandle 생성, command 구성).
+3. `self._emit(cmd)` — runner 모드면 greenlet.switch 로 SimPy 측에 cmd 전달,
+   아니면 `self._commands` 에 append.
+
+## Context
+
+`tl.*` 표면은 `TLContext` 가 노출하는 메소드들로 구성되며, kernel 함수가
+받는 `tl` 매개변수가 이 객체다. 사용자(bench 작성자) 입장에서 보이는
+contract:
+
+- 어떤 primitive 가 있는가
+- 각 primitive 가 어떤 데이터 흐름을 발생시키는가 (DMA / compute / IPCQ /
+  metadata-only)
+- TensorHandle 의 `space` 와 `addr` 가 어떻게 결정되는가
+- command-list 모드와 greenlet 모드의 차이
+
+ADR-0014 (PE pipeline) 가 PE_SCHEDULER 가 받는 PeCommand 들을 정의하나,
+`tl.*` 가 이들을 어떻게 emit 하는지는 코드 컨벤션에만 존재한다. 또한
+ADR-0020 (2-pass data execution) 가 greenlet 모드의 존재를 D3 에서
+언급하나, runner / non-runner 두 경로의 시그너처 차이 (return value 처리)
+는 ADR-level 에 명시되어 있지 않다. 이 ADR 이 그 빈자리를 채운다.
+
+## Decision
+
+### D1. `tl` 매개변수는 `TLContext` 인스턴스다
+
+bench 의 kernel 함수는 다음 시그너처를 따른다:
+
+```python
+def _kernel(arg1, arg2, ..., tl, **kwargs):
+    ...
+```
+
+`tl` 의 정체는 `kernbench.triton_emu.tl_context.TLContext` 인스턴스이다.
+real Triton 의 `triton.language` 모듈을 흉내내기 위한 이름이며, real
+Triton 모듈이 들어오는 것은 아니다.
+
+kernel 함수는 일반 Python 함수이며 `yield` / `async` 가 없다. `tl.*`
+호출이 SimPy event 를 발생시키지만, 호출자(kernel) 쪽에서는 동기 호출처럼
+보인다 — greenlet 모드에서 KernelRunner 가 SimPy ↔ kernel 사이를 중계
+하기 때문 (ADR-0020 D3).
+
+### D2. 두 실행 모드 — command-list / greenlet runner
+
+- **command-list 모드 (`runner is None`)**: `tl.*` 호출이 `self._commands`
+  리스트에 PeCommand 를 누적. DMA / GEMM / Math 가 실제 SimPy 시간을
+  소비하지 않으며, return value 가 metadata-only TensorHandle (data=None) 다.
+  이후 PE_SCHEDULER / sim_engine 가 command 시퀀스를 시간상 재생.
+
+- **greenlet runner 모드 (`runner is not None`)**: `tl.*` 호출이
+  `self._emit(cmd)` 를 통해 `runner.switch_to_simpy(cmd)` 로 부모 greenlet
+  (SimPy) 으로 컨트롤을 넘김. 부모는 cmd 를 컴포넌트에 분배하여 SimPy 시간을
+  소비한 뒤, DMA read 의 경우 실제 numpy 데이터를 반환. kernel 은 그
+  결과를 받아 다음 line 으로 진행 (ADR-0020 D3 의 데이터 인지 실행 모델).
+
+mode 선택은 KernelRunner 인스턴스를 TLContext 에 주입하는지 여부로 결정
+되며, `tl.*` 메소드들은 이 차이를 인지하지 않고 `_emit()` 헬퍼를 통해
+일관되게 동작한다.
+
+### D3. Primitive 카테고리
+
+#### D3.1. Reference (no DMA, metadata only)
+
+- `tl.ref(ptr, shape, dtype="f16") -> TensorHandle`: HBM 데이터를 참조하는
+  핸들만 만들고 DMA 는 발행하지 않음. composite scheduler 가 per-tile 로
+  스트리밍할 때 사용 (예: GEMM 의 b 피연산자).
+
+#### D3.2. Data movement (blocking, DMA engine)
+
+- `tl.load(ptr, shape, dtype="f16") -> TensorHandle`: HBM → 결과 핸들.
+  `DmaReadCmd` 발행. greenlet 모드에서는 결과 핸들의 `.data` 에 실제
+  numpy 배열 첨부; command-list 모드에서는 placeholder. 반환 핸들의
+  `space="hbm"`, `pinned=True`.
+- `tl.store(ptr, handle) -> None`: TCM → HBM. `DmaWriteCmd` 발행. greenlet
+  모드에서는 `handle.data` 가 있을 때만 `_store.write("hbm", ptr, data)` 를
+  먼저 호출 (visibility = issue time, ADR-0020 D3).
+
+#### D3.3. GEMM / compute (blocking)
+
+- `tl.dot(a, b) -> TensorHandle`: `a @ b`. 두 피연산자는 TCM 이어야 하며,
+  shape (M,K) × (K,N) → (M,N). `GemmCmd` 발행, output handle 은
+  `_make_compute_out(shape, dtype)` 로 PE-로컬 scratch 에 할당.
+- `tl.composite(op, a, b=None, out_ptr=0, math_op=None, epilogue=None,
+  acc_dtype=None, tile_shape=None) -> CompletionHandle`: 비차단(non-blocking)
+  tiled pipeline. `CompositeCmd` 발행. `epilogue` 는 dict list, 각 dict 는
+  `"op"` 키 + op-specific 필드 + 옵션 `"scope"` (k_tile / output_tile);
+  unknown op 나 missing field 는 즉시 ValueError. 반환된 CompletionHandle 은
+  `tl.wait(h)` 로 동기화.
+
+#### D3.4. Math: unary (blocking)
+
+- `tl.exp(x)`, `tl.log(x)`, `tl.sqrt(x)`, `tl.abs(x)`, `tl.sigmoid(x)`,
+  `tl.cos(x)`, `tl.sin(x)` — 모두 `MathCmd(op=<name>, inputs=(x,), out=)`
+  발행. `out` 은 동일 shape/dtype 의 scratch 할당.
+
+#### D3.5. Math: binary (blocking)
+
+- `tl.maximum(a, b)`, `tl.minimum(a, b)` — `_binary_math`.
+- `tl.fma(a, b, c)` — `a*b + c`. inputs 3개.
+- `tl.clamp(x, min, max)` — `MathCmd(op="clamp", inputs=(x, min, max))`.
+- `tl.where(cond, a, b)` — `MathCmd(op="where", inputs=(cond, a, b))`.
+- `tl.softmax(x, axis=-1)` — 단일 MathCmd(op="softmax") 로 시간 회계는
+  한 번에. Phase 2 DataExecutor 가 canonical (x-max → exp → sum → div) 로
+  expand 한다.
+
+#### D3.6. Reduction (blocking)
+
+- `tl.sum(x, axis)`, `tl.max(x, axis)`, `tl.min(x, axis)` — 해당 axis 의
+  크기를 1 로 줄인 output handle 을 반환. `MathCmd(op=<name>, inputs=(x,),
+  out=, axis=axis)` 발행.
+
+#### D3.7. Index / scalar (PE_CPU, no engine)
+
+- `tl.program_id(axis=0) -> int`: `axis==0` → pe_id (cube-local PE 인덱스),
+  `axis==1` → cube_id (ADR-0022).
+- `tl.num_programs(axis=0) -> int`: `axis==0` → num_programs (cube 당
+  PE 수), `axis==1` → num_cubes.
+- `tl.arange(start, end, dtype="i32") -> TensorHandle`: TCM 의 인덱스
+  range. command 발사 없이 metadata 만.
+- `tl.zeros(shape, dtype="f16") -> TensorHandle`, `tl.full(shape, value,
+  dtype="f16") -> TensorHandle`: TCM 에 placeholder. command 발사 없음.
+
+#### D3.8. Scalar helpers (no command, no engine)
+
+- `TLContext.cdiv(a, b) -> int` (static): ceiling division
+  `-(-a // b)`. real Triton 의 `tl.cdiv` 모방.
+
+#### D3.9. Metadata-only (no compute, no DMA)
+
+- `tl.trans(x) -> TensorHandle`: shape 의 마지막 두 dim 을 swap 한 새
+  핸들. 같은 addr/data 를 공유, command 발사 없음.
+
+#### D3.10. IPCQ (CCL) primitives (ADR-0023 D4)
+
+- `tl.send(dir, src=None, *, src_addr=None, nbytes=None, shape=None,
+  dtype="f16", space="tcm") -> None`: blocking send. handle 형태 또는
+  raw 주소 형태 둘 다 허용. `IpcqSendCmd` 발행. handle 의 `.data` 스냅샷이
+  명령에 실리는 경우, recv 측에서 받은 데이터의 race 회피.
+- `tl.recv(dir=None, shape=(), dtype="f16", space="tcm", dst_addr=None,
+  dst_space=None) -> TensorHandle`: blocking recv. `dst_addr/dst_space`
+  둘 다 주면 "copy_to_dst" 모드, 아니면 "return_slot" 모드. greenlet
+  모드에서 핸들의 `.data` 에 실제 데이터 첨부.
+- `tl.recv_no_consume(dir=None, shape=(), dtype="f16") -> TensorHandle`:
+  **DIAGNOSTIC ONLY**. recv blocking 동기화는 그대로 적용되나 slot-read
+  latency (slot-IO + PE↔bank fabric drain) 는 건너뛴다. pe2pe overview
+  플롯에서 `tl.store` 와의 apples-to-apples 비교용. production kernel 은
+  사용 금지 — `consume=False` 라는 별도 명령 분기로 격리되어 있어 실수
+  flag 가 작동하지 않는다.
+- `tl.recv_async(dir, shape=(), dtype="f16") -> RecvFuture`: non-blocking
+  recv. `RecvFuture` 를 반환; 이후 `tl.wait(future)` 로 결과 수령.
+
+#### D3.11. Composite + control
+
+- `tl.composite(...)`: D3.3 에서 설명.
+- `tl.wait(handle=None)`: `CompletionHandle` (composite) 또는 `RecvFuture`
+  (async recv) 또는 `None` (모든 pending composite) 대기.
+- `tl.cycles(n)`: PE_CPU scalar 실행 overhead 를 명시적으로 선언.
+  `PeCpuOverheadCmd(cycles=n)` 발행.
+
+### D4. TensorHandle 산술 연산자 — thread-local TLContext
+
+`tl_context.py` 모듈 로드 시점에 `_enable_tensor_ops()` 가 호출되어
+`TensorHandle.__add__`, `__sub__`, `__mul__`, `__truediv__` 를 patch한다.
+각 연산자는 thread-local `_ctx` (모듈 변수) 에 저장된 active TLContext 의
+`_binary_math` 를 호출한다.
+
+따라서 kernel 안에서 `c = a + b` 는 `MathCmd(op="add", inputs=(a,b),
+out=)` 발행 + new TensorHandle 반환 패턴과 동일하다.
+
+active TLContext 관리:
+
+- `TLContext._set_active(ctx)`: 현재 thread/greenlet 의 active ctx 설정.
+- `TLContext._get_active()`: 조회 (없으면 RuntimeError).
+- `run_kernel(kernel_fn, tl_ctx, *args, **kwargs)`: helper. 진입 시
+  active 설정, kernel 실행, 종료 시 None 으로 복원.
+
+`KernelRunner` 는 매 cmd 분배 시 `_switch_kernel` 가 직접 `_set_active(tl)`
+를 호출하여, 같은 thread 안의 다른 PE runner 가 active 를 덮어쓴 경우에도
+복원되도록 한다.
+
+### D5. Scratch allocator — compute output handles
+
+`tl.dot`, `tl.exp`, `tl.add` (TensorHandle `__add__`) 등 결과를 만드는 op 는
+`_make_compute_out(shape, dtype)` 를 호출하여 16-byte aligned scratch
+주소를 할당한다. 이 주소는 `space="tcm"` 로 발행되며, 이후 `tl.send` /
+`tl.store` 가 이 handle 을 source 로 사용할 수 있다.
+
+`_scratch_base == 0` (command-list 모드 등) 이면 할당 주소가 0으로
+반환되어 handle 은 send/store 의 source 로 사용 불가 (이 경우 `tl.load`
+로 받은 핸들만 source 가 될 수 있다).
+
+cursor 가 `_scratch_size` (default 1 MiB) 를 초과하면 RuntimeError.
+cursor 는 매 kernel invocation 시작 시 0 으로 리셋되어야 하나 (현재 코드는
+KernelRunner 가 새 TLContext 를 매번 생성하여 자연스럽게 리셋됨).
+
+### D6. Dispatch overhead — `PeCpuOverheadCmd(dispatch_cycles)`
+
+모든 non-metadata `tl.*` 호출의 entry 에서 `_emit_dispatch_overhead()` 가
+호출되며 `dispatch_cycles > 0` 일 때 `PeCpuOverheadCmd(dispatch_cycles)`
+를 발행한다. PE_CPU 가 명령 dispatch 자체에 소비하는 cycle 비용을
+모델링하기 위함이다.
+
+기본값:
+
+- `TLContext.__init__` 의 `dispatch_cycles` 매개변수 기본값: 1 cycle.
+- `KernelRunner` 가 만드는 TLContext: 0 cycles (greenlet 모드는 cycle
+  회계가 별도, ADR-0020 D3 정신).
+
+### D7. Kernel registry (`triton_emu/registry.py`)
+
+별도의 `_kernels: dict[str, Callable]` 가 kernel 이름 → 함수 매핑을 보유:
+
+- `register_kernel(name, fn)`: duplicate 등록 시 ValueError.
+- `get_kernel(name)`: 미등록 시 KeyError.
+- `clear_registry()`: 테스트 전용.
+
+`RuntimeContext.launch(kernel_name, kernel_fn, *args)` 가 매 호출마다
+`_kernels[kernel_name] = kernel_fn` 으로 idempotent 덮어쓴다 (last call
+wins). 이는 ADR-0045 D8 의 launch 동작과 정합된다.
+
+PE_CPU 는 `KernelRef.name` 으로 registry 에서 kernel 함수를 lookup 한 뒤
+KernelRunner 로 실행한다.
+
+## Alternatives Considered
+
+### A1. tl.* 를 ADR-0014 / ADR-0020 안으로 통합
+
+기각. ADR-0014 는 PE pipeline (PeCommand 의 sim_engine 측 소비) 를, ADR-0020
+은 2-pass 실행 (Phase 1 timing / Phase 2 data) 을 다룬다. `tl.*` 는 kernel
+작성자가 만나는 API 표면이라 독립 분리하는 것이 검색성·온보딩 측면에서
+낫다.
+
+### A2. command-list 모드 deprecation
+
+기각 (현재). 단순한 unit test 와 kernel verification 에서 command-list
+모드가 가볍게 동작한다. greenlet 의존성 없이 PeCommand 시퀀스를 검사할 수
+있는 출입구로 유지한다. greenlet 모드만의 의미 (실데이터, Phase 2) 가
+필요하면 D2 의 mode 선택으로 명시적으로 들어간다.
+
+### A3. TensorHandle 산술 연산자 제거
+
+기각. real Triton 의 kernel 코드 가독성을 흉내내기 위함이며 (예: `c = a +
+b`), thread-local active ctx 패턴이 깔끔하게 작동 중. 명시적 `tl.add(a, b)`
+도 D3.5 에 노출되어 있어, 연산자가 헷갈리면 함수형 호출로 대체 가능.
+
+### A4. softmax 를 명시적 시퀀스 (max → exp → sum → div) 로 expand
+
+부분 채택. `tl.softmax` 는 단일 `MathCmd(op="softmax")` 로 timing 회계는
+한 번에 처리한다 (D3.5). 실 데이터 expansion 은 Phase 2 DataExecutor 가
+canonical 시퀀스로 풀어준다. 즉, 시간 모델은 atomic, 데이터 모델은
+expansion — 두 마리 토끼를 의도적으로 분리.
+
+## Consequences
+
+- bench 작성자가 만나는 모든 `tl.*` primitive 가 한 ADR 에 분류·정의됨.
+  ADR-0045 D8 의 host-side surface (torch.empty 등) 와 짝을 이루어 "kernel
+  안 / 밖" 양쪽 작성 가이드가 완성.
+- command-list / greenlet 두 모드의 차이가 D2 에 명시되어, 새로운 `tl.*`
+  primitive 추가 시 `_emit()` 패턴만 따르면 양쪽 자동 호환됨.
+- thread-local active ctx 패턴 (D4) 이 ADR-level 에서 정당화되어, 향후
+  multi-PE 동일-thread 실행 시 reset 책임이 어디인지 명확해짐
+  (`_switch_kernel` 가 cmd 분배 시 active 복원 — KernelRunner.run 의
+  contract).
+- `tl.recv_no_consume` 의 진단 전용 격리(D3.10) 가 ADR 에 굳어져, 실수로
+  production kernel 에서 사용되는 것을 막는 layer 가 명확.
+- registry (D7) 가 별도 D 항목으로 분리되어, kernel 이름 충돌 / 동적
+  재등록 동작의 사양이 명시.