Files

T

ywkang 998cc85762 Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

Major changes:

PE-level IPCQ infrastructure:
- New PE_IPCQ component: ring-buffer control plane with 4-direction
  neighbor mapping, head/tail pointers, backpressure (poll/sleep).
- PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA,
  including in-flight data snapshot (D9) and op_log recording at
  outbound time for Phase 2 replay correctness.
- IpcqDmaToken piggyback model: data + metadata travel together,
  atomic visibility at receiver (invariant I6).
- Credit return fast path: bottleneck-BW latency, no fabric vc_comm.

Phase 2 data execution (ADR-0020 integration):
- op_log extended: DmaWriteCmd now captures src_space/src_addr for
  Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time.
- DataExecutor replays dma_write + ipcq_copy in t_start order.
- Engine._flush_data_phase: incremental cursor-based replay after
  each engine.wait() so host reads see post-Phase-2 data.
- KernelRunner Phase 1 writes disabled when op_log is active to
  prevent stale data from corrupting the MemoryStore snapshot.

TLContext / kernel API:
- tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype),
  tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode.
- TensorHandle operator overloading (add/sub/mul/div) via thread-local
  active TLContext → MathCmd dispatch through PE_MATH.
- PE-local scratch allocator for math output handles.
- tl.load returns space="hbm" handles for correct Phase 2 addressing.
- Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv.

Unified ccl_allreduce bench (PyTorch-compat host code):
- Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch)
  split matching real PyTorch DDP worker pattern.
- torch.distributed facade: init_process_group, get_world_size,
  get_rank, get_backend, all_reduce, barrier — only real PyTorch names.
- AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches
  kernel via tensor shard metadata (n_elem from shards[0].nbytes).
- world_size derived from topology spec (sips × cubes × pes_per_cube)
  with optional algorithm-level override in ccl.yaml.

Tensor API (PyTorch-compat surface):
- Tensor.numpy(): gather-aware (all shards via VA-based addressing).
- Tensor.copy_(source): scatter from host tensor into sharded target.
- RuntimeContext.from_numpy(arr): host-side staging tensor.
- Tensor.data property fixed to use numpy() (was shards[0]-only).

Algorithm modules moved to src/kernbench/ccl/algorithms/:
- ring_allreduce, mesh_allreduce, tree_allreduce, hello_send.
- Each module exports kernel_args(world_size, n_elem) helper.
- ccl.yaml module paths updated to kernbench.ccl.algorithms.*.

Dead code removed:
- 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.).
- _run_ccl_bench greenlet-per-SIP scheduler.
- benches.loader.is_ccl_bench + run_rank detection.
- benches/ccl/ directory.

Tests:
- New test_ccl_allreduce_matrix.py: 7 parametrized cases
  (ring×3 buffers, ring 8/16, mesh 4, tree 7).
- New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests.
- Existing tests updated for new import paths + world_size_override.

Docs:
- Korean ccl-author-guide.md and ADR-0023 paths updated.
- New English versions: ccl-author-guide.en.md, ADR-0023.en.md.

502 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-12 19:36:59 -07:00

19 KiB

Raw Blame History

CCL Algorithm Author Guide

이 문서는 kernbench에서 CCL (Collective Communication Library) 알고리즘을 직접 작성하는 사람을 위한 step-by-step 가이드이다. 시스템 내부 설계와 컴포넌트 구조는 ADR-0023에 있다.

본 가이드는 알고리즘 작성자가 자신이 만져야 할 곳과 만지지 않아도 될 곳을 명확히 분리하고, 가장 짧은 경로로 첫 알고리즘을 동작시키는 것을 목표로 한다.

0. 5분 요약

만지는 것	위치
알고리즘 모듈 (kernel + 선택적 neighbors)	`src/kernbench/ccl/algorithms/<algo>.py`
알고리즘 등록	`ccl.yaml`
호스트 bench (PE 수, 메모리 init, launch, 검증)	`benches/<your_bench>.py`
(선택) 단위 테스트	`tests/test_<algo>.py`

만지지 않는 것	위치
TLContext API	`src/kernbench/triton_emu/tl_context.py` (ADR-0022 spec)
프레임워크 (topology generators, helpers, mock testing)	`src/kernbench/ccl/`
PE_IPCQ / PE_DMA 컴포넌트	`src/kernbench/components/builtin/`
backend 구현 (install_ipcq)	`src/kernbench/runtime_api/distributed.py` 및 `kernbench/ccl/install.py`

흐름:

알고리즘 모듈에 kernel 작성
ccl.yaml에 entry 등록
호스트 bench에서 install_ipcq + launch
(선택) mock runtime으로 단위 테스트 (수 ms)
kernbench run --bench <name> --verify-data로 SimPy 검증

1. Hello World — 가장 단순한 send/recv

각 PE가 자기 데이터를 E 방향 이웃에 한 번 보내고, W 방향에서 한 번 받는 가장 단순한 알고리즘이다. 실제 동작 코드는 src/kernbench/ccl/algorithms/hello_send.py 에 있다.

Step 1: kernel 작성

새 파일 src/kernbench/ccl/algorithms/hello_send.py:

"""Hello world: 자기 데이터를 다음 rank에 보내고 이전 rank에서 받기."""
def kernel(t_ptr, n_elem, tl):
    # 글로벌 rank는 program_id(0/1)에서 계산 (ADR-0022)
    local_pe = tl.program_id(axis=0)
    cube_id = tl.program_id(axis=1)
    pes_per_cube = tl.num_programs(axis=0)
    rank = cube_id * pes_per_cube + local_pe

    nbytes = n_elem * 2  # f16
    pe_addr = t_ptr + rank * nbytes

    # 자기 슬라이스를 로드해서 E로 보낸다.
    src = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
    tl.send(dir="E", src=src)

    # W 방향에서 받아서 그대로 자기 슬라이스에 store한다.
    recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
    tl.store(pe_addr, recv)

핵심 포인트:

글로벌 rank는 program_id(axis=0) + program_id(axis=1)에서 계산. TL에는 tl.rank / tl.world_size 같은 약속되지 않은 확장이 없다. 호스트가 world_size 같은 알고리즘 파라미터가 필요하면 torch.launch의 일반 인자로 전달한다.
tl.send는 TensorHandle을 받는다. 핸들의 addr/space/shape/dtype/nbytes를 PE_IPCQ가 읽어 PE_DMA에 IpcqDmaToken을 발행한다.
tl.recv는 shape와 dtype이 필수. 반환된 TensorHandle은 IPCQ ring slot을 가리키며, tl.store(pe_addr, recv)처럼 dst 핸들로 그대로 사용할 수 있다. Phase 2 dma_write replay가 (slot, hbm) 복사를 수행하므로 numpy .data를 직접 만질 필요가 없다.

Step 2: ccl.yaml 등록

ccl.yaml의 algorithms 섹션에 entry를 추가한다. (defaults.algorithm은 호스트 bench가 install_ipcq(algorithm=...)로 명시 전달해도 되므로 꼭 바꿀 필요는 없다.)

algorithms:
  hello_send:
    module: kernbench.ccl.algorithms.hello_send
    topology: ring_1d
    buffer_kind: tcm

Step 3: 호스트 bench 작성

새 파일 benches/ccl_hello.py:

"""Hello-world ring rotation bench (각 PE가 W 이웃의 데이터를 1번 받음)."""
import numpy as np

from kernbench.ccl.algorithms import hello_send
from kernbench.policy.placement.dp import DPPolicy

ALGORITHM = "hello_send"
N_ELEM = 8
WORLD_SIZE = 8


def run(torch):
    plan = torch.install_ipcq(algorithm=ALGORITHM)

    a = torch.zeros(
        (1, WORLD_SIZE * N_ELEM), dtype="f16",
        dp=DPPolicy(
            sip="replicate", cube="replicate", pe="column_wise",
            num_sips=1, num_cubes=1,
        ),
        name="hello_in",
    )

    store = torch.engine.memory_store
    base = a._handle.va_base or a._handle.shards[0].pa
    nbytes = N_ELEM * 2
    for r in range(WORLD_SIZE):
        store.write("hbm", base + r * nbytes,
                    np.full((N_ELEM,), float(r + 1), dtype=np.float16))

    torch.launch(ALGORITHM, hello_send.kernel, a, N_ELEM)

    # rank r은 rank (r-1)%ws의 데이터를 가져야 한다.
    for r, (sip, cube, pe) in enumerate(plan["rank_to_pe"]):
        result = store.read("hbm", base + r * nbytes, shape=(N_ELEM,), dtype="f16")
        prev = float(((r - 1) % WORLD_SIZE) + 1)
        ok = np.allclose(result, prev)
        print(f"  [{'OK ' if ok else 'FAIL'}] rank {r} got {float(result.mean()):.1f}, "
              f"expected {prev:.1f}")

Step 4: 단위 테스트 (선택, 강력 추천)

tests/test_hello_send.py:

import numpy as np
from kernbench.ccl.algorithms.hello_send import kernel
from kernbench.ccl.testing import run_kernel_in_mock


def test_hello_send_4_ranks():
    n_elem = 8
    inputs = [np.full((n_elem,), float(r + 1), dtype=np.float16) for r in range(4)]

    outputs = run_kernel_in_mock(
        kernel_fn=kernel,
        world_size=4,
        topology="ring_1d",
        inputs=inputs,
        kernel_args=(n_elem,),
    )

    # rank r은 rank (r-1) % 4의 데이터를 받아야 함
    for r in range(4):
        assert np.array_equal(outputs[r], inputs[(r - 1) % 4])

run_kernel_in_mock는 SimPy 없이 순수 Python으로 모든 rank를 동시 실행하므로 ms 단위로 끝난다. 알고리즘 logic 정합성만 검증.

Step 5: 시뮬 검증

kernbench run --topology topology.yaml --bench ccl_hello --verify-data

Phase 1에서 SimPy 시뮬레이션 + MemoryStore 데이터 이동, Phase 2에서 op_log 정합성 replay. 호스트 bench의 print 검증이 모든 rank에 대해 OK여야 한다.

2. Ring All-Reduce — 두 번째 알고리즘

조금 더 복잡한 예제. Ring all-reduce는 N-1 라운드 동안 각 PE가 자기 데이터를 E로 보내고 W에서 받아 누적한다. 최종적으로 모든 PE가 글로벌 sum을 갖는다.

실제 동작 코드는 src/kernbench/ccl/algorithms/ring_allreduce.py 참조. 핵심 흐름:

"""Ring all-reduce."""


def kernel(t_ptr, n_elem, world_size, tl):
    # rank
    local_pe = tl.program_id(axis=0)
    cube_id = tl.program_id(axis=1)
    pes_per_cube = tl.num_programs(axis=0)
    rank = cube_id * pes_per_cube + local_pe
    nbytes = n_elem * 2
    pe_addr = t_ptr + rank * nbytes

    # HBM의 자기 슬라이스를 가리키는 TensorHandle. greenlet 모드에선 .data가
    # 채워지지만 커널은 .data를 직접 만질 필요가 없다.
    acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
    current = acc  # 첫 라운드 send 출처

    for _step in range(world_size - 1):
        tl.send(dir="E", src=current)
        recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
        # TensorHandle 연산자 오버로드 → MathCmd → PE_MATH 디스패치.
        # Phase 1은 타이밍만, Phase 2 DataExecutor가 실제 numpy 누적을 수행한다.
        acc = acc + recv
        current = recv  # 다음 라운드는 직전에 받은 슬롯을 다시 forward

    # 최종 누적값을 자기 슬라이스에 store. 출처는 acc(=PE-local scratch addr)
    # 이고 dst는 HBM. op_log dma_write가 (scratch, hbm) 복사 정보를 기록하므로
    # Phase 2가 검증 시점에 HBM[pe_addr]에 정답을 채워준다.
    tl.store(pe_addr, acc)

네 가지 포인트:

누적은 TensorHandle 연산자: acc + recv는 MathCmd를 emit하고 PE_MATH로 디스패치된다 — 실제 하드웨어 경로를 거치므로 latency 모델이 정확하다. ADR-0020 D3대로 Phase 1은 타이밍만 시뮬레이션하고, Phase 2 DataExecutor가 op_log를 재실행하면서 numpy 누적을 수행한다.
current = recv로 forward: 매 라운드의 send 출처를 직전에 받은 슬롯 핸들로 갱신해야 같은 데이터가 ring을 순회하면서 누적이 한 번씩 일어난다. current = acc로 두면 누적값이 다시 송출되어 결과가 부풀려진다.
tl.store(pe_addr, acc) 한 번이면 끝: 중간에 store→reload 패턴은 금지다. acc는 PE-local scratch에 살고, op_log가 (src=scratch, dst=hbm) 메타데이터를 기록한다. Phase 2가 math를 먼저 실행해 scratch를 채운 뒤 dma_write 스냅샷으로 HBM에 복사한다.
world_size는 호스트가 명시 전달: TL은 topology slot 수만 안다 (예: num_programs(axis=0)은 cube당 PE 수). 실제 참여하는 CCL group 크기는 bench가 알고 호스트→kernel 인자로 넘긴다.

ccl.yaml 등록 + 호스트 bench는 benches/ccl_allreduce_tcm.py 참조. mock 단위 테스트는 tests/test_ccl_mock_runtime.py 를 그대로 따라하면 된다 (kernel_args=(n_elem, world_size) 인자 형태).

3. neighbors() override — Custom topology

대부분의 알고리즘은 builtin topology(ring_1d, mesh_2d, tree_binary, ring_1d_unidir, none)로 충분하다. builtin을 변형하거나 새로 만들고 싶으면 알고리즘 모듈에 neighbors()를 정의한다.

시그니처

def neighbors(rank: int, world_size: int, neighbor_map: dict[str, int]) -> dict[str, int] | None:
    """builtin topology가 만든 neighbor_map을 override.

    Args:
        neighbor_map: ccl.yaml의 topology 필드가 만든 builtin 매핑.
                      예: ring_1d → {"E": (rank+1)%ws, "W": (rank-1)%ws}
                      mutable dict — 직접 수정 가능.

    Returns:
        dict: neighbor_map을 override한 결과 (또는 수정한 그 dict)
        None: override 안 함, neighbor_map 그대로 사용
    """
    return None

Pattern A: builtin을 base로 일부만 수정

def neighbors(rank, world_size, neighbor_map):
    # 짝수 rank만 W 방향 사용 (홀수 rank는 W 제거)
    if rank % 2 == 1:
        neighbor_map.pop("W", None)
    return neighbor_map

Pattern B: 완전히 새로 작성 (skip-connection ring)

def neighbors(rank, world_size, neighbor_map):
    # neighbor_map은 무시하고 새로 작성
    return {"E": (rank + 2) % world_size}

Pattern C: builtin 사용, override 없음

neighbors() 함수를 정의하지 않거나 None을 반환:

def neighbors(rank, world_size, neighbor_map):
    return None  # 명시적으로 builtin 사용

4. PE 커널 API 레퍼런스 (ADR-0023 D4)

IPCQ API

API	설명	Blocking?
`tl.send(dir, src=TensorHandle)`	direction으로 데이터 send	Yes (peer slot full 시 wait)
`tl.send(dir, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)`	동일, keyword 형태	Yes
`tl.recv(dir, shape=..., dtype=...)`	특정 방향에서 blocking recv	Yes
`tl.recv(shape=..., dtype=...)`	4방향 round-robin recv (방향 미지정)	Yes
`tl.recv_async(dir, shape=..., dtype=...) → RecvFuture`	non-blocking recv	No
`tl.wait(future)`	non-blocking future 완료 대기 → TensorHandle	Yes

기존 TL API (ADR-0020/0022, 그대로 사용 가능)

API	설명
`tl.load(addr, shape, dtype) → TensorHandle`	DMA read; greenlet 모드에서 `.data`에 ndarray
`tl.store(addr, handle)`	DMA write — handle.data가 있으면 MemoryStore에 propagate
`tl.composite(op, ...)`	GEMM/Math compute 비동기 submit
`tl.program_id(axis=0)`	cube 내 local PE id
`tl.program_id(axis=1)`	cube id (ADR-0022)
`tl.num_programs(axis=0/1)`	topology 슬롯 수 (참여 ranks 수가 아님)

`recv` 두 가지 모드

기본은 return_slot (zero-copy): IPCQ slot 주소가 그대로 handle.addr에 들어온다. slot 데이터를 별도 위치로 복사하고 싶으면 dst_addr + dst_space를 명시:

recv = tl.recv(
    dir="W", shape=(8,), dtype="f16",
    dst_addr=my_scratch_addr,
    dst_space="hbm",
)
# 이제 recv.addr == my_scratch_addr (copy_to_dst 모드)

5. Helpers (`kernbench.ccl.helpers`)

알고리즘 코드를 짧게 유지하기 위한 헬퍼들:

from kernbench.ccl.helpers import chunked, ring_step, tree_step

`chunked(base_addr, n_chunks, n_elem, dtype="f16") → list[Chunk]`

총 n_elem 개의 element를 n_chunks 등분한 view 리스트를 반환. 각 Chunk는 addr, n_elem, nbytes 필드를 가진다.

chunks = chunked(t_ptr, n_chunks=4, n_elem=64, dtype="f16")
# chunks[0..3] 각각 16 element view, addr이 연속

`ring_step(rank, step, world_size) → (send_idx, recv_idx)`

Ring algorithm의 step별 chunk 인덱스 (reduce-scatter / all-gather):

for step in range(world_size - 1):
    send_idx, recv_idx = ring_step(rank, step, world_size)
    tl.send(dir="E", src_addr=chunks[send_idx].addr,
            nbytes=chunks[send_idx].nbytes,
            shape=(chunks[send_idx].n_elem,), dtype="f16")
    recv = tl.recv(dir="W", shape=(chunks[recv_idx].n_elem,), dtype="f16")
    # accumulate ...

`tree_step(rank, world_size) → {"parent": int|None, "children": list[int]}`

Binary tree의 parent/children rank:

info = tree_step(rank, world_size)
if info["parent"] is None:
    print(f"rank {rank} is the root")
for child in info["children"]:
    ...

6. 단위 테스트 — Mock Runtime

kernbench.ccl.testing.run_kernel_in_mock은 SimPy를 거치지 않고 알고리즘을 빠르게 검증할 수 있다.

기본 사용법

from kernbench.ccl.testing import run_kernel_in_mock
from kernbench.ccl.algorithms.my_algo import kernel
import numpy as np


def test_my_algo():
    n_elem = 16
    inputs = [np.arange(n_elem, dtype="f16") + r for r in range(4)]
    expected = sum(inputs)

    outputs = run_kernel_in_mock(
        kernel_fn=kernel,
        world_size=4,
        topology="ring_1d",
        inputs=inputs,
        kernel_args=(n_elem, 4),  # kernel의 (t_ptr 이후) 추가 positional 인자
    )

    for r in range(4):
        assert np.allclose(outputs[r], expected, rtol=1e-3)

동작

4개 rank의 kernel을 greenlet으로 동시 실행
tl.send/recv를 in-memory FIFO로 즉시 처리 (DMA, latency 무시)
각 rank가 마지막에 store한 데이터를 ndarray로 반환

한계

latency / 성능 측정 불가 (시뮬레이션이 아님)
PE_DMA, fabric, BW 모델 안 함
정합성 검증만 가능
한 cube 안에서 동작하는 가정 — program_id(axis=1)은 항상 0

7. 디버깅

CCL trace

KERNBENCH_CCL_TRACE=1 kernbench run --topology topology.yaml \
    --bench ccl_allreduce_tcm --verify-data

각 rank의 send/recv 시점이 stdout에 출력된다:

[ccl t=346.4 send] sip0.cube0.pe1 dir=E nbytes=64 seq=0
[ccl t=360.4 recv] sip0.cube0.pe2 dir=W nbytes=64
...

Pointer dump

kernbench.ccl.diagnostics.pointer_dump(engine)는 모든 PE_IPCQ의 ring buffer 상태(my_head, my_tail, peer_head_cache, peer_tail_cache)를 multi-line 문자열로 반환한다. hang이 발생하면 어느 rank가 어떤 상태에서 막혔는지 한눈에 보인다.

Deadlock detection

매칭되지 않는 send/recv 등으로 SimPy 스케줄이 비면 engine이 IpcqDeadlock을 던지며 pointer dump를 메시지에 포함시킨다 (ADR-0023 D14 F3). 별도 wait-for graph 시각화는 미래 작업.

8. 흔한 실수

1. install 안 된 direction 사용

ccl.yaml의 topology: ring_1d는 E/W만 install한다. N/S 사용 시:

tl.send(dir="N", ...)   # → IpcqInvalidDirection 예외

해결: topology: mesh_2d로 바꾸거나, neighbors() override로 N/S 추가.

2. send만 호출하고 recv 없음

def kernel(..., tl):
    for _ in range(100):
        tl.send(dir="E", ...)
    # peer 측 recv 없음 → ring buffer 가득 차면 backpressure → deadlock

해결: 모든 send에 짝이 되는 recv가 있어야 한다. 안 그러면 IpcqDeadlock이 발생한다.

3. dtype/shape 불일치

기본 모드에서는 dtype/shape mismatch를 검증하지 않는다. 작성자가 직접 보장하거나, PE_IPCQ 노드 attrs에 strict_validation: true를 설정해 D14 F2 strict 모드로 mismatch를 즉시 잡을 수 있다.

4. round-robin recv의 fairness 가정

tl.recv() (방향 미지정)는 round-robin으로 가져오지만, 도착한 첫 슬롯을 반환한다. 도착 순서를 알 수 없으므로 알고리즘이 도착 방향에 의존하면 안 된다. 필요하면 tl.recv(dir="N", ...)처럼 명시.

5. CCL 그룹 크기 가정

tl.num_programs(axis=0/1)은 토폴로지 슬롯 개수이지 CCL group 크기가 아니다. 참여하는 rank 수(world_size)는 호스트 bench가 알고 있고, kernel 인자로 명시 전달해야 한다.

6. 호스트가 send-source 메모리를 도착 전에 덮어씀

PE_DMA가 송신 시점에 src 데이터를 토큰에 스냅샷해서 in-flight 데이터의 의미가 보존된다. 그래도 하나의 PE 안에서 같은 주소를 여러 step에 걸쳐 갱신할 때는 direct send 후 다른 step에서 같은 주소를 store해도 안전하다 (token snapshot 덕분). 하지만 tl.send가 PE_DMA 큐에 enqueue되기 전에 주소를 덮어쓰면 잘못된 데이터가 스냅샷된다 — tl.send를 먼저, 메모리 변경을 나중에 하는 게 권장.

9. 다음 단계

mesh_2d / tree_binary 같은 다른 topology 활용
recursive halving/doubling 등 더 빠른 알고리즘
buffer_kind (tcm/hbm/sram) / backpressure (poll/sleep) 모드별 latency 비교
ccl_ring_allreduce_multicube.py, ccl_ring_allreduce_multisip.py처럼 큰 scale의 ring 검증

새 알고리즘이나 패턴을 추가했다면 PR로 기여해주세요.

참고

ADR-0023: IPCQ + PE-level collective 설계
ADR-0022: 2D grid program_id (axis=0/1)
ADR-0020: 2-pass data execution
ADR-0021: PE pipeline refactor

기존 알고리즘 예제:

src/kernbench/ccl/algorithms/hello_send.py — 가장 단순한 send/recv
src/kernbench/ccl/algorithms/ring_allreduce.py — ring all-reduce
src/kernbench/ccl/algorithms/mesh_allreduce.py — 2D mesh all-reduce
src/kernbench/ccl/algorithms/tree_allreduce.py — binary tree all-reduce

19 KiB Raw Blame History