ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037

Filename + lifecycle: - ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable. - ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2: docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft), docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for retroactive docs pending verification. Merges (one ADR per topic, no change-history annotations): - ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items) - ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl. TileToken self-routing and multi-op composite epilogue scope) - ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md deleted; ADR-0019/0021 moved to adr-history with one-line stub status Retroactive documentation (G4 closures, code-verified): - ADR-0037 forwarding component (TransitComponent: first-flit overhead, serial worker, path-based routing, single impl/multiple names) - ADR-0036 IO_CPU component (target_start_ns global barrier stamping, per-cube fan-out, response aggregation) - ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources, target_start_ns passthrough) - ADR-0034 HBM controller internal design (per-PC state, address-based selection, flit-aware per-flit commit, async finalize, command-only fallback path) Content updates: - ADR-0010 expanded to full CLI surface (run/probe/web), retitled "Command Line Interface and Execution Semantics" - ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned - ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata block replaced with standard Status header - ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4); ADR-0027 cleaned of supersession history - ADR-0033 D6 cleanup: address-based PC selection moved out of future-work (now documented in ADR-0034 D3); related D1/D3 wording realigned - Cross-references back-filled in 5 ADRs (G3 gaps closed) Onboarding docs split: - docs/onboarding/ created - moved: hw-architecture-overview.md, latency-model.md, di-presentation.md, ccl-author-guide{,.en}.md - references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8). No behavior change. Tooling: - tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py (ADR EN/KO pair invariant checker) - .claude/commands/report.md tracked (/report slash command) - .gitignore: allow .claude/commands/*.md while keeping settings files ignored Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:15:55 -07:00
parent 22fd0d2b9d
commit 687c98086d
97 changed files with 3286 additions and 3766 deletions
@@ -0,0 +1,537 @@
+# CCL Algorithm Author Guide
+
+이 문서는 kernbench에서 CCL (Collective Communication Library) 알고리즘을
+직접 작성하는 사람을 위한 step-by-step 가이드이다. 시스템 내부 설계와
+컴포넌트 구조는 [ADR-0023](adr/ADR-0023-ipcq-pe-collective.md)에 있다.
+
+본 가이드는 알고리즘 작성자가 **자신이 만져야 할 곳**과 **만지지 않아도 될 곳**을
+명확히 분리하고, 가장 짧은 경로로 첫 알고리즘을 동작시키는 것을 목표로 한다.
+
+---
+
+## 0. 5분 요약
+
+| 만지는 것 | 위치 |
+|----------|------|
+| 알고리즘 모듈 (kernel + 선택적 neighbors) | `src/kernbench/ccl/algorithms/<algo>.py` |
+| 알고리즘 등록 | `ccl.yaml` |
+| 호스트 bench (PE 수, 메모리 init, launch, 검증) | `benches/<your_bench>.py` |
+| (선택) 단위 테스트 | `tests/test_<algo>.py` |
+
+| 만지지 않는 것 | 위치 |
+|---------------|------|
+| TLContext API | `src/kernbench/triton_emu/tl_context.py` (ADR-0022 spec) |
+| 프레임워크 (topology generators, helpers, mock testing) | `src/kernbench/ccl/` |
+| PE_IPCQ / PE_DMA 컴포넌트 | `src/kernbench/components/builtin/` |
+| backend 구현 (install_ipcq) | `src/kernbench/runtime_api/distributed.py` 및 `kernbench/ccl/install.py` |
+
+흐름:
+1. 알고리즘 모듈에 `kernel` 작성
+2. `ccl.yaml`에 entry 등록
+3. 호스트 bench에서 `install_ipcq` + `launch`
+4. (선택) mock runtime으로 단위 테스트 (수 ms)
+5. `kernbench run --bench <name> --verify-data`로 SimPy 검증
+
+---
+
+## 1. Hello World — 가장 단순한 send/recv
+
+각 PE가 자기 데이터를 E 방향 이웃에 한 번 보내고, W 방향에서 한 번 받는
+가장 단순한 알고리즘이다. 실제 동작 코드는
+[`src/kernbench/ccl/algorithms/hello_send.py`](../src/kernbench/ccl/algorithms/hello_send.py)
+에 있다.
+
+### Step 1: kernel 작성
+
+새 파일 `src/kernbench/ccl/algorithms/hello_send.py`:
+
+```python
+"""Hello world: 자기 데이터를 다음 rank에 보내고 이전 rank에서 받기."""
+def kernel(t_ptr, n_elem, tl):
+    # 글로벌 rank는 program_id(0/1)에서 계산 (ADR-0022)
+    local_pe = tl.program_id(axis=0)
+    cube_id = tl.program_id(axis=1)
+    pes_per_cube = tl.num_programs(axis=0)
+    rank = cube_id * pes_per_cube + local_pe
+
+    nbytes = n_elem * 2  # f16
+    pe_addr = t_ptr + rank * nbytes
+
+    # 자기 슬라이스를 로드해서 E로 보낸다.
+    src = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
+    tl.send(dir="E", src=src)
+
+    # W 방향에서 받아서 그대로 자기 슬라이스에 store한다.
+    recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
+    tl.store(pe_addr, recv)
+```
+
+핵심 포인트:
+
+- **글로벌 rank는 `program_id(axis=0)` + `program_id(axis=1)`에서 계산.** TL에는
+  `tl.rank` / `tl.world_size` 같은 약속되지 않은 확장이 없다. 호스트가
+  `world_size` 같은 알고리즘 파라미터가 필요하면 `torch.launch`의 일반 인자로
+  전달한다.
+- **`tl.send`는 `TensorHandle`을 받는다.** 핸들의 `addr`/`space`/`shape`/`dtype`/`nbytes`를
+  PE_IPCQ가 읽어 PE_DMA에 IpcqDmaToken을 발행한다.
+- **`tl.recv`는 `shape`와 `dtype`이 필수.** 반환된 TensorHandle은 IPCQ ring slot을
+  가리키며, `tl.store(pe_addr, recv)`처럼 dst 핸들로 그대로 사용할 수 있다.
+  Phase 2 dma_write replay가 (slot, hbm) 복사를 수행하므로 numpy `.data`를
+  직접 만질 필요가 없다.
+
+### Step 2: ccl.yaml 등록
+
+`ccl.yaml`의 `algorithms` 섹션에 entry를 추가한다. (defaults.algorithm은 호스트
+bench가 `install_ipcq(algorithm=...)`로 명시 전달해도 되므로 꼭 바꿀 필요는 없다.)
+
+```yaml
+algorithms:
+  hello_send:
+    module: kernbench.ccl.algorithms.hello_send
+    topology: ring_1d
+    buffer_kind: tcm
+```
+
+### Step 3: 호스트 bench 작성
+
+새 파일 `benches/ccl_hello.py`:
+
+```python
+"""Hello-world ring rotation bench (각 PE가 W 이웃의 데이터를 1번 받음)."""
+import numpy as np
+
+from kernbench.ccl.algorithms import hello_send
+from kernbench.policy.placement.dp import DPPolicy
+
+ALGORITHM = "hello_send"
+N_ELEM = 8
+WORLD_SIZE = 8
+
+
+def run(torch):
+    plan = torch.install_ipcq(algorithm=ALGORITHM)
+
+    a = torch.zeros(
+        (1, WORLD_SIZE * N_ELEM), dtype="f16",
+        dp=DPPolicy(
+            cube="replicate", pe="column_wise",
+            num_cubes=1,
+        ),
+        name="hello_in",
+    )
+
+    store = torch.engine.memory_store
+    base = a._handle.va_base or a._handle.shards[0].pa
+    nbytes = N_ELEM * 2
+    for r in range(WORLD_SIZE):
+        store.write("hbm", base + r * nbytes,
+                    np.full((N_ELEM,), float(r + 1), dtype=np.float16))
+
+    torch.launch(ALGORITHM, hello_send.kernel, a, N_ELEM)
+
+    # rank r은 rank (r-1)%ws의 데이터를 가져야 한다.
+    for r, (sip, cube, pe) in enumerate(plan["rank_to_pe"]):
+        result = store.read("hbm", base + r * nbytes, shape=(N_ELEM,), dtype="f16")
+        prev = float(((r - 1) % WORLD_SIZE) + 1)
+        ok = np.allclose(result, prev)
+        print(f"  [{'OK ' if ok else 'FAIL'}] rank {r} got {float(result.mean()):.1f}, "
+              f"expected {prev:.1f}")
+```
+
+### Step 4: 단위 테스트 (선택, 강력 추천)
+
+`tests/test_hello_send.py`:
+
+```python
+import numpy as np
+from kernbench.ccl.algorithms.hello_send import kernel
+from kernbench.ccl.testing import run_kernel_in_mock
+
+
+def test_hello_send_4_ranks():
+    n_elem = 8
+    inputs = [np.full((n_elem,), float(r + 1), dtype=np.float16) for r in range(4)]
+
+    outputs = run_kernel_in_mock(
+        kernel_fn=kernel,
+        world_size=4,
+        topology="ring_1d",
+        inputs=inputs,
+        kernel_args=(n_elem,),
+    )
+
+    # rank r은 rank (r-1) % 4의 데이터를 받아야 함
+    for r in range(4):
+        assert np.array_equal(outputs[r], inputs[(r - 1) % 4])
+```
+
+`run_kernel_in_mock`는 SimPy 없이 순수 Python으로 모든 rank를 동시 실행하므로
+**ms 단위로 끝난다**. 알고리즘 logic 정합성만 검증.
+
+### Step 5: 시뮬 검증
+
+```bash
+kernbench run --topology topology.yaml --bench ccl_hello --verify-data
+```
+
+Phase 1에서 SimPy 시뮬레이션 + MemoryStore 데이터 이동, Phase 2에서 op_log
+정합성 replay. 호스트 bench의 `print` 검증이 모든 rank에 대해 OK여야 한다.
+
+---
+
+## 2. Ring All-Reduce — 두 번째 알고리즘
+
+조금 더 복잡한 예제. Ring all-reduce는 N-1 라운드 동안 각 PE가 자기 데이터를
+E로 보내고 W에서 받아 누적한다. 최종적으로 모든 PE가 글로벌 sum을 갖는다.
+
+실제 동작 코드는 [`src/kernbench/ccl/algorithms/ring_allreduce.py`](../src/kernbench/ccl/algorithms/ring_allreduce.py)
+참조. 핵심 흐름:
+
+```python
+"""Ring all-reduce."""
+
+
+def kernel(t_ptr, n_elem, world_size, tl):
+    # rank
+    local_pe = tl.program_id(axis=0)
+    cube_id = tl.program_id(axis=1)
+    pes_per_cube = tl.num_programs(axis=0)
+    rank = cube_id * pes_per_cube + local_pe
+    nbytes = n_elem * 2
+    pe_addr = t_ptr + rank * nbytes
+
+    # HBM의 자기 슬라이스를 가리키는 TensorHandle. greenlet 모드에선 .data가
+    # 채워지지만 커널은 .data를 직접 만질 필요가 없다.
+    acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
+    current = acc  # 첫 라운드 send 출처
+
+    for _step in range(world_size - 1):
+        tl.send(dir="E", src=current)
+        recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
+        # TensorHandle 연산자 오버로드 → MathCmd → PE_MATH 디스패치.
+        # Phase 1은 타이밍만, Phase 2 DataExecutor가 실제 numpy 누적을 수행한다.
+        acc = acc + recv
+        current = recv  # 다음 라운드는 직전에 받은 슬롯을 다시 forward
+
+    # 최종 누적값을 자기 슬라이스에 store. 출처는 acc(=PE-local scratch addr)
+    # 이고 dst는 HBM. op_log dma_write가 (scratch, hbm) 복사 정보를 기록하므로
+    # Phase 2가 검증 시점에 HBM[pe_addr]에 정답을 채워준다.
+    tl.store(pe_addr, acc)
+```
+
+네 가지 포인트:
+
+1. **누적은 TensorHandle 연산자**: `acc + recv`는 `MathCmd`를 emit하고
+   PE_MATH로 디스패치된다 — 실제 하드웨어 경로를 거치므로 latency 모델이
+   정확하다. ADR-0020 D3대로 Phase 1은 타이밍만 시뮬레이션하고, Phase 2
+   `DataExecutor`가 op_log를 재실행하면서 numpy 누적을 수행한다.
+2. **`current = recv`로 forward**: 매 라운드의 send 출처를 직전에 받은 슬롯
+   핸들로 갱신해야 같은 데이터가 ring을 순회하면서 누적이 한 번씩 일어난다.
+   `current = acc`로 두면 누적값이 다시 송출되어 결과가 부풀려진다.
+3. **`tl.store(pe_addr, acc)` 한 번이면 끝**: 중간에 store→reload 패턴은
+   금지다. acc는 PE-local scratch에 살고, op_log가 (src=scratch, dst=hbm)
+   메타데이터를 기록한다. Phase 2가 math를 먼저 실행해 scratch를 채운 뒤
+   dma_write 스냅샷으로 HBM에 복사한다.
+4. **`world_size`는 호스트가 명시 전달**: TL은 topology slot 수만 안다 (예:
+   `num_programs(axis=0)`은 cube당 PE 수). 실제 참여하는 CCL group 크기는 bench가
+   알고 호스트→kernel 인자로 넘긴다.
+
+`ccl.yaml` 등록 + 호스트 bench는 [`benches/ccl_allreduce_tcm.py`](../benches/ccl_allreduce_tcm.py)
+참조. mock 단위 테스트는 [`tests/test_ccl_mock_runtime.py`](../tests/test_ccl_mock_runtime.py)
+를 그대로 따라하면 된다 (`kernel_args=(n_elem, world_size)` 인자 형태).
+
+---
+
+## 3. neighbors() override — Custom topology
+
+대부분의 알고리즘은 builtin topology(`ring_1d`, `mesh_2d`, `tree_binary`,
+`ring_1d_unidir`, `none`)로 충분하다. builtin을 변형하거나 새로 만들고 싶으면
+알고리즘 모듈에 `neighbors()`를 정의한다.
+
+### 시그니처
+
+```python
+def neighbors(rank: int, world_size: int, neighbor_map: dict[str, int]) -> dict[str, int] | None:
+    """builtin topology가 만든 neighbor_map을 override.
+
+    Args:
+        neighbor_map: ccl.yaml의 topology 필드가 만든 builtin 매핑.
+                      예: ring_1d → {"E": (rank+1)%ws, "W": (rank-1)%ws}
+                      mutable dict — 직접 수정 가능.
+
+    Returns:
+        dict: neighbor_map을 override한 결과 (또는 수정한 그 dict)
+        None: override 안 함, neighbor_map 그대로 사용
+    """
+    return None
+```
+
+### Pattern A: builtin을 base로 일부만 수정
+
+```python
+def neighbors(rank, world_size, neighbor_map):
+    # 짝수 rank만 W 방향 사용 (홀수 rank는 W 제거)
+    if rank % 2 == 1:
+        neighbor_map.pop("W", None)
+    return neighbor_map
+```
+
+### Pattern B: 완전히 새로 작성 (skip-connection ring)
+
+```python
+def neighbors(rank, world_size, neighbor_map):
+    # neighbor_map은 무시하고 새로 작성
+    return {"E": (rank + 2) % world_size}
+```
+
+### Pattern C: builtin 사용, override 없음
+
+`neighbors()` 함수를 정의하지 않거나 None을 반환:
+
+```python
+def neighbors(rank, world_size, neighbor_map):
+    return None  # 명시적으로 builtin 사용
+```
+
+---
+
+## 4. PE 커널 API 레퍼런스 (ADR-0023 D4)
+
+### IPCQ API
+
+| API | 설명 | Blocking? |
+|-----|------|-----------|
+| `tl.send(dir, src=TensorHandle)` | direction으로 데이터 send | Yes (peer slot full 시 wait) |
+| `tl.send(dir, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)` | 동일, keyword 형태 | Yes |
+| `tl.recv(dir, shape=..., dtype=...)` | 특정 방향에서 blocking recv | Yes |
+| `tl.recv(shape=..., dtype=...)` | 4방향 round-robin recv (방향 미지정) | Yes |
+| `tl.recv_async(dir, shape=..., dtype=...) → RecvFuture` | non-blocking recv | No |
+| `tl.wait(future)` | non-blocking future 완료 대기 → TensorHandle | Yes |
+
+### 기존 TL API (ADR-0020/0022, 그대로 사용 가능)
+
+| API | 설명 |
+|-----|------|
+| `tl.load(addr, shape, dtype) → TensorHandle` | DMA read; greenlet 모드에서 `.data`에 ndarray |
+| `tl.store(addr, handle)` | DMA write — handle.data가 있으면 MemoryStore에 propagate |
+| `tl.composite(op, ...)` | GEMM/Math compute 비동기 submit |
+| `tl.program_id(axis=0)` | cube 내 local PE id |
+| `tl.program_id(axis=1)` | cube id (ADR-0022) |
+| `tl.num_programs(axis=0/1)` | topology 슬롯 수 (참여 ranks 수가 아님) |
+
+### `recv` 두 가지 모드
+
+기본은 `return_slot` (zero-copy): IPCQ slot 주소가 그대로 handle.addr에 들어온다.
+slot 데이터를 별도 위치로 복사하고 싶으면 `dst_addr` + `dst_space`를 명시:
+
+```python
+recv = tl.recv(
+    dir="W", shape=(8,), dtype="f16",
+    dst_addr=my_scratch_addr,
+    dst_space="hbm",
+)
+# 이제 recv.addr == my_scratch_addr (copy_to_dst 모드)
+```
+
+---
+
+## 5. Helpers (`kernbench.ccl.helpers`)
+
+알고리즘 코드를 짧게 유지하기 위한 헬퍼들:
+
+```python
+from kernbench.ccl.helpers import chunked, ring_step, tree_step
+```
+
+### `chunked(base_addr, n_chunks, n_elem, dtype="f16") → list[Chunk]`
+
+총 `n_elem` 개의 element를 `n_chunks` 등분한 view 리스트를 반환. 각 `Chunk`는
+`addr`, `n_elem`, `nbytes` 필드를 가진다.
+
+```python
+chunks = chunked(t_ptr, n_chunks=4, n_elem=64, dtype="f16")
+# chunks[0..3] 각각 16 element view, addr이 연속
+```
+
+### `ring_step(rank, step, world_size) → (send_idx, recv_idx)`
+
+Ring algorithm의 step별 chunk 인덱스 (reduce-scatter / all-gather):
+
+```python
+for step in range(world_size - 1):
+    send_idx, recv_idx = ring_step(rank, step, world_size)
+    tl.send(dir="E", src_addr=chunks[send_idx].addr,
+            nbytes=chunks[send_idx].nbytes,
+            shape=(chunks[send_idx].n_elem,), dtype="f16")
+    recv = tl.recv(dir="W", shape=(chunks[recv_idx].n_elem,), dtype="f16")
+    # accumulate ...
+```
+
+### `tree_step(rank, world_size) → {"parent": int|None, "children": list[int]}`
+
+Binary tree의 parent/children rank:
+
+```python
+info = tree_step(rank, world_size)
+if info["parent"] is None:
+    print(f"rank {rank} is the root")
+for child in info["children"]:
+    ...
+```
+
+---
+
+## 6. 단위 테스트 — Mock Runtime
+
+`kernbench.ccl.testing.run_kernel_in_mock`은 SimPy를 거치지 않고 알고리즘을
+빠르게 검증할 수 있다.
+
+### 기본 사용법
+
+```python
+from kernbench.ccl.testing import run_kernel_in_mock
+from kernbench.ccl.algorithms.my_algo import kernel
+import numpy as np
+
+
+def test_my_algo():
+    n_elem = 16
+    inputs = [np.arange(n_elem, dtype="f16") + r for r in range(4)]
+    expected = sum(inputs)
+
+    outputs = run_kernel_in_mock(
+        kernel_fn=kernel,
+        world_size=4,
+        topology="ring_1d",
+        inputs=inputs,
+        kernel_args=(n_elem, 4),  # kernel의 (t_ptr 이후) 추가 positional 인자
+    )
+
+    for r in range(4):
+        assert np.allclose(outputs[r], expected, rtol=1e-3)
+```
+
+### 동작
+
+- 4개 rank의 kernel을 greenlet으로 동시 실행
+- `tl.send/recv`를 in-memory FIFO로 즉시 처리 (DMA, latency 무시)
+- 각 rank가 마지막에 store한 데이터를 ndarray로 반환
+
+### 한계
+
+- latency / 성능 측정 불가 (시뮬레이션이 아님)
+- PE_DMA, fabric, BW 모델 안 함
+- 정합성 검증만 가능
+- 한 cube 안에서 동작하는 가정 — `program_id(axis=1)`은 항상 0
+
+---
+
+## 7. 디버깅
+
+### CCL trace
+
+```bash
+KERNBENCH_CCL_TRACE=1 kernbench run --topology topology.yaml \
+    --bench ccl_allreduce_tcm --verify-data
+```
+
+각 rank의 send/recv 시점이 stdout에 출력된다:
+
+```
+[ccl t=346.4 send] sip0.cube0.pe1 dir=E nbytes=64 seq=0
+[ccl t=360.4 recv] sip0.cube0.pe2 dir=W nbytes=64
+...
+```
+
+### Pointer dump
+
+`kernbench.ccl.diagnostics.pointer_dump(engine)`는 모든 PE_IPCQ의 ring buffer
+상태(`my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`)를 multi-line
+문자열로 반환한다. hang이 발생하면 어느 rank가 어떤 상태에서 막혔는지 한눈에
+보인다.
+
+### Deadlock detection
+
+매칭되지 않는 send/recv 등으로 SimPy 스케줄이 비면 engine이 `IpcqDeadlock`을
+던지며 pointer dump를 메시지에 포함시킨다 (ADR-0023 D14 F3). 별도 wait-for graph
+시각화는 미래 작업.
+
+---
+
+## 8. 흔한 실수
+
+### 1. install 안 된 direction 사용
+
+ccl.yaml의 `topology: ring_1d`는 E/W만 install한다. N/S 사용 시:
+
+```python
+tl.send(dir="N", ...)   # → IpcqInvalidDirection 예외
+```
+
+해결: `topology: mesh_2d`로 바꾸거나, `neighbors()` override로 N/S 추가.
+
+### 2. send만 호출하고 recv 없음
+
+```python
+def kernel(..., tl):
+    for _ in range(100):
+        tl.send(dir="E", ...)
+    # peer 측 recv 없음 → ring buffer 가득 차면 backpressure → deadlock
+```
+
+해결: 모든 send에 짝이 되는 recv가 있어야 한다. 안 그러면 `IpcqDeadlock`이
+발생한다.
+
+### 3. dtype/shape 불일치
+
+기본 모드에서는 dtype/shape mismatch를 검증하지 않는다. 작성자가 직접 보장하거나,
+PE_IPCQ 노드 attrs에 `strict_validation: true`를 설정해 D14 F2 strict 모드로
+mismatch를 즉시 잡을 수 있다.
+
+### 4. round-robin recv의 fairness 가정
+
+`tl.recv()` (방향 미지정)는 round-robin으로 가져오지만, 도착한 첫 슬롯을 반환한다.
+**도착 순서를 알 수 없으므로** 알고리즘이 도착 방향에 의존하면 안 된다.
+필요하면 `tl.recv(dir="N", ...)`처럼 명시.
+
+### 5. CCL 그룹 크기 가정
+
+`tl.num_programs(axis=0/1)`은 토폴로지 슬롯 개수이지 CCL group 크기가 아니다.
+참여하는 rank 수(`world_size`)는 호스트 bench가 알고 있고, kernel 인자로 명시
+전달해야 한다.
+
+### 6. 호스트가 send-source 메모리를 도착 전에 덮어씀
+
+PE_DMA가 송신 시점에 src 데이터를 토큰에 스냅샷해서 in-flight 데이터의 의미가
+보존된다. 그래도 하나의 PE 안에서 같은 주소를 여러 step에 걸쳐 갱신할 때는
+direct send 후 다른 step에서 같은 주소를 store해도 안전하다 (token snapshot 덕분).
+하지만 `tl.send`가 PE_DMA 큐에 enqueue되기 전에 주소를 덮어쓰면 잘못된 데이터가
+스냅샷된다 — `tl.send`를 먼저, 메모리 변경을 나중에 하는 게 권장.
+
+---
+
+## 9. 다음 단계
+
+- `mesh_2d` / `tree_binary` 같은 다른 topology 활용
+- recursive halving/doubling 등 더 빠른 알고리즘
+- `buffer_kind` (tcm/hbm/sram) / `backpressure` (poll/sleep) 모드별 latency 비교
+- `ccl_ring_allreduce_multicube.py`, `ccl_ring_allreduce_multisip.py`처럼 큰
+  scale의 ring 검증
+
+새 알고리즘이나 패턴을 추가했다면 PR로 기여해주세요.
+
+---
+
+## 참고
+
+- [ADR-0023](adr/ADR-0023-ipcq-pe-collective.md): IPCQ + PE-level collective 설계
+- [ADR-0022](adr/ADR-0022-program-id-2d-grid.md): 2D grid program_id (axis=0/1)
+- [ADR-0020](adr/ADR-0020-data-execution-two-pass.md): 2-pass data execution
+- [ADR-0014](adr/ADR-0014-dev-pe-pipeline-execution-model.md): PE pipeline execution model
+
+기존 알고리즘 예제:
+
+- [`src/kernbench/ccl/algorithms/hello_send.py`](../src/kernbench/ccl/algorithms/hello_send.py) — 가장 단순한 send/recv
+- [`src/kernbench/ccl/algorithms/ring_allreduce.py`](../src/kernbench/ccl/algorithms/ring_allreduce.py) — ring all-reduce
+- [`src/kernbench/ccl/algorithms/mesh_allreduce.py`](../src/kernbench/ccl/algorithms/mesh_allreduce.py) — 2D mesh all-reduce
+- [`src/kernbench/ccl/algorithms/tree_allreduce.py`](../src/kernbench/ccl/algorithms/tree_allreduce.py) — binary tree all-reduce