Files
kernbench2/tests/test_ccl_helpers.py
T
ywkang 998cc85762 Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)
Major changes:

PE-level IPCQ infrastructure:
- New PE_IPCQ component: ring-buffer control plane with 4-direction
  neighbor mapping, head/tail pointers, backpressure (poll/sleep).
- PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA,
  including in-flight data snapshot (D9) and op_log recording at
  outbound time for Phase 2 replay correctness.
- IpcqDmaToken piggyback model: data + metadata travel together,
  atomic visibility at receiver (invariant I6).
- Credit return fast path: bottleneck-BW latency, no fabric vc_comm.

Phase 2 data execution (ADR-0020 integration):
- op_log extended: DmaWriteCmd now captures src_space/src_addr for
  Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time.
- DataExecutor replays dma_write + ipcq_copy in t_start order.
- Engine._flush_data_phase: incremental cursor-based replay after
  each engine.wait() so host reads see post-Phase-2 data.
- KernelRunner Phase 1 writes disabled when op_log is active to
  prevent stale data from corrupting the MemoryStore snapshot.

TLContext / kernel API:
- tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype),
  tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode.
- TensorHandle operator overloading (add/sub/mul/div) via thread-local
  active TLContext → MathCmd dispatch through PE_MATH.
- PE-local scratch allocator for math output handles.
- tl.load returns space="hbm" handles for correct Phase 2 addressing.
- Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv.

Unified ccl_allreduce bench (PyTorch-compat host code):
- Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch)
  split matching real PyTorch DDP worker pattern.
- torch.distributed facade: init_process_group, get_world_size,
  get_rank, get_backend, all_reduce, barrier — only real PyTorch names.
- AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches
  kernel via tensor shard metadata (n_elem from shards[0].nbytes).
- world_size derived from topology spec (sips × cubes × pes_per_cube)
  with optional algorithm-level override in ccl.yaml.

Tensor API (PyTorch-compat surface):
- Tensor.numpy(): gather-aware (all shards via VA-based addressing).
- Tensor.copy_(source): scatter from host tensor into sharded target.
- RuntimeContext.from_numpy(arr): host-side staging tensor.
- Tensor.data property fixed to use numpy() (was shards[0]-only).

Algorithm modules moved to src/kernbench/ccl/algorithms/:
- ring_allreduce, mesh_allreduce, tree_allreduce, hello_send.
- Each module exports kernel_args(world_size, n_elem) helper.
- ccl.yaml module paths updated to kernbench.ccl.algorithms.*.

Dead code removed:
- 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.).
- _run_ccl_bench greenlet-per-SIP scheduler.
- benches.loader.is_ccl_bench + run_rank detection.
- benches/ccl/ directory.

Tests:
- New test_ccl_allreduce_matrix.py: 7 parametrized cases
  (ring×3 buffers, ring 8/16, mesh 4, tree 7).
- New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests.
- Existing tests updated for new import paths + world_size_override.

Docs:
- Korean ccl-author-guide.md and ADR-0023 paths updated.
- New English versions: ccl-author-guide.en.md, ADR-0023.en.md.

502 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 19:36:59 -07:00

69 lines
2.3 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""Tests for CCL algorithm-author helpers (ADR-0023 D15)."""
from __future__ import annotations
import pytest
from kernbench.ccl.helpers import (
Chunk,
chunked,
ring_step,
tree_step,
)
# ── chunked ──────────────────────────────────────────────────────────
def test_chunked_basic():
chunks = chunked(base_addr=0x1000, n_chunks=4, n_elem=64, dtype="f16")
assert len(chunks) == 4
# Each chunk has 16 elements (64 / 4)
assert chunks[0] == Chunk(addr=0x1000, n_elem=16, nbytes=32)
assert chunks[1] == Chunk(addr=0x1020, n_elem=16, nbytes=32)
assert chunks[2] == Chunk(addr=0x1040, n_elem=16, nbytes=32)
assert chunks[3] == Chunk(addr=0x1060, n_elem=16, nbytes=32)
def test_chunked_f32():
chunks = chunked(base_addr=0x100, n_chunks=2, n_elem=8, dtype="f32")
assert chunks[0].nbytes == 16 # 4 elem × 4 bytes
assert chunks[1].addr == 0x100 + 16
def test_chunked_uneven_raises():
with pytest.raises(ValueError):
chunked(base_addr=0x100, n_chunks=3, n_elem=10, dtype="f16")
# ── ring_step ────────────────────────────────────────────────────────
def test_ring_step_4_ranks():
# Standard reduce-scatter ring step:
# at step s, rank r sends chunk (r-s) and receives chunk (r-s-1) (mod ws)
assert ring_step(rank=0, step=0, world_size=4) == (0, 3)
assert ring_step(rank=0, step=1, world_size=4) == (3, 2)
assert ring_step(rank=1, step=0, world_size=4) == (1, 0)
assert ring_step(rank=2, step=0, world_size=4) == (2, 1)
# ── tree_step ────────────────────────────────────────────────────────
def test_tree_step_root():
info = tree_step(rank=0, world_size=7)
assert info["parent"] is None
assert info["children"] == [1, 2]
def test_tree_step_internal():
info = tree_step(rank=1, world_size=7)
assert info["parent"] == 0
assert info["children"] == [3, 4]
def test_tree_step_leaf():
info = tree_step(rank=4, world_size=7)
assert info["parent"] == 1
assert info["children"] == []