Files
kernbench2/ccl.yaml
T
ywkang 998cc85762 Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)
Major changes:

PE-level IPCQ infrastructure:
- New PE_IPCQ component: ring-buffer control plane with 4-direction
  neighbor mapping, head/tail pointers, backpressure (poll/sleep).
- PE_DMA extended with vc_comm channel for IPCQ outbound/inbound DMA,
  including in-flight data snapshot (D9) and op_log recording at
  outbound time for Phase 2 replay correctness.
- IpcqDmaToken piggyback model: data + metadata travel together,
  atomic visibility at receiver (invariant I6).
- Credit return fast path: bottleneck-BW latency, no fabric vc_comm.

Phase 2 data execution (ADR-0020 integration):
- op_log extended: DmaWriteCmd now captures src_space/src_addr for
  Phase 2 dma_write copy; ipcq_copy ops recorded at outbound time.
- DataExecutor replays dma_write + ipcq_copy in t_start order.
- Engine._flush_data_phase: incremental cursor-based replay after
  each engine.wait() so host reads see post-Phase-2 data.
- KernelRunner Phase 1 writes disabled when op_log is active to
  prevent stale data from corrupting the MemoryStore snapshot.

TLContext / kernel API:
- tl.send(dir, src=TensorHandle), tl.recv(dir, shape, dtype),
  tl.recv_async, tl.wait(RecvFuture), copy_to_dst mode.
- TensorHandle operator overloading (add/sub/mul/div) via thread-local
  active TLContext → MathCmd dispatch through PE_MATH.
- PE-local scratch allocator for math output handles.
- tl.load returns space="hbm" handles for correct Phase 2 addressing.
- Additional math functions: maximum, minimum, fma, clamp, softmax, cdiv.

Unified ccl_allreduce bench (PyTorch-compat host code):
- Single benches/ccl_allreduce.py with run() + worker(rank, ws, torch)
  split matching real PyTorch DDP worker pattern.
- torch.distributed facade: init_process_group, get_world_size,
  get_rank, get_backend, all_reduce, barrier — only real PyTorch names.
- AhbmCCLBackend: eager install_ipcq at init, all_reduce dispatches
  kernel via tensor shard metadata (n_elem from shards[0].nbytes).
- world_size derived from topology spec (sips × cubes × pes_per_cube)
  with optional algorithm-level override in ccl.yaml.

Tensor API (PyTorch-compat surface):
- Tensor.numpy(): gather-aware (all shards via VA-based addressing).
- Tensor.copy_(source): scatter from host tensor into sharded target.
- RuntimeContext.from_numpy(arr): host-side staging tensor.
- Tensor.data property fixed to use numpy() (was shards[0]-only).

Algorithm modules moved to src/kernbench/ccl/algorithms/:
- ring_allreduce, mesh_allreduce, tree_allreduce, hello_send.
- Each module exports kernel_args(world_size, n_elem) helper.
- ccl.yaml module paths updated to kernbench.ccl.algorithms.*.

Dead code removed:
- 7 per-variant bench files (ccl_allreduce_{tcm,hbm,sram}, etc.).
- _run_ccl_bench greenlet-per-SIP scheduler.
- benches.loader.is_ccl_bench + run_rank detection.
- benches/ccl/ directory.

Tests:
- New test_ccl_allreduce_matrix.py: 7 parametrized cases
  (ring×3 buffers, ring 8/16, mesh 4, tree 7).
- New test_runtime_api_tensor.py: copy_/numpy/from_numpy unit tests.
- Existing tests updated for new import paths + world_size_override.

Docs:
- Korean ccl-author-guide.md and ADR-0023 paths updated.
- New English versions: ccl-author-guide.en.md, ADR-0023.en.md.

502 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 19:36:59 -07:00

81 lines
2.7 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ccl.yaml — CCL backend (ahbm) configuration (ADR-0023 D11)
#
# Loaded by AhbmCCLBackend at init_process_group time.
# defaults.algorithm chooses which kernel + topology is installed
# into PE_IPCQ neighbor tables. Host code is unaware of these settings.
defaults:
# Algorithm to run for this benchmark execution.
algorithm: ring_allreduce_tcm
# NOTE: world_size is not set here by default. AhbmCCLBackend derives it
# from the chosen algorithm's entry (if it sets ``world_size``) or from
# topology.yaml (``sips × cubes_per_sip × pes_per_cube``). This mirrors
# real PyTorch DDP where ranks/world_size come from env vars, not code.
# IPCQ ring buffer location.
# tcm — PE-local TCM (fast, small, conflicts with compute TCM access)
# hbm — PE-local HBM (large, slower DMA latency)
# sram — Cube-shared SRAM (medium, cube-internal contention)
buffer_kind: tcm
# Backpressure mode.
# poll — spin-loop polling of cached peer pointers
# sleep — yield SimPy event, wake on credit return
backpressure: sleep
# Ring depth: number of slots per (direction, tx|rx) buffer.
n_slots: 4
# Slot size in bytes (must hold one tile worth of data).
slot_size: 4096
# PE_DMA virtual channel chunk size (D8). First implementation does not
# use chunk-level interleave; this is reserved for future precision.
vc_chunk_size: 256
# Credit return fast path message size (D9). Used by bottleneck-BW
# latency calculation. 16-64 bytes typical.
ipcq_credit_size_bytes: 16
algorithms:
# ── ring all-reduce, buffer in PE_TCM ──
# Defaults to topology-derived world_size (full system, 256 ranks).
# Use a smaller tile size at high rank counts so f16 sums stay within
# the verification tolerance and op_log replay scales.
ring_allreduce_tcm:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d
buffer_kind: tcm
n_elem: 8
# ── ring all-reduce, buffer in PE-local HBM ──
ring_allreduce_hbm:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d
buffer_kind: hbm
n_elem: 8
# ── ring all-reduce, buffer in cube SRAM ──
ring_allreduce_sram:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d
buffer_kind: sram
n_elem: 8
# ── 2D mesh all-reduce: perfect square only (2×2 = 4 PEs) ──
mesh_allreduce_4:
module: kernbench.ccl.algorithms.mesh_allreduce
topology: mesh_2d
buffer_kind: tcm
world_size: 4
n_elem: 16
# ── tree all-reduce (binary, 7 PEs) ──
tree_allreduce_7:
module: kernbench.ccl.algorithms.tree_allreduce
topology: tree_binary
buffer_kind: tcm
world_size: 7
n_elem: 16