Files
kernbench2/ccl.yaml
T
ywkang cfc2d74ec4 Refactor ccl_allreduce bench: rank=SIP only, remove rank=PE legacy path
The unified ccl_allreduce bench previously carried two execution models
in one worker with ``if world_size == n_sips:`` branching:
  - TP mode (rank = SIP, ADR-0024/0027): proper ProcessGroup semantics.
  - Legacy rank = PE mode: single-driver worker allocating one big tensor
    distributed across all PEs via _derive_dp, with kernel-level SPMD via
    program_id.

The second model is unnecessary — intra-SIP PE-level collectives are
expressed inside the kernel (tl.send/tl.recv with program_id, IPCQ) and
do not need a host-side ProcessGroup. Removing it lets the bench be a
clean reference implementation of the TP launcher.

benches/ccl_allreduce.py:
- Config resolved once in run() via _resolve_cfg -> _BenchCfg dataclass.
- rank != n_sips now raises RuntimeError explicitly.
- _worker / _allocate_rank_tile / _init_with_rank_value / _report each
  have one concern; duplicated init + verification paths collapsed.
- _derive_dp and the second verify+print block deleted.
- 166 lines -> 91 lines.

ccl.yaml:
- mesh_allreduce_4 (world_size: 4) and tree_allreduce_7 (world_size: 7)
  algorithm entries removed (rank = PE only).
- Algorithm kernel files (kernbench.ccl.algorithms.mesh_allreduce,
  tree_allreduce) kept as-is for direct-dispatch future use.

tests/test_ccl_allreduce_matrix.py:
- Matrix shrinks from 7 cases to 3: ring × {tcm, hbm, sram} at ws =
  topology SIP count (= 2). mesh_2x2, tree_binary_7, ring_multi_cube,
  and the three ring_*_8 cases removed.

tests/test_ccl_performance.py:
- _run_8rank renamed to _run_ring; world_size: 8 override dropped; now
  exercises rank = SIP ring all-reduce.

tests/test_mp_spawn.py, tests/test_ccl_ddp_launcher.py:
- Monkeypatch target updated from bench.worker to bench._worker
  (signature now takes BenchCfg instead of (rank, world_size)).

555 passed, 1 intentional skip. Tests that directly call
install_ipcq(world_size_override=N) for kernel-level sanity
(test_ccl_hello_world_guide, test_recv_copy_to_dst, test_tl_recv_async,
test_ccl_deadlock_detection) are unchanged — they never went through
the bench and still exercise the kernel-only path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 16:45:27 -07:00

73 lines
2.6 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ccl.yaml — CCL backend (ahbm) configuration (ADR-0023 D11)
#
# Loaded by AhbmCCLBackend at init_process_group time.
# defaults.algorithm chooses which kernel + topology is installed
# into PE_IPCQ neighbor tables. Host code is unaware of these settings.
defaults:
# Algorithm to run for this benchmark execution.
algorithm: ring_allreduce_tcm
# NOTE: world_size is not set here by default. AhbmCCLBackend derives it
# from the chosen algorithm's entry (if it sets ``world_size``) or from
# topology.yaml (``sips × cubes_per_sip × pes_per_cube``). This mirrors
# real PyTorch DDP where ranks/world_size come from env vars, not code.
# IPCQ ring buffer location.
# tcm — PE-local TCM (fast, small, conflicts with compute TCM access)
# hbm — PE-local HBM (large, slower DMA latency)
# sram — Cube-shared SRAM (medium, cube-internal contention)
buffer_kind: tcm
# Backpressure mode.
# poll — spin-loop polling of cached peer pointers
# sleep — yield SimPy event, wake on credit return
backpressure: sleep
# Ring depth: number of slots per (direction, tx|rx) buffer.
n_slots: 4
# Slot size in bytes (must hold one tile worth of data).
slot_size: 4096
# PE_DMA virtual channel chunk size (D8). First implementation does not
# use chunk-level interleave; this is reserved for future precision.
vc_chunk_size: 256
# Credit return fast path message size (D9). Used by bottleneck-BW
# latency calculation. 16-64 bytes typical.
ipcq_credit_size_bytes: 16
algorithms:
# ── ring all-reduce, buffer in PE_TCM ──
# Defaults to topology-derived world_size (full system, 256 ranks).
# Use a smaller tile size at high rank counts so f16 sums stay within
# the verification tolerance and op_log replay scales.
ring_allreduce_tcm:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d
buffer_kind: tcm
n_elem: 8
# ── ring all-reduce, buffer in PE-local HBM ──
ring_allreduce_hbm:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d
buffer_kind: hbm
n_elem: 8
# ── ring all-reduce, buffer in cube SRAM ──
ring_allreduce_sram:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d
buffer_kind: sram
n_elem: 8
# ── hierarchical all-reduce (3-level: intra-cube → inter-cube → inter-SIP) ──
# Uses bidirectional ring reduce + chain broadcast. ~25 rounds vs 255 flat.
hierarchical_allreduce:
module: kernbench.ccl.algorithms.hierarchical_allreduce
topology: none
buffer_kind: tcm
n_elem: 16