Files
kernbench2/ccl.yaml
T
ywkang 10b33b44ba Add Tensor indexing + hierarchical 3-level all-reduce kernel
Tensor.__setitem__ / __getitem__:
- Shard-aligned slice assignment and read on deployed tensors.
- Scalar broadcast and numpy array assignment supported.
- Cross-shard slices raise NotImplementedError (use copy_ for that).
- 3 new tests: single-PE, multi-PE, cross-shard error case.

Hierarchical all-reduce kernel (src/kernbench/ccl/algorithms/):
- 3-level reduce: intra-cube (E/W) → inter-cube (N/S) → inter-SIP (parent).
- Bidirectional ring reduce at each level: ceil((N-1)/2) rounds.
  Left half sends via dir_dec, right half via dir_inc (wrap).
  Representative receives from both sides.
- Chain broadcast for reverse path: cube 0 PE 0 → all PE 0s → all PEs.
- Registered in ccl.yaml as "hierarchical_allreduce" with topology: none
  (neighbors() override builds the full 3-level neighbor map).
- kernel_args derives pes_per_cube/cubes_per_sip/num_sips from world_size.
- Mock-verified at 8/16/32/64/128 ranks.

Mock runtime fixes:
- Direction pairing: explicit N↔S, E↔W, parent↔parent instead of
  "first matching reverse". Fixes 2-element rings where N and S both
  point to the same peer.
- Deadlock detection: send-counter based (not just queue-depth-total)
  to catch chain reductions where send+recv pairs net to zero.
- Multi-cube program_id: pes_per_cube parameter enables
  program_id(axis=0) = PE within cube, program_id(axis=1) = cube id.
  Legacy single-cube tests unaffected (default = world_size).

504 tests pass in 12s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 23:52:04 -07:00

89 lines
3.0 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ccl.yaml — CCL backend (ahbm) configuration (ADR-0023 D11)
#
# Loaded by AhbmCCLBackend at init_process_group time.
# defaults.algorithm chooses which kernel + topology is installed
# into PE_IPCQ neighbor tables. Host code is unaware of these settings.
defaults:
# Algorithm to run for this benchmark execution.
algorithm: ring_allreduce_tcm
# NOTE: world_size is not set here by default. AhbmCCLBackend derives it
# from the chosen algorithm's entry (if it sets ``world_size``) or from
# topology.yaml (``sips × cubes_per_sip × pes_per_cube``). This mirrors
# real PyTorch DDP where ranks/world_size come from env vars, not code.
# IPCQ ring buffer location.
# tcm — PE-local TCM (fast, small, conflicts with compute TCM access)
# hbm — PE-local HBM (large, slower DMA latency)
# sram — Cube-shared SRAM (medium, cube-internal contention)
buffer_kind: tcm
# Backpressure mode.
# poll — spin-loop polling of cached peer pointers
# sleep — yield SimPy event, wake on credit return
backpressure: sleep
# Ring depth: number of slots per (direction, tx|rx) buffer.
n_slots: 4
# Slot size in bytes (must hold one tile worth of data).
slot_size: 4096
# PE_DMA virtual channel chunk size (D8). First implementation does not
# use chunk-level interleave; this is reserved for future precision.
vc_chunk_size: 256
# Credit return fast path message size (D9). Used by bottleneck-BW
# latency calculation. 16-64 bytes typical.
ipcq_credit_size_bytes: 16
algorithms:
# ── ring all-reduce, buffer in PE_TCM ──
# Defaults to topology-derived world_size (full system, 256 ranks).
# Use a smaller tile size at high rank counts so f16 sums stay within
# the verification tolerance and op_log replay scales.
ring_allreduce_tcm:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d
buffer_kind: tcm
n_elem: 8
# ── ring all-reduce, buffer in PE-local HBM ──
ring_allreduce_hbm:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d
buffer_kind: hbm
n_elem: 8
# ── ring all-reduce, buffer in cube SRAM ──
ring_allreduce_sram:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d
buffer_kind: sram
n_elem: 8
# ── 2D mesh all-reduce: perfect square only (2×2 = 4 PEs) ──
mesh_allreduce_4:
module: kernbench.ccl.algorithms.mesh_allreduce
topology: mesh_2d
buffer_kind: tcm
world_size: 4
n_elem: 16
# ── tree all-reduce (binary, 7 PEs) ──
tree_allreduce_7:
module: kernbench.ccl.algorithms.tree_allreduce
topology: tree_binary
buffer_kind: tcm
world_size: 7
n_elem: 16
# ── hierarchical all-reduce (3-level: intra-cube → inter-cube → inter-SIP) ──
# Uses bidirectional ring reduce + chain broadcast. ~25 rounds vs 255 flat.
hierarchical_allreduce:
module: kernbench.ccl.algorithms.hierarchical_allreduce
topology: none
buffer_kind: tcm
n_elem: 16