Intercube allreduce: pe0 cube-mesh reduce + multi-SIP ring/torus/mesh

New intercube allreduce kernel replacing the old flat ring algorithms.
Reduces across the 4x4 cube mesh within each SIP (pe0-only, same-lane),
then inter-SIP exchange on root cube, then broadcast back. Supports
ring_1d, torus_2d, and mesh_2d_no_wrap SIP topologies driven by
topology.yaml. Integrated with dist.init_process_group / dist.all_reduce.

New files:
- src/kernbench/ccl/algorithms/intercube_allreduce.py (kernel)
- src/kernbench/ccl/sfr_config.py (configure_sfr_intercube_multisip)
- tests/test_allreduce_multidevice.py (config-driven, 3 topologies)
- tests/test_distributed_intercube_allreduce.py (full distributed path)
- tests/test_intercube_sfr_config.py (SFR wiring verification)

Modified:
- distributed.py: AhbmCCLBackend uses configure_sfr_intercube_multisip
- topologies.py: added torus_2d, mesh_2d_no_wrap
- install.py: global_E/W/N/S in _OPPOSITE_DIR
- topology.yaml: added system.sips.topology
- ccl.yaml: single intercube_allreduce algorithm
- benches/ccl_allreduce.py: row_wise cube-mesh tensor layout

Removed old flat-ring algorithms and their tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-16 17:33:42 -07:00
parent cfc2d74ec4
commit 1d8b9401e5
30 changed files with 876 additions and 2892 deletions
+12 -39
View File
@@ -6,12 +6,7 @@
defaults:
# Algorithm to run for this benchmark execution.
algorithm: ring_allreduce_tcm
# NOTE: world_size is not set here by default. AhbmCCLBackend derives it
# from the chosen algorithm's entry (if it sets ``world_size``) or from
# topology.yaml (``sips × cubes_per_sip × pes_per_cube``). This mirrors
# real PyTorch DDP where ranks/world_size come from env vars, not code.
algorithm: intercube_allreduce
# IPCQ ring buffer location.
# tcm — PE-local TCM (fast, small, conflicts with compute TCM access)
@@ -30,43 +25,21 @@ defaults:
# Slot size in bytes (must hold one tile worth of data).
slot_size: 4096
# PE_DMA virtual channel chunk size (D8). First implementation does not
# use chunk-level interleave; this is reserved for future precision.
# PE_DMA virtual channel chunk size (D8).
vc_chunk_size: 256
# Credit return fast path message size (D9). Used by bottleneck-BW
# latency calculation. 16-64 bytes typical.
# Credit return fast path message size (D9).
ipcq_credit_size_bytes: 16
algorithms:
# ── ring all-reduce, buffer in PE_TCM ──
# Defaults to topology-derived world_size (full system, 256 ranks).
# Use a smaller tile size at high rank counts so f16 sums stay within
# the verification tolerance and op_log replay scales.
ring_allreduce_tcm:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d
buffer_kind: tcm
n_elem: 8
# ── ring all-reduce, buffer in PE-local HBM ──
ring_allreduce_hbm:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d
buffer_kind: hbm
n_elem: 8
# ── ring all-reduce, buffer in cube SRAM ──
ring_allreduce_sram:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d
buffer_kind: sram
n_elem: 8
# ── hierarchical all-reduce (3-level: intra-cube → inter-cube → inter-SIP) ──
# Uses bidirectional ring reduce + chain broadcast. ~25 rounds vs 255 flat.
hierarchical_allreduce:
module: kernbench.ccl.algorithms.hierarchical_allreduce
# ── intercube all-reduce (pe0-only, cube mesh + inter-SIP) ──
# Reduces across the 4×4 cube mesh within each SIP, then inter-SIP
# exchange on root cube, then broadcast back. SIP topology is read
# from topology.yaml → system.sips.topology. Kernel auto-selects
# ring / torus / mesh inter-SIP exchange pattern.
intercube_allreduce:
module: kernbench.ccl.algorithms.intercube_allreduce
topology: none
buffer_kind: tcm
n_elem: 16
n_elem: 8
root_cube: 15