Intercube allreduce: pe0 cube-mesh reduce + multi-SIP ring/torus/mesh
New intercube allreduce kernel replacing the old flat ring algorithms. Reduces across the 4x4 cube mesh within each SIP (pe0-only, same-lane), then inter-SIP exchange on root cube, then broadcast back. Supports ring_1d, torus_2d, and mesh_2d_no_wrap SIP topologies driven by topology.yaml. Integrated with dist.init_process_group / dist.all_reduce. New files: - src/kernbench/ccl/algorithms/intercube_allreduce.py (kernel) - src/kernbench/ccl/sfr_config.py (configure_sfr_intercube_multisip) - tests/test_allreduce_multidevice.py (config-driven, 3 topologies) - tests/test_distributed_intercube_allreduce.py (full distributed path) - tests/test_intercube_sfr_config.py (SFR wiring verification) Modified: - distributed.py: AhbmCCLBackend uses configure_sfr_intercube_multisip - topologies.py: added torus_2d, mesh_2d_no_wrap - install.py: global_E/W/N/S in _OPPOSITE_DIR - topology.yaml: added system.sips.topology - ccl.yaml: single intercube_allreduce algorithm - benches/ccl_allreduce.py: row_wise cube-mesh tensor layout Removed old flat-ring algorithms and their tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -6,12 +6,7 @@
|
||||
|
||||
defaults:
|
||||
# Algorithm to run for this benchmark execution.
|
||||
algorithm: ring_allreduce_tcm
|
||||
|
||||
# NOTE: world_size is not set here by default. AhbmCCLBackend derives it
|
||||
# from the chosen algorithm's entry (if it sets ``world_size``) or from
|
||||
# topology.yaml (``sips × cubes_per_sip × pes_per_cube``). This mirrors
|
||||
# real PyTorch DDP where ranks/world_size come from env vars, not code.
|
||||
algorithm: intercube_allreduce
|
||||
|
||||
# IPCQ ring buffer location.
|
||||
# tcm — PE-local TCM (fast, small, conflicts with compute TCM access)
|
||||
@@ -30,43 +25,21 @@ defaults:
|
||||
# Slot size in bytes (must hold one tile worth of data).
|
||||
slot_size: 4096
|
||||
|
||||
# PE_DMA virtual channel chunk size (D8). First implementation does not
|
||||
# use chunk-level interleave; this is reserved for future precision.
|
||||
# PE_DMA virtual channel chunk size (D8).
|
||||
vc_chunk_size: 256
|
||||
|
||||
# Credit return fast path message size (D9). Used by bottleneck-BW
|
||||
# latency calculation. 16-64 bytes typical.
|
||||
# Credit return fast path message size (D9).
|
||||
ipcq_credit_size_bytes: 16
|
||||
|
||||
algorithms:
|
||||
# ── ring all-reduce, buffer in PE_TCM ──
|
||||
# Defaults to topology-derived world_size (full system, 256 ranks).
|
||||
# Use a smaller tile size at high rank counts so f16 sums stay within
|
||||
# the verification tolerance and op_log replay scales.
|
||||
ring_allreduce_tcm:
|
||||
module: kernbench.ccl.algorithms.ring_allreduce
|
||||
topology: ring_1d
|
||||
buffer_kind: tcm
|
||||
n_elem: 8
|
||||
|
||||
# ── ring all-reduce, buffer in PE-local HBM ──
|
||||
ring_allreduce_hbm:
|
||||
module: kernbench.ccl.algorithms.ring_allreduce
|
||||
topology: ring_1d
|
||||
buffer_kind: hbm
|
||||
n_elem: 8
|
||||
|
||||
# ── ring all-reduce, buffer in cube SRAM ──
|
||||
ring_allreduce_sram:
|
||||
module: kernbench.ccl.algorithms.ring_allreduce
|
||||
topology: ring_1d
|
||||
buffer_kind: sram
|
||||
n_elem: 8
|
||||
|
||||
# ── hierarchical all-reduce (3-level: intra-cube → inter-cube → inter-SIP) ──
|
||||
# Uses bidirectional ring reduce + chain broadcast. ~25 rounds vs 255 flat.
|
||||
hierarchical_allreduce:
|
||||
module: kernbench.ccl.algorithms.hierarchical_allreduce
|
||||
# ── intercube all-reduce (pe0-only, cube mesh + inter-SIP) ──
|
||||
# Reduces across the 4×4 cube mesh within each SIP, then inter-SIP
|
||||
# exchange on root cube, then broadcast back. SIP topology is read
|
||||
# from topology.yaml → system.sips.topology. Kernel auto-selects
|
||||
# ring / torus / mesh inter-SIP exchange pattern.
|
||||
intercube_allreduce:
|
||||
module: kernbench.ccl.algorithms.intercube_allreduce
|
||||
topology: none
|
||||
buffer_kind: tcm
|
||||
n_elem: 16
|
||||
n_elem: 8
|
||||
root_cube: 15
|
||||
|
||||
Reference in New Issue
Block a user