1d8b9401e5
New intercube allreduce kernel replacing the old flat ring algorithms. Reduces across the 4x4 cube mesh within each SIP (pe0-only, same-lane), then inter-SIP exchange on root cube, then broadcast back. Supports ring_1d, torus_2d, and mesh_2d_no_wrap SIP topologies driven by topology.yaml. Integrated with dist.init_process_group / dist.all_reduce. New files: - src/kernbench/ccl/algorithms/intercube_allreduce.py (kernel) - src/kernbench/ccl/sfr_config.py (configure_sfr_intercube_multisip) - tests/test_allreduce_multidevice.py (config-driven, 3 topologies) - tests/test_distributed_intercube_allreduce.py (full distributed path) - tests/test_intercube_sfr_config.py (SFR wiring verification) Modified: - distributed.py: AhbmCCLBackend uses configure_sfr_intercube_multisip - topologies.py: added torus_2d, mesh_2d_no_wrap - install.py: global_E/W/N/S in _OPPOSITE_DIR - topology.yaml: added system.sips.topology - ccl.yaml: single intercube_allreduce algorithm - benches/ccl_allreduce.py: row_wise cube-mesh tensor layout Removed old flat-ring algorithms and their tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
46 lines
1.5 KiB
YAML
46 lines
1.5 KiB
YAML
# ccl.yaml — CCL backend (ahbm) configuration (ADR-0023 D11)
|
||
#
|
||
# Loaded by AhbmCCLBackend at init_process_group time.
|
||
# defaults.algorithm chooses which kernel + topology is installed
|
||
# into PE_IPCQ neighbor tables. Host code is unaware of these settings.
|
||
|
||
defaults:
|
||
# Algorithm to run for this benchmark execution.
|
||
algorithm: intercube_allreduce
|
||
|
||
# IPCQ ring buffer location.
|
||
# tcm — PE-local TCM (fast, small, conflicts with compute TCM access)
|
||
# hbm — PE-local HBM (large, slower DMA latency)
|
||
# sram — Cube-shared SRAM (medium, cube-internal contention)
|
||
buffer_kind: tcm
|
||
|
||
# Backpressure mode.
|
||
# poll — spin-loop polling of cached peer pointers
|
||
# sleep — yield SimPy event, wake on credit return
|
||
backpressure: sleep
|
||
|
||
# Ring depth: number of slots per (direction, tx|rx) buffer.
|
||
n_slots: 4
|
||
|
||
# Slot size in bytes (must hold one tile worth of data).
|
||
slot_size: 4096
|
||
|
||
# PE_DMA virtual channel chunk size (D8).
|
||
vc_chunk_size: 256
|
||
|
||
# Credit return fast path message size (D9).
|
||
ipcq_credit_size_bytes: 16
|
||
|
||
algorithms:
|
||
# ── intercube all-reduce (pe0-only, cube mesh + inter-SIP) ──
|
||
# Reduces across the 4×4 cube mesh within each SIP, then inter-SIP
|
||
# exchange on root cube, then broadcast back. SIP topology is read
|
||
# from topology.yaml → system.sips.topology. Kernel auto-selects
|
||
# ring / torus / mesh inter-SIP exchange pattern.
|
||
intercube_allreduce:
|
||
module: kernbench.ccl.algorithms.intercube_allreduce
|
||
topology: none
|
||
buffer_kind: tcm
|
||
n_elem: 8
|
||
root_cube: 15
|