Intercube allreduce: pe0 cube-mesh reduce + multi-SIP ring/torus/mesh

New intercube allreduce kernel replacing the old flat ring algorithms. Reduces across the 4x4 cube mesh within each SIP (pe0-only, same-lane), then inter-SIP exchange on root cube, then broadcast back. Supports ring_1d, torus_2d, and mesh_2d_no_wrap SIP topologies driven by topology.yaml. Integrated with dist.init_process_group / dist.all_reduce. New files: - src/kernbench/ccl/algorithms/intercube_allreduce.py (kernel) - src/kernbench/ccl/sfr_config.py (configure_sfr_intercube_multisip) - tests/test_allreduce_multidevice.py (config-driven, 3 topologies) - tests/test_distributed_intercube_allreduce.py (full distributed path) - tests/test_intercube_sfr_config.py (SFR wiring verification) Modified: - distributed.py: AhbmCCLBackend uses configure_sfr_intercube_multisip - topologies.py: added torus_2d, mesh_2d_no_wrap - install.py: global_E/W/N/S in _OPPOSITE_DIR - topology.yaml: added system.sips.topology - ccl.yaml: single intercube_allreduce algorithm - benches/ccl_allreduce.py: row_wise cube-mesh tensor layout Removed old flat-ring algorithms and their tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 17:33:42 -07:00
parent cfc2d74ec4
commit 1d8b9401e5
30 changed files with 876 additions and 2892 deletions
@@ -6,12 +6,7 @@

 defaults:
  # Algorithm to run for this benchmark execution.
-  algorithm: ring_allreduce_tcm
-
-  # NOTE: world_size is not set here by default. AhbmCCLBackend derives it
-  # from the chosen algorithm's entry (if it sets ``world_size``) or from
-  # topology.yaml (``sips × cubes_per_sip × pes_per_cube``). This mirrors
-  # real PyTorch DDP where ranks/world_size come from env vars, not code.
+  algorithm: intercube_allreduce

  # IPCQ ring buffer location.
  #   tcm  — PE-local TCM (fast, small, conflicts with compute TCM access)
@@ -30,43 +25,21 @@ defaults:
  # Slot size in bytes (must hold one tile worth of data).
  slot_size: 4096

-  # PE_DMA virtual channel chunk size (D8). First implementation does not
-  # use chunk-level interleave; this is reserved for future precision.
+  # PE_DMA virtual channel chunk size (D8).
  vc_chunk_size: 256

-  # Credit return fast path message size (D9). Used by bottleneck-BW
-  # latency calculation. 16-64 bytes typical.
+  # Credit return fast path message size (D9).
  ipcq_credit_size_bytes: 16

 algorithms:
-  # ── ring all-reduce, buffer in PE_TCM ──
-  # Defaults to topology-derived world_size (full system, 256 ranks).
-  # Use a smaller tile size at high rank counts so f16 sums stay within
-  # the verification tolerance and op_log replay scales.
-  ring_allreduce_tcm:
-    module: kernbench.ccl.algorithms.ring_allreduce
-    topology: ring_1d
-    buffer_kind: tcm
-    n_elem: 8
-
-  # ── ring all-reduce, buffer in PE-local HBM ──
-  ring_allreduce_hbm:
-    module: kernbench.ccl.algorithms.ring_allreduce
-    topology: ring_1d
-    buffer_kind: hbm
-    n_elem: 8
-
-  # ── ring all-reduce, buffer in cube SRAM ──
-  ring_allreduce_sram:
-    module: kernbench.ccl.algorithms.ring_allreduce
-    topology: ring_1d
-    buffer_kind: sram
-    n_elem: 8
-
-  # ── hierarchical all-reduce (3-level: intra-cube → inter-cube → inter-SIP) ──
-  # Uses bidirectional ring reduce + chain broadcast. ~25 rounds vs 255 flat.
-  hierarchical_allreduce:
-    module: kernbench.ccl.algorithms.hierarchical_allreduce
+  # ── intercube all-reduce (pe0-only, cube mesh + inter-SIP) ──
+  # Reduces across the 4×4 cube mesh within each SIP, then inter-SIP
+  # exchange on root cube, then broadcast back. SIP topology is read
+  # from topology.yaml → system.sips.topology. Kernel auto-selects
+  # ring / torus / mesh inter-SIP exchange pattern.
+  intercube_allreduce:
+    module: kernbench.ccl.algorithms.intercube_allreduce
    topology: none
    buffer_kind: tcm
-    n_elem: 16
+    n_elem: 8
+    root_cube: 15