Refactor ccl_allreduce bench: rank=SIP only, remove rank=PE legacy path

The unified ccl_allreduce bench previously carried two execution models in one worker with ``if world_size == n_sips:`` branching: - TP mode (rank = SIP, ADR-0024/0027): proper ProcessGroup semantics. - Legacy rank = PE mode: single-driver worker allocating one big tensor distributed across all PEs via _derive_dp, with kernel-level SPMD via program_id. The second model is unnecessary — intra-SIP PE-level collectives are expressed inside the kernel (tl.send/tl.recv with program_id, IPCQ) and do not need a host-side ProcessGroup. Removing it lets the bench be a clean reference implementation of the TP launcher. benches/ccl_allreduce.py: - Config resolved once in run() via _resolve_cfg -> _BenchCfg dataclass. - rank != n_sips now raises RuntimeError explicitly. - _worker / _allocate_rank_tile / _init_with_rank_value / _report each have one concern; duplicated init + verification paths collapsed. - _derive_dp and the second verify+print block deleted. - 166 lines -> 91 lines. ccl.yaml: - mesh_allreduce_4 (world_size: 4) and tree_allreduce_7 (world_size: 7) algorithm entries removed (rank = PE only). - Algorithm kernel files (kernbench.ccl.algorithms.mesh_allreduce, tree_allreduce) kept as-is for direct-dispatch future use. tests/test_ccl_allreduce_matrix.py: - Matrix shrinks from 7 cases to 3: ring × {tcm, hbm, sram} at ws = topology SIP count (= 2). mesh_2x2, tree_binary_7, ring_multi_cube, and the three ring_*_8 cases removed. tests/test_ccl_performance.py: - _run_8rank renamed to _run_ring; world_size: 8 override dropped; now exercises rank = SIP ring all-reduce. tests/test_mp_spawn.py, tests/test_ccl_ddp_launcher.py: - Monkeypatch target updated from bench.worker to bench._worker (signature now takes BenchCfg instead of (rank, world_size)). 555 passed, 1 intentional skip. Tests that directly call install_ipcq(world_size_override=N) for kernel-level sanity (test_ccl_hello_world_guide, test_recv_copy_to_dst, test_tl_recv_async, test_ccl_deadlock_detection) are unchanged — they never went through the bench and still exercise the kernel-only path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 16:45:27 -07:00
parent 105f1dc09e
commit cfc2d74ec4
6 changed files with 118 additions and 244 deletions
@@ -1,11 +1,10 @@
 """CCL performance validation tests (ADR-0023 D13 T5).

-Sanity-checks the simulated latency of the unified ``ccl_allreduce`` bench.
-
-Uses 8-rank (single cube) for all buffer variants — the latency model
-is topology-aware, so buffer_kind differences are visible even at small
-scale. Full-system (256-rank) cross-SIP latency is covered by the
-``test_ccl_allreduce_matrix[ring_full_system]`` slow test.
+Sanity-checks the simulated latency of the unified ``ccl_allreduce`` bench
+under the rank = SIP TP launcher model (ADR-0024 / ADR-0027). Uses the
+topology-derived world_size (= 2 in the shipped topology); the latency
+model is topology-aware, so buffer_kind differences remain visible even
+at this scale.
 """
 from __future__ import annotations

@@ -24,9 +23,9 @@ def _engine_factory(topology, device):
    return GraphEngine(getattr(topology, "topology_obj", topology), enable_data=True)


-def _run_8rank(algorithm: str, buffer_kind: str = "tcm") -> float:
-    """Run an 8-rank ring via the unified bench with a tmp ccl.yaml overlay.
-    Returns simulated kernel total_ns."""
+def _run_ring(algorithm: str, buffer_kind: str = "tcm") -> float:
+    """Run a rank = SIP ring all-reduce via the unified bench with a tmp
+    ccl.yaml overlay. Returns simulated kernel total_ns."""
    import tempfile

    body = f"""\
@@ -44,7 +43,6 @@ algorithms:
    module: kernbench.ccl.algorithms.ring_allreduce
    topology: ring_1d
    buffer_kind: {buffer_kind}
-    world_size: 8
    n_elem: 32
 """
    project_root = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
@@ -77,11 +75,11 @@ algorithms:
 def test_ccl_latency_positive(buffer_kind):
    """Every buffer kind must produce a positive simulated latency."""
    algo = f"ring_allreduce_{buffer_kind}"
-    ns = _run_8rank(algo, buffer_kind)
+    ns = _run_ring(algo, buffer_kind)
    assert ns > 0


 def test_ccl_latency_under_reasonable_bound():
-    """8-rank ring all-reduce (tile=32 f16) should finish well under 1ms."""
-    ns = _run_8rank("ring_allreduce_tcm", "tcm")
+    """rank = SIP ring all-reduce (tile=32 f16) should finish well under 1ms."""
+    ns = _run_ring("ring_allreduce_tcm", "tcm")
    assert ns < 1_000_000  # < 1 ms simulated