Speed up regression: 25min → 6min (test matrix + DataExecutor cleanup)

Test matrix restructure: - 256-rank full-system ring runs only ONCE (marked pytest.mark.slow) instead of 7× across matrix + perf tests. Cross-SIP routing is verified by the single run; buffer variants (tcm/hbm/sram) are tested at 8-rank where they finish in <0.5s. - Performance tests use 8-rank instead of 256-rank. - `pytest -m "not slow"` completes in ~2.5min (local dev). - Full suite including slow: ~6min (CI). DataExecutor optimization: - Remove ThreadPoolExecutor from DataExecutor.run(). Same-t_start groups are almost always size 1, so the thread pool creation and dispatch overhead dominated. Simple sequential loop is faster. - Skip dma_read ops at the loop level (they are always no-ops in Phase 2 but were dispatched through _execute_op → _execute_memory). - Remove redundant CLI Phase 2 re-execution: engine._flush_data_phase already replays during engine.wait(); the CLI now only prints the diagnostic summary without re-running DataExecutor. 502 tests pass. Wall time: 25m30s → 5m43s (full), 2m28s (no slow). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 20:52:07 -07:00
parent 998cc85762
commit bcf941dcee
4 changed files with 60 additions and 106 deletions
@@ -66,31 +66,39 @@ def _write_ccl_yaml(

 CASES = [
    # algorithm, module, topology, buffer_kind, world_size, n_elem, expected_ws
+    #
+    # Full-system (256-rank, cross-SIP) — run only ONCE (tcm). Buffer
+    # variant differences are purely IPCQ slot placement; the compute path
+    # is identical. Cross-SIP routing is the real thing being verified here.
    pytest.param(
        "ring_allreduce_tcm", "kernbench.ccl.algorithms.ring_allreduce",
        "ring_1d", "tcm", None, 8, 256,
-        id="ring_full_system_tcm",
+        id="ring_full_system",
+        marks=pytest.mark.slow,
+    ),
+    # Buffer variants at 8-rank (fast — same kernel, different slot space).
+    pytest.param(
+        "ring_allreduce_tcm", "kernbench.ccl.algorithms.ring_allreduce",
+        "ring_1d", "tcm", 8, 32, 8,
+        id="ring_tcm_8",
    ),
    pytest.param(
        "ring_allreduce_hbm", "kernbench.ccl.algorithms.ring_allreduce",
-        "ring_1d", "hbm", None, 8, 256,
-        id="ring_full_system_hbm",
+        "ring_1d", "hbm", 8, 32, 8,
+        id="ring_hbm_8",
    ),
    pytest.param(
        "ring_allreduce_sram", "kernbench.ccl.algorithms.ring_allreduce",
-        "ring_1d", "sram", None, 8, 256,
-        id="ring_full_system_sram",
-    ),
-    pytest.param(
-        "ring_allreduce_8", "kernbench.ccl.algorithms.ring_allreduce",
-        "ring_1d", "tcm", 8, 32, 8,
-        id="ring_single_cube",
+        "ring_1d", "sram", 8, 32, 8,
+        id="ring_sram_8",
    ),
+    # Multi-cube (16-rank, cross-cube within 1 SIP).
    pytest.param(
        "ring_allreduce_16", "kernbench.ccl.algorithms.ring_allreduce",
        "ring_1d", "tcm", 16, 16, 16,
        id="ring_multi_cube",
    ),
+    # Mesh + tree algorithms.
    pytest.param(
        "mesh_allreduce_4", "kernbench.ccl.algorithms.mesh_allreduce",
        "mesh_2d", "tcm", 4, 16, 4,