Speed up regression: 25min → 6min (test matrix + DataExecutor cleanup)
Test matrix restructure: - 256-rank full-system ring runs only ONCE (marked pytest.mark.slow) instead of 7× across matrix + perf tests. Cross-SIP routing is verified by the single run; buffer variants (tcm/hbm/sram) are tested at 8-rank where they finish in <0.5s. - Performance tests use 8-rank instead of 256-rank. - `pytest -m "not slow"` completes in ~2.5min (local dev). - Full suite including slow: ~6min (CI). DataExecutor optimization: - Remove ThreadPoolExecutor from DataExecutor.run(). Same-t_start groups are almost always size 1, so the thread pool creation and dispatch overhead dominated. Simple sequential loop is faster. - Skip dma_read ops at the loop level (they are always no-ops in Phase 2 but were dispatched through _execute_op → _execute_memory). - Remove redundant CLI Phase 2 re-execution: engine._flush_data_phase already replays during engine.wait(); the CLI now only prints the diagnostic summary without re-running DataExecutor. 502 tests pass. Wall time: 25m30s → 5m43s (full), 2m28s (no slow). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -66,31 +66,39 @@ def _write_ccl_yaml(
|
||||
|
||||
CASES = [
|
||||
# algorithm, module, topology, buffer_kind, world_size, n_elem, expected_ws
|
||||
#
|
||||
# Full-system (256-rank, cross-SIP) — run only ONCE (tcm). Buffer
|
||||
# variant differences are purely IPCQ slot placement; the compute path
|
||||
# is identical. Cross-SIP routing is the real thing being verified here.
|
||||
pytest.param(
|
||||
"ring_allreduce_tcm", "kernbench.ccl.algorithms.ring_allreduce",
|
||||
"ring_1d", "tcm", None, 8, 256,
|
||||
id="ring_full_system_tcm",
|
||||
id="ring_full_system",
|
||||
marks=pytest.mark.slow,
|
||||
),
|
||||
# Buffer variants at 8-rank (fast — same kernel, different slot space).
|
||||
pytest.param(
|
||||
"ring_allreduce_tcm", "kernbench.ccl.algorithms.ring_allreduce",
|
||||
"ring_1d", "tcm", 8, 32, 8,
|
||||
id="ring_tcm_8",
|
||||
),
|
||||
pytest.param(
|
||||
"ring_allreduce_hbm", "kernbench.ccl.algorithms.ring_allreduce",
|
||||
"ring_1d", "hbm", None, 8, 256,
|
||||
id="ring_full_system_hbm",
|
||||
"ring_1d", "hbm", 8, 32, 8,
|
||||
id="ring_hbm_8",
|
||||
),
|
||||
pytest.param(
|
||||
"ring_allreduce_sram", "kernbench.ccl.algorithms.ring_allreduce",
|
||||
"ring_1d", "sram", None, 8, 256,
|
||||
id="ring_full_system_sram",
|
||||
),
|
||||
pytest.param(
|
||||
"ring_allreduce_8", "kernbench.ccl.algorithms.ring_allreduce",
|
||||
"ring_1d", "tcm", 8, 32, 8,
|
||||
id="ring_single_cube",
|
||||
"ring_1d", "sram", 8, 32, 8,
|
||||
id="ring_sram_8",
|
||||
),
|
||||
# Multi-cube (16-rank, cross-cube within 1 SIP).
|
||||
pytest.param(
|
||||
"ring_allreduce_16", "kernbench.ccl.algorithms.ring_allreduce",
|
||||
"ring_1d", "tcm", 16, 16, 16,
|
||||
id="ring_multi_cube",
|
||||
),
|
||||
# Mesh + tree algorithms.
|
||||
pytest.param(
|
||||
"mesh_allreduce_4", "kernbench.ccl.algorithms.mesh_allreduce",
|
||||
"mesh_2d", "tcm", 4, 16, 4,
|
||||
|
||||
Reference in New Issue
Block a user