Reduce test time to 12s: shrink GEMM dims + enable pytest-xdist

GEMM dimension reduction: - qkv_gemm.py: M,K,N = 128,256,128 → 32,64,32 (64 tiles → 1 tile). - qkv_gemm_multi_pe.py: same reduction. - Tests verify pipeline correctness, not large-matrix throughput. - Per-test time: 18s → 1.7s. 6 tests total: 108s → 10s. pytest-xdist parallel execution: - Add pytest-xdist to dev dependencies. - pyproject.toml addopts: -n auto (use all CPU cores), -m "not slow". - Default `pytest` runs 501 tests in ~12s (previously 148s). - Full suite including slow: `pytest -m ""` → 3m24s (previously 5m43s). pytest.mark.slow: - Registered in pyproject.toml markers section. - 256-rank full-system test is the only slow-marked test. - Run with: pytest -m "" (CI) or pytest (local dev, skips slow). 502 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 21:06:41 -07:00
parent bcf941dcee
commit 372c987995
3 changed files with 9 additions and 3 deletions
@@ -10,7 +10,9 @@ Kernel: tl.load(a) + tl.ref(b) + tl.composite(gemm) + tl.wait()
 from kernbench.policy.placement.dp import DPPolicy

 # GEMM dimensions: (M, K) x (K, N) → (M, N)
-M, K, N = 128, 256, 128
+# Small dims (1 tile) for fast regression. The test verifies the full
+# host→PE pipeline, not large-matrix throughput.
+M, K, N = 32, 64, 32
 DTYPE = "f16"


@@ -10,7 +10,9 @@ Kernel: tl.load(a) + tl.ref(b) + tl.composite(gemm) + tl.wait()
 from kernbench.policy.placement.dp import DPPolicy

 # GEMM dimensions: (M, K) x (K, N) -> (M, N)
-M, K, N = 128, 256, 128
+# Small dims (1 tile) for fast regression. The test verifies the multi-PE
+# fan-out pipeline, not large-matrix throughput.
+M, K, N = 32, 64, 32
 DTYPE = "f16"