Speed up regression: 25min → 6min (test matrix + DataExecutor cleanup)

Test matrix restructure:
- 256-rank full-system ring runs only ONCE (marked pytest.mark.slow)
  instead of 7× across matrix + perf tests. Cross-SIP routing is
  verified by the single run; buffer variants (tcm/hbm/sram) are
  tested at 8-rank where they finish in <0.5s.
- Performance tests use 8-rank instead of 256-rank.
- `pytest -m "not slow"` completes in ~2.5min (local dev).
- Full suite including slow: ~6min (CI).

DataExecutor optimization:
- Remove ThreadPoolExecutor from DataExecutor.run(). Same-t_start
  groups are almost always size 1, so the thread pool creation and
  dispatch overhead dominated. Simple sequential loop is faster.
- Skip dma_read ops at the loop level (they are always no-ops in
  Phase 2 but were dispatched through _execute_op → _execute_memory).
- Remove redundant CLI Phase 2 re-execution: engine._flush_data_phase
  already replays during engine.wait(); the CLI now only prints the
  diagnostic summary without re-running DataExecutor.

502 tests pass. Wall time: 25m30s → 5m43s (full), 2m28s (no slow).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-12 20:52:07 -07:00
parent 998cc85762
commit bcf941dcee
4 changed files with 60 additions and 106 deletions
+4 -7
View File
@@ -72,15 +72,12 @@ def cmd_run(args) -> int:
print(format_report(result.traces, title=args.bench, spec=spec))
print(result.summary_text())
# Phase 2: data execution (ADR-0020)
# Phase 2 diagnostic summary (ADR-0020). The actual Phase 2 replay
# already runs inside engine.wait() → _flush_data_phase(). We only
# print the summary here; no redundant re-execution.
if verify_data and result.engine is not None:
from kernbench.sim_engine.data_executor import DataExecutor
op_log = result.engine.op_log
store = result.engine.memory_store
if op_log and store is not None:
executor = DataExecutor(op_log, store)
executor.run()
if op_log:
n_gemm = sum(1 for r in op_log if r.op_kind == "gemm")
n_math = sum(1 for r in op_log if r.op_kind == "math")
print(f"[data] Phase 2 complete: {len(op_log)} ops ({n_gemm} gemm, {n_math} math)")
+8 -12
View File
@@ -6,8 +6,6 @@ Same-timestamp independent ops can be batched for efficiency.
"""
from __future__ import annotations
from concurrent.futures import ThreadPoolExecutor
from itertools import groupby
from typing import Any
import numpy as np
@@ -29,18 +27,16 @@ class DataExecutor:
self.store = store
def run(self) -> None:
"""Execute all ops in op_log order, grouped by t_start.
"""Execute all ops in op_log order.
Same-timestamp ops are independent and executed in parallel
via ThreadPoolExecutor (numpy releases the GIL for BLAS ops).
Ops are processed sequentially in t_start order. The previous
ThreadPoolExecutor-based parallel execution was removed because
same-t_start groups are almost always size 1 (each PE processes
one command at a time), so the thread-pool overhead dominated.
"""
with ThreadPoolExecutor() as pool:
for _t, ops_iter in groupby(self._op_log, key=lambda r: r.t_start):
ops = list(ops_iter)
if len(ops) == 1:
self._execute_op(ops[0])
else:
list(pool.map(self._execute_op, ops))
for op in self._op_log:
if op.op_kind != "memory" or op.op_name != "dma_read":
self._execute_op(op)
def _execute_op(self, op: OpRecord) -> None:
if op.op_kind == "memory":