Speed up regression: 25min → 6min (test matrix + DataExecutor cleanup)
Test matrix restructure: - 256-rank full-system ring runs only ONCE (marked pytest.mark.slow) instead of 7× across matrix + perf tests. Cross-SIP routing is verified by the single run; buffer variants (tcm/hbm/sram) are tested at 8-rank where they finish in <0.5s. - Performance tests use 8-rank instead of 256-rank. - `pytest -m "not slow"` completes in ~2.5min (local dev). - Full suite including slow: ~6min (CI). DataExecutor optimization: - Remove ThreadPoolExecutor from DataExecutor.run(). Same-t_start groups are almost always size 1, so the thread pool creation and dispatch overhead dominated. Simple sequential loop is faster. - Skip dma_read ops at the loop level (they are always no-ops in Phase 2 but were dispatched through _execute_op → _execute_memory). - Remove redundant CLI Phase 2 re-execution: engine._flush_data_phase already replays during engine.wait(); the CLI now only prints the diagnostic summary without re-running DataExecutor. 502 tests pass. Wall time: 25m30s → 5m43s (full), 2m28s (no slow). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -6,8 +6,6 @@ Same-timestamp independent ops can be batched for efficiency.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
from itertools import groupby
|
||||
from typing import Any
|
||||
|
||||
import numpy as np
|
||||
@@ -29,18 +27,16 @@ class DataExecutor:
|
||||
self.store = store
|
||||
|
||||
def run(self) -> None:
|
||||
"""Execute all ops in op_log order, grouped by t_start.
|
||||
"""Execute all ops in op_log order.
|
||||
|
||||
Same-timestamp ops are independent and executed in parallel
|
||||
via ThreadPoolExecutor (numpy releases the GIL for BLAS ops).
|
||||
Ops are processed sequentially in t_start order. The previous
|
||||
ThreadPoolExecutor-based parallel execution was removed because
|
||||
same-t_start groups are almost always size 1 (each PE processes
|
||||
one command at a time), so the thread-pool overhead dominated.
|
||||
"""
|
||||
with ThreadPoolExecutor() as pool:
|
||||
for _t, ops_iter in groupby(self._op_log, key=lambda r: r.t_start):
|
||||
ops = list(ops_iter)
|
||||
if len(ops) == 1:
|
||||
self._execute_op(ops[0])
|
||||
else:
|
||||
list(pool.map(self._execute_op, ops))
|
||||
for op in self._op_log:
|
||||
if op.op_kind != "memory" or op.op_name != "dma_read":
|
||||
self._execute_op(op)
|
||||
|
||||
def _execute_op(self, op: OpRecord) -> None:
|
||||
if op.op_kind == "memory":
|
||||
|
||||
Reference in New Issue
Block a user