kernbench2/tests at 372c987995f428f2ceeb48ddc5ba9908df3087c7 - kernbench2 - YWGitServer

ywkang/kernbench2

Files

T

History

ywkang bcf941dcee Speed up regression: 25min → 6min (test matrix + DataExecutor cleanup)

Test matrix restructure:
- 256-rank full-system ring runs only ONCE (marked pytest.mark.slow)
  instead of 7× across matrix + perf tests. Cross-SIP routing is
  verified by the single run; buffer variants (tcm/hbm/sram) are
  tested at 8-rank where they finish in <0.5s.
- Performance tests use 8-rank instead of 256-rank.
- `pytest -m "not slow"` completes in ~2.5min (local dev).
- Full suite including slow: ~6min (CI).

DataExecutor optimization:
- Remove ThreadPoolExecutor from DataExecutor.run(). Same-t_start
  groups are almost always size 1, so the thread pool creation and
  dispatch overhead dominated. Simple sequential loop is faster.
- Skip dma_read ops at the loop level (they are always no-ops in
  Phase 2 but were dispatched through _execute_op → _execute_memory).
- Remove redundant CLI Phase 2 re-execution: engine._flush_data_phase
  already replays during engine.wait(); the CLI now only prints the
  diagnostic summary without re-running DataExecutor.

502 tests pass. Wall time: 25m30s → 5m43s (full), 2m28s (no slow).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-12 20:52:07 -07:00

..

Add SIP-level tensor parallelism, component registry YAML, VA offset verification

2026-03-26 01:13:17 -07:00

test_bw_occupancy.py

Replace xbar/bridge/single-NOC with explicit router mesh (ADR-0019)

2026-04-04 17:51:28 -07:00

test_ccl_allreduce_matrix.py

Speed up regression: 25min → 6min (test matrix + DataExecutor cleanup)

2026-04-12 20:52:07 -07:00

test_ccl_deadlock_detection.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_ccl_diagnostics.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_ccl_framework.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_ccl_hello_world_guide.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_ccl_helpers.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_ccl_install.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_ccl_mock_runtime.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_ccl_performance.py

Speed up regression: 25min → 6min (test matrix + DataExecutor cleanup)

2026-04-12 20:52:07 -07:00

test_ccl_round_robin_recv.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_ccl_strict_mode.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_ccl_topologies.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_cli_verify_data.py

Add --verify-data CLI flag, Tensor.data property, parallel DataExecutor

2026-04-09 09:34:01 -07:00

test_cli.py

Fix cross-SIP PE_TCM access by scoping deploy to target_device SIP

2026-04-04 18:03:11 -07:00

test_component_registry.py

Rename impl names: add builtin. prefix for clear provenance

2026-04-09 00:16:24 -07:00

test_cross_sip_routing.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_data_executor.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_e2e_data.py

Add Phase 1→Phase 2 e2e data tests + GraphEngine enable_data mode

2026-04-08 23:49:28 -07:00

test_e2e_pipeline.py

Add E2E pipeline tests: greenlet op_log, GEMM accuracy, latency regression

2026-04-09 00:28:03 -07:00

test_engine.py

Add probe CLI improvements, D2H read, UCIe/HBM tuning, BW sweep

2026-03-19 01:16:18 -07:00

test_iochiplet_noc_d2h.py

Add probe CLI improvements, D2H read, UCIe/HBM tuning, BW sweep

2026-03-19 01:16:18 -07:00

test_ipcq_types.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_kernel_runner.py

Implement ADR-0020: 2-pass data execution with greenlet kernel runner

2026-04-08 00:22:44 -07:00

test_memory_store.py

Implement ADR-0020: 2-pass data execution with greenlet kernel runner

2026-04-08 00:22:44 -07:00

test_mmu_component.py

Rename impl names: add builtin. prefix for clear provenance

2026-04-09 00:16:24 -07:00

test_mmu_fabric.py

Wire PE_MMU to router mesh for MmuMapMsg delivery

2026-04-04 18:10:42 -07:00

test_noc_mesh.py

Remove xbar/noc remnants, rule-based cube-view connectors

2026-04-06 23:59:12 -07:00

test_op_log.py

Implement ADR-0020: 2-pass data execution with greenlet kernel runner

2026-04-08 00:22:44 -07:00

test_pe_components.py

Rename impl names: add builtin. prefix for clear provenance

2026-04-09 00:16:24 -07:00

test_pe_dma_ipcq.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_pe_ipcq.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_pe_mmu.py

Add virtual memory support: PE_MMU, VA allocator, fabric MmuMapMsg

2026-03-26 00:01:47 -07:00

test_pe_pipeline.py

Add ADR-0021 pipeline tests: self-routing, tiling, overlap

2026-04-08 23:40:19 -07:00

test_phase_a_components.py

Rename impl names: add builtin. prefix for clear provenance

2026-04-09 00:16:24 -07:00

test_phyaddr.py

commit - release 1

2026-03-18 11:47:48 -07:00

test_probe.py

Remove xbar/noc remnants, rule-based cube-view connectors

2026-04-06 23:59:12 -07:00

test_recv_copy_to_dst.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_routing.py

Replace xbar/bridge/single-NOC with explicit router mesh (ADR-0019)

2026-04-04 17:51:28 -07:00

test_runtime_api_tensor.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_sip_parallel.py

Add SIP-level tensor parallelism, component registry YAML, VA offset verification

2026-03-26 01:13:17 -07:00

test_tensor_free.py

Wire PE_MMU to router mesh for MmuMapMsg delivery

2026-04-04 18:10:42 -07:00

test_tensor.py

commit - release 1

2026-03-18 11:47:48 -07:00

test_tl_ipcq_api.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_tl_recv_async.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_topology_compile.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_topology_load.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_topology_visualize.py

Cube-view SVG: detailed topology validation rendering

2026-04-04 22:03:38 -07:00

test_triton_emu.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_va_allocator.py

Add virtual memory support: PE_MMU, VA allocator, fabric MmuMapMsg

2026-03-26 00:01:47 -07:00

test_va_integration.py

Add SIP-level tensor parallelism, component registry YAML, VA offset verification

2026-03-26 01:13:17 -07:00

test_va_offset.py

Fix cross-SIP PE_TCM access by scoping deploy to target_device SIP

2026-04-04 18:03:11 -07:00