kernbench2/tests at 19dfc86dc39f9ff7eb8712ca838c98967638ac1f - kernbench2 - YWGitServer

ywkang/kernbench2

Files

T

History

mukesh 19dfc86dc3 Allreduce latency sweep across topologies and data sizes

Adds test_allreduce_latency_sweep that runs the existing intercube
allreduce kernel under three SIP topologies (ring_1d, torus_2d,
mesh_2d_no_wrap, all at n_sips=4) across 11 data sizes from 256 B/SIP
up to 1 MB/SIP. For each point, captures max(pe_exec_ns) — the
critical-path kernel time — and emits CSV plus log-x and linear-x
plots, both per-topology and combined overview, with KB/MB-formatted
tick labels. Reuses run_allreduce + _write_temp_configs and adds a
slot_size auto-bump when n_elem*2 exceeds the default IPCQ slot.

Sweep skips n_elem=16 because the runtime's dim_map scalar-arg
remapping (context.py:761) collides any int-valued kernel scalar that
matches a global tensor dim with its local shard size.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-27 10:16:29 -07:00

..

Add SIP-level tensor parallelism, component registry YAML, VA offset verification

2026-03-26 01:13:17 -07:00

conftest.py

Add session-scoped topology fixture in tests/conftest.py

2026-04-12 21:13:25 -07:00

test_adr0026_dppolicy_intra_device.py

ADR-0026: DPPolicy intra-device only + ShardSpec structural coords

2026-04-14 13:02:19 -07:00

test_allreduce_multidevice.py

Allreduce latency sweep across topologies and data sizes

2026-04-27 10:16:29 -07:00

test_bw_occupancy.py

Replace xbar/bridge/single-NOC with explicit router mesh (ADR-0019)

2026-04-04 17:51:28 -07:00

test_ccl_framework.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_ccl_helpers.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_ccl_install.py

Intercube allreduce: pe0 cube-mesh reduce + multi-SIP ring/torus/mesh

2026-04-16 17:33:42 -07:00

test_ccl_round_robin_recv.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_ccl_strict_mode.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_ccl_topologies.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_cli_verify_data.py

Add --verify-data CLI flag, Tensor.data property, parallel DataExecutor

2026-04-09 09:34:01 -07:00

test_cli.py

Fix cross-SIP PE_TCM access by scoping deploy to target_device SIP

2026-04-04 18:03:11 -07:00

test_component_registry.py

Rename impl names: add builtin. prefix for clear provenance

2026-04-09 00:16:24 -07:00

test_cross_sip_routing.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_data_executor.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_distributed_intercube_allreduce.py

Intercube allreduce: pe0 cube-mesh reduce + multi-SIP ring/torus/mesh

2026-04-16 17:33:42 -07:00

test_e2e_data.py

Add Phase 1→Phase 2 e2e data tests + GraphEngine enable_data mode

2026-04-08 23:49:28 -07:00

test_e2e_pipeline.py

Add E2E pipeline tests: greenlet op_log, GEMM accuracy, latency regression

2026-04-09 00:28:03 -07:00

test_engine.py

Add probe CLI improvements, D2H read, UCIe/HBM tuning, BW sweep

2026-03-19 01:16:18 -07:00

test_intercube_sfr_config.py

Intercube allreduce: pe0 cube-mesh reduce + multi-SIP ring/torus/mesh

2026-04-16 17:33:42 -07:00

test_iochiplet_noc_d2h.py

Add probe CLI improvements, D2H read, UCIe/HBM tuning, BW sweep

2026-03-19 01:16:18 -07:00

test_ipcq_types.py

Fix ADR-0025: IPCQ direction addressing via address-based matching

2026-04-14 00:38:41 -07:00

test_kernel_launch_sync.py

Kernel-launch sync (ADR-0009 D5) and IPCQ drain at inbound (ADR-0023)

2026-04-23 15:30:29 -07:00

test_kernel_runner.py

Implement ADR-0020: 2-pass data execution with greenlet kernel runner

2026-04-08 00:22:44 -07:00

test_memory_store.py

Implement ADR-0020: 2-pass data execution with greenlet kernel runner

2026-04-08 00:22:44 -07:00

test_mmu_component.py

Rename impl names: add builtin. prefix for clear provenance

2026-04-09 00:16:24 -07:00

test_mmu_fabric.py

Wire PE_MMU to router mesh for MmuMapMsg delivery

2026-04-04 18:10:42 -07:00

test_noc_mesh.py

Remove xbar/noc remnants, rule-based cube-view connectors

2026-04-06 23:59:12 -07:00

test_op_log.py

Implement ADR-0020: 2-pass data execution with greenlet kernel runner

2026-04-08 00:22:44 -07:00

test_pe_components.py

Rename impl names: add builtin. prefix for clear provenance

2026-04-09 00:16:24 -07:00

test_pe_dma_ipcq.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_pe_ipcq.py

Fix ADR-0025: IPCQ direction addressing via address-based matching

2026-04-14 00:38:41 -07:00

test_pe_mmu.py

Add virtual memory support: PE_MMU, VA allocator, fabric MmuMapMsg

2026-03-26 00:01:47 -07:00

test_pe_pipeline.py

Add ADR-0021 pipeline tests: self-routing, tiling, overlap

2026-04-08 23:40:19 -07:00

test_pe_to_pe_latency.py

PE-to-PE latency test + supporting fixes

2026-04-22 21:04:31 -07:00

test_phase_a_components.py

Rename impl names: add builtin. prefix for clear provenance

2026-04-09 00:16:24 -07:00

test_phyaddr.py

commit - release 1

2026-03-18 11:47:48 -07:00

test_probe.py

Remove xbar/noc remnants, rule-based cube-view connectors

2026-04-06 23:59:12 -07:00

test_routing.py

Replace xbar/bridge/single-NOC with explicit router mesh (ADR-0019)

2026-04-04 17:51:28 -07:00

test_runtime_api_tensor.py

ADR-0026: DPPolicy intra-device only + ShardSpec structural coords

2026-04-14 13:02:19 -07:00

test_sip_parallel.py

ADR-0026: DPPolicy intra-device only + ShardSpec structural coords

2026-04-14 13:02:19 -07:00

test_tensor_free.py

Wire PE_MMU to router mesh for MmuMapMsg delivery

2026-04-04 18:10:42 -07:00

test_tensor.py

ADR-0026: DPPolicy intra-device only + ShardSpec structural coords

2026-04-14 13:02:19 -07:00

test_tl_ipcq_api.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_topology_compile.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_topology_load.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_topology_visualize.py

Cube-view SVG: detailed topology validation rendering

2026-04-04 22:03:38 -07:00

test_tp_layers.py

ADR-0027: Megatron TP API + worker-wait generalization + mp.spawn

2026-04-14 16:31:13 -07:00

test_tp_mlp.py

ADR-0027: Megatron TP API + worker-wait generalization + mp.spawn

2026-04-14 16:31:13 -07:00

test_tp_parallel_state.py

ADR-0027: Megatron TP API + worker-wait generalization + mp.spawn

2026-04-14 16:31:13 -07:00

test_triton_emu.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_va_allocator.py

Add virtual memory support: PE_MMU, VA allocator, fabric MmuMapMsg

2026-03-26 00:01:47 -07:00

test_va_integration.py

ADR-0026: DPPolicy intra-device only + ShardSpec structural coords

2026-04-14 13:02:19 -07:00

test_va_offset.py

ADR-0026: DPPolicy intra-device only + ShardSpec structural coords

2026-04-14 13:02:19 -07:00