kernbench2/tests at 10b33b44ba5a0cbca7fa3b0236a1b41f6f9477ae - kernbench2 - YWGitServer

ywkang/kernbench2

Files

T

History

ywkang 10b33b44ba Add Tensor indexing + hierarchical 3-level all-reduce kernel

Tensor.__setitem__ / __getitem__:
- Shard-aligned slice assignment and read on deployed tensors.
- Scalar broadcast and numpy array assignment supported.
- Cross-shard slices raise NotImplementedError (use copy_ for that).
- 3 new tests: single-PE, multi-PE, cross-shard error case.

Hierarchical all-reduce kernel (src/kernbench/ccl/algorithms/):
- 3-level reduce: intra-cube (E/W) → inter-cube (N/S) → inter-SIP (parent).
- Bidirectional ring reduce at each level: ceil((N-1)/2) rounds.
  Left half sends via dir_dec, right half via dir_inc (wrap).
  Representative receives from both sides.
- Chain broadcast for reverse path: cube 0 PE 0 → all PE 0s → all PEs.
- Registered in ccl.yaml as "hierarchical_allreduce" with topology: none
  (neighbors() override builds the full 3-level neighbor map).
- kernel_args derives pes_per_cube/cubes_per_sip/num_sips from world_size.
- Mock-verified at 8/16/32/64/128 ranks.

Mock runtime fixes:
- Direction pairing: explicit N↔S, E↔W, parent↔parent instead of
  "first matching reverse". Fixes 2-element rings where N and S both
  point to the same peer.
- Deadlock detection: send-counter based (not just queue-depth-total)
  to catch chain reductions where send+recv pairs net to zero.
- Multi-cube program_id: pes_per_cube parameter enables
  program_id(axis=0) = PE within cube, program_id(axis=1) = cube id.
  Legacy single-cube tests unaffected (default = world_size).

504 tests pass in 12s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-12 23:52:04 -07:00

..

Add SIP-level tensor parallelism, component registry YAML, VA offset verification

2026-03-26 01:13:17 -07:00

conftest.py

Add session-scoped topology fixture in tests/conftest.py

2026-04-12 21:13:25 -07:00

test_bw_occupancy.py

Replace xbar/bridge/single-NOC with explicit router mesh (ADR-0019)

2026-04-04 17:51:28 -07:00

test_ccl_allreduce_matrix.py

Speed up regression: 25min → 6min (test matrix + DataExecutor cleanup)

2026-04-12 20:52:07 -07:00

test_ccl_deadlock_detection.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_ccl_diagnostics.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_ccl_framework.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_ccl_hello_world_guide.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_ccl_helpers.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_ccl_install.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_ccl_mock_runtime.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_ccl_performance.py

Speed up regression: 25min → 6min (test matrix + DataExecutor cleanup)

2026-04-12 20:52:07 -07:00

test_ccl_round_robin_recv.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_ccl_strict_mode.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_ccl_topologies.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_cli_verify_data.py

Add --verify-data CLI flag, Tensor.data property, parallel DataExecutor

2026-04-09 09:34:01 -07:00

test_cli.py

Fix cross-SIP PE_TCM access by scoping deploy to target_device SIP

2026-04-04 18:03:11 -07:00

test_component_registry.py

Rename impl names: add builtin. prefix for clear provenance

2026-04-09 00:16:24 -07:00

test_cross_sip_routing.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_data_executor.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_e2e_data.py

Add Phase 1→Phase 2 e2e data tests + GraphEngine enable_data mode

2026-04-08 23:49:28 -07:00

test_e2e_pipeline.py

Add E2E pipeline tests: greenlet op_log, GEMM accuracy, latency regression

2026-04-09 00:28:03 -07:00

test_engine.py

Add probe CLI improvements, D2H read, UCIe/HBM tuning, BW sweep

2026-03-19 01:16:18 -07:00

test_iochiplet_noc_d2h.py

Add probe CLI improvements, D2H read, UCIe/HBM tuning, BW sweep

2026-03-19 01:16:18 -07:00

test_ipcq_types.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_kernel_runner.py

Implement ADR-0020: 2-pass data execution with greenlet kernel runner

2026-04-08 00:22:44 -07:00

test_memory_store.py

Implement ADR-0020: 2-pass data execution with greenlet kernel runner

2026-04-08 00:22:44 -07:00

test_mmu_component.py

Rename impl names: add builtin. prefix for clear provenance

2026-04-09 00:16:24 -07:00

test_mmu_fabric.py

Wire PE_MMU to router mesh for MmuMapMsg delivery

2026-04-04 18:10:42 -07:00

test_noc_mesh.py

Remove xbar/noc remnants, rule-based cube-view connectors

2026-04-06 23:59:12 -07:00

test_op_log.py

Implement ADR-0020: 2-pass data execution with greenlet kernel runner

2026-04-08 00:22:44 -07:00

test_pe_components.py

Rename impl names: add builtin. prefix for clear provenance

2026-04-09 00:16:24 -07:00

test_pe_dma_ipcq.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_pe_ipcq.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_pe_mmu.py

Add virtual memory support: PE_MMU, VA allocator, fabric MmuMapMsg

2026-03-26 00:01:47 -07:00

test_pe_pipeline.py

Add ADR-0021 pipeline tests: self-routing, tiling, overlap

2026-04-08 23:40:19 -07:00

test_phase_a_components.py

Rename impl names: add builtin. prefix for clear provenance

2026-04-09 00:16:24 -07:00

test_phyaddr.py

commit - release 1

2026-03-18 11:47:48 -07:00

test_probe.py

Remove xbar/noc remnants, rule-based cube-view connectors

2026-04-06 23:59:12 -07:00

test_recv_copy_to_dst.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_routing.py

Replace xbar/bridge/single-NOC with explicit router mesh (ADR-0019)

2026-04-04 17:51:28 -07:00

test_runtime_api_tensor.py

Add Tensor indexing + hierarchical 3-level all-reduce kernel

2026-04-12 23:52:04 -07:00

test_sip_parallel.py

Add SIP-level tensor parallelism, component registry YAML, VA offset verification

2026-03-26 01:13:17 -07:00

test_tensor_free.py

Wire PE_MMU to router mesh for MmuMapMsg delivery

2026-04-04 18:10:42 -07:00

test_tensor.py

commit - release 1

2026-03-18 11:47:48 -07:00

test_tl_ipcq_api.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_tl_recv_async.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_topology_compile.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_topology_load.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_topology_visualize.py

Cube-view SVG: detailed topology validation rendering

2026-04-04 22:03:38 -07:00

test_triton_emu.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_va_allocator.py

Add virtual memory support: PE_MMU, VA allocator, fabric MmuMapMsg

2026-03-26 00:01:47 -07:00

test_va_integration.py

Add SIP-level tensor parallelism, component registry YAML, VA offset verification

2026-03-26 01:13:17 -07:00

test_va_offset.py

Fix cross-SIP PE_TCM access by scoping deploy to target_device SIP

2026-04-04 18:03:11 -07:00