kernbench2/tests at e748a62264e3bf43db7180d27757151cc0c34492 - kernbench2 - YWGitServer

ywkang/kernbench2

Files

T

History

mukesh e748a62264 attention: land milestone-gqa-llama70b 4-panel sweep bench (ADR-0057 v1)

Self-contained eval bench (ADR-0054) that drives the four GQA Llama-70B
panels through run_bench with enable_data=True at validation scale and
emits sweep.json with the v1 schema (ADR-0057 D7).

Panel dispatch table maps each panel to (kernel, SFR install, S_q,
n_ranks, rank_axis):
  single_user_prefill   mesh_kv_kernel,  intracube_pe_ring,  S_q=16, n=8, rank_axis=0
  multi_user_prefill    mesh_kv_kernel,  intercube_multisip, S_q=16, n=4, rank_axis=1
  single_user_decode    mesh_mlo_kernel, intracube_pe_ring,  S_q=1,  n=8, rank_axis=0
  multi_user_decode     mesh_mlo_kernel, intercube_multisip, S_q=1,  n=4, rank_axis=1

multi_user panels pass _auto_dim_remap=False (avoid d_head=64
colliding with K's global M=64) and rank_axis=1 (cube-level ring,
gates 7 of every 8 PEs to silence).

Each panel runs on a fresh per-config GraphEngine, then op_log is
summarized into gemm/dma/ipcq counts. Both decode panels emit exactly
2*n_ranks GEMMs (one-shot partial attention per rank, ADR-0056 D3).

v1 supports GQA_VALIDATION=1 only; headline mode + figures deferred to
sub-cycles 4b/4c. Sentinel tensor satisfies the run_bench
"at least one request" contract (ADR-0045 D4 / ADR-0054 D2 carve-out).

Tests: tests/attention/test_milestone_gqa_llama70b.py — all 12 pass.
Includes committed sweep.json baseline at the bench's _OUTPUT_DIR so
subsequent test runs reuse it instead of re-simulating.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-06-01 21:57:12 -07:00

..

attention: land milestone-gqa-llama70b 4-panel sweep bench (ADR-0057 v1)

2026-06-01 21:57:12 -07:00

Add SIP-level tensor parallelism, component registry YAML, VA offset verification

2026-03-26 01:13:17 -07:00

eval: fold GEMM/allreduce harnesses into self-contained milestone benches

2026-05-22 15:19:52 -07:00

eval: fold GEMM/allreduce harnesses into self-contained milestone benches

2026-05-22 15:19:52 -07:00

conftest.py

sccl: drive allreduce tests via torch.distributed; reorganize into tests/sccl/

2026-05-20 22:24:43 -07:00

test_adr0026_dppolicy_intra_device.py

ADR-0001 Rev 2: 51-bit PhysAddr layout with concrete sub-unit tables

2026-04-27 15:52:29 -07:00

test_bench_registry.py

benches: package as kernbench.benches, add @bench registry + list subcommand

2026-05-20 14:42:10 -07:00

test_bw_occupancy.py

ADR-0001 Rev 2: 51-bit PhysAddr layout with concrete sub-unit tables

2026-04-27 15:52:29 -07:00

test_ccl_framework.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_ccl_helpers.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_ccl_install.py

Intercube allreduce: pe0 cube-mesh reduce + multi-SIP ring/torus/mesh

2026-04-16 17:33:42 -07:00

test_ccl_strict_mode.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_ccl_topologies.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_cli_list.py

benches: package as kernbench.benches, add @bench registry + list subcommand

2026-05-20 14:42:10 -07:00

test_cli_verify_data.py

benches: package as kernbench.benches, add @bench registry + list subcommand

2026-05-20 14:42:10 -07:00

test_cli.py

benches: package as kernbench.benches, add @bench registry + list subcommand

2026-05-20 14:42:10 -07:00

test_component_registry.py

Calibrate 3 tests for ADR-0033 Phase 2c per-flit wire timing

2026-05-14 23:06:33 -07:00

test_composite_epilogue.py

tl.composite: fused epilogue ops with per-op scope

2026-05-15 10:16:47 -07:00

test_cross_sip_routing.py

ADR-0019 D1/D4: per-PE HBM CTRL partitioning

2026-05-15 01:04:30 -07:00

test_d5_barrier_invariant.py

ADR-0009 D5: chain-aware target_start_ns + zero-byte launch fanout

2026-04-27 15:12:58 -07:00

test_data_executor.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_e2e_data.py

ADR-0001 Rev 2: 51-bit PhysAddr layout with concrete sub-unit tables

2026-04-27 15:52:29 -07:00

test_e2e_pipeline.py

ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037

2026-05-20 01:15:55 -07:00

test_emit_ipcq_diagram.py

Add tests/test_emit_ipcq_diagram.py (missed from earlier commit)

2026-04-27 21:42:44 -07:00

test_engine.py

ADR-0001 Rev 2: 51-bit PhysAddr layout with concrete sub-unit tables

2026-04-27 15:52:29 -07:00

test_flit_streaming.py

ADR-0033 Phase 2c-3 finish: op_log test + ADR doc reflect chunk-streaming

2026-05-14 23:12:50 -07:00

test_hbm_address_based_pc.py

ADR-0019 D1/D4: per-PE HBM CTRL partitioning

2026-05-15 01:04:30 -07:00

test_hbm_pc_striping.py

ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037

2026-05-20 01:15:55 -07:00

test_intercube_sfr_config.py

CCL allreduce: rename to lrab_hierarchical_allreduce + descriptive plots

2026-05-20 20:50:48 -07:00

test_iochiplet_noc_d2h.py

ADR-0001 Rev 2: 51-bit PhysAddr layout with concrete sub-unit tables

2026-04-27 15:52:29 -07:00

test_ipcq_buffer_kind_latency.py

sccl: drive allreduce tests via torch.distributed; reorganize into tests/sccl/

2026-05-20 22:24:43 -07:00

test_ipcq_buffer_kind_locations.py

sccl: drive allreduce tests via torch.distributed; reorganize into tests/sccl/

2026-05-20 22:24:43 -07:00

test_ipcq_types.py

Fix ADR-0025: IPCQ direction addressing via address-based matching

2026-04-14 00:38:41 -07:00

test_kernel_launch_sync.py

Kernel-launch sync (ADR-0009 D5) and IPCQ drain at inbound (ADR-0023)

2026-04-23 15:30:29 -07:00

test_kernel_runner.py

Implement ADR-0020: 2-pass data execution with greenlet kernel runner

2026-04-08 00:22:44 -07:00

test_launch_dim_translation.py

runtime_api: ctx.launch honors DPPolicy.num_cubes + adds _auto_dim_remap opt-out

2026-06-01 19:33:40 -07:00

test_memory_store.py

Implement ADR-0020: 2-pass data execution with greenlet kernel runner

2026-04-08 00:22:44 -07:00

test_milestone_benches.py

eval: fold GEMM/allreduce harnesses into self-contained milestone benches

2026-05-22 15:19:52 -07:00

test_mmu_component.py

Rename impl names: add builtin. prefix for clear provenance

2026-04-09 00:16:24 -07:00

test_mmu_fabric.py

benches: package as kernbench.benches, add @bench registry + list subcommand

2026-05-20 14:42:10 -07:00

test_noc_mesh.py

ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037

2026-05-20 01:15:55 -07:00

test_op_log_input_snapshot_race.py

sim_engine: fix IPCQ slot-wrap snapshot race in Phase 2 replay

2026-06-01 19:14:09 -07:00

test_op_log.py

Implement ADR-0020: 2-pass data execution with greenlet kernel runner

2026-04-08 00:22:44 -07:00

test_pe_components.py

benches: package as kernbench.benches, add @bench registry + list subcommand

2026-05-20 14:42:10 -07:00

test_pe_dma_ipcq.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_pe_ipcq.py

Fix ADR-0025: IPCQ direction addressing via address-based matching

2026-04-14 00:38:41 -07:00

test_pe_mmu.py

Add virtual memory support: PE_MMU, VA allocator, fabric MmuMapMsg

2026-03-26 00:01:47 -07:00

test_pe_pipeline.py

ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037

2026-05-20 01:15:55 -07:00

test_pe_to_pe_diagnostic.py

CCL allreduce: rename to lrab_hierarchical_allreduce + descriptive plots

2026-05-20 20:50:48 -07:00

test_pe_to_pe_latency.py

CCL allreduce: rename to lrab_hierarchical_allreduce + descriptive plots

2026-05-20 20:50:48 -07:00

test_per_pe_hbm_partition.py

ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037

2026-05-20 01:15:55 -07:00

test_phase_a_components.py

Latency model: HBM PC striping + chunk-loop drain (ADR-0033)

2026-05-14 21:59:07 -07:00

test_phyaddr.py

ADR-0001 Rev 2: 51-bit PhysAddr layout with concrete sub-unit tables

2026-04-27 15:52:29 -07:00

test_probe.py

benches: package as kernbench.benches, add @bench registry + list subcommand

2026-05-20 14:42:10 -07:00

test_routing.py

ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037

2026-05-20 01:15:55 -07:00

test_runtime_api_tensor.py

ADR-0026: DPPolicy intra-device only + ShardSpec structural coords

2026-04-14 13:02:19 -07:00

test_sip_parallel.py

ADR-0026: DPPolicy intra-device only + ShardSpec structural coords

2026-04-14 13:02:19 -07:00

test_sip_topology_rectangular.py

Rectangular SIP topology + 6-device allreduce sweep

2026-04-27 15:13:14 -07:00

test_tensor_free.py

ADR-0001 Rev 2: 51-bit PhysAddr layout with concrete sub-unit tables

2026-04-27 15:52:29 -07:00

test_tensor.py

ADR-0001 Rev 2: 51-bit PhysAddr layout with concrete sub-unit tables

2026-04-27 15:52:29 -07:00

test_tl_ipcq_api.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_topology_compile.py

ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037

2026-05-20 01:15:55 -07:00

test_topology_load.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_topology_visualize.py

Cube-view SVG: detailed topology validation rendering

2026-04-04 22:03:38 -07:00

test_tp_layers.py

ADR-0027: Megatron TP API + worker-wait generalization + mp.spawn

2026-04-14 16:31:13 -07:00

test_tp_mlp.py

ADR-0027: Megatron TP API + worker-wait generalization + mp.spawn

2026-04-14 16:31:13 -07:00

test_tp_parallel_state.py

ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037

2026-05-20 01:15:55 -07:00

test_triton_emu.py

Add PE-level IPCQ collective infra + unified ccl_allreduce bench (ADR-0023)

2026-04-12 19:36:59 -07:00

test_va_allocator.py

Add virtual memory support: PE_MMU, VA allocator, fabric MmuMapMsg

2026-03-26 00:01:47 -07:00

test_va_integration.py

ADR-0001 Rev 2: 51-bit PhysAddr layout with concrete sub-unit tables

2026-04-27 15:52:29 -07:00

test_va_offset.py

benches: package as kernbench.benches, add @bench registry + list subcommand

2026-05-20 14:42:10 -07:00

test_verify_adr_lang_pairs.py

ADR: translate adr-ko/ to Korean, fix ADR-0013 slug, refine Status check

2026-05-20 08:17:56 -07:00

test_wire_cut_through.py

Latency model: HBM PC striping + chunk-loop drain (ADR-0033)

2026-05-14 21:59:07 -07:00