kernbench2

ywkang/kernbench2

Fork 0

Commit Graph

Author	SHA1	Message	Date
mukesh	e748a62264	attention: land milestone-gqa-llama70b 4-panel sweep bench (ADR-0057 v1) Self-contained eval bench (ADR-0054) that drives the four GQA Llama-70B panels through run_bench with enable_data=True at validation scale and emits sweep.json with the v1 schema (ADR-0057 D7). Panel dispatch table maps each panel to (kernel, SFR install, S_q, n_ranks, rank_axis): single_user_prefill mesh_kv_kernel, intracube_pe_ring, S_q=16, n=8, rank_axis=0 multi_user_prefill mesh_kv_kernel, intercube_multisip, S_q=16, n=4, rank_axis=1 single_user_decode mesh_mlo_kernel, intracube_pe_ring, S_q=1, n=8, rank_axis=0 multi_user_decode mesh_mlo_kernel, intercube_multisip, S_q=1, n=4, rank_axis=1 multi_user panels pass _auto_dim_remap=False (avoid d_head=64 colliding with K's global M=64) and rank_axis=1 (cube-level ring, gates 7 of every 8 PEs to silence). Each panel runs on a fresh per-config GraphEngine, then op_log is summarized into gemm/dma/ipcq counts. Both decode panels emit exactly 2*n_ranks GEMMs (one-shot partial attention per rank, ADR-0056 D3). v1 supports GQA_VALIDATION=1 only; headline mode + figures deferred to sub-cycles 4b/4c. Sentinel tensor satisfies the run_bench "at least one request" contract (ADR-0045 D4 / ADR-0054 D2 carve-out). Tests: tests/attention/test_milestone_gqa_llama70b.py — all 12 pass. Includes committed sweep.json baseline at the bench's _OUTPUT_DIR so subsequent test runs reuse it instead of re-simulating. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-01 21:57:12 -07:00

Author

SHA1

Message

Date

mukesh

e748a62264

attention: land milestone-gqa-llama70b 4-panel sweep bench (ADR-0057 v1)

Self-contained eval bench (ADR-0054) that drives the four GQA Llama-70B
panels through run_bench with enable_data=True at validation scale and
emits sweep.json with the v1 schema (ADR-0057 D7).

Panel dispatch table maps each panel to (kernel, SFR install, S_q,
n_ranks, rank_axis):
  single_user_prefill   mesh_kv_kernel,  intracube_pe_ring,  S_q=16, n=8, rank_axis=0
  multi_user_prefill    mesh_kv_kernel,  intercube_multisip, S_q=16, n=4, rank_axis=1
  single_user_decode    mesh_mlo_kernel, intracube_pe_ring,  S_q=1,  n=8, rank_axis=0
  multi_user_decode     mesh_mlo_kernel, intercube_multisip, S_q=1,  n=4, rank_axis=1

multi_user panels pass _auto_dim_remap=False (avoid d_head=64
colliding with K's global M=64) and rank_axis=1 (cube-level ring,
gates 7 of every 8 PEs to silence).

Each panel runs on a fresh per-config GraphEngine, then op_log is
summarized into gemm/dma/ipcq counts. Both decode panels emit exactly
2*n_ranks GEMMs (one-shot partial attention per rank, ADR-0056 D3).

v1 supports GQA_VALIDATION=1 only; headline mode + figures deferred to
sub-cycles 4b/4c. Sentinel tensor satisfies the run_bench
"at least one request" contract (ADR-0045 D4 / ADR-0054 D2 carve-out).

Tests: tests/attention/test_milestone_gqa_llama70b.py — all 12 pass.
Includes committed sweep.json baseline at the bench's _OUTPUT_DIR so
subsequent test runs reuse it instead of re-simulating.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-06-01 21:57:12 -07:00

1 Commits