Self-contained eval bench (ADR-0054) that drives the four GQA Llama-70B
panels through run_bench with enable_data=True at validation scale and
emits sweep.json with the v1 schema (ADR-0057 D7).
Panel dispatch table maps each panel to (kernel, SFR install, S_q,
n_ranks, rank_axis):
single_user_prefill mesh_kv_kernel, intracube_pe_ring, S_q=16, n=8, rank_axis=0
multi_user_prefill mesh_kv_kernel, intercube_multisip, S_q=16, n=4, rank_axis=1
single_user_decode mesh_mlo_kernel, intracube_pe_ring, S_q=1, n=8, rank_axis=0
multi_user_decode mesh_mlo_kernel, intercube_multisip, S_q=1, n=4, rank_axis=1
multi_user panels pass _auto_dim_remap=False (avoid d_head=64
colliding with K's global M=64) and rank_axis=1 (cube-level ring,
gates 7 of every 8 PEs to silence).
Each panel runs on a fresh per-config GraphEngine, then op_log is
summarized into gemm/dma/ipcq counts. Both decode panels emit exactly
2*n_ranks GEMMs (one-shot partial attention per rank, ADR-0056 D3).
v1 supports GQA_VALIDATION=1 only; headline mode + figures deferred to
sub-cycles 4b/4c. Sentinel tensor satisfies the run_bench
"at least one request" contract (ADR-0045 D4 / ADR-0054 D2 carve-out).
Tests: tests/attention/test_milestone_gqa_llama70b.py — all 12 pass.
Includes committed sweep.json baseline at the bench's _OUTPUT_DIR so
subsequent test runs reuse it instead of re-simulating.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>