ADR-0059 single_user_* panels run the ring across PEs in one cube
(rank == tl.program_id(axis=0)). multi_user_* panels run the ring
across cubes — rank should be cube_id (axis=1), and 7 of every 8 PEs
in each cube must stay silent because the cube-level SFR install only
gives the cube-coordinate PE 0 an E/W neighbor.
Add ``rank_axis: int = 0`` kwarg to both ``attention_mesh_mlo_kernel``
and ``attention_mesh_kv_kernel``:
- 0 (default): rank == tl.program_id(axis=0). Existing single_user
behavior, all spec tests unchanged.
- 1: gate ``if tl.program_id(axis=0) != 0: return`` at kernel start,
then ``rank = tl.program_id(axis=1)``. multi_user_* panels pass
this to the kernel via ctx.launch positional arg.
Also brings in _attention_mesh_kv.py and _attention_mesh_mlo.py as
the committed home of the ADR-0059 kernels (previously living
uncommitted in the working tree from sub-cycle 4b).
Tests: 7-test rank_axis spec file (default-path + rank_axis=1 gating
and cube-id semantics, both kernels); 4-panel diag harness now green
end-to-end (single_user_prefill/decode + multi_user_prefill/decode);
763-test wider sweep clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two compounding bugs in ctx.launch's dim-translation path surfaced
by multi_user_* panels of milestone-gqa-llama70b (sub-cycle 4c step 2):
Bug A: _compute_local_shape divided by self._num_cubes (the topology's
cube count, 16 in default topology.yaml) instead of the DPPolicy's
effective num_cubes (4 for validation-scale multi_user). The tensor
allocator at context.py:471-484 already honored dp.num_cubes; the
parallel computation inside launch was out of sync. Fix mirrors the
allocator's eff_num_cubes precedence pattern.
Bug B: dim_map was keyed by value, so any scalar whose value
coincidentally equaled a global tensor dim got rewritten to that dim's
local value — e.g. d_head=64 colliding with K's global M=64 in
multi_user mode. Legacy bench kernels (va_offset etc.) rely on this
remap, so the fix is opt-out: ctx.launch(..., _auto_dim_remap=False)
preserves scalars exactly as passed. Default remains True.
Tests: 3 new dim-translation tests + 4-panel diag harness covers
single_user_* (PASS) and multi_user_* (advances to new SFR/axis layer
failure, tracked separately). va_offset + full attention spec suite
unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>