Files
kernbench2/tests
ywkang 787409ced1 ADR-0024 Phase B: update xfail reason with architectural blocker details
Phase B Option A (freeze + defer to ADR-0027): the root cause of
ring_default_ws strict-xfail is that bench workers call torch.zeros /
copy_ which drive env.run in the WORKER-greenlet context. Any pending
KernelLaunchMsg gets stepped inside that worker, spawning kernel_runner
with parent = worker (not main). When the worker yields/finishes, the
kernel greenlet is orphaned and its next switch_to_simpy raises
GreenletExit mid-add — producing rank 0 mean=1 (expected 3).

This is a larger architectural redesign (lazy-deploy tensor API,
coroutine worker, or setup/verify split) and is parked until ADR-0027
(Megatron TP) starts, where the proper solution ships with TP use cases.

No production changes; xfail reason + inline comment only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 12:46:33 -07:00
..
2026-03-18 11:47:48 -07:00
2026-03-18 11:47:48 -07:00