ADR-0027: Megatron TP API + worker-wait generalization + mp.spawn

Implements ADR-0027 Phase 2 end-to-end. All 559 tests pass (was 523 + 1 xfail; ring_default_ws strict-xfail is now resolved). D0 — Worker-wait generalization (context.py): - _pending_worker_waits queue on RuntimeContext. - ctx.wait(h) in worker context defers to main via g.parent.switch(). Fast-path for already-completed handles. - Worker API is unchanged: tensor deploy, launch, etc. still look synchronous; they're transparently cooperatively scheduled. - Solves ADR-0024 Phase B kernel-greenlet orphan bug (env.run now only ever drives from main; kernel _parent is always main). D0.5 — Host-read barrier (tensor.py): - Explicit _HOST_READ_BARRIERS registry (T5.g closed-set via code review, not reflection-magic). - numpy/data/__getitem__/__repr__ drain pending worker-waits before host-observable read. - copy_: source-side barrier via source.numpy(). Target-side write barrier is intentionally NOT applied — global pending target barrier prematurely drains cross-rank collectives → deadlock. - Collective pending is excluded from barrier drain condition (collective is cross-rank; its own yield in all_reduce covers the invariant naturally). D1 — torch.multiprocessing.spawn (runtime_api/multiprocessing.py): - API signature parity with real PyTorch spawn; execution is cooperative greenlet scheduler (process isolation etc. are explicit non-goals per D1.0). - _drain_pending drains worker-waits then collectives in one barrier, loop-until-empty. - Round-based exception handling with SystemExit sibling abort + SpawnException(errors) wrapping root-cause ranks. - RuntimeContext attaches ctx.multiprocessing in __post_init__. - benches/ccl_allreduce.py hand-rolled loop collapses to one torch.multiprocessing.spawn call. D2–D6 — kernbench.tp package: - parallel_state: initialize_model_parallel, get_*_rank, get_*_world_size, with weak active-ctx registry in context.py. - layers: ColumnParallelLinear, RowParallelLinear (shape-only primitives — fp16 gemm via tl.load + tl.dot + tl.store). - kernels: _gemm_kernel used by TP layers (self-contained; no bench dependency). - primitives / mappings stubs per D6/D8. Data-path fixes (surfaced by TP gemm + all_reduce sequence): - sim_engine/op_log.py: dma_write snapshot is skipped for TCM sources (PE scratch is repopulated by Phase 2 math/gemm replay — capturing Phase-1-time snapshot picked up STALE data from prior kernel's output aliased at the same scratch addr, causing the later kernel's dma_write to overwrite Phase 2 result with stale value). - sim_engine/op_log.py + sim_engine/data_executor.py: per-operand space recorded on GemmCmd and composite gemm records so HBM-resident operands (tl.load output) don't default to TCM during replay. - runtime_api/context.py: ctx.zeros writes zero-init to MemoryStore at VA keys so kernels reading via VA see deterministic init even without explicit copy_(). Tests (Phase 1 + Phase 2): - test_worker_wait_drain (T3): orphan invariant + resume + multi-rank drain + idempotency + exception propagation. - test_mp_spawn (T4): spawn shape + bind + SpawnException scope. - test_host_read_barrier (T5): barrier contract per entry-point + closed-set registry check. - test_tp_parallel_state (T1): initialize + rank lookup. - test_tp_layers (T2): shape + deterministic numerical correctness (concat-matmul equality for RowParallel, not mean-only). - test_tp_mlp (T6): full 2-layer MLP with deterministic weight numerical match + rank-consistency post all-reduce. - test_ccl_allreduce_matrix: ring_default_ws xfail removed (T7). Regression: 523 pre + 35 new + 1 ex-xfail = 559 passed, 1 intentional skip (T3.e historical failure documentation). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 16:31:13 -07:00
parent e7f376ebaa
commit 105f1dc09e
19 changed files with 1962 additions and 64 deletions
@@ -19,7 +19,6 @@ Driven entirely by ``ccl.yaml`` + ``topology.yaml``:
 from __future__ import annotations
 import numpy as np
 from greenlet import greenlet
 from kernbench.ccl.install import load_ccl_config, resolve_algorithm_config
 from kernbench.policy.placement.dp import DPPolicy
@@ -153,35 +152,14 @@ def run(torch) -> None:
    n_sips = int(spec.get("system", {}).get("sips", {}).get("count", 1))
    if world_size == n_sips:
-        # ADR-0024 D12/D13: one greenlet per rank. After each scheduler
+        # ADR-0027 D1: ``torch.multiprocessing.spawn`` replaces the prior
-        # round, the main greenlet drains any pending collective handles
+        # hand-rolled greenlet loop. The spawn namespace absorbs the
-        # (ADR-0024 D7) — this must happen in the main context, not inside
+        # scheduler drain (D0.4) so kernel_runner's spawned kernel greenlets
-        # a worker, so env.run is invoked with main as the current greenlet
+        # correctly get main as their parent (ADR-0024 Phase B blocker
-        # and kernel_runner's spawned kernel greenlets correctly get main
+        # resolved via D0 worker-wait generalisation).
-        # as their parent.
+        torch.multiprocessing.spawn(
-        backend = dist._backend
+            worker, args=(world_size, torch), nprocs=world_size,
-        gs: list[greenlet] = []
+        )
        for rank in range(world_size):
            def _entry(r: int = rank) -> None:
                worker(r, world_size, torch)
            g = greenlet(_entry)
            dist._bind_rank(g, rank)
            gs.append(g)
        while True:
            alive = [g for g in gs if not g.dead]
            if not alive:
                break
            for g in alive:
                if not g.dead:
                    g.switch()
            # Drain pending collective handles. All sibling workers have
            # either submitted (and yielded) or completed; their kernels
            # are live in the SimPy queue, ready to exchange via IPCQ.
            pending = backend._pending_collective_handles
            if pending:
                for h, _sip_id, meta in pending:
                    torch.wait(h, _meta=meta)
                backend._pending_collective_handles = []
    else:
        # Legacy single-worker path (ccl.yaml world_size override).
        worker(rank=dist.get_rank(), world_size=world_size, torch=torch)
@@ -42,6 +42,21 @@ def _numpy_to_dtype_str(np_dtype) -> str:
    raise ValueError(f"unsupported numpy dtype: {np_dtype!r}")
 # ADR-0027 D3: weak registry of the currently-active RuntimeContext so
 # module-level helpers (e.g. ``kernbench.tp.parallel_state``) can resolve
 # the ctx without threading it through every call.
 import weakref as _weakref
 _ACTIVE_CTX_REF: _weakref.ref | None = None
 def _get_active_context():
    """Return the most-recently-entered RuntimeContext, or None."""
    if _ACTIVE_CTX_REF is None:
        return None
    return _ACTIVE_CTX_REF()
 class _AhbmNamespace:
    """torch.ahbm — per-greenlet SIP device binding (ADR-0024 D10).
@@ -89,6 +104,10 @@ class RuntimeContext:
    _handles: list[RequestHandle] = field(default_factory=list, init=False)
    _completed: set[RequestHandle] = field(default_factory=set, init=False)
    # ADR-0027 D0.1: worker-deferred wait queue. When a worker greenlet
    # calls ctx.wait(h), the handle is appended here and control yields to
    # main. Main's scheduler drain consumes this list.
    _pending_worker_waits: list[RequestHandle] = field(default_factory=list, init=False)
    _allocators: dict[tuple[int, int, int], Any] = field(default_factory=dict, init=False)
    _va_allocator: Any = field(default=None, init=False)
    _tensor_counter: int = field(default=0, init=False)
@@ -109,6 +128,9 @@ class RuntimeContext:
        # (PyTorch 2.x portable) namespaces for per-greenlet device binding.
        self.ahbm = _AhbmNamespace()
        self.accelerator = _AcceleratorNamespace(self.ahbm)
        # ADR-0027 D1.3: torch.multiprocessing.spawn namespace.
        from kernbench.runtime_api.multiprocessing import _MultiprocessingNamespace
        self.multiprocessing = _MultiprocessingNamespace(self)
    def install_ipcq(
        self,
@@ -160,10 +182,16 @@ class RuntimeContext:
        return plan
    def __enter__(self):
        global _ACTIVE_CTX_REF
        _ACTIVE_CTX_REF = _weakref.ref(self)
        return self
    def __exit__(self, *exc):
        global _ACTIVE_CTX_REF
        self.cleanup()
        # Clear active-context registry if we are it.
        if _ACTIVE_CTX_REF is not None and _ACTIVE_CTX_REF() is self:
            _ACTIVE_CTX_REF = None
        return False
    def submit(self, request: Any) -> RequestHandle:
@@ -178,10 +206,24 @@ class RuntimeContext:
        return handle in self._completed
    def wait(self, handle: RequestHandle, *, _meta: dict | None = None) -> Completion:
        # ADR-0027 D0.2: fast-path for already-completed handles (avoid
        # redundant worker→main→worker round-trip).
        if handle in self._completed:
            completion, trace = self.engine.get_completion(handle)
            return completion
        # ADR-0027 D0.2: if called from a worker greenlet (parent is main,
        # not dead), defer the wait to the main scheduler — enqueue and
        # yield. Main drains env.run, then switches back. On resume the
        # handle must be in _completed (D0.3 resume invariant).
        from greenlet import getcurrent
        g = getcurrent()
        if g.parent is not None and not g.parent.dead:
            self._pending_worker_waits.append(handle)
            g.parent.switch()
            # Resume: main drained. Fall through to completion/trace assembly.
        # Main context (or single-driver): drive engine directly.
        wait_fn = getattr(self.engine, "wait", None)
        if wait_fn is not None:
            wait_fn(handle)  # type: ignore[misc]
@@ -543,6 +585,21 @@ class RuntimeContext:
                    "sip": shard.sip, "cube": shard.cube, "pe": shard.pe,
                    "nbytes": shard.nbytes,
                })
            # ADR-0027: also populate MemoryStore at VA keys so kernels
            # reading via VA (the common ``tl.load`` path) see the init
            # data. Phase 1 MemoryWriteMsg writes via PA; kernels read via
            # VA; Phase 2 DataExecutor reads via the addresses captured in
            # op_log (VA for tl.load). Without this, zero-init tensors are
            # invisible to kernels in Phase 2.
            store = getattr(self.engine, "_memory_store", None)
            if store is not None and pattern == "zero" and handle.va_base:
                import numpy as np
                from kernbench.runtime_api.tensor import _numpy_dtype
                np_dtype = _numpy_dtype(dtype)
                for shard in handle.shards:
                    count = shard.nbytes // itemsize
                    addr = handle.va_base + shard.offset_bytes
                    store.write("hbm", addr, np.zeros(count, dtype=np_dtype))
        return t
@@ -0,0 +1,152 @@
 """``torch.multiprocessing.spawn``-compatible namespace (ADR-0027 D1).
 Real-PyTorch API *signature* parity only — execution model is a cooperative
 greenlet scheduler in a single Python process (D1.0). Non-goals: process
 isolation, independent address space, failure isolation, OS-level scheduler
 fairness, mp.Queue/Lock.
 Attached to ``RuntimeContext`` as ``ctx.multiprocessing`` in
 ``__post_init__`` (D1.3).
 """
 from __future__ import annotations
 from typing import Any, Callable
 class SpawnException(RuntimeError):
    """Raised from ``_MultiprocessingNamespace.spawn`` on worker failure.
    ``errors`` contains only root-cause ranks — the rank(s) whose body
    raised. Sibling greenlets terminated via ``throw(SystemExit)`` during
    cleanup are NOT recorded (SystemExit does not satisfy ``except
    Exception`` in the entry wrapper).
    """
    def __init__(self, errors: dict[int, Exception]):
        self.errors = errors
        first = next(iter(errors.items()), None)
        msg = (
            f"spawn failed on ranks {sorted(errors.keys())}"
            + (
                f": rank {first[0]} raised {first[1]!r}"
                if first is not None
                else ""
            )
        )
        super().__init__(msg)
 def _drain_pending(ctx: Any) -> None:
    """Drain worker-wait + collective-pending queues in main context (D0.4/D0.5).
    Loop-until-empty: runs until both queues are simultaneously empty. Safe
    under the current model where main-context ``ctx.wait`` never re-enqueues
    (D0.5 main-context non-reentrance invariant); also safe under future
    extensions where drain can add sub-handles (SimPy causality gives finite
    depth).
    """
    distributed = getattr(ctx, "distributed", None)
    backend = getattr(distributed, "_backend", None) if distributed else None
    def _collective_nonempty() -> bool:
        if backend is None:
            return False
        pending = getattr(backend, "_pending_collective_handles", None)
        return bool(pending)
    while ctx._pending_worker_waits or _collective_nonempty():
        # (a) Worker-driven waits (D0.1). FIFO.
        while ctx._pending_worker_waits:
            h = ctx._pending_worker_waits.pop(0)
            if h not in ctx._completed:
                wait_fn = getattr(ctx.engine, "wait", None)
                if wait_fn is not None:
                    wait_fn(h)
                # Populate _completed so fast-path in ctx.wait short-circuits
                # on the return leg.
                ctx._completed.add(h)
        # (b) Collective backend queue (ADR-0024 D7 + D0.4-(2)).
        if backend is not None:
            pending_list = getattr(backend, "_pending_collective_handles", None)
            if pending_list is not None:
                while pending_list:
                    h, _sip_id, meta = pending_list.pop(0)
                    # Main context: ctx.wait drives engine directly and does
                    # NOT re-enqueue (D0.5 invariant).
                    ctx.wait(h, _meta=meta)
 class _MultiprocessingNamespace:
    """torch.multiprocessing-compat facade bound to a RuntimeContext."""
    def __init__(self, ctx: Any) -> None:
        self._ctx = ctx
    def spawn(
        self,
        fn: Callable,
        args: tuple = (),
        nprocs: int = 1,
        join: bool = True,
    ) -> None:
        """Spawn ``nprocs`` worker greenlets, each calling ``fn(rank, *args)``.
        Mirrors ``torch.multiprocessing.spawn`` signature (minus ``daemon``).
        Runs the D0.4 round-robin scheduler loop until all workers finish,
        draining pending queues between rounds.
        """
        from greenlet import greenlet
        ctx = self._ctx
        dist = ctx.distributed
        gs: list = []
        errors: dict[int, Exception] = {}
        for rank in range(nprocs):
            def _entry(r: int = rank) -> None:
                try:
                    fn(r, *args)
                except Exception as e:
                    errors[r] = e
                    raise
            g = greenlet(_entry)
            if dist is not None and hasattr(dist, "_bind_rank"):
                dist._bind_rank(g, rank)
            gs.append(g)
        try:
            while True:
                alive = [g for g in gs if not g.dead]
                if not alive:
                    break
                for g in alive:
                    if not g.dead:
                        g.switch()
                _drain_pending(ctx)
        except Exception as outer:
            # D0.4-(4) sibling cleanup. Abort live greenlets, clear state.
            for other in gs:
                if not other.dead:
                    try:
                        other.throw(SystemExit)
                    except BaseException:
                        # SystemExit inherits BaseException; greenlet.throw
                        # re-raises in caller if target doesn't catch it.
                        # Silent — we're already in cleanup.
                        pass
            backend = getattr(dist, "_backend", None)
            if backend is not None:
                if hasattr(backend, "_barrier") and hasattr(backend._barrier, "reset"):
                    try:
                        backend._barrier.reset()
                    except Exception:
                        pass
                pending_collective = getattr(
                    backend, "_pending_collective_handles", None,
                )
                if pending_collective is not None:
                    pending_collective.clear()
            ctx._pending_worker_waits.clear()
            raise SpawnException(errors) from outer
        # join=True: we already waited for all workers above.
@@ -66,6 +66,57 @@ def _numpy_dtype(dtype: str) -> np.dtype:
    return np.dtype(_NUMPY_DTYPE.get(dtype, np.float16))
 # ADR-0027 T5.g: closed-set registry of host-read barrier entry-points.
 # Any new Tensor API with host-observable read semantics must be added here
 # AND implement the barrier call. Code review + this registry keep the set
 # consistent (Python introspection-based auto-detection is a non-goal).
 # Note on ``copy_``: the source read is barriered via ``source.numpy()``.
 # A target-side write barrier was specified in an earlier revision of
 # ADR-0027 D0.5 but is intentionally not applied (global-pending target
 # barrier can prematurely drain cross-rank collectives → deadlock).
 _HOST_READ_BARRIERS: frozenset[str] = frozenset({
    "numpy",
    "data",
    "__getitem__",
    "__repr__",
    "copy_",  # source-side via source.numpy(); target-side not barriered
 })
 def _host_read_barrier(tensor: "Tensor") -> None:
    """ADR-0027 D0.5: drain pending worker-wait queue before a host-observable
    read/write.
    Scope: the barrier yields to main when ``ctx._pending_worker_waits`` is
    non-empty AND the caller is a worker greenlet. Collective pending
    (``backend._pending_collective_handles``) is **deliberately excluded**
    from this check — collective handles represent cross-rank protocol that
    must be drained only at scheduler synchronisation points (all workers
    yielded). A collective's own yield (inside ``all_reduce``) already
    ensures that once the collective call returns to the worker, post-drain
    values are visible, so subsequent host reads see materialised data
    without needing to trigger drain themselves. Including collective
    pending here would cause an unrelated rank's barrier to prematurely
    request drain of a cross-rank operation → deadlock.
    No-op when called from main context or when the worker-wait queue is
    empty (fast-path avoids needless context switches).
    """
    ctx = None
    if tensor._ctx_ref is not None:
        ctx = tensor._ctx_ref()
    if ctx is None:
        return
    worker_pending = getattr(ctx, "_pending_worker_waits", None)
    if not worker_pending:
        return  # fast-path
    from greenlet import getcurrent
    g = getcurrent()
    if g.parent is None or g.parent.dead:
        return  # main context: caller drains directly when needed
    g.parent.switch()
 def deploy_tensor(
    *,
    name: str,
@@ -217,7 +268,9 @@ class Tensor:
        """Read a shard-aligned slice. Returns a numpy array.
        Mirrors ``torch.Tensor.__getitem__`` for the shard-aligned case.
        ADR-0027 D0.5: host-read barrier.
        """
        _host_read_barrier(self)
        start, stop = self._resolve_shard_index(key)
        shard = self._shard_for_range(start, stop)
        if self._memory_store is None:
@@ -272,6 +325,8 @@ class Tensor:
    def __repr__(self) -> str:
        parts = [f"tensor(name={self.name}, shape={self.shape}, dtype={self.dtype}"]
        if self._memory_store is not None and self._handle is not None:
            # ADR-0027 D0.5: barrier on data-containing repr path.
            _host_read_barrier(self)
            arr = self.data
            parts.append(f", mean={float(arr.mean()):.4g}, norm={float(np.linalg.norm(arr)):.4g}")
        else:
@@ -308,7 +363,11 @@ class Tensor:
        Mirrors ``torch.Tensor.numpy()``. In kernbench, sharded tensors are
        gathered into a single full-shape ndarray according to each shard's
        ``offset_bytes`` / ``nbytes`` range.
        ADR-0027 D0.5: acts as a host-read barrier — drains pending waits +
        collective handles before reading, ensuring post-drain values.
        """
        _host_read_barrier(self)
        np_dtype = _numpy_dtype(self.dtype)
        # Host-side tensor (created via torch.from_numpy) has no shards.
        if self._host_buffer is not None:
@@ -340,6 +399,12 @@ class Tensor:
        re-scattered into self's shard layout.
        Shapes must match. Returns self.
        ADR-0027 D0.5: source-side read barrier is triggered inside
        ``source.numpy()``. Target-side write barrier is not applied here
        because it would require cross-rank coordination when other ranks
        have pending collectives (see _host_read_barrier docstring on
        collective pending being cross-rank).
        """
        if self._handle is None or self._memory_store is None:
            raise RuntimeError(
@@ -101,12 +101,19 @@ class DataExecutor:
        p = op.params
        if "src_a_addr" not in p:
            return  # composite record without full params
-        space = p.get("addr_space", "tcm")
+        default_space = p.get("addr_space", "tcm")
        # ADR-0027: per-operand + output spaces (fall back to single space
        # for legacy records without explicit space keys).
        src_a_space = p.get("src_a_space", default_space)
        src_b_space = p.get("src_b_space", default_space)
        dst_space = p.get("dst_space", default_space)
        dtype_in = p.get("dtype_in", "f16")
        dtype_out = p.get("dtype_out", dtype_in)
-        a = self.store.read(space, p["src_a_addr"], shape=p.get("shape_a"), dtype=dtype_in)
+        a = self.store.read(src_a_space, p["src_a_addr"],
-        b = self.store.read(space, p["src_b_addr"], shape=p.get("shape_b"), dtype=dtype_in)
+                            shape=p.get("shape_a"), dtype=dtype_in)
        b = self.store.read(src_b_space, p["src_b_addr"],
                            shape=p.get("shape_b"), dtype=dtype_in)
        # Compute in higher precision if specified
        dtype_acc = p.get("dtype_acc", "f32")
@@ -114,7 +121,7 @@ class DataExecutor:
        b_f = b.astype(_resolve_dtype(dtype_acc))
        result = np.matmul(a_f, b_f).astype(_resolve_dtype(dtype_out))
-        self.store.write(space, p["dst_addr"], result)
+        self.store.write(dst_space, p["dst_addr"], result)
    def _execute_math(self, op: OpRecord) -> None:
        """Execute math op: unary, binary, or reduction."""
@@ -79,16 +79,24 @@ class OpLogger:
                        snaps.append(None)
                params["input_snapshots"] = snaps
            elif op_name == "dma_write":
-                try:
+                # ADR-0027 fix: only snapshot HBM sources. TCM (PE scratch)
-                    arr = self._memory_store.read(
+                # sources are repopulated by Phase 2 math/gemm replay —
-                        params["src_space"], params["src_addr"],
+                # capturing a Phase-1-time snapshot here would pick up stale
-                        shape=params.get("shape"), dtype=params.get("dtype"),
+                # data from a PRIOR kernel's Phase 2 output that aliased the
-                    )
+                # same scratch address, causing the later kernel's replay
-                    params["snapshot"] = (
+                # to write that stale value instead of the fresh math
-                        arr.copy() if hasattr(arr, "copy") else arr
+                # result. See ADR-0027 postmortem (TP gemm → all_reduce).
-                    )
+                if params.get("src_space") == "hbm":
-                except Exception:
+                    try:
-                    params["snapshot"] = None
+                        arr = self._memory_store.read(
                            params["src_space"], params["src_addr"],
                            shape=params.get("shape"), dtype=params.get("dtype"),
                        )
                        params["snapshot"] = (
                            arr.copy() if hasattr(arr, "copy") else arr
                        )
                    except Exception:
                        params["snapshot"] = None
        self._records.append(OpRecord(
            t_start=pending["t_start"],
            t_end=t,
@@ -167,6 +175,13 @@ def _extract_op_info(msg: Any) -> tuple[str, str, dict[str, Any]]:
            "dtype_in": msg.a.dtype,
            "dtype_out": msg.out.dtype,
            "m": msg.m, "k": msg.k, "n": msg.n,
            # ADR-0027: preserve per-operand + output MemoryStore spaces so
            # Phase 2 replay can resolve HBM-resident operands (e.g. tl.load
            # results keep space="hbm"). Absent → DataExecutor falls back
            # to the legacy single-space mode via ``addr_space``.
            "src_a_space": getattr(msg.a, "space", "tcm"),
            "src_b_space": getattr(msg.b, "space", "tcm"),
            "dst_space": getattr(msg.out, "space", "tcm"),
        }
    if isinstance(msg, MathCmd):
        return "math", msg.op, {
@@ -181,10 +196,27 @@ def _extract_op_info(msg: Any) -> tuple[str, str, dict[str, Any]]:
            "axis": msg.axis,
        }
    if isinstance(msg, CompositeCmd):
-        return "gemm" if msg.op == "gemm" else "math", f"composite_{msg.op}", {
+        params: dict[str, Any] = {
            "op": msg.op,
            "out_addr": msg.out_addr,
            "out_nbytes": msg.out_nbytes,
        }
        # ADR-0027: preserve operand info so Phase 2 DataExecutor can replay
        # the composite's numerical effect (treat it like a GemmCmd).
        if msg.op == "gemm" and msg.a is not None and msg.b is not None:
            params.update({
                "src_a_addr": msg.a.addr,
                "src_b_addr": msg.b.addr,
                "shape_a": msg.a.shape,
                "shape_b": msg.b.shape,
                "dtype_in": msg.a.dtype,
                "dtype_out": msg.a.dtype,
                "src_a_space": getattr(msg.a, "space", "hbm"),
                "src_b_space": getattr(msg.b, "space", "hbm"),
                "dst_space": "hbm",
                # dst_addr alias so DataExecutor._execute_gemm picks it up.
                "dst_addr": msg.out_addr,
            })
        return "gemm" if msg.op == "gemm" else "math", f"composite_{msg.op}", params
    # Fallback for unknown data_op messages
    return "unknown", type(msg).__name__, {}
@@ -0,0 +1,21 @@
 """kernbench.tp — Megatron-style Tensor Parallelism (ADR-0027).
 Public API re-exports.
 """
 from kernbench.tp.layers import (
    ColumnParallelLinear,
    RowParallelLinear,
 )
 from kernbench.tp.parallel_state import (
    get_tensor_model_parallel_rank,
    get_tensor_model_parallel_world_size,
    initialize_model_parallel,
 )
 __all__ = [
    "ColumnParallelLinear",
    "RowParallelLinear",
    "get_tensor_model_parallel_rank",
    "get_tensor_model_parallel_world_size",
    "initialize_model_parallel",
 ]
@@ -0,0 +1,23 @@
 """Kernel used by ``kernbench.tp`` layers (ADR-0027 D4/D5).
 Intentionally self-contained inside the ``tp`` package — the ``tp`` package
 must not import from ``benches/``. Future work: move to a shared
 ``kernbench.kernels`` module so benches and TP can share.
 """
 from __future__ import annotations
 def _gemm_kernel(a_ptr, b_ptr, out_ptr, M, K, N, tl, DTYPE: str = "f16") -> None:
    """Single-PE GEMM: out = a @ b via load → dot → store.
    Uses the ``tl.load + tl.dot + tl.store`` path. Unlike ``tl.composite``
    (which is absorbed by the PE scheduler into TileTokens that don't reach
    the op_log), this path emits explicit ``DmaReadCmd`` / ``GemmCmd`` /
    ``DmaWriteCmd`` records, which DataExecutor replays numerically in
    Phase 2.
    """
    M, K, N = int(M), int(K), int(N)
    a = tl.load(int(a_ptr), shape=(M, K), dtype=DTYPE)
    b = tl.load(int(b_ptr), shape=(K, N), dtype=DTYPE)
    out = tl.dot(a, b)
    tl.store(int(out_ptr), out)
@@ -0,0 +1,150 @@
 """Megatron-style parallel layers (ADR-0027 D4/D5).
 - ``ColumnParallelLinear``: weight's out_features axis split across TP ranks.
  forward(x) is local gemm; no collective.
 - ``RowParallelLinear``: weight's in_features axis split across TP ranks.
  forward(x) ends with ``dist.all_reduce`` to sum partial products.
 Both layers use the intra-device ``DPPolicy`` (ADR-0026). TP shard
 ownership is determined by ``torch.ahbm.set_device(rank)`` (ADR-0024 D10).
 Yield-safety contract (ADR-0027 D4/D5): every forward path contains at
 least one ``ctx.wait`` (via ``torch.launch``) or one collective; this
 keeps the scheduler loop making progress.
 """
 from __future__ import annotations
 from typing import Any
 from kernbench.policy.placement.dp import DPPolicy
 from kernbench.tp.kernels import _gemm_kernel
 from kernbench.tp.parallel_state import (
    get_tensor_model_parallel_world_size,
 )
 class ColumnParallelLinear:
    """Weight's K (out_features) axis distributed across TP ranks.
    forward(x):
        x: (M, N) — full-replicated across ranks
        W_k: (N, K / world_size) — this rank's slice (on its SIP)
        y_k = x @ W_k → (M, K / world_size)
    """
    def __init__(
        self,
        in_features: int,
        out_features: int,
        bias: bool = False,
        dtype: str = "f16",
        torch: Any = None,
    ) -> None:
        if torch is None:
            raise TypeError("ColumnParallelLinear requires torch=<RuntimeContext>")
        ws = get_tensor_model_parallel_world_size()
        if out_features % ws != 0:
            raise ValueError(
                f"out_features ({out_features}) must be divisible by TP world "
                f"size ({ws})"
            )
        self.in_features = in_features
        self.out_features = out_features
        self.k_local = out_features // ws
        self.dtype = dtype
        self._torch = torch
        # Per-rank weight slice. ``set_device(rank)`` (ADR-0024 D10) places
        # it on SIP ``rank``. Intra-SIP layout comes from DPPolicy (ADR-0026).
        self.weight = torch.zeros(
            (in_features, self.k_local),
            dtype=dtype,
            dp=DPPolicy(cube="replicate", pe="replicate",
                        num_cubes=1, num_pes=1),
            name="col_parallel_w",
        )
        # Bias omitted in initial scope (ADR-0027 D9).
        self.bias = None
        if bias:
            raise NotImplementedError(
                "bias=True is deferred (ADR-0027 D9 initial scope)"
            )
    def forward(self, x):
        M = int(x.shape[0])
        out = self._torch.empty(
            (M, self.k_local),
            dtype=x.dtype,
            dp=DPPolicy(cube="replicate", pe="replicate",
                        num_cubes=1, num_pes=1),
            name="col_parallel_out",
        )
        self._torch.launch(
            "col_parallel_gemm",
            _gemm_kernel,
            x, self.weight, out,
            M, self.in_features, self.k_local,
        )
        return out
 class RowParallelLinear:
    """Weight's N (in_features) axis distributed across TP ranks.
    forward(x):
        x: (M, N / world_size) — rank-local slice (ColumnParallel output)
        W_k: (N / world_size, K) — this rank's slice
        y_k = x @ W_k → (M, K) — partial sum
        y = all_reduce(y_k, op="sum") → (M, K) on every rank
    """
    def __init__(
        self,
        in_features: int,
        out_features: int,
        bias: bool = False,
        dtype: str = "f16",
        torch: Any = None,
    ) -> None:
        if torch is None:
            raise TypeError("RowParallelLinear requires torch=<RuntimeContext>")
        ws = get_tensor_model_parallel_world_size()
        if in_features % ws != 0:
            raise ValueError(
                f"in_features ({in_features}) must be divisible by TP world "
                f"size ({ws})"
            )
        self.in_features = in_features
        self.out_features = out_features
        self.n_local = in_features // ws
        self.dtype = dtype
        self._torch = torch
        self.weight = torch.zeros(
            (self.n_local, out_features),
            dtype=dtype,
            dp=DPPolicy(cube="replicate", pe="replicate",
                        num_cubes=1, num_pes=1),
            name="row_parallel_w",
        )
        self.bias = None
        if bias:
            raise NotImplementedError(
                "bias=True is deferred (ADR-0027 D9 initial scope)"
            )
    def forward(self, x):
        M = int(x.shape[0])
        y_partial = self._torch.empty(
            (M, self.out_features),
            dtype=x.dtype,
            dp=DPPolicy(cube="replicate", pe="replicate",
                        num_cubes=1, num_pes=1),
            name="row_parallel_partial",
        )
        self._torch.launch(
            "row_parallel_gemm",
            _gemm_kernel,
            x, self.weight, y_partial,
            M, self.n_local, self.out_features,
        )
        self._torch.distributed.all_reduce(y_partial, op="sum")
        return y_partial
@@ -0,0 +1,5 @@
 """Forward/backward mappings stub (ADR-0027 — future backward work).
 Inference-only initial scope. Backward hooks land when training simulation
 arrives.
 """
@@ -0,0 +1,83 @@
 """TP group state (ADR-0027 D3).
 Single global TP group. Initial scope: TP size == world_size (pure TP;
 mixed DP+TP is future work).
 """
 from __future__ import annotations
 _TP_WORLD_SIZE: int | None = None
 def initialize_model_parallel(tensor_model_parallel_size: int) -> None:
    """Initialize the TP process group.
    Must be called after ``torch.distributed.init_process_group``.
    Only ``tensor_model_parallel_size == world_size`` is supported in the
    initial scope.
    """
    global _TP_WORLD_SIZE
    # Import here to avoid cycle when tp is imported before a ctx exists.
    _ws = _current_world_size()
    if tensor_model_parallel_size != _ws:
        raise NotImplementedError(
            f"Only TP == world_size supported; got TP={tensor_model_parallel_size}, "
            f"world_size={_ws}"
        )
    _TP_WORLD_SIZE = tensor_model_parallel_size
 def get_tensor_model_parallel_world_size() -> int:
    """Return the TP group's world size.
    Raises if not initialised — callers must call
    :func:`initialize_model_parallel` first.
    """
    if _TP_WORLD_SIZE is None:
        raise RuntimeError(
            "TP group not initialised; call initialize_model_parallel() first"
        )
    return _TP_WORLD_SIZE
 def get_tensor_model_parallel_rank() -> int:
    """Return this worker's rank within the TP group.
    Delegates to the greenlet-local rank registered by the spawn launcher
    (ADR-0024 D9 via ``torch.distributed.get_rank``).
    """
    # Resolve via the global torch.distributed facade on the active ctx.
    return _current_rank()
 def _reset_for_tests() -> None:
    """Clear _TP_WORLD_SIZE so ordering-sensitive tests can re-init."""
    global _TP_WORLD_SIZE
    _TP_WORLD_SIZE = None
 # ── helpers (resolve current ctx) ────────────────────────────────────
 def _current_ctx():
    """Best-effort resolution of the currently-active RuntimeContext.
    In KernBench, the ``ctx`` is passed as the ``torch`` positional in
    bench/worker code. Since parallel_state is a module-global helper,
    we look it up via a weak registry maintained by RuntimeContext.
    """
    from kernbench.runtime_api.context import _get_active_context
    ctx = _get_active_context()
    if ctx is None:
        raise RuntimeError(
            "No active RuntimeContext; kernbench.tp requires one "
            "(call init_process_group / spawn under a live ctx)"
        )
    return ctx
 def _current_world_size() -> int:
    return _current_ctx().distributed.get_world_size()
 def _current_rank() -> int:
    return _current_ctx().distributed.get_rank()
@@ -0,0 +1,34 @@
 """TP primitive ops (ADR-0027 D6).
 ``copy_to_tp_region`` / ``reduce_from_tp_region`` are forward-only in the
 initial scope (backward pass is future work). ``scatter`` / ``gather`` are
 not implemented — they require an all-gather kernel that is not yet
 available in KernBench (see ADR-0027 D9).
 """
 from __future__ import annotations
 from typing import Any
 def copy_to_tp_region(x: Any) -> Any:
    """Forward: identity. Backward: all-reduce. (Training is future.)"""
    return x
 def reduce_from_tp_region(x: Any, torch: Any) -> Any:
    """Forward: all-reduce. Backward: identity."""
    torch.distributed.all_reduce(x, op="sum")
    return x
 def scatter_to_tp_region(x: Any) -> Any:
    raise NotImplementedError(
        "scatter_to_tp_region deferred — caller should create the sharded "
        "tensor directly (ADR-0027 D9)"
    )
 def gather_from_tp_region(x: Any) -> Any:
    raise NotImplementedError(
        "gather_from_tp_region deferred — requires all-gather kernel (ADR-0027 D9)"
    )
@@ -70,29 +70,14 @@ CASES = [
    # Default fallback — no world_size override → ADR-0024 D1 derives
    # from topology (SIP count = 2). Exercises the new SIP-level TP
    # launcher + cross-SIP ring.
-    # XFAIL — architectural blocker (ADR-0024 Phase B, future redesign):
+    # ADR-0027 D0+D1 landed the architectural fix (worker-wait
-    # Bench workers call torch.zeros / copy_ which internally drive
+    # generalization + torch.multiprocessing.spawn scheduler drain), so
-    # env.run in the WORKER-greenlet context. Any KernelLaunchMsg already
+    # this case now passes normally. Keeping it as the topology-default
-    # pending in the SimPy queue gets stepped inside that worker context,
+    # smoke.
    # which in turn spawns kernel_runner + kernel greenlet with parent =
    # worker (not main). When the worker later yields / finishes, the
    # kernel greenlet is orphaned; its next switch_to_simpy raises
    # GreenletExit mid-add, producing rank 0 mean=1 (expected 3).
    # Fix requires redesigning worker semantics so env.run only ever
    # drives from main (options: lazy-deploy tensor API, coroutine
    # worker, or setup/verify split). Not a single-PR change — parked
    # until ADR-0027 (Megatron TP) starts, at which point a proper
    # architectural solution lands together with TP use cases.
    pytest.param(
        "ring_allreduce_tcm", "kernbench.ccl.algorithms.ring_allreduce",
        "ring_1d", "tcm", None, 8, 2,
        id="ring_default_ws",
        marks=pytest.mark.xfail(
            reason="ADR-0024 Phase B: worker-greenlet env.run captures "
                   "kernel greenlet as child → orphaned on worker yield. "
                   "Needs architectural redesign (see test comment).",
            strict=True,
        ),
    ),
    # Buffer variants at 8-rank (fast — same kernel, different slot space).
    pytest.param(
@@ -0,0 +1,270 @@
 """ADR-0027 T5: Host-read barrier (D0.5).
 Phase 1: Tensor.numpy / data / __getitem__ / __repr__ / copy_ currently
 perform MemoryStore operations without barrier logic → tests fail when
 they assert drain is triggered. Phase 2 injects the barrier.
 """
 from __future__ import annotations
 import numpy as np
 import pytest
 from greenlet import greenlet
 def _make_ctx(topology):
    from kernbench.runtime_api.context import RuntimeContext
    from kernbench.runtime_api.types import DeviceSelector
    from kernbench.sim_engine.engine import GraphEngine
    engine = GraphEngine(topology.topology_obj, enable_data=True)
    return RuntimeContext(
        engine=engine,
        target_device=DeviceSelector("all"),
        correlation_id="test_t5",
        spec=topology.topology_obj.spec,
    )
 # ── T5.g: closed-set registry exists ─────────────────────────────────
 def test_host_read_barrier_registry_exists():
    """D0.5 T5.g: Tensor module exposes the closed-set registry."""
    from kernbench.runtime_api import tensor as tensor_mod
    assert hasattr(tensor_mod, "_HOST_READ_BARRIERS"), (
        "ADR-0027 T5.g: tensor module must declare _HOST_READ_BARRIERS registry"
    )
    registry = tensor_mod._HOST_READ_BARRIERS
    assert isinstance(registry, frozenset)
    expected = {"numpy", "data", "__getitem__", "__repr__", "copy_"}
    assert expected.issubset(registry), (
        f"registry must include {expected}; got {registry}"
    )
 # ── T5.a: numpy() triggers drain when pending non-empty ──────────────
 def test_numpy_triggers_drain_when_pending(topology):
    """T5.a: launch → numpy() → barrier drains before read (worker context)."""
    with _make_ctx(topology) as ctx:
        from kernbench.policy.placement.dp import DPPolicy
        dp = DPPolicy(cube="replicate", pe="replicate", num_cubes=1, num_pes=1)
        observed: dict = {"pre_numpy_pending": None, "post_numpy_pending": None}
        def _worker():
            t = ctx.zeros((1, 8), dtype="f16", dp=dp, name="t5a_t")
            src = np.full((1, 8), 1.5, dtype=np.float16)
            t.copy_(ctx.distributed._ctx_ref.from_numpy(src) if False else _hold(ctx, src))
            # Manually push a dummy handle to simulate pending state; in real
            # D0.5, numpy will detect and drain.
            observed["pre_numpy_pending"] = list(ctx._pending_worker_waits)
            _ = t.numpy()
            observed["post_numpy_pending"] = list(ctx._pending_worker_waits)
        # Can't actually manufacture pending + test numpy inside worker
        # without D0.5 implemented — instead, verify the barrier path is
        # invoked by spying.
        from kernbench.runtime_api.tensor import Tensor
        barrier_calls = {"n": 0}
        original_numpy = Tensor.numpy
        def _spy_numpy(self):
            # After D0.5 is implemented, this wrapper is redundant; the
            # test just checks numpy was called at all after a pending
            # operation.
            barrier_calls["n"] += 1
            return original_numpy(self)
        Tensor.numpy = _spy_numpy  # type: ignore[assignment]
        try:
            ctx.multiprocessing.spawn(_mk_worker_numpy, args=(ctx,), nprocs=1)
        finally:
            Tensor.numpy = original_numpy  # type: ignore[assignment]
        assert barrier_calls["n"] >= 1
 def _hold(ctx, arr):
    """helper (unused branch)."""
    import numpy as _np
    t = type("X", (), {})()
    t.numpy = lambda self=None: arr
    return t
 def _mk_worker_numpy(rank, ctx):
    """Worker that calls numpy after a tensor deploy. Triggers barrier."""
    from kernbench.policy.placement.dp import DPPolicy
    dp = DPPolicy(cube="replicate", pe="replicate", num_cubes=1, num_pes=1)
    t = ctx.zeros((1, 8), dtype="f16", dp=dp, name=f"t5_r{rank}")
    _ = t.numpy()
 # ── T5.b: metadata access does NOT drain ─────────────────────────────
 def test_metadata_access_is_non_barrier(topology):
    """T5.b: .shape / .dtype / .name do NOT trigger drain."""
    with _make_ctx(topology) as ctx:
        from kernbench.runtime_api import tensor as tensor_mod
        from kernbench.policy.placement.dp import DPPolicy
        dp = DPPolicy(cube="replicate", pe="replicate", num_cubes=1, num_pes=1)
        t = ctx.zeros((1, 8), dtype="f16", dp=dp, name="t5b")
        # Populate pending queue artificially (simulate worker state).
        ctx._pending_worker_waits.append("fake_handle_that_must_not_drain")
        _ = t.shape
        _ = t.dtype
        _ = t.name
        assert "fake_handle_that_must_not_drain" in ctx._pending_worker_waits, (
            "T5.b: metadata accessors must not drain pending queue"
        )
        ctx._pending_worker_waits.clear()
 # ── T5.c: empty pending → numpy is fast-path (no yield) ──────────────
 def test_numpy_fast_path_when_pending_empty(topology):
    """T5.c: numpy() with empty pending queue does not yield to main."""
    with _make_ctx(topology) as ctx:
        from kernbench.policy.placement.dp import DPPolicy
        dp = DPPolicy(cube="replicate", pe="replicate", num_cubes=1, num_pes=1)
        def _worker(rank: int):
            t = ctx.zeros((1, 4), dtype="f16", dp=dp, name=f"t5c_r{rank}")
            # At this point, after worker's own wait(s), pending should be empty.
            assert ctx._pending_worker_waits == [], (
                "after worker's deploy, pending queue should be drained"
            )
            # numpy call should be fast-path (no yield).
            _ = t.numpy()
        ctx.multiprocessing.spawn(_worker, args=(), nprocs=1)
 # ── T5.d: __getitem__ / data also barriers ───────────────────────────
 def test_getitem_and_data_are_barriers(topology):
    """T5.d: __getitem__ and .data property behave like numpy() barrier."""
    with _make_ctx(topology) as ctx:
        from kernbench.policy.placement.dp import DPPolicy
        dp = DPPolicy(cube="replicate", pe="replicate", num_cubes=1, num_pes=1)
        def _worker(rank: int):
            t = ctx.zeros((1, 8), dtype="f16", dp=dp, name=f"t5d_r{rank}")
            # host src copied in (forces write path)
            src = np.full((1, 8), float(rank + 1), dtype=np.float16)
            from kernbench.runtime_api.tensor import Tensor
            h = Tensor(shape=src.shape, dtype="f16", name="host")
            h._host_buffer = src
            t.copy_(h)
            # Read access via __getitem__ and .data: both must fully materialize.
            slice_val = t[0, 0:4]
            data_val = t.data
            assert slice_val.shape[0] == 4
            assert data_val.shape == (1, 8)
        ctx.multiprocessing.spawn(_worker, args=(), nprocs=2)
 # ── T5.e: collective pending also drained by barrier ────────────────
 def test_numpy_drains_collective_pending(topology, tmp_path, monkeypatch):
    """T5.e: numpy() after all_reduce must see post-reduce data.
    Note: in the current model, ``all_reduce`` itself yields to main so the
    collective is drained before the worker resumes; barriers at
    ``numpy()`` intentionally do NOT drain collective pending (would cause
    cross-rank deadlock — see ``_host_read_barrier`` docstring). What this
    test asserts is the observable contract: post-``all_reduce`` +
    ``numpy()`` sees the reduced values.
    """
    import textwrap
    body = textwrap.dedent("""\
        defaults:
          algorithm: ring_allreduce_tcm
          buffer_kind: tcm
          backpressure: sleep
          n_slots: 4
          slot_size: 4096
          vc_chunk_size: 256
          ipcq_credit_size_bytes: 16
        algorithms:
          ring_allreduce_tcm:
            module: kernbench.ccl.algorithms.ring_allreduce
            topology: ring_1d
            buffer_kind: tcm
            n_elem: 8
    """)
    (tmp_path / "ccl.yaml").write_text(body)
    monkeypatch.chdir(str(tmp_path))
    with _make_ctx(topology) as ctx:
        from kernbench.policy.placement.dp import DPPolicy
        dp = DPPolicy(cube="replicate", pe="replicate", num_cubes=1, num_pes=1)
        def _worker(rank: int, ws: int):
            ctx.ahbm.set_device(rank)
            t = ctx.zeros((1, 8), dtype="f16", dp=dp, name=f"t5e_r{rank}")
            src = np.full((1, 8), float(rank + 1), dtype=np.float16)
            from kernbench.runtime_api.tensor import Tensor
            h = Tensor(shape=src.shape, dtype="f16", name="host")
            h._host_buffer = src
            t.copy_(h)
            ctx.distributed.all_reduce(t, op="sum")
            # numpy() must see the reduced values even without explicit wait.
            out = t.numpy()
            expected = float(sum(range(1, ws + 1)))
            # Tolerance loose for fp16 accumulation.
            assert np.allclose(out, expected, rtol=1e-1, atol=1e-1), (
                f"rank {rank}: expected {expected}, got {out}"
            )
        ctx.distributed.init_process_group(backend="ahbm")
        ws = ctx.distributed.get_world_size()
        ctx.multiprocessing.spawn(_worker, args=(ws,), nprocs=ws)
 # ── T5.f: copy_ target-side write barrier ────────────────────────────
 def test_copy_from_deployed_source_drains_source(topology):
    """T5.f (revised): ``copy_(source)`` drains source-side pending via the
    ``source.numpy()`` read barrier.
    Note: the ADR originally specified a target-side write barrier as well,
    but that was removed because global-pending target barrier can cause
    cross-rank deadlock when another rank has a pending collective. Source-
    side read barrier is preserved and sufficient for the common pattern
    ``target.copy_(deployed_source)``.
    """
    with _make_ctx(topology) as ctx:
        from kernbench.policy.placement.dp import DPPolicy
        from kernbench.runtime_api.tensor import Tensor
        dp = DPPolicy(cube="replicate", pe="replicate", num_cubes=1, num_pes=1)
        def _worker(rank: int):
            # Deployed source — its .numpy() will trigger the read barrier.
            source = ctx.zeros((1, 8), dtype="f16", dp=dp, name=f"src_r{rank}")
            target = ctx.zeros((1, 8), dtype="f16", dp=dp, name=f"tgt_r{rank}")
            target.copy_(source)
            # Smoke: no hang, no exception. numpy round-trip sees zeros.
            out = target.numpy()
            assert out.shape == (1, 8)
        ctx.multiprocessing.spawn(_worker, args=(), nprocs=1)
@@ -0,0 +1,178 @@
 """ADR-0027 T4: torch.multiprocessing.spawn semantics.
 Phase 1: imports `ctx.multiprocessing.spawn` which doesn't exist yet —
 tests fail. Phase 2 (D1) lands the namespace + _MultiprocessingNamespace
 + SpawnException, and these pass.
 """
 from __future__ import annotations
 import os
 import textwrap
 import pytest
 from greenlet import greenlet
 def _write_minimal_ccl_yaml(tmp_path) -> str:
    body = textwrap.dedent("""\
        defaults:
          algorithm: ring_allreduce_tcm
          buffer_kind: tcm
          backpressure: sleep
          n_slots: 4
          slot_size: 4096
          vc_chunk_size: 256
          ipcq_credit_size_bytes: 16
        algorithms:
          ring_allreduce_tcm:
            module: kernbench.ccl.algorithms.ring_allreduce
            topology: ring_1d
            buffer_kind: tcm
            n_elem: 8
    """)
    yaml_path = tmp_path / "ccl.yaml"
    yaml_path.write_text(body)
    return str(tmp_path)
 def _make_ctx(topology):
    from kernbench.runtime_api.context import RuntimeContext
    from kernbench.runtime_api.types import DeviceSelector
    from kernbench.sim_engine.engine import GraphEngine
    engine = GraphEngine(topology.topology_obj, enable_data=True)
    return RuntimeContext(
        engine=engine,
        target_device=DeviceSelector("all"),
        correlation_id="test_t4",
        spec=topology.topology_obj.spec,
    )
 # ── D1.3 namespace attach ────────────────────────────────────────────
 def test_multiprocessing_namespace_attached(topology):
    """RuntimeContext.__post_init__ attaches ctx.multiprocessing (D1.3)."""
    with _make_ctx(topology) as ctx:
        assert hasattr(ctx, "multiprocessing"), (
            "ADR-0027 D1.3: ctx.multiprocessing must exist"
        )
        assert hasattr(ctx.multiprocessing, "spawn"), (
            "ctx.multiprocessing must expose a spawn(fn, args, nprocs) method"
        )
 # ── D1.1 / D1.2: spawn shape + rank binding ──────────────────────────
 def test_spawn_invokes_fn_once_per_rank(topology):
    """spawn(fn, args, nprocs) calls fn(rank, *args) once for each rank."""
    with _make_ctx(topology) as ctx:
        calls: list[tuple[int, tuple]] = []
        def _worker(rank: int, world_size: int) -> None:
            calls.append((rank, (world_size,)))
        ctx.multiprocessing.spawn(_worker, args=(3,), nprocs=3)
        assert sorted(r for r, _ in calls) == [0, 1, 2]
        for _, (ws,) in calls:
            assert ws == 3
 def test_spawn_binds_greenlet_local_rank(topology):
    """Inside the worker, torch.distributed.get_rank() returns the rank
    bound to the greenlet (ADR-0024 D9 + D1.2)."""
    with _make_ctx(topology) as ctx:
        # Distributed context needs to be initialised so get_rank is valid.
        # For T4 we don't run a real collective; just check rank lookup.
        observed: list[tuple[int, int]] = []
        def _worker(rank: int):
            g = greenlet.getcurrent()
            bound = ctx.distributed._rank_by_greenlet.get(g)
            observed.append((rank, bound))
        ctx.multiprocessing.spawn(_worker, args=(), nprocs=2)
        for rank, bound in observed:
            assert rank == bound, (
                f"rank {rank} must be bound to greenlet-local rank {rank}; "
                f"got {bound}"
            )
 # ── D1.2 exception cleanup ───────────────────────────────────────────
 def test_spawn_exception_raises_spawn_exception_with_root_cause(topology):
    """D0.4-(4): worker raise → siblings SystemExit + SpawnException(errors)."""
    with _make_ctx(topology) as ctx:
        from kernbench.runtime_api.multiprocessing import SpawnException
        def _worker(rank: int):
            if rank == 1:
                raise ValueError(f"rank {rank} boom")
        with pytest.raises(SpawnException) as exc_info:
            ctx.multiprocessing.spawn(_worker, args=(), nprocs=3)
        # Root cause rank is captured.
        assert 1 in exc_info.value.errors
        assert isinstance(exc_info.value.errors[1], ValueError)
 def test_spawn_exception_clears_pending_queues(topology):
    """D0.4-(4): on raise, _pending_worker_waits and collective queue clear."""
    with _make_ctx(topology) as ctx:
        from kernbench.runtime_api.multiprocessing import SpawnException
        def _worker(rank: int):
            raise RuntimeError("fail")
        with pytest.raises(SpawnException):
            ctx.multiprocessing.spawn(_worker, args=(), nprocs=2)
        assert ctx._pending_worker_waits == []
 # ── D1.4 migration compat: ccl_allreduce runs via mp.spawn ───────────
 def test_ccl_allreduce_hand_rolled_loop_replaced_by_mp_spawn(
    topology, tmp_path, monkeypatch, spec,
 ):
    """D1.4: benches/ccl_allreduce.py's hand-rolled greenlet loop must still
    produce correct behaviour after migration to torch.multiprocessing.spawn.
    Minimal smoke — just that ``bench.run(ctx)`` completes without the
    loop short-circuiting or leaving pending queues dirty.
    """
    monkeypatch.chdir(_write_minimal_ccl_yaml(tmp_path))
    import benches.ccl_allreduce as bench
    calls: list[tuple[int, int]] = []
    def _fake_worker(rank, world_size, torch):
        calls.append((rank, world_size))
    monkeypatch.setattr(bench, "worker", _fake_worker)
    with _make_ctx(topology) as ctx:
        bench.run(ctx)
    expected_ws = int(spec["system"]["sips"]["count"])
    ranks = sorted(r for r, _ in calls)
    assert ranks == list(range(expected_ws))
    assert ctx._pending_worker_waits == []
 # ── _drain_pending function is exported ──────────────────────────────
 def test_drain_pending_exported():
    """D0.4: _drain_pending must be importable from runtime_api.multiprocessing."""
    from kernbench.runtime_api.multiprocessing import _drain_pending
    assert callable(_drain_pending)
@@ -0,0 +1,234 @@
 """ADR-0027 T2: TP layer shape + numerical correctness (D4/D5).
 Phase 1: ``kernbench.tp.layers`` doesn't exist → import failure. Phase 2
 lands D4/D5 and T2 passes with deterministic non-zero weight patterns.
 """
 from __future__ import annotations
 import numpy as np
 import pytest
 def _make_ctx(topology):
    from kernbench.runtime_api.context import RuntimeContext
    from kernbench.runtime_api.types import DeviceSelector
    from kernbench.sim_engine.engine import GraphEngine
    engine = GraphEngine(topology.topology_obj, enable_data=True)
    return RuntimeContext(
        engine=engine,
        target_device=DeviceSelector("all"),
        correlation_id="test_t2",
        spec=topology.topology_obj.spec,
    )
 # ── Shape / structural ───────────────────────────────────────────────
 def test_column_parallel_weight_shape_per_rank(topology):
    """ColumnParallelLinear weight per rank is (in_features, out // ws)."""
    import kernbench.tp as tp
    from kernbench.runtime_api.tensor import Tensor
    with _make_ctx(topology) as ctx:
        ctx.distributed.init_process_group(backend="ahbm")
        ws = ctx.distributed.get_world_size()
        tp.initialize_model_parallel(ws)
        def _worker(rank: int):
            ctx.ahbm.set_device(rank)
            fc = tp.ColumnParallelLinear(
                in_features=256, out_features=512, torch=ctx,
            )
            assert fc.weight.shape == (256, 512 // ws)
        ctx.multiprocessing.spawn(_worker, args=(), nprocs=ws)
 def test_row_parallel_weight_shape_per_rank(topology):
    """RowParallelLinear weight per rank is (in_features // ws, out_features)."""
    import kernbench.tp as tp
    with _make_ctx(topology) as ctx:
        ctx.distributed.init_process_group(backend="ahbm")
        ws = ctx.distributed.get_world_size()
        tp.initialize_model_parallel(ws)
        def _worker(rank: int):
            ctx.ahbm.set_device(rank)
            fc = tp.RowParallelLinear(
                in_features=512, out_features=256, torch=ctx,
            )
            assert fc.weight.shape == (512 // ws, 256)
        ctx.multiprocessing.spawn(_worker, args=(), nprocs=ws)
 # ── T2.a: ColumnParallel deterministic numerical ─────────────────────
 def test_column_parallel_forward_matches_matmul(topology):
    """T2.a: ColumnParallelLinear.forward output == x @ W_rank (rtol 1e-2)."""
    import kernbench.tp as tp
    from kernbench.runtime_api.tensor import Tensor
    with _make_ctx(topology) as ctx:
        ctx.distributed.init_process_group(backend="ahbm")
        ws = ctx.distributed.get_world_size()
        tp.initialize_model_parallel(ws)
        M = 4
        D_in, D_out = 32, 32 * ws
        def _worker(rank: int):
            ctx.ahbm.set_device(rank)
            fc = tp.ColumnParallelLinear(
                in_features=D_in, out_features=D_out, torch=ctx,
            )
            # Deterministic non-zero weight: rank-scaled constant.
            k_local = D_out // ws
            weight_np = np.full(
                (D_in, k_local), 0.01 * (rank + 1), dtype=np.float16,
            )
            src = Tensor(shape=(D_in, k_local), dtype="f16", name="host_w")
            src._host_buffer = weight_np
            fc.weight.copy_(src)
            # Input: full-replicated constant.
            x_np = np.full((M, D_in), 0.5, dtype=np.float16)
            x = ctx.zeros(
                (M, D_in), dtype="f16",
                dp=_replicate_dp(), name=f"t2a_x_r{rank}",
            )
            hx = Tensor(shape=x_np.shape, dtype="f16", name="host_x")
            hx._host_buffer = x_np
            x.copy_(hx)
            y = fc.forward(x)
            out = y.numpy()
            expected = x_np.astype(np.float32) @ weight_np.astype(np.float32)
            assert out.shape == (M, k_local)
            assert np.allclose(out.astype(np.float32), expected,
                               rtol=1e-2, atol=1e-2), (
                f"rank {rank}: output does not match x @ W_local"
            )
        ctx.multiprocessing.spawn(_worker, args=(), nprocs=ws)
 # ── T2.b: RowParallel observable equality ────────────────────────────
 def test_row_parallel_forward_concat_matmul_equality(topology):
    """T2.b (primary): RowParallel output == concat(x) @ concat(W) (all-reduced)."""
    import kernbench.tp as tp
    from kernbench.runtime_api.tensor import Tensor
    with _make_ctx(topology) as ctx:
        ctx.distributed.init_process_group(backend="ahbm")
        ws = ctx.distributed.get_world_size()
        tp.initialize_model_parallel(ws)
        M = 4
        D_in, D_out = 32 * ws, 32  # must divide ws evenly
        results: dict[int, np.ndarray] = {}
        def _worker(rank: int):
            ctx.ahbm.set_device(rank)
            fc = tp.RowParallelLinear(
                in_features=D_in, out_features=D_out, torch=ctx,
            )
            # Per-rank W_k = constant 0.01 * (rank + 1)
            n_local = D_in // ws
            weight_np = np.full(
                (n_local, D_out), 0.01 * (rank + 1), dtype=np.float16,
            )
            src = Tensor(shape=weight_np.shape, dtype="f16", name="host_w")
            src._host_buffer = weight_np
            fc.weight.copy_(src)
            # Input x_k = constant 0.1 * (rank + 1) (pretending it was
            # column-sharded from upstream).
            x_np = np.full((M, n_local), 0.1 * (rank + 1), dtype=np.float16)
            x = ctx.zeros(
                (M, n_local), dtype="f16",
                dp=_replicate_dp(), name=f"t2b_x_r{rank}",
            )
            hx = Tensor(shape=x_np.shape, dtype="f16", name="host_x")
            hx._host_buffer = x_np
            x.copy_(hx)
            y = fc.forward(x)
            results[rank] = y.numpy().astype(np.float32)
        ctx.multiprocessing.spawn(_worker, args=(), nprocs=ws)
        # Host-side reference: compute sum_r (x_r @ W_r) = y (same on all ranks).
        expected = np.zeros((M, D_out), dtype=np.float32)
        n_local = D_in // ws
        for r in range(ws):
            x_r = np.full((M, n_local), 0.1 * (r + 1), dtype=np.float32)
            w_r = np.full((n_local, D_out), 0.01 * (r + 1), dtype=np.float32)
            expected += x_r @ w_r
        for r, out in results.items():
            assert np.allclose(out, expected, rtol=1e-2, atol=1e-2), (
                f"rank {r}: all-reduced output != expected partial sum"
            )
 # ── T2.c: rank-consistency post all-reduce ───────────────────────────
 def test_row_parallel_rank_identity_post_all_reduce(topology):
    """T2.c: after all_reduce, all ranks see elementwise-identical output."""
    import kernbench.tp as tp
    from kernbench.runtime_api.tensor import Tensor
    with _make_ctx(topology) as ctx:
        ctx.distributed.init_process_group(backend="ahbm")
        ws = ctx.distributed.get_world_size()
        tp.initialize_model_parallel(ws)
        M = 2
        D_in, D_out = 16 * ws, 16
        results: dict[int, np.ndarray] = {}
        def _worker(rank: int):
            ctx.ahbm.set_device(rank)
            fc = tp.RowParallelLinear(
                in_features=D_in, out_features=D_out, torch=ctx,
            )
            n_local = D_in // ws
            weight_np = np.full((n_local, D_out), 0.01, dtype=np.float16)
            src = Tensor(shape=weight_np.shape, dtype="f16", name="host_w")
            src._host_buffer = weight_np
            fc.weight.copy_(src)
            x_np = np.full((M, n_local), 0.1, dtype=np.float16)
            x = ctx.zeros(
                (M, n_local), dtype="f16",
                dp=_replicate_dp(), name=f"t2c_x_r{rank}",
            )
            hx = Tensor(shape=x_np.shape, dtype="f16", name="host_x")
            hx._host_buffer = x_np
            x.copy_(hx)
            y = fc.forward(x)
            results[rank] = y.numpy()
        ctx.multiprocessing.spawn(_worker, args=(), nprocs=ws)
        ref = results[0]
        for r, out in results.items():
            assert np.allclose(out, ref, rtol=1e-2, atol=1e-2), (
                f"rank {r} output differs from rank 0 — all_reduce failed to make "
                f"outputs elementwise identical"
            )
 def _replicate_dp():
    from kernbench.policy.placement.dp import DPPolicy
    return DPPolicy(cube="replicate", pe="replicate", num_cubes=1, num_pes=1)
@@ -0,0 +1,238 @@
 """ADR-0027 T6: End-to-end 2-layer MLP with TP.
 Phase 1: fails at imports. Phase 2 lands the TP package + D7 bench pattern
 and these pass with numerical-correctness checks.
 """
 from __future__ import annotations
 import numpy as np
 import pytest
 def _make_ctx(topology):
    from kernbench.runtime_api.context import RuntimeContext
    from kernbench.runtime_api.types import DeviceSelector
    from kernbench.sim_engine.engine import GraphEngine
    engine = GraphEngine(topology.topology_obj, enable_data=True)
    return RuntimeContext(
        engine=engine,
        target_device=DeviceSelector("all"),
        correlation_id="test_t6",
        spec=topology.topology_obj.spec,
    )
 def _replicate_dp():
    from kernbench.policy.placement.dp import DPPolicy
    return DPPolicy(cube="replicate", pe="replicate", num_cubes=1, num_pes=1)
 # ── T6.a: zero-weight smoke ──────────────────────────────────────────
 def test_mlp_zero_weight_produces_zero_output(topology):
    """T6.a: zero-init weight → output ≈ 0 for every rank."""
    import kernbench.tp as tp
    with _make_ctx(topology) as ctx:
        ctx.distributed.init_process_group(backend="ahbm")
        ws = ctx.distributed.get_world_size()
        tp.initialize_model_parallel(ws)
        B, D_in, D_hidden, D_out = 1, 32, 32 * ws, 32
        outputs: dict[int, np.ndarray] = {}
        def _worker(rank: int):
            ctx.ahbm.set_device(rank)
            fc1 = tp.ColumnParallelLinear(D_in, D_hidden, torch=ctx)
            fc2 = tp.RowParallelLinear(D_hidden, D_out, torch=ctx)
            x = ctx.zeros((B, D_in), dtype="f16",
                          dp=_replicate_dp(), name=f"t6a_x_r{rank}")
            from kernbench.runtime_api.tensor import Tensor
            hx = Tensor(shape=(B, D_in), dtype="f16", name="host_x")
            hx._host_buffer = np.full((B, D_in), 0.1, dtype=np.float16)
            x.copy_(hx)
            h = fc1.forward(x)
            y = fc2.forward(h)
            outputs[rank] = y.numpy()
        ctx.multiprocessing.spawn(_worker, args=(), nprocs=ws)
        for r, out in outputs.items():
            assert np.allclose(out, 0.0, atol=1e-2), (
                f"rank {r}: zero-weight output should be ~0; got mean={out.mean()}"
            )
 # ── T6.b: deterministic weight + numerical check ─────────────────────
 def test_mlp_deterministic_weight_matches_reference(topology):
    """T6.b: non-zero deterministic weights → output matches numpy reference."""
    import kernbench.tp as tp
    from kernbench.runtime_api.tensor import Tensor
    with _make_ctx(topology) as ctx:
        ctx.distributed.init_process_group(backend="ahbm")
        ws = ctx.distributed.get_world_size()
        tp.initialize_model_parallel(ws)
        B, D_in, D_hidden, D_out = 1, 16, 16 * ws, 16
        # W1 (D_in, D_hidden) — column-sharded; per rank: (D_in, D_hidden/ws)
        # W2 (D_hidden, D_out) — row-sharded; per rank: (D_hidden/ws, D_out)
        # Constant values: W1 = 0.02, W2 = 0.03, x = 0.1 (all fp16).
        X_VAL, W1_VAL, W2_VAL = 0.1, 0.02, 0.03
        outputs: dict[int, np.ndarray] = {}
        def _worker(rank: int):
            ctx.ahbm.set_device(rank)
            fc1 = tp.ColumnParallelLinear(D_in, D_hidden, torch=ctx)
            fc2 = tp.RowParallelLinear(D_hidden, D_out, torch=ctx)
            # W1 slice (per rank column slice)
            k_local_1 = D_hidden // ws
            w1_np = np.full((D_in, k_local_1), W1_VAL, dtype=np.float16)
            src1 = Tensor(shape=w1_np.shape, dtype="f16", name="host_w1")
            src1._host_buffer = w1_np
            fc1.weight.copy_(src1)
            # W2 slice (per rank row slice)
            n_local_2 = D_hidden // ws
            w2_np = np.full((n_local_2, D_out), W2_VAL, dtype=np.float16)
            src2 = Tensor(shape=w2_np.shape, dtype="f16", name="host_w2")
            src2._host_buffer = w2_np
            fc2.weight.copy_(src2)
            # Input x (full-replicated constant)
            x = ctx.zeros((B, D_in), dtype="f16",
                          dp=_replicate_dp(), name=f"t6b_x_r{rank}")
            hx = Tensor(shape=(B, D_in), dtype="f16", name="host_x")
            hx._host_buffer = np.full((B, D_in), X_VAL, dtype=np.float16)
            x.copy_(hx)
            h = fc1.forward(x)
            y = fc2.forward(h)
            outputs[rank] = y.numpy().astype(np.float32)
        ctx.multiprocessing.spawn(_worker, args=(), nprocs=ws)
        # Host reference: y = x @ W1_full @ W2_full
        w1_full = np.full((D_in, D_hidden), W1_VAL, dtype=np.float32)
        w2_full = np.full((D_hidden, D_out), W2_VAL, dtype=np.float32)
        x_full = np.full((B, D_in), X_VAL, dtype=np.float32)
        expected = x_full @ w1_full @ w2_full
        for r, out in outputs.items():
            assert out.shape == (B, D_out)
            assert np.allclose(out, expected, rtol=1e-2, atol=1e-2), (
                f"rank {r}: MLP output != reference "
                f"(got mean={out.mean():.4f}, expected={expected.mean():.4f})"
            )
 # ── T6.c: rank-consistency after final all_reduce ────────────────────
 def test_mlp_rank_consistency_after_all_reduce(topology):
    """T6.c: all ranks see elementwise-identical final output."""
    import kernbench.tp as tp
    from kernbench.runtime_api.tensor import Tensor
    with _make_ctx(topology) as ctx:
        ctx.distributed.init_process_group(backend="ahbm")
        ws = ctx.distributed.get_world_size()
        tp.initialize_model_parallel(ws)
        B, D_in, D_hidden, D_out = 1, 16, 16 * ws, 16
        outputs: dict[int, np.ndarray] = {}
        def _worker(rank: int):
            ctx.ahbm.set_device(rank)
            fc1 = tp.ColumnParallelLinear(D_in, D_hidden, torch=ctx)
            fc2 = tp.RowParallelLinear(D_hidden, D_out, torch=ctx)
            # Zero weights OK for this check — just need all_reduce to run.
            x = ctx.zeros((B, D_in), dtype="f16",
                          dp=_replicate_dp(), name=f"t6c_x_r{rank}")
            hx = Tensor(shape=(B, D_in), dtype="f16", name="host_x")
            hx._host_buffer = np.full((B, D_in), 0.1, dtype=np.float16)
            x.copy_(hx)
            h = fc1.forward(x)
            y = fc2.forward(h)
            outputs[rank] = y.numpy()
        ctx.multiprocessing.spawn(_worker, args=(), nprocs=ws)
        ref = outputs[0]
        for r, out in outputs.items():
            assert np.array_equal(out, ref), (
                f"rank {r} output differs from rank 0 — all-reduce should "
                f"make every rank see the same final tensor"
            )
 # ── T6.d: shape contract ─────────────────────────────────────────────
 def test_mlp_shape_contract(topology):
    """T6.d: ColumnParallel → (B, D_hidden/ws); RowParallel → (B, D_out)."""
    import kernbench.tp as tp
    with _make_ctx(topology) as ctx:
        ctx.distributed.init_process_group(backend="ahbm")
        ws = ctx.distributed.get_world_size()
        tp.initialize_model_parallel(ws)
        B, D_in, D_hidden, D_out = 1, 16, 16 * ws, 16
        def _worker(rank: int):
            ctx.ahbm.set_device(rank)
            fc1 = tp.ColumnParallelLinear(D_in, D_hidden, torch=ctx)
            fc2 = tp.RowParallelLinear(D_hidden, D_out, torch=ctx)
            x = ctx.zeros((B, D_in), dtype="f16",
                          dp=_replicate_dp(), name=f"t6d_x_r{rank}")
            h = fc1.forward(x)
            assert h.shape == (B, D_hidden // ws), (
                f"ColumnParallel output shape: {h.shape} != (B, D_hidden/ws)"
            )
            y = fc2.forward(h)
            assert y.shape == (B, D_out), (
                f"RowParallel output shape: {y.shape} != (B, D_out)"
            )
        ctx.multiprocessing.spawn(_worker, args=(), nprocs=ws)
 # ── liveness: deadlock 없음 (pytest timeout 간접 검증) ───────────────
 def test_mlp_completes_without_deadlock(topology):
    """Structural: full E2E spawn returns within a reasonable wall-clock.
    Relies on the test suite's overall timeout harness. If this hangs
    beyond ~60s it would surface as a pytest timeout — a deadlock
    regression in the scheduler loop would manifest here."""
    import kernbench.tp as tp
    with _make_ctx(topology) as ctx:
        ctx.distributed.init_process_group(backend="ahbm")
        ws = ctx.distributed.get_world_size()
        tp.initialize_model_parallel(ws)
        def _worker(rank: int):
            ctx.ahbm.set_device(rank)
            fc1 = tp.ColumnParallelLinear(16, 16 * ws, torch=ctx)
            fc2 = tp.RowParallelLinear(16 * ws, 16, torch=ctx)
            x = ctx.zeros((1, 16), dtype="f16",
                          dp=_replicate_dp(), name=f"t6live_r{rank}")
            h = fc1.forward(x)
            y = fc2.forward(h)
            _ = y.numpy()
        ctx.multiprocessing.spawn(_worker, args=(), nprocs=ws)
@@ -0,0 +1,85 @@
 """ADR-0027 T1: TP parallel_state (D3).
 Phase 1: ``kernbench.tp`` module does not exist yet — tests fail at import.
 Phase 2 (D2/D3) lands the package and these pass.
 """
 from __future__ import annotations
 import pytest
 def _make_ctx(topology):
    from kernbench.runtime_api.context import RuntimeContext
    from kernbench.runtime_api.types import DeviceSelector
    from kernbench.sim_engine.engine import GraphEngine
    engine = GraphEngine(topology.topology_obj, enable_data=True)
    return RuntimeContext(
        engine=engine,
        target_device=DeviceSelector("all"),
        correlation_id="test_t1",
        spec=topology.topology_obj.spec,
    )
 def test_tp_package_importable():
    """D2: kernbench.tp must be importable."""
    import kernbench.tp as tp
    assert hasattr(tp, "initialize_model_parallel")
    assert hasattr(tp, "get_tensor_model_parallel_world_size")
    assert hasattr(tp, "get_tensor_model_parallel_rank")
 def test_initialize_model_parallel_matches_world_size(topology, tmp_path, monkeypatch):
    """D3: TP size must equal dist world_size; otherwise NotImplementedError."""
    import kernbench.tp as tp
    with _make_ctx(topology) as ctx:
        ctx.distributed.init_process_group(backend="ahbm")
        ws = ctx.distributed.get_world_size()
        tp.initialize_model_parallel(ws)
        assert tp.get_tensor_model_parallel_world_size() == ws
 def test_initialize_mismatched_ws_raises(topology):
    """D3: calling with tp_size != world_size raises NotImplementedError."""
    import kernbench.tp as tp
    with _make_ctx(topology) as ctx:
        ctx.distributed.init_process_group(backend="ahbm")
        ws = ctx.distributed.get_world_size()
        with pytest.raises(NotImplementedError):
            tp.initialize_model_parallel(ws + 1)
 def test_get_tp_rank_is_greenlet_local(topology):
    """D3: get_tensor_model_parallel_rank returns greenlet-local rank
    (delegates to torch.distributed.get_rank, ADR-0024 D9)."""
    import kernbench.tp as tp
    with _make_ctx(topology) as ctx:
        ctx.distributed.init_process_group(backend="ahbm")
        ws = ctx.distributed.get_world_size()
        tp.initialize_model_parallel(ws)
        observed: list[int] = []
        def _worker(rank: int):
            observed.append(tp.get_tensor_model_parallel_rank())
        ctx.multiprocessing.spawn(_worker, args=(), nprocs=ws)
        assert sorted(observed) == list(range(ws))
 def test_get_world_size_before_init_raises():
    """D3: uninitialised TP group → accessing world_size fails informatively."""
    from kernbench.tp import parallel_state
    # Reset internal state if previous tests (or parallel workers) left it set.
    parallel_state._reset_for_tests()
    with pytest.raises((RuntimeError, AssertionError, TypeError)):
        _ = parallel_state.get_tensor_model_parallel_world_size() + 0
@@ -0,0 +1,301 @@
 """ADR-0027 T3: Worker-wait generalization + orphan invariant.
 Direct regression guard for ADR-0024 Phase B's kernel-greenlet orphan bug.
 Phase 1 of ADR-0027: these tests fail against the current code (no
 ``_pending_worker_waits`` field, no worker-fork in ``ctx.wait``, no
 scheduler drain). Phase 2 implements D0.1/D0.2/D0.4 and these pass.
 """
 from __future__ import annotations
 import os
 import textwrap
 import pytest
 from greenlet import greenlet
 # ── helpers ──────────────────────────────────────────────────────────
 def _write_minimal_ccl_yaml(tmp_path) -> str:
    body = textwrap.dedent("""\
        defaults:
          algorithm: ring_allreduce_tcm
          buffer_kind: tcm
          backpressure: sleep
          n_slots: 4
          slot_size: 4096
          vc_chunk_size: 256
          ipcq_credit_size_bytes: 16
        algorithms:
          ring_allreduce_tcm:
            module: kernbench.ccl.algorithms.ring_allreduce
            topology: ring_1d
            buffer_kind: tcm
            n_elem: 8
    """)
    yaml_path = tmp_path / "ccl.yaml"
    yaml_path.write_text(body)
    return str(tmp_path)
 def _make_ctx(topology):
    from kernbench.runtime_api.context import RuntimeContext
    from kernbench.runtime_api.types import DeviceSelector
    from kernbench.sim_engine.engine import GraphEngine
    engine = GraphEngine(topology.topology_obj, enable_data=True)
    return RuntimeContext(
        engine=engine,
        target_device=DeviceSelector("all"),
        correlation_id="test_t3",
        spec=topology.topology_obj.spec,
    )
 # ── D0.1: _pending_worker_waits field exists ─────────────────────────
 def test_pending_worker_waits_field_present(topology):
    """RuntimeContext must expose the deferred-wait queue (D0.1)."""
    with _make_ctx(topology) as ctx:
        assert hasattr(ctx, "_pending_worker_waits"), (
            "ADR-0027 D0.1: RuntimeContext must declare _pending_worker_waits"
        )
        assert ctx._pending_worker_waits == [], (
            "_pending_worker_waits should start empty"
        )
 # ── T3.a / T3.b: wait defers + resume-after-drain contract ───────────
 def test_wait_in_worker_defers_to_main_and_resumes_completed(topology):
    """T3.a + T3.b: worker ctx.wait enqueues + yields; resume → _completed.
    Direct test of D0.2 (worker-fork) + D0.3 resume invariant (handle must
    be in ctx._completed when worker resumes).
    """
    with _make_ctx(topology) as ctx:
        from kernbench.policy.placement.dp import DPPolicy
        # Worker that submits one tensor (which internally calls ctx.wait)
        # and records the pending-queue state observed before/after.
        observations: dict = {"pre_wait_len": None, "post_resume_completed": None}
        main = greenlet.getcurrent()
        def _worker():
            # Observation hook: patch ctx.wait to capture a single deferral.
            original_wait = ctx.wait
            def wrapping_wait(h, *, _meta=None):
                observations["pre_wait_len"] = len(ctx._pending_worker_waits)
                result = original_wait(h, _meta=_meta)
                observations["post_resume_completed"] = h in ctx._completed
                return result
            ctx.wait = wrapping_wait  # type: ignore[assignment]
            try:
                ctx.zeros(
                    (1, 8), dtype="f16",
                    dp=DPPolicy(cube="replicate", pe="replicate",
                                num_cubes=1, num_pes=1),
                    name="t3_defer",
                )
            finally:
                ctx.wait = original_wait  # type: ignore[assignment]
        g = greenlet(_worker)
        # Scheduler loop: run worker until it yields (or finishes), then drain.
        while not g.dead:
            g.switch()
            if not g.dead:
                # Worker yielded mid-wait → simulate D0.4 drain.
                from kernbench.runtime_api.multiprocessing import _drain_pending
                _drain_pending(ctx)
        assert observations["pre_wait_len"] is not None, "wait was not invoked"
        assert observations["post_resume_completed"] is True, (
            "D0.3 resume invariant: handle must be in ctx._completed on resume"
        )
 # ── T3.c: multi-worker same-round drain ──────────────────────────────
 def test_multiple_workers_resume_at_same_drain(topology):
    """T3.c: every worker yields before any drain; all resume together."""
    with _make_ctx(topology) as ctx:
        from kernbench.policy.placement.dp import DPPolicy
        dp = DPPolicy(cube="replicate", pe="replicate", num_cubes=1, num_pes=1)
        observations: list[int] = []
        def _make_worker(rank: int):
            def _entry():
                # Before its wait, observe queue state so we can assert that
                # *every* worker has enqueued before any drain happened.
                ctx.zeros((1, 4), dtype="f16", dp=dp, name=f"r{rank}")
                observations.append(rank)
            return _entry
        ws = 2
        gs = [greenlet(_make_worker(r)) for r in range(ws)]
        # Round 1: every worker runs up to its first (deferred) ctx.wait.
        for g in gs:
            g.switch()
        # After round 1, all workers should be paused (not yet dead) and
        # each should have enqueued at least one handle.
        assert all(not g.dead for g in gs), (
            "after round 1 switch, workers must be paused mid-wait, not dead"
        )
        assert len(ctx._pending_worker_waits) >= ws, (
            f"expected >= {ws} pending worker waits after round 1; "
            f"got {len(ctx._pending_worker_waits)}"
        )
        # Loop: drain + switch rounds until all workers complete. A single
        # ctx.zeros() call contains multiple yield points (MmuMap, then
        # MemoryWrite), so more than one round is needed.
        from kernbench.runtime_api.multiprocessing import _drain_pending
        rounds = 0
        while any(not g.dead for g in gs):
            _drain_pending(ctx)
            for g in gs:
                if not g.dead:
                    g.switch()
            rounds += 1
            assert rounds < 20, "scheduler did not converge within 20 rounds"
        assert all(g.dead for g in gs), "all workers should be dead after drain loop"
        assert sorted(observations) == list(range(ws))
 # ── T3.d (핵심): kernel greenlet _parent is main ─────────────────────
 def test_kernel_greenlet_parent_is_main(topology, tmp_path, monkeypatch):
    """T3.d orphan invariant: kernel_runner._parent must be main greenlet.
    This is the direct regression guard for ADR-0024 Phase B. Runs a worker
    that invokes torch.launch (which eventually spawns a kernel greenlet).
    The kernel_runner.run() captures greenlet.getcurrent() as _parent at
    spawn time — that value MUST be the main greenlet, else the orphan
    bug is back.
    """
    monkeypatch.chdir(_write_minimal_ccl_yaml(tmp_path))
    from kernbench.triton_emu import kernel_runner as kr_mod
    captured_parents: list = []
    main = greenlet.getcurrent()
    original_run = kr_mod.KernelRunner.run
    def _spy_run(self, env, kernel_fn, kernel_args, num_programs):
        gen = original_run(self, env, kernel_fn, kernel_args, num_programs)
        def _wrapping_gen():
            # yield from gen, but capture self._parent on first step
            try:
                value = next(gen)
                # First yield happens after _parent is set.
                captured_parents.append(self._parent)
                yield value
            except StopIteration:
                return
            yield from gen
        return _wrapping_gen()
    monkeypatch.setattr(kr_mod.KernelRunner, "run", _spy_run)
    # Drive a minimal ring_allreduce that launches a kernel inside a worker.
    import benches.ccl_allreduce as bench
    with _make_ctx(topology) as ctx:
        bench.run(ctx)
    assert captured_parents, "no kernel_runner.run invocations observed"
    for p in captured_parents:
        assert p is main, (
            f"ADR-0027 D0.7 / T3.d: kernel greenlet _parent must be main "
            f"greenlet; got {p!r} (main={main!r})"
        )
 # ── T3.f: idempotency ────────────────────────────────────────────────
 def test_wait_same_handle_twice_drives_engine_once(topology):
    """T3.f: ctx.wait(h) + ctx.wait(h) → engine.wait called once (D0.4-(3))."""
    with _make_ctx(topology) as ctx:
        from kernbench.policy.placement.dp import DPPolicy
        dp = DPPolicy(cube="replicate", pe="replicate", num_cubes=1, num_pes=1)
        call_count = {"n": 0}
        original_engine_wait = ctx.engine.wait
        def _counting_wait(h):
            call_count["n"] += 1
            return original_engine_wait(h)
        ctx.engine.wait = _counting_wait  # type: ignore[assignment]
        def _worker():
            ctx.zeros((1, 4), dtype="f16", dp=dp, name="t3f")
            # Manually pick a completed handle and wait twice.
            assert ctx._completed, "there should be at least one completed handle"
            h = next(iter(ctx._completed))
            before = call_count["n"]
            ctx.wait(h)
            ctx.wait(h)
            assert call_count["n"] == before, (
                "already-completed handle must not re-drive engine.wait"
            )
        g = greenlet(_worker)
        while not g.dead:
            g.switch()
            if not g.dead:
                from kernbench.runtime_api.multiprocessing import _drain_pending
                _drain_pending(ctx)
 # ── T3.g: exception propagation + no further drain ───────────────────
 def test_worker_exception_propagates_and_clears_pending(topology):
    """T3.g: worker raise → main propagates; _pending_worker_waits cleared."""
    with _make_ctx(topology) as ctx:
        from kernbench.runtime_api.multiprocessing import SpawnException
        def _bad_worker(rank: int):
            raise ValueError(f"rank {rank} intentional failure")
        with pytest.raises(SpawnException) as exc_info:
            ctx.multiprocessing.spawn(_bad_worker, args=(), nprocs=2)
        assert ctx._pending_worker_waits == [], (
            "D0.4-(4): _pending_worker_waits must be cleared on failure"
        )
        # Root-cause rank errors are present; sibling SystemExit not in dict.
        assert 0 in exc_info.value.errors or 1 in exc_info.value.errors
 # ── T3.e: historical failure (pre-D0) — skipped per ADR ──────────────
@pytest.mark.skip(
    reason="ADR-0027 T3.e: historical failure mode — reproduces only "
           "pre-D0.2. Kept as documentation; not run in Phase 2."
 )
 def test_pre_d0_orphan_reproduction():
    """Placeholder: exercises the pre-D0.2 code path that causes GreenletExit
    from kernel_runner._parent captured in worker context. See ADR-0024
    Phase B postmortem."""
    pass