kernbench2/docs/adr/ADR-0024-par-sip-tp-launcher.md

# ADR-0024: SIP-level Launcher — rank = SIP

## Status

Accepted

## Context

### Goal

Align the participation unit (rank) of `torch.distributed` collective calls
to the **SIP** (device) boundary. The aim is bench code that, at the host
level, reads **indistinguishably** from real PyTorch DDP/TP scripts.

Comparison with real PyTorch:

| Dimension | real PyTorch | KernBench |
| --- | --- | --- |
| Process model | N processes, 1 GPU each | 1 process, N greenlets, 1 SIP each |
| `get_rank()` | `RANK` env var | greenlet-local registry |
| `get_world_size()` | `WORLD_SIZE` env var | SIP count from topology |
| `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP |
| `mp.spawn` | OS process fork | greenlet fan-out |

### Problems to solve

1. **Public API where rank = SIP** — so bench workers do not have to know
   about the PE concept.
2. **Greenlet-local rank/device tracking** — within the 1-process model,
   each worker greenlet must correctly identify its own rank / its own SIP.
3. **Tensor placement = structural (sip, cube, pe)** — if rank is SIP,
   the default tensor placement should also be expressed in structural
   coordinates.

### Non-problem (outside this ADR)

- IPCQ direction addressing → ADR-0025
- Removing `DPPolicy.sip`/`num_sips` → ADR-0026
- Megatron-style TP → ADR-0027
- DTensor → ADR-0028 (future)
- Worker scheduling / `mp.spawn` / collective drain / exception cleanup
  → ADR-0027 D0/D1
- Collective algorithm implementation (intercube_allreduce, SFR config)
  → ADR-0032

## Decision

### D1. rank = SIP (world_size resolution)

```python
def _resolve_world_size(self) -> int:
    if "world_size" in self._merged:
        return int(self._merged["world_size"])
    defaults = self._cfg_all.get("defaults", {})
    if "world_size" in defaults:
        return int(defaults["world_size"])
    spec = self.ctx.spec or {}
    return int(spec.get("system", {}).get("sips", {}).get("count", 1))
```

Priority order: algorithm override > defaults override > SIP count. The
`ccl.yaml` override is retained as the legacy "rank = PE" test path.

### D2. Greenlet-local rank registry (+ debug warning)

```python
class DistributedContext:
    def __init__(self):
        self._backend = None
        self._rank_by_greenlet: dict = {}

    def _bind_rank(self, g, rank: int) -> None:
        self._rank_by_greenlet[g] = int(rank)

    def get_rank(self) -> int:
        self._ensure_initialized()
        from greenlet import getcurrent
        g = getcurrent()
        if g not in self._rank_by_greenlet:
            if os.environ.get("KERNBENCH_DEBUG"):
                warnings.warn(
                    "get_rank() called outside a bound greenlet — returning 0. "
                    "Likely a bug unless running single-driver."
                )
            return 0
        return int(self._rank_by_greenlet[g])
```

### D3. `torch.ahbm.set_device(rank)` — SIP binding

The KernBench backend name is `ahbm` (ADR-0023). Real PyTorch uses
`torch.cuda.set_device(r)`, but since we are not CUDA we use an
honestly-named namespace.

```python
class _AhbmNamespace:
    """torch.ahbm — per-greenlet SIP device binding.

    Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. Since
    KernBench's backend is 'ahbm' (not CUDA), we expose the equivalent
    API under ``torch.ahbm`` to avoid pretending to be a CUDA runtime.
    """

    def __init__(self):
        self._device_by_greenlet: dict = {}

    def set_device(self, device: int) -> None:
        from greenlet import getcurrent
        self._device_by_greenlet[getcurrent()] = int(device)

    def current_device(self) -> int | None:
        from greenlet import getcurrent
        return self._device_by_greenlet.get(getcurrent())

# Attached to RuntimeContext as `self.ahbm = _AhbmNamespace()`.
# Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
```

**PyTorch 2.x style parallel support**: Recent PyTorch is moving toward a
device-agnostic `torch.accelerator` namespace
(`torch.accelerator.set_device_index(r)`,
`torch.accelerator.current_device_index()`). To support users who want to
write code that is not tied to a specific device vendor, KernBench also
exposes this surface in parallel.

```python
class _AcceleratorNamespace:
    """torch.accelerator — device-agnostic API (PyTorch 2.x style).

    Aliases torch.ahbm for bench code that prefers device-neutral idiom:
        torch.accelerator.set_device_index(rank)
        torch.accelerator.current_device_index()
    """

    def __init__(self, ahbm: _AhbmNamespace):
        self._ahbm = ahbm

    def set_device_index(self, device: int) -> None:
        self._ahbm.set_device(device)

    def current_device_index(self) -> int | None:
        return self._ahbm.current_device()

# RuntimeContext
self.ahbm = _AhbmNamespace()
self.accelerator = _AcceleratorNamespace(self.ahbm)   # alias
```

Bench authors may choose either — both share the same registry internally:

```python
torch.ahbm.set_device(rank)                   # KernBench-native, explicit backend
torch.accelerator.set_device_index(rank)      # PyTorch 2.x device-agnostic
```

### D4. Tensor placement = structural (sip, cube, pe) coordinates

`resolve_dp_policy` takes `target_sip` directly and produces placement in
structural coordinates. Details in ADR-0026.

```python
# RuntimeContext._create_tensor
current_sip = self.ahbm.current_device()          # (D3 naming)
if current_sip is None:
    current_sip = 0  # single-driver fallback (consistent with D2)
placement = resolve_dp_policy(
    dp, shape=shape_2d, itemsize=itemsize,
    num_pe=eff_num_pe, num_cubes=eff_num_cubes,
    target_sip=current_sip,
)
```

No post-hoc `pe_index` shifting — ShardSpec carries the `(sip, cube, pe)`
structural coordinates directly. ShardSpec details in ADR-0026.

### D5. SIP grid dimensions — explicit `sips.w/h` resolution

For 2D inter-SIP topologies (`torus_2d`, `mesh_2d_no_wrap`) the SIP grid
shape (width × height) is resolved from `system.sips.w` / `system.sips.h`,
mirroring how D1 resolves `world_size` from `sips.count`. Precedence:
explicit `w/h` (validated `w*h == count`) > square fallback
(`round(sqrt(count))²`, used only when no `w/h` is given) > error.

```python
sips = spec.get("system", {}).get("sips", {})
if sip_topo == "ring_1d":
    w, h = 0, 0                          # 1D sentinel (no grid)
elif sips.get("w") is not None and sips.get("h") is not None:
    w, h = int(sips["w"]), int(sips["h"])
    if w * h != n_sips:
        raise ValueError(f"sip layout {w}x{h} != sips.count ({n_sips})")
else:
    side = int(round(math.sqrt(n_sips)))
    if side * side != n_sips:
        raise ValueError("non-square sips.count requires explicit sips.w/h")
    w, h = side, side
```

This lifts the earlier assumption that 2D SIP grids must be perfect
squares: a 6-SIP `torus_2d` / `mesh_2d_no_wrap` is now expressible as
`w: 3, h: 2` (or `2x3`). The derived `(w, h)` feed the algorithm's
inter-SIP exchange (consumed in ADR-0032 D5). The prior code path silently
took `round(sqrt(count))²` for any non-ring topology, which produced a
wrong grid (e.g. 2×2 for 6 SIPs); the explicit-`w/h` path with a
fail-loud fallback replaces that.

---

## Dependencies

- **ADR-0023** (IPCQ): origin of the backend `ahbm` namespace.
- **ADR-0026** (DPPolicy intra-device): the `resolve_dp_policy` signature
  used by D4 and the structural-coordinate representation of ShardSpec.
- **ADR-0027** (Megatron TP + scheduler): the implementation baseline for
  worker scheduling, `mp.spawn`, collective drain, and exception cleanup.

---

## Non-goals

- **Modifying the IPCQ protocol**: ADR-0023 remains as-is.
- **Cleaning up DPPolicy fields**: ADR-0026.
- **Megatron-style TP**: ADR-0027.
- **Worker scheduling / spawn / drain / exception cleanup**: ADR-0027 D0/D1.
- **Collective algorithm implementation**: ADR-0032.
- **Multi-node (cross-process)**: single process only.

---

## Consequences

### Positive

- **Bench = real PyTorch DDP** (from the public-API point of view).
- **Greenlet-local rank**: enables cross-rank correctness within the
  1-process model.
- **Structural placement coordinates**: lets the other ADRs (ADR-0026 /
  ADR-0027 / ADR-0032) operate consistently on top of the `(sip, cube, pe)`
  3-tuple.

### Neutral

- IPCQ PE-level protocol (ADR-0023) is unchanged.
- IO_CPU role is unchanged (existing transit behavior preserved).