# ADR-0024: SIP-level Launcher — rank = SIP ## Status Accepted ## Context ### Goal Align the participation unit (rank) of `torch.distributed` collective calls to the **SIP** (device) boundary. The aim is bench code that, at the host level, reads **indistinguishably** from real PyTorch DDP/TP scripts. Comparison with real PyTorch: | Dimension | real PyTorch | KernBench | | --- | --- | --- | | Process model | N processes, 1 GPU each | 1 process, N greenlets, 1 SIP each | | `get_rank()` | `RANK` env var | greenlet-local registry | | `get_world_size()` | `WORLD_SIZE` env var | SIP count from topology | | `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP | | `mp.spawn` | OS process fork | greenlet fan-out | ### Problems to solve 1. **Public API where rank = SIP** — so bench workers do not have to know about the PE concept. 2. **Greenlet-local rank/device tracking** — within the 1-process model, each worker greenlet must correctly identify its own rank / its own SIP. 3. **Tensor placement = structural (sip, cube, pe)** — if rank is SIP, the default tensor placement should also be expressed in structural coordinates. ### Non-problem (outside this ADR) - IPCQ direction addressing → ADR-0025 - Removing `DPPolicy.sip`/`num_sips` → ADR-0026 - Megatron-style TP → ADR-0027 - DTensor → ADR-0028 (future) - Worker scheduling / `mp.spawn` / collective drain / exception cleanup → ADR-0027 D0/D1 - Collective algorithm implementation (intercube_allreduce, SFR config) → ADR-0032 ## Decision ### D1. rank = SIP (world_size resolution) ```python def _resolve_world_size(self) -> int: if "world_size" in self._merged: return int(self._merged["world_size"]) defaults = self._cfg_all.get("defaults", {}) if "world_size" in defaults: return int(defaults["world_size"]) spec = self.ctx.spec or {} return int(spec.get("system", {}).get("sips", {}).get("count", 1)) ``` Priority order: algorithm override > defaults override > SIP count. The `ccl.yaml` override is retained as the legacy "rank = PE" test path. ### D2. Greenlet-local rank registry (+ debug warning) ```python class DistributedContext: def __init__(self): self._backend = None self._rank_by_greenlet: dict = {} def _bind_rank(self, g, rank: int) -> None: self._rank_by_greenlet[g] = int(rank) def get_rank(self) -> int: self._ensure_initialized() from greenlet import getcurrent g = getcurrent() if g not in self._rank_by_greenlet: if os.environ.get("KERNBENCH_DEBUG"): warnings.warn( "get_rank() called outside a bound greenlet — returning 0. " "Likely a bug unless running single-driver." ) return 0 return int(self._rank_by_greenlet[g]) ``` ### D3. `torch.ahbm.set_device(rank)` — SIP binding The KernBench backend name is `ahbm` (ADR-0023). Real PyTorch uses `torch.cuda.set_device(r)`, but since we are not CUDA we use an honestly-named namespace. ```python class _AhbmNamespace: """torch.ahbm — per-greenlet SIP device binding. Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. Since KernBench's backend is 'ahbm' (not CUDA), we expose the equivalent API under ``torch.ahbm`` to avoid pretending to be a CUDA runtime. """ def __init__(self): self._device_by_greenlet: dict = {} def set_device(self, device: int) -> None: from greenlet import getcurrent self._device_by_greenlet[getcurrent()] = int(device) def current_device(self) -> int | None: from greenlet import getcurrent return self._device_by_greenlet.get(getcurrent()) # Attached to RuntimeContext as `self.ahbm = _AhbmNamespace()`. # Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`. ``` **PyTorch 2.x style parallel support**: Recent PyTorch is moving toward a device-agnostic `torch.accelerator` namespace (`torch.accelerator.set_device_index(r)`, `torch.accelerator.current_device_index()`). To support users who want to write code that is not tied to a specific device vendor, KernBench also exposes this surface in parallel. ```python class _AcceleratorNamespace: """torch.accelerator — device-agnostic API (PyTorch 2.x style). Aliases torch.ahbm for bench code that prefers device-neutral idiom: torch.accelerator.set_device_index(rank) torch.accelerator.current_device_index() """ def __init__(self, ahbm: _AhbmNamespace): self._ahbm = ahbm def set_device_index(self, device: int) -> None: self._ahbm.set_device(device) def current_device_index(self) -> int | None: return self._ahbm.current_device() # RuntimeContext self.ahbm = _AhbmNamespace() self.accelerator = _AcceleratorNamespace(self.ahbm) # alias ``` Bench authors may choose either — both share the same registry internally: ```python torch.ahbm.set_device(rank) # KernBench-native, explicit backend torch.accelerator.set_device_index(rank) # PyTorch 2.x device-agnostic ``` ### D4. Tensor placement = structural (sip, cube, pe) coordinates `resolve_dp_policy` takes `target_sip` directly and produces placement in structural coordinates. Details in ADR-0026. ```python # RuntimeContext._create_tensor current_sip = self.ahbm.current_device() # (D3 naming) if current_sip is None: current_sip = 0 # single-driver fallback (consistent with D2) placement = resolve_dp_policy( dp, shape=shape_2d, itemsize=itemsize, num_pe=eff_num_pe, num_cubes=eff_num_cubes, target_sip=current_sip, ) ``` No post-hoc `pe_index` shifting — ShardSpec carries the `(sip, cube, pe)` structural coordinates directly. ShardSpec details in ADR-0026. --- ## Dependencies - **ADR-0023** (IPCQ): origin of the backend `ahbm` namespace. - **ADR-0026** (DPPolicy intra-device): the `resolve_dp_policy` signature used by D4 and the structural-coordinate representation of ShardSpec. - **ADR-0027** (Megatron TP + scheduler): the implementation baseline for worker scheduling, `mp.spawn`, collective drain, and exception cleanup. --- ## Non-goals - **Modifying the IPCQ protocol**: ADR-0023 remains as-is. - **Cleaning up DPPolicy fields**: ADR-0026. - **Megatron-style TP**: ADR-0027. - **Worker scheduling / spawn / drain / exception cleanup**: ADR-0027 D0/D1. - **Collective algorithm implementation**: ADR-0032. - **Multi-node (cross-process)**: single process only. --- ## Consequences ### Positive - **Bench = real PyTorch DDP** (from the public-API point of view). - **Greenlet-local rank**: enables cross-rank correctness within the 1-process model. - **Structural placement coordinates**: lets the other ADRs (ADR-0026 / ADR-0027 / ADR-0032) operate consistently on top of the `(sip, cube, pe)` 3-tuple. ### Neutral - IPCQ PE-level protocol (ADR-0023) is unchanged. - IO_CPU role is unchanged (existing transit behavior preserved).