Files
kernbench2/docs/adr/ADR-0024-par-sip-tp-launcher.md
T
ywkang a796c1d2f7 ADR: bilingual structure — EN canonical in adr/, KO mirror in adr-ko/
Establish English as the canonical ADR language with Korean translations
held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror).
Promotion from adr-proposed/ to adr/ now writes English to adr/ and the
Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md.

- Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English,
  2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix
  dropped). ADR-0023 EN regenerated against KO source which had newer
  HW Realization Notes (D16-D23) section.
- docs/adr-history/ left frozen by design (transitional state).
- CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark
  docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline
  section covering bidirectional sync, conflict resolution (EN wins),
  and proposed-language freedom.
- tools/verify_adr_lang_pairs.py: new verification tool checking pair
  completeness, filename mirroring, ADR-ID match, Status byte-equality.
  Pre-commit hook intentionally not added; run on demand or in CI.
- tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF
  normalization, em-dash title separator, underscore-slug edge case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:38:44 -07:00

6.9 KiB

ADR-0024: SIP-level Launcher — rank = SIP

Status

Accepted

Context

Goal

Align the participation unit (rank) of torch.distributed collective calls to the SIP (device) boundary. The aim is bench code that, at the host level, reads indistinguishably from real PyTorch DDP/TP scripts.

Comparison with real PyTorch:

Dimension real PyTorch KernBench
Process model N processes, 1 GPU each 1 process, N greenlets, 1 SIP each
get_rank() RANK env var greenlet-local registry
get_world_size() WORLD_SIZE env var SIP count from topology
torch.cuda.set_device(r) (real) / torch.ahbm.set_device(r) (KernBench) rank → GPU rank → SIP
mp.spawn OS process fork greenlet fan-out

Problems to solve

  1. Public API where rank = SIP — so bench workers do not have to know about the PE concept.
  2. Greenlet-local rank/device tracking — within the 1-process model, each worker greenlet must correctly identify its own rank / its own SIP.
  3. Tensor placement = structural (sip, cube, pe) — if rank is SIP, the default tensor placement should also be expressed in structural coordinates.

Non-problem (outside this ADR)

  • IPCQ direction addressing → ADR-0025
  • Removing DPPolicy.sip/num_sips → ADR-0026
  • Megatron-style TP → ADR-0027
  • DTensor → ADR-0028 (future)
  • Worker scheduling / mp.spawn / collective drain / exception cleanup → ADR-0027 D0/D1
  • Collective algorithm implementation (intercube_allreduce, SFR config) → ADR-0032

Decision

D1. rank = SIP (world_size resolution)

def _resolve_world_size(self) -> int:
    if "world_size" in self._merged:
        return int(self._merged["world_size"])
    defaults = self._cfg_all.get("defaults", {})
    if "world_size" in defaults:
        return int(defaults["world_size"])
    spec = self.ctx.spec or {}
    return int(spec.get("system", {}).get("sips", {}).get("count", 1))

Priority order: algorithm override > defaults override > SIP count. The ccl.yaml override is retained as the legacy "rank = PE" test path.

D2. Greenlet-local rank registry (+ debug warning)

class DistributedContext:
    def __init__(self):
        self._backend = None
        self._rank_by_greenlet: dict = {}

    def _bind_rank(self, g, rank: int) -> None:
        self._rank_by_greenlet[g] = int(rank)

    def get_rank(self) -> int:
        self._ensure_initialized()
        from greenlet import getcurrent
        g = getcurrent()
        if g not in self._rank_by_greenlet:
            if os.environ.get("KERNBENCH_DEBUG"):
                warnings.warn(
                    "get_rank() called outside a bound greenlet — returning 0. "
                    "Likely a bug unless running single-driver."
                )
            return 0
        return int(self._rank_by_greenlet[g])

D3. torch.ahbm.set_device(rank) — SIP binding

The KernBench backend name is ahbm (ADR-0023). Real PyTorch uses torch.cuda.set_device(r), but since we are not CUDA we use an honestly-named namespace.

class _AhbmNamespace:
    """torch.ahbm — per-greenlet SIP device binding.

    Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. Since
    KernBench's backend is 'ahbm' (not CUDA), we expose the equivalent
    API under ``torch.ahbm`` to avoid pretending to be a CUDA runtime.
    """

    def __init__(self):
        self._device_by_greenlet: dict = {}

    def set_device(self, device: int) -> None:
        from greenlet import getcurrent
        self._device_by_greenlet[getcurrent()] = int(device)

    def current_device(self) -> int | None:
        from greenlet import getcurrent
        return self._device_by_greenlet.get(getcurrent())

# Attached to RuntimeContext as `self.ahbm = _AhbmNamespace()`.
# Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.

PyTorch 2.x style parallel support: Recent PyTorch is moving toward a device-agnostic torch.accelerator namespace (torch.accelerator.set_device_index(r), torch.accelerator.current_device_index()). To support users who want to write code that is not tied to a specific device vendor, KernBench also exposes this surface in parallel.

class _AcceleratorNamespace:
    """torch.accelerator — device-agnostic API (PyTorch 2.x style).

    Aliases torch.ahbm for bench code that prefers device-neutral idiom:
        torch.accelerator.set_device_index(rank)
        torch.accelerator.current_device_index()
    """

    def __init__(self, ahbm: _AhbmNamespace):
        self._ahbm = ahbm

    def set_device_index(self, device: int) -> None:
        self._ahbm.set_device(device)

    def current_device_index(self) -> int | None:
        return self._ahbm.current_device()

# RuntimeContext
self.ahbm = _AhbmNamespace()
self.accelerator = _AcceleratorNamespace(self.ahbm)   # alias

Bench authors may choose either — both share the same registry internally:

torch.ahbm.set_device(rank)                   # KernBench-native, explicit backend
torch.accelerator.set_device_index(rank)      # PyTorch 2.x device-agnostic

D4. Tensor placement = structural (sip, cube, pe) coordinates

resolve_dp_policy takes target_sip directly and produces placement in structural coordinates. Details in ADR-0026.

# RuntimeContext._create_tensor
current_sip = self.ahbm.current_device()          # (D3 naming)
if current_sip is None:
    current_sip = 0  # single-driver fallback (consistent with D2)
placement = resolve_dp_policy(
    dp, shape=shape_2d, itemsize=itemsize,
    num_pe=eff_num_pe, num_cubes=eff_num_cubes,
    target_sip=current_sip,
)

No post-hoc pe_index shifting — ShardSpec carries the (sip, cube, pe) structural coordinates directly. ShardSpec details in ADR-0026.


Dependencies

  • ADR-0023 (IPCQ): origin of the backend ahbm namespace.
  • ADR-0026 (DPPolicy intra-device): the resolve_dp_policy signature used by D4 and the structural-coordinate representation of ShardSpec.
  • ADR-0027 (Megatron TP + scheduler): the implementation baseline for worker scheduling, mp.spawn, collective drain, and exception cleanup.

Non-goals

  • Modifying the IPCQ protocol: ADR-0023 remains as-is.
  • Cleaning up DPPolicy fields: ADR-0026.
  • Megatron-style TP: ADR-0027.
  • Worker scheduling / spawn / drain / exception cleanup: ADR-0027 D0/D1.
  • Collective algorithm implementation: ADR-0032.
  • Multi-node (cross-process): single process only.

Consequences

Positive

  • Bench = real PyTorch DDP (from the public-API point of view).
  • Greenlet-local rank: enables cross-rank correctness within the 1-process model.
  • Structural placement coordinates: lets the other ADRs (ADR-0026 / ADR-0027 / ADR-0032) operate consistently on top of the (sip, cube, pe) 3-tuple.

Neutral

  • IPCQ PE-level protocol (ADR-0023) is unchanged.
  • IO_CPU role is unchanged (existing transit behavior preserved).