Files
kernbench2/docs/adr/ADR-0024-par-sip-tp-launcher.md
T
mukesh fd56b6cacd adr: add ADR-0043/0044 (eval harnesses); reconcile ADR-0024/0032 for SIP w/h
Document the allreduce + GEMM evaluation harnesses and bring the affected
allreduce ADRs in line with the refactored code.

New (Accepted, EN + KO):
- ADR-0043 — allreduce evaluation harness (tests/sccl/): distributed-driven
  correctness, latency/buffer-kind sweeps, sessionfinish plot aggregators,
  topology + FSIM-comparison figures. Verified against the implementation.
- ADR-0044 — GEMM evaluation harness (scripts/gemm_sweep.py + tests/gemm/):
  heavy-script data gen vs. fast test-rendered figures, slow regenerator,
  the 3-figure set. Records two limitations as open questions: the
  theoretical-model constants are inherited (not yet traced to ADR-0033/
  0014), and the *_measured figure is a naming misnomer.

Updated (EN + KO):
- ADR-0024 — add D5: SIP grid w/h resolution (explicit sips.w/h, square
  fallback, fail-loud), documenting the AhbmCCLBackend fix.
- ADR-0032 — D4/D5/Non-goals reconciled: rectangular SIP grids (e.g. 6 SIPs
  as 3x2) are supported via explicit w/h; the square requirement now
  applies only to the fallback. Affected-files repointed to tests/sccl/.

Verification: ADR-0023 and ADR-0042 confirmed still matching the code (no
change). verify_adr_lang_pairs.py passes (EN/KO Status blocks byte-equal).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 10:26:25 -07:00

8.3 KiB
Raw Blame History

ADR-0024: SIP-level Launcher — rank = SIP

Status

Accepted

Context

Goal

Align the participation unit (rank) of torch.distributed collective calls to the SIP (device) boundary. The aim is bench code that, at the host level, reads indistinguishably from real PyTorch DDP/TP scripts.

Comparison with real PyTorch:

Dimension real PyTorch KernBench
Process model N processes, 1 GPU each 1 process, N greenlets, 1 SIP each
get_rank() RANK env var greenlet-local registry
get_world_size() WORLD_SIZE env var SIP count from topology
torch.cuda.set_device(r) (real) / torch.ahbm.set_device(r) (KernBench) rank → GPU rank → SIP
mp.spawn OS process fork greenlet fan-out

Problems to solve

  1. Public API where rank = SIP — so bench workers do not have to know about the PE concept.
  2. Greenlet-local rank/device tracking — within the 1-process model, each worker greenlet must correctly identify its own rank / its own SIP.
  3. Tensor placement = structural (sip, cube, pe) — if rank is SIP, the default tensor placement should also be expressed in structural coordinates.

Non-problem (outside this ADR)

  • IPCQ direction addressing → ADR-0025
  • Removing DPPolicy.sip/num_sips → ADR-0026
  • Megatron-style TP → ADR-0027
  • DTensor → ADR-0028 (future)
  • Worker scheduling / mp.spawn / collective drain / exception cleanup → ADR-0027 D0/D1
  • Collective algorithm implementation (intercube_allreduce, SFR config) → ADR-0032

Decision

D1. rank = SIP (world_size resolution)

def _resolve_world_size(self) -> int:
    if "world_size" in self._merged:
        return int(self._merged["world_size"])
    defaults = self._cfg_all.get("defaults", {})
    if "world_size" in defaults:
        return int(defaults["world_size"])
    spec = self.ctx.spec or {}
    return int(spec.get("system", {}).get("sips", {}).get("count", 1))

Priority order: algorithm override > defaults override > SIP count. The ccl.yaml override is retained as the legacy "rank = PE" test path.

D2. Greenlet-local rank registry (+ debug warning)

class DistributedContext:
    def __init__(self):
        self._backend = None
        self._rank_by_greenlet: dict = {}

    def _bind_rank(self, g, rank: int) -> None:
        self._rank_by_greenlet[g] = int(rank)

    def get_rank(self) -> int:
        self._ensure_initialized()
        from greenlet import getcurrent
        g = getcurrent()
        if g not in self._rank_by_greenlet:
            if os.environ.get("KERNBENCH_DEBUG"):
                warnings.warn(
                    "get_rank() called outside a bound greenlet — returning 0. "
                    "Likely a bug unless running single-driver."
                )
            return 0
        return int(self._rank_by_greenlet[g])

D3. torch.ahbm.set_device(rank) — SIP binding

The KernBench backend name is ahbm (ADR-0023). Real PyTorch uses torch.cuda.set_device(r), but since we are not CUDA we use an honestly-named namespace.

class _AhbmNamespace:
    """torch.ahbm — per-greenlet SIP device binding.

    Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. Since
    KernBench's backend is 'ahbm' (not CUDA), we expose the equivalent
    API under ``torch.ahbm`` to avoid pretending to be a CUDA runtime.
    """

    def __init__(self):
        self._device_by_greenlet: dict = {}

    def set_device(self, device: int) -> None:
        from greenlet import getcurrent
        self._device_by_greenlet[getcurrent()] = int(device)

    def current_device(self) -> int | None:
        from greenlet import getcurrent
        return self._device_by_greenlet.get(getcurrent())

# Attached to RuntimeContext as `self.ahbm = _AhbmNamespace()`.
# Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.

PyTorch 2.x style parallel support: Recent PyTorch is moving toward a device-agnostic torch.accelerator namespace (torch.accelerator.set_device_index(r), torch.accelerator.current_device_index()). To support users who want to write code that is not tied to a specific device vendor, KernBench also exposes this surface in parallel.

class _AcceleratorNamespace:
    """torch.accelerator — device-agnostic API (PyTorch 2.x style).

    Aliases torch.ahbm for bench code that prefers device-neutral idiom:
        torch.accelerator.set_device_index(rank)
        torch.accelerator.current_device_index()
    """

    def __init__(self, ahbm: _AhbmNamespace):
        self._ahbm = ahbm

    def set_device_index(self, device: int) -> None:
        self._ahbm.set_device(device)

    def current_device_index(self) -> int | None:
        return self._ahbm.current_device()

# RuntimeContext
self.ahbm = _AhbmNamespace()
self.accelerator = _AcceleratorNamespace(self.ahbm)   # alias

Bench authors may choose either — both share the same registry internally:

torch.ahbm.set_device(rank)                   # KernBench-native, explicit backend
torch.accelerator.set_device_index(rank)      # PyTorch 2.x device-agnostic

D4. Tensor placement = structural (sip, cube, pe) coordinates

resolve_dp_policy takes target_sip directly and produces placement in structural coordinates. Details in ADR-0026.

# RuntimeContext._create_tensor
current_sip = self.ahbm.current_device()          # (D3 naming)
if current_sip is None:
    current_sip = 0  # single-driver fallback (consistent with D2)
placement = resolve_dp_policy(
    dp, shape=shape_2d, itemsize=itemsize,
    num_pe=eff_num_pe, num_cubes=eff_num_cubes,
    target_sip=current_sip,
)

No post-hoc pe_index shifting — ShardSpec carries the (sip, cube, pe) structural coordinates directly. ShardSpec details in ADR-0026.

D5. SIP grid dimensions — explicit sips.w/h resolution

For 2D inter-SIP topologies (torus_2d, mesh_2d_no_wrap) the SIP grid shape (width × height) is resolved from system.sips.w / system.sips.h, mirroring how D1 resolves world_size from sips.count. Precedence: explicit w/h (validated w*h == count) > square fallback (round(sqrt(count))², used only when no w/h is given) > error.

sips = spec.get("system", {}).get("sips", {})
if sip_topo == "ring_1d":
    w, h = 0, 0                          # 1D sentinel (no grid)
elif sips.get("w") is not None and sips.get("h") is not None:
    w, h = int(sips["w"]), int(sips["h"])
    if w * h != n_sips:
        raise ValueError(f"sip layout {w}x{h} != sips.count ({n_sips})")
else:
    side = int(round(math.sqrt(n_sips)))
    if side * side != n_sips:
        raise ValueError("non-square sips.count requires explicit sips.w/h")
    w, h = side, side

This lifts the earlier assumption that 2D SIP grids must be perfect squares: a 6-SIP torus_2d / mesh_2d_no_wrap is now expressible as w: 3, h: 2 (or 2x3). The derived (w, h) feed the algorithm's inter-SIP exchange (consumed in ADR-0032 D5). The prior code path silently took round(sqrt(count))² for any non-ring topology, which produced a wrong grid (e.g. 2×2 for 6 SIPs); the explicit-w/h path with a fail-loud fallback replaces that.


Dependencies

  • ADR-0023 (IPCQ): origin of the backend ahbm namespace.
  • ADR-0026 (DPPolicy intra-device): the resolve_dp_policy signature used by D4 and the structural-coordinate representation of ShardSpec.
  • ADR-0027 (Megatron TP + scheduler): the implementation baseline for worker scheduling, mp.spawn, collective drain, and exception cleanup.

Non-goals

  • Modifying the IPCQ protocol: ADR-0023 remains as-is.
  • Cleaning up DPPolicy fields: ADR-0026.
  • Megatron-style TP: ADR-0027.
  • Worker scheduling / spawn / drain / exception cleanup: ADR-0027 D0/D1.
  • Collective algorithm implementation: ADR-0032.
  • Multi-node (cross-process): single process only.

Consequences

Positive

  • Bench = real PyTorch DDP (from the public-API point of view).
  • Greenlet-local rank: enables cross-rank correctness within the 1-process model.
  • Structural placement coordinates: lets the other ADRs (ADR-0026 / ADR-0027 / ADR-0032) operate consistently on top of the (sip, cube, pe) 3-tuple.

Neutral

  • IPCQ PE-level protocol (ADR-0023) is unchanged.
  • IO_CPU role is unchanged (existing transit behavior preserved).