Establish English as the canonical ADR language with Korean translations held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror). Promotion from adr-proposed/ to adr/ now writes English to adr/ and the Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md. - Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English, 2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix dropped). ADR-0023 EN regenerated against KO source which had newer HW Realization Notes (D16-D23) section. - docs/adr-history/ left frozen by design (transitional state). - CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline section covering bidirectional sync, conflict resolution (EN wins), and proposed-language freedom. - tools/verify_adr_lang_pairs.py: new verification tool checking pair completeness, filename mirroring, ADR-ID match, Status byte-equality. Pre-commit hook intentionally not added; run on demand or in CI. - tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF normalization, em-dash title separator, underscore-slug edge case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.9 KiB
ADR-0024: SIP-level Launcher — rank = SIP
Status
Accepted
Context
Goal
Align the participation unit (rank) of torch.distributed collective calls
to the SIP (device) boundary. The aim is bench code that, at the host
level, reads indistinguishably from real PyTorch DDP/TP scripts.
Comparison with real PyTorch:
| Dimension | real PyTorch | KernBench |
|---|---|---|
| Process model | N processes, 1 GPU each | 1 process, N greenlets, 1 SIP each |
get_rank() |
RANK env var |
greenlet-local registry |
get_world_size() |
WORLD_SIZE env var |
SIP count from topology |
torch.cuda.set_device(r) (real) / torch.ahbm.set_device(r) (KernBench) |
rank → GPU | rank → SIP |
mp.spawn |
OS process fork | greenlet fan-out |
Problems to solve
- Public API where rank = SIP — so bench workers do not have to know about the PE concept.
- Greenlet-local rank/device tracking — within the 1-process model, each worker greenlet must correctly identify its own rank / its own SIP.
- Tensor placement = structural (sip, cube, pe) — if rank is SIP, the default tensor placement should also be expressed in structural coordinates.
Non-problem (outside this ADR)
- IPCQ direction addressing → ADR-0025
- Removing
DPPolicy.sip/num_sips→ ADR-0026 - Megatron-style TP → ADR-0027
- DTensor → ADR-0028 (future)
- Worker scheduling /
mp.spawn/ collective drain / exception cleanup → ADR-0027 D0/D1 - Collective algorithm implementation (intercube_allreduce, SFR config) → ADR-0032
Decision
D1. rank = SIP (world_size resolution)
def _resolve_world_size(self) -> int:
if "world_size" in self._merged:
return int(self._merged["world_size"])
defaults = self._cfg_all.get("defaults", {})
if "world_size" in defaults:
return int(defaults["world_size"])
spec = self.ctx.spec or {}
return int(spec.get("system", {}).get("sips", {}).get("count", 1))
Priority order: algorithm override > defaults override > SIP count. The
ccl.yaml override is retained as the legacy "rank = PE" test path.
D2. Greenlet-local rank registry (+ debug warning)
class DistributedContext:
def __init__(self):
self._backend = None
self._rank_by_greenlet: dict = {}
def _bind_rank(self, g, rank: int) -> None:
self._rank_by_greenlet[g] = int(rank)
def get_rank(self) -> int:
self._ensure_initialized()
from greenlet import getcurrent
g = getcurrent()
if g not in self._rank_by_greenlet:
if os.environ.get("KERNBENCH_DEBUG"):
warnings.warn(
"get_rank() called outside a bound greenlet — returning 0. "
"Likely a bug unless running single-driver."
)
return 0
return int(self._rank_by_greenlet[g])
D3. torch.ahbm.set_device(rank) — SIP binding
The KernBench backend name is ahbm (ADR-0023). Real PyTorch uses
torch.cuda.set_device(r), but since we are not CUDA we use an
honestly-named namespace.
class _AhbmNamespace:
"""torch.ahbm — per-greenlet SIP device binding.
Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. Since
KernBench's backend is 'ahbm' (not CUDA), we expose the equivalent
API under ``torch.ahbm`` to avoid pretending to be a CUDA runtime.
"""
def __init__(self):
self._device_by_greenlet: dict = {}
def set_device(self, device: int) -> None:
from greenlet import getcurrent
self._device_by_greenlet[getcurrent()] = int(device)
def current_device(self) -> int | None:
from greenlet import getcurrent
return self._device_by_greenlet.get(getcurrent())
# Attached to RuntimeContext as `self.ahbm = _AhbmNamespace()`.
# Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
PyTorch 2.x style parallel support: Recent PyTorch is moving toward a
device-agnostic torch.accelerator namespace
(torch.accelerator.set_device_index(r),
torch.accelerator.current_device_index()). To support users who want to
write code that is not tied to a specific device vendor, KernBench also
exposes this surface in parallel.
class _AcceleratorNamespace:
"""torch.accelerator — device-agnostic API (PyTorch 2.x style).
Aliases torch.ahbm for bench code that prefers device-neutral idiom:
torch.accelerator.set_device_index(rank)
torch.accelerator.current_device_index()
"""
def __init__(self, ahbm: _AhbmNamespace):
self._ahbm = ahbm
def set_device_index(self, device: int) -> None:
self._ahbm.set_device(device)
def current_device_index(self) -> int | None:
return self._ahbm.current_device()
# RuntimeContext
self.ahbm = _AhbmNamespace()
self.accelerator = _AcceleratorNamespace(self.ahbm) # alias
Bench authors may choose either — both share the same registry internally:
torch.ahbm.set_device(rank) # KernBench-native, explicit backend
torch.accelerator.set_device_index(rank) # PyTorch 2.x device-agnostic
D4. Tensor placement = structural (sip, cube, pe) coordinates
resolve_dp_policy takes target_sip directly and produces placement in
structural coordinates. Details in ADR-0026.
# RuntimeContext._create_tensor
current_sip = self.ahbm.current_device() # (D3 naming)
if current_sip is None:
current_sip = 0 # single-driver fallback (consistent with D2)
placement = resolve_dp_policy(
dp, shape=shape_2d, itemsize=itemsize,
num_pe=eff_num_pe, num_cubes=eff_num_cubes,
target_sip=current_sip,
)
No post-hoc pe_index shifting — ShardSpec carries the (sip, cube, pe)
structural coordinates directly. ShardSpec details in ADR-0026.
Dependencies
- ADR-0023 (IPCQ): origin of the backend
ahbmnamespace. - ADR-0026 (DPPolicy intra-device): the
resolve_dp_policysignature used by D4 and the structural-coordinate representation of ShardSpec. - ADR-0027 (Megatron TP + scheduler): the implementation baseline for
worker scheduling,
mp.spawn, collective drain, and exception cleanup.
Non-goals
- Modifying the IPCQ protocol: ADR-0023 remains as-is.
- Cleaning up DPPolicy fields: ADR-0026.
- Megatron-style TP: ADR-0027.
- Worker scheduling / spawn / drain / exception cleanup: ADR-0027 D0/D1.
- Collective algorithm implementation: ADR-0032.
- Multi-node (cross-process): single process only.
Consequences
Positive
- Bench = real PyTorch DDP (from the public-API point of view).
- Greenlet-local rank: enables cross-rank correctness within the 1-process model.
- Structural placement coordinates: lets the other ADRs (ADR-0026 /
ADR-0027 / ADR-0032) operate consistently on top of the
(sip, cube, pe)3-tuple.
Neutral
- IPCQ PE-level protocol (ADR-0023) is unchanged.
- IO_CPU role is unchanged (existing transit behavior preserved).