Document the allreduce + GEMM evaluation harnesses and bring the affected allreduce ADRs in line with the refactored code. New (Accepted, EN + KO): - ADR-0043 — allreduce evaluation harness (tests/sccl/): distributed-driven correctness, latency/buffer-kind sweeps, sessionfinish plot aggregators, topology + FSIM-comparison figures. Verified against the implementation. - ADR-0044 — GEMM evaluation harness (scripts/gemm_sweep.py + tests/gemm/): heavy-script data gen vs. fast test-rendered figures, slow regenerator, the 3-figure set. Records two limitations as open questions: the theoretical-model constants are inherited (not yet traced to ADR-0033/ 0014), and the *_measured figure is a naming misnomer. Updated (EN + KO): - ADR-0024 — add D5: SIP grid w/h resolution (explicit sips.w/h, square fallback, fail-loud), documenting the AhbmCCLBackend fix. - ADR-0032 — D4/D5/Non-goals reconciled: rectangular SIP grids (e.g. 6 SIPs as 3x2) are supported via explicit w/h; the square requirement now applies only to the fallback. Affected-files repointed to tests/sccl/. Verification: ADR-0023 and ADR-0042 confirmed still matching the code (no change). verify_adr_lang_pairs.py passes (EN/KO Status blocks byte-equal). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8.3 KiB
ADR-0024: SIP-level Launcher — rank = SIP
Status
Accepted
Context
Goal
Align the participation unit (rank) of torch.distributed collective calls
to the SIP (device) boundary. The aim is bench code that, at the host
level, reads indistinguishably from real PyTorch DDP/TP scripts.
Comparison with real PyTorch:
| Dimension | real PyTorch | KernBench |
|---|---|---|
| Process model | N processes, 1 GPU each | 1 process, N greenlets, 1 SIP each |
get_rank() |
RANK env var |
greenlet-local registry |
get_world_size() |
WORLD_SIZE env var |
SIP count from topology |
torch.cuda.set_device(r) (real) / torch.ahbm.set_device(r) (KernBench) |
rank → GPU | rank → SIP |
mp.spawn |
OS process fork | greenlet fan-out |
Problems to solve
- Public API where rank = SIP — so bench workers do not have to know about the PE concept.
- Greenlet-local rank/device tracking — within the 1-process model, each worker greenlet must correctly identify its own rank / its own SIP.
- Tensor placement = structural (sip, cube, pe) — if rank is SIP, the default tensor placement should also be expressed in structural coordinates.
Non-problem (outside this ADR)
- IPCQ direction addressing → ADR-0025
- Removing
DPPolicy.sip/num_sips→ ADR-0026 - Megatron-style TP → ADR-0027
- DTensor → ADR-0028 (future)
- Worker scheduling /
mp.spawn/ collective drain / exception cleanup → ADR-0027 D0/D1 - Collective algorithm implementation (intercube_allreduce, SFR config) → ADR-0032
Decision
D1. rank = SIP (world_size resolution)
def _resolve_world_size(self) -> int:
if "world_size" in self._merged:
return int(self._merged["world_size"])
defaults = self._cfg_all.get("defaults", {})
if "world_size" in defaults:
return int(defaults["world_size"])
spec = self.ctx.spec or {}
return int(spec.get("system", {}).get("sips", {}).get("count", 1))
Priority order: algorithm override > defaults override > SIP count. The
ccl.yaml override is retained as the legacy "rank = PE" test path.
D2. Greenlet-local rank registry (+ debug warning)
class DistributedContext:
def __init__(self):
self._backend = None
self._rank_by_greenlet: dict = {}
def _bind_rank(self, g, rank: int) -> None:
self._rank_by_greenlet[g] = int(rank)
def get_rank(self) -> int:
self._ensure_initialized()
from greenlet import getcurrent
g = getcurrent()
if g not in self._rank_by_greenlet:
if os.environ.get("KERNBENCH_DEBUG"):
warnings.warn(
"get_rank() called outside a bound greenlet — returning 0. "
"Likely a bug unless running single-driver."
)
return 0
return int(self._rank_by_greenlet[g])
D3. torch.ahbm.set_device(rank) — SIP binding
The KernBench backend name is ahbm (ADR-0023). Real PyTorch uses
torch.cuda.set_device(r), but since we are not CUDA we use an
honestly-named namespace.
class _AhbmNamespace:
"""torch.ahbm — per-greenlet SIP device binding.
Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. Since
KernBench's backend is 'ahbm' (not CUDA), we expose the equivalent
API under ``torch.ahbm`` to avoid pretending to be a CUDA runtime.
"""
def __init__(self):
self._device_by_greenlet: dict = {}
def set_device(self, device: int) -> None:
from greenlet import getcurrent
self._device_by_greenlet[getcurrent()] = int(device)
def current_device(self) -> int | None:
from greenlet import getcurrent
return self._device_by_greenlet.get(getcurrent())
# Attached to RuntimeContext as `self.ahbm = _AhbmNamespace()`.
# Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
PyTorch 2.x style parallel support: Recent PyTorch is moving toward a
device-agnostic torch.accelerator namespace
(torch.accelerator.set_device_index(r),
torch.accelerator.current_device_index()). To support users who want to
write code that is not tied to a specific device vendor, KernBench also
exposes this surface in parallel.
class _AcceleratorNamespace:
"""torch.accelerator — device-agnostic API (PyTorch 2.x style).
Aliases torch.ahbm for bench code that prefers device-neutral idiom:
torch.accelerator.set_device_index(rank)
torch.accelerator.current_device_index()
"""
def __init__(self, ahbm: _AhbmNamespace):
self._ahbm = ahbm
def set_device_index(self, device: int) -> None:
self._ahbm.set_device(device)
def current_device_index(self) -> int | None:
return self._ahbm.current_device()
# RuntimeContext
self.ahbm = _AhbmNamespace()
self.accelerator = _AcceleratorNamespace(self.ahbm) # alias
Bench authors may choose either — both share the same registry internally:
torch.ahbm.set_device(rank) # KernBench-native, explicit backend
torch.accelerator.set_device_index(rank) # PyTorch 2.x device-agnostic
D4. Tensor placement = structural (sip, cube, pe) coordinates
resolve_dp_policy takes target_sip directly and produces placement in
structural coordinates. Details in ADR-0026.
# RuntimeContext._create_tensor
current_sip = self.ahbm.current_device() # (D3 naming)
if current_sip is None:
current_sip = 0 # single-driver fallback (consistent with D2)
placement = resolve_dp_policy(
dp, shape=shape_2d, itemsize=itemsize,
num_pe=eff_num_pe, num_cubes=eff_num_cubes,
target_sip=current_sip,
)
No post-hoc pe_index shifting — ShardSpec carries the (sip, cube, pe)
structural coordinates directly. ShardSpec details in ADR-0026.
D5. SIP grid dimensions — explicit sips.w/h resolution
For 2D inter-SIP topologies (torus_2d, mesh_2d_no_wrap) the SIP grid
shape (width × height) is resolved from system.sips.w / system.sips.h,
mirroring how D1 resolves world_size from sips.count. Precedence:
explicit w/h (validated w*h == count) > square fallback
(round(sqrt(count))², used only when no w/h is given) > error.
sips = spec.get("system", {}).get("sips", {})
if sip_topo == "ring_1d":
w, h = 0, 0 # 1D sentinel (no grid)
elif sips.get("w") is not None and sips.get("h") is not None:
w, h = int(sips["w"]), int(sips["h"])
if w * h != n_sips:
raise ValueError(f"sip layout {w}x{h} != sips.count ({n_sips})")
else:
side = int(round(math.sqrt(n_sips)))
if side * side != n_sips:
raise ValueError("non-square sips.count requires explicit sips.w/h")
w, h = side, side
This lifts the earlier assumption that 2D SIP grids must be perfect
squares: a 6-SIP torus_2d / mesh_2d_no_wrap is now expressible as
w: 3, h: 2 (or 2x3). The derived (w, h) feed the algorithm's
inter-SIP exchange (consumed in ADR-0032 D5). The prior code path silently
took round(sqrt(count))² for any non-ring topology, which produced a
wrong grid (e.g. 2×2 for 6 SIPs); the explicit-w/h path with a
fail-loud fallback replaces that.
Dependencies
- ADR-0023 (IPCQ): origin of the backend
ahbmnamespace. - ADR-0026 (DPPolicy intra-device): the
resolve_dp_policysignature used by D4 and the structural-coordinate representation of ShardSpec. - ADR-0027 (Megatron TP + scheduler): the implementation baseline for
worker scheduling,
mp.spawn, collective drain, and exception cleanup.
Non-goals
- Modifying the IPCQ protocol: ADR-0023 remains as-is.
- Cleaning up DPPolicy fields: ADR-0026.
- Megatron-style TP: ADR-0027.
- Worker scheduling / spawn / drain / exception cleanup: ADR-0027 D0/D1.
- Collective algorithm implementation: ADR-0032.
- Multi-node (cross-process): single process only.
Consequences
Positive
- Bench = real PyTorch DDP (from the public-API point of view).
- Greenlet-local rank: enables cross-rank correctness within the 1-process model.
- Structural placement coordinates: lets the other ADRs (ADR-0026 /
ADR-0027 / ADR-0032) operate consistently on top of the
(sip, cube, pe)3-tuple.
Neutral
- IPCQ PE-level protocol (ADR-0023) is unchanged.
- IO_CPU role is unchanged (existing transit behavior preserved).