Files
kernbench2/docs/adr/ADR-0026-par-dppolicy-intra-device.md
ywkang a796c1d2f7 ADR: bilingual structure — EN canonical in adr/, KO mirror in adr-ko/
Establish English as the canonical ADR language with Korean translations
held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror).
Promotion from adr-proposed/ to adr/ now writes English to adr/ and the
Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md.

- Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English,
  2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix
  dropped). ADR-0023 EN regenerated against KO source which had newer
  HW Realization Notes (D16-D23) section.
- docs/adr-history/ left frozen by design (transitional state).
- CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark
  docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline
  section covering bidirectional sync, conflict resolution (EN wins),
  and proposed-language freedom.
- tools/verify_adr_lang_pairs.py: new verification tool checking pair
  completeness, filename mirroring, ADR-ID match, Status byte-equality.
  Pre-commit hook intentionally not added; run on demand or in CI.
- tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF
  normalization, em-dash title separator, underscore-slug edge case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:38:44 -07:00

11 KiB
Raw Permalink Blame History

ADR-0026: DPPolicy = Intra-Device Only — remove sip/num_sips fields

Status

Accepted (Revision 5 — Phase 2 landed 2026-04-14, 523 passed + 1 strict xfail)

Context

Goal

Clarify DPPolicy as a pure intra-device abstraction that only expresses cube × PE distribution within a single device (SIP). Inter-SIP distribution (TP) is split into a separate layer (handled by ADR-0024's torch.ahbm.set_device(rank) or by ADR-0027's Megatron-style parallel layers).

Decision

D1. Remove sip + num_sips fields from DPPolicy

@dataclass(frozen=True)
class DPPolicy:
    """Intra-device (cube × PE) data-parallel policy.

    SIP-level placement is controlled by ``torch.ahbm.set_device(rank)``
    (ADR-0024 D3) and, for model-level TP, by Megatron-style parallel
    layers (ADR-0027). DPPolicy does not cross SIP boundaries.
    """
    cube: Literal["replicate", "column_wise", "row_wise"] = "replicate"
    pe: Literal["replicate", "column_wise", "row_wise"] = "replicate"
    num_pes: int | None = None
    num_cubes: int | None = None

Removed fields: sip, num_sips.

D2. ShardSpec — structural (sip, cube, pe) coordinates, pe_index fully removed

The current ShardSpec.pe_index is a global flat index (sip × cubes × pes + cube × pes + pe). This is the form ADR-0024 D4 flagged as "abstraction leakage".

This ADR redefines ShardSpec in structural coordinates and does not even leave pe_index as a property:

# src/kernbench/policy/placement/dp.py (after)
@dataclass(frozen=True)
class ShardSpec:
    """Structural shard placement — intra-SIP (cube × PE) coord.

    Global-flat `pe_index` was removed in ADR-0026. Callers must use
    structural coords (sip, cube, pe) directly. If a flat integer key is
    needed (e.g. dict lookup), compute it explicitly at the call site.
    """
    sip: int              # structural — which SIP this shard lives on
    cube: int             # local within SIP
    pe: int               # local within cube
    offset_bytes: int
    nbytes: int

Core principle:

  • The identity of ShardSpec is the (sip, cube, pe) 3-tuple.
  • No pe_index property either — blocks silent semantics drift.
  • Existing callers expecting global-flat get an immediate AttributeError on .pe_index access → forced migration to structural coordinates.
  • Local contexts that genuinely need a flat integer key (e.g. internal dict lookup) explicitly compute spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe at the call site.

Justification for removing the property: KernBench is an internal project with a limited number of call sites. Explicit breakage (AttributeError) is much safer than the risk of silent drift (semantics change while the type stays int).

D3. resolve_dp_policy takes target_sip and produces structural coordinates

Implements the contract of ADR-0024 D4. No post-hoc shifting.

# src/kernbench/policy/placement/dp.py (after)

@dataclass(frozen=True)
class _LocalPeShard:
    """Internal — return value of the PE resolver. Cube-local PE id + payload."""
    local_pe: int                  # cube-local PE index (0..num_pe-1)
    offset_bytes: int
    nbytes: int


def resolve_dp_policy(
    policy: DPPolicy,
    *,
    shape: tuple[int, int],
    itemsize: int,
    num_pe: int,
    num_cubes: int = 1,
    target_sip: int,       # NEW — explicitly state which SIP to place on
) -> list[ShardSpec]:
    """2-level resolution (cube × PE) on a specified SIP.

    Returns ShardSpecs with structural coords (sip=target_sip, cube, pe).
    No SIP-level split — DPPolicy is intra-device only.
    """
    resolver = _PE_RESOLVERS[policy.pe]
    all_shards: list[ShardSpec] = []

    # Level 1: cube within SIP
    cube_splits = _split_shape(policy.cube, shape, num_cubes, itemsize)

    for cube_id, (cube_shape, cube_offset) in enumerate(cube_splits):
        # Level 2: PE within cube — resolver returns _LocalPeShard (local_pe)
        local_shards = resolver(shape=cube_shape, itemsize=itemsize,
                                 num_pe=num_pe)

        for ls in local_shards:
            all_shards.append(ShardSpec(
                sip=target_sip,                   # from caller (current_device)
                cube=cube_id,                     # local within SIP
                pe=ls.local_pe,                   # local within cube (explicit name)
                offset_bytes=cube_offset + ls.offset_bytes,
                nbytes=ls.nbytes,
            ))

    return all_shards

Internal resolvers (column_wise, row_wise, replicate) return a list of _LocalPeShard — the local_pe field name makes it explicit that this is a "cube-local PE identifier". This resolves the previous confusion with the name ShardSpec.pe_index.

Naming convention summary (whole ADR):

  • ShardSpec.pe: the final external API — cube-local PE (structural coord)
  • _LocalPeShard.local_pe: the same meaning at the internal resolver stage
  • pe_index: removed. Not retained anywhere, internal or external (additional benefit of preventing silent drift: the name does not reappear).

D4. _create_tensor — placement directly in structural coordinates

Continuation of ADR-0024 D4. Post-hoc shifting removed; structural coordinates are specified directly at the resolve_dp_policy call site.

# context.py _create_tensor (after)
current_sip = self.ahbm.current_device()
if current_sip is None:
    # Single-driver fallback (consistent with ADR-0024 D2).
    # In launcher-based code, forgetting set_device() silently sticks the
    # tensor on SIP 0 — emit a warning in debug mode.
    if os.environ.get("KERNBENCH_DEBUG"):
        import warnings
        warnings.warn(
            "torch.ahbm.current_device() is None; defaulting to SIP 0. "
            "If this is a multi-rank launcher context, you likely forgot "
            "torch.ahbm.set_device(rank) inside the worker.",
            stacklevel=2,
        )
    current_sip = 0

placement = resolve_dp_policy(
    dp,
    shape=shape_2d,
    itemsize=itemsize,
    num_pe=eff_num_pe,
    num_cubes=eff_num_cubes,
    target_sip=current_sip,          # ← structural coord specified up front
)

# Each ShardSpec in placement already carries (sip=current_sip, cube=local, pe=local).
# The old post-hoc shifting block is removed entirely.

Every tensor is placed on the current device's SIP. If you need a multi-SIP tensor, use the TP primitive of ADR-0027.

Trade-off of the single-driver fallback: When set_device is not called, defaulting to SIP 0 is kept for compatibility with existing single-driver tests. With KERNBENCH_DEBUG=1, a warning is emitted so that accidentally omitting set_device in a launcher context — which would silently place the tensor on the wrong SIP — can be detected.

D5. Downstream — allocator lookup by structural tuple key

Existing deploy_tensor (src/kernbench/runtime_api/tensor.py):

for spec in placement:
    alloc = allocators[spec.pe_index]       # ← AttributeError (property removed)

With pe_index gone, migration to structural coordinates is forced:

for spec in placement:
    alloc = allocators[(spec.sip, spec.cube, spec.pe)]

The dict population in _ensure_allocators is also tuple-keyed:

# context.py _ensure_allocators (after)
for sip_id in sip_range:
    for cube_id in range(cubes_per_sip):
        for pe_id in range(pes_per_cube):
            self._allocators[(sip_id, cube_id, pe_id)] = PEMemAllocator(
                rack_id=0, sip_id=sip_id, cube_id=cube_id, pe_id=pe_id, cfg=cfg,
            )

_free_tensor is the same: the old flat_idx = sip * ... + cube * ... + pe computation block is removed, and (shard.sip, shard.cube, shard.pe) is used directly.

Tuple vs dataclass PEIdentity: Recommend the tuple — it is simple and hashable out of the box. A PEIdentity value object has the upside of an explicit type, but the boilerplate is large and it is currently the only key of the allocator dict, so it would be over-engineering. Keep the tuple.

D7. Backward compatibility — none (cleanup ADR)

This ADR is a breaking change.

  1. DPPolicy(sip=...) or DPPolicy(num_sips=...)TypeError
  2. ShardSpec.pe_index access → AttributeError

Both are immediate, explicit breakage. No deprecation warning / fallback path. KernBench is an internal project with a bounded set of call sites, so migration happens in one pass.

Blocking silent drift is the main upside of fully removing the property: code that expected a global flat could otherwise silently receive a SIP-local result and index incorrectly — that possibility is eliminated.

Dependencies

  • ADR-0024 (launcher): set_device(rank) and current-device scoping provide the SIP placement mechanism. This ADR sits on top and narrows DPPolicy to pure intra-device.
  • ADR-0027 (Megatron TP): the alternative path when a tensor spans multiple SIPs. After this ADR is applied, multi-SIP use cases move to ADR-0027.

Non-goals

  • Redesign of DPPolicy.cube / pe: existing replicate/column_wise/row_wise semantics are kept.
  • Tiling policy consolidation: tiled_column_major / tiled_row_major stay as they are.
  • New multi-device tensor abstraction: a DTensor-like is ADR-0028.

Open questions

  • Default value of current_sip in _create_tensor: for calls without set_device, whether to fall back to rank=0 (SIP 0) or to raise an error. The recommendation is fallback (compatibility with existing single-driver tests).
  • Scope of test_sip_parallel.py rewrite: porting the existing unit tests to the launcher base while preserving their intent requires additional fixtures. Scoped as separate work.
  • Meaning of num_sips=None on DPPolicy: once the field is gone, the concept of num_sips disappears entirely. The explicit answer for expressing multi-SIP is to use the TP primitive of ADR-0027.

Resolved (items that were open in earlier revs):

  • Whether to keep the ShardSpec.pe_index propertyfully removed (D2)
  • Form of _ensure_allocators dict keytuple (sip, cube, pe) (D5)

Consequences

Positive

  • Clean conceptual separation: DPPolicy = intra-device, TP = inter-device.
  • API simplification: about a 33% reduction in DPPolicy constructor fields.
  • Structural-coordinate consistency: ShardSpec is expressed as a (sip, cube, pe) tuple → abstraction leakage resolved (the ADR-0024 D4 contract is satisfied).
  • Clear meaning of pe_index: the single interpretation is SIP-local. If global-flat is needed, it must be made explicit.
  • Launcher-model consistency: ADR-0024's "1 worker per SIP" model is the sole SIP-boundary control mechanism.

Negative

  • Breaking change (explicit): DPPolicy(sip=...)TypeError, spec.pe_indexAttributeError. All callers need to be fixed at once.
  • ShardSpec schema change: a single pe_index field becomes three fields sip/cube/pe. Cascading edits downstream (deploy_tensor, _free_tensor, _ensure_allocators, allocators dict key, etc.).
  • No silent drift: with the property fully removed, runtime failure is immediate → migration leakage is blocked at the source. (Not a negative but an explicit tradeoff.)
  • The cost of rewriting test_sip_parallel.py.

Neutral

  • The meaning of the existing cube / pe fields is unchanged.