# ADR-0026: DPPolicy = Intra-Device Only — remove sip/num_sips fields ## Status Accepted (Revision 5 — Phase 2 landed 2026-04-14, 523 passed + 1 strict xfail) ## Context ### Goal Clarify `DPPolicy` as a pure intra-device abstraction that only expresses **cube × PE distribution within a single device (SIP)**. Inter-SIP distribution (TP) is split into a separate layer (handled by ADR-0024's `torch.ahbm.set_device(rank)` or by ADR-0027's Megatron-style parallel layers). ## Decision ### D1. Remove `sip` + `num_sips` fields from `DPPolicy` ```python @dataclass(frozen=True) class DPPolicy: """Intra-device (cube × PE) data-parallel policy. SIP-level placement is controlled by ``torch.ahbm.set_device(rank)`` (ADR-0024 D3) and, for model-level TP, by Megatron-style parallel layers (ADR-0027). DPPolicy does not cross SIP boundaries. """ cube: Literal["replicate", "column_wise", "row_wise"] = "replicate" pe: Literal["replicate", "column_wise", "row_wise"] = "replicate" num_pes: int | None = None num_cubes: int | None = None ``` Removed fields: `sip`, `num_sips`. ### D2. `ShardSpec` — structural (sip, cube, pe) coordinates, `pe_index` fully removed The current `ShardSpec.pe_index` is a **global flat index** (`sip × cubes × pes + cube × pes + pe`). This is the form ADR-0024 D4 flagged as "abstraction leakage". This ADR **redefines ShardSpec in structural coordinates** and **does not even leave `pe_index` as a property**: ```python # src/kernbench/policy/placement/dp.py (after) @dataclass(frozen=True) class ShardSpec: """Structural shard placement — intra-SIP (cube × PE) coord. Global-flat `pe_index` was removed in ADR-0026. Callers must use structural coords (sip, cube, pe) directly. If a flat integer key is needed (e.g. dict lookup), compute it explicitly at the call site. """ sip: int # structural — which SIP this shard lives on cube: int # local within SIP pe: int # local within cube offset_bytes: int nbytes: int ``` **Core principle**: - The identity of ShardSpec is the `(sip, cube, pe)` 3-tuple. - **No `pe_index` property either** — blocks silent semantics drift. - Existing callers expecting global-flat get an **immediate `AttributeError`** on `.pe_index` access → forced migration to structural coordinates. - Local contexts that genuinely need a flat integer key (e.g. internal dict lookup) explicitly compute `spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe` at the call site. **Justification for removing the property**: KernBench is an internal project with a limited number of call sites. Explicit breakage (AttributeError) is much safer than the risk of silent drift (semantics change while the type stays int). ### D3. `resolve_dp_policy` takes `target_sip` and produces structural coordinates Implements the contract of ADR-0024 D4. No post-hoc shifting. ```python # src/kernbench/policy/placement/dp.py (after) @dataclass(frozen=True) class _LocalPeShard: """Internal — return value of the PE resolver. Cube-local PE id + payload.""" local_pe: int # cube-local PE index (0..num_pe-1) offset_bytes: int nbytes: int def resolve_dp_policy( policy: DPPolicy, *, shape: tuple[int, int], itemsize: int, num_pe: int, num_cubes: int = 1, target_sip: int, # NEW — explicitly state which SIP to place on ) -> list[ShardSpec]: """2-level resolution (cube × PE) on a specified SIP. Returns ShardSpecs with structural coords (sip=target_sip, cube, pe). No SIP-level split — DPPolicy is intra-device only. """ resolver = _PE_RESOLVERS[policy.pe] all_shards: list[ShardSpec] = [] # Level 1: cube within SIP cube_splits = _split_shape(policy.cube, shape, num_cubes, itemsize) for cube_id, (cube_shape, cube_offset) in enumerate(cube_splits): # Level 2: PE within cube — resolver returns _LocalPeShard (local_pe) local_shards = resolver(shape=cube_shape, itemsize=itemsize, num_pe=num_pe) for ls in local_shards: all_shards.append(ShardSpec( sip=target_sip, # from caller (current_device) cube=cube_id, # local within SIP pe=ls.local_pe, # local within cube (explicit name) offset_bytes=cube_offset + ls.offset_bytes, nbytes=ls.nbytes, )) return all_shards ``` **Internal resolvers** (`column_wise`, `row_wise`, `replicate`) return a list of `_LocalPeShard` — the `local_pe` field name makes it **explicit that this is a "cube-local PE identifier"**. This resolves the previous confusion with the name `ShardSpec.pe_index`. **Naming convention summary** (whole ADR): - `ShardSpec.pe`: the final external API — cube-local PE (structural coord) - `_LocalPeShard.local_pe`: the same meaning at the internal resolver stage - `pe_index`: **removed**. Not retained anywhere, internal or external (additional benefit of preventing silent drift: the name does not reappear). ### D4. `_create_tensor` — placement directly in structural coordinates Continuation of ADR-0024 D4. Post-hoc shifting removed; structural coordinates are specified directly at the `resolve_dp_policy` call site. ```python # context.py _create_tensor (after) current_sip = self.ahbm.current_device() if current_sip is None: # Single-driver fallback (consistent with ADR-0024 D2). # In launcher-based code, forgetting set_device() silently sticks the # tensor on SIP 0 — emit a warning in debug mode. if os.environ.get("KERNBENCH_DEBUG"): import warnings warnings.warn( "torch.ahbm.current_device() is None; defaulting to SIP 0. " "If this is a multi-rank launcher context, you likely forgot " "torch.ahbm.set_device(rank) inside the worker.", stacklevel=2, ) current_sip = 0 placement = resolve_dp_policy( dp, shape=shape_2d, itemsize=itemsize, num_pe=eff_num_pe, num_cubes=eff_num_cubes, target_sip=current_sip, # ← structural coord specified up front ) # Each ShardSpec in placement already carries (sip=current_sip, cube=local, pe=local). # The old post-hoc shifting block is removed entirely. ``` **Every** tensor is placed on the current device's SIP. If you need a multi-SIP tensor, use the TP primitive of ADR-0027. **Trade-off of the single-driver fallback**: When set_device is not called, defaulting to SIP 0 is kept for compatibility with existing single-driver tests. With `KERNBENCH_DEBUG=1`, a warning is emitted so that accidentally omitting set_device in a launcher context — which would silently place the tensor on the wrong SIP — can be detected. ### D5. Downstream — allocator lookup by structural tuple key Existing `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`): ```python for spec in placement: alloc = allocators[spec.pe_index] # ← AttributeError (property removed) ``` With `pe_index` gone, migration to structural coordinates is **forced**: ```python for spec in placement: alloc = allocators[(spec.sip, spec.cube, spec.pe)] ``` The dict population in `_ensure_allocators` is also tuple-keyed: ```python # context.py _ensure_allocators (after) for sip_id in sip_range: for cube_id in range(cubes_per_sip): for pe_id in range(pes_per_cube): self._allocators[(sip_id, cube_id, pe_id)] = PEMemAllocator( rack_id=0, sip_id=sip_id, cube_id=cube_id, pe_id=pe_id, cfg=cfg, ) ``` `_free_tensor` is the same: the old `flat_idx = sip * ... + cube * ... + pe` computation block is removed, and `(shard.sip, shard.cube, shard.pe)` is used directly. **Tuple vs dataclass `PEIdentity`**: Recommend the tuple — it is simple and hashable out of the box. A `PEIdentity` value object has the upside of an explicit type, but the boilerplate is large and it is currently the only key of the allocator dict, so it would be over-engineering. Keep the tuple. ### D7. Backward compatibility — none (cleanup ADR) This ADR is a **breaking change**. 1. `DPPolicy(sip=...)` or `DPPolicy(num_sips=...)` → `TypeError` 2. `ShardSpec.pe_index` access → `AttributeError` Both are **immediate, explicit breakage**. No deprecation warning / fallback path. KernBench is an internal project with a bounded set of call sites, so migration happens in one pass. **Blocking silent drift** is the main upside of fully removing the property: code that expected a global flat could otherwise silently receive a SIP-local result and index incorrectly — that possibility is eliminated. ## Dependencies - **ADR-0024** (launcher): `set_device(rank)` and current-device scoping provide the SIP placement mechanism. This ADR sits on top and narrows DPPolicy to pure intra-device. - **ADR-0027** (Megatron TP): the alternative path when a tensor spans multiple SIPs. After this ADR is applied, multi-SIP use cases move to ADR-0027. --- ## Non-goals - **Redesign of `DPPolicy.cube` / `pe`**: existing replicate/column_wise/row_wise semantics are kept. - **Tiling policy consolidation**: `tiled_column_major` / `tiled_row_major` stay as they are. - **New multi-device tensor abstraction**: a DTensor-like is ADR-0028. --- ## Open questions - **Default value of current_sip in `_create_tensor`**: for calls without set_device, whether to fall back to rank=0 (SIP 0) or to raise an error. The recommendation is fallback (compatibility with existing single-driver tests). - **Scope of `test_sip_parallel.py` rewrite**: porting the existing unit tests to the launcher base while preserving their intent requires additional fixtures. Scoped as separate work. - **Meaning of `num_sips=None` on `DPPolicy`**: once the field is gone, the concept of `num_sips` disappears entirely. The explicit answer for expressing multi-SIP is to use the TP primitive of ADR-0027. **Resolved (items that were open in earlier revs)**: - ~~Whether to keep the `ShardSpec.pe_index` property~~ → **fully removed** (D2) - ~~Form of `_ensure_allocators` dict key~~ → **tuple `(sip, cube, pe)`** (D5) --- ## Consequences ### Positive - **Clean conceptual separation**: DPPolicy = intra-device, TP = inter-device. - **API simplification**: about a 33% reduction in DPPolicy constructor fields. - **Structural-coordinate consistency**: ShardSpec is expressed as a `(sip, cube, pe)` tuple → abstraction leakage resolved (the ADR-0024 D4 contract is satisfied). - **Clear meaning of `pe_index`**: the single interpretation is SIP-local. If global-flat is needed, it must be made explicit. - **Launcher-model consistency**: ADR-0024's "1 worker per SIP" model is the sole SIP-boundary control mechanism. ### Negative - **Breaking change (explicit)**: `DPPolicy(sip=...)` → `TypeError`, `spec.pe_index` → `AttributeError`. All callers need to be fixed at once. - **ShardSpec schema change**: a single `pe_index` field becomes three fields `sip`/`cube`/`pe`. Cascading edits downstream (`deploy_tensor`, `_free_tensor`, `_ensure_allocators`, `allocators` dict key, etc.). - **No silent drift**: with the property fully removed, runtime failure is immediate → migration leakage is blocked at the source. (Not a negative but an explicit tradeoff.) - The cost of rewriting `test_sip_parallel.py`. ### Neutral - The meaning of the existing `cube` / `pe` fields is unchanged.