a796c1d2f7
Establish English as the canonical ADR language with Korean translations held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror). Promotion from adr-proposed/ to adr/ now writes English to adr/ and the Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md. - Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English, 2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix dropped). ADR-0023 EN regenerated against KO source which had newer HW Realization Notes (D16-D23) section. - docs/adr-history/ left frozen by design (transitional state). - CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline section covering bidirectional sync, conflict resolution (EN wins), and proposed-language freedom. - tools/verify_adr_lang_pairs.py: new verification tool checking pair completeness, filename mirroring, ADR-ID match, Status byte-equality. Pre-commit hook intentionally not added; run on demand or in CI. - tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF normalization, em-dash title separator, underscore-slug edge case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
316 lines
11 KiB
Markdown
316 lines
11 KiB
Markdown
# ADR-0026: DPPolicy = Intra-Device Only — remove sip/num_sips fields
|
||
|
||
## Status
|
||
|
||
Accepted (Revision 5 — Phase 2 landed 2026-04-14, 523 passed + 1 strict xfail)
|
||
|
||
## Context
|
||
|
||
### Goal
|
||
|
||
Clarify `DPPolicy` as a pure intra-device abstraction that only expresses
|
||
**cube × PE distribution within a single device (SIP)**. Inter-SIP
|
||
distribution (TP) is split into a separate layer (handled by ADR-0024's
|
||
`torch.ahbm.set_device(rank)` or by ADR-0027's Megatron-style parallel
|
||
layers).
|
||
|
||
## Decision
|
||
|
||
### D1. Remove `sip` + `num_sips` fields from `DPPolicy`
|
||
|
||
```python
|
||
@dataclass(frozen=True)
|
||
class DPPolicy:
|
||
"""Intra-device (cube × PE) data-parallel policy.
|
||
|
||
SIP-level placement is controlled by ``torch.ahbm.set_device(rank)``
|
||
(ADR-0024 D3) and, for model-level TP, by Megatron-style parallel
|
||
layers (ADR-0027). DPPolicy does not cross SIP boundaries.
|
||
"""
|
||
cube: Literal["replicate", "column_wise", "row_wise"] = "replicate"
|
||
pe: Literal["replicate", "column_wise", "row_wise"] = "replicate"
|
||
num_pes: int | None = None
|
||
num_cubes: int | None = None
|
||
```
|
||
|
||
Removed fields: `sip`, `num_sips`.
|
||
|
||
### D2. `ShardSpec` — structural (sip, cube, pe) coordinates, `pe_index` fully removed
|
||
|
||
The current `ShardSpec.pe_index` is a **global flat index**
|
||
(`sip × cubes × pes + cube × pes + pe`). This is the form ADR-0024 D4
|
||
flagged as "abstraction leakage".
|
||
|
||
This ADR **redefines ShardSpec in structural coordinates** and **does
|
||
not even leave `pe_index` as a property**:
|
||
|
||
```python
|
||
# src/kernbench/policy/placement/dp.py (after)
|
||
@dataclass(frozen=True)
|
||
class ShardSpec:
|
||
"""Structural shard placement — intra-SIP (cube × PE) coord.
|
||
|
||
Global-flat `pe_index` was removed in ADR-0026. Callers must use
|
||
structural coords (sip, cube, pe) directly. If a flat integer key is
|
||
needed (e.g. dict lookup), compute it explicitly at the call site.
|
||
"""
|
||
sip: int # structural — which SIP this shard lives on
|
||
cube: int # local within SIP
|
||
pe: int # local within cube
|
||
offset_bytes: int
|
||
nbytes: int
|
||
```
|
||
|
||
**Core principle**:
|
||
- The identity of ShardSpec is the `(sip, cube, pe)` 3-tuple.
|
||
- **No `pe_index` property either** — blocks silent semantics drift.
|
||
- Existing callers expecting global-flat get an **immediate
|
||
`AttributeError`** on `.pe_index` access → forced migration to
|
||
structural coordinates.
|
||
- Local contexts that genuinely need a flat integer key (e.g. internal
|
||
dict lookup) explicitly compute
|
||
`spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe` at the call
|
||
site.
|
||
|
||
**Justification for removing the property**: KernBench is an internal
|
||
project with a limited number of call sites. Explicit breakage
|
||
(AttributeError) is much safer than the risk of silent drift (semantics
|
||
change while the type stays int).
|
||
|
||
### D3. `resolve_dp_policy` takes `target_sip` and produces structural coordinates
|
||
|
||
Implements the contract of ADR-0024 D4. No post-hoc shifting.
|
||
|
||
```python
|
||
# src/kernbench/policy/placement/dp.py (after)
|
||
|
||
@dataclass(frozen=True)
|
||
class _LocalPeShard:
|
||
"""Internal — return value of the PE resolver. Cube-local PE id + payload."""
|
||
local_pe: int # cube-local PE index (0..num_pe-1)
|
||
offset_bytes: int
|
||
nbytes: int
|
||
|
||
|
||
def resolve_dp_policy(
|
||
policy: DPPolicy,
|
||
*,
|
||
shape: tuple[int, int],
|
||
itemsize: int,
|
||
num_pe: int,
|
||
num_cubes: int = 1,
|
||
target_sip: int, # NEW — explicitly state which SIP to place on
|
||
) -> list[ShardSpec]:
|
||
"""2-level resolution (cube × PE) on a specified SIP.
|
||
|
||
Returns ShardSpecs with structural coords (sip=target_sip, cube, pe).
|
||
No SIP-level split — DPPolicy is intra-device only.
|
||
"""
|
||
resolver = _PE_RESOLVERS[policy.pe]
|
||
all_shards: list[ShardSpec] = []
|
||
|
||
# Level 1: cube within SIP
|
||
cube_splits = _split_shape(policy.cube, shape, num_cubes, itemsize)
|
||
|
||
for cube_id, (cube_shape, cube_offset) in enumerate(cube_splits):
|
||
# Level 2: PE within cube — resolver returns _LocalPeShard (local_pe)
|
||
local_shards = resolver(shape=cube_shape, itemsize=itemsize,
|
||
num_pe=num_pe)
|
||
|
||
for ls in local_shards:
|
||
all_shards.append(ShardSpec(
|
||
sip=target_sip, # from caller (current_device)
|
||
cube=cube_id, # local within SIP
|
||
pe=ls.local_pe, # local within cube (explicit name)
|
||
offset_bytes=cube_offset + ls.offset_bytes,
|
||
nbytes=ls.nbytes,
|
||
))
|
||
|
||
return all_shards
|
||
```
|
||
|
||
**Internal resolvers** (`column_wise`, `row_wise`, `replicate`) return a
|
||
list of `_LocalPeShard` — the `local_pe` field name makes it **explicit
|
||
that this is a "cube-local PE identifier"**. This resolves the previous
|
||
confusion with the name `ShardSpec.pe_index`.
|
||
|
||
**Naming convention summary** (whole ADR):
|
||
- `ShardSpec.pe`: the final external API — cube-local PE (structural coord)
|
||
- `_LocalPeShard.local_pe`: the same meaning at the internal resolver stage
|
||
- `pe_index`: **removed**. Not retained anywhere, internal or external
|
||
(additional benefit of preventing silent drift: the name does not
|
||
reappear).
|
||
|
||
### D4. `_create_tensor` — placement directly in structural coordinates
|
||
|
||
Continuation of ADR-0024 D4. Post-hoc shifting removed; structural
|
||
coordinates are specified directly at the `resolve_dp_policy` call site.
|
||
|
||
```python
|
||
# context.py _create_tensor (after)
|
||
current_sip = self.ahbm.current_device()
|
||
if current_sip is None:
|
||
# Single-driver fallback (consistent with ADR-0024 D2).
|
||
# In launcher-based code, forgetting set_device() silently sticks the
|
||
# tensor on SIP 0 — emit a warning in debug mode.
|
||
if os.environ.get("KERNBENCH_DEBUG"):
|
||
import warnings
|
||
warnings.warn(
|
||
"torch.ahbm.current_device() is None; defaulting to SIP 0. "
|
||
"If this is a multi-rank launcher context, you likely forgot "
|
||
"torch.ahbm.set_device(rank) inside the worker.",
|
||
stacklevel=2,
|
||
)
|
||
current_sip = 0
|
||
|
||
placement = resolve_dp_policy(
|
||
dp,
|
||
shape=shape_2d,
|
||
itemsize=itemsize,
|
||
num_pe=eff_num_pe,
|
||
num_cubes=eff_num_cubes,
|
||
target_sip=current_sip, # ← structural coord specified up front
|
||
)
|
||
|
||
# Each ShardSpec in placement already carries (sip=current_sip, cube=local, pe=local).
|
||
# The old post-hoc shifting block is removed entirely.
|
||
```
|
||
|
||
**Every** tensor is placed on the current device's SIP. If you need a
|
||
multi-SIP tensor, use the TP primitive of ADR-0027.
|
||
|
||
**Trade-off of the single-driver fallback**: When set_device is not
|
||
called, defaulting to SIP 0 is kept for compatibility with existing
|
||
single-driver tests. With `KERNBENCH_DEBUG=1`, a warning is emitted so
|
||
that accidentally omitting set_device in a launcher context — which would
|
||
silently place the tensor on the wrong SIP — can be detected.
|
||
|
||
### D5. Downstream — allocator lookup by structural tuple key
|
||
|
||
Existing `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`):
|
||
|
||
```python
|
||
for spec in placement:
|
||
alloc = allocators[spec.pe_index] # ← AttributeError (property removed)
|
||
```
|
||
|
||
With `pe_index` gone, migration to structural coordinates is **forced**:
|
||
|
||
```python
|
||
for spec in placement:
|
||
alloc = allocators[(spec.sip, spec.cube, spec.pe)]
|
||
```
|
||
|
||
The dict population in `_ensure_allocators` is also tuple-keyed:
|
||
|
||
```python
|
||
# context.py _ensure_allocators (after)
|
||
for sip_id in sip_range:
|
||
for cube_id in range(cubes_per_sip):
|
||
for pe_id in range(pes_per_cube):
|
||
self._allocators[(sip_id, cube_id, pe_id)] = PEMemAllocator(
|
||
rack_id=0, sip_id=sip_id, cube_id=cube_id, pe_id=pe_id, cfg=cfg,
|
||
)
|
||
```
|
||
|
||
`_free_tensor` is the same: the old
|
||
`flat_idx = sip * ... + cube * ... + pe` computation block is removed,
|
||
and `(shard.sip, shard.cube, shard.pe)` is used directly.
|
||
|
||
**Tuple vs dataclass `PEIdentity`**: Recommend the tuple — it is simple
|
||
and hashable out of the box. A `PEIdentity` value object has the upside
|
||
of an explicit type, but the boilerplate is large and it is currently
|
||
the only key of the allocator dict, so it would be over-engineering.
|
||
Keep the tuple.
|
||
|
||
### D7. Backward compatibility — none (cleanup ADR)
|
||
|
||
This ADR is a **breaking change**.
|
||
|
||
1. `DPPolicy(sip=...)` or `DPPolicy(num_sips=...)` → `TypeError`
|
||
2. `ShardSpec.pe_index` access → `AttributeError`
|
||
|
||
Both are **immediate, explicit breakage**. No deprecation warning /
|
||
fallback path. KernBench is an internal project with a bounded set of
|
||
call sites, so migration happens in one pass.
|
||
|
||
**Blocking silent drift** is the main upside of fully removing the
|
||
property: code that expected a global flat could otherwise silently
|
||
receive a SIP-local result and index incorrectly — that possibility is
|
||
eliminated.
|
||
|
||
## Dependencies
|
||
|
||
- **ADR-0024** (launcher): `set_device(rank)` and current-device scoping
|
||
provide the SIP placement mechanism. This ADR sits on top and narrows
|
||
DPPolicy to pure intra-device.
|
||
- **ADR-0027** (Megatron TP): the alternative path when a tensor spans
|
||
multiple SIPs. After this ADR is applied, multi-SIP use cases move to
|
||
ADR-0027.
|
||
|
||
---
|
||
|
||
## Non-goals
|
||
|
||
- **Redesign of `DPPolicy.cube` / `pe`**: existing
|
||
replicate/column_wise/row_wise semantics are kept.
|
||
- **Tiling policy consolidation**: `tiled_column_major` /
|
||
`tiled_row_major` stay as they are.
|
||
- **New multi-device tensor abstraction**: a DTensor-like is ADR-0028.
|
||
|
||
---
|
||
|
||
## Open questions
|
||
|
||
- **Default value of current_sip in `_create_tensor`**: for calls without
|
||
set_device, whether to fall back to rank=0 (SIP 0) or to raise an
|
||
error. The recommendation is fallback (compatibility with existing
|
||
single-driver tests).
|
||
- **Scope of `test_sip_parallel.py` rewrite**: porting the existing unit
|
||
tests to the launcher base while preserving their intent requires
|
||
additional fixtures. Scoped as separate work.
|
||
- **Meaning of `num_sips=None` on `DPPolicy`**: once the field is gone,
|
||
the concept of `num_sips` disappears entirely. The explicit answer for
|
||
expressing multi-SIP is to use the TP primitive of ADR-0027.
|
||
|
||
**Resolved (items that were open in earlier revs)**:
|
||
- ~~Whether to keep the `ShardSpec.pe_index` property~~ → **fully
|
||
removed** (D2)
|
||
- ~~Form of `_ensure_allocators` dict key~~ → **tuple `(sip, cube, pe)`**
|
||
(D5)
|
||
|
||
---
|
||
|
||
## Consequences
|
||
|
||
### Positive
|
||
|
||
- **Clean conceptual separation**: DPPolicy = intra-device, TP =
|
||
inter-device.
|
||
- **API simplification**: about a 33% reduction in DPPolicy constructor
|
||
fields.
|
||
- **Structural-coordinate consistency**: ShardSpec is expressed as a
|
||
`(sip, cube, pe)` tuple → abstraction leakage resolved (the ADR-0024
|
||
D4 contract is satisfied).
|
||
- **Clear meaning of `pe_index`**: the single interpretation is
|
||
SIP-local. If global-flat is needed, it must be made explicit.
|
||
- **Launcher-model consistency**: ADR-0024's "1 worker per SIP" model is
|
||
the sole SIP-boundary control mechanism.
|
||
|
||
### Negative
|
||
|
||
- **Breaking change (explicit)**: `DPPolicy(sip=...)` → `TypeError`,
|
||
`spec.pe_index` → `AttributeError`. All callers need to be fixed at
|
||
once.
|
||
- **ShardSpec schema change**: a single `pe_index` field becomes three
|
||
fields `sip`/`cube`/`pe`. Cascading edits downstream (`deploy_tensor`,
|
||
`_free_tensor`, `_ensure_allocators`, `allocators` dict key, etc.).
|
||
- **No silent drift**: with the property fully removed, runtime failure
|
||
is immediate → migration leakage is blocked at the source. (Not a
|
||
negative but an explicit tradeoff.)
|
||
- The cost of rewriting `test_sip_parallel.py`.
|
||
|
||
### Neutral
|
||
|
||
- The meaning of the existing `cube` / `pe` fields is unchanged.
|