Files
kernbench2/docs/adr/ADR-0026-par-dppolicy-intra-device.md
ywkang a796c1d2f7 ADR: bilingual structure — EN canonical in adr/, KO mirror in adr-ko/
Establish English as the canonical ADR language with Korean translations
held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror).
Promotion from adr-proposed/ to adr/ now writes English to adr/ and the
Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md.

- Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English,
  2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix
  dropped). ADR-0023 EN regenerated against KO source which had newer
  HW Realization Notes (D16-D23) section.
- docs/adr-history/ left frozen by design (transitional state).
- CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark
  docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline
  section covering bidirectional sync, conflict resolution (EN wins),
  and proposed-language freedom.
- tools/verify_adr_lang_pairs.py: new verification tool checking pair
  completeness, filename mirroring, ADR-ID match, Status byte-equality.
  Pre-commit hook intentionally not added; run on demand or in CI.
- tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF
  normalization, em-dash title separator, underscore-slug edge case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:38:44 -07:00

316 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0026: DPPolicy = Intra-Device Only — remove sip/num_sips fields
## Status
Accepted (Revision 5 — Phase 2 landed 2026-04-14, 523 passed + 1 strict xfail)
## Context
### Goal
Clarify `DPPolicy` as a pure intra-device abstraction that only expresses
**cube × PE distribution within a single device (SIP)**. Inter-SIP
distribution (TP) is split into a separate layer (handled by ADR-0024's
`torch.ahbm.set_device(rank)` or by ADR-0027's Megatron-style parallel
layers).
## Decision
### D1. Remove `sip` + `num_sips` fields from `DPPolicy`
```python
@dataclass(frozen=True)
class DPPolicy:
"""Intra-device (cube × PE) data-parallel policy.
SIP-level placement is controlled by ``torch.ahbm.set_device(rank)``
(ADR-0024 D3) and, for model-level TP, by Megatron-style parallel
layers (ADR-0027). DPPolicy does not cross SIP boundaries.
"""
cube: Literal["replicate", "column_wise", "row_wise"] = "replicate"
pe: Literal["replicate", "column_wise", "row_wise"] = "replicate"
num_pes: int | None = None
num_cubes: int | None = None
```
Removed fields: `sip`, `num_sips`.
### D2. `ShardSpec` — structural (sip, cube, pe) coordinates, `pe_index` fully removed
The current `ShardSpec.pe_index` is a **global flat index**
(`sip × cubes × pes + cube × pes + pe`). This is the form ADR-0024 D4
flagged as "abstraction leakage".
This ADR **redefines ShardSpec in structural coordinates** and **does
not even leave `pe_index` as a property**:
```python
# src/kernbench/policy/placement/dp.py (after)
@dataclass(frozen=True)
class ShardSpec:
"""Structural shard placement — intra-SIP (cube × PE) coord.
Global-flat `pe_index` was removed in ADR-0026. Callers must use
structural coords (sip, cube, pe) directly. If a flat integer key is
needed (e.g. dict lookup), compute it explicitly at the call site.
"""
sip: int # structural — which SIP this shard lives on
cube: int # local within SIP
pe: int # local within cube
offset_bytes: int
nbytes: int
```
**Core principle**:
- The identity of ShardSpec is the `(sip, cube, pe)` 3-tuple.
- **No `pe_index` property either** — blocks silent semantics drift.
- Existing callers expecting global-flat get an **immediate
`AttributeError`** on `.pe_index` access → forced migration to
structural coordinates.
- Local contexts that genuinely need a flat integer key (e.g. internal
dict lookup) explicitly compute
`spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe` at the call
site.
**Justification for removing the property**: KernBench is an internal
project with a limited number of call sites. Explicit breakage
(AttributeError) is much safer than the risk of silent drift (semantics
change while the type stays int).
### D3. `resolve_dp_policy` takes `target_sip` and produces structural coordinates
Implements the contract of ADR-0024 D4. No post-hoc shifting.
```python
# src/kernbench/policy/placement/dp.py (after)
@dataclass(frozen=True)
class _LocalPeShard:
"""Internal — return value of the PE resolver. Cube-local PE id + payload."""
local_pe: int # cube-local PE index (0..num_pe-1)
offset_bytes: int
nbytes: int
def resolve_dp_policy(
policy: DPPolicy,
*,
shape: tuple[int, int],
itemsize: int,
num_pe: int,
num_cubes: int = 1,
target_sip: int, # NEW — explicitly state which SIP to place on
) -> list[ShardSpec]:
"""2-level resolution (cube × PE) on a specified SIP.
Returns ShardSpecs with structural coords (sip=target_sip, cube, pe).
No SIP-level split — DPPolicy is intra-device only.
"""
resolver = _PE_RESOLVERS[policy.pe]
all_shards: list[ShardSpec] = []
# Level 1: cube within SIP
cube_splits = _split_shape(policy.cube, shape, num_cubes, itemsize)
for cube_id, (cube_shape, cube_offset) in enumerate(cube_splits):
# Level 2: PE within cube — resolver returns _LocalPeShard (local_pe)
local_shards = resolver(shape=cube_shape, itemsize=itemsize,
num_pe=num_pe)
for ls in local_shards:
all_shards.append(ShardSpec(
sip=target_sip, # from caller (current_device)
cube=cube_id, # local within SIP
pe=ls.local_pe, # local within cube (explicit name)
offset_bytes=cube_offset + ls.offset_bytes,
nbytes=ls.nbytes,
))
return all_shards
```
**Internal resolvers** (`column_wise`, `row_wise`, `replicate`) return a
list of `_LocalPeShard` — the `local_pe` field name makes it **explicit
that this is a "cube-local PE identifier"**. This resolves the previous
confusion with the name `ShardSpec.pe_index`.
**Naming convention summary** (whole ADR):
- `ShardSpec.pe`: the final external API — cube-local PE (structural coord)
- `_LocalPeShard.local_pe`: the same meaning at the internal resolver stage
- `pe_index`: **removed**. Not retained anywhere, internal or external
(additional benefit of preventing silent drift: the name does not
reappear).
### D4. `_create_tensor` — placement directly in structural coordinates
Continuation of ADR-0024 D4. Post-hoc shifting removed; structural
coordinates are specified directly at the `resolve_dp_policy` call site.
```python
# context.py _create_tensor (after)
current_sip = self.ahbm.current_device()
if current_sip is None:
# Single-driver fallback (consistent with ADR-0024 D2).
# In launcher-based code, forgetting set_device() silently sticks the
# tensor on SIP 0 — emit a warning in debug mode.
if os.environ.get("KERNBENCH_DEBUG"):
import warnings
warnings.warn(
"torch.ahbm.current_device() is None; defaulting to SIP 0. "
"If this is a multi-rank launcher context, you likely forgot "
"torch.ahbm.set_device(rank) inside the worker.",
stacklevel=2,
)
current_sip = 0
placement = resolve_dp_policy(
dp,
shape=shape_2d,
itemsize=itemsize,
num_pe=eff_num_pe,
num_cubes=eff_num_cubes,
target_sip=current_sip, # ← structural coord specified up front
)
# Each ShardSpec in placement already carries (sip=current_sip, cube=local, pe=local).
# The old post-hoc shifting block is removed entirely.
```
**Every** tensor is placed on the current device's SIP. If you need a
multi-SIP tensor, use the TP primitive of ADR-0027.
**Trade-off of the single-driver fallback**: When set_device is not
called, defaulting to SIP 0 is kept for compatibility with existing
single-driver tests. With `KERNBENCH_DEBUG=1`, a warning is emitted so
that accidentally omitting set_device in a launcher context — which would
silently place the tensor on the wrong SIP — can be detected.
### D5. Downstream — allocator lookup by structural tuple key
Existing `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`):
```python
for spec in placement:
alloc = allocators[spec.pe_index] # ← AttributeError (property removed)
```
With `pe_index` gone, migration to structural coordinates is **forced**:
```python
for spec in placement:
alloc = allocators[(spec.sip, spec.cube, spec.pe)]
```
The dict population in `_ensure_allocators` is also tuple-keyed:
```python
# context.py _ensure_allocators (after)
for sip_id in sip_range:
for cube_id in range(cubes_per_sip):
for pe_id in range(pes_per_cube):
self._allocators[(sip_id, cube_id, pe_id)] = PEMemAllocator(
rack_id=0, sip_id=sip_id, cube_id=cube_id, pe_id=pe_id, cfg=cfg,
)
```
`_free_tensor` is the same: the old
`flat_idx = sip * ... + cube * ... + pe` computation block is removed,
and `(shard.sip, shard.cube, shard.pe)` is used directly.
**Tuple vs dataclass `PEIdentity`**: Recommend the tuple — it is simple
and hashable out of the box. A `PEIdentity` value object has the upside
of an explicit type, but the boilerplate is large and it is currently
the only key of the allocator dict, so it would be over-engineering.
Keep the tuple.
### D7. Backward compatibility — none (cleanup ADR)
This ADR is a **breaking change**.
1. `DPPolicy(sip=...)` or `DPPolicy(num_sips=...)``TypeError`
2. `ShardSpec.pe_index` access → `AttributeError`
Both are **immediate, explicit breakage**. No deprecation warning /
fallback path. KernBench is an internal project with a bounded set of
call sites, so migration happens in one pass.
**Blocking silent drift** is the main upside of fully removing the
property: code that expected a global flat could otherwise silently
receive a SIP-local result and index incorrectly — that possibility is
eliminated.
## Dependencies
- **ADR-0024** (launcher): `set_device(rank)` and current-device scoping
provide the SIP placement mechanism. This ADR sits on top and narrows
DPPolicy to pure intra-device.
- **ADR-0027** (Megatron TP): the alternative path when a tensor spans
multiple SIPs. After this ADR is applied, multi-SIP use cases move to
ADR-0027.
---
## Non-goals
- **Redesign of `DPPolicy.cube` / `pe`**: existing
replicate/column_wise/row_wise semantics are kept.
- **Tiling policy consolidation**: `tiled_column_major` /
`tiled_row_major` stay as they are.
- **New multi-device tensor abstraction**: a DTensor-like is ADR-0028.
---
## Open questions
- **Default value of current_sip in `_create_tensor`**: for calls without
set_device, whether to fall back to rank=0 (SIP 0) or to raise an
error. The recommendation is fallback (compatibility with existing
single-driver tests).
- **Scope of `test_sip_parallel.py` rewrite**: porting the existing unit
tests to the launcher base while preserving their intent requires
additional fixtures. Scoped as separate work.
- **Meaning of `num_sips=None` on `DPPolicy`**: once the field is gone,
the concept of `num_sips` disappears entirely. The explicit answer for
expressing multi-SIP is to use the TP primitive of ADR-0027.
**Resolved (items that were open in earlier revs)**:
- ~~Whether to keep the `ShardSpec.pe_index` property~~ → **fully
removed** (D2)
- ~~Form of `_ensure_allocators` dict key~~ → **tuple `(sip, cube, pe)`**
(D5)
---
## Consequences
### Positive
- **Clean conceptual separation**: DPPolicy = intra-device, TP =
inter-device.
- **API simplification**: about a 33% reduction in DPPolicy constructor
fields.
- **Structural-coordinate consistency**: ShardSpec is expressed as a
`(sip, cube, pe)` tuple → abstraction leakage resolved (the ADR-0024
D4 contract is satisfied).
- **Clear meaning of `pe_index`**: the single interpretation is
SIP-local. If global-flat is needed, it must be made explicit.
- **Launcher-model consistency**: ADR-0024's "1 worker per SIP" model is
the sole SIP-boundary control mechanism.
### Negative
- **Breaking change (explicit)**: `DPPolicy(sip=...)``TypeError`,
`spec.pe_index``AttributeError`. All callers need to be fixed at
once.
- **ShardSpec schema change**: a single `pe_index` field becomes three
fields `sip`/`cube`/`pe`. Cascading edits downstream (`deploy_tensor`,
`_free_tensor`, `_ensure_allocators`, `allocators` dict key, etc.).
- **No silent drift**: with the property fully removed, runtime failure
is immediate → migration leakage is blocked at the source. (Not a
negative but an explicit tradeoff.)
- The cost of rewriting `test_sip_parallel.py`.
### Neutral
- The meaning of the existing `cube` / `pe` fields is unchanged.