ADR: introduce docs/history/, merge 0011+0018, prune migration cruft

- CLAUDE.md: add ADR Lifecycle subsection (superseded → docs/history/, immutable numbering, no renumber) - ADR-0011: merge ADR-0018 content as "Address Model: LA" section alongside PA / VA; status notes VA model is currently implemented - ADR-0018 / 0029 / 0031: moved to docs/history/ with status updates (0018 merged into 0011, 0029 superseded by 0032, 0031 absorbed into 0001 rev 2) - ADR-0019: rewrite Context as PE-HBM connectivity decision (self-contained, no LA model framing) - ADR-0019/0020/0021/0023/0025/0027: Status Proposed → Accepted (code verified) and prune Implementation Notes / Affected files / Test strategy / "현재 상태" sub-sections describing pre-impl state - ADR-0024/0026: same migration-flavor cleanup; 0026 also drops D6 Migration and D8 docs-update sub-decisions - ADR-0030: status simplified (blocker ADR-0031 now superseded) - SPEC.md: R10 + §0.2 reflect PA / VA / LA model names - ADR-0008/0012/0013: refresh ADR-0011 subtitle in Links 21 files changed, 553 insertions(+), 1290 deletions(-). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 11:42:45 -07:00
parent ecc57d050d
commit 22fd0d2b9d
23 changed files with 553 additions and 1290 deletions
@@ -2,7 +2,7 @@

 ## Status

-Proposed (Revision 7 — resume invariant / main-context wait 비재귀 invariant /
+Accepted (Revision 7 — resume invariant / main-context wait 비재귀 invariant /
 global barrier over-serialization tradeoff / TP forward yield-safety 명시,
 2026-04-14)

@@ -19,20 +19,6 @@ Megatron-style을 선택한 이유:
 - NVIDIA Megatron / DeepSpeed가 확립한 인더스트리 표준.
 - DTensor는 선언적이라 디자인 공간이 더 크다 → 단계적.

-### 현재 상태
-
- KernBench는 TP가 없음. 기존 `DPPolicy.sip="column_wise"` 경로는 ADR-0026에서
-  제거됨. rank = SIP launcher (ADR-0024) 위에 TP primitive를 얹는다.
- ADR-0024 Phase B에서 **worker-greenlet env.run 재진입 버그**가 드러남:
-  worker가 `ctx.wait(h)` (tensor 생성 시 MmuMapMsg 등)를 호출하면 `env.run`이
-  worker 컨텍스트에서 돌고, 이때 spawn되는 kernel greenlet의 `_parent`가
-  worker가 되어 orphan 발생. `ring_default_ws` strict xfail의 근본 원인.
- `dist.all_reduce`는 이미 `_defer_wait=True` + worker yield 패턴으로 이 문제를
-  피함 ([distributed.py:119-134](src/kernbench/runtime_api/distributed.py#L119-L134)).
- TP layer의 forward는 매번 `torch.launch("gemm", ...)`를 호출하고, 그 뒤에
-  `dist.all_reduce`가 따라오는 패턴이 반복됨. worker-wait 문제를 **반드시**
-  해결하지 않으면 TP 샘플이 첫 실행에서 실패.
-
 ### TP primitive 스펙 (Megatron-LM 참조)

 - **ColumnParallelLinear**: weight의 **column(out_features)** 축을 TP ranks에
@@ -907,155 +893,6 @@ PR을 심사.

 ---

-## Test strategy
-
-### T1. Unit — `tests/test_tp_parallel_state.py` (신규)
-
- `initialize_model_parallel(ws)`가 world_size와 일치하는 경우만 통과.
- `get_tensor_model_parallel_rank()`가 greenlet-local rank 반환 (ADR-0024 D9
-  회귀).
- 미초기화 상태에서 `get_tensor_model_parallel_world_size()`가 적절히 실패.
-
-### T2. Unit — `tests/test_tp_layers.py` (신규)
-
-**Shape / structural checks**:
-
- `ColumnParallelLinear(in=256, out=512).weight.shape` per-rank가 `(256, 512/ws)`.
- `RowParallelLinear(in=512, out=256).weight.shape` per-rank가 `(512/ws, 256)`.
- `ColumnParallelLinear.forward(x)`의 출력 텐서 shape이 `(M, K/ws)`.
-
-**Numerical correctness (weight ≠ zero)**: 단순 shape assert는 대수적 오류를
-놓치므로, 결정론적 non-zero 입력/weight으로 실제 연산 결과 검증:
-
- **T2.a (ColumnParallel, deterministic)**: weight를 per-rank identity
-  (또는 `(i, j) → i + rank * k_local + j` 같은 결정론적 패턴)으로 초기화
-  (`tensor.copy_`). 입력 `x`를 상수 벡터로 둔 뒤 forward. 각 rank의 출력이
-  **기대치 `x @ W_rank_local`와 rtol/atol 1e-2 이내로 일치** (gemm kernel의
-  fp16 round-off 고려).
- **T2.b (RowParallel, reduced output equality — primary)**: 모든 rank의
-  forward 결과가 동일 전역 행렬 곱 `concat([x_0..x_{ws-1}]) @ concat([W_0..
-  W_{ws-1}])`과 일치하는지 검증. rank-별 `y.numpy()` 비교로 (i) all-reduce 후
-  elementwise equality와 (ii) 기대치(host-side numpy로 계산) 일치 **둘 다**
-  assert. observable-only 검증 — internal hook 불필요.
-
-  *Optional implementation note*: partial-sum 단계를 더 세밀히 관찰하고 싶으면
-  `_pending_collective_handles` enqueue 직전 intercept hook을 쓸 수 있으나,
-  이는 내부 구현 detail에 결합되므로 ADR 수준의 test contract는 T2.b의
-  observable equality만 요구한다.
- **T2.c (rank-identity after all_reduce)**: 모든 rank의 `y.numpy()`이 elementwise
-  identical (mean뿐 아니라 full array equality, rtol 1e-2).
-
-**기존 weak assertion 금지**: `output mean이 identical` 같은 aggregate-only
-검증은 silently 깨지기 쉽기에 **main assertion으로 쓰지 말 것** — 보조
-sanity로만 사용.
-
-### T3. Worker-wait 일반화 + orphan regression — `tests/test_worker_wait_drain.py` (신규)
-
-본 테스트의 핵심 목적은 queue 동작이 아니라 **ADR-0024 Phase B orphan
-regression의 직접 방지**이다. 다음을 assert:
-
- **T3.a**: Worker가 `ctx.wait(h)`을 호출하면 `_pending_worker_waits`에
-  handle이 enqueue되고 main이 drain하기 전까지 worker는 resume되지 않는다.
- **T3.b**: `_drain_pending` 직후 worker가 resume되고 handle은 `_completed`
-  상태.
- **T3.c**: Multi-worker에서 모든 worker가 같은 drain 지점에서 resume.
- **T3.d (orphan invariant, 핵심)**: Worker 함수가 `torch.launch(...)`를
-  호출한 뒤, SimPy engine이 실제로 돌기 시작하는 시점에 **kernel greenlet의
-  `_parent`는 main greenlet**이다. 테스트는 `kernel_runner.run`을 monkey-patch
-  하거나 `KernelRunner._parent` capture 시점에 assertion hook을 걸어 이
-  invariant를 직접 검증.
- **T3.e (symptom regression)**: D0 없이는 T3.d와 등가인 GreenletExit 실패가
-  재현되어야 함 (historical failure mode 문서화 — 실제 테스트는 D0 도입 후
-  skip 또는 xfail 처리).
- **T3.f (idempotency)**: 같은 handle을 `ctx.wait(h)`로 두 번 호출해도
-  `engine.wait`은 한 번만 불린다 (D0.4-(3)).
- **T3.g (exception propagation)**: Worker가 `wait` 호출 후 raise하면 main
-  scheduler loop이 즉시 중단되고 예외가 위로 전파. 남은 `_pending_worker_waits`는
-  drain되지 않는다 (D0.4-(4)).
-
-### T4. `torch.multiprocessing.spawn` — `tests/test_mp_spawn.py` (신규)
-
- `spawn(fn, args, nprocs)`이 nprocs 개의 greenlet을 생성하고 각각 rank로 bind.
- 모든 worker 완료 후 return.
- 기존 bench `ccl_allreduce.py`의 hand-rolled loop을 `mp.spawn`으로 교체해도
-  matrix 회귀 통과.
-
-### T5. Host-read barrier — `tests/test_host_read_barrier.py` (신규)
-
-D0.5 contract를 직접 검증:
-
- **T5.a**: Worker가 `launch → tensor.numpy()`를 연속 호출하면 barrier가 동작,
-  numpy 결과는 kernel 완료 후 값 (post-drain).
- **T5.b**: `launch → tensor.shape` (metadata)는 barrier 발동 안 함 (pending
-  queue 그대로 유지).
- **T5.c**: Pending 큐가 비어 있는 상태의 `numpy()` 호출은 yield 없이 즉시
-  read (불필요한 context switch 방지).
- **T5.d**: `__getitem__`, `data` 역시 T5.a와 동일한 barrier 발동.
- **T5.e**: Collective pending (all_reduce) 진행 중 상태에서 `numpy()` 호출 시
-  collective drain까지 기다린 뒤 read.
- **T5.f (copy_ write barrier)**: target tensor에 미완료 pending handle이
-  있는 상태에서 `target.copy_(source)` 호출 시, write 전에 drain 발동.
-  주입한 host source가 drain-이후 상태에 덮어써지는지 확인 (stale-overwrite
-  없음).
- **T5.g (closed-set via registry)**: barrier entry-point의 closed-set은
-  **명시적 registry** (예: `tensor.py` 상단의 `_HOST_READ_BARRIERS = frozenset
-  ({"numpy", "data", "__getitem__", "__repr__", "copy_"})`)로 유지한다.
-  테스트는:
-  1. registry에 나열된 각 entry-point에 **실제 barrier 주입이 되어 있는지**
-     (invocation 시 pending queue를 확인하고 yield 경로를 거치는지) 관찰.
-  2. 새 host-read semantic API 추가는 code review에서 registry 업데이트를
-     의무화 (CODEOWNERS / review checklist로 운영).
-
-  **Non-goal**: Python introspection (method 시그니처, docstring 분석 등)으로
-  barrier-부재 API를 자동 탐지하는 것은 정밀도 문제로 ADR scope 밖. registry
-  + review 접근으로 충분.
-
-### T6. E2E — `tests/test_tp_mlp.py` (신규)
-
-2-layer MLP (ColumnParallel → RowParallel) forward:
-
-**Structural / liveness**:
-
- `ws = SIP count` (topology.yaml 기준 current 2) 모델로 실행 완료.
- **Deadlock 없음**: scheduler loop이 유한 시간 내 종료 (pytest-timeout 등).
- **Completion trace**: 각 `launch` 및 `all_reduce`가 `ctx._traces`에 entry
-  남김 (count = 예상 layer 수).
-
-**Numerical correctness (필수)**:
-
- **T6.a (zero-weight sanity)**: weight 전부 0 → 출력 전부 0. 파이프라인이
-  돌긴 하는지 확인용 smoke test. **이것만으로는 불충분 — T6.b/T6.c와 함께
-  채택**.
- **T6.b (deterministic pattern)**: 모든 weight를 결정론적 non-zero pattern
-  (예: all 0.01, 또는 per-rank identity에서 파생된 값)으로 `copy_`. 입력도
-  상수. 기대 출력을 host-side numpy로 계산한 뒤 각 rank의 `y.numpy()`와 rtol
-  1e-2로 비교.
- **T6.c (rank-consistency post all-reduce)**: RowParallel의 all-reduce
-  이후 **모든 rank의 output이 elementwise identical** (T2.c와 동일 기준).
-  단순 mean 일치가 아니라 full array equality.
- **T6.d (shape contract)**: ColumnParallel 출력이 `(B, D_hidden / ws)`,
-  RowParallel 출력이 `(B, D_out)`.
-
-### T7. 회귀 — `ring_default_ws` xfail 해제
-
- `tests/test_ccl_allreduce_matrix.py::test_ccl_allreduce_matrix[ring_default_ws]`의
-  `@pytest.mark.xfail(strict=True)` 제거 → **PASS**여야 함.
- Acceptance criteria (observable):
-  - **Deadlock 없음**: bench가 유한 시간 내 종료.
-  - **GreenletExit 없음**: stderr/log에 GreenletExit trace 없음.
-  - **Rank 0 산출**: `ring_allreduce_tcm (ws=2): 2 OK` 문자열이 출력.
-  - **Completion trace**: `all_reduce` trace entry 존재.
-  - **Numerical**: 각 rank의 입력 `r+1`에 대한 sum(1..ws)=3 결과를 tolerance
-    1e-1 이내로 달성.
-
-### T8. 회귀 — 기존 전체 test suite
-
- ADR-0026까지 통과하던 모든 test가 그대로 통과 (523 passed + 1 xfail).
- Phase 2 완료 기준: 524 passed (xfail 해제 포함) + 0 xfail + 위 T1~T7 신규
-  테스트 전부 통과.
-
---
-
 ## Consequences

 ### Positive
@@ -1080,29 +917,3 @@ D0.5 contract를 직접 검증:

 - ADR-0024/0026 기반 위에 순수한 상위 레이어 추가. Hardware simulation
  stack에 영향 없음 (D0 제외).
-
---
-
-## Affected files
-
-| File | Change |
-|------|--------|
-| `src/kernbench/runtime_api/context.py` | D0.1/D0.2: `_pending_worker_waits` + `ctx.wait`의 worker fork, D1.3: `self.multiprocessing` namespace attach |
-| `src/kernbench/runtime_api/multiprocessing.py` | 신규 (D1): `_MultiprocessingNamespace.spawn` + `_drain_pending` + `SpawnException` |
-| `src/kernbench/runtime_api/distributed.py` | `_pending_collective_handles` 타입 annotation 보강 (`list[tuple[RequestHandle, int, dict]]`); spawn exception cleanup에서 clear 호출 지점 노출 |
-| `src/kernbench/runtime_api/tensor.py` | D0.5 barrier 주입: `numpy`, `__getitem__`, `data`, `__repr__`, `copy_` (source read + target write) |
-| `src/kernbench/tp/__init__.py` | 신규: public API re-export |
-| `src/kernbench/tp/parallel_state.py` | 신규: D3 |
-| `src/kernbench/tp/layers.py` | 신규: D4/D5 |
-| `src/kernbench/tp/primitives.py` | 신규: D6 |
-| `src/kernbench/tp/kernels.py` | 신규: TP layer용 `_gemm_kernel` (bench 복제) |
-| `src/kernbench/tp/mappings.py` | 신규 stub (backward TODO) |
-| `benches/tp_mlp.py` | 신규 샘플 (D7) |
-| `benches/ccl_allreduce.py` | hand-rolled loop → `torch.multiprocessing.spawn`으로 교체 (D1.4) |
-| `tests/test_tp_parallel_state.py` | 신규 (T1) |
-| `tests/test_tp_layers.py` | 신규 (T2) |
-| `tests/test_worker_wait_drain.py` | 신규 (T3): orphan invariant 직접 검증 포함 |
-| `tests/test_mp_spawn.py` | 신규 (T4) |
-| `tests/test_host_read_barrier.py` | 신규 (T5): D0.5 host-read barrier contract |
-| `tests/test_tp_mlp.py` | 신규 (T6) |
-| `tests/test_ccl_allreduce_matrix.py` | `ring_default_ws` xfail 제거 (T7) |